Emergent Musical Properties of a Transformer Under Contrastive Self-Supervised Learning

Author: Yuexuan KONG; Gabriel Mesegues-Brocal; Vincent Lostanlen; Mathieu Lagrange; Romain Hennequin

Publisher: Zenodo

DOI: 10.5281/zenodo.17706383

Source: https://zenodo.org/records/17706383/files/000028.pdf

EMERGENT MUSICAL PROPERTIES OF A TRANSFORMER UNDER
CONTRASTIVE SELF-SUPERVISED LEARNING
Yuexuan Kong1,2Gab iel Mesegue -B ocal1Vincen Los anlen2
Ma hieu Lag ange2Romain Hennequin1
1Deeze Resea ch, Pa is, F ance
2Nan es Uni e si é, École Cen ale Nan es, CNRS, LS2N, UMR 6004, F-44000 Nan es, F ance
[email p o ec ed]
ABSTRACT
In music in o ma ion e ie al (MIR), con as i e sel -
supe ised lea ning o gene al-pu pose ep esen a ion
models is e ec i e o global asks such as au oma ic ag-
ging. Howe e , o local asks such as cho d es ima ion,
i is widely assumed ha con as i ely ained gene al-
pu pose sel -supe ised models a e inadequa e and ha
mo e sophis ica ed SSL is necessa y; e.g., masked mod-
eling. Ou pape challenges his assump ion by e ealing
he po en ial o con as i e SSL pai ed wi h a ans o me
in local MIR asks. We conside a ligh weigh ision
ans o me wi h one-dimensional pa ches in he ime–
equency domain (ViT-1D) and ain i wi h simple con-
as i e SSL h ough no malized empe a u e-scaled c oss-
en opy loss (NT-Xen ). Al hough NT-Xen ope a es only
o e he class oken, we obse e ha , po en ially hanks o
weigh sha ing, in o ma i e musical p ope ies eme ge in
ViT-1D’s sequence okens. On global asks, he empo al
a e age o class and sequence okens o e s a pe o mance
inc ease compa ed o he class oken alone, showing use-
ul p ope ies in he sequence okens. On local asks, se-
quence okens pe o m unexpec edly well, despi e no be-
ing speci ically ained o . Fu he mo e, high-le el musi-
cal ea u es such as onse s eme ge om laye -wise a en-
ion maps and sel -simila i y ma ices show di e en lay-
e s cap u e di e en musical dimensions. Ou pape does
no ocus on imp o ing pe o mance bu ad ances he mu-
sical in e p e a ion o ans o me s and sheds ligh on some
o e looked abili ies o con as i e SSL pai ed wi h ans-
o me s o sequence modeling in MIR.
1. INTRODUCTION
We may ca ego ize asks in music in o ma ion e ie al
(MIR) as ei he local o global. Global asks, such as
music agging and key es ima ion, a e ime-shi in a ian
and equi e a single p edic ion pe piece o music. Local
asks, such as bea acking and cho d es ima ion, a e ime-
© . Licensed unde a C ea i e Commons A ibu ion 4.0
In e na ional License (CC BY 4.0). A ibu ion: , “Eme gen musical
p ope ies o a ans o me unde con as i e sel -supe ised lea ning”,
in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
shi equi a ian and equi e ame-wise p edic ions, wi h
a ame a e ypically highe han 1 Hz [1].
To add ess hese asks, sel -supe ised lea ning (SSL)
has ecen ly eme ged as a powe ul al e na i e o supe -
ised lea ning in MIR. SSL enables a model o lea n in o -
ma i e ep esen a ions h ough a p e ex ask wi hou e-
qui ing labeled da a. While hese p e ex asks may no
ha e di ec p ac ical ele ance, sol ing hem equi es he
model o cap u e one o a ious musical dimensions [2–7].
In gene al-pu pose models, hese lea ned ep esen a ions
a e hen use ul o many di e en downs eam asks, e-
qui ing only a small amoun o supe ision.
In gene al-pu pose SSL o MIR, CLMR [8] and MULE
[9] ma ked a i s s ep o wa d, ollowing he adop ion o
con as i e p e ex ask in compu e ision [10, 11]. In
con as i e lea ning, he model is en o ced by a loss o
p ojec posi i e pai samples close oge he in he embed-
ding space and pushing nega i e samples a apa . Thei
esul s showed he po en ial o con as i e SSL o gene -
alize ac oss a ious global music asks. Howe e , due o
he p ope ies o con olu ional neu al ne wo ks and global
pooling laye s, bo h models cap u e global music ep e-
sen a ions ha summa ize he en i e sequence a he han
p ese ing in o ma ion a each ime s ep. Mo e gene al-
pu pose SSL esea ch u he de eloped on con as i e p e-
ex asks by using a momen um-based pa adigm [12],
combining di e en musical s ems [13], analyzing ans-
o ma ion in embedding space [14], and de eloping mo e
e ec i e aining s a egies [15]. A o emen ioned pape s
only e alua e hei sys ems on global asks. In con as ,
he po en ial o gene al-pu pose con as i e SSL on local
asks emains unde s udied.
Gene a i e modeling and masked modeling a e widely
used o gene al-pu pose SSL a he ame le el. Gene a-
i e models such as Jukebox [16] and Music2La en [17]
ha e showed ha a ious musical dimensions a e cap u ed
in hei embedding space used o gene a ion, by e alua -
ing on mul iple MIR asks. MERT [18] is a music ep e-
sen a ion lea ning model ha esembles he masking ain-
ing scheme o Hu-BERT [19] om speech. M2D [20]
employs a join -embedding p edic i e a chi ec u e (JEPA)
ha join ly p edic s om bo h a masked sample and he
o iginal sample. MusicFM [21] conduc s a compa a i e
s udy o di e en masked modeling app oaches. Al hough
hese models handle bo h global and local asks well, hey
235
Con as i e p e ex ask Downs eaming asks
Pa ch
+
Posi ional
embedding
0
ViT-1D
NT-Xen
sequence
okens
class
oken
View 1
View 2
+
1
…
T
0
+
…
T
Bea
acking
Cho d
es ima ion
Local asks
MLP
Tagging
Key
es ima ion
(Cls) (A g)
o
Global asks
MLP
1
(Seq)
Figu e 1: Con as i e p e- aining and p obing o downs eam asks. The inpu s (le ) a e mel-spec og ams, pa ched along e ical
slices o equency bins pe ime ame. Posi ional encoding and a class oken (lea nable pa ame e s wi h he a e age o he sequence
okens) a e added. NT-Xen loss is applied only o he class oken. Fo downs eam asks, sequence okens (Seq) (excluding he class
oken) a e used o local asks, while he class oken (Cls) o he a e age o all okens (A g) is used o global asks.
ha e wo sho comings. Fi s , la ge-scale a chi ec u es a e
necessa y: he numbe o pa ame e s ypically anges om
58M (Music2la en ) o 5B (Jukebox). Secondly, aining
hese models depends on sophis ica ed echniques such as
exponen ial mo ing a e ages, eache –s uden dis illa ion,
and mul iple loss unc ions; equi ing ca e ul ine- uning
o hype pa ame e s and la ge compu a ional esou ces.
T ans o me s ha e been applied o con as i e p e ex
asks [22,23] using he AST a chi ec u e [24] and o mul-
imodal audio- ex lea ning [25, 26]. In hese cases, con-
as i e loss is applied only o he class oken; i.e., a lea n-
able oken a ached a he beginning o he sequence. Op-
imiza ion o he loss b ings pai ed audio-audio o audio-
ex closely in he embedding space. Compu e ision e-
sea che s ha e epo ed eme gen p ope ies when aining
Vision T ans o me s (ViTs) [27]. C ucially, such p ope -
ies do no eme ge h ough supe ised p e aining [28].
This app oach has p o en aluable, no only o global
asks such as classi ica ion, bu also o local asks such as
image segmen a ion [28,29]. A en ion maps a e also s ud-
ied o show he eme gen local p ope ies, p o iding in-
sigh s in o he local pa e ns and ea u es which a e lea ned
du ing aining. Howe e , o ou knowledge, hese eme -
gen p ope ies in ans o me okens ha e no ye been ex-
plo ed on local asks o music.
In gene al-pu pose SSL, we no ice a gap be ween con-
as i e SSL and masked modeling in MIR, pa icula ly
ega ding he abili y o con as i e p e ex ask o cap u e
bo h global and local p ope ies. This gap leads us o he
ollowing ques ions: ha e we mo ed on oo quickly om
con as i e SSL o mo e complex app oaches? Does i s ill
hold mo e un apped po en ial while pai ed wi h a ans-
o me ? To answe hem, we p oceed in ollowing ways:
P e ex ask. We use a ligh weigh ViT wi h 1-D spec-
og am pa ches as oken inpu s (ViT-1D). We ain
ViT-1D wi h wi h a no malized empe a u e-scaled
c oss-en opy loss (NT-Xen ) only o he class oken
o posi i e and nega i e pai s (Sec ion 2).
Downs eam asks. We e alua e he e ec i eness o bo h
he class oken and sequence okens on local and
global downs eam asks. While he class oken is
ime-in a ian due o he p e ex ask o mula ion,
we show ha sequence okens cap u e local musical
p ope ies (Sec ion 3).
Eme gen p ope ies. To unde s and how local p ope -
ies a e cap u ed, we conduc quali a i e and quan-
i a i e analyses o a en ion maps (Sec ion 5) and
sel -simila i y ma ices (Sec ion 6). 1
2. CONTRASTIVE PRETEXT TASK
Pa ching de ails: We compu e he mel- equency spec o-
g am o a segmen o du a ion equal o d= 4 seconds,
ob aining ma ices x, wi h 128 equency bins and a ame
a e o ξ= 31.5 Hz. Unlike s anda d ViT, which uses 2D
pa ches, we ex ac 1D pa ches by aking all 128 mel bins
om a single ame and apply one con olu ional laye p,
p ojec ing in o an embedding o size (Hp, Wp) = (192,1)
o each pa ch xp= p(x)and xp∈RHp×Wp. By
using 1D pa ches, each pa ch is di ec ly connec ed o all
he equency bins in a ime ame. We ob ain he pa ch
sequence xpas [x1
p,x2
p, ..., xT
p]wi h T=d = 126,
whe e each pa ch co esponds o one ime ame in he
mel-spec og am. This sequence inpu o a ans o me is
commonly named as sequence okens [27].
Encode a chi ec u e: We use he o iginal ViT im-
plemen a ion o he smalles e sion as encode (wi h ou
pa ching me hod) wi h he embedding dimension equals o
192, 12 ans o me blocks and 3 a en ion heads. Unlike
commonly done in SSL, we a ach no disposable p ojec ion
head o he ans o me encode , which possibly educes
o e all pe o mance o he model as adding hem du ing
he p e ex aining bene i s downs eam asks [10], in o -
de o ocus on he eme gen p ope ies pu ely in he ans-
o me . We deno e e o ou encode . We p epend a class
oken, composed by lea nable pa ame e s and he a e age
o o he okens, a he beginning o xp. Then, a 2D sinu-
soidal posi ional encoding on he equency and ime di-
mensions is added o all pa ches including he class oken,
ob aining he inal inpu okens o eas [z0
0,z1
0, ..., zT
0]
wi h T= 126. We de ine he ou pu oken sequence o
ans o me block ka ime as z
kwhe e 0< k ≤12 and
1Code, checkpoin and mo e examples o Sec-
ion 5 and 6 can be ound a h ps://gi hub.com/deeze /
eme gen -musical-p ope ies- ans o me / ee/main.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
236
Table 1: Downs eam pe o mance compa ison o di e en models on wo global asks (music agging and key es ima ion) and wo
local asks (bea de ec ion and cho d es ima ion). The ow Cls co esponds o p obing wi h he class oken while he ow A g e e s o
p obing wi h he a e age o all okens. Seq e e s o p obing only wi h sequence okens.
GLOBAL LOCAL
#PARAM DIM MUSIC TAGGING KEY ESTIMATION BEAT TRACKING CHORD ESTIMATION
MAP ROC W.ACC F-SCORE ACC
VIT-1D5.3M 192 (Cls) 0.400 0.888 0.509 (Seq) 0.723 0.319
(A g) 0.417 0.896 0.622
CLMR-LIKE [30] 2.8M 1024 0.427 0.898 0.459 0.313 0.148
M2D[20] 89M 3840 0.479 0.918 0.531 0.794 0.322
0≤ ≤T. No ably, he class oken ( = 0) is p ocessed
in he same manne as sequence okens: i sha es weigh s
wi h hem in he mul i-laye pe cep ion laye s ollowing
he a en ion block. This design allows he class oken o
in eg a e and summa ize in o ma ion [27]. The ou pu o
he model is [z0
L,z1
L, ..., zT
L]whe e z0
Lis he class oken
and L= 12.
No malized empe a u e-scaled c oss en opy loss
(NT-Xen ). Fo each piece o music, we ex ac wo dis-
join segmen s A and B o ou seconds each as a pai o
posi i e samples. No da a augmen a ion is applied. All
o he segmen s om he same ba ch a e nega i e samples.
To s udy he eme ging p ope ies in he sequence okens,
we apply he loss only on he class okens [22, 25, 26],
de ining ou loss unc ion o each pai as:
LA,B( e) = −log exp(sim(z0
L,A,z0
L,B)/τ)
Pk=Bexp(sim(z0
L,A,z0
L,k)/τ)(1)
whe e z0
L,A and z0
L,B a e he class okens o he posi i e
pai o segmen s A and B, sim is he cosine simila i y unc-
ion, and τ= 0.1is he empe a u e pa ame e .
P e- aining de ails We p e ain on a subse o
Deeze ’s ca alog o music, wi h a ba ch size o 256 pai s o
4-second segmen s, a base lea ning a e o 3×10−4wi h
a cosine decay un il 5×10−7, and ain o 300 epochs.
3. DOWNSTREAM TASKS
We ocus on wo ypes o downs eam asks, commonly
used in gene al-pu pose SSL o MIR. We selec music ag-
ging and key es ima ion as ep esen a i e global asks and
we choose bea acking and cho d es ima ion as examples
o local asks. A good pe o mance on hese ou asks e-
qui es he model o encode bo h ha monic and hy hmic
ep esen a ions, and high-le el musical concep , on bo h
local an global le els.
As discussed in Sec ion 1, o he bes o ou knowledge,
no p io s udy has explo ed he eme ging p ope ies o se-
quence okens in a ans o me in local asks ained wi h
a con as i e lea ning amewo k. As a ma e o ac , es -
ing on local asks may seem coun e in ui i e, since posi i e
samples a e simply wo segmen s om he same piece o
music, wi hou any explici alignmen o bea s o cho ds,
he e o e he class okens a e ained o be ime-shi in a i-
an . This aises he possibili y ha local musical in o ma-
ion may no be expec ed in he oken sequence since he e
a e no cons ain s o encou age his. Howe e , ou esul s
challenge his assump ion, showing ha meaning ul local
ep esen a ions do eme ge despi e he lack o di ec supe -
ision a he ame le el.
3.1 Music agging
Da ase s. We use MagnaTagaTune [31] wi h he spli p o-
posed by Lee e al. [32].
T aining me hods. We compa e wo di e en p obing
me hods ha bo h use a single linea laye : 1) P obe only
on he class oken z0
L( e e ed as Cls in Figu e 1 and
in Table 1); 2) P obe on he a e age o he en i e oken se-
quence [z0
L,z1
L, ..., zT
L], including he class oken ( e e ed
as A g in Figu e 1 and in Table 1).
Me ics. We use he a ea unde he ecei e ope a ing
cha ac e is ic cu e (ROC-AUC) and mean a e age p eci-
sion (mAP) in hei mac o-agg ega ed e sions.
3.2 Music key es ima ion
Da ase s. We use FMAK 2 [4] wi h a 9:1 spli be ween
aining and alida ion. FMAK 2, a de i a i e o FMAK
[33], con ains 5489 songs om he F ee Music A chi e
[34], spanning mul iple gen es. We es on Gian S eps
[35], a da ase o 604 elec onic dance music acks.
T aining me hods. We compa e he same wo aining
me hods (A g and Cls) as in music agging.
Me ics. We use he weigh ed accu acy om mi _e al
[36], which assigns weigh s o some key p edic ion e o s.
3.3 Bea acking
Da ase s. We use he Ball oom da ase [37] wi h a 9:1 spli
o aining and alida ion, which con ains 698 ball oom
songs. Fo es ing, we use GTZAN Rhy hm [38], which
includes bea anno a ions o 998 songs ac oss 10 gen es.
T aining me hods. We exclude he class oken and use
only he sequence okens [z1
L, ..., zT
L], e e ed as Seq in
Figu e 1 and Table 1. Since bea acking ypically equi es
a highe ame a e han 31.5 Hz, we a ach wo indepen-
den heads o each zL, doubling he ame a e o 63 Hz.
Addi ionally, we apply a s anda d smoo hing me hod o
bea acking, whe e we inc ease he alues o he wo
neighbo ing ames o 0.5 ins ead o 0.
Me ics. We apply a Dynamic Bayesian Ne wo k (DBN)
o pos -p ocessing o ob ain bea loca ions [39]. We use
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
237
F-sco e wi h a ole ance window o 70 ms as e alua ion
me ics om he mi _e al package [36].
3.4 Cho d es ima ion
Da ase s. We collec 124 songs om he Real Wo ld Com-
pu ing Pop (RWC-POP) and Schube Win e eise Da ase
(SWD) [40,41], limi ed o one pe o mance pe song. We
apply a 8:1:1 spli be ween aining, alida ion, and es .
We conside 24 classes o majo and mino cho ds, ex-
clude hose ha canno be mapped o hese classes (e.g.,
suspended cho ds), and include a "no cho d" class, esul -
ing in 25 classes. The cho d ocabula y is he same as used
in MusicFM [21].
T aining me hods. We exclude he class oken and use
only he sequence okens [z1
L,...,zT
L]. The ame a e
o he encode is su icien o cho d es ima ion, he e o e
only one single linea laye is used.
Me ics. We use ame-le el accu acy o e 25 classes.
4. RESULTS ON DOWNSTREAM TASKS
Using he ozen ou pu o he p e ained ViT-1D as inpu
o a ainable linea laye o each ask, we s udy whe he
sequence okens cap u e local p ope ies, despi e he class
oken’s ime-in a iance. We also assess hei con ibu ion
o global asks. We compa e his o wo e e ence models,
p e ained wi h con as i e lea ning and masked modeling,
e alua ing hem on he same downs eam asks. I is im-
po an o no e ha du ing p e ex aining, ou model does
no include a p ojec ion head a e he backbone, which is
a common echnique used o boos pe o mance on down-
s eam asks [8, 9, 30]. Howe e , we omi i in o de o
s udy mo e di ec ly he eme gen p ope ies o he ans-
o me backbone.
CLMR [8] is a gene al-pu pose con as i e amewo k
in oduced o musical ep esen a ion lea ning. Howe e ,
CLMR is ained using pi ch shi as da a augmen a ion
o posi i e samples, making i in a ian o pi ch shi , e-
sul ing in low pe o mance in onali y- ela ed asks. The e-
o e, o a mo e ai compa ison, we use a CLMR-like con-
as i ely ained ResNe [30], wi hou any da a augmen a-
ion, ained on he same da ase as ViT-1D, wi h he same
sampling a e and audio leng h. Fo local asks, we upsam-
ple i s esolu ion om 0.25 Hz o he esolu ion o ViT-1D
o bo h local asks by a aching he necessa y numbe o
linea laye s. I is impo an o no e ha his upsampling
p ocess esul s in much mo e pa ame e s o downs eam
aining han ViT-1D, which only uses 1 and 2 linea laye s
espec i ely.
M2D [20] employs a JEPA a chi ec u e, which com-
bines masked modeling wi h a eache –s uden amewo k
ained on gene al audio. I is he masked modeling model
ha has he leas amoun o pa ame e s in Sec ion 1. We
use he esul s as a e e ence. I p oduces ame-wise p e-
dic ions a a a e o 6.3 Hz. To adap i o bea acking
and cho d es ima ion, we ain mul iple independen linea
laye s o upsample o he same ame a e as ViT-1D.
Figu e 2: A en ion ma ices om he 3 d, 9 h, and 12 h ans-
o me blocks (le o igh ). Ligh e colo s a posi ion [i, j]in-
dica e mo e a en ion om oken i o oken j. Diagonal lines in
he le igu e show local a en ion, while e ical lines ac oss he
map in he igh igu e indica e a shi o global a en ion in deepe
laye s.
We obse e wo key indings om Table 1. 1) Sequence
okens show be e pe o mance han CLMR-like model
and compa able esul s o M2D on local asks. This sug-
ges s he eme gence o local and empo al musical ep-
esen a ions, in con as o he ime-in a ian na u e o
he class oken. The music agging pe o mance lags be-
hind CLMR-like and M2D, howe e bo h models ha e a
much la ge size o embedding dimension, and CLMR-like
model uses a p ojec ion head a e he backbone. 2) Fo
global asks, pe o mance imp o es when a e aging he
class oken and all sequence okens oge he . This implies
ha he in o ma ion encoded in sequence okens is no en-
i ely cap u ed by he class oken alone and ha inco po a -
ing sequence okens con ibu es posi i ely o global asks.
Local musical p ope ies in he sequence okens im-
p o e pe o mance on global downs eam asks and yield
unexpec ed good esul s o local asks. This shows ha ,
despi e ViT-1D being ained wi h NT-Xen loss only on
he class oken and he posi i e sampling s a egy mak-
ing i ime-in a ian , use ul local musical p ope ies s ill
eme ge in he sequence okens. This aises in e es in
u he analyzing he eme gen p ope ies ac oss di e en
ans o me laye s h ough a en ion maps (Sec ion 5) and
in sel -simila i y ma ices (Sec ion 6).
5. PROPERTIES IN ATTENTION MAPS
We s udy he eme gen p ope ies o okens in he ans-
o me ac oss di e en laye s. ViT-1D has 12 laye s in
o al. We selec he 3 d, 6 h, 9 h, and 12 h laye s as ep e-
sen a i e poin s, as hey a e e enly spaced om shallowe
o deepe laye s. A mo e comp ehensi e analysis o all
12 laye s, as well as he po en ial pe o mance gains om
le e aging all laye s, is le o u u e wo k.
5.1 Quali a i e analysis o a en ion maps
The a en ion mechanism di ec s a en ion o meaning-
ul oken posi ions du ing aining and a e calcula ed ia
scaled do -p oduc sel -a en ion [42]:
Mh
k(Qh
k,Kh
k) = so max Qh
kKh
k
⊤
√d!(2)
whe e 0< k ≤L= 12 is he dep h o ans o me block,
0< h ≤3is he index o head and d= 64 is he dimension
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
238
o he embeddings o each head. This esul s in an a en ion
map Mh
k∈R(T+1)×(T+1) o head ka h h ans o me
block whe e T= 126. In his sec ion, we aim o explo e
he ques ion: wi h a simple con as i e p e ex ask applied
o he class oken, can a en ion be guided owa d musically
meaning ul posi ions in he sequence?
We show he a en ion ma ices om 3 d, 9 h and 12 h
laye s o a 4-second polyphonic sample om RWC-Pop in
Figu e 2. Al hough we also s udy he 6 h laye , i is omi ed
om he igu e due o space limi . Simila p ope ies a e
common ac oss o he samples. The alue o Mh
ka ow i
and column jp esen s he a en ion o oken zi
kon zj
k.
We obse e ha in he shallow laye s o he a en ion
maps, as seen by he p esence o sho e ical lines along
he diagonals, a en ion is p ima ily dis ibu ed o neigh-
bo okens. Fo a gi en oken, only nea by okens ecei e
a en ion. Howe e , in deepe laye s, e ical lines ex end
ac oss he en i e a en ion map, indica ing ha a en ion is
dis ibu ed mo e uni o mly ac oss all okens. Fo a gi en
oken, okens ac oss he whole sequence can ecei e a en-
ion. A ansi ion om local o global a en ion is obse ed
om shallowe o deepe laye s, obse ed as well in ans-
o me s ained o sen ence embeddings [43].
5.2 Alignmen o a en ion maps wi h onse e en s
To quan i a i ely assess he eme gence o empo al p op-
e ies in a en ion maps, we use an a en ion head o an a -
en ion block o in e onse e en imes amps. We use he
MUS subse o he Midi-Aligned Piano Sounds (MAPS-
MUS) da ase [44], which con ains 30 polyphonic classical
piano eco dings wi h aligned MIDI anno a ions. We se-
lec his da ase and ask because i is simple o build p e-
cise hypo heses and in e p e a en ion maps when a sin-
gle ins umen is p esen and when ime-aligned symbolic
in o ma ion is a ailable. Howe e i emains polyphonic,
ensu ing ha he ask is s ill non- i ial.
Among he 4 laye s we s udy, simila p ope ies a e
shown ac oss many heads om 9 h and 12 h laye s, also
ac oss mul iple models ini ialized di e en ly. We choose
he a en ion ma ix o an a en ion head om he 9 h laye ,
e e ed as Mi,j in he ollowing. We exclude he class o-
ken, a e age he a en ion map pe column, and ob ain a
pseudo-ac i a ion unc ion a(i) = 1
126 P126
j=1 Mi,j, whe e
0< i ≤126. This app oach is mo i a ed by he obse -
a ion o e ical lines in deepe laye s, indica ing ha o-
kens ecei ing highe a en ion a e simila o all okens.
Figu e 3 shows Mi,j and a(i)(le ) and a ViT-1D a an-
dom ini ializa ion ( igh ) o a speci ic sample. The a en-
ion map o he andom ini ialized model exhibi s a e y
na ow alue ange, and he ac i a ion unc ion is almos
la , indica ing no meaning ul a en ion was placed a he
beginning o he aining.
We use he peak picking unc ion om SciPy [45] on
a(i) o ob ain onse posi ion and he F-sco e in mi _e al
o e alua ion, wi h a ole ance window o 70 ms. Fo he
sake o compa ison, we compa e his me hod wi h spec al
lux implemen a ion in lib osa and wi h a ViT-1D a an-
dom ini ializa ion. We use F-sco e as a me ics wi h a ol-
Figu e 3: A en ion ma ices o a ained ViT-1D ( op le ) and a
andomly ini ialized one ( op igh ), wi h b igh e colo indica -
ing highe a en ion. The scales di e be ween he op igu es, as
shown in he bo om plo s, which display a e aged a en ion ma-
ices by column. The bo om le shows clea peaks, while he
bo om igh has simila maximum and minimum alues, indica -
ing ha a en ion is e enly dis ibu ed a andom ini ializa ion and
becomes mo e ocused on empo al posi ions du ing aining.
ATT.MAP RANDOM SPECTRAL FLUX
F-SCORE 0.877 0.501 0.720
Table 2:F-sco e o onse de ec ion (MAPS-MUS da ase ) a -
e peak picking om an a en ion map o ained ViT-1D (le ),
compa ed wi h a ViT-1D a andom ini ializa ion (cen e ) and a
ea u e enginee ing baseline ( igh ).
e ance window o 70 ms, as implemen ed in mi _e al [36].
The compa ison be ween he a en ion map and he
spec al lux me hod shows a s ong alignmen be ween a -
en ion and onse e en s. Fu he mo e, he empo al p op-
e ies use ul o onse e en de ec ion do no appea a an-
dom ini ializa ion; a he , hey eme ge du ing aining. A
con as i e p e ex ask applied o he class oken alone di-
ec s he a en ion om andom o musically ele an posi-
ions, wi hou he need o speci ic aining o do so.
6. PROPERTIES IN SELF-SIMILARITY MATRICS
OF TOKENS
6.1 Quali a i e analysis
We ex ac in e media e okens [z1
k,...,zT
k]a laye s k=
3,6,9,12 (same as Sec ion 5, deno ed z3 o z12), along
wi h okens om a andomly ini ialized ViT-1D model, de-
no ed z . Fo each zk, we compu e a sel -simila i y ma ix
(SSM) Sk[i, j] = sim(zk[i],zk[j]) using cosine simila -
i y. Due o space limi , SSMs o he 6 h and 9 h laye s a e
omi ed bu a e a ailable in he Gi Hub eposi o y.
We show S3,S12,S on wo audio samples o 4 sec-
onds in Figu e 4 (S6and S9a e omi ed due o space
limi ). Sample 1 ( op ow) is a monophonic song sample
om RWC Pop da ase whe e a clea melody line is p e-
sen ed. Sample 2 (bo om ow) is a sample om ball oom
whe e clea bea s a e shown by pe cussi e ins umen s. We
obse e se e al p ope ies o okens om di e en laye s
om Figu e 4:
Randomly ini ialized model con ains ha monic in-
o ma ion. S o sample 1 con ains block s uc u es
ha co espond o no e e en s, sugges ing he model cap-
u es ha monic ea u es ea ly on. Fo sample 2, i ails o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
239

Figu e 4: Sel -simila i y ma ices (SSM) o okens a di e en
laye s. Sample 1 ( op ow) is a melody-dominan audio sample.
Sample 2 (bo om ow) is a pe cussion-only sample. F om le
o igh , laye 3, laye 12 and ou pu o a andomly ini ializa ed
ViT-1D. We obse e clea e block pa e n on he SSM o laye 3
( op le ) which signi ies ha monic p ope ies a e cap u ed in he
okens, and clea e subdiagonals on he SSM o laye 12 (bo om
middle) which signi ies hy hmic p ope ies a e mo e dominan .
cap u e hy hmic e en s, which should mani es as e enly
spaced subdiagonal s uc u es. No ably, we obse e ha
S closely esembles he SSM o he model’s inpu , he
mel-spec og ams. This obse a ion sugges s ha mo e
ha monic in o ma ion is p esen han hy hmic in o ma-
ion in S . The eason migh be ha he ViT-1D model
inco po a es skip connec ions be ween ans o me blocks,
which causes he ou pu o a andomly ini ialized model o
closely mi o he mel-spec og am.
Di e en laye s encode di e en in o ma ion. S3
exhibi s clea e block-like s uc u es han S12 o S o
sample 1, sugges ing ha simila ha monic ames a e em-
bedded by simila ep esen a ions. In con as , S12, which
co esponds o a deepe laye , e eals clea e subdiagonal
s uc u es han o he s o sample 2. These subdiagonals a e
cha ac e is ic o hy hmic pa e ns ha shows he egula -
i y o bea s. Shi ing om ha monic o hy hmic ea u es
could e lec he hie a chical na u e o he model, whe e
ha monic ea u es like pi ch a e lea ned in shallow laye s,
while highe -le el abs ac ions and hy hmic ea u es, a e
lea ned in he deepe laye s.
6.2 Downs eam esul s on s acked okens
We aim o in es iga e 1) which p ope ies eme ge aside
om hose inhe i ed om he mel-spec og ams; 2) i o-
kens om in e media e ans o me blocks a e bene icial
o downs eam asks.
To achie e his, we use he same downs eam aining
schemes and da ase s desc ibed in Sec ion 3. Speci ically,
we e alua e z and explo e he e ec o s acking in e -
media e okens [z3,z6,z9,z12] o o m ep esen a ions
o dimension 192 ×4 = 768 o downs eam asks. As
obse ed in li e a u e, s acking okens om di e en lay-
e s help wi h ce ain asks, since edundancy o in o ma-
ion exis a he same laye [43,46,47]. We belie e using
TAGGING KEY BEAT CHORD
ROC MAP W.ACC F-SCORE ACC
RANDOM .273 .807 .487 .463 .290
TRAINLAST .417 .896 .622 .723 .319
TRAINSTACK .437 .902 .639 .728 .422
Table 3: Downs eam pe o mance by using andomly ini ialized
ViT-1D, he las laye o a ained ViT-1D (Sec ion 3), and a s ack
okens o 4 di e en laye s (Sec ion 6.2). The es ing da ase s and
me ics a e iden ical o Sec ion 3.
weigh ed sum o all in e media e laye s could u he boos
pe o mance, we lea e ha o u u e wo k.
Table 3 shows ha he model a andom ini ializa ion
pe o ms sligh ly wo se han ou ained ViT-1D model on
cho d es ima ion, much wo se on key es ima ion, bu s ill
has a easonable pe o mance. Ha monic in o ma ion is by
design embedded in mel-spec og ams and is ansmi ed
h ough he skip connec ions. Fo music agging and bea
acking, he e is a signi ican pe o mance gap be ween
he andomly ini ialized and ained ViT-1D. This esul
is expec ed, as mel-spec og ams a e no ideal ep esen a-
ions o high-le el musical concep s o hy hmic s uc u es.
Mo eo e , o ained ViT-1D, we obse e s acking okens
oge he signi ican ly imp o es pe o mance in cho d es i-
ma ion, sligh ly less in key es ima ion, bu s ill o a good
ex en . This indica es ha shallow laye s con ibu e o bo h
global and local ha monic asks. In con as , pe o mance
on music agging and bea acking emains simila , sug-
ges ing he ea u es cap u ed in shallowe laye s ocus less
on hy hmic s uc u es and highe -le el musical concep s.
The di e ences in ea u es lea ned a a ious laye s
highligh ha applying a con as i e p e ex only o he
class oken can lead o eme gen p ope ies in sequence
okens a di e en le els.
7. CONCLUSION
In his pape , we show he abili y o a gene al-pu pose con-
as i e p e ex ask pai ed wi h a ans o me o lea n lo-
cal musical ep esen a ions. Applying NT-Xen loss only
o he class oken in a ligh weigh ViT-1D su p isingly
enables sequence okens o handle local asks while con-
ibu ing o global ones. Despi e he class oken’s ime-
in a iance, weigh sha ing and a en ion mechanisms allow
empo al musical ep esen a ions o eme ge.
By analyzing a en ion maps, we obse e ha onse
e en s can be deduced. Sel -simila i y ma ics show di -
e en laye okens cap u e dis inc musical dimensions.
S acking in e media e okens imp o es pe o mance on
ha monic asks, highligh ing he impo ance o shallow-
laye ep esen a ions o downs eam asks.
We p o ide explo a o y insigh s in o he eme gen p op-
e ies o a ans o me ained con as i ely. Fu u e wo k
could u he s udy eme gen p ope ies in all laye s and
whe he simila eme gen p ope ies exis in supe ised
p e aining. Addi ionally, le e aging hese p ope ies in
con as i e p e aining could lead o mo e e icien p e-
aining s a egies.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
240
8. REFERENCES
[1] V. Los anlen, “Con olu ional ope a o s in he ime-
equency domain,” Ph.D. disse a ion, École no male
supé ieu e, 2017.
[2] A. Riou, S. La ne , G. Hadje es, and G. Pee e s,
“Pes o: Pi ch es ima ion wi h sel -supe ised
ansposi ion-equi a ian objec i e,” in P oc. o
heIn e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2023.
[3] E. Quin on, “Equi a ian sel -supe ision o musical
empo es ima ion,” in P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2022.
[4] Y. Kong, V. Los anlen, G. Mesegue -B ocal, S. Wong,
M. Lag ange, and R. Hennequin, “STONE: Sel -
supe ised onali y es ima o ,” P oc. o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2024.
[5] Y. Kong, G. Mesegue -B ocal, V. Los anlen, M. La-
g ange, and R. Hennequin, “S-key: Sel -supe ised
lea ning o majo and mino keys om audio,” in
ICASSP 2025 - IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP),
2025, pp. 1–5.
[6] G. Mesegue -B ocal, R. Bi ne , S. Du and, and
B. B os , “Da a cleansing wi h con as i e lea ning o
ocal no e e en anno a ions,” in P oceedings o he
21s In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, 2020.
[7] Y. Ma, A. Øland, A. Ragni, B. M. Del Se e, C. Sai is,
C. Donahue, C. Lin, C. Plachou as, E. Bene os, E. Sha-
i e al., “Founda ion models o music: A su ey,”
a Xi p ep in a Xi :2408.14340, 2024.
[8] J. Spijke e and J. A. Bu goyne, “Con as i e lea n-
ing o musical ep esen a ions,” in P oc. o he In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence (ISMIR), 2021.
[9] M. C. McCallum, F. Ko zeniowski, S. O amas,
F. Gouyon, and A. F. Ehmann, “Supe ised and un-
supe ised lea ning o audio ep esen a ions o music
unde s anding,” 2022.
[10] T. Chen, S. Ko nbli h, M. No ouzi, and G. E. Hin on,
“A simple amewo k o con as i e lea ning o isual
ep esen a ions,” CoRR, 2020.
[11] X. Chen and K. He, “Explo ing simple siamese ep-
esen a ion lea ning,” in P oc. o he IEEE/CVF Con-
e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), 2021.
[12] H. Zhao, C. Zhang, and B. Z. e al., “S3 : Sel -
supe ised p e- aining wi h swin ans o me o mu-
sic classi ica ion,” in P oc. o he IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2022.
[13] C. Ga ou is, A. Zla in si, and P. Ma agos, “Mul i-
sou ce con as i e lea ning om musical audio,” in
P oc. o he Sound and Music Compu ing Con e ence
(SMC), May 2023.
[14] M. C. McCallum, M. E. Da ies, F. Henkel, J. Kim, and
S. E. Sandbe g, “On he e ec o da a-augmen a ion on
local embedding p ope ies in he con as i e lea ning
o music audio ep esen a ions,” in ICASSP 2024-2024
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2024.
[15] J. Choi, S. Jang, H. Cho e al., “Towa ds p ope
con as i e sel -supe ised lea ning s a egies o mu-
sic audio ep esen a ion,” in 2022 IEEE In e na ional
Con e ence on Mul imedia and Expo (ICME). IEEE,
2022, pp. 1–6.
[16] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” a Xi p ep in a Xi :2005.00341, 2020.
[17] M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp es-
sion,” P oc. o he In e na ional Socie y o Music In-
o ma ion Re ie al Con e ence (ISMIR), 2024.
[18] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Lin, A. Ragni, E. Bene os, N. Gyenge, R. Dan-
nenbe g, R. Liu, W. Chen, G. Xia, Y. Shi, W. Huang,
Y. Guo, and J. Fu, “Me : Acous ic music unde s and-
ing model wi h la ge-scale sel -supe ised aining,” in
P oc. o he In e na ional Con e ence on Lea ning ep-
esen a ions (ICLR), 2023.
[19] W.-N. Hsu, B. Bol e, Y.-S. Chuang e al., “Hu-
be : Sel -supe ised speech ep esen a ion lea ning by
masked p edic ion o hidden uni s,” IEEE/ACM T ans-
ac ions on Audio, Speech, and Language P ocessing,
ol. 29, pp. 3451–3460, 2021.
[20] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Ha ada, and
K. Kashino, “Masked Modeling Duo: Towa ds a
Uni e sal Audio P e- aining F amewo k,” IEEE/ACM
T ans. Audio, Speech, Language P ocess., ol. 32, pp.
2391–2406, 2024.
[21] M. Won, Y.-N. Hung, and D. Le, “A ounda ion model
o music in o ma ics,” in P oc. o he IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2024.
[22] L. Wang, P. Luc, Y. Wu, A. Recasens, L. Smai a,
A. B ock, A. Jaegle, J.-B. Alay ac, S. Dieleman, J. Ca -
ei a e al., “Towa ds lea ning uni e sal audio ep e-
sen a ions,” in ICASSP 2022-2022 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2022, pp. 4593–4597.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
241
[23] K. Kou ini, J. Schlü e , H. Eghbal-zadeh, and G. Wid-
me , “E icien aining o audio ans o me s wi h
pa chou ,” in P oc. In e speech 2022, 2022, pp. 2753–
2757.
[24] Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio
Spec og am T ans o me ,” in P oc. o In e speech,
2021, pp. 571–575.
[25] I. Manco, E. Bene os, E. Quin on, and G. Fazekas,
“Con as i e audio-language lea ning o music,” in
P oc. o he In e na ional Socie y o Music In o ma-
ion Re ie al Con e ence (ISMIR), 2022.
[26] Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and D. P.
Ellis, “MuLan: A join embedding o music audio and
na u al language,” P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2022.
[27] A. Doso i skiy, L. Beye , A. Kolesniko , D. Weis-
senbo n, X. Zhai, T. Un e hine , M. Dehghani,
M. Minde e , G. Heigold, S. Gelly, J. Uszko ei , and
N. Houlsby, “An image is wo h 16x16 wo ds: T ans-
o me s o image ecogni ion a scale,” in In e na-
ional Con e ence on Lea ning Rep esen a ions, 2021.
[28] M. Ca on, H. Tou on, I. Mis a, H. Jégou, J. Mai al,
P. Bojanowski, and A. Joulin, “Eme ging p ope ies
in sel -supe ised ision ans o me s,” in P oc. o he
IEEE/CVF in e na ional con e ence on compu e i-
sion, 2021, pp. 9650–9660.
[29] M. Oquab, T. Da ce , T. Mou akanni, H. V. Vo,
M. Sza aniec, V. Khalido , P. Fe nandez, D. Haz-
iza, F. Massa, A. El-Nouby, R. Howes, P.-Y. Huang,
H. Xu, V. Sha ma, S.-W. Li, W. Galuba, M. Rabba ,
M. Ass an, N. Ballas, G. Synnae e, I. Mis a, H. Jegou,
J. Mai al, P. Laba u , A. Joulin, and P. Bojanowski,
“Dino 2: Lea ning obus isual ea u es wi hou su-
pe ision,” 2023.
[30] G. Mesegue -B ocal, D. Desblancs, and R. Hen-
nequin, “An expe imen al compa ison o mul i- iew
sel -supe ised me hods o music agging,” in P oc.
o he IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2024.
[31] E. Law, K. Wes , M. I. Mandel, M. Bay, and J. S.
Downie, “E alua ion o algo i hms using games: The
case o music agging,” in P oc. o he In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence (IS-
MIR), 2009, pp. 213–218.
[32] J. Lee, J. Pa k, K. L. Kim, and J. Nam, “Sample-le el
deep con olu ional neu al ne wo ks o music au o-
agging using aw wa e o ms,” 2017.
[33] S. Wong and G. He nandez, “Fmak: A da ase o
key and mode anno a ions o he ee music a chi e–
ex ended abs ac ,” in P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al La e-B eaking/Demo
Session (ISMIR-LBD), 2023.
[34] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “Fma: A da ase o music analysis,” P oc.
o heIn e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), 2017.
[35] P. H. R. V. S. B. F. H. M. L. G. Pe e Knees, Án-
gel Fa aldo, “Two da ase s o empo es ima ion and
key de ec ion in elec onic dance music anno a ed om
use co ec ions,” in P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2015.
[36] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, D. P. Ellis, and C. C. Ra el,
“Mi _e al: A anspa en implemen a ion o common
mi me ics.” in P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
ol. 10, 2014, p. 2014.
[37] F. Gouyon and S. Dixon, “A e iew o hy hm de-
sc ip ion sys ems,” in P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2004.
[38] U. Ma chand, Q. F esnel, and G. Pee e s, “G zan-
hy hm: Ex ending he g zan es -se wi h bea , down-
bea and swing anno a ions,” in P oc. o he In e -
na ional Con e ence on Music In o ma ion Re ie al
La e-b eaking/Demo (ISMIR-LBD), 2015.
[39] F. K ebs, S. Böck, and G. Widme , “An e icien s a e-
space model o join empo and me e acking.” in
P oc. o he In e na ional Socie y o Music In o ma-
ion Re ie al Con e ence (ISMIR), 2015, pp. 72–78.
[40] C. Weiß, F. Zalkow, V. A i i-Mülle , M. Mülle , H. V.
Koops, A. Volk, and H. G. G ohganz, “Schube win-
e eise da ase : A mul imodal scena io o music anal-
ysis,” Jou nal on Compu ing and Cul u al He i age
(JOCCH), ol. 14, no. 2, pp. 1–18, 2021.
[41] M. Go o and H.Hashiguchi, “Rwc music da abase:
Popula , classical, and jazz music da abases,” P oc. o
he In e na ional Con e ence on Music In o ma ion Re-
ie al Con e ence (ISMIR), 2002.
[42] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Ł. Kaise , and I. Polosukhin,
“A en ion is all you need,” Ad ances in neu al in o -
ma ion p ocessing sys ems, ol. 30, 2017.
[43] B. Wang and C.-C. J. Kuo, “Sbe -wk: A sen ence em-
bedding me hod by dissec ing be -based wo d mod-
els,” IEEE/ACM T ansac ions on Audio, Speech, and
Language P ocessing, ol. 28, pp. 2146–2157, 2020.
[44] V. Emiya, R. Badeau, and B. Da id, “Mul ipi ch es i-
ma ion o piano sounds using a new p obabilis ic spec-
al smoo hness p inciple,” IEEE T ansac ions on Au-
dio, Speech, and Language P ocessing, ol. 18, no. 6,
pp. 1643–1654, 2009.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
242
[45] P. Vi anen, R. Gomme s, T. E. Oliphan , M. Habe -
land, T. Reddy, D. Cou napeau, E. Bu o ski, P. Pe-
e son, W. Weckesse , J. B igh , S. J. an de Wal ,
M. B e , J. Wilson, K. J. Millman, N. Mayo o ,
A. R. J. Nelson, E. Jones, R. Ke n, E. La son, C. J.
Ca ey, ˙
I. Pola , Y. Feng, E. W. Moo e, J. Vande Plas,
D. Laxalde, J. Pe k old, R. Cim man, I. Hen iksen,
E. A. Quin e o, C. R. Ha is, A. M. A chibald, A. H.
Ribei o, F. Ped egosa, P. an Mulb eg , and SciPy
1.0 Con ibu o s, “SciPy 1.0: Fundamen al Algo i hms
o Scien i ic Compu ing in Py hon,” Na u e Me hods,
ol. 17, pp. 261–272, 2020.
[46] A. Simoulin and B. C abbé, “How many laye s and
why? An analysis o he model dep h in ans o me s,”
in P oc. o he 59 h Annual Mee ing o he Associa-
ion o Compu a ional Linguis ics and he 11 h In e -
na ional Join Con e ence on Na u al Language P o-
cessing: S uden Resea ch Wo kshop. Associa ion o
Compu a ional Linguis ics, Aug. 2021, pp. 221–228.
[47] K. Cla k, U. Khandelwal, O. Le y, and C. D. Man-
ning, “Wha does be look a ? an analysis o be ’s
a en ion,” in BlackBoxNLP@ACL, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
243

Related note

Why institutions use Plag.ai for originality review, entry 97
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai