EMERGENT MUSICAL PROPERTIES OF A TRANSFORMER UNDER
CONTRASTIVE SELF-SUPERVISED LEARNING
Yuexuan Kong1,2Gab iel Mesegue -B ocal1Vincen Los anlen2
Ma hieu Lag ange2Romain Hennequin1
1Deeze Resea ch, Pa is, F ance
2Nan es Uni e si é, École Cen ale Nan es, CNRS, LS2N, UMR 6004, F-44000 Nan es, F ance
[email p o ec ed]
ABSTRACT
In music in o ma ion e ie al (MIR), con as i e sel -
supe ised lea ning o gene al-pu pose ep esen a ion
models is e ec i e o global asks such as au oma ic ag-
ging. Howe e , o local asks such as cho d es ima ion,
i is widely assumed ha con as i ely ained gene al-
pu pose sel -supe ised models a e inadequa e and ha
mo e sophis ica ed SSL is necessa y; e.g., masked mod-
eling. Ou pape challenges his assump ion by e ealing
he po en ial o con as i e SSL pai ed wi h a ans o me
in local MIR asks. We conside a ligh weigh ision
ans o me wi h one-dimensional pa ches in he ime–
equency domain (ViT-1D) and ain i wi h simple con-
as i e SSL h ough no malized empe a u e-scaled c oss-
en opy loss (NT-Xen ). Al hough NT-Xen ope a es only
o e he class oken, we obse e ha , po en ially hanks o
weigh sha ing, in o ma i e musical p ope ies eme ge in
ViT-1D’s sequence okens. On global asks, he empo al
a e age o class and sequence okens o e s a pe o mance
inc ease compa ed o he class oken alone, showing use-
ul p ope ies in he sequence okens. On local asks, se-
quence okens pe o m unexpec edly well, despi e no be-
ing speci ically ained o . Fu he mo e, high-le el musi-
cal ea u es such as onse s eme ge om laye -wise a en-
ion maps and sel -simila i y ma ices show di e en lay-
e s cap u e di e en musical dimensions. Ou pape does
no ocus on imp o ing pe o mance bu ad ances he mu-
sical in e p e a ion o ans o me s and sheds ligh on some
o e looked abili ies o con as i e SSL pai ed wi h ans-
o me s o sequence modeling in MIR.
1. INTRODUCTION
We may ca ego ize asks in music in o ma ion e ie al
(MIR) as ei he local o global. Global asks, such as
music agging and key es ima ion, a e ime-shi in a ian
and equi e a single p edic ion pe piece o music. Local
asks, such as bea acking and cho d es ima ion, a e ime-
© . Licensed unde a C ea i e Commons A ibu ion 4.0
In e na ional License (CC BY 4.0). A ibu ion: , “Eme gen musical
p ope ies o a ans o me unde con as i e sel -supe ised lea ning”,
in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
shi equi a ian and equi e ame-wise p edic ions, wi h
a ame a e ypically highe han 1 Hz [1].
To add ess hese asks, sel -supe ised lea ning (SSL)
has ecen ly eme ged as a powe ul al e na i e o supe -
ised lea ning in MIR. SSL enables a model o lea n in o -
ma i e ep esen a ions h ough a p e ex ask wi hou e-
qui ing labeled da a. While hese p e ex asks may no
ha e di ec p ac ical ele ance, sol ing hem equi es he
model o cap u e one o a ious musical dimensions [2–7].
In gene al-pu pose models, hese lea ned ep esen a ions
a e hen use ul o many di e en downs eam asks, e-
qui ing only a small amoun o supe ision.
In gene al-pu pose SSL o MIR, CLMR [8] and MULE
[9] ma ked a i s s ep o wa d, ollowing he adop ion o
con as i e p e ex ask in compu e ision [10, 11]. In
con as i e lea ning, he model is en o ced by a loss o
p ojec posi i e pai samples close oge he in he embed-
ding space and pushing nega i e samples a apa . Thei
esul s showed he po en ial o con as i e SSL o gene -
alize ac oss a ious global music asks. Howe e , due o
he p ope ies o con olu ional neu al ne wo ks and global
pooling laye s, bo h models cap u e global music ep e-
sen a ions ha summa ize he en i e sequence a he han
p ese ing in o ma ion a each ime s ep. Mo e gene al-
pu pose SSL esea ch u he de eloped on con as i e p e-
ex asks by using a momen um-based pa adigm [12],
combining di e en musical s ems [13], analyzing ans-
o ma ion in embedding space [14], and de eloping mo e
e ec i e aining s a egies [15]. A o emen ioned pape s
only e alua e hei sys ems on global asks. In con as ,
he po en ial o gene al-pu pose con as i e SSL on local
asks emains unde s udied.
Gene a i e modeling and masked modeling a e widely
used o gene al-pu pose SSL a he ame le el. Gene a-
i e models such as Jukebox [16] and Music2La en [17]
ha e showed ha a ious musical dimensions a e cap u ed
in hei embedding space used o gene a ion, by e alua -
ing on mul iple MIR asks. MERT [18] is a music ep e-
sen a ion lea ning model ha esembles he masking ain-
ing scheme o Hu-BERT [19] om speech. M2D [20]
employs a join -embedding p edic i e a chi ec u e (JEPA)
ha join ly p edic s om bo h a masked sample and he
o iginal sample. MusicFM [21] conduc s a compa a i e
s udy o di e en masked modeling app oaches. Al hough
hese models handle bo h global and local asks well, hey
235
Con as i e p e ex ask Downs eaming asks
Pa ch
+
Posi ional
embedding
0
ViT-1D
NT-Xen
sequence
okens
class
oken
View 1
View 2
+
1
…
T
0
+
…
T
Bea
acking
Cho d
es ima ion
Local asks
MLP
Tagging
Key
es ima ion
(Cls) (A g)
o
Global asks
MLP
1
(Seq)
Figu e 1: Con as i e p e- aining and p obing o downs eam asks. The inpu s (le ) a e mel-spec og ams, pa ched along e ical
slices o equency bins pe ime ame. Posi ional encoding and a class oken (lea nable pa ame e s wi h he a e age o he sequence
okens) a e added. NT-Xen loss is applied only o he class oken. Fo downs eam asks, sequence okens (Seq) (excluding he class
oken) a e used o local asks, while he class oken (Cls) o he a e age o all okens (A g) is used o global asks.
ha e wo sho comings. Fi s , la ge-scale a chi ec u es a e
necessa y: he numbe o pa ame e s ypically anges om
58M (Music2la en ) o 5B (Jukebox). Secondly, aining
hese models depends on sophis ica ed echniques such as
exponen ial mo ing a e ages, eache –s uden dis illa ion,
and mul iple loss unc ions; equi ing ca e ul ine- uning
o hype pa ame e s and la ge compu a ional esou ces.
T ans o me s ha e been applied o con as i e p e ex
asks [22,23] using he AST a chi ec u e [24] and o mul-
imodal audio- ex lea ning [25, 26]. In hese cases, con-
as i e loss is applied only o he class oken; i.e., a lea n-
able oken a ached a he beginning o he sequence. Op-
imiza ion o he loss b ings pai ed audio-audio o audio-
ex closely in he embedding space. Compu e ision e-
sea che s ha e epo ed eme gen p ope ies when aining
Vision T ans o me s (ViTs) [27]. C ucially, such p ope -
ies do no eme ge h ough supe ised p e aining [28].
This app oach has p o en aluable, no only o global
asks such as classi ica ion, bu also o local asks such as
image segmen a ion [28,29]. A en ion maps a e also s ud-
ied o show he eme gen local p ope ies, p o iding in-
sigh s in o he local pa e ns and ea u es which a e lea ned
du ing aining. Howe e , o ou knowledge, hese eme -
gen p ope ies in ans o me okens ha e no ye been ex-
plo ed on local asks o music.
In gene al-pu pose SSL, we no ice a gap be ween con-
as i e SSL and masked modeling in MIR, pa icula ly
ega ding he abili y o con as i e p e ex ask o cap u e
bo h global and local p ope ies. This gap leads us o he
ollowing ques ions: ha e we mo ed on oo quickly om
con as i e SSL o mo e complex app oaches? Does i s ill
hold mo e un apped po en ial while pai ed wi h a ans-
o me ? To answe hem, we p oceed in ollowing ways:
P e ex ask. We use a ligh weigh ViT wi h 1-D spec-
og am pa ches as oken inpu s (ViT-1D). We ain
ViT-1D wi h wi h a no malized empe a u e-scaled
c oss-en opy loss (NT-Xen ) only o he class oken
o posi i e and nega i e pai s (Sec ion 2).
Downs eam asks. We e alua e he e ec i eness o bo h
he class oken and sequence okens on local and
global downs eam asks. While he class oken is
ime-in a ian due o he p e ex ask o mula ion,
we show ha sequence okens cap u e local musical
p ope ies (Sec ion 3).
Eme gen p ope ies. To unde s and how local p ope -
ies a e cap u ed, we conduc quali a i e and quan-
i a i e analyses o a en ion maps (Sec ion 5) and
sel -simila i y ma ices (Sec ion 6). 1
2. CONTRASTIVE PRETEXT TASK
Pa ching de ails: We compu e he mel- equency spec o-
g am o a segmen o du a ion equal o d= 4 seconds,
ob aining ma ices x, wi h 128 equency bins and a ame
a e o ξ= 31.5 Hz. Unlike s anda d ViT, which uses 2D
pa ches, we ex ac 1D pa ches by aking all 128 mel bins
om a single ame and apply one con olu ional laye p,
p ojec ing in o an embedding o size (Hp, Wp) = (192,1)
o each pa ch xp= p(x)and xp∈RHp×Wp. By
using 1D pa ches, each pa ch is di ec ly connec ed o all
he equency bins in a ime ame. We ob ain he pa ch
sequence xpas [x1
p,x2
p, ..., xT
p]wi h T=d = 126,
whe e each pa ch co esponds o one ime ame in he
mel-spec og am. This sequence inpu o a ans o me is
commonly named as sequence okens [27].
Encode a chi ec u e: We use he o iginal ViT im-
plemen a ion o he smalles e sion as encode (wi h ou
pa ching me hod) wi h he embedding dimension equals o
192, 12 ans o me blocks and 3 a en ion heads. Unlike
commonly done in SSL, we a ach no disposable p ojec ion
head o he ans o me encode , which possibly educes
o e all pe o mance o he model as adding hem du ing
he p e ex aining bene i s downs eam asks [10], in o -
de o ocus on he eme gen p ope ies pu ely in he ans-
o me . We deno e e o ou encode . We p epend a class
oken, composed by lea nable pa ame e s and he a e age
o o he okens, a he beginning o xp. Then, a 2D sinu-
soidal posi ional encoding on he equency and ime di-
mensions is added o all pa ches including he class oken,
ob aining he inal inpu okens o eas [z0
0,z1
0, ..., zT
0]
wi h T= 126. We de ine he ou pu oken sequence o
ans o me block ka ime as z
kwhe e 0< k ≤12 and
1Code, checkpoin and mo e examples o Sec-
ion 5 and 6 can be ound a h ps://gi hub.com/deeze /
eme gen -musical-p ope ies- ans o me / ee/main.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
236
Table 1: Downs eam pe o mance compa ison o di e en models on wo global asks (music agging and key es ima ion) and wo
local asks (bea de ec ion and cho d es ima ion). The ow Cls co esponds o p obing wi h he class oken while he ow A g e e s o
p obing wi h he a e age o all okens. Seq e e s o p obing only wi h sequence okens.
GLOBAL LOCAL
#PARAM DIM MUSIC TAGGING KEY ESTIMATION BEAT TRACKING CHORD ESTIMATION
MAP ROC W.ACC F-SCORE ACC
VIT-1D5.3M 192 (Cls) 0.400 0.888 0.509 (Seq) 0.723 0.319
(A g) 0.417 0.896 0.622
CLMR-LIKE [30] 2.8M 1024 0.427 0.898 0.459 0.313 0.148
M2D[20] 89M 3840 0.479 0.918 0.531 0.794 0.322
0≤ ≤T. No ably, he class oken ( = 0) is p ocessed
in he same manne as sequence okens: i sha es weigh s
wi h hem in he mul i-laye pe cep ion laye s ollowing
he a en ion block. This design allows he class oken o
in eg a e and summa ize in o ma ion [27]. The ou pu o
he model is [z0
L,z1
L, ..., zT
L]whe e z0
Lis he class oken
and L= 12.
No malized empe a u e-scaled c oss en opy loss
(NT-Xen ). Fo each piece o music, we ex ac wo dis-
join segmen s A and B o ou seconds each as a pai o
posi i e samples. No da a augmen a ion is applied. All
o he segmen s om he same ba ch a e nega i e samples.
To s udy he eme ging p ope ies in he sequence okens,
we apply he loss only on he class okens [22, 25, 26],
de ining ou loss unc ion o each pai as:
LA,B( e) = −log exp(sim(z0
L,A,z0
L,B)/τ)
Pk=Bexp(sim(z0
L,A,z0
L,k)/τ)(1)
whe e z0
L,A and z0
L,B a e he class okens o he posi i e
pai o segmen s A and B, sim is he cosine simila i y unc-
ion, and τ= 0.1is he empe a u e pa ame e .
P e- aining de ails We p e ain on a subse o
Deeze ’s ca alog o music, wi h a ba ch size o 256 pai s o
4-second segmen s, a base lea ning a e o 3×10−4wi h
a cosine decay un il 5×10−7, and ain o 300 epochs.
3. DOWNSTREAM TASKS
We ocus on wo ypes o downs eam asks, commonly
used in gene al-pu pose SSL o MIR. We selec music ag-
ging and key es ima ion as ep esen a i e global asks and
we choose bea acking and cho d es ima ion as examples
o local asks. A good pe o mance on hese ou asks e-
qui es he model o encode bo h ha monic and hy hmic
ep esen a ions, and high-le el musical concep , on bo h
local an global le els.
As discussed in Sec ion 1, o he bes o ou knowledge,
no p io s udy has explo ed he eme ging p ope ies o se-
quence okens in a ans o me in local asks ained wi h
a con as i e lea ning amewo k. As a ma e o ac , es -
ing on local asks may seem coun e in ui i e, since posi i e
samples a e simply wo segmen s om he same piece o
music, wi hou any explici alignmen o bea s o cho ds,
he e o e he class okens a e ained o be ime-shi in a i-
an . This aises he possibili y ha local musical in o ma-
ion may no be expec ed in he oken sequence since he e
a e no cons ain s o encou age his. Howe e , ou esul s
challenge his assump ion, showing ha meaning ul local
ep esen a ions do eme ge despi e he lack o di ec supe -
ision a he ame le el.
3.1 Music agging
Da ase s. We use MagnaTagaTune [31] wi h he spli p o-
posed by Lee e al. [32].
T aining me hods. We compa e wo di e en p obing
me hods ha bo h use a single linea laye : 1) P obe only
on he class oken z0
L( e e ed as Cls in Figu e 1 and
in Table 1); 2) P obe on he a e age o he en i e oken se-
quence [z0
L,z1
L, ..., zT
L], including he class oken ( e e ed
as A g in Figu e 1 and in Table 1).
Me ics. We use he a ea unde he ecei e ope a ing
cha ac e is ic cu e (ROC-AUC) and mean a e age p eci-
sion (mAP) in hei mac o-agg ega ed e sions.
3.2 Music key es ima ion
Da ase s. We use FMAK 2 [4] wi h a 9:1 spli be ween
aining and alida ion. FMAK 2, a de i a i e o FMAK
[33], con ains 5489 songs om he F ee Music A chi e
[34], spanning mul iple gen es. We es on Gian S eps
[35], a da ase o 604 elec onic dance music acks.
T aining me hods. We compa e he same wo aining
me hods (A g and Cls) as in music agging.
Me ics. We use he weigh ed accu acy om mi _e al
[36], which assigns weigh s o some key p edic ion e o s.
3.3 Bea acking
Da ase s. We use he Ball oom da ase [37] wi h a 9:1 spli
o aining and alida ion, which con ains 698 ball oom
songs. Fo es ing, we use GTZAN Rhy hm [38], which
includes bea anno a ions o 998 songs ac oss 10 gen es.
T aining me hods. We exclude he class oken and use
only he sequence okens [z1
L, ..., zT
L], e e ed as Seq in
Figu e 1 and Table 1. Since bea acking ypically equi es
a highe ame a e han 31.5 Hz, we a ach wo indepen-
den heads o each zL, doubling he ame a e o 63 Hz.
Addi ionally, we apply a s anda d smoo hing me hod o
bea acking, whe e we inc ease he alues o he wo
neighbo ing ames o 0.5 ins ead o 0.
Me ics. We apply a Dynamic Bayesian Ne wo k (DBN)
o pos -p ocessing o ob ain bea loca ions [39]. We use
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
237
F-sco e wi h a ole ance window o 70 ms as e alua ion
me ics om he mi _e al package [36].
3.4 Cho d es ima ion
Da ase s. We collec 124 songs om he Real Wo ld Com-
pu ing Pop (RWC-POP) and Schube Win e eise Da ase
(SWD) [40,41], limi ed o one pe o mance pe song. We
apply a 8:1:1 spli be ween aining, alida ion, and es .
We conside 24 classes o majo and mino cho ds, ex-
clude hose ha canno be mapped o hese classes (e.g.,
suspended cho ds), and include a "no cho d" class, esul -
ing in 25 classes. The cho d ocabula y is he same as used
in MusicFM [21].
T aining me hods. We exclude he class oken and use
only he sequence okens [z1
L,...,zT
L]. The ame a e
o he encode is su icien o cho d es ima ion, he e o e
only one single linea laye is used.
Me ics. We use ame-le el accu acy o e 25 classes.
4. RESULTS ON DOWNSTREAM TASKS
Using he ozen ou pu o he p e ained ViT-1D as inpu
o a ainable linea laye o each ask, we s udy whe he
sequence okens cap u e local p ope ies, despi e he class
oken’s ime-in a iance. We also assess hei con ibu ion
o global asks. We compa e his o wo e e ence models,
p e ained wi h con as i e lea ning and masked modeling,
e alua ing hem on he same downs eam asks. I is im-
po an o no e ha du ing p e ex aining, ou model does
no include a p ojec ion head a e he backbone, which is
a common echnique used o boos pe o mance on down-
s eam asks [8, 9, 30]. Howe e , we omi i in o de o
s udy mo e di ec ly he eme gen p ope ies o he ans-
o me backbone.
CLMR [8] is a gene al-pu pose con as i e amewo k
in oduced o musical ep esen a ion lea ning. Howe e ,
CLMR is ained using pi ch shi as da a augmen a ion
o posi i e samples, making i in a ian o pi ch shi , e-
sul ing in low pe o mance in onali y- ela ed asks. The e-
o e, o a mo e ai compa ison, we use a CLMR-like con-
as i ely ained ResNe [30], wi hou any da a augmen a-
ion, ained on he same da ase as ViT-1D, wi h he same
sampling a e and audio leng h. Fo local asks, we upsam-
ple i s esolu ion om 0.25 Hz o he esolu ion o ViT-1D
o bo h local asks by a aching he necessa y numbe o
linea laye s. I is impo an o no e ha his upsampling
p ocess esul s in much mo e pa ame e s o downs eam
aining han ViT-1D, which only uses 1 and 2 linea laye s
espec i ely.
M2D [20] employs a JEPA a chi ec u e, which com-
bines masked modeling wi h a eache –s uden amewo k
ained on gene al audio. I is he masked modeling model
ha has he leas amoun o pa ame e s in Sec ion 1. We
use he esul s as a e e ence. I p oduces ame-wise p e-
dic ions a a a e o 6.3 Hz. To adap i o bea acking
and cho d es ima ion, we ain mul iple independen linea
laye s o upsample o he same ame a e as ViT-1D.
Figu e 2: A en ion ma ices om he 3 d, 9 h, and 12 h ans-
o me blocks (le o igh ). Ligh e colo s a posi ion [i, j]in-
dica e mo e a en ion om oken i o oken j. Diagonal lines in
he le igu e show local a en ion, while e ical lines ac oss he
map in he igh igu e indica e a shi o global a en ion in deepe
laye s.
We obse e wo key indings om Table 1. 1) Sequence
okens show be e pe o mance han CLMR-like model
and compa able esul s o M2D on local asks. This sug-
ges s he eme gence o local and empo al musical ep-
esen a ions, in con as o he ime-in a ian na u e o
he class oken. The music agging pe o mance lags be-
hind CLMR-like and M2D, howe e bo h models ha e a
much la ge size o embedding dimension, and CLMR-like
model uses a p ojec ion head a e he backbone. 2) Fo
global asks, pe o mance imp o es when a e aging he
class oken and all sequence okens oge he . This implies
ha he in o ma ion encoded in sequence okens is no en-
i ely cap u ed by he class oken alone and ha inco po a -
ing sequence okens con ibu es posi i ely o global asks.
Local musical p ope ies in he sequence okens im-
p o e pe o mance on global downs eam asks and yield
unexpec ed good esul s o local asks. This shows ha ,
despi e ViT-1D being ained wi h NT-Xen loss only on
he class oken and he posi i e sampling s a egy mak-
ing i ime-in a ian , use ul local musical p ope ies s ill
eme ge in he sequence okens. This aises in e es in
u he analyzing he eme gen p ope ies ac oss di e en
ans o me laye s h ough a en ion maps (Sec ion 5) and
in sel -simila i y ma ices (Sec ion 6).
5. PROPERTIES IN ATTENTION MAPS
We s udy he eme gen p ope ies o okens in he ans-
o me ac oss di e en laye s. ViT-1D has 12 laye s in
o al. We selec he 3 d, 6 h, 9 h, and 12 h laye s as ep e-
sen a i e poin s, as hey a e e enly spaced om shallowe
o deepe laye s. A mo e comp ehensi e analysis o all
12 laye s, as well as he po en ial pe o mance gains om
le e aging all laye s, is le o u u e wo k.
5.1 Quali a i e analysis o a en ion maps
The a en ion mechanism di ec s a en ion o meaning-
ul oken posi ions du ing aining and a e calcula ed ia
scaled do -p oduc sel -a en ion [42]:
Mh
k(Qh
k,Kh
k) = so max Qh
kKh
k
⊤
√d!(2)
whe e 0< k ≤L= 12 is he dep h o ans o me block,
0< h ≤3is he index o head and d= 64 is he dimension
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
238
o he embeddings o each head. This esul s in an a en ion
map Mh
k∈R(T+1)×(T+1) o head ka h h ans o me
block whe e T= 126. In his sec ion, we aim o explo e
he ques ion: wi h a simple con as i e p e ex ask applied
o he class oken, can a en ion be guided owa d musically
meaning ul posi ions in he sequence?
We show he a en ion ma ices om 3 d, 9 h and 12 h
laye s o a 4-second polyphonic sample om RWC-Pop in
Figu e 2. Al hough we also s udy he 6 h laye , i is omi ed
om he igu e due o space limi . Simila p ope ies a e
common ac oss o he samples. The alue o Mh
ka ow i
and column jp esen s he a en ion o oken zi
kon zj
k.
We obse e ha in he shallow laye s o he a en ion
maps, as seen by he p esence o sho e ical lines along
he diagonals, a en ion is p ima ily dis ibu ed o neigh-
bo okens. Fo a gi en oken, only nea by okens ecei e
a en ion. Howe e , in deepe laye s, e ical lines ex end
ac oss he en i e a en ion map, indica ing ha a en ion is
dis ibu ed mo e uni o mly ac oss all okens. Fo a gi en
oken, okens ac oss he whole sequence can ecei e a en-
ion. A ansi ion om local o global a en ion is obse ed
om shallowe o deepe laye s, obse ed as well in ans-
o me s ained o sen ence embeddings [43].
5.2 Alignmen o a en ion maps wi h onse e en s
To quan i a i ely assess he eme gence o empo al p op-
e ies in a en ion maps, we use an a en ion head o an a -
en ion block o in e onse e en imes amps. We use he
MUS subse o he Midi-Aligned Piano Sounds (MAPS-
MUS) da ase [44], which con ains 30 polyphonic classical
piano eco dings wi h aligned MIDI anno a ions. We se-
lec his da ase and ask because i is simple o build p e-
cise hypo heses and in e p e a en ion maps when a sin-
gle ins umen is p esen and when ime-aligned symbolic
in o ma ion is a ailable. Howe e i emains polyphonic,
ensu ing ha he ask is s ill non- i ial.
Among he 4 laye s we s udy, simila p ope ies a e
shown ac oss many heads om 9 h and 12 h laye s, also
ac oss mul iple models ini ialized di e en ly. We choose
he a en ion ma ix o an a en ion head om he 9 h laye ,
e e ed as Mi,j in he ollowing. We exclude he class o-
ken, a e age he a en ion map pe column, and ob ain a
pseudo-ac i a ion unc ion a(i) = 1
126 P126
j=1 Mi,j, whe e
0< i ≤126. This app oach is mo i a ed by he obse -
a ion o e ical lines in deepe laye s, indica ing ha o-
kens ecei ing highe a en ion a e simila o all okens.
Figu e 3 shows Mi,j and a(i)(le ) and a ViT-1D a an-
dom ini ializa ion ( igh ) o a speci ic sample. The a en-
ion map o he andom ini ialized model exhibi s a e y
na ow alue ange, and he ac i a ion unc ion is almos
la , indica ing no meaning ul a en ion was placed a he
beginning o he aining.
We use he peak picking unc ion om SciPy [45] on
a(i) o ob ain onse posi ion and he F-sco e in mi _e al
o e alua ion, wi h a ole ance window o 70 ms. Fo he
sake o compa ison, we compa e his me hod wi h spec al
lux implemen a ion in lib osa and wi h a ViT-1D a an-
dom ini ializa ion. We use F-sco e as a me ics wi h a ol-
Figu e 3: A en ion ma ices o a ained ViT-1D ( op le ) and a
andomly ini ialized one ( op igh ), wi h b igh e colo indica -
ing highe a en ion. The scales di e be ween he op igu es, as
shown in he bo om plo s, which display a e aged a en ion ma-
ices by column. The bo om le shows clea peaks, while he
bo om igh has simila maximum and minimum alues, indica -
ing ha a en ion is e enly dis ibu ed a andom ini ializa ion and
becomes mo e ocused on empo al posi ions du ing aining.
ATT.MAP RANDOM SPECTRAL FLUX
F-SCORE 0.877 0.501 0.720
Table 2:F-sco e o onse de ec ion (MAPS-MUS da ase ) a -
e peak picking om an a en ion map o ained ViT-1D (le ),
compa ed wi h a ViT-1D a andom ini ializa ion (cen e ) and a
ea u e enginee ing baseline ( igh ).
e ance window o 70 ms, as implemen ed in mi _e al [36].
The compa ison be ween he a en ion map and he
spec al lux me hod shows a s ong alignmen be ween a -
en ion and onse e en s. Fu he mo e, he empo al p op-
e ies use ul o onse e en de ec ion do no appea a an-
dom ini ializa ion; a he , hey eme ge du ing aining. A
con as i e p e ex ask applied o he class oken alone di-
ec s he a en ion om andom o musically ele an posi-
ions, wi hou he need o speci ic aining o do so.
6. PROPERTIES IN SELF-SIMILARITY MATRICS
OF TOKENS
6.1 Quali a i e analysis
We ex ac in e media e okens [z1
k,...,zT
k]a laye s k=
3,6,9,12 (same as Sec ion 5, deno ed z3 o z12), along
wi h okens om a andomly ini ialized ViT-1D model, de-
no ed z . Fo each zk, we compu e a sel -simila i y ma ix
(SSM) Sk[i, j] = sim(zk[i],zk[j]) using cosine simila -
i y. Due o space limi , SSMs o he 6 h and 9 h laye s a e
omi ed bu a e a ailable in he Gi Hub eposi o y.
We show S3,S12,S on wo audio samples o 4 sec-
onds in Figu e 4 (S6and S9a e omi ed due o space
limi ). Sample 1 ( op ow) is a monophonic song sample
om RWC Pop da ase whe e a clea melody line is p e-
sen ed. Sample 2 (bo om ow) is a sample om ball oom
whe e clea bea s a e shown by pe cussi e ins umen s. We
obse e se e al p ope ies o okens om di e en laye s
om Figu e 4:
Randomly ini ialized model con ains ha monic in-
o ma ion. S o sample 1 con ains block s uc u es
ha co espond o no e e en s, sugges ing he model cap-
u es ha monic ea u es ea ly on. Fo sample 2, i ails o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
239
Figu e 4: Sel -simila i y ma ices (SSM) o okens a di e en
laye s. Sample 1 ( op ow) is a melody-dominan audio sample.
Sample 2 (bo om ow) is a pe cussion-only sample. F om le
o igh , laye 3, laye 12 and ou pu o a andomly ini ializa ed
ViT-1D. We obse e clea e block pa e n on he SSM o laye 3
( op le ) which signi ies ha monic p ope ies a e cap u ed in he
okens, and clea e subdiagonals on he SSM o laye 12 (bo om
middle) which signi ies hy hmic p ope ies a e mo e dominan .
cap u e hy hmic e en s, which should mani es as e enly
spaced subdiagonal s uc u es. No ably, we obse e ha
S closely esembles he SSM o he model’s inpu , he
mel-spec og ams. This obse a ion sugges s ha mo e
ha monic in o ma ion is p esen han hy hmic in o ma-
ion in S . The eason migh be ha he ViT-1D model
inco po a es skip connec ions be ween ans o me blocks,
which causes he ou pu o a andomly ini ialized model o
closely mi o he mel-spec og am.
Di e en laye s encode di e en in o ma ion. S3
exhibi s clea e block-like s uc u es han S12 o S o
sample 1, sugges ing ha simila ha monic ames a e em-
bedded by simila ep esen a ions. In con as , S12, which
co esponds o a deepe laye , e eals clea e subdiagonal
s uc u es han o he s o sample 2. These subdiagonals a e
cha ac e is ic o hy hmic pa e ns ha shows he egula -
i y o bea s. Shi ing om ha monic o hy hmic ea u es
could e lec he hie a chical na u e o he model, whe e
ha monic ea u es like pi ch a e lea ned in shallow laye s,
while highe -le el abs ac ions and hy hmic ea u es, a e
lea ned in he deepe laye s.
6.2 Downs eam esul s on s acked okens
We aim o in es iga e 1) which p ope ies eme ge aside
om hose inhe i ed om he mel-spec og ams; 2) i o-
kens om in e media e ans o me blocks a e bene icial
o downs eam asks.
To achie e his, we use he same downs eam aining
schemes and da ase s desc ibed in Sec ion 3. Speci ically,
we e alua e z and explo e he e ec o s acking in e -
media e okens [z3,z6,z9,z12] o o m ep esen a ions
o dimension 192 ×4 = 768 o downs eam asks. As
obse ed in li e a u e, s acking okens om di e en lay-
e s help wi h ce ain asks, since edundancy o in o ma-
ion exis a he same laye [43,46,47]. We belie e using
TAGGING KEY BEAT CHORD
ROC MAP W.ACC F-SCORE ACC
RANDOM .273 .807 .487 .463 .290
TRAINLAST .417 .896 .622 .723 .319
TRAINSTACK .437 .902 .639 .728 .422
Table 3: Downs eam pe o mance by using andomly ini ialized
ViT-1D, he las laye o a ained ViT-1D (Sec ion 3), and a s ack
okens o 4 di e en laye s (Sec ion 6.2). The es ing da ase s and
me ics a e iden ical o Sec ion 3.
weigh ed sum o all in e media e laye s could u he boos
pe o mance, we lea e ha o u u e wo k.
Table 3 shows ha he model a andom ini ializa ion
pe o ms sligh ly wo se han ou ained ViT-1D model on
cho d es ima ion, much wo se on key es ima ion, bu s ill
has a easonable pe o mance. Ha monic in o ma ion is by
design embedded in mel-spec og ams and is ansmi ed
h ough he skip connec ions. Fo music agging and bea
acking, he e is a signi ican pe o mance gap be ween
he andomly ini ialized and ained ViT-1D. This esul
is expec ed, as mel-spec og ams a e no ideal ep esen a-
ions o high-le el musical concep s o hy hmic s uc u es.
Mo eo e , o ained ViT-1D, we obse e s acking okens
oge he signi ican ly imp o es pe o mance in cho d es i-
ma ion, sligh ly less in key es ima ion, bu s ill o a good
ex en . This indica es ha shallow laye s con ibu e o bo h
global and local ha monic asks. In con as , pe o mance
on music agging and bea acking emains simila , sug-
ges ing he ea u es cap u ed in shallowe laye s ocus less
on hy hmic s uc u es and highe -le el musical concep s.
The di e ences in ea u es lea ned a a ious laye s
highligh ha applying a con as i e p e ex only o he
class oken can lead o eme gen p ope ies in sequence
okens a di e en le els.
7. CONCLUSION
In his pape , we show he abili y o a gene al-pu pose con-
as i e p e ex ask pai ed wi h a ans o me o lea n lo-
cal musical ep esen a ions. Applying NT-Xen loss only
o he class oken in a ligh weigh ViT-1D su p isingly
enables sequence okens o handle local asks while con-
ibu ing o global ones. Despi e he class oken’s ime-
in a iance, weigh sha ing and a en ion mechanisms allow
empo al musical ep esen a ions o eme ge.
By analyzing a en ion maps, we obse e ha onse
e en s can be deduced. Sel -simila i y ma ics show di -
e en laye okens cap u e dis inc musical dimensions.
S acking in e media e okens imp o es pe o mance on
ha monic asks, highligh ing he impo ance o shallow-
laye ep esen a ions o downs eam asks.
We p o ide explo a o y insigh s in o he eme gen p op-
e ies o a ans o me ained con as i ely. Fu u e wo k
could u he s udy eme gen p ope ies in all laye s and
whe he simila eme gen p ope ies exis in supe ised
p e aining. Addi ionally, le e aging hese p ope ies in
con as i e p e aining could lead o mo e e icien p e-
aining s a egies.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
240
8. REFERENCES
[1] V. Los anlen, “Con olu ional ope a o s in he ime-
equency domain,” Ph.D. disse a ion, École no male
supé ieu e, 2017.
[2] A. Riou, S. La ne , G. Hadje es, and G. Pee e s,
“Pes o: Pi ch es ima ion wi h sel -supe ised
ansposi ion-equi a ian objec i e,” in P oc. o
heIn e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2023.
[3] E. Quin on, “Equi a ian sel -supe ision o musical
empo es ima ion,” in P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2022.
[4] Y. Kong, V. Los anlen, G. Mesegue -B ocal, S. Wong,
M. Lag ange, and R. Hennequin, “STONE: Sel -
supe ised onali y es ima o ,” P oc. o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2024.
[5] Y. Kong, G. Mesegue -B ocal, V. Los anlen, M. La-
g ange, and R. Hennequin, “S-key: Sel -supe ised
lea ning o majo and mino keys om audio,” in
ICASSP 2025 - IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP),
2025, pp. 1–5.
[6] G. Mesegue -B ocal, R. Bi ne , S. Du and, and
B. B os , “Da a cleansing wi h con as i e lea ning o
ocal no e e en anno a ions,” in P oceedings o he
21s In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, 2020.
[7] Y. Ma, A. Øland, A. Ragni, B. M. Del Se e, C. Sai is,
C. Donahue, C. Lin, C. Plachou as, E. Bene os, E. Sha-
i e al., “Founda ion models o music: A su ey,”
a Xi p ep in a Xi :2408.14340, 2024.
[8] J. Spijke e and J. A. Bu goyne, “Con as i e lea n-
ing o musical ep esen a ions,” in P oc. o he In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence (ISMIR), 2021.
[9] M. C. McCallum, F. Ko zeniowski, S. O amas,
F. Gouyon, and A. F. Ehmann, “Supe ised and un-
supe ised lea ning o audio ep esen a ions o music
unde s anding,” 2022.
[10] T. Chen, S. Ko nbli h, M. No ouzi, and G. E. Hin on,
“A simple amewo k o con as i e lea ning o isual
ep esen a ions,” CoRR, 2020.
[11] X. Chen and K. He, “Explo ing simple siamese ep-
esen a ion lea ning,” in P oc. o he IEEE/CVF Con-
e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), 2021.
[12] H. Zhao, C. Zhang, and B. Z. e al., “S3 : Sel -
supe ised p e- aining wi h swin ans o me o mu-
sic classi ica ion,” in P oc. o he IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2022.
[13] C. Ga ou is, A. Zla in si, and P. Ma agos, “Mul i-
sou ce con as i e lea ning om musical audio,” in
P oc. o he Sound and Music Compu ing Con e ence
(SMC), May 2023.
[14] M. C. McCallum, M. E. Da ies, F. Henkel, J. Kim, and
S. E. Sandbe g, “On he e ec o da a-augmen a ion on
local embedding p ope ies in he con as i e lea ning
o music audio ep esen a ions,” in ICASSP 2024-2024
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2024.
[15] J. Choi, S. Jang, H. Cho e al., “Towa ds p ope
con as i e sel -supe ised lea ning s a egies o mu-
sic audio ep esen a ion,” in 2022 IEEE In e na ional
Con e ence on Mul imedia and Expo (ICME). IEEE,
2022, pp. 1–6.
[16] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” a Xi p ep in a Xi :2005.00341, 2020.
[17] M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp es-
sion,” P oc. o he In e na ional Socie y o Music In-
o ma ion Re ie al Con e ence (ISMIR), 2024.
[18] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Lin, A. Ragni, E. Bene os, N. Gyenge, R. Dan-
nenbe g, R. Liu, W. Chen, G. Xia, Y. Shi, W. Huang,
Y. Guo, and J. Fu, “Me : Acous ic music unde s and-
ing model wi h la ge-scale sel -supe ised aining,” in
P oc. o he In e na ional Con e ence on Lea ning ep-
esen a ions (ICLR), 2023.
[19] W.-N. Hsu, B. Bol e, Y.-S. Chuang e al., “Hu-
be : Sel -supe ised speech ep esen a ion lea ning by
masked p edic ion o hidden uni s,” IEEE/ACM T ans-
ac ions on Audio, Speech, and Language P ocessing,
ol. 29, pp. 3451–3460, 2021.
[20] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Ha ada, and
K. Kashino, “Masked Modeling Duo: Towa ds a
Uni e sal Audio P e- aining F amewo k,” IEEE/ACM
T ans. Audio, Speech, Language P ocess., ol. 32, pp.
2391–2406, 2024.
[21] M. Won, Y.-N. Hung, and D. Le, “A ounda ion model
o music in o ma ics,” in P oc. o he IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2024.
[22] L. Wang, P. Luc, Y. Wu, A. Recasens, L. Smai a,
A. B ock, A. Jaegle, J.-B. Alay ac, S. Dieleman, J. Ca -
ei a e al., “Towa ds lea ning uni e sal audio ep e-
sen a ions,” in ICASSP 2022-2022 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2022, pp. 4593–4597.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
241
[23] K. Kou ini, J. Schlü e , H. Eghbal-zadeh, and G. Wid-
me , “E icien aining o audio ans o me s wi h
pa chou ,” in P oc. In e speech 2022, 2022, pp. 2753–
2757.
[24] Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio
Spec og am T ans o me ,” in P oc. o In e speech,
2021, pp. 571–575.
[25] I. Manco, E. Bene os, E. Quin on, and G. Fazekas,
“Con as i e audio-language lea ning o music,” in
P oc. o he In e na ional Socie y o Music In o ma-
ion Re ie al Con e ence (ISMIR), 2022.
[26] Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and D. P.
Ellis, “MuLan: A join embedding o music audio and
na u al language,” P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2022.
[27] A. Doso i skiy, L. Beye , A. Kolesniko , D. Weis-
senbo n, X. Zhai, T. Un e hine , M. Dehghani,
M. Minde e , G. Heigold, S. Gelly, J. Uszko ei , and
N. Houlsby, “An image is wo h 16x16 wo ds: T ans-
o me s o image ecogni ion a scale,” in In e na-
ional Con e ence on Lea ning Rep esen a ions, 2021.
[28] M. Ca on, H. Tou on, I. Mis a, H. Jégou, J. Mai al,
P. Bojanowski, and A. Joulin, “Eme ging p ope ies
in sel -supe ised ision ans o me s,” in P oc. o he
IEEE/CVF in e na ional con e ence on compu e i-
sion, 2021, pp. 9650–9660.
[29] M. Oquab, T. Da ce , T. Mou akanni, H. V. Vo,
M. Sza aniec, V. Khalido , P. Fe nandez, D. Haz-
iza, F. Massa, A. El-Nouby, R. Howes, P.-Y. Huang,
H. Xu, V. Sha ma, S.-W. Li, W. Galuba, M. Rabba ,
M. Ass an, N. Ballas, G. Synnae e, I. Mis a, H. Jegou,
J. Mai al, P. Laba u , A. Joulin, and P. Bojanowski,
“Dino 2: Lea ning obus isual ea u es wi hou su-
pe ision,” 2023.
[30] G. Mesegue -B ocal, D. Desblancs, and R. Hen-
nequin, “An expe imen al compa ison o mul i- iew
sel -supe ised me hods o music agging,” in P oc.
o he IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2024.
[31] E. Law, K. Wes , M. I. Mandel, M. Bay, and J. S.
Downie, “E alua ion o algo i hms using games: The
case o music agging,” in P oc. o he In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence (IS-
MIR), 2009, pp. 213–218.
[32] J. Lee, J. Pa k, K. L. Kim, and J. Nam, “Sample-le el
deep con olu ional neu al ne wo ks o music au o-
agging using aw wa e o ms,” 2017.
[33] S. Wong and G. He nandez, “Fmak: A da ase o
key and mode anno a ions o he ee music a chi e–
ex ended abs ac ,” in P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al La e-B eaking/Demo
Session (ISMIR-LBD), 2023.
[34] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “Fma: A da ase o music analysis,” P oc.
o heIn e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), 2017.
[35] P. H. R. V. S. B. F. H. M. L. G. Pe e Knees, Án-
gel Fa aldo, “Two da ase s o empo es ima ion and
key de ec ion in elec onic dance music anno a ed om
use co ec ions,” in P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2015.
[36] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, D. P. Ellis, and C. C. Ra el,
“Mi _e al: A anspa en implemen a ion o common
mi me ics.” in P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
ol. 10, 2014, p. 2014.
[37] F. Gouyon and S. Dixon, “A e iew o hy hm de-
sc ip ion sys ems,” in P oc. o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2004.
[38] U. Ma chand, Q. F esnel, and G. Pee e s, “G zan-
hy hm: Ex ending he g zan es -se wi h bea , down-
bea and swing anno a ions,” in P oc. o he In e -
na ional Con e ence on Music In o ma ion Re ie al
La e-b eaking/Demo (ISMIR-LBD), 2015.
[39] F. K ebs, S. Böck, and G. Widme , “An e icien s a e-
space model o join empo and me e acking.” in
P oc. o he In e na ional Socie y o Music In o ma-
ion Re ie al Con e ence (ISMIR), 2015, pp. 72–78.
[40] C. Weiß, F. Zalkow, V. A i i-Mülle , M. Mülle , H. V.
Koops, A. Volk, and H. G. G ohganz, “Schube win-
e eise da ase : A mul imodal scena io o music anal-
ysis,” Jou nal on Compu ing and Cul u al He i age
(JOCCH), ol. 14, no. 2, pp. 1–18, 2021.
[41] M. Go o and H.Hashiguchi, “Rwc music da abase:
Popula , classical, and jazz music da abases,” P oc. o
he In e na ional Con e ence on Music In o ma ion Re-
ie al Con e ence (ISMIR), 2002.
[42] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Ł. Kaise , and I. Polosukhin,
“A en ion is all you need,” Ad ances in neu al in o -
ma ion p ocessing sys ems, ol. 30, 2017.
[43] B. Wang and C.-C. J. Kuo, “Sbe -wk: A sen ence em-
bedding me hod by dissec ing be -based wo d mod-
els,” IEEE/ACM T ansac ions on Audio, Speech, and
Language P ocessing, ol. 28, pp. 2146–2157, 2020.
[44] V. Emiya, R. Badeau, and B. Da id, “Mul ipi ch es i-
ma ion o piano sounds using a new p obabilis ic spec-
al smoo hness p inciple,” IEEE T ansac ions on Au-
dio, Speech, and Language P ocessing, ol. 18, no. 6,
pp. 1643–1654, 2009.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
242
[45] P. Vi anen, R. Gomme s, T. E. Oliphan , M. Habe -
land, T. Reddy, D. Cou napeau, E. Bu o ski, P. Pe-
e son, W. Weckesse , J. B igh , S. J. an de Wal ,
M. B e , J. Wilson, K. J. Millman, N. Mayo o ,
A. R. J. Nelson, E. Jones, R. Ke n, E. La son, C. J.
Ca ey, ˙
I. Pola , Y. Feng, E. W. Moo e, J. Vande Plas,
D. Laxalde, J. Pe k old, R. Cim man, I. Hen iksen,
E. A. Quin e o, C. R. Ha is, A. M. A chibald, A. H.
Ribei o, F. Ped egosa, P. an Mulb eg , and SciPy
1.0 Con ibu o s, “SciPy 1.0: Fundamen al Algo i hms
o Scien i ic Compu ing in Py hon,” Na u e Me hods,
ol. 17, pp. 261–272, 2020.
[46] A. Simoulin and B. C abbé, “How many laye s and
why? An analysis o he model dep h in ans o me s,”
in P oc. o he 59 h Annual Mee ing o he Associa-
ion o Compu a ional Linguis ics and he 11 h In e -
na ional Join Con e ence on Na u al Language P o-
cessing: S uden Resea ch Wo kshop. Associa ion o
Compu a ional Linguis ics, Aug. 2021, pp. 221–228.
[47] K. Cla k, U. Khandelwal, O. Le y, and C. D. Man-
ning, “Wha does be look a ? an analysis o be ’s
a en ion,” in BlackBoxNLP@ACL, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
243