UNIVERSAL MUSIC REPRESENTATIONS? EVALUATING FOUNDATION
MODELS ON WORLD MUSIC CORPORA
Cha ilaos Papaioannou1,2,3Emmanouil Bene os2Alexand os Po amianos1,3
1School o ECE, Na ional Technical Uni e si y o A hens, G eece
2Cen e o Digi al Music, Queen Ma y Uni e si y o London, UK
3A chimedes, A hena Resea ch Cen e , G eece
[email p o ec ed]
ABSTRACT
Founda ion models ha e e olu ionized music in o ma-
ion e ie al, bu ques ions emain abou hei abili y o
gene alize ac oss di e se musical adi ions. This pape
p esen s a comp ehensi e e alua ion o i e s a e-o - he-a
audio ounda ion models ac oss six musical co po a span-
ning Wes e n popula , G eek, Tu kish, and Indian classi-
cal adi ions. We employ h ee complemen a y me hod-
ologies o in es iga e hese models’ c oss-cul u al capa-
bili ies: p obing o assess inhe en ep esen a ions, a -
ge ed supe ised ine- uning o 1-2 laye s, and mul i-label
ew-sho lea ning o low- esou ce scena ios. Ou analy-
sis shows a ying c oss-cul u al gene aliza ion, wi h la ge
models ypically ou pe o ming on non-Wes e n music,
hough esul s decline o cul u ally dis an adi ions. No-
ably, ou app oaches achie e s a e-o - he-a pe o mance
on i e ou o six e alua ed da ase s, demons a ing he e -
ec i eness o ounda ion models o wo ld music unde -
s anding. We also ind ha ou a ge ed ine- uning ap-
p oach does no consis en ly ou pe o m p obing ac oss
all se ings, sugges ing ounda ion models al eady encode
subs an ial musical knowledge. Ou e alua ion amewo k
and benchma king esul s con ibu e o unde s anding how
a cu en models a e om achie ing uni e sal music ep-
esen a ions while es ablishing me ics o u u e p og ess.
1. INTRODUCTION
The no ion o music as a “uni e sal language” emains
con es ed among schola s [1, 2]. While some musical
elemen s anscend cul u al bounda ies, adi ions ha e
e ol ed wi h dis inc cha ac e is ics and seman ic con en
[3,4]. This ension be ween uni e sali y and cul u al speci-
ici y p esen s a complex challenge ha mode n a i icial
in elligence app oaches o e a no el lens o in es iga e.
Founda ion models ha e eme ged as a ans o ma i e
pa adigm ac oss a i icial in elligence (AI) domains [5],
© C. Papaioannou, E. Bene os, and A. Po amianos. Li-
censed unde a C ea i e Commons A ibu ion 4.0 In e na ional License
(CC BY 4.0). A ibu ion: C. Papaioannou, E. Bene os, and A. Po ami-
anos, “Uni e sal Music Rep esen a ions? E alua ing Founda ion Models
on Wo ld Music Co po a”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
including music and audio [6–8]. In music in o ma ion
e ie al (MIR), hese mul ipu pose models pe o m di-
e se asks om bea acking o au oma ic agging [9,10].
Though implici ly claiming a o m o uni e sali y, hey
la gely neglec cul u al dimensions while aining p edom-
inan ly on Wes e n-cen ic da a [10]. This aises a c i i-
cal ques ion: o wha ex en do ounda ion models ac u-
ally p o ide uni e sal music ep esen a ions ha gene al-
ize ac oss di e se musical adi ions?
In his wo k, we e alua e i e s a e-o - he-a audio
models ac oss six co po a spanning Wes e n popula ,
G eek, Tu kish, and Indian classical adi ions, o quan i a-
i ely assess hei c oss-cul u al capabili ies and con ibu e
o discussions abou he uni e sali y o musical ep esen a-
ions. We ocus on au oma ic music agging as ou e alua-
ion ask and employ h ee complemen a y me hodologies:
(i) p obing, which uses he models as ozen ea u e ex ac-
o s wi h a ainable classi ie , (ii) a ge ed supe ised ine-
uning o assess adap a ion po en ial, and (iii) mul i-label
ew-sho lea ning o e alua e pe o mance in low- esou ce
scena ios common wi h wo ld music collec ions.
Ou e alua ion e eals bo h p omising c oss-cul u al
ans e capabili ies as well as emaining gaps in uni e -
sal music unde s anding, due o he dec ease in pe o -
mance o cul u ally dis an domains and especially in low-
esou ce scena ios. The con ibu ions o his wo k can be
summa ized as ollows:
• This is he i s comp ehensi e e alua ion, o he bes
o ou knowledge, o ounda ion models ac oss cul-
u ally di e se music co po a.
• We p opose a me hodological e alua ion amewo k
ha in eg a es ew-sho lea ning wi h adi ional ap-
p oaches, enabling sys ema ic assessmen o model
ep esen a ions unde di e en aining se ups.
• S a e-o - he-a esul s ha e been achie ed by ou
app oaches in i e ou o six da ase s.
• We ha e op imized mul i-label ew-sho lea ning,
signi ican ly educing in e ence ime and making i
p ac ical o la ge numbe s o classes.
• Ou code is being made a ailable 1 o ep oducibil-
i y and o p omo e esea ch on wo ld music.
1h ps://gi hub.com/pxa is/FM-music- agging
303
2. RELATED WORK
Founda ion models. Founda ion models o music ha e
eme ged by le e aging la ge-scale sel -supe ised o con-
as i e lea ning on ex ensi e audio da ase s, enabling
hem o cap u e ich musical ea u es applicable ac oss di-
e se asks. Rep esen a i e wo ks include JukeMIR [11],
which explo ed ep esen a ions om he Jukebox gene a-
i e model [12], MULE [13], a sel -supe ised model p e-
ained on MusicNe da ase , and Music2Vec [14], which
u ilized masked p edic ion s a egies wi h s uden - eache
app oaches. Subsequen ad ancemen s like MusicFM [9]
ha e scaled up bo h model size and aining da a, demon-
s a ing e ec i eness ac oss mul iple benchma k asks.
The landscape o cu en ounda ion models encom-
passes se e al a chi ec u al app oaches: masked acous-
ic modeling, MERT [6], con as i e audio- ex lea ning
such as LAION-CLAP [7], and uni ied audio unde s and-
ing wi h models like Qwen-Audio [8]. Despi e hei im-
p essi e pe o mance on s anda d benchma ks, hei c oss-
cul u al gene aliza ion capabili ies emain la gely unex-
plo ed, pa icula ly ega ding hei e ec i eness ac oss di-
e se musical adi ions beyond Wes e n con ex s.
Au oma ic wo ld music agging. Au oma ic music ag-
ging - p edic ing me ada a such as gen e, mood, and in-
s umen a ion om audio signals - is ypically e e ed
o as music au o- agging [15–18] and cons i u es a mul i-
label classi ica ion p oblem. A chi ec u es add essing his
ask ha e e ol ed om con olu ional models like VGG-
ish [19] and Musicnn [20] o ans o me -based app oaches
like AST [21] and mo e ecen ounda ion models [9].
Resea ch on wo ld music compu a ional analysis has
g own in ecen yea s [22], wi h s udies ocused on speci ic
adi ions including Tu kish makam ecogni ion [23, 24],
Indian classical music classi ica ion [25], and analysis o
I anian and Ko ean adi ional music [26, 27]. While a
ecen s udy applied au o- agging ac oss di e se musi-
cal da ase s [28], his is he i s ime o he bes o ou
knowledge whe e a comp ehensi e e alua ion o ounda-
ion models on wo ld music co po a is being conduc ed.
To add ess he challenges o imbalanced ags and lim-
i ed da a inhe en in wo ld music esea ch, we employ
Label-Combina ion P o o ypical Ne wo ks (LC-P o one s)
[29] o ew-sho lea ning. This app oach ex ends P o o-
ypical Ne wo ks [30] by c ea ing p o o ypes o each la-
bel combina ion, a he han gene a ing one p o o ype pe
label. While es ablished benchma ks o e alua ing ep e-
sen a ions on downs eam asks ypically employ p obing
and ine- uning me hodologies [31–33], ou wo k inco po-
a es ew-sho lea ning as a complemen a y e alua ion ap-
p oach, assessing ounda ion models’ capabili ies in low-
esou ce scena ios.
3. METHODOLOGICAL FRAMEWORK
Ou me hodological amewo k sys ema ically e alua es
whe he ounda ion models can e ec i ely ep esen mu-
sical cha ac e is ics ac oss di e se cul u al adi ions. As
shown in Figu e 1, we employ h ee complemen a y
Figu e 1. A chi ec u al o e iew o ou e alua ion
amewo k showcasing h ee me hodologies: (1) P obing
(P ob.), (2) Supe ised Fine-Tuning (SFT), and (3) Mul i-
Label Few-Sho Lea ning (ML-FSL). The diag am indi-
ca es ea u e ex ac ion poin s used by ML-FSL om ei he
P e-T ained (PT), ained P ob. o SFT models.
me hodologies: p obing (P ob.), supe ised ine- uning
(SFT), and mul i-label ew-sho lea ning (ML-FSL). P ob-
ing ains only an MLP classi ie on ozen model ep esen-
a ions, while SFT makes he model’s las laye s ainable
alongside he MLP. ML-FSL ex ac s ep esen a ions om
h ee con ex s, i.e., p e ained model (PT), ained p ob-
ing model (P ob.) and ine- uned model (SFT) o e alua e
pe o mance on ex ended ag se s unde da a sca ci y con-
di ions.
3.1 Models
Fo ou e alua ion, we selec ed i e s a e-o - he-a audio
models spanning di e en a chi ec u es, p e- aining ap-
p oaches, and pa ame e scales:
MERT. We e alua e wo a ian s o MERT [6]: MERT-
95M 2and MERT-330M 3wi h 95M and 330M pa ame-
e s espec i ely. These ans o me -based models employ
masked acous ic modeling, using an acous ic and a musi-
cal eache , du ing p e- aining. MERT-95M consis s o 12
laye s, while MERT-330M has 24 laye s.
LAION-CLAP. We include wo a ian s: CLAP-Music 4
(CLAP-M), ained exclusi ely on music da a, and CLAP-
Music&Speech 5(CLAP-M&S), which inco po a es addi-
ional speech da a [7]. Bo h u ilize HTS-AT [34] o au-
dio encoding, a ans o me -based model wi h 4 g oups o
swin- ans o me blocks [35], wi h 68M audio-speci ic pa-
ame e s wi hin a la ge 194M pa ame e model.
Qwen2-Audio. The la ges model in ou e alua ion ame-
wo k, Qwen2-Audio 6[36], con ains 637M audio-speci ic
pa ame e s wi hin an 8.4B pa ame e a chi ec u e and ea-
u es 32 ans o me laye s [37] in i s audio owe .
VGG-ish. As a baseline compa ison, we include VGG-
ish [17,38], a 3.6M pa ame e end- o-end model ained ia
supe ised lea ning on mel-spec og ams o p edic ags.
Fo VGG-ish, we epo esul s om he li e a u e o he
2h ps://hugging ace.co/m-a-p/MERT- 1-95M
3h ps://hugging ace.co/m-a-p/MERT- 1-330M
4h ps://hugging ace.co/laion/la ge _clap_music
5h ps://hugging ace.co/laion/la ge _clap_music_and_speech
6h ps://hugging ace.co/Qwen/Qwen2-Audio-7B
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
304
same expe imen al se up used in ou wo k [28, 29] a he
han unning new expe imen s.
3.2 Da ase s
Ou e alua ion spans di e se adi ions om six music
da ase s. Fo Wes e n music, we u ilize MagnaTagATune
[39] (25,863 clips) and FMA-medium [40] (25,000 acks).
Fo wo ld music adi ions, we inco po a e he Ly a da ase
[41] wi h 1,570 eco dings o G eek olk music, and h ee
collec ions om he CompMusic p ojec [42]: he Tu kish-
makam co pus [43, 44] (5,297 eco dings) as well as Hin-
dus ani [45] (1,204 eco dings) and Ca na ic [45] (2,612
eco dings) o Indian classical music.
Following [28], we se maximum audio du a ions o
achie e simila sizes be ween da ase s and p epa e hei
me ada a o he au o- agging ask. Fo P obing and Su-
pe ised Fine-Tuning, we use he s anda d ag se s, i.e., 50
ags o MagnaTagATune, 30 o Ly a and Tu kish-makam,
and 20 o he es o he da ase s. Ou ML-FSL expe -
imen s use ex ended ag se s ha include p e iously un-
seen classes, summing up o: 80 ags o MagnaTagATune,
60 o Ly a and Tu kish-makam, 40 o FMA-medium and
Ca na ic, and 35 o Hindus ani, consis en wi h [29].
3.3 E alua ion me hodologies
P obing. Ou i s me hodology (P ob.) e alua es how
well ounda ion models inhe en ly ep esen musical cha -
ac e is ics ac oss cul u es. We employ p obing, whe e he
model emains ozen while only aining a classi ie on op
o he ex ac ed ep esen a ions. Speci ically, we imple-
men a shallow Mul i-laye Pe cep on (MLP) wi h a single
hidden laye o 512 uni s ollowed by a sigmoid classi ica-
ion laye , op imized wi h bina y c oss-en opy loss.
Supe ised Fine-Tuning. To e alua e adap a ion po en-
ial, we implemen a ge ed supe ised ine- uning (SFT)
by un eezing a subse o model pa ame e s. Fo MERT-
95M, we un eeze he las wo ans o me laye s, while
o MERT-330M only he las laye . Fo bo h CLAP mod-
els, we un eeze he las g oup o swin- ans o me blocks
o he audio encode along wi h he no maliza ion and wo
p ojec ion laye s. In Qwen2-Audio, we ine- une he las
laye o he audio owe along wi h he no maliza ion laye
be o e mul i-modal p ojec ion. These choices we e con-
s ained by RAM limi a ions a ec ing bo h ainable pa-
ame e s and hype pa ame e uning. We use he same
ainable MLP P obe a chi ec u e as in he P obing expe -
imen s, ini ializing i wi h he weigh s lea ned du ing ha
phase. This weigh ini ializa ion s a egy helps main ain
p e iously lea ned knowledge while adap ing o new do-
mains, mi iga ing po en ial ca as ophic o ge ing issues
[46]. We also employ lea ning a e wa mup and cosine
scheduling o ensu e s able adap a ion [47].
Mul i-Label Few-Sho Lea ning. Ou hi d me hodology
(ML-FSL) e alua es pe o mance in low- esou ce scena -
ios by employing an op imized e sion o LC-P o one s
[29] ha is de ailed in subsec ion 3.4. We ex ac ep esen-
a ions om h ee di e en con ex s: di ec ly om he p e-
ained model (PT), om he hidden laye o he ained
Figu e 2. Rela ionship be ween model size and pe -
o mance, a e aged o e P obing and Supe ised Fine-
Tuning (SFT) asks. The x-axis ep esen s he numbe o
audio-speci ic pa ame e s on a loga i hmic scale, while he
y-axis epo s he mean ROC-AUC (%) ac oss all da ase s.
MLP P obe (P ob.), and om he ine- uned model (SFT).
No ably, his me hodology in ol es no addi ional aining
du ing ew-sho e alua ion; he model ac s as a ozen ea-
u e ex ac o ha maps bo h he ew examples and he un-
known i ems o an embedding space whe e classi ica ion
occu s u ilizing he LC-P o one s app oach.
3.4 Mul i-label ew-sho lea ning op imiza ion
While he LC-P o one s me hod [29] o e s signi ican pe -
o mance ad an ages o mul i-label ew-sho lea ning, i s
compu a ional complexi y inc eases subs an ially wi h he
numbe o labels due o he exponen ial g ow h o label
combina ions. In his wo k, we in oduce an op imiza-
ion ha signi ican ly imp o es in e ence e iciency while
main aining iden ical classi ica ion esul s.
The o iginal app oach c ea es an LC-P o o ype (LCP)
o each label combina ion (LC-class) de i ed om he
powe se s o he ew a ailable examples’ labels. Each
a ailable example is called a suppo i em and i is de ined
by (xi,yi), wi h xibeing i s inpu ea u e ec o and yi
he se o i s labels. Fo he se o suppo i ems S, he se
o all LC-classes Lis compu ed as L=S(xi,yi)∈SP(yi),
whe e P(yi)is he powe se o he labels o he i- h sup-
po i em, excluding he emp y se . Fo each LC-class Lj,
wi h j= 1,2, ..., |L|, he LCP ep esen a ion pjis com-
pu ed by a e aging he embeddings o all suppo i ems
ha include Ljin hei powe se s:
pj=1
|Sj|X
(xi,yi)∈Sj
θ(xi),(1)
whe e Sj={(xi,yi)∈S|Lj∈ P(yi)}, and θ he
embedding mapping model.
Ou key insigh is ha mul iple LC-classes o en sha e
iden ical LCP ep esen a ions despi e ep esen ing di e -
en label combina ions. This occu s because he same se
o suppo i ems con ibu es o mul iple label combina ions
de i ed om hei powe se s. Fo example, i a suppo
i em wi h labels {A, B, C}is he only i em con ibu ing
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
305
o bo h {A, B}and {B, C}LCPs, hese LCPs will ha e
iden ical ep esen a ions.
We exploi his edundancy by main aining a dic iona y
s uc u e ha maps unique LCP ep esen a ions o hei
co esponding se s o LC-classes:
UniqueLCPs ={pm7→ {Lj|pj=pm}},(2)
whe e j= 1,2, ..., |L|and m= 1,2, ..., M wi h Mbeing
he numbe o unique LCPs and M≪ |L|. Du ing in e -
ence, ins ead o compu ing dis ances be ween a que y i em,
unseen du ing aining, and all possible |L|LCPs, we only
compu e dis ances o he Munique LCP ep esen a ions.
Fo he nea es unique LCP, we hen selec he label com-
bina ion wi h he maximum ca dinali y, consis en wi h he
o iginal LC-P o one s me hod.
Ou expe imen s show ha his app oach yields speed
imp o emen s o 10× o da ase s wi h 20 labels, scaling o
mo e han 100× o da ase s wi h 60 labels, while p oduc-
ing iden ical classi ica ion esul s o he o iginal me hod.
We apply his op imiza ion o he LC-P o one s eposi-
o y 7, making i p ac ical o la ge label se s.
4. EXPERIMENTAL SETUP
Expe imen s and esou ces. We conduc ed 5 uns wi h
di e en andom seeds o bo h P obing and ML-FSL
asks, bu a single un o SFT due o compu a ional
cons ain s. SFT ainable pa ame e s a ied: 14M o
MERT-95M, 13M o MERT-330M, 25M o CLAP mod-
els, and 56M o Qwen2-Audio. All expe imen s an on an
NVIDIA RTX A5000 GPU, and we used Qwen2-Audio in
hal -p ecision (FP16) in all ou me hodologies o i in his
ca d. Mos SFT aining comple ed wi hin 24 hou s, wi h
only 3 ou o 30 expe imen s ex ending o abou 36 hou s.
Da ase p ocessing. We s anda dized Tu kish-makam,
Hindus ani, and Ca na ic da ase s o app oxima ely 200
hou s each, ma ching MagnaTagATune and FMA-medium
du a ions [28], while Ly a emained a i s o iginal 80
hou s. We ollowed he aining, alida ion, and es spli s
om [17, 28]. Fo ML-FSL, e alua ion i ems came exclu-
si ely om es se s [29] o p e en da a leakage.
Model-speci ic con igu a ions. Each ounda ion model
equi ed speci ic p ep ocessing: MERT models use 30-
second windows a 24kHz, CLAP models 10-second win-
dows a 48kHz, and Qwen2-Audio 30-second windows a
16kHz. All audio was con e ed o mono and esampled o
he model’s equi ed a e.
Rep esen a ion ex ac ion s a egies. Fo MERT mod-
els, we ex ac ep esen a ions by summing he a e -
age, ac oss ime, hidden s a es o he las ou laye s
o he models. Fo CLAP models, we ex ac hem
om he audio p ojec ion laye which akes as inpu
he a e age pooled laye ep esen a ion o he las hid-
den s a e. Fo Qwen2-Audio, we use he las hid-
den s a e embeddings a e aged ac oss all laye s o he
whole model, when passing a simple ex p omp ha in-
cludes no hing bu he espec i e ags o audio p ocess-
ing, i.e., <|audio_bos|><|AUDIO|><|audio_eos|>.
7h ps://gi hub.com/pxa is/LC-P o one s
Model Pa ams ROC-AUC (%) mAP (%)
Audio/To al
VGG-ish [28] 3.6M/3.6M84.45 50.56
P ob. SFT P ob. SFT
MERT-95M 95M/95M87.250.32 87.26 52.250.42 52.68
MERT-330M 330M/330M85.400.68 85.69 49.620.83 50.47
CLAP-M 68M/194M71.521.14 78.96 29.981.07 40.41
CLAP-M&S 68M/194M86.780.31 86.15 53.120.87 51.99
Qwen2-Audio 637M/8.40B88.590.47 89.37 56.480.63 58.73
Table 1. Model pe o mance compa ison a e aged ac oss
all da ase s o P obing and SFT asks. Values a e a e -
aged o e mul iple uns wi h subsc ip ed s anda d de ia-
ions. Bold alues indica e bes pe o mance pe column.
These ep esen a ion ex ac ion s a egies, numbe o ine-
uned laye s, and o he design choices o ou me hod we e
op imized h ough p elimina y expe imen s.
Hype pa ame e s. Fo P obing, we used Adam op i-
mize [48] (β1= 0.9,β2= 0.999,ϵ= 10−8) wi h lea n-
ing a e 10−3, ba ch size 16, ea ly s opping pa ience 10,
and maximum 200 epochs. Fo SFT, we used AdamW
[49] wi h iden ical βpa ame e s bu lea ning a e 10−4,
model-speci ic ba ch sizes ( o i maximum a ailable e-
sou ces) wi h g adien accumula ion o simula e ba ch size
16 ac oss all se ups, pa ience 5, and maximum 30 epochs.
We applied lea ning a e wa mup and cosine scheduling
o he i s 5% o SFT epochs. ML-FSL e alua ions used
cosine dis ance wi h an N-way K-sho se up, wi h Nbe-
ing he numbe o ex ended ags pe da ase and Kequal o
3examples pe label in all expe imen s. We also a emp ed
Low-Rank Adap a ion [50] ini ially bu abandoned i due
o ex ensi e hype pa ame e uning equi emen s ac oss
ou 5×6expe imen al ma ix.
E alua ion me ics. Fo he P obing and SFT me hod-
ologies, we epo a ea unde he ecei e ope a ing cha -
ac e is ic cu e (ROC-AUC) and mean a e age p eci-
sion (mAP). These me ics a e pa icula ly well-sui ed o
mul i-label classi ica ion asks [51] and a e consis en wi h
p io wo k in music agging [17, 28]. Fo ML-FSL e al-
ua ion, we epo mac o-F1 (M-F1) and mic o-F1 (m-
F1) sco es, which align wi h he LC-P o one s e alua ion
amewo k [29]. F1 sco e is he ha monic mean o he p e-
cision and ecall sco es. Mac o-F1 gi es equal weigh o
all classes, while mic o-F1 accoun s o class imbalance by
calcula ing me ics globally ac oss all ins ances.
5. RESULTS
5.1 P obing and Supe ised Fine-Tuning
Table 1 p esen s he pe o mance o he e alua ed ounda-
ion models a e aged ac oss all da ase s o bo h P obing
and SFT asks. O e all, Qwen2-Audio achie es he highes
pe o mance wi h 88.59% ROC-AUC and 56.48% mAP
in P obing, u he imp o ing o 89.37% ROC-AUC and
58.73% mAP a e ine- uning. This is ollowed by MERT-
95M and CLAP-Music&Speech wi h compa able pe o -
mance, while CLAP-Music shows signi ican ly lowe pe -
o mance wi hou speech da a in i s aining co pus.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
306
Model MagnaTagATune FMA-medium Ly a Tu kish-makam Hindus ani Ca na ic
ROC-AUC mAP ROC-AUC mAP ROC-AUC mAP ROC-AUC mAP ROC-AUC mAP ROC-AUC mAP
VGG-ish [28] 91.23 45.82 88.89 49.49 80.97 48.06 86.96 56.39 84.77 60.82 73.92 42.78
P obing (P ob.)
MERT-95M 90.460.10 44.160.21 91.680.08 51.430.43 85.610.66 53.340.61 88.220.23 57.890.34 86.590.52 60.260.56 80.960.35 46.410.35
MERT-330M 89.660.16 41.730.59 90.780.11 48.850.32 84.650.78 51.810.59 85.370.64 52.451.12 84.231.36 58.782.08 77.731.03 44.070.31
CLAP-M 80.070.21 25.820.13 77.420.15 22.890.38 64.181.29 31.160.43 77.310.51 38.771.00 68.694.05 33.434.21 61.470.60 27.830.30
CLAP-M&S 92.410.05 48.540.16 94.050.08 59.130.54 87.250.18 56.940.51 86.490.27 54.690.36 82.611.14 55.703.29 77.850.13 43.730.35
Qwen2-Audio 91.170.13 45.580.21 96.600.07 73.380.28 86.440.81 53.500.65 86.640.42 53.380.79 88.450.83 62.420.99 82.220.56 50.590.88
Supe ised Fine-Tuning (SFT)
MERT-95M 90.62 44.52 91.70 51.74 84.89 53.62 87.50 57.91 88.20 61.47 80.64 46.83
MERT-330M 89.55 41.93 91.12 49.56 84.74 52.54 86.17 53.80 85.49 61.33 77.05 43.66
CLAP-M 88.54 39.26 88.37 42.04 71.97 38.14 79.82 42.49 75.65 45.01 69.39 35.51
CLAP-M&S 91.77 47.54 92.86 57.11 85.35 52.86 86.69 54.93 83.73 56.91 76.51 42.58
Qwen2-Audio 92.03 48.27 97.02 75.94 87.57 57.04 87.95 56.10 88.32 64.35 83.35 50.66
(P e ious) SOTA 92.7 46.54 92.4 53.7 85.4 54.3 87.7 57.7 86.5 63.1 77.0 43.9
Table 2. Model pe o mance on indi idual da ase s o P obing and SFT asks. Fo P obing, alues a e a e aged o e
mul iple uns wi h subsc ip ed s anda d de ia ions, while SFT esul s a e om single uns. Bold alues indica e bes
pe o mance pe me ic and da ase . SOTA alues a e om [52] o MagnaTagATune and [28] o he es o he da ase s.
Figu e 2 illus a es he ela ionship be ween model size
(audio-speci ic pa ame e s) and ROC-AUC pe o mance,
a e aged ac oss da ase s and bo h P obing and SFT asks.
A gene ally posi i e co ela ion is e ealed, wi h simi-
la ends obse ed in bo h me hodologies. Qwen2-Audio
(637M pa ame e s) consis en ly ou pe o ms smalle mod-
els, achie ing 88.98% a e age ROC-AUC sco e. Su p is-
ingly, MERT-95M (87.25%) ou pe o ms he much la ge
MERT-330M (85.55%). This is wo h no ing as [33] e-
po ed ha bo h models pe o med on pa o au o- agging
asks, sugges ing ha ou common ep esen a ion ex ac-
ion s a egy o bo h MERT models may no op imally
le e age he la ge model’s capaci y. Ano he po en ial
explana ion is ha MERT-95M has been ained on open
da a whe eas MERT-330M has been ained wi h addi-
ional p op ie a y da a wi h a s ong Wes e n bias [6].
When examining P obing (P ob.) pe o mance ac oss
indi idual da ase s, in Table 2, we obse e a consis-
en pa e n o dec easing pe o mance o music adi-
ions ha a e cul u ally dis an om he da a used o
p e- ain he espec i e ounda ion models. Wes e n mu-
sic da ase s (MagnaTagATune and FMA-medium) consis-
en ly achie e he highes pe o mance ac oss all models,
wi h ROC-AUC alues eaching 96.60% o Qwen2-Audio
on FMA-medium. G eek (Ly a) and Tu kish (makam)
music da ase s show mode a e pe o mance, while Indian
classical music (Hindus ani and Ca na ic) da ase s ypi-
cally exhibi he lowes pe o mance. This cul u al pe -
o mance gap is especially p onounced o CLAP-Music,
whe e he ROC-AUC d ops om 80.07% o MagnaTa-
gATune o 61.47% o Ca na ic.
Applying Supe ised Fine-Tuning (SFT) gene ally im-
p o es pe o mance ac oss all models and da ase s, wi h an
a e age gain o 1-2% in ROC-AUC o mos models. No-
ably, CLAP-Music shows he la ges imp o emen wi h
SFT, indica ing g ea e adap a ion po en ial despi e lowe
absolu e pe o mance. Fo o he models, he modes gains
sugges ha hey equi e b oade ine- uning o u he shi
hei p e- ained ep esen a ions owa ds di e en cul u es.
Impo an ly, ou app oaches achie e s a e-o - he-a
pe o mance in i e ou o six da ase s, wi h MagnaTa-
Model M-F1 m-F1
VGG-ish [29] 30.18 55.09
PT P ob. SFT PT P ob. SFT
MERT-95M 23.901.52 28.051.74 28.281.80 46.591.57 52.161.43 52.561.63
MERT-330M 23.031.12 28.481.40 28.511.28 45.111.29 51.781.51 51.801.46
CLAP-M 17.711.20 18.431.40 21.581.13 38.801.37 39.971.20 46.571.20
CLAP-M&S 28.231.36 29.221.09 30.271.90 51.591.54 53.321.31 54.431.27
Qwen2-Audio 25.981.36 30.961.26 32.001.41 49.971.41 55.660.82 56.851.23
Table 3. ML-FSL pe o mance a e aged ac oss da ase s
on ex ended ag se s. Resul s show mac o-F1 (M-F1) and
mic o-F1 (m-F1) ac oss con ex s (PT,P ob.,SFT). Values
a e means wi h subsc ip ed s anda d de ia ions. Bold indi-
ca es bes pe o mance pe column.
gATune being he only excep ion. Howe e , hei consis-
en pe o mance dec ease owa ds di e se cul u es, sug-
ges s ha hei ep esen a ions a e s ill biased owa d Wes -
e n musical adi ions.
5.2 Mul i-label ew-sho lea ning
Table 3 p esen s he ML-FSL e alua ion esul s a e aged
ac oss all da ase s using ex ended ag se s. The esul s
show consis en pe o mance imp o emen s mo ing om
p e- ained models (PT) o ained p obing models (P ob.)
and hen o supe ised ine- uned models (SFT) ac oss all
ounda ion models. The subs an ial gap be ween mac o-F1
and mic o-F1 me ics indica es conside able class imbal-
ance in he ex ended ag se s, while he inc eased s anda d
de ia ion s ems om he suppo se sampling which can
signi ican ly impac he classi ica ion pe o mance.
Qwen2-Audio demons a es he bes o e all pe o -
mance in he ML-FSL ask wi h 32.00% mac o-F1 and
56.85% mic o-F1 a e ine- uning, ollowed closely by
CLAP-Music&Speech wi h 30.27% mac o-F1 and 54.43%
mic o-F1. No ably, e en he bes ounda ion model’s pe -
o mance (Qwen2-Audio) is compa able o a VGG-ish ea-
u e ex ac o ained ia supe ised lea ning on s anda d
ags o each da ase . This s ands in con as o he P ob-
ing and SFT se ings (Table 1), whe e ounda ion models
clea ly ou pe o m VGG-ish, showing ha ML-FSL asks
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
307
Model MagnaTagATune FMA-medium Ly a Tu kish-makam Hindus ani Ca na ic
M-F1 m-F1 M-F1 m-F1 M-F1 m-F1 M-F1 m-F1 M-F1 m-F1 M-F1 m-F1
VGG-ish [29] 26.40 37.31 29.12 45.37 46.05 69.03 30.07 56.22 31.33 58.38 18.13 64.25
P e-T ained models (PT)
MERT-95M 18.761.04 28.371.38 16.240.64 35.370.94 46.872.59 66.072.25 20.691.77 40.951.80 25.872.45 51.501.92 14.970.64 57.261.10
MERT-330M 18.170.78 26.991.36 16.240.69 31.151.51 44.221.45 65.481.57 20.142.01 39.711.95 25.081.40 50.141.08 14.320.41 57.210.28
CLAP-M 13.100.84 20.001.15 9.650.29 19.771.31 33.562.88 57.141.49 14.331.10 32.121.37 21.061.63 47.381.60 14.550.43 56.421.30
CLAP-M&S 25.900.55 36.550.61 28.781.66 42.952.02 48.032.02 69.041.54 24.191.73 47.132.22 26.291.20 54.501.57 16.191.02 59.381.30
Qwen2-Audio 21.290.51 32.090.26 29.762.23 47.501.86 39.991.05 64.241.07 19.891.71 42.271.88 28.421.96 55.921.70 16.550.69 57.821.67
T ained P obing models (P ob.)
MERT-95M 23.770.85 34.711.03 24.621.19 42.961.30 45.802.76 68.161.81 26.141.73 50.000.70 30.752.95 56.412.18 17.250.98 60.701.55
MERT-330M 24.480.59 34.781.45 25.210.76 40.651.76 47.923.26 70.152.18 26.971.61 50.471.13 29.251.55 53.771.82 17.060.61 60.850.69
CLAP-M 14.840.49 22.671.00 11.550.50 22.721.48 34.854.03 57.731.37 16.680.81 36.001.22 18.771.42 44.960.95 13.871.16 55.741.15
CLAP-M&S 26.900.47 37.620.93 31.141.28 46.531.59 47.100.89 69.770.53 25.581.59 49.701.39 28.111.38 56.432.19 16.460.92 59.881.25
Qwen2-Audio 26.790.40 37.650.21 39.491.02 56.300.82 42.521.81 67.101.13 26.091.65 51.591.20 31.621.26 60.080.40 19.251.40 61.241.14
Supe ised Fine-Tuned models (SFT)
MERT-95M 24.460.79 35.280.90 24.941.18 42.781.44 45.513.74 67.932.72 26.161.87 49.761.54 30.402.15 56.391.68 18.181.08 63.191.48
MERT-330M 23.780.65 33.670.91 24.941.21 39.951.77 48.502.75 70.062.23 26.841.51 50.291.25 30.561.31 55.251.58 16.430.27 61.571.04
CLAP-M 22.150.51 32.671.22 19.610.79 34.810.99 30.462.04 55.862.02 20.661.69 45.801.13 21.951.31 50.741.14 14.630.45 59.530.67
CLAP-M&S 26.280.50 37.231.09 30.271.56 46.571.61 48.094.74 69.932.28 28.911.75 53.871.56 31.272.47 57.410.74 16.820.37 61.550.34
Qwen2-Audio 27.670.25 38.570.18 40.101.29 57.170.95 44.132.45 68.342.38 27.612.37 53.981.55 32.521.23 60.260.89 19.970.87 62.761.43
Table 4. ML-FSL pe o mance on ex ended ag se s pe da ase . Resul s show mac o-F1 (M-F1) and mic o-F1 (m-F1)
ac oss h ee con ex s. Values a e means wi h subsc ip ed s anda d de ia ions. Bold indica es bes pe o mance pe column.
Figu e 3. Scalabili y me ics o he LC-P o one s me hod,
a e aged ac oss all da ase s. The x-axis ep esen s he
numbe o labels, he le y-axis shows he numbe o
LCPs, and he igh y-axis indica es he in e ence ime pe
i em wi h bo h y-axes using he same loga i hmic scale.
emain challenging o hem despi e hei ex ensi e p e-
aining. Supe ised lea ning o a VGG-ish model on ex-
ended ag se s has no been conduc ed in he li e a u e,
likely due o he sca ci y o examples o in equen ags.
When examining he ML-FSL esul s pe da ase in Ta-
ble 4, we obse e ha only on Wes e n da ase s (Mag-
naTagATune and FMA-medium) does he bes ounda ion
model (Qwen2-Audio) achie e signi ican ly be e pe o -
mance han he VGG-ish baseline. Fo Tu kish-makam,
VGG-ish ep esen a ions ac ually ou pe o m ounda ion
models, while o Ly a, Hindus ani, and Ca na ic, he e-
sul s a e compa able. This pa e n p o ides addi ional clea
e idence o he implici Wes e n-cen ic bias in eg a ed
in o models due o hei p e- aining da a.
LC-P o one s op imiza ion. Figu e 3 illus a es he scal-
abili y me ics o ou op imized LC-P o one s app oach
compa ed o he o iginal me hod, a e aged ac oss da ase s.
As he numbe o labels inc eases om 20 o 60, he num-
be o LC-P o o ypes g ows exponen ially, om app oxi-
ma ely 500 o o e 50,000. This g ow h leads he o igi-
nal me hod o a co esponding inc ease om 21ms o o e
2,000ms in e ence ime pe que y i em (dashed blue line).
Howe e , ou op imiza ion (solid blue line), le e aging he
unique p o o ypes, mi iga es he compu a ional complexi y
issues, equi ing only 2ms in he 20 labels cases and ising
o no mo e han 20ms o 60 labels, a 100× imp o emen .
6. CONCLUSIONS
In his pape , we examined he uni e sali y o music ep e-
sen a ions in ounda ion models h ough a comp ehensi e
me hodological amewo k e alua ing i e s a e-o - he-a
audio models ac oss six wo ld music co po a. Al hough
hese models achie ed be e pe o mance han p e ious
models o di e se music adi ions, we ound clea indi-
ca o s o Wes e n-cen ic bias.
Ou inco po a ion o ML-FSL asks pa icula ly e-
ealed his limi a ion. When aced wi h hese challenging
scena ios, ounda ion models pe o med on pa wi h sig-
ni ican ly smalle and simple models, wi h pe o mance
no ably deg ading u he on non-Wes e n da ase s.
To u he enable ML-FSL e alua ion, we subs an ially
op imized he compu a ional complexi y o he u ilized
me hod, by o ming unique p o o ypes ep esen ing mul i-
ple label combina ions. We demons a ed ha his change
makes i p ac ical o la ge se s o labels, a ypical condi-
ion when s udying wo ld music da ase s.
Fu u e wo k could ex end ou me hodological ame-
wo k by inco po a ing Low-Rank Adap a ion (LoRA) and
implemen b oade supe ised ine- uning o in es iga e
u he cul u al adap a ion. Mo e asks can also be in-
cluded such as mode es ima ion, explo ing he analogies
be ween key on Wes e n cul u es and makam o aga
ecogni ion in o he cul u es.
We hope his wo k b ings a en ion o he cul u al di-
mensions o ounda ion models while p o iding a ame-
wo k o quan i a i ely assessing p og ess owa d uly
uni e sal musical ep esen a ions.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
308
7. ACKNOWLEDGMENTS
We would like o hank he e iewe s o hei aluable
and cons uc i e eedback, which helped us imp o e ou
s udy. This wo k has been pa ially suppo ed by p ojec
MIS 5154714 o he Na ional Reco e y and Resilience
Plan G eece 2.0 unded by he Eu opean Union unde he
Nex Gene a ionEU P og am.
8. REFERENCES
[1] S. A. Meh , M. Singh, D. Knox, D. M. Ke e ,
D. Pickens-Jones, S. A wood, C. Lucas, N. Jacoby,
A. A. Egne , E. J. Hopkins, R. M. Howa d e al., “Uni-
e sali y and di e si y in human song,” Science, ol.
366, 2019.
[2] P. E. Sa age, S. B own, E. Sakai, and T. E. Cu ie, “S a-
is ical uni e sals e eal he s uc u es and unc ions o
human music,” P oceedings o he Na ional Academy
o Sciences, ol. 112, pp. 8987 – 8992, 2015.
[3] S. E. T ehub, J. Becke , and I. Mo ley, “C oss-cul u al
pe spec i es on music and musicali y,” Philosophical
T ansac ions o he Royal Socie y B: Biological Sci-
ences, ol. 370, 2015.
[4] E. H. Ma gulis, P. C. M. Wong, C. Tu nbull, B. M.
Kubi , and J. D. McAuley, “Na a i es imagined
in esponse o ins umen al music e eal cul u e-
bounded in e subjec i i y,” P oceedings o he Na-
ional Academy o Sciences o he Uni ed S a es o
Ame ica, ol. 119, 2022.
[5] R. Bommasani, D. A. Hudson, E. Adeli, R. B. Al -
man, S. A o a, S. on A x, M. S. Be ns ein, J. Bohg,
A. Bosselu , E. B unskill, E. B ynjol sson, S. Buch,
D. Ca d, R. Cas ellon, N. S. Cha e ji, A. S. Chen,
K. C eel e al., “On he oppo uni ies and isks o oun-
da ion models,” CoRR, ol. abs/2108.07258, 2021.
[6] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Xiao, C. Lin, A. Ragni, E. Bene os, N. Gyenge
e al., “MERT: acous ic music unde s anding model
wi h la ge-scale sel -supe ised aining,” in ICLR.
OpenRe iew.ne , 2024.
[7] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-
audio p e aining wi h ea u e usion and keywo d- o-
cap ion augmen a ion,” in ICASSP. IEEE, 2023, pp.
1–5.
[8] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan,
C. Zhou, and J. Zhou, “Qwen-audio: Ad ancing
uni e sal audio unde s anding ia uni ied la ge-scale
audio-language models,” CoRR, ol. abs/2311.07919,
2023.
[9] M. Won, Y. Hung, and D. Le, “A ounda ion model
o music in o ma ics,” in ICASSP. IEEE, 2024, pp.
1226–1230.
[10] Y. Ma, A. Øland, A. Ragni, B. M. Del Se e, C. Sai is,
C. Donahue, C. Lin, C. Plachou as, E. Bene os,
E. Quin on e al., “Founda ion models o music: A
su ey,” CoRR, ol. abs/2408.14340, 2024.
[11] R. Cas ellon, C. Donahue, and P. Liang, “Codi ied au-
dio language modeling lea ns use ul ep esen a ions
o music in o ma ion e ie al,” in ISMIR, 2021, pp.
88–96.
[12] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” CoRR, ol. abs/2005.00341, 2020.
[13] M. C. McCallum, F. Ko zeniowski, S. O amas,
F. Gouyon, and A. F. Ehmann, “Supe ised and un-
supe ised lea ning o audio ep esen a ions o music
unde s anding,” in ISMIR, 2022, pp. 256–263.
[14] Y. Li, R. Yuan, G. Zhang, Y. Ma, C. Lin, X. Chen,
A. Ragni, H. Yin, Z. Hu, H. He, E. Bene os, N. Gyenge,
R. Liu, and J. Fu, “Map-music2 ec: A simple and e -
ec i e baseline o sel -supe ised music audio ep e-
sen a ion lea ning,” CoRR, ol. abs/2212.02508, 2022.
[15] K. Choi, “Deep neu al ne wo ks o music agging,”
Ph.D. disse a ion, Queen Ma y Uni e si y o London,
UK, 2018.
[16] T. Kim, J. Lee, and J. Nam, “Sample-le el CNN a chi-
ec u es o music au o- agging using aw wa e o ms,”
in ICASSP. IEEE, 2018, pp. 366–370.
[17] M. Won, A. Fe a o, D. Bogdano , and X. Se a, “E al-
ua ion o cnn-based au oma ic music agging models,”
CoRR, ol. abs/2006.00751, 2020.
[18] J. Lee, J. Pa k, K. L. Kim, and J. Nam, “Sample-
le el deep con olu ional neu al ne wo ks o mu-
sic au o- agging using aw wa e o ms,” CoRR, ol.
abs/1703.01789, 2017.
[19] S. He shey, S. Chaudhu i, D. P. W. Ellis, J. F. Gem-
meke, A. Jansen, R. C. Moo e, M. Plakal, D. Pla ,
R. A. Sau ous, B. Seybold e al., “CNN a chi ec u es
o la ge-scale audio classi ica ion,” in ICASSP. IEEE,
2017, pp. 131–135.
[20] J. Pons and X. Se a, “musicnn: P e- ained con olu-
ional neu al ne wo ks o music audio agging,” CoRR,
ol. abs/1909.06654, 2019.
[21] Y. Gong, Y. Chung, and J. R. Glass, “AST: audio spec-
og am ans o me ,” in In e speech. ISCA, 2021, pp.
571–575.
[22] M. Pan eli, “Compu a ional analysis o wo ld music
co po a,” Ph.D. disse a ion, Queen Ma y Uni e si y
o London, UK, 2018.
[23] E. Demi el, B. Bozku , and X. Se a, “Au oma ic
makam ecogni ion using ch oma ea u es,” in 8 h In-
e na ional Wo kshop on Folk Music Analysis, 2018,
pp. 19–24.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
309
[24] K. K. Ganguli, S. Sen ü k, and C. Guedes, “C i-
iquing ask- e sus goal-o ien ed app oaches: A case
o makam ecogni ion,” in ISMIR, 2022, pp. 369–376.
[25] A. K. Sha ma, G. Agga wal, S. Bha dwaj,
P. Chak aba i, T. Chak aba i, J. H. Abawajy,
S. Bha acha yya e al., “Classi ica ion o indian clas-
sical music wi h ime-se ies ma ching deep lea ning
app oach,” IEEE Access, ol. 9, pp. 102 041–102 052,
2021.
[26] B. Nikza and R. C. Repe o, “KDC: an open co pus o
compu a ional esea ch o das g¯
ahi music,” in ISMIR,
2022, pp. 321–328.
[27] D. Han, R. C. Repe o, and D. Jeong, “Finding o i:
Sel -supe ised lea ning o analyzing ko ean olk
song,” in ISMIR, 2023, pp. 440–447.
[28] C. Papaioannou, E. Bene os, and A. Po amianos,
“F om wes o eas : Who can unde s and he music o
he o he s be e ?” in ISMIR, 2023, pp. 311–318.
[29] ——, “LC-P o one s: Mul i-label ew-sho lea ning o
wo ld music audio agging,” IEEE Open Jou nal o
Signal P ocessing, ol. 6, pp. 138–146, 2025.
[30] J. Snell, K. Swe sky, and R. S. Zemel, “P o o ypical
ne wo ks o ew-sho lea ning,” in NIPS, 2017, pp.
4077–4087.
[31] J. Tu ian, J. Shie , H. R. Khan, B. Raj, B. W. Schulle ,
C. J. S einme z, C. Malloy, G. Tzane akis, G. Ve-
la de, K. McNally, M. Hen y, N. Pin o, C. Nou i e al.,
“HEAR: holis ic e alua ion o audio ep esen a ions,”
in Neu IPS (Compe i ion and Demos), se . P oceed-
ings o Machine Lea ning Resea ch, ol. 176. PMLR,
2021, pp. 125–145.
[32] S. Yang, P. Chi, Y. Chuang, C. J. Lai, K. Lakho ia, Y. Y.
Lin, A. T. Liu, J. Shi, X. Chang, G. Lin, T. Huang,
W. Tseng, K. Lee e al., “SUPERB: speech p ocess-
ing uni e sal pe o mance benchma k,” in In e speech.
ISCA, 2021, pp. 1194–1198.
[33] R. Yuan, Y. Ma, Y. Li, G. Zhang, X. Chen, H. Yin,
L. Zhuo, Y. Liu, J. Huang, Z. Tian, B. Deng, N. Wang,
C. Lin, E. Bene os, A. Ragni e al., “MARBLE: music
audio ep esen a ion benchma k o uni e sal e alua-
ion,” in Neu IPS, 2023.
[34] K. Chen, X. Du, B. Zhu, Z. Ma, T. Be g-Ki kpa ick,
and S. Dubno , “HTS-AT: A hie a chical oken-
seman ic audio ans o me o sound classi ica ion and
de ec ion,” in ICASSP. IEEE, 2022, pp. 646–650.
[35] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin,
and B. Guo, “Swin ans o me : Hie a chical ision
ans o me using shi ed windows,” in ICCV. IEEE,
2021, pp. 9992–10 002.
[36] Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo,
Y. Leng, Y. L , J. He, J. Lin, C. Zhou, and
J. Zhou, “Qwen2-audio echnical epo ,” CoRR, ol.
abs/2407.10759, 2024.
[37] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, L. Kaise , and I. Polosukhin,
“A en ion is All you Need,” in Ad ances in Neu al In-
o ma ion P ocessing Sys ems, 2017.
[38] K. Simonyan and A. Zisse man, “Ve y deep con olu-
ional ne wo ks o la ge-scale image ecogni ion,” in
ICLR, 2015.
[39] E. Law, K. Wes , M. I. Mandel, M. Bay, and J. S.
Downie, “E alua ion o algo i hms using games: The
case o music agging,” in ISMIR, 2009, pp. 387–392.
[40] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,” in
ISMIR, 2017, pp. 316–323.
[41] C. Papaioannou, I. Valian zas, T. Giannakopoulos,
M. A. Kaliaka sos-Papakos as, and A. Po amianos, “A
da ase o g eek adi ional and olk music: Ly a,” in
ISMIR, 2022, pp. 377–383.
[42] X. Se a, “C ea ing esea ch co po a o he compu-
a ional s udy o music: he case o he compmusic
p ojec ,” in Seman ic Audio. Audio Enginee ing So-
cie y, 2014.
[43] B. Uya , H. S. A li, S. Sen ü k, B. Bozku , and
X. Se a, “A co pus o compu a ional esea ch o u k-
ish makam music,” in DL M@JCDL. ACM, 2014, pp.
1–7.
[44] S. Sen ü k, “Compu a ional analysis o audio eco d-
ings and music sco es o he desc ip ion and disco e y
o o oman- u kish makam music,” Ph.D. disse a ion,
Pompeu Fab a Uni e si y, Spain, 2017.
[45] A. S ini asamu hy, G. K. Kodu i, S. Gula i, V. Ishwa ,
and X. Se a, “Co po a o music in o ma ion esea ch
in indian a music,” in ICMC. Michigan Publishing,
2014.
[46] J. Ki kpa ick, R. Pascanu, N. C. Rabinowi z, J. Ve-
ness, G. Desja dins, A. A. Rusu, K. Milan, J. Quan,
T. Ramalho, A. G abska-Ba winska, D. Hassabis,
C. Clopa h, D. Kuma an, and R. Hadsell, “O e com-
ing ca as ophic o ge ing in neu al ne wo ks,” CoRR,
ol. abs/1612.00796, 2016.
[47] K. Gup a, B. Thé ien, A. Ib ahim, M. L. Rich e ,
Q. An hony, E. Belilo sky, I. Rish, and T. Leso ,
“Con inual p e- aining o la ge language models:
How o ( e)wa m you model?” CoRR, ol.
abs/2308.04014, 2023.
[48] D. P. Kingma and J. Ba, “Adam: A me hod o s ochas-
ic op imiza ion,” in ICLR, 2015.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
310
[49] I. Loshchilo and F. Hu e , “Decoupled weigh decay
egula iza ion,” CoRR, ol. abs/1711.05101, 2019.
[50] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,
S. Wang, L. Wang, and W. Chen, “Lo a: Low- ank
adap a ion o la ge language models,” in ICLR. Open-
Re iew.ne , 2022.
[51] J. Da is and M. Goad ich, “The ela ionship be ween
p ecision- ecall and oc cu es,” in P oceedings o he
23 d in e na ional con e ence on Machine lea ning,
2006, pp. 233–240.
[52] Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and
D. P. W. Ellis, “Mulan: A join embedding o music
audio and na u al language,” in ISMIR, 2022, pp. 559–
566.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
311