scieee Science in your language
[en] (orig)

CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning

Author: Angelos-Nikolaos Kanatas; Charilaos Papaioannou; Alexandros Potamianos
Publisher: Zenodo
DOI: 10.5281/zenodo.17706517
Source: https://zenodo.org/records/17706517/files/000064.pdf
CULTUREMERT: CONTINUAL PRE-TRAINING FOR CROSS-CULTURAL
MUSIC REPRESENTATION LEARNING
Angelos-Nikolaos Kana as1,2Cha ilaos Papaioannou1,3,4Alexand os Po amianos1,4
1School o ECE, Na ional Technical Uni e si y o A hens, G eece
2Ins i u e o Language and Speech P ocessing, A hena Resea ch Cen e , G eece
3Cen e o Digi al Music, Queen Ma y Uni e si y o London, UK
4A chimedes, A hena Resea ch Cen e , G eece
[email p o ec ed], [email p o ec ed], [email p o ec ed]
ABSTRACT
Recen ad ances in music ounda ion models ha e im-
p o ed audio ep esen a ion lea ning, ye hei e ec i e-
ness ac oss di e se musical adi ions emains limi ed.
We in oduce Cul u eMERT-95M, a mul i-cul u ally
adap ed ounda ion model de eloped o enhance c oss-
cul u al music ep esen a ion lea ning and unde s anding.
To achie e his, we p opose a wo-s age con inual p e-
aining s a egy ha in eg a es lea ning a e e-wa ming
and e-decaying, enabling s able adap a ion e en wi h lim-
i ed compu a ional esou ces. T aining on a 650-hou
mul i-cul u al da a mix, comp ising G eek, Tu kish, and
Indian music adi ions, esul s in an a e age imp o emen
o 4.9% in ROC-AUC and AP ac oss di e se non-Wes e n
music au o- agging asks, su passing p io s a e-o - he-a ,
wi h minimal o ge ing on Wes e n-cen ic benchma ks.
We u he in es iga e ask a i hme ic, an al e na i e ap-
p oach o mul i-cul u al adap a ion ha me ges single-
cul u e adap ed models in he weigh space. Task a i h-
me ic pe o ms on pa wi h ou mul i-cul u ally ained
model on non-Wes e n au o- agging asks and shows no
eg ession on Wes e n da ase s. C oss-cul u al e alua ion
e eals ha single-cul u e models ans e wi h a ying e -
ec i eness ac oss musical adi ions, whe eas he mul i-
cul u ally adap ed model achie es he bes o e all pe o -
mance. To suppo esea ch on wo ld music ep esen a ion
lea ning, we publicly elease Cul u eMERT-95M 1and
Cul u eMERT-TA-95M 2, os e ing he de elopmen o
mo e cul u ally awa e music ounda ion models.
1. INTRODUCTION
Founda ion models ha e ecen ly eme ged in he music do-
main [1–5], o e ing powe ul gene al-pu pose ep esen-
1h ps://hugging ace.co/n ua-slp/Cul u eMERT-95M
2h ps://hugging ace.co/n ua-slp/Cul u eMERT-TA-95M
© A.-N. Kana as, C. Papaioannou, and A. Po amianos. Li-
censed unde a C ea i e Commons A ibu ion 4.0 In e na ional License
(CC BY 4.0). A ibu ion: A.-N. Kana as, C. Papaioannou, and A.
Po amianos, “Cul u eMERT: Con inual P e-T aining o C oss-Cul u al
Music Rep esen a ion Lea ning”, in P oc. o he 26 h In . Socie y o
Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
a ions lea ned om la ge-scale audio da a. These mod-
els cap u e b oad musical cha ac e is ics and ha e demon-
s a ed s a e-o - he-a pe o mance ac oss a ange o mu-
sic unde s anding asks, educing he need o ask-speci ic
aining. By le e aging sel -supe ised lea ning (SSL) on
la ge amoun s o unlabelled music da a, ounda ion models
add ess da a sca ci y, educe anno a ion cos s, and imp o e
gene aliza ion in music in o ma ion e ie al (MIR) [4].
Despi e hese ad ances, mos exis ing ounda ion mod-
els o music ha e been ained p ima ily on Wes e n-
cen ic da ase s, limi ing hei abili y o ep esen di e se
musical s yles [6, 7]. Many musical adi ions, includ-
ing Tu kish, Indian, and G eek adi ional music, ea u e
unique melodic s uc u es, modal o onal sys ems, and
hy hmic pa e ns ha a e no adequa ely cap u ed by hese
models [8–10]. Failing o model such cul u e-speci ic
s ylis ic elemen s no only na ows he applicabili y o mu-
sic ounda ion models, o example, in egion-speci ic ec-
ommenda ion sys ems [11] o cul u al he i age p ese a-
ion, bu also o e looks ich, cul u ally speci ic knowledge
c ucial o ad ancing MIR esea ch [4]. Acco dingly, he e
is an u gen need o de elop mo e inclusi e and cul u ally
awa e compu a ional models [12,13], capable o gene aliz-
ing beyond Wes e n-cen ic adi ions and adap ing e ec-
i ely o di e se unde ep esen ed musical cul u es.
One p omising a enue o add essing hese challenges
is con inual p e- aining (CPT), which has eme ged as an
e ec i e and inc easingly popula app oach in la ge lan-
guage models (LLMs) [14–22] and mul imodal lea ning
[23]. By enabling models o inc emen ally adap o new
domains, asks, o languages, CPT a oids he need o ull
e- aining, which is o en imp ac ical and compu a ionally
expensi e [14, 18, 19, 22, 24]. No ably, i has been shown
o ma ch, o e en su pass, aining om sc a ch in some
cases [20, 21], while also con e ging as e [25] and mi -
iga ing ca as ophic o ge ing [26]. CPT has also gained
ac ion in he audio domain, wi h ecen wo k demons a -
ing i s e ec i eness in adap ing p e- ained speech models
o bo h high- and low- esou ce languages [24,27–30].
Addi ionally, model me ging [31–34] has p o en o be
a simple ye e ec i e echnique o adap ing p e- ained
models ac oss mul iple domains by combining domain-
speci ic pa ame e s in weigh space, wi hou equi ing ad-
di ional aining [35] o access o he o iginal aining da a
555
[36]. A no able me hod wi hin his pa adigm is ask a i h-
me ic (TA) [37], which cons uc s ask ec o s by compu -
ing he di e ence be ween he pa ame e s o an adap ed
model and i s p e- ained coun e pa , he eby encoding
domain-speci ic knowledge. These ask ec o s can hen
be in eg a ed in o he p e- ained model ia algeb aic op-
e a ions in Euclidean space o c ea e a uni ied model om
mul iple independen ly adap ed models.
While bo h con inual p e- aining and ask a i hme ic
ha e been widely explo ed in o he domains, hei appli-
ca ion o MIR emains la gely unexplo ed. We b idge
his gap by le e aging hese echniques o adap he
MERT- 1-95M music ounda ion model [1], o iginally
ained on 1K hou s o p edominan ly Wes e n music [1,
38], o di e se musical cul u es om he Eas e n Medi e -
anean and he Indian subcon inen , while p ese ing pe -
o mance on "Wes e n"-cen ic benchma ks.
We summa ize ou main con ibu ions as ollows:
1. To he bes o ou knowledge, his is he i s s udy
o explo e con inual p e- aining and ask a i h-
me ic o c oss-cul u al adap a ion in MIR, demon-
s a ing hei e ec i eness in music audio ep esen-
a ion lea ning.
2. We p opose a wo-s age CPT s a egy ha s a-
bilizes aining, mi iga es ca as ophic o ge ing,
and acili a es e ec i e adap a ion unde cons ained
compu a ional esou ces.
3. Ou mul i-cul u al model, Cul u eMERT, ou pe -
o ms he o iginal MERT- 1 by an a e age o 4.9%
ac oss ROC-AUC and AP on cul u ally di e se non-
Wes e n music agging asks, while exhibi ing mini-
mal o ge ing on Wes e n benchma ks.
4. Ou cul u ally adap ed models su pass p e ious
s a e-o - he-a esul s ac oss all e alua ed non-
Wes e n music agging asks.
5. We analyze c oss-cul u al ans e abili y, showing
ha single-cul u e adap a ions exhibi a ying de-
g ees o ans e ac oss cul u al domains.
To suppo ep oducibili y and u he esea ch in c oss-
cul u al music ep esen a ion lea ning, we publicly elease
Cul u eMERT-95M, along wi h he ask a i hme ic a i-
an , Cul u eMERT-TA-95M.
2. DATASETS
Fo ou expe imen s, we use a di e se se o music da ase s
spanning bo h Wes e n and non-Wes e n adi ions. Speci -
ically, we adop he MagnaTagATune (MTAT) [39] and
FMA-medium [40] da ase s o ep esen "Wes e n" 3mu-
sic. Fo "non-Wes e n" adi ions, we inco po a e he Ly a
co pus [41], ea u ing G eek adi ional and olk music,
along wi h h ee collec ions om he CompMusic Co -
po a 4[42]: Tu kish-makam [43,44], which, oge he wi h
3We use he e m “Wes e n” o e e o music s yles p edominan ly
oo ed in Wes e n cul u es, including pop, ock, and Wes e n classical.
4h ps://compmusic.up .edu/co po a
Ly a, ep esen music o he Eas e n Medi e anean; and
Hindus ani and Ca na ic music [45], ep esen ing No h
and Sou h Indian classical adi ions, espec i ely.
We assess ou models on bo h Wes e n and non-
Wes e n music agging asks o c oss-cul u al e alua ion,
using s anda d mul i-label classi ica ion me ics, includ-
ing he a ea unde he ecei e ope a ing cha ac e is ic
cu e (ROC-AUC) and a e age p ecision (AP). Follow-
ing [46, 47], we u ilize he op-k ags ele an o each
da ase : 50 ags o MTAT (spanning gen e,ins umen s,
and mood), 20 hie a chical gen e ags o FMA-medium,
30 ags o Tu kish-makam (co e ing makam,usul, and in-
s umen s), 20 ags o Hindus ani and Ca na ic (p ima ily
e lec ing aga, ala,ins umen s, and o ms), and 30 ags
o Ly a ( ela ed o gen e,place, and ins umen s).
All audio is esampled o 24 kHz, and we adop he
same da a spli s as [46]. To p epa e ou da a o con-
inual p e- aining, we ex ac 30-second segmen s om
each aining spli o he non-Wes e n da ase s. Gi en he
a ying da ase sizes, we balance he p e- aining du a ion
ac oss cul u es o ensu e p opo ional ep esen a ion by
ex ac ing 200 hou s each om he Tu kish-makam, Ca -
na ic, and Hindus ani da ase s, and 50 hou s om Ly a due
o i s smalle size. Addi ionally, we combine hese subse s
o cons uc a uni ied 650-hou da ase in eg a ing all ou
adi ions o mul i-cul u al con inual p e- aining.
3. METHOD
The o e all amewo k o ou app oach is illus a ed in Fig-
u e 1, which depic s he wo-s age con inual p e- aining
s a egy o Cul u eMERT. In his sec ion, we i s e-
iew he a chi ec u e and p e- aining objec i e o MERT,
and hen p esen ou CPT s a egy o cul u al adap a ion.
Finally, we in es iga e ask a i hme ic, an al e na i e ap-
p oach o mul i-cul u al adap a ion ha me ges cul u ally
specialized models in weigh space o cons uc a uni ied
mul i-cul u al model, Cul u eMERT-TA.
3.1 MERT P e-T aining Objec i e
Ou con inual p e- aining objec i e ollows he sel -
supe ised masked language modeling (MLM) objec i e
o MERTRVQ-VAE, whe e wo eache models p o ide he
pseudo-labels: (i) an acous ic eache , he EnCodec model
[48], which disc e izes audio in o okens om K= 8
esidual ec o quan iza ion (RVQ) codebooks, each con-
aining C= 1024 codewo ds, and (ii) a musical eache ,
based on cons an -Q ans o m (CQT) spec og am econ-
s uc ion, encoding pi ch and ha monic s uc u e.
MERT- 1-95M ollows he HuBERT a chi ec u e [49],
comp ising a CNN-based ea u e ex ac o ha encodes
aw 24 kHz wa e o ms in o 75 Hz ame-le el ep esen-
a ions, ollowed by a 12-laye T ans o me encode , p o-
ducing 768-dimensional con ex ual embeddings. Du ing
aining, a subse o ame embeddings is masked, and he
model is op imized using a mul i- ask lea ning (MTL) ob-
jec i e, combining masked acous ic oken p edic ion and
spec og am econs uc ion. The o e all aining objec i e
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
556
is:
L=αLRVQ +LCQT,(1)
whe e he acous ic MLM loss LRVQ encou ages he model
o p edic masked RVQ-VAE okens om Kcodebooks,
using a noise-con as i e es ima ion (NCE) loss:
LRVQ =
K
X
k=1
X
∈M
log pθ(c ,k|x′
),(2)
wi h Mdeno ing he se o masked ime ames, c ,k he
g ound- u h disc e e codewo d om he k- h codebook a
ime ame ex ac ed ia he EnCodec okenize , and pθ
he model’s p edic ed oken dis ibu ion:
pθ(c|x′
) = exp(sim(T(o ),ec)/τ)
PC
c′=1 exp(sim(T(o ),ec′)/τ).(3)
He e, x′
is he masked inpu ea u e, o is he model’s ou -
pu ep esen a ion, T(o )p ojec s i o he codewo d em-
bedding space, ecis he embedding o codewo d c∈ Ck,
whe e k∈ {1, . . . , K}, sim(·,·)deno es cosine simila i y,
and τ= 0.1is a empe a u e scaling pa ame e .
The CQT econs uc ion loss LCQT minimizes he mean
squa ed e o (MSE) be ween he model’s p edic ed ˆ
zCQT,
and g ound- u h zCQT, ame-le el CQT ea u es:
LCQT =X
∈M
∥zCQT, −ˆ
zCQT, ∥2
2.(4)
3.2 Two-S age Con inual P e-T aining S a egy
To adap he MERT ounda ion model o di e se musi-
cal adi ions, we employ con inual p e- aining, which
ex ends he aining o a p e- ained model on new da a,
aiming o adap i o a shi ed domain o ask while e-
aining p io knowledge, wi hou e- aining om sc a ch.
In ou case, his in ol es con inually p e- aining he
MERT- 1-95M model, using he same p e- aining objec-
i e, on cul u ally di e se da a ha in oduce a signi ican
dis ibu ion shi , as i was ini ially ained on p edomi-
nan ly Wes e n music [1,38]. Gi en his shi , nai ely con-
inuing o ain he model, i.e., adap ing all pa ame e s a
once wi hou ese ing he lea ning a e, can lead o ca as-
ophic o ge ing [50] and poo adap a ion [14], as con-
i med by ou p elimina y expe imen s (see Table 1). To
add ess his, we p opose a wo-s age s a egy ha s abi-
lizes aining h ough: (i) lea ning a e e-wa ming and e-
decaying [14,19,21,23,51], and (ii) s aged adap a ion.
S aged Adap a ion In ou p elimina y expe imen s, we
obse ed an ini ial pe o mance d op du ing CPT, ollowed
by a slow eco e y phase, a phenomenon known as he s a-
bili y gap [23, 52, 53]. This ins abili y a ises due o he
ab up adap a ion o model pa ame e s o a subs an ially
shi ed da a dis ibu ion, which can empo a ily deg ade
p e iously lea ned ep esen a ions be o e s abilizing. To
mi iga e his, a he han ull-pa ame e adap a ion on he
en i e da ase in a single epoch, which induces a la ge plas-
ici y g adien o a long pe iod [53], we spli aining in o
wo s ages o educe ins abili y and ensu e smoo he adap-
a ion, as illus a ed in Figu e 1:
Lea ning Ra e Re-Wa ming
1% Wa m-up
Lea ning Ra e Re-Wa ming
10% Wa m-up
S age 1
100-hou Mul i-
Cul u al Audio Da a
(20% Music4All)
Music MLM loss
MERT
Codewo d Embeddings

❄
T ans o me Encode
Acous ic MLM loss

1D con olu ion ea u e ex ac o
S age 2
650-hou Mul i-
Cul u al Audio Da a
Music MLM loss
MERT

T ans o me Encode
Acous ic MLM loss

1D con olu ion ea u e ex ac o
Codewo d Embeddings

Figu e 1:Two-S age Con inual P e-T aining S a egy
o Cul u eMERT. In S age 1, a subse o pa ame e s is
ained on 100h o mul i-cul u al da a wi h 20% Wes e n
music o s abiliza ion. In S age 2, all pa ame e s a e un-
ozen and ained on he ull 650h da ase . Lea ning a e
e-wa ming and e-decaying is applied in bo h s ages.
S age 1 S abiliza ion Phase: We i s ain on a smalle
da a subse [52], upda ing only he CNN-based ea u e ex-
ac o and he codewo d embedding laye while keeping
he T ans o me encode ozen. To educe he dis ibu-
ion gap and mi iga e o ge ing [19, 25, 28], we inco po-
a e a ac ion o Music4All da a [54], which is p ima ily
o Wes e n o igin, in o he p e- aining mix, accoun ing o
20% o he o al aining da a (Wes e n eplay).
S age 2 Full Adap a ion: We un eeze he T ans o me
encode and con inue aining on he ull da ase .
CPT S a egy Wes e n Replay Tu kish-makam MTAT
MERT- 1 (Baseline) - 83.2 89.6
Single-s age ✓83.8 86.0
Single-s age (no e-wa m) ✓83.0 87.5
Two-s age (Ou s) S age 1 89.6 89.2
Two-s age (Ou s) Bo h s ages 88.6 89.4
Table 1:CPT S a egy Compa ison. ROC-AUC sco es
on Tu kish-makam and MTAT da ase s. Two-s age CPT
ou pe o ms single-s age adap a ion, wi h Wes e n eplay
limi ed o S age 1 yielding he bes ade-o be ween cul-
u al adap a ion and knowledge e en ion.
This wo-s age app oach is pa icula ly mo i a ed
by compu a ional cons ain s, speci ically he ba ch
size misma ch be ween p e- aining and adap a ion.
MERT- 1-95M was o iginally ained wi h ba ch sizes o
1.5 hou s pe s ep, whe eas we use a signi ican ly smalle
e ec i e ba ch size o 160 seconds pe s ep due o mem-
o y limi a ions. T aining wi h his educed ba ch size di-
ec ly on he en i e da ase wi h ull-pa ame e adap a-
ion esul ed in uns able aining and equen c ashes, de-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
557
g ading pe o mance on bo h Wes e n and non-Wes e n
benchma ks. By s uc u ing adap a ion in wo s ages,
we s ike o balance plas ici y (adap a ion o non-Wes e n
adi ions) and s abili y ( e aining knowledge on Wes -
e n da ase s), a challenge known as he s abili y-plas ici y
dilemma [55,56]. In ui i ely, he ini ial s abiliza ion phase
allows lowe -le el acous ic ep esen a ions, cap u ed by
he CNN-based ea u e ex ac o and he codewo d embed-
dings, o adap i s and calib a e o he shi ed dis ibu ion
be o e upda ing high-le el T ans o me ep esen a ions.
Lea ning Ra e Re-Wa ming To u he imp o e adap-
a ion s abili y, we apply lea ning a e e-wa ming and
e-decaying in bo h s ages. P io wo k has shown ha
ese ing he lea ning a e schedule, i.e., e-wa ming he
model, du ing con inual p e- aining is c ucial o p e en -
ing poo con e gence and mi iga ing ca as ophic o ge -
ing [14,19,21,23,51]. In S age 1, we adop a mode a ely
agg essi e wa m-up and decay schedule o encou age ea ly
adap a ion o low-le el ep esen a ions. In S age 2, a less
agg essi e schedule balances plas ici y and s abili y du ing
ull-model aining, educing also aining ins abili ies.
Following his wo-s age CPT s a egy, we de elop
wo ypes o cul u ally adap ed models: (i) a mul i-
cul u ally adap ed model,Cul u eMERT, ained on
a cul u ally di e se mix spanning all ou non-Wes e n
musical adi ions; and (ii) single-cul u e adap ed mod-
els, each con inually p e- ained on da a om a single
adi ion, esul ing in MakamMERT,Hindus aniMERT,
Ca na icMERT, and Ly aMERT.
3.3 Task A i hme ic o C oss-Cul u al Adap a ion
As an al e na i e o con inual p e- aining on mul i-cul u al
da a, we explo e ask a i hme ic [37], a model me ging
me hod ha combines cul u ally specialized models in
weigh space o cons uc a uni ied mul i-cul u al model.
Task a i hme ic ope a es by algeb aically me ging model
pa ame e s h ough ask ec o addi ion and nega ion.
In ou se ing, we ob ain ask ec o s by compu ing
he elemen -wise di e ence be ween he pa ame e s o he
single-cul u e con inually p e- ained models and hose o
he MERT- 1 model. Fo mally, gi en he p e- ained base
model wi h pa ame e s θp e and a con inually p e- ained
model θiadap ed o a cul u al da ase Di, he ask ec o
o cul u e iis gi en by τi=θi−θp e, cap u ing he pa-
ame e shi induced by cul u e-speci ic adap a ion.
Fo mul i-cul u al adap a ion, we cons uc a uni ied
model θme ged by me ging Nsingle-cul u e adap ed models
ia ask a i hme ic, summing hei espec i e ask ec o s
τiwi h co esponding scaling ac o s λi:
θme ged =θp e +
N
X
i=1
λiτi,(5)
whe e λi∈Ra e scala hype pa ame e s ha con ol he
con ibu ion o each ask ec o . P io wo k ypically uses
a single scaling ac o λ o all ask ec o s, i.e., λi=λ,
∀i. In he special case whe e λ= 1/N, Equa ion 5 simpli-
ies o weigh a e aging [31,33,34], in which he adap ed
models a e me ged by di ec ly a e aging hei pa ame e s.
4. EXPERIMENTS
4.1 Implemen a ion De ails
In all con inual p e- aining se ups, we ini ialize ou mod-
els om he publicly a ailable MERT- 1-95M 5p e-
ained checkpoin . T aining was conduc ed using he
FAIRSEQ 6 amewo k on a single NVIDIA GeFo ce GTX
TITAN X GPU wi h 12 GB o memo y. All models we e
ained wi h hal -p ecision (FP16), using 5-second audio
segmen s as inpu con ex , andomly c opped om he ex-
ac ed 30-second p e- aining audio da a. The weigh o
he acous ic loss in he p e- aining objec i e is se o α=
10.0. The EnCodec neu al audio codec (NAC) model [48],
which okenizes audio in o disc e e codewo ds, emains
ozen du ing con inual p e- aining, as in [1]. To enhance
ep esen a ion obus ness, we apply in-ba ch noise mix u e
augmen a ion wi h a mixup p obabili y o 0.5, and use
p e-laye no maliza ion (P e-LN) [57] o aining s abil-
i y, ollowing [1]. O he aining se ings mi o hose o
he MERT- 1-95M se up.
4.2 P obing-Based E alua ion
Following [1, 2, 58], we adop a p obing-based e alua-
ion a he han ine- uning, keeping he p e- ained mod-
els ozen as deep ea u e ex ac o s while aining only a
shallow mul ilaye pe cep on (MLP) wi h a single 512-
dimensional hidden laye o sequence-le el asks. Ou
e alua ion ollows he MARBLE p o ocol [59] unde con-
s ained se ings, and we apply i o bo h Wes e n and non-
Wes e n music agging asks o c oss-cul u al e alua ion.
To p ocess long-du a ion audio iles, we segmen hem
in o 30-second chunks using a sliding window app oach
and agg ega e he chunk-le el p edic ions by a e aging o
ob ain he inal p edic ion o he en i e audio ile. Fo
Tu kish-makam, Hindus ani, and Ca na ic asks, we apply
a maximum du a ion cu as in [46] o ensu e compa abili y
wi h p io s a e-o - he-a esul s.
4.3 Con inual P e-T aining Se ings
Mul i-Cul u al CPT In S age 1, aining uns o 2,250
s eps wi h a 10% linea wa m-up pe iod, using 100 hou s
o he da ase . Op imiza ion ollows AdamW [61] wi h
β1= 0.9,β2= 0.999, and ϵ= 1e−5. T aining employs
an e ec i e ba ch size o 32 eco dings (160 seconds), wi h
g adien accumula ion o e 8 s eps. The maximum lea n-
ing a e is se o ηmax = 5e−4, ollowed by a cosine decay
o a minimum o ηmin = 5e−5. G adien clipping is ap-
plied wi h a no m o 1.0 o p e en exploding g adien s. In
S age 2, aining ex ends o 14,625 s eps wi h a 1% wa m-
up pe iod, using he ull 650-hou da ase . Op imiza ion
ollows AdamW wi h β1= 0.9,β2= 0.95, and ϵ= 1e−5,
main aining he same ba ch size as S age 1. The lea n-
ing a e decays om a maximum alue o ηmax = 5e−5 o
ηmin = 5e−6. G adien clipping emains a 1.0.
Single-Cul u e CPT In S age 1, we ain on 60 hou s o a
o al o 1,350 aining s eps. In S age 2, we expand aining
5h ps://hugging ace.co/m-a-p/MERT- 1-95M
6h ps://gi hub.com/ acebook esea ch/ ai seq
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
558
Da ase Tu kish-makam Hindus ani Ca na ic Ly a FMA-medium MagnaTagATune A g.
Me ics ROC AP ROC AP ROC AP ROC AP ROC AP ROC AP
MERT- 1 83.20.08 53.30.12 82.40.04 52.90.19 74.90.05 39.70.15 85.70.10 56.50.18 90.70.04 48.10.11 89.60.07 35.90.15 66.1
MakamMERT 88.70.11 58.80.22 84.50.16 57.80.18 77.60.14 42.70.16 84.60.12 53.20.17 90.30.12 47.10.16 89.00.07 35.60.12 67.5
Ca na icMERT 88.40.06 58.40.16 87.00.06 60.20.14 78.80.13 44.00.17 85.40.11 55.80.16 90.20.10 46.70.09 89.20.10 35.30.11 68.3
Hindus aniMERT 88.30.12 58.20.16 87.40.11 60.30.16 77.00.12 42.70.16 84.20.13 52.00.15 90.20.13 46.10.10 89.10.09 35.80.13 67.6
Ly aMERT 86.70.07 56.80.13 85.90.08 57.40.13 76.40.09 40.10.13 85.00.11 53.50.14 90.00.08 46.00.16 88.90.05 35.10.14 66.8
Cul u eMERT 89.60.09 60.60.21 88.20.20 63.50.24 79.20.18 43.10.22 86.90.10 56.70.20 90.70.09 48.10.13 89.40.09 35.90.16 69.3
Cul u eMERT-TA 89.00.12 61.00.18 87.50.10 59.30.13 79.10.11 43.30.13 87.30.08 57.30.19 90.80.06 49.10.15 89.60.10 36.40.14 69.1
(P e ious) SOTA 87.7 [46] 57.7 [46] 86.5 [46] 63.1 [46] 77.0 [46] 43.9 [46] 85.4 [46] 54.3 [46] 92.4 [46] 53.7 [46] 92.7 [60] 41.4 [58] -
Table 2:E alua ion Resul s (ROC-AUC and AP) o P e-T ained and Cul u ally Adap ed MERT Models on Di e se
Music Au o-Tagging Tasks. We epo a e ages ac oss i e andom seeds wi h s anda d de ia ions as subsc ip s. The
"A g." column ep esen s he a e age pe o mance ac oss all da ase s and e alua ion me ics o each model. The esul s
highligh he impac o mul i-cul u al CPT and model me ging ia ask a i hme ic on c oss-cul u al adap a ion and ans e .
Tu kish-makam
Hindus aniCa na ic
Ly a
FMA-medium MagnaTagATune
MERT- 1
Cul u eMERT
MakamMERT
Ca na icMERT
Hindus aniMERT
Ly aMERT
Cul u eMERT-TA
Figu e 2:C oss-Cul u al T ans e abili y. Rela i e
ROC-AUC pe o mance ac oss da ase s, highligh ing key
ends in c oss-cul u al ans e . Cul u eMERT gene -
alizes well o non-Wes e n da ase s, while ask a i hme ic
pe o ms on pa in hese se ings and e en su passes bo h
he p e- ained and mul i-cul u ally adap ed models on
Wes e n benchma ks (FMA-medium, MTAT) and Ly a.
o he ull 200-hou da ase o 4,500 s eps. We employ he
same op imize s, ba ch size, and lea ning a e schedules as
in he mul i-cul u al CPT. Fo Ly a, due o i s smalle size
(50 hou s), we ain on 20 hou s in S age 1 (450 s eps) and
hen on he ull da ase in S age 2 (1,125 s eps).
5. RESULTS AND DISCUSSION
As shown in Table 2, Cul u eMERT, adap ed ia mul i-
cul u al con inual p e- aining, consis en ly ou pe o ms
he o iginal MERT- 1 model ac oss all non-Wes e n asks
and e alua ion me ics, achie ing an a e age imp o emen
o 4.9%. I also su passes he single-cul u e adap ed mod-
els on a e age, sugges ing ha inco po a ing cul u ally di-
e se da a du ing CPT bene i s all non-Wes e n adi ions
by imp o ing he quali y o ep esen a ions o each indi-
idual cul u e, he eby enhancing gene aliza ion. No ably,
Cul u eMERT achie es his wi h minimal o ge ing on
Wes e n benchma ks (0.05% a e age d op ac oss ROC-
AUC and AP), demons a ing he e icacy o ou app oach.
We u he obse e ha single-cul u e adap ed models end
o pe o m bes on hei espec i e in-domain asks o
well- esou ced adi ions, ea i ming he e ec i eness o
CPT o domain-speci ic adap a ion [18]. Howe e , e en
low- esou ce adap a ion, as in he case o Ly aMERT
ained on jus 50 hou s, leads o no iceable gains ac oss
o he non-Wes e n asks, indica ing ha e en limi ed cul-
u al exposu e can signi ican ly boos c oss-cul u al gen-
e aliza ion. Mo eo e , ask a i hme ic pe o ms compa-
ably o Cul u eMERT on non-Wes e n asks and e en
su passes i on Wes e n benchma ks and Ly a, demon-
s a ing ha weigh -space me ging o cul u ally special-
ized models can se e as an e ec i e, aining- ee al e -
na i e o mul i-cul u al CPT—p o ided such models a e
a ailable. In e es ingly, i also ou pe o ms he unadap ed
base model by 0.4% on a e age ac oss Wes e n asks.
No ably, only he mul i-cul u al models, Cul u eMERT
and Cul u eMERT-TA, ou pe o m MERT- 1 on Ly a,
whe e he la e al eady se es as a s ong baseline.
This u he unde sco es he e ec i eness o mul i-cul u al
adap a ion, pa icula ly in low- esou ce and ans e se -
ings. Finally, Cul u eMERT and Cul u eMERT-TA
su pass p e ious s a e-o - he-a (SOTA) esul s on all non-
Wes e n music agging asks, wi h he bes ask a i hme ic
a ian ob ained using λ= 0.2(see Figu e 4).
5.1 C oss-Cul u al T ans e
As illus a ed in Figu e 2, con inual p e- aining on one
musical adi ion can bene i o he s o a ying deg ees,
e ealing asymme ies in c oss-cul u al ans e e ec i e-
ness. Fo ins ance, we obse e s ong ans e be ween
Tu kish-makam and Ca na ic music, wi h models adap ed
o ei he adi ion gene alizing well o he o he . This
aligns wi h hei sha ed heo e ical ounda ions as modal
amewo ks ha emphasize mic o onali y and imp o i-
sa ion, se ing simila oles in hei espec i e cul u es
[62]. Addi ionally, he s ong pe o mance o he Ca na ic-
adap ed model on he Hindus ani domain ein o ces he
musical p oximi y be ween hese adi ions, pa icula ly in
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
559

hei sha ed use o aga (melodic mode) and ala ( hy h-
mic amewo k) [10]. In e es ingly, he model adap ed o
Ca na ic music appea s o be he mos consis en ly ans-
e able among single-cul u e adap a ions, achie ing s ong
esul s no only wi hin Indian classical adi ions bu also
gene alizing well o Tu kish-makam and Ly a.
5.2 Token-Le el Cul u e Simila i y
To u he examine c oss-cul u al simila i ies in ou da a,
we analyze oken o e lap ac oss musical adi ions using
bo h he Jensen-Shannon di e gence (JSD) and cosine dis-
ance be ween oken dis ibu ions ex ac ed om he En-
Codec model [48], which se es as ou audio okenize .
Lowe alues in bo h me ics indica e g ea e simila i y.
Ou analysis, as shown in Figu e 3, e eals s ong oken-
le el simila i y among non-Wes e n adi ions, pa icula ly
be ween Hindus ani and Ca na ic music. In con as , Wes -
e n da ase s (MTAT, FMA-medium) a e highly simila o
each o he bu no ably dissimila om non-Wes e n a-
di ions. G eek adi ional music (Ly a), while dis inc ,
aligns mo e closely wi h non-Wes e n adi ions han Wes -
e n ones. In e es ingly, hese indings co ela e wi h ou
esul s on c oss-cul u al ans e (Sec ion 5.1), sugges ing
ha oken-le el simila i y me ics can se e as p edic o s
o posi i e c oss-cul u al ans e . This insigh has p ac-
ical implica ions: such simila i y me ics can guide he
selec ion and e inemen o p e- aining da a mix u es du -
ing CPT, o in o m he adjus men o a i hme ic ope a ions
when me ging models ia ask a i hme ic. Simila ap-
p oaches o quan i ying language simila i y and p edic -
ing posi i e c oss-lingual ans e , based on he simila i y
o ex ac ed linguis ic o acous ic okens, ha e been ex-
plo ed in bo h he ex [17,63] and speech domains [29].
Tu kish-makam
Hindus ani
Ca na ic
Ly a
MTAT
FMA-medium
Tu kish-makam
Hindus ani
Ca na ic
Ly a
MTAT
FMA-medium
13.8%
12.8% 10.1%
18.6% 14.5% 16.1%
22.5% 19.8% 22.6% 19.4%
19.2% 18.0% 19.6% 19.0% 7.4%
8.6% 6.9% 16.1% 19.8% 13.8%
3.6% 8.7% 14.5% 10.9%
10.9% 18.3% 12.9%
13.0% 11.9%
2.1%
Jensen-Shannon Di e gence Cosine Dis ance
Figu e 3:Token Simila i y Ac oss Cul u es. Pai wise
simila i y be ween acous ic oken dis ibu ions ex ac ed
om he EnCodec NAC model [48]. Simila i y sco es a e
a e aged ac oss 8codebooks, each con aining 1024 dis-
c e e codewo ds (acous ic pseudo- okens).
5.3 Task A i hme ic Scaling Fac o
A key conside a ion in ask a i hme ic is he choice o
he scaling ac o λ, which con ols he balance be-
ween ask ec o s. P io wo k [64, 65] has shown
ha subop imal alues can signi ican ly deg ade pe o -
mance in mul i- ask model me ging. We sys ema i-
cally e alua e di e en alues o a sha ed scaling ac o
λ∈ {0.1,0.2,0.25,0.3,0.5,0.75,1.0}, applied uni o mly
ac oss all ask ec o s, including he special case o weigh
a e aging (λ= 0.25). We simila ly obse e a consis en
end: ill-sui ed alues, such as λ= 1.0, esul in poo
pe o mance ac oss all benchma ks, as shown in Figu e 4.
0.1 0.2 0.25 0.3 0.5 0.75 1.0
Scaling Fac o
70
75
80
85
90
ROC-AUC (%)
Tu kish-makam
Hindus ani
Ca na ic
Ly a
MagnaTagATune
FMA-medium
Figu e 4:E ec o Scaling Fac o λon Task A i hme ic
Pe o mance. The ROC-AUC sco es ac oss six di e se
music agging asks demons a e how a ying λimpac s
ask a i hme ic when me ging he ou non-Wes e n single-
cul u e adap ed models.
6. CONCLUSIONS
In his pape , we in oduce Cul u eMERT-95M, a mul i-
cul u ally adap ed music ounda ion model de eloped ia
con inual p e- aining on di e se non-Wes e n musical a-
di ions. We p opose a wo-s age CPT s a egy ha inco -
po a es lea ning a e e-wa ming and s aged adap a ion o
s able aining. C oss-cul u al e alua ion demons a es ha
Cul u eMERT-95M consis en ly ou pe o ms he base
MERT- 1-95M model on non-Wes e n music agging
asks, su passing p io s a e-o - he-a me hods while p e-
se ing pe o mance on Wes e n benchma ks. Addi ion-
ally, we in es iga e ask a i hme ic, which o e s a s ong
al e na i e o mul i-cul u al CPT by e ec i ely me ging
cul u ally specialized models in weigh space.
While ou esul s a e p omising, se e al limi a ions e-
main. The ozen EnCodec okenize used in he MERT
a chi ec u e may be subop imal o encoding cul u ally di-
e se musical languages, as i was p e- ained on Wes e n
music. Fu u e di ec ions include scaling o addi ional mu-
sical cul u es, explo ing al e na i e a chi ec u es, ex end-
ing e alua ion beyond sequence-le el classi ica ion asks,
conduc ing ine-g ained abla ion s udies, and in es iga ing
whe he he p oposed wo-s age CPT s a egy emains nec-
essa y unde less cons ained compu a ional budge s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
560
7. ETHICS STATEMENT
7.1 Cul u al F aming and In e p e i e Scope
We acknowledge he limi a ions o aming music wi hin
a "Wes e n" e sus "non-Wes e n" dicho omy. While such
e minology is commonly used in compu a ional esea ch
o con enience, i isks o e simpli ying he ich di e si y
o global musical adi ions. Fu he mo e, his wo k does
no aim o es ablish o analyze c oss-cul u al simila i ies
om an e hnomusicological pe spec i e. Ou analysis o
c oss-cul u al ans e abili y should be conside ed in ligh
o po en ial limi a ions in he ep esen a i eness and co -
e age o he co po a used.
7.2 Responsible Use
Ca e ul conside a ion is ad ised be o e deploying hese
models in eal-wo ld con ex s, as hey may s ill e lec
cul u al and da ase -speci ic biases. Some o he da ase s
used in his wo k a e no publicly a ailable and we e ob-
ained unde esea ch-use ag eemen s. The eleased mod-
els should no be used in comme cial o gene a i e appli-
ca ions wi hou explici a en ion o cul u al ep esen a ion,
app op ia e licensing, and he consen o he ele an com-
muni ies o da ase cu a o s.
8. ACKNOWLEDGMENTS
We would like o hank he e iewe s and Geo gios
Pa aske opoulos o hei aluable and cons uc i e eed-
back, which helped us imp o e his wo k. We also g a e-
ully acknowledge he Music Technology G oup (MTG) a
Uni e si a Pompeu Fab a o p o iding access o da ase s
used in his s udy. This wo k has been pa ially suppo ed
by p ojec MIS 5154714 o he Na ional Reco e y and Re-
silience Plan G eece 2.0 unded by he Eu opean Union
unde he Nex Gene a ionEU P og am.
9. REFERENCES
[1] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Xiao, C. Lin, A. Ragni, E. Bene os e al., “MERT:
acous ic music unde s anding model wi h la ge-scale
sel -supe ised aining,” in The Twel h In e na ional
Con e ence on Lea ning Rep esen a ions, ICLR 2024,
Vienna, Aus ia, May 7-11, 2024. OpenRe iew.ne ,
2024.
[2] M. Won, Y. Hung, and D. Le, “A ounda ion model o
music in o ma ics,” in IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing, ICASSP
2024, Seoul, Republic o Ko ea, Ap il 14-19, 2024.
IEEE, 2024, pp. 1226–1230.
[3] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” CoRR, ol. abs/2005.00341, 2020.
[4] Y. Ma, A. Øland, A. Ragni, B. M. D. Se e, C. Sai is,
C. Donahue, C. Lin, C. Plachou as, E. Bene os,
E. Quin on e al., “Founda ion models o music: A
su ey,” CoRR, ol. abs/2408.14340, 2024.
[5] W. Li, Y. Cai, Z. Wu, W. Zhang, Y. Chen, R. Qi,
M. Dong, P. Chen, X. Dong, F. Shi e al., “A su ey
o ounda ion models o music unde s anding,” CoRR,
ol. abs/2409.09601, 2024.
[6] E. Gómez, P. He e a, and F. Gómez-Ma in, “Com-
pu a ional E hnomusicology: pe spec i es and chal-
lenges,” Jou nal o New Music Resea ch, ol. 42, no. 2,
June 2013, pp. 111–112.
[7] A. Meh a, S. Chauhan, A. Djanibeko , A. Kulka ni,
G. Xia, and M. Choudhu y, “Music o all: Explo ing
mul icul u al ep esen a ions in music gene a ion mod-
els,” CoRR, ol. abs/2502.07328, 2025.
[8] T. Lidy, C. N. S. J ., O. Co nelis, F. Gouyon, A. Raube ,
C. A. A. Kaes ne , and A. L. Koe ich, “On he sui abil-
i y o s a e-o - he-a music in o ma ion e ie al me h-
ods o analyzing, ca ego izing and accessing non-
wes e n and e hnic music collec ions,” Signal P ocess.,
ol. 90, no. 4, 2010, pp. 1032–1048.
[9] G. Plaja-Roglans, T. Nu all, L. Pea son, X. Se a, and
M. Mi on, “Repe oi e-speci ic ocal pi ch da a gene -
a ion o imp o ed melodic analysis o ca na ic music,”
T ans. In . Soc. Music. In . Re ., ol. 6, no. 1, 2023, pp.
13–26.
[10] G. K. Kodu i, M. Mi on, J. Se à, and X. Se a, “Com-
pu a ional app oaches o he unde s anding o melody
in ca na ic music,” in P oceedings o he 12 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence, ISMIR 2011, Miami, Flo ida, USA, Oc obe 24-
28, 2011, A. Klapu i and C. Leide , Eds. Uni e si y
o Miami, 2011, pp. 263–268.
[11] A. Fe a o, G. Fe ei a, F. Diaz, and G. Bo n, “Measu -
ing commonali y in ecommenda ion o cul u al con-
en o s eng hen cul u al ci izenship,” T ans. Recomm.
Sys ., ol. 2, no. 1, 2024, pp. 10:1–10:32.
[12] C. C. Liu, I. Gu e ych, and A. Ko honen, “Cul u ally
awa e and adap ed NLP: A axonomy and a su ey o
he s a e o he a ,” CoRR, ol. abs/2406.03930, 2024.
[13] A. Holzap el, B. L. S u m, and M. Coeckelbe gh, “E h-
ical dimensions o music in o ma ion e ie al echnol-
ogy,” T ans. In . Soc. Music. In . Re ., ol. 1, no. 1,
2018, pp. 44–55.
[14] A. Ib ahim, B. Thé ien, K. Gup a, M. L. Rich e ,
Q. G. An hony, E. Belilo sky, T. Leso , and I. Rish,
“Simple and scalable s a egies o con inually p e- ain
la ge language models,” T ans. Mach. Lea n. Res., ol.
2024, 2024.
[15] D. M. Al es, J. Pombal, N. M. Gue ei o, P. H. Ma -
ins, J. Al es, M. A. Fa ajian, B. Pe e s, R. Rei, P. Fe -
nandes, S. Ag awal e al., “Towe : An open mul-
ilingual la ge language model o ansla ion- ela ed
asks,” CoRR, ol. abs/2402.17733, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
561
[16] L. Voukou is, D. Roussis, G. Pa aske opoulos, S. So i-
anopoulos, P. P okopidis, V. Papa asileiou, A. Ka -
samanis, S. Pipe idis, and V. Ka sou os, “Mel emi: The
i s open la ge language model o g eek,” CoRR, ol.
abs/2407.20743, 2024.
[17] E. Gogoulou, T. Leso , M. Boman, and J. Ni e, “Con-
inual lea ning unde language shi ,” in Tex , Speech,
and Dialogue - 27 h In e na ional Con e ence, TSD
2024, B no, Czech Republic, Sep embe 9-13, 2024,
P oceedings, Pa I, se . Lec u e No es in Compu e
Science, E. Nö h, A. Ho ák, and P. Sojka, Eds., ol.
15048. Sp inge , 2024, pp. 71–84.
[18] S. Gu u angan, A. Ma aso ic, S. Swayamdip a, K. Lo,
I. Bel agy, D. Downey, and N. A. Smi h, “Don’ s op
p e aining: Adap language models o domains and
asks,” in P oceedings o he 58 h Annual Mee ing o
he Associa ion o Compu a ional Linguis ics, ACL
2020, Online, July 5-10, 2020, D. Ju a sky, J. Chai,
N. Schlu e , and J. R. Te eaul , Eds. Associa ion o
Compu a ional Linguis ics, 2020, pp. 8342–8360.
[19] J. Pa ma , S. Sa heesh, M. Pa wa y, M. Shoeybi, and
B. Ca anza o, “Reuse, don’ e ain: A ecipe o con-
inued p e aining o language models,” CoRR, ol.
abs/2407.07263, 2024.
[20] K. Fujii, T. Nakamu a, M. Loem, H. Iida, M. Ohi,
K. Ha o i, H. Sho a, S. Mizuki, R. Yoko a, and
N. Okazaki, “Con inual p e- aining o c oss-lingual
LLM adap a ion: Enhancing japanese language capa-
bili ies,” CoRR, ol. abs/2404.17790, 2024.
[21] K. Gup a, B. Thé ien, A. Ib ahim, M. L. Rich e ,
Q. An hony, E. Belilo sky, I. Rish, and T. Leso ,
“Con inual p e- aining o la ge language models:
How o ( e)wa m you model?” CoRR, ol.
abs/2308.04014, 2023.
[22] H. Shi, Z. Xu, H. Wang, W. Qin, W. Wang, Y. Wang,
and H. Wang, “Con inual lea ning o la ge lan-
guage models: A comp ehensi e su ey,” CoRR, ol.
abs/2404.16789, 2024.
[23] V. Udanda ao, K. Ro h, S. Dziadzio, A. P abhu,
M. Che i, O. Vinyals, O. J. Héna , S. Albanie,
Z. Aka a, and M. Be hge, “A p ac i ione ’s guide o
eal-wo ld con inual mul imodal p e aining,” in Ad-
ances in Neu al In o ma ion P ocessing Sys ems 38:
Annual Con e ence on Neu al In o ma ion P ocessing
Sys ems 2024, Neu IPS 2024, Vancou e , BC, Canada,
Decembe 10 - 15, 2024, A. Globe sons, L. Mackey,
D. Belg a e, A. Fan, U. Paque , J. M. Tomczak, and
C. Zhang, Eds., 2024.
[24] K. Nowakowski, M. P aszynski, K. Mu asaki, and
J. Nieuwazny, “Adap ing mul ilingual speech ep e-
sen a ion model o a new, unde esou ced language
h ough mul ilingual ine- uning and con inued p e-
aining,” In . P ocess. Manag., ol. 60, no. 2, 2023,
p. 103148.
[25] W. Zheng, W. Pan, X. Xu, L. Qin, L. Yue, and M. Zhou,
“B eaking language ba ie s: C oss-lingual con inual
p e- aining a scale,” in P oceedings o he 2024 Con-
e ence on Empi ical Me hods in Na u al Language
P ocessing, EMNLP 2024, Miami, FL, USA, No embe
12-16, 2024, Y. Al-Onaizan, M. Bansal, and Y. Chen,
Eds. Associa ion o Compu a ional Linguis ics,
2024, pp. 7725–7738.
[26] A. Cossu, A. Ca a, L. C. Passa o, V. Lomonaco,
T. Tuy elaa s, and D. Bacciu, “Con inual p e- aining
mi iga es o ge ing in language and ision,” Neu al
Ne wo ks, ol. 179, 2024, p. 106492.
[27] M. DeHa en and J. Billa, “Imp o ing low- esou ce
speech ecogni ion wi h p e ained speech models:
Con inued p e aining s. semi-supe ised aining,”
CoRR, ol. abs/2207.00659, 2022.
[28] H. Zhu, G. Cheng, J. Wang, W. Hou, P. Zhang,
and Y. Yan, “Boos ing c oss-domain speech ecogni-
ion wi h sel -supe ision,” IEEE ACM T ans. Audio
Speech Lang. P ocess., ol. 32, 2024, pp. 471–485.
[29] N. San, G. Pa aske opoulos, A. A o a, X. He, P. Kau ,
O. Adams, and D. Ju a sky, “P edic ing posi i e ans-
e o imp o ed low- esou ce speech ecogni ion us-
ing acous ic pseudo- okens,” in P oceedings o he 6 h
Wo kshop on Resea ch in Compu a ional Linguis ic Ty-
pology and Mul ilingual NLP, SIGTYPE 2024, S . Ju-
lian’s, Mal a, Ma ch 22, 2024, M. Hahn, A. So okin,
R. Kuma , A. Sche bako , Y. O makho a, J. Yang,
O. Se iko , P. Rani, E. M. Pon i, S. Mu adoglu e al.,
Eds. Associa ion o Compu a ional Linguis ics,
2024, pp. 100–112.
[30] A. A. A ia, D. Demszky, T. Ògún èmí, J. Liu, and
C. Y. Espy-Wilson, “Cp -boos ed wa 2 ec2.0: To-
wa ds noise obus speech ecogni ion o class oom
en i onmen s,” CoRR, ol. abs/2409.14494, 2024.
[31] M. Wo sman, G. Ilha co, S. Y. Gad e, R. Roelo s,
R. G. Lopes, A. S. Mo cos, H. Namkoong, A. Fa hadi,
Y. Ca mon, S. Ko nbli h e al., “Model soups: a e ag-
ing weigh s o mul iple ine- uned models imp o es ac-
cu acy wi hou inc easing in e ence ime,” in In e na-
ional Con e ence on Machine Lea ning, ICML 2022,
17-23 July 2022, Bal imo e, Ma yland, USA, se . P o-
ceedings o Machine Lea ning Resea ch, K. Chaud-
hu i, S. Jegelka, L. Song, C. Szepes á i, G. Niu, and
S. Saba o, Eds., ol. 162. PMLR, 2022, pp. 23 965–
23 998.
[32] E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang,
and D. Tao, “Model me ging in llms, mllms, and be-
yond: Me hods, heo ies, applica ions and oppo uni-
ies,” CoRR, ol. abs/2408.07666, 2024.
[33] J. Choi, D. Kim, C. Lee, and S. Hong, “Re isi -
ing weigh a e aging o model me ging,” CoRR, ol.
abs/2412.12153, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
562
[34] G. Ilha co, M. Wo sman, S. Y. Gad e, S. Song, H. Ha-
jishi zi, S. Ko nbli h, A. Fa hadi, and L. Schmid ,
“Pa ching open- ocabula y models by in e pola ing
weigh s,” in Ad ances in Neu al In o ma ion P ocess-
ing Sys ems 35: Annual Con e ence on Neu al In o -
ma ion P ocessing Sys ems 2022, Neu IPS 2022, New
O leans, LA, USA, No embe 28 - Decembe 9, 2022,
S. Koyejo, S. Mohamed, A. Aga wal, D. Belg a e,
K. Cho, and A. Oh, Eds., 2022.
[35] G. S oica, D. Bolya, J. Bjo ne , P. Ramesh, T. Hea n,
and J. Ho man, “Zipi ! me ging models om di e en
asks wi hou aining,” in The Twel h In e na ional
Con e ence on Lea ning Rep esen a ions, ICLR 2024,
Vienna, Aus ia, May 7-11, 2024. OpenRe iew.ne ,
2024.
[36] X. Jin, X. Ren, D. P eo iuc-Pie o, and P. Cheng,
“Da aless knowledge usion by me ging weigh s o
language models,” in The Ele en h In e na ional Con-
e ence on Lea ning Rep esen a ions, ICLR 2023, Ki-
gali, Rwanda, May 1-5, 2023. OpenRe iew.ne , 2023.
[37] G. Ilha co, M. T. Ribei o, M. Wo sman, L. Schmid ,
H. Hajishi zi, and A. Fa hadi, “Edi ing models wi h
ask a i hme ic,” in The Ele en h In e na ional Con e -
ence on Lea ning Rep esen a ions, ICLR 2023, Kigali,
Rwanda, May 1-5, 2023. OpenRe iew.ne , 2023.
[38] D. Li, Y. Ma, W. Wei, Q. Kong, Y. Wu, M. Che,
F. Xia, E. Bene os, and W. Li, “Me ech: Ins umen
playing echnique de ec ion using sel -supe ised p e-
ained model wi h mul i- ask ine uning,” in IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing, ICASSP 2024, Seoul, Republic o Ko-
ea, Ap il 14-19, 2024. IEEE, 2024, pp. 521–525.
[39] E. Law, K. Wes , M. I. Mandel, M. Bay, and J. S.
Downie, “E alua ion o algo i hms using games: The
case o music agging,” in P oceedings o he 10 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR 2009, Kobe In e na ional Con e -
ence Cen e , Kobe, Japan, Oc obe 26-30, 2009, K. Hi-
a a, G. Tzane akis, and K. Yoshii, Eds. In e na-
ional Socie y o Music In o ma ion Re ie al, 2009,
pp. 387–392.
[40] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,”
in P oceedings o he 18 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, ISMIR 2017,
Suzhou, China, Oc obe 23-27, 2017, S. J. Cunning-
ham, Z. Duan, X. Hu, and D. Tu nbull, Eds., 2017, pp.
316–323.
[41] C. Papaioannou, I. Valian zas, T. Giannakopoulos,
M. A. Kaliaka sos-Papakos as, and A. Po amianos, “A
da ase o g eek adi ional and olk music: Ly a,”
in P oceedings o he 23 d In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, ISMIR 2022,
Bengalu u, India, Decembe 4-8, 2022, P. Rao, H. A.
Mu hy, A. S ini asamu hy, R. M. Bi ne , R. C.
Repe o, M. Go o, X. Se a, and M. Mi on, Eds., 2022,
pp. 377–383.
[42] X. Se a, “C ea ing esea ch co po a o he compu-
a ional s udy o music: he case o he compmusic
p ojec ,” in AES In e na ional Con e ence on Seman-
ic Audio 2014, London, UK, Janua y 27-29, 2014,
C. Di ma , G. Fazekas, and S. Ewe , Eds. Audio
Enginee ing Socie y, 2014.
[43] B. Uya , H. S. A li, S. Sen ü k, B. Bozku , and
X. Se a, “A co pus o compu a ional esea ch o u k-
ish makam music,” in P oceedings o he 1s In e -
na ional Wo kshop on Digi al Lib a ies o Musicol-
ogy, DL M@JCDL 2014, London, Uni ed Kingdom,
Sep embe 12, 2014, B. Fields and K. R. Page, Eds.
ACM, 2014, pp. 1–7.
[44] S. Sen ü k, “Compu a ional analysis o audio eco d-
ings and music sco es o he desc ip ion and disco e y
o o oman- u kish makam music,” Ph.D. disse a ion,
Pompeu Fab a Uni e si y, Spain, 2017.
[45] A. S ini asamu hy, G. K. Kodu i, S. Gula i, V. Ish-
wa , and X. Se a, “Co po a o music in o ma ion e-
sea ch in indian a music,” in Music Technology mee s
Philosophy - F om Digi al Echos o Vi ual E hos:
Join P oceedings o he 40 h In e na ional Compu e
Music Con e ence, ICMC 2014, and he 11 h Sound
and Music Compu ing Con e ence, SMC 2014, A hens,
G eece, Sep embe 14-20, 2014. Michigan Publish-
ing, 2014.
[46] C. Papaioannou, E. Bene os, and A. Po amianos,
“F om wes o eas : Who can unde s and he mu-
sic o he o he s be e ?” in P oceedings o he 24 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR 2023, Milan, I aly, No embe 5-9,
2023, A. Sa i, F. An onacci, M. Sandle , P. Bes agini,
S. Dixon, B. Liang, G. Richa d, and J. Pauwels, Eds.,
2023, pp. 311–318.
[47] C. Papaioannou, E. Bene os, and A. Po amianos, “LC-
P o one s: Mul i-label ew-sho lea ning o wo ld mu-
sic audio agging,” IEEE Open Jou nal o Signal P o-
cessing, ol. 6, 2025, pp. 138–146.
[48] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” T ans. Mach.
Lea n. Res., ol. 2023, 2023.
[49] W. Hsu, B. Bol e, Y. H. Tsai, K. Lakho ia, R. Salakhu -
dino , and A. Mohamed, “Hube : Sel -supe ised
speech ep esen a ion lea ning by masked p edic ion o
hidden uni s,” IEEE ACM T ans. Audio Speech Lang.
P ocess., ol. 29, 2021, pp. 3451–3460.
[50] J. Ki kpa ick, R. Pascanu, N. C. Rabinowi z, J. Ve-
ness, G. Desja dins, A. A. Rusu, K. Milan, J. Quan,
T. Ramalho, A. G abska-Ba winska e al., “O e com-
ing ca as ophic o ge ing in neu al ne wo ks,” CoRR,
ol. abs/1612.00796, 2016.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
563