SLAP: SIAMESE LANGUAGE-AUDIO PRETRAINING
WITHOUT NEGATIVE SAMPLES FOR MUSIC UNDERSTANDING
Julien Guino ∗,1,2Alain Riou∗,3Elio Quin on2Gyö gy Fazekas1
1Cen e o Digi al Music, Queen Ma y Uni e si y o London, U.K.
2Music & Audio Machine Lea ning Lab, Uni e sal Music G oup, London, U.K.
3LTCI, Télécom-Pa is, Ins i u Poly echnique de Pa is, F ance
∗Equal con ibu ion, co espondence o [email p o ec ed]
ABSTRACT
Join embedding spaces ha e signi ican ly ad anced
music unde s anding and gene a ion by linking ex and
audio h ough mul imodal con as i e lea ning. Howe e ,
hese app oaches ace la ge memo y equi emen limi a-
ions due o elying on la ge ba ch sizes o e ec i ely u i-
lize nega i e samples. Fu he , mul imodal join embed-
ding spaces su e om a modali y gap whe ein embed-
dings om di e en modali ies lie in di e en mani olds
o he embedding space. To add ess hese challenges, we
p opose Siamese Language-Audio P e aining (SLAP), a
no el mul imodal p e aining amewo k ha allows lea n-
ing powe ul ep esen a ions wi hou nega i e samples.
SLAP adap s he Boo s ap You Own La en (BYOL)
pa adigm o mul imodal audio- ex aining, p omo ing
scalabili y in aining mul imodal embedding spaces.
We illus a e he abili y o ou model o lea n meaning-
ul ela ionships be ween music and ex — speci ically,
we show ha SLAP ou pe o ms CLAP on asks such as
ex -music e ie al and ze o-sho classi ica ion. We also
obse e compe i i e downs eam pe o mance on se e al
MIR asks, including wi h la ge o supe ised models
(gen e and ins umen classi ica ion, au o- agging). Ad-
di ionally, ou app oach has a ac i e p ope ies, such as
a quan i iably educed modali y gap and imp o ed obus -
ness o ba ch size a ia ions on e ie al pe o mance. Fi-
nally, i s no el o mula ion unlocks la ge-scale aining on
a single GPU h ough g adien accumula ion.
1. INTRODUCTION
Join embedding spaces o ex and audio ha e been oun-
da ional in ecen de elopmen s in music unde s anding
and gene a ion. Such spaces a e ypically lea ned ia Mul-
imodal Con as i e Lea ning (MCL), which op imizes a
pai o encode s o maximal simila i y be ween posi i e
pai s, while minimizing simila i y o nega i e pai s [1,2].
© J. Guino , A. Riou, E. Quin on, and G. Fazekas.. Li-
censed unde a C ea i e Commons A ibu ion 4.0 In e na ional License
(CC BY 4.0). A ibu ion: J. Guino , A. Riou, E. Quin on, and G.
Fazekas., “SLAP: Siamese Language-Audio P e aining Wi hou Nega-
i e Samples o Music Unde s anding”, in P oc. o he 26 h In . Socie y
o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
Though widely success ul, some d awbacks ha e been
iden i ied wi h his me hod. Recen ends o join mul-
imodal embedding spaces ha e emphasized ine-g ained
unde s anding be ween indi idual ex okens and indi id-
ual imes eps o music [3,4] and need o ex augmen a-
ion s a egies o alle ia e he lack o la ge scale da ase s
[5,6]. A modali y gap eme ging om he con as i e ap-
p oach has been obse ed, whe e modali ies lie in sepa-
a e mani olds o he embedding space [7,8]. This leads
o “join ” ep esen a ions no being uly join , po en ially
ha ming pe o mance. Impo an ly, con as i e app oaches
ace an inhe en scalabili y issue. Thei o mula ion makes
hem 1) dependen on ba ch size o ep esen a ion quali y
and 2) mo e di icul o scale han masked modelling ap-
p oaches due o he o mula ion o he con as i e loss [9].
These d awbacks pose issues o ounda ion models, which
by usage should be scalable o adap a ion on la ge-scale
da ase s. Fu he , ex -music spaces a e a many- o-many
space. I.e., he e is o en no one single co esponding ap-
p op ia e cap ion o a piece o music, and ice- e sa.
Inspi ed by ecen ad ances in Sel -Supe ised Lea n-
ing o images and gene al audio [10,11], we p opose
SLAP,Siamese Language Audio P e aining. We adap
Boo s ap You Own La en (BYOL) [10] as a join -
embedding p e aining app oach wi hou nega i e samples
o mul imodal ex -audio p e aining. We show ha SLAP
alle ia es bo h he scalabili y issues and he modali y gap
inhe en o MCL. Ou con ibu ions a e:
1. We in oduce a scalable, hype pa ame e - obus ap-
p oach o language-audio p e aining which does no
equi e nega i e pai s o lea n s ong ep esen a ions.
2. We ou pe o m compa able con as i e models on
e ie al and downs eam p obing.
3. We show ha ou app oach signi ican ly dec eases
he modali y gap be ween audio and ex embed-
dings compa ed o con as i e app oaches.
4. Ou app oach enables la ge ba ch sizes ia g adi-
en accumula ion, which was p e iously inaccessi-
ble due o he o mula ion o he con as i e loss.
To acili a e u he esea ch in his di ec ion, we make
ou code a ailable. 1
1h ps://gi hub.com/Pliploop/SLAP
382
2. BACKGROUND
2.1 Mul imodal Con as i e Lea ning
In con as i e lea ning, models lea n ep esen a ions by
maximizing he simila i y be ween wo iews o an in-
pu while minimizing i s simila i y wi h nega i e sam-
ples, ypically by op imizing an In oNCE loss [12]. While
hese iews a e ypically c a ed by andomly applying
ans o ms o he inpu da a poin s [12,13], Mul imodal
Con as i e Lea ning (MCL) ins ead elies on pai ed da a
om di e en modali ies, such as ex -image [1], ex -
ideo [14], audio-image [15], ex -music [16,17], and ex -
audio [2,18]. Con as i e lea ning spanning mo e han wo
modali ies has also been in es iga ed [19–21].
MCL has been le e aged o g ea ex en in mul iple
modali ies o asks such as unde s anding [1], e ie al
[16], and gene a ion [22], wi h many mul imodal ounda-
ion models eme ging om he app oach [23,24].
MCL, speci ically ex -audio and ex -music con as i e
lea ning, has also been widely adop ed in he ield o audio
and music. No ably, CLAP [2] and subsequen de elop-
men s building upon i ha e demons a ed high audio un-
de s anding pe o mance [3,5,18,25]. Lea ned audio- ex
ep esen a ions enable e ec i e ex -audio and audio- ex
e ie al by le e aging simila i y me ics be ween modali-
ies. Beyond ca alogue na iga ion, hese capabili ies sup-
po e ie al-augmen ed audio cap ioning [26,27]. In
music- ex ep esen a ion lea ning, MusCALL [16] and
MuLan [17] ha e bo h achie ed s ong pe o mance on
downs eam asks such as ex -music e ie al, music clas-
si ica ion, and ze o-sho classi ica ion. Finally, join em-
bedding ex -audio spaces ha e also been le e aged o con-
di ion audio and music gene a i e models [28–32].
2.2 Limi a ions o con as i e lea ning
Despi e i s widesp ead success, con as i e lea ning aces
no able limi a ions, ypically ela ed o i s eliance on neg-
a i e samples. Fi s , ensu ing a di e se and ep esen a i e
se o nega i e samples is c ucial o s able aining, which
usually necessi a es a la ge ba ch size Bin p ac ice. How-
e e , compu ing he In oNCE loss equi es s o ing a B×B
ma ix o pai wise simila i ies on a single GPU, which lim-
i s scalabili y. SigLIP [33] add esses his by eplacing he
so max in he c i e ion wi h a sigmoid, making dis ibu ed
aining mo e ac able, hough i s ill demands cos ly in e -
de ice communica ion. In addi ion, con as i e lea ning
implici ly assumes a uni o m p io o he da a dis ibu-
ion, which is de imen al in long- ailed scena ios [34].
MCL, mo e speci ically, also su e s om an issue o en
e e ed o as he modali y gap, whe e embeddings om
di e en modali ies occupy non-o e lapping cone mani-
olds in he join la en space [8]. [35] obse e a ela ion-
ship be ween his gap and he ini ial model weigh s and
loss empe a u e, while ecen indings sugges his p ob-
lem could be in insic o he loss o mula ion i sel [7].
2.3 Sel -supe ised lea ning wi hou nega i e samples
Recen ad ances in Sel -Supe ised Lea ning (SSL) ha e
demons a ed ha con as i e lea ning can be ou pe -
o med by al e na i e app oaches ha do no ely on nega-
i e samples, a leas in unimodal se ings [10,36,37].
In pa icula , BYOL [10] le e ages an asymme ic a -
chi ec u e composed o a con ex encode and a a ge en-
code . Two augmen ed iews a e passed h ough hei e-
spec i e encode , and a p edic o ne wo k maps he con ex
ou pu o ma ch he a ge ’s. C ucially, he a ge encode
is no ained ia g adien descen bu upda ed as an Ex-
ponen ial Mo ing A e age (EMA) o he con ex encode ,
which e ec i ely a oids ep esen a ion collapse [38].
This app oach, as well as subsequen wo ks elying
on T ans o me -based a chi ec u es, do no su e om
he concep ual limi a ions desc ibed in Sec ion 2.2, and
achie e s a e-o - he-a pe o mance in image [37,39,40]
and audio [11,41–43] ep esen a ion lea ning. Howe e ,
he eliance on EMA equi es he con ex and a ge en-
code s o sha e he same a chi ec u e, hinde ing he di ec
applica ion o hese me hods o mul imodal scena ios.
In his wo k, we d aw inspi a ion om hese ideas o
p opose a no el mul imodal SSL amewo k ha elimi-
na es he need o nega i e samples while add essing he
a chi ec u al limi a ions imposed by EMA-based me hods.
3. SIAMESE LANGUAGE-AUDIO PRETRAINING
The aining pipeline o SLAP is depic ed in Figu e 1.
Conside an audio encode EAmapping samples xAo
leng h TA om he audio space o a la en audio ep e-
sen a ion zaso ha EA:xA∈RTA7→ zA∈Rdand
a ex encode mapping N okens o a ex la en space
ET:x ∈NN7→ z ∈Rd.
Audio and ex encode s p ojec inpu s in o espec i e
modali ies’ la en spaces. Fo each modali y, we de ine a
con ex encode (E) and an Exponen ial Mo ing A e age
(EMA)-upda ed a ge encode ¯
Ewi h EMA a e τ:
¯
E=τ¯
E+ (1 −τ)E.(1)
Fo each modali y, we also inco po a e a p edic o P,
which has been shown o be necessa y o p e en i ial so-
lu ions and collapse [10,38]. The p edic o o each b anch
p ocesses he ou pu zo he espec i e con ex encode
and aims o p edic he ou pu ¯zo he a ge b anch o
bo h modali ies. We no a e q he ou pu o he p edic o :
qA=PA(zA), qT=PT(zT).(2)
The model is ained by minimizing in e modal losses,
de ined as he cosine dis ance be ween he que ies qand
a ge s ¯z om di e en modali ies:
LA→T= 1−qA·¯zT
∥qA∥∥¯zT∥,LT→A= 1−qT·¯zA
∥qT∥∥¯zA∥,(3)
whe e ∥ · ∥ deno es he L2no m.
Addi ionally, we in oduce in amodal losses be ween
he que ies qand a ge s ¯zwi hin each modali y, which
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
383
"A bla ing me al
ack wi h s ompy
kicks and dis o ed
chuggy gui a ."
Audio
EMA
sg(⋅)
Cap ion
EMA
EMA
agg
agg
agg sg(⋅)
agg
Audios
Cap ions
agg
agg
"A bla ing me al
ack wi h s ompy
kicks and dis o ed
chuggy gui a ."
Figu e 1.(a)CLAP: The mul imodal con as i e loss LCLAP is op imized be ween ba ches o audio/ ex pai s. (b)SLAP:
Fo each audio/ ex pai , we compu e a que y qand a a ge ¯z o each modali y, hen op imize bo h he in e modal losses
LA→Tand LT→Aand he in amodal losses LAand LTbe ween he que ies and he a ge s. As indica ed by he s op-
g adien ope a o sg, he g adien lows only h ough he con ex b anches, while he a ge encode s a e upda ed ia EMA.
we ind o empi ically yield s onge e ie al pe o mance
han using in e modal losses alone (see Sec ion 5.6):
LA= 1 −qA·¯zA
∥qA∥∥¯zA∥,LT= 1 −qT·¯zT
∥qT∥∥¯zT∥.(4)
The inal loss is a combina ion o in amodal and in e -
modal losses wi h a weighing e m λ∈[0,1]:
L=λLA→T+LT→A+ (1 −λ)LA+LT.(5)
This app oach does no equi e any complex machine y
o p e en collapse o he han an EMA upda e o encode s
om bo h modali ies and he addi ion o he symme y-
b eaking p edic o .
4. EXPERIMENTAL SETUP
4.1 Da ase s
The lis o da ase s used in his wo k is epo ed in Table 1.
We ain SLAP on an in e nal p i a e da ase o 260,000
pai s o ull-leng h p oduc ion-quali y music acks and
p o essionally anno a ed cap ions (P i a eCaps [16]). Ou
e ie al es ing da ase s include wo music-cap ion pai
da ase s. Speci ically, we use MusicCaps [32] and he
Song Desc ibe Da ase [44] o ex -music e ie al pe -
o mance e alua ion. MusicCaps con ains 5,521 music
clips, each accompanied by a de ailed ex desc ip ion.
The Song Desc ibe Da ase includes 2-minu e-long pe -
missi ely licensed music clips wi h c owd-sou ced single-
sen ence cap ions.
Fo ze o-sho classi ica ion and downs eam p obing
pe o mance (See Sec ions 5.2,5.3), we use he GTZAN
da ase [45] o music gen e classi ica ion, which con ains
1,000 audio acks, each 30 seconds long, spanning 10 gen-
es. We also use he MagnaTagATune (MTAT) da ase
[46], an au oma ic agging da ase consis ing o 25,000
Task Da ase # clips Clip leng h
T aining P i a eCaps 260k ull leng h
Re ie al MusicCaps [32] 5.5k 10 seconds
Song Desc ibe [44] 1k 2 minu es
P obing/ZS
GTZAN [45] 1k 30 seconds
MagnaTagATune [46] 25k 30 seconds
OpenMic [47] 20k 10 seconds
Table 1. Da ase s used o aining, e ie al, and down-
s eam asks (p obing and ze o-sho classi ica ion).
30-second music clips wi h 50 associa ed ags. Addi ion-
ally, we employ he OpenMic da ase [47] o ine-g ained
ins umen agging, con aining anno a ions o ins umen
p esence o e 20,000 snippe s.
4.2 Model a chi ec u e
We use he same a chi ec u e as in LAION-CLAP [49],
namely HTS-AT [50] o he audio encode and
RoBERTa [51] o he ex encode . This backbone con ig-
u a ion leads o an o e all model size o 193M pa ame e s.
The p edic o a chi ec u e is a ReLU-ac i a ed mul i-
laye pe cep on (MLP) wi h ba ch no maliza ion. By de-
aul , we employ one 4096-wide hidden laye o he MLP.
Bo h he p edic ion and p ojec ion dimensions a e 512, as
in CLAP [2]. We lea e he s udy o p edic o a chi ec u e
hype pa ame e s as well as he in luence o encode a chi-
ec u e o u u e wo k.
4.3 T aining de ails
All models a e ained on P i a eCaps using a ba ch size o
768 ac oss 6 A100 GPUs wi h PyTo ch au oma ic mixed
p ecision, o 60 epochs. We use linea wa mup and cosine
decay wi h a maximum lea ning a e o 4×10−4(scaled
by ba ch size) and wa mup o e 1/10 o o al epochs. We
use SpecAugmen as an audio augmen a ion [52], which
imp o es pe o mance sligh ly ac oss asks. We se EMA
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
384
Song Desc ibe MusicCaps
A→T T →A A →T T →A
Recall (↑)No m. ank (↓) Recall (↑)No m. ank (↓) Recall (↑)No m. ank (↓) Recall (↑)No m. ank (↓)
Model P e. 1 5 10 Med. Mean 1 5 10 Med. Mean 1 5 10 Med. Mean 1 5 10 Med. Mean
SLAP ✗3.3 13.0 19.7 5.3 12.3 3.7 12.5 19.4 5.4 13.8 1.6 5.3 7.5 4.8 14.5 1.2 4.9 7.9 4.4 13.3
CLAP ✗3.1 9.6 14.9 7.3 14.2 2.7 9.7 15.8 6.7 16.4 0.9 3.4 5.2 8.8 18.2 0.7 3.2 5.2 7.5 17.4
SLAP ✓5.7 18.1 26.6 3.2 8.9 6.0 18.1 26.4 3.6 10.5 3.1 10.1 15.4 1.9 7.7 3.0 9.6 15.4 1.9 7.7
CLAP ✓5.3 14.9 22.2 4.3 9.2 5.7 16.8 24.1 4.3 9.7 2.8 8.3 10.4 3.0 10.0 2.8 8.7 14.0 2.2 7.6
MusCALL 1.4 7.0 14.4 7.8 14.7 1.9 6.8 12.1 8.8 19.9 1.5 7.1 10.8 9.3 17.3 0.4 1.6 3.0 12.8 21.1
LAION-CLAP†1.9 7.4 11.4 9.4 18.7 1.6 5.3 9.5 9.2 18.2 3.2 10.4 16.0 2.0 6.9 3.7 11.1 17.0 1.6 5.5
†No e ha LAION-CLAP is ained on a la ge da ase ha includes AudioSe [48], om which he audio samples o MusicCaps we e ex ac ed.
Table 2. Tex -Music e ie al esul s. Bolded esul s indica e bes esul s be ween CLAP and SLAP o ei he models
ini ialized om p e ained checkpoin s (✓) o om sc a ch (✗). All me ics a e in pe cen ages.
upda e a e τ= 0.95, as we empi ically ind ha lowe al-
ues (compa ed o 0.996 in [11]) yield be e e ie al pe -
o mance, and lea e ull hype -pa ame e uning o u u e
wo k. We expe imen wi h bo h andomly ini ialized and
p e ained checkpoin s o HTS-AT 2and RoBERTa 3, and
ini ialize he HTS-AT backbone om a public checkpoin
ained on AudioSe [48] when ini ializing om p e ained.
5. RESULTS
5.1 Mul imodal e ie al
We pe o m Audio o Tex (A→T) and Tex o Audio
(T→A) e ie al ac oss he es da ase s desc ibed in Sec-
ion 4.1. We use p edic ions qas key and que y o SLAP
models (see Sec ion 5.1.1) and p ojec ions z o CLAP
models. We epo Recall @(1,5,10), as well as Mean and
Median No malized Rank (MNR/MdNR) [53]. We epo
compa able esul s om ele an wo k in li e a u e as well
as ou ep oduced CLAP model.
We epo e ie al esul s in Table 2. We ind ha SLAP
consis en ly ou pe o ms bo h MusCALL [16], which was
also ained on P i a eCaps, and ou own ep oduced
CLAP models. No ably, Recall and No malized Rank a e
imp o ed ac oss all me ics excep o he T→A ask on
MNR. Ini ializing om p e ained encode s leads o sig-
ni ican ly be e pe o mance bo h o CLAP and SLAP.
We also compa e ou esul s o LAION-CLAP [49],
which sha es he same a chi ec u e. On Song Desc ibe ,
all ou models signi ican ly ou pe o m i , highligh ing he
impo ance o high-quali y aining da a o music unde -
s anding. On MusicCaps, whose audios a e pa o hei
aining da a, LAION-CLAP pe o ms be e ; howe e , ou
bes SLAP model, ini ialised om p e ained checkpoin s,
achie es close pe o mances, pa icula ly on A→T.
Song Desc ibe MusicCaps
A→T T →A A →T T →A
Ancho R@5 MdNR R@5 MdNR R@5 MdNR R@5 MdNR
z17.8 3.6 17.0 3.8 9.6 2.1 9.4 2.3
q18.1 3.2 18.1 3.6 10.1 1.9 9.6 1.9
Table 3. Re ie al esul s using que ies qo p ojec ion em-
beddings zas ancho s. Bes esul s a e in bold.
2h ps://gi hub.com/LAION-AI/CLAP
3h ps://hugging ace.co/FacebookAI/ obe a-base
5.1.1 P edic ion s. p ojec ion e ie al
One a ian o ou app oach is o use p edic ions ins ead
o p ojec ions o pe o m e ie al. Gi en he loss o mu-
la ion, bo h a e possible since bo h exis in sepa a e em-
bedding spaces and he p e aining objec i e op imizes o
ou -way simila i y. He e we obse e he e ec o using
qo zas que y and key o e ie al. Resul s a e epo ed
in Table 3. We ind ha using p ojec ions ins ead o p e-
dic ions leads o a small d op in pe o mance, bu bo h a e
iable o e ie al asks, meaning he e is no loss o in o -
ma ion when compa ed o CLAP o p ojec ions.
GTZAN MTAT OpenMic
Model Acc. AUROC mAP mAP
SLAP 82.9 92.0 45.8 86.2
CLAP 80.9 91.7 45.1 85.8
MusCALL 74.5 91.5 44.4 79.7
MATPAC 85.9 91.6 41.1 85.4
SOTA 87.4 [54]92.7 [17] 41.4 [55]86.7 [56]
Table 4. P obing esul s on ozen ep esen a ions o
gen e classi ica ion (GTZAN), au oma ic agging (MTAT),
and Ins umen agging (OpenMic). Bes sco es a e bolded
and second- o-bes a e unde lined.
5.2 Downs eam p obing
This sec ion e i ies he quali y o he lea ned ep esen-
a ions o SLAP by pe o ming downs eam p obing on
a ange o asks on ep esen a ions be o e he p ojec ion
head which ou pu s z. Fo downs eam ask e alua ion,
we eeze he audio encode and ain shallow nonlin-
ea p obes (one ReLU-ac i a ed 512-wide hidden laye
MLPs). P obes a e ained wi h a lea ning a e o 10−4
wi h an ea ly-s opping mechanism on alida ion loss.
Ou esul s a e epo ed in Table 4. As abo e, we com-
pa e hem o ou CLAP baseline and MusCALL. We e-
po he pe o mance o MATPAC [43], he cu en sel -
supe ised s a e-o - he-a on hese asks as well. As a e -
e ence, we also indica e s a e-o - he-a pe o mances o
each ask.
We obse e ha SLAP consis en ly ou pe o ms CLAP
and MusCALL, demons a ing he ad an ages o ou a ge
encode /p edic o s a egy o e con as i e lea ning.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
385
On MTAT and OpenMic, SLAP su passes MATPAC,
which also employs a a ge encode and p edic o bu is
ained on audio alone. This sugges s ha language supe -
ision is pa icula ly bene icial o agging asks bu is less
use ul o gen e classi ica ion (82.9% s. 85.9%).
On agging asks, SLAP achie es s a e-o - he-a esul s
on MTAT (45.8% s. 41.1%) and b idges he gap wi h bes
supe ised me hods on OpenMic (86.2% s. 86.7%).
5.3 Ze o-sho pe o mance
Simila o shallow downs eam p obing, we e alua e he
ze o-sho pe o mance o SLAP by e ie ing he highes
simila i y p omp s o audio using p omp s as a class p oxy
on GTZAN, MTAT, and OpenMic. We epo Ze o-sho
accu acy on GTZAN, as well as ROC-AUC and mAP o
au oma ic agging on MTAT, and mAR on OpenMic. As
p e ious wo k has shown he impo ance o p omp engi-
nee ing o ze o-sho pe o mance [16], we e alua e ac oss
ou p omp s wi h which we w ap he a ge class “{}”, “{}
music”, “ his sounds like {}”, and “A {} ack”. We e-
po he bes o ou sco es ac oss da ase s o bo h CLAP,
SLAP, and MusCALL [16].
We adap ano he me ic o ze o-sho e alua ion o
mul ilabel app oaches, which we call mean A e age Recall
(mAR). P e ious app oaches e alua e mul i- ag ze o-sho
pe o mance by using ag-wise cosine simila i y as logi s
as inpu o s anda d me ics such as AUROC and mAP.
These me ics a e no obus o o cosine simila i y dis i-
bu ions cen e ed a ound 0.5 wi h low sp eads o ce ain
asks, e.g. ins umen a ion. mAR compu es he mean ank
o e ie ed g ound u h ags o a gi en audio:
mAR =1
NK
K
X
k=1
N
X
n=1
Rn@k. (6)
whe e Kis he numbe o samples in he da ase , N he
numbe o ags, and Rn@k ecall a k o ag n. We e alu-
a e mAR on OpenMic. Resul s a e epo ed in Table 5 o
CLAP, SLAP, and MusCALL [16].
We obse e ha SLAP consis en ly ou pe o ms o is
on pa wi h ou CLAP baseline and MusCALL. As o e-
ie al, a plausible cause is he be e o e lap be ween he
audio and ex spaces o ou app oach compa ed o con-
as i e ones. We measu e his o e lap in Sec ion 5.4.
GTZAN MTAT OpenMIC
Model Acc. AUROC mAP mAR
SLAP 58.3 75.0 31.5 70.5
CLAP 51.7 75.0 26.0 70.1
MusCALL [16] 48.2 73.8 23.0 66.7
Table 5. Bes -o - ou ze o-sho classi ica ion and agging
esul s o gen e and ins umen classi ica ion, and au o-
ma ic agging. Bes sco es a e shown in bold.
5.4 Modali y gap
As discussed in Sec ion 2.2, one no able issue wi h MCL
is he p esence o a modali y gap [7,8]. While i is unclea
Figu e 2. UMAP educ ion o he CLAP ( op) and SLAP
(bo om) embeddings ob ained om P i a eCaps (le )
and Song Desc ibe ( igh ).
PC SD MC
Da ase
0.0
0.1
0.2
0.3
0.4
A
/
T
( )
Cen oid Dis ance
CLAP P ojec ions
SLAP P ojec ions
SLAP P edic ions
PC SD MC
Da ase
0.5
0.6
0.7
0.8
0.9
1.0
LA
/
T
( )
Linea Sepa abili y
Figu e 3. Cen oid Dis ance (le ) and Linea sepa abili y
( igh ) o SLAP and CLAP on P i a eCaps (PC), Song
Desc ibe (SD) and MusicCaps (MC). Lowe is be e .
how de imen al he modali y gap is o pe o mance o
e ie al, i has been shown ha “closing he gap” is ben-
e icial a leas o gene a i e applica ions [57]. In ui i ely,
ha ing disjoin mani olds o audio and ex ep esen a ions
allows hei usage o cosine-simila i y Max Inne P oduc
Sea ch, bu is no bene icial o a sha ed join unde s and-
ing o ex and audio. Mul imodal con as i e models a e
subjec o a modali y gap, likely a ibu able o he o mu-
la ion o con as i e loss [7]. No being con as i e, we
posi ha ou me hod should be less p one o c ea ing such
a gap. Figu e 2shows he UMAP educ ion o p ojec ions
zT, zAand qT, qA om P i a eCaps and Song Desc ibe ,
o bo h CLAP and SLAP. Al hough he modali y gap is
immedia ely appa en o CLAP, i seems much less p o-
nounced o SLAP.
As in [7] and [8], we measu e his by compu ing he
linea sepa abili y LA/T be ween he espec i e mani olds
o audio and ex in he la en space. Fo bo h P i a eCaps,
MusicCaps and Song Desc ibe , we epo in Figu e 3 he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
386
64 128 256 512 768
0.04
0.06
0.08
0.10
Recall@1
A T
Model Class
CLAP
SLAP
p e ained
False
T ue
64 128 256 512 768
E ec i e Ba ch Size
0.04
0.06
0.08
0.10
Recall@1
T A
Figu e 4. Scaling ba ch size o CLAP and SLAP.
aining accu acy o a Logis ic Reg ession classi ie ained
o dis inguish an audio embedding om a ex embedding.
We also measu e he Euclidean dis ance ∆A/T be ween he
cen oids o each modali y.
These obse a ions con i m he modali y gap is la ge ,
bo h in e ms o dis ance and linea sepa abili y, o CLAP
han o SLAP. Mo e so, i is nea ly nonexis en o SLAP
on bo h PC and SD. Al hough he gap is la ge o MC
han o PC and SD, his can be a ibu ed o a domain shi
be ween da ase s, likely a ibu able o he noisiness and
longe cap ions o MC.
5.5 Scalabili y p ope ies
We compa e he scalabili y o SLAP and CLAP, ocusing
on ba ch size. A key limi a ion o con as i e models like
CLAP is ha hei loss depends on all nega i es in a ba ch,
making g adien accumula ion ine ec i e—since he g a-
dien does no scale linea ly wi h ba ch size. In con as ,
SLAP equi es no nega i es and can heo e ically be scaled
o as ly la ge ba ch sizes, no ably due o he lesse mem-
o y oo p in and he a ailabili y o g adien accumula ion.
We s udy he e ec o ba ch size on e ie al pe o mance
by epo ing A→Tand T→AR@1 on P i a eCaps
o ba ch sizes om 64 o 768. Fo CLAP, ba ch size is
scaled di ec ly. Fo SLAP, we ix a base ba ch o 128 and
accumula e g adien s o e N=B//128 s eps o each
a ge ba ch size B. We empi ically con i m ha g adien
accumula ion in SLAP is equi alen o ue ba ch scaling.
Resul s a e shown in Figu e 4.
5.6 Role o In e modal and In amodal Loss e ms
One key componen o ou app oach is he mul iple loss
e ms be ween he di e en EMA encode s o he aining
se up. We ind ha empi ically, he absence o hese losses
LAand LTleads o collapse o he model. We show his
in Figu e 5by epo ing ecall@kand linea sepa abili y
be ween modali ies o di e en balancing weigh s λ.
0.0 0.2 0.4 0.6 0.8 1.0
0.000
0.025
0.050
0.075
0.100
0.125
0.150
Recall@k
MusicCaps
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0.25
SongDesc ibe
ask
A->T
T->A
k
1
5
10
0.0 0.2 0.4 0.6 0.8 1.0
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
Modali y Gap (
LA
/
T
)
0.0 0.2 0.4 0.6 0.8 1.0
0.6
0.7
0.8
0.9
1.0
Figu e 5. Tuning o in amodali y loss balancing weigh
λ o he SLAP objec i e, measu ing bo h T→Aand
A→T e ie al pe o mance and LA/T .
We ind ha wi hin a ange o λ, SLAP is obus o
e ie al. Values o λou side [0.2, 0.7] lead o almos sys-
ema ic collapse due o he lack o meaning ul cons ain s
be ween o wi hin modali ies. Despi e good e ie al o-
bus ness, we ind ha any λ= 0.5yields a apidly inc eas-
ing modali y gap. Only a balanced con ibu ion o in a and
in e -modali y loss e ms leads o he signi ican educ ion
o he modali y gap obse ed in Sec ion 5.4.
6. CONCLUSION AND FUTURE WORK
We p opose a no el app oach o aining mul imodal join
embedding spaces o align music and ex ep esen a ion:
Siamese Language Audio P e aining (SLAP). SLAP ou -
pe o ms con as i e models on asks including ex -music
e ie al, downs eam p obing, and ze o-sho music un-
de s anding. I o e s key ad an ages o e con as i e ap-
p oaches, including nega i e- ee aining o be e scala-
bili y and subs an ial educ ion o he modali y gap. Wi h-
ou elying on ex ensi e da a augmen a ion, a chi ec u al
uning, o hype pa ame e op imiza ion, SLAP ma ches
s a e-o - he-a pe o mance—demons a ing i s po en ial
as a scalable al e na i e o CLAP-s yle aining.
CLIP/CLAP being a co e componen o many ex -
condi ioned models, se e al wo ks ha e p oposed o im-
p o e hem, e.g., by explo ing al e na i e a chi ec u es [2,
58], scaling-up aining da a [25,59], p omo ing seman ic
o linguis ic in a iance [59,60] o imp o ing he sensi i i y
o ine-g ained empo al e en s [3,5]. Impo an ly, hese
imp o emen s ypically do no ely on he con as i e loss,
and SLAP could p obably bene i om hem as well.
Finally, while we demons a e he e ec i eness o ou
me hod on music- ela ed asks, i is gene alizable and
could be applied o o he modali ies such as images o gen-
e al audio. We he e o e belie e ha his app oach could
lead o many applica ions beyond he scope o ou pape .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
387
7. ACKNOWLEDGEMENT
This wo k is suppo ed by he EPSRC UKRI Cen e o
Doc o al T aining in A i icial In elligence and Music
(EP/S022694/1) and Uni e sal Music G oup.
8. REFERENCES
[1] A. Rad o d, J. W. Kim, C. Hallacy e al., “Lea n-
ing ans e able isual models om na u al language
supe ision,” in In e na ional con e ence on machine
lea ning. PMLR, 2021, pp. 8748–8763.
[2] B. Elizalde, S. Deshmukh, M. Al Ismail e al., “Clap
lea ning audio concep s om na u al language supe i-
sion,” in ICASSP 2023-2023 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2023, pp. 1–5.
[3] J. Wu, W. Li, Z. No ack e al., “Collap: Con as i e
long- o m language-audio p e aining wi h musical
empo al s uc u e augmen a ion,” in ICASSP 2025-
2025 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2025,
pp. 1–5.
[4] T. Koma su, H. Munaka a, T. Hasumi e al., “Aligned
con as i e lea ning o ex - o-music e ie al,” in
ICASSP 2025-2025 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2025, pp. 1–5.
[5] Y. Yuan, Z. Chen, X. Liu e al., “T-clap: Tempo al-
enhanced con as i e language-audio p e aining,” in
2024 IEEE 34 h In e na ional Wo kshop on Machine
Lea ning o Signal P ocessing (MLSP). IEEE, 2024,
pp. 1–6.
[6] I. Manco, J. Salamon, and O. Nie o, “Augmen , d op &
swap: Imp o ing di e si y in llm cap ions o e icien
music- ex ep esen a ion lea ning,” in P oceedings o
he 25 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2024.
[7] A. Fahim, A. Mu phy, and A. Fyshe, “I s No a
Modali y Gap: Cha ac e izing and Add essing he
Con as i e Gap,” may 2024. [Online]. A ailable:
h p://a xi .o g/abs/2405.18570
[8] V. W. Liang, Y. Zhang, Y. Kwon e al., “Mind he gap:
Unde s anding he modali y gap in mul i-modal con-
as i e ep esen a ion lea ning,” Ad ances in Neu al
In o ma ion P ocessing Sys ems, ol. 35, pp. 17 612–
17 625, 2022.
[9] H. Pham, Z. Dai, G. Ghiasi e al., “Combined scaling
o ze o-sho ans e lea ning,” Neu ocompu ing, ol.
555, p. 126658, 2023.
[10] J.-B. G ill, F. S ub, F. Al ché e al., “Boo s ap you
own la en -a new app oach o sel -supe ised lea n-
ing,” Ad ances in neu al in o ma ion p ocessing sys-
ems, ol. 33, pp. 21 271–21 284, 2020.
[11] D. Niizumi, D. Takeuchi, Y. Ohishi e al., “Byol o au-
dio: Sel -supe ised lea ning o gene al-pu pose au-
dio ep esen a ion,” in 2021 In e na ional Join Con-
e ence on Neu al Ne wo ks (IJCNN). IEEE, 2021,
pp. 1–8.
[12] T. Chen, S. Ko nbli h, M. No ouzi e al., “A simple
amewo k o con as i e lea ning o isual ep esen-
a ions,” in In e na ional con e ence on machine lea n-
ing. PMLR, 2020, pp. 1597–1607.
[13] J. Spijke e and J. A. Bu goyne, “Con as i e lea ning
o musical ep esen a ions,” in P oceedings o he 22nd
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2021.
[14] H. Akba i, L. Yuan, R. Qian, W. H. Chuang, S. F.
Chang, Y. Cui, and B. Gong, “VATT: T ans o me s
o Mul imodal Sel -Supe ised Lea ning om Raw
Video, Audio and Tex ,” in Ad ances in Neu al
In o ma ion P ocessing Sys ems, ol. 29. Neu al
in o ma ion p ocessing sys ems ounda ion, ap 2021,
pp. 24 206–24 221. [Online]. A ailable: h ps://a xi .
o g/abs/2104.11178 3
[15] R. A andjelo ic and A. Zisse man, “Look, Lis en
and Lea n,” in P oceedings o he IEEE In e na ional
Con e ence on Compu e Vision, ol. 2017-Oc ob.
Ins i u e o Elec ical and Elec onics Enginee s
Inc., may 2017, pp. 609–617. [Online]. A ailable:
h ps://a xi .o g/abs/1705.08168 2
[16] I. Manco, E. Bene os, E. Quin on e al., “Con as i e
audio-language lea ning o music,” in P oceedings o
he 23 d In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, ISMIR 2022, Bengalu u, India,
Decembe 4-8, 2022, P. Rao, H. A. Mu hy, A. S ini-
asamu hy, R. M. Bi ne , R. C. Repe o, M. Go o,
X. Se a, and M. Mi on, Eds., 2022, pp. 640–649.
[17] Q. Huang, A. Jansen, J. Lee e al., “Mulan: A join em-
bedding o music audio and na u al language,” in P o-
ceedings o he 23 d In e na ional Socie y o Music
In o ma ion Re ie al Con e ence, ISMIR 2022, Ben-
galu u, India, Decembe 4-8, 2022, P. Rao, H. A.
Mu hy, A. S ini asamu hy, R. M. Bi ne , R. C.
Repe o, M. Go o, X. Se a, and M. Mi on, Eds., 2022,
pp. 559–566.
[18] G. Zhu, J. Da e sky, and Z. Duan, “Cacophony: An
imp o ed con as i e audio- ex model,” IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing, 2024.
[19] A. Guzho , F. Raue, J. Hees e al., “Audioclip: Ex-
ending clip o image, ex and audio,” in ICASSP
2022-2022 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2022, pp. 976–980.
[20] R. Gi dha , A. El-Nouby, Z. Liu e al., “Imagebind:
One embedding space o bind hem all,” in P oceedings
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
388
o he IEEE/CVF con e ence on compu e ision and
pa e n ecogni ion, 2023, pp. 15 180–15 190.
[21] G. Cicche i, E. G assucci, L. Sigillo, and D. Com-
miniello, “G amian Mul imodal Rep esen a ion Lea n-
ing and Alignmen ,” dec 2024. [Online]. A ailable:
h p://a xi .o g/abs/2412.11959
[22] A. Ramesh, P. Dha iwal, A. Nichol e al., “Hie a chi-
cal ex -condi ional image gene a ion wi h clip la en s,”
a Xi p ep in a Xi :2204.06125, ol. 1, no. 2, p. 3,
2022.
[23] J.-B. Alay ac, J. Donahue, P. Luc e al., “Flamingo:
a isual language model o ew-sho lea ning,” Ad-
ances in neu al in o ma ion p ocessing sys ems,
ol. 35, pp. 23 716–23 736, 2022.
[24] Z. Kong, A. Goel, R. Badlani e al., “Audio Flamingo:
A No el Audio Language Model wi h Few-Sho
Lea ning and Dialogue Abili ies,” in P oceedings
o Machine Lea ning Resea ch, ol. 235. ML
Resea ch P ess, eb 2024, pp. 25 125–25 148. [Online].
A ailable: h ps://a xi .o g/abs/2402.01831 3
[25] S. Ghosh, S. Kuma , C. K. R. E u u e al., “Reclap:
Imp o ing ze o sho audio classi ica ion by desc ib-
ing sounds,” in ICASSP 2025-2025 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2025, pp. 1–5.
[26] X. Li, W. Chen, Z. Ma e al., “D cap: Decoding clap la-
en s wi h e ie al-augmen ed gene a ion o ze o-sho
audio cap ioning,” in ICASSP 2025-2025 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2025, pp. 1–5.
[27] S. Ghosh, S. Kuma , C. K. R. E u u e al., “Recap:
Re ie al-augmen ed audio cap ioning,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 1161–1165.
[28] Z. E ans, C. Ca , J. Taylo e al., “Fas iming-
condi ioned la en audio di usion,” in P oceedings o
he 41s In e na ional Con e ence on Machine Lea n-
ing, 2024, pp. 12 652–12 665.
[29] Z. E ans, J. D. Pa ke , C. Ca e al., “Long- o m mu-
sic gene a ion wi h la en di usion,” in P oceedings o
he 25 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2024.
[30] J. Nis al, M. Pasini, C. Aouameu e al., “Di -a- i :
Musical accompanimen co-c ea ion ia la en di u-
sion models,” in ISMIR, 2024, 2024.
[31] H. Liu, Z. Chen, Y. Yuan e al., “AudioLDM: Tex - o-
audio gene a ion wi h la en di usion models,” in P oc.
ICML, 2023.
[32] A. Agos inelli, T. I. Denk, Z. Bo sos e al., “Mu-
sicLM: Gene a ing music om ex ,” a Xi p ep in
a Xi :2301.11325, 2023.
[33] X. Zhai, B. Mus a a, A. Kolesniko e al., “Sigmoid
Loss o Language Image P e-T aining,” in P oceed-
ings o he IEEE In e na ional Con e ence on Com-
pu e Vision. Ins i u e o Elec ical and Elec onics
Enginee s Inc., ma 2023, pp. 11 941–11 952. [Online].
A ailable: h ps://a xi .o g/abs/2303.15343 4
[34] M. Ass an, R. Bales ie o, Q. Du al e al., “The
Hidden Uni o m Clus e P io in Sel -Supe ised
Lea ning,” in 11 h In e na ional Con e ence on
Lea ning Rep esen a ions, ICLR 2023, oc 2023.
[Online]. A ailable: h p://a xi .o g/abs/2210.07277
[35] P. Y. Shi, M. Welle, M. Bjö kman e al., “Unde s and-
ing he Modali y Gap in CLIP,” in In e na ional Con-
e ence on Lea ning Rep sen a ions, no. 2023, 2023.
[36] X. Chen and K. He, “Explo ing simple siamese ep e-
sen a ion lea ning,” in P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni-
ion, 2021, pp. 15 750–15 758.
[37] M. Ass an, Q. Du al, I. Mis a e al., “Sel -Supe ised
Lea ning om Images wi h a Join -Embedding P edic-
i e A chi ec u e,” in 2023 IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion (CVPR), jun
2023, pp. 15 619–15 629, iSSN: 2575-7075.
[38] Y. Tian, X. Chen, and S. Ganguli, “Unde s anding Sel -
Supe ised Lea ning Dynamics wi hou Con as i e
Pai s,” P oceedings o Machine Lea ning Resea ch,
ol. 139, pp. 10 268–10 278, eb 2021. [Online].
A ailable: h p://a xi .o g/abs/2102.06810
[39] M. Oquab, T. Da ce , T. Mou akanni e al., “DINO 2:
Lea ning Robus Visual Fea u es wi hou Supe ision,”
ap 2023. [Online]. A ailable: h p://a xi .o g/abs/
2304.07193
[40] A. Bae ski, W.-N. Hsu, Q. Xu e al., “da a2 ec: A
Gene al F amewo k o Sel -supe ised Lea ning in
Speech, Vision and Language,” in P oceedings o Ma-
chine Lea ning Resea ch, Bal imo e, MD, USA, 2022.
[41] X. Li and X. Li, “ATST: Audio Rep esen a-
ion Lea ning wi h Teache -S uden T ans o me ,”
in P oceedings o he Annual Con e ence o
he In e na ional Speech Communica ion Asso-
cia ion, INTERSPEECH, ol. 2022-Sep e. In-
e na ional Speech Communica ion Associa ion,
ap 2022, pp. 4172–4176. [Online]. A ailable:
h p://dx.doi.o g/10.21437/In e speech.2022-10126
[42] D. Niizumi, D. Takeuchi, Y. Ohishi e al., “Masked
Modeling Duo: Lea ning Rep esen a ions by Encou -
aging Bo h Ne wo ks o Model he Inpu ,” in ICASSP
2023 - 2023 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). Rhodes
Island, G eece: IEEE, jun 2023, pp. 1–5.
[43] A. Quelennec, P. Chou eau, G. Pee e s e al., “Masked
la en p edic ion and classi ica ion o sel -supe ised
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
389
audio ep esen a ion lea ning,” in ICASSP 2025 - 2025
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2025, pp. 1–5.
[44] I. Manco, B. Weck, S. Doh e al., “The song desc ibe
da ase : a co pus o audio cap ions o music-and-
language e alua ion,” Neu IPS Machine Lea ning o
Audio Wo kshop, 2023.
[45] G. Tzane akis and P. Cook, “Musical gen e
classi ica ion o audio signals,” IEEE T ansac-
ions on Speech and Audio P ocessing, ol. 10,
no. 5, pp. 293–302, 2002. [Online]. A ailable:
h ps://doi.o g/10.1109/TSA.2002.800560
[46] E. Law, K. Wes , M. I. Mandel e al., “E alua ion
o Algo i hms Using Games: The Case o Music
Tagging,” in P oceedings o he 10 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2009, Kobe In e na ional Con e ence Cen e ,
Kobe, Japan, Oc obe 26-30, 2009. In e na ional
Socie y o Music In o ma ion Re ie al, 2009, pp.
387–392. [Online]. A ailable: h p://ismi 2009.ismi .
ne /p oceedings/OS5-5.pd
[47] E. Humph ey, S. Du and, and B. McFee, “Openmic-
2018: An open da a-se o mul iple ins umen ecog-
ni ion.” in ISMIR, 2018, pp. 438–444.
[48] J. F. Gemmeke, D. P. W. Ellis, D. F eedman e al., “Au-
dio se : An on ology and human-labeled da ase o au-
dio e en s,” in P oc. IEEE ICASSP 2017, New O leans,
LA, 2017.
[49] Y. Wu, K. Chen, T. Zhang e al., “La ge-scale con-
as i e language-audio p e aining wi h ea u e u-
sion and keywo d- o-cap ion augmen a ion,” in P oc.
ICASSP. IEEE, 2023, pp. 1–5.
[50] K. Chen, X. Du, B. Zhu e al., “H s-a : A hie a chical
oken-seman ic audio ans o me o sound classi ica-
ion and de ec ion,” in ICASSP 2022-2022 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2022, pp. 646–650.
[51] Y. Liu, M. O , N. Goyal e al., “Robe a: A obus ly
op imized be p e aining app oach,” a Xi p ep in
a Xi :1907.11692, 2019.
[52] D. S. Pa k, W. Chan, Y. Zhang e al., “Specaugmen : A
simple da a augmen a ion me hod o au oma ic speech
ecogni ion,” In e speech 2019, p. 2613, 2019.
[53] S. La ne , “Samplema ch: D um sample e ie al
by musical con ex ,” in P oceedings o he 23 d
In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, {ISMIR} 2022, Bengalu u, India,
Decembe 4-8, 2022, 8 2022, pp. 781–788. [On-
line]. A ailable: h p://a xi .o g/abs/2208.01141h ps:
//a chi es.ismi .ne /ismi 2022/pape /000094.pd
[54] K. Kou ini, J. Schlü e , H. Eghbal-zadeh e al., “E -
icien aining o audio ans o me s wi h pa chou ,”
In e speech 2022, 2022.
[55] R. Cas ellon, C. Donahue, and P. Liang, “Codi ied au-
dio language modeling lea ns use ul ep esen a ions
o music in o ma ion e ie al,” In e na ional Socie y
o Music In o ma ion Re ie al (ISMIR), 2021.
[56] S. Chen, Y. Wu, C. Wang e al., “Bea s: audio p e-
aining wi h acous ic okenize s,” in P oceedings o
he 40 h In e na ional Con e ence on Machine Lea n-
ing, 2023, pp. 5178–5193.
[57] J. Nis al, M. Pasini, and S. La ne , “Imp o ing mu-
sical accompanimen co-c ea ion ia di usion ans-
o me s,” in Audio Imagina ion: Neu IPS 2024 Wo k-
shop AI-D i en Speech, Music, and Sound Gene a ion,
2024.
[58] M. Zhao, J. Ono, Z. Zhong e al., “On he Language
Encode o Con as i e C oss-modal Models,” oc
2023. [Online]. A ailable: h p://a xi .o g/abs/2310.
13267
[59] S. Ghosh, Z. Kong, S. Kuma e al., “Audio
Flamingo 2: An Audio-Language Model wi h
Long-Audio Unde s anding and Expe Reasoning
Abili ies,” ma 2025. [Online]. A ailable: h p:
//a xi .o g/abs/2503.03983
[60] X. Dong, J. Bao, Y. Zheng, T. Zhang, D. Chen,
H. Yang, M. Zeng, W. Zhang, L. Yuan, D. Chen,
F. Wen, and N. Yu, “MaskCLIP: Masked Sel -
Dis illa ion Ad ances Con as i e Language-Image
P e aining,” pp. 10 995–11 005, aug 2023. [Online].
A ailable: h ps://a xi .o g/abs/2208.12262 1h p://
a xi .o g/abs/2208.12262
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
390