scieee Science in your language
[en] (orig)

Comparison of Audio Encoders for Audio-Text Contrastive Learning Representations

Author: Cárdenas Gracia, Sergio
Publisher: Zenodo
DOI: 10.5281/zenodo.17304842
Source: https://zenodo.org/records/17304842/files/Sergio-Cardenas_SMC_2025_Master_Thesis.pdf
Mas e hesis on Sound and Music Compu ing
Uni e si a Pompeu Fab a
Compa ison o Audio Encode s o
Audio-Tex Con as i e Lea ning
Rep esen a ions
Se gio Cá denas G acia
Supe iso : Pablo Alonso Jiménez
Co-Supe iso : Dmi y Bogdano
July 2025
Mas e hesis on Sound and Music Compu ing
Uni e si a Pompeu Fab a
Compa ison o Audio Encode s o
Audio-Tex Con as i e Lea ning
Rep esen a ions
Se gio Cá denas G acia
Supe iso : Pablo Alonso Jiménez
Co-Supe iso : Dmi y Bogdano
July 2025
Con en s
1 In oduc ion 1
1.1 Mo i a ion.................................. 1
1.2 Scopeo hep ojec ............................ 2
1.3 S uc u e o he hesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 S a e o he A 3
2.1 Con as i elea ning ............................ 3
2.2 Tex encode s................................ 4
2.3 Audioencode s ............................... 4
2.4 Model aining ............................... 5
2.5 T ainingda ase s.............................. 6
3 Me hodology 7
3.1 T ainingse up................................ 7
3.1.1 Da ase ................................... 7
3.1.2 Modela chi ec u e ............................. 7
3.2 E alua ionse up .............................. 8
3.2.1 Ze o-sho classi ica ion on he GTZAN da ase . . . . . . . . . . . . . . 8
3.2.2 Mul i-label classi ica ion on he MagnaTagATune da ase . . . . . . . . 9
3.2.3 Tex - o-music e ie al on he Song Desc ibe da ase . . . . . . . . . . 9
3.3 Expe imen s................................. 10
3.3.1 HTSAT-base (ini ialized weigh s) + RoBERTa ( ozen) . . . . . . . . . 10
3.3.2 MAEST-10s (ini ialized weigh s) + RoBERTa ( ozen) . . . . . . . . . 11

3.3.3 Hype pa ame e s explo a ion . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.4 E alua ionme hods............................. 11
4 Resul s 12
4.1 HTSAT sMAEST............................. 12
4.2 Hype pa ame e s e ec . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.1 Ba chsize.................................. 13
4.2.2 Weigh Decay................................ 15
4.2.3 Lea ningRa eDecay............................ 16
4.3 E alua ionme hods............................. 16
5 Conclusions 18
5.1 Discussion.................................. 18
5.2 Conclusions ................................. 19
5.3 Fu u ewo k................................. 20
Lis o Figu es 21
Lis o Tables 22
Bibliog aphy 23
A T aining and Valida ion Loss G aphs 26
Acknowledgemen
I would like o exp ess my since e g a i ude o my supe iso s, Pablo Alonso and
Dmi y Bogdano , o hei con inuous suppo and in aluable guidance h oughou
his p ojec . Thei expe ise and ambi ion ha e been essen ial o my lea ning and
p og ess.
I am also deeply hank ul o my amily and iends o hei unwa e ing suppo and
encou agemen , which ha e been a cons an sou ce o s eng h.
Finally, I would like o hank my colleagues om he Mas e ’s in Sound and Music
Compu ing. Sha ing his jou ney wi h hem has been a ewa ding expe ience, illed
wi h many memo able momen s and mu ual lea ning.
4Chap e 2. S a e o he A
While his usion s a egy shows p omise o gene al audio, i pe o ms poo ly in he
con ex o music. As a esul , he usion-based LAION-AI CLAP is no conside ed
sui able o his p ojec . Ins ead, he non- usion a ian o LAION-AI CLAP is
chosen as he p ojec ’s baseline, as i o e s a mo e ecen and e ined implemen a ion
compa ed o he o iginal Mic oso CLAP.
2.2 Tex encode s
Tex encode s a e essen ial o gene a ing meaning ul ep esen a ions o ex ual da a
ha can be ma ched wi h audio con en . Cu en s a e-o - he-a ex encode s o
mul imodal asks a e ypically based on ans o me a chi ec u es [4].
While OpenAI’s CLIP used a ans o me -based ex encode ained om sc a ch,
ecen bes p ac ices in mul imodal lea ning a o using p e ained language models
such as BERT [5] o RoBERTa [6]. Mic oso CLAP adop s BERT as i s ex
encode , whe eas LAION-AI CLAP uses RoBERTa.
Al hough o he p e ained encode s ha e been explo ed in mul imodal con ex s,
BERT and RoBERTa emain he mos widely adop ed o audio– ex alignmen ,
and he discussion he e is he e o e cen e ed on hem. Bo h p oduce high-quali y,
con ex -awa e embeddings ha can be aligned wi h audio ep esen a ions. Howe e ,
RoBERTa is conside ed a mo e op imized and obus a ian o BERT, making i
he p e e ed choice o cu en mul imodal applica ions.
2.3 Audio encode s
Audio encode s a e c i ical o ex ac ing meaning ul ea u es om aw audio ha
can be aligned wi h ex ual desc ip ions.
In ecen yea s, PANN [7] and HTSAT [8] ha e been widely selec ed as he p ima y
audio encode s o mul imodal asks in he audio- ex domain. PANN (P e- ained
Audio Neu al Ne wo k) is a CNN-based audio classi ica ion model wi h 7 downsam-
pling CNN blocks and 7 upsampling blocks. HTSAT (Hie a chical Token-Seman ic

2.4. Model aining 5
Audio T ans o me ), on he o he hand, is a ans o me -based model ha uses
ou g oups o Swin T ans o me blocks [9] o cap u e complex pa e ns in audio
spec og ams wi h a ocus on op imiza ion.
These encode s we e used in Mic oso CLAP and LAION-AI CLAP, espec i ely
(PANN in he o me and HTSAT in he la e ). Among he wo, HTSAT is gene ally
conside ed he mo e sui able op ion, pa icula ly due o i s s ong pe o mance on
a a ie y o audio unde s anding asks.
Tha said, MAEST [10] (Music Audio E icien Spec og am T ans o me ) is a newe
encode speci ically designed o music- ela ed asks. Compa ed o HTSAT, i is a
la ge and mo e complex model, and i has shown p omising esul s. This makes
i a aluable candida e o e alua e in a con as i e lea ning se up ocused on mu-
sic, pa icula ly in low-da a scena ios whe e model e iciency and gene aliza ion a e
c i ical.
2.4 Model aining
T aining a con as i e lea ning model in ol es minimizing a loss unc ion ha b ings
ma ching audio- ex pai s close oge he in he embedding space, while pushing
apa non-ma ching pai s. To achie e his, bo h audio and ex inpu s a e passed
h ough hei espec i e encode s and hen p ojec ed in o a sha ed embedding space
using modali y-speci ic linea p ojec ions.
Al hough aining all componen s join ly is s anda d p ac ice, his app oach can
be compu a ionally expensi e. To add ess his, se e al al e na i es ha e been p o-
posed [11, 12], which aim o educe aining cos s wi hou signi ican ly sac i icing
pe o mance.
One p ac ical s a egy in mul imodal se ings is o ini ialize he audio encode wi h
p e ained weigh s om a ela ed domain, such as ision o gene al audio classi ica-
ion, a he han aining om sc a ch. Ano he common app oach is o eeze he
ex encode , especially when i has al eady been p e ained on la ge-scale language
co po a. These echniques help educe compu a ional demands while s ill enabling
6Chap e 2. S a e o he A
e ec i e con as i e lea ning.
2.5 T aining da ase s
T aining con as i e lea ning models equi es pai ed audio- ex da a, whe e he ex
can consis o ags o na u al language desc ip ions. Typically, la ge-scale da ase s
a e used o achie e s ong pe o mance in his se ing.
Fo e e ence, he LAION-AI CLAP model o music was ained on he LAION-
Audio-630k da ase (630,000 audios o human ac i i ies, na u al sounds and au-
dio e ec s), he AudioSe (app oxima ely 2 million samples o human-labeled 10-
second sound clips d awn om YouTube ideos), and a combina ion o music- ela ed
da ase s1.
Howe e , hese da ase s a e ei he no speci ically cu a ed o music o con ain noisy
and low-quali y anno a ions. Addi ionally, wo king wi h such la ge-scale da ase s is
no always p ac ical due o compu a ional limi a ions, especially in smalle -scale o
academic en i onmen s.
1Check e e ence o mo e de ails on he music da ase s: h ps://gi hub.com/LAION-AI/
audio-da ase /blob/main/da a_collec ion/README.md
Chap e 3
Me hodology
3.1 T aining se up
Es ablishing a solid aining se up in ol es ca e ully selec ing bo h he da a and he
model a chi ec u e. As discussed in ea lie sec ions, his p ojec p io i izes high-
quali y, openly a ailable audio- ex da ase s. Fo his pu pose, he MTG-Jamendo
da ase [13] is a s ong i .
3.1.1 Da ase
Fo aining, he MTG-Jamendo spli -0 is used, which is a s anda d pa i ion o
esea ch. This spli consis s o 32,859 audio- ex pai s o aining and 11,101 o al-
ida ion. The da ase is con e ed in o he WebDa ase o ma , a s uc u e op imized
o scalable machine lea ning wo k lows. In his o ma , each pai is ep esen ed by
an audio pa h and a comma-sepa a ed lis o ags associa ed wi h he ack.
3.1.2 Model a chi ec u e
The models ollow he LAION-AI CLAP a chi ec u e, which se es as he baseline
o his p ojec . This includes bo h he componen s uc u e and he aining con ig-
u a ion. The a chi ec u e consis s o sepa a e encode s o audio and ex , ollowed
by linea p ojec ions in o a sha ed embedding space.
7
8Chap e 3. Me hodology
To educe aining complexi y, he ex encode is ozen du ing aining, since i is
al eady p e ained on la ge-scale language da a. Fo he audio encode , p e ained
weigh s a e used o a oid aining om sc a ch.
By de aul , he HTSAT is used as he audio encode . Howe e , he MAEST encode
is also in eg a ed in o he implemen a ion o enable compa ison. This makes i
possible o ain models using ei he encode wi hin he same amewo k.
3.2 E alua ion se up
Choosing a meaning ul e alua ion s a egy is essen ial o accu a ely assessing and
compa ing model pe o mance. To his end, h ee dis inc e alua ion asks a e
de ined, each highligh ing di e en aspec s o mul imodal lea ning: ze o-sho classi-
ica ion using he GTZAN da ase [14], mul i-label classi ica ion using he MagnaTa-
gATune da ase [15], and ex - o-music e ie al using he Song Desc ibe da ase
[16].
3.2.1 Ze o-sho classi ica ion on he GTZAN da ase
Ze o-sho classi ica ion es s a model’s abili y o assign audio samples o p ede ined
ca ego ies wi hou ask-speci ic aining. Ins ead, he model ma ches audio o ex
ep esen a ions o each ca ego y by compa ing embeddings in a sha ed space.
The GTZAN da ase is well-sui ed o ze o-sho classi ica ion, as i p o ides 30-
second audio samples o 10 dis inc music gen es: blues, classical, coun y, disco,
hip-hop, jazz, me al, pop, eggae, and ock. Each gen e includes 100 samples, excep
o jazz, which has 99 due o a co up ed ile.
To pe o m ze o-sho classi ica ion, a ex embedding is c ea ed o each gen e using
a simple p omp o ma : “This is a {gen e} song.” These se e as he class ep e-
sen a ions in he embedding space. Nex , audio embeddings a e compu ed o all
acks. Fo each audio embedding, simila i y sco es a e calcula ed wi h each o he
ex embeddings using he do p oduc . The gen e co esponding o he highes
simila i y sco e is selec ed as he p edic ed label.
3.2. E alua ion se up 9
Finally, p edic ed labels a e compa ed o g ound u h anno a ions, and he o e all
classi ica ion accu acy is hen used o e alua e model pe o mance on his ze o-sho
ask.
3.2.2 Mul i-label classi ica ion on he MagnaTagATune da ase
Mul i-label classi ica ion e alua es a model’s abili y o assign mul iple ele an ags
o a single audio ack. The MagnaTagATune da ase is well-sui ed o his ask,
con aining 29-second clips anno a ed wi h one o mo e labels d awn om a pool o
188 music- ela ed ags.
To simpli y he ask and ensu e eliable e alua ion, only he 50 mos equen ags
a e conside ed, ollowing a common p ac ice. The da ase is di ided in o ain-
ing (15,244 samples), alida ion (1,529 samples), and es (4,332 samples) subse s.
Audio embeddings a e p ecompu ed o all acks.
A ligh weigh classi ie is ained on op o hese embeddings using a ans e lea ning
app oach. The classi ie is a wo-laye eed o wa d neu al ne wo k (MLP) ha akes
an audio embedding as inpu and ou pu s p obabili ies o each o he 50 ags. A
sigmoid ac i a ion unc ion allows each ag o be p edic ed independen ly. Tags
a e assigned when hei p edic ed p obabili y exceeds a de ined h eshold, enabling
mul i-label p edic ions pe ack.
E alua ion is conduc ed using he A ea Unde he Recei e Ope a ing Cha ac e -
is ic Cu e (AUROC), which measu es he model’s capaci y o dis inguish be ween
classes ac oss a ious decision h esholds, and he Mean A e age P ecision (MAP),
which e alua es he anking quali y and p ecision ac oss all ele an labels, p o iding
a obus amewo k o assessing how well he mul imodal model cap u es di e se
musical a ibu es in a mul i-label con ex .
3.2.3 Tex - o-music e ie al on he Song Desc ibe da ase
Tex - o-music e ie al e alua es he model’s abili y o e ie e ele an audio acks
based on na u al language que ies. The Song Desc ibe da ase is well-sui ed o his

10 Chap e 3. Me hodology
ask, as i con ains audio acks pai ed wi h human-w i en cap ions ha desc ibe
musical con en . Fo e alua ion, he alida ed subse o he da ase is used, which
includes 746 unique audio acks.
Howe e , he da ase con ains a o al o 1,106 audio- ex pai s, since some acks a e
associa ed wi h mul iple cap ions. This adds alue om an e alua ion pe spec i e,
as i allows he model o e ie e a co ec audio ack based on a cap ion ha may
no be i s exac pai .
To ca y ou he e ie al, embeddings a e compu ed o all audio acks and ex
cap ions. A simila i y ma ix is hen buil using he do p oduc be ween each ex
and audio embedding. Each cap ion is ea ed as a que y, and all acks a e anked
based on simila i y sco es.
Pe o mance is e alua ed using Median Rank (MedR), which measu es he median
posi ion o he co ec audio ack ac oss all que ies, and Recall a K (R@k), which
indica es he p opo ion o que ies o which he co ec ack appea s in he op
k esul s. These me ics oge he p o ide a s ong indica ion o how e ec i ely he
model aligns ex desc ip ions wi h musical con en .
3.3 Expe imen s
This sec ion ou lines he expe imen s conduc ed h oughou he p ojec .
3.3.1 HTSAT-base (ini ialized weigh s) + RoBERTa ( ozen)
A CLAP model is ained using he HTSAT audio encode wi h ini ialized p e-
ained weigh s2, alongside a ozen RoBERTa ex encode . The speci ic HTSAT
implemen a ion used is he base e sion, designed o p ocess 10-second audio clips.
This model se es as he baseline o he p ojec . T aining a model wi h he same
encode s as LAION-AI CLAP, bu on a smalle da ase , ensu es a ai compa ison
wi h o he expe imen al models.
2P e ained weigh s a e a ailable a he ollowing link: h ps://gi hub.com/LAION-AI/CLAP/
blob/main/README.md# ep oducibili y
3.3. Expe imen s 11
3.3.2 MAEST-10s (ini ialized weigh s) + RoBERTa ( ozen)
An expe imen al CLAP model is ained using he MAEST audio encode , speci i-
cally he “discogs-maes -10s-pw-129e” a ian , which also p ocesses 10-second audio
segmen s. As wi h he baseline, he RoBERTa ex encode emains ozen, and
p e ained weigh s ini ialize he audio encode . The goal o his expe imen is o
compa e he pe o mance o he MAEST encode agains he baseline HTSAT wi hin
a mul imodal ep esen a ion se ing.
3.3.3 Hype pa ame e s explo a ion
To be e unde s and he impac o aining con igu a ions, a ious hype pa ame e
se ings a e explo ed h oughou he aining p ocess.
A key pa ame e unde in es iga ion is ba ch size, which is known o be di ec ly
ela ed o model pe o mance in con as i e lea ning se ings. Howe e , due o
compu a ional cons ain s, he ba ch sizes used in his p ojec a e smalle han
hose commonly used in la ge-scale con as i e lea ning se ups.
Fu he mo e, se e al hype pa ame e s in ended o enhance gene aliza ion in low-
da a con ex s a e examined. In pa icula , weigh decay and lea ning a e decay a e
explo ed as s a egies o educe o e i ing and p omo e mo e s able aining unde
low-da a condi ions.
3.3.4 E alua ion me hods
Fo e alua ion, wo app oaches a e conside ed o compu ing audio embeddings: one
in ol es ex ac ing embeddings om a andomly selec ed 10-second segmen o each
audio ack, and he o he compu es embeddings o e e y 10-second segmen and
a e ages hem o ob ain he inal ep esen a ion. These me hods help assess how
he choice o segmen a ec s he quali y o audio ep esen a ions.
Chap e 4
Resul s
4.1 HTSAT s MAEST
The i s compa ison be ween models using di e en audio encode s has been con-
duc ed, wi h bo h models ained using a ba ch size o 64 and he de aul hype -
pa ame e s p o ided by he LAION-AI CLAP implemen a ion. The esul s o his
e alua ion a e shown in Table 1.
Task Me ic HTSAT MAEST
Ze o-sho classi ica ion (GTZAN) Accu acy 51.05 28.73
Mul i-label classi ica ion (MTT) AUROC 0.807 0.786
MAP 0.284 0.263
Tex - o-music e ie al (SD)
MedR ↓140 198
R@1 0.94 0.80
R@5 4.42 2.95
R@10 9.92 5.36
Table 1: Compa ison o pe o mance o models ained using HTSAT and MAEST
audio encode s wi h ba ch size = 64, bo h ini ialized wi h p e ained weigh s.
The model using he HTSAT audio encode demons a es solid pe o mance ac oss
he a ious asks, pa icula ly gi en he limi ed amoun o aining da a used com-
pa ed o he la ge-scale da ase s employed in he o iginal LAION-AI models. In
12
4.2. Hype pa ame e s e ec 13
con as , he model using he MAEST audio encode shows signi ican ly lowe pe -
o mance in mos e alua ion asks.
No ably, he MAEST-based model appea s o su e om o e i ing, sugges ed by
he alida ion loss inc easing signi ican ly while he aining loss dec eases, as shown
in Figu e 1 and Figu e 2. To u he explo e his issue, he ollowing sec ion ana-
lyzes he impac o a ious hype pa ame e s, including hose ha may help imp o e
gene aliza ion, such as weigh decay and lea ning a e decay.
4.2 Hype pa ame e s e ec
This sec ion explo es he e ec s o h ee key hype pa ame e s: ba ch size, weigh
decay, and lea ning a e decay. Ba ch size is e alua ed using he HTSAT baseline
model o unde s and i s impac on pe o mance in con as i e lea ning asks. In
con as , weigh decay, which helps p e en o e i ing by discou aging la ge weigh s,
and lea ning a e decay, which educes he lea ning a e ac oss laye s o enable mo e
s able con e gence, a e explo ed as s a egies o add ess he o e i ing obse ed in
he MAEST-based model.
4.2.1 Ba ch size
The pe o mance o he LAION-AI CLAP implemen a ion, using he HTSAT-base
audio encode (ini ialized wi h p e ained weigh s) and he RoBERTa ex encode ,
has been e alua ed wi h h ee di e en ba ch sizes: 16, 32, and 64. The esul s o
his compa ison a e shown in Table 2.
As shown in Table 2, he ela ionship be ween ba ch size and pe o mance is incon-
sis en ac oss asks when using p e ained weigh s. While la ge ba ch sizes show
imp o emen s in mul i-label classi ica ion, pe o mance in ze o-sho classi ica ion
and ex - o-music e ie al is less consis en . This beha iou may esul om he
p e ained weigh s al eady p o iding s uc u ed seman ic ep esen a ions. As a e-
sul , he model becomes less sensi i e o he numbe o nega i e samples pe ba ch,
and hus o ba ch size.
20 Chap e 5. Conclusions
p e ained weigh s and a ozen RoBERTa ex encode p o ed o be he mos
e ec i e con igu a ion o he LAION-AI CLAP amewo k unde limi ed esou ces.
Ano he key insigh is he c i ical ole o da a olume, no only o lea ning obus
mul imodal ep esen a ions bu also o achie ing eliable and meaning ul e alua-
ion. Models ained wi h limi ed da a emain subs an ially behind s a e-o - he-a
pe o mance, highligh ing he dependence o con as i e lea ning me hods on la ge
da ase s.
Finally, unde cons ained esou ces ha es ic ba ch size, pe o mance appea s
mo e in luenced by ac o s such as audio encode weigh ini ializa ion. The inabili y
o ain wi h la ge ba ch sizes is a signi ican limi a ion o his s udy.
5.3 Fu u e wo k
A na u al nex s ep o his p ojec is o in es iga e a simpli ied e sion o he
MAEST-based model, o example, by emo ing ce ain laye s o educing i s di-
mensionali y, o imp o e i s sui abili y o low-da a se ings.
Ano he po en ial di ec ion o u u e wo k in ol es aining models unde low-
da a condi ions bu wi h la ge ba ch sizes, which equi es mo e compu a ional
esou ces. Addi ionally, expanding he amoun o aining da a would enable a
mo e comp ehensi e analysis o how bo h ba ch size and da a olume in luence
model pe o mance.
Finally, a mo e explo a o y pa h could in ol e expe imen ing wi h al e na i e en-
code combina ions, including newe audio and ex encode s, o in es iga ing di e -
en app oaches o lea ning mul imodal audio- ex ep esen a ions beyond he con-
en ional con as i e lea ning amewo k, such as app oaches bene i ing om la ge
language models (LLMs) [17].

Lis o Figu es
1 T aining loss cu es: (a) audio encode s compa ison, (b) ba ch size
compa ison, (c) weigh decay compa ison, (d) lea ning a e decay
compa ison. ................................ 26
2 Valida ion loss cu es: (a) audio encode s compa ison, (b) ba ch size
compa ison, (c) weigh decay compa ison, (d) lea ning a e decay
compa ison. ................................ 27
21
Lis o Tables
1 Compa ison o pe o mance o models ained using HTSAT and
MAEST audio encode s wi h ba ch size = 64, bo h ini ialized wi h
p e ainedweigh s. ............................ 12
2 Compa ison o ba ch size pe o mance o ained HTSAT models
ini ialized wi h p e ained weigh s. . . . . . . . . . . . . . . . . . . . 14
3 Compa ison o ba ch size pe o mance o ained HTSAT models
ini ialized wi h andom weigh s. . . . . . . . . . . . . . . . . . . . . . 14
4 Compa ison o weigh decay e ec o ained MAEST models ini ial-
ized wi h p e ained weigh s. . . . . . . . . . . . . . . . . . . . . . . . 15
5 Compa ison o lea ning a e decay e ec o ained MAEST models
ini ialized wi h p e ained weigh s. . . . . . . . . . . . . . . . . . . . 16
6 Compa ison o e alua ion me hods o he LAION-AI CLAP p e-
ainedcheckpoin . ............................ 17
22
Bibliog aphy
[1] Elizalde, B., Deshmukh, S., Al Ismail, M. & Wang, H. Clap lea ning audio
concep s om na u al language supe ision. In ICASSP 2023-2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Signal P ocessing (ICASSP),
1–5 (IEEE, 2023).
[2] Wu, Y. e al. La ge-scale con as i e language-audio p e aining wi h ea u e
usion and keywo d- o-cap ion augmen a ion. In ICASSP 2023-2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Signal P ocessing (ICASSP),
1–5 (IEEE, 2023).
[3] Rad o d, A. e al. Lea ning ans e able isual models om na u al language
supe ision. In In e na ional Con e ence on Machine Lea ning, 8748–8763
(PMLR, 2021).
[4] Vaswani, A. e al. A en ion is all you need. Ad ances in Neu al In o ma ion
P ocessing Sys ems 30 (2017).
[5] De lin, J., Chang, M. W., Lee, K. & Tou ano a, K. Be : P e- aining o
deep bidi ec ional ans o me s o language unde s anding. In P oceedings o
he 2019 Con e ence o he No h Ame ican Chap e o he Associa ion o
Compu a ional Linguis ics: Human Language Technologies, Volume 1 (Long
and Sho Pape s), 4171–4186 (2019).
[6] Liu, Y. e al. Robe a: A obus ly op imized be p e aining app oach. a Xi
p ep in a Xi :1907.11692 (2019).
23
24 BIBLIOGRAPHY
[7] Kong, Q. e al. Panns: La ge-scale p e ained audio neu al ne wo ks o audio
pa e n ecogni ion. IEEE/ACM T ansac ions on Audio, Speech, and Language
P ocessing 28, 2880–2894 (2020).
[8] Chen, K. e al. H s-a : A hie a chical oken-seman ic audio ans o me o
sound classi ica ion and de ec ion. In ICASSP 2022-2022 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocessing (ICASSP), 646–650
(IEEE, 2022).
[9] Liu, Z. e al. Swin ans o me : Hie a chical ision ans o me using shi ed
windows. In P oceedings o he IEEE/CVF In e na ional Con e ence on Com-
pu e Vision, 10012–10022 (2021).
[10] Alonso-Jiménez, P., Se a, X. & Bogdano , D. E icien supe ised ain-
ing o audio ans o me s o music ep esen a ion lea ning. a Xi p ep in
a Xi :2309.16418 (2023).
[11] Manipa ambil, M. e al. Ha nessing ozen unimodal encode s o lexible mul-
imodal alignmen . In P oceedings o he Compu e Vision and Pa e n Recog-
ni ion Con e ence, 29847–29857 (2025).
[12] Qin, J., Liu, C., Cheng, S., Guo, Y. & A cucci, R. F eeze he backbones: a
pa ame e -e icien con as i e app oach o obus medical ision-language p e-
aining. In ICASSP 2024-2024 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP), 1686–1690 (IEEE, 2024).
[13] Bogdano , D., Won, M., To s ogan, P., Po e , A. & Se a, X. The m g-
jamendo da ase o au oma ic music agging. In Machine Lea ning o Music
Disco e y Wo kshop, In e na ional Con e ence on Machine Lea ning (ICML),
1–3 (2019).
[14] Tzane akis, G. & Cook, P. Musical gen e classi ica ion o audio signals. IEEE
T ansac ions on Speech and Audio P ocessing 10, 293–302 (2002).
[15] Law, E., Wes , K., Mandel, M. I., Bay, M. & Downie, J. S. E alua ion o
algo i hms using games: The case o music agging. In ISMIR, 387–392 (2009).
BIBLIOGRAPHY 25
[16] Manco, I. e al. The song desc ibe da ase : A co pus o audio cap ions o
music-and-language e alua ion. a Xi p ep in a Xi :2311.10057 (2023).
[17] Ga dne , J., Du and, S., S olle , D. & Bi ne , R. M. Lla k: A mul-
imodal ins uc ion- ollowing language model o music. a Xi p ep in
a Xi :2310.07160 (2023).

Appendix A
T aining and Valida ion Loss G aphs
(a) audio encode s compa ison (b) ba ch size compa ison
(c) weigh decay compa ison (d) lea ning a e decay compa ison
Figu e 1: T aining loss cu es: (a) audio encode s compa ison, (b) ba ch size
compa ison, (c) weigh decay compa ison, (d) lea ning a e decay compa ison.
26
27
(a) audio encode s compa ison (b) ba ch size compa ison
(c) weigh decay compa ison (d) lea ning a e decay compa ison
Figu e 2: Valida ion loss cu es: (a) audio encode s compa ison, (b) ba ch size
compa ison, (c) weigh decay compa ison, (d) lea ning a e decay compa ison.