scieee Science in your language
[en] (orig)

Open-Domain Zero-Shot Audio Tagging: Evaluation via Semantic Embeddings

Author: Yapici, Tolga
Publisher: Zenodo
DOI: 10.5281/zenodo.17305082
Source: https://zenodo.org/records/17305082/files/Tolga_Yapici_Master_Thesis_2025.pdf
Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Open-Domain Ze o-Sho Audio Tagging:
E alua ion ia Seman ic Embeddings
Tolga Yapici
Supe iso : Panagio a Anas asopoulou
Co-Supe iso : F ede ic Fon
July 2025
Con en s
1In oduc ion 1
1.1 Mo i a ion.................................. 1
1.2 KeyConcep s................................ 2
1.2.1 Folksonomy ................................. 2
1.2.2 Co-occu ence Based Tag Recommenda ion . . . . . . . . . . . . . . . 2
1.2.3 Ze o-Sho Classi ica ion . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Audio-Tex Rep esen a ion Models . . . . . . . . . . . . . . . . . . . . 3
1.2.5 LAION-CLAP................................ 4
1.3 Objec i es.................................. 4
1.4 Scope .................................... 4
2Backg ound 6
2.1 F eesound Tag Recommende (RankST) . . . . . . . . . . . . . . . . . 6
2.1.1 Co-occu ence Based Tag Recommenda ion . . . . . . . . . . . . . . . 6
2.1.2 Rank Agg ega ion and Adap i e Cu off.................. 7
2.1.3 Limi a ions ................................. 8
2.2 Con as i e Language-Audio P e aining . . . . . . . . . . . . . . . . . 8
2.2.1 CLAP Model A chi ec u e . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Models&Benchma ks ........................... 9
2.2.3 P omp Design ............................... 10
2.2.4 Challenges in Folksonomy-Based Tagging . . . . . . . . . . . . . . . . . 10
2.3 Seman ic E alua ion o Tag Recommenda ions . . . . . . . . . . . . . . 11
2.3.1 Limi a ions o Exac Ma ching . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Embedding-Based Seman ic E alua ion wi h SBERT . . . . . . . . . . 12
2.3.3 Applica ion o Audio Tagging . . . . . . . . . . . . . . . . . . . . . . . 13
3Me hods 14
3.1 Da ase and P ep ocessing . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 BSD10kDa ase .............................. 14
3.1.2 Da ase P ep ocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 TagVocabula y............................... 16
3.2 LAION-CLAP Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 AudioEmbeddings ............................. 16
3.2.2 TagEmbeddings .............................. 17
3.3 Tag Recommenda ion Sys ems . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 RankST ................................... 17
3.3.2 Ze o-Sho Baseline ............................. 18
3.3.3 Ze o-Sho wi h DF Weigh ing . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 E alua ion Me hodology . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 E alua ionMe ics ............................. 19
4Resul s 21
4.1 Ze o-Sho Pe o mance . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 RankSTBenchma k ............................ 23
4.3 Resul sO e iew.............................. 26
5Discussion 28
5.0.1 Ze o-Sho Tagging Pe o mance . . . . . . . . . . . . . . . . . . . . . . 28
5.0.2 Impac o Embedding-Based Seman ic E alua ion . . . . . . . . . . . . 28
5.0.3 Tagse Quali y and Noise . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.0.4 G ound T u h Limi a ions . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.0.5 Sys em Pe o mance Compa ison . . . . . . . . . . . . . . . . . . . . . 30
6Conclusion 32
Lis o Figu es 33
Lis o Tables 35
Bibliog aphy 36
A Fi s Appendix 39
BSecondAppendix 40

Dedica ion
To my lo ing amily, my dea iends, and he beau i ul ci y o Ba celona.
Acknowledgemen
I hank my supe iso s, Penny and F ede ic, o hei suppo and since i y, and
my iends in he Mas e ’s p og am and he Music Technology G oup who ha e
suppo ed my g ow h academically and beyond.
4Chap e 1. In oduc ion
ze o-sho audio classi ica ion by compa ing simila i y be ween audio and ex em-
beddings, ypically using cosine simila i y [3][4].
1.2.5 LAION-CLAP
LAION-CLAP (Con as i e Language-Audio P e aining) is an open-sou ce audio-
ex ep esen a ion model ained on o e 630,000 audio– ex pai s spanning di e se
audio con en . I s a chi ec u e (shown in Figu e 4) uses sepa a e encode s o audio
and ex , op imized ia con as i e loss o p oduce aligned audio- ex embeddings
[4].
1.3 Objec i es
This hesis aims o e alua e whe he a ze o-sho audio- ex model can effec i ely
ecommend ags o he e ogeneous audio con en wi hou supe ised aining on he
a ge da ase . The speci ic objec i es a e:
•Benchma k he pe o mance o LAION-CLAP o ag ecommenda ion in a
ze o-sho se ing on F eesound.
•Compa e ze o-sho p edic ions wi h F eesound’s co-occu ence-based p edic-
ions in agging accu acy.
•E alua e whe he seman ic simila i y me ics ( ia SBERT) cap u e meaning ul
ags missed by exac s ing ma ching.
1.4 Scope
This hesis e alua es ze o-sho audio agging pe o mance on a cu a ed he e ogenous
subse o F eesound (BSD10k), ep esen ing di e se gene al-pu pose audio con en .
E alua ion ocuses on me hods inco po a ing ag weigh ing ia documen equency
and seman ic e alua ion h ough sen ence embeddings. Pe o mance is measu ed
using p ecision, ecall, and F1-sco e compu ed unde bo h exac s ing ma ching

1.4. Scope 5
and seman ic ma ching c i e ia. E alua ions a e pe o med in a ze o-sho se ing
wi hou ine- uning o e aining o unde lying models.
Chap e 2
Backg ound
2.1 F eesound Tag Recommende (RankST)
2.1.1 Co-occu ence Based Tag Recommenda ion
The RankST algo i hm (Fon e al. 2012) is he ounda ion o F eesound’s ag ecom-
menda ion sys em. I es ima es ag ele ance om ag- ag co-occu ence s a is ics.
Tag co-occu ence coun s a e used o compu e he ag simila i y ma ix S=DD>,
whe e each elemen sij indica es he numbe o sounds in which ags iand jappea
oge he [2]. Gi en a se o inpu ags, RankST e ie es a lis o candida e ags,
aking he op-Nmos simila ags o each inpu ag. Fo example, o an inpu ag
d ums, hesys emmaysugges ela ed agssuchaspe cussion o hy hm i hese
equen ly co-occu in he anno a ions. This app oach elies on he p inciple ha i
wo ags equen ly co-occu , hey a e likely o desc ibe ela ed sound con en .
6
2.1. F eesound Tag Recommende (RankST) 7
Figu e 3: Visualiza ion o Tag Simila i y Ma ix S[2]
2.1.2 Rank Agg ega ion and Adap i e Cu off
RankST agg ega es candida e ags using ank agg ega ion. Each inpu ag p oduces
acandida elis o Cinpu T agio i s op-Nco-occu ing ags. Fo each candida e a
posi ion nin Cinpu T agi, ank aluesa eassignedas
ank(neighbo n)=N(n1)
The closes ag o e e y inpu ag is assigned N, hesecondN1,andso o h.
Using his agg ega ion me hod, simila i y is ep esen ed by hese ank alues, which
a e summed ac oss he inpu ags. This gi es g ea e weigh o ags ha consis en ly
appea a he op o he co-occu ence lis s. RankST hen applies an adap i e cu off
based on he Ande son-Da ling es o au oma ically de e mine how many ags
o ecommend, which imp o es F-measu e pe o mance compa ed o ixed-leng h
ou pu s [2].
8Chap e 2. Backg ound
2.1.3 Limi a ions
RankST elies en i ely on ag me ada a and is cons ained by he use -gene a ed
olksonomy and i s inconsis encies. I s ecommenda ions a e limi ed o p e iously
seen ags, and he quali y o he ou pu is hea ily dependen on he seed ags,
making he sys em sensi i e o subjec i e o inconsis en anno a ions.
In addi ion, noise in he olksonomy p opaga es in o he ecommenda ions. Spelling
a ia ions, use -speci ic ags, non-seman ic e ms, and misspellings o en appea in
he ou pu p o iding no seman ic u ili y. Fo ins ance, o he inpu s d um and
pe cussion, he ecommenda ionsa ed ums,pe cussi e,pe cs,anduse name okens
such as sandy b (a use name agged ac oss many uploads), seen in Figu e 1.
Fu he limi a ions o his sys em a e he absence o audio con en analysis and he
seed equi emen . As a esul , his sys em canno p edic labels di ec ly om he
audio signal i sel and canno ope a e in a cold-s a se ing.
2.2 Con as i e Language-Audio P e aining
2.2.1 CLAP Model A chi ec u e
CLAP models a e ained on la ge co po a o pai ed audio– ex da a, enabling
hem o lea n join audio– ex ep esen a ions by aining sepa a e encode s o
each modali y [3][4]. In LAION-CLAP a chi ec u e, he audio encode (e.g., PANN
o HTSAT) p ocesses mel-spec og ams, while he ex encode (e.g., BERT o
RoBERTa) embeds na u al language [4] (Figu e 4).
These embeddings a e p ojec ed in o a join 512-dimensional la en space and op-
imized ia con as i e lea ning, maximizing he simila i y o seman ically aligned
audio- ex pai s while minimizing i o un ela ed pai s, a me hod adap ed om
CLIP [3][4][7]. Wi h his amewo k hese models a e able o lea n seman ically
aligned c oss-modal ep esen a ions making hem sui able o downs eam asks such
as ex - o audio e ie al, ze o-sho audio classi ica ion, supe ised audio classi ica-
2.2. Con as i e Language-Audio P e aining 9
ion, and enabling hem o gene alize well o unseen ca ego ies wi hou ine- uning
[4].
This amewo k enables in e ence h ough cosine simila i y be ween an audio em-
bedding and a se o candida e ex embeddings, enabling audio classi ica ion in
ze o-sho se ings [4]. Ze o-sho pe o mance o CLAP models ha e been alida ed
on benchma k da ase s such as ESC-50, U banSound8K and VGGSound, whe e
hey pe o m compe i i ely wi h supe ised models in classi ica ion accu acy [3][4].
Figu e 4: Con as i e Language-Audio P e aining A chi ec u e [8]
2.2.2 Models & Benchma ks
Se e al CLAP a ian s ha e been de eloped, each diffe ing in model capaci y, ain-
ing da a scale, and applica ion domain. MS-CLAP is ained on 128k di e se au-
dio– ex pai s [3], while LAION-CLAP scales aining o o e 630k pai s [4].
On closed-domain benchma ks, MS-CLAP achie es 82.6% ze o-sho classi ica ion

10 Chap e 2. Backg ound
accu acy on ESC-50 and 73.2% on U banSound8K. LAION-CLAP demons a es im-
p o ed pe o mance wi h 89.1% accu acy on ESC-50 and 73.2% on U banSound8K
[3][4].
LAION-CLAP’s la ge aining scale and da a di e si y sugges g ea e po en ial o
gene aliza ion ac oss di e se audio con en . Howe e , on VGGSound, a la ge open-
domain da ase wi h o e 310 sound classes, LAION-CLAP’s accu acy d ops o
29.1%, highligh ing he inc eased challenge o classi ica ion in open-domain se ings
[4].
Da ase Domain # Classes CLAP(MS) LAION-CLAP
ESC-50 [9] En i onmen al 50 82.6% 89.1%
US8K [10] U ban 10 73.2% 73.2%
VGGSound [11] Open-domain 310+ N/A 29.1%
Table 1: Ze o-sho classi ica ion accu acy o CLAP models on benchma k da ase s
2.2.3 P omp Design
LAION-CLAP is ained on audio– ex pai s s uc u ed as na u al language sen-
ences, ypically o he o m This is a sound o [label] [4]. A in e ence ime, he
ph asing o candida e class ags signi ican ly in luences he model’s ze o-sho clas-
si ica ion accu acy. P omp s s uc u ed as na u al language sen ences (e.g., “This
is a sound o a dog ba king”) align mo e closely wi h he model’s aining dis i-
bu ion, esul ing in highe audio- ex simila i y sco es and imp o ed classi ica ion
pe o mance. The s udy by Ol e a e al. (2024) demons a es ha p omp ing wi h
comple e sen ences and de ailed acous ic desc ip ions consis en ly ou pe o ms us-
ing isola ed labels [12]. This makes p omp o ma ing a c i ical design choice in
ze o-sho agging and classi ica ion wi h CLAP-based sys ems.
2.2.4 Challenges in Folksonomy-Based Tagging
Unlike s anda dized class labels ypical o benchma k da ase s, olksonomies consis
o use -gene a ed ags ha a e o en noisy, inconsis en , and seman ically o e -
lapping [2]. As discussed in he Analysis o he Folksonomy o F eesound [13], he
2.3. Seman ic E alua ion o Tag Recommenda ions 11
F eesound olksonomy is cha ac e ized by a con inuously g owing and la gely uncon-
olled ocabula y (Figu e 5), whe e use s label sounds wi hou cons ain s leading
o inconsis encies in desc ibing audio con en . In his se ing, CLAP mus sco e
housands o candida e ags wi h widely a ying g anula i y and ele ance, inc eas-
ing he isk o inaccu a e o i ele an p edic ions. These condi ions inc ease he
complexi y o adap a ion o ze o-sho models o he e ogeneous ocabula ies.
Figu e 5: Numbe o new ags in oduced e e y mon h o F eesound (2005-2012)
[13]
2.3 Seman ic E alua ion o Tag Recommenda ions
2.3.1 Limi a ions o Exac Ma ching
Con en ional e alua ion o classi ica ion sys ems elies on exac s ing ma ching,
whe e a p edic ed class label mus ma ch he g ound u h. This app oach ails o
accoun o seman ic equi alence be ween lexically diffe en ags, such as synonyms
(e.g., bi dsong s. chi ping), plu al o ms (e.g., d um s. d ums), o spelling a ian s
(e.g., colo s. colou ). This leads o an unde es ima ion o sys em pe o mance,
pa icula ly in ag ecommenda ion asks whe e he ocabula y is highly a iable.
12 Chap e 2. Backg ound
2.3.2 Embedding-Based Seman ic E alua ion wi h SBERT
Bidi ec ional Encode Rep esen a ions om T ans o me s (BERT) is a language
model ha p oduces oken-le el con ex ual embeddings, based on he mul i-laye
bidi ec ional T ans o me encode in oduced by Vaswani e al. (2023) [14][15].
Sen ence-BERT (SBERT) adap s BERT wi h siamese and iple ne wo k ain-
ing o gene a e sen ence-le el embeddings ha cap u e seman ic simila i y beyond
oken-le el ep esen a ions [16]. Building on his, embedding-based app oaches ha e
been adop ed in asks such as audio cap ion quali y assessmen , measu ing seman-
ic simila i y be ween p edic ed and e e ence cap ions [17], effec i ely add essing
challenges o lexical a iabili y.
Figu e 6: BERT T ans o me encode a chi ec u e based on Vaswani e al. 2017
[15][18]
2.3. Seman ic E alua ion o Tag Recommenda ions 13
Figu e 7: SBERT siamese adap a ion o BERT a chi ec u e [16]
Figu e 8: Audio cap ion seman ic simila i y measu emen amewo k [17]
2.3.3 Applica ion o Audio Tagging
Despi e i s adop ion in cap ioning, seman ic simila i y has been unde explo ed as
an e alua ion me hod o ag ecommenda ion sys ems. In his wo k, SBERT em-
beddings a e used o e alua e ag ecommenda ions, which a e o en mul i-wo d
concep s. Cosine simila i y be ween embeddings o p edic ed and e e ence ags mea-
su es seman ic ele ance, effec i ely add essing he lexical a iabili y o olksonomy-
based g ound u h and cap u ing seman ically ele an ags missed by exac s ing
ma ching.
20 Chap e 3. Me hods
F1-sco e compu es he ha monic mean o P ecision and Recall:
F1=2·P@10 ·R@10
P@10 + R@10
3.4.1.2 Seman ic Ma ching
Seman ic simila i y is compu ed using he all-MiniLM-L6- 2 SBERT model o en-
code p edic ed and g ound- u h ags in o 384-dimensional ec o s. Fo each p e-
dic ed ag p, cosine simila i y is calcula ed wi h all g ound- u h ags g :
sim( p,
g )=cos(embedding
p,embedding g )
whe e embedding pand embedding g deno e SBERT embeddings. A p edic ed ag
is a seman ic ma ch i
max
g 2TGT
sim( p,
g )⌧
whe e ⌧is a h eshold pa ame e . Pe o mance is e alua ed using P ecision@10,
Recall@10, and F1 me ics using seman ic ma ches.
p g Cos. sim. Ma ch (⌧=0.7)
bea bea 1.000 X
d ums d um 0.867 X
hy hmic hy hm 0.863 X
ocal oice 0.824 X
woman emale 0.799 X
me allic me al 0.884 X
b eaking b eak 0.920 X
Figu e 13: Seman ic ma ching example o p edic ed ags pagains g ound- u h
ags g ,collec edac ossmul ipleaudioclips.

Chap e 4
Resul s
Pe o mance is epo ed using P ecision@10, Recall@10, and F1 me ics as de ailed
in Sec ion 3.4.1, unde bo h exac s ing ma ching (Sec ion 3.4.1.1) and seman ic
ma ching (Sec ion 3.4.1.2) condi ions.
4.1 Ze o-Sho Pe o mance
4.1.0.1 Exac Ma ching
Sys em P ecision@10 Recall@10 F1
ZS Baseline 0.0051 ±0.0240 0.0093 ±0.0497 0.0062 ±0.0292
ZS DF-Weigh ed (↵=0.7)0.0305±0.0738 0.0515 ±0.1248 0.0367 ±0.0876
Table 3: Ze o-sho agging pe o mance o CLAP-based sys ems unde exac ma ch-
ing condi ions. Repo ed as mean ±s anda d de ia ion ac oss es clips.
21
22 Chap e 4. Resul s
4.1.0.2 Seman ic Ma ching
Sys em P ecision@10 Recall@10 F1
ZS Baseline 0.0130 ±0.0467 0.0230 ±0.0968 0.0157 ±0.0570
ZS DF-Weigh ed (↵=0.7)0.0488±0.1099 0.0837 ±0.1891 0.0590 ±0.1309
Table 4: Ze o-sho agging pe o mance o CLAP-based sys ems unde seman ic
ma ching condi ions (⌧=0.7). Repo ed as mean ±s anda d de ia ion ac oss es
clips.
4.1.0.3 Pe o mance O e iew
Figu e 14: F1 pe o mance o Ze o-Sho sys ems (exac and seman ic ma ching)
4.2. RankST Benchma k 23
Figu e 15: Numbe o clips wi h 1hi o Ze o-Sho sys ems (exac and seman ic
ma ching)
4.2 RankST Benchma k
4.2.0.1 Exac Ma ching
Sys em P ecision@10 Recall@10 F1
RankST (k=1)0.0808±0.1285 0.1550 ±0.2445 0.0997 ±0.1513
RankST (k=2)0.1333±0.1441 0.2703 ±0.2875 0.1687 ±0.1740
RankST (k=3)0.1774±0.1598 0.3540 ±0.3070 0.2236 ±0.1890
Table 5: Tagging pe o mance o RankST o k2{1,2,3}unde exac ma ching
condi ions. Repo ed as mean ±s anda d de ia ion ac oss es clips.
24 Chap e 4. Resul s
4.2.0.2 Seman ic Ma ching
Sys em P ecision@10 Recall@10 F1
RankST (k=1)0.1127±0.1609 0.2341 ±0.3585 0.1444 ±0.2056
RankST (k=2)0.1734±0.1747 0.3594 ±0.3756 0.2212 ±0.2157
RankST (k=3)0.2181±0.1826 0.4434 ±0.3754 0.2765 ±0.2197
Table 6: Tagging pe o mance o RankST o k2{1,2,3}unde seman ic ma ching
condi ions (⌧=0.7). Repo ed as mean ±s anda d de ia ion ac oss es clips.
4.2.0.3 Pe o mance O e iew
Figu e 16: F1 pe o mance o RankST (k2{1,2,3})(exac andseman icma ching)
4.2. RankST Benchma k 25
Figu e 17: Numbe o clips wi h 1hi o RankST (k2{1,2,3})(exac and
seman ic ma ching)

26 Chap e 4. Resul s
4.3 Resul s O e iew
Exac Ma ching Seman ic Ma ching
Sys em P R F1 Hi s P R F1 Hi s F1(%) Hi s(%)
ZS Baseline 0.005 0.009 0.006 72 0.013 0.023 0.016 136 +166.7 +88.9
ZS DF 0.031 0.052 0.037 293 0.049 0.084 0.059 369 +59.5 +25.9
RankST (k=1)0.0810.1550.100 651 0.1130.2340.144 739 +44.0 +13.5
RankST (k=2)0.1330.2700.169 983 0.1730.3590.2211053 +30.8 +7.1
RankST (k=3)0.1770.3540.22411450.2180.4430.2771200 +23.7 +4.8
Table 7: Pe o mance ac oss all sys ems. Repo ed alues a e mean P ecision@10
(P), Recall@10 (R), F1 and clips wi h 1co ec hi (Hi s). F1(%) and Hi s(%)
alues a e exac s seman ic.
Figu e 18: F1 pe o mance ac oss all sys ems (exac and seman ic ma ching)
4.3. Resul s O e iew 27
Figu e 19: Numbe o clips wi h 1hi ac oss all sys ems (exac and seman ic
ma ching)
Chap e 5
Discussion
5.0.1 Ze o-Sho Tagging Pe o mance
The baseline ze o-sho sys em achie es an F1 sco e o 0.006 unde exac ma ch-
ing, highligh ing he inhe en challenge o open-domain audio agging wi hou ask-
speci ic aining. Implemen ing no malized loga i hmic documen equency (DF)
weigh ing (Sec ion 3.3.3) yields a subs an ial ela i e imp o emen , inc easing he
F1 sco e o 0.037, a 516.7% inc ease o e he baseline. This app oach effec i ely
down-weigh s a e ags, which o en co espond o unin o ma i e labels, a enua -
ing noise and imp o ing disc imina i e pe o mance in ze o-sho agging.
5.0.2 Impac o Embedding-Based Seman ic E alua ion
Ze o-sho sys ems demons a e subs an ial imp o emen s unde seman ic e alua ion
compa ed o exac ma ching. The ze o-sho baseline model achie es a 166.7% in-
c ease in F1 sco e and an 88.9% inc ease in clips wi h a leas one co ec p edic ion.
The DF-weigh ed sys em simila ly achie es p opo ional gains unde seman ic e al-
ua ion, wi h a 59.5% inc ease in F1 sco e and a 25.9% inc ease in clips wi h a leas
one co ec p edic ion.
Compa ing he ze o-sho baseline unde exac ma ching (F1 = 0.006) o he DF-
weigh ed sys em unde seman ic ma ching (F1 = 0.059) e eals an 883% ela i e im-
28
29
p o emen in F1, demons a ing subs an ial gains in ze o-sho pe o mance h ough
weigh ed agging and seman ic e alua ion, which e eals la en pe o mance missed
by exac ma ching.
In addi ion, seman ic e alua ion cap u es la en pe o mance o RankST (up o a
44% inc ease in F1) highligh ing he b oade po en ial o embedding-based e alua-
ion in MIR asks.
5.0.3 Tagse Quali y and Noise
The ze o-sho ag ocabula y, comp ising 1,870 unique ags, p o ides b oad seman-
ic co e age o open-domain audio agging. Compa ed o cu a ed open-domain
da ase s such as FSD50K[22] (200 classes) and VGGSound[11] (310+ classes), he
ze o-sho ocabula y is subs an ially la ge . Howe e , he ex ensi e agse is no
indica i e o highe seman ic u ili y; analysis e eals conside able a ia ion in ag
quali y wi h a la ge ac ion o he agse comp ising noise. These include o e ly
speci ic and seman ically unin o ma i e ags, such as p ope nouns, echnical iden-
i ie s, subjec i e e ms and agmen s (Table 8). While DF weigh ing p o ides an
immedia e mechanism o a enua e noisy ags, i does no ully elimina e noise in-
he en o he olksonomy. This in la ed ocabula y impai s CLAP gene aliza ion,
comp omising ze o-sho agging pe o mance.
P ope nouns ba celona, japan, nasa, sony, able on
Technical 16bi , bpm, midi, mono, s , h4n
Subjec i e nice, bad, cool, yes, no
F agmen s a, el, la, c3, m, xy
Table 8: Examples o noisy ags sampled om he ze o-sho ag ocabula y
5.0.4 G ound T u h Limi a ions
G ound u h comp ises use -gene a ed anno a ions exhibi ing inconsis en desc ip-
i e co e age and high spa si y. This limi s e alua ion, as he sys em may p o-
duce ele an p edic ions no co e ed by he g ound u h (Figu e 20). Seman ic
Bibliog aphy
[1] 2024 in numbe s | The F eesound Blog. URL h ps://blog. eesound.o g/
?p=2141.
[2] Fon Co be a, F., Se à Julià, J. & Se a, X. Folksonomy-based ag ecommen-
da ion o online audio clip sha ing (2012). URL h p://hdl.handle.ne /
10230/22736. Publishe : In e na ional Socie y o Music In o ma ion Re ie al
(ISMIR).
[3] Elizalde, B., Deshmukh, S., Ismail, M. A. & Wang, H. CLAP Lea ning
Audio Concep s om Na u al Language Supe ision. In ICASSP 2023 -
2023 IEEE In e na ional Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP),1–5(2023). URLh ps://ieeexplo e.ieee.o g/abs ac /
documen /10095889.ISSN:2379-190X.
[4] Wu, Y. e al. La ge-scale Con as i e Language-Audio P e aining wi h Fea u e
Fusion and Keywo d- o-Cap ion Augmen a ion (2024). URL h p://a xi .
o g/abs/2211.06687.A Xi :2211.06687[cs].
[5] Folksonomy :: ande wal.ne . URL h ps:// ande wal.ne / olksonomy.
h ml.
[6] F eesound. URL h ps:// eesound.o g/.
[7] Rad o d, A. e al. Lea ning T ans e able Visual Models F om Na u al
Language Supe ision (2021). URL h p://a xi .o g/abs/2103.00020.
A Xi :2103.00020 [cs].
36

BIBLIOGRAPHY 37
[8] LAION-AI/CLAP (2025). URL h ps://gi hub.com/LAION-AI/CLAP.
O iginal-da e: 2022-03-06T20:12:49Z.
[9] Piczak, K. J. ESC: Da ase o En i onmen al Sound Classi ica ion. In P o-
ceedings o he 23 d ACM in e na ional con e ence on Mul imedia, 1015–1018
(ACM, B isbane Aus alia, 2015). URL h ps://dl.acm.o g/doi/10.1145/
2733373.2806390.
[10] Salamon, J., Jacoby, C. & Bello, J. P. A Da ase and Taxonomy o U ban
Sound Resea ch. In P oceedings o he 22nd ACM in e na ional con e ence
on Mul imedia, 1041–1044 (ACM, O lando Flo ida USA, 2014). URL h ps:
//dl.acm.o g/doi/10.1145/2647868.2655045.
[11] Chen, H., Xie, W., Vedaldi, A. & Zisse man, A. VGGSound: A La ge-
scale Audio-Visual Da ase (2020). URL h p://a xi .o g/abs/2004.14368.
A Xi :2004.14368 [cs].
[12] Ol e a, M., S ama iadis, P. & Essid, S. A sound desc ip ion: Explo ing
p omp empla es and class desc ip ions o enhance ze o-sho audio classi i-
ca ion (2024). URL h p://a xi .o g/abs/2409.13676.A Xi :2409.13676
[cs].
[13] Fon , F. & Se a, X. ANALYSIS OF THE FOLKSONOMY OF FREESOUND
(2012).
[14] De lin, J., Chang, M.-W., Lee, K. & Tou ano a, K. BERT: P e- aining o
Deep Bidi ec ional T ans o me s o Language Unde s anding (2019). URL
h p://a xi .o g/abs/1810.04805.A Xi :1810.04805[cs].
[15] Vaswani, A. e al. A en ion Is All You Need (2023). URL h p://a xi .o g/
abs/1706.03762.A Xi :1706.03762[cs].
[16] Reime s, N. & Gu e ych, I. Sen ence-BERT: Sen ence Embeddings using
Siamese BERT-Ne wo ks (2019). URL h p://a xi .o g/abs/1908.10084.
A Xi :1908.10084 [cs].
38 BIBLIOGRAPHY
[17] Mah uz, R., Guo, Y. & Visse , E. Imp o ing Audio Cap ioning Using Se-
man ic Simila i y Me ics (2023). URL h p://a xi .o g/abs/2210.16470.
A Xi :2210.16470 [cs].
[18] Smi h, B. A Comple e Guide o BERT wi h
Code (2024). URL h ps:// owa dsda ascience.com/
a-comple e-guide- o-be -wi h-code-9 87602e4a11/.
[19] Gi Hub - allholy/BSD10k. URL h ps://gi hub.com/allholy/BSD10k.
[20] Anas asopoulou, P., To ey, J., Se a, X. & Fon , F. He e ogeneous sound
classi ica ion wi h he B oad Sound Taxonomy and Da ase (2024). URL h p:
//a xi .o g/abs/2410.00980.A Xi :2410.00980[cs] e sion:1.
[21] MTG/ eesound (2025). URL h ps://gi hub.com/MTG/ eesound.O iginal-
da e: 2012-11-07T18:25:03Z.
[22] Fonseca, E., Fa o y, X., Pons, J., Fon , F. & Se a, X. FSD50K: An Open
Da ase o Human-Labeled Sound E en s (2022). URL h p://a xi .o g/
abs/2010.00475.A Xi :2010.00475[cs].