Open-Domain Zero-Shot Audio Tagging: Evaluation via Semantic Embeddings

Author: Yapici, Tolga

Publisher: Zenodo

DOI: 10.5281/zenodo.17305082

Source: https://zenodo.org/records/17305082/files/Tolga_Yapici_Master_Thesis_2025.pdf

Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Open-Domain Ze o-Sho Audio Tagging:
E alua ion ia Seman ic Embeddings
Tolga Yapici
Supe iso : Panagio a Anas asopoulou
Co-Supe iso : F ede ic Fon
July 2025
Con en s
1In oduc ion 1
1.1 Mo i a ion.................................. 1
1.2 KeyConcep s................................ 2
1.2.1 Folksonomy ................................. 2
1.2.2 Co-occu ence Based Tag Recommenda ion . . . . . . . . . . . . . . . 2
1.2.3 Ze o-Sho Classi ica ion . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Audio-Tex Rep esen a ion Models . . . . . . . . . . . . . . . . . . . . 3
1.2.5 LAION-CLAP................................ 4
1.3 Objec i es.................................. 4
1.4 Scope .................................... 4
2Backg ound 6
2.1 F eesound Tag Recommende (RankST) . . . . . . . . . . . . . . . . . 6
2.1.1 Co-occu ence Based Tag Recommenda ion . . . . . . . . . . . . . . . 6
2.1.2 Rank Agg ega ion and Adap i e Cu oﬀ.................. 7
2.1.3 Limi a ions ................................. 8
2.2 Con as i e Language-Audio P e aining . . . . . . . . . . . . . . . . . 8
2.2.1 CLAP Model A chi ec u e . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Models&Benchma ks ........................... 9
2.2.3 P omp Design ............................... 10
2.2.4 Challenges in Folksonomy-Based Tagging . . . . . . . . . . . . . . . . . 10
2.3 Seman ic E alua ion o Tag Recommenda ions . . . . . . . . . . . . . . 11
2.3.1 Limi a ions o Exac Ma ching . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Embedding-Based Seman ic E alua ion wi h SBERT . . . . . . . . . . 12
2.3.3 Applica ion o Audio Tagging . . . . . . . . . . . . . . . . . . . . . . . 13
3Me hods 14
3.1 Da ase and P ep ocessing . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 BSD10kDa ase .............................. 14
3.1.2 Da ase P ep ocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 TagVocabula y............................... 16
3.2 LAION-CLAP Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 AudioEmbeddings ............................. 16
3.2.2 TagEmbeddings .............................. 17
3.3 Tag Recommenda ion Sys ems . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 RankST ................................... 17
3.3.2 Ze o-Sho Baseline ............................. 18
3.3.3 Ze o-Sho wi h DF Weigh ing . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 E alua ion Me hodology . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 E alua ionMe ics ............................. 19
4Resul s 21
4.1 Ze o-Sho Pe o mance . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 RankSTBenchma k ............................ 23
4.3 Resul sO e iew.............................. 26
5Discussion 28
5.0.1 Ze o-Sho Tagging Pe o mance . . . . . . . . . . . . . . . . . . . . . . 28
5.0.2 Impac o Embedding-Based Seman ic E alua ion . . . . . . . . . . . . 28
5.0.3 Tagse Quali y and Noise . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.0.4 G ound T u h Limi a ions . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.0.5 Sys em Pe o mance Compa ison . . . . . . . . . . . . . . . . . . . . . 30
6Conclusion 32
Lis o Figu es 33
Lis o Tables 35
Bibliog aphy 36
A Fi s Appendix 39
BSecondAppendix 40

Dedica ion
To my lo ing amily, my dea iends, and he beau i ul ci y o Ba celona.
Acknowledgemen
I hank my supe iso s, Penny and F ede ic, o hei suppo and since i y, and
my iends in he Mas e ’s p og am and he Music Technology G oup who ha e
suppo ed my g ow h academically and beyond.
4Chap e 1. In oduc ion
ze o-sho audio classi ica ion by compa ing simila i y be ween audio and ex em-
beddings, ypically using cosine simila i y [3][4].
1.2.5 LAION-CLAP
LAION-CLAP (Con as i e Language-Audio P e aining) is an open-sou ce audio-
ex ep esen a ion model ained on o e 630,000 audio– ex pai s spanning di e se
audio con en . I s a chi ec u e (shown in Figu e 4) uses sepa a e encode s o audio
and ex , op imized ia con as i e loss o p oduce aligned audio- ex embeddings
[4].
1.3 Objec i es
This hesis aims o e alua e whe he a ze o-sho audio- ex model can eﬀec i ely
ecommend ags o he e ogeneous audio con en wi hou supe ised aining on he
a ge da ase . The speci ic objec i es a e:
•Benchma k he pe o mance o LAION-CLAP o ag ecommenda ion in a
ze o-sho se ing on F eesound.
•Compa e ze o-sho p edic ions wi h F eesound’s co-occu ence-based p edic-
ions in agging accu acy.
•E alua e whe he seman ic simila i y me ics ( ia SBERT) cap u e meaning ul
ags missed by exac s ing ma ching.
1.4 Scope
This hesis e alua es ze o-sho audio agging pe o mance on a cu a ed he e ogenous
subse o F eesound (BSD10k), ep esen ing di e se gene al-pu pose audio con en .
E alua ion ocuses on me hods inco po a ing ag weigh ing ia documen equency
and seman ic e alua ion h ough sen ence embeddings. Pe o mance is measu ed
using p ecision, ecall, and F1-sco e compu ed unde bo h exac s ing ma ching

1.4. Scope 5
and seman ic ma ching c i e ia. E alua ions a e pe o med in a ze o-sho se ing
wi hou ine- uning o e aining o unde lying models.
Chap e 2
Backg ound
2.1 F eesound Tag Recommende (RankST)
2.1.1 Co-occu ence Based Tag Recommenda ion
The RankST algo i hm (Fon e al. 2012) is he ounda ion o F eesound’s ag ecom-
menda ion sys em. I es ima es ag ele ance om ag- ag co-occu ence s a is ics.
Tag co-occu ence coun s a e used o compu e he ag simila i y ma ix S=DD>,
whe e each elemen sij indica es he numbe o sounds in which ags iand jappea
oge he [2]. Gi en a se o inpu ags, RankST e ie es a lis o candida e ags,
aking he op-Nmos simila ags o each inpu ag. Fo example, o an inpu ag
d ums, hesys emmaysugges ela ed agssuchaspe cussion o hy hm i hese
equen ly co-occu in he anno a ions. This app oach elies on he p inciple ha i
wo ags equen ly co-occu , hey a e likely o desc ibe ela ed sound con en .
6
2.1. F eesound Tag Recommende (RankST) 7
Figu e 3: Visualiza ion o Tag Simila i y Ma ix S[2]
2.1.2 Rank Agg ega ion and Adap i e Cu oﬀ
RankST agg ega es candida e ags using ank agg ega ion. Each inpu ag p oduces
acandida elis o Cinpu T agio i s op-Nco-occu ing ags. Fo each candida e a
posi ion nin Cinpu T agi, ank aluesa eassignedas
ank(neighbo n)=N(n1)
The closes ag o e e y inpu ag is assigned N, hesecondN1,andso o h.
Using his agg ega ion me hod, simila i y is ep esen ed by hese ank alues, which
a e summed ac oss he inpu ags. This gi es g ea e weigh o ags ha consis en ly
appea a he op o he co-occu ence lis s. RankST hen applies an adap i e cu oﬀ
based on he Ande son-Da ling es o au oma ically de e mine how many ags
o ecommend, which imp o es F-measu e pe o mance compa ed o ixed-leng h
ou pu s [2].
8Chap e 2. Backg ound
2.1.3 Limi a ions
RankST elies en i ely on ag me ada a and is cons ained by he use -gene a ed
olksonomy and i s inconsis encies. I s ecommenda ions a e limi ed o p e iously
seen ags, and he quali y o he ou pu is hea ily dependen on he seed ags,
making he sys em sensi i e o subjec i e o inconsis en anno a ions.
In addi ion, noise in he olksonomy p opaga es in o he ecommenda ions. Spelling
a ia ions, use -speci ic ags, non-seman ic e ms, and misspellings o en appea in
he ou pu p o iding no seman ic u ili y. Fo ins ance, o he inpu s d um and
pe cussion, he ecommenda ionsa ed ums,pe cussi e,pe cs,anduse name okens
such as sandy b (a use name agged ac oss many uploads), seen in Figu e 1.
Fu he limi a ions o his sys em a e he absence o audio con en analysis and he
seed equi emen . As a esul , his sys em canno p edic labels di ec ly om he
audio signal i sel and canno ope a e in a cold-s a se ing.
2.2 Con as i e Language-Audio P e aining
2.2.1 CLAP Model A chi ec u e
CLAP models a e ained on la ge co po a o pai ed audio– ex da a, enabling
hem o lea n join audio– ex ep esen a ions by aining sepa a e encode s o
each modali y [3][4]. In LAION-CLAP a chi ec u e, he audio encode (e.g., PANN
o HTSAT) p ocesses mel-spec og ams, while he ex encode (e.g., BERT o
RoBERTa) embeds na u al language [4] (Figu e 4).
These embeddings a e p ojec ed in o a join 512-dimensional la en space and op-
imized ia con as i e lea ning, maximizing he simila i y o seman ically aligned
audio- ex pai s while minimizing i o un ela ed pai s, a me hod adap ed om
CLIP [3][4][7]. Wi h his amewo k hese models a e able o lea n seman ically
aligned c oss-modal ep esen a ions making hem sui able o downs eam asks such
as ex - o audio e ie al, ze o-sho audio classi ica ion, supe ised audio classi ica-
2.2. Con as i e Language-Audio P e aining 9
ion, and enabling hem o gene alize well o unseen ca ego ies wi hou ine- uning
[4].
This amewo k enables in e ence h ough cosine simila i y be ween an audio em-
bedding and a se o candida e ex embeddings, enabling audio classi ica ion in
ze o-sho se ings [4]. Ze o-sho pe o mance o CLAP models ha e been alida ed
on benchma k da ase s such as ESC-50, U banSound8K and VGGSound, whe e
hey pe o m compe i i ely wi h supe ised models in classi ica ion accu acy [3][4].
Figu e 4: Con as i e Language-Audio P e aining A chi ec u e [8]
2.2.2 Models & Benchma ks
Se e al CLAP a ian s ha e been de eloped, each diﬀe ing in model capaci y, ain-
ing da a scale, and applica ion domain. MS-CLAP is ained on 128k di e se au-
dio– ex pai s [3], while LAION-CLAP scales aining o o e 630k pai s [4].
On closed-domain benchma ks, MS-CLAP achie es 82.6% ze o-sho classi ica ion

10 Chap e 2. Backg ound
accu acy on ESC-50 and 73.2% on U banSound8K. LAION-CLAP demons a es im-
p o ed pe o mance wi h 89.1% accu acy on ESC-50 and 73.2% on U banSound8K
[3][4].
LAION-CLAP’s la ge aining scale and da a di e si y sugges g ea e po en ial o
gene aliza ion ac oss di e se audio con en . Howe e , on VGGSound, a la ge open-
domain da ase wi h o e 310 sound classes, LAION-CLAP’s accu acy d ops o
29.1%, highligh ing he inc eased challenge o classi ica ion in open-domain se ings
[4].
Da ase Domain # Classes CLAP(MS) LAION-CLAP
ESC-50 [9] En i onmen al 50 82.6% 89.1%
US8K [10] U ban 10 73.2% 73.2%
VGGSound [11] Open-domain 310+ N/A 29.1%
Table 1: Ze o-sho classi ica ion accu acy o CLAP models on benchma k da ase s
2.2.3 P omp Design
LAION-CLAP is ained on audio– ex pai s s uc u ed as na u al language sen-
ences, ypically o he o m This is a sound o [label] [4]. A in e ence ime, he
ph asing o candida e class ags signi ican ly in luences he model’s ze o-sho clas-
si ica ion accu acy. P omp s s uc u ed as na u al language sen ences (e.g., “This
is a sound o a dog ba king”) align mo e closely wi h he model’s aining dis i-
bu ion, esul ing in highe audio- ex simila i y sco es and imp o ed classi ica ion
pe o mance. The s udy by Ol e a e al. (2024) demons a es ha p omp ing wi h
comple e sen ences and de ailed acous ic desc ip ions consis en ly ou pe o ms us-
ing isola ed labels [12]. This makes p omp o ma ing a c i ical design choice in
ze o-sho agging and classi ica ion wi h CLAP-based sys ems.
2.2.4 Challenges in Folksonomy-Based Tagging
Unlike s anda dized class labels ypical o benchma k da ase s, olksonomies consis
o use -gene a ed ags ha a e o en noisy, inconsis en , and seman ically o e -
lapping [2]. As discussed in he Analysis o he Folksonomy o F eesound [13], he
2.3. Seman ic E alua ion o Tag Recommenda ions 11
F eesound olksonomy is cha ac e ized by a con inuously g owing and la gely uncon-
olled ocabula y (Figu e 5), whe e use s label sounds wi hou cons ain s leading
o inconsis encies in desc ibing audio con en . In his se ing, CLAP mus sco e
housands o candida e ags wi h widely a ying g anula i y and ele ance, inc eas-
ing he isk o inaccu a e o i ele an p edic ions. These condi ions inc ease he
complexi y o adap a ion o ze o-sho models o he e ogeneous ocabula ies.
Figu e 5: Numbe o new ags in oduced e e y mon h o F eesound (2005-2012)
[13]
2.3 Seman ic E alua ion o Tag Recommenda ions
2.3.1 Limi a ions o Exac Ma ching
Con en ional e alua ion o classi ica ion sys ems elies on exac s ing ma ching,
whe e a p edic ed class label mus ma ch he g ound u h. This app oach ails o
accoun o seman ic equi alence be ween lexically diﬀe en ags, such as synonyms
(e.g., bi dsong s. chi ping), plu al o ms (e.g., d um s. d ums), o spelling a ian s
(e.g., colo s. colou ). This leads o an unde es ima ion o sys em pe o mance,
pa icula ly in ag ecommenda ion asks whe e he ocabula y is highly a iable.
12 Chap e 2. Backg ound
2.3.2 Embedding-Based Seman ic E alua ion wi h SBERT
Bidi ec ional Encode Rep esen a ions om T ans o me s (BERT) is a language
model ha p oduces oken-le el con ex ual embeddings, based on he mul i-laye
bidi ec ional T ans o me encode in oduced by Vaswani e al. (2023) [14][15].
Sen ence-BERT (SBERT) adap s BERT wi h siamese and iple ne wo k ain-
ing o gene a e sen ence-le el embeddings ha cap u e seman ic simila i y beyond
oken-le el ep esen a ions [16]. Building on his, embedding-based app oaches ha e
been adop ed in asks such as audio cap ion quali y assessmen , measu ing seman-
ic simila i y be ween p edic ed and e e ence cap ions [17], eﬀec i ely add essing
challenges o lexical a iabili y.
Figu e 6: BERT T ans o me encode a chi ec u e based on Vaswani e al. 2017
[15][18]
2.3. Seman ic E alua ion o Tag Recommenda ions 13
Figu e 7: SBERT siamese adap a ion o BERT a chi ec u e [16]
Figu e 8: Audio cap ion seman ic simila i y measu emen amewo k [17]
2.3.3 Applica ion o Audio Tagging
Despi e i s adop ion in cap ioning, seman ic simila i y has been unde explo ed as
an e alua ion me hod o ag ecommenda ion sys ems. In his wo k, SBERT em-
beddings a e used o e alua e ag ecommenda ions, which a e o en mul i-wo d
concep s. Cosine simila i y be ween embeddings o p edic ed and e e ence ags mea-
su es seman ic ele ance, eﬀec i ely add essing he lexical a iabili y o olksonomy-
based g ound u h and cap u ing seman ically ele an ags missed by exac s ing
ma ching.
20 Chap e 3. Me hods
F1-sco e compu es he ha monic mean o P ecision and Recall:
F1=2·P@10 ·R@10
P@10 + R@10
3.4.1.2 Seman ic Ma ching
Seman ic simila i y is compu ed using he all-MiniLM-L6- 2 SBERT model o en-
code p edic ed and g ound- u h ags in o 384-dimensional ec o s. Fo each p e-
dic ed ag p, cosine simila i y is calcula ed wi h all g ound- u h ags g :
sim( p,
g )=cos(embedding
p,embedding g )
whe e embedding pand embedding g deno e SBERT embeddings. A p edic ed ag
is a seman ic ma ch i
max
g 2TGT
sim( p,
g )⌧
whe e ⌧is a h eshold pa ame e . Pe o mance is e alua ed using P ecision@10,
Recall@10, and F1 me ics using seman ic ma ches.
p g Cos. sim. Ma ch (⌧=0.7)
bea bea 1.000 X
d ums d um 0.867 X
hy hmic hy hm 0.863 X
ocal oice 0.824 X
woman emale 0.799 X
me allic me al 0.884 X
b eaking b eak 0.920 X
Figu e 13: Seman ic ma ching example o p edic ed ags pagains g ound- u h
ags g ,collec edac ossmul ipleaudioclips.

Chap e 4
Resul s
Pe o mance is epo ed using P ecision@10, Recall@10, and F1 me ics as de ailed
in Sec ion 3.4.1, unde bo h exac s ing ma ching (Sec ion 3.4.1.1) and seman ic
ma ching (Sec ion 3.4.1.2) condi ions.
4.1 Ze o-Sho Pe o mance
4.1.0.1 Exac Ma ching
Sys em P ecision@10 Recall@10 F1
ZS Baseline 0.0051 ±0.0240 0.0093 ±0.0497 0.0062 ±0.0292
ZS DF-Weigh ed (↵=0.7)0.0305±0.0738 0.0515 ±0.1248 0.0367 ±0.0876
Table 3: Ze o-sho agging pe o mance o CLAP-based sys ems unde exac ma ch-
ing condi ions. Repo ed as mean ±s anda d de ia ion ac oss es clips.
21
22 Chap e 4. Resul s
4.1.0.2 Seman ic Ma ching
Sys em P ecision@10 Recall@10 F1
ZS Baseline 0.0130 ±0.0467 0.0230 ±0.0968 0.0157 ±0.0570
ZS DF-Weigh ed (↵=0.7)0.0488±0.1099 0.0837 ±0.1891 0.0590 ±0.1309
Table 4: Ze o-sho agging pe o mance o CLAP-based sys ems unde seman ic
ma ching condi ions (⌧=0.7). Repo ed as mean ±s anda d de ia ion ac oss es
clips.
4.1.0.3 Pe o mance O e iew
Figu e 14: F1 pe o mance o Ze o-Sho sys ems (exac and seman ic ma ching)
4.2. RankST Benchma k 23
Figu e 15: Numbe o clips wi h 1hi o Ze o-Sho sys ems (exac and seman ic
ma ching)
4.2 RankST Benchma k
4.2.0.1 Exac Ma ching
Sys em P ecision@10 Recall@10 F1
RankST (k=1)0.0808±0.1285 0.1550 ±0.2445 0.0997 ±0.1513
RankST (k=2)0.1333±0.1441 0.2703 ±0.2875 0.1687 ±0.1740
RankST (k=3)0.1774±0.1598 0.3540 ±0.3070 0.2236 ±0.1890
Table 5: Tagging pe o mance o RankST o k2{1,2,3}unde exac ma ching
condi ions. Repo ed as mean ±s anda d de ia ion ac oss es clips.
24 Chap e 4. Resul s
4.2.0.2 Seman ic Ma ching
Sys em P ecision@10 Recall@10 F1
RankST (k=1)0.1127±0.1609 0.2341 ±0.3585 0.1444 ±0.2056
RankST (k=2)0.1734±0.1747 0.3594 ±0.3756 0.2212 ±0.2157
RankST (k=3)0.2181±0.1826 0.4434 ±0.3754 0.2765 ±0.2197
Table 6: Tagging pe o mance o RankST o k2{1,2,3}unde seman ic ma ching
condi ions (⌧=0.7). Repo ed as mean ±s anda d de ia ion ac oss es clips.
4.2.0.3 Pe o mance O e iew
Figu e 16: F1 pe o mance o RankST (k2{1,2,3})(exac andseman icma ching)
4.2. RankST Benchma k 25
Figu e 17: Numbe o clips wi h 1hi o RankST (k2{1,2,3})(exac and
seman ic ma ching)

26 Chap e 4. Resul s
4.3 Resul s O e iew
Exac Ma ching Seman ic Ma ching
Sys em P R F1 Hi s P R F1 Hi s F1(%) Hi s(%)
ZS Baseline 0.005 0.009 0.006 72 0.013 0.023 0.016 136 +166.7 +88.9
ZS DF 0.031 0.052 0.037 293 0.049 0.084 0.059 369 +59.5 +25.9
RankST (k=1)0.0810.1550.100 651 0.1130.2340.144 739 +44.0 +13.5
RankST (k=2)0.1330.2700.169 983 0.1730.3590.2211053 +30.8 +7.1
RankST (k=3)0.1770.3540.22411450.2180.4430.2771200 +23.7 +4.8
Table 7: Pe o mance ac oss all sys ems. Repo ed alues a e mean P ecision@10
(P), Recall@10 (R), F1 and clips wi h 1co ec hi (Hi s). F1(%) and Hi s(%)
alues a e exac s seman ic.
Figu e 18: F1 pe o mance ac oss all sys ems (exac and seman ic ma ching)
4.3. Resul s O e iew 27
Figu e 19: Numbe o clips wi h 1hi ac oss all sys ems (exac and seman ic
ma ching)
Chap e 5
Discussion
5.0.1 Ze o-Sho Tagging Pe o mance
The baseline ze o-sho sys em achie es an F1 sco e o 0.006 unde exac ma ch-
ing, highligh ing he inhe en challenge o open-domain audio agging wi hou ask-
speci ic aining. Implemen ing no malized loga i hmic documen equency (DF)
weigh ing (Sec ion 3.3.3) yields a subs an ial ela i e imp o emen , inc easing he
F1 sco e o 0.037, a 516.7% inc ease o e he baseline. This app oach eﬀec i ely
down-weigh s a e ags, which o en co espond o unin o ma i e labels, a enua -
ing noise and imp o ing disc imina i e pe o mance in ze o-sho agging.
5.0.2 Impac o Embedding-Based Seman ic E alua ion
Ze o-sho sys ems demons a e subs an ial imp o emen s unde seman ic e alua ion
compa ed o exac ma ching. The ze o-sho baseline model achie es a 166.7% in-
c ease in F1 sco e and an 88.9% inc ease in clips wi h a leas one co ec p edic ion.
The DF-weigh ed sys em simila ly achie es p opo ional gains unde seman ic e al-
ua ion, wi h a 59.5% inc ease in F1 sco e and a 25.9% inc ease in clips wi h a leas
one co ec p edic ion.
Compa ing he ze o-sho baseline unde exac ma ching (F1 = 0.006) o he DF-
weigh ed sys em unde seman ic ma ching (F1 = 0.059) e eals an 883% ela i e im-
28
29
p o emen in F1, demons a ing subs an ial gains in ze o-sho pe o mance h ough
weigh ed agging and seman ic e alua ion, which e eals la en pe o mance missed
by exac ma ching.
In addi ion, seman ic e alua ion cap u es la en pe o mance o RankST (up o a
44% inc ease in F1) highligh ing he b oade po en ial o embedding-based e alua-
ion in MIR asks.
5.0.3 Tagse Quali y and Noise
The ze o-sho ag ocabula y, comp ising 1,870 unique ags, p o ides b oad seman-
ic co e age o open-domain audio agging. Compa ed o cu a ed open-domain
da ase s such as FSD50K[22] (200 classes) and VGGSound[11] (310+ classes), he
ze o-sho ocabula y is subs an ially la ge . Howe e , he ex ensi e agse is no
indica i e o highe seman ic u ili y; analysis e eals conside able a ia ion in ag
quali y wi h a la ge ac ion o he agse comp ising noise. These include o e ly
speci ic and seman ically unin o ma i e ags, such as p ope nouns, echnical iden-
i ie s, subjec i e e ms and agmen s (Table 8). While DF weigh ing p o ides an
immedia e mechanism o a enua e noisy ags, i does no ully elimina e noise in-
he en o he olksonomy. This in la ed ocabula y impai s CLAP gene aliza ion,
comp omising ze o-sho agging pe o mance.
P ope nouns ba celona, japan, nasa, sony, able on
Technical 16bi , bpm, midi, mono, s , h4n
Subjec i e nice, bad, cool, yes, no
F agmen s a, el, la, c3, m, xy
Table 8: Examples o noisy ags sampled om he ze o-sho ag ocabula y
5.0.4 G ound T u h Limi a ions
G ound u h comp ises use -gene a ed anno a ions exhibi ing inconsis en desc ip-
i e co e age and high spa si y. This limi s e alua ion, as he sys em may p o-
duce ele an p edic ions no co e ed by he g ound u h (Figu e 20). Seman ic
Bibliog aphy
[1] 2024 in numbe s | The F eesound Blog. URL h ps://blog. eesound.o g/
?p=2141.
[2] Fon Co be a, F., Se à Julià, J. & Se a, X. Folksonomy-based ag ecommen-
da ion o online audio clip sha ing (2012). URL h p://hdl.handle.ne /
10230/22736. Publishe : In e na ional Socie y o Music In o ma ion Re ie al
(ISMIR).
[3] Elizalde, B., Deshmukh, S., Ismail, M. A. & Wang, H. CLAP Lea ning
Audio Concep s om Na u al Language Supe ision. In ICASSP 2023 -
2023 IEEE In e na ional Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP),1–5(2023). URLh ps://ieeexplo e.ieee.o g/abs ac /
documen /10095889.ISSN:2379-190X.
[4] Wu, Y. e al. La ge-scale Con as i e Language-Audio P e aining wi h Fea u e
Fusion and Keywo d- o-Cap ion Augmen a ion (2024). URL h p://a xi .
o g/abs/2211.06687.A Xi :2211.06687[cs].
[5] Folksonomy :: ande wal.ne . URL h ps:// ande wal.ne / olksonomy.
h ml.
[6] F eesound. URL h ps:// eesound.o g/.
[7] Rad o d, A. e al. Lea ning T ans e able Visual Models F om Na u al
Language Supe ision (2021). URL h p://a xi .o g/abs/2103.00020.
A Xi :2103.00020 [cs].
36

BIBLIOGRAPHY 37
[8] LAION-AI/CLAP (2025). URL h ps://gi hub.com/LAION-AI/CLAP.
O iginal-da e: 2022-03-06T20:12:49Z.
[9] Piczak, K. J. ESC: Da ase o En i onmen al Sound Classi ica ion. In P o-
ceedings o he 23 d ACM in e na ional con e ence on Mul imedia, 1015–1018
(ACM, B isbane Aus alia, 2015). URL h ps://dl.acm.o g/doi/10.1145/
2733373.2806390.
[10] Salamon, J., Jacoby, C. & Bello, J. P. A Da ase and Taxonomy o U ban
Sound Resea ch. In P oceedings o he 22nd ACM in e na ional con e ence
on Mul imedia, 1041–1044 (ACM, O lando Flo ida USA, 2014). URL h ps:
//dl.acm.o g/doi/10.1145/2647868.2655045.
[11] Chen, H., Xie, W., Vedaldi, A. & Zisse man, A. VGGSound: A La ge-
scale Audio-Visual Da ase (2020). URL h p://a xi .o g/abs/2004.14368.
A Xi :2004.14368 [cs].
[12] Ol e a, M., S ama iadis, P. & Essid, S. A sound desc ip ion: Explo ing
p omp empla es and class desc ip ions o enhance ze o-sho audio classi i-
ca ion (2024). URL h p://a xi .o g/abs/2409.13676.A Xi :2409.13676
[cs].
[13] Fon , F. & Se a, X. ANALYSIS OF THE FOLKSONOMY OF FREESOUND
(2012).
[14] De lin, J., Chang, M.-W., Lee, K. & Tou ano a, K. BERT: P e- aining o
Deep Bidi ec ional T ans o me s o Language Unde s anding (2019). URL
h p://a xi .o g/abs/1810.04805.A Xi :1810.04805[cs].
[15] Vaswani, A. e al. A en ion Is All You Need (2023). URL h p://a xi .o g/
abs/1706.03762.A Xi :1706.03762[cs].
[16] Reime s, N. & Gu e ych, I. Sen ence-BERT: Sen ence Embeddings using
Siamese BERT-Ne wo ks (2019). URL h p://a xi .o g/abs/1908.10084.
A Xi :1908.10084 [cs].
38 BIBLIOGRAPHY
[17] Mah uz, R., Guo, Y. & Visse , E. Imp o ing Audio Cap ioning Using Se-
man ic Simila i y Me ics (2023). URL h p://a xi .o g/abs/2210.16470.
A Xi :2210.16470 [cs].
[18] Smi h, B. A Comple e Guide o BERT wi h
Code (2024). URL h ps:// owa dsda ascience.com/
a-comple e-guide- o-be -wi h-code-9 87602e4a11/.
[19] Gi Hub - allholy/BSD10k. URL h ps://gi hub.com/allholy/BSD10k.
[20] Anas asopoulou, P., To ey, J., Se a, X. & Fon , F. He e ogeneous sound
classi ica ion wi h he B oad Sound Taxonomy and Da ase (2024). URL h p:
//a xi .o g/abs/2410.00980.A Xi :2410.00980[cs] e sion:1.
[21] MTG/ eesound (2025). URL h ps://gi hub.com/MTG/ eesound.O iginal-
da e: 2012-11-07T18:25:03Z.
[22] Fonseca, E., Fa o y, X., Pons, J., Fon , F. & Se a, X. FSD50K: An Open
Da ase o Human-Labeled Sound E en s (2022). URL h p://a xi .o g/
abs/2010.00475.A Xi :2010.00475[cs].

Related note

Why organizations use Identific for document trust, entry 44
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in North America, Europe, Latin America, and international online education, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports more transparent source review, better handling of multilingual submissions, and more consistent review procedures. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For doctoral theses, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com