scieee Science in your language
[en] (orig)

Research and Evaluation of Automatic Sound FX Classification in Freesound using the Universal Category System

Author: Jaideep, Madhav
Publisher: Zenodo
DOI: 10.5281/zenodo.17304417
Source: https://zenodo.org/records/17304417/files/Madhav-Jaideep_SMC_2025_Master_Thesis.pdf
Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Resea ch and E alua ion o Au oma ic
Sound FX Classi ica ion in F eesound
using he Uni e sal Ca ego y Sys em
Madha Jaideep
Supe iso : F ede ic Fon
July 2025
Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Resea ch and E alua ion o Au oma ic
Sound FX Classi ica ion in F eesound
using he Uni e sal Ca ego y Sys em
Madha Jaideep
Supe iso : F ede ic Fon
July 2025
Con en s
1 In oduc ion 1
1.1 Mo i a ion.................................. 2
1.2 KeyConcep s................................ 5
1.2.1 Au oma ic Sound Classi ica ion (ASC) . . . . . . . . . . . . . . . . . . 5
1.2.2 Seman ics and Acous ic Fea u es . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Mul imodal lea ning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 F eesound.................................. 7
1.2.5 Uni e sal Ca ego y Sys em (UCS) . . . . . . . . . . . . . . . . . . . . . 7
1.2.6 E alua ion me ics and Pe o mance Assessmen . . . . . . . . . . . . . 7
1.3 Objec i es.................................. 10
1.4 S uc u e o he Repo . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 S a e o he A 12
2.1 His o y o Sound Classi ica ion . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Cu en App oaches ............................ 13
2.2.1 Machine lea ning based app oaches . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Deep lea ning based app oaches . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Da ase s, Sound FX lib a ies and Taxonomy . . . . . . . . . . . . . . 17
2.4 Uni e sal Ca ego y Sys em . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 UCSdeploymen .............................. 20
3 Me hodology 22
3.1 Da ase ................................... 22

3.1.1 PSE150K .................................. 22
3.1.2 F eesoundDa a............................... 23
3.1.3 O he da ase s ............................... 27
3.2 Fea u eEx ac ion ............................. 27
3.3 Classi ica ion wi h da ase subse (PSE8K) . . . . . . . . . . . . . . . . 29
3.3.1 Expe imen al Design and Model Implemen a ion . . . . . . . . . . . . . 29
3.4 La ge-Scale classi ica ion using PSE150K . . . . . . . . . . . . . . . . . 31
3.5 E alua ion on F eesound Da a . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 C oss-Domain Classi ica ion Analysis . . . . . . . . . . . . . . . . . . . 33
3.6.1 Embedding Space Visualiza ion . . . . . . . . . . . . . . . . . . . . . . 33
3.6.2 C oss-ModalSea ch............................. 33
3.7 Fine-Tuning using T ans e lea ning . . . . . . . . . . . . . . . . . . . . 34
4 Resul s 35
4.1 Rapid E alua ion Resul s . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Audio-Only Con igu a ion . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.2 Tex -Only Con igu a ion . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.3 Mul imodal Conca ena ion . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.4 Mul imodal Weigh ed Fusion (alpha=0.8) . . . . . . . . . . . . . . . . 37
4.1.5 Model Pe o mance Analysis . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 E alua ion o ull da ase . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 E alua ion o PSE ained models on F eesound Da a . . . . . . . . . . 42
4.3.1 Resul s on P elimina y F eesound Da ase . . . . . . . . . . . . . . . . 42
4.3.2 Resul s on Ex ended F eesound Da ase . . . . . . . . . . . . . . . . . 42
4.4 Fine- uningResul s............................. 45
5 Discussion 47
5.1 Rapid E alua ion on Da ase Subse . . . . . . . . . . . . . . . . . . . 47
5.2 T aining and E alua ion on he ull da ase . . . . . . . . . . . . . . . 49
5.3 E alua ion on F eesound Da a . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.1 P elimina y Da ase E alua ion . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Comp ehensi e Da ase E alua ion . . . . . . . . . . . . . . . . . . . . 51
5.4.1 -SNE Analysis o Embedding Spaces Ac oss Domains . . . . . . . . . . 52
5.4.2 Ca ego y Analysis: Pe o mance Ac oss Sound Classes . . . . . . . . . 54
5.5 Fine-Tuning Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.6 Fu u eWo k................................. 57
6 Conclusion 59
Lis o Figu es 60
Lis o Tables 61
Bibliog aphy 62
Acknowledgemen
I would like o exp ess my since e g a i ude o my supe iso , F ede ic Fon , o
his con inuous guidance, and insigh s h oughou he cou se o his hesis. I am
g a e ul o he in e es aken in my wo k and helping me unde s and e e y hing
be e as I p og essed. I would also like o hank Panagio a Anas asopoulou o
kindly e iewing my wo k and p o iding aluable insigh s, which g ea ly helped
imp o e he quali y o his hesis and guided me owa ds new di ec ions. I would
also like o hank Xa ie Se a o his academic guidance h oughou he pe iod o
my Mas e ’s s udy.
I am also g a e ul o all my acul ies and he en i e Music Technology G oup a
Uni e si a Pompeu Fab a o p o iding an amazing en i onmen o lea ning and
g owing, and o hei gene ous suppo and esou ces p o ided.
I am also ex emely hank ul o all my pee s and classma es o hei encou agemen ,
and o c ea ing a suppo i e en i onmen ha made his jou ney bo h ewa ding
and enjoyable.
Finally, I would like o exp ess my hea el g a i ude o my amily in India—my
pa en s, my sis e , and my pa ne Di ya— o hei unwa e ing lo e and suppo
h oughou my jou ney as a Mas e ’s s uden in Ba celona, wi hou which his would
no ha e been possible.
4Chap e 1. In oduc ion
•Domain gap: The mos c i ical issue is he signi ican domain misma ch be-
ween aining da a and eal-wo ld samples. A model ained on p o essionally
eco ded, clean audio is likely o misclassi y o ail en i ely when exposed o
low-quali y, noisy, o acous ically a ying eco dings om e e yday se ings.
•Me ada a Va iabili y: Tex ual in o ma ion o he sounds such as i les,
ags, and desc ip ions o he sound can enhance classi ica ion h ough mul i-
modal models[6]. Howe e , in communi y-d i en pla o ms, his me ada a is
usually inconsis en o some imes e en inco ec . This me ada a a iabili y
weakens he model’s abili y o le e age ex alongside audio.
•Label Subjec i i y: Human pe cep ion o sounds is e y con ex dependen
and subjec i e. "Wha one pe son ecognizes as ’clapping,’ ano he migh
label as ’applause’ o ’hand pe cussion’ o wha one pe son ags as "s o m",
ano he migh ag as " ain". This ambigui y in labeling c ea es label noise in
aining da a and leads o con usion in model p edic ions. A he same ime,
I also aises a philosophical ques ion abou he "co ec ness" o a ag in he
absence o a g ound u h.
•Class Imbalance: In many use -gene a ed da ase s, ce ain sound ca ego ies
domina e while o he s a e a e . Fo example, e e yday sounds like oo s eps,
ci y noise, na u e sounds o phone no i ica ions may appea equen ly, while
niche o mo e con ex -speci ic sounds a e usually lesse in numbe . This may
lead o unbalanced lea ning, whe e classi ica ion sys ems may a o o e ep-
esen ed classes and neglec smalle ones.
•Scalabili y: Beyond esea ch accu acy, he e’s also he mo i a ion o building
ools ha a e unc ionally eliable in p o essional en i onmen s. Sound design-
e s, music p oduce s, and de elope s inc easingly ely on accu a ely sea chable
sound lib a ies, agging sys ems, and audio e ie al ools. Fo hese sys ems
o be genuinely use ul in eal-wo ld applica ions, hey mus unc ion e ec i ely
ac oss e en wi h uns uc u ed and unp edic able da ase s.
These con e ge on a key esea ch ques ion - can sound classi ica ion sys ems de-

1.2. Key Concep s 5
signed a ound p o essional da ase s and axonomies like UCS be adap ed o wo k
e ec i ely on noisy and highly a iable use -gene a ed sound lib a ies? By in es-
iga ing his p oblem, his hesis aims o con ibu e owa ds be e unde s anding
o sound classi ica ion and also o a b oade goal o making in elligen au oma ic
sound classi ica ion sys ems ha a e deployable ac oss di e se en i onmen s.
1.2 Key Concep s
1.2.1 Au oma ic Sound Classi ica ion (ASC)
Au oma ic Sound Classi ica ion e e s o he use o compu a ional algo i hms
o manage, classi y, and e ie e sound e ec s by au oma ically assigning mean-
ing ul labels o audio eco dings, mimicking he human abili y o ecognize and
ca ego ize sounds. Unlike humans, who classi y sounds in ui i ely h ough con ex
and expe ience, machines depend on la ge, anno a ed da ase s c ea ed by human
expe s. These labeled eco dings o m he ounda ion o aining models using sig-
nal p ocessing, machine lea ning, o deep lea ning echniques. Th ough his p ocess,
sys ems lea n o de ec pa e ns in audio and pe o m accu a e classi ica ions. Typ-
ically, he wo k low begins wi h p e-p ocessing, whe e aw eco dings a e cleaned,
s anda dized, and segmen ed. Nex , ea u es a e ex ac ed om he wa e o m in
a s uc u ed o ma sui able o compu a ion. Finally, machine lea ning o deep
lea ning algo i hms associa e hese ea u es wi h sound labels. Au oma ic sound
classi ica ion has b oad applica ions, including en i onmen al moni o ing, medical
diagnosis (e.g., espi a o y sound analysis), music in o ma ion e ie al, secu i y
sys ems, and biodi e si y esea ch h ough au oma ed species iden i ica ion.
1.2.2 Seman ics and Acous ic Fea u es
In sound classi ica ion, bo h acous ic ea u es o a sound and i ’s seman ic in o ma-
ion such as i le, ags and desc ip ion play a i al ole in unde s anding and ca e-
go izing audio con en . Acous ic ea u es e e o measu able in o ma ion ex ac ed
om an audio wa e o m, such as spec al shape, spec al enge y, pi ch, dynamics o
6Chap e 1. In oduc ion
empo al ea u es. These ea u es a e o en ep esen ed using Mel- equency ceps al
Coe icien s (MFCCs), log-Mel spec og ams, o lea ned embeddings ha cap u e
he a ious p ope ies o a sound and a e essen ial o models ha use audio as a
basis.
On he o he hand, seman ic ea u es de i e om human-anno a ed me ada a like
i les, ags and desc ip ions, o en o e ing con ex and meaning beyond a aw audio
signal. Fo example, o a sound ha has ain noise, a ag like " ain", "wa e ",
" ain all" helps p o ide meaning ul con ex and disambigua e be ween acous ically
simila sounds.
Acous ic ea u es ypically exis in high-dimensional spaces and cap u e empo al-
spec al pa e ns, while seman ic ea u es ope a e in concep ual spaces whe e ela-
ionships be ween ca ego ies can be lea ned h ough wo d embeddings o on ological
s uc u es. A he same ime acous ic ea u es a e sensi i e o eco ding condi ions,
backg ound noise, and signal quali y, whe eas seman ic ea u es emain consis en
ac oss di e en audio eco dings o he same concep ual ca ego y. This makes se-
man ic ea u es aluable o gene aliza ion ac oss di e se eco ding en i onmen s
and o he a iable ac o s.
1.2.3 Mul imodal lea ning
Mul imodal lea ning e e s o he p ocess o modeling in o ma ion om mul iple
da a sou ces o modali ies, in his scena io o sound classi ica ion, his would in ol e
acous ic and seman ic ea u es o a sound.
In mul i-modal sys ems, combining hese wo modali ies, acous ic and seman ic ea-
u es, can enhance classi ica ion pe o mance, especially in noisy o ambiguous sce-
na ios whe e one modali y alone may be insu icien By combining hese modali ies,
he sys em would aim o use complemen a y s eng hs whe e audio ea u es could
p o ide de ailed signal-le el in o ma ion and ex /seman ic ea u es could p o ide
high le el in o ma ion abou he audio [6].
Mul imodal sys ems show be e obus ness compa ed o unimodal app oaches as
1.2. Key Concep s 7
hey can main ain pe o mance e en when one modali y is deg aded. Fo ins ance i
audio quali y is poo due o backg ound noise, seman ic ea u es om me ada a can
compensa e, and ice e sa when ex ual desc ip ions a e inco ec o ambiguous.
1.2.4 F eesound
F eesound is an open, collabo a i e da abase pla o m whe e use s can upload and
sha e audio eco dings anging om e e yday sounds o expe imen al syn hesized
ex u es wi h o e 670,000 plus use -con ibu ed audio samples. I se es as a as
lib a y o audio esou ce o sound designe s, esea che s and a is s. The pla o m
uses ad anced echnologies o classi ica ion o sound, e ie al and sea ch, le e aging
he Essen ia audio analysis lib a y o au oma ically ex ac acous ic ea u es om
uploaded sounds and he simila i y sea ch engine o con en -based e ie al. Addi-
ionally, F eesound has ecen ly implemen ed he B oad Sound Taxonomy (BST)[7]
as i s o ganiza ional scheme. This enables use s o ind sounds h ough adi ional
ex -based sea ches, as well as Que y-by-Example unc ionali y, whe e use s can up-
load audio iles o ind acous ically simila sounds based on spec al, imb al and
o he acous ic cha ac e is ics.
1.2.5 Uni e sal Ca ego y Sys em (UCS)
The Uni e sal Ca ego y Sys em (UCS) is a s anda dized hei a chical axonomy de-
eloped o o ganize and label SFX in p o essional audio en i onmen s. I consis s
o a ound 82 op-le el ca ego ies wi h o e 700 sub-ca ego ies, allowing o consis-
en naming and s uc u e ac oss a ious le els which enables a uni o m s uc u e
in comme cial sound lib a ies.
1.2.6 E alua ion me ics and Pe o mance Assessmen
Pe o mance e alua ion in au oma ic sound classi ica ion elies on se e al key me -
ics ha p o ide di e en pe spec i es on sys em pe o mance. These me ics a e
impo an o assessing he classi ica ion pe o mance and accu acy and assess i
he e a e any imbalances.
8Chap e 1. In oduc ion
Accu acy
Accu acy ep esen s he p opo ion o co ec ly classi ied samples o e he o al
numbe o samples:
Accu acy =T P +TN
TP +TN +FP +F N =Numbe o Co ec P edic ions
To al Numbe o P edic ions
whe e TP (T ue Posi i es), TN (T ue Nega i es), F P (False Posi i es), and F N
(False Nega i es) ep esen he coun s om he con usion ma ix.
Balanced Accu acy
Balanced accu acy add esses he limi a ions o s anda d accu acy by compu ing he
a e age ecall ob ained on each class, p o iding equal weigh o each class (balanced)
ega dless o i s equency:
Balanced Accu acy =1
2TP
TP +F N +TN
TN +F P 
Fo mul i-class p oblems wi h Cclasses, his ex ends o:
Balanced Accu acy =1
C
C
X
i=1
TPi
TPi+F Ni
This me ic is pa icula ly aluable o sound classi ica ion asks whe e ce ain ca -
ego ies may be unde ep esen ed in he da ase .
F1-Sco e
The F1-sco e p o ides he ha monic mean o p ecision and ecall, o e ing a single
me ic ha balances bo h measu es:
P ecision =TP
TP +F P
1.2. Key Concep s 9
Recall =TP
TP +F N
F1-Sco e = 2 ·P ecision ×Recall
P ecision +Recall =2·TP
2·TP +F P +FN
The F1-sco e is pa icula ly use ul when bo h alse posi i es and alse nega i es
ca y signi ican cos s, making i well-sui ed o sound classi ica ion applica ions
whe e bo h p ecision and ecall a e impo an .
Weigh ed F1-Sco e
Fo mul i-class classi ica ion p oblems, he weigh ed F1-sco e compu es he F1-sco e
o each class and hen calcula es a weigh ed a e age based on he suppo (numbe
o ue ins ances) o each class:
F1-Sco eweigh ed =1
N
C
X
i=1
ni·F1-Sco ei
whe e Nis he o al numbe o samples, Cis he numbe o classes, niis he numbe
o samples in class i, and F1-Sco eiis he F1-sco e o class i. This me ic accoun s
o class imbalance by gi ing mo e weigh o classes wi h mo e samples which is
impo an in da ase s wi h class imbalances.
Con usion Ma ix
The con usion ma ix p o ides a comp ehensi e iew o classi ica ion pe o mance
by showing he ac ual e sus p edic ed classi ica ions o each class. Fo a mul i-class
p oblem wi h Cclasses, he con usion ma ix Mis a C×Cma ix whe e:
Mi,j =numbe o samples wi h ue label ip edic ed as label j
The con usion ma ix enables de ailed analysis o classi ica ion e o s, e ealing

10 Chap e 1. In oduc ion
which classes a e mos commonly con used wi h each o he . Fo sound classi i-
ca ion, his is use ul o iden i ying acous ically o seman ically simila ca ego ies
ha he sys em s uggles o di e en ia e.
1.3 Objec i es
The main objec i e o his hesis is o in es iga e he p ac ical applicabili y o s an-
da dized axonomies, speci ically he Uni e sal Ca ego y Sys em (UCS), o ASC
in eal-wo ld, use gene a ed audio en i onmen s. While p io esea ch has demon-
s a ed high classi ica ion accu acy in con olled and p o essionally cu a ed da ase s,
he e is limi ed unde s anding o how hese sys ems would pe o m in communi y-
d i en pla o ms such as F eesound. This hesis aims o b idge ha gap h ough
he ollowing speci ic objec i es:
•Enhance and e ine al eady exis ing classi ie s: To imp o e and e ine
al eady exis ing classi ie s like MLP o sound classi ica ion sys ems using ad-
di ional embeddings and da a acco dingly o be e pe o mance.
•Assess domain ans e e ec i eness: E alua e how well UCS-based clas-
si ie s ained on p o essionally cu a ed da ase s such as P o Sound E ec s
(PSE150K), gene alize o use -gene a ed con en om F eesound. Analyze
he pe o mance deg ada ion caused by domain shi and analyze i s implica-
ions o classi ica ion.
•E alua e he ole o mul imodal lea ning: Implemen and e alua e uni-
modal and mul i-modal classi ica ion a chi ec u es o assess how di e en
modali ies con ibu e o p edic i e pe o mance. De e mine he ela i e con-
ibu ion o audio and ex ual me ada a in classi ica ion asks and e alua e
hei e ec i eness.
•P o ide aluable insigh s o p ac ical deploymen : Resea ch and in-
es iga e possible di ec ions owa ds designing mo e obus and gene alizable
ASC sys ems ha can wo k wi h a ied da ase s and audio con en .
1.4. S uc u e o he Repo 11
•De elop a da ase o c oss-domain e alua ion: Cons uc and make
publicly a ailable a cus om buil UCS-o ganized F eesound da ase ha allows
o e alua ion o domain ans e om p o essional o use -gene a ed audio
con en . This da ase con ibu ion aims o enable u u e esea ch, o conduc
s anda dized compa isons and unde s anding o domain adap a ion challenges
in sound classi ica ion sys ems.
The cus om F eesound da ase and code publicly a ailable1 o u he esea ch.
1.4 S uc u e o he Repo
The emainde o his hesis is o ganized as ollows:
Chap e 2: S a e o he A - This chap e discusses he his o y o au oma ic
sound classi ica ion sys ems, om adi ional signal p ocessing app oaches o deep
lea ning echniques, including cu en app oaches ha a e used. I also in oduces
key echnologies, axonomies and models ele an o his s udy.
Chap e 3: Me hodology - This chap e ou lines he b eakdown o he expe -
imen s and p ocedu es used h oughou his esea ch. P o ides explana ions o
he da ase collec ion, p e-p ocessing s eps, model con igu a ions and e alua ion
me hodologies.
Chap e 4: Resul s - This chap e p esen s he ou comes o he expe imen s,
including pe o mance me ics and obse a ions. The key indings a e highligh ed
and compa ed ac oss di e en models and da a condi ions.
Chap e 5: Discussion - This chap e desc ibes and analyzes he esul s ob ained
om all he expe imen s, while ying o in e p e and unde s and he implica ions
o he esul s in he con ex o eal-wo ld sound classi ica ion.
Chap e 6: Conclusion - This inal chap e summa izes he main con ibu ions
and indings o he hesis. Discussing on he b oade signi icance o he wo k, along
wi h sugges ions o possible u u e imp o emen s in sound classi ica ion sys ems.
1h ps://gi hub.com/Madha J06/ eesound_ucs.gi
Chap e 2
S a e o he A
The ield o au oma ic sound classi ica ion has e ol ed signi ican ly o e he pas wo
decades, d i en by ad ances in signal p ocessing, machine lea ning, and he g owing
a ailabili y o la ge-scale audio da ase s. This chap e p o ides a comp ehensi e
e iew o he cu en s a e-o - he-a in au oma ic sound classi ica ion, wi h pa ic-
ula emphasis on sound e ec s classi ica ion, axonomic sys ems, and mul imodal
app oaches ha combine acous ic and seman ic ea u es.
Finally, we iden i y possible gaps in he cu en li e a u e, pa icula ly he lack o
sys ema ic e alua ion amewo ks o c oss-domain sound e ec classi ica ion and he
limi ed explo a ion o UCS-based axonomic classi ica ion in use -gene a ed con en
scena ios. This analysis es ablishes he ounda ion o he me hodological con ibu-
ions p esen ed in he ollowing chap e s and pu s his esea ch wi hin he b oade
con ex o au oma ic sound classi ica ion and audio con en o ganiza ion.
2.1 His o y o Sound Classi ica ion
Ea ly esea ch in sound classi ica ion d ew om a ious b anches o s udy such
as cogni i e sciences and psychoacous ics, whe e he goal was o unde s and how
humans pe cei ed and dis inguished sounds. These ea ly s udies ocused on Iden i y-
ing pe cep ual ea u es such as pi ch, imb e and empo al s uc u es which enabled
12
2.2. Cu en App oaches 13
humans o di e en ia e sounds. This wo k laid he ounda ion o compu a ional
models by es ablishing key acous ic p ope ies ha in luence audi o y pe cep ion
and classi ica ion.
Ini ial wo ks began wi h a ocus on pi ch pe cep ion o sound classi ica ion e-
sea ches. Helmhol z’s place heo y and Te ha d ’s ha monic empla e model[8][9]
demons a ed how ou b ain and audi o y sys em a e s imula ed di e en ly acco d-
ing o he equencies and onal quali ies in a sound. Risse and Chowning de-
eloped compu a ional models[10][11] ha analyzed sounds based on hei spec al
en elopes. These spec al p o iles we e used o dis inguish be ween di e en ins u-
men s.
Theo ies o ca ego iza ion in cogni i e psychology ha e signi ican ly in luenced sound
classi ica ion esea ch. Ges al ’s p inciples by We heime [12] explain how humans
o ganize audi o y s imuli in o meaning ul pa e ns, such as g ouping en i onmen al
sounds in o a single ca ego y. Rosch’s p o o ype heo y[13] sugges s ca ego iza ion
is based on esemblance o a amilia o p ope associa ion o a sound, simila o
how musical gen es a e iden i ied and ca ego ized by humans. Be gman in oduced
Audi o y Scene Analysis[14] which desc ibes how humans pe cei e complex sound-
scapes by being able o seg ega e and analyze di e en audi o y elemen s. These
s udies p o ided he concep ual amewo k and ounda ions o he de elopmen o
compu a ional me hods o analyze sounds based on hei spec al cha ac e is ics.
2.2 Cu en App oaches
2.2.1 Machine lea ning based app oaches
In ea ly s ages while adop ing machine lea ning (ML) me hodologies o sound classi-
ica ion, handc a ed ea u es such as Mel-F equency Ceps al Coe icien s (MFCCs)
[15] we e widely used o ea u e ex ac ion. These ea u es o med he basis o
adi ional machine lea ning models, including Hidden Ma ko Models (HMMs),
Suppo Vec o Machines (SVMs) and Gaussian Mix u e Models (GMMs)
20 Chap e 2. S a e o he A
be sha ed ac oss mul iple pa en ca ego ies, c ea ing a lexible ocabula y. Fo
ins ance, “bells” could be pa o “me al” and “music” as well o he subca ego y
“mo emen ” migh desc ibe bo h “clo h us ling” and “plas ic c umpling”, “c ackle”
could equally apply o he sound o bu ning i e o spa king elec ici y. This sha ed
modi ie app oach e icien ly mi o s how sound designe s na u ally desc ibe e ec s
in eal-wo ld p ac ice while p e en ing edundancy and main aining consis ency.
Ano he lowe le el o ca ego ies p esen in UCS, pa ing o he sys em being mo e
g anula in na u e is h ough Ca ego y IDs, which ep esen unique combina ions
o op-le el ca ego ies and hei associa ed subca ego ies. The e a e 457 dis inc
iden i ie s o med by algo i hmically me ging elemen s om bo h ie s o he hie -
a chy. Fo example, combining op-le el ca ego y “Wa e ” wi h subca ego y “ low”
will yield he ca ego y ID “WATER low” o “WAT lw”.
Figu e 2: UCS Ca ego iza ion example
This hie a chical s uc u e o e s se e al key ad an ages o bo h human use s and
au oma ic sound classi ica ion sys ems. As he UCS ep esen s a widely adop ed
s anda diza ion e o , e alua ing a classi ie pe o mance wi hin i ’s amewo k p o-
ides di ec ly applicable insigh s o a ious wo k lows.
2.5 UCS deploymen
The UCS has been widely adop ed globally by p o essional sound lib a ies and audio
managemen ools. P oSoundE ec s, Soundly P o, K o os a e lib a ies ha o e
ex ensi e indus y s anda d lib a ies (e.g. BBC sound e ec s, Hollywood edge) ha

2.5. UCS deploymen 21
con ain a di e se ange o SFX o comme cial use. These lib a ies a e aligned wi h
me ada a acco ding o he UCS axonomy o p ecise sea ch-abili y.
As highligh ed by Alison e al.[42] e en UCS-complian sys ems ace challenges.
The s udy add esses he challenges o inconsis en me ada a and axonomy upda es
in SFX lib a ies by p oposing con ex -agnos ic audio embeddings ained ia ep-
esen a ion lea ning. These embeddings cap u e acous ic p ope ies and ea u es
independen o p ede ined labels. This was seen o ou pe o m adi ional me h-
ods like OpenL3 by handling class imbalance and label noise h ough c oss-da ase
aining and me ic lea ning. While he UCS s anda dizes lib a ies, his wo k com-
plemen s he sys em by c ea ing gene alized embeddings applicable o bo h UCS-
complian and non-complian da ase s. Al hough, his app oach does no assign
ixed labels like adi ional classi ie s, ins ead i e ie es based on acous ic simila -
i y. This app oach in a mul imodal me hodology could be ele an o pla o ms like
F eesound, whe e use -gene a ed ags o en de ia e om s anda ds, and p o ides a
p ope amewo k o e alua ion.
By adop ing he hie a chical axonomy o UCS as a classi ica ion benchma k, ou
esea ch can add ess he challenges o classi ica ion and axonomies in F eesound,
whe e inconsis en me ada a cu en ly limi sea ch-abili y and usabili y. The de-
ploymen o UCS in a eal wo ld con ex , such as he F eesound communi y-d i en
agging sys em, will help o add ess many challenges. Chi u hapudi, S. [43] explo ed
sound classi ica ion using he Uni e sal Ca ego y Sys em (UCS) on he P oSound-
E ec s da ase , de eloping and e alua ing au oma ic classi ie s based on s anda d
classi ica ion me ics o assess hei pe o mance. This esea ch ex ends he p e ious
wo k by ying o e ine and imp o e he classi ie s and e alua ing how es ablished
UCS classi ie s pe o m unde impe ec eal-wo ld condi ions. By implemen ing
UCS-based classi ica ion wi h F eesound da a, we aim o e alua e and esea ch on
how communi y based pla o ms could po en ially bene i om Indus y s anda ds,
while simul aneously p o iding u he insigh s and esea ch oppo uni ies in o how
hese axonomies pe o m.
Chap e 3
Me hodology
This chap e ou lines he me hodology used in his hesis o expe imen and in es-
iga e sound classi ica ion using UCS[2] in he con ex o F eesound. The pipeline
consis s o compa ing di e en models, de elopmen o F eesound da ase , ea u e
ex ac ion om he da ase s, e alua ion and possible e inemen s and enhancemen s
o exis ing classi ie s using di e en a chi ec u es, and c oss-domain e alua ion. The
expe imen s a e pe o med on a ious le els s a ing om a small subse o he
PSE150K, using which we e alua e a ious a chi ec u es and y o ind he mos
op imal a chi ec u es o he expe imen s. A se o di e en models we e ob ained
om hese e alua ions wi h di e en mul imodal ea u es which we e used o inally
e alua e c oss-domain pe o mance.
3.1 Da ase
3.1.1 PSE150K
The PSE150K da ase is an audio sound e ec s collec ion ha was de eloped by
P o Sound E ec s (PSE) company. PSE has de eloped one o he la ges lib a y o
p o essionally eco ded sounds. Fo ou wo k we use a da ase ha was p o ided by
PSE o he Music Technology G oup (MTG) a Pompeu Fab a Uni e si y[44]. This
da ase in o al consis s o 380,000 audio iles which ha e been ca ego ized acco ding
22
3.1. Da ase 23
o he Uni e sal Ca ego y Sys em (UCS)[2] o 82 op le el ca ego ies and 753 sub
ca ego ies. F om his ull da ase , we use app oxima ely 143,300 audio iles along
wi h which a me ada a ile o .cs o ma ha con ains he mapping o each au-
dio ile o i s co esponding op-le el ca ego y, sub-ca ego y and ca ego y ID which
we call as he PSE150k da ase . F om da ase analysis, i was no ed ha ac oss
he op le el 82 ca ego ies, he e we e 368 unique subca ego ies. The da ase shows
signi ican a iabili y in ca ego y ep esen a ion, anging om 43 samples (CERAM-
ICS) o 10,539 samples (VEHICLES). The mos popula ed ca ego ies a e common
sound e ec s like VEHICLES, AMBIENCE, VOICES, WATER, ANIMALS, while
some ambiguous and specialized ca ego ies con ain ewe samples e lec ing he mo e
niche applica ions in p o essional usage. Since 143,000 audio iles will be compu a-
Table 1: Summa y o Da ase S a is ics o PSE150K and PSE8K
Da ase Top-le el Ca ego ies Sub-ca ego ies A g. Samples.
PSE150K 82 368 ≈1747
PSE8K 82 291 ≈97
ionally expensi e o pe o m e alua ions on, we use a smalle subse o his da ase
o app oxima ely 8,000 audio iles we e used o pe o m a p elimina y apid e al-
ua ion o sound classi ica ion sys ems. This subse PSE8K is balanced a ge ing
app oxima ely 100 samples pe ca ego y ac oss all he 82- op le el UCS ca ego ies.
74 ca ego ies ou o he 82 has 100 samples each wi h some ca ego ies a ying wi h
lesse numbe s like WINDOWS (62 samples), EQUIPMENT(59 samples) e c. This
da ase has be e class dis ibu ion and was used as he base o obus classi i-
ca ion model aining and e alua ions. This da ase se es as he p ima y da ase
ep esen ing a p o essionally cu a ed and con olled da ase .
3.1.2 F eesound Da a
Ini ial Da ase de elopmen and limi a ions
A cus om F eesound da ase was c ea ed speci ically o c oss-domain e alua ion in
his esea ch. The ini ial phase in ol ed de eloping a smalle da ase comp ising
24 Chap e 3. Me hodology
app oxima ely 1,534 audio clips sou ced h ough he F eesound API. Each audio
sample was mapped o he 82 op-le el UCS ca ego ies and accompanied by com-
p ehensi e me ada a including F eesound ID, i les, ags, desc ip ions, and sou ce
URLs. The da ase a ge ed app oxima ely 20 samples pe ca ego y, using gene ic
sea ch que ies ha combined op-le el ca ego y names wi h ele an synonyms o
ensu e b oad co e age ac oss each class.
Howe e , se e al limi a ions eme ged du ing he da a collec ion p ocess. Since he
da ase exclusi ely used only he MP3 iles om F eesound a he han iles o
all o ma s, he collec ed da a was cons ained in bo h quali y and a ailabili y.
This limi a ion, combined wi h he gene ic na u e o sea ch que ies, esul ed in
insu icien samples o some ambiguous op-le el ca ego ies, ul ima ely educing
he inal da ase o 79 ca ego ies ou o he o iginal 82. Fu he analysis e ealed
signi ican inconsis encies in audio quali y ac oss samples, showing he di e si y o
eco dings and en i onmen s used by di e en F eesound con ibu o s. Addi ionally,
he absence o use -based sampling unc ionali ies led o an unin ended bias whe e
mul iple samples we e collec ed om he same use s, po en ially comp omising he
da ase ’s di e si y and ep esen a i eness o obus c oss-domain e alua ion.
Comp ehensi e Da ase c ea ion
In o de o mi iga e hese issues wi h he p elimina y da ase , we se ou o build a
la ge da ase ha could be used as a base o u u e wo k in c oss-domain classi ica-
ion om p o essional da ase s o mo e eal-wo ld use -gene a ed con en da ase s.
A comp ehensi e F eesound da ase was de eloped o c oss-domain e alua ion, com-
p ising app oxima ely 9,400 audio iles mapped o he 82 op-le el UCS ca ego ies.
The da ase employed a que y-based collec ion me hodology u ilizing he F eesound
API, a ge ing a ound 100-120 samples pe ca ego y. Fo each UCS ca ego y, mul-
iple que ies we e c ea ed using he op-le el ca ego y as he p ima y que y and
subca ego ies and hei synonyms as seconda y que ies. Each que y was s uc u ed
in he ollowing manne : "<subca ego y/synonym> <p ima y que y>". This made
he que ies e y ex ensi e in na u e, going h ough a ious di e en possibili ies in
3.1. Da ase 25
each op-le el ca ego y. The numbe o samples o be collec ed om a ca ego y was
dis ibu ed ac oss he que ies depending on he numbe o que ies ha we e p esen
o ha espec i e ca ego y. Fo example, i a ca ego y "AIR" had 10 que ies, each
que y would collec 12 samples. These que ies we e c ea ed and mapped ac oss all
he ca ego ies manually using he UCS[2] ca ego iza ion a ailable on hei websi e.
In o de o ensu e p ope dis ibu ion among a ca ego y wi h di e en que ies, all-
back s a egies we e implemen ed such ha i one o mo e que ies did no collec a
su icien amoun o samples, he collec ion was edi ec ed owa ds he o he sea ch
que ies ha we e mo e success ul. This p ocess was epea ed un il each ca ego y
collec ed a sa is ac o y amoun o samples.
Addi ionally, in o de o ensu e quali y and p e en bias in he da ase , some il e ing
unc ionali ies we e implemen ed. The API allows o only downloads o he p e iews
o he audio iles o ull leng h and no he o iginal iles in he o iginal o ma , so
p e e ence o high-quali y o ma s was gi en (WAV > FLAC > MP3), and he
p e iew iles we e downloaded in .ogg o ma , which is o highe quali y han MP3,
he iles chosen we e il e ed o ha e minimum use a ings o 3 o g ea e whe e e
a ailable. In o de o p e en bias, global acking o samples pe use ac oss he
collec ion was implemen ed so ha no mo e han 3 o 4 samples we e collec ed pe
use ac oss he di e en ca ego ies. This ensu ed acous ic and me ada a di e si y in
he da ase . Along wi h he collec ion o he iles, he me ada a o each audio ile
was also collec ed and s o ed, which includes he ile name, F eesound ID, F eesound
URL, UCS ca ego y o he espec i e ile, que y used, ags, desc ip ion, uploade
in o ma ion, a ing, and audio ile in o ma ion ( ile o ma , sample a e). Duplica es
we e also no allowed o be downloaded. This whole p ocess was au oma ed in a
sc ip wi h o e sigh and manual supe ision o he da a collec ed.
The esul ing da ase achie ed app oxima ely 120 samples pe ca ego y, wi h some
ambiguous o specialized ca ego ies yielding ewe samples due o limi ed a ailabili y
o sui able con en on he pla o m as shown in 3. This comp ehensi e da ase ep-
esen s a signi ican imp o emen o e he p elimina y e sion, es ablishing a obus
and s ong ounda ion o e alua ing c oss-domain au oma ic sound classi ica ion

26 Chap e 3. Me hodology
Figu e 3: Ca ego y Dis ibu ion in F eesound da ase
pe o mance be ween p o essionally cu a ed sound lib a ies and use -gene a ed au-
dio con en om collabo a i e pla o ms.
A no able a ia ion eme ged be ween he p o essional PSE da ase and he use -
gene a ed F eesound da ase ac oss da a quali y and consis ency. The audio quali y
in he F eesound da ase exhibi ed signi ican a iabili y ac oss di e en con ib-
u o s, e lec ing he di e se eco ding equipmen , acous ic en i onmen s, and also
expe ise o indi idual use s. This con as s wi h he PSE da ase ’s mo e consis en
na u e in i ’s samples. The PSE da ase gene ally con ains mo e consis en ypes
o eco dings and quali y o audio ac oss he boa d. A key dis inc ion compa ed o
he o he da ase may lie mo e in he use o manual labeling a he han au oma ic
que ying.
Simila ly, he me ada a quali y e ealed signi ican di e ences be ween he wo
da ase s. While he PSE da ase main ains comp ehensi e and s anda dized me a-
da a wi h sys ema ic ile naming con en ions and s uc u ed agging me hodologies,
3.2. Fea u e Ex ac ion 27
he F eesound da ase ’s use -con ibu ed na u e showed highly inconsis en me a-
da a. This inconsis ency was e lec ed h oughou he da ase , some use s would ag
hei iles wi h ex ensi e, de ailed desc ip ions while o he s p o ided a ew wo ds as
desc ip ions. File names anged om alphanume ic codes o o e ly e bose desc ip-
ions. This s a k con as be ween he p o essional, s anda dized PSE da ase and
he chao ic eali y o use -gene a ed con en pe ec ly illus a es one o he bigges
challenges in audio classi ica ion esea ch: aining models on p o essional, cu a ed
da a only o deploy hem in he unp edic able wo ld o use -con ibu ed uploads.
3.1.3 O he da ase s
In o de o compa e UCS wi h o he axonomies, wo o he da ase s we e e alua ed
- BSD10K (B oad sound Da ase )[4], which con ains 10,309 F eesound audio clips,
manually anno a ed acco ding o he B oad Sound Taxonomy which has 5 op-le el
ca ego ies and 23 subca ego ies. Ano he da ase ha was e alua ed is he FSD50K
(F eesound Da ase 50K)[39], which con ains 51,197 F eesound audio clips ha a e
assigned o di e en classes aligned wi h he AudioSe on ology[5]. The AudioSe
on ology is a hie a chical axonomy o 632 sound e en classes, de eloped by Google.
3.2 Fea u e Ex ac ion
Fea u e ex ac ion o he da ase s we e done using LAION CLAP[34]. The CLAP
embeddings we e used o ep esen bo h audio and ex ual da a in a common la en
space. The CLAP model p o ides a 512-dimensional embedding o each audio
clip and co esponding ex me ada a. CLAP embeddings we e ex ac ed di ec ly
om he wa e o ms o he audio iles using a p e- ained LAION-CLAP encode o
PSE150K and he F eesound da a alike. The p ocessing chain begins wi h audio
loading ia lib osa[45] a a a ge sampling a e o 48kHz, ollowed by con e sion o
he audio in o mel spec og ams which is handled by he CLAP model. Each audio
ile is passed h ough CLAP’s audio encode o p oduce a 512-dimensional ea u e
ec o ha cap u es seman ic audio ep esen a ions.
In o de o expe imen and in es iga e di e en modali ies, ex me ada a was p o-
28 Chap e 3. Me hodology
cessed and ex ac ed om he CSV ile ha had in o ma ion abou each audio ile
along wi h i ’s ile name, op-le el ca ego ies and subca ego ies. Fo he PSE150K
da ase , he me ada a was p o essionally cu a ed and had p ope s uc u e, whe eas
o eal-wo ld da a like F eesound, he me ada a anged om de ailed desc ip ions
and ags o minimal wo ds o incomple e desc ip ions. The ex ea u e ex ac ion
gene a ed desc ip ions o audio samples. The meaning ul keywo ds om ilenames
and desc ip ions we e ex ac ed and c ea ed in o s uc u ed desc ip ions such as
"Audio sample wi h cha ac e is ics (keywo ds) and ype (subca ego y)" along wi h
some simple ags, he op le el ca ego ies and subca ego ies. These gene a ed ex
desc ip ions a e passed h ough CLAP’s ex encode o p oduce 512-dimensional
embeddings ha sha e he same space as he audio embeddings. This enabled mul-
imodal compa isons.
The sys em employs wo mul imodal usion s a egies o combine audio and ex
embeddings. Conca ena ion usion me ges he 512-dimensional audio and ex ec-
o s in o a 1024-dimensional ep esen a ion, p ese ing all in o ma ion om bo h
modali ies. Weigh ed usion main ains he o iginal 512-dimensional space by com-
bining modali ies wi h lea nable weigh s, de aul ing o 80% audio and 20% ex
(alpha = 0.8). The same ea u e ex ac ion pipeline was used on bo h he PSE150K
and F eesound da ase
The subse PSE8K consis ed o 7990 audio iles in o al, and o ensu e consis ency
and comple eness, he ile names wi hou hei ex ensions we e c oss- e e enced and
checked wi h he iles a ailable acco ding o he me ada a ile om he embeddings
ex ac ed om CLAP. This ma ching p ocedu e e ealed ha alid embeddings we e
a ailable o 5,568 audio iles wi hin he subse . Since his subse was de i ed om
he bigge ull da a se o 350K audio iles, he e ended up being some iles ha did
no ha e he ex ac ed embeddings. Howe e i was con i med ha mo e o less he
ca ego ies we e s ill well balanced and no subs an ially skewed.
3.3. Classi ica ion wi h da ase subse (PSE8K) 29
3.3 Classi ica ion wi h da ase subse (PSE8K)
3.3.1 Expe imen al Design and Model Implemen a ion
To es ablish baseline pe o mance and explo e op imal a chi ec u es, ini ial expe i-
men s we e conduc ed on he PSE8K subse , which p o ided a sui able ounda ion
o apid e alua ion and i e a i e model de elopmen . Se e al classi ica ion models
we e es ed o assess hei abili y o handle he unique challenges posed by bo h
audio and ex da a, as well as hei combined mul imodal embeddings.
The i s model e alua ed was he K-Nea es Neighbo s (KNN) classi ie , im-
plemen ed, se ing as a non-pa ame ic baseline model o classi ica ion e alua ions,
whe e classi ica ion decisions a e made by inding he k= 3 nea es neighbo s o a
que y poin in he embedding space based on cosine dis ance. Majo i y o ing among
hese neighbo s de e mines he p edic ed class. To handle he high-dimensional
CLAP embeddings e ec i ely, S anda dScale no maliza ion was applied o s an-
da dize he ea u e space be o e compu a ion. This model suppo s mul imodal
usion by conca ena ing audio and ex embeddings in o a 1024-dimensional ec o
bu also uses audio-only o ex -only modes using 512-dimensional inpu s. The KNN
app oach does no in ol e an explici aining phase bu p o ides aluable insigh
in o he sepa abili y o he embedding space and obus ness agains class imbalance
h ough local majo i y o ing.
The nex model implemen ed was he Mul i-Laye Pe cep on (MLP), imple-
men ed wi h eed o wa d neu al ne wo k, ea u ing wo hidden laye s con aining
1024 and 512 neu ons, espec i ely, and applies he anh ac i a ion unc ion o
non-linea i y. The model aining le e ages he Adam op imize wi h an adap-
i e lea ning a e schedule ini ialized a α= 0.001. L2 egula iza ion wi h penal y
weigh λ= 0.001 educes o e i ing, he model aining used ea ly s opping based on
alida ion accu acy o be e con e gence con ol. Inpu ea u es a e p ep ocessed
h ough S anda dScale , while ca ego ical a ge s u ilize Label Encode encoding.
The MLP also simila ly suppo s he audio-only, ex -only and he mul imodal con-
36 Chap e 4. Resul s
mance. The e alua ion used s anda dized ea u e ex ac ion pipelines, employing
he CLAP[34] model o gene a e audio and ex embeddings, which se ed as inpu s
o a ious classi ica ion a chi ec u es. This app oach enabled apid benchma k-
ing o di e en modali y-speci ic and mul imodal usion models unde a consis en
amewo k, yielding ini ial pe o mance esul s which help guide he subsequen
asks in he p ojec .
4.1.1 Audio-Only Con igu a ion
Table 2: Audio-Only Model Pe o mance on PSE8K
Model Type Accu acy Balanced Acc F1 Mac o F1 Weigh ed
KNN 0.5202 0.4885 0.4554 0.4913
MLP 0.5701 0.5142 0.4942 0.5499
T ans o me 0.5393 0.4829 0.4682 0.5245
S anda d CA 0.5509 0.5042 0.4845 0.5365
Pa ch CA 0.5298 0.4818 0.4631 0.5134
Bes Audio-Only: MLP (Accu acy = 0.5701)
4.1.2 Tex -Only Con igu a ion
Table 3: Tex -Only Model Pe o mance on PSE8K
Model Type Accu acy Balanced Acc F1 Mac o F1 Weigh ed
KNN 0.8431 0.8432 0.8427 0.8410
MLP 0.9097 0.9106 0.9106 0.9092
T ans o me 0.9069 0.9079 0.9078 0.9074
S anda d CA 0.9042 0.9045 0.9034 0.9036
Pa ch CA 0.8972 0.8984 0.8976 0.8966
Bes Tex -Only: MLP (Accu acy = 0.9097)

4.1. Rapid E alua ion Resul s 37
4.1.3 Mul imodal Conca ena ion
Table 4: Mul imodal Conca ena ion Model Pe o mance on PSE8K
Model Type Accu acy Balanced Acc F1 Mac o F1 Weigh ed
KNN 0.7965 0.7572 0.7466 0.7854
MLP 0.9002 0.8773 0.8733 0.8941
T ans o me 0.9079 0.8844 0.8760 0.9057
S anda d CA 0.9021 0.8797 0.8784 0.8963
Pa ch CA 0.8983 0.8663 0.8647 0.8944
Bes Mul imodal Conca ena ion: T ans o me (Accu acy = 0.9079)
4.1.4 Mul imodal Weigh ed Fusion (alpha=0.8)
Table 5: Mul imodal Weigh ed Fusion Model Pe o mance on PSE8K
Model Type Accu acy Balanced Acc F1 Mac o F1 Weigh ed
KNN 0.5739 0.5425 0.5090 0.5486
MLP 0.6833 0.6379 0.6346 0.6699
T ans o me 0.6583 0.6064 0.5807 0.6409
S anda d CA 0.8848 0.8617 0.8600 0.8776
Pa ch CA 0.8983 0.8775 0.8760 0.8925
Bes Mul imodal Weigh ed Fusion: Pa ch C oss-A en ion (Accu acy = 0.8983)
4.1.5 Model Pe o mance Analysis
Audio-Only Embeddings Among he audio-only models, he MLP) achie ed
he highes accu acy a 57.01%, wi h a balanced accu acy o 51.42%, ou pe o ming
o he a chi ec u es including K-Nea es Neighbo s (KNN) and T ans o me a ian s.
The MLP also eco ded he leading F1 sco es, wi h a Mac o F1 o 49.42% and a
Weigh ed F1 o 54.99%, indica ing i s be e capabili y in cap u ing p ominen audio
38 Chap e 4. Resul s
ea u es om he CLAP embeddings. While he KNN model showed he lowes
pe o mance wi h an accu acy o 52.02%.
Tex -Only Embeddings Tex -based models demons a ed signi ican ly highe
pe o mance han audio-only coun e pa s. The MLP model again achie ed he
highes wi h an accu acy o 90.97%, balanced accu acy o 91.06%, and high Mac o
and Weigh ed F1 sco es nea ing 91%, showcasing he model’s abili y o lea n om
me ada a and he ex ual desc ip ion ea u es. T ans o me and C oss-A en ion
models also pe o med simila ly well wi h negligible di e ence compa ed o he MLP,
achie ing accu acies in he 90% ange. The KNN model, while lowes among hese,
s ill eached an accu acy o 84.31%.
Figu e 4: Balanced Accu acy ac oss di e en models
Mul imodal Conca ena ion Combining audio and ex embeddings ia simple
conca ena ion imp o ed o e all pe o mance ac oss all models. The T ans o me
model opped his g oup wi h a 90.79% accu acy and balanced accu acy o 88.44%,
ollowed closely by MLP and C oss-A en ion models. The KNN model also showed
gains o e audio-only mode, eaching nea ly 80% accu acy, possibly indica ing some
ad an age in using mul imodal embeddings.
Mul imodal Weigh ed Fusion (alpha=0.8) When applying weigh ed usion
wi h a dominan audio weigh o 0.8, he C oss-A en ion models pe o med bes by
4.2. E alua ion o ull da ase 39
a no able ma gin. Pa ch C oss-A en ion a ained an accu acy o 89.83%, balanced
accu acy o 87.75%, and F1 sco es abo e 87%. The S anda d C oss-A en ion model
simila ly eco ded high sco es a ound 88% accu acy. O he models including MLP
and T ans o me saw dec eased accu acies wi h 68% and 65% espec i ely compa ed
o conca ena ion. KNN demons a ed i s lowes pe o mance he e, a ound 57%
accu acy.
O e all, hese esul s demons a e ha he MLP consis en ly achie ed he bes
pe o mance despi e i s ela i ely simple a chi ec u e compa ed o mo e complex
models such as T ans o me s and C oss-A en ion ne wo ks. This end held ue
ac oss mos ea u e con igu a ions, wi h he excep ion o he weigh ed mul imodal
usion se ing, whe e he c oss-a en ion models exhibi ed s onge pe o mance.
Fu he de ails and analysis o his excep ion a e p o ided in he Discussion sec ion.
4.2 E alua ion o ull da ase
The ull PSE150K da ase was e alua ed using he bes -pe o ming MLP model
iden i ied om he subse expe imen s ac oss di e en embedding con igu a ions.
The aining and e alua ion in his sec ion a e conduc ed exclusi ely wi h MLP a -
chi ec u es, le e aging hei demons a ed e ec i eness. This app oach allows us o
use on he s eng hs o he MLP model while scaling e alua ion o he en i e da ase ,
p o iding a comp ehensi e assessmen o model pe o mance on a b oade da ase .
The ained MLP models will be u he u ilized o c oss-domain e alua ion asks,
including expe imen s on he F eesound da ase .
Table 6: PSE150K Model Pe o mance (MLP A chi ec u es)
Model Accu acy Bal Acc F1 Mac o F1 Weigh ed
Tex -Only 0.9833 0.9796 0.9805 0.9833
Mul imodal Conca 0.9838 0.9808 0.9810 0.9837
Mul imodal Weigh ed 0.8344 0.8070 0.8037 0.8335
Audio-Only 0.6746 0.6270 0.6409 0.6720
The pe o mance o he MLP model e alua ed ac oss ou embedding con igu a ions
40 Chap e 4. Resul s
on he ull PSE150K da ase is p esen ed in 6 and isualized in he con usion ma-
ices shown in Figu e 5. The mul imodal conca ena ion con igu a ion yielded he
highes accu acy a 98.38%, wi h co esponding balanced accu acy and F1 sco es
exceeding 98%. The ex -only con igu a ion showed high accu acy as well, wi h an
accu acy o 98.33% and balanced accu acy o 97.96%. In bo h hese se ings, he
con usion ma ices (Figu e 5b and 5c) display s ong diagonal dominance, indica -
ing nea pe ec classi ica ion ac oss nea ly all ca ego ies; howe e , his displayed
pe o mance may o e es ima e he model’s ue e ec i eness gi en he complexi y
o he classi ica ion ask and migh no ully e lec he challenges o he ask, and
he esul s should be in e p e ed wi h cau ion.
In con as , he mul imodal weigh ed usion con igu a ion, employing an audio
weigh o 0.8, esul ed in educed pe o mance, wi h accu acy and balanced ac-
cu acy alues a ound 83%. The co esponding con usion ma ix (Figu e 5d) e eals
an inc ease in o -diagonal e o s, sugges ing highe misclassi ica ion a es compa ed
o conca ena ed o ex -only embeddings.
The audio-only con igu a ion p oduced he lowes me ics, eco ding an accu acy
o 67.46% and a balanced accu acy o 62.70%. I s con usion ma ix (Figu e 5) u -
he illus a es a signi ican p esence o o -diagonal elemen s, highligh ing inc eased
con usion be ween ca ego ies when elying solely on audio ea u es, which mo e ac-
cu a ely e lec s he challenges ypical o audio classi ica ion asks.
Collec i ely, hese esul s p o ide aluable insigh in o he beha io o he di e en
embedding modali ies. The audio-based CLAP embeddings appea o o e a mo e
ealis ic ep esen a ion o he classi ica ion challenges inhe en in audio da a, while
he ex embeddings end o p oduce mo e op imis ic pe o mance es ima es ha
may o e s a e model capabili ies. Building on hese obse a ions, he ained models
we e applied o c oss-domain e alua ion on he F eesound da ase o u he assess
hei gene aliza ion capabili ies beyond he o iginal aining dis ibu ion.
4.2. E alua ion o ull da ase 41
(a) (b)
(c) (d)
Figu e 5: Con usion ma ices o MLP e alua ion o (a) Audio-only, (b) Tex -only,
(c) Mul imodal conca ena ed, and (d) Mul imodal weigh ed con igu a ions.

42 Chap e 4. Resul s
4.3 E alua ion o PSE ained models on F eesound
Da a
The PSE- ained models we e e alua ed on wo e sions o he F eesound da ase : a
p elimina y subse and an ex ended, mo e comp ehensi e collec ion. This e alua ion
in ol ed a numbe o di e en expe imen s designed o compa e model pe o mance
and p o ide de ailed analysis ac oss di e en embedding con igu a ions. The aim
was o assess he gene aliza ion capabili y o he ained models when applied o
ex e nal, eal-wo ld audio da ase s wi h a ying cha ac e is ics.
4.3.1 Resul s on P elimina y F eesound Da ase
The e alua ion esul s on he p elimina y F eesound-UCS da ase a e desc ibed
below. The o e all model pe o mance ac oss all embedding modali ies and usion
s a egies is signi ican ly lowe compa ed o he PSE150K e alua ions, e lec ing he
inc eased complexi y and a iabili y o he F eesound da a.
Among he di e en con igu a ions, he mul imodal conca . model achie ed he
highes accu acy o 16.71%, alongside balanced accu acy and F1 me ics exceed-
ing 15%. The audio-only model also pe o med compe i i ely, wi h an accu acy o
16.02%, ma ginally below he mul imodal conca ena ion. The mul imodal weigh ed
usion and ex -only models yielded sligh ly lowe accu acies, a ound 15.5% and
12.0% as he lowes o he ex -only model espec i ely. This p elimina y e alua-
ion p o ides a aluable baseline o u he de elopmen o an ex ended eesound
da ase wi h a mo e di e se sample se , allowing o mo e comp ehensi e and eliable
e alua ion o model pe o mance.
4.3.2 Resul s on Ex ended F eesound Da ase
The ex ended F eesound UCS da ase used o e alua ion comp ises app oxima ely
9,400 audio iles ac oss he 82 op-le el UCS[2] sound ca ego ies, wi h an a e age
o abou 120 samples pe ca ego y. This da ase o e s a di e se and ep esen a i e
4.3. E alua ion o PSE ained models on F eesound Da a 43
collec ion o eal-wo ld audio samples, making i a challenging benchma k o la ge-
scale, mul i-class sound classi ica ion.
The e alua ion esul s as shown in 7 e eal ha model pe o mance on his da ase ,
while modes , e lec s he inhe en complexi y and a iabili y wi hin F eesound sam-
ples. Among he ou embedding con igu a ions es ed, he mul imodal weigh ed
usion app oach achie ed he highes accu acy o app oxima ely 28.7%, balanced
accu acy a ound 28%, and F1 weigh ed sco e close o 26.8%. Mul imodal con-
ca ena ion ollowed wi h compa able bu sligh ly lowe pe o mance, wi h accu acy
a ound 26.1%. The audio-only model showed mode a e e ec i eness wi h an accu-
acy nea 23.8%, and he ex -only model eco ded he lowes me ics, wi h accu acy
sligh ly abo e 20.1%.
Table 7: F eesound-UCS E alua ion Resul s Summa y
Model Accu acy Bal Acc F1 Mac o F1 Weigh ed
Audio-Only 0.2383 0.2323 0.2209 0.2262
Tex -Only 0.2057 0.1974 0.1902 0.1971
Mul imodal Conca 0.2602 0.2507 0.2412 0.2484
Mul imodal Weigh ed 0.2872 0.2801 0.2688 0.2751
These esul s seem o indica e a end whe e combining audio and ex ual ea u es
seems o enhance classi ica ion capabili y despi e he da ase ’s complexi y.As shown
by he con usion ma ices in Figu e 6, all model con igu a ions exhibi subs an ial
o -diagonal e o s, highligh ing equen misclassi ica ions ac oss many o he 82 op-
le el ca ego ies. While he mul imodal app oaches ou pe o m single-modali y mod-
els, he absolu e sco es emain modes , indica ing pe sis en challenges in achie ing
obus gene aliza ion.
The ela i ely lowe sco es ac oss all models poin ou he majo challenge o gen-
e aliza ion when mo ing om con olled and p o essional da ase s o a di e se and
a ied he e ogeneous collec ion like F eesound UCS Da ase . The ich di e si y o
sounds, anno a ion a iabili y, and domain misma ch be ween aining and es ing
da a likely con ibu e o hese ou comes.
O e all, hese indings p o ide a comp ehensi e pe spec i e on model pe o mance
44 Chap e 4. Resul s
(a) (b)
(c) (d)
Figu e 6: Con usion ma ices o e alua ion on F eesound da ase (a) Audio-only,
(b) Tex -only, (c) Mul imodal conca ena ed, and (d) Mul imodal weigh ed con igu-
a ions.
4.4. Fine- uning Resul s 45
and gene aliza ion unde ealis ic au oma ic Sound classi ica ion condi ions. The
esul s and isual analysis sugges key di ec ions o u u e esea ch, pa icula ly in
e ining embeddings, imp o ing usion s a egies, and add essing domain adap a ion
challenges o la ge-scale, mul i-class sound ecogni ion. Taken oge he , he esul s
p o ide a ounda ion o a lo o in e p e a ion o he model beha io in sound
classi ica ion and he b oade implica ions o c oss-domain sound classi ica ions.
These aspec s will be explo ed u he in g ea e de ail in he ollowing Discussion
sec ion.
4.4 Fine- uning Resul s
Fine- uning consis en ly imp o ed model pe o mance ac oss all con igu a ions and
aining da a pe cen ages, he accu acies o di e en aining da a used o he
ine- uning is shown in igu e 7. Wi h jus 10% o F eesound UCS da a, each
model sligh ly ou pe o med i s o iginal PSE baseline, and mo e imp o emen s we e
obse ed as mo e aining da a was used. The audio-only model accu acy inc eased
mode a ely, om 24.3% a 10% da a o 25.0% a 50%. Fo he ex -only model,
gains we e a bi mo e p onounced, jumping om 21.2% o 23.7%. Mul imodal
models achie ed he bes esul s o e all: he conca ena ion model ose om 26.8%
o 29.0%, and he weigh ed mul imodal model achie ed he highes accu acy, om
29.6% o 30.2% as aining da a inc eased om 10% o 50%.
These indings demons a e ha ine- uning is especially e ec i e o mul imodal
sys ems and ha pe o mance bene i s scale wi h he size o he a ge da ase .
52 Chap e 5. Discussion
embeddings om CLAP [34].
CLAP embeddings hemsel es a e inhe en ly mul imodal in na u e as CLAP is
ained o cap u e bo h acous ic and seman ic in o ma ion. Howe e he CLAP
[34] embeddings ga e only an accu acy o 23% which is lesse han he mu limodal
usion models. This pe o mance boos likely s ems om he addi ional cues p o-
ided by he aining on he ex me ada a, which can supplemen audio-de i ed
ep esen a ions in he cases whe e ex ual o acous ic ea u es alone a e insu icien .
Howe e , o e all he esul s a e s ill nowhe e nea ideal o he ask o Au oma ic
Sound Classi ica ion and unde sco es he challenges in gene aliza ion o sounds when
he da a is uns uc u ed, a ying in quali y and noisy in na u e. The con usion ma-
ices (see igu e 6 highligh how he e is s ill subs an ial misclassi ica ions, showing
how he e’s high ca ego y o e lap and he nuances o eal-wo ld sounds.
5.4.1 -SNE Analysis o Embedding Spaces Ac oss Domains
Figu e 8 p esen s a -SNE isualiza ion compa ing PSE (blue) and F eesound ( ed)
embeddings ac oss audio-only, ex -only, mul imodal conca ena ion, and weigh ed
mul imodal con igu a ions. This analysis o e s an in ui i e window in o he dis i-
bu ional p ope ies and domain o e lap o each embedding ype.
Audio-only (Top Le ) -SNE plo shows p e y high o e lap and close clus-
e s be ween he PSE [44] and F eesound [1] embeddings, wi h some clea egions
exis ing whe e da ase -speci ic clus e s emain. This sugges s CLAP audio embed-
dings cap u e some simila ea u es, and acous ic a iabili y be ween cu a ed (PSE)
and use -gene a ed (F eesound) da a p oduces no much sepa a ion. The somewha
in e mixed pa e n indica es ha audio embeddings p o ide limi ed domain gen-
e aliza ion, echoing obse ed model pe o mance. Howe e , he o e lap does no
necessa ily indica e ha he embeddings align meaning ully wi h he in ended se-
man ic ca ego ies. Ra he i may simply e lec low-le el acous ic simila i ies while
di e ences in o he ac o s may emain unde ep esen ed

5.4. Comp ehensi e Da ase E alua ion 53
Figu e 8: -SNE analysis o PSE s F eesound Embeddings
Tex -only (Top Righ ). Fo ex -only embeddings, he sepa a ion be ween he
wo da ase s is much mo e p onounced. PSE samples clus e igh ly owa ds he
cen e , e lec ing clean and uni o m me ada a, while F eesound poin s a e mo e dis-
ibu ed, indica i e o noise, he e ogenei y, and inconsis en me ada a. This explains
he signi ican domain gap and he lowe pe o mance wi h he ex -only model.
Mul imodal Conca ena ion (Bo om Le ). The mul imodal conca ena ed
space exhibi s mo e o e lap han ex -only, ye some dis inc ion be ween domains
pe sis s. He e, he combined cues om audio and ex b ing he wo domains close ,
bu simple usion does no comple ely align he espec i e embedding dis ibu ions.
Weigh ed Mul imodal Fusion (Bo om Righ ). The -SNE plo o weigh ed
mul imodal embeddings (wi h audio weigh α= 0.8) esembles he audio-only con-
igu a ion, showing inc eased mixing o he wo da ase s. While mo e domain o e lap
is achie ed, clea sepa a ion emains, signi ying ha e en s ong audio usion canno
ully b idge domain gaps p esen in eal-wo ld da a.
O e all, he -SNE analysis isualizes he challenges o domain adap a ion: only pa -
ial embedding o e lap is achie ed ac oss con igu a ions, wi h audio o mul imodal
54 Chap e 5. Discussion
ep esen a ions p o iding he g ea es (bu s ill incomple e) gene aliza ion. I is o
be no ed ha he analysis o F eesound s PSE was done using only 10,000 PSE
samples ou o he ull da ase and he ull F eesound da ase , in o de o main ain
compa abili y.
5.4.2 Ca ego y Analysis: Pe o mance Ac oss Sound Classes
To u he unde s and model s eng hs and weaknesses, a de ailed ca ego y-wise
analysis was conduc ed, as depic ed in Figu es 9 and 10.
Figu e 9 p esen s he accu acy sco es o audio-only, ex -only, and mul imodal mod-
els ac oss all ca ego ies. The plo e eals a ia ion in pe o mance be ween ca ego ies
and modali ies. Some ca ego ies such as Fi ewo ks,Use In e ace, and Wa e ex-
hibi s ong pe o mance, o en showing clea bene i s om mul imodal in eg a ion
(g een ba s). In con as , se e al ca ego ies (e.g., Na u al Disas e ,Des uc ion,
A chi ed) show consis en ly low accu acy ac oss all con igu a ions, sugges ing ha
bo h modali ies s uggle o cap u e disc imina i e ea u es o hese mo e ambigu-
ous classes. I is o be no ed ha he "a chi ed" ca ego y in he UCS sys em is a
special op-le el label mean o sound iles ha a e s o ed o se aside, no classi-
ied by seman ic o audio cha ac e is ics, making i no so ele an o ou cu en
s udy. The audio-only and mul imodal usion modali ies o e all show highe pe -
ca ego y pe o mance ac oss he boa d while he ex -only embeddings show good
pe o mance only in ce ain ca ego ies like Ca oon and Clo h.
Figu e 10 highligh s his u he by ela ing mul imodal accu acy o he numbe
o samples pe ca ego y and anno a ing ca ego ies based on which modali y domi-
na ed pe o mance. Ca ego ies ma ked in blue (audio dominan ) demons a e ha
acous ic ea u es a e he main con ibu ion o co ec p edic ions, while ed ( ex
dominan ) poin s indica e ca ego ies whe e me ada a and ex ual cues gi e models
an edge. G een poin s (mul imodal bene i ) illus a e classes whe e using modal-
i ies yields he bes esul s. Many ca ego ies, especially hose wi h la ge sample
sizes, bene i om mul imodal app oaches, ye a sizable numbe o classes emain
challenging, ega dless o modali y o da ase size.
5.4. Comp ehensi e Da ase E alua ion 55
Figu e 9: Ca ego y-wise accu acy o audio-only, ex -only, and mul imodal models.
Figu e 10: Mul imodal accu acy s. sample coun by ca ego y.
56 Chap e 5. Discussion
Taken oge he , hese analyses e eal se e al key ends: (1) classi ica ion success
is highly ca ego y-dependen , (2) he be e pe o mance om mul imodal usion
a e no uni o m and depend on he complemen a y na u e o audio and ex ual
in o ma ion pe class, and (3) o especially di icul o ambiguous ca ego ies, nei he
modali y no inc eased da a alone sol es he challenge, poin ing o he need o
iche ep esen a ions o a ge ed da a augmen a ion. This analysis illus a es he
po en ial and limi a ions o cu en mul imodal sys ems, possibly guiding u u e
wo k owa ds mo e obus s a egies.
5.5 Fine-Tuning Analysis
Figu e 11: Compa ison o Fine- uned Models Based on Accu acy.
Fine- uning was done on he PSE ained models wi h 3 di e en se s o da ase
aken om he comp ehensi e F eesound da ase - 10%, 25%, 50%. He e he
plo s 11 12 impac o ine- uning he models. As shown in Figu e 11 e en modes
inc emen s in he amoun o da a yield measu able imp o emen s o e he base-
line pe o mance o all he model con igu a ions. Mul imodal conca ena ion and
ex -only models bene i he mos , wi h gains exceeding 2-3% a he la ges ine-
uning spli , while audio-only and mul imodal weigh ed models see mo e modes
bu consis en imp o emen .
5.6. Fu u e Wo k 57
Figu e 12: Fine- uning Accu acy Imp o emen T ends.
The line plo o accu acy gains poin s ou he impo ance o using sligh ly la ge
ine- uning se s. All modali ies display s eadily inc easing imp o emen s, wi h he
la ges e ec seen o con igu a ions ha use "mo e" ex ual da a. This ein o ces
how much he di e ence in ex ual me ada a is p esen be ween he wo da ase s
PSE[44] and F eesound[1].
This poin s ou he signi icance o ideally employing bo h p o essionally cu a ed
da ase s and mo e a ied, eal-wo ld da ase s o domain adap a ion, which will
enable models o be e gene alize be ween hem and emain obus in p ac ical
applica ions whe e da a, ha is sounds in his case come in all shapes and sizes.
5.6 Fu u e Wo k
This s udy highligh s key di ec ions o ad ancing au oma ic sound classi ica ion:
•Imp o ed Embeddings: Cu en audio and ex embeddings cap u e limi ed
eal-wo ld a iabili y. Fu u e wo k should ocus on de eloping mo e obus ,
seman ically ich ep esen a ions o enhance gene aliza ion. Possibly expe i-
men ing using o he ea u e ex ac o s o he han CLAP.[34]

58 Chap e 5. Discussion
•Ad anced Fusion Techniques: While mul imodal usion imp o es accu acy,
mo e dynamic and adap i e usion me hods a e needed o be e le e age
modali y-speci ic s eng hs on a pe -sample basis.
•Di e se and Realis ic Da a: E alua ion on he e ogeneous da ase s like
F eesound UCS is a i al and c ucial s ep in c ea ing obus Au oma ic Sound
Classi ica ion Sys ems while also being able o unde s and well-s uc u ed da a
as well. Expanding da ase di e si y and quali y will help models be e e lec
eal-wo ld complexi ies.
•Domain Adap a ion and Robus ness: Add essing domain misma ch, an-
no a ion noise, and acous ic di e si y emains a challenge. Inco po a ing do-
main adap a ion, label alignmen , and con ex ual supe ision will be impo -
an o scalable sound classi ica ion sys ems.
•Ca ego y-awa e Modeling: Al hough a complex and di icul ask, de el-
oping ca ego y-speci ic s a egies and a ge ed da a augmen a ion may u he
enhance pe o mance on challenging o unde ep esen ed sound classes.
Chap e 6
Conclusion
This hesis se ou o in es iga e he challenges o au oma ic sound classi ica ion
when mo ing om con olled, p o essionally cu a ed da ase s o he messie eali-
ies o eal-wo ld da a. On he cu a ed p o essional da ase , audio embeddings and
audio- ex models pe o med s ongly, showing he p omise o cu en a chi ec u es.
Ye when applied o he mo e a ied and unp edic able F eesound da ase , pe o -
mance d opped no iceably, poin ing ou he ob ious issue o domain-gaps. Seman ic
audio embeddings showed modes bu ealis ic pe o mance, while mul imodal u-
sion (audio+ ex ) con igu a ions pe o med ela i ely be e . By con as , ex -only
embeddings, hough o e ly op imis ic on he p o essional da ase , ailed subs an-
ially on F eesound da a. To explo e his u he , ine- uning on di e en amoun s
o F eesound da a showed clea imp o emen s, especially o mul imodal models,
highligh ing how c i ical adap a ion is when acing eal-wo ld a iabili y. Toge he ,
hese indings sugges ha cu a ed da ase s p o ide a s ong ounda ion, bu ue
obus ness comes om handling noisy, di e se da a o be e p epa e models o he
complexi y o e e yday audio.
Ul ima ely, his hesis con ibu es o he ongoing challenges in sound classi ica ion
and emphasizes he need o obus embeddings and domain-awa e aining s a e-
gies as essen ial s eps owa d building scalable, ans e able, and eliable Au oma ic
Sound FX Classi ica ion Sys ems o eal-wo ld use
59
60 LIST OF FIGURES
Lis o Figu es
1 CLAP (Con as i e Language-Audio P e- aining) A chi ec u e . . . 17
2 UCS Ca ego iza ion example . . . . . . . . . . . . . . . . . . . . . . . 20
3 Ca ego y Dis ibu ion in F eesound da ase . . . . . . . . . . . . . . . 26
4 Balanced Accu acy ac oss di e en models . . . . . . . . . . . . . . . 38
5 Con usion ma ices o MLP e alua ion o (a) Audio-only, (b) Tex -
only, (c) Mul imodal conca ena ed, and (d) Mul imodal weigh ed con-
igu a ions. ................................ 41
6 Con usion ma ices o e alua ion on F eesound da ase (a) Audio-
only, (b) Tex -only, (c) Mul imodal conca ena ed, and (d) Mul i-
modal weigh ed con igu a ions. . . . . . . . . . . . . . . . . . . . . . 44
7 Fine-Tuned model accu acies . . . . . . . . . . . . . . . . . . . . . . . 46
8 -SNE analysis o PSE s F eesound Embeddings . . . . . . . . . . . 53
9 Ca ego y-wise accu acy o audio-only, ex -only, and mul imodal
models. .................................. 55
10 Mul imodal accu acy s. sample coun by ca ego y. . . . . . . . . . . 55
11 Compa ison o Fine- uned Models Based on Accu acy. . . . . . . . . . 56
12 Fine- uning Accu acy Imp o emen T ends. . . . . . . . . . . . . . . 57
Lis o Tables
1 Summa y o Da ase S a is ics o PSE150K and PSE8K . . . . . . . 23
2 Audio-Only Model Pe o mance on PSE8K . . . . . . . . . . . . . . . 36
3 Tex -Only Model Pe o mance on PSE8K . . . . . . . . . . . . . . . . 36
4 Mul imodal Conca ena ion Model Pe o mance on PSE8K . . . . . . 37
5 Mul imodal Weigh ed Fusion Model Pe o mance on PSE8K . . . . . 37
6 PSE150K Model Pe o mance (MLP A chi ec u es) . . . . . . . . . . 39
7 F eesound-UCS E alua ion Resul s Summa y . . . . . . . . . . . . . 43
61