scieee Science in your language
[en] (orig)

Visualization of the output of Sound Event Detection algorithms in Freesound

Author: Marcé Forns, Joaquim
Publisher: Zenodo
DOI: 10.5281/zenodo.17304575
Source: https://zenodo.org/records/17304575/files/Quim_Marce_SMC_2025_Master_Thesis.pdf
Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Visualiza ion o he ou pu o Sound
E en De ec ion algo i hms in F eesound
Joaquim Ma cé Fo ns
Supe iso : F ede ic Fon
July 2025
Con en s
1 In oduc ion 1
1.1 Mo i a ion.................................. 1
1.2 Objec i es.................................. 2
1.3 S uc u e o he Repo . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 S a e o he a 4
2.1 Founda ions o Sound E en de ec ion . . . . . . . . . . . . . . . . . . 4
2.1.1 P oblemde ini ions............................. 4
2.1.2 Modela chi ec u es............................. 6
2.1.3 E alua ionme ics ............................. 7
2.2 Da ase s and benchma ks . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 DCASEDa ase s.............................. 8
2.2.2 FSD50K................................... 9
2.3 Pos -p ocessing o Sound E en De ec ion ou pu s . . . . . . . . . . . . 11
2.3.1 Gene al echniques ............................. 11
2.3.2 Lea ning based pos -p ocessing . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 In he con ex o F eesound . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Use in e aces o sound anno a ions . . . . . . . . . . . . . . . . . . . 14
2.4.1 Gene al isualiza ion echniques . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 SED sys ems use in e aces . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 F eesound in e aces and ools . . . . . . . . . . . . . . . . . . . . . . . 16
3 Me hods 19
3.1 Pos p ocessing p oposals . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Empi ical obse a ion on FSD50K da ase . . . . . . . . . . . . . . . . 20
3.1.2 Tag ecommenda ion me hods applied o SED echniques . . . . . . . . 22
3.1.3 Hie a chy il e ing o human- eadable esul s . . . . . . . . . . . . . . 23
3.1.4 Fu u e op imiza ion p oposals . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Visualiza ion o SED in F eesound . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Gene al conside a ions . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Visualiza iond a s............................. 29
3.2.3 Bes use in e aces............................. 33
3.3 Use sa is ac ion expe imen design . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Da ase Files ................................ 36
3.3.2 Expe imen design ............................. 37
4 Resul s 39
4.1 Pos p ocessing app oaches . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Quali a i e insigh s on de ec ions . . . . . . . . . . . . . . . . . . . . . 40
4.3 Use Sa is ac ion Su ey Resul s . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Anno a ionaccu acy ............................ 41
4.3.2 Visualiza ion e ec i eness . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.3 Quali a i e Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Discussion 48
5.1 Pos p ocessing insigh s . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Visualiza ion Conside a ions . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Fu u ewo k................................. 49
5.4 Conclusions ................................. 50
Lis o Figu es 52
Lis o Tables 55
Bibliog aphy 56
A Resou ces and Code a ailabili y 58
B Fil e ing esul s 59

Dedica ion
I would like o dedica e his wo k o bi ds.
Acknowledgemen
I would like o exp ess my since e g a i ude o:
•My supe iso , F ede ic Fon , o his u mos pa ience, unlimi ed cu iosi y and
de ailed a en ion and suppo o he pas mon hs.
•My amily, o always being suppo i e.
Chap e 2
S a e o he a
This chap e e iews he undamen als and la es ad ancemen s in he li e a u e
o his wo k. Gi en ha he scope o his hesis in ol es wo main opics, pos -
p ocessing o he ou pu s o sound e en de ec ion algo i hms, and i s isualiza ion
and use in e ace design, hey a e ea ed sepa a ely.
2.1 Founda ions o Sound E en de ec ion
2.1.1 P oblem de ini ions
Sound e en de ec ion (SED) is a apidly e ol ing sub ield o compu a ional anal-
ysis o sound scenes ha in ol es he au oma ic iden i ica ion and classi ica ion o
acous ic e en s in eal wo ld audio eco dings. Sound e en de ec ion aims no only
a ecognizing he p esence o sound e en s; sounds p oduced by a single sou ce
(dog ba king, ca passing by...) bu also a localizing hese e en s in he empo al
domain. This subsec ion in oduces he co e p oblem o mula ions in ol ed in he
p ocess o sound e en de ec ion: audio agging, e en de ec ion and segmen a ion
and he dis inc ion be ween monophonic s polyphonic de ec ions.
4

2.1. Founda ions o Sound E en de ec ion 5
Audio Tagging s E en De ec ion.
Sound scene analysis esea ch is undamen ally di ided in o audio agging and e en
de ec ion. On he one hand, audio agging e e s o he ask o assigning one o mo e
labels o an audio clip o de e mine which classes a e p esen , wi hou speci ying
i s empo al in o ma ion. This is usually ea ed as a mul i-label classi ica ion ask
(ci a al llib e!) and is app op ia e o weakly labeled da ase s whe e anno a ions
a e p o ided o each audio clip. Common benchma ks such as Audiose [1] and
FSD50K [2] p o ide la ge-scale da a o his goal. On he o he hand, sound e en
de ec ion equi es, no only iden i ying which classes a e p esen in an audio clip
bu also delimi ing i s ime bounda ies i.e. p edic ing hei onse s and o se along
he empo al domain o each e en . This equi emen ans o ms he p oblem in o
a localiza ion ask e alua ed using e en -based me ics such as F1-Sco e and e o
a e (ER) [3]. The e o e, e en de ec ion is a mo e demanding and complex ask
han agging, especially when dealing wi h eal-wo d audi o y scenes, whe e mul iple
e en s end o o e lap.
Monophonic s Polyphonic De ec ions
Ano he di ision in sound e en de ec ion e e s o he numbe o sounds p esen
a a ime o an audio clip. In monophonic en i onmen s, only one sound e en is
mean o occu a a ime, simpli ying he modeling p ocess. Howe e , as men ioned
abo e, many eal-wo ld eco dings con ain o e lapping sounds. One jus needs o
hink o a domes ic en i onmen such as a ki chen, whe e is p ac ically impossible
o sound e en s no o o e lap (cooking, speech, home appliances e c.). Polyphonic
sound e en de ec ion add esses his complexi y by assuming he o e lap o mul iple
sound e en s empo al and equency-wise [4]. This ask is pa icula ly impo an
o da ase s like DESED and FSD50K, which a e ele an o his esea ch and will
be inspec ed in mo e de ail u he in his hesis.
6Chap e 2. S a e o he a
Weakly s S ongly Labeled Da a
The a ailabili y o anno a ion-quali y da ase s ends o in luence he modeling s a -
egy chosen in each case. These anno a ions can be sepa a ed in o weak and s ong
labels. S ong labels include p ecise imes amps o each sound e en , which a-
cili a es supe ised lea ning o empo al localiza ion, bu hei main d awback is
ha hey a e qui e di icul o ob ain. By con as , weak labels only con ain clip-
le el in o ma ion ega ding he p esen classes in each clip, a ea u e ha makes
hem much easie o collec bu a he same ime implies a signi ican challenge o
lea ning quali y empo al pa e ns. Recen ad ances in Mul iple ins ance Lea ning
(MIL) and a en ion mechanisms ha e a emp ed o ill his gap [5].
2.1.2 Model a chi ec u es
The pe o mance o SED sys ems is highly dependen on he model a chi ec u es
hey use. Fo he ecen yea s, Con olu ional Neu al Ne wo ks (CNNs) and hei
ecu en ex ensions ha e become a popula app oach due o hei abili y o lea n
obus audio ea u es and ep esen a ions om spec og am inpu s. This sec ion
p o ides a b ie o e iew o impo an models used in SED, speci ically ocusing on
he FSD-SINe model, which is c ucial o he F eesound analysis pipeline.
Con olu ional Neu al Ne wo ks and CRNNs
CNNs o m he main ounda ion o many SED sys ems by combining local ime-
equency ea u e ex ac ion wi h global ea u e in eg a ion. Thei capaci y o
ansla ion in a iance enables hem o de ec impo an audio cha ac e is ics inde-
penden ly o sligh empo al o spec al shi s [4]. To be e model empo al depen-
dencies ac oss ames, CNNs a e usually combined wi h Recu en Neu al Ne wo ks,
esul ing in Con olu ional Recu en Neu al Ne wo ks (CRNNs) ha can cap u e
a be e con ex [6]. CRNNs ha e demons a ed s ong pe o mance in polyphonic
SED benchma ks such as DCASE challenges.
2.1. Founda ions o Sound E en de ec ion 7
The FSD-SINe Model: Inc easing Shi -In a iance
The FSD-SINe model de eloped by Fonseca and Se a [7] ep esen s a no o i-
ous imp o emen o he F eesound analysis pipeline. I add esses impe ec shi -
in a iance, which is he main limi a ion o s anda d CNNs in SED. S anda d con-
olu ions can be sensi i e o small shi s in he inpu signal, causing de ec ion issues.
Howe e , FSD-SINe in eg a es Shi -In a ian Con olu ions (SINe s) ha inc ease
he con en ional CNN laye s o imp o e obus ness o p oblema ic shi s, s abiliz-
ing he ea u e ex ac ion p ocess. The esul ing p edic ions o his model p o ide
imp o ed gene aliza ion o he weakly-labeled da ase FSD50K.
2.1.3 E alua ion me ics
E alua ing SED sys esm equi es choosing me ics ha can p ope ly e alua e bo h
sound e en classi ica ion and he de ec ion o he e en empo al bounda ies.
F1 Sco e
The F1 Sco e is s ill one o he mos popula me ics used o SED e alua ion.
I ep esen s he ha monic mean o p ecision and ecall, he e o e balancing and
penalizing bo h alse posi i es and alse nega i es. The wo mos equen ly used
a ian s a e:
•F1-Mac o: This me ic compu es he F1 sco e independen ly o each class
and a e ages hem, so i gi es he same impo ance o all classes ega dless o
hei equency in he da ase .
•Pe -class F1: This me ic calcula es he F1 sco e o each class, so ha
s eng hs and weaknesses o he model a e a ailable o each ca ego y o acil-
i a e a ge ing imp o emen s.
F1-based me ics usually ea sounds e en s as bina y occu ences in ixed segmen s
leng h, which may no ake in o accoun empo al localiza ion accu acy.
8Chap e 2. S a e o he a
E en -based me ics
Apa om segmen -le el me ics such as he men ioned F1 Sco es, SED e alua ion
sys ems o en use e en -based me ics o he sake o empo al accu acy. These
me ics conside ime bounda ies o he p edic ed e en s compa ed o g ound u h
anno a ions [8].
•E en -based F1 Sco e and ER: P oposed in he DCASE challenges, hese me -
ics e alua e he accu acy o de ec ions based on onse and o se s, handling
missed, alse and o e lapping e en s.
•Polyphonic Sound De ec ion Sco e: This me ic has been ecen ly in oduce
o p o ide a be e handling o polyphonic and o e lapping e en s. PSDS
in eg a es bo h segmen -le el and e en -le el de ec ions o e ing an inc ease in
obus ness o eal-wo ld applica ions.
Despi e PSDS p o iding iche e alua ion esul s, he F1-Mac o and pe -class F1 s ill
emain s anda d and popula choices o benchma king and compa a i e analysis in
many SED sys ems s udies.
Mean A e age P ecision
mAP is a widely used clip-based me ic in mu li-label classi ica ions and de ec ions
asks. I measu es he a e age p ecision (AP) ac oss all classes an i is especially
help ul o e alua e how well a model anks classes o e en i e audio clips, which is
a common p ac ice in weakly labeled sound e en de ec ions asks [7].
2.2 Da ase s and benchma ks
2.2.1 DCASE Da ase s
The de elopmen and benchma king o Sound E en De ec ions sys ems, simila ly
o o he ela ed ields o s udy, ely on s anda dized da ase s ha can p o ide an-
no a ed audio o bo h aining and e alua ion. O e he pas decade, he SED
2.2. Da ase s and benchma ks 9
communi y has con e ged a ound a se o da ase s ha con ain di e se acous ic
en i onmen s, scenes and e en s, amed wi hin he De ec ion and Classi ica ion o
Acous ic Scenes and E en s (DCASE) challenge.
DCASE Challenge Da ase s
The DCASE se ies, held annually since 2013, has played a undamen al ole in com-
mon benchma ks o SED and o he ela ed asks. These challenges ha e in oduced
se e al da ase s ocused on di e en aspec s o audio scene analysis such as audio
scene classi ica ion, single and mul i-channel e en de ec ion and weakly supe ised
lea ning. [8]. A pa icula ly in luen ial subse is DCASE Task 4, which ea s he
polyphonic sound e en de ec ion unde weakly labeled (o unlabeled) condi ions
being a good app oach o deal wi h eal-wo ld si ua ions and cons ain s. In he
con ex o DCASE Task 4, he DESED (Domes ic En i onmen Sound E en De-
ec ion) da ase has been no ably impo an . This da ase has been de eloped o
simula e ealis ic domes ic acous ic en i onmen s while emaining con ollable and
ep oducible. DESED con ains eal and syn he ic eco dings o ganized as seen in 1:
•The syn he ic subse is gene a ed by mixing isola ed sound e en s di e se
soundscapes. I p o ides s ong empo al anno a ions, make i a eally good
i o supe ised lea ning o e alua e empo al p ecision.
•The eal subse includes unlabeled, weakly, and s ongly labeled da a sepa a ed
in o aining and alida ion o he de elopmen se and he public e alua ion
se .
This dual composi ion allows o es models unde di e en lea ning condi ions and
has become a s anda d benchma k o e alua ing models.
2.2.2 FSD50K
FSD50K is a la ge-scale open da ase designed o mul i-label sound classi ica ion
sys ems and audio agging, wi h a s ong emphasis on di e si y, ealism and scalabil-
i y. Released in 2020 by Fonseca e al., i is ex ac ed om he F eesound pla o m

10 Chap e 2. S a e o he a
Figu e 1: DESED da ase composi ion o e iew
and can be used as a c i ical benchma k o audio classi ica ion models aining wi h
eal-wo ld sound con en [2].
Da ase composi ion
FSD50K consis s o o e 51.000 audio clips ex ac ed om F eesound. These clips
a e labeled using a subse o he AudioSe on ology [1], using 200 unbalanced sound
classes anging h ough se e al labels classi ied by sound sou ce (human oice) o
i s means o p oduc ion (bang), which can be hie a chically so ed acco ding i s
on ology. The da ase is spli in o de elopmen and e alua ion, wi h weak labels,
me ada a, and clips wi h leng hs anging om 0.3 o 30 seconds, e lec ing he
a iabili y and unp edic abili y o eal-wo ds acous ic scena ios. I is impo an o
highligh ha all audio iles a e d awn om an open domain, and ha FSD50K
includes an anno a o bounded ool o alida e ca ego y anno a ions wi h o e 600
con ibu o s.
2.3. Pos -p ocessing o Sound E en De ec ion ou pu s 11
2.3 Pos -p ocessing o Sound E en De ec ion ou -
pu s
2.3.1 Gene al echniques
Pos -p ocessing plays a c ucial ole in ans o ming aw ou pu s om SED models
in o human- eadable da a so ha he e en de ec ions can be p ope ly in e p e ed
by end-use s. This sec ion e iews he mos popula echniques such as h esholding
and median il e ing.
Th esholding
The mos basic ye common me hod o pos p ocessing is p obabili y h esholding.
Gi en ha SED models end o ou pu a ma ix o ame-wise class p obabili ies,
ei he a ixed o class-dependen h eshold can be applied o de e mine i he con-
idence o a p esen class is enough o he sound e en o be ac i e a he gi en
ame. This bina iza ion o he sound e en ime bounda ies con e s he model
ou pu in o a ha d decision making p ocess. While i is a s aigh o wa d ech-
nique, using a global h eshold (e.g., 0.5) may unde pe o m o unbalanced se s o
when he con idence ends o a y a lo o a single class. Fo his eason, using
class-dependen h esholding is o en a easonable app oach, uning he h eshold
acco ding o alida ion da a.
Median il e ing
Ano he popula me hod o smoo hing sound e en de ec ion ou pu s is median
il e ing. This app oach p o ides a be e handling o sudden isola ed de ec ions o
gaps as seen in 2. Median il e ing is applied o he bina y p edic ions o e ime a e
h esholding, by using a sliding window o an odd size ha ypically akes om 3 o
11 ames. The ke nel size o he median il e is some imes de e mined empi ically,
depending on he expec ed beha io o sound classes in ol ed in he p ocess. In
some implemen a ions, class-dependen median il e ing is used o be e ma ch he
empo al beha io o he di e en conside ed classes.
12 Chap e 2. S a e o he a
Figu e 2: Real (g een) s. P edic ed (o ange) bounda ies o an audio e en be o e
(up) and a e (down) median il e ing.
Despi e hese being he wo mos popula app oaches ega ding sound e en de ec-
ion pos p ocessing he e a e o he auxilia y s eps such as minimum e en du a ion
en o cemen o me ging empo ally adjacen e en s ha can also esul help ul.
2.3.2 Lea ning based pos -p ocessing
E en hough con en ional pos -p ocessing echniques such as h esholding and me-
dian il e ing a e e ec i e o imp o ing he p edic ions o SED models, some imes
hey s ill ail o cap u e he huge complexi y o eal-wo ld sound scenes. In o de
o add ess his limi a ions, ecen esea ch has explo ed mo e sophis ica ed pos -
p ocessing me hods. A no able example is he app oach p oposed by Giannakopou-
los e al. (2022), in which Rein o cemen Lea ning is used o op imize he pos p o-
cessing pipeline o a SED model. In his con ex , a lea ning agen lea ns o con igu e
pa ame e s o he pos -p ocessing ope a ions o h esholding and median- il e ing
o maximize e alua ion o me ics such as F1 sco e o ERR [9]. Mo e speci ically,
i lea ns o op imize pe -class alues o h esholds and median il e ing window
sizes. The agen is ained using Policy G adien me hods o ind he aimed op-
imal con igu a ion. This app oach has shown imp o emen s o e manual uning
in expe imen s on DCASE da ase s, highligh ing he powe o adap i e da a-d i en
pos p ocessing. As he SED ield e ol es, he use o lea ning-based models in o he
2.3. Pos -p ocessing o Sound E en De ec ion ou pu s 13
pos -p ocessing pipeline shows a p omising di ec ion o u u e wo k.
2.3.3 In he con ex o F eesound
In he con ex o he F eesound analysis pipeline, which is esponsible o ex ac ing
ce ain p ope ies om sounds using di e en analyze s and s o ing he esul s in
he da abase, he majo analyze ega ding SED is FSD-SINe [7], which is why his
wo k ocuses on i s esul s o pos -p ocessing and la e isualiza ion. Howe e , in
his ield o s udy we s ill can ind o he no o ious analyze s and models wo h o
men ion wi hin he cu en pipeline.
•YAMNe Model: a p e- ained deep ne ha p edic s o he 521 audio e en
classes in he AudioSe on ology.
•Bi d-NET Analyze , which has especially popula among he communi y o
o ni hologis s due o i s abili y o iden i ying speci ic species o bi ds by hei
calls and songs [10].
Pos p ocessing o FSD-SINe analyze
The cu en pos p ocessing pipeline o his model is qui e s aigh o wa d and one
o he main a eas o imp o emen a ge ed by his wo k. A s a ic h eshold is se o
a cu en alue o 0.7 which is used o il e he esul ing con idences gene a ed by
he FSD-SINe model. A e his s ep, ime bounda ies a e de ined ollowing he 0.5
hop-size ha di ides he audio clip ames. By checking bo h onse s and o se s o
each de ec ion, consecu i e e en s a e assembled as a single e en de ec ion i he e
is a ma ch be ween bo h pa ame e s.
As isi ed in he p e ious sec ion, his pos p ocessing app oach can be imp o ed by
making i class-dependen and including popula me hods such as median il e ing.
I is impo an o men ion ha he F eesound da abase no only con ains he esul -
ing sound e en de ec ions om his analysis bu also he aw da a o he op-10
de ec ed classes o each analyzed audio clip.
20 Chap e 3. Me hods
3.1.1 Empi ical obse a ion on FSD50K da ase
Rega ding he use o a s a ic h eshold o he ha d decision-making p ocess on he
ou pu p obabili ies, he usage o speci ic, pe -class h esholds a ises. Gi en he
di e si y in bo h he F eesound audio collec ion and he se o classes p esen in he
FSD50K da ase , i is e iden ha an adap i e h eshold will imp o e he global
pe o mance o he analyze . Fo example, in ui i ely he singula i y o a sound e en
labeled as applause equi es a less s ic h eshold han mo e ambiguous labels such
as boiling o gu gling. Apa om ob aining an adap i e h eshold pe class, he lack
o a median il e ing s ep in he pos -p ocessing aises wha could be done in his
aspec oo. Following he p e ious idea, se ing an speci ic window size pe class
helps ou smoo hing ou pu de ec ions, gi en ha in he empo al domain he e is
also a wide ange o beha io s depending on class (see Thunde and Thunde s o m).
In o de o ob ain adap i e pa ame e s o bo h ope a ions, a simple obse a ional
app oach is p oposed. I consis s o aking a se o di e en ke nel sizes and com-
pu ing h esholds wi h di e en es ic ion pa ame e s pe class and audio clip. By
pos p ocessing he e alua ion se wi h each g oup o a iables and compa ing hem
o g ound- u h anno a ions, he bes h eshold and ke nel sizes can be selec ed o
each class.
Th eshold compu a ions
To de e mine class-speci ic decision h esholds, a mul i-s age compu a ion p ocess
is employed. Fo each aw analysis ile and co esponding class, an a ay o ame-
le el h esholds is ex ac ed based classes ma ked as ac i e in he g ound u h
anno a ions.
Once his pe - ame h eshold p og ession is ob ained o a gi en class and audio clip,
a segmen a ion p ocess is applied. Wi hin each segmen , he mean and s anda d
de ia ion o he ac i a ion sco es a e compu ed and combined o o m a se o
ep esen a i e h esholds:

3.1. Pos p ocessing p oposals 21
ameTh esholds ={a g(s) + s d(s)|s∈segmen s}
A ile-le el h eshold is hen compu ed by a e aging hese ame-le el alues:
ileTh eshold =a g( ameTh esholds,weigh ed =False)
A his s age, bo h weigh ed and simple a e ages a e conside ed by he leng h o he
ac i e segmen s. The ou come o his p ocess is a se o N ile-le el h esholds o
each class, whe e Ndeno es he numbe o analysis iles a ailable in he da ase .
To agg ega e hese in o a single, ep esen a i e h eshold pe class, a pe cen ile-based
educ ion is applied ac oss he N alues. This enables he de i a ion o class-speci ic
h esholds wi h a ying deg ees o es ic i eness.
Pos p ocessing subse s
A his poin o he pos p ocessing pipeline, a o al o 2P h eshold se s a e a ail-
able, whe e P deno es he numbe o pe cen iles conside ed in he p e ious s ep.
Each h eshold is compu ed using bo h weigh ed and simple a e aging app oaches,
esul ing in wo a ian s pe pe cen ile.
In o de o de e mine he op imal combina ion o h esholds and median il e ke -
nel size pe class, a bina y p edic ion ma ix mus be ob ained o e e y ile and
h eshold se . Howe e , he aw ac i a ion ou pu s p o ided by he analysis iles
only con ain sco es o he op 10 de ec ed classes pe ame. To econs uc he
comple e ac i a ion ma ix, missing ame-class combina ions a e assigned a alue
o 0.
Gi en he econs uc ed ac i a ion ma ix A, p edic ions a e compu ed o each
h eshold se T, ke nel size k∈ K, and class i∈ C as ollows:
ileP edic ionsk,T = MedianFil e k({1[ai> Ti]|i= 1, . . . , C, a ∈ A}),∀k∈ K
22 Chap e 3. Me hods
This p ocedu e esul s in 2P K p edic ion se s pe ile. In he cu en con igu a ion,
he se o ke nel sizes is de ined as K= [1,3,5,7] and pe cen iles conside ed a e
P= [50,60,70,80] esul ing in 32 di e en pos p ocessing con igu a ions.
Finally, o each class, he combina ion o h eshold and ke nel size ha maximizes
he F1-sco e on alida ion da a is selec ed and eco ded in he inal pos p ocessing
se .
3.1.2 Tag ecommenda ion me hods applied o SED ech-
niques
Sound E en De ec ion (SED) sys ems ope a e ac oss a wide a ie y o sounds, each
cha ac e ized by di e en le els o polyphony, ambien condi ions and eco ding
equipmen . Gi en his di e si y, he pe -class pa ame e selec ion me hods desc ibed
in he p e ious sec ion may lack scalabili y and gene alizabili y, especially when
applied o la ge-scale, he e ogeneous audio collec ions such as F eesound.
To add ess his limi a ion, i becomes impo an o hink o adap i e s a egies ha
do no ely on o he sounds s a is ical obse a ions. Since he g ound u h anno-
a ions p o ide only week labels, he SED ask can, in his con ex , be app oached
simila ly o an audio agging p oblem.
In his ield o s udy, he ag selec ion me hods o ecommenda ion pu poses p o-
posed by Fon , 2015 [15] a e a solid baseline. Among hese, he mos e ec i e
me hod desc ibed in Sec ion 3.2.3 ("Selec ion o ags o ecommend") is he pe cen -
age s a egy, which in ol es selec ing ags whose sco es su pass a ixed pe cen age
o he highes ag sco e.
Expo ing his app oach o he cu en SED ask in oduces ce ain conside a ions.
In ac , his ag ecommenda ion me hod is designed o always ou pu a leas one
ag, and his assump ion does no hold in SED, whe e many ames can con ain no
ac i e sound e en s. The e o e, a class is only conside ed ac i e a a gi en ame i
i s con idence sco e su passes bo h a s a ic h eshold (e.g., 0.45) and a leas 80%
o he highes con idence sco e obse ed among all classes o ha ame.
3.1. Pos p ocessing p oposals 23
Ac i eClass(c, ) = 




1i ac, > τmin and ac, > α ·max
j∈C aj,
0o he wise
Whe e ac, is he ac i a ion sco e o class ca ame ,τmin is a ixed minimum
h eshold (0.45), αis he ela i e h eshold pa ame e (0.8), and Cis he se o all
classes.
3.1.3 Hie a chy il e ing o human- eadable esul s
The cu en ou pu o ma o he FSD-SINe model is s uc u ed as a dic iona y
con aining he ollowing elemen s:
•A lis o dic iona ies wi h e en de ec ion in o ma ion, including class name,
ime bounda ies, and con idence sco e.
•A lis o all unique classes de ec ed wi hin he audio ile.
•A lis o embeddings esul ing om he audio analysis.
While his s uc u e p o ides he essen ial da a needed o build a isualiza ion
sys em o F eesound use s, wo impo an conside a ions mus be add essed o
imp o e he human eadabili y and in e p e abili y o he ou pu .
On he one hand, i is impo an o acknowledge ha he ocabula y used by he
FSD50K da ase is de i ed om a subse o he AudioSe On ology, which in o-
duces a aluable bu cu en ly o e looked de ail: class hie a chy. Ex ac ing he
hie a chical le el o each class can signi ican ly help o ganize he p esen a ion o
sound e en s du ing he isualiza ion s age o a mo e in ui i e and in o ma i e use
expe ience. No e ha , as he g ound u h ocabula y does no include some 0-le el
classes (e.g. Sounds o hings all i s child classes le els will be pushed up.
On he o he hand, he cu en sys em seeks an in eg a ed display o SED da a
wi hin he sound playe s in he F eesound pla o m, which poses ce ain limi a ions
24 Chap e 3. Me hods
in sc een space. To mee his equi emen , a oiding edundancy in he in o ma ion
display is c ucial. The e o e, when mul iple classes ha sha e same ime bound-
a ies wi hin a di e ence o 1 ame (0.5 seconds) a e ound o ha e a pa en -child
ela ionship in he on ology, hey a e me ged in o a single one. In such cases, he
child class is selec ed o display because i p o ides mo e speci ici y o he de ec ed
sound e en , as seen in 8 and 9. The code wi h he pos p ocessing and il e ing o
FSD-SINe esul s implemen a ions can be ound in appendix A.
Figu e 8: D a display o aw de ec ions gi en by he FSD-SINe model cu en
posp ocessing.
Figu e 9: D a display o hie a chically il e ed de ec ions.
3.1. Pos p ocessing p oposals 25
3.1.4 Fu u e op imiza ion p oposals
Upon e iewing he p e iously discussed pos p ocessing s a egies, i becomes s aigh -
o wa d ha none o he p esen ed implemen a ions inco po a e s ong op imiza ion
echniques o lea ning-based app oaches o e ine pos p ocessing pa ame e s wi h
he goal o imp o ing e alua ion me ics. This highligh s a me hodological gap con-
side ing ha op imiza ion-d i en me hods a e mo e aligned wi h cu en esea ch
ends in sound e en de ec ion.
As seen in sec ion 2.3.2, he ein o cemen lea ning (RL) app oach p oposed by
Giannakopoulos e al.(2022)[9] in oduces a amewo k capable o op imizing he
en i e pos p ocessing pipeline o a SED model such as FSD-SINe . While p omising,
his app oach p esen s se e al challenges ha mus be add essed. The mos no able
one is he equi emen o s ongly labeled anno a ions, some hing he FSD50K
da ase does no p o ide.
Ne e heless, as men ioned in sec ion 2.2.1, he DESED da ase includes a subse o
eco dings wi h s ong labels. Al hough hese anno a ions a e limi ed o domes ic
sound en i onmen s, se e al o he classes can be mapped o he FSD50K axon-
omy. This opens he possibili y o aining he RL-based op imiza ion sys em on
he s ongly labeled DESED subse and expo ing he esul ing pa ame e s o he
co esponding class subse in FSD50K. A p oposal o mapping bo h subse s can be
ound below, in able 1. No e ha he ma ch be ween classes is challenging due o
he p esence o ambigui y in esol ing speci ic classes om one se o he o he .
Table 1: Mapping be ween DESED classes and FSD50K
classes
DESED Class FSD-SINe Class
Speech (0) Child_speech_and_kid_speaking (33)
Female_speech_and_woman_speaking (75)
Male_speech_and_man_speaking (111)
Con inued on nex page

26 Chap e 3. Me hods
Table 1 – con inued om p e ious page
DESED_CLASS FSD_SINET_CLASS
Speech (158)
Human_ oice (101)
Cha e (29)
Con e sa ion (43)
Dog (1) Dog (59)
Ba k (7)
Domes ic_animals_and_pe s (60)
Ca (2) Domes ic_animals_and_pe s (60)
Ca (28)
Meow (116)
Ala m/bell/ inging (3) Ala m (4)
Bell (11)
Ring one (140)
Bicycle_bell (13)
Chu ch_bell (38)
Cowbell (45)
Doo bell (63)
Dishes (4) Dishes_and_po s_and_pans (58)
F ying (5) F ying_( ood) (83)
Blende (6) Domes ic_sounds_and_home_sounds (61)
Running wa e (7) Wa e (187)
Wa e _ ap_and_ auce (188)
Ba h ub_( illing_o _washing) (10)
Fill_(wi h_liquid) (76)
Sink_( illing_o _washing) (151)
Toile _ lush (175)
Vacuum cleane (8) Domes ic_sounds_and_home_sounds (61)
Con inued on nex page
3.2. Visualiza ion o SED in F eesound 27
Table 1 – con inued om p e ious page
DESED_CLASS FSD_SINET_CLASS
Elec ic sha e / oo hb ush (9) Domes ic_sounds_and_home_sounds (61)
3.2 Visualiza ion o SED in F eesound
This sec ion p esen s all p ocedu es ega ding he de elopmen and implemen a ion
o sound e en de ec ion displays in F eesound, om gene al conside a ions o ake
in o accoun o all upcoming pa adigms, o p esen ing d a isualiza ions and
implemen ing he bes candida es, and inally designing how he use sa is ac ion
e alua ion will be pe o med.
3.2.1 Gene al conside a ions
This subsec ion ou lines he de elopmen and implemen a ion p ocess o display-
ing sound e en de ec ion (SED) esul s om he FSD-SINe model wi hin he
F eesound pla o m. I begins wi h gene al conside a ions ha apply ac oss all sub-
sequen isualiza ion pa adigms, ollowed by he p esen a ion o d a display p o-
posals, he implemen a ion o selec ed candida es, and he de ini ion o he me hod-
ology o e alua e use sa is ac ion.
Da a ea u es
The isualiza ion o FSD-SINe esul s o a gi en audio clip will include he ollow-
ing elemen s:
•The class name o he de ec ed e en .
•The ime bounda ies indica ing he empo al loca ion o he e en .
•The con idence sco e associa ed wi h he de ec ion.
•The hie a chical le el o he class wi hin he on ology.
28 Chap e 3. Me hods
These ea u es can be g ouped based on hei ele ance o he use in e ace. On he
one hand, class names and ime bounda ies ca y he mos c i ical in o ma ion and
should he e o e be p io i ized in any isualiza ion app oach. On he o he hand,
he con idence sco e and hie a chy le el, while no c ucial, can imp o e use unde -
s anding and in e p e a ion o he de ec ion esul s. No e ha a common p ope y
ac oss all displays is ha colo coding is used o class iden i ica ion, accompanied
by a legend appended below he playe con aine . The chosen se o colo s is selec ed
o a oid in e e ence wi h F eesound de aul wa e o m display colo s.
As a esul , he speci ic o ma and encoding o he in o ma ion may a y be ween
di e en isualiza ion p oposals, wi h each one p esen ing hese pa ame e s using
di e en isual s a egies depending on i s in ended use and emphasis.
Finally, i is impo an o men ion ha he displayed esul s co espond o he
cu en pos p ocessing pipeline o he FSD-SINe model, adding he on ology il e
p ocess desc ibed in Sec ion 3.1.3
Use in e ac ion
Simila ly o he da a ea u es desc ibed abo e, he p oposed isualiza ion pa adigms
also sha e a se o common use in e ac ion unc ionali ies. These sha ed p ope ies
a e designed o ensu e in ui i e and e icien in e ac ion wi h de ec ed e en s.
•Mouse click in e ac ion: Clicking on a displayed sound e en will au oma -
ically mo e he p og ess ba o he e en ’s s a ime and begin playback om
ha poin . This ea u e is designed o acili a e use e alua ion and o enable
quick e i ica ion o de ec ed e en s.
•Mouse ho e unc ionali y: when ho e ing o e a isualized e en , a label
will appea showing key de ails such as he e en ’s class name, con idence sco e
and ime bounda ies. This ensu es ha use s can access de ailed in o ma ion
wi hou clu e ing he main in e ace.
•O e lay display beha io : By de aul , SED isualiza ions will no be shown
o analyzed sounds. Ins ead, a oggle bu on in eg a ed in o he audio playe ’s
3.2. Visualiza ion o SED in F eesound 29
con ol ba will allow use s o show o hide a isual o e lay con aining he de-
ec ions in o ma ion. This laye ed app oach aims o main ain a clean in e ace
while s ill p o iding access o he analysis esul s.
These in e ac ion ea u es a e pa icula ly ele an gi en ha he p oposed isual-
iza ions a y in he amoun o in o ma ion displayed by de aul . Also, hey ha e
been designed o enabling use expe ience wi h he esul s in a meaning ul and
non-in usi e manne .
3.2.2 Visualiza ion d a s
This sec ion in oduces a se ies o isualiza ion d a s each o e ing a dis inc ap-
p oach o p esen ing sound e en de ec ion esul s. E e y p oposal is ou lined wi h
i s key s eng hs and po en ial limi a ions o p o ide a comp ehensi e compa ison
and la e choose he bes candida es.
VisualiSED 01: Class-wise dis ibu ed ec angles
This app oach a anges labels along di e en e ical le els, each co esponding o a
sepa a e sound class. Class names a e displayed a he s a o he playe con aine
in hei co esponding le el, while indi idual e en de ec ions a e ep esen ed as
ec angles ex ending o e hei ime in e als as seen in 10. Each ec angle includes
a ex o he con idence sco e exp essed as a pe cen age. Whene e space is insu -
icien o he con idence display due o sho sound e en s, his in o ma ion is s ill
accessible h ough mouse ho e in e ac ion. A signi ican s eng h o his design is
i s e ec i e handling o o e lapping e en s, which ends o be a challenging ask in
he isualiza ion o SED ou pu s. By g ouping e en s by class and alloca ing each o
a di e en e ical ack, o e lapping is a oided. Howe e , his app oach can ha e
some ouble wi h sounds con aining a high numbe o dis inc classes.
VisualiSED 02: Le el-wise dis ibu ed ec angles
To add ess he space limi a ions iden i ied in he p e ious p oposal, his me hod e-
ains he same basic s uc u e ( ep esen ing e en s as ec angles indica ing hei ime
36 Chap e 3. Me hods
3.3.1 Da ase Files
The sounds selec ed o he inal use sa is ac ion expe imen a e chosen based on a
se o c i e ia o ensu e a ep esen a i e ange o SED beha io s while keeping he
expe imen e icien .
•Da ase size mus be kep small (10-20 sounds) in o de o make he expe imen
agile enough.
•I mus con ain a ep esen a i e sample o a ious classes p esen in he FSD50K
da ase .
•I mus con ain di e se iles in e ms o e en densi y and a ie y ( om ield
eco dings o single e en audio clips).
•Reco dings mus no be longe han 30 seconds o he analysis o be un
wi hou skipping.
Based on hese conside a ions, he ollowing sound IDs ha e been selec ed om he
F eesound de elopmen da abase: 463464, 75825, 347223, 217543, 685989, 682534,
436790, 49520, 671901, 181628, 437623, 463472. Fo mo e in o ma ion o i s ype o
con en see he ollowing 2, wi h classi ied sounds acco ding o he AudioSe On ol-
ogy pa en class and he amoun o sou ces p esen in sound (single s mul iple).
Table 2: Expe imen da ase
ID Sou ce Human Animals Music Ambiguous Things Na u al
463464 Mul iple T ue False False False False False
347223 Mul iple T ue False False False False False
217543 Mul iple False False T ue False False False
685989 Mul iple False False False False T ue False
682534 Single False T ue False False False False
Con inued on nex page

3.3. Use sa is ac ion expe imen design 37
Table 2 – con inued om p e ious page
ID Sou ce Human Animals Music Ambiguous Things Na u al
436790 Mul iple False False False False False T ue
49520 Mul iple T ue T ue False False False False
671901 Single False False T ue False False False
181628 Mul iple False False False T ue T ue False
463472 Mul iple T ue False T ue False False False
437623 Mul iple T ue False False T ue T ue False
75825 Mul iple False False False False T ue T ue
3.3.2 Expe imen design
To assess use sa is ac ion wi h he selec ed SED isualiza ion designs, an online
mock expe imen has been conduc ed using Google Fo ms. The o m p esen s pa -
icipan s wi h p e iously ou lined se o analyzed F eesound sound URLs, con aining
isualiza ions gene a ed using he inal se o display designs desc ibed in he p e-
ious sec ion.
The main objec i e o he expe imen is o ob ain a gene al assessmen and e alua e
he cla i y, usabili y, and pe cei ed use ulness o each isualiza ion app oach as well
as he quali y o he gene a ed de ec ions. Pa icipan s will be asked o lis en o
each audio clip while obse ing he co esponding isual ep esen a ions and hen
answe a se ies o sho ques ions o cap u e hei imp essions.
The inal ques ionnai e is sho and concise o encou age pa icipa ion. Fo each
isualiza ion design, he e is one ques ion add essing he quali y o he e en an-
no a ions while he emaining ocus on he e ec i eness o he isual design: help-
ulness, layou cla i y, conside ing hypo he ical use ulness when explo ing sounds
in F eesound and an op ional commen box o any sugges ions e c. The 4 sounds
isi ed o each isualiza ion display a e andomly selec ed. The esponses ha e
been collec ed using a 0−10 g ading scale and a 5-poin Like scale anging om
S ongly ag ee o S ongly disag ee as ollows.
38 Chap e 3. Me hods
1. How would you a e he accu acy and quali y o he e en anno a ions in his
isualiza ion? 0-10 g ade scale.
2. The isualiza ion helped me unde s and he e en s p esen in he audio clip.
5-poin Like scale.
3. The layou and isual o ganiza ion o he e en s we e clea and easy o in e -
p e . 5-poin Like scale.
4. I hink his isualiza ion would be use ul when explo ing sounds in F eesound.
5-poin Like scale.
5. Op ional: Any sugges ions o obse a ions abou his isualiza ion? Tex an-
swe .
Addi ionally, a he end o he su ey, he e is an op ional box o lea e any message
o obse a ions he use wishes o p o ide.
Due o he impac ha he p oposed changes could ha e on F eesound, bo h in
e ms o isualiza ion displays and he in eg a ion o he modi ica ions in he FSD-
SINe analyze , he expe imen has no ye been dis ibu ed o he wide F eesound
communi y. Ins ead, i has been ca ied ou wi h a smalle g oup o pa icipan s
close o me, including specialis s in audio and UX design as well as indi iduals
wi hou p io knowledge o he ield.
Chap e 4
Resul s
4.1 Pos p ocessing app oaches
The cu en pos p ocessing s a egy o he FSD-SINe model elies on a ixed
h eshold o 0.7, which is applied o e ain only he mos con iden de ec ions.
In his wo k, wo al e na i e app oaches ha e been explo ed: (i) class-dependen
h esholds and ke nel sizes de e mined h ough empi ical obse a ion, and (ii) he
use o ag- ecommenda ion echniques adap ed o sound e en de ec ion.
Fo he e alua ion o hese app oaches, he inpu consis ed o he aw ou pu s
o he FSD-SINe model, namely he op-10 ame-wise class con idences o he
FSD50K e alua ion se . Pe o mance was measu ed by compu ing he F-Sco e o
each me hodology, wi h he esul ing alues summa ized in Table 3.
In he i s me hod, based on empi ical obse a ion, pa ame e s we e es ima ed us-
ing he de elopmen se . Fo each class, his p ocess p oduced a speci ic pe cen ile,
a ke nel size, and an indica ion o whe he he a e aging should accoun o de ec-
ion leng h. Analysis o hese esul s showed ha he mos common con igu a ion
co esponded o he 50 h pe cen ile wi h ke nel size k= 1, wi hou leng h-weigh ed
a e aging. Howe e , his choice o pa ame e s is also he leas in usi e, making he
app oach closely esemble he baseline and he e o e educing i s po en ial impac .
39
40 Chap e 4. Resul s
Conside ing he F-Sco e o each pos p ocessing app oach, o he inal use expe -
imen he used pos p ocessing is he cu en s a ic h eshold combined wi h he
hie a chy il e ing desc ibed in sec ion 3.1.3.
50 60 70 80
0
0.2
0.4
0.6
0.8
1
P alues
Dis ibu ion
1357
0
0.2
0.4
0.6
0.8
1
K alues
Dis ibu ion
Figu e 16: Dis ibu ions o pe cen iles Pand ke nel size alues Kpa ame e s.
Table 3: Compa ison o app oaches and hei pe o mance
App oach Use ul pa ame e s Mac o F1
Pe -class h eshold
and ke nel size
Pe cen iles = [50, 60, 70, 80]
Ke nel sizes = [1, 3, 5, 7]
0.3385
Tag ecommenda ion
pe cen ual app oach
Min. h eshold = 0.45
Valid pe cen age = 0.8
0.344
Cu en s a ic h esh-
old
Th eshold = 0.7 0.3869
4.2 Quali a i e insigh s on de ec ions
Gi en he ul ima e goal o his wo k o e alua e he eal-wo ld use expe ience, i is
impo an o p o ide insigh in o how he hie a chy il e ing s ep desc ibed in 3.1.3
a ec s bo h he display and speci ici y o de ec ions. The ull collec ion o images
can be ound in he appendix B, hough he mos no able e ec s a e illus a ed in
17.
I is clea ha he il e ing p ocess no only emo es some gene al de ec ions bu
also o ganizes he emaining ones acco ding o a hie a chy, displaying pa en -child
ela ionships om op o bo om.
4.3. Use Sa is ac ion Su ey Resul s 41
While he aim o his ea u e is o imp o e speci ici y and enhance usabili y o he
end use , i can some imes lead o less accu a e displays. In pa icula , a pa en
class wi h mo e accu a ely de ined ime bounda ies may be omi ed in a o o mo e
speci ic, bu less p ecise, child classes. This e ec is illus a ed in 17, whe e he
b oade Bi d class be e ma ches he ull du a ion o he bi d singing han he
mo e speci ic Bi d ocaliza ion and bi d call and bi d song class.
4.3 Use Sa is ac ion Su ey Resul s
This sec ion p esen s he esul s o he use sa is ac ion su ey, which e alua ed
bo h he unde lying de ec ion quali y o he FSD-SINe model and he e ec i eness
o h ee di e en isualiza ion designs o displaying hese de ec ions. A o al o
12 pa icipan s comple ed he su ey. Each pa icipan lis ened o ou sounds pe
isualiza ion, andomly selec ed om he se o 12 sounds, ensu ing all sounds we e
e alua ed while in oducing a ia ion in he sequence. The ques ionnai e s uc u e is
p e iously de ined in sec ion3.3.2. The esul s a e p esen ed in g oups by e alua ion
opic and all he ma e ials ela ed o he su ey can be ound in appendix A.
4.3.1 Anno a ion accu acy
Gi en ha all isualiza ion designs ely on he same unde lying de ec ions, his as-
sessmen di ec ly e lec s use con idence in he model ou pu s. The dis ibu ion o
esponses (Figu e 18) shows a cen al endency a ound he middle o he scale (Mean
= 4.31), which indica es only mode a e pe cei ed accu acy, wi h no iceable a iabil-
i y (SD = 1.34) and a wide ange o opinions. While some use s conside ed he
e en anno a ions easonably accu a e, o he s ound hem less con incing, sugges -
ing ha e en s we e some imes missed o imp ecisely localized. These e alua ions
highligh he need o u he e inemen in de ec ion pe o mance and pos p ocess-
ing app oaches o imp o e use in e p e abili y o SED ou pu s.
In pa allel o he su ey, a small manually anno a ed da ase was p epa ed o
he wel e sounds included in he expe imen ha can be ound in appendix A.
These anno a ions p o ide a e e ence o assessing he alignmen be ween he FSD-

42 Chap e 4. Resul s
SINe de ec ions and he ac ual acous ic e en s. Al hough a de ailed quan i a i e
compa ison was ou side he scope o his wo k, p elimina y obse a ions sugges ha
disc epancies be ween manual anno a ions and model ou pu s explain he lowe use
a ings. This da ase can he e o e se e as a aluable baseline o u u e alida ion
and imp o emen e o s.
4.3.2 Visualiza ion e ec i eness
The esul s o he ques ions e e ing o he pe o mance o he di e en isualiza ion
designs e eal di e ences in how pa icipan s pe cei e he quali y o he h ee dis-
plays. Rega ding he ques ion o whe he he isualiza ion helped use s unde s and
he e en s p esen in he audio clip (Q2), he Class-Wise design (VS01) ecei ed he
s onges posi i e eedback, wi h 11 pa icipan s ag eeing and 1 s ongly ag eeing,
and no nega i e esponses. The De ailed Onse s design (VS06) was also posi i ely
ecei ed, wi h 8 ag eeing and 3 s ongly ag eeing, hough one pa icipan exp essed
neu ali y. The Le el-Wise design (VS02) showed mo e mixed esul s, wi h 5 pa -
icipan s ag eeing, 3 neu al, and 4 disag eeing, sugges ing some use s ound i less
in ui i e o in e p e ing e en sequences. Resul s can be checked in igu e 19
A simila pa e n eme ges o layou cla i y and ease o in e p e a ion (Q3), see
igu e 20. The Class-Wise design again sco ed highly, wi h 8 ag eeing and 3 s ongly
ag eeing, whe eas he De ailed Onse s design had 8 s ongly ag ee esponses bu
ewe ag eeing esponses o e all. The Le el-Wise design had mo e dispe sed a ings,
including one disag eemen and wo neu al esponses, indica ing ha hie a chical
g ouping o e en s in oduces some ambigui y in isual o ganiza ion.
Fo pe cei ed use ulness when explo ing sounds in F eesound (Q4), bo h he Class-
Wise and De ailed Onse s designs we e consis en ly a ed as use ul, wi h 8 o mo e
pa icipan s selec ing ag ee o s ongly ag ee. The Le el-Wise design ecei ed mo e
a ied esponses, wi h some pa icipan s neu al o disag eeing, ein o cing he ob-
se a ion ha while hie a chical g ouping may p o ide s uc u e, i migh educe
in e p e abili y and pe cei ed use ulness in p ac ical explo a ion o sound e en s
(see igu e 21). O e all, hese esul s sugges ha he Class-Wise and De ailed On-
4.3. Use Sa is ac ion Su ey Resul s 43
se s designs we e gene ally p e e ed by use s, o e ing clea e and mo e ac ionable
isualiza ions o unde s anding sound e en s.
4.3.3 Quali a i e Feedback
Summa izing he use ’s eedback ac oss he h ee isualiza ion designs, pa icipan s
gene ally highligh issues wi h he quali y and accu acy o he de ec ions. Many o
hem we e missing, empo ally misaligned, o o e ly gene al (e.g., “Musical ins u-
men ” o o ches al passages, “Domes ic animals” ins ead o “Dog,” “Human g oup
ac ion” ins ead o “Applause”). Long e en s we e o en unca ed, while backg ound
sounds such as oo s eps o wind we e unde ec ed. Se e al use s emphasized ha
he weak de ec ion quali y made i di icul o e alua e he isualiza ions hemsel es.
VisualiSED 02: Hie a chy Le el-Wise Display
Use s app ecia ed he use o con idence-dependen bo de s, bu ound o e lapping
ec angles con using and isually clu e ed. Labels missing e en names ( elying
only on colo s, wi h names displayed only on he legend) we e seen as p oblema ic
o accessibili y. Addi ionally, some use s epo ed ha he display o he labels on
op o he wa e o m is con using, and p oposed a sepa a e whole display simila o
he oggle be ween wa e o m and spec og am displays.
VisualiSED 06: De ailed Onse s Display
This design was gene ally be e ecei ed. The s acking o o e lapping e en s was
conside ed a clea imp o emen , making he display less in usi e. Howe e , em-
po al inconsis encies emained a majo issue, pa icula ly o applause and o he
con inuous sounds. Some use s c i icized edundancy in displaying simila labels
(e.g., “Mo o Vehicle” mul iple imes). While s ill a ec ed by gene alis ic de ec-
ions, his display was o en desc ibed as he mos com o able o use.
44 Chap e 4. Resul s
VisualiSED 01: Class-Wise Display
Se e al use s p e e ed his design because o clea e alignmen be ween de ec ion
names and hei posi ion (labels lis ed on he le ). Howe e , clu e emained a
conce n, pa icula ly when many e en s we e p esen . The o e lapping o wa e o m
and labels again d ew c i icism. Despi e pe sis en de ec ion issues, some conside ed
his an imp o ed e sion o Design A.
Gene al Feedback
O e all, use s saw po en ial in he isualiza ions bu s essed ha he sys em is no
ye eady o F eesound in eg a ion. Sugges ions included: sepa a ing wa e o ms
om labels o imp o e eadabili y and allowing oggling o de ec ions ia he legend
o add ess colo -blind accessibili y. While de ec ion quali y was he main limi a ion,
pa icipan s ag eed ha isualiza ion could become a use ul ea u e, especially o
b owsing longe audio iles whe e i could sa e lis ening ime.
4.3. Use Sa is ac ion Su ey Resul s 45
Figu e 17: Top: Sound 682534 de ec ions be o e il e ing. Bo om: Sound 682534
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3
Lis o Figu es
1 DESED da ase composi ion o e iew . . . . . . . . . . . . . . . . . . 10
2 Real (g een) s. P edic ed (o ange) bounda ies o an audio e en
be o e (up) and a e (down) median il e ing. . . . . . . . . . . . . . 12
3 Soundcloud commen s displayed along ime axis each a a ime ins an . 14
4 Di e en ypes o sound anno a ions gene a ed by Sonic Visualize ,
each on a di e en label. No e ha each anno a ion can ha e an as-
socia ed cus omizable ex . Tex (blue), no es ( ed), egions (g een),
boxes (pu ple), ime ins an s (b igh ed, e ical line). . . . . . . . . 15
5 Le : Bi dNET App in he analysis bounda ies se ing s ep. Righ :
Bi dNET App displaying de ec ion esul s. . . . . . . . . . . . . . . . 17
6 F eesound display in wa e o m mode. . . . . . . . . . . . . . . . . . . 18
7 F eesound display in spec og am mode. . . . . . . . . . . . . . . . . 18
8 D a display o aw de ec ions gi en by he FSD-SINe model cu -
en posp ocessing. ............................ 24
9 D a display o hie a chically il e ed de ec ions. . . . . . . . . . . . 24
10 D a display o VisualSED 01 . . . . . . . . . . . . . . . . . . . . . 30
11 D a display o VisualiSED 02 . . . . . . . . . . . . . . . . . . . . . 30
12 D a display o VisualiSED 03 . . . . . . . . . . . . . . . . . . . . . 31
13 D a display o VisualiSED 04 . . . . . . . . . . . . . . . . . . . . . 32
14 D a display o VisualiSED 05 . . . . . . . . . . . . . . . . . . . . . 33
15 D a display o VisualiSED 06 . . . . . . . . . . . . . . . . . . . . . 34
16 Dis ibu ions o pe cen iles Pand ke nel size alues Kpa ame e s. . 40
52

LIST OF FIGURES 53
17 Top: Sound 682534 de ec ions be o e il e ing. Bo om: Sound 682534
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 45
18 Use answe s ega ding quali y and accu acy o de ec ions. Mean
sco e = 4.31, S dDe = 1.34, Range = 6. . . . . . . . . . . . . . . . . 46
19 Use answe s ega ding help ulness o he display. . . . . . . . . . . . 46
20 Use answe s ega ding cla i y o he display. . . . . . . . . . . . . . . 47
21 Use answe s ega ding use ulness o he display. . . . . . . . . . . . . 47
22 Top: Sound 49520 de ec ions be o e il e ing. Bo om: Sound 49520
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 60
23 Top: Sound 75825 de ec ions be o e il e ing. Bo om: Sound 75825
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 61
24 Top: Sound 181628 de ec ions be o e il e ing. Bo om: Sound 181628
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 62
25 Top: Sound 217543 de ec ions be o e il e ing. Bo om: Sound 217543
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 63
26 Top: Sound 347223 de ec ions be o e il e ing. Bo om: Sound 347223
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 64
27 Top: Sound 436790 de ec ions be o e il e ing. Bo om: Sound 436790
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 65
28 Top: Sound 437623 de ec ions be o e il e ing. Bo om: Sound 437623
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 66
54 LIST OF FIGURES
29 Top: Sound 463464 de ec ions be o e il e ing. Bo om: Sound 463464
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 67
30 Top: Sound 463472 de ec ions be o e il e ing. Bo om: Sound 463472
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 68
31 Top: Sound 671901 de ec ions be o e il e ing. Bo om: Sound 671901
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 69
32 Top: Sound 685989 de ec ions be o e il e ing. Bo om: Sound 685989
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 70
33 Top: Sound 685989 de ec ions be o e il e ing. Bo om: Sound 685989
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 71
Lis o Tables
1 Mapping be ween DESED classes and FSD50K classes . . . . . . . . 25
2 Expe imen da ase ............................ 36
3 Compa ison o app oaches and hei pe o mance . . . . . . . . . . . 40
55
Bibliog aphy
[1] Gemmeke, J. F. e al. Audio se : An on ology and human-labeled da ase o
audio e en s. In 2017 IEEE In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP), 776–780 (2017).
[2] Fonseca, E., Fa o y, X., Pons, J., Fon , F. & Se a, X. Fsd50k: An open da ase
o human-labeled sound e en s (2022). URL h ps://a xi .o g/abs/2010.
00475.2010.00475.
[3] Mesa os, A., Hei ola, T., Vi anen, T. & Plumbley, M. D. Sound e en de ec-
ion: A u o ial. IEEE Signal P ocessing Magazine 38, 67–83 (2021).
[4] Ada anne, S., Pe ilä, P. & Vi anen, T. Sound e en de ec ion using spa ial
ea u es and con olu ional ecu en neu al ne wo k (2017). URL h ps://
a xi .o g/abs/1706.02291.1706.02291.
[5] Kong, Q., Xu, Y., Sobie aj, I., Wang, W. & Plumbley, M. D. Sound e en de ec-
ion and ime– equency segmen a ion om weakly labelled da a. IEEE/ACM
T ansac ions on Audio, Speech, and Language P ocessing 27, 777–787 (2019).
URL h p://dx.doi.o g/10.1109/TASLP.2019.2895254.
[6] Caki , E., Pa ascandolo, G., Hei ola, T., Hu unen, H. & Vi anen, T. Con-
olu ional ecu en neu al ne wo ks o polyphonic sound e en de ec ion.
IEEE/ACM T ansac ions on Audio, Speech, and Language P ocessing 25,
1291–1303 (2017). URL h p://dx.doi.o g/10.1109/TASLP.2017.2690575.
56
BIBLIOGRAPHY 57
[7] Fonseca, E., Fe a o, A. & Se a, X. Imp o ing sound e en classi ica ion by
inc easing shi in a iance in con olu ional neu al ne wo ks (2021). URL h ps:
//a xi .o g/abs/2107.00623.2107.00623.
[8] Mesa os, A., Hei ola, T. & Vi anen, T. Me ics o polyphonic sound e en
de ec ion. Applied Sciences 6, 162 (2016).
[9] Giannakopoulos, P., Pik akis, A. & Co onis, Y. Imp o ing pos -p ocessing o
audio e en de ec o s using ein o cemen lea ning. IEEE Access 10, 84398–
84404 (2022).
[10] Kahl, S., Wood, C. M., Eibl, M. & Klinck, H. Bi dne : A deep lea ning solu ion
o a ian di e si y moni o ing. Ecological In o ma ics 61, 101236 (2021).
[11] SoundCloud Help Cen e . Commen ing basics (2025). URL h ps://help.
soundcloud.com/hc/en-us/a icles/115003566008-Commen ing-basics.
Accessed: 2025-07-01.
[12] Cannam, C., Landone, C. & Sandle , M. Sonic isualise : an open sou ce
applica ion o iewing, analysing, and anno a ing music audio iles. 1467–1468
(2010).
[13] Boe sma, P. & Weenink, D. P aa , a sys em o doing phone ics by compu e .
Glo in e na ional 5, 341–345 (2001).
[14] Co nell Lab o O ni hology and Chemni z Uni e si y o Technology. Bi dne :
Bi d sound iden i ica ion. h ps://play.google.com/s o e/apps/de ails?
id=de. u_chemni z.mi.kahs .bi dne (2025). Mobile applica ion; And oid;
upda ed 2025-06-12.
[15] Co be a, F. F. Tag Recommenda ion using Folksonomy In o ma ion o On-
line Sound Sha ing Pla o ms. Ph.d. disse a ion, Uni e si a Pompeu Fab a,
Ba celona, Spain (2015). URL h ps://www. dx.ca /handle/10803/296797.

Appendix A
Resou ces and Code a ailabili y
F eesound eposi o y b anch con aining he main code o he p ojec ega ding i-
sualiza ions and UI/UX.
F eesound audio analyze s eposi o y b anch con aining he code o he pos -p ocessing
and il e ing o he ou pu s o he FSD-SINe model.
Da ase use ul in o ma ion and manual anno a ions in Google Shee s o ma .
Use expe imen o m in Google Fo ms o ma .
Use expe imen esul s collec ed in a Google Shee s documen .
58
Appendix B
Fil e ing esul s
59
60 Appendix B. Fil e ing esul s
Figu e 22: Top: Sound 49520 de ec ions be o e il e ing. Bo om: Sound 49520
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3
61
Figu e 23: Top: Sound 75825 de ec ions be o e il e ing. Bo om: Sound 75825
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3
68 Appendix B. Fil e ing esul s
Figu e 30: Top: Sound 463472 de ec ions be o e il e ing. Bo om: Sound 463472
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3

69
Figu e 31: Top: Sound 671901 de ec ions be o e il e ing. Bo om: Sound 671901
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3
70 Appendix B. Fil e ing esul s
Figu e 32: Top: Sound 685989 de ec ions be o e il e ing. Bo om: Sound 685989
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3
71
Figu e 33: Top: Sound 685989 de ec ions be o e il e ing. Bo om: Sound 685989
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3