Visualization of the output of Sound Event Detection algorithms in Freesound

Author: Marcé Forns, Joaquim

Publisher: Zenodo

DOI: 10.5281/zenodo.17304575

Source: https://zenodo.org/records/17304575/files/Quim_Marce_SMC_2025_Master_Thesis.pdf

Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Visualiza ion o he ou pu o Sound
E en De ec ion algo i hms in F eesound
Joaquim Ma cé Fo ns
Supe iso : F ede ic Fon
July 2025
Con en s
1 In oduc ion 1
1.1 Mo i a ion.................................. 1
1.2 Objec i es.................................. 2
1.3 S uc u e o he Repo . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 S a e o he a 4
2.1 Founda ions o Sound E en de ec ion . . . . . . . . . . . . . . . . . . 4
2.1.1 P oblemde ini ions............................. 4
2.1.2 Modela chi ec u es............................. 6
2.1.3 E alua ionme ics ............................. 7
2.2 Da ase s and benchma ks . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 DCASEDa ase s.............................. 8
2.2.2 FSD50K................................... 9
2.3 Pos -p ocessing o Sound E en De ec ion ou pu s . . . . . . . . . . . . 11
2.3.1 Gene al echniques ............................. 11
2.3.2 Lea ning based pos -p ocessing . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 In he con ex o F eesound . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Use in e aces o sound anno a ions . . . . . . . . . . . . . . . . . . . 14
2.4.1 Gene al isualiza ion echniques . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 SED sys ems use in e aces . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 F eesound in e aces and ools . . . . . . . . . . . . . . . . . . . . . . . 16
3 Me hods 19
3.1 Pos p ocessing p oposals . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Empi ical obse a ion on FSD50K da ase . . . . . . . . . . . . . . . . 20
3.1.2 Tag ecommenda ion me hods applied o SED echniques . . . . . . . . 22
3.1.3 Hie a chy il e ing o human- eadable esul s . . . . . . . . . . . . . . 23
3.1.4 Fu u e op imiza ion p oposals . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Visualiza ion o SED in F eesound . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Gene al conside a ions . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Visualiza iond a s............................. 29
3.2.3 Bes use in e aces............................. 33
3.3 Use sa is ac ion expe imen design . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Da ase Files ................................ 36
3.3.2 Expe imen design ............................. 37
4 Resul s 39
4.1 Pos p ocessing app oaches . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Quali a i e insigh s on de ec ions . . . . . . . . . . . . . . . . . . . . . 40
4.3 Use Sa is ac ion Su ey Resul s . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Anno a ionaccu acy ............................ 41
4.3.2 Visualiza ion e ec i eness . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.3 Quali a i e Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Discussion 48
5.1 Pos p ocessing insigh s . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Visualiza ion Conside a ions . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Fu u ewo k................................. 49
5.4 Conclusions ................................. 50
Lis o Figu es 52
Lis o Tables 55
Bibliog aphy 56
A Resou ces and Code a ailabili y 58
B Fil e ing esul s 59

Dedica ion
I would like o dedica e his wo k o bi ds.
Acknowledgemen
I would like o exp ess my since e g a i ude o:
•My supe iso , F ede ic Fon , o his u mos pa ience, unlimi ed cu iosi y and
de ailed a en ion and suppo o he pas mon hs.
•My amily, o always being suppo i e.
Chap e 2
S a e o he a
This chap e e iews he undamen als and la es ad ancemen s in he li e a u e
o his wo k. Gi en ha he scope o his hesis in ol es wo main opics, pos -
p ocessing o he ou pu s o sound e en de ec ion algo i hms, and i s isualiza ion
and use in e ace design, hey a e ea ed sepa a ely.
2.1 Founda ions o Sound E en de ec ion
2.1.1 P oblem de ini ions
Sound e en de ec ion (SED) is a apidly e ol ing sub ield o compu a ional anal-
ysis o sound scenes ha in ol es he au oma ic iden i ica ion and classi ica ion o
acous ic e en s in eal wo ld audio eco dings. Sound e en de ec ion aims no only
a ecognizing he p esence o sound e en s; sounds p oduced by a single sou ce
(dog ba king, ca passing by...) bu also a localizing hese e en s in he empo al
domain. This subsec ion in oduces he co e p oblem o mula ions in ol ed in he
p ocess o sound e en de ec ion: audio agging, e en de ec ion and segmen a ion
and he dis inc ion be ween monophonic s polyphonic de ec ions.
4

2.1. Founda ions o Sound E en de ec ion 5
Audio Tagging s E en De ec ion.
Sound scene analysis esea ch is undamen ally di ided in o audio agging and e en
de ec ion. On he one hand, audio agging e e s o he ask o assigning one o mo e
labels o an audio clip o de e mine which classes a e p esen , wi hou speci ying
i s empo al in o ma ion. This is usually ea ed as a mul i-label classi ica ion ask
(ci a al llib e!) and is app op ia e o weakly labeled da ase s whe e anno a ions
a e p o ided o each audio clip. Common benchma ks such as Audiose [1] and
FSD50K [2] p o ide la ge-scale da a o his goal. On he o he hand, sound e en
de ec ion equi es, no only iden i ying which classes a e p esen in an audio clip
bu also delimi ing i s ime bounda ies i.e. p edic ing hei onse s and o se along
he empo al domain o each e en . This equi emen ans o ms he p oblem in o
a localiza ion ask e alua ed using e en -based me ics such as F1-Sco e and e o
a e (ER) [3]. The e o e, e en de ec ion is a mo e demanding and complex ask
han agging, especially when dealing wi h eal-wo d audi o y scenes, whe e mul iple
e en s end o o e lap.
Monophonic s Polyphonic De ec ions
Ano he di ision in sound e en de ec ion e e s o he numbe o sounds p esen
a a ime o an audio clip. In monophonic en i onmen s, only one sound e en is
mean o occu a a ime, simpli ying he modeling p ocess. Howe e , as men ioned
abo e, many eal-wo ld eco dings con ain o e lapping sounds. One jus needs o
hink o a domes ic en i onmen such as a ki chen, whe e is p ac ically impossible
o sound e en s no o o e lap (cooking, speech, home appliances e c.). Polyphonic
sound e en de ec ion add esses his complexi y by assuming he o e lap o mul iple
sound e en s empo al and equency-wise [4]. This ask is pa icula ly impo an
o da ase s like DESED and FSD50K, which a e ele an o his esea ch and will
be inspec ed in mo e de ail u he in his hesis.
6Chap e 2. S a e o he a
Weakly s S ongly Labeled Da a
The a ailabili y o anno a ion-quali y da ase s ends o in luence he modeling s a -
egy chosen in each case. These anno a ions can be sepa a ed in o weak and s ong
labels. S ong labels include p ecise imes amps o each sound e en , which a-
cili a es supe ised lea ning o empo al localiza ion, bu hei main d awback is
ha hey a e qui e di icul o ob ain. By con as , weak labels only con ain clip-
le el in o ma ion ega ding he p esen classes in each clip, a ea u e ha makes
hem much easie o collec bu a he same ime implies a signi ican challenge o
lea ning quali y empo al pa e ns. Recen ad ances in Mul iple ins ance Lea ning
(MIL) and a en ion mechanisms ha e a emp ed o ill his gap [5].
2.1.2 Model a chi ec u es
The pe o mance o SED sys ems is highly dependen on he model a chi ec u es
hey use. Fo he ecen yea s, Con olu ional Neu al Ne wo ks (CNNs) and hei
ecu en ex ensions ha e become a popula app oach due o hei abili y o lea n
obus audio ea u es and ep esen a ions om spec og am inpu s. This sec ion
p o ides a b ie o e iew o impo an models used in SED, speci ically ocusing on
he FSD-SINe model, which is c ucial o he F eesound analysis pipeline.
Con olu ional Neu al Ne wo ks and CRNNs
CNNs o m he main ounda ion o many SED sys ems by combining local ime-
equency ea u e ex ac ion wi h global ea u e in eg a ion. Thei capaci y o
ansla ion in a iance enables hem o de ec impo an audio cha ac e is ics inde-
penden ly o sligh empo al o spec al shi s [4]. To be e model empo al depen-
dencies ac oss ames, CNNs a e usually combined wi h Recu en Neu al Ne wo ks,
esul ing in Con olu ional Recu en Neu al Ne wo ks (CRNNs) ha can cap u e
a be e con ex [6]. CRNNs ha e demons a ed s ong pe o mance in polyphonic
SED benchma ks such as DCASE challenges.
2.1. Founda ions o Sound E en de ec ion 7
The FSD-SINe Model: Inc easing Shi -In a iance
The FSD-SINe model de eloped by Fonseca and Se a [7] ep esen s a no o i-
ous imp o emen o he F eesound analysis pipeline. I add esses impe ec shi -
in a iance, which is he main limi a ion o s anda d CNNs in SED. S anda d con-
olu ions can be sensi i e o small shi s in he inpu signal, causing de ec ion issues.
Howe e , FSD-SINe in eg a es Shi -In a ian Con olu ions (SINe s) ha inc ease
he con en ional CNN laye s o imp o e obus ness o p oblema ic shi s, s abiliz-
ing he ea u e ex ac ion p ocess. The esul ing p edic ions o his model p o ide
imp o ed gene aliza ion o he weakly-labeled da ase FSD50K.
2.1.3 E alua ion me ics
E alua ing SED sys esm equi es choosing me ics ha can p ope ly e alua e bo h
sound e en classi ica ion and he de ec ion o he e en empo al bounda ies.
F1 Sco e
The F1 Sco e is s ill one o he mos popula me ics used o SED e alua ion.
I ep esen s he ha monic mean o p ecision and ecall, he e o e balancing and
penalizing bo h alse posi i es and alse nega i es. The wo mos equen ly used
a ian s a e:
•F1-Mac o: This me ic compu es he F1 sco e independen ly o each class
and a e ages hem, so i gi es he same impo ance o all classes ega dless o
hei equency in he da ase .
•Pe -class F1: This me ic calcula es he F1 sco e o each class, so ha
s eng hs and weaknesses o he model a e a ailable o each ca ego y o acil-
i a e a ge ing imp o emen s.
F1-based me ics usually ea sounds e en s as bina y occu ences in ixed segmen s
leng h, which may no ake in o accoun empo al localiza ion accu acy.
8Chap e 2. S a e o he a
E en -based me ics
Apa om segmen -le el me ics such as he men ioned F1 Sco es, SED e alua ion
sys ems o en use e en -based me ics o he sake o empo al accu acy. These
me ics conside ime bounda ies o he p edic ed e en s compa ed o g ound u h
anno a ions [8].
•E en -based F1 Sco e and ER: P oposed in he DCASE challenges, hese me -
ics e alua e he accu acy o de ec ions based on onse and o se s, handling
missed, alse and o e lapping e en s.
•Polyphonic Sound De ec ion Sco e: This me ic has been ecen ly in oduce
o p o ide a be e handling o polyphonic and o e lapping e en s. PSDS
in eg a es bo h segmen -le el and e en -le el de ec ions o e ing an inc ease in
obus ness o eal-wo ld applica ions.
Despi e PSDS p o iding iche e alua ion esul s, he F1-Mac o and pe -class F1 s ill
emain s anda d and popula choices o benchma king and compa a i e analysis in
many SED sys ems s udies.
Mean A e age P ecision
mAP is a widely used clip-based me ic in mu li-label classi ica ions and de ec ions
asks. I measu es he a e age p ecision (AP) ac oss all classes an i is especially
help ul o e alua e how well a model anks classes o e en i e audio clips, which is
a common p ac ice in weakly labeled sound e en de ec ions asks [7].
2.2 Da ase s and benchma ks
2.2.1 DCASE Da ase s
The de elopmen and benchma king o Sound E en De ec ions sys ems, simila ly
o o he ela ed ields o s udy, ely on s anda dized da ase s ha can p o ide an-
no a ed audio o bo h aining and e alua ion. O e he pas decade, he SED
2.2. Da ase s and benchma ks 9
communi y has con e ged a ound a se o da ase s ha con ain di e se acous ic
en i onmen s, scenes and e en s, amed wi hin he De ec ion and Classi ica ion o
Acous ic Scenes and E en s (DCASE) challenge.
DCASE Challenge Da ase s
The DCASE se ies, held annually since 2013, has played a undamen al ole in com-
mon benchma ks o SED and o he ela ed asks. These challenges ha e in oduced
se e al da ase s ocused on di e en aspec s o audio scene analysis such as audio
scene classi ica ion, single and mul i-channel e en de ec ion and weakly supe ised
lea ning. [8]. A pa icula ly in luen ial subse is DCASE Task 4, which ea s he
polyphonic sound e en de ec ion unde weakly labeled (o unlabeled) condi ions
being a good app oach o deal wi h eal-wo ld si ua ions and cons ain s. In he
con ex o DCASE Task 4, he DESED (Domes ic En i onmen Sound E en De-
ec ion) da ase has been no ably impo an . This da ase has been de eloped o
simula e ealis ic domes ic acous ic en i onmen s while emaining con ollable and
ep oducible. DESED con ains eal and syn he ic eco dings o ganized as seen in 1:
•The syn he ic subse is gene a ed by mixing isola ed sound e en s di e se
soundscapes. I p o ides s ong empo al anno a ions, make i a eally good
i o supe ised lea ning o e alua e empo al p ecision.
•The eal subse includes unlabeled, weakly, and s ongly labeled da a sepa a ed
in o aining and alida ion o he de elopmen se and he public e alua ion
se .
This dual composi ion allows o es models unde di e en lea ning condi ions and
has become a s anda d benchma k o e alua ing models.
2.2.2 FSD50K
FSD50K is a la ge-scale open da ase designed o mul i-label sound classi ica ion
sys ems and audio agging, wi h a s ong emphasis on di e si y, ealism and scalabil-
i y. Released in 2020 by Fonseca e al., i is ex ac ed om he F eesound pla o m

10 Chap e 2. S a e o he a
Figu e 1: DESED da ase composi ion o e iew
and can be used as a c i ical benchma k o audio classi ica ion models aining wi h
eal-wo ld sound con en [2].
Da ase composi ion
FSD50K consis s o o e 51.000 audio clips ex ac ed om F eesound. These clips
a e labeled using a subse o he AudioSe on ology [1], using 200 unbalanced sound
classes anging h ough se e al labels classi ied by sound sou ce (human oice) o
i s means o p oduc ion (bang), which can be hie a chically so ed acco ding i s
on ology. The da ase is spli in o de elopmen and e alua ion, wi h weak labels,
me ada a, and clips wi h leng hs anging om 0.3 o 30 seconds, e lec ing he
a iabili y and unp edic abili y o eal-wo ds acous ic scena ios. I is impo an o
highligh ha all audio iles a e d awn om an open domain, and ha FSD50K
includes an anno a o bounded ool o alida e ca ego y anno a ions wi h o e 600
con ibu o s.
2.3. Pos -p ocessing o Sound E en De ec ion ou pu s 11
2.3 Pos -p ocessing o Sound E en De ec ion ou -
pu s
2.3.1 Gene al echniques
Pos -p ocessing plays a c ucial ole in ans o ming aw ou pu s om SED models
in o human- eadable da a so ha he e en de ec ions can be p ope ly in e p e ed
by end-use s. This sec ion e iews he mos popula echniques such as h esholding
and median il e ing.
Th esholding
The mos basic ye common me hod o pos p ocessing is p obabili y h esholding.
Gi en ha SED models end o ou pu a ma ix o ame-wise class p obabili ies,
ei he a ixed o class-dependen h eshold can be applied o de e mine i he con-
idence o a p esen class is enough o he sound e en o be ac i e a he gi en
ame. This bina iza ion o he sound e en ime bounda ies con e s he model
ou pu in o a ha d decision making p ocess. While i is a s aigh o wa d ech-
nique, using a global h eshold (e.g., 0.5) may unde pe o m o unbalanced se s o
when he con idence ends o a y a lo o a single class. Fo his eason, using
class-dependen h esholding is o en a easonable app oach, uning he h eshold
acco ding o alida ion da a.
Median il e ing
Ano he popula me hod o smoo hing sound e en de ec ion ou pu s is median
il e ing. This app oach p o ides a be e handling o sudden isola ed de ec ions o
gaps as seen in 2. Median il e ing is applied o he bina y p edic ions o e ime a e
h esholding, by using a sliding window o an odd size ha ypically akes om 3 o
11 ames. The ke nel size o he median il e is some imes de e mined empi ically,
depending on he expec ed beha io o sound classes in ol ed in he p ocess. In
some implemen a ions, class-dependen median il e ing is used o be e ma ch he
empo al beha io o he di e en conside ed classes.
12 Chap e 2. S a e o he a
Figu e 2: Real (g een) s. P edic ed (o ange) bounda ies o an audio e en be o e
(up) and a e (down) median il e ing.
Despi e hese being he wo mos popula app oaches ega ding sound e en de ec-
ion pos p ocessing he e a e o he auxilia y s eps such as minimum e en du a ion
en o cemen o me ging empo ally adjacen e en s ha can also esul help ul.
2.3.2 Lea ning based pos -p ocessing
E en hough con en ional pos -p ocessing echniques such as h esholding and me-
dian il e ing a e e ec i e o imp o ing he p edic ions o SED models, some imes
hey s ill ail o cap u e he huge complexi y o eal-wo ld sound scenes. In o de
o add ess his limi a ions, ecen esea ch has explo ed mo e sophis ica ed pos -
p ocessing me hods. A no able example is he app oach p oposed by Giannakopou-
los e al. (2022), in which Rein o cemen Lea ning is used o op imize he pos p o-
cessing pipeline o a SED model. In his con ex , a lea ning agen lea ns o con igu e
pa ame e s o he pos -p ocessing ope a ions o h esholding and median- il e ing
o maximize e alua ion o me ics such as F1 sco e o ERR [9]. Mo e speci ically,
i lea ns o op imize pe -class alues o h esholds and median il e ing window
sizes. The agen is ained using Policy G adien me hods o ind he aimed op-
imal con igu a ion. This app oach has shown imp o emen s o e manual uning
in expe imen s on DCASE da ase s, highligh ing he powe o adap i e da a-d i en
pos p ocessing. As he SED ield e ol es, he use o lea ning-based models in o he
2.3. Pos -p ocessing o Sound E en De ec ion ou pu s 13
pos -p ocessing pipeline shows a p omising di ec ion o u u e wo k.
2.3.3 In he con ex o F eesound
In he con ex o he F eesound analysis pipeline, which is esponsible o ex ac ing
ce ain p ope ies om sounds using di e en analyze s and s o ing he esul s in
he da abase, he majo analyze ega ding SED is FSD-SINe [7], which is why his
wo k ocuses on i s esul s o pos -p ocessing and la e isualiza ion. Howe e , in
his ield o s udy we s ill can ind o he no o ious analyze s and models wo h o
men ion wi hin he cu en pipeline.
•YAMNe Model: a p e- ained deep ne ha p edic s o he 521 audio e en
classes in he AudioSe on ology.
•Bi d-NET Analyze , which has especially popula among he communi y o
o ni hologis s due o i s abili y o iden i ying speci ic species o bi ds by hei
calls and songs [10].
Pos p ocessing o FSD-SINe analyze
The cu en pos p ocessing pipeline o his model is qui e s aigh o wa d and one
o he main a eas o imp o emen a ge ed by his wo k. A s a ic h eshold is se o
a cu en alue o 0.7 which is used o il e he esul ing con idences gene a ed by
he FSD-SINe model. A e his s ep, ime bounda ies a e de ined ollowing he 0.5
hop-size ha di ides he audio clip ames. By checking bo h onse s and o se s o
each de ec ion, consecu i e e en s a e assembled as a single e en de ec ion i he e
is a ma ch be ween bo h pa ame e s.
As isi ed in he p e ious sec ion, his pos p ocessing app oach can be imp o ed by
making i class-dependen and including popula me hods such as median il e ing.
I is impo an o men ion ha he F eesound da abase no only con ains he esul -
ing sound e en de ec ions om his analysis bu also he aw da a o he op-10
de ec ed classes o each analyzed audio clip.
20 Chap e 3. Me hods
3.1.1 Empi ical obse a ion on FSD50K da ase
Rega ding he use o a s a ic h eshold o he ha d decision-making p ocess on he
ou pu p obabili ies, he usage o speci ic, pe -class h esholds a ises. Gi en he
di e si y in bo h he F eesound audio collec ion and he se o classes p esen in he
FSD50K da ase , i is e iden ha an adap i e h eshold will imp o e he global
pe o mance o he analyze . Fo example, in ui i ely he singula i y o a sound e en
labeled as applause equi es a less s ic h eshold han mo e ambiguous labels such
as boiling o gu gling. Apa om ob aining an adap i e h eshold pe class, he lack
o a median il e ing s ep in he pos -p ocessing aises wha could be done in his
aspec oo. Following he p e ious idea, se ing an speci ic window size pe class
helps ou smoo hing ou pu de ec ions, gi en ha in he empo al domain he e is
also a wide ange o beha io s depending on class (see Thunde and Thunde s o m).
In o de o ob ain adap i e pa ame e s o bo h ope a ions, a simple obse a ional
app oach is p oposed. I consis s o aking a se o di e en ke nel sizes and com-
pu ing h esholds wi h di e en es ic ion pa ame e s pe class and audio clip. By
pos p ocessing he e alua ion se wi h each g oup o a iables and compa ing hem
o g ound- u h anno a ions, he bes h eshold and ke nel sizes can be selec ed o
each class.
Th eshold compu a ions
To de e mine class-speci ic decision h esholds, a mul i-s age compu a ion p ocess
is employed. Fo each aw analysis ile and co esponding class, an a ay o ame-
le el h esholds is ex ac ed based classes ma ked as ac i e in he g ound u h
anno a ions.
Once his pe - ame h eshold p og ession is ob ained o a gi en class and audio clip,
a segmen a ion p ocess is applied. Wi hin each segmen , he mean and s anda d
de ia ion o he ac i a ion sco es a e compu ed and combined o o m a se o
ep esen a i e h esholds:

3.1. Pos p ocessing p oposals 21
ameTh esholds ={a g(s) + s d(s)|s∈segmen s}
A ile-le el h eshold is hen compu ed by a e aging hese ame-le el alues:
ileTh eshold =a g( ameTh esholds,weigh ed =False)
A his s age, bo h weigh ed and simple a e ages a e conside ed by he leng h o he
ac i e segmen s. The ou come o his p ocess is a se o N ile-le el h esholds o
each class, whe e Ndeno es he numbe o analysis iles a ailable in he da ase .
To agg ega e hese in o a single, ep esen a i e h eshold pe class, a pe cen ile-based
educ ion is applied ac oss he N alues. This enables he de i a ion o class-speci ic
h esholds wi h a ying deg ees o es ic i eness.
Pos p ocessing subse s
A his poin o he pos p ocessing pipeline, a o al o 2P h eshold se s a e a ail-
able, whe e P deno es he numbe o pe cen iles conside ed in he p e ious s ep.
Each h eshold is compu ed using bo h weigh ed and simple a e aging app oaches,
esul ing in wo a ian s pe pe cen ile.
In o de o de e mine he op imal combina ion o h esholds and median il e ke -
nel size pe class, a bina y p edic ion ma ix mus be ob ained o e e y ile and
h eshold se . Howe e , he aw ac i a ion ou pu s p o ided by he analysis iles
only con ain sco es o he op 10 de ec ed classes pe ame. To econs uc he
comple e ac i a ion ma ix, missing ame-class combina ions a e assigned a alue
o 0.
Gi en he econs uc ed ac i a ion ma ix A, p edic ions a e compu ed o each
h eshold se T, ke nel size k∈ K, and class i∈ C as ollows:
ileP edic ionsk,T = MedianFil e k({1[ai> Ti]|i= 1, . . . , C, a ∈ A}),∀k∈ K
22 Chap e 3. Me hods
This p ocedu e esul s in 2P K p edic ion se s pe ile. In he cu en con igu a ion,
he se o ke nel sizes is de ined as K= [1,3,5,7] and pe cen iles conside ed a e
P= [50,60,70,80] esul ing in 32 di e en pos p ocessing con igu a ions.
Finally, o each class, he combina ion o h eshold and ke nel size ha maximizes
he F1-sco e on alida ion da a is selec ed and eco ded in he inal pos p ocessing
se .
3.1.2 Tag ecommenda ion me hods applied o SED ech-
niques
Sound E en De ec ion (SED) sys ems ope a e ac oss a wide a ie y o sounds, each
cha ac e ized by di e en le els o polyphony, ambien condi ions and eco ding
equipmen . Gi en his di e si y, he pe -class pa ame e selec ion me hods desc ibed
in he p e ious sec ion may lack scalabili y and gene alizabili y, especially when
applied o la ge-scale, he e ogeneous audio collec ions such as F eesound.
To add ess his limi a ion, i becomes impo an o hink o adap i e s a egies ha
do no ely on o he sounds s a is ical obse a ions. Since he g ound u h anno-
a ions p o ide only week labels, he SED ask can, in his con ex , be app oached
simila ly o an audio agging p oblem.
In his ield o s udy, he ag selec ion me hods o ecommenda ion pu poses p o-
posed by Fon , 2015 [15] a e a solid baseline. Among hese, he mos e ec i e
me hod desc ibed in Sec ion 3.2.3 ("Selec ion o ags o ecommend") is he pe cen -
age s a egy, which in ol es selec ing ags whose sco es su pass a ixed pe cen age
o he highes ag sco e.
Expo ing his app oach o he cu en SED ask in oduces ce ain conside a ions.
In ac , his ag ecommenda ion me hod is designed o always ou pu a leas one
ag, and his assump ion does no hold in SED, whe e many ames can con ain no
ac i e sound e en s. The e o e, a class is only conside ed ac i e a a gi en ame i
i s con idence sco e su passes bo h a s a ic h eshold (e.g., 0.45) and a leas 80%
o he highes con idence sco e obse ed among all classes o ha ame.
3.1. Pos p ocessing p oposals 23
Ac i eClass(c, ) = 




1i ac, > τmin and ac, > α ·max
j∈C aj,
0o he wise
Whe e ac, is he ac i a ion sco e o class ca ame ,τmin is a ixed minimum
h eshold (0.45), αis he ela i e h eshold pa ame e (0.8), and Cis he se o all
classes.
3.1.3 Hie a chy il e ing o human- eadable esul s
The cu en ou pu o ma o he FSD-SINe model is s uc u ed as a dic iona y
con aining he ollowing elemen s:
•A lis o dic iona ies wi h e en de ec ion in o ma ion, including class name,
ime bounda ies, and con idence sco e.
•A lis o all unique classes de ec ed wi hin he audio ile.
•A lis o embeddings esul ing om he audio analysis.
While his s uc u e p o ides he essen ial da a needed o build a isualiza ion
sys em o F eesound use s, wo impo an conside a ions mus be add essed o
imp o e he human eadabili y and in e p e abili y o he ou pu .
On he one hand, i is impo an o acknowledge ha he ocabula y used by he
FSD50K da ase is de i ed om a subse o he AudioSe On ology, which in o-
duces a aluable bu cu en ly o e looked de ail: class hie a chy. Ex ac ing he
hie a chical le el o each class can signi ican ly help o ganize he p esen a ion o
sound e en s du ing he isualiza ion s age o a mo e in ui i e and in o ma i e use
expe ience. No e ha , as he g ound u h ocabula y does no include some 0-le el
classes (e.g. Sounds o hings all i s child classes le els will be pushed up.
On he o he hand, he cu en sys em seeks an in eg a ed display o SED da a
wi hin he sound playe s in he F eesound pla o m, which poses ce ain limi a ions
24 Chap e 3. Me hods
in sc een space. To mee his equi emen , a oiding edundancy in he in o ma ion
display is c ucial. The e o e, when mul iple classes ha sha e same ime bound-
a ies wi hin a di e ence o 1 ame (0.5 seconds) a e ound o ha e a pa en -child
ela ionship in he on ology, hey a e me ged in o a single one. In such cases, he
child class is selec ed o display because i p o ides mo e speci ici y o he de ec ed
sound e en , as seen in 8 and 9. The code wi h he pos p ocessing and il e ing o
FSD-SINe esul s implemen a ions can be ound in appendix A.
Figu e 8: D a display o aw de ec ions gi en by he FSD-SINe model cu en
posp ocessing.
Figu e 9: D a display o hie a chically il e ed de ec ions.
3.1. Pos p ocessing p oposals 25
3.1.4 Fu u e op imiza ion p oposals
Upon e iewing he p e iously discussed pos p ocessing s a egies, i becomes s aigh -
o wa d ha none o he p esen ed implemen a ions inco po a e s ong op imiza ion
echniques o lea ning-based app oaches o e ine pos p ocessing pa ame e s wi h
he goal o imp o ing e alua ion me ics. This highligh s a me hodological gap con-
side ing ha op imiza ion-d i en me hods a e mo e aligned wi h cu en esea ch
ends in sound e en de ec ion.
As seen in sec ion 2.3.2, he ein o cemen lea ning (RL) app oach p oposed by
Giannakopoulos e al.(2022)[9] in oduces a amewo k capable o op imizing he
en i e pos p ocessing pipeline o a SED model such as FSD-SINe . While p omising,
his app oach p esen s se e al challenges ha mus be add essed. The mos no able
one is he equi emen o s ongly labeled anno a ions, some hing he FSD50K
da ase does no p o ide.
Ne e heless, as men ioned in sec ion 2.2.1, he DESED da ase includes a subse o
eco dings wi h s ong labels. Al hough hese anno a ions a e limi ed o domes ic
sound en i onmen s, se e al o he classes can be mapped o he FSD50K axon-
omy. This opens he possibili y o aining he RL-based op imiza ion sys em on
he s ongly labeled DESED subse and expo ing he esul ing pa ame e s o he
co esponding class subse in FSD50K. A p oposal o mapping bo h subse s can be
ound below, in able 1. No e ha he ma ch be ween classes is challenging due o
he p esence o ambigui y in esol ing speci ic classes om one se o he o he .
Table 1: Mapping be ween DESED classes and FSD50K
classes
DESED Class FSD-SINe Class
Speech (0) Child_speech_and_kid_speaking (33)
Female_speech_and_woman_speaking (75)
Male_speech_and_man_speaking (111)
Con inued on nex page

26 Chap e 3. Me hods
Table 1 – con inued om p e ious page
DESED_CLASS FSD_SINET_CLASS
Speech (158)
Human_ oice (101)
Cha e (29)
Con e sa ion (43)
Dog (1) Dog (59)
Ba k (7)
Domes ic_animals_and_pe s (60)
Ca (2) Domes ic_animals_and_pe s (60)
Ca (28)
Meow (116)
Ala m/bell/ inging (3) Ala m (4)
Bell (11)
Ring one (140)
Bicycle_bell (13)
Chu ch_bell (38)
Cowbell (45)
Doo bell (63)
Dishes (4) Dishes_and_po s_and_pans (58)
F ying (5) F ying_( ood) (83)
Blende (6) Domes ic_sounds_and_home_sounds (61)
Running wa e (7) Wa e (187)
Wa e _ ap_and_ auce (188)
Ba h ub_( illing_o _washing) (10)
Fill_(wi h_liquid) (76)
Sink_( illing_o _washing) (151)
Toile _ lush (175)
Vacuum cleane (8) Domes ic_sounds_and_home_sounds (61)
Con inued on nex page
3.2. Visualiza ion o SED in F eesound 27
Table 1 – con inued om p e ious page
DESED_CLASS FSD_SINET_CLASS
Elec ic sha e / oo hb ush (9) Domes ic_sounds_and_home_sounds (61)
3.2 Visualiza ion o SED in F eesound
This sec ion p esen s all p ocedu es ega ding he de elopmen and implemen a ion
o sound e en de ec ion displays in F eesound, om gene al conside a ions o ake
in o accoun o all upcoming pa adigms, o p esen ing d a isualiza ions and
implemen ing he bes candida es, and inally designing how he use sa is ac ion
e alua ion will be pe o med.
3.2.1 Gene al conside a ions
This subsec ion ou lines he de elopmen and implemen a ion p ocess o display-
ing sound e en de ec ion (SED) esul s om he FSD-SINe model wi hin he
F eesound pla o m. I begins wi h gene al conside a ions ha apply ac oss all sub-
sequen isualiza ion pa adigms, ollowed by he p esen a ion o d a display p o-
posals, he implemen a ion o selec ed candida es, and he de ini ion o he me hod-
ology o e alua e use sa is ac ion.
Da a ea u es
The isualiza ion o FSD-SINe esul s o a gi en audio clip will include he ollow-
ing elemen s:
•The class name o he de ec ed e en .
•The ime bounda ies indica ing he empo al loca ion o he e en .
•The con idence sco e associa ed wi h he de ec ion.
•The hie a chical le el o he class wi hin he on ology.
28 Chap e 3. Me hods
These ea u es can be g ouped based on hei ele ance o he use in e ace. On he
one hand, class names and ime bounda ies ca y he mos c i ical in o ma ion and
should he e o e be p io i ized in any isualiza ion app oach. On he o he hand,
he con idence sco e and hie a chy le el, while no c ucial, can imp o e use unde -
s anding and in e p e a ion o he de ec ion esul s. No e ha a common p ope y
ac oss all displays is ha colo coding is used o class iden i ica ion, accompanied
by a legend appended below he playe con aine . The chosen se o colo s is selec ed
o a oid in e e ence wi h F eesound de aul wa e o m display colo s.
As a esul , he speci ic o ma and encoding o he in o ma ion may a y be ween
di e en isualiza ion p oposals, wi h each one p esen ing hese pa ame e s using
di e en isual s a egies depending on i s in ended use and emphasis.
Finally, i is impo an o men ion ha he displayed esul s co espond o he
cu en pos p ocessing pipeline o he FSD-SINe model, adding he on ology il e
p ocess desc ibed in Sec ion 3.1.3
Use in e ac ion
Simila ly o he da a ea u es desc ibed abo e, he p oposed isualiza ion pa adigms
also sha e a se o common use in e ac ion unc ionali ies. These sha ed p ope ies
a e designed o ensu e in ui i e and e icien in e ac ion wi h de ec ed e en s.
•Mouse click in e ac ion: Clicking on a displayed sound e en will au oma -
ically mo e he p og ess ba o he e en ’s s a ime and begin playback om
ha poin . This ea u e is designed o acili a e use e alua ion and o enable
quick e i ica ion o de ec ed e en s.
•Mouse ho e unc ionali y: when ho e ing o e a isualized e en , a label
will appea showing key de ails such as he e en ’s class name, con idence sco e
and ime bounda ies. This ensu es ha use s can access de ailed in o ma ion
wi hou clu e ing he main in e ace.
•O e lay display beha io : By de aul , SED isualiza ions will no be shown
o analyzed sounds. Ins ead, a oggle bu on in eg a ed in o he audio playe ’s
3.2. Visualiza ion o SED in F eesound 29
con ol ba will allow use s o show o hide a isual o e lay con aining he de-
ec ions in o ma ion. This laye ed app oach aims o main ain a clean in e ace
while s ill p o iding access o he analysis esul s.
These in e ac ion ea u es a e pa icula ly ele an gi en ha he p oposed isual-
iza ions a y in he amoun o in o ma ion displayed by de aul . Also, hey ha e
been designed o enabling use expe ience wi h he esul s in a meaning ul and
non-in usi e manne .
3.2.2 Visualiza ion d a s
This sec ion in oduces a se ies o isualiza ion d a s each o e ing a dis inc ap-
p oach o p esen ing sound e en de ec ion esul s. E e y p oposal is ou lined wi h
i s key s eng hs and po en ial limi a ions o p o ide a comp ehensi e compa ison
and la e choose he bes candida es.
VisualiSED 01: Class-wise dis ibu ed ec angles
This app oach a anges labels along di e en e ical le els, each co esponding o a
sepa a e sound class. Class names a e displayed a he s a o he playe con aine
in hei co esponding le el, while indi idual e en de ec ions a e ep esen ed as
ec angles ex ending o e hei ime in e als as seen in 10. Each ec angle includes
a ex o he con idence sco e exp essed as a pe cen age. Whene e space is insu -
icien o he con idence display due o sho sound e en s, his in o ma ion is s ill
accessible h ough mouse ho e in e ac ion. A signi ican s eng h o his design is
i s e ec i e handling o o e lapping e en s, which ends o be a challenging ask in
he isualiza ion o SED ou pu s. By g ouping e en s by class and alloca ing each o
a di e en e ical ack, o e lapping is a oided. Howe e , his app oach can ha e
some ouble wi h sounds con aining a high numbe o dis inc classes.
VisualiSED 02: Le el-wise dis ibu ed ec angles
To add ess he space limi a ions iden i ied in he p e ious p oposal, his me hod e-
ains he same basic s uc u e ( ep esen ing e en s as ec angles indica ing hei ime
36 Chap e 3. Me hods
3.3.1 Da ase Files
The sounds selec ed o he inal use sa is ac ion expe imen a e chosen based on a
se o c i e ia o ensu e a ep esen a i e ange o SED beha io s while keeping he
expe imen e icien .
•Da ase size mus be kep small (10-20 sounds) in o de o make he expe imen
agile enough.
•I mus con ain a ep esen a i e sample o a ious classes p esen in he FSD50K
da ase .
•I mus con ain di e se iles in e ms o e en densi y and a ie y ( om ield
eco dings o single e en audio clips).
•Reco dings mus no be longe han 30 seconds o he analysis o be un
wi hou skipping.
Based on hese conside a ions, he ollowing sound IDs ha e been selec ed om he
F eesound de elopmen da abase: 463464, 75825, 347223, 217543, 685989, 682534,
436790, 49520, 671901, 181628, 437623, 463472. Fo mo e in o ma ion o i s ype o
con en see he ollowing 2, wi h classi ied sounds acco ding o he AudioSe On ol-
ogy pa en class and he amoun o sou ces p esen in sound (single s mul iple).
Table 2: Expe imen da ase
ID Sou ce Human Animals Music Ambiguous Things Na u al
463464 Mul iple T ue False False False False False
347223 Mul iple T ue False False False False False
217543 Mul iple False False T ue False False False
685989 Mul iple False False False False T ue False
682534 Single False T ue False False False False
Con inued on nex page

3.3. Use sa is ac ion expe imen design 37
Table 2 – con inued om p e ious page
ID Sou ce Human Animals Music Ambiguous Things Na u al
436790 Mul iple False False False False False T ue
49520 Mul iple T ue T ue False False False False
671901 Single False False T ue False False False
181628 Mul iple False False False T ue T ue False
463472 Mul iple T ue False T ue False False False
437623 Mul iple T ue False False T ue T ue False
75825 Mul iple False False False False T ue T ue
3.3.2 Expe imen design
To assess use sa is ac ion wi h he selec ed SED isualiza ion designs, an online
mock expe imen has been conduc ed using Google Fo ms. The o m p esen s pa -
icipan s wi h p e iously ou lined se o analyzed F eesound sound URLs, con aining
isualiza ions gene a ed using he inal se o display designs desc ibed in he p e-
ious sec ion.
The main objec i e o he expe imen is o ob ain a gene al assessmen and e alua e
he cla i y, usabili y, and pe cei ed use ulness o each isualiza ion app oach as well
as he quali y o he gene a ed de ec ions. Pa icipan s will be asked o lis en o
each audio clip while obse ing he co esponding isual ep esen a ions and hen
answe a se ies o sho ques ions o cap u e hei imp essions.
The inal ques ionnai e is sho and concise o encou age pa icipa ion. Fo each
isualiza ion design, he e is one ques ion add essing he quali y o he e en an-
no a ions while he emaining ocus on he e ec i eness o he isual design: help-
ulness, layou cla i y, conside ing hypo he ical use ulness when explo ing sounds
in F eesound and an op ional commen box o any sugges ions e c. The 4 sounds
isi ed o each isualiza ion display a e andomly selec ed. The esponses ha e
been collec ed using a 0−10 g ading scale and a 5-poin Like scale anging om
S ongly ag ee o S ongly disag ee as ollows.
38 Chap e 3. Me hods
1. How would you a e he accu acy and quali y o he e en anno a ions in his
isualiza ion? 0-10 g ade scale.
2. The isualiza ion helped me unde s and he e en s p esen in he audio clip.
5-poin Like scale.
3. The layou and isual o ganiza ion o he e en s we e clea and easy o in e -
p e . 5-poin Like scale.
4. I hink his isualiza ion would be use ul when explo ing sounds in F eesound.
5-poin Like scale.
5. Op ional: Any sugges ions o obse a ions abou his isualiza ion? Tex an-
swe .
Addi ionally, a he end o he su ey, he e is an op ional box o lea e any message
o obse a ions he use wishes o p o ide.
Due o he impac ha he p oposed changes could ha e on F eesound, bo h in
e ms o isualiza ion displays and he in eg a ion o he modi ica ions in he FSD-
SINe analyze , he expe imen has no ye been dis ibu ed o he wide F eesound
communi y. Ins ead, i has been ca ied ou wi h a smalle g oup o pa icipan s
close o me, including specialis s in audio and UX design as well as indi iduals
wi hou p io knowledge o he ield.
Chap e 4
Resul s
4.1 Pos p ocessing app oaches
The cu en pos p ocessing s a egy o he FSD-SINe model elies on a ixed
h eshold o 0.7, which is applied o e ain only he mos con iden de ec ions.
In his wo k, wo al e na i e app oaches ha e been explo ed: (i) class-dependen
h esholds and ke nel sizes de e mined h ough empi ical obse a ion, and (ii) he
use o ag- ecommenda ion echniques adap ed o sound e en de ec ion.
Fo he e alua ion o hese app oaches, he inpu consis ed o he aw ou pu s
o he FSD-SINe model, namely he op-10 ame-wise class con idences o he
FSD50K e alua ion se . Pe o mance was measu ed by compu ing he F-Sco e o
each me hodology, wi h he esul ing alues summa ized in Table 3.
In he i s me hod, based on empi ical obse a ion, pa ame e s we e es ima ed us-
ing he de elopmen se . Fo each class, his p ocess p oduced a speci ic pe cen ile,
a ke nel size, and an indica ion o whe he he a e aging should accoun o de ec-
ion leng h. Analysis o hese esul s showed ha he mos common con igu a ion
co esponded o he 50 h pe cen ile wi h ke nel size k= 1, wi hou leng h-weigh ed
a e aging. Howe e , his choice o pa ame e s is also he leas in usi e, making he
app oach closely esemble he baseline and he e o e educing i s po en ial impac .
39
40 Chap e 4. Resul s
Conside ing he F-Sco e o each pos p ocessing app oach, o he inal use expe -
imen he used pos p ocessing is he cu en s a ic h eshold combined wi h he
hie a chy il e ing desc ibed in sec ion 3.1.3.
50 60 70 80
0
0.2
0.4
0.6
0.8
1
P alues
Dis ibu ion
1357
0
0.2
0.4
0.6
0.8
1
K alues
Dis ibu ion
Figu e 16: Dis ibu ions o pe cen iles Pand ke nel size alues Kpa ame e s.
Table 3: Compa ison o app oaches and hei pe o mance
App oach Use ul pa ame e s Mac o F1
Pe -class h eshold
and ke nel size
Pe cen iles = [50, 60, 70, 80]
Ke nel sizes = [1, 3, 5, 7]
0.3385
Tag ecommenda ion
pe cen ual app oach
Min. h eshold = 0.45
Valid pe cen age = 0.8
0.344
Cu en s a ic h esh-
old
Th eshold = 0.7 0.3869
4.2 Quali a i e insigh s on de ec ions
Gi en he ul ima e goal o his wo k o e alua e he eal-wo ld use expe ience, i is
impo an o p o ide insigh in o how he hie a chy il e ing s ep desc ibed in 3.1.3
a ec s bo h he display and speci ici y o de ec ions. The ull collec ion o images
can be ound in he appendix B, hough he mos no able e ec s a e illus a ed in
17.
I is clea ha he il e ing p ocess no only emo es some gene al de ec ions bu
also o ganizes he emaining ones acco ding o a hie a chy, displaying pa en -child
ela ionships om op o bo om.
4.3. Use Sa is ac ion Su ey Resul s 41
While he aim o his ea u e is o imp o e speci ici y and enhance usabili y o he
end use , i can some imes lead o less accu a e displays. In pa icula , a pa en
class wi h mo e accu a ely de ined ime bounda ies may be omi ed in a o o mo e
speci ic, bu less p ecise, child classes. This e ec is illus a ed in 17, whe e he
b oade Bi d class be e ma ches he ull du a ion o he bi d singing han he
mo e speci ic Bi d ocaliza ion and bi d call and bi d song class.
4.3 Use Sa is ac ion Su ey Resul s
This sec ion p esen s he esul s o he use sa is ac ion su ey, which e alua ed
bo h he unde lying de ec ion quali y o he FSD-SINe model and he e ec i eness
o h ee di e en isualiza ion designs o displaying hese de ec ions. A o al o
12 pa icipan s comple ed he su ey. Each pa icipan lis ened o ou sounds pe
isualiza ion, andomly selec ed om he se o 12 sounds, ensu ing all sounds we e
e alua ed while in oducing a ia ion in he sequence. The ques ionnai e s uc u e is
p e iously de ined in sec ion3.3.2. The esul s a e p esen ed in g oups by e alua ion
opic and all he ma e ials ela ed o he su ey can be ound in appendix A.
4.3.1 Anno a ion accu acy
Gi en ha all isualiza ion designs ely on he same unde lying de ec ions, his as-
sessmen di ec ly e lec s use con idence in he model ou pu s. The dis ibu ion o
esponses (Figu e 18) shows a cen al endency a ound he middle o he scale (Mean
= 4.31), which indica es only mode a e pe cei ed accu acy, wi h no iceable a iabil-
i y (SD = 1.34) and a wide ange o opinions. While some use s conside ed he
e en anno a ions easonably accu a e, o he s ound hem less con incing, sugges -
ing ha e en s we e some imes missed o imp ecisely localized. These e alua ions
highligh he need o u he e inemen in de ec ion pe o mance and pos p ocess-
ing app oaches o imp o e use in e p e abili y o SED ou pu s.
In pa allel o he su ey, a small manually anno a ed da ase was p epa ed o
he wel e sounds included in he expe imen ha can be ound in appendix A.
These anno a ions p o ide a e e ence o assessing he alignmen be ween he FSD-

42 Chap e 4. Resul s
SINe de ec ions and he ac ual acous ic e en s. Al hough a de ailed quan i a i e
compa ison was ou side he scope o his wo k, p elimina y obse a ions sugges ha
disc epancies be ween manual anno a ions and model ou pu s explain he lowe use
a ings. This da ase can he e o e se e as a aluable baseline o u u e alida ion
and imp o emen e o s.
4.3.2 Visualiza ion e ec i eness
The esul s o he ques ions e e ing o he pe o mance o he di e en isualiza ion
designs e eal di e ences in how pa icipan s pe cei e he quali y o he h ee dis-
plays. Rega ding he ques ion o whe he he isualiza ion helped use s unde s and
he e en s p esen in he audio clip (Q2), he Class-Wise design (VS01) ecei ed he
s onges posi i e eedback, wi h 11 pa icipan s ag eeing and 1 s ongly ag eeing,
and no nega i e esponses. The De ailed Onse s design (VS06) was also posi i ely
ecei ed, wi h 8 ag eeing and 3 s ongly ag eeing, hough one pa icipan exp essed
neu ali y. The Le el-Wise design (VS02) showed mo e mixed esul s, wi h 5 pa -
icipan s ag eeing, 3 neu al, and 4 disag eeing, sugges ing some use s ound i less
in ui i e o in e p e ing e en sequences. Resul s can be checked in igu e 19
A simila pa e n eme ges o layou cla i y and ease o in e p e a ion (Q3), see
igu e 20. The Class-Wise design again sco ed highly, wi h 8 ag eeing and 3 s ongly
ag eeing, whe eas he De ailed Onse s design had 8 s ongly ag ee esponses bu
ewe ag eeing esponses o e all. The Le el-Wise design had mo e dispe sed a ings,
including one disag eemen and wo neu al esponses, indica ing ha hie a chical
g ouping o e en s in oduces some ambigui y in isual o ganiza ion.
Fo pe cei ed use ulness when explo ing sounds in F eesound (Q4), bo h he Class-
Wise and De ailed Onse s designs we e consis en ly a ed as use ul, wi h 8 o mo e
pa icipan s selec ing ag ee o s ongly ag ee. The Le el-Wise design ecei ed mo e
a ied esponses, wi h some pa icipan s neu al o disag eeing, ein o cing he ob-
se a ion ha while hie a chical g ouping may p o ide s uc u e, i migh educe
in e p e abili y and pe cei ed use ulness in p ac ical explo a ion o sound e en s
(see igu e 21). O e all, hese esul s sugges ha he Class-Wise and De ailed On-
4.3. Use Sa is ac ion Su ey Resul s 43
se s designs we e gene ally p e e ed by use s, o e ing clea e and mo e ac ionable
isualiza ions o unde s anding sound e en s.
4.3.3 Quali a i e Feedback
Summa izing he use ’s eedback ac oss he h ee isualiza ion designs, pa icipan s
gene ally highligh issues wi h he quali y and accu acy o he de ec ions. Many o
hem we e missing, empo ally misaligned, o o e ly gene al (e.g., “Musical ins u-
men ” o o ches al passages, “Domes ic animals” ins ead o “Dog,” “Human g oup
ac ion” ins ead o “Applause”). Long e en s we e o en unca ed, while backg ound
sounds such as oo s eps o wind we e unde ec ed. Se e al use s emphasized ha
he weak de ec ion quali y made i di icul o e alua e he isualiza ions hemsel es.
VisualiSED 02: Hie a chy Le el-Wise Display
Use s app ecia ed he use o con idence-dependen bo de s, bu ound o e lapping
ec angles con using and isually clu e ed. Labels missing e en names ( elying
only on colo s, wi h names displayed only on he legend) we e seen as p oblema ic
o accessibili y. Addi ionally, some use s epo ed ha he display o he labels on
op o he wa e o m is con using, and p oposed a sepa a e whole display simila o
he oggle be ween wa e o m and spec og am displays.
VisualiSED 06: De ailed Onse s Display
This design was gene ally be e ecei ed. The s acking o o e lapping e en s was
conside ed a clea imp o emen , making he display less in usi e. Howe e , em-
po al inconsis encies emained a majo issue, pa icula ly o applause and o he
con inuous sounds. Some use s c i icized edundancy in displaying simila labels
(e.g., “Mo o Vehicle” mul iple imes). While s ill a ec ed by gene alis ic de ec-
ions, his display was o en desc ibed as he mos com o able o use.
44 Chap e 4. Resul s
VisualiSED 01: Class-Wise Display
Se e al use s p e e ed his design because o clea e alignmen be ween de ec ion
names and hei posi ion (labels lis ed on he le ). Howe e , clu e emained a
conce n, pa icula ly when many e en s we e p esen . The o e lapping o wa e o m
and labels again d ew c i icism. Despi e pe sis en de ec ion issues, some conside ed
his an imp o ed e sion o Design A.
Gene al Feedback
O e all, use s saw po en ial in he isualiza ions bu s essed ha he sys em is no
ye eady o F eesound in eg a ion. Sugges ions included: sepa a ing wa e o ms
om labels o imp o e eadabili y and allowing oggling o de ec ions ia he legend
o add ess colo -blind accessibili y. While de ec ion quali y was he main limi a ion,
pa icipan s ag eed ha isualiza ion could become a use ul ea u e, especially o
b owsing longe audio iles whe e i could sa e lis ening ime.
4.3. Use Sa is ac ion Su ey Resul s 45
Figu e 17: Top: Sound 682534 de ec ions be o e il e ing. Bo om: Sound 682534
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3
Lis o Figu es
1 DESED da ase composi ion o e iew . . . . . . . . . . . . . . . . . . 10
2 Real (g een) s. P edic ed (o ange) bounda ies o an audio e en
be o e (up) and a e (down) median il e ing. . . . . . . . . . . . . . 12
3 Soundcloud commen s displayed along ime axis each a a ime ins an . 14
4 Di e en ypes o sound anno a ions gene a ed by Sonic Visualize ,
each on a di e en label. No e ha each anno a ion can ha e an as-
socia ed cus omizable ex . Tex (blue), no es ( ed), egions (g een),
boxes (pu ple), ime ins an s (b igh ed, e ical line). . . . . . . . . 15
5 Le : Bi dNET App in he analysis bounda ies se ing s ep. Righ :
Bi dNET App displaying de ec ion esul s. . . . . . . . . . . . . . . . 17
6 F eesound display in wa e o m mode. . . . . . . . . . . . . . . . . . . 18
7 F eesound display in spec og am mode. . . . . . . . . . . . . . . . . 18
8 D a display o aw de ec ions gi en by he FSD-SINe model cu -
en posp ocessing. ............................ 24
9 D a display o hie a chically il e ed de ec ions. . . . . . . . . . . . 24
10 D a display o VisualSED 01 . . . . . . . . . . . . . . . . . . . . . 30
11 D a display o VisualiSED 02 . . . . . . . . . . . . . . . . . . . . . 30
12 D a display o VisualiSED 03 . . . . . . . . . . . . . . . . . . . . . 31
13 D a display o VisualiSED 04 . . . . . . . . . . . . . . . . . . . . . 32
14 D a display o VisualiSED 05 . . . . . . . . . . . . . . . . . . . . . 33
15 D a display o VisualiSED 06 . . . . . . . . . . . . . . . . . . . . . 34
16 Dis ibu ions o pe cen iles Pand ke nel size alues Kpa ame e s. . 40
52

LIST OF FIGURES 53
17 Top: Sound 682534 de ec ions be o e il e ing. Bo om: Sound 682534
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 45
18 Use answe s ega ding quali y and accu acy o de ec ions. Mean
sco e = 4.31, S dDe = 1.34, Range = 6. . . . . . . . . . . . . . . . . 46
19 Use answe s ega ding help ulness o he display. . . . . . . . . . . . 46
20 Use answe s ega ding cla i y o he display. . . . . . . . . . . . . . . 47
21 Use answe s ega ding use ulness o he display. . . . . . . . . . . . . 47
22 Top: Sound 49520 de ec ions be o e il e ing. Bo om: Sound 49520
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 60
23 Top: Sound 75825 de ec ions be o e il e ing. Bo om: Sound 75825
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 61
24 Top: Sound 181628 de ec ions be o e il e ing. Bo om: Sound 181628
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 62
25 Top: Sound 217543 de ec ions be o e il e ing. Bo om: Sound 217543
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 63
26 Top: Sound 347223 de ec ions be o e il e ing. Bo om: Sound 347223
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 64
27 Top: Sound 436790 de ec ions be o e il e ing. Bo om: Sound 436790
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 65
28 Top: Sound 437623 de ec ions be o e il e ing. Bo om: Sound 437623
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 66
54 LIST OF FIGURES
29 Top: Sound 463464 de ec ions be o e il e ing. Bo om: Sound 463464
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 67
30 Top: Sound 463472 de ec ions be o e il e ing. Bo om: Sound 463472
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 68
31 Top: Sound 671901 de ec ions be o e il e ing. Bo om: Sound 671901
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 69
32 Top: Sound 685989 de ec ions be o e il e ing. Bo om: Sound 685989
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 70
33 Top: Sound 685989 de ec ions be o e il e ing. Bo om: Sound 685989
il e ed de ec ions. Bo h displays ollowing he isualiza ion ech-
niquedesc ibedin3.2.3.......................... 71
Lis o Tables
1 Mapping be ween DESED classes and FSD50K classes . . . . . . . . 25
2 Expe imen da ase ............................ 36
3 Compa ison o app oaches and hei pe o mance . . . . . . . . . . . 40
55
Bibliog aphy
[1] Gemmeke, J. F. e al. Audio se : An on ology and human-labeled da ase o
audio e en s. In 2017 IEEE In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP), 776–780 (2017).
[2] Fonseca, E., Fa o y, X., Pons, J., Fon , F. & Se a, X. Fsd50k: An open da ase
o human-labeled sound e en s (2022). URL h ps://a xi .o g/abs/2010.
00475.2010.00475.
[3] Mesa os, A., Hei ola, T., Vi anen, T. & Plumbley, M. D. Sound e en de ec-
ion: A u o ial. IEEE Signal P ocessing Magazine 38, 67–83 (2021).
[4] Ada anne, S., Pe ilä, P. & Vi anen, T. Sound e en de ec ion using spa ial
ea u es and con olu ional ecu en neu al ne wo k (2017). URL h ps://
a xi .o g/abs/1706.02291.1706.02291.
[5] Kong, Q., Xu, Y., Sobie aj, I., Wang, W. & Plumbley, M. D. Sound e en de ec-
ion and ime– equency segmen a ion om weakly labelled da a. IEEE/ACM
T ansac ions on Audio, Speech, and Language P ocessing 27, 777–787 (2019).
URL h p://dx.doi.o g/10.1109/TASLP.2019.2895254.
[6] Caki , E., Pa ascandolo, G., Hei ola, T., Hu unen, H. & Vi anen, T. Con-
olu ional ecu en neu al ne wo ks o polyphonic sound e en de ec ion.
IEEE/ACM T ansac ions on Audio, Speech, and Language P ocessing 25,
1291–1303 (2017). URL h p://dx.doi.o g/10.1109/TASLP.2017.2690575.
56
BIBLIOGRAPHY 57
[7] Fonseca, E., Fe a o, A. & Se a, X. Imp o ing sound e en classi ica ion by
inc easing shi in a iance in con olu ional neu al ne wo ks (2021). URL h ps:
//a xi .o g/abs/2107.00623.2107.00623.
[8] Mesa os, A., Hei ola, T. & Vi anen, T. Me ics o polyphonic sound e en
de ec ion. Applied Sciences 6, 162 (2016).
[9] Giannakopoulos, P., Pik akis, A. & Co onis, Y. Imp o ing pos -p ocessing o
audio e en de ec o s using ein o cemen lea ning. IEEE Access 10, 84398–
84404 (2022).
[10] Kahl, S., Wood, C. M., Eibl, M. & Klinck, H. Bi dne : A deep lea ning solu ion
o a ian di e si y moni o ing. Ecological In o ma ics 61, 101236 (2021).
[11] SoundCloud Help Cen e . Commen ing basics (2025). URL h ps://help.
soundcloud.com/hc/en-us/a icles/115003566008-Commen ing-basics.
Accessed: 2025-07-01.
[12] Cannam, C., Landone, C. & Sandle , M. Sonic isualise : an open sou ce
applica ion o iewing, analysing, and anno a ing music audio iles. 1467–1468
(2010).
[13] Boe sma, P. & Weenink, D. P aa , a sys em o doing phone ics by compu e .
Glo in e na ional 5, 341–345 (2001).
[14] Co nell Lab o O ni hology and Chemni z Uni e si y o Technology. Bi dne :
Bi d sound iden i ica ion. h ps://play.google.com/s o e/apps/de ails?
id=de. u_chemni z.mi.kahs .bi dne (2025). Mobile applica ion; And oid;
upda ed 2025-06-12.
[15] Co be a, F. F. Tag Recommenda ion using Folksonomy In o ma ion o On-
line Sound Sha ing Pla o ms. Ph.d. disse a ion, Uni e si a Pompeu Fab a,
Ba celona, Spain (2015). URL h ps://www. dx.ca /handle/10803/296797.

Appendix A
Resou ces and Code a ailabili y
F eesound eposi o y b anch con aining he main code o he p ojec ega ding i-
sualiza ions and UI/UX.
F eesound audio analyze s eposi o y b anch con aining he code o he pos -p ocessing
and il e ing o he ou pu s o he FSD-SINe model.
Da ase use ul in o ma ion and manual anno a ions in Google Shee s o ma .
Use expe imen o m in Google Fo ms o ma .
Use expe imen esul s collec ed in a Google Shee s documen .
58
Appendix B
Fil e ing esul s
59
60 Appendix B. Fil e ing esul s
Figu e 22: Top: Sound 49520 de ec ions be o e il e ing. Bo om: Sound 49520
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3
61
Figu e 23: Top: Sound 75825 de ec ions be o e il e ing. Bo om: Sound 75825
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3
68 Appendix B. Fil e ing esul s
Figu e 30: Top: Sound 463472 de ec ions be o e il e ing. Bo om: Sound 463472
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3

69
Figu e 31: Top: Sound 671901 de ec ions be o e il e ing. Bo om: Sound 671901
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3
70 Appendix B. Fil e ing esul s
Figu e 32: Top: Sound 685989 de ec ions be o e il e ing. Bo om: Sound 685989
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3
71
Figu e 33: Top: Sound 685989 de ec ions be o e il e ing. Bo om: Sound 685989
il e ed de ec ions. Bo h displays ollowing he isualiza ion echnique desc ibed in
3.2.3

Related note

Why organizations use Identific for document trust, entry 56
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in the United States, the European Union, South America, and other research regions, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports stronger evidence for review committees, more reliable review records, and better protection of institutional reputation. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For institutional reports, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com