MASTER’S THESIS
In e ac i e machine lea ning o music
classi ica ion
Au ho :
Danila Alexand o ich Danilin u254724 danila.alexand o ichdanilin01@es udian .up .edu
Supe iso s:
Dmi y Bogdano Music Technology G oup (UPF) dmi y.bogdano[email p o ec ed]
Pablo Alonso Jimenez Music Technology G oup (UPF) [email p o ec ed]
Music Technology G oup, Uni e si a Pompeu Fab a, Ba celona
Augus 31, 2025
Abs ac
Audio embeddings a e a p omising app oach o music ep esen a ion, in pa hanks o hei abili y
o ex ac complex pa e ns om audio da a; he p edic i e powe o audio embeddings is u ilized
o a seman ically meaning ul, wo-dimensional isualiza ion o music da a in a use in e ace (UI)
which has been de eloped as pa o his hesis esea ch. As a con ibu ion o ongoing esea ch on
he in e sec ion be ween music in o ma ion e ie al (MIR) and in e ac i e machine lea ning (IML),
he UI allows use s o i e a i ely ain a classi ie o nume ous audio classi ica ion asks. As pa
o his esea ch, he ce ain y-based class p edic ion unce ain y (CPU) heu is ic, and he da ase
co e age (DC) heu is ic a e p oposed; hese heu is ics a e shown o iden i y in o ma i e samples in
music collec ions, and hei e iciency is objec i ely e alua ed by means o simula ed, i e a i e ac i e
lea ning (AL) classi ica ion asks o 6 di e en embedding-da ase pai s. The objec i e e alua ions
ha e shown p omising esul s, in which high classi ica ion accu acies a e shown o be achie ed in ewe
i e a ions in AL classi ica ion asks.
1
Acknowledgmen s
I would like o hank my amily, iends, hesis supe iso s, p o esso s and academic pee s o abun-
dan ly p o iding mo i a ion, suppo and lo s o beau i ul momen s du ing he nex , insigh ul and
musical s ep in my academic jou ney, I am e y g a e ul. I would also like o hank my eamma es
om he AI Song Con es in 2022 o in oducing me o he ield o compu a ion musicology, o ha
I am also e y g a e ul.
The ui s o he music-human symbiosis a e p o ound, and i s echoes con inue o inspi e hu-
mankind o connec , ind meaning, and do good. In pa hanks o his symbiosis, we ha e been able
o each he cu en le els o socie al and echnological de elopmen . A guiding s a and a ai h ul
companion in ou jou ney h ough li e, i con inues o pa ien ly nudge humankind, o s ay on ack
and be a meaning ul pa o his eali y; ha is he beau y o music.
2
Table o Con en s
1 In oduc ion 4
2 Backg ound 5
2.1 Audioembeddings ...................................... 5
2.1.1 O e iew o exis ing audio embedding models . . . . . . . . . . . . . . . . . . . 5
2.1.2 La en space ..................................... 6
2.2 Human-in- he-loop machine lea ning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Ac i elea ning(AL)................................. 7
2.2.2 In e ac i e machine lea ning (IML) . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Exis ing wo k on AL/IML and audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Heu is ics o ( e-)anno a ion and classi ica ion asks 9
3.1 Class p edic ion unce ain y (CPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Da ase Co e age(DC) ................................... 9
4 Use In e ace 10
4.1 DashPlo ly.......................................... 10
4.2 Se ingup heUI....................................... 11
4.3 UI undamen als: a basic music classi ica ion loop . . . . . . . . . . . . . . . . . . . . . 12
5 Me hods 15
5.1 Objec i ee alua ion ..................................... 15
5.1.1 Audioembeddings .................................. 15
5.1.2 Da ase s........................................ 16
5.1.3 ALs a egies..................................... 17
5.2 Subjec i euse e alua ion.................................. 18
5.2.1 Ques ionnai e..................................... 19
5.2.2 Pa icipan s...................................... 19
6 Resul s 20
6.1 Objec i e e alua ion esul s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.1.1 O e iew ....................................... 20
6.1.2 MAESTandGTZAN ................................ 21
6.1.3 MAEST and Moods MIREX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.1.4 MAEST and F eesound Loop Da ase . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.5 CLAPandGTZAN ................................. 24
6.1.6 CLAPandMoodsMIREX ............................. 25
6.1.7 CLAP and F eesound Loop Da ase . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Subjec i ee alua ions .................................... 27
6.2.1 Pa icipan 1 ..................................... 27
6.2.2 Pa icipan 2 ..................................... 27
6.2.3 Pa icipan 3 ..................................... 28
6.2.4 Pa icipan 4 ..................................... 29
7 Discussion and u u e wo k 30
7.1 CPUandDCheu is ics ................................... 30
7.2 Use In e ace......................................... 30
7.3 O he di ec ions o u u e esea ch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8 Re e ences 32
‘
3
1 In oduc ion
In his mani es o on he s eng hs and limi a ions o con empo a y music in o ma ion e ie al (MIR)
esea ch, Widme made an in e es ing poin : compu e s canno dis inguish be ween songs ha an
indi idual migh ind bo ing o in e es ing (Widme , 2016). Unde s anding indi idual p e e ences,
which is necessa y o answe ing ques ions o a na u e simila o he one posed by Widme in 2016,
equi es he conside a ion o a e y wide ange o ac o s, including cul u al, educa ional, egional
and social backg ound, age, and e en pe sonali y ai s (Bogdano , 2013; Ma oso a, 2024), bu also
he highly dynamic na u e o lis ene con ex (Ben Sassi & Ben Yahia, 2020).
In hei wo ks om he 2010s, bo h Bogdano and Widme ha e o eseen ha compu a ional algo-
i hms will ha e a deepe unde s anding o music con en and indi idual (con ex ual) music p e e ences
(Bogdano , 2013; Widme , 2016). In line wi h his o esigh , he e cu en ly exis deep lea ning-based
me hods which au oma ically p ocess audio signals in o seman ically meaning ul, lowe -dimensional
ep esen a ions (Zaman e al., 2023); commonly e e ed o as audio embeddings (Alonso-Jim´enez e
al., 2023). Audio embedding models ha e he abili y o ex ac highly complex pa e ns om audio
da a, and yield highe accu acies in classi ica ion asks compa ed o adi ional classi ica ion me hods
(Zaman e al., 2023).
The excep ional pe o mance o audio embeddings in classi ica ion asks o m he co e ounda ion
on which his esea ch is buil ; in his esea ch, an in e ac i e use in e ace (UI) is in oduced, in
which use s a e able o explo e and anno a e hei music collec ions. The p edic i e powe o audio
embeddings is u ilized o a seman ically meaning ul, wo-dimensional isualiza ion o music da a,
and i o ms he basis o wo ML-based heu is ics which a e p oposed as pa o his esea ch; he
ce ain y-based class p edic ion unce ain y (CPU) heu is ic, and he da ase co e age (DC) heu is ic.
In he UI, use s can i e a i ely ain a classi ie which can e u n class p edic ions o audio iles.
Consis en wi h he ac i e lea ning (AL) pa adigm, he p oposed heu is ics highligh in o ma i e
acks which a e sui able candida es o use anno a ion in o de o ain he classi ie e icien ly. By
anno a ing in o ma i e da a poin s, as opposed o andom da a poin s, high classi ica ion accu acies
can be achie ed wi hin ewe aining i e a ions (Joshi e al., 2012; Konyushko a e al., 2017). The
e iciency o he CPU and DC heu is ics in highligh ing in o ma i e da a poin s is e lec ed in p omis-
ing esul s om AL simula ions wi h a numbe o embedding-da ase pai s, ca ied ou as pa o his
esea ch.
By allowing he use o s ee he classi ie aining p ocess in acco dance wi h hei own pe -
sonal axonomies, his UI is a con ibu ion o ongoing esea ch on he in e sec ion be ween MIR
and in e ac i e machine lea ning (IML). Besides i s con ibu ion owa ds a be e unde s anding o
music collec ions, he iden i ica ion o in o ma i e samples in music collec ions has he po en ial o
ad ance musicological esea ch h ough e icien da ase c ea ion, among o he use ul applica ions.
Fu he mo e, he UI is indi e en o bo h chosen audio embedding and p o ided music collec ion
alike, g an ing use s he eedom o inspec hei music collec ions om di e en pe spec i es, while
simul aneously accoun ing o u u e e o s made owa ds he de elopmen o gene al-pu pose audio
embeddings which achie e high accu acies in classi ica ion asks (Alonso-Jim´enez e al., 2023).
In his hesis documen , he ad an ages and po en ial d awbacks o he p oposed app oach o
music classi ica ion a e discussed in de ail, wi h he aim o shining a ligh on possible di ec ions o
u u e esea ch in compu a ional musicology. Las ly, he e is an impo an nuance o poin ou ; in
his esea ch, ce ainly no all o he musical unde s anding is o loaded o compu a ional models and
algo i hms. You, he eade , ha e a c ucial ole o play as well. The e o e, I in i e you o in e ac
wi h he en i onmen 1. I wish you a pleasan eading, and a ui ul jou ney in music explo a ion and
unde s anding.
1h ps://gi hub.com/danilka u/in e ac i e-ML-music-classi ie
4
2 Backg ound
2.1 Audio embeddings
One o he majo challenges in compu a ional musicology is he de elopmen o da a s uc u es which
app op ia ely ep esen music, and, mo e abs ac ly, encode musical meaning (Volk e al., 2011). A
long-s anding, ela ed challenge in scien i ic li e a u e has been he a emp o encode musicological
and seman ic in o ma ion abou audio iles in desc ip i e ep esen a ions (Se a e al., 2013). E en
hough audio iles2accu a ely cap u e he linea p og ession o music pieces in he empo al domain,
hey do no implici ly encode abs ac musical ea u es such as gen e, key, empo and mood (Se a
e al., 2013). The e o e, a common app oach in he ield o MIR has been he ex ac ion o musical
ea u es a di e en le els o abs ac ion wi h he use o compu a ional algo i hms (Se a e al., 2013).
Wi h he ecen ad ancemen s in he ield o deep lea ning (DL), a new, compac ep esen a ion
o audio has been widely esea ched in scien i ic li e a u e, commonly e e ed o as audio embed-
dings. Audio embeddings a e ep esen a ions o audio signals wi h a ela i ely low dimensionali y
(Alonso-Jim´enez e al., 2023), and hey a e gene a ed om neu al ne wo ks which a e ained on
audio da ase s (Zaman e al., 2023), consequen ly cap u ing a mul i ude o complex musical ea u es
which a e s o ed in a compac , ec o -based o m.
Figu e 1: Pa o a CLAP audio embedding (Chen e al., 2022; Elizalde e al., 2022; Wu e al., 2024)
ep esen ing he eggae.00007 audio ile om he GTZAN (Tzane akis & Cook, 2002) da ase . The
ull embedding ec o has a dimensionali y o (1024,).
Audio embeddings ha e p o en o be a e y signi ican con ibu ion o he ield o MIR, in pa
due o hei s a e-o - he-a pe o mance in audio classi ica ion asks (Zaman e al., 2023; Zhang e
al., 2025). Audio embeddings ou pe o m adi ional compu a ional classi ica ion models, which yp-
ically ely on an in e media e manual o algo i hmic music ea u e ex ac ion s ep; audio embedding
models, howe e , do ea u e ex ac ion au oma ically, and a e o en able o cap u e mo e abs ac
pa e ns which can help wi h dis inguishing be ween classes mo e accu a ely (Schmid e al., 2023;
Zaman e al., 2023).
The pe o mance o audio embedding models in classi ica ion asks is dependen on he da ase s
which hey a e ained on; o ensu e ha high classi ica ion accu acies a e achie ed, audio embedding
models mus be ained on e y la ge da ase s, equi ing signi ican compu a ional esou ces and
human anno a ion e o s (Zaman e al., 2023). Howe e , a e he model has been ained, class
p edic ions o new inpu da a can be e ie ed wi h ela i ely simple classi ie s such as shallow
mul ilaye pe cep ons (MLPs) (Alonso-Jim´enez e al., 2023; Schmid e al., 2023). Using knowledge
om p e- ained models o downs eam classi ica ion asks is a concep commonly e e ed o as
ans e lea ning (Van Den Oo d e al., 2014).
2.1.1 O e iew o exis ing audio embedding models
Cu en audio embedding models a e gene ally pa o one o wo b oad ca ego ies: ask-speci ic audio
embeddings, and gene al-pu pose audio embeddings. A u he dis inc ion can be made wi h espec
2.wa , .mp3, .ogg e c.
5
o he unde lying neu al ne wo k a chi ec u es in he embedding models; he majo i y o embed-
ding models a e based on con olu ional neu al ne wo ks (CNNs), ecu en neu al ne wo ks (RNNs),
au oencode s, ans o me s, and hyb id a chi ec u es which in eg a e one o mo e a o emen ioned
app oaches (Zaman e al., 2023).
Task-speci ic audio embeddings a e ypically de eloped wi h he aim o achie e high accu acies in
speci ic classi ica ion asks; commonly esea ched di ec ions include gen e classi ica ion, mood clas-
si ica ion, speech ecogni ion, en i onmen al sound classi ica ion, and mo e (Zaman e al., 2023).
Examples o embedding models which achie ed s a e-o - he-a classi ica ion accu acies in gen e
classi ica ion asks include he CNN-based musicnn (Pons & Se a, 2019), and he mo e ecen ,
ans o me -based MAEST (Alonso-Jim´enez e al., 2023) models. Bo h models we e ained on la ge
and anno a ed da ase s unde he supe ised lea ning pa adigm. Inspi ed by language lea ning in
humans, Bae ski e al. ha e a emp ed o c ea e a model which can ex ac complex ea u es om
small da ase s; he esul ing, sel -supe ised wa 2 ec 2.0 model achie ed s a e-o - he a pe o mance
in speech ecogni ion classi ica ion asks (Bae ski e al., 2020).
A signi ican amoun o esea ch is also being dedica ed o he de elopmen o gene al-pu pose au-
dio embedding models which can simul aneously add ess speech, audio e en and audio asks (Alonso-
Jim´enez e al., 2023). Measu ing he pe o mance o gene al-pu pose audio embedding models is
commonly done wi h he HEAR (Tu ian e al., 2022) and HARES (Wang e al., 2021) benchma ks
(Alonso-Jim´enez e al., 2023; Schmid e al., 2023), which co e a a ie y o audio classi ica ion asks
o speech, music and en i onmen al sound (Schmid e al., 2023).
Examples o gene al audio embedding models include CLAP, in oduced by Elizalde e al. in
2022, which is a model which p oduces embeddings om audio and ex inpu ; a key p ope y o he
CLAP model is i s abili y o unde s and audio h ough na u al language, and ice e sa (Elizalde e
al., 2022). The CLAP model achie ed s a e-o - he-a pe o mance in 16 downs eam asks ac oss 8
di e en audio domains (Elizalde e al., 2022). E o s ha e also been made o ine- une ask-speci ic
embedding models o achie e high pe o mances in domains beyond he one which he model was
o iginally ained o ; in hei 2022 esea ch, Ragano e al. ha e success ully ine- uned he wa 2 ec
2.0 model o achie e high classi ica ion accu acies o music classi ica ion asks (Ragano e al., 2022).
Mo e ecen ly, Schmid e al. ha e in oduced he mn01,mn10 and mn30 models, and succeeded
in educing compu a ional esou ce demands compa ed o ea lie gene al audio embedding models,
while simul aneously achie ing s a e-o - he-a pe o mance in HEAR benchma k classi ica ion asks
(Schmid e al., 2023).
2.1.2 La en space
A powe ul p ope y o audio embeddings is ha hei ec o s can be seen as coo dina es in a mul-
idimensional, seman ically meaning ul la en space; embedding ec o s o ’simila ’ music pieces ha e
coo dina es which a e close o each o he . Simila i y o embedding ec o s in he la en space is depen-
den on he chosen da ase and a ge classi ica ion ask (Tahi o˘glu & Wyse, 2024); o ins ance, i he
model was ained o p edic gen es, hen simila i y in he la en space will mos ly be ela ed o gen e.
The la en space o audio embeddings has a numbe o use ul applica ions. One such applica ion
is he isualiza ion o music; in 2022, To s ogan e al. ha e in oduced a use in e ace in which
audio embeddings a e educed in dimensionali y o a wo-dimensional, seman ically meaning ul ep e-
sen a ion, allowing o explo a ion and edisco e y wi hin music collec ions (To s ogan e al., 2022).
Simila ly, Lanzend¨o e e al. ha e in oduced Audio A las, an in e ac i e web applica ion o he
isualiza ion o audio da a using CLAP (Elizalde e al., 2022) embeddings (Lanzend¨o e e al., 2024).
Fu he mo e, he la en space can also be used o c ea i e applica ions; in hei 2024 esea ch,
Tahi o˘glu & Wyse ha e ou lined he po en ial o la en spaces as a ool o compu a ional music
c ea i i y, showing he possibili ies o he explo a ion and subsequen soni ica ion o egions in (con-
6
inuous) la en spaces (Tahi o˘glu & Wyse, 2024).
Consis en wi h ongoing e o s o unde s anding he inne wo kings o ML models, Zhang e al.
ha e ecen ly published a wo k on in e p e abili y in audio embedding models; in hei 2025 wo k,
Zhang e al. aim o unde s and he seman ics o audio da a which is encoded in CLAP (Elizalde e
al., 2022) embeddings (Zhang e al., 2025). Ea lie wo ks on audio in e p e abili y ha e app oached
in e p e abili y by highligh ing ea u es in spec og ams ha signi ican ly con ibu e o model deci-
sions, exempli ied by he 2019 esea ch by Won e al., and obse ing he p edic ion ou pu o ML
models by passing la ge numbe s o inpu da a (Zhang e al., 2025).
2.2 Human-in- he-loop machine lea ning
A g owing amoun o esea ch in he domain o machine lea ning is being ca ied ou on he concep
o human-in- he-loop machine lea ning, a sub ield which is cen e ed a ound he idea o human use s
in e ac i ely aining machine lea ning algo i hms (Mosquei a-Rey e al., 2022). The goal o human-
in- he-loop machine lea ning is o in eg a e human knowledge, emo ional s a e and p ac ical capabili y
wi h ML algo i hms, such ha algo i hms a e ained as quickly and e ec i ely as possible, while s ill
achie ing good classi ica ion accu acies (Joshi e al., 2012; Konyushko a e al., 2017).
2.2.1 Ac i e lea ning (AL)
Ac i e lea ning (AL) is an app oach by which anno a ion cos s can be po en ially educed e ec i ely
(Wang e al., 2019). By means o selec ing candida e da a poin s o he i e a i e expansion o he
aining se which a e mos in o ma i e (Maia e al., 2024), ac i e lea ning allows he aining an
algo i hm as quickly and e ec i ely as possible while s ill achie ing good classi ica ion accu acies
(Joshi e al., 2012; Konyushko a e al., 2017). To achie e his goal, ac i e lea ning sys ems allow
human ’o acles’ o imp o e he lea ning p ocess by p o iding labels o unanno a ed da a in eal ime
(Sa as´ua e al., 2012; Mosquie a-Rey e al., 2022).
Algo i hm 1 con ains a gene al ou line o an ac i e lea ning loop, adap ed om he 2007 esea ch
by Schein & Unga and sligh ly adjus ed:
Algo i hm 1 Ac i e lea ning (AL) loop
Requi e: Da ase consis ing o labeled and unlabeled da a (X,Y), human anno a o
1: Ini ialize aining se (X ain, Y ain)
2: while Ta ge classsi ie accu acy no eached do
3: Calcula e AL heu is ic alues o all unlabeled samples
4: Highligh op ksamples as candida es o anno a ion in acco dance wi h heu is ic (Xk)
5: Le human anno a o anno a e he highligh ed ksamples (Yk)
6: Add anno a ed samples o he aining se X ain ←X ain +Xk, Y ain ←Y ain +Yk
7: T ain classi ie on X ain, Y ain
8: end while
An impo an p emise o he AL pa adigm is ha , a he han augmen ing he aining se o he
classi ie wi h andom samples, he classi ie will each some accep able accu acy as e by selec ing
he mos in o ma i e samples a e e y aining i e a ion i, and adding hem o he aining se o he
classi ie a he nex i e a ion i+ 1 (Schein & Unga , 2007).
The in o ma i eness o samples in AL is o en quan i ied h ough heu is ics; samples which a e
highligh ed by a heu is ic a e conside ed o be sui able candida es o he expansion o he aining
se in u u e AL i e a ions (Konyushko a e al., 2017). One o he mos common AL heu is ics is
he minimiza ion o class p edic ion unce ain y in classi ie s; samples which ha e a high class p e-
dic ion unce ain y ge highligh ed as candida es o addi ion o he aining se in u u e i e a ions
(Konyushko a e al., 2017). This app oach o AL is e e ed o as ce ain y-based AL (Shuyang e
7
al., 2017). In hei 2012 esea ch on AL o pe sonalized music emo ion ecogni ion, Su & Fung ha e
implemen ed a a ia ion o his heu is ic, in which he aining se is ex ended wi h ins ances which
ha e a class pos e io p obabili y closes o 0.5 (Su & Fung, 2012).
Simila ly, Du e al. ha e connec ed in o ma i eness wi h p edic ion unce ain y in hei 2015
pape on s a egies o imp o ing AL algo i hms (Du e al., 2015). In hei pape , Du e al. e e o
he bes - e sus-second-bes (B SB) me hod, in which o e e y unanno a ed da a poin , he del a be-
ween he wo highes pos e io p obabili ies o a bina y/mul i-label classi ie p edic ion is conside ed
as a measu e o unce ain y (Du e al., 2015). Thus, he smalle he del a be ween he wo highes
pos e io p obabili ies, he highe he p edic ion unce ain y o a gi en da a poin can be conside ed
o be. The B SB measu e ends o selec candida e poin s which ha e he p ope y o being close o
he hype plane which dis inguishes wo o mo e classes (Du e al., 2015).
Addi ionally, se e al AL heu is ics which a e no ce ain y-based ha e been p oposed in li e a u e;
in hei 2012 esea ch, Joshi e al. ha e p oposed a heu is ic which maximizes co e age o aining
poin s in a da ase , ensu ing ha all poin s ha e a nea es labeled da a poin wi hin some bounded
dis ance (Joshi e al., 2012). Mo eo e , da a-d i en app oaches o AL ha e been p oposed, in which
candida e sample selec ion s a egies a e lea ned based on expe ience om p e ious AL ou comes; in
hei 2017 esea ch, Konyushko a e al. ha e ained a eg ession model which o da a poin s can
p edic he expec ed e o educ ion a a pa icula i e a ion in an ac i e lea ning loop (Konyushko a
e al., 2017).
2.2.2 In e ac i e machine lea ning (IML)
In e ac i e machine lea ning (IML) is ano he app oach owa ds human-in- he-loop machine lea ning
(Mosquie a-Rey e al., 2022). The IML app oach builds upon he p inciples o AL, bu is a mo e
human-cen e ed app oach; in he AL pa adigm, he human anno a o needs o label highligh ed
samples, and epea his ask o a p olonged ime, as a esul o which human ac o s such as
dis ac ion and a igue can come in o play (Mosquie a-Rey e al., 2022). In IML en i onmen s,
howe e , he human anno a o is no necessa ily equi ed o only anno a e highligh ed da a poin s,
and human anno a o s can in e ac wi h he en i onmen beyond he anno a ion loop, in a ee and
less s uc u ed manne (Mosquei a-Rey, 2022).
2.3 Exis ing wo k on AL/IML and audio
The po en ial o applying AL o MIR asks has been iden i ied as ea ly as 2012 by Sa as´ua e al., in
which hey used ac i e lea ning echniques o music mood classi ica ion asks (Sa as´ua e al., 2012).
Sa as´ua e al. ha e employed di e en selec ion s a egies, and hei indings included he impo ance
o choosing da a poin s such ha he whole da ase space is co e ed (Sa as´ua e al., 2012), an idea
which has been also cen al o he co e age-based heu is ic as p oposed by Joshi e al. in hei 2012
esea ch (Joshi e al., 2012).
In 2019, Wang e al. ha e de eloped a sound classi ica ion model o de ec a i ac noise in sound
eco dings om he Sounds o New Yo k Ci y (SONYC) p ojec , an ini ia i e o mi iga ing u ban
noise pollu ion in New Yo k Ci y (Wang e al., 2019). The aim was he iden i ica ion o an sound a i-
ac s, in which 15 human anno a o s we e helped by ce ain y-based AL heu is ics (Wang e al., 2019).
In hei 2021 esea ch, Hilasaca e al. ha e p oposed a isual amewo k o da a anno a ion; in he
isual amewo k, audio iles a e clus e ed based on ex ac ed nume ical ea u es, and subsequen ly
p esen ed o he use in an in e ac i e 2D space, allowing he use o explo e and label soundscape
ecology3da a, as pa o a long- e m ecological esea ch p ojec in he Can a ei a-Man iquei a co ido
in B azil (Hilasaca e al., 2021).
3The academic s udy o acous ical pa e ns which a e emana ed om landscapes (Pijanowski e al. 2011)
8
Audio p e iewing
An impo an addi ion o he UI is he implemen a ion o audio playback on ho e ; by ho e ing o e
he da a poin s in he 2D embedding space, use s can playback he audio iles which a e pa o he
uploaded music collec ion. This can be done by p o iding he local olde pa h con aining he audio
iles which ha e he same ilename (wi hou ex ension, e.g. .mp3/.wa ) as he names s o ed in he
id ields in he uploaded embedding ile. Because he ma ching happens on id-s ings and no on ile
pa hs, in p inciple, he use is ee o pass a local olde pa h which con ains al e na i e e sions o
he audio (e.g. ull e sion, snippe s, al e na i e mixes/mas e s, e c.). The audio playback on ho e
unc ionali y has been p oposed in ea lie esea ch, including he 2020 esea ch by To s ogan e al.
on he he explo a ion o la en spaces o music collec ions (To s ogan e al., 2020).
5 Me hods
To quan i y he ele ance o he use in e ace, as well as he ele ance o he p oposed heu is ics
o imp o ed music classi ica ion, objec i e and subjec i e e alua ions a e ca ied ou as pa o his
esea ch. In he objec i e e alua ions, he heu is ics a e es ed by means o simula ed ac i e lea ning
loops. In he subjec i e e alua ions, 4 pa icipan s a e asked o in e ac wi h he use in e ace, and
o answe a numbe o ques ions ela ed o he use in e ace.
5.1 Objec i e e alua ion
Wi h he aim o quan i ying he ele ance o he heu is ics which ha e been in oduced in his esea ch,
an expe imen al se up is p oposed in which ac i e lea ning (AL) classi ica ion accu acies a e measu ed
o 3 di e en ac i e lea ning s a egies; expanding he aining se wi h andom samples, expanding
he aining se wi h candida es highligh ed by he CPU heu is ic, and expanding he aining se
wi h candida es highligh ed by he DC heu is ic.
5.1.1 Audio embeddings
The ac i e lea ning s a egies a e simula ed o 2 di e en audio embeddings;
•MAEST (Alonso-Jim´enez e al., 2023);
•CLAP (Elizalde e al., 2022; Chen e al., 2022; Wu e al., 2024).
MAEST
The MAEST (Alonso-Jim´enez e al., 2023) audio embeddings ha e been gene a ed wi h he associa ed
Gi hub eposi o y6. Fo his audio embedding, all audio iles ha e been con e ed o mono o ma ,
wi h a sample a e o 16 kHz.
The MAEST model which has been used o embedding gene a ion is discogs-maes -10s-pw-129e,
and i equi es an audio ile o ha e a du a ion o a leas 10 seconds. Fu he mo e, consis en wi h
he indings by Alonso-Jim´enez e al. ha he middle blocks o he ans o me in he MAEST model
ea u e he bes ep esen a ion o downs eam classi ica ion asks (Alonso-Jim´enez e al., 2023), he
6 h ans o me block is used.
The discogs-maes -10s-pw-129e MAEST model gene a es an embedding ec o o e e y 10 seconds
o audio. Fo audio iles which a e longe han 10 seconds, he embedding ec o s a e a e aged ou
o yield one embedding ec o wi h a dimensionali y o (768,).
6h ps://gi hub.com/palonso/MAEST
15
CLAP
The CLAP (Elizalde e al., 2022) audio embeddings ha e been gene a ed wi h LAION-CLAP, a
Gi hub eposi o y associa ed wi h 2022 esea ch by Chen e al., and 2024 esea ch by Wu e al.7. Fo
his audio embedding, all audio iles ha e been con e ed o a sample a e o 48 kHz.
LAION-CLAP o e s he possibili y o choose di e en p e ained checkpoin s, howe e , wi hin
he scope o his esea ch, he de aul p e ained checkpoin has been used. Fo e e y audio ile, he
LAION-CLAP model gene a es embeddings wi h a dimensionali y o (1024,).
Embedding ile s uc u e
The embedding ec o s a e encapsula ed in a .json ile, as ou lined in sec ion 4.3. Besides he em-
beddings, an ID, gen e/label/mood (desc ip o ) and a BPM is s o ed o e e y song. The ID and
desc ip o a e e ie ed om a .cs ile which con ains all song IDs and desc ip o s, and depending
on he da ase his ile can be gene a ed p og ama ically o manually.
5.1.2 Da ase s
The ac i e lea ning s a egies a e applied on 3 di e en classi ica ion asks wi h he ollowing da ase s;
•GTZAN (Tzane akis & Cook, 2002);
•Moods MIREX (Hu & Downie, 2007);
•F eesound Loop Da ase (Rami es e al., 2020).
GTZAN
The GTZAN da ase consis s o 1000 acks wi h a du a ion o 30 seconds, and he da ase has been
c ea ed o gen e classi ica ion asks. The numbe o unique classes in he GTZAN da ase is 10.
Moods MIREX
The Moods MIREX da ase consis s o 269 acks, and he da ase has been c ea ed o mood classi-
ica ion asks. The numbe o unique classes in he Moods MIREX da ase is 5.
F eesound Loop Da ase
The F eesound Loop Da ase consis s o 9455 loops which ha e been e ie ed om F eesound (Fon
e al., 2013). The numbe o unique classes in he da ase is 6. Be o e gene a ing he embeddings
o he F eesound Loop Da ase , a ew da a p ep ocessing s eps ha e been applied. Fi s o all, all
acks which ha e mo e han one label ha e been il e ed ou . A e his s ep, only 28 loops wi h he
class “ ocal” emained, hus his class has been emo ed om he da a. Nex o all, all loops which
ha e a du a ion o less han 10 seconds ha e been il e ed ou . A e his s ep, he class wi h he leas
amoun o emaining loops was “bass” wi h a loop coun o 82. The e o e, he decision has been made
o c ea e a subse o 400 loops which a e andomly sampled om he pool o 842 sui able loops.
The esul ing subse consis s o 400 samples; 78 “pe cussion” loops, 89 “ x” loops, 70 “bass” loops,
76 “melody” loops, and 87 “cho ds” loops.
7h ps://gi hub.com/LAION-AI/CLAP
16
5.1.3 AL s a egies
Six di e en ac i e lea ning (AL) s a egies a e e alua ed; in each s a egy, unanno a ed da a poin s,
ep esen ed by audio embedding ec o s which ha e been ex ac ed om he associa ed audio iles,
a e ei he andomly chosen, o chosen in acco dance wi h he class p edic ion unce ain y (CPU) o
da ase co e age (DC) heu is ic, and subsequen ly added o an expanding aining se in o de o
i e a i ely ain a mul ilaye pe cep on (MLP) classi ie . Wi hin he scope o his esea ch, he AL
s a egy wi h andom sample selec ion is e e ed o as andom baseline (RB).
Fo each chosen da a poin , he associa ed class label is ex ac ed om he da ase , and subse-
quen ly he embedding-label pai ge s added o he aining se as aining da a. Thus, he class
labels in he objec i e e alua ion simula ions a e equi alen o he class labels which a e ound in he
e alua ed da ase .
The algo i hm o he RB ac i e lea ning s a egy is as ollows:
Algo i hm 2 Random candida e sample selec ion
Requi e: Labeled embedding da a (X, Y )
1: Le he use ( e)label an ini ial subse (Xini , Yini )
2: Ini ialize Xacc ←Xini , Yacc ←Yini
3: while unlabeled samples X Xacc emain do ▷All samples in X ha a e no in Xacc
4: Selec k andom samples ▷ k is use -de ined
5: Add selec ed samples o (Xacc, Yacc)
6: T ain classi ie on (Xacc, Yacc)
7: Re u n label p edic ions o X Xacc
8: end while
17
The algo i hm o he CPU ac i e lea ning s a egy is as ollows:
Algo i hm 3 Candida e selec ion based on minimiza ion o class p edic ion unce ain y (CPU)
Requi e: Labeled embedding da a (X, Y )
1: Le he use ( e)label an ini ial subse (Xini , Yini )
2: Ini ialize Xacc ←Xini , Yacc ←Yini
3: while unlabeled samples X Xacc emain do ▷All samples in X ha a e no in Xacc
4: o all xi∈X Xacc do
5: p←p edic p oba(xi)
6: δi←|pmax−p2nd|
p2nd ×100 ▷Class p edic ion unce ain y
7: end o
8: Selec ksamples wi h he smalles δi▷ k is use -de ined
9: Add selec ed samples o (Xacc, Yacc)
10: T ain classi ie on (Xacc, Yacc)
11: Re u n label p edic ions o X Xacc
12: end while
The algo i hm o he DC ac i e lea ning s a egy is as ollows:
Algo i hm 4 Candida e selec ion based on maximiza ion o da ase co e age (DC)
Requi e: Labeled embedding da a (X, Y )
1: Le he use ( e)label an ini ial subse (Xini , Yini )
2: Ini ialize Xacc ←Xini , Yacc ←Yini
3: while unlabeled samples X Xacc emain do ▷All samples in X ha a e no in Xacc
4: o all xi∈X Xacc do
5: Compu e cosine dis ances o each xj∈Xacc
6: di←min (cosine dis ance(xi, xj)) ▷Da ase co e age
7: end o
8: Selec ksamples wi h he la ges di▷ k is use -de ined
9: Add selec ed samples o (Xacc, Yacc)
10: T ain classi ie on (Xacc, Yacc)
11: Re u n label p edic ions o X Xacc
12: end while
In he CPU, DC and RB s a egies, he aining se is andomly ini ialized, and i is no gua an eed
ha all class labels occu in he ini ial aining se . The e o e, 3 addi ional AL s a egies a e e alu-
a ed in which he ini ial aining se o he CPU, DC and RB s a egies a e ini ialized wi h s a i ied
sampling, ensu ing ha e e y class label om he da ase occu s a leas once in he ini ial aining
se . Wi hin he scope o his esea ch, hese s a egies a e e e ed o as RB (S), CPU (S) and DC (S).
Addi ionally, in he objec i e e alua ion o CLAP and he F eesound Loop Da ase , a heo e ical
app oxima ion o an uppe bound o classi ica ion accu acy is calcula ed, in his esea ch e e ed
o as he hyb id-one-look-ahead (HOLA) algo i hm. A e e y i e a ion o he objec i e e alua ion
simula ion, he HOLA algo i hm looks ahead one i e a ion, and simula es he classi ica ion accu acies
which a e e u ned i candida es a e chosen in he nex i e a ion based on he CPU o he DC
heu is ic; he aim o he HOLA algo i hm is o maximize he classi ica ion accu acy in a g eedy bes -
i s app oach. The accu acies e u ned by he HOLA algo i hm ep esen he po en ial classi ica ion
accu acy imp o emen which is easible o achie e, unde he condi ion ha a e e y i e a ion, he
’ igh ’ candida e samples a e added o he aining se .
5.2 Subjec i e use e alua ion
Wi hin he scope o his esea ch, subjec i e e alua ions o he UI a e ca ied ou , bo h o p o iding
insigh s in o he use -cen e edness o he cu en implemen a ion o he UI, as well as o answe ing
he p oposed hypo heses in his esea ch wi h mo e da a a hand. In he subjec i e e alua ions in
his esea ch, 4 pa icipan s in e ac wi h he UI o comple e some amoun o i e a ions o di e en
18
classi ica ion asks, a e which hey a e asked o e alua e he p ocess and he esul ing classi ica ions
by means o a Like scale ques ionnai e. The music collec ions a e chosen by he pa icipan s, and
hey a e assis ed in c ea ing .json audio embedding iles.
5.2.1 Ques ionnai e
Range S ep size
Ques ions abou he use in e ace
Q1: In e ac ing wi h he use in e ace was enjoyable; [1, 7] 1
Q2: In e ac ing wi h he use in e ace was easy; [1, 7] 1
Q3: The bu ons in he use in e ace ha e a clea pu pose; [1, 7] 1
Q4: The da a ables we e a use ul addi ion o he use in e ace; [1, 7] 1
Ques ions abou he music embedding space
Q5: Simila music pieces a e close o each o he in he in e ac i e music
da a ’cloud’;
[1, 7] 1
Ques ions abou he classi ie and heu is ics
Q6: Musically speaking, he classes p edic ed by he classi ie o unan-
no a ed da a make sense o me;
[1,7] 1
Q7: Musically speaking, in he classi ie p obabili y dis ibu ion, he
classes p edic ed as second and hi d mos likely class make sense o
me;
[1,7] 1
Q8: The acks highligh ed by he heu is ics ma ched you de ini ion o
di icul acks;
[1, 7] 1
Q9: How many i e a ions did i equi e o each an accep able
classi ie pe o mance?
[1,∞) 1
Table 1: Ques ions used o hypo hesis es ing
When pa icipan s answe 1 (s ongly disag ee) o 7 (s ongly ag ee), hey a e asked o p o ide an
addi ional elabo a ion o why hey chose he a ing. The ques ions as ou lined in able 1 a e g ouped
by ca ego y, and he o de ing o he ques ions may di e in he in e iew.
5.2.2 Pa icipan s
Pa icipan 1: P edic ing subjec i e emo ional dep h in music
Pa icipan 1 is a male pa icipan om The Ne he lands in he age g oup 18-24. The pa icipan
has p oposed a classi ica ion ask in which he classi ie a emp s o ecognize nuances in pe cei ed
emo ional dep h in 98 songs om his music collec ion. Along wi h he lis o songs, he p o ided
anno a ions, in which e e y song is labeled as being pa o class ED1, ED2, ED3, ED4, and ED5,
whe e class ED1 desc ibes songs which a e pe cei ed o be he leas emo ionally deep, and class ED5
desc ibes songs which a e pe cei ed o be he mos emo ionally deep.
Pa icipan 2: P edic ing music p e e ence
Pa icipan 2 is a male pa icipan om The Ne he lands in he age g oup 25-34. The pa icipan has
p oposed a classi ica ion ask in which he classi ie a emp s o ecognize music p e e ence, desc ibed
19
by g ades as class labels. The music collec ion consis ed o 55 songs, and a ile wi h song names and
co esponding g ades was c ea ed. In he in e ac ion wi h he UI, songs we e labeled wi h 3, 4, 5, 6,
7, 8 o 9 as classes by he pa icipan .
Pa icipan 3: Dis inguishing be ween Sla ic languages
Pa icipan 3 is a male pa icipan om The Ne he lands in he age g oup 18-24. The pa icipan has
chosen a classi ica ion ask in which he classi ie a emp s o dis inguish be ween 7 di e en Sla ic
languages. The audio collec ion con ains exce p s om wea he o ecas s in Bela usian, Czech, Polish,
Russian, Se bian, Slo ak and Uk ainian, spli in o 10 audio chunks o app oxima ely 4 o 5 seconds
o each language.
Pa icipan 4: Classi ying pe sonal music collec ion
Pa icipan 4 is a male pa icipan om The Ne he lands in he age g oup 13-17. The pa icipan has
chosen a classi ica ion ask in which he classi ie a emp s o dis inguish be ween 5 di e en music
labels; pop, indie, minec a , syn hwa e and 1980. The music collec ion consis s o 80 audio iles om
he pe sonal collec ion o he pa icipan .
6 Resul s
6.1 Objec i e e alua ion esul s
6.1.1 O e iew
In his subsec ion, he esul s o he objec i e e alua ion a e ou lined. Fo e e y embedding-da ase
pai , 25 simula ions a e ca ied o e e y ac i e lea ning s a egy (i.e. andom, CPU, DC), and he
esul s a e a e aged ou . The esul s o e e y embedding-da ase pai a e p esen ed by means o
a able which con ains classi ica ion accu acies a a ious i e a ions, and a g aph which shows he
p og ession o a e age classi ica ion accu acies o e he cou se o i e a ions.
Fo simula ions wi h he GTZAN da ase , he aining se is expanded wi h 20 da a poin s a
e e y i e a ion, o he Moods MIREX da ase he aining se is i e a i ely expanded wi h 5 da a
poin s, and o he F eesound Loop Da ase he aining se is i e a i ely expanded wi h 8 da a poin s.
The GTZAN aining se is expanded o e he cou se o 40 i e a ions, he Moods MIREX aining
se is expanded o e he cou se o 44 i e a ions, and he F eesound Loop Da ase aining se is also
expanded o e he cou se o 40 i e a ions.
The HOLA/ heo e ical uppe bound accu acy cu e, as in oduced in sec ion 5.1.3, has been com-
pu ed in he CLAP-F eesound Loop Da ase simula ions, and has been men ioned in he able and
isualized in he g aph in sec ion 6.1.7.
[Thesis documen con inues on he nex page]
20
6.1.2 MAEST and GTZAN
In he objec i e e alua ion o GTZAN and MAEST, he aining se was expanded o e he cou se
o 40 i e a ions, and a each i e a ion, he aining se was expanded wi h 20 da a poin s.
Algo. (↓) ; I e . (→) 1 3 5 7 10 15 20 25 30 35 40
Non-s a i ied
RB 0.597 0.766 0.792 0.806 0.829 0.857 0.873 0.876 0.886 0.890 0.888
DC 0.597 0.762 0.807 0.837 0.869 0.879 0.897 0.901 0.901 0.899 0.889
CPU 0.597 0.769 0.808 0.841 0.863 0.870 0.885 0.888 0.891 0.894 0.888
S a i ied
CPU (S) 0.614 0.766 0.808 0.838 0.861 0.873 0.879 0.884 0.891 0.899 0.890
DC (S) 0.614 0.766 0.796 0.824 0.855 0.875 0.894 0.898 0.897 0.897 0.890
RB (S) 0.614 0.763 0.794 0.805 0.836 0.854 0.872 0.876 0.888 0.894 0.894
Table 2: Classi ica ion accu acy p og ess o MAEST-GTZAN o e he cou se o 40 i e a ions
The ollowing g aph shows he p og ession o classi ica ion accu acies o he algo i hms wi hou
s a i ied sampling:
Figu e 9: MAEST-GTZAN accu acy cu e (non-s a i ied)
And he ollowing g aph shows he p og ession o classi ica ion accu acies o he algo i hms wi h
s a i ied sampling:
Figu e 10: MAEST-GTZAN accu acy cu e (s a i ied)
21
6.1.3 MAEST and Moods MIREX
Algo. (↓) ; I e . (→) 1 3 5 7 10 15 20 25 30 35 40 44
Non-s a i ied
RB 0.236 0.241 0.256 0.273 0.273 0.273 0.275 0.294 0.313 0.310 0.319 0.319
DC 0.236 0.254 0.251 0.248 0.259 0.274 0.256 0.283 0.297 0.310 0.329 0.315
CPU 0.236 0.259 0.257 0.272 0.300 0.297 0.296 0.319 0.312 0.312 0.306 0.318
S a i ied
CPU (S) 0.241 0.238 0.262 0.274 0.291 0.292 0.306 0.320 0.319 0.317 0.319 0.321
DC (S) 0.241 0.245 0.245 0.232 0.263 0.263 0.277 0.295 0.298 0.316 0.325 0.318
RB (S) 0.241 0.240 0.260 0.269 0.282 0.282 0.318 0.313 0.318 0.315 0.309 0.322
Table 3: Classi ica ion accu acy p og ess o MAEST-Moods MIREX o e he cou se o 44 i e a ions
The ollowing g aph shows he p og ession o classi ica ion accu acies o he algo i hms wi hou
s a i ied sampling:
Figu e 11: MAEST-Moods MIREX accu acy cu e (non-s a i ied)
And he ollowing g aph shows he p og ession o classi ica ion accu acies o he algo i hms wi h
s a i ied sampling:
Figu e 12: MAEST-Moods MIREX accu acy cu e (s a i ied)
22
6.1.4 MAEST and F eesound Loop Da ase
Algo. (↓) ; I e . (→) 1 3 5 7 10 15 20 25 30 35 40
Non-s a i ied
RB 0.228 0.226 0.240 0.245 0.230 0.245 0.240 0.246 0.246 0.250 0.243
DC 0.228 0.225 0.228 0.228 0.252 0.249 0.249 0.254 0.243 0.239 0.238
CPU 0.228 0.236 0.236 0.246 0.252 0.256 0.246 0.242 0.248 0.238 0.240
S a i ied
RB (S) 0.230 0.221 0.233 0.236 0.238 0.237 0.239 0.238 0.231 0.242 0.240
DC (S) 0.230 0.252 0.234 0.240 0.264 0.248 0.247 0.239 0.250 0.241 0.241
CPU (S) 0.230 0.243 0.246 0.240 0.244 0.241 0.242 0.243 0.242 0.236 0.242
Table 4: Classi ica ion accu acy p og ess o MAEST-F eesound Loop Da ase o e he cou se o 40
i e a ions
The ollowing g aph shows he p og ession o classi ica ion accu acies o he algo i hms wi hou
s a i ied sampling:
Figu e 13: MAEST-F eesound Loop Da ase accu acy cu e (non-s a i ied)
And he ollowing g aph shows he p og ession o classi ica ion accu acies o he algo i hms wi h
s a i ied sampling:
Figu e 14: MAEST-F eesound Loop Da ase accu acy cu e (s a i ied)
23
6.1.5 CLAP and GTZAN
Algo. (↓) ; I e . (→) 1 3 5 7 10 15 20 25 30 35 40
Non-s a i ied
RB 0.606 0.710 0.734 0.754 0.778 0.798 0.800 0.806 0.816 0.820 0.829
DC 0.606 0.693 0.733 0.771 0.787 0.810 0.818 0.816 0.824 0.826 0.828
CPU 0.606 0.728 0.754 0.776 0.795 0.803 0.812 0.813 0.814 0.822 0.829
S a i ied
CPU (S) 0.628 0.726 0.752 0.765 0.783 0.796 0.802 0.811 0.815 0.821 0.829
DC (S) 0.628 0.693 0.735 0.755 0.794 0.800 0.812 0.817 0.821 0.826 0.829
RB (S) 0.628 0.690 0.727 0.747 0.763 0.780 0.798 0.808 0.820 0.820 0.829
Table 5: Classi ica ion accu acy p og ess o CLAP-GTZAN o e he cou se o 40 i e a ions
The ollowing g aph shows he p og ession o classi ica ion accu acies o he algo i hms wi hou
s a i ied sampling:
Figu e 15: CLAP-GTZAN accu acy cu e (non-s a i ied)
And he ollowing g aph shows he p og ession o classi ica ion accu acies o he algo i hms wi h
s a i ied sampling:
Figu e 16: CLAP-GTZAN accu acy cu e (s a i ied)
24
he seman ic meaning o sound can be e ec i ely communica ed o he use h ough addi ional isual
cues such as he ideo humbnail and images e ie ed om audio- o-image algo i hms (Ishibashi e
al., 2020).
Fu he mo e, in he cu en implemen a ion o he UI, all music embedding ep esen a ions a e
shown as unique da a poin s, which may lead o a clu e ed isualiza ion o da a in he case o la ge
music da abases. One idea could be he agg ega ion o da a clouds in o combined ’bubbles’, which
can be clicked, in o de o subsequen ly zoom in on a segmen o he da a. An in e es ing example o
his idea is al eady implemen ed in he 2025 e sion o geo ag-based sound explo a ion on F eesound
(Fon e al., 2013).
Figu e 21: F eesound Map o Sounds (2025). Re ie ed om h ps:// eesound.o g/b owse/geo ags/
O he possible ea u es which could be added o he UI in u u e upda es include: da a poin
emo al, allowing he use o sa e logs which show a lis o pe o med ac ions wi h imes amp, sa ing
use anno a ions as a ‘checkpoin ’, allowing he use o choose di e en alues o displayed p oblem-
a ic acks, pe heu is ic, and mo e. Mo e di ec ions can be ound in he “Addi ional con ibu ions
by he pa icipan ” subsec ions, in he esul s o he objec i e e alua ions (sec ion 6).
7.3 O he di ec ions o u u e esea ch
N400 is a b ain wa e esponse adi ionally conside ed o be ela ed o language-seman ic p ocessing,
in which he maximum ampli ude in b ain ac i i y is eached a e abou 400 ms a e he onse o
a wo d s imulus which is incong uen wi h he p eceding sequence o wo ds (Gazzaniga e al., 2018).
In 2009, Dal ozzo & Sch¨on sough o ind ou i he N400 e ec is also elici ed by musical s imuli
(Dal ozzo & Sch¨on, 2009). The esea ch by Dal ozzo & Sch¨on yielded he i s e idence ha musical
in o ma ion can also elici N400 esponses, gi en ha a N400 b ain esponse was obse ed when 1-
second, subjec i ely seman ically un ela ed music exce p s we e played back- o-back (Koelsch, 2011).
Thus, he N400 b ain esponse, elici ed by ins an audio p e iews which a e subjec i ely incon-
sis en wi h pe sonal axonomies, could possibly unc ion as a powe ul ca alys o quickly inding
ou lie s o candida es o ( e-)anno a ion. Al hough ou o he scope o his esea ch, examining he
ela ionship be ween audio p e iewing and he N400 migh be an in e es ing di ec ion o u u e e-
sea ch. To my knowledge, he connec ion be ween he N400 e ec and audio p e iewing in in e ac i e
use in e aces has nei he been made o explo ed in o he esea ch ye .
31
8 Re e ences
•Alonso-Jim´enez, P., Bogdano , D., Pons, J., & Se a, X. (2020). Tenso Flow Audio Models in
Essen ia. a Xi (Co nell Uni e si y). h ps://doi.o g/10.48550/a xi .2003.07393
•Alonso-Jim´enez, P., Se a, X., & Bogdano , D. (2023). E icien supe ised aining o audio
ans o me s o music ep esen a ion lea ning. a Xi (Co nell Uni e si y).
h ps://doi.o g/10.48550/a xi .2309.16418
•Ben Sassi, I., & Ben Yahia, S. (2020). How does con ex in luence music p e e ences: a use -
based s udy o he e ec s o con ex ual in o ma ion on use s’ p e e ed music. Mul imedia
Sys ems, 27(2), 143–160. h ps://doi.o g/10.1007/s00530-020-00717-x
•Bogdano , D. (2013). F om music simila i y o music ecommenda ion: compu a ional ap-
p oaches based on audio ea u es and me ada a. h p://m g.up .edu/node/2817
•Bogdano , D., Liza aga-Seijas, X., Alonso-Jim´enez, P., & Se a X. (2022). MusAV: A da ase
o ela i e a ousal- alence anno a ions o alida ion o audio models. In e na ional Socie y o
Music In o ma ion Re ie al Con e ence (ISMIR 2022). h p://hdl.handle.ne /10230/54181
•Chen, K., Du, X., Zhu, B., Ma, Z., Be g-Ki kpa ick, T., & Dubno , S. (2022). HTS-AT:
A Hie a chical Token-Seman ic Audio T ans o me o Sound Classi ica ion and De ec ion.
h ps://doi.o g/10.48550/a Xi .2202.00874
•Dal ozzo, J., & Sch¨on, D. (2009). Is concep ual p ocessing in music au oma ic? An elec ophys-
iological app oach. B ain Resea ch, 1270, 88–94. h ps://doi.o g/10.1016/j.b ain es.2009.03.019
•Du, B., Wang, Z., Zhang, L., Zhang, L., Liu, W., Shen, J., & Tao, D. (2015). Explo ing
ep esen a i eness and in o ma i eness o ac i e lea ning. IEEE T ansac ions on Cybe ne ics,
47(1), 14–26. h ps://doi.o g/10.1109/ cyb.2015.2496974
•Elizalde, B., Deshmukh, S., Al Ismail, M., & Wang, H. (2022). CLAP: Lea ning Audio Concep s
F om Na u al Language Supe ision. h ps://doi.o g/10.48550/a Xi .2206.04769
•Fon , F., Roma, G., & Se a, X. (2013). F eesound echnical demo. MM’13. P oceedings
o he 21s ACM In e na ional Con e ence on Mul imedia; 2013 Oc 21-25; Ba celona, Spain.
h ps://doi.o g/10.1145/2502081.2502245
•Gazzaniga, M., I y, R. B., & Mangun, G. R. (2018). Cogni i e Neu oscience: Fi h In e na-
ional S uden Edi ion. W.W. No on & Company.
•Hilasaca, L. H., Ribei o, M. C., & Minghim, R. (2021). Visual Ac i e Lea ning o Labeling: A
case o Soundscape Ecology da a. In o ma ion, 12(7), 265. h ps://doi.o g/10.3390/in o12070265
•Hu, X. & Downie, J. S. (2007). “Explo ing mood me ada a: ela ionships wi h gen e, a is
and usage me ada a”, P oceedings o he 8 h In e na ional Con e ence on Music In o ma ion
Re ie al, ISMIR’07, Vienna, Aus ia, 2007.
•Huzai ah, M. & Wyse, L. (2021). Deep Gene a i e Models o Musical Audio Syn hesis. In:
Mi anda, E.R. (eds) Handbook o A i icial In elligence o Music. Sp inge , Cham.
h ps://doi.o g/10.1007/978-3-030-72116-9 22
•Ishibashi, T., Nakao, Y., & Sugano, Y. (2020). In es iga ing audio da a isualiza ion o in e ac-
i e sound ecogni ion. IUI ’20: P oceedings o he 25 h In e na ional Con e ence on In elligen
Use In e aces. h ps://doi.o g/10.1145/3377325.3377483
•Joshi, A. J., Po ikli, F., & Papanikolopoulos, N. (2012). Co e age op imized ac i e lea n-
ing o k - NN classi ie s. 2012 IEEE In e na ional Con e ence on Robo ics and Au oma ion.
h ps://doi.o g/10.1109/ic a.2012.6225054
32
•Koelsch, S. (2011). Towa d a neu al basis o music pe cep ion – a e iew and upda ed model.
F on ie s in Psychology, 2. h ps://doi.o g/10.3389/ psyg.2011.00110
•Konyushko a, K., Szni man, R., & Fua, P. (2017). Lea ning Ac i e Lea ning om Da a. a Xi
(Co nell Uni e si y). h ps://doi.o g/10.48550/a xi .1703.03365
•Lanzend¨o e , L. A., G ¨o schla, F., Valizada, U., & Wa enho e , R. (2024). Audio A las:
Visualizing and Explo ing Audio Da ase s. Ex ended Abs ac s o he La e-B eaking Demo
Session o he 25 h In . Socie y o Music In o ma ion Re ie al Con ., San F ancisco, Uni ed
S a es, 2024. h ps://doi.o g/10.48550/a xi .2412.00591
•Maia, L. S., Rocamo a, M., Biscainho, L. W. P., & Fuen es, M. (2024). Selec i e anno a ion o
ew da a o bea acking o La in Ame ican music using hy hmic ea u es. T ansac ions o
he In e na ional Socie y o Music In o ma ion Re ie al, 7(1), 99–112.
h ps://doi.o g/10.5334/ ismi .170
•Ma oso a, K. (2024). Modeling and In luencing Music P e e ences on S eaming Pla o ms.
Compu e Science [cs]. Uni e si ´e So bonne Pa is No d, 2024. h ps://hal.science/ el-04865002 2
•McInnes, L., & Healy, J. (2018). UMAP: uni o m mani old app oxima ion and p ojec ion o
dimension educ ion. a Xi (Co nell Uni e si y). h ps://doi.o g/10.48550/a xi .1802.03426
•Mosquei a-Rey, E., He n´andez-Pe ei a, E., Alonso-R´ıos, D., Bobes-Basca ´an, J., & Fe n´andez-
Leal, ´
A. (2022). Human-in- he-loop machine lea ning: a s a e o he a . A i icial In elligence
Re iew, 56(4), 3005–3054. h ps://doi.o g/10.1007/s10462-022-10246-w
•No h, A. C., & Ha g ea es, D. J. (1996). Si ua ional in luences on epo ed musical p e e ence.
Psychomusicology Music Mind and B ain, 15(1–2), 30–45. h ps://doi.o g/10.1037/h0094081
•Pijanowski, B. C., Villanue a-Ri e a, L. J., Dumyahn, S. L., Fa ina, A., K ause, B. L., Napole-
ano, B. M., Gage, S. H., & Pie e i, N. (2011). Soundscape Ecology: The science o sound in
he landscape. BioScience, 61(3), 203–216. h ps://doi.o g/10.1525/bio.2011.61.3.6
•Plo ly Technologies Inc. (2025). Basic Dash Callbacks |Dash o Py hon Documen a ion |
Plo ly. Re ie ed om h ps://dash.plo ly.com/basic-callbacks
•Plo ly Technologies Inc. (2025). Dash Co e Componen s |Dash o Py hon Documen a ion |
Plo ly. Re ie ed om h ps://dash.plo ly.com/dash-co e-componen s
•Plo ly Technologies Inc. (2025). Layou |Dash o Py hon Documen a ion |Plo ly. Re ie ed
om h ps://dash.plo ly.com/layou
•Pons, J., & Se a, X. (2019). musicnn: P e- ained con olu ional neu al ne wo ks o music
audio agging. a Xi (Co nell Uni e si y). h ps://doi.o g/10.48550/a xi .1909.06654
•Ragano, A., Bene os, E., & Hines, A. (2022). Lea ning Music Rep esen a ions wi h wa 2 ec
2.0. a Xi (Co nell Uni e si y). h ps://doi.o g/10.48550/a xi .2210.15310
•Rami es, A., Fon , F., Bogdano , D., Smi h, J. B. L., Yang, Y., Ching, J., Chen, B., Wu, Y.,
Wei-Han, H., & Se a, X. (2020). The F eesound Loop Da ase and Anno a ion Tool. a Xi
(Co nell Uni e si y). h ps://doi.o g/10.48550/a xi .2008.11507
•Sa as´ua, ´
A., Lau ie , C., & He e a, P. (2012). Suppo Vec o Machine Ac i e Lea ning o
Music Mood Tagging. 9 h In e na ional Symposium on Compu e Music Modeling and Re ie al
(CMMR). h ps://m g.up .edu/sys em/ iles/publica ions/sa as%C3%BAa-CMMR-2012.pd
•Schein, A. I., & Unga , L. H. (2007). Ac i e lea ning o logis ic eg ession: an e alua ion.
Machine Lea ning, 68(3), 235–265. h ps://doi.o g/10.1007/s10994-007-5019-5
33
•Schmid, F., Kou ini, K., & Widme , G. (2023). Low-Complexi y audio embedding ex ac o s.
a Xi (Co nell Uni e si y). h ps://doi.o g/10.48550/a xi .2303.01879
•Se a, X., Magas M., Bene os E., Chudy M., Dixon S., Flexe A., G´omez E., Gouyon F., He e a
P., Jo d`a S., Pay u i O., Pee e s G., Schl¨u e J., Vine H., & Widme G., “Roadmap o Music
In o ma ion” ReSea ch, Pee e s G. (edi o ), 2013, C ea i e Commons BY-NC-ND 3.0 license,
ISBN: 978-2-9540351-1-6
•Shuyang, Z., Hei ola, T., & Vi anen, T. (2017). Ac i e lea ning o sound e en classi ica ion
by clus e ing unlabeled da a. ICASSP 2017 - 2017 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP), 751–755. h ps://doi.o g/10.1109/icassp.2017.7952256
•Su, D., & Fung, P. (2012). Pe sonalized music emo ion classi ica ion ia ac i e lea ning. MIRUM
’12: P oceedings o he Second In e na ional ACM Wo kshop on Music In o ma ion Re ie al
Wi h Use -cen e ed and Mul imodal S a egies, 57–62. h ps://doi.o g/10.1145/2390848.2390864
•Tahi o˘glu, K., & Wyse, L. (2024). La en Spaces as Pla o ms o Sonic C ea i i y. In e na-
ional Con e ence on Compu a ional C ea i i y (ICCC’24).
h ps://www. esea chga e.ne /publica ion/382947199 La en Spaces as Pla o ms o Sonic C ea i i y
•To s ogan, P., Se a, X., & Bogdano , D. (2020). Web In e ace o Explo a ion o La en and
Tag Spaces in Music Au o-Tagging. P oceedings o he 37 h In e na ional Con e ence on Ma-
chine Lea ning, Vienna, Aus ia, PMLR 108, 2020. h ps:// eposi o i.up .edu/handle/10230/45186
•To s ogan, P., Se a, X., & Bogdano , D. (2022). Visualiza ion o Deep Audio Embeddings
o Music Explo a ion and Redisco e y. P oceedings o he 19 h Sound and Music Compu ing
Con e ence, June 5-12 h, 2022, Sain -´
E ienne (F ance). h p://hdl.handle.ne /10230/53710
•Tu ian, J., Shie , J., Khan, H. R., Raj, B., Schulle , B. W., S einme z, C. J., Malloy, C., Tzane-
akis, G., Vela de, G., McNally, K., Hen y, M., Pin o, N., Nou i, C., Clough, C., He emans,
D., Fonseca, E., Engel, J., Salamon, J., Esling, P., Manocha, P., Wa anabe, S., Jin, Z., Bisk,
Y. (2022). HEAR: Holis ic E alua ion o Audio Rep esen a ions. a Xi (Co nell Uni e si y).
h ps://doi.o g/10.48550/a xi .2203.03022
•Tzane akis, G. & Cook, P. (2002). Musical Gen e Classi ica ion o Audio Signals. IEEE T ansac-
ions on Speech and Audio P ocessing. 10. 293 - 302. h ps://doi.o g/10.1109/TSA.2002.800560
•Van Den Oo d, A., Dieleman, S., & Sch auwen, B. (2014). T ans e lea ning by supe ised p e-
aining o audio-based music classi ica ion. In e na ional Symposium/Con e ence on Music
In o ma ion Re ie al, 29–34. h ps://biblio.ugen .be/publica ion/5973853
•Volk, A., Wie ing, F., & Van K anenbu g, P. (2011). Un olding he po en ial o compu a ional
musicology. In e na ional Con e ence on In o ma ics and Semio ics in O ganisa ions, 137–144.
h ps://pu e.knaw.nl/ws/ iles/475092/a olk pape iciso2011.pd
•Volk, A., & Van K anenbu g, P. (2012). Melodic simila i y among olk songs: An anno-
a ion s udy on simila i y-based ca ego iza ion in music. Musicae Scien iae, 16(3), 317–339.
h ps://doi.o g/10.1177/1029864912448329
•Wang, L., Luc, P., Wu, Y., Recasens, A., Smai a, L., B ock, A., Jaegle, A., Alay ac, J., Diele-
man, S., Ca ei a, J., & Van Den Oo d, A. (2021). Towa ds lea ning uni e sal audio ep esen-
a ions. a Xi (Co nell Uni e si y). h ps://doi.o g/10.48550/a xi .2111.12124
•Wang, Y., Mendez, A. E. M., Ca w igh , M., & Bello, J. P. (2019). Ac i e Lea ning o E icien
Audio Anno a ion and Classi ica ion wi h a La ge Amoun o Unlabeled Da a. ICASSP 2022
- 2022 IEEE In e na ional Con e ence on Acous ics, Speech and Signal P ocessing (ICASSP),
880–884. h ps://doi.o g/10.1109/icassp.2019.8683063
34
•Widme , G. (2016). Ge ing close o he essence o music: The Con Esp essione Mani es o.
ACM T ansac ions on In elligen Sys ems and Technology, 8(2), 1–13.
h ps://doi.o g/10.1145/2899004
•Wilkie, K., Holland, S., & Mulholland, P. (2010). Wha Can he Language o Musicians Tell
Us abou Music In e ac ion Design? Compu e Music Jou nal, 34(4), 34–48.
h ps://doi.o g/10.1162/comj a 00024
•Won, M., Chun, S., & Se a, X. (2019). Towa d In e p e able Music Tagging wi h Sel -A en ion.
a Xi (Co nell Uni e si y). h ps://doi.o g/10.48550/a xi .1906.04972
•Wu, Y., Chen, K., Zhang, T., Hui, Y., Nezhu ina, M., Be g-Ki kpa ick, T., & Dubno , S.
(2024). La ge-scale Con as i e Language-Audio P e aining wi h Fea u e Fusion and Keywo d-
o-Cap ion Augmen a ion. h ps://doi.o g/10.48550/a Xi .2211.06687
•Zaman, K., Sah, M., Di ekoglu, C., & Unoki, M. (2023). A su ey o audio classi ica ion using
deep lea ning. IEEE Access, 11, 106620–106649. h ps://doi.o g/10.1109/access.2023.3318015
•Zhang, A., Thomaz, E., & Lu, L. (2025). T ans o ma ion o audio embeddings in o in e p e able,
concep -based ep esen a ions. h ps://doi.o g/10.48550/a Xi .2504.14076
35