scieee Science in your language
[en] (orig)

Exploring the Feasibility of LLMs for Automated Music Emotion Annotation

Author: Meng Yang; Jon McCormack; Maria Teresa Llano; Wanchao Su
Publisher: Zenodo
DOI: 10.5281/zenodo.17706355
Source: https://zenodo.org/records/17706355/files/000018.pdf
EXPLORING THE FEASIBILITY OF LLMS FOR AUTOMATED MUSIC
EMOTION ANNOTATION
Meng Yang1Jon McCo mack1Ma ia Te esa Llano2Wanchao Su1
1SensiLab, Monash Uni e si y, Melbou ne, Aus alia
2Uni e si y o Sussex, B igh on, Uni ed Kingdom
{Meng.Yang, Jon.McCo mack, Wanchao.Su}@monash.edu, [email p o ec ed]
ABSTRACT
Cu en app oaches o music emo ion anno a ion emain
hea ily elian on manual labelling, a p ocess ha imposes
signi ican esou ce and labou bu dens, se e ely limi ing
he scale o a ailable anno a ed da a. This s udy examines
he easibili y and eliabili y o employing a la ge language
model (GPT-4o) o music emo ion anno a ion. In his
s udy, we anno a ed Gian MIDI-Piano, a classical MIDI
piano music da ase , in a ou -quad an alence-a ousal
amewo k using GPT-4o, and compa ed agains anno a-
ions p o ided by h ee human expe s. We conduc ed ex-
ensi e e alua ions o assess he pe o mance and eliabil-
i y o GPT-gene a ed music emo ion anno a ions, includ-
ing s anda d accu acy, weigh ed accu acy ha accoun s o
in e -expe ag eemen , in e -anno a o ag eemen me ics,
and dis ibu ional simila i y o he gene a ed labels.
While GPT’s anno a ion pe o mance ell sho o hu-
man expe s in o e all accu acy and exhibi ed less nu-
ance in ca ego izing speci ic emo ional s a es, in e - a e
eliabili y me ics indica e ha GPT’s a iabili y emains
wi hin he ange o na u al disag eemen among expe s.
These indings unde sco e bo h he limi a ions and po en-
ial o GPT-based anno a ion: despi e i s cu en sho com-
ings ela i e o human pe o mance, i s cos -e ec i eness
and e iciency ende i a p omising scalable al e na i e o
music emo ion anno a ion.
1. INTRODUCTION
Music is widely ecognized as a medium o con ey-
ing complex human emo ions and expe iences, making
emo ion- ela ed esea ch a ocal poin in he Music In-
o ma ion Re ie al (MIR) communi y [1]. Mos empi i-
cal ad ances in emo ion- ela ed MIR s a wi h a p e eq-
uisi e s ep: secu ing a su icien ly la ge, eliable se o
emo ion labels. Exis ing da ase s like DEAM [2], Emo-
i y [3], VGMIDI [4] and EMOPIA [5], we e all c ea ed
h ough in ensi e manual anno a ion campaigns. While in-
dispensable, human labelling is slow and cos ly, so mos
© M. Yang, J. McCo mack, T. Llano, and W. Su. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: M. Yang, J. McCo mack, T. Llano, and W. Su, “Ex-
plo ing he Feasibili y o LLMs o Au oma ed Music Emo ion Anno a-
ion”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al
Con ., Daejeon, Sou h Ko ea, 2025.
o hese da ase s pla eau a a ew housand i ems. These
scale limi s, in u n, he downs eam esea ch: mode n
deep a chi ec u es demand a mo e da a han he com-
muni y can cu en ly a o d o label by hand. Recen ad-
ances in la ge language models (LLMs) ha e ans o med
ex unde s anding, making i possible o in e music’s pe -
cei ed emo ion om ex insic ex ual sou ces—me ada a,
ly ics, and con ex ual desc ip ions. Some o wha shapes
lis ene s’ pe cei ed emo ion is encoded ou side he sound
i sel : compose biog aphies, gen e con en ions, and he
his o ical con ex o composi ion [6]. Al hough LLMs can-
no “hea ” melody, ha mony, o imb e, hey can pa se
hese documen s and ex ac he a ec i e s ance hey im-
ply. In ocal music, ly ics al eady se e as an e ec i e ex-
ual p oxy and ha e unde pinned success ul MER s udies
[7–10]. Ins umen al wo ks, howe e , lack buil -in seman-
ic cues; o hem, me ada a becomes he p ima y linguis ic
window in o a compose ’s exp essi e in en . This has been
demons a ed in p e ious esea ch, which has shown co -
ela ions be ween pe cei ed emo ion as well as his o ical
and cul u al con ex [11], mo i a ing ou use o me ada a-
d i en LLM in e ence o anno a e pe cei ed emo ion a
scale o ins umen al music.
In his s udy, we explo e a no el anno a ion me hodol-
ogy ha employs a la ge language model (GPT-4o) as an
au oma ed anno a o o he pe cei ed emo ion o music.
Ou app oach uses he i le and compose o a music piece
as sea ch keywo ds o e ie e ele an web esul s, p o-
iding he LLM wi h ex ac ed ex ual con en as con ex ,
enabling i o in e an app op ia e emo ion label based on
he a ailable in o ma ion. We apply ou me hod o anno-
a e Gian MIDI-Piano [12], a classical piano da ase wi h
10,855 MIDI music pieces, using a ou -quad an alence-
a ousal amewo k. To e alua e his GPT-based anno-
a ion me hod, we andomly selec ed 100 samples om
each o ou emo ion ca ego ies, o alling 400 samples,
and ob ained anno a ions o each sample om h ee hu-
man expe s. We hen compa ed he GPT-gene a ed la-
bels agains he expe anno a ions using a comp ehensi e
e alua ion amewo k ha includes bina y and weigh ed
accu acy me ics, in e -anno a o eliabili y measu es (Co-
hen’s Kappa and Fleiss’ Kappa), and dis ibu ional simi-
la i y analyses ia Jensen–Shannon di e gence.
Ou indings show ha al hough GPT-4o does no ye
ma ch human expe s in o e all accu acy o nuanced emo-
ional ca ego iza ion, i s in e - a e a iabili y alls wi hin
150
he ange o na u al disag eemen among expe s. These
esul s highligh bo h he challenges and po en ial o LLM-
based anno a ion: while u he e inemen s a e needed be-
o e i can ully eplace manual anno a ion, he me hod’s
cos -e ec i eness and e iciency make i a p omising ap-
p oach o la ge-scale music emo ion anno a ion.
The main con ibu ions o his pape a e as ollows: (1)
We p opose a cos -e ec i e app oach o music emo ion an-
no a ion by le e aging GPT’s ex -based in e ence capa-
bili y, educing he eliance on ime-consuming and cos ly
manual labelling, (2) We de elop a comp ehensi e e al-
ua ion amewo k inco po a ing accu acy me ics, in e -
anno a o ag eemen measu es, and dis ibu ional analyses
o assess and compa e he pe o mance o GPT-gene a ed
and expe anno a ions.
2. RELATED WORK
2.1 Music Emo ion Anno a ion
Ea ly Music In o ma ion Re ie al (MIR) esea ch on emo-
ion, such as Music Emo ion Recogni ion (MER), has e-
lied hea ily on manually anno a ed da ase s. Fo exam-
ple, CAL500 [13] con ains 502 songs and each song is
anno a ed wi h mul iple human-p o ided emo ion labels,
and he DEAM da ase [2] includes 1,802 music exce p s
wi h con inuous and s a ic a ousal- alence anno a ions.
While hese co po a ha e p o en in aluable o de elop-
ing and e alua ing models and asks, manual emo ion an-
no a ion equi es mul iple human lis ene s pe ack, mak-
ing i bo h cos ly and labou -in ensi e [14–16]. As a e-
sul , mos da ase s a e small in scale, ypically comp ising
only hund eds o housands o songs, and a e o en lim-
i ed o speci ic gen es [2–5, 13], cons aining he pe o -
mance on da a-in ense models. Models ained on such
limi ed da ase s s uggle o gene alize, and he lack o
la ge-scale, s anda dized da ase s complica es benchma k-
ing ac oss di e en s udies.
Va ious al e na i e app oaches ha e been p oposed o
add ess hese challenges. Gami ied anno a ion echniques
such as MoodSwings [17], and c owd-sou ced agging
pla o ms like Las . m and AllMusic, can expedi e label
collec ion, bu hey o en in oduce new p oblems such as
da a spa si y, biased sampling, o unclea axonomy and a
lack o label quali y assu ance [14]. The need o scalable,
cos -e ec i e, and consis en anno a ion me hods has led
esea che s o explo e ad anced AI solu ions – including
LLMs – o assis o au oma e he labelling p ocess.
2.2 LLM-Based Anno a ion
Recen ly, LLMs ha e e olu ionized ex -based anno a ion
and classi ica ion asks [18]. Unlike ask-speci ic classi-
ie s, LLMs a e p e- ained on massi e ex co po a and
can pe o m labelling h ough na u al language p omp s
wi hou ask-speci ic e aining. S udies ha e shown ha
hese models can ma ch o e en occasionally su pass he
accu acy and consis ency o c owd-sou ced o expe anno-
a ions, p ima ily by applying labelling c i e ia mo e uni-
o mly and educing he impac o subjec i e in e p e a ion
[19, 20]. Mo eo e , once deployed, LLMs anno a e da a
apidly and a ela i ely low cos , ende ing hem highly
sui able o la ge-scale applica ions [21].
Tex ual sou ces ha e long se ed as he sou ce o
emo ional e idence o music–emo ion s udies. Ea ly
wo k in e ed emo ion di ec ly om ly ics [7–10, 22],
while mo e ecen e o s ha e mined use -gene a ed dis-
cou se—YouTube commen s, wee s, Reddi h eads— o
anno a e pieces along he alence–a ousal plane, achie ing
mode a e eliabili y wi h ans o me models [23]. These
s a egies, howe e , p esuppose plen i ul public discus-
sion o ocal con en and he e o e miss much o he in-
s umen al and lesse -known classical epe oi e. P io
esea ch shows ha me ada a—compose backg ound,
gen e, s ylis ic school, and his o ical con ex —also co -
ela es wi h music emo ion [11], which o e s a b oadly
applicable ounda ion o au oma ic emo ion anno a ion.
Ou s udy builds on his insigh : we employ LLMs o an-
no a e pe cei ed emo ion labels di ec ly om con ex ual
me ada a, he eby expanding anno a ed esou ces o he
non-ly ics music.
3. METHODOLOGY
3.1 Da a P epa a ion
We employed he Gian MIDI-piano da ase [12], a clas-
sical piano MIDI collec ion comp ising 10,855 iles om
2,786 compose s. Fo each piece, we collec ed con ex ual
in o ma ion by web-c awling a cu a ed lis o music in o -
ma ion sou ces 1. We ex ac ed me ada a, such as gen e,
s yle, compose biog aphy, his o ical and cul u al con ex ,
and he compose ’s c ea i e in en , which we e hen inco -
po a ed in o he p omp p o ided o he GPT-4o o anno-
a ing he pe cei ed emo ion o he music.
3.2 GPT-based Anno a ion
The collec ed ex me ada a was used o au oma e emo-
ion anno a ion o he music in ou da ase . Emo ion la-
bels we e assigned acco ding o Russell’s alence–a ousal
model [24], ollowing he quad an -based scheme used in
EMOPIA [5] o ca ego ized in o ou disc e e quad an s:
High Valence–High A ousal (HVHA), High Valence–Low
A ousal (HVLA), Low Valence–High A ousal (LVHA),
and Low Valence–Low A ousal (LVLA). GPT-4o was sup-
plied wi h con ex ual in o ma ion ia a s uc u ed p omp
ha explici ly ins uc ed i o in e he pe cei ed emo ional
con en o each piece solely om he p o ided ex , he eby
minimizing hallucina ion. I he model de e mined ha he
a ailable in o ma ion was insu icien o a eliable anno-
a ion, i was ins uc ed o e u n he label “no enough in-
o ma ion.” The p omp is p esen ed in Figu e 1. To en-
su e ha GPT-4o selec ed he mos eliable label based on
he con ex and did no in oduce andom luc ua ions, we
se he model’s empe a u e o 0. Following he anno a ion
p ocess, a o al o 9,803 musical pieces we e assigned alid
emo ion labels.
1e.g., en.wikipedia.o g,imslp.o g,naxos.com,
allmusic.com,classical-music.com,g amophone.
co.uk
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
151
You a e a music expe asked wi h anno a ing
he emo ion con eyed by a musical piece.
Based on he con ex p o ided below, selec
one alence-a ousal label ha bes desc ibes
he music's emo ional s a e. Choose only om
he ollowing op ions:
HVHA: High Valence, High A ousal
HVLA: High Valence, Low A ousal
LVHA: Low Valence, High A ousal
LVLA: Low Valence, Low A ousal
I he con ex does no clea ly indica e a
speci ic emo ion, make easonable in e ences
using as much in o ma ion as possible abou
he gen e, s yle, his o ical con ex e c.
Only i no use ul clues a e a ailable, answe
“no enough in o ma ion”.
{con ex }
Figu e 1. The P omp o Music Emo ion Anno a ion.
3.3 Human E alua ion
To assess he easibili y and quali y o GPT-gene a ed an-
no a ions, we engaged h ee anno a o s wi h o e i e yea s
o o mal music aining and simila cul u al backg ounds.
Be o e anno a ion, hey calib a ed hei unde s anding o
he alence–a ousal amewo k o ensu e a consis en la-
belling s anda d. We andomly sampled 100 acks om
each o he ou emo ion quad an s (400 o al), ensu ing a
balanced dis ibu ion ac oss quad an s. To educe po en ial
s ylis ic bias, he samples we e selec ed o co e a di e se
ange o compose s and musical s yles wi hin he classical
piano epe oi e. Each anno a o independen ly labelled
he pe cei ed emo ion based on lis ening expe ience. The
anno a ions we e hen agg ega ed by majo i y o ing o es-
ablish a human-de i ed gold s anda d. Samples o which
a leas wo ou o h ee anno a o s ag eed we e designa ed
as high-con idence samples, while hose lacking consensus
we e classi ied as low-con idence samples. Ou human an-
no a ion esul s yielded 386 high-con idence samples and
14 low-con idence samples.
3.4 E alua ion F amewo k
We assessed he pe o mance and eliabili y o he GPT-
gene a ed labels using a comp ehensi e e alua ion ame-
wo k ha inco po a es mul iple me ics:
3.4.1 Accu acy
Bina y Accu acy is de ined as he p opo ion o high-
con idence samples o which he GPT-gene a ed label ex-
ac ly ma ches he gold s anda d ob ained ia majo i y o -
ing among human expe s.
Gi en he subjec i i y in music emo ion anno a ion,
a s ic bina y accu acy me ic—whe e a sample is con-
side ed “co ec ” only i he GPT-gene a ed label exac ly
ma ches he gold s anda d (i.e., he majo i y o e om ex-
pe s)—may no cap u e he nuances o expe disag ee-
men . To mo e p ecisely e lec he g ada ions in expe
ag eemen , we p opose a Weigh ed Accu acy ha inco -
po a es pa ial consensus among expe s. Speci ically, le
sibe he sco e assigned o sample ibased on how closely
GPT’s p edic ion aligns wi h expe consensus:
• Full Consensus (3/3): All h ee expe s ag ee on he
same label. In his case i GPT’s label ma ches he
gold s anda d, si= 1; O he wise si= 0.
• Pa ial Consensus (2/3): Two expe s ag ee on a ma-
jo i y label Lm, and one expe has a mino i y label
Ln. In his case i GPT’s label equals Lm,si= 1; i
GPT’s label equals Ln,si= 0.5; O he wise si= 0 2.
• Comple e Disag eemen (3 dis inc labels): All h ee
expe s disag ee, each o e ing a unique label. I
GPT’s label ma ches any one o he h ee expe s la-
bels, si= 1/3. O he wise si= 0.
Finally, he Weigh ed Accu acy is compu ed as he a -
e age sco e ac oss all Nsamples:
W eigh edAccu acy =1
N
n
X
i=1
si
3.4.2 In e -Anno a o Consis ency
Cohen’s Kappa is used o measu e pai wise ag eemen be-
ween wo se s o anno a ions while accoun ing o chance
ag eemen . Fo a pai o a e s, i is gi en by:
κ=P0−Pe
1−Pe
,
whe e P0is he obse ed ag eemen and Peis he expec ed
ag eemen by chance. We compu ed Cohen’s Kappa o
GPT e sus he gold s anda d, as well as o each pai o
human anno a o s.
Fleiss’ Kappa ex ends he kappa s a is ic o mul iple
a e s. Gi en a a ing ma ix Rwhe e each ow ep esen s
a sample and each column j ep esen s he numbe o a -
ings o ca ego y j, he pe -i em ag eemen is:
Pi=1
ni(ni−1)
k
X
j=1
nij(nij −1)
wi h ni=
k
X
j=1
nij. The o e all obse ed a g eemen is
¯
P=1
N
N
P
i=1
Pi, and he chance ag eemen is Pe=
k
P
j=1
p2
j,
whe e piis he p opo ion o a ings in ca ego y jac oss
all samples. Fleiss’ Kappa is hen:
κF=¯
P−Pe
1−Pe
,
We compu ed Fleiss’ Kappa o bo h he human-only anno-
a ions and o he combined GPT and human anno a ions.
3.4.3 Dis ibu ional Simila i y
JS di e gence is used o measu e he simila i y be ween
wo p obabili y dis ibu ions. Fo each sample, le P=
(p1, ..., pk)be he agg ega ed expe dis ibu ion (de i ed
om he ela i e equencies o he h ee expe labels) and
Q= (q1, ..., qk)be he one-ho ep esen a ion o he GPT
p edic ion. The JS di e gence is de ined as:
2No e ha each sample con ibu es a mos one c edi —so he
weigh ed accu acy emains wi hin [0,1], and GPT incu s no penal y when
i s anno a ion aligns wi h he majo i y.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
152
Anno a o Accu acy Weigh ed Accu acy
GPT-4o 0.710 0.788
Human1 0.833 -
Human2 0.812 -
Human3 0.869 -
Table 1. Accu acy Compa ison o GPT-4o and Human
Anno a ions wi h espec o he Gold S anda d.
Model Accu acy
GPT-3.5 0.430
GPT-4o 0.710
GPT-4.5 0.705
Table 2. Accu acy Compa ison among GPT models.
JS(P∥Q) = 1
2KL(P∥M) + 1
2KL(Q∥M)
whe e M=1
2(P+Q)and he Kullback–Leible di e -
gence KL(P∥Q)is gi en by:
KL(P∥Q) =
k
X
j=1
pjlog pj
qj
A lowe JS di e gence indica es g ea e simila i y be ween
he GPT-gene a ed and agg ega ed expe dis ibu ions.
4. EXPERIMENTAL RESULTS AND DISCUSSION
4.1 P e-Tes : S abili y and Rep oducibili y o
GPT-4o’s Anno a ion
Be o e conduc ing he main e alua ion, we pe o med a
p e- es o e i y he s abili y and ep oducibili y o GPT-
4o’s anno a ions unde a de e minis ic se ing ( empe a u e
= 0). This con igu a ion ensu es ha he model consis-
en ly selec s he mos p obable ou pu , elimina ing an-
domness and enabling p ecise e alua ion. To es his,
GPT-4o was p omp ed o anno a e he same 400 samples
ac oss h ee independen uns. Resul s showed ha 385
samples ecei ed iden ical labels in all uns, and he e-
maining 15 had wo ou o h ee consis en labels. This
high le el o consis ency con i ms ha empe a u e = 0
yields s able, epea able ou pu s, allowing us o a ibu e
pe o mance di e ences o he model’s unde lying eason-
ing a he han s ochas ic a ia ion.
4.2 Accu acy
4.2.1 O e all Anno a ion Pe o mance
Table 1 demons a es he accu acy o GPT-4o’s anno a-
ions agains a human-de i ed gold s anda d, e alua ed us-
ing bo h bina y and weigh ed accu acy me ics. Bina y
accu acy, he p opo ion o high-con idence samples in
which GPT-4o’s label exac ly ma ched he gold s anda d,
was app oxima ely 71%. In con as , weigh ed accu acy,
which accoun s o a ying deg ees o expe consensus
by awa ding pa ial c edi in cases o pa ial ag eemen ,
inc eased o a ound 78%. Al hough human anno a o s
achie ed highe bina y accu acy ( anging om app oxi-
ma ely 81% o 87% ac oss indi idual a e s), he pe o -
mance o GPT-4o is s ill accep able conside ing he inhe -
en subjec i i y o music emo ion anno a ion and he ad-
an ages in cos and e iciency.
4.2.2 Compa ison o GPT Model Pe o mance
We e alua ed mul iple GPT models om OpenAI o mu-
sic emo ion anno a ion. As esul s shown in able 2, he
newly eleased GPT-4.5 achie es 70.5% accu acy, ma ch-
ing GPT-4o’s 71%, o e ing no no able imp o emen . In
con as , GPT-3.5 eaches only 43%, unde sco ing signi -
ican ad ancemen s in he GPT-4 amily’s abili y o accu-
a ely in e p e and anno a e musical emo ion.
4.2.3 Con ex Abla ion S udy
To u he alida e he e ec i eness o p o iding musical
con ex , we conduc ed an abla ion expe imen in which
GPT-4o was p omp ed o label he same 400 samples wi h-
ou any addi ional con ex in o ma ion. In his “ i le-only”
condi ion, GPT-4o had access o no hing mo e han he mu-
sic i le and compose name, and i achie ed a bina y ac-
cu acy o only 57%. In con as , when he model was u -
nished wi h he con ex we collec ed om online sou ces
desc ibing he wo k’s gen e, s ylis ic backg ound, his o -
ical and cul u al ac o s, and compose biog aphy, i s ac-
cu acy ose signi ican ly o 71%. This gap unde sco es
he impo ance o con ex ual in o ma ion o disambigua -
ing sub le emo ional cues – in o ma ion ha pu ely nomi-
nal e e ences (e.g., a piece’s i le) canno eliably con ey.
The e o e, he con ex ual me ada a can be seen as a c ucial
signal enabling GPT-4o o be e align i s anno a ions wi h
expe judgmen s.
4.2.4 Consensus-Based Subg oup Analysis
In addi ion o o e all accu acy me ics, we analyzed pe -
o mance wi hin wo expe -consensus subg oups. As
shown in able 3, in he ull-consensus g oup (211 sam-
ples in which all expe s ag eed), GPT-4o co ec ly anno-
a ed 180 samples (85.3%) and e ed on 31 samples. In
he pa ial-consensus g oup (175 samples wi h a 2/3 ex-
pe ag eemen ), GPT-4o ma ched he majo i y opinion in
94 samples (53.7%), wi h no ins ances o aligning wi h he
mino i y opinion and 21 samples classi ied en i ely inco -
ec ly. By compa ison, human anno a o s in he pa ial-
consensus subg oup achie ed co ec ness anging om
114 o 121 ou o 175 samples (65.1% o 69.1%). These
indings indica e ha GPT-4o pe o ms obus ly when ex-
pe consensus is s ong, and in cases o ambigui y, i cap-
u es a subs an ial po ion o expe ag eemen .
4.3 In e -Anno a o Reliabili y Analysis
In e -anno a o eliabili y was assessed using bo h Cohen’s
Kappa and Fleiss’ Kappa. Table 4 and able 5 p esen
bo h pai wise Cohen’s Kappa and Fleiss’ Kappa alues.
The Cohen’s Kappa be ween GPT-4o and he gold s an-
da d (ob ained ia majo i y o ing) was 0.613, indica ing
mode a e ag eemen . Among he human expe s, he pai -
wise Cohen’s Kappa alues we e 0.547, 0.568, and 0.569,
wi h an a e age o 0.561. In addi ion, he pai wise Cohen’s
Kappa alues be ween GPT-4o and indi idual human an-
no a o s we e 0.467, 0.593, and 0.607, yielding an a e -
age o 0.556. Al hough GPT-4o’s ag eemen wi h indi id-
ual expe s is sligh ly lowe han he ag eemen obse ed
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
153
G oup N GPT-4o Human1 Human2 Human3
Full-Consensus 211 180 (85.3%) 211 211 211
Pa ial Consensus 175 94 (53.7%) 114 (65.1%) 115 (65.7%) 121 (69.1%)
Table 3. Accu acy o GPT-4o and Human Anno a o s in Full and Pa ial Consensus G oups.
Anno a o GPT-4o Human1 Human2 Human3
GPT-4o N.A. 0.467 0.593 0.607
Human1 0.467 N.A. 0.547 0.568
Human2 0.593 0.547 N.A 0.569
Human3 0.607 0.568 0.569 N.A.
Table 4. Pai wise Cohen’s Kappa be ween GPT-4o and
human expe s.
GPT-4o
.s. Gold
A e age
GPT-4o
.s.
Human
A e age
Human
.s.
Human
Fleiss
Kappa
among
Human
Fleiss
Kappa
GPT-4o .s.
Human
0.613 0.556 0.561 0.561 0.558
Table 5. Summa y o in e -anno a o eliabili y me ics in
bo h a e age Cohen’s Kappa and Fleiss’ Kappa alues.
among human anno a o s, he di e ences a e ela i ely mi-
no . A he g oup le el, Fleiss’ Kappa was 0.561 o he
human expe s and 0.558 when GPT-4o was included, in-
dica ing ha GPT-4o’s o e all a iabili y is compa able o
ha o he human a e s.
To compa e he ag eemen le els o GPT-4o agains
each indi idual expe , we pe o med boo s ap-based hy-
po hesis es ing on he pai wise Cohen’s Kappa s a is ics.
As shown in able 6, al hough he a e age Cohen kappa
o GPT-4o e sus gold (0.614) is close o he mean hu-
man kappa (0.593), boo s ap di e ence analysis e eals
ha GPT-4o’s ag eemen wi h he gold s anda d is sig-
ni ican ly lowe han each expe , as e idenced by he
95% con idence in e als o he di e ences (e.g.,[-0.267,-
0.135] when compa ing o Human3) all lying below ze o.
This ou come indica es ha , on an indi idual basis, expe s
ag ee wi h he gold s anda d mo e consis en ly han GPT-
4o does. Howe e , hese pai wise disc epancies should
be in e p e ed in he con ex o no mal in e - a e a i-
ance: hey highligh ha GPT-4o ends o de ia e om
he gold s anda d mo e o en han any single human an-
no a o , a he han e lec ing he o e all g oup-le el con-
sis ency. Indeed, a subsequen Fleiss’ Kappa analysis e-
eals ha inco po a ing GPT-4o in o he se o a e s does
no signi ican ly al e collec i e ag eemen —i s mean di -
e ence om he human-only g oup is 0.003 wi h a 95%
con idence in e al o [-0.019,0.025], indica ing no signi -
ican impac . Consequen ly, while GPT-4o’s labeling di -
e s mo e o en om he gold s anda d han any single hu-
man expe , i s a iabili y a he g oup le el emains wi hin
he ange o human disag eemen .
4.3.1 Jensen–Shannon Di e gence Analysis
Table 7 shows he squa ed JS di e gence o in e p e abil-
i y, whe e lowe alues indica e g ea e simila i y be ween
dis ibu ions. On an indi idual basis, he a e age squa ed
JS di e gence be ween GPT-4o and each expe was 0.266
(GPT-4o s. Human1), 0.205 (GPT-4o s. Human2), and
0.194 (GPT-4o s. Human3). When compa ing GPT-4o’s
Figu e 2. No malized con usion ma ix (by gold s anda d)
compa ing GPT-4o-gene a ed emo ion labels wi h expe
consensus.
p edic ions o he agg ega ed expe dis ibu ion, he a -
e age squa ed JS di e gence was lowe a 0.175. These
esul s sugges some di e gence be ween GPT-4o’s p edic-
ions and indi idual expe labels, bu he o e all simila i y
o he combined expe consensus is mode a e, indica ing
GPT-4o cap u es much o he collec i e expe opinion de-
spi e mino disc epancies wi h indi idual judgmen s.
4.4 E o Analysis by Ca ego y
E o analysis based on he no malized con usion ma i-
ces (Figu e 2) e eals misclassi ica ion pa e ns o GPT-
4o ela i e o he gold s anda d. In he high alence–high
a ousal (HVHA) ca ego y, o ins ance, GPT-4o co ec ly
anno a ed 73% o samples bu misclassi ied abou 16.0%
as low alence–high a ousal (LVHA). A simila end is
obse ed in he low alence-low a ousal (LVLA) ca ego y,
whe e GPT-4o’s pe o mance is ela i ely obus and e en
sligh ly exceeds ha o Human2; howe e , GPT-4o ends
o misclassi y some LVLA samples as HVLA. These sug-
ges ha while GPT-4o is ela i ely adep a cap u ing he
a ousal dimension, i s uggles o dis inguish be ween posi-
i e and nega i e alence. No ably, This pa e n o alence
con usion is also p esen in Human2’s anno a ions, sug-
ges ing ha e en expe anno a o s may ind i challeng-
ing o p ecisely dis inguish he alence o some samples
in his ca ego y. Addi ionally, GPT-4o exhibi s subop imal
pe o mance in he high alence–low a ousal (HVLA) ca -
ego y, indica ing a limi ed capaci y o cap u e emo ional
nuances. Collec i ely, hese indings highligh ha while
GPT-4o pe o ms easonably well in cases wi h clea emo-
ional cues, i s abili y o disce n ine-g ained di e ences in
emo ion—pa icula ly hose in ol ing he alence dimen-
sion— emains limi ed.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
154

Compa ison Main Di e ence 95% CI
GPT-4o s. Gold minusHuman1 s. Gold -0.174 [-0.250, -0.096]
GPT-4o s. Gold minus Human2 s. Gold -0.176 [-0.241, -0.113]
GPT-4o s. Gold minus Human3 s. Gold -0.200 [-0.267, -0.135]
Fleiss’ kappa, Human Expe s minus GPT-4o & Human Expe s 0.003 [-0.019, 0.025]
Table 6. Boo s ap Di e ences in Pai wise and G oup-Le el Kappa.
Compa ison Main Di e ence
GPT-4o .s. Human1 0.266
GPT-4o .s. Human2 0.205
GPT-4o .s. Human3 0.194
GPT-4o .s. Agg ega ed Expe s 0.175
Table 7. A e age Squa ed Jensen–Shannon Di e gence
be ween GPT-4o and Expe Anno a ion Dis ibu ions
5. DISCUSSION AND FUTURE WORK
Ou esul s indica e ha while GPT-4o’s o e all pe o -
mance in music emo ion anno a ion alls sho o human
expe -le el accu acy, i s a iabili y emains wi hin he
ange o na u al disag eemen among expe s. No ably,
GPT-4o ends o con la e emo ional ca ego ies ha di e
p ima ily along he alence dimension, sugges ing ha a-
lence is mo e challenging o in e om ex ual me ada a
han a ousal, a di icul y also e lec ed in human anno a-
ions. This highligh s a key limi a ion o ou con ex -based
me hod, in which GPT-4o anno a ions ely exclusi ely on
p e-c awled ex ual me ada a, he e o e he accu acy and
g anula i y o he anno a ions a e limi ed by he quali y and
comp ehensi eness o a ailable ex ual in o ma ion. Ad-
di ionally, me ada a-based in e ence es s on he majo i y
cul u al consensus encoded in ex . While e ec i e in mos
cases, his p emise can mislabel “ou lie ” wo ks whose a -
ec i e in en depa s om s ylis ic no ms.
A c i ical issue a ising om ou indings ela es o he
subjec i i y o music emo ion anno a ion asks. Music
emo ion can be app oached om wo pe spec i es: pe -
cei ed emo ions ( he emo ions a lis ene belie es he music
is in ended o con ey) and induced emo ions ( he emo ions
he lis ene pe sonally expe iences while lis ening) [25].
Ou me hodology explici ly a ge s pe cei ed emo ions ia
ex ual con ex s—cul u al cues and gen e con en ions ha
shape lis ene s’ expec a ions be o e a no e is hea d—as a
scalable p oxy o anno a ion, e lec ing a socially sha ed
in e p e a ion o wha he music exp esses a he han he
idiosync a ic emo ions i migh induce. E en so, disag ee-
men among human anno a o s eminds us ha pe cei ed
emo ion is no wholly objec i e, unde sco ing he inhe en
complexi y in music emo ion anno a ion.
Fu he mo e, elying on syn he ic labels p oduced by
la ge language models such as GPT-4o aises impo an
e hical ques ions. Au oma ed anno a ion undoub edly o -
e s scale and cos e iciency, ye isks e oding he nuanced
in e p e i e judgmen s human expe s p o ide. Replacing
human anno a ion en i ely could in oduce sys ema ic bi-
ases and o e simpli ica ions—conce ns ha a e especially
salien in a domain as subjec i e as music–emo ion e-
sea ch. We he e o e ad oca e o a hyb id anno a ion
amewo k o combine AI-gene a ed ini ial anno a ions
wi h human expe o e sigh o ambiguous cases. Fo in-
s ance, GPT-4o can se e as a i s -pass anno a o : high-
con idence p edic ions— hose backed by clea con ex ual
cues—can be accep ed di ec ly a e sampling alida ion,
while low-ma gin o “no -enough-in o ma ion” cases a e
ou ed o human anno a o s. This di ision o labou ha -
nesses he scalabili y o LLMs while p ese ing essen ial
human insigh , and concu en ly add essing b oade e hi-
cal conce ns abou he esponsible use o syn he ic da a.
Fu u e wo k should ocus on se e al key a eas o ad-
d ess cu en limi a ions. Fi s , e ining p omp enginee -
ing by inco po a ing iche con ex ual de ails—such as
comp ehensi e compose biog aphies, his o ical and cul-
u al na a i es, and p og am no es—may imp o e GPT-
4o’s abili y o cap u e emo ional nuances, pa icula ly
along he alence dimension. Second, enabling eal- ime
online con ex e ie al would allow he model access o
mo e dynamic and de ailed me ada a, po en ially imp o -
ing i s disc imina i e capabili ies. Thi d, adop ing he
a o emen ioned hyb id human–AI anno a ion amewo k
would balance e iciency and quali y while also pa ially
add essing he e hical conce ns abou AI-d i en da a an-
no a ion. Finally, explo ing mul i-modal app oaches ha
in eg a e audio ea u es wi h con ex ual me ada a could
p o ide a mo e holis ic and accu a e assessmen o musi-
cal emo ions, u he b idging he gap be ween au oma ed
and expe -le el anno a ions. Such models may also ex-
end co e age o b and-new o undocumen ed wo ks by
enabling di ec emo ion in e ence om aw audio, com-
plemen ing ou cu en me ada a-dependen pipeline.
O e all, al hough GPT-4o’s cu en pe o mance does
no en i ely subs i u e o human expe ise, i s e iciency
and scalabili y p esen a aluable oppo uni y o comple-
men human anno a ions in la ge-scale Music Emo ion e-
sea ch. Fu u e wo k should e ine p omp ing, enable on-
line access, combine human–AI anno a ion, and in eg a e
mul imodal inpu s o imp o e quali y while add essing e h-
ical and subjec i e challenges.
6. CONCLUSION
In his s udy, we p esen ed a no el app oach o au o-
ma ic music emo ion anno a ion using GPT-4o, le e ag-
ing con ex ual me ada a ex ac ed ia web-c awling o la-
bel non-ly ical classical music. Ou comp ehensi e e alu-
a ion, inco po a ing accu acy me ics, in e -anno a o eli-
abili y measu es, dis ibu ional simila i y, and e o analy-
sis demons a ed ha while GPT-4o does no ye ma ch hu-
man expe accu acy, i s ou pu s a e s able and i s a iabil-
i y alls wi hin he ange o na u al human disag eemen .
O e all, ou indings highligh he po en ial o LLM anno-
a ion as a scalable and cos -e ec i e ool o la ge-scale
music emo ion anno a ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
155
7. REFERENCES
[1] Y.-H. Yang and H. H. Chen, “Machine ecogni ion o
music emo ion: A e iew,” ACM T ansac ions on In-
elligen Sys ems and Technology, ol. 3, no. 3, May
2012.
[2] A. Aljanaki, y.-h. Yang, and M. Soleymani, “De el-
oping a benchma k o emo ional analysis o music,”
PLOS ONE, ol. 12, p. e0173392, 03 2017.
[3] A. Aljanaki, F. Wie ing, and R. C. Vel kamp, “S udy-
ing emo ion induced by music h ough a c owdsou c-
ing game,” In o ma ion P ocessing & Managemen ,
ol. 52, no. 1, p. 115–128, Jan. 2016.
[4] L. N. Fe ei a and J. Whi ehead, “Lea ning o gene a e
music wi h sen imen ,” in P oceedings o he Con e -
ence o he In e na ional Socie y o Music In o ma ion
Re ie al, Del , Ne he lands, 2019, pp. 384–390.
[5] H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and Y.-
H. Yang, “Emopia: A mul i-modal pop piano da ase
o emo ion ecogni ion and emo ion-based music gen-
e a ion,” in In e na ional Socie y o Music In o ma-
ion Re ie al Con e ence, 2021.
[6] M. Ba he , G. Fazekas, and M. Sandle , “Music emo-
ion ecogni ion: F om con en - o con ex -based mod-
els,” in F om Sounds o Music and Emo ions, M. A a-
maki, M. Ba he , R. K onland-Ma ine , and S. Ys ad,
Eds. Be lin, Heidelbe g: Sp inge Be lin Heidelbe g,
2013, pp. 228–252.
[7] R. Delbouys, R. Hennequin, F. Piccoli, J. Royo-
Le elie , and M. Moussallam, “Music mood de ec ion
based on audio and ly ics wi h deep neu al ne ,” A Xi ,
ol. abs/1809.07276, 2018.
[8] F. H. Rachman, R. Sa no, and C. Fa ichah, “Music
emo ion de ec ion using weigh ed o audio and ly ic
ea u es,” in 2020 6 h In o ma ion Technology In e na-
ional Semina (ITIS), 2020, pp. 229–233.
[9] X. Hu, K. Choi, and J. S. Downie, “A amewo k
o e alua ing mul imodal music mood classi ica ion,”
Jou nal o he Associa ion o In o ma ion Science
and Technology, ol. 68, 2017. [Online]. A ailable:
h ps://api.seman icschola .o g/Co pusID:45480061
[10] Y. Ag awal, R. G. R. Shanke , and V. Allu i,
“T ans o me -based app oach owa ds music emo ion
ecogni ion om ly ics,” in Ad ances in In o ma ion
Re ie al: 43 d Eu opean Con e ence on IR Resea ch,
ECIR 2021, Vi ual E en , Ma ch 28 – Ap il 1, 2021,
P oceedings, Pa II, Be lin, Heidelbe g, 2021, p.
167–175.
[11] X. Hu and J. S. Downie, “Explo ing mood me ada a:
Rela ionships wi h gen e, a is and usage me ada a,” in
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, 2007. [Online]. A ailable: h ps://api.
seman icschola .o g/Co pusID:16794525
[12] Q. Kong, B. Li, J. Chen, and Y. Wang, “Gian midi-
piano: A la ge-scale midi da ase o classical piano
music,” T ansac ions o he In e na ional Socie y o
Music In o ma ion Re ie al, May 2022.
[13] D. Tu nbull, L. Ba ing on, D. To es, and G. Lanck-
ie , “Towa ds musical que y-by-seman ic-desc ip ion
using he cal500 da a se ,” in P oceedings o he 30 h
Annual In e na ional ACM SIGIR Con e ence on Re-
sea ch and De elopmen in In o ma ion Re ie al, se .
SIGIR ’07. New Yo k, NY, USA: Associa ion o
Compu ing Machine y, 2007, p. 439–446.
[14] P. L. Lou o, H. Redinho, R. San os, R. Malhei o,
R. Panda, and R. P. Pai a, “Me ge – a bimodal da ase
o s a ic music emo ion ecogni ion,” 2025. [Online].
A ailable: h ps://a xi .o g/abs/2407.06060
[15] Y. E. Kim, E. M. Schmid , R. Migneco, B. G. Mo on,
P. Richa dson, J. J. Sco , J. A. Speck, and D. Tu n-
bull, “Music emo ion ecogni ion: A s a e o he a e-
iew,” in In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, 2010.
[16] P. Donnelly and A. Bee y, “E alua ing la ge-language
models o dimensional music emo ion p edic ion om
social media discou se,” in P oceedings o he 5 h
In e na ional Con e ence on Na u al Language and
Speech P ocessing (ICNLSP 2022). T en o, I aly:
Associa ion o Compu a ional Linguis ics, Dec. 2022,
pp. 242–250.
[17] Y. E. Kim, E. M. Schmid , and L. Emelle,
“Moodswings: A collabo a i e game o music mood
label collec ion,” in In e na ional Socie y o Music
In o ma ion Re ie al Con e ence, 2008. [Online].
A ailable: h ps://api.seman icschola .o g/Co pusID:
14382686
[18] Z. Tan, D. Li, S. Wang, A. Beigi, B. Jiang, A. Bha -
acha jee, M. Ka ami, J. Li, L. Cheng, and H. Liu,
“La ge language models o da a anno a ion and syn-
hesis: A su ey,” in P oceedings o he 2024 Con e -
ence on Empi ical Me hods in Na u al Language P o-
cessing. Miami, Flo ida, USA: Associa ion o Com-
pu a ional Linguis ics, No . 2024, pp. 930–957.
[19] F. Gila di, M. Alizadeh, and M. Kubli, “Cha gp ou -
pe o ms c owd wo ke s o ex -anno a ion asks,”
P oceedings o he Na ional Academy o Sciences o
he Uni ed S a es o Ame ica, ol. 120, 2023.
[20] Z. He, C.-Y. Huang, C.-K. C. Ding, S. Roha gi, and
T.-H. K. Huang, “I in a c owdsou ced da a anno a-
ion pipeline, a gp -4,” in P oceedings o he 2024 CHI
Con e ence on Human Fac o s in Compu ing Sys ems,
se . CHI ’24. New Yo k, NY, USA: Associa ion o
Compu ing Machine y, 2024.
[21] S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng, “Wan
o educe labeling cos ? GPT-3 can help,” in Find-
ings o he Associa ion o Compu a ional Linguis ics:
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
156
EMNLP 2021. Pun a Cana, Dominican Republic:
Associa ion o Compu a ional Linguis ics, No . 2021,
pp. 4195–4205.
[22] D. Edmonds and J. Sedoc, “Mul i-emo ion classi ica-
ion o song ly ics,” in P oceedings o he Ele en h
Wo kshop on Compu a ional App oaches o Subjec i -
i y, Sen imen and Social Media Analysis. Online:
Associa ion o Compu a ional Linguis ics, Ap . 2021,
pp. 221–235.
[23] P. Donnelly and A. Bee y, “E alua ing la ge-language
models o dimensional music emo ion p edic ion
om social media discou se,” in P oceedings o he
5 h In e na ional Con e ence on Na u al Language
and Speech P ocessing (ICNLSP 2022), M. Abbas
and A. A. F eiha , Eds. T en o, I aly: Associa ion
o Compu a ional Linguis ics, dec 2022, pp. 242–
250. [Online]. A ailable: h ps://aclan hology.o g/
2022.icnlsp-1.28/
[24] J. Russell, “A ci cumplex model o a ec ,” Jou nal o
pe sonali y and social psychology, ol. 39, no. 6, pp.
1161–1178, 1980.
[25] A. Gab ielsson, “Emo ion pe cei ed and emo ion el :
Same o di e en ?” Musicae Scien iae, ol. 5, no.
1_suppl, pp. 123–147, 2001.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
157