scieee Science in your language
[en] (orig)

Are You Really Listening? Boosting Perceptual Awareness in Music-QA Benchmarks

Author: Yongyi Zang; Sean O'Brien; Taylor Berg-Kirkpatrick; Julian McAuley; Zachary Novack
Publisher: Zenodo
DOI: 10.5281/zenodo.17706385
Source: https://zenodo.org/records/17706385/files/000029.pdf
ARE YOU REALLY LISTENING?
BOOSTING PERCEPTUAL AWARENESS IN MUSIC-QA BENCHMARKS
Yongyi Zang1Sean O’B ien2Taylo Be g-Ki kpa ick2
Julian McAuley2Zacha y No ack2
1Independen Resea che 2Uni e si y o Cali o nia, San Diego
[email p o ec ed], {seob ien, be g,jmcauley,zno ack}@ucsd.edu
ABSTRACT
La ge Audio Language Models (LALMs), whe e p e-
ained ex LLMs a e ine uned wi h audio inpu , ha e
made ema kable p og ess in music unde s anding. How-
e e , cu en e alua ion me hodologies exhibi c i ical lim-
i a ions: on he leading Music Ques ion Answe ing bench-
ma k, MuChoMusic, ex -only LLMs wi hou audio pe -
cep ion capabili ies achie e su p isingly high accu acy
o up o 56.4%, much highe han chance. Fu he -
mo e, when p esen ed wi h andom Gaussian noise in-
s ead o ac ual audio, LALMs s ill pe o m signi ican ly
abo e chance. These indings sugges exis ing benchma ks
p edominan ly assess easoning abili ies a he han au-
dio pe cep ion. To o e come his challenge, we p esen
RULis ening, a amewo k ha enhances pe cep ual e al-
ua ion in Music-QA benchma ks. We in oduce he Pe -
cep ual Index (PI), a quan i a i e me ic ha measu es a
ques ion’s eliance on audio pe cep ion by analyzing log
p obabili y dis ibu ions om ex -only language models.
Using his me ic, we gene a e syn he ic, challenging dis-
ac o s o c ea e QA pai s ha necessi a e genuine au-
dio pe cep ion. When applied o MuChoMusic, ou il-
e ed da ase success ully o ces models o ely on pe cep-
ual in o ma ion— ex -only LLMs pe o m a chance le -
els, while LALMs simila ly de e io a e when audio inpu s
a e eplaced wi h noise. These esul s alida e ou ame-
wo k’s e ec i eness in c ea ing benchma ks ha mo e ac-
cu a ely e alua e audio pe cep ion capabili ies.
1. INTRODUCTION
La ge language models (LLMs) ha e achie ed imp essi e
easoning capabili ies [1] and s ong ze o- and ew-sho
pe o mance ac oss NLP asks [2], bu a e limi ed o only
p ocessing ex ual in o ma ion. This cons ain has d i en
he de elopmen o Mul imodal LLMs (MLLMs), which
ex end LLMs o p ocess, eason o e , and gene a e mul i-
modal con en like images o ideos [3]. La ge Audio Lan-
© Y. Zang, S. O’B ien, T. Be g-Ki kpa ick, J. McAuley
and Z. No ack. Licensed unde a C ea i e Commons A ibu ion 4.0 In-
e na ional License (CC BY 4.0). A ibu ion: Y. Zang, S. O’B ien, T.
Be g-Ki kpa ick, J. McAuley and Z. No ack, “A e you eally lis ening?
Boos ing Pe cep ual Awa eness in Music-QA Benchma ks”, in P oc. o
he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
Figu e 1. Tex -only LMs and LALMs’ pe o mance on
he Music QA benchma k MuChoMusic [5]. OpenMU is
ine uned on Llama 3 8B, ye pe o ms wo se han i .
guage Models (LALMs) speci ically add audio pe cep ion
and easoning capabili ies o LLMs. E alua ing LALMs is
challenging, as con en ional me ics like BLEU [4] s ug-
gle wi h di e se ou pu s. QA amewo ks like MuChoMu-
sic [5] add ess his by ans o ming e alua ion in o classi-
ica ion asks wi h p ede ined choices, making hem well-
sui ed o assessing music capabili ies in LALMs.
Howe e , we disco e a conce ning issue: ex -only
models o en selec co ec answe s e en wi hou mul i-
modal inpu , nea ly ma ching he pe o mance o mul i-
modal models. We e alua ed 11 ex -only LLMs agains
s a e-o - he-a LALMs on he p emie Music QA bench-
ma k MuChoMusic [5] (see Figu e 1). Su p isingly, we
ound ha ex -only models can pe o m well e en wi hou
audio pe cep ion abili y, wi h eigh models eaching accu-
acy o e 50%, wo o which a e e en o simila pa am-
e e size as LALMs. E en mo e elling, OpenMU [6]—a
LALM ine uned om Llama 3 8B— pe o ms wo se on
his benchma k han i s ex -only Llama 3 8B ounda ion,
despi e ha ing access o he audio. As men ioned in he
MuChoMusic pape and pe ou e-e alua ion (See Fig. 2),
when p esen ed wi h gaussian noise as inpu , he LALMs
only show e y limi ed pe o mance decline no whe e nea
247
Figu e 2. LALM pe o mance wi h o iginal inpu s.
gaussian noise inpu on MuChoMusic [5].
o chance le el. We p esen a hypo hesis o his phe-
nomenon: he s ong ini ializa ion o ex -only easoning
capabili ies allows LLMs o sol e QA benchma ks wi hou
ue audio pe cep ion, c ea ing an illusion o unde s and-
ing.
To add ess his challenge, we in oduce RULis ening,
a amewo k o boos exis ing QA benchma king da ase s,
whe e we gene a e dis ac o s ha equi e ac i e pe cep-
ion o be dis inguished om co ec answe s. S a ing
wi h audio desc ip ions, ques ions, and co ec answe s,
we p omp a ex -only model o gene a e plausible ye in-
co ec candida es. We de ine "pe cep ual index" (PI) as
he need o pe cep ual in o ma ion, calcula ed om log-
p obabili ies o dis ac o s being selec ed by a ex -only
model. We op imize based on his me ic o selec ou dis-
ac o s pe ques ion/answe pai . We addi ionally employ
a lea e-one-ou s a egy o 4- old c oss- alida ion, ensu -
ing obus assessmen o models’ pe cep ual capabili ies.
Empi ically, il e ing MuChoMusic h ough RULis en-
ing educes ex -only models o nea -chance pe o mance,
con i ming easoning alone canno sol e hese ques ions.
When audio inpu s o LALMs a e eplaced wi h gaussian
noise, hei pe o mance also plumme s o nea -o -below-
chance le els, con i ming sensi i i y o pe cep ual abili-
ies. Addi ionally, we ind he PI me ic (de i ed om a
single ex -only LM) s ongly co ela es wi h pe o mance
ac oss all ex -only LMs, alida ing ou me hodology’s
gene alizabili y and e ec i eness a boos ing genuine au-
dio pe cep ion capabili ies.
To he bes o ou knowledge, his ep esen s he
i s esea ch o e alua e ex -only LMs on Music QA
benchma ks, explo ing he easoning and pe cep ion
abili y sepa a ely o LALMs, and he i s o p o-
pose such a me hodology o boos ing QA bench-
ma ks o speci ically emphasize pe cep ual capabili ies.
We belie e ou wo k ad ances he communi y’s ap-
p oach o benchma king LALMs. We open-sou ce
all code and e alua ion sc ip s a h ps://gi hub.
com/yongyizang/A eYouReallyLis ening and
RUL-MuChoMusic a h ps://hugging ace.co/
da ase s/yongyizang/RULis ening unde MIT
License o acili a e u he esea ch.
2. RELATED WORK
2.1 LALMs
La ge Audio Language Models (LALMs) combine audio
encode s wi h ine- uned LLMs o p ocess audio alongside
ex okens. Pengi [7] pionee ed his a chi ec u e, achie -
ing s a e-o - he-a esul s on audio classi ica ion asks.
This b eak h ough inspi ed nume ous open-sou ce mod-
els including LTU [8], LTU-AS [9], SALMONN [10],
FUTGA [11], AudioGPT [12], GAMA [13], JMLA [14],
and Audio Flamingo [15], plus open-access al e na i es
like Qwen-Audio [16] and Qwen2-Audio [17]. Resea ch
has p io i ized scaling pa ame e s and da ase s o e im-
p o ing da a quali y o audio ep esen a ions [18]. While
hese models show enhanced pe o mance on basic asks,
hey s ill ace limi a ions in eal-wo ld applica ions [19].
2.2 Music QA Benchma ks
LALM benchma ks e alua e ei he speci ic musical a -
ibu es ( onali y, gen e, ins umen iden i ica ion) o o e -
all music unde s anding h ough audio desc ip ion and mu-
sical inqui y asks [20–23]. Fo ques ion-answe pai s,
many wo ks [5, 20, 21, 23, 24] use he MusicCaps col-
lec ion [25], while o he s [21, 22] c ea e new da ase s by
using LLMs o con e exis ing anno a ions om Mus-
icCaps o MagnaTagaTune [26] in o s uc u ed QA o -
ma s, p oducing da ase s like MusicQA and MusicIns uc .
MMAU [27] ep esen s a ecen ad ancemen ha bal-
ances in o ma ion ex ac ion (pe cep ion) and easoning
ques ions. Some esea ch ocuses on e alua ing mod-
els ained on symbolic music ep esen a ions [5, 28–30],
wi h MuChin [28] using non-mul iple-choice Chinese ex
and bo h MusicTheo yBench and ZIQI-E al a ge ing ex -
o ien ed LLMs h ough symbolic no a ion a he han au-
dio. Meanwhile, mul imodal capabili y e alua ion appea s
in wo ks like AIR-Bench [31], which includes music-
ela ed assessmen s wi hin b oade audio comp ehension,
and MuChoMusic [5], which employs LLMs wi h human
e i ica ion o gene a e ques ion-answe pai s om audio
desc ip ions, c ea ing mo e obus benchma ks o comp e-
hensi e music unde s anding e alua ion.
2.3 Mul imodal Pe cep ion Benchma ks
Va ious benchma ks assess mul imodal easoning abili ies.
Beyond hose discussed abo e, MMMU [32] p o ides a
mul i-discipline da ase o e alua ing ision models’ mul-
imodal easoning, while memen os [33] es s easoning
o e long image sequences. Howe e , pe cep ion assess-
men emains ela i ely unde explo ed compa ed o ea-
soning e alua ion. Chen e al. [34] ound ha many ision
language model benchma k ques ions can be answe ed
wi hou isual inpu o ely on ex ual componen s om
aining da a. They de eloped a il e ing me hodology us-
ing ex -only language models o answe ques ions, and us-
ing hei accu acy o de e mine he deg ee o eliance o a
ques ion on isual modali y. To ou knowledge, no simila
wo k exis s o audio o music language models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
248
3. REASONING IS ENOUGH TO SOLVE
CURRENT MUSIC QA BENCHMARK
We begin by assessing he ex en o which cu en Music
QA benchma k equi es pe cep ion. To do so, we e alua e
ex -only LMs, which ha e no pe cep ion bu s ong ea-
soning capabili ies, on he MuChoMusic benchma k, com-
pa ing hem agains LALMs, which ha e bo h pe cep ion
and easoning capabili ies. This compa ison allows us o
quan i y he impo ance o pe cep ual abili ies in success-
ully add essing music- ela ed ques ions.
Fo ex -only LMs, we e alua e 11 SOTA models ac oss
<3B, <8B, <32B, <72B and >72B pa ame e anges:
Gemma 2B and Llama 3.2 3B; Llama 3 8B [35] and
Qwen 2.5 7B [36]; Mix al 8x7B [37] and Gemma
27B [38]; Mix al 8x22B [39], Qwen 2.5 72B, and
Llama 3.1 70B; and Llama 3.1 405B and DeepSeek V3
671B [40] o la ge models. Fo LALMs, we e alua ed
op MuChoMusic benchma k pe o me s including Au-
dio Flamingo 2 [18], OpenMU [6], Qwen Audio [16] and
Qwen2-Audio [16], epo ing esul s om o iginal model
pape s o he MuChoMusic pape when a ailable.
We obse ed ha he s anda d e alua ion p omp p o-
ided by MuChoMusic o en esul ed in he models de-
clining o answe , wi h esponses indica ing hey could no
pe cei e he audio. To comple e he e alua ion, we modi-
ied he p omp o:
“P o ide you bes guess o his ques ion. You mus
guess one, e en i you did no hea he audio. Think
s ep by s ep.”
This change p omp ed ex -only LMs o gene a e answe s.
Fo LALMs, we couldn’ ep oduce he epo ed esul s
(ac ual pe o mance was lowe ) due o luc ua ions om
ex decoding hype pa ame e s. Fo ai compa ison, we
ci e he numbe s om hei o iginal pape s.
Figu e 1 p esen s he e alua ion esul s. No ably, eigh
ex -only LMs a e able o achie e an o e all accu acy o
mo e han 50%, wi h wo o hem (Llama 3 8B and Qwen
2.5 7B) o simila size as LALMs. E en mo e su p is-
ingly, al hough OpenMU was ine- uned om Llama 3 8B,
i s pe o mance alls sho o he base Llama 3 8B model.
As men ioned in he MuChoMusic pape ( [5], see Fig.
6b) and pe ou e-e alua ion 1(see Fig. 2), when p e-
sen ed wi h gaussian noise as inpu , he LALMs only show
e y limi ed pe o mance decline nowhe e nea chance
le el; one LALM e en showed pe o mance imp o emen .
These indings sugges ha easoning capabili ies alone,
e en wi hou pe cep ion, may be su icien o sol e he Mu-
ChoMusic benchma k.
Bu how is his possible? How can ex -only models
answe music QAs wi hou hea ing he music? Th ough
manual inspec ion, we iden i ied wo key easons: Fi s ,
MuChoMusic’s design includes a "co ec bu i ele an "
dis ac o p o iding con ex ual in o ma ion. While his
1OpenMU [6] ailed o be e alua ed unde noise inpu o MuChoMu-
sic since i s language capabili ies become oo se e ely deg aded and e-
jec s o answe he ques ion, despi e ou bes e o by changing hype pa-
ame e s and p omp s.
dis ac o is help ul in analyzing e o modes, ex -only
LMs can exploi i o make educa ed guesses. Second,
we hypo hesize ha ex -only LMs a e using hei inhe -
en "wo ld p io " knowledge om hei aining. They
may unde s and ha 80s pop songs likely ea u e syn he-
size s a he han o ches as, and olk songs ypically con-
ain acous ic a he han dis o ed elec ic gui a s. This
b oad wo ld knowledge could ha e made much o he mul-
imodal inpu edundan o answe ing ques ions, educing
he benchma k’s e ec i eness in e alua ing LALMs’ ue
pe cep ual capabili ies.
To p o ide a conc e e unde s anding o his phe-
nomenon, we display a p o o ypical example h ough
Llama 3.1 70B’s esponse o he ollowing ques ion 2:
"Which ins umen ini ia es he piece?
Op ions: (A) Syn hesize (B) Rue ul une (C) Acous ic
gui a (D) Vocals"
Llama 3.1 70B was able o eason h ough his ques ion
wi hou audio inpu ye a i ing a he co ec answe (C).
I s a s by elimina ing impossible answe s:
To make an educa ed guess, I’ll conside common mu-
sical pa e ns and he op ions p o ided. Op ion (B)
“Rue ul une” doesn’ seem o be a speci ic ins u-
men , so I’ll exclude ha choice.
Then i a emp s o le e age i s p io knowledge abou dis-
ibu ion o music:
Be ween he emaining op ions, I’ll conside which in-
s umen is mos likely o ini ia e a piece. Syn hesize s
(A) can be used o c ea e a wide ange o sounds, bu
hey’ e no always he i s choice o s a a song. Vo-
cals (D) can ini ia e a piece, bu i ’s no uncommon o
an ins umen o se he one be o e he ocals begin.
Acous ic gui a (C) is a popula choice o s a a song,
especially in gen es like olk, ock, and pop. I ’s o en
used o c ea e a simple, ye e ec i e melody ha se s
he one o he es o he piece.
Based on his easoning, my guess is: C.
4. CONSTRUCTING RULISTENING
4.1 Me hodology
We de ine a Music-QA benchma k as a se o audio-
ques ion-answe s iple s (a, q, Y ) o audio clip a, ques-
ion q, and se o answe s Y, and can u he decompose
Y={c∪D}whe e cis he co ec answe and Dis
he se o inco ec dis ac o s. Unde his de ini ion, an
e ec i e benchma k o audio pe cep ion should p esen
ques ions ha a e challenging wi hou audio bu sol able
wi h audio access. Fo mally, le p ex (Y|q) ep esen he
o al p obabili y o e all gi en answe s o a ex -only LM
(i.e. p ex (Y|q) = Pyi∈Yp ex (yi|q)), and pLALM(Y|q, a)
ep esen he co esponding p obabili y o a LALM.
Ideally, i one wan s o measu e he mul imodal pe cep-
ion abili ies o LALMs, a Music-QA ques ion should il-
lici a no iceable in o ma ion gain when condi ioning on
2We p o ide mo e examples o his in Appendix A.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
249
he audio, i.e., p(c|q, a)≫p(c|q). Using his p inciple
o benchma k design gi es us wo op ions o inc easing
he in o ma ion gain: c ea e (a, q, Y ) iple s ha a e uni-
modally di icul (i.e. educe p(c|q)), o design ques ions
and co ec answe s highly pe cep ually aligned wi h au-
dio (i.e. inc ease p(c|q, a)). We p io i ize he o me as
he la e is p oblema ic: cons uc ing new QA-pai s is un-
scalable wi h cu en sys ems, and using LALMs o au o-
ma e his would con amina e he benchma k’s e alua i e
pu pose and ely oo much on ques ionable LALM capa-
bili ies. We he e o e ocus on c ea ing benchma k i ems
whe e ques ions challenge ex -only LMs while main ain-
ing he expe - e i ied ela ionship be ween (a, q, c). We
o malize his as inding op imal dis ac o se s D∗ ha
maximize he p obabili y o ex -only models selec ing in-
co ec answe s. We de ine he need o pe cep ual in o -
ma ion as "pe cep ual index," o PI:
PI(q, Y, D) = p ex (D|q)
p ex (Y|q)(1)
which is equi alen o he QA-no malized e o p ob-
abili y. This me ic anges om 0 o 1, wi h alues
close o 1 indica ing ques ions whe e a ex -only model is
mo e likely o selec inco ec answe s (i.e., p ex (D|q)≫
p ex (c|q)). Since we canno modi y he audio, ques ion,
o co ec answe wi hou comp omising he in eg i y o
he expe - e i ied con en , we es ic ou op imiza ion o
inding dis ac o se s ha maximize his pe cep ual index
me ic. Impo an ly, PI does no pe ec ly co ela e wi h
he en opy o he answe space; a high PI may e lec a
model ha is con iden ly inco ec (selec ing a w ong an-
swe wi h high p obabili y), hus exhibi ing low en opy.
We p e e PI o e en opy as ou op imiza ion a ge p e-
cisely because PI cap u es he maximum possible pe o -
mance gap be ween modali ies— he dis ance be ween be-
ing con iden ly w ong (high PI) and co ec is necessa -
ily la ge han be ween being unce ain (high en opy) and
co ec , he eby p o iding a s onge signal o iden i ying
ques ions ha genuinely equi e pe cep ual in o ma ion.
4.2 Gene a ing Dis ac o s Se
To a i e a a se o dis ac o s D∗ ha maximizes PI, we
begin wi h gene a ing a la ge pool o possible dis ac o s
D, hen il e h ough hem o a i e a he highes PI se
o dis ac o s. We le e age he DeepSeek-V3 model o do
his. We use a p omp empla e including ques ion ex , au-
dio desc ip ion, and co ec answe , and p omp he LLM
o gene a e mul iple candida es. This p ocess happens o
mul iple imes, allowing us o sample mul iple ba ches o
di e si y. Finally, we apply cleaning and deduplica ion
p ocesses. We explici ly p omp he model o main ain
s ylis ic consis ency ac oss answe s, and only gene a e an-
swe s ha a e 1) plausible and 2) dis inc ly di e en om
he co ec answe . We use in-con ex lea ning examples
o en o ce s uc u ed ou pu using XML ags, hen ex ac
possible dis ac o s using egula exp essions and apply
ex no maliza ion.
Figu e 3. Seman ic dis ibu ion o dis ac o s.
To analyze he dis ibu ion o gene a ed dis ac o s, we
employed wo dis inc models: T5 [41], a ex -only ans-
o me encode , and he ex b anch o CLAP [42], a join
audio- ex ep esen a ion model. Fo each dis ac o and
co ec answe pai , we calcula ed he cosine simila i y
be ween hei espec i e embeddings. The T5 simila -
i y dis ibu ion cap u es he na u al language seman ic e-
la ionships, while CLAP’s ex encode e eals he mu-
sic domain-speci ic ela ionships. Figu e 3 p esen s his-
og ams o hese seman ic simila i y dis ibu ions. We ob-
se e ha dis ac o s clus e igh ly in he ex seman ic
space ye sp ead mo e widely in he music seman ic space,
indica ing i ems ha a e ex ually simila (e.g., “Acous ic
gui a ” and “Elec ic gui a ”) bu musically dis inc . This
con i ms ou dis ac o s main ain musical a ie y while
minimizing ex ual di e ences ( hus p e en ing leakage).
4.3 Fil e ing Based on Pe cep ual Index
A e ob aining he dis ac o se D, we begin il e ing o
D∗. To calcula e he p obabili y o each answe o be se-
lec ed p ex (y|q), we use he log p obabili y o a ligh weigh
ex -only LLM Qwen 2.5 7B. Speci ically, we p omp he
model wi h:
“P o ide you bes guess o his ques ion. The ques ion
is: {ques ion} The answe candida es a e: (A) ... (B) ...
(C) ... (D) ... Answe wi hou he pa en hesis. The mos
likely answe is”
Then ake he log p obabili y o he immedia e nex o-
ken o be A, B, C o D o ep esen he p obabili y o each
co esponding answe o be selec ed ollowing he me hod-
ology o [43]. Empi ically, we ound his me hod wo ked
well when e alua ing <4dis ac o s, likely due o how
eal-wo ld mul iple choice ques ions a e o en wi h ou
choices. As such, we begin by andomly selec ing se s o
h ee dis ac o s, hen e alua e hem wi h he co ec an-
swe in andom o de . Fo each se o ou choices, we
ake he dis ac o wi h highes p obabili y; we ecu si ely
do his, un il we a e le wi h ou dis ac o s. These ou
dis ac o s ha e highes p ex (D|q), and hus easonably ap-
p oxima es he se D∗ ha yields he la ges PI(q, Y, D).
Du ing e alua ion, we implemen a lea e-one-ou s a egy:
wi hin he ou dis ac o s, we emo e one a each i e a-
ion. This app oach p o ides ou dis inc answe passes
o each QA pai . Ou design se es wo pu poses: (1)
ha ing 4 answe s aligns wi h he eal-wo ld dis ibu ion o
mul iple-choice ques ions; and (2) i enhances ou obus -
ness agains a ia ions in dis ac o s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
250
Figu e 4. Co ela ion be ween Pe cep ual Index (PI) and
ques ion accu acy on RUL-MuChoMusic. Tex -only LMs
show s onge nega i e co ela ion, indica ing g ea e in-
luence om lack o pe cep ion.
5. RESULTS
We e alua e he a o emen ioned 11 ex -only LMs and se-
lec he op-pe o ming 4 LALMs on MuChoMusic, and
ou p oposed modi ed e sion RUL-MuChoMusic. Fo all
models e alua ed, we epo bo h he mean pe o mance
and 95% con idence in e als.
5.1 Validi y o Pe cep ual Index
To alida e he Pe cep ual Index (PI) as an e ec i e su o-
ga e o o e all LLM pe o mance, we analyzed ques ion-
le el accu acy ac oss all 11 LLMs (44 esponse passes).
Fo each ques ion, we calcula ed he co ela ion be ween
accu acy ac oss all a emp s and he PI. We obse e a
s ong nega i e Pea son co ela ion o -0.738 as shown in
Figu e 4(a), indica ing a highly signi ican ela ionship
whe e high PI co esponds o low ques ion accu acy. These
esul s con i m PI e ec i ely p edic s ex -only LMs’ abil-
i y o answe ques ions using solely ex ual in o ma ion.
Simila ly, calcula ing he co ela ion be ween PI and
ques ion-le el accu acy ac oss all 4 LALMs (16 passes)
e eals a weake nega i e Pea son co ela ion o -0.331, as
shown in Fig. 4(b), sugges ing he need o pe cep ion is
signi ican ly highe . This alida es PI as an e ec i e me ic
o op imizing dis ac o se s o maximize he pe o mance
gap be ween ex -only LMs and LALMs.
Addi ionally, we plo he pe cep ual index dis ibu-
ion ac oss all ques ions o bo h MuChoMusic and RUL-
MuChoMusic. The only di e ence be ween hese bench-
ma ks is he dis ac o se . As shown in Figu e 5, Mu-
ChoMusic PI alues ollow an app oxima ely Gaussian
dis ibu ion wi h mean 0.427 and la ge a iance, indica -
Figu e 5. Dis ibu ion o PI on MuChoMusic and RUL-
MuChoMusic. MuChoMusic exhibi s o e all less eliance
on pe cep ual modali y compa ed o RUL-MuChoMusic.
ing many ques ions can be answe ed subs an ially h ough
ex modali y alone wi hou equi ing music in o ma ion.
This aligns wi h ou obse a ion ha ex -only language
models sco e highly on MuChoMusic. In con as , RUL-
MuChoMusic achie es a signi ican ly highe PI dis ibu-
ion wi h mean 0.861 and lowe a iance, demons a ing
g ea e dependence on music modali y o co ec answe s.
Fo iden ical ques ions, ou gene a ed and il e ed dis-
ac o s consis en ly inc ease PI compa ed o he o iginal
benchma k (mean inc ease o 0.338), wi h some ques ions
showing inc eases exceeding 0.9. 3These esul s con i m
ou gene a ion and il e ing pipeline e ec i ely educes
ex -only answe ing capabili y, c ea ing a mo e obus mul-
imodal e alua ion benchma k.
5.2 Benchma k Resul s o Tex -only LMs and
LALMs
We p esen comp ehensi e esul s o ex -only LLMs
and LALMs in Figu e 6. Se e al key pa e ns eme ge
om ou analysis. Ac oss all models, we obse e a
consis en dec ease in accu acy sco es, indica ing ha
RUL-MuChoMusic p esen s a g ea e challenge han Mu-
ChoMusic; ex -only LMs pe o m a nea -chance le -
els, alida ing ou app oach. Impo an ly, OpenMU (4 h-
place) ou pe o ms i s ex -only subcomponen (Llama 3
8B, 12 h-place), sugges ing enhanced music pe cep ion ca-
pabili ies. The ex -only LMs ha managed o place in he
op-10 possess much la ge pa ame e coun s (405B, 72B,
27B, 671B, 70B, and 56B) compa ed o he sub-7B audio
models.
Though RULis ening e ec i ely inc eases unimodal
di icul y (see Sec. 5.1), mos LALMs besides Qwen2-
Audio demons a e ela i ely poo pe o mance, as mul-
imodal di icul y was no used in cons uc ion. Due o
Qwen2-Audio’s b oad use ac oss a ious asks [44], i s
s ong pe o mance is expec ed. To quan i a i ely assess
whe he poo esul s s em om inhe en model limi a-
ions o benchma k design laws, we e alua ed all LALMs
using 10-second samples o andom Gaussian noise o
p obe hei sensi i i y o audio inpu . Resul s appea in
Figu e 7. While all models p e iously pe o med abo e
chance, noise inpu s d o e pe o mance o nea o below
chance le els. Qwen2-Audio showed he mos d ama ic
pe o mance deg ada ion, while Audio Flamingo 2 demon-
3We p esen examples o highes and lowes dis ac o PI change in
Appendix C.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
251

Figu e 6. Benchma king esul s on RUL-MuChoMusic. E o ba displays 95% con idence in e al.
Figu e 7. LALM pe o mance wi h o iginal inpu s.
gaussian noise inpu on RUL-MuChoMusic.
s a ed he leas sensi i i y o noise, possibly ela ed o
i s weake easoning abili ies. When compa ing o Mu-
ChoMusic [5], only 2 LALMs show signi ican deg ada-
ion wi h noise inpu , ye nowhe e nea chance-le el pe -
o mance, sugges ing RULis ening p o ides s onge e al-
ua ion o audio pe cep ion.
Examining LALM esponse pa e ns e eals addi ional
insigh s. 4Audio Flamingo 2 [18] exhibi s limi ed ea-
soning abili y, o en gene a ing di ec answe s. In con-
as , Qwen2-Audio equen ly p oduces ex ended eason-
ing chains. This sugges s easoning capabili y may be c u-
cial o success on Music QA benchma ks, as also demon-
s a ed by ecen esea ch explo ing LALM mul imodal
ine- uning echniques o easoning models.
6. DISCUSSION
While es ablishing RUL-MuChoMusic as a mo e e ec i e
pe cep ion- es ing benchma k compa ed o MuChoMusic,
we acknowledge a undamen al limi a ion in ou wo k: he
quali y o ou benchma k is inhe en ly cons ained by he
quali y o he p o ided ques ion-answe pai s.
Ou manual inspec ion e ealed se e al issues wi h he
o iginal da ase . Some p oblems s em om he LLM-
assis ed me hodology used o c ea e ques ion-answe pai s
om human cap ions. Ques ions wi h IDs 448 and 665
ha e "No speci ied in he desc ip ion" as co ec answe s,
while eigh o he s con ain ph ases like "based on he de-
sc ip ion" despi e no desc ip ion being p o ided du ing
benchma king. Human cap ions some imes include un-
in e able me ada a—we ound 17 ques ions/answe s wi h
4We include esponse examples in Appendix B.
" eco ded in" ph ases, hough eco ding loca ion canno be
de e mined solely om audio. Some ques ions a e chal-
lenging e en o human expe s, such as iden i ying spe-
ci ic banjo ypes (ques ion ID 730). Some co ec answe s
inadequa ely add ess hei ques ions— o ins ance, ques-
ion ID 832 asks "Who is he p ima y ocalis in he song?"
wi h "Male" as he co ec answe , and ques ion ID 7 asks
"Wha is used a he e y beginning o make he ack
sound in age?" wi h he o e ly simplis ic answe "E ec ."
To quan i y his issue, we employed Claude 3.7 Son-
ne o e alua e whe he ques ions and answe s made sense
based on he audio ex cap ions. The model iden i ied 201
ou o 1187 pai s (16.9%) as p oblema ic. Howe e , we ob-
se ed ha Claude i sel made e o s in his e alua ion p o-
cess. Fo example, i inco ec ly lagged ques ion ID 1125,
claiming ha he co ec answe "Digi al bass sound" was
inconsis en wi h he audio desc ip ion men ioning a "syn-
hesize bassline ha is epea ing."
These issues highligh a b oade challenge in Music
QA benchma k cons uc ion: human-w i en benchma ks
a e ime-consuming o de elop, ye LLMs a e e o -p one
when used as disc imina o s o assis an s. The ques ion o
how o e ec i ely balance hese app oaches emains an im-
po an a ea o u u e esea ch. We hope ou wo k se es
as a s a ing poin in encou aging esea che s o c i ically
examine he e ec i eness o Music QA benchma ks.
7. CONCLUSION
We in oduce RULis ening, a me hodology o imp o -
ing he pe cep ual ele ance o LALM QA benchma ks.
By demons a ing ha ex -only LMs ou pe o m LALMs
on exis ing benchma ks, we e ealed ha cu en mu-
sic QA benchma ks es easoning a he han pe cep ion.
We gene a e dis ac o s ha maximize pe cep ual neces-
si y h ough ou Pe cep ual Index me ic, c ea ing a bench-
ma k whe e ex -only models pe o m a chance le els, and
LALMs all o chance le el when p esen ed wi h gaussian
noise inpu . Though QA benchma ks emain cons ained
by hei unde lying ques ion-answe pai s, RULis ening o -
e s a p ac ical pa h owa d de eloping mul imodal bench-
ma ks ha genuinely equi e engagemen wi h non- ex ual
da a—an app oach po en ially aluable o o he mul i-
modal domains beyond music.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
252
8. ACKNOWLEDGMENTS
We since ely hank he co- i s au ho s o MuchoMusic [5],
Benno Weck and Ila ia Manco, o p oposing a me hodol-
ogy o e ec i e music QA benchma k c ea ion and o
hei help ul ea ly discussions ha helped shape his pape .
9. REFERENCES
[1] J. Wei, X. Wang, D. Schuu mans, M. Bosma, F. Xia,
E. Chi, Q. V. Le, D. Zhou e al., “Chain-o - hough
p omp ing elici s easoning in la ge language models,”
Ad ances in neu al in o ma ion p ocessing sys ems,
ol. 35, pp. 24 824–24 837, 2022.
[2] D. Hend ycks, C. Bu ns, S. Basa , A. Zou,
M. Mazeika, D. Song, and J. S einha d , “Measu -
ing massi e mul i ask language unde s anding,” a Xi
p ep in a Xi :2009.03300, 2020.
[3] J. Wu, W. Gan, Z. Chen, S. Wan, and P. S. Yu,
“Mul imodal la ge language models: A su ey,” in
2023 IEEE In e na ional Con e ence on Big Da a (Big-
Da a). IEEE, 2023, pp. 2247–2256.
[4] K. Papineni, S. Roukos, T. Wa d, and W.-J. Zhu, “Bleu:
a me hod o au oma ic e alua ion o machine ansla-
ion,” in P oceedings o he 40 h annual mee ing o he
Associa ion o Compu a ional Linguis ics, 2002, pp.
311–318.
[5] B. Weck, I. Manco, E. Bene os, E. Quin on,
G. Fazekas, and D. Bogdano , “Muchomusic: E al-
ua ing music unde s anding in mul imodal audio-
language models,” a Xi p ep in a Xi :2408.01337,
2024.
[6] M. Zhao, Z. Zhong, Z. Mao, S. Yang, W.-H. Liao,
S. Takahashi, H. Wakaki, and Y. Mi su uji, “Openmu:
You swiss a my kni e o music unde s anding,” a Xi
p ep in a Xi :2410.15573, 2024.
[7] S. Deshmukh, B. Elizalde, R. Singh, and H. Wang,
“Pengi: An audio language model o audio asks,”
2023.
[8] Y. Gong, H. Luo, A. H. Liu, L. Ka linsky, and
J. R. Glass, “Lis en, hink, and unde s and,” in
The Twel h In e na ional Con e ence on Lea ning
Rep esen a ions, 2024. [Online]. A ailable: h ps:
//open e iew.ne / o um?id=nBZBPXdJlC
[9] Y. Gong, A. H. Liu, H. Luo, L. Ka linsky, and
J. Glass, “Join audio and speech unde s anding,” in
2023 IEEE Au oma ic Speech Recogni ion and Unde -
s anding Wo kshop (ASRU), 2023, pp. 1–8.
[10] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan,
W. Li, L. Lu, Z. MA, and C. Zhang, “SALMONN:
Towa ds gene ic hea ing abili ies o la ge language
models,” in The Twel h In e na ional Con e ence on
Lea ning Rep esen a ions, 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=14 n7HpKVk
[11] J. Wu, Z. No ack, A. Nambu i, J. Dai, H.-W. Dong,
Z. Xie, C. Chen, and J. McAuley, “Fu ga-mi : En-
hancing ine-g ained and empo ally-awa e music un-
de s anding wi h music in o ma ion e ie al,” 2025.
[12] R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye,
Y. Wu, Z. Hong, J. Huang, J. Liu e al., “Audiogp : Un-
de s anding and gene a ing speech, music, sound, and
alking head,” in P oceedings o he AAAI Con e ence
on A i icial In elligence, ol. 38, 2024, pp. 23 802–
23 804.
[13] S. Ghosh, S. Kuma , A. Se h, C. K. R. E u u, U. Tyagi,
S. Sakshi, O. Nie o, R. Du aiswami, and D. Manocha,
“GAMA: A la ge audio-language model wi h ad anced
audio unde s anding and complex easoning abili ies,”
in P oceedings o he 2024 Con e ence on Empi ical
Me hods in Na u al Language P ocessing, Y. Al-
Onaizan, M. Bansal, and Y.-N. Chen, Eds. Miami,
Flo ida, USA: Associa ion o Compu a ional Linguis-
ics, No . 2024, pp. 6288–6313. [Online]. A ailable:
h ps://aclan hology.o g/2024.emnlp-main.361/
[14] X. Du, Z. Yu, J. Lin, B. Zhu, and Q. Kong, “Join mu-
sic and language a en ion models o ze o-sho mu-
sic agging,” in ICASSP 2024-2024 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2024, pp. 1126–1130.
[15] Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle,
and B. Ca anza o, “Audio lamingo: A no el audio
language model wi h ew-sho lea ning and dialogue
abili ies,” in Fo y- i s In e na ional Con e ence
on Machine Lea ning, 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=WYi3WKZjYe
[16] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang,
Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Ad-
ancing uni e sal audio unde s anding ia uni ied
la ge-scale audio-language models,” a Xi p ep in
a Xi :2311.07919, 2023.
[17] Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo,
Y. Leng, Y. L , J. He, J. Lin e al., “Qwen2-audio ech-
nical epo ,” a Xi p ep in a Xi :2407.10759, 2024.
[18] S. Ghosh, Z. Kong, S. Kuma , S. Sakshi, J. Kim,
W. Ping, R. Valle, D. Manocha, and B. Ca anza o, “Au-
dio lamingo 2: An audio-language model wi h long-
audio unde s anding and expe easoning abili ies,”
a Xi p ep in a Xi :2503.03983, 2025.
[19] M. R. Mo is, J. Sohl-Dicks ein, N. Fiedel,
T. Wa ken in, A. Da oe, A. Faus , C. Fa abe ,
and S. Legg, “Posi ion: Le els o AGI o op-
e a ionalizing p og ess on he pa h o AGI,” in
P oceedings o he 41s In e na ional Con e ence
on Machine Lea ning, se . P oceedings o Machine
Lea ning Resea ch, R. Salakhu dino , Z. Kol e ,
K. Helle , A. Welle , N. Oli e , J. Sca le , and
F. Be kenkamp, Eds., ol. 235. PMLR, 21–27
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
253
Jul 2024, pp. 36 308–36 321. [Online]. A ailable:
h ps://p oceedings.ml .p ess/ 235/mo is24b.h ml
[20] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan,
W. Li, L. Lu, Z. MA, and C. Zhang, “SALMONN:
Towa ds Gene ic Hea ing Abili ies o La ge Language
Models,” in The Twel h In e na ional Con e ence on
Lea ning Rep esen a ions, 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=14 n7HpKVk
[21] Z. Deng, Y. Ma, Y. Liu, R. Guo, G. Zhang,
W. Chen, W. Huang, and E. Bene os, “MusiLingo:
B idging music and ex wi h p e- ained language
models o music cap ioning and que y esponse,”
in Findings o he Associa ion o Compu a ional
Linguis ics: NAACL 2024, K. Duh, H. Gomez, and
S. Be ha d, Eds. Mexico Ci y, Mexico: Associa ion
o Compu a ional Linguis ics, Jun. 2024, pp. 3643–
3655. [Online]. A ailable: h ps://aclan hology.o g/
2024. indings-naacl.231
[22] S. Liu, A. S. Hussain, C. Sun, and Y. Shan, “Mu-
sic unde s anding llama: Ad ancing ex - o-music gen-
e a ion wi h ques ion answe ing and cap ioning,” in
ICASSP 2024 - 2024 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP),
2024, pp. 286–290.
[23] J. Ga dne , S. Du and, D. S olle , and R. Bi ne ,
“Lla k: A mul imodal ins uc ion- ollowing language
model o music,” in P oceedings o he 41s In e na-
ional Con e ence on Machine Lea ning (ICML), 2024.
[24] Y. Vasilakis, R. Bi ne , and J. Pauwels, “I can lis en bu
canno ead: An e alua ion o wo- owe mul imodal
sys ems o ins umen ecogni ion,” a Xi p ep in
a Xi :2407.18058, 2024.
[25] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. F ank, “Musiclm: Gene a ing music om ex ,”
a Xi p ep in a Xi :2301.11325, 2023.
[26] E. Law, K. Wes , M. Mandel, M. Bay, and
J. S ephen Downie, “E alua ion o algo i hms using
games: The case o music agging,” in P oceedings o
he 10 h ISMIR Con e ence, 2009.
[27] S. Sakshi, U. Tyagi, S. Kuma , A. Se h, R. Sel-
akuma , O. Nie o, R. Du aiswami, S. Ghosh, and
D. Manocha, “Mmau: A massi e mul i- ask audio un-
de s anding and easoning benchma k,” a Xi p ep in
a Xi :2410.19168, 2024.
[28] Z. Wang, S. Li, T. Zhang, Q. Wang, P. Yu, J. Luo,
Y. Liu, M. Xi, and K. Zhang, “Muchin: A chinese
colloquial desc ip ion benchma k o e alua ing lan-
guage models in he ield o music,” a Xi p ep in
a Xi :2402.09871, 2024.
[29] R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu, T. Shen,
G. Zhang, Y. Wu, C. Liu, Z. Zhou e al., “Cha musi-
cian: Unde s anding and gene a ing music in insically
wi h llm,” a Xi p ep in a Xi :2402.16153, 2024.
[30] J. Li, L. Yang, M. Tang, C. Chen, Z. Li, P. Wang,
and H. Zhao, “The music maes o o he musi-
cally challenged, a massi e music e alua ion bench-
ma k o la ge language models,” a Xi p ep in
a Xi :2406.15885, 2024.
[31] Q. Yang, J. Xu, W. Liu, Y. Chu, Z. Jiang,
X. Zhou, Y. Leng, Y. L , Z. Zhao, C. Zhou, and
J. Zhou, “Ai -bench: Benchma king la ge audio-
language models ia gene a i e comp ehension,”
CoRR, ol. abs/2402.07729, 2024. [Online]. A ailable:
h ps://doi.o g/10.48550/a Xi .2402.07729
[32] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang,
S. S e ens, D. Jiang, W. Ren, Y. Sun e al., “Mmmu:
A massi e mul i-discipline mul imodal unde s anding
and easoning benchma k o expe agi,” in P oceed-
ings o he IEEE/CVF Con e ence on Compu e Vision
and Pa e n Recogni ion, 2024, pp. 9556–9567.
[33] X. Wang, Y. Zhou, X. Liu, H. Lu, Y. Xu, F. He, J. Yoon,
T. Lu, G. Be asius, M. Bansal e al., “Memen os: A
comp ehensi e benchma k o mul imodal la ge lan-
guage model easoning o e image sequences,” a Xi
p ep in a Xi :2401.10529, 2024.
[34] L. Chen, J. Li, X. Dong, P. Zhang, Y. Zang, Z. Chen,
H. Duan, J. Wang, Y. Qiao, D. Lin e al., “A e we
on he igh way o e alua ing la ge ision-language
models?” a Xi p ep in a Xi :2403.20330, 2024.
[35] A. G a a io i, A. Dubey, A. Jauh i, A. Pandey, A. Ka-
dian, A. Al-Dahle, A. Le man, A. Ma hu , A. Schel en,
A. Vaughan e al., “The llama 3 he d o models,” a Xi
p ep in a Xi :2407.21783, 2024.
[36] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu,
C. Li, D. Liu, F. Huang, H. Wei e al., “Qwen2.5 ech-
nical epo ,” a Xi p ep in a Xi :2412.15115, 2024.
[37] A. Q. Jiang, A. Sablay olles, A. Roux, A. Mensch,
B. Sa a y, C. Bam o d, D. S. Chaplo , D. d. l. Casas,
E. B. Hanna, F. B essand e al., “Mix al o expe s,”
a Xi p ep in a Xi :2401.04088, 2024.
[38] G. Team, M. Ri ie e, S. Pa hak, P. G. Sessa, C. Ha din,
S. Bhupa i aju, L. Husseno , T. Mesna d, B. Shah i-
a i, A. Ramé e al., “Gemma 2: Imp o ing open
language models a a p ac ical size,” a Xi p ep in
a Xi :2408.00118, 2024.
[39] “Cheape , Be e , Fas e , S onge | Mis al AI —
mis al.ai,” h ps://mis al.ai/news/mix al-8x22b, [Ac-
cessed 28-03-2025].
[40] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu,
C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan
e al., “Deepseek- 3 echnical epo ,” a Xi p ep in
a Xi :2412.19437, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
254
[41] C. Ra el, N. Shazee , A. Robe s, K. Lee, S. Na ang,
M. Ma ena, Y. Zhou, W. Li, and P. J. Liu, “Explo -
ing he limi s o ans e lea ning wi h a uni ied ex -
o- ex ans o me ,” Jou nal o machine lea ning e-
sea ch, ol. 21, no. 140, pp. 1–67, 2020.
[42] Y. Wu*, K. Chen*, T. Zhang*, Y. Hui*, T. Be g-
Ki kpa ick, and S. Dubno , “La ge-scale con as i e
language-audio p e aining wi h ea u e usion and
keywo d- o-cap ion augmen a ion,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing, ICASSP, 2023.
[43] N. Sachde a, B. Coleman, W.-C. Kang, J. Ni, L. Hong,
E. H. Chi, J. Ca e lee, J. McAuley, and D. Z. Cheng,
“How o ain da a-e icien llms,” a Xi p ep in
a Xi :2402.09668, 2024.
[44] R. Yuan, H. Lin, S. Guo, G. Zhang, J. Pan, Y. Zang,
H. Liu, Y. Liang, W. Ma, X. Du e al., “Yue: Scaling
open ounda ion models o long- o m music gene a-
ion,” a Xi p ep in a Xi :2503.08638, 2025.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
255