Are You Really Listening? Boosting Perceptual Awareness in Music-QA Benchmarks

Author: Yongyi Zang; Sean O'Brien; Taylor Berg-Kirkpatrick; Julian McAuley; Zachary Novack

Publisher: Zenodo

DOI: 10.5281/zenodo.17706385

Source: https://zenodo.org/records/17706385/files/000029.pdf

ARE YOU REALLY LISTENING?
BOOSTING PERCEPTUAL AWARENESS IN MUSIC-QA BENCHMARKS
Yongyi Zang1Sean O’B ien2Taylo Be g-Ki kpa ick2
Julian McAuley2Zacha y No ack2
1Independen Resea che 2Uni e si y o Cali o nia, San Diego
[email p o ec ed], {seob ien, be g,jmcauley,zno ack}@ucsd.edu
ABSTRACT
La ge Audio Language Models (LALMs), whe e p e-
ained ex LLMs a e ine uned wi h audio inpu , ha e
made ema kable p og ess in music unde s anding. How-
e e , cu en e alua ion me hodologies exhibi c i ical lim-
i a ions: on he leading Music Ques ion Answe ing bench-
ma k, MuChoMusic, ex -only LLMs wi hou audio pe -
cep ion capabili ies achie e su p isingly high accu acy
o up o 56.4%, much highe han chance. Fu he -
mo e, when p esen ed wi h andom Gaussian noise in-
s ead o ac ual audio, LALMs s ill pe o m signi ican ly
abo e chance. These indings sugges exis ing benchma ks
p edominan ly assess easoning abili ies a he han au-
dio pe cep ion. To o e come his challenge, we p esen
RULis ening, a amewo k ha enhances pe cep ual e al-
ua ion in Music-QA benchma ks. We in oduce he Pe -
cep ual Index (PI), a quan i a i e me ic ha measu es a
ques ion’s eliance on audio pe cep ion by analyzing log
p obabili y dis ibu ions om ex -only language models.
Using his me ic, we gene a e syn he ic, challenging dis-
ac o s o c ea e QA pai s ha necessi a e genuine au-
dio pe cep ion. When applied o MuChoMusic, ou il-
e ed da ase success ully o ces models o ely on pe cep-
ual in o ma ion— ex -only LLMs pe o m a chance le -
els, while LALMs simila ly de e io a e when audio inpu s
a e eplaced wi h noise. These esul s alida e ou ame-
wo k’s e ec i eness in c ea ing benchma ks ha mo e ac-
cu a ely e alua e audio pe cep ion capabili ies.
1. INTRODUCTION
La ge language models (LLMs) ha e achie ed imp essi e
easoning capabili ies [1] and s ong ze o- and ew-sho
pe o mance ac oss NLP asks [2], bu a e limi ed o only
p ocessing ex ual in o ma ion. This cons ain has d i en
he de elopmen o Mul imodal LLMs (MLLMs), which
ex end LLMs o p ocess, eason o e , and gene a e mul i-
modal con en like images o ideos [3]. La ge Audio Lan-
© Y. Zang, S. O’B ien, T. Be g-Ki kpa ick, J. McAuley
and Z. No ack. Licensed unde a C ea i e Commons A ibu ion 4.0 In-
e na ional License (CC BY 4.0). A ibu ion: Y. Zang, S. O’B ien, T.
Be g-Ki kpa ick, J. McAuley and Z. No ack, “A e you eally lis ening?
Boos ing Pe cep ual Awa eness in Music-QA Benchma ks”, in P oc. o
he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
Figu e 1. Tex -only LMs and LALMs’ pe o mance on
he Music QA benchma k MuChoMusic [5]. OpenMU is
ine uned on Llama 3 8B, ye pe o ms wo se han i .
guage Models (LALMs) speci ically add audio pe cep ion
and easoning capabili ies o LLMs. E alua ing LALMs is
challenging, as con en ional me ics like BLEU [4] s ug-
gle wi h di e se ou pu s. QA amewo ks like MuChoMu-
sic [5] add ess his by ans o ming e alua ion in o classi-
ica ion asks wi h p ede ined choices, making hem well-
sui ed o assessing music capabili ies in LALMs.
Howe e , we disco e a conce ning issue: ex -only
models o en selec co ec answe s e en wi hou mul i-
modal inpu , nea ly ma ching he pe o mance o mul i-
modal models. We e alua ed 11 ex -only LLMs agains
s a e-o - he-a LALMs on he p emie Music QA bench-
ma k MuChoMusic [5] (see Figu e 1). Su p isingly, we
ound ha ex -only models can pe o m well e en wi hou
audio pe cep ion abili y, wi h eigh models eaching accu-
acy o e 50%, wo o which a e e en o simila pa am-
e e size as LALMs. E en mo e elling, OpenMU [6]—a
LALM ine uned om Llama 3 8B— pe o ms wo se on
his benchma k han i s ex -only Llama 3 8B ounda ion,
despi e ha ing access o he audio. As men ioned in he
MuChoMusic pape and pe ou e-e alua ion (See Fig. 2),
when p esen ed wi h gaussian noise as inpu , he LALMs
only show e y limi ed pe o mance decline no whe e nea
247
Figu e 2. LALM pe o mance wi h o iginal inpu s.
gaussian noise inpu on MuChoMusic [5].
o chance le el. We p esen a hypo hesis o his phe-
nomenon: he s ong ini ializa ion o ex -only easoning
capabili ies allows LLMs o sol e QA benchma ks wi hou
ue audio pe cep ion, c ea ing an illusion o unde s and-
ing.
To add ess his challenge, we in oduce RULis ening,
a amewo k o boos exis ing QA benchma king da ase s,
whe e we gene a e dis ac o s ha equi e ac i e pe cep-
ion o be dis inguished om co ec answe s. S a ing
wi h audio desc ip ions, ques ions, and co ec answe s,
we p omp a ex -only model o gene a e plausible ye in-
co ec candida es. We de ine "pe cep ual index" (PI) as
he need o pe cep ual in o ma ion, calcula ed om log-
p obabili ies o dis ac o s being selec ed by a ex -only
model. We op imize based on his me ic o selec ou dis-
ac o s pe ques ion/answe pai . We addi ionally employ
a lea e-one-ou s a egy o 4- old c oss- alida ion, ensu -
ing obus assessmen o models’ pe cep ual capabili ies.
Empi ically, il e ing MuChoMusic h ough RULis en-
ing educes ex -only models o nea -chance pe o mance,
con i ming easoning alone canno sol e hese ques ions.
When audio inpu s o LALMs a e eplaced wi h gaussian
noise, hei pe o mance also plumme s o nea -o -below-
chance le els, con i ming sensi i i y o pe cep ual abili-
ies. Addi ionally, we ind he PI me ic (de i ed om a
single ex -only LM) s ongly co ela es wi h pe o mance
ac oss all ex -only LMs, alida ing ou me hodology’s
gene alizabili y and e ec i eness a boos ing genuine au-
dio pe cep ion capabili ies.
To he bes o ou knowledge, his ep esen s he
i s esea ch o e alua e ex -only LMs on Music QA
benchma ks, explo ing he easoning and pe cep ion
abili y sepa a ely o LALMs, and he i s o p o-
pose such a me hodology o boos ing QA bench-
ma ks o speci ically emphasize pe cep ual capabili ies.
We belie e ou wo k ad ances he communi y’s ap-
p oach o benchma king LALMs. We open-sou ce
all code and e alua ion sc ip s a h ps://gi hub.
com/yongyizang/A eYouReallyLis ening and
RUL-MuChoMusic a h ps://hugging ace.co/
da ase s/yongyizang/RULis ening unde MIT
License o acili a e u he esea ch.
2. RELATED WORK
2.1 LALMs
La ge Audio Language Models (LALMs) combine audio
encode s wi h ine- uned LLMs o p ocess audio alongside
ex okens. Pengi [7] pionee ed his a chi ec u e, achie -
ing s a e-o - he-a esul s on audio classi ica ion asks.
This b eak h ough inspi ed nume ous open-sou ce mod-
els including LTU [8], LTU-AS [9], SALMONN [10],
FUTGA [11], AudioGPT [12], GAMA [13], JMLA [14],
and Audio Flamingo [15], plus open-access al e na i es
like Qwen-Audio [16] and Qwen2-Audio [17]. Resea ch
has p io i ized scaling pa ame e s and da ase s o e im-
p o ing da a quali y o audio ep esen a ions [18]. While
hese models show enhanced pe o mance on basic asks,
hey s ill ace limi a ions in eal-wo ld applica ions [19].
2.2 Music QA Benchma ks
LALM benchma ks e alua e ei he speci ic musical a -
ibu es ( onali y, gen e, ins umen iden i ica ion) o o e -
all music unde s anding h ough audio desc ip ion and mu-
sical inqui y asks [20–23]. Fo ques ion-answe pai s,
many wo ks [5, 20, 21, 23, 24] use he MusicCaps col-
lec ion [25], while o he s [21, 22] c ea e new da ase s by
using LLMs o con e exis ing anno a ions om Mus-
icCaps o MagnaTagaTune [26] in o s uc u ed QA o -
ma s, p oducing da ase s like MusicQA and MusicIns uc .
MMAU [27] ep esen s a ecen ad ancemen ha bal-
ances in o ma ion ex ac ion (pe cep ion) and easoning
ques ions. Some esea ch ocuses on e alua ing mod-
els ained on symbolic music ep esen a ions [5, 28–30],
wi h MuChin [28] using non-mul iple-choice Chinese ex
and bo h MusicTheo yBench and ZIQI-E al a ge ing ex -
o ien ed LLMs h ough symbolic no a ion a he han au-
dio. Meanwhile, mul imodal capabili y e alua ion appea s
in wo ks like AIR-Bench [31], which includes music-
ela ed assessmen s wi hin b oade audio comp ehension,
and MuChoMusic [5], which employs LLMs wi h human
e i ica ion o gene a e ques ion-answe pai s om audio
desc ip ions, c ea ing mo e obus benchma ks o comp e-
hensi e music unde s anding e alua ion.
2.3 Mul imodal Pe cep ion Benchma ks
Va ious benchma ks assess mul imodal easoning abili ies.
Beyond hose discussed abo e, MMMU [32] p o ides a
mul i-discipline da ase o e alua ing ision models’ mul-
imodal easoning, while memen os [33] es s easoning
o e long image sequences. Howe e , pe cep ion assess-
men emains ela i ely unde explo ed compa ed o ea-
soning e alua ion. Chen e al. [34] ound ha many ision
language model benchma k ques ions can be answe ed
wi hou isual inpu o ely on ex ual componen s om
aining da a. They de eloped a il e ing me hodology us-
ing ex -only language models o answe ques ions, and us-
ing hei accu acy o de e mine he deg ee o eliance o a
ques ion on isual modali y. To ou knowledge, no simila
wo k exis s o audio o music language models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
248
3. REASONING IS ENOUGH TO SOLVE
CURRENT MUSIC QA BENCHMARK
We begin by assessing he ex en o which cu en Music
QA benchma k equi es pe cep ion. To do so, we e alua e
ex -only LMs, which ha e no pe cep ion bu s ong ea-
soning capabili ies, on he MuChoMusic benchma k, com-
pa ing hem agains LALMs, which ha e bo h pe cep ion
and easoning capabili ies. This compa ison allows us o
quan i y he impo ance o pe cep ual abili ies in success-
ully add essing music- ela ed ques ions.
Fo ex -only LMs, we e alua e 11 SOTA models ac oss
<3B, <8B, <32B, <72B and >72B pa ame e anges:
Gemma 2B and Llama 3.2 3B; Llama 3 8B [35] and
Qwen 2.5 7B [36]; Mix al 8x7B [37] and Gemma
27B [38]; Mix al 8x22B [39], Qwen 2.5 72B, and
Llama 3.1 70B; and Llama 3.1 405B and DeepSeek V3
671B [40] o la ge models. Fo LALMs, we e alua ed
op MuChoMusic benchma k pe o me s including Au-
dio Flamingo 2 [18], OpenMU [6], Qwen Audio [16] and
Qwen2-Audio [16], epo ing esul s om o iginal model
pape s o he MuChoMusic pape when a ailable.
We obse ed ha he s anda d e alua ion p omp p o-
ided by MuChoMusic o en esul ed in he models de-
clining o answe , wi h esponses indica ing hey could no
pe cei e he audio. To comple e he e alua ion, we modi-
ied he p omp o:
“P o ide you bes guess o his ques ion. You mus
guess one, e en i you did no hea he audio. Think
s ep by s ep.”
This change p omp ed ex -only LMs o gene a e answe s.
Fo LALMs, we couldn’ ep oduce he epo ed esul s
(ac ual pe o mance was lowe ) due o luc ua ions om
ex decoding hype pa ame e s. Fo ai compa ison, we
ci e he numbe s om hei o iginal pape s.
Figu e 1 p esen s he e alua ion esul s. No ably, eigh
ex -only LMs a e able o achie e an o e all accu acy o
mo e han 50%, wi h wo o hem (Llama 3 8B and Qwen
2.5 7B) o simila size as LALMs. E en mo e su p is-
ingly, al hough OpenMU was ine- uned om Llama 3 8B,
i s pe o mance alls sho o he base Llama 3 8B model.
As men ioned in he MuChoMusic pape ( [5], see Fig.
6b) and pe ou e-e alua ion 1(see Fig. 2), when p e-
sen ed wi h gaussian noise as inpu , he LALMs only show
e y limi ed pe o mance decline nowhe e nea chance
le el; one LALM e en showed pe o mance imp o emen .
These indings sugges ha easoning capabili ies alone,
e en wi hou pe cep ion, may be su icien o sol e he Mu-
ChoMusic benchma k.
Bu how is his possible? How can ex -only models
answe music QAs wi hou hea ing he music? Th ough
manual inspec ion, we iden i ied wo key easons: Fi s ,
MuChoMusic’s design includes a "co ec bu i ele an "
dis ac o p o iding con ex ual in o ma ion. While his
1OpenMU [6] ailed o be e alua ed unde noise inpu o MuChoMu-
sic since i s language capabili ies become oo se e ely deg aded and e-
jec s o answe he ques ion, despi e ou bes e o by changing hype pa-
ame e s and p omp s.
dis ac o is help ul in analyzing e o modes, ex -only
LMs can exploi i o make educa ed guesses. Second,
we hypo hesize ha ex -only LMs a e using hei inhe -
en "wo ld p io " knowledge om hei aining. They
may unde s and ha 80s pop songs likely ea u e syn he-
size s a he han o ches as, and olk songs ypically con-
ain acous ic a he han dis o ed elec ic gui a s. This
b oad wo ld knowledge could ha e made much o he mul-
imodal inpu edundan o answe ing ques ions, educing
he benchma k’s e ec i eness in e alua ing LALMs’ ue
pe cep ual capabili ies.
To p o ide a conc e e unde s anding o his phe-
nomenon, we display a p o o ypical example h ough
Llama 3.1 70B’s esponse o he ollowing ques ion 2:
"Which ins umen ini ia es he piece?
Op ions: (A) Syn hesize (B) Rue ul une (C) Acous ic
gui a (D) Vocals"
Llama 3.1 70B was able o eason h ough his ques ion
wi hou audio inpu ye a i ing a he co ec answe (C).
I s a s by elimina ing impossible answe s:
To make an educa ed guess, I’ll conside common mu-
sical pa e ns and he op ions p o ided. Op ion (B)
“Rue ul une” doesn’ seem o be a speci ic ins u-
men , so I’ll exclude ha choice.
Then i a emp s o le e age i s p io knowledge abou dis-
ibu ion o music:
Be ween he emaining op ions, I’ll conside which in-
s umen is mos likely o ini ia e a piece. Syn hesize s
(A) can be used o c ea e a wide ange o sounds, bu
hey’ e no always he i s choice o s a a song. Vo-
cals (D) can ini ia e a piece, bu i ’s no uncommon o
an ins umen o se he one be o e he ocals begin.
Acous ic gui a (C) is a popula choice o s a a song,
especially in gen es like olk, ock, and pop. I ’s o en
used o c ea e a simple, ye e ec i e melody ha se s
he one o he es o he piece.
Based on his easoning, my guess is: C.
4. CONSTRUCTING RULISTENING
4.1 Me hodology
We de ine a Music-QA benchma k as a se o audio-
ques ion-answe s iple s (a, q, Y ) o audio clip a, ques-
ion q, and se o answe s Y, and can u he decompose
Y={c∪D}whe e cis he co ec answe and Dis
he se o inco ec dis ac o s. Unde his de ini ion, an
e ec i e benchma k o audio pe cep ion should p esen
ques ions ha a e challenging wi hou audio bu sol able
wi h audio access. Fo mally, le p ex (Y|q) ep esen he
o al p obabili y o e all gi en answe s o a ex -only LM
(i.e. p ex (Y|q) = Pyi∈Yp ex (yi|q)), and pLALM(Y|q, a)
ep esen he co esponding p obabili y o a LALM.
Ideally, i one wan s o measu e he mul imodal pe cep-
ion abili ies o LALMs, a Music-QA ques ion should il-
lici a no iceable in o ma ion gain when condi ioning on
2We p o ide mo e examples o his in Appendix A.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
249
he audio, i.e., p(c|q, a)≫p(c|q). Using his p inciple
o benchma k design gi es us wo op ions o inc easing
he in o ma ion gain: c ea e (a, q, Y ) iple s ha a e uni-
modally di icul (i.e. educe p(c|q)), o design ques ions
and co ec answe s highly pe cep ually aligned wi h au-
dio (i.e. inc ease p(c|q, a)). We p io i ize he o me as
he la e is p oblema ic: cons uc ing new QA-pai s is un-
scalable wi h cu en sys ems, and using LALMs o au o-
ma e his would con amina e he benchma k’s e alua i e
pu pose and ely oo much on ques ionable LALM capa-
bili ies. We he e o e ocus on c ea ing benchma k i ems
whe e ques ions challenge ex -only LMs while main ain-
ing he expe - e i ied ela ionship be ween (a, q, c). We
o malize his as inding op imal dis ac o se s D∗ ha
maximize he p obabili y o ex -only models selec ing in-
co ec answe s. We de ine he need o pe cep ual in o -
ma ion as "pe cep ual index," o PI:
PI(q, Y, D) = p ex (D|q)
p ex (Y|q)(1)
which is equi alen o he QA-no malized e o p ob-
abili y. This me ic anges om 0 o 1, wi h alues
close o 1 indica ing ques ions whe e a ex -only model is
mo e likely o selec inco ec answe s (i.e., p ex (D|q)≫
p ex (c|q)). Since we canno modi y he audio, ques ion,
o co ec answe wi hou comp omising he in eg i y o
he expe - e i ied con en , we es ic ou op imiza ion o
inding dis ac o se s ha maximize his pe cep ual index
me ic. Impo an ly, PI does no pe ec ly co ela e wi h
he en opy o he answe space; a high PI may e lec a
model ha is con iden ly inco ec (selec ing a w ong an-
swe wi h high p obabili y), hus exhibi ing low en opy.
We p e e PI o e en opy as ou op imiza ion a ge p e-
cisely because PI cap u es he maximum possible pe o -
mance gap be ween modali ies— he dis ance be ween be-
ing con iden ly w ong (high PI) and co ec is necessa -
ily la ge han be ween being unce ain (high en opy) and
co ec , he eby p o iding a s onge signal o iden i ying
ques ions ha genuinely equi e pe cep ual in o ma ion.
4.2 Gene a ing Dis ac o s Se
To a i e a a se o dis ac o s D∗ ha maximizes PI, we
begin wi h gene a ing a la ge pool o possible dis ac o s
D, hen il e h ough hem o a i e a he highes PI se
o dis ac o s. We le e age he DeepSeek-V3 model o do
his. We use a p omp empla e including ques ion ex , au-
dio desc ip ion, and co ec answe , and p omp he LLM
o gene a e mul iple candida es. This p ocess happens o
mul iple imes, allowing us o sample mul iple ba ches o
di e si y. Finally, we apply cleaning and deduplica ion
p ocesses. We explici ly p omp he model o main ain
s ylis ic consis ency ac oss answe s, and only gene a e an-
swe s ha a e 1) plausible and 2) dis inc ly di e en om
he co ec answe . We use in-con ex lea ning examples
o en o ce s uc u ed ou pu using XML ags, hen ex ac
possible dis ac o s using egula exp essions and apply
ex no maliza ion.
Figu e 3. Seman ic dis ibu ion o dis ac o s.
To analyze he dis ibu ion o gene a ed dis ac o s, we
employed wo dis inc models: T5 [41], a ex -only ans-
o me encode , and he ex b anch o CLAP [42], a join
audio- ex ep esen a ion model. Fo each dis ac o and
co ec answe pai , we calcula ed he cosine simila i y
be ween hei espec i e embeddings. The T5 simila -
i y dis ibu ion cap u es he na u al language seman ic e-
la ionships, while CLAP’s ex encode e eals he mu-
sic domain-speci ic ela ionships. Figu e 3 p esen s his-
og ams o hese seman ic simila i y dis ibu ions. We ob-
se e ha dis ac o s clus e igh ly in he ex seman ic
space ye sp ead mo e widely in he music seman ic space,
indica ing i ems ha a e ex ually simila (e.g., “Acous ic
gui a ” and “Elec ic gui a ”) bu musically dis inc . This
con i ms ou dis ac o s main ain musical a ie y while
minimizing ex ual di e ences ( hus p e en ing leakage).
4.3 Fil e ing Based on Pe cep ual Index
A e ob aining he dis ac o se D, we begin il e ing o
D∗. To calcula e he p obabili y o each answe o be se-
lec ed p ex (y|q), we use he log p obabili y o a ligh weigh
ex -only LLM Qwen 2.5 7B. Speci ically, we p omp he
model wi h:
“P o ide you bes guess o his ques ion. The ques ion
is: {ques ion} The answe candida es a e: (A) ... (B) ...
(C) ... (D) ... Answe wi hou he pa en hesis. The mos
likely answe is”
Then ake he log p obabili y o he immedia e nex o-
ken o be A, B, C o D o ep esen he p obabili y o each
co esponding answe o be selec ed ollowing he me hod-
ology o [43]. Empi ically, we ound his me hod wo ked
well when e alua ing <4dis ac o s, likely due o how
eal-wo ld mul iple choice ques ions a e o en wi h ou
choices. As such, we begin by andomly selec ing se s o
h ee dis ac o s, hen e alua e hem wi h he co ec an-
swe in andom o de . Fo each se o ou choices, we
ake he dis ac o wi h highes p obabili y; we ecu si ely
do his, un il we a e le wi h ou dis ac o s. These ou
dis ac o s ha e highes p ex (D|q), and hus easonably ap-
p oxima es he se D∗ ha yields he la ges PI(q, Y, D).
Du ing e alua ion, we implemen a lea e-one-ou s a egy:
wi hin he ou dis ac o s, we emo e one a each i e a-
ion. This app oach p o ides ou dis inc answe passes
o each QA pai . Ou design se es wo pu poses: (1)
ha ing 4 answe s aligns wi h he eal-wo ld dis ibu ion o
mul iple-choice ques ions; and (2) i enhances ou obus -
ness agains a ia ions in dis ac o s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
250
Figu e 4. Co ela ion be ween Pe cep ual Index (PI) and
ques ion accu acy on RUL-MuChoMusic. Tex -only LMs
show s onge nega i e co ela ion, indica ing g ea e in-
luence om lack o pe cep ion.
5. RESULTS
We e alua e he a o emen ioned 11 ex -only LMs and se-
lec he op-pe o ming 4 LALMs on MuChoMusic, and
ou p oposed modi ed e sion RUL-MuChoMusic. Fo all
models e alua ed, we epo bo h he mean pe o mance
and 95% con idence in e als.
5.1 Validi y o Pe cep ual Index
To alida e he Pe cep ual Index (PI) as an e ec i e su o-
ga e o o e all LLM pe o mance, we analyzed ques ion-
le el accu acy ac oss all 11 LLMs (44 esponse passes).
Fo each ques ion, we calcula ed he co ela ion be ween
accu acy ac oss all a emp s and he PI. We obse e a
s ong nega i e Pea son co ela ion o -0.738 as shown in
Figu e 4(a), indica ing a highly signi ican ela ionship
whe e high PI co esponds o low ques ion accu acy. These
esul s con i m PI e ec i ely p edic s ex -only LMs’ abil-
i y o answe ques ions using solely ex ual in o ma ion.
Simila ly, calcula ing he co ela ion be ween PI and
ques ion-le el accu acy ac oss all 4 LALMs (16 passes)
e eals a weake nega i e Pea son co ela ion o -0.331, as
shown in Fig. 4(b), sugges ing he need o pe cep ion is
signi ican ly highe . This alida es PI as an e ec i e me ic
o op imizing dis ac o se s o maximize he pe o mance
gap be ween ex -only LMs and LALMs.
Addi ionally, we plo he pe cep ual index dis ibu-
ion ac oss all ques ions o bo h MuChoMusic and RUL-
MuChoMusic. The only di e ence be ween hese bench-
ma ks is he dis ac o se . As shown in Figu e 5, Mu-
ChoMusic PI alues ollow an app oxima ely Gaussian
dis ibu ion wi h mean 0.427 and la ge a iance, indica -
Figu e 5. Dis ibu ion o PI on MuChoMusic and RUL-
MuChoMusic. MuChoMusic exhibi s o e all less eliance
on pe cep ual modali y compa ed o RUL-MuChoMusic.
ing many ques ions can be answe ed subs an ially h ough
ex modali y alone wi hou equi ing music in o ma ion.
This aligns wi h ou obse a ion ha ex -only language
models sco e highly on MuChoMusic. In con as , RUL-
MuChoMusic achie es a signi ican ly highe PI dis ibu-
ion wi h mean 0.861 and lowe a iance, demons a ing
g ea e dependence on music modali y o co ec answe s.
Fo iden ical ques ions, ou gene a ed and il e ed dis-
ac o s consis en ly inc ease PI compa ed o he o iginal
benchma k (mean inc ease o 0.338), wi h some ques ions
showing inc eases exceeding 0.9. 3These esul s con i m
ou gene a ion and il e ing pipeline e ec i ely educes
ex -only answe ing capabili y, c ea ing a mo e obus mul-
imodal e alua ion benchma k.
5.2 Benchma k Resul s o Tex -only LMs and
LALMs
We p esen comp ehensi e esul s o ex -only LLMs
and LALMs in Figu e 6. Se e al key pa e ns eme ge
om ou analysis. Ac oss all models, we obse e a
consis en dec ease in accu acy sco es, indica ing ha
RUL-MuChoMusic p esen s a g ea e challenge han Mu-
ChoMusic; ex -only LMs pe o m a nea -chance le -
els, alida ing ou app oach. Impo an ly, OpenMU (4 h-
place) ou pe o ms i s ex -only subcomponen (Llama 3
8B, 12 h-place), sugges ing enhanced music pe cep ion ca-
pabili ies. The ex -only LMs ha managed o place in he
op-10 possess much la ge pa ame e coun s (405B, 72B,
27B, 671B, 70B, and 56B) compa ed o he sub-7B audio
models.
Though RULis ening e ec i ely inc eases unimodal
di icul y (see Sec. 5.1), mos LALMs besides Qwen2-
Audio demons a e ela i ely poo pe o mance, as mul-
imodal di icul y was no used in cons uc ion. Due o
Qwen2-Audio’s b oad use ac oss a ious asks [44], i s
s ong pe o mance is expec ed. To quan i a i ely assess
whe he poo esul s s em om inhe en model limi a-
ions o benchma k design laws, we e alua ed all LALMs
using 10-second samples o andom Gaussian noise o
p obe hei sensi i i y o audio inpu . Resul s appea in
Figu e 7. While all models p e iously pe o med abo e
chance, noise inpu s d o e pe o mance o nea o below
chance le els. Qwen2-Audio showed he mos d ama ic
pe o mance deg ada ion, while Audio Flamingo 2 demon-
3We p esen examples o highes and lowes dis ac o PI change in
Appendix C.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
251

Figu e 6. Benchma king esul s on RUL-MuChoMusic. E o ba displays 95% con idence in e al.
Figu e 7. LALM pe o mance wi h o iginal inpu s.
gaussian noise inpu on RUL-MuChoMusic.
s a ed he leas sensi i i y o noise, possibly ela ed o
i s weake easoning abili ies. When compa ing o Mu-
ChoMusic [5], only 2 LALMs show signi ican deg ada-
ion wi h noise inpu , ye nowhe e nea chance-le el pe -
o mance, sugges ing RULis ening p o ides s onge e al-
ua ion o audio pe cep ion.
Examining LALM esponse pa e ns e eals addi ional
insigh s. 4Audio Flamingo 2 [18] exhibi s limi ed ea-
soning abili y, o en gene a ing di ec answe s. In con-
as , Qwen2-Audio equen ly p oduces ex ended eason-
ing chains. This sugges s easoning capabili y may be c u-
cial o success on Music QA benchma ks, as also demon-
s a ed by ecen esea ch explo ing LALM mul imodal
ine- uning echniques o easoning models.
6. DISCUSSION
While es ablishing RUL-MuChoMusic as a mo e e ec i e
pe cep ion- es ing benchma k compa ed o MuChoMusic,
we acknowledge a undamen al limi a ion in ou wo k: he
quali y o ou benchma k is inhe en ly cons ained by he
quali y o he p o ided ques ion-answe pai s.
Ou manual inspec ion e ealed se e al issues wi h he
o iginal da ase . Some p oblems s em om he LLM-
assis ed me hodology used o c ea e ques ion-answe pai s
om human cap ions. Ques ions wi h IDs 448 and 665
ha e "No speci ied in he desc ip ion" as co ec answe s,
while eigh o he s con ain ph ases like "based on he de-
sc ip ion" despi e no desc ip ion being p o ided du ing
benchma king. Human cap ions some imes include un-
in e able me ada a—we ound 17 ques ions/answe s wi h
4We include esponse examples in Appendix B.
" eco ded in" ph ases, hough eco ding loca ion canno be
de e mined solely om audio. Some ques ions a e chal-
lenging e en o human expe s, such as iden i ying spe-
ci ic banjo ypes (ques ion ID 730). Some co ec answe s
inadequa ely add ess hei ques ions— o ins ance, ques-
ion ID 832 asks "Who is he p ima y ocalis in he song?"
wi h "Male" as he co ec answe , and ques ion ID 7 asks
"Wha is used a he e y beginning o make he ack
sound in age?" wi h he o e ly simplis ic answe "E ec ."
To quan i y his issue, we employed Claude 3.7 Son-
ne o e alua e whe he ques ions and answe s made sense
based on he audio ex cap ions. The model iden i ied 201
ou o 1187 pai s (16.9%) as p oblema ic. Howe e , we ob-
se ed ha Claude i sel made e o s in his e alua ion p o-
cess. Fo example, i inco ec ly lagged ques ion ID 1125,
claiming ha he co ec answe "Digi al bass sound" was
inconsis en wi h he audio desc ip ion men ioning a "syn-
hesize bassline ha is epea ing."
These issues highligh a b oade challenge in Music
QA benchma k cons uc ion: human-w i en benchma ks
a e ime-consuming o de elop, ye LLMs a e e o -p one
when used as disc imina o s o assis an s. The ques ion o
how o e ec i ely balance hese app oaches emains an im-
po an a ea o u u e esea ch. We hope ou wo k se es
as a s a ing poin in encou aging esea che s o c i ically
examine he e ec i eness o Music QA benchma ks.
7. CONCLUSION
We in oduce RULis ening, a me hodology o imp o -
ing he pe cep ual ele ance o LALM QA benchma ks.
By demons a ing ha ex -only LMs ou pe o m LALMs
on exis ing benchma ks, we e ealed ha cu en mu-
sic QA benchma ks es easoning a he han pe cep ion.
We gene a e dis ac o s ha maximize pe cep ual neces-
si y h ough ou Pe cep ual Index me ic, c ea ing a bench-
ma k whe e ex -only models pe o m a chance le els, and
LALMs all o chance le el when p esen ed wi h gaussian
noise inpu . Though QA benchma ks emain cons ained
by hei unde lying ques ion-answe pai s, RULis ening o -
e s a p ac ical pa h owa d de eloping mul imodal bench-
ma ks ha genuinely equi e engagemen wi h non- ex ual
da a—an app oach po en ially aluable o o he mul i-
modal domains beyond music.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
252
8. ACKNOWLEDGMENTS
We since ely hank he co- i s au ho s o MuchoMusic [5],
Benno Weck and Ila ia Manco, o p oposing a me hodol-
ogy o e ec i e music QA benchma k c ea ion and o
hei help ul ea ly discussions ha helped shape his pape .
9. REFERENCES
[1] J. Wei, X. Wang, D. Schuu mans, M. Bosma, F. Xia,
E. Chi, Q. V. Le, D. Zhou e al., “Chain-o - hough
p omp ing elici s easoning in la ge language models,”
Ad ances in neu al in o ma ion p ocessing sys ems,
ol. 35, pp. 24 824–24 837, 2022.
[2] D. Hend ycks, C. Bu ns, S. Basa , A. Zou,
M. Mazeika, D. Song, and J. S einha d , “Measu -
ing massi e mul i ask language unde s anding,” a Xi
p ep in a Xi :2009.03300, 2020.
[3] J. Wu, W. Gan, Z. Chen, S. Wan, and P. S. Yu,
“Mul imodal la ge language models: A su ey,” in
2023 IEEE In e na ional Con e ence on Big Da a (Big-
Da a). IEEE, 2023, pp. 2247–2256.
[4] K. Papineni, S. Roukos, T. Wa d, and W.-J. Zhu, “Bleu:
a me hod o au oma ic e alua ion o machine ansla-
ion,” in P oceedings o he 40 h annual mee ing o he
Associa ion o Compu a ional Linguis ics, 2002, pp.
311–318.
[5] B. Weck, I. Manco, E. Bene os, E. Quin on,
G. Fazekas, and D. Bogdano , “Muchomusic: E al-
ua ing music unde s anding in mul imodal audio-
language models,” a Xi p ep in a Xi :2408.01337,
2024.
[6] M. Zhao, Z. Zhong, Z. Mao, S. Yang, W.-H. Liao,
S. Takahashi, H. Wakaki, and Y. Mi su uji, “Openmu:
You swiss a my kni e o music unde s anding,” a Xi
p ep in a Xi :2410.15573, 2024.
[7] S. Deshmukh, B. Elizalde, R. Singh, and H. Wang,
“Pengi: An audio language model o audio asks,”
2023.
[8] Y. Gong, H. Luo, A. H. Liu, L. Ka linsky, and
J. R. Glass, “Lis en, hink, and unde s and,” in
The Twel h In e na ional Con e ence on Lea ning
Rep esen a ions, 2024. [Online]. A ailable: h ps:
//open e iew.ne / o um?id=nBZBPXdJlC
[9] Y. Gong, A. H. Liu, H. Luo, L. Ka linsky, and
J. Glass, “Join audio and speech unde s anding,” in
2023 IEEE Au oma ic Speech Recogni ion and Unde -
s anding Wo kshop (ASRU), 2023, pp. 1–8.
[10] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan,
W. Li, L. Lu, Z. MA, and C. Zhang, “SALMONN:
Towa ds gene ic hea ing abili ies o la ge language
models,” in The Twel h In e na ional Con e ence on
Lea ning Rep esen a ions, 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=14 n7HpKVk
[11] J. Wu, Z. No ack, A. Nambu i, J. Dai, H.-W. Dong,
Z. Xie, C. Chen, and J. McAuley, “Fu ga-mi : En-
hancing ine-g ained and empo ally-awa e music un-
de s anding wi h music in o ma ion e ie al,” 2025.
[12] R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye,
Y. Wu, Z. Hong, J. Huang, J. Liu e al., “Audiogp : Un-
de s anding and gene a ing speech, music, sound, and
alking head,” in P oceedings o he AAAI Con e ence
on A i icial In elligence, ol. 38, 2024, pp. 23 802–
23 804.
[13] S. Ghosh, S. Kuma , A. Se h, C. K. R. E u u, U. Tyagi,
S. Sakshi, O. Nie o, R. Du aiswami, and D. Manocha,
“GAMA: A la ge audio-language model wi h ad anced
audio unde s anding and complex easoning abili ies,”
in P oceedings o he 2024 Con e ence on Empi ical
Me hods in Na u al Language P ocessing, Y. Al-
Onaizan, M. Bansal, and Y.-N. Chen, Eds. Miami,
Flo ida, USA: Associa ion o Compu a ional Linguis-
ics, No . 2024, pp. 6288–6313. [Online]. A ailable:
h ps://aclan hology.o g/2024.emnlp-main.361/
[14] X. Du, Z. Yu, J. Lin, B. Zhu, and Q. Kong, “Join mu-
sic and language a en ion models o ze o-sho mu-
sic agging,” in ICASSP 2024-2024 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2024, pp. 1126–1130.
[15] Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle,
and B. Ca anza o, “Audio lamingo: A no el audio
language model wi h ew-sho lea ning and dialogue
abili ies,” in Fo y- i s In e na ional Con e ence
on Machine Lea ning, 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=WYi3WKZjYe
[16] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang,
Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Ad-
ancing uni e sal audio unde s anding ia uni ied
la ge-scale audio-language models,” a Xi p ep in
a Xi :2311.07919, 2023.
[17] Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo,
Y. Leng, Y. L , J. He, J. Lin e al., “Qwen2-audio ech-
nical epo ,” a Xi p ep in a Xi :2407.10759, 2024.
[18] S. Ghosh, Z. Kong, S. Kuma , S. Sakshi, J. Kim,
W. Ping, R. Valle, D. Manocha, and B. Ca anza o, “Au-
dio lamingo 2: An audio-language model wi h long-
audio unde s anding and expe easoning abili ies,”
a Xi p ep in a Xi :2503.03983, 2025.
[19] M. R. Mo is, J. Sohl-Dicks ein, N. Fiedel,
T. Wa ken in, A. Da oe, A. Faus , C. Fa abe ,
and S. Legg, “Posi ion: Le els o AGI o op-
e a ionalizing p og ess on he pa h o AGI,” in
P oceedings o he 41s In e na ional Con e ence
on Machine Lea ning, se . P oceedings o Machine
Lea ning Resea ch, R. Salakhu dino , Z. Kol e ,
K. Helle , A. Welle , N. Oli e , J. Sca le , and
F. Be kenkamp, Eds., ol. 235. PMLR, 21–27
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
253
Jul 2024, pp. 36 308–36 321. [Online]. A ailable:
h ps://p oceedings.ml .p ess/ 235/mo is24b.h ml
[20] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan,
W. Li, L. Lu, Z. MA, and C. Zhang, “SALMONN:
Towa ds Gene ic Hea ing Abili ies o La ge Language
Models,” in The Twel h In e na ional Con e ence on
Lea ning Rep esen a ions, 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=14 n7HpKVk
[21] Z. Deng, Y. Ma, Y. Liu, R. Guo, G. Zhang,
W. Chen, W. Huang, and E. Bene os, “MusiLingo:
B idging music and ex wi h p e- ained language
models o music cap ioning and que y esponse,”
in Findings o he Associa ion o Compu a ional
Linguis ics: NAACL 2024, K. Duh, H. Gomez, and
S. Be ha d, Eds. Mexico Ci y, Mexico: Associa ion
o Compu a ional Linguis ics, Jun. 2024, pp. 3643–
3655. [Online]. A ailable: h ps://aclan hology.o g/
2024. indings-naacl.231
[22] S. Liu, A. S. Hussain, C. Sun, and Y. Shan, “Mu-
sic unde s anding llama: Ad ancing ex - o-music gen-
e a ion wi h ques ion answe ing and cap ioning,” in
ICASSP 2024 - 2024 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP),
2024, pp. 286–290.
[23] J. Ga dne , S. Du and, D. S olle , and R. Bi ne ,
“Lla k: A mul imodal ins uc ion- ollowing language
model o music,” in P oceedings o he 41s In e na-
ional Con e ence on Machine Lea ning (ICML), 2024.
[24] Y. Vasilakis, R. Bi ne , and J. Pauwels, “I can lis en bu
canno ead: An e alua ion o wo- owe mul imodal
sys ems o ins umen ecogni ion,” a Xi p ep in
a Xi :2407.18058, 2024.
[25] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. F ank, “Musiclm: Gene a ing music om ex ,”
a Xi p ep in a Xi :2301.11325, 2023.
[26] E. Law, K. Wes , M. Mandel, M. Bay, and
J. S ephen Downie, “E alua ion o algo i hms using
games: The case o music agging,” in P oceedings o
he 10 h ISMIR Con e ence, 2009.
[27] S. Sakshi, U. Tyagi, S. Kuma , A. Se h, R. Sel-
akuma , O. Nie o, R. Du aiswami, S. Ghosh, and
D. Manocha, “Mmau: A massi e mul i- ask audio un-
de s anding and easoning benchma k,” a Xi p ep in
a Xi :2410.19168, 2024.
[28] Z. Wang, S. Li, T. Zhang, Q. Wang, P. Yu, J. Luo,
Y. Liu, M. Xi, and K. Zhang, “Muchin: A chinese
colloquial desc ip ion benchma k o e alua ing lan-
guage models in he ield o music,” a Xi p ep in
a Xi :2402.09871, 2024.
[29] R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu, T. Shen,
G. Zhang, Y. Wu, C. Liu, Z. Zhou e al., “Cha musi-
cian: Unde s anding and gene a ing music in insically
wi h llm,” a Xi p ep in a Xi :2402.16153, 2024.
[30] J. Li, L. Yang, M. Tang, C. Chen, Z. Li, P. Wang,
and H. Zhao, “The music maes o o he musi-
cally challenged, a massi e music e alua ion bench-
ma k o la ge language models,” a Xi p ep in
a Xi :2406.15885, 2024.
[31] Q. Yang, J. Xu, W. Liu, Y. Chu, Z. Jiang,
X. Zhou, Y. Leng, Y. L , Z. Zhao, C. Zhou, and
J. Zhou, “Ai -bench: Benchma king la ge audio-
language models ia gene a i e comp ehension,”
CoRR, ol. abs/2402.07729, 2024. [Online]. A ailable:
h ps://doi.o g/10.48550/a Xi .2402.07729
[32] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang,
S. S e ens, D. Jiang, W. Ren, Y. Sun e al., “Mmmu:
A massi e mul i-discipline mul imodal unde s anding
and easoning benchma k o expe agi,” in P oceed-
ings o he IEEE/CVF Con e ence on Compu e Vision
and Pa e n Recogni ion, 2024, pp. 9556–9567.
[33] X. Wang, Y. Zhou, X. Liu, H. Lu, Y. Xu, F. He, J. Yoon,
T. Lu, G. Be asius, M. Bansal e al., “Memen os: A
comp ehensi e benchma k o mul imodal la ge lan-
guage model easoning o e image sequences,” a Xi
p ep in a Xi :2401.10529, 2024.
[34] L. Chen, J. Li, X. Dong, P. Zhang, Y. Zang, Z. Chen,
H. Duan, J. Wang, Y. Qiao, D. Lin e al., “A e we
on he igh way o e alua ing la ge ision-language
models?” a Xi p ep in a Xi :2403.20330, 2024.
[35] A. G a a io i, A. Dubey, A. Jauh i, A. Pandey, A. Ka-
dian, A. Al-Dahle, A. Le man, A. Ma hu , A. Schel en,
A. Vaughan e al., “The llama 3 he d o models,” a Xi
p ep in a Xi :2407.21783, 2024.
[36] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu,
C. Li, D. Liu, F. Huang, H. Wei e al., “Qwen2.5 ech-
nical epo ,” a Xi p ep in a Xi :2412.15115, 2024.
[37] A. Q. Jiang, A. Sablay olles, A. Roux, A. Mensch,
B. Sa a y, C. Bam o d, D. S. Chaplo , D. d. l. Casas,
E. B. Hanna, F. B essand e al., “Mix al o expe s,”
a Xi p ep in a Xi :2401.04088, 2024.
[38] G. Team, M. Ri ie e, S. Pa hak, P. G. Sessa, C. Ha din,
S. Bhupa i aju, L. Husseno , T. Mesna d, B. Shah i-
a i, A. Ramé e al., “Gemma 2: Imp o ing open
language models a a p ac ical size,” a Xi p ep in
a Xi :2408.00118, 2024.
[39] “Cheape , Be e , Fas e , S onge | Mis al AI —
mis al.ai,” h ps://mis al.ai/news/mix al-8x22b, [Ac-
cessed 28-03-2025].
[40] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu,
C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan
e al., “Deepseek- 3 echnical epo ,” a Xi p ep in
a Xi :2412.19437, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
254
[41] C. Ra el, N. Shazee , A. Robe s, K. Lee, S. Na ang,
M. Ma ena, Y. Zhou, W. Li, and P. J. Liu, “Explo -
ing he limi s o ans e lea ning wi h a uni ied ex -
o- ex ans o me ,” Jou nal o machine lea ning e-
sea ch, ol. 21, no. 140, pp. 1–67, 2020.
[42] Y. Wu*, K. Chen*, T. Zhang*, Y. Hui*, T. Be g-
Ki kpa ick, and S. Dubno , “La ge-scale con as i e
language-audio p e aining wi h ea u e usion and
keywo d- o-cap ion augmen a ion,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing, ICASSP, 2023.
[43] N. Sachde a, B. Coleman, W.-C. Kang, J. Ni, L. Hong,
E. H. Chi, J. Ca e lee, J. McAuley, and D. Z. Cheng,
“How o ain da a-e icien llms,” a Xi p ep in
a Xi :2402.09668, 2024.
[44] R. Yuan, H. Lin, S. Guo, G. Zhang, J. Pan, Y. Zang,
H. Liu, Y. Liang, W. Ma, X. Du e al., “Yue: Scaling
open ounda ion models o long- o m music gene a-
ion,” a Xi p ep in a Xi :2503.08638, 2025.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
255

Related note

Why institutions use Plag.ai for originality review, entry 83
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by doctoral supervisors in universities, research institutes, colleges, schools, and publishing workflows, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer documentation of academic decisions, reduced manual checking effort, and clearer separation between similarity and misconduct. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For course assignments, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai