scieee Science in your language
[en] (orig)

CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

Author: Yinghao MA; Siyou Li; Juntao Yu; Emmanouil Benetos; Akira Maezawa
Publisher: Zenodo
DOI: 10.5281/zenodo.17706469
Source: https://zenodo.org/records/17706469/files/000048.pdf
CMI-BENCH: A COMPREHENSIVE BENCHMARK FOR EVALUATING
MUSIC INSTRUCTION FOLLOWING
Yinghao Ma1Siyou Li1Jun ao Yu1Emmanouil Bene os1Aki a Maezawa2
1Queen Ma y Uni e si y o London, London, UK
2Yamaha Co po a ion, Hamama su, Japan
[email p o ec ed], [email p o ec ed]
ABSTRACT
Recen ad ances in audio- ex la ge language models
(LLMs) ha e opened new possibili ies o music unde -
s anding and gene a ion. Howe e , exis ing benchma ks a e
limi ed in scope, o en elying on simpli ied asks o mul i-
choice e alua ions ha ail o e lec he complexi y o eal-
wo ld music analysis. We ein e p e a b oad ange o adi-
ional MIR anno a ions as ins uc ion- ollowing o ma s and
in oduce CMI-Bench, a comp ehensi e music ins uc ion
ollowing benchma k designed o e alua e audio- ex LLMs
on a di e se se o music in o ma ion e ie al (MIR) asks.
These include gen e classi ica ion, emo ion eg ession, emo-
ion agging, ins umen classi ica ion, pi ch es ima ion, key
de ec ion, ly ics ansc ip ion, melody ex ac ion, ocal
echnique ecogni ion, ins umen pe o mance echnique
de ec ion, music agging, music cap ioning, and (down)bea
acking — e lec ing co e challenges in MIR esea ch. Un-
like p e ious benchma ks, CMI-Bench adop s s anda dized
e alua ion me ics consis en wi h p e ious s a e-o - he-a
MIR models, ensu ing di ec compa abili y wi h supe ised
app oaches. We p o ide an e alua ion oolki suppo ing
all open-sou ce audio- ex ual LLMs, including LTU, Qwen-
audio, SALMONN, MusiLingo, e c. Expe imen esul s
e eal signi ican pe o mance gaps be ween LLMs and su-
pe ised models, along wi h hei cul u e, ch onological and
gende bias, highligh ing he po en ial and limi a ions o
cu en models in add essing MIR asks. CMI-Bench es ab-
lishes a uni ied ounda ion o e alua ing music ins uc ion
ollowing, d i ing p og ess in music-awa e LLMs.
1. INTRODUCTION
The eme gence o la ge language models (LLMs) has
eshaped he landscape o na u al language p ocessing
by enabling gene al-pu pose models o sol e a wide
a ie y o asks h ough ins uc ion ollowing. This
pa adigm—whe e models a e ained no jus on p e- ex
co po a bu ins uc ion- esponse pai s—has unlocked new
possibili ies in model gene aliza ion, ew-sho lea ning, and
c oss-domain easoning. Supe ised ine- uning (S T), also
© Y. Ma, S. Li, J. Yu, E. Bene os, and A. Maezawa. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: Y. Ma, S. Li, J. Yu, E. Bene os, and A. Maezawa, “CMI-
Bench: A Comp ehensi e Benchma k o E alua ing Music Ins uc ion
Following”, in P oc. o he 26 h In . Socie y o Music In o ma ion
Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
known as ins uc ion ine uning, and ein o cemen lea n-
ing om human eedback ha e u he s eng hened LLMs’
abili y o align wi h human in en [1].
In he con ex o music, he ins uc ion- ollowing
pa adigm holds pa icula p omise. Many music- ela ed
asks a e na u ally mul imodal and domain-speci ic and
o en lack la ge-scale anno a ed da a. Ins uc ion- uned
models can gene alize o p e iously unseen p oblems such
as cho d gene a ion unde hy hmic cons ain s o pe son-
alized music ecommenda ion based on con ex . Besides,
by suppo ing in-con ex lea ning, LLMs o e a lexible
pa h o in e ac wi h wo ld music adi ions, a e gen es, and
di e se use p e e ences—all wi hou explici e aining [2].
Recen ly, a g owing numbe o audio LLMs [3
–
6], ex-
ended LLMs wi h audio encode s and ins uc ion- ollowing
capabili ies. Howe e , hese models ha e so a been e alu-
a ed on limi ed asks, elying on cap ion simila i y me ics
on da ase s like [3, 4], single-choice p o ocols [7
–
9], o
mul iple-choice ques ion (MCQ) p o ocols [10] Despi e
hese successes, such e alua ions ail o cap u e he com-
plexi y o co e music in o ma ion e ie al (MIR) asks and
o e limi ed insigh o eal-wo ld pe o mance.
This wo k makes h ee key con ibu ions: Fi s , we ein-
e p e a b oad ange o co e MIR anno a ions as ins uc ion-
ollowing asks as illus a ed in Figu e 2, enabling he use
o a wide ange o MIR da ase s, including sequen ial asks,
no only o e alua ion bu o aining and SFT audio- ex
LLMs. Second, we p o ide a s anda dized benchma king
amewo k ha includes implemen a ions o majo open-
sou ce audio- ex LLMs, along wi h e alua ion me ics
aligned wi h p io MIR li e a u e. Unlike ea lie MCQ
p o ocols, CMI-bench adop s open-ended, ask-speci ic me -
ics, allowing mo e igo ous compa isons. Las , we p esen
an ini ial analysis o gene aliza ion ou side aining da a,
along wi h cul u al and gende bias ac oss models, unco e -
ing po en ial limi a ions in hei gene aliza ion and poin ing
u u e di ec ions o cul u ally inclusi e music AI. Toge he ,
hese con ibu ions lay g oundwo k o sys ema ic p og ess
in music ins uc ion ollowing and i s in e sec ion wi h a-
di ional MIR. The code
1
and es se audio
2
a e a ailable.
This pape is o ganized as ollows. Sec ion 2 e iews e-
la ed wo k. Sec ion 3 in oduces he CMI-Bench design and
ins uc ion cons uc ion p ocedu e, Sec ion 4 desc ibes he
1h ps://gi hub.com/nicolaus625/CMI-bench/
2h ps://hugging ace.co/da ase s/nicolaus625/
CMI-bench
416
Ques ion: Es ima e he a ousal sco e o he gi en music on a scale om 1 o 9, whe e 1
ep esen s he lowes a ousal le el (calm, elaxing) and 9 ep esen s he highes a ousal
le el (ene ge ic, in ense). P o ide a nume ical es ima e based on you pe cep ion o he
music.
Example 1, when he music is slow and gen le, sugges ing a low a ousal
le el: nEs ima ed sco e: 2.
Example 2, when he music is ene ge ic and d i ing, indica ing a high a ousal
le el: nEs ima ed sco e: 8.
Example 3, when he music has a balanced empo and mode a e in ensi y: nEs ima ed
sco e: 5.
Based on he desc ip ion p o ided, I would es ima e he music's a ousal sco e as 7. The key o C mino
ypically ca ies a sombe o melancholic conno a ion, which combined wi h a b isk empo and a 4/4 ime
signa u e sugges s an ene ge ic bu pe haps sligh ly melancholic eel. Howe e , he addi ion o a jazzy
elemen and complex cho d p og ession migh add some complexi y and ib ancy o he o e all mood,
making i mo e dynamic and po en ially upli ing.
Ques ion: Please p o ide he ins umen ag(s) o gi en audio.
The ins umen ags o his audio a e 'acous ic gui a ' and 'lu e'.
Figu e 1. Ins uc ion-Following Fo ma Da a in CMI-
Bench and example esponse om Qwen2-audio.
expe imen al se up, including e alua ion p o ocol and open-
sou ce models. Sec ion 5 p esen s benchma king esul s.
Finally, sec ion 6 d aw he conclusion.
2. RELATED WORK
2.1 Ins uc ion Following Da ase
Ins uc ion ollowing e e s o he abili y o LLMs o
pe o m asks based on na u al language p omp s ha de-
sc ibe he ask i sel . This pa adigm has become cen al o
ecen ad ances in NLP, whe e SFT models a e ained
on a wide ange o ins uc ion- esponse pai s. Supe -
Na u alIns uc ions [11] ew o e anno a ions om o e
1,600 di e se NLP asks in o ins uc ion- ollowing o -
ma s, showing ha models can gene alize o unseen asks
gi en clea ins uc ions. Sel -Ins uc [12] u he ad anced
his app oach by au oma ically gene a ing di e se ins uc-
ion– esponse pai s using he model’s own ou pu s, while
Ins uc ional-GPT [13] aligned models wi h human in en
h ough SFT and ein o cemen lea ning.
These echniques ha e ecen ly been ex ended o mu-
sic, enabling ins uc ion- ollowing models o engage wi h
mul imodal and domain-speci ic asks. MusicQA [3] and
MusicIns uc [4] epu pose desc ip ions and ags om MIR
da ase o gene a e Q&A pai s. Such a da ase does no dis-
inguish sub ask on ins umen , emo ion, gen e, and cap ion
Q&A pai s, and he e alua ing me ics a e BERT-sco e,
o e es ima e model’s music unde s anding capabili y wi h-
ou equally compa ed wi h adi ional MIR algo i hms. Fi-
nally, Audio-FLAN [14] p esen s a la ge-scale ins uc ion-
uning co pus ac oss 80 asks, uni ying unde s anding and
gene a ion in audio, music, and speech. Ye , many asks
a e pa aph ased o an MCQ o ma , signi ican ly smalle
han he ange o p e-de ined classes o labels. Fu he -
mo e, hese wo ks do no p o ide a model pe o mance
benchma k on such asks, and he e alua ion me ics a e
no compa ible wi h adi ional MIR me hod.
2.2 Ins uc ion-Following Benchma ks o Music
While ins uc ion- ollowing has shown g ea p omise in
na u al language and ision asks, i s applica ion in he mu-
sic domain emains unde explo ed. ZIQI-E al [15] is an
ins uc ion ollowing he benchma k on ex ual symbolic
music. AIR-Bench [7] co e s a b oade ange o audio
ypes, including music, bu emphasizes low-le el asks such
as pi ch and ins umen ecogni ion and elies p ima ily on
MCQ o ma s. MMAU [10] includes music easoning, ye
only co e s six MIR da ase s, lacks alignmen wi h MIR-
speci ic e alua ion me ics, and epo s only a e age sco es
ac oss asks. MuCho Music [8] e alua es music unde -
s anding in mul imodal models h ough 1,187 MCQs. Au-
dioBench [16] and MusicBench [17] p ima ily a ge audio
and ex - o-music gene a ion espec i ely, wi hou add ess-
ing MIR asks. MuChin [18], while aluable o colloquial
desc ip ions, is ailo ed o Chinese pop song gene a ion.
Ac oss hese e o s, mos benchma ks omi key MIR asks
popula ized by MIREX , ew suppo sequen ial asks, and
e alua ion p o ocols o en ely on mul iple-choice ques ions
a he han he ask-speci ic me ics used in supe ised MIR
li e a u e.
2.3 Audio-Tex ual La ge Language Models
Cu en audio- ex ual LLMs ypically consis o an encode
o speech, audio o music, an in e media e a chi ec u e
and an LLM backbone. LTU [19] and LTU-AS [20] o-
cus on gene al audio comp ehension and easoning, com-
bining whispe speech encode [21] w. MU-LLaMA [3],
MusiLingo [4] and Lla k [22] a e ailo ed o music- ela ed
asks, le e aging audio encode s and ins uc ional da ase s
o suppo cap ioning and open-ended music ques ion an-
swe ing. Pengi [23] ames all audio asks as ex gene -
a ion, uni ying audio pe cep ion wi h LLM-based eason-
ing ia a simple p e ix- uning s a egy. GAMA [24] and
GAMA-IT [24] in eg a e mul i-laye audio ea u es and in-
s uc ion uning (CompA-R) o suppo complex easoning
o e gene al audio, including music. SALMONN-Audio [6]
in oduces a Q- o me window a chi ec u e o sequen ial
speech and sound unde s anding. Qwen-Audio [5] and
Qwen2-Audio [25] scale ins uc ion uning ac oss o e 30
audio asks wi h hie a chical o na u al p omp s. Audio-
Flamingo [26] and Audio-Flamingo2 [27] inco po a e in-
con ex lea ning and e ie al-based adap a ion o audio-
ex in e ac ion and dialogue. Beyond open-sou ce models,
p op ie a y sys ems Gemini-2.5 P o and GPT-4o may ep-
esen he s a e-o - he-a (SOTA).
3. CMI-BENCHMARK
Wi h CMI-Bench, we aim o add ess he ollowing limi-
a ions in e alua ing music unde s anding capabili ies o
audio- ex LLMs. P e ious benchma ks o en co e only
a na ow ange o asks, and no benchma k suppo s se-
quen ial asks, o e looking many classic challenges which
a e cen al o MIR esea ch. Mo eo e , e alua ion p o o-
cols a e ypically inconsis en wi h s anda d MIR me ics,
di icul o compa e agains adi ional supe ised mod-
els. To add ess hese issues, we e o mula e anno a ions
om widely-used MIR da ase s in o ins uc ion- ollowing
p omp s and p ocess model ou pu s in o o ma s compa ible
wi h s anda d MIR Py hon lib a y mi _e al [28].
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
417
3.1 O e iew
Tasks Da ase Me ics #Tes Samples
Key de ec ion GS [29] Gmean sco e 2406
Emo ion Reg ession EMO [30] R2125
Music agging MagnaTagATune [31] ROC-AUC, PR-AUC 5329
MTG-Top50 [32] ROC-AUC, PR-AUC 11356
Ins umen Classi ica ion MTG-Ins umen [32] ROC-AUC, PR-AUC 5115
Nsyn h-Ins umen [33] Accu acy 4096
Gen e classi ica ion MTG-Gen e [32] ROC-AUC, PR-AUC 11479
GTZAN [34] Accu acy 290
Emo ion agging MTG-Emo ion [32] ROC-AUC, PR-AUC 4231
Pi ch Es ima ion Nsyn h-Pi ch [33] Accu acy 4096
Singing Techniques VocalSe [35] Accu acy 1140
Music Cap ioning SDD [36] BL., ME., RO., Be -Sco e 1106
MusicCaps [37] BL., ME., RO., Be -Sco e 2813
Ly ics T ansc ip ion DSing [38] WER, CER 482
Bea acking GTZAN-Rhy hm [34] F_measu e 290
ball oom [39,40] F_measu e 685
DownBea acking GTZAN-Rhy hm F_measu e 290
ball oom F_measu e 685
Melody Ex ac ion MedleyDB 2 [41] Melody Accu acy 618
Pe o mance Technique GuZheng_99 [42] ame-le el mic o/mac o- 1 94
Table 1. O e iew o asks, da ase s, e alua ion me ics,
and he numbe o es samples in he CMI-Bench.
The CMI-Benchma k encompasses 14 asks spanning
mul i-class, mul i-label, eg ession, cap ioning, and se-
quen ial p edic ion challenges, e alua ed ac oss 20 di e se
da ase s. This benchma k in eg a es adi ional MIR asks
wi h eme ging music-and-language objec i es, p o iding a
obus pla o m o assess compu a ional music in elligence.
The asks and da ase s used in he benchma k a e shown in
Table 1. By s anda dizing spli s and me ics, CMI-bench
ensu es ep oducibili y and ai compa isons.
3.2 Sel -Ins uc ion o MIR Anno a ions
In his subsec ion, we in oduce he sel -ins uc ion ame-
wo k o CMI-Bench designed o uni y di e se MIR asks
unde a consis en NLP pa adigm, ou lining he design o in-
s uc ions and inpu ailo ed o asks such as key es ima ion,
gen e classi ica ion, emo ion eg ession, ins umen agging,
and empo al sequence anno a ions. Ou app oach le e -
ages s uc u ed p omp s wi h mul i-class, eg ession, and
sequence-based ou pu s, en iched wi h ew-sho examples
o guide anno a ion gene a ion.
Fo mul i-label asks, we allow lexible ou pu s wi hou
p o iding p e-de ined ags, e lec ing eal-wo ld complexi y.
Fo clip-le el mul i-class asks wi h a manageable numbe
o ca ego ies, such as musical key es ima ion and gen e and
ocal echniques classi ica ion, ins uc ions explici ly lis
all possible choices. Fo ins ance, key es ima ion equi es
selec ing one o 24 majo and mino keys , wi h ew-sho
examples like "Bb majo " o cla i y he o ma . In cases
wi h la ge class se s, such as pi ch classi ica ion on sho
exce p s ac oss MIDI numbe s 9 o 119, we p o ide a de i-
ni ion o MIDI s anda d alongside examples (e.g., "A4: 69",
"Middle C (C4) = 60") o ancho he ask.
Reg ession asks, such as a ousal es ima ion, adop a
nume ical scale (1 o 9) wi h desc ip i e ancho s—1 o
"calm, elaxing" and 9 o "ene ge ic, in ense." To be e
u ilize LLM-s in-con ex lea ning capabili y, we include
examples o ew-sho lea ning on ie sco es o musical
cha ac e is ics (e.g., "slow and gen le: 2," "ene ge ic and
d i ing: 8"), enabling p ecise emo ional anno a ion.
Tempo al asks, such as bea acking and ins umen pe -
o mance echnique de ec ion, equi e s uc u ed ou pu s.
Bea acking ou pu s imes amps in a comma-sepa a ed o -
ma (e.g., "0.1s, 1.19s, 2.25s"), while Guzheng ( adi ional
Chinese Kyo o) echnique de ec ion uses a Py hon-s yle
lis o uples (e.g., "[(’70.8086’, ’71.4817’, ’T emolo’)]"),
co e ing echniques like Vib a o and Glissando. De aul
ou pu s "[(’0.0’, ’10.0’, ’No Tech’)]" handle cases wi h no
de ec ions. Melody ex ac ion ollows simila p inciples,
balancing speci ici y and cla i y. We o bid uples o ha e
ime o e lapping on melody and (down)bea acking, bu
allow o playing echnique de ec ions.
Inpu s a e uni o mly ep esen ed as audio placeholde s
(e.g., "<|SOA|><AUDIO><|EOA|>"), pai ed wi h me ada a
such as audio pa hs and ime segmen s. This ensu es com-
pa ibili y wi h NLP models while p ese ing MIR ask di-
e si y, o e ing a scalable amewo k o u u e e o s.
4. EXPERIMENTS
4.1 E alua ion P o ocol
To enable igo ous and ai compa ison wi h adi ional
MIR sys ems, we design an e alua ion pipeline ha closely
ollows he o iginal ask de ini ions and me ics. All model
ou pu s a e au oma ically pos -p ocessed o con o m o each
ask’s expec ed o ma , ensu ing compa ibili y wi h MIR
e alua ion ools such as
mi _e al
. Below, we de ail he
e alua ion s a egies used o each ask ca ego y.
4.1.1 Classi ica ion Tasks
Mul i-Class Classi ica ion. Tasks include sho -clip mono-
phonic pi ch es ima ion, ins umen classi ica ion, singing
echnique classi ica ion, and gen e classi ica ion. We e al-
ua e using s ic s ing ma ching: a model’s esponse is
conside ed co ec i i con ains only he co ec label (case-
, space-, and punc ua ion-insensi i e) and no o he s. Fo
pi ch classi ica ion, we addi ionally equi e he model o
ollow he ins uc ion o ma and e u n MIDI numbe s.
Accu acy is used as he me ic.
Mul i-Label Classi ica ion Tasks include music agging,
gen e labelling, emo ion agging, and ins umen ecogni-
ion. As model esponses may include synonyms o ee-
o m ex , we embed bo h he p edic ed and g ound u h
ag se s using he BGE encode [43], a model op imized o
e ie al and mul i-label ma ching. Cosine simila i y sco es
a e hen used o compu e ROC and PR, p o iding a so bu
seman ically aligned e alua ion quali y.
4.1.2 Clip-le el MIR Tasks
Key De ec ion. We adop s anda d weigh ed sco e me -
ic om
mi _e al.key
, which accoun s o musically
easonable e o s, such as ela i e mino o pa allel key.
Reg ession. Model ou pu s a e cons ained o in ege s
in he ange [1, 9]; i a loa is e u ned, we ake he loo .
Ou pu s a e hen z-sco e no malized o ze o mean and uni
a iance. I a model ails o e u n a alue, we assign he
model’s mean alue. The coe icien o de e mina ion (
R2
)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
418
is compu ed be ween p edic ions and anno a ions on a ousal
and alance.
Music Cap ioning. We assess cap ion quali y using
ou s anda d NLP me ics: BLEU [44,45], METEOR [46],
ROUGE [47], and Be -Sco e [48].
4.1.3 Sequen ial MIR Tasks
Ly ics T ansc ip ion. We ex ac ly ics om model ou -
pu s by emo ing ypical p e ixes (e.g., “ly ics is as ol-
lows:”). Wo d E o Ra e (WER) and Cha ac e E o Ra e
(CER) a e compu ed agains g ound- u h ly ics.
(Down)Bea T acking. Model a e expec ed o e u n
a lis o ime poin s o (down)bea e en s. We il e non-
nume ic ou pu s, so he lis by ime, and apply F-measu e
me ic om mi _e al.bea , wi h a 20ms ole ance.
Melody Ex ac ion is ea ed as a sequen ial eg es-
sion ask on he undamen al equency o no es calcu-
la ed by
mi _e al.melody.e alua e
wi h 50 music
cen s ole ance. Models a e ins uc ed o e u n a lis o
( ime, pi ch) uples. We disca d in alid uples (e.g., missing
pi ches, o imp ope ly o ma ed en ies, e c.). I mul iple
pi ches a e p edic ed o he same imes amp, we use only
he i s . E alua ion is based on ame-le el accu acy.
Ins umen Playing Technique De ec ion. Fo he
GuZheng_99 da ase , we e alua e ame-le el p edic ions
using mac o- and mic o-F1 sco es, allowing o o e lap-
ping echniques. In alid p edic ions (e.g., inco ec uple
o ma s) a e il e ed ou . Emp y esponses a e in e p e ed
as a "no echnique" p edic ion co e ing he ull ime ange.
4.2 Models
Model #Pa ams Sound Music Speech
Encode A chi ec u e Decode
Pengi [23] 323M ✓ ✓
Audio-Flamingo [26] 2.2B ✓ ✓
LTU [19] 7B ✓ ✓
LTU-AS [20] 7B ✓ ✓ ✓
MusiLingo-long [4] 7B ✓
MuLLaMa [3] 7B ✓
GAMA [24] 7B ✓ ✓
GAMA-IT [24] 7B ✓ ✓
Qwen-Audio-Cha [5] 8.4B ✓
Qwen2-Audio-Ins uc [25] 8.4B ✓ ✓ ✓
SALAMONN-Audio [6] 13B ✓ ✓ ✓
Table 2. Compa ison o audio- ex ual LLMs by aining
domains.✓deno es co e age o p esence; ✗absence.
To p o ide a b oad and ep esen a i e e alua ion, we im-
plemen and benchma k 11 audio- ex LLMs wi h publicly
a ailable weigh s demons a ed in able 2. Ou selec ion
co e s a wide spec um o model designs and aining co -
pus, enabling a comp ehensi e compa ison o ins uc ion-
ollowing capabili ies ac oss a ious music-speci ic asks.
5. RESULTS AND DISCUSSION
5.1 Benchma king Resul s
Expe imen esul s e eal se e al impo an obse a ions
abou he cu en s a e o audio- ex LLMs on MIR asks.
5.1.1 LLMs Unde pe o m T adi ional MIR Baselines.
Despi e LLMs ha e achie ed excellen esul s on music cap-
ions and mul i-choices QA, [4,8
–
10,37], all models in ou
s udy all signi ican ly sho o he pe o mance achie ed
by ask-speci ic supe ised sys ems when e alua ed using
s anda d MIR me ics besides music cap ioning. This is
consis en ac oss classi ica ion, eg ession, and sequen ial
asks. These indings sugges ha ins uc ion- ollowing
LLMs s ill lack he specialized p ecision and induc i e bias
o MIR models ained explici ly o each ask.
5.1.2 Bes Pe o mance May Skew owa d T aining Se
In e es ingly, he peak pe o mance on each ask is ypi-
cally achie ed by models whose da ase s o e lap signi i-
can ly wi h hei aining co pus, e ealing limi ed gene -
aliza ion. Qwen2-Audio pe o ms bes on MTG-Jamendo-
ela ed asks such as MTG- op50, MTG-Emo ion, and
SDD cap ioning, while common on o he agging and cap-
ion da ase s. This aligns wi h i s use o MTG-Jamendo
and FMA du ing model de elopmen ia AIR-Bench, sug-
ges ing unsa is ying gene aliza ion capabili y. Besides,
MusiLingo pe o ms bes on MusicCaps, he same da ase
i was ained on o cap ioning and Q&A. Las ly, GAMA
shows he bes on MTT and NSyn h-ins umen and com-
pa a i e on MusicCaps, while common on o he da ase s on
same asks, e lec ing bias in i s SFT co pus. These demon-
s a e ha supe ised ins uc ion- uned models can cap u e
ask-speci ic pa e ns well when aining da a is di ec ly
aligned, bu hei gene aliza ion o unseen o s uc u ally
di e en asks emains limi ed.
5.1.3 All Models Pe o m Poo ly on DSing T ansc ip ion
Despi e he absence o ins umen al accompanimen and
use o English ly ics, none o he models each usable pe -
o mance le els on DSing o ly ics ansc ip ion, hough
i is ela i ely clean. This esul is pa icula ly s iking o
models like LTU and SALMONN, which include Whispe
as hei audio encode and could heo e ically bene i om
ASR capabili ies. Ly icWhiz [64] u ilizes GPT-4 o pos -
p ocess whispe ASR ou pu on DSing da ase , p o iding
esul s simila o SOTA wi hou aining.
5.1.4 P omp ing Fo ma May Impac s Pe o mance.
P omp ing wi hou ask-speci ic okens used du ing aining
signi ican ly deg ades pe o mance. Qwen-Audio pe o ms
a wo se on Nsyn h-Pi ch han epo ed in i s o iginal pa-
pe . This is likely due o he absence o s uc u ed ask
okens (e.g., “<|pi ch|><|midi_pi ch|>piano”) in ou p omp .
Ins ead, CMI-bench elies on gene al na u al language in-
s uc ions. This highligh s a c i ical gap in cu en audio
LLMs: wi hou clea ly de ined p omp ing schemas, hei
abili y o in e p e ins uc ions can be agile and ail o
gene alize. While di e en p omp s o MusiLingo do no
p o ide a signi ican di e ence on MusicCaps.
5.1.5 Sequen ial Tasks Remain Challenging o All.
Tasks in ol ing s uc u ed sequence-based ou pu s—such
as melody ex ac ion, ins umen pe o mance echnique
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
419
Qw2. Qw. Salm. MusiL. LTU LTU-AS MU-L. auFla. Gama GamaI Pengi SOTA
GS-K GES ↑8.28 6.51 7.70 9.50 7.61 1.42 7.56 8.21 7.69 7.70 0.00 74.3 [49]
EMO aR2 ↑-0.75 -0.44 -0.51 -0.68 -1.14 -1.27 -0.03 -0.85 -1.08 -0.29 0.00 0.62 [50]
R2 ↑-0.84 -0.78 0.0 -0.60 -1.13 -0.78 -0.12 -0.60 -1.30 -1.19 0.00 0.76 [51]
MTT ROC ↑66.78 66.00 59.07 63.39 65.75 65.83 68.32 68.68 81.21 78.32 66.75 92.0 [52]
PR ↑19.15 16.99 15.08 12.25 17.78 15.72 18.65 20.16 34.26 27.53 17.82 41.4 [50]
M-G ROC ↑64.44 66.39 57.71 57.48 52.22 57.14 57.36 62.83 52.50 62.49 58.23 88.0 [51]
PR ↑9.23 8.07 5.62 4.99 3.62 4.98 4.97 6.85 3.90 6.01 5.47 20.5 [51]
M-E ROC ↑60.89 59.06 50.69 53.07 51.41 52.02 54.40 55.80 51.97 58.84 53.88 78.6 [53]
PR ↑7.85 6.09 3.65 3.95 3.98 3.72 4.35 4.60 4.07 5.27 3.93 16.1 [53]
M-I ROC ↑58.90 56.95 48.78 55.63 55.34 53.02 50.81 56.99 51.15 55.16 56.09 78.8 [54]
PR ↑12.41 11.35 7.44 9.24 10.98 8.90 8.24 10.71 9.01 10.69 9.36 22.0 [51]
M-50 ROC ↑64.64 63.00 53.46 57.58 53.86 54.11 54.88 60.96 52.01 60.68 57.22 84.3 [53]
PR ↑16.54 14.45 9.49 9.68 8.30 8.67 9.11 12.16 8.10 11.72 10.19 32.1 [53]
GTZ. Acc. ↑72.07 71.38 32.76 7.24 2.76 16.90 8.97 50.34 21.38 42.41 6.21 83.9 [55]
VS-T Acc. ↑14.91 15.18 15.61 1.23 7.11 0.53 4.56 11.32 7.72 7.89 0.00 76.9 [56]
NI Acc. ↑37.62 4.13 0.15 0.00 0.49 6.88 0.00 15.80 58.37 39.36 42.26 78.2 [57]]
NP Acc. ↑1.51 0.37 0.00 0.00 0.73 0.05 0.00 0.73 0.20 0.00 5.74 89.2 [53]
SDD
BL. ↑23.40 11.95 16.41 8.14 11.54 9.72 15.55 15.14 15.96 20.93 15.47 -
ME. ↑23.21 9.35 18.45 14.32 8.51 7.49 13.89 11.81 13.81 16.41 9.98 16.7 [58]
RO. ↑28.47 12.35 28.12 30.15 9.33 9.42 15.28 12.92 18.35 20.07 11.45 111.9 [58]
BS. ↑87.44 84.79 86.68 85.28 84.44 83.62 86.38 85.75 85.89 86.21 82.90 86.0 [58]
MC
BL. ↑14.76 2.98 1.23 21.50 5.24 4.22 3.48 2.25 7.57 14.53 16.52 21.7 [4]
ME. ↑12.47 5.55 4.60 22.49 8.55 7.01 8.01 5.97 10.07 10.98 14.77 22.4 [58]
RO. ↑12.35 6.68 6.26 30.29 9.39 7.51 8.58 6.94 11.38 12.46 12.64 30.8 [4]
BS. ↑84.38 82.37 82.98 85.75 83.84 83.59 83.00 83.43 84.30 84.57 83.22 87.8 [58]
DS WE. ↓793.0 115.7 816.1 2019 235.5 191.7 191.9 275.7 225.4 152.6 343.2 12.99 [59]
CE. ↓818.6 96.2 760.00 2311 210.8 185.5 168.3 262.6 201.3 165.2 368.0 -
G-B FM. ↑7.50 23.69 11.49 0.04 0.10 0.00 0.71 3.96 0.00 1.49 0.00 88.3 [56]
G-D FM. ↑5.97 10.21 8.62 0.18 0.86 0.00 0.17 3.06 0.05 0.54 0.00 54.1 [60]
BR-B FM. ↑7.12 21.96 14.97 0.01 0.15 0.00 0.22 4.69 0.02 1.02 0.00 96.8 [61]
BR-D FM. ↑5.69 10.68 9.40 0.06 2.29 0.00 0.15 3.47 0.14 0.68 0.00 94.1 [61]
MDB Acc. ↑5.06 0.08 0.00 0.00 0.00 0.00 0.00 0.01 0.66 0.00 0.00 72.3 [62]
GZ maF1 ↑3.18 1.66 0.03 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 90.0 [63]
miF1 ↑0.89 0.44 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 80.4 [63]
Table 3. Pe o mance o 11 open-sou ce audio- ex LLMs on CMI-Bench. Models: Qwen2-Audio (Qw2.), Qwen-Audio
(Qw.), SALMONN-Audio (Salm.), MusiLingo (MusiL.), LTU, LTU-AS, MU-LLaMA (MU-L.), Audio-Flamingo (auFla.),
GAMA, GAMA-IT (GamaI), Pengi. Tasks include key de ec ion (GS-K), emo ion eg ession (EMO), agging (MTT, M-50),
gen e (M-G, GTZ.), emo ion/ins umen agging (M-E, M-I), cap ioning (SDD, MC), ly ics ansc ip ion (DS), bea /downbea
acking (G-B/G-D, BR-B/BR-D), melody (MDB), and Guzheng echniques (GZ). Me ics: GES, R², ROC-AUC, PR-AUC,
Accu acy, BLEU (BL.), METEOR (ME.), ROUGE (RO.), BERTSco e (BS.), WER/ CER, FM(F-Measu e), Mac o-F1
(maF1), Mic o-F1 (miF1). Bes sco es a e in bold.
de ec ion, and (down)bea acking—a e poo ly handled
by all e alua ed models. E en Qwen-Audio, which shows
ela i ely s ong pe o mance in gen e and bea acking,
alls a sho o MIR baselines, some imes copying he
inpu examples. We hypo hesize wo key easons: Fo one
hing, he di e si y and ambigui y in how sequence asks
a e ph ased (e.g., imes amps, uple o ma s) educes con-
sis ency in model ou pu s. Fo ano he , many models ha e
only limi ed exposu e o audio asks wi h dense empo al su-
pe ision. I p e aining da a includes imes amped ou pu
and ma ched decoding o ma s, pe o mance may imp o e.
5.1.6 Emo ion Reg ession Fails o All Models.
Despi e clea ins uc ions, ca e ully designed scales (1–9),
and con ex ual music desc ip ions, and ew-sho exam-
ples, all models ail o p o ide usable p edic ions o
a ousal and alence. In ac , model ou pu s o en clus e
a ound meaningless alues, some imes pe o ming wo se
han simply p edic ing he mean. Ou pos -p ocessing
ules con e emp y o in alid ou pu s o da ase means,
which o en lead o be e R² sco es han he models hem-
sel es—highligh ing he se e e limi a ions in mapping
con inuous pe cep ual a ibu es om music using cu en
audio- ex LLMs.
These indings emphasize he gap be ween cu en SFT
mul imodal LLMs and adi ional ask-speci ic MIR sys-
ems. While open-sou ce audio LLMs show p omise in
isola ed asks wi h aligned aining da a, subs an ial chal-
lenges emain in e ms o gene aliza ion, s uc u ed ou pu
gene a ion, and adap a ion o eal-wo ld se ings.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
420

5.2 Cul u e and Gende Bias
We u he analyze he pe o mance o wo op-pe o ming
models—Qwen2-Audio and Audio-Flamingo—on ine-
g ained ins umen , gen e, and music ag ca ego ies. While
bo h models show compe i i e esul s o e all, ou b eak-
down highligh s no able pe o mance dispa i ies ac oss in-
s umen ypes, cul u al gen es, and oice- ela ed ags.
0
10
20
30
40
50
60
70
80
90
Values
56.99
10.71
49.87
8.16
45.95
0.43
52.82
3.0
75.15
63.64 56.5
10.9
62.83
6.85
50.22
0.58
47.85
0.33
54.24
0.64
69.95
2.06
63.81
1.23
43.02
0.57
51.57
1.41
44.59
0.59
49.63
1.32
54.83
1.51
45.09
4.28
68.68
20.16
80.45
29.67
56.02
21.98
78.43
16.84
49.54
8.81
80.76
11.66
72.2
12.05
79.79
15.68
65.89
11.85

All-ins umen
acco dion
bongo
ha monica
piano
iolin
All-Gen e
medie al
60s
70s
80s
90s
Bossano a
Cel ic
chanson
e hno
La in
wo ld
All-Tag
emale
male
woman
man
emale oice
male oice
emale ocal
male ocal
0
10
20
30
40
50
60
70
80
90
Values
58.9
12.41
59.1
13.35
62.41
0.85
61.29
4.01
76.3
69.55
55.32
11.56
64.44
9.23
63.07
0.92
41.43
0.3
28.69
0.17
66.21
1.3
70.04
1.02
47.68
2.04
59.2
6.55
50.73
3.2
46.72
1.2
48.21
3.75
46.84
4.89
66.78
19.15
53.9
8.88
45.76
12.93
55.86
5.75
51.89
7.24
57.49
3.64
50.27
5.22
60.88
5.51
52.63
7.45

Figu e 2. Fine-g ained e alua ion o Qwen2-Audio and
Audio-Flamingo on ins umen (pu ple), gen e (yellow),
and ocal ( ed) ag classi ica ion. The uppe ex emi y
ep esen s he ROC-AUC alue, and he lowe is PR-AUC.
5.2.1 Ins umen Bias on MTG-Ins umen
Bo h models achie e high sco es on piano, e lec ing he
s ong ep esen a ion o piano in mos aining da ase s.
Wes e n ins umen s such as iolin and acco dion pe o m
close o he a e age, sugges ing mode a e obus ness ac oss
common musical imb es. Howe e , pe o mance d ops
signi ican ly on bongo and ha monica — commonly associ-
a ed wi h wo ld music. These esul s poin o a pe sis en
bias owa d Wes e n ins umen s and limi ed gene aliza ion
o unde ep esen ed imb es in cu en p e- aining co po a.
5.2.2 Cul u al Gen e Imbalance on MTG-Gen e
Gen e classi ica ion esul s simila ly e eal sys ema ic dis-
pa i ies. Bo h models show ela i ely s ong pe o mance
on mains eam Wes e n pop gen es (e.g., 80s, 90s), while
gen es associa ed wi h wo ld music (e.g., Bossano a, Cel ic,
Chanson, E hno, La in) and music adi ions (e.g., Me-
die al) consis en ly all below a e age. Fo example, Audio-
Flamingo’s pe o mance on Bossano a and Chanson d ops
se e ely. Qwen2-Audio pe o ms sligh ly be e on some
long- ail gen es, bu s ill shows conside able deg ada ion.
These highligh a lack o cul u al and his o ical di e si y in
he da a used o ins uc ion uning and model p e aining.
5.2.3 Voice Tag Di e ences on MTT
A de ailed compa ison on ocal ags e eals an in e es -
ing di e gence. Audio-Flamingo is consis en ly be e a
iden i ying * emale* oices han male oices, indica ing
a possible gende - ela ed acous ic o anno a ion bias. In
con as , Qwen2-Audio achie es highe ROC-AUC o * e-
male* ags bu lowe PR-AUC, sugges ing ha while he
model anks posi i e examples co ec ly, i s absolu e p e-
dic ions emain spa se o o e con iden . This misma ch
implies ha Qwen2-Audio is sensi i e o class anking bu
may lack calib a ion in es ima ing ag p esence p obabili-
ies, an issue wo h in es iga ing o ai ness and eliabili y
in music model deploymen .
5.3 Abla ion S udy on Di e en P omp s and T ials
Figu e 3. Abla ion S udy on P omp Sensi i i y o Gen e
Classi ica ion and A ousal Reg ession
We conduc an abla ion s udy on p omp design using
GAMA and GAMA-IT models ac oss wo ep esen a i e
asks: GTZAN gen e classi ica ion and EMO a ousal eg es-
sion. Va ian P omp s 1 and 2 a e e alua ed o e h ee uns,
and he ba s epo mean pe o mance wi h s anda d a ian
as e o ba s. GTZAN esul s (bo om ow) a e ela i ely s a-
ble ac oss p omp s and ha e small a iance o each p omp
in mul i- ials, indica ing ha mos gen e- ela ed ins uc-
ions a e consis en ly ollowed. The low a iance sugges s
obus ness o p omp changes. In con as , EMO-A esul s
( op ow) show ela i e sensi i i y o p omp a ia ion, pa -
icula ly unde he GAMA-IT model. This ins abili y s ems
om a highe a e o in alid o non- esponsible gene a-
ions, which a e sco ed as mean alues du ing e alua ion.
Consequen ly, di e ences in p omp ph asing migh lead o
la ge de ia ions, especially when alid p edic ions di e ge
signi ican ly om he mean sco e.
6. CONCLUSION
We in oduce CMI-Bench, a comp ehensi e benchma k o
e alua ing audio- ex LLMs ac oss di e se MIR asks. Ou
esul s highligh a signi ican pe o mance gap be ween
LLMs and supe ised MIR sys ems, wi h bes models like
Qwen2-Audio and GAMA also s uggling wi h gene al-
iza ion. Sequence-based asks, such as melody ex ac ion
and bea acking, pose pa icula challenges, likely due
o limi ed imes amped p e aining and p omp sensi i i y.
Fine-g ained analysis also e eals cul u al and gende biases
ied o aining da a imbalances. By o e ing a s anda dized
e alua ion amewo k and oolki , CMI-Bench b idges NLP
and MIR esea ch, p o iding a ounda ion o u u e ad-
ancemen s. P og ess will hinge on imp o ed p e aining,
sequen ial ou pu handling, and bias mi iga ion, and we
hope his wo k spu s collabo a ion owa d mo e capable
music-awa e LLMs.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
421
A. ACKNOWLEDGMENTS
Yinghao Ma is a esea ch s uden a he UKRI Cen e
o Doc o al T aining in A i icial In elligence and Music,
suppo ed by UK Resea ch and Inno a ion [g an numbe
EP/S022694/1].
Siyou Li is a esea ch s uden a he Compu a ional Lin-
guis ics Lab a Queen Ma y Uni e si y o London, unded
by he QMUL-CSC PhD schola ships.
Yinghao Ma would also like o exp ess hea el g a i-
ude o he S uden Philha monic Chinese O ches a a he
Chinese Music Ins i u e, Peking Uni e si y (abb e ia ed as
CMI, un ela ed o he pape i le). We wa mly cong a ula e
he o ches a on i s 20 h anni e sa y.
B. ETHICS STATEMENT
CMI-Bench epu poses exis ing publicly a ailable da ase s
in he MIR domain by e o ma ing hei anno a ions in o
ins uc ion- ollowing o ma s. No new human anno a ions
we e collec ed, and no human pa icipan s we e in ol ed
in he c ea ion o his benchma k. All da a used in he
p ojec a e licensed unde e ms ha pe mi non-comme cial
esea ch use. In compliance wi h hese e ms, we li-
cense CMI-Bench unde a C ea i e Commons A ibu ion-
NonComme cial-Sha eAlike (CC BY-NC-SA) license. To
p omo e long- e m accessibili y, we hos he audio es
se on Hugging Face wi h clea usage es ic ions o non-
comme cial pu poses.
The da ase p ima ily consis s o Wes e n, English-
language popula music, wi h limi ed inclusion o ins u-
men al acks and non-English songs. T ansc ip ion asks
a e es ic ed o English ly ics, and wo ld music ins umen s
besides Guzheng a e unde ep esen ed. We acknowledge
his cul u al and linguis ic skew and encou age u u e ex-
ensions o imp o e global di e si y and ep esen a ion.
This wo k in ol es no sa e y, secu i y, o en i onmen al
isks. The benchma k does no equi e high-compu e model
aining o deploymen o po en ially ha m ul gene a i e
models. We elease CMI-Bench and i s e alua ion oolki
o os e esponsible and ep oducible esea ch in audio-
language modeling.
C. APPENDIX
Due o he limi a ion o he ISMIR p oceeding, please e e
o ou a xi e sion o mo e in o ma ion on ins uc ion
examples, e o case analysis and mo e discussion.
D. REFERENCES
[1]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang,
Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong e al.,
“A su ey o la ge language models,” a Xi p ep in
a Xi :2303.18223, ol. 1, no. 2, 2023.
[2]
Y. Ma, A. Øland, A. Ragni, B. M. Del Se e, C. Sai is,
C. Donahue, C. Lin, C. Plachou as, E. Bene os, E. Sha i
e al., “Founda ion models o music: A su ey,” a Xi
p ep in a Xi :2408.14340, 2024.
[3]
S. Liu, A. S. Hussain, C. Sun, and Y. Shan, “Music un-
de s anding llama: Ad ancing ex - o-music gene a ion
wi h ques ion answe ing and cap ioning,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2024,
pp. 286–290.
[4]
Z. Deng, Y. Ma, Y. Liu, R. Guo, G. Zhang, W. Chen,
W. Huang, and E. Bene os, “Musilingo: B idging music
and ex wi h p e- ained language models o music
cap ioning and que y esponse,” in NAACL-HLT (Find-
ings), 2024.
[5]
Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan,
C. Zhou, and J. Zhou, “Qwen-audio: Ad ancing uni e -
sal audio unde s anding ia uni ied la ge-scale audio-
language models,” a Xi p ep in a Xi :2311.07919,
2023.
[6]
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu,
M. Zejun, and C. Zhang, “Salmonn: Towa ds gene ic
hea ing abili ies o la ge language models,” in The
Twel h In e na ional Con e ence on Lea ning Rep e-
sen a ions.
[7]
Q. Yang, J. Xu, W. Liu, Y. Chu, Z. Jiang, X. Zhou,
Y. Leng, Y. L , Z. Zhao, C. Zhou e al., “Ai -bench:
Benchma king la ge audio-language models ia gene a-
i e comp ehension,” in ACL (1), 2024.
[8]
B. Weck, I. Manco, E. Bene os, E. Quin on, G. Fazekas,
and D. Bogdano , “Muchomusic: E alua ing music
unde s anding in mul imodal audio-language models,”
in P oceedings o he 25 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, ISMIR 2024,
San F ancisco, Cali o nia, USA and Online, No embe
10-14, 2024, B. Kaneshi o, G. J. Myso e, O. Nie o,
C. Donahue, C. A. Huang, J. H. Lee, B. McFee, and
M. C. McCallum, Eds., 2024, pp. 825–833. [Online].
A ailable: h ps://doi.o g/10.5281/zenodo.14877459
[9]
Y. Li, G. Zhang, Y. Ma, R. Yuan, K. Zhu, H. Guo,
Y. Liang, J. Liu, Z. Wang, J. Yang e al., “Omnibench:
Towa ds he u u e o uni e sal omni-language models,”
a Xi p ep in a Xi :2409.15272, 2024.
[10]
S. Sakshi, U. Tyagi, S. Kuma , A. Se h, R. Sel akuma ,
O. Nie o, R. Du aiswami, S. Ghosh, and D. Manocha,
“Mmau: A massi e mul i- ask audio unde s anding and
easoning benchma k,” in The Thi een h In e na ional
Con e ence on Lea ning Rep esen a ions.
[11]
Y. Wang, S. Mish a, P. Alipoo molabashi, Y. Ko-
di, A. Mi zaei, A. Naik, A. Ashok, A. S.
Dhanaseka an, A. A unkuma , D. S ap e al., “Supe -
na u alins uc ions: Gene aliza ion ia decla a i e in-
s uc ions on 1600+ nlp asks,” in P oceedings o he
2022 Con e ence on Empi ical Me hods in Na u al Lan-
guage P ocessing, 2022, pp. 5085–5109.
[12]
Y. Wang, Y. Ko di, S. Mish a, A. Liu, N. A. Smi h,
D. Khashabi, and H. Hajishi zi, “Sel -ins uc : Aligning
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
422
language models wi h sel -gene a ed ins uc ions,” in
P oceedings o he 61s Annual Mee ing o he Associ-
a ion o Compu a ional Linguis ics (Volume 1: Long
Pape s). Associa ion o Compu a ional Linguis ics,
2023.
[13]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainw igh ,
P. Mishkin, C. Zhang, S. Aga wal, K. Slama, A. Ray
e al., “T aining language models o ollow ins uc ions
wi h human eedback,” Ad ances in neu al in o ma ion
p ocessing sys ems, ol. 35, pp. 27 730–27744, 2022.
[14]
L. Xue, Z. Zhou, J. Pan, Z. Li, S. Fan, Y. Ma, S. Cheng,
D. Yang, H. Guo, Y. Xiao e al., “Audio- lan: A p elim-
ina y elease,” a Xi p ep in a Xi :2502.16584, 2025.
[15]
J. Li, L. Yang, M. Tang, C. Chenchong, Z. Li, P. Wang,
and H. Zhao, “The music maes o o he musically chal-
lenged, a massi e music e alua ion benchma k o la ge
language models,” in Findings o he Associa ion o
Compu a ional Linguis ics ACL 2024, 2024, pp. 3246–
3257.
[16]
B. Wang, X. Zou, G. Lin, S. Sun, Z. Liu, W. Zhang,
Z. Liu, A. Aw, and N. F. Chen, “Audiobench: A uni e -
sal benchma k o audio la ge language models,” a Xi
p ep in a Xi :2406.16020, 2024.
[17]
J. Melecho sky, Z. Guo, D. Ghosal, N. Majumde ,
D. He emans, and S. Po ia, “Mus ango: Towa d con-
ollable ex - o-music gene a ion,” in P oceedings o
he 2024 Con e ence o he No h Ame ican Chap e o
he Associa ion o Compu a ional Linguis ics: Human
Language Technologies (Volume 1: Long Pape s), 2024,
pp. 8286–8309.
[18]
Z. Wang, S. Li, T. Zhang, Q. Wang, P. Yu, J. Luo, Y. Liu,
M. Xi, and K. Zhang, “Muchin: a chinese colloquial
desc ip ion benchma k o e alua ing language models
in he ield o music,” in P oceedings o he Thi y-Thi d
In e na ional Join Con e ence on A i icial In elligence,
2024, pp. 7771–7779.
[19]
Y. Gong, H. Luo, A. H. Liu, L. Ka linsky, and J. Glass,
“Lis en, hink, and unde s and,” in In e na ional Con e -
ence on Lea ning Rep esen a ions, 2024.
[20]
Y. Gong, A. H. Liu, H. Luo, L. Ka linsky, and
J. Glass, “Join audio and speech unde s anding,” in
2023 IEEE Au oma ic Speech Recogni ion and Unde -
s anding Wo kshop (ASRU). IEEE, 2023, pp. 1–8.
[21]
A. Rad o d, J. Kim, T. Xu, G. B ockman, C. McLea ey,
and I. Su ske e , “Robus speech ecogni ion ia la ge-
scale weak supe ision (a xi : 2212.04356). a xi ,”
2022.
[22]
J. P. Ga dne , S. Du and, D. S olle , and R. M. Bi ne ,
“Lla k: A mul imodal ins uc ion- ollowing language
model o music,” in In e na ional Con e ence on Ma-
chine Lea ning. PMLR, 2024, pp. 15 037–15 082.
[23]
S. Deshmukh, B. Elizalde, R. Singh, and H. Wang,
“Pengi: An audio language model o audio asks,”
Ad ances in Neu al In o ma ion P ocessing Sys ems,
ol. 36, pp. 18 090–18108, 2023.
[24]
S. Ghosh, S. Kuma , A. Se h, C. K. R. E u u, U. Tyagi,
S. Singh, O. Nie o, R. Du aiswami, and D. Manocha,
“Gama: A la ge audio-language model wi h ad anced
audio unde s anding and complex easoning abili ies,”
CoRR, 2024.
[25]
Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y. Leng,
Y. L , J. He, J. Lin e al., “Qwen2-audio echnical e-
po ,” a Xi p ep in a Xi :2407.10759, 2024.
[26]
Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and
B. Ca anza o, “Audio lamingo: a no el audio language
model wi h ew-sho lea ning and dialogue abili ies,” in
P oceedings o he 41s In e na ional Con e ence on
Machine Lea ning, 2024, pp. 25 125–25 148.
[27]
S. Ghosh, Z. Kong, S. Kuma , S. Sakshi, J. Kim,
W. Ping, R. Valle, D. Manocha, and B. Ca anza o,
“Audio lamingo 2: An audio-language model wi h
long-audio unde s anding and expe easoning abili-
ies,” a Xi p ep in a Xi :2503.03983, 2025.
[28]
C. Ra el, B. McFee, E. J. Humph ey, J. Salamon, O. Ni-
e o, D. Liang, D. P. Ellis, and C. C. Ra el, “Mi _e al:
A anspa en implemen a ion o common mi me ics.”
in ISMIR, 2014, pp. 367–372.
[29]
P. Knees, Á. Fa aldo Pé ez, H. Boye , R. Vogl, S. Böck,
F. Hö schläge , M. Le Go e al., “Two da a se s o
empo es ima ion and key de ec ion in elec onic dance
music anno a ed om use co ec ions,” in P oceedings
o he 16 h In e na ional Socie y o Music In o ma-
ion Re ie al Con e ence (ISMIR); 2015 Oc 26-30;
Málaga, Spain.[Málaga]: In e na ional Socie y o Mu-
sic In o ma ion Re ie al, 2015. p. 364-70. In e na-
ional Socie y o Music In o ma ion Re ie al (ISMIR),
2015.
[30]
M. Soleymani, M. N. Ca o, E. M. Schmid , C.-Y. Sha,
and Y.-H. Yang, “1000 songs o emo ional analysis o
music,” in P oceedings o he 2nd ACM in e na ional
wo kshop on C owdsou cing o mul imedia, 2013, pp.
1–6.
[31]
E. Law, K. Wes , M. I. Mandel, M. Bay, and J. S.
Downie, “E alua ion o algo i hms using games: The
case o music agging.” in ISMIR. Ci esee , 2009, pp.
387–392.
[32]
D. Bogdano , M. Won, P. To s ogan, A. Po e , and
X. Se a, “The m g-jamendo da ase o au oma ic mu-
sic agging,” in In e na ional Con e ence on Machine
Lea ning. ICML, 2019.
[33]
J. Engel, C. Resnick, A. Robe s, S. Dieleman,
M. No ouzi, D. Eck, and K. Simonyan, “Neu al au-
dio syn hesis o musical no es wi h wa ene au oen-
code s,” in In e na ional Con e ence on Machine Lea n-
ing. PMLR, 2017, pp. 1068–1077.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
423
[34]
G. Tzane akis and P. Cook, “Musical gen e classi ica-
ion o audio signals,” IEEE T ansac ions on speech and
audio p ocessing, ol. 10, no. 5, pp. 293–302, 2002.
[35]
J. Wilkins, P. See ha aman, A. Wahl, and B. Pa do,
“Vocalse : A singing oice da ase .” in ISMIR, 2018, pp.
468–474.
[36]
I. Manco, B. Weck, S. Doh, M. Won, Y. Zhang, D. Bog-
dano , Y. Wu, K. Chen, P. To s ogan, E. Bene os
e al., “The song desc ibe da ase : a co pus o au-
dio cap ions o music-and-language e alua ion,” a Xi
p ep in a Xi :2311.10057, 2023.
[37]
A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “Musiclm: Gene a -
ing music om ex ,” a Xi p ep in a Xi :2301.11325,
2023.
[38]
G. Roa Dabike and J. Ba ke , “Au oma ic ly ic an-
sc ip ion om ka aoke ocal acks: Resou ces and a
baseline sys em,” in P oceedings o he 20 h Annual
Con e ence o he In e na ional Speech Communica ion
Associa ion (INTERSPEECH 2019), 2019.
[39]
F. Gouyon, A. Klapu i, S. Dixon, M. Alonso, G. Tzane-
akis, C. Uhle, and P. Cano, “An expe imen al com-
pa ison o audio empo induc ion algo i hms,” IEEE
T ansac ions on Audio, Speech, and Language P ocess-
ing, ol. 14, no. 5, pp. 1832–1844, 2006.
[40]
F. K ebs, S. Böck, and G. Widme , “Rhy hmic pa e n
modeling o bea and downbea acking in musical
audio.” in Ismi , 2013, pp. 227–232.
[41]
R. Bi ne , J. Salamon, M. Tie ney, M. Mauch, C. Can-
nam, and J. Bello, “Medleydb: A mul i ack da ase o
anno a ion-in ensi e mi esea ch,” 10 2014.
[42]
D. Li, M. Che, W. Meng, Y. Wu, Y. Yu, F. Xia, and W. Li,
“F ame-le el mul i-label playing echnique de ec ion us-
ing mul i-scale ne wo k and sel -a en ion mechanism,”
in IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing ICASSP 2023, Rhodes Island,
G eece, June 4-10, 2023. IEEE, 2023, pp. 1–5.
[43]
S. Xiao, Z. Liu, P. Zhang, N. Muennigho , D. Lian,
and J.-Y. Nie, “C-pack: Packed esou ces o gene al
chinese embeddings,” in P oceedings o he 47 h in-
e na ional ACM SIGIR con e ence on esea ch and
de elopmen in in o ma ion e ie al, 2024, pp. 641–
649.
[44]
K. Papineni, S. Roukos, T. Wa d, and W. jing Zhu,
“Bleu: a me hod o au oma ic e alua ion o machine
ansla ion,” in P oceedings o he 40 h annual mee ing
o he Associa ion o Compu a ional Linguis ics, 2002,
pp. 311–318.
[45]
C.-Y. Lin and F. J. Och, “ORANGE: a me hod o
e alua ing au oma ic e alua ion me ics o machine
ansla ion,” in COLING 2004: P oceedings o
he 20 h In e na ional Con e ence on Compu a ional
Linguis ics. Gene a, Swi ze land: COLING, aug
23–aug 27 2004, pp. 501–507. [Online]. A ailable:
h ps://www.aclweb.o g/an hology/C04-1072
[46]
S. Bane jee and A. La ie, “Me eo : An au oma ic me ic
o m e alua ion wi h imp o ed co ela ion wi h human
judgmen s,” in P oceedings o he acl wo kshop on
in insic and ex insic e alua ion measu es o machine
ansla ion and/o summa iza ion, 2005, pp. 65–72.
[47]
C.-Y. Lin, “ROUGE: A package o au oma ic
e alua ion o summa ies,” in Tex Summa iza ion
B anches Ou . Ba celona, Spain: Associa ion o
Compu a ional Linguis ics, Jul. 2004, pp. 74–81.
[Online]. A ailable: h ps://www.aclweb.o g/an hology/
W04-1013
[48]
T. Zhang*, V. Kisho e*, F. Wu*, K. Q. Weinbe ge ,
and Y. A zi, “Be sco e: E alua ing ex gene a ion
wi h be ,” in In e na ional Con e ence on Lea ning
Rep esen a ions, 2020. [Online]. A ailable: h ps:
//open e iew.ne / o um?id=SkeHuCVFD
[49]
F. Ko zeniowski and G. Widme , “End- o-end musical
key es ima ion using a con olu ional neu al ne wo k,”
in 2017 25 h Eu opean Signal P ocessing Con e ence
(EUSIPCO). IEEE, 2017, pp. 966–970.
[50]
R. Cas ellon, C. Donahue, and P. Liang, “Codi ied
audio language modeling lea ns use ul ep esen a-
ions o music in o ma ion e ie al,” a Xi p ep in
a Xi :2107.05677, 2021.
[51]
R. Yuan, Y. Ma, Y. Li, G. Zhang, X. Chen, H. Yin,
Y. Liu, J. Huang, Z. Tian, B. Deng e al., “Ma ble:
Music audio ep esen a ion benchma k o uni e sal
e alua ion,” Ad ances in Neu al In o ma ion P ocessing
Sys ems, ol. 36, pp. 39 626–39647, 2023.
[52]
Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and D. P.
Ellis, “Mulan: A join embedding o music audio and
na u al language,” a Xi p ep in a Xi :2208.12415,
2022.
[53]
M. C. McCallum, F. Ko zeniowski, S. O amas,
F. Gouyon, and A. F. Ehmann, “Supe ised and un-
supe ised lea ning o audio ep esen a ions o music
unde s anding,” Ismi 2022 Hyb id Con e ence, 2022.
[54]
P. Alonso-Jiménez, X. Se a, and D. Bogdano , “Music
ep esen a ion lea ning based on edi o ial me ada a om
discogs,” 2022.
[55]
D. Niizumi, D. Takeuchi, Y. Ohishi, N. Ha ada, and
K. Kashino, “Masked modeling duo: Lea ning ep e-
sen a ions by encou aging bo h ne wo ks o model he
inpu ,” in ICASSP 2023-2023 IEEE In e na ional Con-
e ence On Acous ics, Speech And Signal P ocessing
(ICASSP). IEEE, 2023, pp. 1–5.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
424