scieee Science in your language
[en] (orig)

Aligning Text-to-Music Evaluation With Human Preferences

Author: Yichen Huang; Zachary Novack; Koichi Saito; Jiatong Shi; Shinji Watanabe; Yuki Mitsufuji; John Thickstun; Chris Donahue
Publisher: Zenodo
DOI: 10.5281/zenodo.17706363
Source: https://zenodo.org/records/17706363/files/000021.pdf
ALIGNING TEXT-TO-MUSIC EVALUATION
WITH HUMAN PREFERENCES
Yichen Huang1Zacha y No ack2Koichi Sai o3Jia ong Shi1
Shinji Wa anabe1Yuki Mi su uji3John Thicks un4Ch is Donahue1
1Ca negie Mellon Uni e si y 2Uni e si y o Cali o nia – San Diego
3Sony AI 4Co nell Uni e si y
[email p o ec ed], [email p o ec ed]
ABSTRACT
Despi e signi ican ecen ad ances in gene a i e acous ic
ex - o-music (TTM) modeling, obus e alua ion o hese
models lags behind, elying in pa icula on he popula
F éche Audio Dis ance (FAD). In his wo k, we igo ously
s udy he design space o e e ence-based di e gence me -
ics o e alua ing TTM models h ough (1) designing ou
syn he ic me a-e alua ions o measu e sensi i i y o pa ic-
ula musical deside a a, and (2) collec ing and e alua ing
on MusicP e s, an open-sou ce da ase o pai wise human
p e e ences o TTM sys ems. We ind ha no only is he
s anda d FAD se up inconsis en on bo h syn he ic and hu-
man p e e ence da a, bu ha nea ly all exis ing me ics
ail o e ec i ely cap u e deside a a, and a e only weakly
co ela ed wi h human pe cep ion. We p opose a new me -
ic, he MAUVE Audio Di e gence (MAD), compu ed on
ep esen a ions om a sel -supe ised audio embedding
model. We ind ha his me ic e ec i ely cap u es di e se
musical deside a a (a e age ank co ela ion 0.84 o MAD
s. 0.49 o FAD) and also co ela es mo e s ongly wi h
MusicP e s (0.62 s. 0.14).
1. INTRODUCTION
Recen ad ances in ex - o-music (TTM) modeling ha e
p oduced models capable o gene a ing cohe en , high-
ideli y, open-ended music audio [1–6]. While he pe -
cei ed quali y o TTM sys ems has clea ly imp o ed, ou
e alua ion me hods ha e no kep pace wi h his p og ess.
Sys ema ic human e alua ion da a o gene a ed music is
limi ed, unlike modali ies like ex [7] and speech [8]. Au-
oma ic e alua ions o music commonly ely on he F éche
Audio Dis ance (FAD) [9]. Howe e , FAD was o iginally
de eloped o e alua ing music enhancemen algo i hms,
and has been shown o co ela e poo ly wi h human p e -
e ences on open-ended music gene a ion [10].
© Y. Huang, Z. No ack, K. Sai o, J. Shi, S. Wa anabe, Y.
Mi su uji, J. Thicks un, C. Donahue. Licensed unde a C ea i e Com-
mons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: Y.
Huang, Z. No ack, K. Sai o, J. Shi, S. Wa anabe, Y. Mi su uji, J. Thick-
s un, C. Donahue, “Aligning Tex - o-music E alua ion wi h Human P e -
e ences”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al
Con ., Daejeon, Sou h Ko ea, 2025.
In his wo k, we pe o m a sys ema ic s udy o he de-
sign space o au oma ic e alua ion me ics o open-ended
Wes e n pop music gene a ion. We iden i y h ee compo-
nen s o an au oma ic e alua ion me ic: (1) a e e ence se
o ep esen a i e music ha we aim o model, (2) a ep e-
sen a ion ha cap u es salien ea u es o music audio, and
(3) a di e gence me ic ha quan i ies di e ences be ween
ep esen a ions o he e e ence se and hose o he gene -
a ed music. This design space includes FAD, which uses
VGGish ea u es [11] and he F éche dis ance as a di e -
gence, as well as mo e ecen p oposals ha pai F éche
dis ance wi h iche audio ep esen a ions [12, 13] such as
hose om CLAP [14], o eplace F éche dis ance wi h
ke nelized di e gence me ics such as MMD [15].
To mo e igo ously explo e his design space, we con-
duc a me a-e alua ion o me ics by s udying hei sen-
si i i y o “common sense” deside a a—p ope ies ha we
posi as desi able o any TTM sys em ailo ed o pop mu-
sic gene a ion. We codi y each deside a um as a syn he ic
da a gene a ion p ocess: we p oduce deg aded music audio
wi h con olled and inc easing amoun s o deg ada ion, in-
ducing an in e p e able o de ing. We hen me a-e alua e a
me ic by measu ing he ank co ela ion (Kendall’s τ) be-
ween i s o de ing o he deg aded music o he “g ound
u h”. We p opose ou deg ada ion p ocesses, e lec -
ing sensi i i y o me ics o ou deside a a: ideli y, mu-
sicali y, con ex leng h, and di e si y. F om hese indings
we p opose a new me ic, he MAUVE Audio Di e gence
(MAD), which pe o ms well in ou me a-e alua ion com-
pa ed o FAD (τ= 0.84 s. 0.49 espec i ely, see ou
supplemen a y ma e ial o comp ehensi e esul s). 1
Ul ima ely ou goal is o iden i y me ics ha no only
align wi h common sense deside a a, bu also wi h eal hu-
man p e e ences. To measu e his, we collec and elease
MusicP e s, an open-sou ce da ase o pai wise human
p e e ences o TTM gene a ion (concu en wi h [16,17]).
We ind ha MAD co ela es mo e s ongly human p e e -
ences (τ= 0.62) acco ding o MusicP e s, compa ed o
adi ional e alua ion me ics including FAD (τ= 0.14).
We can hink o he me a-e alua ion as aining da a o
me ic selec ion, and MusicP e s as es da a— om his
pe spec i e, he esul s o human e alua ion show ha
MAD is no o e i o he syn he ic me a-e alua ion asks.
1h ps://bi .ly/mad-me ic-supplemen
174
Embedding
Model
Old: VGGish
P oposed: MERT
Re e ence
Dis ibu ion
Gene a ed
Dis ibu ion
Re e ence
Embeddings
…
…
Gene a ed
Embeddings
…
Model #1
Model #2
Model #N
Di e gence Calcula ion
A
B
Human E al (MusicP e s)
Old: FAD, MMD
P oposed: MAUVE
TTM Models
Measu e Co ela ion
Win a es: [Model #2: 30%, Model #1: 40%, ...] Di e gences: [Model #3: 0.6, Model #2: 0.4, ...]
Au oma ic E al (MAD)
Figu e 1: O e iew o ou p oposed au oma ic e alua ion me ic (MAD) and open da ase o human p e e ences o
TTM (MusicP e s). Gi en a collec ion o open TTM models, we p esen a ho ough analysis o di e en e e ence-based
di e gence me ics and embedding backbones. Then, by collec ing MusicP e s, an open sou ce da ase o pai wise TTM
human p e e ence da a, we measu e how well he induced ankings o di e en me ics co ela e wi h human p e e ences.
Ou o e all con ibu ions a e summa ized as ollows:
• We pe o m a sys ema ic s udy o e alua ion me ics
o TTM, based on a b oad se o audio deg ada ion
models and alida ed by human p e e ences.
• We in oduce MusicP e s, a new open sou ce da ase
o human p e e ence da a TTM ou pu s.
• We p opose he MAUVE Audio Di e gence (MAD)
me ic o TTM e alua ion, based on he MAUVE
me ic o open-ended ex gene a ion [18]. Acco d-
ing o MusicP e s, MAD co ela es mo e s ongly
wi h human p e e ences han p e ious me ics.
Figu e 1 p o ides an o e iew o MAD and MusicP e s.
Code and da a a e made a ailable. 2
2. PRELIMINARIES AND METHODS
We s udy e e ence-based e alua ion me ics [19] ha
compa e a collec ion o gene a ed ou pu s o a e e ence
se o ep esen a i e da a om he dis ibu ion we aspi e o
model. A e e ence-based me ic quan i ies he di e gence
be ween he p obabili y dis ibu ion o a dis ibu ion q(o -
dina ily a gene a i e model) and a e e ence dis ibu ion
pusing a ep esen a i e sample x1,...,Np∼p, i.e., a e -
e ence se o Nphuman music pe o mances. To iden i y
salien disc epancies be ween qand p, we de ine an e alua-
ion me ic on ea u es o examples ex ac ed using an em-
bedding model. Gi en Nqgene a ed ou pu s y1,...,Nq∼q,
and ea u es :audio →Rd om an embedding model, a
e e ence-based e alua ion me ic M:q→[0,∞)is a di-
e gence D( (y1), . . . , (yNq)|| (x1), . . . , (xNp)) be-
ween ea u es o audio gene a ed by qand ea u es o e -
e ence audio om p.
2Sound examples om ou syn he ic s udy and MusicP e s can
be ound a h ps://bi .ly/mad-me ic. An implemen a ion
o MAD is a ailable a h ps://gi hub.com/i-need-sleep/
mad. MusicP e s is a ailable a h ps://hugging ace.co/
da ase s/i-need-sleep/musicp e s.
The design space o e e ence-based e alua ion me ics
can he e o e be cha ac e ized by (1) he e e ence dis-
ibu ion pand co esponding e e ence se , (2) he em-
bedding model and ea u es used o p ocess he audio
samples, and (3) he di e gence Dused o calcula e dis i-
bu ional disc epancies in ea u e space Rd. Fo example,
FAD uses VGGish as he embedde , ac i a ions o he inal
(p e-classi ica ion) laye as ea u es, F éche dis ance as a
di e gence, and doesn’ p esc ibe a pa icula e e ence se .
2.1 Choosing a Re e ence Se
Gui e al. [12] epo ha he choice o e e ence se can
hea ily in luence he pe o mance o au oma ic me ics.
Mo eo e , he commonly used MusicCaps [3] da ase in-
cludes a no able amoun o low-quali y en ies ha can
make FAD sco es co ela e poo ly wi h human a ings. In
ou expe imen s, we p ima ily use FMA-Pop [12], a cu-
a ed subse o FMA [20] emphasizing songs wi h high
play coun s unde he “pop” label. We ocus on pop be-
cause his is he p ima y gen e modeled by s a e-o - he-a
music gene a ion sys ems. FMA-Pop con ains 4,230 songs
o 30 seconds in leng h each. To compa e his openly a ail-
able e e ence se o mo e es ic i e choices, we also ex-
pe imen wi h an in e nal da ase o high-quali y licensed
music audio con aining 7,846 songs, om which we ex-
ac 30-second clips a andom.
2.2 Ex ac ing Fea u es om Embedding Models
We s udy he pe o mance o a a ie y o audio models as
embedding models. In addi ion o VGGish [11], o iginally
used o FAD, and he highe -pe o ming audio unde -
s anding models explo ed in Gui e al. [12] (CLAP [14] and
MERT [21]), we expe imen wi h s ong music gene a ion
models (MusicGen [4] and Jukebox [1]). We s udy hese
sel -supe ised models as embedding models because hey
may mo e accu a ely cap u e salien audio ea u es o e -
looked by dis an ly-supe ised music models [22].
Gi en an embedding model, we conside se e al s a e-
gies o ex ac ea u es. A na u al candida e o ea u es
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
175
is model’s in e nal ac i a ions. Fo a deep model wi h
many laye s, we mus decide which laye ’s ac i a ions o
use as ea u es. Fo audio models, which p ocess empo-
al da a, we mus decide how o agg ega e hese ea u es
ac oss ime. We s udy ou s a egies o empo al agg ega-
ion o ea u es: (1) Max-pool: ake he maximum o each
ea u e ac oss ime; (2) A e age-pool, he mean o each
ea u e ac oss ime; (3) Las : ea u es a he las ime in-
dex; (4) Fi s : ea u es a he i s ime index (only consid-
e ed o he bi-di ec ional MERT model). Fo mo e de ails
abou ou sea ch space o embedding backbones, laye s,
and pooling me hods, see Table 3 in he supplemen .
2.3 Calcula ing Di e gences
We s udy a a ie y o di e gences o pa ame e izing an
e alua ion me ic. The key dis inc ions be ween hese
me ics in ol e how hey es ima e he e e ence and gen-
e a ed dis ibu ions wi hin he chosen embedding space.
FAD [9] uses he assump ion ha hese dis ibu ions
a e Gaussian and calcula e he F eché dis ance be ween
hem. MMD [23] (equi alen ly e e ed o as KD o
KAD [15]) cons uc s a ke nel densi y es ima o o hese
dis ibu ions. The P ecision/Recall/Densi y/Co e age
(o PRDC) me ics [19] use k-NN es ima o s: P eci-
sion/Densi y/Co e age use k-NN o es ima e he suppo o
he e e ence dis ibu ion pand measu e whe he he gene -
a ed samples a e suppo ed by p; Recall con e sely uses k-
NN o es ima e he suppo o he gene a i e dis ibu ion q
and measu e whe he he e e ence se is suppo ed by q.
MAUVE [24] uses k-means o o m disc e e his og am es-
ima es o pand q, and app oxima es he di e gence cu e
C(p, q) = {(exp(−cKL(q| λ)),exp(−cKL(p| λ)))} o
λ=λp + (1 −λ)q,λ∈(0,1). MAUVE is de ined
as he a ea unde he app oxima ed di e gence cu e.
As MAUVE is a sco e bounded on [0,1] whe e highe
is be e , o consis ency wi h o he di e gence me ics
such as FAD we ede ine i as −ln(MAUVE) anging om
[0,in ) whe e lowe is be e . Among he me ics we s udy,
MAUVE and Recall a e he only me ics ha explici ly es-
ima e he gene a ed dis ibu ion qin he embedding ea-
u e space (as MMD only does his implici ly in he co -
esponding RKHS) wi h mo e exp essi i y han i ing a
single Gaussian like FAD.
3. SYNTHETIC DATA META-EVALUATION
To explo e he design space o e e ence-based music me -
ics, we i s cons uc a se o ou me a-e alua ions de-
signed o speci ically disen angle di e en deside a a in
TTM sys ems. A high quali y e alua ion me ic in ui i ely
should be sensi i e o human in e p e able deg ada ions o
ou a ge da a. Fo mally, i we ha e some o de ed se
o model dis ibu ions {q1, q2, . . . , qK}in dec easing o -
de o human pe cep ual quali y (q1bes , qKwo s ), hen
a good di e gence me ic Mshould induce he ollowing
beha io : M(qi)< M(qj),∀i < j. Thus, ou syn he ic
me a-e alua ion seeks o assess how well a gi en me ic
ollows his beha io ac oss di e en se s o dis ibu ions
{qi} ha codi y pa icula deside a a.
Pas wo ks look a he sensi i i y o me ics o deg ada-
ions in audio ideli y by dis o ing wi h noise [9,12]. In ad-
di ion o measu ing sensi i i y o ideli y, he e we explo e
h ee addi ional deside a a o musicali y,con ex , and di-
e si y. Fo each deside a um, we p opose a pa e n ha
in e p e ably deg ades music along ha axis alone. In his
way, we cap u e whe he a gi en me ic ac ually cap u es
speci ic o ms o musical deg ada ion, which allows us o
measu e sensi i i y o each me ic o changes in dis o ion
s eng h as well as embedding backbone. Ou goal is no
o comp ehensi ely cap u e all possible dis o ions a TTM
model could exhibi , bu ins ead o cap u e a ew common
sense ones o enable mo e objec i e compa ison o me ics.
Fo each deside a um, we gene a e K= 11 se s o in-
c easingly deg aded audio, whe e each se con ains 5,000
clips o 30 seconds in leng h. Fo each me ic M, we mea-
su e he ank co ela ion (Kendall’s τ) be ween he o de -
ing induced by he me ic and he g ound u h o de ing
1, . . . , K. We measu e ank co ela ions o wo easons:
(1) di ec co ela ion me ics like Pea son’s assume linea -
i y (which canno be assumed o any gi en di e gence),
and (2) mode n TTM di e gence me ics exis p ima ily
o esea che s o compa e he ela i e pe o mance o gen-
e a i e sys ems (i.e., “ho se- acing"); he absolu e di e -
ences be ween models o a gi en di e gence a e gene ally
poo ly unde s ood and no pa icula ly meaning ul. We
ou line each deside a a in he ollowing subsec ion—see
ou supplemen o addi ional de ails on each ask.
3.1 Codi ying deside a a ia syn he ic deg ada ions
Fideli y. Fideli y is he simples o m o deg ada ion and
has been somewha s udied p e iously on F eché -based
me ics [9, 12]. He e, we s a wi h he FMA-Pop da ase ,
and g adually dis o he audio ideli y o each sample.
Speci ically, we add iso opic Gaussian noise o each audio
ile, wi h inc easing s anda d de ia ion deno ing g ea e
dis o ion ( he noise added o qihas a s anda d de ia ion
o 0.2·i−1
10 ). No ably, his is he only ca ego y o deg ada-
ion ha is conside ed in p e ious wo k.
Musicali y. Absen om p e ious acous ic music e al-
ua ions is any way o measu e he “musicali y" o a TTM
model’s ou pu s. While his is inhe en ly di icul o mea-
su e di ec ly, we codi y (Wes e n) musicali y using he no-
ion ha pe cei ed musicali y is co ela ed wi h ea u es
o he symbolic ep esen a ion o music. Speci ically, we
posi ha in oducing andom pe uba ions bo h hy hmi-
cally (i.e. sligh changes in no e iming) and ha monically
(i.e. pi ch changes) can con ibu e o deg ada ion o musi-
cali y. We pe u b subse s o no es by [−6,6] semi ones in
pi ch and [−0.2,0.2] seconds in onse and o se . We use
a subse o he Lakh MIDI da ase [25] while pe u bing
he no e imings and pi ches wi h inc easing p obabili y o
pe u ba ion om 0 o 0.5in s eps o 0.05. F om hese pe -
u bed MIDI sequences, we hen ende hem in o 44.1 kHz
audio using Fluidsyn h [26]. In his way, we can di ec ly
measu e how e alua ion me ics ea changes o he musi-
cal s uc u e independen ly o audio ideli y dis o ions.
Con ex . While ou de ini ions o ideli y and musi-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
176
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200
Whi e noise s d.
0.0
0.2
0.4
0.6
0.8
1.0
Adjus ed sco e
Fideli y
0.0 0.1 0.2 0.3 0.4 0.5
No e pe u ba ion p ob.
Musicali y
1 2 3 4 5
Max. con ex len. (seconds)
Con ex
100101102103
Num. unique p omp s
Di e si y
FAD
Recall
MAUVE
Figu e 2: Agg ega ed esul s o syn he ic me a-e alua ion. Me ic sco es a e no malized o [0, 1] and a e aged ac oss all
embedding models. Shaded a eas show s anda d de ia ions. While FAD shows la ge inconsis encies on Musicali y and
mino inconsis encies on o he s, Recall and MAUVE ha e obus pe o mance on all deside a a.
cali y e ec i ely cap u e local s uc u e, hey do no ully
add ess how music undamen ally equi es long- e m co-
he ence. Tex - o-music (TTM) sys ems o en in oduce
unique pe cep ual a i ac s ela ed o empo al inconsis-
ency. To assess whe he e alua ion me ics de ec hese
empo al cohe ence issues, we sample om MusicGen-
Small [4] while con olling o con ex leng h by gene -
a ing in k-s ep blocks whe e k∈ {1,2,3,...,10,15}sec-
onds. Fo each block, he model only accesses he p e i-
ous block as con ex . This allows us o di ec ly manipula e
empo al consis ency, as sho e -con ex gene a ions ypi-
cally lack cohe en s uc u e and con ain ja ing ansi ions
ha iola e musical expec a ions.
Di e si y. While he p eceding asks all measu e di e -
en senses o pe cep ual quali y, di e si y is equally impo -
an o TTM sys ems and hei e alua ion me ics [19].
Thus, he e we ea he a i icial educ ion o di e si y
as he dis o ion ac o . We use a cleaned subse o ex
p omp s om MusicCaps [3] (de ailed in Sec ion 4) and
p omp MusicGen-Small o gene a e 5000 segmen s using
a dec easing p omp se size (lowe di e si y), speci ically:
{2.5K, 2K, 1K, 500, 200, 100, 50, 25, 10, 5, 1}. A he
ex eme, all 5k gene a ions a e based on he same p omp .
In his way, we can assess how di e en e alua ion me ics
eac o di e en le els o di e si y.
3.2 Resul s
He e we compa e a numbe o di e en e alua ion me -
ics and embedding backbones (see Sec ion 2). Fo each
ask and me ic, we sea ch o e combina ions o laye s and
pooling me hods o each embedding backbone wi h FMA-
Pop as he e e ence se . In o de o assess he pe o mance
o each embedding backbone-di e gence combina ion, we
epo he Kendall-Tau coe icien τbe ween he au oma ic
sco es and g ound u h ankings (i.e. whe he a gi en me -
ic cap u es he dec ease in quali y as he dis o ion le el
inc eases o each ask), which is bounded in [-1, 1].
Wi h his amewo k, we in es iga ed ou high-le el
empi ical ques ions: (1) How obus a e di e en di e -
gence me ics o changes in embedding backbones? (2)
How much does embedding backbone impac me ic pe -
o mance? (3) How obus a e such di e gence me ics o
changes in e e ence dis ibu ion? (4) How e icien a e
such me ics wi h espec o he gene a ion se size?
Me ic Robus ness Ac oss Embeddings. In Table 1
( op), we show he a e age Kendall τac oss he ou syn-
he ic se s agg ega ed ac oss embedding models, each un-
de hei bes (laye , agg ega ion) se ups. No ably, MAUVE
and Recall show he bes o e all pe o mance, su passing
all o he me ics ac oss embedding backbones. In pa ic-
ula , FAD and all o he me ics s uggle hea ily on he
musicali y ask, showing a dis inc lack o he abili y o
e alua e gene a ed music ha has consis en audio qual-
i y bu seman ic deg ada ions, and pe o m poo ly on he
Con ex and Di e si y asks. This sugges s ha ou mu-
sicali y, con ex , and di e si y se s can be a use ul ool in
me a-e alua ing sensi i i y o mo e nuanced di e ences in
music. In e es ingly, Recall and Co e age, while designed
as di e si y measu es, can also dis inguish quali y di e -
ences in musicali y, ideli y, and con ex leng h. P ecision
and Densi y exhibi s ela i e poo e pe o mance in all ou
aspec s despi e being designed as ideli y measu es. This
indica es ha P ecision and Densi y may ha e a low pe -
o mance ceiling o e alua ing music.
Figu e 2 shows he sco es by each me ic agains he
g ound u h le el o dis o ion di ec ly o FAD and ou
op pe o ming me ics MAUVE and Recall. In pa icula ,
we ind ha MAUVE no only cap u es he g adual pe -
u ba ion unde each ask well, bu does so wi h consid-
e ably lowe a iance ac oss embedding backbones han
Recall and FAD do. Addi ionally, hough mos me ics
a e capable o dis inguishing di e ences be ween a y-
ing in ensi ies o Gaussian noises, co obo a ing p e ious
wo ks [9, 12], MAUVE sco es mo e closely ollow an ex-
ponen ial pa e n, which is a guably mo e simila o human
pe cep ion o addi i e Gaussian dis o ion.
Di e gence Quali y by Embedding Backbone.
Wi h MAUVE and Recall es ablished as he mo e obus
me ics, we now shi ou a en ion o how di e en
embedding models pe o m when applied wi h hese
me ics. Resul s a e shown in Table 1 (bo om). We
obse e ha he sel -supe ised models MERT, Jukebox,
MusicGen embeddings pe o m easonably well when
pai ed wi h ei he MAUVE o Recall. VGGish pe o ms
no ably poo ly in e alua ing Musicali y, wi h p ac ically
andom pe o mance as he dis o ion le el inc eases,
since i s classi ica ion aining does no allow i o cap u e
he mo e nuanced and long- e m ac o s in music quali y.
CLAP simila ly alls behind on Musicali y and Con ex ,
p esumably because o he limi ed exp essi eness allowed
by i s ex -audio aining da a.
Robus ness Ac oss Re e ence Se s. Since all he di-
e gence me ics s udied a e e e ence-based me ics, he
choice o e e ence da ase can be highly impo an , as he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
177
Fideli y Musicali y Con ex Di e si y A e age A g. (In e nal)
Recall 1.00 ±0.00 0.80 ±0.24 0.70 ±0.21 0.87 ±0.06 0.84 ±0.11 0.82 ±0.16
MAUVE 1.00 ±0.00 0.73 ±0.30 0.81 ±0.16 0.59 ±0.13 0.78 ±0.07 0.86 ±0.07
FAD 0.91 ±0.22 0.44 ±0.57 0.92 ±0.13 0.61 ±0.20 0.72 ±0.14 0.71 ±0.20
Co e age 1.00 ±0.00 0.49 ±0.40 0.64 ±0.30 0.70 ±0.14 0.71 ±0.07 0.71 ±0.11
MMD 1.00 ±0.00 0.35 ±0.61 0.78 ±0.31 0.60 ±0.25 0.68 ±0.26 0.75 ±0.25
Densi y 0.95 ±0.12 0.01 ±0.73 −0.12 ±0.80 0.31 ±0.29 0.29 ±0.17 0.55 ±0.35
P ecision 0.88 ±0.23 0.12 ±0.43 −0.23 ±0.67 0.39 ±0.28 0.29 ±0.26 0.49 ±0.25
MusicGen-M 1.00 ±0.00 0.98 ±0.03 0.85 ±0.00 0.76 ±0.22 0.90 ±0.06 0.92 ±0.11
MusicGen-S 1.00 ±0.00 0.93 ±0.05 0.76 ±0.23 0.70 ±0.24 0.85 ±0.02 0.93 ±0.14
MERT 1.00 ±0.00 0.75 ±0.21 0.89 ±0.05 0.76 ±0.22 0.85 ±0.02 0.91 ±0.22
Jukebox 1.00 ±0.00 0.89 ±0.05 0.82 ±0.00 0.62 ±0.29 0.83 ±0.08 0.83 ±0.25
VGGish 1.00 ±0.00 0.53 ±0.51 0.78 ±0.21 0.87 ±0.06 0.79 ±0.09 0.74 ±0.33
CLAP 1.00 ±0.00 0.51 ±0.13 0.42 ±0.10 0.67 ±0.15 0.65 ±0.02 0.69 ±0.18
Table 1: A e age Kendall τ ank co ela ion (highe is be e ) and s anda d de ia ion be ween di e en e alua ion se ups
and ou syn he ic me a-e alua ion se . Top: Agg ega ed co ela ion ac oss embedding models, each unde hei bes (laye ,
agg ega ion) se up. Bo om: Agg ega ed co ela ion ac oss he mo e obus MAUVE and Recall me ics. All me ics a e
calcula ed agains FMA-Pop excep o he las column, which is uses ou in e nal se o high-quali y licensed music.
lowe -quali y bu commonly used MusicCaps (whe e some
exce p s a e cap ioned as “low-quali y") esul s in me ic
sco es less co ela ed wi h human p e e ence as opposed
o FMA-Pop and MusCC (a small subse o musdb18 [27])
[12]. In o de o e i y ha he p esen esul s a e obus
ac oss e e ence se s, we ec ea e ou expe imen s using an
in e nal se o high-quali y music o simila size.
Resul s a e shown in he igh mos column in Table 1.
FMA-Pop pe o ms compa ably o ou se o in e nal mu-
sic as a e e ence se , wi h he in e nal music leading o
an o e all sligh ly be e pe o mance. The ends we ob-
se e wi h FMA-Pop s ill holds: MAUVE and Recall a e
consis en ly highe -pe o ming, and embedding om sel -
supe ised models consis en ly lead o be e pe o mance.
This p esen s an impo an poin o open esea ch in he
TTM space: while high-quali y e e ence se s may seem
op imal, FMA-Pop p o es o be easonably compa able as
a e e ence dis ibu ion o pe o m gene a i e e alua ion.
E iciency. The compu e ime o e e ence-based
TTM e alua ion is domina ed by he ime i akes o gen-
e a e and embed each audio sample, as opposed o he
ime equi ed o measu e di e gence (despi e p e ious con-
ce ns). Acco dingly, sample e iciency is he s onges de-
e mining ac o o o e all compu a ional e iciency (as
opposed o asymp o ic beha io o he me ic [15, 23]). To
examine how obus each me ic is wi h espec o sample
size, we measu e ank co ela ion on ou syn he ic me a-
e alua ion da a o FAD wi h VGGish and CLAP embed-
dings, MMD wi h MERT embeddings, and MAUVE wi h
MERT embeddings ac oss exponen ially dec easing he
gene a ed sample sizes o {5000, 2500, 1250, 625}. We
ind ha FAD and MAUVE a e compa ably obus o educ-
ion in he sample size (simila co ela ion a N= 5000
and N= 625 samples), while MAUVE shows s onge
o e all co ela ions a all sample sizes. See Figu e 5 in
he supplemen a y ma e ial o mo e de ails.
3.3 MAD: MAUVE Audio Di e gence (MAD)
These insigh s om ou syn he ic me a-e alua ion mo i-
a e a new TTM e alua ion me ic: MAUVE Audio Di-
e gence (MAD). Speci ically, MAD u ilizes MERT o ex-
ac embeddings, which a e hen used o calcula e MAUVE
in i s embedding space. Ou o he s onge pe o ming
me ics MAUVE and Recall, we choose MAUVE as we
obse e lowe a iance in sco es ac oss embedding back-
bones, sugges ing obus ole ance o changes in back-
bones. Among sel -supe ised models which pe o m
well ac oss Mau e and Recall (MusicGen and MERT),
we choose MERT o a oid he coun e in ui i e scena io o
e alua ing gene a i e models wi h o he gene a i e mod-
els. MAD is able o cap u e a wide ange o musical
pe u ba ions, a su passing he cu en s anda d usage o
FAD and disc imina i e backbones. While we ecommend
ha p ac i ione s use a e e ence se ha bes codi ies he
goals o hei TTM sys em, we o e FMA-Pop as a de aul ,
gi en ha i co ela es well wi h MAD sco es on ou in e -
nal e e ence se . In ou syn he ic me a-e alua ion, MAD
(0.84 a e age τ) signi ican ly ou pe o ms he s anda d
FAD wi h VGGish embeddings (0.49 a e age τ). Bo h
MERT and MAUVE no ably con ibu es o he imp o ed
pe o mance: Abla ing MERT (MAUVE + VGGish) e-
sul s in an a e age τo 0.73 while abla ing MAUVE (FAD
+ MERT) leads o an a e age τo 0.72.
4. MEASURING HUMAN PREFERENCE
ALIGNMENT
Collec ing MusicP e s. We collec ed a la ge da ase
o human p e e ences on music gene a ed by s a e-o -
he-a open weigh s TTM models. Ou app oach in-
ol ed wo key s eps: (1) gene a ing music samples
om 7 ep esen a i e models (S able Audio Open [5],
MusicGen small/medium/la ge [4], AudioLDM2 [28],
MusicLDM [29], and Ri usion 1 [2]) using 2,617
ins umen al-only ex p omp s de i ed om MusicCaps,
and (2) collec ing pai wise human p e e ences ia Ama-
zon Mechanical Tu k. Fo each p omp , we gene a ed 10
ou pu s pe model using dis inc andom seeds, esul ing
in 183k o al audio clips. We hen collec ed p e e ences
on 2,520 ou pu pai s (120 p e e ences pe unique model
pai ), asking anno a o s o independen ly judge ideli y
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
178

Human Au oma ic (Di e gence) Con ol
Sys em O e all ↑Fideli y ↑Musicali y ↑FAD ↓FAD-CLAP ↓MAD ↓CLAP ↑
MusicGen-L 24.24 (1) 22.96 (1) 25.42 (1) 5.649 (4) 3.904 (2) 2.744 (2) 0.356 (3)
MusicGen-M 17.28 (2) 17.35 (2) 17.08 (2)5.802 (5) 3.940 (5) 3.504 (3) 0.337 (5)
MusicGen-S 14.11 (3) 14.67 (3) 13.43 (4)6.032 (6) 3.987 (6) 3.928 (4) 0.314 (6)
MusicLDM 12.17 (4) 10.45 (6) 14.02 (3)5.538 (1) 3.916 (4) 4.713 (5) 0.411 (1)
SAO 11.41 (5) 13.95 (4) 9.19 (6)5.547 (2) 3.883 (1) 1.970 (1) 0.356 (4)
AudioLDM2 10.83 (6) 9.87 (7) 11.74 (5)5.632 (3) 3.913 (3) 5.321 (6) 0.378 (2)
Ri usion 1 9.97 (7) 10.75 (5) 9.12 (7)7.994 (7) 4.179 (7) 5.477 (7) 0.185 (7)
τ(p al.) 1.00 (0.00) 0.71 (0.03) 0.81 (0.01) 0.14 (0.77) 0.14 (0.77) 0.62 (0.07) 0.10 (0.76)
Table 2: Compa ison o human p e e ences om MusicP e s o au oma ic me ics o music gene a ion models, including
MAD (p oposed). Human p e e ences a e B adley-Te y sco es. We solici musicali y and ideli y p e e ences sepa a ely,
e e ing o hei union as “O e all”. We epo s anda d au oma ic me ics: FAD using VGGish [9] and CLAP [12]
embeddings, ou p oposed MAD which measu es MAUVE [18] on MERT [21] embeddings, and CLAP sco e [14] which
measu es an o hogonal axis o adhe ence o ex con ol (we do no expec i o co ela e). Fo each me ic, we induce
a anking and compu e he Kendall τ ank co ela ion ela i e o o e all human p e e ences. We ind ha MAD yields
s onge co ela ion (τ= 0.62,p= 0.07) wi h human p e e ences ela i e o exis ing me ics.
and musicali y wi hou showing hem he o iginal p omp s.
Anno a o s could decla e ies, which we disca ded om
analysis (25% o ideli y, 19% o musicali y).
Measu ing Alignmen . We assess how well ou an-
alyzed me ics align wi h human p e e ences. In pa ic-
ula , we ocus on how he o de ing induced by a gi en
di e gence me ic ma ches he human ankings om Mu-
sicP e s. We ocus on his anking beha io as he abili y o
consis en ly de e mine he ela i e pe o mance o models
(i.e., “ho se- acing") is in eg al o mode n TTM esea ch
in assessing commensu a e gains om pape o pape . We
compa e MAD o he baseline FAD (wi h bo h VGGish
and CLAP backbones), as well as he induced anking
om he e e ence- ee CLAP-Sco e me ic [14,30], which
measu es con ol adhe ence o he unde lying ex condi-
ions and should be independen o human anking.
Table 2 shows he o e all induced ankings by each
me ic so ed acco ding o he o e all human anking and
Kendall’s τcoe icien s ela i e o he o e all ankings.
Human ankings a e calcula ed h ough hei B adley-
Te y sco es, which a e linea ly equi alen o Elo sco e and
hus mo e accu a ely es ima e he o e all s eng h o each
model han assessing aw win a e [7,31,32]. MAD shows
a s onge co ela ion wi h human ankings han o he me -
ics, nea ly exac ly ma ching he human p e e ences wi h
he excep ion o a s ong p e e ence owa ds S able Audio
Open. No e ha we do no expec CLAP sco e o co ela e
wi h MusicP e s as anno a o s we e no shown p omp s—
we include i o p o ide e idence ha MusicP e s measu es
deside a a o hogonal o con ol as in ended.
5. RELATED WORK
Gene a i e music sys ems, and in pa icula audio-domain
sys ems, ha e seen a enaissance in ecen yea s d i en
by he wide me hodological explosion in gene a i e mod-
els, owing co e ad ances o insigh s om language mod-
els [1, 4] and di usion models [2, 5, 28, 29]. Despi e simi-
la i ies wi h he ex and image domains, he space o wo k
on gene a i e e alua ion is much less de eloped o TTM.
While Kilgou e al. [9] and Gui e al. [12] ha e a emp ed
o assess he quali y o e alua ion me ics in TTM sys ems
(leading o he adop ion o be e FAD backbones [33–35]),
hese only conside ed F eché Dis ance as a me ic, and
only conside ed ideli y dis o ions in hei analyses. While
some TTM wo ks ha e included addi ional me ics ou side
FAD and CLAP sco e [5, 6, 29, 36, 37], such wo ks pu ely
ely on he assump ion ha insigh s om he image modal-
i y [19, 23] would ans e o TTM, wi h no empi ical e -
i ica ion. Vinay and Le ch [10] and Chung e al. [15] a e
simila , wi h he o me ocusing on benchma king me ics
o olde audio syn hesis models (wi h no s ong co ela-
ion ound wi h human pe cep ion), and he la e explo -
ing he MMD me ic used in ea lie TTM wo ks [6,36,38]
o oley gene a ion. Concu en ly, wo ecen wo ks ha e
collec ed human a ings and p e e ence da a on syn he ic
music clips om TTM sys ems [16, 17]. Liu e al. [16]
collec Mean Opinion Sco es o each sys em based on
o e all music imp ession and ex alignmen —he e we col-
lec pai wise p e e ences. G ö schla e al. [17] also collec
pai wise p e e ences o TTM, hough hey do no explo e
alignmen wi h au oma ic me ics.
6. CONCLUSION
We p opose MAD: a new e alua ion me ic o au oma ic
e alua ion o TTM models. MAD is de i ed om a sys-
ema ic me a-e alua ion ha analyzes sensi i i y o e alu-
a ion hype pa ame e s o a ious music gene a ion deside -
a a. In e ms o obus ness, we ind ha MAUVE ou -
pe o ms p e iously s udied di e gences like F éche Dis-
ance, and sel -supe ised embedding models like MERT
ou pe o m disc imina i e ones. We collec and elease
MusicP e s, an open da ase o pai wise human TTM p e -
e ences, and use i o demons a e ha MAD s ongly co -
ela es wi h human p e e ences. While we do no ecom-
mend eplacing human e alua ions wi h MAD, au oma ic
e alua ions can p o ide a powe ul signal o compe i ion
and hill-climbing on model pe o mance. We hope MAD
can p o ide such a signal o music gene a ion esea ch.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
179
7. ACKNOWLEDGEMENTS
This wo k was suppo ed by unding om Sony AI.
8. REFERENCES
[1] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” a Xi p ep in a Xi :2005.00341, 2020.
[2] S. Fo sg en and H. Ma i os, “Ri usion - S able
di usion o eal- ime music gene a ion,” 2022.
[Online]. A ailable: h ps:// i usion.com/abou
[3] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. F ank, “Musiclm: Gene a ing music om ex ,”
2023. [Online]. A ailable: h ps://a xi .o g/abs/2301.
11325
[4] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. De ossez, “Simple and con ol-
lable music gene a ion,” in Ad ances in Neu al In-
o ma ion P ocessing Sys ems, A. Oh, T. Naumann,
A. Globe son, K. Saenko, M. Ha d , and S. Le ine,
Eds., ol. 36. Cu an Associa es, Inc., 2023, pp.
47 704–47 720.
[5] Z. E ans, J. D. Pa ke , C. Ca , Z. Zukowski, J. Taylo ,
and J. Pons, “S able audio open,” 2024. [Online].
A ailable: h ps://a xi .o g/abs/2407.14358
[6] Z. No ack, G. Zhu, J. Casebee , J. McAuley, T. Be g-
Ki kpa ick, and N. J. B yan, “P es o! dis illing s eps
and laye s o accele a ing music gene a ion.” in In-
e na ional Con e ence on Lea ning Rep esen a ions
(ICLR), 2025.
[7] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. An-
gelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. I.
Jo dan, J. E. Gonzalez, and I. S oica, “Cha bo a ena:
An open pla o m o e alua ing llms by human
p e e ence,” in ICML, 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=3MW8GKNyzI
[8] W.-C. Huang, S.-W. Fu, E. Coope , R. Zeza io,
T. Toda, H.-M. Wang, J. Yamagishi, and Y. Tsao,
“The oicemos challenge 2024: Beyond speech
quali y p edic ion,” 2024 IEEE Spoken Language
Technology Wo kshop, 12 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2409.07001
[9] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A e e ence- ee me ic
o e alua ing music enhancemen algo i hms,” in
In e speech, 2019. [Online]. A ailable: h ps://api.
seman icschola .o g/Co pusID:202725406
[10] A. Vinay and A. Le ch, “E alua ing gene a i e audio
sys ems and hei me ics,” in P oceedings o he 23nd
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR 2022), 2022.
[11] S. He shey, S. Chaudhu i, D. P. W. Ellis, J. F. Gem-
meke, A. Jansen, R. C. Moo e, M. Plakal, D. Pla ,
R. A. Sau ous, B. Seybold, M. Slaney, R. J. Weiss,
and K. Wilson, “Cnn a chi ec u es o la ge-scale audio
classi ica ion,” in 2017 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP),
2017, pp. 131–135.
[12] A. Gui, H. Gampe , S. B aun, and D. Emmanouilidou,
“Adap ing eche audio dis ance o gene a i e music
e alua ion,” in P oc. IEEE ICASSP 2024, 2024.
[Online]. A ailable: h ps://a xi .o g/abs/2311.01616
[13] J. Re kowski, J. S epniak, and M. Mod zejewski,
“F eche music dis ance: A me ic o gene a i e
symbolic music e alua ion,” 2025. [Online]. A ailable:
h ps://a xi .o g/abs/2412.07948
[14] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-
audio p e aining wi h ea u e usion and keywo d-
o-cap ion augmen a ion,” in IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing,
ICASSP, 2023.
[15] Y. Chung, P. Eu, J. Lee, K. Choi, J. Nam, and B. S.
Chon, “Kad: No mo e ad! an e ec i e and e icien
e alua ion me ic o audio gene a ion,” a Xi p ep in
a Xi :2502.15602, 2025.
[16] C. Liu, H. Wang, J. Zhao, S. Zhao, H. Bu, X. Xu,
J. Zhou, H. Sun, and Y. Qin, “Musice al: A gene a i e
music da ase wi h expe a ings o au oma ic ex - o-
music e alua ion,” in ICASSP 2025 - 2025 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP), 2025, pp. 1–5.
[17] F. G ö schla, A. Solak, L. A. Lanzendö e , and R. Wa -
enho e , “Benchma king music gene a ion models and
me ics ia human p e e ence s udies,” in ICASSP
2025 - 2025 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP), 2025,
pp. 1–5.
[18] K. Pillu la, S. Swayamdip a, R. Zelle s, J. Thicks un,
S. Welleck, Y. Choi, and Z. Ha chaoui, “MAUVE:
Measu ing he gap be ween neu al ex and human
ex using di e gence on ie s,” in Ad ances in Neu al
In o ma ion P ocessing Sys ems, A. Beygelzime ,
Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021.
[Online]. A ailable: h ps://open e iew.ne / o um?id=
Tqx7nJp7PR
[19] M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo, “Re-
liable ideli y and di e si y me ics o gene a i e mod-
els,” in In e na ional Con e ence on Machine Lea n-
ing. PMLR, 2020, pp. 7176–7185.
[20] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,” in
18 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), 2017. [Online]. A ailable:
h ps://a xi .o g/abs/1612.01840
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
180
[21] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Lin, A. Ragni, E. Bene os, N. Gyenge, R. Dan-
nenbe g, R. Liu, W. Chen, G. Xia, Y. Shi, W. Huang,
Y. Guo, and J. Fu, “Me : Acous ic music unde s and-
ing model wi h la ge-scale sel -supe ised aining,”
2023.
[22] R. Cas ellon, C. Donahue, and P. Liang, “Codi ied
audio language modeling lea ns use ul ep esen a-
ions o music in o ma ion e ie al,” a Xi p ep in
a Xi :2107.05677, 2021.
[23] S. Jayasumana, S. Ramalingam, A. Vei , D. Glasne ,
A. Chak aba i, and S. Kuma , “Re hinking id: To-
wa ds a be e e alua ion me ic o image gene a ion,”
in P oceedings o he IEEE/CVF Con e ence on Com-
pu e Vision and Pa e n Recogni ion, 2024, pp. 9307–
9315.
[24] K. Pillu la, L. Liu, J. Thicks un, S. Welleck,
S. Swayamdip a, R. Zelle s, S. Oh, Y. Choi, and Z. Ha -
chaoui, “Mau e sco es o gene a i e models: heo y
and p ac ice,” J. Mach. Lea n. Res., ol. 24, no. 1, Ma .
2024.
[25] C. Ra el, “Lea ning-based me hods o compa -
ing sequences, wi h applica ions o audio- o-midi
alignmen and ma ching,” 2016. [Online]. A ailable:
h ps://api.seman icschola .o g/Co pusID:63439223
[26] FluidSyn h, “Fluidsyn h: So wa e eal- ime syn he-
size based on he sound on 2 speci ica ion,” h ps:
//www. luidsyn h.o g, 2024, accessed: 2024-11-20.
[27] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “The MUSDB18 co pus o music
sepa a ion,” Dec. 2017. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.1117372
[28] H. Liu, Y. Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian,
Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley,
“Audioldm 2: Lea ning holis ic audio gene a ion wi h
sel -supe ised p e aining,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing, ol. 32,
pp. 2871–2883, 2024.
[29] K. Chen, Y. Wu, H. Liu, M. Nezhu ina, T. Be g-
Ki kpa ick, and S. Dubno , “Musicldm: Enhanc-
ing no el y in ex - o-music gene a ion using bea -
synch onous mixup s a egies,” in ICASSP 2024 - 2024
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2024, pp. 1206–
1210.
[30] R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li,
Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-an-audio:
Tex - o-audio gene a ion wi h p omp -enhanced di u-
sion models,” a Xi p ep in a Xi :2301.12661, 2023.
[31] R. A. B adley and M. E. Te y, “Rank analysis o
incomple e block designs: I. he me hod o pai ed
compa isons,” Biome ika, ol. 39, p. 324, 1952.
[Online]. A ailable: h ps://api.seman icschola .o g/
Co pusID:125209808
[32] H. Whi e, “Maximum likelihood es ima ion o
misspeci ied models,” Econome ica, ol. 50,
no. 1, pp. 1–25, 1982. [Online]. A ailable:
h p://www.js o .o g/s able/1912526
[33] Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. J. B yan, “DITTO: Di usion in e ence- ime -
op imiza ion o music gene a ion,” in In e na ional
Con e ence on Machine Lea ning (ICML), 2024.
[34] ——, “DITTO-2: Dis illed di usion in e ence- ime -
op imiza ion o music gene a ion,” in In e na ional
Socie y o Music In o ma ion Re ie al (ISMIR), 2024.
[35] R. Ci anni, E. Pos olache, G. Ma iani, M. Mancusi,
L. Cosmo, and E. Rodolà, “Cocola: Cohe ence-
o ien ed con as i e lea ning o musical audio ep-
esen a ions,” A Xi , ol. abs/2404.16969, 2024.
[Online]. A ailable: h ps://api.seman icschola .o g/
Co pusID:269430865
[36] J. Nis al, M. Pasini, C. Aouameu , M. G ach en, and
S. La ne , “Di -a- i : Musical accompanimen co-
c ea ion ia la en di usion models,” a Xi p ep in
a Xi :2406.08384, 2024.
[37] K. Sai o, D. Kim, T. Shibuya, C.-H. Lai,
Z. Zhong, Y. Takida, and Y. Mi su uji, “Sound-
CTM: Uni ying sco e-based and consis ency mod-
els o ull-band ex - o-sound gene a ion,” in
The Thi een h In e na ional Con e ence on Lea n-
ing Rep esen a ions, 2025. [Online]. A ailable:
h ps://open e iew.ne / o um?id=K K6zXbj O
[38] J. Nis al, M. Pasini, and S. La ne , “Imp o ing mu-
sical accompanimen co-c ea ion ia di usion ans-
o me s,” a Xi p ep in a Xi :2410.23005, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
181