scieee Science in your language
[en] (orig)

Estimating Musical Surprisal From Audio in Autoregressive Diffusion Model Noise Spaces

Author: Mathias Rose Bjare; Stefan Lattner; Gerhard Widmer
Publisher: Zenodo
DOI: 10.5281/zenodo.17706559
Source: https://zenodo.org/records/17706559/files/000079.pdf
ESTIMATING MUSICAL SURPRISAL FROM AUDIO
IN AUTOREGRESSIVE DIFFUSION MODEL NOISE SPACES
Ma hias Rose Bja e1S e an La ne 2Ge ha d Widme 1,3
1Ins i u e o Compu a ional Pe cep ion, Johannes Keple Uni e si y Linz, Aus ia
2Sony Compu e Science Labo a o ies (CSL), Pa is, F ance
3LIT AI Lab, Linz Ins i u e o Technology, Aus ia
[email p o ec ed]
ABSTRACT
Recen ly, he in o ma ion con en (IC) o p edic ions om
a Gene a i e In ini e-Vocabula y T ans o me (GIVT) has
been used o model musical expec ancy and su p isal in
audio. We in es iga e he e ec i eness o such modelling
using IC calcula ed wi h au o eg essi e di usion models
(ADMs). We empi ically show ha IC es ima es o models
based on wo di e en di usion o dina y di e en ial equa-
ions (ODEs) desc ibe di e se da a be e , in e ms o neg-
a i e log-likelihood, han a GIVT. We e alua e di usion
model IC’s e ec i eness in cap u ing su p isal aspec s by
examining wo asks: (1) cap u ing monophonic pi ch su -
p isal, and (2) de ec ing segmen bounda ies in mul i- ack
audio. In bo h asks, he di usion models ma ch o ex-
ceed he pe o mance o a GIVT. We hypo hesize ha he
su p isal es ima ed a di e en di usion p ocess noise le -
els co esponds o he su p isal o music and audio ea u es
p esen a di e en audio g anula i ies. Tes ing ou hypo h-
esis, we ind ha , o app op ia e noise le els, he s udied
musical su p isal asks’ esul s imp o e. Code is p o ided
on gi hub.com/SonyCSLPa is/audioic.
1. INTRODUCTION
Su p isal, es ima ed ia in o ma ion con en (IC) o nega-
i e log-likelihood (NLL) o an au o eg essi e model, has
been p oposed as a p oxy es ima o o pe cei ed musical
su p ise as expe ienced by human lis ene s [1–4]. Wi h
sui able models, he IC o musical e en s co ela es wi h
human su p ise pe cep ion and complexi y, including onal
and hy hmic aspec s [5, 6]. Music analysis wi h IC en-
ables quan i a i e in o ma ion- heo e ic hypo heses abou
music and music pe cep ion [7]. Fu he mo e, IC can se e
as a condi ioning signal o gene a i e models [8–10]. Re-
cen ly, [11] showed ha su p isal modeling using IC calcu-
la ed in he con inuous audio la en space o Music2La en
[12] e ec i ely models musical complexi y, epe i ion e-
© Ma hias Rose Bja e, S e an La ne , and Ge ha d Wid-
me . Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: Ma hias Rose Bja e, S e an La ne ,
and Ge ha d Widme , “Es ima ing Musical Su p isal om Audio in Au-
o eg essi e Di usion Model Noise Spaces”, in P oc. o he 26 h In .
Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea,
2025.
duc ion, and p edic ion o elec oencephalog am (EEG)
b ain esponses in human lis ene s. In [11], a GIVT model
[13] is used ha calcula es IC using he likelihood o one-
s ep p edic ions ha a e assumed o ollow Gaussian mix-
u e model (GMM) dis ibu ions wi h unco ela ed dimen-
sions. Howe e , his assump ion may limi such models’
e ec i eness, gi en he na u e o highly comp essed con-
inuous au oencode ep esen a ions, like Music2La en .
Di usion models ha e become powe ul ools in gen-
e a i e AI, achie ing s a e-o - he-a esul s in mul iple
domains, including music. A key ad an age is ha hey
do no ely on s ong assump ions abou how he da a is
dis ibu ed. Recen wo k [14] shows ha by o mula ing
di usion p ocesses as ODEs, a di usion model can no
only gene a e new samples bu also es ima e how likely
(o “p obable”) any gi en da a poin is unde he model.
Rema kably, his likelihood es ima e can be compu ed a
di e en s ages o he di usion p ocess, which co espond
o a ying le els o “noise” o abs ac ion in he da a.
In his pape , we s udy he abili y o di usion models
o es ima e musical su p isal. We es ic ou in es iga-
ions o ADMs [15, 16], since musical su p isal is causal
in ime. We expe imen wi h di usion models ained on
mul i- ack audio ollowing he popula EDM [17] and he
Rec i ied Flow [18] p ocesses, which di e in how hey
noise de ails. We show ha hese models desc ibe di e se
music da a be e in e ms o NLL han GIVT, consis en
wi h audio ideli y esul s in gene a ion asks [16]. We
e alua e he es ima ed IC’s e ec i eness in modelling as-
pec s o musical su p isal in audio on wo asks ha ha e
been s udied in he monophonic symbolic domain: cap u -
ing monophonic pi ch su p isal, ela ed o onali y unde -
s anding, and segmen -bounda y de ec ion on mul i- ack
audio, ela ed o he in o ma ion changes in music. We
show ha su p isal es ima ed using di usion models cap-
u es pi ch su p isal be e han he GIVT model. Fu he -
mo e, we demons a e ha peaks in he su p isal unc ion
align wi h segmen bounda ies; howe e , addi ional peaks
a e ound. Finally, we hypo hesize ha IC es ima ed a
ce ain di usion p ocess noise le els can p ese e he su -
p isal o highe -le el audio ea u es like pi ch, while il e -
ing ou con ibu ions o he IC o low-le el ea u es like
imb e nuances. We suppo ou hypo hesis by showing
ha , o app op ia e noise le els, he esul s o he s udied
musical su p isal asks imp o e.
679
2. RELATED WORK
In he symbolic music domain, musical su p isal p oxied
by IC has mos no ably been s udied wi h he a iable o -
de Ma ko -model IDyOM [2]. IDyOM modeling o hu-
man melodic expec a ion has been alida ed by nume ous
beha io al and neu al s udies [3, 19–23]. The model is,
howe e , limi ed o monophonic symbolic music s imuli.
In [10], he au ho s p opose an IC-based echnique o es-
ima ing su p isal in polyphonic symbolic music and show
he IC o co ela e wi h onal and hy hmic complexi y us-
ing solo piano pe o mances.
In he audio domain, su p isal es ima ion ypically e-
lies on human-selec ed audio ea u es. In [24], he IC o
a D-REX [25, 26] model, calcula ed using Bayesian in-
e ence, is ela ed o he magne oencephalog aphy (MEG)
b ain esponse o human pa icipan s. The Audio O a-
cle [27] analyzes su p isal using in o ma ion a e calcu-
la ed om sel -simila i ies o audio ea u es and iden i ies
high su p isal a segmen bounda ies.
Su p isal es ima ion using symbolic music o audio ea-
u es aces wo issues: he in es iga ion is based on limi ed
human-selec ed a ibu es, and a misma ch be ween wha
he compu a ional model sees and wha a human lis ene
hea s. Bo h cases po en ially bias he in es iga ion.
The e o e, mos simila o ou app oach, [11] es ima es
su p isal in an audio ep esen a ion ha p ese es all ea-
u es o he o iginal audio. The au ho s es ima e IC using
he likelihood o a GIVT model [13] and show ha i can
p edic EEG esponses o sung music. Howe e , in con as
o ADMs, he GIVT model assumes ha he nex -s ep p e-
dic ions ollow a pa icula dis ibu ion, which may limi
i s p edic i e e ec i eness.
Al hough un ela ed o empo al su p isal, [28] uses he
KL-di e gence o a di usion model o app oxima e he
likelihood o 5-second monophonic music clips. This is
used o ep oduce he in e ed U-shape ela ion be ween
he o al IC o music clips and lis ene p e e ence p esen ed
in [7]. The model does no ely on audio ea u es; how-
e e , i igno es causali y and memo y aspec s o su p isal.
In addi ion o IC, o he measu es o in o ma ion ha e been
p oposed o he compu a ional s udy o musical su p isal
and expec a ion [29, 30]. These, howe e , a e imp ac ical
o calcula e o con inuous au o eg essi e models and ha e
been limi edly alida ed pe cep ually in he li e a u e.
3. METHOD
3.1 In o ma ion Con en Modelling
Es ima ion o causal IC in a disc e e (symbolic music) do-
main can be achie ed e ec i ely wi h GPT-s yle one-s ep
p edic ion modelling. In his case, IC is calcula ed om
he p edic ion a ge ’s log-likelihood acco ding o an ex-
plici (so max) p obabili y mass unc ion wi h logi s om
a mul i-laye pe cep on (MLP) ha akes as inpu s a con-
ex s a e summa ized by a causal T ans o me model [31].
As a esul , he IC measu es he likelihood o speci ic (mu-
sical) e en s. In con as , we aim o es ima e he IC o
con inuous audio embeddings, using he comp essed ep-
esen a ions o he Music2La en au oencode [12]. In
his con inuous case, he p obabili y mass canno be mod-
elled. Consequen ly, in [11], a GIVT model [13] is used,
which models he p obabili y densi y o nex -s ep p edic-
ions explici ly using a GMM wi h pa ame e s om an
MLP ha akes as inpu a con ex s a e, summa ized by
a causal T ans o me . In his wo k, we do no equi e ex-
plici densi y modeling. Ins ead, i su ices o ob ain IC by
log-likelihood poin es ima es o he obse a ions. To ha
end, we calcula e such poin es ima es using Au o eg es-
si e Di usion Models (ADMs) [15, 16]. Simila o GPT-
s yle causal ans o me s, ADMs summa ize he con ex o
pas obse a ions in o a con ex s a e. Howe e , ins ead o
using an MLP o ans o m he con ex s a es in o so max
logi s o GMM pa ame e s, ADMs use he con ex s a es
o condi ion small di usion model MLPs o gene a e he
nex con inuous s a e.
Es ima ing IC using a di usion model equi es he
use o he Ins an aneous Change o Va iables o mula
[32]. This o mula, as de ailed below, compu es he log-
likelihood o da a poin s z0∼π0(in ou case, Mu-
sic2La en audio ep esen a ions) ha can be lown o
a known analy ic dis ibu ion π1acco ding o an ODE
d
d z( ) = (z( ), ). Finding such ODEs is non- i ial,
bu i u ns ou ha neu al ODEs [32] de i ed om di u-
sion p ocesses do exac ly ha : low da a samples o noise
samples o known iso opic Gaussian dis ibu ions.
3.2 Ins an aneous Change o Va iables
Fo da a poin s z0∼π0 lowing in ime acco d-
ing o d
d z( ) = (z( ), ), [32] shows ha he log-
likelihood o poin s change acco ding o ano he ODE:
d
d log p(z( )) = − ∂
∂z(z( ), ), gi en some egula -
i y condi ions. The e o e, i z0∼π0is lown o z1∼π1
and π1is known, we can e alua e log π0(z0)by he sum o
log π1(z1)and he log-likelihood low change om π0 o
π1. P a ically, he wo ODEs a e combined in o a sys em o
equa ions and sol ed nume ical om 0 o 1gi en he ini-
ial condi ions z( 0) = z0,log π0(z0)−log π0(z( 0)) = 0:
Z 1
0" (z( ), )
− ∂
∂z(z( ), )#
|{z }
dynamics
d =z1
log π0(z0)−log π1(z1)
|{z }
solu ions
.
(1)
We can now ob ain log π0(z0)by adding log π1(z1) o he
solu ion o he 2nd ODE. The o me can be easily e al-
ua ed using he solu ion o he 1s ODE and he known
π1. Fu he mo e, [33] shows ha Eq. 1 can be calcula ed
e icien ly wi h e e se-mode au oma ic di e en ia ion us-
ing he Skilling-Hu chinson ace es ima o [34,35], which
in ol es using n Mon e Ca lo uns wi h noise samples
om a Rademache dis ibu ion [35] o ob ain an unbiased
es ima e o an expec a ion. Fo he app oach o wo k, we
he e o e equi e inding ODEs ha low he da a dis i-
bu ion π0 o an analy ic dis ibu ion π1. In he ollowing,
we conside wo di usion model-based neu al ODEs [32]
lea ning such lows.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
680
3.3 P obabili y Flow ODEs
In [17, 36], he au ho s de ine a di usion noise ( o wa d)
p ocess by a s ochas ic di e en ial equa ion (SDE) ha
low he da a dis ibu ion π0 o a (known) Gaussian dis-
ibu ion π1in ime 0→ 1. The dynamics o he SDE
on da a poin s z( 0) lowing o z( )can e ec i ely be de-
sc ibed by:
p 0, (z( )|z( 0)) = Nz( ); z( 0)s( ), s( )2σ( )2I,
(2)
whe e σis a noise scale and sa con ac ion chosen such
ha p 0, 0(z( 0)|z( 0)) = π0and p 0, 1(z( 1)|z( 0)) =
π1≈ N z( 1); 0, σ2
maxI. Rema kably, he SDEs can
be ansla ed o de e minis ic p ocesses (p obabili y low
ODEs) ha equi alen ly low π0 o π1gi en by:
d
d z( ) = s( )2˙σ( )σ( )∇zlog pz( )
s( );σ( )−˙s( )
s( )z( ).
(3)
These ODEs, hus, ul il he equi emen s o Sec ion 3.2.
Eq. 3 can be u ned in o a neu al ODE by lea ning ∇zlog p
wi h a neu al ne wo k (see Sec ion 4.2) using sco e ma ch-
ing. In Sec ion 4, we use he EDM ini ializa ion o Eq. 3
om [17], whe e s( ) = 1,σ( ) = , and he p ocess lows
in ime 0= 0.002 o 1= 80. This model will be e e ed
o as EDM in ou expe imen s.
3.4 Rec i ied Flow
Rec i ied Flow (RFF) [18] de ines a p ocess ha low he
da a dis ibu ion o a s anda d Gaussian dis ibu ion (π1=
N(z( 1); 0,I)) by ollowing s aigh lines as much as pos-
sible. Fo mally, gi en he ODE: d
d z( ) = (z( ), )d a,
RFF be ween π0and π1is lea ned by he minimiza ion:
min
Z1
0
Eh∥(z1−z0)− (z , )∥2id , (4)
whe e z0∼π0,z1∼π1and z = (1 − )z0+ z1 o
∈[0,1].z is he e o e a poin on he s aigh line con-
nec ing z0,z1. Subs i u ing wi h a neu al ne wo k (see
Sec ion 4.2) in he ODE, we ge a neu al ODE. The weigh s
a e lea ned using sample es ima es o Eq. 4 as a loss.
3.5 Likelihood Es ima ions in Noise Space Con inua
In addi ion o es ima ing likelihoods o he da a dis ibu ion
π0using he amewo k desc ibed abo e, we can also com-
pu e likelihoods a a ious noise le els along he noise/da a
con inuum ( ha a e a e sed by a ying —ei he in he
pe u ba ion ke nel o Eq. 2 o p obabili y low ODEs, o
in z o RFF). In bo h cases, a he han sol ing he ODE
om 0 o 1, we sol e i om he noise le el down o 1.
I is no ed ha bo h p ocesses in oduce noise g adu-
ally in o he da a. As a esul , high-de ail in o ma ion is
emo ed i s , while lowe -de ail in o ma ion is e ained
a lowe noise le els, and hen also los as he noise in-
c eases. The e o e, in Sec ion 4.5 we hypo hesize ha
he IC ex ac ed a mode a e noise le els cap u es he su -
p isal o ce ain lowe -de ail musical ea u es—such as
pi ch—while il e ing ou he con ibu ions o highe -de ail
ea u es, like sub le imb al nuances.
Gi en a es example exis ing in π0, he e a e h ee
na u al ways o ob ain a "noised" e sion o he da a a
a gi en noise le el : (1) sampling om he noise p o-
cess, (2) using he expec ed alue o he noise p ocess,
o (3) sol ing he ODE om 0 o . We disca d op-
ion (1) because i yields a s ochas ic es ima e, and op ion
(3) because we ound ha op ion (2)—using he expec ed
alue—p oduces be e esul s in p ac ice.
4. EXPERIMENTS AND RESULTS
4.1 Da a
Fo aining and e alua ing ou models, we use he ol-
lowing audio da ase s and encode hem in o Music2La en
ep esen a ions, using he public checkpoin o [12]. Fo
model aining, we use a da ase consis ing o 150,000 CC
licenced ull-leng h mixed-sou ce MP3 iles om Jamendo
(JAM) [37], which we spli in o 125k, 12.5k, and 12.5k
examples o aining, alida ing, and es ing pu poses, e-
spec i ely. Fo expe imen s in ol ing monophonic singing
oices, we use a p i a e da ase o ine- uning ou models,
comp ising ocal s ems om 20k songs. Fo ou expe i-
men s wi h monophonic pi ch, we use a syn he ic da ase
(SYN) o 49 I ish olk unes om The Session da ase [38],
syn hesized a cons an eloci y wi h di e se SoundFon -
based ins umen s acco ding o he midi-p og ams: “Pad
1 (new age)”,“Syn h Voice”,“Acous ic Gui a (nylon)”,
“Acous ic G and Piano” and “T umpe ” o a o al o 245
examples. Addi ionally, o each melody, he IC o IDyOM
is compu ed (see [39] o de ails). Fu he mo e, we use
he da ase o [40] (VOC), consis ing o 18 eco ded o-
cal melodies pai ed wi h IDyOM IC es ima es o he an-
sc ibed melodies. Fo ou expe imen on segmen bound-
a y de ec ion, we use he Salami da ase (SAL) [41], con-
aining 1310 audio iles, ha ing segmen anno a ions om
one o wo human anno a o s. The anno a ions a e hie -
a chical and include he ollowing le els: unc ional, up-
pe case, and lowe case, desc ibing global s uc u e wi h
seman ic segmen labels, global s uc u e, and local s uc-
u es, espec i ely. We use all da ase s o e alua e model
e ec i eness using NLL.
4.2 A chi ec u e and T aining De ails
Fo he di usion models, we use a s anda d 12-laye
causal T ans o me backbone wi h P e-Laye No maliza-
ion simila o [42], o a y posi ional embeddings [43], and
FlashA en ion [44], and o he di usion MLP, we ollow
he a chi ec u e o [16]. Fo he GIVT model, we ollow
he a chi ec u e p esen ed in [11]. All models a e ained
wi h a maximum sequence leng h o 4600, co esponding
o app oxima ely 7 minu es o audio and a ba ch size o 8
sequences, esul ing in an e ec i e ba ch size o up o ∼1
hou o music. We use Adam op imiza ion o 270k s eps
wi h a lea ning a e o 3·10−4 o he di usion models and
10−4 o GIVT, using a cosine schedule wi h a wa mup o
1800 s eps. Fo ou expe imen s in ol ing monophonic
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
681
S-MAE
n 1 2 4 8 16
EDM .109 .078 .057 .043 .033
RFF .109 .079 .057 .043 .033
Q-MAE
ol 1 .1 .01 .001 1e-4
EDM .085 .078 .076 .076 .076
RFF .076 .076 .076 .076 .076
Q-ME
ol 1 .1 .01 .001 1e-4
EDM -.044 -.018 -.004 .000 .000
RFF .000 -.001 .000 .000 .000
Table 1. App oxima ion e o s o he likelihood es ima-
ion, indica ed wi h he Skilling-Hu chinson (S) and quan-
iza ion (Q) mean a e age e o (MAE) and he quan iza-
ion mean e o (ME). The e o is epo ed wi h espec o
e e ences o n = 32 uns and a ole ance o ol = 10−4
o he wo e o ypes, espec i ely. The esul s a e no -
malized o he mean absolu e NLL o he e e ences.
singing oices, we addi ionally ine- une each model on
he da ase men ioned in Sec ion 4.1 un il con e gence.
4.3 ODE-based Likelihood App oxima ion E o s
The likelihood es ima ion om ODE-based di usion mod-
els is a ec ed by wo ypes o app oxima ion e o s:
he disc e iza ion e o o he ODEs and he Skilling-
Hu chinson ace es ima o (see Sec ion 3.2). In bo h
cases, he app oxima ion e o can be con olled by ad-
ing o speed. We he e o e pe o m ini ial expe imen s
on 500 examples om ou alida ion da ase o de e mine
a sui able ade-o . Fo he o me , simila ly o [33, 36],
he e o is con olled using he Runge-Ku a 5(4) [45]
me hod. Fo he la e , he unbiased app oxima ion can
be made a bi a ily small using enough Mon e Ca lo uns
n . We use he scipy Runge-Ku a implemen a ion [46]
wi h s anda d pa ame e s excep o se ing he ole ances
o a ol = ol = ol = 10−3and compu e he mean abso-
lu e e o (MAE) o he di e ence be ween NLL calcula ed
wi h di e en n and NLL o a e e ence calcula ed wi h a
e y la ge numbe o uns (n = 32).
To ela e he MAE o he scale o he NLL, we di ide
i by he e e ence’s a e age absolu e NLL and epo he
esul ing measu e as S-MAE in Table 1. Fo all n , we
ound ha he a e age e o is small o bo h models. E en
when n = 1, he e o is 0.109 o he a e age NLL. This
demons a es ha i is possible o ob ain a coa se es ima e
o a sample’s NLL wi h minimal compu a ional o e head
compa ed o adi ional di usion model gene a ion. We
iden i y n = 4 as a good balance and ix i o u he
expe imen s.
Fo de e mining ole ance pa ame e s ol, we simila ly
compa e NLL calcula ed wi h di e en alues o ol o a
e e ence o ol = 10−5and epo his as Q-MAE in Ta-
ble 1. Fu he mo e, we in es iga e he bias by plo ing he
mean e o (ME), no malized o he a e age NLL, and e-
po i as Q-ME in Table 1. We ind ha o ol ≤0.1
and 0.01 o EDM and RFF, he absolu e e o does no
imp o e compa ed o he e e ence. Compa ing he bias
JAM SAL VOC SYN
GIVT 0.925 1.053 1.182 0.981
EDM 0.707 0.829 0.823 0.642
RFF 0.699 0.829 0.831 0.656
Table 2. Compa ison o model NLLs in he Music2La en
on di e en da ase s epo ed in bi s/dimension.
o he e o , we ind ha while RFF is unbiased, EDM has
a nega i e bias o ol > 0.001. The be e pe o mance
o RFF is likely due o he s aigh lows imposed by he
me hod, which allow he sol e o ake la ge s eps. Thus,
we ake ol = 0.001, such ha he ela i e MAE is 0.057.
4.4 P edic i e E iciency Compa ision
Simila o p e ious wo k in densi y es ima ion models
[33, 36, 47], we compa e he model’s p edic i e e ec i e-
ness (how well he models p edic di e se audio da a) using
he a e age NLL epo ed in bi s/dimension (mean nega-
i e log2-likelihood/dimension). Since all compa ed mod-
els es ima e likelihoods in he ixed coo dina e sys em o
he Music2La en codec, we can compa e he NLL in ha
space. We emphasize ha ou epo ed esul s a e, he e-
o e, no di ec ly compa able wi h hose desc ibed in [11]
as i uses a di e en e sion o Music2La en .
We ind ha Music2La en encodes silence in o a small
egion o i s la en space, causing he model o assign
ex emely low IC alues o silen ames (since IC is
unbounded om below o densi ies). Consequen ly,
hese low alues downweigh he a e age NLL calcula-
ions wi hou imp o ing he models’ p edic i e capabili-
ies. To add ess his, we emo e leading and ailing si-
lence om he audio be o e compu ing NLL. Simila ly, o
each da ase , we disca d he IC alues a ime s eps ha all
wi hin he 1% mos ex eme IC alues (ac oss any model).
We p esen he models’ a e age NLL in Table 2.
We ind ha he di usion models ha e much lowe NLL
han he GIVT model, and as such, model he one-s ep p e-
dic ion densi ies mo e accu a ely. This is consis en wi h
he indings o [16] o audio ideli y in a gene a i e ask.
Compa ing he NLL alues o he EDM and RFF mod-
els e eals no clea winne . In e es ingly, we ind ha he
NLL o di usion models on he SYN monophonic da ase ,
which is dissimila o he aining dis ibu ion JAM, is low-
es . Using an in o ma ion- heo e ic in e p e a ion, he low
NLL indica es ha he SYN da ase is less su p ising e-
ga ding imb e and melody, and he e o e has lowe IC.
4.5 Pi ch Su p isal in Noise Space Con inua
In [12], i is no ed ha a small MLP can p edic pi ches
om he Music2La en ep esen a ions wi h high accu acy.
Thus, i is easonable o hypo hesize ha pi ch is embedded
in coa se s uc u es in he Music2La en ep esen a ions.
We, he e o e, in es iga e o wha ex en ou IC can explain
pi ch su p isal and whe he IC es ima es a di e en noise
le els desc ibe pi ch su p isal o a g ea e ex en .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
682
C4
C#4
D4
D#4
E4
F4
F#4
G4
G#4
A4
T (seconds)
Acous ic G and Piano
012345678
C4
C#4
D4
D#4
E4
F4
F#4
G4
G#4
A4 Pad 1 (new age) GIVT EDM = 0.002 EDM = 50.0 RFF = 0.0 RFF = 0.6
Figu e 1. Piano oll o simple melody and IC o EDM and RFF models calcula ed om SoundFon audio syn hesized a
cons an no e- eloci y wi h “Acous ic G and Piano” ( op) and “Pad 1 (new age)” (bo om) ins umen s. The IC is shown a
noise le els = 0.002,0.0(co esponding o ully unnoised da a), and = 50,0.6(co esponding o mid-p ocess alues)
o EDM, RFF, espec i ely. The IC is a inely ans o med o a min/max- alue o 0/1 o isualiza ion.
4.5.1 IC Su p isal Quali a i e
In Fig. 1, we plo a simple melody along wi h IC es ima es
on SoundFon audio syn hesized wi h wo di e en ins u-
men s. The melody is composed o a Cmajo a peggio
epea ed ou imes. On he ou h occu ence a T= 6,
he pa e n is modi ied wi h he ou -o - onali y pi ch C#.
When p edic ing Ea T= 2.5sand o wa d, he e is
a pa ial ma ch be ween he cu en con ex and he con-
ex seen 2sea lie . Tha is, Eand he ollowing no es
a e expec ed a ime T= 2.5sand o wa d, which is e-
lec ed in he lowe IC ac oss all models in he in es i-
ga ed noise le els, compa ed o he ICs a imes T−2s.
A T= 6s, he su p ising ou -o - onali y no e p omp s a
big peak in IC es ima es ac oss all models and noise le -
els. Finally, a ime T= 8s, he melody is su p isingly
ab up ly e mina ed, e lec ed in a smalle inal peak in IC
es ima es ac oss all models. Compa ing he IC es ima es
o he di e en imb e, we ind ha he mid-p ocess IC
es ima es a = 50.0,0.6, a e mo e simila ac oss imb e
han he es ima es o ully-denoised da a a = 0.002,0.0
o EDM and RFF, espec i ely. This is e iden , o ex-
ample, by he lesse a ia ion in IC be ween no e onse s
wi hin T= [0.5,1.0], which is especially isible o he
EDM model. This sugges s ha he IC es ima es a mod-
e a e noise le els cap u e pi ch su p isal a he han imb e
a ia ions mo e e ec i ely han es ima es om ully un-
noised da a, which we in es iga e in he ollowing.
4.5.2 IC Su p isal Quan i a i e
To quan i a i ely alida e whe he he IC compu ed in au-
dio can explain pi ch su p isal, we would equi e a g ound
u h, which does no exis . As a p oxy, we use he su p isal
(IC) alues p edic ed by he pe cep ually alida ed pi ch
expec ancy model IDyOM (see Sec ion 2), which ope a es
in he symbolic domain. We conduc ou expe imen wi h
he SYN and VOC da ase s. We ex ac he IDyOM IC
o each no e pi ch in he symbolic da ase s, and pai hese
wi h IC alues calcula ed om ou models on he audio
da ase s. We align he la e o he no es by iden i ying he
GIVT SYN -.031
VOC .147
EDM
2e-3 10.0 20.0 50.0 60.0
SYN .137 .048 .190 .264 .255
VOC .135 .213 .206 .138 .106
RFF
0.0 0.1 0.5 0.6 0.7
SYN .134 .189 .216 .218 .189
VOC .137 .133 .069 .048 .030
Table 3. Spea man’s co ela ion be ween IC calcula ed
on (symbolic) melodies using IDyOM and IC calcula ed
a di e en noise le els wi h EDM and RFF.
wo Music2La en ames ha con ain he no e onse and
calcula ing hei a e age IC. Since we expec mono onic,
a he han linea , ela ions in he IC pai s, we compa e he
pai ed es ima es using Spea man’s ank co ela ion and e-
po he esul s in Table 3. We ind all co ela ions o be
signi ican a a 5%-signi icance le el and, excep o GIVT,
posi i e. Compa ing he GIVT o he di usion models es-
ima ing IC o unnoised da a ( = 0.002,0.0 o EDM and
RFF, espec i ely), we ind ha he co ela ions a e highe
o SYN and simila o he VOC da ase . In all cases,
excep o he RFF-VOC, we ind he highes co ela ions
using es ima ions wi h = 0, i.e., when IC is es ima ed
using he noised da a. Fo SYN, we see he highes co e-
la ions o high noise scale alues = 50.0,0.6(compa ed
o he ully noised noise alues = 80.0,1.0) o EDM
and RFF, espec i ely. In pa icula , we ind ha EDM a
noise le el = 50 is o e all mos ly co ela ed wi h he
IC o IDyOM, which suppo s he indings in Sec ion 4.5.1
o simila da a, whe e his alue shows he smoo hes su -
p isal cu es, wi h clea ly de ined peaks a ound he no e
onse s. Fo he VOC da ase , he highes co ela ions oc-
cu a lowe noise le els, and o e all, he co ela ions a e
less p onounced han in he SYN da ase . This may be be-
cause singe s a e less p ecise when changing pi ch, o en
using po amen o o glide in o a no e. As a esul , peaks in
IC du ing no e changes may no be d i en solely by pi ch
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
683

GIVT .380
2e-3 10.0 20.0 50.0 60.0
EDM 0.385 0.517 .525 0.522 .518
0.0 0.1 0.5 0.6 0.7
RFF 0.391 0.307 .385 .429 .402
Table 4. Spea man’s co ela ion be ween IC es ima ions
sha ing he same no e ma e ial, bu wi h di e en imb e.
shi s, bu also by o he cues, such as emphasis a he onse
(e.g., olume changes), owel ansi ions, o he p esence
o plosi es. Masking such po en ially sub le cha ac e is ics
wi h noise may explain he obse ed co ela ion educ ion.
4.5.3 Noise Space Con inua Timb e In a iance
As shown abo e o SYN, he IC is mo e co ela ed wi h
pi ch su p isal a in e media e noise le els, whe e he ine
de ails in he audio embeddings ha e been emo ed. We
in es iga e o wha ex en his can be explained by a highe
in a iance o imb e i ele an o pi ch su p isal. We, he e-
o e, in es iga e i IC es ima ed on music ha con ains he
same no e con en bu wi h di e en imb e is mo e simi-
la a he noise le els s udied abo e. Speci ically, we use
SYN and in es iga e Spea man’s co ela ion be ween IC
o all combina ions o pai s o syn he ics sha ing he same
no e ma e ial (bu ha ing di e en imb e), and epo he
esul s in Table 4. The esul s a e all signi ican a a 5%
signi icance le el. Fo unnoised da a, he co ela ions a e
simila o he di usion models and GIVT. Howe e , o
EDM especially, he co ela ions inc ease o noised da a.
We ind high co ela ions o noise le el alues = 50,0.6
o EDM and RFF, espec i ely, which ha e he highes
co ela ion wi h IDyOM (See able 3), suppo ing he no-
ion ha hese noise le els a e mo e in a ian o imb e.
4.6 IC o Unsupe ised Segmen Bounda y De ec ion
In he symbolic domain, IC has been used as a no el y
unc ion o segmen bounda y de ec ion [48–50]. The e-
o e, we in es iga e whe he big changes in IC ex ac ed
om audio also coincide wi h segmen bounda ies. We
conduc an expe imen whe e we p edic Salami lowe case
segmen bounda ies using he mos signi ican peaks ex-
ac ed om an IC no el y unc ion. The no el y unc ion
is cons uc ed by smoo hing ou IC cu es wi h a Gaussian
ke nel wi h s anda d de ia ion σ= 5, and di e encing he
smoo hed se ies. Using he o - he-shel Röde peak pick-
ing algo i hm [51] wi h s anda d pa ame e s, we epo ,
in Table 5, p ecision, ecall, and F1-sco e on p edic ions
ha a e accu a e wi hin ±0.5seconds o he anno a ions.
Gene ally, p ecision alues a e subs an ially lowe han e-
call, implying ha he IC no el y cu es end o ha e ex a
peaks no a ibu ed o segmen bounda ies. Fo he GIVT,
and he IC es ima ed wi h di usion models on unnoised
da a, we ind he F1sco es o be simila . Fo RFF, and o
a lesse ex en EDM, p ecision and ecall inc ease wi h he
noise le el. This shows ha he IC es ima ed a a coa se
le el aligns be e wi h he segmen bounda ies.
GIVT
p ec .158
ec .309
F1.209
EDM
2e-3 17.6 40.0 60.0
p ec .159 .162 .169 .178
ec .286 .311 .324 .345
F1.204 .213 .222 .235
RFF
0.0 0.25 0.50 0.70
p ec .159 .163 .179 .198
ec .287 .342 .380 .416
F1.205 .221 .243 .268
Table 5. P ecision, ecall and F1sco es o Salami lowe -
case ±0.5 seconds bounda y de ec ion.
5. CONCLUSION AND DISCUSSION
We in es iga ed ADMs’ abili y o es ima e musical su -
p isal and ound ha EDM and RFF di usion models mo e
e ec i ely desc ibe music da a han a GITV in e ms o
NLL. We e alua ed he di usion models IC’s e ec i eness
in cap u ing monophonic pi ch su p isal and ound ha
hese cap u e pi ch su p isal be e han a GIVT. Fu he -
mo e, we ound ha IC es ima es o noised da a inc ease
co ela ion wi h pi ch su p isal, and showed ha his coin-
cides wi h hese es ima es being mo e in a ian o imb e.
Fu he mo e, we showed ha peaks in a no el y unc ion
de i ed om IC coincide wi h Salami lowe case segmen
bounda ies; howe e , he unc ion has addi ional peaks. Fi-
nally, using he IC es ima ed in noise space imp o es he
segmen bounda y p edic ions ega ding p ecision and e-
call. As such, di usion models su pass GIVT models in
su p isal es ima ion and o e addi ional es ima es ha can
cap u e aspec s impo an o musical su p isal.
Simila ly o [11], we es ima e su p isal wi h IC in Mu-
sic2La en ep esen a ions, so hei indings on musical
complexi y, epe i ion educ ion, and EEG p edic ion a e
likely o ex end o di usion-based IC. This should be al-
ida ed in u u e wo k and ex ended wi h o he pe cep ual
alida ing expe imen s on di e se music. Fu he mo e,
ou in es iga ion o noise le els ele an o pi ch su p isal
could be ex ended o conside o he pe cep ual ea u es and
hei en anglemen in di e en da a ep esen a ions. Fo in-
s ance, he IC calcula ed a sui able (high) noise le els in
mel-spec og ams o cons an -Q ans o med ep esen a-
ions may gi e es ima ions o su p isal ha co ela e mo e
wi h pi ch su p isal. Also, he explo a o y in es iga ion o
op imal noise le els could be au oma ed using a me hod-
ology simila o [52], by moni o ing pe o mance deg ada-
ions o a classi ie / eg esso model ained o p edic he
ea u e using a iably noised inpu s. Finally, ou pi ch su -
p isal analysis measu ed IC o coa se-g ained s uc u es,
bu ou amewo k also allows s udying su p ising changes
in ine-g ained s uc u es. This migh , o ins ance, be el-
e an o analyzing imb e changes o singing echniques.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
684
6. ACKNOWLEDGMENTS
The wo k leading o hese esul s was conduc ed in a col-
labo a ion be ween JKU and Sony Compu e Science Lab-
o a o ies Pa is unde a esea ch ag eemen . The i s and
hi d au ho also acknowledge suppo by he Eu opean Re-
sea ch Council (ERC) unde he Eu opean Union’s Ho i-
zon 2020 esea ch and inno a ion p og amme, g an ag ee-
men 101019375 (“Whi he Music?”).
7. REFERENCES
[1] L. B. Meye , “Meaning in music and in o ma ion he-
o y,” The Jou nal o Aes he ics and A C i icism,
ol. 15, no. 4, pp. 412–424, 1957.
[2] D. Conklin and I. H. Wi en, “Mul iple iewpoin sys-
ems o music p edic ion,” Jou nal o New Music Re-
sea ch, ol. 24, no. 1, pp. 51–73, 1995.
[3] M. Pea ce, “The cons uc ion and e alua ion o s a is-
ical models o melodic s uc u e in music pe cep ion
and composi ion,” Ph.D. disse a ion, Depa men o
Compu ing, Ci y Uni e si y, London, UK, 2005.
[4] M. T. Pea ce and G. A. Wiggins, “Audi o y expec a-
ion: The in o ma ion dynamics o music pe cep ion
and cogni ion,” Top. Cogn. Sci., ol. 4, no. 4, pp. 625–
652, 2012.
[5] S. A. Sau é and M. T. Pea ce, “In o ma ion- heo e ic
modeling o pe cei ed musical complexi y,” Music
Pe cep ion: An In e disciplina y Jou nal, ol. 37,
no. 2, pp. 165–178, 2019.
[6] M. R. Bja e, S. La ne , and G. Widme , “Explo -
ing sampling echniques o gene a ing melodies wi h
a ans o me language model,” in ISMIR, 2023, pp.
810–816.
[7] B. P. Gold, M. T. Pea ce, E. Mas-He e o, A. Daghe ,
and R. J. Za o e, “P edic abili y and unce ain y in he
pleasu e o music: a ewa d o lea ning?” Jou nal o
Neu oscience, ol. 39, no. 47, pp. 9397–9409, 2019.
[8] C.-i. Wang and S. Dubno , “Guided music syn hesis
wi h a iable ma ko o acle,” in AAAI, ol. 10, no. 5,
2014, pp. 55–62.
[9] T. Collins, R. Laney, A. Willis, and P. H. Ga hwai e,
“De eloping and e alua ing compu a ional models o
musical s yle,” AI EDAM, ol. 30, no. 1, pp. 16–43,
2016.
[10] M. R. Bja e, S. La ne , and G. Widme , “Con olling
su p isal in music gene a ion ia in o ma ion con en
cu e ma ching,” in ISMIR, 2024.
[11] M. R. Bja e, G. Can isani, S. La ne , and G. Widme ,
“Es ima ing musical su p isal in audio,” in ICASSP,
2025.
[12] M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp es-
sion,” in ISMIR, 2024.
[13] M. Tschannen, C. Eas wood, and F. Men ze , “GIVT:
gene a i e in ini e- ocabula y ans o me s,” CoRR,
ol. abs/2312.02116, 2023.
[14] Y. Song, P. Dha iwal, M. Chen, and I. Su ske e , “Con-
sis ency models,” in ICML, ol. 202, 2023, pp. 32 211–
32 252.
[15] T. Li, Y. Tian, H. Li, M. Deng, and K. He, “Au o eg es-
si e image gene a ion wi hou ec o quan iza ion,” in
Neu IPS, 2024.
[16] M. Pasini, J. Nis al, S. La ne , and G. Fazekas, “Con-
inuous au o eg essi e models wi h noise augmen a-
ion a oid e o accumula ion,” in Audio Imagina ion:
Neu IPS 2024 Wo kshop AI-D i en Speech, Music,
and Sound Gene a ion, 2024.
[17] T. Ka as, M. Ai ala, T. Aila, and S. Laine, “Eluci-
da ing he design space o di usion-based gene a i e
models,” in Neu IPS, 2022.
[18] X. Liu, C. Gong, and Q. Liu, “Flow s aigh and as :
Lea ning o gene a e and ans e da a wi h ec i ied
low,” in ICLR, 2023.
[19] M. T. Pea ce, M. H. Ruiz, S. Kapasi, G. A. Wiggins,
and J. Bha acha ya, “Unsupe ised s a is ical lea n-
ing unde pins compu a ional, beha iou al, and neu al
mani es a ions o musical expec a ion,” Neu oImage,
ol. 50, no. 1, pp. 302–313, 2010.
[20] G. M. Di Libe o, C. Pelo i, R. Bianco, P. Pa el, A. D.
Meh a, J. L. He e o, A. De Che eigné, S. Shamma,
and N. Mesga ani, “Co ical encoding o melodic ex-
pec a ions in human empo al co ex,” Eli e, ol. 9, p.
e51784, 2020.
[21] N. C. Hansen and M. T. Pea ce, “P edic i e unce ain y
in audi o y sequence p ocessing,” F on ie s in psychol-
ogy, ol. 5, p. 1052, 2014.
[22] R. Bianco, L. E. P asczynski, and D. Omigie, “Pupil
esponses o pi ch de ian s e lec p edic abili y o
melodic sequences,” B ain and Cogni ion, ol. 138, p.
103621, 2020.
[23] T. Moldwin, O. Schwa z, and E. S. Sussman, “S a is-
ical lea ning o melodic pa e ns in luences he b ain’s
esponse o w ong no es,” Jou nal o cogni i e neu o-
science, ol. 29, no. 12, pp. 2114–2122, 2017.
[24] E. Ab ams, E. M. Vidal, C. Pelo i, and P. Ripollés, “Re-
ie ing musical in o ma ion om neu al da a: how
cogni i e ea u es en ich acous ic ones.” in ISMIR,
2022, pp. 160–168.
[25] B. Ske i -Da is and M. Elhilali, “De ec ing change
in s ochas ic sound sequences,” PLoS Compu . Biol.,
ol. 14, no. 5, 2018.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
685
[26] B. Ske i -Da is and M. Elhilali, “A model o s a is-
ical egula i y ex ac ion om dynamic sounds,” Ac a
Acus ica uni ed wi h Acus ica, ol. 105, no. 1, pp. 1–4,
2019.
[27] S. Dubno , G. Assayag, and A. Con , “Audio o acle: A
new algo i hm o as lea ning o audio s uc u es,” in
ICMC, 2007, pp. 224–227.
[28] N. L. Mascle and T. A. Kelle , “Deep gene a i e
models o music expec a ion,” Neu IPS ML o Audio
Wo kshop 2023, 2023.
[29] S. Abdallah and M. Plumbley, “In o ma ion dynamics:
pa e ns o expec a ion and su p ise in he pe cep ion
o music,” Connec ion Science, ol. 21, no. 2-3, pp.
89–117, 2009.
[30] S. Dubno , “Deep music in o ma ion dynamics,” a Xi
p ep in a Xi :2102.01133, 2021.
[31] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, L. Kaise , and I. Polosukhin,
“A en ion is all you need,” in Neu IPS, 2017, pp.
5998–6008.
[32] T. Q. Chen, Y. Rubano a, J. Be encou , and D. Du-
enaud, “Neu al o dina y di e en ial equa ions,” in
Neu IPS, 2018, pp. 6572–6583.
[33] W. G a hwohl, R. T. Q. Chen, J. Be encou ,
I. Su ske e , and D. Du enaud, “FFJORD: ee- o m
con inuous dynamics o scalable e e sible gene a i e
models,” in ICLR, 2019.
[34] J. Skilling, “The eigen alues o mega-dimensional
ma ices,” Maximum En opy and Bayesian Me hods:
Camb idge, England, 1988, pp. 455–466, 1989.
[35] M. F. Hu chinson, “A s ochas ic es ima o o he
ace o he in luence ma ix o laplacian smoo hing
splines,” Communica ions in S a is ics-Simula ion and
Compu a ion, ol. 18, no. 3, pp. 1059–1076, 1989.
[36] Y. Song, J. Sohl-Dicks ein, D. P. Kingma, A. Ku-
ma , S. E mon, and B. Poole, “Sco e-based gene a i e
modeling h ough s ochas ic di e en ial equa ions,” in
ICLR, 2021.
[37] Jamendo, “Jamendo Music,” h ps://www.jamendo.
com.
[38] B. L. S u m, J. F. San os, O. Ben-Tal, and I. Ko -
shuno a, “Music ansc ip ion modelling and compo-
si ion using deep lea ning,” in P oceedings o he Con-
e ence on Compu e Simula ion o Musical C ea i i y,
Hudde s ield,UK, 2016.
[39] M. R. Bja e, S. La ne , and G. Widme , “Di e en iable
sho - e m models o e icien online lea ning and p e-
dic ion in monophonic music,” T ans. In . Soc. Music.
In . Re ., ol. 5, no. 1, p. 190, 2022.
[40] G. Can isani, A. Chalehchaleh, G. Di Libe o, and
S. Shamma, “In es iga ing he co ical acking o
speech and music wi h sung speech,” in INTER-
SPEECH. ISCA, 2023, pp. 5157–5161.
[41] J. B. L. Smi h, J. A. Bu goyne, I. Fujinaga, D. D.
Rou e, and J. S. Downie, “Design and c ea ion o a
la ge-scale da abase o s uc u al anno a ions,” in IS-
MIR, 2011, pp. 555–560.
[42] A. Rad o d, J. Wu, R. Child, D. Luan, D. Amodei,
I. Su ske e e al., “Language models a e unsupe ised
mul i ask lea ne s,” OpenAI blog, ol. 1, no. 8, p. 9,
2019.
[43] J. Su, M. H. M. Ahmed, Y. Lu, S. Pan, W. Bo, and
Y. Liu, “Ro o me : Enhanced ans o me wi h o-
a y posi ion embedding,” Neu ocompu ing, ol. 568,
p. 127063, 2024.
[44] T. Dao, D. Y. Fu, S. E mon, A. Rud a, and C. Ré,
“Flasha en ion: Fas and memo y-e icien exac a en-
ion wi h io-awa eness,” in Neu IPS, 2022.
[45] J. R. Do mand and P. J. P ince, “A amily o embedded
unge-ku a o mulae,” Jou nal o compu a ional and
applied ma hema ics, ol. 6, no. 1, pp. 19–26, 1980.
[46] P. Vi anen, R. Gomme s, T. E. Oliphan , M. Habe -
land, T. Reddy, D. Cou napeau, E. Bu o ski, P. Pe-
e son, W. Weckesse , J. B igh , S. J. an de Wal ,
M. B e , J. Wilson, K. J. Millman, N. Mayo o ,
A. R. J. Nelson, E. Jones, R. Ke n, E. La son, C. J.
Ca ey, ˙
I. Pola , Y. Feng, E. W. Moo e, J. Vande Plas,
D. Laxalde, J. Pe k old, R. Cim man, I. Hen iksen,
E. A. Quin e o, C. R. Ha is, A. M. A chibald, A. H.
Ribei o, F. Ped egosa, P. an Mulb eg , and SciPy
1.0 Con ibu o s, “SciPy 1.0: Fundamen al Algo i hms
o Scien i ic Compu ing in Py hon,” Na u e Me hods,
ol. 17, pp. 261–272, 2020.
[47] J. Ho, X. Chen, A. S ini as, Y. Duan, and P. Abbeel,
“Flow++: Imp o ing low-based gene a i e models
wi h a ia ional dequan iza ion and a chi ec u e de-
sign,” in ICML, ol. 97, 2019, pp. 2722–2730.
[48] M. T. Pea ce, D. Müllensie en, and G. A. Wiggins,
“Melodic g ouping in music in o ma ion e ie al:
New me hods and applica ions,” in Ad ances in Music
In o ma ion Re ie al, 2010, ol. 274, pp. 364–388.
[49] S. La ne , M. G ach en, K. Ag es, and C. E. C.
Chacón, “P obabilis ic segmen a ion o musical se-
quences using es ic ed bol zmann machines,” in
MCM, ol. 9110, 2015, pp. 323–334.
[50] S. La ne , C. E. C. Chacón, and M. G ach en, “Pseudo-
supe ised aining imp o es unsupe ised melody
segmen a ion,” in IJCAI. AAAI P ess, 2015, pp.
2459–2465.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
686
[51] M. Mülle and F. Zalkow, “lib mp: A py hon package
o undamen als o music p ocessing,” J. Open Sou ce
So w., ol. 6, no. 63, p. 3326, 2021. [Online].
A ailable: h ps://doi.o g/10.21105/joss.03326
[52] G. Da as, A. Rod iguez-Munoz, A. Kli ans, A. To -
alba, and C. Daskalakis, “Ambien di usion omni:
T aining good models wi h bad da a,” a Xi p ep in
a Xi :2506.10038, 2025.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
687