ADAPTIVE PATH OF PREDICTION: AN UNSUPERVISED METHOD FOR
MODELING NOTE-LEVEL INFORMATIONAL HIERARCHY OF
POLYPHONY
Xiaoxuan Wang
EPFL
[email p o ec ed]
Ma in Roh meie
EPFL
[email p o ec ed]
ABSTRACT
Polyphonic music p esen s a unique challenge o compu-
a ional modeling due o he complex in e ac ions o mul-
iple simul aneous musical s eams and he need o cap u e
bo h local and global s uc u al ela ionships. We p opose
Adap i e Pa h o P edic ion, a disc e e di usion model ha
lea ns he in o ma ional hie a chy o polyphony in an un-
supe ised manne . By aining he model o ind op imal
no e- emo al pa hs, and o e e sibly econs uc hese se-
lec i ely emo ed no es, we e eal how c i ical musical
e en s, which pe sis un il la e s ages o da a co up ion,
maximize he p ese ed in o ma ion and guide he p e-
dic ion o emaining con en . D awing on comp ession
lea ning heo y, we posi ha such adap i ely-disco e ed
“ancho no es” e lec he sys em’s abili y o make an ex-
plici abs ac ion o polyphonic music. Ou expe imen s
demons a e ha he model con e ges on consis en no e-
impo ance dis inc ions and can achie e be e econs uc-
ion pe o mance in selec ed denoising pa hs han andom
ones. Fu he mo e, he model’s assignmen o no e impo -
ance du ing he aining p ocess inc easingly aligns wi h
a educ i e music analysis da ase , sugges ing ha ou un-
supe ised amewo k can unco e s uc u al hie a chies
consis en wi h es ablished music- heo e ical iews.
1. INTRODUCTION
Polyphony desc ibes a e y common musical phenomenon
in which mul iple s eams o no es occu simul aneously.
When pe cei ing a polyphonic piece, a cogni i e sys em
expe ienced wi h he complexi y o polyphony iden i ies
he in e ac ions and ela ionships be ween musical s uc-
u es, he eby o ming an unde s anding o hei oles and
s uc u al hie a chy [1, 2]. Using machine lea ning me h-
ods o model he s uc u al unde s anding o polyphony is
non- i ial because o he sca ci y o no e-le el labels o
s uc u al hie a chy o supe ised lea ning and he inhe -
en complexi y o polyphony, which demands sys ems ca-
© X. Wang, and M. Roh meie . Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: X. Wang, and M. Roh meie , “Adap i e Pa h o P edic ion: An
unsupe ised me hod o modeling no e-le el in o ma ional hie a chy o
polyphony”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re-
ie al Con ., Daejeon, Sou h Ko ea, 2025.
pable o cap u ing bo h local and long-dis ance s uc u al
ela ionships [3,4].
The lack o labeled da a and he ac ha human mu-
sic acquisi ion usually happens wi hou explici ins uc-
ions abou s uc u al ela ionships be ween no es highligh
he impo ance o unsupe ised lea ning on his p oblem.
Mos unsupe ised lea ning models o melody, hy hm,
o ha mony s uc u e (e.g., n-g am models [5]) ely on wo
assump ions: (a) pe ec pe cep ion, meaning he model
cap u es all inpu in o ma ion and uses all o he in o ma-
ion o upda e he pa ame e s ega dless o complexi y, and
(b) au o eg essi e p edic ion, meaning he model always
makes le - o- igh p edic ions using he p e ious con ex
o he sequence, which only allows unidi ec ional depen-
dencies P(x |x< )≈P(x |x1, x2, . . . , x −1).
Wi h mul iple simul aneous lines un olding in ime, ou
cogni i e sys em is unlikely o p ocess all no e-le el in o -
ma ion and ela ionships in one pass [6]. This challenges
assump ion (a). Howe e , a highly impo an ea u e o
music, and music lis ening, is epe i ion; composi ions o -
en epea musical ph ases, and lis ene s equen ly e isi
pieces. Repe i ion allows lis ene s o ocus on di e en el-
emen s o disce n how no es a e o ganized in o polyphonic
s uc u es [7, 8]. Du ing his p ocess, no es a di e en
ime poin s may be explici ly emembe ed. This memo y
enables he use o bi-di ec ional in o ma ion when o m-
ing p edic ions, which challenges assump ion (b). In o he
wo ds, due o he complex na u e o polyphony ha in-
i es epea ed lis ening, a pu ely au o eg essi e me hod—
one ha lacks an explici mechanism o e aining and
eusing p e iously a ended no es—may p o ide an incom-
ple e pic u e o polyphonic ela ionships, especially om a
cogni i e science pe spec i e.
Recognizing he impo ance o bi-di ec ional in o ma-
ion, a na u al ques ion o ask is how o de e mine which
no es should be explici ly memo ized. Comp ession lea n-
ing heo y [9] sugges s ha a cogni i e sys em is in e nally
ewa ded when i disco e s he egula i ies o he complex
en i onmen and lea ns o op imally comp ess i . The e-
o e, acqui ing an adap i e abs ac ion and gene a i e ca-
pabili y o a pa icula class o da a (e.g., onal music) can
be highly ad an ageous as i emo es he need o de ise a
sepa a e comp ession scheme o each da a poin . The e-
o e, i a sys em can quickly iden i y a se o explici “an-
cho no es” ha encapsula e a polyphonic ph ase’s unda-
men al in o ma ion dis ibu ions and long-dis ance depen-
565
dencies, hese ancho s can guide p edic ions o less c i i-
cal de ails. In o he wo ds, ancho no es a e he sys em’s
explici abs ac ion o he ph ase, hey e lec he sys em’s
unde s anding o no e-le el in o ma ional hie a chy.
Mo i a ed by hese conside a ions, we p opose he
Adap i e Pa h o P edic ion, an unsupe ised model ha
lea ns he in o ma ional hie a chy o no es and can ex-
plici ly epo hem. We adop Di usion Denoising [10],
a gene a i e amewo k ha i e a i ely co up s he inpu
da a and lea ns o econs uc i .
In his s udy, we in es iga e an op imized pa h o da a
co up ion and econs uc ion o symbolic music. We e-
place he con en ional andom co up ion p ocess wi h a
sys em ha lea ns o p ese e he s uc u ally impo an el-
emen s o bene i he econs uc ion p ocess. Ou p oposed
model has wo pa s: a Di usion O de ing Ne wo k and a
Denoising Ne wo k. Speci ically, he Di usion O de ing
Ne wo k is ained using ein o cemen lea ning, le e ag-
ing eedback om he Denoising Ne wo k o op imize he
selec ion o he da a co up ion pa h.
To allow an accu a e disc e e di usion o polyphony,
we a oid sequen ial me hods o empo al encoding. In-
s ead, we de elop a new encoding me hod o symbolic
music: a musical g aph ha minimizes ime dependencies
be ween no es.
To summa ize, ou con ibu ions include: (1) he i s
unsupe ised me hod ha lea ns o explici ly epo no e-
le el in o ma ional hie a chy in polyphony; (2) he i s
in es iga ion o using non-au o eg essi e p edic ion o
model s uc u al cogni ion o polyphony; and (3) a no el
g aph-based symbolic ep esen a ion o polyphony ha
minimizes empo al dependencies be ween no es.
2. RELATED WORKS
2.1 Comp ession lea ning heo y
To main ain s abili y in a dynamic en i onmen , biologi-
cal agen s a e d i en o minimize long- e m senso y su -
p ise by e ining hei p edic i e capabili ies [11]. In doing
so, we na u ally adop encoding s a egies ha enhance ou
abili y o p edic sequences o e en s [12]. Howe e , gi en
ou limi ed cogni i e esou ces, we mus de elop concise
and gene alized ep esen a ions ha cap u e inpu egula -
i ies, a he han c ea ing a unique encoding o each e en
sequence [9,13]. This d i e o op imize ou encoding s a -
egy uels ou cu iosi y: a he han pe manen ly inc eas-
ing cogni i e load, empo a ily alloca ing esou ces o ex-
ac he egula i ies o no el s imuli ul ima ely e ines ou
adap i e model o en i onmen al pa e ns and educes he
cogni i e load [14, 15]. In o he wo ds, ou inna e cu ios-
i y mo i a es us o encoun e un amilia music no me ely
o lea n i s speci ic s uc u e, bu also o e ine a lexible
s a egy o lis ening—one ha enhances ou abili y o un-
de s and, p edic , and app ecia e di e se musical wo ks.
Mo eo e , comp ession lea ning is inhe en ly gene a i e
and c ea i e. By mas e ing a succinc ep esen a ion o
egula i ies, he cogni i e sys em is empowe ed o syn he-
size no el con en based on an abs ac ed amewo k [16].
This capabili y con as s sha ply wi h con en ional s a is-
ical models, which a e ypically es ic ed o ep oducing
he pa e ns hey ha e al eady encoun e ed [3].
2.2 Modeling music long-dis ance dependencies and
s uc u al hie a chy
Theo e ical amewo ks like musical g amma s ha e been
used o model long-dis ance dependencies and s uc u al
hie a chy [17, 18], ye unsupe ised app oaches o cap-
u ing no e-le el hie a chy in polyphony emain sca ce. A
no able excep ion is he Music T ans o me [19], which
u ilizes sel -a en ion mechanisms o cap u e bo h local
and long-dis ance ela ionships be ween no es in poly-
phonic music. Al hough isualiza ions o sel -a en ion
p o ide insigh s in o hese ela ionships and co espond-
ing no es’ impo ance [20], such analyses yield an indi-
ec in e p e a ion o he model’s unde s anding. Mo e-
o e , while T ans o me -based models can, in p inciple,
ep esen long- ange dependencies h ough sel -a en ion,
hey s ill ely on an au o eg essi e decoding o de . Ou
mo i a ion is o p opose a amewo k whe e he s uc u e
o p edic ion i sel is o de -adap i e and in e p e able.
2.3 Di usion denoising models
Di usion denoising models [10, 21] ha e opened a new
pa h o gene a i e modeling. Al hough mos ea ly appli-
ca ions ocused on con inuous domains (e.g., images), dis-
c e e di usion me hods ha e now eme ged o handle sym-
bolic da a [22]. Disc e e di usion has also been applied in
he symbolic music gene a ion ask [23].
Di usion denoising models gene ally ollow an
abs ac - o-conc e e gene a ion scheme. Due o his na u e,
i has been used o gene a e music in explici hie a chical
s eps, om o m o ph ases, and melodies and accompani-
men s [24]. Howe e , he e a e no exis ing s udies ha use
he di usion denoising model as a ool o unsupe ised
lea ning o he s uc u al hie a chy o polyphony.
An impo an a ian o disc e e di usion models is
he au o eg essi e disc e e di usion model [25], which
sequen ially abso bs one dimension a a ime in o wa d
di usion. I keeps he di usion model’s non-sequen ial
gene a ion capabili y while limi ing he numbe o di u-
sion s eps o a mos he inpu da a’s dimensionali y. Re-
cen wo k in molecula g aph gene a ion has ex ended his
amewo k by in oducing Di usion O de ing Ne wo ks
ha lea n op imal, da a-dependen abso p ion ajec o ies
[26]. These lea ned pa hs p o ide a powe ul mechanism
o explici ly modeling s uc u al dependencies—an abil-
i y concep ually aligned wi h comp ession lea ning heo y.
Unlike he a ia ional au oencode [27], which comp esses
da a in o a la en space, his app oach o e s he po en ial
o p oduce explici and in e p e able comp essed ep esen-
a ions o music. This mo i a es ou wo k, which applies
disc e e di usion and lea ned o de ing mechanisms o he
unsupe ised lea ning o no e-le el in o ma ional hie a -
chies in polyphonic music, he eby e ealing he inhe en
s uc u al dependencies o polyphonic composi ions.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
566
Figu e 1. Illus a ion o a aining s ep o ou model
3. METHOD
3.1 Adap i e O de ing o Music Disc e e Di usion
Ou model adap s abso bing s a e di usion (as in D3PMs,
[22]) o symbolic music by in oducing wo key modi i-
ca ions. Fi s , a he han using an abso bing s a e mask
o da a co up ion, we simply dele e no es. We hink he
dele ion p ocess is concep ually close o he music educ-
ion analysis, which may encou age he model o be e ex-
ploi he emaining no e in o ma ion and lea n an explici
in o ma ional hie a chy. The easibili y o dele ion ac ion
es s on ou g aph-based da a ep esen a ion, which will
be discussed in he nex subsec ion. Second, We eplace
he andom o wa d di usion p ocess q(x |x −1)wi h an
adap i e Di usion O de ing Ne wo k qφ(n |x , x0)[26],
ha selec s, a each s ep , he no e n o emo e.
Ou sys em comp ises wo co-e ol ing componen s:
he Di usion O de ing Ne wo k and he Denoising Ne -
wo k pθ(n −1|x ). The Denoising Ne wo k p edic s he
mos ecen ly emo ed no e n −1gi en he cu en in-
pu x , while he Di usion O de ing Ne wo k lea ns a
s uc u ally-dependen no e emo al o de ha bene i s e-
cons uc ion. In aining, he o de ing ne wo k’s di ec
goal is o emo e less-impo an no es a each s ep, he eby
indi ec ly p ese ing “ancho no es.”
Figu e 1 illus a es a single aining s ep o he Di u-
sion O de ing Ne wo k, whe e i samples M ajec o ies.
In e e y s ep o e e y ajec o y m, he denoising ne -
wo k gene a es a p obabili y dis ibu ion o i s p edic ions
o he pi ch label, he onse posi ion, and he o se posi-
ion o he emo ed no e n −1. The nega i e cos o hese
p edic ions in all s eps and ajec o ies will be accumula ed
and se e as he eedback o he ein o cemen lea ning o
he Di usion O de ing Ne wo k, he e o e, we de ine he
ewa d R ,m in Eq. (1) as:
R ,m =−−(T− )
Tlog pθn(m)
−1
x(m)
.(1)
No e ha R ,m is also weigh ed by he posi ion (T− )
Tin
he ajec o y o emphasize he e ec i eness o he p e-
se ed “ancho no es” o da a econs uc ion.
To educe a iance in he aw s epwise ewa ds, we
adop an ad an age ac o -c i ic (A2C) pa adigm [28]: we
in oduce a C i ic Ne wo k Vψ(x) ha es ima es he alue
o he cu en inpu x. The ad an age a s ep o ajec o y
mis hen
A ,m =R ,m +γ Vψx +1−Vψx ,(2)
whe e he discoun ac o γcon ols how s ongly u u e
ewa ds a e weigh ed ela i e o immedia e ewa ds.
The Di usion O de ing Ne wo k qφ(n |x , x0)plays
he ole o he ac o . Unde he ad an age ac o -c i ic
amewo k, i s pa ame e s a e upda ed by ascending he
ad an age-weigh ed log-likelihood:
∆φ←1
M T
M
X
m=1
T
X
=1
A ,m ∇φlog qφn
x , x0.
(3)
Meanwhile, he C i ic Ne wo k Vψ(x)is upda ed by
minimizing he empo al-di e ence e o (i.e., squa ed ad-
an age) be ween i s p edic ion and he boo s apped e-
u n:
L(ψ) = 1
M T
M
X
m=1
T
X
=1A ,m2
.(4)
Minimizing Eq. (4) ensu es ha he C i ic Ne wo k ac-
cu a ely es ima es he ue e u n, s abilizing he ad an age
used in he ac o upda e (Eq. (3)). Toge he , hese upda es
de ine he Di usion O de ing Ne wo k’s pa ame e upda e.
We now u n o he Denoising Ne wo k’s upda es. Du -
ing he upda e i e a ion o he Denoising Ne wo k, in each
s ep o e e y ajec o y m, he Denoising Ne wo k pθp e-
dic s he p obabili y o he emo ed no e n −1gi en he
cu en musical inpu x . We measu e he nega i e log-
likelihood and accumula e i ac oss all s eps and ajec o-
ies. He e, we again use he weigh ac o (T− )
T. Hence,
he o al aining loss o he Denoising Ne wo k o e M
sampled ajec o ies can be w i en as:
L(θ) = 1
M T
M
X
m=1
T
X
=1−(T− )
Tlog pθn(m)
−1
x(m)
.
(5)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
567
To upda e he pa ame e s θo he Denoising Ne wo k,
we ake g adien s eps ha minimize L(θ):
∆θ← − ∇θL(θ).(6)
This s ep is pe o med sepa a ely om he ein o cemen
lea ning upda e o he Di usion O de ing Ne wo k qφ
(Eq. (3)). By al e na ing he wo aining p ocedu es in
ime, we ensu e ha he Denoising Ne wo k’s g adien s do
no in e e e wi h he Di usion O de ing Ne wo k’s policy
g adien s. Speci ically, in one aining i e a ion:
1. Sampling o he Di usion O de ing Ne wo k: Sam-
ple M ajec o ies using qφand, o each ajec-
o y, compu e he ewa ds {R ,m}and ad an ages
{A ,m}(see Eqs. (1) and (2)).
2. Upda ing he Di usion O de ing Ne wo k: Upda e
he ac o pa ame e s φand c i ic pa ame e s ψ(see
Eqs. (3) and (4)).
3. Sampling o he Denoising Ne wo k: Sample M
ajec o ies using qφ.
4. Upda ing he Denoising Ne wo k: Compu e he o-
al denoising loss L(θ)o e ajec o ies as in Eq. (5)
and upda e he pa ame e s θ(see Eq. (6)).
3.2 Da a Rep esen a ion
P e ious s udies p oposed g aph-based encodings o sym-
bolic music by linking no es ia ela i e empo al ela ion-
ships [29, 30]. While e ec i e o ce ain asks, hese en-
codings become in lexible when suppo ing dynamic op-
e a ions like no e dele ion, whe e empo al edges mus be
cons an ly upda ed. To add ess his, we p opose a me ical-
ee-based ep esen a ion ha encodes iming ia hie a -
chical me ical nodes, aiming o minimize empo al depen-
dencies and suppo lexible no e dele ion.
Me ical nodes a e gene a ed om he measu e le el
h ough successi e bina y and e na y di isions. The ype
o di ision is dis inguished by he edge ype, and node la-
bels encode hei laye in o ma ion. Each pi ch-labeled
no e node connec s ia onse and o se edges o he shal-
lowes ele an me ical node, encoding i s hy hmic posi-
ion.
No e nodes in e connec h ough di ec ed in e al edges
labeled by oc a e-in a ian pi ch in e als, acili a ing he
cap u e o long-dis ance musical ela ionships while educ-
ing edge- ype complexi y. Al hough hese in e al edges
assis p edic i e modeling, hei econs uc ion is no e-
qui ed o he Denoising Ne wo k.
All edges a e di ec ed o he g aph neu al ne wo k o
lexibly cap u e ela ionships. Edges ha seem bidi ec-
ional in he isualiza ion ac ually ep esen o e lapping
pai s o di ec ed edges.
Figu e 2 illus a es a simpli ied example o his ep e-
sen a ion. Fo isual cla i y, in e al edges (wi h dis inc
labels) a e colo ed uni o mly, and only h ee me ical le -
els a e shown, wi hou iple subdi isions; he ull aining
g aphs con ain ou me ical le els wi h comp ehensi e di-
isions.
Figu e 2. A simpli ied example o ou music g aph ep e-
sen a ion.
3.3 Ne wo k A chi ec u e
Fo inpu encoding, he ac o componen o he Di usion
O de ing Ne wo k inco po a es he o iginal music g aph
and p e iously dele ed nodes di e en ia ed by sinusoidal
posi ional encodings [31]. The Denoising Ne wo k in o-
duces a special "supe node" labeled "mask", connec ed ia
"mask" edges o all emaining nodes o agg ega e global
in o ma ion. The ac o and c i ic componen s o he Di -
usion O de ing Ne wo k, along wi h he Denoising Ne -
wo k, sha e a simila encode a chi ec u e bu main ain
sepa a e pa ame e s. Each encode ans o ms node la-
bels in o 256-dimensional embeddings, which a e subse-
quen ly p ocessed h ough an al e na ing sequence (wi h
esidual connec ions) o h ee Rela ional G aph Con olu-
ional Ne wo k [32] laye s (co esponding o dis inc g aph
edge ypes) and h ee G aph A en ion Ne wo k [33] lay-
e s, each wi h 4 a en ion heads.
Decoding me hods di e pe componen : he ac o o
he Di usion O de ing Ne wo k decodes node embeddings
h ough a ou -laye MLP (dimension 256) wi h ReLU,
employing an ou pu mask o es ic selec ions o cu en ly
a ailable no e nodes. The c i ic comp esses ia mean pool-
ing and a ou -laye MLP wi h ReLU o ge a scala alue
es ima ing he g aph alue. Due o he complexi y o he
Denoising Ne wo k’s ask, we apply addi ional e inemen
o embeddings h ough a 4-laye MLP. Decoding in his
ne wo k is di ided in o pi ch selec ion, di ec ly de i ed
om he supe node embedding p ocessed h ough a 3-
laye MLP; and onse -o se p edic ions, which s ack em-
beddings om a achable me ical nodes (me ical nodes
wi hou e ical pa en onse nodes) wi h he supe node
embedding. These combined embeddings a e indepen-
den ly p ocessed h ough specialized onse and o se 3-
laye MLP decode s, each inalized wi h so max ac i a-
ion.
3.4 T aining De ails
We ain ou model using a combined da ase comp ising
Bach cho ales and Bach English and F ench Sui es sou ced
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
568
om he Dis an Lis ening Co pus [34] o one epoch.
Fou -measu e musical segmen s a e ep esen ed as g aphs,
employing a sliding window echnique o a oid segmen a-
ion bias. Du ing each aining s ep, we use a ba ch size
o 4 and sample 16 ajec o ies pe g aph o he di usion
o de ing ne wo k. The discoun ac o γ o Eq. (2) is se
o 0.99. All ne wo ks a e op imized using he AdamW op-
imize wi h a lea ning a e o 2e-4. T aining is conduc ed
on an RTX 4080 Lap op GPU o 122 hou s.
4. EXPERIMENTS
4.1 Objec i e Measu emen s
4.1.1 Consis ency o No e Impo ance Assignmen
To e alua e whe he he Di usion O de ing Ne wo k inds
a consis en s a egy o o de ing no e- emo al (i.e., as-
signing no e impo ance), we measu e he di e gence o i s
no e- emo al ajec o ies on unseen da a. E e y 20 ain-
ing i e a ions, we pe o m a alida ion s ep on 8 g aphs,
andomly selec ed om he alida ion da ase . Fo each
g aph, he Di usion O de ing Ne wo k will sample 16 a-
jec o ies. We hen compu e he pai wise edi dis ance be-
ween hese ajec o ies, and ge he a e aged edi dis ance
as a me ic o ajec o y di e gence. Fo compa ison, we
also collec he a e aged edi dis ance o 16 andomly sam-
pled ajec o ies.
Figu e 3. Compa ison o ajec o y di e gence
As Figu e 3 shows, as he aining p oceeds, he Di u-
sion O de ing Ne wo k’s a angemen o no e emo al o -
de in di e en sampling ajec o ies g adually con e ges.
A he end o aining, we es wi h he en i e es se . The
Di usion O de ing Ne wo k achie es an a e age di e -
gence o 15.25 ± 5.19, while he andom baseline yields
50.64 ± 14.03. These esul s demons a e ha ou model
lea ns a no e emo al o de ing s a egy ha is gene alized
o unseen polyphonic music.
4.1.2 Recons uc ion Pe o mance
To assess whe he he no e emo al o de iden i ied by
he Di usion O de ing Ne wo k bene i s he econs uc ion
pe o mance o he Denoising Ne wo k, we ain a base-
line model wi hou he Di usion O de ing Ne wo k (wi h
a andom o wa d di usion q(x |x −1), all o he ain-
ing se ings emain he same), and we examine hei e-
cons uc ion/denoising pe o mance by measu ing eache -
Model Raw NLL Scaled NLL
Di usion O de ed 9.89 ±1.22 4.92 ±0.60
Baseline 10.96 ±1.05 5.64 ±0.49
Table 1. Compa ison o econs uc ion pe o mance
o cing nega i e log-likelihood (NLL) in he es se . Fo
each es sample, we gene a e 16 ajec o ies.
We e alua e bo h he aw and weigh ed NLL (see Eq.
(5)). Wi h he same aining i e a ion, ou model ha uses
a Di usion O de ing Ne wo k demons a es lowe NLL in
bo h aw and weigh ed measu es.
4.2 Visualizing he Impo ance o No es
Figu e 4. Minue in G majo , BWV Anh. 114 in (a) ull
sco e wi h o de o emo al, (b) op 50 pe cen ese ed
no es, and (c) educ i e analysis by Ki lin [35].
Figu e 4 illus a es he model-iden i ied no es in o ma-
ion hie a chy in he i s ou measu es o Ch is ian Pe -
zold’s Minue in G majo (BWV Anh. 114). In panel (a),
each no e is labeled by he o de in which i is emo ed;
da ke no es and highe nume ical labels (o “ inal”) indi-
ca e no es e ained un il la e s ages o he da a co up ion
p ocess. In panel (b), we show only he op 50 pe cen
o hese e ained no es. Panel (c) shows Ki lin’s educ i e
analysis o he ph ase ollowing Schenke ian analysis [35].
Musical educ i e analyses, such as hose o Schenke , p o-
ide a amewo k by which o namen al no es a e p og es-
si ely abso bed o e eal he condensed s uc u al ounda-
ions o a melody o ha mony [36]. In his example, he
no es e ained by he Di usion O de ing Ne wo k closely
align wi h hose ha a human analys (i.e., Ki lin) would
label as ounda ional o he s uc u e o he musical ph ase.
4.3 Compa ison wi h Reduc i e Music Analysis
To examine ou model’s no e impo ance assignmen wi h
music educ ion analyses on a la ge scale, we employ he
la ges a ailable Schenke ian analysis da ase a ailable in
machine- eadable o ma [37]. Al hough he da ase is p i-
ma ily based on no e sequences (melodies o sequences
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
569
ha only imply mul iple oices) a he han ull polyphony,
i s ill o e s a use ul esou ce o es ou model.
E e y 20 aining i e a ions, we pe o m a alida ion
s ep, whe e we un ou Di usion O de ing Ne wo k on he
music example o he da ase . We use Spea man’s Rho
wi h epea ed anks o measu e he co ela ion be ween he
Di usion O de ing Ne wo k’s no e emo al o de and he
hie a chical dep h labels. In he anno a ions o [37], highe
hie a chical dep h indica es g ea e s uc u al impo ance.
Figu e 5. The co ela ion change be ween no e emo al
o de and he labeled no es dep h om [37] du ing aining
Figu e 5 shows an inc easing co ela ion be ween no e
emo al o de and dep h labels. Al hough he inal a e age
Spea man’s Rho only eaches 0.34, i should be empha-
sized ha ou model is ained wi hou any explici labels
o no es’ s uc u al impo ance.
5. DISCUSSION
The abo e expe imen s show ha he coope a i e ain-
ing o he Denoising Ne wo k and he Di usion O de -
ing Ne wo k disco e s a gene alized s a egy o o de no es
based on in o ma ional impo ance o unseen da a. Lea n-
ing his o de ing can imp o e da a econs uc ion pe o -
mance. And he co ela ion o his o de wi h music e-
duc ion analysis shows ha ou unsupe ised app oach can
a leas pa ially unco e s uc u al hie a chies consis en
wi h es ablished music heo y. Taken oge he , hese e-
sul s indica e ha he o de ound by he collabo a i e op-
imiza ion o ou wo ne wo ks ac ually exploi s he de-
pendencies in polyphonic music, a he han eaching an
a bi a y ag eemen on a andom o de .
One likely eason o he mode a e co ela ion le el is
he con inuous na u e o he no e- emo al pa h, whe eas
educ i e anno a ions commonly g oup no es in o disc e e
le els o s uc u al impo ance. Music educ ion’s disc e e
g ouping o impo ance implies ha human pe cep ion o
no es’ hie a chy may also be disc e ized, p omp ing u u e
wo k on segmen ing ou con inuous o de o impo ance
in o disc e e laye s. One po en ial me hod could be ana-
lyzing he change in in o ma ion con en and en opy in
he adap i e o de o p edic ion. A emaining challenge
is he lack o la ge-scale, s anda dized subjec i e a ings
o no e-le el impo ance, which makes i di icul o ully
alida e ou unsupe ised amewo k agains human pe -
cep ion. Consequen ly, explo ing beha io al pa adigms o
access pa icipan s’ assignmen o no e impo ance hie -
a chies would p o ide deepe insigh s in o how well he
lea ned hie a chy e lec s eal-wo ld lis ening expe iences.
The ma hema ical ounda ion o his s udy e e ences
he app oach desc ibed in [26], ye ou ocus di e s sub-
s an ially. Whe eas [26] is p ima ily conce ned wi h
he e icien p edic ion o new molecula s uc u es, ou
wo k aims a modeling he s uc u al unde s anding o a
gi en musical s imulus. Consequen ly, ou me hod in-
oduces h ee p ima y inno a ions ela i e o i s ame-
wo k: (1) a g aph ep esen a ion and ac ion space ailo ed
o musically meaning ul ac o s, (2) an Ad an age Ac o -
C i ic (A2C) upda e scheme [28] a he han simple REIN-
FORCE (wi hou baselines) [38] o s abilize he di usion
o de , and (3) newly p oposed alida ion and isualiza ion
me hods speci ically designed o assess he e ec i eness
o he adap i e no e- emo al o de .
Despi e ope a ing on limi ed compu a ional esou ces
and a ela i ely small, gen e-speci ic da ase , ou me hod
s ands ou as one o he i s unsupe ised app oaches ha
explici ly models no e impo ance in polyphonic music.
Compa ed o ea lie esea ch on music educ ion and hie -
a chical modeling, ou app oach uses a ully da a-d i en,
ein o cemen -lea ning pa adigm. Expanding he aining
co pus bo h in size and s ylis ic di e si y could help he
model lea n mo e obus , c oss-compose adap i e s a e-
gies. Meanwhile, building sepa a e models o di e en
gen es o compose s may also e eal how s ylis ic con en-
ions in luence hie a chical no e-impo ance assignmen s.
Finally, al hough we showcase ou app oach in a g aph-
based ep esen a ion, i could eadily ex end o o he sym-
bolic ep esen a ions (e.g., piano oll a ian s o Oc u-
pleMIDI [39]) ha encode ime wi h minimal be ween-
no e dependencies, po en ially b oadening applicabili y.
6. CONCLUSION
In his pape , we in oduced a disc e e di usion ame-
wo k, Adap i e Pa h o P edic ion, ha lea ns o model
no e-le el in o ma ional hie a chy in polyphonic music
wi hou elying on any explici labels. Th ough a collab-
o a i e aining p ocess, he Di usion O de ing Ne wo k
and Denoising Ne wo k con e ge on an adap i e no e-
emo al pa h, which, in u n, enhances econs uc ion pe -
o mance. Ou expe imen s demons a e ha he lea ned
o de ing aligns—a leas pa ially—wi h es ablished music
educ ion analyses, sugges ing ha he o de ing success-
ully cap u es hy hmic, melodic, and ha monic s uc u al
dependencies.
Ou me hod is eadily adap able o o he symbolic ep-
esen a ions. And i could bene i om he in oduc ion
o addi ional cons ain s e lec ing music psychological o
heo e ical p inciples. Ou p oposed amewo k highligh s
ha con olled di usion app oaches can explici ly epo
he lea ned s uc u al hie a chies p e iously la en in he
deep lea ning modeling o music. I pa es he way o
la ge , b oade , and mo e nuanced unsupe ised models o
s uc u al analysis o music.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
570
7. ACKNOWLEDGEMENTS
We hank all membe s o he Digi al and Cogni i e Musi-
cology Lab a EPFL o hei aluable eedback. This e-
sea ch was suppo ed by he Swiss Na ional Science Foun-
da ion wi hin he p ojec “Dis an Lis ening: T ansi ions
o Tonali y” (G an no. 215701). In pa , his p ojec
has ecei ed unding om he Eu opean Resea ch Coun-
cil (ERC) unde he Eu opean Union’s Ho izon 2020 e-
sea ch and inno a ion p og amme unde g an ag eemen
No 760081 – PMSB. The au ho s hank M . Claude La ou
o gene ously suppo ing his esea ch h ough he La ou
chai in digi al musicology.
8. REFERENCES
[1] E. J. C awley, B. E. Acke -Mills, R. E. Pas o e, and
S. Weil, “Change de ec ion in mul i- oice music: he
ole o musical s uc u e, musical aining, and ask de-
mands.” Jou nal o Expe imen al Psychology: Human
Pe cep ion and Pe o mance, ol. 28, no. 2, p. 367,
2002.
[2] S. Koelsch, M. Roh meie , R. To ecuso, and
S. Jen schke, “P ocessing o hie a chical syn ac ic
s uc u e in music,” P oceedings o he Na ional
Academy o Sciences, ol. 110, no. 38, pp. 15 443–
15 448, 2013.
[3] M. Roh meie and S. Koelsch, “P edic i e in o ma ion
p ocessing in music cogni ion. a c i ical e iew,” In e -
na ional Jou nal o Psychophysiology, ol. 83, no. 2,
pp. 164–175, 2012.
[4] C. Finkensiep and M. A. Roh meie , “Modeling and
in e ing p o o- oice s uc u e in ee polyphony,” in
P oceedings o he 22nd In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence. ISMIR, 2021,
pp. 189–196.
[5] M. T. Pea ce, “S a is ical lea ning and p obabilis ic
p edic ion in music cogni ion: mechanisms o s ylis-
ic encul u a ion,” Annals o he New Yo k Academy o
Sciences, ol. 1423, no. 1, pp. 378–395, 2018.
[6] K. C. Ba e , R. Ashley, D. L. S ai , E. Skoe, C. J.
Limb, and N. K aus, “Mul i- oiced music bypasses a -
en ional limi a ions in he b ain,” F on ie s in neu o-
science, ol. 15, p. 588914, 2021.
[7] E. H. Ma gulis, On epea : How music plays he mind.
Ox o d Uni e si y P ess, 2013.
[8] V. N. Salimpoo , D. H. Zald, R. J. Za o e, A. Daghe ,
and A. R. McIn osh, “P edic ions and he b ain: how
musical sounds become ewa ding,” T ends in cogni-
i e sciences, ol. 19, no. 2, pp. 86–91, 2015.
[9] J. Schmidhube , “D i en by comp ession p og ess:
A simple p inciple explains essen ial aspec s o
subjec i e beau y, no el y, su p ise, in e es ingness,
a en ion, cu iosi y, c ea i i y, a , science, music,
jokes,” 2009. [Online]. A ailable: h ps://a xi .o g/
abs/0812.4360
[10] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” 2020. [Online]. A ailable:
h ps://a xi .o g/abs/2006.11239
[11] K. F is on, “The ee-ene gy p inciple: a uni ied b ain
heo y?” Na u e e iews neu oscience, ol. 11, no. 2,
pp. 127–138, 2010.
[12] H. B. Ba low e al., “Possible p inciples unde lying he
ans o ma ion o senso y messages,” Senso y commu-
nica ion, ol. 1, no. 01, pp. 217–233, 1961.
[13] J. Rissanen, “Modeling by sho es da a desc ip ion,”
Au oma ica, ol. 14, no. 5, pp. 465–471, 1978.
[14] J. Go lieb, P.-Y. Oudeye , M. Lopes, and A. Ba anes,
“In o ma ion-seeking, cu iosi y, and a en ion: compu-
a ional and neu al mechanisms,” T ends in cogni i e
sciences, ol. 17, no. 11, pp. 585–593, 2013.
[15] A. Modi shanechi, K. Kond akiewicz, W. Ge s ne ,
and S. Haesle , “Cu iosi y-d i en explo a ion: oun-
da ions in neu oscience and compu a ional modeling,”
T ends in Neu osciences, ol. 46, no. 12, pp. 1054–
1066, 2023.
[16] J. Schmidhube , “Fo mal heo y o c ea i i y, un, and
in insic mo i a ion (1990–2010),” IEEE ansac ions
on au onomous men al de elopmen , ol. 2, no. 3, pp.
230–247, 2010.
[17] F. Le dahl and R. S. Jackendo , A Gene a i e Theo y
o Tonal Music, eissue, wi h a new p e ace. MIT
p ess, 1996.
[18] M. Roh meie , “Towa ds a gene a i e syn ax o onal
ha mony,” Jou nal o Ma hema ics and Music, ol. 5,
no. 1, pp. 35–53, 2011.
[19] C.-Z. A. Huang, A. Vaswani, J. Uszko ei , N. Shazee ,
I. Simon, C. Haw ho ne, A. M. Dai, M. D. Ho man,
M. Dinculescu, and D. Eck, “Music ans o me ,”
2018. [Online]. A ailable: h ps://a xi .o g/abs/1809.
04281
[20] A. Huang, M. Dinculescu, A. Vaswani, and D. Eck,
“Visualizing music sel -a en ion,” in P oc. Neu IPS
Wo kshop on In e p e abili y and Robus ness in Audio,
Speech, and Language, ol. 1, 2018, p. 4.
[21] J. Sohl-Dicks ein, E. A. Weiss, N. Maheswa ana han,
and S. Ganguli, “Deep unsupe ised lea ning using
nonequilib ium he modynamics,” 2015. [Online].
A ailable: h ps://a xi .o g/abs/1503.03585
[22] J. Aus in, D. D. Johnson, J. Ho, D. Ta low, and
R. an den Be g, “S uc u ed denoising di usion
models in disc e e s a e-spaces,” 2023. [Online].
A ailable: h ps://a xi .o g/abs/2107.03006
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
571
[23] M. Plasse , S. Pe e , and G. Widme , “Disc e e
di usion p obabilis ic models o symbolic music
gene a ion,” 2023. [Online]. A ailable: h ps://a xi .
o g/abs/2305.09489
[24] Z. Wang, L. Min, and G. Xia, “Whole-song
hie a chical gene a ion o symbolic music using
cascaded di usion models,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2405.09901
[25] E. Hoogeboom, A. A. G i senko, J. Bas ings,
B. Poole, R. an den Be g, and T. Salimans,
“Au o eg essi e di usion models,” 2022. [Online].
A ailable: h ps://a xi .o g/abs/2110.02037
[26] L. Kong, J. Cui, H. Sun, Y. Zhuang, B. A. P akash,
and C. Zhang, “Au o eg essi e di usion model
o g aph gene a ion,” 2023. [Online]. A ailable:
h ps://a xi .o g/abs/2307.08849
[27] D. P. Kingma, M. Welling e al., “Au o-encoding a i-
a ional bayes,” 2013.
[28] V. Mnih, A. P. Badia, M. Mi za, A. G a es,
T. P. Lillic ap, T. Ha ley, D. Sil e , and
K. Ka ukcuoglu, “Asynch onous me hods o deep
ein o cemen lea ning,” 2016. [Online]. A ailable:
h ps://a xi .o g/abs/1602.01783
[29] D. Jeong, T. Kwon, Y. Kim, and J. Nam, “G aph neu-
al ne wo k o music sco e da a and modeling exp es-
si e piano pe o mance,” in In e na ional con e ence
on machine lea ning. PMLR, 2019, pp. 3060–3070.
[30] E. Ka ys inaios and G. Widme , “G aphmuse: A li-
b a y o symbolic music g aph p ocessing,” a Xi
p ep in a Xi :2407.12671, 2024.
[31] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Ł. Kaise , and I. Polosukhin,
“A en ion is all you need,” Ad ances in neu al in o -
ma ion p ocessing sys ems, ol. 30, 2017.
[32] M. Schlich k ull, T. N. Kip , P. Bloem, R. an den
Be g, I. Ti o , and M. Welling, “Modeling ela ional
da a wi h g aph con olu ional ne wo ks,” 2017.
[Online]. A ailable: h ps://a xi .o g/abs/1703.06103
[33] P. Veliˇ
cko i´
c, G. Cucu ull, A. Casano a, A. Rome o,
P. Liò, and Y. Bengio, “G aph a en ion ne wo ks,”
2018. [Online]. A ailable: h ps://a xi .o g/abs/1710.
10903
[34] J. Hen schel, Y. Rammos, M. Neuwi h, and
M. Roh meie , “The dis an lis ening co pus,” 2024.
[Online]. A ailable: h ps://doi.o g/10.5281/zenodo.
13845439
[35] P. B. Ki lin, A p obabilis ic model o hie a chical mu-
sic analysis. Uni e si y o Massachuse s Amhe s ,
2014.
[36] H. Schenke , Neue musikalische Theo ien und Phan-
asien. Uni e sal-edi ion ag, 1910, ol. 2.
[37] S. Ni-Hahn, W. Xu, J. Yin, R. Zhu, S. Mak, Y. Jiang,
and C. Rudin, “A new da ase , no a ion so wa e, and
ep esen a ion o compu a ional schenke ian analy-
sis,” a Xi p ep in a Xi :2408.07184, 2024.
[38] R. J. Williams, “Simple s a is ical g adien - ollowing
algo i hms o connec ionis ein o cemen lea ning,”
Machine lea ning, ol. 8, pp. 229–256, 1992.
[39] M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y. Liu,
“Musicbe : Symbolic music unde s anding wi h la ge-
scale p e- aining,” a Xi p ep in a Xi :2106.05630,
2021.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
572