Adaptive Path of Prediction: An Unsupervised Method for Modeling Note-Level Informational Hierarchy of Polyphony

Author: Xiaoxuan Wang; Martin Rohrmeier

Publisher: Zenodo

DOI: 10.5281/zenodo.17706519

Source: https://zenodo.org/records/17706519/files/000065.pdf

ADAPTIVE PATH OF PREDICTION: AN UNSUPERVISED METHOD FOR
MODELING NOTE-LEVEL INFORMATIONAL HIERARCHY OF
POLYPHONY
Xiaoxuan Wang
EPFL
[email p o ec ed]
Ma in Roh meie
EPFL
[email p o ec ed]
ABSTRACT
Polyphonic music p esen s a unique challenge o compu-
a ional modeling due o he complex in e ac ions o mul-
iple simul aneous musical s eams and he need o cap u e
bo h local and global s uc u al ela ionships. We p opose
Adap i e Pa h o P edic ion, a disc e e di usion model ha
lea ns he in o ma ional hie a chy o polyphony in an un-
supe ised manne . By aining he model o ind op imal
no e- emo al pa hs, and o e e sibly econs uc hese se-
lec i ely emo ed no es, we e eal how c i ical musical
e en s, which pe sis un il la e s ages o da a co up ion,
maximize he p ese ed in o ma ion and guide he p e-
dic ion o emaining con en . D awing on comp ession
lea ning heo y, we posi ha such adap i ely-disco e ed
“ancho no es” e lec he sys em’s abili y o make an ex-
plici abs ac ion o polyphonic music. Ou expe imen s
demons a e ha he model con e ges on consis en no e-
impo ance dis inc ions and can achie e be e econs uc-
ion pe o mance in selec ed denoising pa hs han andom
ones. Fu he mo e, he model’s assignmen o no e impo -
ance du ing he aining p ocess inc easingly aligns wi h
a educ i e music analysis da ase , sugges ing ha ou un-
supe ised amewo k can unco e s uc u al hie a chies
consis en wi h es ablished music- heo e ical iews.
1. INTRODUCTION
Polyphony desc ibes a e y common musical phenomenon
in which mul iple s eams o no es occu simul aneously.
When pe cei ing a polyphonic piece, a cogni i e sys em
expe ienced wi h he complexi y o polyphony iden i ies
he in e ac ions and ela ionships be ween musical s uc-
u es, he eby o ming an unde s anding o hei oles and
s uc u al hie a chy [1, 2]. Using machine lea ning me h-
ods o model he s uc u al unde s anding o polyphony is
non- i ial because o he sca ci y o no e-le el labels o
s uc u al hie a chy o supe ised lea ning and he inhe -
en complexi y o polyphony, which demands sys ems ca-
© X. Wang, and M. Roh meie . Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: X. Wang, and M. Roh meie , “Adap i e Pa h o P edic ion: An
unsupe ised me hod o modeling no e-le el in o ma ional hie a chy o
polyphony”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re-
ie al Con ., Daejeon, Sou h Ko ea, 2025.
pable o cap u ing bo h local and long-dis ance s uc u al
ela ionships [3,4].
The lack o labeled da a and he ac ha human mu-
sic acquisi ion usually happens wi hou explici ins uc-
ions abou s uc u al ela ionships be ween no es highligh
he impo ance o unsupe ised lea ning on his p oblem.
Mos unsupe ised lea ning models o melody, hy hm,
o ha mony s uc u e (e.g., n-g am models [5]) ely on wo
assump ions: (a) pe ec pe cep ion, meaning he model
cap u es all inpu in o ma ion and uses all o he in o ma-
ion o upda e he pa ame e s ega dless o complexi y, and
(b) au o eg essi e p edic ion, meaning he model always
makes le - o- igh p edic ions using he p e ious con ex
o he sequence, which only allows unidi ec ional depen-
dencies P(x |x< )≈P(x |x1, x2, . . . , x −1).
Wi h mul iple simul aneous lines un olding in ime, ou
cogni i e sys em is unlikely o p ocess all no e-le el in o -
ma ion and ela ionships in one pass [6]. This challenges
assump ion (a). Howe e , a highly impo an ea u e o
music, and music lis ening, is epe i ion; composi ions o -
en epea musical ph ases, and lis ene s equen ly e isi
pieces. Repe i ion allows lis ene s o ocus on di e en el-
emen s o disce n how no es a e o ganized in o polyphonic
s uc u es [7, 8]. Du ing his p ocess, no es a di e en
ime poin s may be explici ly emembe ed. This memo y
enables he use o bi-di ec ional in o ma ion when o m-
ing p edic ions, which challenges assump ion (b). In o he
wo ds, due o he complex na u e o polyphony ha in-
i es epea ed lis ening, a pu ely au o eg essi e me hod—
one ha lacks an explici mechanism o e aining and
eusing p e iously a ended no es—may p o ide an incom-
ple e pic u e o polyphonic ela ionships, especially om a
cogni i e science pe spec i e.
Recognizing he impo ance o bi-di ec ional in o ma-
ion, a na u al ques ion o ask is how o de e mine which
no es should be explici ly memo ized. Comp ession lea n-
ing heo y [9] sugges s ha a cogni i e sys em is in e nally
ewa ded when i disco e s he egula i ies o he complex
en i onmen and lea ns o op imally comp ess i . The e-
o e, acqui ing an adap i e abs ac ion and gene a i e ca-
pabili y o a pa icula class o da a (e.g., onal music) can
be highly ad an ageous as i emo es he need o de ise a
sepa a e comp ession scheme o each da a poin . The e-
o e, i a sys em can quickly iden i y a se o explici “an-
cho no es” ha encapsula e a polyphonic ph ase’s unda-
men al in o ma ion dis ibu ions and long-dis ance depen-
565
dencies, hese ancho s can guide p edic ions o less c i i-
cal de ails. In o he wo ds, ancho no es a e he sys em’s
explici abs ac ion o he ph ase, hey e lec he sys em’s
unde s anding o no e-le el in o ma ional hie a chy.
Mo i a ed by hese conside a ions, we p opose he
Adap i e Pa h o P edic ion, an unsupe ised model ha
lea ns he in o ma ional hie a chy o no es and can ex-
plici ly epo hem. We adop Di usion Denoising [10],
a gene a i e amewo k ha i e a i ely co up s he inpu
da a and lea ns o econs uc i .
In his s udy, we in es iga e an op imized pa h o da a
co up ion and econs uc ion o symbolic music. We e-
place he con en ional andom co up ion p ocess wi h a
sys em ha lea ns o p ese e he s uc u ally impo an el-
emen s o bene i he econs uc ion p ocess. Ou p oposed
model has wo pa s: a Di usion O de ing Ne wo k and a
Denoising Ne wo k. Speci ically, he Di usion O de ing
Ne wo k is ained using ein o cemen lea ning, le e ag-
ing eedback om he Denoising Ne wo k o op imize he
selec ion o he da a co up ion pa h.
To allow an accu a e disc e e di usion o polyphony,
we a oid sequen ial me hods o empo al encoding. In-
s ead, we de elop a new encoding me hod o symbolic
music: a musical g aph ha minimizes ime dependencies
be ween no es.
To summa ize, ou con ibu ions include: (1) he i s
unsupe ised me hod ha lea ns o explici ly epo no e-
le el in o ma ional hie a chy in polyphony; (2) he i s
in es iga ion o using non-au o eg essi e p edic ion o
model s uc u al cogni ion o polyphony; and (3) a no el
g aph-based symbolic ep esen a ion o polyphony ha
minimizes empo al dependencies be ween no es.
2. RELATED WORKS
2.1 Comp ession lea ning heo y
To main ain s abili y in a dynamic en i onmen , biologi-
cal agen s a e d i en o minimize long- e m senso y su -
p ise by e ining hei p edic i e capabili ies [11]. In doing
so, we na u ally adop encoding s a egies ha enhance ou
abili y o p edic sequences o e en s [12]. Howe e , gi en
ou limi ed cogni i e esou ces, we mus de elop concise
and gene alized ep esen a ions ha cap u e inpu egula -
i ies, a he han c ea ing a unique encoding o each e en
sequence [9,13]. This d i e o op imize ou encoding s a -
egy uels ou cu iosi y: a he han pe manen ly inc eas-
ing cogni i e load, empo a ily alloca ing esou ces o ex-
ac he egula i ies o no el s imuli ul ima ely e ines ou
adap i e model o en i onmen al pa e ns and educes he
cogni i e load [14, 15]. In o he wo ds, ou inna e cu ios-
i y mo i a es us o encoun e un amilia music no me ely
o lea n i s speci ic s uc u e, bu also o e ine a lexible
s a egy o lis ening—one ha enhances ou abili y o un-
de s and, p edic , and app ecia e di e se musical wo ks.
Mo eo e , comp ession lea ning is inhe en ly gene a i e
and c ea i e. By mas e ing a succinc ep esen a ion o
egula i ies, he cogni i e sys em is empowe ed o syn he-
size no el con en based on an abs ac ed amewo k [16].
This capabili y con as s sha ply wi h con en ional s a is-
ical models, which a e ypically es ic ed o ep oducing
he pa e ns hey ha e al eady encoun e ed [3].
2.2 Modeling music long-dis ance dependencies and
s uc u al hie a chy
Theo e ical amewo ks like musical g amma s ha e been
used o model long-dis ance dependencies and s uc u al
hie a chy [17, 18], ye unsupe ised app oaches o cap-
u ing no e-le el hie a chy in polyphony emain sca ce. A
no able excep ion is he Music T ans o me [19], which
u ilizes sel -a en ion mechanisms o cap u e bo h local
and long-dis ance ela ionships be ween no es in poly-
phonic music. Al hough isualiza ions o sel -a en ion
p o ide insigh s in o hese ela ionships and co espond-
ing no es’ impo ance [20], such analyses yield an indi-
ec in e p e a ion o he model’s unde s anding. Mo e-
o e , while T ans o me -based models can, in p inciple,
ep esen long- ange dependencies h ough sel -a en ion,
hey s ill ely on an au o eg essi e decoding o de . Ou
mo i a ion is o p opose a amewo k whe e he s uc u e
o p edic ion i sel is o de -adap i e and in e p e able.
2.3 Di usion denoising models
Di usion denoising models [10, 21] ha e opened a new
pa h o gene a i e modeling. Al hough mos ea ly appli-
ca ions ocused on con inuous domains (e.g., images), dis-
c e e di usion me hods ha e now eme ged o handle sym-
bolic da a [22]. Disc e e di usion has also been applied in
he symbolic music gene a ion ask [23].
Di usion denoising models gene ally ollow an
abs ac - o-conc e e gene a ion scheme. Due o his na u e,
i has been used o gene a e music in explici hie a chical
s eps, om o m o ph ases, and melodies and accompani-
men s [24]. Howe e , he e a e no exis ing s udies ha use
he di usion denoising model as a ool o unsupe ised
lea ning o he s uc u al hie a chy o polyphony.
An impo an a ian o disc e e di usion models is
he au o eg essi e disc e e di usion model [25], which
sequen ially abso bs one dimension a a ime in o wa d
di usion. I keeps he di usion model’s non-sequen ial
gene a ion capabili y while limi ing he numbe o di u-
sion s eps o a mos he inpu da a’s dimensionali y. Re-
cen wo k in molecula g aph gene a ion has ex ended his
amewo k by in oducing Di usion O de ing Ne wo ks
ha lea n op imal, da a-dependen abso p ion ajec o ies
[26]. These lea ned pa hs p o ide a powe ul mechanism
o explici ly modeling s uc u al dependencies—an abil-
i y concep ually aligned wi h comp ession lea ning heo y.
Unlike he a ia ional au oencode [27], which comp esses
da a in o a la en space, his app oach o e s he po en ial
o p oduce explici and in e p e able comp essed ep esen-
a ions o music. This mo i a es ou wo k, which applies
disc e e di usion and lea ned o de ing mechanisms o he
unsupe ised lea ning o no e-le el in o ma ional hie a -
chies in polyphonic music, he eby e ealing he inhe en
s uc u al dependencies o polyphonic composi ions.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
566
Figu e 1. Illus a ion o a aining s ep o ou model
3. METHOD
3.1 Adap i e O de ing o Music Disc e e Di usion
Ou model adap s abso bing s a e di usion (as in D3PMs,
[22]) o symbolic music by in oducing wo key modi i-
ca ions. Fi s , a he han using an abso bing s a e mask
o da a co up ion, we simply dele e no es. We hink he
dele ion p ocess is concep ually close o he music educ-
ion analysis, which may encou age he model o be e ex-
ploi he emaining no e in o ma ion and lea n an explici
in o ma ional hie a chy. The easibili y o dele ion ac ion
es s on ou g aph-based da a ep esen a ion, which will
be discussed in he nex subsec ion. Second, We eplace
he andom o wa d di usion p ocess q(x |x −1)wi h an
adap i e Di usion O de ing Ne wo k qφ(n |x , x0)[26],
ha selec s, a each s ep , he no e n o emo e.
Ou sys em comp ises wo co-e ol ing componen s:
he Di usion O de ing Ne wo k and he Denoising Ne -
wo k pθ(n −1|x ). The Denoising Ne wo k p edic s he
mos ecen ly emo ed no e n −1gi en he cu en in-
pu x , while he Di usion O de ing Ne wo k lea ns a
s uc u ally-dependen no e emo al o de ha bene i s e-
cons uc ion. In aining, he o de ing ne wo k’s di ec
goal is o emo e less-impo an no es a each s ep, he eby
indi ec ly p ese ing “ancho no es.”
Figu e 1 illus a es a single aining s ep o he Di u-
sion O de ing Ne wo k, whe e i samples M ajec o ies.
In e e y s ep o e e y ajec o y m, he denoising ne -
wo k gene a es a p obabili y dis ibu ion o i s p edic ions
o he pi ch label, he onse posi ion, and he o se posi-
ion o he emo ed no e n −1. The nega i e cos o hese
p edic ions in all s eps and ajec o ies will be accumula ed
and se e as he eedback o he ein o cemen lea ning o
he Di usion O de ing Ne wo k, he e o e, we de ine he
ewa d R ,m in Eq. (1) as:
R ,m =−−(T− )
Tlog pθn(m)
−1
x(m)
.(1)
No e ha R ,m is also weigh ed by he posi ion (T− )
Tin
he ajec o y o emphasize he e ec i eness o he p e-
se ed “ancho no es” o da a econs uc ion.
To educe a iance in he aw s epwise ewa ds, we
adop an ad an age ac o -c i ic (A2C) pa adigm [28]: we
in oduce a C i ic Ne wo k Vψ(x) ha es ima es he alue
o he cu en inpu x. The ad an age a s ep o ajec o y
mis hen
A ,m =R ,m +γ Vψx +1−Vψx ,(2)
whe e he discoun ac o γcon ols how s ongly u u e
ewa ds a e weigh ed ela i e o immedia e ewa ds.
The Di usion O de ing Ne wo k qφ(n |x , x0)plays
he ole o he ac o . Unde he ad an age ac o -c i ic
amewo k, i s pa ame e s a e upda ed by ascending he
ad an age-weigh ed log-likelihood:
∆φ←1
M T
M
X
m=1
T
X
=1
A ,m ∇φlog qφn 
x , x0.
(3)
Meanwhile, he C i ic Ne wo k Vψ(x)is upda ed by
minimizing he empo al-di e ence e o (i.e., squa ed ad-
an age) be ween i s p edic ion and he boo s apped e-
u n:
L(ψ) = 1
M T
M
X
m=1
T
X
=1A ,m2
.(4)
Minimizing Eq. (4) ensu es ha he C i ic Ne wo k ac-
cu a ely es ima es he ue e u n, s abilizing he ad an age
used in he ac o upda e (Eq. (3)). Toge he , hese upda es
de ine he Di usion O de ing Ne wo k’s pa ame e upda e.
We now u n o he Denoising Ne wo k’s upda es. Du -
ing he upda e i e a ion o he Denoising Ne wo k, in each
s ep o e e y ajec o y m, he Denoising Ne wo k pθp e-
dic s he p obabili y o he emo ed no e n −1gi en he
cu en musical inpu x . We measu e he nega i e log-
likelihood and accumula e i ac oss all s eps and ajec o-
ies. He e, we again use he weigh ac o (T− )
T. Hence,
he o al aining loss o he Denoising Ne wo k o e M
sampled ajec o ies can be w i en as:
L(θ) = 1
M T
M
X
m=1
T
X
=1−(T− )
Tlog pθn(m)
−1
x(m)
.
(5)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
567
To upda e he pa ame e s θo he Denoising Ne wo k,
we ake g adien s eps ha minimize L(θ):
∆θ← − ∇θL(θ).(6)
This s ep is pe o med sepa a ely om he ein o cemen
lea ning upda e o he Di usion O de ing Ne wo k qφ
(Eq. (3)). By al e na ing he wo aining p ocedu es in
ime, we ensu e ha he Denoising Ne wo k’s g adien s do
no in e e e wi h he Di usion O de ing Ne wo k’s policy
g adien s. Speci ically, in one aining i e a ion:
1. Sampling o he Di usion O de ing Ne wo k: Sam-
ple M ajec o ies using qφand, o each ajec-
o y, compu e he ewa ds {R ,m}and ad an ages
{A ,m}(see Eqs. (1) and (2)).
2. Upda ing he Di usion O de ing Ne wo k: Upda e
he ac o pa ame e s φand c i ic pa ame e s ψ(see
Eqs. (3) and (4)).
3. Sampling o he Denoising Ne wo k: Sample M
ajec o ies using qφ.
4. Upda ing he Denoising Ne wo k: Compu e he o-
al denoising loss L(θ)o e ajec o ies as in Eq. (5)
and upda e he pa ame e s θ(see Eq. (6)).
3.2 Da a Rep esen a ion
P e ious s udies p oposed g aph-based encodings o sym-
bolic music by linking no es ia ela i e empo al ela ion-
ships [29, 30]. While e ec i e o ce ain asks, hese en-
codings become in lexible when suppo ing dynamic op-
e a ions like no e dele ion, whe e empo al edges mus be
cons an ly upda ed. To add ess his, we p opose a me ical-
ee-based ep esen a ion ha encodes iming ia hie a -
chical me ical nodes, aiming o minimize empo al depen-
dencies and suppo lexible no e dele ion.
Me ical nodes a e gene a ed om he measu e le el
h ough successi e bina y and e na y di isions. The ype
o di ision is dis inguished by he edge ype, and node la-
bels encode hei laye in o ma ion. Each pi ch-labeled
no e node connec s ia onse and o se edges o he shal-
lowes ele an me ical node, encoding i s hy hmic posi-
ion.
No e nodes in e connec h ough di ec ed in e al edges
labeled by oc a e-in a ian pi ch in e als, acili a ing he
cap u e o long-dis ance musical ela ionships while educ-
ing edge- ype complexi y. Al hough hese in e al edges
assis p edic i e modeling, hei econs uc ion is no e-
qui ed o he Denoising Ne wo k.
All edges a e di ec ed o he g aph neu al ne wo k o
lexibly cap u e ela ionships. Edges ha seem bidi ec-
ional in he isualiza ion ac ually ep esen o e lapping
pai s o di ec ed edges.
Figu e 2 illus a es a simpli ied example o his ep e-
sen a ion. Fo isual cla i y, in e al edges (wi h dis inc
labels) a e colo ed uni o mly, and only h ee me ical le -
els a e shown, wi hou iple subdi isions; he ull aining
g aphs con ain ou me ical le els wi h comp ehensi e di-
isions.
Figu e 2. A simpli ied example o ou music g aph ep e-
sen a ion.
3.3 Ne wo k A chi ec u e
Fo inpu encoding, he ac o componen o he Di usion
O de ing Ne wo k inco po a es he o iginal music g aph
and p e iously dele ed nodes di e en ia ed by sinusoidal
posi ional encodings [31]. The Denoising Ne wo k in o-
duces a special "supe node" labeled "mask", connec ed ia
"mask" edges o all emaining nodes o agg ega e global
in o ma ion. The ac o and c i ic componen s o he Di -
usion O de ing Ne wo k, along wi h he Denoising Ne -
wo k, sha e a simila encode a chi ec u e bu main ain
sepa a e pa ame e s. Each encode ans o ms node la-
bels in o 256-dimensional embeddings, which a e subse-
quen ly p ocessed h ough an al e na ing sequence (wi h
esidual connec ions) o h ee Rela ional G aph Con olu-
ional Ne wo k [32] laye s (co esponding o dis inc g aph
edge ypes) and h ee G aph A en ion Ne wo k [33] lay-
e s, each wi h 4 a en ion heads.
Decoding me hods di e pe componen : he ac o o
he Di usion O de ing Ne wo k decodes node embeddings
h ough a ou -laye MLP (dimension 256) wi h ReLU,
employing an ou pu mask o es ic selec ions o cu en ly
a ailable no e nodes. The c i ic comp esses ia mean pool-
ing and a ou -laye MLP wi h ReLU o ge a scala alue
es ima ing he g aph alue. Due o he complexi y o he
Denoising Ne wo k’s ask, we apply addi ional e inemen
o embeddings h ough a 4-laye MLP. Decoding in his
ne wo k is di ided in o pi ch selec ion, di ec ly de i ed
om he supe node embedding p ocessed h ough a 3-
laye MLP; and onse -o se p edic ions, which s ack em-
beddings om a achable me ical nodes (me ical nodes
wi hou e ical pa en onse nodes) wi h he supe node
embedding. These combined embeddings a e indepen-
den ly p ocessed h ough specialized onse and o se 3-
laye MLP decode s, each inalized wi h so max ac i a-
ion.
3.4 T aining De ails
We ain ou model using a combined da ase comp ising
Bach cho ales and Bach English and F ench Sui es sou ced
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
568
om he Dis an Lis ening Co pus [34] o one epoch.
Fou -measu e musical segmen s a e ep esen ed as g aphs,
employing a sliding window echnique o a oid segmen a-
ion bias. Du ing each aining s ep, we use a ba ch size
o 4 and sample 16 ajec o ies pe g aph o he di usion
o de ing ne wo k. The discoun ac o γ o Eq. (2) is se
o 0.99. All ne wo ks a e op imized using he AdamW op-
imize wi h a lea ning a e o 2e-4. T aining is conduc ed
on an RTX 4080 Lap op GPU o 122 hou s.
4. EXPERIMENTS
4.1 Objec i e Measu emen s
4.1.1 Consis ency o No e Impo ance Assignmen
To e alua e whe he he Di usion O de ing Ne wo k inds
a consis en s a egy o o de ing no e- emo al (i.e., as-
signing no e impo ance), we measu e he di e gence o i s
no e- emo al ajec o ies on unseen da a. E e y 20 ain-
ing i e a ions, we pe o m a alida ion s ep on 8 g aphs,
andomly selec ed om he alida ion da ase . Fo each
g aph, he Di usion O de ing Ne wo k will sample 16 a-
jec o ies. We hen compu e he pai wise edi dis ance be-
ween hese ajec o ies, and ge he a e aged edi dis ance
as a me ic o ajec o y di e gence. Fo compa ison, we
also collec he a e aged edi dis ance o 16 andomly sam-
pled ajec o ies.
Figu e 3. Compa ison o ajec o y di e gence
As Figu e 3 shows, as he aining p oceeds, he Di u-
sion O de ing Ne wo k’s a angemen o no e emo al o -
de in di e en sampling ajec o ies g adually con e ges.
A he end o aining, we es wi h he en i e es se . The
Di usion O de ing Ne wo k achie es an a e age di e -
gence o 15.25 ± 5.19, while he andom baseline yields
50.64 ± 14.03. These esul s demons a e ha ou model
lea ns a no e emo al o de ing s a egy ha is gene alized
o unseen polyphonic music.
4.1.2 Recons uc ion Pe o mance
To assess whe he he no e emo al o de iden i ied by
he Di usion O de ing Ne wo k bene i s he econs uc ion
pe o mance o he Denoising Ne wo k, we ain a base-
line model wi hou he Di usion O de ing Ne wo k (wi h
a andom o wa d di usion q(x |x −1), all o he ain-
ing se ings emain he same), and we examine hei e-
cons uc ion/denoising pe o mance by measu ing eache -
Model Raw NLL Scaled NLL
Di usion O de ed 9.89 ±1.22 4.92 ±0.60
Baseline 10.96 ±1.05 5.64 ±0.49
Table 1. Compa ison o econs uc ion pe o mance
o cing nega i e log-likelihood (NLL) in he es se . Fo
each es sample, we gene a e 16 ajec o ies.
We e alua e bo h he aw and weigh ed NLL (see Eq.
(5)). Wi h he same aining i e a ion, ou model ha uses
a Di usion O de ing Ne wo k demons a es lowe NLL in
bo h aw and weigh ed measu es.
4.2 Visualizing he Impo ance o No es
Figu e 4. Minue in G majo , BWV Anh. 114 in (a) ull
sco e wi h o de o emo al, (b) op 50 pe cen ese ed
no es, and (c) educ i e analysis by Ki lin [35].
Figu e 4 illus a es he model-iden i ied no es in o ma-
ion hie a chy in he i s ou measu es o Ch is ian Pe -
zold’s Minue in G majo (BWV Anh. 114). In panel (a),
each no e is labeled by he o de in which i is emo ed;
da ke no es and highe nume ical labels (o “ inal”) indi-
ca e no es e ained un il la e s ages o he da a co up ion
p ocess. In panel (b), we show only he op 50 pe cen
o hese e ained no es. Panel (c) shows Ki lin’s educ i e
analysis o he ph ase ollowing Schenke ian analysis [35].
Musical educ i e analyses, such as hose o Schenke , p o-
ide a amewo k by which o namen al no es a e p og es-
si ely abso bed o e eal he condensed s uc u al ounda-
ions o a melody o ha mony [36]. In his example, he
no es e ained by he Di usion O de ing Ne wo k closely
align wi h hose ha a human analys (i.e., Ki lin) would
label as ounda ional o he s uc u e o he musical ph ase.
4.3 Compa ison wi h Reduc i e Music Analysis
To examine ou model’s no e impo ance assignmen wi h
music educ ion analyses on a la ge scale, we employ he
la ges a ailable Schenke ian analysis da ase a ailable in
machine- eadable o ma [37]. Al hough he da ase is p i-
ma ily based on no e sequences (melodies o sequences
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
569

ha only imply mul iple oices) a he han ull polyphony,
i s ill o e s a use ul esou ce o es ou model.
E e y 20 aining i e a ions, we pe o m a alida ion
s ep, whe e we un ou Di usion O de ing Ne wo k on he
music example o he da ase . We use Spea man’s Rho
wi h epea ed anks o measu e he co ela ion be ween he
Di usion O de ing Ne wo k’s no e emo al o de and he
hie a chical dep h labels. In he anno a ions o [37], highe
hie a chical dep h indica es g ea e s uc u al impo ance.
Figu e 5. The co ela ion change be ween no e emo al
o de and he labeled no es dep h om [37] du ing aining
Figu e 5 shows an inc easing co ela ion be ween no e
emo al o de and dep h labels. Al hough he inal a e age
Spea man’s Rho only eaches 0.34, i should be empha-
sized ha ou model is ained wi hou any explici labels
o no es’ s uc u al impo ance.
5. DISCUSSION
The abo e expe imen s show ha he coope a i e ain-
ing o he Denoising Ne wo k and he Di usion O de -
ing Ne wo k disco e s a gene alized s a egy o o de no es
based on in o ma ional impo ance o unseen da a. Lea n-
ing his o de ing can imp o e da a econs uc ion pe o -
mance. And he co ela ion o his o de wi h music e-
duc ion analysis shows ha ou unsupe ised app oach can
a leas pa ially unco e s uc u al hie a chies consis en
wi h es ablished music heo y. Taken oge he , hese e-
sul s indica e ha he o de ound by he collabo a i e op-
imiza ion o ou wo ne wo ks ac ually exploi s he de-
pendencies in polyphonic music, a he han eaching an
a bi a y ag eemen on a andom o de .
One likely eason o he mode a e co ela ion le el is
he con inuous na u e o he no e- emo al pa h, whe eas
educ i e anno a ions commonly g oup no es in o disc e e
le els o s uc u al impo ance. Music educ ion’s disc e e
g ouping o impo ance implies ha human pe cep ion o
no es’ hie a chy may also be disc e ized, p omp ing u u e
wo k on segmen ing ou con inuous o de o impo ance
in o disc e e laye s. One po en ial me hod could be ana-
lyzing he change in in o ma ion con en and en opy in
he adap i e o de o p edic ion. A emaining challenge
is he lack o la ge-scale, s anda dized subjec i e a ings
o no e-le el impo ance, which makes i di icul o ully
alida e ou unsupe ised amewo k agains human pe -
cep ion. Consequen ly, explo ing beha io al pa adigms o
access pa icipan s’ assignmen o no e impo ance hie -
a chies would p o ide deepe insigh s in o how well he
lea ned hie a chy e lec s eal-wo ld lis ening expe iences.
The ma hema ical ounda ion o his s udy e e ences
he app oach desc ibed in [26], ye ou ocus di e s sub-
s an ially. Whe eas [26] is p ima ily conce ned wi h
he e icien p edic ion o new molecula s uc u es, ou
wo k aims a modeling he s uc u al unde s anding o a
gi en musical s imulus. Consequen ly, ou me hod in-
oduces h ee p ima y inno a ions ela i e o i s ame-
wo k: (1) a g aph ep esen a ion and ac ion space ailo ed
o musically meaning ul ac o s, (2) an Ad an age Ac o -
C i ic (A2C) upda e scheme [28] a he han simple REIN-
FORCE (wi hou baselines) [38] o s abilize he di usion
o de , and (3) newly p oposed alida ion and isualiza ion
me hods speci ically designed o assess he e ec i eness
o he adap i e no e- emo al o de .
Despi e ope a ing on limi ed compu a ional esou ces
and a ela i ely small, gen e-speci ic da ase , ou me hod
s ands ou as one o he i s unsupe ised app oaches ha
explici ly models no e impo ance in polyphonic music.
Compa ed o ea lie esea ch on music educ ion and hie -
a chical modeling, ou app oach uses a ully da a-d i en,
ein o cemen -lea ning pa adigm. Expanding he aining
co pus bo h in size and s ylis ic di e si y could help he
model lea n mo e obus , c oss-compose adap i e s a e-
gies. Meanwhile, building sepa a e models o di e en
gen es o compose s may also e eal how s ylis ic con en-
ions in luence hie a chical no e-impo ance assignmen s.
Finally, al hough we showcase ou app oach in a g aph-
based ep esen a ion, i could eadily ex end o o he sym-
bolic ep esen a ions (e.g., piano oll a ian s o Oc u-
pleMIDI [39]) ha encode ime wi h minimal be ween-
no e dependencies, po en ially b oadening applicabili y.
6. CONCLUSION
In his pape , we in oduced a disc e e di usion ame-
wo k, Adap i e Pa h o P edic ion, ha lea ns o model
no e-le el in o ma ional hie a chy in polyphonic music
wi hou elying on any explici labels. Th ough a collab-
o a i e aining p ocess, he Di usion O de ing Ne wo k
and Denoising Ne wo k con e ge on an adap i e no e-
emo al pa h, which, in u n, enhances econs uc ion pe -
o mance. Ou expe imen s demons a e ha he lea ned
o de ing aligns—a leas pa ially—wi h es ablished music
educ ion analyses, sugges ing ha he o de ing success-
ully cap u es hy hmic, melodic, and ha monic s uc u al
dependencies.
Ou me hod is eadily adap able o o he symbolic ep-
esen a ions. And i could bene i om he in oduc ion
o addi ional cons ain s e lec ing music psychological o
heo e ical p inciples. Ou p oposed amewo k highligh s
ha con olled di usion app oaches can explici ly epo
he lea ned s uc u al hie a chies p e iously la en in he
deep lea ning modeling o music. I pa es he way o
la ge , b oade , and mo e nuanced unsupe ised models o
s uc u al analysis o music.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
570
7. ACKNOWLEDGEMENTS
We hank all membe s o he Digi al and Cogni i e Musi-
cology Lab a EPFL o hei aluable eedback. This e-
sea ch was suppo ed by he Swiss Na ional Science Foun-
da ion wi hin he p ojec “Dis an Lis ening: T ansi ions
o Tonali y” (G an no. 215701). In pa , his p ojec
has ecei ed unding om he Eu opean Resea ch Coun-
cil (ERC) unde he Eu opean Union’s Ho izon 2020 e-
sea ch and inno a ion p og amme unde g an ag eemen
No 760081 – PMSB. The au ho s hank M . Claude La ou
o gene ously suppo ing his esea ch h ough he La ou
chai in digi al musicology.
8. REFERENCES
[1] E. J. C awley, B. E. Acke -Mills, R. E. Pas o e, and
S. Weil, “Change de ec ion in mul i- oice music: he
ole o musical s uc u e, musical aining, and ask de-
mands.” Jou nal o Expe imen al Psychology: Human
Pe cep ion and Pe o mance, ol. 28, no. 2, p. 367,
2002.
[2] S. Koelsch, M. Roh meie , R. To ecuso, and
S. Jen schke, “P ocessing o hie a chical syn ac ic
s uc u e in music,” P oceedings o he Na ional
Academy o Sciences, ol. 110, no. 38, pp. 15 443–
15 448, 2013.
[3] M. Roh meie and S. Koelsch, “P edic i e in o ma ion
p ocessing in music cogni ion. a c i ical e iew,” In e -
na ional Jou nal o Psychophysiology, ol. 83, no. 2,
pp. 164–175, 2012.
[4] C. Finkensiep and M. A. Roh meie , “Modeling and
in e ing p o o- oice s uc u e in ee polyphony,” in
P oceedings o he 22nd In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence. ISMIR, 2021,
pp. 189–196.
[5] M. T. Pea ce, “S a is ical lea ning and p obabilis ic
p edic ion in music cogni ion: mechanisms o s ylis-
ic encul u a ion,” Annals o he New Yo k Academy o
Sciences, ol. 1423, no. 1, pp. 378–395, 2018.
[6] K. C. Ba e , R. Ashley, D. L. S ai , E. Skoe, C. J.
Limb, and N. K aus, “Mul i- oiced music bypasses a -
en ional limi a ions in he b ain,” F on ie s in neu o-
science, ol. 15, p. 588914, 2021.
[7] E. H. Ma gulis, On epea : How music plays he mind.
Ox o d Uni e si y P ess, 2013.
[8] V. N. Salimpoo , D. H. Zald, R. J. Za o e, A. Daghe ,
and A. R. McIn osh, “P edic ions and he b ain: how
musical sounds become ewa ding,” T ends in cogni-
i e sciences, ol. 19, no. 2, pp. 86–91, 2015.
[9] J. Schmidhube , “D i en by comp ession p og ess:
A simple p inciple explains essen ial aspec s o
subjec i e beau y, no el y, su p ise, in e es ingness,
a en ion, cu iosi y, c ea i i y, a , science, music,
jokes,” 2009. [Online]. A ailable: h ps://a xi .o g/
abs/0812.4360
[10] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” 2020. [Online]. A ailable:
h ps://a xi .o g/abs/2006.11239
[11] K. F is on, “The ee-ene gy p inciple: a uni ied b ain
heo y?” Na u e e iews neu oscience, ol. 11, no. 2,
pp. 127–138, 2010.
[12] H. B. Ba low e al., “Possible p inciples unde lying he
ans o ma ion o senso y messages,” Senso y commu-
nica ion, ol. 1, no. 01, pp. 217–233, 1961.
[13] J. Rissanen, “Modeling by sho es da a desc ip ion,”
Au oma ica, ol. 14, no. 5, pp. 465–471, 1978.
[14] J. Go lieb, P.-Y. Oudeye , M. Lopes, and A. Ba anes,
“In o ma ion-seeking, cu iosi y, and a en ion: compu-
a ional and neu al mechanisms,” T ends in cogni i e
sciences, ol. 17, no. 11, pp. 585–593, 2013.
[15] A. Modi shanechi, K. Kond akiewicz, W. Ge s ne ,
and S. Haesle , “Cu iosi y-d i en explo a ion: oun-
da ions in neu oscience and compu a ional modeling,”
T ends in Neu osciences, ol. 46, no. 12, pp. 1054–
1066, 2023.
[16] J. Schmidhube , “Fo mal heo y o c ea i i y, un, and
in insic mo i a ion (1990–2010),” IEEE ansac ions
on au onomous men al de elopmen , ol. 2, no. 3, pp.
230–247, 2010.
[17] F. Le dahl and R. S. Jackendo , A Gene a i e Theo y
o Tonal Music, eissue, wi h a new p e ace. MIT
p ess, 1996.
[18] M. Roh meie , “Towa ds a gene a i e syn ax o onal
ha mony,” Jou nal o Ma hema ics and Music, ol. 5,
no. 1, pp. 35–53, 2011.
[19] C.-Z. A. Huang, A. Vaswani, J. Uszko ei , N. Shazee ,
I. Simon, C. Haw ho ne, A. M. Dai, M. D. Ho man,
M. Dinculescu, and D. Eck, “Music ans o me ,”
2018. [Online]. A ailable: h ps://a xi .o g/abs/1809.
04281
[20] A. Huang, M. Dinculescu, A. Vaswani, and D. Eck,
“Visualizing music sel -a en ion,” in P oc. Neu IPS
Wo kshop on In e p e abili y and Robus ness in Audio,
Speech, and Language, ol. 1, 2018, p. 4.
[21] J. Sohl-Dicks ein, E. A. Weiss, N. Maheswa ana han,
and S. Ganguli, “Deep unsupe ised lea ning using
nonequilib ium he modynamics,” 2015. [Online].
A ailable: h ps://a xi .o g/abs/1503.03585
[22] J. Aus in, D. D. Johnson, J. Ho, D. Ta low, and
R. an den Be g, “S uc u ed denoising di usion
models in disc e e s a e-spaces,” 2023. [Online].
A ailable: h ps://a xi .o g/abs/2107.03006
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
571
[23] M. Plasse , S. Pe e , and G. Widme , “Disc e e
di usion p obabilis ic models o symbolic music
gene a ion,” 2023. [Online]. A ailable: h ps://a xi .
o g/abs/2305.09489
[24] Z. Wang, L. Min, and G. Xia, “Whole-song
hie a chical gene a ion o symbolic music using
cascaded di usion models,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2405.09901
[25] E. Hoogeboom, A. A. G i senko, J. Bas ings,
B. Poole, R. an den Be g, and T. Salimans,
“Au o eg essi e di usion models,” 2022. [Online].
A ailable: h ps://a xi .o g/abs/2110.02037
[26] L. Kong, J. Cui, H. Sun, Y. Zhuang, B. A. P akash,
and C. Zhang, “Au o eg essi e di usion model
o g aph gene a ion,” 2023. [Online]. A ailable:
h ps://a xi .o g/abs/2307.08849
[27] D. P. Kingma, M. Welling e al., “Au o-encoding a i-
a ional bayes,” 2013.
[28] V. Mnih, A. P. Badia, M. Mi za, A. G a es,
T. P. Lillic ap, T. Ha ley, D. Sil e , and
K. Ka ukcuoglu, “Asynch onous me hods o deep
ein o cemen lea ning,” 2016. [Online]. A ailable:
h ps://a xi .o g/abs/1602.01783
[29] D. Jeong, T. Kwon, Y. Kim, and J. Nam, “G aph neu-
al ne wo k o music sco e da a and modeling exp es-
si e piano pe o mance,” in In e na ional con e ence
on machine lea ning. PMLR, 2019, pp. 3060–3070.
[30] E. Ka ys inaios and G. Widme , “G aphmuse: A li-
b a y o symbolic music g aph p ocessing,” a Xi
p ep in a Xi :2407.12671, 2024.
[31] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Ł. Kaise , and I. Polosukhin,
“A en ion is all you need,” Ad ances in neu al in o -
ma ion p ocessing sys ems, ol. 30, 2017.
[32] M. Schlich k ull, T. N. Kip , P. Bloem, R. an den
Be g, I. Ti o , and M. Welling, “Modeling ela ional
da a wi h g aph con olu ional ne wo ks,” 2017.
[Online]. A ailable: h ps://a xi .o g/abs/1703.06103
[33] P. Veliˇ
cko i´
c, G. Cucu ull, A. Casano a, A. Rome o,
P. Liò, and Y. Bengio, “G aph a en ion ne wo ks,”
2018. [Online]. A ailable: h ps://a xi .o g/abs/1710.
10903
[34] J. Hen schel, Y. Rammos, M. Neuwi h, and
M. Roh meie , “The dis an lis ening co pus,” 2024.
[Online]. A ailable: h ps://doi.o g/10.5281/zenodo.
13845439
[35] P. B. Ki lin, A p obabilis ic model o hie a chical mu-
sic analysis. Uni e si y o Massachuse s Amhe s ,
2014.
[36] H. Schenke , Neue musikalische Theo ien und Phan-
asien. Uni e sal-edi ion ag, 1910, ol. 2.
[37] S. Ni-Hahn, W. Xu, J. Yin, R. Zhu, S. Mak, Y. Jiang,
and C. Rudin, “A new da ase , no a ion so wa e, and
ep esen a ion o compu a ional schenke ian analy-
sis,” a Xi p ep in a Xi :2408.07184, 2024.
[38] R. J. Williams, “Simple s a is ical g adien - ollowing
algo i hms o connec ionis ein o cemen lea ning,”
Machine lea ning, ol. 8, pp. 229–256, 1992.
[39] M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y. Liu,
“Musicbe : Symbolic music unde s anding wi h la ge-
scale p e- aining,” a Xi p ep in a Xi :2106.05630,
2021.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
572

Related note

Why institutions use Plag.ai for originality review, entry 3
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by doctoral supervisors in universities, research institutes, colleges, schools, and publishing workflows, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer documentation of academic decisions, reduced manual checking effort, and clearer separation between similarity and misconduct. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For course assignments, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai