Conditional Diffusion as Latent Constraints for Unconditional Symbolic Music Generation Models

Author: Matteo Pettenò; Alessandro Mezza; Alberto Bernardini

Publisher: Zenodo

DOI: 10.5281/zenodo.17706331

Source: https://zenodo.org/records/17706331/files/000006.pdf

CONDITIONAL DIFFUSION AS LATENT CONSTRAINTS FOR
CONTROLLABLE SYMBOLIC MUSIC GENERATION
Ma eo Pe enò Alessand o Ilic Mezza Albe o Be na dini
Dipa imen o di Ele onica, In o mazione e Bioingegne ia
Poli ecnico di Milano, Milan, I aly
[email p o ec ed], [email p o ec ed], [email p o ec ed]
ABSTRACT
Recen ad ances in la en di usion models ha e demon-
s a ed s a e-o - he-a pe o mance in high-dimensional
ime-se ies da a syn hesis while p o iding lexible con-
ol h ough condi ioning and guidance. Howe e , exis -
ing me hodologies p ima ily ely on musical con ex o
na u al language as he main modali y o in e ac ing wi h
he gene a i e p ocess, which may no be ideal o ex-
pe use s who seek p ecise ade -like con ol o e speci ic
musical a ibu es. In his wo k, we explo e he applica-
ion o denoising di usion p ocesses as plug-and-play la-
en cons ain s o uncondi ional symbolic music gene -
a ion models. We ocus on a amewo k ha le e ages
a lib a y o small condi ional di usion models ope a ing
as implici p obabilis ic p io s on he la en s o a ozen
uncondi ional backbone. While p e ious s udies ha e ex-
plo ed domain-speci ic use cases, his wo k, o he bes
o ou knowledge, is he i s o demons a e he e sa il-
i y o such an app oach ac oss a di e se a ay o musical
a ibu es, such as no e densi y, pi ch ange, con ou , and
hy hm complexi y. Ou expe imen s show ha di usion-
d i en cons ain s ou pe o m adi ional a ibu e egula -
iza ion and o he la en cons ain s a chi ec u es, achie ing
signi ican ly s onge co ela ions be ween a ge and gen-
e a ed a ibu es while main aining high pe cep ual quali y
and di e si y.
1. INTRODUCTION
La en Cons ain s (LC) [1] e e o a se o echniques
o gene a ing condi ionally om uncondi ional gene a i e
models. T adi ional me hods o en o cing use -de ined
cons ain s du ing model aining, such as using a ibu e-
egula iza ion losses o aining on cu a ed subse s, equi e
ex ensi e labeled da a, la ge amoun s o compu a ional
powe , and ime-consuming hype pa ame e uning. The
cos o e aining becomes inc easingly p ohibi i e as he
numbe o condi ioning a iables g ows, especially when
expe use s need he lexibili y o choose om a ange o
© M. Pe enò, A. I. Mezza, and A. Be na dini. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: M. Pe enò, A. I. Mezza, and A. Be na dini, “Condi-
ional Di usion as La en Cons ain s o Con ollable Symbolic Music
Gene a ion”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re-
ie al Con ., Daejeon, Sou h Ko ea, 2025.
di e en a ibu es a any gi en ime, as is he case in sym-
bolic music gene a ion and compu e -assis ed music com-
posi ion.
Deep la en - a iable models like GANs and VAEs lea n
o gene a e di e se ou pu s by sampling om a s uc u ed
la en space. By exploi ing his p ope y, LC p o ides a
p incipled amewo k o endowing p e- ained unsupe -
ised models wi h pos -hoc condi ional gene a ion capa-
bili ies. This is achie ed ei he explici ly, by op imizing a
new model ha imposes he desi ed beha io on o la en
ep esen a ions [1], o implici ly, by aining a small pe -
sonalized model o gene a e only om egions o he la en
space [2]. In his way, a in e ence ime, LC models yield
la en s ha , once decoded, esul in ou pu s wi h he de-
si ed a ibu es. LC is also closely ela ed o la en ansla-
ion [3], which in oduces neu al ne wo ks ha b idge mul-
imodal ep esen a ions o di e en p e- ained gene a i e
models, condi ioning on he espec i e domain labels.
La en di usion [4] can be hough o as a class o LC.
In La en Di usion Models (LDMs), he p e- ained au-
oencode is ypically unde s ood as a way o comp ess he
da a in o a lowe -dimensional space whe e e e se di u-
sion is compu a ionally easible. By condi ioning he de-
noising p ocess, hough, i is possible o s ee he decode
o gene a e ou pu s wi h desi ed cha ac e is ics by eeding
i inpu s ha lie in ce ain egions o he la en space asso-
cia ed wi h he desi ed a ibu es o he ou pu , jus like in
exis ing LC me hods.
Di usion-based symbolic music gene a ion models
ha e been condi ioned on a ious inpu s, including ex
p omp s [5, 6], musical con ex [7], accompanimen [8],
cho ds [9, 10], and hy hmic ex u es [10]. Recen me h-
ods, such as MelodyDi usion [9] and Poly usion [10],
apply condi ional denoising on piano oll ep esen a ions,
ea ing hem as image-like da a. As a esul , sequence
modeling is igh ly coupled wi h con ol, making i in easi-
ble o seamlessly subs i u e one condi ioning signal o an-
o he wi hou e aining he no e-gene a ion p ocess. No-
ably, his also applies o many ecen con ollable gene a-
ion me hods ha a e no based on di usion [11–13].
Agains his backd op, di usion-d i en LC o e a
pa icula ly compelling app oach o achie ing modula
ade -like con ol [14, 15] o e mul iple musical a ibu es
wi h an o he wise uncondi ional model, allowing use s
o manipula e di e en musical ea u es along con inuous
axes h ough a ange o a ibu e-speci ic LDMs.
52
LDM speci ica ions depend on he base uncondi ional
model; Denoising Di usion P obabilis ic Models (DDPM)
[16] and Denoising Di usion Implici Models (DDIM)
[17] ope a e on con inuous la en spaces, whe eas Dis-
c e e DDPM [18] ope a e on okenized ep esen a ions. In
pa icula , disc e e di usion models ha e ecen ly shown
p omising esul s o symbolic music gene a ion [19–21].
P io wo k also explo ed la en di usion o speci ic do-
mains, such as emo ion-con olled symbolic music gene -
a ion ei he by lea ning om emo ion-labeled da a [22] o
by elying on emo ion classi ie guidance [23]. Pos -hoc
con ol o e black-box music ules has been ackled in [24]
by means o s ochas ic con ol guidance, which, inspi ed
by con ol heo y, en ails sampling se e al ealiza ions o
he nex denoising s ep and selec ing he one mos compli-
an wi h he ule.
In his wo k, by looking a la en di usion h ough he
lens o LC, we s udy LDMs as plug-and-play condi ioning
modules. Thus, we keep he base gene a i e model ixed
and de elop a lib a y o di usion-d i en LC models (“LC-
Di ”) ained on a ange o non-di e en iable and possibly
con inuous musical a ibu es, including con ou , no e den-
si y, pi ch ange, and hy hm complexi y.
We show ha , compa ed o a ibu e- egula ized VAEs
[25, 26] and o he LC a chi ec u es [3], LC-Di imp o es
ideli y (measu ed by F éche Music Dis ance [27]) and
con ollabili y (measu ed by he co ela ion be ween de-
si ed a ibu es and hose o gene a ed samples) ac oss all
a ibu es conside ed in he p esen s udy.
2. DIFFUSION AS LATENT CONSTRAINTS
Le z∼p(z|x)be he la en ep esen a ion o an inpu se-
quence xwi h N okens and a ibu e a∈R. Di usion
models employ a Ma ko chain o p og essi ely co up
inpu da a wi h Gaussian noise and lea n o e e se he p o-
cess. Fo wa d (la en ) di usion begins wi h he ep esen a-
ion z0=zand g adually adds noise ollowing a schedule
β , wi h = 1, . . . , T . A each s ep, Gaussian noise is
in oduced acco ding o
q(z |z −1) = N(z ;p1−β z −1, β I).(1)
The LC-Di e e se di usion p ocess aims o na iga e he
la en space o a p e- ained gene a i e model by acing a
ajec o y condi ional on he a ge a ibu e s a ing om
a noise sample zT∼ N(0,I). A denoising unc ion ϵθis
ained o p edic he addi i e noise a each s ep. ϵθis hus
condi ioned on aand a ime a iable ξ ha can be ei he
he di usion s ep ξ = [16] o he con inuous noise le el
ξ =√¯α [28], whe e ¯α =Q
i=1(1 −βi).
By sampling he condi ional dis ibu ion o z a an a -
bi a y imes ep in closed o m
q(z |z0) = N(z ;√¯α z0,(1 −¯α )I),(2)
i is possible o e icien ly ain ϵθby op imizing andom
e ms o he ollowing objec i e [16]:
L=Ez0∼p(z|x),ϵ∼N(0,I), ∥ϵ−ϵθ(z , ξ , a)∥2,(3)
Linea
Sinusoidal Encoding
SiLU
Linea
Linea
Noise le el
Shi Scale
Linea
Sinusoidal Encoding
SiLU
Linea
Linea
A ibu e
+
+
0/1
Figu e 1: LC-Di condi ioning ne wo ks.
Linea (256)
Laye No m
+
×+
Linea (2048)
SiLU
Laye No m
Residual Dense Block (×3)
Shi
Scale
S ack (×2)
Linea (2048)
Figu e 2: LC-Di denoise a chi ec u e.
whe e ξ is sampled om ei he U({1, . . . , T})[16] o
U(√¯α −1,√¯α )[28].
2.1 Sampling
Di e en sampling s a egies ha e been explo ed in he li -
e a u e. A in e ence ime, DDPMs [16] in ol e a s ochas-
ic Ma ko p ocess whe e, a each in e media e s ep, a
small amoun o Gaussian noise is added back in o encou -
age di e si y in he gene a ed samples. Howe e , ollowing
a s ochas ic ajec o y ypically equi es a la ge numbe o
s eps, slowing down he sampling p ocess. DDIMs [17]
di e om DDPMs by making he p ocess de e minis ic.
Wi h his class o models, o wa d di usion is e e sed by
z −1=√¯α −1 (z , ξ , a) + g(z , ξ , a),(4)
whe e
(z , ξ , a) = z −√1−¯α ϵθ(z , ξ , a)
√¯α
(5)
a emp s o di ec ly es ima e z0 om he cu en noisy la-
en z , while
g(z , ξ , a) = p1−¯α −1ϵθ(z , ξ , a)(6)
ensu es ha he ajec o y owa d z0 ollows he di ec ion
poin ing o z . The de e minis ic na u e o DDIM allows o
skip in e media e denoising s eps and pe o m only Ts≪
Ti e a ions o he e e se p ocess, he eby enabling as e
in e ence.
2.2 Condi ioning
We aim o condi ion ϵθon con inuous musical a ibu es
a∈R. Simila ly, Chen e al. [28] ound i bene icial o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
53
(%) Con ou No e Densi y Pi ch Range Complexi y
NM 63.36 76.63 46.70 48.67
P&L 37.44 10.11 0.41 50.59
LC-VAE-A 60.88 97.56 34.69 33.32
LC-VAE-SE 52.02 97.33 36.84 52.00
LC-Di 85.60 98.59 80.97 94.93
Table 1: Pea son Co ela ion Coe icien (PCC) be ween
a ge and decoded a ibu es.
condi ion on he noise le el ins ead o he disc e e di u-
sion s ep, which [7] la e adop ed o LDM-based symbolic
music gene a ion. Thus, we a e le wi h wo con inuous
condi ioning signals o be passed on o he di usion model.
We injec aand √¯α in o ϵθ h ough dedica ed ne wo ks
(see Figu e 1). Fi s , we apply Sinusoidal Encoding (SE)
based on T ans o me posi ional embeddings [29]
Γ(u) = sin (ωi(u)) ,cos (ωi(u)) d/2
i=0 (7)
whe e ωi(u) = su
b2i/d [7], wi h d∈N he (e en) dimen-
sionali y o he embedding, b∈R he base equency,
and s∈Ra equency scaling hype pa ame e . The e-
sul ing SE ea u es a e passed h ough a linea laye wi h
SiLU ac i a ions. Finally, we employ Fea u e-wise Linea
Modula ion (FiLM) [30], whe e wo ully-connec ed lay-
e s yield shi s and scales, espec i ely, ha modula e he
ac i a ions o he denoise (see Figu e 2). The wo condi-
ioning b anches un in pa allel. This is equi alen o lea n-
ing a single a ine ans o ma ion, whe e scale and shi a e
he sum o FiLM ou pu s om he a ibu e and noise le el
condi ioning ne wo ks.
To enhance con ollabili y o e he gene a ed samples,
we also apply Classi ie -F ee Guidance (CFG) [31] o he
noise p edic ion:
ˆϵθ(z , ξ , a) = (1 + w)ϵθ(z , ξ , a)−wϵθ(z , ξ ),(8)
whe e ϵθ(z , ξ )is he uncondi ional noise p edic ion and
w∈R≥0is he guidance scale. To make CFG e ec i e,
he model mus lea n o p edic noise bo h wi h and wi hou
a ibu e condi ioning. We achie e his h ough condi ion-
ing d opou (depic ed as 0/1in Figu e 1), i.e., se ing he
ou pu s o he a ibu e condi ioning ne wo k o ze o wi h
a ce ain p obabili y when e alua ing (3).
3. EVALUATION
3.1 Da ase
The models a e designed o lea n pi ch sequence ep esen-
a ions om ou -ba monophonic melodies. We cons uc
a la ge-scale da ase comp ising melodies ex ac ed om
176,581 MIDI iles om he Lakh MIDI Da ase [32]. 1
Fi s , we assess whe he each MIDI ile con ains ime
signa u e changes. I any a e ound, we segmen he ile
and e ain only sec ions wi h a 4/4 ime signa u e. Each
MIDI e en is hen quan ized o he nea es six een h no e.
A melody is de ined as a sequence o pi ches wi hin
he s anda d 88-key piano ange, played by an ins umen
1C. Ra el, 2016, “The Lakh MIDI Da ase 0.1.” [Online]. A ailable:
h ps://colin a el.com/p ojec s/lmd
Con ou No e Densi y Pi ch Range Complexi y
Uncond. VAE 41.44
NM 35.506 58.436 30.833 47.61
P&L 49.698 67.836 40.657 87.80
LC-VAE-A 30.197 29.450 30.257 32.435
LC-VAE-SE 29.161 30.124 31.274 30.166
LC-Di 19.299 20.559 31.695 17.51
Table 2: F éche Music Dis ance [27].
mapped o a alid MIDI p og am. A melody is conside ed
comple e when a ull measu e o silence occu s. We ex ac
only melodies spanning a leas ou ba s and comp ising a
leas h ee dis inc pi ches. I mul iple no es sound simul-
aneously, we ollow he app oach p oposed in [33] and
selec only he highes -pi ched no e o ensu e monophonic
sequences. Subsequen ly, ou -ba segmen s a e ex ac ed
using a s ide o one ba .
Fo each melody hus ex ac ed, we compu e 13 musical
a ibu es, including hose ou lined in Sec ion 3.2.
Melodies a e encoded as sequences o N= 64 in ege s
in P={0,...,129}, whe e each elemen ep esen s ei he
a MIDI no e numbe (0-127) o one o wo special okens:
no e o (128) and no e hold (129). The da ase is di ided
in o aining, alida ion, and es se s, wi h aining da a
augmen ed h ough ansposi ion by a andomly selec ed
numbe o semi ones wi hin a ange o ±1oc a e. The
inal da ase , consis ing o 10,126,676 unique melodies, is
publicly a ailable. 2
3.2 Musical A ibu es
As p e iously done in [25], we ocus on ou musical a -
ibu es: (i) Con ou , which quan i ies he melodic mo e-
men in a sequence, measu ed by a e aging he pi ch di -
e ences be ween consecu i e no es; (ii) No e Densi y,
de ined as he a io be ween he numbe o no es in he
melody and he sequence leng h. I akes alues in [0,1];
(iii) Pi ch Range, de ined as he di e ence be ween he
highes and lowes MIDI pi ch alues in he sequence, no -
malized by he ange o an 88-key piano. I akes al-
ues in 0,127
88 , whe e alues abo e one indica e a ange
exceeding A0–C8; (i ) Rhy hm Complexi y, e alua ed
using Toussain ’s me ical complexi y measu e [34], co -
ec ed o he o al numbe o no es in he sequence [26].
By de ini ion, i akes on disc e e alues.
3.3 Uncondi ional Gene a i e Model
As base uncondi ional model, we implemen a β-VAE [35]
based on MusicVAE [33]. This model, p e iously used in
LDM-based symbolic music gene a ion [7], also enables
di ec compa ison wi h exis ing a ibu e- egula ized VAEs
(AR-VAEs) employing he same a chi ec u e [25, 26] (see
Sec ion 3.5).
The encode pψ(z|x)consis s o a wo-laye bidi ec-
ional LSTM ne wo k ed wi h ou -ba pi ch sequence ep-
esen a ions (see Sec ion 3.1), ollowed by wo linea lay-
2M. Pe enò, Aug. 2024, “4 Ba s Monophonic Melodies Da ase (Pi ch
Sequence),” Zenodo, doi: h ps://doi.o g/10.5281/zenodo
.13369389
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
54
(a) NM (b) P&L (c) LC-VAE-A (d) LC-VAE-SE
Figu e 3: Reg ession plo s compa ing a ge and decoded Con ou a ibu es ac oss baseline me hods.
Figu e 4: Reg ession plo compa ing a ge and decoded
Con ou a ibu es using LC-Di .
e s pa ame e izing he la en pos e io . The hie a chical
decode qϕ(x|z) ea u es wo unidi ec ional LSTMs, wi h
he bo om-le el ne wo k au o eg essi ely es ima ing he
dis ibu ion o e he sequence alues ia a so max nonlin-
ea i y [33]. As such, each pi ch sequence x∈PNis i s
mapped on o a single la en code z∈RM,M= 256, and
since decoding amoun s o a nex oken p edic ion ask, he
s anda d β-VAE objec i e [35]
LVAE =−Epψ(z|x)[log qϕ(x|z)] + βDKL [pψ(z|x)∥p(z)] ,
(9)
is implemen ed using c oss-en opy as econs uc ion loss.
The uncondi ional model is ained o 40,000 i e a ions
on a single NVIDIA Ti an RTX GPU wi h a ba ch size
o 512. The objec i e (9) is minimized using Adam, and
he lea ning a e is dec eased exponen ially om 10−3 o
10−5wi h a a e o 0.9999. The hype pa ame e βis an-
nealed exponen ially om 0 o 10−3, which encou ages he
model o p io i ize accu a e sequence econs uc ion du -
ing he ea ly pa o he aining. Simila ly o [33], we
apply eache o cing wi hin he bo om-le el decode wi h
a p obabili y ollowing a logis ic schedule.
3.4 Condi ional Di usion Model
Wi h la en codes being ec o s in RM, we implemen a
DDIM model wi h a ully-connec ed denoise ne wo k. 3
Shown in Figu e 2, he denoise comp ises an inpu laye
wi h 2048 linea uni s, ollowed by h ee dense esidual
blocks. Each esidual block comp ises wo s acks o Lay-
e No m, ea u e-wise modula ion ( esponsible o join a -
ibu e and ime condi ioning), SiLU, and a linea laye ,
plus a esidual connec ion ha sho cu s he inpu and ou -
pu o he block. Finally, he ou pu is linea ly p ojec ed
back on o RM.
3Sou ce code and audio examples a e a ailable a h ps://mpe
eno.gi hub.io/con ollable-la en -di usion/
We se he SE dimensionali y o d= 128. The a ibu e
and noise le el condi ioning ne wo ks ha e 512 and 2048
uni s in he i s linea laye and FiLM laye s, espec i ely.
In he o wa d p ocess, β ollows a linea schedule
om 10−6 o 10−2o e T= 1000 s eps. Con e sely, he
numbe o sampling s eps is se o Ts= 100.
We ain he model wi h an a ibu e condi ioning
d opou p obabili y o 20%. We hen apply CFG wi h a
guidance scale o w= 3.0[31]. In ou expe imen s, CFG
p o ed undamen al o achie e a ibu e egula iza ion.
The esul ing denoise ne wo k has 43.1million pa am-
e e s, and con e ges in jus abou 20 aining epochs, hal
he i e a ions equi ed by he uncondi ional model.
3.5 AR-VAE Baseline Me hods
Fo compa ison, we conside AR-VAEs [25, 26] wi h he
same a chi ec u e as he uncondi ional model desc ibed in
Sec ion 3.3. AR-VAEs inco po a e egula iza ion du ing
aining by means o a supe ised mul i- ask lea ning ap-
p oach, wi h he goal o encoding he a ibu e ain he i- h
dimension zio hei la en spaces. This is achie ed by
including an AR loss e m in (9)
LAR-VAE =LVAE +γLAR,(10)
whe e γ≥0is a unable hype pa ame e con olling he
s eng h o he egula iza ion.
Mezza e al. [26] p opose he use o
LNM
AR = MAE(zi,˜a),(11)
whe e MAE(·,·)deno es he mean absolu e e o , and ˜ais
he z-sco e o a.
Pa i and Le ch [25] in oduced a egula iza ion e m ha
en o ces a mono onic ela ionship be ween aand zi, i.e.,
LP&L
AR = MAE ( anh(δDz),sign(Da)) ,(12)
whe e Dzand Daa e pai wise dis ance ma ices be ween
ziand ao all samples in a ba ch, espec i ely, and δ > 0
is a unable hype pa ame e . As in [25], we se γ= 1 and
δ= 10. The emaining aining de ails a e he same as in
Sec ion 3.3. Fo b e i y, we will la e e e o he o me
AR me hod as “NM” and o he la e as “P&L.”
3.6 LC-VAE Baseline Me hods
Simila ly o Tian and Engel [3], we implemen LC h ough
a condi ional VAE (cVAE) ained on he ep esen a ions o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
55
(a) NM (b) P&L (c) LC-VAE-A (d) LC-VAE-SE
Figu e 5: Reg ession plo s compa ing a ge and decoded Rhy hm Complexi y a ibu es ac oss baseline me hods.
Figu e 6: Reg ession plo compa ing a ge and decoded
Rhy hm Complexi y a ibu es using LC-Di .
he base uncondi ional model (Sec ion 3.3). The cVAE en-
code consis s o ou linea laye s wi h ReLU ac i a ions,
ollowed by wo Ga ing Mixing Laye s (GML) ha pa am-
e e ize he inne mos la en dis ibu ion. The decode mi -
o s he encode wi h ou linea laye s wi h ReLU ac i a-
ions, ollowed by an ou pu GML. Excep o using 2048
uni s in he ully-connec ed laye s and M′= 128 la en
a iables, he cVAE a chi ec u e is he same as in [3].
Le z∈RMbe he la en ep esen a ions o he uncon-
di ional model, zc∈RM′be he la en ep esen a ions o
he cVAE, and a∈R he sequence a ibu e. The au ho s
o [3] conside ed bina y labels and one-ho ec o s we e
hus conca ena ed wi h zand zc. Ins ead, we deal wi h con-
inuous a ibu es. We implemen wo cVAE a ian s ha
di e in how ais ed in o he ne wo ks. In he i s a ian ,
la e e e ed o as LC-VAE-A, we eed ˜
z= [zT, a]T o he
encode , and ˜
zc= [zT
c, a]T o he decode . In he second
a ian , named LC-VAE-SE, we conca ena e zand zc, e-
spec i ely, wi h he a ibu e SE, i.e., ˜
z= [zT,Γ(a)]Tand
˜
zc= [zT
c,Γ(a)]T.
4. RESULTS
4.1 A ibu e-Con olled Gene a ion
To e alua e he con ollabili y o he gene a i e models un-
de sc u iny, we sample he a ge a ibu es uni o mly in
he ange o ze o o he 99 h pe cen ile o he a ibu e dis-
ibu ion o he sequences in he es se . 4These equally-
spaced alues, which we e e o as a ge a ibu es, a e
ed o he espec i e condi ioning ne wo k o LC-Di , sui -
ably ans o med and plugged in o he egula ized dimen-
4Limi ing he ange o he 99 h pe cen ile is mean o exclude hose
sequences wi h abno mally high a ibu e alues. We a gue ha hese
sequences a e spu ious, and we a ibu e hei exis ence o he choice,
bo owed om [33], o ex ac ing melodies by naï ely picking he highes
no e a any gi en ime.
sion zio he AR-VAEs, and conca ena ed o he inpu ec-
o o he LC-VAE decode ne wo ks.
Table 1 lis s he Pea son Co ela ion Coe icien s (PCC)
be ween he a ge a ibu es and hose compu ed om he
gene a ed sequences ( he highe , he be e ). LC-Di con-
sis en ly ou pe o ms he wo AR-VAEs (NM and P&L)
and LC-VAEs (bo h wi h and wi hou SE) o all a ibu es
conside ed. No ably, LC-Di is he only me hod among
hose conside ed in he p esen s udy o yield co ela ion
sco es highe han 80% ac oss he boa d.
As o Con ou , LC-Di achie es a PCC o 85.60%,
ou pe o ming he nex -bes model, NM, by o e 22%. The
di e ence is less p onounced o No e Densi y, whe e LC-
Di (98.56%) imp o es upon he second-bes model by
jus 1%. None heless, LC-VAE-A and LC-VAE-SE al-
eady achie e 97.56% and 97.33%, espec i ely, sugges -
ing ha cons aining he gene a i e model is e y e ec i e
compa ed o AR me hods when i comes o ende ing he
desi ed numbe o no es. LC-Di also demons a es sig-
ni ican imp o emen s in Pi ch Range and Rhy hm Com-
plexi y. Fo Pi ch Range, i achie es a PCC o 80.97%,
exceeding NM (46.70%) by 34.27%, while NM i sel ou -
pe o ms LC-VAEs by app oxima ely 10%. Fo Rhy hm
Complexi y, LC-Di achie es a ema kable 94.93%, su -
passing LC-VAE-SE (52%) by 42.93%.
Conce ning AR models, while NM di ec ly encodes he
(s anda dized) dis ibu ion on o he i h dimension o he
la en space, he e is no a p io i way o know he mono onic
ela ionship lea ned using he P&L egula iza ion in (12).
This explains he nea -ze o co ela ion obse ed o Pi ch
Range, and, in gene al, he o e all lowe PCC.
Figu es 3 h ough 6 show he eg ession plo s o Con-
ou and Rhy hm Complexi y. Figu e 3 and Figu e 4 illus-
a e he case o a con inuous dis ibu ion, while Figu e 5
and Figu e 6 exempli y a case whe e he a ibu e akes on
in ege alues. Ac oss bo h a ibu es, LC-Di is cha ac-
e ized by a lowe sp ead and a clea linea end. In Fig-
u e 3, all baseline models show a endency o p oduce ex-
cessi ely high con ou alues, whe eas LC-Di (Figu e 4)
appea s o mi iga e he issue. Likewise, Figu e 5 e eals
ha all models bu LC-Di (Figu e 6) end o ail when he
a ge Complexi y alues a e low.
4.2 Da a Fideli y
To e alua e he quali y o he gene a ed sequences, we use
he F éche Music Dis ance (FMD) [27], a me ic ha ex-
ends he amily o F éche Incep ion Dis ance [36] and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
56

(a) a= 1.0→ag= 1.03 (b) a= 3.0→ag= 2.95 (c) a= 6.0→ag= 6.19
Figu e 7: Examples o MIDI iles gene a ed by con olling he Con ou a ibu e wi h LC-Di .
(a) a= 0 →ag= 0 (b) a= 15 →ag= 15 (c) a= 33 →ag= 33
Figu e 8: Examples o MIDI iles gene a ed by con olling Rhy hm Complexi y wi h LC-Di . Qua e no es a e indica ed
by solid e ical lines; odd pulses (s ong) a e indica ed by dashed lines; e en pulses (weak) a e indica ed by do ed lines.
F éche Audio Dis ance [37] o he symbolic music do-
main. FMD was compu ed be ween 22,016 melodies om
he held-ou es se and an equal numbe o gene a ed se-
quences. To p e en he FMD om measu ing a spu ious
di e gence om he eal a ibu e dis ibu ion, we condi-
ion he gene a ion on he a ibu es o he e e ence se-
quences, a he han using e enly-spaced con ol alues as
in Sec ion 4.1. By condi ioning wi h a ibu es measu ed
om he es se , indeed, we aim o simul aneously com-
pa e he ideli y o gene a ed sequences and how well hey
con o m o he desi ed a ibu e dis ibu ion.
Table 2 epo s he esul s ob ained using CLaMP 2
MIDI embeddings [38] ( he lowe , he be e ). Fo com-
pa ison, we epo he FMD be ween he e e ence es se
and he ou pu o he uncondi ional VAE (see Sec ion 3.3)
ob ained by decoding 22,016 samples om N(0,I).
The esul s p esen ed in Table 2 demons a e ha he
p oposed LC-Di model consis en ly achie es he lowes
FMD alues ac oss mos a ibu es, indica ing supe io pe -
o mance in gene a ing samples ha aligns mo e closely
wi h he s a is ical p ope ies o eal sequences. No ably,
LC-Di ou pe o ms all baselines in Con ou (19.299),
No e Densi y (20.559), and Rhy hm Complexi y (17.51),
signi ican ly imp o ing o e bo h AR-VAEs and LC-
VAEs. While LC-VAE-A achie es he bes Pi ch Range
sco e (30.257), LC-Di emains compe i i e (31.695).
O e all, all LC me hods ou pe o m he uncondi ional
base model (41.44), showing ha in oducing pos -hoc
con ol leads o mo e consis en and s uc u ed music gen-
e a ion, wi h be e alignmen o he desi ed a ibu es.
Finally, Figu e 7 and Figu e 8 illus a e he po en ial
di e si y in he gene a ed samples p oduced by LC-Di
when condi ioned on low, medium, and high alues o
Con ou and Rhy hm Complexi y, espec i ely.
5. CONCLUSIONS
In his pape , we ha e explo ed la en di usion h ough he
lens o La en Cons ain s (LC), demons a ing he e icacy
o DDIMs as plug-and-play condi ioning modules o sym-
bolic music gene a ion. By keeping he base gene a i e
model ixed, we ained di usion-based LC models (LC-
Di ) capable o con olling a ange o non-di e en iable
and con inuous musical a ibu es, including con ou , no e
densi y, pi ch ange, and hy hm complexi y. Ou em-
pi ical e alua ions e eal ha LC-Di signi ican ly ou -
pe o ms a ibu e- egula ized VAEs and cVAE-based LC
me hods in e ms o bo h ideli y and con ollabili y, wi h
absolu e imp o emen s o up o 12.65 in F éche Mu-
sic Dis ance and 43% in co ela ion be ween desi ed and
gene a ed a ibu es. These esul s highligh he po en-
ial o denoising as a powe ul ool o ad hoc ade -like
con ol o e mul iple musical a ibu es along con inuous
axes, e ec i ely ans o ming a p e- ained uncondi ional
model in o a con ollable music gene a ion sys em depend-
ing on he use ’s needs. Fu u e wo k will ocus on ex-
panding he lib a y o LC-Di models o include a wide
ange o musical a ibu es and explo ing he in eg a ion o
use in e aces o eal- ime con ol. Fu u e expe imen s
could also explo e a ibu e-con olled inpu ans o ma-
ions by applying o wa d di usion o encoded ep esen a-
ions, a he han d awing noise samples om he s anda d
no mal p io . Fu he mo e, we aim o in es iga e he po-
en ial o LC o o he gene a i e echniques, such as low
ma ching and consis ency models.
6. REFERENCES
[1] J. Engel, M. Ho man, and A. Robe s, “La en con-
s ain s: Lea ning o gene a e condi ionally om un-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
57
condi ional gene a i e models,” in In e na ional Con-
e ence on Lea ning Rep esen a ions, 2018.
[2] M. Dinculescu, J. Engel, and A. Robe s, “MidiMe:
Pe sonalizing a MusicVAE model wi h use da a,” in
Neu IPS Wo kshop on Machine Lea ning o C ea i -
i y and Design, 2019.
[3] Y. Tian and J. Engel, “La en ansla ion: C oss-
ing modali ies by b idging gene a i e models,” a Xi
p ep in a Xi :1902.08261, 2019.
[4] R. Rombach, A. Bla mann, D. Lo enz, P. Esse ,
and B. Omme , “High- esolu ion image syn hesis
wi h la en di usion models,” in P oceedings o he
IEEE/CVF con e ence on compu e ision and pa e n
ecogni ion, 2022, pp. 10 684–10 695.
[5] S. Wu and M. Sun, “Explo ing he e icacy o p e-
ained checkpoin s in ex - o-music gene a ion ask,”
in The AAAI-23 Wo kshop on C ea i e AI Ac oss
Modali ies, 2023.
[6] P. Jajo ia and J. McDe mo , “Tex condi ioned sym-
bolic d umbea gene a ion using la en di usion mod-
els,” a Xi p ep in a Xi :2408.02711, 2024.
[7] G. Mi al, J. Engel, C. Haw ho ne, and I. Simon, “Sym-
bolic music gene a ion wi h di usion models,” in P oc.
o he 22nd In e na ional Socie y o Music In o ma-
ion Re ie al Con e ence (ISMIR), 2021, pp. 468–475.
[8] M. Pasini, M. G ach en, and S. La ne , “Bass accom-
panimen gene a ion ia la en di usion,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP), 2024,
pp. 1166–1170.
[9] S. Li and Y. Sung, “MelodyDi usion: Cho d-
condi ioned melody gene a ion using a ans o me -
based di usion model,” Ma hema ics, ol. 11, no. 8,
2023.
[10] L. Min, J. Jiang, G. Xia, and J. Zhao, “Poly usion: A
di usion model o polyphonic sco e gene a ion wi h
in e nal and ex e nal con ols,” in P oc. o he 24 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2023, pp. 231–238.
[11] L. Kawai, P. Esling, and T. Ha ada, “A ibu es-awa e
deep music ans o ma ion.” in P oc. o he 21s In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence (ISMIR), 2020, pp. 670–677.
[12] S.-L. Wu and Y.-H. Yang, “MuseMo phose: Full-song
and ine-g ained piano music s yle ans e wi h one
ans o me VAE,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 31, pp. 1953–
1967, 2023.
[13] M. E. Maland o, “Compose ’s Assis an 2: In e ac i e
mul i- ack MIDI in illing wi h ine-g ained use con-
ol,” in P oc. o he 25 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR), 2024,
pp. 438–445.
[14] G. Lample, N. Zeghidou , N. Usunie , A. Bo des,
L. Denoye , and M. Ranza o, “Fade ne wo ks: Ma-
nipula ing images by sliding a ibu es,” in Ad ances
in Neu al In o ma ion P ocessing Sys ems, 2017, pp.
5963–5972.
[15] H. H. Tan and D. He emans, “Music Fade Ne s: Con-
ollable music gene a ion based on high-le el ea u es
ia low-le el ea u e modelling,” in P oc. o he 21s
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2020, pp. 109–116.
[16] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” in P oc. o he 34 h In e na-
ional Con e ence on Neu al In o ma ion P ocessing
Sys ems, 2020, pp. 1–12.
[17] J. Song, C. Meng, and S. E mon, “Denoising di u-
sion implici models,” in In e na ional Con e ence on
Lea ning Rep esen a ions, 2021.
[18] J. Aus in, D. D. Johnson, J. Ho, D. Ta low, and
R. an den Be g, “S uc u ed denoising di usion mod-
els in disc e e s a e-spaces,” in Ad ances in Neu al In-
o ma ion P ocessing Sys ems, 2021, pp. 1–13.
[19] A. L , X. Tan, P. Lu, W. Ye, S. Zhang, J. Bian, and
R. Yan, “GETMusic: Gene a ing any music acks
wi h a uni ied ep esen a ion and di usion amewo k,”
a Xi p ep in a Xi :2305.10841, 2023.
[20] M. Plasse , S. Pe e , and G. Widme , “Disc e e di u-
sion p obabilis ic models o symbolic music gene a-
ion,” in P oc. o he Thi y-Second In e na ional Join
Con e ence on A i icial In elligence, 2023.
[21] J. Zhang, G. Fazekas, and C. Sai is, “Compose s yle-
speci ic symbolic music gene a ion using ec o quan-
ized disc e e di usion models,” in 2024 IEEE 34 h In-
e na ional Wo kshop on Machine Lea ning o Signal
P ocessing (MLSP), 2024, pp. 1–6.
[22] ——, “Fas di usion GAN model o symbolic mu-
sic gene a ion con olled by emo ions,” a Xi p ep in
a Xi :2310.14040, 2023.
[23] M. Zhang, L. J. Fe is, L. Yue, and M. Xu, “Emo ion-
ally guided symbolic music gene a ion using di usion
models: The AGE-DM app oach,” in P oc. o he 6 h
ACM In e na ional Con e ence on Mul imedia in Asia,
2024, pp. 1–5.
[24] Y. Huang, A. Gha a e, Y. Liu, Z. Hu, Q. Zhang, C. S.
Sas y, S. Gu u ani, S. Oo e, and Y. Yue, “Symbolic
music gene a ion wi h non-di e en iable ule guided
di usion,” a Xi p ep in a Xi :2402.14285, 2024.
[25] A. Pa i and A. Le ch, “A ibu e-based egula iza ion
o la en spaces o a ia ional au o-encode s,” Neu al
Compu ing and Applica ions, ol. 33, no. 9, pp. 4429–
4444, 2021.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
58
[26] A. I. Mezza, M. Zanoni, and A. Sa i, “A la en hy hm
complexi y model o a ibu e-con olled d um pa e n
gene a ion,” EURASIP Jou nal on Audio, Speech, and
Music P ocessing, ol. 2023, no. 1, 2023.
[27] J. Re kowski, J. S e¸pniak, and M. Mod zejew-
ski, “F eche music dis ance: A me ic o gen-
e a i e symbolic music e alua ion,” a Xi p ep in
a Xi :2412.07948, 2024.
[28] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. No ouzi,
and W. Chan, “Wa eG ad: Es ima ing g adien s o
wa e o m gene a ion,” in In e na ional Con e ence on
Lea ning Rep esen a ions, 2021.
[29] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, L. u. Kaise , and I. Polosukhin,
“A en ion is all you need,” in Ad ances in Neu al In-
o ma ion P ocessing Sys ems, ol. 30, 2017.
[30] E. Pe ez, F. S ub, H. de V ies, V. Dumoulin, and A. C.
Cou ille, “FiLM: Visual easoning wi h a gene al con-
di ioning laye ,” in P oc. o he Thi y-Second AAAI
Con e ence on A i icial In elligence, 2018, pp. 3942–
3951.
[31] J. Ho and T. Salimans, “Classi ie - ee di usion guid-
ance,” in Neu IPS 2021 Wo kshop on Deep Gene a i e
Models and Downs eam Applica ions, 2021.
[32] C. Ra el, “Lea ning-based me hods o compa ing se-
quences, wi h applica ions o audio- o-midi alignmen
and ma ching,” Ph.D. disse a ion, Columbia Uni e -
si y, 2016.
[33] A. Robe s, J. Engel, C. Ra el, C. Haw ho ne, and
D. Eck, “A hie a chical la en ec o model o lea n-
ing long- e m s uc u e in music,” in P oc. o he 35 h
In e na ional Con e ence on Machine Lea ning, 2018,
pp. 4364–4373.
[34] G. Toussain , “A ma hema ical analysis o A ican,
B azilian, and Cuban cla e hy hms,” in B idges:
Ma hema ical Connec ions in A , Music, and Science,
2002, pp. 157–168.
[35] I. Higgins, L. Ma hey, A. Pal, C. P. Bu gess, X. Glo-
o , M. M. Bo inick, S. Mohamed, and A. Le chne ,
“be a-VAE: Lea ning basic isual concep s wi h a con-
s ained a ia ional amewo k.” In e na ional Con e -
ence on Lea ning Rep esen a ions, ol. 3, 2017.
[36] M. Heusel, H. Ramsaue , T. Un e hine , B. Nessle ,
and S. Hoch ei e , “GANs ained by a wo ime-scale
upda e ule con e ge o a local Nash equilib ium,” in
P oc. o he 31s In e na ional Con e ence on Neu al
In o ma ion P ocessing Sys ems, 2017, pp. 6629–6640.
[37] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A e e ence- ee me ic o
e alua ing music enhancemen algo i hms,” in P oc.
In e speech 2019, 2019, pp. 2350–2354.
[38] S. Wu, Y. Wang, R. Yuan, Z. Guo, X. Tan, G. Zhang,
M. Zhou, J. Chen, X. Mu, Y. Gao, Y. Dong, J. Liu,
X. Li, F. Yu, and M. Sun, “CLaMP 2: Mul i-
modal music in o ma ion e ie al ac oss 101 lan-
guages using la ge language models,” a Xi p ep in
a Xi :2410.13267, 2025.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
59

Related note

Why institutions use Plag.ai for originality review, entry 17
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai