CONDITIONAL DIFFUSION AS LATENT CONSTRAINTS FOR
CONTROLLABLE SYMBOLIC MUSIC GENERATION
Ma eo Pe enò Alessand o Ilic Mezza Albe o Be na dini
Dipa imen o di Ele onica, In o mazione e Bioingegne ia
Poli ecnico di Milano, Milan, I aly
[email p o ec ed], [email p o ec ed], [email p o ec ed]
ABSTRACT
Recen ad ances in la en di usion models ha e demon-
s a ed s a e-o - he-a pe o mance in high-dimensional
ime-se ies da a syn hesis while p o iding lexible con-
ol h ough condi ioning and guidance. Howe e , exis -
ing me hodologies p ima ily ely on musical con ex o
na u al language as he main modali y o in e ac ing wi h
he gene a i e p ocess, which may no be ideal o ex-
pe use s who seek p ecise ade -like con ol o e speci ic
musical a ibu es. In his wo k, we explo e he applica-
ion o denoising di usion p ocesses as plug-and-play la-
en cons ain s o uncondi ional symbolic music gene -
a ion models. We ocus on a amewo k ha le e ages
a lib a y o small condi ional di usion models ope a ing
as implici p obabilis ic p io s on he la en s o a ozen
uncondi ional backbone. While p e ious s udies ha e ex-
plo ed domain-speci ic use cases, his wo k, o he bes
o ou knowledge, is he i s o demons a e he e sa il-
i y o such an app oach ac oss a di e se a ay o musical
a ibu es, such as no e densi y, pi ch ange, con ou , and
hy hm complexi y. Ou expe imen s show ha di usion-
d i en cons ain s ou pe o m adi ional a ibu e egula -
iza ion and o he la en cons ain s a chi ec u es, achie ing
signi ican ly s onge co ela ions be ween a ge and gen-
e a ed a ibu es while main aining high pe cep ual quali y
and di e si y.
1. INTRODUCTION
La en Cons ain s (LC) [1] e e o a se o echniques
o gene a ing condi ionally om uncondi ional gene a i e
models. T adi ional me hods o en o cing use -de ined
cons ain s du ing model aining, such as using a ibu e-
egula iza ion losses o aining on cu a ed subse s, equi e
ex ensi e labeled da a, la ge amoun s o compu a ional
powe , and ime-consuming hype pa ame e uning. The
cos o e aining becomes inc easingly p ohibi i e as he
numbe o condi ioning a iables g ows, especially when
expe use s need he lexibili y o choose om a ange o
© M. Pe enò, A. I. Mezza, and A. Be na dini. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: M. Pe enò, A. I. Mezza, and A. Be na dini, “Condi-
ional Di usion as La en Cons ain s o Con ollable Symbolic Music
Gene a ion”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re-
ie al Con ., Daejeon, Sou h Ko ea, 2025.
di e en a ibu es a any gi en ime, as is he case in sym-
bolic music gene a ion and compu e -assis ed music com-
posi ion.
Deep la en - a iable models like GANs and VAEs lea n
o gene a e di e se ou pu s by sampling om a s uc u ed
la en space. By exploi ing his p ope y, LC p o ides a
p incipled amewo k o endowing p e- ained unsupe -
ised models wi h pos -hoc condi ional gene a ion capa-
bili ies. This is achie ed ei he explici ly, by op imizing a
new model ha imposes he desi ed beha io on o la en
ep esen a ions [1], o implici ly, by aining a small pe -
sonalized model o gene a e only om egions o he la en
space [2]. In his way, a in e ence ime, LC models yield
la en s ha , once decoded, esul in ou pu s wi h he de-
si ed a ibu es. LC is also closely ela ed o la en ansla-
ion [3], which in oduces neu al ne wo ks ha b idge mul-
imodal ep esen a ions o di e en p e- ained gene a i e
models, condi ioning on he espec i e domain labels.
La en di usion [4] can be hough o as a class o LC.
In La en Di usion Models (LDMs), he p e- ained au-
oencode is ypically unde s ood as a way o comp ess he
da a in o a lowe -dimensional space whe e e e se di u-
sion is compu a ionally easible. By condi ioning he de-
noising p ocess, hough, i is possible o s ee he decode
o gene a e ou pu s wi h desi ed cha ac e is ics by eeding
i inpu s ha lie in ce ain egions o he la en space asso-
cia ed wi h he desi ed a ibu es o he ou pu , jus like in
exis ing LC me hods.
Di usion-based symbolic music gene a ion models
ha e been condi ioned on a ious inpu s, including ex
p omp s [5, 6], musical con ex [7], accompanimen [8],
cho ds [9, 10], and hy hmic ex u es [10]. Recen me h-
ods, such as MelodyDi usion [9] and Poly usion [10],
apply condi ional denoising on piano oll ep esen a ions,
ea ing hem as image-like da a. As a esul , sequence
modeling is igh ly coupled wi h con ol, making i in easi-
ble o seamlessly subs i u e one condi ioning signal o an-
o he wi hou e aining he no e-gene a ion p ocess. No-
ably, his also applies o many ecen con ollable gene a-
ion me hods ha a e no based on di usion [11–13].
Agains his backd op, di usion-d i en LC o e a
pa icula ly compelling app oach o achie ing modula
ade -like con ol [14, 15] o e mul iple musical a ibu es
wi h an o he wise uncondi ional model, allowing use s
o manipula e di e en musical ea u es along con inuous
axes h ough a ange o a ibu e-speci ic LDMs.
52
LDM speci ica ions depend on he base uncondi ional
model; Denoising Di usion P obabilis ic Models (DDPM)
[16] and Denoising Di usion Implici Models (DDIM)
[17] ope a e on con inuous la en spaces, whe eas Dis-
c e e DDPM [18] ope a e on okenized ep esen a ions. In
pa icula , disc e e di usion models ha e ecen ly shown
p omising esul s o symbolic music gene a ion [19–21].
P io wo k also explo ed la en di usion o speci ic do-
mains, such as emo ion-con olled symbolic music gene -
a ion ei he by lea ning om emo ion-labeled da a [22] o
by elying on emo ion classi ie guidance [23]. Pos -hoc
con ol o e black-box music ules has been ackled in [24]
by means o s ochas ic con ol guidance, which, inspi ed
by con ol heo y, en ails sampling se e al ealiza ions o
he nex denoising s ep and selec ing he one mos compli-
an wi h he ule.
In his wo k, by looking a la en di usion h ough he
lens o LC, we s udy LDMs as plug-and-play condi ioning
modules. Thus, we keep he base gene a i e model ixed
and de elop a lib a y o di usion-d i en LC models (“LC-
Di ”) ained on a ange o non-di e en iable and possibly
con inuous musical a ibu es, including con ou , no e den-
si y, pi ch ange, and hy hm complexi y.
We show ha , compa ed o a ibu e- egula ized VAEs
[25, 26] and o he LC a chi ec u es [3], LC-Di imp o es
ideli y (measu ed by F éche Music Dis ance [27]) and
con ollabili y (measu ed by he co ela ion be ween de-
si ed a ibu es and hose o gene a ed samples) ac oss all
a ibu es conside ed in he p esen s udy.
2. DIFFUSION AS LATENT CONSTRAINTS
Le z∼p(z|x)be he la en ep esen a ion o an inpu se-
quence xwi h N okens and a ibu e a∈R. Di usion
models employ a Ma ko chain o p og essi ely co up
inpu da a wi h Gaussian noise and lea n o e e se he p o-
cess. Fo wa d (la en ) di usion begins wi h he ep esen a-
ion z0=zand g adually adds noise ollowing a schedule
β , wi h = 1, . . . , T . A each s ep, Gaussian noise is
in oduced acco ding o
q(z |z −1) = N(z ;p1−β z −1, β I).(1)
The LC-Di e e se di usion p ocess aims o na iga e he
la en space o a p e- ained gene a i e model by acing a
ajec o y condi ional on he a ge a ibu e s a ing om
a noise sample zT∼ N(0,I). A denoising unc ion ϵθis
ained o p edic he addi i e noise a each s ep. ϵθis hus
condi ioned on aand a ime a iable ξ ha can be ei he
he di usion s ep ξ = [16] o he con inuous noise le el
ξ =√¯α [28], whe e ¯α =Q
i=1(1 −βi).
By sampling he condi ional dis ibu ion o z a an a -
bi a y imes ep in closed o m
q(z |z0) = N(z ;√¯α z0,(1 −¯α )I),(2)
i is possible o e icien ly ain ϵθby op imizing andom
e ms o he ollowing objec i e [16]:
L=Ez0∼p(z|x),ϵ∼N(0,I), ∥ϵ−ϵθ(z , ξ , a)∥2,(3)
Linea
Sinusoidal Encoding
SiLU
Linea
Linea
Noise le el
Shi Scale
Linea
Sinusoidal Encoding
SiLU
Linea
Linea
A ibu e
+
+
0/1
Figu e 1: LC-Di condi ioning ne wo ks.
Linea (256)
Laye No m
+
×+
Linea (2048)
SiLU
Laye No m
Residual Dense Block (×3)
Shi
Scale
S ack (×2)
Linea (2048)
Figu e 2: LC-Di denoise a chi ec u e.
whe e ξ is sampled om ei he U({1, . . . , T})[16] o
U(√¯α −1,√¯α )[28].
2.1 Sampling
Di e en sampling s a egies ha e been explo ed in he li -
e a u e. A in e ence ime, DDPMs [16] in ol e a s ochas-
ic Ma ko p ocess whe e, a each in e media e s ep, a
small amoun o Gaussian noise is added back in o encou -
age di e si y in he gene a ed samples. Howe e , ollowing
a s ochas ic ajec o y ypically equi es a la ge numbe o
s eps, slowing down he sampling p ocess. DDIMs [17]
di e om DDPMs by making he p ocess de e minis ic.
Wi h his class o models, o wa d di usion is e e sed by
z −1=√¯α −1 (z , ξ , a) + g(z , ξ , a),(4)
whe e
(z , ξ , a) = z −√1−¯α ϵθ(z , ξ , a)
√¯α
(5)
a emp s o di ec ly es ima e z0 om he cu en noisy la-
en z , while
g(z , ξ , a) = p1−¯α −1ϵθ(z , ξ , a)(6)
ensu es ha he ajec o y owa d z0 ollows he di ec ion
poin ing o z . The de e minis ic na u e o DDIM allows o
skip in e media e denoising s eps and pe o m only Ts≪
Ti e a ions o he e e se p ocess, he eby enabling as e
in e ence.
2.2 Condi ioning
We aim o condi ion ϵθon con inuous musical a ibu es
a∈R. Simila ly, Chen e al. [28] ound i bene icial o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
53
(%) Con ou No e Densi y Pi ch Range Complexi y
NM 63.36 76.63 46.70 48.67
P&L 37.44 10.11 0.41 50.59
LC-VAE-A 60.88 97.56 34.69 33.32
LC-VAE-SE 52.02 97.33 36.84 52.00
LC-Di 85.60 98.59 80.97 94.93
Table 1: Pea son Co ela ion Coe icien (PCC) be ween
a ge and decoded a ibu es.
condi ion on he noise le el ins ead o he disc e e di u-
sion s ep, which [7] la e adop ed o LDM-based symbolic
music gene a ion. Thus, we a e le wi h wo con inuous
condi ioning signals o be passed on o he di usion model.
We injec aand √¯α in o ϵθ h ough dedica ed ne wo ks
(see Figu e 1). Fi s , we apply Sinusoidal Encoding (SE)
based on T ans o me posi ional embeddings [29]
Γ(u) = sin (ωi(u)) ,cos (ωi(u)) d/2
i=0 (7)
whe e ωi(u) = su
b2i/d [7], wi h d∈N he (e en) dimen-
sionali y o he embedding, b∈R he base equency,
and s∈Ra equency scaling hype pa ame e . The e-
sul ing SE ea u es a e passed h ough a linea laye wi h
SiLU ac i a ions. Finally, we employ Fea u e-wise Linea
Modula ion (FiLM) [30], whe e wo ully-connec ed lay-
e s yield shi s and scales, espec i ely, ha modula e he
ac i a ions o he denoise (see Figu e 2). The wo condi-
ioning b anches un in pa allel. This is equi alen o lea n-
ing a single a ine ans o ma ion, whe e scale and shi a e
he sum o FiLM ou pu s om he a ibu e and noise le el
condi ioning ne wo ks.
To enhance con ollabili y o e he gene a ed samples,
we also apply Classi ie -F ee Guidance (CFG) [31] o he
noise p edic ion:
ˆϵθ(z , ξ , a) = (1 + w)ϵθ(z , ξ , a)−wϵθ(z , ξ ),(8)
whe e ϵθ(z , ξ )is he uncondi ional noise p edic ion and
w∈R≥0is he guidance scale. To make CFG e ec i e,
he model mus lea n o p edic noise bo h wi h and wi hou
a ibu e condi ioning. We achie e his h ough condi ion-
ing d opou (depic ed as 0/1in Figu e 1), i.e., se ing he
ou pu s o he a ibu e condi ioning ne wo k o ze o wi h
a ce ain p obabili y when e alua ing (3).
3. EVALUATION
3.1 Da ase
The models a e designed o lea n pi ch sequence ep esen-
a ions om ou -ba monophonic melodies. We cons uc
a la ge-scale da ase comp ising melodies ex ac ed om
176,581 MIDI iles om he Lakh MIDI Da ase [32]. 1
Fi s , we assess whe he each MIDI ile con ains ime
signa u e changes. I any a e ound, we segmen he ile
and e ain only sec ions wi h a 4/4 ime signa u e. Each
MIDI e en is hen quan ized o he nea es six een h no e.
A melody is de ined as a sequence o pi ches wi hin
he s anda d 88-key piano ange, played by an ins umen
1C. Ra el, 2016, “The Lakh MIDI Da ase 0.1.” [Online]. A ailable:
h ps://colin a el.com/p ojec s/lmd
Con ou No e Densi y Pi ch Range Complexi y
Uncond. VAE 41.44
NM 35.506 58.436 30.833 47.61
P&L 49.698 67.836 40.657 87.80
LC-VAE-A 30.197 29.450 30.257 32.435
LC-VAE-SE 29.161 30.124 31.274 30.166
LC-Di 19.299 20.559 31.695 17.51
Table 2: F éche Music Dis ance [27].
mapped o a alid MIDI p og am. A melody is conside ed
comple e when a ull measu e o silence occu s. We ex ac
only melodies spanning a leas ou ba s and comp ising a
leas h ee dis inc pi ches. I mul iple no es sound simul-
aneously, we ollow he app oach p oposed in [33] and
selec only he highes -pi ched no e o ensu e monophonic
sequences. Subsequen ly, ou -ba segmen s a e ex ac ed
using a s ide o one ba .
Fo each melody hus ex ac ed, we compu e 13 musical
a ibu es, including hose ou lined in Sec ion 3.2.
Melodies a e encoded as sequences o N= 64 in ege s
in P={0,...,129}, whe e each elemen ep esen s ei he
a MIDI no e numbe (0-127) o one o wo special okens:
no e o (128) and no e hold (129). The da ase is di ided
in o aining, alida ion, and es se s, wi h aining da a
augmen ed h ough ansposi ion by a andomly selec ed
numbe o semi ones wi hin a ange o ±1oc a e. The
inal da ase , consis ing o 10,126,676 unique melodies, is
publicly a ailable. 2
3.2 Musical A ibu es
As p e iously done in [25], we ocus on ou musical a -
ibu es: (i) Con ou , which quan i ies he melodic mo e-
men in a sequence, measu ed by a e aging he pi ch di -
e ences be ween consecu i e no es; (ii) No e Densi y,
de ined as he a io be ween he numbe o no es in he
melody and he sequence leng h. I akes alues in [0,1];
(iii) Pi ch Range, de ined as he di e ence be ween he
highes and lowes MIDI pi ch alues in he sequence, no -
malized by he ange o an 88-key piano. I akes al-
ues in 0,127
88 , whe e alues abo e one indica e a ange
exceeding A0–C8; (i ) Rhy hm Complexi y, e alua ed
using Toussain ’s me ical complexi y measu e [34], co -
ec ed o he o al numbe o no es in he sequence [26].
By de ini ion, i akes on disc e e alues.
3.3 Uncondi ional Gene a i e Model
As base uncondi ional model, we implemen a β-VAE [35]
based on MusicVAE [33]. This model, p e iously used in
LDM-based symbolic music gene a ion [7], also enables
di ec compa ison wi h exis ing a ibu e- egula ized VAEs
(AR-VAEs) employing he same a chi ec u e [25, 26] (see
Sec ion 3.5).
The encode pψ(z|x)consis s o a wo-laye bidi ec-
ional LSTM ne wo k ed wi h ou -ba pi ch sequence ep-
esen a ions (see Sec ion 3.1), ollowed by wo linea lay-
2M. Pe enò, Aug. 2024, “4 Ba s Monophonic Melodies Da ase (Pi ch
Sequence),” Zenodo, doi: h ps://doi.o g/10.5281/zenodo
.13369389
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
54
(a) NM (b) P&L (c) LC-VAE-A (d) LC-VAE-SE
Figu e 3: Reg ession plo s compa ing a ge and decoded Con ou a ibu es ac oss baseline me hods.
Figu e 4: Reg ession plo compa ing a ge and decoded
Con ou a ibu es using LC-Di .
e s pa ame e izing he la en pos e io . The hie a chical
decode qϕ(x|z) ea u es wo unidi ec ional LSTMs, wi h
he bo om-le el ne wo k au o eg essi ely es ima ing he
dis ibu ion o e he sequence alues ia a so max nonlin-
ea i y [33]. As such, each pi ch sequence x∈PNis i s
mapped on o a single la en code z∈RM,M= 256, and
since decoding amoun s o a nex oken p edic ion ask, he
s anda d β-VAE objec i e [35]
LVAE =−Epψ(z|x)[log qϕ(x|z)] + βDKL [pψ(z|x)∥p(z)] ,
(9)
is implemen ed using c oss-en opy as econs uc ion loss.
The uncondi ional model is ained o 40,000 i e a ions
on a single NVIDIA Ti an RTX GPU wi h a ba ch size
o 512. The objec i e (9) is minimized using Adam, and
he lea ning a e is dec eased exponen ially om 10−3 o
10−5wi h a a e o 0.9999. The hype pa ame e βis an-
nealed exponen ially om 0 o 10−3, which encou ages he
model o p io i ize accu a e sequence econs uc ion du -
ing he ea ly pa o he aining. Simila ly o [33], we
apply eache o cing wi hin he bo om-le el decode wi h
a p obabili y ollowing a logis ic schedule.
3.4 Condi ional Di usion Model
Wi h la en codes being ec o s in RM, we implemen a
DDIM model wi h a ully-connec ed denoise ne wo k. 3
Shown in Figu e 2, he denoise comp ises an inpu laye
wi h 2048 linea uni s, ollowed by h ee dense esidual
blocks. Each esidual block comp ises wo s acks o Lay-
e No m, ea u e-wise modula ion ( esponsible o join a -
ibu e and ime condi ioning), SiLU, and a linea laye ,
plus a esidual connec ion ha sho cu s he inpu and ou -
pu o he block. Finally, he ou pu is linea ly p ojec ed
back on o RM.
3Sou ce code and audio examples a e a ailable a h ps://mpe
eno.gi hub.io/con ollable-la en -di usion/
We se he SE dimensionali y o d= 128. The a ibu e
and noise le el condi ioning ne wo ks ha e 512 and 2048
uni s in he i s linea laye and FiLM laye s, espec i ely.
In he o wa d p ocess, β ollows a linea schedule
om 10−6 o 10−2o e T= 1000 s eps. Con e sely, he
numbe o sampling s eps is se o Ts= 100.
We ain he model wi h an a ibu e condi ioning
d opou p obabili y o 20%. We hen apply CFG wi h a
guidance scale o w= 3.0[31]. In ou expe imen s, CFG
p o ed undamen al o achie e a ibu e egula iza ion.
The esul ing denoise ne wo k has 43.1million pa am-
e e s, and con e ges in jus abou 20 aining epochs, hal
he i e a ions equi ed by he uncondi ional model.
3.5 AR-VAE Baseline Me hods
Fo compa ison, we conside AR-VAEs [25, 26] wi h he
same a chi ec u e as he uncondi ional model desc ibed in
Sec ion 3.3. AR-VAEs inco po a e egula iza ion du ing
aining by means o a supe ised mul i- ask lea ning ap-
p oach, wi h he goal o encoding he a ibu e ain he i- h
dimension zio hei la en spaces. This is achie ed by
including an AR loss e m in (9)
LAR-VAE =LVAE +γLAR,(10)
whe e γ≥0is a unable hype pa ame e con olling he
s eng h o he egula iza ion.
Mezza e al. [26] p opose he use o
LNM
AR = MAE(zi,˜a),(11)
whe e MAE(·,·)deno es he mean absolu e e o , and ˜ais
he z-sco e o a.
Pa i and Le ch [25] in oduced a egula iza ion e m ha
en o ces a mono onic ela ionship be ween aand zi, i.e.,
LP&L
AR = MAE ( anh(δDz),sign(Da)) ,(12)
whe e Dzand Daa e pai wise dis ance ma ices be ween
ziand ao all samples in a ba ch, espec i ely, and δ > 0
is a unable hype pa ame e . As in [25], we se γ= 1 and
δ= 10. The emaining aining de ails a e he same as in
Sec ion 3.3. Fo b e i y, we will la e e e o he o me
AR me hod as “NM” and o he la e as “P&L.”
3.6 LC-VAE Baseline Me hods
Simila ly o Tian and Engel [3], we implemen LC h ough
a condi ional VAE (cVAE) ained on he ep esen a ions o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
55
(a) NM (b) P&L (c) LC-VAE-A (d) LC-VAE-SE
Figu e 5: Reg ession plo s compa ing a ge and decoded Rhy hm Complexi y a ibu es ac oss baseline me hods.
Figu e 6: Reg ession plo compa ing a ge and decoded
Rhy hm Complexi y a ibu es using LC-Di .
he base uncondi ional model (Sec ion 3.3). The cVAE en-
code consis s o ou linea laye s wi h ReLU ac i a ions,
ollowed by wo Ga ing Mixing Laye s (GML) ha pa am-
e e ize he inne mos la en dis ibu ion. The decode mi -
o s he encode wi h ou linea laye s wi h ReLU ac i a-
ions, ollowed by an ou pu GML. Excep o using 2048
uni s in he ully-connec ed laye s and M′= 128 la en
a iables, he cVAE a chi ec u e is he same as in [3].
Le z∈RMbe he la en ep esen a ions o he uncon-
di ional model, zc∈RM′be he la en ep esen a ions o
he cVAE, and a∈R he sequence a ibu e. The au ho s
o [3] conside ed bina y labels and one-ho ec o s we e
hus conca ena ed wi h zand zc. Ins ead, we deal wi h con-
inuous a ibu es. We implemen wo cVAE a ian s ha
di e in how ais ed in o he ne wo ks. In he i s a ian ,
la e e e ed o as LC-VAE-A, we eed ˜
z= [zT, a]T o he
encode , and ˜
zc= [zT
c, a]T o he decode . In he second
a ian , named LC-VAE-SE, we conca ena e zand zc, e-
spec i ely, wi h he a ibu e SE, i.e., ˜
z= [zT,Γ(a)]Tand
˜
zc= [zT
c,Γ(a)]T.
4. RESULTS
4.1 A ibu e-Con olled Gene a ion
To e alua e he con ollabili y o he gene a i e models un-
de sc u iny, we sample he a ge a ibu es uni o mly in
he ange o ze o o he 99 h pe cen ile o he a ibu e dis-
ibu ion o he sequences in he es se . 4These equally-
spaced alues, which we e e o as a ge a ibu es, a e
ed o he espec i e condi ioning ne wo k o LC-Di , sui -
ably ans o med and plugged in o he egula ized dimen-
4Limi ing he ange o he 99 h pe cen ile is mean o exclude hose
sequences wi h abno mally high a ibu e alues. We a gue ha hese
sequences a e spu ious, and we a ibu e hei exis ence o he choice,
bo owed om [33], o ex ac ing melodies by naï ely picking he highes
no e a any gi en ime.
sion zio he AR-VAEs, and conca ena ed o he inpu ec-
o o he LC-VAE decode ne wo ks.
Table 1 lis s he Pea son Co ela ion Coe icien s (PCC)
be ween he a ge a ibu es and hose compu ed om he
gene a ed sequences ( he highe , he be e ). LC-Di con-
sis en ly ou pe o ms he wo AR-VAEs (NM and P&L)
and LC-VAEs (bo h wi h and wi hou SE) o all a ibu es
conside ed. No ably, LC-Di is he only me hod among
hose conside ed in he p esen s udy o yield co ela ion
sco es highe han 80% ac oss he boa d.
As o Con ou , LC-Di achie es a PCC o 85.60%,
ou pe o ming he nex -bes model, NM, by o e 22%. The
di e ence is less p onounced o No e Densi y, whe e LC-
Di (98.56%) imp o es upon he second-bes model by
jus 1%. None heless, LC-VAE-A and LC-VAE-SE al-
eady achie e 97.56% and 97.33%, espec i ely, sugges -
ing ha cons aining he gene a i e model is e y e ec i e
compa ed o AR me hods when i comes o ende ing he
desi ed numbe o no es. LC-Di also demons a es sig-
ni ican imp o emen s in Pi ch Range and Rhy hm Com-
plexi y. Fo Pi ch Range, i achie es a PCC o 80.97%,
exceeding NM (46.70%) by 34.27%, while NM i sel ou -
pe o ms LC-VAEs by app oxima ely 10%. Fo Rhy hm
Complexi y, LC-Di achie es a ema kable 94.93%, su -
passing LC-VAE-SE (52%) by 42.93%.
Conce ning AR models, while NM di ec ly encodes he
(s anda dized) dis ibu ion on o he i h dimension o he
la en space, he e is no a p io i way o know he mono onic
ela ionship lea ned using he P&L egula iza ion in (12).
This explains he nea -ze o co ela ion obse ed o Pi ch
Range, and, in gene al, he o e all lowe PCC.
Figu es 3 h ough 6 show he eg ession plo s o Con-
ou and Rhy hm Complexi y. Figu e 3 and Figu e 4 illus-
a e he case o a con inuous dis ibu ion, while Figu e 5
and Figu e 6 exempli y a case whe e he a ibu e akes on
in ege alues. Ac oss bo h a ibu es, LC-Di is cha ac-
e ized by a lowe sp ead and a clea linea end. In Fig-
u e 3, all baseline models show a endency o p oduce ex-
cessi ely high con ou alues, whe eas LC-Di (Figu e 4)
appea s o mi iga e he issue. Likewise, Figu e 5 e eals
ha all models bu LC-Di (Figu e 6) end o ail when he
a ge Complexi y alues a e low.
4.2 Da a Fideli y
To e alua e he quali y o he gene a ed sequences, we use
he F éche Music Dis ance (FMD) [27], a me ic ha ex-
ends he amily o F éche Incep ion Dis ance [36] and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
56
(a) a= 1.0→ag= 1.03 (b) a= 3.0→ag= 2.95 (c) a= 6.0→ag= 6.19
Figu e 7: Examples o MIDI iles gene a ed by con olling he Con ou a ibu e wi h LC-Di .
(a) a= 0 →ag= 0 (b) a= 15 →ag= 15 (c) a= 33 →ag= 33
Figu e 8: Examples o MIDI iles gene a ed by con olling Rhy hm Complexi y wi h LC-Di . Qua e no es a e indica ed
by solid e ical lines; odd pulses (s ong) a e indica ed by dashed lines; e en pulses (weak) a e indica ed by do ed lines.
F éche Audio Dis ance [37] o he symbolic music do-
main. FMD was compu ed be ween 22,016 melodies om
he held-ou es se and an equal numbe o gene a ed se-
quences. To p e en he FMD om measu ing a spu ious
di e gence om he eal a ibu e dis ibu ion, we condi-
ion he gene a ion on he a ibu es o he e e ence se-
quences, a he han using e enly-spaced con ol alues as
in Sec ion 4.1. By condi ioning wi h a ibu es measu ed
om he es se , indeed, we aim o simul aneously com-
pa e he ideli y o gene a ed sequences and how well hey
con o m o he desi ed a ibu e dis ibu ion.
Table 2 epo s he esul s ob ained using CLaMP 2
MIDI embeddings [38] ( he lowe , he be e ). Fo com-
pa ison, we epo he FMD be ween he e e ence es se
and he ou pu o he uncondi ional VAE (see Sec ion 3.3)
ob ained by decoding 22,016 samples om N(0,I).
The esul s p esen ed in Table 2 demons a e ha he
p oposed LC-Di model consis en ly achie es he lowes
FMD alues ac oss mos a ibu es, indica ing supe io pe -
o mance in gene a ing samples ha aligns mo e closely
wi h he s a is ical p ope ies o eal sequences. No ably,
LC-Di ou pe o ms all baselines in Con ou (19.299),
No e Densi y (20.559), and Rhy hm Complexi y (17.51),
signi ican ly imp o ing o e bo h AR-VAEs and LC-
VAEs. While LC-VAE-A achie es he bes Pi ch Range
sco e (30.257), LC-Di emains compe i i e (31.695).
O e all, all LC me hods ou pe o m he uncondi ional
base model (41.44), showing ha in oducing pos -hoc
con ol leads o mo e consis en and s uc u ed music gen-
e a ion, wi h be e alignmen o he desi ed a ibu es.
Finally, Figu e 7 and Figu e 8 illus a e he po en ial
di e si y in he gene a ed samples p oduced by LC-Di
when condi ioned on low, medium, and high alues o
Con ou and Rhy hm Complexi y, espec i ely.
5. CONCLUSIONS
In his pape , we ha e explo ed la en di usion h ough he
lens o La en Cons ain s (LC), demons a ing he e icacy
o DDIMs as plug-and-play condi ioning modules o sym-
bolic music gene a ion. By keeping he base gene a i e
model ixed, we ained di usion-based LC models (LC-
Di ) capable o con olling a ange o non-di e en iable
and con inuous musical a ibu es, including con ou , no e
densi y, pi ch ange, and hy hm complexi y. Ou em-
pi ical e alua ions e eal ha LC-Di signi ican ly ou -
pe o ms a ibu e- egula ized VAEs and cVAE-based LC
me hods in e ms o bo h ideli y and con ollabili y, wi h
absolu e imp o emen s o up o 12.65 in F éche Mu-
sic Dis ance and 43% in co ela ion be ween desi ed and
gene a ed a ibu es. These esul s highligh he po en-
ial o denoising as a powe ul ool o ad hoc ade -like
con ol o e mul iple musical a ibu es along con inuous
axes, e ec i ely ans o ming a p e- ained uncondi ional
model in o a con ollable music gene a ion sys em depend-
ing on he use ’s needs. Fu u e wo k will ocus on ex-
panding he lib a y o LC-Di models o include a wide
ange o musical a ibu es and explo ing he in eg a ion o
use in e aces o eal- ime con ol. Fu u e expe imen s
could also explo e a ibu e-con olled inpu ans o ma-
ions by applying o wa d di usion o encoded ep esen a-
ions, a he han d awing noise samples om he s anda d
no mal p io . Fu he mo e, we aim o in es iga e he po-
en ial o LC o o he gene a i e echniques, such as low
ma ching and consis ency models.
6. REFERENCES
[1] J. Engel, M. Ho man, and A. Robe s, “La en con-
s ain s: Lea ning o gene a e condi ionally om un-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
57
condi ional gene a i e models,” in In e na ional Con-
e ence on Lea ning Rep esen a ions, 2018.
[2] M. Dinculescu, J. Engel, and A. Robe s, “MidiMe:
Pe sonalizing a MusicVAE model wi h use da a,” in
Neu IPS Wo kshop on Machine Lea ning o C ea i -
i y and Design, 2019.
[3] Y. Tian and J. Engel, “La en ansla ion: C oss-
ing modali ies by b idging gene a i e models,” a Xi
p ep in a Xi :1902.08261, 2019.
[4] R. Rombach, A. Bla mann, D. Lo enz, P. Esse ,
and B. Omme , “High- esolu ion image syn hesis
wi h la en di usion models,” in P oceedings o he
IEEE/CVF con e ence on compu e ision and pa e n
ecogni ion, 2022, pp. 10 684–10 695.
[5] S. Wu and M. Sun, “Explo ing he e icacy o p e-
ained checkpoin s in ex - o-music gene a ion ask,”
in The AAAI-23 Wo kshop on C ea i e AI Ac oss
Modali ies, 2023.
[6] P. Jajo ia and J. McDe mo , “Tex condi ioned sym-
bolic d umbea gene a ion using la en di usion mod-
els,” a Xi p ep in a Xi :2408.02711, 2024.
[7] G. Mi al, J. Engel, C. Haw ho ne, and I. Simon, “Sym-
bolic music gene a ion wi h di usion models,” in P oc.
o he 22nd In e na ional Socie y o Music In o ma-
ion Re ie al Con e ence (ISMIR), 2021, pp. 468–475.
[8] M. Pasini, M. G ach en, and S. La ne , “Bass accom-
panimen gene a ion ia la en di usion,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP), 2024,
pp. 1166–1170.
[9] S. Li and Y. Sung, “MelodyDi usion: Cho d-
condi ioned melody gene a ion using a ans o me -
based di usion model,” Ma hema ics, ol. 11, no. 8,
2023.
[10] L. Min, J. Jiang, G. Xia, and J. Zhao, “Poly usion: A
di usion model o polyphonic sco e gene a ion wi h
in e nal and ex e nal con ols,” in P oc. o he 24 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2023, pp. 231–238.
[11] L. Kawai, P. Esling, and T. Ha ada, “A ibu es-awa e
deep music ans o ma ion.” in P oc. o he 21s In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence (ISMIR), 2020, pp. 670–677.
[12] S.-L. Wu and Y.-H. Yang, “MuseMo phose: Full-song
and ine-g ained piano music s yle ans e wi h one
ans o me VAE,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 31, pp. 1953–
1967, 2023.
[13] M. E. Maland o, “Compose ’s Assis an 2: In e ac i e
mul i- ack MIDI in illing wi h ine-g ained use con-
ol,” in P oc. o he 25 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR), 2024,
pp. 438–445.
[14] G. Lample, N. Zeghidou , N. Usunie , A. Bo des,
L. Denoye , and M. Ranza o, “Fade ne wo ks: Ma-
nipula ing images by sliding a ibu es,” in Ad ances
in Neu al In o ma ion P ocessing Sys ems, 2017, pp.
5963–5972.
[15] H. H. Tan and D. He emans, “Music Fade Ne s: Con-
ollable music gene a ion based on high-le el ea u es
ia low-le el ea u e modelling,” in P oc. o he 21s
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2020, pp. 109–116.
[16] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” in P oc. o he 34 h In e na-
ional Con e ence on Neu al In o ma ion P ocessing
Sys ems, 2020, pp. 1–12.
[17] J. Song, C. Meng, and S. E mon, “Denoising di u-
sion implici models,” in In e na ional Con e ence on
Lea ning Rep esen a ions, 2021.
[18] J. Aus in, D. D. Johnson, J. Ho, D. Ta low, and
R. an den Be g, “S uc u ed denoising di usion mod-
els in disc e e s a e-spaces,” in Ad ances in Neu al In-
o ma ion P ocessing Sys ems, 2021, pp. 1–13.
[19] A. L , X. Tan, P. Lu, W. Ye, S. Zhang, J. Bian, and
R. Yan, “GETMusic: Gene a ing any music acks
wi h a uni ied ep esen a ion and di usion amewo k,”
a Xi p ep in a Xi :2305.10841, 2023.
[20] M. Plasse , S. Pe e , and G. Widme , “Disc e e di u-
sion p obabilis ic models o symbolic music gene a-
ion,” in P oc. o he Thi y-Second In e na ional Join
Con e ence on A i icial In elligence, 2023.
[21] J. Zhang, G. Fazekas, and C. Sai is, “Compose s yle-
speci ic symbolic music gene a ion using ec o quan-
ized disc e e di usion models,” in 2024 IEEE 34 h In-
e na ional Wo kshop on Machine Lea ning o Signal
P ocessing (MLSP), 2024, pp. 1–6.
[22] ——, “Fas di usion GAN model o symbolic mu-
sic gene a ion con olled by emo ions,” a Xi p ep in
a Xi :2310.14040, 2023.
[23] M. Zhang, L. J. Fe is, L. Yue, and M. Xu, “Emo ion-
ally guided symbolic music gene a ion using di usion
models: The AGE-DM app oach,” in P oc. o he 6 h
ACM In e na ional Con e ence on Mul imedia in Asia,
2024, pp. 1–5.
[24] Y. Huang, A. Gha a e, Y. Liu, Z. Hu, Q. Zhang, C. S.
Sas y, S. Gu u ani, S. Oo e, and Y. Yue, “Symbolic
music gene a ion wi h non-di e en iable ule guided
di usion,” a Xi p ep in a Xi :2402.14285, 2024.
[25] A. Pa i and A. Le ch, “A ibu e-based egula iza ion
o la en spaces o a ia ional au o-encode s,” Neu al
Compu ing and Applica ions, ol. 33, no. 9, pp. 4429–
4444, 2021.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
58
[26] A. I. Mezza, M. Zanoni, and A. Sa i, “A la en hy hm
complexi y model o a ibu e-con olled d um pa e n
gene a ion,” EURASIP Jou nal on Audio, Speech, and
Music P ocessing, ol. 2023, no. 1, 2023.
[27] J. Re kowski, J. S e¸pniak, and M. Mod zejew-
ski, “F eche music dis ance: A me ic o gen-
e a i e symbolic music e alua ion,” a Xi p ep in
a Xi :2412.07948, 2024.
[28] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. No ouzi,
and W. Chan, “Wa eG ad: Es ima ing g adien s o
wa e o m gene a ion,” in In e na ional Con e ence on
Lea ning Rep esen a ions, 2021.
[29] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, L. u. Kaise , and I. Polosukhin,
“A en ion is all you need,” in Ad ances in Neu al In-
o ma ion P ocessing Sys ems, ol. 30, 2017.
[30] E. Pe ez, F. S ub, H. de V ies, V. Dumoulin, and A. C.
Cou ille, “FiLM: Visual easoning wi h a gene al con-
di ioning laye ,” in P oc. o he Thi y-Second AAAI
Con e ence on A i icial In elligence, 2018, pp. 3942–
3951.
[31] J. Ho and T. Salimans, “Classi ie - ee di usion guid-
ance,” in Neu IPS 2021 Wo kshop on Deep Gene a i e
Models and Downs eam Applica ions, 2021.
[32] C. Ra el, “Lea ning-based me hods o compa ing se-
quences, wi h applica ions o audio- o-midi alignmen
and ma ching,” Ph.D. disse a ion, Columbia Uni e -
si y, 2016.
[33] A. Robe s, J. Engel, C. Ra el, C. Haw ho ne, and
D. Eck, “A hie a chical la en ec o model o lea n-
ing long- e m s uc u e in music,” in P oc. o he 35 h
In e na ional Con e ence on Machine Lea ning, 2018,
pp. 4364–4373.
[34] G. Toussain , “A ma hema ical analysis o A ican,
B azilian, and Cuban cla e hy hms,” in B idges:
Ma hema ical Connec ions in A , Music, and Science,
2002, pp. 157–168.
[35] I. Higgins, L. Ma hey, A. Pal, C. P. Bu gess, X. Glo-
o , M. M. Bo inick, S. Mohamed, and A. Le chne ,
“be a-VAE: Lea ning basic isual concep s wi h a con-
s ained a ia ional amewo k.” In e na ional Con e -
ence on Lea ning Rep esen a ions, ol. 3, 2017.
[36] M. Heusel, H. Ramsaue , T. Un e hine , B. Nessle ,
and S. Hoch ei e , “GANs ained by a wo ime-scale
upda e ule con e ge o a local Nash equilib ium,” in
P oc. o he 31s In e na ional Con e ence on Neu al
In o ma ion P ocessing Sys ems, 2017, pp. 6629–6640.
[37] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A e e ence- ee me ic o
e alua ing music enhancemen algo i hms,” in P oc.
In e speech 2019, 2019, pp. 2350–2354.
[38] S. Wu, Y. Wang, R. Yuan, Z. Guo, X. Tan, G. Zhang,
M. Zhou, J. Chen, X. Mu, Y. Gao, Y. Dong, J. Liu,
X. Li, F. Yu, and M. Sun, “CLaMP 2: Mul i-
modal music in o ma ion e ie al ac oss 101 lan-
guages using la ge language models,” a Xi p ep in
a Xi :2410.13267, 2025.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
59