ADDING TEMPORAL MUSICAL CONTROLS ON TOP OF PRETRAINED
GENERATIVE MODELS
Sa ah Nabi∗1Nils Deme lé∗1Geo oy Pee e s2F édé ic Be ilacqua1Philippe Esling1
1UMR 9912 STMS-IRCAM, So bonne Uni e si é, CNRS, Pa is, F ance
2LTCI, Télécom-Pa is, Ins i u Poly echnique de Pa is, F ance
[email p o ec ed], [email p o ec ed]
ABSTRACT
Recen ad ances in deep gene a i e modeling ha e en-
abled high-quali y models o musical audio syn hesis.
Howe e , hese app oaches emain di icul o con ol, con-
ined o simple, s a ic a ibu es and, mos impo an ly, en-
ail e aining a di e en compu a ionally-hea y a chi ec-
u e o each new con ol. This is ine icien and imp ac i-
cal as i equi es subs an ial compu a ional esou ces.
In his pape , we p opose a no el app oach allowing o
add ime- a ying musical con ols on op o any p e ained
gene a i e models wi h an exposed la en space (e.g. neu-
al audio codecs), wi hou e aining o ine uning. Ou
me hod suppo s bo h disc e e and con inuous a ibu es by
adap ing a ec i ied low app oach wi h a la en di usion
ans o me . We lea n an in e ible mapping be ween p e-
ained la en a iables and a new space disen angling ex-
plici con ol a ibu es and s yle a iables ha cap u e he
emaining ac o s o a ia ion. This enables bo h ea u e
ex ac ion om an inpu , bu also edi ing hose ea u es o
gene a e ans o med audio samples. Finally, his also in-
oduces he abili y o pe o m syn hesis di ec ly om he
audio desc ip o s. We alida e ou me hod wi h 4 da ase s
going om di e en musical ins umen s up o ull music
eco dings, on which we ou pe o m s a e-o - he-a ask-
speci ic baselines in e ms o bo h gene a ion quali y and
accu acy o he con ol by in e ing ans e ed a ibu es.
Ou code is a ailable on he suppo ing webpage 1.
1. INTRODUCTION
Since he pionee ing au o eg essi e model Wa eNe [1],
signi ican ad ances ha e been made in deep gene a i e
modeling o aw audio wa e o m syn hesis. In pa icu-
la , neu al audio codecs [2–4] ecen ly enabled as high-
quali y audio gene a ion wi h sampling a es up o 48kHz.
These models di ec ly comp ess he aw audio wa e o m
in o a empo al la en space ha cap u es high-le el audio
*Equal con ibu ion
1h ps://acids-i cam.gi hub.io/pla une/
© S. Nabi, N. Deme lé, G. Pee e s, F. Be ilacqua and P.
Esling. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: S. Nabi, N. Deme lé, G. Pee e s, F.
Be ilacqua and P. Esling, “Adding empo al musical con ols on op o
p e ained gene a i e models”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
codec
p e ained
con ol-s yle
space
ex ac edi condi ion
Figu e 1: Ou app oach enhances p e ained neu al codecs
wi h a new con ol-s yle space suppo ing bo h disc e e and
con inuous ime- a ying musical con ols, enabling ea-
u es ex ac ion, audio edi ing and condi ional syn hesis.
ea u es. Howe e , hese la en ep esen a ions a e di i-
cul o in e p e and highly en angled, p ecluding hei use
as s aigh o wa d con ols [5]. To add ess his, exis ing
me hods usually ely on explici condi ioning echniques,
bypassing he la en in o ma ion [6, 7], adding ad e sa ial
egula iza ion [8], o lea ning an addi ional model o con-
di ion he la en codes [9–11]. Howe e , hese app oaches
a e o en limi ed in he ypes o con ols usable [6, 9] and,
mos impo an ly, hey equi e o e ain [8,11] o ine une
he unde lying la ge syn hesis model [10], which is ine -
icien and impossible o many as i equi es subs an ial
amoun o da a and compu a ional esou ces.
Al hough neu al codecs add ess he echnical di icul-
ies o modeling high-dimensional audio wa e o ms, he
emaining challenge is o ind in e p e able con ol o e
he la en a iables, which a e usually es ained o ol-
low a simple Gaussian p io dis ibu ion [2]. In his pa-
pe , we aim o gi e he abili y o he end use o de ine i s
own con ols by eshaping he s uc u e o he p e ained
la en space. Hence, we p opose a new me hod o add em-
po al musical con ols on op o any p e ained gene a i e
model wi h an exposed la en space wi hou equi ing e-
aining no ine uning. As depic ed in Fig. 1, ou me hod
suppo s bo h disc e e and con inuous ea u es, allowing
o combine any ypes o con ol including bo h s a ic and
779
ime- a ying ones such as ins umen labels, MIDI inpu s
o con inuous desc ip o s. Fu he mo e, ou me hod is able
o be ained e y e icien ly on op o any ozen p e-
ained codec by adap ing he PluGeN [5] app oach. This
me hod p oposed o ans o m he en angled la en ep-
esen a ion o p e ained gene a i e models in o a mul i-
dimensional con ol-s yle space whe e la en ea u es a e
spli ed be ween con ol a iables modeling he selec ed a -
ibu es and s yle a iables cap u ing he emaining ac o s
o a ia ions. I uses disc e e labels o pa ame e ize he
con ol dis ibu ions and elies on No malizing Flows [12]
(NFs) o lea n an in e ible mapping be ween bo h spaces.
Al hough PluGeN p o ides he abili y o add new con ols
on p e ained syn hesis models, i is limi ed o s a ic label
a ibu es and was only applied o condi ional gene a ion
o s a ic images. He e, we enhance p e ained neu al au-
dio codecs wi h a new con ol-s yle space by ex ending he
PluGeN app oach o ime- a ying musical con ols. As
dealing wi h audio signals implies mo e complex empo-
al la en spaces and NFs su e om aining ins abili y
and lack scalabili y wi h s ong a chi ec u e cons ain s, we
also p opose a mo e e icien way o de ine he mapping be-
ween bo h spaces by adap ing he de e minis ic di usion-
based ec i ied low app oach [13] o empo al la en spaces
using a La en Di usion T ans o me [14].
Thanks o he in e ible mapping, ou p oposed me hod
becomes highly e sa ile and enables (1) ea u e ex ac-
ion, (2) inpu audio edi ing, as well as (3) explici condi-
ional syn hesis, as illus a ed in Fig 1. We benchma k ou
model on each o hese 3 asks wi h 4 inc easingly com-
plex da ase s composed o eco dings anging om mono-
phonic musical ins umen s up o ull music pieces. We
show ha ou model e icien ly ex ac s and imp o es he
con ol o mul iple complex audio ea u es compa ed o
s a e-o - he-a baselines, while main aining he gene a ion
quali y and enabling condi ioning wi h unseen combina-
ions o a ibu es.
2. BACKGROUND
2.1 Di usion models
Deep gene a i e models aim o model he unde lying
dis ibu ion p(x)o a gi en da ase . Di usion Models
(DMs) achie e his by co up ing he da a wi h noise and
hen lea ning o e e se his p ocess h ough denoising.
Gi en a se ies o noise le els σ0< σ1< ... < σN,
we can de ine a se o molli ied dis ibu ions pσ (x ) =
Rp(x)N(x ;x, σ2
I)dx, ob ained by adding i.i.d Gaussian
noise o he da a. Al hough many o mula ions o di usion
exis s, a common app oach is o ain a denoise ne wo k
Dθ, ypically a neu al ne wo k o minimize he ollowing
denoising sco e ma ching objec i e
Ex∼pda a En∼N(0,σ2I)∥D(x+n;σ)−x∥2
2.(1)
The model implici ly lea ns o app oxima e he sco e o
each pe u bed dis ibu ion sθ(x, )≈ ∇xlog pσ (x) =
D(x;σ)−x
σ2and hence can gene a e new samples using an-
nealed Lange in dynamics. S a ing om xN∼ N (0,I),
i ollows he sco e unc ions owa ds he da a by i e a-
i ely upda ing x −1=x +1
T∗∇xlog pσ (x )un il con-
e ging o a sample x0 om p(x). DMs cu en ly achie e
s a e-o - he-a esul s in image [15–17] and audio gene -
a ion [18, 19], o e ing high-quali y syn hesis wi h s able
aining and s ong condi ioning abili ies.
2.2 Rec i ied low
Rec i ied Flow (RF) [13] is a ecen p oposal ha builds
upon di usion models and op imal anspo ideas, aim-
ing o di ec ly lea n anspo maps be ween da a dis ibu-
ions. Unlike di usion models, which a e cons ained o
a Gaussian noise p io , RFs gene alize o any dis ibu ion
and ha e demons a ed supe io pe o mance in la en gen-
e a ion asks [20]. Conside ing wo a bi a y dis ibu ions
π0and π1, RF cons uc s a de e minis ic con inuous- ime
low by lea ning an o dina y di e en ial equa ion (ODE)
model ha connec s obse a ions (x0∼π0,x1∼π1)
h ough s aigh pa hs. Fo mally, he ec i ied low induced
om (x0,x1) is de ined by he ODE
dz = (z , )d , (2)
whe e z ep esen s he da a poin a ime ∈[0,1]2,
and :Rd→Rdis he d i o ce, ypically pa ame e ized
by a neu al ne wo k which is ained o minimize
min
Z1
0
E∥ (x , )−(x1−x0)∥2d ,
wi h x = (1 − )x0+ x1.
(3)
This objec i es encou ages he low o ollow s aigh
pa hs be ween x0and x1, leading o an e icien and in-
e ible de e minis ic mapping be ween π0and π1.
2.3 Neu al audio codecs
Neu al audio codecs such as RAVE [2], EnCodec [3], o
Music2La en [4] enable as high-quali y wa e o m syn-
hesis. By le e aging au oencode schemes, i di ec ly
comp esses he inpu audio signal in o a ime se ies o la-
en a iables z∈RD×N(whe e Dand Na e espec i ely
he embedding and ime dimensions) de ined in a lowe -
dimensional la en space. In addi ion o hei compu a-
ional e iciency, hese models a e also powe ul ep esen-
a ion lea ning amewo ks. Du ing aining, he encode
cap u es high-le el ea u es om he inpu da a, while he
decode lea ns o econs uc he o iginal wa e o m om
i s la en ep esen a ion. Howe e , hese la en a iables
a e e y di icul o in e p e and highly en angled which
p ecludes s aigh o wa d con ol o e hese models.
2.4 Con ol o deep gene a i e models
To p o ide explici con ols, p io wo ks ely on condi ion-
ing echniques [21, 22] and di ec ly use he audio ea u es
as addi ional aining inpu s o he syn hesis model [6, 7].
2 e e s o he pa ame iza ion in he ODE and no musical ime
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
780
Encode Decode
la en space
s yle
a iables
con ol
a iables
a ibu es
Flow
(a) Adding musical con ols on p e ained neu al audio codecs.
la en
ajec o y
p e ained
la en space
con ol-s yle
space
s yle
con ol
(b) Mapping con ols o la en ajec o ies using ec i ied low.
Figu e 2: De ailed o e iew o ou app oach mapping p e ained la en codes o explici con ol and s yle a iables.
Fade Ra e [8] p oposed o condi ion RAVE [2] wi h non-
di e en iable ime- a ying desc ip o s using an ad e sa -
ial c i e ion [22], bu equi es o e ain he codec om
sc a ch. Recen ly, la en di usion p io models we e
also in oduced o achie e e icien condi ional gene a ion
[16, 17]. The AFTER [11] model p o ides explici con-
ols and example-based s yle ans e on neu al codecs by
adap ing he Di AE [23] app oach. They ain wo seman-
ic encode s wi h a la en di usion model o disen angle
global imb e p ope ies and ime- a ying musical s uc-
u e. Howe e , mos exis ing app oaches a e limi ed in
he ypes o con ol hey p o ide [6, 9] and equi e o e-
ain [7,8] o ine- une [10] he syn hesis model.
Recen ly, PluGeN [5] enhanced p e ained gene a i e
models wi h mul i-label a ibu es condi ioning o con ol
image gene a ion wi hou e aining. I ans o ms he
en angled la en ep esen a ion in o a mul i-dimensional
space disen angling a ibu es ac oss indi idual dimen-
sions. This new space is composed o con ol a iables
c, de ined as independen one-dimensional Gaussian mix-
u e dis ibu ions, and s yle a iables ssupposed o cap u e
he emaining ac o s o a ia ions. To lea n an in e ible
mapping be ween he wo spaces, PluGeN elies on No -
malizing Flows [12] (NFs) p ese ing he dimensionali y.
Each example x∈Rdxis pai ed wi h disc e e mul i-labels
y= (y1, ..., yK)∈J1, MkKK, whe e Mkis he numbe o
classes o each a ibu e yk. The low ans o ms he la en
code z∈RDin o (c,s) = (c1, ..., cK, s1, ..., sD−K)∈
RDsuch ha he a ge dis ibu ion co esponds o
pC,S|Y=y(c,s) =
K
Y
k=1
pCk|Yk=yk(ck)·pS(s),(4)
whe e s yle a iables a e modeled by a s anda d Gaussian
pS=N(0,ID−K),(5)
and con ol a iables wi h a mix u e o Gaussians
pCk|Yk=yk=
Mk
Y
i=1
N(µi, σ2
i)1yk=i,(6)
whe e µia e e enly dis ibu ed be ween -1 and 1, σia e
use -de ined pa ame e s, and 1yk=iis he indica o unc-
ion which equals 1 i yk=iand 0 o he wise.
3. PROPOSED METHOD
Ou app oach aims o add empo al musical con ols on op
o any p e ained gene a i e model wi h an exposed la en
space, wi hou equi ing e aining. We adap PluGeN [5]
o ime- a ying a ibu es o c ea e a new space ha dis-
en angles explici con ol ea u es om he emaining s yle
a ia ions. We in oduce a ec i ied low model o lea n
an in e ible mapping be ween he con ol-s yle space and
he p e ained la en space, which di ec ly anspo s he
con ol signals in o i s la en ajec o y. Once ained, we
can use his space o ex ac musical ea u es, edi audio
samples by shi ing he con ol ajec o ies o ans e ing
a ibu es, and pe o m explici condi ional syn hesis. Ou
o e all app oach is depic ed in Fig. 2.
3.1 De ining he ime- a ying con ol-s yle space
The aining da a x∈Rdxa e pai ed wi h Ka ibu es
a= (ad,ac)∈RK×N, whe e adand ac e e o disc e e
and con inuous ea u es, esampled o align wi h he la en
ajec o y z=E(x)∈RD×N. As in eq. 6, we pa ame e -
ize he con ol dis ibu ions as ime-e ol ing mix u es o
Gaussians o each con ol a iable ck(n)which leads o
pck|adk(n)(ck, n) =
Mk
Y
i=1
N(ck|µ(i)
k(n), σ2
k)1adk(n)=i,(7)
o disc e e a ibu es, whe e µk(i)co esponds o he mean
o he i- h class con ol dis ibu ion, uni o mly dis ibu ed
be ween -1 and 1 o all Mkclasses.
Fo con inuous ones, we conside
pck|ak(n)(ck, n) = N(ck|µk(n), σ2
k),(8)
whe e he means µk(n)co espond o he a ibu e al-
ues no malized be ween -1 and 1 3. We sample he e-
maining s yle a iables s∈R(D−K)×N om a s anda d
Gaussian as in eq. 5 and de ine he con ol-s yle space wi h
he join dis ibu ion
pc,s|a((c,s), n) =
K
Y
k=1
pck|ak(n)(ck, n)·ps(s, n).(9)
3minimum and maximum alues o each a ibu es amin and amax
a e compu ed o he whole da ase .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
781
3.2 Mapping empo al la en and con ol dis ibu ions
As depic ed in Fig. 2b, we ely on a ec i ied low o di-
ec ly map he empo al con ol-s yle space o he p e-
ained la en space. We use eq. 9 o de ine he inpu dis i-
bu ion π0 om Sec. 2.2, and sample a con ol-s yle ajec-
o y z =0 = (c,s)∼pc,s|a. We use he associa ed la en
ajec o y z=z =1 ∼pzas he a ge signal. Then, we
uni o mly sample a ime s ep ∼ U(0,1) o compu e he
in e media e empo al ep esen a ion
z = (1 − )×z =0 + ×z =1.(10)
We pa ame ize he d i o ce o he low θwi h a La en
Di usion T ans o me (DiT) [14] and op imize
Lθ=E ∼U(0,1)
z =0∼pc,s|a
z =1∼pz
∥ θ(z , )−(z =1 −z =0)∥2.(11)
The ull aining p ocedu e is desc ibed in Algo i hm 1.
Algo i hm 1 T aining p ocedu e o ou me hod
1: Inpu : da a x∈ D, a ibu es a= (ad,ac), con ol
a iances σ2
c, encode Eo he p e ained audio codec,
denoising di usion model θ, lea ning a e γ
2: o each aining s ep do
3: Encode E(x) = z= [z1, ..., zN] = z =1
4: Align a= [a1, ..., aN]
5: Compu e µc= 2 ×a−amin
amax−amin −1
6: De ine pc=N(µc, σc2)
7: Sample con ol c∼pc
8: Sample s yle s∼ps=N(0,I)
9: z =0 = (c,s)
10: Sample ime s ep ∼ U(0,1)
11: Compu e z = (1 − )×z =0 + ×z =1
12: Compu e Lθ=E|| θ(z , )−(z =1 −z =0)||2
13: Upda e θ←θ−γ∇θLθ
14: end o
15: e u n ϵθ
3.3 In e ence
Once he de e minis ic mapping om con ols o la en s
is lea ned, we can le e age he disen angling p ope ies
o ou me hod o in e he a ibu es ˆa o a gi en audio
sample xin each p e-de ined con ol dimension by e e s-
ing he di usion p ocess o p edic ˆz =0 = (ˆc,s) om
z =1 =E(x)by applying
z −1=z − θ(z , )×d . (12)
o Tsampling s eps, whe e d =1
T. We can hen edi
he sample’s ea u es by di ec ly manipula ing he con ol
a iables o pe o m condi ional syn hesis by sampling and
mapping he new (˜c,s) ep esen a ion o he p e ained la-
en space h ough
z +1 =z + θ(z , )×d . (13)
Finally, we use he p e ained decode D o econs uc he
wa e o m ˜x =D(˜z =1).
4. EXPERIMENTS
We aim o assess he con ol abili ies o ou p oposed
me hod o ou 3 a ge asks, namely, musical ea u es e-
ie al,audio edi ing, and condi ional syn hesis. We e alu-
a e ou app oach on ime- a ying and s a ic a ibu es, bo h
disc e e and con inuous, on 4 da ase s anging om mono-
phonic ins umen s up o ull music eco dings. We p o-
ide audio examples on he suppo ing webpage 4.
4.1 Da ase s
Monophonic ins umen s We benchma k ou 3 asks on
he URMP [24] and MedleySolos [25] da ase s composed
o musical pieces played by di e se ins umen s. We e-
ained only he monophonic ins umen s wi h 13 ca ego ies
o URMP and 6 o MedleySolos, esul ing in 2,600 and
13,000 examples o 6 and 3 seconds espec i ely. We de-
ine he ins umen label as a con ol dimension and use he
MIDI in o ma ion ex ac ed wi h BasicPi ch [26] o con-
ol he melody de ined by pi ch, oc a e, onse s and dy-
namics ea u es each encoded in indi idual dimension. We
also use con inuous desc ip o s, namely, cen oid,band-
wid h,in eg a ed loudness 5and highe -le el imb al ea-
u es, sha pness and booming 6, compu ed ame-wise di-
ec ly om audio samples. Hence, we e alua e on bo h
disc e e and con inuous, s a ic and ime- a ying a ibu es.
Polyphonic ins umen We ely on he Maes oV3 [25]
da ase composed o piano eco dings o classical pieces.
We spli he audio iles in o 12-second chunks, esul ing
in 27,000 examples. Using he music21 lib a y 7, we com-
pu e high-le el MIDI desc ip o s ha cha ac e ise he play-
ing echniques and melodic con en , namely a e age no e
du a ion,no e densi y,cen al pi ch and pi ch ange. To
ob ain ime- a ying ea u es, we compu e he desc ip o s
on a 4 second sliding window wi h a 0.5 second hop size.
Full music Finally, we e alua e ou me hod on he la ge-
scale open sou ce Jamendo [27] da ase , con aining a di-
e se collec ion o musical eco dings, co e ing a ious
gen es and labeled wi h emo ion ags. To e alua e ou
model on high-le el pe cep ual desc ip o s, we ely on
Music2Emo ion [28] 8, a p e ained emo ion ecogni ion
model ained on he same da ase . Using a simple c oss-
co ela ion analysis we e ain ou ela i ely independen
ags among he mos ep esen ed in he da ase : "da k",
" as ", "emo ional", "child en". We use he classi ica ion
ou pu o Music2Emo ion o hose 4 ags, compu ed on
a sliding window o 10 seconds o ob ain a ime- a ying
measu e o each emo ion. We no malize and esample
hose ea u es o ob ain ou con ol signals.
4.2 E alua ion me ics
To e alua e ou me hod, we assess bo h i s audio syn hesis
quali y, bu also i s abili y o co ec ly ex ac ea u es, and
4h ps://acids-i cam.gi hub.io/pla une/
5compu ed wi h h ps://gi hub.com/cs einme z1/pyloudno m
6h ps://gi hub.com/AudioCommons/ imb al_models
7h ps://www.music21.o g/
8h ps://gi hub.com/AMAAI-Lab/Music2Emo ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
782
Ex ac Quali y (FAD) ↓Onse F1 sco e ↑Ins umen Acc. (%) ↑
mel↓ins ↑Rec. Edi Syn h. Rec. Edi Syn h Rec. Edi Syn h.
Medley S.
AFTER [11] - - 0.308 0.330 0.415 0.422 0.355 0.322 91.7 76.9 -
PluGeN [5] 0.127 99.26 1.267 0.847 1.271 0.002 0.026 0.002 12.54 36.85 12.23
Ou s 0.208 94.32 0.188 0.203 0.216 0.427 0.375 0.386 91.28 79.97 82.72
URMP
AFTER [11] - - 0.235 0.271 0.402 0.536 0.487 0.441 82.0 57.8 -
PluGeN [5] 0.078 86.46 1.516 0.758 1.518 0.001 0.121 0.007 8.59 21.88 8.59
Ou s 0.192 58.53 0.265 0.318 0.304 0.521 0.294 0.433 77.34 28.13 36.71
Table 1: Compa ison o di e en models o he ea u es ex ac ion,edi ing and condi ioning asks wi h melody and
ins umen con ols. MSE on he no malized a ibu es is used o melody ex ac ion and accu acy o ins umen . Lowe
FAD indica es be e quali y, while highe onse F1 sco e and Ins umen Accu acy indica e be e pe o mance.
esyn hesize audio wi h accu a ely ans o med ea u es.
Fea u es e ie al We compa e he a ibu es in e ed by
he model wi h he g ound u h a ibu es by using he
Mean Squa e E o (MSE) on he no malized con ol di-
mensions, excep o he ins umen whe e we compu e a
p edic ion accu acy sco e o belonging o he co ec Gaus-
sian o he mix u e.
A ibu es edi ing and ans e We ex ac he con ol-
s yle ep esen a ions om he audio inpu s and andomly
swap he a ibu e ha we aim o e alua e. Fo ins ance, we
keep he melody and associa ed s yle ep esen a ion bu up-
da e he ins umen dimension wi h a andom pick among
he a ailable ca ego ies. This enables us o assess he dis-
en anglemen p ope ies o ou me hod by also gi ing un-
seen combina ions o a ibu es. Fi s , we e alua e he im-
pac o his change o e he gene a ion quali y using he
F eche audio Dis ance [29] (FAD) on he CLAP [30] em-
beddings compu ed wi h he gene a ed wa e o ms. Then,
we e alua e he accu acy o con ol by ex ac ing he up-
da ed a ibu e om he gene a ed wa e o m and compa e
i o he a ge a ibu e. Fo he melody con ol, we ex-
ac he MIDI wi h BasicPi ch [26] and compa e i using
he Onse F1 sco e om mi -e al, whe e wo no es a e
conside ed iden ical i he pi ch and onse s a e he same
wi hin ±50ms o each o he . Fo ins umen p edic ion,
we ained a classi ie on CLAP embeddings o he aining
se , and use i o p edic an accu acy sco e. Fo con inuous
a ibu es, we simply use he MSE as men ioned abo e.
Condi ional syn hesis We e alua e condi ional syn hesis
by andomly selec ing a combina ion o g ound u hs con-
ols and sampling a s yle ep esen a ion om he s anda d
Gaussian dis ibu ion o gene a e a new wa e o m. We
e alua e he syn hesis quali y wi h FAD and compu e he
con ol p ecision simila ly o edi ing.
4.3 Implemen a ion de ails
We ely on Music2La en [4] as p e ained neu al codec,
using he he o icial implemen a ion and weigh s 9 ha
comp esses wa e o ms sampled a 44.1kHz in o 64-
dimensional ime- a ying la en codes wi h a ime com-
9h ps://gi hub.com/SonyCSLPa is/music2la en
p ession a io o 4096. The model a chi ec u e is a La en
DiT [14] wi h 15M pa ame e s o he monophonic ins u-
men expe imen s, scaled up o 27M pa ame e s o Mae-
s o and Jamendo da ase s. All models a e ained o 300k
s eps using AdamW [32] op imize wi h a cons an lea n-
ing a e o 1e-4.
4.4 Baselines
PluGeN [5] The app oach mos closely ela ed o ou wo k
is PluGeN which can be e alua ed on he 3 asks simila ly.
Howe e , i was applied only o s a ic a ibu es o im-
age da a. To use i as a s aigh o wa d ai baseline ha
akes in o accoun he ime- a ying ea u es, we adap he
Wa eGlow [33] a chi ec u e on op o he la en space o
he p e ained Music2La en emo ing he condi ioning on
mel-spec og ams. This enables o di ec ly map he em-
po al la en ajec o y o he con ol-s yle space wi h a ine
coupling low laye s as in he o iginal pape .
AFTER [11] We ain he midi- o-audio con igu a ion o
AFTER using he mos ecen a ailable implemen a ion 10.
Fo ins umen edi ing we use he imb e embedding o a
andomly sampled example ma ching he a ge label, and
o syn hesis we andomly sample he imb e space.
SDEdi [31] To compa e ou app oach wi h an exis ing
edi ing s a egy, we ain a condi ional ec i ied low model
θ(z , c, ) o map samples be ween a Gaussian noise dis-
ibu ion ˜π0∼ N(0,I)and he la en da a dis ibu ion π1.
To pe o m edi ing, we i s in e a da a sample (z1, c)
o i s co esponding noise z0(which can be seen as ana-
loguous o he s yle ec o s om ou p oposal) ollowing
z −1=z − θ(z , c, )×d . Then, we gene a e an edi ed
e sion wi h con ols cswap h ough he denoising p ocess
z +1 =z + θ(z , cswap, )×d s a ing om z0.
5. RESULTS
5.1 Monophonic ins umen s
Fi s , we e alua e ou me hod on he 3 asks o mono-
phonic ins umen s (MedleySolos and URMP da ase s)
and p esen ou esul s in Table 1 and Table 2.
10h ps://gi hub.com/acids-i cam/AFTER
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
783
Quali y (FAD) ↓Onse F1 sco e ↑Ins . Acc. (%) ↑Con . Fea . MSE ↓
Rec. Edi Syn h. Rec. Edi Syn h. Rec. Edi Syn h. Rec. Edi Syn h.
SDEdi [31] 0.131 0.154 0.289 0.552 0.264 0.267 93.8 36.2 38.5 0.042 0.048 0.058
PluGeN [5] 1.248 0.946 1.239 0.001 0.007 0.001 15.60 25.38 15.90 0.310 0.329 0.305
Ou s 0.201 0.211 0.319 0.374 0.144 0.110 88.07 51.68 58.56 0.061 0.070 0.079
Table 2: Compa ison on MedleySolos o edi ing and condi ioning asks combining bo h disc e e and con inuous con ols.
5.1.1 Fea u es e ie al
Al hough PluGeN gene ally achie es be e esul s o he
ea u es ex ac ion ask, ou me hod emains compe i i e
on MedleySolos and manages o ex ac he co ec in-
s umen om he o iginal wa e o m. We belie e ha he
lowe esul s on URMP a e due o he lack o da a in he
aining se , making i mo e di icul o he model o gen-
e alize o he 3 asks simul aneously. Mo eo e , he No -
malizing Flows [12] aining objec i e o PluGeN explic-
i ly penalize he model on he con ol accu acy while he
ea u e ex ac ion is implici ly lea ned wi h ou me hod.
5.1.2 Audio Edi ing
As shown in Table 1, ou me hod ou pe o ms bo h
PluGeN and AFTER on Medley Solos o independen
melody and ins umen edi ing. This demons a es he
s ong disen anglemen abili ies o ou app oach, ha e-
mains obus o unseen combina ions o a ibu es c ea ed
by swapping andomly he a ge con ol dimension. This
is especially he case o he ins umen con ol wi h high
p edic ion accu acy, which implies ha all he ins umen
in o ma ion is well-encoded in he con ol a iable and no
he s yle ep esen a ion. On he con a y, PluGeN com-
ple ely ails o ollow he a ge con ols, indica ing ha
mos o he in o ma ion is encoded in he s yle dimen-
sion. Fu he mo e, ou me hod ensu es good gene a ion
quali y and achie es he lowes FAD sco e on MedleySo-
los on all scena ios. Howe e , AFTER demons a es be -
e pe o mance on URMP, especially in melody con ol,
which migh also sugges a minimum amoun o da a e-
qui ed by ou me hod.We p esen ou esul s o he com-
bina ion o melody, ins umen and audio desc ip o s in Ta-
ble 2. Al hough SDEdi achie es be e FAD and melody
con ol, ou me hod achie es a be e ins umen con ol,
indica ing ha some ea u es a e di icul o edi wi h sim-
ple condi ioning. In oducing desc ip o s con ol deg ades
he melody accu acy, which can be explained by co ela-
ions be ween con ols like pi ch con ou and cen oid.
5.1.3 Condi ional syn hesis
We now e alua e ou model on condi ional syn hesis, o
bo h pai ed (Rec.) and unpai ed (Syn h.) a ibu es. Ou
me hod s ill ou pe oms AFTER on Medley Solos. In-
e es ingly, he gap be ween pai ed and unpai ed melody
accu acy seems o be g ea e o AFTER, sugges ing ha
ou me hod p o ides a be e disen anglemen be ween a -
ibu es, independen ly o he global condi ioning s eng h.
This is also suppo ed by he s onge inc ease in FAD o
Ex ac Gene a ion
MSE ↓FAD ↓Edi ↓Syn h. ↓
Maes o
SDEdi [31] - 0.008 0.364 0.362
Ou s 0.267 0.021 0.340 0.325
Jamedo
SDEdi [31] - 0.054 0.068 0.060
Ou s 0.079 0.105 0.047 0.033
Table 3: Compa ison on he Jamendo and Maes oV3
da ase s. Con ol accu acy is a e aged ac oss ea u es.
AFTER, while ou model main ains mo e consis en audio
quali y ac oss con igu a ions.
5.2 Applica ion o high-le el con ols
We p esen ou esul s on Maes oV3 and Jamendo da ase s
o high-le el con ol asks in Table 3. Ou app oach con-
sis en ly ou pe o ms SDEdi , wi h pa icula ly signi ican
imp o emen s on Jamendo, which ocuses on he challeng-
ing ask o emo ion-based edi ing. The unde lying p inci-
ple o condi ional di usion models is ha he condi ion-
ing signal p o ides addi ional in o ma ion du ing aining
o guide he denoise and educe he denoising loss. How-
e e , hese models end o exhibi lowe pe o mance when
dealing wi h high-le el ea u es ha ha e complex, non-
i ial ela ionships wi h he da a. In con as , ou me hod
explici ly lea ns an in e ible mapping be ween he con-
ol and da a spaces, imp o ing he con ol e iciency o e
high-le el a ibu es, on op o p o iding con ol ex ac-
ion capabili ies. Finally, quali a i e analysis e eals ha ,
o bo h da ase s, he s yle ec o e ec i ely cap u es he
majo i y o he melodic con en and, o Jamendo, e en
he ly ics and o ches a ion. This p ope y enables use s
o gene a e meaning ul a ia ions o a gi en ack o e en
c ea e hyb ids be ween emo ions o plays yles.
6. CONCLUSION
We p esen ed a new e icien me hod o add con ols on
p e ained neu al audio codecs. To he bes o ou knowl-
edge, his is he i s app oach de ining an in e ible map-
ping be ween a bi a y con ols and audio la en codes en-
abling a la ge a ie y o applica ions. We lea e o u u e
wo ks, imp o emen s on he low sho pe o mance, aiming
o p o ide mo e pe sonalized con ols on small da ase s o
c ea i e applica ions.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
784
7. ACKNOWLEDGMENTS
This wo k has been suppo ed by he Pa is Ile-de-F ance
Région in he amewo k o DIM AI4IDF.
8. REFERENCES
[1] A. Van Den Oo d, S. Dieleman, H. Zen, K. Simonyan,
O. Vinyals, A. G a es, N. Kalchb enne , A. Senio ,
K. Ka ukcuoglu e al., “Wa ene : A gene a i e model
o aw audio,” a Xi p ep in a Xi :1609.03499,
ol. 12, 2016.
[2] A. Caillon and P. Esling, “Ra e: A a ia ional au oen-
code o as and high-quali y neu al audio syn hesis,”
a Xi p ep in a Xi :2111.05011, 2021.
[3] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” a Xi p ep in
a Xi :2210.13438, 2022.
[4] M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp es-
sion,” in In e na ional Socie y o Music In o ma ion
Re ie al, ISMIR 2024, 2024.
[5] M. Wołczyk, M. P oszewska, Ł. Mazia ka, M. Zieba,
P. Wielopolski, R. Ku czab, and M. Smieja, “Plugen:
Mul i-label condi ional gene a ion om p e- ained
models,” in P oceedings o he AAAI Con e ence on
A i icial In elligence, ol. 36, no. 8, 2022, pp. 8647–
8656.
[6] J. Engel, L. Han akul, C. Gu, and A. Robe s,
“Ddsp: Di e en iable digi al signal p ocessing,” a Xi
p ep in a Xi :2001.04643, 2020.
[7] Y. Wu, E. Manilow, Y. Deng, R. J. Swa ely, K. Kas -
ne , T. Cooijmans, A. Cou ille, A. Huang, and J. En-
gel, “Midi-ddsp: Hie a chical modeling o music o
de ailed con ol,” in P oceedings o he Ten h In-
e na ional Con e ence on Lea ning Rep esen a ions
(ICLR)(Online)(2022 Ap .), 2022.
[8] N. De is, N. Deme lé, S. Nabi, D. Geno a, and P. Es-
ling, “Con inuous desc ip o -based con ol o deep
audio syn hesis,” in ICASSP 2023-2023 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[9] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, pp. 47 704–47 720,
2023.
[10] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music con olne : Mul iple ime- a ying con ols o
music gene a ion,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 2692–
2703, 2024.
[11] N. Deme lé, P. Esling, G. Do as, and D. Geno a,
“Combining audio con ol and s yle ans e using la-
en di usion,” in In e na ional Socie y o Music In-
o ma ion Re ie al, ISMIR 2024, 2024.
[12] D. Rezende and S. Mohamed, “Va ia ional in e ence
wi h no malizing lows,” in In e na ional con e ence
on machine lea ning. PMLR, 2015, pp. 1530–1538.
[13] X. Liu, C. Gong, and Q. Liu, “Flow s aigh and as :
Lea ning o gene a e and ans e da a wi h ec i ied
low,” in The Ele en h In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR), 2023.
[14] W. Peebles and S. Xie, “Scalable di usion models
wi h ans o me s,” in P oceedings o he IEEE/CVF
in e na ional con e ence on compu e ision, 2023, pp.
4195–4205.
[15] T. Ka as, M. Ai ala, T. Aila, and S. Laine, “Eluci-
da ing he design space o di usion-based gene a i e
models,” Ad ances in neu al in o ma ion p ocessing
sys ems, ol. 35, pp. 26 565–26 577, 2022.
[16] A. Ramesh, P. Dha iwal, A. Nichol, C. Chu,
and M. Chen, “Hie a chical ex -condi ional im-
age gene a ion wi h clip la en s,” a Xi p ep in
a Xi :2204.06125, ol. 1, no. 2, p. 3, 2022.
[17] R. Rombach, A. Bla mann, D. Lo enz, P. Esse ,
and B. Omme , “High- esolu ion image syn hesis
wi h la en di usion models,” in P oceedings o he
IEEE/CVF con e ence on compu e ision and pa e n
ecogni ion, 2022, pp. 10 684–10 695.
[18] Z. E ans, C. Ca , J. Taylo , S. H. Hawley, and
J. Pons, “Fas iming-condi ioned la en audio di u-
sion,” in Fo y- i s In e na ional Con e ence on Ma-
chine Lea ning, 2024.
[19] Q. Huang, D. S. Pa k, T. Wang, T. I. Denk,
A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu,
C. F ank e al., “Noise2music: Tex -condi ioned mu-
sic gene a ion wi h di usion models,” a Xi p ep in
a Xi :2302.03917, 2023.
[20] P. Esse , S. Kulal, A. Bla mann, R. En eza i, J. Mülle ,
H. Saini, Y. Le i, D. Lo enz, A. Saue , F. Boesel e al.,
“Scaling ec i ied low ans o me s o high- esolu ion
image syn hesis,” in Fo y- i s in e na ional con e -
ence on machine lea ning, 2024.
[21] K. Sohn, H. Lee, and X. Yan, “Lea ning s uc u ed
ou pu ep esen a ion using deep condi ional gene a-
i e models,” Ad ances in neu al in o ma ion p ocess-
ing sys ems, ol. 28, 2015.
[22] G. Lample, N. Zeghidou , N. Usunie , A. Bo des,
L. Denoye , and M. Ranza o, “Fade ne wo ks: Ma-
nipula ing images by sliding a ibu es,” Ad ances in
neu al in o ma ion p ocessing sys ems, ol. 30, 2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
785
[23] K. P eechakul, N. Cha hee, S. Wizadwongsa, and
S. Suwajanako n, “Di usion au oencode s: Towa d a
meaning ul and decodable ep esen a ion,” in P oceed-
ings o he IEEE/CVF con e ence on compu e ision
and pa e n ecogni ion, 2022, pp. 10 619–10 629.
[24] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma,
“C ea ing a mul i ack classical music pe o mance
da ase o mul imodal music analysis: Challenges, in-
sigh s, and applica ions,” IEEE T ansac ions on Mul i-
media, ol. 21, no. 2, pp. 522–535, 2018.
[25] C. Haw ho ne, A. S asyuk, A. Robe s, I. Simon, C.-
Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and
D. Eck, “Enabling ac o ized piano music modeling
and gene a ion wi h he maes o da ase ,” in In e na-
ional Con e ence on Lea ning Rep esen a ions.
[26] R. M. Bi ne , J. J. Bosch, D. Rubins ein, G. Mesegue -
B ocal, and S. Ewe , “A ligh weigh ins umen -
agnos ic model o polyphonic no e ansc ip ion and
mul ipi ch es ima ion,” in P oceedings o he IEEE In-
e na ional Con e ence on Acous ics, Speech, and Sig-
nal P ocessing (ICASSP), Singapo e, 2022.
[27] D. Bogdano , M. Won, P. To s ogan, A. Po e , and
X. Se a, “The m g-jamendo da ase o au oma ic mu-
sic agging,” in Machine lea ning o music disco e y
wo kshop, in e na ional con e ence on machine lea n-
ing (ICML 2019), 2019, pp. 1–3.
[28] J. Kang and D. He emans, “Towa ds uni ied mu-
sic emo ion ecogni ion ac oss dimensional and
ca ego ical models,” 2025. [Online]. A ailable:
h ps://a xi .o g/abs/2502.03979
[29] D. Roblek, K. Kilgou , M. Sha i i, and M. Zuluaga,
“F ’eche audio dis ance: A e e ence- ee me ic o
e alua ing music enhancemen algo i hms,” in P oc.
In e speech, 2019, pp. 2350–2354.
[30] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-
audio p e aining wi h ea u e usion and keywo d- o-
cap ion augmen a ion,” in ICASSP 2023-2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[31] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and
S. E mon, “Sdedi : Guided image syn hesis and edi ing
wi h s ochas ic di e en ial equa ions,” in In e na ional
Con e ence on Lea ning Rep esen a ions.
[32] I. Loshchilo and F. Hu e , “Decoupled weigh de-
cay egula iza ion,” a Xi p ep in a Xi :1711.05101,
2017.
[33] R. P enge , R. Valle, and B. Ca anza o, “Wa eglow: A
low-based gene a i e ne wo k o speech syn hesis,”
in ICASSP 2019-2019 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2019, pp. 3617–3621.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
786