Adding Temporal Musical Controls on Top of Pretrained Generative Models

Author: Sarah Nabi; Nils Demerlé; Geoffroy Peeters; Frederic Bevilacqua; Philippe Esling

Publisher: Zenodo

DOI: 10.5281/zenodo.17706592

Source: https://zenodo.org/records/17706592/files/000091.pdf

ADDING TEMPORAL MUSICAL CONTROLS ON TOP OF PRETRAINED
GENERATIVE MODELS
Sa ah Nabi∗1Nils Deme lé∗1Geo oy Pee e s2F édé ic Be ilacqua1Philippe Esling1
1UMR 9912 STMS-IRCAM, So bonne Uni e si é, CNRS, Pa is, F ance
2LTCI, Télécom-Pa is, Ins i u Poly echnique de Pa is, F ance
[email p o ec ed], [email p o ec ed]
ABSTRACT
Recen ad ances in deep gene a i e modeling ha e en-
abled high-quali y models o musical audio syn hesis.
Howe e , hese app oaches emain di icul o con ol, con-
ined o simple, s a ic a ibu es and, mos impo an ly, en-
ail e aining a di e en compu a ionally-hea y a chi ec-
u e o each new con ol. This is ine icien and imp ac i-
cal as i equi es subs an ial compu a ional esou ces.
In his pape , we p opose a no el app oach allowing o
add ime- a ying musical con ols on op o any p e ained
gene a i e models wi h an exposed la en space (e.g. neu-
al audio codecs), wi hou e aining o ine uning. Ou
me hod suppo s bo h disc e e and con inuous a ibu es by
adap ing a ec i ied low app oach wi h a la en di usion
ans o me . We lea n an in e ible mapping be ween p e-
ained la en a iables and a new space disen angling ex-
plici con ol a ibu es and s yle a iables ha cap u e he
emaining ac o s o a ia ion. This enables bo h ea u e
ex ac ion om an inpu , bu also edi ing hose ea u es o
gene a e ans o med audio samples. Finally, his also in-
oduces he abili y o pe o m syn hesis di ec ly om he
audio desc ip o s. We alida e ou me hod wi h 4 da ase s
going om di e en musical ins umen s up o ull music
eco dings, on which we ou pe o m s a e-o - he-a ask-
speci ic baselines in e ms o bo h gene a ion quali y and
accu acy o he con ol by in e ing ans e ed a ibu es.
Ou code is a ailable on he suppo ing webpage 1.
1. INTRODUCTION
Since he pionee ing au o eg essi e model Wa eNe [1],
signi ican ad ances ha e been made in deep gene a i e
modeling o aw audio wa e o m syn hesis. In pa icu-
la , neu al audio codecs [2–4] ecen ly enabled as high-
quali y audio gene a ion wi h sampling a es up o 48kHz.
These models di ec ly comp ess he aw audio wa e o m
in o a empo al la en space ha cap u es high-le el audio
*Equal con ibu ion
1h ps://acids-i cam.gi hub.io/pla une/
© S. Nabi, N. Deme lé, G. Pee e s, F. Be ilacqua and P.
Esling. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: S. Nabi, N. Deme lé, G. Pee e s, F.
Be ilacqua and P. Esling, “Adding empo al musical con ols on op o
p e ained gene a i e models”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
codec
p e ained
con ol-s yle
space
ex ac edi condi ion
Figu e 1: Ou app oach enhances p e ained neu al codecs
wi h a new con ol-s yle space suppo ing bo h disc e e and
con inuous ime- a ying musical con ols, enabling ea-
u es ex ac ion, audio edi ing and condi ional syn hesis.
ea u es. Howe e , hese la en ep esen a ions a e di i-
cul o in e p e and highly en angled, p ecluding hei use
as s aigh o wa d con ols [5]. To add ess his, exis ing
me hods usually ely on explici condi ioning echniques,
bypassing he la en in o ma ion [6, 7], adding ad e sa ial
egula iza ion [8], o lea ning an addi ional model o con-
di ion he la en codes [9–11]. Howe e , hese app oaches
a e o en limi ed in he ypes o con ols usable [6, 9] and,
mos impo an ly, hey equi e o e ain [8,11] o ine une
he unde lying la ge syn hesis model [10], which is ine -
icien and impossible o many as i equi es subs an ial
amoun o da a and compu a ional esou ces.
Al hough neu al codecs add ess he echnical di icul-
ies o modeling high-dimensional audio wa e o ms, he
emaining challenge is o ind in e p e able con ol o e
he la en a iables, which a e usually es ained o ol-
low a simple Gaussian p io dis ibu ion [2]. In his pa-
pe , we aim o gi e he abili y o he end use o de ine i s
own con ols by eshaping he s uc u e o he p e ained
la en space. Hence, we p opose a new me hod o add em-
po al musical con ols on op o any p e ained gene a i e
model wi h an exposed la en space wi hou equi ing e-
aining no ine uning. As depic ed in Fig. 1, ou me hod
suppo s bo h disc e e and con inuous ea u es, allowing
o combine any ypes o con ol including bo h s a ic and
779
ime- a ying ones such as ins umen labels, MIDI inpu s
o con inuous desc ip o s. Fu he mo e, ou me hod is able
o be ained e y e icien ly on op o any ozen p e-
ained codec by adap ing he PluGeN [5] app oach. This
me hod p oposed o ans o m he en angled la en ep-
esen a ion o p e ained gene a i e models in o a mul i-
dimensional con ol-s yle space whe e la en ea u es a e
spli ed be ween con ol a iables modeling he selec ed a -
ibu es and s yle a iables cap u ing he emaining ac o s
o a ia ions. I uses disc e e labels o pa ame e ize he
con ol dis ibu ions and elies on No malizing Flows [12]
(NFs) o lea n an in e ible mapping be ween bo h spaces.
Al hough PluGeN p o ides he abili y o add new con ols
on p e ained syn hesis models, i is limi ed o s a ic label
a ibu es and was only applied o condi ional gene a ion
o s a ic images. He e, we enhance p e ained neu al au-
dio codecs wi h a new con ol-s yle space by ex ending he
PluGeN app oach o ime- a ying musical con ols. As
dealing wi h audio signals implies mo e complex empo-
al la en spaces and NFs su e om aining ins abili y
and lack scalabili y wi h s ong a chi ec u e cons ain s, we
also p opose a mo e e icien way o de ine he mapping be-
ween bo h spaces by adap ing he de e minis ic di usion-
based ec i ied low app oach [13] o empo al la en spaces
using a La en Di usion T ans o me [14].
Thanks o he in e ible mapping, ou p oposed me hod
becomes highly e sa ile and enables (1) ea u e ex ac-
ion, (2) inpu audio edi ing, as well as (3) explici condi-
ional syn hesis, as illus a ed in Fig 1. We benchma k ou
model on each o hese 3 asks wi h 4 inc easingly com-
plex da ase s composed o eco dings anging om mono-
phonic musical ins umen s up o ull music pieces. We
show ha ou model e icien ly ex ac s and imp o es he
con ol o mul iple complex audio ea u es compa ed o
s a e-o - he-a baselines, while main aining he gene a ion
quali y and enabling condi ioning wi h unseen combina-
ions o a ibu es.
2. BACKGROUND
2.1 Di usion models
Deep gene a i e models aim o model he unde lying
dis ibu ion p(x)o a gi en da ase . Di usion Models
(DMs) achie e his by co up ing he da a wi h noise and
hen lea ning o e e se his p ocess h ough denoising.
Gi en a se ies o noise le els σ0< σ1< ... < σN,
we can de ine a se o molli ied dis ibu ions pσ (x ) =
Rp(x)N(x ;x, σ2
I)dx, ob ained by adding i.i.d Gaussian
noise o he da a. Al hough many o mula ions o di usion
exis s, a common app oach is o ain a denoise ne wo k
Dθ, ypically a neu al ne wo k o minimize he ollowing
denoising sco e ma ching objec i e
Ex∼pda a En∼N(0,σ2I)∥D(x+n;σ)−x∥2
2.(1)
The model implici ly lea ns o app oxima e he sco e o
each pe u bed dis ibu ion sθ(x, )≈ ∇xlog pσ (x) =
D(x;σ)−x
σ2and hence can gene a e new samples using an-
nealed Lange in dynamics. S a ing om xN∼ N (0,I),
i ollows he sco e unc ions owa ds he da a by i e a-
i ely upda ing x −1=x +1
T∗∇xlog pσ (x )un il con-
e ging o a sample x0 om p(x). DMs cu en ly achie e
s a e-o - he-a esul s in image [15–17] and audio gene -
a ion [18, 19], o e ing high-quali y syn hesis wi h s able
aining and s ong condi ioning abili ies.
2.2 Rec i ied low
Rec i ied Flow (RF) [13] is a ecen p oposal ha builds
upon di usion models and op imal anspo ideas, aim-
ing o di ec ly lea n anspo maps be ween da a dis ibu-
ions. Unlike di usion models, which a e cons ained o
a Gaussian noise p io , RFs gene alize o any dis ibu ion
and ha e demons a ed supe io pe o mance in la en gen-
e a ion asks [20]. Conside ing wo a bi a y dis ibu ions
π0and π1, RF cons uc s a de e minis ic con inuous- ime
low by lea ning an o dina y di e en ial equa ion (ODE)
model ha connec s obse a ions (x0∼π0,x1∼π1)
h ough s aigh pa hs. Fo mally, he ec i ied low induced
om (x0,x1) is de ined by he ODE
dz = (z , )d , (2)
whe e z ep esen s he da a poin a ime ∈[0,1]2,
and :Rd→Rdis he d i o ce, ypically pa ame e ized
by a neu al ne wo k which is ained o minimize
min
Z1
0
E∥ (x , )−(x1−x0)∥2d ,
wi h x = (1 − )x0+ x1.
(3)
This objec i es encou ages he low o ollow s aigh
pa hs be ween x0and x1, leading o an e icien and in-
e ible de e minis ic mapping be ween π0and π1.
2.3 Neu al audio codecs
Neu al audio codecs such as RAVE [2], EnCodec [3], o
Music2La en [4] enable as high-quali y wa e o m syn-
hesis. By le e aging au oencode schemes, i di ec ly
comp esses he inpu audio signal in o a ime se ies o la-
en a iables z∈RD×N(whe e Dand Na e espec i ely
he embedding and ime dimensions) de ined in a lowe -
dimensional la en space. In addi ion o hei compu a-
ional e iciency, hese models a e also powe ul ep esen-
a ion lea ning amewo ks. Du ing aining, he encode
cap u es high-le el ea u es om he inpu da a, while he
decode lea ns o econs uc he o iginal wa e o m om
i s la en ep esen a ion. Howe e , hese la en a iables
a e e y di icul o in e p e and highly en angled which
p ecludes s aigh o wa d con ol o e hese models.
2.4 Con ol o deep gene a i e models
To p o ide explici con ols, p io wo ks ely on condi ion-
ing echniques [21, 22] and di ec ly use he audio ea u es
as addi ional aining inpu s o he syn hesis model [6, 7].
2 e e s o he pa ame iza ion in he ODE and no musical ime
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
780
Encode Decode
la en space
s yle
a iables
con ol
a iables
a ibu es
Flow
(a) Adding musical con ols on p e ained neu al audio codecs.
la en
ajec o y
p e ained
la en space
con ol-s yle
space
s yle
con ol
(b) Mapping con ols o la en ajec o ies using ec i ied low.
Figu e 2: De ailed o e iew o ou app oach mapping p e ained la en codes o explici con ol and s yle a iables.
Fade Ra e [8] p oposed o condi ion RAVE [2] wi h non-
di e en iable ime- a ying desc ip o s using an ad e sa -
ial c i e ion [22], bu equi es o e ain he codec om
sc a ch. Recen ly, la en di usion p io models we e
also in oduced o achie e e icien condi ional gene a ion
[16, 17]. The AFTER [11] model p o ides explici con-
ols and example-based s yle ans e on neu al codecs by
adap ing he Di AE [23] app oach. They ain wo seman-
ic encode s wi h a la en di usion model o disen angle
global imb e p ope ies and ime- a ying musical s uc-
u e. Howe e , mos exis ing app oaches a e limi ed in
he ypes o con ol hey p o ide [6, 9] and equi e o e-
ain [7,8] o ine- une [10] he syn hesis model.
Recen ly, PluGeN [5] enhanced p e ained gene a i e
models wi h mul i-label a ibu es condi ioning o con ol
image gene a ion wi hou e aining. I ans o ms he
en angled la en ep esen a ion in o a mul i-dimensional
space disen angling a ibu es ac oss indi idual dimen-
sions. This new space is composed o con ol a iables
c, de ined as independen one-dimensional Gaussian mix-
u e dis ibu ions, and s yle a iables ssupposed o cap u e
he emaining ac o s o a ia ions. To lea n an in e ible
mapping be ween he wo spaces, PluGeN elies on No -
malizing Flows [12] (NFs) p ese ing he dimensionali y.
Each example x∈Rdxis pai ed wi h disc e e mul i-labels
y= (y1, ..., yK)∈J1, MkKK, whe e Mkis he numbe o
classes o each a ibu e yk. The low ans o ms he la en
code z∈RDin o (c,s) = (c1, ..., cK, s1, ..., sD−K)∈
RDsuch ha he a ge dis ibu ion co esponds o
pC,S|Y=y(c,s) =
K
Y
k=1
pCk|Yk=yk(ck)·pS(s),(4)
whe e s yle a iables a e modeled by a s anda d Gaussian
pS=N(0,ID−K),(5)
and con ol a iables wi h a mix u e o Gaussians
pCk|Yk=yk=
Mk
Y
i=1
N(µi, σ2
i)1yk=i,(6)
whe e µia e e enly dis ibu ed be ween -1 and 1, σia e
use -de ined pa ame e s, and 1yk=iis he indica o unc-
ion which equals 1 i yk=iand 0 o he wise.
3. PROPOSED METHOD
Ou app oach aims o add empo al musical con ols on op
o any p e ained gene a i e model wi h an exposed la en
space, wi hou equi ing e aining. We adap PluGeN [5]
o ime- a ying a ibu es o c ea e a new space ha dis-
en angles explici con ol ea u es om he emaining s yle
a ia ions. We in oduce a ec i ied low model o lea n
an in e ible mapping be ween he con ol-s yle space and
he p e ained la en space, which di ec ly anspo s he
con ol signals in o i s la en ajec o y. Once ained, we
can use his space o ex ac musical ea u es, edi audio
samples by shi ing he con ol ajec o ies o ans e ing
a ibu es, and pe o m explici condi ional syn hesis. Ou
o e all app oach is depic ed in Fig. 2.
3.1 De ining he ime- a ying con ol-s yle space
The aining da a x∈Rdxa e pai ed wi h Ka ibu es
a= (ad,ac)∈RK×N, whe e adand ac e e o disc e e
and con inuous ea u es, esampled o align wi h he la en
ajec o y z=E(x)∈RD×N. As in eq. 6, we pa ame e -
ize he con ol dis ibu ions as ime-e ol ing mix u es o
Gaussians o each con ol a iable ck(n)which leads o
pck|adk(n)(ck, n) =
Mk
Y
i=1
N(ck|µ(i)
k(n), σ2
k)1adk(n)=i,(7)
o disc e e a ibu es, whe e µk(i)co esponds o he mean
o he i- h class con ol dis ibu ion, uni o mly dis ibu ed
be ween -1 and 1 o all Mkclasses.
Fo con inuous ones, we conside
pck|ak(n)(ck, n) = N(ck|µk(n), σ2
k),(8)
whe e he means µk(n)co espond o he a ibu e al-
ues no malized be ween -1 and 1 3. We sample he e-
maining s yle a iables s∈R(D−K)×N om a s anda d
Gaussian as in eq. 5 and de ine he con ol-s yle space wi h
he join dis ibu ion
pc,s|a((c,s), n) =
K
Y
k=1
pck|ak(n)(ck, n)·ps(s, n).(9)
3minimum and maximum alues o each a ibu es amin and amax
a e compu ed o he whole da ase .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
781
3.2 Mapping empo al la en and con ol dis ibu ions
As depic ed in Fig. 2b, we ely on a ec i ied low o di-
ec ly map he empo al con ol-s yle space o he p e-
ained la en space. We use eq. 9 o de ine he inpu dis i-
bu ion π0 om Sec. 2.2, and sample a con ol-s yle ajec-
o y z =0 = (c,s)∼pc,s|a. We use he associa ed la en
ajec o y z=z =1 ∼pzas he a ge signal. Then, we
uni o mly sample a ime s ep ∼ U(0,1) o compu e he
in e media e empo al ep esen a ion
z = (1 − )×z =0 + ×z =1.(10)
We pa ame ize he d i o ce o he low θwi h a La en
Di usion T ans o me (DiT) [14] and op imize
Lθ=E ∼U(0,1)
z =0∼pc,s|a
z =1∼pz
∥ θ(z , )−(z =1 −z =0)∥2.(11)
The ull aining p ocedu e is desc ibed in Algo i hm 1.
Algo i hm 1 T aining p ocedu e o ou me hod
1: Inpu : da a x∈ D, a ibu es a= (ad,ac), con ol
a iances σ2
c, encode Eo he p e ained audio codec,
denoising di usion model θ, lea ning a e γ
2: o each aining s ep do
3: Encode E(x) = z= [z1, ..., zN] = z =1
4: Align a= [a1, ..., aN]
5: Compu e µc= 2 ×a−amin
amax−amin −1
6: De ine pc=N(µc, σc2)
7: Sample con ol c∼pc
8: Sample s yle s∼ps=N(0,I)
9: z =0 = (c,s)
10: Sample ime s ep ∼ U(0,1)
11: Compu e z = (1 − )×z =0 + ×z =1
12: Compu e Lθ=E|| θ(z , )−(z =1 −z =0)||2
13: Upda e θ←θ−γ∇θLθ
14: end o
15: e u n ϵθ
3.3 In e ence
Once he de e minis ic mapping om con ols o la en s
is lea ned, we can le e age he disen angling p ope ies
o ou me hod o in e he a ibu es ˆa o a gi en audio
sample xin each p e-de ined con ol dimension by e e s-
ing he di usion p ocess o p edic ˆz =0 = (ˆc,s) om
z =1 =E(x)by applying
z −1=z − θ(z , )×d . (12)
o Tsampling s eps, whe e d =1
T. We can hen edi
he sample’s ea u es by di ec ly manipula ing he con ol
a iables o pe o m condi ional syn hesis by sampling and
mapping he new (˜c,s) ep esen a ion o he p e ained la-
en space h ough
z +1 =z + θ(z , )×d . (13)
Finally, we use he p e ained decode D o econs uc he
wa e o m ˜x =D(˜z =1).
4. EXPERIMENTS
We aim o assess he con ol abili ies o ou p oposed
me hod o ou 3 a ge asks, namely, musical ea u es e-
ie al,audio edi ing, and condi ional syn hesis. We e alu-
a e ou app oach on ime- a ying and s a ic a ibu es, bo h
disc e e and con inuous, on 4 da ase s anging om mono-
phonic ins umen s up o ull music eco dings. We p o-
ide audio examples on he suppo ing webpage 4.
4.1 Da ase s
Monophonic ins umen s We benchma k ou 3 asks on
he URMP [24] and MedleySolos [25] da ase s composed
o musical pieces played by di e se ins umen s. We e-
ained only he monophonic ins umen s wi h 13 ca ego ies
o URMP and 6 o MedleySolos, esul ing in 2,600 and
13,000 examples o 6 and 3 seconds espec i ely. We de-
ine he ins umen label as a con ol dimension and use he
MIDI in o ma ion ex ac ed wi h BasicPi ch [26] o con-
ol he melody de ined by pi ch, oc a e, onse s and dy-
namics ea u es each encoded in indi idual dimension. We
also use con inuous desc ip o s, namely, cen oid,band-
wid h,in eg a ed loudness 5and highe -le el imb al ea-
u es, sha pness and booming 6, compu ed ame-wise di-
ec ly om audio samples. Hence, we e alua e on bo h
disc e e and con inuous, s a ic and ime- a ying a ibu es.
Polyphonic ins umen We ely on he Maes oV3 [25]
da ase composed o piano eco dings o classical pieces.
We spli he audio iles in o 12-second chunks, esul ing
in 27,000 examples. Using he music21 lib a y 7, we com-
pu e high-le el MIDI desc ip o s ha cha ac e ise he play-
ing echniques and melodic con en , namely a e age no e
du a ion,no e densi y,cen al pi ch and pi ch ange. To
ob ain ime- a ying ea u es, we compu e he desc ip o s
on a 4 second sliding window wi h a 0.5 second hop size.
Full music Finally, we e alua e ou me hod on he la ge-
scale open sou ce Jamendo [27] da ase , con aining a di-
e se collec ion o musical eco dings, co e ing a ious
gen es and labeled wi h emo ion ags. To e alua e ou
model on high-le el pe cep ual desc ip o s, we ely on
Music2Emo ion [28] 8, a p e ained emo ion ecogni ion
model ained on he same da ase . Using a simple c oss-
co ela ion analysis we e ain ou ela i ely independen
ags among he mos ep esen ed in he da ase : "da k",
" as ", "emo ional", "child en". We use he classi ica ion
ou pu o Music2Emo ion o hose 4 ags, compu ed on
a sliding window o 10 seconds o ob ain a ime- a ying
measu e o each emo ion. We no malize and esample
hose ea u es o ob ain ou con ol signals.
4.2 E alua ion me ics
To e alua e ou me hod, we assess bo h i s audio syn hesis
quali y, bu also i s abili y o co ec ly ex ac ea u es, and
4h ps://acids-i cam.gi hub.io/pla une/
5compu ed wi h h ps://gi hub.com/cs einme z1/pyloudno m
6h ps://gi hub.com/AudioCommons/ imb al_models
7h ps://www.music21.o g/
8h ps://gi hub.com/AMAAI-Lab/Music2Emo ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
782
Ex ac Quali y (FAD) ↓Onse F1 sco e ↑Ins umen Acc. (%) ↑
mel↓ins ↑Rec. Edi Syn h. Rec. Edi Syn h Rec. Edi Syn h.
Medley S.
AFTER [11] - - 0.308 0.330 0.415 0.422 0.355 0.322 91.7 76.9 -
PluGeN [5] 0.127 99.26 1.267 0.847 1.271 0.002 0.026 0.002 12.54 36.85 12.23
Ou s 0.208 94.32 0.188 0.203 0.216 0.427 0.375 0.386 91.28 79.97 82.72
URMP
AFTER [11] - - 0.235 0.271 0.402 0.536 0.487 0.441 82.0 57.8 -
PluGeN [5] 0.078 86.46 1.516 0.758 1.518 0.001 0.121 0.007 8.59 21.88 8.59
Ou s 0.192 58.53 0.265 0.318 0.304 0.521 0.294 0.433 77.34 28.13 36.71
Table 1: Compa ison o di e en models o he ea u es ex ac ion,edi ing and condi ioning asks wi h melody and
ins umen con ols. MSE on he no malized a ibu es is used o melody ex ac ion and accu acy o ins umen . Lowe
FAD indica es be e quali y, while highe onse F1 sco e and Ins umen Accu acy indica e be e pe o mance.
esyn hesize audio wi h accu a ely ans o med ea u es.
Fea u es e ie al We compa e he a ibu es in e ed by
he model wi h he g ound u h a ibu es by using he
Mean Squa e E o (MSE) on he no malized con ol di-
mensions, excep o he ins umen whe e we compu e a
p edic ion accu acy sco e o belonging o he co ec Gaus-
sian o he mix u e.
A ibu es edi ing and ans e We ex ac he con ol-
s yle ep esen a ions om he audio inpu s and andomly
swap he a ibu e ha we aim o e alua e. Fo ins ance, we
keep he melody and associa ed s yle ep esen a ion bu up-
da e he ins umen dimension wi h a andom pick among
he a ailable ca ego ies. This enables us o assess he dis-
en anglemen p ope ies o ou me hod by also gi ing un-
seen combina ions o a ibu es. Fi s , we e alua e he im-
pac o his change o e he gene a ion quali y using he
F eche audio Dis ance [29] (FAD) on he CLAP [30] em-
beddings compu ed wi h he gene a ed wa e o ms. Then,
we e alua e he accu acy o con ol by ex ac ing he up-
da ed a ibu e om he gene a ed wa e o m and compa e
i o he a ge a ibu e. Fo he melody con ol, we ex-
ac he MIDI wi h BasicPi ch [26] and compa e i using
he Onse F1 sco e om mi -e al, whe e wo no es a e
conside ed iden ical i he pi ch and onse s a e he same
wi hin ±50ms o each o he . Fo ins umen p edic ion,
we ained a classi ie on CLAP embeddings o he aining
se , and use i o p edic an accu acy sco e. Fo con inuous
a ibu es, we simply use he MSE as men ioned abo e.
Condi ional syn hesis We e alua e condi ional syn hesis
by andomly selec ing a combina ion o g ound u hs con-
ols and sampling a s yle ep esen a ion om he s anda d
Gaussian dis ibu ion o gene a e a new wa e o m. We
e alua e he syn hesis quali y wi h FAD and compu e he
con ol p ecision simila ly o edi ing.
4.3 Implemen a ion de ails
We ely on Music2La en [4] as p e ained neu al codec,
using he he o icial implemen a ion and weigh s 9 ha
comp esses wa e o ms sampled a 44.1kHz in o 64-
dimensional ime- a ying la en codes wi h a ime com-
9h ps://gi hub.com/SonyCSLPa is/music2la en
p ession a io o 4096. The model a chi ec u e is a La en
DiT [14] wi h 15M pa ame e s o he monophonic ins u-
men expe imen s, scaled up o 27M pa ame e s o Mae-
s o and Jamendo da ase s. All models a e ained o 300k
s eps using AdamW [32] op imize wi h a cons an lea n-
ing a e o 1e-4.
4.4 Baselines
PluGeN [5] The app oach mos closely ela ed o ou wo k
is PluGeN which can be e alua ed on he 3 asks simila ly.
Howe e , i was applied only o s a ic a ibu es o im-
age da a. To use i as a s aigh o wa d ai baseline ha
akes in o accoun he ime- a ying ea u es, we adap he
Wa eGlow [33] a chi ec u e on op o he la en space o
he p e ained Music2La en emo ing he condi ioning on
mel-spec og ams. This enables o di ec ly map he em-
po al la en ajec o y o he con ol-s yle space wi h a ine
coupling low laye s as in he o iginal pape .
AFTER [11] We ain he midi- o-audio con igu a ion o
AFTER using he mos ecen a ailable implemen a ion 10.
Fo ins umen edi ing we use he imb e embedding o a
andomly sampled example ma ching he a ge label, and
o syn hesis we andomly sample he imb e space.
SDEdi [31] To compa e ou app oach wi h an exis ing
edi ing s a egy, we ain a condi ional ec i ied low model
θ(z , c, ) o map samples be ween a Gaussian noise dis-
ibu ion ˜π0∼ N(0,I)and he la en da a dis ibu ion π1.
To pe o m edi ing, we i s in e a da a sample (z1, c)
o i s co esponding noise z0(which can be seen as ana-
loguous o he s yle ec o s om ou p oposal) ollowing
z −1=z − θ(z , c, )×d . Then, we gene a e an edi ed
e sion wi h con ols cswap h ough he denoising p ocess
z +1 =z + θ(z , cswap, )×d s a ing om z0.
5. RESULTS
5.1 Monophonic ins umen s
Fi s , we e alua e ou me hod on he 3 asks o mono-
phonic ins umen s (MedleySolos and URMP da ase s)
and p esen ou esul s in Table 1 and Table 2.
10h ps://gi hub.com/acids-i cam/AFTER
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
783

Quali y (FAD) ↓Onse F1 sco e ↑Ins . Acc. (%) ↑Con . Fea . MSE ↓
Rec. Edi Syn h. Rec. Edi Syn h. Rec. Edi Syn h. Rec. Edi Syn h.
SDEdi [31] 0.131 0.154 0.289 0.552 0.264 0.267 93.8 36.2 38.5 0.042 0.048 0.058
PluGeN [5] 1.248 0.946 1.239 0.001 0.007 0.001 15.60 25.38 15.90 0.310 0.329 0.305
Ou s 0.201 0.211 0.319 0.374 0.144 0.110 88.07 51.68 58.56 0.061 0.070 0.079
Table 2: Compa ison on MedleySolos o edi ing and condi ioning asks combining bo h disc e e and con inuous con ols.
5.1.1 Fea u es e ie al
Al hough PluGeN gene ally achie es be e esul s o he
ea u es ex ac ion ask, ou me hod emains compe i i e
on MedleySolos and manages o ex ac he co ec in-
s umen om he o iginal wa e o m. We belie e ha he
lowe esul s on URMP a e due o he lack o da a in he
aining se , making i mo e di icul o he model o gen-
e alize o he 3 asks simul aneously. Mo eo e , he No -
malizing Flows [12] aining objec i e o PluGeN explic-
i ly penalize he model on he con ol accu acy while he
ea u e ex ac ion is implici ly lea ned wi h ou me hod.
5.1.2 Audio Edi ing
As shown in Table 1, ou me hod ou pe o ms bo h
PluGeN and AFTER on Medley Solos o independen
melody and ins umen edi ing. This demons a es he
s ong disen anglemen abili ies o ou app oach, ha e-
mains obus o unseen combina ions o a ibu es c ea ed
by swapping andomly he a ge con ol dimension. This
is especially he case o he ins umen con ol wi h high
p edic ion accu acy, which implies ha all he ins umen
in o ma ion is well-encoded in he con ol a iable and no
he s yle ep esen a ion. On he con a y, PluGeN com-
ple ely ails o ollow he a ge con ols, indica ing ha
mos o he in o ma ion is encoded in he s yle dimen-
sion. Fu he mo e, ou me hod ensu es good gene a ion
quali y and achie es he lowes FAD sco e on MedleySo-
los on all scena ios. Howe e , AFTER demons a es be -
e pe o mance on URMP, especially in melody con ol,
which migh also sugges a minimum amoun o da a e-
qui ed by ou me hod.We p esen ou esul s o he com-
bina ion o melody, ins umen and audio desc ip o s in Ta-
ble 2. Al hough SDEdi achie es be e FAD and melody
con ol, ou me hod achie es a be e ins umen con ol,
indica ing ha some ea u es a e di icul o edi wi h sim-
ple condi ioning. In oducing desc ip o s con ol deg ades
he melody accu acy, which can be explained by co ela-
ions be ween con ols like pi ch con ou and cen oid.
5.1.3 Condi ional syn hesis
We now e alua e ou model on condi ional syn hesis, o
bo h pai ed (Rec.) and unpai ed (Syn h.) a ibu es. Ou
me hod s ill ou pe oms AFTER on Medley Solos. In-
e es ingly, he gap be ween pai ed and unpai ed melody
accu acy seems o be g ea e o AFTER, sugges ing ha
ou me hod p o ides a be e disen anglemen be ween a -
ibu es, independen ly o he global condi ioning s eng h.
This is also suppo ed by he s onge inc ease in FAD o
Ex ac Gene a ion
MSE ↓FAD ↓Edi ↓Syn h. ↓
Maes o
SDEdi [31] - 0.008 0.364 0.362
Ou s 0.267 0.021 0.340 0.325
Jamedo
SDEdi [31] - 0.054 0.068 0.060
Ou s 0.079 0.105 0.047 0.033
Table 3: Compa ison on he Jamendo and Maes oV3
da ase s. Con ol accu acy is a e aged ac oss ea u es.
AFTER, while ou model main ains mo e consis en audio
quali y ac oss con igu a ions.
5.2 Applica ion o high-le el con ols
We p esen ou esul s on Maes oV3 and Jamendo da ase s
o high-le el con ol asks in Table 3. Ou app oach con-
sis en ly ou pe o ms SDEdi , wi h pa icula ly signi ican
imp o emen s on Jamendo, which ocuses on he challeng-
ing ask o emo ion-based edi ing. The unde lying p inci-
ple o condi ional di usion models is ha he condi ion-
ing signal p o ides addi ional in o ma ion du ing aining
o guide he denoise and educe he denoising loss. How-
e e , hese models end o exhibi lowe pe o mance when
dealing wi h high-le el ea u es ha ha e complex, non-
i ial ela ionships wi h he da a. In con as , ou me hod
explici ly lea ns an in e ible mapping be ween he con-
ol and da a spaces, imp o ing he con ol e iciency o e
high-le el a ibu es, on op o p o iding con ol ex ac-
ion capabili ies. Finally, quali a i e analysis e eals ha ,
o bo h da ase s, he s yle ec o e ec i ely cap u es he
majo i y o he melodic con en and, o Jamendo, e en
he ly ics and o ches a ion. This p ope y enables use s
o gene a e meaning ul a ia ions o a gi en ack o e en
c ea e hyb ids be ween emo ions o plays yles.
6. CONCLUSION
We p esen ed a new e icien me hod o add con ols on
p e ained neu al audio codecs. To he bes o ou knowl-
edge, his is he i s app oach de ining an in e ible map-
ping be ween a bi a y con ols and audio la en codes en-
abling a la ge a ie y o applica ions. We lea e o u u e
wo ks, imp o emen s on he low sho pe o mance, aiming
o p o ide mo e pe sonalized con ols on small da ase s o
c ea i e applica ions.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
784
7. ACKNOWLEDGMENTS
This wo k has been suppo ed by he Pa is Ile-de-F ance
Région in he amewo k o DIM AI4IDF.
8. REFERENCES
[1] A. Van Den Oo d, S. Dieleman, H. Zen, K. Simonyan,
O. Vinyals, A. G a es, N. Kalchb enne , A. Senio ,
K. Ka ukcuoglu e al., “Wa ene : A gene a i e model
o aw audio,” a Xi p ep in a Xi :1609.03499,
ol. 12, 2016.
[2] A. Caillon and P. Esling, “Ra e: A a ia ional au oen-
code o as and high-quali y neu al audio syn hesis,”
a Xi p ep in a Xi :2111.05011, 2021.
[3] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” a Xi p ep in
a Xi :2210.13438, 2022.
[4] M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp es-
sion,” in In e na ional Socie y o Music In o ma ion
Re ie al, ISMIR 2024, 2024.
[5] M. Wołczyk, M. P oszewska, Ł. Mazia ka, M. Zieba,
P. Wielopolski, R. Ku czab, and M. Smieja, “Plugen:
Mul i-label condi ional gene a ion om p e- ained
models,” in P oceedings o he AAAI Con e ence on
A i icial In elligence, ol. 36, no. 8, 2022, pp. 8647–
8656.
[6] J. Engel, L. Han akul, C. Gu, and A. Robe s,
“Ddsp: Di e en iable digi al signal p ocessing,” a Xi
p ep in a Xi :2001.04643, 2020.
[7] Y. Wu, E. Manilow, Y. Deng, R. J. Swa ely, K. Kas -
ne , T. Cooijmans, A. Cou ille, A. Huang, and J. En-
gel, “Midi-ddsp: Hie a chical modeling o music o
de ailed con ol,” in P oceedings o he Ten h In-
e na ional Con e ence on Lea ning Rep esen a ions
(ICLR)(Online)(2022 Ap .), 2022.
[8] N. De is, N. Deme lé, S. Nabi, D. Geno a, and P. Es-
ling, “Con inuous desc ip o -based con ol o deep
audio syn hesis,” in ICASSP 2023-2023 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[9] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, pp. 47 704–47 720,
2023.
[10] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music con olne : Mul iple ime- a ying con ols o
music gene a ion,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 2692–
2703, 2024.
[11] N. Deme lé, P. Esling, G. Do as, and D. Geno a,
“Combining audio con ol and s yle ans e using la-
en di usion,” in In e na ional Socie y o Music In-
o ma ion Re ie al, ISMIR 2024, 2024.
[12] D. Rezende and S. Mohamed, “Va ia ional in e ence
wi h no malizing lows,” in In e na ional con e ence
on machine lea ning. PMLR, 2015, pp. 1530–1538.
[13] X. Liu, C. Gong, and Q. Liu, “Flow s aigh and as :
Lea ning o gene a e and ans e da a wi h ec i ied
low,” in The Ele en h In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR), 2023.
[14] W. Peebles and S. Xie, “Scalable di usion models
wi h ans o me s,” in P oceedings o he IEEE/CVF
in e na ional con e ence on compu e ision, 2023, pp.
4195–4205.
[15] T. Ka as, M. Ai ala, T. Aila, and S. Laine, “Eluci-
da ing he design space o di usion-based gene a i e
models,” Ad ances in neu al in o ma ion p ocessing
sys ems, ol. 35, pp. 26 565–26 577, 2022.
[16] A. Ramesh, P. Dha iwal, A. Nichol, C. Chu,
and M. Chen, “Hie a chical ex -condi ional im-
age gene a ion wi h clip la en s,” a Xi p ep in
a Xi :2204.06125, ol. 1, no. 2, p. 3, 2022.
[17] R. Rombach, A. Bla mann, D. Lo enz, P. Esse ,
and B. Omme , “High- esolu ion image syn hesis
wi h la en di usion models,” in P oceedings o he
IEEE/CVF con e ence on compu e ision and pa e n
ecogni ion, 2022, pp. 10 684–10 695.
[18] Z. E ans, C. Ca , J. Taylo , S. H. Hawley, and
J. Pons, “Fas iming-condi ioned la en audio di u-
sion,” in Fo y- i s In e na ional Con e ence on Ma-
chine Lea ning, 2024.
[19] Q. Huang, D. S. Pa k, T. Wang, T. I. Denk,
A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu,
C. F ank e al., “Noise2music: Tex -condi ioned mu-
sic gene a ion wi h di usion models,” a Xi p ep in
a Xi :2302.03917, 2023.
[20] P. Esse , S. Kulal, A. Bla mann, R. En eza i, J. Mülle ,
H. Saini, Y. Le i, D. Lo enz, A. Saue , F. Boesel e al.,
“Scaling ec i ied low ans o me s o high- esolu ion
image syn hesis,” in Fo y- i s in e na ional con e -
ence on machine lea ning, 2024.
[21] K. Sohn, H. Lee, and X. Yan, “Lea ning s uc u ed
ou pu ep esen a ion using deep condi ional gene a-
i e models,” Ad ances in neu al in o ma ion p ocess-
ing sys ems, ol. 28, 2015.
[22] G. Lample, N. Zeghidou , N. Usunie , A. Bo des,
L. Denoye , and M. Ranza o, “Fade ne wo ks: Ma-
nipula ing images by sliding a ibu es,” Ad ances in
neu al in o ma ion p ocessing sys ems, ol. 30, 2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
785
[23] K. P eechakul, N. Cha hee, S. Wizadwongsa, and
S. Suwajanako n, “Di usion au oencode s: Towa d a
meaning ul and decodable ep esen a ion,” in P oceed-
ings o he IEEE/CVF con e ence on compu e ision
and pa e n ecogni ion, 2022, pp. 10 619–10 629.
[24] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma,
“C ea ing a mul i ack classical music pe o mance
da ase o mul imodal music analysis: Challenges, in-
sigh s, and applica ions,” IEEE T ansac ions on Mul i-
media, ol. 21, no. 2, pp. 522–535, 2018.
[25] C. Haw ho ne, A. S asyuk, A. Robe s, I. Simon, C.-
Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and
D. Eck, “Enabling ac o ized piano music modeling
and gene a ion wi h he maes o da ase ,” in In e na-
ional Con e ence on Lea ning Rep esen a ions.
[26] R. M. Bi ne , J. J. Bosch, D. Rubins ein, G. Mesegue -
B ocal, and S. Ewe , “A ligh weigh ins umen -
agnos ic model o polyphonic no e ansc ip ion and
mul ipi ch es ima ion,” in P oceedings o he IEEE In-
e na ional Con e ence on Acous ics, Speech, and Sig-
nal P ocessing (ICASSP), Singapo e, 2022.
[27] D. Bogdano , M. Won, P. To s ogan, A. Po e , and
X. Se a, “The m g-jamendo da ase o au oma ic mu-
sic agging,” in Machine lea ning o music disco e y
wo kshop, in e na ional con e ence on machine lea n-
ing (ICML 2019), 2019, pp. 1–3.
[28] J. Kang and D. He emans, “Towa ds uni ied mu-
sic emo ion ecogni ion ac oss dimensional and
ca ego ical models,” 2025. [Online]. A ailable:
h ps://a xi .o g/abs/2502.03979
[29] D. Roblek, K. Kilgou , M. Sha i i, and M. Zuluaga,
“F ’eche audio dis ance: A e e ence- ee me ic o
e alua ing music enhancemen algo i hms,” in P oc.
In e speech, 2019, pp. 2350–2354.
[30] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-
audio p e aining wi h ea u e usion and keywo d- o-
cap ion augmen a ion,” in ICASSP 2023-2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[31] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and
S. E mon, “Sdedi : Guided image syn hesis and edi ing
wi h s ochas ic di e en ial equa ions,” in In e na ional
Con e ence on Lea ning Rep esen a ions.
[32] I. Loshchilo and F. Hu e , “Decoupled weigh de-
cay egula iza ion,” a Xi p ep in a Xi :1711.05101,
2017.
[33] R. P enge , R. Valle, and B. Ca anza o, “Wa eglow: A
low-based gene a i e ne wo k o speech syn hesis,”
in ICASSP 2019-2019 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2019, pp. 3617–3621.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
786

Related note

Why institutions use Plag.ai for originality review, entry 85
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by academic integrity officers in doctoral schools, editorial boards, quality-assurance offices, and student services, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also more transparent source review, better handling of multilingual submissions, and faster first-level screening. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For journal manuscripts, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai