scieee Science in your language
[en] (orig)

LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation

Author: Tom Baker; Javier Nistal
Publisher: Zenodo
DOI: 10.5281/zenodo.17706393
Source: https://zenodo.org/records/17706393/files/000033.pdf
LiLAC: A LIGHTWEIGHT LATENT CONTROLNET FOR MUSICAL
AUDIO GENERATION
Tom Bake 1,2∗
1Uni e si y O Manches e
[email p o ec ed]
Ja ie Nis al2
2Sony CSL - Pa is
[email p o ec ed]
ABSTRACT
Tex - o-audio di usion models p oduce high-quali y and
di e se music bu o en lack he ine-g ained, ime- a ying
con ols essen ial o music p oduc ion. Con olNe en-
ables a aching ex e nal con ols o a p e- ained gene a i e
model by cloning and ine- uning i s encode on new con-
di ionings. Howe e , his app oach incu s a la ge memo y
oo p in and es ic s use s o a ixed se o con ols. We
p opose a ligh weigh , modula a chi ec u e ha consid-
e ably educes pa ame e coun while ma ching Con ol-
Ne in audio quali y and condi ion adhe ence. Ou me hod
o e s g ea e lexibili y and signi ican ly lowe memo y
usage, enabling mo e e icien aining and deploymen o
independen con ols. We conduc ex ensi e objec i e and
subjec i e e alua ions and p o ide nume ous audio exam-
ples on he accompanying websi e. 1
1. INTRODUCTION
Wi h he ise o gene a i e models, new challenges ha e
eme ged in he ield o human-compu e in e ac ion, espe-
cially in how use s in e ac wi h hese sys ems [1]. This
issue is pa icula ly signi ican in domains o a is ic ex-
p ession, such as music c ea ion, whe e use s need in e -
aces ha allow o bo h high-le el con ol o e abs ac
concep s and p ecise manipula ion o low-le el de ails.
Achie ing his balance be ween c ea i e eedom and ech-
nical con ol is c i ical o musicians and compose s when
wo king wi h gene a i e sys ems.
Gene a i e models o music ha e explo ed a ious
con ol mechanisms o b idge he gap be ween use in en-
ion and machine ou pu [2–6]. Howe e , he e is no clea
consensus on which con ol modali ies o signals a e mos
e ec i e. Cu en ly, one o he mos common me hods o
in e ac ing wi h gene a i e music sys ems is h ough ex
inpu [2–4], which le e ages sha ed embedding spaces o
ex and audio in o ma ion [7]. While his app oach has a-
cili a ed signi ican ad ancemen s, i lacks he ine-g ained
con ol equi ed o de ailed music p oduc ion.
1h ps://ligh la en con ol.gi hub.io
∗Resea ch unde aken while an in e n a Sony CSL - Pa is
© T. Bake and J. Nis al. Licensed unde a C ea i e Com-
mons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion:
T. Bake and J. Nis al, “LiLAC: A Ligh weigh La en Con olNe o
Musical Audio Gene a ion”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
In esponse o hese limi a ions, some s udies ha e ex-
pe imen ed wi h condi ioning gene a i e models on ime-
a ying signals, such as pi ch o dynamics [5, 8]. These
p o ide mo e speci ici y bu a e s ill cons ained by a key
issue: con ol signals a e ypically equi ed du ing he
aining p ocess o he gene a i e model. Once he model
has lea ned o espond o hese inpu s, he con ol mecha-
nisms become ixed and in lexible.
Modula app oaches like Con olNe [9] enable lexible,
pos -hoc con ol o image gene a i e models and ha e e-
cen ly been adap ed o music gene a ion [10,11]. While e -
ec i e, hese echniques come wi h impo an limi a ions:
memo y-in ensi e clones o he ne wo k o each new con-
ol [9], o igid mul i-con ol schemes [12].
In his wo k, we adap Con olNe ’s amewo k o mu-
sical audio gene a ion, in oducing a modula a chi ec u e
ha eplaces cloned encode blocks wi h ligh weigh con-
olu ional laye s. While simila pa ame e -e icien ap-
p oaches exis in compu e ision [13], ou me hod is he
i s o demons a e his o music, enabling lexible ain-
ing o mul iple independen con ol models (e.g., ch oma,
cho ds) wi hou e aining he backbone. By decoupling
con ols in o ask-speci ic modules, use s can deploy only
he necessa y condi ions du ing in e ence, o en wi h lowe
memo y o e head han a single adi ional Con olNe
b anch. C ucially, ou e alua ions—spanning objec i e
me ics (FAD, APA) and subjec i e lis ening es s—show
ha his s eamlined design achie es pe o mance compa-
able o Con olNe in audio quali y and condi ion adhe -
ence, es ablishing a p ac ical balance be ween lexibili y
and ideli y o music gene a ion wo k lows.
2. RELATED WORK
Con ols o music gene a ion models encompass di e se
inpu modali ies. Tex p omp ing is widely used [2–4, 14–
17], and ecen join ex -audio embeddings allow ze o-
sho con ol wi hou pai ed da a [7, 18, 19], enabling ed-
i s like “make his piece o music mo e happy” [20, 21].
Howe e , ex can be ambiguous and less sui ed o p e-
cise con ol in music p oduc ion. Fine -g ained con ols,
such as melody [5], hy hm and dynamics [8], o im-
b al ea u es [22, 23], o e mo e p ecision. Mul imodal
inpu s like images o ideo also expand c ea i e possi-
bili ies [24]. Audio-based condi ioning is pa icula ly e -
ec i e o asks like accompanimen gene a ion and s yle
ans e [25–31], wi h models like Di -A-Ri [6, 32] and
287
SingSong [33] le e aging inpu audio o guide gene a-
ion. Con ol in eg a ion s a egies a y— om aining-
ime condi ioning o in e ence- ime guidance [34] o la en
op imiza ion [35,36]. Inspi ed by Con olNe , ecen me h-
ods in oduce auxilia y ne wo ks o con ol [10,11], while
o he s, like Ske ch2Sound [37], explo e ligh weigh al e -
na i es. Despi e his p og ess, op imal s a egies emain
unclea . Con olNe -s yle designs o e modula i y bu a e
o en esou ce-in ensi e o in lexible. To add ess his, we
p opose a new ligh weigh and modula a chi ec u e ha
e ains Con olNe ’s s eng hs wi h imp o ed e iciency.
3. BACKGROUND
This wo k builds on Con olNe [9] (see Sec. 3.2), a ame-
wo k o in oducing pos -hoc con ollabili y in o p e-
ained gene a i e models. While ou a chi ec u e is gene -
alisable o any gene a i e model, we u ilise Di -a-Ri [6]
(see Sec. 3.1) as he backbone model h oughou his pape .
In he ollowing sec ions, we p o ide an o e iew o hese
wo a chi ec u es, laying he g oundwo k o he p oposed
me hodology.
3.1 Di -A-Ri
Di -a-Ri [6] is a La en Di usion Model (LDM) de-
signed o gene a e high-quali y indi idual musical s ems
ha align wi h a use -p o ided musical audio sample, de-
no ed Con ex . The model employs a Consis ency Au oen-
code (CAE) [38] o comp ess aw audio in o compac la-
en ep esen a ions and u ilises an Elucida ed Di usion
Model (EDM) [39] amewo k. The CAEs la en audio
codec educes 48 kHz audio o a 64-dimensional encoding
a ∼12 Hz. Gene a ion can be con olled ia audio e e -
ences, ex p omp s, o in e pola ions o bo h, acili a ed
by a sha ed CLAP embedding space [7, 19]. Fo u he
de ails, e e o he o iginal pape [6].
3.2 Con olNe
Con olNe [9] in oduces a me hod o augmen la ge p e-
ained ex - o-image di usion models wi h new con ols.
I achie es his by eezing he pa ame e s o he o igi-
nal model, o backbone, and in oducing a so-called adap-
o b anch—a ainable copy o he backbone’s encoding
laye s. This b anch p ocesses bo h o iginal inpu s and
new condi ional signals, eeding ac i a ions back h ough
ze o-ini ialised con olu ions while eusing only he o igi-
nal aining objec i e.
The decoupled a chi ec u e allows o condi ioning wi h
limi ed specialised da a, enabling di e se con ols (edges,
dep h maps, segmen a ion, poses) wi hou comp omising
he p e ained backbone’s capabili ies.
Con olNe has been success ully adap ed o musical
audio, p o iding ime- equency con ols like pi ch o
loudness [10, 11]. Below, we p o ide a b ie o e iew o
he de ails ele an o his pape .
3.2.1 A chi ec u e
The a chi ec u e is displayed in Fig. 1. Fo laye l, we
u ilise bo h a ozen encode block Fl(xl−1, e)and i s
cloned adap o block coun e pa Gl(ˆxl−1, e). He e, xl−1
and ˆxl−1, ep esen he ou pu s om he p e ious ozen
and con ol laye s, espec i ely, while edeno es he back-
bone’s o iginal condi ional embeddings. Using ze o con-
olu ions Zs, he skip connec ion slis compu ed as:
sl=Fl(xl−1, e)+Zs(Gl(ˆxl−1, e)) (1)
The inpu o he adap o b anch ˆx0is de i ed om he
noised inpu enso x0and he new condi ional c h ough
he inpu ze o con olu ion: ˆx0=x0+Zin(c).
Con olNe [9] demons a es ha using cloned encode s
om he backbone model is c i ical o e ec i e con-
ol signal in eg a ion, as andomly ini ialised con olu ions
comp omise condi ion adhe ence, pa icula ly when ex
condi ioning is misaligned. Addi ionally, ze o con olu-
ions—used o in oduce he con ol signal ia skip con-
nec ions—g adually in oduce he signal du ing aining,
imp o ing s abili y and ou pu quali y.
4. LILAC
In his sec ion, we in oduce LiLAC and de ail how i s a -
chi ec u e, condi ioning mechanisms, and aining me hod-
ologies di e ge om Con olNe .
4.1 P oposed A chi ec u e
In Fig. 1, we depic he basic block o LiLAC’s a chi ec-
u e. Ins ead o cloning he backbone’s encode , LiLAC
pe o ms a second pass h ough each o he ozen encode
blocks, w apping hese by smalle con olu ional laye s.
Speci ically, we in oduce h ee laye s pe block: a head
laye be o e he ozen block, a ail laye a e he ozen
block, and a esidual connec ion o p ese e condi ion in-
o ma ion as i passes h ough he ozen block.
Fo mally, we eplace he cloned encode block Gl(ˆx, e)
in (1) wi h
Gl(ˆx, e)≈I
®
ail
(Fl(Ih(ˆx)
²
head
, e)) +Z (ˆx)
´¹¹¹¹¹¸¹¹¹¹¹¶
esidual
,
whe e I ep esen s he iden i y con olu ions used as he
head and ail laye s (see Sec. 4.1.1), and Z deno es he
ze o con olu ion used as he esidual connec ion.
While o he mul i-con ol me hodologies ha e p o-
posed ein oducing he condi ion in o each block [10], we
ound empi ically ha his app oach does no imp o e con-
di ion adhe ence and adds edundan pa ame e s.
4.1.1 Iden i y Con olu ions
As discussed in Sec. 3.2.1, ensu ing he adap o b anch
le e ages he backbone model’s knowledge while g adu-
ally in oducing he condi ional signal du ing aining is
c ucial. We achie e his by ini ialising he adap o pa hway
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
288
❆
❆
Figu e 1. Con olNe [9] (le ) and LiLAC ( igh ) adap o
blocks (l=1). The noisy inpu signal is deno ed as x0,
and condi ional signal as c. The ozen encode block is
deno ed F1, and i s cloned copy G1. Iden i y Iand ze o Z
con olu ions, along wi h he skip connec ion pa hway s1
a e also illus a ed.
o mi o he backbone encode h ough iden i y con olu-
ions I, in he head and ail laye s, and ze o con olu ions,
Z, in he esidual laye s.
We ini ialise any n-dimensional Iden i y con olu ion
ke nel biases o 0 and he weigh s WI[k1, ...kn, , i, j]as:
WI[k1, ...kn, i, j]=
⎧
⎪
⎪
⎨
⎪
⎪
⎩
1k1, ...kn=K−1
2,and i=j
0o he wise
Whe e k1, ...knindex spa ial posi ions in he ke nel, and
i, j index inpu and ou pu channels. Kis he ke nel size
and mus be odd; in ou case, all con olu ion ke nels a e
size 1 o main ain a ligh weigh ne wo k.
4.2 Condi ioning Signals
As ou lined in Sec ion 4.1, he condi ion cis in oduced
o he adap o b anch by adding i o he noised inpu
x0. Howe e his equi es dimensional alignmen be ween
bo h enso s. Fo n-dimensional inpu da a wi h Cchan-
nels, he noised inpu is ep esen ed as x0∈RC×k1×...×kn.
Condi ions a e p o ided as ea u e maps in he o m c∈
RN×k1×...×kn, whe e Nis he numbe o condi ion chan-
nels. To align he condi ion wi h he inpu enso , he ini ial
ze o con olu ion Zin also maps he condi ion’s channels
om N o Cbe o e hey a e combined.
5. EXPERIMENTAL SETUP
5.1 Da a
In his wo k, we use he same da ase om Di -A-Ri [6],
comp ising 12,000 mul i- ack eco dings, andomly spli
in o 1 million 10-second segmen s. Du ing aining, he
model ecei es one ack as he a ge and a andom com-
bina ion o he emaining acks as musical Con ex .
5.1.1 Condi ions
In addi ion o he o iginal aining da a, we p e-ex ac he
equi ed condi ioning con ol signals om each audio ack
as speci ied below. Fo e alua ion pu poses, we selec wo
dis inc signals o add ess di e en use cases: ch oma pi ch
class condi ioning and cho d condi ioning.
Ch oma. A ch omag am is ex ac ed om he a ge single
ack using he lib osa lib a y [40]. The 12 ch oma bins a e
placed along he channel dimension, and he ime scale is
downsampled o ma ch he ame a e o he la en audio
codec used by he backbone model (∼11.7 Hz) [38].
Cho d. We ex ac cho d symbols o each ame om
mixed mul i- ack audio using Deep12MIR [41] and con-
e hem in o a 12-dimensional ch oma-like o ma . Each
cho d is encoded as a ec o whe e pi ch classes in he
cho d a e assigned 1, he oo no e 2, and emaining en ies
0. This o ma suppo s all cho d symbols, wi h ha mon-
ically ela ed cho ds ecei ing simila embeddings, and
acili a es he use o a bi a y cho d shapes absen in he
aining da a.
5.2 T aining de ails
We ollow he backbone’s aining me hodology [6]. Con-
di ions Con ex ,e, and ca e all independen ly d opped ou
wi h a 50% p obabili y du ing he aining o he adap o
b anch [9], encou aging he model o use he new condi-
ion, cas a eplacemen o missing in o ma ion and en-
abling Classi ie F ee Guidance (CFG) [42] a in e ence.
All models a e ained o 2 days on a single NVIDIA RTX
3090 GPU wi h a ba ch size o 128. We use AdamW [43]
op imize wi h a base lea ning a e o 10−4and a cosine
annealing lea ning a e schedule wi h linea wa mup [44].
5.3 Objec i e E alua ion
In he ollowing sec ion, we p esen wo objec i e e alu-
a ion s a egies: one compa ing ou me hodology agains
baseline models, and he o he analysing condi ion con-
lic s and hei impac on condi ion adhe ence.
5.3.1 Me ics
We employ a subse o he me ics used in he backbone pa-
pe [6]: F éche Audio Dis ance (FAD) [45] o audio qual-
i y and di e si y; Audio P omp Adhe ence (APA) [46] and
CLAP Sco e (CS) o con ex and ex adhe ence, espec-
i ely. Fo e alua ing adhe ence o he LiLAC-p o ided
ch omag am condi ioning we calcula e he ch oma Mean
Squa ed E o (cMSE) be ween inpu and ou pu ch oma-
g ams. All me ics a e compu ed by a e aging ac oss i e
se s o 500 samples. Dis ibu ion-based me ics such as
FAD and APA a e measu ed using he CLAP embedding
space [7,19] agains a backg ound da ase o 5,000 eal au-
dio samples.
5.3.2 Baselines
We compa e LiLAC agains ou baseline con igu a ions:
he backbone model in h ee condi ioning se ups— o ally
uncondi ioned (Di -a-Ri *), wi h CLAP (Di -a-Ri ), and
wi h bo h CLAP and con ex (Di -a-Ri + Con ex )—as
well as he o iginal Con olNe a chi ec u e applied o he
backbone.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
289
5.3.3 Compa ison and Abla ion
Wi h he aim o keeping he a chi ec u e as ligh as pos-
sible, we conduc an abla ion s udy o each o he h ee
LiLAC adap o laye s desc ibed in Sec ion 4.1-Head (H),
Tail (T), and Residual (R). We e alua e hei pe o mance
ela i e o he pa ame e o e head (see Tab. 1). Addi ion-
ally, we es a model a ian deno ed LiLAC* ha a oids
a second pass h ough he backbone model by di ec ly in-
se ing he new adap o laye s as pa o he backbone’s o -
wa d pass, simila in idea o p e ious wo k [37] (Al hough
no ably we don’ u he ine- une ou backbone model).
Fo he objec i e e alua ions, we delibe a ely employ
aw ch omag am condi ioning due o i s high in o ma ion
densi y, enabling us o es ablish a mo e accu a e uppe
bound on he model’s condi ional adhe ence pe o mance.
Since each audio signal p oduces only a single ch oma-
g am, his allows o p ecise and unambiguous objec i e
compa isons using he cMSE. In con as , cho d symbols
can ha e mul iple alid in e p e a ions o a single s em,
and as he e is no unambiguous single way o e alua e ad-
he ence ac oss a a ie y o ins umen classes, we ese e
hese o he subjec i e es (see Sec. 5.4).
5.3.4 Con lic i e Condi ioning
Nex , we e alua e how LiLAC and Con olNe -s yle mod-
els espond o con lic ing CLAP and ch oma condi ion-
ings (e.g., a solo iolin CLAP embedding wi h a poly-
phonic ch oma condi ion). Ce ain condi ions, espe-
cially hose ex ac ed di ec ly om audio, can be "o e -
speci ied"—con aining edundan in o ma ion. Fo exam-
ple, ch omag ams may include also aces o imb e and
pi ch ange due o lowe ch oma bin esolu ion a lowe
equencies. We measu e he change in CLAP Sco e ac oss
di e en condi ioning se ups o quan i y how hese added
con ol condi ions may o e ide o "leak" in o he back-
bone model’s o iginal CLAP condi ioning.
We compa e h ee se ups: 1) CLAP embeddings and
con ol signals a e aligned (i.e., ex ac ed om he same
sou ce audio signal); 2) CLAP embeddings and con ol
signals a e misaligned (sou ced om di e en audio ex-
amples); and 3) he model ecei es only he con ol sig-
nal, wi h CLAP condi ioning omi ed, labelled none. Us-
ing hese se ups, we e alua e he ollowing a iables:
Impac o A chi ec u e: We compa e all p oposed LiLAC
con igu a ions as well as Con olNe o assess which a -
chi ec u e is mo e suscep ible o CLAP leakage—whe e
o e -speci ied con ol signals (e.g., ch oma) can domina e
o obscu e CLAP’s condi ion. By con as ing pe o mance
in aligned s. misaligned con igu a ions, we quan i y how
s ongly each model p io i ises in o ma ion om he new
con ol signals o e p e-exis ing CLAP embeddings.
Impac o Con ol Signal Speci ici y: We e alua e h ee
con ol signal ypes: (1) Ch oma, an o e -speci ied sig-
nal con aining pi ch, iming, and imb al in o ma ion; (2)
Th esholded Ch oma, a a ian whe e ampli udes ≥0.9
a e clipped o 1 and all o he s o 0, educing ine im-
b al de ails o minimise leakage; and (3) Cho ds, a well-
speci ied signal con eying only ha monic s uc u e. By
analysing he di e ence in CLAP Sco e ac oss hese con-
di ions, we quan i y whe he educing su plus in o ma ion
in he condi ional signal leads o mo e ai h ul adhe ence
o he CLAP embedding.
5.4 Subjec i e E alua ion
To subjec i ely e alua e LiLAC’s e ec i eness in condi-
ioning he di usion model, we conduc wo lis ening es s
loosely ollowing MUSHRA guidelines [47] (i.e., using
he ue accompanimen as hidden e e ence in ou case).
In bo h es s, pa icipan s a e se s o audio samples on a
100-poin scale (0: poo , 100: excellen ) based on wo
c i e ia: audio quali y and subjec i e adhe ence, espec-
i ely, o each ques ionnai e. 2Each es comp ises 10
ques ions, wi h each p esen ing 5 unlabeled, andomly o -
de ed samples co esponding o LiLACH, LiLACHTR, Con-
olNe , he g ound- u h e e ence ack and a nega i e an-
cho . Fu he de ails abou each es a e gi en below.
5.4.1 Subjec i e Audio Quali y (SAQ):
The SAQ es e alua es whe he he con ol models audibly
deg ade he ou pu quali y. Pa icipan s a e asked o ank
each se o examples based on pe cei ed audio quali y and
he p esence o a i ac s. Fo each ques ion ound, we s a
om a e e ence ack, andomly sampled om he alida-
ion se , and ex ac he ch omas and CLAP embeddings.
Using hese as condi ioning, we gene a e an example wi h
each o he h ee e alua ed models. To c ea e he nega i e
ancho , we apply hea y comp ession (16 kbps MP3) o he
g ound u h ack. The g ound u h audio is used as a
posi i e ancho . Sha ed ch oma and CLAP condi ioning in
his es ensu es ha all examples a e compa able, as hey
should ideally exhibi he same pi ch dis ibu ion and im-
b e. This app oach encou ages pa icipan s o ocus solely
on e alua ing audio quali y.
5.4.2 Subjec i e Condi ion Adhe ence (SCA):
The SCA es e alua es he models’ abili y o adhe e o
cho d condi ioning, chosen o i s lexibili y in gene a ing
a ied ou pu s (e.g., cho ds, bass lines, melodies). Pa ici-
pan s ank how well he gene a ed ou pu aligns ha moni-
cally wi h he cho d condi ioning ex ac ed om he mul-
i ack eco ding. Fo each ound, we begin wi h a mul i-
ack eco ding and isola e one ins umen ack as he e -
e ence. Be o e gene a ion, we e alua e he emaining mul-
i ack; i i does no con ain a leas one polyphonic oice
h oughou he es sample, hen he sample lacks su i-
cien ha monic con en o eliably disce n cho ds and is
excluded om he lis ening es ˙
We ex ac cho d symbols
om he ull mul i ack and CLAP embeddings om he
e e ence ack, hen gene a e an example o each es ed
model. The nega i e ancho is gene a ed using he CLAP-
condi ioned Di -A-Ri backbone, i.e., wi hou cho d o
con ex condi ioning. To acili a e compa ison, he gene -
a ed ou pu is panned o he le , while he emaining mul-
i ack is panned o he igh .
2h ps://link .ee/lilac es s
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
290
6. RESULTS
In wha ollows, we p esen and discuss he esul s o he
e alua ions ou lined in Sec. 5.
Model APA ↑FAD ↓cMSE ↓
Di -a-Ri * (432M) 0.63 0.812 0.206
+Con ex 1.00 0.508 0.148
+ Con olNe (165M) 1.00 0.507 0.052
+ LiLAC
H(32M) 1.00 0.506 0.057
HT (48M) 1.00 0.507 0.056
HR (47M) 1.00 0.509 0.056
HTR (64M) 1.00 0.507 0.055
+ LiLAC* (47M) 1.00 0.508 0.070
Table 1. Objec i e me ics: Audio P omp Adhe ence
(APA) and F éche Audio Dis ance (FAD) o audio qual-
i y; ch oma Mean Squa ed E o (cMSE) o ch oma ad-
he ence. All models include CLAP condi ioning excep
o Di -A-Ri ∗. Addi ionally, all new con ol models a e
condi ioned on ch oma.
6.1 Objec i e Expe imen s
Table 1 shows objec i e compa ison be ween LiLAC
and baseline models (see Sec.5.3.3). The APA me -
ic—a dis ibu ion-based sco e bounded in ange [0,1]
[46]—shows ha all models condi ioned on CLAP em-
beddings and ch oma ou pe o m he con ex - ee Di -A-
Ri baseline (Di -A-Ri ∗) and ma ch he pe o mance o
he con ex -condi ioned a ian (+ Con ex ). This indica es
ha ch omag am-based condi ioning is consis en ly e ec-
i e a guiding alignmen wi h he mul i ack, ega dless o
a chi ec u al a ia ions. Analogously, FAD sco es emain
compa able ac oss all se ups, sugges ing ha adding con-
di ioning does no deg ade audio quali y. In pa icula , he
nea -iden ical FAD sco es be ween Di -A-Ri + Con ex
baseline, Con olNe and LiLAC models indica e ha pos -
hoc condi ioning doesn’ comp omise ideli y (low FAD,
high APA) and may enhance con ollabili y (high APA).
While hese esul s p o ide a s a is ical pe spec i e
on he o e all dis ibu ion o gene a ed ou pu s and hei
alignmen wi h musical con ex based on ch oma alone,
hey do no di ec ly e eal how ai h ully models espond
o speci ic ch oma inpu s. A pai wise me ic is equi ed o
e alua e adhe ence on a pe -example basis. Fo his pu -
pose, we epo cMSE. He e, Con olNe achie es he bes
absolu e adhe ence o he ch oma inpu , hough all LiLAC
con igu a ions deli e compe i i e esul s while main ain-
ing a conside ably smalle pa ame e coun (i.e., 165M
in Con olNe e sus 32M in LiLACH)3. Among hem,
he HTR a ian shows s onge pe o mance, wi h cMSE
close o Con olNe . In e es ingly, LiLAC*, a ligh weigh
3As he size o he adap o blocks scales wi h he encode ’s chan-
nel dimensions, he exac pa ame e educ ion depends on he speci ic
backbone a chi ec u e. Fo Di -A-Ri [6], ou ligh es con igu a ion
(LiLACH) uses only 19% o Con olNe ’s pa ame e s. This e iciency
scales wi h la ge a chi ec u es: o S able Audio Open [48], he pa ame-
e usage d ops o 10.2%, while o image models such as SD V2 [49], i
educes d ama ically o 2.6%.
a chi ec u e inspi ed by p io wo k [37], exhibi s limi a-
ions in con olling ine melodic nuances, hus, we exclude
i om u he e alua ions.
Model Aligned Misaligned None
Di -a-Ri 0.65 (0.17) 0.65 (0.21) 0.17 (0.21)
+ LiLACH0.67 (0.06) 0.55 (0.06) 0.57 (0.06)
+ LiLACHT 0.67 (0.06) 0.54 (0.06) 0.58 (0.05)
+ LiLACHR 0.67 (0.06) 0.54 (0.06) 0.58 (0.06)
+ LiLACHTR 0.67 (0.05) 0.53 (0.06) 0.58 (0.05)
+ Con olNe 0.67 (0.05) 0.52 (0.06) 0.60 (0.05)
Table 2. CS↑a(cMSE↓) o models ained wi h ch oma
condi ioning and Di -A-Ri baseline. Each column co -
esponds wi h a di e en se ing: aligned pai s o ch oma-
g am and CLAP embedding, misaligned pai s, o wi hou
CLAP embedding (see Sec. 5.3.4). aCS↓in he None case
We nex in es iga e how he new ch oma adap o s in-
e ac wi h he backbone’s CLAP de aul condi ioning (see
Sec.5.3.4). Table 2 epo s CLAP simila i y sco es (CS)
and ch oma econs uc ion e o (cMSE) ac oss a ious
model con igu a ions. The esul s indica e a deg ee o in-
o ma ion o e lap be ween CLAP embeddings and ch oma
condi ions. When bo h a e p esen —bu con ey con lic -
ing in o ma ion—CLAP simila i y sco es no iceably de-
cline, sugges ing ha ch oma condi ioning can pa ially
o e w i e o in e e e wi h he o iginal CLAP guidance.
Howe e , his d op is mode a e, and he model s ill e-
cons uc s a ai h ul ch omag am, as e lec ed in he low
cMSE e en in he misaligned se ing. This implies ha he
model makes easonable sense o he con lic ing condi ion-
als— ollowing he CLAP guidance as much as possible
while s ill p oducing gene a ions ha ai h ully adhe e o
he ch oma condi ioning.
In e es ingly, in he CLAP-uncondi ioned se ing,
cMSE emains low while CS s ays high. This suppo s ou
hypo hesis o in o ma ion edundancy be ween ch oma-
g ams and CLAP embeddings. I sugges s ha he model
can s ill in e high-le el cues—such as imb e— om ch o-
mag ams alone, allowing i o p oduce con en ha aligns
wi h wha CLAP, e en in i s absence. Tha said, he
model’s abili y o main ain easonable CLAP simila i y
wi hou explici CLAP inpu does no necessa ily mean
i econs uc s he o iginal ins umen associa ed wi h he
ch omag am. Mo e likely, i in e s a plausible ins umen
class—such as s ings o pads— ha i s he gi en pi ch
ange and ha monic con en . This sugges s ha he model
in e p e s ch oma condi ioning sensibly: when CLAP is
a ailable, i may e ine o o e w i e he in e ed ins umen
iden i y; when absen , i de aul s o gene a ing some hing
imb ally cohe en ha aligns wi h he ch oma s uc u e.
Among all se ups, he LiLAC a chi ec u e demons a es
he leas suscep ibili y o CLAP in e e ence, wi h i s
ligh weigh a ian pe o ming bes unde bo h con lic ing
and CLAP-abla ed condi ions. This obus ness is likely
due o i s simpli ied s uc u e, which limi s he ex en o
which he model encodes auxilia y in o ma ion alongside
he CLAP embedding.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
291

Condi ion Aligned Misaligned None
Di -a-Ri 0.65 (0.17) 0.65 (0.21) 0.17 (0.21)
+ Cho d 0.65 (0.14) 0.64 (0.20) 0.36 (0.18)
+ Th esh 0.65 (0.08) 0.62 (0.17) 0.45 (0.10)
+ Ch oma 0.67 (0.06) 0.55 (0.06) 0.57 (0.06)
Table 3. CS↑a(cMSE↓) o LiLACH ained on cho d,
h esholded ch oma, and ch oma wi h he Di -a-Ri base-
line (see Sec. 5.3.4). aCS↓in he None case
To educe edundancy wi h CLAP, we e alua e
LiLACH—ou ligh es and bes -pe o ming a ian —using
comp essed melodic inpu s: h esholded ch omag ams
( h esh) and cho ds. Table 3 compa es pe o mance ac oss
CLAP-aligned, misaligned, and uncondi ioned (None) se -
ings (see Sec. 5.3.4).
In ui i ely, less speci ic condi ioning signals should ex-
hibi lowe o e lap wi h CLAP embeddings. This is con-
i med by he esul s: h esholding he ch omag am sig-
ni ican ly educes he con lic be ween he wo modali ies.
Among he es ed condi ions, cho ds show he leas in e -
e ence, wi h almos no d op in CLAP simila i y in he mis-
aligned case. No ably, e en wi hou CLAP condi ioning,
he cho d model achie es a CLAP sco e o 0.36—up om
0.17 in he uncondi ioned baseline—sugges ing ha a sub-
s an ial amoun o global musical in o ma ion (e.g., empo,
onali y) is implici ly cap u ed by he cho d inpu alone.
Howe e , his obus ness may come a he cos o e-
duced in luence om he condi ioning. We obse e ha
weake condi ions, like cho ds, a e mo e easily o e id-
den by CLAP. Fo ins ance, he cMSE ises sha ply om
0.08 o 0.17 when CLAP is misaligned, despi e he spa -
si y o he cho d-based inpu . This indica es ha e en min-
imal shi s in CLAP can signi ican ly dis o he esul ing
ch oma, especially when he condi ioning is less cons ain-
ing. To con i m his, we e-e alua ed he cMSE unde
he aligned se ing o cho ds and obse ed a consis en
alue a ound 0.14. This alida es ha he obse ed jump
in cMSE unde misalignmen e lec s a genuine in e e -
ence e ec , a he han noise o sampling a iance.
O e all, hese indings highligh ha ou p oposed con-
di ioning mechanisms—especially when simpli ied—can
e ec i ely coope a e wi h CLAP guidance. When signals
a e aligned, hey yield cohe en gene a ions ha sa is y
bo h imb al and ha monic cons ain s. When con lic ing,
he model ends o p io i ise ch oma while s ill le e aging
CLAP o guide plausible gene a ion. E en in he absence
o CLAP, he model e ains he abili y o in e global cues
such as empo, onali y, and imb e om abs ac melodic
inpu s, demons a ing he e sa ili y and obus ness o he
condi ioning app oach.
6.2 Subjec i e Expe imen s
Table 4 p esen s esul s om ou Subjec i e Audio Qual-
i y (SAQ) and Subjec i e Condi ion Adhe ence (SCA)
e alua ions (see Sec. 5.4). Ac oss bo h es s, we col-
lec ed a o al o 1,250 a ings om 25 pa icipan s (11 o
Model SAQ ↑SCA ↑
Re e ence 61.5 ±4.7 82.3 ±4.3
Con olNe 56.4 ±4.2 66.5 ±5.1
LiLACH60.6 ±4.6 65.9 ±4.8
LiLACHTR 58.7 ±4.2 68.6 ±4.9
Ancho 26.4 ±4.9 12.5 ±2.5
Table 4. Lis ene a ings wi h 95% con idence in e als
o SAQ and SCA ac oss e alua ed models (see Sec. 5.4).
SCA and 14 o SAQ). In he SAQ es , all h ee mod-
els we e a ed compa ably in e ms o audio quali y. The
wo LiLAC a ian s—especially he ligh weigh con igu-
a ion—sligh ly ou pe o med Con olNe based on bo h
mean sco e and a e age ank. Howe e , hese di e ences
did no each s a is ical signi icance, ei he among he h ee
models o ela i e o he o iginal s em, as indica ed by he
F iedman es (p>0.10). This aligns wi h expec a ions:
gi en ha he backbone is ozen, i is unlikely ha con ol
mechanisms alone would imp o e audio quali y. Ne e he-
less, he sligh ly be e pe o mance o LiLAC models may
e lec a be e p ese a ion o p e- ained ea u es, due o
hei smalle and mo e ocused adap e laye s.
The SCA esul s show simila ou comes. All mod-
els signi ican ly ou pe o m uncondi ional gene a ion (p<
0.0001), and again show ema kable simila i y o one an-
o he (p>0.8). Rega ding condi ional adhe ence o he
o iginal s em, he e is a mo e no iceable gap o he e e -
ence, howe e LiLACHTR comes close han o he models,
showing only ma ginally signi ican di e ences (p=0.09).
O e all, hese indings a e consis en wi h ou expec-
a ions: while audio quali y emains s able ac oss con ol
me hods, condi ion adhe ence bene i s mo e clea ly om
a chi ec u al imp o emen s like LiLACHTR.
7. CONCLUSION
In his pape , we in oduced a ligh weigh and lexible con-
ol me hodology o ex - o-audio di usion models, in-
spi ed by Con olNe bu wi h a signi ican ly educed pa-
ame e coun . Ou app oach suppo s mul iple con igu-
a ions o accommoda e con ol signals o a ying com-
plexi y, while main aining— and in some cases imp o -
ing upon— he audio quali y and condi ion adhe ence o
he o iginal me hod, as demons a ed in bo h objec i e and
subjec i e e alua ions. We belie e his modula and e i-
cien amewo k pa es he way o mo e exp essi e and
musically use ul con ol modali ies in audio gene a ion.
As u u e wo k, we plan o explo e he modula i y
and composabili y o mul iple LiLAC con olle s ope a -
ing o e a sha ed backbone, as well as in es iga e whe he
ou me hod gene alises o non-con olu ional a chi ec u es
such as DiT 4.
4Subsequen o ou ini ial submission, we ha e success ully applied
his a chi ec u e o he DiT-based Di -a-Ri 2 [32] and seen simila e-
sul s, sugges ing ha ou me hodology gene alises well beyond con o-
lu ional a chi ec u es. Fo his, we adap ed he Con olNe me hodology
pionee ed in [50] and again subs i u ed he cloned ans o me blocks o
he p e-exis ing ozen blocks w apped by ou adap o laye s .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
292
8. ETHICS STATEMENT
Sony Compu e Science Labo a o ies is commi ed o ex-
plo ing he posi i e applica ions o AI in music c ea ion.
We collabo a e wi h a is s o de elop inno a i e echnolo-
gies ha enhance c ea i i y. We uphold s ong e hical s an-
da ds and ac i ely engage wi h he music communi y and
indus y o align ou p ac ices wi h socie al alues. Ou
eam is mind ul o he ex ensi e wo k ha songw i e s and
eco ding a is s dedica e o hei c a . Ou echnology
mus espec , p o ec , and honou his commi men .
LiLAC p esen s a ligh e , mo e lexible con ol
pa adigm ha enables musicians o exe cise ine-g ained
con ol o e gene a i e model ou pu s, ensu ing comple e
a is ic agency. We hope he p esen a ion o his model will
encou age comme cial ex - o-audio model p o ide s o in-
co po a e simila addi ional con ols in o hei wo k lows,
he eby empowe ing use s wi h g ea e c ea i e au onomy.
Bo h LiLAC and i s backbone model, Di -A-Ri , ha e
been ained exclusi ely on da ase s ha we e legally ac-
qui ed o in e nal esea ch and de elopmen pu poses.
Consequen ly, nei he he aining da a no he models can
be made publicly a ailable. We emain commi ed o ull
legal compliance and p oac i ely add ess all e hical con-
side a ions in ou wo k.
9. REFERENCES
[1] J. Schneide , “Explainable Gene a i e AI (GenXAI):
A Su ey, Concep ualiza ion, and Resea ch Agenda,”
A i icial In elligence Re iew, ol. 57, 2024.
[2] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. F ank, “MusicLM: Gene a ing Music F om
Tex ,” 2023.
[3] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic,
W. Wang, and M. D. Plumbley, “AudioLDM: Tex -
o-Audio Gene a ion wi h La en Di usion Models,”
in In e na ional Con e ence on Machine Lea ning
(ICML), Hawaii, Uni ed S a es, 2023.
[4] Z. E ans, C. J. Ca , J. Taylo , S. H. Hawley, and
J. Pons, “Fas Timing-Condi ioned La en Audio Di u-
sion,” in In e na ional Con e ence on Machine Lea n-
ing (ICML), Vienna, Aus ia, 2024.
[5] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and Con ol-
lable Music Gene a ion,” in 37 h Con e ence on Neu al
In o ma ion P ocessing Sys ems (Neu IPS), New O -
leans, Uni ed S a es, 2023.
[6] J. Nis al, M. Pasini, C. Aouameu , M. G ach en, and
S. La ne , “Di -A-Ri : Musical Accompanimen Co-
c ea ion ia La en Di usion Models,” in P oc. o he
25 h In . Socie y o Music In o ma ion Re ie al Con .
(ISMIR), San F ancisco, Uni ed S a es, 2024.
[7] B. Elizalde, S. Deshmukh, M. A. Ismail, and H. Wang,
“CLAP: Lea ning Audio Concep s F om Na u al Lan-
guage Supe ision,” in IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP),
Rhodes Island, G eece, 2023.
[8] Y.-K. Wu, C.-Y. Chiu, and Y.-H. Yang, “JukeD um-
me : Condi ional Bea -awa e Audio-domain D um Ac-
companimen Gene a ion ia T ans o me VQ-VAE,”
in P oc. o he 23 d In . Socie y o Music In o ma ion
Re ie al Con . (ISMIR), Bengalu u, India, 2022.
[9] L. Zhang, A. Rao, and M. Ag awala, “Adding Con-
di ional Con ol o Tex - o-Image Di usion Models,”
in IEEE In e na ional Con e ence on Compu e Vision
(ICCV), Pa is, F ance, 2023.
[10] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music Con olNe : Mul iple Time- a ying Con ols
o Music Gene a ion,” IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing, (TASLP),
ol. 32, No . 2023.
[11] S. Hou, S. Liu, R. Yuan, W. Xue, Y. Shan, M. Zhao,
and C. Zhang, “Edi ing Music wi h Melody and Tex :
Using Con olNe o Di usion T ans o me ,” in In e -
na ional Con e ence on Acous ics, Speech, and Signal
P ocessing (ICASSP), Hyde abad, India, 2025.
[12] S. Zhao, D. Chen, Y.-C. Chen, J. Bao, S. Hao, L. Yuan,
and K.-Y. K. Wong, “Uni-Con olNe : All-in-One
Con ol o Tex - o-Image Di usion Models,” in 37 h
Con e ence on Neu al In o ma ion P ocessing Sys ems
(Neu IPS), New O leans, Uni ed S a es, 2023.
[13] D. Za adski, J.-F. Feiden, and C. Ro he , “Con olNe -
XS: Designing an E icien and E ec i e A chi ec-
u e o Con olling Tex - o-Image Di usion Models,”
2023.
[14] F. Schneide , O. Kamal, Z. Jin, and B. Schölkop ,
“Moûsai: Tex - o-Music Gene a ion wi h Long-
Con ex La en Di usion,” 2023.
[15] Q. Huang, D. S. Pa k, T. Wang, T. I. Denk, A. Ly,
N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. F ank, J. En-
gel, Q. V. Le, W. Chan, Z. Chen, and W. Han,
“Noise2Music: Tex -condi ioned Music Gene a ion
wi h Di usion Models,” 2023.
[16] P. Li, B. Chen, Y. Yao, Y. Wang, A. Wang, and
A. Wang, “JEN-1: Tex -Guided Uni e sal Music Gen-
e a ion wi h Omnidi ec ional Di usion Models,” in
IEEE Con e ence on A i icial In elligence (CAI), Sin-
gapo e, Singapo e, 2024.
[17] H. Liu, Y. Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian,
Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley,
“AudioLDM 2: Lea ning Holis ic Audio Gene a ion
wi h Sel -supe ised P e aining,” IEEE/ACM T ans-
ac ions on Audio, Speech and Language P ocessing,
ol. 32, May 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
293
[18] Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and
D. P. W. Ellis, “MuLan: A Join Embedding o Music
Audio and Na u al Language,” in P oc. o he 23 d In .
Socie y o Music In o ma ion Re ie al Con . (ISMIR),
Bengalu u, India, 2022.
[19] Y. Wu, K. Chen, T. Zhang, Y. Hui, M. Nezhu ina,
T. Be g-Ki kpa ick, and S. Dubno , “La ge-scale Con-
as i e Language-Audio P e aining wi h Fea u e Fu-
sion and Keywo d- o-Cap ion Augmen a ion,” in IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP), Rhodes Island, G eece,
2023.
[20] Y. Zhang, Y. Ikemiya, G. Xia, N. Mu a a, M. A.
Ma ínez-Ramí ez, W.-H. Liao, Y. Mi su uji, and
S. Dixon, “MusicMagus: Ze o-Sho Tex - o-Music
Edi ing ia Di usion Models,” in The 33 d In e na-
ional Join Con e ence on A i icial In elligence (IJ-
CAI), Jeju, Ko ea, 2024.
[21] H. Mano and T. Michaeli, “Ze o-Sho Unsupe ised
and Tex -Based Audio Edi ing Using DDPM In e -
sion,” in In e na ional Con e ence on Machine Lea n-
ing (ICML), Vienna, Aus ia, 2024.
[22] J. Nis al, S. La ne , and G. Richa d, “D umGAN: Syn-
hesis o D um Sounds Wi h Timb al Fea u e Condi-
ioning Using Gene a i e Ad e sa ial Ne wo ks,” in
P oc. o he 21s In . Socie y o Music In o ma ion Re-
ie al Con . (ISMIR), Mon éal, Canada, 2020.
[23] J. Nis al, C. Aouameu , I. Vela de, and S. La ne ,
“D umGAN VST: A Plugin o D um Sound Anal-
ysis/Syn hesis Wi h Au oencoding Gene a i e Ad e -
sa ial Ne wo ks,” in In e na ional Con e ence on Ma-
chine Lea ning (ICML), Bal imo e, Uni ed S a es,
2022.
[24] R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li,
Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-An-Audio:
Tex -To-Audio Gene a ion wi h P omp -Enhanced Di -
usion Models,” in In e na ional Con e ence on Ma-
chine Lea ning (ICML), Hawaii, Uni ed S a es, 2023.
[25] S. La ne and M. G ach en, “High-Le el Con ol o
D um T ack Gene a ion Using Lea ned Pa e ns o
Rhy hmic In e ac ion,” in Wo kshop on Applica ions o
Signal P ocessing o Audio and Acous ics (WASPAA),
New Pal z, Uni ed S a es, 2019.
[26] M. G ach en, S. La ne , and E. De u y, “BassNe : A
Va ia ional Ga ed Au oencode o Condi ional Gene -
a ion o Bass Gui a T acks wi h Lea ned In e ac i e
Con ol,” Applied Sciences, ol. 10, 2020.
[27] J. D. Pa ke , J. Spijke e , K. Kos a, F. Yesile ,
B. Kuzne so , J.-C. Wang, M. A en , J. Chen, and
D. Le, “S emGen: A music gene a ion model ha
lis ens,” in In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP), Seoul, Ko ea,
2024.
[28] M. Pasini, M. G ach en, and S. La ne , “Bass Accom-
panimen Gene a ion ia La en Di usion,” in In e -
na ional Con e ence on Acous ics, Speech, and Signal
P ocessing (ICASSP), Seoul, Ko ea, 2024.
[29] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, and E. Rodolà, “Mul i-Sou ce Di usion
Models o Simul aneous Music Gene a ion and Sepa-
a ion,” in In e na ional Con e ence on Lea ning Rep-
esen a ions (ICLR), Vienna, Aus ia, 2024.
[30] G. Bindi and P. Esling, “Unsupe ised Composable
Rep esen a ions o Audio,” in P oc. o he 25 h In .
Socie y o Music In o ma ion Re ie al Con . (ISMIR),
San F ancisco, Uni ed S a es, 2024.
[31] T. Ka chkhadze, M. R. Izadi, and S. Dubno , “Simul a-
neous Music Sepa a ion and Gene a ion Using Mul i-
T ack La en Di usion Models,” 2024.
[32] J. Nis al, M. Pasini, and S. La ne , “Imp o ing Musi-
cal Accompanimen Co-c ea ion ia Di usion T ans-
o me s,” in 38 h Con e ence on Neu al In o ma ion
P ocessing Sys ems (Neu IPS), Vancou e , Canada,
2024.
[33] C. Donahue, A. Caillon, A. Robe s, E. Manilow, P. Es-
ling, A. Agos inelli, M. Ve ze i, I. Simon, O. Pie quin,
N. Zeghidou , and J. Engel, “SingSong: Gene a ing
musical accompanimen s om singing,” 2023.
[34] M. Le y, B. D. Gio gi, F. Wee s, A. Ka ha opoulos,
and T. Nickson, “Con ollable Music P oduc ion wi h
Di usion Models and Guidance G adien s,” in 37 h
Con e ence on Neu al In o ma ion P ocessing Sys ems
(Neu IPS), New O leans, Uni ed S a es, 2023.
[35] Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. J. B yan, “DITTO: Di usion In e ence-Time T-
Op imiza ion o Music Gene a ion,” in In e na ional
Con e ence on Machine Lea ning (ICML), Vienna,
Aus ia, 2024.
[36] Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. B yan, “DITTO-2: Dis illed Di usion In e ence-
Time T-Op imiza ion o Music Gene a ion,” in P oc.
o he 25 h In . Socie y o Music In o ma ion Re ie al
Con . (ISMIR), San F ancisco, Uni ed S a es, 2024.
[37] H. F. Ga cía, O. Nie o, J. Salamon, B. Pa do, and
P. See ha aman, “Ske ch2Sound: Con ollable Audio
Gene a ion ia Time-Va ying Signals and Sonic Imi a-
ions,” 2024.
[38] M. Pasini, S. La ne , and G. Fazekas, “Music2La en :
Consis ency Au oencode s o La en Audio Comp es-
sion,” in P oc. o he 25 h In . Socie y o Music In o -
ma ion Re ie al Con . (ISMIR), San F ancisco, Uni ed
S a es, 2024.
[39] T. Ka as, M. Ai ala, T. Aila, and S. Laine, “Elucida -
ing he Design Space o Di usion-Based Gene a i e
Models,” in 36 h Con e ence on Neu al In o ma ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
294
P ocessing Sys ems (Neu IPS), New O leans, Uni ed
S a es, 2022.
[40] B. McFee, C. Ra el, D. Liang, D. Ellis, M. McVica ,
E. Ba enbe g, and O. Nie o, “Lib osa: Audio and Mu-
sic Signal Analysis in Py hon,” in P oc. o he 14 h
Py hon in Science Con . (SCIPY), Aus in, Texas, 2015.
[41] T. Akama, N. Polouliakh, and H. Kishi, “Deep12: Mu-
sic Analysis AI,” 2023.
[42] J. Ho and T. Salimans, “Classi ie -F ee Di usion Guid-
ance,” in Neu IPS Wo kshop on Deep Gene a i e Mod-
els and Downs eam Applica ions, Online, 2021.
[43] I. Loshchilo and F. Hu e , “Decoupled Weigh De-
cay Regula iza ion,” in 7 h In e na ional Con e ence
on Lea ning Rep esen a ions (ICLR), New O leans,
Uni ed S a es, 2019.
[44] ——, “SGDR: S ochas ic G adien Descen wi h Wa m
Res a s,” in 5 h In e na ional Con e ence on Lea ning
Rep esen a ions (ICLR), Toulon, F ance, 2017.
[45] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche Audio Dis ance: A Me ic o E alua ing Mu-
sic Enhancemen Algo i hms,” in In e speech, G az,
Aus ia, 2019.
[46] M. G ach en, “Measu ing Audio P omp Adhe ence
wi h Dis ibu ion-based Embedding Dis ances,” 2024.
[47] ITU-R BS.1534-3, “Me hods o he subjec i e assess-
men o small impai men s in audio sys ems,” 2015.
[48] Z. E ans, J. D. Pa ke , C. J. Ca , Z. Zukowski, J. Tay-
lo , and J. Pons, “S able Audio Open,” 2024.
[49] R. Rombach, A. Bla mann, D. Lo enz, P. Esse , and
B. Omme , “High-Resolu ion Image Syn hesis wi h
La en Di usion Models,” in Compu e Vision and Pa -
e n Recogni ion (CVPR), New O leans, Uni ed S a es,
2022.
[50] J. Chen, Y. Wu, S. Luo, E. Xie, S. Paul, P. Luo,
H. Zhao, and Z. Li, “PIXART-δ: Fas and Con ollable
Image Gene a ion wi h La en Consis ency Models,”
2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
295