LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation

Author: Tom Baker; Javier Nistal

Publisher: Zenodo

DOI: 10.5281/zenodo.17706393

Source: https://zenodo.org/records/17706393/files/000033.pdf

LiLAC: A LIGHTWEIGHT LATENT CONTROLNET FOR MUSICAL
AUDIO GENERATION
Tom Bake 1,2∗
1Uni e si y O Manches e
[email p o ec ed]
Ja ie Nis al2
2Sony CSL - Pa is
[email p o ec ed]
ABSTRACT
Tex - o-audio di usion models p oduce high-quali y and
di e se music bu o en lack he ine-g ained, ime- a ying
con ols essen ial o music p oduc ion. Con olNe en-
ables a aching ex e nal con ols o a p e- ained gene a i e
model by cloning and ine- uning i s encode on new con-
di ionings. Howe e , his app oach incu s a la ge memo y
oo p in and es ic s use s o a ixed se o con ols. We
p opose a ligh weigh , modula a chi ec u e ha consid-
e ably educes pa ame e coun while ma ching Con ol-
Ne in audio quali y and condi ion adhe ence. Ou me hod
o e s g ea e lexibili y and signi ican ly lowe memo y
usage, enabling mo e e icien aining and deploymen o
independen con ols. We conduc ex ensi e objec i e and
subjec i e e alua ions and p o ide nume ous audio exam-
ples on he accompanying websi e. 1
1. INTRODUCTION
Wi h he ise o gene a i e models, new challenges ha e
eme ged in he ield o human-compu e in e ac ion, espe-
cially in how use s in e ac wi h hese sys ems [1]. This
issue is pa icula ly signi ican in domains o a is ic ex-
p ession, such as music c ea ion, whe e use s need in e -
aces ha allow o bo h high-le el con ol o e abs ac
concep s and p ecise manipula ion o low-le el de ails.
Achie ing his balance be ween c ea i e eedom and ech-
nical con ol is c i ical o musicians and compose s when
wo king wi h gene a i e sys ems.
Gene a i e models o music ha e explo ed a ious
con ol mechanisms o b idge he gap be ween use in en-
ion and machine ou pu [2–6]. Howe e , he e is no clea
consensus on which con ol modali ies o signals a e mos
e ec i e. Cu en ly, one o he mos common me hods o
in e ac ing wi h gene a i e music sys ems is h ough ex
inpu [2–4], which le e ages sha ed embedding spaces o
ex and audio in o ma ion [7]. While his app oach has a-
cili a ed signi ican ad ancemen s, i lacks he ine-g ained
con ol equi ed o de ailed music p oduc ion.
1h ps://ligh la en con ol.gi hub.io
∗Resea ch unde aken while an in e n a Sony CSL - Pa is
© T. Bake and J. Nis al. Licensed unde a C ea i e Com-
mons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion:
T. Bake and J. Nis al, “LiLAC: A Ligh weigh La en Con olNe o
Musical Audio Gene a ion”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
In esponse o hese limi a ions, some s udies ha e ex-
pe imen ed wi h condi ioning gene a i e models on ime-
a ying signals, such as pi ch o dynamics [5, 8]. These
p o ide mo e speci ici y bu a e s ill cons ained by a key
issue: con ol signals a e ypically equi ed du ing he
aining p ocess o he gene a i e model. Once he model
has lea ned o espond o hese inpu s, he con ol mecha-
nisms become ixed and in lexible.
Modula app oaches like Con olNe [9] enable lexible,
pos -hoc con ol o image gene a i e models and ha e e-
cen ly been adap ed o music gene a ion [10,11]. While e -
ec i e, hese echniques come wi h impo an limi a ions:
memo y-in ensi e clones o he ne wo k o each new con-
ol [9], o igid mul i-con ol schemes [12].
In his wo k, we adap Con olNe ’s amewo k o mu-
sical audio gene a ion, in oducing a modula a chi ec u e
ha eplaces cloned encode blocks wi h ligh weigh con-
olu ional laye s. While simila pa ame e -e icien ap-
p oaches exis in compu e ision [13], ou me hod is he
i s o demons a e his o music, enabling lexible ain-
ing o mul iple independen con ol models (e.g., ch oma,
cho ds) wi hou e aining he backbone. By decoupling
con ols in o ask-speci ic modules, use s can deploy only
he necessa y condi ions du ing in e ence, o en wi h lowe
memo y o e head han a single adi ional Con olNe
b anch. C ucially, ou e alua ions—spanning objec i e
me ics (FAD, APA) and subjec i e lis ening es s—show
ha his s eamlined design achie es pe o mance compa-
able o Con olNe in audio quali y and condi ion adhe -
ence, es ablishing a p ac ical balance be ween lexibili y
and ideli y o music gene a ion wo k lows.
2. RELATED WORK
Con ols o music gene a ion models encompass di e se
inpu modali ies. Tex p omp ing is widely used [2–4, 14–
17], and ecen join ex -audio embeddings allow ze o-
sho con ol wi hou pai ed da a [7, 18, 19], enabling ed-
i s like “make his piece o music mo e happy” [20, 21].
Howe e , ex can be ambiguous and less sui ed o p e-
cise con ol in music p oduc ion. Fine -g ained con ols,
such as melody [5], hy hm and dynamics [8], o im-
b al ea u es [22, 23], o e mo e p ecision. Mul imodal
inpu s like images o ideo also expand c ea i e possi-
bili ies [24]. Audio-based condi ioning is pa icula ly e -
ec i e o asks like accompanimen gene a ion and s yle
ans e [25–31], wi h models like Di -A-Ri [6, 32] and
287
SingSong [33] le e aging inpu audio o guide gene a-
ion. Con ol in eg a ion s a egies a y— om aining-
ime condi ioning o in e ence- ime guidance [34] o la en
op imiza ion [35,36]. Inspi ed by Con olNe , ecen me h-
ods in oduce auxilia y ne wo ks o con ol [10,11], while
o he s, like Ske ch2Sound [37], explo e ligh weigh al e -
na i es. Despi e his p og ess, op imal s a egies emain
unclea . Con olNe -s yle designs o e modula i y bu a e
o en esou ce-in ensi e o in lexible. To add ess his, we
p opose a new ligh weigh and modula a chi ec u e ha
e ains Con olNe ’s s eng hs wi h imp o ed e iciency.
3. BACKGROUND
This wo k builds on Con olNe [9] (see Sec. 3.2), a ame-
wo k o in oducing pos -hoc con ollabili y in o p e-
ained gene a i e models. While ou a chi ec u e is gene -
alisable o any gene a i e model, we u ilise Di -a-Ri [6]
(see Sec. 3.1) as he backbone model h oughou his pape .
In he ollowing sec ions, we p o ide an o e iew o hese
wo a chi ec u es, laying he g oundwo k o he p oposed
me hodology.
3.1 Di -A-Ri
Di -a-Ri [6] is a La en Di usion Model (LDM) de-
signed o gene a e high-quali y indi idual musical s ems
ha align wi h a use -p o ided musical audio sample, de-
no ed Con ex . The model employs a Consis ency Au oen-
code (CAE) [38] o comp ess aw audio in o compac la-
en ep esen a ions and u ilises an Elucida ed Di usion
Model (EDM) [39] amewo k. The CAEs la en audio
codec educes 48 kHz audio o a 64-dimensional encoding
a ∼12 Hz. Gene a ion can be con olled ia audio e e -
ences, ex p omp s, o in e pola ions o bo h, acili a ed
by a sha ed CLAP embedding space [7, 19]. Fo u he
de ails, e e o he o iginal pape [6].
3.2 Con olNe
Con olNe [9] in oduces a me hod o augmen la ge p e-
ained ex - o-image di usion models wi h new con ols.
I achie es his by eezing he pa ame e s o he o igi-
nal model, o backbone, and in oducing a so-called adap-
o b anch—a ainable copy o he backbone’s encoding
laye s. This b anch p ocesses bo h o iginal inpu s and
new condi ional signals, eeding ac i a ions back h ough
ze o-ini ialised con olu ions while eusing only he o igi-
nal aining objec i e.
The decoupled a chi ec u e allows o condi ioning wi h
limi ed specialised da a, enabling di e se con ols (edges,
dep h maps, segmen a ion, poses) wi hou comp omising
he p e ained backbone’s capabili ies.
Con olNe has been success ully adap ed o musical
audio, p o iding ime- equency con ols like pi ch o
loudness [10, 11]. Below, we p o ide a b ie o e iew o
he de ails ele an o his pape .
3.2.1 A chi ec u e
The a chi ec u e is displayed in Fig. 1. Fo laye l, we
u ilise bo h a ozen encode block Fl(xl−1, e)and i s
cloned adap o block coun e pa Gl(ˆxl−1, e). He e, xl−1
and ˆxl−1, ep esen he ou pu s om he p e ious ozen
and con ol laye s, espec i ely, while edeno es he back-
bone’s o iginal condi ional embeddings. Using ze o con-
olu ions Zs, he skip connec ion slis compu ed as:
sl=Fl(xl−1, e)+Zs(Gl(ˆxl−1, e)) (1)
The inpu o he adap o b anch ˆx0is de i ed om he
noised inpu enso x0and he new condi ional c h ough
he inpu ze o con olu ion: ˆx0=x0+Zin(c).
Con olNe [9] demons a es ha using cloned encode s
om he backbone model is c i ical o e ec i e con-
ol signal in eg a ion, as andomly ini ialised con olu ions
comp omise condi ion adhe ence, pa icula ly when ex
condi ioning is misaligned. Addi ionally, ze o con olu-
ions—used o in oduce he con ol signal ia skip con-
nec ions—g adually in oduce he signal du ing aining,
imp o ing s abili y and ou pu quali y.
4. LILAC
In his sec ion, we in oduce LiLAC and de ail how i s a -
chi ec u e, condi ioning mechanisms, and aining me hod-
ologies di e ge om Con olNe .
4.1 P oposed A chi ec u e
In Fig. 1, we depic he basic block o LiLAC’s a chi ec-
u e. Ins ead o cloning he backbone’s encode , LiLAC
pe o ms a second pass h ough each o he ozen encode
blocks, w apping hese by smalle con olu ional laye s.
Speci ically, we in oduce h ee laye s pe block: a head
laye be o e he ozen block, a ail laye a e he ozen
block, and a esidual connec ion o p ese e condi ion in-
o ma ion as i passes h ough he ozen block.
Fo mally, we eplace he cloned encode block Gl(ˆx, e)
in (1) wi h
Gl(ˆx, e)≈I
®
ail
(Fl(Ih(ˆx)
²
head
, e)) +Z (ˆx)
´¹¹¹¹¹¸¹¹¹¹¹¶
esidual
,
whe e I ep esen s he iden i y con olu ions used as he
head and ail laye s (see Sec. 4.1.1), and Z deno es he
ze o con olu ion used as he esidual connec ion.
While o he mul i-con ol me hodologies ha e p o-
posed ein oducing he condi ion in o each block [10], we
ound empi ically ha his app oach does no imp o e con-
di ion adhe ence and adds edundan pa ame e s.
4.1.1 Iden i y Con olu ions
As discussed in Sec. 3.2.1, ensu ing he adap o b anch
le e ages he backbone model’s knowledge while g adu-
ally in oducing he condi ional signal du ing aining is
c ucial. We achie e his by ini ialising he adap o pa hway
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
288
❆
❆
Figu e 1. Con olNe [9] (le ) and LiLAC ( igh ) adap o
blocks (l=1). The noisy inpu signal is deno ed as x0,
and condi ional signal as c. The ozen encode block is
deno ed F1, and i s cloned copy G1. Iden i y Iand ze o Z
con olu ions, along wi h he skip connec ion pa hway s1
a e also illus a ed.
o mi o he backbone encode h ough iden i y con olu-
ions I, in he head and ail laye s, and ze o con olu ions,
Z, in he esidual laye s.
We ini ialise any n-dimensional Iden i y con olu ion
ke nel biases o 0 and he weigh s WI[k1, ...kn, , i, j]as:
WI[k1, ...kn, i, j]=
⎧
⎪
⎪
⎨
⎪
⎪
⎩
1k1, ...kn=K−1
2,and i=j
0o he wise
Whe e k1, ...knindex spa ial posi ions in he ke nel, and
i, j index inpu and ou pu channels. Kis he ke nel size
and mus be odd; in ou case, all con olu ion ke nels a e
size 1 o main ain a ligh weigh ne wo k.
4.2 Condi ioning Signals
As ou lined in Sec ion 4.1, he condi ion cis in oduced
o he adap o b anch by adding i o he noised inpu
x0. Howe e his equi es dimensional alignmen be ween
bo h enso s. Fo n-dimensional inpu da a wi h Cchan-
nels, he noised inpu is ep esen ed as x0∈RC×k1×...×kn.
Condi ions a e p o ided as ea u e maps in he o m c∈
RN×k1×...×kn, whe e Nis he numbe o condi ion chan-
nels. To align he condi ion wi h he inpu enso , he ini ial
ze o con olu ion Zin also maps he condi ion’s channels
om N o Cbe o e hey a e combined.
5. EXPERIMENTAL SETUP
5.1 Da a
In his wo k, we use he same da ase om Di -A-Ri [6],
comp ising 12,000 mul i- ack eco dings, andomly spli
in o 1 million 10-second segmen s. Du ing aining, he
model ecei es one ack as he a ge and a andom com-
bina ion o he emaining acks as musical Con ex .
5.1.1 Condi ions
In addi ion o he o iginal aining da a, we p e-ex ac he
equi ed condi ioning con ol signals om each audio ack
as speci ied below. Fo e alua ion pu poses, we selec wo
dis inc signals o add ess di e en use cases: ch oma pi ch
class condi ioning and cho d condi ioning.
Ch oma. A ch omag am is ex ac ed om he a ge single
ack using he lib osa lib a y [40]. The 12 ch oma bins a e
placed along he channel dimension, and he ime scale is
downsampled o ma ch he ame a e o he la en audio
codec used by he backbone model (∼11.7 Hz) [38].
Cho d. We ex ac cho d symbols o each ame om
mixed mul i- ack audio using Deep12MIR [41] and con-
e hem in o a 12-dimensional ch oma-like o ma . Each
cho d is encoded as a ec o whe e pi ch classes in he
cho d a e assigned 1, he oo no e 2, and emaining en ies
0. This o ma suppo s all cho d symbols, wi h ha mon-
ically ela ed cho ds ecei ing simila embeddings, and
acili a es he use o a bi a y cho d shapes absen in he
aining da a.
5.2 T aining de ails
We ollow he backbone’s aining me hodology [6]. Con-
di ions Con ex ,e, and ca e all independen ly d opped ou
wi h a 50% p obabili y du ing he aining o he adap o
b anch [9], encou aging he model o use he new condi-
ion, cas a eplacemen o missing in o ma ion and en-
abling Classi ie F ee Guidance (CFG) [42] a in e ence.
All models a e ained o 2 days on a single NVIDIA RTX
3090 GPU wi h a ba ch size o 128. We use AdamW [43]
op imize wi h a base lea ning a e o 10−4and a cosine
annealing lea ning a e schedule wi h linea wa mup [44].
5.3 Objec i e E alua ion
In he ollowing sec ion, we p esen wo objec i e e alu-
a ion s a egies: one compa ing ou me hodology agains
baseline models, and he o he analysing condi ion con-
lic s and hei impac on condi ion adhe ence.
5.3.1 Me ics
We employ a subse o he me ics used in he backbone pa-
pe [6]: F éche Audio Dis ance (FAD) [45] o audio qual-
i y and di e si y; Audio P omp Adhe ence (APA) [46] and
CLAP Sco e (CS) o con ex and ex adhe ence, espec-
i ely. Fo e alua ing adhe ence o he LiLAC-p o ided
ch omag am condi ioning we calcula e he ch oma Mean
Squa ed E o (cMSE) be ween inpu and ou pu ch oma-
g ams. All me ics a e compu ed by a e aging ac oss i e
se s o 500 samples. Dis ibu ion-based me ics such as
FAD and APA a e measu ed using he CLAP embedding
space [7,19] agains a backg ound da ase o 5,000 eal au-
dio samples.
5.3.2 Baselines
We compa e LiLAC agains ou baseline con igu a ions:
he backbone model in h ee condi ioning se ups— o ally
uncondi ioned (Di -a-Ri *), wi h CLAP (Di -a-Ri ), and
wi h bo h CLAP and con ex (Di -a-Ri + Con ex )—as
well as he o iginal Con olNe a chi ec u e applied o he
backbone.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
289
5.3.3 Compa ison and Abla ion
Wi h he aim o keeping he a chi ec u e as ligh as pos-
sible, we conduc an abla ion s udy o each o he h ee
LiLAC adap o laye s desc ibed in Sec ion 4.1-Head (H),
Tail (T), and Residual (R). We e alua e hei pe o mance
ela i e o he pa ame e o e head (see Tab. 1). Addi ion-
ally, we es a model a ian deno ed LiLAC* ha a oids
a second pass h ough he backbone model by di ec ly in-
se ing he new adap o laye s as pa o he backbone’s o -
wa d pass, simila in idea o p e ious wo k [37] (Al hough
no ably we don’ u he ine- une ou backbone model).
Fo he objec i e e alua ions, we delibe a ely employ
aw ch omag am condi ioning due o i s high in o ma ion
densi y, enabling us o es ablish a mo e accu a e uppe
bound on he model’s condi ional adhe ence pe o mance.
Since each audio signal p oduces only a single ch oma-
g am, his allows o p ecise and unambiguous objec i e
compa isons using he cMSE. In con as , cho d symbols
can ha e mul iple alid in e p e a ions o a single s em,
and as he e is no unambiguous single way o e alua e ad-
he ence ac oss a a ie y o ins umen classes, we ese e
hese o he subjec i e es (see Sec. 5.4).
5.3.4 Con lic i e Condi ioning
Nex , we e alua e how LiLAC and Con olNe -s yle mod-
els espond o con lic ing CLAP and ch oma condi ion-
ings (e.g., a solo iolin CLAP embedding wi h a poly-
phonic ch oma condi ion). Ce ain condi ions, espe-
cially hose ex ac ed di ec ly om audio, can be "o e -
speci ied"—con aining edundan in o ma ion. Fo exam-
ple, ch omag ams may include also aces o imb e and
pi ch ange due o lowe ch oma bin esolu ion a lowe
equencies. We measu e he change in CLAP Sco e ac oss
di e en condi ioning se ups o quan i y how hese added
con ol condi ions may o e ide o "leak" in o he back-
bone model’s o iginal CLAP condi ioning.
We compa e h ee se ups: 1) CLAP embeddings and
con ol signals a e aligned (i.e., ex ac ed om he same
sou ce audio signal); 2) CLAP embeddings and con ol
signals a e misaligned (sou ced om di e en audio ex-
amples); and 3) he model ecei es only he con ol sig-
nal, wi h CLAP condi ioning omi ed, labelled none. Us-
ing hese se ups, we e alua e he ollowing a iables:
Impac o A chi ec u e: We compa e all p oposed LiLAC
con igu a ions as well as Con olNe o assess which a -
chi ec u e is mo e suscep ible o CLAP leakage—whe e
o e -speci ied con ol signals (e.g., ch oma) can domina e
o obscu e CLAP’s condi ion. By con as ing pe o mance
in aligned s. misaligned con igu a ions, we quan i y how
s ongly each model p io i ises in o ma ion om he new
con ol signals o e p e-exis ing CLAP embeddings.
Impac o Con ol Signal Speci ici y: We e alua e h ee
con ol signal ypes: (1) Ch oma, an o e -speci ied sig-
nal con aining pi ch, iming, and imb al in o ma ion; (2)
Th esholded Ch oma, a a ian whe e ampli udes ≥0.9
a e clipped o 1 and all o he s o 0, educing ine im-
b al de ails o minimise leakage; and (3) Cho ds, a well-
speci ied signal con eying only ha monic s uc u e. By
analysing he di e ence in CLAP Sco e ac oss hese con-
di ions, we quan i y whe he educing su plus in o ma ion
in he condi ional signal leads o mo e ai h ul adhe ence
o he CLAP embedding.
5.4 Subjec i e E alua ion
To subjec i ely e alua e LiLAC’s e ec i eness in condi-
ioning he di usion model, we conduc wo lis ening es s
loosely ollowing MUSHRA guidelines [47] (i.e., using
he ue accompanimen as hidden e e ence in ou case).
In bo h es s, pa icipan s a e se s o audio samples on a
100-poin scale (0: poo , 100: excellen ) based on wo
c i e ia: audio quali y and subjec i e adhe ence, espec-
i ely, o each ques ionnai e. 2Each es comp ises 10
ques ions, wi h each p esen ing 5 unlabeled, andomly o -
de ed samples co esponding o LiLACH, LiLACHTR, Con-
olNe , he g ound- u h e e ence ack and a nega i e an-
cho . Fu he de ails abou each es a e gi en below.
5.4.1 Subjec i e Audio Quali y (SAQ):
The SAQ es e alua es whe he he con ol models audibly
deg ade he ou pu quali y. Pa icipan s a e asked o ank
each se o examples based on pe cei ed audio quali y and
he p esence o a i ac s. Fo each ques ion ound, we s a
om a e e ence ack, andomly sampled om he alida-
ion se , and ex ac he ch omas and CLAP embeddings.
Using hese as condi ioning, we gene a e an example wi h
each o he h ee e alua ed models. To c ea e he nega i e
ancho , we apply hea y comp ession (16 kbps MP3) o he
g ound u h ack. The g ound u h audio is used as a
posi i e ancho . Sha ed ch oma and CLAP condi ioning in
his es ensu es ha all examples a e compa able, as hey
should ideally exhibi he same pi ch dis ibu ion and im-
b e. This app oach encou ages pa icipan s o ocus solely
on e alua ing audio quali y.
5.4.2 Subjec i e Condi ion Adhe ence (SCA):
The SCA es e alua es he models’ abili y o adhe e o
cho d condi ioning, chosen o i s lexibili y in gene a ing
a ied ou pu s (e.g., cho ds, bass lines, melodies). Pa ici-
pan s ank how well he gene a ed ou pu aligns ha moni-
cally wi h he cho d condi ioning ex ac ed om he mul-
i ack eco ding. Fo each ound, we begin wi h a mul i-
ack eco ding and isola e one ins umen ack as he e -
e ence. Be o e gene a ion, we e alua e he emaining mul-
i ack; i i does no con ain a leas one polyphonic oice
h oughou he es sample, hen he sample lacks su i-
cien ha monic con en o eliably disce n cho ds and is
excluded om he lis ening es ˙
We ex ac cho d symbols
om he ull mul i ack and CLAP embeddings om he
e e ence ack, hen gene a e an example o each es ed
model. The nega i e ancho is gene a ed using he CLAP-
condi ioned Di -A-Ri backbone, i.e., wi hou cho d o
con ex condi ioning. To acili a e compa ison, he gene -
a ed ou pu is panned o he le , while he emaining mul-
i ack is panned o he igh .
2h ps://link .ee/lilac es s
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
290
6. RESULTS
In wha ollows, we p esen and discuss he esul s o he
e alua ions ou lined in Sec. 5.
Model APA ↑FAD ↓cMSE ↓
Di -a-Ri * (432M) 0.63 0.812 0.206
+Con ex 1.00 0.508 0.148
+ Con olNe (165M) 1.00 0.507 0.052
+ LiLAC
H(32M) 1.00 0.506 0.057
HT (48M) 1.00 0.507 0.056
HR (47M) 1.00 0.509 0.056
HTR (64M) 1.00 0.507 0.055
+ LiLAC* (47M) 1.00 0.508 0.070
Table 1. Objec i e me ics: Audio P omp Adhe ence
(APA) and F éche Audio Dis ance (FAD) o audio qual-
i y; ch oma Mean Squa ed E o (cMSE) o ch oma ad-
he ence. All models include CLAP condi ioning excep
o Di -A-Ri ∗. Addi ionally, all new con ol models a e
condi ioned on ch oma.
6.1 Objec i e Expe imen s
Table 1 shows objec i e compa ison be ween LiLAC
and baseline models (see Sec.5.3.3). The APA me -
ic—a dis ibu ion-based sco e bounded in ange [0,1]
[46]—shows ha all models condi ioned on CLAP em-
beddings and ch oma ou pe o m he con ex - ee Di -A-
Ri baseline (Di -A-Ri ∗) and ma ch he pe o mance o
he con ex -condi ioned a ian (+ Con ex ). This indica es
ha ch omag am-based condi ioning is consis en ly e ec-
i e a guiding alignmen wi h he mul i ack, ega dless o
a chi ec u al a ia ions. Analogously, FAD sco es emain
compa able ac oss all se ups, sugges ing ha adding con-
di ioning does no deg ade audio quali y. In pa icula , he
nea -iden ical FAD sco es be ween Di -A-Ri + Con ex
baseline, Con olNe and LiLAC models indica e ha pos -
hoc condi ioning doesn’ comp omise ideli y (low FAD,
high APA) and may enhance con ollabili y (high APA).
While hese esul s p o ide a s a is ical pe spec i e
on he o e all dis ibu ion o gene a ed ou pu s and hei
alignmen wi h musical con ex based on ch oma alone,
hey do no di ec ly e eal how ai h ully models espond
o speci ic ch oma inpu s. A pai wise me ic is equi ed o
e alua e adhe ence on a pe -example basis. Fo his pu -
pose, we epo cMSE. He e, Con olNe achie es he bes
absolu e adhe ence o he ch oma inpu , hough all LiLAC
con igu a ions deli e compe i i e esul s while main ain-
ing a conside ably smalle pa ame e coun (i.e., 165M
in Con olNe e sus 32M in LiLACH)3. Among hem,
he HTR a ian shows s onge pe o mance, wi h cMSE
close o Con olNe . In e es ingly, LiLAC*, a ligh weigh
3As he size o he adap o blocks scales wi h he encode ’s chan-
nel dimensions, he exac pa ame e educ ion depends on he speci ic
backbone a chi ec u e. Fo Di -A-Ri [6], ou ligh es con igu a ion
(LiLACH) uses only 19% o Con olNe ’s pa ame e s. This e iciency
scales wi h la ge a chi ec u es: o S able Audio Open [48], he pa ame-
e usage d ops o 10.2%, while o image models such as SD V2 [49], i
educes d ama ically o 2.6%.
a chi ec u e inspi ed by p io wo k [37], exhibi s limi a-
ions in con olling ine melodic nuances, hus, we exclude
i om u he e alua ions.
Model Aligned Misaligned None
Di -a-Ri 0.65 (0.17) 0.65 (0.21) 0.17 (0.21)
+ LiLACH0.67 (0.06) 0.55 (0.06) 0.57 (0.06)
+ LiLACHT 0.67 (0.06) 0.54 (0.06) 0.58 (0.05)
+ LiLACHR 0.67 (0.06) 0.54 (0.06) 0.58 (0.06)
+ LiLACHTR 0.67 (0.05) 0.53 (0.06) 0.58 (0.05)
+ Con olNe 0.67 (0.05) 0.52 (0.06) 0.60 (0.05)
Table 2. CS↑a(cMSE↓) o models ained wi h ch oma
condi ioning and Di -A-Ri baseline. Each column co -
esponds wi h a di e en se ing: aligned pai s o ch oma-
g am and CLAP embedding, misaligned pai s, o wi hou
CLAP embedding (see Sec. 5.3.4). aCS↓in he None case
We nex in es iga e how he new ch oma adap o s in-
e ac wi h he backbone’s CLAP de aul condi ioning (see
Sec.5.3.4). Table 2 epo s CLAP simila i y sco es (CS)
and ch oma econs uc ion e o (cMSE) ac oss a ious
model con igu a ions. The esul s indica e a deg ee o in-
o ma ion o e lap be ween CLAP embeddings and ch oma
condi ions. When bo h a e p esen —bu con ey con lic -
ing in o ma ion—CLAP simila i y sco es no iceably de-
cline, sugges ing ha ch oma condi ioning can pa ially
o e w i e o in e e e wi h he o iginal CLAP guidance.
Howe e , his d op is mode a e, and he model s ill e-
cons uc s a ai h ul ch omag am, as e lec ed in he low
cMSE e en in he misaligned se ing. This implies ha he
model makes easonable sense o he con lic ing condi ion-
als— ollowing he CLAP guidance as much as possible
while s ill p oducing gene a ions ha ai h ully adhe e o
he ch oma condi ioning.
In e es ingly, in he CLAP-uncondi ioned se ing,
cMSE emains low while CS s ays high. This suppo s ou
hypo hesis o in o ma ion edundancy be ween ch oma-
g ams and CLAP embeddings. I sugges s ha he model
can s ill in e high-le el cues—such as imb e— om ch o-
mag ams alone, allowing i o p oduce con en ha aligns
wi h wha CLAP, e en in i s absence. Tha said, he
model’s abili y o main ain easonable CLAP simila i y
wi hou explici CLAP inpu does no necessa ily mean
i econs uc s he o iginal ins umen associa ed wi h he
ch omag am. Mo e likely, i in e s a plausible ins umen
class—such as s ings o pads— ha i s he gi en pi ch
ange and ha monic con en . This sugges s ha he model
in e p e s ch oma condi ioning sensibly: when CLAP is
a ailable, i may e ine o o e w i e he in e ed ins umen
iden i y; when absen , i de aul s o gene a ing some hing
imb ally cohe en ha aligns wi h he ch oma s uc u e.
Among all se ups, he LiLAC a chi ec u e demons a es
he leas suscep ibili y o CLAP in e e ence, wi h i s
ligh weigh a ian pe o ming bes unde bo h con lic ing
and CLAP-abla ed condi ions. This obus ness is likely
due o i s simpli ied s uc u e, which limi s he ex en o
which he model encodes auxilia y in o ma ion alongside
he CLAP embedding.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
291

Condi ion Aligned Misaligned None
Di -a-Ri 0.65 (0.17) 0.65 (0.21) 0.17 (0.21)
+ Cho d 0.65 (0.14) 0.64 (0.20) 0.36 (0.18)
+ Th esh 0.65 (0.08) 0.62 (0.17) 0.45 (0.10)
+ Ch oma 0.67 (0.06) 0.55 (0.06) 0.57 (0.06)
Table 3. CS↑a(cMSE↓) o LiLACH ained on cho d,
h esholded ch oma, and ch oma wi h he Di -a-Ri base-
line (see Sec. 5.3.4). aCS↓in he None case
To educe edundancy wi h CLAP, we e alua e
LiLACH—ou ligh es and bes -pe o ming a ian —using
comp essed melodic inpu s: h esholded ch omag ams
( h esh) and cho ds. Table 3 compa es pe o mance ac oss
CLAP-aligned, misaligned, and uncondi ioned (None) se -
ings (see Sec. 5.3.4).
In ui i ely, less speci ic condi ioning signals should ex-
hibi lowe o e lap wi h CLAP embeddings. This is con-
i med by he esul s: h esholding he ch omag am sig-
ni ican ly educes he con lic be ween he wo modali ies.
Among he es ed condi ions, cho ds show he leas in e -
e ence, wi h almos no d op in CLAP simila i y in he mis-
aligned case. No ably, e en wi hou CLAP condi ioning,
he cho d model achie es a CLAP sco e o 0.36—up om
0.17 in he uncondi ioned baseline—sugges ing ha a sub-
s an ial amoun o global musical in o ma ion (e.g., empo,
onali y) is implici ly cap u ed by he cho d inpu alone.
Howe e , his obus ness may come a he cos o e-
duced in luence om he condi ioning. We obse e ha
weake condi ions, like cho ds, a e mo e easily o e id-
den by CLAP. Fo ins ance, he cMSE ises sha ply om
0.08 o 0.17 when CLAP is misaligned, despi e he spa -
si y o he cho d-based inpu . This indica es ha e en min-
imal shi s in CLAP can signi ican ly dis o he esul ing
ch oma, especially when he condi ioning is less cons ain-
ing. To con i m his, we e-e alua ed he cMSE unde
he aligned se ing o cho ds and obse ed a consis en
alue a ound 0.14. This alida es ha he obse ed jump
in cMSE unde misalignmen e lec s a genuine in e e -
ence e ec , a he han noise o sampling a iance.
O e all, hese indings highligh ha ou p oposed con-
di ioning mechanisms—especially when simpli ied—can
e ec i ely coope a e wi h CLAP guidance. When signals
a e aligned, hey yield cohe en gene a ions ha sa is y
bo h imb al and ha monic cons ain s. When con lic ing,
he model ends o p io i ise ch oma while s ill le e aging
CLAP o guide plausible gene a ion. E en in he absence
o CLAP, he model e ains he abili y o in e global cues
such as empo, onali y, and imb e om abs ac melodic
inpu s, demons a ing he e sa ili y and obus ness o he
condi ioning app oach.
6.2 Subjec i e Expe imen s
Table 4 p esen s esul s om ou Subjec i e Audio Qual-
i y (SAQ) and Subjec i e Condi ion Adhe ence (SCA)
e alua ions (see Sec. 5.4). Ac oss bo h es s, we col-
lec ed a o al o 1,250 a ings om 25 pa icipan s (11 o
Model SAQ ↑SCA ↑
Re e ence 61.5 ±4.7 82.3 ±4.3
Con olNe 56.4 ±4.2 66.5 ±5.1
LiLACH60.6 ±4.6 65.9 ±4.8
LiLACHTR 58.7 ±4.2 68.6 ±4.9
Ancho 26.4 ±4.9 12.5 ±2.5
Table 4. Lis ene a ings wi h 95% con idence in e als
o SAQ and SCA ac oss e alua ed models (see Sec. 5.4).
SCA and 14 o SAQ). In he SAQ es , all h ee mod-
els we e a ed compa ably in e ms o audio quali y. The
wo LiLAC a ian s—especially he ligh weigh con igu-
a ion—sligh ly ou pe o med Con olNe based on bo h
mean sco e and a e age ank. Howe e , hese di e ences
did no each s a is ical signi icance, ei he among he h ee
models o ela i e o he o iginal s em, as indica ed by he
F iedman es (p>0.10). This aligns wi h expec a ions:
gi en ha he backbone is ozen, i is unlikely ha con ol
mechanisms alone would imp o e audio quali y. Ne e he-
less, he sligh ly be e pe o mance o LiLAC models may
e lec a be e p ese a ion o p e- ained ea u es, due o
hei smalle and mo e ocused adap e laye s.
The SCA esul s show simila ou comes. All mod-
els signi ican ly ou pe o m uncondi ional gene a ion (p<
0.0001), and again show ema kable simila i y o one an-
o he (p>0.8). Rega ding condi ional adhe ence o he
o iginal s em, he e is a mo e no iceable gap o he e e -
ence, howe e LiLACHTR comes close han o he models,
showing only ma ginally signi ican di e ences (p=0.09).
O e all, hese indings a e consis en wi h ou expec-
a ions: while audio quali y emains s able ac oss con ol
me hods, condi ion adhe ence bene i s mo e clea ly om
a chi ec u al imp o emen s like LiLACHTR.
7. CONCLUSION
In his pape , we in oduced a ligh weigh and lexible con-
ol me hodology o ex - o-audio di usion models, in-
spi ed by Con olNe bu wi h a signi ican ly educed pa-
ame e coun . Ou app oach suppo s mul iple con igu-
a ions o accommoda e con ol signals o a ying com-
plexi y, while main aining— and in some cases imp o -
ing upon— he audio quali y and condi ion adhe ence o
he o iginal me hod, as demons a ed in bo h objec i e and
subjec i e e alua ions. We belie e his modula and e i-
cien amewo k pa es he way o mo e exp essi e and
musically use ul con ol modali ies in audio gene a ion.
As u u e wo k, we plan o explo e he modula i y
and composabili y o mul iple LiLAC con olle s ope a -
ing o e a sha ed backbone, as well as in es iga e whe he
ou me hod gene alises o non-con olu ional a chi ec u es
such as DiT 4.
4Subsequen o ou ini ial submission, we ha e success ully applied
his a chi ec u e o he DiT-based Di -a-Ri 2 [32] and seen simila e-
sul s, sugges ing ha ou me hodology gene alises well beyond con o-
lu ional a chi ec u es. Fo his, we adap ed he Con olNe me hodology
pionee ed in [50] and again subs i u ed he cloned ans o me blocks o
he p e-exis ing ozen blocks w apped by ou adap o laye s .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
292
8. ETHICS STATEMENT
Sony Compu e Science Labo a o ies is commi ed o ex-
plo ing he posi i e applica ions o AI in music c ea ion.
We collabo a e wi h a is s o de elop inno a i e echnolo-
gies ha enhance c ea i i y. We uphold s ong e hical s an-
da ds and ac i ely engage wi h he music communi y and
indus y o align ou p ac ices wi h socie al alues. Ou
eam is mind ul o he ex ensi e wo k ha songw i e s and
eco ding a is s dedica e o hei c a . Ou echnology
mus espec , p o ec , and honou his commi men .
LiLAC p esen s a ligh e , mo e lexible con ol
pa adigm ha enables musicians o exe cise ine-g ained
con ol o e gene a i e model ou pu s, ensu ing comple e
a is ic agency. We hope he p esen a ion o his model will
encou age comme cial ex - o-audio model p o ide s o in-
co po a e simila addi ional con ols in o hei wo k lows,
he eby empowe ing use s wi h g ea e c ea i e au onomy.
Bo h LiLAC and i s backbone model, Di -A-Ri , ha e
been ained exclusi ely on da ase s ha we e legally ac-
qui ed o in e nal esea ch and de elopmen pu poses.
Consequen ly, nei he he aining da a no he models can
be made publicly a ailable. We emain commi ed o ull
legal compliance and p oac i ely add ess all e hical con-
side a ions in ou wo k.
9. REFERENCES
[1] J. Schneide , “Explainable Gene a i e AI (GenXAI):
A Su ey, Concep ualiza ion, and Resea ch Agenda,”
A i icial In elligence Re iew, ol. 57, 2024.
[2] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. F ank, “MusicLM: Gene a ing Music F om
Tex ,” 2023.
[3] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic,
W. Wang, and M. D. Plumbley, “AudioLDM: Tex -
o-Audio Gene a ion wi h La en Di usion Models,”
in In e na ional Con e ence on Machine Lea ning
(ICML), Hawaii, Uni ed S a es, 2023.
[4] Z. E ans, C. J. Ca , J. Taylo , S. H. Hawley, and
J. Pons, “Fas Timing-Condi ioned La en Audio Di u-
sion,” in In e na ional Con e ence on Machine Lea n-
ing (ICML), Vienna, Aus ia, 2024.
[5] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and Con ol-
lable Music Gene a ion,” in 37 h Con e ence on Neu al
In o ma ion P ocessing Sys ems (Neu IPS), New O -
leans, Uni ed S a es, 2023.
[6] J. Nis al, M. Pasini, C. Aouameu , M. G ach en, and
S. La ne , “Di -A-Ri : Musical Accompanimen Co-
c ea ion ia La en Di usion Models,” in P oc. o he
25 h In . Socie y o Music In o ma ion Re ie al Con .
(ISMIR), San F ancisco, Uni ed S a es, 2024.
[7] B. Elizalde, S. Deshmukh, M. A. Ismail, and H. Wang,
“CLAP: Lea ning Audio Concep s F om Na u al Lan-
guage Supe ision,” in IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP),
Rhodes Island, G eece, 2023.
[8] Y.-K. Wu, C.-Y. Chiu, and Y.-H. Yang, “JukeD um-
me : Condi ional Bea -awa e Audio-domain D um Ac-
companimen Gene a ion ia T ans o me VQ-VAE,”
in P oc. o he 23 d In . Socie y o Music In o ma ion
Re ie al Con . (ISMIR), Bengalu u, India, 2022.
[9] L. Zhang, A. Rao, and M. Ag awala, “Adding Con-
di ional Con ol o Tex - o-Image Di usion Models,”
in IEEE In e na ional Con e ence on Compu e Vision
(ICCV), Pa is, F ance, 2023.
[10] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music Con olNe : Mul iple Time- a ying Con ols
o Music Gene a ion,” IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing, (TASLP),
ol. 32, No . 2023.
[11] S. Hou, S. Liu, R. Yuan, W. Xue, Y. Shan, M. Zhao,
and C. Zhang, “Edi ing Music wi h Melody and Tex :
Using Con olNe o Di usion T ans o me ,” in In e -
na ional Con e ence on Acous ics, Speech, and Signal
P ocessing (ICASSP), Hyde abad, India, 2025.
[12] S. Zhao, D. Chen, Y.-C. Chen, J. Bao, S. Hao, L. Yuan,
and K.-Y. K. Wong, “Uni-Con olNe : All-in-One
Con ol o Tex - o-Image Di usion Models,” in 37 h
Con e ence on Neu al In o ma ion P ocessing Sys ems
(Neu IPS), New O leans, Uni ed S a es, 2023.
[13] D. Za adski, J.-F. Feiden, and C. Ro he , “Con olNe -
XS: Designing an E icien and E ec i e A chi ec-
u e o Con olling Tex - o-Image Di usion Models,”
2023.
[14] F. Schneide , O. Kamal, Z. Jin, and B. Schölkop ,
“Moûsai: Tex - o-Music Gene a ion wi h Long-
Con ex La en Di usion,” 2023.
[15] Q. Huang, D. S. Pa k, T. Wang, T. I. Denk, A. Ly,
N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. F ank, J. En-
gel, Q. V. Le, W. Chan, Z. Chen, and W. Han,
“Noise2Music: Tex -condi ioned Music Gene a ion
wi h Di usion Models,” 2023.
[16] P. Li, B. Chen, Y. Yao, Y. Wang, A. Wang, and
A. Wang, “JEN-1: Tex -Guided Uni e sal Music Gen-
e a ion wi h Omnidi ec ional Di usion Models,” in
IEEE Con e ence on A i icial In elligence (CAI), Sin-
gapo e, Singapo e, 2024.
[17] H. Liu, Y. Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian,
Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley,
“AudioLDM 2: Lea ning Holis ic Audio Gene a ion
wi h Sel -supe ised P e aining,” IEEE/ACM T ans-
ac ions on Audio, Speech and Language P ocessing,
ol. 32, May 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
293
[18] Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and
D. P. W. Ellis, “MuLan: A Join Embedding o Music
Audio and Na u al Language,” in P oc. o he 23 d In .
Socie y o Music In o ma ion Re ie al Con . (ISMIR),
Bengalu u, India, 2022.
[19] Y. Wu, K. Chen, T. Zhang, Y. Hui, M. Nezhu ina,
T. Be g-Ki kpa ick, and S. Dubno , “La ge-scale Con-
as i e Language-Audio P e aining wi h Fea u e Fu-
sion and Keywo d- o-Cap ion Augmen a ion,” in IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP), Rhodes Island, G eece,
2023.
[20] Y. Zhang, Y. Ikemiya, G. Xia, N. Mu a a, M. A.
Ma ínez-Ramí ez, W.-H. Liao, Y. Mi su uji, and
S. Dixon, “MusicMagus: Ze o-Sho Tex - o-Music
Edi ing ia Di usion Models,” in The 33 d In e na-
ional Join Con e ence on A i icial In elligence (IJ-
CAI), Jeju, Ko ea, 2024.
[21] H. Mano and T. Michaeli, “Ze o-Sho Unsupe ised
and Tex -Based Audio Edi ing Using DDPM In e -
sion,” in In e na ional Con e ence on Machine Lea n-
ing (ICML), Vienna, Aus ia, 2024.
[22] J. Nis al, S. La ne , and G. Richa d, “D umGAN: Syn-
hesis o D um Sounds Wi h Timb al Fea u e Condi-
ioning Using Gene a i e Ad e sa ial Ne wo ks,” in
P oc. o he 21s In . Socie y o Music In o ma ion Re-
ie al Con . (ISMIR), Mon éal, Canada, 2020.
[23] J. Nis al, C. Aouameu , I. Vela de, and S. La ne ,
“D umGAN VST: A Plugin o D um Sound Anal-
ysis/Syn hesis Wi h Au oencoding Gene a i e Ad e -
sa ial Ne wo ks,” in In e na ional Con e ence on Ma-
chine Lea ning (ICML), Bal imo e, Uni ed S a es,
2022.
[24] R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li,
Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-An-Audio:
Tex -To-Audio Gene a ion wi h P omp -Enhanced Di -
usion Models,” in In e na ional Con e ence on Ma-
chine Lea ning (ICML), Hawaii, Uni ed S a es, 2023.
[25] S. La ne and M. G ach en, “High-Le el Con ol o
D um T ack Gene a ion Using Lea ned Pa e ns o
Rhy hmic In e ac ion,” in Wo kshop on Applica ions o
Signal P ocessing o Audio and Acous ics (WASPAA),
New Pal z, Uni ed S a es, 2019.
[26] M. G ach en, S. La ne , and E. De u y, “BassNe : A
Va ia ional Ga ed Au oencode o Condi ional Gene -
a ion o Bass Gui a T acks wi h Lea ned In e ac i e
Con ol,” Applied Sciences, ol. 10, 2020.
[27] J. D. Pa ke , J. Spijke e , K. Kos a, F. Yesile ,
B. Kuzne so , J.-C. Wang, M. A en , J. Chen, and
D. Le, “S emGen: A music gene a ion model ha
lis ens,” in In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP), Seoul, Ko ea,
2024.
[28] M. Pasini, M. G ach en, and S. La ne , “Bass Accom-
panimen Gene a ion ia La en Di usion,” in In e -
na ional Con e ence on Acous ics, Speech, and Signal
P ocessing (ICASSP), Seoul, Ko ea, 2024.
[29] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, and E. Rodolà, “Mul i-Sou ce Di usion
Models o Simul aneous Music Gene a ion and Sepa-
a ion,” in In e na ional Con e ence on Lea ning Rep-
esen a ions (ICLR), Vienna, Aus ia, 2024.
[30] G. Bindi and P. Esling, “Unsupe ised Composable
Rep esen a ions o Audio,” in P oc. o he 25 h In .
Socie y o Music In o ma ion Re ie al Con . (ISMIR),
San F ancisco, Uni ed S a es, 2024.
[31] T. Ka chkhadze, M. R. Izadi, and S. Dubno , “Simul a-
neous Music Sepa a ion and Gene a ion Using Mul i-
T ack La en Di usion Models,” 2024.
[32] J. Nis al, M. Pasini, and S. La ne , “Imp o ing Musi-
cal Accompanimen Co-c ea ion ia Di usion T ans-
o me s,” in 38 h Con e ence on Neu al In o ma ion
P ocessing Sys ems (Neu IPS), Vancou e , Canada,
2024.
[33] C. Donahue, A. Caillon, A. Robe s, E. Manilow, P. Es-
ling, A. Agos inelli, M. Ve ze i, I. Simon, O. Pie quin,
N. Zeghidou , and J. Engel, “SingSong: Gene a ing
musical accompanimen s om singing,” 2023.
[34] M. Le y, B. D. Gio gi, F. Wee s, A. Ka ha opoulos,
and T. Nickson, “Con ollable Music P oduc ion wi h
Di usion Models and Guidance G adien s,” in 37 h
Con e ence on Neu al In o ma ion P ocessing Sys ems
(Neu IPS), New O leans, Uni ed S a es, 2023.
[35] Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. J. B yan, “DITTO: Di usion In e ence-Time T-
Op imiza ion o Music Gene a ion,” in In e na ional
Con e ence on Machine Lea ning (ICML), Vienna,
Aus ia, 2024.
[36] Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. B yan, “DITTO-2: Dis illed Di usion In e ence-
Time T-Op imiza ion o Music Gene a ion,” in P oc.
o he 25 h In . Socie y o Music In o ma ion Re ie al
Con . (ISMIR), San F ancisco, Uni ed S a es, 2024.
[37] H. F. Ga cía, O. Nie o, J. Salamon, B. Pa do, and
P. See ha aman, “Ske ch2Sound: Con ollable Audio
Gene a ion ia Time-Va ying Signals and Sonic Imi a-
ions,” 2024.
[38] M. Pasini, S. La ne , and G. Fazekas, “Music2La en :
Consis ency Au oencode s o La en Audio Comp es-
sion,” in P oc. o he 25 h In . Socie y o Music In o -
ma ion Re ie al Con . (ISMIR), San F ancisco, Uni ed
S a es, 2024.
[39] T. Ka as, M. Ai ala, T. Aila, and S. Laine, “Elucida -
ing he Design Space o Di usion-Based Gene a i e
Models,” in 36 h Con e ence on Neu al In o ma ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
294
P ocessing Sys ems (Neu IPS), New O leans, Uni ed
S a es, 2022.
[40] B. McFee, C. Ra el, D. Liang, D. Ellis, M. McVica ,
E. Ba enbe g, and O. Nie o, “Lib osa: Audio and Mu-
sic Signal Analysis in Py hon,” in P oc. o he 14 h
Py hon in Science Con . (SCIPY), Aus in, Texas, 2015.
[41] T. Akama, N. Polouliakh, and H. Kishi, “Deep12: Mu-
sic Analysis AI,” 2023.
[42] J. Ho and T. Salimans, “Classi ie -F ee Di usion Guid-
ance,” in Neu IPS Wo kshop on Deep Gene a i e Mod-
els and Downs eam Applica ions, Online, 2021.
[43] I. Loshchilo and F. Hu e , “Decoupled Weigh De-
cay Regula iza ion,” in 7 h In e na ional Con e ence
on Lea ning Rep esen a ions (ICLR), New O leans,
Uni ed S a es, 2019.
[44] ——, “SGDR: S ochas ic G adien Descen wi h Wa m
Res a s,” in 5 h In e na ional Con e ence on Lea ning
Rep esen a ions (ICLR), Toulon, F ance, 2017.
[45] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche Audio Dis ance: A Me ic o E alua ing Mu-
sic Enhancemen Algo i hms,” in In e speech, G az,
Aus ia, 2019.
[46] M. G ach en, “Measu ing Audio P omp Adhe ence
wi h Dis ibu ion-based Embedding Dis ances,” 2024.
[47] ITU-R BS.1534-3, “Me hods o he subjec i e assess-
men o small impai men s in audio sys ems,” 2015.
[48] Z. E ans, J. D. Pa ke , C. J. Ca , Z. Zukowski, J. Tay-
lo , and J. Pons, “S able Audio Open,” 2024.
[49] R. Rombach, A. Bla mann, D. Lo enz, P. Esse , and
B. Omme , “High-Resolu ion Image Syn hesis wi h
La en Di usion Models,” in Compu e Vision and Pa -
e n Recogni ion (CVPR), New O leans, Uni ed S a es,
2022.
[50] J. Chen, Y. Wu, S. Luo, E. Xie, S. Paul, P. Luo,
H. Zhao, and Z. Li, “PIXART-δ: Fas and Con ollable
Image Gene a ion wi h La en Consis ency Models,”
2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
295

Related note

Why organizations use Identific for document trust, entry 38
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com