USER-GUIDED GENERATIVE SOURCE SEPARATION
Yu ong Wen Minje Kim Pa is Sma agdis
Uni e si y o Illinois a U bana-Champaign
[email p o ec ed]
ABSTRACT
Music sou ce sepa a ion (MSS) aims o ex ac indi id-
ual ins umen sou ces om hei mix u e. While mos
exis ing me hods ocus on he widely adop ed ou -s em
sepa a ion se up ( ocals, bass, d ums, and o he ins u-
men s), his app oach lacks he lexibili y needed o eal-
wo ld applica ions. To add ess his, we p opose GuideSep,
a di usion-based MSS model capable o ins umen -
agnos ic sepa a ion beyond he ou -s em se up. GuideSep
is condi ioned on mul iple inpu s: a wa e o m mimic y
condi ion, which can be easily p o ided by humming o
playing he a ge melody, and mel-spec og am domain
masks, which o e addi ional guidance o sepa a ion. Un-
like p io app oaches ha elied on ixed class labels o
sound que ies, ou condi ioning scheme, coupled wi h he
gene a i e app oach, p o ides g ea e lexibili y and appli-
cabili y. Addi ionally, we design a mask-p edic ion base-
line using he same model a chi ec u e o sys ema ically
compa e p edic i e and gene a i e app oaches. Ou objec-
i e and subjec i e e alua ions demons a e ha GuideSep
achie es high-quali y sepa a ion while enabling mo e e -
sa ile ins umen ex ac ion, highligh ing he po en ial o
use pa icipa ion in he di usion-based gene a i e p o-
cess o MSS. Ou code and demo page a e a ailable a
h ps://yu ongwen.gi hub.io/GuideSep/.
1. INTRODUCTION
Music sou ce sepa a ion (MSS) aims o sepa a e a mix-
u e audio in o i s cons i uen sou ces, ypically de ined by
he ins umen . Since he 2015 Signal Sepa a ion E al-
ua ion Campaign (SiSEC) [1], he MSS communi y has
la gely ocused on supe ised models o sepa a e songs
in o ou s ems: ocals, bass, d ums, and o he s ha in-
cludes all emaining ins umen s, a se up commonly e-
e ed o as VBDO. Unde his amewo k, nume ous e-
cen deep neu al ne wo k (DNN) models ha e signi ican ly
ad anced pe o mance [2–8]. While his se up p o ides a
con enien benchma k, i lacks he lexibili y needed o
eal-wo ld applica ions: ideally, MSS sys ems should be
able o ex ac any a ge ins umen o in e es .
© Y. Wen, M. Kim, and P. Sma agdis. Licensed unde a
C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0).
A ibu ion: Y. Wen, M. Kim, and P. Sma agdis, “Use -Guided Gen-
e a i e Sou ce Sepa a ion”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
In his ega d, se e al wo ks ha e ex ended MSS be-
yond he VBDO se up. To enable he sepa a ion o a bi-
a y ins umen s, he model mus i s be p o ided wi h
a condi ion speci ying he a ge ins umen , such as in-
s umen class labels [9–12]. In [9, 11] his condi ioning
me hod is shown o wo k o he VBDO se up, whe eas
[10] ex ends his app oach o 13 ins umen s. Howe e ,
class labels can be ague, as ins umen s like he gui a
may exhibi signi ican a iabili y wi hin he same label.
Mo eo e , new ins umen classes equi e e- aining. An-
o he app oach, que y-based MSS condi ions he model
using a sound example, whe e he model ex ac s sou ces
simila o he example [13–18]. Fo ins ance, Wa cha a-
supa e al. [16] designed a ligh weigh model capable o
ins umen -agnos ic sepa a ion using a single que y, while
Wang e al. [18] de eloped a model ha accep s up o i e
que ies o imp o e pe o mance s abili y. Despi e i s po-
en ial o p o ide ich in o ma ion abou he a ge sou ce,
que y-based sepa a ion may be limi ed in eal-wo ld appli-
ca ions whe e high-quali y que ies a e una ailable. Addi-
ionally, MSS models can be condi ioned on MIDI sco e
o he a ge ins umen [19–23]. While MIDI in o ma ion
p o ides a s ong and accu a e cue, i is o en una ailable
in many eal-wo ld scena ios, such as pop music. B yan
e al. [24–27] p oposed an al e na i e me hod whe e use s
ske ch a ough mask on he spec og am o he mix u e o
indica e he a ge sou ce. Howe e , his app oach can su -
e om ambigui y, as iden i ying he a ge ins umen ’s
egion in he mix u e spec og am is o en challenging.
Sma agdis e al. [28] le e ages humming as a guidance
o sepa a e a a ge sou ce. Unlike label-based o sound
que y condi ioning, humming o e s use s g ea e lexibil-
i y when in e ac ing wi h he sys em.
In his wo k, we p opose a guided sepa a ion
(GuideSep) me hod, a condi ional complex-spec og am
domain di usion model designed o add ess music sou ce
sepa a ion beyond he VBDO se up in an ins umen -
agnos ic manne . Building on he obse a ions o exis ing
me hods o MSS beyond VBDO, we condi ion he di u-
sion model on mul iple inpu s: a wa e o m mimic y o a
a ge sou ce and mel-spec og am domain masks. While
MIDI sco e in o ma ion is o en di icul o ob ain in eal-
wo ld scena ios, use s a e capable o p o iding a mimic y
by humming o playing he a ge melody wi h an ins u-
men o hei choice. Addi ionally, we in oduce a ough
mask on he mel-spec og am o he use s o u he in-
o m he model o he egion o ocus on. Du ing in e -
ence, ei he o bo h condi ions can be u ilized, o e ing
821
use s a lexible way o speci y he a ge sou ce o sep-
a a ion om he mix u e. Ou di usion model is buil
on EDMSound [29], a complex-spec og am domain di -
usion me hod designed o bo h uncondi ional and label-
condi ioned audio gene a ion. We modi y he model back-
bone o suppo mul iple condi ioning inpu s.
T adi ionally, audio sou ce sepa a ion has been ack-
led using p edic i e models 1, which map mix u e inpu o
an es ima ed clean ou pu by minimizing a poin -wise loss
unc ion [31–33]. While p edic i e models o en s uggle
wi h esidual noise, a i ac s [34] in enhancemen asks,
gene a i e models ha e he po en ial o p oduce cleane
esul s by di ec ly o indi ec ly modeling he clean p io .
In ecen yea s, signi ican p og ess has been made in ap-
plying gene a i e models o audio sepa a ion asks, pa -
icula ly in speech enhancemen and sepa a ion [35–42].
While mos music sou ce sepa a ion (MSS) me hods a e
s ill p edic i e, a ew gene a i e app oaches ha e begun
o eme ge. Fo ins ance, Ge e al. p oposed a low-based
model, Ins Glow [43], which le e ages he p io s o clean
sou ces o imp o e sepa a ion esul s wi hin he VBDO
se up. Addi ionally, mul i-sou ce di usion models ha e
been p oposed o simul aneous music sou ce sepa a ion
and gene a ion [44,45]. These app oaches employ a mul i-
channel di usion p ocess o model he join dis ibu ion o
indi idual sou ces and condi ion on he mix u e o sample
indi idual sou ces du ing in e ence, enabling sepa a ion.
While his o mula ion p o ides con ol o e which ins u-
men o syn hesize o sepa a e, i is limi ed o he speci ic
se o ins umen s he model is ained on.
While he e is g owing in e es in applying gene a-
i e me hods o MSS, o he bes o ou knowledge, no
p io wo k has sys ema ically compa ed gene a i e me h-
ods wi h hei di ec coun e pa s. In his wo k, we ad-
d ess his gap by designing a mask-p edic ion baseline ha
sha es he exac same model backbone as ou di usion
model. We hen conduc a sys ema ic e alua ion o ana-
lyze he di e ences be ween he wo app oaches.
Ou con ibu ions can be summa ized as ollows:
1) We p opose GuideSep, one o he i s di usion-
based models designed o add ess music sou ce sep-
a a ion beyond he VBDO se up and we elease he
codebase 2) We in oduce e sa ile, ins umen -agnos ic
condi ions—wa e o m mimic y condi ions and mel-
spec og am domain masks— ha a e mo e p ac ical o
eal-wo ld applica ions 3) We design a mask-p edic ion
baseline using he same model a chi ec u e and conduc
a sys ema ic e alua ion o analyze he di e ences be ween
p edic i e and gene a i e app oaches.
2. THE PROPOSED GUIDESEP METHOD
GuideSep is a di usion model condi ioned by use inpu .
Ou app oach le e ages use s’ inpu desc ibing a sou ce,
i.e., he aw wa e o m o use mimic y o a a ge sou ce
as well as a ough mask in he mel-spec og am domain.
1Some li e a u e e e s o p edic i e models as disc imina i e o de-
e minis ic. Leme cie e al. [30] no e ha p edic i e models encompass
bo h concep s.
Posi i e Mel-spec og am mask om use ske ch
(a) Mel-spec og am o he mix u e
Ta ge sou ce egion
(a) Mel-spec og am o he mix-
u e
om use ske ch
(b) Mel-spec og am o he mix u e !
(b) Mel-spec og am o he
mix u e wi h use Mel-mask
o e lay
Figu e 1: Illus a ion o an example o posi i e use -inpu
Mel-spec og am mask.
2.1 Condi ion signals
2.1.1 Mimic y condi ion
The mimic y guidance is a use -p o ided ime-domain
wa e o m, such as a hummed endi ion o he a ge
melody o he melody played on ano he ins umen . Due
o he lack o eal-wo ld da a o aining, we simula e he
mimic y guidance by con e ing he g ound- u h MIDI
sco e o he a ge sou ce o audio using he FluidSyn h
lib a y [46]. Real-wo ld mimic y inpu s o en include o -
pi ch no es, impe ec iming, and limi a ions in no e ange.
Addi ionally, since many ins umen s and ocal mimic y
a e monophonic, i poses signi ican challenges when ex-
ac ing polyphonic sou ces, e.g., gui a o piano. We sim-
ula e use inpu ia a ious da a augmen a ion echniques:
•O -pi ch melodies a e simula ed by in oducing pe -
u ba ions o he MIDI no es. Each no e has a 50%
p obabili y o being pi ch-ben , whose amoun is an-
domly sampled om a uni o m dis ibu ion anging
om −0.4 o +0.4semi ones.
•Impe ec iming is simula ed by in oducing a ia-
ions in he iming o MIDI no es. Wi h a 40% p ob-
abili y, he s a and end imes o a no e a e shi ed by
up o ±30 milliseconds. The ime shi o a no e will
also be applied o i s ollowing no es.
•Limi a ion o no e ange is simula ed by andomly
shi ing MIDI no es up o down by one oc a e wi h a
50% p obabili y.
•Ex ac ion using non-polyphonic ins umen s: We
es ic he condi ion melody o be monophonic o e-
lec eal-wo ld limi a ions o many ins umen s and
humming. I encou ages he model o in e missing
no es o he a ge sou ce using o he side in o ma-
ion, such as he mel-spec al mask. The choice o a
monophonic condi ion was d i en by ou ocus on hu-
man oice guidance; howe e , his is a limi a ion o he
aining da a a he han he algo i hm i sel .
2.1.2 Mel-spec al masks
Ou second condi ioning inpu is he use -c ea ed mask
in he mel-spec og am domain, ha dis inguishes egions
co esponding o he a ge sou ce om hose o back-
g ound sou ces. While being concep ually aligned wi h
[24], GuideSep uses i o condi ion a deep gene a i e
model. Speci ically, we de ine wo ypes o masks: pos-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
822
i i e and nega i e masks o indica e he a ge and back-
g ound sou ce egions, espec i ely. The mel-spec og am
domain is chosen due o i s g ea e in e p e abili y and eas-
ie iden i ica ion o he sou ces compa ed o he Fou ie
ans o m’s linea equency scale.
Figu e 1 illus a es he p ocess o c ea ing a mel-
spec og am mask based on use inpu . We implemen
a use in e ace whe e use s can ske ch on he mel-
spec og am o he mix u e wi h di e en b ush size and
con idence le el o indica e egions hey belie e co e-
spond o he a ge sou ce o backg ound music. In p ac-
ice, use -p o ided masks may exclude po ions o he
a ge sou ce o unin en ionally include egions o back-
g ound sound. Addi ionally, in many cases, he a ge
sou ce signi ican ly o e laps he backg ound sou ces, u -
he complica ing he masking p ocess. To simula e hese
eal-wo ld impe ec ions du ing aining, we gene a e syn-
he ic use inpu masks by applying a Gaussian il e ,
whose s anda d de ia ion anges be ween 4 o 6, o he
g ound- u h mel-spec og ams o he a ge sou ce and he
esidual sou ces. In addi ion, we andomly d op ou 40%
o pa ches.
2.2 Condi ional complex spec og am di usion
Di usion p obabilis ic models (DPMs) [47, 48] consis o
wo key p ocesses: p og essi ely co up ing aining da a
by adding noise un il i app oxima es a no mal dis ibu-
ion, and lea ning o e e se each s ep o his noise co up-
ion using he same unc ional o m. These models can be
gene alized as sco e-based gene a i e models [49], which
u ilize an in ini e numbe o noise scales, enabling bo h he
o wa d and backwa d di usion p ocesses o be desc ibed
by s ochas ic di e en ial equa ions (SDEs). Du ing in e -
ence, he e e se SDE is employed o gene a e samples nu-
me ically, s a ing om a s anda d no mal dis ibu ion.
Complex spec og am di usion wi h EDM: Ou
wo k is based on EDMSound [29]. We ain ou di u-
sion model using he EDM amewo k [50], which e o -
mula es he di usion SDE in e ms o noise scales a he
han d i and di usion coe icien s. To ensu e ha he in-
pu s o he neu al ne wo k a e app op ia ely scaled wi hin
he ange [−1,1], as equi ed by he di usion models, we
apply an ampli ude ans o ma ion o he complex spec o-
g am inpu s. Speci ically, we use ˜c=β|c|αei∠c, as p o-
posed in [42,51], whe e α∈(0,1] is a comp ession ac o
ha emphasizes ime- equency bins wi h lowe ene gy, ∠c
deno es he phase o he o iginal complex spec og am c,
and β∈R+is a scaling ac o ha no malizes ampli udes
app oxima ely o he ange [0,1].
Adding condi ions o EDMSound: To adap EDM-
Sound o a ge sound ex ac ion, we modi y he ne wo k
o accep condi ional inpu s, including he mix u e signal,
mimic y signal, and spec al masks. Ra he han modeling
p(s|c), whe e sis he a ge sou ce and cis an ins umen
label, we ins ead model p(s|cmix,cmimic y,cmasks). He e,
cmix co esponds o he music mix u e ep esen ed in he
complex- alued sho - ime Fou ie ans o m (STFT) do-
main, while cmimic y deno es he mimic y condi ion in he
Sco e U-Ne
Condi ion U-Ne
Mix u e
<la exi sha1_base64="qNpYpA3PbTUP4hgRS XYnbxuOM4=">AAACBnicbVDLSgMxFM3UV62 UV iJliEClJmRK gpuhGN1LBPqAzlEyaaUOTzJBkhDIU /6I27oS 36F4N+Yab Q1gMhh3Pu5d57gphRpR3n28o La+s uXXCxubW9s79u5eQ0WJxKSOIxbJVoAUYVSQuqaakVYsCeIBI81gcJP5zSciFY3Eox7GxOeoJ2hIMdJG6 gHHke6jxFL70clL+DQuYLZd3 SsY O2ZkALhJ3Ropghl H/ K6EU44ERozpFTbdWL p0hqihkZFbxEkRjhAeqR qECcaL8dHLCCB4bpQ DSJonNJyo z SxJUa8sBUZgu eS8TT2HA//PbiQ4 /ZSKONFE4OmwMGFQRzDLBHapJFizoSEIS2 2hbiPJMLaJFcwQbjzZy+SxlnZ ZQ D+ F6 Uskjw4BEegBFxwAa gF RAHWDwDF7BGLxZL9bYe c+pqU5a9azD/7A+ wBikCXGg==</la exi >
N(0;I)
Gaussian Noise
Mel-Mask
Mimic y
Ex ac ed Ta ge
UI
Di usion Model
condi ioning
Figu e 2: O e iew o he GuideSep a in e ence
ime. Ou model accep s mimic y condi ion and mel-
spec og am domain masks as guidance om use s o ex-
ac he a ge sou ce om he mix u e.
o m o magni ude STFT, assuming phase in o ma ion is a
dis ac ion when i comes o ep esen ing spec um in o -
ma ion. Finally, cmasks e e s o he no malized magni udes
o mel-spec og am masks, anged be ween 0and 1.
The p oposed a chi ec u e: Building on insigh s om
p io wo ks [40, 52–54], we design ou model as depic ed
in Figu e 2. The a chi ec u e comp ises wo p ima y U-Ne
s uc u es. The i s one, e e ed o as he sco e U-Ne ,
aligns wi h he o iginal U-Ne used in EDMSound. I i
we e no o condi ioning inpu , his pa o he model pe -
o ms a blind audio syn hesis by aking a Gaussian noise
sample. The second module, he condi ion U-Ne , is in-
oduced o ame his o he wise en i ely gene a i e beha -
io o he sco e U-Ne . The condi ion U-Ne is dedica ed
o p ocessing all condi ional inpu s, including he mix u e.
These wo U-Ne s a e connec ed so ha he ou pu o each
laye in he condi ion U-Ne is elemen -wise added o i s
co esponding laye in he sco e U-Ne , spanning bo h he
downsampling and upsampling laye s. Since he mel-scale
masks a e in a di e en equency dimension compa ed o
he magni ude and complex spec og ams, we in oduce a
simple 1-hidden-laye neu al ne wo k o p ojec he mel-
equency axis on o he spec og am equency axis. Since
he e a e h ee condi ions in he o m o a spec og am—
mix u e, mimic y, and p ojec ed masks— we conca ena e
hem along he channel dimension and eed as inpu o he
condi ion U-Ne . Ou U-Ne a chi ec u e is adap ed om
Imagen [55], chosen o i s high sample quali y, apid con-
e gence, and memo y e iciency.
Loss unc ion: Du ing aining, we op imize he model
using p econdi ioned denoising sco e ma ching, ollow-
ing [50]. The aining objec i e is o mula ed as
EsEnhλ(σ)∥D(s+n;σ, cmix,cmimic y,M(cmasks)) −s∥2
2i,
whe e D(·)is he EDM weigh ed neu al ne wo k, σis
he noise le el, λ(·)is he loss weigh ing which is (σ2−
σ2
da a)/(σ·σ2
da a) o he EDM amewo k, M(·)deno es he
1-hidden-laye p ojec ion ne wo k, and n∼ N(0, σ2I)is
Gaussian noise.
In e ence: Wi hin he EDM amewo k, he p obabil-
i y low o dina y di e en ial equa ion (ODE) can be sim-
pli ied in o a nonlinea ODE, allowing he di ec use o
s anda d o - he-shel ODE sol e s, such as high-o de Ex-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
823
ponen ial In eg a o (EI)-based ODE sol e s [56], speci -
ically mul is ep DPM sol e s [56, 57], o sampling as in
EDMSound.
3. EXPERIMENT
We conduc expe imen s using he Slakh2100 da ase [58]
augmen ed by MoisesDB [59] o aining. The
Slakh2100 da ase p o ides an o icial ain- alida ion- es
spli , which we u ilize as well. We e alua e ou model’s
pe o mance using he widely adop ed signal- o-dis o ion
a io (SDR) me ics [60,61].
3.1 T aining and model de ails
3.1.1 The da ase s
Slakh2100 is a syn he ic da ase o wa e o m-MIDI-
aligned music da ase con aining 2,100 acks in o al
a ound 145 hou s o audio. In ou aining p ocess, ins ead
o using he o iginal mix om he da ase , we gene a e
aining da a h ough andom mixing. This way allows o
nea ly in ini e a ia ions o aining samples. While his
app oach may esul in he loss o some musical con ex ,
p e ious wo k [62] has demons a ed ha andom mixing
can imp o e MSS model pe o mance. To enhance ou
model’s pe o mance on eal-wo ld music, we u ilize he
MoisesDB da ase [59] o cons uc backg ound sou ces.
MoisesDB is a comp ehensi e mul i ack da ase designed
o sou ce sepa a ion beyond 4-s ems, ea u ing 240 p e-
iously un eleased songs by 47 a is s ac oss wel e high-
le el gen es, in o al app oxima ely 14 hou s o audio. Du -
ing andom mixing, we andomly selec 3 o 6 sou ces om
he MoisesDB da ase o se e as backg ound music, while
he a ge sou ce is d awn om he Slakh2100 da ase . The
backg ound and a ge sou ces a e mixed a signal- o-noise
a ios (SNR) anging om −5dB o 5dB. All The inpu
audio is con e ed o single channel and esampled o 16
kHz, and hen immed o padded o a ound 4.1seconds
o ba ched aining.
3.1.2 D opou s a egies
To ensu e ha he model can p ocess any combina ion o
inpu ypes, we inco po a e d opou s a egies du ing ain-
ing. This allows he model o ope a e wi h an incom-
ple e se o condi ions, such as he mimic y-only o mel-
spec og am-mask-only cases. To his end, we andomly
d op ou ei he he mimic y condi ion o mel-spec og am
masks, ensu ing ha he model lea ns o p edic he a -
ge sou ce e en when p o ided wi h pa ial condi ioning
in o ma ion. Addi ionally, we empi ically obse e ha he
model bene i s om a mimic y-only condi ioned syn he-
sis asks, which happens when we andomly d op he mix-
u e inpu cmix du ing aining. This encou ages he model
o in e he a ge sou ce om melodic guidance alone.
Speci ically, du ing aining, we d op 30% o he mimic y
condi ion, 70% o he mel-masks, and 10% o he mix u e.
The high d opou a e o mel masks is in en ional and
uned using he alida ion spli , as hey p o ide a s ong
cue o he a ge sou ce. By educing hei p esence, he
model is encou aged o ocus mo e on lea ning om he
mimic y condi ion.
3.1.3 The model a chi ec u e
Fo bo h sco e and condi ion U-Ne modules, we u ilize an
e icien U-Ne a chi ec u e adap ed om he open-sou ce
Imagen implemen a ion 2, which is known o memo y
e iciency and as con e gence. Bo h U-Ne s inco po-
a e downsampling and upsampling blocks, each con ain-
ing wo ResNe blocks wi h a sel -a en ion laye ha uses
wo a en ion heads. The bo leneck dimension is 128. The
comple e model has 93.3 million ainable pa ame e s.
The inpu o he condi ion U-Ne consis s o h ee ypes:
a complex spec og am cmix, a magni ude spec og am
cmimic y, bo h wi h a window size o 512 samples and a
hop size o 256 samples, and he mel-spec og am masks
cmask, which sha e he same hop size and consis o 80 mel-
equency bins. E en ually, i is a i e-channel spec o-
g am inpu : wo o he complex spec og am, one o he
magni ude spec og am, and wo o he posi i e (i.e., a -
ge sou ce) and nega i e (i.e., backg ound music) masks.
The sco e U-Ne , as an au oencode , de ines a wo-
channel spec og ams as i s inpu and ou pu ep esen a-
ion, whe e he wo channels ep esen he eal and imag-
ina y componen s o he complex spec og am, espec-
i ely. No e ha in he e y beginning o he sampling
p ocess, he inpu spec og am o he sco e U-Ne is noise
sampled om Gaussian. Addi ionally, we condi ion he
ne wo k on loga i hmically scheduled noise le els σ.
3.1.4 In e ence
Fo in e ence, we employ an EI-based DPM sample [56,
57]. To ensu e compa ibili y be ween he EDM amewo k
sample s and a bi a y aining objec i es du ing in e ence,
we implemen inpu escaling as needed. Speci ically, we
escale bo h he noisy inpu s and noise le els o align wi h
he ne wo k’s o iginal aining- ime scales. The esul s,
p esen ed in Sec ion 4, a e ob ained using an 8-s ep sam-
ple con igu a ion.
3.1.5 T aining de ails
Ou model is ained wi h a ba ch size o 36 and a lea ning
a e o 1×10−4using he Adam op imize . The ain-
ing p ocess uns o 300k upda es. We used wo NVIDIA
L40S GPUs, and ained o en days.
3.2 Baselines
To he bes o ou knowledge, no exis ing wo k o e s a
ai compa ison, as ou me hod in oduces a no el condi-
ioning app oach. Howe e , we design a adi ional mask-
p edic ion model o compa e he p oposed gene a i e ap-
p oach agains . The baseline sha es he same win U-Ne
a chi ec u e and s uc u al de ails as ou di usion back-
bone. In pa icula , he inpu o he sco e U-Ne po ion is
he magni ude spec og am o he mix u e, while he inpu
o he condi ion U-Ne consis s o he magni ude spec o-
g am o he mimic y condi ion and he masks. The model
2h ps://gi hub.com/lucid ains/imagen-py o ch
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
824
Model Piano Gui a Bass S ings B ass Syn h Pipe Reed O gan Ch oma ic
Pe cussion O e all
Ou s ( ull) 8.34±0.11 10.53±0.09 11.97±0.12 9.64±0.12 9.15±0.38 9.25±0.20 15.58±0.27 13.78±0.24 13.44±0.22 11.53±0.36 10.46
Baseline ( ull) 7.03±0.09 8.72±0.08 8.69±0.06 9.06±0.13 8.03±0.31 8.00±0.17 14.99±0.26 11.94±0.23 11.25±0.23 9.62±0.31 8.74
Ou s (mimic y only) 7.46±0.11 9.96±0.10 11.19±0.13 8.63±0.14 7.95±0.45 8.13±0.23 14.43±0.34 13.14±0.27 12.39±0.25 8.74±0.43 9.60
w/ pseudo-masks 7.99±0.11 10.18±0.10 9.87±0.15 8.72±0.15 8.21±0.40 8.19±0.25 14.81±0.31 13.20±0.26 12.20±0.29 8.26±0.55 9.56
Ou s (posi i e mask only) 7.86±0.11 10.17±0.09 11.45±0.13 9.48±0.12 8.97±0.38 9.14±0.19 15.19±0.28 13.42±0.25 13.08±0.23 11.09±0.38 10.09
Ou s (humming)* - - - - - - - - - - 13.61
F equency (%) 20.71 27.89 17.79 15.23 2.65 4.74 2.72 3.06 3.43 1.78 -
Table 1: SDR (dB) esul s wi h 95% con idence in e al (highe alues indica e be e pe o mance) o en ins umen
classes in he Slakh2100 es spli . The esul s include GuideSep (ou me hod) unde a ious inpu condi ions and he
mask-p edic ion baseline. The bes sco es a e highligh ed in bold. Fo as e isk (*) please e e o Sec ion 4.2.
ou pu s a non-bina y mask by applying a sigmoid unc-
ion a e he ou pu laye , which is hen used o compu e
he a ge sou ce magni ude spec og am h ough elemen -
wise mul iplica ion wi h he inpu mix u e magni ude spec-
og am. The inal wa e o m is econs uc ed by combin-
ing he p edic ed magni ude spec og am wi h he phase
in o ma ion om he o iginal mix u e. We ain he mask-
p edic ion baseline using he L2 econs uc ion loss in he
magni ude spec og am domain, wi h he same lea ning
a e, ba ch size, and numbe o upda es as ou di usion
model. No e ha , due o he absence o ime-s ep condi-
ional inpu s and di e ences in inpu channels, he mask-
p edic ion baseline con ains 80.3 million pa ame e s.
4. EVALUATION AND DISCUSSION
We e alua e ou model on he o icial es spli o he
Slakh2100 da ase . The mimic y condi ion signals a e
syn hesized as desc ibed in Sec ion 2.1, using andomly
selec ed i ual ins umen s om he FluidSyn h lib a y.
Simila ly, he posi i e and nega i e masks a e simula ed
ollowing he same p ocedu e ou lined in Sec ion 2.1. Fo
e alua ion, we g oup he ins umen classes in Slakh2100
in o en b oade ca ego ies, whe e d um acks a e ex-
cluded om he a ge sou ces, because ou syn hesis
me hod does no apply o hem. The e alua ion esul s a e
p esen ed in Table 1.
In he i s wo ows in Table 1, we p esen he esul s o
ou model and he mask-p edic ion baseline. Bo h models
u ilize he mimic y condi ion and mel-spec og am masks
du ing in e ence, deno ed wi h ‘( ull)’ in he able. The e-
sul s demons a e ha ou model consis en ly ou pe o ms
he mask-p edic ion baseline ac oss all ins umen classes.
Gi en ha he mask-p edic ion baseline sha es he same
model backbone, aining da a, and con igu a ion as ou
di usion model, he pe o mance gap highligh s he ben-
e i s o using di usion app oach. In he lis ening es ,
we obse e ha he mask-p edic ion baseline o en econ-
s uc s a ge sou ces which s ill con ain in e e ences. In
con as , while ou di usion model may occasionally ex-
hibi inexac imb e, i gene ally gene a es cleane a ge
sou ces. This can be a ibu ed o he di usion model
lea ning a p io dis ibu ion o clean sou ces, which biases
i s ou pu s owa d cleane esul s. Al hough ou indings
align o he well-known beha io o gene a i e models, ou
expe imen s a e limi ed o he pa icula choice o he di -
usion model and a masking-based baseline wi h a ma ch-
ing a chi ec u e, lea ing mo e gene al a gumen s o u u e
wo k. We also obse e ha bo h models wo k be e o
he monophonic sou ces han he polyphonic ones, such as
piano, gui a , s ings, and syn h, whe e ou s ic ly mono-
phonic mimic y condi ion is no in o ma i e enough. As
a esul , he models may s uggle wi h missing no es om
cho ds, ex ac ing he w ong a ge ins umen , o e en ex-
ac ing mul iple ins umen s when hey sha e a simila
melody, which is common in music. Fo esul s on he
eal-wo ld condi ions, please e e o ou demo page.
4.1 Subjec i e Lis ening Tes s
In addi ion o he SDR esul s, we conduc a subjec i e
lis ening es o u he e alua e ou model. We modi y
webMUSHRA [63] so he es comp ises wo sec ions: he
i s assesses he o e all quali y o he model’s sepa a ion
esul s, while he second ocuses speci ically on e alua -
ing he imb e o he econs uc ed a ge sou ce. In he
i s sec ion, each ques ion p esen s he music mix u e as
a e e ence. Pa icipan s a e asked o compa e and a e
ou s imuli: he g ound u h, he mix u e i sel (i.e., he
hidden e e ence), and he p edic ions om ou model and
he mask-p edic ion baseline. Pa icipan s a e unawa e ha
one o he s imuli is he ac ual g ound u h and a e ins ead
old ha he h ee s imuli a e po en ial econs uc ions o
a a ge sou ce. The pa icipan s a e asked o i s iden i y
he mix u e and assign i a sco e o 0, hen a e he emain-
ing s imuli (e.g., wi h 100 being a pe ec ma ch) based on
how closely hey esemble he a ge sou ce in he mix u e.
This pa consis s o en ials, wi h each mix u e sample
andomly selec ed om a di e en ins umen class in he
Slakh2100 es spli . We use he mix u e as a e e ence
ins ead o he g ound- u h sou ce in o de o measu e he
lis ene ’s opinion on he “syn hesized" sou ce wi hou in-
oducing any p ejudice.
To make up he modi ica ion in oduced in he i s pa ,
he second pa is dedica ed o e alua ing he po en ial a -
i ac speci ic o he gene a i e models, i.e., he imb e
change. This ime, each ial p esen s he g ound- u h a -
ge sou ce as he e e ence. Pa icipan s compa e and a e
h ee s imuli: he hidden e e ence (i.e., he g ound u h
i sel ) and he p edic ions om ou model and he mask-
p edic ion baseline. Howe e , a ings a e based on im-
b e simila i y o he e e ence, wi h 100 indica ing an ex-
ac ma ch and 0 ep esen ing a comple ely di e en imb e.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
825
Pa icipan s a e ins uc ed o ocus solely on he imb e o
he a ge sou ce while dis ega ding any in e e ence o a -
i ac s. Bo h pa s use he same se o music samples, bu
he second pa is p esen ed only a e pa icipan s com-
ple e he i s pa o a oid bias, ensu ing hey emain un-
awa e ha he g ound u h was included in he i s pa .
50 60 70 80 90
MUSHRA Sco e
GuideSep
Baseline
GT
a) MUSHRA Resul o Sepa a ion Quali y
70 80 90 100
MUSHRA Sco es
b) MUSHRA Resul o Sepa a ion
Timb e Quali y
(a) Sec. 1, MUSHRA esul on sep-
a a ion quali y
50 60 70 80 90
MUSHRA Sco e
GuideSep
Baseline
GT
a) MUSHRA Resul o Sepa a ion Quali y
70 80 90 100
MUSHRA Sco es
b) MUSHRA Resul o Sepa a ion
Timb e Quali y
(b) Sec. 2, MUSHRA esul
on imb e simila i y
Figu e 3: Mean MUSHRA Sco e wi h 95% con idence in-
e al o he subjec i e lis ening es on sepa a ion quali y
and sepa a ion imb e quali y.
A o al o 13 pa icipan s ook pa in he subjec i e lis-
ening es , and he esul s om bo h pa s a e p esen ed in
Figu e 3. In he i s pa , whe e pa icipan s a ed he sep-
a a ion quali y, ou model sco ed 82.82 ±2.95, he g ound
u h sco ed 90.13±2.06, and he mask-p edic ion baseline
sco ed 50.69 ±3.41. No ably, despi e he mask-p edic ion
baseline ha ing a ela i ely small SDR di e ence om ou
model, he lis ening es e ealed a signi ican gap in pe -
cep ual e alua ion. This sugges s ha use s may pe cei e a
cleane a ge sou ce p edic ion as mo e sa is ac o y, e en
i a sligh ly noisie p edic ion achie es a decen sample-
wise simila i y o he a ge sou ce.
In he second pa , whe e pa icipan s a ed imb e
p ese a ion, ou model sco ed 79.38 ±3.76, su passing
he mask-p edic ion baseline, which sco ed 68.88 ±4.07.
In heo y, he mask-p edic ion baseline could p ese e he
o iginal imb e be e , bu ou subjec i e lis ening es e-
sul s sugges o he wise. Based on he lis ene s’ eedback,
we specula e ha his ou come is in luenced by he na u e
o he a ge sou ce ex ac ion ask, whe e mul iple sou ces
in a musical piece may sha e simila melodic pa e ns. As
a esul , he mask-p edic ion baseline’s ou pu can be con-
amina ed by in e e ing simila melodies, which can be
pe cei ed as a imb al change a he han a i ac s.
4.2 Abla ion
Beyond e alua ing ou model wi h bo h condi ioning sig-
nals, we conduc an abla ion s udy o assess i s pe o -
mance unde di e en inpu condi ions.
Mimic y-only: We e alua e he model using he
‘(mimic y only)’ se up (Table 1). We obse e a sligh
o e all dec ease in pe o mance, indica ing ha while mel-
masks con ibu e o imp o ed pe o mance, he model
emains e ec i e e en when condi ioned solely on he
melody signal.
Pseudo-masks: When only he mimic y condi ion is
a ailable, we can gene a e pseudo mel-masks using he
mimic y condi ion and he mix u e. Speci ically, we use
he Gaussian-blu ed mel-spec og am o he mimic y con-
di ion as he posi i e mel-mask and he blu ed mix u e as
he nega i e mel-mask wi h he s anda d de ia ion se o
be 5. In Table 1, al hough he o e all SDR sco e is sligh ly
lowe compa ed o using only he mimic y condi ion, he
model pe o ms be e wi h pseudo-masks o 7 ou o 10
ins umen classes. This sugges s ha pseudo-masks can
gene ally enhance he model’s pe o mance a no addi-
ional cos . The bass class is an excep ion, likely due o
i s limi ed high- equency con en , which se s i apa om
o he ins umen s. Consequen ly, he mel-spec og am
mask may be misleading in his case. A di e en s anda d
de ia ion o he Gaussian il e could wo k be e , while i
in ol es an addi ional hype pa ame e sea ch.
Mel-masks-only: Ano he case is when only he mel-
masks a e used o condi ioning. We obse e ha he e-
sul s a e gene ally be e han hose ob ained using only
he mimic y condi ion, indica ing ha mel-masks se e as
highly e ec i e condi ioning signals.
Humming-only: Al hough in ou aining, mimic y
condi ion do no include humming, we e alua e ou model
o assess i s gene aliza ion o unseen mimic y condi ion,
such as humming. Since we canno easily syn hesize hum-
ming om MIDI, we u ilize he HumT ans da ase [64],
a MIDI-humming aligned da ase , esul ing in an e alua-
ion da ase o app oxima ely 16.6hou s. Since HumT ans
melodies do no coincide wi h ou es songs, an ideal
sou ce sepa a ion se up is impossible o design. Ins ead,
we syn hesize backg ound sou ces by andomly mixing 3
o 6 sou ces om he MoisesDB da ase , ollowing ou
aining p ocedu e p ocedu e desc ibed in Sec ion 3.1. The
a ge sou ce is syn hesized om he MIDI in o ma ion
aligned o he humming, using he me hod ou lined in Sec-
ion 2.1 wi h augmen a ion. As he i ual ins umen s
a e sampled om he FluidSyn h lib a y, which is no di-
ec ly compa able o he Slakh2100 benchma k, we e-
po only an o e all SDR esul , whose mean is 13.61 dB.
This sco e exceeds he o e all SDR esul o ou model on
he Slakh2100 benchma k, demons a ing ha he mimic y
condi ion can gene alize o humming du ing in e ence.
Howe e , he s ong pe o mance could also be a ibu ed
o he andom mixing used du ing e alua ion, which sim-
pli ies he ask o a ge sou ce ex ac ion o he model.
5. CONCLUSION
We in oduced GuideSep, a di usion-based music sou ce
sepa a ion model ha enables lexible, ins umen -agnos ic
sepa a ion using wa e o m mimic y condi ions and mel-
spec og am masks, and eleased he codebase. Ou e-
sul s demons a e ha his app oach achie es high-quali y
sepa a ion while o e ing g ea e adap abili y compa ed o
adi ional class-based me hods. Addi ionally, ou compa -
ison wi h a mask-p edic ion baseline p o ides insigh s in o
he s eng hs o gene a i e models o MSS. This wo k
highligh s he po en ial o di usion models in ad ancing
mo e e sa ile and use -con ollable sou ce sepa a ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
826
6. REFERENCES
[1] N. Ono, Z. Ra ii, D. Ki amu a, N. I o, and A. Li-
u kus, “The 2015 signal sepa a ion e alua ion cam-
paign,” in La en Va iable Analysis and Signal Sep-
a a ion, E. Vincen , A. Ye edo , Z. Koldo ský, and
P. Ticha ský, Eds. Cham: Sp inge In e na ional Pub-
lishing, 2015, pp. 387–395.
[2] S. Roua d, F. Massa, and A. Dé ossez, “Hyb id ans-
o me s o music sou ce sepa a ion,” in ICASSP 2023-
2023 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2023,
pp. 1–5.
[3] J. Chen, S. Vekko , and P. Shukla, “Music sou ce sepa-
a ion based on a ligh weigh deep lea ning amewo k
(d ne : Dual-pa h c- d une ),” in ICASSP 2024-2024
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2024, pp.
656–660.
[4] R. Sawa a, N. Takahashi, S. Uhlich, S. Takahashi, and
Y. Mi su uji, “The whole is g ea e han he sum o i s
pa s: imp o ing music sou ce sepa a ion by b idging
ne wo ks,” EURASIP Jou nal on Audio, Speech, and
Music P ocessing, ol. 2024, no. 1, p. 39, 2024.
[5] W. Tong, J. Zhu, J. Chen, S. Kang, T. Jiang, Y. Li,
Z. Wu, and H. Meng, “SCNe : Spa se comp ession ne -
wo k o music sou ce sepa a ion,” in ICASSP 2024-
2024 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2024,
pp. 1276–1280.
[6] N. Takahashi and Y. Mi su uji, “D3Ne : Densely con-
nec ed mul idila ed densene o music sou ce sepa a-
ion,” a Xi p ep in a Xi :2010.01733, 2020.
[7] Y. Luo and J. Yu, “Music sou ce sepa a ion wi h band-
spli nn,” IEEE/ACM T ansac ions on Audio, Speech,
and Language P ocessing, ol. 31, pp. 1893–1901,
2023.
[8] W.-T. Lu, J.-C. Wang, Q. Kong, and Y.-N. Hung, “Mu-
sic sou ce sepa a ion wi h band-spli ope ans o me ,”
in ICASSP 2024-2024 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2024, pp. 481–485.
[9] G. Mesegue -B ocal and G. Pee e s, “Condi ioned-
U-Ne : In oducing a con ol mechanism in he u-
ne o mul iple sou ce sepa a ions,” a Xi p ep in
a Xi :1907.01277, 2019.
[10] O. Slizo skaia, L. Kim, G. Ha o, and E. Gomez, “End-
o-end sound sou ce sepa a ion condi ioned on ins u-
men labels,” in ICASSP 2019-2019 IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2019, pp. 306–310.
[11] P. See ha aman, G. Wiche n, S. Venka a amani, and
J. Le Roux, “Class-condi ional embeddings o music
sou ce sepa a ion,” in ICASSP 2019-2019 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2019, pp. 301–305.
[12] D. Samuel, A. Ganeshan, and J. Na adowsky, “Me a-
lea ning ex ac o s o music sou ce sepa a ion,” in
ICASSP 2020-2020 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2020, pp. 816–820.
[13] P. Sma agdis, “Use guided audio selec ion om com-
plex sound mix u es,” in P oceedings o he 22nd an-
nual ACM symposium on Use in e ace so wa e and
echnology, 2009, pp. 89–92.
[14] J. H. Lee, H.-S. Choi, and K. Lee, “Audio
que y-based music sou ce sepa a ion,” a Xi p ep in
a Xi :1908.06593, 2019.
[15] E. Manilow, G. Wiche n, and J. Le Roux, “Hie a chi-
cal musical ins umen sepa a ion.” in ISMIR, 2020, pp.
376–383.
[16] K. N. Wa cha asupa and A. Le ch, “A s em-agnos ic
single-decode sys em o music sou ce sepa a ion be-
yond ou s ems,” a Xi p ep in a Xi :2406.18747,
2024.
[17] K. Chen, X. Du, B. Zhu, Z. Ma, T. Be g-Ki kpa ick,
and S. Dubno , “Ze o-sho audio sou ce sepa a ion
h ough que y-based lea ning om weakly-labeled
da a,” in P oceedings o he AAAI Con e ence on A i-
icial In elligence, ol. 36, no. 4, 2022, pp. 4441–4449.
[18] Y. Wang, D. S olle , R. M. Bi ne , and J. P. Bello,
“Few-sho musical sou ce sepa a ion,” in ICASSP
2022-2022 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2022, pp. 121–125.
[19] S. Ewe and M. B. Sandle , “S uc u ed d opou o
weak label and mul i-ins ance lea ning and i s appli-
ca ion o sco e-in o med sou ce sepa a ion,” in 2017
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2017, pp.
2277–2281.
[20] M. Mi on, J. Jane , and E. Gómez, “Monau al sco e-
in o med sou ce sepa a ion o classical music using
con olu ional neu al ne wo ks.” in ISMIR, ol. 2017,
2017, pp. 55–62.
[21] M. Go e , “Sco e-in o med sou ce sepa a ion o cho al
music,” 2020.
[22] A. J. Munoz-Mon o o, J. J. Ca abias-O i, P. Ve a-
Candeas, F. J. Canadas-Quesada, and N. Ruiz-Reyes,
“Online/o line sco e in o med music signal decompo-
si ion: applica ion o minus one,” EURASIP Jou nal
on Audio, Speech, and Music P ocessing, ol. 2019,
pp. 1–30, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
827
[23] Y.-N. Hung, G. Wiche n, and J. Le Roux, “T ansc ip-
ion is all you need: Lea ning o sepa a e musical mix-
u es wi h sco e as supe ision,” in ICASSP 2021-2021
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2021, pp.
46–50.
[24] N. J. B yan, G. J. Myso e, and G. Wang, “ISSE: An
in e ac i e sou ce sepa a ion edi o ,” in P oceedings o
he SIGCHI Con e ence on Human Fac o s in Compu -
ing Sys ems, 2014, pp. 257–266.
[25] N. J. B yan and G. J. Myso e, “In e ac i e use -
eedback o sound sou ce sepa a ion,” in In e na ional
Con e ence on In elligen Use -In e aces (IUI), Wo k-
shop on In e ac i e Machine Lea ning. San a Monica,
2013.
[26] N. B yan and G. Myso e, “An e icien pos e io eg-
ula ized la en a iable model o in e ac i e sound
sou ce sepa a ion,” in In e na ional con e ence on ma-
chine lea ning. PMLR, 2013, pp. 208–216.
[27] N. J. B yan and G. J. Myso e, “In e ac i e e inemen
o supe ised and semi-supe ised sound sou ce sep-
a a ion es ima es,” in 2013 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing.
IEEE, 2013, pp. 883–887.
[28] P. Sma agdis and G. J. Myso e, “Sepa a ion by “hum-
ming”: Use -guided sound ex ac ion om mono-
phonic mix u es,” in 2009 IEEE Wo kshop on Appli-
ca ions o Signal P ocessing o Audio and Acous ics.
IEEE, 2009, pp. 69–72.
[29] G. Zhu, Y. Wen, M.-A. Ca bonneau, and Z. Duan,
“EDMSound: Spec og am based di usion models
o e icien and high-quali y audio syn hesis,” a Xi
p ep in a Xi :2311.08667, 2023.
[30] J.-M. Leme cie , J. Rich e , S. Welke , E. Moline ,
V. Välimäki, and T. Ge kmann, “Di usion models o
audio es o a ion: A e iew [special issue on model-
based and da a-d i en audio signal p ocessing],” IEEE
Signal P ocessing Magazine, ol. 41, no. 6, pp. 72–84,
2025.
[31] Y. Luo and N. Mesga ani, “Con -TasNe : Su passing
ideal ime– equency magni ude masking o speech
sepa a ion,” IEEE/ACM ansac ions on audio, speech,
and language p ocessing, ol. 27, no. 8, pp. 1256–
1266, 2019.
[32] Y. Hu, Y. Liu, S. L , M. Xing, S. Zhang, Y. Fu, J. Wu,
B. Zhang, and L. Xie, “DCCRN: Deep complex con-
olu ion ecu en ne wo k o phase-awa e speech en-
hancemen ,” a Xi p ep in a Xi :2008.00264, 2020.
[33] H. J. Pa k, B. H. Kang, W. Shin, J. S. Kim, and
S. W. Han, “MANNER: Mul i- iew a en ion ne wo k
o noise e asu e,” in ICASSP 2022-2022 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2022, pp. 7842–7846.
[34] J. Pi klbaue , M. Sach, K. Fluy , W. Ti y, W. Wa dah,
S. Moelle , and T. Fingscheid , “E alua ion me ics o
gene a i e speech enhancemen me hods: Issues and
pe spec i es,” in Speech Communica ion; 15 h ITG
Con e ence. VDE, 2023, pp. 265–269.
[35] E. Pos olache, G. Ma iani, M. Mancusi, A. San-
illi, L. Cosmo, and E. Rodolà, “La en au o eg essi e
sou ce sepa a ion,” in P oceedings o he AAAI Con e -
ence on A i icial In elligence, ol. 37, no. 8, 2023, pp.
9444–9452.
[36] Y. C. Subakan and P. Sma agdis, “Gene a i e ad e -
sa ial sou ce sepa a ion,” in 2018 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2018, pp. 26–30.
[37] B. Chen, C. Wu, and W. Zhao, “SEPDIFF: Speech
sepa a ion based on denoising di usion model,” in
ICASSP 2023-2023 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2023, pp. 1–5.
[38] S. Lu a i, E. Nachmani, and L. Wol , “Sepa a e and di -
use: Using a p e ained di usion model o imp o ing
sou ce sepa a ion,” a Xi p ep in a Xi :2301.10752,
2023.
[39] R. Scheible , Y. Ji, S.-W. Chung, J. Byun, S. Choe, and
M.-S. Choi, “Di usion-based gene a i e speech sou ce
sepa a ion,” in ICASSP 2023-2023 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2023, pp. 1–5.
[40] R. Scheible , Y. Fuji a, Y. Shi aha a, and T. Ko-
ma su, “Uni e sal sco e-based speech enhancemen
wi h high con en p ese a ion,” a Xi p ep in
a Xi :2406.12194, 2024.
[41] Z. Guo, Q. Wang, J. Du, J. Pan, Q.-F. Liu, and C.-
H. Lee, “A a iance-p ese ing in e pola ion app oach
o di usion models wi h applica ions o single chan-
nel speech enhancemen and ecogni ion,” IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing, 2024.
[42] J. Rich e , S. Welke , J.-M. Leme cie , B. Lay,
and T. Ge kmann, “Speech enhancemen and de e-
e be a ion wi h di usion-based gene a i e models,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 31, pp. 2351–2364, 2023.
[43] G. Zhu, J. Da e sky, F. Jiang, A. Seli skiy, and Z. Duan,
“Music sou ce sepa a ion wi h gene a i e low,” IEEE
Signal P ocessing Le e s, ol. 29, pp. 2288–2292,
2022.
[44] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, and E. Rodolà, “Mul i-sou ce di usion
models o simul aneous music gene a ion and sepa-
a ion,” a Xi p ep in a Xi :2302.02257, 2023.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
828
[45] T. Ka chkhadze, M. R. Izadi, and S. Dubno , “Si-
mul aneous music sepa a ion and gene a ion using
mul i- ack la en di usion models,” a Xi p ep in
a Xi :2409.12346, 2024.
[46] D. Henningsson and F. Team, “Fluidsyn h eal- ime
and h ead sa e y challenges,” in P oceedings o he
9 h In e na ional Linux Audio Con e ence, Maynoo h
Uni e si y, I eland, 2011, pp. 123–128.
[47] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” Ad ances in neu al in o ma ion
p ocessing sys ems, ol. 33, pp. 6840–6851, 2020.
[48] J. Sohl-Dicks ein, E. Weiss, N. Maheswa ana han,
and S. Ganguli, “Deep unsupe ised lea ning us-
ing nonequilib ium he modynamics,” in In e na ional
con e ence on machine lea ning. PMLR, 2015, pp.
2256–2265.
[49] Y. Song, J. Sohl-Dicks ein, D. P. Kingma, A. Kuma ,
S. E mon, and B. Poole, “Sco e-based gene a i e mod-
eling h ough s ochas ic di e en ial equa ions,” a Xi
p ep in a Xi :2011.13456, 2020.
[50] T. Ka as, M. Ai ala, T. Aila, and S. Laine, “Eluci-
da ing he design space o di usion-based gene a i e
models,” Ad ances in neu al in o ma ion p ocessing
sys ems, ol. 35, pp. 26 565–26 577, 2022.
[51] C. B ei haup and R. Ma in, “Analysis o he decision-
di ec ed sn es ima o o speech enhancemen wi h e-
spec o low-sn and ansien condi ions,” IEEE ans-
ac ions on audio, speech, and language p ocessing,
ol. 19, no. 2, pp. 277–289, 2010.
[52] J. Se à, S. Pascual, J. Pons, R. O. A az, and D. Scaini,
“Uni e sal speech enhancemen wi h sco e-based di -
usion,” a Xi p ep in a Xi :2206.03065, 2022.
[53] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music Con olNe : Mul iple ime- a ying con ols o
music gene a ion,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 2692–
2703, 2024.
[54] H. F. Ga cía, O. Nie o, J. Salamon, B. Pa do, and
P. See ha aman, “Ske ch2Sound: Con ollable audio
gene a ion ia ime- a ying signals and sonic imi a-
ions,” a Xi p ep in a Xi :2412.08550, 2024.
[55] C. Saha ia, W. Chan, S. Saxena, L. Li, J. Whang,
E. L. Den on, K. Ghasemipou , R. Gon ijo Lopes,
B. Ka agol Ayan, T. Salimans e al., “Pho o ealis ic
ex - o-image di usion models wi h deep language un-
de s anding,” Ad ances in neu al in o ma ion p ocess-
ing sys ems, ol. 35, pp. 36 479–36 494, 2022.
[56] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu,
“DPM-Sol e : A as ode sol e o di usion p oba-
bilis ic model sampling in a ound 10 s eps,” Ad ances
in Neu al In o ma ion P ocessing Sys ems, ol. 35, pp.
5775–5787, 2022.
[57] ——, “DPM-Sol e ++: Fas sol e o guided sam-
pling o di usion p obabilis ic models,” a Xi p ep in
a Xi :2211.01095, 2022.
[58] E. Manilow, G. Wiche n, P. See ha aman, and
J. Le Roux, “Cu ing music sou ce sepa a ion some
Slakh: A da ase o s udy he impac o aining da a
quali y and quan i y,” in P oc. IEEE Wo kshop on Ap-
plica ions o Signal P ocessing o Audio and Acous ics
(WASPAA). IEEE, 2019.
[59] I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“MoisesDB: A da ase o sou ce sepa a ion beyond 4-
s ems,” a Xi p ep in a Xi :2307.15913, 2023.
[60] E. Vincen , R. G ibon al, and C. Fé o e, “Pe o -
mance measu emen in blind audio sou ce sepa a ion,”
IEEE ansac ions on audio, speech, and language p o-
cessing, ol. 14, no. 4, pp. 1462–1469, 2006.
[61] R. Scheible , “SDR—medium a e wi h as compu a-
ions,” in ICASSP 2022-2022 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2022, pp. 701–705.
[62] C.-B. Jeon, G. Wiche n, F. G. Ge main, and J. Le Roux,
“Why does music sou ce sepa a ion bene i om ca-
cophony?” in 2024 IEEE In e na ional Con e ence on
Acous ics, Speech, and Signal P ocessing Wo kshops
(ICASSPW). IEEE, 2024, pp. 873–877.
[63] M. Schoe le , S. Ba oschek, F.-R. S ö e , M. Roess,
S. Wes phal, B. Edle , and J. He e, “web-
MUSHRA—a comp ehensi e amewo k o web-
based lis ening es s,” Jou nal o Open Resea ch So -
wa e, ol. 6, no. 1, 2018.
[64] S. Liu, X. Li, D. Li, and Y. Shan, “HumT ans: A no el
open-sou ce da ase o humming melody ansc ip-
ion and beyond,” in ICASSP 2024-2024 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2024, pp. 7915–7919.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
829