User-Guided Generative Source Separation

Author: Yutong Wen; Minje Kim; Paris Smaragdis

Publisher: Zenodo

DOI: 10.5281/zenodo.17706603

Source: https://zenodo.org/records/17706603/files/000096.pdf

USER-GUIDED GENERATIVE SOURCE SEPARATION
Yu ong Wen Minje Kim Pa is Sma agdis
Uni e si y o Illinois a U bana-Champaign
[email p o ec ed]
ABSTRACT
Music sou ce sepa a ion (MSS) aims o ex ac indi id-
ual ins umen sou ces om hei mix u e. While mos
exis ing me hods ocus on he widely adop ed ou -s em
sepa a ion se up ( ocals, bass, d ums, and o he ins u-
men s), his app oach lacks he lexibili y needed o eal-
wo ld applica ions. To add ess his, we p opose GuideSep,
a di usion-based MSS model capable o ins umen -
agnos ic sepa a ion beyond he ou -s em se up. GuideSep
is condi ioned on mul iple inpu s: a wa e o m mimic y
condi ion, which can be easily p o ided by humming o
playing he a ge melody, and mel-spec og am domain
masks, which o e addi ional guidance o sepa a ion. Un-
like p io app oaches ha elied on ixed class labels o
sound que ies, ou condi ioning scheme, coupled wi h he
gene a i e app oach, p o ides g ea e lexibili y and appli-
cabili y. Addi ionally, we design a mask-p edic ion base-
line using he same model a chi ec u e o sys ema ically
compa e p edic i e and gene a i e app oaches. Ou objec-
i e and subjec i e e alua ions demons a e ha GuideSep
achie es high-quali y sepa a ion while enabling mo e e -
sa ile ins umen ex ac ion, highligh ing he po en ial o
use pa icipa ion in he di usion-based gene a i e p o-
cess o MSS. Ou code and demo page a e a ailable a
h ps://yu ongwen.gi hub.io/GuideSep/.
1. INTRODUCTION
Music sou ce sepa a ion (MSS) aims o sepa a e a mix-
u e audio in o i s cons i uen sou ces, ypically de ined by
he ins umen . Since he 2015 Signal Sepa a ion E al-
ua ion Campaign (SiSEC) [1], he MSS communi y has
la gely ocused on supe ised models o sepa a e songs
in o ou s ems: ocals, bass, d ums, and o he s ha in-
cludes all emaining ins umen s, a se up commonly e-
e ed o as VBDO. Unde his amewo k, nume ous e-
cen deep neu al ne wo k (DNN) models ha e signi ican ly
ad anced pe o mance [2–8]. While his se up p o ides a
con enien benchma k, i lacks he lexibili y needed o
eal-wo ld applica ions: ideally, MSS sys ems should be
able o ex ac any a ge ins umen o in e es .
© Y. Wen, M. Kim, and P. Sma agdis. Licensed unde a
C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0).
A ibu ion: Y. Wen, M. Kim, and P. Sma agdis, “Use -Guided Gen-
e a i e Sou ce Sepa a ion”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
In his ega d, se e al wo ks ha e ex ended MSS be-
yond he VBDO se up. To enable he sepa a ion o a bi-
a y ins umen s, he model mus i s be p o ided wi h
a condi ion speci ying he a ge ins umen , such as in-
s umen class labels [9–12]. In [9, 11] his condi ioning
me hod is shown o wo k o he VBDO se up, whe eas
[10] ex ends his app oach o 13 ins umen s. Howe e ,
class labels can be ague, as ins umen s like he gui a
may exhibi signi ican a iabili y wi hin he same label.
Mo eo e , new ins umen classes equi e e- aining. An-
o he app oach, que y-based MSS condi ions he model
using a sound example, whe e he model ex ac s sou ces
simila o he example [13–18]. Fo ins ance, Wa cha a-
supa e al. [16] designed a ligh weigh model capable o
ins umen -agnos ic sepa a ion using a single que y, while
Wang e al. [18] de eloped a model ha accep s up o i e
que ies o imp o e pe o mance s abili y. Despi e i s po-
en ial o p o ide ich in o ma ion abou he a ge sou ce,
que y-based sepa a ion may be limi ed in eal-wo ld appli-
ca ions whe e high-quali y que ies a e una ailable. Addi-
ionally, MSS models can be condi ioned on MIDI sco e
o he a ge ins umen [19–23]. While MIDI in o ma ion
p o ides a s ong and accu a e cue, i is o en una ailable
in many eal-wo ld scena ios, such as pop music. B yan
e al. [24–27] p oposed an al e na i e me hod whe e use s
ske ch a ough mask on he spec og am o he mix u e o
indica e he a ge sou ce. Howe e , his app oach can su -
e om ambigui y, as iden i ying he a ge ins umen ’s
egion in he mix u e spec og am is o en challenging.
Sma agdis e al. [28] le e ages humming as a guidance
o sepa a e a a ge sou ce. Unlike label-based o sound
que y condi ioning, humming o e s use s g ea e lexibil-
i y when in e ac ing wi h he sys em.
In his wo k, we p opose a guided sepa a ion
(GuideSep) me hod, a condi ional complex-spec og am
domain di usion model designed o add ess music sou ce
sepa a ion beyond he VBDO se up in an ins umen -
agnos ic manne . Building on he obse a ions o exis ing
me hods o MSS beyond VBDO, we condi ion he di u-
sion model on mul iple inpu s: a wa e o m mimic y o a
a ge sou ce and mel-spec og am domain masks. While
MIDI sco e in o ma ion is o en di icul o ob ain in eal-
wo ld scena ios, use s a e capable o p o iding a mimic y
by humming o playing he a ge melody wi h an ins u-
men o hei choice. Addi ionally, we in oduce a ough
mask on he mel-spec og am o he use s o u he in-
o m he model o he egion o ocus on. Du ing in e -
ence, ei he o bo h condi ions can be u ilized, o e ing
821
use s a lexible way o speci y he a ge sou ce o sep-
a a ion om he mix u e. Ou di usion model is buil
on EDMSound [29], a complex-spec og am domain di -
usion me hod designed o bo h uncondi ional and label-
condi ioned audio gene a ion. We modi y he model back-
bone o suppo mul iple condi ioning inpu s.
T adi ionally, audio sou ce sepa a ion has been ack-
led using p edic i e models 1, which map mix u e inpu o
an es ima ed clean ou pu by minimizing a poin -wise loss
unc ion [31–33]. While p edic i e models o en s uggle
wi h esidual noise, a i ac s [34] in enhancemen asks,
gene a i e models ha e he po en ial o p oduce cleane
esul s by di ec ly o indi ec ly modeling he clean p io .
In ecen yea s, signi ican p og ess has been made in ap-
plying gene a i e models o audio sepa a ion asks, pa -
icula ly in speech enhancemen and sepa a ion [35–42].
While mos music sou ce sepa a ion (MSS) me hods a e
s ill p edic i e, a ew gene a i e app oaches ha e begun
o eme ge. Fo ins ance, Ge e al. p oposed a low-based
model, Ins Glow [43], which le e ages he p io s o clean
sou ces o imp o e sepa a ion esul s wi hin he VBDO
se up. Addi ionally, mul i-sou ce di usion models ha e
been p oposed o simul aneous music sou ce sepa a ion
and gene a ion [44,45]. These app oaches employ a mul i-
channel di usion p ocess o model he join dis ibu ion o
indi idual sou ces and condi ion on he mix u e o sample
indi idual sou ces du ing in e ence, enabling sepa a ion.
While his o mula ion p o ides con ol o e which ins u-
men o syn hesize o sepa a e, i is limi ed o he speci ic
se o ins umen s he model is ained on.
While he e is g owing in e es in applying gene a-
i e me hods o MSS, o he bes o ou knowledge, no
p io wo k has sys ema ically compa ed gene a i e me h-
ods wi h hei di ec coun e pa s. In his wo k, we ad-
d ess his gap by designing a mask-p edic ion baseline ha
sha es he exac same model backbone as ou di usion
model. We hen conduc a sys ema ic e alua ion o ana-
lyze he di e ences be ween he wo app oaches.
Ou con ibu ions can be summa ized as ollows:
1) We p opose GuideSep, one o he i s di usion-
based models designed o add ess music sou ce sep-
a a ion beyond he VBDO se up and we elease he
codebase 2) We in oduce e sa ile, ins umen -agnos ic
condi ions—wa e o m mimic y condi ions and mel-
spec og am domain masks— ha a e mo e p ac ical o
eal-wo ld applica ions 3) We design a mask-p edic ion
baseline using he same model a chi ec u e and conduc
a sys ema ic e alua ion o analyze he di e ences be ween
p edic i e and gene a i e app oaches.
2. THE PROPOSED GUIDESEP METHOD
GuideSep is a di usion model condi ioned by use inpu .
Ou app oach le e ages use s’ inpu desc ibing a sou ce,
i.e., he aw wa e o m o use mimic y o a a ge sou ce
as well as a ough mask in he mel-spec og am domain.
1Some li e a u e e e s o p edic i e models as disc imina i e o de-
e minis ic. Leme cie e al. [30] no e ha p edic i e models encompass
bo h concep s.
Posi i e Mel-spec og am mask om use ske ch
(a) Mel-spec og am o he mix u e
Ta ge sou ce egion
(a) Mel-spec og am o he mix-
u e
om use ske ch
(b) Mel-spec og am o he mix u e !
(b) Mel-spec og am o he
mix u e wi h use Mel-mask
o e lay
Figu e 1: Illus a ion o an example o posi i e use -inpu
Mel-spec og am mask.
2.1 Condi ion signals
2.1.1 Mimic y condi ion
The mimic y guidance is a use -p o ided ime-domain
wa e o m, such as a hummed endi ion o he a ge
melody o he melody played on ano he ins umen . Due
o he lack o eal-wo ld da a o aining, we simula e he
mimic y guidance by con e ing he g ound- u h MIDI
sco e o he a ge sou ce o audio using he FluidSyn h
lib a y [46]. Real-wo ld mimic y inpu s o en include o -
pi ch no es, impe ec iming, and limi a ions in no e ange.
Addi ionally, since many ins umen s and ocal mimic y
a e monophonic, i poses signi ican challenges when ex-
ac ing polyphonic sou ces, e.g., gui a o piano. We sim-
ula e use inpu ia a ious da a augmen a ion echniques:
•O -pi ch melodies a e simula ed by in oducing pe -
u ba ions o he MIDI no es. Each no e has a 50%
p obabili y o being pi ch-ben , whose amoun is an-
domly sampled om a uni o m dis ibu ion anging
om −0.4 o +0.4semi ones.
•Impe ec iming is simula ed by in oducing a ia-
ions in he iming o MIDI no es. Wi h a 40% p ob-
abili y, he s a and end imes o a no e a e shi ed by
up o ±30 milliseconds. The ime shi o a no e will
also be applied o i s ollowing no es.
•Limi a ion o no e ange is simula ed by andomly
shi ing MIDI no es up o down by one oc a e wi h a
50% p obabili y.
•Ex ac ion using non-polyphonic ins umen s: We
es ic he condi ion melody o be monophonic o e-
lec eal-wo ld limi a ions o many ins umen s and
humming. I encou ages he model o in e missing
no es o he a ge sou ce using o he side in o ma-
ion, such as he mel-spec al mask. The choice o a
monophonic condi ion was d i en by ou ocus on hu-
man oice guidance; howe e , his is a limi a ion o he
aining da a a he han he algo i hm i sel .
2.1.2 Mel-spec al masks
Ou second condi ioning inpu is he use -c ea ed mask
in he mel-spec og am domain, ha dis inguishes egions
co esponding o he a ge sou ce om hose o back-
g ound sou ces. While being concep ually aligned wi h
[24], GuideSep uses i o condi ion a deep gene a i e
model. Speci ically, we de ine wo ypes o masks: pos-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
822
i i e and nega i e masks o indica e he a ge and back-
g ound sou ce egions, espec i ely. The mel-spec og am
domain is chosen due o i s g ea e in e p e abili y and eas-
ie iden i ica ion o he sou ces compa ed o he Fou ie
ans o m’s linea equency scale.
Figu e 1 illus a es he p ocess o c ea ing a mel-
spec og am mask based on use inpu . We implemen
a use in e ace whe e use s can ske ch on he mel-
spec og am o he mix u e wi h di e en b ush size and
con idence le el o indica e egions hey belie e co e-
spond o he a ge sou ce o backg ound music. In p ac-
ice, use -p o ided masks may exclude po ions o he
a ge sou ce o unin en ionally include egions o back-
g ound sound. Addi ionally, in many cases, he a ge
sou ce signi ican ly o e laps he backg ound sou ces, u -
he complica ing he masking p ocess. To simula e hese
eal-wo ld impe ec ions du ing aining, we gene a e syn-
he ic use inpu masks by applying a Gaussian il e ,
whose s anda d de ia ion anges be ween 4 o 6, o he
g ound- u h mel-spec og ams o he a ge sou ce and he
esidual sou ces. In addi ion, we andomly d op ou 40%
o pa ches.
2.2 Condi ional complex spec og am di usion
Di usion p obabilis ic models (DPMs) [47, 48] consis o
wo key p ocesses: p og essi ely co up ing aining da a
by adding noise un il i app oxima es a no mal dis ibu-
ion, and lea ning o e e se each s ep o his noise co up-
ion using he same unc ional o m. These models can be
gene alized as sco e-based gene a i e models [49], which
u ilize an in ini e numbe o noise scales, enabling bo h he
o wa d and backwa d di usion p ocesses o be desc ibed
by s ochas ic di e en ial equa ions (SDEs). Du ing in e -
ence, he e e se SDE is employed o gene a e samples nu-
me ically, s a ing om a s anda d no mal dis ibu ion.
Complex spec og am di usion wi h EDM: Ou
wo k is based on EDMSound [29]. We ain ou di u-
sion model using he EDM amewo k [50], which e o -
mula es he di usion SDE in e ms o noise scales a he
han d i and di usion coe icien s. To ensu e ha he in-
pu s o he neu al ne wo k a e app op ia ely scaled wi hin
he ange [−1,1], as equi ed by he di usion models, we
apply an ampli ude ans o ma ion o he complex spec o-
g am inpu s. Speci ically, we use ˜c=β|c|αei∠c, as p o-
posed in [42,51], whe e α∈(0,1] is a comp ession ac o
ha emphasizes ime- equency bins wi h lowe ene gy, ∠c
deno es he phase o he o iginal complex spec og am c,
and β∈R+is a scaling ac o ha no malizes ampli udes
app oxima ely o he ange [0,1].
Adding condi ions o EDMSound: To adap EDM-
Sound o a ge sound ex ac ion, we modi y he ne wo k
o accep condi ional inpu s, including he mix u e signal,
mimic y signal, and spec al masks. Ra he han modeling
p(s|c), whe e sis he a ge sou ce and cis an ins umen
label, we ins ead model p(s|cmix,cmimic y,cmasks). He e,
cmix co esponds o he music mix u e ep esen ed in he
complex- alued sho - ime Fou ie ans o m (STFT) do-
main, while cmimic y deno es he mimic y condi ion in he
Sco e U-Ne
Condi ion U-Ne
Mix u e
<la exi sha1_base64="qNpYpA3PbTUP4hgRS XYnbxuOM4=">AAACBnicbVDLSgMxFM3UV62 UV iJliEClJmRK gpuhGN1LBPqAzlEyaaUOTzJBkhDIU /6I27oS 36F4N+Yab Q1gMhh3Pu5d57gphRpR3n28o La+s uXXCxubW9s79u5eQ0WJxKSOIxbJVoAUYVSQuqaakVYsCeIBI81gcJP5zSciFY3Eox7GxOeoJ2hIMdJG6 gHHke6jxFL70clL+DQuYLZd3 SsY O2ZkALhJ3Ropghl H/ K6EU44ERozpFTbdWL p0hqihkZFbxEkRjhAeqR qECcaL8dHLCCB4bpQ DSJonNJyo z SxJUa8sBUZgu eS8TT2HA//PbiQ4 /ZSKONFE4OmwMGFQRzDLBHapJFizoSEIS2 2hbiPJMLaJFcwQbjzZy+SxlnZ ZQ D+ F6 Uskjw4BEegBFxwAa gF RAHWDwDF7BGLxZL9bYe c+pqU5a9azD/7A+ wBikCXGg==</la exi >
N(0;I)
Gaussian Noise
Mel-Mask
Mimic y
Ex ac ed Ta ge
UI
Di usion Model
condi ioning
Figu e 2: O e iew o he GuideSep a in e ence
ime. Ou model accep s mimic y condi ion and mel-
spec og am domain masks as guidance om use s o ex-
ac he a ge sou ce om he mix u e.
o m o magni ude STFT, assuming phase in o ma ion is a
dis ac ion when i comes o ep esen ing spec um in o -
ma ion. Finally, cmasks e e s o he no malized magni udes
o mel-spec og am masks, anged be ween 0and 1.
The p oposed a chi ec u e: Building on insigh s om
p io wo ks [40, 52–54], we design ou model as depic ed
in Figu e 2. The a chi ec u e comp ises wo p ima y U-Ne
s uc u es. The i s one, e e ed o as he sco e U-Ne ,
aligns wi h he o iginal U-Ne used in EDMSound. I i
we e no o condi ioning inpu , his pa o he model pe -
o ms a blind audio syn hesis by aking a Gaussian noise
sample. The second module, he condi ion U-Ne , is in-
oduced o ame his o he wise en i ely gene a i e beha -
io o he sco e U-Ne . The condi ion U-Ne is dedica ed
o p ocessing all condi ional inpu s, including he mix u e.
These wo U-Ne s a e connec ed so ha he ou pu o each
laye in he condi ion U-Ne is elemen -wise added o i s
co esponding laye in he sco e U-Ne , spanning bo h he
downsampling and upsampling laye s. Since he mel-scale
masks a e in a di e en equency dimension compa ed o
he magni ude and complex spec og ams, we in oduce a
simple 1-hidden-laye neu al ne wo k o p ojec he mel-
equency axis on o he spec og am equency axis. Since
he e a e h ee condi ions in he o m o a spec og am—
mix u e, mimic y, and p ojec ed masks— we conca ena e
hem along he channel dimension and eed as inpu o he
condi ion U-Ne . Ou U-Ne a chi ec u e is adap ed om
Imagen [55], chosen o i s high sample quali y, apid con-
e gence, and memo y e iciency.
Loss unc ion: Du ing aining, we op imize he model
using p econdi ioned denoising sco e ma ching, ollow-
ing [50]. The aining objec i e is o mula ed as
EsEnhλ(σ)∥D(s+n;σ, cmix,cmimic y,M(cmasks)) −s∥2
2i,
whe e D(·)is he EDM weigh ed neu al ne wo k, σis
he noise le el, λ(·)is he loss weigh ing which is (σ2−
σ2
da a)/(σ·σ2
da a) o he EDM amewo k, M(·)deno es he
1-hidden-laye p ojec ion ne wo k, and n∼ N(0, σ2I)is
Gaussian noise.
In e ence: Wi hin he EDM amewo k, he p obabil-
i y low o dina y di e en ial equa ion (ODE) can be sim-
pli ied in o a nonlinea ODE, allowing he di ec use o
s anda d o - he-shel ODE sol e s, such as high-o de Ex-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
823
ponen ial In eg a o (EI)-based ODE sol e s [56], speci -
ically mul is ep DPM sol e s [56, 57], o sampling as in
EDMSound.
3. EXPERIMENT
We conduc expe imen s using he Slakh2100 da ase [58]
augmen ed by MoisesDB [59] o aining. The
Slakh2100 da ase p o ides an o icial ain- alida ion- es
spli , which we u ilize as well. We e alua e ou model’s
pe o mance using he widely adop ed signal- o-dis o ion
a io (SDR) me ics [60,61].
3.1 T aining and model de ails
3.1.1 The da ase s
Slakh2100 is a syn he ic da ase o wa e o m-MIDI-
aligned music da ase con aining 2,100 acks in o al
a ound 145 hou s o audio. In ou aining p ocess, ins ead
o using he o iginal mix om he da ase , we gene a e
aining da a h ough andom mixing. This way allows o
nea ly in ini e a ia ions o aining samples. While his
app oach may esul in he loss o some musical con ex ,
p e ious wo k [62] has demons a ed ha andom mixing
can imp o e MSS model pe o mance. To enhance ou
model’s pe o mance on eal-wo ld music, we u ilize he
MoisesDB da ase [59] o cons uc backg ound sou ces.
MoisesDB is a comp ehensi e mul i ack da ase designed
o sou ce sepa a ion beyond 4-s ems, ea u ing 240 p e-
iously un eleased songs by 47 a is s ac oss wel e high-
le el gen es, in o al app oxima ely 14 hou s o audio. Du -
ing andom mixing, we andomly selec 3 o 6 sou ces om
he MoisesDB da ase o se e as backg ound music, while
he a ge sou ce is d awn om he Slakh2100 da ase . The
backg ound and a ge sou ces a e mixed a signal- o-noise
a ios (SNR) anging om −5dB o 5dB. All The inpu
audio is con e ed o single channel and esampled o 16
kHz, and hen immed o padded o a ound 4.1seconds
o ba ched aining.
3.1.2 D opou s a egies
To ensu e ha he model can p ocess any combina ion o
inpu ypes, we inco po a e d opou s a egies du ing ain-
ing. This allows he model o ope a e wi h an incom-
ple e se o condi ions, such as he mimic y-only o mel-
spec og am-mask-only cases. To his end, we andomly
d op ou ei he he mimic y condi ion o mel-spec og am
masks, ensu ing ha he model lea ns o p edic he a -
ge sou ce e en when p o ided wi h pa ial condi ioning
in o ma ion. Addi ionally, we empi ically obse e ha he
model bene i s om a mimic y-only condi ioned syn he-
sis asks, which happens when we andomly d op he mix-
u e inpu cmix du ing aining. This encou ages he model
o in e he a ge sou ce om melodic guidance alone.
Speci ically, du ing aining, we d op 30% o he mimic y
condi ion, 70% o he mel-masks, and 10% o he mix u e.
The high d opou a e o mel masks is in en ional and
uned using he alida ion spli , as hey p o ide a s ong
cue o he a ge sou ce. By educing hei p esence, he
model is encou aged o ocus mo e on lea ning om he
mimic y condi ion.
3.1.3 The model a chi ec u e
Fo bo h sco e and condi ion U-Ne modules, we u ilize an
e icien U-Ne a chi ec u e adap ed om he open-sou ce
Imagen implemen a ion 2, which is known o memo y
e iciency and as con e gence. Bo h U-Ne s inco po-
a e downsampling and upsampling blocks, each con ain-
ing wo ResNe blocks wi h a sel -a en ion laye ha uses
wo a en ion heads. The bo leneck dimension is 128. The
comple e model has 93.3 million ainable pa ame e s.
The inpu o he condi ion U-Ne consis s o h ee ypes:
a complex spec og am cmix, a magni ude spec og am
cmimic y, bo h wi h a window size o 512 samples and a
hop size o 256 samples, and he mel-spec og am masks
cmask, which sha e he same hop size and consis o 80 mel-
equency bins. E en ually, i is a i e-channel spec o-
g am inpu : wo o he complex spec og am, one o he
magni ude spec og am, and wo o he posi i e (i.e., a -
ge sou ce) and nega i e (i.e., backg ound music) masks.
The sco e U-Ne , as an au oencode , de ines a wo-
channel spec og ams as i s inpu and ou pu ep esen a-
ion, whe e he wo channels ep esen he eal and imag-
ina y componen s o he complex spec og am, espec-
i ely. No e ha in he e y beginning o he sampling
p ocess, he inpu spec og am o he sco e U-Ne is noise
sampled om Gaussian. Addi ionally, we condi ion he
ne wo k on loga i hmically scheduled noise le els σ.
3.1.4 In e ence
Fo in e ence, we employ an EI-based DPM sample [56,
57]. To ensu e compa ibili y be ween he EDM amewo k
sample s and a bi a y aining objec i es du ing in e ence,
we implemen inpu escaling as needed. Speci ically, we
escale bo h he noisy inpu s and noise le els o align wi h
he ne wo k’s o iginal aining- ime scales. The esul s,
p esen ed in Sec ion 4, a e ob ained using an 8-s ep sam-
ple con igu a ion.
3.1.5 T aining de ails
Ou model is ained wi h a ba ch size o 36 and a lea ning
a e o 1×10−4using he Adam op imize . The ain-
ing p ocess uns o 300k upda es. We used wo NVIDIA
L40S GPUs, and ained o en days.
3.2 Baselines
To he bes o ou knowledge, no exis ing wo k o e s a
ai compa ison, as ou me hod in oduces a no el condi-
ioning app oach. Howe e , we design a adi ional mask-
p edic ion model o compa e he p oposed gene a i e ap-
p oach agains . The baseline sha es he same win U-Ne
a chi ec u e and s uc u al de ails as ou di usion back-
bone. In pa icula , he inpu o he sco e U-Ne po ion is
he magni ude spec og am o he mix u e, while he inpu
o he condi ion U-Ne consis s o he magni ude spec o-
g am o he mimic y condi ion and he masks. The model
2h ps://gi hub.com/lucid ains/imagen-py o ch
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
824
Model Piano Gui a Bass S ings B ass Syn h Pipe Reed O gan Ch oma ic
Pe cussion O e all
Ou s ( ull) 8.34±0.11 10.53±0.09 11.97±0.12 9.64±0.12 9.15±0.38 9.25±0.20 15.58±0.27 13.78±0.24 13.44±0.22 11.53±0.36 10.46
Baseline ( ull) 7.03±0.09 8.72±0.08 8.69±0.06 9.06±0.13 8.03±0.31 8.00±0.17 14.99±0.26 11.94±0.23 11.25±0.23 9.62±0.31 8.74
Ou s (mimic y only) 7.46±0.11 9.96±0.10 11.19±0.13 8.63±0.14 7.95±0.45 8.13±0.23 14.43±0.34 13.14±0.27 12.39±0.25 8.74±0.43 9.60
w/ pseudo-masks 7.99±0.11 10.18±0.10 9.87±0.15 8.72±0.15 8.21±0.40 8.19±0.25 14.81±0.31 13.20±0.26 12.20±0.29 8.26±0.55 9.56
Ou s (posi i e mask only) 7.86±0.11 10.17±0.09 11.45±0.13 9.48±0.12 8.97±0.38 9.14±0.19 15.19±0.28 13.42±0.25 13.08±0.23 11.09±0.38 10.09
Ou s (humming)* - - - - - - - - - - 13.61
F equency (%) 20.71 27.89 17.79 15.23 2.65 4.74 2.72 3.06 3.43 1.78 -
Table 1: SDR (dB) esul s wi h 95% con idence in e al (highe alues indica e be e pe o mance) o en ins umen
classes in he Slakh2100 es spli . The esul s include GuideSep (ou me hod) unde a ious inpu condi ions and he
mask-p edic ion baseline. The bes sco es a e highligh ed in bold. Fo as e isk (*) please e e o Sec ion 4.2.
ou pu s a non-bina y mask by applying a sigmoid unc-
ion a e he ou pu laye , which is hen used o compu e
he a ge sou ce magni ude spec og am h ough elemen -
wise mul iplica ion wi h he inpu mix u e magni ude spec-
og am. The inal wa e o m is econs uc ed by combin-
ing he p edic ed magni ude spec og am wi h he phase
in o ma ion om he o iginal mix u e. We ain he mask-
p edic ion baseline using he L2 econs uc ion loss in he
magni ude spec og am domain, wi h he same lea ning
a e, ba ch size, and numbe o upda es as ou di usion
model. No e ha , due o he absence o ime-s ep condi-
ional inpu s and di e ences in inpu channels, he mask-
p edic ion baseline con ains 80.3 million pa ame e s.
4. EVALUATION AND DISCUSSION
We e alua e ou model on he o icial es spli o he
Slakh2100 da ase . The mimic y condi ion signals a e
syn hesized as desc ibed in Sec ion 2.1, using andomly
selec ed i ual ins umen s om he FluidSyn h lib a y.
Simila ly, he posi i e and nega i e masks a e simula ed
ollowing he same p ocedu e ou lined in Sec ion 2.1. Fo
e alua ion, we g oup he ins umen classes in Slakh2100
in o en b oade ca ego ies, whe e d um acks a e ex-
cluded om he a ge sou ces, because ou syn hesis
me hod does no apply o hem. The e alua ion esul s a e
p esen ed in Table 1.
In he i s wo ows in Table 1, we p esen he esul s o
ou model and he mask-p edic ion baseline. Bo h models
u ilize he mimic y condi ion and mel-spec og am masks
du ing in e ence, deno ed wi h ‘( ull)’ in he able. The e-
sul s demons a e ha ou model consis en ly ou pe o ms
he mask-p edic ion baseline ac oss all ins umen classes.
Gi en ha he mask-p edic ion baseline sha es he same
model backbone, aining da a, and con igu a ion as ou
di usion model, he pe o mance gap highligh s he ben-
e i s o using di usion app oach. In he lis ening es ,
we obse e ha he mask-p edic ion baseline o en econ-
s uc s a ge sou ces which s ill con ain in e e ences. In
con as , while ou di usion model may occasionally ex-
hibi inexac imb e, i gene ally gene a es cleane a ge
sou ces. This can be a ibu ed o he di usion model
lea ning a p io dis ibu ion o clean sou ces, which biases
i s ou pu s owa d cleane esul s. Al hough ou indings
align o he well-known beha io o gene a i e models, ou
expe imen s a e limi ed o he pa icula choice o he di -
usion model and a masking-based baseline wi h a ma ch-
ing a chi ec u e, lea ing mo e gene al a gumen s o u u e
wo k. We also obse e ha bo h models wo k be e o
he monophonic sou ces han he polyphonic ones, such as
piano, gui a , s ings, and syn h, whe e ou s ic ly mono-
phonic mimic y condi ion is no in o ma i e enough. As
a esul , he models may s uggle wi h missing no es om
cho ds, ex ac ing he w ong a ge ins umen , o e en ex-
ac ing mul iple ins umen s when hey sha e a simila
melody, which is common in music. Fo esul s on he
eal-wo ld condi ions, please e e o ou demo page.
4.1 Subjec i e Lis ening Tes s
In addi ion o he SDR esul s, we conduc a subjec i e
lis ening es o u he e alua e ou model. We modi y
webMUSHRA [63] so he es comp ises wo sec ions: he
i s assesses he o e all quali y o he model’s sepa a ion
esul s, while he second ocuses speci ically on e alua -
ing he imb e o he econs uc ed a ge sou ce. In he
i s sec ion, each ques ion p esen s he music mix u e as
a e e ence. Pa icipan s a e asked o compa e and a e
ou s imuli: he g ound u h, he mix u e i sel (i.e., he
hidden e e ence), and he p edic ions om ou model and
he mask-p edic ion baseline. Pa icipan s a e unawa e ha
one o he s imuli is he ac ual g ound u h and a e ins ead
old ha he h ee s imuli a e po en ial econs uc ions o
a a ge sou ce. The pa icipan s a e asked o i s iden i y
he mix u e and assign i a sco e o 0, hen a e he emain-
ing s imuli (e.g., wi h 100 being a pe ec ma ch) based on
how closely hey esemble he a ge sou ce in he mix u e.
This pa consis s o en ials, wi h each mix u e sample
andomly selec ed om a di e en ins umen class in he
Slakh2100 es spli . We use he mix u e as a e e ence
ins ead o he g ound- u h sou ce in o de o measu e he
lis ene ’s opinion on he “syn hesized" sou ce wi hou in-
oducing any p ejudice.
To make up he modi ica ion in oduced in he i s pa ,
he second pa is dedica ed o e alua ing he po en ial a -
i ac speci ic o he gene a i e models, i.e., he imb e
change. This ime, each ial p esen s he g ound- u h a -
ge sou ce as he e e ence. Pa icipan s compa e and a e
h ee s imuli: he hidden e e ence (i.e., he g ound u h
i sel ) and he p edic ions om ou model and he mask-
p edic ion baseline. Howe e , a ings a e based on im-
b e simila i y o he e e ence, wi h 100 indica ing an ex-
ac ma ch and 0 ep esen ing a comple ely di e en imb e.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
825

Pa icipan s a e ins uc ed o ocus solely on he imb e o
he a ge sou ce while dis ega ding any in e e ence o a -
i ac s. Bo h pa s use he same se o music samples, bu
he second pa is p esen ed only a e pa icipan s com-
ple e he i s pa o a oid bias, ensu ing hey emain un-
awa e ha he g ound u h was included in he i s pa .
50 60 70 80 90
MUSHRA Sco e
GuideSep
Baseline
GT
a) MUSHRA Resul o Sepa a ion Quali y
70 80 90 100
MUSHRA Sco es
b) MUSHRA Resul o Sepa a ion
Timb e Quali y
(a) Sec. 1, MUSHRA esul on sep-
a a ion quali y
50 60 70 80 90
MUSHRA Sco e
GuideSep
Baseline
GT
a) MUSHRA Resul o Sepa a ion Quali y
70 80 90 100
MUSHRA Sco es
b) MUSHRA Resul o Sepa a ion
Timb e Quali y
(b) Sec. 2, MUSHRA esul
on imb e simila i y
Figu e 3: Mean MUSHRA Sco e wi h 95% con idence in-
e al o he subjec i e lis ening es on sepa a ion quali y
and sepa a ion imb e quali y.
A o al o 13 pa icipan s ook pa in he subjec i e lis-
ening es , and he esul s om bo h pa s a e p esen ed in
Figu e 3. In he i s pa , whe e pa icipan s a ed he sep-
a a ion quali y, ou model sco ed 82.82 ±2.95, he g ound
u h sco ed 90.13±2.06, and he mask-p edic ion baseline
sco ed 50.69 ±3.41. No ably, despi e he mask-p edic ion
baseline ha ing a ela i ely small SDR di e ence om ou
model, he lis ening es e ealed a signi ican gap in pe -
cep ual e alua ion. This sugges s ha use s may pe cei e a
cleane a ge sou ce p edic ion as mo e sa is ac o y, e en
i a sligh ly noisie p edic ion achie es a decen sample-
wise simila i y o he a ge sou ce.
In he second pa , whe e pa icipan s a ed imb e
p ese a ion, ou model sco ed 79.38 ±3.76, su passing
he mask-p edic ion baseline, which sco ed 68.88 ±4.07.
In heo y, he mask-p edic ion baseline could p ese e he
o iginal imb e be e , bu ou subjec i e lis ening es e-
sul s sugges o he wise. Based on he lis ene s’ eedback,
we specula e ha his ou come is in luenced by he na u e
o he a ge sou ce ex ac ion ask, whe e mul iple sou ces
in a musical piece may sha e simila melodic pa e ns. As
a esul , he mask-p edic ion baseline’s ou pu can be con-
amina ed by in e e ing simila melodies, which can be
pe cei ed as a imb al change a he han a i ac s.
4.2 Abla ion
Beyond e alua ing ou model wi h bo h condi ioning sig-
nals, we conduc an abla ion s udy o assess i s pe o -
mance unde di e en inpu condi ions.
Mimic y-only: We e alua e he model using he
‘(mimic y only)’ se up (Table 1). We obse e a sligh
o e all dec ease in pe o mance, indica ing ha while mel-
masks con ibu e o imp o ed pe o mance, he model
emains e ec i e e en when condi ioned solely on he
melody signal.
Pseudo-masks: When only he mimic y condi ion is
a ailable, we can gene a e pseudo mel-masks using he
mimic y condi ion and he mix u e. Speci ically, we use
he Gaussian-blu ed mel-spec og am o he mimic y con-
di ion as he posi i e mel-mask and he blu ed mix u e as
he nega i e mel-mask wi h he s anda d de ia ion se o
be 5. In Table 1, al hough he o e all SDR sco e is sligh ly
lowe compa ed o using only he mimic y condi ion, he
model pe o ms be e wi h pseudo-masks o 7 ou o 10
ins umen classes. This sugges s ha pseudo-masks can
gene ally enhance he model’s pe o mance a no addi-
ional cos . The bass class is an excep ion, likely due o
i s limi ed high- equency con en , which se s i apa om
o he ins umen s. Consequen ly, he mel-spec og am
mask may be misleading in his case. A di e en s anda d
de ia ion o he Gaussian il e could wo k be e , while i
in ol es an addi ional hype pa ame e sea ch.
Mel-masks-only: Ano he case is when only he mel-
masks a e used o condi ioning. We obse e ha he e-
sul s a e gene ally be e han hose ob ained using only
he mimic y condi ion, indica ing ha mel-masks se e as
highly e ec i e condi ioning signals.
Humming-only: Al hough in ou aining, mimic y
condi ion do no include humming, we e alua e ou model
o assess i s gene aliza ion o unseen mimic y condi ion,
such as humming. Since we canno easily syn hesize hum-
ming om MIDI, we u ilize he HumT ans da ase [64],
a MIDI-humming aligned da ase , esul ing in an e alua-
ion da ase o app oxima ely 16.6hou s. Since HumT ans
melodies do no coincide wi h ou es songs, an ideal
sou ce sepa a ion se up is impossible o design. Ins ead,
we syn hesize backg ound sou ces by andomly mixing 3
o 6 sou ces om he MoisesDB da ase , ollowing ou
aining p ocedu e p ocedu e desc ibed in Sec ion 3.1. The
a ge sou ce is syn hesized om he MIDI in o ma ion
aligned o he humming, using he me hod ou lined in Sec-
ion 2.1 wi h augmen a ion. As he i ual ins umen s
a e sampled om he FluidSyn h lib a y, which is no di-
ec ly compa able o he Slakh2100 benchma k, we e-
po only an o e all SDR esul , whose mean is 13.61 dB.
This sco e exceeds he o e all SDR esul o ou model on
he Slakh2100 benchma k, demons a ing ha he mimic y
condi ion can gene alize o humming du ing in e ence.
Howe e , he s ong pe o mance could also be a ibu ed
o he andom mixing used du ing e alua ion, which sim-
pli ies he ask o a ge sou ce ex ac ion o he model.
5. CONCLUSION
We in oduced GuideSep, a di usion-based music sou ce
sepa a ion model ha enables lexible, ins umen -agnos ic
sepa a ion using wa e o m mimic y condi ions and mel-
spec og am masks, and eleased he codebase. Ou e-
sul s demons a e ha his app oach achie es high-quali y
sepa a ion while o e ing g ea e adap abili y compa ed o
adi ional class-based me hods. Addi ionally, ou compa -
ison wi h a mask-p edic ion baseline p o ides insigh s in o
he s eng hs o gene a i e models o MSS. This wo k
highligh s he po en ial o di usion models in ad ancing
mo e e sa ile and use -con ollable sou ce sepa a ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
826
6. REFERENCES
[1] N. Ono, Z. Ra ii, D. Ki amu a, N. I o, and A. Li-
u kus, “The 2015 signal sepa a ion e alua ion cam-
paign,” in La en Va iable Analysis and Signal Sep-
a a ion, E. Vincen , A. Ye edo , Z. Koldo ský, and
P. Ticha ský, Eds. Cham: Sp inge In e na ional Pub-
lishing, 2015, pp. 387–395.
[2] S. Roua d, F. Massa, and A. Dé ossez, “Hyb id ans-
o me s o music sou ce sepa a ion,” in ICASSP 2023-
2023 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2023,
pp. 1–5.
[3] J. Chen, S. Vekko , and P. Shukla, “Music sou ce sepa-
a ion based on a ligh weigh deep lea ning amewo k
(d ne : Dual-pa h c- d une ),” in ICASSP 2024-2024
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2024, pp.
656–660.
[4] R. Sawa a, N. Takahashi, S. Uhlich, S. Takahashi, and
Y. Mi su uji, “The whole is g ea e han he sum o i s
pa s: imp o ing music sou ce sepa a ion by b idging
ne wo ks,” EURASIP Jou nal on Audio, Speech, and
Music P ocessing, ol. 2024, no. 1, p. 39, 2024.
[5] W. Tong, J. Zhu, J. Chen, S. Kang, T. Jiang, Y. Li,
Z. Wu, and H. Meng, “SCNe : Spa se comp ession ne -
wo k o music sou ce sepa a ion,” in ICASSP 2024-
2024 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2024,
pp. 1276–1280.
[6] N. Takahashi and Y. Mi su uji, “D3Ne : Densely con-
nec ed mul idila ed densene o music sou ce sepa a-
ion,” a Xi p ep in a Xi :2010.01733, 2020.
[7] Y. Luo and J. Yu, “Music sou ce sepa a ion wi h band-
spli nn,” IEEE/ACM T ansac ions on Audio, Speech,
and Language P ocessing, ol. 31, pp. 1893–1901,
2023.
[8] W.-T. Lu, J.-C. Wang, Q. Kong, and Y.-N. Hung, “Mu-
sic sou ce sepa a ion wi h band-spli ope ans o me ,”
in ICASSP 2024-2024 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2024, pp. 481–485.
[9] G. Mesegue -B ocal and G. Pee e s, “Condi ioned-
U-Ne : In oducing a con ol mechanism in he u-
ne o mul iple sou ce sepa a ions,” a Xi p ep in
a Xi :1907.01277, 2019.
[10] O. Slizo skaia, L. Kim, G. Ha o, and E. Gomez, “End-
o-end sound sou ce sepa a ion condi ioned on ins u-
men labels,” in ICASSP 2019-2019 IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2019, pp. 306–310.
[11] P. See ha aman, G. Wiche n, S. Venka a amani, and
J. Le Roux, “Class-condi ional embeddings o music
sou ce sepa a ion,” in ICASSP 2019-2019 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2019, pp. 301–305.
[12] D. Samuel, A. Ganeshan, and J. Na adowsky, “Me a-
lea ning ex ac o s o music sou ce sepa a ion,” in
ICASSP 2020-2020 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2020, pp. 816–820.
[13] P. Sma agdis, “Use guided audio selec ion om com-
plex sound mix u es,” in P oceedings o he 22nd an-
nual ACM symposium on Use in e ace so wa e and
echnology, 2009, pp. 89–92.
[14] J. H. Lee, H.-S. Choi, and K. Lee, “Audio
que y-based music sou ce sepa a ion,” a Xi p ep in
a Xi :1908.06593, 2019.
[15] E. Manilow, G. Wiche n, and J. Le Roux, “Hie a chi-
cal musical ins umen sepa a ion.” in ISMIR, 2020, pp.
376–383.
[16] K. N. Wa cha asupa and A. Le ch, “A s em-agnos ic
single-decode sys em o music sou ce sepa a ion be-
yond ou s ems,” a Xi p ep in a Xi :2406.18747,
2024.
[17] K. Chen, X. Du, B. Zhu, Z. Ma, T. Be g-Ki kpa ick,
and S. Dubno , “Ze o-sho audio sou ce sepa a ion
h ough que y-based lea ning om weakly-labeled
da a,” in P oceedings o he AAAI Con e ence on A i-
icial In elligence, ol. 36, no. 4, 2022, pp. 4441–4449.
[18] Y. Wang, D. S olle , R. M. Bi ne , and J. P. Bello,
“Few-sho musical sou ce sepa a ion,” in ICASSP
2022-2022 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2022, pp. 121–125.
[19] S. Ewe and M. B. Sandle , “S uc u ed d opou o
weak label and mul i-ins ance lea ning and i s appli-
ca ion o sco e-in o med sou ce sepa a ion,” in 2017
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2017, pp.
2277–2281.
[20] M. Mi on, J. Jane , and E. Gómez, “Monau al sco e-
in o med sou ce sepa a ion o classical music using
con olu ional neu al ne wo ks.” in ISMIR, ol. 2017,
2017, pp. 55–62.
[21] M. Go e , “Sco e-in o med sou ce sepa a ion o cho al
music,” 2020.
[22] A. J. Munoz-Mon o o, J. J. Ca abias-O i, P. Ve a-
Candeas, F. J. Canadas-Quesada, and N. Ruiz-Reyes,
“Online/o line sco e in o med music signal decompo-
si ion: applica ion o minus one,” EURASIP Jou nal
on Audio, Speech, and Music P ocessing, ol. 2019,
pp. 1–30, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
827
[23] Y.-N. Hung, G. Wiche n, and J. Le Roux, “T ansc ip-
ion is all you need: Lea ning o sepa a e musical mix-
u es wi h sco e as supe ision,” in ICASSP 2021-2021
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2021, pp.
46–50.
[24] N. J. B yan, G. J. Myso e, and G. Wang, “ISSE: An
in e ac i e sou ce sepa a ion edi o ,” in P oceedings o
he SIGCHI Con e ence on Human Fac o s in Compu -
ing Sys ems, 2014, pp. 257–266.
[25] N. J. B yan and G. J. Myso e, “In e ac i e use -
eedback o sound sou ce sepa a ion,” in In e na ional
Con e ence on In elligen Use -In e aces (IUI), Wo k-
shop on In e ac i e Machine Lea ning. San a Monica,
2013.
[26] N. B yan and G. Myso e, “An e icien pos e io eg-
ula ized la en a iable model o in e ac i e sound
sou ce sepa a ion,” in In e na ional con e ence on ma-
chine lea ning. PMLR, 2013, pp. 208–216.
[27] N. J. B yan and G. J. Myso e, “In e ac i e e inemen
o supe ised and semi-supe ised sound sou ce sep-
a a ion es ima es,” in 2013 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing.
IEEE, 2013, pp. 883–887.
[28] P. Sma agdis and G. J. Myso e, “Sepa a ion by “hum-
ming”: Use -guided sound ex ac ion om mono-
phonic mix u es,” in 2009 IEEE Wo kshop on Appli-
ca ions o Signal P ocessing o Audio and Acous ics.
IEEE, 2009, pp. 69–72.
[29] G. Zhu, Y. Wen, M.-A. Ca bonneau, and Z. Duan,
“EDMSound: Spec og am based di usion models
o e icien and high-quali y audio syn hesis,” a Xi
p ep in a Xi :2311.08667, 2023.
[30] J.-M. Leme cie , J. Rich e , S. Welke , E. Moline ,
V. Välimäki, and T. Ge kmann, “Di usion models o
audio es o a ion: A e iew [special issue on model-
based and da a-d i en audio signal p ocessing],” IEEE
Signal P ocessing Magazine, ol. 41, no. 6, pp. 72–84,
2025.
[31] Y. Luo and N. Mesga ani, “Con -TasNe : Su passing
ideal ime– equency magni ude masking o speech
sepa a ion,” IEEE/ACM ansac ions on audio, speech,
and language p ocessing, ol. 27, no. 8, pp. 1256–
1266, 2019.
[32] Y. Hu, Y. Liu, S. L , M. Xing, S. Zhang, Y. Fu, J. Wu,
B. Zhang, and L. Xie, “DCCRN: Deep complex con-
olu ion ecu en ne wo k o phase-awa e speech en-
hancemen ,” a Xi p ep in a Xi :2008.00264, 2020.
[33] H. J. Pa k, B. H. Kang, W. Shin, J. S. Kim, and
S. W. Han, “MANNER: Mul i- iew a en ion ne wo k
o noise e asu e,” in ICASSP 2022-2022 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2022, pp. 7842–7846.
[34] J. Pi klbaue , M. Sach, K. Fluy , W. Ti y, W. Wa dah,
S. Moelle , and T. Fingscheid , “E alua ion me ics o
gene a i e speech enhancemen me hods: Issues and
pe spec i es,” in Speech Communica ion; 15 h ITG
Con e ence. VDE, 2023, pp. 265–269.
[35] E. Pos olache, G. Ma iani, M. Mancusi, A. San-
illi, L. Cosmo, and E. Rodolà, “La en au o eg essi e
sou ce sepa a ion,” in P oceedings o he AAAI Con e -
ence on A i icial In elligence, ol. 37, no. 8, 2023, pp.
9444–9452.
[36] Y. C. Subakan and P. Sma agdis, “Gene a i e ad e -
sa ial sou ce sepa a ion,” in 2018 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2018, pp. 26–30.
[37] B. Chen, C. Wu, and W. Zhao, “SEPDIFF: Speech
sepa a ion based on denoising di usion model,” in
ICASSP 2023-2023 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2023, pp. 1–5.
[38] S. Lu a i, E. Nachmani, and L. Wol , “Sepa a e and di -
use: Using a p e ained di usion model o imp o ing
sou ce sepa a ion,” a Xi p ep in a Xi :2301.10752,
2023.
[39] R. Scheible , Y. Ji, S.-W. Chung, J. Byun, S. Choe, and
M.-S. Choi, “Di usion-based gene a i e speech sou ce
sepa a ion,” in ICASSP 2023-2023 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2023, pp. 1–5.
[40] R. Scheible , Y. Fuji a, Y. Shi aha a, and T. Ko-
ma su, “Uni e sal sco e-based speech enhancemen
wi h high con en p ese a ion,” a Xi p ep in
a Xi :2406.12194, 2024.
[41] Z. Guo, Q. Wang, J. Du, J. Pan, Q.-F. Liu, and C.-
H. Lee, “A a iance-p ese ing in e pola ion app oach
o di usion models wi h applica ions o single chan-
nel speech enhancemen and ecogni ion,” IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing, 2024.
[42] J. Rich e , S. Welke , J.-M. Leme cie , B. Lay,
and T. Ge kmann, “Speech enhancemen and de e-
e be a ion wi h di usion-based gene a i e models,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 31, pp. 2351–2364, 2023.
[43] G. Zhu, J. Da e sky, F. Jiang, A. Seli skiy, and Z. Duan,
“Music sou ce sepa a ion wi h gene a i e low,” IEEE
Signal P ocessing Le e s, ol. 29, pp. 2288–2292,
2022.
[44] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, and E. Rodolà, “Mul i-sou ce di usion
models o simul aneous music gene a ion and sepa-
a ion,” a Xi p ep in a Xi :2302.02257, 2023.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
828
[45] T. Ka chkhadze, M. R. Izadi, and S. Dubno , “Si-
mul aneous music sepa a ion and gene a ion using
mul i- ack la en di usion models,” a Xi p ep in
a Xi :2409.12346, 2024.
[46] D. Henningsson and F. Team, “Fluidsyn h eal- ime
and h ead sa e y challenges,” in P oceedings o he
9 h In e na ional Linux Audio Con e ence, Maynoo h
Uni e si y, I eland, 2011, pp. 123–128.
[47] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” Ad ances in neu al in o ma ion
p ocessing sys ems, ol. 33, pp. 6840–6851, 2020.
[48] J. Sohl-Dicks ein, E. Weiss, N. Maheswa ana han,
and S. Ganguli, “Deep unsupe ised lea ning us-
ing nonequilib ium he modynamics,” in In e na ional
con e ence on machine lea ning. PMLR, 2015, pp.
2256–2265.
[49] Y. Song, J. Sohl-Dicks ein, D. P. Kingma, A. Kuma ,
S. E mon, and B. Poole, “Sco e-based gene a i e mod-
eling h ough s ochas ic di e en ial equa ions,” a Xi
p ep in a Xi :2011.13456, 2020.
[50] T. Ka as, M. Ai ala, T. Aila, and S. Laine, “Eluci-
da ing he design space o di usion-based gene a i e
models,” Ad ances in neu al in o ma ion p ocessing
sys ems, ol. 35, pp. 26 565–26 577, 2022.
[51] C. B ei haup and R. Ma in, “Analysis o he decision-
di ec ed sn es ima o o speech enhancemen wi h e-
spec o low-sn and ansien condi ions,” IEEE ans-
ac ions on audio, speech, and language p ocessing,
ol. 19, no. 2, pp. 277–289, 2010.
[52] J. Se à, S. Pascual, J. Pons, R. O. A az, and D. Scaini,
“Uni e sal speech enhancemen wi h sco e-based di -
usion,” a Xi p ep in a Xi :2206.03065, 2022.
[53] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music Con olNe : Mul iple ime- a ying con ols o
music gene a ion,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 2692–
2703, 2024.
[54] H. F. Ga cía, O. Nie o, J. Salamon, B. Pa do, and
P. See ha aman, “Ske ch2Sound: Con ollable audio
gene a ion ia ime- a ying signals and sonic imi a-
ions,” a Xi p ep in a Xi :2412.08550, 2024.
[55] C. Saha ia, W. Chan, S. Saxena, L. Li, J. Whang,
E. L. Den on, K. Ghasemipou , R. Gon ijo Lopes,
B. Ka agol Ayan, T. Salimans e al., “Pho o ealis ic
ex - o-image di usion models wi h deep language un-
de s anding,” Ad ances in neu al in o ma ion p ocess-
ing sys ems, ol. 35, pp. 36 479–36 494, 2022.
[56] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu,
“DPM-Sol e : A as ode sol e o di usion p oba-
bilis ic model sampling in a ound 10 s eps,” Ad ances
in Neu al In o ma ion P ocessing Sys ems, ol. 35, pp.
5775–5787, 2022.
[57] ——, “DPM-Sol e ++: Fas sol e o guided sam-
pling o di usion p obabilis ic models,” a Xi p ep in
a Xi :2211.01095, 2022.
[58] E. Manilow, G. Wiche n, P. See ha aman, and
J. Le Roux, “Cu ing music sou ce sepa a ion some
Slakh: A da ase o s udy he impac o aining da a
quali y and quan i y,” in P oc. IEEE Wo kshop on Ap-
plica ions o Signal P ocessing o Audio and Acous ics
(WASPAA). IEEE, 2019.
[59] I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“MoisesDB: A da ase o sou ce sepa a ion beyond 4-
s ems,” a Xi p ep in a Xi :2307.15913, 2023.
[60] E. Vincen , R. G ibon al, and C. Fé o e, “Pe o -
mance measu emen in blind audio sou ce sepa a ion,”
IEEE ansac ions on audio, speech, and language p o-
cessing, ol. 14, no. 4, pp. 1462–1469, 2006.
[61] R. Scheible , “SDR—medium a e wi h as compu a-
ions,” in ICASSP 2022-2022 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2022, pp. 701–705.
[62] C.-B. Jeon, G. Wiche n, F. G. Ge main, and J. Le Roux,
“Why does music sou ce sepa a ion bene i om ca-
cophony?” in 2024 IEEE In e na ional Con e ence on
Acous ics, Speech, and Signal P ocessing Wo kshops
(ICASSPW). IEEE, 2024, pp. 873–877.
[63] M. Schoe le , S. Ba oschek, F.-R. S ö e , M. Roess,
S. Wes phal, B. Edle , and J. He e, “web-
MUSHRA—a comp ehensi e amewo k o web-
based lis ening es s,” Jou nal o Open Resea ch So -
wa e, ol. 6, no. 1, 2018.
[64] S. Liu, X. Li, D. Li, and Y. Shan, “HumT ans: A no el
open-sou ce da ase o humming melody ansc ip-
ion and beyond,” in ICASSP 2024-2024 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2024, pp. 7915–7919.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
829

Related note

Why institutions use Plag.ai for originality review, entry 21
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by teachers in the United States, the European Union, South America, and other research regions, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also faster first-level screening, better protection of institutional reputation, and stronger evidence for review committees. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For student essays, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai