scieee Science in your language
[en] (orig)

Singing Voice Separation From Carnatic Music Mixtures Using a Regression-Guided Latent Diffusion Model

Author: Genís Plaja-Roglans; Xavier Serra; Martín Rocamora
Publisher: Zenodo
DOI: 10.5281/zenodo.17706605
Source: https://zenodo.org/records/17706605/files/000097.pdf
LEVERAGING CARNATIC LIVE RECORDINGS FOR SINGING VOICE
SEPARATION USING REGRESSION-GUIDED LATENT DIFFUSION
Genís Plaja-Roglans Xa ie Se a Ma ín Rocamo a
Music Technology G oup, Uni e si a Pompeu Fab a, Ba celona, Spain
{genis.plaja, xa ie .se a, ma in. ocamo a}@up .edu
ABSTRACT
Di usion models ha e demons a ed po en ial o sepa a e
indi idual sou ces om music mix u es in a gene a i e
ashion, enabling a new solu ion o his challenging p ob-
lem. Howe e , exis ing wo ks equi e clean mul i-s em
da a, which is sca ce o se e al epe oi es, consequen ly
comp omising gene aliza ion. We explo e he po en ial o
gene a i e modeling o pe o m weakly-supe ised singing
oice sepa a ion o Ca na ic Music, a music epe oi e o
which la ge quan i ies o mul i-s em eco dings wi h bleed-
ing be ween sou ces ha e been collec ed om li e pe -
o mances. We p e- ain a la en di usion model o pe -
o m p elimina y ocal sepa a ion condi ioning on he co -
esponding mix u e. Then, using a eg essi e model which
is sepa a ely ained on a clean, smalle , and ou -o -domain
da ase , we es ima e he le el o bleeding in he p elimi-
na y sepa a ions and use ha in o ma ion o guide he di -
usion model owa d gene a ing cleane samples. The ob-
jec i e and pe cep ual e alua ions show he po en ial o he
p oposed gene a i e sys em o Ca na ic ocal sepa a ion.
Code, weigh s, and u he ma e ials a e a ailable online. 1
1. INTRODUCTION
Denoising di usion p obabilis ic models (DDPM) a e a
class o gene a i e sys ems ha a e eme ging as an al-
e na i e solu ion o audio in e se p oblems such as en-
hancemen [1], upsampling [2], and e en sou ce sepa a-
ion [3–5]. Music sou ce sepa a ion (MSS) is he ask
o es ima ing he indi idual elemen s in a musical mix-
u e [6]. Because o hei condi ioning lexibili y and gen-
e a i e po en ial, DDPM a e conside ed a p omising solu-
ion o MSS [7]. While compe i i e di usion sepa a ion
sys ems exis [5,8], hese ocus on ins umen al music.
La ge aining da a is key o DDPM [1, 9, 10], how-
e e , ga he ing clean, mul i-s em da a is challenging [11].
While la ge mul i-s em collec ions eco ded in li e shows
exis [12–16], hese come wi h sou ce bleeding: he o he
sou ces, oom esponse, and o he in e e ences leak in o
1h ps://gi hub.com/genisplaja/ldm-ca na ic-sepa a ion
© G. Plaja-Roglans, X. Se a, and M. Rocamo a. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC
BY 4.0). A ibu ion: G. Plaja-Roglans, X. Se a, and M. Rocamo a,
“Le e aging Ca na ic li e eco dings o singing oice sepa a ion using
eg ession-guided la en di usion”, in P oc. o he 26 h In . Socie y o
Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
he indi idual s ems. Regula ly aining an MSS model on
such da a o en esul s in subop imal pe o mance [17].
In his wo k, we aim o le e age he inhe en domain
knowledge in a la ge collec ion o li e mul i-s em acks
wi h bleeding while s ill a ge ing clean sepa a ion. Ca -
na ic Music, which is mos ly enjoyed li e [18], p esen s
an in e es ing case o s udy. P io wo k a ge ed he same
objec i e o Ca na ic ocals [17] and iolin [19]. How-
e e , [17] elies on a complex heu is ic comp omising gen-
e aliza ion and e iciency, while [19] uses a la ge, clean bu
p i a e in-domain da ase . Clean Ca na ic mul i-s em da a
exis [20], bu only o a small collec ion o 5conce s.
We p opose a gene a i e app oach o his p oblem, ely-
ing on la en di usion models (LDM) [21]: he gene a i e
di usion p ocess ope a es on a compac , p e-lea n audio
ep esen a ion, enhancing e iciency and lea ning capac-
i y [10]. We p e- ain an LDM o gene a e signing ocals
wi h sou ce bleeding condi ioned on music mix u es [22].
In pa allel, we ain a eg esso o es ima e he bleeding a-
io in ocal signals using open, clean, non-Ca na ic mul i-
s em da a. We hen e ine he p e- ained LDM using a loss
penaliza ion e m based on he bleeding p edic ions aiming
a gene a ing cleane ocals. Inspi ed by g adien guid-
ance o di usion models [23, 24], we subsequen ly p o-
pose eg ession-based bleeding le el guidance: we s ee
he g adien s o he bleeding es ima o o in o m he di u-
sion sample owa d he di ec ion o cleane sepa a ion.
Non-gene a i e MSS sys ems ha ans o m o mask
ime- equency ep esen a ions no mally ely on access o
all s ems, assuming hese combine linea ly o he e e ence
mix u e [25]. Le e aging gene a i e lexibili y we con-
side wo added challenges: (1) access o he mix u e and
he co esponding ocal s em wi h bleeding only [26], and
(2) he mix u e alone has unde gone non-linea p ocessing.
We p io i ize e iciency using a compac la en space,
a he expense o signal quali y and a signi ican penaliza-
ion on sepa a ion me ics, a known p oblem o he e alu-
a ion o gene a i e models [27,28]. Ne e heless, ou sys-
em achie es, wi hou he need o clean, in-domain, mul i-
s em samples, compe i i e objec i e gene a ion quali y and
pe cep ual sepa a ion p e e ence o e he baselines.
2. BACKGROUND
2.1 La en di usion
Le X∈RF×Dbe a la en embedding wi h ea u e size
Fand ime dimension D=T
c , whe e Tis audio leng h
830
and c he comp ession ac o o a ce ain la en encode
E:x∈R1×T→X∈RF×D. In his wo k, we ely on a
la en o wa d di usion p ocess de ined by a Ma ko chain
o Ts eps ha con e s a la en embedding X∼p(X),
in o a sample o Gaussian noise ϵ∈RF×D. The in e me-
dia e s eps o his ans o ma ion a e compu ed as [29]:
Xσ =ασ Xσ0+βσ ϵ, (1)
whe e σ ∈[0,1] is a noise schedule o T alues o con-
ol he ans o ma ion, while we de ine ασ := cos(ϕ )
and βσ := sin(ϕ ), whe e ϕ := π
2σ . No e also ha
Xσ0=X. A model is hen ained o e e his p ocess,
app oxima ing he da a dis ibu ion p(X)by lea ning o
map Gaussian samples o obse a ions X∼p(X).
Le σ ∈R1×Dbe he eloci y objec i e, which co e-
sponds o he inne a iable o he di usion p ocess which
acks he ans o ma ion be ween Xσ0and XσT. The ob-
jec i e σ is o mally compu ed as:
σ =ασ ϵ−βσ Xσ0,(2)
and es ima ed by neu al ne wo k mwi h pa ame e s θ:
ˆ σ =mθ(Xσ , σ , C)(3)
Ne wo k mθis he gene a i e LDM. Inpu C∈RF×D
ep esen s he condi ioning signal. Di usion sys ems may
be ained uncondi ionally o sample andom obse a ions
om app oxima ed ˆp(X), while ins uc ions om a ious
modali ies (e.g., ex p omp s [9], audio signals [5], and
mo e) can be injec ed o he pos e io o modi y he gen-
e a ion ajec o y. Howe e , ou wo k ocuses on a well-
de ined in e se p oblem. As a esul , we injec Cdu ing
bo h aining and in e ence, a chi ec u ally op imizing he
sys em o ailo he di usion ajec o y ela ing he con-
di ioning signal and he gene a o a ge ˆ
X. Le Edeno e
expec a ion. The di usion loss objec i e is de ined as [29]:
Ldi =E ∼[0,T ],σ ,Xσ ||ˆ σ − σ ||2
2(4)
2.2 Sampling p ocess
The sampling p ocess p og essi ely models a sample pe -
aining o he app oxima ed dis ibu ion ˆp(X)by denois-
ing a andom sample o Gaussian noise. P e ious wo ks
in audio gene a ion ha e elied on he Denoising Di usion
Implici Models (DDIM) sample , achie ing sa is ac o y
comp omise be ween sampling s eps and gene a ion qual-
i y [30]. In DDIM sampling [29], he in e ence p ocess is
pe o med using a bi a y T, and i is ini ia ed a σ = 1.
A gi en sampling s ep is composed o a se o ope a ions:
We i s un a o wa d pass wi h model mθas de ined
in Eqn (3). Using p edic ed eloci y ˆ σ , we can compu e
ˆ
Xσ0, which co esponds o he es ima ed a ge sample a
= 0, and ˆϵσ ∈R1×Dwhich is Gaussian noise a s ep :
ˆ
Xσ0=ασ Xσ −βσ ˆ σ (5)
ˆϵσ =βσ Xσ +ασ ˆ σ (6)
No e ha , o ≈T, i.e. a an ea ly s age o he sam-
pling p ocess, p edic ed ˆ
Xσ0is expec ed o be noisy, lim-
i edly consis en wi h signal C, while a ≈0, i app oxi-
ma es u he o he inal, e ined sepa a ion. Fo > 0, he
inpu o he nex sampling s ep is o mally de ined as:
ˆ
Xσ −1=ασ −1ˆ
Xσ0+βσ −1ˆϵσ (7)
Finally, ˆ
Xσ0is decoded o he o iginal domain using
decode E′:X∈RF×D→x∈R1×T. Encode Eand
decode E′a e no mally p e- ained and kep ozen.
3. METHOD
Le Aand B ep esen musical epe oi es o domains
which di e on ins umen a ion, concep s, and p ac ices.
In ou wo k, Aco esponds o Ca na ic Music and B o
Wes e n adio music (e.g. pop, ock, hip-hop, and ela ed)
3.1 La en encode
We use Music2La en 1 [31] (M2L), which is a neu al
codec based on a consis ency model [32]. Bo h M2L en-
code and decode a e depic ed in ed in Figu e 1. M2L
comp esses signals sampled a 48kHz down o 12Hz, and
p oduces 64-dimensional codes wi h 0 mean and de ia ion
1. The signi ican comp ession o M2L enables he de el-
opmen o ou wo k in an en i onmen wi h limi ed com-
pu a ional esou ces. M2L is ained using MTG-Jamendo
da ase [33], which includes nume os acks o epe oi e
B, and 90 eco dings agged as indian. I also includes
≈2k ocal acks, and ≈2k acks wi h iolin. We a e un-
awa e o he numbe o eco dings mixing hese sou ces.
The eno mous comp ession a e o M2L comes a a
cos : au ho s epo −3.85dB o econs uc ion SI-SDR, a
s anda d sepa a ion me ic, and pe cep ible a i ac s o en
a ise in he econs uc ions. While o icial code o ain
o ine- une M2L is no a ailable, we ely on he open
p e- ained model, p io i izing i s comp ession and ea u e
lea ning capabili ies o s udy he e ec i eness o la en di -
usion o weakly-supe ised MSS. Mo eo e , he M2L
comp ession a e enables us o pe o m ou LDM s udy
wi h e y limi ed compu a ional esou ces, ye aining a
model wi h he size on pa wi h he li e a u e [34–36].
3.2 La en di usion o sepa a ion
Model mθis a 1D a en ion U-Ne wi h skip-connec ions.
I is depic ed in g een in Fig. 1. I is composed o n esidual
blocks which include wo 1D con olu ional laye s, each
p eceded by G oupNo m [37] and SiLU ac i a ion [38].
A p e-de ined numbe o blocks include ime-wise sel -
a en ion o lea n he ela ionship be ween di e en ime
s eps and en ich con ex , which is c ucial o MSS.
To down and upsample he ea u es a each le el in he
U-Ne , we add an ex a laye wi h ke nel size k×k,kbeing
he ime comp ession o expansion ac o . When k > 1,
o downsampling we double he ea u e channels, while
hal ing hem o he upsampling blocks.
The ime-s ep σ is p ojec ed in o a 1024-channeled
andom Fou ie ea u e embeddings, which a e p ocessed
h ough a 3-laye mul i-laye pe cep on (MLP) wi h
GELU ac i a ions. The esul ing embedding is inco po-
a ed in o he model ia FiLM laye s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
831
Figu e 1. Diag am o he p oposed sys em. The LDM is i s ained o gene a e encoded ocals wi h bleeding. Nex , we
ine- une using he bleeding a io loss. Finally, du ing sampling, we use he bleeding p edic ions o compu e he g adien s
owa ds less bleeding and modi y he gene a ion ajec o y on ha di ec ion. The o de o de elopmen s eps is indica ed.
Se e al mechanisms o injec condi ioning signals in
di usion U-Ne s exis [1,30,34]. We ind he bes quali y-
e iciency comp omise on conca ena ing, o e he ea u e
channels, he condi ioning signal Cand Xσ [39]. P e i-
ous la en di usion wo k using M2L embeddings has e-
lied on his mechanism [34]. E en i he M2L embeddings
a e 2D, we employ 1D con olu ional laye s o e ec i ely
cap u e empo al dependencies in he comp essed ep e-
sen a ion, p ocessing each ea u e ec o independen ly
wi hou imposing a i icial spa ial co ela ions.
The ne wo k is ained elying on he objec i e in Eq. 4
using co esponding pai s o ocal s ems wi h bleeding
Xσ0and mix u e C, bo h encoded using E[22].
3.3 Bleeding le el es ima o
The glass ceiling o he sepa a ion LDM p esen ed in Sec-
ion 3.2 is es ablished a he inhe en sou ce bleeding in
he aining da a o domain A. Howe e , he ne wo k may
s ill be ained o map om mix u e o he co esponding
ocals wi h bleeding, le e aging domain knowledge [17].
P io wo k has shown ha a sepa a o model ained
using only da a wi h sou ce bleeding can be ine- uned
owa ds cleane ou pu s by s ee ing a bleeding es ima o
ne wo k [19], which p edic s he a io o bleeding in he
p elimina y sepa a ions, while he non-op imal sepa a o
is op imized o minimize his a io. Building on his in-
sigh , we hypo hesize ha es ima ing bleeding a ios is less
p one o se e e gene aliza ion e o s compa ed o MSS.
This allows us o le e age he knowledge embedded in a
p e- ained sepa a o o epe oi e A, while ine- uning us-
ing he bleeding es ima o ained using epe oi e B, by-
passing access o clean mul i-s em da a o epe oi e A.
3.3.1 Reg ession-based bleeding le el guidance
In an a emp o guide he p e- ained LDM o gene a e
cleane ocals, we le e age a eg ession model o guide
he di usion p ocess using he le el o sou ce bleeding.
Simila ly o [19], we aim o in oduce a bleeding es i-
ma o model o guide he sepa a ion sys em owa d educ-
ing he bleeding. Le e aging he lexibili y o he sampling
p ocess o di usion models, we p opose eg ession-based
bleeding le el guidance (RG), which is inspi ed by clas-
si ie guidance (CG) [23], a echnique o enhance quali y
and con ol in di usion image gene a ion. CG le e ages
a p e- ained classi ie o es ima e he class o which Xσ
belongs o. Using he g adien s ob ained w. . . Xσ , we
may modi y he la en in e media e di usion s eps o poin
be e owa d he a ge class. P io s udies ha e also elied
on g adien s o une di usion sampling [24,40].
We use a bleeding es ima o , in blue in Fig. 1, o p edic
he amoun o sou ce bleeding in an indi idual s em, ep-
esen ed by a loa ing poin alue b∈[0,1], whe e 0 ep e-
sen s no bleeding and 1 he mix u e. Using a clean mul i-
s em bu ou -o -domain da ase o epe oi e B, we ain a
neu al ne wo k φ o pe o m his ask. Since φis mean
o be in eg a ed wi hin he i e a i e di usion p ocess, he
bleeding p edic ion inpu is expec ed o be ˆ
Xσ , which is
in used wi h Gaussian noise ollowing Eq. 1. The e o e,
he aining inpu o φis an M2L-encoded ocal s em wi h
bleeding (wi h a io b), co up ed using Eq. 1 om he di -
usion o mula ion. No e ha he M2L codes ha e shown
compe i i e pe o mance in se e al downs eam asks [31].
The model is ained using L2 loss:
L φ(Xσ , σ , b) = ∥ φ(Xσ , σ )−b∥2
2(8)
Bleeding is expec ed o s ay consis en along an audio
sample, hus he model es ima es a single ˆ
bpe each inpu .
Di usion ime-s ep σ is also injec ed o he eg esso o
p o ide in o ma ion o he cu en noise le el [23,24].
Reg ession-guidance o di usion sampling. We in-
co po a e RG in he di usion sampling algo i hm by s ee -
ing he g adien s om he bleeding p edic o . We p edic
he bleeding o he inpu di usion o wa d a iable Xσ a
each sampling s ep , and calcula e he g adien s ha poin
owa d he di ec ion o ou a ge : 0 bleeding [40]. The ex-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
832
ac ed g adien s a e used o guide p edic ed eloci y ˆ σ ,
ollowing he o mula ion below:
Wguid =η·102·1
−1·σ 2(9)
ˆ guid
σ = ˆ σ +Wguid · ∇Xσ |0− φ(Xσ , σ )|(10)
The g adien s a e no malized using pe -sample L2 no -
maliza ion, ensu ing s able guidance. The guidance le el
is manually con olled by η, and is also dynamically scaled
o p o ide less guidance in he beginning o he sampling
p ocess whe e Xσ ≈ϵ, and s eng hen he guidance e ec
owa d he in e media e- o-las sampling s eps [24].
LDM bleeding-awa e ine- uning. We inco po a e a
penaliza ion loss e m o penalize he p e- ained LDM
using he le el o bleeding in he p edic ions. Fo ime-
s eps wi h low noise exposu e (σ <0.6), he ozen
bleeding es ima o p edic s he bleeding a io be o e and
a e a denoising s ep, deno ed as ˆ
bp e = φ(Xσ )and
ˆ
bpos = φ(ˆ
Xσ0), espec i ely. A max-ma gin hinge e m
max0,ˆ
bpos −ˆ
bp e +mwi h ma gin m= 0.05, ensu es
ha he model mus educe bleed by a leas m, o he wise
i is penalized. The penaliza ion e m is u he weigh ed
by (1 −σ )2 o ampli y i s impac a la e and pe cep ually
clea e s eps. O e all, he ine- uning loss becomes:
L=Ldi +λ·(1 −σ )2·max0,ˆ
bpos −ˆ
bp e +m(11)
We use pa ame e λ o con ol he balance be ween pe-
naliza ion e m and di usion loss, encou aging consis en
educ ion in bleed while main aining gene a ion ideli y.
3.3.2 Ne wo k de ails
Bleeding es ima o φis based on a s ack o dila ed con-
olu ions which a e egula ized using G oupNo m, SiLU
ac i a ions, and d opou . The ou pu embedding is ba ch-
no malized and hen σ , which is p ocessed using he same
s ep embedde han he LDM, is injec ed ia summa ion.
The esul an ec o is hen passed h ough a bidi ec ional
LSTM, cap u ing empo al dependencies. To model global
empo al ela ionships, we use mul i-head sel -a en ion
o e he sequence. Finally, we apply global a e age pool-
ing ac oss he ime dimension and use sigmoid-ac i a ed
linea laye o p oduce a single scala co esponding o p e-
dic ed bleed sco e (∈[0,1]).
4. EXPERIMENTS
4.1 Expe imen al se up
4.1.1 LDM sepa a ion p e- aining
The LDM U-Ne is 7le els deep. The ea u e channels pe
laye a e se as: {128,256,256,512,512,1024,2048}, in-
pu channels being 64 ∗2–being he channel-wise conca e-
na ion o he M2L codes and he model inpu Xσ . How-
e e , he las con olu ional laye ou pu s a 64-channeled
signal, co esponding o he gene a ed embedding o de-
code. The aining inpu con ex a e 1052000 samples
(≈24s), which is comp essed o 2048 samples using
M2L. Time comp ession ac o s o he LDM a e se o
{1,2,1,2,1,2,1}, ac o 1 ep esen ing no comp ession,
hus we each 128 samples in he bo leneck, ying no o
o e -comp ess he in o ma ion.
The i s U-Ne le el is composed o 1block, while he
deepes includes 4. The es a e composed o 2blocks. The
ou deepes le els o he U-Ne include ime-wise sel -
a en ion wi h 8heads, aiming a en iching con ex .
The LDM ne wo k has ≈365M ainable pa ame e s,
and i is ained using he 168 mul i-s em eco dings om
Sa aga Ca na ic–15 eco dings a e kep o alida ion, each
o hem om a di e en conce . We use ADAM op imize
wi h lea ning a e 1∗10−5, and use a linea wa m-up s age
using a cosine schedule wi h a ini ial a e o 1.6∗10−6.
We each 500k aining s eps in wo weeks in an 8GB GPU.
4.1.2 Bleeding es ima o guidance
A i icial bleeding da ase . The bleeding es ima o φis
ained using musdb18hq [41], co esponding o domain B.
We a i icially c ea e he bleeding ollowing he pipeline
desc ibed in he SDX 2023 bleeding challenge [25]. Le
Sibe a gi en sou ce s em. The accompanimen Ais he
weigh ed sum o he non- ocal sou ces: A=PN
i=1 wiSi,
whe e wi∼U(0,1) a e independen ly sampled andom
mixing weigh s. We andomly il e he sou ces using he
ollowing ca ego ical dis ibu ion [19,25]:
Si=




BPF(Si, low
c, high
c)wi h p= 0.4
HPF(Si, c)wi h p= 0.4
Siwi h p= 0.2
whe e we use band pass il e (BPF) wi h low cu o e-
quency low
c∼U(200,600) Hz and high cu o equency
high
c∼U(8k,10k)Hz, and high-pass il e (HPF) wi h
c∼U(900,9k)Hz. The o de o he il e s is also an-
domly sampled om ∼U(3,8). Nex , a bleeding a io
b∼U(0,1) is sampled and used o compu e he e e ence
mix u e M=S +b·A. We no malize M o p e en
clipping and con ine all alues be ween [−1,1].
Model de ails. The inpu a i icial mix Mis encoded
using M2L and me ged wi h Gaussian noise using Eq. 1,
he e o e, he inpu channel size is 64. We use i e dila ed
con olu ions wi h a ios {1, 2, 4, 8, 8}. The numbe il e s
o he con olu ional s ack and also he size o he hidden
s a e o he LSTM a e se o 512. The mul i-head a en ion
mechanism is con igu ed wi h 8heads.
T aining scheme. The aining con ex o he bleed-
ing p edic o is he same as ha o he LDM. The bleeding
es ima o model o als ≈1M pa ame e s. We use ADAM
op imize wi h a lea ning a e 4∗10−4, and ain un il con-
e gence. Subsequen ly, he p e- ained LDM is ine- uned
o 10k s eps using λ= 50, using lea ning a e 1∗10−6.
4.1.3 Di usion sampling pa ame e s
We sample using o e lapping segmen s o ≈24s wi h ≈5s
hop, which a e subsequen ly combined using o e lap-add.
I sampling wi h T= 64 on a single TITAN X 8GB GPU,
ou sys em sepa a es audio a an a e age speed o ≈0.4x
he du a ion o he ack.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
833
4.2 Da ase s
Sa aga Ca na ic (Domain A, aining da a). This is a
collec ion con aining 168 eal mul i-s em eco dings ( o-
aling ≈60h o music), including ocals, iolin, and pe -
cussion ins umen s, missing only he anpu a, which p o-
ides he onic om which an en i e pe o mance is buil .
Ca na ic Music is mos ly enjoyed li e, he e o e, o en-
su e ecological alidi y, Sa aga is eco ded in li e pe o -
mances, collec ing he s ems om he mixe . Howe e , his
has a d awback: he mic ophone o a gi en sou ce cap-
u es, in he backg ound, he o he sou ces.
musdb18hq (Domain B, o RG). I is one o he mos
es ablished open da ase s o MSS. I includes 100 ain-
ing and 50 es ing mul i-s em acks spli in ocals, bass,
d ums, and o he s. I ep esen s a limi ed se o s yles
mos ly con ined in Wes e n comme cial music.
Sanidha (A, e alua ion da a). I is he only a ailable
open collec ion o clean mul i-s em eco dings o Ca na ic
Music [20]. A e some explo a ion on his da ase , which
is composed o 5conce s, we disca d 1ha ing bleeding
in he ocal s em leaked h ough he singe headphones.
While Sanidha has no been ye shown a po en ial da ase
o aining o e Sa aga, we employ i o es ing, enabling
mo e eliable objec i e e alua ion o his epe oi e.
4.3 E alua ion me ics
The objec i e e alua ion o gene a i e sys ems o audio
in e se p oblems is challenging [1]. In MSS, adi ional
de ini ions o sou ce- o-dis o ion a io (SDR), he s an-
da d sepa a ion me ic, ha e been epo ed o o en mis-
ep esen he pe cep ual quali y o sepa a ions [42, 43].
Mo eo e , SDR is signi ican ly penalized by po en ial sub-
le di e ences and phase misma ches commonly p esen
when e alua ing ully-gene a i e models. Fo hese ea-
sons, SDR is being less used in p io gene a i e sepa a-
ion wo k. In he case o LDM, gi en he added phase
econs uc ion misma ch in oduced by he la en encode ,
no e en scale-in a ian SDR (SI-SDR), p esen in a ious
gene a i e sepa a ion sys ems [5,26], is being used [8,44].
The e o e, we ely on al e na i e audio quali y mea-
su es ha ha e been employed in p io wo k on la-
en di usion o gene a ion [34, 35] and sou ce sepa a-
ion [5, 8, 44]: log-spec al dis ance (LSD) [35] and log
mel-spec og am L2 e o [43]. These me ics a e phase-
independen and may be mo e app op ia e o gene a i e
sys ems. We also epo pe cep ual e alua ion o speech
quali y (PESQ) [45], aiming a measu ing in elligibili y.
To assess he quali y o he gene a ed signals wi hou
elying on ma ching audio pai s, we epo he F éche Au-
dio Dis ance (FAD) [46]. Model ou pu s wi h highe qual-
i y and lesse in e e ences should epo lowe FAD. The
FAD is o en compu ed on sho chunks. Howe e , in ad-
di ion o di e si y, con ex is impo an [46]. Ca na ic en-
di ions o en ea u e p olonged imp o isa ional segmen s,
such as alapana and anam, which can span se e al min-
u es. Fo hese easons, we spli he samples in o 1-minu e
chunks. We disca d he chunks wi h >25% o silence,
which esul s in ≈150 es ing samples.
Da ase L2 Loss↓A g. ˆ
bmix A g. ˆ
b oc.
musdb18hq 0.054 0.89 ±0.18 0.05 ±0.09
Sanidha 0.098 0.80 ±0.23 0.08 ±0.18
Sa aga - 0.97 ±0.07 0.25 ±0.29
Table 1.Assessing he bleeding eg esso . Ideally, he
a g. ˆ
bshould be ≈1 o mix, and ≈0 o ocals, excep
o Sa aga, whose ocal s ems ha e inhe en , eal bleeding.
To complemen he objec i e measu es we un a pe cep-
ual es wi h human lis ene s. We conduc a p e e ence-
based expe imen [1]. We spli he sepa a ions in o chunks
o ≈15 seconds, disca ding un oiced egions, and an-
domly selec an ins ance o each endi ion. Using he
mix u e as e e ence, we selec 6examples om he pool,
including di e si y o music sec ions and singe gende .
The pa icipan s a e shown se e al unlabeled and an-
domly o de ed pai s o samples o ou model agains
o he sys ems. We in oduce compa isons be ween non-
gene a i e models o p e en he pa icipan s om ge ing
amilia wi h model-speci ic a i ac s.
4.4 Compa ed sys ems
We compa e agains he mul i-sou ce di usion model
(MSDM) [5] o sepa a ion. Since no weigh s o ocals
a e a ailable, and he sys em is designed o clean mul i-
s em da a, we ain i using musdb18hq, ollowing he in-
s uc ions in he eposi o y. We do no compa e agains
exis ing LDM sepa a o s since hese a e no op imized o
ocals [8] o code o weigh s a e no ye a ailable [22,44].
While no di ec ly compa able, since hese a e non-
gene a i e and mask-based, we e alua e cold-di [17] and
he mixe model om [19], he baseline sys ems add ess-
ing he same ask o he Ca na ic s udy case. Bo h mod-
els a e di ec ly used h ough he compIAM package [47].
Also, o p o ide a pe o mance bound o ou model, we
e alua e he M2L- econs uc ed ocal s ems in Sanidha.
We pe o m an abla ion s udy on he bleeding ine-
uning (FT), eg ession guidance le el (RGη) o le els
η∈[0,5,10,20], and sampling s eps T∈[32,64]. The
non ine- uned model is ained o ≈10kmo e s eps o
a ai e compa ison wi h he FT model. Fo he pe cep ual
es we use T= 32, mid-guidance η= 10.
5. RESULTS
5.1 E alua ing he bleeding p edic o
E alua ing he bleeding p edic o is complex, since no
music da ase s wi h eal, anno a ed bleeding exis . How-
e e , we pe o m wo sani y checks. Fi s , we compu e
he model L2 loss on a i icial bleeding mix u es using
musdb18hq and Sanidha. Second, we compu e he a e age
bleeding a io on Sa aga mix u es and ocal s ems. To sim-
ula e he ac ual applica ion o he bleeding p edic o , inpu s
a e noised using Eq. 1, uni o mly sampling σ alues pe -
example. We p edic he bleeding o 500 andomly sam-
pled and oiced 12s-exce p s pe each da ase .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
834

Model TFAD↓LogMel L2↓LSD↓PESQ↑
M2L – 0.281 2.95 1.22 2.73
cold-di [17] 80.515 15.79 2.29 1.39
mixe [19] – 0.648 13.26 1.77 1.22
MSDM [5] 150 0.791 12.52 2.00 1.78
P oposed
no FT 32 0.637 16.74 1.80 1.15
FT 32 0.593 16.02 1.75 1.17
FT-RG532 0.587 13.36 1.68 1.22
FT-RG10 32 0.579 12.89 1.66 1.18
FT-RG20 32 0.626 13.44 1.67 1.19
no FT 64 0.642 16.78 1.79 1.16
FT 64 0.602 16.10 1.74 1.16
FT-RG564 0.600 13.41 1.68 1.21
FT-RG10 64 0.595 12.61 1.65 1.19
FT-RG20 64 0.623 12.31 1.64 1.16
Table 2.Objec i e e alua ion o di e se sys ems on au-
dio and ocal quali y measu es. A ow ↓indica es lowe
is be e , ↑o he wise. In bold, we indica e he bes sco e
among all sys ems. We unde line he bes sco es o he ab-
la ion. See u he abla ion esul s in he companion epo.1
See he esul s in Table 1. The eg esso gene alizes
qui e sa is ac o ily o he Ca na ic domain. The sys em
also disc imina es eal Ca na ic mix u es (ˆ
b≈1) om o-
cal s ems wi h bleeding (ˆ
b= 0.25 ±0.29). The high s an-
da d de ia ion in he p edic ed bleeding a io o Sa aga
may be explained by he high a iance in accompanimen
p esence in di e en sec ions o a Ca na ic endi ion.
5.2 Objec i e e alua ion
See he objec i e e alua ion in Table 2. We obse e a an-
gible imp o emen o ou model when using he bleeding
guidance du ing sampling, inding he swee spo on RG10
o T= 32, and RG20 o T= 64, despi e he me ics
no always co ela e. The bleeding ine- uning loss e m
p o ides a mo e mode a e imp o emen .
Ou gene a i e sys em ou pe o ms he baselines on he
spec al assessmen me ics, while anking second on FAD,
only ou pe o med by cold-di , a non-gene a i e sys em.
In e ms o PESQ, ou sys em sco es he lowes , only le -
eling he mixe model when using T= 32 and η= 10.
The pe o mance ac oss me ics o ou sys em sugges s
ha s onge guidance u he cleans and b ings he gen-
e a ion close o he a ge signal o e all. Howe e , his
esul s in a ade-o : s onge in e e ence emo al comes
a he cos o deg aded ocal quali y. While ou sys em
shows compe i i e o e all quali y, especially when guided,
i epo s lowe PESQ han MSDM, he gene a i e base-
line, sugges ing ha MSDM gene a ions ha e u he in-
elligibili y bu also s onge in e e ence. No e howe e
ha MSDM does no gene a e encoded la en s bu di ec ly
wa e o ms, po en ially accumula ing less phase disc ep-
ancy. Ne e heless, he gene al low PESQ sco es o all
models may be explained by he ac ha his me ic is
o speech and i does no assume po en ial in e e ences,
while i is unclea how i cha ac e izes he ex emely com-
mon and s ong ocal o namen s in Ca na ic Music.
Model Quali y (%) In e e ence (%)
Ou s O he Ou s O he
cold-di [17] 5.0 95.0 97.50 2.50
mixe [19] 15.0 85.0100.0 0.0
MSDM [5] 20.0 80.0100.0 0.0
Table 3. Pe cep ual e alua ion esul s showing he pe -
cen age o pa icipan s who p e e ed ou sys em o base-
line models in e ms o quali y and in e e ence emo al.
The esul s sugges ha an LDM can be ained o-
wa ds gene a ing sepa a ed complex sou ces such as o-
cals, while he p oposed guidance me hod con ibu es o
a cleane gene a ion ha ge s close o he a ge signal.
The objec i e me ics suppo he expec ed beha io o
he p oposed app oach, al hough we hypo hesize po en ial
s onge imp o emen i u u e e o s a e done on ine-
uning he la en encode and e ining he bleeding es ima-
o , as well as scaling he ne wo k up (closely ela ed LDM
elies on >500M pa ame e s [34]). These may con ibu e
o a e ine he sou ce quali y o he gene a ed ocals.
5.3 Pe cep ual assessmen
A o al o 20 pa icipan s ook he es . In e es ingly, he
pa icipan ag eemen is ema kable. The esul s a e e-
po ed in Table 3. The pe cep ual assessmen is signi i-
can ly clea : ou sys em leads in in e e ence emo al bu
is no able o each he sou ce quali y o he baselines.
These esul s ag ee wi h he objec i e me ics, which sug-
ges ha he o e all quali y and cleanliness o ou gene a-
ions a e compe i i e, howe e , he ideli y and in elligibil-
i y o he gene a ed ocals lea e oom o imp o emen .
6. CONCLUSIONS
We p esen a deep gene a i e model o add ess weakly-
supe ised singing oice sepa a ion o Ca na ic Music,
le e aging pai s o in-domain mix u e and ocal s ems
which ha e sou ce bleeding because hese a e eco ded in
eal li e pe o mances. We p opose o ain a la en di u-
sion model o gene a e ocals wi h bleeding condi ioned on
he co esponding mix u e. We hen guide he p e- ained
gene a i e sys em owa d p oducing cleane samples using
a bleeding a io p edic o . Ou sys em achie es compe -
i i e sco es o gene a ion quali y measu es, and ou pe -
o ms he baselines in e ms o in e e ence emo al in a
p e e ence lis ening es . While he p oposed amewo k
shows p omise, we en ision ex ensi e u u e wo k, espe-
cially o e ine he ocal quali y, which is sub-op imally
anked in he esul s. Tailo ing he la en encode o Ca -
na ic, imp o ing he bleeding es ima o , and pe o ming
mul i-sou ce sepa a ion a e po en ial u u e esea ch lines.
We belie e ha he lexibili y, condi ioning, and guid-
ance capabili ies o DDPM may enable app oaches o
ackle sepa a ion in non-op imal con ex s, o e en imp o e
pe o mance on ideal condi ions. This has po en ial o
sepa a ion o epe oi es ha a e no commonly eco ded
in s udios, conside ing unde ep esen ed ins umen s, and
p io i izing pa icula aspec s such as in e e ence emo al.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
835
7. ETHICS STATEMENT
While his wo k deals wi h gene a i e modeling, he sys-
em is ained using ully-open da a, while we add ess a
pu ely in e se p oblem. No copy igh implica ions should
be in ol ed. The model is a chi ec u ally de eloped no
o gene a e unseen music eco dings. The esul s o he
pe cep ual es we e included in his wo k wi h he pa -
icipan s’ pe mission, whose iden i ies emain anonymous.
No pe sonal o sensi i e da a om he pa icipan s is col-
lec ed and/o dis ibu ed.
8. ACKNOWLEDGEMENTS
This wo k is suppo ed by IA y Música: Cá ed a en
In eligencia A i icial y Música (TSI-100929-2023-1),
unded by he Sec e a ía de Es ado de Digi alización e In-
eligencia A i icial, and he Eu opean Union-Nex Gen-
e a ion EU, unde he p og am Cá ed as ENIA 2022 pa a
la c eación de cá ed as uni e sidad-emp esa en IA, and
IMPA: Mul imodal AI o Audio P ocessing (PID2023-
152250OB-I00), unded by he Minis y o Science, In-
no a ion and Uni e si ies o he Spanish Go e nmen , he
Agencia Es a al de In es igación (AEI) and co- inanced by
he Eu opean Union.
9. REFERENCES
[1] J. Se à, S. Pascual, J. Pons, R. O. A az, and D. Scaini,
“Uni e sal speech enhancemen wi h sco e-based di -
usion,” a xi :2206.03065, 2022.
[2] J. Lee and S. Han, “NU-wa e: A di usion p oba-
bilis ic model o neu al audio upsampling,” in An-
nual Con . o he In . Speech Communica ion Assoc.
(INTERSPEECH), B no, Czech Republic, 2021, pp.
2698–2702.
[3] R. Scheible , Y. Ji, S.-W. Chung, J. Byun, S. Choe, and
M.-S. Choi, “Di usion-based gene a i e speech sou ce
sepa a ion,” in In . Con . on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), Singapo e, 2022.
[4] C.-Y. Yu, E. Pos olache, E. Rodolà, and G. Fazekas,
“Ze o-sho due singing oices sepa a ion wi h di u-
sion models,” in Sound Demixing Wo kshop (SDX),
2023.
[5] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, and E. Rodolà, “Mul i-sou ce di usion
models o simul aneous music gene a ion and sepa a-
ion,” in 12 h In . Con . on Lea ning Rep esen a ions,
Viena, Aus ia, 2024.
[6] A. Jansson, E. Humph ey, N. Mon ecchio, R. Bi ne ,
A. Kuma , and T. Weyde, “Singing oice sepa a ion
wi h deep U-Ne con olu ional ne wo ks,” in 18 h In .
Socie y o Music In o ma ion Re ie al Con . (ISMIR),
Suzhou, China, 2017, pp. 745–751.
[7] S. A aki, N. I o, R. Haeb-Umbach, G. Wiche n, Z.-Q.
Wang, and Y. Mi su uji, “30+ yea s o sou ce sepa a-
ion esea ch: Achie emen s and u u e challenges,” in
IEEE In . Con . on Acous ics, Speech and Signal P o-
cessing (ICASSP), Hyde abad, India, 2025.
[8] T. Ka chkhadze, M. Izadi, and S. Dubno , “Simul ane-
ous music sepa a ion and gene a ion using mul i- ack
la en di usion models,” in In . Con . on Acous ics,
Speech and Signal P ocessing (ICASSP), Hyde abad,
India, 2025.
[9] J. Ho, A. Jain, and P. Abbeel, “Denoising di u-
sion p obabilis ic models,” in 33 h Ad ances in Neu-
al In o ma ion P ocessing Sys ems (Neu IPS), Online,
2020, pp. 6840–6851.
[10] Z. E ans, C. Ca , J. Taylo , S. H. Hawley, and J. Pons,
“Fas iming-condi ioned la en audio di usion,” in In .
Con . on Machine Lea ning (ICML), Vienna, Aus ia,
2024.
[11] E. Manilow, G. Wiche n, P. See ha aman, and
J. Le Roux, “Cu ing music sou ce sepa a ion some
Slakh: A da ase o s udy he impac o aining da a
quali y and quan i y,” in IEEE Wo kshop on Appli-
ca ions o Signal P ocessing o Audio and Acous ics
(WASPAA), 2019.
[12] T. P ä zlich, M. Mülle , B. Bohl, and J. Vei , “F eis-
chü z Digi al: Demos o audio- ela ed con ibu ions,”
in Demos and La e B eaking News o he In . Soci-
e y o Music In o ma ion Re ie al Con . (ISMIR),
Málaga, Spain, 2015.
[13] O. Mayo , Q. Llimona, M. Ma chini, P. Papio is, and
E. Gómez, “ epoVizz: a amewo k o emo e s o age,
b owsing, anno a ion, and exchange o mul imodal
da a,” in ACM In . Con . on Mul imedia, Ba celona,
Spain, 2013.
[14] E. Gómez, M. G ach en, A. Hanjalic, J. Jane , S. Jo dà,
C. F. Julià, C. Liem, A. Ma o ell, M. Schedl, and
G. Widme , “PHENICX: Pe o mances as Highly En-
iched and In e ac i e Conce Expe iences,” in P oc.
o he 10 h Sound and Music Compu ing Con . (SMC),
S ockholm, Sweden, 2013.
[15] A. S ini asamu hy, S. Gula i, R. C. Repe o, and
X. Se a, “Sa aga: Open da ase s o esea ch on in-
dian a music,” Empi ical Musicology Re iew, ol. 16,
no. 1, pp. 85–98, 2021.
[16] A. Shanka , G. Plaja-Roglans, T. Nu all, M. Ro-
camo a, and X. Se a, “Sa aga Audio isual: a la ge
mul imodal open da a collec ion o he analysis o
Ca na ic Music,” in P oc. o he 25 h In . Socie y
o Music In o ma ion Re ie al Con ., San F ancisco,
Uni ed S a es, 2024.
[17] G. Plaja-Roglans, M. Mi on, A. Shanka , and X. Se a,
“Ca na ic singing oice sepa a ion using cold di usion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
836
on aining da a wi h bleeding,” in 24 h In . Socie y o
Music In o ma ion Re ie al Con . (ISMIR), Milano,
I aly, 2023.
[18] G. Plaja-Roglans, T. Nu all, L. Pea son, X. Se a, and
M. Mi on, “Repe oi e-speci ic ocal pi ch da a gene -
a ion o imp o ed melodic analysis o ca na ic music,”
T ansac ions o he In e na ional Socie y o Music In-
o ma ion Re ie al, 2023.
[19] A. Shanka , S. Schweini z, G. Plaja-Roglans, X. Se a,
and M. Rocamo a, “Disen angling o e lapping
sou ces: Imp o ing ocal and iolin sou ce sepa a ion
in ca na ic music,” in Wo kshop on Indian Music
Analysis and Gene a i e Applica ions (WIMAGA) in
ICASSP, 2025.
[20] V. V. K ishnan, N. Alben, A. A. Nai , and N. Condi -
Schul z, “Sanidha: A s udio quali y mul i-modal
da ase o ca na ic music,” in P oc. o he 25 h In . So-
cie y o Music In o ma ion Re ie al Con ., San F an-
cisco, Uni ed S a es, 2024.
[21] R. Rombach, A. Bla mann, D. Lo enz, P. Esse , and
B. Omme , “High- esolu ion image syn hesis wi h la-
en di usion models,” in Con . on Compu e Vi-
sion and Pa e n Recogni ion (CVPR), New O leans,
Louisiana, USA, 2021.
[22] G. Plaja-Roglans, Y.-N. Hung, X. Se a, and I. Pe ei a,
“E icien and as gene a i e-based singing oice sep-
a a ion using a la en di usion model,” in P oc. o he
In . Join Con . on Neu al Ne wo ks (IJCNN), Rome,
I aly, 2025.
[23] P. Dha iwal and A. Nichol, “Di usion Models Bea
GANs on Image Syn hesis,” in 35 h Con . on Neu al
In o ma ion P ocessing Sys ems (Neu IPS 2021), On-
line, 2021.
[24] A. Bansal, H.-M. Chu, A. Schwa zschild, R. Sengup a,
M. Goldblum, J. Geiping, and T. Golds ein, “Uni e sal
guidance o di usion models,” in P oc. o he 2 h In-
e na ional Con e ence on Lea ning Rep esen a ions,
2024.
[25] G. Fabb o, S. Uhlich, C.-H. Lai, W. Choi, M. Ma ínez-
Ramí ez, W. Liao, I. Gadelha, G. Ramos, E. Hsu,
H. Rod igues e al., “The Sound Demixing Challenge
2023: Music Demixing T ack,” T ansac ions o he In .
Socie y o Music In o ma ion Re ie al, 2023.
[26] G. Zhu, J. Da e sky, F. Jiang, A. Seli skiy, and
Z. Duan, “Music sou ce sepa a ion wi h gene a i e
low,” IEEE Signal P ocessing Le e s, ol. 29,
p. 2288–2292, 2022. [Online]. A ailable: h p:
//dx.doi.o g/10.1109/LSP.2022.3219355
[27] P. Chandna, M. Blaauw, J. Bonada, and E. Gómez,
“Con en based singing oice ex ac ion om a musical
mix u e,” in In . Con . on Acous ics, Speech and Signal
P ocessing (ICASSP), Online, 2020, pp. 781–785.
[28] W. A. Jassim, J. Skoglund, M. Chinen, and A. Hines,
“WARP-Q: quali y p edic ion o gene a i e neu al
speech codecs,” in IEEE In . Con . on Acous ics,
Speech and Signal P ocessing (ICASSP), Online, 2021,
pp. 401–405.
[29] J. Song, C. Meng, and S. E mon, “Denoising di usion
implici models,” in In . Con . on Lea ning Rep esen-
a ions (ICLR), Online, 2021.
[30] F. Schneide , O. Kamal, Z. Jin, and B. Schölkop ,
“Moûsai: Tex - o-music gene a ion wi h long-con ex
la en di usion,” 2023.
[31] M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp es-
sion,” in 25 h In . Socie y o Music In o ma ion Re-
ie al Con . (ISMIR), San F ancisco, USA, 2024.
[32] Y. Song, P. Dha iwal, M. Chen, and I. Su ske e , “Con-
sis ency models,” a Xi p ep in a Xi :2303.01469,
2023.
[33] D. Bogdano , M. Won, P. To s ogan, A. Po e , and
X. Se a, “The m g-jamendo da ase o au oma ic
music agging,” in Machine Lea ning o Music
Disco e y Wo kshop, In e na ional Con e ence on
Machine Lea ning (ICML 2019), Long Beach, CA,
Uni ed S a es, 2019. [Online]. A ailable: h p:
//hdl.handle.ne /10230/42015
[34] J. Nis al, M. Pasini, C. Aouameu , M. G ach en, and
S. La ne , “Di -A-Ri : Musical Accompanimen Co-
c ea ion ia La en Di usion Models,” in 25 h In . So-
cie y o Music In o ma ion Re ie al Con . (ISMIR),
San F ancisco, USA, 2024.
[35] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic,
W. Wang, and M. D. Plumbley, “AudioLDM: Tex -
o-audio gene a ion wi h la en di usion models,” in
In . Con . on Machine Lea ning (ICML), Honolulu,
Hawaii, 2023.
[36] J.-S. Hwang, S.-H. Lee, and S.-W. Lee, “Hiddensinge :
High-quali y singing oice syn hesis ia neu al au-
dio codec and la en di usion models,” a Xi p ep in
a Xi :2306.06814, 2023.
[37] Y. Wu and K. He, “G oup no maliza ion,”
a Xi :1803.08494, 2018.
[38] S. El wing, E. Uchibe, and K. Doya, “Sigmoid-
weigh ed linea uni s o neu al ne wo k unc ion ap-
p oxima ion in ein o cemen lea ning,” Neu al ne -
wo ks, ol. 107, pp. 3–11, 2018.
[39] Y. Song and S. E mon, “Imp o ed echniques o
aining sco e-based gene a i e models,” in 34 h Con .
in Neu al In o ma ion P ocessing Sys ems (Neu IPS),
2020, pp. 12 438–12 448.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
837
[40] Y. Guo, H. Yuan, Y. Yang, M. Chen, and M. Wang,
“G adien guidance o di usion models: An op i-
miza ion pe spec i e,” in P oc. o he 38 h Con e ence
on Neu al In o ma ion P ocessing Sys ems (Neu IPS),
2024.
[41] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “MUSDB18 - a co pus o music
sepa a ion,” Dec. 2017. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.1117372
[42] E. Cano, D. Fi zge ald, and K. B andenbu g, “E alua-
ion o quali y o sound sou ce sepa a ion algo i hms:
Human pe cep ion s quan i a i e me ics,” in 24 h
Eu opean Signal P ocessing Con . (EUSIPCO), Bu-
dapes , Hunga y, 2016, pp. 1758–1762.
[43] E. Gusó, J. Pons, S. Pascual, and J. Se à, “On loss
unc ions and e alua ion me ics o music sou ce sep-
a a ion,” IEEE In . Con . on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), pp. 306–310, 2022.
[44] Y. Chae and K. Lee, “Mge-ldm: Join la en di usion
o simul aneous music gene a ion and sou ce ex ac-
ion,” a Xi :2505.23305, 2025.
[45] A. Rix, J. Bee ends, M. Hollie , and A. Heks a,
“Pe cep ual e alua ion o speech quali y (pesq)-a new
me hod o speech quali y assessmen o elephone ne -
wo ks and codecs,” in In . Con . on Acous ics, Speech,
and Signal P ocessing, 2001, pp. 749–752.
[46] A. Gui, H. Gampe , S. B aun, and D. Emmanouilidou,
“Adap ing eche audio dis ance o gene a i e music
e alua ion,” in In . Con . on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), Seoul, Sou h Ko ea. IEEE,
2024, pp. 1331–1335.
[47] Genís Plaja-Roglans and Thomas Nu all and Xa ie
Se a, “compiam,” 2024. [Online]. A ailable: h ps:
//m g.gi hub.io/compIAM/
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
838