Singing Voice Separation From Carnatic Music Mixtures Using a Regression-Guided Latent Diffusion Model

Author: Genís Plaja-Roglans; Xavier Serra; Martín Rocamora

Publisher: Zenodo

DOI: 10.5281/zenodo.17706605

Source: https://zenodo.org/records/17706605/files/000097.pdf

LEVERAGING CARNATIC LIVE RECORDINGS FOR SINGING VOICE
SEPARATION USING REGRESSION-GUIDED LATENT DIFFUSION
Genís Plaja-Roglans Xa ie Se a Ma ín Rocamo a
Music Technology G oup, Uni e si a Pompeu Fab a, Ba celona, Spain
{genis.plaja, xa ie .se a, ma in. ocamo a}@up .edu
ABSTRACT
Di usion models ha e demons a ed po en ial o sepa a e
indi idual sou ces om music mix u es in a gene a i e
ashion, enabling a new solu ion o his challenging p ob-
lem. Howe e , exis ing wo ks equi e clean mul i-s em
da a, which is sca ce o se e al epe oi es, consequen ly
comp omising gene aliza ion. We explo e he po en ial o
gene a i e modeling o pe o m weakly-supe ised singing
oice sepa a ion o Ca na ic Music, a music epe oi e o
which la ge quan i ies o mul i-s em eco dings wi h bleed-
ing be ween sou ces ha e been collec ed om li e pe -
o mances. We p e- ain a la en di usion model o pe -
o m p elimina y ocal sepa a ion condi ioning on he co -
esponding mix u e. Then, using a eg essi e model which
is sepa a ely ained on a clean, smalle , and ou -o -domain
da ase , we es ima e he le el o bleeding in he p elimi-
na y sepa a ions and use ha in o ma ion o guide he di -
usion model owa d gene a ing cleane samples. The ob-
jec i e and pe cep ual e alua ions show he po en ial o he
p oposed gene a i e sys em o Ca na ic ocal sepa a ion.
Code, weigh s, and u he ma e ials a e a ailable online. 1
1. INTRODUCTION
Denoising di usion p obabilis ic models (DDPM) a e a
class o gene a i e sys ems ha a e eme ging as an al-
e na i e solu ion o audio in e se p oblems such as en-
hancemen [1], upsampling [2], and e en sou ce sepa a-
ion [3–5]. Music sou ce sepa a ion (MSS) is he ask
o es ima ing he indi idual elemen s in a musical mix-
u e [6]. Because o hei condi ioning lexibili y and gen-
e a i e po en ial, DDPM a e conside ed a p omising solu-
ion o MSS [7]. While compe i i e di usion sepa a ion
sys ems exis [5,8], hese ocus on ins umen al music.
La ge aining da a is key o DDPM [1, 9, 10], how-
e e , ga he ing clean, mul i-s em da a is challenging [11].
While la ge mul i-s em collec ions eco ded in li e shows
exis [12–16], hese come wi h sou ce bleeding: he o he
sou ces, oom esponse, and o he in e e ences leak in o
1h ps://gi hub.com/genisplaja/ldm-ca na ic-sepa a ion
© G. Plaja-Roglans, X. Se a, and M. Rocamo a. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC
BY 4.0). A ibu ion: G. Plaja-Roglans, X. Se a, and M. Rocamo a,
“Le e aging Ca na ic li e eco dings o singing oice sepa a ion using
eg ession-guided la en di usion”, in P oc. o he 26 h In . Socie y o
Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
he indi idual s ems. Regula ly aining an MSS model on
such da a o en esul s in subop imal pe o mance [17].
In his wo k, we aim o le e age he inhe en domain
knowledge in a la ge collec ion o li e mul i-s em acks
wi h bleeding while s ill a ge ing clean sepa a ion. Ca -
na ic Music, which is mos ly enjoyed li e [18], p esen s
an in e es ing case o s udy. P io wo k a ge ed he same
objec i e o Ca na ic ocals [17] and iolin [19]. How-
e e , [17] elies on a complex heu is ic comp omising gen-
e aliza ion and e iciency, while [19] uses a la ge, clean bu
p i a e in-domain da ase . Clean Ca na ic mul i-s em da a
exis [20], bu only o a small collec ion o 5conce s.
We p opose a gene a i e app oach o his p oblem, ely-
ing on la en di usion models (LDM) [21]: he gene a i e
di usion p ocess ope a es on a compac , p e-lea n audio
ep esen a ion, enhancing e iciency and lea ning capac-
i y [10]. We p e- ain an LDM o gene a e signing ocals
wi h sou ce bleeding condi ioned on music mix u es [22].
In pa allel, we ain a eg esso o es ima e he bleeding a-
io in ocal signals using open, clean, non-Ca na ic mul i-
s em da a. We hen e ine he p e- ained LDM using a loss
penaliza ion e m based on he bleeding p edic ions aiming
a gene a ing cleane ocals. Inspi ed by g adien guid-
ance o di usion models [23, 24], we subsequen ly p o-
pose eg ession-based bleeding le el guidance: we s ee
he g adien s o he bleeding es ima o o in o m he di u-
sion sample owa d he di ec ion o cleane sepa a ion.
Non-gene a i e MSS sys ems ha ans o m o mask
ime- equency ep esen a ions no mally ely on access o
all s ems, assuming hese combine linea ly o he e e ence
mix u e [25]. Le e aging gene a i e lexibili y we con-
side wo added challenges: (1) access o he mix u e and
he co esponding ocal s em wi h bleeding only [26], and
(2) he mix u e alone has unde gone non-linea p ocessing.
We p io i ize e iciency using a compac la en space,
a he expense o signal quali y and a signi ican penaliza-
ion on sepa a ion me ics, a known p oblem o he e alu-
a ion o gene a i e models [27,28]. Ne e heless, ou sys-
em achie es, wi hou he need o clean, in-domain, mul i-
s em samples, compe i i e objec i e gene a ion quali y and
pe cep ual sepa a ion p e e ence o e he baselines.
2. BACKGROUND
2.1 La en di usion
Le X∈RF×Dbe a la en embedding wi h ea u e size
Fand ime dimension D=T
c , whe e Tis audio leng h
830
and c he comp ession ac o o a ce ain la en encode
E:x∈R1×T→X∈RF×D. In his wo k, we ely on a
la en o wa d di usion p ocess de ined by a Ma ko chain
o Ts eps ha con e s a la en embedding X∼p(X),
in o a sample o Gaussian noise ϵ∈RF×D. The in e me-
dia e s eps o his ans o ma ion a e compu ed as [29]:
Xσ =ασ Xσ0+βσ ϵ, (1)
whe e σ ∈[0,1] is a noise schedule o T alues o con-
ol he ans o ma ion, while we de ine ασ := cos(ϕ )
and βσ := sin(ϕ ), whe e ϕ := π
2σ . No e also ha
Xσ0=X. A model is hen ained o e e his p ocess,
app oxima ing he da a dis ibu ion p(X)by lea ning o
map Gaussian samples o obse a ions X∼p(X).
Le σ ∈R1×Dbe he eloci y objec i e, which co e-
sponds o he inne a iable o he di usion p ocess which
acks he ans o ma ion be ween Xσ0and XσT. The ob-
jec i e σ is o mally compu ed as:
σ =ασ ϵ−βσ Xσ0,(2)
and es ima ed by neu al ne wo k mwi h pa ame e s θ:
ˆ σ =mθ(Xσ , σ , C)(3)
Ne wo k mθis he gene a i e LDM. Inpu C∈RF×D
ep esen s he condi ioning signal. Di usion sys ems may
be ained uncondi ionally o sample andom obse a ions
om app oxima ed ˆp(X), while ins uc ions om a ious
modali ies (e.g., ex p omp s [9], audio signals [5], and
mo e) can be injec ed o he pos e io o modi y he gen-
e a ion ajec o y. Howe e , ou wo k ocuses on a well-
de ined in e se p oblem. As a esul , we injec Cdu ing
bo h aining and in e ence, a chi ec u ally op imizing he
sys em o ailo he di usion ajec o y ela ing he con-
di ioning signal and he gene a o a ge ˆ
X. Le Edeno e
expec a ion. The di usion loss objec i e is de ined as [29]:
Ldi =E ∼[0,T ],σ ,Xσ ||ˆ σ − σ ||2
2(4)
2.2 Sampling p ocess
The sampling p ocess p og essi ely models a sample pe -
aining o he app oxima ed dis ibu ion ˆp(X)by denois-
ing a andom sample o Gaussian noise. P e ious wo ks
in audio gene a ion ha e elied on he Denoising Di usion
Implici Models (DDIM) sample , achie ing sa is ac o y
comp omise be ween sampling s eps and gene a ion qual-
i y [30]. In DDIM sampling [29], he in e ence p ocess is
pe o med using a bi a y T, and i is ini ia ed a σ = 1.
A gi en sampling s ep is composed o a se o ope a ions:
We i s un a o wa d pass wi h model mθas de ined
in Eqn (3). Using p edic ed eloci y ˆ σ , we can compu e
ˆ
Xσ0, which co esponds o he es ima ed a ge sample a
= 0, and ˆϵσ ∈R1×Dwhich is Gaussian noise a s ep :
ˆ
Xσ0=ασ Xσ −βσ ˆ σ (5)
ˆϵσ =βσ Xσ +ασ ˆ σ (6)
No e ha , o ≈T, i.e. a an ea ly s age o he sam-
pling p ocess, p edic ed ˆ
Xσ0is expec ed o be noisy, lim-
i edly consis en wi h signal C, while a ≈0, i app oxi-
ma es u he o he inal, e ined sepa a ion. Fo > 0, he
inpu o he nex sampling s ep is o mally de ined as:
ˆ
Xσ −1=ασ −1ˆ
Xσ0+βσ −1ˆϵσ (7)
Finally, ˆ
Xσ0is decoded o he o iginal domain using
decode E′:X∈RF×D→x∈R1×T. Encode Eand
decode E′a e no mally p e- ained and kep ozen.
3. METHOD
Le Aand B ep esen musical epe oi es o domains
which di e on ins umen a ion, concep s, and p ac ices.
In ou wo k, Aco esponds o Ca na ic Music and B o
Wes e n adio music (e.g. pop, ock, hip-hop, and ela ed)
3.1 La en encode
We use Music2La en 1 [31] (M2L), which is a neu al
codec based on a consis ency model [32]. Bo h M2L en-
code and decode a e depic ed in ed in Figu e 1. M2L
comp esses signals sampled a 48kHz down o 12Hz, and
p oduces 64-dimensional codes wi h 0 mean and de ia ion
1. The signi ican comp ession o M2L enables he de el-
opmen o ou wo k in an en i onmen wi h limi ed com-
pu a ional esou ces. M2L is ained using MTG-Jamendo
da ase [33], which includes nume os acks o epe oi e
B, and 90 eco dings agged as indian. I also includes
≈2k ocal acks, and ≈2k acks wi h iolin. We a e un-
awa e o he numbe o eco dings mixing hese sou ces.
The eno mous comp ession a e o M2L comes a a
cos : au ho s epo −3.85dB o econs uc ion SI-SDR, a
s anda d sepa a ion me ic, and pe cep ible a i ac s o en
a ise in he econs uc ions. While o icial code o ain
o ine- une M2L is no a ailable, we ely on he open
p e- ained model, p io i izing i s comp ession and ea u e
lea ning capabili ies o s udy he e ec i eness o la en di -
usion o weakly-supe ised MSS. Mo eo e , he M2L
comp ession a e enables us o pe o m ou LDM s udy
wi h e y limi ed compu a ional esou ces, ye aining a
model wi h he size on pa wi h he li e a u e [34–36].
3.2 La en di usion o sepa a ion
Model mθis a 1D a en ion U-Ne wi h skip-connec ions.
I is depic ed in g een in Fig. 1. I is composed o n esidual
blocks which include wo 1D con olu ional laye s, each
p eceded by G oupNo m [37] and SiLU ac i a ion [38].
A p e-de ined numbe o blocks include ime-wise sel -
a en ion o lea n he ela ionship be ween di e en ime
s eps and en ich con ex , which is c ucial o MSS.
To down and upsample he ea u es a each le el in he
U-Ne , we add an ex a laye wi h ke nel size k×k,kbeing
he ime comp ession o expansion ac o . When k > 1,
o downsampling we double he ea u e channels, while
hal ing hem o he upsampling blocks.
The ime-s ep σ is p ojec ed in o a 1024-channeled
andom Fou ie ea u e embeddings, which a e p ocessed
h ough a 3-laye mul i-laye pe cep on (MLP) wi h
GELU ac i a ions. The esul ing embedding is inco po-
a ed in o he model ia FiLM laye s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
831
Figu e 1. Diag am o he p oposed sys em. The LDM is i s ained o gene a e encoded ocals wi h bleeding. Nex , we
ine- une using he bleeding a io loss. Finally, du ing sampling, we use he bleeding p edic ions o compu e he g adien s
owa ds less bleeding and modi y he gene a ion ajec o y on ha di ec ion. The o de o de elopmen s eps is indica ed.
Se e al mechanisms o injec condi ioning signals in
di usion U-Ne s exis [1,30,34]. We ind he bes quali y-
e iciency comp omise on conca ena ing, o e he ea u e
channels, he condi ioning signal Cand Xσ [39]. P e i-
ous la en di usion wo k using M2L embeddings has e-
lied on his mechanism [34]. E en i he M2L embeddings
a e 2D, we employ 1D con olu ional laye s o e ec i ely
cap u e empo al dependencies in he comp essed ep e-
sen a ion, p ocessing each ea u e ec o independen ly
wi hou imposing a i icial spa ial co ela ions.
The ne wo k is ained elying on he objec i e in Eq. 4
using co esponding pai s o ocal s ems wi h bleeding
Xσ0and mix u e C, bo h encoded using E[22].
3.3 Bleeding le el es ima o
The glass ceiling o he sepa a ion LDM p esen ed in Sec-
ion 3.2 is es ablished a he inhe en sou ce bleeding in
he aining da a o domain A. Howe e , he ne wo k may
s ill be ained o map om mix u e o he co esponding
ocals wi h bleeding, le e aging domain knowledge [17].
P io wo k has shown ha a sepa a o model ained
using only da a wi h sou ce bleeding can be ine- uned
owa ds cleane ou pu s by s ee ing a bleeding es ima o
ne wo k [19], which p edic s he a io o bleeding in he
p elimina y sepa a ions, while he non-op imal sepa a o
is op imized o minimize his a io. Building on his in-
sigh , we hypo hesize ha es ima ing bleeding a ios is less
p one o se e e gene aliza ion e o s compa ed o MSS.
This allows us o le e age he knowledge embedded in a
p e- ained sepa a o o epe oi e A, while ine- uning us-
ing he bleeding es ima o ained using epe oi e B, by-
passing access o clean mul i-s em da a o epe oi e A.
3.3.1 Reg ession-based bleeding le el guidance
In an a emp o guide he p e- ained LDM o gene a e
cleane ocals, we le e age a eg ession model o guide
he di usion p ocess using he le el o sou ce bleeding.
Simila ly o [19], we aim o in oduce a bleeding es i-
ma o model o guide he sepa a ion sys em owa d educ-
ing he bleeding. Le e aging he lexibili y o he sampling
p ocess o di usion models, we p opose eg ession-based
bleeding le el guidance (RG), which is inspi ed by clas-
si ie guidance (CG) [23], a echnique o enhance quali y
and con ol in di usion image gene a ion. CG le e ages
a p e- ained classi ie o es ima e he class o which Xσ
belongs o. Using he g adien s ob ained w. . . Xσ , we
may modi y he la en in e media e di usion s eps o poin
be e owa d he a ge class. P io s udies ha e also elied
on g adien s o une di usion sampling [24,40].
We use a bleeding es ima o , in blue in Fig. 1, o p edic
he amoun o sou ce bleeding in an indi idual s em, ep-
esen ed by a loa ing poin alue b∈[0,1], whe e 0 ep e-
sen s no bleeding and 1 he mix u e. Using a clean mul i-
s em bu ou -o -domain da ase o epe oi e B, we ain a
neu al ne wo k φ o pe o m his ask. Since φis mean
o be in eg a ed wi hin he i e a i e di usion p ocess, he
bleeding p edic ion inpu is expec ed o be ˆ
Xσ , which is
in used wi h Gaussian noise ollowing Eq. 1. The e o e,
he aining inpu o φis an M2L-encoded ocal s em wi h
bleeding (wi h a io b), co up ed using Eq. 1 om he di -
usion o mula ion. No e ha he M2L codes ha e shown
compe i i e pe o mance in se e al downs eam asks [31].
The model is ained using L2 loss:
L φ(Xσ , σ , b) = ∥ φ(Xσ , σ )−b∥2
2(8)
Bleeding is expec ed o s ay consis en along an audio
sample, hus he model es ima es a single ˆ
bpe each inpu .
Di usion ime-s ep σ is also injec ed o he eg esso o
p o ide in o ma ion o he cu en noise le el [23,24].
Reg ession-guidance o di usion sampling. We in-
co po a e RG in he di usion sampling algo i hm by s ee -
ing he g adien s om he bleeding p edic o . We p edic
he bleeding o he inpu di usion o wa d a iable Xσ a
each sampling s ep , and calcula e he g adien s ha poin
owa d he di ec ion o ou a ge : 0 bleeding [40]. The ex-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
832
ac ed g adien s a e used o guide p edic ed eloci y ˆ σ ,
ollowing he o mula ion below:
Wguid =η·102·1
−1·σ 2(9)
ˆ guid
σ = ˆ σ +Wguid · ∇Xσ |0− φ(Xσ , σ )|(10)
The g adien s a e no malized using pe -sample L2 no -
maliza ion, ensu ing s able guidance. The guidance le el
is manually con olled by η, and is also dynamically scaled
o p o ide less guidance in he beginning o he sampling
p ocess whe e Xσ ≈ϵ, and s eng hen he guidance e ec
owa d he in e media e- o-las sampling s eps [24].
LDM bleeding-awa e ine- uning. We inco po a e a
penaliza ion loss e m o penalize he p e- ained LDM
using he le el o bleeding in he p edic ions. Fo ime-
s eps wi h low noise exposu e (σ <0.6), he ozen
bleeding es ima o p edic s he bleeding a io be o e and
a e a denoising s ep, deno ed as ˆ
bp e = φ(Xσ )and
ˆ
bpos = φ(ˆ
Xσ0), espec i ely. A max-ma gin hinge e m
max0,ˆ
bpos −ˆ
bp e +mwi h ma gin m= 0.05, ensu es
ha he model mus educe bleed by a leas m, o he wise
i is penalized. The penaliza ion e m is u he weigh ed
by (1 −σ )2 o ampli y i s impac a la e and pe cep ually
clea e s eps. O e all, he ine- uning loss becomes:
L=Ldi +λ·(1 −σ )2·max0,ˆ
bpos −ˆ
bp e +m(11)
We use pa ame e λ o con ol he balance be ween pe-
naliza ion e m and di usion loss, encou aging consis en
educ ion in bleed while main aining gene a ion ideli y.
3.3.2 Ne wo k de ails
Bleeding es ima o φis based on a s ack o dila ed con-
olu ions which a e egula ized using G oupNo m, SiLU
ac i a ions, and d opou . The ou pu embedding is ba ch-
no malized and hen σ , which is p ocessed using he same
s ep embedde han he LDM, is injec ed ia summa ion.
The esul an ec o is hen passed h ough a bidi ec ional
LSTM, cap u ing empo al dependencies. To model global
empo al ela ionships, we use mul i-head sel -a en ion
o e he sequence. Finally, we apply global a e age pool-
ing ac oss he ime dimension and use sigmoid-ac i a ed
linea laye o p oduce a single scala co esponding o p e-
dic ed bleed sco e (∈[0,1]).
4. EXPERIMENTS
4.1 Expe imen al se up
4.1.1 LDM sepa a ion p e- aining
The LDM U-Ne is 7le els deep. The ea u e channels pe
laye a e se as: {128,256,256,512,512,1024,2048}, in-
pu channels being 64 ∗2–being he channel-wise conca e-
na ion o he M2L codes and he model inpu Xσ . How-
e e , he las con olu ional laye ou pu s a 64-channeled
signal, co esponding o he gene a ed embedding o de-
code. The aining inpu con ex a e 1052000 samples
(≈24s), which is comp essed o 2048 samples using
M2L. Time comp ession ac o s o he LDM a e se o
{1,2,1,2,1,2,1}, ac o 1 ep esen ing no comp ession,
hus we each 128 samples in he bo leneck, ying no o
o e -comp ess he in o ma ion.
The i s U-Ne le el is composed o 1block, while he
deepes includes 4. The es a e composed o 2blocks. The
ou deepes le els o he U-Ne include ime-wise sel -
a en ion wi h 8heads, aiming a en iching con ex .
The LDM ne wo k has ≈365M ainable pa ame e s,
and i is ained using he 168 mul i-s em eco dings om
Sa aga Ca na ic–15 eco dings a e kep o alida ion, each
o hem om a di e en conce . We use ADAM op imize
wi h lea ning a e 1∗10−5, and use a linea wa m-up s age
using a cosine schedule wi h a ini ial a e o 1.6∗10−6.
We each 500k aining s eps in wo weeks in an 8GB GPU.
4.1.2 Bleeding es ima o guidance
A i icial bleeding da ase . The bleeding es ima o φis
ained using musdb18hq [41], co esponding o domain B.
We a i icially c ea e he bleeding ollowing he pipeline
desc ibed in he SDX 2023 bleeding challenge [25]. Le
Sibe a gi en sou ce s em. The accompanimen Ais he
weigh ed sum o he non- ocal sou ces: A=PN
i=1 wiSi,
whe e wi∼U(0,1) a e independen ly sampled andom
mixing weigh s. We andomly il e he sou ces using he
ollowing ca ego ical dis ibu ion [19,25]:
Si=




BPF(Si, low
c, high
c)wi h p= 0.4
HPF(Si, c)wi h p= 0.4
Siwi h p= 0.2
whe e we use band pass il e (BPF) wi h low cu o e-
quency low
c∼U(200,600) Hz and high cu o equency
high
c∼U(8k,10k)Hz, and high-pass il e (HPF) wi h
c∼U(900,9k)Hz. The o de o he il e s is also an-
domly sampled om ∼U(3,8). Nex , a bleeding a io
b∼U(0,1) is sampled and used o compu e he e e ence
mix u e M=S +b·A. We no malize M o p e en
clipping and con ine all alues be ween [−1,1].
Model de ails. The inpu a i icial mix Mis encoded
using M2L and me ged wi h Gaussian noise using Eq. 1,
he e o e, he inpu channel size is 64. We use i e dila ed
con olu ions wi h a ios {1, 2, 4, 8, 8}. The numbe il e s
o he con olu ional s ack and also he size o he hidden
s a e o he LSTM a e se o 512. The mul i-head a en ion
mechanism is con igu ed wi h 8heads.
T aining scheme. The aining con ex o he bleed-
ing p edic o is he same as ha o he LDM. The bleeding
es ima o model o als ≈1M pa ame e s. We use ADAM
op imize wi h a lea ning a e 4∗10−4, and ain un il con-
e gence. Subsequen ly, he p e- ained LDM is ine- uned
o 10k s eps using λ= 50, using lea ning a e 1∗10−6.
4.1.3 Di usion sampling pa ame e s
We sample using o e lapping segmen s o ≈24s wi h ≈5s
hop, which a e subsequen ly combined using o e lap-add.
I sampling wi h T= 64 on a single TITAN X 8GB GPU,
ou sys em sepa a es audio a an a e age speed o ≈0.4x
he du a ion o he ack.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
833
4.2 Da ase s
Sa aga Ca na ic (Domain A, aining da a). This is a
collec ion con aining 168 eal mul i-s em eco dings ( o-
aling ≈60h o music), including ocals, iolin, and pe -
cussion ins umen s, missing only he anpu a, which p o-
ides he onic om which an en i e pe o mance is buil .
Ca na ic Music is mos ly enjoyed li e, he e o e, o en-
su e ecological alidi y, Sa aga is eco ded in li e pe o -
mances, collec ing he s ems om he mixe . Howe e , his
has a d awback: he mic ophone o a gi en sou ce cap-
u es, in he backg ound, he o he sou ces.
musdb18hq (Domain B, o RG). I is one o he mos
es ablished open da ase s o MSS. I includes 100 ain-
ing and 50 es ing mul i-s em acks spli in ocals, bass,
d ums, and o he s. I ep esen s a limi ed se o s yles
mos ly con ined in Wes e n comme cial music.
Sanidha (A, e alua ion da a). I is he only a ailable
open collec ion o clean mul i-s em eco dings o Ca na ic
Music [20]. A e some explo a ion on his da ase , which
is composed o 5conce s, we disca d 1ha ing bleeding
in he ocal s em leaked h ough he singe headphones.
While Sanidha has no been ye shown a po en ial da ase
o aining o e Sa aga, we employ i o es ing, enabling
mo e eliable objec i e e alua ion o his epe oi e.
4.3 E alua ion me ics
The objec i e e alua ion o gene a i e sys ems o audio
in e se p oblems is challenging [1]. In MSS, adi ional
de ini ions o sou ce- o-dis o ion a io (SDR), he s an-
da d sepa a ion me ic, ha e been epo ed o o en mis-
ep esen he pe cep ual quali y o sepa a ions [42, 43].
Mo eo e , SDR is signi ican ly penalized by po en ial sub-
le di e ences and phase misma ches commonly p esen
when e alua ing ully-gene a i e models. Fo hese ea-
sons, SDR is being less used in p io gene a i e sepa a-
ion wo k. In he case o LDM, gi en he added phase
econs uc ion misma ch in oduced by he la en encode ,
no e en scale-in a ian SDR (SI-SDR), p esen in a ious
gene a i e sepa a ion sys ems [5,26], is being used [8,44].
The e o e, we ely on al e na i e audio quali y mea-
su es ha ha e been employed in p io wo k on la-
en di usion o gene a ion [34, 35] and sou ce sepa a-
ion [5, 8, 44]: log-spec al dis ance (LSD) [35] and log
mel-spec og am L2 e o [43]. These me ics a e phase-
independen and may be mo e app op ia e o gene a i e
sys ems. We also epo pe cep ual e alua ion o speech
quali y (PESQ) [45], aiming a measu ing in elligibili y.
To assess he quali y o he gene a ed signals wi hou
elying on ma ching audio pai s, we epo he F éche Au-
dio Dis ance (FAD) [46]. Model ou pu s wi h highe qual-
i y and lesse in e e ences should epo lowe FAD. The
FAD is o en compu ed on sho chunks. Howe e , in ad-
di ion o di e si y, con ex is impo an [46]. Ca na ic en-
di ions o en ea u e p olonged imp o isa ional segmen s,
such as alapana and anam, which can span se e al min-
u es. Fo hese easons, we spli he samples in o 1-minu e
chunks. We disca d he chunks wi h >25% o silence,
which esul s in ≈150 es ing samples.
Da ase L2 Loss↓A g. ˆ
bmix A g. ˆ
b oc.
musdb18hq 0.054 0.89 ±0.18 0.05 ±0.09
Sanidha 0.098 0.80 ±0.23 0.08 ±0.18
Sa aga - 0.97 ±0.07 0.25 ±0.29
Table 1.Assessing he bleeding eg esso . Ideally, he
a g. ˆ
bshould be ≈1 o mix, and ≈0 o ocals, excep
o Sa aga, whose ocal s ems ha e inhe en , eal bleeding.
To complemen he objec i e measu es we un a pe cep-
ual es wi h human lis ene s. We conduc a p e e ence-
based expe imen [1]. We spli he sepa a ions in o chunks
o ≈15 seconds, disca ding un oiced egions, and an-
domly selec an ins ance o each endi ion. Using he
mix u e as e e ence, we selec 6examples om he pool,
including di e si y o music sec ions and singe gende .
The pa icipan s a e shown se e al unlabeled and an-
domly o de ed pai s o samples o ou model agains
o he sys ems. We in oduce compa isons be ween non-
gene a i e models o p e en he pa icipan s om ge ing
amilia wi h model-speci ic a i ac s.
4.4 Compa ed sys ems
We compa e agains he mul i-sou ce di usion model
(MSDM) [5] o sepa a ion. Since no weigh s o ocals
a e a ailable, and he sys em is designed o clean mul i-
s em da a, we ain i using musdb18hq, ollowing he in-
s uc ions in he eposi o y. We do no compa e agains
exis ing LDM sepa a o s since hese a e no op imized o
ocals [8] o code o weigh s a e no ye a ailable [22,44].
While no di ec ly compa able, since hese a e non-
gene a i e and mask-based, we e alua e cold-di [17] and
he mixe model om [19], he baseline sys ems add ess-
ing he same ask o he Ca na ic s udy case. Bo h mod-
els a e di ec ly used h ough he compIAM package [47].
Also, o p o ide a pe o mance bound o ou model, we
e alua e he M2L- econs uc ed ocal s ems in Sanidha.
We pe o m an abla ion s udy on he bleeding ine-
uning (FT), eg ession guidance le el (RGη) o le els
η∈[0,5,10,20], and sampling s eps T∈[32,64]. The
non ine- uned model is ained o ≈10kmo e s eps o
a ai e compa ison wi h he FT model. Fo he pe cep ual
es we use T= 32, mid-guidance η= 10.
5. RESULTS
5.1 E alua ing he bleeding p edic o
E alua ing he bleeding p edic o is complex, since no
music da ase s wi h eal, anno a ed bleeding exis . How-
e e , we pe o m wo sani y checks. Fi s , we compu e
he model L2 loss on a i icial bleeding mix u es using
musdb18hq and Sanidha. Second, we compu e he a e age
bleeding a io on Sa aga mix u es and ocal s ems. To sim-
ula e he ac ual applica ion o he bleeding p edic o , inpu s
a e noised using Eq. 1, uni o mly sampling σ alues pe -
example. We p edic he bleeding o 500 andomly sam-
pled and oiced 12s-exce p s pe each da ase .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
834

Model TFAD↓LogMel L2↓LSD↓PESQ↑
M2L – 0.281 2.95 1.22 2.73
cold-di [17] 80.515 15.79 2.29 1.39
mixe [19] – 0.648 13.26 1.77 1.22
MSDM [5] 150 0.791 12.52 2.00 1.78
P oposed
no FT 32 0.637 16.74 1.80 1.15
FT 32 0.593 16.02 1.75 1.17
FT-RG532 0.587 13.36 1.68 1.22
FT-RG10 32 0.579 12.89 1.66 1.18
FT-RG20 32 0.626 13.44 1.67 1.19
no FT 64 0.642 16.78 1.79 1.16
FT 64 0.602 16.10 1.74 1.16
FT-RG564 0.600 13.41 1.68 1.21
FT-RG10 64 0.595 12.61 1.65 1.19
FT-RG20 64 0.623 12.31 1.64 1.16
Table 2.Objec i e e alua ion o di e se sys ems on au-
dio and ocal quali y measu es. A ow ↓indica es lowe
is be e , ↑o he wise. In bold, we indica e he bes sco e
among all sys ems. We unde line he bes sco es o he ab-
la ion. See u he abla ion esul s in he companion epo.1
See he esul s in Table 1. The eg esso gene alizes
qui e sa is ac o ily o he Ca na ic domain. The sys em
also disc imina es eal Ca na ic mix u es (ˆ
b≈1) om o-
cal s ems wi h bleeding (ˆ
b= 0.25 ±0.29). The high s an-
da d de ia ion in he p edic ed bleeding a io o Sa aga
may be explained by he high a iance in accompanimen
p esence in di e en sec ions o a Ca na ic endi ion.
5.2 Objec i e e alua ion
See he objec i e e alua ion in Table 2. We obse e a an-
gible imp o emen o ou model when using he bleeding
guidance du ing sampling, inding he swee spo on RG10
o T= 32, and RG20 o T= 64, despi e he me ics
no always co ela e. The bleeding ine- uning loss e m
p o ides a mo e mode a e imp o emen .
Ou gene a i e sys em ou pe o ms he baselines on he
spec al assessmen me ics, while anking second on FAD,
only ou pe o med by cold-di , a non-gene a i e sys em.
In e ms o PESQ, ou sys em sco es he lowes , only le -
eling he mixe model when using T= 32 and η= 10.
The pe o mance ac oss me ics o ou sys em sugges s
ha s onge guidance u he cleans and b ings he gen-
e a ion close o he a ge signal o e all. Howe e , his
esul s in a ade-o : s onge in e e ence emo al comes
a he cos o deg aded ocal quali y. While ou sys em
shows compe i i e o e all quali y, especially when guided,
i epo s lowe PESQ han MSDM, he gene a i e base-
line, sugges ing ha MSDM gene a ions ha e u he in-
elligibili y bu also s onge in e e ence. No e howe e
ha MSDM does no gene a e encoded la en s bu di ec ly
wa e o ms, po en ially accumula ing less phase disc ep-
ancy. Ne e heless, he gene al low PESQ sco es o all
models may be explained by he ac ha his me ic is
o speech and i does no assume po en ial in e e ences,
while i is unclea how i cha ac e izes he ex emely com-
mon and s ong ocal o namen s in Ca na ic Music.
Model Quali y (%) In e e ence (%)
Ou s O he Ou s O he
cold-di [17] 5.0 95.0 97.50 2.50
mixe [19] 15.0 85.0100.0 0.0
MSDM [5] 20.0 80.0100.0 0.0
Table 3. Pe cep ual e alua ion esul s showing he pe -
cen age o pa icipan s who p e e ed ou sys em o base-
line models in e ms o quali y and in e e ence emo al.
The esul s sugges ha an LDM can be ained o-
wa ds gene a ing sepa a ed complex sou ces such as o-
cals, while he p oposed guidance me hod con ibu es o
a cleane gene a ion ha ge s close o he a ge signal.
The objec i e me ics suppo he expec ed beha io o
he p oposed app oach, al hough we hypo hesize po en ial
s onge imp o emen i u u e e o s a e done on ine-
uning he la en encode and e ining he bleeding es ima-
o , as well as scaling he ne wo k up (closely ela ed LDM
elies on >500M pa ame e s [34]). These may con ibu e
o a e ine he sou ce quali y o he gene a ed ocals.
5.3 Pe cep ual assessmen
A o al o 20 pa icipan s ook he es . In e es ingly, he
pa icipan ag eemen is ema kable. The esul s a e e-
po ed in Table 3. The pe cep ual assessmen is signi i-
can ly clea : ou sys em leads in in e e ence emo al bu
is no able o each he sou ce quali y o he baselines.
These esul s ag ee wi h he objec i e me ics, which sug-
ges ha he o e all quali y and cleanliness o ou gene a-
ions a e compe i i e, howe e , he ideli y and in elligibil-
i y o he gene a ed ocals lea e oom o imp o emen .
6. CONCLUSIONS
We p esen a deep gene a i e model o add ess weakly-
supe ised singing oice sepa a ion o Ca na ic Music,
le e aging pai s o in-domain mix u e and ocal s ems
which ha e sou ce bleeding because hese a e eco ded in
eal li e pe o mances. We p opose o ain a la en di u-
sion model o gene a e ocals wi h bleeding condi ioned on
he co esponding mix u e. We hen guide he p e- ained
gene a i e sys em owa d p oducing cleane samples using
a bleeding a io p edic o . Ou sys em achie es compe -
i i e sco es o gene a ion quali y measu es, and ou pe -
o ms he baselines in e ms o in e e ence emo al in a
p e e ence lis ening es . While he p oposed amewo k
shows p omise, we en ision ex ensi e u u e wo k, espe-
cially o e ine he ocal quali y, which is sub-op imally
anked in he esul s. Tailo ing he la en encode o Ca -
na ic, imp o ing he bleeding es ima o , and pe o ming
mul i-sou ce sepa a ion a e po en ial u u e esea ch lines.
We belie e ha he lexibili y, condi ioning, and guid-
ance capabili ies o DDPM may enable app oaches o
ackle sepa a ion in non-op imal con ex s, o e en imp o e
pe o mance on ideal condi ions. This has po en ial o
sepa a ion o epe oi es ha a e no commonly eco ded
in s udios, conside ing unde ep esen ed ins umen s, and
p io i izing pa icula aspec s such as in e e ence emo al.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
835
7. ETHICS STATEMENT
While his wo k deals wi h gene a i e modeling, he sys-
em is ained using ully-open da a, while we add ess a
pu ely in e se p oblem. No copy igh implica ions should
be in ol ed. The model is a chi ec u ally de eloped no
o gene a e unseen music eco dings. The esul s o he
pe cep ual es we e included in his wo k wi h he pa -
icipan s’ pe mission, whose iden i ies emain anonymous.
No pe sonal o sensi i e da a om he pa icipan s is col-
lec ed and/o dis ibu ed.
8. ACKNOWLEDGEMENTS
This wo k is suppo ed by IA y Música: Cá ed a en
In eligencia A i icial y Música (TSI-100929-2023-1),
unded by he Sec e a ía de Es ado de Digi alización e In-
eligencia A i icial, and he Eu opean Union-Nex Gen-
e a ion EU, unde he p og am Cá ed as ENIA 2022 pa a
la c eación de cá ed as uni e sidad-emp esa en IA, and
IMPA: Mul imodal AI o Audio P ocessing (PID2023-
152250OB-I00), unded by he Minis y o Science, In-
no a ion and Uni e si ies o he Spanish Go e nmen , he
Agencia Es a al de In es igación (AEI) and co- inanced by
he Eu opean Union.
9. REFERENCES
[1] J. Se à, S. Pascual, J. Pons, R. O. A az, and D. Scaini,
“Uni e sal speech enhancemen wi h sco e-based di -
usion,” a xi :2206.03065, 2022.
[2] J. Lee and S. Han, “NU-wa e: A di usion p oba-
bilis ic model o neu al audio upsampling,” in An-
nual Con . o he In . Speech Communica ion Assoc.
(INTERSPEECH), B no, Czech Republic, 2021, pp.
2698–2702.
[3] R. Scheible , Y. Ji, S.-W. Chung, J. Byun, S. Choe, and
M.-S. Choi, “Di usion-based gene a i e speech sou ce
sepa a ion,” in In . Con . on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), Singapo e, 2022.
[4] C.-Y. Yu, E. Pos olache, E. Rodolà, and G. Fazekas,
“Ze o-sho due singing oices sepa a ion wi h di u-
sion models,” in Sound Demixing Wo kshop (SDX),
2023.
[5] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, and E. Rodolà, “Mul i-sou ce di usion
models o simul aneous music gene a ion and sepa a-
ion,” in 12 h In . Con . on Lea ning Rep esen a ions,
Viena, Aus ia, 2024.
[6] A. Jansson, E. Humph ey, N. Mon ecchio, R. Bi ne ,
A. Kuma , and T. Weyde, “Singing oice sepa a ion
wi h deep U-Ne con olu ional ne wo ks,” in 18 h In .
Socie y o Music In o ma ion Re ie al Con . (ISMIR),
Suzhou, China, 2017, pp. 745–751.
[7] S. A aki, N. I o, R. Haeb-Umbach, G. Wiche n, Z.-Q.
Wang, and Y. Mi su uji, “30+ yea s o sou ce sepa a-
ion esea ch: Achie emen s and u u e challenges,” in
IEEE In . Con . on Acous ics, Speech and Signal P o-
cessing (ICASSP), Hyde abad, India, 2025.
[8] T. Ka chkhadze, M. Izadi, and S. Dubno , “Simul ane-
ous music sepa a ion and gene a ion using mul i- ack
la en di usion models,” in In . Con . on Acous ics,
Speech and Signal P ocessing (ICASSP), Hyde abad,
India, 2025.
[9] J. Ho, A. Jain, and P. Abbeel, “Denoising di u-
sion p obabilis ic models,” in 33 h Ad ances in Neu-
al In o ma ion P ocessing Sys ems (Neu IPS), Online,
2020, pp. 6840–6851.
[10] Z. E ans, C. Ca , J. Taylo , S. H. Hawley, and J. Pons,
“Fas iming-condi ioned la en audio di usion,” in In .
Con . on Machine Lea ning (ICML), Vienna, Aus ia,
2024.
[11] E. Manilow, G. Wiche n, P. See ha aman, and
J. Le Roux, “Cu ing music sou ce sepa a ion some
Slakh: A da ase o s udy he impac o aining da a
quali y and quan i y,” in IEEE Wo kshop on Appli-
ca ions o Signal P ocessing o Audio and Acous ics
(WASPAA), 2019.
[12] T. P ä zlich, M. Mülle , B. Bohl, and J. Vei , “F eis-
chü z Digi al: Demos o audio- ela ed con ibu ions,”
in Demos and La e B eaking News o he In . Soci-
e y o Music In o ma ion Re ie al Con . (ISMIR),
Málaga, Spain, 2015.
[13] O. Mayo , Q. Llimona, M. Ma chini, P. Papio is, and
E. Gómez, “ epoVizz: a amewo k o emo e s o age,
b owsing, anno a ion, and exchange o mul imodal
da a,” in ACM In . Con . on Mul imedia, Ba celona,
Spain, 2013.
[14] E. Gómez, M. G ach en, A. Hanjalic, J. Jane , S. Jo dà,
C. F. Julià, C. Liem, A. Ma o ell, M. Schedl, and
G. Widme , “PHENICX: Pe o mances as Highly En-
iched and In e ac i e Conce Expe iences,” in P oc.
o he 10 h Sound and Music Compu ing Con . (SMC),
S ockholm, Sweden, 2013.
[15] A. S ini asamu hy, S. Gula i, R. C. Repe o, and
X. Se a, “Sa aga: Open da ase s o esea ch on in-
dian a music,” Empi ical Musicology Re iew, ol. 16,
no. 1, pp. 85–98, 2021.
[16] A. Shanka , G. Plaja-Roglans, T. Nu all, M. Ro-
camo a, and X. Se a, “Sa aga Audio isual: a la ge
mul imodal open da a collec ion o he analysis o
Ca na ic Music,” in P oc. o he 25 h In . Socie y
o Music In o ma ion Re ie al Con ., San F ancisco,
Uni ed S a es, 2024.
[17] G. Plaja-Roglans, M. Mi on, A. Shanka , and X. Se a,
“Ca na ic singing oice sepa a ion using cold di usion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
836
on aining da a wi h bleeding,” in 24 h In . Socie y o
Music In o ma ion Re ie al Con . (ISMIR), Milano,
I aly, 2023.
[18] G. Plaja-Roglans, T. Nu all, L. Pea son, X. Se a, and
M. Mi on, “Repe oi e-speci ic ocal pi ch da a gene -
a ion o imp o ed melodic analysis o ca na ic music,”
T ansac ions o he In e na ional Socie y o Music In-
o ma ion Re ie al, 2023.
[19] A. Shanka , S. Schweini z, G. Plaja-Roglans, X. Se a,
and M. Rocamo a, “Disen angling o e lapping
sou ces: Imp o ing ocal and iolin sou ce sepa a ion
in ca na ic music,” in Wo kshop on Indian Music
Analysis and Gene a i e Applica ions (WIMAGA) in
ICASSP, 2025.
[20] V. V. K ishnan, N. Alben, A. A. Nai , and N. Condi -
Schul z, “Sanidha: A s udio quali y mul i-modal
da ase o ca na ic music,” in P oc. o he 25 h In . So-
cie y o Music In o ma ion Re ie al Con ., San F an-
cisco, Uni ed S a es, 2024.
[21] R. Rombach, A. Bla mann, D. Lo enz, P. Esse , and
B. Omme , “High- esolu ion image syn hesis wi h la-
en di usion models,” in Con . on Compu e Vi-
sion and Pa e n Recogni ion (CVPR), New O leans,
Louisiana, USA, 2021.
[22] G. Plaja-Roglans, Y.-N. Hung, X. Se a, and I. Pe ei a,
“E icien and as gene a i e-based singing oice sep-
a a ion using a la en di usion model,” in P oc. o he
In . Join Con . on Neu al Ne wo ks (IJCNN), Rome,
I aly, 2025.
[23] P. Dha iwal and A. Nichol, “Di usion Models Bea
GANs on Image Syn hesis,” in 35 h Con . on Neu al
In o ma ion P ocessing Sys ems (Neu IPS 2021), On-
line, 2021.
[24] A. Bansal, H.-M. Chu, A. Schwa zschild, R. Sengup a,
M. Goldblum, J. Geiping, and T. Golds ein, “Uni e sal
guidance o di usion models,” in P oc. o he 2 h In-
e na ional Con e ence on Lea ning Rep esen a ions,
2024.
[25] G. Fabb o, S. Uhlich, C.-H. Lai, W. Choi, M. Ma ínez-
Ramí ez, W. Liao, I. Gadelha, G. Ramos, E. Hsu,
H. Rod igues e al., “The Sound Demixing Challenge
2023: Music Demixing T ack,” T ansac ions o he In .
Socie y o Music In o ma ion Re ie al, 2023.
[26] G. Zhu, J. Da e sky, F. Jiang, A. Seli skiy, and
Z. Duan, “Music sou ce sepa a ion wi h gene a i e
low,” IEEE Signal P ocessing Le e s, ol. 29,
p. 2288–2292, 2022. [Online]. A ailable: h p:
//dx.doi.o g/10.1109/LSP.2022.3219355
[27] P. Chandna, M. Blaauw, J. Bonada, and E. Gómez,
“Con en based singing oice ex ac ion om a musical
mix u e,” in In . Con . on Acous ics, Speech and Signal
P ocessing (ICASSP), Online, 2020, pp. 781–785.
[28] W. A. Jassim, J. Skoglund, M. Chinen, and A. Hines,
“WARP-Q: quali y p edic ion o gene a i e neu al
speech codecs,” in IEEE In . Con . on Acous ics,
Speech and Signal P ocessing (ICASSP), Online, 2021,
pp. 401–405.
[29] J. Song, C. Meng, and S. E mon, “Denoising di usion
implici models,” in In . Con . on Lea ning Rep esen-
a ions (ICLR), Online, 2021.
[30] F. Schneide , O. Kamal, Z. Jin, and B. Schölkop ,
“Moûsai: Tex - o-music gene a ion wi h long-con ex
la en di usion,” 2023.
[31] M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp es-
sion,” in 25 h In . Socie y o Music In o ma ion Re-
ie al Con . (ISMIR), San F ancisco, USA, 2024.
[32] Y. Song, P. Dha iwal, M. Chen, and I. Su ske e , “Con-
sis ency models,” a Xi p ep in a Xi :2303.01469,
2023.
[33] D. Bogdano , M. Won, P. To s ogan, A. Po e , and
X. Se a, “The m g-jamendo da ase o au oma ic
music agging,” in Machine Lea ning o Music
Disco e y Wo kshop, In e na ional Con e ence on
Machine Lea ning (ICML 2019), Long Beach, CA,
Uni ed S a es, 2019. [Online]. A ailable: h p:
//hdl.handle.ne /10230/42015
[34] J. Nis al, M. Pasini, C. Aouameu , M. G ach en, and
S. La ne , “Di -A-Ri : Musical Accompanimen Co-
c ea ion ia La en Di usion Models,” in 25 h In . So-
cie y o Music In o ma ion Re ie al Con . (ISMIR),
San F ancisco, USA, 2024.
[35] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic,
W. Wang, and M. D. Plumbley, “AudioLDM: Tex -
o-audio gene a ion wi h la en di usion models,” in
In . Con . on Machine Lea ning (ICML), Honolulu,
Hawaii, 2023.
[36] J.-S. Hwang, S.-H. Lee, and S.-W. Lee, “Hiddensinge :
High-quali y singing oice syn hesis ia neu al au-
dio codec and la en di usion models,” a Xi p ep in
a Xi :2306.06814, 2023.
[37] Y. Wu and K. He, “G oup no maliza ion,”
a Xi :1803.08494, 2018.
[38] S. El wing, E. Uchibe, and K. Doya, “Sigmoid-
weigh ed linea uni s o neu al ne wo k unc ion ap-
p oxima ion in ein o cemen lea ning,” Neu al ne -
wo ks, ol. 107, pp. 3–11, 2018.
[39] Y. Song and S. E mon, “Imp o ed echniques o
aining sco e-based gene a i e models,” in 34 h Con .
in Neu al In o ma ion P ocessing Sys ems (Neu IPS),
2020, pp. 12 438–12 448.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
837
[40] Y. Guo, H. Yuan, Y. Yang, M. Chen, and M. Wang,
“G adien guidance o di usion models: An op i-
miza ion pe spec i e,” in P oc. o he 38 h Con e ence
on Neu al In o ma ion P ocessing Sys ems (Neu IPS),
2024.
[41] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “MUSDB18 - a co pus o music
sepa a ion,” Dec. 2017. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.1117372
[42] E. Cano, D. Fi zge ald, and K. B andenbu g, “E alua-
ion o quali y o sound sou ce sepa a ion algo i hms:
Human pe cep ion s quan i a i e me ics,” in 24 h
Eu opean Signal P ocessing Con . (EUSIPCO), Bu-
dapes , Hunga y, 2016, pp. 1758–1762.
[43] E. Gusó, J. Pons, S. Pascual, and J. Se à, “On loss
unc ions and e alua ion me ics o music sou ce sep-
a a ion,” IEEE In . Con . on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), pp. 306–310, 2022.
[44] Y. Chae and K. Lee, “Mge-ldm: Join la en di usion
o simul aneous music gene a ion and sou ce ex ac-
ion,” a Xi :2505.23305, 2025.
[45] A. Rix, J. Bee ends, M. Hollie , and A. Heks a,
“Pe cep ual e alua ion o speech quali y (pesq)-a new
me hod o speech quali y assessmen o elephone ne -
wo ks and codecs,” in In . Con . on Acous ics, Speech,
and Signal P ocessing, 2001, pp. 749–752.
[46] A. Gui, H. Gampe , S. B aun, and D. Emmanouilidou,
“Adap ing eche audio dis ance o gene a i e music
e alua ion,” in In . Con . on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), Seoul, Sou h Ko ea. IEEE,
2024, pp. 1331–1335.
[47] Genís Plaja-Roglans and Thomas Nu all and Xa ie
Se a, “compiam,” 2024. [Online]. A ailable: h ps:
//m g.gi hub.io/compIAM/
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
838

Related note

Why institutions use Plag.ai for originality review, entry 81
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by teachers in the United States, the European Union, South America, and other research regions, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also faster first-level screening, better protection of institutional reputation, and stronger evidence for review committees. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For student essays, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai