Looking Beyond Averaged Metrics in Music Source Separation

Author: Saurjya Sarkar; Victoria Moomijan; Basil Woods; Emmanouil Benetos; Mark Sandler

Publisher: Zenodo

DOI: 10.5281/zenodo.17706609

Source: https://zenodo.org/records/17706609/files/000098.pdf

PERCEPTUAL ERRORS IN MUSIC SOURCE SEPARATION:
LOOKING BEYOND SDR AVERAGES
Sau jya Sa ka 1Vic o ia Moomjian1Basil Woods2Emmanouil Bene os1Ma k Sandle 1
1Queen Ma y Uni e si y o London, 2AudioS ip L d.
[email p o ec ed], [email p o ec ed]
ABSTRACT
Music sou ce sepa a ion ex ac s indi idual ins u-
men /pe o me s ems om mixed musical eco dings.
Pe o mance is ypically e alua ed using me ics like
sou ce- o-dis o ion a io (SDR), wi h highe alues indi-
ca ing be e sepa a ion. Howe e , elying on global SDR
a e ages ac oss es da ase s p o ides limi ed insigh in o
model pe o mance. While imp o ed a e age SDR sug-
ges s supe io pe o mance, i e eals li le abou speci ic
s eng hs and weaknesses. Addi ionally, a e aged me ics
ail o accoun o SDR a iance, which depends hea ily on
he musical cha ac e is ics o he es se . These limi a ions
make c oss- ask/s em compa isons po en ially misleading.
To add ess hese issues, we conduc ed a lis ening s udy
e alua ing sou ce sepa a ion models ac oss h ee asks: 6-
s em sepa a ion, Lead s. Backing Vocal Sepa a ion, and
Due Sepa a ion. Pa icipan s assessed di e se examples,
pa icula ly hose wi h poo objec i e o subjec i e pe o -
mance. We ca ego ised ailu e cases in o h ee e o ypes
and ound ha while SDR gene ally co ela es wi h pe -
cep ual a ings, signi ican de ia ions occu . Some e o s
subs an ially impac human pe cep ion bu a en’ well cap-
u ed by SDR, while in o he cases, lis ene s pe cei e be -
e quali y han SDR sugges s. Ou indings e eal nuances
missed in cu en e alua ion pa adigms and highligh he
need o include e o ca ego isa ion and pe o mance dis-
ibu ion alongside a e aged me ics.
1. INTRODUCTION
Deep lea ning based audio sou ce sepa a ion models a e
ained o sepa a e indi idual sound sou ces o sound
classes om a mix u e o audio sou ces. The complex-
i y o he ask is de e mined by bo h he cons i uen s o
he mix u e, and he de ini ion o he sepa a ion a ge in
he con ex o ha mix u e. In he popula Music Sou ce
Sepa a ion ask which was in oduced in SiSEC18 [1], he
inpu mix u e is a mixed and mas e ed song consis ing o
a di e se ange o ins umen s, howe e he a ge classes
a e ocals, d ums and bass only. Models ained o sep-
© S. Sa ka , V. Moomjian, B. Woods, E.Bene os and M.
Sandle . Licensed unde a C ea i e Commons A ibu ion 4.0 In e na-
ional License (CC BY 4.0). A ibu ion: S. Sa ka , V. Moomjian, B.
Woods, E.Bene os and M. Sandle , “Pe cep ual E o s in Music Sou ce
Sepa a ion: looking beyond SDR a e ages”, in P oc. o he 26 h In . So-
cie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
a a e addi ional ins umen s om musical mix u es ha e
no pe o med on pa wi h he well-es ablished d ums, bass
and ocal s em (4-s em sepa a ion) ask [2, 3]. While he
complexi y and lack o consis en da a o hese o he sep-
a a ion asks a e a signi ican ac o , he a e aged me -
ics used o compa e di e en models o a gi en ask and
es da ase migh be misleading when compa ing models
ac oss di e en asks and da ase s. E en wi hin a gi en
sepa a ion ask, compa ing models based on a e age me -
ics p o ides limi ed insigh s.
In his wo k, we a emp o e alua e how use s pe cei e
music sou ce sepa a ion esul s ac oss h ee music sepa-
a ion asks: 6-s em sepa a ion, Lead s. Backing Vocal
Sepa a ion and chambe due sepa a ion. We ask he use s
o a e 2 aspec s o he ask, he di icul y o he sepa a ion
ask and he quali y o he sepa a ed ou pu . We hen com-
pa e hese a ings ac oss asks and ailu e ca ego ies, in
o de o gain insigh s in o how well hese objec i e me ics
like SDR co ela e o he use s pe cep ual a ings and iden-
i y i and when he objec i e me ics de ia e om he pe -
cep ual a ings. We p esen he pa icipan s wi h di e en
ypes o poo ly sepa a ed examples wi h a wide ange o
ou pu SDR sco es in ou s udy. The con ibu ions o his
s udy include a amewo k o ca ego ising pe o mance
issues in music sou ce sepa a ion, iden i y dis inc ypes o
ailu es ha he models su e om, he in e dependence o
musical con ex s and sepa a ion objec i es ha cause hese
e o ypes and insigh s in o he a ying impac hese e o
ypes unde di e en musical con ex s on lis ene s.
2. BACKGROUND
Sou ce sepa a ion has lou ished wi h deep lea ning ac oss
speech, music, and gene al audio domains, bene i ing om
s anda dised ask desc ip ions and e alua ion me ics es-
ablished in he p e-deep-lea ning e a h ough he SiSEC
2008 [4] public e alua ion campaign. This campaign
sough a comp ehensi e e alua ion and unde s anding o
sou ce sepa a ion sys ems, including music sou ce sepa a-
ion. While he de ini ion o "s ems" (subse s o dis inc
audio sou ces) was ini ially example-speci ic and depen-
den on indi idual p oduc ion p ocesses, SiSEC 2015 in-
oduced he 4-s em (Vocals, D ums, Bass and O he s) o -
malism which is s ill ollowed in he ecen MDX21 [5]
and SDX23 [6] challenges. These public e alua ion cam-
paigns use he objec i e me ic SDR (sou ce- o-dis o ion
a io) de ined by Equa ion 1, whe e ˆssepa a ed is he es i-
839
ma ed signal and s a ge is he e e ence signal.
SDR = log10
∥s a ge ∥2
∥ˆssepa a ed −s a ge ∥2(1)
Since SDR is calcula ed on a ame-by- ame basis, he
SDRsong is calcula ed as he a e age ac oss he 4-s ems
o a song. Then models a e e alua ed by a e aging
SDRsong ac oss all songs in a gi en es da ase [5]. I
is well known ha a e aged me ics gi e limi ed insigh s
in o model pe o mance unde di e se condi ions and may
co ela e poo ly o human pe cep ion [7, 8]. The SDX23
challenge conduc ed lis ening es s on hei submissions
o alida e he objec i e ankings, howe e hey did no
ind a clea co ela ion be ween he objec i e sco es and
pe cep ual a ings, wi h simila obse a ions also epo ed
in [9,10].
O he wo ks ha e a emp ed o explo e he ask o mu-
sic sou ce sepa a ion ou side he 4-s em decomposi ion
pa adigm. Da ase s and models ha e been p esen ed o
asks such as 6-s em sepa a ion [2, 3], chambe ensemble
sepa a ion [11], ocal ha mony sepa a ion [12–14] and pi-
ano accompanimen sepa a ion [15]. While hese me hods
also epo SDR, i is di icul o compa e he pe o mance
o gene alisabili y o hese asks as hey lack he le el o
ask o malism, da a a ailabili y and di e si y as 4-s em
music sepa a ion.
Recen wo ks in music sepa a ion ha e explo ed he
in luence o musical cha ac e is ics o inpu mix u es on
he e icacy o hese sou ce sepa a ion models. Sa ka e
al. [13] epo he in luence o ha monic o e lap amongs
sou ces o ha e a nega i e impac sepa a ion pe o mance.
Subsequen ly, Sa ka e . al. [16], Oze e al. [17] and Jeon
e al. [14] ha e obse ed ha pi ch o e laps/unisons sig-
ni ican ly a ec he quali y o sepa a ion. Jeon e al. ad-
di ionaly epo a 10 dB d op in quali y achie ed by ideal-
a io-masking (IRM) when compa ing due s and unisons.
Wa cha asupa e al. [2] epo ha models pe o m poo ly
when ex ending beyond he 4-s em de ini ion in o gui a s
and piano, and a e pa icula ly insensi i e o imb al di e -
ences, which makes he sepa a ion o mix u es wi h simi-
la imb es pa icula ly challenging. They also highligh
he un eliabili y o sepa a ion pe o mance o o gan, syn h
s ems and backing ocals wi h highly a iable esul s. Ou
wo k explo es hese hemes u he by conduc ing a lis en-
ing s udy designed o e alua e he pe o mance o sou ce
sepa a ion models in hese challenging musical con ex s.
3. EVALUATION DATA
We used sepa a ion esul s om 5 di e en models in ou
lis ening s udy, which we e ained o one o h ee asks:
6-s em sepa a ion, Lead/Backing Vocals sepa a ion and
due sepa a ion. We use audio s ems om he URMP
Da ase [18] o chambe ensemble due s, Bach Cho ales
and Ba be shop Qua e Da ase (BCBQ) [19] o cho al
due s, and a p i a e mul i- ack da ase o pop song co -
e s which we e downmixed o gene a e ou 6-s em and
Lead/Backing Vocals sepa a ion examples. The p i a e
da ase was used ins ead o MoisesDB [3] as he da a ca -
ego isa ion men ioned in Sec ion 5 was based on a pilo
s udy conduc ed on he p i a e licensed da ase .
3.1 6-S em Sepa a ion
We use wo HT-Demucs [20] models ained on di e en
da ase s o gene a ing 6-s em sepa a ion examples. The
models a e ained o be able o ex ac he ocals, bass,
d ums, keys, gui a s and "o he s" s ems om pop songs.
We we e able o gene a e 2 di e en esul s o each 6-s em
sepa a ion example, one using he p e- ained expe imen al
model made a ailable by he o iginal au ho s and one using
a model we ained using a p i a e da ase (38.88 hou s).
Fo ou p i a e da ase , we use a simila s em de ini ion
as [3]. Fo piano/keys we include all keyed ins umen s
including syn hs, o gans and elec ic pianos. Ou gui a
s em includes bo h elec ic and acous ic gui a s ems. The
speci ic s em de ini ions used o he p e- ained model
om [20] a e unknown, howe e i was obse ed ha he
p e- ained model only sepa a ed Elec ic and Acous ic Pi-
anos in o he keys s em, while sepa a ing syn hs in o he
"O he ". While we didn’ explici ly include a 4-s em model
in ou analysis, he ocal, bass and d ums s ems om ou
6-s em model co e he 4-s em scena io implici ly.
3.2 Lead/Backing Vocal (LV/BV) Sepa a ion
We ained HT-Demucs wi h h ee a ge channels—Lead
Vocals (LV), Backing Vocals (BV), and O he s (accompa-
nimen ). This is simila o he Main s. Res sepa a ion
ask p esen ed by Jeon e al. [14], howe e ou ask also
in ol es he sepa a ion o he accompanimen s em, which
is included in he inpu mix u es. These s ems we e gen-
e a ed by us using p i a e mul i ack da a (10.75 hou s).
The de ini ions o LV and BV s ems we e de e mined on
a case by case basis, since no objec i e de ini ion o Lead
and Backing Vocals exis s. Any singe ha sings a leas
one e se in isola ion is ca ego ised as he "lead ocals"
s em, such ha due s would include bo h singe s classi ied
as he Lead Vocals. Ha monies, non-ly ical singing we e
classi ied as he "backing ocals" s em.
E en hough ou model was ained using he abo e
s em de ini ion o lead and backing ocals, i was ob-
se ed ha du ing in e ence, he model consis en ly sep-
a a ed he loudes monophonic singe as he lead ocals
and any addi ional ocal laye s as backing. Thus, i a
any poin he song con ains only one singe (including non-
ly ical singing), he model sepa a ed he solo singing oice
as "Lead Vocals". On he o he hand, in cases o all o ms
o ha monised singing, including due s, he model sepa-
a ed he loudes singe as "Lead Vocals" and he emain-
ing singe s as "Backing Vocals". This is discussed u he
in Sec ion 5.
3.3 Due Sepa a ion
Fo due sepa a ion, we use a Dual-pa h T ans o me Ne -
wo k (DPTNe ) [21] wi h pe mu a ion in a ian aining
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
840
      





































Figu e 1. Sca e plo o all es examples used in he lis ening s udy ca ego ised by sepa a ion a ge and e o ype. The
black line ep esen s he linea i be ween ou pu SDR and Quali y MOS ac oss all es examples. The shaded egion
ep esen s he s anda d de ia ion in Quali y MOS obse ed o examples in a 3dB SDR wide sliding window.
(PIT) [22] on EnsembleSe [11] o sepa a e 2 sou ce mix-
u es o monophonic sou ces. This model was chosen as
due sepa a ion asks ha e been ypically app oached us-
ing PIT and wa e o m [23] based me hods in o he wo ks
[13, 14]. Two e sions o he DPTNe model we e ine-
uned sepa a ely wi h limi ed amoun s o a ge domain
cho al da a om he BCBQ da ase and chambe ensem-
ble da a om he URMP da ase espec i ely, as desc ibed
in [16]. The examples gene a ed by hese models include
mix u es o iden ical ins umen s and he same gende ed
singe s (classi ied as Mono imb al Ensembles and Mono-
imb al Choi s) and mix u es o dis inc ins umen am-
ilies and di e en gende ed singing due s (classi ied as
Poly imb al Ensembles and Choi s).
4. LISTENING STUDY
To e alua e how hese di e en sepa a ion models pe -
o m unde di e en ask con igu a ions, an online lis en-
ing s udy was conduc ed whe e pa icipan s we e in i ed
o lis en o a se ies o sepa a ion examples. They we e p e-
sen ed wi h he inpu mix u e, he a ge s em label and he
sepa a ed ou pu s, including hidden e e ence, hidden low
ancho s, and sepa a ed esul s om one o mo e models.
Assesso s swi ch be ween hese s imuli o di ec ly compa e
he e e ence and he es signals. Pa icipan s a e each au-
dio mix u e and ins umen s em on a con inuous scale o
pe cei ed sepa a ion di icul y and quali y, anging om 0
o 100. The pa icipan s we e p esen ed wi h a calib a ion
s age and 3 p ac ice examples o amilia ise hemsel es
wi h he ask and es ing amewo k. Each ques ion in he
lis ening es had wo pa s: Di icul y and Quali y.
Di icul y: This pa p esen s lis ene s wi h an inpu
mix u e and asks hem o lis en o a speci ied sou ce (e.g.
gui a ) and we e ins uc ed as "Based on he mix u e au-
dio, do you hink i is e y di icul o sepa a e his gui a
om he o he ins umen s? O is i simple and easy o
di e en ia e? You can also hink abou how you expec
he gui a o sound on i s own and how easy he gui a is o
men ally ocus on". They we e hen asked o a e how di i-
cul hey pe cei ed he sepa a ion o ha sou ce o be. The
sou ce unde e alua ion a ied in each ques ion and was
a ed wi h he guidelines "Ve y easy" (0 o 20), "Easy" (20
o 40), "Nei he easy no di icul " (40 o 60), "Di icul "
(60 o 80), and "Ve y di icul " (80 o 100).
Quali y: In his pa , lis ene s we e gi en he sepa-
a ed esul o he gi en s em label, as in he Di icul y
sec ion, o e alua e he model’s sepa a ion pe o mance.
They assessed he quali y based on how he es ima ed sig-
nal aligned wi h hei expec a ions, cleanliness, and mini-
mal bleed om o he ins umen s. This was a ed wi h he
guidelines o "Bad" (0 o 20), "Poo " (20 o 40), "Fai " (40
o 60), "Good" (60 o 80), and "Excellen " (80 o 100).
4.1 Pa icipan s and P e-sc eening
56 assesso s ook pa in his es , and we e musicians, e-
sea che s and/o audio enginee s wi h musical and c i ical
lis ening expe ience. The es was s uc u ed in a ash-
ion whe e a e he ini ial p ac ice and calib a ion s age,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
841
he o de o he emaining es examples was andomised.
This enabled us o use he esul s om pa icipan s who
we e unable o comple e a ing all he examples, as long as
hey ha e encoun e ed 1 each o ou hidden ancho ypes.
24 pa icipan s comple ed he en i e es , while 32 pa ici-
pan s pa ially comple ed he es . O hese 32 pa icipan s
who pa ially comple ed he es , 6 o hem had o be dis-
ca ded om ou analysis as hey did no encoun e enough
o ou hidden ancho s. 3 pa icipan s had o be emo ed be-
cause hey didn’ a e he hidden e e ence and low ancho s
wi hin he op 80-100 o bo om 0-20 ange. An a e age o
30.4 esponses pe ques ion we e conside ed in his s udy,
wi h each ques ion ecei ing a leas 27 esponses and up
o 37 esponses.
4.2 Exce p s
The es included 47 di e en inpu mix u es: 18 6-s em,
20 due (10 chambe and 10 cho al), and 11 LV/BV exam-
ples o leng h anging om 3 o 9 seconds. Along wi h
es ima ed ou pu s, 12 hidden e e ences we e included: 3
6-s em, 4 due , and 5 LV/BV. Pa icipan s a ed he quali y
o 104 es ima ed ou pu s: 50 we e 6-s em, 40 we e Due ,
and 11 we e LV/BV. The disc epancy be ween he num-
be o mix u es and es ima ed ou pu s is because some ex-
amples consis ed o mul iple ou pu s o di e en s ems,
while in some cases, LV/BV and 6-s em sepa a ion sha e
he same inpu mix u es.
O he 6-s em mix u es, he es ima ed ou pu s unde
e alua ion we e 5 bass, 4 d ums, 19 gui a , 5 o he s, 3
keys/piano, and 11 ocals wi h 3 addi ional hidden e e -
ences o 1 gui a , 1 ocals, and 1 keys, espec i ely. Each
o he 20 due mix u es p esen ed bo h es ima ed sou ces
in di e en ques ions, along wi h 3 chambe ensemble an-
cho s and 1 cho al ancho . The e we e 8 lead ocal and
4 backing ocal sou ces e alua ed which included 1 lead
ocal ancho .
5. DATA CATEGORISATION
The es ima ed ou pu s p esen ed in he lis ening s udy
we e chosen om di e se scena ios, including a subse o
che y-picked scena ios whe e he ou pu SDR and he pe -
cei ed quali y o sepa a ion we e poo o in disag eemen ,
based on pilo s udies. 30 es ima ed ou pu s we e sampled
om ou es se as "O dina y" whe e no explici sepa a ion
ailu e ype was obse ed, while 64 examples we e che y-
picked o exhibi one o he ollowing e o ypes: Bleed,
Misclassi ica ion (Misclass), Noise, Spec al Deg ada ion
(Sdeg), Unison and C osso e -Swaps (X-Swap) which can
be b oadly ca ego ised in o h ee classes o e o s.
5.1 Channel Swaps
We de ine channel swaps as cases whe e he sepa a ed ou -
pu con ains he ou pu om a non- a ge s em and does
no o e lap wi h he a ge s em. This occu s du ing a -
ge s em silence o when a ge and non- a ge s ems a e
swapped due o sou ce con usion. These e o s mani es
as sou ce misclassi ica ion (Misclass) o pi ch/loudness
c osso e swaps (X-Swaps).
Misclass occu s when models success ully sepa a e a
sou ce bu place i in a di e en channel/s em han he
g ound u h expec s. This was no able in "o he s" and
"backing ocals" esidual s ems, whe e con en ha should
be "d ums" o "lead ocals" was misclassi ied. In he ex-
ample o he d um s. o he misclassi ica ion, an elec-
onic d um based pi ched pe cussion was classi ied as
"d ums" in ou g ound u h, while he models p edic ed
hose no es as "o he ". In an example o misclassi ica ion
be ween lead and backing ocals, he example con ained
non-ly ical emale singing in he backing ocals, and he
lead ocals included a male ly ical singing wi h audio e -
ec s. The LV/BV model was able o e ec i ely sepa a e
he wo singe s om backg ound music, howe e he e-
male singe was classi ied as lead singe as i was loude
and male as backing.
X-Swaps a e simila o scena ios o misclassi ica ion,
bu ins ead he e we obse e he misclassi ica ion happen
o a pa o he audio segmen , hus he sepa a ed sou ce
in he ou pu swi ches in he audio segmen . This occu ed
p ima ily in LV/BV and Due sepa a ion. LV/BV mod-
els consis en ly sepa a ed he loudes singe as lead o-
cals wi h all o he s as backing. In due sepa a ion, when
he pi ch ajec o ies o wo sou ces c ossed, models o en
swi ched sou ce-channel alignmen a he c osso e poin .
These swaps in mono imb al (MonoT) sou ce mix u es
a e less no iceable, while in poly imb al (PolyT) mix u es,
hey a e mo e appa en .
5.2 Bleed
We de ine Bleed as scena ios whe e he sepa a ed ou pu
con ains he ou pu om a non- a ge s em in addi ion o
he a ge s em. This occu s due o imb al ambigui y o
spec al/pi ch o e lap o sou ces. Fo he case o pi ch
o e laps, we iden i y hese scena ios as cases o Unison,
whe e he a ge and a non- a ge sou ce a e playing no es
in unison. In ou expe imen s we obse ed ha all ou
models ail o sepa a e sou ces when hey a e in unison.
In case o due s and lead s. backing ocal sepa a ion,
his ypically esul s in a cho us-like e ec . The case o
unisons in 6-s em sepa a ion a e o en di icul o iden-
i y as he no e du a ions and dynamics a e e y di e -
en ac oss sou ce ypes, and i is obse ed ha in case o
unisons, he less loude ins umen is ypically absen in he
ou pu s em and is ins ead p esen in he loude non- a ge
s em.
5.3 A e ac s
The inal e o ca ego y is o scena ios whe e he e a e sig-
ni ican sepa a ion a e ac s p esen in he sepa a ed s em
which may a ec pe cep ual pe o mance. Sepa a ion a e-
ac s may be addi i e o sub ac i e, which a e iden i ied
as Noise o Spec al Deg ada ion (SDeg) in hese exam-
ples. While no pa e n was obse ed o cause he sepa a ed
ou pu o con ain addi i e noise, spec al deg ada ion was
ypically a byp oduc o misclassi ica ion o unison. Cases
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
842


























Figu e 2. Compa ison o a e age inpu SNR and mean
di icul y sco e (Di icul y MOS) ac oss sepa a ion asks.
o spec al deg ada ion we e iden i ied whe e he ullness
o he sepa a ed s em was no obse ed o be sa is ac o y.
6. SURVEY ANALYSIS
In his sec ion, we analyse he esponses collec ed om
ou lis ening s udy as desc ibed in Sec ion 4 on he basis o
he classi ica ions o ou da a poin s as desc ibed in Secion
5. The esponses ecei ed ac oss all alid pa icipan s in
he su ey a e hen a e aged o gene a e a Mean Di icul y
Sco e (Di icul y MOS) and Mean Quali y Sco e (Quali y
MOS). These sco es we e compa ed wi h inpu SNR (loud-
ness o a ge s. mix u e) and ou pu SDR o each sepa-
a ion example which we e calcula ed using BSS-E al.
6.1 Di icul y MOS
In Figu e 2, we compa e inpu SNR o a a ge s em
in a gi en mix u e wi h he Di icul y MOS o explo e
how use s pe cei e he di icul y o a gi en sou ce sepa-
a ion ask. I was expec ed ha inpu SNR would ha e
a s ong nega i e co ela ion wi h Di icul y MOS, how-
e e no clea co ela ion was obse ed. When compa ing
he Di icul y MOS ac oss di e en sepa a ion asks, we
obse e ha all examples ou side he ypical 4-s em mu-
sic sepa a ion asks we e consis en ly a ed o be o high
di icul y. While he dis ibu ion o inpu SNRs ac oss ex-
amples o Backing Vox, Cho al and Ensemble Sepa a ion
a e e y simila o Vocals and Bass sepa a ion examples,
he o me a e consis en ly a ed as highly di icul while
he la e a e a ed o be o low di icul y. This sugges s
ha use s pe cei e sepa a ion o ha monically co ela ed
sou ces highly challenging o a sepa a ion model.
6.2 Quali y MOS
Al hough SIR and SAR we e also calcula ed, no clea co -
ela ion be ween hese me ics and subjec i e sco es was
obse ed. We ained a linea eg ession model o use SIR,
SAR and SDR as inpu ea u es o p edic he Di icul y
MOS. The coe icien s lea n o SDR, SIR and SAR in
























Figu e 3. Compa ison o a e age ou pu SDR and mean
quali y sco e (Quali y MOS) by ac oss sepa a ion asks.
he linea eg ession model we e 1.18, 0.39 and 0.72 e-
spec i ely, which suppo ou obse a ion ha SDR had
he mos in luence on he Quali y MOS. An Analysis o
Va iance (ANOVA) was pe o med o e alua e he e ec s
o bleed ypes, mix u e ypes, sou ce ypes, SDR alues
on quali y a ings. The s a is ics in ANOVA ha indica e
signi ican e ec s o he ac o s on quali y a ings include
he F-s a is ic and p- alue, whe e a high F- alue and a p-
alue less han 0.05 indica e a s a is ically signi ican e -
ec . The esul s e eal ha he SDR alue ac o ejec s
he null hypo hesis wi h an F- alue o 68.31 and a p- alue
o 2.83 ×10−16, sugges ing ha he ou pu SDR signi i-
can ly a ec s quali y a ings. Spea man’s ank co ela ion
u he suppo s his, wi h a ρcoe icien o 0.66. Gi en
ρ anges om -1 o 1, his esul indica es s ong posi i e
co ela ion be ween ou pu SDR and quali y MOS.
In Figu e 3, we compa e ou pu SDR o he es ex-
amples wi h he Quali y MOS ac oss di e en s em ypes.
While a la ge amoun o a iance is obse ed be ween he
classes and wi hin each o he classes o bo h SDR and
Quali y MOS, compa ing he a e ages e eal a s ong co -
ela ion be ween SDR and Quali y MOS excep o he
"O he " s em. No e ha hese a e ages a e calcula ed
ac oss mul iple models and mul iple asks, and he exam-
ples we e che y picked, so he a ia ion in pe o mance
obse ed ac oss s ems may no be ep esen a i e o he
sepa a ion models used. I is also no ewo hy ha o some
s ems, namely D ums, Gui a s, Backing Vox and Cho al
show a sligh posi i e o se o 2-8% in hei Quali y MOS
a e ages compa ed o s ems like Vocals, Bass, Lead Vox.
On he o he hand Keys and Ensemble examples show a
nega i e o se o 2-6% in hei Quali y MOS a e ages.
6.3 E o Analysis
Figu e 1 displays he Quali y MOS sco e dis ibu ion
ac oss all lis ening s udy examples, anno a ed by ins u-
men and e o ype. I includes "o dina y" examples in
black, which we e he examples e i ied no o i ou e o
ca ego ies om Sec ion 5. The igu e also shows a lin-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
843

     























Figu e 4. Sca e plo wi h linea i s be ween Quali y
MOS and ou pu SDR ca ego ised by e o ype.
ea i be ween ou pu SDR and Quali y MOS, which is
calcula ed ac oss all examples in he lis ening es and a
con idence in e al calcula ed using he s anda d de ia ion
wi hin a 3dB wide sliding window. Abo e 7 dB ou pu
SDR, s ong ag eemen exis s be ween Quali y MOS and
ou pu SDR, while signi ican disc epancies appea o ex-
amples below 6 dB ou pu SDR.
In Figu e 4 he ela ionship be ween quali y MOS and
ou pu SDR is analysed o each e o g oup. I is obse ed
ha Sdeg,Bleed and Unison exhibi posi i e co ela ion
be ween quali y MOS and ou pu SDR. Howe e , Quali y
MOS in cases o Misclass,Noise and X-Swap seem o no
be ep esen ed well by SDR. I is also obse ed ha Unison
and Misclass a e consis en ly a ed highe as compa ed o
Bleed and Noise. Examples o noise consis en ly esul s in
he wo s Quali y MOS a ings, howe e hese examples do
no show any dis inc i e in luence on SAR o SDR when
compa ed o o he e o ypes.
Figu e 5 examines he a e age Quali y MOS de ia ion
om he linea i be ween Quali y MOS and ou pu SDR
ac oss ins umen ypes. We ind ha Bleed,Noise and
Sdeg consis en ly show a nega i e bias o Quali y MOS
ac oss all ins umen s. Con e sely, he Misclass shows a
consis en posi i e bias, which implies ha use s a ed ex-
amples o Misclass wi h a highe Quali y a ing han he
ou pu SDR would sugges . In e es ingly, examples o X-
Swap show an ins umen dependen impac on Quali y
a ing de ia ion whe e MonoT swaps showed a posi i e
shi in Quali y MOS whe eas swaps in PolyT mix u es and
Lead Vocals show a nega i e shi in Quali y MOS. These
shi s we e mo e p o ound in Ensemble due s, and less so
in Cho al due s. The lowe a e age impac on cho al due s
may be due o he ac ha use s may also be mo e sensi-
i e o indi idual ocal iden i ies/ imb es a he han hei
gende , which would’ e in luenced he shi in he a e ages
when conside ed as MonoT and PolyT based on gende .
7. CONCLUSION
In his s udy we explo e he ela ionship be ween subjec-
i e a ings o di icul y and quali y o music sou ce sep-
a a ion asks and hei espec i e objec i e me ics inpu
SNR and ou pu SDR. We show ha use pe cep ion o sep-



























Figu e 5. A e age de ia ion om Quali y MOS-SDR lin-
ea i ca ego ised by sepa a ion a ge and e o ype.
a a ion di icul y is highly dependen on he musical con-
ex and sepa a ion a ge , and does no s ongly co ela e
wi h inpu SNR. Ou obse a ions show ha SDR on a -
e age is a ep esen a i e objec i e me ic o music sou ce
sepa a ion, bu subjec i e me ics consis en ly ag ee wi h
SDR only abo e 7 dB. On he o he hand, ou s udy high-
ligh s he la ge a iance and po en ial disag eemen be-
ween subjec i e Quali y a ings and SDR o alues less
han 7 dB. Ou classi ica ion o a ious e o ypes shows
ha misma ch be ween ask de ini ion and sepa a ion capa-
bili ies based on a ge ambigui y lead o disag eemen be-
ween pe cep ual a ings and objec i e me ics like SDR,
which esul in much highe quali y MOS han expec ed
o low SDR examples om asks beyond 4-s em sepa a-
ion. We also highligh he ailu e o objec i e me ics o
cap u e Noise as an e o ype, especially in examples wi h
SDRs be ween 0-6 dB, which esul in e y low Quali y
a ings, esul ing in e y high a iance in his ange.
Ou esul s highligh he challenges aced in music
sou ce sepa a ion asks which explo e beyond he ypical
4 s em Vocals, D ums, Bass and O he s asks. Timb al am-
bigui y in s ems like keys and gui a s signi ican ly a ec
sepa a ion pe o mance. On he o he hand, o he ask
o Lead Vocal Sepa a ion, subjec i e de ini ions o Lead
Vocals a e no consis en , and models a e ins ead shown
o be capable o sepa a ing he loudes monophonic sou ce
wi hin a ocal s em. Meanwhile, o PIT-based ensemble
sepa a ion, we obse e ha models o en do no p ese e
sou ce consis ency ac oss a sepa a ion ame and may su -
e om channel swaps, which signi ican ly a ec SDR
pe o mance bu do no always nega i ely in luence pe -
cep ual a ings. We also show ha models canno sepa a e
unisons e ec i ely ac oss all 3 asks, howe e hese ail-
u es a e accu a ely cap u ed by SDR, and he esul ing in-
luence on pe cep ual a ings a e in ag eemen wi h SDR.
Ou indings highligh he need o u u e sou ce sepa a-
ion e alua ions o epo pe o mance dis ibu ion and ca -
ego ised ailu es alongside a e aged me ics. This would
imp o e ou abili y o compa e models ac oss asks and
may also shed ligh on how hey wo k.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
844
8. ACKNOWLEDGMENTS
This wo k was suppo ed by UKRI - Inno a e UK
(P ojec no. 10102241). The due sepa a ion mod-
els we e ained using he Baske ille Tie 2 HPC se -
ice (h ps://www.baske ille.ac.uk/), unded by he EP-
SRC and UKRI h ough he Wo ld Class Labs scheme
(EP/T022221/1) and he Digi al Resea ch In as uc u e
p og amme (EP/W032244/1) and is ope a ed by Ad anced
Resea ch Compu ing a he Uni e si y o Bi mingham.
The 6-s em and LV/BV models we e ained on GPUs p o-
ided and unded by Audios ip LTD.
9. ETHICS STATEMENT
This use s udy ob ained he necessa y e hical app o al
(QMERC20.565.DSEECS24.049) om he Queen Ma y
E hics o Resea ch Commi ee. The p i a e da ase used
was pu chased and licensed by AudioS ip L d., and hey
owned he app op ia e igh s o use as equi ed by his
s udy.
10. REFERENCES
[1] F.-R. S ö e , A. Liu kus, and N. I o, “The 2018 signal
sepa a ion e alua ion campaign,” in In e na ional Con-
e ence on La en Va iable Analysis and Signal Sepa-
a ion. Sp inge , 2018, pp. 293–305.
[2] K. N. Wa cha asupa and A. Le ch, “A s em-agnos ic
single-decode sys em o music sou ce sepa a ion be-
yond ou s ems,” in P oceedings o he 25 h In e -
na ional Socie y o Music In o ma ion Re ie al (IS-
MIR), San F ancisco, CA, USA, no 2024.
[3] I. G. Pe ei a, F. A aujo, F. Ko zeniowski, and R. Vogl,
“Moisesdb: A da ase o sou ce sepa a ion beyond 4
s ems,” in Ismi 2023 Hyb id Con e ence, 2023.
[4] E. Vincen , S. A aki, and P. Bo ill, “The 2008 signal
sepa a ion e alua ion campaign: A communi y-based
app oach o la ge-scale e alua ion,” in In e na ional
Con e ence on Independen Componen Analysis and
Signal Sepa a ion. Sp inge , 2009, pp. 734–741.
[5] Y. Mi su uji, G. Fabb o, S. Uhlich, F.-R. S ö e , A. Dé-
ossez, M. Kim, W. Choi, C.-Y. Yu, and K.-W. Cheuk,
“Music demixing challenge 2021,” F on ie s in Signal
P ocessing, ol. 1, p. 808395, 2022.
[6] G. Fabb o, S. Uhlich, C.-H. Lai, W. Choi, M. Ma ínez-
Ramí ez, W. Liao, I. Gadelha, G. Ramos, E. Hsu,
H. Rod igues e al., “The sound demixing challenge
2023–music demixing ack,” T ansac ions o he In-
e na ional Socie y o Music In o ma ion Re ie al,
ol. 7, no. 1, 2024.
[7] E. Cano, D. Fi zGe ald, and K. B andenbu g, “E alu-
a ion o quali y o sound sou ce sepa a ion algo i hms:
Human pe cep ion s quan i a i e me ics,” in 2016
24 h Eu opean Signal P ocessing Con e ence (EU-
SIPCO). IEEE, 2016, pp. 1758–1762.
[8] D. Wa d, H. Wie s o , R. D. Mason, E. M. G ais, and
M. D. Plumbley, “Bss e al o peass? p edic ing he pe -
cep ion o singing- oice sepa a ion,” in 2018 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2018, pp. 596–600.
[9] M. To coli, T. Kas ne , and J. He e, “Objec i e
measu es o pe cep ual audio quali y e iewed: An
e alua ion o hei applica ion domain dependence,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 29, pp. 1530–1541, 2021.
[10] E. Rumbold, G. Tzane akis, and B. Pa do, “Co ela-
ions be ween objec i e and subjec i e e alua ions o
music sou ce sepa a ion,” 21s Sound and Music Com-
pu ing Con e ence, SMC 2024, 2024.
[11] S. Sa ka , E. Bene os, and M. Sandle , “Ensemble-
se : a new high quali y syn hesised da ase o cham-
be ensemble sepa a ion,” in P oc. o he 23 d In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2022, pp. 625–632.
[12] T. Nakamu a, S. Takamichi, N. Tanji, S. Fukayama,
and H. Sa uwa a i, “jacappella co pus: A japanese
a cappella ocal ensemble co pus,” in ICASSP 2023-
2023 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2023,
pp. 1–5.
[13] S. Sa ka , E. Bene os, and M. Sandle , “Vocal ha -
mony sepa a ion using ime-domain neu al ne wo ks,”
in P oc. In e speech 2021, 2021, pp. 3515–3519.
[14] C.-B. Jeon, H. Moon, K. Choi, B. S. Chon, and
K. Lee, “Medley ox: An e alua ion da ase o mul-
iple singing oices sepa a ion,” in ICASSP 2023-2023
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2023, pp.
1–5.
[15] Y. Öze , S. Schwä , V. A i i-Mülle , J. Law ence,
E. Sen, and M. Mülle , “Piano conce o da ase (pcd):
A mul i ack da ase o piano conce os,” T ansac ions
o he In e na ional Socie y o Music In o ma ion Re-
ie al, ol. 6, no. 1, 2023.
[16] S. Sa ka , L. Tho pe, E. Bene os, and M. Sandle ,
“Le e aging syn he ic da a o imp o ing chambe en-
semble sepa a ion,” in 2023 IEEE Wo kshop on Appli-
ca ions o Signal P ocessing o Audio and Acous ics
(WASPAA), 2023, pp. 1–5.
[17] Y. Öze and M. Mülle , “Sou ce sepa a ion o pi-
ano conce os using musically mo i a ed augmen a-
ion echniques,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 1214–
1225, 2024.
[18] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma,
“C ea ing a mul i ack classical music pe o mance
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
845
da ase o mul imodal music analysis: Challenges, in-
sigh s, and applica ions,” IEEE T ansac ions on Mul i-
media, ol. 21, no. 2, pp. 522–535, 2018.
[19] R. Sch amm, E. Bene os e al., “Au oma ic ansc ip-
ion o a cappella eco dings om mul iple singe s.”
Audio Enginee ing Socie y, 2017.
[20] S. Roua d, F. Massa, and A. Dé ossez, “Hyb id ans-
o me s o music sou ce sepa a ion,” in ICASSP 23,
2023.
[21] J. Chen, Q. Mao, and D. Liu, “Dual-Pa h T ans o me
Ne wo k: Di ec Con ex -Awa e Modeling o End-
o-End Monau al Speech Sepa a ion,” in P oc. In e -
speech 2020, 2020, pp. 2642–2646.
[22] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Pe mu-
a ion in a ian aining o deep models o speake -
independen mul i- alke speech sepa a ion,” in 2017
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2017, pp.
241–245.
[23] Y. Luo and N. Mesga ani, “Tasne : ime-domain au-
dio sepa a ion ne wo k o eal- ime, single-channel
speech sepa a ion,” in 2018 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2018, pp. 696–700.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
846

Related note

Why institutions use Plag.ai for originality review, entry 17
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai