PERCEPTUAL ERRORS IN MUSIC SOURCE SEPARATION:
LOOKING BEYOND SDR AVERAGES
Sau jya Sa ka 1Vic o ia Moomjian1Basil Woods2Emmanouil Bene os1Ma k Sandle 1
1Queen Ma y Uni e si y o London, 2AudioS ip L d.
[email p o ec ed], [email p o ec ed]
ABSTRACT
Music sou ce sepa a ion ex ac s indi idual ins u-
men /pe o me s ems om mixed musical eco dings.
Pe o mance is ypically e alua ed using me ics like
sou ce- o-dis o ion a io (SDR), wi h highe alues indi-
ca ing be e sepa a ion. Howe e , elying on global SDR
a e ages ac oss es da ase s p o ides limi ed insigh in o
model pe o mance. While imp o ed a e age SDR sug-
ges s supe io pe o mance, i e eals li le abou speci ic
s eng hs and weaknesses. Addi ionally, a e aged me ics
ail o accoun o SDR a iance, which depends hea ily on
he musical cha ac e is ics o he es se . These limi a ions
make c oss- ask/s em compa isons po en ially misleading.
To add ess hese issues, we conduc ed a lis ening s udy
e alua ing sou ce sepa a ion models ac oss h ee asks: 6-
s em sepa a ion, Lead s. Backing Vocal Sepa a ion, and
Due Sepa a ion. Pa icipan s assessed di e se examples,
pa icula ly hose wi h poo objec i e o subjec i e pe o -
mance. We ca ego ised ailu e cases in o h ee e o ypes
and ound ha while SDR gene ally co ela es wi h pe -
cep ual a ings, signi ican de ia ions occu . Some e o s
subs an ially impac human pe cep ion bu a en’ well cap-
u ed by SDR, while in o he cases, lis ene s pe cei e be -
e quali y han SDR sugges s. Ou indings e eal nuances
missed in cu en e alua ion pa adigms and highligh he
need o include e o ca ego isa ion and pe o mance dis-
ibu ion alongside a e aged me ics.
1. INTRODUCTION
Deep lea ning based audio sou ce sepa a ion models a e
ained o sepa a e indi idual sound sou ces o sound
classes om a mix u e o audio sou ces. The complex-
i y o he ask is de e mined by bo h he cons i uen s o
he mix u e, and he de ini ion o he sepa a ion a ge in
he con ex o ha mix u e. In he popula Music Sou ce
Sepa a ion ask which was in oduced in SiSEC18 [1], he
inpu mix u e is a mixed and mas e ed song consis ing o
a di e se ange o ins umen s, howe e he a ge classes
a e ocals, d ums and bass only. Models ained o sep-
© S. Sa ka , V. Moomjian, B. Woods, E.Bene os and M.
Sandle . Licensed unde a C ea i e Commons A ibu ion 4.0 In e na-
ional License (CC BY 4.0). A ibu ion: S. Sa ka , V. Moomjian, B.
Woods, E.Bene os and M. Sandle , “Pe cep ual E o s in Music Sou ce
Sepa a ion: looking beyond SDR a e ages”, in P oc. o he 26 h In . So-
cie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
a a e addi ional ins umen s om musical mix u es ha e
no pe o med on pa wi h he well-es ablished d ums, bass
and ocal s em (4-s em sepa a ion) ask [2, 3]. While he
complexi y and lack o consis en da a o hese o he sep-
a a ion asks a e a signi ican ac o , he a e aged me -
ics used o compa e di e en models o a gi en ask and
es da ase migh be misleading when compa ing models
ac oss di e en asks and da ase s. E en wi hin a gi en
sepa a ion ask, compa ing models based on a e age me -
ics p o ides limi ed insigh s.
In his wo k, we a emp o e alua e how use s pe cei e
music sou ce sepa a ion esul s ac oss h ee music sepa-
a ion asks: 6-s em sepa a ion, Lead s. Backing Vocal
Sepa a ion and chambe due sepa a ion. We ask he use s
o a e 2 aspec s o he ask, he di icul y o he sepa a ion
ask and he quali y o he sepa a ed ou pu . We hen com-
pa e hese a ings ac oss asks and ailu e ca ego ies, in
o de o gain insigh s in o how well hese objec i e me ics
like SDR co ela e o he use s pe cep ual a ings and iden-
i y i and when he objec i e me ics de ia e om he pe -
cep ual a ings. We p esen he pa icipan s wi h di e en
ypes o poo ly sepa a ed examples wi h a wide ange o
ou pu SDR sco es in ou s udy. The con ibu ions o his
s udy include a amewo k o ca ego ising pe o mance
issues in music sou ce sepa a ion, iden i y dis inc ypes o
ailu es ha he models su e om, he in e dependence o
musical con ex s and sepa a ion objec i es ha cause hese
e o ypes and insigh s in o he a ying impac hese e o
ypes unde di e en musical con ex s on lis ene s.
2. BACKGROUND
Sou ce sepa a ion has lou ished wi h deep lea ning ac oss
speech, music, and gene al audio domains, bene i ing om
s anda dised ask desc ip ions and e alua ion me ics es-
ablished in he p e-deep-lea ning e a h ough he SiSEC
2008 [4] public e alua ion campaign. This campaign
sough a comp ehensi e e alua ion and unde s anding o
sou ce sepa a ion sys ems, including music sou ce sepa a-
ion. While he de ini ion o "s ems" (subse s o dis inc
audio sou ces) was ini ially example-speci ic and depen-
den on indi idual p oduc ion p ocesses, SiSEC 2015 in-
oduced he 4-s em (Vocals, D ums, Bass and O he s) o -
malism which is s ill ollowed in he ecen MDX21 [5]
and SDX23 [6] challenges. These public e alua ion cam-
paigns use he objec i e me ic SDR (sou ce- o-dis o ion
a io) de ined by Equa ion 1, whe e ˆssepa a ed is he es i-
839
ma ed signal and s a ge is he e e ence signal.
SDR = log10
∥s a ge ∥2
∥ˆssepa a ed −s a ge ∥2(1)
Since SDR is calcula ed on a ame-by- ame basis, he
SDRsong is calcula ed as he a e age ac oss he 4-s ems
o a song. Then models a e e alua ed by a e aging
SDRsong ac oss all songs in a gi en es da ase [5]. I
is well known ha a e aged me ics gi e limi ed insigh s
in o model pe o mance unde di e se condi ions and may
co ela e poo ly o human pe cep ion [7, 8]. The SDX23
challenge conduc ed lis ening es s on hei submissions
o alida e he objec i e ankings, howe e hey did no
ind a clea co ela ion be ween he objec i e sco es and
pe cep ual a ings, wi h simila obse a ions also epo ed
in [9,10].
O he wo ks ha e a emp ed o explo e he ask o mu-
sic sou ce sepa a ion ou side he 4-s em decomposi ion
pa adigm. Da ase s and models ha e been p esen ed o
asks such as 6-s em sepa a ion [2, 3], chambe ensemble
sepa a ion [11], ocal ha mony sepa a ion [12–14] and pi-
ano accompanimen sepa a ion [15]. While hese me hods
also epo SDR, i is di icul o compa e he pe o mance
o gene alisabili y o hese asks as hey lack he le el o
ask o malism, da a a ailabili y and di e si y as 4-s em
music sepa a ion.
Recen wo ks in music sepa a ion ha e explo ed he
in luence o musical cha ac e is ics o inpu mix u es on
he e icacy o hese sou ce sepa a ion models. Sa ka e
al. [13] epo he in luence o ha monic o e lap amongs
sou ces o ha e a nega i e impac sepa a ion pe o mance.
Subsequen ly, Sa ka e . al. [16], Oze e al. [17] and Jeon
e al. [14] ha e obse ed ha pi ch o e laps/unisons sig-
ni ican ly a ec he quali y o sepa a ion. Jeon e al. ad-
di ionaly epo a 10 dB d op in quali y achie ed by ideal-
a io-masking (IRM) when compa ing due s and unisons.
Wa cha asupa e al. [2] epo ha models pe o m poo ly
when ex ending beyond he 4-s em de ini ion in o gui a s
and piano, and a e pa icula ly insensi i e o imb al di e -
ences, which makes he sepa a ion o mix u es wi h simi-
la imb es pa icula ly challenging. They also highligh
he un eliabili y o sepa a ion pe o mance o o gan, syn h
s ems and backing ocals wi h highly a iable esul s. Ou
wo k explo es hese hemes u he by conduc ing a lis en-
ing s udy designed o e alua e he pe o mance o sou ce
sepa a ion models in hese challenging musical con ex s.
3. EVALUATION DATA
We used sepa a ion esul s om 5 di e en models in ou
lis ening s udy, which we e ained o one o h ee asks:
6-s em sepa a ion, Lead/Backing Vocals sepa a ion and
due sepa a ion. We use audio s ems om he URMP
Da ase [18] o chambe ensemble due s, Bach Cho ales
and Ba be shop Qua e Da ase (BCBQ) [19] o cho al
due s, and a p i a e mul i- ack da ase o pop song co -
e s which we e downmixed o gene a e ou 6-s em and
Lead/Backing Vocals sepa a ion examples. The p i a e
da ase was used ins ead o MoisesDB [3] as he da a ca -
ego isa ion men ioned in Sec ion 5 was based on a pilo
s udy conduc ed on he p i a e licensed da ase .
3.1 6-S em Sepa a ion
We use wo HT-Demucs [20] models ained on di e en
da ase s o gene a ing 6-s em sepa a ion examples. The
models a e ained o be able o ex ac he ocals, bass,
d ums, keys, gui a s and "o he s" s ems om pop songs.
We we e able o gene a e 2 di e en esul s o each 6-s em
sepa a ion example, one using he p e- ained expe imen al
model made a ailable by he o iginal au ho s and one using
a model we ained using a p i a e da ase (38.88 hou s).
Fo ou p i a e da ase , we use a simila s em de ini ion
as [3]. Fo piano/keys we include all keyed ins umen s
including syn hs, o gans and elec ic pianos. Ou gui a
s em includes bo h elec ic and acous ic gui a s ems. The
speci ic s em de ini ions used o he p e- ained model
om [20] a e unknown, howe e i was obse ed ha he
p e- ained model only sepa a ed Elec ic and Acous ic Pi-
anos in o he keys s em, while sepa a ing syn hs in o he
"O he ". While we didn’ explici ly include a 4-s em model
in ou analysis, he ocal, bass and d ums s ems om ou
6-s em model co e he 4-s em scena io implici ly.
3.2 Lead/Backing Vocal (LV/BV) Sepa a ion
We ained HT-Demucs wi h h ee a ge channels—Lead
Vocals (LV), Backing Vocals (BV), and O he s (accompa-
nimen ). This is simila o he Main s. Res sepa a ion
ask p esen ed by Jeon e al. [14], howe e ou ask also
in ol es he sepa a ion o he accompanimen s em, which
is included in he inpu mix u es. These s ems we e gen-
e a ed by us using p i a e mul i ack da a (10.75 hou s).
The de ini ions o LV and BV s ems we e de e mined on
a case by case basis, since no objec i e de ini ion o Lead
and Backing Vocals exis s. Any singe ha sings a leas
one e se in isola ion is ca ego ised as he "lead ocals"
s em, such ha due s would include bo h singe s classi ied
as he Lead Vocals. Ha monies, non-ly ical singing we e
classi ied as he "backing ocals" s em.
E en hough ou model was ained using he abo e
s em de ini ion o lead and backing ocals, i was ob-
se ed ha du ing in e ence, he model consis en ly sep-
a a ed he loudes monophonic singe as he lead ocals
and any addi ional ocal laye s as backing. Thus, i a
any poin he song con ains only one singe (including non-
ly ical singing), he model sepa a ed he solo singing oice
as "Lead Vocals". On he o he hand, in cases o all o ms
o ha monised singing, including due s, he model sepa-
a ed he loudes singe as "Lead Vocals" and he emain-
ing singe s as "Backing Vocals". This is discussed u he
in Sec ion 5.
3.3 Due Sepa a ion
Fo due sepa a ion, we use a Dual-pa h T ans o me Ne -
wo k (DPTNe ) [21] wi h pe mu a ion in a ian aining
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
840
Figu e 1. Sca e plo o all es examples used in he lis ening s udy ca ego ised by sepa a ion a ge and e o ype. The
black line ep esen s he linea i be ween ou pu SDR and Quali y MOS ac oss all es examples. The shaded egion
ep esen s he s anda d de ia ion in Quali y MOS obse ed o examples in a 3dB SDR wide sliding window.
(PIT) [22] on EnsembleSe [11] o sepa a e 2 sou ce mix-
u es o monophonic sou ces. This model was chosen as
due sepa a ion asks ha e been ypically app oached us-
ing PIT and wa e o m [23] based me hods in o he wo ks
[13, 14]. Two e sions o he DPTNe model we e ine-
uned sepa a ely wi h limi ed amoun s o a ge domain
cho al da a om he BCBQ da ase and chambe ensem-
ble da a om he URMP da ase espec i ely, as desc ibed
in [16]. The examples gene a ed by hese models include
mix u es o iden ical ins umen s and he same gende ed
singe s (classi ied as Mono imb al Ensembles and Mono-
imb al Choi s) and mix u es o dis inc ins umen am-
ilies and di e en gende ed singing due s (classi ied as
Poly imb al Ensembles and Choi s).
4. LISTENING STUDY
To e alua e how hese di e en sepa a ion models pe -
o m unde di e en ask con igu a ions, an online lis en-
ing s udy was conduc ed whe e pa icipan s we e in i ed
o lis en o a se ies o sepa a ion examples. They we e p e-
sen ed wi h he inpu mix u e, he a ge s em label and he
sepa a ed ou pu s, including hidden e e ence, hidden low
ancho s, and sepa a ed esul s om one o mo e models.
Assesso s swi ch be ween hese s imuli o di ec ly compa e
he e e ence and he es signals. Pa icipan s a e each au-
dio mix u e and ins umen s em on a con inuous scale o
pe cei ed sepa a ion di icul y and quali y, anging om 0
o 100. The pa icipan s we e p esen ed wi h a calib a ion
s age and 3 p ac ice examples o amilia ise hemsel es
wi h he ask and es ing amewo k. Each ques ion in he
lis ening es had wo pa s: Di icul y and Quali y.
Di icul y: This pa p esen s lis ene s wi h an inpu
mix u e and asks hem o lis en o a speci ied sou ce (e.g.
gui a ) and we e ins uc ed as "Based on he mix u e au-
dio, do you hink i is e y di icul o sepa a e his gui a
om he o he ins umen s? O is i simple and easy o
di e en ia e? You can also hink abou how you expec
he gui a o sound on i s own and how easy he gui a is o
men ally ocus on". They we e hen asked o a e how di i-
cul hey pe cei ed he sepa a ion o ha sou ce o be. The
sou ce unde e alua ion a ied in each ques ion and was
a ed wi h he guidelines "Ve y easy" (0 o 20), "Easy" (20
o 40), "Nei he easy no di icul " (40 o 60), "Di icul "
(60 o 80), and "Ve y di icul " (80 o 100).
Quali y: In his pa , lis ene s we e gi en he sepa-
a ed esul o he gi en s em label, as in he Di icul y
sec ion, o e alua e he model’s sepa a ion pe o mance.
They assessed he quali y based on how he es ima ed sig-
nal aligned wi h hei expec a ions, cleanliness, and mini-
mal bleed om o he ins umen s. This was a ed wi h he
guidelines o "Bad" (0 o 20), "Poo " (20 o 40), "Fai " (40
o 60), "Good" (60 o 80), and "Excellen " (80 o 100).
4.1 Pa icipan s and P e-sc eening
56 assesso s ook pa in his es , and we e musicians, e-
sea che s and/o audio enginee s wi h musical and c i ical
lis ening expe ience. The es was s uc u ed in a ash-
ion whe e a e he ini ial p ac ice and calib a ion s age,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
841
he o de o he emaining es examples was andomised.
This enabled us o use he esul s om pa icipan s who
we e unable o comple e a ing all he examples, as long as
hey ha e encoun e ed 1 each o ou hidden ancho ypes.
24 pa icipan s comple ed he en i e es , while 32 pa ici-
pan s pa ially comple ed he es . O hese 32 pa icipan s
who pa ially comple ed he es , 6 o hem had o be dis-
ca ded om ou analysis as hey did no encoun e enough
o ou hidden ancho s. 3 pa icipan s had o be emo ed be-
cause hey didn’ a e he hidden e e ence and low ancho s
wi hin he op 80-100 o bo om 0-20 ange. An a e age o
30.4 esponses pe ques ion we e conside ed in his s udy,
wi h each ques ion ecei ing a leas 27 esponses and up
o 37 esponses.
4.2 Exce p s
The es included 47 di e en inpu mix u es: 18 6-s em,
20 due (10 chambe and 10 cho al), and 11 LV/BV exam-
ples o leng h anging om 3 o 9 seconds. Along wi h
es ima ed ou pu s, 12 hidden e e ences we e included: 3
6-s em, 4 due , and 5 LV/BV. Pa icipan s a ed he quali y
o 104 es ima ed ou pu s: 50 we e 6-s em, 40 we e Due ,
and 11 we e LV/BV. The disc epancy be ween he num-
be o mix u es and es ima ed ou pu s is because some ex-
amples consis ed o mul iple ou pu s o di e en s ems,
while in some cases, LV/BV and 6-s em sepa a ion sha e
he same inpu mix u es.
O he 6-s em mix u es, he es ima ed ou pu s unde
e alua ion we e 5 bass, 4 d ums, 19 gui a , 5 o he s, 3
keys/piano, and 11 ocals wi h 3 addi ional hidden e e -
ences o 1 gui a , 1 ocals, and 1 keys, espec i ely. Each
o he 20 due mix u es p esen ed bo h es ima ed sou ces
in di e en ques ions, along wi h 3 chambe ensemble an-
cho s and 1 cho al ancho . The e we e 8 lead ocal and
4 backing ocal sou ces e alua ed which included 1 lead
ocal ancho .
5. DATA CATEGORISATION
The es ima ed ou pu s p esen ed in he lis ening s udy
we e chosen om di e se scena ios, including a subse o
che y-picked scena ios whe e he ou pu SDR and he pe -
cei ed quali y o sepa a ion we e poo o in disag eemen ,
based on pilo s udies. 30 es ima ed ou pu s we e sampled
om ou es se as "O dina y" whe e no explici sepa a ion
ailu e ype was obse ed, while 64 examples we e che y-
picked o exhibi one o he ollowing e o ypes: Bleed,
Misclassi ica ion (Misclass), Noise, Spec al Deg ada ion
(Sdeg), Unison and C osso e -Swaps (X-Swap) which can
be b oadly ca ego ised in o h ee classes o e o s.
5.1 Channel Swaps
We de ine channel swaps as cases whe e he sepa a ed ou -
pu con ains he ou pu om a non- a ge s em and does
no o e lap wi h he a ge s em. This occu s du ing a -
ge s em silence o when a ge and non- a ge s ems a e
swapped due o sou ce con usion. These e o s mani es
as sou ce misclassi ica ion (Misclass) o pi ch/loudness
c osso e swaps (X-Swaps).
Misclass occu s when models success ully sepa a e a
sou ce bu place i in a di e en channel/s em han he
g ound u h expec s. This was no able in "o he s" and
"backing ocals" esidual s ems, whe e con en ha should
be "d ums" o "lead ocals" was misclassi ied. In he ex-
ample o he d um s. o he misclassi ica ion, an elec-
onic d um based pi ched pe cussion was classi ied as
"d ums" in ou g ound u h, while he models p edic ed
hose no es as "o he ". In an example o misclassi ica ion
be ween lead and backing ocals, he example con ained
non-ly ical emale singing in he backing ocals, and he
lead ocals included a male ly ical singing wi h audio e -
ec s. The LV/BV model was able o e ec i ely sepa a e
he wo singe s om backg ound music, howe e he e-
male singe was classi ied as lead singe as i was loude
and male as backing.
X-Swaps a e simila o scena ios o misclassi ica ion,
bu ins ead he e we obse e he misclassi ica ion happen
o a pa o he audio segmen , hus he sepa a ed sou ce
in he ou pu swi ches in he audio segmen . This occu ed
p ima ily in LV/BV and Due sepa a ion. LV/BV mod-
els consis en ly sepa a ed he loudes singe as lead o-
cals wi h all o he s as backing. In due sepa a ion, when
he pi ch ajec o ies o wo sou ces c ossed, models o en
swi ched sou ce-channel alignmen a he c osso e poin .
These swaps in mono imb al (MonoT) sou ce mix u es
a e less no iceable, while in poly imb al (PolyT) mix u es,
hey a e mo e appa en .
5.2 Bleed
We de ine Bleed as scena ios whe e he sepa a ed ou pu
con ains he ou pu om a non- a ge s em in addi ion o
he a ge s em. This occu s due o imb al ambigui y o
spec al/pi ch o e lap o sou ces. Fo he case o pi ch
o e laps, we iden i y hese scena ios as cases o Unison,
whe e he a ge and a non- a ge sou ce a e playing no es
in unison. In ou expe imen s we obse ed ha all ou
models ail o sepa a e sou ces when hey a e in unison.
In case o due s and lead s. backing ocal sepa a ion,
his ypically esul s in a cho us-like e ec . The case o
unisons in 6-s em sepa a ion a e o en di icul o iden-
i y as he no e du a ions and dynamics a e e y di e -
en ac oss sou ce ypes, and i is obse ed ha in case o
unisons, he less loude ins umen is ypically absen in he
ou pu s em and is ins ead p esen in he loude non- a ge
s em.
5.3 A e ac s
The inal e o ca ego y is o scena ios whe e he e a e sig-
ni ican sepa a ion a e ac s p esen in he sepa a ed s em
which may a ec pe cep ual pe o mance. Sepa a ion a e-
ac s may be addi i e o sub ac i e, which a e iden i ied
as Noise o Spec al Deg ada ion (SDeg) in hese exam-
ples. While no pa e n was obse ed o cause he sepa a ed
ou pu o con ain addi i e noise, spec al deg ada ion was
ypically a byp oduc o misclassi ica ion o unison. Cases
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
842
Figu e 2. Compa ison o a e age inpu SNR and mean
di icul y sco e (Di icul y MOS) ac oss sepa a ion asks.
o spec al deg ada ion we e iden i ied whe e he ullness
o he sepa a ed s em was no obse ed o be sa is ac o y.
6. SURVEY ANALYSIS
In his sec ion, we analyse he esponses collec ed om
ou lis ening s udy as desc ibed in Sec ion 4 on he basis o
he classi ica ions o ou da a poin s as desc ibed in Secion
5. The esponses ecei ed ac oss all alid pa icipan s in
he su ey a e hen a e aged o gene a e a Mean Di icul y
Sco e (Di icul y MOS) and Mean Quali y Sco e (Quali y
MOS). These sco es we e compa ed wi h inpu SNR (loud-
ness o a ge s. mix u e) and ou pu SDR o each sepa-
a ion example which we e calcula ed using BSS-E al.
6.1 Di icul y MOS
In Figu e 2, we compa e inpu SNR o a a ge s em
in a gi en mix u e wi h he Di icul y MOS o explo e
how use s pe cei e he di icul y o a gi en sou ce sepa-
a ion ask. I was expec ed ha inpu SNR would ha e
a s ong nega i e co ela ion wi h Di icul y MOS, how-
e e no clea co ela ion was obse ed. When compa ing
he Di icul y MOS ac oss di e en sepa a ion asks, we
obse e ha all examples ou side he ypical 4-s em mu-
sic sepa a ion asks we e consis en ly a ed o be o high
di icul y. While he dis ibu ion o inpu SNRs ac oss ex-
amples o Backing Vox, Cho al and Ensemble Sepa a ion
a e e y simila o Vocals and Bass sepa a ion examples,
he o me a e consis en ly a ed as highly di icul while
he la e a e a ed o be o low di icul y. This sugges s
ha use s pe cei e sepa a ion o ha monically co ela ed
sou ces highly challenging o a sepa a ion model.
6.2 Quali y MOS
Al hough SIR and SAR we e also calcula ed, no clea co -
ela ion be ween hese me ics and subjec i e sco es was
obse ed. We ained a linea eg ession model o use SIR,
SAR and SDR as inpu ea u es o p edic he Di icul y
MOS. The coe icien s lea n o SDR, SIR and SAR in
Figu e 3. Compa ison o a e age ou pu SDR and mean
quali y sco e (Quali y MOS) by ac oss sepa a ion asks.
he linea eg ession model we e 1.18, 0.39 and 0.72 e-
spec i ely, which suppo ou obse a ion ha SDR had
he mos in luence on he Quali y MOS. An Analysis o
Va iance (ANOVA) was pe o med o e alua e he e ec s
o bleed ypes, mix u e ypes, sou ce ypes, SDR alues
on quali y a ings. The s a is ics in ANOVA ha indica e
signi ican e ec s o he ac o s on quali y a ings include
he F-s a is ic and p- alue, whe e a high F- alue and a p-
alue less han 0.05 indica e a s a is ically signi ican e -
ec . The esul s e eal ha he SDR alue ac o ejec s
he null hypo hesis wi h an F- alue o 68.31 and a p- alue
o 2.83 ×10−16, sugges ing ha he ou pu SDR signi i-
can ly a ec s quali y a ings. Spea man’s ank co ela ion
u he suppo s his, wi h a ρcoe icien o 0.66. Gi en
ρ anges om -1 o 1, his esul indica es s ong posi i e
co ela ion be ween ou pu SDR and quali y MOS.
In Figu e 3, we compa e ou pu SDR o he es ex-
amples wi h he Quali y MOS ac oss di e en s em ypes.
While a la ge amoun o a iance is obse ed be ween he
classes and wi hin each o he classes o bo h SDR and
Quali y MOS, compa ing he a e ages e eal a s ong co -
ela ion be ween SDR and Quali y MOS excep o he
"O he " s em. No e ha hese a e ages a e calcula ed
ac oss mul iple models and mul iple asks, and he exam-
ples we e che y picked, so he a ia ion in pe o mance
obse ed ac oss s ems may no be ep esen a i e o he
sepa a ion models used. I is also no ewo hy ha o some
s ems, namely D ums, Gui a s, Backing Vox and Cho al
show a sligh posi i e o se o 2-8% in hei Quali y MOS
a e ages compa ed o s ems like Vocals, Bass, Lead Vox.
On he o he hand Keys and Ensemble examples show a
nega i e o se o 2-6% in hei Quali y MOS a e ages.
6.3 E o Analysis
Figu e 1 displays he Quali y MOS sco e dis ibu ion
ac oss all lis ening s udy examples, anno a ed by ins u-
men and e o ype. I includes "o dina y" examples in
black, which we e he examples e i ied no o i ou e o
ca ego ies om Sec ion 5. The igu e also shows a lin-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
843
Figu e 4. Sca e plo wi h linea i s be ween Quali y
MOS and ou pu SDR ca ego ised by e o ype.
ea i be ween ou pu SDR and Quali y MOS, which is
calcula ed ac oss all examples in he lis ening es and a
con idence in e al calcula ed using he s anda d de ia ion
wi hin a 3dB wide sliding window. Abo e 7 dB ou pu
SDR, s ong ag eemen exis s be ween Quali y MOS and
ou pu SDR, while signi ican disc epancies appea o ex-
amples below 6 dB ou pu SDR.
In Figu e 4 he ela ionship be ween quali y MOS and
ou pu SDR is analysed o each e o g oup. I is obse ed
ha Sdeg,Bleed and Unison exhibi posi i e co ela ion
be ween quali y MOS and ou pu SDR. Howe e , Quali y
MOS in cases o Misclass,Noise and X-Swap seem o no
be ep esen ed well by SDR. I is also obse ed ha Unison
and Misclass a e consis en ly a ed highe as compa ed o
Bleed and Noise. Examples o noise consis en ly esul s in
he wo s Quali y MOS a ings, howe e hese examples do
no show any dis inc i e in luence on SAR o SDR when
compa ed o o he e o ypes.
Figu e 5 examines he a e age Quali y MOS de ia ion
om he linea i be ween Quali y MOS and ou pu SDR
ac oss ins umen ypes. We ind ha Bleed,Noise and
Sdeg consis en ly show a nega i e bias o Quali y MOS
ac oss all ins umen s. Con e sely, he Misclass shows a
consis en posi i e bias, which implies ha use s a ed ex-
amples o Misclass wi h a highe Quali y a ing han he
ou pu SDR would sugges . In e es ingly, examples o X-
Swap show an ins umen dependen impac on Quali y
a ing de ia ion whe e MonoT swaps showed a posi i e
shi in Quali y MOS whe eas swaps in PolyT mix u es and
Lead Vocals show a nega i e shi in Quali y MOS. These
shi s we e mo e p o ound in Ensemble due s, and less so
in Cho al due s. The lowe a e age impac on cho al due s
may be due o he ac ha use s may also be mo e sensi-
i e o indi idual ocal iden i ies/ imb es a he han hei
gende , which would’ e in luenced he shi in he a e ages
when conside ed as MonoT and PolyT based on gende .
7. CONCLUSION
In his s udy we explo e he ela ionship be ween subjec-
i e a ings o di icul y and quali y o music sou ce sep-
a a ion asks and hei espec i e objec i e me ics inpu
SNR and ou pu SDR. We show ha use pe cep ion o sep-
Figu e 5. A e age de ia ion om Quali y MOS-SDR lin-
ea i ca ego ised by sepa a ion a ge and e o ype.
a a ion di icul y is highly dependen on he musical con-
ex and sepa a ion a ge , and does no s ongly co ela e
wi h inpu SNR. Ou obse a ions show ha SDR on a -
e age is a ep esen a i e objec i e me ic o music sou ce
sepa a ion, bu subjec i e me ics consis en ly ag ee wi h
SDR only abo e 7 dB. On he o he hand, ou s udy high-
ligh s he la ge a iance and po en ial disag eemen be-
ween subjec i e Quali y a ings and SDR o alues less
han 7 dB. Ou classi ica ion o a ious e o ypes shows
ha misma ch be ween ask de ini ion and sepa a ion capa-
bili ies based on a ge ambigui y lead o disag eemen be-
ween pe cep ual a ings and objec i e me ics like SDR,
which esul in much highe quali y MOS han expec ed
o low SDR examples om asks beyond 4-s em sepa a-
ion. We also highligh he ailu e o objec i e me ics o
cap u e Noise as an e o ype, especially in examples wi h
SDRs be ween 0-6 dB, which esul in e y low Quali y
a ings, esul ing in e y high a iance in his ange.
Ou esul s highligh he challenges aced in music
sou ce sepa a ion asks which explo e beyond he ypical
4 s em Vocals, D ums, Bass and O he s asks. Timb al am-
bigui y in s ems like keys and gui a s signi ican ly a ec
sepa a ion pe o mance. On he o he hand, o he ask
o Lead Vocal Sepa a ion, subjec i e de ini ions o Lead
Vocals a e no consis en , and models a e ins ead shown
o be capable o sepa a ing he loudes monophonic sou ce
wi hin a ocal s em. Meanwhile, o PIT-based ensemble
sepa a ion, we obse e ha models o en do no p ese e
sou ce consis ency ac oss a sepa a ion ame and may su -
e om channel swaps, which signi ican ly a ec SDR
pe o mance bu do no always nega i ely in luence pe -
cep ual a ings. We also show ha models canno sepa a e
unisons e ec i ely ac oss all 3 asks, howe e hese ail-
u es a e accu a ely cap u ed by SDR, and he esul ing in-
luence on pe cep ual a ings a e in ag eemen wi h SDR.
Ou indings highligh he need o u u e sou ce sepa a-
ion e alua ions o epo pe o mance dis ibu ion and ca -
ego ised ailu es alongside a e aged me ics. This would
imp o e ou abili y o compa e models ac oss asks and
may also shed ligh on how hey wo k.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
844
8. ACKNOWLEDGMENTS
This wo k was suppo ed by UKRI - Inno a e UK
(P ojec no. 10102241). The due sepa a ion mod-
els we e ained using he Baske ille Tie 2 HPC se -
ice (h ps://www.baske ille.ac.uk/), unded by he EP-
SRC and UKRI h ough he Wo ld Class Labs scheme
(EP/T022221/1) and he Digi al Resea ch In as uc u e
p og amme (EP/W032244/1) and is ope a ed by Ad anced
Resea ch Compu ing a he Uni e si y o Bi mingham.
The 6-s em and LV/BV models we e ained on GPUs p o-
ided and unded by Audios ip LTD.
9. ETHICS STATEMENT
This use s udy ob ained he necessa y e hical app o al
(QMERC20.565.DSEECS24.049) om he Queen Ma y
E hics o Resea ch Commi ee. The p i a e da ase used
was pu chased and licensed by AudioS ip L d., and hey
owned he app op ia e igh s o use as equi ed by his
s udy.
10. REFERENCES
[1] F.-R. S ö e , A. Liu kus, and N. I o, “The 2018 signal
sepa a ion e alua ion campaign,” in In e na ional Con-
e ence on La en Va iable Analysis and Signal Sepa-
a ion. Sp inge , 2018, pp. 293–305.
[2] K. N. Wa cha asupa and A. Le ch, “A s em-agnos ic
single-decode sys em o music sou ce sepa a ion be-
yond ou s ems,” in P oceedings o he 25 h In e -
na ional Socie y o Music In o ma ion Re ie al (IS-
MIR), San F ancisco, CA, USA, no 2024.
[3] I. G. Pe ei a, F. A aujo, F. Ko zeniowski, and R. Vogl,
“Moisesdb: A da ase o sou ce sepa a ion beyond 4
s ems,” in Ismi 2023 Hyb id Con e ence, 2023.
[4] E. Vincen , S. A aki, and P. Bo ill, “The 2008 signal
sepa a ion e alua ion campaign: A communi y-based
app oach o la ge-scale e alua ion,” in In e na ional
Con e ence on Independen Componen Analysis and
Signal Sepa a ion. Sp inge , 2009, pp. 734–741.
[5] Y. Mi su uji, G. Fabb o, S. Uhlich, F.-R. S ö e , A. Dé-
ossez, M. Kim, W. Choi, C.-Y. Yu, and K.-W. Cheuk,
“Music demixing challenge 2021,” F on ie s in Signal
P ocessing, ol. 1, p. 808395, 2022.
[6] G. Fabb o, S. Uhlich, C.-H. Lai, W. Choi, M. Ma ínez-
Ramí ez, W. Liao, I. Gadelha, G. Ramos, E. Hsu,
H. Rod igues e al., “The sound demixing challenge
2023–music demixing ack,” T ansac ions o he In-
e na ional Socie y o Music In o ma ion Re ie al,
ol. 7, no. 1, 2024.
[7] E. Cano, D. Fi zGe ald, and K. B andenbu g, “E alu-
a ion o quali y o sound sou ce sepa a ion algo i hms:
Human pe cep ion s quan i a i e me ics,” in 2016
24 h Eu opean Signal P ocessing Con e ence (EU-
SIPCO). IEEE, 2016, pp. 1758–1762.
[8] D. Wa d, H. Wie s o , R. D. Mason, E. M. G ais, and
M. D. Plumbley, “Bss e al o peass? p edic ing he pe -
cep ion o singing- oice sepa a ion,” in 2018 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2018, pp. 596–600.
[9] M. To coli, T. Kas ne , and J. He e, “Objec i e
measu es o pe cep ual audio quali y e iewed: An
e alua ion o hei applica ion domain dependence,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 29, pp. 1530–1541, 2021.
[10] E. Rumbold, G. Tzane akis, and B. Pa do, “Co ela-
ions be ween objec i e and subjec i e e alua ions o
music sou ce sepa a ion,” 21s Sound and Music Com-
pu ing Con e ence, SMC 2024, 2024.
[11] S. Sa ka , E. Bene os, and M. Sandle , “Ensemble-
se : a new high quali y syn hesised da ase o cham-
be ensemble sepa a ion,” in P oc. o he 23 d In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2022, pp. 625–632.
[12] T. Nakamu a, S. Takamichi, N. Tanji, S. Fukayama,
and H. Sa uwa a i, “jacappella co pus: A japanese
a cappella ocal ensemble co pus,” in ICASSP 2023-
2023 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2023,
pp. 1–5.
[13] S. Sa ka , E. Bene os, and M. Sandle , “Vocal ha -
mony sepa a ion using ime-domain neu al ne wo ks,”
in P oc. In e speech 2021, 2021, pp. 3515–3519.
[14] C.-B. Jeon, H. Moon, K. Choi, B. S. Chon, and
K. Lee, “Medley ox: An e alua ion da ase o mul-
iple singing oices sepa a ion,” in ICASSP 2023-2023
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2023, pp.
1–5.
[15] Y. Öze , S. Schwä , V. A i i-Mülle , J. Law ence,
E. Sen, and M. Mülle , “Piano conce o da ase (pcd):
A mul i ack da ase o piano conce os,” T ansac ions
o he In e na ional Socie y o Music In o ma ion Re-
ie al, ol. 6, no. 1, 2023.
[16] S. Sa ka , L. Tho pe, E. Bene os, and M. Sandle ,
“Le e aging syn he ic da a o imp o ing chambe en-
semble sepa a ion,” in 2023 IEEE Wo kshop on Appli-
ca ions o Signal P ocessing o Audio and Acous ics
(WASPAA), 2023, pp. 1–5.
[17] Y. Öze and M. Mülle , “Sou ce sepa a ion o pi-
ano conce os using musically mo i a ed augmen a-
ion echniques,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 1214–
1225, 2024.
[18] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma,
“C ea ing a mul i ack classical music pe o mance
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
845
da ase o mul imodal music analysis: Challenges, in-
sigh s, and applica ions,” IEEE T ansac ions on Mul i-
media, ol. 21, no. 2, pp. 522–535, 2018.
[19] R. Sch amm, E. Bene os e al., “Au oma ic ansc ip-
ion o a cappella eco dings om mul iple singe s.”
Audio Enginee ing Socie y, 2017.
[20] S. Roua d, F. Massa, and A. Dé ossez, “Hyb id ans-
o me s o music sou ce sepa a ion,” in ICASSP 23,
2023.
[21] J. Chen, Q. Mao, and D. Liu, “Dual-Pa h T ans o me
Ne wo k: Di ec Con ex -Awa e Modeling o End-
o-End Monau al Speech Sepa a ion,” in P oc. In e -
speech 2020, 2020, pp. 2642–2646.
[22] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Pe mu-
a ion in a ian aining o deep models o speake -
independen mul i- alke speech sepa a ion,” in 2017
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2017, pp.
241–245.
[23] Y. Luo and N. Mesga ani, “Tasne : ime-domain au-
dio sepa a ion ne wo k o eal- ime, single-channel
speech sepa a ion,” in 2018 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2018, pp. 696–700.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
846