DO MUSIC SOURCE SEPARATION MODELS PRESERVE SPATIAL
INFORMATION IN BINAURAL AUDIO?
Richa Namballa
New Yo k Uni e si y
[email p o ec ed]
Agnieszka Roginska
New Yo k Uni e si y
[email p o ec ed]
Magdalena Fuen es
New Yo k Uni e si y
[email p o ec ed]
ABSTRACT
Binau al audio emains unde explo ed wi hin he mu-
sic in o ma ion e ie al communi y. Mo i a ed by he is-
ing popula i y o i ual and augmen ed eali y expe iences
as well as po en ial applica ions o accessibili y, we in es-
iga e how well exis ing music sou ce sepa a ion (MSS)
models pe o m on binau al audio. Al hough hese mod-
els p ocess wo-channel inpu s, i is unclea how e ec-
i ely hey e ain spa ial in o ma ion. In his wo k, we
e alua e how se e al popula MSS models p ese e spa-
ial in o ma ion on bo h s anda d s e eo and no el binau-
al da ase s. Ou binau al da a is syn hesized using s ems
om MUSDB18-HQ and open-sou ce head- ela ed ans-
e unc ions by posi ioning ins umen sou ces andomly
along he ho izon al plane. We hen assess he spa ial qual-
i y o he sepa a ed s ems using signal p ocessing and in e -
au al cue-based me ics. Ou esul s show ha s e eo MSS
models ail o p ese e he spa ial in o ma ion c i ical o
main aining he imme si e quali y o binau al audio, and
ha he deg ada ion depends on model a chi ec u e as well
as he a ge ins umen . Finally, we highligh aluable op-
po uni ies o u u e wo k a he in e sec ion o MSS and
imme si e audio.
1. INTRODUCTION
In ecen yea s, imme si e expe iences ha e gained popu-
la i y in a ious o ms o media such as ideo games, con-
ce s, and mo ies. The shi o i ual and augmen ed eal-
i y (VR/AR) equi es no only ealis ic isual s imuli, bu
au hen ic audi o y cues as well. One common o m o spa-
ial audio used o p o ide he lis ene wi h di ec ionali y o
sound is binau al audio. Binau al audio goes beyond adi-
ional gain-based s e eo panning by il e ing wo-channel
audio o c ea e in e au al cues di e ing in le el, ime, and
spec al con en o simula e he loca ion o a sou ce in
space [1]. Fu he mo e, binau al audio equi es ep oduc-
ion h ough headphones o loudspeake s equipped wi h
c oss alk cancella ion o main ain spa ial imaging in eg i y.
Le el di e ences esul ing om he “head-shadow e ec ”
© R. Namballa, A. Roginska, and M. Fuen es. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: R. Namballa, A. Roginska, and M. Fuen es, “Do Mu-
sic Sou ce Sepa a ion Models P ese e Spa ial In o ma ion in Binau al
Audio?”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al
Con ., Daejeon, Sou h Ko ea, 2025.
and he Time Di e ence o A i al (TDOA) o a sound a
each ea p o ide di ec ional cues. F equency-dependen
il e ing, de e mined by he o m o he lis ene ’s head
and speci ic ea (pinna) shape, causes wo iden ical sound
sou ces posi ioned di e en ly o exhibi sligh ly di e en
spec al con en a each ea , u he assis ing localiza ion.
The wo common me hods o p oducing binau al audio a e
eco ding wi h a binau al dummy head and signal p ocess-
ing wi h a Head-Rela ed T ans e Func ion (HRTF).
Beyond he inc easing demand o imme si e VR/AR
expe iences, binau al audio has signi ican po en ial ap-
plica ions in accessibili y. Fo ins ance, indi iduals who
iden i y as neu o-di e gen o ha d o hea ing o en bene-
i om enhanced audi o y cla i y, enabling hem o isola e
and ocus on speci ic sound sou ces in complex acous ic
en i onmen s, acili a ing independen na iga ion and in-
e ac ion in social and public se ings. Binau al sou ce sep-
a a ion has been shown o signi ican ly enhance audi o y
accessibili y by educing backg ound noise and emphasiz-
ing ele an audi o y signals in eal- ime wi h he use o
mic ophone-enabled headphones [2]. In his con ex , mu-
sic sou ce sepa a ion (MSS) in binau al audio could sub-
s an ially imp o e how indi iduals engage wi h and enjoy
musical en i onmen s such as conce s, es i als, and o he
li e pe o mances, enabling use s o isola e speci ic musi-
cal elemen s o ins umen s and hus enhance hei lis en-
ing expe ience and o e all pa icipa ion in music e en s.
These ools can u he be u ilized o eco ded binau al
con en such as spa ial audio cap u es o li e pe o mances
o binau al ield eco dings.
Despi e hese po en ial bene i s and g owing in e es ,
binau al audio p ocessing has ecei ed limi ed a en ion
wi hin he music in o ma ion e ie al (MIR) communi y,
pa icula ly conce ning MSS. In his wo k, we in es iga e
whe he exis ing MSS models a e able o sepa a e binau-
al mix u es in o hei espec i e s ems while p ese ing
he spa ial cha ac e is ics, which a e c ucial o he im-
me si e expe ience p o ided by binau al audio. We c ea e
a binau al MSS da ase based o o he well-es ablished
MUSDB18-HQ da ase [3], and le e age se e al me ics
ha quan i y sepa a ion quali y, spa ial dis o ion, and im-
me si eness o e alua e hese models. Ou esul s show
ha he e is a conside able gap in binau al MSS pe o -
mance compa ed wi h MSS in simple s e eo se ings, and
ha his gap depends on model a chi ec u e and a ge
sou ce. Las ly, we discuss he sho comings o cu en
me ics and iden i y oppo uni ies o u u e esea ch.
671
2. RELATED WORK
Un il now, mos wo k on binau al sou ce sepa a ion has
been comple ed in he speech domain, o en o e lapping
wi h he simila ask o a ge sou ce ex ac ion (TSE). In
pa icula , he speech esea ch communi y desc ibes he
ask as wo- old: sou ce sepa a ion and localiza ion [4].
We ocus on p io s udies conce ning he o me .
Ea ly wo-channel sou ce sepa a ion models we e p i-
ma ily signal p ocessing-based, wi h a ocus on ma hema -
ical and heo e ical echniques [5]. As he ocus mo ed
owa ds cap u ing di ec ionali y, models began using psy-
choacous ic spa ial cues o imp o e he pe o mance o he
signal p ocessing-based sou ce es ima ion me hods [6–11].
Wi h he echnological p og ess made in compu a ional e-
sou ces, binau al sou ce sepa a ion models shi ed o using
deep lea ning app oaches o pe o m sou ce ex ac ion in
mo e complex en i onmen s and in eal- ime [2, 12–15].
Recen deep lea ning sys ems ha e p oposed no el loss
unc ions aimed a p ese ing he le el, phase, and ime
di e ences be ween binau al channels, cues which a e c i -
ical o he imme si e na u e o binau al audio [16,17].
To he bes o ou knowledge, he only published wo k
on binau al MSS hus a conce ns ocal sepa a ion o bin-
au al audio eco ded wi h a dummy head [18]. Thei ap-
p oach uses a ious hyb id combina ions o single- and
mul ichannel-sou ce sepa a ion algo i hms o ex ac he
ocal s ems, wi h a ocus on signal-p ocessing me h-
ods [19–23]. The esul s a e e alua ed wi h s anda d
sou ce sepa a ion me ics [24] and subjec i e lis ene a -
ings. Based on he limi ed exis ing esea ch in binau al
MSS, we belie e ha he e is a signi ican oppo uni y o
explo e his ask using deep lea ning me hods, inspi ed by
ecen p og ess in he speech communi y.
Rega ding pe o mance, he mos common me ic e-
po ed o e alua ing sou ce sepa a ion models is he
Signal o Dis o ion Ra io (SDR), measu ed in decibels
(dB) [24]. Speci ically, o MSS, esea che s o en bench-
ma k hei models on he es se o MUSDB18-HQ and
epo he SDR bo h o e all and by ins umen ype [25].
SDR (and i s scale-in a ian e sion, SI-SDR [26]) aim o
e lec wha po ion o he es ima ed s em co esponds o
he e e ence s em e sus any e o in oduced by in e e -
ence om o he ins umen s, noise, and a i ac s. While
SDR is well-es ablished o e alua ing mono and s e eo
acks [27], i does no speci y he amoun o spa ial e o
in oduced be ween channels in he model’s es ima ed ou -
pu , which is essen ial o e alua ing he quali y o binau-
al sou ce sepa a ion. The e o e, we le e age o he me ics
om he li e a u e which e lec spa ial quali y.
In he imme si e audio esea ch communi y, he e a e
se e al models used o quan i y he quali y o a binau al
signal, such as BAM-Q [28] and MoBi-Q [29], ained on
a combina ion o ex ac ed binau al ea u es and subjec i e
quali y a ings. We sa e he use o hese models o u u e
wo k in binau al MSS and choose o ocus on mo e acces-
sible and in e p e able me ics, u he explained in Sec-
ion 4.1, which o igina e om he duplex heo y o sound
localiza ion [30]. This heo y s a es ha , along ho izon-
al plane (0◦ele a ion), humans use wo audi o y cues o
localize he di ec ion o a sound: he in e au al ime di e -
ence (ITD) and he in e au al le el di e ence (ILD). ITD
e e s o he di e ence in ime o a i al, a each ea , o
a sound emi ed om a sou ce. Gene ally, a sound will
each he ipsila e al (closes o he sou ce) ea as e han
he con ala e al ( a hes om he sou ce) ea . Likewise,
he ILD is he di e ence in a sound’s in ensi y as i a -
i es a he ipsila e al and con ala e al ea s. O iginally, i
was belie ed ha ILD was he p ima y cue used o high
equency signals while ITD was o low equencies [1].
Howe e , ecen s udies ha e shown ha b oadband sig-
nals equi e a complex in e ac ion o he ITD and ILD o
e ec i ely iden i y a sound’s loca ion [31].
The wo k in [32] le e ages his duplex heo y o local-
iza ion o p opose wo ene gy- a io me ics o spa ial e al-
ua ion: Signal o Spa ial Dis o ion Ra io (SSR) and Signal
o Residual Dis o ion (SRR). These measu es a e in e -
p e ed simila ly o SDR, wi h SSR in ended as a subs i u e
o he Image o Spa ial Dis o ion Ra io (ISR), p oposed
by [27]. The spa ial e o is compu ed by p ojec ing he
e e ence signal o he es ima ed signal and op imizing o
ela i e changes in gain and delay. F om hese p ojec ions,
we can sepa a e he dis o ion in spa ial in o ma ion (spa-
ial e o ) om e o s such as in e e ence in he es ima ed
signal ( esidual e o ). The a ios o SSR and SRR a e de-
ined in Sec ion 4.1.
3. DATASET
To di ec ly compa e he pe o mance o a ious MSS mod-
els on bo h s e eo and binau al audio, we c ea ed a binau al
e sion o MUSDB18-HQ [3]. MUSDB18-HQ is he un-
comp essed, 22kHz-bandwid h e sion o MUSDB18 [33]
con aining ull-leng h, mixed music acks om p ima -
ily Wes e n pop and ock gen es as well as hei espec-
i e s ems sepa a ed in o ocals, d ums, bass, and “o he ”.
The aining and es se s consis o 100 and 50 songs, e-
spec i ely. All audio iles a e s e eophonic in WAV o -
ma , sampled a 44.1kHz/16b. We call ou binau al da ase
Binau al-MUSDB and we e e o he o iginal MUSDB18-
HQ as S e eo-MUSDB.
To cons uc Binau al-MUSDB, we u ilized binau al
syn hesis o c ea e he illusion o he sou ce signal emi -
ing om a speci ic loca ion a ound he lis ene [1]. We use
he publicly a ailable SADIE II 1da abase o HRTFs [34].
Each wo-channel HRTF measu emen con ains he audi-
o y spa ial cues which can be supe imposed on o a signal
such ha he lis ene will pe cei e he sound as o igina ing
om a loca ion along he azimu h (θ) and a a gi en ele a-
ion (φ). Fo ou syn hesis, we apply he HRTF measu e-
men s o subjec D1 om SADIE II, which co espond
o he head and pinnae o he Neumann KU100 binau al
dummy head mic ophone, which is he size o he a e age
human head.
We limi ed he ho izon al plane o θ∈[−90◦,+90◦]
along he azimu h, ixed a φ= 0◦ele a ion. In spa ial au-
1h ps://www.yo k.ac.uk/sadie-p ojec /
da abase.h ml
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
672
- 90°+90°
0°
Figu e 1. Binau al-MUSDB: each binau al sou ce sig-
nal siis placed andomly along he ho izon al plane a an
angle θi∈[−90◦,+90◦]wi h he o igin loca ed di ec ly
in on o he lis ene . E e y sou ce has a minimum o
10◦sepa a ion om he o he s, ensu ing ha he e is no di-
ec spa ial o e lap be ween s ems.
-90° -60° -30° 0° 30° 60° 90°
0
2
4
6
8
10
12
14
Coun
ocals bass d ums o he
Figu e 2. Dis ibu ion o ins umen posi ions in he es
se o Binau al-MUSDB. θco esponds o he sou ce’s lo-
ca ion along he ho izon al plane whe e 0◦co esponds o
he posi ion di ec ly in on o he lis ene .
dio, θ= 0◦co esponds o he loca ion di ec ly in on o
he lis ene , equidis an om he le and igh ea s, as seen
in Figu e 1. While he duplex heo y s a es ha humans
p ima ily ely on ITD and ILD o binau al localiza ion
on he ho izon al plane [30], hey equi e spec al in o ma-
ion o disambigua ing on -back loca ions [35]. Since
we limi sou ce loca ions o he on hal o he sound
ield, we do no an icipa e any signi ican di e ences in
esul s using HRTFs o he han he KU100’s.
Fo e e y song in bo h he aining and es se s, we as-
signed each sou ce i o a s a ic loca ion θiin inc emen s o
10◦. Angles o each s em in a single song we e sampled
andomly wi hou eplacemen in he o de o ocals, bass,
d ums, and o he . Fu he mo e, in a gi en mix u e, no wo
sou ces we e allowed o be loca ed a he same angle ensu -
ing ha he e was a minimum o 10◦sepa a ion (no di ec
o e lap) be ween each s em. Each song was assigned only
one se o sou ce loca ions. The dis ibu ion o loca ions
ac oss he es se can be seen in Figu e 2.
We con e ed he o iginal s e eo s em o mono by a -
e aging he wo channels. Nex , we loaded he Head-
Rela ed Impulse Response (HRIR), he ime-domain e -
sion o an HRTF, co esponding o θiand con ol ed each
HRIR channel wi h he mono s em signal o p oduce a bin-
au al signal, wi h he wo channels co esponding o he
le and igh ea s. This p ocess is isualized in Figu e 3.
Finally, we summed he binau al e sions o he ocals,
d ums, bass and o he s ems oge he and no malized he
esul ing signal o c ea e he binau al mix u es which we e
used as he inpu o he MSS models desc ibed in Sec-
ion 4.2. The binau al syn hesis was comple ed o all 150
acks wi h he same ain- es spli as S e eo-MUSDB.
4. EXPERIMENTAL SETUP
4.1 Me ics
We u ilize ou me ics o desc ibe he amoun o dis o -
ion in oduced by he MSS models, h ee o which quan-
i y he le el o spa ial e o in he es ima ed s ems (SSR,
∆ITD, ∆ILD) and one ha measu es he emaining signal
dis o ion due o in e e ence and a i ac s in oduced by
he sepa a ion (SRR).
In binau al audio, i is c ucial ha he ITD and ILD o a
sound emain unchanged a e sepa a ion o allow a lis ene
o localize he sou ce and main ain hei sense o imme -
sion. The e o e, we quan i y how well he in e au al cues
a e p ese ed by measu ing he change (∆) in ITD and ILD
be ween he es ima ed s em (ˆs) and he e e ence s em (s),
as in [2]. To compu e ∆ITD, we calcula e he magni ude
o he di e ence in ITD(ˆs) and ITD(s) [36].
∆ITD =|ITD(s)−ITD(ˆ
s)|(1)
We measu e he ITD o each signal as he TDOA o
he sou ce in he le and igh channels using he ame-
wise Gene alized C oss Co ela ion wi h Phase T ans o m
(GCC-PHAT) algo i hm [37], implemen ed by [2]. Fi s ,
we segmen he signal xin o ames o 0.5s in leng h (wi h
no o e lap) and apply a Tukey window o each ame.
Nex , we calcula e he GCC-PHAT C( , τ)a ame , o
lags τ(in samples) co esponding o he ange [-1, 1] ms,
and ind τ∗, he alue o τwhich maximizes C[2,38]. The
ame-wise TDOA is compu ed in seconds by di iding τ∗
by he sample a e s.
TDOA(x, ) = 1
s
·a g max
τ
C( , τ)(2)
The ITD o he ull signal is hen calcula ed as he
weigh ed mode o he ame-wise TDOA. Each weigh w
is based on he Roo Mean Squa e (RMS) ene gy, whe e
x c is he signal a ame and channel c,nis he leng h o
he ame, and kis he sample index o he ame.
w = max
c
u
u
1
n
n−1
X
k=0
x c[k]2
(3)
F ames wi h a w less han a h eshold o 5×10−4a e
conside ed silen and excluded om he signal’s ITD cal-
cula ion. ∆ITD is p esen ed in mic oseconds (µs) [2].
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
673
MUSDB18 SADIE II
L
RMono
S em
S e eo
S em
HRIR L
HRIR R
HRIR
Binau al S em L
Binau al S em R
Binau al
MUSDB
Figu e 3. An o e iew o he binau al syn hesis p ocess o he Binau al-MUSDB da ase . Fo e e y song in MUSDB18-
HQ, each sou ce is assigned a loca ion θalong he azimu h in he on al po ion o he ho izon al plane (±90◦). The
co esponding HRIR (θ, ele a ion φ= 0) is e ie ed om he SADIE II da abase and each channel is con ol ed (∗) wi h
he monophonic e sion o he sou ce s em. The esul ing signals a e he le and igh channels o he binau al e sion o
he s em which a e included in he da ase .
The ILD is compu ed as he decibel a io o he sum
o squa es o each channel ac oss he en i e signal. He e,
xc ep esen s channel co he ull signal x,kis he co e-
sponding sample index, and Nis he leng h o he en i e
signal in samples. As wi h ITD, we epo ∆ILD.
ILD(x) = 10 ·log10 PN−1
k=0 xL[k]2
PN−1
k=0 xR[k]2!(4)
∆ILD =|ILD(s)−ILD(ˆ
s)|(5)
Fo bo h ∆ITD and ∆ILD, a lowe alue indica es a
highe -quali y spa ial p ese a ion o he in e au al cue in
he es ima ed s em.
In addi ion o ∆ITD and ∆ILD, we compu e he SSR
and SRR as p oposed in [32] using hei p o ided open-
sou ce implemen a ion wi h i s de aul pa ame e s. Bo h
me ics a e compu ed ame-wise, epo ing he median
alue, wi h a window o 1s and a hop leng h o 0.5s.
SSR(ˆs;s) = 10 ·log10 ||s||2
||espa ||2(6)
SRR(ˆs;s) = 10 ·log10 ||˜s||2
||e esid||2(7)
The SSR is in ended o cap u e he spa ial dis o ion in-
oduced by he sepa a ion (espa ) in o he es ima ed s em
(ˆs) while he SRR e lec s only non-spa ial dis o ion and
e o s such as in e e ence and a i ac s (e esid). No e ha
˜s is he p ojec ion 2o sin o ˆs, as men ioned in Sec ion 2.
Bo h SSR and SRR a e measu ed in dB and a highe alue
indica es less dis o ion in he es ima ed signal.
4.2 Models
We e alua e he pe o mance o h ee well-known
p e- ained MSS models on bo h s e eo and binau al
2Due o space cons ain s, we encou age eade s o e e ence he o ig-
inal publica ion [32] o he p ecise ma hema ical de ini ion o ˜s.
condi ions: Hyb id T ans o me Demucs Fine-Tuned
(h demucs_ ) [39], OpenUnmix (umxhq) [40], and
Splee e (splee e :4s ems) [41]. We chose hese
models o e newe MSS models o alida e ou esul s
wi h [32] and because all h ee models ha e o icial open-
sou ce implemen a ions a ailable o use. Bo h Demucs
and OpenUnmix a e ained on he S e eo-MUSDB ain-
ing se , while Splee e is ained on a p op ie a y da ase .
Addi ionally, he e sion o Demucs we use is ained on
an ex a 800 songs no publicly iden i ied. Each model ac-
cep s a s e eophonic mix u e inpu and e u ns an es ima ed
wo-channel s em.
Bo h OpenUnmix and Splee e ha e inpu s in he e-
quency domain, while Demucs is a hyb id model, op-
e a ing in bo h he wa e o m and spec og am domains.
Splee e uses a U-ne a chi ec u e (CNN-based) [42] o es-
ima e a ime- equency mask o each sou ce and applies i
o he inpu mix u e’s magni ude spec og am o gene a e
he spec og am o he es ima ed s em [43]. OpenUnmix
ope a es simila ly, howe e , i uses a bi-di ec ional LSTM
model (RNN-based) o es ima e he mask [44]. All h ee
models use a L1 loss unc ion o minimize he e o be-
ween he es ima ed and e e ence signals.
To p ese e he empo al s uc u e o he inpu audio,
bo h OpenUnmix and Splee e apply he o iginal inpu
mix u e’s phase o he es ima ed magni ude spec og am
be o e in e sion o he ime domain o cons uc he inal
p edic ed s em. On he o he hand, since Demucs unc-
ions in wo domains, he model has o combine he es i-
ma ed ime and equency ep esen a ions o p o ide he
inal syn hesized wa e o m. In he o iginal hyb id e sion
o Demucs [45], he model equi ed ca e ul hype pa am-
e e uning o align he empo al and spec al ep esen a-
ions o he es ima ed signal so hey could be summed in
he wa e o m domain. Howe e , in he newes e sion o
he model [39], he au ho s claim ha he ans o me ad-
d esses his bo leneck h ough i s lexible a chi ec u e.
To compa e he sepa a ion pe o mance in s e eo and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
674
binau al se ings, we apply hese models o he es se s o
S e eo-MUSDB and Binau al-MUSDB.
5. RESULTS AND DISCUSSION
In his sec ion, we analyze and discuss he pe o mance o
he h ee MSS models by looking a he di e en me ics in
he binau al and s e eo da ase s, conside ing he e ec on
indi idual ins umen s, and iden i ying he e ec o spa ial
dis o ion in he di e en loca ions along he azimu h.
Table 1. SRR esul s om he MSS models ac oss he wo
da ase s using median alues. The bes esul s a e high-
ligh ed in bold and he second bes a e unde lined.
Da ase Model SRR (dB) ↑
Bass D ums O he Vocals O e all
Binau al
Demucs 8.90 10.58 4.10 4.37 6.91
OpenUnmix 3.37 6.75 1.19 2.37 3.51
Splee e 1.53 4.71 0.11 0.00 2.01
S e eo
Demucs 8.36 9.86 6.36 6.08 7.39
OpenUnmix 1.72 4.82 2.90 2.40 3.14
Splee e 1.25 4.51 3.31 2.76 3.21
5.1 S e eo s. Binau al Pe o mance
Based on he median SRR alues shown in Table 1, we
obse e a ela i ely consis en sepa a ion quali y ac oss
he s e eo and binau al da ase s, sugges ing ha in oduc-
ing spa ial cues does no d ama ically impac he abili y
o models o isola e ins umen s om one ano he . The
SRR se es as a p oxy o sepa a ion quali y in spa ial au-
dio se ings as i conside s all esidual dis o ions ha a e
no spa ial. Demucs appea s o ou pe o m he o he wo
models in SRR o bo h da ase s, which aligns wi h i s o ig-
inal SDR-based anking epo ed on he es se o S e eo-
MUSDB [40,41, 45].
The median spa ial me ics in Table 2 show ha he
MSS models in oduce subs an ial spa ial dis o ion when
applied o binau al audio. Fo e e ence, SSR alues
a ound 10dB ela e o no iceable spa ial dis o ion, while
alues below ha indica e se e e spa ial dis o ion, based
on ends seen in o he ene gy- a io me ics [24, 32, 46].
No e ha spa ializa ion in s e eo acks adi ionally uses
gain-based panning, so a median ∆ITD o 0µs is no unex-
pec ed. Upon close inspec ion, a ew ∆ITD alues we e
nonze o, indica ing ha some in e channel empo al dis o -
ion is in oduced by he models, e en in he s e eo s ems.
Demucs shows a conside able pe o mance d op om
s e eo o binau al condi ions, especially in SSR, com-
pa ed o he o he models. A plausible explana ion is
ha , by ope a ing di ec ly on wa e o ms, Demucs implic-
i ly lea ned s e eo spa ial cues based on ampli ude di -
e ences and s uggled o e ec i ely in e p e he sub-
le spec al in o ma ion cha ac e is ic o binau al audio.
In u n, Open-Unmix occasionally achie es supe io e-
sul s in binau al se ings compa ed o s e eo, likely due
o i s equency-domain masking app oach ha p ese es
he o iginal mix u e’s phase, inad e en ly main aining
he spa ial in eg i y. Simila ly, Splee e , also employ-
ing equency-domain masking, demons a es s able and
some imes imp o ed pe o mance on binau al audio, ein-
o cing ha p ese ing he o iginal phase o a mix u e can
be bene icial o spa ial cue accu acy. Ne e heless, none
o he models’ binau al me ics ma ch Demucs’s s e eo
pe o mance le el, demons a ing conside able oom o
imp o emen in e aining binau al spa ial cues.
0
10
20
SSR (dB)
0
500
1000
1500
ITD ( s)
[-90°, -60°]
(-60°, -30°]
(-30°, 0°]
(0°, 30°]
(30°, 60°]
(60°, 90°]
0
1
2
3
ILD (dB)
h demucs splee e umxhq
Figu e 4. Dis ibu ions o spa ial me ics (SSR, ∆ITD,
∆ILD) by model and angle, agg ega ed ac oss all sou ces.
5.2 Pe o mance by Angle
Figu e 4 shows he o e all spa ial dis o ion ac oss all h ee
spa ial me ics by model and angle bin along he azimu h.
We obse e ha SSR and ∆ILD emain ela i ely consis-
en ac oss angles, whe eas he ITD no ably dis o s mo e
he a he he sou ce is posi ioned om he o igin (la ge
|θ|), displaying a U-shaped e ec . One sou ce o his en-
dency could be ha s ongly la e alized signals ha e mini-
mal o e lap in ime-domain ampli ude be ween he le and
igh channels. C oss-co ela ion elies on sha ed, co e-
la ed ene gy be ween channels so, in hese cases, e en mi-
no dis u bances om sepa a ion educe channel simila i y
subs an ially, making accu a e lag es ima ion challenging.
This pa e n could also imply ha cu en MSS models a e
be e a p ese ing ampli ude-based spa ial in o ma ion
(e.g., gain-based panning) han phase-based cues, and ha
hey a e in oducing empo al dis u bances. Addi ionally,
he ∆ITD dis ibu ion highligh s a po en ial limi a ion in
he SSR me ic. Al hough i has been designed o accoun
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
675
Table 2. Spa ial me ic esul s (SSR, ∆ITD, ∆ILD) om he MSS models o he wo da ase s using median alues. The
bes esul s a e highligh ed in bold and he second bes a e unde lined.
Da ase Model SSR (dB) ↑∆ITD µs↓∆ILD (dB)↓
Bass D ums O he Vocals O e all Bass D ums O he Vocals O e all Bass D ums O he Vocals O e all
Binau al
MUSDB
Demucs 9.13 10.39 12.62 8.70 10.59 476.19 0.00 22.68 0.00 68.03 0.20 0.31 0.57 0.42 0.39
OpenUnmix 10.94 12.22 11.04 8.20 10.43 521.54 0.00 226.76 0.00 90.7 0.41 0.38 0.72 0.73 0.50
Splee e 10.63 11.86 9.96 5.22 9.86 544.22 22.68 22.68 22.68 22.68 0.44 0.52 0.99 0.74 0.64
S e eo
MUSDB
Demucs 17.18 20.63 14.11 13.42 16.01 0.00 0.00 0.00 0.00 0.00 0.08 0.07 0.11 0.05 0.08
OpenUnmix 9.74 12.12 10.09 11.22 10.73 0.00 0.00 0.00 0.00 0.00 0.12 0.10 0.24 0.08 0.12
Splee e 8.69 11.54 11.31 10.18 10.78 0.00 0.00 0.00 0.00 0.00 0.15 0.08 0.23 0.10 0.12
o all spa ial dis o ions in acco dance wi h he duplex he-
o y [30, 32], i may be mo e sensi i e o le el di e ences
a he han ime o a i al changes (as i does no e lec he
U-shaped beha io obse ed in ∆ITD). Fu he esea ch
wi h syn he ic signals is needed o cla i y how SSR alues
espond o phase dis o ions, whe he he me ic o i s im-
plemen a ion equi es e ision, and how sensi i e ITD and
ILD calcula ions a e o small a i ac s.
5.3 Pe o mance by Ins umen
When looking a ins umen -speci ic pe o mance in Ta-
bles 1 and 2, we see ha bass and “o he ” ins umen s ex-
hibi highe spa ial dis o ion (∆ITD) compa ed o ocals
and d ums. Bass ins umen s p edominan ly occupy na -
ow, low equency bands, whe e localiza ion elies hea -
ily on sub le ime di e ences a he han le el. Because
hese low- equency sounds ha e longe wa eleng hs, e en
mino phase dis o ions in oduced du ing he sepa a ion
p ocess can lead o signi ican pe cei ed spa ial e o s.
This ai is e lec ed in he c oss-co ela ion calcula ions
o ITD, which equi e la ge sample lags (τ) o p ope ly
align he channels. Simila ly, he “o he ” ca ego y o en in-
cludes a di e se collec ion o complex and spec ally dense
ins umen s wi h b oade spa ial posi ioning, esul ing in
di used o ambiguous spa ial cues.
5.4 Pe o mance by Model
As men ioned p e iously, Demucs exhibi s a signi ican
pe o mance d op om s e eo o binau al condi ions in
e ms o spa ial dis o ion. In con as , he equency-
domain models, Open-Unmix and Splee e , display mo e
consis en spa ial pe o mance ac oss hese wo se ings.
Ne e heless, all models pe o m well below he le el
achie ed by Demucs in s e eo, sugges ing ha none a e
ye op imized o binau al spa ial ideli y. Fu u e esea ch
should explo e aining he models di ec ly on binau al au-
dio and adjus ing he loss unc ions used du ing aining
o explici ly penalize dis o ions in ITD and ILD o im-
p o e spa ial cue p ese a ion, using sys ems inspi ed by
he speech communi y [2,16, 17].
5.5 Pe cep ual Conside a ions
While we p ima ily elied on objec i e me ics o ou
e alua ion, p elimina y subjec i e lis ening by he au ho s
sugges s no iceable spa ial dis o ions, pa icula ly a ec -
ing bass ins umen s. These dis o ions align wi h ou
quan i a i e indings and indica e subs an ial spa ial a i-
ac s caused by inaccu acies in phase p ese a ion. To p o-
ide a clea e illus a ion o hese e ec s, selec ed audio
examples demons a ing ypical spa ial dis o ions iden i-
ied in ou analysis a e made a ailable on an accompany-
ing demons a ion webpage, along wi h he open-sou ce
da a and code eposi o y. 3
6. CONCLUSION AND FUTURE WORK
We in es iga ed he capabili ies o exis ing music sou ce
sepa a ion (MSS) models applied o binau al audio. Ou
analysis e ealed a conside able gap in MSS pe o mance
be ween binau al and s e eo se ings. This pe o mance
dispa i y was in luenced signi ican ly by bo h he speci ic
a chi ec u e o he model and he a ge audio sou ce. We
iden i y se e al a enues o planned u u e wo k which will
add ess he limi a ions o his s udy and build he ounda-
ion o subsequen binau al MSS models.
Da a. The binau al da a was syn hesized wi h a andom
placemen o sou ces and a single se o HRTF measu e-
men s. We hope o examine he s abili y o he esul s con-
ce ning he andom seed ini ializa ion in he posi ioning o
sou ces and he e ec o hei o e lap. Addi ionally, we
can alida e he he impac o using di e se HRTFs (co e-
sponding o a ious pinnae) when syn hesizing he da a.
Me ics. We belie e he cu en me ics equi e u -
he in es iga ion o be e unde s and hei sensi i i y o
changes in phase e sus le el. Mo eo e , we can explo e
exis ing binau al quali y models es ablished by he imme -
si e audio communi y and pe o m a pe cep ual s udy o
alida e all me ics.
Modeling. Since MSS esea ch has p og essed signi -
ican ly, we hope o e alua e newe s a e-o - he-a MSS
models’ pe o mance on binau al audio. We also plan o
ain a simple baseline MSS model on he binau al da ase
wi h he op ion o da a augmen a ions (e.g., noise, e e -
be a ion) o simula e di e se binau al condi ions. Las ly,
we will modi y exis ing MSS model a chi ec u es o ac-
coun o he p ese a ion o spa ial cues, such as wi h loss
unc ions ha minimize changes in ITD and ILD.
These pa hs o u u e esea ch show p omise in design-
ing models speci ically ained o binau al MSS wi h he
goal o b idging imme si e audio wi h music in o ma ion
e ie al o bo h cul u al and accessibili y applica ions.
3h ps:// icha-namballa.gi hub.io/
binau al-mss-demo/
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
676
7. REFERENCES
[1] A. Roginska and P. Geluso, Imme si e Sound. Focal
P ess, 2017.
[2] B. Velu i, M. I ani, J. Chan, T. Yoshioka, and S. Gol-
lako a, “Seman ic hea ing: P og amming acous ic
scenes wi h binau al hea ables,” in P oceedings o he
36 h Annual ACM Symposium on Use In e ace So -
wa e and Technology, 2023, pp. 1–15.
[3] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis, and
R. Bi ne , “MUSDB18-HQ - an uncomp essed e sion
o MUSDB18,” Dec. 2019. [Online]. A ailable:
h ps://doi.o g/10.5281/zenodo.3338373
[4] A. Dele o ge and R. Ho aud, “The cock ail pa y obo :
Sound sou ce sepa a ion and localisa ion wi h an ac i e
binau al head,” in P oceedings o he Se en h Annual
ACM/IEEE In e na ional Con e ence on Human-Robo
In e ac ion, 2012, pp. 431–438.
[5] K. To kkola, “Blind sepa a ion o audio signals-a e we
he e ye ?” in Fi s In e na ional Wo kshop on Inde-
penden Componen Analysis and Blind Sou ce Sepa-
a ion, 1999, pp. 239–244.
[6] B. D. Van Veen and K. M. Buckley, “Beam o ming:
A e sa ile app oach o spa ial il e ing,” IEEE ASSP
magazine, ol. 5, no. 2, pp. 4–24, 1988.
[7] A. Jou jine, S. Ricka d, and O. Yilmaz, “Blind sep-
a a ion o disjoin o hogonal signals: Demixing N
sou ces om 2 mix u es,” in P oceedings o he 2000
IEEE In e na ional Con e ence on Acous ics, Speech,
and Signal P ocessing (ICASSP), ol. 5. IEEE, 2000,
pp. 2985–2988.
[8] H. Vis e and G. E angelis a, “On he use o spa ial cues
o imp o e binau al sou ce sepa a ion,” in P oceedings
o 6 h In e na ional Con e ence on Digi al Audio E -
ec s (DAFx-2003), 2003, pp. 209–213.
[9] S. Schulz and T. He e , “Binau al sou ce sepa a ion in
non-ideal e e be an en i onmen s,” in P oceedings o
10 h In e na ional Con e ence on Digi al Audio E ec s
(DAFx-2007), Bo deaux, F ance, 2007.
[10] C. Kim, K. Kuma , and R. M. S e n, “Binau al sound
sou ce sepa a ion mo i a ed by audi o y p ocessing,”
in 2011 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2011,
pp. 5072–5075.
[11] R. Abdipou , A. Akba i, M. Rahmani, and B. Nase -
sha i , “Binau al sou ce sepa a ion based on spa ial
cues and maximum likelihood model adap a ion,” Dig-
i al Signal P ocessing, ol. 36, pp. 174–183, 2015.
[12] S. Zake i and M. Ge a anchizadeh, “Supe ised binau-
al sou ce sepa a ion using audi o y a en ion de ec ion
in ealis ic scena ios,” Applied Acous ics, ol. 175, p.
107826, 2021.
[13] Y. Yang, G. Sung, S.-F. Shih, H. E dogan, C. Lee,
and M. G undmann, “Binau al angula sepa a ion ne -
wo k,” in 2024 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2024, pp. 1201–1205.
[14] X. Zhang and D. Wang, “Deep lea ning based bin-
au al speech sepa a ion in e e be an en i onmen s,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 25, no. 5, pp. 1075–1084, 2017.
[15] B. Velu i, J. Chan, M. I ani, T. Chen, T. Yoshioka,
and S. Gollako a, “Real- ime a ge sound ex ac ion,”
in 2023 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2023,
pp. 1–5.
[16] C. He nandez-Oli an, M. Delc oix, T. Ochiai,
N. Tawa a, T. Naka ani, and S. A aki, “In e au al ime
di e ence loss o binau al a ge sound ex ac ion,” in
2024 18 h In e na ional Wo kshop on Acous ic Signal
Enhancemen (IWAENC). IEEE, 2024, pp. 210–214.
[17] V. Tokala, E. G ins ein, M. B ookes, S. Doclo,
J. Jensen, and P. A. Naylo , “Binau al speech enhance-
men using deep complex con olu ional ans o me
ne wo ks,” in 2024 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2024, pp. 681–685.
[18] P. Kasak, R. Ja ina, D. Ticha, and M. Jakubec, “Hyb id
binau al singing oice sepa a ion,” in 2023 33 d In e -
na ional Con e ence Radioelek onika (RADIOELEK-
TRONIKA). IEEE, 2023, pp. 1–6.
[19] S. Ricka d, “The DUET blind sou ce sepa a ion algo-
i hm,” in Blind Speech Sepa a ion. Sp inge , 2007,
pp. 217–241.
[20] A. Hy ä inen and E. Oja, “Independen componen
analysis: algo i hms and applica ions,” Neu al Ne -
wo ks, ol. 13, no. 4-5, pp. 411–430, 2000.
[21] J. D iedge , M. Mülle , and S. Disch, “Ex ending
ha monic-pe cussi e sepa a ion o audio signals.” in
P oceedings o he 15 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence. ISMIR, 2014,
pp. 611–616.
[22] P. See ha aman, F. Pishdadian, and B. Pa do, “Mu-
sic/ oice sepa a ion using he 2D Fou ie ans o m,”
in 2017 IEEE Wo kshop on Applica ions o Signal P o-
cessing o Audio and Acous ics (WASPAA). IEEE,
2017, pp. 36–40.
[23] Z. Ra ii and B. Pa do, “Repea ing pa e n ex ac ion
echnique (REPET): A simple me hod o music/ oice
sepa a ion,” IEEE T ansac ions on Audio, Speech, and
Language P ocessing, ol. 21, no. 1, pp. 73–84, 2012.
[24] E. Vincen , R. G ibon al, and C. Fé o e, “Pe o -
mance measu emen in blind audio sou ce sepa a ion,”
IEEE T ansac ions on Audio, Speech, and Language
P ocessing, ol. 14, no. 4, pp. 1462–1469, 2006.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
677
[25] “Sound Demixing Wo kshop.” [Online]. A ailable:
h ps://sdx-wo kshop.gi hub.io/
[26] J. Le Roux, S. Wisdom, H. E dogan, and J. R. He shey,
“SDR–hal -baked o well done?” in 2019 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2019, pp. 626–630.
[27] E. Vincen , H. Sawada, P. Bo ill, S. Makino, and J. P.
Rosca, “Fi s s e eo audio sou ce sepa a ion e alua ion
campaign: da a, algo i hms and esul s,” in In e na-
ional Con e ence on Independen Componen Analysis
and Signal Sepa a ion. Sp inge , 2007, pp. 552–559.
[28] J.-H. Fleßne , R. Hube , and S. D. Ewe , “Assess-
men and p edic ion o binau al aspec s o audio qual-
i y,” Jou nal o he Audio Enginee ing Socie y, ol. 65,
no. 11, pp. 929–942, 2017.
[29] J.-H. Fleßne , S. D. Ewe , B. Kollmeie , and R. Hube ,
“Quali y assessmen o mul i-channel audio p ocessing
schemes based on a binau al audi o y model,” in 2014
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2014, pp.
1340–1344.
[30] L. Rayleigh, “XII. on ou pe cep ion o sound di ec-
ion,” The London, Edinbu gh, and Dublin Philosoph-
ical Magazine and Jou nal o Science, ol. 13, no. 74,
pp. 214–232, 1907.
[31] L. R. Be ns ein, “Audi o y p ocessing o in e au al
iming in o ma ion: new insigh s,” Jou nal o Neu o-
science Resea ch, ol. 66, no. 6, pp. 1035–1046, 2001.
[32] K. N. Wa cha asupa and A. Le ch, “Quan i ying spa-
ial audio quali y impai men ,” in 2024 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2024, pp. 746–750.
[33] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “The MUSDB18 co pus o music
sepa a ion,” Dec. 2017. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.1117372
[34] C. A ms ong, L. Th esh, D. Mu phy, and G. Kea -
ney, “A pe cep ual e alua ion o indi idual and non-
indi idual HRTFs: A case s udy o he SADIE II
da abase,” Applied Sciences, ol. 8, no. 11, p. 2029,
2018.
[35] P. M. Ho man and A. J. Van Ops al, “Spec o- empo al
ac o s in wo-dimensional human sound localiza ion,”
The Jou nal o he Acous ical Socie y o Ame ica, ol.
103, no. 5, pp. 2634–2648, 1998.
[36] C. Han, Y. Luo, and N. Mesga ani, “Real- ime binau al
speech sepa a ion wi h p ese ed spa ial cues,” in 2020
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2020, pp.
6404–6408.
[37] C. Knapp and G. Ca e , “The gene alized co ela ion
me hod o es ima ion o ime delay,” IEEE T ans-
ac ions on Acous ics, Speech, and Signal P ocessing,
ol. 24, no. 4, pp. 320–327, 1976.
[38] T. May, S. Van De Pa , and A. Kohl ausch, “A p oba-
bilis ic model o obus localiza ion based on a binau-
al audi o y on -end,” IEEE T ansac ions on Audio,
Speech, and Language P ocessing, ol. 19, no. 1, pp.
1–13, 2010.
[39] S. Roua d, F. Massa, and A. Dé ossez, “Hyb id ans-
o me s o music sou ce sepa a ion,” in 2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[40] F.-R. S ö e , S. Uhlich, A. Liu kus, and Y. Mi su uji,
“Open-Unmix - a e e ence implemen a ion o music
sou ce sepa a ion,” Jou nal o Open Sou ce So wa e,
ol. 4, no. 41, p. 1667, 2019. [Online]. A ailable:
h ps://doi.o g/10.21105/joss.01667
[41] R. Hennequin, A. Khli , F. Voi u e , and M. Moussal-
lam, “Splee e : a as and e icien music sou ce sepa-
a ion ool wi h p e- ained models,” Jou nal o Open
Sou ce So wa e, ol. 5, no. 50, p. 2154, 2020.
[42] O. Ronnebe ge , P. Fische , and T. B ox, “U-ne : Con-
olu ional ne wo ks o biomedical image segmen a-
ion,” in Medical Image Compu ing and Compu e -
Assis ed In e en ion – MICCAI 2015, 18 h In e na-
ional Con e ence, P oceedings, Pa III. Munich,
Ge many: Sp inge , 2015, pp. 234–241.
[43] A. Jansson, E. Humph ey, N. Mon ecchio, R. Bi ne ,
A. Kuma , and T. Weyde, “Singing oice sepa a ion
wi h deep U-ne con olu ional ne wo ks,” in P oceed-
ings o he 18 h In e na ional Socie y o Music In o -
ma ion Re ie al Con e ence. Suzhou, China: ISMIR,
2017, pp. 23–27.
[44] S. Uhlich, M. Po cu, F. Gi on, M. Enenkl, T. Kemp,
N. Takahashi, and Y. Mi su uji, “Imp o ing music
sou ce sepa a ion based on deep neu al ne wo ks
h ough da a augmen a ion and ne wo k blending,” in
2017 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2017,
pp. 261–265.
[45] A. Dé ossez, “Hyb id spec og am and wa e o m
sou ce sepa a ion,” in P oceedings o he Music Demix-
ing (MDX) Wo kshop, 2021.
[46] “Music sou ce sepa a ion on MUSDB18,” Ma ch
2025. [Online]. A ailable: h ps://pape swi hcode.
com/so a/music-sou ce-sepa a ion-on-musdb18
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
678