scieee Science in your language
[en] (orig)

MB-RIRs: a Synthetic Room Impulse Response Dataset with Frequency-Dependent Absorption Coefficients

Author: Gusó, Enric; Luberadzka, Joanna; Sayin, Umut; Serra, Xavier
Publisher: Zenodo
DOI: 10.48550/arXiv.2507.09750
Source: https://zenodo.org/records/17279819/files/2507.09750v1.pdf
MB-RIRs: a Syn he ic Room Impulse Response Da ase wi h
F equency-Dependen Abso p ion Coe icien s
En ic Gus´
o,1,2Joanna Lube adzka,2Umu Sayin,2Xa ie Se a1
1Uni e si a Pompeu Fab a, Music Technology G oup, Ba celona
[email p o ec ed], xa ie [email p o ec ed]
2Eu eca , Cen e Tecnol`
ogic de Ca alunya, Tecnologies Mul im`
edia, Ba celona
[email p o ec ed]g, [email p o ec ed]
Abs ac —We in es iga e he e ec s o ou s a egies o imp o ing
he ecological alidi y o syn he ic oom impulse esponse (RIR) da ase s
o monoau al Speech Enhancemen (SE). We implemen h ee ea u es
on op o he adi ional image sou ce me hod-based (ISM) shoebox RIRs:
mul iband abso p ion coe icien s, sou ce di ec i i y and ecei e di ec-
i i y. We addi ionally conside mesh-based RIRs om he SoundSpaces
da ase . We hen ain a DeepFil e ne 3 model o each RIR da ase and
e alua e he pe o mance on a es se o eal RIRs bo h objec i ely and
subjec i ely. We ind ha RIRs which use equency-dependen acous ic
abso p ion coe icien s (MB-RIRs) can ob ain +0.51dB o SDR and a +8.9
MUSHRA sco e when e alua ed on eal RIRs. The MB-RIRs da ase is
publicly a ailable o ee download.
1. INTRODUCTION
Speech enhancemen (SE) is he combina ion o p ocesses like
de e e be a ion, denoising, declipping and bandwid h ex ension. In
he las decade, deep lea ning-based SE me hods ha e shown o be
e y e ec i e, a success pa ially d i en by he use o bigge and
mo e di e se aining da ase s —e.g. he ones used in he Deep Noise
Supp ession Challenges (DNS) [1]. This da a scaling has p oduced
models ha gene alize be e and mi iga e c oss-da ase pe o mance
di e ences like he ones in [2], [3]. On op o scaling size and a ie y,
speech and noise da ase s ha e also been ex ended o a sampling a e
o 48kHz, u he imp o ing he esul s [4]–[6]. Gene ally speaking,
he adi ional app oach o augmen ing da a is o con ol e he speech
signals wi h Room Impulse Responses (RIRs), simula ing sounds in
di e en acous ic en i onmen s.
Howe e , mos RIRs cu en ly used in he s a e-o - he-a a e s ill
he ones om he ea ly DNS challenges [7] —i.e. shoebox-like ooms
wi h a single acous ic abso p ion coe icien o he en i e equency
spec um. They a e ypically ende ed h ough he image sou ce
me hod and a 16kHz sampling a e, and we e o iginally in ended o
ain au oma ic speech ecogni ion sys ems. In ac , inc easing he
ealism o he RIRs has been shown o imp o e speech ecogni ion
o augmen ed eali y glasses [8] o keywo d spo ing [9] e en as a
pos -p ocessing augmen a ion s ep [10], so he same could happen
o he SE ask.
Mo e ecen ly, he URGENT Challenge [6] acknowledges he
impo ance o RIR gene aliza ion by using eal eco ded RIRs o
e alua ion. While i cons i u es an impo an s ep o imp o ing he
ecological alidi y o SE models, i s ill uses DNS5 RIRs o aining,
so hei models migh be con ined o ha pa icula RIR co e age.
The opic is also co e ed in [11], wi h a ocus on augmen ing exis ing
RIRs wi h gene a i e models.
The esea ch leading o hese esul s has ecei ed unding om he Eu opean
union’s Ho izon Eu ope p og amme unde g an ag eemen No 101017884 -
Gues XR p ojec .
Da ase publicly a ailable a h ps://doi.o g/10.5281/zenodo.15773093
Table 1: Summa y o he e alua ed syn he ic RIR da ase s.
s
s ands o
sampling a e,
ec
o ecei e ,
s c
o sou ce,
ende
o ende ing me hod
and T60 o e e be a ion ime.
di ec i i y
sT60 ec s c ende
DNS5 16k single ✗ ✗ ISM
SB 48k single ✗ ✗ ISM
MB 48k mul i ✗ ✗ ISM
REC+MB 48k mul i ✓ ✗ ISM
SRC+REC+MB 48k mul i ✓ ✓ ISM
SSPA 48k single ✓ ✗ mesh
Beyond he monoau al SE ask, mesh-based RIR da a gene a ion and
e inemen me hods ha e been p oposed in [12], [13], o asks such
as audio- isual na iga ion. An example o his is he SoundSpaces
da ase [14], which con ains a se o RIRs om ooms wi h complex
geome ies, p o iding a g id o posi ions o each oom, and is ende ed
by pa h-based me hods on 3D meshes. Howe e , he pe o mance o
SE models ained on hese RIRs has no been e alua ed ye .
In his wo k we ocus on he e ec s o RIRs used in aining o
SE. We e alua e i e models ained on he same speech and noise
da ase s bu on six di e en RIR da ase s ( wo baselines, h ee o
ou own making and one om SoundSpaces), wi h he in en ion o
answe ing whe he models bene i om using RIRs wi h equency
dependen (mul iband) abso p ion coe icien s, whe he modeling he
ecei e di ec i i y o he sou ce di ec i i y helps o gene alize in
monoau al models [8], [9], and whe he mesh-based modeling migh
be help ul o mo e sui able han Image Sou ce Me hod (ISM) [9].
To his end, we p opose an e alua ion o he ollowing RIR da ase s,
which a e summa ized abo e in Table 1.
•DNS5: s a e o he a (SOTA) baseline om [7].
•
SB: we e-c ea e DNS5 a 48kHz sampling a e using single-band
abso p ion coe icien s on he shoebox oom ma e ial.
•MB: we use mul i-band abso p ion coe icien s ins ead.
•
REC+MB: we add ecei e di ec i iy by using Head Rela ed
T ans e Func ions (HRTFs) a he ende ing s age.
•
SRC+REC+MB: we add sou ce di ec i i y by modelling a e age
human speech di ec i i y in he ISM.
•SSPA: we use mesh-based RIRs om SoundSpaces ins ead.
2. TRAINING DATASETS
Rega ding speech and noise, we ha e used he whole DNS5 which
con ains emo ional speech, eadings om English audiobooks using
di e en accen s, singing om VocalSe , F ench, Spanish, I alian,
Russian and Ge man speech om M-AILABS Speech, Ge man speech
om Wikipedia, Spanish om SLR73, SLR61, SLR39, SLR75, SLR74
a Xi :2507.09750 1 [cs.SD] 13 Jul 2025
and SLR71 se s, which add up o a o al o 583k speech u e ances,
and 63k noise samples om F eeSound and AudioSe [1] wich add
up o 1315 and 177 hou s espec i ely. We ha e spli hem in o 70%
o aining, and he emaining 30% spli equally be ween alida ion
and es se s, a oiding c oss-se speake con amina ion when possible.
Below we desc ibe he RIR aining se s.
2.1. Baseline RIRs (DNS5)
We ha e aken as baseline 60k RIRs om DNS5 (syn he ic RIRs om
SLR26 and eal RIRs om SLR28, o iginally a a 16kHz sampling
a e and upsampled o 48kHz). We ha e dis ibu ed small, medium,
and la ge ooms in o h ee spli s: 70%, 15% and 15% co esponding
o aining, alida ion and es se s.
2.2. Single-band abso p ion coe icien s RIRs (SB)
Wi h he in en ion o ob aining a ai compa ison be ween he di e en
s a egies, we ha e c ea ed an app oxima ion o he baseline DNS5
RIRs da ase using he Mul ichannel Acous ic Signal P ocessing
(MASP) lib a y [15], [16]. Speci ically, we ha e gene a ed 60k
shoebox-like RIRs wi h single-band abso p ion coe icien s, his ime
ende ing a 48kHz sampling a e. The geome ic con igu a ions o
hese RIRs a e desc ibed in Table 2: oom dimensions
ha e been
sampled om uni o m dis ibu ions, wi h
x
a ec ing
y
in o de o
a oid co ido -like geome ies. Single-band abso p ion coe icien s
o all six walls o he shoebox-like oom a e compu ed ia Sabine’s
o mula, using
and e e be a ion ime
T60
. In con as o he
mul iband case (see 2.3), whe e
T60
is a ec o de ining e e be a ion
ime in each equency band, in he single-band case
T60
is a scala
compu ed as he mean o
T60
ec o .
T60
used o SB can be
app oxima ed by a no mal dis ibu ion. Recei e coo dina es
ec
ha e also been se andomly o e e y oom, a oiding o place hem
oo close o he walls. The sou ces
s c
ha e been placed a ound he
ecei e s a a dis ance be ween 0.5 and 3 me e s and in on o hem.
We ha e used his same se o oom con igu a ions o he es o he
e alua ed da ase s excep o he p e-compu ed SoundSpaces da ase
SSPA.
2.3. Mul iband abso p ion coe icien s RIRs (MB)
The i s o he h ee s a egies we ha e p oposed is o inc ease
he da ase co e age by using mul iband abso p ion coe icien s
T60
[9] ins ead o a single
T60
alue as in he SB case. We
ha e depic ed he o e all s uc u e o he da a gene a ion pipeline
in Figu e 1. To ensu e he gene a ion o ealis ic pa ame e s, we
ha e analyzed 4495 eal
T60
alues om [17] in six equency
bands
ωn={125,250,500,1k, 2k, 4k}
in Hz. We ha e modeled
each band as an independen
Gamma(α,β)
dis ibu ions, i ed
by minimizing he nega i e log-likelihood unc ion. We ha e ob-
ained shape
α={1.72,1.62,1.93,2.56,4.17,2.49}
and scale
β={0.39,0.24,0.14,0.10,0.09,0.18}
. MB RIRs a e gene a ed
Table 2: Random oom con igu a ions o SB, MB, REC+MB, SRC+REC+MB.
Angles a e in deg ees, ec o s a e in bold.
x=U(3,30) ecx=U(0.35 x,0.65 x)
y= x· U(0.5,1) ecy=U(0.35 y,0.65 y)
z=U(2.5,5) ecz=U(1,2)
|| ec −s c|| =U(0.5,3) ecϕ=U(−45,45)
∠ ec,s c =U(−45,45) ecθ=U(−10,10)
T60 ≈ N(0.4,0.014) T60 = 1
NPnT60
T60 =Gamma(α,β)
...
a g
SRC
...
x
SB
... MB
SRC+REC+MB
REC
+
REC+MB
+
Fig. 1: RIR gene a ion se up: he same se o geome ic con igu a ions is
sha ed among he di e en da ase s. Con inuous line ollows he MB pipeline,
do ed line is o SB, sho dashed line s ands o he MB+REC modeling and
long dashed line s ands o he SRC modeling.
by unning one ISM acous ic simula ion o each sub-band, passing
he sub-band RIRs h ough a il e bank comp ised o a low-pass
(
LP F
) o
n= 1
, band-pass (
BP Fωn
) o
n={2, ..., N −1}
and
a high-pass il e (
HP F
) o he highes band, and inally summing
he il e ed RIRs.
2.4. Recei e di ec i i y RIRs (REC+MB)
In [8], he augmen ed eali y glasses’ di ec i i y and he acous ical
shadow o he head we e shown o be ele an o speech sepa a ion
and ecogni ion. Likewise, SE om in-ea o headse de ices may
also bene i om modeling his di ec i i y. Al hough he e migh
be indi idual di e ences be ween he de ice ypes, hey all sha e
he cha ac e is ic o being placed nea he human ea . We ha e
modeled he ecei e di ec i i y in his scena io by applying a Head-
Rela ed T ans e Func ion (HRTF) o e e y e lec ion. To his end, we
ha e used he same me hodology as in [18]: o ob ain he Sphe ical
Ha monics (SH) expansion o he RIRs om MASP and hen o use
a Bila e al Magni ude Leas Squa es (BiMagLS) [19] decode ha
implici ly applies he HRTFs, he e ocusing ou e alua ion o he le
ea and using a se o no mal hea ing HRTFs o a Neumann KU100
dummy head.
2.5. Sou ce di ec i i y RIRs (SRC+REC+MB)
In SE, sou ces a e always speech so we ha e applied i s a e age
di ec i i y by aking he adia ion pa e n om [20] and con e ing
i in o he azimu h and ele a ion lookup able
Dϕ,θ
. As depic ed in
Figu e 1,
Dϕ,θ
can be applied in he
SH
pipeline igh a e he ISM
me hod and p io o he il e bank ende ing and summa ion. We ha e
weigh ed he ampli ude o each ISM e lec ion wi h he closes
Dϕ,θ
.
Bo h angles ha e been ob ained h ough acous ical ecip oci y [21]:
swapping
s c
and
ec
coo dina es in he ISM we ob ain a lis o
ecip ocal e lec ions whose angle wi h espec o he ecei e can be
compu ed di ec ly. Each ecip ocal e lec ion angle wi h espec o
he ecei e is equi alen o he emission angle (wi h espec o he
sou ce) in he o iginal ISM.
2.6. Mesh-based RIRs (SoundSpaces, SSPA)
The SSPA da a se con ains only 103 scenes, which is much less han
scenes in SB. Ne e heless, we ha e included SSPA in his e alua ion
as a well- es ed e e ence om ields like na iga ion [14] and 3D
SE [22]. To ou knowledge, i is he only publicly-a ailable da ase
which can be compa ed o REC+MB because i also uses Ambisonics
o ende ing in o binau al and mul iband abso p ion coe icien s. I
addi ionally p o ides complex geome ies by pa h- acing h ough 3D
meshes. To keep he amoun o RIRs consis en , we ha e aken 60k
RIRs, also ocusing on he le channel o he binau al downmix as
in REC+MB and SRC+REC+MB.
3. EXPERIMENTAL SETUP
Gi en he speech, noise and he six RIR da ase s desc ibed abo e,
we ha e ained six DeepFil e Ne 3 [4] SE models – one o e e y
RIR da ase – using i s de aul hype pa ame e s excep o a smalle
maximum ba ch size o 38 in o de o i ou GPUs and a
p e e b
o
1 ( o apply a RIR o e e y u e ance). We ha e chosen DeepFil e Ne 3
because i has SOTA pe o mance while being open sou ce and ha ing
eal- ime capabili ies. A e aining o 117 o 120 epochs (depending
on he ea ly s opping) ou e alua ion has been h ee- old: i s ly, we
ha e applied he models o monoau al noisy u e ances con ol ed
wi h eal RIRs and compu ed a se o in usi e and non-in usi e
me ics. Secondly, we ha e used a ew p ocessed samples o e alua e
subjec i ely wi h a MUSHRA lis ening es [23]. Thi dly, we ha e
applied he models o he
headse
and
speake phone
DNS5 es
se s o add ess he e ec s o di ec i i y. Since hese es se s con ain
eal mix u es (no seppa a e clean speech g ound u h is a ailable)
me ics a e es ic ed o be non-in usi e.
3.1. Objec i e e alua ion
Speech and noise o hese simula ions has been aken om he DNS5
es se spli s. Mo e p ecisely, we ha e aken 10k high quali y ead
speech samples om he VCTK [27] and Lib iVox [28] subse s (7k
and 3k espec i ely). We ha e used 397 eal RIRs om he ACE [29],
MIT IR Su ey [30], Openai [31], BUT Re e b [32] da ase s and he
RIRs om SLR28 [7] ha we e no used du ing aining. Noises and
RIRs ha e been eused as many imes as necessa y o i he speech
es se size.
F om each speech sample we ha e aken a non-silen ou -seconds
chunk
y
, con ol ed i wi h a eal RIR
h
and summed i o a noise
sample
n
using
SNR =U(0,30)
, ob aining noisy and e e ebe an
x
. We ha e made su e ha ime synch onici y be ween
x
and
y
is
kep , bu when add essing his we ha e ound ha ce ain eal RIR
onse s a e ha de o de ec han hei non-noisy syn he ic coun e pa s,
so he icini y o he highes ampli ude has been checked o any
ea lie peak ha su passed a -6dB h eshold. We ha e also ound ha
he b oadly used co ela ion-based synch oniza ion me hod is less
obus o RIR noise han ou h esholding heu is ic.
No e ha in he adi ional signal model
x= (y∗h) + n
, noise
can be non- e e be an while he speech is, which can cons i u e a
limi a ion despi e being b oadly used in mos o SE li e a u e. To
u he in es iga e his, we ha e also conduc ed he e alua ion using
x= (y+n)∗h
, a signal model in which bo h speech and noise ha e
a mo e spec ally-cohe en e e be a ion a he expense o using he
exac same RIR (and he e o e placing speech and noise a he exac
same posi ion in he simula ion space, which is deemed un easible).
Resul s on his al e na i e signal model a e no epo ed he e o he
sake o b e i y, bu we e ound o be e y simila o he esul s om
he mo e adi ional x= (y∗h) + nwe epo below.
Rega ding me ics, we ha e used almos all o he ones om he
URGENT Challenge [6] wi h he addi ion o Scale In a ian Signal-
o-Dis o ion Ra ion (
SISDR
) and
SISDRsquim
,
P ESQsquim
,
MOSsquim
and
ST OIsquim
om [24]. Because e e be a ion can
mislead some me ics’ alues and ou da ase is he i s da ase
o his kind ha we a e awa e o , we a e epo ing he esul s as
inc emen s wi h espec o he noisy and e e ebe an signal. Gi en
DeepFil e Ne 3 model
and clean speech es ima e
ˆy= (x)
, we
epo e alua ion me ics
m
inc emen s as
∆m=m(ˆy)−m(x)
o non-in usi e me ics and
∆m= (m(ˆy, y)−m(x, y))
o he
in usi e ones ha equi e he e e ence
y
. Fo DeepFil e ne 3 absolu e
me ics on he commonly-used Voicebank+Demand es se we e e
he eade o [4].
20 25 30 35 40 45 50 55 60
MUSHRA Sco e
DNS5
SB
MB
REC+MB
SRC+REC+MB
SSPA
Poo Fai
Fig. 2: MUSHRA sco es mean and 95% con idence in e als o each model.
3.2. Subjec i e e alua ion
In o de o u he alida e he objec i e e alua ion, we ha e picked
eigh e e be an syn he ic speech and noise mix u es ha use eal
RIRs om he same se as in he objec i e e alua ion and ha e
conduc ed an online MUSHRA lis ening es using [23], [33], aking
y
as 48kHz high-quali y clean e e ence and hidden e e ence,
x
as
ancho and he six di e en model es ima es
ˆy
using a scale om
0 o Bad o 100 o Excellen . All 33 pa icipan s ha e decla ed o
be expe lis ene s (p o essional audio p oduce s o esea che s wi h
expe ience doing lis ening es s). Only one subjec had o be excluded
du ing he pos -sc eening due o misjudging he ancho .
3.3. Compu a ional equi emen s
Due o he sequen ial na u e o he ISM and he compu a ional
demands o high-o de SH, gene a ing all he RIR da ase s simul a-
neously has equi ed a se e wi h 64-co es o CPU and 250GB o
RAM o 10 days. T aining all he six DeepFil e Ne 3 models ook
30 days on wo A100s.
4. RESULTS AND DISCUSSION
Subjec i e esul s o he enhanced speech and noise mix u es con-
ol ed wi h eal RIRs a e depic ed on Figu e 2.
To compa e he condi ions, we used a pai wise - es wi h a
signi icance h eshold o
p < 0.05
. MUSHRA sco es show no
s a is ically signi ican di e ence be ween DNS5 and SSPA ( - es :
p= 0.55
) which poin s ou ha a ie y and quan i y o ooms a e
ac o s o ake in o accoun on op o RIR complexi y –no e ha
SSPA addi ionally models complex oom geome ies and u ni u e bu
despi e ha ing he same numbe o RIRs han SB, i con ains much
ewe di e en ooms. SB signi ican ly ou pe o ms he DNS5 baseline
( - es :
p= 0.003
), which highligh s he impo ance o ex ending he
sampling a e om 16kHz o 48kHz.
All models apa om SSPA ha e ecei ed signi ican ly be e
sco es han DNS5 baseline (SB:
p= 3 ·10−3
, MB:
p= 7 ·10−14
,
REC+MB
p= 8·10−7
and SRC+REC+MB
p= 2·10−14
), indica ing
ha bo h he highe sampling a e and adding MB, SRC and REC
ea u es in o he acous ic simula ion imp o es he speech enhancemen .
In e es ingly, he e a e no signi ican di e ences be ween MB and
SRC+REC+MB ( - es :
p= 0.86
) which sugges s ha modeling
di ec i i y does no deg ade monau al SE pe o mance when e alua ed
on eal RIRs. We can’ assess SRC and REC di ec i i ies wi h he
eal RIR es se , bu no e ha MB and SRC+REC+MB ou pe o m
he es and ha all mul iband RIR da ase s pe o m simila ly. The
sco es a e p esen ed in Table 3, in addi ion o objec i e in usi e and
non-in usi e esul s.
Simila o he MUSHRA sco es, he objec i e esul s on he same
se ( he eal RIRs es se ) indica e ha MB RIRs ou pe o m he
es . Howe e , he e is no consensus among all in usi e me ics, no
be ween in usi e and non-in usi e me ics. While mos in usi e
Table 3: E alua ion esul s on eal RIRs using DNS5 ead speech. Mean alues
±
s anda d de ia ion. The highe he be e (
⇑
) excep o Log-Spec al
Dis ance and Mel-Ceps al Dis o ion (⇓). SQUIM [24], NISQA [25] and DNSMOS [26] a e non-in usi e neu al-based app oxima ions o he me ics.
DNS5 SB MB REC+MB SRC+REC+MB SSPA
In usi e
∆SISDR ⇑1.105±3.50 1.335±3.49 1.361±3.45 1.965±4.52 1.9±4.41 0.888±3.27
∆SDR ⇑2.817±3.17 3.298±3.02 3.328±3.00 2.674±3.19 2.706±3.17 2.025±3.11
∆LSD ⇓1.213±1.41 1.34±1.42 1.26±1.40 0.841±1.43 0.778±1.43 1.105±1.43
∆MCD ⇓ −1.434±2.00 −1.479±1.98 −1.504±1.99 -1.542±1.97 −1.533±1.98 −1.454±2.00
∆P ESQ ⇑0.52±0.47 0.583±0.47 0.593±0.47 0.595±0.46 0.609±0.46 0.51±0.45
∆ST OI ⇑0.088±0.07 0.101±0.07 0.102±0.07 0.101±0.06 0.101±0.06 0.082±0.06
∆P honSim ⇑0.051±0.15 0.058±0.15 0.058±0.15 0.063±0.15 0.0626±0.15 0.061±0.15
∆SpkSim ⇑ −0.055±0.13 −0.058±0.14 −0.057±0.13 −0.034±0.12 -0.03±0.11 −0.053±0.13
∆Be Sim ⇑0.081±0.07 0.086±0.07 0.087±0.07 0.089±0.07 0.09±0.07 0.082±0.07
Non-in usi e
∆SISDRsquim ⇑3.793±5.87 4.456±5.71 4.544±5.67 3.814±5.65 4.038±5.63 2.291±5.61
∆MOSsquim ⇑0.547±0.71 0.537±0.71 0.541±0.71 0.523±0.70 0.506±0.71 0.541±0.70
∆MOSdnsmos ⇑0.884±0.53 0.935±0.54 0.937±0.53 0.895±0.53 0.906±0.53 0.811±0.52
∆MOSnisqa ⇑1.158±0.73 1.198±0.73 1.224±0.72 1.109±0.72 1.099±0.72 0.977±0.73
∆P ESQsquim ⇑0.636±0.62 0.667±0.59 0.678±0.59 0.613±0.58 0.637±0.58 0.525±0.57
∆ST OIsquim ⇑0.06±0.08 0.07±0.08 0.071±0.08 0.063±0.08 0.066±0.08 0.048±0.08
Subjec i e
MUSHRA ⇑33.59±20.339.15±22.648.05±22.842.65±22.548.39±23.032.49±21.7
me ics ollow he subjec i e sco es and show a highe pe o mance
o MB and SRC+REC+MB da ase s,
SISDR
,
MCD
and Phone ic
Simila i y (
P honSim
) me ics sugges o use REC+MB. Non-
in usi e me ics ag eemen is highe han in he in usi e ones, wi h
MB sligh ly ou pe o ming he es ( - es :
p < 0.05
), pe haps due o
a bias in NISQA and SQUIM aining da a bu in e es ingly, owa ds
MB ins ead o SB.
Comp ehensi e esul s o he DNS5
headse
and
speake phone
a e shown in Table 4. A p io i, one would expec SRC+REC+MB
o ou pe o m he es o
headse
, bu his is no he case. Fo bo h
headse
and
speake phone
, mos non-in usi e me ics show MB
models sligh ly ou pe o ming he es as hey did o he eal RIR es
se , bu di e ences a e no s a is ically signi ican (e.g. - es :
p= 0.9
o
SISDRsquim
be ween MB and SB). The e o e, we can no ind
e idence o bene i s om di ec i i y modeling.
Table 4: Non-in usi e e alua ion esul s on eal noisy and e e be an
eco dings om he DNS5 headse and speake phone se s. HDS s ands o
headse and SPK o speake phone. The highe he be e o all me ics.
DNS5 SB MB REC+MB SRC+REC+MB SSPA
SISDRsquim
HDS
SPK
15.65
16.81
15.75
16.86
15.80
17.05
15.35
16.65
15.42
16.82
14.21
15.51
MOSsquim
HDS
SPK
3.92
4.15
3.89
4.15
3.90
4.15
3.91
4.14
3.94
4.14
3.91
4.14
MOSdnsmos
HDS
SPK
3.01
3.03
3.04
3.05
3.05
3.06
3.04
3.03
3.02
3.04
3.01
2.99
MOSnisqa
HDS
SPK
3.54
3.77
3.64
3.82
3.68
3.82
3.61
3.77
3.56
3.76
3.43
3.62
P ESQsquim
HDS
SPK
2.58
2.64
2.59
2.63
2.62
2.64
2.55
2.56
2.54
2.59
2.45
2.49
ST OIsquim
HDS
SPK
0.92
0.94
0.92
0.94
0.92
0.94
0.92
0.94
0.92
0.94
0.91
0.93
Al hough in [8] hey showed ha di ec i i y could be exploi ed
e en in monoau al models, we could no eplica e he esul s, ei he
because o he limi a ions o NISQA and SQUIM me ics, o due o
limi a ions in he DNS5 es se . In bo h cases, he lack o a clea
imp o emen when applying di ec i i y b ings in o ques ion i MB
RIRs pe o m be e hanks o being mo e domain consis en o jus
because hey a e mo e di e se [34] han SB. We would like o add ess
his in u u e wo k, pe haps by compa ing acous ically implausible
andom T60 RIRs wi h he ones om Sec ion 2.3.
O e all, we ecommend o ain SE models wi h high sampling a e
mul iband RIRs, po en ially inco po a ing SRC and REC di ec i i es
depending on he use case. We make bo h MB and SRC+REC+MB
a ailable a h ps://doi.o g/10.5281/zenodo.15773093.
5. CONCLUSIONS
We ha e p esen ed an e alua ion o h ee Room Impulse Response
(RIRs) gene a ion echniques using he s a e o he a eal- ime capable
Speech Enhancemen (SE) model DeepFil e Ne 3. Speci ically, we
ha e shown ha he idea o ex ending he RIR co e age o equency
dependen acous ic abso p ion coe icien s –which has been shown
o be success ul o keywo d spo ing– can also bene i mode n SE
models. We ha e also ound ha he amoun and a iabili y o RIRs
can be as impo an as he model complexi y, showing ha a lo o ISM-
based ooms can ou pe o m ewe mesh-based ooms (SoundSpaces)
when e alua ed on eal RIRs. Besides, we ha e shown ha sou ce and
ecei e di ec i i ies do no deg ade pe o mance in he monoau al
SE ask and also ha e he po en ial o inc easing SE pe o mance bu
could no p o ide s ong e idence o he imp o emen . In conclusion,
we ecommend aining wi h a a ied 48kHz mul iband RIRs da ase :
MB-RIRs.
REFERENCES
[1]
H. Dubey, A. Aazami, V. Gopal, B. Nade i, S. B aun, R. Cu le , A. Ju,
M. Zohou ian, M. Tang, M. Goles aneh e al., “ICASSP 2023 deep noise
supp ession challenge,” IEEE Open Jou nal o Signal P ocessing, ol.
142, pp. 725–737, Ma . 2024.
[2]
J. Rich e , S. Welke , J.-M. Leme cie , B. Lay, and T. Ge kmann, “Speech
enhancemen and de e e be a ion wi h di usion-based gene a i e models,”
IEEE T ans. Audio, Speech, Lang. P ocess., ol. 31, pp. 2351–2364, Jun.
2023.
[3]
B. Kadıo
˘
glu, M. Ho gan, X. Liu, J. Pons, D. Da cy, and V. Kuma ,
“An empi ical s udy o Con -TasNe ,” in P oc. ICASSP, May 2020, pp.
7264–7268.
[4]
H. Sch
¨
o e , T. Rosenk anz, A. Maie e al., “DeepFil e Ne : Pe cep ually
mo i a ed eal- ime speech enhancemen s,” in P oc. In e speech, Aug.
2023, pp. 2008–2009.
[5]
W. Zhang, K. Saijo, Z.-Q. Wang, S. Wa anabe, and Y. Qian, “Towa d
uni e sal speech enhancemen o di e se inpu condi ions,” in P oc.
ASRU, Dec. 2023, pp. 1–6.
[6]
W. Zhang, R. Scheible , K. Saijo, S. Co nell, C. Li, Z. Ni, J. Pi klbaue ,
M. Sach, S. Wa anabe, T. Fingscheid , and Y. Qian, “URGENT challenge:
Uni e sali y, obus ness, and gene alizabili y o speech enhancemen ,”
in P oc. In e speech, Sep. 2024, pp. 4868–4872.
[7]
T. Ko, V. Peddin i, D. Po ey, M. L. Sel ze , and S. Khudanpu , “A s udy
on da a augmen a ion o e e be an speech o obus speech ecogni ion,”
in P oc. ICASSP, Ma . 2017, pp. 5220–5224.
[8]
R. A akawa, M. Pa aix, C. Lai, H. E dogan, and A. Olwal, “Quan i ying
he e ec o simula o -based da a augmen a ion o speech ecogni ion
on augmen ed eali y glasses,” in P oc. ICASSP, Ap . 2024, pp. 726–730.
[9]
E. Bezzam, R. Scheible , C. Cadoux, and T. Gisselb ech , “A s udy on
mo e ealis ic oom simula ion o a - ield keywo d spo ing,” in P oc.
APSIPA ASC, Dec. 2020, pp. 674–680.
[10]
A. Ra na ajah, Z. Tang, and D. Manocha, “TS-RIR: T ansla ed syn he ic
oom impulse esponses o speech augmen a ion,” in P oc. ASRU, Dec.
2021, pp. 259–266.
[11]
J. Lin, G. G
¨
o z, H. S. Llopis, H. Ha s einsson, S. Guðj
´
onsson, D. G.
Nielsen, F. Pind, P. Sma agdis, D. Manocha, J. He shey e al., “Gene a i e
da a augmen a ion challenge: Syn hesis o oom acous ics o speake
dis ance es ima ion,” in P oc. ICASSPW, Ap . 2025, pp. 1–5.
[12]
Z. Tang, R. A alika i, A. J. Ra na ajah, and D. Manocha, “Gwa: A la ge
high-quali y acous ic da ase o audio p ocessing,” in P oc. SIGGRAPH,
Aug. 2022, pp. 1–9.
[13]
L. Kelley, D. Di Ca lo, A. A. Nug aha, M. Fon aine, Y. Bando, and
K. Yoshii, “RIR-in-a-box: Es ima ing oom acous ics om 3D mesh da a
h ough shoebox app oxima ion,” in P oc. In e speech, Sep. 2024, pp.
3255–3259.
[14]
C. Chen, C. Schissle , S. Ga g, P. Kobe nik, A. Clegg, P. Calamia,
D. Ba a, P. Robinson, and K. G auman, “Soundspaces 2.0: A simula ion
pla o m o isual-acous ic lea ning,” P oc. Neu IPS, ol. 35, pp. 8896–
8911, No . 2022.
[15]
A. Pe ez-Lopez and A. Poli is, “A py hon lib a y o mul ichannel acous ic
signal p ocessing,” in Audio Eng. Soc. Con . 148, May 2028.
[16]
A. Poli is, Mic ophone a ay p ocessing o pa ame ic spa ial audio
echniques. Espoo, Finland: Aal o Uni e si y, 2016.
[17]
K. S idha , R. Cu le , A. Saabas, T. Pa namaa, M. Loide, H. Gampe ,
S. B aun, R. Aichne , and S. S ini asan, “ICASSP 2021 acous ic echo
cancella ion challenge: Da ase s, es ing amewo k, and esul s,” in P oc.
ICASSP, Jun. 2021, pp. 151–155.
[18]
E. Gus
´
o, J. Lube adzka, M. Baig, U. Sayin, and X. Se a, “An objec i e
e alua ion o hea ing aids and dnn-based binau al speech enhancemen
in complex acous ic scenes,” in P oc. WASPAA, Oc . 2023, pp. 1–5.
[19]
I. Engel, D. Goodman, and L. Picinali, “Imp o ing binau al ende ing
wi h bila e al ambisonics and MagLS,” in Annual Ge man Con e ence
on Acous ics, ol. 99, 2021, p. 10.
[20]
T. W. Leishman, S. D. Bellows, C. M. Pincock, and J. K. Whi ing, “High-
esolu ion sphe ical di ec i i y o li e speech om a mul iple-cap u e
ans e unc ion me hod,” J. Acous . Soc. Am., ol. 149, pp. 1507–1523,
Ma . 2021.
[21]
P. Sama asinghe, T. D. Abhayapala, and W. Kelle mann, “Acous ic
ecip oci y: An ex ension o sphe ical ha monics domain,” J. Acous .
Soc. Am., ol. 142, no. 4, pp. EL337–EL343, Oc . 2017.
[22]
C. Ma inoni, R. F. G amaccioni, C. Chen, A. Uncini, and D. Comminiello,
“O e iew o he L3DAS23 challenge on audio- isual ex ended eali y,”
in P oc. ICASSP, Jun. 2023, pp. 1–2.
[23]
B. Se ies, “Me hod o he subjec i e assessmen o in e media e
quali y le el o audio sys ems,” In e na ional Telecommunica ion Union
Radiocommunica ion Assembly, ol. 2, 2014.
[24]
A. Kuma , K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Hende son, and B. Xu,
“To chaudio-squim: Re e ence-less speech quali y and in elligibili y
measu es in o chaudio,” in P oc. ICASSP, May 2023, pp. 1–5.
[25]
G. Mi ag, B. Nade i, A. Chehadi, and S. M
¨
olle , “NISQA: A deep cnn-
sel -a en ion model o mul idimensional speech quali y p edic ion wi h
c owdsou ced da ase s,” in P oc. In e speech, Aug. 2021, pp. 2958–1796.
[26]
C. K. Reddy, V. Gopal, and R. Cu le , “DNSMOS: A non-in usi e
pe cep ual objec i e speech quali y me ic o e alua e noise supp esso s,”
in P oc. ICASSP, Jun. 2021, pp. 6493–6497.
[27]
C. Veaux, J. Yamagishi, K. MacDonald e al., “CSTR VCTK co pus:
English mul i-speake co pus o CSTR oice cloning oolki ,” Uni e si y
o Edinbu gh. The Cen e o Speech Technology Resea ch, ol. 6, p. 15,
No . 2017.
[28]
V. Panayo o , G. Chen, D. Po ey, and S. Khudanpu , “Lib ispeech: an
ASR co pus based on public domain audio books,” in P oc. ICASSP,
Ap . 2015, pp. 5206–5210.
[29]
J. Ea on, N. D. Gaubi ch, A. H. Moo e, and P. A. Naylo , “The ACE
challenge — co pus desc ip ion and pe o mance e alua ion,” in P oc.
WASPAA, Oc . 2015, pp. 1–5.
[30]
I. Sz
¨
oke, M. Sk
´
acel, L. Mo
ˇ
sne , J. Paliesek, and J.
ˇ
Ce nock
`
y, “Building
and e alua ion o a eal oom impulse esponse da ase e,” IEEE J. Sel.
Top. Signal P ocess., ol. 13, pp. 863–876, Aug. 2019.
[31]
S. Shelley and D. T. Mu phy, “Openai : An in e ac i e au aliza ion web
esou ce and da abase,” in 129 h Audio Eng. Soc. Con en ion, ol. II,
No . 2010, pp. 1270–1278.
[32]
I. Sz
¨
oke, M. Sk
´
acel, L. Mo
ˇ
sne , J. Paliesek, and J.
ˇ
Ce nock
´
y, “Building
and e alua ion o a eal oom impulse esponse da ase ,” IEEE J. Sel.
Top. Signal P ocess., ol. 13, no. 4, pp. 863–876, May 2019.
[33]
D. Ba y, Q. Zhang, P. W. Sun, and A. Hines, “Go lis en: an end- o-end
online lis ening es pla o m,” Jou nal o Open Resea ch So wa e, ol. 9,
p. 20, Jul. 2021.
[34]
C.-B. Jeon, G. Wiche n, F. G. Ge main, and J. Le Roux, “Why does
music sou ce sepa a ion bene i om cacophony?” in P oc. ICASSPW,
Ap . 2024, pp. 873–877.