P oceedings o he 10 h Wo kshop on De ec ion
and Classi ica ion o Acous ic Scenes and E en s
(DCASE 2025)
Emmanouil Bene os, F ede ic Fon , Magdalena Fuen es, I ene
Ma in Mo a o, and Ma ín Rocamo a (eds.)
Oc obe 30-31, 2025
This wo k is licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License. To iew a copy o his license, isi :
h p://c ea i ecommons.o g/licenses/by/4.0/
Ci a ion:
Emmanouil Bene os, F ede ic Fon , Magdalena Fuen es, I ene Ma in Mo a o, and
Ma ín Rocamo a (eds.), P oceedings o he 10 h Wo kshop on De ec ion and
Classi ica ion o Acous ic Scenes and E en s (DCASE 2025), Oc . 2025.
DOI: 10.5281/zenodo.17251589
ISBN: 978-84-09-77652-8
Table o con en s
E icien S a e-Space Model o Audio Anomaly De ec ion wi h Domain Adap a ion
Emon, Jaka ia ; Anon, Taha im Rahman 1
Low-Complexi y Acous ic Scene Classi ica ion wi h De ice In o ma ion in he DCASE 2025 Challenge
Schmid, Flo ian; P imus, Paul; Hei ola, Toni; Mesa os, Annama ia; Ma in-Mo a o, I ene; Widme ,
Ge ha d 6
Co ela ion-Based Fil e ing o Unsupe ised Anomalous Sound De ec ion
Bü li, And in; Hamdan, Sami 11
Towa ds Audio-based Ze o-Sho Ac ion Recogni ion in Ki chen En i onmen s
Gebha d, Alexande ; T ian a yllopoulos, And eas; Tsangko, Iosi ; Schulle , Bjö n W. 15
Handling Domain Shi s o Anomalous Sound De ec ion: A Re iew o DCASE-Rela ed Wo k
Wilkingho , Ke in; Fujimu a, Takuya; Imo o, Keisuke; Le Roux, Jona han; Tan, Zheng-Hua; Toda,
Tomoki 20
Adjus ing Bias in Anomaly Sco es ia Va iance Minimiza ion o Domain-Gene alized Disc imina i e
Anomalous Sound De ec ion
Ma sumo o, Masaaki; Fujimu a, Takuya; Huang, WenChin; Toda, Tomoki 25
Region-Speci ic Audio Tagging o Spa ial Sound
ZHAO, Jinzheng; Xu, Yong; Liu, Haohe; Be ghi, Da ide; Qian, Xinyuan; Kong, Qiuqiang; Zhao, Junqi;
Plumbley, Ma k; Wang, Wenwu 30
Lis ening o Reading? An Empi ical S udy o Modali y Impo ance Analysis Ac oss AQA Ques ion
Types
Yin, Zeyu; Cai, Yiqiang; Lyu, Xinyang; Deng, Pingsong; Li, Shengchen 35
ASDKi : A Toolki o Comp ehensi e E alua ion o Anomalous Sound De ec ion Me hods
Fujimu a, Takuya; Wilkingho , Ke in; Imo o, Keisuke; Toda, Tomoki 40
LiB-TRAD: A Li hium Ba e y The mal Runaway Acous ic Da ase o Anomaly De ec ion
WANG, xiaoliang; MING, ao; Chen, Meixin; JIN, hao 45
C ossing he Species Di ide: T ans e Lea ning om Speech o Animal Sounds
Cauzinille, Jules; Mi on, Ma ius; Pie quin, Oli ie ; Hagiwa a, Masa o; Ma xe , Rica d; Rey, A naud;
Fa e, Benoi 50
Desc ip ion and Discussion on DCASE 2025 Challenge Task 2: Fi s -Sho Unsupe ised Anomalous
Sound De ec ion o Machine Condi ion Moni o ing
Nishida, Tomoya; Nobo u, Ha ada; Niizumi, Daisuke; Albe ini, Da ide; Sannino, Robe o; P adolini,
Simone; Augus i, Filippo; Imo o, Keisuke; Dohi, Ko a; Pu ohi , Ha sh; Endo, Takashi; Kawaguchi,
Yohei 55
Audio‑Based Pedes ian De ec ion in he P esence o Vehicula Noise
Kim, Yonghyun; Han, Chaeyeon; Sa ode, Akash; Posne , Noah; Guha haku a, Subh aji ; Le ch,
Alexande 60
Towa ds Spa ial Audio Unde s anding Via Ques ion Answe ing
Suda sanam, Pa hasaa a hy; Poli is, A chon is 65
Compa ison o Founda ion Model P e-T aining S a egies and A chi ec u es o U ban Ga den
Reco dings
Kou sogeo gos, Pa menion; Hä mä, Aki 70
Deploymen o AI-based Sound Analysis Algo i hms in Real- ime Acous ic Senso s: Challenges and a
Use Case
Sagas i, Amaia; A ís, Pe e; Fon , F ede ic; Se a, Xa ie 75
Bioacous ics on Tiny Ha dwa e a he BioDCASE 2025 Challenge
Ca man ini, Gio anni; Benhamadi, Yasmine; Ca eau, Ma hieu; Kwak, Minkyung; Mo andi, Ila ia;
Fö s ne , F ied ich; Hladik, Pie e-Emmanue; Lag ange, Ma hieu; Linha , Pa el; Pe usko á, Te eza;
Los anlen, Vincen ; Kahl, S e an 80
Analysing Human-Gene a ed Cap ions o Audio and Visual Scenes
Ma in, I ene; Suda sanam, Pa hasaa a hy; Vi anen, Tuomas 85
Uni e sal Inc emen al Lea ning o Few-Sho Bi d Sound Classi ica ion
Mulimani , Manjuna h ; Mesa os, Annama ia 90
In eg a ing Spa ial and Seman ic Embeddings o S e eo Sound E en Localiza ion in Videos
Be ghi, Da ide; Jackson, Philip 95
An En opy-Guided Cu iculum Lea ning S a egy o Da a-E icien Acous ic Scene Classi ica ion
unde Domain Shi
Zhang, Peihong; Liu, Yuxuan; Li, Zhixin; sang, ui; an, yizhou; cai, yiqiang; li, shengchen 100
Hie a chical and Mul imodal Lea ning o He e ogeneous Sound Classi ica ion
Anas asopoulou, Panagio a; Dal Rí, F ancesco; Se a, Xa ie ; Fon , F ede ic 105
C oss-Modal A en ion A chi ec u es o Language-Based Audio Re ie al
Cal e , Osca ; To e Toledano, Do o eo 110
Sound E en Classi ica ion mee s Da a Assimila ion wi h Dis ibu ed Fibe -Op ic Sensing
Tonami, No iyuki; Yajima, Yoshiyuki; Kohno, Wa a u; Mishima, Sakiko; Kondo, Reishi; Hino,
Tomoyuki 115
Syn he ic da a enables con ex -awa e bioacous ic sound e en de ec ion
Ho man, Benjamin; Robinson, Da id; Mi on, Ma ius; Baglione, Vi o io; Canes a i, Daniela; Elias,
Damian; T apo e, E a; Cusimano, Maddie; E enbe ge , Felix; Hagiwa a, Masa o; Pie quin, Oli ie 120
Enhancing Mul iscale Fea u es o E icien Acous ic Scene Classi ica ion wi h One-Dimensional
Sepa a e CNN
He, Yuxuan; Raake, Alexande ; Abeße , Jakob 125
La en Mul i- iew Lea ning o Robus En i onmen al Sound Rep esen a ions
Ding, Si an; Wilkins, Julia; Fuen es, Magdalena; Bello, Juan Pablo 130
Robus De ec ion o O e lapping Bioacous ic Sound E en s
Mahon, Louis; Ho man, Benjamin; Cuisimano, Maddie; Hagiwa a, Masa o; James, Logan; Woolley,
Sa ah; E enbe ge , Felix; Keen, Sa a; Liu, Jen-yu; Pie quin, Oli ie 135
S e eo Sound E en Localiza ion and De ec ion wi h Onsc een/O sc een Classi ica ion
Shimada, Kazuki; Poli is, A chon is; Roman, I an; Suda sanam, Pa hasaa a hy; Diaz-Gue a, Da id;
Pandey, Ruchi; Uchida, Kengo; Koyama, Yuichi o; Takahashi, Naoya; Shibuya, Takashi; Takahashi,
Shusuke; Vi anen, Tuomas; Mi su uji, Yuki 140
A Re isi o Audio E alua ion h ough Human Imp essions: De ining and Modeling a Mul idimensional
Pe cep ual Task
Nishijima, Hi oshi; Sai o, Daisuke; Minema su, Nobuaki 145
Sound E en De ec ion using Time- equency Bounding Boxes wi h a Sel -Supe ised Audio
Spec og am T ans o me
Zhu, Zhi; Sa o, Yoshinao 150
A Ligh weigh Tempo al A en ion Module o F equency Dynamic Sound E en De ec ion
Zhang, Yuliang 155
MIMII-Agen : Le e aging LLMs wi h Func ion Calling o Rela i e E alua ion o Anomalous Sound
De ec ion
Pu ohi , Ha sh; Nishida, Tomoya; Dohi, Ko a; Endo, Takashi; Kawaguchi, Yohei 160
Whale-VAD: Whale Vocalisa ion Ac i i y De ec ion
Geldenhuys, Ch is iaan; Toni z, Gün he ; Niesle , Thomas 165
Desc ip ion and Discussion on DCASE 2025 Challenge Task 4: Spa ial Seman ic Segmen a ion o
Sound Scenes
Yasuda, Masahi o; Binh Thien, Nguyen; Ha ada, Nobo u; Se izel, Romain; Mish a, Mayank; Delc oix,
Ma c; A aki, Shoko; Takeuchi, Daiki; Niizumi, Daisuke; Ohishi, Yasuno i; Naka ani, Tomohi o;
Kawamu a, Takao; Ono, Nobu aka 170
Impo ance-Weigh ed Domain Adap a ion o Sound Sou ce T acking
Zhong, Bingxiang; Die zen, Thomas 175
Disc imina i e Anomalous Sound De ec ion Using Pseudo Labels, Ta ge Signal Enhancemen , and
Ensemble Fea u e Ex ac o s
Fujimu a, Takuya; Ku oyanagi, Ibuki; Toda, Tomoki 180
A Th ee-Le el E alua ion P o ocol o Acous ic Scene Unde s anding o La ge Language Audio
Models
Ha ish, Dilip; Abeße , Jakob 185
Exploi ing S e eo Spa ial P ope ies wi h ReCoOP F amewo k o Join Sound E en De ec ion and
Localiza ion
Bane jee, Moho ; Nagise y, S ikan h; Teo, Han Boon 190
C oss-A en ion wi h Con idence Weigh ing o Mul i-Channel Audio Alignmen
Nihal, Md Ragib Amin; Yen, Benjamin; Ashizawa, Takeshi; Nakadai, Kazuhi o 195
Assessing a Domain-Adap i e Deploymen Wo k low o Selec i e Audio Reco ding in Wildli e
Acous ic Moni o ing
Azziz, Julia; Lema, Jose ina; Anziba , Maximiliano; Ziegle , Lucía; S ein eld, Leona do; Rocamo a,
Ma ín 200
On he Role o T aining Class Dis ibu ion in Ze o-Sho Audio Classi ica ion
Dogan, Duygu; Xie, Huang; Hei ola, Toni; Vi anen, Tuomas 205
CochlScene P e‑T aining and De ice‑Awa e Dis illa ion o Low-Complexi y Acous ic Scene
Classi ica ion
Ka asin, Dominik; Ola iu, Ioan-C is ian; Schöp , Michael; Szymańska, Anna 210
Pe cep ual De ec ion o Packe Loss-Induced Audio A i ac s in Black-Box Wi eless Music Sys ems
Guima ães, Vic ó ia; Ben es, Luiz; Pi es, Ana; F ei as, Rosiane 215
Supe ised De ec ion o Baleen Whale Calls on Edge-Compu e
an Too , As id 220
On Tempo al Guidance and I e a i e Re inemen in Audio Sou ce Sepa a ion
Mo ocu i, Tobias; G ei , Jona han; P imus, Paul; Schmid, Flo ian; Widme , Ge ha d 225
ToyADMOS2025: The E alua ion Da ase o he DCASE2025T2 Fi s -Sho Unsupe ised Anomalous
Sound De ec ion o Machine Condi ion Moni o ing
Ha ada, Nobo u; Niizumi, Daisuke; Ohishi, Yasuno i; Takeuchi, Daiki; Yasuda, Masahi o 230
Sel -Guided Ta ge Sound Ex ac ion and Classi ica ion Th ough Uni e sal Sound Sepa a ion Model
and Mul iple Clues
Kwon, Younghoo; Lee, Dongheon; Kim, Dohwan; Choi, Jung-Woo 235
P e ace
This olume is a collec ion o he pape s o be p esen ed a he De ec ion and Classi ica ion o
Acous ic Scenes and E en s (DCASE) 2025 Wo kshop in Ba celona, Spain, on Oc obe 30-31, 2025.
DCASE 2025 Wo kshop is he en h wo kshop on De ec ion and Classi ica ion o Acous ic Scenes
and E en s, o ganized in conjunc ion wi h he DCASE challenge. The wo kshop aims o p o ide a
enue o esea che s wo king on compu a ional analysis o sound e en s and scene analysis o
p esen and discuss hei esul s. We aim o b ing oge he esea che s om many di e en
uni e si ies, esea ch o ganiza ions and companies in e es ed in he opic and p o ide he oppo uni y
o scien i ic exchange o ideas and opinions.
The DCASE 2025 Wo kshop is join ly o ganized by esea che s a Uni e si a Pompeu Fab a, Queen
Ma y Uni e si y o London, New Yo k Uni e si y, Tampe e Uni e si y, Google Resea ch, Cochl, NEC
Co po a ion and Bose Co po a ion.
Fo his DCASE 2025 Wo kshop, 72 ull pape s we e submi ed. The submi ed pape s we e assigned
o h ee e iewe s. O hese, 48 pape s we e accep ed.
The O ganizing Commi ee was also pleased o in i e leading expe s o keyno e add esses: Simone
G ae ze (Ph.D. MIOA MASA Senio Resea ch Fellow in he Acous ics Resea ch Cen e, Uni e si y o
Sal o d, Co-Lead in EPSRC Noise Ne wo k Plus, Co-In es iga o in EPSRC CDT in Sound Fu u es),
and Gaë an Hadje es (S a Resea ch Scien is a SonyAI).
The p og ess o he DCASE 2025 Wo kshop esul s om he ha d wo k o many people whom we
wish o ex end a wa m hanks o he e, including he au ho s, he keyno e speake s, and he e iewe s,
all wi hou whom his DCASE 2025 Wo kshop would no exis . We also wish o hank he o ganize s
and pa icipan s in he DCASE Challenge asks.
This edi ion o he wo kshop was suppo ed by sponso ship om Google, Apple, P o Sound E ec s,
Adobe, Dolby, Hi achi, Bose, Cochl, and Mi subishi Elec ic. We wish o hank hem wa mly o hei
aluable suppo o his wo kshop and he expanding opic a ea.
Emmanouil Bene os, F ede ic Fon , Magdalena Fuen es,
I ene Ma in Mo a o, and Ma ín Rocamo a
Sponso s
The o ganiza ion o DCASE2025 Wo kshop hanks he ollowing sponso s o hei aluable suppo .
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
EFFICIENT STATE-SPACE MODEL FOR AUDIO ANOMALY
DETECTION WITH DOMAIN ADAPTATION
Taha im Rahman Anon1, Jaka ia Islam Emon1
1Hokkaido Denshikiki Co., L d., Sappo o, Hokkaido, Japan
[email p o ec ed], [email p o ec ed]
Abs ac —This pape p esen s a wo-s age, embedding-cen ic ame-
wo k o Unsupe ised Anomaly Sound De ec ion (UASD), speci ically
add essing he challenges o i s -sho gene aliza ion and compu a ional
e iciency. Ou app oach u ilizes an e icien s a e-space model (SSM)
backbone. P e- aining o his backbone is accele a ed using a pixel-
unshu le (space- o-dep h) inpu ans o ma ion o spec og ams, which
educes aining ime by app oxima ely 87.5% while p ese ing ep esen-
a ion quali y. Subsequen ly, he p e- ained model is ine- uned wi h a
specialized anomaly head ha uses mul i-le el ea u es, combined wi h
pseudo ou lie exposu e and domain-ad e sa ial adap a ion employing a
g adien e e sal laye . Ou sys em demons a es supe io pe o mance
o e he DCASE 2025 au oencode baseline. Machine-speci ic models
achie e a ha monic mean o al sco e o 0.722. This wo k es ablishes he
e icacy o SSMs o his ask and o e s a scalable, obus solu ion o
UASD in dynamic acous ic en i onmen s.
Index Te ms—audio anomaly de ec ion, pseudo ou lie -exposu e,
domain-ad e sa ial aining, S a e Space Model, Rep esen a ion lea ning
1. INTRODUCTION
The De ec ion and Classi ica ion o Acous ic Scenes and E en s
(DCASE) Challenge Task 2 in pa icula has s eadily aised he ba
o unsupe ised anomaly sound de ec ion (UASD) . The ask e ol ed
om plain UASD in 2020 [1], [2], h ough domain adap a ion in 2021
[3], [4], [5], domain gene aliza ion in 2022 [6], and, mos ecen ly,
he demanding i s -sho scena io in 2023–2025 [7], [8]. Cu en
sys ems ypically adop ei he inlie modelling wi h au oencode s
(AE) [9]–[11] o ou lie exposu e (OE) [12], [13], whe e auxilia y o
pseudo-ou lie da a imp o e obus ness [14], [15].
Mos exis ing backbones CNNs [16], [17], [18], AEs [6], di usion
models [19], and ligh weigh ne s such as MobileFaceNe [20], [21]
s uggle o model long- ange empo al con ex . T ans o me a ian s
mi iga e ha wi h sel -a en ion [22], [23], bu o sequence leng h
N
,
hei quad a ic ime–memo y cos
O(N2)
limi s p ac ical use on long
audio s eams. Mode n s a e-space models (SSMs) o e a compelling
al e na i e: hey scale linea ly,
O(N)
, while e aining s ong sequence
modelling powe [24]. Ye , a ecen su ey o Task 2 wo k e eals
no SSM backbones o da e, lea ing a clea gap.
To b idge his gap, we in oduce a UASD sys em ailo ed o he
i s -sho , domain-gene alisa ion se ing o DCASE Task 2 [10]. Ou
me hod couples an e icien SSM backbone wi h (i) a space- o-dep h
[25] spec og am ea angemen ha accele a es p e- aining, and (ii)
a ine- uning s age ha blends pseudo-ou lie exposu e wi h g adien -
e e sal-based domain adap a ion [26]. We in es iga e bo h machine-
speci ic and machine-gene alised a ian s du ing ine- uning.
Ou con ibu ions a e:
1)
Fi s SSM backbone o DCASE Task 2. We p esen he
i s e icien s a e-space model applied o he DCASE i s -sho
UASD challenge.
2)
Fas e p e- aining ia space- o-dep h. A no el spec og am
ea angemen cu s p e- aining ime by abou 87.5 % wi hou
deg ading ep esen a ion quali y.
Wo k done du ing in e nship a Hokkaido Denshikiki Co., L d.
3)
Compac anomaly head wi h domain adap a ion. We design
a ligh weigh head ha uses mul i-le el ea u es and pai i wi h
pseudo OE plus a g adien - e e sal laye o domain alignmen .
Ex ensi e expe imen s con i m ha ou sys em su passes he o icial
DCASE 2025 baseline [8].
Pape s uc u e: Sec ion 2 de ails he a chi ec u e and aining
p ocedu e; Sec ion 3 desc ibes da ase s and me ics; Sec ion 4 epo s
esul s and abla ions; and Sec ion 5 summa ises indings and u u e
di ec ions.
2. METHOD
Ou me hod ackles audio anomaly de ec ion using a wo-s age
ep esen a ion lea ning s a egy:
S age I: P e- aining. The i s s age ocuses on lea ning obus
gene al-pu pose acous ic ea u es. We p e- ain an e icien bidi ec-
ional Audio-Mamba [27] encode using a supe ised classi ica ion
objec i e.
S age II: Fine- uning wi h pOE and Domain Adap a ion. The p e-
ained encode is epu posed o anomaly de ec ion in wo s eps. Fi s ,
we swap he classi ica ion head o an anomaly head ha uses global
and in e media e ea u es. We hen ine- une he ne wo k wi h a pseudo
Ou lie Exposu e (pOE) loss: embeddings o no mal samples a e pulled
owa d a lea ned cen e, while embeddings o auxilia y pseudo-ou lie s
a e pushed away (Sec ion 2.5). In pa allel, domain-ad e sa ial aining
encou ages domain-in a ian ep esen a ions, imp o ing obus ness
ac oss ope a ing condi ions.
2.1. Spec og am Tokeniza ion ia Space- o-Dep h
Le
X∈RC×H×W
be a log
-
mel spec og am (
C=1
). We i s apply
pixel–unshu le wi h ac o
o old local ime– equency con ex in o
he channel dimension:
X′= PU(X, )∈R 2C×H
×W
.(1)
Wi h
=4
he
128×1024
inpu becomes
16×32×256
, educing he
oken leng h by
2
while p ese ing locali y inside he enla ged
channel dimension, as illus a ed in Fig. 1.
We hen pa i ion
X′
in o non
-
o e lapping pa ches o size
p×p
(wi h
p=16
), la en each pa ch
Si∈Rp2 2C
using
ec(·)
and p ojec
i linea ly in o a D-dimensional embedding,
Ei=Wpa ch ec(Si) + bpa ch,(2)
yielding he oken sequence
E=E1,...,EN∈RN×D
wi h
N=
HW
p2 2.
2.2. Bidi ec ional S a e–Space Encode
The pa ch-embedded oken sequence
E= [E1,...,EN]∈RN×D
is p ocessed by a s ack o
K
iden ical Fo wa d-Bidi ec ional Audio
Mamba (AuM) blocks, as depic ed in Fig. 2. Each AuM block execu es
a sequence o ope a ions o ans o m i s inpu x .
Fi s , an inpu p ojec ion maps
x
o an in e media e ep esen a ion
˜x =Winx +bin
. This p ojec ed sequence
˜
X
is hen ed in o a
1
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 1: De ice-wise and o e all accu acies o he baseline sys em on he de elopmen - es spli .
Model A B C S1 S2 S3 S4 S5 S6 Mac o A g. Accu acy
Gene al Model 62.80 52.87 54.23 48.52 47.29 52.86 48.14 47.23 42.60 50.72 ±0.47
De ice-speci ic Models 63.98 55.85 59.09 48.68 48.74 52.72 48.14 47.23 42.60 51.89 ±0.05
Pa icipan s a e ee o ade o he numbe o pa ame e s agains
nume ical p ecision; o ins ance, he limi co esponds o 128K
pa ame e s wi h 8-bi quan iza ion o 32K pa ame e s wi h 32-bi
p ecision. Compu a ional complexi y is capped a 30 MMACs o
p ocessing a one-second audio segmen . These cons ain s a e designed
o e lec he capabili ies o esou ce-cons ained de ices such as
he Co ex-M4 se ies (e.g., STM32L496@80 MHz o A duino Nano
33@64 MHz).
4. BASELINE SYSTEM
Following he 2024 edi ion [5], he baseline sys em builds on a
simpli ied a ian o he op-pe o ming submission om he 2023
edi ion [25]. I employs a ecep i e- ield- egula ized, ac o ized CNN
a chi ec u e, e e ed o as CP-Mobile. Audio eco dings a e i s
esampled o 32 kHz, hen con e ed in o mel spec og ams using
a 4096-poin FFT wi h a window size o 96 ms and a hop size
o app oxima ely 16 ms, ollowed by a mel scaling wi h 256 mel
il e banks.
As illus a ed in Figu e 1, he sys em is ained in wo s ages. In
he i s s age, a gene al model is ained on da a om all de ices
o 150 epochs using he AdamW op imize and a ba ch size o
256. To add ess de ice misma ch, F eq-MixS yle [7], [8] is applied
du ing aining. In he second s age, o each de ice in he aining
se , a de ice-speci ic model is c ea ed by end- o-end ine- uning he
gene al model on da a om ha speci ic de ice o 50 epochs. Du ing
in e ence, de ice-speci ic models a e applied o known de ices, while
he gene al model handles unknown ones.
The baseline sys em equi es 29.4 MMACs o p ocess a one-
second audio clip. The model uses 61,148 pa ame e s in 16-bi ( p16)
p ecision, esul ing in a o al memo y oo p in o 122.3 kB o he
pa ame e s.
Table 1 p esen s he de ice-wise and o e all accu acies o he
baseline sys em on he de elopmen - es spli . A e S age 1, he
gene al model achie es an o e all accu acy o 50.72%. Following
S age 2, whe e de ice-speci ic models a e ained, he o e all accu acy
imp o es o 51.89%. De ice-speci ic ine- uning inc eases he accu acy
o all known de ices excep o S3, wi h pe o mance gains a ying
no ably ac oss de ices. The accu acy on unknown de ices emains
unchanged be ween he wo ows o he able, as he gene al model
is used o in e ence on unknown de ices. The sou ce code and a
de ailed desc ip ion o he baseline sys em a e a ailable online2.
5. CHALLENGE RESULTS
The ask ecei ed 31 submissions om 12 eams, wi h 11 ou o
12 eams ou pe o ming he baseline sys em. Fo bo h he baseline
and mos submi ed sys ems, pe o mance on he de elopmen - es
spli aligned well wi h ha on he e alua ion se . Table 2 p esen s
he bes -pe o ming sys em om each eam ha ou pe o med he
baseline and summa izes hei a chi ec u al choices, s a egies o
handling complexi y, use o ex e nal da a, and de ice adap a ion
me hods. The ollowing subsec ions discuss each o hese aspec s in
2Sou ce Code: h ps://gi hub.com/CPJKU/dcase2025 ask1 baseline
de ail. Addi ional esul s and de ailed sys em desc ip ions a e a ailable
on he o icial challenge websi e3.
5.1. A chi ec u es
Due o he low-complexi y cons ain s, e icien neu al ne wo k design
emained a cen al ocus. In line wi h las yea ’s ends [5], mos eams
adop ed ac o ized con olu ional a chi ec u es. Fi e o he wel e
eams—including he op- anked submission—buil hei sys ems on
he CP-Mobile a chi ec u e [25]. Howe e , se e al op-pe o ming
eams p oposed no el a chi ec u al a ian s.
Team Tan SNTLNTU [26] in oduced CNN-GRU, which combines
poin wise and 1D dep hwise con olu ions o e he equency and
ime dimensions, in eg a es Squeeze-and-Exci a ion laye s [27], and
applies a GRU along he equency axis. Team Luo CQUPT [28]
p esen ed DynaCP, a CP-Mobile modi ica ion ha p ocesses pooling
and s ided con olu ions in pa allel and dynamically combines hei
ou pu s. Teams Chang HYU [29] and Ramezanee SUT [30] buil
upon epa ame e izable con olu ion blocks [31], which use mul iple
b anches du ing aining ha can be me ged in o a single, e icien
equi alen a in e ence ime. Addi ionally, Chang HYU [29] employed
Channel-Time-F equency A en ion (CTFA) [32], a ligh weigh a -
en ion mechanism ha allows he model o ocus on in o ma i e
inpu egions, while Ramezanee SUT [30] p oposed lea nable pooling
laye s. As inpu o he models, all eams used log-mel ene gies, wi h
he excep ion o wo eams ha used he spec og am ins ead.
5.2. Sys em Complexi y
As in p e ious edi ions [4], [5], Knowledge Dis illa ion (KD) [33]
emained he mos widely used model comp ession echnique, em-
ployed by 10 ou o 12 eams. Compa ed o p e ious edi ions, se e al
in e es ing a ia ions o he KD p ocess ha e been explo ed, such as
a ea u e-le el dis illa ion loss [34], de ice-awa e ea u e alignmen
loss o ain a de ice-expe eache [35], and sel -dis illa ion [36].
Compa ed o 2024 [5], whe e p uning was used only by he
op- anked eam [15], his yea p uning gained ac ion, wi h 3
o he op 6 eams adop ing i . No ably, he second- anked eam,
Tan SNTLNTU [26], applied p uning exclusi ely, wi hou using KD.
All op-5 eams used 16-bi p ecision, while none op ed o 8-bi
quan iza ion—likely due o he ease o educing o 16-bi wi h minimal
o no accu acy loss, whe eas main aining pe o mance wi h 8-bi
quan iza ion emains mo e challenging.
5.3. Ex e nal Da a Usage
Ex e nal da a was p ima ily used in wo ways. Fi s , mos eams em-
ployed eache models o KD ha we e p e- ained on AudioSe [21].
PaSST [37] emained a popula choice, hough wo eams—including
he op- anked one—used BEATs [38], while he hi d- anked eam,
Luo CQUPT [28], used AudioSe -p e ained MobileNe s [39] and
Dynamic MobileNe s [40].
Second, se e al eams applied De ice Impulse Response (DIR)
augmen a ion [9] using impulse esponses om MicIRP
4
, inc easing
he di e si y o eco ding condi ions in he aining da a.
3
Resul s: h ps://dcase.communi y/challenge2025/ ask-low-complexi y-
acous ic-scene-classi ica ion-wi h-de ice-in o ma ion- esul s
4h ps://mici p.blogspo .com/
8
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 2: Bes -pe o ming sys em pe eam (only including sys ems ha ou pe o m he baseline) and he o icial DCASE2025 baseline. Sco e indica es he
accu acy on he e alua ion se , Size e e s o he memo y equi ed o s o e model pa ame e s, and MAC deno es he numbe o mul iply-accumula e ope a ions.
Ex e nal indica es whe he ex e nal da a was used, and De ice Adap a ion desc ibes he me hod used o adap he model o speci ic de ices based on p o ided
de ice IDs. KD,IR, and FT s and o Knowledge Dis illa ion, Impulse Response augmen a ion, and Fine-Tuning, espec i ely.
Team Sco e Size MAC A chi ec u e Complexi y Ex e nal De ice Adap a ion
Ka asin JKU 61.5 122kB 29M CP-Mobile p16,KD IR,CochlScene,BEATs Full de ice-spec. FT
Tan SNTLNTU 59.9 116kB 10M CNN-GRU p16, p une IR Full FT
Luo CQUPT 59.6 123kB 28M DynaCP p16, KD E icien AT Full FT
Zhang AITHU-SJTU 59.3 126kB 29M SSCP-Mobile p16,KD,p une PaSST –
Chang HYU 59.0 125kB 29M Rep-CTFA p16,KD IR,PaSST Head-only FT
Li NTU 58.9 122kB 17M CP-Mobile KD,p une IR,PaSST –
Ramezanee SUT 57.9 125kB 28M DSFlexiNe KD IR Full FT
Jeong SEOULTECH 57.9 122kB 26M CP-Mobile p16,KD PaSST Full FT
Chen GXU 56.6 122kB 29M CP-Mobile p16,KD PaSST –
K ishna SRIB 56.1 122kB 27M CP-Mobile p16 – Full FT
Zhou XJTLU 55.5 126kB 29M TF-SepNe in 8,KD IR,AudioSe ,BEATs Full FT
DCASE25 baseline 53.2 122kB 29M CP-Mobile p16 – Full FT
Among all pa icipa ing eams, only he op- anked eam,
Ka asin JKU [41], ook ad an age o he new ule allowing ex e nal
ASC da ase s by le e aging CochlScene [42]. CochlScene con ains
76,115 en-second audio clips eco ded ac oss 13 dis inc acous ic
scenes. The da ase was collec ed ia c owdsou cing, p ima ily om
con ibu o s in Ko ea. Se e al scene classes o e lap pa ially wi h
hose in he TAU U ban Acous ic Scenes 2022 Mobile da ase [2],
[6] (e.g., Bus and Pa k), hough o he s a e unique o CochlScene
(e.g., Res oom and Ele a o ) o TAU (e.g., Ai po and T a elling
by T am).
Team Ka asin JKU [41] explo ed p e- aining bo h he eache and
s uden models on CochlScene. No ably, his s a egy led o subs an ial
pe o mance imp o emen s o con olu ional a chi ec u es such as
CP-Mobile [25] and CP-ResNe [43], wi h gains o +3.36 and +6.05
pe cen age poin s on he TAU de elopmen - es spli , espec i ely. In
con as , ans o me -based models like PaSST [37] and BEATs [38]
saw only ma ginal o no imp o emen s.
5.4. De ice Adap a ion
To exploi he gi en de ice in o ma ion, mos eams op ed o he
baseline s a egy o ine- uning he gene al model on de ice-speci ic
da a o ob ain specialized models. Mo e ad anced me hods we e
explo ed by only a ew pa icipan s.
Team Han CSU [44] add essed de ice a iabili y by inco po a ing
de ice embeddings in o he model’s in e nal ep esen a ions, e ec i ely
condi ioning he ne wo k on he iden i y o he eco ding de ice.
Team Chang HYU [29] adop ed a modula app oach by aining
ligh weigh , de ice-speci ic classi ica ion heads while keeping he
sha ed backbone ozen. This design p ese es a common, gene al-
pu pose acous ic ea u e ex ac o ac oss all de ices, while allowing
o de ice- ailo ed classi ica ion a he ou pu s age. Impo an ly, his
me hod keeps he o e all sys em compac , as he addi ional de ice-
speci ic componen s in oduce only minimal o e head.
The op- anked eam, Ka asin JKU [41], u he exploi ed de ice in-
o ma ion by cus omizing aining con igu a ions—such as Knowledge
Dis illa ion hype pa ame e s— o each de ice-speci ic ine- uning un.
In pa icula , hey obse ed ha he op imal loss weigh ing ac o in
Knowledge Dis illa ion, which balances he supe ised loss and he
dis illa ion loss, a ies ac oss de ices and bene i s om de ice-speci ic
uning.
6. CONCLUSION
This pape in oduced he se up and baseline sys em o Task 1 o
he DCASE 2025 Challenge, which con inues o add ess h ee co e
challenges o acous ic scene classi ica ion: low-complexi y cons ain s,
de ice misma ch, and limi ed aining da a. A key no el y his yea
is he a ailabili y o de ice in o ma ion a in e ence ime, enabling
de ice-speci ic adap a ion and yielding consis en imp o emen s in
he baseline sys em.
While he h ee esea ch ques ions ou lined in Sec ion 3 emain only
pa ially explo ed, he op- anked submission p o ided aluable ini ial
answe s. They showed ha ine- uning ou ines ailo ed o speci ic
de ices imp o e pe o mance, and ha le e aging ex e nal acous ic
scene classi ica ion da ase s such as CochlScene can subs an ially
boos accu acy on he TAU da ase . These s a egies deli e ed an
accu acy gain o mo e han 1.5 pe cen age poin s o e all o he
submissions, highligh ing p omising di ec ions o u u e wo k.
Beyond ans e lea ning and de ice-awa e modeling, pa icipan s
also ad anced esea ch on e icien a chi ec u es, Knowledge Dis illa-
ion, and p uning. Se e al eams expe imen ed wi h di e en eache
models o Knowledge Dis illa ion, while o he s in oduced a chi-
ec u al componen s o low-complexi y models such as ligh weigh
a en ion mechanisms, epa ame e izable con olu ions, and lea nable
pooling laye s.
O e all, he 2025 edi ion o Task 1 ad anced es ablished esea ch on
low-complexi y modeling while p o iding ini ial insigh s in o de ice-
awa e adap a ion and he use o ex e nal acous ic scene da ase s,
laying he g oundwo k o u he explo a ion in hese di ec ions.
7. ACKNOWLEDGMENT
The LIT AI Lab is suppo ed by he Fede al S a e o Uppe Aus ia.
Ge ha d Widme ’s wo k is suppo ed by he Eu opean Resea ch
Council (ERC) unde he Eu opean Union’s Ho izon 2020 esea ch
and inno a ion p og amme, g an ag eemen No 101019375 (Whi he
Music?). Annama ia Mesa os’s wo k is suppo ed by Academy o
Finland g an 332063 “Teaching machines o lis en”.
REFERENCES
[1]
E. Bene os, D. S owell, and M. D. Plumbley, “App oaches o complex
sound scene analysis,” in Cham: Sp inge In e na ional Publishing, 2018.
[2]
T. Hei ola, A. Mesa os, and T. Vi anen, “Acous ic scene classi ica ion
in DCASE 2020 challenge: Gene aliza ion ac oss de ices and low
complexi y solu ions,” in DCASE Wo kshop, 2020.
[3]
I. Ma
´
ın-Mo a
´
o, T. Hei ola, A. Mesa os, and T. Vi anen, “Low-
complexi y acous ic scene classi ica ion o mul i-de ice audio: Analysis
o DCASE 2021 challenge sys ems,” in DCASE Wo kshop, 2021.
[4]
I. Ma
´
ın-Mo a
´
o, F. Paissan, A. Ancilo o, T. Hei ola, A. Mesa os,
E. Fa ella, A. B u i, and T. Vi anen, “Low-complexi y acous ic scene
classi ica ion in DCASE 2022 challenge,” in DCASE Wo kshop, 2022.
9
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
[5]
F. Schmid, P. P imus, T. Hei ola, A. Mesa os, I. Ma
´
ın-Mo a
´
o,
K. Kou ini, and G. Widme , “Da a-e icien low-complexi y acous ic
scene classi ica ion in he DCASE 2024 challenge,” in DCASE Wo kshop,
2024.
[6]
A. Mesa os, T. Hei ola, and T. Vi anen, “A mul i-de ice da ase o
u ban acous ic scene classi ica ion,” in DCASE Wo kshop, 2018.
[7]
B. Kim, S. Yang, J. Kim, H. Pa k, J. Lee, and S. Chang, “Domain
gene aliza ion wi h elaxed ins ance equency-wise no maliza ion o
mul i-de ice acous ic scene classi ica ion,” in In e speech, 2022.
[8]
F. Schmid, S. Masoudian, K. Kou ini, and G. Widme , “CP-JKU
submission o DCASE22: Dis illing knowledge o low-complexi y
con olu ional neu al ne wo ks om a pa chou audio ans o me ,”
DCASE Challenge, Tech. Rep., 2022.
[9]
T. Mo ocu i, F. Schmid, K. Kou ini, and G. Widme , “De ice- obus
acous ic scene classi ica ion ia impulse esponse augmen a ion,” in
EUSIPCO, 2023.
[10]
H. T uchan, T. H. Ngo, and Z. Ahmadi, “Ascdomain: Domain in a ian
de ice-ad e sa ial iso opic knowledge dis illa ion con olu ional neu al
a chi ec u e,” in ICASSP, 2025.
[11]
K. Kou ini, F. Henkel, H. Eghbal-zadeh, and G. Widme , “CP-JKU
submissions o DCASE’20: Low-complexi y c oss-de ice acous ic scene
classi ica ion wi h RF- egula ized CNNs,” DCASE Challenge, Tech. Rep.,
2020.
[12]
B. Kim, S. Yang, J. Kim, and S. Chang, “QTI submission o DCASE
2021: Residual no maliza ion o de ice-imbalanced acous ic scene
classi ica ion wi h e icien design,” DCASE Challenge, Tech. Rep., 2021.
[13]
J.-H. Lee, J.-H. Choi, P. M. Byun, and J.-H. Chang, “Hyu submission
o he DCASE 2022: E icien ine- uning me hod using de ice-awa e
da a- andom-d op o de ice-imbalanced acous ic scene classi ica ion,”
DCASE Challenge, Tech. Rep., 2022.
[14]
K. Kou ini, J. Schl
¨
u e , and G. Widme , “CPJKU submission o DCASE21:
C oss-de ice audio scene classi ica ion wi h wide spa se equency-
damped CNNs,” DCASE Challenge, Tech. Rep., 2021.
[15]
H. Bing, H. Wen, C. Zhengyang, J. Anbai, C. Xie, F. Pingyi, L. Cheng,
L. Zhiqiang, L. Jia, Z. Wei-Qiang, and Q. Yanmin, “Da a-e icien acous ic
scene classi ica ion ia ensemble eache s dis illa ion and p uning,”
DCASE Challenge, Tech. Rep., 2024.
[16]
C.-H. H. Yang, H. Hu, S. M. Siniscalchi, Q. Wang, W. Yuyang, X. Xia,
Y. Zhao, Y. Wu, Y. Wang, J. Du, and C.-H. Lee, “A lo e y icke
hypo hesis amewo k o low-complexi y de ice- obus neu al acous ic
scene classi ica ion,” DCASE Challenge, Tech. Rep., 2021.
[17]
J. Tan and Y. Li, “Low-complexi y acous ic scene classi ica ion using
bluep in sepa able con olu ion and knowledge dis illa ion,” DCASE
Challenge, Tech. Rep., 2023.
[18]
Y. Cai, M. Lin, C. Zhu, S. Li, and X. Shao, “DCASE2023 ask1
submission: De ice simula ion and ime- equency sepa able con olu ion
o acous ic scene classi ica ion,” DCASE Challenge, Tech. Rep., 2023.
[19]
F. Schmid, T. Mo ocu i, S. Masoudian, K. Kou ini, and G. Widme ,
“CP-JKU submission o DCASE23: E icien acous ic scene classi ica ion
wi h cp-mobile,” DCASE Challenge, Tech. Rep., 2023.
[20]
Y.-F. Shao, P. Jiang, and W. Li, “Low-complexi y acous ic scene
classi ica ion wi h limi ed aining da a,” DCASE Challenge, Tech. Rep.,
2024.
[21]
J. F. Gemmeke, D. P. W. Ellis, D. F eedman, A. Jansen, W. Law ence,
R. C. Moo e, M. Plakal, and M. Ri e , “Audio se : An on ology and
human-labeled da ase o audio e en s,” in ICASSP, 2017.
[22]
Y. Cai, M. Lin, S. Li, and X. Shao, “DCASE2024 ask1 submission:
Da a-e icien acous ic scene classi ica ion wi h sel -supe ised eache s,”
DCASE Challenge, Tech. Rep., 2024.
[23]
N. Da id, R. Aida, and S. Pa ick, “Da a-e icien acous ic scene
classi ica ion wi h p e- ained CP-Mobile,” DCASE Challenge, Tech.
Rep., 2024.
[24]
A. We ning and R. Haeb-Umbach, “Upb-N submission o DCASE24:
Da ase p uning o a ge ed knowledge dis illa ion,” DCASE Challenge,
Tech. Rep., 2024.
[25]
F. Schmid, T. Mo ocu i, S. Masoudian, K. Kou ini, and G. Widme ,
“Dis illing he knowledge o ans o me s and CNNs wi h CP-mobile,”
in DCASE Wo kshop, 2023.
[26]
E.-L. Tan, J. W. Yeow, S. Peksi, H. Li, Z. Yang, and W.-S. Gan, “Sn l-
n u dcase25 submission: Acous ic scene classi ica ion using CNN-GRU
model wi hou knowledge dis illa ion,” DCASE2025 Challenge, Tech.
Rep., May 2025.
[27]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-exci a ion ne wo ks,” in CVPR,
2018.
[28]
Y. Luo, H. Liu, L. Shi, and L. Gan, “Dynacp: Dynamic pa allel selec i e
con olu ion in cp-mobile unde mul i- eache dis illa ion o acous ic
scene classi ica ion,” DCASE2025 Challenge, Tech. Rep., 2025.
[29]
S.-G. Han, P. M. Byun, and J.-H. Chang, “Hyu submission o DCASE
2025 ask 1: Low-complexi y acous ic scene classi ica ion using epa-
ame e izable CNN wi h channel- ime- equency a en ion,” DCASE2025
Challenge, Tech. Rep., 2025.
[30]
M. M. Ramezanee, H. Sha i y, A. M. Meh ani Kia, and B. Raou i,
“Acous ic scene classi ica ion wi h knowledge dis illa ion and de ice-
speci ic ine- uning o DCASE 2025,” DCASE2025 Challenge, Tech.
Rep., 2025.
[31]
B. Han, W. Huang, Z. Chen, A. Jiang, P. Fan, C. Lu, Z. L , J. Liu,
W.-Q. Zhang, and Y. Qian, “Da a-e icien low-complexi y acous ic scene
classi ica ion ia dis illing and p og essi e p uning,” in ICASSP, 2025.
[32]
X. Zeng and M. Wang, “Channel- ime- equency a en ion module o
imp o ed mul i-channel speech enhancemen ,” IEEE Access, 2025.
[33]
G. E. Hin on, O. Vinyals, and J. Dean, “Dis illing he knowledge in a
neu al ne wo k,” in NIPS Deep Lea ning Wo kshop, 2014.
[34]
H. Li, Z. Yang, M. Wang, E.-L. Tan, J. Yeow, S. Peksi, and W.-S. Gan,
“Join ea u e and ou pu dis illa ion o low-complexi y acous ic scene
classi ica ion,” DCASE2025 Challenge, Tech. Rep., 2025.
[35]
S. Jeong and S. Kim, “Adap i e knowledge dis illa ion using a
de ice-awa e eache o low-complexi y acous ic scene classi ica ion,”
DCASE2025 Challenge, Tech. Rep., 2025.
[36]
X. Chen and W. Xie, “McCi submission o DCASE 2025: T aining
low-complexi y acous ic scene classi ica ion sys em wi h knowledge
dis illa ion and cu iculum,” DCASE2025 Challenge, Tech. Rep., 2025.
[37]
K. Kou ini, J. Schl
¨
u e , H. Eghbal-zadeh, and G. Widme , “E icien
aining o audio ans o me s wi h pa chou ,” in In e speech, 2022.
[38]
S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu,
and F. Wei, “BEATs: Audio p e- aining wi h acous ic okenize s,” in
ICML, 2023.
[39]
F. Schmid, K. Kou ini, and G. Widme , “E icien la ge-scale audio
agging ia ans o me - o-cnn knowledge dis illa ion,” in ICASSP, 2023.
[40]
——, “Dynamic con olu ional neu al ne wo ks as e icien p e- ained
audio models,” IEEE ACM T ans. Audio Speech Lang. P ocess., 2024.
[41]
D. Ka asin, I.-C. Ola iu, M. Sch
¨
op , and A. Szyma
´
nska, “Domain-speci ic
ex e nal da a p e- aining and de ice-awa e dis illa ion o da a-e icien
acous ic scene classi ica ion,” DCASE2025 Challenge, Tech. Rep., May
2025.
[42]
I.-Y. Jeong and J. Pa k, “Cochlscene: Acquisi ion o acous ic scene da a
using c owdsou cing,” in APSIPA ASC, 2022.
[43]
K. Kou ini, H. Eghbal-zadeh, and G. Widme , “Recep i e ield eg-
ula iza ion echniques o audio classi ica ion and agging wi h deep
con olu ional neu al ne wo ks,” IEEE ACM T ans. Audio Speech Lang.
P ocess., 2021.
[44]
S. Han, D. H. Lee, M. S. Jo, E. S. Ha, M. J. Chae, and G. W. Lee,
“Con idence-awa e ensemble knowledge dis illa ion o low-complexi y
acous ic scene classi ica ion,” DCASE2025 Challenge, Tech. Rep., 2025.
10
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Co ela ion-Based Fil e ing o Unsupe ised Anomalous Sound De ec ion
And in B¨
u li, Sami Hamdan, Iason Kas anis
Cen e Suisse d’ ´
Elec onique e Mic o echnique
P edic i e Analy ics G oup, Swi ze land
and in.b[email p o ec ed], [email p o ec ed]
Abs ac —Unsupe ised anomalous sound de ec ion (ASD) unde
domain shi emains a key challenge o eal-wo ld deploymen . We
in oduce a wo-s age “ i s -sho ” pipeline o DCASE 2025 Task 2 ha
le e ages op ional clean-only o noise-only supplemen al eco dings o
imp o e obus ness o unseen backg ound noises. Fi s , a co ela ion-
based il e is ained sepa a ely on clean o noise da a, sepa a ing each
es mix u e
x=C+N+A
in o a cleane signal
x′=C+A
. Second, a
mel-spec og am au oencode , augmen ed wi h SMOTE and mixup on
x′
,
de ec s anomalies. On he de elopmen se , ou me hod achie es a high
SI-SDR o he sepa a ion ask and imp o es he de ec ion me ics o
h ee ou o se en componen s compa ed o he baseline. These esul s
alida e ha assuming s a is ical independence be ween machine sound,
backg ound noise, and anomalies can enhance i s -sho ASD. Fu u e
wo k will explo e au oma ed co ela ion es ima ion and in eg a ion wi h
mo e ad anced anomaly de ec ion me hods o he second s age.
Index Te ms—anomalous sound de ec ion, signal co ela ion, DCASE,
sou ce sepa a ion, audio
1. INTRODUCTION
Anomalous sound de ec ion (ASD) has eme ged as a c i ical echnol-
ogy o non-in usi e moni o ing o indus ial machine y, enabling
ea ly wa ning o mechanical aul s h ough audio analysis [1],
[2]. Unsupe ised ASD, which elies solely on no mal-condi ion
eco dings, was i s s anda dized in he DCASE 2020 Challenge Task
2 o add ess he sca ci y and di e si y o anomalous examples in eal
ac o ies [3]. Subsequen edi ions ha e p og essi ely inco po a ed
domain-shi and “ i s -sho ” scena ios, in which sys ems mus
gene alize o unseen ope a ing condi ions o en i ely new machine
ypes wi hou ask-speci ic uning [3], [4].
Building on he i s -sho unsupe ised ASD amewo k o DCASE
2023 and 2024, he 2025 Task 2 challenge e ains he equi emen
o ain exclusi ely on no mal da a and o de ec anomalies unde
unknown domain shi s, while in oducing op ional use o clean-
only o noise-only supplemen a y eco dings [4]. Pa icipan s mus
also handle comple ely no el machine ypes a e alua ion, wi h no
access o anomalous es da a o hype pa ame e uning. This “ i s -
sho ” se ing e lec s eal-wo ld cons ain s whe e apid deploymen
p ecludes exhaus i e da a collec ion o manual calib a ion.
We p opose a wo-s age pipeline o i s -sho ASD: (1) a co ela ion-
based sepa a o ha , gi en clean-only o noise-only supplemen al da a,
il e s each es mix u e
x=C+N+A
in o
x′=C+A
as depic ed
in Figu e 1; (2) a mel-spec og am au oencode , augmen ed wi h
SMOTE and mixup ained on
x′
, o de ec anomalies. By le e aging
co ela ion-based il e ing, ou me hod enhances obus ness o unseen
backg ound noises in he DCASE 2025 Task 2 se ing.
2. METHOD
We deno e clean machine sound by
C
, backg ound noise by
N
,
and anomalous sound by
A
. A i icial noise augmen a ions,
NA
, a e
sampled om di e se sou ces. We use
ρ(S1, S2)
as he co ela ion
be ween wo signals
S1
and
S2
, whe e he h eshold
ε
deno es a
signi ican co ela ion be ween hem. In ou wo-s age me hodology,
a i s s ep ains a il e ing model which can sepa a e he mix u e
x=C+N+A
in o
x′=C+A
. The second s ep hen in ol es
aining an anomaly de ec ion model based on he il e ed x′.
(a) Bea ing (b) Slide
Fig. 1: Co ela ion-based il e ing o wo componen s. In (a) we ha e
supplemen al
=C
, hus we pe o m machine sound ex ac ion. In (b)
we ha e supplemen al =N, hus we pe o m noise ex ac ion.
2.1. Co ela ion-based Fil e ing
In he DCASE challenge 2025, we a e p o ided wi h addi ional
supplemen al da a which consis s o ei he
C
o
N
. Fo bo h cases
we de eloped sepa a e il e ing s a egies. Fo he e alua ion o he
il e ing quali y, we epo he scale-in a ian signal- o-dis o ion a io
(SI-SDR), which is commonly used in sou ce sepa a ion asks [5].
2.1.1. Machine Sound Ex ac ion: I we a e p o ided wi h sup-
plemen al da a con aining he clean machine sound
C
, we ain a
sou ce-sepa a ion ne wo k
θ(C+NA)≈C(1)
o eco e he clean sound
C
om noise-augmen ed inpu s
C+NA
.
A in e ence ime on mix u e x his will allow us o il e :
by= θ(x)≈C+A=x′(2)
Fo his il e ing o wo k we in oduce he ollowing assump ions:
1.1 ρ(C, NA)< ε (a i icial noise unco ela ed wi h C)
1.2 ρ(C, N)< ε (backg ound noise unco ela ed wi h C)
1.3 ρ(C, A)> ε (anomalies s ongly co ela ed wi h C)
1.4 ρ(N, A)< ε (anomalies unco ela ed wi h N)
1.5 ρ(Csou ce, C a ge )> ε
(machine sound o sou ce is s ongly
co ela ed wi h a ge )
11
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
2.1.2. Noise Ex ac ion: I we a e p o ided wi h supplemen al da a
con aining he backg ound sound
N
, we ain a sou ce-sepa a ion
ne wo k
θ(N+NA)≈N(3)
o ex ac backg ound noise
N
om
N+NA
. A in e ence ime on
mix u e x his will allow us o il e :
by= θ(x)≈N→x−ˆy=x′(4)
Fo his il e ing o wo k we in oduce he ollowing assump ions:
2.1 ρ(N, NA)< ε (a i icial noise unco ela ed wi h N)
2.2 ρ(A, NA)< ε (a i icial noise unco ela ed wi h A)
2.3 ρ(C, N)< ε (backg ound noise unco ela ed wi h C)
2.4 ρ(N, A)< ε (anomalies unco ela ed wi h N)
2.5 Nsou ce =N a ge
(backg ound sound o sou ce is equal o a ge )
Assump ions 2.1, 2.2, and 2.5 a e new; 2.3 and 2.4 o e lap wi h 1.2
and 1.4, espec i ely.
2.2. Anomaly De ec ion
Once we ha e
x′
we can heo e ically use any o he me hods p esen ed
in he DCASE challenges 2020-2024, anging om ou lie exposu e
o inlie modeling and a la ge di e si y o combina ions o he wo [6]–
[9]. We choose o use a simila app oach as he baseline o he 2025
challenge, consis ing o an au oencode based on Mel-spec og ams.
Addi ionally, we employ SMOTE [10] o o e sampling he a ge
domain and mixup [11] o augmen ou da a. Fo anomaly de ec ion,
we e alua e using he a ea unde he ROC cu e (AUC) and pa ial
AUC (pAUC), ollowing he o icial DCASE challenge me ics [4].
3. EXPERIMENTAL SETUP
We adhe e o he DCASE 2025 Task 2 p o ocol [4]. The de elopmen
da ase p o ides aining and es spli s o se en machines: Val e,
Bea ing, ToyCa , ToyT ain, Slide , Gea box and Fan, whe e we ha e
supplemen al
C
in he i s h ee and
N
in he o he s. Fo each
componen , we i s ain a co ela ion-based il e model wi h he
s a egy depending on he p o ided supplemen al da a. The model is a
s anda d U-Ne wi h inpu and ou pu being he complex spec og ams
o he espec i e signals using a 64-ms window and 32-ms hop size
[12]. U-Ne has p o en e ec i e o sou ce sepa a ion, especially
when he h esholds
ε
in assump ions 1.1 and 2.1 a e small. We use
a ba ch size o 32 and a lea ning a e o 0.0005 o op imize o e a
mul i- esolu ion STFT loss [13] o 300 epochs wi h ea ly s opping.
To coun e ac po en ial iola ions o assump ions 1.1 and 2.1, we
sweep o e a ious SNR anges and
NA
sou ces and choose he un
esul ing in he highes adjus ed SI-SDR (
=SI-SDR −E
SNR
) on
a 10% holdou alida ion se , o ensu e we co e ealis ic SNR a ios
ha a e no known in ad ance.
•NAsou ces: {AudioSe ull, AudioSe no ools, AudioSe only
ools, DCASE Clean (supplemen al clean-only), DCASE Noise
(supplemen al noise-only)},
•SNR windows: [-30,30], [-10,30], [-10,10], [-5,5] dB.
The anomaly de ec ion au oencode is ained wi h e y simila
pa ame e s as he DCASE 2025 baseline. We use a 64-ms window
wi h a 128-bin mel spec og am o e i e consecu i e windows as a
ea u e ec o . The encode -decode a chi ec u e is a symme ic MLP.
We ain he model o e 100 epochs wi h a lea ning a e o 0.001 and
a ba ch size o 64.
4. RESULTS
We e alua e ou wo-s age pipeline on he DCASE 2025 Task 2
de elopmen se in h ee pa s: co ela ion-based il e ing on he
de elopmen da a, il e ing pe o mance on he addi ional e alua ion
da a, and anomaly de ec ion on he de elopmen se .
Componen SNR [dB] NASou ce SI-SDR [dB]
Val e [-10, 10] AudioSe ull 11.4
ToyT ain [-10, 10] DCASE Clean 6.3
ToyCa [-5, 5] AudioSe ull 5.8
Slide [-5, 5] DCASE Clean 6.9
Gea box [-5, 5] DCASE Clean 5.7
Fan [-5, 5] DCASE Clean 7.4
Bea ing [-5, 5] AudioSe ull 9.1
Table 1: Bes Adjus ed SI-SDR esul s o co ela ion-based il e ing
pe componen in de elopmen da ase
Fi s , Table 1 epo s he op imal il e ing se ings and adjus ed SI-
SDR o each o he se en de elopmen componen s. Val e achie es
he highes SI-SDR o 11.4 dB using ull AudioSe noise a ±10 dB,
while Bea ing achie es 9.1 dB unde a na owe ±5 dB ange wi h he
same noise sou ce. ToyT ain (6.3 dB) and ToyCa (5.8 dB) simila ly
le e age wide SNR windows (±10 dB and ±5 dB) wi h DCASE
Clean o AudioSe ull augmen a ions, e lec ing hei a ied spec al
con en . The emaining componen s Slide (6.9 dB), Gea box (5.7
dB), and Fan (7.4 dB) a ain he bes sepa a ion unde na ow (±5
dB) clean-only noise, indica ing limi ed noise a iabili y su ices o
hese cases.
Componen SNR [dB] NASou ce SI-SDR [dB]
Au oT ash [-5, 5] AudioSe ull 15.29
BandSeale [-10, 10] DCASE Clean 6.79
Co eeG inde [-5, 5] DCASE Clean 11.38
HomeCame a [-5, 5] DCASE Clean 12.07
Polishe [-10, 10] AudioSe ull 5.82
Sc ewFeede [-5, 5] AudioSe no ools 8.84
ToyPe [-5, 5] DCASE Clean 9.16
ToyRCCa [-5, 5] DCASE Clean 8.67
Table 2: Adjus ed SI-SDR o co ela ion-based il e ing pe compo-
nen in addi ional aining da ase
Nex , Table 2 p esen s SI-SDR esul s on eigh no el componen s
in he addi ional e alua ion se . He e, SI-SDR anges om 5.82 dB
(Polishe ) up o 15.29 dB (Au oT ash), wi h mos componen s a o ing
±5 dB clean o ull-AudioSe noise. This consis ency con i ms ha
ou co ela ion-based il e gene alizes e ec i ely o unseen machine
ypes in a i s -sho scena io.
Componen Baseline Bes 2024 Un il e ed Fil e ed
Val e 0.611 0.771 0.669 0.848
ToyT ain 0.557 0.651 0.590 0.564
ToyCa 0.567 0.594 0.588 0.405
Slide 0.561 0.593 0.542 0.600
Gea box 0.553 0.704 0.547 0.566
Fan 0.499 0.639 0.541 0.545
Bea ing 0.598 0.691 0.582 0.734
hmean 0.5617 0.6582 0.5771 0.5775
Table 3: De elopmen da ase de ec ion esul s. Sco es co espond o
he ha monic mean o AUC and pAUC. Bes 2024 co esponds o
[14], bes esul s pe ow a e highligh ed in bold.
12
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Finally, Table 3 compa es anomaly de ec ion me ics be o e
(”Un il e ed”) and a e il e ing (”Fil e ed”), alongside he baseline
and he Bes 2024 sys em. We would expec ”Fil e ed” o be simila o
be e han ”Un il e ed” i he il e ing indeed ans o ms ou mix u e
x=C+N+A
o
θ(x) = C+A=x′
. A e il e ing, Val e
imp o es om 0.669 o 0.848 (+0.179), Bea ing om 0.582 o 0.734
(+0.152), Slide om 0.542 o 0.600 (+0.058) and Gea box om
0.547 o 0.566 (+0.019), demons a ing ha isola ing
C+A
enhances
anomaly de ec ion as expec ed.
Con e sely, ToyT ain (0.590
→
0.564, –0.026) and ToyCa (0.588
→
0.396, –0.183) deg ade, indica ing hei anomalies may no align
wi h ou independence assump ions. O e all, he ha monic mean
ac oss componen s inc eases sligh ly om 0.5771 o 0.5775 (+0.0004),
e idencing modes bene i s o il e ing on a e age ac oss all machines,
which is mos ly due o he bad esul s on he ToyCa and ToyT ain
componen s.
5. DISCUSSION
The expe imen al esul s demons a e ha ou co ela ion-based
il e ing e ec i ely enhances anomaly de ec ion when he unde lying
independence assump ions hold. Fo componen s such as Val e,
Bea ing, and Slide , he il e succeeded in isola ing he machine signal
plus anomaly, leading o clea gains in anomaly de ec ion (Table 3,
Figu e 1). This indica es ha , o hese machines, (1) backg ound
noise and a i icial augmen a ions emain unco ela ed wi h he clean
sound (
ρ(C, NA)< ε
and
ρ(N, NA)< ε
), and (2) anomalous e en s
e ain su icien co ela ion wi h he machine signa u e (
ρ(C, A)> ε
)
o su i e il e ing.
(a) ToyCa (b) ToyT ain
Fig. 2: Bo h componen s in (a) and (b) p oduce non-s a iona y sounds
ha a e co ec ly il e ed. Howe e , hey may iola e assump ions 1.3,
1.4, and 2.4 because he anomalous sound may be weakly co ela ed
wi h he machine sound, o example, by only occu ing du ing he
amp-up o amp-down phase.
In con as , ToyT ain and ToyCa exhibi pe o mance deg ada ions
a e il e ing, wi h de ec ion sco es alling by 0.026 and 0.183
espec i ely. Thei non-s a iona y ope a ing cycles wi h amp-up,
s eady, and amp-down phases appea o iola e he assump ion ha
anomalies s ongly co- a y wi h he baseline machine sound. As a
esul , he il e may emo e o a enua e anomalous componen s
along wi h noise, ha ming de ec ion (Figu e 2). In oducing addi ional
ans o ma ions such as windowing could help wi h his issue
bu equi es u he wo k. Fan and Gea box exhibi only modes
de ec ion gains a e il e ing, indica ing pa ial alignmen wi h ou
independence assump ions. We a ibu e his o an unde - ep esen a ion
o backg ound noise in he supplemen al da a (see Figu e 3): o Fan,
he supplemen al eco dings con ain only s a iona y noise which migh
be easie o de ec in STFT, whe eas he de elopmen and e alua ion
se s also include non-s a iona y e en s such as hamme ing and
g inding. Consequen ly, he il e ing model canno lea n o supp ess
hese dynamic noise componen s, lea ing esidual in e e ence in
x′
and limi ing he achie able imp o emen . We made e y simila
obse a ions o Gea box.
(a) Sample o an supplemen al
backg ound noise
(b) Sample o an aining da a
Fig. 3: In (a) we see a ep esen a i e sample om supplemen al
da a o Fan, which is s a iona y. In (b) we can see ha he ac ual
aining da a con ains ob ious non-s a iona y backg ound e en s, such
as g inding. The co ela ion-based il e model does no emo e hese
e en s because i has ne e encoun e ed hem be o e.
On he addi ional e alua ion se , he il e gene alizes e ec i ely o
eigh no el componen s, yielding SI-SDR sco es be ween 5.82 dB and
15.29 dB (Table 2). Al hough absolu e sepa a ion quali y a ies wi h
machine-noise spec al o e lap, he consis en pe o mance ac oss
unseen machines con i ms he obus ness o ou i s -sho il e ing
app oach.
6. CONCLUSION
We ha e p esen ed a wo-s age “ i s -sho ” pipeline o unsupe ised
anomalous sound de ec ion, combining co ela ion-based il e ing wi h
a mel-spec og am au oencode . By g id-sea ching SNR windows
and noise-augmen a ion sou ces, ou me hod adap i ely sepa a es
each mix u e in o machine-plus-anomaly signals be o e a simple
econs uc ion-based e o de ec ion. On he DCASE 2025 Task 2
de elopmen se , il e ing imp o ed de ec ion me ics o he majo i y
o componen s. On he eigh unseen machines we ind a simila
sepa a ion pe o mance ange as o he de elopmen da ase using
he same hype pa ame e g id. Fu u e wo k will explo e au oma ed
es ima ion o signal co ela ions o selec augmen a ions pe machine
and in eg a ion wi h mo e sophis ica ed anomaly de ec o s ha can
ole a e pa ial assump ion iola ions.
13
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
REFERENCES
[1]
E. Ma chi, F. Vespe ini, F. Eyben, S. Squa ini, and B. Schulle , “Non-
linea p edic ion wi h ls m neu al ne wo ks o acous ic no el y de ec ion,”
P oceedings o he In e na ional Join Con e ence on Neu al Ne wo ks
(IJCNN), pp. 1–8, 2015.
[2]
T. Pe ei a and N. Nunes, “Anomaly de ec ion in indus ial shop- loo
machines using audio and ib a ion signals,” in P oc. IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocessing (ICASSP), 2019,
pp. 854–858.
[3]
Y. Nishida, V. Sapona o, M. Do e , and N. Ono, “Fi s -sho unsupe ised
anomalous sound de ec ion challenge: O e iew and baseline sys em,”
in P oceedings o he DCASE 2024 Wo kshop, 2024, pp. 150–155.
[4]
DCASE 2025 Challenge Task 2 o ganize s, “Fi s -sho unsupe ised
anomalous sound de ec ion o machine condi ion moni o ing,” Online,
2025.
[5]
J. Le Roux, S. Wisdom, H. E dogan, and J. R. He shey, “Sd –hal -baked
o well done?” in ICASSP 2019-2019 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP). IEEE, 2019, pp.
626–630.
[6]
Y. Koizumi, Y. Kawaguchi, K. Imo o, T. Nakamu a, Y. Nikaido,
R. Tanabe, H. Pu ohi , K. Sue usa, T. Endo, M. Yasuda, and N. Ha ada,
“Desc ip ion and discussion on dcase2020 challenge ask2: Unsupe ised
anomalous sound de ec ion o machine condi ion moni o ing,” 2020.
[Online]. A ailable: h ps://a xi .o g/abs/2006.05822
[7]
Y. Kawaguchi, K. Imo o, Y. Koizumi, N. Ha ada, D. Niizumi, K. Dohi,
R. Tanabe, H. Pu ohi , and T. Endo, “Desc ip ion and discussion on
dcase 2021 challenge ask 2: Unsupe ised anomalous sound de ec ion
o machine condi ion moni o ing unde domain shi ed condi ions,”
2021. [Online]. A ailable: h ps://a xi .o g/abs/2106.04492
[8]
K. Dohi, K. Imo o, N. Ha ada, D. Niizumi, Y. Koizumi, T. Nishida,
H. Pu ohi , T. Endo, M. Yamamo o, and Y. Kawaguchi, “Desc ip ion
and discussion on dcase 2022 challenge ask 2: Unsupe ised
anomalous sound de ec ion o machine condi ion moni o ing
applying domain gene aliza ion echniques,” 2022. [Online]. A ailable:
h ps://a xi .o g/abs/2206.05876
[9]
K. Dohi, K. Imo o, N. Ha ada, D. Niizumi, Y. Koizumi, T. Nishida,
H. Pu ohi , R. Tanabe, T. Endo, and Y. Kawaguchi, “Desc ip ion and
discussion on dcase 2023 challenge ask 2: Fi s -sho unsupe ised
anomalous sound de ec ion o machine condi ion moni o ing,” 2023.
[Online]. A ailable: h ps://a xi .o g/abs/2305.07828
[10]
N. V. Chawla, K. W. Bowye , L. O. Hall, and W. P. Kegelmeye ,
“Smo e: syn he ic mino i y o e -sampling echnique,” Jou nal o a i icial
in elligence esea ch, ol. 16, pp. 321–357, 2002.
[11]
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond
empi ical isk minimiza ion,” a Xi p ep in a Xi :1710.09412, 2017.
[12]
A. Jansson, R. M. Bi ne , N. Mon ecchio, and T. Weyde, “Lea ned
complex masks o mul i-ins umen sou ce sepa a ion,” 2021. [Online].
A ailable: h ps://a xi .o g/abs/2103.12864
[13]
R. Yamamo o, E. Song, and J.-M. Kim, “Pa allel wa egan: A as
wa e o m gene a ion model based on gene a i e ad e sa ial ne wo ks wi h
mul i- esolu ion spec og am,” in ICASSP 2020-2020 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocessing (ICASSP). IEEE,
2020, pp. 6199–6203.
[14]
A. Jiang, Q. Hou, J. Liu, P. Fan, J. Ma, C. Lu, Y. Zhai, Y. Deng, and
W.-Q. Zhang, “Thuee sys em o i s -sho unsupe ised anomalous sound
de ec ion o machine condi ion moni o ing,” P oceedings o he IEEE
AASP Challenge on De ec ion and Classi ica ion o Acous ic Scenes and
E en s, Tampe e, Finland, pp. 20–22, 2023.
14
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Towa ds Audio-based Ze o-Sho Ac ion Recogni ion
in Ki chen En i onmen s
Alexande Gebha d1,2, And eas T ian a yllopoulos1,2, Iosi Tsangko1,2, Bj¨
o n W. Schulle 1,2,3,4
1CHI – Chai o Heal h In o ma ics, TUM Uni e si y Hospi al, Ge many
2MCML, Munich Cen e o Machine Lea ning, Ge many
3MDSI, Munich Da a Science Ins i u e, Ge many
4GLAM – G oup on Language, Audio, & Music, Impe ial College London, UK
Abs ac —Human ac ions o en gene a e sounds ha can be ecognized
o in e hei cause. In ac ion ecogni ion, ac ions can usually be b oken
down o a combina ion o e bs and nouns, o which he e exis a e y la ge
numbe o enume a ions. Con empo a y da ase s, like EPIC-KITCHENS,
co e a wide gamu o he po en ial ac ion space, bu no i s en i e y.
A guably, he holis ic cha ac e iza ion o human ac ions h ough he
sounds hey gene a e equi es he use o ze o-sho lea ning (ZSL). In his
con ibu ion, we explo e he easibili y o ZSL o ecognizing a) nouns,
b) e bs, o c) ac ions on Epic-Ki chens. To achie e his, we use linguis ic
in e media ion, by gene a ing desc ip ions o each wo d co esponding
o ou classes using a p e- ained la ge language model (LLAMA-2). Ou
esul s show ha human ac ion ecogni ion om sounds is possible in
ze o-sho ashion, as we consis en ly ob ain esul s o e chance.
Index Te ms—ze o-sho classi ica ion, compu e audi ion, machine
lea ning, ac ion ecogni ion
1. INTRODUCTION
Despi e signi ican s ides in he ield o compu e audi ion in
ecen yea s [1], achie ing a le el o audi o y pe cep ion compa able
o ha o humans emains a conside able challenge [2]. In he
case o humans in e ac ing wi h hei en i onmen , achie ing a
comp ehensi e unde s anding o audio in ol es he in ica e ask
o pe cei ing all ac ions embedded wi hin a pa icula soundscape.
This challenge is ampli ied by he ac ha , in psychology, he
iden i ica ion o sound e en s by humans is deeply in e wined wi h he
ecogni ion o associa ed ac ions [3]. Howe e , he mul i ace ed na u e
o human in e ac ions wi h hei en i onmen gi es ise o a seemingly
inexhaus ible a ay o po en ial ac ion ca ego ies. This inhe en
complexi y poses a signi ican challenge, making i imp ac ical o
ain a model capable o ecognizing e e y concei able ca ego y o
combina ion. In his ega d, ze o-sho lea ning (ZSL) p o es ideal, as
i gene alizes indings om known classes and hei combina ions o
new ones and hus excels in iden i ying ca ego ies ha we e p e iously
unknown o unseen [2], [4].
As o ZSL, he majo i y o p og ess in ze o-sho ac ion ecogni ion
(ZSAR) has p edominan ly occu ed wi hin he ield o compu e
ision, le e aging seman ic in o ma ion such as ideo cap ions and
o he ex da a [5]–[7]. Acco dingly, mos ly isual o audio- isual da a
ha e been in es iga ed, no only o ZSAR bu ac ion ecogni ion (AR)
in gene al [8]–[11], while audio was o en neglec ed. Some popula
da ase s employed o AR, such as HMDB51 [12] o UCF101 [13],
do no e en con ain audio o only ha e pa ial audio in o ma ion.
As men ioned by Elizalde e al. [3], audio is an ex emely impo an
ac o o people o be able o ecognize ac ions. In his ega d, e bs
a e o en closely ied o cha ac e is ic sounds ha e lec ac ions,
in e ac ions be ween objec s, and occasionally he ma e ial composing
he objec s [3]. We he e o e wan o in es iga e an audio-based ZSAR
app oach in his s udy.
This wo k was pa ially unded om he DFG’s Reinha Koselleck p ojec
No. 442218748 (AUDI0NOMOUS).
Simila o compu e ision, i is common o audio-based ZSL
app oaches o le e age ex ual desc ip ions o he a ge classes, hei
ea u es, o ela ed in o ma ion as me a in o ma ion [14]–[17]. These
auxilia y da a can be ex ual desc ip ions o he sound classes o e en
he labels hemsel es [14], desc ip ions o wha he a ge classes
sound like [15], o desc ip ions o he musical concep s which shall
be modeled in music [17]. Among he la es b eak h oughs when i
comes o audio-based ZSL a e modi ica ions o he CLIP app oach
in compu e ision, exempli ied by Wa 2CLIP [18], AudioCLIP [19],
o CLAP [16].
While ZSL has gained popula i y in he audio domain, we ha e
no come ac oss any p io s udies explo ing audio-based ZSAR. This
pape akes a s ep in ha di ec ion by ca ying ou ini ial in es iga ions.
Fo his pu pose, we employ he EPIC-KITCHENS da ase [20], [21],
which con ains egocen ic ideos o people in e ac ing wi h objec s
in hei home ki chen en i onmen s. We chose his da ase due o
he non-sc ip ed daily ac i i ies, he a ailabili y o audio o all
anno a ed ac ions, as well as he ac ha he ideos we e na a ed
by he pa icipan s hemsel es a e wa ds. Fu he mo e, each ac ion
comp ises a e b and a noun (e. g., “cu oma o”), enabling hei
sepa a e in es iga ion.
We also no e ha he ecen ly published EPIC-SOUNDS da ase
[22] also holds he po en ial o cap u e audible ac ions. Ne e heless,
he da ase ’s ac ion classes p edominan ly cen e a ound collisions
and ma e ials, gi ing ise o classi ica ions like “me al-only” o “cu
/ chop”. Ou objec i e, howe e , is o del e in o he iden i ica ion o
he ac ual objec (s) engaged in an in e ac ion, alongside disce ning
he co esponding e b ha desc ibes o cha ac e izes he ac ion.
We p ima ily wan o in es iga e i ZSAR based solely on audio is
possible. To do his, we adop a i icially gene a ed ex ual desc ip ions
o how he ac ions sound as me a in o ma ion. Thus, o each anno a ed
ac ion, we use LLAMA-2 [23], a la ge language model (LLM), o
gene a e a co esponding ex ual desc ip ion. This desc ip ion is hen
adop ed as auxilia y in o ma ion o he ZSL p ocess. We d aw
inspi a ion o his app oach om he usage o a i icially gene a ed
ideo cap ions in [6] as well as he ex ual sound desc ip ions o
a ious bi d species in [15] o ZSL. We u he mo e dis inguish he
h ee scena ios o classi ying 1) he e b, 2) he noun o an ac ion,
as well as 3) he ac ion i sel . In ou modeling app oach, we le e age
ecen esea ch in audio-based ZSL and p ima ily u ilize a s anda d
ZSL me hod [14], [15]. Ou objec i e is o conduc ini ial expe imen s
a he han pu sue s a e-o - he-a esul s. In doing so, we also explo e
he ex ual embeddings o he e bs and nouns and how hey in luence
he model pe o mance. The code eposi o y o his wo k is publicly
a ailable on Gi Hub1.
1h ps://gi hub.com/CHI-TUM/epic-ki chens-zsl
15
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 1: S a is ics abou he u ilized da a. The minimum (Min),
maximum (Max), and a e age (A g) a e epo ed w. . . he amoun
(le ) and he o al du a ion ( igh ) o he audio segmen s pe class.
PTo al Du a ion (in s)
Class Min Max A g Min Max A g
VERB 1 14 648 680 1.1 37 802 2 118
NOUN 2 3 576 330 1.7 9 032 716
ACTION 1 1 768 19 .3 3 688 59
2. DATA
We u ilize he publicly a ailable da a om bo h EPIC-KITCHENS-
100 [21] and EPIC-KITCHENS-55 [20] o ou expe imen s. In
pa icula , we adop he aining da a om bo h da ase s and spli
hem in o ou own subse s in Sec ion 3.2. As he es da a does no
come wi h anno a ed ac ions we neglec i . Based on he imes amps
o he anno a ed ac ions we ex ac he ac ion segmen s om he
p o ided ideos and con e hem o .mp3 o ma , as we only exploi
he audio s eam. In o al, we ha e 57 hou s o audio, a e aging 3.12
seconds pe ile.
The anno a ed ac ions a e based on he na a ions by he pa icipan s
hemsel es. The co e componen s o a na a ion a e one e b and a
leas one noun. I mul iple nouns a e p esen in a na a ion, only he
i s one is conside ed o he ac ion, as done in he o iginal pape
desc ibing he da ase [20]. An ac ion
ai
o an audio segmen
i
is
de ined as
ai= ( i, ni)
wi h
i
and
ni
being he co esponding
e b and noun class. Fo ins ance, he na a ion “add banana o jug”
desc ibes he ac ion uple (add, banana), hus comp ising he e b class
add and he noun class banana. No e ha we omi ed p eposi ional
objec s (“ o jug”), same as he o iginal au ho s o he da a [20]. In o al,
ou ex ac ed subse om he EPIC-KITCHENS da ase s comp ises
97 e b, 287 noun, and 3 507 ac ion ca ego ies. Howe e , he e is a
conside able da a imbalance ega ding he amoun o audio segmen s
o each e b / noun / ac ion class which is illus a ed by Table 1.
Audio desc ip ions: We ex ac a i icially gene a ed ex ual
desc ip ions as me a-in o ma ion ha desc ibe he expec ed sound
o he co esponding ac ion. This app oach is inspi ed by he
inco po a ion o a i icially gene a ed ideo cap ions in [6] as well
as he le e aging o ex ual sound desc ip ions in [15]. Fo his
pu pose, we employ LLAMA-2
2
[23], speci ically, he ins uc ion
ine- uned 7-billion pa ame e model, o gene a e adequa e ex ual
desc ip ions, ep esen ing he audi o y cha ac e is ics associa ed wi h
he speci ic ac ions. To achie e his, we u ilize an app op ia ely
designed p omp ing empla e, ailo ed o sound desc ip ion asks.
In eg a ing he speci ic ac ion as a a iable in he p omp ing s a egy,
we guide he LLM in p oducing he inal desc ip ion. The u ilized
p omp is p esen ed in Table 2. The ollowing quo e, belonging o
he na a ion “add banana o jug” om abo e, gi es an imp ession o
hese gene a ed desc ip ions:
“Adding a banana o a jug p oduces a dis inc i e ’slosh’
sound, ollowed by a sligh ’gu gling’ o ’splashing’ noise
as he ui se les a he bo om o he con aine .”
3. METHODOLOGY
This sec ion desc ibes he employed ea u es and how hey a e
ex ac ed, he u ilized ze o-sho classi ica ion me hod, and he ex-
pe imen al se up o his s udy.
2h ps://hugging ace.co/me a-llama/Llama-2-7b-cha -h
Table 2: P omp employed wi h LLAMA-2 o gene a e ex ual sound
desc ip ions, whe e
{ACTION}
is eplaced by each na a ion om
he EPIC-KITCHENS da ase .
P omp empla e
<
s
>
[INST]
<<
SYS
>>
You a e a highly skilled audio enginee wi h expe ise
in accu a ely desc ibing sounds om a ious ac ions in e e yday li e. I will
p o ide you wi h a “ki chen ac ion” and you ask is o gi e me a sho ,
p ecise and accu a e desc ip ion o he sound p oduced by he speci ic ac ion
in he ph ase. The ques ion is, how does his ac ion sound like in e ms o
audi ioning
<< /
SYS
>>
“
{
ACTION
}
”. [/INST] This ac ion sounds like:
###Response:
Audio ea u es: We u ilize audio spec og am ans o me (AST)
embeddings as audio ea u es, ob ained h ough a s a e-o - he-a AST
model
3
[24]. P io o ex ac ing he embeddings, he audio iles a e
esampled o
16
kHz. The p ocess in ol es con e ing each audio
ile in o a 2D a ay ep esen a ion, ollowed by empo al a e aging
o de i e a 1D ec o wi h a dimensionali y o 768.
Tex embeddings: We also apply a p e- ained T ans o me -based
language model o ob ain ep esen a i e embeddings o he LLAMA-2
ex desc ip ions. Fo his pu pose, we deploy SENTENCE-BERT
(SBERT), an adap a ion o he BERT model [25], which was p oposed
by Reime s and Gu e ych [26] and is in ended o e lec seman ic
simila i y in gene a ed sen ence and pa ag aph embeddings. As BERT
and SBERT showed no disce nible di e ence in [15], we choose he
la e a ian . This decision is g ounded in he no ion ha ex ac ing
he seman ic meaning om ou ex desc ip ions is expec ed o be
mo e easible han handling he onoma opoeia o bi d sounds in he
case o [15]. We selec he pa aph ase-mul ilingual-mpne -base- 2
model
4
om he a ailable se o p e- ained SBERT models
5
and call
he p o ided pooling me hod o yield he embedding ec o o size
768.
Since we ex ac he SBERT embeddings o e e y LLAMA-2
desc ip ion om Sec ion 2, e e y audio ile has a co esponding ex
embedding ec o . Fo ou ZSL app oach, which will be desc ibed
in Sec ion 3.1, we equi e one ex embedding ec o o each class.
Conside ing ha a class is usually anno a ed o mul iple audio iles, we
ake in o accoun all o hese iles and hei co esponding LLAMA-2
desc ip ions. Since we now possess he SBERT embeddings o each
o hese desc ip ions, we can a e age hem o ob ain a singula ex
embedding ec o o each class. Tha is, o each o ou h ee use
cases o classi ying ei he he 1) e b, 2) noun, o 3) ac ion, we c ea e
a .cs ile which con ains he classes oge he wi h hei co esponding
ex embedding ec o .
3.1. Model T aining
The ZSL me hodology implemen ed in his s udy is he one om
Gebha d e al. [15], building upon he ounda ion laid by p e ious
s udies om Xie e al. [14] and Aka a e al. [27]. They employ a
compa ibili y unc ion on an acous ic-seman ic p ojec ion o classi y
he sound classes. This compa ibili y unc ion is le e aged by a anking
hinge loss in hei aining p ocess, wi h he sound class exhibi ing he
highes compa ibili y deemed co ec . The aim is o he op- anked
class embeddings o p o ide he mos accu a e desc ip ion o he
audio sample.
Following he app oaches o [15] and [14], we employ a single
linea laye equipped wi h as many neu ons as he size o he
3
h ps://hugging ace.co/docs/ ans o me s/model doc/audio-spec og am-
ans o me
4
h ps://hugging ace.co/sen ence- ans o me s/pa aph ase-mul ilingual-
mpne -base- 2
5h ps://www.sbe .ne
16
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Acous ic
embeddings
Acous ic-seman ic
p ojec ion
Compa ibili y
unc ion
Class
embeddings
Ranked
hinge loss
Pa ame e upda e
Classi ie
P edic ion
"Adding a banana o a jug p oduces a
dis inc i e 'slosh' sound, ollowed by a sligh
'gu gling' o 'splashing' noise as he ui
se les a he bo om o he con aine ."
Llama2 desc ip ion
Audio
“add banana o jug”
Na a ion
Fig. 1: O e iew o he ze o-sho lea ning pipeline adap ed om
[14], [15]. Audio samples a e con e ed in o acous ic embeddings
using an AST model and p ojec ed in o a sha ed seman ic space.
LLAMA-2-gene a ed ex ual desc ip ions a e encoded ia SBERT o
ob ain class embeddings. The classi ica ion is based on he highes
compa ibili y sco e be ween acous ic and class embeddings.
co esponding class embeddings o p ojec he acous ic embeddings
on o he class embeddings. Fu he mo e, we apply he do p oduc
as ou compa ibili y unc ion. The schema ic ep esen a ion o ou
pipeline is depic ed in Fig. 1.
3.2. Expe imen al Se up
Fo ou expe imen s, we employed a non-exhaus i e c oss- alida ion
app oach, aiming o ensu e an ample amoun o da a o aining.
Consequen ly, we op ed o an
80
–
10
–
10
spli o each o he
i e spli s. We ook explici ca e o ensu e ha hese h ee se s o a
spli we e mu ually exclusi e, meaning ha no classes could appea
in he o he wo se s. In gene a ing he i e spli s, we also ensu ed
ha he de elopmen and es se s, in ela ion o he o he ou spli s,
consis en ly con ained di e en classes, he eby a oiding any o e lap.
Ou s udy assesses he e icacy o he ZSL app oach ou lined in Sec-
ion 3.1 when pai ed wi h he a i icially gene a ed me a-in o ma ion
expounded in Sec ion 2. The expe imen s a e execu ed ac oss he
i e spli s, and he a e age pe o mance on he de elopmen / es se s
is epo ed. We u he mo e op o h ee di e en andom seeds o
ini ialize ou model and also ake he mean o hose.
We conduc aining o a o al o
30
epochs, u ilizing an Adam
op imize wi h a lea ning a e o
.0001
and a ba ch size o
16
. As
a compa ibili y unc ion o ou anking loss, we employ he do
p oduc , as men ioned in Sec ion 3.1. Subsequen ly, he model s a e
demons a ing op imal pe o mance on he de elopmen se is selec ed
o e alua ion on he es se . Be o e analyzing he esul s, we conduc
an explo a o y da a analysis on he ex embeddings o he e b
and noun classes o gain insigh s in o he unde lying ela ionships.
Sec ion 4 o e s he esul s and he co esponding discussion.
4. RESULTS
The me ics used o he e alua ion a e a supe se o he me ics
u ilized in he EPIC-KITCHENS pape s [20], [21]. We use unweigh ed
accu acy (UA), weigh ed accu acy (WA), and unweigh ed p ecision
(UP). UA is calcula ed by compu ing he ecall o each class and hen
a e aging he esul s; he same is done o UP jus wi h he p ecision.
WA is ins ead compu ed by aking he pe cen age o co ec ly classi ied
examples – and is hus agnos ic o class imbalance. The e o e, WA
can be conside ed as a b oad o e iew o model pe o mance, while
UA and UP ake class imbalances mo e in o accoun .
e bs
e bs
sbe - cosine simila i ies
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(a) Cosine simila i ies among he
e b embeddings.
nouns
nouns
sbe - cosine simila i ies
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(b) Cosine simila i ies among he
noun embeddings.
Fig. 2: The pai wise cosine simila i y ma ices o he SBERT
embeddings o he (a) e b and (b) noun classes depic ed as hea maps.
Each cell ep esen s he simila i y be ween wo class desc ip ions. The
noun embeddings exhibi clea e di e en ia ion, sugges ing s onge
sepa abili y in he ex embedding space.
4.1. Explo a o y da a analysis
Fo ou analysis we eso o cosine simila i y isualized as hea maps
and -SNE pai wise dis ance isualized as sca e plo .
Cosine simila i ies: Fi s , we conduc an analysis o pai wise
cosine simila i ies wi hin he ex embeddings o he e b and noun
classes. The ac ion classes, being composed o e b and noun
classes, a e no conside ed in his analysis, allowing us o ocus
on he smalle componen s. Ou aim is o gauge he s eng h o
he ex ual embeddings o bo h he e b and noun classes, o
de e mine which embeddings impa mo e dis inc cha ac e is ics.
Speci ically, we calcula e he pai wise cosine simila i y be ween he
(S)BERT embeddings o each ca ego y and e e y o he ca ego y. This
compu a ion esul s in a ma ix o cosine simila i ies w. . . hose
embeddings. Visualized in Fig. 2 as hea maps, hese ma ices e eal
ha noun embeddings mani es mo e dis inc ep esen a ions ac oss
a ious classes. As a esul , we an icipa e he noun classi ica ion use
case o ela i ely ou pe o m e b classi ica ion in e ms o model
pe o mance. Sec ion 4.2 p esen s he esul s and discussions.
-SNE dis ances: To u he unde s and he ela ionship be ween
he ex embeddings o di e en classes, we apply -SNE ( -Dis ibu ed
S ochas ic Neighbo Embedding) [28] o he ex ual embeddings o
bo h he e bs as well as he nouns. This way, we can isualize he
high-dimensional da a in a lowe -dimensional, 2D space
6
. Be o e
c ea ing he plo s, we anno a e he e b and noun ca ego ies o he
g oups speci ied in [21], o allow a be e analysis. The e a e
13
e b
and
21
noun g oups. -SNE has a endency o g oup simila da a
poin s in he educed-dimensional space. I embedding ec o s om
di e en classes o m dis inc clus e s, i sugges s ha he o iginal
high-dimensional ec o s ca y in o ma ion ha allows o e ec i e
class sepa a ion. Knowing his and looking a he plo s depic ed in
Fig. 3 we can, he e o e, make se e al assump ions. Fo a clea e and
mo e de ailed iew, we ecommend inspec ing he in e ac i e plo s
p o ided in he supplemen a y ma e ial.
Fi s , when looking a he -SNE plo o he e b classes i is
ha d o iden i y some g oups. The only no iceable g oups a e he
e bs belonging o he “moni o ” o he “spli ” g oup. Un o una ely,
e en hese g oups exhibi e b classes ha a e qui e dis an ly ela ed.
Howe e , he e a e also wo small clus e s in which he classes a e
6
In e ac i e -SNE plo s a e a ached as HTML- iles in he supplemen a y
ma e ial o easie isualiza ion.
17
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
[12]
A. Mesa os, R. Se izel, T. Hei ola, T. Vi anen, and M. D. Plumbley,
“A decade o DCASE: achie emen s, p ac ices, e alua ions and u u e
challenges,” in P oc. ICASSP, 2025.
[13]
R. Tanabe, H. Pu ohi , K. Dohi, T. Endo, Y. Nikaido, T. Nakamu a, and
Y. Kawaguchi, “MIMII DUE: Sound da ase o mal unc ioning indus ial
machine in es iga ion and inspec ion wi h domain shi s due o changes
in ope a ional and en i onmen al condi ions,” in P oc. WASPAA, 2021.
[14]
N. Ha ada, D. Niizumi, D. Takeuchi, Y. Ohishi, M. Yasuda, and S. Sai o,
“ToyADMOS2: Ano he da ase o minia u e-machine ope a ing sounds
o anomalous sound de ec ion unde domain shi condi ions,” in P oc.
DCASE, 2021.
[15]
H. Chen, Y. Song, L. Dai, I. McLoughlin, and L. Liu, “Sel -supe ised
ep esen a ion lea ning o unsupe ised anomalous sound de ec ion unde
domain shi ,” in P oc. ICASSP, 2022.
[16]
I. Ku oyanagi, T. Hayashi, Y. Adachi, T. Yoshimu a, K. Takeda, and
T. Toda, “An ensemble app oach o anomalous sound de ec ion based
on con o me -based au oencode and bina y classi ie inco po a ed wi h
me ic lea ning,” in P oc. DCASE, 2021.
[17]
M. Yamaguchi, Y. Koizumi, and N. Ha ada, “AdaFlow: Domain-adap i e
densi y es ima o wi h applica ion o anomaly de ec ion and unpai ed
c oss-domain ansla ion,” in P oc. ICASSP, 2019.
[18]
J. A. Lopez, G. S emme , P. Lopez-Meye , P. Singh, J. A. del Hoyo On-
i e os, and H. A. Co dou ie , “Ensemble o complemen a y anomaly
de ec o s unde domain shi ed condi ions,” in P oc. DCASE, 2021.
[19]
B. Chen, L. Bondi, and S. Das, “Lea ning o adap o domain shi s wi h
ew-sho samples in anomalous sound de ec ion,” in P oc. ICPR, 2022.
[20]
K. Wilkingho , “Combining mul iple dis ibu ions based on sub-clus e
AdaCos o anomalous sound de ec ion unde domain shi ed condi ions,”
in P oc. DCASE, 2021.
[21]
I. Ku oyanagi, T. Hayashi, K. Takeda, and T. Toda, “Two-s age anomalous
sound de ec ion sys ems using domain gene aliza ion and specializa ion
echniques,” in P oc. DCASE, 2022.
[22]
J. Guan, Y. Liu, Q. Zhu, T. Zheng, J. Han, and W. Wang, “Time-
weigh ed equency domain audio ep esen a ion wi h GMM es ima o
o anomalous sound de ec ion,” in P oc. ICASSP, 2023.
[23]
W. Junjie, W. Jiajun, C. Shengbing, S. Yong, and L. Mengyuan, “Anomaly
sound de ec ion sys em based on mul i-dimensional a en ion module,”
DCASE2023 Challenge, Tech. Rep., 2023.
[24]
S. Chen, Y. Sun, J. Wang, M. Wan, M. Liu, and X. Li, “A mul i-scale
dual-decode au oencode model o domain-shi machine sound anomaly
de ec ion,” Digi . Signal P ocess., ol. 156, 2025.
[25]
Y. Deng, A. Jiang, Y. Duan, J. Ma, X. Chen, J. Liu, P. Fan, C. Lu, and
W. Zhang, “Ensemble o mul iple anomalous sound de ec o s,” in P oc.
DCASE, 2022.
[26]
I. Nejja , J. Meunie -Pion, G. F usque, and O. Fink, “DG-Mix: Domain
gene aliza ion o anomalous sound de ec ion based on sel -supe ised
lea ning,” in P oc. DCASE, 2022.
[27]
J. Yan, Y. Cheng, Q. Wang, L. Liu, W. Zhang, and B. Jin, “T ans o me
and g aph con olu ion-based unsupe ised de ec ion o machine anoma-
lous sound unde domain shi s,” IEEE T ans. Eme g. Top. Compu .
In ell., ol. 8, no. 4, 2024.
[28]
S. Venka esh, G. Wiche n, A. S. Sub amanian, and J. Le Roux,
“Imp o ed domain gene aliza ion ia disen angled mul i- ask lea ning in
unsupe ised anomalous sound de ec ion,” in P oc. DCASE, 2022.
[29]
K. Dohi, T. Endo, and Y. Kawaguchi, “Disen angling physical pa ame e s
o anomalous sound de ec ion unde domain shi s,” in P oc. EUSIPCO,
2022.
[30]
H. Lan, Q. Zhu, J. Guan, Y. Wei, and W. Wang, “Hie a chical me ada a
in o ma ion cons ained sel -supe ised lea ning o anomalous sound
de ec ion unde domain shi ,” in P oc. ICASSP, 2024.
[31]
J. Guan, J. Tian, Q. Zhu, F. Xiao, H. Zhang, and X. Liu, “Disen angling
hie a chical ea u es o anomalous sound de ec ion unde domain shi ,”
in P oc. ICASSP, 2025.
[32]
N. Ha ada, D. Niizumi, Y. Ohishi, D. Takeuchi, and M. Yasuda, “Fi s -
sho anomaly sound de ec ion o machine condi ion moni o ing: A
domain gene aliza ion baseline,” in P oc. EUSIPCO, 2023.
[33]
K. Wilkingho , “Design choices o lea ning embeddings om auxilia y
asks o domain gene aliza ion in anomalous sound de ec ion,” in P oc.
ICASSP, 2023.
[34]
K. Wilkingho , H. Yang, J. Ebbe s, F. G. Ge main, G. Wiche n, and J. Le
Roux, “Local densi y-based anomaly sco e no maliza ion o domain
gene aliza ion,” a Xi p ep in a Xi :2509.10951, 2025.
[35]
P. Saeng hong and T. Shinozaki, “Deep gene ic ep esen a ions o
domain-gene alized anomalous sound de ec ion,” in P oc. ICASSP, 2025.
[36]
S. Io e and C. Szegedy, “Ba ch no maliza ion: accele a ing deep ne wo k
aining by educing in e nal co a ia e shi ,” in P oc. ICML, 2015.
[37]
F. M. Ca lucci, L. Po zi, B. Capu o, E. Ricci, and S. R. Bul
`
o, “Au oDIAL:
Au oma ic domain alignmen laye s,” in P oc. ICCV, 2017.
[38]
A. Nichol, J. Achiam, and J. Schulman, “On i s -o de me a-lea ning
algo i hms,” a Xi p ep in a Xi :1803.02999, 2018.
[39]
J. Snell, K. Swe sky, and R. S. Zemel, “P o o ypical ne wo ks o ew-sho
lea ning,” in P oc. Neu IPS, 2017.
[40]
J. Wang, C. Lan, C. Liu, Y. Ouyang, T. Qin, W. Lu, Y. Chen, W. Zeng,
and P. S. Yu, “Gene alizing o unseen domains: A su ey on domain
gene aliza ion,” IEEE T ans. Knowl. Da a Eng., ol. 35, no. 8, 2023.
[41]
K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. C. Loy, “Domain gene aliza-
ion: A su ey,” IEEE T ans. Pa e n Anal. Mach. In ell., ol. 45, no. 4,
2023.
[42]
K. Dohi, T. Nishida, H. Pu ohi , R. Tanabe, T. Endo, M. Yamamo o,
Y. Nikaido, and Y. Kawaguchi, “MIMII DG: Sound da ase o mal-
unc ioning indus ial machine in es iga ion and inspec ion o domain
gene aliza ion ask,” in P oc. DCASE, 2022.
[43]
N. Ha ada, D. Niizumi, D. Takeuchi, Y. Ohishi, and M. Yasuda,
“ToyADMOS2+: New oyadmos da a and benchma k esul s o he i s -
sho anomalous sound e en de ec ion baseline,” in P oc. DCASE, 2023.
[44]
D. Niizumi, N. Ha ada, Y. Ohishi, D. Takeuchi, and M. Yasuda,
“ToyADMOS2#: Ye ano he da ase o he DCASE2024 challenge
ask 2 i s -sho anomalous sound de ec ion,” in P oc. DCASE, 2024.
[45]
D. Albe ini, F. Augus i, K. Esme , A. Be na dini, and R. Sannino,
“IMAD-DS: A da ase o indus ial mul i-senso anomaly de ec ion
unde domain shi condi ions,” in P oc. DCASE, 2024.
[46]
N. V. Chawla, K. W. Bowye , L. O. Hall, and W. P. Kegelmeye , “SMOTE:
Syn he ic mino i y o e -sampling echnique,” J. A i . In ell. Res., ol. 16,
2002.
[47]
S. Ben-Da id, J. Bli ze , K. C amme , and F. Pe ei a, “Analysis o
ep esen a ions o domain adap a ion,” in P oc. Neu IPS, 2006.
[48]
A. Ba des, J. Ponce, and Y. LeCun, “VICReg: Va iance-in a iance-
co a iance egula iza ion o sel -supe ised lea ning,” in P oc. ICLR,
2022.
[49]
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond
empi ical isk minimiza ion,” in P oc. ICLR, 2018.
[50]
B. Sun and K. Saenko, “Deep CORAL: co ela ion alignmen o deep
domain adap a ion,” in P oc. ECCV, 2016.
[51]
H. Nam, S. Kim, and Y. Pa k, “Fil e augmen : An acous ic en i onmen al
da a augmen a ion me hod,” in P oc. ICASSP, 2022.
[52]
H. Zhang, Y. Zhang, W. Liu, A. Welle , B. Sch
¨
olkop , and E. P. Xing,
“Towa ds p incipled disen anglemen o domain gene aliza ion,” in P oc.
CVPR, 2022.
[53]
Y. Ganin, E. Us ino a, H. Ajakan, P. Ge main, H. La ochelle, F. La iole e,
M. Ma chand, and V. S. Lempi sky, “Domain-ad e sa ial aining o
neu al ne wo ks,” JMLR, ol. 17, 2016.
[54]
T. Lin, P. Goyal, R. B. Gi shick, K. He, and P. Doll
´
a , “Focal loss o
dense objec de ec ion,” in P oc. ICCV, 2017.
[55]
K. Wilkingho , H. Yang, J. Ebbe s, F. G. Ge main, G. Wiche n, and
J. Le Roux, “Keeping he balance: Anomaly sco e calcula ion o domain
gene aliza ion,” in P oc. ICASSP, 2025.
[56]
H. Wang, H. He, and D. Ka abi, “Con inuously indexed domain
adap a ion,” in P oc. ICML, 2020.
[57]
Q. Wang, O. Fink, L. V. Gool, and D. Dai, “Con inual es - ime domain
adap a ion,” in P oc. CVPR, 2022.
[58]
K. T. Mai, T. Da ies, L. D. G i in, and E. Bene os, “Explaining he
decision o anomalous sound de ec o s,” in P oc. DCASE, 2022.
[59]
S. Tsubaki, Y. Kawaguchi, T. Nishida, K. Imo o, Y. Okamo o, K. Dohi,
and T. Endo, “Audio-change cap ioning o explain machine-sound
anomalies,” in P oc. DCASE, 2023.
[60]
J. Gao, X. Ma, and C. Xu, “Lea ning ans e able concep ual p o o ypes
o in e p e able unsupe ised domain adap a ion,” IEEE T ans. Image
P ocess., ol. 33, 2024.
[61]
S. Bobek, S. Nowaczyk, S. Pashami, Z. Taghiya enani, and G. J. Nalepa,
“Towa ds explainable deep domain adap a ion,” in P oc. ECAI Wo kshops,
2023.
[62]
A. Zunino, S. A. Ba gal, R. Volpi, M. Sameki, J. Zhang, S. Scla o ,
V. Mu ino, and K. Saenko, “Explainable deep classi ica ion models o
domain gene aliza ion,” in P oc. CVPR Wo kshops, 2021.
[63] C. Agga wal, Ou lie Analysis, 2nd ed. Sp inge , 2017.
[64]
T. Fujimu a, K. Wilkingho , K. Imo o, and T. Toda, “ASDKi : A oolki
o comp ehensi e e alua ion o anomalous sound de ec ion me hods,”
a Xi p ep in a Xi :2507.10264, 2025.
24
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Adjus ing Bias in Anomaly Sco es ia Va iance Minimiza ion o
Domain-Gene alized Disc imina i e Anomalous Sound De ec ion
Masaaki Ma sumo o1, Takuya Fujimu a1, Wen-Chin Huang1, Tomoki Toda2
1G adua e School o In o ma ics, Nagoya Uni e si y, Nagoya, Japan
2In o ma ion Technology Cen e , Nagoya Uni e si y, Nagoya, Japan
Abs ac —We p opose an anomaly sco e escaling me hod based on
a iance minimiza ion o domain-gene alized anomalous sound de ec ion
(ASD). Cu en s a e-o - he-a ASD me hods ace signi ican challenges
due o p onounced domain shi s, which lead o inconsis en anomaly
sco e dis ibu ions ac oss domains. One p omising exis ing app oach
o add ess his issue is o escale anomaly sco es based on local da a
densi y in he embedding space. To enable mo e lexible and adap i e
escaling, ou p oposed me hod in oduces weigh ing pa ame e s in o
he escaling p ocess and analy ically op imizes hem based on he sco e
a iance minimiza ion. Expe imen al e alua ions on he DCASE 2021–2024
ASD da ase s demons a e ha ou p oposed me hod achie es signi ican
imp o emen s on he DCASE 2022–2024 da ase s. We also con i m ha
he p oposed me hod ob ains weigh ing pa ame e s ha lead o high ASD
pe o mance.
Index Te ms—anomalous sound de ec ion, domain gene aliza ion,
anomaly sco e escaling,
1. INTRODUCTION
Anomaly Sound De ec ion (ASD) is he ask o iden i ying abno mal
sounds om audio da a. Since i is di icul o collec anomalous sound
da a, we need o de elop ASD sys ems by using only no mal sound
da a [1]–[5]. One o he main challenges in he ASD ask is domain
shi [6]. Domain shi s a e a ia ions – in acous ic en i onmen s,
eco ding equipmen , o ope a ional condi ions – ha do no a ec
whe he a sample is no mal o anomalous. Domain shi s can occu
du ing sys em deploymen , and ASD sys ems mus be able o pe o m
obus ly e en in domains whe e only a small amoun o da a is
collec ed du ing he de elopmen . ASD sys ems mus be obus o
domain shi s and pe o m obus ly no only in domains wi h abundan
da a bu also in domains wi h only a ew samples. He e, i is common
o e e o domains wi h abundan no mal da a as he sou ce domain,
and hose wi h only a ew no mal samples as he a ge domain.
Cu en s a e-o - he-a ASD me hods p edominan ly employ dis-
c imina i e app oaches [7]–[11]. These me hods le e age labels asso-
cia ed wi h sounds, such as machine ypes o ope a ional pa ame e s,
and ain a ea u e ex ac o h ough he classi ica ion ask. Anomaly
sco es a e hen calcula ed based on he dis ance be ween es samples
and he aining samples wi hin he disc imina i e embedding space.
The unde lying p inciple is ha anomalous sounds a e no included
in he aining da a; hey a e no co ec ly classi ied, causing hem
o de ia e om he no mal sound dis ibu ion in he disc imina i e
embedding space and esul ing in high anomaly sco es. Al hough
his app oach achie es high pe o mance, he limi ed aining da a
in he a ge domain s ill o en causes inconsis en anomaly sco e
dis ibu ions ac oss domains, as shown in Fig. 1. In such cases, he
op imal h eshold o dis inguishing no mal and anomalous samples
in he sou ce domain does no gene alize well o he a ge domain.
One p omising app oach o handle his challenge is he anomaly
sco e escaling app oach [12]. This me hod escales anomaly sco es
based on he local densi y, whe e low-densi y a ge domains end
o exhibi highe anomaly sco es. While his escaling amewo k
has p o en e ec i e, i s pe o mance is limi ed by he assignmen o
subop imal ixed hype pa ame e s ac oss di e en embedding spaces.
Fig. 1: Disc epancies o anomaly sco es be ween domains.
In his pape , we aim o u he enhance domain gene aliza ion
abili y by p oposing a new anomaly sco e escaling me hod ha
au oma ically adjus s he deg ee o escaling acco ding o he
embedding space. Ou p oposed me hod in oduces a weigh ing
pa ame e in o he escaling p ocess and analy ically op imizes i
based on he minimiza ion o he a iance o anomaly sco es o e
no mal samples in bo h he sou ce and a ge domains. In expe imen al
e alua ions, we demons a e ha ou p oposed me hod ou pe o ms
exis ing anomaly sco e calcula ion me hods on he DCASE 2022–
2024 da ase s. Fu he mo e, he expe imen al analysis shows ha
ou me hod ob ains a weigh ing pa ame e ha leads o high ASD
pe o mance h ough analy ical op imiza ion.
2. ANOMALY SCORE CALCULATION METHODS
2.1. Baseline me hod
The anomaly sco e
A(x,X e )
o a es embedding
x
is ypically
calcula ed as he dis ance o i s nea es neighbo in a e e ence se
X e consi ing o no mal aining samples as ollows:
A(x,X e ) := miny∈X e D(x,y),(1)
D(x,y) := 1
2(1 − ⟨x,y⟩)(2)
whe e
x
and
y
a e he no malized embeddings wi h
∥x∥=∥y∥= 1
,
D(·,·)
is he cosine dis ance,
⟨·,·⟩
is he inne p oduc ope a ion. We
se his app oach as ou baseline.
2.2. O e - and unde -sampling echniques o da a imbalance
To add ess he da a imbalance be ween he sou ce and a ge domains,
echniques such as K-means clus e ing and SMOTE (Syn he ic
Mino i y O e -sampling Technique) a e widely employed in he
anomaly sco e calcula ion p ocess [7], [13]–[15]. K-means clus e ing is
applied o he sou ce domain samples [7], and he ob ained cen oids
and he o iginal a ge domain samples a e used as he e e ence
samples
X e
in Eq. 1. SMOTE gene a es syn he ic samples in he a ge
domain by linea ly in e pola ing o iginal samples [16]. The augmen ed
a ge domain samples and he o iginal sou ce domain samples a e used
as he e e ence samples
X e
[13]. These me hods aim o mi iga e he
disc epancies in anomaly sco e dis ibu ions be ween he sou ce and
a ge domains by balancing he numbe o samples ac oss domains.
25
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
2.3. Anomaly sco e escaling
Recen ly, o add ess disc epancies in anomaly sco e dis ibu ions,
a new app oach has been p oposed [12]. This me hod escales
anomaly sco es based on he local densi y o e e ence samples
in he embedding space. I is mo i a ed by he obse a ion ha he
a ge domain ends o exhibi highe anomaly sco es due o a lack
o su icien e e ence samples, as shown in Fig. 1.
This me hod calcula es he anomaly sco e
Ascaled(x,X e )
as
ollows:
Ascaled(x,X e ) := miny∈X e
D(x,y)
b(y,X e , K),(3)
b(y,X e , K) = 1
K
K
X
k=1
D(y,yk),(4)
whe e ykdeno es he k- h closes sample o yin X e . This me hod
escale he dis ance be ween a es sample
x
and a e e ence sample
y
by di iding i by he local densi y e m
b(y,X e , K)
, which is
calcula ed as he a e age dis ance om he e e ence sample
y
o
i s
K
nea es neighbo s wi hin
X e
. By using his local densi y
e m, i can p e en he high anomaly sco es due o he sca ci y
o e e ence samples, hus educing he disc epancies in anomaly
sco e dis ibu ions. Fu he mo e, unlike echniques such as K-means
clus e ing and SMOTE, his me hod does no equi e domain labels
and can handle mino domain shi s wi hin he sou ce domain ha
a e no e lec ed in he domain labels. Despi e hese ad an ages, i s
pe o mance s ill depends on he manually selec ed hype pa ame e
K
. Al hough
K= 16
was ound o pe o m well ac oss se e al
da ase s in hei expe imen s, his ixed alue is no op imal o e e y
embedding space.
3. PROPOSED METHOD
To add ess he pe o mance limi a ions caused by he subop imal
ixed hype pa ame e in he p e ious anomaly sco e escaling me hod,
we p opose a new me hod ha adap i ely escales anomaly sco es
acco ding o each embedding space. Ou p oposed me hod calcula es
he anomaly sco e Ap op(x,X e , α)as ollows:
Ap op(x,X e , α) := miny∈X e (D(x,y)−α·b(y,X e , K)) ,(5)
whe e he anomaly sco e is escaled by sub ac ing a bias e m,
b(y,X e , K)
, weigh ed by a newly in oduced pa ame e
α
. The
op imal weigh ing pa ame e α⋆is ob ained as ollows:
y
⋆(z,X e ) = a g miny∈X e D(z,y),(6)
α⋆= a g minα(7)
Va (D(z,y
⋆(z,X e )) −α·b(y
⋆(z,X e ),X e , K)|z∈ X al),
=Co (D(z,y⋆(z,X e )), b(y⋆(z,X e ),X e , K)|z∈ X al)
Va (b(y⋆(z,X e ),X e , K)|z∈ X al),
(8)
whe e
Va (·)
and
Co (·)
a e a iance and co a iance, espec i ely.
D(z,y⋆(z,X e ))
is iden ical o
A(z,X e )
, and Eq. 7 yields he
alue o
α
ha minimizes he a iance o he anomaly sco es in
he alida ion se
X al
. This a iance minimiza ion encou ages he
alignmen o anomaly sco es ac oss domains and educes he need
o ca e ul uning o he hype pa ame e K.
Fo he alida ion da a
X al
, we p opose ou di e en app oaches.
1)
T ainAll: We use all a ailable no mal aining da a om bo h
sou ce and a ge domains as X al.
2)
T ainRandom: To educe imbalances be ween domains in
X al
,
we cons uc a balanced alida ion se consis ing o all a ge
aining samples and andomly sampled sou ce aining samples,
such ha he numbe o sou ce samples ma ches ha o he
a ge domain.
3)
T ainClus e : Ano he app oach o educe imbalances be ween
domains in
X al
is o use clus e ing. We i s pe o m K-means
clus e ing on he sou ce domain, and hen selec he o iginal
sou ce samples closes o he cen oids as
X al
, ins ead o using
he cen oids di ec ly. This app oach elimina es andomness,
unlike T ainRandom app oach.
4)
Tes All: We also conside he case whe e we can u ilize es da a
as
X al
. Since es samples obse ed du ing he ope a ional phase
lack bo h domain and no mal/anomalous labels, we simply use
all es as
X al
. Assuming ha ASD sys ems a e deployed in
bo h he sou ce and a ge domains, his app oach enables he
use o balanced alida ion da a including su icien samples;
howe e , i may include anomalous samples.
No e ha T ainRandom and T ainClus e equi e he domain labels
while T ainAll and Tes All does no . Since
X e
includes all aining
samples and o e laps wi h
X al
, we ensu e ha
y⋆(z,X e )=z
by
excluding z om X e .
4. EXPERIMENTAL EVALUATIONS
4.1. Expe imen al se ups
We conduc ed expe imen al e alua ions using he DCASE 2021–2024
Task 2 Challenge da ase s [2]–[5]. These da ase s p o ide labels
o machine ypes, sec ions, domains, and a ibu es. The sec ion
iden i ies each indi idual ins ance o he same machine ype, while
he a ibu e labels e lec he ope a ional s a e o he machine. The
DCASE 2021 and 2022 da ase s consis o se en machine ypes,
each wi h six sec ions. The DCASE 2023 and 2024 da ase s consis
o 14 and 16 machine ypes, espec i ely, wi h each machine ype
ha ing only one sec ion. One sec ion con ains app oxima ely 1,000
no mal aining samples and 400 o 200 es samples. In he aining
da ase , h ee samples in DCASE 2021 and en in each o he DCASE
2022–2024 da ase s a e om he a ge domain, while he emaining
samples a e om he sou ce domain. In he es da ase , he sou ce
and a ge domains, as well as no mal and anomalous samples, a e
app oxima ely balanced (i.e., app oxima ely 100 o 50 samples o
each combina ion o domain and no mal/anomalous classes). Each
eco ding is app oxima ely en seconds long, consis ing o a single-
channel signal sampled a 16 kHz. Each o hese da ase s is di ided
in o de and e al subse s based on sec ion o machine ype, and he
e alua ion esul s a e agg ega ed acco dingly.
The disc imina i e ea u e ex ac o was simila o ha used in [8].
This ex ac o ecei ed an ampli ude spec um and an ampli ude
spec og am as inpu ea u es and p ocessed hem in pa allel using
wo sepa a e neu al ne wo ks, a spec um ne wo k and a spec og am
ne wo k. The spec um ne wo k consis ed o 1D con olu ional laye s,
while he spec og am ne wo k consis ed o 2D con olu ional laye -
based ResNe [17] blocks, Squeeze-and-Exci a ion [18] blocks, and
mul ilaye pe cep ons. The inal ou pu was ob ained by conca ena ing
he ou pu s om spec um and spec og am ne wo ks. Fo he
spec og am, we used a DFT size o 1024 and a hop leng h o 512.
The equency ange was es ic ed o 200-8000 Hz. We ained he
ea u e ex ac o o 16 epochs join ly using machine ype, sec ion,
domain, and a ibu e labels. The op imize was AdamW [19] wi h a
ixed lea ning a e o 0.001 and a ba ch size was se o 64. Fo he loss
unc ion, we used Sub-clus e AdaCos (SCAC) [20] wi h he numbe
o sub-clus e s se o 16 and a ixed scale pa ame e . While p e ious
wo k [7] showed ha ixed class cen e s lead o be e pe o mance
26
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 1: Mean (±S anda d De ia ion) o o icial sco e [%] o DCASE 2021–2024 da ase s using ixed SCAC loss. † equi es he domain labels.
Me hod 2021 2022 2023 2024
de e al de e al de e al de e al
Baseline 67.31 ± 0.51 66.28 ± 0.65 68.97 ± 0.75 63.79 ± 0.92 62.11 ± 0.98 57.00 ± 1.51 58.63 ± 0.71 51.59 ± 0.73
w/ K-means clus e ing†65.27 ± 0.59 64.02 ± 0.59 70.49 ± 0.81 61.82 ± 1.04 63.63 ± 0.70 59.17 ± 1.44 59.24 ± 0.96 52.15 ± 0.63
w/ SMOTE†67.48 ± 0.48 66.43 ± 0.62 69.46 ± 0.74 64.63 ± 0.95 63.24 ± 0.75 59.27 ± 1.52 59.36 ± 0.78 52.16 ± 0.73
Rescaling (K= 8)
P e ious [12] 63.38 ± 1.39 61.43 ± 1.65 64.08 ± 1.81 62.68 ± 0.77 58.94 ± 1.34 65.32 ± 1.37 60.08 ± 1.63 51.47 ± 1.38
P op (T ainAll) 52.46 ± 2.11 49.83 ± 2.53 69.10 ± 1.55 66.34 ± 0.83 61.20 ± 1.64 67.41 ± 1.58 58.56 ± 1.74 54.10 ± 1.25
P op (T ainRandom)†60.69 ± 3.61 59.15 ± 2.93 70.60 ± 0.96 66.84 ± 1.05 62.62 ± 2.16 66.18 ± 1.52 58.20 ± 2.59 54.06 ± 1.35
P op (T ainClus e )†60.44 ± 3.60 58.55 ± 2.87 70.63 ± 1.00 66.84 ± 1.02 62.23 ± 1.93 66.25 ± 1.80 57.92 ± 2.84 54.40 ± 1.06
P op (Tes All) 64.02 ± 1.04 62.75 ± 1.24 70.47 ± 1.19 65.28 ± 0.67 64.97 ± 0.95 67.91 ± 0.96 59.43 ± 1.85 56.58 ± 0.68
Rescaling (K= 16)
P e ious [12] 63.62 ± 1.38 61.58 ± 1.68 63.33 ± 1.68 61.90 ± 0.85 58.71 ± 1.58 63.92 ± 1.47 59.67 ± 1.71 48.18 ± 0.98
P op (T ainAll) 51.07 ± 3.17 50.46 ± 1.99 66.71 ± 1.92 65.85 ± 0.87 60.40 ± 1.99 66.09 ± 1.38 58.18 ± 1.81 54.16 ± 1.27
P op (T ainRandom)†60.65 ± 3.32 58.72 ± 3.17 70.32 ± 1.11 67.59 ± 0.92 62.81 ± 2.07 65.63 ± 1.75 58.47 ± 2.02 53.52 ± 1.11
P op (T ainClus e )†60.21 ± 3.45 58.04 ± 3.07 70.20 ± 1.14 67.55 ± 0.87 62.41 ± 1.90 65.46 ± 1.81 57.96 ± 2.37 53.63 ± 1.11
P op (Tes All) 63.93 ± 1.10 62.47 ± 1.31 69.43 ± 2.77 65.24 ± 0.92 65.43 ± 0.99 67.39 ± 0.86 59.35 ± 1.68 56.21 ± 0.68
Table 2: Mean (±S anda d De ia ion) o o icial sco e [%] o DCASE 2021–2024 da ase s using ainable SCAC loss. † equi es he domain labels.
Me hod 2021 2022 2023 2024
de e al de e al de e al de e al
Baseline 68.82 ± 0.60 65.22 ± 0.64 70.82 ± 0.63 67.32 ± 0.56 63.96 ± 1.10 63.68 ± 3.49 60.42 ± 1.03 56.16 ± 0.71
w/ K-means clus e ing†67.01 ± 0.98 63.96 ± 0.62 70.23 ± 1.11 64.34 ± 0.70 65.53 ± 0.99 65.47 ± 2.83 61.54 ± 1.16 55.71 ± 0.64
w/ SMOTE†69.05 ± 0.57 65.30 ± 0.55 71.12 ± 0.67 67.86 ± 0.60 65.31 ± 0.93 65.60 ± 2.55 61.82 ± 0.99 56.20 ± 0.78
Rescaling (K= 8)
P e ious [12] 69.39 ± 0.85 64.47 ± 0.78 66.53 ± 1.39 66.60 ± 0.80 61.87 ± 1.66 67.23 ± 0.89 62.74 ± 1.14 52.59 ± 0.93
P op (T ainAll) 66.99 ± 0.94 62.26 ± 0.75 67.79 ± 1.26 67.61 ± 0.70 64.94 ± 1.37 69.75 ± 1.22 62.63 ± 1.15 55.02 ± 0.70
P op (T ainRandom)†68.52 ± 1.26 64.50 ± 0.80 70.98 ± 0.88 68.80 ± 0.62 66.51 ± 0.84 69.04 ± 1.38 62.74 ± 0.70 57.05 ± 0.77
P op (T ainClus e )†67.72 ± 0.87 63.60 ± 0.36 70.87 ± 0.78 68.89 ± 0.61 66.44 ± 0.94 69.33 ± 1.53 62.44 ± 1.02 56.22 ± 0.89
P op (Tes All) 68.75 ± 0.79 64.62 ± 0.50 71.66 ± 0.63 67.36 ± 0.57 66.19 ± 0.69 69.91 ± 1.25 62.40 ± 1.48 58.63 ± 0.79
Rescaling (K= 16)
P e ious [12] 69.47 ± 0.83 64.48 ± 0.83 65.41 ± 1.40 65.77 ± 0.64 60.95 ± 1.33 66.28 ± 0.54 62.39 ± 1.12 49.82 ± 0.84
P op (T ainAll) 66.33 ± 0.99 61.83 ± 0.70 67.93 ± 1.11 67.69 ± 0.83 64.22 ± 1.48 69.19 ± 1.32 62.59 ± 1.10 55.64 ± 1.22
P op (T ainRandom)†68.51 ± 1.15 64.59 ± 0.80 71.04 ± 0.81 69.03 ± 0.62 66.46 ± 0.87 68.61 ± 1.50 62.72 ± 0.74 56.76 ± 1.03
P op (T ainClus e )†67.76 ± 0.87 63.73 ± 0.41 70.85 ± 0.74 68.97 ± 0.55 66.24 ± 0.85 69.07 ± 1.42 62.28 ± 0.89 56.20 ± 1.21
P op (Tes All) 68.69 ± 0.76 64.61 ± 0.56 71.79 ± 0.68 67.80 ± 0.57 66.23 ± 0.71 69.07 ± 1.01 62.36 ± 1.39 57.88 ± 0.70
han ainable class cen e s, we ound ha ainable class cen e s
pe o med be e . Acco dingly, we conduc ed expe imen s using bo h
ixed and ainable cen e s in he SCAC loss. Addi ionally, we applied
Mixup [21] o he inpu wa e o ms wi h a p obabili y o 0.5.
We e alua ed ou p oposed me hod wi h ou compa ison p e ious
backend me hods: he baseline me hod (Eq.1), he baseline me hod
wi h K-means clus e ing o SMOTE desc ibed in Sec.2.2, and he
anomaly sco e escaling me hod [12] desc ibed in Sec. 2.3. Fo he
baseline me hod wi h K-means clus e ing, we se he numbe o
clus e s o 16. Fo he SMOTE, we se he o e sampling a io o 5%
and he numbe o neighbo s o 5. Fo he anomaly sco e escaling
app oach, we se
K
o 8 and 16 ollowing he p e ious wo k [12]. Fo
T ainClus e in ou p oposed me hod, we used K-means clus e ing
wi h he numbe o clus e s se o ma ch he numbe o samples in
he a ge domain (i.e., h ee o DCASE 2021 and en o DCASE
2022–2024).
As e alua ion me ics, we used he o icial DCASE me ics o each
da ase : he ha monic mean o he a ea unde he ecei e ope a ing
cha ac e is ic (ROC) cu e (AUC) and he pa ial AUC (pAUC) wi h
p= 0.1
o e all machine ypes and domains. We calcula ed he
a i hme ic mean and s anda d de ia ion o he o icial sco es ac oss
en independen ials.
4.2. Expe imen al esul s
Tables 1 and 2 show he e alua ion esul s when using ixed and
ainable class cen e s o he SCAC loss, espec i ely. Fo he DCASE
2022–2024 da ase s, ou p oposed me hod consis en ly achie es high
pe o mance, signi ican ly imp o ing pe o mance in se e al subse s.
Fo example, in 2023 e al o Table 1, baseline, K-means clus e ing,
SMOTE, and he p e ious escaling wi h
K= 8
achie ed 57.00%,
59.17%, 59.27%, and 65.32%, espec i ely, while ou p oposed
T ainAll achie ed 67.41% wi h
K= 8
. Addi ionally, T ainAll o ou
p oposed me hod achie es high pe o mance wi hou equi ing domain
labels, whe eas he K-means clus e ing and SMOTE echniques ely
on hem. Compa ing Tables 1 and 2, we can see ha using ainable
SCAC loss imp o es o e all pe o mance. Al hough his educes he
ela i e pe o mance gain o ou me hod, i s ill consis en ly con ibu es
o pe o mance imp o emen .
We can also see ha he p e ious escaling me hod [12] can cause
pe o mance deg ada ion, whe eas ou p oposed me hod imp o es
o keeps he baseline pe o mance in mos cases o he DCASE
2022–2024 da ase s, ega dless o whe he he hype pa ame e
K
is
se o 8 o 16. Fo example, in 2023 de o Table 2, he p e ious
me hod wi h
K= 8
deg ades pe o mance om 63.96% o baseline o
61.87%, whe eas he p oposed me hod T ainAll wi h
K= 8
achie es
64.94%. In con as , in 2023 e al o Table 2, he p e ious me hod
27
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Fig. 2: G aphs o he o icial sco e imp o emen o DCASE 2024 e alua ion
da a (one ial wi h a ixed SCAC loss,
K= 16
). The ho izon al axis shows
α
, and he e ical axis shows o icial sco e [%]. Plo ed poin s indica e he
α
selec ed by each alida ion da a selec ion me hod (O ange: T ainAll, Red:
T ainRandom, G een: T ainClus e , G ay: Tes All) o each machine.
wi h
K= 8
imp o es he pe o mance om 63.68% o baseline
o 67.23%, whe e he p oposed me hod T ainAll wi h
K= 8
also
imp o es he pe o mance o 69.75%.
Rega ding alida ion da a selec ion, he compa ison among T ainAll,
T ainClus e , and T ainRandom shows ha he bes choice a ies
depending on he da ase , and he e is no consis en end. On he
o he hand, Tes app oach achie es he highes pe o mance in mos
cases. Fo example, in 2023 de o Table 1, he p oposed me hod
Tes wi h
K= 16
achie es 65.43%, in 2024 e al o Table 1, he
p oposed me hod Tes wi h
K= 8
achie es 56.58%, and in 2024 e al
o Table 2, he p oposed me hod Tes wi h
K= 8
achie es 58.63%.
This sugges s he impo ance o using alida ion da ase s in which
he sou ce and a ge domains a e balanced wi h su icien samples.
We con i m ha
α⋆
indeed leads o high pe o mance. Fig. 2
illus a es he ela ionship be ween
α
and he o icial sco e, along
wi h he
α⋆
ob ained using each alida ion da a selec ion me hod.
This igu e is gene a ed o each machine ype in he DCASE 2024
e alua ion subse , when using he ixed SCAC loss and
K= 16
unde
a speci ic andom seed. He e, he p oposed me hod wi h
α= 0
is
equi alen o he baseline me hod. The igu e clea ly shows ha he
op imal alue o
α
a ies ac oss machine ypes, and ha he p oposed
me hod adap i ely selec s alues o
α
ha yield high pe o mance
o each ype.
Despi e he o e all pe o mance imp o emen s in he DCASE 2022–
2024 da ase s, we obse e ha he p oposed me hods and he p e ious
escaling me hod [12] exhibi deg aded pe o mance on he DCASE
Fig. 3: The plo s show he embedding o he es samples o sec ion 3
and aining samples o all sec ions o he pump machine om he DCASE
2021 da ase (one ial wi h a ixed SCAC loss). The AUC o hese es
samples o he baseline me hod, he p e ious escaling using
K= 8
, and
he p oposed me hod using
K= 8
and T ainAll we e 86.69%, 67.01%, and
43.98%, espec i ely.
2021 da ase compa ed o o he me hods. To in es iga e he eason
o his deg ada ion, we examine he embedding space. Figu e 3
isualizes a pa ial exce p o he embedding space o pump machine
using UMAP [22]. In he highligh ed ame wi hin he igu e, we can
obse e ha a single aining sample is embedded apa om he o he
aining samples, while many anomalous samples a e loca ed nea by.
Due o he low densi y o aining samples in ha a ea, he escaling
me hod inad e en ly educes he anomaly sco es o anomalous sounds,
leading o pe o mance deg ada ion.
5. CONCLUSION
This pape in oduced an anomaly sco e escaling me hod based on
a iance minimiza ion o domain-gene alized ASD. Ou p oposed
me hod in oduced weigh ing pa ame e s in o he local da a densi y
based escaling p ocess and analy ically op imized hem based on
he sco e a iance minimiza ion. Expe imen al esul s demons a ed
ha (1) he p oposed me hod signi ican ly imp o es pe o mance
o e exis ing anomaly sco e calcula ion me hods; (2) o e s able wi h
espec o andom seed han he p e ious escaling me hod [12]; and (3)
he weigh ing pa ame e s de i ed h ough ou a iance minimiza ion
scheme adap i ely escale anomaly sco es o each machine ype,
leading o high pe o mance.
6. ACKNOWLEDGMENT
This wo k is pa ly suppo ed by JST AIP Accele a ion Resea ch
JPMJCR25U5.
28
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
REFERENCES
[1]
Y. Koizumi, Y. Kawaguchi, K. Imo o, e al., “Desc ip ion and discussion
on DCASE2020 challenge ask2: Unsupe ised anomalous sound
de ec ion o machine condi ion moni o ing,” in P oc. DCASE, 2020,
pp. 81–85.
[2]
Y. Kawaguchi, K. Imo o, Y. Koizumi, e al., “Desc ip ion and discussion
on DCASE 2021 challenge ask 2: Unsupe ised anomalous de ec ion
o machine condi ion moni o ing unde domain shi ed condi ions,” in
P oc. DCASE, 2021, pp. 186–190.
[3]
K. Dohi, K. Imo o, N. Ha ada, e al., “Desc ip ion and discussion on
DCASE 2022 challenge ask 2: Unsupe ised anomalous sound de ec ion
o machine condi ion moni o ing applying domain gene aliza ion
echniques,” in P oc. DCASE, 2022, pp. 1–5.
[4]
K. Dohi, K. Imo o, N. Ha ada, e al., “Desc ip ion and discussion
on DCASE 2023 challenge ask 2: Fi s -sho unsupe ised anomalous
sound de ec ion o machine condi ion moni o ing,” in DCASE, 2023,
pp. 31–35.
[5]
T. Nishida, N. Ha ada, D. Niizumi, e al., “Desc ip ion and discussion
on DCASE 2024 challenge ask 2: Fi s -sho unsupe ised anomalous
sound de ec ion o machine condi ion moni o ing,” in P oc. DCASE,
2024, pp. 111–115.
[6]
K. Wilkingho , T. Fujimu a, K. Imo o, J. L. Roux, Z.
-
H. Tan, and
T. Toda, “Handling domain shi s o anomalous sound de ec ion: A
e iew o DCASE- ela ed wo k,” a Xi p ep in a Xi :2503.10435,
2025.
[7]
K. Wilkingho , “Design choices o lea ning embeddings om auxilia y
asks o domain gene aliza ion in anomalous sound de ec ion,” in P oc.
ICASSP, 2023, pp. 1–5.
[8]
K. Wilkingho , “Sel -supe ised lea ning o anomalous sound de ec-
ion,” in P oc. ICASSP, 2024, pp. 276–280.
[9]
T. Fujimu a, I. Ku oyanagi, and T. Toda, “Imp o emen s o disc imina-
i e ea u e space aining o anomalous sound de ec ion in unlabeled
condi ions,” in P oc. ICASSP, 2025, pp. 1–5.
[10]
X. Zheng, A. Jiang, B. Han, e al., “Imp o ing anomalous sound
de ec ion ia low- ank adap a ion ine- uning o p e- ained audio
models,” in P oc. SLT, 2024, pp. 969–974.
[11]
A. Jiang, B. Han, Z. L , e al., “Anopa ch: Towa ds be e consis ency
in machine anomalous sound de ec ion,” in P oc. In e speech, 2024,
pp. 107–111.
[12]
K. Wilkingho , H. Yang, J. Ebbe s, F. G. Ge main, G. Wiche n, and
J. Le Roux, “Keeping he balance: Anomaly sco e calcula ion o domain
gene aliza ion,” in P oc. ICASSP, 2025, pp. 1–5.
[13]
A. Jiang, X. Zheng, B. Han, e al., “Adap i e p o o ype lea ning o
anomalous sound de ec ion wi h pa ially known a ibu es,” in P oc.
ICASSP, 2025, pp. 1–5.
[14]
Z. L , A. Jiang, B. Han, e al., “AITHU sys em o i s -sho
unsupe ised anomalous sound de ec ion,” DCASE2024 Challenge,
Tech. Rep., 2024.
[15]
A. Jiang, X. Zheng, Y. Qiu, e al., “Thuee sys em o i s -sho
unsupe ised anomalous sound de ec ion,” DCASE2024 Challenge,
Tech. Rep., 2024.
[16]
N. V. Chawla, K. W. Bowye , L. O. Hall, and W. P. Kegelmeye ,
“SMOTE: Syn he ic mino i y o e -sampling echnique,” Jou nal o
a i icial in elligence esea ch, ol. 16, pp. 321–357, 2002.
[17]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep esidual lea ning o image
ecogni ion,” in P oc. CVPR, 2016, pp. 770–778.
[18]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-exci a ion ne wo ks,” in 2018
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion,
2018, pp. 7132–7141.
[19]
I. Loshchilo and F. Hu e , “Decoupled weigh decay egula iza ion,”
in P oc. ICLR, 2019.
[20]
K. Wilkingho , “Sub-clus e adacos: Lea ning ep esen a ions o
anomalous sound de ec ion,” in 2021 In e na ional Join Con e ence
on Neu al Ne wo ks (IJCNN), 2021, pp. 1–8.
[21]
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond
empi ical isk minimiza ion,” in P oc. ICLR, 2018.
[22]
L. McInnes, J. Healy, N. Saul, and L. G ossbe ge , “UMAP: Uni o m
Mani old App oxima ion and P ojec ion,” The Jou nal o Open Sou ce
So wa e, ol. 3, no. 29, 2018, 63 pages.
29
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Region-Speci ic Audio Tagging o Spa ial Sound
Jinzheng Zhao1, Yong Xu2, Haohe Liu1, Da ide Be ghi1, Xinyuan Qian3,
Qiuqiang Kong4, Junqi Zhao1, Ma k D. Plumbley1, Wenwu Wang1
1Cen e o Vision, Speech and Signal P ocessing (CVSSP), Uni e si y o Su ey, UK
2Tencen AI Lab, Belle ue, WA, USA
3Depa men o Compu e Science and Technology, Uni e si y o Science and Technology Beijing, China
4The Chinese Uni e si y o Hong Kong (CUHK)
Abs ac —Audio agging aims o label sound e en s appea ing in an
audio eco ding. In his pape , we p opose egion-speci ic audio agging,
a new ask which labels sound e en s in a gi en egion o spa ial audio
eco ded by a mic ophone a ay. The egion can be speci ied as an
angula space o a dis ance om he mic ophone. We i s s udy he
pe o mance o di e en combina ions o spec al, spa ial, and posi ion
ea u es. Then we ex end s a e-o - he-a audio agging sys ems such
as p e- ained audio neu al ne wo ks (PANNs) and audio spec og am
ans o me (AST) o he p oposed egion-speci ic audio agging ask.
Expe imen al esul s on bo h he simula ed and he eal da ase s show
he easibili y o he p oposed ask and he e ec i eness o he p oposed
me hod. Fu he expe imen s show ha inco po a ing he di ec ional
ea u es is bene icial o omnidi ec ional agging.
1. INTRODUCTION
The ask o audio agging aims o iden i y he sound e en s p esen in
a sound clip. This opic has been a popula a ea in audio p ocessing,
playing an impo an ole in applica ions, such as audio classi ica ion
[1] and in o ma ion e ie al [2]. In he cu en ask se ing, in gene al,
all sound e en s a e labeled ega dless o he loca ion o he sound
e en s. In su eillance applica ions, howe e , sound e en s loca ed in
a speci ic spa ial egion may be mo e impo an han o he s. The e o e,
i is o p ac ical in e es o s udy he p oblem o egion-speci ic audio
agging, i.e., labelling audio e en s ha a e p esen in a speci ic spa ial
egion. Tagging sound e en s acco ding o he loca ion can help he
sepa a ion and make people ocus on sound om a speci ic egion.
In addi ion, egion-speci ic audio agging allows people o a ach
impo ance o sounds, e.g. a wa ning om behind, he eby imp o ing
he sa e y.
In he adi ional audio agging ask, deep lea ning based me hods
a e popula choices. Fo a con olu ional neu al ne wo k (CNN) based
sys em, p e ained audio neu al ne wo ks (PANNs) [3] is a model
p e ained on AudioSe [4], and shows p omising pe o mance on
audio pa e n ecogni ion asks such as audio agging and acous ic
scene classi ica ion. The sys em o p e aining, sampling, labeling,
and agg ega ion (PSLA) [5] employs E icien Ne as he backbone and
imp o es he model pe o mance by ImageNe p e aining, balanced
sampling and model agg ega ion. Fo a ans o me -based [6] sys em,
he audio spec og am ans o me (AST) [7] ollows he ision
ans o me [8], and akes mel-spec og am pa ches as inpu . The
AST model can ou pe o m PANNs and PSLA, when buil wi h a
la ge amoun o da a. The abo e me hods only use spec al ea u es. In
[9], bo h spec al and spa ial audio ea u es a e used wi h a ga ed CNN
a chi ec u e. The inco po a ion o audio ea u es u he imp o es he
This esea ch was suppo ed by Tencen AI Lab Rhino-Bi d Gi Fund
and Uni e si y o Su ey. This wo k was also suppo ed by he Enginee ing
and Physical Sciences Resea ch Council [g an numbe s EP/T019751/1,
EP/Y028805/1]. Fo he pu pose o open access, he au ho s ha e ap-
plied a c ea i e commons a ibu ion (CC BY) licence o any au ho
accep ed manusc ip e sion a ising. The codes and da ase a e a ailable
a h ps://gi hub.com/KawhiZhao/AudioTagging.
4-Mic A ay
Walk Clapping
Telephone
Doo Open Music
Wa e
4-Mic A ay
Walk
Doo Open
Clapping
Telephone
Music
Wa e
Fig. 1: The ask scena io o egion-speci ic audio agging. Le : Que y by
ho izon al angula egions. Righ : Que y by dis ance.
model pe o mance. In addi ion o he CNN and ans o me -based
me hods, he e a e also me hods based on g aph neu al ne wo k (GNN),
such as he wo k p oposed in [10], which le e ages se e al GNNs o
model he ela ionships be ween di e en ea u e pa ches and di e en
labels o agging. In ou wo k, we ex end hese s a e-o - he-a models
o he egion-speci ic audio agging p oblem.
Ou p oposed ask is inspi ed by egion-speci ic speech p ocessing
[11]–[13] including au oma ic speech ecogni ion (ASR) and speech
sepa a ion o a gi en di ec ion o dis ance. Beam o ming aims o
loca e he signal om a gi en di ec ion by a enua ing unwan ed sound
sou ces. T adi ional me hods like he minimum a iance dis o ionless
esponse (MVDR) [14] beam o me can minimize he o al signal
powe and main ain a dis o ionless esponse o a gi en di ec ion.
Recen ly, deep lea ning based me hods a e also used o egion-speci ic
au oma ic speech ecogni ion and sepa a ion. Fo egion-speci ic ASR,
in [11], long sho e m memo y (LSTM)-based a chi ec u e is used o
mul i-channel o e lapped ASR wi h he inpu o he conca ena ion o
spec al, spa ial and angle ea u es. Di ec ional ea u es a e p oposed
as he angle ea u es, encoding he in o ma ion o egions o in e es .
Fo egion-speci ic speech sepa a ion, in [12], di ec ional ea u es a e
used as a condi ion o indica e he speake o mul i-modal a ge
speech sepa a ion. In [13], a mul i-channel band-spli ecu en neu al
ne wo k (RNN) model is p oposed o angula -que y, sphe ical-que y
and conical-que y based speech sepa a ion. In addi ion, he ield o
iew (FOV) ea u e is p oposed in [15] o audio zooming. The FOV
ea u e can ep esen he p ope y o an angula space while di ec ional
ea u es can ep esen he p ope y o an azimu h.
2. PROPOSED METHOD
Inspi ed by he p e ious wo k in egion-speci ic speech p ocessing,
we s udy a new ask, i.e. egion-speci ic audio agging, and explo e he
use o di ec ional ea u es and FOV ea u es o his ask. Following
p e ious wo k [13], we explo e his ask in wo pa adigms, que y
by ho izon al angula egions and que y by dis ance, as shown in
Fig. 1. We examine h ee ypes o ea u es, i.e. spec al, spa ial, and
30
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Spec al
Fea u es
Spa ial
Fea u es
Posi ional
Fea u es
Con 2D
(k, 64)
Con 2D
(1024,
2048)
...
Max
Mean
FC
PANNs
Sound
E en s
Conca
Addi ion
Fig. 2: The p oposed mul i-channel PANNs o egion-speci ic audio agging.
posi ional ea u es. The spec al ea u es a e used o iden i y sound
sou ces. Spa ial ea u es can gi e in o ma ion o all po en ial sound
e en s and posi ional ea u es a e used o desc ibe a po en ial egion-
speci ic sound e en .
Ou con ibu ions can be summa ized as ollows. Fi s ly, we p opose,
o he i s ime, he egion-speci ic audio agging ask and build a
benchma k. Secondly, we ex end he s a e-o - he-a audio agging
models o his ask and s udy he pe o mance achie ed by a di e en
combina ion o spec al, spa ial and posi ional ea u es. Finally, we
ex end he p oposed me hod o omnidi ec ional audio agging, which
aims o ag he e en s in he whole space ins ead o a egion.
Expe imen al esul s show ha he p oposed ixed- egion and loca ion-
awa e sys em a e supe io o he omnidi ec ional agging sys ems.
2.1. P oblem De ini ion
Gi en a ou -channel audio in e ahed al mic ophone a ay (MIC)
o ma
x∈R4×L
(whe e
L
deno es he audio leng h) and a gi en a ea
p
, in he o m o angula ange
[θbegin, ..., θend]
o dis ance
d
om he
mic ophone a ay o he sound e en s, he objec i e o egion-speci ic
audio agging is o iden i y he sound sou ces loca ed in
p
. I is wo h
no ing ha he p oposed new ask can be achie ed by a beam o me
and an audio agging model. Howe e , his cascaded o ma is no
end- o-end and he audio agging model can only lea ns he seman ic
ep esen a ion.
2.2. O e all Sys em A chi ec u e
To his end, we p esen a sys em, as shown in Fig. 2. The sys em
is composed o he modules including ea u e ex ac ion and audio
agging. The ea u es indica e he egion o in e es s while he agging
model p edic s he sound e en s. Compa ed o p e ious agging
sys ems, he p oposed sys em p edic s he sound e en s in he gi en
egion speci ied by a use , ins ead o all e en s p esen in mic ophone
eco dings. Compa ed o he cascaded model combining a beam o me
and an audio agging model, he p oposed model can lea n bo h
seman ic and spa ial in o ma ion simul aneously.
2.3. Ex ac ion o Fea u es
Following p e ious wo k o egion-speci ic speech sepa a ion [12],
[13], we also use he combina ion o spec al, spa ial and posi ional
ea u es.
2.3.1. Spec al Fea u es: Spec al ea u es can help classi y he
audio e en s. He e we use loga i hm powe spec um (LPS)
L
o he
i s -channel audio
x
.
L
is calcula ed based on sho - ime Fou ie
ans o m (STFT) Xo x, as ollows,
L= log(|X|2)(1)
whe e
X∈RT×F
and
L∈RT×F
, and
T
and
F
a e he numbe o
ime ames and equency bins, espec i ely.
2.3.2. Spa ial Fea u es: Spa ial ea u es a e use ul o localizing
sound e en s [16]. He e, we explo e wo ea u es, gene alized c oss-
co ela ion phase ans o m (GCCPHAT)
G
and in e -channel phase
di e ence (IPD) I.
GCCPHAT is widely used in sound sou ce localiza ion [17] and
speake acking [18], which es ima es he ime di e ence o a i al
be ween a pai o mic ophones, as ollows,
Gn(τ) = Z+∞
−∞
Xn1( )Xn2∗( )
|Xn1( )Xn2∗( )|ei2π τ d (2)
whe e
n= (n1, n2)
indexes he mic ophone pai ,
∗
deno es complex
conjuga e,
deno es he equency, and
τ
deno es he ime delay. I is
no malized o mi iga e he impac o changing ampli udes. We s ack
GCCPHAT om selec ed pai s as he spa ial ea u es.
IPD is calcula ed as he phase di e ence be ween wo audio
channels.
In= angle(Xn1)−angle(Xn2)(3)
whe e
angle(·)
deno es he phase o a signal. IPD shows e ec i e-
ness in mul i-channel speech sepa a ion [12]. IPDs om selec ed
mic ophone pai s a e s acked.
2.3.3. Posi ional Fea u es: In his sec ion, we discuss he ex ac-
ion o posi ional ea u es om he angula egion o a dis ance,
espec i ely.
Angula Region Fo he ho izon al angula a eas om
−180◦
o
180◦
, we de ec he sound e en s wi hin he egion
Θ = [θbegin, ..., θend]
by a enua ing e en s om he unselec ed a eas. We choose he FOV
ea u e p oposed in [15] o his ask.
To calcula e he FOV ea u e, we i s calcula e di ec ional ea u es
(DF) D∈RT×F o each angle in Θ ollowing [19], as ollows,
D(θ) = X
n
cos(In−Pn(θ)) (4)
Pn(θ) = 2π ϕncos(θ) s/c (5)
whe e
Pn
is he a ge -dependen phase di e ence calcula ed a he
n
- h mic ophone pai wi h equency
,
ϕn
is he dis ance be ween
he
n
- h mic ophone pai ,
s
is he sampling a e, and
c
is he sound
eloci y. Then, he FOV ea u e in he iew o Θis calcula ed as:
Fin = max(D(θ)), θ ∈Θ(6)
The ea u e ou o he iew Θis calcula ed as:
Fou = max(D(θ)), θ /∈Θ(7)
31
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Embedding
( )
FC
( ) anh FC
( ) anh FC
( ) anh
Fig. 3: The lea ning-based me hod o ob aining angle ea u es (le ) and
dis ance ea u es ( igh ), whe e
h
is he hidden dimension and FC deno es he
ully connec ed laye .
Finally, he FOV ea u e F∈RT∗Fis de ined as:
F=(Fin,i Fin >Fou ,
−1,i Fin ≤Fou .(8)
Dis ance Fo ob aining a dis ance ea u e gi en a dis ance
d
om
he mic ophone a ay o he sound e en s, we use lea ning-based
me hods ollowing [13], as shown in he igh pa o Fig. 3. The
dis ance ea u e is epea ed o ma ch he ime dimension o spec al
o spa ial ea u es.
2.4. Model
We ex ended PANNs o egion-speci ic mul i-channel audio agging,
as shown in Fig. 2. The spec al, spa ial and posi ional ea u es a e
conca ena ed on he channel dimension, wi h
k
being he numbe o
conca ena ed channels. PANNs akes he conca ena ed ea u es and
employs a sequence o con olu ional laye s o downsize he ime-
equency dimension while inc easing he channel dimension om
k
o 2048. Then he ime- equency dimension is pooled and he channel
dimension is e ained. Finally, he sys em ou pu s he p obabili ies
o he sound e en classes. The bina y c oss-en opy loss is used o
op imize he model.
We also ex ended he o he wo models, PSLA [5] and AST [7],
o he p oposed ask. Fo PSLA, he E icien Ne backbone is ained
om sc a ch wi h 2560 hidden dimension in he mul i-head a en ion
module. Fo AST, we use he small2241 e sion.
3. EXPERIMENT SETUP
3.1. Da ase
We use he STARSS23 da ase [20] o he p oposed ask and his
da ase is also used in DCASE 2024 ask 3. Howe e , he scena io
p esen ed in his da ase is ela i ely simple o egion-speci ic audio
agging, since he a e age numbe o o e lapping sound e en s is
1.32. Thus, we use Spa ialScape [21] o simula e a new da ase ,
named Spa ial Region-Speci ic Audio Tagging (SRSAT). The oom is
andomly chosen om he gi en oom se
2
. The mean and s anda d
de ia ion o he numbe o sound e en s a e se o 25 and 3. Each
audio clip is in he o ma o a e ahed al mic ophone a ay, 60
seconds long, sampled a 24 kHz. We gene a e 1,000 audio clips o
aining, 60 audio clips o alida ion, and 60 audio clips o es ing,
espec i ely. The e a e 13 sound classes ( emale speech, male speech,
clapping, elephone, laugh e , domes ic sounds, walk, doo , music,
musical ins umen , wa e ap, bell and knock) om FSD50K [22]
in bo h he STARSS23 and SRSAT da ase s. In each ime s ep, he
anno a ion con ains he classes and posi ions (azimu h, ele a ion, and
dis ance) o he sound e en s.
3.2. Implemen a ion De ails and E alua ion Me ics
A 2-second audio clip is andomly ex ac ed om he 60-second
audio signals. We ensu e ha he ex ac ed audio clip has a leas one
sound e en . Du ing aining, we pe o m audio channel swapping
da a augmen a ion using he me hod p oposed in [23] o enla ge he
1
h ps://gi hub.com/YuanGongND/as /blob/mas e /s c/models/as models.py
2h ps://gi hub.com/i an oman/Spa ialScape
Table 1: Expe imen al esul s o di e en ea u es. Bold numbe s indica e
he bes pe o mance.
Spec al Spa ial Angle mAP (↑)EER (↓)
LPS IPD DF 0.473 0.253
LPS IPD FOV 0.485 0.260
LPS IPD lea ned 0.416 0.260
LPS GCCPHAT DF 0.479 0.246
SALSA SALSA DF 0.455 0.260
Table 2: Expe imen al esul s o di e en models.
Model # Pa ams mAP (↑)EER (↓)
PSLA [5] 64.1M 0.384 0.283
AST [7] 22.8M 0.360 0.308
PANNs [3] 79.7M 0.473 0.253
da ase by a ac o o eigh . STFT is calcula ed wi h a window o
512 samples and a hop size o 256 samples. Bo h IPD and di ec ional
ea u es a e calcula ed wi h mic ophone pai s o
(0,0),(0,1),(0,2)
and
(0,3)
[11]. We use he CNN-14 a ian o PANNs as he model
backbone. Each model is ained using he Adam op imize wi h a
lea ning a e
10−5
. The model is ained o 50 epochs wi h an ea ly
s op mechanism a a pa ience o 10 epochs. Apa om he ea u es
men ioned abo e, we explo e he lea ning-based angula ea u es
shown in he le pa o Fig. 3. A lea nable embedding laye is used
o ex ac ea u es
E
o di e en inpu angles which is join ly ained
wi h PANNs. In addi ion, o spec al and spa ial ea u es, we explo e
spa ial cue-augmen ed log-spec og am ea u es (SALSA) p oposed
in [24]. We use he angula egion o 60
◦
as he egion o in e es .
Fo DF and lea ning-based ea u es, we use he middle angle o he
egion o ex ac he ea u es. Fo he FOV ea u e, we calcula e each
DF in a esolu ion o 5
◦
. Following he p e ious wo k [7], [9], we
use bo h mean a e age p ecision (mAP) and equal e o a e (EER)
as he me ic o e alua e he model pe o mance.
4. EXPERIMENTAL RESULTS AND DISCUSSIONS
4.1. Impac o Fea u es
We show he expe imen al esul s on SRSAT using di e en ea u e
combina ions o 60-deg ee angula egion-speci ic audio agging in
Table 1. Expe imen al esul s show ha using LPS and IPD wi h
FOV ea u es achie es he bes pe o mance o mAP. One possible
eason is ha FOV ea u es can ocus mo e on he desi ed a ea han
o he angle ea u es, as DF and lea ning-based can only a end o a
single azimu h. Howe e , he FOV ea u e is mo e compu a ionally
expensi e han DF. Wi h a esolu ion o 5
◦
, he FOV ea u es equi e
72 imes compu a ional cos s han DF. Thus, we used DF o he
emaining expe imen s, gi en i s compe i i e pe o mance o e he
FOV ea u e. Fo o he spa ial ea u es, GCCPHAT has compe i i e
pe o mance wi h IPD.
4.2. Impac o Models
The expe imen al esul s o di e en models a e demons a ed in
Table 2 and i shows ha PANNs achie e he bes pe o mance. AST
and PSLA do no pe o m well gi ing lowe mAP and highe EER
han PANNs, which shows ha PANNs can be adap ed om adi ional
audio agging o egion-speci ic audio agging e ec i ely. The po en ial
eason o AST no pe o ming good is ha he ans o me -based
me hod gene ally equi es a la ge amoun o da a o good esul s [8].
4.3. Compa ed wi h Omnidi ec ional Tagging Sys ems
We compa e he egion-speci ic agging sys ems wi h he sys ems o
omnidi ec ional audio agging, i.e., agging sound e en s in he whole
32
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
360360
60 60 60 60 60 60
60 60 60Loca ion-Awa e
Fixed-Region
Omnidi ec ional
Telephone Music
-180 180
Wa e
Angle
E en s
Fig. 4: Illus a ion o di e en agging sys em. The numbe s a e in deg ees.
Table 3: Expe imen al esul s o omnidi ec ional (OD), ixed- egion (FR) and
loca ion-awa e (LA) audio agging sys ems.
Spec al Spa ial Angle Sys em mAP (↑)EER (↓)
LPS IPD - OD 0.371 0.267
LPS IPD FOV OD 0.370 0.274
LPS GCCPHAT - OD 0.376 0.263
SALSA SALSA - OD 0.373 0.265
LPS IPD DF FR 0.653 0.252
LPS IPD DF LA 0.667 0.244
space ins ead o a speci ic egion. We explo e h ee sys ems as shown
in Fig. 4. The omnidi ec ional audio agging sys em does no use he
angle in o ma ion. Bo h he loca ion-awa e sys em and he ixed- egion
sys em a e based on a ained egion-speci ic sys em. The loca ion-
awa e sys em has he p io knowledge o he loca ions o all sound
sou ces. The sys em is ope a ed in he egion o
60◦
cen e ed a he
azimu h o he sound e en s. We il e ou o e lapped egions, which
a e de ined as wo egions o e lapping by
30◦
, o educe unnecessa y
compu a ional cos . The ixed- egion agging sys em is ope a ed on
six ixed egions om
−180◦
o
180◦
a an in e al o
60◦
. Fo he
loca ion-awa e and ixed- egion sys ems, he sys em ou pu and he
g ound u h om di e en egions a e agg ega ed by he ope a ion
o maximum be o e calcula ing he me ics.
The expe imen al esul s a e shown in Table 3. Simila o he
p e ious expe imen s, we compa e he model pe o mance achie ed
wi h di e en ea u e combina ions o omnidi ec ional audio agging.
All omnidi ec ional sys ems ha e mAP be ween
0.37
and
0.38
. I can
be obse ed ha adding he angle ea u e does no imp o e he model
pe o mance, as indica ed by he hi d sys em. I is demons a ed ha
he ixed- egion sys em ou pe o ms he omnidi ec ional agging one,
which shows ha he ixed- egion sys em bene i s om he egion
di ision. I he posi ions o he sound e en s a e known, he model
pe o mance can be u he imp o ed wi h mAP om 0.653 o 0.667
and EER om 0.252 o 0.244.
4.4. Impac o Angula Range
We explo e he in luence o di e en angula ange and show he
esul s in Table 4. We can see ha he p oposed ask becomes ha de
as he angula ange becoming la ge as mo e sound e en s would
all in o he selec ed a ea, making he ask mo e challenging. When
he angula ange is
300◦
, mAP is 0.370, which is close o he model
Table 4: Impac o angula ange.
Spec al Spa ial Angle Range mAP (↑)EER (↓)
LPS IPD DF 60◦0.473 0.253
LPS IPD DF 180◦0.406 0.266
LPS IPD DF 300◦0.370 0.275
Table 5: Da ase s a is ics and pe o mance compa isons be ween STARSS23
and SRSAT da ase .
STARSS23 SRSAT
A g. No. E en s 1.320 2.233
Max. No. E en s 6 10
P op. One E en 0.623 0.287
P op. Two E en s 0.278 0.294
mAP o PANNs (↑) 0.938 0.473
Table 6: Model pe o mance o a gi en dis ance.
Model Pa am mAP (↑)EER (↓)
PANNs [3] 79.8M 0.417 0.266
AST [7] 23.0M 0.324 0.320
PSLA [5] 64.2M 0.399 0.274
pe o mance in he omnidi ec ional agging scena io shown in Table
3.
4.5. Model Pe o mance on STARSS23 Da ase
We u he e alua e PANNs on he STARSS23 da ase and make
compa isons wi h he model pe o mance on SRSAT. The esul s a e
shown in Table 5. STARSS23 was o iginally used o sound sou ce
localiza ion and de ec ion. We can see ha STARSS23 is much simple
han SRSAT o he p oposed ask, as indica ed by he lowe a e age
numbe o e en s (A g. No. E en s) and maximum numbe o e en s
(Max. No. E en s) pe ame. We show he p opo ions o ames
con aining one e en (P op. One E en ) o wo e en s (P op. Two
E en s). I is demons a ed ha STARSS23 da ase con ains one e en
and wo e en s a
90%
o he ime ames, which necessi a es he
c ea ion o SRSAT da ase .
4.6. Model Pe o mance o a Gi en Dis ance
In his pa we explo e he model pe o mance o dis ance-guided
audio agging. We andomly selec a sound e en and use he g ound
u h dis ance as he condi ion. The pe o mance o he model is
epo ed in Table 6. Fo each model, we use LPS as he spec al ea u e
and IPD as he spa ial ea u e ollowing he p e ious se ings. The
dis ance ea u e ex ac ion ne wo k is join ly ained wi h he model.
The PANNs model achie es he bes pe o mance while AST and
PSLA do no pe o m well. Compa ed wi h he model pe o mance
on azimu h-que ied agging epo ed in Table 2, dis ance-que ied
agging scena ios a e mo e challenging. One possible eason is ha
he s a is ic-based DF e lec s he con idence o he sound sou ce
coming om a gi en azimu h, which can cap u e he cha ac e is ics
o he selec ed egions be e han he lea ning-based dis ance ea u e.
5. CONCLUSION
We ha e p esen ed a no el ask named egion-speci ic audio agging
o spa ial sound. We ex ended he cu en ad anced agging models
o his ask using angula egion-que ied and dis ance-que ied mode.
We u he explo ed he impac o di e en combina ions o spec al,
spa ial and posi ional ea u es. We ind ha using LPS and IPD
combined wi h FOV ea u es achie es he bes mAP. When ex ending
he egion-speci ic agging sys em o omnidi ec ional agging, we ind
ha he p oposed ixed- egional and loca ion-awa e agging sys em
ou pe o ms he omnidi ec ional agging sys em. In he cu en se ing,
we only include 13 sound classes ollowing he se ing o DCASE Task
3. In u u e, we plan o include mo e sound classes om AudioSe
o enhance he di e si y o da ase .
33
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
ASDKi : A Toolki o Comp ehensi e E alua ion o
Anomalous Sound De ec ion Me hods
Takuya Fujimu a1, Ke in Wilkingho 2,3, Keisuke Imo o4, Tomoki Toda1
1Nagoya Uni e si y, Japan, 2Aalbo g Uni e si y, Denma k, 3Pionee Cen e o AI, Denma k, 4Kyo o Uni e si y, Japan
Abs ac —In his pape , we in oduce ASDKi , a oolki o anomalous
sound de ec ion (
ASD
) ask. Ou aim is o acili a e ASD esea ch
by p o iding an open-sou ce amewo k ha collec s and ca e ully
e alua es a ious
ASD
me hods. Fi s , ASDKi p o ides aining and
e alua ion sc ip s o a wide ange o ASD me hods, all handled wi hin
a uni ied amewo k. Fo ins ance, i includes he au oencode -based
o icial DCASE baseline, ep esen a i e disc imina i e me hods, and sel -
supe ised lea ning-based me hods. Second, i suppo s comp ehensi e
e alua ion on he DCASE 2020–2024 da ase s, enabling ca e ul assessmen
o
ASD
pe o mance, which is highly sensi i e o ac o s such as da ase s
and andom seeds. In ou expe imen s, we e-e alua e a ious
ASD
me hods using ASDKi and iden i y consis en ly e ec i e echniques ac oss
mul iple da ase s and ials. We also demons a e ha ASDKi ep oduces
he s a e-o - he-a -le el pe o mance on he conside ed da ase s.
Index Te ms—anomalous sound de ec ion, open-sou ce oolki
1. INTRODUCTION
Anomalous sound de ec ion (
ASD
) is a echnique o machine
condi ion moni o ing, whe e
ASD
sys ems aim o de ec mechanical
ailu es om hei ope a ional machine sounds [1]–[5]. In his ask,
since i is in easible o exhaus i ely collec a e and di e se anomalous
sounds, he sys em is de eloped wi h only no mal machine sounds.
The e o e,
ASD
sys ems calcula e anomaly sco es based on he
de ia ion om he no mal sound dis ibu ion and assign anomalies o
sounds wi h high anomaly sco es.
The
ASD
ield has ad anced h ough he de elopmen o a ious
me hods. One majo app oach is based on gene a i e modeling [1],
[6], [7], whe e i di ec ly models he dis ibu ion o audio ea u es
belonging o no mal machine sounds and compu es anomaly sco es
based on he nega i e likelihood o a gi en obse a ion. Fo example,
an au oencode (
AE
)-based me hod [1] ains an
AE
ne wo k o
econs uc audio ea u es and compu es anomaly sco es based on he
econs uc ion e o o a gi en obse a ion. O he a ian s o
AE
s only
econs uc he cen e ame o consecu i e spec og am ames [6] o
mask he inpu o ob ain an au o eg essi e model [8]. Ano he ecen
app oach i s p ojec s audio signals in o a lowe -dimensional ea u e
space and hen compu ing anomaly sco es based on he dis ance o an
obse a ion om he no mal aining da a samples in ha space [9]–
[12]. Fo example, s a e-o - he-a (
SOTA
) disc imina i e me hods [9],
[12]–[16] ain a disc imina i e ea u e ex ac o o classi y me a-
in o ma ion labels associa ed wi h no mal aining da a, and hen
compu e anomaly sco es in he disc imina i e ea u e space. Possible
choices o me a in o ma ion include he machine ype, machine IDs
and speci ic ope a ing condi ions o machines. Al e na i ely, sel -
supe ised lea ning (
SSL
) ea u e spaces o models ained on (la ge)
ex e nal da ase s a e di ec ly u ilized o ASD [11], [17].
Al hough he p oposal o a ious
ASD
me hods has undoub edly
ad anced he ield, we a gue ha u he de elopmen is hinde ed by
he lack o comp ehensi e and ca e ul e alua ion. A ep esen a i e
e o o e alua ing
ASD
me hods is he DCASE challenge [1]–[5],
which plays an impo an ole in enabling ai e alua ions using
un e ealed es da a. Howe e , challenge e alua ions a e conduc ed on
a single da ase and a single ial, which is insu icien o ho ough
assessmen , as we ha e obse ed ha
ASD
pe o mance is highly
1. T ain 2. Ex ac
Ou pu s
(T ain)
F on end
3. Sco e
Anomaly
Sco es
Backend
4. E alua e
Anomaly
Sco es
AUC
T aining F on end
T ain
T ain
Tes Ou pu s
(Tes )
F on end
T aining
Ou pu s
(T ain)
Ou pu s
(Tes ) Backend No mal/Anomaly
Labels
Fig. 1: Uni ied amewo k o ASDKi . (1) T aining he on end wi h he
aining da ase , (2) ex ac ing and s o ing ou pu s om bo h he aining
and es da ase s using he on end, (3) compu ing anomaly sco es o he
es da a using he backend ained on he aining da a, and (4) compu ing
e alua ion sco es based on he anomaly sco es and he g ound- u h no mal
and anomalous labels.
sensi i e o da ase s and andom seeds in ou p elimina y expe imen s.
Fu he mo e, i is di icul o p ecisely iden i y he e ec i eness o
indi idual componen s, as pa icipan s de elop me hods on hei own
amewo ks and o en employ ensemble o boos pe o mance.
To add ess hese limi a ions, we in oduce ASDKi
1
, an open-
sou ce oolki o he
ASD
ask. The main ea u es o ASDKi a e
summa ized as ollows: (1) ASDKi collec s a ious
ASD
me hods in o
a single open-sou ce eposi o y, enabling easy access and compa ison.
(2) ASDKi p o ides ecipes ha comple e he en i e aining and
e alua ion p ocess shown in Fig. 1. (3) ASDKi suppo s comp e-
hensi e e alua ion on he DCASE 2020–2024 da ase s [1]–[5] wi h
mul iple ials, enabling ca e ul assessmen . To demons a e ASDKi ’s
capabili ies, we e-e alua e a ious
ASD
me hods ac oss mul iple
da ase s and ials. By analyzing he esul s, we iden i y consis en ly
e ec i e echniques and p o ide new insigh s. Fu he mo e, we
demons a e ha ASDKi can ep oduce he
SOTA
-le el pe o mance
on he conside ed da ase s.
2. ASDKIT
2.1. Suppo ed expe imen al condi ions
ASDKi suppo s he DCASE 2020–2024 condi ions [1]–[5], which
a e summa ized in Table 1. Ac oss hese condi ions, he machine
ypes, he p o ided me a-in o ma ion, and he e alua ion sco es di e .
ASDKi p o ides download sc ip s and e alua ion ools o hese
da ase s. The download sc ip s au oma ically download he da ase s
and add g ound- u h no mal and anomalous labels o he es se s,
o which he labels a e concealed du ing he challenge and eleased
by he o ganize s a e wa d. The e alua ion ool compu es he o icial
e alua ion sco es acco ding o he espec i e DCASE yea . Fo all
condi ions, he da ase s a e di ided in o o icial de and e al subse s,
and e alua ion sco es a e agg ega ed sepa a ely o each subse . The
e alua ion sco e is based on a combina ion o se e al ypes o a ea
unde he ecei e ope a ing cha ac e is ic (ROC) cu e (AUC).
1h ps://gi hub.com/TakuyaFujimu a/dcase-asd- oolki
40
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 1: Suppo ed expe imen al condi ions in ASDKi . MT and Sec deno e machine ype and sec ion, espec i ely. The “#ID/Sec” column shows he numbe
o machine IDs o sec ions wi hin each machine ype. The “De /E al” column shows he in o ma ion used o de ine he de and e al subse s. The “Agg ega ion”
column shows how he e alua ion esul s o each machine ype and ID/Sec a e agg ega ed o he machine- ype le el o he de - and e al-subse le el.
Amean()
and Hmean() a e a i hme ic mean and ha monic mean ope a ions, espec i ely. ∗: he a ibu e in o ma ion is no a ailable o some machine ypes.
Yea #MT #ID/Sec Domain Me a In o ma ion De /E al E alua ion sco e o each MT and ID/Sec Agg ega ion
2020 [1] 6 6 o 7 Sou ce MT, ID ID Amean(s auc,s pauc) Amean()
2021 [2] 7 6 Sou ce, Ta ge MT, Sec, A ibu e, Domain Sec Hmean(s auc,s pauc, auc, pauc) Hmean()
2022 [3] 7 6 Sou ce, Ta ge MT, Sec, A ibu e, Domain Sec Hmean(smix auc, mix auc,mix pauc) Hmean()
2023 [4] 14 1 Sou ce, Ta ge MT, Sec, A ibu e, Domain MT Hmean(smix auc, mix auc,mix pauc) Hmean()
2024 [5] 16 1 Sou ce, Ta ge MT, Sec, A ibu e∗, Domain MT Hmean(smix auc, mix auc,mix pauc) Hmean()
Table 2: ASD me hods suppo ed by ASDKi .
App oach Recipe name
AE ae
Disc imina i e dis spec adacos ixed,dis bea s scac ainable, e c.
Raw ea u e aw spec, aw bea s, aw ea
In DCASE 2020, he machine ID, which iden i ies each indi idual
machine o he same machine ype, is p o ided as me a in o ma ion.
The e alua ion sco e is calcula ed as he a i hme ic mean o he
AUC
and he pa ial
AUC
(
pAUC
) wi h
p= 0.1
. This sco e is compu ed
o each machine ID wi hin each machine ype and hen agg ega ed
in o de - and e al-subse -le el sco es using he a i hme ic mean. The
de and e al subse s a e de ined based on he machine ID [1].
DCASE 2021 newly in oduces a domain shi p oblem, whe e
eco ding en i onmen s o machine ope a ions di e be ween he
sou ce and a ge domains [18]. While abundan aining da a is
a ailable in he sou ce domain, only a ew samples a e a ailable in
he a ge domain. As me a in o ma ion, a ibu es which deno es
he eco ding and machine ope a ion condi ions, and he domain
in o ma ion a e p o ided. The e alua ion sco e is compu ed as he
ha monic mean o
s auc
,
s pauc
,
auc
, and
pauc
.
s auc
and
s pauc
a e he
AUC
and
pAUC
o he sou ce domain, espec i ely,
and
auc
and
pauc
a e he
AUC
and
pAUC
o he a ge domain,
espec i ely. The e o e,
ASD
sys ems a e expec ed o pe o m well in
bo h domains. Also, in DCASE 2021, he machine ID is eplaced wi h
he sec ion. The sec ion de ines subse s wi hin one machine ype, and
i se es a simila ole o he machine ID; howe e , in some machine
ypes, di e en machine IDs can appea in he same sec ion, and he
same machine ID can appea in mul iple sec ions [2]. The e alua ion
sco es o each sec ion wi hin each machine ype, a e agg ega ed o
de - and e al-subse -le el sco es using he ha monic mean.
DCASE 2022 inhe i s he domain shi p oblem om he DCASE
2021 condi ion, bu employs a di e en e alua ion sco e [3]. The
sco e is calcula ed as he ha monic mean o
smix auc
,
mix auc
,
and
mix pauc
.
smix auc
is he
AUC
calcula ed using no mal and
anomalous samples om he sou ce domain oge he wi h anomalous
samples om he a ge domain. Simila ly,
mix auc
is he
AUC
calcula ed using no mal and anomalous samples om he a ge
domain oge he wi h anomalous samples om he sou ce domain.
mix pauc
is he
pAUC
calcula ed using no mal and anomalous
samples om bo h domains. These e alua ion sco es a e calcula ed
by join ly using samples om bo h domains o assess whe he he
anomaly sco e dis ibu ions a e well aligned be ween domains.
In DCASE 2023 and 2024, only one sec ion is p o ided, bu a
la ge numbe o machine ypes a e included. The de and e al subse s
a e de ined based on he machine ypes [4], [5]. In DCASE 2024,
a ibu e in o ma ion is no a ailable o some machine ypes [5]. The
e alua ion sco e is he same as in DCASE 2022.
2.2. Suppo ed me hods
ASDKi handles a ious
ASD
me hods in a uni ied amewo k shown
in Fig. 1. The suppo ed me hods a e summa ized in Table 2. All
ASD
me hods consis o on end and backend modules, whe e he
on end ex ac s some ea u es om audio signals, and he backend
compu es anomaly sco es a e p ocessing hese ea u es. T aining and
e alua ion a e pe o med in ou s eps: (1) aining he on end wi h
he aining da ase , (2) ex ac ing ea u es om bo h he aining and
es da ase s, (3) aining he backend wi h he ea u es ex ac ed om
he aining da ase and compu ing anomaly sco es o he es da a,
and (4) compu ing e alua ion sco es based on he anomaly sco es
and g ound- u h no mal and anomalous labels. In he ollowing, we
explain he
ASD
me hods suppo ed by ASDKi and hei p ocessing
pipelines wi hin his ou -s ep amewo k.
2.2.1. AE: This is he
AE
-based o icial DCASE baseline
me hod [1]. The on end consis s o a en-laye mul ilaye pe cep on
(
MLP
). The inpu ea u es a e i e adjacen ames o he log Mel
powe spec og am, and he on end is ained o minimize hei
econs uc ion loss.
AE
di ec ly compu es anomaly sco es based on
he econs uc ion e o o encoun e ed es samples in he second s ep,
and in he hi d s ep, he backend simply copies hese anomaly sco es.
The
AE
is independen ly ained o each machine ype, epea ing
s eps 1–4 o each machine ype.
2.2.2. Disc imina i e me hods: ASDKi p o ides a ious op ions
o disc imina i e me hods. Fo he on end a chi ec u e, i suppo s
mul i-b anch CNNs [9], [19] and
SSL
models [12], [20], [21]. The
mul i-b anch CNN ecei es mul iple inpu ea u es, such as he
spec um and spec og ams. The on end independen ly p ocesses
hese ea u es and ob ains a inal ou pu by conca ena ing he ou pu s
om each b anch. Each b anch consis s o 1D o 2D con olu ional
laye s ollowed by an
MLP
. Fo
SSL
models, ASDKi suppo s
BEATs [20] and EAT [21], which a e widely used in
ASD
asks [11],
[12], [22]–[24]. Based on p e ious wo ks [12], we also in oduce
addi ional low- ank adap a ion (
LoRA
) [25] pa ame e s o ine- une
hese models. These on ends a e ained using classi ica ion o me a-
in o ma ion labels.
Fo he disc imina i e loss unc ion, ASDKi suppo s A cFace [26],
AdaCos [27], Sub-clus e AdaCos (SCAC) [28], and AdaP oj [29],
whe e hese angula ma gin loss unc ions a e known o be e ec i e o
ASD
asks [30]. ASDKi also suppo s he op ion o choose be ween
ixed and ainable class cen e s o he angula ma gin loss unc ions,
as his choice a ec s pe o mance; ixed class cen e s ha e been
shown o achie e be e esul s in p e ious wo k [9].
Fo da a augmen a ion, ASDKi suppo s Mixup [31] and SpecAug-
men [32], which a e widely used in
ASD
asks [9]. Addi ionally,
ASDKi suppo s echniques o boos
ASD
pe o mance, such as
Fea Ex [10] and i s pa ame e -e icien a ian , subspace loss [19].
Fea Ex and subspace loss in oduce addi ional losses using subspace
ea u es in he mul i-b anch CNN, whe e subspace ea u es e e o
he ea u es ex ac ed om each b anch be o e conca ena ion.
The backend consis s o
k
-nea es neighbo (
kNN
), whe e anomaly
sco es a e calcula ed as he a e age dis ance o he
k
nea es neighbo s
in he aining ( e e ence) da ase om a gi en obse a ion in he
disc imina i e ea u e space. ASDKi also suppo s a ian s ha
41
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 3: Se ups o he disc imina i e me hods. MB indica es mul i b anch
CNN. Pa en heses in he “T aining s a egy” column deno e whe he he class
cen e s in he loss unc ion a e ixed o ainable.
Recipe Name F on end T aining
s a egy Da aAug
dis spec
adacos ixed wo mixup MB AdaCos ( ixed) No
dis spec adacos ixed MB AdaCos ( ixed) Mixup
dis spec scac ixed MB SCAC ( ixed) Mixup
dis spec scac ainable MB SCAC ( ainable) Mixup
dis spec subspaceloss MB Subspace loss Mixup
dis spec ea ex MB Fea Ex Mixup
dis mul ispec scac ainable MB SCAC ( ainable) Mixup
dis bea s scac ainable BEATs
w/ LoRA SCAC ( ainable) Mixup
dis ea scac ainable EAT
w/ LoRA SCAC ( ainable) Mixup
inco po a e kmeans clus e ing [9], SMOTE o e sampling [23], [33],
and anomaly sco e escaling [34]. Kmeans clus e ing and SMOTE
a e used o balance he numbe o aining samples be ween he
sou ce and a ge domains. Kmeans clus e ing educes he numbe o
samples in he sou ce domain, and he esul ing cen oids a e used
as e e ence samples in he subsequen
kNN
-based anomaly sco e
calcula ion. SMOTE is applied o o e sample he aining samples
in he a ge domain. The anomaly sco e escaling echnique [34]
escales anomaly sco es based on he local densi y o aining samples
in he ea u e space, add essing he endency o low-densi y a ge
domains o exhibi highe anomaly sco es.
The disc imina i e me hods ain a common on end using da a
om all machine ypes, and hen independen ly ain he backend o
each machine ype by epea ing s eps 2–4 o each ype.
2.2.3. Raw ea u e-based me hods: ASDKi suppo s aw ea u e-
based
ASD
me hods [11], [35], whe e he on end ex ac s aw
ea u es wi hou aining wi h me a-in o ma ion label classi ica ion.
Fo example, as aw ea u es, [35] uses he ime-a e aged spec og am,
and [11] uses BEATs ea u es wi hou ine- uning. Fo he backend, he
same
kNN
-based anomaly sco e calcula ion as in he disc imina i e
me hods is employed. This me hod epea s s eps 2–4 o each machine
ype, skipping he on end aining in he i s s ep.
3. EXPERIMENTAL EVALUATION
We e-e alua ed a ious
ASD
me hods using ASDKi o demons a e
i s capabili ies.
3.1. Se ups
3.1.1. Da ase s and me ics: We used he DCASE 2020–2024
da ase s [1]–[5] and he o icial e alua ion sco es desc ibed in Sec. 2.1.
We compu ed he a i hme ic mean and s anda d de ia ion ac oss ou
independen ials, whe e each ial’s o icial sco e was calcula ed as
he ha monic o a i hme ic mean, as de ined in Table 1.
3.1.2. E alua ed me hods and hei se ups: In he ollowing, we
lis he
ASD
me hods e alua ed in his pape and desc ibe hei se ups.
Each me hod is e e ed o by i s ecipe name p o ided in ASDKi .
ae is he
AE
-based me hod [1] desc ibed in Sec. 2.2.1. We ained
he
AE
ne wo k o 50 epochs using he Adam op imize [36] using
he mean squa ed e o as a loss unc ion wi h a ixed lea ning a e
o 0.001 and a ba ch size o 256. Fo he inpu ea u es, we used i e
adjacen ames o he log Mel powe spec og am, wi h 128 Mel
bins, a DFT size o 1024 samples, and a hop size o 512 samples.
The disc imina i e me hods we e alua ed a e summa ized in
Table 3. Mixup [31] was always applied wi h a p obabili y o 50%.
dis mul ispec scac ainable ex ac ed 512-dimensional ea u es om
an ampli ude spec um and h ee spec og ams wi h di e en DFT
sizes (256, 512, and 1024 samples), while he o he mul i-b anch
CNN me hods ex ac ed 256-dimensional ea u es om an ampli ude
spec um and an ampli ude spec og am wi h a DFT size o 1024
samples. The hop size was se o hal he DFT size, and equency
bins in he ange o
200 Hz o 8000 Hz
we e used. We ained he
mul i-b anch CNNs o 16 epochs using AdamW op imize [37]
wi h a ixed lea ning a e o 0.001 and a ba ch size o 64. Fo
dis bea s scac ainable, we used he BEATs i e 3.p checkpoin
and in oduced
LoRA
pa ame e s o he que y and key p ojec ion
laye s wi hin he T ans o me . The 768-dimensional ea u e sequence
om BEATs was agg ega ed using a s a is ics pooling laye [38]
and hen p ojec ed o a 256-dimensional ea u e using a linea laye .
Fo dis ea scac ainable, we used he EAT-base epoch10 p .p
checkpoin and in oduced
LoRA
pa ame e s o he que y, key, and
alue p ojec ion laye s. A 768-dimensional CLS ea u e om EAT was
p ojec ed o a 256-dimensional ea u e using a linea laye . We ine-
uned he
SSL
-based models o 25 epochs using AdamW op imize
wi h a ba ch size o 8 and a
LoRA
ank o 64. The lea ning a e was
linea ly inc eased om 0 o 0.0001 o e he i s 5,000 s eps.
Fo he aw ea u e-based me hods desc ibed in Sec. 2.2.3, we
e alua ed aw spec, aw bea s, and aw ea . aw spec used ime-
a e aged Mel ampli ude spec og ams wi h 128 Mel bins, a DFT size
o 1024 samples, and a hop size o 512 samples. aw bea s a e aged
he 768-dimensional ea u e sequence om BEATs, while aw ea
di ec ly used he 768-dimensional CLS ea u e om EAT.
Fo he disc imina i e me hods and aw ea u e-based me hods, we
used a
kNN
-based backend wi h
k= 1
desc ibed in Sec. 2.2.2. We
also used kmeans clus e ing, SMOTE o e sampling, and anomaly
sco e escaling echniques. Fo kmeans clus e ing, we se he numbe
o clus e s o 16. Fo SMOTE o e sampling, we se he o e sampling
a io o 20% and he numbe o neighbo s o 2. Fo anomaly sco e
escaling [34], we used a
kNN
-based app oach wi h he numbe o
neighbo s se o 4.
3.2. Resul s
Figu e 2 shows he o icial e alua ion sco es o he conside ed
ASD
me hods, whe e bo h he disc imina i e me hods and aw ea u e-based
me hods employ
kNN
wi h SMOTE as he backend. The igu e also
includes he e alua ion sco es o he op-pe o ming sys em and he
o icial baselines epo ed by he DCASE o ganize s.
2
Fi s , i is e i-
den ha e alua ion on a single da ase and a single ial is insu icien .
Fo example, he pe o mance o de o dis spec adacos ixed and
dis bea s is e e sed be ween he DCASE 2023 de and e al subse s.
Also, dis spec adacos ixed exhibi s a s anda d de ia ion o 2.04 on
he DCASE 2023 e al subse , wi h pe o mance anging om 53.07
o 57.98 ac oss ou ials, as obse ed in ou expe imen al esul s.
We ea i m ha aining echniques o disc imina i e me hods sig-
ni ican ly impac pe o mance. Fo ins ance, he e ec i eness o mixup
is demons a ed by compa ing dis spec adacos ixed wo mixup
and dis spec adacos ixed; he e ec i eness o SCAC, by com-
pa ing dis spec adacos ixed and dis spec scac ixed; and he
e ec i eness o mul i- esolu ion spec og ams, by compa ing
2
The e alua ion sco es on he de elopmen se a e no o icially epo ed.
In DCASE 2020, he o icial sco es agg ega ed ac oss all machine ypes we e
also no p o ided. Al hough se e al o icial baseline sys ems a e p o ided,
only he bes -pe o ming baseline sys ems a e included in he igu e. The
bes o icial baseline sys ems in DCASE 2021 [1] and 2024 [5] use he same
me hod as ou ae ecipe, while he 2023 [4] baseline uses i s a ian [39]. The
2022 baseline employs a di e en disc imina i e me hod [3].
42
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
20de 20e al
ae
aw_spec
aw_bea s
aw_ea
dis_spec
adacos_ ixed_wo_mixup
dis_spec
adacos_ ixed
dis_spec
scac_ ixed
dis_spec
scac_ ainable
dis_spec
subspaceloss
dis_spec
ea ex
dis_mul ispec
scac_ ainable
dis_bea s
scac_ ainable
dis_ea
scac_ ainable
Challenge_Baseline
Challenge_Top
65.33
(0.41) 67.18
(0.62)
64.81 62.22
70.24 71.15
66.04 68.36
89.24
(0.79) 89.48
(0.60)
89.44
(0.18) 90.66
(0.27)
89.53
(0.78) 90.77
(0.70)
90.57
(0.13) 91.96
(0.61)
90.06
(0.47) 91.59
(0.54)
89.92
(0.21) 91.74
(0.45)
91.65
(0.41) 92.84
(0.75)
90.63
(0.19) 93.52
(0.26)
90.41
(0.18) 93.46
(0.43)
- -
- -
Recipe Name 21de 21e al
56.46
(0.24) 54.89
(0.29)
55.92 54.92
62.85 58.12
57.80 57.90
66.80
(0.40) 65.04
(0.61)
67.89
(0.62) 66.01
(0.39)
67.37
(0.99) 66.75
(1.10)
69.14
(0.53) 65.36
(0.21)
68.89
(0.39) 67.40
(0.50)
69.18
(0.44) 67.41
(0.74)
69.80
(0.74) 67.13
(0.52)
73.06
(0.61) 67.62
(0.14)
72.91
(0.38) 67.57
(0.53)
-56.38
- 66.80
22de 22e al
53.84
(0.35) 53.46
(0.77)
58.15 55.32
61.86 57.08
58.35 57.09
67.11
(1.25) 63.26
(0.88)
68.30
(0.94) 64.12
(0.80)
69.26
(1.39) 65.03
(0.64)
71.62
(0.47) 67.74
(0.89)
71.66
(0.35) 67.53
(0.83)
71.58
(0.86) 66.96
(1.23)
71.74
(0.50) 68.99
(0.39)
73.24
(0.34) 68.87
(0.28)
73.88
(0.80) 69.85
(0.43)
-54.02
- 70.97
23de 23e al
54.82
(1.10) 57.80
(1.04)
54.75 59.99
60.11 62.40
56.74 59.81
61.12
(0.92) 51.14
(2.88)
61.67
(2.07) 55.83
(2.04)
63.01
(0.42) 58.95
(1.72)
64.40
(1.28) 66.14
(1.37)
63.54
(0.80) 62.79
(1.32)
64.32
(0.57) 65.43
(1.26)
66.08
(2.06) 68.60
(0.75)
62.66
(0.70) 71.46
(0.49)
63.98
(0.70) 72.04
(0.77)
- 61.05
- 66.97
24de 24e al
55.02
(0.21) 56.71
(0.85)
54.16 54.04
56.11 57.62
54.87 56.30
57.20
(1.20) 52.32
(0.55)
59.14
(1.34) 52.31
(0.61)
59.82
(0.34) 52.46
(1.55)
61.58
(0.15) 56.84
(0.97)
61.96
(0.71) 55.14
(1.09)
62.05
(0.93) 55.84
(0.60)
63.05
(0.99) 56.73
(1.84)
59.53
(0.85) 61.72
(0.73)
59.93
(0.37) 61.47
(0.73)
-56.51
- 66.24
Hmean Amean
57.22 57.55
57.22 57.43
61.38 61.76
59.06 59.33
64.12 66.27
65.58 67.54
66.47 68.30
69.02 70.53
68.46 70.06
68.93 70.44
70.16 71.66
70.76 72.23
71.13 72.55
- -
- -
65
70
75
80
85
90
55.0
57.5
60.0
62.5
65.0
67.5
70.0
72.5
55.0
57.5
60.0
62.5
65.0
67.5
70.0
72.5
52.5
55.0
57.5
60.0
62.5
65.0
67.5
70.0
54
56
58
60
62
64
66
58
60
62
64
66
68
70
72
Fig. 2: O icial e alua ion sco es. Values a e p esen ed as “a i hme ic mean (s anda d de ia ion)” ac oss ou independen ials. Hmean and Amean deno e he
ha monic mean and a i hme ic mean, espec i ely, compu ed o e all de and e al subse s ac oss he da ase s. Backend o he disc imina i e me hods and aw
ea u e-based me hods is
kNN
wi h SMOTE. Raw ea u e-based me hods do no in ol e any andom p ocesses; he e o e, he s anda d de ia ion is no epo ed.
Fig. 3: O icial e alua ion sco es o aw bea s wi h di e en backends.
Fig. 4: O icial e alua ion sco es o dis spec scac ainable wi h di e en backends. Values a e p esen ed as “a i hme ic mean (s anda d de ia ion)” ac oss
ou independen ials.
dis spec scac ainable and dis mul ispec scac ainable. Addi-
ionally, ine- uned
SSL
models (dis bea s scac ainable and
dis ea scac ainable) consis en ly achie e high pe o mance.
As a new insigh , we ind ha ainable class cen e s imp o e
pe o mance in mos cases, as demons a ed by he compa ison
be ween dis spec scac ixed and dis spec scac ainable, while ixed
class cen e s achie ed be e esul s in p e ious wo k [9]. Mo eo e , he
Fea Ex (dis spec ea ex) and subspace loss (dis spec subspaceloss)
echniques achie e pe o mance imp o emen s simila o hose
ob ained by using ainable class cen e s. Since bo h Fea Ex and
subspace loss employ ainable cen e s o addi ional loss e ms, we
conclude ha hei pe o mance gains a e mainly a ibu able o he
use o ainable class cen e s.
Fu he mo e, Fig.2 demons a es ha ASDKi achie es
SOTA
-
le el pe o mance. No e ha he op-pe o ming sys ems in DCASE
2021 [40], 2022 [41], and 2024 [42] employed ensembles, wi h
he 2022 sys em addi ionally using machine-speci ic hype pa ame e
uning, excep o he 2023 [43] op sys em.
Figu es 3 and 4 show he o icial e alua ion sco es o aw bea s
and dis spec scac ainable wi h di e en backends, espec i ely.
The esul s show ha
kNN
wi h SMOTE achie es high pe o mance
in mos cases. The anomaly sco e escaling echnique signi ican ly
imp o es he pe o mance o aw bea s, al hough ou e-e alua ion
ac oss mul iple da ase s e eals ha i s e ec is uns able o he
conside ed disc imina i e embedding model.
4. CONCLUSION
In his pape , we in oduced ASDKi , an open-sou ce oolki o
ASD
esea ch. ASDKi p o ides ecipes o a ious
ASD
me hods,
including
AE
-based app oaches, s a e-o - he-a (SOTA) disc imina i e
me hods, and aw ea u e-based me hods. In addi ion, i suppo s
comp ehensi e e alua ion on he DCASE 2020–2024 da ase s wi h
mul iple ials. Using ASDKi , we e-e alua ed a ious
ASD
me hods
and iden i ied he e ec i eness o mixup, SCAC wi h ainable class
cen e s, mul i- esolu ion spec og ams, and ine- uned
SSL
models.
We will con inuously upda e ASDKi wi h new
ASD
me hods and
welcome con ibu ions om he communi y o acili a e u he
de elopmen o ASD esea ch.
5. ACKNOWLEDGMENT
This wo k was pa ly suppo ed by JSPS KAKENHI G an Numbe
JP25KJ1439.
43
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
REFERENCES
[1]
Y. Koizumi, Y. Kawaguchi, K. Imo o, e al., “Desc ip ion and discussion
on DCASE2020 challenge ask2: Unsupe ised anomalous sound
de ec ion o machine condi ion moni o ing,” in P oc. DCASE, 2020,
pp. 81–85.
[2]
Y. Kawaguchi, K. Imo o, Y. Koizumi, e al., “Desc ip ion and discussion
on DCASE 2021 challenge ask 2: Unsupe ised anomalous de ec ion
o machine condi ion moni o ing unde domain shi ed condi ions,” in
P oc. DCASE, 2021, pp. 186–190.
[3]
K. Dohi, K. Imo o, N. Ha ada, e al., “Desc ip ion and discussion on
DCASE 2022 challenge ask 2: Unsupe ised anomalous sound de ec ion
o machine condi ion moni o ing applying domain gene aliza ion
echniques,” in P oc. DCASE, 2022, pp. 1–5.
[4]
K. Dohi, K. Imo o, N. Ha ada, e al., “Desc ip ion and discussion
on DCASE 2023 challenge ask 2: Fi s -sho unsupe ised anomalous
sound de ec ion o machine condi ion moni o ing,” in P oc. DCASE,
2023, pp. 31–35.
[5]
T. Nishida, N. Ha ada, D. Niizumi, e al., “Desc ip ion and discussion
on DCASE 2024 challenge ask 2: Fi s -sho unsupe ised anomalous
sound de ec ion o machine condi ion moni o ing,” in P oc. DCASE,
2024, pp. 111–115.
[6]
K. Sue usa, T. Nishida, H. Pu ohi , R. Tanabe, T. Endo, and Y.
Kawaguchi, “Anomalous sound de ec ion based on in e pola ion deep
neu al ne wo k,” in P oc. ICASSP, 2020, pp. 271–275.
[7]
K. Dohi, T. Endo, H. Pu ohi , R. Tanabe, and Y. Kawaguchi, “Flow-
based sel -supe ised densi y es ima ion o anomalous sound de ec ion,”
in P oc. ICASSP, 2021, pp. 336–340.
[8]
R. Gi i, F. Cheng, K. Helwani, S. V. Tenne i, U. Isik, and A.
K ishnaswamy, “G oup masked au oencode based densi y es ima o
o audio anomaly de ec ion,” in P oc. DCASE, 2020, pp. 51–55.
[9]
K. Wilkingho , “Design choices o lea ning embeddings om auxilia y
asks o domain gene aliza ion in anomalous sound de ec ion,” in P oc.
ICASSP, 2023, pp. 1–5.
[10]
K. Wilkingho , “Sel -supe ised lea ning o anomalous sound de ec-
ion,” in P oc. ICASSP, 2024, pp. 276–280.
[11]
P. Saeng hong and T. Shinozaki, “Deep gene ic ep esen a ions o
domain-gene alized anomalous sound de ec ion,” in P oc. ICASSP,
2025, pp. 1–5.
[12]
X. Zheng, A. Jiang, B. Han, e al., “Imp o ing anomalous sound
de ec ion ia low- ank adap a ion ine- uning o p e- ained audio
models,” in P oc. SLT, 2024, pp. 969–974.
[13]
J. A. Lopez, H. Lu, P. Lopez-Meye , L. Nachman, G. S emme , and
J. Huang, “A speake ecogni ion app oach o anomaly de ec ion,” in
P oc. DCASE, 2020, pp. 96–99.
[14]
R. Gi i, S. V. Tenne i, F. Cheng, K. Helwani, U. Isik, and A.
K ishnaswamy, “Sel -supe ised classi ica ion o de ec ing anomalous
sounds,” in P oc. DCASE, 2020, pp. 46–50.
[15]
P. P imus, V. Haunschmid, P. P ahe , and G. Widme , “Anomalous sound
de ec ion as a simple bina y classi ica ion p oblem wi h ca e ul selec ion
o p oxy ou lie examples,” in P oc. DCASE, 2020, pp. 170–174.
[16]
I. Ku oyanagi, T. Hayashi, K. Takeda, and T. Toda, “Se ial-OE:
Anomalous sound de ec ion based on se ial me hod wi h ou lie exposu e
capable o using small amoun s o anomalous da a o aining,” APSIPA
T ansac ions on Signal and In o ma ion P ocessing, ol. 14, no. 1, 2025.
[17]
K. Wilkingho and F. F i z, “On using p e- ained embeddings o
de ec ing anomalous sounds wi h limi ed aining da a,” in P oc.
EUSIPCO, 2023, pp. 186–190.
[18]
K. Wilkingho , T. Fujimu a, K. Imo o, J. Le Roux, Z.
-
H. Tan, and
T. Toda, “Handling domain shi s o anomalous sound de ec ion: A
e iew o DCASE- ela ed wo k,” a Xi p ep in a Xi :2503.10435,
2025.
[19]
T. Fujimu a, I. Ku oyanagi, and T. Toda, “Imp o emen s o disc imina-
i e ea u e space aining o anomalous sound de ec ion in unlabeled
condi ions,” in P oc. ICASSP, 2025, pp. 1–5.
[20]
S. Chen, Y. Wu, C. Wang, e al., “BEATs: Audio p e- aining wi h
acous ic okenize s,” in P oc. ICML, 2023, pp. 5178–5193.
[21]
W. Chen, Y. Liang, Z. Ma, Z. Zheng, and X. Chen, “EAT: Sel -
supe ised p e- aining wi h e icien audio ans o me ,” in P oc. IJCAI,
Main T ack, 2024, pp. 3807–3815.
[22]
A. Jiang, B. Han, Z. L , e al., “Anopa ch: Towa ds be e consis ency
in machine anomalous sound de ec ion,” in P oc. In e speech, 2024,
pp. 107–111.
[23]
A. Jiang, X. Zheng, B. Han, e al., “Adap i e p o o ype lea ning o
anomalous sound de ec ion wi h pa ially known a ibu es,” in P oc.
ICASSP, 2025, pp. 1–5.
[24]
J. Yin, Y. Gao, W. Zhang, T. Wang, and M. Zhang, “Di usion
augmen a ion sub-cen e modeling o unsupe ised anomalous sound
de ec ion wi h pa ially a ibu e-una ailable condi ions,” in P oc.
ICASSP, 2025, pp. 1–5.
[25]
E. J. Hu, Y. Shen, P. Wallis, e al., “LoRA: Low- ank adap a ion o
la ge language models.,” ICLR, ol. 1, no. 2, p. 3, 2022.
[26]
J. Deng, J. Guo, N. Xue, and S. Za ei iou, “A cFace: Addi i e angula
ma gin loss o deep ace ecogni ion,” in P oc. CVPR, 2019, pp. 4690–
4699.
[27]
X. Zhang, R. Zhao, Y. Qiao, X. Wang, and H. Li, “AdaCos: Adap i ely
scaling cosine logi s o e ec i ely lea ning deep ace ep esen a ions,”
in P oc. CVPR, 2019, pp. 10 823–10 832.
[28]
K. Wilkingho , “Sub-clus e AdaCos: Lea ning ep esen a ions o
anomalous sound de ec ion,” in P oc. IJCNN, 2021, pp. 1–8.
[29]
K. Wilkingho , “AdaP oj: Adap i ely scaled angula ma gin subspace
p ojec ions o anomalous sound de ec ion wi h auxilia y classi ica ion
asks,” in P oc. DCASE, 2024, pp. 186–190.
[30]
K. Wilkingho and F. Ku h, “Why do angula ma gin losses wo k well
o semi-supe ised anomalous sound de ec ion?” IEEE/ACM TASLP,
2023.
[31]
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond
empi ical isk minimiza ion,” in P oc. ICLR, 2018.
[32]
D. S. Pa k, W. Chan, Y. Zhang, e al., “SpecAugmen : A simple
da a augmen a ion me hod o au oma ic speech ecogni ion,” in P oc.
In e speech, 2019, pp. 2613–2617.
[33]
N. V. Chawla, K. W. Bowye , L. O. Hall, and W. P. Kegelmeye ,
“SMOTE: Syn he ic mino i y o e -sampling echnique,” JAIR, ol. 16,
pp. 321–357, 2002.
[34]
K. Wilkingho , H. Yang, J. Ebbe s, F. G. Ge main, G. Wiche n, and
J. Le Roux, “Keeping he balance: Anomaly sco e calcula ion o domain
gene aliza ion,” in P oc. ICASSP, 2025, pp. 1–5.
[35]
J. Guan, Y. Liu, Q. Zhu, T. Zheng, J. Han, and W. Wang, “Time-
weigh ed equency domain audio ep esen a ion wi h GMM es ima o
o anomalous sound de ec ion,” in P oc. ICASSP, 2023, pp. 1–5.
[36]
D. P. Kingma and J. Ba, “Adam: A me hod o s ochas ic op imiza ion,”
in P oc. ICLR, 2015, pp. 1–15.
[37]
I. Loshchilo and F. Hu e , “Decoupled weigh decay egula iza ion,”
in P oc. ICLR, 2019.
[38]
B. Desplanques, J. Thienpond , and K. Demuynck, “ECAPA-TDNN:
emphasized channel a en ion, p opaga ion and agg ega ion in TDNN
based speake e i ica ion,” in P oc. In e speech, 2020, pp. 3830–3834.
[39]
N. Ha ada, D. Niizumi, Y. Ohishi, D. Takeuchi, and M. Yasuda, “Fi s -
sho anomaly sound de ec ion o machine condi ion moni o ing: A
domain gene aliza ion baseline,” in P oc. EUSIPCO, 2023, pp. 191–195.
[40]
J. Lopez, G. S emme , and P. Lopez-Meye , “Ensemble o complemen-
a y anomaly de ec o s unde domain shi ed condi ions,” DCASE2021
Challenge, Tech. Rep., 2021.
[41]
Y. Zeng, H. Liu, L. Xu, Y. Zhou, and L. Gan, “Robus anomaly sound
de ec ion amewo k o machine condi ion moni o ing,” DCASE2022
Challenge, Tech. Rep., 2022.
[42]
Z. L , A. Jiang, B. Han, e al., “AITHU sys em o i s -sho
unsupe ised anomalous sound de ec ion,” DCASE2024 Challenge,
Tech. Rep., 2024.
[43]
J. Jie, “Anomalous sound de ec ion based on sel -supe ised lea ning,”
DCASE2023 Challenge, Tech. Rep., 2023.
44
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30-31 Oc obe , Ba celona, Spain
LiB-TRAD: A Li hium Ba e y The mal Runaway Acous ic Da ase o
Anomaly De ec ion
Wang Xiaoliang1, Ming Ao1, Chen Meixin1, Jin Hao1
1Zhejiang Uni e si y, Hangzhou, China
Abs ac — E icien de ec ion o li hium ba e y he mal unaway is a
c i ical ac o in p omo ing he la ge-scale applica ion o li hium ba e ies
in ene gy s o age and elec ic anspo a ion. T adi ional me hods ely
hea ily on con ac -based echniques such as empe a u e, cu en , ol age,
impedance, o s uc u al de o ma ion moni o ing, which ha e limi a ions
in e ms o cos , eal- ime pe o mance, and scalabili y. In con as ,
acous ic de ec ion, wi h i s non-con ac na u e, low cos , and sui abili y o
la ge-scale moni o ing, is eme ging as a p omising al e na i e. While
p e ious s udies ha e demons a ed he e ec i eness o machine lea ning-
based acous ic me hods o he mal unaway de ec ion, he e is s ill a lack
o an open acous ic da ase co e ing he en i e p ocess o li hium ba e y
he mal unaway. To add ess his, his pape in oduces he i s li hium
ba e y acous ic da ase con aining bo h no mal and he mal unaway
e en s, anno a ed wi h abno mal e en s. We u he e alua e se e al
baseline models and s a e-o - he-a acous ic e en de ec ion models using
his da ase . Expe imen al esul s show ha his da ase holds s ong
po en ial o he mal unaway anomaly de ec ion and p o ides a aluable
da a ounda ion and benchma k o u u e esea ch.
Index Te ms—The mal Runaway, Acous ic Anomaly De ec ion,
Li hium Ba e y Sa e y, Audio Da ase
1. INTRODUCTION
Wi h he de elopmen o enewable ene gy sou ces such as sola and
wind powe [1–2], la ge-scale ene gy s o age s a ions and g id
acili ies ha e ad anced signi ican ly in ecen yea s, leading o he
cons uc ion o la ge s a ions equipped wi h a as numbe o ene gy
s o age ba e ies [3]. Due o hei long cycle li e, high ou pu ol age,
high ene gy densi y, and low sel -discha ge a e [4–8], li hium-ion
ba e ies ha e g adually become he mos p omising op ion among all
ene gy s o age ba e y echnologies. Un o una ely, in la ge-scale
ene gy s o age s a ions, ens o housands o closely packed ba e y
cells a e o en deployed, and each indi idual cell has he po en ial o
unde go he mal unaway due o hea accumula ion and ma e ial
cha ac e is ics, which may lead o cascading he mal e en s and e en
i e haza ds [9]. In ecen yea s, nume ous cases o he mal unaway
ha e occu ed in bo h he ene gy s o age and elec ic ehicle sec o s,
posing signi ican challenges o he u he de elopmen o li hium-ion
ba e ies. The mal unaway has become he mos c i ical sa e y issue
in he ene gy s o age domain [10–11], and how o de ec and in e ene
in he ea ly s ages has become a sha ed ocus o bo h academia and
indus y.
O e cha ging, o e discha ging, and mechanical damage can all
igge he isk o he mal unaway in li hium-ion ba e ies [12–13].
Fig. 1(a)(b)(c) illus a es he hea gene a ed by mild o e cha ge and
o e discha ge condi ions [14], while Fig. 1(d)(e)( ) depic s se e al
s ages o he mal unaway induced by excessi e o e cha ge. In s age
(d), he ini ial phase, a la ge amoun o hea is gene a ed inside he cell,
and he elec oly e begins o p oduce bubbles, leading o in e nal
p essu e buildup. In s age (e), he onse phase, he p essu e eaches he
h eshold o he sa e y al e, causing i o open and elease a
signi ican amoun o gas, accompanied by a dis inc en ing sound.
Finally, in s age ( ), he mal unaway escala es and p opaga es
iolen ly, du ing which he elec oly e is expelled and lames may
occu and sp ead [12].1
Fig. 1: Mechanism o he mal unaway ini ia ion and p opaga ion [14].(a) and
(c): Mild o e cha ge and o e discha ge condi ions in indi idual cells;(b):
Module inconsis ency;(d)–( ): Rep esen a i e s ages o he mal unaway
p og ession.
Va ious easible me hods ha e been p oposed o moni o ing
he mal unaway in li hium ba e ies, which can be b oadly
ca ego ized in o app oaches based on changes in empe a u e, p essu e,
and elec ical cha ac e is ics [15–18]. Tempe a u e-based moni o ing
ypically in ol es placing he mocouple senso s on he ba e y su ace
o ack empe a u e a ia ions, he eby enabling p edic ion and
de ec ion o he mal unaway e en s [19]. While e ec i e, his
app oach equi es ex ensi e wi ing and su e s om signi ican
la ency. P essu e-based me hods ely on ins alling p essu e senso s
be ween ba e y cells o de ec swelling, hus p o iding ea ly wa nings
o abno mal beha io [20]; howe e , his me hod also demands
complex wi ing and su e s om limi ed de ec ion accu acy.
Moni o ing based on elec ical cha ac e is ics in ol es using ba e y
managemen sys ems (BMS) o measu e pa ame e s such as
impedance, cu en , and ol age o assess he ba e y’s heal h s a us
[15]. None heless, such me hods may lack di ec co ela ion wi h
he mal unaway e en s and a e o en cos ly.
As p e iously men ioned, he opening o he sa e y al e du ing a
he mal unaway e en gene a es a dis inc en ing sound.
Consequen ly, acous ic-based moni o ing me hods ha e a ac ed
g owing a en ion. In he s udy by Su [21], a en ing sound de ec ion
me hod o indi idual li hium cells was p oposed, combining
XGBoos and wa ele ans o m o alida e he exis ence o
cha ac e is ic acous ic signals du ing he mal unaway and he
easibili y o moni o ing such e en s ia sound. Simila ly, in he wo k
by Lyu [12], ou mic ophones we e deployed inside a ba e y s o age
This wo k is suppo ed by Science and Technology P ojec o
he S a e G id Co po a ion o China (E olu ion mechanism o
pe o mance deg ada ion and s a us sensing me hods o
li hium-ion ba e y ene gy s o age sys em based on ad anced
acous ic sensing echnology, G an No. 520627230016).
45
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30-31 Oc obe , Ba celona, Spain
cabin, and a combina ion o wa ele ans o m and c oss-co ela ion
algo i hms was used o de ec and loca e he sa e y al e en ing
sound. These s udies demons a e ha acous ic sensing o e s an
e ec i e app oach o de ec ing he mal unaway in li hium ba e ies
[12].
Al hough acous ic me hods ha e p o en e ec i e o iden i ying
he mal unaway, g owing a en ion has been paid o p edic ing
ba e y s a es using machine lea ning based on acous ic signals. While
se e al da ase s exis o machine lea ning asks ela ed o ba e y
impedance and capaci y [22], he e is s ill a lack o publicly a ailable
acous ic da ase s speci ically ocused on abno mal sounds du ing
he mal unaway. Such a da ase is essen ial o be e unde s anding
he acous ic e olu ion du ing he mal unaway and o enabling
p edic ion and diagnosis using sound-based me hods.
In his wo k, we add ess his gap by collec ing he mal unaway
sound da a om li hium-ion ba e ies using a mic ophone a ay and
anno a ing abno mal e en s h oughou he p ocess. We cons uc a
li hium ba e y he mal unaway sound da ase and alida e i s
e ec i eness h ough a bina y classi ica ion ask using a CNN-based
model. Fu he mo e, o be e e lec eal-wo ld scena ios whe e
abno mal da a is sca ce, we also explo e an unsupe ised anomaly
de ec ion app oach based on acous ic signals.
Speci ically, ou con ibu ions a e as ollows:
1. We p esen he i s acous ic da ase o li hium-ion ba e ies ha
includes bo h no mal and he mal unaway s a es.
2. We p o ide de ailed anno a ions o abno mal e en s wi hin he
da ase .
3. We e alua e a ious baseline models and mains eam acous ic
e en de ec ion me hods, o e ing a aluable benchma k and da a
ounda ion o u u e esea ch.
2. DATASET CONSTRUCTION
2.1. Da a Acquisi ion Sys em Design
We employed a ou -mic ophone a ay o collec acous ic da a.
Speci ically, ou da a acquisi ion sys em consis s o he mic ophone
a ay and a cen al p ocessing uni . The mic ophones used a e ICS-
43434 digi al MEMS mic ophones, and a DSP chip se es as he
p ocessing co e o signal acquisi ion and ansmission. The
con igu a ion o he mic ophone a ay and he da a acquisi ion sys em
is shown in Fig. 2.
Due o a ying scales o ba e y modules in ou he mal unaway
expe imen s, he mic ophone a ay needed o main ain a ce ain
dis ance om he cen e o he unaway cell. The e o e, he shape o
he a ay was no ixed bu adjus ed acco ding o he size o he ba e y
module. Fo single-cell he mal unaway expe imen s, he
mic ophones we e placed a he ou co ne s o a ec angula ba e y
module. Fo expe imen s in ol ing a ull ow o cells, whe e eac ions
a e mo e in ense, he a ay was moun ed on a ec angula me al ame
app oxima ely one me e away om he ba e y pack.
Fig. 2: Mic ophone a ay con igu a ion and da a acquisi ion sys em. (a)
mic ophone a ay and da a acquisi ion se up; (b) close-up iew o he da a
p ocessing ci cui boa d.
2.2. The mal Runaway Da a Collec ion
He e, we used o e cha ging o igge he mal unaway in he ba e y
cells. The cell model used was 314-0.5C (Na ada, China), and he
cha ging me hod applied was 157 A, 0.5 C di ec cu en . Da a
acquisi ion—including sound, empe a u e, and o he physical
pa ame e s—began a he s a o cha ging and ended when wa e
cooling was ac i a ed. In o al, we collec ed h ee se s o he mal
unaway acous ic signals, wi h co esponding cell con igu a ions and
mic ophone a ay layou s summa ized in Table 1. The ac ual
expe imen al si e se ups a e shown in Fig. 3.
Table 1: Ba e y con igu a ions and mic ophone a ay layou s in he mal
unaway expe imen s.
The mal
Runaway ID
Ba e y
Con igu a ion
a Tes Si e
T igge ed Cell
Con igu a ion
Mic ophone
A ay Layou
1
Full Row
Single Cell
Placed a ou
co ne s o he
ull cell ow
2
Full Row
Full Cell Row
Placed on
ec angula i on
ame
su ounding he
cells
3
En i e Ba e y
Pack
Single Cell
Placed a ou
co ne s o he
en i e ba e y
pack
Fig. 3: On-si e he mal unaway cell and mic ophone a ay con igu a ion. (a)
single-cell he mal unaway igge ing expe imen on a single- ow cell a ay;
(b) whole- ow cell igge ing expe imen on a single- ow cell a ay; (c) single-
cell igge ing expe imen on he en i e ba e y module.
We collec ed h ee comple e se s o he mal unaway sound da a
using a ou -channel mic ophone a ay, wi h a sampling a e o 16,000
Hz. Each eco ding segmen is 10 seconds long. Speci ically, we
ob ained 100×10 s, 185×10 s, and 186×10 s o he mal unaway audio
samples, and eco ded co esponding imes amps o synch oniza ion
wi h ideo oo age and subsequen anno a ion. The eco dings om
each channel o he mic ophone a ay we e ea ed as sepa a e audio
iles o da a p ocessing.
Addi ionally, we collec ed wo se s o no mal cha ging sound da a,
comp ising 1000×10 s and 41×10 s segmen s, o e i y ha ba e ies
p oduce almos no audible noise du ing no mal cha ging. Due o
a ia ions in ield condi ions, no mal cha ging sounds we e no
included in he aining da ase o he subsequen expe imen s.
2.3. Da ase Cons uc ion
Since he acous ic signals du ing he mal unaway a e p ima ily
caused by he ac i a ion o he sa e y al e and he subsequen en ing
p ocess, we de ine he sa e y al e ac i a ion and he ollowing sound
as anomalous e en s associa ed wi h he mal unaway. We name i
LiB-TRAD, which is, o he bes o ou knowledge, he i s sound
da ase speci ically dedica ed o he mal unaway in li hium-ion
46
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30-31 Oc obe , Ba celona, Spain
ba e ies. The anno a ed o ma o ou he mal unaway da ase is
shown in Fig. 4.
Fig. 4: Composi ion o he he mal unaway da ase .
We ex ac ed mel-spec og ams om bo h no mal and abno mal
signals o compa ison, as shown in Fig. 5. Fo illus a i e pu poses,
he spec og ams shown he e a e de i ed om a single ep esen a i e
channel o he mic ophone a ay. I can be obse ed ha he no mal
signals con ain a la ge amoun o low- equency componen s, which
a e a ibu ed o he noise gene a ed by ans and ope a ing machine y.
In con as , he abno mal signals exhibi highe - equency and mo e
ene gy-concen a ed componen s. This is because he sound sou ce o
he abno mal en ing signal om he sa e y al e is close o he
mic ophone a ay, and bo h he opening o he al e and he
subsequen en ing p ocess p oduce signi ican high- equency
componen s.
Fig. 5: Compa ison o mel-spec og ams be ween no mal and abno mal signals.
(a) Mel-spec og am o a no mal signal. (b) Mel-spec og am o an abno mal
signal.
3. DATASET VALIDATION
To demons a e he e ec i eness o he collec ed da a and ou
abno mal sound labeling me hod, we i s o mula e he abno mal
sound de ec ion ask as a bina y classi ica ion p oblem, i.e., a
supe ised classi ica ion ask. By analyzing he esul s on he
alida ion se , we can e alua e he alidi y o ou labeling app oach.
We designed a CNN-based ne wo k o pe o m his ask, wi h he
ne wo k a chi ec u e shown in Fig. 6. Speci ically, each audio
segmen is ead a a sampling a e o 16 kHz, and a 64-dimensional
Mel-spec og am is ex ac ed (wi h a window size o 1024 and hop
leng h o 512). The Mel-spec og am is hen con e ed in o a log-mel
powe spec og am, which se es as he inpu o he model.
The CNN model consis s o h ee con olu ional blocks, each
con aining a con olu ional laye , Ba ch No maliza ion, and a ReLU
ac i a ion unc ion. Downsampling is pe o med using MaxPooling
laye s. Finally, an Adap i e A e age Pooling laye ollowed by a ully
connec ed laye ou pu s he p edic ion p obabili y, ep esen ing he
con idence ha he cu en inpu is an abno mal sound.
Du ing aining, no mal samples a e labeled as 0 and abno mal
samples as 1. The model is op imized using a bina y c oss-en opy
loss unc ion wi h he Adam op imize and a lea ning a e o 1e-3. The
da ase is andomly spli in o 80% aining and 20% alida ion se s. A
he end o each epoch, we compu e he A ea Unde he Cu e (AUC)
me ic on he alida ion se . A highe AUC indica es be e
pe o mance in dis inguishing be ween abno mal and no mal sounds,
he eby alida ing he e ec i eness o ou da a collec ion and labeling
me hodology. The aining loss cu e is shown in Fig. 7.
Fig. 6: Schema ic diag am o he designed CNN-based bina y classi ica ion
ne wo k.
Fig. 7: Loss cu es. (a) T aining loss cu e o The mal Runaway Expe imen 1.
(b) T aining loss cu e o The mal Runaway Expe imen 2. (c) T aining loss
cu e o The mal Runaway Expe imen 3.
The AUC alues o ou model on he es se s om di e en
he mal unaway expe imen s a e epo ed in Table 2.
Table 2: Bina y classi ica ion es esul s o he CNN model on di e en
he mal unaway expe imen s in he LiB-TRAD da ase .
The mal Runaway ID
Tes se AUC
1
0.9971
2
0.9876
3
0.9893
The esul s on he es se indica e ha ou abno mal sound labeling
is highly accu a e. Howe e , in eal-wo ld p oduc ion en i onmen s, i
is o en di icul o ob ain abno mal sound da a. In such cases,
abno mal sound de ec ion mus be conduc ed in an unsupe ised
manne [23], which is simila o Task 2 o DCASE 2025 [24].
The e o e, in he ollowing sec ion, we adop an unsupe ised aining
app oach, whe e he model is ained using only no mal sound da a,
and abno mal sounds a e in oduced only du ing es ing.
4. UNSUPERVISED ABNORMAL SOUND DETECTION
The ask o unsupe ised abno mal sound de ec ion o machine
condi ion moni o ing has a ac ed conside able a en ion [25].
Abno mal sound de ec ion e e s o iden i ying whe he he sound
emi ed by a a ge machine is no mal o abno mal. In his wo k, we
e alua e se e al baseline and p e- ained models ha ha e
demons a ed s ong pe o mance in p e ious DCASE asks, including
he au oencode (AE) model [26], he BEATs model [27], he EATs
model [28], and he Dasheng model [29]. Addi ionally, we p opose an
imp o ed AE model as he baseline speci ically ailo ed o he da ase
p esen ed in his pape .
In his s udy, we u he designed and implemen ed an imp o ed
AE model, as illus a ed in he lowcha in Fig. 8, o unsupe ised
de ec ion o abno mal sound da a. The model e alua es anomaly
sco es based on he Mahalanobis dis ance. The p oposed p ocessing
pipeline consis s o h ee key s ages: da a p ep ocessing, model
aining, and anomaly e alua ion.
47
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30-31 Oc obe , Ba celona, Spain
Fig. 8: Flowcha o he au oencode model wi h in eg a ed sel -a en ion
mechanism.
4.1. Da a P ep ocessing
Ou aining se consis s o 80% o he no mal audio da a, while he
es se includes he emaining 20% o he no mal audio along wi h
100% o he abno mal audio, enabling an unsupe ised de ec ion ask.
We use audio da a in .wa o ma sampled a 16 kHz (SR=16000),
o ganized in o wo olde s: no mal/ o eco dings du ing egula
ope a ion and abno mal/ o eco dings du ing he mal unaway e en s.
To ex ac ime- equency ea u es om he audio, we use Lib osa o
compu e 64-dimensional Log-Mel spec og ams (wi h n_mels=64).
The ex ac ed ea u es o m a 2D ma ix o shape [T, 64], whe e T
deno es he numbe o ime ames. Fo uni ied modeling, each [1, 64]
ame is ea ed as an indi idual inpu sample du ing aining.
4.2. Model T aining
We employ a symme ical wo-laye ully connec ed a chi ec u e and
in oduces an a en ion mechanism o enhance he model’s abili y o
ocus on key ea u es. The o e all s uc u e is as ollows:
Encode :
Linea (64 → 128), Ac i a ion: ReLU.
Linea (128 → 64), Ou pu : encoded la en ec o �.
Sel -A en ion Mechanism:
We in oduce a sel -a en ion module on he la en ec o �,
o mula ed as ollows:
�=��⋅�,�=��⋅�,�=��⋅� 1
����������,�,� =������� ���
�⋅� 2
This module enhances he model's abili y o ocus on ime-
equency componen s wi h signi ican abno mal cha ac e is ics,
he eby imp o ing he obus ness and sensi i i y o anomaly de ec ion.
Decode :
Linea (64 → 128), Ac i a ion: ReLU
Linea (128 → 64), Ou pu : econs uc ed ea u es
The model uses MSELoss as he econs uc ion loss unc ion and is
op imized wi h he Adam op imize . The lea ning a e is se o 1e-3,
and he model is ained o a o al o 200 epochs, wi h he da a
shu led andomly in each epoch. The aining ba ch size is 128,
meaning ha 128 ames a e used o each model upda e.
4.3. Anomaly E alua ion
A e aining, du ing he in e ence phase, he model akes a es ame
as inpu , ex ac s he la en ep esen a ion z h ough he encode and
a en ion module, and econs uc s he ou pu �
using he decode .
Du ing aining, he econs uc ion e o (mean squa ed e o ) is used
as he loss unc ion o op imize he model's econs uc ion abili y,
exp essed as: ℒ���=1
��=1
�
��−−��−23
To be e cha ac e ize he di e ence be ween abno mal and no mal
samples in he la en space du ing in e ence, he model cons uc s a
mul i a ia e Gaussian dis ibu ion based on he mean (μ) and
co a iance ma ix (Σ) o all z alues om he aining se . The
Mahalanobis dis ance is hen used as he inal anomaly sco ing me ic:
������ = (�−�)��−1(�−�) 4
This dis ance measu es how a he la en ep esen a ion o he
cu en sample de ia es om he dis ibu ion o no mal aining da a.
A highe sco e indica es a highe likelihood o being an anomaly.
5. EXPERIMENTAL RESULTS
We e alua ed he anomaly de ec ion pe o mance o each model on
he es se using he AUC (A ea Unde ROC Cu e) me ic o
measu e ecogni ion accu acy. The esul s a e shown in he able
below:
Table 3: Pe o mance compa ison o di e en models.
The mal Runaway ID AUC
hmean
A e age
In e ence
Time/ms
1
2
3
AE
0.4953
0.5236
0.4864
0.5012
0.12
Bea s
0.5861
0.4253
0.5742
0.5173
3.52
Ea s
0.6324
0.5236
0.5637
0.5697
1.84
Dasheng
0.7125
0.5324
0.6152
0.6112
0.95
P oposed
0.6892
0.6125
0.5936
0.6291
0.17
The expe imen al esul s show ha a ious models ained on his
da ase exhibi di e en le els o pe o mance in he he mal unaway
sound anomaly de ec ion ask, which alida es he e ec i eness o he
cons uc ed da ase in e alua ing model ecogni ion capabili ies in
complex ba e y sa e y scena ios. Among hem, he p oposed
au oencode achie ed s able and excellen esul s in all h ee
expe imen s, wi h a ha monic mean AUC (hmean) o 0.6291, he
highes among all models. This e lec s s ong o e all de ec ion abili y
and good obus ness. Addi ionally, wi h an a e age in e ence ime o
0.17 ms, i s ikes a balance be ween pe o mance and e iciency,
demons a ing high p ac ical deploymen alue.
6. CONCLUSION AND FUTURE WORK
This s udy p esen s he i s publicly a ailable mul i-channel acous ic
da ase co e ing he en i e he mal unaway p ocess o li hium-ion
ba e ies, comp ehensi ely eco ding c ucial acous ic changes om
no mal ope a ion o he mal unaway onse . The LiB-TRAD da ase
was collec ed in eal expe imen al scena ios wi h p ecise anomaly
labeling and phase segmen a ion, p o iding a s anda dized benchma k
o aining and e alua ing anomaly de ec ion algo i hms.
We sys ema ically e alua ed a ious ypical unsupe ised anomaly
de ec ion models, including AE, BEATs, EATs, and Dasheng, and
p oposed an imp o ed au oencode model as he baseline o ou
da ase . E alua ion esul s demons a e ha ou da ase e ec i ely
dis inguishes he s abili y and gene aliza ion capabili ies o di e en
models ac oss a ious expe imen al scena ios. The p oposed imp o ed
au oencode achie ed he mos balanced pe o mance ac oss h ee
he mal unaway expe imen s (hmean eaching 0.6291), no only
48
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30-31 Oc obe , Ba celona, Spain
alida ing he da ase 's applicabili y o acous ic anomaly de ec ion
asks bu also es ablishing an e ec i e benchma k o u u e esea ch.
Fu u e wo k will ocus on ine -g ained sound e en ecogni ion,
pa icula ly he acous ic signa u e modeling o sa e y al es du ing
ea ly elease p ocesses o iden i y po en ial wa ning signals. We also
plan o con inuously expand he da ase scale, en ich expe imen al
condi ions and senso con igu a ions o enhance model obus ness and
p ac ical deploymen capabili ies, ul ima ely p omo ing he eal-wo ld
applica ion o acous ic sensing echnology in li hium ba e y sa e y
moni o ing.
REFERENCES
[1] E. Kabalci, “Design and analysis o a hyb id enewable ene gy plan wi h
sola and wind powe ,” Ene gy Con e s. Manage., ol. 72, pp. 51–59,
2013.
[2] J. B. V. Sub ahmanyam, P. Allu ada, K. Bhanup iya, e al., “Renewable
ene gy sys ems: De elopmen and pe spec i es o a hyb id sola -wind
sys em,” Eng. Technol. Appl. Sci. Res., ol. 2, no. 1, pp. 177–181, 2012.
[3] Y. Yang, C. Lian, C. Ma, e al., “Resea ch on ene gy s o age op imiza ion
o la ge-scale PV powe s a ions unde gi en long-dis ance deli e y
mode,” Ene gies, ol. 13, no. 1, p. 27, 2019.
[4] G. Assa , J.-M. Ta ascon, “Fundamen al unde s anding and p ac ical
challenges o anionic edox ac i i y in Li-ion ba e ies,” Na . Ene gy, ol.
3, pp. 373–386, 2018.
[5] L. Chen, M. Fio e, J. E. Wang, e al., “Readiness le el o sodium-ion
ba e y echnology: a ma e ials e iew,” Ad . Sus ain. Sys ., ol. 2, 2018,
A . no. 1700153.
[6] J. W. Choi, D. Au bach, “P omise and eali y o pos -li hium-ion ba e ies
wi h high ene gy densi ies,” Na . Re . Ma e ., ol. 1, pp. 1–16, 2016.
[7] R. Hagiwa a, K. Ma sumo o, J. Hwang, e al., “Sodium Ion Ba e ies
using Ionic Liquids as Elec oly es,” Chem. Rec., ol. 19, pp. 758–770,
2019.
[8] X. Lin, M. Sala i, L. M. A a a, e al., “High empe a u e elec ical ene gy
s o age: ad ances, challenges, and on ie s,” Chem. Soc. Re ., ol. 45,
pp. 5848–5887, 2016.
[9] Q. Wang, B. Mao, S. I. S olia o , e al., “A e iew o li hium ion ba e y
ailu e mechanisms and i e p e en ion s a egies,” P og. Ene gy
Combus . Sci., ol. 73, pp. 95–131,2019.
[10] G. Wang, D. Kong, P. Ping, e al., “Re ealing pa icle en ing o li hium-
ion ba e ies du ing he mal unaway: A mul i-scale model owa d
mul iphase p ocess,” ET anspo a ion, 2023, 16: 100237.
[11] J. Q. Li, D. N. Sun, X. Jin, e al., “Li hium-ion ba e y o e cha ging
he mal cha ac e is ics analysis and an impedance-based elec o- he mal
coupled model simula ion,” Applied Ene gy, 2019, 254: 113574.
[12] N. Lyu, Y. Jin, S. Miao, e al., “Faul wa ning and loca ion in ba e y
ene gy s o age sys ems ia en ing acous ic signal,” IEEE J. Eme g. Sel.
Topics Powe Elec on., ol. 11, no. 1, pp. 100–108, 2021.
[13] D. Kong, G. Wang, P. Ping, e al., “A coupled conjuga e hea ans e
and CFD model o he he mal unaway e olu ion and je i e o 18650
li hium-ion ba e y unde he mal abuse,” E anspo a ion, 2022, 12:
100157.
[14] N. Lyu, Y. Jin, R. Xiong, e al., “Real- ime o e cha ge wa ning and ea ly
he mal unaway p edic ion o Li-ion ba e y by online impedance
measu emen ,” IEEE T ans. Ind. Elec on., ol. 69, no. 2, pp. 1929–1936,
2021.
[15] H. Rahimi-Eichi, U. Ojha, F. Ba on i, e al., “Ba e y managemen sys em:
An o e iew o i s applica ion in he sma g id and elec ic ehicles,”
IEEE Ind. Elec on. Mag., ol. 7, no. 2, pp. 4–16, 2013.
[16] M. Du and J. Towey, “Two ways o measu e empe a u e using
he mocouples ea u e simplici y, accu acy, and lexibili y,” Analog
Dialogue, ol. 44, no. 10, pp. 1–6, 2010.
[17] J. Ch is ensen, D. Cook, and P. Albe us, “An e icien pa allelizable 3D
he moelec ochemical model o a Li-ion cell,” J. Elec ochem. Soc., ol.
160, no. 11, pp. A2258–A2264, 2013.
[18] H. Chikh-Bled, K. Chah, Á. González-Vila, e al., “Beha io o
em osecond lase -induced eccen ic ibe B agg g a ings a e y high
empe a u es,” Op . Le ., ol. 41, no. 17, pp. 4048–4051, 2016.
[19] L. H. J. Raijmake s, D. L. Danilo , R. A. Eichel, e al., “A e iew on
a ious empe a u e-indica ion me hods o Li-ion ba e ies,” Appl.
Ene gy, ol. 240, pp. 918–945, 2019.
[20] R. Li, D. Ren, D. Guo, e al., “Volume de o ma ion o la ge- o ma
li hium ion ba e ies unde di e en deg ada ion pa hs,” J. Elec ochem.
Soc., ol. 166, no. 16, p. A4106, 2019.
[21] T. Su, N. Lyu, Z. Zhao, e al., “Sa e y wa ning o li hium-ion ba e y
ene gy s o age s a ion ia en ing acous ic signal de ec ion o g id
applica ion,” J. Ene gy S o age, ol. 38, p. 102498, 2021.
[22] M. F. Ng, J. Zhao, Q. Yan, e al., “P edic ing he s a e o cha ge and
heal h o ba e ies using da a-d i en machine lea ning,” Na . Mach. In ell.,
ol. 2, no. 3, pp. 161–170, 2020.
[23] K. Dohi, K. Imo o, N. Ha ada, e al., “Desc ip ion and discussion on
DCASE 2022 challenge ask 2: Unsupe ised anomalous sound de ec ion
o machine condi ion moni o ing applying domain gene aliza ion
echniques,” a Xi p ep in a Xi :2206.05876, 2022.
[24] T. Nishida, N. Ha ada, D. Niizumi, e al., “Desc ip ion and discussion on
DCASE 2025 challenge ask 2: Fi s -sho unsupe ised anomalous sound
de ec ion o machine condi ion moni o ing,” a Xi p ep in
a Xi :2506.10097, 2025.
[25] Y. Wang, Y. Zheng, Y. Zhang, e al., “Unsupe ised anomalous sound
de ec ion o machine condi ion moni o ing using classi ica ion-based
me hods,” Appl. Sci., ol. 11, no. 23, p. 11128, 2021.
[26] C. Zhang, Y. Liu, and H. Fu, “Ae2-ne s: Au oencode in au oencode
ne wo ks,” in P oc. IEEE/CVF Con . Compu . Vis. Pa e n Recogni .,
2019, pp. 2577–2585.
[27] S. Chen, Y. Wu, C. Wang, e al., "BEATs: Audio P e-T aining wi h
Acous ic Tokenize s," in P oc. 40 h In . Con . Mach. Lea n., 2023, pp.
5178-5193.
[28] W. Chen, Y. Liang, Z. Ma, e al., “EAT: Sel -supe ised p e- aining wi h
e icien audio ans o me ,” a Xi p ep in a Xi :2401.03497, 2024.
[29] H. Dinkel, Z. Yan, Y. Wang, e al., “Scaling up masked audio encode
lea ning o gene al audio classi ica ion,” a Xi p ep in
a Xi :2406.06992, 2024.
49
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30-31 Oc obe 2025, Ba celona, Spain
al e he dis ibu ion o he obse ed sound da a. These a ia ions
can esul om di e ences in ope a ing speed, machine load, hea -
ing empe a u e, mic ophone a angemen , en i onmen al noise,
and o he ac o s. Two domains a e de ined: he sou ce domain,
ep esen ing he o iginal condi ion wi h su icien aining da a, and
he a ge domain, ep esen ing ano he condi ion whe e only lim-
i ed samples a e a ailable. This yea ’s ask ollows he 2022 o 2024
Task 2 [10–12] se ing, whe e he domain in o ma ion is assumed
o be unknown in he es phase and anomalies om bo h domains
ha e o be de ec ed wi h a single h eshold. In his case, domain
gene aliza ion is equi ed o achie e good pe o mance.
To u he pu sue he apid de elopmen o ASD sys ems in
eal-wo ld scena ios, sol ing ASD (a) agains comple ely no el ma-
chine ypes (b) wi h only one sec ion o aining da a (c) wi hou
handc a ed unings ha depend on es da a, a e highly impo an .
This is because in eal-wo ld scena ios, cus ome s may only pos-
sess a single no el machine, and collec ing es da a-especially he
anomalous samples- o handc a ed uning may be in easible. This
p oblem se ing was named as he “ i s -sho p oblem”, and he
2023 and 2024 Task 2 [11,12] was o ganized based on his p oblem
se ing. The i s -sho p oblem was implemen ed by in oducing
wo key ea u es o he da ase : (i) The de elopmen and e alua-
ion da ase s consis o en i ely di e en se s o machine ypes, and
(ii) Each machine ype in he da ase con ains only a single sec ion.
No e ha un il 2022 Task 2, he p o ided da ase included mul iple
sec ions o each machine ype, and he de elopmen and e alua ion
da ase s sha ing he same machine ypes.
The DCASE2025 Challenge Task 2 e ains he p e ious ask
se ing as a i s -sho p oblem unde domain gene aliza ion con-
di ions, while in oducing se e al modi ica ions. Fi s , we ha e
p o ided addi ional supplemen a y da a, including clean machine
eco dings and noise samples. These esou ces may e lec p ac i-
cal scena ios—such as collec ing clean machine da a when a ac o y
is idle o ga he ing noise eco dings when he machine is no un-
ning. Pa icipan s a e ee o inco po a e hese addi ional sou ces
o enhance he accu acy o hei models. Second, al hough la ge-
scale models—such as p e ained ne wo ks and ensembles—ha e
become inc easingly popula in his ask, ligh weigh models capa-
ble o unning on edge de ices also emain an impo an a ea o
esea ch. To acknowledge his, pa icipan s we e op ionally asked
o epo he compu a ional complexi y o hei solu ions in e ms
o Mul iply-Accumula e Ope a ions (MACs). Al hough his me ic
does no a ec he o icial ankings, i p o ides aluable insigh in o
he balance be ween model complexi y and pe o mance.
3. TASK SETUP
3.1. Da ase
The da ase o his ask is di ided in o h ee ca ego ies: he de-
elopmen da ase , he addi ional aining da ase , and he e al-
ua ion da ase . The de elopmen da ase con ains se en machine
ypes, while he addi ional aining and e alua ion da ase s include
nine machine ypes, wi h each machine ype consis ing o a single
sec ion. A machine ype e e s o he ca ego y o machines, such as
ans o gea boxes, and a sec ion ep esen s a subse o he en i e y
o he da a associa ed wi h each machine ype.
All eco ding a e single-channel, las ing 6 o 10 seconds, and
ha e a sampling a e o 16 kHz. The machine sounds eco ded a
labo a o ies we e mixed wi h en i onmen al noise eco ded a ac-
o ies and in he subu bs o c ea e each sample in he da ase . Fo
u he de ails o he eco ding p ocess, please e e o he pape s on
ToyADMOS2 [13] and MIMII DG [14].
The de elopmen da ase p o ides se en machine ypes ( an,
gea box, bea ing, slide ail, al e, ToyCa , ToyT ain), and each ma-
chine ype has one sec ion ha con ains a comple e se o he ain-
ing and es da a. Each sec ion con ains (i) 990 no mal clips om
a sou ce domain o aining, (ii) 10 no mal clips om a a ge do-
main o aining, (iii) 100 clips o supplemen a y sound da a con-
aining ei he clean no mal machine sounds in he sou ce domain
o noise-only sounds, and (i ) 100 no mal clips and 100 anomalous
clips om bo h domains o he es . To assis pa icipan s, domain
in o ma ion (sou ce/ a ge ) was included in he es da a. Fo ou
machine ypes ( an, gea box, al e, and ToyCa ) de ails ega ding
ope a ional o en i onmen al condi ions we e p o ided in he ile
names and a ibu e CSV iles. Howe e , o he emaining h ee
machine ypes, hese a ibu es we e no disclosed.
The addi ional aining da ase p o ides no el nine machine
ypes (Au oT ash, HomeCame a, ToyPe , ToyRCCa , BandSeale ,
Polishe , Sc ewFeede , Co eeG inde ). Each sec ion consis s o
(i) 990 no mal clips in a sou ce domain o aining, (ii) 10 no -
mal clips in a a ge domain o aining. and (iii) 100 clips o
supplemen a y sound da a con aining ei he clean no mal machine
sounds in he sou ce domain o noise-only sounds. Fo i e machine
ypes (HomeCame a, ToyRCCa , BandSeale , and Co eeG inde ),
a ibu es we e p o ided in his da ase . Fo he o he ou machine
ypes, a ibu es we e concealed. The e alua ion da ase p o ides
he es clips ha co espond o he addi ional aining da ase , e.g.
da a o he same machine ypes as he addi ional aining da ase .
Each sec ion consis s o 200 es clips, none o which ha e a condi-
ion label (i.e., no mal o anomaly), domain in o ma ion, o a ibu e
in o ma ion. Pa icipan s a e equi ed o ain a model o each new
machine ype using only a single sec ion pe machine ype.
3.2. E alua ion me ics
To assess o e all de ec ion pe o mance, we employed he a ea un-
de he ecei e ope a ing cha ac e is ic cu e (AUC). Addi ionally,
we used he pa ial AUC (pAUC) o e alua e pe o mance in a low
alse-posi i e a e ange [0, p], whe e we se p= 0.1. To e alua e
each sys em unde he domain gene aliza ion se ing, we compu e
he AUC o each domain and pAUC o each sec ion as
AUCm,n,d =1
N−
dN+
n
N−
d
X
i=1
N+
n
X
j=1
H(Aθ(x+
j)− Aθ(x−
i)),(2)
pAUCm,n =1
⌊pN−
n⌋N+
n
⌊pN−
n⌋N+
n
X
i=1
N+
n
X
j=1
H(Aθ(x+
j)− Aθ(x−
i)),(3)
whe e mand n ep esen he index o a machine ype and a sec ion
espec i ely, d∈ {sou ce, a ge } ep esen s a domain, ⌊·⌋ is he
loo ing unc ion, and H(y) e u ns 1 when y > 0and 0 o he wise.
He e, {x−
i}N−
d
i=1 a e he no mal es clips in domain din sec ion n
o machine ype mand {x+
j}N+
n
j=1 a e all he anomalous es clips in
sec ion no machine ype m.N−
d, N−
n, N+
n ep esen he numbe
o no mal es clips in domain d, no mal es clips in sec ion n, and
anomalous es clips in sec ion n, espec i ely.
The o icial sco e Ωis gi en by he ha monic mean o he AUC
and pAUC sco es o e all machine ypes and sec ions:
Ω = hAUCm,n,d,pAUCm,n |
m∈ M, n ∈ S(m), d ∈ {sou ce, a ge }} ,(4)
56
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30-31 Oc obe 2025, Ba celona, Spain
whe e h{·} ep esen s he ha monic mean, Mis he se o gi en
machine ypes, and S(m) ep esen s he se o sec ions o machine
ype m. Speci ically, S(m) = {00} o he da ase in 2024-2025.
Addi ionally, al hough no included in he o icial ankings, pa -
icipan s we e op ionally asked o p o ide in o ma ion on he com-
pu a ional complexi y o hei models in e ms o MAC ope a ions.
I was ecommended ha his be calcula ed using he open-sou ce
implemen a ion a ailable in [15].
3.3. Baseline sys ems and esul s
The ask o ganize s o e a baseline sys em using Au oencode s
(AEs) wi h wo ope a ing modes, iden ical o he 2023 Task 2 base-
line. While bo h modes use Au oencode s o aining, hey di e
in anomaly sco e compu a ion. This pape p esen s he sys em and
i s de ec ion pe o mance; de ails can be ound in [16].
3.3.1. Au oencode aining
The AE is ained o bo h ope a ing modes using log-mel-
spec og ams o aining sound clips X= [X1,...,XT], whe e
X ∈RF o = 1,...,T ep esen s ame-wise ea u e ec o s
a ame , whe e F= 128 and Tis he numbe o mel- il e s and
ime- ames, espec i ely. Fo inpu , P= 5 consecu i e ames
a e conca ena ed as ψ = [XT
,...,XT
+P−1]T∈RD o each ,
wi h D=P×F= 640. Model pa ame e s a e ained by mini-
mizing he mean squa ed e o (MSE) be ween he inpu ψ and he
econs uc ed ou pu θ(ψ ) o all inpu s om he aining da a.
3.3.2. Simple Au oencode mode
This mode uses he mean MSE o all ea u es de i ed om he gi en
sound clip as i s anomaly sco e, e.g.,
Aθ(X) = 1
DK
K
X
k=1
∥ψk− θ(ψk)∥2
2,(5)
whe e K=T−P+ 1, and ∥ · ∥2 ep esen s ℓ2no m.
3.3.3. Selec i e Mahalanobis mode
In his mode, he Mahalanobis dis ance be ween he sys em inpu
and econs uc ed ea u e is used o compu e he anomaly sco e.
The anomaly sco e is de ined as
Aθ(X) = 1
DK
K
X
k=1
min{Ds(ψk, θ(ψk)), D (ψk, θ(ψk))},(6)
Ds(·) = Mahalanobis(ψk, θ(ψk),Σ−1
s),(7)
D (·) = Mahalanobis(ψk, θ(ψk),Σ−1
),(8)
whe e Σ−1
sand Σ−1
a e he co a iance ma ices o θ(ψk)−ψk
o he sou ce and a ge domain da a o each machine ype, espec-
i ely.
3.3.4. Resul s
Tables 1 p esen he AUC and pAUC esul s o he wo baseline
sys ems on he de elopmen da ase , wi h he a e ages and s anda d
de ia ions compu ed om i e independen ials.
Table 1: Baseline esul s o de elopmen da ase .
Machine ype Mode AUC [%] pAUC [%]
Sou ce Ta ge
ToyCa MSE 71.05 ±0.50 53.32 ±0.56 49.79 ±0.49
MAHALA 73.17 ±0.39 50.91 ±0.85 49.05 ±0.05
ToyT ain MSE 61.76 ±0.74 56.46 ±0.47 50.19 ±0.25
MAHALA 50.87 ±2.88 46.15 ±1.77 48.32 ±0.05
bea ing MSE 66.53 ±2.63 53.15 ±1.99 61.12 ±0.59
MAHALA 63.63 ±1.15 59.03 ±1.79 61.86 ±0.36
an MSE 70.96 ±0.94 38.75 ±0.74 49.46 ±0.53
MAHALA 77.99 ±0.23 38.56 ±0.58 50.82 ±0.06
gea box MSE 64.80 ±1.48 50.49 ±1.22 52.49 ±0.37
MAHALA 73.26 ±0.78 51.61 ±0.52 55.07 ±0.47
slide MSE 70.10 ±1.01 48.77 ±1.07 52.32 ±0.36
MAHALA 73.79 ±1.95 50.27 ±1.15 53.61 ±0.26
al e MSE 63.53 ±2.90 67.18 ±1.75 57.35 ±1.96
MAHALA 56.22 ±2.22 61.00 ±2.98 52.53 ±1.32
4. CHALLENGE RESULTS
4.1. O e all esul s
We ecei ed 119 submissions om 35 eams. 20 eams ou pe -
o med bo h baselines, which sligh ly inc eased compa ed o las
yea ’s ask (11 ou o 27 eams). Looking a he esul s o each
domain sepa a ely, six eams su passed he baselines on he sou ce-
domain AUC, while 25 did so on he a ge -domain AUC. Fou
eams achie ed highe AUCs han he baselines in bo h domains.
This shows he di icul y o imp o ing he pe o mance on bo h he
sou ce and a ge domain a he same ime. Figu e 1 shows he
AUC alues o he op 10 eams. In he sou ce domain, whe he
each eam could bea he baseline was highly machine-dependen ,
and many eams s uggled o ou pe o m he baseline on a e age.
Speci ically, machines o which a ibu e in o ma ion was a ail-
able ended o pe o m poo ly, al hough i is unclea whe he his
ac o is ac ually ele an . In con as , all o he op 10 eams ou -
pe o med he baselines in he a ge domain in he ha monic mean.
Figu e 2 compa es he AUC alues o he op 20 eams be ween
he de elopmen and e alua ion da ase s. As can be seen, achie -
ing high AUC alues in he de elopmen da ase does no indica e
high AUC in he e alua ion da ase . This is a ypical end in he
i s -sho p oblem se ing which s a ed om 2023, and shows he
di icul y in inding an app oach obus o machine ypes in he ab-
sence o es da a. Thus, building an ASD sys em ha wo ks well
o unknown machine ypes emains bo h a di icul and impo an
challenge. In Figu e 3, we compa e he ha monic mean AUC al-
ues in he sou ce and a ge domain o he e alua ion da ase among
he op 20 eams. The igu e shows a nega i e co ela ion be ween
he AUC in he sou ce and a ge domain. While he op 3 eams
achie ed e y close o icial sco es (wi hin 1.0% di e ence), hei
balance be ween he sou ce and a ge AUC a ied. Achie ing well
balanced pe o mance be ween he sou ce and a ge domain may
be impo an o achie e high anks.
4.2. T ends in model size
We analyzed he compu a ional-complexi y ends o he submi ed
sys ems. MACs we e epo ed by 21 eams. Figu e 4 plo s each
submission’s MAC coun agains i s o icial sco e; i a eam submi -
ed mul iple sys ems wi h iden ical MAC alues, only he highes -
anked sys em was e ained.
The igu e shows a b oad ange o MAC coun s ac oss submis-
sions. I also highligh s ha la ge compu a ional budge s did no
necessa ily yield highe sco es. No ably, wo submissions om di -
e en eams [17,18] achie ed sco es exceeding he baselines while
57
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30-31 Oc obe 2025, Ba celona, Spain
****
Figu e 1: E alua ion esul s o op 10 eams in anking. A e age sou ce ( op) and a ge -domain AUC (bo om) o each machine ype (“*”
indica es ha a ibu es a e hidden.). Labels “A” and “M” on deno e simple Au oencode mode and selec i e Mahalanobis mode, espec i ely.
Figu e 2: Compa ison o ha monic mean o AUC o de elopmen
and e alua ion da ase ac oss eams.
Figu e 3: Compa ison o ha monic mean o AUC o sou ce and
a ge domain in e alua ion da ase ac oss eams.
using ewe MACs. This shows he easibili y o compu a ionally
e icien solu ions o i s -sho UASD, and could be one o he u-
u e di ec ions o esea ch.
4.3. New app oaches seen in he op- anked eams
a. Use o p e ained models
This yea , ollowing he end om p e ious yea s, many
pa icipan s adop ed p e ained models in hei anomaly-de ec ion
pipelines. Many o hose eams ine- uned hem wi h an a ibu e
o domain classi ica ion-based auxilia y ask, such as he 1, 4, 5 h
anked eams [19–21]. On he o he hand, in e es ingly, se e al o
he high- anking eams achie ed s ong pe o mance using ozen
p e ained ne wo ks, exploi ing in e media e-laye ea u es along
wi h anomaly-sco e no maliza ion [22]. The 2nd [23] and 8 h [24]
Figu e 4: Compa ison o o icials sco es o submissions agains he
MACs alues.
place eams solely used ozen models wi hin hei submissions,
while he 4 h-place eam [20] ensembled ozen ne wo ks o he
ine- uned ones. Ne e heless, eams ha ained ligh weigh mod-
els om sc a ch wi h classi ica ion-based asks also achie ed high
anks, including he 3 d anked eam [17, 18], showing ha p e-
ained models a e no an absolu e p e equisi e o compe i i e pe -
o mance. O e all, di e se app oaches we e all compe i i e his
yea and each app oach may s ill ha e oom o u he esea ch.
b. Use o supplemen al da a
Pa icipan s ied u ilizing he newly eleased supplemen al
clean-machine and noise eco dings, applying hem in wo dis inc
ways. In he i s way, hey we e used o da a augmen a ion. The
1s [19] and se e al o he op-10 eams [24–26] injec ed he sup-
plemen al clips as an ex a class in auxilia y classi ie s, blended
he noise signals wi h aining samples, o le e aged hem in con-
as i e lea ning [18] o en ich ea u e space di e si y. Con e sely,
he 3 d and 4 h-place eams [17, 20], used he clean/noise da a o
build enhancemen modules ha ex ac ed o denoised a ge ma-
chine sounds om he noisy aining da a, and supplied hese ex-
ac ed signals o hei main anomaly-de ec ion ne wo ks.
5. CONCLUSION
We p esen ed an o e iew o he DCASE 2025 Challenge Task
2. The ask’s aim was o de elop ASD sys ems ha wo k o a
no el machine ype wi h a single sec ion o each machine ype,
whe e supplemen al da a such as clean machine sounds o noise-
only sounds we e also p o ided. We discussed se e al new ap-
p oaches seen in he challenge, such as how p e ained models we e
(o we e no ) used and he use o he newly p o ided supplemen al
da a. While we we e no able o discuss all new app oaches, we
hope ha all he echnical epo s will con ibu e o he ad ance-
men s in he ield o anomalous sound de ec ion.
58
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30-31 Oc obe 2025, Ba celona, Spain
6. REFERENCES
[1] Y. Koizumi, S. Sai o, H. Uema su, and N. Ha ada, “Op imiz-
ing acous ic ea u e ex ac o o anomalous sound de ec ion
based on Neyman-Pea son lemma,” in P oc. EUSIPCO, 2017,
pp. 698–702.
[2] Y. Kawaguchi and T. Endo, “How can we de ec anomalies
om subsampled audio signals?” in P oc. IEEE MLSP, 2017.
[3] Y. Koizumi, S. Sai o, H. Uema su, Y. Kawachi, and N. Ha ada,
“Unsupe ised de ec ion o anomalous sound based on
deep lea ning and he Neyman-Pea son lemma,” IEEE/ACM
TASLP, ol. 27, no. 1, pp. 212–224, Jan. 2019.
[4] Y. Kawaguchi, R. Tanabe, T. Endo, K. Ichige, and K. Hamada,
“Anomaly de ec ion based on an ensemble o de e e be a-
ion and anomalous sound ex ac ion,” in P oc. IEEE ICASSP,
2019, pp. 865–869.
[5] Y. Koizumi, S. Sai o, M. Yamaguchi, S. Mu a a, and
N. Ha ada, “Ba ch uni o miza ion o minimizing maximum
anomaly sco e o DNN-based anomaly de ec ion in sounds,”
in P oc. IEEE WASPAA, 2019, pp. 6–10.
[6] K. Sue usa, T. Nishida, H. Pu ohi , R. Tanabe, T. Endo, and
Y. Kawaguchi, “Anomalous sound de ec ion based on in e po-
la ion deep neu al ne wo k,” in P oc. IEEE ICASSP, 2020, pp.
271–275.
[7] H. Pu ohi , R. Tanabe, T. Endo, K. Sue usa, Y. Nikaido,
and Y. Kawaguchi, “Deep au oencoding GMM-based unsu-
pe ised anomaly de ec ion in acous ic signals and i s hype -
pa ame e op imiza ion,” in P oc. DCASE Wo kshop, 2020,
pp. 175–179.
[8] Y. Koizumi, Y. Kawaguchi, K. Imo o, T. Nakamu a,
Y. Nikaido, R. Tanabe, H. Pu ohi , K. Sue usa, T. Endo,
M. Yasuda, and N. Ha ada, “Desc ip ion and discussion
on DCASE2020 challenge ask2: Unsupe ised anomalous
sound de ec ion o machine condi ion moni o ing,” in P oc.
DCASE Wo kshop, 2020, pp. 81–85.
[9] Y. Kawaguchi, K. Imo o, Y. Koizumi, N. Ha ada, D. Niizumi,
K. Dohi, R. Tanabe, H. Pu ohi , and T. Endo, “Desc ip ion and
discussion on DCASE 2021 challenge ask 2: Unsupe ised
anomalous de ec ion o machine condi ion moni o ing unde
domain shi ed condi ions,” in P oc. DCASE Wo kshop, 2021,
pp. 186–190.
[10] K. Dohi, K. Imo o, N. Ha ada, D. Niizumi, Y. Koizumi,
T. Nishida, H. Pu ohi , R. Tanabe, T. Endo, M. Yamamo o,
and Y. Kawaguchi, “Desc ip ion and discussion on DCASE
2022 challenge ask 2: Unsupe ised anomalous sound de ec-
ion o machine condi ion moni o ing applying domain gen-
e aliza ion echniques,” in P oc. DCASE Wo kshop, 2022, pp.
26–30.
[11] K. Dohi, K. Imo o, N. Ha ada, D. Niizumi, Y. Koizumi,
T. Nishida, H. Pu ohi , R. Tanabe, T. Endo, and Y. Kawaguchi,
“Desc ip ion and discussion on DCASE 2023 challenge ask
2: Fi s -sho unsupe ised anomalous sound de ec ion o
machine condi ion moni o ing,” in P oc. DCASE Wo kshop,
2023, pp. 31–35.
[12] T. Nishida, N. Ha ada, D. Niizumi, D. Albe ini, R. San-
nino, S. P adolini, F. Augus i, K. Imo o, K. Dohi, H. Pu o-
hi , R. Tanabe, T. Endo, and Y. Kawaguchi, “Desc ip ion and
discussion on DCASE 2024 challenge ask 2: Fi s -sho un-
supe ised anomalous sound de ec ion o machine condi ion
moni o ing,” in P oc. DCASE Wo kshop, 2024, pp. 111–115.
[13] N. Ha ada, D. Niizumi, D. Takeuchi, Y. Ohishi, M. Yasuda,
and S. Sai o, “ToyADMOS2: Ano he da ase o minia u e-
machine ope a ing sounds o anomalous sound de ec ion un-
de domain shi condi ions,” in P oc. DCASE Wo kshop,
2021, pp. 1–5.
[14] K. Dohi, T. Nishida, H. Pu ohi , R. Tanabe, T. Endo, M. Ya-
mamo o, Y. Nikaido, and Y. Kawaguchi, “MIMII DG: Sound
da ase o mal unc ioning indus ial machine in es iga ion
and inspec ion o domain gene aliza ion ask,” in P oc.
DCASE Wo kshop, 2022.
[15] L. Zhu, “Thop: Py o ch-opcoun e ,” h ps://gi hub.com/
Lyken17/py o ch-OpCoun e , 2019.
[16] N. Ha ada, N. Daisuke, T. Daiki, O. Yasuno i, and
Y. Masahi o, “Fi s -sho anomaly de ec ion o machine con-
di ion moni o ing: A domain gene aliza ion baseline,” in
P oc. EUSIPCO, 2023, pp. 191–195.
[17] J. Yang, “A wo s age usion anomaly de ec ion app oach o
Task2,” DCASE2025 Challenge, Tech. Rep., June 2025.
[18] Q. Zhou and S. Wu, “Machine anomalous sound de ec ion
combining con olu ional au o-encode and con as i e lea n-
ing,” DCASE2025 Challenge, Tech. Rep., June 2025.
[19] L. Wang, “P e- ained model enhanced anomalous sound de-
ec ion sys em o DCASE2025 Task2,” DCASE2025 Chal-
lenge, Tech. Rep., June 2025.
[20] F. Takuya, I. Ku oyanagi, and T. Toda, “The NU sys ems o
DCASE 2025 Challenge Task 2,” DCASE2025 Challenge,
Tech. Rep., June 2025.
[21] A. Jiang, W. Liang, S. Feng, Y. Qiu, Y. Zhao, J. Li, P. Fan, W.-
Q. Zhang, C. Lu, X. Chen, Y. Qian, and J. Liu, “THUEE sys-
em o DCASE 2025 anomalous sound de ec ion challenge,”
DCASE2025 Challenge, Tech. Rep., June 2025.
[22] K. Wilkingho , H. Yang, J. Ebbe s, F. G. Ge main, G. Wich-
e n, and J. Le Roux, “Keeping he balance: Anomaly
sco e calcula ion o domain gene aliza ion,” in P oc. IEEE
ICASSP. IEEE, 2025, pp. 1–5.
[23] P. Saeng hong and T. Shinozaki, “GenRep o i s -sho un-
supe ised anomalous sound de ec ion o DCASE2025 Chal-
lenge,” DCASE2025 Challenge, Tech. Rep., June 2025.
[24] T. Shi aga, K. Ozeki, T. Masuzaki, N. Tanaka, and
T. Ku iyama, “Anomalous sound de ec ion me hod using con-
as i e lea ning,” DCASE2025 Challenge, Tech. Rep., June
2025.
[25] X. Zheng, A. Jiang, B. Han, S. Zhang, W.-Q. Zhang, X. Chen,
C. Lu, P. Fan, J. Liu, and Y. Qian, “SJTU-AITHU sys-
em o DCASE 2025 anomalous sound de ec ion challenge,”
DCASE2025 Challenge, Tech. Rep., June 2025.
[26] S. Zhang, F. Xiao, S. Fan, Q. Zhu, W. Wang, and J. Guan,
“Anomalous sound de ec ion using p e- ained model wi h s a-
is ical ea u e di e ence ep esen a ion,” DCASE2025 Chal-
lenge, Tech. Rep., June 2025.
59
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Audio-Based Pedes ian De ec ion in he P esence o Vehicula Noise
Yonghyun Kim1, Chaeyeon Han2, Akash Sa ode3, Noah Posne 2, Subh aji Guha haku a2, Alexande Le ch1
1Music In o ma ics G oup, Geo gia Ins i u e o Technology, USA
2Cen e o U ban Resilience and Analy ics, Geo gia Ins i u e o Technology, USA
3College o Compu ing, Geo gia Ins i u e o Technology, USA
Abs ac —Audio-based pedes ian de ec ion is a challenging ask and
has, hus a , only been explo ed in noise-limi ed en i onmen s. We p esen
a new da ase , esul s, and a de ailed analysis o he s a e-o - he-a in
audio-based pedes ian de ec ion in he p esence o ehicula noise. In
ou s udy, we conduc h ee analyses: (i) c oss-da ase e alua ion be ween
noisy and noise-limi ed en i onmen s, (ii) an assessmen o he impac o
noisy da a on model pe o mance, highligh ing he in luence o acous ic
con ex , and (iii) an e alua ion o he model’s p edic i e obus ness on ou -
o -domain sounds. The new da ase is a comp ehensi e 1321-hou oadside
da ase . I inco po a es a ic- ich soundscapes. Each eco ding includes
16 kHz
audio synch onized wi h ame-le el pedes ian anno a ions and
1 ps ideo humbnails.
Index Te ms—Audio da abases, Sound e en de ec ion, U ban sound
analysis, Pedes ian de ec ion, Vehicula noise
1. INTRODUCTION
Pedes ian olume da a o e aluable insigh s in o u ban ac i i y
pa e ns, which suppo planning e o s such as e alua ing sidewalk
imp o emen s, assessing land use changes, and iden i ying a eas
needing in es men s in sa e y and walkabili y [1]. These da a also
suppo op imizing s ee connec i i y and accessibili y [1].
The widesp ead adop ion o sma phones has b ough new oppo -
uni ies o au oma ed human mobili y sensing, pa icula ly h ough
mobile GPS da a. Howe e , g owing p i acy conce ns, pa icula ly
unde amewo ks like he Gene al Da a P o ec ion Regula ion (GDPR)
in he Eu opean Union, ha e placed es ic ions on using mobile
loca ion da a o ack indi iduals [2]. In pa allel, sma ci y ini ia i es
ha e adop ed he deploymen o IoT-based senso s o moni o ac i i y
in u ban en i onmen s. These e o s ha e la gely ocused on ision-
based sys ems, such as compu e ision and in a ed came as [3],
al hough o he sensing echnologies ha e also been es ed.
U ban sound o e s a p omising al e na i e. Mic ophones a e
a o dable, ene gy-e icien , and e ec i e in isually occluded en i-
onmen s. They can complemen o eplace came as in con ex s whe e
ins alla ion is imp ac ical, such as shaded a eas, na ow co ido s, o
loca ions o scena ios whe e he cos s o came as a e p ohibi i e. The
gene al easibili y o using mic ophone eco dings o he de ec ion
o pedes ians has been shown ecen ly o a ehicle- ee cou ya d
on a uni e si y campus [4].
This s udy add esses wo key gaps in exis ing wo k. Fi s , he
gene alizabili y o audio-based models emains unclea . Gi en he
a iabili y in u ban soundscapes, shaped by a ic, land use, and
a e age pedes ian ac i i y le els, i is necessa y o e alua e model
pe o mance ac oss da a collec ed om di e en se ings, pa icula ly
in he p esence o ypical u ban noise. Second, exis ing s udies lack
in o ma ion on in e p e abili y; i is unclea which sound cha ac e is ics
exis ing models ely on o de ec ing pedes ians.
Thus, he main con ibu ions o his s udy a e
(i)
a new publicly a ailable
1
da ase o audio-based pedes ian
de ec ion in he p esence o ehicula noise,
(ii)
an in es iga ion in o how ehicula noise a ec s pedes ian
de ec ion pe o mance, and
1h ps://hugging ace.co/da ase s/u banaudiosensing/ASPED b
(iii)
insigh s in o he acous ic ea u es ha enable pedes ian de ec-
ion.
2. RELATED WORK
2.1. Au oma ed Pedes ian De ec ion Techniques
U ban pedes ian sensing echnologies ha e e ol ed o e decades, wi h
ideo came as and in a ed senso s being he mos widely deployed
o da e [3], [5], [6]. Video-based sys ems, now commonly augmen ed
wi h compu e ision and deep lea ning echniques, o e high spa ial
p ecision bu can su e om limi a ions in occluded o low-ligh
en i onmen s. Fu he mo e, such sys ems o en aise p i acy conce ns
[7], [8]. In a ed coun e s, including ac i e, passi e, and a ge -
e lec i e ypes, a e less in usi e bu end o unde coun pedes ians,
pa icula ly in high pedes ian olume scena ios [5], [6]. Mo e
sophis ica ed bu cos -p ohibi i e op ions, such as ada , piezoelec ic
s ips, and induc i e loops, a e limi ed in spa ial scalabili y [9]. In
con as , audio-based pedes ian sensing emains unde explo ed, albei
wi h p omising low-cos deploymen , esilience o isual obs uc ions,
and po en ial p i acy ad an ages. As demons a ed by Seshad i e
al. [4], audio-based sys ems can de ec he p esence o pedes ians
by using ad ances in acous ic scene analysis and deep lea ning,
al hough challenges pe sis in signal sepa a ion, da a imbalance, and
gene alizabili y ac oss u ban soundscapes.
The gene alizabili y o pedes ian de ec ion models has been
explo ed only ecen ly. Rasouli e al. assessed se en s a e-o - he-
a de ec ion algo i hms unde a ying eal-wo ld condi ions using
he JAAD da ase and ound ha model pe o mance de e io a es in
changed con ex s, such as di e en wea he condi ions, pedes ian
beha io s, o occlusion [10]. They emphasized he impo ance o
inco po a ing di e se aining da a, showing ha gene al-pu pose
objec de ec ion models ained on b oade da ase s end o gene alize
be e han hose ained na owly on pedes ian- ocused inpu s.
Mo e ecen ly, Hasan e al. conduc ed a c oss-da ase e alua ion
o pedes ian de ec o s and simila ly ound ha adi ional models
gene alized poo ly because hei aining sou ce usually does no
con ain dense pedes ian olume [11]. In e es ingly, gene al-pu pose
objec de ec o s, no ained o pedes ian de ec ion, showed be e
c oss-da ase pe o mance, sugges ing ha a ied aining sou ces can
imp o e model ans e abili y. Al hough hese s udies do no ocus
on audio-based models, hey emphasize ha es ing gene alizabili y
ac oss da ase s is c ucial. In he con ex o audio-based sensing, his
implica ion is pa icula ly ele an , as u ban soundscapes can a y
conside ably depending on he su ounding en i onmen .
2.2. Audio-based U ban Sensing
U ban sound has eme ged as a ich sou ce o in o ma ion o
unde s anding ci y li e, complemen ing adi ional isual o spa ial
da a. Ea ly u ban noise s udies p ima ily emphasized en i onmen al
heal h and policy, ocusing on he quan i ica ion o noise pollu ion
om oad a ic, ailways, and indus ial sou ces [12]–[14]. These
wo ks led o he de elopmen o s anda dized noise maps and public
60
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
ASPED .b “Fi h S ee ”
ASPED .a “Cadell Cou ya d”
ASPED .a “Tech Walkway”
Fig. 1: Da a collec ion si es on he Geo gia Tech campus in A lan a.
heal h guidelines (e.g., [15]). Howe e , beyond i s alue as a nuisance,
u ban sound is inc easingly ecognized as a medium ha implies
in o ma ion abou human ac i i y, mobili y pa e ns, and he social
ib ancy o public spaces [16], [17].
Recen ad ances in sensing echnologies and machine lea ning
enable g anula , au oma ed analysis o u ban soundscapes. P ojec s
such as SONYC (Sounds o New Yo k Ci y) [18] ha e es ablished a
baseline o classi ying gene al u ban sounds, using anno a ions o
b oad e en ca ego ies ha include speech-o ien ed human sounds.
Ex ending his scope o u ban audio analysis u he , Han e al. and
Seshad i e al. in oduced audio-based me hods o de ec ing pedes ian
p esence [4], [19]. Thei app oach u ilized a new la ge-scale da ase
wi h pedes ian- ocused anno a ions. This da ase is composed o
con inuous eco dings om eal-wo ld walking en i onmen s, enable
models o lea n om he ull ange o implici acous ic cues (bo h
speech and non-speech) ha signal pedes ian p esence. Thei esul s
highligh he po en ial o mic ophone-based sensing as a low-cos ,
p i acy-p ese ing, and scalable complemen o came a-based sys ems.
Despi e ecen p og ess, he gene alizabili y o models ac oss di e se
u ban en i onmen s and he in e p e abili y o hese models emain
unde explo ed. Unde s anding he le el o gene alizabili y and audio
cues ha igge models o p edic pedes ian p esence is —gi en
he a ie y o u ban soundscape— c ucial o building obus and
in e p e able sys ems.
3. DATASET
This s udy builds on he p e iously published ASPED da ase [4],
which includes anno a ed audio and ideo da a collec ed in a ehicle-
ee cou ya d en i onmen and will be e e ed o in he ollowing as
ASPED .a. This da ase p o ides he ounda ion o ou pedes ian
de ec ion amewo k and is desc ibed in de ail by Seshad i e al. [4].
The eco de se up and p ep ocessing s eps a e iden ical o hose used
in ASPED .a.
In his s udy, we in oduce an addi ional da ase , ASPED .b,
collec ed close o a oad wi h ehicula a ic. Figu e 1highligh s he
eco ding loca ion on he Geo gia Tech campus in ed. The ehicula
noise p ima ily consis s o engine sounds and in e mi en shu le
buses ope a ing a slow speeds. The p opo ion o ames con aining
a leas one ehicle de ec ed is 9.16%, 29.00%, 36.43%, and 42.91%
o adii o 1 m,3 m,6 m, and 9 m, espec i ely.
The ASPED .b da ase con ains 1,321 hou s o audio om 4
di e en sessions. Each session akes place o e a ime ame o
app oxima ely 40 hou s and has audio da a collec ed by 4 o 8
eco de s sp ead along a s ee . The eco ding a eas a e moni o ed
Fig. 2: Pe cen age o ames con aining pedes ians by hou o day.
Fig. 3: Time-se ies dis ibu ion o pedes ian coun s.
by 6 GoP o came as, which cap u ed
1 ps
ideo eco dings o aling
2,946,513 ames ac oss all came as.
Figu e 2illus a es gene al pedes ian pa e ns de i ed om he
labels o he ASPED .b da ase . The igu e shows he g ound u h
numbe o pedes ians de ec ed om ideo eco dings a a speci ic
imes amp, isualized o he eco ding zone wi h a
6 m
adius.
Pedes ian ac i i y peaks be ween 3 PM and 5 PM and declines
conside ably a nigh . This class imbalance e lec s he ecological
alidi y o he da ase , cap u ing ealis ic pe iods o low ac i i y.
The a e age numbe o pedes ians walking he s ee on a speci ic
day o he week and ime by aking he olling a e age o he numbe
o pedes ians de ec ed ac oss all came as is shown in Fig. 3. The
peaks align wi h he imes ha classes end on campus, demons a ing
how pedes ian a ic on campus is closely ied o he class schedule.
Las ly, 2.9% o o al ames we e obs uc ed by buses, p e en ing
he ideo-based pedes ian anno a ion om p oducing eliable labels.
The e o e, hese ames we e lagged and disca ded in modeling.
4. EXPERIMENTAL SETUP
The goal o his esea ch is o p o ide new insigh s in o audio-
based pedes ian de ec ion ha migh acili a e new app oaches wi h
enhanced pe o mance. We conduc h ee key expe imen s o explo e
hese aspec s: (i) a c oss-da ase e alua ion o assess he gene aliza ion
capabili ies o models ained on noisy and noise- ee sec ions o he
da ase s ( .a and .b), (ii) e alua ing he e ec o ehicle p esence
in aining da a on he pe o mance wi h ehicle-con olled es se s,
and (iii) an analysis o he acous ic cues ha he model associa es
wi h pedes ian and non-pedes ian ins ances.
Fo he expe imen s, we ep oduced he model p oposed by Seshad i
e al. [4]. This model p ocesses 10-second
16 kHz
mono audio inpu s
by i s compu ing powe spec og ams using STFT (window:
25 ms
,
hop:
10 ms
). These a e hen con e ed o 64-bin mel spec og ams
61
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 1: C oss-da ase e alua ion balanced accu acy (%).
T ain Da ase Tes Da ase
ASPED .a ↑ASPED .b ↑
ASPED .a 71.74 66.48
ASPED .b 64.77 69.15
(125–
7500 Hz
) and no malized ia s anda d scaling. The mean and
s anda d de ia ion alues o no maliza ion we e ob ained om he
implemen a ion p o ided in he Gi Hub eposi o y o he ASPED
.a model.
2
The esul ing log-mel spec og ams a e ed in o he
VGGish backbone, p e- ained on AudioSe [20], o ex ac a sequence
o 10 acous ic embeddings, each co esponding o a snippe o
1 s
o he inpu . A T ans o me encode (1 laye , 4 a en ion heads,
128 hidden dimension), wi h added posi ional encoding, p ocesses
hese embeddings o cap u e empo al dependencies. Finally, a linea
p ojec ion laye wi h ReLU ac i a ion, ollowed by ano he linea
laye and a sigmoid ac i a ion unc ion, ou pu s a bina y classi ica ion
p obabili y o each
1 s
snippe , esul ing in 10 p edic ions o one
10 s-inpu , using a ba ch size o 256.
4.1. Exp. 1: C oss-da ase e alua ion
Following p e iously es ablished me hodology [4], he wo da ase s,
ASPED .a and .b, we e andomly pa i ioned in o ain, es , and
alida ion subse s wi h an 80/10/10 spli , espec i ely.
To add ess he inhe en class imbalance, we employed weigh ed
ba ch sampling and a a iable weigh ed loss du ing aining. Model
in e ence esul s a e epo ed using he checkpoin ha yielded he
lowes alida ion loss a e 20 epochs.
4.2. Exp. 2: Impac o ehicle p esence o aining
A key di e ence be ween he p e iously exis ing da ase ASPED .a
and he new da a lies in he p esence o ehicle sounds in he audio
eco dings. In his expe imen , we in es iga e whe he his ac o in
he aining da a in luences model pe o mance on es en i onmen s
wi h (VP: Vehicle-P esen ) and wi hou ehicles (VA: Vehicle-Absen ).
To his end, we c ea e wo dis inc es spli s o ASPED .b, con olled
o ehicle p esence and analyze he esul s o he models ained
on .a and .b (c . Sec . 4.1), espec i ely.
Fu he mo e, o assess he models’ p opensi y o alse posi i es, we
sampled ehicle- ela ed ca ego ies om he nonhuman sounds sec ion
o he FSD50K [21] da ase . FSD50K is an open da ase o human-
labeled sound e en s con aining 51,197 F eesound
3
clips unequally
dis ibu ed in 200 classes d awn om he AudioSe On ology. Fo his
and Sec ion 4.3, we downsampled he audio o
16 kHz
and ca ego ized
i in o human o non-human sounds by ollowing he gi en on ology
4
.
All classes in FSD50K a e ep esen ed in AudioSe , excep C ash
cymbal (non-human), Human g oup ac ions (human), Human oice
(human), Respi a o y sounds (human), and Domes ic sounds, home
sounds (non-human). Only single- agged audio samples we e included
in his analysis, and we il e ed he da ase o include only ca ego ies
con aining a leas 10 dis inc iles. The esul ing e ined da ase
comp ised 21 human sound ca ego ies (989 iles) and 133 non-human
sound ca ego ies (8,097 iles). The p obabili y o class 1, which he
model was ained o associa e wi h ‘pedes ian’ p esence, was used
o de e mine he model’s esponse.
2
h ps://gi hub.com/u banaudiosensing/Models/blob/main/da a u ils/
ans o ms.py, las access da e: Sep embe 19, 2025
3h ps:// eesound.o g/, las access da e: Sep embe 19, 2025
4
h ps:// esea ch.google.com/audiose /on ology/human sounds 1.h ml, las
access da e: Sep embe 19, 2025
Table 2: Impac o ehicle p esence in aining da a — balanced accu acy (%)
on ASPED .b subse s. (VP: Vehicle-P esen , VA: Vehicle-Absen )
T ain Da ase Tes Da ase (ASPED .b)
VP ↑VA ↑
ASPED .a 65.16 67.87
ASPED .b 67.49 71.01
4.3. Exp. 3: Model sensi i i y o di e en sound ca ego ies
To gain insigh s in o “wha he models a e lis ening o,” we analyze he
sensi i i y o he ASPED- ained models o a ious audio ca ego ies
by classi ying inpu s om he FSD50K human and non-human sound
on ologies. Mo e speci ically, we in es iga e which human-gene a ed
sound ca ego ies we e mos equen ly de ec ed as ‘pedes ian.’
Fu he mo e, we conduc a pos -hoc analysis o de e mine i any non-
human sound ca ego ies a e consis en ly misclassi ied as ‘pedes ian.’
A c ucial conside a ion o his analysis is he di e ence in bo h
audio cha ac e is ics/ eco ding se up and labeling pa adigms. The
ASPED da ase labels a e based on he p esence o indi iduals wi hin
a ce ain amoun o adius o he eco ding de ice (in his s udy,
6 m
).
In con as , FSD50K anno a ions do no conside spa ial p oximi y;
o his e alua ion, we ope a ed unde he assump ion ha all human
sounds ep esen he ‘pedes ian’ class and all non-human sounds
ep esen he ‘non-pedes ian’ class.
We u he in es iga e he speci ic ca ego ies o human- ela ed
sounds ha ou model eliably de ec s o s uggles o ecognize.
Addi ionally, we examine non-human sounds ha a e e oneously
classi ied as pedes ian- ela ed, leading o alse posi i e e o s.
5. RESULTS
5.1. Exp. 1: C oss-da ase e alua ion
Table 1p esen s he balanced accu acy, calcula ed as he a e age
o sensi i i y and speci ici y, achie ed when models ained on one
da ase e sion we e es ed on he o he .
The esul s indica e a pe o mance d op when models a e es ed
on a da ase di e en om hei aining se , which indica es limi ed
gene aliza ion ac oss he wo eco ding se ups.
These c oss-da ase esul s highligh he complex in e play be ween
he p esence o speci ic ypes o backg ound noise, such as ehicula
a ic, and model gene aliza ion. Fu he in es iga ion in o domain
adap a ion echniques may be bene icial o imp o e he obus ness
o pedes ian de ec ion sys ems in eal-wo ld scena ios wi h a ying
acous ic en i onmen s.
5.2. Exp. 2: Impac o ehicle p esence o aining
To in es iga e he speci ic impac o ehicle p esence in he aining
da a, we e alua e he .a- ained and .b- ained models on subse s o
.b ha we e con olled o he p esence o absence o ehicle sounds
(VP: Vehicle-P esen , VA: Vehicle-Absen ), as shown in Table 2.
The esul s show ha —as expec ed— he p esence o absence
o ehicle sounds in he es se impac s pe o mance. E en hough
he .b- ained model was exposed o ehicle sounds du ing aining,
p edic ing pedes ian p esence in he absence o hese po en ially
con ounding sounds is simple .
To assess whe he he .b- ained model exhibi s a educed endency
o misclassi y common ehicle sounds as pedes ians compa ed o he
.a- ained model, we compa ed he a e age p edic ed p obabili y o
he ‘pedes ian’ class o a cu a ed se o ehicle- ela ed non-human
sound ca ego ies om FSD50K (Table 3).
The .a- ained model gene ally exhibi ed conside ably highe
a e age p edic ed p obabili ies o classi ying ehicle sounds as
62
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 3: A g. p ob. o pedes ian class o ehicle- ela ed FSD50K ca ego ies.
Ca ego y .a- ained ↓ .b- ained ↓
Race ca , au o acing 0.86 0.69
Ca 0.69 0.62
Vehicle 0.75 0.56
Vehicle ho n, ca ho n, honking 0.74 0.62
Ca passing by 0.71 0.59
Mo o ehicle ( oad) 0.72 0.60
Table 4: A e age p obabili y o FSD50K human sound ca ego ies o models
ained on ASPED .a and .b.
Ca ego y .a- ained ↑ .b- ained ↑
Female singing 0.95 0.74
Speech 0.94 0.65
C ying, sobbing 0.92 0.63
Laugh e 0.91 0.65
Singing 0.90 0.71
Human oice 0.89 0.64
Yell 0.89 0.66
Chee ing 0.88 0.66
Cha e 0.87 0.66
Child speech, kid speaking 0.86 0.65
Human g oup ac ions 0.83 0.63
Speech syn hesize 0.82 0.59
Con e sa ion 0.79 0.63
Bu ping, e uc a ion 0.78 0.56
Male speech, man speaking 0.77 0.56
Whispe ing 0.76 0.52
Applause 0.76 0.48
Chewing, mas ica ion 0.76 0.57
Hands 0.74 0.56
Run 0.74 0.56
Walk, oo s eps 0.70 0.58
‘pedes ian’ compa ed o he .b- ained model. This sugges s ha he
absence o a ic noise du ing aining in .a migh lead he model o
e oneously associa e ehicle sounds wi h human p esence, inc easing
alse posi i es. Con e sely, he .b- ained model ained wi h a ic
noise was mo e e ec i e a dis inguishing pedes ian p esence om
ehicle sounds as indica ed by ewe alse ala ms.
5.3. Exp. 3: Model sensi i i y o di e en sound ca ego ies
To unde s and he models’ sensi i i y o di e en acous ic cues, we
in es iga ed he impac o signal ene gy and o di e en (human and
non-human) sounds on he pedes ian de ec ion accu acy.
5.3.1. Compa ison wi h RMS ene gy: The Pea son co ela ion
be ween he audio’s RMS ene gy and he model’s ou pu logi is low
o models ained on ASPED .a and .b (
≈0.14
and
≈0.29
,
espec i ely), con i ming ha he lea ned ep esen a ions a e mo e
e ec i e han a simple ene gy measu emen .
5.3.2. E alua ion on FSD50K Human Sounds: Table 4p esen s
he human ca ego ies as a subse o he FSD50K da ase . On he
igh , we lis he co esponding a e age p edic ed p obabili y o
he ‘pedes ian’ class o bo h models. The model ained on he
ASPED .a da ase demons a es g ea e con idence when classi ying
human sounds as ‘pedes ian’ compa ed o i s coun e pa ained
on he a ic-noise- ich ASPED .b da ase . While speech- ela ed
sounds gene ally exhibi ed highe p obabili ies ac oss bo h models,
sub le pe o mance a ia ions in he anking o speci ic ca ego ies
migh indica e ha backg ound noise du ing aining in luences he
model’s sensi i i y o di e en ypes o human sounds. The o e all
lowe a e age p obabili ies o he .b- ained model likely e lec he
masking e ec o a ic noise on he acous ic ea u es c ucial o
human sound iden i ica ion. No ably, ca ego ies in ui i ely associa ed
wi h pedes ian mo emen , such as Walk, oo s eps and Run, we e
Table 5: Top and bo om 3 non-human sound ca ego ies by a g. p ob.
Ca ego y .a- ained ↓ .b- ained ↓
Top 3
Ha p 0.94 ±0.07 0.71 ±0.13
T umpe 0.94 ±0.14 0.80 ±0.11
Plucked s ing ins umen 0.93 ±0.07 0.66 ±0.15
Bo om 3
C icke 0.42 ±0.33 0.48 ±0.16
Chi p, wee 0.51 ±0.31 0.45 ±0.15
Bicycle bell 0.52 ±0.36 0.48 ±0.14
anked ela i ely low wi hin he b oade se o human sound ca ego ies
o bo h models. These indings unde sco e he impac o he aining
en i onmen ’s acous ic cha ac e is ics on he lea ned ep esen a ions
and he subsequen gene aliza ion o ou -o -domain human sounds. I
should be no ed, howe e , ha he majo i y o signals in his da ase
a e e y di e en om he ypical u ban sound eco ding; hus, hese
esul s should be in e p e ed ca e ully.
5.3.3. E alua ion on FSD50K Non-Human Sounds: To unde s and
he models’ sensi i i y o o he sounds, we e alua ed hei p edic ions
on a subse o 133 (ca ego ies wi h a leas 10 samples) non-human
sound ca ego ies om AudioSe . Table 5displays he op 3 and
bo om 3 ca ego ies, de e mined based on he .a- ained model’s
a e age p edic ed p obabili y o he ‘pedes ian’ class. The e alua ion
on non-human sounds e eals ha he model ained on .a da a
has a highe endency o misclassi y ce ain musical ins umen s as
‘pedes ian’ compa ed o he .b- ained model. This may be due o
such sound ca ego ies being pa icula ly in equen o en i ely absen
in he ASPED da ase s. In e es ingly, he bo om- anked ca ego ies
e eal g ea e p edic ion a iabili y in he .a- ained model compa ed
o he .b- ained model. This highe s anda d de ia ion sugges s
ha he .a model is less ce ain when classi ying sounds ha a e
dissimila o human p esence.
6. CONCLUSION
This esea ch in es iga ed he impac o he acous ic en i onmen on
pedes ian de ec ion using a no el pedes ian de ec ion da ase wi h
ehicula noise. Ou c oss-da ase e alua ion e ealed a pe o mance
d op when models we e ained on di e en en i onmen s, indica ing
limi ed domain gene aliza ion capabili y. Fu he mo e, he p esence
o ehicle sounds in he es se conside ably in luenced pe o mance,
wi h models showing a ying sensi i i ies based on hei aining da a’s
acous ic cha ac e is ics. E alua ion on ou -o -domain FSD50K da a
highligh ed ha models ained in .a exhibi ed highe con idence in
iden i ying human sounds bu we e also mo e p one o alse posi i es
o non-pedes ian sounds. Con e sely, models ained wi h a ic
noise demons a ed mo e cau ious p edic ions. Howe e , he no able
issue o alse posi i es ac oss a ious non-human sound ca ego ies
wa an s u he a en ion. These indings unde sco e he c i ical ole
o he acous ic en i onmen in aining obus pedes ian de ec ion
sys ems. The limi ed gene aliza ion obse ed sugges s ha u u e
wo k should ocus on domain adap a ion echniques o b idge he gap
be ween di e en acous ic domains. Speci ically, explo ing me hods
o enhance he model’s abili y o il e ou i ele an backg ound noise,
such as ehicula a ic, while e aining sensi i i y o sub le pedes ian-
ela ed cues is c ucial. Addi ionally, we plan o in es iga e he
in eg a ion o mul i-modal in o ma ion (e.g., isual cues) o inc ease
obus ness in challenging scena ios. Finally, a mo e comp ehensi e
analysis o he model’s ailu e cases, pa icula ly he misclassi ica ion
o speci ic non-human sounds, could in o m he design o mo e
disc imina i e acous ic ea u es o obus model a chi ec u es.
63
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
REFERENCES
[1]
Ame ican Planning Associa ion, “The pedes ian coun ,” h ps://www.
planning.o g/pas/ epo s/ epo 199.h m, 1965.
[2]
GDPR.eu, “Gene al da a p o ec ion egula ion (gdp ) compliance
guidelines,” 2019, las access da e: May 7, 2025. [Online]. A ailable:
h ps://gdp .eu/
[3]
H. Li, Z. Wu, and J. Zhang, “Pedes ian de ec ion based on deep lea ning
model,” in In e na ional Cong ess on Image and Signal P ocessing,
BioMedical Enginee ing and In o ma ics (CISP-BMEI). IEEE, 2016,
pp. 796–800.
[4]
P. Seshad i, C. Han, B.-W. Koo, N. Posne , S. Guha haku a, and A. Le ch,
“Asped: An audio da ase o de ec ing pedes ians,” in In e na ional
Con e ence on Acous ics, Speech and Signal P ocessing (ICASSP), 2024,
pp. 406–410.
[5]
H. Yang, K. Ozbay, and B. Ba in, “In es iga ing he pe o mance o
au oma ic coun ing senso s o pedes ian a ic da a collec ion,” in Wo ld
Con e ence on T anspo Resea ch (WCTR), ol. 1115, 2010, pp. 1–11.
[6]
——, “Enhancing he quali y o in a ed-based au oma ic pedes ian
senso da a by nonpa ame ic s a is ical me hod,” T anspo a ion Resea ch
Reco d: Jou nal o he T anspo a ion Resea ch Boa d, ol. 2264, no. 1,
pp. 11–17, 2011.
[7]
A. B une i, D. Buongio no, G. F. T o a, and V. Be ilacqua, “Compu e
ision and deep lea ning echniques o pedes ian de ec ion and acking:
A su ey,” Neu ocompu ing, ol. 300, pp. 17–33, 2018.
[8]
P. Dolla , C. Wojek, B. Schiele, and P. Pe ona, “Pedes ian de ec ion: An
e alua ion o he s a e o he a ,” IEEE T ansac ions on Pa e n Analysis
and Machine In elligence (TPAMI), ol. 34, no. 4, pp. 743–761, 2011.
[9]
E. Ozan, S. Sea cy, B. C. Geige , C. Vaughan, C. Ca nes, C. Bai d, and
A. Hipp, “S a e-o - he-a app oaches o bicycle and pedes ian coun e s,”
No h Ca olina Depa men o T anspo a ion, Tech. Rep., 2021.
[10]
A. Rasouli, I. Ko se uba, and J. K. Tso sos, “I ’s no all abou size: On he
ole o da a p ope ies in pedes ian de ec ion,” in Eu opean Con e ence
on Compu e Vision (ECCV) Wo kshops, 2018, pp. 210–225.
[11]
I. Hasan, S. Liao, J. Li, S. U. Ak am, and L. Shao, “Gene alizable
pedes ian de ec ion: The elephan in he oom,” in Con e ence on
Compu e Vision and Pa e n Recogni ion (CVPR), 2021, pp. 11 323–
11 332.
[12]
J. Rul , F. Mi anda, M. Hosseini, M. Lage, M. Ca w igh , G. Do e,
J. Bello, and C. T. Sil a, “U ban hapsody: La ge-scale explo a ion
o u ban soundscapes,” Compu e G aphics Fo um, ol. 41, no. 3, pp.
209–221, 2022.
[13]
M. S. Hamme , T. K. Swinbu n, and R. L. Nei zel, “En i onmen al
noise pollu ion in he uni ed s a es: de eloping an e ec i e public heal h
esponse,” En i onmen al Heal h Pe spec i es (EHP), ol. 122, no. 2, pp.
115–119, 2014.
[14]
H. J. Ja iwala, H. S. Syed, M. J. Pandya, and Y. M. Gaje a, Noise
Pollu ion & Human Heal h: A e iew, 2021, las access da e: May 7,
2025. [Online]. A ailable: h ps://www. esea chga e.ne /p o ile/Hi al-
Ja iwala/publica ion/319329633 Noise Pollu ion Human Heal h A
Re iew/links/59a54434a6 dcc773a3b1c49/Noise-Pollu ion-Human-
Heal h-A-Re iew.pd
[15]
Wo ld Heal h O ganiza ion, “En i onmen al noise,” in Compendium
o WHO and o he UN guidance on heal h and en i onmen ,
2022, ch. 11, las access da e: May 7, 2025. [Online]. A ailable:
h ps://cdn.who.in /media/docs/de aul -sou ce/who-compendium-on-
heal h-and-en i onmen /who compendium noise 01042022.pd ?
s sn=bc371498 3#:
∼
: ex =Fo %20a e age%20noise%20exposu e%
2C%20 he,dB%20LAeq%2C%2024h%20%E2%80%A2%20weekly
[16]
A. Radicchi, P. Ce ikayak Yelmi, A. Chung, P. Jo dan, S. S ewa ,
A. Tsaligopoulos, L. McCunn, and M. G an , “Sound and he heal hy
ci y,” Ci ies & Heal h, ol. 5, no. 1-2, pp. 1–13, 2021.
[17]
L. M. Aiello, R. Schi anella, D. Que cia, and F. Ale a, “Cha y maps:
cons uc ing sound maps o u ban a eas om social media da a,” Royal
Socie y Open Science, ol. 3, no. 3, p. 150690, 2016.
[18]
J. P. Bello, C. Sil a, O. No , R. L. Dubois, A. A o a, J. Salamon,
C. Mydla z, and H. Do aiswamy, “Sonyc: A sys em o moni o ing,
analyzing, and mi iga ing u ban noise pollu ion,” Communica ions o he
ACM, ol. 62, no. 2, pp. 68–77, 2019.
[19]
C. Han, P. Seshad i, Y. Ding, N. Posne , B. W. Koo, A. Ag awal,
A. Le ch, and S. Guha haku a, “Unde s anding pedes ian mo emen
using u ban sensing echnologies: he p omise o audio-based senso s,”
U ban In o ma ics, ol. 3, no. 1, p. 22, 2024.
[20]
J. F. Gemmeke, D. P. Ellis, D. F eedman, A. Jansen, W. Law ence,
R. C. Moo e, M. Plakal, and M. Ri e , “Audio se : An on ology and
human-labeled da ase o audio e en s,” in In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP), 2017, pp. 776–780.
[21]
E. Fonseca, X. Fa o y, J. Pons, F. Fon , and X. Se a, “Fsd50k: an open
da ase o human-labeled sound e en s,” IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing (TASLP), ol. 30, pp. 829–852,
2022.
64
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
TOWARDS SPATIAL AUDIO UNDERSTANDING VIA QUESTION ANSWERING
Pa hasaa a hy Suda sanam, A chon is Poli is
Audio Resea ch G oup, Tampe e Uni e si y, Tampe e, Finland
ABSTRACT
In his pape , we in oduce a no el amewo k o spa ial audio un-
de s anding o i s -o de ambisonic (FOA) signals h ough a ques-
ion answe ing (QA) pa adigm, aiming o ex end he scope o sound
e en localiza ion and de ec ion (SELD) owa ds spa ial scene un-
de s anding and easoning. Fi s , we cu a e and elease ine-g ained
spa io- empo al ex ual desc ip ions o he STARSS23 da ase us-
ing a ule-based app oach, and u he enhance linguis ic di e si y
using la ge language model (LLM)-based eph asing. We also in-
oduce a QA da ase aligned wi h he STARSS23 scenes, co e -
ing a ious aspec s such as e en p esence, localiza ion, spa ial,
and empo al ela ionships. To inc ease language a ie y, we again
le e age LLMs o gene a e mul iple eph asings pe ques ion. Fi-
nally, we de elop a baseline spa ial audio QA model ha akes
FOA signals and na u al language ques ions as inpu and p o ides
answe s ega ding a ious occu ences, empo al, and spa ial ela-
ionships o sound e en s in he scene o mula ed as a classi ica-
ion ask. Despi e being ained solely wi h scene-le el ques ion an-
swe ing supe ision, ou model achie es pe o mance ha is com-
pa able o a ully supe ised sound e en localiza ion and de ec-
ion model ained wi h ame-le el spa io empo al anno a ions.The
esul s highligh he po en ial o language-guided app oaches o
spa ial audio unde s anding and open new di ec ions o in eg a ing
linguis ic supe ision in o spa ial scene analysis.
Index Te ms—Spa ial audio unde s anding, acous ic scene
analysis, ques ion answe ing
1. INTRODUCTION
Scene unde s anding by machines using audio signals is a well-
es ablished p oblem in audio p ocessing, audio-based machine
lea ning, and he b oade ield o machine lea ning o na u al
scenes. Ea ly and widely s udied asks include he classi ica ion
o scene ypes om audio eco dings and he de ec ion o sound
e en s o e ime co esponding o speci ic a ge classes [1]. While
hese app oaches p o ide aluable seman ic insigh s in o he con-
en o an audi o y scene, hey lack in o ma ion abou he spa ial
cha ac e is ics o he sound en i onmen . To add ess his limi a-
ion, he ask o sound e en localiza ion and de ec ion (SELD) was
in oduced [2, 3]. SELD ex ends adi ional me hods by cap u ing
bo h he empo al ac i i y o sound e en s and hei spa ial loca ions
ela i e o he eco ding de ice. This spa ial in o ma ion is essen-
ial o downs eam applica ions ha depend on he posi ioning and
spa ial ela ionships o sound sou ces wi hin a scene. In oduced as
pa o he DCASE Challenge in 2019, SELD has buil an ac i e and
g owing esea ch communi y dedica ed o ad ancing SELD me h-
ods. [4, 5, 6, 7, 8].
Mos SELD p oposals ha a e capable o handling dynamic
complex scenes ollow a s ongly supe ised aining pa adigm
whe e e en ac i i y and loca ion labels a e p o ided a a ine em-
po al esolu ion, wi h a ew excep ions ying o le e age sel -
supe ision [9, 10]. Simila ly, du ing in e ence SELD models a e
expec ed o p o ide p edic ions a a simila esolu ion. Such an-
no a ions a e e y di icul o ob ain in eal scenes, wi h only a
hand ul o such da ase s cu en ly exis ing [11, 12], including he
STARSS22-23 da ase collec ed by he au ho s and collabo a o s
[12, 13]. O he wise, supe ised aining o SELD me hods has e-
lied on simula ions o spa ial sound scenes [14, 15].
A ecen end in scene unde s anding ac oss domains in ol es
g ounding pe cep ion in na u al language [16, 17]. I has been
explo ed in a ew wo ks o spa ial audio scene unde s anding
[18, 19, 20, 21]. Focusing on gene al sound e en s, he BAT sys-
em [18] e alua es a model’s abili y o answe ques ions on classi-
ica ion, de ec ion, di ec ion, and dis ance using simula ed binau al
eco dings wi h up o wo s a ic e en s in e e be an scenes. Ques-
ions and answe s we e gene a ed based on a ule-based app oach,
while an LLM was used o he QA ask. The ELSA sys em [19]
ains a spa ial audio model using con as i e lea ning on a la ge
simula ed da ase o FOA eco dings pai ed wi h cap ions desc ib-
ing spa ial p ope ies. Due o he lack o cap ioned spa ial audio
da a, he au ho s use s anda d audio cap ioning da ase s, simula e
FOA by placing sounds in ooms, and e ise he o iginal cap ions
wi h a language model o include sou ce posi ion and oom size.
In his wo k, we ain a model ha answe s na u al language
ques ions abou sound e en localiza ion and de ec ion in o ma ion,
including sound e en p esence, classi ica ion, and empo al and
spa ial o de , based on he sound scene. Unlike p e ious s udies,
ou model is ained and e alua ed on FOA audio om eal scene
eco dings. We use he STARSS23 da ase , which con ains app ox-
ima ely eigh hou s o audio wi h ine g ained spa io empo al an-
no a ions a 100ms in e als. Fi s ly, we gene a e a se o de ailed
spa ial cap ions by con e ing he me ada a in o ex ual desc ip ions
ha cap u e he e ol ing spa io empo al s uc u e o each scene,
and enhance hem using GPT-4 o linguis ic di e si y. These cap-
ions can suppo bo h spa ial audio ques ion answe ing and gene al
scene unde s anding asks. Secondly, we cons uc a QA da ase
o STARSS23 and simila ly apply GPT-4 o eph ase ule-based
ques ions. Finally, we de elop a baseline spa ial audio ques ion an-
swe ing model ha akes FOA audio ea u es and na u al language
ques ions as inpu , and ames he ask as classi ica ion.
2. DATASET CREATION
2.1. STARSS23 Cap ions da ase
To gene a e de ailed ex ual desc ip ions o he eco ded scenes,
we u ilize he anno a ions p o ided in he STARSS23 da ase .
These anno a ions a e a ailable a a empo al esolu ion o 100
ms and consis o he ollowing ields: ame ime,sound
e en label,pa en sou ce ID,azimu h angle,
ele a ion angle, and dis ance. Anno a ions e lec
65
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 1: Pe o mance me ics on he downs eam asks. The bes pe o ming model o each ask is shown in bold.
Model
Bi d De ec ion Time o Day P ecipi a ion Ra e A e age Wind Speed
P ecision Recall F1 MAE STD MAE STD MAE STD
bimamba-con 70.0 (69.5, 70.4) 87.3 (86.9, 87.6) 77.7 (77.3, 78.0) 0.322 0.225 0.193 0.282 0.381 0.318
bimamba-mae 78.3 (77.9, 78.7) 90.8 (90.5, 91.1) 84.1 (83.8, 84.4) 0.381 0.259 0.113 0.250 0.334 0.268
bimamba-mse 76.5 (76.1, 76.9) 91.0 (90.7, 91.3) 83.1 (82.8, 83.4) 0.284 0.219 0.121 0.250 0.350 0.273
ssas -con 78.9 (78.5, 79.3) 91.0 (90.7, 91.3) 84.5 (84.2, 84.8) 0.190 0.176 0.098 0.232 0.289 0.238
ssas -mae 77.8 (77.4, 78.2) 92.6 (92.3, 92.9) 84.6 (84.3, 84.9) 0.257 0.204 0.109 0.249 0.338 0.263
ssas -mse 76.9 (76.5, 77.3) 92.2 (91.9, 92.5) 83.8 (83.6, 84.1) 0.278 0.217 0.116 0.254 0.350 0.274
ea u es ela ed o bi d ocalisa ions and dis inguish hem om
o he sounds in he en i onmen . We assign p esence and absence
labels by looking o ag eemen be ween MIT-AST’s bi d- ela ed
ags and Bi dNET’s p edic ion con idence sco e as ollows. Clips
whe e bo h models ag ee ha no bi d is p esen (MIT-AST de ec s
no bi d, Bi dNET con idence
<0.5
) o whe e MIT-AST inds
a bi d bu Bi dNET is e y unce ain (con idence
<0.2
) a e
labelled as “absence”, while clips whe e Bi dNET’s con idence
exceeds he p e-de ined h esholds (0.2 when MIT-AST de ec s
a bi d and 0.5 o he wise) a e labelled as “p esence”.
•
Tempo al me ada a p edic ion: a eg ession ask whe e he
model is ained o p edic he ime o eco ding wi hin he day
om a gi en audio clip, which is ea ed as a cyclical a iable
and encoded by sine and cosine pai s. The goal o his ask is o
assess whe he he embeddings ex ac ed om each model can
ep esen complex pa e ns in he daily ac i i y o humans and
animals.
•
Wea he me ada a p edic ion: a eg ession ask in which he
model is ained o p edic wea he me ada a associa ed wi h each
audio clip, including p ecipi a ion a e and a e age wind speed.
This ask aims o e alua e each model’s po en ial o wea he -
ela ed ea u e ex ac ion, which can be use ul o moni o ing
clima e change and i s impac on he en i onmen .
Du ing ine- uning he p e- ained encode is ozen and no masking
is applied o he inpu , while an MLP head is ained speci ically o
each o he h ee asks.
Resou ces and aining ime. T aining was conduc ed on an
in e nal compu ing uni p o ided by Maas ich Uni e si y, which
is equipped wi h NVIDIA GeFo ce RTX 2080 Ti and Quad o RTX
6000 GPUs, as well as on he Snellius compu ing clus e p o ided
by SURF [25], which includes NVIDIA A100 and NVIDIA H100
GPUs. T aining gene ally ook be ween 2 and 4 days pe expe imen
o he p e- aining phase and a ound 1 day pe expe imen o he
ine- uning phase.
5. RESULTS AND DISCUSSION
Embedding Analysis. We es ima e he alignmen and uni o mi y
me ics [26] o assess he quali y o he lea ned embedding spaces.
Alignmen measu es how close simila samples a e, while uni o mi y
measu es how well he embeddings a e dis ibu ed on he uni
hype sphe e. To p oduce simila pai s o he alignmen me ic, we
apply andom ans o ma ions bo h on he wa e o m le el (Gaussian
noise, olume gain, pi ch shi ) as well as he spec og am le el
(Gaussian noise, andom oll, andom masking). Figu e 2 shows
he alignmen and nega i e (absolu e) uni o mi y me ics o all p e-
ained models. The Mamba models achie e lowe alignmen and
lowe absolu e uni o mi y, meaning ha he embeddings end o co e
a igh ly clus e ed a ea o he uni sphe e, while he SSAST models
achie e highe alignmen and highe absolu e uni o mi y, indica ing
sp ead-ou embeddings ega dless o simila i y.
Fig. 2: Alignmen and absolu e uni o mi y o he embeddings o all
p e- ained models.
Table 1 shows he pe o mance o he p e- ained and ine- uned
models on he h ee downs eam asks.
Bi d de ec ion. We e alua e he pe o mance o he ine- uned
models on he bi d de ec ion ask using he s anda d accu acy,
p ecision, ecall and F1 sco e me ics. Con idence in e als a e
compu ed a a le el o 95% using 1,000 boo s ap samples. The
esul s sugges
ssas -mae
as he bes pe o ming model in e ms
o ecall, while
ssas -con
achie es he highes p ecision, wi h
bo h models achie ing simila F1 sco es. We obse e ha he Mamba
models gene ally pe o m wo se han hei co esponding SSAST
models o each p e- aining echnique, while he MAE p e- aining
ask ou pe o ms he MSE ask ac oss bo h models.
Time o day p edic ion. To assess pe o mance on he ime-o -day
p edic ion ask, we compu e he minimum angle di e ence be ween he
p edic ed and g ound- u h angles. This di e ence is hen no malized
o he ange
[0,1)
, co esponding o ac ions o a 12-hou pe iod. We
calcula e he mean absolu e e o (MAE) and i s s anda d de ia ion o
quan i y he p edic ion e o . The
ssas -mae
model achie es he
lowes mean absolu e e o , along wi h he lowes s anda d de ia ion.
We u he examine he abili y o he models o make p edic ions
ha indica e insigh s abou he acous ic pa e ns ha may appea
h oughou a 24-hou pe iod. To his end, we bin he ue and
p edic ed alues in o 6-hou in e als (Nigh , Mo ning, A e noon,
E ening). We hen compu e he con usion ma ix o he ue and
p edic ed alues, which is shown o he wo bes pe o ming models
o each ype in Figu e 3. The esul s show ha he models end o
classi y he nigh hou s co ec ly, which is mos likely a ibu ed o he
dec eased acous ic ac i i y du ing he nigh - ime. All models excep
o
ssas -con
end o p edic “A e noon” o mos samples
ou side he nigh hou s. We hypo hesise ha all models apa om
ssas -con
essen ially collapse he ask o a bina y classi ica ion
p oblem, wi h one class co esponding o “inac i e” hou s, which a e
p edic ed du ing nigh - ime, and he o he class o “ac i e” hou s,
which a e p edic ed nea he da ase mean in he opposi e di ec ion,
72
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
(a) ssas -con (b) ssas -mae
(c) bimamba-mse (d) bimamba-con
Fig. 3: Con usion ma ices o he ue and p edic ed ime-o -day
alues o all p e- ained models.
i.e., “A e noon”. Con e sely, he
ssas -con
model is able o lea n
mo e complex pa e ns, wi h mos o i s misclassi ica ions occu ing
be ween consecu i e ime in e als o simila acous ic ac i i y. Fo
example, mo ning and a e noon hou s a e expec ed o p esen wi h
simila acous ic pa e ns in e ms o animal and human ac i i y.
Wea he me ada a p edic ion. We calcula e he mean and s anda d
de ia ion o he absolu e e o be ween he p edic ed and ue alues
o he p ecipi a ion a e and a e age wind speed asks. We ind
ha he
ssas -con
model achie es he lowes mean absolu e
e o and s anda d de ia ion o bo h asks, while he SSAST-based
models ou pe o m he Bi-di ec ional Mamba-based models o each
p e- aining echnique.
To assess model pe o mance ac oss a ying wea he condi ions,
we di ide he p ecipi a ion a e and a e age wind speed alues in o
i e equally sized bins and compu e he mean absolu e e o o
each bin, along wi h i s s anda d de ia ion. As shown in Figu e 4,
he
ssas -con
model achie es he mos consis en pe o mance,
wi h a binned MAE s anda d de ia ion o 0.65 o p ecipi a ion a e
and 0.09 o wind speed. Rega ding p ecipi a ion, MAE inc eases
mono onically wi h ain in ensi y o all models. We a ibu e his
pa e n bo h o he sca ci y o hea y ain samples (o e 90% o
obse a ions eco d negligible p ecipi a ion) and he ac ha in ense
ain all p oduces o e whelming b oadband, high-ene gy acous ic noise
ha obscu es ine spec al ea u es. In con as , wind speed MAE
exhibi s a sha p decline a e he i s bin. Unde calm condi ions,
es ima ion e o may be ele a ed because bio ic and an h opogenic
sounds domina e he spec og am, some o which may sha e simila
acous ic pa e ns wi h s onge winds (e.g. dis an exhaus sounds),
o because o he wea he cues may alsely hin a s ong winds (e.g.
hea y ain). On he o he hand, s onge wind is ypically easie o
ecognise as i p esen s wi h mo e acous ic cues such as us ling o
(a) P ecipi a ion a e pe o mance pe model ac oss in e als.
(b) A e age wind speed pe o mance pe model ac oss in e als.
Fig. 4: Mean absolu e e o and s anda d de ia ion o he p ecipi a ion
a e and a e age wind speed asks ac oss 5 equally sized bins.
lea es o a dis inc i e b oadband u bulen ai low (“whoosh”) sound.
6. CONCLUSION
Ou esul s highligh dis inc s eng hs and weaknesses o each p e-
aining s a egy and model a chi ec u e ac oss downs eam asks.
On he one hand, masked econs uc ion gene ally led o be e
pe o mance in bi d de ec ion, likely due o i s ocus on econs uc ing
s uc u ed spec al con en , which encou ages he model o cap u e
high-le el audi o y ea u es such as ha monics and ime- equency
pa e ns. Ac oss models o he same a chi ec u e, he MAE loss
gene ally leads o be e pe o mance han he MSE loss, wi h he
excep ion o
bimamba-mae
e sus
bimamba-mse
on he empo al
p edic ion ask. We a ibu e his o he ac ha MAE is mo e obus
o ou lie s as i is no domina ed by high-ene gy pixels, pushing he
models o lea n be e econs uc ions h oughou he en i e spec a,
which ansla es o ine ea u e ex ac ion.
On he o he hand, con as i e lea ning yielded compa able esul s
in he bi d de ec ion ask, while p oducing supe io esul s in he
empo al and wea he p edic ion asks. These indings sugges ha
s e eo con as i e lea ning encou ages he ex ac ion o low-le el
acous ic cues ha can ca y in o ma ion abou acous ic pa e ns
h oughou he day and co ela e wi h wea he changes. This e ec was
mos p ominen in he T ans o me -based models, whe e con as i e
p e- aining p oduced di e se and well-s uc u ed embeddings, leading
o he bes o e all pe o mance in eg ession asks. In compa ison,
SSM-based models ended o p oduce mo e collapsed ep esen a ions
and we e less s able and pe o ma i e du ing ine- uning, pa icula ly
unde con as i e lea ning. Despi e his, SSMs may s ill o e bene i s
o modelling long- ange dependencies in u u e wo k in ol ing longe
inpu sequences.
73
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
REFERENCES
[1]
P. Kou sogeo gos, “Towa ds a Founda ion Model o he Analysis
o En i onmen al Audio Da a Using he T ans o me and Mamba
A chi ec u es,” Mas e ’s hesis, Depa men o Ad anced Compu ing
Sciences, Facul y o Science and Enginee ing, Maas ich Uni e si y,
2025.
[2]
A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei , L. Jones, A. N. Gomez,
Ł
. Kaise , and I. Polosukhin, “A en ion is all you need,” in P oc. Neu IPS,
ol. 30, 2017. [Online]. A ailable: h ps://p oceedings.neu ips.cc/pape
iles/pape /2017/ ile/3 5ee243547dee91 bd053c1c4a845aa-Pape .pd
[3]
J. De lin, M. Chang, K. Lee, and K. Tou ano a, “BERT: P e- aining
o deep bidi ec ional ans o me s o language unde s anding,” CoRR,
ol. abs/1810.04805, 2018. [Online]. A ailable: h p://a xi .o g/abs/1810.
04805
[4]
A. Doso i skiy, L. Beye , A. Kolesniko , D. Weissenbo n, X. Zhai,
T. Un e hine , M. Dehghani, M. Minde e , G. Heigold, S. Gelly,
J. Uszko ei , and N. Houlsby, “An image is wo h 16x16 wo ds:
T ans o me s o image ecogni ion a scale,” P oc. ICLR, 2021.
[5]
Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio spec og am ans-
o me ,” in P oc. In e speech, 2021, pp. 571–575.
[6]
Y. Gong, C.-I. Lai, Y.-A. Chung, and J. Glass, “SSAST: Sel -supe ised
audio spec og am ans o me ,” in P oceedings o he AAAI Con e ence
on A i icial In elligence, ol. 36, no. 10, 2022, pp. 10 699–10 709.
[7]
A. Gu, K. Goel, and C. R
´
e, “E icien ly modeling long sequences wi h
s uc u ed s a e spaces,” in P oc. ICLR, 2022.
[8]
J. T. Smi h, A. Wa ing on, and S. Linde man, “Simpli ied s a e space
laye s o sequence modeling,” in P oc. ICLR, 2023. [Online]. A ailable:
h ps://open e iew.ne / o um?id=Ai8Hw3AXqks
[9]
A. Gu and T. Dao, “Mamba: Linea - ime sequence modeling wi h selec i e
s a e spaces,” a Xi p ep in a Xi :2312.00752, 2023.
[10]
T. Dao and A. Gu, “T ans o me s a e SSMs: Gene alized models and
e icien algo i hms h ough s uc u ed s a e space duali y,” in P oc. ICML,
2024.
[11]
S. Yada and Z.-H. Tan, “Audio Mamba: Selec i e s a e spaces o sel -
supe ised audio ep esen a ions,” in P oc. In e speech, 09 2024, pp.
552–556.
[12]
J. Lin and H. Hu, “Audio Mamba: P e ained audio s a e space model
o audio agging,” 2024. [Online]. A ailable: h ps://a xi .o g/abs/2405.
13636
[13]
M. H. E ol, A. Senocak, J. Feng, and J. S. Chung, “Audio Mamba:
Bidi ec ional s a e space model o audio ep esen a ion lea ning,” IEEE
Signal P ocess. Le ., ol. 31, pp. 2975–2979, 2024.
[14]
K. Li, G. Chen, R. Yang, and X. Hu, “SPMamba: S a e-space model
is all you need in speech sepa a ion,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2404.02063
[15]
S. Shams, S. S. Dinda , X. Jiang, and N. Mesga ani, “SSAMBA: Sel -
supe ised audio ep esen a ion lea ning wi h Mamba s a e space model,”
a Xi p ep in a Xi :2405.11831, 2024.
[16]
S. Kahl, C. M. Wood, M. Eibl, and H. Klinck, “Bi dNET: A
deep lea ning solu ion o a ian di e si y moni o ing,” Ecological
In o ma ics, ol. 61, p. 101236, 2021. [Online]. A ailable: h ps:
//www.sciencedi ec .com/science/a icle/pii/S1574954121000273
[17]
B. Ghani, T. Den on, S. Kahl, and H. Klinck, “Global bi dsong
embeddings enable supe io ans e lea ning o bioacous ic
classi ica ion,” Scien i ic Repo s, ol. 13, no. 1, Dec. 2023.
[Online]. A ailable: h p://dx.doi.o g/10.1038/s41598-023-49989-z
[18]
G. Veng o ski, M. R. Hulsey-Vincen , M. A. Bem ose, and T. J.
Ga dne , “Twee yBERT: Au oma ed pa sing o bi dsong h ough
sel -supe ised machine lea ning,” bioRxi , 2025. [Online]. A ailable:
h ps://www.bio xi .o g/con en /ea ly/2025/04/10/2025.04.09.648029
[19]
R. A anza o and F. Be i elli, “An inno a i e acous ic ain gauge based
on con olu ional neu al ne wo ks,” In o ma ion, ol. 11, no. 4, 2020.
[Online]. A ailable: h ps://www.mdpi.com/2078-2489/11/4/183
[20]
M. Wang, M. Chen, Z. Wang, Y. Guo, Y. Wu, W. Zhao, and
X. Liu, “Es ima ing ain all in ensi y based on su eillance audio and
deep-lea ning,” En i onmen al Science and Eco echnology, ol. 22,
p. 100450, 2024. [Online]. A ailable: h ps://www.sciencedi ec .com/
science/a icle/pii/S2666498424000644
[21]
A. H
¨
a m
¨
a and E. Naza enko, “Bi d ac i i ies in a esiden ial back
ga den,” in 32nd Eu opean Signal P ocessing Con e ence, EUSIPCO
2024 - P oceedings, se . Eu opean Signal P ocessing Con e ence.
Uni ed S a es: IEEE, 2024, pp. 1262–1266. [Online]. A ailable:
h ps://eusipcolyon.sciencescon .o g/
[22]
A. H
¨
a m
¨
a, “Ga denFiles23,” 2024. [Online]. A ailable: h ps://doi.o g/10.
34894/HPLUCH
[23]
Y. Gong, Y.-A. Chung, and J. Glass, “PSLA: Imp o ing audio agging
wi h p e aining, sampling, labeling, and agg ega ion,” IEEE/ACM T ans.
Audio, Speech, Lang. P ocess., 2021.
[24]
A. an den Oo d, Y. Li, and O. Vinyals, “Rep esen a ion lea ning
wi h con as i e p edic i e coding,” CoRR, ol. abs/1807.03748, 2018.
[Online]. A ailable: h p://a xi .o g/abs/1807.03748
[25]
SURF, “Snellius: The na ional supe compu e ,” 2025, accessed: 2025-09-
19. [Online]. A ailable: h ps://www.su .nl/en/se ices/compu e/snellius-
he-na ional-supe compu e
[26]
T. Wang and P. Isola, “Unde s anding con as i e ep esen a ion lea ning
h ough alignmen and uni o mi y on he hype sphe e,” CoRR, ol.
abs/2005.10242, 2020. [Online]. A ailable: h ps://a xi .o g/abs/2005.
10242
74
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Deploymen o AI-based Sound Analysis Algo i hms in Real- ime
Acous ic Senso s: Challenges and a Use Case
Amaia Sagas i1, Pe e A ´
ıs2, Xa ie Se a1, F ede ic Fon 3
1Music Technology G oup, Uni e si a Pompeu Fab a, Ba celona 2Keacous ics, Ba celona
3Phonos Fundaci´
o P i ada, Ba celona
Abs ac —Real- ime acous ic sensing in ol es signi ican challenges in
cap u ing, p ocessing, and ansmi ing audio. In eg a ing AI models on
esou ce-cons ained de ices u he complica es de elopmen . This pape
p esen s an end- o-end solu ion add essing hese challenges: SENS, he
Sma En i onmen al Noise Sys em, is a low-cos senso designed o eal-
ime acous ic moni o ing. Buil on a Raspbe y Pi pla o m, SENS cap u es
sound con inuously and p ocesses i locally using cus om-de eloped
so wa e based on small and e icien a i icial in elligence algo i hms.
Wi h a cu en ocus on u ban en i onmen s, SENS calcula es acous ic
pa ame e s, including sound p essu e le el (SPL), and makes p edic ions
o he pe cep ual sound a ibu es o pleasan ness and e en ulness (ISO
12913), along wi h de ec ing he p esence o speci ic sound sou ces
such as ehicles, bi ds, and human ac i i y. To sa egua d p i acy, all
p ocessing occu s di ec ly on he de ice in eal- ime ensu ing ha no
audio eco dings a e pe manen ly s o ed o ans e ed. Addi ionally, he
sys em ansmi s he analysis esul s h ough he wi eless ne wo k o a
emo e se e . Demons a ing i s p ac ical applicabili y, a ne wo k o i e
SENS de ices has been deployed in an u ban a ea o o e h ee mon hs,
alida ing SENS as a powe ul ool o analyzing and unde s anding
soundscapes, ecognizing pa e ns, and de ec ing acous ic e en s. The
p oposed lexible and ep oducible echnology allows econ igu a ion o
di e en applica ions and ep esen s an inno a i e s ep in eal- ime and
AI-based noise moni o ing.
Index Te ms—En i onmen al noise moni o ing, machine lea ning,
In e ne o Things (IoT), u ban soundscapes, sma ci y
1. INTRODUCTION
Sound moni o ing has adi ionally elied on high-p ecision ins u-
men s such as sonome e s, bu hese a e limi ed by hei high
cos , lack o emo e communica ion capabili ies ( equi ing manual
deploymen and e ie al), and low spa ial o empo al esolu ion.
The g owing adop ion o In e ne o Things (IoT) echnologies has
d ama ically ans o med his pa adigm. Recen de elopmen s in low-
powe embedded sys ems and wi eless communica ion ha e enabled
dis ibu ed sensing ia low-cos acous ic senso ne wo ks. Examples
like AudioMo h [1] ha e made long- e m acous ic da a collec ion
easible, bu hey o en s ill ely on o line analysis.
The pa allel eme gence o a i icial in elligence (AI) ac oss nu-
me ous ields opens new oppo uni ies o eal- ime and au onomous
sound analysis. App oaches like Bone -Sol
`
a e al. [2] in eg a e noise
le el da a om public wi eless senso ne wo ks wi h sho audio
clips eco ded on sma phones o es ima e subjec i e acous ic com o .
Howe e , hei sys em elies on cen alized p ocessing and o line AI
models. Simila ly, he CENSE ne wo k [3], [4], uses MEMS-based
senso s o ansmi low- esolu ion spec al da a, ensu ing p i acy while
elying on cen alized se e s o analysis. Ano he example o his
app oach is he LIFEWARD p ojec [5] o neona al ICUs, which
compu es hi d-oc a e spec og ams on he edge o a oid s o ing
in elligible audio, wi h cloud-based AI comple ing he analysis. In
con as , o he solu ions in ol e AI-powe ed senso s ha can pe o m
on-de ice sound e en de ec ion, noise classi ica ion, o pe cep ual
indexing. An example o his is a sys em deployed a ound he
VELTINS A ena in Ge many [6], which uns ligh weigh con olu ional
neu al ne wo ks (CNNs) on Raspbe y Pi de ices o classi y sound
e en s locally, educing bo h da a ansmission and p i acy isks. This
shi enables eal- ime, au onomous, and p i acy-p ese ing insigh s
bu in oduces challenges ela ed o compu a ional e iciency.
In his wo k, we p esen a me hodology o deploying AI-based
sound analysis algo i hms in eal- ime acous ic senso s h ough
a use case: SENS (Sma En i onmen al Noise Sys em), a low-
cos acous ic senso sys em designed o un ligh weigh AI models
locally o con inuous sound analysis. Buil on Raspbe y Pi, and
wi h a cu en ocus on u ban en i onmen s, i es ima es bo h
physical (e.g., sound p essu e le el) and pe cep ual a ibu es (e.g.,
pleasan ness,e en ulness), as well as he sound sou ces p esen
in he soundscape, while p ese ing p i acy by a oiding pe manen
s o age o ansmission o audio. Resul s a e ansmi ed h ough he
ne wo k o a emo e se e , and he sys em is modula o suppo
lexible model upda es owa ds o he applica ions. SENS echnology
is alida ed h ough a eal-wo ld u ban deploymen , demons a ing
i s po en ial o scalable, eal- ime acous ic moni o ing.
2. SYSTEM OVERVIEW
The ully in eg a ed p oposed echnology in ol es bo h ha dwa e
and so wa e componen s. Howe e , he so wa e does no equi e
dedica ed ha dwa e and, depending on he speci ic applica ion, can un
on a s anda d lap op o any de ice wi h an audio inpu . Addi ionally,
i ansmi ing da a o a emo e se e is unnecessa y, he modula
design o he so wa e allows i o ope a e en i ely o line.
Each SENS senso cons i u es a low-cos solu ion buil a ound a
single Raspbe y Pi. The de ice cap u es sound h ough a connec ed
mic ophone and ansmi s esul s ia a mobile ne wo k ha wi h a
SIM ca d. Senso s can be accessed emo ely h ough a i ual p i a e
ne wo k (VPN), allowing o e - he-ai so wa e upda es and p oblem
esolu ion. Besides, a ha dwa e wa chdog allows o au onomous
sys em eboo s i ce ain condi ions a e no me (e.g., i he e a e
connec i i y issues). Fo he use case p esen ed in his pape (see
Sec ion 6), he senso s we e connec ed o a con inuous powe supply
allowing unin e up ed ope a ion, hough solu ions ha make use o
ba e ies we e also de eloped. The Gi hub eposi o y o he p ojec
1
con ains guidelines o building cus om SENS ha dwa e de ices.
The so wa e is de eloped end- o-end wi h a modula design ha
allows lexibili y: h ee independen bu ela ed p ocesses un in
pa allel — sound cap u e, p ocessing, and ansmission o esul s o a
emo e se e . The code is implemen ed in Py hon and is a ailable in
he sens-senso Gi Hub eposi o y
1
. The modula so wa e s uc u e
in each SENS de ice is u he explained in he ollowing sec ions.
3. AUDIO ACQUISITION
The sound cap u e p ocess is s aigh o wa d. Audio is con inuously
eco ded and, when he audio bu e eaches a de ined du a ion (3
seconds in he SENS implemen a ion), i is sa ed as a pickle ile in a
1h ps://gi hub.com/MTG/sens-senso
75
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
speci ied olde , wi h a ilename ha includes he da e and ime o
e e ence. The equi alen con inuous sound le els, L
eq
and L
Aeq
, a e
compu ed o each segmen and sa ed in a ile wi h he same naming
con en ion in he same di ec o y. While incoming audio ames a e
being sa ed o disk, a dedica ed h ead con inuously moni o s he
sa ed audio iles and dele es hose olde han a de ined e en ion
pe iod (30 seconds in SENS). This app oach imp o es da a p i acy by
ensu ing ha he senso does no pe manen ly s o e audio da a, while
also imp o ing he de ice’s s o age e iciency. Mic ophone calib a ion
is essen ial, as i di ec ly a ec s he calcula ion o L
eq
and L
Aeq
, and,
as shown in an ea lie s udy [7], i can in luence model accu acy.
4. AUDIO PROCESSING
The p ocessing module is a c ucial pa o he acous ic moni o ing
solu ion, as i s de elopmen equi es add essing se e al key challenges.
To sa egua d da a p i acy, all audio p ocessing is pe o med locally
and no audio is pe manen ly s o ed o ansmi ed. Running machine
lea ning models in eal- ime on a low-cos de ice wi h limi ed
compu a ional esou ces demands he de elopmen o ligh weigh
models. Addi ionally, o achie e adap abili y o di e en moni o ing
use cases, independen and sepa a e models a e ained o each ask,
esul ing in a lexible modula a chi ec u e. The ollowing subsec ions
de ail he esea ch and aining o he ligh weigh models, ollowed
by hei implemen a ion wi hin he so wa e a chi ec u e.
4.1. Model aining
The use case o which SENS has been de eloped is ocused on he
moni o ing o u ban spaces. The e o e, we de eloped sound analysis
algo i hms ha p edic pe cep ual soundscape a ibu es, pleasan ness
and e en ulness; and es ima e he saliency o common sound sou ces
p esen in he acous ic en i onmen : bi ds,cons uc ion wo ks,dogs,
human ac i i y,si ens,music and ehicles. Fo his pu pose, we used
exis ing da ase s wi h open licenses. The ARAUS da ase [8] is used o
aining pleasan ness and e en ulness models. I consis s o 30s-leng h
augmen ed, bu ealis ic, soundscape audios, each labeled wi h alues
o pleasan ness[-1, 1] and e en ulness[-1, 1], ob ained ollowing he
soundscape s udy me hodology sugges ed in ISO-12913 [9]–[11]. Fo
he es ima ion o sound sou ces, we used se e al publicly a ailable
da ase s. The U ban Sound Moni o ing (USM) da ase [12] was used
o bi ds,cons uc ion wo ks,dogs,human ac i i y,si ens, and music.
This da ase consis s o 5-second polyphonic s e eo soundscapes
composed o sounds om he FSD50k da ase [13]. Be o e aining,
we adap ed he da ase by emo ing i ele an sound sou ces, such
as gunsho , and mapping mo e speci ic classes o gene al ones— o
example, chee ing,sc eam, and speech we e all mapped o human.
Fo each sound sou ce, we buil an independen model using a one-
s-all classi ica ion app oach, allowing he senso o be cus omized
o di e en applica ions.
The app oach o build he ehicles sound sou ce model in ol ed he
ca e ul combina ion o wo da ase s. A se o audios om he IDMT
T a ic da ase [14] was selec ed, each consis ing o 2-second long
s e eo audio eco dings o ehicle sounds (bus,ca ,mo o cycle, and
uck). The balanced selec ion o IDMT audios was combined as an
addi ional sound class wi hin he U banSound8k (US8k) da ase [15].
This da ase o iginally includes sound exce p s (
<= 4s
) o u ban
sounds om 10 classes, including noisy sounds like ai condi ione ,
d illing and engine idling, ha we e cu o 2 seconds long. Again,
he ehicles model is ained ollowing a one- s-all app oach.
Despi e he a ious da ase s used o each pa ame e , he emaining
aining p ocess was consis en ac oss all. The algo i hms ake as inpu
sound ep esen a ions gene a ed using Laion-AI’s CLAP (Con as i e
Fig. 1: When SENS boo s up, all machine lea ning models (CLAP and
indi idual, bo h pe cep ual and sou ce de ec ion, models) load be o e
s a ing no mal ope a ion. The g aph shows he de ice’s memo y usage
o di e en con igu a ions wi h o iginal CLAP o wi h simpli ied
CLAP model (which only includes he audio encode ), and he use,
o no , o PCA o educing embeddings’ dimensionali y.
Language-Audio P e aining) 630k- usion-bes model [16]. P e ious
esea ch has demons a ed ha his ep esen a ion pe o ms well in
simila classi ica ion asks [7], [17]. The CLAP model p oduces a
512-dimensional embedding ec o , deno ed as:
E={E0, E1,...,E511}, E ∈R512 (1)
Due o he SENS de ice’s limi ed compu a ional capabili y, using
he o iginal CLAP model along wi h i s aw embeddings caused he
memo y load o each i s limi , equen ly esul ing in sys em eezes.
CLAP models unc ion by lea ning a join embedding space o bo h
audio and ex ual desc ip ions. The o iginal LAION-AI’s CLAP model
consis s o wo b anches: one o con e ing ex o an embedding
space and ano he o con e ing audio in o he same embedding
space. To op imize ou use case, since bi-di ec ional ma ching was no
equi ed, we emo ed he ex encode . By doing so, we signi ican ly
educed he model’s memo y consump ion, making i mo e easible
o eal- ime p ocessing on he Raspbe y Pi wi hou comp omising
accu acy.
To u he op imize memo y usage, we used P incipal Componen
Analysis (PCA) o educe he dimensionali y o he embedding space.
A PCA ans o ma ion was de i ed om an analysis o o e 25,000
audio samples om he ARAUS da ase . The esul s indica ed ha 50
p incipal componen s we e su icien o explain 95.46% o he da a
a iabili y, wi h a negligible e ec on he p edic ion accu acy. The
educed embedding ec o is gi en by:
E′={E′
0, E′
1,...,E′
49}, E′∈R50 (2)
Figu e 1 illus a es he senso ’s boo -up unde ou di e en
con igu a ions, e lec ing he impac ha cleaning he CLAP model
and applying PCA ha e on he boo -up ime and memo y load.
Using ou educed embedding space, a a ie y o simple classi ie s
we e e alua ed o he di e en p ope ies ha SENS es ima es,
wi h inal selec ions including Random Fo es Reg esso , Linea
Suppo Vec o Classi ica ion, and Logis ic Reg ession wi h LBFGS
op imiza ion. Pleasan ness and e en ulness we e ea ed as eg ession
p oblems, wi h ou pu s anging [-1,1] whe e
−1
co esponds o
unpleasan and une en ul, and
1
ep esen s pleasan and e en ul,
espec i ely. On he o he hand, he sound sou ces app oach ollowed a
one- s-all classi ica ion. Thei ou pu anges wi hin [0,1] ep esen ing
he model’s es ima ed likelihood ha he inpu belongs o he posi i e
class. The esul ing models achie e Mean Absolu e E o s (MAE) o
76
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 1: Summa y o da ase s and eg esso s/classi ie s used o
aining models o p edic a ious pa ame e s, wi h co esponding
alida ion me ics. Algo i hm ac onyms: Random Fo es Reg esso
(RFR), Suppo Vec o Classi ica ion (SVC), Logis ic Reg ession (LR).
Pa ame e Da ase Algo i hm Me ics ( al)
Pleasan ness ARAUS RFR 0.22 MAE
E en ulness 0.20 MAE
Bi ds
USM Linea SVC
97% p ecision
Cons uc ion 81% p ecision
Dogs 92% p ecision
Human 83% p ecision
Si ens 88% p ecision
Music LR - LBFGS 80% p ecision
Vehicles IDMT-T a ic, US8k Linea SVC
100% p ecision
0.22 and 0.20 on he alida ion se s o pleasan ness and e en ulness,
espec i ely, and he sound sou ces classi ica ion models achie e
p ecision sco es (i.e., he p opo ion o co ec ly classi ied posi i e
samples) ha exceed 80% ac oss all ca ego ies on he alida ion se
(Table 1). The code used o aining he models is a ailable in he
p ojec ’s Gi hub eposi o y.
4.2. Model deploymen
The implemen a ion o he ained AI models is ela i ely s aigh o -
wa d. A dedica ed h ead con inuously moni o s he olde whe e
audio iles a e sa ed by he cap u ing module, wai ing o new
incoming da a. As soon as a new ile appea s, he audio da a is
ead o gene a e a educed embedding ec o
E′
. This is passed o
he se o AI models, each o which ou pu s a p edic ion alue. Due
o he ac ha pleasan ness and e en ulness cons i u e in eg a ed
pe cep ions o he soundscape a he han ins an aneous measu emen s,
he p ocessing module also agg ega es he 10 mos ecen audio ames
(co esponding o 30 seconds o audio da a in he SENS use case).
Thus, ano he educed embedding ec o is gene a ed and passed
o he pleasan ness and e en ulness models, ob aining p edic ions
o hese pa ame e s o e a longe pe iod. Finally, in eg a ing he
p e iously sa ed sound le els, L
eq
and L
Aeq
, he module gene a es an
ou pu dic iona y wi h all he compiled esul s. This is s o ed in a
JSON ile wi hin a designa ed olde .
5. DATA TRANSMISSION
Fo da a ansmission, each SENS de ice in he use case ne wo k
ope a es emo ely and sends he analysis esul s o a emo e se e
o s o age and display on a web pla o m. This equi es app op ia e
ha dwa e and poses challenges, pa icula ly in e ms o in e ne da a
consump ion. While HTTPS is widely used because i ensu es secu e
da a ans e , i can add signi ican o e head due o la ge heade s
(5–10KB). Fo example, each JSON ile gene a ed by he p ocessing
module o a 3-second audio chunk is abou 650 by es. Sending
each esul indi idually a a a e o 20 messages pe minu e would
esul in o e 8GB o mon hly da a usage pe senso . To mi iga e his,
mul iple JSON iles a e ba ched oge he in o a single HTTPS eques ,
signi ican ly educing da a consump ion o 2-3GB pe mon h.
The da a ansmission module in he p oposed solu ion consis s o
a sc ip which con inuously moni o s he olde whe e analysis esul s
a e s o ed and sends hem o he emo e se e once a numbe o
iles a e accumula ed (10 in ou implemen a ion). Thei con en s a e
combined in o a single payload and ansmi ed o a emo e se e ia
he mobile ne wo k. I he ne wo k connec ion is una ailable, audio
acquisi ion and p ocessing con inue locally, and ansmission esumes
au oma ically once he connec ion is es o ed. Upon success ul eceip ,
he ansmi ed JSON iles a e dele ed locally. The cus om se e s o es
all incoming da a in a da abase and o e s a isualiza ion ool ha
enables use s o moni o ac i e senso s, check eal- ime s a us me ics
like memo y and CPU usage, and isualize p ocessed da a wi h
in e ac i e g aphs. A de ailed desc ip ion o his se e amewo k is
beyond he scope o his pape .
6. REAL-WORLD DEPLOYMENT
A ne wo k o i e SENS de ices has been deployed in he eal-wo ld
as pa o Sma I u
˜
na Lab [18], a sma ci ies p og am ca ied ou in
he ci y o Pamplona/I u
˜
na. Wi h o e h ee mon hs o deploymen ,
SENS echnology has p o en o be a eliable ool o noise moni o ing.
Fig. 2: SENS de ice deployed in he ci y o Pamplona/I u˜
na as pa
o he Sma I u˜
na Lab p og am.
In o de o ex ac meaning ul in o ma ion om he aw moni o ed
da a, i is necessa y o make i in e p e able and p ac ically use ul by
applying me hodological choices (such as se ing h esholds based on
empi ical es ing o de ine when a sound sou ce is conside ed ac i e)
and o agg ega e da a by s a is ical mean, he pe cen age o ime abo e
a h eshold o he numbe o de ec ed e en s. Fo example, Figu e 3
p esen s an example o SENS moni o ing esul s: he agg ega ed da a
by hou o he week o May 12-18 o one o he moni o ed si es
in Pamplona/I u
˜
na. In hese plo s, he 360 deg ees ep esen he 24
hou s o he day, while each concen ic ci cle co esponds o a day o
he week — wi h Monday a he cen e and Sunday on he ou e mos
ing. The selec ed loca ion is a esiden ial neighbo hood e y close
o he ci y cen e , known o i s ib an s ee li e and equen isi s
om young people due o nea by nigh clubs. I is gene ally pe cei ed
as qui e noisy because o cons an a ic h oughou he week.
Subplo
(a)
shows he LAeq. This con i ms he high noise le els
in he a ea, wi h L
den
alues anging om 68 o 70 dB e e y day.
To be e unde s and he soundscape, i is use ul o examine he
pe cep ual a ibu es alongside he de ec ed sound sou ces. Howe e ,
i s , i should be no ed ha due o he c i e ia o he ci y council,
any ac i i y a nigh ( om 23:00 o 07:00) is in e sely in e p e ed
o pleasan ness: he alues o pleasan ness du ing hese hou s a e
adjus ed o e lec g ea e unpleasan ness when ac i i y is high, using
he in e se o he e en ulness alues. Looking a g aph
(b)
, we see
ha he day and e ening pe iods end o be nei he dis inc ly pleasan
no unpleasan , likely because o he cons an p esence o a ic, as
illus a ed in g aph
(d)
. Nigh s in he second hal o he week appea
mo e unpleasan , which co esponds wi h highe ac i i y le els (g aph
(c)
) du ing he ea ly mo ning hou s om Thu sday o Sunday. This
pa e n is explained by g aph
(e)
, which shows an inc ease in human
p esence a he same pe iods, indica ing nigh li e ac i i y associa ed
wi h he su ounding clubs and ba s. Ano he no able aspec is he
weekend ac i i y peaks a ound 12:00–15:00 and 17:00–20:00. These
high le els a e also linked o human p esence, sugges ing ha he
a ea is a popula ga he ing spo du ing hese imes. The low de ec ion
77
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
(a) LAeq - A-weigh ed equi alen noise
p essu e le el in decibels.
(b) In eg a ed Pleasan ness - Mean
alues adjus ed o [0,1].
(c) In eg a ed E en ulness - Mean al-
ues adjus ed o [0,1].
(d) Bi ds p esence - Time(%) o e h esh-
old, ( h = 0.15 wi hin he ange [0,1]).
(e) Human p esence - Time(%) o e
h eshold, (
h = 0.2
wi hin he ange
[0,1]).
( ) Vehicles p esence - Time(%) o e
h eshold, (
h = 0.1
wi hin he ange
[0,1]).
Fig. 3: Ci cula g aphs o hou ly da a om May 12–18 o one si e moni o ed in Pamplona/I u˜
na.
a es o bi ds (pe cei ed 20–30% o he ime) indica e ha he a ea
is highly u banised, o e ing limi ed e uge o wildli e.
Al oge he , his da a illus a es how SENS enables us o unde s and
he acous ic en i onmen wi hou needing o be physically p esen . By
combining con inuous measu emen and in elligen da a agg ega ion,
we gain aluable insigh s in o daily and weekly sound pa e ns. I
noise issues a ise, hese insigh s p o ide obus e idence o iden i y
he mos e ec i e measu es o imp o e he u ban soundscape.
7. CONCLUSION AND FUTURE WORK
The eme gence o AI opens new oppo uni ies o eal- ime sound
analysis. The gene ally high cos o noise moni o ing de ices on- he-
ma ke aises he need o de elop solu ions based on small low-cos
de ices. Ne e heless, deploying AI-based sound analysis algo i hms
on eal- ime acous ic senso s p esen s nume ous challenges, including
he compu a ional cons ain s linked wi h he need o p o ec da a
p i acy. This pape has in oduced SENS as a p ac ical use case ha
add esses hese di icul ies h ough a low-cos , lexible, and p i acy-
p ese ing solu ion. By pe o ming all audio p ocessing locally on he
de ice, SENS educes he isk o comp omising pe sonal p i acy. To
achie e his, signi ican e o was made o educe he compu a ional
load o he ained models h ough ca e ul p uning and he use
o P incipal Componen Analysis (PCA) o inpu dimensionali y
educ ion. Fu he , he gene al so wa e a chi ec u e is modula ,
wi h sepa a e componen s o audio cap u e, signal p ocessing, and
ansmission o analysis esul s o a emo e se e when equi ed.
Mo eo e , he me hodology p oposes he de elopmen o independen
models o each acous ic pa ame e o in e es , enhancing he sys em’s
adap abili y and scalabili y. In e ne da a consump ion is u he
op imized h ough ba ching and packaging mul iple analysis esul s
be o e ansmission. To demons a e i s eal-wo ld iabili y, a ne wo k
o SENS de ices has been deployed in an u ban en i onmen . This
long- e m deploymen has shown ha he sys em is obus , eliable,
and capable o gene a ing meaning ul insigh s in o he acous ic
en i onmen .
Fu u e wo k will ocus on u he op imizing and expanding he
me hodology by explo ing he minimum iable sampling equency
needed o accu a e p edic ions o help educe p ocessing load e en
u he o using mo e ligh weigh p o ocols like MQTT o da a
ansmission. Addi ionally, he modula design o SENS makes i
well-sui ed o o he applica ions beyond u ban noise moni o ing.
Fu u e applica ions could include a ic moni o ing (e.g., coun ing
ehicles o dis inguishing be ween ligh and hea y a ic), public
sa e y (e.g., de ec ing dis ess calls on he s ee s), o any o he scena io
whe e sound can se e as a aluable eal- ime inpu o in o ma ion.
8. ACKNOWLEDGMENT
This wo k was suppo ed by IA y M
´
usica: C
´
a ed a en In eligencia
A i icial y M
´
usica (TSI-100929-2023-1) by Sec e a
´
ıa de Es ado
de Digi alizaci
´
on e In eligencia A i icial and he Eu opean Union-
Nex Gene a ion unde C
´
a ed as ENIA 2022, and he IMPA p ojec
(PID2023-152250OB-I00) by he Minis y o Science, Inno a ion
and Uni e si ies o he Spanish Go e nmen , he Agencia Es a al de
In es igaci
´
on (AEI) and co- inanced by he EU. We would also like
o acknowledge Keacous ic o he ha dwa e suppo and Sma I u
˜
na
Lab o he oppo uni y o deploy he sys em.
78
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
REFERENCES
[1] AudioMo h, h ps://www.openacous icde ices.in o/audiomo h.
[2]
D. Bone -Sol
`
a, E. Vida
˜
na-Vila, and R. M. Alsina-Pag
`
es, “Acous ic com o
p edic ion: In eg a ing sound e en de ec ion and noise le els om a
wi eless acous ic senso ne wo k,” Senso s, ol. 24, no. 13, p. 4400, Jul.
2024.
[3] CENSE p ojec , h ps://cense.i s a . /en/.
[4]
J. A douin, J.-C. Ba on, L. Cha pen ie , D. Eco i
`
e e, N. Fo in, F. Gon ie ,
G. Guillaume, M. Lag ange, G. Libouban, J. Picau , and C. Ribei o, “A
high densi y ne wo k o low cos acous ic senso s based on wi ed and
ai bo ne ansmission o spec al da a,” Eu onoise, 2021.
[5]
M. Tailleu , V. Los anlen, J.-P. Ri i
`
e e, and P. Aumond, “Machine
lis ening in a neona al in ensi e ca e uni ,” DCASE, 2024.
[6]
P. Ngam hipwa hana, M. G
¨
o ze, A. K
´
a ai, and J. Abeße , “Towa ds mea-
su ing and o ecas ing noise exposu e a he el ins-a ena in gelsenki chen,
ge many,” DCASE, 2024.
[7]
A. Sagas i, M. Rocamo a, and F. Fon , “P edic ion o pleasan ness and
e en ulness pe cep ual sound quali ies in u ban soundscapes,” DCASE
Wo kshop, 2024.
[8]
ARAUS da ase , h ps:// esea chda a.n u.edu.sg/da ase .xh ml?
pe sis en Id=doi:10.21979/N9/9OTEVX.
[9]
ISO 12913-1. Acous ics-Soundscape-Pa 1: De ini ion and concep ual
amewo k, www.iso.o g.
[10]
ISO 12913-2. Acous ics-Soundscape-Pa 2: Da a collec ion and epo ing
equi emen s, www.iso.o g.
[11]
ISO 12913-3. Acous ics-Soundscape-Pa 3: Da a analysis, www.iso.o g.
[12]
U ban Sound Moni o ing (USM) da ase , h ps://gi hub.com/jakobabesse /
USM.
[13] FSD50K Da ase , h ps://zenodo.o g/ eco ds/4060432.
[14]
IDMT-TRAFFIC Da ase , h ps://www.idm . aunho e .de/en/publica ions/
da ase s/ a ic.h ml.
[15]
U ban Sound 8k da ase , h ps://u bansoundda ase .weebly.com/
u bansound8k.h ml.
[16]
LAION-AI/CLAP Gi hub Reposi o y, h ps://gi hub.com/LAION-AI/
CLAP.
[17]
R. O. A az, D. Bogdano , P. Alonso-Jim
´
enez, and F. Fon , “E alua ion o
deep audio ep esen a ions o seman ic sound simila i y,” In e na ional
Con e ence on Con en -based Mul imedia Indexing (CBMI), 2024.
[18] Sma I u˜
na Lab, h ps://www.pamplona.es/sma -i una-lab.
79
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Bioacous ics on Tiny Ha dwa e a he BioDCASE 2025 Challenge
Gio anni Ca man ini,1aYasmine Benhamadi,2bMa hieu Ca eau,2cMinkyung Kwak,3bIla ia Mo andi,4b
F ied ich F¨
o s ne ,1,5aPie e-Emmanuel Hladik,2cMa hieu Lag ange,2bc Pa el Linha ,6bTe eza Pe usko ´
a,3b
Vincen Los anlen,2bc S e an Kahl5,7
1 old ecosys emics, 2 Bis Rou e d’Hadigny, 88330, Badm´
enil-aux-Bois, F ance
2Nan es Uni e si ´
e, ´
Ecole Cen ale Nan es, CNRS, LS2N, UMR 6004, F-44000 Nan es, F ance
3Depa men o Ecology, Facul y o Science, Cha les Uni e si y, Viniˇ
cn´
a 7, P ague 2 12844, Czechia
4Depa men o Zoology, Facul y o Science, Uni e si y o Sou h Bohemia,
B aniˇ
so sk´
a 1716/31c, ˇ
Cesk´
e Budˇ
ejo ice 37005, Czechia
5Chemni z Uni e si y o Technology, S aße de Na ionen 62, 09111 Chemni z, Ge many
6Cen e o Sus ainable Landscapes unde Global Change (Sus ainScapes), Depa men o Biology,
Aa hus Uni e si y, Ny Munkegade 114-116, DK-8000 Aa hus C, Denma k
7K. Lisa Yang Cen e o Conse a ion Bioacous ics, Co nell Lab o O ni hology, Co nell Uni e si y,
159 Sapsucke Woods Road, I haca, NY 14850, USA
Abs ac —The BioDCASE ini ia i e aims o encou age he in en ion
o me hods o de ec ion and classi ica ion o acous ic scenes and e en s
(DCASE) wi hin he domain o bioacous ics. We ha e con ibu ed o
he i s edi ion o he BioDCASE challenge by means o a ask named
“bioacous ics on iny ha dwa e”. The mo i a ion o his ask esides in
he g owing need o ope a ing bioacous ic e en de ec ion algo i hms on
low-powe au onomous eco ding uni s (ARU’s). Pa icipan s we e asked
wi h de eloping a de ec o o bi d ocaliza ions om he yellowhamme
(Embe iza ci inella), gi en wo hou s o audio as a aining se . The
de ec o had o un wi hin he esou ce cons ain s o an ESP32-S3
mic ocon olle uni . By e alua ing he submi ed models on a wi hheld
da ase , we conduc ed an independen benchma k ha assessed bo h
classi ica ion pe o mance and esou ce e iciency h ough mul iple me ics:
a e age p ecision, in e ence ime, and memo y usage. Ou epo ed
esul s con i m ha ecen ad ances in “ iny machine lea ning” (TinyML)
ha e ans o ma i e po en ial o compu a ional bioacous ics. Fo mo e
in o ma ion, please isi : h ps://biodcase.gi hub.io/challenge2025/ ask3
Index Te ms—Acous ic e en de ec ion, au onomous eco ding uni s,
bioacous ics, edge compu ing, passi e acous ic moni o ing.
1. INTRODUCTION
Bioacous ics, unde s ood as he science o sonic in e ac ion in and
be ween animals, equi es senso s o da a collec ion [1]. Fo animal
beha io esea ch [2] o biodi e si y su eys [3], hese senso s a e
ypically deployed on o emo e loca ions, o he elec ical g id. As
o oday, mos comme cially a ailable senso s o bi ds and land
mammals a e ba e y-powe ed, wi h a ba e y li e o 200–500 hou s and
a cos o $100–$700 [4]. They eco d digi al audio, ei he con inuously
o acco ding o an in e mi en schedule, and s o e i on o an SD ca d.
Al hough hey a e o en e e ed o as “au onomous eco ding uni s”
(ARU), hey a e no ully au onomous: indeed, equen ound ip
om he lab o he deploymen si e a e necessa y, in o de o eplace
ba e ies, ans e he eco ded da a, and ese he SD ca d.
P e ious publica ions ha e ale ed on he lack o au onomy o
cu en -gene a ion ARU’s and i s nega i e consequences [5]–[7]:
a
The de elopmen o he e alua ion sys em was unded by he Ge man
Fede al Minis y o Educa ion and Resea ch as pa o he “Bi dNET+” p ojec
(FKZ 01-S22072). The compe i ion ecei ed addi ional suppo om he
Ge man Fede al Minis y o he En i onmen , Na u e Conse a ion, Nuclea
Sa e y and Consume P o ec ion h ough he “DeepBi dDe ec ” p ojec (FKZ
67KI31040E).
b
Suppo ed by EU MSCA Doc o al Ne wo k Bioacous ic AI (101071532).
c
This esea ch was unded, in whole o in pa , by he F ench Na ional
Resea ch Agency (ANR) unde he p ojec OWL “ANR-23-IAS3-0003-01”.
1)
An insu icien up ime o ill-sui ed schedule may cause he
ARU o mispo ay hese dynamics o ocal ac i i y, e en ually
educing i s use ulness o s a is ical hypo hesis es ing [8].
2)
The cos o ba e y eplacemen and he weigh o ha dwa e is
a se ious challenge o p ac i ione s [9], [10].
3)
Al hough hey do no equi e di ec con ac o manipula ion
o he animals, ARU’s may s ill be conside ed in asi e i i s
main enance aises he le el o an h opogenic p essu e; ha is,
he s ess induced by he p esence o humans [11].
4)
The same da a which a e collec ed o esea ch may be used o
su eillance; an e ec known as su eillance c eep [12]. Some
e es ial bioacous ic da ase s con ain oices om people who
a e unawa e o being eco ded [13]. In he age o au oma ed
audio con en analysis, he su eillance c eep o acous ic senso s
is no only a isk bu a documen ed eali y [14].
5)
As echnological ad ancemen s accele a e and cos s dec ease,
he ecological impac o p oduc ion and he g owing issue o
elec onic was e (e-was e) ha e eme ged as c i ical sus ainabili y
challenges [15]. Fo lack o a be e end-o -li e managemen
[16], i would be p uden o s i e o a lowe dependency on
ba e ies, o e en swi ch o ba e yless ha dwa e [17].
Facing he d awbacks o cu en -gene a ion ARU’s, we p opose o
explo e an al e na i e design: on-de ice sound e en de ec ion (SED).
The key idea is o de elop an algo i hm which is able o analyze he
audio s eam in eal- ime and eco d sound e en selec i ely. In he
con ex o wi eless acous ic senso ne wo ks, his is known as edge
compu ing as opposed o cloud compu ing. The main a gumen in a o
o edge compu ing on ARU’s is ha he bioacous ic e en s o in e es
a e ypically ew, while s o ing audio pe manen ly is cos ly in e ms
o ene gy. I SED can be made e icien enough, he ene gy sa ings
o selec i e s o age can co e and ou weigh he ene gy expendi u e o
edge compu ing [18]. Beyond he ne gain in up ime, his simple idea
has he po en ial o make he nex gene a ion o ARU’s cheape o
main ain, less in asi e, mo e p i acy-p ese ing, and mo e du able.
Machine lea ning algo i hms allow o au oma e a ious bioacous ics
asks, including call de ec ion o species classi ica ion. P e ious
wo ks demons a e he easibili y o unning hese algo i hms on
“ iny ha dwa e” such as mic ocon olle uni s (MCU’s). Fo example,
[19] ha e ained a MFCC-based classi ie o species-speci ic call
de ec ion on a low-powe MCU, wi h a memo y usage o he o de o
80
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Fig. 1: Spec og am ep esen a ions o yellowhamme (Embe iza ci inella) song. B igh e colo s deno e ela i ely la ge ime– equency magni udes a e
no maliza ion o isual con as . Rows illus a e in e -indi idual a ia ion while columns illus a e he e ec o dis ance. See Sec ion 2 o de ails.
one kiloby e. Ano he example is [20], who ha e shown he easibili y
o SED on he AudioMo h, a widesp ead and open-sou ce ARU [21]:
he p oposed algo i hms ely on MFCC and Goe zel il e ing and a e
applied o cicadas and gunsho s.
Recen wo k a he in e sec ion o “ iny machine lea ning” (TinyML)
and bioacous ics has p oposed o embed deep neu al ne wo ks,
pa icula ly con olu ional ne wo ks (con ne s), on o low-cos ha dwa e.
Fo example, [22] ha e ained a con ne o ecognize 50 classes o
en i onmen al sounds on a Sony Sp esense MCU.
2. DATASET
The de elopmen se o ask 3 o BioDCASE 2025 ”Bioacous ics
o Tiny Ha dwa e” consis s o 2 hou s and 37 mins wo h o audio
eco dings. See Table 1 o a b eakdown.
2.1. Da a collec ion
Yellowhamme s a e widesp ead ac oss Eu asia and hei songs ha e
a ac ed he a en ion o na u alis s and scien is s o o e one and
a hal cen u ies (e.g. [23]). They ha e long been a popula model
species, especially o s udying dialec s in bi dsong [24]. They a e also
an indica o o heal hy a mland showing apid popula ion declines
a ound Eu ope (e.g., [25]), and hei indi idually speci ic songs migh
o e de ail nonin asi e insigh s in o as popula ion changes. The
yellowhamme songs used o BioDCASE 2025 we e ini ially eco ded
as a pa o sound ansmission expe imen s ha in es iga ed how
a i is s ill possible o iden i y Yellowhamme indi iduals by hei
songs. To collec he songs, AudioMo h eco de s we e placed nea
he bi d’s a o i e singing pos s less han 5 m om he singing bi ds.
The bes quali y songs we e selec ed and pu in o one ack. The
ack con ained in o al 209 songs om 10 indi iduals, including in
o al 21 di e en song ypes (2–3 song ypes pe male), wi h c.a. 10
epe i ions o each song ype.
The ack was hen played back and e- eco ded a 7 di e en
dis ances anging om 6.5 o 200 m (see Figu e 1 o ep esen a i e
examples o songs a di e en dis ances). Taking in o accoun he
o iginal eco ding dis ance (
<
5m), he songs in he closes dis ance
ca ego y simula e bi ds being any hing be ween 6.5m o 11m a away.
No e ha he signal- o-noise a io (SNR) is qui e low in he las wo
dis ance ca ego ies, and songs ha we e no isible on he spec og am
we e no included in he da ase , which means ha hese dis ance
ca ego ies may ha e ewe han 209 samples. The playback was ca ied
ou in wo di e en en i onmen s ( o es and g assland). Al oge he ,
each song is epea ed up o 15 imes in he da ase (o iginal dis ance +
7x o es + 7x g assland). These e- eco ded songs we e anno a ed and
spli in o sepa a e iles and used in he challenge as posi i e iles. In
addi ion, he o iginal eco dings o li e Yellowhamme s we e sc eened,
anno a ed and spli o ob ain nega i e iles o simila du a ion. These
included songs om o he known species o backg ound noise only.
2.2. Da a cu a ion
All audio eco dings we e clipped o 2 seconds and esampled a
16kHz. The da ase was di ided in o aining and alida ion se s.
Yellowhamme eco dings we e spli by indi idual: 6 o aining, 2
o alida ion, and 2 held ou o e alua ion. Nega i e samples we e
andomly spli .
3. CHALLENGE
3.1. Rules
The challenge asked pa icipan s wi h de eloping a Yellowhamme
bi d ocaliza ion de ec ion sys em o he ESP32-S3-Ko o-2 mic o-
con olle , using he aining and alida ion se s om he Yellowham-
me da ase (Sec ion 2), and ou baseline amewo k (Sec ion 3.3).
Submissions had o be deployable on he a ge ha dwa e and capable
o eal- ime audio p ocessing. The submi ed models would hen
be e alua ed by o ganize s on he (hidden) e alua ion se using
he benchma king acili ies o he baseline amewo k, on a ange
o me ics o do wi h model classi ica ion pe o mance as well
as i s esou ce e iciency, desc ibed in he nex sec ion. Rankings
81
De ec ion and Classi ica ion o Acous ic Scenes and E en s 2025 30–31 Oc obe 2025, Ba celona, Spain
Table 5: Neu al Ne wo k-based cap ion classi ica ion in o audio/image by
modali y.
Cap ion ype Audio (%) Image (%)
audio cap ions 4386 (71.78%) 1724 (28.22%)
isual cap ions 787 (10.36%) 6812 (89.64%)
AV cap ions 1001 (12.92%) 6745 (87.08%)
GPT AV cap ions 1895 (30.65%) 4288 (69.35%)
Table 5 shows he a e age classi ica ion esul s by cap ion ype.
Ou classi ica ion model achie ed high accu acy, co ec ly iden i ying
71.78% o audio cap ions and 89.64% o isual cap ions.
When compa ing he esul s wi h he s uc u e-based classi ica ion,
we obse e ha i assigns sligh ly mo e cap ions o he audio ca ego y
han he neu al ne wo k-based me hod. In con as , he neu al-ne wo k
based classi ie consis en ly labels a highe pe cen age o cap ions as
isual ac oss all cap ion ypes. This sugges s ha he neu al-ne wo k
based model is mo e con iden , o biased, owa ds he isual modali y.
One possible explana ion o his beha io is o e i ing in he CNN
model. Al hough we applied a esampling s a egy o balance he
aining da a, he o e all da ase s ill con ains signi ican ly mo e
isual cap ions han audio ones. This imbalance may ha e in luenced
he model o a o he isual class du ing p edic ion, especially in
ambiguous cases.
6. MODEL INTERPRETABILITY
In he s uc u e-based me hod, is possible o in e p e he ou pu
because classi ica ion is based on SVO iple s. In con as , he neu al
ne wo k-based classi ica ion beha es mo e like a black box, making
i ha de o unde s and why a sen ence is classi ied as an audio o
image cap ion. To add ess his, we use in eg a ed g adien s, in oduced
in [28], a me hod o unde s and which inpu ea u es con ibu e mos
o he model’s p edic ion. I wo ks by compa ing he model’s ou pu
on he ac ual inpu o a baseline inpu (e.g. all ze os), which ep esen s
he absence o ea u es. The me hod compu es g adien s o he model’s
ou pu wi h espec o he inpu . In eg a ed g adien s sa is y wo key
axioms: Sensi i i y, which ensu es ha ea u es a ec ing he ou pu
ecei e non-ze o a ibu ion, and implemen a ion in a iance, which
gua an ees consis en a ibu ions o unc ionally equi alen models.
This makes i a heo e ically sound and in e p e able me hod o
explaining deep lea ning p edic ions.
Fo ou expe imen s we use Cap um
2
, an open sou ce lib a y
o model in e p e abili y buil on PyTo ch. To ob ain wo d-le el
a ibu ion sco es, we apply he In eg a ed G adien s me hod and sum
he a ibu ion sco es ac oss all embedding dimensions o each wo d.
This gi es us a single a ibu ion sco e pe wo d, indica ing how
much ha wo d con ibu ed o he model’s inal decision. The o e all
a ibu ion sco e o a sen ence is hen he sum o i s wo d-le el
a ibu ions. In ou case, illus a ed in Figu e 2, he model is ained o
classi y sen ences in o wo modali ies, class 0 co esponds o image
cap ions; class 1 co esponds o audio cap ions.
The a ibu ion sco es help us unde s and how much each wo d
pushed he model owa d o away om p edic ing a speci ic class. A
posi i e a ibu ion sco e means he wo d con ibu ed owa d p edic ing
class 1 (audio), while a nega i e sco e indica es a push owa d
class 0 (image). Table 6 shows he op en wo ds wi h highes and
lowes a ibu ion sco es, wi h wo ds like g owling, sounded, howling,
chi p, and chi ping as s ongly associa ed wi h he audio modali y. In
con as , wo ds wi h s ong nega i e a ibu ion sco es include g ouped,
cupcakes, adminis e s, powe ul, and pa icipa es a e mo e aligned
wi h he image modali y. I we look a he op 10 mos equen
2h ps://cap um.ai/docs/ex ension/in eg a ed g adien s
Fig. 2: Visual ep esen a ion o he wo d-le el a ibu ion in he model ou pu
p edic ion. The g een and ed colo ep esen s audio (class 1) and image
(class 0) espec i ely, while he in ensi y indica es a ibu ion s eng h. The
i s column is he ue label, he second column is he p edic ed label and he
hi d column is he a ibu ion sco e.
Table 6: Top 10 wo ds wi h he highes and lowes a e age a ibu ion sco es
in he AVCaps da ase . Posi i e sco es indica e s onge associa ion wi h
audio cap ions, while nega i e sco es sugges s onge associa ion wi h image
cap ions.
Wo d A e age
a ibu ion Wo d A e age
a ibu ion
g owling 0.88 badge -0.76
sounded 0.83 a ac s -0.76
howling 0.82 yogu -0.76
chi p 0.81 celeb i y -0.76
mumbling 0.77 pu chased -0.79
eques s 0.75 pa icipa es -0.80
mu mu s 0.72 powe ul -0.81
con e sed 0.71 adminis e s -0.83
meows 0.69 cupcakes -0.85
hums 0.69 g ouped -0.89
wo ds in he AVCaps da ase , we ha e wo ds like “speaking” (0.64),
“ alking” (0.58), and “singing” (0.42) ha e high posi i e a ibu ion
sco es, indica ing ha hey s ongly suppo he model’s p edic ion o
he audio class. These wo ds a e seman ically ela ed o sound ac ions.
In con as , wo ds such as “si ing” (-0.54), “man” (-0.24), and “child”
(-0.19) ha e nega i e a ibu ion sco es, sugges ing hey a e mo e
indica i e o he image class. These e ms ypically desc ibe isual
scenes o en i ies, which a e mo e likely o appea in image cap ions.
In e es ingly, some high- equency wo ds like “playing” (0.09) and
“baby” (-0.11) ha e a ibu ion sco es close o ze o, indica ing a mo e
neu al o ambiguous ole in he classi ica ion ask. This could be due
o hei p esence in bo h audio and image con ex s, making hem less
disc imina i e.
7. CONCLUSION
In his wo k we ha e s udied he linguis ic and s uc u al di e ences
be ween audio and image cap ions, highligh ing how hese di e ences
and simila i ies a ec s he model ou pu s. The linguis ic analysis
e ealed clea modali y-speci ic pa e ns, wi h audio cap ions a o ing
sound- ela ed ocabula y and image cap ions ocusing on isual
elemen s such as people and scenes. Th ough he Subjec -Ve b-Objec
s uc u es, we c ea ed a modali y-speci ic e e ences, which we e hen
compa ed o an CNN classi ie . While he CNN model pe o med
well, i showed a consis en bias owa d he isual modali y, likely
in luenced by da ase imbalance. To u he in e p e model beha io ,
we applied in eg a ed g adien s, which con i med ha wo ds s ongly
associa ed wi h sound (e.g., g owling, chi ping) posi i ely con ibu ed
o audio p edic ions, while isually desc ip i e e ms (e.g., g ouped,
cupcakes) suppo ed image p edic ions. These indings emphasize he
impo ance o unde s anding linguis ic pa e ns in mul imodal da ase s,
as hey di ec ly shape model ou pu s and can in o m he design o
mo e balanced and in e p e able cap ioning sys ems.
88
[Document text truncated for crawler view.]