Joint Object Detection and Sound Source Separation

Author: Sunyoo Kim; Yunjeong Choi; Doyeon Lee; Seoyoung Lee; Eunyi Lyou; Seungju Kim; Junhyug Noh; Joonseok Lee

Publisher: Zenodo

DOI: 10.5281/zenodo.17706601

Source: https://zenodo.org/records/17706601/files/000095.pdf

JOINT OBJECT DETECTION AND SOUND SOURCE SEPARATION
Sunyoo Kim1Yunjeong Choi1Doyeon Lee1Seoyoung Lee2
Eunyi Lyou1Seungju Kim3Junhyug Noh4∗Joonseok Lee1∗
1Seoul Na ional Uni e si y, Seoul, Ko ea 2Uni e si y o Texas a Aus in, Texas, USA
3Sookmyung Women’s Uni e si y, Seoul, Ko ea 4Ewha Womans Uni e si y, Seoul, Ko ea
[email p o ec ed], [email p o ec ed], [email p o ec ed]
ABSTRACT
We p opose See2Hea (S2H), a amewo k ha join ly
lea ns audio- isual ep esen a ions o objec de ec ion and
sound sou ce sepa a ion om ideos. Exis ing me hods
do no ully exploi he syne gy be ween he de ec ion and
sepa a ion asks, o en elying on disjoin ly p e- ained i-
sual encode s. Ou S2H in eg a es bo h asks in an end-
o-end ainable uni ied s uc u e using ans o me -based
a chi ec u es. A nai e combina ion o hese app oaches,
howe e , esul s in subop imal pe o mance. We p opose
a dynamic il e ing mechanism ha selec s ele an objec
que ies om he objec de ec o o esol e his issue. We
conduc ex ensi e expe imen s o e i y ha ou app oach
achie es he s a e-o - he-a pe o mance in audio sou ce
sepa a ion on MUSIC and MUSIC-21, while main aining
compe i i e objec de ec ion pe o mance. Abla ion s ud-
ies con i m ha he join aining o de ec ion and sepa a-
ion is mu ually bene icial o bo h asks.
1. INTRODUCTION
Human pe cep ion is inhe en ly mul imodal, aking inpu
signals om i e senses and comp ehensi ely unde s and-
ing he gi en si ua ion om hei usion. [1–3] O en, in-
eg a ing mul iple cues helps us o cohe en ly unde s and
ou su oundings. Music is no an excep ion; o ins ance,
seeing and ecognizing a pa icula ins umen simul ane-
ously allows us o associa e i wi h he sound i p oduces.
Mo e b oadly, isual cues can be o en use ul o ecognize
co-occu ing sound, and a he same ime, audi o y signals
can also help o isually pe cei e an objec .
In li e a u e, esea che s ha e explo ed a wide ange
o audio- isual lea ning, including sel -supe ised c oss-
modal alignmen [4, 5], audio- isual ep esen a ion lea n-
ing [6, 7], and sound sou ce localiza ion [8, 9]. These e -
o s collec i ely demons a e ha isual and audi o y in-
∗Co esponding au ho s
© S. Kim, Y. Choi, D. Lee, S. Lee, E. Lyou, S. Kim, J.
Noh, and J. Lee. Licensed unde a C ea i e Commons A ibu ion 4.0
In e na ional License (CC BY 4.0). A ibu ion: S. Kim, Y. Choi, D.
Lee, S. Lee, E. Lyou, S. Kim, J. Noh, and J. Lee, “Join Objec De ec ion
and Sound Sou ce Sepa a ion”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
o ma ion, when p ocessed join ly, can p o ide mo e o-
bus pe cep ion han conside ing each modali y alone.
Among hese esea ch on audio- isual co espondence,
ou ocus is on audio- isual sound sou ce sepa a ion [10–
15], which aims o isola e indi idual sound sou ces om
a complex mix u e by exploi ing he isual signal as an
ancho . Fo ins ance, i mul iple ins umen s a e played
oge he , seeing a iolin o a umpe o en helps a model
disce n which equency belongs o each ins umen .
Despi e his clea connec ion be ween isual iden i i-
ca ion o a sou ce and i s audi o y p esence in he mix-
u e, mos p io app oaches ea ele an isual asks (e.g.,
objec de ec ion) and sound sepa a ion independen ly, o
sequen ially. Typically, one i s uses a p e- ained de-
ec o o localize ins umen s, hen eeds he bounding
boxes o egion ea u es in o a sepa a e ne wo k o sep-
a a ion [11,14,15].
Howe e , such wo-s ep o disjoin pipelines do no ake
ad an age o po en ially use ul cues om he o he modal-
i ies; e.g., he isual signals o sound sepa a ion and ice
e sa. Conside ing ha accu a e objec localiza ion would
p o ide guidance o be e sound isola ion and imp o ed
sound sepa a ion can also ein o ce be e isual ep esen-
a ions by ocusing on he mos ele an objec egions, dis-
join ly ackling hese wo p oblems would be subop imal.
In his pape , we p opose See2Hea (S2H), which
join ly lea ns o de ec objec s and sepa a e hei co e-
sponding audio signals, ained end- o-end. To achie e
his, we adop a T ans o me -based a chi ec u e, well-
sui ed o seamlessly handle mul imodal inpu s wi h min-
imal modali y-speci ic encoding o e head. Pa icula ly, a
mul imodal T ans o me enables di ec a en ion ac oss he
audio and isual okens, ep esen ing pa s o he spec o-
g am and he image, espec i ely. We ain bo h de ec ion
and sepa a ion in a single model, allowing g adien s om
bo h asks o upda e he sha ed ep esen a ion space.
Howe e , a nai e assemble o isual and audio T ans-
o me s wi h c oss-a en ion is no scalable. Mo e speci i-
cally, we disco e ha i is c ucial o con ol he numbe o
eo ganized objec s. Wi hou a p ope measu e, a mo e
bounding boxes a e de ec ed han ac ual du ing aining,
making he weigh upda es less accu a e and compu a ion-
ally in easible. To esol e his issue, we inco po a e a dy-
namic il e ing mechanism ha disca ds low-con idence o
o e lapping de ec ions, hus a oiding con usion om spu-
ious egions. As a esul , ou model e ec i ely “sees”
813
objec s and “hea s” hei co esponding sounds in a ully
in eg a ed manne .
Ou expe imen s e i y ha his uni ied a chi ec u e e -
ec i ely exploi s syne gy be ween he wo asks, su pass-
ing he pe o mance o me hods ha ely on ex e nal o
p e-ex ac ed de ec ions [11,14,15].
Ou main con ibu ions a e summa ized as ollows: 1
• We p opose a no el uni ied amewo k ha join ly
lea ns objec de ec ion and audio- isual sound sou ce
sepa a ion end- o-end, allowing c oss- ask syne gy.
• We in oduce a dynamic il e ing o objec que ies, en-
su ing ha only ele an objec s guide he sepa a ion.
• Ou me hod achie es he s a e-o - he-a sound sepa a-
ion pe o mance on MUSIC [10] and MUSIC-21 [16],
while main aining easonable de ec ion pe o mance.
Th ough comp ehensi e analysis, we highligh he ben-
e i o join ly aining he wo asks.
2. RELATED WORK
2.1 Audio Sou ce Sepa a ion
Audio sou ce sepa a ion aims o isola e dis inc sound
sou ces om mixed signals. Classical app oaches include
Independen Componen Analysis (ICA) [17–19]. ICA-
based me hods laid he ounda ion o blind sou ce sepa-
a ion (BSS) unde he assump ion o s a is ical indepen-
dence, while Non-nega i e Ma ix Fac o iza ion (NMF)
[20,21] in oduced pa s-based ep esen a ions pa icula ly
sui ed o music.
Deep lea ning e olu ionized he ield wi h app oaches
like Deep Clus e ing [22] and Deep A ac o Ne -
wo ks [23], which lea n disc imina i e embeddings o
sou ce sepa a ion. U-Ne a chi ec u es [24, 25] became
s anda d o music sepa a ion h ough skip connec ions
ha e ine signals in he ime- equency domain.
Recen ans o me -based me hods ha e pushed s a e-
o - he-a pe o mance, wi h he Audio Spec og am
T ans o me (AST) [26] demons a ing ha a en ion
mechanisms can e ec i ely model bo h sho and long-
ange dependencies. We adop AST as ou audio encode
backbone, ex ending i o audio- isual sepa a ion.
2.2 Audio-Visual Sound Sepa a ion
Audio- isual app oaches le e age isual in o ma ion o
guide sound sepa a ion, signi ican ly ou pe o ming audio-
only me hods. The mix-and-sepa a e pa adigm in oduced
by Sound-o -Pixels [10] c ea es syn he ic mix u es o sel -
supe ised lea ning. This app oach was adap ed by subse-
quen me hods [5,10–15,27], including Co-Sepa a ion [11]
ha disco e s audio- isual associa ions, and ecu si e sep-
a a ion me hods [12].
Recen ad ances include Cyclic co-lea ning (CCoL)
[13] ha i e a i ely e ines sepa a ion and localiza ion, and
AME [14] and T iBERT [28], which inco po a e addi-
ional cues like mo ion and human pose. iQue y [15] uses
1Code a ailable a h ps://gi hub.com/snu iplab/S2H.
isually-named audio que ies in a c oss-a en ion-based
ans o me o sepa a e sou ces. Rahman and Sigal [29]
p oposed a weakly-supe ised app oach ha lea ns audio-
isual co-segmen a ion om ideos labeled only wi h ob-
jec labels.
Howe e , exis ing me hods ely on p e-ex ac ed i-
sual ea u es o p e- ained objec de ec o s, c ea ing a
disconnec be ween isual analysis and audio sepa a ion.
This wo-s age app oach in oduces e o p opaga ion and
p e en s join op imiza ion o bo h asks. E en ecen
di usion-based me hods like DAVIS [30] ope a e on p e-
p ocessed isual inpu s a he han lea ning ep esen a ions
join ly wi h audio sepa a ion.
2.3 Objec De ec ion o Audio-Visual Tasks
Objec de ec ion has e ol ed om egion-based CNNs
like R-CNN [31] and Fas e R-CNN [32] o ans o me -
based app oaches. DETR [33] e olu ionized de ec ion
by ea ing i as a di ec se p edic ion p oblem, elimina -
ing hand-c a ed componen s like ancho gene a ion and
non-maximum supp ession. This end- o-end di e en ia-
bili y makes DETR pa icula ly sui able o in eg a ion
wi h o he modali ies. In audio- isual esea ch, objec de-
ec ion p ima ily se es as p ep ocessing. Me hods like
Co-Sepa a ion [11] and iQue y [15] use p e- ained de ec-
o s o iden i y isual egions be o e sepa a ion. While e-
cen wo k explo es igh e in eg a ion be ween de ec ion
and audio p ocessing in speci ic domains [34–36], mos
app oaches s ill ea de ec ion as a sepa a e module. The
po en ial o join ly aining de ec ion wi h audio- isual
asks emains la gely unexplo ed, pa icula ly o lea n-
ing c oss-modal associa ions di ec ly a he han elying
on p e- ained isual ea u es.
2.4 Join Lea ning in Audio-Visual Tasks
T adi ional sequen ial pipelines su e om e o p opa-
ga ion and miss c oss-modal in e ac ions ha could en-
hance bo h modali ies. Recen speech domain ad ances
demons a e clea bene i s: TDFNe [37] achie es 10%
imp o emen h ough join speake ea u e lea ning, while
IIANe [38] and DGFNe [39] show ha in eg a ing modal-
i ies h oughou ne wo ks subs an ially ou pe o ms la e
usion. Howe e , he music domain lags behind. While
speech sys ems emb ace join lea ning, music sou ce sepa-
a ion emains domina ed by wo-s age app oaches – Mu-
sic Ges u e [40] and ecen wo k [15] s ill use p e-ex ac ed
isual ea u es. This gap is signi ican gi en music’s isual
ichness: ins umen mo emen s and spa ial a angemen s
o e aluable cues be e exploi ed h ough join lea ning.
The absence o end- o-end join aining in music sepa a-
ion ep esen s a majo oppo uni y ha ou S2H ame-
wo k add esses h ough uni ied op imiza ion o objec de-
ec ion and sound sepa a ion.
3. PROBLEM FORMULATION
We conside he objec de ec ion and sound sou ce sepa-
a ion asks simul aneously. Gi en a ideo Vcon aining
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
814
Figu e 1:O e iew o ou See2Hea (S2H) amewo k. Taking a ideo and i s audio as inpu , unimodal b anches
(Sec. 4.1) i s encode isual and audial ea u es, espec i ely. Then, he Audio- isual usion module (Sec. 4.2) p oduces
mul imodal-awa e ep esen a ions o he ideo, di ec ly u ilized o p edic he spec og am co esponding o each de ec ed
objec in he ideo. The model is ained wi h wo ask-speci ic losses, one o he objec de ec ion (Lod) and ano he o
he sound sou ce sepa a ion (Lss).
Kobjec s ha p oduce sound, he expec ed ou pu o his
ask is a se o iples {(bi, si, ci) : i= 1, ..., K}, whe e
bi∈[0,1]4is he bounding box, siis he sepa a ed sound,
and ci∈ C is he class label o each objec iwi hin he
ideo. The sound scan be ep esen ed in mul iple ways,
including (mel-)spec og am o he aw wa e, and Cis a
se o p e-de ined classes, e.g., musical ins umen s.
We emphasize ha ou goal is o ain a single model
ha pe o ms bo h objec de ec ion and sound sou ce sep-
a a ion asks end- o-end, assuming ha sol ing hese wo
asks would equi e common cues and p o ide use ul in-
o ma ion o each o he .
4. PROPOSED APPROACH
In his sec ion, we de ail ou p oposed amewo k, namely
See2Hea (S2H), which in eg a es objec de ec ion and
audio- isual sound sepa a ion in a uni ied end- o-end
ans o me -based a chi ec u e. Fig. 1 o e iews ou
p oposed amewo k o join aining o objec de ec-
ion and sound sepa a ion, composed o unimodal en-
code s ( he isual objec de ec ion b anch and he audio
b anch; Sec. 4.1) and mul imodal decode s (audio- isual
ea u e usion module and he inal spec og am decode ;
Sec. 4.2).
4.1 Modali y-speci ic Encode s
Visual B anch. Gi en an inpu ideo V,F ames a e
uni o mly sampled. Then, an objec de ec o encodes he
isual signals. Adop ing a ans o me -based a chi ec-
u e (e.g., DETR [33]), he isual encode and decode in
Fig. 1 p oduce in e media e isual ep esen a ions called
que y embeddings o Qde ec ed objec s. Th ough a eed-
o wa d ne wo k (FFN), hei bounding boxes (bq) and
class labels (cq) a e p edic ed o q∈ {1, ..., Q}.
Howe e , among hese Qque y embeddings, we ob-
se e ha many ep esen he backg ound o duplica e ob-
jec s mul iple imes. To a oid con using he audio- isual
decode wi h i ele an objec s, we il e ou i ele an
que y embeddings in se e al ways; e.g., we exclude bound-
ing boxes ha o e lap by mo e han θin he in e sec-
ion o e union (IoU), o apply Non-Maximum Supp es-
sion (NMS) o keep only one wi h he highes con idence
(con idence h esholding). We also keep only one bound-
ing box wi h he highes con idence o each objec class
in a ame, i mul iple boxes sha e he same p edic ed la-
bel. This dynamic il e ing helps he model ocus on key
objec s, a oiding spu ious bounding boxes ha could de-
g ade sepa a ion. We deno e by NO he numbe o emain-
ing objec que y embeddings ac oss all ames. They a e
conca ena ed o o m V ∈RNO×D, and passed o he
audio- isual decode , desc ibed in Sec 4.2.
Audio B anch. We encode he inpu audio signal,
composed o sounds om mul iple sou ces, using a
ans o me -based a chi ec u e (e.g., AST [26]) om i s
log-mel spec og am. Simila ly o he isual b anch,
we ob ain a sequence o audio oken embeddings A∈
RNA×D, whe e NAdeno es he numbe o que y embed-
dings ac oss he en i e spec og am, and i is passed o he
audio- isual decode in Sec 4.2.
4.2 Audio-Visual Fusion and Decode s
We now elabo a e he co e module o ou S2H amewo k –
he audio- isual decode – which uses he ex ac ed isual
objec and audio ea u es, ollowed by he spec og am de-
coding p ocess.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
815
Audio-Visual Fusion. A e de ec ing objec s in he isual
b anch and ex ac ing audio okens om he audio b anch,
we use hem ia a ans o me audio- isual decode . As
shown in Fig. 1, he audio- isual decode Da akes he i-
nal il e ed se o isual que y embeddings, V ∈RNO×D,
as he decode que ies, and he audio okens, A∈RNA×D,
as he keys and alues. Composed o mul iple (L) ans-
o me decode laye s, which pe o m sel -a en ion o e
V o cap u e in e ac ions among he objec que ies, ol-
lowed by c oss-a en ion wi h he audio okens A, i p o-
duces he upda ed objec ea u es O∈RNO×D, whe e NO
is he o al numbe o objec que y embeddings ac oss all
ames. These ea u es a e awa e o he audial signals as
well as isual ones.
Spec og am Decode . Finally, we p oduce a ull-
esolu ion sound mask co esponding o each de ec ed ob-
jec using a ligh -weigh upsampling audio decode Dail-
lus a ed in Fig. 1. I akes as inpu he encoded audio ea-
u es A∈RNA×D, eshaped back o a 2D ea u e map,
and yields a spec og am embedding Sou ∈RH×W×D,
ma ching he size o he inpu spec og am. We hen use
he objec embedding Oo∈RD o a pa icula objec o
wi h Sou o ob ain i s mask:
ˆ
Mo=σ(Oo⊗Sou ),(1)
whe e ⊗deno es a do p oduc a each spa ial loca ion, and
σis he sigmoid unc ion.
A in e ence, his p edic ed mask ˆ
Mois mul iplied wi h
he inpu spec og am (wi h mul iple sound sou ces) o
sepa a e ou he sound co esponding o he pa icula ob-
jec o.
4.3 Model T aining
We op imize wo main losses end- o-end: objec de ec-
ion loss and sound sepa a ion loss. This end- o-end ain-
ing encou ages syne gy: bounding box e inemen bene i s
om audio cons ain s, and audio sepa a ion bene i s om
obus isual g ounding.
Objec De ec ion Loss. We apply he se p edic ion
loss [33], which in ol es bipa i e ma ching be ween he
Qp edic ed que ies and g ound- u h bounding boxes. De-
no ing ˆpqand ˆ
bqbe he p edic ed class p obabili y and
bounding box o he que y q, and c∗
qand b∗
qbe he ma ched
g ound u h label and box, he loss is de ined as
Lod =
Q
X
q=1h−log ˆpq(c∗
q) + 1{c∗
q=∅}λL1∥ˆ
bq−b∗
q∥1
+1{c∗
q=∅}λgiou1−GIoU(ˆ
bq, b∗
q)i,(2)
whe e λL1 and λgiou con ol he ela i e weigh ing o L1
and GIoU losses o he bounding box eg ession.
Sound Sepa a ion Loss. Fo each ideo, we minimize he
L1 dis ance be ween he p edic ed and g ound u h spec-
og am co esponding o each objec in i :
Lss =X
o
∥ˆ
Mo−Mo∥1,(3)
whe e Mois he ue audio spec og am p oduced by he
objec owi hin in he ideo, and ˆ
Mois i s p edic ion. As
he aining se does no p o ide he ue pe -sou ce sound
spec og am, pseudo-g ound u h can be adop ed o his
loss. (See Sec. 5.1 o ou expe imen al se ings.)
O e all Objec i e. The o e all objec i e is gi en by
L=Lss +λLod (4)
whe e λis a hype pa ame e . The sound sou ce sepa a ion
loss Lss backp opaga e all he way h ough he sha ed i-
sual and audio ans o me s, enabling syne gy be ween de-
ec ion and sepa a ion. Al hough Lod only lows wi hin he
isual b anch, he bounding box p edic ions i e ines lead
o mo e p ecise objec que ies o c oss-a en ion, he eby
indi ec ly imp o ing he audio ep esen a ion lea ned ia
Lss. This syne gy os e s be e sepa a ion and de ec ion.
5. EXPERIMENTS
5.1 Expe imen al Se ings
Da ase s. We e alua e ou model on MUSIC [10] and
MUSIC-21 [16]. MUSIC con ains 685 solo and due
ideos wi h 11 musical ins umen ca ego ies, o which
637 a e cu en ly a ailable. MUSIC-21 ex ends i o 1,365
solo ideos wi h 21 ins umen s, and 1,040 ideos a e cu -
en ly a ailable.
We ollow he s anda d p o ocol [10] o se ing aside he
i s ideo o each class o alida ion and he second one
o es , lea ing he es o o m he aining se .
E alua ion P o ocol. Since no publicly a ailable da ase
p o ides g ound u h labels o sound sou ce sepa a ion,
we ollow he widely-used mix-and-sepa a e pa adigm [5,
10–15, 27] o aining. Speci ically, we andomly sample
M ideos {(V(m), s(m))}M
m=1 om he aining da a and
mix hei audio by smix =PM
m=1 s(m). We deno e i s
spec og am by Smix. We ake he no malized mask o
each ideo mas i s g ound- u h mask:
M(m)=S(m)
Pm′S(m′),(5)
whe e S(m)is he spec og am o s(m)and he di ision is
pe o med elemen -wise.
Since he MUSIC and MUSIC-21 da ase s do no p o-
ide bounding-box anno a ions o objec de ec ion, we
ob ain pseudo-g ound u h boxes in each ame using a
well-es ablished open- ocabula y de ec o , ollowing p io
wo ks [15]. Speci ically, we adop De ic [41], which is ca-
pable o de ec ing a bi a y ca ego ies when p o ided wi h
a ex p omp . To gene a e he aining samples, we pe -
o m in e ence on each ame wi h a ele an ins umen
p omp (e.g., “gui a ”, “ iolin”, “ lu e”, “saxophone”, and
so on) and ake he p edic ed bounding boxes wi h con-
idence sco e ≥0.7. I he e a e mo e han one bound-
ing boxes wi h IoU ≥0.7, we ake he maximal one only.
These box p edic ions hen se e as pseudo-labels o ain
ou isual b anch.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
816
Da a P ep ocessing. We sample F= 3 ames pe ideo
wi h a s ide o 1and a ideo sampling a e o 1 ps, and
esize each ame o a maximum side leng h o 256 pix-
els while p ese ing aspec a io. Fo he audio inpu , we
sample 6seconds o he inpu audio a 11 kHz, cen e ed a
he middle o each sampled ame. We apply Sho -Time
Fou ie T ans o m (STFT) wi h a window size o 1,022
and hop leng h o 256, esul ing in a 512 ×256 complex
spec og am, which is hen esampled on a log- equency
scale o 256 ×256. The phase in o ma ion is p ese ed o
econs uc ion.
Backbone Models. Fo he isual b anch, we use
DETR [33] as a se -based objec de ec o o p edic bound-
ing boxes and class p obabili ies. I has a ResNe back-
bone [31] ha ex ac s ea u e maps, which a e hen la -
ened and passed o a ans o me encode . The hype pa-
ame e Qis usually se as he maximally expec ed numbe
o objec s in a ame, and we use Q= 12 o MUSIC and
Q= 22 o MUSIC-21, ese ing one que y as he back-
g ound class.
Fo he audio b anch, we adop AST [26], a
ans o me -based audio classi ie , gi en a log-mel spec-
og am o size F×T. While AST was o iginally designed
o classi ica ion, we use i o ob ain a sequence o pa ch
embeddings A∈RNA×D, d opping he classi ica ion o-
ken.
Baselines. We compa e ou S2H wi h he s a e-o - he-a
sound sepa a ion models, iQue y [15], Sound-o -Pixels
(SoP) [10], and CoSep [11]. Since he numbe o a ailable
ideos a ies depending on he access ime, we e- ain all
baselines on he same se o cu en ly a ailable ideos o
ensu e a ai compa ison.
E alua ion Me ics. The sound sepa a ion ask is e alu-
a ed using Sou ce- o-Dis o ion Ra io (SDR), Sou ce- o-
In e e ence Ra io (SIR), and Sou ce- o-A i ac s Ra io
(SAR). Fo objec de ec ion, we measu e he mean A -
e age P ecision (mAP) and mean In e sec ion o e Union
(mIoU).
Implemen a ion De ails. We use 6encode and decode
laye s o he isual encode , espec i ely. Fo he audio
encode , we use he i s 6laye s o a p e- ained AST
model. We g id-sea ch λby c oss- alida ion and se i o
0.1. We se he bounding box o e lap h eshold θ= 0.7.
We ain o 100 epochs using he AdamW op imize
wi h a ba ch size o 48, decaying he lea ning a e a epoch
80 by a ac o o 0.1. The lea ning a es a e se o 1×10−5
o he isual backbone, 2×10−5 o he DETR encode
and decode , 5×10−5 o he AST audio encode , and 1×
10−4 o he audio- isual decode and he audio decode .
We mix 2sound acks pe mini-ba ch, leading o up o 4
dis inc ins umen s in he mix u e.
5.2 Compa ison wi h Compe ing Me hods
Tab. 1 shows he esul s on MUSIC (le ) and MUSIC-
21 ( igh ). Ou S2H ou pe o ms exis ing baselines in
sound sepa a ion, achie ing highe SDR, SIR, and SAR.
The imp o emen in SDR is pa icula ly subs an ial (gains
MUSIC MUSIC-21
Me hod SDR SIR SAR SDR SIR SAR
Sound-o -Pixels 5.63 6.85 9.80 5.77 9.95 10.33
CoSep 5.72 8.00 8.13 6.17 8.73 10.18
iQue y 8.04 11.63 11.92 7.51 11.16 11.64
S2H (Ou s) 9.03 12.85 13.99 9.20 12.54 14.79
Table 1:Sound sepa a ion pe o mance o compe -
ing models. On MUSIC [10] and MUSIC-21 [16], S2H
achie es s a e-o - he-a sepa a ion me ics.
o abou +2dB o e iQue y [15] on MUSIC). We a ibu e
his o he explici syne gy be ween objec de ec ion and
sound sepa a ion in ou a chi ec u e.
Fig. 2 u he illus a es his imp o emen wi h quali a-
i e examples. In a due ideo wi h lu e and iolin, S2H
accu a ely localizes bo h ins umen s and econs uc s hei
indi idual spec og ams wi h minimal in e e ence.
5.3 Abla ion S udies
We pe o m abla ions on MUSIC o isola e key design
choices in S2H. Tab. 2 summa izes he esul s.
(1) E ec o Bounding Box Fil e ing. We compa e he
pe o mance o ou ull model o he one wi hou he
dynamic bounding box il e ing desc ibed in Sec. 4.1.
The pe o mance signi ican ly d ops, e.g., om 9.03
o 8.19 o sound sou ce sepa a ion (in SDR) and om
0.67 o 0.54 o objec de ec ion (in mAP). Wi hou
il e ing, many low-con idence o hea ily duplica e
bounding boxes a e passed o he audio- isual decode ,
con using he model du ing he usion s ep, as hey
con ain objec que ies ha do no co espond o any
ue sounding objec . In e es ingly, he mIoU emains
ai ly high (0.85) e en wi hou he bounding box il e -
ing, sugges ing ha some boxes a e spa ially co ec ,
ye he shee numbe o boxes o he same objec de-
g ades bo h de ec ion p ecision and sepa a ion quali y.
(2) E ec o C oss-modal Fusion. Ou p oposed S2H
in oduces c oss-modal usion, aiming o c ea e syn-
e gy be ween he objec de ec ion and sound sepa a ion
asks. In o de o see he e ec o his design, we com-
pa e wi h a simple U-Ne -based decode s uc u e,
which has been widely used in audio sou ce sepa a ion,
ins ead o ou ans o me -based c oss-a en ion. Wi h
his design, he isual b anch simply yields a la en
ea u e (ins ead o bounding-box que ies), hen con-
ca ena ed wi h audio ea u es. We obse e a d as ic
pe o mance d op in sound sou ce sepa a ion me ics
(e.g.,9.03 →2.34 in SDR). This highligh s he bene i
o ou ans o me -based c oss-a en ion, which allows
explici objec -awa e usion be ween he bounding-
box que ies and he spec og am embeddings.
(3) E ec o C oss-a en ion. Ins ead o comple ely e-
placing he audio- isual usion wi h a U-Ne , we u -
he expe imen wi h ou audio- isual usion only wi h
sel -a en ion o see he e ec o c oss-modal a en ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
817

(a) De ec ion Resul s (b) G ound T u h (c) Ou s (d) SoP (e) iQue y
Video #2 Video #1 Video #2 Video #1
Figu e 2:Quali a i e examples o objec de ec ion and sound sou ce sepa a ion. (a) P edic ed bounding boxes in a
sampled ideo ame. (b) G ound- u h spec og am o he sou ce audio. (c)–(e) The sepa a ed spec og am by ou and
baseline me hods.
As each modali y is independen ly p ocessed, he pe -
o mance d ops sha ply (9.03 →5.22 in SDR), ea -
i ming he bene i s o c oss-modal in e ac ion. Simi-
la ly, he de ec ion me ics (mAP and mIoU) also de-
g ade, since he isual b anch no longe ecei es indi-
ec supe ision signals om he sepa a ion loss.
(4) E ec o Join T aining. We emo e he de ec ion
loss by se ing λ= 0 in Eq. (4), e ec i ely disca ding
he de ec ion supe ision. Al hough he isual b anch
s ill encodes he isual ea u es, i has no u he in-
cen i e o localize objec s o e ine bounding-box p e-
dic ions ailo ed o sound sou ce sepa a ion. In his
scena io, he de ec ion pe o mance d ops o 0, show-
ing ha he bounding boxes degene a e immedia ely.
The sound sepa a ion quali y is also educed (SDR
9.03 →7.67), indica ing ha accu a e objec localiza-
ion signi ican ly helps he sound sepa a ion ask. This
unde sco es he syne gy be ween de ec ion and sepa-
a ion.
6. CONCLUSION
We p esen See2Hea (S2H), a no el amewo k ha
join ly lea ns objec de ec ion and sound sou ce sepa a ion.
By le e aging ans o me -based modules bo h o he i-
sual and audio b anches, and by in eg a ing hem h ough
a sha ed audio- isual decode , S2H cap u es iche c oss-
modal dependencies han p e ious disjoin o sequen ial
Con igu a ion Sound Sepa a ion De ec ion
SDR SIR SAR mAP mIoU
Full S2H 9.03 12.85 13.99 0.67 0.80
(1) No b-box il e ing 8.19 11.17 14.80 0.54 0.85
(2) U-Ne -like decode 2.34 8.33 5.94 0.25 0.84
(3) No c oss-a en ion 5.22 8.67 12.54 0.32 0.82
(4) No Lod loss 7.67 11.05 13.49 0.00 0.00
Table 2:Abla ion s udies on MUSIC. Each componen
signi ican ly con ibu es o he pe o mance o S2H, o
bo h sound sou ce sepa a ion and objec de ec ion.
me hods. Ou dynamic il e ing mechanism ensu es ha
only ele an de ec ions guide sepa a ion. Ex ensi e ex-
pe imen s on MUSIC and MUSIC-21 demons a e ha
S2H achie es he s a e-o - he-a sound sepa a ion pe o -
mance, while also main aining compe i i e de ec ion accu-
acy. Ou abla ion s udies con i m ha objec de ec ion and
sound sepa a ion indeed mu ually bene i each o he when
ained end- o-end.
In spi e o he p omising esul s, S2H s ill has some lim-
i a ions. Mainly due o he lack o labeled da a o sound
sou ce sepa a ion, S2H has been e i ied mos ly on single
ins umen scenes. Wi h a la ge scaled da a, i could be
ex ended o mul i-ins umen scenes wi h mo e complex
polyphony. Also, as i s applica ion is no con ined o mu-
sical audio, i could be applied o o he mul imodal asks
(e.g., speech-d i en de ec ion o ac ion ecogni ion).
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
818
Acknowledgmen s. This wo k was suppo ed by Youl-
chon Founda ion, NRF (RS-2021-NR05515, RS-2024-
00336576, RS-2023-0022663) and IITP g an s (RS-
2022-II220264, RS-2024-00353131, RS-2022-00155966)
unded by he go e nmen o Ko ea.
7. REFERENCES
[1] C. Opoku-Baah, A. M. Schoenhau , S. G. Vassall,
D. A. To a , R. Ramachand an, and M. T. Wallace,
“Visual in luences on audi o y beha io al, neu al, and
pe cep ual p ocesses: a e iew,” Jou nal o he Asso-
cia ion o Resea ch in O ola yngology, ol. 22, no. 4,
pp. 365–386, 2021.
[2] A. Tonelli, L. F. Cu u i, and M. Go i, “The in luence o
audi o y in o ma ion on isual size adap a ion,” F on-
ie s in Neu oscience, ol. 11, p. 594, 2017.
[3] Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and
D. P. Ellis, “MuLan: A join embedding o music au-
dio and na u al language,” in P oceedings o he 23 d
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2022.
[4] R. A andjelo i´
c and A. Zisse man, “Look, lis en and
lea n,” in P oceedings o he IEEE In e na ional Con-
e ence on Compu e Vision (ICCV), 2017.
[5] A. Owens and A. A. E os, “Audio- isual scene analy-
sis wi h sel -supe ised mul isenso y ea u es,” in P o-
ceedings o he Eu opean Con e ence on Compu e Vi-
sion (ECCV), 2018.
[6] Y. Ay a , C. Vond ick, and A. To alba, “SoundNe :
Lea ning sound ep esen a ions om unlabeled ideo,”
in Ad ances in Neu al In o ma ion P ocessing Sys ems
(Neu IPS), 2016.
[7] T. A ou as, A. Owens, J. S. Chung, and A. Zisse man,
“Sel -supe ised lea ning o audio- isual objec s om
ideo,” in P oceedings o he Eu opean Con e ence on
Compu e Vision (ECCV), 2020.
[8] A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S.
Kweon, “Lea ning o localize sound sou ce in i-
sual scenes,” in P oceedings o he IEEE Con e ence
on Compu e Vision and Pa e n Recogni ion (CVPR),
2018.
[9] K. Qian, Y. Zhang, S. Chang, D. Cox, and
M. Hasegawa-Johnson, “Unsupe ised speech decom-
posi ion ia iple in o ma ion bo leneck,” in P oceed-
ings o he 37 h In e na ional Con e ence on Machine
Lea ning (ICML), 2020.
[10] H. Zhao, C. Gan, A. Roudi chenko, C. Vond ick, J. Mc-
De mo , and A. To alba, “The sound o pixels,” in
P oceedings o he Eu opean Con e ence on Compu e
Vision (ECCV), 2018.
[11] R. Gao and K. G auman, “Co-sepa a ing sounds o i-
sual objec s,” in P oceedings o he IEEE In e na ional
Con e ence on Compu e Vision (ICCV), 2019.
[12] X. Xu, B. Dai, and D. Lin, “Recu si e isual sound
sepa a ion using minus-plus ne ,” in P oceedings o
he IEEE In e na ional Con e ence on Compu e Vi-
sion (ICCV), 2019.
[13] Y. Tian, D. Hu, and C. Xu, “Cyclic co-lea ning o
sounding objec isual g ounding and sound sepa a-
ion,” in P oceedings o he IEEE Con e ence on Com-
pu e Vision and Pa e n Recogni ion (CVPR), 2021.
[14] L. Zhu and E. Rah u, “Visually guided sound sou ce
sepa a ion and localiza ion using sel -supe ised mo-
ion ep esen a ions,” in P oceedings o he IEEE Win-
e Con e ence on Applica ions o Compu e Vision
(WACV), 2022.
[15] J. Chen, R. Li, Z. Hou, G. Zhang, L. Peng, and C.-W.
Ngo, “iQue y: Ins umen s as que ies o audio- isual
sound sepa a ion,” in P oceedings o he IEEE Con-
e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), 2023.
[16] H. Zhao, C. Gan, W.-C. Ma, and A. To alba, “The
sound o mo ions,” in P oceedings o he IEEE In e na-
ional Con e ence on Compu e Vision (ICCV), 2019.
[17] J.-F. Ca doso, “Blind signal sepa a ion: S a is ical
p inciples,” P oceedings o he IEEE, ol. 86, no. 10,
pp. 2009–2025, 1998.
[18] A. J. Bell and T. J. Sejnowski, “An in o ma ion-
maximiza ion app oach o blind sepa a ion and blind
decon olu ion,” Neu al Compu a ion, ol. 7, no. 6, pp.
1129–1159, 1995.
[19] A. Hy ä inen and E. Oja, “Independen componen
analysis: Algo i hms and applica ions,” Neu al Ne -
wo ks, ol. 13, no. 4-5, pp. 411–430, 2000.
[20] D. D. Lee and H. S. Seung, “Lea ning he pa s o
objec s by non-nega i e ma ix ac o iza ion,” Na u e,
ol. 401, no. 6755, pp. 788–791, 1999.
[21] P. Sma agdis and J. C. B own, “Non-nega i e ma ix
ac o iza ion o polyphonic music ansc ip ion,” in
IEEE Wo kshop on Applica ions o Signal P ocessing
o Audio and Acous ics, 2003.
[22] J. R. He shey, Z. Chen, J. Le Roux, and S. Wa an-
abe, “Deep clus e ing: Disc imina i e embeddings o
segmen a ion and sepa a ion,” in P oceedings o he
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2016.
[23] Z. Chen, Y. Luo, and N. Mesga ani, “Deep a ac o
ne wo k o single-mic ophone speake sepa a ion,” in
P oceedings o he IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP),
2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
819
[24] A. Jansson, E. J. Humph ey, N. Mon ecchio, R. Bi -
ne , A. Kuma , and T. Weyde, “Singing oice sepa a-
ion wi h deep u-ne con olu ional ne wo ks,” in P o-
ceedings o he 18 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), 2017.
[25] D. S olle , S. Ewe , and S. Dixon, “Wa e-u-ne : A
mul i-scale neu al ne wo k o end- o-end audio sou ce
sepa a ion,” in P oceedings o he 19 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), 2018.
[26] Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio
spec og am ans o me ,” in P oceedings o he In e -
speech Con e ence, 2021.
[27] A. Eph a , I. Mosse i, O. Lang, T. Dekel, K. Wil-
son, A. Hassidim, W. T. F eeman, and M. Rubins ein,
“Looking o lis en a he cock ail pa y: A speake -
independen audio- isual model o speech sepa a-
ion,” ACM T ansac ions on G aphics (TOG), ol. 37,
no. 4, pp. 112:1–112:11, 2018.
[28] T. Rahman, M. Yang, and L. Sigal, “T iBERT: Human-
cen ic audio- isual ep esen a ion lea ning,” in Ad-
ances in Neu al In o ma ion P ocessing Sys ems
(Neu IPS), 2021.
[29] T. Rahman and L. Sigal, “Weakly-supe ised audio-
isual sound sou ce de ec ion and sepa a ion,” in IEEE
In e na ional Con e ence on Mul imedia and Expo
(ICME), 2021.
[30] C. Huang, S. Liang, Y. Tian, A. Kuma , and
C. Xu, “DAVIS: High-quali y audio- isual sepa a-
ion wi h gene a i e di usion models,” a Xi p ep in
a Xi :2308.00122, 2023.
[31] R. Gi shick, J. Donahue, T. Da ell, and J. Malik, “Rich
ea u e hie a chies o accu a e objec de ec ion and
seman ic segmen a ion,” in P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni-
ion (CVPR), 2014.
[32] S. Ren, K. He, R. Gi shick, and J. Sun, “Fas e R-
CNN: Towa ds eal- ime objec de ec ion wi h egion
p oposal ne wo ks,” in Ad ances in Neu al In o ma-
ion P ocessing Sys ems (Neu IPS), 2015.
[33] N. Ca ion, F. Massa, G. Synnae e, N. Usunie , A. Ki -
illo , and S. Zago uyko, “End- o-end objec de ec ion
wi h ans o me s,” in P oceedings o he Eu opean
Con e ence on Compu e Vision (ECCV), 2020.
[34] F. R. Val e de, J. V. Hu ado, and A. Valada, “The e is
mo e han mee s he eye: Sel -supe ised mul i-objec
de ec ion and acking wi h sound by dis illing mul i-
modal knowledge,” in P oceedings o he IEEE Con-
e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), 2021.
[35] S. Mo and Y. Tian, “Audio- isual g ouping ne wo k o
sound localiza ion om mix u es,” in P oceedings o
he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion (CVPR), 2023.
[36] H. Chen, W. Xie, A. Vedaldi, and A. Zisse man, “Lo-
calizing isual sounds he ha d way,” in P oceedings o
he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion (CVPR), 2021.
[37] S. Pegg, K. Li, and X. Hu, “TDFne : An e icien
audio- isual speech sepa a ion model wi h op-down
usion,” in In e na ional Con e ence on In o ma ion
Science and Technology, 2023.
[38] K. Li, X. Hu, S. Pegg, R. Zhang, F. Zhou, X. Wu, and
X. Liu, “IIAne : An in a- and in e -modali y a en ion
ne wo k o audio- isual speech sepa a ion,” in P o-
ceedings o he 41s In e na ional Con e ence on Ma-
chine Lea ning (ICML), 2024.
[39] Y. Yu and S. Sun, “DGFne : End- o-end audio- isual
sou ce sepa a ion based on dynamic ga ing usion,” in
P oceedings o he In e na ional Con e ence on Mul i-
media Re ie al (ICMR), 2025.
[40] C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and
A. To alba, “Music ges u e o isual sound sepa a-
ion,” in P oceedings o he IEEE Con e ence on Com-
pu e Vision and Pa e n Recogni ion (CVPR), 2020.
[41] X. Zhou, R. Gi dha , A. Joulin, P. K ähenbühl, and
I. Mis a, “De ec ing wen y- housand classes using
image-le el supe ision,” in P oceedings o he Eu o-
pean Con e ence on Compu e Vision (ECCV), 2022.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
820

Related note

Why institutions use Plag.ai for originality review, entry 71
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by teachers in the United States, the European Union, South America, and other research regions, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also faster first-level screening, better protection of institutional reputation, and stronger evidence for review committees. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For student essays, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai