JOINT OBJECT DETECTION AND SOUND SOURCE SEPARATION
Sunyoo Kim1Yunjeong Choi1Doyeon Lee1Seoyoung Lee2
Eunyi Lyou1Seungju Kim3Junhyug Noh4∗Joonseok Lee1∗
1Seoul Na ional Uni e si y, Seoul, Ko ea 2Uni e si y o Texas a Aus in, Texas, USA
3Sookmyung Women’s Uni e si y, Seoul, Ko ea 4Ewha Womans Uni e si y, Seoul, Ko ea
[email p o ec ed], [email p o ec ed], [email p o ec ed]
ABSTRACT
We p opose See2Hea (S2H), a amewo k ha join ly
lea ns audio- isual ep esen a ions o objec de ec ion and
sound sou ce sepa a ion om ideos. Exis ing me hods
do no ully exploi he syne gy be ween he de ec ion and
sepa a ion asks, o en elying on disjoin ly p e- ained i-
sual encode s. Ou S2H in eg a es bo h asks in an end-
o-end ainable uni ied s uc u e using ans o me -based
a chi ec u es. A nai e combina ion o hese app oaches,
howe e , esul s in subop imal pe o mance. We p opose
a dynamic il e ing mechanism ha selec s ele an objec
que ies om he objec de ec o o esol e his issue. We
conduc ex ensi e expe imen s o e i y ha ou app oach
achie es he s a e-o - he-a pe o mance in audio sou ce
sepa a ion on MUSIC and MUSIC-21, while main aining
compe i i e objec de ec ion pe o mance. Abla ion s ud-
ies con i m ha he join aining o de ec ion and sepa a-
ion is mu ually bene icial o bo h asks.
1. INTRODUCTION
Human pe cep ion is inhe en ly mul imodal, aking inpu
signals om i e senses and comp ehensi ely unde s and-
ing he gi en si ua ion om hei usion. [1–3] O en, in-
eg a ing mul iple cues helps us o cohe en ly unde s and
ou su oundings. Music is no an excep ion; o ins ance,
seeing and ecognizing a pa icula ins umen simul ane-
ously allows us o associa e i wi h he sound i p oduces.
Mo e b oadly, isual cues can be o en use ul o ecognize
co-occu ing sound, and a he same ime, audi o y signals
can also help o isually pe cei e an objec .
In li e a u e, esea che s ha e explo ed a wide ange
o audio- isual lea ning, including sel -supe ised c oss-
modal alignmen [4, 5], audio- isual ep esen a ion lea n-
ing [6, 7], and sound sou ce localiza ion [8, 9]. These e -
o s collec i ely demons a e ha isual and audi o y in-
∗Co esponding au ho s
© S. Kim, Y. Choi, D. Lee, S. Lee, E. Lyou, S. Kim, J.
Noh, and J. Lee. Licensed unde a C ea i e Commons A ibu ion 4.0
In e na ional License (CC BY 4.0). A ibu ion: S. Kim, Y. Choi, D.
Lee, S. Lee, E. Lyou, S. Kim, J. Noh, and J. Lee, “Join Objec De ec ion
and Sound Sou ce Sepa a ion”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
o ma ion, when p ocessed join ly, can p o ide mo e o-
bus pe cep ion han conside ing each modali y alone.
Among hese esea ch on audio- isual co espondence,
ou ocus is on audio- isual sound sou ce sepa a ion [10–
15], which aims o isola e indi idual sound sou ces om
a complex mix u e by exploi ing he isual signal as an
ancho . Fo ins ance, i mul iple ins umen s a e played
oge he , seeing a iolin o a umpe o en helps a model
disce n which equency belongs o each ins umen .
Despi e his clea connec ion be ween isual iden i i-
ca ion o a sou ce and i s audi o y p esence in he mix-
u e, mos p io app oaches ea ele an isual asks (e.g.,
objec de ec ion) and sound sepa a ion independen ly, o
sequen ially. Typically, one i s uses a p e- ained de-
ec o o localize ins umen s, hen eeds he bounding
boxes o egion ea u es in o a sepa a e ne wo k o sep-
a a ion [11,14,15].
Howe e , such wo-s ep o disjoin pipelines do no ake
ad an age o po en ially use ul cues om he o he modal-
i ies; e.g., he isual signals o sound sepa a ion and ice
e sa. Conside ing ha accu a e objec localiza ion would
p o ide guidance o be e sound isola ion and imp o ed
sound sepa a ion can also ein o ce be e isual ep esen-
a ions by ocusing on he mos ele an objec egions, dis-
join ly ackling hese wo p oblems would be subop imal.
In his pape , we p opose See2Hea (S2H), which
join ly lea ns o de ec objec s and sepa a e hei co e-
sponding audio signals, ained end- o-end. To achie e
his, we adop a T ans o me -based a chi ec u e, well-
sui ed o seamlessly handle mul imodal inpu s wi h min-
imal modali y-speci ic encoding o e head. Pa icula ly, a
mul imodal T ans o me enables di ec a en ion ac oss he
audio and isual okens, ep esen ing pa s o he spec o-
g am and he image, espec i ely. We ain bo h de ec ion
and sepa a ion in a single model, allowing g adien s om
bo h asks o upda e he sha ed ep esen a ion space.
Howe e , a nai e assemble o isual and audio T ans-
o me s wi h c oss-a en ion is no scalable. Mo e speci i-
cally, we disco e ha i is c ucial o con ol he numbe o
eo ganized objec s. Wi hou a p ope measu e, a mo e
bounding boxes a e de ec ed han ac ual du ing aining,
making he weigh upda es less accu a e and compu a ion-
ally in easible. To esol e his issue, we inco po a e a dy-
namic il e ing mechanism ha disca ds low-con idence o
o e lapping de ec ions, hus a oiding con usion om spu-
ious egions. As a esul , ou model e ec i ely “sees”
813
objec s and “hea s” hei co esponding sounds in a ully
in eg a ed manne .
Ou expe imen s e i y ha his uni ied a chi ec u e e -
ec i ely exploi s syne gy be ween he wo asks, su pass-
ing he pe o mance o me hods ha ely on ex e nal o
p e-ex ac ed de ec ions [11,14,15].
Ou main con ibu ions a e summa ized as ollows: 1
• We p opose a no el uni ied amewo k ha join ly
lea ns objec de ec ion and audio- isual sound sou ce
sepa a ion end- o-end, allowing c oss- ask syne gy.
• We in oduce a dynamic il e ing o objec que ies, en-
su ing ha only ele an objec s guide he sepa a ion.
• Ou me hod achie es he s a e-o - he-a sound sepa a-
ion pe o mance on MUSIC [10] and MUSIC-21 [16],
while main aining easonable de ec ion pe o mance.
Th ough comp ehensi e analysis, we highligh he ben-
e i o join ly aining he wo asks.
2. RELATED WORK
2.1 Audio Sou ce Sepa a ion
Audio sou ce sepa a ion aims o isola e dis inc sound
sou ces om mixed signals. Classical app oaches include
Independen Componen Analysis (ICA) [17–19]. ICA-
based me hods laid he ounda ion o blind sou ce sepa-
a ion (BSS) unde he assump ion o s a is ical indepen-
dence, while Non-nega i e Ma ix Fac o iza ion (NMF)
[20,21] in oduced pa s-based ep esen a ions pa icula ly
sui ed o music.
Deep lea ning e olu ionized he ield wi h app oaches
like Deep Clus e ing [22] and Deep A ac o Ne -
wo ks [23], which lea n disc imina i e embeddings o
sou ce sepa a ion. U-Ne a chi ec u es [24, 25] became
s anda d o music sepa a ion h ough skip connec ions
ha e ine signals in he ime- equency domain.
Recen ans o me -based me hods ha e pushed s a e-
o - he-a pe o mance, wi h he Audio Spec og am
T ans o me (AST) [26] demons a ing ha a en ion
mechanisms can e ec i ely model bo h sho and long-
ange dependencies. We adop AST as ou audio encode
backbone, ex ending i o audio- isual sepa a ion.
2.2 Audio-Visual Sound Sepa a ion
Audio- isual app oaches le e age isual in o ma ion o
guide sound sepa a ion, signi ican ly ou pe o ming audio-
only me hods. The mix-and-sepa a e pa adigm in oduced
by Sound-o -Pixels [10] c ea es syn he ic mix u es o sel -
supe ised lea ning. This app oach was adap ed by subse-
quen me hods [5,10–15,27], including Co-Sepa a ion [11]
ha disco e s audio- isual associa ions, and ecu si e sep-
a a ion me hods [12].
Recen ad ances include Cyclic co-lea ning (CCoL)
[13] ha i e a i ely e ines sepa a ion and localiza ion, and
AME [14] and T iBERT [28], which inco po a e addi-
ional cues like mo ion and human pose. iQue y [15] uses
1Code a ailable a h ps://gi hub.com/snu iplab/S2H.
isually-named audio que ies in a c oss-a en ion-based
ans o me o sepa a e sou ces. Rahman and Sigal [29]
p oposed a weakly-supe ised app oach ha lea ns audio-
isual co-segmen a ion om ideos labeled only wi h ob-
jec labels.
Howe e , exis ing me hods ely on p e-ex ac ed i-
sual ea u es o p e- ained objec de ec o s, c ea ing a
disconnec be ween isual analysis and audio sepa a ion.
This wo-s age app oach in oduces e o p opaga ion and
p e en s join op imiza ion o bo h asks. E en ecen
di usion-based me hods like DAVIS [30] ope a e on p e-
p ocessed isual inpu s a he han lea ning ep esen a ions
join ly wi h audio sepa a ion.
2.3 Objec De ec ion o Audio-Visual Tasks
Objec de ec ion has e ol ed om egion-based CNNs
like R-CNN [31] and Fas e R-CNN [32] o ans o me -
based app oaches. DETR [33] e olu ionized de ec ion
by ea ing i as a di ec se p edic ion p oblem, elimina -
ing hand-c a ed componen s like ancho gene a ion and
non-maximum supp ession. This end- o-end di e en ia-
bili y makes DETR pa icula ly sui able o in eg a ion
wi h o he modali ies. In audio- isual esea ch, objec de-
ec ion p ima ily se es as p ep ocessing. Me hods like
Co-Sepa a ion [11] and iQue y [15] use p e- ained de ec-
o s o iden i y isual egions be o e sepa a ion. While e-
cen wo k explo es igh e in eg a ion be ween de ec ion
and audio p ocessing in speci ic domains [34–36], mos
app oaches s ill ea de ec ion as a sepa a e module. The
po en ial o join ly aining de ec ion wi h audio- isual
asks emains la gely unexplo ed, pa icula ly o lea n-
ing c oss-modal associa ions di ec ly a he han elying
on p e- ained isual ea u es.
2.4 Join Lea ning in Audio-Visual Tasks
T adi ional sequen ial pipelines su e om e o p opa-
ga ion and miss c oss-modal in e ac ions ha could en-
hance bo h modali ies. Recen speech domain ad ances
demons a e clea bene i s: TDFNe [37] achie es 10%
imp o emen h ough join speake ea u e lea ning, while
IIANe [38] and DGFNe [39] show ha in eg a ing modal-
i ies h oughou ne wo ks subs an ially ou pe o ms la e
usion. Howe e , he music domain lags behind. While
speech sys ems emb ace join lea ning, music sou ce sepa-
a ion emains domina ed by wo-s age app oaches – Mu-
sic Ges u e [40] and ecen wo k [15] s ill use p e-ex ac ed
isual ea u es. This gap is signi ican gi en music’s isual
ichness: ins umen mo emen s and spa ial a angemen s
o e aluable cues be e exploi ed h ough join lea ning.
The absence o end- o-end join aining in music sepa a-
ion ep esen s a majo oppo uni y ha ou S2H ame-
wo k add esses h ough uni ied op imiza ion o objec de-
ec ion and sound sepa a ion.
3. PROBLEM FORMULATION
We conside he objec de ec ion and sound sou ce sepa-
a ion asks simul aneously. Gi en a ideo Vcon aining
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
814
Figu e 1:O e iew o ou See2Hea (S2H) amewo k. Taking a ideo and i s audio as inpu , unimodal b anches
(Sec. 4.1) i s encode isual and audial ea u es, espec i ely. Then, he Audio- isual usion module (Sec. 4.2) p oduces
mul imodal-awa e ep esen a ions o he ideo, di ec ly u ilized o p edic he spec og am co esponding o each de ec ed
objec in he ideo. The model is ained wi h wo ask-speci ic losses, one o he objec de ec ion (Lod) and ano he o
he sound sou ce sepa a ion (Lss).
Kobjec s ha p oduce sound, he expec ed ou pu o his
ask is a se o iples {(bi, si, ci) : i= 1, ..., K}, whe e
bi∈[0,1]4is he bounding box, siis he sepa a ed sound,
and ci∈ C is he class label o each objec iwi hin he
ideo. The sound scan be ep esen ed in mul iple ways,
including (mel-)spec og am o he aw wa e, and Cis a
se o p e-de ined classes, e.g., musical ins umen s.
We emphasize ha ou goal is o ain a single model
ha pe o ms bo h objec de ec ion and sound sou ce sep-
a a ion asks end- o-end, assuming ha sol ing hese wo
asks would equi e common cues and p o ide use ul in-
o ma ion o each o he .
4. PROPOSED APPROACH
In his sec ion, we de ail ou p oposed amewo k, namely
See2Hea (S2H), which in eg a es objec de ec ion and
audio- isual sound sepa a ion in a uni ied end- o-end
ans o me -based a chi ec u e. Fig. 1 o e iews ou
p oposed amewo k o join aining o objec de ec-
ion and sound sepa a ion, composed o unimodal en-
code s ( he isual objec de ec ion b anch and he audio
b anch; Sec. 4.1) and mul imodal decode s (audio- isual
ea u e usion module and he inal spec og am decode ;
Sec. 4.2).
4.1 Modali y-speci ic Encode s
Visual B anch. Gi en an inpu ideo V,F ames a e
uni o mly sampled. Then, an objec de ec o encodes he
isual signals. Adop ing a ans o me -based a chi ec-
u e (e.g., DETR [33]), he isual encode and decode in
Fig. 1 p oduce in e media e isual ep esen a ions called
que y embeddings o Qde ec ed objec s. Th ough a eed-
o wa d ne wo k (FFN), hei bounding boxes (bq) and
class labels (cq) a e p edic ed o q∈ {1, ..., Q}.
Howe e , among hese Qque y embeddings, we ob-
se e ha many ep esen he backg ound o duplica e ob-
jec s mul iple imes. To a oid con using he audio- isual
decode wi h i ele an objec s, we il e ou i ele an
que y embeddings in se e al ways; e.g., we exclude bound-
ing boxes ha o e lap by mo e han θin he in e sec-
ion o e union (IoU), o apply Non-Maximum Supp es-
sion (NMS) o keep only one wi h he highes con idence
(con idence h esholding). We also keep only one bound-
ing box wi h he highes con idence o each objec class
in a ame, i mul iple boxes sha e he same p edic ed la-
bel. This dynamic il e ing helps he model ocus on key
objec s, a oiding spu ious bounding boxes ha could de-
g ade sepa a ion. We deno e by NO he numbe o emain-
ing objec que y embeddings ac oss all ames. They a e
conca ena ed o o m V ∈RNO×D, and passed o he
audio- isual decode , desc ibed in Sec 4.2.
Audio B anch. We encode he inpu audio signal,
composed o sounds om mul iple sou ces, using a
ans o me -based a chi ec u e (e.g., AST [26]) om i s
log-mel spec og am. Simila ly o he isual b anch,
we ob ain a sequence o audio oken embeddings A∈
RNA×D, whe e NAdeno es he numbe o que y embed-
dings ac oss he en i e spec og am, and i is passed o he
audio- isual decode in Sec 4.2.
4.2 Audio-Visual Fusion and Decode s
We now elabo a e he co e module o ou S2H amewo k –
he audio- isual decode – which uses he ex ac ed isual
objec and audio ea u es, ollowed by he spec og am de-
coding p ocess.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
815
Audio-Visual Fusion. A e de ec ing objec s in he isual
b anch and ex ac ing audio okens om he audio b anch,
we use hem ia a ans o me audio- isual decode . As
shown in Fig. 1, he audio- isual decode Da akes he i-
nal il e ed se o isual que y embeddings, V ∈RNO×D,
as he decode que ies, and he audio okens, A∈RNA×D,
as he keys and alues. Composed o mul iple (L) ans-
o me decode laye s, which pe o m sel -a en ion o e
V o cap u e in e ac ions among he objec que ies, ol-
lowed by c oss-a en ion wi h he audio okens A, i p o-
duces he upda ed objec ea u es O∈RNO×D, whe e NO
is he o al numbe o objec que y embeddings ac oss all
ames. These ea u es a e awa e o he audial signals as
well as isual ones.
Spec og am Decode . Finally, we p oduce a ull-
esolu ion sound mask co esponding o each de ec ed ob-
jec using a ligh -weigh upsampling audio decode Dail-
lus a ed in Fig. 1. I akes as inpu he encoded audio ea-
u es A∈RNA×D, eshaped back o a 2D ea u e map,
and yields a spec og am embedding Sou ∈RH×W×D,
ma ching he size o he inpu spec og am. We hen use
he objec embedding Oo∈RD o a pa icula objec o
wi h Sou o ob ain i s mask:
ˆ
Mo=σ(Oo⊗Sou ),(1)
whe e ⊗deno es a do p oduc a each spa ial loca ion, and
σis he sigmoid unc ion.
A in e ence, his p edic ed mask ˆ
Mois mul iplied wi h
he inpu spec og am (wi h mul iple sound sou ces) o
sepa a e ou he sound co esponding o he pa icula ob-
jec o.
4.3 Model T aining
We op imize wo main losses end- o-end: objec de ec-
ion loss and sound sepa a ion loss. This end- o-end ain-
ing encou ages syne gy: bounding box e inemen bene i s
om audio cons ain s, and audio sepa a ion bene i s om
obus isual g ounding.
Objec De ec ion Loss. We apply he se p edic ion
loss [33], which in ol es bipa i e ma ching be ween he
Qp edic ed que ies and g ound- u h bounding boxes. De-
no ing ˆpqand ˆ
bqbe he p edic ed class p obabili y and
bounding box o he que y q, and c∗
qand b∗
qbe he ma ched
g ound u h label and box, he loss is de ined as
Lod =
Q
X
q=1h−log ˆpq(c∗
q) + 1{c∗
q=∅}λL1∥ˆ
bq−b∗
q∥1
+1{c∗
q=∅}λgiou1−GIoU(ˆ
bq, b∗
q)i,(2)
whe e λL1 and λgiou con ol he ela i e weigh ing o L1
and GIoU losses o he bounding box eg ession.
Sound Sepa a ion Loss. Fo each ideo, we minimize he
L1 dis ance be ween he p edic ed and g ound u h spec-
og am co esponding o each objec in i :
Lss =X
o
∥ˆ
Mo−Mo∥1,(3)
whe e Mois he ue audio spec og am p oduced by he
objec owi hin in he ideo, and ˆ
Mois i s p edic ion. As
he aining se does no p o ide he ue pe -sou ce sound
spec og am, pseudo-g ound u h can be adop ed o his
loss. (See Sec. 5.1 o ou expe imen al se ings.)
O e all Objec i e. The o e all objec i e is gi en by
L=Lss +λLod (4)
whe e λis a hype pa ame e . The sound sou ce sepa a ion
loss Lss backp opaga e all he way h ough he sha ed i-
sual and audio ans o me s, enabling syne gy be ween de-
ec ion and sepa a ion. Al hough Lod only lows wi hin he
isual b anch, he bounding box p edic ions i e ines lead
o mo e p ecise objec que ies o c oss-a en ion, he eby
indi ec ly imp o ing he audio ep esen a ion lea ned ia
Lss. This syne gy os e s be e sepa a ion and de ec ion.
5. EXPERIMENTS
5.1 Expe imen al Se ings
Da ase s. We e alua e ou model on MUSIC [10] and
MUSIC-21 [16]. MUSIC con ains 685 solo and due
ideos wi h 11 musical ins umen ca ego ies, o which
637 a e cu en ly a ailable. MUSIC-21 ex ends i o 1,365
solo ideos wi h 21 ins umen s, and 1,040 ideos a e cu -
en ly a ailable.
We ollow he s anda d p o ocol [10] o se ing aside he
i s ideo o each class o alida ion and he second one
o es , lea ing he es o o m he aining se .
E alua ion P o ocol. Since no publicly a ailable da ase
p o ides g ound u h labels o sound sou ce sepa a ion,
we ollow he widely-used mix-and-sepa a e pa adigm [5,
10–15, 27] o aining. Speci ically, we andomly sample
M ideos {(V(m), s(m))}M
m=1 om he aining da a and
mix hei audio by smix =PM
m=1 s(m). We deno e i s
spec og am by Smix. We ake he no malized mask o
each ideo mas i s g ound- u h mask:
M(m)=S(m)
Pm′S(m′),(5)
whe e S(m)is he spec og am o s(m)and he di ision is
pe o med elemen -wise.
Since he MUSIC and MUSIC-21 da ase s do no p o-
ide bounding-box anno a ions o objec de ec ion, we
ob ain pseudo-g ound u h boxes in each ame using a
well-es ablished open- ocabula y de ec o , ollowing p io
wo ks [15]. Speci ically, we adop De ic [41], which is ca-
pable o de ec ing a bi a y ca ego ies when p o ided wi h
a ex p omp . To gene a e he aining samples, we pe -
o m in e ence on each ame wi h a ele an ins umen
p omp (e.g., “gui a ”, “ iolin”, “ lu e”, “saxophone”, and
so on) and ake he p edic ed bounding boxes wi h con-
idence sco e ≥0.7. I he e a e mo e han one bound-
ing boxes wi h IoU ≥0.7, we ake he maximal one only.
These box p edic ions hen se e as pseudo-labels o ain
ou isual b anch.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
816
Da a P ep ocessing. We sample F= 3 ames pe ideo
wi h a s ide o 1and a ideo sampling a e o 1 ps, and
esize each ame o a maximum side leng h o 256 pix-
els while p ese ing aspec a io. Fo he audio inpu , we
sample 6seconds o he inpu audio a 11 kHz, cen e ed a
he middle o each sampled ame. We apply Sho -Time
Fou ie T ans o m (STFT) wi h a window size o 1,022
and hop leng h o 256, esul ing in a 512 ×256 complex
spec og am, which is hen esampled on a log- equency
scale o 256 ×256. The phase in o ma ion is p ese ed o
econs uc ion.
Backbone Models. Fo he isual b anch, we use
DETR [33] as a se -based objec de ec o o p edic bound-
ing boxes and class p obabili ies. I has a ResNe back-
bone [31] ha ex ac s ea u e maps, which a e hen la -
ened and passed o a ans o me encode . The hype pa-
ame e Qis usually se as he maximally expec ed numbe
o objec s in a ame, and we use Q= 12 o MUSIC and
Q= 22 o MUSIC-21, ese ing one que y as he back-
g ound class.
Fo he audio b anch, we adop AST [26], a
ans o me -based audio classi ie , gi en a log-mel spec-
og am o size F×T. While AST was o iginally designed
o classi ica ion, we use i o ob ain a sequence o pa ch
embeddings A∈RNA×D, d opping he classi ica ion o-
ken.
Baselines. We compa e ou S2H wi h he s a e-o - he-a
sound sepa a ion models, iQue y [15], Sound-o -Pixels
(SoP) [10], and CoSep [11]. Since he numbe o a ailable
ideos a ies depending on he access ime, we e- ain all
baselines on he same se o cu en ly a ailable ideos o
ensu e a ai compa ison.
E alua ion Me ics. The sound sepa a ion ask is e alu-
a ed using Sou ce- o-Dis o ion Ra io (SDR), Sou ce- o-
In e e ence Ra io (SIR), and Sou ce- o-A i ac s Ra io
(SAR). Fo objec de ec ion, we measu e he mean A -
e age P ecision (mAP) and mean In e sec ion o e Union
(mIoU).
Implemen a ion De ails. We use 6encode and decode
laye s o he isual encode , espec i ely. Fo he audio
encode , we use he i s 6laye s o a p e- ained AST
model. We g id-sea ch λby c oss- alida ion and se i o
0.1. We se he bounding box o e lap h eshold θ= 0.7.
We ain o 100 epochs using he AdamW op imize
wi h a ba ch size o 48, decaying he lea ning a e a epoch
80 by a ac o o 0.1. The lea ning a es a e se o 1×10−5
o he isual backbone, 2×10−5 o he DETR encode
and decode , 5×10−5 o he AST audio encode , and 1×
10−4 o he audio- isual decode and he audio decode .
We mix 2sound acks pe mini-ba ch, leading o up o 4
dis inc ins umen s in he mix u e.
5.2 Compa ison wi h Compe ing Me hods
Tab. 1 shows he esul s on MUSIC (le ) and MUSIC-
21 ( igh ). Ou S2H ou pe o ms exis ing baselines in
sound sepa a ion, achie ing highe SDR, SIR, and SAR.
The imp o emen in SDR is pa icula ly subs an ial (gains
MUSIC MUSIC-21
Me hod SDR SIR SAR SDR SIR SAR
Sound-o -Pixels 5.63 6.85 9.80 5.77 9.95 10.33
CoSep 5.72 8.00 8.13 6.17 8.73 10.18
iQue y 8.04 11.63 11.92 7.51 11.16 11.64
S2H (Ou s) 9.03 12.85 13.99 9.20 12.54 14.79
Table 1:Sound sepa a ion pe o mance o compe -
ing models. On MUSIC [10] and MUSIC-21 [16], S2H
achie es s a e-o - he-a sepa a ion me ics.
o abou +2dB o e iQue y [15] on MUSIC). We a ibu e
his o he explici syne gy be ween objec de ec ion and
sound sepa a ion in ou a chi ec u e.
Fig. 2 u he illus a es his imp o emen wi h quali a-
i e examples. In a due ideo wi h lu e and iolin, S2H
accu a ely localizes bo h ins umen s and econs uc s hei
indi idual spec og ams wi h minimal in e e ence.
5.3 Abla ion S udies
We pe o m abla ions on MUSIC o isola e key design
choices in S2H. Tab. 2 summa izes he esul s.
(1) E ec o Bounding Box Fil e ing. We compa e he
pe o mance o ou ull model o he one wi hou he
dynamic bounding box il e ing desc ibed in Sec. 4.1.
The pe o mance signi ican ly d ops, e.g., om 9.03
o 8.19 o sound sou ce sepa a ion (in SDR) and om
0.67 o 0.54 o objec de ec ion (in mAP). Wi hou
il e ing, many low-con idence o hea ily duplica e
bounding boxes a e passed o he audio- isual decode ,
con using he model du ing he usion s ep, as hey
con ain objec que ies ha do no co espond o any
ue sounding objec . In e es ingly, he mIoU emains
ai ly high (0.85) e en wi hou he bounding box il e -
ing, sugges ing ha some boxes a e spa ially co ec ,
ye he shee numbe o boxes o he same objec de-
g ades bo h de ec ion p ecision and sepa a ion quali y.
(2) E ec o C oss-modal Fusion. Ou p oposed S2H
in oduces c oss-modal usion, aiming o c ea e syn-
e gy be ween he objec de ec ion and sound sepa a ion
asks. In o de o see he e ec o his design, we com-
pa e wi h a simple U-Ne -based decode s uc u e,
which has been widely used in audio sou ce sepa a ion,
ins ead o ou ans o me -based c oss-a en ion. Wi h
his design, he isual b anch simply yields a la en
ea u e (ins ead o bounding-box que ies), hen con-
ca ena ed wi h audio ea u es. We obse e a d as ic
pe o mance d op in sound sou ce sepa a ion me ics
(e.g.,9.03 →2.34 in SDR). This highligh s he bene i
o ou ans o me -based c oss-a en ion, which allows
explici objec -awa e usion be ween he bounding-
box que ies and he spec og am embeddings.
(3) E ec o C oss-a en ion. Ins ead o comple ely e-
placing he audio- isual usion wi h a U-Ne , we u -
he expe imen wi h ou audio- isual usion only wi h
sel -a en ion o see he e ec o c oss-modal a en ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
817
(a) De ec ion Resul s (b) G ound T u h (c) Ou s (d) SoP (e) iQue y
Video #2 Video #1 Video #2 Video #1
Figu e 2:Quali a i e examples o objec de ec ion and sound sou ce sepa a ion. (a) P edic ed bounding boxes in a
sampled ideo ame. (b) G ound- u h spec og am o he sou ce audio. (c)–(e) The sepa a ed spec og am by ou and
baseline me hods.
As each modali y is independen ly p ocessed, he pe -
o mance d ops sha ply (9.03 →5.22 in SDR), ea -
i ming he bene i s o c oss-modal in e ac ion. Simi-
la ly, he de ec ion me ics (mAP and mIoU) also de-
g ade, since he isual b anch no longe ecei es indi-
ec supe ision signals om he sepa a ion loss.
(4) E ec o Join T aining. We emo e he de ec ion
loss by se ing λ= 0 in Eq. (4), e ec i ely disca ding
he de ec ion supe ision. Al hough he isual b anch
s ill encodes he isual ea u es, i has no u he in-
cen i e o localize objec s o e ine bounding-box p e-
dic ions ailo ed o sound sou ce sepa a ion. In his
scena io, he de ec ion pe o mance d ops o 0, show-
ing ha he bounding boxes degene a e immedia ely.
The sound sepa a ion quali y is also educed (SDR
9.03 →7.67), indica ing ha accu a e objec localiza-
ion signi ican ly helps he sound sepa a ion ask. This
unde sco es he syne gy be ween de ec ion and sepa-
a ion.
6. CONCLUSION
We p esen See2Hea (S2H), a no el amewo k ha
join ly lea ns objec de ec ion and sound sou ce sepa a ion.
By le e aging ans o me -based modules bo h o he i-
sual and audio b anches, and by in eg a ing hem h ough
a sha ed audio- isual decode , S2H cap u es iche c oss-
modal dependencies han p e ious disjoin o sequen ial
Con igu a ion Sound Sepa a ion De ec ion
SDR SIR SAR mAP mIoU
Full S2H 9.03 12.85 13.99 0.67 0.80
(1) No b-box il e ing 8.19 11.17 14.80 0.54 0.85
(2) U-Ne -like decode 2.34 8.33 5.94 0.25 0.84
(3) No c oss-a en ion 5.22 8.67 12.54 0.32 0.82
(4) No Lod loss 7.67 11.05 13.49 0.00 0.00
Table 2:Abla ion s udies on MUSIC. Each componen
signi ican ly con ibu es o he pe o mance o S2H, o
bo h sound sou ce sepa a ion and objec de ec ion.
me hods. Ou dynamic il e ing mechanism ensu es ha
only ele an de ec ions guide sepa a ion. Ex ensi e ex-
pe imen s on MUSIC and MUSIC-21 demons a e ha
S2H achie es he s a e-o - he-a sound sepa a ion pe o -
mance, while also main aining compe i i e de ec ion accu-
acy. Ou abla ion s udies con i m ha objec de ec ion and
sound sepa a ion indeed mu ually bene i each o he when
ained end- o-end.
In spi e o he p omising esul s, S2H s ill has some lim-
i a ions. Mainly due o he lack o labeled da a o sound
sou ce sepa a ion, S2H has been e i ied mos ly on single
ins umen scenes. Wi h a la ge scaled da a, i could be
ex ended o mul i-ins umen scenes wi h mo e complex
polyphony. Also, as i s applica ion is no con ined o mu-
sical audio, i could be applied o o he mul imodal asks
(e.g., speech-d i en de ec ion o ac ion ecogni ion).
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
818
Acknowledgmen s. This wo k was suppo ed by Youl-
chon Founda ion, NRF (RS-2021-NR05515, RS-2024-
00336576, RS-2023-0022663) and IITP g an s (RS-
2022-II220264, RS-2024-00353131, RS-2022-00155966)
unded by he go e nmen o Ko ea.
7. REFERENCES
[1] C. Opoku-Baah, A. M. Schoenhau , S. G. Vassall,
D. A. To a , R. Ramachand an, and M. T. Wallace,
“Visual in luences on audi o y beha io al, neu al, and
pe cep ual p ocesses: a e iew,” Jou nal o he Asso-
cia ion o Resea ch in O ola yngology, ol. 22, no. 4,
pp. 365–386, 2021.
[2] A. Tonelli, L. F. Cu u i, and M. Go i, “The in luence o
audi o y in o ma ion on isual size adap a ion,” F on-
ie s in Neu oscience, ol. 11, p. 594, 2017.
[3] Q. Huang, A. Jansen, J. Lee, R. Gan i, J. Y. Li, and
D. P. Ellis, “MuLan: A join embedding o music au-
dio and na u al language,” in P oceedings o he 23 d
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2022.
[4] R. A andjelo i´
c and A. Zisse man, “Look, lis en and
lea n,” in P oceedings o he IEEE In e na ional Con-
e ence on Compu e Vision (ICCV), 2017.
[5] A. Owens and A. A. E os, “Audio- isual scene analy-
sis wi h sel -supe ised mul isenso y ea u es,” in P o-
ceedings o he Eu opean Con e ence on Compu e Vi-
sion (ECCV), 2018.
[6] Y. Ay a , C. Vond ick, and A. To alba, “SoundNe :
Lea ning sound ep esen a ions om unlabeled ideo,”
in Ad ances in Neu al In o ma ion P ocessing Sys ems
(Neu IPS), 2016.
[7] T. A ou as, A. Owens, J. S. Chung, and A. Zisse man,
“Sel -supe ised lea ning o audio- isual objec s om
ideo,” in P oceedings o he Eu opean Con e ence on
Compu e Vision (ECCV), 2020.
[8] A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S.
Kweon, “Lea ning o localize sound sou ce in i-
sual scenes,” in P oceedings o he IEEE Con e ence
on Compu e Vision and Pa e n Recogni ion (CVPR),
2018.
[9] K. Qian, Y. Zhang, S. Chang, D. Cox, and
M. Hasegawa-Johnson, “Unsupe ised speech decom-
posi ion ia iple in o ma ion bo leneck,” in P oceed-
ings o he 37 h In e na ional Con e ence on Machine
Lea ning (ICML), 2020.
[10] H. Zhao, C. Gan, A. Roudi chenko, C. Vond ick, J. Mc-
De mo , and A. To alba, “The sound o pixels,” in
P oceedings o he Eu opean Con e ence on Compu e
Vision (ECCV), 2018.
[11] R. Gao and K. G auman, “Co-sepa a ing sounds o i-
sual objec s,” in P oceedings o he IEEE In e na ional
Con e ence on Compu e Vision (ICCV), 2019.
[12] X. Xu, B. Dai, and D. Lin, “Recu si e isual sound
sepa a ion using minus-plus ne ,” in P oceedings o
he IEEE In e na ional Con e ence on Compu e Vi-
sion (ICCV), 2019.
[13] Y. Tian, D. Hu, and C. Xu, “Cyclic co-lea ning o
sounding objec isual g ounding and sound sepa a-
ion,” in P oceedings o he IEEE Con e ence on Com-
pu e Vision and Pa e n Recogni ion (CVPR), 2021.
[14] L. Zhu and E. Rah u, “Visually guided sound sou ce
sepa a ion and localiza ion using sel -supe ised mo-
ion ep esen a ions,” in P oceedings o he IEEE Win-
e Con e ence on Applica ions o Compu e Vision
(WACV), 2022.
[15] J. Chen, R. Li, Z. Hou, G. Zhang, L. Peng, and C.-W.
Ngo, “iQue y: Ins umen s as que ies o audio- isual
sound sepa a ion,” in P oceedings o he IEEE Con-
e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), 2023.
[16] H. Zhao, C. Gan, W.-C. Ma, and A. To alba, “The
sound o mo ions,” in P oceedings o he IEEE In e na-
ional Con e ence on Compu e Vision (ICCV), 2019.
[17] J.-F. Ca doso, “Blind signal sepa a ion: S a is ical
p inciples,” P oceedings o he IEEE, ol. 86, no. 10,
pp. 2009–2025, 1998.
[18] A. J. Bell and T. J. Sejnowski, “An in o ma ion-
maximiza ion app oach o blind sepa a ion and blind
decon olu ion,” Neu al Compu a ion, ol. 7, no. 6, pp.
1129–1159, 1995.
[19] A. Hy ä inen and E. Oja, “Independen componen
analysis: Algo i hms and applica ions,” Neu al Ne -
wo ks, ol. 13, no. 4-5, pp. 411–430, 2000.
[20] D. D. Lee and H. S. Seung, “Lea ning he pa s o
objec s by non-nega i e ma ix ac o iza ion,” Na u e,
ol. 401, no. 6755, pp. 788–791, 1999.
[21] P. Sma agdis and J. C. B own, “Non-nega i e ma ix
ac o iza ion o polyphonic music ansc ip ion,” in
IEEE Wo kshop on Applica ions o Signal P ocessing
o Audio and Acous ics, 2003.
[22] J. R. He shey, Z. Chen, J. Le Roux, and S. Wa an-
abe, “Deep clus e ing: Disc imina i e embeddings o
segmen a ion and sepa a ion,” in P oceedings o he
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2016.
[23] Z. Chen, Y. Luo, and N. Mesga ani, “Deep a ac o
ne wo k o single-mic ophone speake sepa a ion,” in
P oceedings o he IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP),
2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
819
[24] A. Jansson, E. J. Humph ey, N. Mon ecchio, R. Bi -
ne , A. Kuma , and T. Weyde, “Singing oice sepa a-
ion wi h deep u-ne con olu ional ne wo ks,” in P o-
ceedings o he 18 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), 2017.
[25] D. S olle , S. Ewe , and S. Dixon, “Wa e-u-ne : A
mul i-scale neu al ne wo k o end- o-end audio sou ce
sepa a ion,” in P oceedings o he 19 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), 2018.
[26] Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio
spec og am ans o me ,” in P oceedings o he In e -
speech Con e ence, 2021.
[27] A. Eph a , I. Mosse i, O. Lang, T. Dekel, K. Wil-
son, A. Hassidim, W. T. F eeman, and M. Rubins ein,
“Looking o lis en a he cock ail pa y: A speake -
independen audio- isual model o speech sepa a-
ion,” ACM T ansac ions on G aphics (TOG), ol. 37,
no. 4, pp. 112:1–112:11, 2018.
[28] T. Rahman, M. Yang, and L. Sigal, “T iBERT: Human-
cen ic audio- isual ep esen a ion lea ning,” in Ad-
ances in Neu al In o ma ion P ocessing Sys ems
(Neu IPS), 2021.
[29] T. Rahman and L. Sigal, “Weakly-supe ised audio-
isual sound sou ce de ec ion and sepa a ion,” in IEEE
In e na ional Con e ence on Mul imedia and Expo
(ICME), 2021.
[30] C. Huang, S. Liang, Y. Tian, A. Kuma , and
C. Xu, “DAVIS: High-quali y audio- isual sepa a-
ion wi h gene a i e di usion models,” a Xi p ep in
a Xi :2308.00122, 2023.
[31] R. Gi shick, J. Donahue, T. Da ell, and J. Malik, “Rich
ea u e hie a chies o accu a e objec de ec ion and
seman ic segmen a ion,” in P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni-
ion (CVPR), 2014.
[32] S. Ren, K. He, R. Gi shick, and J. Sun, “Fas e R-
CNN: Towa ds eal- ime objec de ec ion wi h egion
p oposal ne wo ks,” in Ad ances in Neu al In o ma-
ion P ocessing Sys ems (Neu IPS), 2015.
[33] N. Ca ion, F. Massa, G. Synnae e, N. Usunie , A. Ki -
illo , and S. Zago uyko, “End- o-end objec de ec ion
wi h ans o me s,” in P oceedings o he Eu opean
Con e ence on Compu e Vision (ECCV), 2020.
[34] F. R. Val e de, J. V. Hu ado, and A. Valada, “The e is
mo e han mee s he eye: Sel -supe ised mul i-objec
de ec ion and acking wi h sound by dis illing mul i-
modal knowledge,” in P oceedings o he IEEE Con-
e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), 2021.
[35] S. Mo and Y. Tian, “Audio- isual g ouping ne wo k o
sound localiza ion om mix u es,” in P oceedings o
he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion (CVPR), 2023.
[36] H. Chen, W. Xie, A. Vedaldi, and A. Zisse man, “Lo-
calizing isual sounds he ha d way,” in P oceedings o
he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion (CVPR), 2021.
[37] S. Pegg, K. Li, and X. Hu, “TDFne : An e icien
audio- isual speech sepa a ion model wi h op-down
usion,” in In e na ional Con e ence on In o ma ion
Science and Technology, 2023.
[38] K. Li, X. Hu, S. Pegg, R. Zhang, F. Zhou, X. Wu, and
X. Liu, “IIAne : An in a- and in e -modali y a en ion
ne wo k o audio- isual speech sepa a ion,” in P o-
ceedings o he 41s In e na ional Con e ence on Ma-
chine Lea ning (ICML), 2024.
[39] Y. Yu and S. Sun, “DGFne : End- o-end audio- isual
sou ce sepa a ion based on dynamic ga ing usion,” in
P oceedings o he In e na ional Con e ence on Mul i-
media Re ie al (ICMR), 2025.
[40] C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and
A. To alba, “Music ges u e o isual sound sepa a-
ion,” in P oceedings o he IEEE Con e ence on Com-
pu e Vision and Pa e n Recogni ion (CVPR), 2020.
[41] X. Zhou, R. Gi dha , A. Joulin, P. K ähenbühl, and
I. Mis a, “De ec ing wen y- housand classes using
image-le el supe ision,” in P oceedings o he Eu o-
pean Con e ence on Compu e Vision (ECCV), 2022.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
820