scieee Science in your language
[en] (orig)

AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding

Author: Wang, Yidan; Zhuang, Chenyi; Liu, Wutao; Gao, Pan; Sebe, Niculae
Publisher: Zenodo
DOI: 10.1145/3746027.3755751
Source: https://zenodo.org/records/17689270/files/3746027.3755751.pdf
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e
o Weakly Supe ised Visual G ounding
Yidan Wang∗
Nanjing Uni e si y o Ae onau ics
and As onau ics
Nanjing, China
[email p o ec ed]
Chenyi Zhuang∗
Uni e si y o T en o
T en o, I aly
[email p o ec ed]
Wu ao Liu
Nanjing Uni e si y o Ae onau ics
and As onau ics
Nanjing, China
[email p o ec ed]
Pan Gao†
Nanjing Uni e si y o Ae onau ics
and As onau ics
Nanjing, China
[email p o ec ed]
Nicu Sebe
Uni e si y o T en o
T en o, I aly
[email p o ec ed]
Abs ac
Weakly supe ised isual g ounding (VG) aims o loca e objec s
in images based on ex desc ip ions. Despi e signi ican p og ess,
exis ing me hods lack s ong c oss-modal easoning o dis inguish
sub le seman ic di e ences in ex exp essions due o ca ego y-
based and a ibu e-based ambigui y. To add ess hese challenges,
we in oduce AlignCAT, a no el que y-based seman ic ma ching
amewo k o weakly supe ised VG. To enhance isual-linguis ic
alignmen , we p opose a coa se-g ained alignmen module ha u i-
lizes ca ego y in o ma ion and global con ex , e ec i ely mi iga ing
in e e ence om ca ego y-inconsis en objec s. Subsequen ly, a
ine-g ained alignmen module le e ages desc ip i e in o ma ion
and cap u es wo d-le el ex ea u es o achie e a ibu e consis-
ency. By exploi ing linguis ic cues o hei ulles ex en , ou p o-
posed AlignCAT p og essi ely il e s ou misaligned isual que ies
and enhances con as i e lea ning e iciency. Ex ensi e expe imen s
on h ee VG benchma ks, namely Re COCO, Re COCO+, and Re -
COCOg, e i y he supe io i y o AlignCAT agains exis ing weakly
supe ised me hods on wo VG asks. Ou code is a ailable a :
h ps://gi hub.com/I2-Mul imedia-Lab/AlignCAT.
CCS Concep s
•Compu ing me hodologies
→
Image segmen a ion; Scene
unde s anding.
Keywo ds
Weakly Supe ised Visual G ounding, Mul imodali y
∗Equal con ibu ion.
†Co esponding au ho
Pe mission o make digi al o ha d copies o all o pa o his wo k o pe sonal o
class oom use is g an ed wi hou ee p o ided ha copies a e no made o dis ibu ed
o p o i o comme cial ad an age and ha copies bea his no ice and he ull ci a ion
on he i s page. Copy igh s o componen s o his wo k owned by o he s han he
au ho (s) mus be hono ed. Abs ac ing wi h c edi is pe mi ed. To copy o he wise, o
epublish, o pos on se e s o o edis ibu e o lis s, equi es p io speci ic pe mission
and/o a ee. Reques pe missions om [email p o ec ed].
MM ’25, Dublin, I eland., Oc obe 27–31, 2025
©2025 Copy igh held by he owne /au ho (s). Publica ion igh s licensed o ACM.
ACM ISBN 979-8-4007-2035-2/2025/10
h ps://doi.o g/10.1145/3746027.3755751
ACM Re e ence Fo ma :
Yidan Wang, Chenyi Zhuang, Wu ao Liu, Pan Gao, and Nicu Sebe. 2025.
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e o Weakly
Supe ised Visual G ounding. In P oceedings o he 33 d ACM In e na ional
Con e ence on Mul imedia (MM ’25), Oc obe 27–31, 2025, Dublin, I eland.
ACM, New Yo k, NY, USA, 10 pages. h ps://doi.o g/10.1145/3746027.3755751
1 In oduc ion
Visual G ounding (VG) aims o iden i y objec s in an image co e-
sponding o a gi en ex desc ip ion and has gained a en ion o i s
po en ial in open-ended de ec ion o a ious compu e ision appli-
ca ions [
25
,
31
,
39
]. While ully supe ised me hods [
5
,
22
,
38
,
42
,
43
]
achie e high accu acy, hey ely on ins ance-le el anno a ions,
which a e bo h labo -in ensi e and ime-consuming o ob ain. To al-
le ia e his bu den, ecen s udies ha e explo ed weakly supe ised
lea ning in wo g ounding asks, namely Re e ing Exp ession Com-
p ehension (REC) and Re e ing Exp ession Segmen a ion (RES).
These wo ks ha e e amed weakly supe ised VG as a egion-
based [
7
,
33
,
41
], ancho -based [
9
,
23
], o que y-based [
3
] ma ching
p oblem. Howe e , unde s anding complex ex ual desc ip ions and
associa ing e e en s in mul i-objec images emains challenging.
In Figu e 1, we iden i y wo ypes o ex anno a ion in exis ing
VG benchma ks: (1) ca ego y-based anno a ions ha dis inguish
he e e ed objec in i s undamen al class om objec s in o he ca -
ego ies. Fo example, in he sen ence “gi l wi h spoon”, wo objec s
“gi l” and “spoon” belong o di e en classes. The model should
iden i y he logical objec “gi l” a he han he con ex ual objec
“spoon”, and align his linguis ic ca ego y in o ma ion wi h isual
ea u es. (2) a ibu e-based anno a ions ha desc ibe speci ic
cha ac e is ics o he e e ed objec , such as colo s and spa ial
ela ions. Fo example, he sen ence “guy on knees” has no con lic
in he pe son ca ego y, bu i s desc ip i e in o ma ion “on knees”
poses a challenge o he model in iden i ying he e e ed objec
as he image con ains mul iple pe sons. This equi es an unde -
s anding o nuanced ex ual and isual seman ics. Howe e , he
s a e-o - he-a que y-based me hod [
3
] ails o p oduce eliable
g ounding esul s on bo h anno a ion ypes. While his me hod
u ilizes con as i e lea ning o ampli y he alignmen o a ge ex s
and posi i e que ies, i is no conduci e o disc imina ing nuanced
5100
MM ’25, Oc obe 27–31, 2025, Dublin, I eland. Yidan Wang, Chenyi Zhuang, Wu ao Liu, Pan Gao, and Nicu Sebe
Ca ego y
Consis ency
Image
Encode Tex
Encode
T ans o me
Decode
De ec ion Head
Que y-Tex
Con as i e Loss
(a) F amewo k and isualiza ion o Que yMa ch
Que y-Tex Ma ching
"gi l wi h spoon"
Ca ego y-based
Anno a ion
Que y-Tex Ma ching
"guy on knees"
A ibu e-based
Anno a ion
Ca ego y
Inconsis ency
A ibu e
Inconsis ency
Image
Encode
T ans o me
Decode
G ounding Head
Con as i e Loss
Que ies
Que ies
Fine-G ained
Alignmen
Que y-Tex
Ma ching
Tex
Encode
class label
Classi ica ion Loss
Coa se-g ained
Alignmen
Tex
Encode
T ans o me
Decode
Que y-Tex
Ma ching
G ounding Head
Que y-Tex
Con as i e Loss
(b) F amewo k and isualiza ion o ou AlignCAT
Ca ego y-based
Anno a ion
"guy on knees"
A ibu e-based
Anno a ion
A ibu e
Consis ency
pe son
Fine-G ained Alignmen
Coa se-g ained Alignmen
guy on knees
"gi l wi h spoon"
Figu e 1: Compa ison o Que yMa ch and he p oposed AlignCAT. (a) Que yMa ch ails o deal wi h ca ego y-based and
a ibu e-based ambigui y in anno a ions. (b) AlignCAT p og essi ely le e ages linguis ic cues om coa se ( igh - op) o ine
( igh -bo om) o il e isual que ies, achie ing ca ego y and a ibu e consis ency.
seman ics in objec ca ego ies and a ibu es. Fo he ca ego y-
based anno a ion, he con ex ual objec “spoon” is used o en ich
he in o ma ion o he a ge objec “gi l”, ye i has c ea ed ac i a-
ion noise and hinde ed he accu a e isual-linguis ic alignmen ,
showing ca ego y inconsis ency. Fo he a ibu e-based anno a ion,
i misma ches he isual ea u es o he inco ec guy o he a ge
ac ion “on knee”, showing a ibu e inconsis ency.
To add ess he a o emen ioned isual-linguis ic inconsis encies,
we p opose AlignCAT (Align Ca ego y hen AT ibu e), a no el
que y-based VG amewo k. To ensu e ca ego y consis ency, we
i s design a coa se-g ained alignmen module ha le e ages
ca ego y in o ma ion and he global con ex om he inpu ex ual
exp ession. This coa se alignmen mi iga es in e e ence om i el-
e an objec s, e ec i ely na owing down he sea ch space o ideal
isual que ies. To achie e a ibu e consis ency, we u he p opose
a ine-g ained alignmen module ha employs adap i e ph ase
a en ion o cap u e wo d-le el desc ip i e linguis ic ea u es. Such
a ine alignmen enhances c oss-modal co espondences and e-
sol es in a-class ambigui ies when mul iple isual objec s belong
o he same ca ego y. By aligning isual que ies i s a a coa se
le el and hen a a ine le el, AlignCAT highligh s he key ole o
linguis ic cues in unde s anding c oss-modal ep esen a ions. This
ca ego y- hen-a ibu e p og essi e alignmen wi hin a con as i e
lea ning amewo k signi ican ly enhances VG pe o mance. In
summa y, he main con ibu ions o his wo k a e h ee- old:
•
We iden i y ca ego y inconsis ency and a ibu e inconsis-
ency in exis ing weakly supe ised VG me hods. To add ess
hese challenges, we p opose a no el que y-based ca ego y-
hen-a ibu e ma ching amewo k, modeling linguis ic ep-
esen a ions om gene al o de ailed le els.
•
To achie e isual-linguis ic alignmen , we design a coa se-
g ained module ha le e ages ca ego y in o ma ion and
global con ex o il e ou ca ego y-inconsis en isual que ies,
and a ine-g ained module ha employs adap i e ph ase a -
en ion o ensu e a ibu e consis ency.
•
E alua ed on h ee benchma ks o REC and RES, ou p o-
posed me hod achie es s a e-o - he-a pe o mance, demon-
s a ing he po en ial o linguis ic cues and he e icacy o
he ca ego y- hen-a ibu e ma ching s a egy in enhancing
isual-linguis ic alignmen .
2 Rela ed Wo ks
Re e ing Exp ession Segmen a ion (RES). This ask aims o
o e come he e iciency limi a ion o ully supe ised lea ning
schemes. Weakly supe ised RES does no equi e in ensi e pixel-
le el anno a ion, which is less expensi e and mo e e icien o ain-
ing. Se e al wo ks [
10
,
32
] achie e egion- ex ma ching h ough
mul i-ins ance lea ning, bu a e a in e io o ully supe ised
me hods. Ins ead o agg ega ing isual en i ies, TRIS [16] ex ac s
ough objec loca ions as pseudo-labels based on he inpu ex o
pe o m objec localiza ion. Lee e al. [
11
] elies on he linguis ic e-
la ionship, which p edic s signi ican maps o each wo d. Howe e ,
he masks gene a ed by hese me hods a e highly noisy, esul ing
in less accu a e segmen a ion.
Re e ing Exp ession Comp ehension (REC). Compa ed o
ully supe ised REC, weakly supe ised REC is mo e challeng-
ing due o he lack o bounding box anno a ions. To ob ain addi-
ional supe ision signals, exis ing REC me hods [
17
,
36
] inco po-
a e ex e nal knowledge and align he egion-based in o ma ion
wi h he co esponding ph ases. Some wo ks [
2
,
19
] u he u i-
lize p io knowledge o il e ou i ele an egion p oposals. Re-
cen ad ances also include le e aging language models o build
nega i e samples [
40
], o p e- ained models o gene a e pseudo-
labels [
8
,
21
]. Ye , hese wo-s age me hods lack gene aliza ion o
eal-wo ld scena ios and la ge-scale asks. To imp o e e iciency,
ancho -based me hods [
9
,
23
] emo e he egion p oposal s age
owa ds a one-s age p ocess. Que yMa ch [
3
] u he in oduces
a que y- ex ma ching scheme o imp o e he lea ning o objec
ep esen a ions. Despi e hei ad ances in e iciency, we iden i y
he challenges o ca ego y and a ibu e inconsis ency in exis ing
5101
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e o Weakly Supe ised Visual G ounding MM ’25, Oc obe 27–31, 2025, Dublin, I eland.
Vision
B anch
Image
Encode
Lea nable Que ies
…
Image Inpu
…
T ans o me
Decode
Con idence-based Selec ion
Candida e Se
Tex Inpu :"o ange abby ca
s anding in a sink"
…
Coa se-g ainedAlignmen 
Ca ago y
Ma ching
Global
Similia i y
Re ined Se
…
Tex
B anch
Tex
Classi ie
P edic ed
Fine-g ainedAlignmen
Ph ase A en ion
Wo d-le el
Similia i y
REC Resul
RES Resul
Selec ed
Que y
P ojec ion P ojec ion
G ounding
Head
Tex Encode
Con as i e Lea ning
posi i enega i e
Tex ual
Classi ica ion
P ojec ion
In e ence
T aining
Loss
Disca d
Vision Token
Tex Token
"ca "
P ojec ion
Figu e 2: AlignCAT amewo k o e iew. AlignCAT il e s isual que ies by hie a chically le e aging linguis ic cues. The
coa se-g ained alignmen module u ilizes ca ego y and global in o ma ion o disca d ca ego y-inconsis en candida es. The
ine-g ained alignmen module employs adap i e ph ase a en ion o selec he a ibu e-consis en isual que y.
one-s age me hods. To add ess his p oblem, ou me hod le e ages
linguis ic cues om gene al o speci ic, in eg a ing ca ego y in-
o ma ion and global con ex o coa se-g ained alignmen , and
hen exploi ing wo d-le el desc ip ions o ine-g ained alignmen .
The ca ego y- hen-a ibu e ma ching amewo k signi ican ly im-
p o es VG esul s o bo h RES and REC asks, especially in mul i-
objec scena ios wi h complex ex exp essions.
3 P elimina y
Following [
3
], we e o mula e VG as a que y- ex ma ching p oblem
by adop ing a que y-based de ec o Mask2Fo me [
4
]. I es ablishes
one- o-one associa ions wi h objec s in he image by
𝑁
lea nable
ec o s, namely que ies, deno ed as
Q={𝑞1, ...,𝑞𝑁}
. Que yMa ch
il e s ou noisy and low-quali y que y ea u es based on hei
con idence sco es, esul ing in a candida e se
Q𝑂
, whe e
𝑂
is a
p e-de ined hype pa ame e . This me hod de ines wo me ics as
di icul y and uniqueness o quan i a i ely es ima e he quali y o
nega i e samples. Speci ically, di icul y measu es ision-language
alignmen , while uniqueness equi es ha high-quali y nega i e
que ies signi ican ly di e om o he que ies in he embedding
space. Gi en a se o candida e que ies, Que yMa ch i e a i ely
es ima es he quali y o 𝑖- h que y as:
𝑆𝑑𝑖
=𝑁𝑜𝑟𝑚(sim(𝑓𝑞𝑖, 𝑓𝑡)),(1)
𝑆𝑢𝑖=𝑁𝑜𝑟𝑚(− 𝑀
max
𝑗=1cos(𝑓𝑞𝑖, 𝑓𝑞𝑗)),(2)
whe e
𝑓𝑞𝑖
is he ea u e o he cu en nega i e que y,
𝑓𝑞𝑗
is he
ea u e o a p e iously selec ed nega i e que y,
𝑓𝑡
is he ex ea-
u e.
𝑆𝑑𝑖
is he di icul y o he
𝑖
- h que y, measu ed by he do
p oduc simila i y be ween isual and linguis ic ea u es, deno ed
sim(𝑓𝑞𝑖, 𝑓𝑡)
.
𝑆𝑢𝑖
is he uniqueness o he
𝑖
- h que y, measu ed by he
cosine simila i y be ween wo isual que ies, deno ed
cos(𝑓𝑞𝑖, 𝑓𝑞𝑗)
.
𝑁𝑜𝑟𝑚(·)
is he min-max no maliza ion. The o e all quali y sco e
o he nega i e que y is de ined as:
𝑆𝑞𝑖=𝑆𝑑𝑖·𝑆𝑢𝑖.(3)
Ranked in descending o de , an app op ia e numbe o nega i e
samples is selec ed o pe o m con as i e lea ning.
The e ec i eness o Que yMa ch elies on p ecise que y- ex
ma ching. This me hod in oduces an e ec i e nega i e que y selec-
ion scheme, while simply selec he posi i e que y by compu ing
he simila i y o he global ex ea u e
𝑓𝑡
. Howe e , ex ual de-
sc ip ions a e exp essi e and equi e s ong easoning abili ies o
unde s and hem. As p esen ed in Figu e 1, Que yMa ch ails a
iden i ying isual que ies om he candida e se
Q𝑂
o achie e
isual-linguis ic consis ency a he ca ego y and a ibu e le els. In
his s udy, we le e age linguis ic cues o hei ulles ex en , em-
phasizing bo h coa se and ine-g ained in o ma ion. Based on he
ca ego y- hen-a ibu e alignmen s a egy, ou p oposed Align-
CAT can selec high-quali y posi i e isual que ies and enhance
que y- ex ma ching accu acy.
4 Me hodology
4.1 O e iew o AlignCAT
Gi en he inpu image
𝐼
and he inpu exp ession
𝑇
, we aim o lo-
ca e he e e ed objec h ough a bounding box ( o REC) o a mask
( o RES). To add ess he challenges o ca ego y and a ibu e in-
consis encies, we in oduce AlignCAT, a no el que y-based weakly
supe ised VG amewo k. As illus a ed in Figu e 2, he main
goal o AlignCAT is o selec high-quali y posi i e que ies h ough
a ca ego y- hen-a ibu e ma ching mechanism o e icien con-
as i e lea ning. We ollow [
3
] and adop a que y-based de ec o
[
4
] o p ocess he inpu image
𝐼
. The encoded image ea u es a e
hen ed in o he T ans o me decode o in e ac wi h
𝑁
lea nable
que ies, ou pu ing he que y ea u es
{𝑓𝑞1, ..., 𝑓𝑞𝑁} ∈ R𝑑𝑣
o i-
sual que ies in
Q
, whe e
𝑑𝑣
is he isual dimension, and one- o-one
classi ica ions {𝑐𝑉
1, ...,𝑐𝑉
𝑁}p edic ed by he isual classi ie .
Unlike exis ing que y-based VG amewo ks [
3
], we le e age
linguis ic cues in he inpu exp ession
𝑇
o achie e isual-linguis ic
alignmen . The ex encode ans o ms
𝑇
in o he global ea u e
𝑓𝑡∈R𝑑𝑡
and he wo d-le el ea u es
F𝑤∈R𝑙×𝑑𝑡
, whe e
𝑙
is he
leng h o he inpu ex and
𝑑𝑡
is he dimension o he linguis ic
ea u e. Ou me hod p og essi ely il e s ou isual que ies h ough
h ee sequen ial selec ion modules: (1) a con idence-based il e ing
s age educes he numbe o isual que ies om
𝑁
o
𝑂
, o ming a
5102
MM ’25, Oc obe 27–31, 2025, Dublin, I eland. Yidan Wang, Chenyi Zhuang, Wu ao Liu, Pan Gao, and Nicu Sebe
candida e subse
Q𝑂
; (2) a coa se-g ained alignmen module e alu-
a es ca ego y consis ency and global que y- ex simila i y o u he
e ine he candida es in o a e ined que y se
e
𝑄
; (3) a ine-g ained
alignmen cap u es a ibu e de ails by ecalib a ing he wo d-le el
ea u es
F𝑤
, ul ima ely selec ing he mos ele an que y
𝑞∗
. Finally,
we can decode he selec ed isual que y o ob ain he bounding box
o he mask o he e e ed objec h ough a g ounding head:
𝑟∗=Head(𝑞∗).(4)
4.2 Coa se-g ained Alignmen
To add ess ca ego y inconsis ency, we design a coa se-g ained align-
men module o i s il e ou i ele an isual que ies. We disce n
ha he ca ego y in o ma ion is eadily a ailable in he inpu ex .
Fo example, i is appa en om he inpu ex “o ange abby ca
s anding in a sink” ha he ca ego y o he e e ed objec is “ca ”.
We a e mo i a ed o p edic he speci ic ca ego y and injec his
in o ma ion o ensu e ha ou selec ed que ies belong o he a ge
ca ego y. This ca ego y-based que y- ex ma ching, along wi h a
global que y- ex ma ching, e ec i ely mi iga es in e e ence om
i ele an objec s in he candida e se
Q𝑂
. Mo e speci ically, a he
ca ego y ma ching s age, we injec a G ound T u h (GT) ca ego y
𝑐∗
, which is he class label anno a ion ob ained om he da ase .
Fo each que y
𝑞𝑖∈ Q𝑂
, he T ans o me decode p edic s i s co e-
sponding ca ego y
𝑐𝑉
𝑖∈ {
1
,
2
, . . . ,𝐶}
h ough a classi ica ion head,
whe e
𝐶
is a p e-de ined numbe o o al ca ego ies (e.g.,
𝐶=
80
o MSCOCO [
14
]). The ca ego y sco e measu es whe he he p e-
dic ed que y ca ego y
𝑐𝑉
𝑖
is consis en wi h he GT ca ego y
𝑐∗
. I
hey a e he same, he ca ego y sco e
𝑆class,𝑖
is se o 1; o he wise,
i is se o 0. The abo e p ocess can be o mula ed as:
𝑆class,𝑖 =(1,i 𝑐𝑉
𝑖=𝑐∗,
0,o he wise.(5)
The ca ego y-based que y- ex ma ching e ec i ely il e s ou
isual que ies ha belong o i ele an ca ego ies. To ully exploi
con ex in ex ual ep esen a ion, we p ojec he que y ea u e 𝑓𝑞𝑖
and global ex ea u e
𝑓𝑡
in o a coa se-g ained sha ed seman ic
space o lea n global isual-linguis ic alignmen :
¯
𝑓𝑞𝑖=𝑓𝑞𝑖·𝑊𝑞+𝑏𝑞,(6)
¯
𝑓𝑡=𝑓𝑡·𝑊𝑡+𝑏𝑡,(7)
whe e
𝑊𝑞,𝑊𝑡
a e p ojec ion ma ices,
𝑏𝑞,𝑏𝑡
a e biases o ans o m
image and ex ea u es, espec i ely. A e p ojec ion, we calcula e
he global que y- ex ma ching sco e:
𝑆global,𝑖 =sim(¯
𝑓𝑞𝑖,¯
𝑓𝑡),(8)
whe e
sim(·)
is he do p oduc simila i y o measu e he alignmen
be ween each isual que y and he global linguis ic ea u e.
O e all, we de ine he coa se-g ained alignmen sco e as he
weigh ed sum o he ca ego y sco e and he global sco e:
𝑆coa se,𝑖 =𝛼𝑆class,𝑖 +𝑆global,𝑖,(9)
whe e 𝛼is a hype pa ame e o balance he alue.
Designed o ensu e ca ego y consis ency, his coa se-g ained
alignmen module il e s ou ca ego y-inconsis en isual que ies.
In o he wo ds, only he que ies wi h
𝑆𝑐𝑙𝑎𝑠𝑠 =
1a e selec ed o
cons uc he e ined se
e
𝑄
. We also de ine a h eshold
𝐾
o cu ail
he que y numbe based on he coa se-g ained sco e
𝑆𝑐𝑜𝑎𝑟𝑠𝑒
. Mo e
de ails a e in he supplemen al ma e ial.
4.3 Fine-g ained Alignmen
The abo e coa se-g ained alignmen module u ilizes gene al lin-
guis ic cues o ensu e ca ego y consis ency and il e ou ca ego y-
inconsis en candida es. Howe e , i is insu icien o disc imina e
nuanced seman ics and achie e a ibu e consis ency. We u he in-
oduce a ine-g ained alignmen ha emphasizes desc ip i e in o -
ma ion in wo d-le el ex ual ea u es, he eby cap u ing a ibu e-
awa e c oss-modal co espondences.
Speci ically, we adop an adap i e ph ase a en ion mechanism
[
35
] o emphasize linguis ic seman ics wi hin he wo d-le el ea-
u es
F𝑤
. Ins ead o ocusing on global con ex o ca ego y-le el
in o ma ion, his module highligh s ine-g ained desc ip i e de-
ails by assigning highe a en ion weigh s o a ibu e wo ds and
lowe weigh s o ca ego y wo ds. Fo ins ance, he ph ase “s anding
in a sink” p o ides mo e disc imina i e linguis ic cues han o he
wo ds when dis inguishing be ween wo ca s, and is he e o e gi en
g ea e a en ion. Mo e p ecisely, he wo d-le el ea u es
F𝑤
a e
i s p ocessed by a Bidi ec ional GRU (Bi-GRU) o ecalib a e he
impo ance o each wo d, which can be o mula ed as:
e
F𝑤=[−→
F𝑤,←−
F𝑤]=𝐸(F𝑤,𝜃),(10)
whe e
𝐸
and
𝜃
ep esen he Bi-GRU module and i s pa ame e s,
espec i ely.
e
F𝑤
deno es he modula ed wo d-le el ea u es ha
conca ena e bidi ec ional ou pu s om he Bi-GRU ne wo k. To
achie e a mo e adap i e agg ega ion, we dynamically balance he
weigh s o he p edic ed ea u es as:
e
F𝑤:=e
F𝑤·so max(FC(e
F𝑤)),(11)
whe e
FC(·)
is a ully connec ed laye o p edic he weigh assigned
o each wo d.
To lea n local isual-linguis ic alignmen , we agg ega e hese
wo d-le el ea u es and p ojec hem in o a ine-g ained seman i-
cally sha ed space, whe e he que y ea u es a e also p ojec ed. We
o mula e he abo e p ocess as ollows:
e
𝑓𝑤=𝑓𝑤·𝑊′
𝑡+𝑏′
𝑡,whe e 𝑓𝑤=∑︁
𝑙e
F𝑤(12)
e
𝑓𝑞𝑖=𝑓𝑞𝑖·𝑊′
𝑞+𝑏′
𝑞,(13)
whe e
𝑓𝑞𝑖
is he
𝑖
- h que y in he e ined que y se
e
𝑄
. The p o-
jec ion ma ices a e
𝑊′
𝑡
and
𝑊′
𝑞
, and he bias e ms a e
𝑏′
𝑡
and
𝑏′
𝑞
.
These pa ame e s a e used o ans o m he image and ex ea u es,
espec i ely.
To selec he isual que y ha bes ma ches he ex exp ession
a he a ibu e le el, we de ine he ine-g ained alignmen sco e
as he do p oduc simila i y be ween each isual que y and he
ine-g ained adap ed ex ea u e, which can be exp essed as:
𝑆 ine,𝑖 =sim(e
𝑓𝑞𝑖,e
𝑓𝑤),(14)
Since he adap ed wo d ea u e
e
𝑓𝑤
encodes disc imina i e lin-
guis ic seman ics, his ine-g ained alignmen module enables he
model o di e en ia e candida e isual que ies based on hei local
5103
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e o Weakly Supe ised Visual G ounding MM ’25, Oc obe 27–31, 2025, Dublin, I eland.
Table 1: Compa isons wi h s a e-o - he-a me hods on h ee RES benchma k da ase s. Bes in ed and second in blue.
Me hod Venue Re COCO Re COCO+ Re COCOg
al es A es B al es A es B al-g
AMR [29] AAAI’22 14.12 11.69 17.47 14.13 11.47 18.13 15.83
G oupViT [37] CVPR’22 18.03 18.13 19.33 18.15 17.65 19.53 19.97
CLIP-ES [15] CVPR’23 13.79 15.23 12.87 14.57 16.01 13.53 14.16
GbS [1] ICCV’21 14.59 14.60 14.97 14.49 14.49 15.77 14.21
WWbL [30] Neu IPS’22 18.26 17.37 19.90 19.85 18.70 21.64 21.84
TSEG [32] a Xi ’20 30.12 - - 25.95 - - 22.62
ALBEF [13] Neu IPS’21 23.11 22.79 23.42 22.44 22.07 22.51 24.18
I-Chunk [12] ICCV’23 31.06 32.30 30.11 31.28 32.11 30.13 32.88
TRIS [16] ICCV’23 31.17 32.43 29.56 30.90 30.42 30.80 36.00
APL [23] ECCV’24 55.92 54.84 55.64 34.92 34.87 35.61 40.13
Que yMa ch [3] MM’24 59.10 59.08 58.82 39.87 41.44 37.22 43.06
Ou s - 61.83 62.75 60.02 42.05 46.39 37.53 49.06
Table 2: Compa isons wi h s a e-o - he-a me hods on h ee REC benchma k da ase s.
Me hod Venue Re COCO Re COCO+ Re COCOg
al es A es B al es A es B al-g
VC [28] TPAMI’19 - 32.68 27.22 - 34.68 28.10 29.65
ARN [18] ICCV’19 32.17 35.25 30.28 32.78 34.35 32.13 33.09
KPRN [20] MM’19 36.34 35.28 37.72 37.16 36.06 39.29 38.37
IGN [41] Neu IPS’20 34.78 37.64 32.59 34.29 36.91 33.56 34.92
DTWREG [33] TPAMI’21 38.35 39.51 37.01 38.91 39.91 37.09 42.54
Cycle-F ee [34] TMM’21 39.58 41.46 37.96 39.19 39.63 37.53 -
EARN [17] TPAMI’23 38.08 38.25 38.59 37.54 37.58 37.92 45.33
TGKD [26] ICRA’23 39.70 39.92 39.63 40.20 39.94 40.27 47.99
Re CLIP [9] CVPR’23 60.36 58.58 57.13 40.39 40.45 38.86 47.87
APL [23] ECCV’24 64.51 61.91 63.57 42.70 42.84 39.80 50.22
Que yMa ch [3] MM’24 66.02 66.00 65.48 44.76 46.72 41.50 48.47
Ou s - 69.03 70.27 66.59 47.16 52.22 41.91 54.72
ep esen a ions, e en when hey belong o he same ca ego y. Fi-
nally, we selec he que y wi h he highes ine-g ained alignmen
sco e 𝑆 ine,𝑖 as he op imal que y:
𝑞∗=a g max
𝑖
𝑆 ine,𝑖 .(15)
4.4 T aining and In e ence
We adop a que y- ex con as i e lea ning s a egy [
3
] o achie e
weakly supe ised lea ning. A common choice o c oss-modal
con as i e lea ning objec i e is In oNCE:
L𝑐𝑙 (ℎ𝑡,ℎ+
𝑞,ℎ−
𝑞)=−log T (ℎ𝑡,ℎ+
𝑞)
T (ℎ𝑡,ℎ+
𝑞) + ∑︁
ℎ−
𝑞
T (ℎ𝑡,ℎ−
𝑞)
,(16)
whe e
T=exp(sim(𝑞, 𝑘+)/𝜏)
is he do p oduc simila i y. The ex
ea u e
ℎ𝑡
should ma ch he isual ea u e o i s designa ed que y
ℎ+
𝑞o e a se o nega i e samples ℎ−
𝑞 om o he images.
In his s udy, we in oduce wo sha ed seman ic spaces o isual-
linguis ic alignmen . The e o e, he inal con as i e lea ning ob-
jec i e o ou AlignCAT is he sum o ha om he wo spaces:
L𝐶𝐿 =L𝑐𝑙 (¯
𝑓𝑡,¯
𝑓+
𝑞,¯
𝑓−
𝑞) + L𝑐𝑙 (e
𝑓𝑤,e
𝑓+
𝑞,e
𝑓−
𝑞).(17)
Du ing aining, we di ec ly injec he GT ca ego y o calcula e
he ca ego y sco e. Howe e , his in o ma ion is no a ailable a he
in e ence s age. We a e d i en o ain an auxilia y classi ie and
p edic he ca ego y om he ex side. Speci ically, we add a ex
classi ie o p ojec he global linguis ic ea u e
𝑓𝑡
and p oduce he
p edic ed ca ego y, deno ed
𝑐𝑇
. The s anda d c oss-en opy loss is
used o ain his ex classi ie :
L𝐶𝐸 =−
𝐶
∑︁
𝑖=1
𝑦𝑖log(ˆ
𝑦𝑖),(18)
whe e
𝑦𝑖
is he one-ho encoding o he GT ca ego y
𝑐∗
, and
ˆ
𝑦𝑖
is
he p edic ed p obabili y o
𝑖
- h obse a ion belonging o one class.
We no e ha his ex ca ego y
𝑐𝑇
di e s om he que y ca ego y
𝑐𝑉
𝑖
as hey a e p edic ed om he linguis ic ea u e
𝑓𝑡
and he isual
ea u e 𝑓𝑞𝑖, espec i ely.
O e all, he weakly supe ised lea ning objec i e o AlignCAT
can be w i en as ollows:
L=𝜆1L𝐶𝐿 +𝜆2L𝐶𝐸,(19)
whe e
𝜆1
and
𝜆2
a e hype pa ame e s dynamically adjus ed o con-
ol he s eng hs, de ailed in he supplemen al ma e ial.
5 Expe imen s
5.1 Da ase s and Me ic
We e alua e he p oposed me hod on h ee benchma ks: Re COCO
[
27
], Re COCO+ [
27
], and Re COCOg [
24
]. All o hem a e based
on MSCOCO [
14
], and each con ains (image, exp ession) as: (19,994,
142,210), (19,992, 141,564), (26,711, 104,560). In hese h ee da ase s,
each exp ession is associa ed wi h one class label, which is used as
5104

MM ’25, Oc obe 27–31, 2025, Dublin, I eland. Yidan Wang, Chenyi Zhuang, Wu ao Liu, Pan Gao, and Nicu Sebe
Table 3: Abla ion o he o mula o que y quali y es ima ion.
Fo mula al es A es B
𝑆global 65.89 65.94 65.47
𝑆global +𝑆class 67.55↑1.66 69.63↑3.69 64.66↓0.81
𝑆global +𝑆 ine 67.21↑1.32 67.93↑1.99 66.21↑0.74
𝑆 ine +𝑆global +𝑆class 67.36↑1.47 68.66↑2.72 66.48↑1.01
𝑆global +𝑆class +𝑆 ine 69.03↑3.14 70.27↑4.33 66.59↑1.12
Table 4: Abla ion s udy o he injec ed ca ego y in o ma ion.
“T ain” and “In e ” e e o he aining and in e ence s ages.
𝑐∗: GT ca ego y. 𝑐𝑇: ex classi ie ’s p edic ed ca ego y.
𝑐∗(T ain)𝑐𝑇(T ain)𝑐𝑇(In e ) al es A es B
- - - 67.21 67.93 66.21
✓-- 68.74↑1.53 69.84↑1.91 66.23↑0.02
--✓67.61↑0.40 68.18↑0.25 66.40↑0.19
-✓✓64.64↓2.57 64.90↓3.03 63.53↓2.68
✓-✓69.03↑1.82 70.27↑2.34 66.59↑0.38
he GT ca ego y in he coa se-g ained alignmen . Rega ding he
ex exp ession, Re COCO desc ibes objec s wi h absolu e spa ial
in o ma ion, while he o he wo da ase s a e mo e challenging.
Re COCO+ ocuses mo e on ela i e spa ial in o ma ion and appea -
ance (such as colo and ex u e), and Re COCOg p o ides longe
exp essions ha a e mo e complex and ca y iche seman ics. Fo
he REC ask, we ollow [
3
,
9
] ha use [email p o ec ed] as he me ic. We
coun a p edic ion as co ec i he IoU be ween he p edic ed and
GT bounding boxes exceeds 0.5. Fo he RES ask, we adop mIoU
[
15
,
30
] as he me ic ha calcula es he a e age IoU ac oss all es
samples. Mo e de ails a e in he supplemen a y ma e ial.
5.2 Implemen a ion De ails
Following [
3
], we employ he p e ained Mask2Fo me de ec o
[
4
] and eeze i s pa ame e s when aining ou AlignCAT. The
image esolu ion is se o 416
×
416. The ex leng hs o Re COCO,
Re COCO+, and Re COCOg a e 15, 15, and 20, espec i ely. All
expe imen s a e conduc ed on wo 24G N idia RTX 4090 GPUs.
The ba ch size pe GPU is 14. The que y ea u e dimension is 256,
and he dimensions o wo d-le el ea u es, ex ea u es, and he
sha ed seman ic space a e all 512. Du ing que y selec ion, we se
𝑂=
20 o con idence-based il e ing,
𝐾=
10 o he maximum
selec ed que ies a e coa se-g ained alignmen . We se
𝛼=
100 o
emphasize he ca ego y in o ma ion o calcula ing coa se-g ained
sco es. We use he Adam op imize [
6
] wi h a lea ning a e o 1
𝑒−
4
and se aining epochs o 25.
5.3 Quan i a i e Analysis
In his sec ion, we i s alida e AlignCAT by compa ing i wi h
comp ehensi e weakly supe ised VG me hods, and abla e key
componen s o ou app oach.
Compa ison o he s a e-o - he-a s. In Tables 1 and 2, we
compa e AlignCAT wi h a se o weakly supe ised VG me hods.
The i s obse a ion is ha AlignCAT signi ican ly ou pe o ms
exis ing me hods on all h ee benchma ks. Ou me hod imp o es
he a e age accu acy by
+
2
.
53% and
+
2
.
80% o e Que yMa ch on
Re COCO o RES and REC, espec i ely. The imp o emen on
Table 5: Abla ion o he module o injec GT ca ego y.
Con idence-based Coa se-g ained al es A es B
Selec ion Alignmen
- - 67.61 68.18 66.40
✓- 67.24 71.50 62.06
-✓69.03 70.27 66.59
Figu e 3: Visualiza ion o adap i e ph ase a en ion.
Re COCOg is pa icula ly no able, wi h AlignCAT inc easing he
accu acy o Que yMa ch by mo e han 6% o bo h asks. We also
no ice ha AlignCAT excels on Tes A o all da ase s, whe e mos
ca ego ies o e e ed objec s a e “pe son”. Wi h he help o ca e-
go y ma ching, AlignCAT e ec i ely il e s ou isual que ies no
belonging o humans, be o e mo e ine-g ained alignmen . This al-
ida es he e ec i eness o ou inno a i e ca ego y- hen-a ibu e
mechanism in enhancing c oss-modal alignmen , wi h he capaci y
o ackle mul i-objec images and complex ex exp essions.
Abla ion o AlignCAT. To alida e he designs o AlignCAT, we
ha e conduc ed a ious abla ion s udies on he Re COCO da ase
o weakly supe ised REC. We i s compa e di e en se ings
o que y selec ion. When abla ing he design o global simila i y,
he co esponding con as i e lea ning objec i e is also emo ed.
The same applies o he ine-g ained alignmen wi h wo d-le el
simila i y calcula ion. These esul s a e epo ed in Table 3. The
baseline selec s one posi i e isual que y wi h he highes
𝑆global
.
Wi h ca ego y ma ching, he combina ion
𝑆class +𝑆global
imp o es
VG pe o mance on wo subse s, albei wi h a sligh dec ease on
es B. This sugges s ha ca ego y in o ma ion bene i s human-
a ge localiza ion, bu s uggles wi h non-human objec s. Solely
using he ine-g ained alignmen ,
𝑆global +𝑆 ine
achie es 67
.
93% on
Re COCO es A, ye is wo se han he o me se ing wi h 69
.
63%.
This compa ison highligh s he impo ance o ca ego y-based il-
e ing. We also expe imen ed wi h he a ibu e- hen-ca ego y o -
de . The esul o
𝑆 ine +𝑆global +𝑆class
shows a ema kable pe -
o mance decline compa ed o
𝑆global +𝑆class
. We suspec ha
wi hou ca ego y-based il e ing, he ex ea u es o con ex ual
objec s c ea e noise and a ec he c oss-modal alignmen . Con-
e sely,
𝑆class +𝑆global +𝑆 ine
wi h a ca ego y- hen-a ibu e o de
achie es he bes pe o mance, demons a ing he e ec i eness o
he coa se- o- ine isual-linguis ic ma ching scheme, as well as he
complemen a y e ec o h ee selec ion modules.
5105
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e o Weakly Supe ised Visual G ounding MM ’25, Oc obe 27–31, 2025, Dublin, I eland.
Exp-2: Tall bo le wi h yellow label
Exp-1: Guy in back wi h a m on chai
Exp-3: Pe son holding umb ella Exp-4: Gi l on le wi h whi e jacke
Figu e 4: Visualiza ion compa ison o di e en selec ion designs o AlignCAT in weakly supe ised REC. The ed and g een
boxes a e GT and p edic ed g ounding esul s, espec i ely.
Exp-3: Woman helping wi h bull on igh
Exp-1: Man holding li le dog
Exp-4: Couch igh side
Exp-2: Cen e case on loo wi h squa es
TRIS Que yMa ch Ou s GT Que yMa ch Ou s
GT TRIS
Figu e 5: Visualiza ion compa ison o TRIS and Que yMa ch o he weakly supe ised RES ask. The GT and p edic ed
segmen a ion esul s a e ma ked in ed.
Nex , we examine di e en s a egies o using ca ego y in o -
ma ion du ing aining and in e ence. As shown in Table 4, he i s
ow is he baseline wi hou ca ego y in o ma ion. The second ow is
he model ained wi h he GT ca ego y
𝑐∗
while emo ing he ca -
ego y ma ching sco e du ing in e ence. This se ing imp o es he
model pe o mance, which highligh s he impo ance o ca ego y
in o ma ion in enhancing c oss-modal co espondences. The hi d
ow is he esul o aining he ex classi ie bu only injec ing
he p edic ed ca ego y
𝑐𝑇
du ing in e ence, which p esen s a sligh
imp o emen . No ably, in he las second ow, di ec ly injec ing
he p edic ed ca ego y
𝑐𝑇
du ing aining indica es a signi ican
pe o mance decline. We suspec ha he ex classi ie ’s p edic ed
ca ego ies a e la gely inaccu a e a he beginning o aining, e-
sul ing in un eliable isual-linguis ic alignmen . The las ow is
ou ull model ha uses he GT ca ego y du ing aining and in-
jec s he p edic ed ca ego y du ing in e ence. This design enhances
obus ness and achie es he bes pe o mance ac oss all subse s.
We u he in es iga e al e na i e s a egies o injec ing GT ca -
ego y in o ma ion. As shown in Table 5, we compa e he con idence-
based selec ion and he global ea u e alignmen o inco po a e he
ca ego y sco e. In he o me ,
Q𝑂
is il e ed by con idence and
ca ego y ma ching, while
e
𝑄
elies on global simila i y
𝑆global
. Al-
hough his imp o es es A pe o mance, i leads o a signi ican
pe o mance deg ada ion on es B. This issue a ises due o subop i-
mal nega i e que y selec ion, which a e sampled om
Q𝑂
. Since a
la ge p opo ion o e e ed objec s in he aining se belong o he
“pe son” ca ego y, in eg a ing ca ego y ma ching du ing con idence-
based il e ing esul s in he same ca ego y be ween mos nega i e
que ies and he posi i e que y. This educes he di e si y o nega-
i e samples and a ec s gene aliza ion. To add ess his, we injec
he ca ego y in o ma ion a he coa se-g ained alignmen module.
This se ing enhances nega i e sample quali y and imp o es he
model’s obus ness ac oss comp ehensi e scena ios.
5.4 Quali a i e Analysis
In Figu e 3, we isualize he weigh s o he modula ed wo d-le el
ex ea u es a e adap i e ph ase a en ion. These alues illus a e
how he model dynamically adjus s he impo ance o each wo d.
Fo example, gi en he ex “gi l pink”, he model highligh s he
colo a ibu e “pink” han he ca ego y wo d “gi l”. In e es ingly,
he con ex ual objec “lea es” is alloca ed wi h a highe alue han
he e e ed objec “ ege able”. This obse a ion explains he e-
sul in Table 3 o he in e io pe o mance in he se ing o he
exchanged o de . O e all, AlignCAT le e ages desc ip i e in o -
ma ion o mi iga e in a-class ambigui y, he eby dis inguishing
objec s belonging o he same ca ego y.
5106
MM ’25, Oc obe 27–31, 2025, Dublin, I eland. Yidan Wang, Chenyi Zhuang, Wu ao Liu, Pan Gao, and Nicu Sebe
Exp-1: M plaid Exp-2: Sea ed man
Exp-10: Le in whi e
pe son
Exp-7: 160Exp-5: Pea on igh Exp-4: We hai
Exp-12: Glass unde
ha ge bil hingy
Exp-9: Woman in
op pic u e on he
le
Exp-15: Gi l
b ushing ee h
Exp-8: 2
Exp-13: Woman
s anding closes o
gi a e
Exp-14: Pe son in
backg ound
Exp-6: Food in ocusExp-3: Kid down
Exp-16: Banana
ouching op igh
co ne o he sign
Exp-11: Pas y nex
o igh hand co ee
mug
Figu e 6: Visualiza ion compa ison in weakly supe ised REC. G een: g ound u h.Blue: Que yMa ch.Red: Ou s.
Exp-1: Uppe keyboa d Exp-2: Whi e
Exp-6: TV in cen e
s acked on ano he
TV kicking
Exp-7: Man e y a le
Exp-5: Bo le jus
one o he igh o
he middle one
Exp-3: B oc a 5 o'clock Exp-4: 1019
Exp-8: The a m no
holding he sandwich
Figu e 7: Failu e cases o AlignCAT in weakly supe ised
REC. G een: g ound u h.Red: Ou s.
To gain in-dep h insigh s in o he ca ego y- hen-a ibu e align-
men mechanism, we abla e AlignCAT wi h ou con igu a ions and
isualize he esul s in Figu e 4. U ilizing solely he global ea u e
simila i y
𝑆global
s uggles o achie e ca ego y and a ibu e consis-
encies, especially when images in ol e mul iple objec s. Wi h he
ca ego y ma ching sco e, he esul s p esen he ca ego y consis-
ency. Fo ins ance,
𝑆global +𝑆class
excludes he con ex ual objec
“label” in Exp-2, howe e , i ails o esol e in a-class ambigui ies
and selec s ano he bo le. On he o he hand, wi hou
𝑆class
, he
esul s o
𝑆global +S ine
s ill su e s om ca ego y inconsis ency in
Exp-3, which highligh s he impo ance o he ca ego y ma ching.
This disco e y is consis en wi h he quan i a i e compa ison in
Table 3. In con as , he ull con igu a ion
𝑆global +𝑆class +S ine
achie es ca ego y and a ibu e consis encies, whe he he a ge
is human o non-human, and he a ibu e is colo (e.g., “whi e”)
o spa ial ela ion (e.g., “in back”). No ice ha AlignCAT may ail
o accu a ely loca e he a ge objec due o he occlusion p oblem
(e.g., he a m behind he head in Exp-1).
In Figu e 5, we compa e AlignCAT wi h wo s a e-o - he-a
weakly supe ised RES models, TRIS [
16
] and Que yMa ch [
3
].
Gi en ex s wi h complex ela ionships and in ensi e images wi h
mul iple objec s, hey inco ec ly loca e he con ex ual objec s such
as “dog” and “bull”. Con e sely, he p oposed me hod achie es
highe segmen a ion accu acy. Wi h s onge easoning abili y and
be e isual unde s anding, AlignCAT is mo e obus and eliable
in ex ensi e g ounding scena ios.
Figu e 6 compa es he pe o mance o AlignCAT and Que y-
Ma ch in he weakly supe ised REC ask. The analysis shows ha
AlignCAT has a clea ad an age in main aining ca ego y and a -
ibu e consis ency. Fo example, in Exp-7, AlignCAT success ully
aligns he abs ac and ague que y “160” wi h he “pe son” ca e-
go y, accu a ely localizing he a ge . In con as , Que yMa ch ails
o unde s and his abs ac que y, leading o a localiza ion ailu e. In
complex mul i-objec scena ios, AlignCAT con inues o e ec i ely
selec he co ec que ies. Fo ins ance, in Exp-16, AlignCAT no
only iden i ies he logical subjec “banana”, bu also cap u es i s
ine-g ained spa ial ela ionships, achie ing p ecise localiza ion.
In summa y, AlignCAT, wi h i s obus coa se- o- ine seman ic
alignmen , ou pe o ms Que yMa ch in complex scena ios.
To p o ide a mo e comp ehensi e analysis, we p esen ypical
ailu e cases o AlignCAT o he REC ask in Figu e 7. These ail-
u es occu due o da ase quali y issues, including w ong g ound
u h anno a ions (Exp-1) and insu icien ex ual desc ip ions (Exp-
2). Meanwhile, ou me hod s ill lacks seman ic unde s anding o
comp ehend ou -o -dis ibu ion ex ual exp essions (Exp-3), o o
disce n nuanced isual ea u es o simila objec s (Exp-4).
6 Conclusion
In his s udy, we iden i y ha exis ing weakly supe ised VG me h-
ods su e om con ex ual ambigui ies, showing ca ego y and a -
ibu e inconsis encies. To add ess hese challenges, we p opose
a no el que y-based VG amewo k, AlignCAT, wi h a ca ego y-
hen-a ibu e isual-linguis ic alignmen s a egy o p og essi ely
il e ou que y candida es. To ensu e ca ego y consis ency, we
design a coa se-g ained alignmen module ha le e ages ca ego y
in o ma ion and global con ex . Fo a ibu e consis ency, we u -
he p opose a ine-g ained alignmen module o cap u e wo d-
le el linguis ic ea u es and emphasize a ibu e-based que y- ex
alignmen , e ec i ely esol ing in a-class ambigui ies. Ex ensi e
expe imen s demons a e ha AlignCAT achie es s a e-o - he-a
pe o mance on h ee benchma ks o bo h REC and RES asks. The
p oposed ca ego y- hen-a ibu e alignmen enhances ca ego y and
a ibu e consis encies in comp ehensi e scenes. This wo k p o-
ides no el insigh s in o le e aging linguis ic cues o ad ancing
weakly supe ised isual g ounding.
5107
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e o Weakly Supe ised Visual G ounding MM ’25, Oc obe 27–31, 2025, Dublin, I eland.
Acknowledgmen s
This wo k was pa ially suppo ed by he Na ional Na u al Science
Founda ion o China (No. 62272227). I was also pa ly suppo ed by
he MUR PNRR p ojec FAIR (PE00000013) unded by he Nex Gen-
e a ionEU, he EU Ho izon p ojec s ELIAS (No. 101120237) and
ELLIOT (No. 101214398).
Re e ences
[1]
Assa A belle, Si an Do eh, Ami Al assy, Joseph Sh ok, Guy Le , Eli Schwa z,
Hilde Kuehne, Hila Ba ak Le i, P asanna Sa ige i, Rameswa Panda, e al
.
2021.
De ec o - ee weakly supe ised g ounding by sepa a ion. In P oceedings o he
IEEE/CVF In e na ional Con e ence on Compu e Vision. 1801–1812.
[2]
Kan Chen, Jiyang Gao, and Ram Ne a ia. 2018. Knowledge aided consis ency
o weakly supe ised ph ase g ounding. In P oceedings o he IEEE con e ence on
compu e ision and pa e n ecogni ion. 4042–4050.
[3]
Shengxin Chen, Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, and Ron-
g ong Ji. 2024. Que yMa ch: A Que y-based Con as i e Lea ning F amewo k
o Weakly Supe ised Visual G ounding. In P oceedings o he 32nd ACM In e -
na ional Con e ence on Mul imedia. 4177–4186.
[4]
Bowen Cheng, Ishan Mis a, Alexande G Schwing, Alexande Ki illo , and Rohi
Gi dha . 2022. Masked-a en ion mask ans o me o uni e sal image segmen-
a ion. In P oceedings o he IEEE/CVF con e ence on compu e ision and pa e n
ecogni ion. 1290–1299.
[5]
Ming Dai, Ling eng Yang, Yihao Xu, Zhenhua Feng, and Wankou Yang. 2024.
SimVG: A Simple F amewo k o Visual G ounding wi h Decoupled Mul i-modal
Fusion. a Xi p ep in a Xi :2409.17531 (2024).
[6]
P Kingma Diede ik. 2014. Adam: A me hod o s ochas ic op imiza ion. (No Ti le)
(2014).
[7]
F ancisco Ei as, Kemal Oksuz, Adel Bibi, Philip HS To , and Punee K Dokania.
2024. Segmen , selec , co ec : A amewo k o weakly-supe ised e e ing
segmen a ion. In Eu opean Con e ence on Compu e Vision. Sp inge , 326–342.
[8]
Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, and Gao Huang. 2022.
Pseudo-q: Gene a ing pseudo language que ies o isual g ounding. In P o-
ceedings o he IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion.
15513–15523.
[9]
Lei Jin, Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Annan Shu, and
Rong ong Ji. 2023. Re clip: A uni e sal eache o weakly supe ised e e ing
exp ession comp ehension. In P oceedings o he IEEE/CVF con e ence on compu e
ision and pa e n ecogni ion. 2681–2690.
[10]
Dongwon Kim, Namyup Kim, Cuiling Lan, and Suha Kwak. 2023. Sha e and
ga he : Lea ning e e ing image segmen a ion wi h ex supe ision. In P oceed-
ings o he IEEE/CVF In e na ional Con e ence on Compu e Vision. 15547–15557.
[11]
Jungbeom Lee, Sungjin Lee, Jinseok Nam, Seunghak Yu, Jaeyoung Do, and Ta a
Tagha i. 2023. Weakly supe ised e e ing image segmen a ion wi h in a-
chunk and in e -chunk consis ency. In P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision. 21870–21881.
[12]
Jungbeom Lee, Sungjin Lee, Jinseok Nam, Seunghak Yu, Jaeyoung Do, and Ta a
Tagha i. 2023. Weakly supe ised e e ing image segmen a ion wi h in a-
chunk and in e -chunk consis ency. In P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision. 21870–21881.
[13]
Junnan Li, Ramp asaa h Sel a aju, Akhilesh Go ma e, Sha iq Jo y, Caiming Xiong,
and S e en Chu Hong Hoi. 2021. Align be o e use: Vision and language ep e-
sen a ion lea ning wi h momen um dis illa ion. Ad ances in neu al in o ma ion
p ocessing sys ems 34 (2021), 9694–9705.
[14]
Tsung-Yi Lin, Michael Mai e, Se ge Belongie, James Hays, Pie o Pe ona, De a
Ramanan, Pio Dollá , and C Law ence Zi nick. 2014. Mic oso coco: Common
objec s in con ex . In Compu e Vision–ECCV 2014: 13 h Eu opean Con e ence,
Zu ich, Swi ze land, Sep embe 6-12, 2014, P oceedings, Pa V 13. Sp inge , 740–
755.
[15]
Yuqi Lin, Minghao Chen, Wenxiao Wang, Boxi Wu, Ke Li, Binbin Lin, Hai eng Liu,
and Xiao ei He. 2023. Clip is also an e icien segmen e : A ex -d i en app oach
o weakly supe ised seman ic segmen a ion. In P oceedings o he IEEE/CVF
Con e ence on Compu e Vision and Pa e n Recogni ion. 15305–15314.
[16]
Fang Liu, Yuhao Liu, Yuqiu Kong, Ke Xu, Lihe Zhang, Baocai Yin, Ge ha d Hancke,
and Rynson Lau. 2023. Re e ing image segmen a ion using ex supe ision. In
P oceedings o he IEEE/CVF In e na ional Con e ence on Compu e Vision. 22124–
22134.
[17]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Zechao Li, Qi Tian, and
Qingming Huang. 2022. En i y-enhanced adap i e econs uc ion ne wo k o
weakly supe ised e e ing exp ession g ounding. IEEE T ansac ions on Pa e n
Analysis and Machine In elligence 45, 3 (2022), 3003–3018.
[18]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qing-
ming Huang. 2019. Adap i e econs uc ion ne wo k o weakly supe ised
e e ing exp ession g ounding. In P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision. 2611–2620.
[19]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang.
2019. Knowledge-guided pai wise econs uc ion ne wo k o weakly supe ised
e e ing exp ession g ounding. In P oceedings o he 27 h ACM In e na ional
Con e ence on Mul imedia. 539–547.
[20]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang.
2019. Knowledge-guided pai wise econs uc ion ne wo k o weakly supe ised
e e ing exp ession g ounding. In P oceedings o he 27 h ACM In e na ional
Con e ence on Mul imedia. 539–547.
[21]
Yang Liu, Jiahua Zhang, Qingchao Chen, and Yuxin Peng. 2023. Con idence-awa e
Pseudo-label Lea ning o Weakly Supe ised Visual G ounding. In P oceedings
o he IEEE/CVF In e na ional Con e ence on Compu e Vision. 2828–2838.
[22]
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and
Rong ong Ji. 2020. Mul i- ask collabo a i e ne wo k o join e e ing exp ession
comp ehension and segmen a ion. In P oceedings o he IEEE/CVF Con e ence on
compu e ision and pa e n ecogni ion. 10034–10043.
[23]
Yaxin Luo, Jiayi Ji, Xiao u Chen, Yuxin Zhang, Tianhe Ren, and Gen Luo. 2025.
APL: Ancho -Based P omp Lea ning o One-S age Weakly Supe ised Re e ing
Exp ession Comp ehension. In Eu opean Con e ence on Compu e Vision. Sp inge ,
198–215.
[24]
Junhua Mao, Jona han Huang, Alexande Toshe , Oana Cambu u, Alan L Yuille,
and Ke in Mu phy. 2016. Gene a ion and comp ehension o unambiguous objec
desc ip ions. In P oceedings o he IEEE con e ence on compu e ision and pa e n
ecogni ion. 11–20.
[25]
Kenne h Ma ino, Mohammad Ras ega i, Ali Fa hadi, and Roozbeh Mo aghi. 2019.
Ok- qa: A isual ques ion answe ing benchma k equi ing ex e nal knowledge.
In P oceedings o he IEEE/c con e ence on compu e ision and pa e n ecogni ion.
3195–3204.
[26]
Jinpeng Mi, Song Tang, Zhiyuan Ma, Dan Liu, Qingdu Li, and Jianwei Zhang. 2023.
Weakly supe ised e e ing exp ession g ounding ia a ge -guided knowledge
dis illa ion. In 2023 IEEE In e na ional Con e ence on Robo ics and Au oma ion
(ICRA). IEEE, 8299–8305.
[27]
Va un K Naga aja, Vlad I Mo a iu, and La y S Da is. 2016. Modeling con ex be-
ween objec s o e e ing exp ession unde s anding. In Compu e Vision–ECCV
2016: 14 h Eu opean Con e ence, Ams e dam, The Ne he lands, Oc obe 11–14, 2016,
P oceedings, Pa IV 14. Sp inge , 792–807.
[28]
Yulei Niu, Hanwang Zhang, Zhiwu Lu, and Shih-Fu Chang. 2019. Va ia ional
con ex : Exploi ing isual and ex ual con ex o g ounding e e ing exp es-
sions. IEEE ansac ions on pa e n analysis and machine in elligence 43, 1 (2019),
347–359.
[29]
Jie Qin, Jie Wu, Xue eng Xiao, Lujun Li, and Xingang Wang. 2022. Ac i a ion
modula ion and ecalib a ion scheme o weakly supe ised seman ic segmen-
a ion. In P oceedings o he AAAI Con e ence on A i icial In elligence, Vol. 36.
2117–2125.
[30]
Tal Shaha abany, Yoad Tewel, and Lio Wol . 2022. Wha is whe e by looking:
Weakly-supe ised open-wo ld ph ase-g ounding wi hou ex inpu s. Ad ances
in Neu al In o ma ion P ocessing Sys ems 35 (2022), 28222–28237.
[31]
Ma eo S e anini, Ma cella Co nia, Lo enzo Ba aldi, Sil ia Cascianelli, Giuseppe
Fiameni, and Ri a Cucchia a. 2022. F om show o ell: A su ey on deep lea ning-
based image cap ioning. IEEE ansac ions on pa e n analysis and machine
in elligence 45, 1 (2022), 539–559.
[32]
Robin S udel, I an Lap e , and Co delia Schmid. 2022. Weakly-supe ised seg-
men a ion o e e ing exp essions. a Xi p ep in a Xi :2205.04725 (2022).
[33]
Mingjie Sun, Jimin Xiao, Eng Gee Lim, Si Liu, and John Y Goule mas. 2021. Dis-
c imina i e iad ma ching and econs uc ion o weakly e e ing exp ession
g ounding. IEEE ansac ions on pa e n analysis and machine in elligence 43, 11
(2021), 4189–4195.
[34]
Mingjie Sun, Jimin Xiao, Eng Gee Lim, and Yao Zhao. 2021. Cycle- ee weakly
e e ing exp ession g ounding wi h sel -paced lea ning. IEEE T ansac ions on
Mul imedia 25 (2021), 1611–1621.
[35]
Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang, and Zechao Li. 2023. Con-
ex disen angling and p o o ype inhe i ing o obus isual g ounding. IEEE
T ansac ions on Pa e n Analysis and Machine In elligence (2023).
[36]
Josiah Wang and Lucia Specia. 2019. Ph ase localiza ion wi hou pai ed aining
examples. In P oceedings o he IEEE/CVF In e na ional Con e ence on Compu e
Vision. 4663–4672.
[37]
Jia ui Xu, Shalini De Mello, Si ei Liu, Wonmin Byeon, Thomas B euel, Jan Kau z,
and Xiaolong Wang. 2022. G oup i : Seman ic segmen a ion eme ges om ex
supe ision. In P oceedings o he IEEE/CVF Con e ence on Compu e Vision and
Pa e n Recogni ion. 18134–18144.
[38]
Ziyan Yang, Kushal Ka le, F anck De noncou , and Vicen e O donez. 2023. Im-
p o ing isual g ounding by encou aging consis en g adien -based explana ions.
In P oceedings o he IEEE/CVF Con e ence on Compu e Vision and Pa e n Recog-
ni ion. 19165–19174.
[39]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modula co-
a en ion ne wo ks o isual ques ion answe ing. In P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni ion. 6281–6290.
5108