AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding

Author: Wang, Yidan; Zhuang, Chenyi; Liu, Wutao; Gao, Pan; Sebe, Niculae

Publisher: Zenodo

DOI: 10.1145/3746027.3755751

Source: https://zenodo.org/records/17689270/files/3746027.3755751.pdf

AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e
o Weakly Supe ised Visual G ounding
Yidan Wang∗
Nanjing Uni e si y o Ae onau ics
and As onau ics
Nanjing, China
[email p o ec ed]
Chenyi Zhuang∗
Uni e si y o T en o
T en o, I aly
[email p o ec ed]
Wu ao Liu
Nanjing Uni e si y o Ae onau ics
and As onau ics
Nanjing, China
[email p o ec ed]
Pan Gao†
Nanjing Uni e si y o Ae onau ics
and As onau ics
Nanjing, China
[email p o ec ed]
Nicu Sebe
Uni e si y o T en o
T en o, I aly
[email p o ec ed]
Abs ac
Weakly supe ised isual g ounding (VG) aims o loca e objec s
in images based on ex desc ip ions. Despi e signi ican p og ess,
exis ing me hods lack s ong c oss-modal easoning o dis inguish
sub le seman ic di e ences in ex exp essions due o ca ego y-
based and a ibu e-based ambigui y. To add ess hese challenges,
we in oduce AlignCAT, a no el que y-based seman ic ma ching
amewo k o weakly supe ised VG. To enhance isual-linguis ic
alignmen , we p opose a coa se-g ained alignmen module ha u i-
lizes ca ego y in o ma ion and global con ex , e ec i ely mi iga ing
in e e ence om ca ego y-inconsis en objec s. Subsequen ly, a
ine-g ained alignmen module le e ages desc ip i e in o ma ion
and cap u es wo d-le el ex ea u es o achie e a ibu e consis-
ency. By exploi ing linguis ic cues o hei ulles ex en , ou p o-
posed AlignCAT p og essi ely il e s ou misaligned isual que ies
and enhances con as i e lea ning e iciency. Ex ensi e expe imen s
on h ee VG benchma ks, namely Re COCO, Re COCO+, and Re -
COCOg, e i y he supe io i y o AlignCAT agains exis ing weakly
supe ised me hods on wo VG asks. Ou code is a ailable a :
h ps://gi hub.com/I2-Mul imedia-Lab/AlignCAT.
CCS Concep s
•Compu ing me hodologies
→
Image segmen a ion; Scene
unde s anding.
Keywo ds
Weakly Supe ised Visual G ounding, Mul imodali y
∗Equal con ibu ion.
†Co esponding au ho
Pe mission o make digi al o ha d copies o all o pa o his wo k o pe sonal o
class oom use is g an ed wi hou ee p o ided ha copies a e no made o dis ibu ed
o p o i o comme cial ad an age and ha copies bea his no ice and he ull ci a ion
on he i s page. Copy igh s o componen s o his wo k owned by o he s han he
au ho (s) mus be hono ed. Abs ac ing wi h c edi is pe mi ed. To copy o he wise, o
epublish, o pos on se e s o o edis ibu e o lis s, equi es p io speci ic pe mission
and/o a ee. Reques pe missions om [email p o ec ed].
MM ’25, Dublin, I eland., Oc obe 27–31, 2025
©2025 Copy igh held by he owne /au ho (s). Publica ion igh s licensed o ACM.
ACM ISBN 979-8-4007-2035-2/2025/10
h ps://doi.o g/10.1145/3746027.3755751
ACM Re e ence Fo ma :
Yidan Wang, Chenyi Zhuang, Wu ao Liu, Pan Gao, and Nicu Sebe. 2025.
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e o Weakly
Supe ised Visual G ounding. In P oceedings o he 33 d ACM In e na ional
Con e ence on Mul imedia (MM ’25), Oc obe 27–31, 2025, Dublin, I eland.
ACM, New Yo k, NY, USA, 10 pages. h ps://doi.o g/10.1145/3746027.3755751
1 In oduc ion
Visual G ounding (VG) aims o iden i y objec s in an image co e-
sponding o a gi en ex desc ip ion and has gained a en ion o i s
po en ial in open-ended de ec ion o a ious compu e ision appli-
ca ions [
25
,
31
,
39
]. While ully supe ised me hods [
5
,
22
,
38
,
42
,
43
]
achie e high accu acy, hey ely on ins ance-le el anno a ions,
which a e bo h labo -in ensi e and ime-consuming o ob ain. To al-
le ia e his bu den, ecen s udies ha e explo ed weakly supe ised
lea ning in wo g ounding asks, namely Re e ing Exp ession Com-
p ehension (REC) and Re e ing Exp ession Segmen a ion (RES).
These wo ks ha e e amed weakly supe ised VG as a egion-
based [
7
,
33
,
41
], ancho -based [
9
,
23
], o que y-based [
3
] ma ching
p oblem. Howe e , unde s anding complex ex ual desc ip ions and
associa ing e e en s in mul i-objec images emains challenging.
In Figu e 1, we iden i y wo ypes o ex anno a ion in exis ing
VG benchma ks: (1) ca ego y-based anno a ions ha dis inguish
he e e ed objec in i s undamen al class om objec s in o he ca -
ego ies. Fo example, in he sen ence “gi l wi h spoon”, wo objec s
“gi l” and “spoon” belong o di e en classes. The model should
iden i y he logical objec “gi l” a he han he con ex ual objec
“spoon”, and align his linguis ic ca ego y in o ma ion wi h isual
ea u es. (2) a ibu e-based anno a ions ha desc ibe speci ic
cha ac e is ics o he e e ed objec , such as colo s and spa ial
ela ions. Fo example, he sen ence “guy on knees” has no con lic
in he pe son ca ego y, bu i s desc ip i e in o ma ion “on knees”
poses a challenge o he model in iden i ying he e e ed objec
as he image con ains mul iple pe sons. This equi es an unde -
s anding o nuanced ex ual and isual seman ics. Howe e , he
s a e-o - he-a que y-based me hod [
3
] ails o p oduce eliable
g ounding esul s on bo h anno a ion ypes. While his me hod
u ilizes con as i e lea ning o ampli y he alignmen o a ge ex s
and posi i e que ies, i is no conduci e o disc imina ing nuanced
5100
MM ’25, Oc obe 27–31, 2025, Dublin, I eland. Yidan Wang, Chenyi Zhuang, Wu ao Liu, Pan Gao, and Nicu Sebe
Ca ego y
Consis ency
Image
Encode Tex
Encode
T ans o me
Decode
De ec ion Head
Que y-Tex
Con as i e Loss
(a) F amewo k and isualiza ion o Que yMa ch
Que y-Tex Ma ching
"gi l wi h spoon"
Ca ego y-based
Anno a ion
Que y-Tex Ma ching
"guy on knees"
A ibu e-based
Anno a ion
Ca ego y
Inconsis ency
A ibu e
Inconsis ency
Image
Encode
T ans o me
Decode
G ounding Head
Con as i e Loss
Que ies
Que ies
Fine-G ained
Alignmen
Que y-Tex
Ma ching
Tex
Encode
class label
Classi ica ion Loss
Coa se-g ained
Alignmen
Tex
Encode
T ans o me
Decode
Que y-Tex
Ma ching
G ounding Head
Que y-Tex
Con as i e Loss
(b) F amewo k and isualiza ion o ou AlignCAT
Ca ego y-based
Anno a ion
"guy on knees"
A ibu e-based
Anno a ion
A ibu e
Consis ency
pe son
Fine-G ained Alignmen
Coa se-g ained Alignmen
guy on knees
"gi l wi h spoon"
Figu e 1: Compa ison o Que yMa ch and he p oposed AlignCAT. (a) Que yMa ch ails o deal wi h ca ego y-based and
a ibu e-based ambigui y in anno a ions. (b) AlignCAT p og essi ely le e ages linguis ic cues om coa se ( igh - op) o ine
( igh -bo om) o il e isual que ies, achie ing ca ego y and a ibu e consis ency.
seman ics in objec ca ego ies and a ibu es. Fo he ca ego y-
based anno a ion, he con ex ual objec “spoon” is used o en ich
he in o ma ion o he a ge objec “gi l”, ye i has c ea ed ac i a-
ion noise and hinde ed he accu a e isual-linguis ic alignmen ,
showing ca ego y inconsis ency. Fo he a ibu e-based anno a ion,
i misma ches he isual ea u es o he inco ec guy o he a ge
ac ion “on knee”, showing a ibu e inconsis ency.
To add ess he a o emen ioned isual-linguis ic inconsis encies,
we p opose AlignCAT (Align Ca ego y hen AT ibu e), a no el
que y-based VG amewo k. To ensu e ca ego y consis ency, we
i s design a coa se-g ained alignmen module ha le e ages
ca ego y in o ma ion and he global con ex om he inpu ex ual
exp ession. This coa se alignmen mi iga es in e e ence om i el-
e an objec s, e ec i ely na owing down he sea ch space o ideal
isual que ies. To achie e a ibu e consis ency, we u he p opose
a ine-g ained alignmen module ha employs adap i e ph ase
a en ion o cap u e wo d-le el desc ip i e linguis ic ea u es. Such
a ine alignmen enhances c oss-modal co espondences and e-
sol es in a-class ambigui ies when mul iple isual objec s belong
o he same ca ego y. By aligning isual que ies i s a a coa se
le el and hen a a ine le el, AlignCAT highligh s he key ole o
linguis ic cues in unde s anding c oss-modal ep esen a ions. This
ca ego y- hen-a ibu e p og essi e alignmen wi hin a con as i e
lea ning amewo k signi ican ly enhances VG pe o mance. In
summa y, he main con ibu ions o his wo k a e h ee- old:
•
We iden i y ca ego y inconsis ency and a ibu e inconsis-
ency in exis ing weakly supe ised VG me hods. To add ess
hese challenges, we p opose a no el que y-based ca ego y-
hen-a ibu e ma ching amewo k, modeling linguis ic ep-
esen a ions om gene al o de ailed le els.
•
To achie e isual-linguis ic alignmen , we design a coa se-
g ained module ha le e ages ca ego y in o ma ion and
global con ex o il e ou ca ego y-inconsis en isual que ies,
and a ine-g ained module ha employs adap i e ph ase a -
en ion o ensu e a ibu e consis ency.
•
E alua ed on h ee benchma ks o REC and RES, ou p o-
posed me hod achie es s a e-o - he-a pe o mance, demon-
s a ing he po en ial o linguis ic cues and he e icacy o
he ca ego y- hen-a ibu e ma ching s a egy in enhancing
isual-linguis ic alignmen .
2 Rela ed Wo ks
Re e ing Exp ession Segmen a ion (RES). This ask aims o
o e come he e iciency limi a ion o ully supe ised lea ning
schemes. Weakly supe ised RES does no equi e in ensi e pixel-
le el anno a ion, which is less expensi e and mo e e icien o ain-
ing. Se e al wo ks [
10
,
32
] achie e egion- ex ma ching h ough
mul i-ins ance lea ning, bu a e a in e io o ully supe ised
me hods. Ins ead o agg ega ing isual en i ies, TRIS [16] ex ac s
ough objec loca ions as pseudo-labels based on he inpu ex o
pe o m objec localiza ion. Lee e al. [
11
] elies on he linguis ic e-
la ionship, which p edic s signi ican maps o each wo d. Howe e ,
he masks gene a ed by hese me hods a e highly noisy, esul ing
in less accu a e segmen a ion.
Re e ing Exp ession Comp ehension (REC). Compa ed o
ully supe ised REC, weakly supe ised REC is mo e challeng-
ing due o he lack o bounding box anno a ions. To ob ain addi-
ional supe ision signals, exis ing REC me hods [
17
,
36
] inco po-
a e ex e nal knowledge and align he egion-based in o ma ion
wi h he co esponding ph ases. Some wo ks [
2
,
19
] u he u i-
lize p io knowledge o il e ou i ele an egion p oposals. Re-
cen ad ances also include le e aging language models o build
nega i e samples [
40
], o p e- ained models o gene a e pseudo-
labels [
8
,
21
]. Ye , hese wo-s age me hods lack gene aliza ion o
eal-wo ld scena ios and la ge-scale asks. To imp o e e iciency,
ancho -based me hods [
9
,
23
] emo e he egion p oposal s age
owa ds a one-s age p ocess. Que yMa ch [
3
] u he in oduces
a que y- ex ma ching scheme o imp o e he lea ning o objec
ep esen a ions. Despi e hei ad ances in e iciency, we iden i y
he challenges o ca ego y and a ibu e inconsis ency in exis ing
5101
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e o Weakly Supe ised Visual G ounding MM ’25, Oc obe 27–31, 2025, Dublin, I eland.
Vision
B anch
Image
Encode
Lea nable Que ies
…
Image Inpu
…
T ans o me
Decode
Con idence-based Selec ion
Candida e Se
Tex Inpu :"o ange abby ca
s anding in a sink"
…
Coa se-g ainedAlignmen 
Ca ago y
Ma ching
Global
Similia i y
Re ined Se
…
Tex
B anch
Tex
Classi ie
P edic ed
Fine-g ainedAlignmen
Ph ase A en ion
Wo d-le el
Similia i y
REC Resul
RES Resul
Selec ed
Que y
P ojec ion P ojec ion
G ounding
Head
Tex Encode
Con as i e Lea ning
posi i enega i e
Tex ual
Classi ica ion
P ojec ion
In e ence
T aining
Loss
Disca d
Vision Token
Tex Token
"ca "
P ojec ion
Figu e 2: AlignCAT amewo k o e iew. AlignCAT il e s isual que ies by hie a chically le e aging linguis ic cues. The
coa se-g ained alignmen module u ilizes ca ego y and global in o ma ion o disca d ca ego y-inconsis en candida es. The
ine-g ained alignmen module employs adap i e ph ase a en ion o selec he a ibu e-consis en isual que y.
one-s age me hods. To add ess his p oblem, ou me hod le e ages
linguis ic cues om gene al o speci ic, in eg a ing ca ego y in-
o ma ion and global con ex o coa se-g ained alignmen , and
hen exploi ing wo d-le el desc ip ions o ine-g ained alignmen .
The ca ego y- hen-a ibu e ma ching amewo k signi ican ly im-
p o es VG esul s o bo h RES and REC asks, especially in mul i-
objec scena ios wi h complex ex exp essions.
3 P elimina y
Following [
3
], we e o mula e VG as a que y- ex ma ching p oblem
by adop ing a que y-based de ec o Mask2Fo me [
4
]. I es ablishes
one- o-one associa ions wi h objec s in he image by
𝑁
lea nable
ec o s, namely que ies, deno ed as
Q={𝑞1, ...,𝑞𝑁}
. Que yMa ch
il e s ou noisy and low-quali y que y ea u es based on hei
con idence sco es, esul ing in a candida e se
Q𝑂
, whe e
𝑂
is a
p e-de ined hype pa ame e . This me hod de ines wo me ics as
di icul y and uniqueness o quan i a i ely es ima e he quali y o
nega i e samples. Speci ically, di icul y measu es ision-language
alignmen , while uniqueness equi es ha high-quali y nega i e
que ies signi ican ly di e om o he que ies in he embedding
space. Gi en a se o candida e que ies, Que yMa ch i e a i ely
es ima es he quali y o 𝑖- h que y as:
𝑆𝑑𝑖
=𝑁𝑜𝑟𝑚(sim(𝑓𝑞𝑖, 𝑓𝑡)),(1)
𝑆𝑢𝑖=𝑁𝑜𝑟𝑚(− 𝑀
max
𝑗=1cos(𝑓𝑞𝑖, 𝑓𝑞𝑗)),(2)
whe e
𝑓𝑞𝑖
is he ea u e o he cu en nega i e que y,
𝑓𝑞𝑗
is he
ea u e o a p e iously selec ed nega i e que y,
𝑓𝑡
is he ex ea-
u e.
𝑆𝑑𝑖
is he di icul y o he
𝑖
- h que y, measu ed by he do
p oduc simila i y be ween isual and linguis ic ea u es, deno ed
sim(𝑓𝑞𝑖, 𝑓𝑡)
.
𝑆𝑢𝑖
is he uniqueness o he
𝑖
- h que y, measu ed by he
cosine simila i y be ween wo isual que ies, deno ed
cos(𝑓𝑞𝑖, 𝑓𝑞𝑗)
.
𝑁𝑜𝑟𝑚(·)
is he min-max no maliza ion. The o e all quali y sco e
o he nega i e que y is de ined as:
𝑆𝑞𝑖=𝑆𝑑𝑖·𝑆𝑢𝑖.(3)
Ranked in descending o de , an app op ia e numbe o nega i e
samples is selec ed o pe o m con as i e lea ning.
The e ec i eness o Que yMa ch elies on p ecise que y- ex
ma ching. This me hod in oduces an e ec i e nega i e que y selec-
ion scheme, while simply selec he posi i e que y by compu ing
he simila i y o he global ex ea u e
𝑓𝑡
. Howe e , ex ual de-
sc ip ions a e exp essi e and equi e s ong easoning abili ies o
unde s and hem. As p esen ed in Figu e 1, Que yMa ch ails a
iden i ying isual que ies om he candida e se
Q𝑂
o achie e
isual-linguis ic consis ency a he ca ego y and a ibu e le els. In
his s udy, we le e age linguis ic cues o hei ulles ex en , em-
phasizing bo h coa se and ine-g ained in o ma ion. Based on he
ca ego y- hen-a ibu e alignmen s a egy, ou p oposed Align-
CAT can selec high-quali y posi i e isual que ies and enhance
que y- ex ma ching accu acy.
4 Me hodology
4.1 O e iew o AlignCAT
Gi en he inpu image
𝐼
and he inpu exp ession
𝑇
, we aim o lo-
ca e he e e ed objec h ough a bounding box ( o REC) o a mask
( o RES). To add ess he challenges o ca ego y and a ibu e in-
consis encies, we in oduce AlignCAT, a no el que y-based weakly
supe ised VG amewo k. As illus a ed in Figu e 2, he main
goal o AlignCAT is o selec high-quali y posi i e que ies h ough
a ca ego y- hen-a ibu e ma ching mechanism o e icien con-
as i e lea ning. We ollow [
3
] and adop a que y-based de ec o
[
4
] o p ocess he inpu image
𝐼
. The encoded image ea u es a e
hen ed in o he T ans o me decode o in e ac wi h
𝑁
lea nable
que ies, ou pu ing he que y ea u es
{𝑓𝑞1, ..., 𝑓𝑞𝑁} ∈ R𝑑𝑣
o i-
sual que ies in
Q
, whe e
𝑑𝑣
is he isual dimension, and one- o-one
classi ica ions {𝑐𝑉
1, ...,𝑐𝑉
𝑁}p edic ed by he isual classi ie .
Unlike exis ing que y-based VG amewo ks [
3
], we le e age
linguis ic cues in he inpu exp ession
𝑇
o achie e isual-linguis ic
alignmen . The ex encode ans o ms
𝑇
in o he global ea u e
𝑓𝑡∈R𝑑𝑡
and he wo d-le el ea u es
F𝑤∈R𝑙×𝑑𝑡
, whe e
𝑙
is he
leng h o he inpu ex and
𝑑𝑡
is he dimension o he linguis ic
ea u e. Ou me hod p og essi ely il e s ou isual que ies h ough
h ee sequen ial selec ion modules: (1) a con idence-based il e ing
s age educes he numbe o isual que ies om
𝑁
o
𝑂
, o ming a
5102
MM ’25, Oc obe 27–31, 2025, Dublin, I eland. Yidan Wang, Chenyi Zhuang, Wu ao Liu, Pan Gao, and Nicu Sebe
candida e subse
Q𝑂
; (2) a coa se-g ained alignmen module e alu-
a es ca ego y consis ency and global que y- ex simila i y o u he
e ine he candida es in o a e ined que y se
e
𝑄
; (3) a ine-g ained
alignmen cap u es a ibu e de ails by ecalib a ing he wo d-le el
ea u es
F𝑤
, ul ima ely selec ing he mos ele an que y
𝑞∗
. Finally,
we can decode he selec ed isual que y o ob ain he bounding box
o he mask o he e e ed objec h ough a g ounding head:
𝑟∗=Head(𝑞∗).(4)
4.2 Coa se-g ained Alignmen
To add ess ca ego y inconsis ency, we design a coa se-g ained align-
men module o i s il e ou i ele an isual que ies. We disce n
ha he ca ego y in o ma ion is eadily a ailable in he inpu ex .
Fo example, i is appa en om he inpu ex “o ange abby ca
s anding in a sink” ha he ca ego y o he e e ed objec is “ca ”.
We a e mo i a ed o p edic he speci ic ca ego y and injec his
in o ma ion o ensu e ha ou selec ed que ies belong o he a ge
ca ego y. This ca ego y-based que y- ex ma ching, along wi h a
global que y- ex ma ching, e ec i ely mi iga es in e e ence om
i ele an objec s in he candida e se
Q𝑂
. Mo e speci ically, a he
ca ego y ma ching s age, we injec a G ound T u h (GT) ca ego y
𝑐∗
, which is he class label anno a ion ob ained om he da ase .
Fo each que y
𝑞𝑖∈ Q𝑂
, he T ans o me decode p edic s i s co e-
sponding ca ego y
𝑐𝑉
𝑖∈ {
1
,
2
, . . . ,𝐶}
h ough a classi ica ion head,
whe e
𝐶
is a p e-de ined numbe o o al ca ego ies (e.g.,
𝐶=
80
o MSCOCO [
14
]). The ca ego y sco e measu es whe he he p e-
dic ed que y ca ego y
𝑐𝑉
𝑖
is consis en wi h he GT ca ego y
𝑐∗
. I
hey a e he same, he ca ego y sco e
𝑆class,𝑖
is se o 1; o he wise,
i is se o 0. The abo e p ocess can be o mula ed as:
𝑆class,𝑖 =(1,i 𝑐𝑉
𝑖=𝑐∗,
0,o he wise.(5)
The ca ego y-based que y- ex ma ching e ec i ely il e s ou
isual que ies ha belong o i ele an ca ego ies. To ully exploi
con ex in ex ual ep esen a ion, we p ojec he que y ea u e 𝑓𝑞𝑖
and global ex ea u e
𝑓𝑡
in o a coa se-g ained sha ed seman ic
space o lea n global isual-linguis ic alignmen :
¯
𝑓𝑞𝑖=𝑓𝑞𝑖·𝑊𝑞+𝑏𝑞,(6)
¯
𝑓𝑡=𝑓𝑡·𝑊𝑡+𝑏𝑡,(7)
whe e
𝑊𝑞,𝑊𝑡
a e p ojec ion ma ices,
𝑏𝑞,𝑏𝑡
a e biases o ans o m
image and ex ea u es, espec i ely. A e p ojec ion, we calcula e
he global que y- ex ma ching sco e:
𝑆global,𝑖 =sim(¯
𝑓𝑞𝑖,¯
𝑓𝑡),(8)
whe e
sim(·)
is he do p oduc simila i y o measu e he alignmen
be ween each isual que y and he global linguis ic ea u e.
O e all, we de ine he coa se-g ained alignmen sco e as he
weigh ed sum o he ca ego y sco e and he global sco e:
𝑆coa se,𝑖 =𝛼𝑆class,𝑖 +𝑆global,𝑖,(9)
whe e 𝛼is a hype pa ame e o balance he alue.
Designed o ensu e ca ego y consis ency, his coa se-g ained
alignmen module il e s ou ca ego y-inconsis en isual que ies.
In o he wo ds, only he que ies wi h
𝑆𝑐𝑙𝑎𝑠𝑠 =
1a e selec ed o
cons uc he e ined se
e
𝑄
. We also de ine a h eshold
𝐾
o cu ail
he que y numbe based on he coa se-g ained sco e
𝑆𝑐𝑜𝑎𝑟𝑠𝑒
. Mo e
de ails a e in he supplemen al ma e ial.
4.3 Fine-g ained Alignmen
The abo e coa se-g ained alignmen module u ilizes gene al lin-
guis ic cues o ensu e ca ego y consis ency and il e ou ca ego y-
inconsis en candida es. Howe e , i is insu icien o disc imina e
nuanced seman ics and achie e a ibu e consis ency. We u he in-
oduce a ine-g ained alignmen ha emphasizes desc ip i e in o -
ma ion in wo d-le el ex ual ea u es, he eby cap u ing a ibu e-
awa e c oss-modal co espondences.
Speci ically, we adop an adap i e ph ase a en ion mechanism
[
35
] o emphasize linguis ic seman ics wi hin he wo d-le el ea-
u es
F𝑤
. Ins ead o ocusing on global con ex o ca ego y-le el
in o ma ion, his module highligh s ine-g ained desc ip i e de-
ails by assigning highe a en ion weigh s o a ibu e wo ds and
lowe weigh s o ca ego y wo ds. Fo ins ance, he ph ase “s anding
in a sink” p o ides mo e disc imina i e linguis ic cues han o he
wo ds when dis inguishing be ween wo ca s, and is he e o e gi en
g ea e a en ion. Mo e p ecisely, he wo d-le el ea u es
F𝑤
a e
i s p ocessed by a Bidi ec ional GRU (Bi-GRU) o ecalib a e he
impo ance o each wo d, which can be o mula ed as:
e
F𝑤=[−→
F𝑤,←−
F𝑤]=𝐸(F𝑤,𝜃),(10)
whe e
𝐸
and
𝜃
ep esen he Bi-GRU module and i s pa ame e s,
espec i ely.
e
F𝑤
deno es he modula ed wo d-le el ea u es ha
conca ena e bidi ec ional ou pu s om he Bi-GRU ne wo k. To
achie e a mo e adap i e agg ega ion, we dynamically balance he
weigh s o he p edic ed ea u es as:
e
F𝑤:=e
F𝑤·so max(FC(e
F𝑤)),(11)
whe e
FC(·)
is a ully connec ed laye o p edic he weigh assigned
o each wo d.
To lea n local isual-linguis ic alignmen , we agg ega e hese
wo d-le el ea u es and p ojec hem in o a ine-g ained seman i-
cally sha ed space, whe e he que y ea u es a e also p ojec ed. We
o mula e he abo e p ocess as ollows:
e
𝑓𝑤=𝑓𝑤·𝑊′
𝑡+𝑏′
𝑡,whe e 𝑓𝑤=∑︁
𝑙e
F𝑤(12)
e
𝑓𝑞𝑖=𝑓𝑞𝑖·𝑊′
𝑞+𝑏′
𝑞,(13)
whe e
𝑓𝑞𝑖
is he
𝑖
- h que y in he e ined que y se
e
𝑄
. The p o-
jec ion ma ices a e
𝑊′
𝑡
and
𝑊′
𝑞
, and he bias e ms a e
𝑏′
𝑡
and
𝑏′
𝑞
.
These pa ame e s a e used o ans o m he image and ex ea u es,
espec i ely.
To selec he isual que y ha bes ma ches he ex exp ession
a he a ibu e le el, we de ine he ine-g ained alignmen sco e
as he do p oduc simila i y be ween each isual que y and he
ine-g ained adap ed ex ea u e, which can be exp essed as:
𝑆 ine,𝑖 =sim(e
𝑓𝑞𝑖,e
𝑓𝑤),(14)
Since he adap ed wo d ea u e
e
𝑓𝑤
encodes disc imina i e lin-
guis ic seman ics, his ine-g ained alignmen module enables he
model o di e en ia e candida e isual que ies based on hei local
5103
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e o Weakly Supe ised Visual G ounding MM ’25, Oc obe 27–31, 2025, Dublin, I eland.
Table 1: Compa isons wi h s a e-o - he-a me hods on h ee RES benchma k da ase s. Bes in ed and second in blue.
Me hod Venue Re COCO Re COCO+ Re COCOg
al es A es B al es A es B al-g
AMR [29] AAAI’22 14.12 11.69 17.47 14.13 11.47 18.13 15.83
G oupViT [37] CVPR’22 18.03 18.13 19.33 18.15 17.65 19.53 19.97
CLIP-ES [15] CVPR’23 13.79 15.23 12.87 14.57 16.01 13.53 14.16
GbS [1] ICCV’21 14.59 14.60 14.97 14.49 14.49 15.77 14.21
WWbL [30] Neu IPS’22 18.26 17.37 19.90 19.85 18.70 21.64 21.84
TSEG [32] a Xi ’20 30.12 - - 25.95 - - 22.62
ALBEF [13] Neu IPS’21 23.11 22.79 23.42 22.44 22.07 22.51 24.18
I-Chunk [12] ICCV’23 31.06 32.30 30.11 31.28 32.11 30.13 32.88
TRIS [16] ICCV’23 31.17 32.43 29.56 30.90 30.42 30.80 36.00
APL [23] ECCV’24 55.92 54.84 55.64 34.92 34.87 35.61 40.13
Que yMa ch [3] MM’24 59.10 59.08 58.82 39.87 41.44 37.22 43.06
Ou s - 61.83 62.75 60.02 42.05 46.39 37.53 49.06
Table 2: Compa isons wi h s a e-o - he-a me hods on h ee REC benchma k da ase s.
Me hod Venue Re COCO Re COCO+ Re COCOg
al es A es B al es A es B al-g
VC [28] TPAMI’19 - 32.68 27.22 - 34.68 28.10 29.65
ARN [18] ICCV’19 32.17 35.25 30.28 32.78 34.35 32.13 33.09
KPRN [20] MM’19 36.34 35.28 37.72 37.16 36.06 39.29 38.37
IGN [41] Neu IPS’20 34.78 37.64 32.59 34.29 36.91 33.56 34.92
DTWREG [33] TPAMI’21 38.35 39.51 37.01 38.91 39.91 37.09 42.54
Cycle-F ee [34] TMM’21 39.58 41.46 37.96 39.19 39.63 37.53 -
EARN [17] TPAMI’23 38.08 38.25 38.59 37.54 37.58 37.92 45.33
TGKD [26] ICRA’23 39.70 39.92 39.63 40.20 39.94 40.27 47.99
Re CLIP [9] CVPR’23 60.36 58.58 57.13 40.39 40.45 38.86 47.87
APL [23] ECCV’24 64.51 61.91 63.57 42.70 42.84 39.80 50.22
Que yMa ch [3] MM’24 66.02 66.00 65.48 44.76 46.72 41.50 48.47
Ou s - 69.03 70.27 66.59 47.16 52.22 41.91 54.72
ep esen a ions, e en when hey belong o he same ca ego y. Fi-
nally, we selec he que y wi h he highes ine-g ained alignmen
sco e 𝑆 ine,𝑖 as he op imal que y:
𝑞∗=a g max
𝑖
𝑆 ine,𝑖 .(15)
4.4 T aining and In e ence
We adop a que y- ex con as i e lea ning s a egy [
3
] o achie e
weakly supe ised lea ning. A common choice o c oss-modal
con as i e lea ning objec i e is In oNCE:
L𝑐𝑙 (ℎ𝑡,ℎ+
𝑞,ℎ−
𝑞)=−log T (ℎ𝑡,ℎ+
𝑞)
T (ℎ𝑡,ℎ+
𝑞) + ∑︁
ℎ−
𝑞
T (ℎ𝑡,ℎ−
𝑞)
,(16)
whe e
T=exp(sim(𝑞, 𝑘+)/𝜏)
is he do p oduc simila i y. The ex
ea u e
ℎ𝑡
should ma ch he isual ea u e o i s designa ed que y
ℎ+
𝑞o e a se o nega i e samples ℎ−
𝑞 om o he images.
In his s udy, we in oduce wo sha ed seman ic spaces o isual-
linguis ic alignmen . The e o e, he inal con as i e lea ning ob-
jec i e o ou AlignCAT is he sum o ha om he wo spaces:
L𝐶𝐿 =L𝑐𝑙 (¯
𝑓𝑡,¯
𝑓+
𝑞,¯
𝑓−
𝑞) + L𝑐𝑙 (e
𝑓𝑤,e
𝑓+
𝑞,e
𝑓−
𝑞).(17)
Du ing aining, we di ec ly injec he GT ca ego y o calcula e
he ca ego y sco e. Howe e , his in o ma ion is no a ailable a he
in e ence s age. We a e d i en o ain an auxilia y classi ie and
p edic he ca ego y om he ex side. Speci ically, we add a ex
classi ie o p ojec he global linguis ic ea u e
𝑓𝑡
and p oduce he
p edic ed ca ego y, deno ed
𝑐𝑇
. The s anda d c oss-en opy loss is
used o ain his ex classi ie :
L𝐶𝐸 =−
𝐶
∑︁
𝑖=1
𝑦𝑖log(ˆ
𝑦𝑖),(18)
whe e
𝑦𝑖
is he one-ho encoding o he GT ca ego y
𝑐∗
, and
ˆ
𝑦𝑖
is
he p edic ed p obabili y o
𝑖
- h obse a ion belonging o one class.
We no e ha his ex ca ego y
𝑐𝑇
di e s om he que y ca ego y
𝑐𝑉
𝑖
as hey a e p edic ed om he linguis ic ea u e
𝑓𝑡
and he isual
ea u e 𝑓𝑞𝑖, espec i ely.
O e all, he weakly supe ised lea ning objec i e o AlignCAT
can be w i en as ollows:
L=𝜆1L𝐶𝐿 +𝜆2L𝐶𝐸,(19)
whe e
𝜆1
and
𝜆2
a e hype pa ame e s dynamically adjus ed o con-
ol he s eng hs, de ailed in he supplemen al ma e ial.
5 Expe imen s
5.1 Da ase s and Me ic
We e alua e he p oposed me hod on h ee benchma ks: Re COCO
[
27
], Re COCO+ [
27
], and Re COCOg [
24
]. All o hem a e based
on MSCOCO [
14
], and each con ains (image, exp ession) as: (19,994,
142,210), (19,992, 141,564), (26,711, 104,560). In hese h ee da ase s,
each exp ession is associa ed wi h one class label, which is used as
5104

MM ’25, Oc obe 27–31, 2025, Dublin, I eland. Yidan Wang, Chenyi Zhuang, Wu ao Liu, Pan Gao, and Nicu Sebe
Table 3: Abla ion o he o mula o que y quali y es ima ion.
Fo mula al es A es B
𝑆global 65.89 65.94 65.47
𝑆global +𝑆class 67.55↑1.66 69.63↑3.69 64.66↓0.81
𝑆global +𝑆 ine 67.21↑1.32 67.93↑1.99 66.21↑0.74
𝑆 ine +𝑆global +𝑆class 67.36↑1.47 68.66↑2.72 66.48↑1.01
𝑆global +𝑆class +𝑆 ine 69.03↑3.14 70.27↑4.33 66.59↑1.12
Table 4: Abla ion s udy o he injec ed ca ego y in o ma ion.
“T ain” and “In e ” e e o he aining and in e ence s ages.
𝑐∗: GT ca ego y. 𝑐𝑇: ex classi ie ’s p edic ed ca ego y.
𝑐∗(T ain)𝑐𝑇(T ain)𝑐𝑇(In e ) al es A es B
- - - 67.21 67.93 66.21
✓-- 68.74↑1.53 69.84↑1.91 66.23↑0.02
--✓67.61↑0.40 68.18↑0.25 66.40↑0.19
-✓✓64.64↓2.57 64.90↓3.03 63.53↓2.68
✓-✓69.03↑1.82 70.27↑2.34 66.59↑0.38
he GT ca ego y in he coa se-g ained alignmen . Rega ding he
ex exp ession, Re COCO desc ibes objec s wi h absolu e spa ial
in o ma ion, while he o he wo da ase s a e mo e challenging.
Re COCO+ ocuses mo e on ela i e spa ial in o ma ion and appea -
ance (such as colo and ex u e), and Re COCOg p o ides longe
exp essions ha a e mo e complex and ca y iche seman ics. Fo
he REC ask, we ollow [
3
,
9
] ha use [email p o ec ed] as he me ic. We
coun a p edic ion as co ec i he IoU be ween he p edic ed and
GT bounding boxes exceeds 0.5. Fo he RES ask, we adop mIoU
[
15
,
30
] as he me ic ha calcula es he a e age IoU ac oss all es
samples. Mo e de ails a e in he supplemen a y ma e ial.
5.2 Implemen a ion De ails
Following [
3
], we employ he p e ained Mask2Fo me de ec o
[
4
] and eeze i s pa ame e s when aining ou AlignCAT. The
image esolu ion is se o 416
×
416. The ex leng hs o Re COCO,
Re COCO+, and Re COCOg a e 15, 15, and 20, espec i ely. All
expe imen s a e conduc ed on wo 24G N idia RTX 4090 GPUs.
The ba ch size pe GPU is 14. The que y ea u e dimension is 256,
and he dimensions o wo d-le el ea u es, ex ea u es, and he
sha ed seman ic space a e all 512. Du ing que y selec ion, we se
𝑂=
20 o con idence-based il e ing,
𝐾=
10 o he maximum
selec ed que ies a e coa se-g ained alignmen . We se
𝛼=
100 o
emphasize he ca ego y in o ma ion o calcula ing coa se-g ained
sco es. We use he Adam op imize [
6
] wi h a lea ning a e o 1
𝑒−
4
and se aining epochs o 25.
5.3 Quan i a i e Analysis
In his sec ion, we i s alida e AlignCAT by compa ing i wi h
comp ehensi e weakly supe ised VG me hods, and abla e key
componen s o ou app oach.
Compa ison o he s a e-o - he-a s. In Tables 1 and 2, we
compa e AlignCAT wi h a se o weakly supe ised VG me hods.
The i s obse a ion is ha AlignCAT signi ican ly ou pe o ms
exis ing me hods on all h ee benchma ks. Ou me hod imp o es
he a e age accu acy by
+
2
.
53% and
+
2
.
80% o e Que yMa ch on
Re COCO o RES and REC, espec i ely. The imp o emen on
Table 5: Abla ion o he module o injec GT ca ego y.
Con idence-based Coa se-g ained al es A es B
Selec ion Alignmen
- - 67.61 68.18 66.40
✓- 67.24 71.50 62.06
-✓69.03 70.27 66.59
Figu e 3: Visualiza ion o adap i e ph ase a en ion.
Re COCOg is pa icula ly no able, wi h AlignCAT inc easing he
accu acy o Que yMa ch by mo e han 6% o bo h asks. We also
no ice ha AlignCAT excels on Tes A o all da ase s, whe e mos
ca ego ies o e e ed objec s a e “pe son”. Wi h he help o ca e-
go y ma ching, AlignCAT e ec i ely il e s ou isual que ies no
belonging o humans, be o e mo e ine-g ained alignmen . This al-
ida es he e ec i eness o ou inno a i e ca ego y- hen-a ibu e
mechanism in enhancing c oss-modal alignmen , wi h he capaci y
o ackle mul i-objec images and complex ex exp essions.
Abla ion o AlignCAT. To alida e he designs o AlignCAT, we
ha e conduc ed a ious abla ion s udies on he Re COCO da ase
o weakly supe ised REC. We i s compa e di e en se ings
o que y selec ion. When abla ing he design o global simila i y,
he co esponding con as i e lea ning objec i e is also emo ed.
The same applies o he ine-g ained alignmen wi h wo d-le el
simila i y calcula ion. These esul s a e epo ed in Table 3. The
baseline selec s one posi i e isual que y wi h he highes
𝑆global
.
Wi h ca ego y ma ching, he combina ion
𝑆class +𝑆global
imp o es
VG pe o mance on wo subse s, albei wi h a sligh dec ease on
es B. This sugges s ha ca ego y in o ma ion bene i s human-
a ge localiza ion, bu s uggles wi h non-human objec s. Solely
using he ine-g ained alignmen ,
𝑆global +𝑆 ine
achie es 67
.
93% on
Re COCO es A, ye is wo se han he o me se ing wi h 69
.
63%.
This compa ison highligh s he impo ance o ca ego y-based il-
e ing. We also expe imen ed wi h he a ibu e- hen-ca ego y o -
de . The esul o
𝑆 ine +𝑆global +𝑆class
shows a ema kable pe -
o mance decline compa ed o
𝑆global +𝑆class
. We suspec ha
wi hou ca ego y-based il e ing, he ex ea u es o con ex ual
objec s c ea e noise and a ec he c oss-modal alignmen . Con-
e sely,
𝑆class +𝑆global +𝑆 ine
wi h a ca ego y- hen-a ibu e o de
achie es he bes pe o mance, demons a ing he e ec i eness o
he coa se- o- ine isual-linguis ic ma ching scheme, as well as he
complemen a y e ec o h ee selec ion modules.
5105
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e o Weakly Supe ised Visual G ounding MM ’25, Oc obe 27–31, 2025, Dublin, I eland.
Exp-2: Tall bo le wi h yellow label
Exp-1: Guy in back wi h a m on chai
Exp-3: Pe son holding umb ella Exp-4: Gi l on le wi h whi e jacke
Figu e 4: Visualiza ion compa ison o di e en selec ion designs o AlignCAT in weakly supe ised REC. The ed and g een
boxes a e GT and p edic ed g ounding esul s, espec i ely.
Exp-3: Woman helping wi h bull on igh
Exp-1: Man holding li le dog
Exp-4: Couch igh side
Exp-2: Cen e case on loo wi h squa es
TRIS Que yMa ch Ou s GT Que yMa ch Ou s
GT TRIS
Figu e 5: Visualiza ion compa ison o TRIS and Que yMa ch o he weakly supe ised RES ask. The GT and p edic ed
segmen a ion esul s a e ma ked in ed.
Nex , we examine di e en s a egies o using ca ego y in o -
ma ion du ing aining and in e ence. As shown in Table 4, he i s
ow is he baseline wi hou ca ego y in o ma ion. The second ow is
he model ained wi h he GT ca ego y
𝑐∗
while emo ing he ca -
ego y ma ching sco e du ing in e ence. This se ing imp o es he
model pe o mance, which highligh s he impo ance o ca ego y
in o ma ion in enhancing c oss-modal co espondences. The hi d
ow is he esul o aining he ex classi ie bu only injec ing
he p edic ed ca ego y
𝑐𝑇
du ing in e ence, which p esen s a sligh
imp o emen . No ably, in he las second ow, di ec ly injec ing
he p edic ed ca ego y
𝑐𝑇
du ing aining indica es a signi ican
pe o mance decline. We suspec ha he ex classi ie ’s p edic ed
ca ego ies a e la gely inaccu a e a he beginning o aining, e-
sul ing in un eliable isual-linguis ic alignmen . The las ow is
ou ull model ha uses he GT ca ego y du ing aining and in-
jec s he p edic ed ca ego y du ing in e ence. This design enhances
obus ness and achie es he bes pe o mance ac oss all subse s.
We u he in es iga e al e na i e s a egies o injec ing GT ca -
ego y in o ma ion. As shown in Table 5, we compa e he con idence-
based selec ion and he global ea u e alignmen o inco po a e he
ca ego y sco e. In he o me ,
Q𝑂
is il e ed by con idence and
ca ego y ma ching, while
e
𝑄
elies on global simila i y
𝑆global
. Al-
hough his imp o es es A pe o mance, i leads o a signi ican
pe o mance deg ada ion on es B. This issue a ises due o subop i-
mal nega i e que y selec ion, which a e sampled om
Q𝑂
. Since a
la ge p opo ion o e e ed objec s in he aining se belong o he
“pe son” ca ego y, in eg a ing ca ego y ma ching du ing con idence-
based il e ing esul s in he same ca ego y be ween mos nega i e
que ies and he posi i e que y. This educes he di e si y o nega-
i e samples and a ec s gene aliza ion. To add ess his, we injec
he ca ego y in o ma ion a he coa se-g ained alignmen module.
This se ing enhances nega i e sample quali y and imp o es he
model’s obus ness ac oss comp ehensi e scena ios.
5.4 Quali a i e Analysis
In Figu e 3, we isualize he weigh s o he modula ed wo d-le el
ex ea u es a e adap i e ph ase a en ion. These alues illus a e
how he model dynamically adjus s he impo ance o each wo d.
Fo example, gi en he ex “gi l pink”, he model highligh s he
colo a ibu e “pink” han he ca ego y wo d “gi l”. In e es ingly,
he con ex ual objec “lea es” is alloca ed wi h a highe alue han
he e e ed objec “ ege able”. This obse a ion explains he e-
sul in Table 3 o he in e io pe o mance in he se ing o he
exchanged o de . O e all, AlignCAT le e ages desc ip i e in o -
ma ion o mi iga e in a-class ambigui y, he eby dis inguishing
objec s belonging o he same ca ego y.
5106
MM ’25, Oc obe 27–31, 2025, Dublin, I eland. Yidan Wang, Chenyi Zhuang, Wu ao Liu, Pan Gao, and Nicu Sebe
Exp-1: M plaid Exp-2: Sea ed man
Exp-10: Le in whi e
pe son
Exp-7: 160Exp-5: Pea on igh Exp-4: We hai
Exp-12: Glass unde
ha ge bil hingy
Exp-9: Woman in
op pic u e on he
le
Exp-15: Gi l
b ushing ee h
Exp-8: 2
Exp-13: Woman
s anding closes o
gi a e
Exp-14: Pe son in
backg ound
Exp-6: Food in ocusExp-3: Kid down
Exp-16: Banana
ouching op igh
co ne o he sign
Exp-11: Pas y nex
o igh hand co ee
mug
Figu e 6: Visualiza ion compa ison in weakly supe ised REC. G een: g ound u h.Blue: Que yMa ch.Red: Ou s.
Exp-1: Uppe keyboa d Exp-2: Whi e
Exp-6: TV in cen e
s acked on ano he
TV kicking
Exp-7: Man e y a le
Exp-5: Bo le jus
one o he igh o
he middle one
Exp-3: B oc a 5 o'clock Exp-4: 1019
Exp-8: The a m no
holding he sandwich
Figu e 7: Failu e cases o AlignCAT in weakly supe ised
REC. G een: g ound u h.Red: Ou s.
To gain in-dep h insigh s in o he ca ego y- hen-a ibu e align-
men mechanism, we abla e AlignCAT wi h ou con igu a ions and
isualize he esul s in Figu e 4. U ilizing solely he global ea u e
simila i y
𝑆global
s uggles o achie e ca ego y and a ibu e consis-
encies, especially when images in ol e mul iple objec s. Wi h he
ca ego y ma ching sco e, he esul s p esen he ca ego y consis-
ency. Fo ins ance,
𝑆global +𝑆class
excludes he con ex ual objec
“label” in Exp-2, howe e , i ails o esol e in a-class ambigui ies
and selec s ano he bo le. On he o he hand, wi hou
𝑆class
, he
esul s o
𝑆global +S ine
s ill su e s om ca ego y inconsis ency in
Exp-3, which highligh s he impo ance o he ca ego y ma ching.
This disco e y is consis en wi h he quan i a i e compa ison in
Table 3. In con as , he ull con igu a ion
𝑆global +𝑆class +S ine
achie es ca ego y and a ibu e consis encies, whe he he a ge
is human o non-human, and he a ibu e is colo (e.g., “whi e”)
o spa ial ela ion (e.g., “in back”). No ice ha AlignCAT may ail
o accu a ely loca e he a ge objec due o he occlusion p oblem
(e.g., he a m behind he head in Exp-1).
In Figu e 5, we compa e AlignCAT wi h wo s a e-o - he-a
weakly supe ised RES models, TRIS [
16
] and Que yMa ch [
3
].
Gi en ex s wi h complex ela ionships and in ensi e images wi h
mul iple objec s, hey inco ec ly loca e he con ex ual objec s such
as “dog” and “bull”. Con e sely, he p oposed me hod achie es
highe segmen a ion accu acy. Wi h s onge easoning abili y and
be e isual unde s anding, AlignCAT is mo e obus and eliable
in ex ensi e g ounding scena ios.
Figu e 6 compa es he pe o mance o AlignCAT and Que y-
Ma ch in he weakly supe ised REC ask. The analysis shows ha
AlignCAT has a clea ad an age in main aining ca ego y and a -
ibu e consis ency. Fo example, in Exp-7, AlignCAT success ully
aligns he abs ac and ague que y “160” wi h he “pe son” ca e-
go y, accu a ely localizing he a ge . In con as , Que yMa ch ails
o unde s and his abs ac que y, leading o a localiza ion ailu e. In
complex mul i-objec scena ios, AlignCAT con inues o e ec i ely
selec he co ec que ies. Fo ins ance, in Exp-16, AlignCAT no
only iden i ies he logical subjec “banana”, bu also cap u es i s
ine-g ained spa ial ela ionships, achie ing p ecise localiza ion.
In summa y, AlignCAT, wi h i s obus coa se- o- ine seman ic
alignmen , ou pe o ms Que yMa ch in complex scena ios.
To p o ide a mo e comp ehensi e analysis, we p esen ypical
ailu e cases o AlignCAT o he REC ask in Figu e 7. These ail-
u es occu due o da ase quali y issues, including w ong g ound
u h anno a ions (Exp-1) and insu icien ex ual desc ip ions (Exp-
2). Meanwhile, ou me hod s ill lacks seman ic unde s anding o
comp ehend ou -o -dis ibu ion ex ual exp essions (Exp-3), o o
disce n nuanced isual ea u es o simila objec s (Exp-4).
6 Conclusion
In his s udy, we iden i y ha exis ing weakly supe ised VG me h-
ods su e om con ex ual ambigui ies, showing ca ego y and a -
ibu e inconsis encies. To add ess hese challenges, we p opose
a no el que y-based VG amewo k, AlignCAT, wi h a ca ego y-
hen-a ibu e isual-linguis ic alignmen s a egy o p og essi ely
il e ou que y candida es. To ensu e ca ego y consis ency, we
design a coa se-g ained alignmen module ha le e ages ca ego y
in o ma ion and global con ex . Fo a ibu e consis ency, we u -
he p opose a ine-g ained alignmen module o cap u e wo d-
le el linguis ic ea u es and emphasize a ibu e-based que y- ex
alignmen , e ec i ely esol ing in a-class ambigui ies. Ex ensi e
expe imen s demons a e ha AlignCAT achie es s a e-o - he-a
pe o mance on h ee benchma ks o bo h REC and RES asks. The
p oposed ca ego y- hen-a ibu e alignmen enhances ca ego y and
a ibu e consis encies in comp ehensi e scenes. This wo k p o-
ides no el insigh s in o le e aging linguis ic cues o ad ancing
weakly supe ised isual g ounding.
5107
AlignCAT: Visual-Linguis ic Alignmen o Ca ego y and A ibu e o Weakly Supe ised Visual G ounding MM ’25, Oc obe 27–31, 2025, Dublin, I eland.
Acknowledgmen s
This wo k was pa ially suppo ed by he Na ional Na u al Science
Founda ion o China (No. 62272227). I was also pa ly suppo ed by
he MUR PNRR p ojec FAIR (PE00000013) unded by he Nex Gen-
e a ionEU, he EU Ho izon p ojec s ELIAS (No. 101120237) and
ELLIOT (No. 101214398).
Re e ences
[1]
Assa A belle, Si an Do eh, Ami Al assy, Joseph Sh ok, Guy Le , Eli Schwa z,
Hilde Kuehne, Hila Ba ak Le i, P asanna Sa ige i, Rameswa Panda, e al
.
2021.
De ec o - ee weakly supe ised g ounding by sepa a ion. In P oceedings o he
IEEE/CVF In e na ional Con e ence on Compu e Vision. 1801–1812.
[2]
Kan Chen, Jiyang Gao, and Ram Ne a ia. 2018. Knowledge aided consis ency
o weakly supe ised ph ase g ounding. In P oceedings o he IEEE con e ence on
compu e ision and pa e n ecogni ion. 4042–4050.
[3]
Shengxin Chen, Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, and Ron-
g ong Ji. 2024. Que yMa ch: A Que y-based Con as i e Lea ning F amewo k
o Weakly Supe ised Visual G ounding. In P oceedings o he 32nd ACM In e -
na ional Con e ence on Mul imedia. 4177–4186.
[4]
Bowen Cheng, Ishan Mis a, Alexande G Schwing, Alexande Ki illo , and Rohi
Gi dha . 2022. Masked-a en ion mask ans o me o uni e sal image segmen-
a ion. In P oceedings o he IEEE/CVF con e ence on compu e ision and pa e n
ecogni ion. 1290–1299.
[5]
Ming Dai, Ling eng Yang, Yihao Xu, Zhenhua Feng, and Wankou Yang. 2024.
SimVG: A Simple F amewo k o Visual G ounding wi h Decoupled Mul i-modal
Fusion. a Xi p ep in a Xi :2409.17531 (2024).
[6]
P Kingma Diede ik. 2014. Adam: A me hod o s ochas ic op imiza ion. (No Ti le)
(2014).
[7]
F ancisco Ei as, Kemal Oksuz, Adel Bibi, Philip HS To , and Punee K Dokania.
2024. Segmen , selec , co ec : A amewo k o weakly-supe ised e e ing
segmen a ion. In Eu opean Con e ence on Compu e Vision. Sp inge , 326–342.
[8]
Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, and Gao Huang. 2022.
Pseudo-q: Gene a ing pseudo language que ies o isual g ounding. In P o-
ceedings o he IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion.
15513–15523.
[9]
Lei Jin, Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Annan Shu, and
Rong ong Ji. 2023. Re clip: A uni e sal eache o weakly supe ised e e ing
exp ession comp ehension. In P oceedings o he IEEE/CVF con e ence on compu e
ision and pa e n ecogni ion. 2681–2690.
[10]
Dongwon Kim, Namyup Kim, Cuiling Lan, and Suha Kwak. 2023. Sha e and
ga he : Lea ning e e ing image segmen a ion wi h ex supe ision. In P oceed-
ings o he IEEE/CVF In e na ional Con e ence on Compu e Vision. 15547–15557.
[11]
Jungbeom Lee, Sungjin Lee, Jinseok Nam, Seunghak Yu, Jaeyoung Do, and Ta a
Tagha i. 2023. Weakly supe ised e e ing image segmen a ion wi h in a-
chunk and in e -chunk consis ency. In P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision. 21870–21881.
[12]
Jungbeom Lee, Sungjin Lee, Jinseok Nam, Seunghak Yu, Jaeyoung Do, and Ta a
Tagha i. 2023. Weakly supe ised e e ing image segmen a ion wi h in a-
chunk and in e -chunk consis ency. In P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision. 21870–21881.
[13]
Junnan Li, Ramp asaa h Sel a aju, Akhilesh Go ma e, Sha iq Jo y, Caiming Xiong,
and S e en Chu Hong Hoi. 2021. Align be o e use: Vision and language ep e-
sen a ion lea ning wi h momen um dis illa ion. Ad ances in neu al in o ma ion
p ocessing sys ems 34 (2021), 9694–9705.
[14]
Tsung-Yi Lin, Michael Mai e, Se ge Belongie, James Hays, Pie o Pe ona, De a
Ramanan, Pio Dollá , and C Law ence Zi nick. 2014. Mic oso coco: Common
objec s in con ex . In Compu e Vision–ECCV 2014: 13 h Eu opean Con e ence,
Zu ich, Swi ze land, Sep embe 6-12, 2014, P oceedings, Pa V 13. Sp inge , 740–
755.
[15]
Yuqi Lin, Minghao Chen, Wenxiao Wang, Boxi Wu, Ke Li, Binbin Lin, Hai eng Liu,
and Xiao ei He. 2023. Clip is also an e icien segmen e : A ex -d i en app oach
o weakly supe ised seman ic segmen a ion. In P oceedings o he IEEE/CVF
Con e ence on Compu e Vision and Pa e n Recogni ion. 15305–15314.
[16]
Fang Liu, Yuhao Liu, Yuqiu Kong, Ke Xu, Lihe Zhang, Baocai Yin, Ge ha d Hancke,
and Rynson Lau. 2023. Re e ing image segmen a ion using ex supe ision. In
P oceedings o he IEEE/CVF In e na ional Con e ence on Compu e Vision. 22124–
22134.
[17]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Zechao Li, Qi Tian, and
Qingming Huang. 2022. En i y-enhanced adap i e econs uc ion ne wo k o
weakly supe ised e e ing exp ession g ounding. IEEE T ansac ions on Pa e n
Analysis and Machine In elligence 45, 3 (2022), 3003–3018.
[18]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qing-
ming Huang. 2019. Adap i e econs uc ion ne wo k o weakly supe ised
e e ing exp ession g ounding. In P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision. 2611–2620.
[19]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang.
2019. Knowledge-guided pai wise econs uc ion ne wo k o weakly supe ised
e e ing exp ession g ounding. In P oceedings o he 27 h ACM In e na ional
Con e ence on Mul imedia. 539–547.
[20]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang.
2019. Knowledge-guided pai wise econs uc ion ne wo k o weakly supe ised
e e ing exp ession g ounding. In P oceedings o he 27 h ACM In e na ional
Con e ence on Mul imedia. 539–547.
[21]
Yang Liu, Jiahua Zhang, Qingchao Chen, and Yuxin Peng. 2023. Con idence-awa e
Pseudo-label Lea ning o Weakly Supe ised Visual G ounding. In P oceedings
o he IEEE/CVF In e na ional Con e ence on Compu e Vision. 2828–2838.
[22]
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and
Rong ong Ji. 2020. Mul i- ask collabo a i e ne wo k o join e e ing exp ession
comp ehension and segmen a ion. In P oceedings o he IEEE/CVF Con e ence on
compu e ision and pa e n ecogni ion. 10034–10043.
[23]
Yaxin Luo, Jiayi Ji, Xiao u Chen, Yuxin Zhang, Tianhe Ren, and Gen Luo. 2025.
APL: Ancho -Based P omp Lea ning o One-S age Weakly Supe ised Re e ing
Exp ession Comp ehension. In Eu opean Con e ence on Compu e Vision. Sp inge ,
198–215.
[24]
Junhua Mao, Jona han Huang, Alexande Toshe , Oana Cambu u, Alan L Yuille,
and Ke in Mu phy. 2016. Gene a ion and comp ehension o unambiguous objec
desc ip ions. In P oceedings o he IEEE con e ence on compu e ision and pa e n
ecogni ion. 11–20.
[25]
Kenne h Ma ino, Mohammad Ras ega i, Ali Fa hadi, and Roozbeh Mo aghi. 2019.
Ok- qa: A isual ques ion answe ing benchma k equi ing ex e nal knowledge.
In P oceedings o he IEEE/c con e ence on compu e ision and pa e n ecogni ion.
3195–3204.
[26]
Jinpeng Mi, Song Tang, Zhiyuan Ma, Dan Liu, Qingdu Li, and Jianwei Zhang. 2023.
Weakly supe ised e e ing exp ession g ounding ia a ge -guided knowledge
dis illa ion. In 2023 IEEE In e na ional Con e ence on Robo ics and Au oma ion
(ICRA). IEEE, 8299–8305.
[27]
Va un K Naga aja, Vlad I Mo a iu, and La y S Da is. 2016. Modeling con ex be-
ween objec s o e e ing exp ession unde s anding. In Compu e Vision–ECCV
2016: 14 h Eu opean Con e ence, Ams e dam, The Ne he lands, Oc obe 11–14, 2016,
P oceedings, Pa IV 14. Sp inge , 792–807.
[28]
Yulei Niu, Hanwang Zhang, Zhiwu Lu, and Shih-Fu Chang. 2019. Va ia ional
con ex : Exploi ing isual and ex ual con ex o g ounding e e ing exp es-
sions. IEEE ansac ions on pa e n analysis and machine in elligence 43, 1 (2019),
347–359.
[29]
Jie Qin, Jie Wu, Xue eng Xiao, Lujun Li, and Xingang Wang. 2022. Ac i a ion
modula ion and ecalib a ion scheme o weakly supe ised seman ic segmen-
a ion. In P oceedings o he AAAI Con e ence on A i icial In elligence, Vol. 36.
2117–2125.
[30]
Tal Shaha abany, Yoad Tewel, and Lio Wol . 2022. Wha is whe e by looking:
Weakly-supe ised open-wo ld ph ase-g ounding wi hou ex inpu s. Ad ances
in Neu al In o ma ion P ocessing Sys ems 35 (2022), 28222–28237.
[31]
Ma eo S e anini, Ma cella Co nia, Lo enzo Ba aldi, Sil ia Cascianelli, Giuseppe
Fiameni, and Ri a Cucchia a. 2022. F om show o ell: A su ey on deep lea ning-
based image cap ioning. IEEE ansac ions on pa e n analysis and machine
in elligence 45, 1 (2022), 539–559.
[32]
Robin S udel, I an Lap e , and Co delia Schmid. 2022. Weakly-supe ised seg-
men a ion o e e ing exp essions. a Xi p ep in a Xi :2205.04725 (2022).
[33]
Mingjie Sun, Jimin Xiao, Eng Gee Lim, Si Liu, and John Y Goule mas. 2021. Dis-
c imina i e iad ma ching and econs uc ion o weakly e e ing exp ession
g ounding. IEEE ansac ions on pa e n analysis and machine in elligence 43, 11
(2021), 4189–4195.
[34]
Mingjie Sun, Jimin Xiao, Eng Gee Lim, and Yao Zhao. 2021. Cycle- ee weakly
e e ing exp ession g ounding wi h sel -paced lea ning. IEEE T ansac ions on
Mul imedia 25 (2021), 1611–1621.
[35]
Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang, and Zechao Li. 2023. Con-
ex disen angling and p o o ype inhe i ing o obus isual g ounding. IEEE
T ansac ions on Pa e n Analysis and Machine In elligence (2023).
[36]
Josiah Wang and Lucia Specia. 2019. Ph ase localiza ion wi hou pai ed aining
examples. In P oceedings o he IEEE/CVF In e na ional Con e ence on Compu e
Vision. 4663–4672.
[37]
Jia ui Xu, Shalini De Mello, Si ei Liu, Wonmin Byeon, Thomas B euel, Jan Kau z,
and Xiaolong Wang. 2022. G oup i : Seman ic segmen a ion eme ges om ex
supe ision. In P oceedings o he IEEE/CVF Con e ence on Compu e Vision and
Pa e n Recogni ion. 18134–18144.
[38]
Ziyan Yang, Kushal Ka le, F anck De noncou , and Vicen e O donez. 2023. Im-
p o ing isual g ounding by encou aging consis en g adien -based explana ions.
In P oceedings o he IEEE/CVF Con e ence on Compu e Vision and Pa e n Recog-
ni ion. 19165–19174.
[39]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modula co-
a en ion ne wo ks o isual ques ion answe ing. In P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni ion. 6281–6290.
5108

Related note

Why institutions use Plag.ai for originality review, entry 29
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by review committees in large academic systems, distance-learning programs, and cross-border universities, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer separation between similarity and misconduct, more consistent review procedures, and more transparent source review. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For grant proposals, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai