Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

Author: Li, Jinlong; Saltori, Cristiano; Poiesi, Fabio; Sebe, Niculae

Publisher: Zenodo

DOI: 10.1109/CVPR52734.2025.01806

Source: https://zenodo.org/records/17688789/files/Li_Cross-Modal_and_Uncertainty-Aware_Agglomeration_for_Open-Vocabulary_3D_Scene_Understanding_CVPR_2025_paper.pdf

C oss-Modal and Unce ain y-Awa e Agglome a ion o Open-Vocabula y 3D
Scene Unde s anding
Jinlong Li1,†C is iano Sal o i1Fabio Poiesi2Nicu Sebe1
1Uni e si y o T en o 2Fondazione B uno Kessle
Abs ac
The lack o a la ge-scale 3D- ex co pus has led ecen wo ks
o dis ill open- ocabula y knowledge om ision-language
models (VLMs). Howe e , hese me hods ypically ely on a
single VLM o align he ea u e spaces o 3D models wi hin
a common language space, which limi s he po en ial o 3D
models o le e age he di e se spa ial and seman ic capabili-
ies encapsula ed in a ious ounda ion models. In his pape ,
we p opose C oss-modal and Unce ain y-awa e Agglome -
a ion o Open- ocabula y 3D Scene Unde s anding dubbed
CUA-O3D, he ﬁ s model o in eg a e mul iple ounda ion
models—such as CLIP, DINO 2, and S able Di usion—in o
3D scene unde s anding. We u he in oduce a de e minis-
ic unce ain y es ima ion o adap i ely dis ill and ha monize
he he e ogeneous 2D ea u e embeddings om hese models.
Ou me hod add esses wo key challenges: (1) inco po a ing
seman ic p io s om VLMs alongside he geome ic knowl-
edge o spa ially-awa e ision ounda ion models, and (2)
using a no el de e minis ic unce ain y es ima ion o cap u e
model-speciﬁc unce ain ies ac oss di e se seman ic and
geome ic sensi i i ies, helping o econcile he e ogeneous
ep esen a ions du ing aining. Ex ensi e expe imen s on
ScanNe V2 and Ma e po 3D demons a e ha ou me hod
no only ad ances open- ocabula y segmen a ion bu also
achie es obus c oss-domain alignmen and compe i i e spa-
ial pe cep ion capabili ies. P ojec webpage: CUA-O3D.
1. In oduc ion
3D scene unde s anding se es as a c ucial pe cep ion com-
ponen o a wide a ay o eal-wo ld applica ions o help
models be e unde s and he physical wo ld, including
obo na iga ion, au onomous ehicles, and i ual eal-
i y [
4
,
21
,
46
,
79
]. Typical app oaches necessi a e a da ase
o seman ically anno a ed poin clouds, which is bo h ime-
consuming (e.g., 22.3 minu es o anno a ing a single scene
†Co esponding au ho : [email p o ec ed].
Lseg DINO 2 S able Di usion
Inpu GT
Figu e 1. Top is ea u e dis ibu ion analysis o di e en 2d p o-
jec ed ea u e embeddings om a ious ounda ion models (Lseg,
DINO 2 and S able Di usion), enume a ing on he o e all Scan-
Ne V2 ain se and coun ing he equency o all poin ea u es
wi hin each bin in e al. Bo om is he sample u ilizing K-Means
o clus e p ojec ed 3D ea u es in o speciﬁed clus e s o make
segmen a ion compa isons. Di e en ounda ion models illus a e
he e ogeneous ye complemen a y esul s.
wi h 20 classes [
13
]) and unable o encompass all pos-
sible exis ing ca ego ies, he eby limi ing hei p ac ical
u ili y. To add ess his limi a ion, ecen wo ks ha e ex-
plo ed he open- ocabula y 3D scene unde s anding se -
ing [
15
,
33
,
52
,
53
,
62
], aiming o localize and ecognize
a bi a y objec classes. This objec i e is achie ed by le e -
aging Vision-Language Models (VLMs) [
34
,
64
] ha , ha -
ing been p e- ained on billions o image- ex pai s [
74
], can
gauge he alignmen be ween ex ual and isual inpu s, acili-
a ing a ange o 2D open- ocabula y asks [
5
,
20
,
23
,
39
,
97
].
To adap hese models o 3D downs eam asks (e.g., seman-
ic segmen a ion), a ious s a egies, p edominan ly based
on dis illing mul i- iew 2D isual ea u es in o a 3D-speciﬁc
model [15,25,33,62,84], ha e been employed.
Howe e , hese me hods mainly ocus on dis illing knowl-
edge om one single VLM which limi s he po en ial o
3D models o le e age he di e se spa ial and seman ic ca-
pabili ies ha ha e been ained on la ge-scale images o
image- ex pai s co pus, as shown in Table 1. Gi en he
This CVPR pape is he Open Access e sion, p o ided by he Compu e Vision Founda ion.
Excep o his wa e ma k, i is iden ical o he accep ed e sion;
he inal published e sion o he p oceedings is a ailable on IEEE Xplo e.
19390
exis ence o a ious ounda ion models, such as CLIP [
64
],
DINO 2 [
59
] and S able Di usion [
66
], e c., he e has been
unde -explo ed on how o make be e u ili y o hese 2D
ounda ion models o de elop he 3D ounda ion model,
whe ein no la ge-scale 3D poin -cloud o 3D- ex pai co pus
a ailable. Ve y ecen ly, some wo ks [
18
,
51
] ha e s a ed
p obing he po en ial o hese 2D ounda ion models on 3D
asks, since di e en VLMs o isual ounda ion models
showcase unique cha ac e is ics. P obe3D [
18
] posi s he 3D
awa eness o some isual ounda ion models, like DINO 2
p esen well o dep h and su ace no mals. Lexicon3D [
51
]
ﬁnds di usion models beneﬁ geome ic asks. None heless,
he e is s ill a lack o s udies on how o agg ega e hese
he e ogeneous ounda ion models in o he 3D model which
na u ally excels a geome ic knowledge ex ac ion and lo-
ca ing 3D spa ial objec s.
To explo e a ious ounda ion model p ope ies when
using mul i- iew posed image ea u es o 3D space and
enabling model dis illa ion, we ﬁ s conduc a pilo s udy
o analyze and compa e he he e ogeneous esul s shown in
Fig. 1. We can obse e ha he dis ibu ion in e ms o he
ea u e embeddings om each 2D ounda ion model mainly
ollows a gaussian-like ea u e dis ibu ion. A he same
ime, when clus e ing hese ea u es in o speciﬁc clus e s,
he esul s illus a e he e ogeneous and complemen a y e -
ec s ac oss di e en ounda ion models. This is also due o
he inconsis ency ac oss di e en posed images when he 2D
model encoun e s complex image con ex s. This mo i a es
us o de elop a new me hod o ha monize his he e oge-
neous knowledge in o a 3D model and handle such noisy
inconsis ency om he used 2D ea u e embeddings.
In his pape , we p esen C oss-modal and Unce ain y-
awa e Agglome a ion o Open- ocabula y 3D Scene Un-
de s anding, dubbed CUA-O3D, he ﬁ s me hod o in e-
g a e mul iple 2D ounda ion models in o one 3D model o
scene unde s anding. De e minis ic unce ain y es ima ion
is u he in oduced o adap i ely dis ill and ha monize he
he e ogeneous 2D ea u e embeddings om hese models.
We show ha he e a e po en ial inconsis encies ye com-
plemen a i y when a emp ing o dis ill di e en ounda ion
models. Based on ou pilo s udy, we ﬁ s p opose o le e -
age dis illa ion loss o supe ise 3D model aining gi en
se e al a ailable 2D ea u e embeddings. Ou 3D model con-
sis s o independen p ojec ion laye s o be mapped wi h one
VLM o isual ounda ion model unde ea u e supe ision
which helps wi h econciling he en anglemen s om he e o-
geneous dis ibu ions. To esol e he po en ial noises om
2D models which mainly come om insu ﬁcien con ex s, a
no el de e minis ic unce ain y es ima ion is ailo ed o adap-
i ely weigh he knowledge dis illa ion which can be mod-
eled like a gaussian likelihood ha ollows he dis ibu ions
as shown a he op o Fig. 1. Speciﬁcally, ega ding each
p ojec ion laye o be mapped wi h he speciﬁc 2D model, we
ailo an obse a ion noise scala p edic ion independen ly
o cap u e how much he noise is con ained in ea u e supe -
ision du ing aining, which is e med unce ain y-awa e
lea ning. Mo ing one s ep o wa d, as can be obse ed he
dis ibu ion in e ms o S able Di usion shows a bi shi
away om he cen e scale, and usually ca ies b oade alue
anges and hea y- ailed (“spike”) alues in he p ojec ed
ea u e embeddings. A de-mean ope a ion is hen adop ed o
e-cen e he ea u e scales, being able o educe he impac
om anomaly poin s while s ill allowing poin s wi h small
scale o guide he 3D model aining.
As we show expe imen ally on Scanne V2 [
13
] and Ma -
e po 3D [
6
], ou app oach allows he 3D model agglome -
a es he e ogeneous knowledge and econciles wi h po en ial
noises om a ious 2D ea u e supe isions. Ex ensi e ex-
pe imen s demons a e ha ou me hod no only ad ances
open- ocabula y segmen a ion bu also achie es compe i i e
c oss-domain alignmen s and spa ial pe cep ion capabili ies.
Addi ionally, we also alida e ha ou me hod can achie e
signiﬁcan downs eam pe o mance a e dis illa ion. In
summa y, he con ibu ions o his wo k a e:
•
To he bes o ou knowledge, we a e he ﬁ s one o in-
es iga e he agglome a ion o c oss-modal knowledge
dis illa ion om 2D models o a 3D model, gi en a ious
s ong ounda ion models a ailable.
•
We analyze he he e ogeneous ye complemen a y ea u e
embeddings om mul iple 2D models and inco po a e
bo h seman ic- and geome ic-awa e knowledge in o one
single 3D model.
•
We u he p opose a de e minis ic unce ain y es ima ion
o enable he 3D model p edic independen obse a ion
scala o cap u e he noise and esol e he he e ogenei y
om a ious ea u e supe isions.
•
We e alua e ou me hod in a wide se o expe imen s om
3D open- ocabula y segmen a ion and p esen compe i i e
c oss-domain alida ion o ou me hod, while also demon-
s a ing s ong downs eam pe o mances a e dis illa ion.
2. Rela ed Wo ks
Open-Vocabula y (OV) 3D scene unde s anding ad ances
o e he p e ious la ge co pus o close-se app oaches
[
2
,
8
,
48
,
49
,
68
,
84
,
93
], allowing obus ze o-sho easoning
and alle ia ing he need o anno a ions. Recen ad ances in
Visual-Language Models (VLMs) [
34
,
64
]ha ed i enOV
models owa ds ema kable le els o obus ness wi h nume -
ous eme ging app oaches ackling OV in image seman ic
segmen a ion [
5
,
20
,
47
,
83
], objec de ec ion [
3
,
98
], and
ecen ly uni e sal segmen a ion [
89
]. Di e en ly, OV o 3D
scene unde s anding (OV3D) is limi ed in he da a a ailabil-
i y o aining a pu ely undamen al 3D VLM. Al e na i ely,
he communi y achie es OV3D by dis illing ze o-sho knowl-
edge om ecen VLMs [
5
,
64
] and by mapping poin cloud
ea u es o a que yable CLIP space. In 3D seman ic seg-
19391
Table 1. Compa ison o Vision Founda ion Models. Al hough all u ilize he same Vision T ans o me (ViT) backbone, hey g ea ly di e
in hei aining pa adigms, including da a, image esolu ions, and aining objec i es, which lead o di e se ep esen a ion biases.
Model T aining Da ase Da ase Size A chi ec u e Objec i e
ViT [17] ImageNe -1k/21k 1.2M/14.2M ViT-B/L/G Supe ised classiﬁca ion
DINO 2 [59]LVD-142M 142M ViT-L/14 Disc imina i e sel -supe ised lea ning
CLIP [64]WebImageTex 400M ViT-L/14 Image- ex con as i e lea ning
S able Di usion [66]LAION 5B UNe Image-Tex /Image Gene a ion
men a ion, Concep Fusion [
33
] uses VLM ep esen a ions
om mul iple iews in o 3D poin s. Some me hods [
76
]
ex ends OV o 3D ins ance seman ic segmen a ion based on
CLIP [
64
] o SAM [
38
] o align he 3D space wi h language
space while o cing ins ance-mask cons ain . Howe e , he
ecen OV3D me hods hea ily ely on he ze o-sho knowl-
edge o he 2D VLMs wi hou in es iga ing he eliabili y
o he p ojec ed 2D ea u e ep esen a ion. In his wo k, we
mo e o wa d and explo e how o agg ega e knowledge om
a ious 2D ounda ion models.
Knowledge dis illa ion (KD) [
27
,
60
] aims a aining com-
pac s uden models wi h he supe ision om mo e pow-
e ul and la ge eache models [
43
]. In oduced by he
ﬁ s wo k [
27
], he s uden model is ained o mimic he
p edic ion beha io o he eache model and has been ex en-
si ely explo ed in subsequen wo ks [
1
,
45
,
67
,
88
,
92
,
94
],
which has been applied success ully in a wide ange o asks
going om supe ised- aining [
54
,
78
,
82
], o ne wo k com-
p ession [
7
,
63
,
73
] and o domain adap a ion [
26
,
28
,
95
].
Recen ly, ollowing he explosion o Visual-Language Mod-
els like CLIP [
64
] and ALIGN [
34
], KD has been in oduced
o e ﬁcien ly ans e knowledge be ween di e en modali-
ies [
23
,
43
,
50
], and ecen ly o b idge he gap be ween he
ex and 3D poin cloud modali ies [
9
,
15
,
32
,
62
,
85
,
87
,
91
].
Recen ly, AM-RADIO [
65
] desc ibes a gene al me hodol-
ogy o dis illing mul iple dis inc ounda ion models in o
one, bu s ill ocus on only 2D domain. We p ima ily o-
cus on s udying and ackling he ambigui y o he dis illed
ep esen a ions be ween image and poin cloud modali ies.
Unce ain y es ima ion has been widely in es iga ed in a -
ious asks [
24
,
29
,
35
,
36
,
40
,
42
,
55
,
58
] which is capable
o add essing he p oblem o quan i ying he unce ain y
o p edic ions by model. Unce ain y Es ima ion can be
b oadly classiﬁed in o: (i) alea o ic es ima ion [
35
,
57
,
80
]
ha usually dues o he unde lying unce ain y in he mea-
su emen which u ilizes he ex a ne wo k o be ained om
sc a ch o app oxima e a he e oscedas ic dis ibu ion by max-
imizing he likelihood o he sys em, and (ii) epis emic es-
ima ion [
19
,
22
,
40
] ha induces he unce ain y by he
model pa ame e s in low-da a egimes as pa ame e es ima-
ion becomes noisy, espec i ely. In 3D poin clouds, un-
ce ain y es ima ion ﬁnds applica ions in inc emen al lea n-
ing [
86
], seman ic segmen a ion [
12
], and domain adap a-
ion [
71
,
72
,
81
]. Unlike p e ious wo ks, we seek a de e -
minis ic unce ain y es ima ion o supe ising 3D model
a) inpu
c) g ound u h
b) p edic ion
d) 2D iew-images p edic ion
Figu e 2. P elimina y s udy on image embedding ambigui y. VLM
embeddings show inconsis en segmen a ions ac oss mul i- iew
images (e.g. cabine ). The guidance wi h ambiguous embeddings
may be de imen al o supe ising a 3D model aining.
aining unde unce ain y awa eness o he ambigui y be-
ween 2D image and 3D poin modali ies.
3. Me hodology
In his sec ion, we ﬁ s desc ibe he p elimina y open-
ocabula y 3D scene unde s anding ask in Sec. 3.1. Then
we elucida e inconsis en esul s ac oss mul i- iew posed
images o demons a e he necessi y o unce ain y-awa e
aining o alle ia e such issues and emb ace a ious 2D oun-
da ion models in Sec. 3.2. Ou C oss-Modal and Unce ain y-
Awa e Agglome a ion (CUA-O3D) me hod will be depic ed
in Sec. 3.3, including Dis illa ion agglome a ion and De e -
minis ic unce ain y es ima ion.
3.1. Open-Vocabula y 3D Scene Unde s anding
In s anda d 3D seman ic segmen a ion, he aining se
T ain
includes poin clouds and dense poin -le el anno a ions.
Each poin cloud
X=pi∈R3,i∈[0,N −1]
, consis s o
N
poin s
pi
wi h co esponding poin -le el anno a ions
Y
.
These anno a ions a e assigned om a p edeﬁned se o class
indices
K=[1,...,K]
, whe e each index co esponds o
a speciﬁc class name in he ocabula y
V=[ 1,...,
k]
.
Gi en
T ain
and
K
, he objec i e is o lea n a deep neu al
ne wo k co ec ly assigning a label om
K
o each poin
pi∈X
. In con as , open- ocabula y 3D seman ic segmen-
a ion (OV3D) aims o segmen
X
using an a bi a y ocab-
ula y
V
. Recen app oaches achie e his wi h a p e- ained
VLM [
62
,
76
]. The VLM p o ides his common embedding
space h ough wo dis inc encode s, namely, a ision en-
code
2D:I→Z
and a ex encode
x :V→Z
. The
common p ac ice is o ain a 3D ision encode
θ3D
o align
o
2D
embeddings, b idging he modali y gap and deﬁning
θ3D:X→Z
. A e aining,
x
encodes
V
, and class
19392
p edic ions a e compu ed ia simila i y ma ching be ween
3D poin cloud and ex ual ocabula y embeddings.
3.2. P elimina y Obse a ion
We ﬁ s conduc a simple quali a i e s udy o analyze how
embedding ambigui y a ec s
2D
p edic ions be o e and
a e p ojec ion o he poin cloud space based on commonly
used Lseg [
5
]. We analyze he consis ency o he p edic ions
o he same objec appea ing in mul i- iew images. Gi en
a p e- ained ision-language model
2D
and mul i- iew
images
I
, we que y
2D
wi h known ocabula ies
V
and
epo he quali a i e esul s o e mul iple ScanNe V2 [
13
]
iew-images in Fig. 2. We no ice ha p edic ions a e con-
sis en o he classes wall,ﬂoo , and doo while showing
inconsis ency o e he class cabine . A e p ojec ion o
he 3D poin cloud space (le ), he p ojec ed p edic ion in-
he i s his ambigui y, esul ing in u he de imen al 3D
model dis illa ion. Rega ding he inco po a ion o a ious
2D ounda ion models shown in Fig. 1, how o ha monize
he he e ogeneous cha ac e is ics also ma e s. This simple
s udy highligh s he need o a eliable unce ain y measu e
cap u ing embedding ambigui y ha we hope o shed new
ligh on he u u e wo ks on his ask.
3.3. C oss-Modal Agglome a ion
We p o ide an o e iew o ou CUA-O3D in Fig. 3. Ou
app oach le e ages a 3D encode backbone
θ3D
o ans o m
he inpu poin cloud
X
in o a 3D spa se poin cloud ea u es
F3D
. Concu en ly, we use se e al p e- ained ision en-
code s
2D
i
, such as CLIP, DINO 2, and S able Di usion, o
map mul i- iew images
I
o dense image ea u es sepa a ely,
which a e hen p ojec ed o yield spa se image ea u es
F2D
i
.
Then, we cons uc h ee p ojec ion laye s wi h a simple
MLP o map
F3D
i
wi h each 2D model h ough he co -
esponding dis illa ion loss. Meanwhile, he 3D model is
designed o ou pu independen ly de e minis ic unce ain y-
awa e obse a ion scala
σi
o adap i ely weigh he ea u e
supe isions. A e aining, we use
θ3D
o he main ask
o open- ocabula y 3D seman ic segmen a ion ia ma ching
wi h ex embeddings
F x
[
76
] and d op he unce ain y
module.
Dis illa ion agglome a ion. The dis illa ion phase aims o
align he poin embeddings
F3D
o image embeddings
F2D
ob ained om each ozen p e- ained isual encode
2D
i
.
This is a common p ac ice in ecen OV3D app oaches [
76
],
which leads o a sha ed embedding space be ween image,
ex , and poin cloud modali ies. Gi en a poin cloud
X
pai ed wi h mul i- iew posed images
I
, we lea n 3D spa se
ea u es
F3D
by employing he well-es ablished Minkowsk-
iNe [
11
] as a 3D spa se con olu ional encode
θ3D
. The
encode
θ3D
ou pu s a spa se se o poin -wise ea u e ec-
o s
F3D=θ3D(X)
, whe e each ea u e is associa ed wi h
an inpu poin
pi∈X
. Simila ly, we dis ill he mul i- iew
image ea u es om
I
using each ozen ision encode
2D
i
sepa a ely, whe e
{ 2D
i∈ 2D| 2D
Lseg, 2D
DINO 2, 2D
SD}
. The
ou pu o
2D
is a se o ea u e ec o s
F2D
i= 2D
i(I)
whe e each ea u e ec o in
F2D
is associa ed o an inpu
pixel
u
. A e p ojec ion in he homogeneous coo dina es
space, we edeﬁne
F2D
i
as a se o poin -wise image ea-
u es, co esponding o each
2D
i
. We deﬁne he ma ches
be ween each poin
p
and pixel
u
by using he co espond-
ing homogeneous coo dina es
˜p
and
˜u
, espec i ely. Once
ma ches a e es ablished, we en o ce he alignmen be ween
he p edic ed
F3D
i
and
F2D
i
. Following [
62
] and aining
θ3D
o minimize he dis illa ion loss
Ldis ill
, ou indepen-
den dis illa ion losses a e deﬁned as,
Lseg, Lcos lseg =1−F3D
1·F2D
1
F3D
12·F2D
12
(1)
DINO 2, Ll1=1
n
n

i=1
|F3D
2−F2D
2|(2)
S ableDi usion, Lcos sd =1−F3D
3·F2D
3
F3D
32·F2D
32
(3)
which co esponds o minimizing he dis ance be ween each
F3D
iand F2D
i, and leading o ﬁnal dis illa ion loss as,
Ldis ill =Lcos lseg +Ll1+Lcos sd,(4)
whe e we will s udy he dis illa ion loss choice in ou sup-
plemen a y ma e ials.
To u he alle ia e he impac om S able Di usion
F2D
3
which con ains sha p alues in he p ojec ed ea u e embed-
dings, we hen adop a de-mean ope a ion o e-cen e he
ea u e scales, educing he impac om anomaly poin s
while s ill allowing poin s wi h small scale o guide he 3D
model aining:
F2D
3=F2D
3−μF2D
3(5)
whe e μF2D
3is he mean o F2D
3along channel dimension.
De e minis ic unce ain y es ima ion. As we analyzed be-
o e ha di e en 2D ounda ion models encapsula e unique
cha ac e is ics and one single 2D model induces inhe en
inconsis ency om mul i- iew posed image which neces-
si a es app op ia e measu es o ackle. We hen p opose a
simple ye e ec i e de e minis ic unce ain y-awa e obse -
a ion scala p edic ion o quan i y embedding ambigui y
wi hin each c oss-modal dis illa ion, ha lea ns he adap-
i e weigh s o a ious 2D ea u e supe isions unde he
c oss-modal aining.
Speciﬁcally, we de ise he 3D model
θ3D
wi h h ee in-
dependen noise scala p edic ions
σi
w. . each 2D model
2D
i
. The ou pu om he p obabilis ic model wi h weigh
W
being analogous o he eg ession ask can be modeled as
gaussian likelihood as:
p(y| W(x)) = N( W(x),σ2),(6)
19393
 



  









 

 











Figu e 3. O e iew o CUA-O3D. We ﬁ s u ilize Lseg, DINO 2 and S able Di usion model o ex ac mul i- iew posed image embeddings
and hen use mul i- iew 3D p ojec ion o ob ain he p ojec ed 3D ea u es
F2D
i
o supe ise he 3D model aining. Th ee MLP laye s a e
es ablished o map wi h each 2D model supe isions independen ly, while a speciﬁc noisy scala p edic ion
σi
h ough a de e minis ic
unce ain y es ima ion will be lea ned and adop ed o adap i ely weigh he co esponding dis illa ion loss L.
he e we assume he 2D ounda ion models ollow simila
modeling as demons a ed in Fig. 1, and ou dense align-
men aining can be ega ded as con inuous model ou pu .
Combining all h ee 2D models we used, we hen deﬁne he
mul iple model ou pu s as:
p(y1, ..., yK| W(x)) =
K

i=1
p(yi| W(x)),(7)
whe e index
i
co esponds o he mapping o each 2D
model,
{ 2D
Lseg, 2D
DINO 2, 2D
SD}
. Based on he maximum
likelihood modeling and aking he eg ession ask as an
example, he log-likelihood o he model can hen be op i-
mized,
log p(y| W(x)) ∝− 1
2σ2||y− W(x)||2−log σ, (8)
whe e
σ
deno es he model’s obse a ion noise scala , be-
ing esponsible o cap u ing he inhe en unce ain y wi hin
2D ea u e supe isions, which is mainly caused by he -
e ogeneous and noisy ea u e embeddings om a ious 2D
ounda ion models. Assuming ha we ha e h ee ou pu s
co esponding o Lseg, DINO 2, and S able Di usion, each
ollowing gaussian-like dis ibu ion, we hen ha e:
p(y1,y
2,y
3| W(x)) =
3

i=1
p(yi| W(x))
=
3

i=1
N(yi; W(x),σ2
i),
(9)
and hen, we o mula e ou aining objec i e
Ldis ill(W,σ
1,σ
2,σ
3) o mul iple mappings:
Ldis ill =−log p(y1,y
2,y
3| W(x))
∝1
2σ2
1
Lcos lseg +1
2σ2
2
Ll1+1
2σ2
3
Lcos sd +logσ1σ2σ3,
(10)
which leads o ou o e all aining objec i e: L=Ldis ill.
To he bes o ou knowledge, his is he ﬁ s wo k o
explo e de e minis ic unce ain y-awa e modeling o ag-
glome a ing mul iple 2D ounda ion models in o a single
uniﬁed 3D model, aiming owa d he de elopmen o a po-
en ial ounda ional 3D model. In he las e m o Eq. 10,
each supe ision signal om a 2D model con ibu es o
lea ning adap i e weigh ing du ing aining. Speciﬁcally,
he unce ain y pa ame e
σi
, which cha ac e izes he noise
le el associa ed wi h he
i
- h 2D model
2D
i
, dynamically ad-
jus s he con ibu ion o he co esponding loss e m
Li
.As
σi
inc eases—indica ing highe unce ain y— he e ec i e
weigh o
Li
dec eases, he eby down-weigh ing less eliable
supe ision. Meanwhile, each
σi
is implici ly egula ized
o p e en excessi e g ow h, ensu ing ha all supe ision
sou ces con ibu e meaning ully o he op imiza ion. We
isualize he e olu ion o
σi
h oughou aining in he sup-
plemen a y ma e ial. To u he ensu e s able op imiza ion,
we add a small cons an (se o
1.0
) o each
σi
o p e en he
loss om becoming nega i e du ing aining.
log σi→log(1.0+σi).(11)
4. Expe imen s
We un ex ensi e expe imen s o e a wide se o asks, going
om open- ocabula y segmen a ion and c oss-domain gen-
e aliza ion o he e alua ion wi h ﬁne-g ained class ocab-
ula ies. This sec ion is o ganized as ollows. Sec. 4.1-4.2
p o ide he da ase and implemen a ion de ails o CUA-
O3D used in ou expe imen s. Sec. 4.3 and Sec. 4.4 illus-
a e he po en ial when emb acing a ious 2D ounda ion
models and e alua e he open- ocabula y 3D seman ic seg-
men a ion, compa ed wi h ela ed me hods. Sec. 4.5 epo s
c oss-domain gene aliza ion unde common and ﬁne-g ained
19394

class e alua ion while sec 4.6 p esen s he downs eam pe -
o mance a e dis illa ion. The ﬁnal Sec. 4.7 abla es and
analyzes he imp o emen s o each p oposed componen s.
4.1. Da ase s
ScanNe V2 [
13
,
69
] is a la ge-scale anno a ed indoo da ase .
I includes o e
2.5
million came a iews wi hin mo e han
1.5k
RGB-D scans, collec ed ac oss
707
di e se indoo en-
i onmen s such as o ﬁces and li ing ooms. The da ase is
en iched wi h anno a ions, including 3D came a poses and
poin -le el seman ic segmen a ion. I is o ﬁcially spli in o
1201
aining and
312
alida ion scans sampled om
706
di -
e en scenes, and
100
scans es se wi h hidden g ound u h.
ScanNe V2 anno a ions co e
40
seman ic classes, while he
o ﬁcial 3D seman ic segmen a ion benchma k ocuses on a
subse o 20 classes.
Ma e po 3D [
6
] is ano he la ge-scale RGB-D da ase
collec ed o 3D scene unde s anding in indoo se ings, con-
aining
90
buildings wi h mul iple ooms on di e en ﬂoo s
cap u ed using a Ma e po P o Came a. I p o ides
10.8k
pano amic iews wi hin
90
eal, building-scale scenes p o-
cessed om
194.4k
RGB-D images. Each scene ep esen s
a esiden ial building wi h mul iple ooms and is anno a ed
wi h came a poses and poin -le el seman ic segmen a ion.
The o ﬁcial 3D seman ic segmen a ion benchma k e alua es
pe o mance ac oss 21 seman ic classes.
4.2. Implemen a ion De ails
We implemen ou me hod in he PyTo ch amewo k [
61
]
and employ expe imen s on a single NVIDIA A100 GPU
o ScanNe V2 and Ma e po 3D, espec i ely. We ollow
[
62
] and use MinkUNe 18A [
11
] as ou 3D backbone s a -
ing om andomly ini ialized weigh s, and Lseg [
5
]asou
p e- ained VLM. Du ing ou aining, we adop Adam op-
imize [
37
] wi h an ini ial lea ning a e o
1e−4
and an
exponen ial decay o ain ou pipeline o 50 epochs. Fo
he eache model pa ame e upda e, we se he momen um
coe ﬁcien
β
o
0.99
and
γ
o
1
. Besides, we employ he
oxel size o
2
cm and ba ch size o
2
o bo h he ScanNe V2
and Ma e po 3D expe imen s. Due o he GPU memo y
limi a ion, we uni o mly sample
20k
poin ea u es o ain
he model and inpu only he 3D poin posi ion wi hou RGB
in o ma ion o he MinkowskiNe . We u ilize andom ho i-
zon al ﬂip and elas ic dis o ion as da a augmen a ions o e
poin clouds while ollowing BP-Ne [
30
] o apply colo ,
ji e , and hue ea u e ans o ma ions o e 2D ea u e em-
beddings.
4.3. Quali a i e Compa isons
To demons a e he mo i a ion o a ious 2D ounda ion mod-
els agglome a ion, we in es iga e he esul s using K-Means
o clus e he p ojec ed ea u e embeddings om di e en
2D models and ou dis illed ea u es as shown in Fig. 4.As
Table 2. Open- ocabula y 3D seman ic segmen a ion esul s. We
compa e ou CUA-O3D wi h ecen ully supe ised (Fully-sup.)
and ze o-sho (Ze o-sho ) baselines. Ou me hod demons a es
compe i i e pe o mance on bo h ScanNe V2 and Ma e po 3D.
†
deno es esul s om o igin pape based on Lseg.
Type Me hod ScanNe V2 Ma e po 3D
mIoU mAcc mIoU mAcc
Fully-sup.
Tangen Con [77] 40.9 - - 46.8
Tex u eNe [31] 54.8 - - 63.0
ScanComple e [14] 56.6 - - 44.9
DCM-Ne [75] 65.8 - - 66.2
Mix3D [56] 73.6 - --
SupCon [96] 69.2 77.7 53.1 63.4
LG ound [69] 73.2 - - 67.2
MinkowskiNe [11] 69.2 77.7 53.1 63.4
Uppe -bound MinkowskiNe eimple [11] 68.96 77.41 54.12 65.57
Ze o-sho
MSeg Vo ing [41] 45.6 54.4 33.4 -
PLA [16] 17.7 33.5 --
CLIP2Scene [9] 25.1 - --
CNS [10] 26.8 - --
CLIP-FO3D [90] 30.2 49.1 --
RegionPLC [85] 43.8 65.6 --
DMA- ex only [44] 50.5 63.7 39.8 49.5
OpenScene-3D†[62]52.9 63.2 41.9 51.2
OpenScene-2D3D†[62]54.2 66.6 43.4 53.5
OpenScene eimple-3D [62] 51.6 63.1 40.5 48.8
OpenScene eimple-2D3D [62] 52.2 65.4 41.5 50.6
(Ou s) CUA-O3D (3D) 54.1 64.1 41.3 49.5
(Ou s) CUA-O3D (2D3D) 55.3 65.6 42.2 50.9
can be obse ed ha di e en 2D model pe o ms he e oge-
neous ye complemen a y esul s. Likewise, he able om
he op-le sample in Fig. 4displays a ious clus e ing e-
sul s, leading o ou agglome a ed model being able o ou pu
mo e accu a e and consis en clus e s. Meanwhile, we also
u ilize UMAP [
70
] o be e illus a e he in insic cha ac-
e is ics when adop ing a speciﬁc 2D model o employ he
3D model dis illa ion, since he ou pu ea u e embeddings
om a ious dimensions o DINO 2 and S able Di usion
a e no elabo a ed o di ec ma ching wi h ex embeddings.
As shown on he igh side o Fig. 4, DINO 2 indica es
smoo he and mo e consis en esul s hough Lseg has been
ained o align wi h he ex encode be o e wi hin dense
supe ision. S able Di usion p esen s in iguing geome ic
cha ac e is ics, all o which a e capable o agglome a ing
po en ial ounda ion 3D models using dis illa ion.
4.4. Open-Vocabula y 3D Seman ic Segmen a ion
To showcase ha ou me hod CUA-O3D can boos he pe o -
mances o he open- ocabula y 3D seman ic segmen a ion
model, we compa e ou me hod wi h exis ing wo ks wi hin
wo ypes o se ings, including ully supe ised (Fully-
sup.) and ze o-sho . F om Table 2, ou me hod su passes
he ecen wo k OpenScene
eimple
[
62
] wi h
+2.5%
mIoU
unde 3D-dis ill and
+3.1%
mIoU unde 2D3D-ensemble
on ScanNe V2 al se , and
+0.8%
mIoU unde 3D-dis ill
and
+0.7%
mIoU unde 2D3D-ensemble on Ma e po 3D
al se , espec i ely. This u he p o es ou me hod no
only dis ills he e ogeneous ye complemen a y knowledge
19395



 



 








Figu e 4. Le side:
KMeans
is apped o clus e he p ojec ed 3D ea u e embeddings based on Lseg. DINO 2, S able Di usion and ou
ﬁnal dis illed eau u e p edic ed by he 3D model. Righ side: UMAP [
70
] is applied o p ojec high-dimension ea u e in o low-dimension
one o isualize he s uc u al cha ac e is ics. Whi e ec angle highligh s he appa en he e ogeneous ye complemen a y esul s.
Table 3. C oss-da ase e alua ion. We e alua e he c oss-da ase
gene aliza ion capabili y o CUA-O3D. We pe o m his expe i-
men when aining on ScanNe V2 and e alua ing on Ma e po 3D
(ScanNe V2 →Ma e po 3D), and ice e sa.
ScanNe V2 ( ain)→Ma e po 3D (e al)
Me hod mIoU mAcc
OpenScene [62] 36.0 48.0
(Ou s) CUA-O3D 37.4 (+1.4)49.2(+1.2)
Ma e po 3D ( ain)→ScanNe V2 (e al)
OpenScene [62] 36.5 44.0
(Ou s) CUA-O3D 38.6 (+2.1)46.6(+2.6)
in o he 3D model bu also econciles wi h 2D noisy su-
pe isions. This is achie ed by he p oposed de e minis ic
unce ain y es ima ion which adap i ely cap u es inhe en
noise and hen weigh s he co esponding dis illa ion. Supe -
ised by a ious 2D ounda ion models, like Lseg, DINO 2,
and S able Di usion, he 3D model lea ns o align wi h
he open- ocabula y ea u es oge he wi h spa ial and ge-
ome ic awa eness. Some open- ocabula y 3D seman ic
segmen a ion isualiza ions a e shown in Fig. 5. Addi ional
expe imen s wi h AMRADIO [
65
] can be e e ed o ou
supplemen a y ma e ial.
4.5. C oss-Da ase Gene aliza ion
We s udy he gene aliza ion capabili y o ou CUA-O3D o
unseen da ase s wi hou u he ﬁne- uning (i.e., ze o-sho ).
This is achie ed by e alua ing he c oss-da ase pe o mance
when aining and e alua ing di e en da ase s. We use Scan-
Ne V2 and Ma e po 3D as aining and e alua ion da ase s
espec i ely, and analyze he c oss-da ase esul s when ain-
ing on ScanNe V2 and e alua ing on Ma e po 3D, and ice
e sa. Table 3and Table 4 epo he c oss-da ase esul s
and wi h di e en g anula i ies, espec i ely.
Inpu GT OpenScene Ou s
ScanNe Ma e po
Figu e 5. Open- ocabula y 3D seman ic segmen a ion compa isons
in e ms o ScanNe V2 and Ma e po 3D. Ou app oach displays
supe io pe o mance o e he OpenScene, which is ega ded as
ou baseline. Bes iew zoom in and ou .
We no ice ha ou me hod imp o es he c oss-da ase
pe o mance in bo h di ec ions. As epo ed in Table 3,
CUA-O3D consis en ly ou pe o ms OpenScene in bo h di-
ec ions wi h
+1.4%
mIoU imp o emen s on ScanNe V2
19396
Table 4. Compa ison on c oss-da ase gene aliza ion. Bo h CUA-O3D and OpenScene a e ained on ScanNe , and ze o-sho es ed on
he Ma e po 3D da ase .
‡
deno es he pu e 3D esul s ob ained om he o ﬁcial eleased model. K = 21 is de i ed om he o iginal
Ma e po 3D benchma k, while K = 40, 80, 160, is K mos common ca ego ies om he NYU label se p o ided in he benchma k.
Me hod Ma e po 21 Ma e po 40 Ma e po 80 Ma e po 160
mIoU mAcc mIoU mAcc mIoU mAcc mIoU mAcc
OpenScene‡[62]36.0 48.0 21.1 27.5 10.8 13.9 6.0 8.1
(Ou s) CUA-O3D (2D3D) 37.4 49.2 23.3 30.2 12.2 16.3 6.1 8.4
Table 5. Expe imen al esul s on ScanNe V2 and Ma e po 3D in
e ms o al on linea p obing e alua ion. Uppe bound- ull sup.
deno es he ully-supe ised uppe bounding esul s while Baseline
ini . means ini ialize he model om ou baseline model and hen
pe o m linea p obing e alua ion.
Type Me hod ScanNe V2 Ma e po 3D
mIoU mAcc mIoU mAcc
Uppe bound- ully sup. MinkowskiNe [11] 68.9 77.4 54.1 65.5
Baseline ini . MinkowskiNe [11] 54.4 64.7 36.1 43.0
Conca 3-heads conca 62.1 72.7 45.8 55.3
Sepa a e 3-heads a e age 61.7 72.0 45.4 55.0
Single-head
Lseg-head 59.9 71.5 --
DINO 2-head 61.7 72.2 --
S ableDi usion-head 61.4 72.1 --
→
Ma e po 3D and
+2.1%
mIoU imp o emen s on Ma -
e po 3D
→
ScanNe V2. In e es ingly, we no ice ha ou
app oach also p o ides consis en supe io i y ac oss di e -
en g anula i ies in Table 4, anging among K = 21, 40, 60,
and 160 common ca ego ies in e ms o ze o-sho e alua ion
on ScanNe V2 →Ma e po 3D.
4.6. Linea P obing
In his sec ion, we exploi how he ained 3D model will
pe o m a e agglome a ing a ious 2D models. We hen
conduc expe imen s ha employ linea p obe lea ning based
on he 3D model a e dis illing om a ious 2D models.
Speciﬁcally, we cons uc a simple MLP laye on op o he
ozen 3D model backbone and ain he linea laye only ol-
lowing he ully-supe ised manne . As shown in Table 5, he
me hod conca ena es all h ee mapping laye s co esponding
o Lseg, DINO 2, and S able Di usion, and hen maps he
conca ena ed ea u es o close se label spaces, we can see
his way can achie e he bes pe o mances, 62.1% mIoU
and 45.8% mIoU on ScanNe V2 and Ma e po 3D al a e
uned on ain. No e ha his ob ains 7.7% mIoU imp o e-
men o e he model ini ialized om ou baseline model.
We can also obse e ha simply mapping he DINO 2 laye
om he 3D model a ains e y compe i i e segmen a ion
pe o mance while mapping he Lseg laye only which has
been ained be o e o align wi h he ex encode ealizes
an in e io one. These u he insigh s ha we shall seek
mo e sui able 2D model selec ions o help de elop po en ial
ounda ional 3D models, whe eas DINO 2 p esen s s ong
gene alizabili y and ﬂexibili y, which is consis en wi h he
obse a ion [18,51].
Table 6. Abla ion: Con ibu ion o each componen by g adually
adding in o he ﬁnal aining, based on ze o-sho segmen a ion.
BaselineLseg +DINO 2+SD +Unc +Au oW +DeMean mIoU ↑mAcc ↑
51.4 62.3
 51.7 63.3
 51.4 62.4
 52.7 62.6
 53.5 64.2
  NAN NAN
 
54.1 64.1
4.7. Abla i e S udies
In his sec ion, we s udy he imp o emen s om each p o-
posed componen . As shown in Table 6, we begin by g ad-
ually adding each componen o he 3D model aining and
ﬁnd ha only combining wi h Lseg and DINO 2 leads o
ma ginal open- ocabula y 3D seman ic segmen a ion which
can be conjec u ed ha DINO 2 has no been aligned wi h
language space be o e hough i excels a spa ial pe cep ion
abili y. Then, he pe o mance is boos ed by 1.3% mIoU
when in oducing S able Di usion supe ision, while u he
imp o ed by 2.1% mIoU and 1.9% mAcc ia ou p oposed
de e minis ic es ima ion o help he model adap i ely ha mo-
nize he he e ogeneous knowledge om a ious 2D models.
In e es ingly, i we apply au o-weigh ing o enable he model
o lea n by i sel , he model aining alls in o collapse, which
we su mise i is due o he minimal op imiza ion objec i e
and he model quickly ge s in o a i ial solu ion. O e -
all, CUA-O3D can imp o e he 3D model only om 51.4%
mIoU and 62.3% mAcc o 54.1% mIoU and 64.1% mAcc,
which u he demons a es he e ec i eness o ou me hod.
5. Conclusions
In his pape , we ﬁ s in es iga e he c oss-modal agglom-
e a ion om a ious 2D ounda ion models in o one 3D
model, in pu sui o a po en ial ounda ional 3D model. To
esol e he he e ogeneous bias and inhe en noise om 2D
ea u e supe isions, we hen p opose a de e minis ic unce -
ain y es ima ion o cap u e 2D model-speciﬁc unce ain ies
ac oss di e se seman ic and geome ic sensi i i ies, which
is hen le e aged o weigh he co esponding dis illa ion
loss adap i ely. In his way, he ained 3D model pe o ms
compe i i e open- ocabula y segmen a ion while achie ing
obus c oss-domain alignmen and s ong spa ial pe cep ion
abili y, which hopes o shed new ligh on he communi y.
19397
Acknowledgmen s This wo k was suppo ed by he
MUR PNRR p ojec FAIR (PE00000013) unded by he
Nex Gene a ionEU and he EU Ho izon p ojec ELIAS (No.
101120237). We acknowledge he CINECA awa d unde
he ISCRA ini ia i e o he a ailabili y o high-pe o mance
compu ing esou ces and suppo .
Re e ences
[1]
Sungsoo Ahn, Shell Xu Hu, And eas Damianou, Neil D
Law ence, and Zhenwen Dai. Va ia ional in o ma ion dis il-
la ion o knowledge ans e . In CVPR, pages 9163–9171,
2019. 3
[2]
Vijay Bad ina ayanan, Alex Kendall, and Robe o Cipolla.
Segne : A deep con olu ional encode -decode a chi ec u e
o image segmen a ion. TPAMI, 39(12):2481–2495, 2017. 2
[3]
H. Bangala h, M. Maaz, M. Kha ak, S. Khan, and F. Shahbaz
Khan. B idging he gap be ween objec and image-le el
ep esen a ions o open- ocabula y de ec ion. Neu IPS, 2022.
2
[4]
J. Behley, M. Ga bade, A. Milio o, J. Quenzel, S. Behnke,
C. S achniss, and J. Gall. Seman icKITTI: A Da ase o
Seman ic Scene Unde s anding o LiDAR Sequences. In
ICCV, 2019. 1
[5]
L. Boyi, W. Kilian, B. Se ge, K. Vladlen, and R. Rene.
Language-d i en seman ic segmen a ion. In ICLR, 2022. 1,
2,4,6
[6]
A. Chang, A. Dai, T. Funkhouse , M. Halbe , M. Niebne ,
M. Sa a, S. Song, A. Zeng, and Y. Zhang. Ma e po 3d:
Lea ning om gb-d da a in indoo en i onmen s. In 3DV,
2017. 2,6
[7]
Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Man-
mohan Chand ake . Lea ning e ﬁcien objec de ec ion mod-
els wi h knowledge dis illa ion. Neu IPS, 30, 2017. 3
[8]
L.-C. Chen, G. Papand eou, I. Kokkinos, K. Mu phy, and A.:.
Yuille. Deeplab: Seman ic image segmen a ion wi h deep
con olu ional ne s, a ous con olu ion, and ully connec ed
c s. TPAMI, 2017. 2
[9]
Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu,
Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping
Wang. Clip2scene: Towa ds label-e ﬁcien 3d scene unde -
s anding by clip. In CVPR, pages 7020–7030, 2023. 3,6
[10]
Runnan Chen, Youquan Liu, Lingdong Kong, Nenglun Chen,
Xinge Zhu, Yuexin Ma, Tongliang Liu, and Wenping Wang.
Towa ds label- ee scene unde s anding by ision ounda ion
models. In Neu IPS, 2024. 6
[11]
C. Choy, J. Gwak, and S. Sa a ese. 4d spa io- empo al con-
ne s: Minkowski con olu ional neu al ne wo ks. In ICCV,
2019. 4,6,8
[12]
T. Co inhal, G. Tzelepis, and E. E dal Aksoy. Salsanex :
Fas , unce ain y-awa e seman ic segmen a ion o lida poin
clouds. In ISVC, 2020. 3
[13]
A. Dai, A. Chang, M. Sa a, M. Halbe , T. Funkhouse , and
M. Nießne . Scanne : Richly-anno a ed 3d econs uc ions o
indoo scenes. In CVPR, 2017. 1,2,4,6
[14]
A. Dai, D. Ri chie, M. Bokeloh, S. Reed, J. S u m, and M.
Nießne . Scancomple e: La ge-scale scene comple ion and
seman ic segmen a ion o 3d scans. In CVPR, 2018. 6
[15]
Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song
Bai, and Xiaojuan Qi. Pla: Language-d i en open- ocabula y
3d scene unde s anding. In CVPR, 2023. 1,3
[16]
R. Ding, J. Yang, C. Xue, W. Zhang, S. Bai, and X. Qi. Pla:
Language-d i en open- ocabula y 3d scene unde s anding.
In CVPR, 2023. 6
[17]
Alexey Doso i skiy. An image is wo h 16x16 wo ds: T ans-
o me s o image ecogni ion a scale. a Xi p ep in
a Xi :2010.11929, 2020. 3
[18]
Mohamed El Banani, Ami Raj, Ke is-Koki si Maninis, Ab-
hishek Ka , Yuanzhen Li, Michael Rubins ein, Deqing Sun,
Leonidas Guibas, Jus in Johnson, and Va un Jampani. P ob-
ing he 3d awa eness o isual ounda ion models. In CVPR,
pages 21795–21806, 2024. 2,8
[19]
Y. Gal and Z. Ghah amani. D opou as a bayesian app oxi-
ma ion: Rep esen ing model unce ain y in deep lea ning. In
ICLR, 2016. 3
[20]
G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin. Scaling open-
ocabula y image segmen a ion wi h image-le el labels. In
ECCV, 2022. 1,2
[21]
Benjamin G aham, Ma in Engelcke, and Lau ens Van
De Maa en. 3d seman ic segmen a ion wi h submani old
spa se con olu ional ne wo ks. In CVPR, pages 9224–9232,
2018. 1
[22]
A. G a es. P ac ical a ia ional in e ence o neu al ne wo ks.
Neu IPS, 2011. 3
[23]
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-
ocabula y de ec ion ia ision and language knowledge dis-
illa ion. a Xi p ep in a Xi :2104.13921, 2021. 1,3
[24]
H. Guo, H. Wang, and Q. Ji. Unce ain y-guided p obabilis ic
ans o me o complex ac ion ecogni ion. In CVPR, 2022.
3
[25]
Huy Ha and Shu an Song. Seman ic abs ac ion: Open-wo ld
3D scene unde s anding om 2D ision-language models. In
CORL, 2022. 1
[26]
Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming
Sun, and Youliang Yan. Knowledge adap a ion o e ﬁcien
seman ic segmen a ion. In CVPR, pages 578–587, 2019. 3
[27]
G. Hin on, O. Vinyals, and J. Dean. Dis illing he knowledge
in a neu al ne wo k. a Xi , 2015. 3
[28]
Yunzhong Hou and Liang Zheng. Visualizing adap ed knowl-
edge in domain ans e . In CVPR, pages 13824–13833, 2021.
3
[29]
P. Hu, S. Scla o , and K. Saenko. Unce ain y-awa e lea ning
o ze o-sho seman ic segmen a ion. Neu IPS, 2020. 3
[30]
Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia, and Tien-
Tsin Wong. Bidi ec ional p ojec ion ne wo k o c oss dimen-
sion scene unde s anding. In CVPR, 2021. 6
[31]
J. Huang, H. Zhang, L. Yi, T. Funkhouse , M. Nießne , and
L. Guibas. Tex u ene : Consis en local pa ame iza ions o
lea ning om high- esolu ion signals on meshes. In CVPR,
2019. 6
[32]
Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang,
Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo.
Clip2poin : T ans e clip o poin cloud classiﬁca ion wi h
image-dep h p e- aining. In ICCV, pages 22157–22167,
2023. 3
19398

Related note

Why institutions use Plag.ai for originality review, entry 67
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai