C oss-Modal and Unce ain y-Awa e Agglome a ion o Open-Vocabula y 3D
Scene Unde s anding
Jinlong Li1,†C is iano Sal o i1Fabio Poiesi2Nicu Sebe1
1Uni e si y o T en o 2Fondazione B uno Kessle
Abs ac
The lack o a la ge-scale 3D- ex co pus has led ecen wo ks
o dis ill open- ocabula y knowledge om ision-language
models (VLMs). Howe e , hese me hods ypically ely on a
single VLM o align he ea u e spaces o 3D models wi hin
a common language space, which limi s he po en ial o 3D
models o le e age he di e se spa ial and seman ic capabili-
ies encapsula ed in a ious ounda ion models. In his pape ,
we p opose C oss-modal and Unce ain y-awa e Agglome -
a ion o Open- ocabula y 3D Scene Unde s anding dubbed
CUA-O3D, he fi s model o in eg a e mul iple ounda ion
models—such as CLIP, DINO 2, and S able Di usion—in o
3D scene unde s anding. We u he in oduce a de e minis-
ic unce ain y es ima ion o adap i ely dis ill and ha monize
he he e ogeneous 2D ea u e embeddings om hese models.
Ou me hod add esses wo key challenges: (1) inco po a ing
seman ic p io s om VLMs alongside he geome ic knowl-
edge o spa ially-awa e ision ounda ion models, and (2)
using a no el de e minis ic unce ain y es ima ion o cap u e
model-specific unce ain ies ac oss di e se seman ic and
geome ic sensi i i ies, helping o econcile he e ogeneous
ep esen a ions du ing aining. Ex ensi e expe imen s on
ScanNe V2 and Ma e po 3D demons a e ha ou me hod
no only ad ances open- ocabula y segmen a ion bu also
achie es obus c oss-domain alignmen and compe i i e spa-
ial pe cep ion capabili ies. P ojec webpage: CUA-O3D.
1. In oduc ion
3D scene unde s anding se es as a c ucial pe cep ion com-
ponen o a wide a ay o eal-wo ld applica ions o help
models be e unde s and he physical wo ld, including
obo na iga ion, au onomous ehicles, and i ual eal-
i y [
4
,
21
,
46
,
79
]. Typical app oaches necessi a e a da ase
o seman ically anno a ed poin clouds, which is bo h ime-
consuming (e.g., 22.3 minu es o anno a ing a single scene
†Co esponding au ho : [email p o ec ed].
Lseg DINO 2 S able Di usion
Inpu GT
Figu e 1. Top is ea u e dis ibu ion analysis o di e en 2d p o-
jec ed ea u e embeddings om a ious ounda ion models (Lseg,
DINO 2 and S able Di usion), enume a ing on he o e all Scan-
Ne V2 ain se and coun ing he equency o all poin ea u es
wi hin each bin in e al. Bo om is he sample u ilizing K-Means
o clus e p ojec ed 3D ea u es in o specified clus e s o make
segmen a ion compa isons. Di e en ounda ion models illus a e
he e ogeneous ye complemen a y esul s.
wi h 20 classes [
13
]) and unable o encompass all pos-
sible exis ing ca ego ies, he eby limi ing hei p ac ical
u ili y. To add ess his limi a ion, ecen wo ks ha e ex-
plo ed he open- ocabula y 3D scene unde s anding se -
ing [
15
,
33
,
52
,
53
,
62
], aiming o localize and ecognize
a bi a y objec classes. This objec i e is achie ed by le e -
aging Vision-Language Models (VLMs) [
34
,
64
] ha , ha -
ing been p e- ained on billions o image- ex pai s [
74
], can
gauge he alignmen be ween ex ual and isual inpu s, acili-
a ing a ange o 2D open- ocabula y asks [
5
,
20
,
23
,
39
,
97
].
To adap hese models o 3D downs eam asks (e.g., seman-
ic segmen a ion), a ious s a egies, p edominan ly based
on dis illing mul i- iew 2D isual ea u es in o a 3D-specific
model [15,25,33,62,84], ha e been employed.
Howe e , hese me hods mainly ocus on dis illing knowl-
edge om one single VLM which limi s he po en ial o
3D models o le e age he di e se spa ial and seman ic ca-
pabili ies ha ha e been ained on la ge-scale images o
image- ex pai s co pus, as shown in Table 1. Gi en he
This CVPR pape is he Open Access e sion, p o ided by he Compu e Vision Founda ion.
Excep o his wa e ma k, i is iden ical o he accep ed e sion;
he inal published e sion o he p oceedings is a ailable on IEEE Xplo e.
19390
exis ence o a ious ounda ion models, such as CLIP [
64
],
DINO 2 [
59
] and S able Di usion [
66
], e c., he e has been
unde -explo ed on how o make be e u ili y o hese 2D
ounda ion models o de elop he 3D ounda ion model,
whe ein no la ge-scale 3D poin -cloud o 3D- ex pai co pus
a ailable. Ve y ecen ly, some wo ks [
18
,
51
] ha e s a ed
p obing he po en ial o hese 2D ounda ion models on 3D
asks, since di e en VLMs o isual ounda ion models
showcase unique cha ac e is ics. P obe3D [
18
] posi s he 3D
awa eness o some isual ounda ion models, like DINO 2
p esen well o dep h and su ace no mals. Lexicon3D [
51
]
finds di usion models benefi geome ic asks. None heless,
he e is s ill a lack o s udies on how o agg ega e hese
he e ogeneous ounda ion models in o he 3D model which
na u ally excels a geome ic knowledge ex ac ion and lo-
ca ing 3D spa ial objec s.
To explo e a ious ounda ion model p ope ies when
using mul i- iew posed image ea u es o 3D space and
enabling model dis illa ion, we fi s conduc a pilo s udy
o analyze and compa e he he e ogeneous esul s shown in
Fig. 1. We can obse e ha he dis ibu ion in e ms o he
ea u e embeddings om each 2D ounda ion model mainly
ollows a gaussian-like ea u e dis ibu ion. A he same
ime, when clus e ing hese ea u es in o specific clus e s,
he esul s illus a e he e ogeneous and complemen a y e -
ec s ac oss di e en ounda ion models. This is also due o
he inconsis ency ac oss di e en posed images when he 2D
model encoun e s complex image con ex s. This mo i a es
us o de elop a new me hod o ha monize his he e oge-
neous knowledge in o a 3D model and handle such noisy
inconsis ency om he used 2D ea u e embeddings.
In his pape , we p esen C oss-modal and Unce ain y-
awa e Agglome a ion o Open- ocabula y 3D Scene Un-
de s anding, dubbed CUA-O3D, he fi s me hod o in e-
g a e mul iple 2D ounda ion models in o one 3D model o
scene unde s anding. De e minis ic unce ain y es ima ion
is u he in oduced o adap i ely dis ill and ha monize he
he e ogeneous 2D ea u e embeddings om hese models.
We show ha he e a e po en ial inconsis encies ye com-
plemen a i y when a emp ing o dis ill di e en ounda ion
models. Based on ou pilo s udy, we fi s p opose o le e -
age dis illa ion loss o supe ise 3D model aining gi en
se e al a ailable 2D ea u e embeddings. Ou 3D model con-
sis s o independen p ojec ion laye s o be mapped wi h one
VLM o isual ounda ion model unde ea u e supe ision
which helps wi h econciling he en anglemen s om he e o-
geneous dis ibu ions. To esol e he po en ial noises om
2D models which mainly come om insu ficien con ex s, a
no el de e minis ic unce ain y es ima ion is ailo ed o adap-
i ely weigh he knowledge dis illa ion which can be mod-
eled like a gaussian likelihood ha ollows he dis ibu ions
as shown a he op o Fig. 1. Specifically, ega ding each
p ojec ion laye o be mapped wi h he specific 2D model, we
ailo an obse a ion noise scala p edic ion independen ly
o cap u e how much he noise is con ained in ea u e supe -
ision du ing aining, which is e med unce ain y-awa e
lea ning. Mo ing one s ep o wa d, as can be obse ed he
dis ibu ion in e ms o S able Di usion shows a bi shi
away om he cen e scale, and usually ca ies b oade alue
anges and hea y- ailed (“spike”) alues in he p ojec ed
ea u e embeddings. A de-mean ope a ion is hen adop ed o
e-cen e he ea u e scales, being able o educe he impac
om anomaly poin s while s ill allowing poin s wi h small
scale o guide he 3D model aining.
As we show expe imen ally on Scanne V2 [
13
] and Ma -
e po 3D [
6
], ou app oach allows he 3D model agglome -
a es he e ogeneous knowledge and econciles wi h po en ial
noises om a ious 2D ea u e supe isions. Ex ensi e ex-
pe imen s demons a e ha ou me hod no only ad ances
open- ocabula y segmen a ion bu also achie es compe i i e
c oss-domain alignmen s and spa ial pe cep ion capabili ies.
Addi ionally, we also alida e ha ou me hod can achie e
significan downs eam pe o mance a e dis illa ion. In
summa y, he con ibu ions o his wo k a e:
•
To he bes o ou knowledge, we a e he fi s one o in-
es iga e he agglome a ion o c oss-modal knowledge
dis illa ion om 2D models o a 3D model, gi en a ious
s ong ounda ion models a ailable.
•
We analyze he he e ogeneous ye complemen a y ea u e
embeddings om mul iple 2D models and inco po a e
bo h seman ic- and geome ic-awa e knowledge in o one
single 3D model.
•
We u he p opose a de e minis ic unce ain y es ima ion
o enable he 3D model p edic independen obse a ion
scala o cap u e he noise and esol e he he e ogenei y
om a ious ea u e supe isions.
•
We e alua e ou me hod in a wide se o expe imen s om
3D open- ocabula y segmen a ion and p esen compe i i e
c oss-domain alida ion o ou me hod, while also demon-
s a ing s ong downs eam pe o mances a e dis illa ion.
2. Rela ed Wo ks
Open-Vocabula y (OV) 3D scene unde s anding ad ances
o e he p e ious la ge co pus o close-se app oaches
[
2
,
8
,
48
,
49
,
68
,
84
,
93
], allowing obus ze o-sho easoning
and alle ia ing he need o anno a ions. Recen ad ances in
Visual-Language Models (VLMs) [
34
,
64
]ha ed i enOV
models owa ds ema kable le els o obus ness wi h nume -
ous eme ging app oaches ackling OV in image seman ic
segmen a ion [
5
,
20
,
47
,
83
], objec de ec ion [
3
,
98
], and
ecen ly uni e sal segmen a ion [
89
]. Di e en ly, OV o 3D
scene unde s anding (OV3D) is limi ed in he da a a ailabil-
i y o aining a pu ely undamen al 3D VLM. Al e na i ely,
he communi y achie es OV3D by dis illing ze o-sho knowl-
edge om ecen VLMs [
5
,
64
] and by mapping poin cloud
ea u es o a que yable CLIP space. In 3D seman ic seg-
19391
Table 1. Compa ison o Vision Founda ion Models. Al hough all u ilize he same Vision T ans o me (ViT) backbone, hey g ea ly di e
in hei aining pa adigms, including da a, image esolu ions, and aining objec i es, which lead o di e se ep esen a ion biases.
Model T aining Da ase Da ase Size A chi ec u e Objec i e
ViT [17] ImageNe -1k/21k 1.2M/14.2M ViT-B/L/G Supe ised classifica ion
DINO 2 [59]LVD-142M 142M ViT-L/14 Disc imina i e sel -supe ised lea ning
CLIP [64]WebImageTex 400M ViT-L/14 Image- ex con as i e lea ning
S able Di usion [66]LAION 5B UNe Image-Tex /Image Gene a ion
men a ion, Concep Fusion [
33
] uses VLM ep esen a ions
om mul iple iews in o 3D poin s. Some me hods [
76
]
ex ends OV o 3D ins ance seman ic segmen a ion based on
CLIP [
64
] o SAM [
38
] o align he 3D space wi h language
space while o cing ins ance-mask cons ain . Howe e , he
ecen OV3D me hods hea ily ely on he ze o-sho knowl-
edge o he 2D VLMs wi hou in es iga ing he eliabili y
o he p ojec ed 2D ea u e ep esen a ion. In his wo k, we
mo e o wa d and explo e how o agg ega e knowledge om
a ious 2D ounda ion models.
Knowledge dis illa ion (KD) [
27
,
60
] aims a aining com-
pac s uden models wi h he supe ision om mo e pow-
e ul and la ge eache models [
43
]. In oduced by he
fi s wo k [
27
], he s uden model is ained o mimic he
p edic ion beha io o he eache model and has been ex en-
si ely explo ed in subsequen wo ks [
1
,
45
,
67
,
88
,
92
,
94
],
which has been applied success ully in a wide ange o asks
going om supe ised- aining [
54
,
78
,
82
], o ne wo k com-
p ession [
7
,
63
,
73
] and o domain adap a ion [
26
,
28
,
95
].
Recen ly, ollowing he explosion o Visual-Language Mod-
els like CLIP [
64
] and ALIGN [
34
], KD has been in oduced
o e ficien ly ans e knowledge be ween di e en modali-
ies [
23
,
43
,
50
], and ecen ly o b idge he gap be ween he
ex and 3D poin cloud modali ies [
9
,
15
,
32
,
62
,
85
,
87
,
91
].
Recen ly, AM-RADIO [
65
] desc ibes a gene al me hodol-
ogy o dis illing mul iple dis inc ounda ion models in o
one, bu s ill ocus on only 2D domain. We p ima ily o-
cus on s udying and ackling he ambigui y o he dis illed
ep esen a ions be ween image and poin cloud modali ies.
Unce ain y es ima ion has been widely in es iga ed in a -
ious asks [
24
,
29
,
35
,
36
,
40
,
42
,
55
,
58
] which is capable
o add essing he p oblem o quan i ying he unce ain y
o p edic ions by model. Unce ain y Es ima ion can be
b oadly classified in o: (i) alea o ic es ima ion [
35
,
57
,
80
]
ha usually dues o he unde lying unce ain y in he mea-
su emen which u ilizes he ex a ne wo k o be ained om
sc a ch o app oxima e a he e oscedas ic dis ibu ion by max-
imizing he likelihood o he sys em, and (ii) epis emic es-
ima ion [
19
,
22
,
40
] ha induces he unce ain y by he
model pa ame e s in low-da a egimes as pa ame e es ima-
ion becomes noisy, espec i ely. In 3D poin clouds, un-
ce ain y es ima ion finds applica ions in inc emen al lea n-
ing [
86
], seman ic segmen a ion [
12
], and domain adap a-
ion [
71
,
72
,
81
]. Unlike p e ious wo ks, we seek a de e -
minis ic unce ain y es ima ion o supe ising 3D model
a) inpu
c) g ound u h
b) p edic ion
d) 2D iew-images p edic ion
Figu e 2. P elimina y s udy on image embedding ambigui y. VLM
embeddings show inconsis en segmen a ions ac oss mul i- iew
images (e.g. cabine ). The guidance wi h ambiguous embeddings
may be de imen al o supe ising a 3D model aining.
aining unde unce ain y awa eness o he ambigui y be-
ween 2D image and 3D poin modali ies.
3. Me hodology
In his sec ion, we fi s desc ibe he p elimina y open-
ocabula y 3D scene unde s anding ask in Sec. 3.1. Then
we elucida e inconsis en esul s ac oss mul i- iew posed
images o demons a e he necessi y o unce ain y-awa e
aining o alle ia e such issues and emb ace a ious 2D oun-
da ion models in Sec. 3.2. Ou C oss-Modal and Unce ain y-
Awa e Agglome a ion (CUA-O3D) me hod will be depic ed
in Sec. 3.3, including Dis illa ion agglome a ion and De e -
minis ic unce ain y es ima ion.
3.1. Open-Vocabula y 3D Scene Unde s anding
In s anda d 3D seman ic segmen a ion, he aining se
T ain
includes poin clouds and dense poin -le el anno a ions.
Each poin cloud
X=pi∈R3,i∈[0,N −1]
, consis s o
N
poin s
pi
wi h co esponding poin -le el anno a ions
Y
.
These anno a ions a e assigned om a p edefined se o class
indices
K=[1,...,K]
, whe e each index co esponds o
a specific class name in he ocabula y
V=[ 1,...,
k]
.
Gi en
T ain
and
K
, he objec i e is o lea n a deep neu al
ne wo k co ec ly assigning a label om
K
o each poin
pi∈X
. In con as , open- ocabula y 3D seman ic segmen-
a ion (OV3D) aims o segmen
X
using an a bi a y ocab-
ula y
V
. Recen app oaches achie e his wi h a p e- ained
VLM [
62
,
76
]. The VLM p o ides his common embedding
space h ough wo dis inc encode s, namely, a ision en-
code
2D:I→Z
and a ex encode
x :V→Z
. The
common p ac ice is o ain a 3D ision encode
θ3D
o align
o
2D
embeddings, b idging he modali y gap and defining
θ3D:X→Z
. A e aining,
x
encodes
V
, and class
19392
p edic ions a e compu ed ia simila i y ma ching be ween
3D poin cloud and ex ual ocabula y embeddings.
3.2. P elimina y Obse a ion
We fi s conduc a simple quali a i e s udy o analyze how
embedding ambigui y a ec s
2D
p edic ions be o e and
a e p ojec ion o he poin cloud space based on commonly
used Lseg [
5
]. We analyze he consis ency o he p edic ions
o he same objec appea ing in mul i- iew images. Gi en
a p e- ained ision-language model
2D
and mul i- iew
images
I
, we que y
2D
wi h known ocabula ies
V
and
epo he quali a i e esul s o e mul iple ScanNe V2 [
13
]
iew-images in Fig. 2. We no ice ha p edic ions a e con-
sis en o he classes wall,floo , and doo while showing
inconsis ency o e he class cabine . A e p ojec ion o
he 3D poin cloud space (le ), he p ojec ed p edic ion in-
he i s his ambigui y, esul ing in u he de imen al 3D
model dis illa ion. Rega ding he inco po a ion o a ious
2D ounda ion models shown in Fig. 1, how o ha monize
he he e ogeneous cha ac e is ics also ma e s. This simple
s udy highligh s he need o a eliable unce ain y measu e
cap u ing embedding ambigui y ha we hope o shed new
ligh on he u u e wo ks on his ask.
3.3. C oss-Modal Agglome a ion
We p o ide an o e iew o ou CUA-O3D in Fig. 3. Ou
app oach le e ages a 3D encode backbone
θ3D
o ans o m
he inpu poin cloud
X
in o a 3D spa se poin cloud ea u es
F3D
. Concu en ly, we use se e al p e- ained ision en-
code s
2D
i
, such as CLIP, DINO 2, and S able Di usion, o
map mul i- iew images
I
o dense image ea u es sepa a ely,
which a e hen p ojec ed o yield spa se image ea u es
F2D
i
.
Then, we cons uc h ee p ojec ion laye s wi h a simple
MLP o map
F3D
i
wi h each 2D model h ough he co -
esponding dis illa ion loss. Meanwhile, he 3D model is
designed o ou pu independen ly de e minis ic unce ain y-
awa e obse a ion scala
σi
o adap i ely weigh he ea u e
supe isions. A e aining, we use
θ3D
o he main ask
o open- ocabula y 3D seman ic segmen a ion ia ma ching
wi h ex embeddings
F x
[
76
] and d op he unce ain y
module.
Dis illa ion agglome a ion. The dis illa ion phase aims o
align he poin embeddings
F3D
o image embeddings
F2D
ob ained om each ozen p e- ained isual encode
2D
i
.
This is a common p ac ice in ecen OV3D app oaches [
76
],
which leads o a sha ed embedding space be ween image,
ex , and poin cloud modali ies. Gi en a poin cloud
X
pai ed wi h mul i- iew posed images
I
, we lea n 3D spa se
ea u es
F3D
by employing he well-es ablished Minkowsk-
iNe [
11
] as a 3D spa se con olu ional encode
θ3D
. The
encode
θ3D
ou pu s a spa se se o poin -wise ea u e ec-
o s
F3D=θ3D(X)
, whe e each ea u e is associa ed wi h
an inpu poin
pi∈X
. Simila ly, we dis ill he mul i- iew
image ea u es om
I
using each ozen ision encode
2D
i
sepa a ely, whe e
{ 2D
i∈ 2D| 2D
Lseg, 2D
DINO 2, 2D
SD}
. The
ou pu o
2D
is a se o ea u e ec o s
F2D
i= 2D
i(I)
whe e each ea u e ec o in
F2D
is associa ed o an inpu
pixel
u
. A e p ojec ion in he homogeneous coo dina es
space, we edefine
F2D
i
as a se o poin -wise image ea-
u es, co esponding o each
2D
i
. We define he ma ches
be ween each poin
p
and pixel
u
by using he co espond-
ing homogeneous coo dina es
˜p
and
˜u
, espec i ely. Once
ma ches a e es ablished, we en o ce he alignmen be ween
he p edic ed
F3D
i
and
F2D
i
. Following [
62
] and aining
θ3D
o minimize he dis illa ion loss
Ldis ill
, ou indepen-
den dis illa ion losses a e defined as,
Lseg, Lcos lseg =1−F3D
1·F2D
1
F3D
12·F2D
12
(1)
DINO 2, Ll1=1
n
n
i=1
|F3D
2−F2D
2|(2)
S ableDi usion, Lcos sd =1−F3D
3·F2D
3
F3D
32·F2D
32
(3)
which co esponds o minimizing he dis ance be ween each
F3D
iand F2D
i, and leading o final dis illa ion loss as,
Ldis ill =Lcos lseg +Ll1+Lcos sd,(4)
whe e we will s udy he dis illa ion loss choice in ou sup-
plemen a y ma e ials.
To u he alle ia e he impac om S able Di usion
F2D
3
which con ains sha p alues in he p ojec ed ea u e embed-
dings, we hen adop a de-mean ope a ion o e-cen e he
ea u e scales, educing he impac om anomaly poin s
while s ill allowing poin s wi h small scale o guide he 3D
model aining:
F2D
3=F2D
3−μF2D
3(5)
whe e μF2D
3is he mean o F2D
3along channel dimension.
De e minis ic unce ain y es ima ion. As we analyzed be-
o e ha di e en 2D ounda ion models encapsula e unique
cha ac e is ics and one single 2D model induces inhe en
inconsis ency om mul i- iew posed image which neces-
si a es app op ia e measu es o ackle. We hen p opose a
simple ye e ec i e de e minis ic unce ain y-awa e obse -
a ion scala p edic ion o quan i y embedding ambigui y
wi hin each c oss-modal dis illa ion, ha lea ns he adap-
i e weigh s o a ious 2D ea u e supe isions unde he
c oss-modal aining.
Specifically, we de ise he 3D model
θ3D
wi h h ee in-
dependen noise scala p edic ions
σi
w. . each 2D model
2D
i
. The ou pu om he p obabilis ic model wi h weigh
W
being analogous o he eg ession ask can be modeled as
gaussian likelihood as:
p(y| W(x)) = N( W(x),σ2),(6)
19393
Figu e 3. O e iew o CUA-O3D. We fi s u ilize Lseg, DINO 2 and S able Di usion model o ex ac mul i- iew posed image embeddings
and hen use mul i- iew 3D p ojec ion o ob ain he p ojec ed 3D ea u es
F2D
i
o supe ise he 3D model aining. Th ee MLP laye s a e
es ablished o map wi h each 2D model supe isions independen ly, while a specific noisy scala p edic ion
σi
h ough a de e minis ic
unce ain y es ima ion will be lea ned and adop ed o adap i ely weigh he co esponding dis illa ion loss L.
he e we assume he 2D ounda ion models ollow simila
modeling as demons a ed in Fig. 1, and ou dense align-
men aining can be ega ded as con inuous model ou pu .
Combining all h ee 2D models we used, we hen define he
mul iple model ou pu s as:
p(y1, ..., yK| W(x)) =
K
i=1
p(yi| W(x)),(7)
whe e index
i
co esponds o he mapping o each 2D
model,
{ 2D
Lseg, 2D
DINO 2, 2D
SD}
. Based on he maximum
likelihood modeling and aking he eg ession ask as an
example, he log-likelihood o he model can hen be op i-
mized,
log p(y| W(x)) ∝− 1
2σ2||y− W(x)||2−log σ, (8)
whe e
σ
deno es he model’s obse a ion noise scala , be-
ing esponsible o cap u ing he inhe en unce ain y wi hin
2D ea u e supe isions, which is mainly caused by he -
e ogeneous and noisy ea u e embeddings om a ious 2D
ounda ion models. Assuming ha we ha e h ee ou pu s
co esponding o Lseg, DINO 2, and S able Di usion, each
ollowing gaussian-like dis ibu ion, we hen ha e:
p(y1,y
2,y
3| W(x)) =
3
i=1
p(yi| W(x))
=
3
i=1
N(yi; W(x),σ2
i),
(9)
and hen, we o mula e ou aining objec i e
Ldis ill(W,σ
1,σ
2,σ
3) o mul iple mappings:
Ldis ill =−log p(y1,y
2,y
3| W(x))
∝1
2σ2
1
Lcos lseg +1
2σ2
2
Ll1+1
2σ2
3
Lcos sd +logσ1σ2σ3,
(10)
which leads o ou o e all aining objec i e: L=Ldis ill.
To he bes o ou knowledge, his is he fi s wo k o
explo e de e minis ic unce ain y-awa e modeling o ag-
glome a ing mul iple 2D ounda ion models in o a single
unified 3D model, aiming owa d he de elopmen o a po-
en ial ounda ional 3D model. In he las e m o Eq. 10,
each supe ision signal om a 2D model con ibu es o
lea ning adap i e weigh ing du ing aining. Specifically,
he unce ain y pa ame e
σi
, which cha ac e izes he noise
le el associa ed wi h he
i
- h 2D model
2D
i
, dynamically ad-
jus s he con ibu ion o he co esponding loss e m
Li
.As
σi
inc eases—indica ing highe unce ain y— he e ec i e
weigh o
Li
dec eases, he eby down-weigh ing less eliable
supe ision. Meanwhile, each
σi
is implici ly egula ized
o p e en excessi e g ow h, ensu ing ha all supe ision
sou ces con ibu e meaning ully o he op imiza ion. We
isualize he e olu ion o
σi
h oughou aining in he sup-
plemen a y ma e ial. To u he ensu e s able op imiza ion,
we add a small cons an (se o
1.0
) o each
σi
o p e en he
loss om becoming nega i e du ing aining.
log σi→log(1.0+σi).(11)
4. Expe imen s
We un ex ensi e expe imen s o e a wide se o asks, going
om open- ocabula y segmen a ion and c oss-domain gen-
e aliza ion o he e alua ion wi h fine-g ained class ocab-
ula ies. This sec ion is o ganized as ollows. Sec. 4.1-4.2
p o ide he da ase and implemen a ion de ails o CUA-
O3D used in ou expe imen s. Sec. 4.3 and Sec. 4.4 illus-
a e he po en ial when emb acing a ious 2D ounda ion
models and e alua e he open- ocabula y 3D seman ic seg-
men a ion, compa ed wi h ela ed me hods. Sec. 4.5 epo s
c oss-domain gene aliza ion unde common and fine-g ained
19394
class e alua ion while sec 4.6 p esen s he downs eam pe -
o mance a e dis illa ion. The final Sec. 4.7 abla es and
analyzes he imp o emen s o each p oposed componen s.
4.1. Da ase s
ScanNe V2 [
13
,
69
] is a la ge-scale anno a ed indoo da ase .
I includes o e
2.5
million came a iews wi hin mo e han
1.5k
RGB-D scans, collec ed ac oss
707
di e se indoo en-
i onmen s such as o fices and li ing ooms. The da ase is
en iched wi h anno a ions, including 3D came a poses and
poin -le el seman ic segmen a ion. I is o ficially spli in o
1201
aining and
312
alida ion scans sampled om
706
di -
e en scenes, and
100
scans es se wi h hidden g ound u h.
ScanNe V2 anno a ions co e
40
seman ic classes, while he
o ficial 3D seman ic segmen a ion benchma k ocuses on a
subse o 20 classes.
Ma e po 3D [
6
] is ano he la ge-scale RGB-D da ase
collec ed o 3D scene unde s anding in indoo se ings, con-
aining
90
buildings wi h mul iple ooms on di e en floo s
cap u ed using a Ma e po P o Came a. I p o ides
10.8k
pano amic iews wi hin
90
eal, building-scale scenes p o-
cessed om
194.4k
RGB-D images. Each scene ep esen s
a esiden ial building wi h mul iple ooms and is anno a ed
wi h came a poses and poin -le el seman ic segmen a ion.
The o ficial 3D seman ic segmen a ion benchma k e alua es
pe o mance ac oss 21 seman ic classes.
4.2. Implemen a ion De ails
We implemen ou me hod in he PyTo ch amewo k [
61
]
and employ expe imen s on a single NVIDIA A100 GPU
o ScanNe V2 and Ma e po 3D, espec i ely. We ollow
[
62
] and use MinkUNe 18A [
11
] as ou 3D backbone s a -
ing om andomly ini ialized weigh s, and Lseg [
5
]asou
p e- ained VLM. Du ing ou aining, we adop Adam op-
imize [
37
] wi h an ini ial lea ning a e o
1e−4
and an
exponen ial decay o ain ou pipeline o 50 epochs. Fo
he eache model pa ame e upda e, we se he momen um
coe ficien
β
o
0.99
and
γ
o
1
. Besides, we employ he
oxel size o
2
cm and ba ch size o
2
o bo h he ScanNe V2
and Ma e po 3D expe imen s. Due o he GPU memo y
limi a ion, we uni o mly sample
20k
poin ea u es o ain
he model and inpu only he 3D poin posi ion wi hou RGB
in o ma ion o he MinkowskiNe . We u ilize andom ho i-
zon al flip and elas ic dis o ion as da a augmen a ions o e
poin clouds while ollowing BP-Ne [
30
] o apply colo ,
ji e , and hue ea u e ans o ma ions o e 2D ea u e em-
beddings.
4.3. Quali a i e Compa isons
To demons a e he mo i a ion o a ious 2D ounda ion mod-
els agglome a ion, we in es iga e he esul s using K-Means
o clus e he p ojec ed ea u e embeddings om di e en
2D models and ou dis illed ea u es as shown in Fig. 4.As
Table 2. Open- ocabula y 3D seman ic segmen a ion esul s. We
compa e ou CUA-O3D wi h ecen ully supe ised (Fully-sup.)
and ze o-sho (Ze o-sho ) baselines. Ou me hod demons a es
compe i i e pe o mance on bo h ScanNe V2 and Ma e po 3D.
†
deno es esul s om o igin pape based on Lseg.
Type Me hod ScanNe V2 Ma e po 3D
mIoU mAcc mIoU mAcc
Fully-sup.
Tangen Con [77] 40.9 - - 46.8
Tex u eNe [31] 54.8 - - 63.0
ScanComple e [14] 56.6 - - 44.9
DCM-Ne [75] 65.8 - - 66.2
Mix3D [56] 73.6 - --
SupCon [96] 69.2 77.7 53.1 63.4
LG ound [69] 73.2 - - 67.2
MinkowskiNe [11] 69.2 77.7 53.1 63.4
Uppe -bound MinkowskiNe eimple [11] 68.96 77.41 54.12 65.57
Ze o-sho
MSeg Vo ing [41] 45.6 54.4 33.4 -
PLA [16] 17.7 33.5 --
CLIP2Scene [9] 25.1 - --
CNS [10] 26.8 - --
CLIP-FO3D [90] 30.2 49.1 --
RegionPLC [85] 43.8 65.6 --
DMA- ex only [44] 50.5 63.7 39.8 49.5
OpenScene-3D†[62]52.9 63.2 41.9 51.2
OpenScene-2D3D†[62]54.2 66.6 43.4 53.5
OpenScene eimple-3D [62] 51.6 63.1 40.5 48.8
OpenScene eimple-2D3D [62] 52.2 65.4 41.5 50.6
(Ou s) CUA-O3D (3D) 54.1 64.1 41.3 49.5
(Ou s) CUA-O3D (2D3D) 55.3 65.6 42.2 50.9
can be obse ed ha di e en 2D model pe o ms he e oge-
neous ye complemen a y esul s. Likewise, he able om
he op-le sample in Fig. 4displays a ious clus e ing e-
sul s, leading o ou agglome a ed model being able o ou pu
mo e accu a e and consis en clus e s. Meanwhile, we also
u ilize UMAP [
70
] o be e illus a e he in insic cha ac-
e is ics when adop ing a specific 2D model o employ he
3D model dis illa ion, since he ou pu ea u e embeddings
om a ious dimensions o DINO 2 and S able Di usion
a e no elabo a ed o di ec ma ching wi h ex embeddings.
As shown on he igh side o Fig. 4, DINO 2 indica es
smoo he and mo e consis en esul s hough Lseg has been
ained o align wi h he ex encode be o e wi hin dense
supe ision. S able Di usion p esen s in iguing geome ic
cha ac e is ics, all o which a e capable o agglome a ing
po en ial ounda ion 3D models using dis illa ion.
4.4. Open-Vocabula y 3D Seman ic Segmen a ion
To showcase ha ou me hod CUA-O3D can boos he pe o -
mances o he open- ocabula y 3D seman ic segmen a ion
model, we compa e ou me hod wi h exis ing wo ks wi hin
wo ypes o se ings, including ully supe ised (Fully-
sup.) and ze o-sho . F om Table 2, ou me hod su passes
he ecen wo k OpenScene
eimple
[
62
] wi h
+2.5%
mIoU
unde 3D-dis ill and
+3.1%
mIoU unde 2D3D-ensemble
on ScanNe V2 al se , and
+0.8%
mIoU unde 3D-dis ill
and
+0.7%
mIoU unde 2D3D-ensemble on Ma e po 3D
al se , espec i ely. This u he p o es ou me hod no
only dis ills he e ogeneous ye complemen a y knowledge
19395
Figu e 4. Le side:
KMeans
is apped o clus e he p ojec ed 3D ea u e embeddings based on Lseg. DINO 2, S able Di usion and ou
final dis illed eau u e p edic ed by he 3D model. Righ side: UMAP [
70
] is applied o p ojec high-dimension ea u e in o low-dimension
one o isualize he s uc u al cha ac e is ics. Whi e ec angle highligh s he appa en he e ogeneous ye complemen a y esul s.
Table 3. C oss-da ase e alua ion. We e alua e he c oss-da ase
gene aliza ion capabili y o CUA-O3D. We pe o m his expe i-
men when aining on ScanNe V2 and e alua ing on Ma e po 3D
(ScanNe V2 →Ma e po 3D), and ice e sa.
ScanNe V2 ( ain)→Ma e po 3D (e al)
Me hod mIoU mAcc
OpenScene [62] 36.0 48.0
(Ou s) CUA-O3D 37.4 (+1.4)49.2(+1.2)
Ma e po 3D ( ain)→ScanNe V2 (e al)
OpenScene [62] 36.5 44.0
(Ou s) CUA-O3D 38.6 (+2.1)46.6(+2.6)
in o he 3D model bu also econciles wi h 2D noisy su-
pe isions. This is achie ed by he p oposed de e minis ic
unce ain y es ima ion which adap i ely cap u es inhe en
noise and hen weigh s he co esponding dis illa ion. Supe -
ised by a ious 2D ounda ion models, like Lseg, DINO 2,
and S able Di usion, he 3D model lea ns o align wi h
he open- ocabula y ea u es oge he wi h spa ial and ge-
ome ic awa eness. Some open- ocabula y 3D seman ic
segmen a ion isualiza ions a e shown in Fig. 5. Addi ional
expe imen s wi h AMRADIO [
65
] can be e e ed o ou
supplemen a y ma e ial.
4.5. C oss-Da ase Gene aliza ion
We s udy he gene aliza ion capabili y o ou CUA-O3D o
unseen da ase s wi hou u he fine- uning (i.e., ze o-sho ).
This is achie ed by e alua ing he c oss-da ase pe o mance
when aining and e alua ing di e en da ase s. We use Scan-
Ne V2 and Ma e po 3D as aining and e alua ion da ase s
espec i ely, and analyze he c oss-da ase esul s when ain-
ing on ScanNe V2 and e alua ing on Ma e po 3D, and ice
e sa. Table 3and Table 4 epo he c oss-da ase esul s
and wi h di e en g anula i ies, espec i ely.
Inpu GT OpenScene Ou s
ScanNe Ma e po
Figu e 5. Open- ocabula y 3D seman ic segmen a ion compa isons
in e ms o ScanNe V2 and Ma e po 3D. Ou app oach displays
supe io pe o mance o e he OpenScene, which is ega ded as
ou baseline. Bes iew zoom in and ou .
We no ice ha ou me hod imp o es he c oss-da ase
pe o mance in bo h di ec ions. As epo ed in Table 3,
CUA-O3D consis en ly ou pe o ms OpenScene in bo h di-
ec ions wi h
+1.4%
mIoU imp o emen s on ScanNe V2
19396
Table 4. Compa ison on c oss-da ase gene aliza ion. Bo h CUA-O3D and OpenScene a e ained on ScanNe , and ze o-sho es ed on
he Ma e po 3D da ase .
‡
deno es he pu e 3D esul s ob ained om he o ficial eleased model. K = 21 is de i ed om he o iginal
Ma e po 3D benchma k, while K = 40, 80, 160, is K mos common ca ego ies om he NYU label se p o ided in he benchma k.
Me hod Ma e po 21 Ma e po 40 Ma e po 80 Ma e po 160
mIoU mAcc mIoU mAcc mIoU mAcc mIoU mAcc
OpenScene‡[62]36.0 48.0 21.1 27.5 10.8 13.9 6.0 8.1
(Ou s) CUA-O3D (2D3D) 37.4 49.2 23.3 30.2 12.2 16.3 6.1 8.4
Table 5. Expe imen al esul s on ScanNe V2 and Ma e po 3D in
e ms o al on linea p obing e alua ion. Uppe bound- ull sup.
deno es he ully-supe ised uppe bounding esul s while Baseline
ini . means ini ialize he model om ou baseline model and hen
pe o m linea p obing e alua ion.
Type Me hod ScanNe V2 Ma e po 3D
mIoU mAcc mIoU mAcc
Uppe bound- ully sup. MinkowskiNe [11] 68.9 77.4 54.1 65.5
Baseline ini . MinkowskiNe [11] 54.4 64.7 36.1 43.0
Conca 3-heads conca 62.1 72.7 45.8 55.3
Sepa a e 3-heads a e age 61.7 72.0 45.4 55.0
Single-head
Lseg-head 59.9 71.5 --
DINO 2-head 61.7 72.2 --
S ableDi usion-head 61.4 72.1 --
→
Ma e po 3D and
+2.1%
mIoU imp o emen s on Ma -
e po 3D
→
ScanNe V2. In e es ingly, we no ice ha ou
app oach also p o ides consis en supe io i y ac oss di e -
en g anula i ies in Table 4, anging among K = 21, 40, 60,
and 160 common ca ego ies in e ms o ze o-sho e alua ion
on ScanNe V2 →Ma e po 3D.
4.6. Linea P obing
In his sec ion, we exploi how he ained 3D model will
pe o m a e agglome a ing a ious 2D models. We hen
conduc expe imen s ha employ linea p obe lea ning based
on he 3D model a e dis illing om a ious 2D models.
Specifically, we cons uc a simple MLP laye on op o he
ozen 3D model backbone and ain he linea laye only ol-
lowing he ully-supe ised manne . As shown in Table 5, he
me hod conca ena es all h ee mapping laye s co esponding
o Lseg, DINO 2, and S able Di usion, and hen maps he
conca ena ed ea u es o close se label spaces, we can see
his way can achie e he bes pe o mances, 62.1% mIoU
and 45.8% mIoU on ScanNe V2 and Ma e po 3D al a e
uned on ain. No e ha his ob ains 7.7% mIoU imp o e-
men o e he model ini ialized om ou baseline model.
We can also obse e ha simply mapping he DINO 2 laye
om he 3D model a ains e y compe i i e segmen a ion
pe o mance while mapping he Lseg laye only which has
been ained be o e o align wi h he ex encode ealizes
an in e io one. These u he insigh s ha we shall seek
mo e sui able 2D model selec ions o help de elop po en ial
ounda ional 3D models, whe eas DINO 2 p esen s s ong
gene alizabili y and flexibili y, which is consis en wi h he
obse a ion [18,51].
Table 6. Abla ion: Con ibu ion o each componen by g adually
adding in o he final aining, based on ze o-sho segmen a ion.
BaselineLseg +DINO 2+SD +Unc +Au oW +DeMean mIoU ↑mAcc ↑
51.4 62.3
51.7 63.3
51.4 62.4
52.7 62.6
53.5 64.2
NAN NAN
54.1 64.1
4.7. Abla i e S udies
In his sec ion, we s udy he imp o emen s om each p o-
posed componen . As shown in Table 6, we begin by g ad-
ually adding each componen o he 3D model aining and
find ha only combining wi h Lseg and DINO 2 leads o
ma ginal open- ocabula y 3D seman ic segmen a ion which
can be conjec u ed ha DINO 2 has no been aligned wi h
language space be o e hough i excels a spa ial pe cep ion
abili y. Then, he pe o mance is boos ed by 1.3% mIoU
when in oducing S able Di usion supe ision, while u he
imp o ed by 2.1% mIoU and 1.9% mAcc ia ou p oposed
de e minis ic es ima ion o help he model adap i ely ha mo-
nize he he e ogeneous knowledge om a ious 2D models.
In e es ingly, i we apply au o-weigh ing o enable he model
o lea n by i sel , he model aining alls in o collapse, which
we su mise i is due o he minimal op imiza ion objec i e
and he model quickly ge s in o a i ial solu ion. O e -
all, CUA-O3D can imp o e he 3D model only om 51.4%
mIoU and 62.3% mAcc o 54.1% mIoU and 64.1% mAcc,
which u he demons a es he e ec i eness o ou me hod.
5. Conclusions
In his pape , we fi s in es iga e he c oss-modal agglom-
e a ion om a ious 2D ounda ion models in o one 3D
model, in pu sui o a po en ial ounda ional 3D model. To
esol e he he e ogeneous bias and inhe en noise om 2D
ea u e supe isions, we hen p opose a de e minis ic unce -
ain y es ima ion o cap u e 2D model-specific unce ain ies
ac oss di e se seman ic and geome ic sensi i i ies, which
is hen le e aged o weigh he co esponding dis illa ion
loss adap i ely. In his way, he ained 3D model pe o ms
compe i i e open- ocabula y segmen a ion while achie ing
obus c oss-domain alignmen and s ong spa ial pe cep ion
abili y, which hopes o shed new ligh on he communi y.
19397
Acknowledgmen s This wo k was suppo ed by he
MUR PNRR p ojec FAIR (PE00000013) unded by he
Nex Gene a ionEU and he EU Ho izon p ojec ELIAS (No.
101120237). We acknowledge he CINECA awa d unde
he ISCRA ini ia i e o he a ailabili y o high-pe o mance
compu ing esou ces and suppo .
Re e ences
[1]
Sungsoo Ahn, Shell Xu Hu, And eas Damianou, Neil D
Law ence, and Zhenwen Dai. Va ia ional in o ma ion dis il-
la ion o knowledge ans e . In CVPR, pages 9163–9171,
2019. 3
[2]
Vijay Bad ina ayanan, Alex Kendall, and Robe o Cipolla.
Segne : A deep con olu ional encode -decode a chi ec u e
o image segmen a ion. TPAMI, 39(12):2481–2495, 2017. 2
[3]
H. Bangala h, M. Maaz, M. Kha ak, S. Khan, and F. Shahbaz
Khan. B idging he gap be ween objec and image-le el
ep esen a ions o open- ocabula y de ec ion. Neu IPS, 2022.
2
[4]
J. Behley, M. Ga bade, A. Milio o, J. Quenzel, S. Behnke,
C. S achniss, and J. Gall. Seman icKITTI: A Da ase o
Seman ic Scene Unde s anding o LiDAR Sequences. In
ICCV, 2019. 1
[5]
L. Boyi, W. Kilian, B. Se ge, K. Vladlen, and R. Rene.
Language-d i en seman ic segmen a ion. In ICLR, 2022. 1,
2,4,6
[6]
A. Chang, A. Dai, T. Funkhouse , M. Halbe , M. Niebne ,
M. Sa a, S. Song, A. Zeng, and Y. Zhang. Ma e po 3d:
Lea ning om gb-d da a in indoo en i onmen s. In 3DV,
2017. 2,6
[7]
Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Man-
mohan Chand ake . Lea ning e ficien objec de ec ion mod-
els wi h knowledge dis illa ion. Neu IPS, 30, 2017. 3
[8]
L.-C. Chen, G. Papand eou, I. Kokkinos, K. Mu phy, and A.:.
Yuille. Deeplab: Seman ic image segmen a ion wi h deep
con olu ional ne s, a ous con olu ion, and ully connec ed
c s. TPAMI, 2017. 2
[9]
Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu,
Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping
Wang. Clip2scene: Towa ds label-e ficien 3d scene unde -
s anding by clip. In CVPR, pages 7020–7030, 2023. 3,6
[10]
Runnan Chen, Youquan Liu, Lingdong Kong, Nenglun Chen,
Xinge Zhu, Yuexin Ma, Tongliang Liu, and Wenping Wang.
Towa ds label- ee scene unde s anding by ision ounda ion
models. In Neu IPS, 2024. 6
[11]
C. Choy, J. Gwak, and S. Sa a ese. 4d spa io- empo al con-
ne s: Minkowski con olu ional neu al ne wo ks. In ICCV,
2019. 4,6,8
[12]
T. Co inhal, G. Tzelepis, and E. E dal Aksoy. Salsanex :
Fas , unce ain y-awa e seman ic segmen a ion o lida poin
clouds. In ISVC, 2020. 3
[13]
A. Dai, A. Chang, M. Sa a, M. Halbe , T. Funkhouse , and
M. Nießne . Scanne : Richly-anno a ed 3d econs uc ions o
indoo scenes. In CVPR, 2017. 1,2,4,6
[14]
A. Dai, D. Ri chie, M. Bokeloh, S. Reed, J. S u m, and M.
Nießne . Scancomple e: La ge-scale scene comple ion and
seman ic segmen a ion o 3d scans. In CVPR, 2018. 6
[15]
Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song
Bai, and Xiaojuan Qi. Pla: Language-d i en open- ocabula y
3d scene unde s anding. In CVPR, 2023. 1,3
[16]
R. Ding, J. Yang, C. Xue, W. Zhang, S. Bai, and X. Qi. Pla:
Language-d i en open- ocabula y 3d scene unde s anding.
In CVPR, 2023. 6
[17]
Alexey Doso i skiy. An image is wo h 16x16 wo ds: T ans-
o me s o image ecogni ion a scale. a Xi p ep in
a Xi :2010.11929, 2020. 3
[18]
Mohamed El Banani, Ami Raj, Ke is-Koki si Maninis, Ab-
hishek Ka , Yuanzhen Li, Michael Rubins ein, Deqing Sun,
Leonidas Guibas, Jus in Johnson, and Va un Jampani. P ob-
ing he 3d awa eness o isual ounda ion models. In CVPR,
pages 21795–21806, 2024. 2,8
[19]
Y. Gal and Z. Ghah amani. D opou as a bayesian app oxi-
ma ion: Rep esen ing model unce ain y in deep lea ning. In
ICLR, 2016. 3
[20]
G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin. Scaling open-
ocabula y image segmen a ion wi h image-le el labels. In
ECCV, 2022. 1,2
[21]
Benjamin G aham, Ma in Engelcke, and Lau ens Van
De Maa en. 3d seman ic segmen a ion wi h submani old
spa se con olu ional ne wo ks. In CVPR, pages 9224–9232,
2018. 1
[22]
A. G a es. P ac ical a ia ional in e ence o neu al ne wo ks.
Neu IPS, 2011. 3
[23]
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-
ocabula y de ec ion ia ision and language knowledge dis-
illa ion. a Xi p ep in a Xi :2104.13921, 2021. 1,3
[24]
H. Guo, H. Wang, and Q. Ji. Unce ain y-guided p obabilis ic
ans o me o complex ac ion ecogni ion. In CVPR, 2022.
3
[25]
Huy Ha and Shu an Song. Seman ic abs ac ion: Open-wo ld
3D scene unde s anding om 2D ision-language models. In
CORL, 2022. 1
[26]
Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming
Sun, and Youliang Yan. Knowledge adap a ion o e ficien
seman ic segmen a ion. In CVPR, pages 578–587, 2019. 3
[27]
G. Hin on, O. Vinyals, and J. Dean. Dis illing he knowledge
in a neu al ne wo k. a Xi , 2015. 3
[28]
Yunzhong Hou and Liang Zheng. Visualizing adap ed knowl-
edge in domain ans e . In CVPR, pages 13824–13833, 2021.
3
[29]
P. Hu, S. Scla o , and K. Saenko. Unce ain y-awa e lea ning
o ze o-sho seman ic segmen a ion. Neu IPS, 2020. 3
[30]
Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia, and Tien-
Tsin Wong. Bidi ec ional p ojec ion ne wo k o c oss dimen-
sion scene unde s anding. In CVPR, 2021. 6
[31]
J. Huang, H. Zhang, L. Yi, T. Funkhouse , M. Nießne , and
L. Guibas. Tex u ene : Consis en local pa ame iza ions o
lea ning om high- esolu ion signals on meshes. In CVPR,
2019. 6
[32]
Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang,
Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo.
Clip2poin : T ans e clip o poin cloud classifica ion wi h
image-dep h p e- aining. In ICCV, pages 22157–22167,
2023. 3
19398