S e eo Anywhe e: Robus Ze o-Sho Deep S e eo Ma ching
E en Whe e Ei he S e eo o Mono Fail
Luca Ba olomei∗,†Fabio Tosi†Ma eo Poggi∗,†S e ano Ma occia∗,†
∗Ad anced Resea ch Cen e on Elec onic Sys em (ARCES)
†Depa men o Compu e Science and Enginee ing (DISI)
Uni e si y o Bologna, I aly
h ps://s e eoanywhe e.gi hub.io/
RGB Dep h Any hing 2 [121] RAFT-S e eo [55] S e eo Anywhe e (Ou s)
Middlebu y
✓ ✓ ✓
Boos e
✓✗✓
MonoT ap
✗✓ ✓
Figu e 1. S e eo Anywhe e: Combining Monocula and S e eo S engh s o Robus Dep h Es ima ion. Ou model achie es accu a e
esul s on s anda d condi ions (on Middlebu y [86]), while e ec i ely handling non-Lambe ian su aces whe e s e eo ne wo ks ail (on
Boos e [127]) and pe spec i e illusions ha decei e monocula dep h ounda ion models (on MonoT ap, ou no el da ase ).
Abs ac
We in oduce S e eo Anywhe e, a no el s e eo-ma ching
amewo k ha combines geome ic cons ain s wi h o-
bus p io s om monocula dep h Vision Founda ion Mod-
els (VFMs). By elegan ly coupling hese complemen a y
wo lds h ough a dual-b anch a chi ec u e, we seamlessly
in eg a e s e eo ma ching wi h lea ned con ex ual cues. Fol-
lowing his design, ou amewo k in oduces no el cos
olume usion mechanisms ha e ec i ely handle c i ical
challenges such as ex u eless egions, occlusions, and non-
Lambe ian su aces. Th ough ou no el op ical illusion
da ase , MonoT ap, and ex ensi e e alua ion ac oss mul-
iple benchma ks, we demons a e ha ou syn he ic-only
ained model achie es s a e-o - he-a esul s in ze o-sho
gene aliza ion, signi ican ly ou pe o ming exis ing solu-
ions while showing ema kable obus ness o challenging
cases such as mi o s and anspa encies.
1. In oduc ion
S e eo is a undamen al ask ha compu es dep h om a
synch onized, ec i ied image pai by inding pixel co e-
spondences o measu e hei ho izon al o se (dispa i y).
Due o i s e ec i eness and minimal ha dwa e equi e-
men s, s e eo has become p e alen in nume ous applica-
ions, om au onomous na iga ion o augmen ed eali y.
Al hough in p inciple single-image dep h es ima ion [3]
equi es an e en simple acquisi ion se up, i s ill-posed na-
u e leads o scale ambigui y and pe spec i e illusion is-
sues ha s e eo me hods inhe en ly o e come h ough well-
es ablished geome ic mul i- iew cons ain s.
Howe e , despi e signi ican ad ances h ough deep
lea ning [47,72], s e eo models s ill ace wo main chal-
lenges: (i) limi ed gene aliza ion ac oss di e en scena -
ios, and (ii) c i ical condi ions ha hinde ma ching o
p ope dep h iangula ion. Rega ding (i), despi e he ini-
ial success o syn he ic da ase s in enabling deep lea n-
This CVPR pape is he Open Access e sion, p o ided by he Compu e Vision Founda ion.
Excep o his wa e ma k, i is iden ical o he accep ed e sion;
he inal published e sion o he p oceedings is a ailable on IEEE Xplo e.
1013
ing o s e eo, hei limi ed a ie y and simpli ied na u e
poo ly e lec eal-wo ld complexi y, and he sca ci y o eal
aining da a u he hinde s he abili y o handle he e oge-
neous scena ios. As o (ii), la ge ex u eless egions com-
mon in indoo en i onmen s make pixel ma ching highly
ambiguous, while occlusions and non-Lambe ian su aces
[76,115,127] iola e he undamen al assump ions linking
pixel co espondences o 3D geome y.
We a gue ha bo h challenges a e oo ed in he unde ly-
ing limi a ions o s e eo aining da a. Indeed, while da a
has scaled up o millions - o e en billions - o se e al
compu e ision asks, s e eo da ase s a e s ill cons ained
in quan i y and a ie y. This is pa icula ly e iden o non-
Lambe ian su aces, which a e se e ely unde ep esen ed
in exis ing da ase s as hei ma e ial p ope ies p e en eli-
able dep h measu emen s om ac i e senso s (e.g. LiDAR).
In con as , single-image dep h es ima ion has ecen ly
wi nessed a signi ican scale-up in da a a ailabili y, each-
ing he o de o millions o samples and enabling he eme -
gence o Vision Founda ion Models (VFMs) [22,43,120,
121]. Such da a abundance has in luenced hese models
in di e en ways, ei he h ough di ec aining on la ge-
scale dep h da ase s [120,121] o indi ec ly by le e ag-
ing ne wo ks p e- ained on billions o images o di e se
asks [22,43]. Since hese models ely on con ex ual
cues o dep h es ima ion, hey show be e capabili y in
handling ex u eless egions and non-Lambe ian ma e ials
[75,81,128,129] while being inhe en ly immune o oc-
clusions. Mode n g aphics engines ha e u he accele -
a ed his p og ess, enabling apid gene a ion o high-quali y
syn he ic da a wi h dense dep h anno a ions. Howe e ,
al hough syn he ic da ase s ea u ing non-Lambe ian su -
aces like Hype Sim [81] ha e p o en e ec i e o monocu-
la dep h es ima ion [75,128,129], his da a abundance has
no ansla ed o s e eo. Despi e e o s in gene a ing s e eo
pai s ia no el iew syn hesis [24,54,104], a ailable da a
emains insu icien o obus s e eo ma ching.
In his pape , a he han ocusing on cos ly eal-wo ld
da a collec ion o gene a ing addi ional syn he ic da ase s,
we p opose o b idge his gap by le e aging exis ing VFMs
o single- iew dep h es ima ion. To his end, we de elop
a no el dual-b anch deep a chi ec u e ha combines s e eo
ma ching p inciples wi h monocula dep h cues. Speci i-
cally, while one b anch o he p oposed ne wo k cons uc s
a cos olume om lea ned s e eo image ea u es, he o he
b anch p ocesses dep h p edic ions om he VFM on bo h
le and igh images o build a second cos olume ha
inco po a es dep h p io s o guide he dispa i y es ima ion
p ocess. These complemen a y signals a e hen i e a i ely
combined [55], along wi h no el augmen a ion s a egies
applied o bo h cos olumes, o p edic he inal dispa -
i y map. Th ough his design, ou ne wo k achie es o-
bus pe o mance on challenging cases like ex u eless e-
gions, occlusions, and non-Lambe ian su aces, while e-
qui ing minimal syn he ic s e eo da a. Impo an ly, while
le e aging monocula cues, ou app oach p ese es s e eo
ma ching geome ic gua an ees, e ec i ely handling sce-
na ios whe e monocula dep h es ima ion ypically ails,
such as in he p esence o pe spec i e illusions. We alida e
his h ough ou no el da ase o op ical illusions, comp is-
ing 26 scenes wi h g ound- u h dep h maps.
We dub ou amewo k S e eo Anywhe e, highligh ing i s
abili y o o e come he indi idual limi a ions o s e eo and
monocula app oaches, as depic ed in Fig. 1. To summa-
ize, ou main con ibu ions a e:
• A no el deep s e eo a chi ec u e le e aging monocula
dep h VFMs o achie e s ong gene aliza ion capabili ies
and obus ness o challenging condi ions.
• No el da a augmen a ion s a egies designed o enhance
he obus ness o ou model o ex u eless egions and
non-Lambe ian su aces.
• A challenging da ase wi h op ical illusion, which is pa -
icula ly challenging o monocula dep h wi h VFMs.
• Ex ensi e expe imen s showing S e eo Anywhe e’s supe-
io gene aliza ion and obus ness o condi ions c i ical
o ei he s e eo o monocula app oaches.
2. Rela ed Wo ks
We b ie ly e iew he li e a u e ele an o ou wo k.
Deep S e eo Ma ching. In he las decade, s e eo ma ch-
ing has ansi ioned om classical hand-c a ed algo i hms
[85] o deep lea ning solu ions, leading o unp eceden ed
accu acy in dep h es ima ion. Ea ly deep lea ning e o s
ocused on eplacing indi idual componen s o he con en-
ional pipeline [88,96,105,130,131]. Since DispNe C
[61], end- o-end a chi ec u es ha e e ol ed in o 2D [53,92,
125,125] and 3D [4,8,9,32,44,90,91,119,132,134]
app oaches, p ocessing cos olumes h ough co ela ion
laye s o 3D con olu ions espec i ely. Mo e ecen ad-
ances, ho oughly e iewed in [47,72,107], include e-
cu en a chi ec u es o s e eo ma ching [13,27,40,50,
55,110,116,140] inspi ed by RAFT [99], T ans o me -
based solu ions [31,52,59,97,113,117,138] o cap u ing
long- ange dependencies, and ully da a-d i en MRF mod-
els [28]. Among hem, some me hods speci ically add ess
empo al consis ency in s e eo ideos [41,42,133,137].
Domain gene aliza ion emains a majo challenge, wi h a -
ious app oaches p oposed including domain-in a ian ea-
u e lea ning [17,56,80,93,135], hand-c a ed ma ch-
ing cos s [7,15], in eg a ion o addi ional geome ic cues
[2,66,105], and exploi a ion o spa se dep h measu e-
men s om ac i e senso s [5,49,69]. In pa allel, sel -
supe ised app oaches [25,57] ha e eme ged as e ec-
i e al e na i es o supe ised lea ning, e en using pseudo-
labels om adi ional algo i hms [1,100] o deploying neu-
al adiance ields [104]. Despi e he nume ous a emp s o
1014
&RUUHODWLRQ
9ROXPH
IURPQRUPDOV
$JJUHJDWHG
&RUUHODWLRQ
9ROXPH
IURPQRUPDOV
&RUUHODWLRQ
9ROXPH
7UXQFDWH
)XQFWLRQ
7UXQFDWHG
&RUUHODWLRQ
9ROXPH
'+RXUJODVV
(VWLPDWHG
1RUPDO0DSV
&RQWH[W
%DFNERQH
)HDWXUHH[WUDFWLRQ
%DFNERQH
6WHUHR3DLU
0RQRFXODU
'HSWK(VWLPDWLRQV
0'(V
'LIIHUHQWLDEOH
6FDOHU
/
/
6FDOHG0'(
/
/
/
/
)LQDO
'LVSDULW
&RUUHODWLRQ3 UDPLGV
&RUUHODWLRQ3 UDPLGV
IURPQRUPDOV
)HDWXUHV
([WUDFWLRQ
&RUUHODWLRQ3 UDPLGV
%XLOGLQJ
,WHUDWLYH'LVSDULW
(VWLPDWLRQ
Figu e 2. S e eo Anywhe e A chi ec u e. Gi en a s e eo pai , (1) a p e- ained backbone is used o ex ac ea u es and hen build a
co ela ion olume. Such a olume is hen unca ed (2) o ejec ma ching cos s compu ed o dispa i y hypo heses being behind non-
Lambe ian su aces – glasses and mi o s. On a pa allel b anch, he wo images a e p ocessed by a monocula VFM o ob ain wo dep h
maps (3): hese a e used o build a second co ela ion olume om e ie ed no mals (4). This olume is hen agg ega ed h ough a 3D
CNN o p edic a new dispa i y map, used o align he o iginal monocula dep h o me ic scale h ough a di e en iable scaling module (5)
o i . In pa allel, he monocula dep h map om le images is p ocessed by ano he backbone (6) o ex ac con ex ea u es. Finally, he
wo olumes and he con ex ea u es om monocula dep h guide he i e a i e dispa i y p edic ion (7).
imp o e speci ic aspec s h ough he a o emen ioned ech-
niques, ecen a chi ec u es achie e ema kable gene aliza-
ion by combining hei a chi ec u al ad ances wi h he in-
c easing a ailabili y o di e se aining da a, while online
adap a ion echniques enable u he imp o emen s du -
ing deploymen h ough sel -supe ised lea ning [45,67,
71,101]. Howe e , al hough p og ess on challenges like
o e -smoo hing [103,118] and isually imbalanced s e eo
[2,11,58,105], handling non-Lambe ian su aces e-
mains pa icula ly challenging due o limi ed anno a ed da a
and complex appea ance, wi h a e wo ks like Dep h4ToM
[18] speci ically add essing his h ough seman ic guidance.
Among all he a o emen ioned app oaches, he e ha e been
limi ed a emp s o in eg a e s e eo wi h monocula cues
[1,12,112], mos ly in sel -supe ised se ings o h ough
loose coupling be ween modali ies.
Monocula Dep h Es ima ion. Pa allel o de elop-
men s in s e eo ma ching, single-image dep h es ima ion
has e ol ed om hand-c a ed ea u es [82] o deep lea n-
ing me hods [10,21,48,73,108], wi h sel -supe ised ap-
p oaches [25,26,60,68,111,139,141] e aming he ask
as an image econs uc ion p oblem. This led o mul i- ask
app oaches inco po a ing low [79,102,124,142] and se-
man ics [29,126], alongside ad ances in unce ain y es i-
ma ion [34,70] and dynamic objec handling [46,63,98].
A ine-in a ian models [20,77,78,109,122] ma ked a
b eak h ough in c oss-domain gene aliza ion, pionee ed by
MiDaS [78] and ollowed by wo ks like DPT [77] and,
mo e ecen ly, he Dep h Any hing se ies [120]. These
app oaches used di e en da a sou ces, om in e ne pho-
os [51,94,95,122] o ca senso s [23,62] and RGB-D de-
ices [16,64], ep esen ing he i s gene a ion o VFMs o
monocula dep h es ima ion. Recen wo ks ha e ocused
on me ic dep h es ima ion h ough came a pa ame e in-
eg a ion [30,35,123], di usion models [19,22,33,38,
43,83,84], and empo al consis ency [36,89]. Mo eo e ,
ma e ial-awa e me hods [18], di usion models [106], and
la ge-scale syn he ic da ase s ha e enabled obus monoc-
ula dep h es ima ion o non-Lambe ian su aces [121].
S e eo me hods, howe e , s ill s uggle wi h hese su aces
due o limi ed eal-wo ld and syn he ic anno a ed da a, a -
ec ing gene aliza ion. We add ess his by in eg a ing obus
monocula VFMs in o a s e eo a chi ec u e.
Concu en Wo ks. Finally, we men ion some solu ions
o s e eo [14,39,114] and o mul i- iew s e eo [37], de-
eloped in pa allel wi h ou s and sha ing simila a ionale.
3. Me hod O e iew
Gi en a ec i ied s e eo pai IL,IR∈R3×H×W, we i s
ob ain monocula dep h es ima es (MDEs) ML,MR∈
R1×H×Wusing a gene ic VFM ϕM o monocula dep h
es ima ion. We aim o es ima e a dispa i y map D=
ϕS(IL,IR,ML,MR), inco po a ing VFM p io s o p o-
ide accu a e esul s e en unde challenging condi ions,
such as ex u e-less a eas, occlusions, and non-Lambe ian
su aces. A he same ime, ou s e eo ne wo k ϕSis de-
signed o a oid dep h es ima ion e o s ha could a ise om
elying solely on con ex ual cues, which can be ambiguous,
like in he p esence o isual illusions.
Following ecen ad ances in i e a i e models [55],
S e eo Anywhe e comp ises h ee main s ages, as shown
in Fig. 2: I) Fea u e Ex ac ion, II) Co ela ion Py amids
Building, and III) I e a i e Dispa i y Es ima ion.
1015
3.1. Fea u e Ex ac ion
Two dis inc ypes o ea u es a e ex ac ed [55]: image
ea u es and con ex ea u es – (1) and (6) in Fig. 2. The
image ea u es a e ob ained h ough a ea u e encode p o-
cessing he s e eo pai , yielding ea u e maps FL,FR∈
RD×H
4×W
4, which a e used o build a s e eo co ela ion ol-
ume a 1
4o he o iginal inpu esolu ion. These encode s
a e ini ialized wi h p e- ained weigh s [55] and he image
encode is kep ozen du ing aining. Fo con ex ea u es,
we employ a con ex encode wi h iden ical a chi ec u e o
he ea u e encode , bu p ocessing he monocula dep h es-
ima e aligned wi h he e e ence image ML– (3) in Fig. 2
– ins ead o IL o cap u e s ong geome y p io s. Acco d-
ingly, du ing aining he con ex encode is op imized o
ex ac meaning ul ea u es om hese dep h maps.
3.2. Co ela ion Py amids Building
As a s anda d p ac ice in s e eo ma ching, he cos olume
is he da a s uc u e encoding he simila i y be ween pix-
els ac oss wo images. Acco dingly, ou model u ilizes cos
olumes—speci ically Co ela ion Py amids [55]—bu in a
no el manne . Indeed, S e eo Anywhe e cons uc s wo co -
ela ion py amids: a s e eo co ela ion olume de i ed om
IL,IR o encode image simila i ies, and a monocula co -
ela ion olume om ML,MR o encode geome ic simi-
la i ies—(2) and (4) in Fig. 2. Unlike he o me , he la e
emains una ec ed by non-Lambe ian su aces, assuming
a obus ϕM.
S e eo Co ela ion Volume. Gi en FL,FR, we con-
s uc a 3D co ela ion olume VSusing do p oduc be-
ween ea u e maps:
(VS)ijk =X
h
(FL)hij ·(FR)hik,VS∈RH
4×W
4×W
4(1)
Monocula Co ela ion Volume. Gi en ML,MR, we
downsample hem o 1/4, compu e hei no mals ∇L,∇R,
and cons uc a 3D co ela ion olume VMusing do p od-
uc be ween no mal maps:
(VM)ijk =X
h
(∇L)hij ·(∇R)hik,VM∈RH
4×W
4×W
4
(2)
Gi en he absence o ex u e in ∇Land ∇R, he esul -
ing monocula olume VMwill be less in o ma i e. To
alle ia e his p oblem we segmen VMusing he ela i e
dep h p io s om MLand MR: o do so, we gene a e
le and igh segmen a ion masks ML∈ {0,1}H
4×W
4×1,
MR∈ {0,1}H
4×1×W
4. We e e he eade o he sup-
plemen a y ma e ial o a de ailed desc ip ion. Gi en he
segmen a ion masks, we can gene a e masked olumes as:
(VMn)ijk = (MLn)ij ·(MRn)ik ·(VM)ijk (3)
Nex , we inse a 3D Con olu ional Regula iza ion mod-
ule ϕA o agg ega e VMn, esul ing in V′M=
ϕA(VM
1,...,VMN,ML,MR), wi h N= 8. The a chi-
ec u e o ϕA ollows he one in [116], wi h a simple pe mu-
a ion o ma ch he s uc u e o he co ela ion olumes. We
p opose an adap ed e sion o CoEx [4] co ela ion olume
exci a ion ha exploi s bo h iews. The esul ing ea u e
olumes V′M∈RF×H
4×W
4×W
4a e ed o wo di e en
shallow 3D con laye s ϕDand ϕC o ob ain wo agg e-
ga ed olumes VD
M=ϕD(V′M)and VC
M=ϕC(V′M)
wi h VD
M,VC
M∈RH
4×W
4×W
4.
Di e en iable Monocula Scaling. Volume VD
Mwill
be used no only as a monocula guide o he i e a i e e-
inemen uni bu also o es ima e he coa se dispa i y maps
ˆ
DLˆ
DR, while VC
Mis used o es ima e con idence maps ˆ
CL
ˆ
CR. These maps a e hen used o scale bo h MLand MR
– (5) in Fig. 2. To es ima e le dispa i y om a co ela ion
olume, we i s pe o m a so a gmax on he las Wdi-
mension o VD
M o ex ac he co ela ed pixel x-coo dina e.
Then, gi en he ela ionship be ween le dispa i y and co -
ela ion dL=jL−jR, we ob ain a coa se dispa i y map
ˆ
DL:
(ˆ
DL)ij =j−so a gmaxL(VD
M)ij (4)
Simila ly, we es ima e ˆ
DR om VD
M. We e e he eade
o he supplemen a y o de ails. We also es ima e a pai o
con idence maps ˆ
CL,ˆ
CR∈[0,1]H×W o classi y ou lie s
and pe o m obus scaling. Inspi ed by in o ma ion en-
opy, we measu e he chaos wi hin co ela ion cu es: clea
monomodal-like cos cu es— hose wi h low en opy—a e
eliable, while chao ic cu es wi h high en opy indica e un-
ce ain y. To es ima e he le con idence map, we pe o m
aso max ope a ion on he las Wdimension o VC
M, hen
ˆ
CLis ob ained as ollows:
(ˆ
CL)ij = 1 + PW
4
d
e(VC
M)ijd
P
W
4
e(VC
M)ij
·log2 e(VC
M)ijd
P
W
4
e(VC
M)ij !
log2(W
4)
(5)
In he same way, we es ima e ˆ
CR. To u he educe ou -
lie s, we mask ou occluded pixels om ˆ
CLand ˆ
CRusing
aSo LRC ope a o – see he supplemen a y ma e ial o
de ails. Finally, we es ima e he scale ˆsand shi ˆ
using a
di e en iable weigh ed leas -squa e app oach:
min
ˆs,ˆ
L,R
X
pˆ
C⊙hˆsM+ˆ
−ˆ
Di
F(6)
whe e ∥·∥Fdeno es he F obenius no m. Using he scaling
coe icien s, we ob ain wo dispa i y maps ˆ
ML,ˆ
MR:
ˆ
ML= ˆsML+ˆ
, ˆ
MR= ˆsMR+ˆ
(7)
1016
Image G ound-T u h Dep h Any hing 2 [121] Image G ound-T u h Dep h Any hing 2 [121]
Figu e 3. Samples om MonoT ap Da ase . We epo wo scenes ea u ed in ou da ase , showing he le image, he g ound- u h dep h,
and he p edic ions by Dep h Any hing 2 [121], highligh ing how i ails in he p esence o isual illusions.
I is c ucial o op imize bo h le and igh scaling join ly o
ob ain consis ency be ween ˆ
MLand ˆ
MR.
Volume Augmen a ions. Un o una ely, S e eo Any-
whe e canno p ope ly lea n when o choose s e eo o mono
in o ma ion om [61] alone. Hence, we p opose h ee ol-
ume augmen a ions and a monocula augmen a ion o o e -
come his issue: 1) Volume Rolling: we andomly apply
a olling ope a ion o he las Wdimension o VDMo
VS; 2) Volume Noising: we apply andom noise sampled
om he in e al [0,1) using a uni o m dis ibu ion; 3) Vol-
ume Ze oing: we apply a Gaussian-like cu e wi h he peak
whe e dispa i y equals ze o. Fu he mo e, we andomly
subs i u e he monocula dep h wi h g ound u h no mal-
ized be ween [0,1] as an addi ional augmen a ion. We apply
only one olume augmen a ion o VDMo VSand only o
a sec ion o he olume, andomly selec ing an Mn
Lmask.
Volume T unca ion. To u he help S e eo Anywhe e
o handle mi o su aces, we in oduce a hand-c a ed ol-
ume unca ion ope a ion on VS. Fi s ly, we ex ac le
con idence CM=so LRCL(ˆ
ML,ˆ
MR) o classi y e-
liable monocula p edic ions. Then, we c ea e a un-
ca e mask T∈[0,1]H
4×W
4using he ollowing logic
condi ion: (T)ij =h(ˆ
ML)ij >(ˆ
DL)ij∧(CM)iji∨
h(CM)ij ∧ ¬(ˆ
CL)iji. We implemen his logic using
uzzy ope a o s (mo e de ails in he supplemen a y ma-
e ial). The a ionale is ha s e eo p edic s a he dep hs
on mi o su aces: he mi o is pe cei ed as a window in o
a new en i onmen , specula o he eal one. Finally, o
alues o T> Tm= 0.98, we unca e VSusing a sigmoid
cu e cen e ed a he co ela ion alue p edic ed by ˆ
ML–
i.e., he eal dispa i y o mi o su aces – p ese ing only
he s e eo co ela ion cu e no “pie cing” mi o s.
3.3. I e a i e Dispa i y Es ima ion
We aim o es ima e a se ies o e ined dispa i y maps {D1=
ˆ
ML,D2, . . . Dl, . . . }exploi ing he guidance om bo h
s e eo and mono b anches. S a ing om he Mul i-GRU
upda e ope a o by [55], we in oduce a second lookup op-
e a o ha ex ac s co ela ion ea u es GM om he addi-
ional olume VD
M– (7) in Fig. 2. The wo se s o co ela-
ion ea u es om GSand GMa e p ocessed by he same
wo-laye encode and conca ena ed wi h ea u es de i ed
om he cu en dispa i y es ima ion Dl. This conca ena-
ion is u he p ocessed by a 2D con laye , and hen by
he Con GRU ope a o . We inhe i he con ex upsampling
module [55] o upsample inal dispa i y o ull esolu ion.
3.4. T aining Supe ision
We supe ise he i e a i e module using he well-known L1
loss wi h exponen ially inc easing weigh s [55], hen ˆ
DL,
ˆ
DR,ˆ
MLand ˆ
MRusing he L1 loss, inally ˆ
CLand ˆ
CR
using he Bina y C oss En opy loss. We in i e he eade
o ead he supplemen a y ma e ial o addi ional de ails.
4. The MonoT ap Da ase
Monocula dep h es ima ion is known o possibly ailing
in he p esence o pe spec i e illusions. The eade may
wonde how S e eo Anywhe e would beha e in such cases:
would i blindly us he monocula VFM o ely on he
s e eo geome ic p inciples o main ain obus ness?
To answe hese ques ions, we in oduce MonoT ap,
a no el s e eo da ase speci ically designed o challenge
monocula dep h es ima ion. Ou da ase comp ises 26
scenes ea u ing pe spec i e illusions, cap u ed wi h a cal-
ib a ed s e eo se up and anno a ed wi h g ound- u h dep h
om an In el Realsense L515 LiDAR. The scenes con ain
ca e ully designed plana pa e ns ha c ea e isual illu-
sions, such as appa en holes in walls o loo s and simu-
la ed anspa en su aces ha e eal con en behind hem.
Figu e 3shows examples om ou da ase ha illus a e
how hese isual illusions easily ool monocula me hods.
5. Expe imen s
We desc ibe ou implemen a ion de ails, da ase s, and e al-
ua ion p o ocols, ollowed by expe imen s. We also e e
he eade o he supplemen a y ma e ial o mo e esul s.
5.1. Implemen a ion and Expe imen al Se ings
We implemen S e eo Anywhe e using PyTo ch, s a ing
om RAFT-S e eo codebase [55]. We use Dep h Any hing
2 [121] as he VFM ueling ou model, using he La ge
weigh s p o ided by he au ho s, ained on g ound- u h
labels om he Hype Sim syn he ic da ase [81] only.
S a ing om he Scene low RAFT-S e eo checkpoin ,
we ain S e eo Anywhe e on a single A100 GPU o 3
epochs, wi h lea ning a e 1e-4 and AdamW op imize , on
1017
Boos e (Q) Middlebu y 2014 (H)
Expe imen bad A g. bad >2A g.
>2>4>6>8(px) All Noc Occ (px)
(A) Baseline [55] 17.84 13.06 10.76 9.24 3.59 11.15 8.06 29.06 1.55
(B) (A) + Monocula Con ex w/o e- ain 15.85 10.98 8.89 7.69 3.05 14.96 11.70 34.38 2.82
(C) (A) + Monocula Con ex w/ e- ain 14.94 10.40 8.61 7.63 3.03 9.62 6.98 25.39 1.13
(D) (C) + No mals Co ela ion Volume / Scaled Dep h 11.33 6.88 5.32 4.59 1.87 7.67 5.24 21.51 0.96
(E) (D) + Volume augmen a ion / unca ion 9.01 5.40 4.12 3.34 1.21 6.96 4.75 20.34 0.94
Table 1. Abla ion S udies. We measu e he impac o di e en design s a egies. Ne wo ks ained on SceneFlow [61].
Middlebu y 2014 (H) Middlebu y 2021 ETH3D KITTI 2012 KITTI 2015
Model bad >2A g. bad >2A g. bad >1A g. bad >3A g. bad >3A g.
All Noc Occ (px) All Noc Occ (px) All Noc Occ (px) All Noc Occ (px) All Noc Occ (px)
RAFT-S e eo [55] 11.15 8.06 29.06 1.55 12.05 9.38 37.89 1.81 2.59 2.24 8.78 0.25 4.80 4.23 29.21 0.89 5.44 5.21 14.09 1.16
PSMNe [8] 18.79 13.80 53.22 4.63 23.67 20.61 53.75 5.70 19.75 18.62 42.05 0.94 6.73 5.81 46.24 1.22 6.78 6.40 24.85 1.38
GMS e eo [117] 15.63 10.98 46.04 1.87 25.43 22.43 54.70 2.86 6.22 5.58 19.97 0.42 5.68 4.87 38.84 1.10 5.72 5.44 17.33 1.21
ELFNe [59] 24.48 16.94 77.06 8.61 27.08 21.77 85.56 11.01 25.61 24.50 46.06 5.65 10.52 8.67 88.21 2.30 9.61 8.22 85.64 2.16
PCVNe [132] 16.79 13.54 35.66 2.96 12.92 10.19 40.23 2.18 4.24 3.61 14.01 0.41 4.44 3.92 27.70 0.89 5.08 4.88 13.72 1.24
DLNR [140] 9.46 6.20 28.75 1.45 8.44 5.88 32.71 1.24 23.12 22.94 26.93 9.89 9.45 8.83 36.75 1.59 15.74 15.41 34.32 2.83
Selec i e-RAFT [110] 12.05 9.46 27.42 2.35 15.69 13.86 36.32 5.92 4.36 3.81 10.23 0.34 5.71 5.16 30.54 1.08 6.50 6.22 18.44 1.27
Selec i e-IGEV [110] 9.98 7.09 27.62 1.60 8.89 6.34 32.88 1.60 6.42 5.71 18.71 1.73 6.22 5.54 34.78 1.09 5.87 5.66 14.99 1.42
IGEV-S e eo [116] 9.91 7.08 26.26 1.84 9.15 6.43 34.88 1.53 4.30 3.86 12.65 0.38 5.65 4.43 33.38 1.03 5.87 5.13 14.31 1.34
NMRF [28] 14.08 10.87 34.62 2.91 23.36 21.69 42.51 8.57 4.34 3.66 17.15 0.42 4.62 4.05 30.65 0.92 5.24 5.07 12.28 1.16
S e eo Anywhe e (ou s) 6.96 4.75 20.34 0.94 7.97 5.71 29.52 1.08 1.66 1.43 5.29 0.24 3.90 3.52 21.65 0.83 3.93 3.79 11.01 0.97
Table 2. Ze o-sho Gene aliza ion. Compa ison wi h s a e-o - he-a deep s e eo models. Ne wo ks ained on SceneFlow [61].
ba ches o 2 images. We ex ac andom c ops o size
320×640 om images and apply s anda d colo and spa-
ial augmen a ions [55]. The VFM is used only o sou ce
monocula dep h maps, emaining ozen du ing aining.
The numbe o i e a ions o GRUs is ixed o 12 du ing
aining and inc eased o 32 a in e ence ime.
5.2. E alua ion Da ase s & P o ocol
Da ase s. We u ilize SceneFlow [61] as ou sole aining
da ase , comp ising abou 39k syn he ic s e eo pai s wi h
dense g ound- u h dispa i ies. Fo e alua ion, we employ
se e al benchma ks: Middlebu y 2014 [86] and i s 2021
ex ension [65] p o ide high- esolu ion indoo scenes wi h
semi-dense labels (15 and 24 s e eo pai s), KITTI 2012 [23]
and 2015 [62] ea u e ou doo d i ing scena ios (∼200 pai s
each a 1280 ×384 wi h spa se LiDAR g ound u h), and
ETH3D [87] con ibu es 27 low- esolu ion indoo /ou doo
scenes. Fo non-Lambe ian su aces, we p ima ily use
Boos e [127], con aining 228 high- esolu ion (12 Mpx) in-
doo pai s wi h i s 191-pai online benchma k, and Lay-
e edFlow [115], ea u ing 400 pai s wi h anspa en objec s
and spa se g ound u h (∼50 poin s pe pai ). Addi ionally,
we include ou newly p oposed MonoT ap da ase ocusing
on op ical illusions. Fo ze o-sho e alua ion, we es on
KITTI 2015, Middlebu y 3 a hal (H) esolu ion, Middle-
bu y 2021, and ETH3D, while non-Lambe ian ze o-sho
es ing elies on Boos e a qua e (Q) esolu ion and Lay-
e edFlow a eigh (E) esolu ion.
E alua ion Me ics. We e alua e ou me hod using
wo s anda d me ics: he a e age pixel e o (A g.), which
compu es he absolu e di e ence be ween p edic ed and
g ound u h dispa i ies a e aged o e all pixels, and he
bad> τ e o , which measu es he pe cen age o pixels wi h
a dispa i y e o g ea e han τpixels – o he la e , we
compu e i conside ing all pixels o ei he non-occluded o
occluded pixels, e e ed o as All,Noc o Occ espec i ely.
We e alua e on MonoT ap h ough s anda d monocu-
la dep h me ics [25] - Absolu e ela i e e o (AbsRel),
RMSE, and δ < 1.05 sco e.
5.3. Abla ion S udy
We s a ou analysis by e alua ing how indi idual com-
ponen s o ou model con ibu e o he o e all accu acy.
All model a ian s a e ained solely on he syn he ic
SceneFlow da ase and es ed on Boos e and Middlebu y
2014, allowing us o examine hei e ec i eness on non-
Lambe ian su aces and gene al scenes.
Table 1summa izes ou indings. In (A), we epo he
pe o mance o ou baseline model, upon which we build
S e eo Anywhe e– i.e., RAFT-s e eo [55]. On he one hand,
by adding monocula con ex om an o - he-shel monoc-
ula dep h ne wo k o he p e- ained con ex backbone (B),
we obse e imp o ed pe o mance on non-Lambe ian su -
aces, hough a he expense o a gene al d op in accu-
acy on Middlebu y. On he o he hand, by e- aining
he con ex backbone o p ocess dep h maps ob ained om
he monocula ne wo k on SceneFlow (C), we can app eci-
a e a consis en imp o emen in bo h da ase s. In oducing
he no mals co ela ion olume wi h subsequen di e en-
iable dep h scaling (D) signi ican ly enhances he accu acy
on non-Lambe ian su aces, also showing imp o emen s
on indoo scenes. Finally, cos olume augmen a ions and
unca ion (E) demons a e posi i e e ec s on anspa en
su aces and mi o s p esen in he Boos e da ase by u -
he educing he bad-2 me ic by app oxima ely 1.5% and
A g. by 0.7 pixels, wi h minimal in luence on Middlebu y.
Acco ding o hese esul s, om now on, we will adop
(E) as he de aul se ing o S e eo Anywhe e.
1018
RGB RAFT-S e eo [55] DLNR [140] NMRF [28] Selec i e-IGEV [110]S e eo Anywhe e
KITTI 15
Middlebu y
ETH3D
Figu e 4. Quali a i e Resul s – Ze o-Sho Gene aliza ion. P edic ions by s a e-o - he-a models and S e eo Anywhe e.
Boos e (Q) Laye edFlow (E)
Model E o Ra e (%) A g. E o Ra e (%) A g.
>2>4>6>8(px) >1>3>5(px)
RAFT-S e eo [55] 17.84 13.06 10.76 9.24 3.59 89.21 79.02 71.61 19.27
PSMNe [8] 34.47 24.83 20.46 17.77 7.26 91.85 79.84 70.04 21.18
GMS e eo [117] 32.44 22.52 17.96 15.02 5.29 92.95 83.68 74.76 20.91
ELFNe [59] 45.52 35.79 30.72 27.33 14.04 93.08 82.24 70.41 20.19
PCVNe [132] 22.63 16.51 13.81 12.08 4.70 88.27 76.65 66.79 18.19
DLNR [140] 18.56 14.55 12.61 11.22 3.97 89.90 79.46 72.72 18.97
Selec i e-RAFT [110] 20.01 15.08 12.52 10.88 4.12 92.69 86.32 78.82 20.18
Selec i e-IGEV [110] 18.52 14.24 12.14 10.77 4.38 91.31 81.72 74.74 19.65
IGEV-S e eo [116] 16.90 13.23 11.40 10.20 3.94 87.28 80.07 72.91 19.07
NMRF [28] 27.08 19.06 15.43 13.21 5.02 89.08 79.13 70.51 20.17
S e eo Anywhe e (ou s) 9.01 5.40 4.12 3.34 1.21 81.83 57.66 45.12 11.20
Boos e (Q) Online Benchma k Laye edFlow (E)
DKT-RAFT [136] (*) 10.32 7.13 5.65 4.36 1.70 66.05 46.95 37.77 8.72
S e eo Anywhe e (ou s) (*) 6.52 2.82 1.77 1.27 0.73 51.24 25.63 15.65 4.84
Table 3. Ze o-sho Non-Lambe ian Gene aliza ion. Compa ison wi h s a e-o - he-a models. Ne wo ks ained on SceneFlow [61].
(*) means ine- uned on Boos e aining se .
5.4. Ze o-Sho Gene aliza ion
We now compa e ou S e eo Anywhe e model agains s a e-
o - he-a deep s e eo ne wo ks, assessing ze o-sho gene -
aliza ion capabili y when ans e ed om syn he ic o eal
images. Pu posely, we ollow a well-es ablished benchma k
in he li e a u e [55,104], e alua ing on eal da ase s models
p e- ained exclusi ely on SceneFlow [61].
Table 2compa es S e eo Anywhe e wi h o - he-shel
s e eo ne wo ks using au ho s’ p o ided weigh s. Consid-
e ing All, Noc, and A g. me ics, we can no ice how S e eo
Anywhe e achie es consis en ly be e esul s ac oss mos
da ase s, achie ing almos 3% lowe bad-2 All on Middle-
bu y 2014 e sus he second-bes me hod DLNR [140], and
b eaking he 4% ba ie on KITTI’s bad-3 All me ic.
The Occ me ic u he demons a es how S e eo Any-
whe e consis en ly ou pe o ms o he s e eo models on any
da ase , wi h subs an ial ma gins o e he second-bes – i.e.,
app oxima ely 6% on Middlebu y 2014 and KITTI 2012,
and 3% on ETH3D. This con i ms ha le e aging p io s
om VFMs o monocula dep h es ima ion e ec i ely im-
p o e he s e eo ma ching es ima ion accu acy in challeng-
ing condi ions whe e s e eo ma ching is ill-posed, such as
a occluded egions.
Figu e 4shows p edic ions on KITTI 2015, Middlebu y
2014, and ETH3D samples. In pa icula , he i s ow
shows an ex emely challenging case o SceneFlow- ained
models, whe e S e eo Anywhe e achie es accu a e dispa i y
maps hanks o VFM p io s.
5.5. Ze o-Sho Non-Lambe ian Gene aliza ion
We now assess he gene aliza ion capabili ies o S e eo
Anywhe e and exis ing s e eo models when dealing wi h
non-Lambe ian ma e ials, such as anspa en su aces o
mi o s. To his end, we conduc a ze o-sho gene aliza ion
e alua ion expe imen on he Boos e [74] and Laye ed-
Flow [115] da ase s, once again using models p e- ained
on SceneFlow [61] – wi h weigh s p o ided by he au ho s.
Table 3shows he ou come o his e alua ion. This ime,
we can pe cei e e en mo e clea ly how S e eo Anywhe e is
he absolu e winne , demons a ing unp eceden ed obus -
ness in he p esence o non-Lambe ian su aces despi e be-
ing ained only on syn he ic s e eo da a, no e en ea u ing
such objec s. These esul s u he alida e how le e aging
s ong p io s om exis ing VFMs o monocula dep h es i-
1019
RGB RAFT-S e eo [55] DLNR [140] NMRF [28] Selec i e-IGEV [110]S e eo Anywhe e
Boos e
Laye edFlow
Figu e 5. Quali a i e esul s – Ze o-Sho non-Lambe ian Gene aliza ion. P edic ions by s a e-o - he-a models and S e eo Anywhe e.
MonoT ap
Model AbsRel RMSE σ < 1.05
(%)↓(m)↓(%)↑
Dep h Any hing 2 [121] 53.46 0.36 15.21
Dep h Any hing 2 [121]†27.92 0.27 19.43
Dep hP o [6] 47.77 0.32 21.90
Dep hP o [6]†20.82 0.22 22.88
RAFT-S e eo [55] 5.01 0.09 77.05
S e eo Anywhe e 3.50 0.06 80.27
Table 4. MonoT ap Benchma k. Compa ison wi h s a e-o - he-
a monocula dep h es ima ion models and RAFT-S e eo. Bo h
RAFT-S e eo and S e eo Anywhe e a e ained on SceneFlow
[61]. † e e s o obus scaling h ough RANSAC.
ma ion can play a game-changing ole in s e eo ma ching as
well, especially when lacking aining da a explici ly a ge -
ing c i ical condi ions such as non-Lambe ian su aces. A
he bo om, we epo esul s achie ed by ine- uning S e eo
Anywhe e on he Boos e aining se and e alua ing on he
online benchma k. Ou model anks i s when e alua ed a
qua e esolu ion.
Figu e 5shows examples om Boos e and Laye ed-
Flow, whe e S e eo Anywhe e is he only s e eo model co -
ec ly pe cei ing he mi o and anspa en ailing.
5.6. MonoT ap Benchma k
We conclude ou e alua ion by unning expe imen s on ou
newly collec ed MonoT ap da ase o p o e he obus ness
o S e eo Anywhe e in he p esence o c i ical condi ions
ha ming he accu acy o monocula dep h p edic o s.
Table 4collec s he esul s achie ed by s a e-o - he-a
monocula dep h es ima ion models, he baseline s e eo
model o e which we buil ou amewo k (RAFT-S e eo)
and S e eo Anywhe e. Rega ding he o me models, as
hey p edic a ine-in a ian dep h maps, ollowing he li -
e a u e [78] we use leas squa e e o s o align hem o he
g ound- u h. As hese models a e ooled by he isual il-
lusions, his scaling p ocedu e is likely o yield sub-op imal
scale and shi pa ame e s. The e o e, we al e na i ely align
o g ound- u h dep h h ough a mo e obus RANSAC i -
ing – deno ed wi h †in he able.
On he one hand, by compa ing monocula and s e eo
me hods, we no ice how he ailu es o he o me nega-
i ely impac hei e alua ion me ics. Once again, we e-
RGB D. Any hing 2 [121]S e eo Anywhe e
Figu e 6. Quali a i e esul s – MonoT ap. S e eo Anywhe e is
no ooled by e oneous p edic ions by i s monocula engine [121].
ma k ha a di ec compa ison ac oss he wo amilies o
me hods is no he main goal o his expe imen . On he
o he hand, we ocus on he compa ison be ween RAFT-
S e eo and S e eo Anywhe e, wi h ou model pe o ming
sligh ly be e han i s baseline. This ac p o es ha de-
spi e i s s ong eliance on he p io s e ie ed om VFMs
o monocula dep h es ima ion, S e eo Anywhe e can p op-
e ly igno e such p io s when un eliable.
Figu e 6shows h ee samples whe e Dep h Any hing 2
ails while S e eo Anywhe e does no .
6. Conclusion
In his pape , we in oduced S e eo Anywhe e, a no el
s e eo ma ching amewo k ha le e ages monocula dep h
VFMs o o e come adi ional s e eo ma ching limi a ions.
Combining s e eo geome ic cons ain s wi h monocula
p io s, ou app oach demons a es supe io ze o-sho gen-
e aliza ion and obus ness o challenging condi ions like
ex u eless egions, occlusions, and non-Lambe ian su -
aces. Fu he mo e, h ough ou no el MonoT ap da ase ,
we showed ha S e eo Anywhe e e ec i ely combines he
bes o bo h wo lds - main aining s e eo ma ching’s geome -
ic accu acy whe e monocula me hods ail, while le e ag-
ing monocula p io s o handle challenging s e eo scena ios.
Ex ensi e compa isons agains s a e-o - he-a ne wo ks in
ze o-sho se ings alida e hese indings.
1020
Acknowledgemen . This s udy was ca ied ou wi hin he
MOST – Sus ainable Mobili y Na ional Resea ch Cen e and e-
cei ed unding om he Eu opean Union Nex -Gene a ionEU –
PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) –
MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.4 – D.D.
1033 17/06/2022, CN00000023. This manusc ip e lec s only he
au ho s’ iews and opinions, nei he he Eu opean Union no he
Eu opean Commission can be conside ed esponsible o hem.
This s udy was unded by he Eu opean Union – Nex Gene a-
ion EU wi hin he amewo k o he Na ional Reco e y and Re-
silience Plan NRRP – Mission 4 “Educa ion and Resea ch” –
Componen 2 - In es men 1.1 “Na ional Resea ch P og am and
P ojec s o Signi ican Na ional In e es Fund (PRIN)” (Call D.D.
MUR n. 104/2022) – PRIN2022 – P ojec e e ence: “Ri e -
Wa ch: a ci izen-science app oach o i e pollu ion moni o ing”
(ID: 2022MMBA8X, CUP: J53D23002260006).
We also acknowledge he CINECA awa d unde he ISCRA
ini ia i e, o he a ailabili y o high-pe o mance compu ing e-
sou ces and suppo .
Re e ences
[1] Filippo Aleo i, Fabio Tosi, Li Zhang, Ma eo Poggi, and
S e ano Ma occia. Re e sing he cycle: sel -supe ised
deep s e eo h ough enhanced monocula dis illa ion. In
Compu e Vision–ECCV 2020: 16 h Eu opean Con e ence,
Glasgow, UK, Augus 23–28, 2020, P oceedings, Pa XI
16, pages 614–632. Sp inge , 2020. 2,3
[2] Filippo Aleo i, Fabio Tosi, Pie luigi Zama Rami ez, Ma -
eo Poggi, Samuele Sal i, S e ano Ma occia, and Luigi
Di S e ano. Neu al dispa i y e inemen o a bi a y esolu-
ion s e eo. In 2021 In e na ional Con e ence on 3D Vision
(3DV), pages 207–217. IEEE, 2021. 2,3
[3] Vasileios A ampa zakis, Geo ge Pa lidis, Nikolaos Mi-
ianoudis, and Nikos Papama kos. Monocula dep h es i-
ma ion: A ho ough e iew. IEEE T ansac ions on Pa e n
Analysis and Machine In elligence, 2023. 1
[4] An yan a Bangunha cana, Jae Won Cho, Seokju Lee, In So
Kweon, Kyung-Soo Kim, and Soohyun Kim. Co ela e-
and-exci e: Real- ime s e eo ma ching ia guided cos ol-
ume exci a ion. In IEEE/RSJ In e na ional Con e ence on
In elligen Robo s and Sys ems (IROS), 2021. 2,4
[5] Luca Ba olomei, Ma eo Poggi, Fabio Tosi, And ea Con i,
and S e ano Ma occia. Ac i e s e eo wi hou pa e n p o-
jec o . In P oceedings o he IEEE/CVF In e na ional Con-
e ence on Compu e Vision, pages 18470–18482, 2023. 2
[6] Aleksei Bochko skii, Ama¨
el Delaunoy, Hugo Ge main,
Ma cel San os, Yichao Zhou, S ephan R. Rich e , and
Vladlen Kol un. Dep h p o: Sha p monocula me ic dep h
in less han a second. a Xi , 2024. 8
[7] Changjiang Cai, Ma eo Poggi, S e ano Ma occia, and
Philippos Mo dohai. Ma ching-space s e eo ne wo ks o
c oss-domain gene aliza ion. In 2020 In e na ional Con-
e ence on 3D Vision (3DV), pages 364–373, 2020. 2
[8] Jia-Ren Chang and Yong-Sheng Chen. Py amid s e eo
ma ching ne wo k. In P oceedings o he IEEE Con e ence
on Compu e Vision and Pa e n Recogni ion, pages 5410–
5418, 2018. 2,6,7
[9] Liyan Chen, Weihan Wang, and Philippos Mo dohai.
Lea ning he dis ibu ion o e o s in s e eo ma ching o
join dispa i y and unce ain y es ima ion. In P oceedings
o he IEEE/CVF Con e ence on Compu e Vision and Pa -
e n Recogni ion, pages 17235–17244, 2023. 2
[10] Wei eng Chen, Zhao Fu, Dawei Yang, and Jia Deng.
Single-image dep h pe cep ion in he wild. In P oceedings
o he 30 h In e na ional Con e ence on Neu al In o ma ion
P ocessing Sys ems, page 730–738, Red Hook, NY, USA,
2016. Cu an Associa es Inc. 3
[11] Xihao Chen, Zhiwei Xiong, Zhen Cheng, Jiayong Peng,
Yueyi Zhang, and Zheng-Jun Zha. Deg ada ion-agnos ic
co espondence om esolu ion-asymme ic s e eo. In P o-
ceedings o he IEEE/CVF Con e ence on Compu e Vi-
sion and Pa e n Recogni ion (CVPR), pages 12962–12971,
2022. 3
[12] Zhi Chen, Xiaoqing Ye, Wei Yang, Zhenbo Xu, Xiao Tan,
Zhikang Zou, E ui Ding, Xinming Zhang, and Liusheng
Huang. Re ealing he ecip ocal ela ions be ween sel -
supe ised s e eo and monocula dep h es ima ion. In P o-
ceedings o he IEEE/CVF In e na ional Con e ence on
Compu e Vision (ICCV), pages 15529–15538, 2021. 3
[13] Ziyang Chen, Wei Long, He Yao, Yongjun Zhang, Bing-
shu Wang, Yongbin Qin, and Jia Wu. Mocha-s e eo: Mo i
channel a en ion ne wo k o s e eo ma ching. In P oceed-
ings o he IEEE/CVF Con e ence on Compu e Vision and
Pa e n Recogni ion, 2024. 2
[14] Junda Cheng, Longliang Liu, Gangwei Xu, Xianqi Wang,
Zhaoxing Zhang, Yong Deng, Jinliang Zang, Yu ui Chen,
Zhipeng Cai, and Xin Yang. Mons e : Ma y monodep h o
s e eo unleashes powe . In P oceedings o he IEEE/CVF
Con e ence on Compu e Vision and Pa e n Recogni ion,
2025. 3
[15] Kel in Cheng, Tian u Wu, and Ch is ophe Healey. Re is-
i ing non-pa ame ic ma ching cos olumes o obus and
gene alizable s e eo ma ching. Ad ances in Neu al In o -
ma ion P ocessing Sys ems, 35:16305–16318, 2022. 2
[16] Jaehoon Cho, Dongbo Min, Youngjung Kim, and
Kwanghoon Sohn. Diml/c l gb-d da ase : 2m gb-d im-
ages o na u al indoo and ou doo scenes. a Xi p ep in
a Xi :2110.11590, 2021. 3
[17] WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad,
Ali eza Bab-Hadiasha , and Da id Su e . I sa: An
in o ma ion- heo e ic app oach o au oma ic sho cu
a oidance and domain gene aliza ion in s e eo ma ching
ne wo ks. In P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion (CVPR), pages
13022–13032, 2022. 2
[18] Alex Cos anzino, Pie luigi Zama Rami ez, Ma eo Poggi,
Fabio Tosi, S e ano Ma occia, and Luigi Di S e ano.
Lea ning dep h es ima ion o anspa en and mi o su -
aces. In P oceedings o he IEEE/CVF In e na ional Con-
e ence on Compu e Vision, pages 9244–9255, 2023. 3
[19] Yiqun Duan, Xianda Guo, and Zheng Zhu. Di usionDep h:
Di usion denoising app oach o monocula dep h es ima-
ion. a Xi p ep in a Xi :2303.05021, 2023. 3
[20] Ainaz E ekha , Alexande Sax, Ji end a Malik, and Ami
Zami . Omnida a: A scalable pipeline o making mul i-
1021