Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

Author: Bartolomei, Luca; Tosi, Fabio; Poggi, Matteo; Mattoccia, Stefano

Publisher: Zenodo

DOI: 10.5281/zenodo.17672518

Source: https://zenodo.org/records/17672518/files/Bartolomei_Stereo_Anywhere_Robust_Zero-Shot_Deep_Stereo_Matching_Even_Where_Either_CVPR_2025_paper.pdf

S e eo Anywhe e: Robus Ze o-Sho Deep S e eo Ma ching
E en Whe e Ei he S e eo o Mono Fail
Luca Ba olomei∗,†Fabio Tosi†Ma eo Poggi∗,†S e ano Ma occia∗,†
∗Ad anced Resea ch Cen e on Elec onic Sys em (ARCES)
†Depa men o Compu e Science and Enginee ing (DISI)
Uni e si y o Bologna, I aly
h ps://s e eoanywhe e.gi hub.io/
RGB Dep h Any hing 2 [121] RAFT-S e eo [55] S e eo Anywhe e (Ou s)
Middlebu y
✓ ✓ ✓
Boos e
✓✗✓
MonoT ap
✗✓ ✓
Figu e 1. S e eo Anywhe e: Combining Monocula and S e eo S engh s o Robus Dep h Es ima ion. Ou model achie es accu a e
esul s on s anda d condi ions (on Middlebu y [86]), while e ec i ely handling non-Lambe ian su aces whe e s e eo ne wo ks ail (on
Boos e [127]) and pe spec i e illusions ha decei e monocula dep h ounda ion models (on MonoT ap, ou no el da ase ).
Abs ac
We in oduce S e eo Anywhe e, a no el s e eo-ma ching
amewo k ha combines geome ic cons ain s wi h o-
bus p io s om monocula dep h Vision Founda ion Mod-
els (VFMs). By elegan ly coupling hese complemen a y
wo lds h ough a dual-b anch a chi ec u e, we seamlessly
in eg a e s e eo ma ching wi h lea ned con ex ual cues. Fol-
lowing his design, ou amewo k in oduces no el cos
olume usion mechanisms ha e ec i ely handle c i ical
challenges such as ex u eless egions, occlusions, and non-
Lambe ian su aces. Th ough ou no el op ical illusion
da ase , MonoT ap, and ex ensi e e alua ion ac oss mul-
iple benchma ks, we demons a e ha ou syn he ic-only
ained model achie es s a e-o - he-a esul s in ze o-sho
gene aliza ion, signi ican ly ou pe o ming exis ing solu-
ions while showing ema kable obus ness o challenging
cases such as mi o s and anspa encies.
1. In oduc ion
S e eo is a undamen al ask ha compu es dep h om a
synch onized, ec i ied image pai by inding pixel co e-
spondences o measu e hei ho izon al o se (dispa i y).
Due o i s e ec i eness and minimal ha dwa e equi e-
men s, s e eo has become p e alen in nume ous applica-
ions, om au onomous na iga ion o augmen ed eali y.
Al hough in p inciple single-image dep h es ima ion [3]
equi es an e en simple acquisi ion se up, i s ill-posed na-
u e leads o scale ambigui y and pe spec i e illusion is-
sues ha s e eo me hods inhe en ly o e come h ough well-
es ablished geome ic mul i- iew cons ain s.
Howe e , despi e signi ican ad ances h ough deep
lea ning [47,72], s e eo models s ill ace wo main chal-
lenges: (i) limi ed gene aliza ion ac oss di e en scena -
ios, and (ii) c i ical condi ions ha hinde ma ching o
p ope dep h iangula ion. Rega ding (i), despi e he ini-
ial success o syn he ic da ase s in enabling deep lea n-
This CVPR pape is he Open Access e sion, p o ided by he Compu e Vision Founda ion.
Excep o his wa e ma k, i is iden ical o he accep ed e sion;
he inal published e sion o he p oceedings is a ailable on IEEE Xplo e.
1013
ing o s e eo, hei limi ed a ie y and simpli ied na u e
poo ly e lec eal-wo ld complexi y, and he sca ci y o eal
aining da a u he hinde s he abili y o handle he e oge-
neous scena ios. As o (ii), la ge ex u eless egions com-
mon in indoo en i onmen s make pixel ma ching highly
ambiguous, while occlusions and non-Lambe ian su aces
[76,115,127] iola e he undamen al assump ions linking
pixel co espondences o 3D geome y.
We a gue ha bo h challenges a e oo ed in he unde ly-
ing limi a ions o s e eo aining da a. Indeed, while da a
has scaled up o millions - o e en billions - o se e al
compu e ision asks, s e eo da ase s a e s ill cons ained
in quan i y and a ie y. This is pa icula ly e iden o non-
Lambe ian su aces, which a e se e ely unde ep esen ed
in exis ing da ase s as hei ma e ial p ope ies p e en eli-
able dep h measu emen s om ac i e senso s (e.g. LiDAR).
In con as , single-image dep h es ima ion has ecen ly
wi nessed a signi ican scale-up in da a a ailabili y, each-
ing he o de o millions o samples and enabling he eme -
gence o Vision Founda ion Models (VFMs) [22,43,120,
121]. Such da a abundance has in luenced hese models
in di e en ways, ei he h ough di ec aining on la ge-
scale dep h da ase s [120,121] o indi ec ly by le e ag-
ing ne wo ks p e- ained on billions o images o di e se
asks [22,43]. Since hese models ely on con ex ual
cues o dep h es ima ion, hey show be e capabili y in
handling ex u eless egions and non-Lambe ian ma e ials
[75,81,128,129] while being inhe en ly immune o oc-
clusions. Mode n g aphics engines ha e u he accele -
a ed his p og ess, enabling apid gene a ion o high-quali y
syn he ic da a wi h dense dep h anno a ions. Howe e ,
al hough syn he ic da ase s ea u ing non-Lambe ian su -
aces like Hype Sim [81] ha e p o en e ec i e o monocu-
la dep h es ima ion [75,128,129], his da a abundance has
no ansla ed o s e eo. Despi e e o s in gene a ing s e eo
pai s ia no el iew syn hesis [24,54,104], a ailable da a
emains insu icien o obus s e eo ma ching.
In his pape , a he han ocusing on cos ly eal-wo ld
da a collec ion o gene a ing addi ional syn he ic da ase s,
we p opose o b idge his gap by le e aging exis ing VFMs
o single- iew dep h es ima ion. To his end, we de elop
a no el dual-b anch deep a chi ec u e ha combines s e eo
ma ching p inciples wi h monocula dep h cues. Speci i-
cally, while one b anch o he p oposed ne wo k cons uc s
a cos olume om lea ned s e eo image ea u es, he o he
b anch p ocesses dep h p edic ions om he VFM on bo h
le and igh images o build a second cos olume ha
inco po a es dep h p io s o guide he dispa i y es ima ion
p ocess. These complemen a y signals a e hen i e a i ely
combined [55], along wi h no el augmen a ion s a egies
applied o bo h cos olumes, o p edic he inal dispa -
i y map. Th ough his design, ou ne wo k achie es o-
bus pe o mance on challenging cases like ex u eless e-
gions, occlusions, and non-Lambe ian su aces, while e-
qui ing minimal syn he ic s e eo da a. Impo an ly, while
le e aging monocula cues, ou app oach p ese es s e eo
ma ching geome ic gua an ees, e ec i ely handling sce-
na ios whe e monocula dep h es ima ion ypically ails,
such as in he p esence o pe spec i e illusions. We alida e
his h ough ou no el da ase o op ical illusions, comp is-
ing 26 scenes wi h g ound- u h dep h maps.
We dub ou amewo k S e eo Anywhe e, highligh ing i s
abili y o o e come he indi idual limi a ions o s e eo and
monocula app oaches, as depic ed in Fig. 1. To summa-
ize, ou main con ibu ions a e:
• A no el deep s e eo a chi ec u e le e aging monocula
dep h VFMs o achie e s ong gene aliza ion capabili ies
and obus ness o challenging condi ions.
• No el da a augmen a ion s a egies designed o enhance
he obus ness o ou model o ex u eless egions and
non-Lambe ian su aces.
• A challenging da ase wi h op ical illusion, which is pa -
icula ly challenging o monocula dep h wi h VFMs.
• Ex ensi e expe imen s showing S e eo Anywhe e’s supe-
io gene aliza ion and obus ness o condi ions c i ical
o ei he s e eo o monocula app oaches.
2. Rela ed Wo ks
We b ie ly e iew he li e a u e ele an o ou wo k.
Deep S e eo Ma ching. In he las decade, s e eo ma ch-
ing has ansi ioned om classical hand-c a ed algo i hms
[85] o deep lea ning solu ions, leading o unp eceden ed
accu acy in dep h es ima ion. Ea ly deep lea ning e o s
ocused on eplacing indi idual componen s o he con en-
ional pipeline [88,96,105,130,131]. Since DispNe C
[61], end- o-end a chi ec u es ha e e ol ed in o 2D [53,92,
125,125] and 3D [4,8,9,32,44,90,91,119,132,134]
app oaches, p ocessing cos olumes h ough co ela ion
laye s o 3D con olu ions espec i ely. Mo e ecen ad-
ances, ho oughly e iewed in [47,72,107], include e-
cu en a chi ec u es o s e eo ma ching [13,27,40,50,
55,110,116,140] inspi ed by RAFT [99], T ans o me -
based solu ions [31,52,59,97,113,117,138] o cap u ing
long- ange dependencies, and ully da a-d i en MRF mod-
els [28]. Among hem, some me hods speci ically add ess
empo al consis ency in s e eo ideos [41,42,133,137].
Domain gene aliza ion emains a majo challenge, wi h a -
ious app oaches p oposed including domain-in a ian ea-
u e lea ning [17,56,80,93,135], hand-c a ed ma ch-
ing cos s [7,15], in eg a ion o addi ional geome ic cues
[2,66,105], and exploi a ion o spa se dep h measu e-
men s om ac i e senso s [5,49,69]. In pa allel, sel -
supe ised app oaches [25,57] ha e eme ged as e ec-
i e al e na i es o supe ised lea ning, e en using pseudo-
labels om adi ional algo i hms [1,100] o deploying neu-
al adiance ields [104]. Despi e he nume ous a emp s o
1014
&RUUHODWLRQ
9ROXPH
IURPQRUPDOV
$JJUHJDWHG
&RUUHODWLRQ
9ROXPH
IURPQRUPDOV
&RUUHODWLRQ
9ROXPH
7UXQFDWH
)XQFWLRQ
7UXQFDWHG
&RUUHODWLRQ
9ROXPH
'+RXUJODVV
(VWLPDWHG
1RUPDO0DSV
&RQWH[W
%DFNERQH
)HDWXUHH[WUDFWLRQ
%DFNERQH
6WHUHR3DLU
0RQRFXODU
'HSWK(VWLPDWLRQV
0'(V
'LIIHUHQWLDEOH
6FDOHU
/
/
6FDOHG0'(
/
/
/
/
)LQDO
'LVSDULW
&RUUHODWLRQ3 UDPLGV
&RUUHODWLRQ3 UDPLGV
IURPQRUPDOV
 
 

 
)HDWXUHV
([WUDFWLRQ
&RUUHODWLRQ3 UDPLGV
%XLOGLQJ
,WHUDWLYH'LVSDULW 
(VWLPDWLRQ
Figu e 2. S e eo Anywhe e A chi ec u e. Gi en a s e eo pai , (1) a p e- ained backbone is used o ex ac ea u es and hen build a
co ela ion olume. Such a olume is hen unca ed (2) o ejec ma ching cos s compu ed o dispa i y hypo heses being behind non-
Lambe ian su aces – glasses and mi o s. On a pa allel b anch, he wo images a e p ocessed by a monocula VFM o ob ain wo dep h
maps (3): hese a e used o build a second co ela ion olume om e ie ed no mals (4). This olume is hen agg ega ed h ough a 3D
CNN o p edic a new dispa i y map, used o align he o iginal monocula dep h o me ic scale h ough a di e en iable scaling module (5)
o i . In pa allel, he monocula dep h map om le images is p ocessed by ano he backbone (6) o ex ac con ex ea u es. Finally, he
wo olumes and he con ex ea u es om monocula dep h guide he i e a i e dispa i y p edic ion (7).
imp o e speci ic aspec s h ough he a o emen ioned ech-
niques, ecen a chi ec u es achie e ema kable gene aliza-
ion by combining hei a chi ec u al ad ances wi h he in-
c easing a ailabili y o di e se aining da a, while online
adap a ion echniques enable u he imp o emen s du -
ing deploymen h ough sel -supe ised lea ning [45,67,
71,101]. Howe e , al hough p og ess on challenges like
o e -smoo hing [103,118] and isually imbalanced s e eo
[2,11,58,105], handling non-Lambe ian su aces e-
mains pa icula ly challenging due o limi ed anno a ed da a
and complex appea ance, wi h a e wo ks like Dep h4ToM
[18] speci ically add essing his h ough seman ic guidance.
Among all he a o emen ioned app oaches, he e ha e been
limi ed a emp s o in eg a e s e eo wi h monocula cues
[1,12,112], mos ly in sel -supe ised se ings o h ough
loose coupling be ween modali ies.
Monocula Dep h Es ima ion. Pa allel o de elop-
men s in s e eo ma ching, single-image dep h es ima ion
has e ol ed om hand-c a ed ea u es [82] o deep lea n-
ing me hods [10,21,48,73,108], wi h sel -supe ised ap-
p oaches [25,26,60,68,111,139,141] e aming he ask
as an image econs uc ion p oblem. This led o mul i- ask
app oaches inco po a ing low [79,102,124,142] and se-
man ics [29,126], alongside ad ances in unce ain y es i-
ma ion [34,70] and dynamic objec handling [46,63,98].
A ine-in a ian models [20,77,78,109,122] ma ked a
b eak h ough in c oss-domain gene aliza ion, pionee ed by
MiDaS [78] and ollowed by wo ks like DPT [77] and,
mo e ecen ly, he Dep h Any hing se ies [120]. These
app oaches used di e en da a sou ces, om in e ne pho-
os [51,94,95,122] o ca senso s [23,62] and RGB-D de-
ices [16,64], ep esen ing he i s gene a ion o VFMs o
monocula dep h es ima ion. Recen wo ks ha e ocused
on me ic dep h es ima ion h ough came a pa ame e in-
eg a ion [30,35,123], di usion models [19,22,33,38,
43,83,84], and empo al consis ency [36,89]. Mo eo e ,
ma e ial-awa e me hods [18], di usion models [106], and
la ge-scale syn he ic da ase s ha e enabled obus monoc-
ula dep h es ima ion o non-Lambe ian su aces [121].
S e eo me hods, howe e , s ill s uggle wi h hese su aces
due o limi ed eal-wo ld and syn he ic anno a ed da a, a -
ec ing gene aliza ion. We add ess his by in eg a ing obus
monocula VFMs in o a s e eo a chi ec u e.
Concu en Wo ks. Finally, we men ion some solu ions
o s e eo [14,39,114] and o mul i- iew s e eo [37], de-
eloped in pa allel wi h ou s and sha ing simila a ionale.
3. Me hod O e iew
Gi en a ec i ied s e eo pai IL,IR∈R3×H×W, we i s
ob ain monocula dep h es ima es (MDEs) ML,MR∈
R1×H×Wusing a gene ic VFM ϕM o monocula dep h
es ima ion. We aim o es ima e a dispa i y map D=
ϕS(IL,IR,ML,MR), inco po a ing VFM p io s o p o-
ide accu a e esul s e en unde challenging condi ions,
such as ex u e-less a eas, occlusions, and non-Lambe ian
su aces. A he same ime, ou s e eo ne wo k ϕSis de-
signed o a oid dep h es ima ion e o s ha could a ise om
elying solely on con ex ual cues, which can be ambiguous,
like in he p esence o isual illusions.
Following ecen ad ances in i e a i e models [55],
S e eo Anywhe e comp ises h ee main s ages, as shown
in Fig. 2: I) Fea u e Ex ac ion, II) Co ela ion Py amids
Building, and III) I e a i e Dispa i y Es ima ion.
1015
3.1. Fea u e Ex ac ion
Two dis inc ypes o ea u es a e ex ac ed [55]: image
ea u es and con ex ea u es – (1) and (6) in Fig. 2. The
image ea u es a e ob ained h ough a ea u e encode p o-
cessing he s e eo pai , yielding ea u e maps FL,FR∈
RD×H
4×W
4, which a e used o build a s e eo co ela ion ol-
ume a 1
4o he o iginal inpu esolu ion. These encode s
a e ini ialized wi h p e- ained weigh s [55] and he image
encode is kep ozen du ing aining. Fo con ex ea u es,
we employ a con ex encode wi h iden ical a chi ec u e o
he ea u e encode , bu p ocessing he monocula dep h es-
ima e aligned wi h he e e ence image ML– (3) in Fig. 2
– ins ead o IL o cap u e s ong geome y p io s. Acco d-
ingly, du ing aining he con ex encode is op imized o
ex ac meaning ul ea u es om hese dep h maps.
3.2. Co ela ion Py amids Building
As a s anda d p ac ice in s e eo ma ching, he cos olume
is he da a s uc u e encoding he simila i y be ween pix-
els ac oss wo images. Acco dingly, ou model u ilizes cos
olumes—speci ically Co ela ion Py amids [55]—bu in a
no el manne . Indeed, S e eo Anywhe e cons uc s wo co -
ela ion py amids: a s e eo co ela ion olume de i ed om
IL,IR o encode image simila i ies, and a monocula co -
ela ion olume om ML,MR o encode geome ic simi-
la i ies—(2) and (4) in Fig. 2. Unlike he o me , he la e
emains una ec ed by non-Lambe ian su aces, assuming
a obus ϕM.
S e eo Co ela ion Volume. Gi en FL,FR, we con-
s uc a 3D co ela ion olume VSusing do p oduc be-
ween ea u e maps:
(VS)ijk =X
h
(FL)hij ·(FR)hik,VS∈RH
4×W
4×W
4(1)
Monocula Co ela ion Volume. Gi en ML,MR, we
downsample hem o 1/4, compu e hei no mals ∇L,∇R,
and cons uc a 3D co ela ion olume VMusing do p od-
uc be ween no mal maps:
(VM)ijk =X
h
(∇L)hij ·(∇R)hik,VM∈RH
4×W
4×W
4
(2)
Gi en he absence o ex u e in ∇Land ∇R, he esul -
ing monocula olume VMwill be less in o ma i e. To
alle ia e his p oblem we segmen VMusing he ela i e
dep h p io s om MLand MR: o do so, we gene a e
le and igh segmen a ion masks ML∈ {0,1}H
4×W
4×1,
MR∈ {0,1}H
4×1×W
4. We e e he eade o he sup-
plemen a y ma e ial o a de ailed desc ip ion. Gi en he
segmen a ion masks, we can gene a e masked olumes as:
(VMn)ijk = (MLn)ij ·(MRn)ik ·(VM)ijk (3)
Nex , we inse a 3D Con olu ional Regula iza ion mod-
ule ϕA o agg ega e VMn, esul ing in V′M=
ϕA(VM
1,...,VMN,ML,MR), wi h N= 8. The a chi-
ec u e o ϕA ollows he one in [116], wi h a simple pe mu-
a ion o ma ch he s uc u e o he co ela ion olumes. We
p opose an adap ed e sion o CoEx [4] co ela ion olume
exci a ion ha exploi s bo h iews. The esul ing ea u e
olumes V′M∈RF×H
4×W
4×W
4a e ed o wo di e en
shallow 3D con laye s ϕDand ϕC o ob ain wo agg e-
ga ed olumes VD
M=ϕD(V′M)and VC
M=ϕC(V′M)
wi h VD
M,VC
M∈RH
4×W
4×W
4.
Di e en iable Monocula Scaling. Volume VD
Mwill
be used no only as a monocula guide o he i e a i e e-
inemen uni bu also o es ima e he coa se dispa i y maps
ˆ
DLˆ
DR, while VC
Mis used o es ima e con idence maps ˆ
CL
ˆ
CR. These maps a e hen used o scale bo h MLand MR
– (5) in Fig. 2. To es ima e le dispa i y om a co ela ion
olume, we i s pe o m a so a gmax on he las Wdi-
mension o VD
M o ex ac he co ela ed pixel x-coo dina e.
Then, gi en he ela ionship be ween le dispa i y and co -
ela ion dL=jL−jR, we ob ain a coa se dispa i y map
ˆ
DL:
(ˆ
DL)ij =j−so a gmaxL(VD
M)ij (4)
Simila ly, we es ima e ˆ
DR om VD
M. We e e he eade
o he supplemen a y o de ails. We also es ima e a pai o
con idence maps ˆ
CL,ˆ
CR∈[0,1]H×W o classi y ou lie s
and pe o m obus scaling. Inspi ed by in o ma ion en-
opy, we measu e he chaos wi hin co ela ion cu es: clea
monomodal-like cos cu es— hose wi h low en opy—a e
eliable, while chao ic cu es wi h high en opy indica e un-
ce ain y. To es ima e he le con idence map, we pe o m
aso max ope a ion on he las Wdimension o VC
M, hen
ˆ
CLis ob ained as ollows:
(ˆ
CL)ij = 1 + PW
4
d
e(VC
M)ijd
P
W
4
e(VC
M)ij
·log2 e(VC
M)ijd
P
W
4
e(VC
M)ij !
log2(W
4)
(5)
In he same way, we es ima e ˆ
CR. To u he educe ou -
lie s, we mask ou occluded pixels om ˆ
CLand ˆ
CRusing
aSo LRC ope a o – see he supplemen a y ma e ial o
de ails. Finally, we es ima e he scale ˆsand shi ˆ
using a
di e en iable weigh ed leas -squa e app oach:
min
ˆs,ˆ
L,R
X

pˆ
C⊙hˆsM+ˆ
−ˆ
Di

F(6)
whe e ∥·∥Fdeno es he F obenius no m. Using he scaling
coe icien s, we ob ain wo dispa i y maps ˆ
ML,ˆ
MR:
ˆ
ML= ˆsML+ˆ
, ˆ
MR= ˆsMR+ˆ
(7)
1016
Image G ound-T u h Dep h Any hing 2 [121] Image G ound-T u h Dep h Any hing 2 [121]
Figu e 3. Samples om MonoT ap Da ase . We epo wo scenes ea u ed in ou da ase , showing he le image, he g ound- u h dep h,
and he p edic ions by Dep h Any hing 2 [121], highligh ing how i ails in he p esence o isual illusions.
I is c ucial o op imize bo h le and igh scaling join ly o
ob ain consis ency be ween ˆ
MLand ˆ
MR.
Volume Augmen a ions. Un o una ely, S e eo Any-
whe e canno p ope ly lea n when o choose s e eo o mono
in o ma ion om [61] alone. Hence, we p opose h ee ol-
ume augmen a ions and a monocula augmen a ion o o e -
come his issue: 1) Volume Rolling: we andomly apply
a olling ope a ion o he las Wdimension o VDMo
VS; 2) Volume Noising: we apply andom noise sampled
om he in e al [0,1) using a uni o m dis ibu ion; 3) Vol-
ume Ze oing: we apply a Gaussian-like cu e wi h he peak
whe e dispa i y equals ze o. Fu he mo e, we andomly
subs i u e he monocula dep h wi h g ound u h no mal-
ized be ween [0,1] as an addi ional augmen a ion. We apply
only one olume augmen a ion o VDMo VSand only o
a sec ion o he olume, andomly selec ing an Mn
Lmask.
Volume T unca ion. To u he help S e eo Anywhe e
o handle mi o su aces, we in oduce a hand-c a ed ol-
ume unca ion ope a ion on VS. Fi s ly, we ex ac le
con idence CM=so LRCL(ˆ
ML,ˆ
MR) o classi y e-
liable monocula p edic ions. Then, we c ea e a un-
ca e mask T∈[0,1]H
4×W
4using he ollowing logic
condi ion: (T)ij =h(ˆ
ML)ij >(ˆ
DL)ij∧(CM)iji∨
h(CM)ij ∧ ¬(ˆ
CL)iji. We implemen his logic using
uzzy ope a o s (mo e de ails in he supplemen a y ma-
e ial). The a ionale is ha s e eo p edic s a he dep hs
on mi o su aces: he mi o is pe cei ed as a window in o
a new en i onmen , specula o he eal one. Finally, o
alues o T> Tm= 0.98, we unca e VSusing a sigmoid
cu e cen e ed a he co ela ion alue p edic ed by ˆ
ML–
i.e., he eal dispa i y o mi o su aces – p ese ing only
he s e eo co ela ion cu e no “pie cing” mi o s.
3.3. I e a i e Dispa i y Es ima ion
We aim o es ima e a se ies o e ined dispa i y maps {D1=
ˆ
ML,D2, . . . Dl, . . . }exploi ing he guidance om bo h
s e eo and mono b anches. S a ing om he Mul i-GRU
upda e ope a o by [55], we in oduce a second lookup op-
e a o ha ex ac s co ela ion ea u es GM om he addi-
ional olume VD
M– (7) in Fig. 2. The wo se s o co ela-
ion ea u es om GSand GMa e p ocessed by he same
wo-laye encode and conca ena ed wi h ea u es de i ed
om he cu en dispa i y es ima ion Dl. This conca ena-
ion is u he p ocessed by a 2D con laye , and hen by
he Con GRU ope a o . We inhe i he con ex upsampling
module [55] o upsample inal dispa i y o ull esolu ion.
3.4. T aining Supe ision
We supe ise he i e a i e module using he well-known L1
loss wi h exponen ially inc easing weigh s [55], hen ˆ
DL,
ˆ
DR,ˆ
MLand ˆ
MRusing he L1 loss, inally ˆ
CLand ˆ
CR
using he Bina y C oss En opy loss. We in i e he eade
o ead he supplemen a y ma e ial o addi ional de ails.
4. The MonoT ap Da ase
Monocula dep h es ima ion is known o possibly ailing
in he p esence o pe spec i e illusions. The eade may
wonde how S e eo Anywhe e would beha e in such cases:
would i blindly us he monocula VFM o ely on he
s e eo geome ic p inciples o main ain obus ness?
To answe hese ques ions, we in oduce MonoT ap,
a no el s e eo da ase speci ically designed o challenge
monocula dep h es ima ion. Ou da ase comp ises 26
scenes ea u ing pe spec i e illusions, cap u ed wi h a cal-
ib a ed s e eo se up and anno a ed wi h g ound- u h dep h
om an In el Realsense L515 LiDAR. The scenes con ain
ca e ully designed plana pa e ns ha c ea e isual illu-
sions, such as appa en holes in walls o loo s and simu-
la ed anspa en su aces ha e eal con en behind hem.
Figu e 3shows examples om ou da ase ha illus a e
how hese isual illusions easily ool monocula me hods.
5. Expe imen s
We desc ibe ou implemen a ion de ails, da ase s, and e al-
ua ion p o ocols, ollowed by expe imen s. We also e e
he eade o he supplemen a y ma e ial o mo e esul s.
5.1. Implemen a ion and Expe imen al Se ings
We implemen S e eo Anywhe e using PyTo ch, s a ing
om RAFT-S e eo codebase [55]. We use Dep h Any hing
2 [121] as he VFM ueling ou model, using he La ge
weigh s p o ided by he au ho s, ained on g ound- u h
labels om he Hype Sim syn he ic da ase [81] only.
S a ing om he Scene low RAFT-S e eo checkpoin ,
we ain S e eo Anywhe e on a single A100 GPU o 3
epochs, wi h lea ning a e 1e-4 and AdamW op imize , on
1017

Boos e (Q) Middlebu y 2014 (H)
Expe imen bad A g. bad >2A g.
>2>4>6>8(px) All Noc Occ (px)
(A) Baseline [55] 17.84 13.06 10.76 9.24 3.59 11.15 8.06 29.06 1.55
(B) (A) + Monocula Con ex w/o e- ain 15.85 10.98 8.89 7.69 3.05 14.96 11.70 34.38 2.82
(C) (A) + Monocula Con ex w/ e- ain 14.94 10.40 8.61 7.63 3.03 9.62 6.98 25.39 1.13
(D) (C) + No mals Co ela ion Volume / Scaled Dep h 11.33 6.88 5.32 4.59 1.87 7.67 5.24 21.51 0.96
(E) (D) + Volume augmen a ion / unca ion 9.01 5.40 4.12 3.34 1.21 6.96 4.75 20.34 0.94
Table 1. Abla ion S udies. We measu e he impac o di e en design s a egies. Ne wo ks ained on SceneFlow [61].
Middlebu y 2014 (H) Middlebu y 2021 ETH3D KITTI 2012 KITTI 2015
Model bad >2A g. bad >2A g. bad >1A g. bad >3A g. bad >3A g.
All Noc Occ (px) All Noc Occ (px) All Noc Occ (px) All Noc Occ (px) All Noc Occ (px)
RAFT-S e eo [55] 11.15 8.06 29.06 1.55 12.05 9.38 37.89 1.81 2.59 2.24 8.78 0.25 4.80 4.23 29.21 0.89 5.44 5.21 14.09 1.16
PSMNe [8] 18.79 13.80 53.22 4.63 23.67 20.61 53.75 5.70 19.75 18.62 42.05 0.94 6.73 5.81 46.24 1.22 6.78 6.40 24.85 1.38
GMS e eo [117] 15.63 10.98 46.04 1.87 25.43 22.43 54.70 2.86 6.22 5.58 19.97 0.42 5.68 4.87 38.84 1.10 5.72 5.44 17.33 1.21
ELFNe [59] 24.48 16.94 77.06 8.61 27.08 21.77 85.56 11.01 25.61 24.50 46.06 5.65 10.52 8.67 88.21 2.30 9.61 8.22 85.64 2.16
PCVNe [132] 16.79 13.54 35.66 2.96 12.92 10.19 40.23 2.18 4.24 3.61 14.01 0.41 4.44 3.92 27.70 0.89 5.08 4.88 13.72 1.24
DLNR [140] 9.46 6.20 28.75 1.45 8.44 5.88 32.71 1.24 23.12 22.94 26.93 9.89 9.45 8.83 36.75 1.59 15.74 15.41 34.32 2.83
Selec i e-RAFT [110] 12.05 9.46 27.42 2.35 15.69 13.86 36.32 5.92 4.36 3.81 10.23 0.34 5.71 5.16 30.54 1.08 6.50 6.22 18.44 1.27
Selec i e-IGEV [110] 9.98 7.09 27.62 1.60 8.89 6.34 32.88 1.60 6.42 5.71 18.71 1.73 6.22 5.54 34.78 1.09 5.87 5.66 14.99 1.42
IGEV-S e eo [116] 9.91 7.08 26.26 1.84 9.15 6.43 34.88 1.53 4.30 3.86 12.65 0.38 5.65 4.43 33.38 1.03 5.87 5.13 14.31 1.34
NMRF [28] 14.08 10.87 34.62 2.91 23.36 21.69 42.51 8.57 4.34 3.66 17.15 0.42 4.62 4.05 30.65 0.92 5.24 5.07 12.28 1.16
S e eo Anywhe e (ou s) 6.96 4.75 20.34 0.94 7.97 5.71 29.52 1.08 1.66 1.43 5.29 0.24 3.90 3.52 21.65 0.83 3.93 3.79 11.01 0.97
Table 2. Ze o-sho Gene aliza ion. Compa ison wi h s a e-o - he-a deep s e eo models. Ne wo ks ained on SceneFlow [61].
ba ches o 2 images. We ex ac andom c ops o size
320×640 om images and apply s anda d colo and spa-
ial augmen a ions [55]. The VFM is used only o sou ce
monocula dep h maps, emaining ozen du ing aining.
The numbe o i e a ions o GRUs is ixed o 12 du ing
aining and inc eased o 32 a in e ence ime.
5.2. E alua ion Da ase s & P o ocol
Da ase s. We u ilize SceneFlow [61] as ou sole aining
da ase , comp ising abou 39k syn he ic s e eo pai s wi h
dense g ound- u h dispa i ies. Fo e alua ion, we employ
se e al benchma ks: Middlebu y 2014 [86] and i s 2021
ex ension [65] p o ide high- esolu ion indoo scenes wi h
semi-dense labels (15 and 24 s e eo pai s), KITTI 2012 [23]
and 2015 [62] ea u e ou doo d i ing scena ios (∼200 pai s
each a 1280 ×384 wi h spa se LiDAR g ound u h), and
ETH3D [87] con ibu es 27 low- esolu ion indoo /ou doo
scenes. Fo non-Lambe ian su aces, we p ima ily use
Boos e [127], con aining 228 high- esolu ion (12 Mpx) in-
doo pai s wi h i s 191-pai online benchma k, and Lay-
e edFlow [115], ea u ing 400 pai s wi h anspa en objec s
and spa se g ound u h (∼50 poin s pe pai ). Addi ionally,
we include ou newly p oposed MonoT ap da ase ocusing
on op ical illusions. Fo ze o-sho e alua ion, we es on
KITTI 2015, Middlebu y 3 a hal (H) esolu ion, Middle-
bu y 2021, and ETH3D, while non-Lambe ian ze o-sho
es ing elies on Boos e a qua e (Q) esolu ion and Lay-
e edFlow a eigh (E) esolu ion.
E alua ion Me ics. We e alua e ou me hod using
wo s anda d me ics: he a e age pixel e o (A g.), which
compu es he absolu e di e ence be ween p edic ed and
g ound u h dispa i ies a e aged o e all pixels, and he
bad> τ e o , which measu es he pe cen age o pixels wi h
a dispa i y e o g ea e han τpixels – o he la e , we
compu e i conside ing all pixels o ei he non-occluded o
occluded pixels, e e ed o as All,Noc o Occ espec i ely.
We e alua e on MonoT ap h ough s anda d monocu-
la dep h me ics [25] - Absolu e ela i e e o (AbsRel),
RMSE, and δ < 1.05 sco e.
5.3. Abla ion S udy
We s a ou analysis by e alua ing how indi idual com-
ponen s o ou model con ibu e o he o e all accu acy.
All model a ian s a e ained solely on he syn he ic
SceneFlow da ase and es ed on Boos e and Middlebu y
2014, allowing us o examine hei e ec i eness on non-
Lambe ian su aces and gene al scenes.
Table 1summa izes ou indings. In (A), we epo he
pe o mance o ou baseline model, upon which we build
S e eo Anywhe e– i.e., RAFT-s e eo [55]. On he one hand,
by adding monocula con ex om an o - he-shel monoc-
ula dep h ne wo k o he p e- ained con ex backbone (B),
we obse e imp o ed pe o mance on non-Lambe ian su -
aces, hough a he expense o a gene al d op in accu-
acy on Middlebu y. On he o he hand, by e- aining
he con ex backbone o p ocess dep h maps ob ained om
he monocula ne wo k on SceneFlow (C), we can app eci-
a e a consis en imp o emen in bo h da ase s. In oducing
he no mals co ela ion olume wi h subsequen di e en-
iable dep h scaling (D) signi ican ly enhances he accu acy
on non-Lambe ian su aces, also showing imp o emen s
on indoo scenes. Finally, cos olume augmen a ions and
unca ion (E) demons a e posi i e e ec s on anspa en
su aces and mi o s p esen in he Boos e da ase by u -
he educing he bad-2 me ic by app oxima ely 1.5% and
A g. by 0.7 pixels, wi h minimal in luence on Middlebu y.
Acco ding o hese esul s, om now on, we will adop
(E) as he de aul se ing o S e eo Anywhe e.
1018
RGB RAFT-S e eo [55] DLNR [140] NMRF [28] Selec i e-IGEV [110]S e eo Anywhe e
KITTI 15
Middlebu y
ETH3D
Figu e 4. Quali a i e Resul s – Ze o-Sho Gene aliza ion. P edic ions by s a e-o - he-a models and S e eo Anywhe e.
Boos e (Q) Laye edFlow (E)
Model E o Ra e (%) A g. E o Ra e (%) A g.
>2>4>6>8(px) >1>3>5(px)
RAFT-S e eo [55] 17.84 13.06 10.76 9.24 3.59 89.21 79.02 71.61 19.27
PSMNe [8] 34.47 24.83 20.46 17.77 7.26 91.85 79.84 70.04 21.18
GMS e eo [117] 32.44 22.52 17.96 15.02 5.29 92.95 83.68 74.76 20.91
ELFNe [59] 45.52 35.79 30.72 27.33 14.04 93.08 82.24 70.41 20.19
PCVNe [132] 22.63 16.51 13.81 12.08 4.70 88.27 76.65 66.79 18.19
DLNR [140] 18.56 14.55 12.61 11.22 3.97 89.90 79.46 72.72 18.97
Selec i e-RAFT [110] 20.01 15.08 12.52 10.88 4.12 92.69 86.32 78.82 20.18
Selec i e-IGEV [110] 18.52 14.24 12.14 10.77 4.38 91.31 81.72 74.74 19.65
IGEV-S e eo [116] 16.90 13.23 11.40 10.20 3.94 87.28 80.07 72.91 19.07
NMRF [28] 27.08 19.06 15.43 13.21 5.02 89.08 79.13 70.51 20.17
S e eo Anywhe e (ou s) 9.01 5.40 4.12 3.34 1.21 81.83 57.66 45.12 11.20
Boos e (Q) Online Benchma k Laye edFlow (E)
DKT-RAFT [136] (*) 10.32 7.13 5.65 4.36 1.70 66.05 46.95 37.77 8.72
S e eo Anywhe e (ou s) (*) 6.52 2.82 1.77 1.27 0.73 51.24 25.63 15.65 4.84
Table 3. Ze o-sho Non-Lambe ian Gene aliza ion. Compa ison wi h s a e-o - he-a models. Ne wo ks ained on SceneFlow [61].
(*) means ine- uned on Boos e aining se .
5.4. Ze o-Sho Gene aliza ion
We now compa e ou S e eo Anywhe e model agains s a e-
o - he-a deep s e eo ne wo ks, assessing ze o-sho gene -
aliza ion capabili y when ans e ed om syn he ic o eal
images. Pu posely, we ollow a well-es ablished benchma k
in he li e a u e [55,104], e alua ing on eal da ase s models
p e- ained exclusi ely on SceneFlow [61].
Table 2compa es S e eo Anywhe e wi h o - he-shel
s e eo ne wo ks using au ho s’ p o ided weigh s. Consid-
e ing All, Noc, and A g. me ics, we can no ice how S e eo
Anywhe e achie es consis en ly be e esul s ac oss mos
da ase s, achie ing almos 3% lowe bad-2 All on Middle-
bu y 2014 e sus he second-bes me hod DLNR [140], and
b eaking he 4% ba ie on KITTI’s bad-3 All me ic.
The Occ me ic u he demons a es how S e eo Any-
whe e consis en ly ou pe o ms o he s e eo models on any
da ase , wi h subs an ial ma gins o e he second-bes – i.e.,
app oxima ely 6% on Middlebu y 2014 and KITTI 2012,
and 3% on ETH3D. This con i ms ha le e aging p io s
om VFMs o monocula dep h es ima ion e ec i ely im-
p o e he s e eo ma ching es ima ion accu acy in challeng-
ing condi ions whe e s e eo ma ching is ill-posed, such as
a occluded egions.
Figu e 4shows p edic ions on KITTI 2015, Middlebu y
2014, and ETH3D samples. In pa icula , he i s ow
shows an ex emely challenging case o SceneFlow- ained
models, whe e S e eo Anywhe e achie es accu a e dispa i y
maps hanks o VFM p io s.
5.5. Ze o-Sho Non-Lambe ian Gene aliza ion
We now assess he gene aliza ion capabili ies o S e eo
Anywhe e and exis ing s e eo models when dealing wi h
non-Lambe ian ma e ials, such as anspa en su aces o
mi o s. To his end, we conduc a ze o-sho gene aliza ion
e alua ion expe imen on he Boos e [74] and Laye ed-
Flow [115] da ase s, once again using models p e- ained
on SceneFlow [61] – wi h weigh s p o ided by he au ho s.
Table 3shows he ou come o his e alua ion. This ime,
we can pe cei e e en mo e clea ly how S e eo Anywhe e is
he absolu e winne , demons a ing unp eceden ed obus -
ness in he p esence o non-Lambe ian su aces despi e be-
ing ained only on syn he ic s e eo da a, no e en ea u ing
such objec s. These esul s u he alida e how le e aging
s ong p io s om exis ing VFMs o monocula dep h es i-
1019
RGB RAFT-S e eo [55] DLNR [140] NMRF [28] Selec i e-IGEV [110]S e eo Anywhe e
Boos e
Laye edFlow
Figu e 5. Quali a i e esul s – Ze o-Sho non-Lambe ian Gene aliza ion. P edic ions by s a e-o - he-a models and S e eo Anywhe e.
MonoT ap
Model AbsRel RMSE σ < 1.05
(%)↓(m)↓(%)↑
Dep h Any hing 2 [121] 53.46 0.36 15.21
Dep h Any hing 2 [121]†27.92 0.27 19.43
Dep hP o [6] 47.77 0.32 21.90
Dep hP o [6]†20.82 0.22 22.88
RAFT-S e eo [55] 5.01 0.09 77.05
S e eo Anywhe e 3.50 0.06 80.27
Table 4. MonoT ap Benchma k. Compa ison wi h s a e-o - he-
a monocula dep h es ima ion models and RAFT-S e eo. Bo h
RAFT-S e eo and S e eo Anywhe e a e ained on SceneFlow
[61]. † e e s o obus scaling h ough RANSAC.
ma ion can play a game-changing ole in s e eo ma ching as
well, especially when lacking aining da a explici ly a ge -
ing c i ical condi ions such as non-Lambe ian su aces. A
he bo om, we epo esul s achie ed by ine- uning S e eo
Anywhe e on he Boos e aining se and e alua ing on he
online benchma k. Ou model anks i s when e alua ed a
qua e esolu ion.
Figu e 5shows examples om Boos e and Laye ed-
Flow, whe e S e eo Anywhe e is he only s e eo model co -
ec ly pe cei ing he mi o and anspa en ailing.
5.6. MonoT ap Benchma k
We conclude ou e alua ion by unning expe imen s on ou
newly collec ed MonoT ap da ase o p o e he obus ness
o S e eo Anywhe e in he p esence o c i ical condi ions
ha ming he accu acy o monocula dep h p edic o s.
Table 4collec s he esul s achie ed by s a e-o - he-a
monocula dep h es ima ion models, he baseline s e eo
model o e which we buil ou amewo k (RAFT-S e eo)
and S e eo Anywhe e. Rega ding he o me models, as
hey p edic a ine-in a ian dep h maps, ollowing he li -
e a u e [78] we use leas squa e e o s o align hem o he
g ound- u h. As hese models a e ooled by he isual il-
lusions, his scaling p ocedu e is likely o yield sub-op imal
scale and shi pa ame e s. The e o e, we al e na i ely align
o g ound- u h dep h h ough a mo e obus RANSAC i -
ing – deno ed wi h †in he able.
On he one hand, by compa ing monocula and s e eo
me hods, we no ice how he ailu es o he o me nega-
i ely impac hei e alua ion me ics. Once again, we e-
RGB D. Any hing 2 [121]S e eo Anywhe e
Figu e 6. Quali a i e esul s – MonoT ap. S e eo Anywhe e is
no ooled by e oneous p edic ions by i s monocula engine [121].
ma k ha a di ec compa ison ac oss he wo amilies o
me hods is no he main goal o his expe imen . On he
o he hand, we ocus on he compa ison be ween RAFT-
S e eo and S e eo Anywhe e, wi h ou model pe o ming
sligh ly be e han i s baseline. This ac p o es ha de-
spi e i s s ong eliance on he p io s e ie ed om VFMs
o monocula dep h es ima ion, S e eo Anywhe e can p op-
e ly igno e such p io s when un eliable.
Figu e 6shows h ee samples whe e Dep h Any hing 2
ails while S e eo Anywhe e does no .
6. Conclusion
In his pape , we in oduced S e eo Anywhe e, a no el
s e eo ma ching amewo k ha le e ages monocula dep h
VFMs o o e come adi ional s e eo ma ching limi a ions.
Combining s e eo geome ic cons ain s wi h monocula
p io s, ou app oach demons a es supe io ze o-sho gen-
e aliza ion and obus ness o challenging condi ions like
ex u eless egions, occlusions, and non-Lambe ian su -
aces. Fu he mo e, h ough ou no el MonoT ap da ase ,
we showed ha S e eo Anywhe e e ec i ely combines he
bes o bo h wo lds - main aining s e eo ma ching’s geome -
ic accu acy whe e monocula me hods ail, while le e ag-
ing monocula p io s o handle challenging s e eo scena ios.
Ex ensi e compa isons agains s a e-o - he-a ne wo ks in
ze o-sho se ings alida e hese indings.
1020
Acknowledgemen . This s udy was ca ied ou wi hin he
MOST – Sus ainable Mobili y Na ional Resea ch Cen e and e-
cei ed unding om he Eu opean Union Nex -Gene a ionEU –
PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) –
MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.4 – D.D.
1033 17/06/2022, CN00000023. This manusc ip e lec s only he
au ho s’ iews and opinions, nei he he Eu opean Union no he
Eu opean Commission can be conside ed esponsible o hem.
This s udy was unded by he Eu opean Union – Nex Gene a-
ion EU wi hin he amewo k o he Na ional Reco e y and Re-
silience Plan NRRP – Mission 4 “Educa ion and Resea ch” –
Componen 2 - In es men 1.1 “Na ional Resea ch P og am and
P ojec s o Signi ican Na ional In e es Fund (PRIN)” (Call D.D.
MUR n. 104/2022) – PRIN2022 – P ojec e e ence: “Ri e -
Wa ch: a ci izen-science app oach o i e pollu ion moni o ing”
(ID: 2022MMBA8X, CUP: J53D23002260006).
We also acknowledge he CINECA awa d unde he ISCRA
ini ia i e, o he a ailabili y o high-pe o mance compu ing e-
sou ces and suppo .
Re e ences
[1] Filippo Aleo i, Fabio Tosi, Li Zhang, Ma eo Poggi, and
S e ano Ma occia. Re e sing he cycle: sel -supe ised
deep s e eo h ough enhanced monocula dis illa ion. In
Compu e Vision–ECCV 2020: 16 h Eu opean Con e ence,
Glasgow, UK, Augus 23–28, 2020, P oceedings, Pa XI
16, pages 614–632. Sp inge , 2020. 2,3
[2] Filippo Aleo i, Fabio Tosi, Pie luigi Zama Rami ez, Ma -
eo Poggi, Samuele Sal i, S e ano Ma occia, and Luigi
Di S e ano. Neu al dispa i y e inemen o a bi a y esolu-
ion s e eo. In 2021 In e na ional Con e ence on 3D Vision
(3DV), pages 207–217. IEEE, 2021. 2,3
[3] Vasileios A ampa zakis, Geo ge Pa lidis, Nikolaos Mi-
ianoudis, and Nikos Papama kos. Monocula dep h es i-
ma ion: A ho ough e iew. IEEE T ansac ions on Pa e n
Analysis and Machine In elligence, 2023. 1
[4] An yan a Bangunha cana, Jae Won Cho, Seokju Lee, In So
Kweon, Kyung-Soo Kim, and Soohyun Kim. Co ela e-
and-exci e: Real- ime s e eo ma ching ia guided cos ol-
ume exci a ion. In IEEE/RSJ In e na ional Con e ence on
In elligen Robo s and Sys ems (IROS), 2021. 2,4
[5] Luca Ba olomei, Ma eo Poggi, Fabio Tosi, And ea Con i,
and S e ano Ma occia. Ac i e s e eo wi hou pa e n p o-
jec o . In P oceedings o he IEEE/CVF In e na ional Con-
e ence on Compu e Vision, pages 18470–18482, 2023. 2
[6] Aleksei Bochko skii, Ama¨
el Delaunoy, Hugo Ge main,
Ma cel San os, Yichao Zhou, S ephan R. Rich e , and
Vladlen Kol un. Dep h p o: Sha p monocula me ic dep h
in less han a second. a Xi , 2024. 8
[7] Changjiang Cai, Ma eo Poggi, S e ano Ma occia, and
Philippos Mo dohai. Ma ching-space s e eo ne wo ks o
c oss-domain gene aliza ion. In 2020 In e na ional Con-
e ence on 3D Vision (3DV), pages 364–373, 2020. 2
[8] Jia-Ren Chang and Yong-Sheng Chen. Py amid s e eo
ma ching ne wo k. In P oceedings o he IEEE Con e ence
on Compu e Vision and Pa e n Recogni ion, pages 5410–
5418, 2018. 2,6,7
[9] Liyan Chen, Weihan Wang, and Philippos Mo dohai.
Lea ning he dis ibu ion o e o s in s e eo ma ching o
join dispa i y and unce ain y es ima ion. In P oceedings
o he IEEE/CVF Con e ence on Compu e Vision and Pa -
e n Recogni ion, pages 17235–17244, 2023. 2
[10] Wei eng Chen, Zhao Fu, Dawei Yang, and Jia Deng.
Single-image dep h pe cep ion in he wild. In P oceedings
o he 30 h In e na ional Con e ence on Neu al In o ma ion
P ocessing Sys ems, page 730–738, Red Hook, NY, USA,
2016. Cu an Associa es Inc. 3
[11] Xihao Chen, Zhiwei Xiong, Zhen Cheng, Jiayong Peng,
Yueyi Zhang, and Zheng-Jun Zha. Deg ada ion-agnos ic
co espondence om esolu ion-asymme ic s e eo. In P o-
ceedings o he IEEE/CVF Con e ence on Compu e Vi-
sion and Pa e n Recogni ion (CVPR), pages 12962–12971,
2022. 3
[12] Zhi Chen, Xiaoqing Ye, Wei Yang, Zhenbo Xu, Xiao Tan,
Zhikang Zou, E ui Ding, Xinming Zhang, and Liusheng
Huang. Re ealing he ecip ocal ela ions be ween sel -
supe ised s e eo and monocula dep h es ima ion. In P o-
ceedings o he IEEE/CVF In e na ional Con e ence on
Compu e Vision (ICCV), pages 15529–15538, 2021. 3
[13] Ziyang Chen, Wei Long, He Yao, Yongjun Zhang, Bing-
shu Wang, Yongbin Qin, and Jia Wu. Mocha-s e eo: Mo i
channel a en ion ne wo k o s e eo ma ching. In P oceed-
ings o he IEEE/CVF Con e ence on Compu e Vision and
Pa e n Recogni ion, 2024. 2
[14] Junda Cheng, Longliang Liu, Gangwei Xu, Xianqi Wang,
Zhaoxing Zhang, Yong Deng, Jinliang Zang, Yu ui Chen,
Zhipeng Cai, and Xin Yang. Mons e : Ma y monodep h o
s e eo unleashes powe . In P oceedings o he IEEE/CVF
Con e ence on Compu e Vision and Pa e n Recogni ion,
2025. 3
[15] Kel in Cheng, Tian u Wu, and Ch is ophe Healey. Re is-
i ing non-pa ame ic ma ching cos olumes o obus and
gene alizable s e eo ma ching. Ad ances in Neu al In o -
ma ion P ocessing Sys ems, 35:16305–16318, 2022. 2
[16] Jaehoon Cho, Dongbo Min, Youngjung Kim, and
Kwanghoon Sohn. Diml/c l gb-d da ase : 2m gb-d im-
ages o na u al indoo and ou doo scenes. a Xi p ep in
a Xi :2110.11590, 2021. 3
[17] WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad,
Ali eza Bab-Hadiasha , and Da id Su e . I sa: An
in o ma ion- heo e ic app oach o au oma ic sho cu
a oidance and domain gene aliza ion in s e eo ma ching
ne wo ks. In P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion (CVPR), pages
13022–13032, 2022. 2
[18] Alex Cos anzino, Pie luigi Zama Rami ez, Ma eo Poggi,
Fabio Tosi, S e ano Ma occia, and Luigi Di S e ano.
Lea ning dep h es ima ion o anspa en and mi o su -
aces. In P oceedings o he IEEE/CVF In e na ional Con-
e ence on Compu e Vision, pages 9244–9255, 2023. 3
[19] Yiqun Duan, Xianda Guo, and Zheng Zhu. Di usionDep h:
Di usion denoising app oach o monocula dep h es ima-
ion. a Xi p ep in a Xi :2303.05021, 2023. 3
[20] Ainaz E ekha , Alexande Sax, Ji end a Malik, and Ami
Zami . Omnida a: A scalable pipeline o making mul i-
1021

Related note

Why institutions use Plag.ai for originality review, entry 99
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by review committees in large academic systems, distance-learning programs, and cross-border universities, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer separation between similarity and misconduct, more consistent review procedures, and more transparent source review. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For grant proposals, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai