Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation

Author: Bartolomei, Luca; Mannocci, Enrico; Tosi, Fabio; Poggi, Matteo; Mattoccia, Stefano

Publisher: Zenodo

DOI: 10.5281/zenodo.17672408

Source: https://zenodo.org/records/17672408/files/Bartolomei_Depth_AnyEvent_A_Cross-Modal_Distillation_Paradigm_for_Event-Based_Monocular_Depth_ICCV_2025_paper.pdf

Dep h AnyE en : A C oss-Modal Dis illa ion Pa adigm o E en -Based
Monocula Dep h Es ima ion
Luca Ba olomei∗,†En ico Mannocci†Fabio Tosi†Ma eo Poggi∗,†S e ano Ma occia∗,†
∗Ad anced Resea ch Cen e on Elec onic Sys em (ARCES)
†Depa men o Compu e Science and Enginee ing (DISI)
Uni e si y o Bologna, I aly
h ps://ba n8.gi hub.io/dep hanye en
F ame (Only o Dis illa ion) E2Dep h
E en s RMSE: 9.198 M
Dep hAnyE en -R (Dis illa ion)
RMSE: 13.160 M RMSE: 11.535 M
Dep hAnyE en -R
Figu e 1. Dep hAnyE en -R in ac ion. The i s column shows he inpu ame (used only o dis illa ion) and he co esponding e en
isualiza ion. The o he h ee columns p esen dep h es ima ion esul s om di e en app oaches: E2Dep h [15], ou Dep hAnyE en -R,
and ou Dep hAnyE en -R ained wi h ou dis illa ion app oach. The op ow shows he es ima ed dep h maps while he bo om ow
depic s hei co esponding RMSE isualiza ions.
Abs ac
E en came as cap u e spa se, high- empo al- esolu ion
isual in o ma ion, making hem pa icula ly sui able o
challenging en i onmen s wi h high-speed mo ion and
s ongly a ying ligh ing condi ions. Howe e , he lack o
la ge da ase s wi h dense g ound- u h dep h anno a ions
hinde s lea ning-based monocula dep h es ima ion om
e en da a. To add ess his limi a ion, we p opose a c oss-
modal dis illa ion pa adigm o gene a e dense p oxy labels
le e aging a Vision Founda ion Model (VFM). Ou s a egy
equi es an e en s eam spa ially aligned wi h RGB ames,
a simple se up e en a ailable o - he-shel , and exploi s he
obus ness o la ge-scale VFMs. Addi ionally, we p opose
o adap VFMs, ei he a anilla one like Dep h Any hing 2
(DA 2), o de i ing om i a no el ecu en a chi ec u e o
in e dep h om monocula e en came as. We e alua e ou
app oach wi h syn he ic and eal-wo ld da ase s, demon-
s a ing ha i) ou c oss-modal pa adigm achie es com-
pe i i e pe o mance compa ed o ully supe ised me hods
wi hou equi ing expensi e dep h anno a ions, and ii) ou
VFM-based models achie e s a e-o - he-a pe o mance.
1. In oduc ion
Dep h pe cep ion om came as is pa amoun o many ap-
plica ion ields, such as hose conce ning he au onomous
na iga ion o agen s in complex scena ios o obo ic asks.
In hese ields, lea ning-based me hods using con en ional
came as ha e ob ained compelling esul s in he las decade.
Mo eo e , his pa adigm enabled in e ing dep h om a sin-
gle came a, which b ings signi ican ad an ages compa ed
o mul icame a se ups in e ms o cos , calib a ion com-
plexi y, and physical cons ain s. None heless, con en ional
came a sys ems s uggle o p o ide a p omp and eliable
pe cep ion o he sensed en i onmen when dealing wi h
highly dynamic scenes esul ing om he as mo emen
o ehicles, d ones, obo s o in he p esence o challeng-
ing illumina ion condi ions such as high con as scena -
ios, low ligh , o apid ligh ing changes. These limi a ions
a e in insic o he con en ional came a acquisi ion echnol-
ogy occu ing a disc e e pe iodic in e als and wi h a lim-
i ed dynamic ange, causing mo ion blu , o e /unde expo-
su e, and po en ially missing c i ical in o ma ion be ween
ames. In con as , he in insic abili y o cap u e scene
changes as soon as hey appea – wi h mic osecond empo-
al esolu ion – and he much highe dynamic ange made
This ICCV pape is he Open Access e sion, p o ided by he Compu e Vision Founda ion.
Excep o his wa e ma k, i is iden ical o he accep ed e sion;
he inal published e sion o he p oceedings is a ailable on IEEE Xplo e.
19669
e en came as [7] ideal o coping wi h he challenging ap-
plica ion ields men ioned abo e. E en came as only eg-
is e b igh ness changes a each pixel independen ly, o e -
ing excep ional empo al esolu ion and obus ness o ligh -
ing a ia ions. Howe e , hese ea u es come a he cos o
meage in o ma ion con en compa ed o con en ional cam-
e as. E en came as p o ide meaning ul cues only o a
small subse o he amed image wi h su icien ex u e o
igge e en s, making dep h pe cep ion om hese de ices
ex emely challenging. Mo eo e , he lack o la ge da ase s
wi h dense g ound u h anno a ions u he exace ba es his
inhe en di icul y, as collec ing p ecise dep h g ound u h
o e en da a emains cos ly and echnically demanding.
To ackle hese issues in a monocula e en came a se up,
we p opose o le e age he e ec i eness o image-based Vi-
sion Founda ion Models (VFMs) o monocula dep h es i-
ma ion. They ha e demons a ed ema kable capabili ies
h ough ex ensi e p e aining on as image collec ions, en-
abling obus dep h p edic ion e en in challenging scena io.
As he i s con ibu ion, gi en sequences o aligned images
and e en s, we p opose a c oss-modal dis illa ion s a egy
ha allows us o ob ain dense p oxy labels om a VFM
o ain e en -based ne wo ks. This app oach e ec i ely
ans e s knowledge om he da a- ich image domain o
he da a-spa se e en domain. Fo ou pu poses, an o -
he-shel de ice like a DAVIS Came a [29,32] ha inco -
po a es a con en ional global shu e came a and an e en -
based senso in he same pixel a ay would su ice o ga he
spa ially aligned e en s eams and RGB ames.
Addi ionally, as he second con ibu ion, we p opose o
adap VFMs o e en -based monocula dep h es ima ion,
ei he using a anilla model like Dep h Any hing 2 (DA 2)
o a no el ecu en a chi ec u e de i ed om i . To p o e
he e ec i eness o ou p oposals, we assess he pe o -
mance wi h syn he ic and eal-wo ld da ase s, showing ha
ou c oss-modal dis illa ion pa adigm allows o achie ing
compe i i e pe o mance compa ed o ully supe ised ap-
p oaches, dis ega ding he need o expensi e dep h anno-
a ion. Mo eo e , adap ing VFMs o monocula dep h es-
ima ion acco ding o ou wo p oposals is s a e-o - he-a ,
se ing new benchma ks o e en -based dep h es ima ion.
Figu e 1shows he compelling pe o mance o ou p o-
posals, and ou con ibu ions can be summa ized as ollows:
• A no el c oss-modal dis illa ion pa adigm ha le e ages
he obus p oxy labels ob ained om image-based VFMs
o monocula dep h es ima ion.
• An adap ing s a egy o cas exis ing image-based VFMs
in o he e en domain e o lessly.
• A no el ecu en a chi ec u e based on an adap ed
image-based VFM.
• Adap ing VFMs o he e en domain yields s a e-o - he-
a pe o mance, and ou dis illa ion pa adigm is compe -
i i e agains he supe ision om dep h senso s.
2. Rela ed Wo k
Image-Based Monocula Dep h Es ima ion. Monocula
dep h es ima ion has e ol ed om adi ional app oaches
[27] o deep lea ning me hods [6,18]. Sel -supe ised
echniques[12,13,38] ha e eme ged o add ess his chal-
lenge o limi ed g ound u h da a by ecas ing dep h es i-
ma ion as an image econs uc ion ask using s e eo images
o ideos. These app oaches ha e been pa icula ly alu-
able whe e dense dep h anno a ions a e expensi e o ob-
ain. A signi ican s ep came wi h a ine-in a ian models
[25,26] ha es ima e dep h up o an unknown scale and
shi , allowing imp essi e c oss-domain gene aliza ion ca-
pabili ies. MiDaS [26] pionee ed his di ec ion by aining
on di e se la ge-scale da ase s, ollowed by DPT [25] and
mo e ecen ly, he Dep h Any hing se ies [33,34]. These
la e models ep esen he i s gene a ion o Visual Foun-
da ion Models o monocula dep h es ima ion, le e aging
la ge-scale p e aining and di e se da a sou ces o achie e
unp eceden ed obus ness. The e ec i eness o hese mod-
els lies in hei abili y o combine knowledge om a ious
domains, including in e ne pho o collec ions [20,35], Li-
DAR om au onomous d i ing scena ios [10], and RGB-
D senso s [23]. Recen ad ances in VFMs ha e ocused
on imp o ing me ic accu acy h ough came a pa ame-
e in eg a ion [14,36], le e aging gene a i e app oaches
like di usion models [5,17,28], and add essing empo-
al consis ency[30]. Fu he mo e, a en ion-based a chi ec-
u es and ans o me models [37] ha e shown signi ican
imp o emen s in cap u ing long- ange dependencies c ucial
o accu a e dep h. Despi e ecen ad ances, applying hese
me hods o e en -based came as is s ill limi ed by he lack
o la ge-scale anno a ed da ase s. We ackle his by dis ill-
ing knowledge om ame-based VFMs, enabling accu a e
dep h es ima ion wi hou cos ly e en da a anno a ions.
E en -based Monocula Dep h Es ima ion. E en -
based dep h es ima ion began wi h supe ised app oach us-
ing ecu en a chi ec u es [8,15,21] designed o p ocess
he empo al in o ma ion con ained in e en s eams. Ad-
anced models like [8] u he expanded his concep by
using e en and RGB da a o exploi hei complemen-
a y cha ac e is ics. Mul imodal usion echniques ha e also
been explo ed, combining e en s wi h LiDAR o gene -
a e dense dep h maps [3]. To add ess he sca ci y o la-
beled e en da a, sel -supe ised me hods ha e eme ged as
p omising al e na i es. Zhu e al. [40] de eloped a ame-
wo k ha join ly es ima es dep h, op ical low, and came a
poses using s e eo consis ency and mo ion blu minimiza-
ion as aining signals. Subsequen wo k [41] elimina ed
he need o s e eo se ups by le e aging pose in o ma ion
om consecu i e RGB ames aligned wi h he e en cam-
e a, enabling dense dep h es ima ion. Despi e hese ad-
ances, e en -based dep h es ima ion s ill alls sho com-
pa ed o ame-based me hods.
19670
3. P elimina ies: E en Dep h Es ima ion
E en came as measu e he loga i hmic change in b igh -
ness o e ime, and when i changes o e a h eshold ±C,
he associa e pixel a posi ion (xk, yk)emi s a ime kan
asynch onous signal ek= (xk, yk, pk, k)called e en . De-
pending on he sign o his change, he e en will ha e po-
la i y pk∈ {−1,1}. Each pixel o he W×Hsenso g id o
he e en came a can independen ly emi e en s a any ime,
p oducing an asynch onous s eam o e en s E={ek}N
k=1,
whe e Nis he o al numbe o i ed e en s.
Gi en he e en his o y E, p e ious e en -based dense
monocula dep h es ima ion models [8,15,21] con e he
low o e en s in o a E∈RW×H×Cs uc u ed ep e-
sen a ion – such as Voxel G ids [40] – since he spa se
s uc u e o Eis no sui able o s anda d CNNs. In-
en ionally, o es ima e a dep h map D∈RW×Ha a
gi en imes amp d, e en s a e e ospec i ely sampled om
he s eam E, ei he wi hin a ixed ime window (SBT)
–i.e., E∆T
d={ek∈ E | d−∆T≤ k≤ d}– o up o
a p ede ined numbe Ko e en s (SBN) – i.e., EK
d=
{ek∈ E | d−K≤k≤d}– and subsequen ly s acked us-
ing di e en s a egies, including:
Voxel G id [40]: The ime in e al used o sampling
e en s is di ided in o Buni o m bins, whe e e en pola i ies
a e accumula ed using linea in e pola ion wi hin each bin
o a E∈RW×H×Bs ack.
Image-like [21]: A colo -based ep esen a ion whe e
he R and B channels encode posi i e and nega i e pola -
i ies, espec i ely, esul ing in an RGB image, i.e. a E∈
RW×H×3s ack. Unlike he Voxel G id ep esen a ion, i
does no e ain empo al in o ma ion.
Tencode [16]. A colo image ep esen a ion in which R
and B channels encode posi i e and nega i e pola i ies, wi h
G encoding he imes amp ela i e o he o al ime-lapse. I
p oduces an RGB image, i.e. a E∈RW×H×3s ack.
Fo he sake o space, we epo only he e en ep esen-
a ions ele an o ou wo k, bu addi ional de ails ega ding
e en ep esen a ions can be ound in [1,11].
4. P oposed Me hod
Ou i s goal is o le e age he knowledge o ame-based
monocula dep h models like DA 2 ex ac ing pseudo la-
bels o ain any e en -based s uden dep h model – e.g.,
E2Dep h – gi en aligned in ensi y ames and e en s acks.
Figu e 2ou lines ou c oss-modal dis illa ion pa adigm.
Mo eo e , we p opose o cas a ame-based model – DA 2
in ou expe imen s – ei he in i s o iginal e sion o en ich-
ing i o exploi empo al cues, o he e en domain aking
ad an age o he massi e p e- ain pe o med in he image
domain.
F ame-based
Teache VFM
E en -based
S uden Model
P oxy Dep h Labels
Dep h P edic ionE en s s ack
F ame
Alignmen Assump ion Dis illa ion
T aining Time
T aining and Tes Time
Figu e 2. P oposed C oss-Modal Dis illa ion S a egy. Du ing
aining, a VFM eache p ocesses RGB inpu ames I o gene a e
p oxy dep h labels D∗, which supe ise an e en -based s uden
model. The s uden akes aligned e en s acks Eas inpu and p e-
dic s he inal dep h map D.
4.1. VFMs o C oss-Modal Dis illa ion
Visual Founda ion Models ha e achie ed as onishing e-
sul s mainly due o hei peculia la ge-scale aining p o-
cedu es. Fo ins ance, DA 2 elies on a DINO 2 backbone
ha was p e- ained wi h hund eds o millions o images
in an unsupe ised manne . Fu he mo e, DA 2 uses ens
o millions o pseudo-labeled and millions o labeled im-
ages o aining. Un o una ely, e en da a lacks equi alen
la ge-scale da ase s [2,9,39], subs an ially p ecluding com-
pa able aining in he e en domain. To b idge his gap, we
p opose le e aging a p e- ained VFM – DA 2 ViT-La ge
in ou expe imen s– o p o ide dense supe ision o any
e en -based dep h es ima ion ne wo ks, as ou lined in Fig-
u e 2. Du ing aining, a eache VFM p ocesses a ame,
p oducing he p oxy label D∗(Fig. 3shows an example)
and he s uden model p edic s a dep h map D om he spa-
ially and empo ally aligned e en s. The s uden model is
supe ised using a loss L=Lsi +λL eg composed o a
scale-in a ian loss Lsi and a g adien egula iza ion e m
L eg [19]:
Lsi(ˆ
D,ˆ
D∗) = 1
2|M|X
(x,y)∈Mˆ
D−ˆ
D∗2
(1)
whe e Mis he se o alid pixels, ˆ
D=sD+ and ˆ
D∗=
D∗a e espec i ely he scaled and shi ed e sions o he
s uden p edic ion Dand he p oxy label D∗, and (s, )a e
he scaling ac o s ob ained using he leas -squa e app oach:
(s, ) = a g min
s, X
(x,y)∈M
(sD+ −D∗)2(2)
The egula iza ion e m L eg is de ined as ollows:
L eg(ˆ
D,ˆ
D∗) =
K
X
k=1
1
|Mk|X
(x,y)∈Mk
(|∇xRk|+|∇yRk|)
(3)
19671
E en sRGB P oxy Labels G ound T u h
Figu e 3. Labels Dis illa ion om F ame-Based Vision Founda ion Model. Gi en he a ailabili y o aligned colo and e en modali ies,
e.g., collec ed by a DAVIS346B senso , we can exploi a VFM o ex ac p oxy labels om he colo images, esul ing in much dense
supe ision compa ed o he one p o ided by semi-dense LiDAR anno a ions.
Image To Pa ches
Posi ional Encoding
T ans o me T ans o me T ans o me T ans o me
Reassemble Reassemble Reassemble Reassemble
Con LSTM Con LSTM Con LSTM Con LSTM
FusionFusionFusionFusionHead
Final P edic ion
Figu e 4. P oposed Recu en VFM. Ou Dep hAnyE en -R model p ocesses image pa ches wi h posi ional encoding h ough mul iple
ans o me s ages ha p oduce mul i-scale ea u e maps Fs. These ea u es a e combined wi h hidden s a es Hi
sin Con LSTM modules
Rs o inco po a e empo al in o ma ion om p e ious e en s acks, gene a ing enhanced ea u e maps
ˆ
Fsand upda ed hidden s a es Hi+1
s.
A hie a chical usion p ocess in eg a es ea u es om di e en scales o p edic he inal dep h p edic ion
ˆ
F∗.
whe e Rk=ˆ
Dk−ˆ
D∗
kis he di e ence o maps a scale k
and Mkis he se o alid pixels a scale k.
To ensu e alignmen , ame and e en came as mus be
calib a ed – in insically done in he DAVIS came a – and
e en s a e sliced om he ame’s acqui ing imes amp.
4.2. Cas ing VFMs o he E en Domain
F ame-based monocula dep h models canno be used di-
ec ly on e en s, gi en he di e se na u e o he la e .
Hence, o adap hei capabili ies o he e en domain, we
choose an app op ia e e en ep esen a ion ha can educe
he gap be ween ames and e en s encoding. Fu he mo e,
we exploi he sequen ial na u e o empo al e en s, p opos-
ing a no el ecu en a chi ec u e o DA 2.
Choosing he Righ E en Rep esen a ion. The e en s
s eam con ains spa ial and empo al in o ma ion; hence,
a good e en ep esen a ion should cap u e bo h o ensu e
limi ed loss o in o ma ion. Since monocula models na u-
ally p ocess RGB ames – i.e., hey p oduce a dep h map
gi en an image I∈RW×H×3as inpu – we ha e o choose
an e en ep esen a ion ha encodes bo h spa ial and em-
po al equi emen s wi hin an RGB ame o pu sue minimal
modi ica ions o he p e- ained VFM.
Pu posely, he Tencode [16] ep esen a ion i s wi h ou
aim. Consequen ly, s a ing om a sliced e en his o y E d,
ei he using SBT o SBN [22], Tencode encodes E din o a
s ack Eas ollows:
E(xk, yk) = ((1, d− k
∆T,0) i pk= 1
(0, d− k
∆T,1) i pk=−1(4)
whe e ek= (xk, yk, pk, k)∈ E dis he k- h e en o E d
and ∆Tis he ime in e al o e en slice E d.
VFM o E en s. Al hough he Tencode ep esen a ion
signi ican ly di e s om a con en ional RGB image o he
same scene, we p opose o adap a p e- ained VFM o deal
wi h he e en domain h ough ine- uning wi h e en da a
using he Tencode ep esen a ion. Fo his pu pose, we use
as he VFM a anilla DA 2 ViT-S o ou expe imen s. We
dubbed he model as Dep hAnyE en .
Recu en VFM o E en s. Addi ionally, gi en he se-
quence na u e o he e en s eam, Recu en Neu al Ne -
wo ks (RNNs) could encode p e ious ea u es ex ac ed
om pas e en s acks in o a hidden s a e [15,21]. A each
i e a ion, he ecu en module can upda e he hidden s a e
wi h he ea u es ex ac ed om he cu en s ack, gene a -
ing a new hidden s a e o he nex i e a ion.
Howe e , monocula dep h models ypically lack a e-
cu en module since hey a e designed o wo k wi h single-
ame ins ances. Hence, o ou pu poses, his could hinde
he quali y o p edic ions, especially du ing s a ic scenes
19672
Model Da ase Abs Rel↓Sq Rel ↓RMSE↓RMSE log↓SI log↓δ < 1.25 ↑δ < 1.252↑δ < 1.253↑
E2Dep h [15] 0.527 1.122 7.894 0.512 0.244 0.363 0.637 0.811
EReFo me [21] MVSEC 0.518 1.012 8.423 0.559 0.316 0.361 0.630 0.800
Dep hAnyE en 0.466 0.976 7.824 0.480 0.229 0.408 0.689 0.847
Dep hAnyE en -R 0.469 0.946 8.064 0.508 0.272 0.428 0.690 0.832
E2Dep h [15] 0.395 0.334 13.258 0.412 0.167 0.409 0.719 0.891
EReFo me [21] DSEC 0.297 0.195 11.608 0.334 0.113 0.524 0.824 0.945
Dep hAnyE en 0.297 0.186 11.072 0.330 0.108 0.519 0.827 0.948
Dep hAnyE en -R 0.276 0.165 10.942 0.314 0.101 0.555 0.843 0.954
Table 1. Quan i a i e Resul s – Ze o-Sho Gene aliza ion on MVSEC and DSEC. All ne wo ks a e ained on he E en Scape syn he ic
da ase only, and es ed wi hou any ine- uning.
E en s
Dep hAnyE en Dep hAnyE en -RE en s
Dep hAnyE en Dep hAnyE en -RE2Dep h EReFo me
E2Dep h EReFo me
Figu e 5. Quali a i e Resul s on DSEC da ase – Ze o-Sho Gene aliza ion. F om le o igh : e en image, p edic ions by E2Dep h,
EReFo me , Dep hAnyE en and Dep hAnyE en -R, ained on E en Scape only.
whe e e en s a e no igge ed. To e ec i ely adap hem
o he e en domain, we in oduce a ecu en ex ension o
DA 2 ViT-Small, dubbed as Dep hAnyE en -R, ha in e-
g a es cues om p e ious e en s acks, as ou lined in Fig-
u e 4. The DA 2 a chi ec u e is composed o wo main
modules: a DINO 2 [24] Encode Gbased on Visual T ans-
o me (ViT), and a Dense Dep h Decode D. Gi en an im-
age Iencoded wi h he Tencode ep esen a ion, he encode
G i s spli s he image in o pa ches and adds posi ional en-
coding o hem. Nex , pa ches a e passed h ough mul i-
ple ans o me s ages and hen eassembled om di e en
s ages in o mul i-scale ea u e maps Fs∈RW
s×H
s×Cs. Fo
each scale s, we eed he ea u e maps Fsand he hidden
s a e Hi
s∈RW
s×H
s×Cswi h H0
s=0 o a Con LSTM [31]
module Rsob aining a new hidden s a e Hi+1
sand empo-
ally enhanced ea u e maps ˆ
Fs. S a ing om he lowes
scale, a se ies o usion modules sequen ially upsample and
use he ea u e maps o ob ain he inal ea u e map ˆ
F∗ ed
o he decode D o ob ain he inal p edic ed dep h map.
5. Expe imen s
We desc ibe ou implemen a ion de ails, da ase s, and e al-
ua ion p o ocols, ollowed by expe imen s.
5.1. Implemen a ion and Expe imen al Se ings
Hype pa ame e s Se ings. We se he slicing window
∆T, he numbe o Voxel G id bins B, and he loss ac o
λ espec i ely o 50ms, 5, and 0.25. We implemen e en -
based s uden ne wo ks E2Dep h [15] and EReFo me [21]
s a ing om hei codebase. Fo Dep hAnyE en and
Dep hAnyE en -R, we s a om he DA 2 ViT-Small
codebase [34]. We use PyTo ch, and a single A100 GPU
wi h 64GB o RAM. Following he o iginal pape s, we ix
he lea ning a e o 10−4and 3.2·10−5 espec i ely o
E2Dep h and EReFo me , while we se a lea ning a e o
5·10−6 o all Dep hAnyE en a ian s. We adjus he
aining s eps o 75k, using he AdamW op imize wi h
he OneCycle schedule , and apply da a augmen a ions in-
cluding ho izon al lips and andom c ops a 224 ×224.
We se he ba ch size o 10, excep o EReFo me : gi en
he highe memo y equi emen s, we change i o 2. We
un oll all ecu en ne wo ks – i.e., E2Dep h, EReFo me ,
and Dep hAnyE en -R – o 20 s eps. We choose as he
e en ep esen a ion Tencode [16] o Dep hAnyE en and
Dep hAnyE en -R, while we main ained he o iginal ep e-
sen a ion o E2Dep h and EReFo me – i.e., espec i ely,
Voxel G id [40] and Image-like [21]. Finally, we use he
scale-in a ian L o all ne wo ks. The se ings epo ed a e
used o all expe imen s unless o he wise speci ied.
P oxy Labels Fac o y. We gene a e p oxy labels
om ames using he DA 2 ViT-La ge ained o me ic
dep h es ima ion: s a ing om he La ge anilla weigh s
p o ided by he au ho s, we pe o m a ine- uning on
E en Scape [8] o 10k s eps wi h a lea ning a e o 10−6.
Syn he ic T aining Se up. We ob ain he syn he ic
checkpoin s o all ne wo ks aining on he syn he ic
E en Scape [8] da ase . While E2Dep h was ained om
sc a ch, we ollowed EReFo me ’s o iginal pape and se
Swin-T p e- ained on ImageNe as he backbone. Fo
Dep hAnyE en and Dep hAnyE en -R, we s a ed om
he Small weigh s p o ided by he au ho s.
Fine- uning Se up. We ollow [15], ine- uning he
models o he a ge domain using bo h eal and syn he ic
19673

Model Da ase Abs Rel↓Sq Rel ↓RMSE↓RMSE log↓SI log↓δ < 1.25 ↑δ < 1.252↑δ < 1.253↑
E2Dep h [15] 0.420 0.806 7.268 0.455 0.213 0.432 0.717 0.868
EReFo me [21] MVSEC 0.511 1.057 8.373 0.523 0.274 0.391 0.652 0.810
Dep hAnyE en 0.373 0.715 6.627 0.449 0.222 0.471 0.747 0.884
Dep hAnyE en -R 0.365 0.691 6.465 0.483 0.258 0.489 0.751 0.878
E2Dep h [15] 0.253 0.130 10.119 0.315 0.107 0.574 0.861 0.956
EReFo me [21] DSEC 0.286 0.208 11.369 0.325 0.109 0.569 0.839 0.944
Dep hAnyE en 0.201 0.079 8.880 0.266 0.077 0.664 0.917 0.975
Dep hAnyE en -R 0.191 0.070 8.618 0.244 0.064 0.691 0.930 0.981
Table 2. Quan i a i e Resul s – In-Domain E alua ion on MVSEC and DSEC. All ne wo ks a e ained on he E en Scape syn he ic
da ase and hen u he ine- uned on MVSEC and DSEC da ase s sepa a ely.
E2Dep hE en s EReFo me
Dep hAnyE en Dep hAnyE en (Dis illa ion) Dep hAnyE en -R Dep hAnyE en -R (Dis illa ion)
E en s E2Dep h EReFo me Dep hAnyE en Dep hAnyE en -R
Dep hAnyE en Dep hAnyE en -R
Figu e 6. Quali a i e Resul s on MVSEC – Fine- uned Models. F om le o igh : e en image, p edic ions by E2Dep h, EReFo me ,
Dep hAnyE en and Dep hAnyE en -R, ained on E en Scape and ine- uned on MVSEC.
da a – i.e., MVSEC [39] + E en Scape [8], and DSEC[9]
+ E en Scape [8] – s a ing om he syn he ic checkpoin s
ob ained in he p e ious poin .
Dis illa ion T aining Se up. We use he p oxy labels
p e iously gene a ed wi h DA 2 ViT-L ins ead o he o ig-
inal spa se g ound- u h. Di e en ly om he p e ious
poin , we ained he models on he dense p oxy labels only
ins ead o a syn he ic+p oxy mix u e.
5.2. E alua ion Da ase s & P o ocol
Da ase s. We u ilize E en Scape [8] as he syn he ic ain-
ing se , comp ising abou 120k g ound u h dep h maps a
esolu ion o 512 ×256, cap u ed om CARLA [4] simu-
la o . Fo e alua ion and domain ine- unings we used wo
main benchma ks: MVSEC [39] and DSEC [9]. The da ase
p o ides e en s a a esolu ion o 346 ×260 pixels om a
s e eo e en came a consis ing o wo DAVIS346B senso s,
which also cap u e spa ially aligned images. g ound- u h
is ob ained by p ocessing da a om a 16-line LiDAR using
Lida Odome y and Mapping (LOAM), yielding a o al o
10k aining samples and 20k es ing samples. The es se is
di ided in o a 5k-sample day ime subse and h ee nigh ime
subse s, each con aining 5k samples. DSEC [9] employs
wo 640 ×480 P ophesee Gen3.1 e en came as in a s e eo
con igu a ion. G ound- u h dispa i y is ob ained using a
32-line LiDAR, p ocessed wi h a Lida Ine ial Odome y
algo i hm, and u he il e ed o emo e ou lie s. We con-
e he dispa i y g ound- u h o dep h based on he s e eo
se up pa ame e s. Unlike MVSEC, RGB ames a e cap-
u ed using a pai o FLIR Black ly S came as. To align
ames and e en s, we wa p he RGB ames using he cali-
b a ion pa ame e s. We also apply a 640 ×320 cen e c op
o mi iga e misalignmen a i ac s in nea by objec s. The
da ase coun s 26k aining samples, di ided as in [1] in o
19k o aining and 7k o es ing.
E alua ion Me ics. We e alua e he ne wo ks using
di e en me ics: absolu e ela i e e o (Abs Rel), squa e
Abs Rel (Sq Rel), oo mean squa ed e o (RMSE), loga-
i hmic RMSE (RMSE log), loga i hmic scale in a ian e -
o (SI log), and accu acy wi h di e en h esholds (δ <
1.25,δ < 1.252, and δ < 1.253). We apply scale and shi
o align p edic ions wi h he g ound- u h be o e compu ing
he me ics. We highligh using bold and unde line he bes
and second bes sco es.
5.3. Syn he ic- o-Real Gene aliza ion
We s a by e alua ing he capabili y o he di e en dep h
es ima ion models o gene alize om syn he ic da a o eal
e en s eams. Pu posely, we ain E2Dep h, EReFo me ,
Dep hAnyE en , and Dep hAnyE en -R on E en Scape
and measu e hei accu acy on bo h MVSEC and DSEC
da ase s. Table 1collec s he ou come o his expe imen .
19674
Model Da ase Abs Rel↓Sq Rel ↓RMSE↓RMSE log↓SI log↓δ < 1.25 ↑δ < 1.252↑δ < 1.253↑
E2Dep h Syn h MVSEC 0.527 1.122 7.894 0.512 0.244 0.363 0.637 0.811
E2Dep h Dis illed 0.400 0.817 6.786 0.538 0.304 0.479 0.740 0.865
E2Dep h Supe ised 0.420 0.806 7.268 0.455 0.213 0.432 0.717 0.868
EReFo me Syn h MVSEC 0.518 1.012 8.423 0.559 0.316 0.361 0.630 0.800
EReFo me Dis illed 0.448 0.817 7.867 0.498 0.253 0.434 0.700 0.842
EReFo me Supe ised 0.511 1.057 8.373 0.523 0.274 0.391 0.652 0.810
Dep hAnyE en Syn h MVSEC 0.466 0.976 7.824 0.480 0.229 0.408 0.689 0.847
Dep hAnyE en Dis illed 0.397 0.771 6.910 0.495 0.260 0.461 0.735 0.870
Dep hAnyE en Supe ised 0.373 0.715 6.627 0.449 0.222 0.471 0.747 0.884
Dep hAnyE en -R Syn h MVSEC 0.469 0.946 8.064 0.508 0.272 0.428 0.690 0.832
Dep hAnyE en -R Dis illed 0.399 0.781 6.830 0.509 0.281 0.462 0.735 0.866
Dep hAnyE en -R Supe ised 0.365 0.691 6.465 0.483 0.258 0.489 0.751 0.878
E2Dep h Syn h DSEC 0.395 0.334 13.258 0.412 0.167 0.409 0.719 0.891
E2Dep h Dis illed 0.272 0.153 10.579 0.309 0.096 0.551 0.851 0.959
E2Dep h Supe ised 0.253 0.130 10.119 0.315 0.107 0.574 0.861 0.956
EReFo me Syn h DSEC 0.297 0.195 11.608 0.334 0.113 0.524 0.824 0.945
EReFo me Dis illed 0.285 0.198 11.407 0.327 0.111 0.563 0.839 0.944
EReFo me Supe ised 0.286 0.208 11.369 0.325 0.109 0.569 0.839 0.944
Dep hAnyE en Syn h DSEC 0.297 0.186 11.072 0.330 0.108 0.519 0.827 0.948
Dep hAnyE en Dis illed 0.213 0.095 8.930 0.253 0.065 0.662 0.915 0.980
Dep hAnyE en Supe ised 0.201 0.079 8.880 0.266 0.077 0.664 0.917 0.975
Dep hAnyE en -R Syn h DSEC 0.276 0.165 10.942 0.314 0.101 0.555 0.843 0.954
Dep hAnyE en -R Dis illed 0.226 0.111 9.310 0.266 0.072 0.638 0.906 0.977
Dep hAnyE en -R Supe ised 0.191 0.070 8.618 0.244 0.064 0.691 0.930 0.981
Table 3. Quan i a i e Resul s – Supe ised s Dis illed Models on MVSEC and DSEC. All ne wo ks a e ained on he E en Scape
syn he ic da ase and hen ine- uned on MVSEC and DSEC da ase s sepa a ely, ei he h ough dis illa ion o on g ound- u h dep h labels.
E en s
Dep hAnyE en Dep hAnyE en (Dis illa ion) Dep hAnyE en -R Dep hAnyE en -R (Dis illa ion)
E en s
Dep hAnyE en Dep hAnyE en (Dis illa ion) Dep hAnyE en -R Dep hAnyE en -R (Dis illa ion)
Figu e 7. Quali a i e Resul s on DSEC – Supe ised s Dis illed Models. F om le o igh : e en image, p edic ions by Dep hAnyE en
and i s dis illed coun e pa , and by Dep hAnyE en -R and i s dis illed coun e pa .
Dep hAnyE en and Dep hAnyE en -R achie e he bes
esul s on almos any me ic, hin ing how he web-scale
aining in used in he weigh s we used o ini ialize hese
models ep esen s a solid p io o dep h es ima ion, al-
hough coming om images, i.e., a comple ely di e en
modali y wi h espec o e en s eams. The wo models
achie e mixed esul s one agains he o he on MVSEC,
while Dep hAnyE en -R consis en ly achie es he bes gen-
e aliza ion esul s o e DSEC, gi ing a i s in ui ion abou
he e ec i eness o ou design choice o deal wi h s eamed
e en da a. Figu e 5p esen s a quali a i e compa ison
o p edic ions om di e en models, showcasing he su-
pe io ze o-sho capabili ies o ou Dep hAnyE en and
Dep hAnyE en -R models.
5.4. Supe ised Fine- uning
We now e alua e he accu acy o each model when ained
on eal e en da a anno a ed wi h semi-dense g ound- u h
dep h. To his aim, we ake he weigh s ob ained a -
e aining on E en Scape and pe o m u he ine- uning
on MVSEC and DSEC sepa a ely, hen e alua ing on he
co esponding alida ion se s. Table 2 epo s he e-
sul s o his e alua ion. We can no ice, once again, he
no able gap in pe o mance be ween Dep hAnyE en and
Dep hAnyE en -R agains exis ing me hods EReFo me
and E2Dep h, con i ming again he s ong ad an age ha
ou models can exploi om he c oss-modal aining be-
ing conduc ed o image-based dep h es ima ion. Speci i-
cally, his ime we can no ice how Dep hAnyE en -R con-
sis en ly ou pe o ms he anilla Dep hAnyE en model on
bo h MVSEC and DSEC da ase s, alida ing ou p oposed
design ailo ed o e en -based dep h es ima ion.
Figu e 6shows a quali a i e compa ison be ween he
p edic ions by he di e en models, highligh ing he supe-
io accu acy achie ed by Dep hAnyE en and, e en highe ,
by Dep hAnyE en -R.
5.5. C oss-Modal Dis illa ion
We now assess he e ec i eness o ou c oss-modal dis illa-
ion s a egy compa ed o con en ional, supe ised aining
19675
Model Abs Rel↓Sq Rel ↓RMSE↓RMSE log↓SI log↓δ < 1.25 ↑δ < 1.252↑δ < 1.253↑
E2Dep h [15] 0.344 0.253 13.467 0.376 0.098 0.447 0.755 0.915
EReFo me [21] 0.387 0.401 13.954 0.395 0.124 0.486 0.776 0.892
Dep hAnyE en 0.277 0.170 11.117 0.292 0.051 0.585 0.860 0.955
Dep hAnyE en -R 0.252 0.128 9.824 0.268 0.045 0.592 0.900 0.971
Table 4. Me ic Dep h E alua ion. T aining and e alua ion on DSEC da ase .
Model Supe ision Expe imen Abs Rel↓Sq Rel ↓RMSE↓RMSE log↓SI log↓δ < 1.25 ↑δ < 1.252↑δ < 1.253↑
(A) Dep hAnyE en -R Dis illa ion Tencode+DA 2 0.399 0.781 6.830 0.509 0.281 0.462 0.735 0.866
(B) Dis illa ion Tencode+Dep hP o 0.429 0.942 7.472 0.452 0.208 0.444 0.726 0.869
(C)
Dep hAnyE en -R
G ound- u h Tencode+DA 2 0.365 0.691 6.465 0.483 0.258 0.489 0.751 0.878
(D) G ound- u h VoxelG id+DA 2 0.382 0.719 6.932 0.444 0.215 0.473 0.742 0.877
(E) G ound- u h Tencode+DA 2 (no p e ain) 0.446 0.799 7.492 0.506 0.260 0.390 0.678 0.845
(F) G ound- u h + Dis illa ion Tencode+DA 2 0.362 0.697 6.511 0.438 0.211 0.494 0.760 0.890
Table 5. Abla ion S udies. T aining and e alua ion on MVSEC da ase .
Model In e ence (ms) Memo y (MB)
E2Dep h [15] 1.50 242
EReFo me [21] 35.75 534
Dep hAnyE en 1.26 71
Dep hAnyE en -R 9.20 202
Table 6. Compu a ional Analysis. In e ence ime on A100 GPU.
equi ing he a ailabili y o cos ly dep h anno a ions om
ac i e senso s. Table 3collec s he esul s achie ed by each
model unde he aining con igu a ion conside ed so a ,
as well as a e being ained acco ding o ou dis illa ion
app oach. In mos cases, we can no ice how he models
ained h ough dis illa ion a e compa able, and some imes
e en be e han hei supe ised coun e pa s.
Figu e 7show some quali a i e examples om he DSEC
da ase , compa ing he p edic ions by Dep hAnyE en
and Dep hAnyE en -R when ained wi h g ound- u h o
h ough dis illa ion. In bo h cases, dis illed models a e e en
mo e accu a e han hose supe ised wi h g ound- u h.
5.6. Me ic Dep h E alua ion
Finally, we assess he accu acy o ou models when ained
o p edic me ic a he han a ine-in a ian dep h. Table 4
collec s he esul s achie ed by exis ing ne wo ks and ou s
when ained on he DSEC da ase o me ic dep h p edic-
ion, e alua ed on he alida ion se o he e y same da ase .
We can app ecia e how ou wo a chi ec u es achie e he
bes esul s, wi h Dep hAnyE en -R consis en ly yielding
he bes esul s on any e alua ion me ics.
5.7. Abla ion S udies
We conclude wi h a s udy abou he impac o di e en
modules in ou amewo k. In he o me case, we ain
di e en ins ances o Dep hAnyE en -R on he MVSEC
da ase and e alua e on i s alida ion se . Resul s a e col-
lec ed in Table 5, wi h ow (A) ep esen ing he con igu a-
ion used in he p e ious expe imen s.
Di e en VFMs o dis illa ion. Row (B) shows ha
eplacing Dep h Any hing 2 wi h a di e en VFM o dis-
illa ion – i.e., Dep h P o – yields close esul s, al hough
sligh ly wo se on mos me ics.
Inpu ep esen a ion. In ows (C) and (D), we epo
he esul s achie ed by aining ou model wi h g ound- u h
labels, when p ocessing ei he Tencode o a oxel-g id ep-
esen a ion used o encode aw e en s. The o me yields
almos consis en ly be e esul s.
P e- aining. By aining ou model s a ing om DA 2
p e ained weigh s, we can g ea ly imp o e i s pe o mance.
Indeed, when aining Dep hAnyE en -R om sc a ch (E),
he accu acy consis en ly d ops.
Combining dis illa ion wi h g ound- u h labels. Fi-
nally, we show how deploying bo h ou c oss-modal dis il-
la ion pa adigm and g ound- u h anno a ions (when a ail-
able) u he imp o es he inal model on mos me ics.
5.8. Run ime and Memo y Requi emen s
Table 6 epo s a compu a ional analysis o any model
in ol ed in ou e alua ion. Dep hAnyE en achie es he
as es p edic ions, using as ew as 80MB o a single in-
e ence. E2Dep h exposes a e y simila in e ence ime,
al hough equi ing nea ly 4× he memo y, while ERe-
Fo me uns consis en ly slowe and inc eases he memo y
usage o up o 0.5GB. Compa ed o Dep hAnyE en , he
Dep hAnyE en -R a ian uns slowe , ye s ill in eal- ime,
and yields mo e accu a e p edic ions.
6. Conclusions
In his pape , we p esen ed a no el app oach o e en -based
monocula dep h es ima ion ha le e ages he powe o
p e- ained Visual Founda ion Models. Ou c oss-modal
dis illa ion s a egy e ec i ely ans e s knowledge om
ame-based models o he e en domain, add essing he
c ucial challenge o limi ed g ound u h da a o e en cam-
e as. Expe imen al esul s wi h syn he ic and eal-wo ld
da ase s alida e ou me hod, showing compe i i e pe o -
mance compa ed o ully supe ised me hods wi hou e-
qui ing expensi e dep h anno a ions. Mo eo e , we ha e
demons a ed wo e ec i e me hods o adap ing VFMs o
e en da a: a anilla adap a ion and a ecu en a chi ec-
u e ha be e cap u es he na u e o e en s eams, yielding
s a e-o - he-a pe o mance.
19676
Acknowledgmen . This s udy was ca ied ou wi hin he
MOST – Sus ainable Mobili y Na ional Resea ch Cen e and e-
cei ed unding om he Eu opean Union Nex -Gene a ionEU –
PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) –
MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.4 – D.D.
1033 17/06/2022, CN00000023. This manusc ip e lec s only he
au ho s’ iews and opinions, nei he he Eu opean Union no he
Eu opean Commission can be conside ed esponsible o hem.
We also acknowledge he CINECA awa d unde he ISCRA
ini ia i e o he a ailabili y o high-pe o mance compu ing e-
sou ces and suppo .
Re e ences
[1] Luca Ba olomei, Ma eo Poggi, And ea Con i, and S e ano
Ma occia. Lida -e en s e eo usion wi h hallucina ions. In
Eu opean Con e ence on Compu e Vision, pages 125–145.
Sp inge , 2024. 3,6
[2] Kenne h Chaney, Fe nando Clade a, Ziyun Wang, An hony
Bisulco, M Ani Hsieh, Ch is ophe Ko pela, Vijay Kuma ,
Camillo J Taylo , and Kos as Daniilidis. M3ed: Mul i-
obo , mul i-senso , mul i-en i onmen e en da ase . In
2023 IEEE/CVF Con e ence on Compu e Vision and Pa -
e n Recogni ion Wo kshops (CVPRW), pages 4016–4023.
IEEE, 2023. 3
[3] Mingyue Cui, Yuzhang Zhu, Yechang Liu, Yunchao Liu,
Gang Chen, and Kai Huang. Dense dep h-map es ima ion
based on usion o e en came a and spa se lida . IEEE
T ansac ions on Ins umen a ion and Measu emen , 71:1–
11, 2022. 2
[4] Alexey Doso i skiy, Ge man Ros, Felipe Code illa, An o-
nio Lopez, and Vladlen Kol un. Ca la: An open u ban d i -
ing simula o . In Con e ence on obo lea ning, pages 1–16.
PMLR, 2017. 6
[5] Yiqun Duan, Xianda Guo, and Zheng Zhu. Di usionDep h:
Di usion denoising app oach o monocula dep h es ima-
ion. a Xi p ep in a Xi :2303.05021, 2023. 2
[6] Da id Eigen, Ch is ian Puh sch, and Rob Fe gus. Dep h map
p edic ion om a single image using a mul i-scale deep ne -
wo k. In Ad ances in Neu al In o ma ion P ocessing Sys-
ems. Cu an Associa es, Inc., 2014. 2
[7] Guille mo Gallego, Tobi Delb uck, Ga ick Michael O -
cha d, Chia a Ba olozzi, B ian Taba, And ea Censi, S e an
Leu enegge , And ew Da ison, Jo g Con ad , Kos as Dani-
ilidis, and Da ide Sca amuzza. E en -based ision: A su -
ey. IEEE T ansac ions on Pa e n Analysis and Machine
In elligence, pages 154–180, 2022. 2
[8] Daniel Geh ig, Michelle R¨
uegg, Ma hias Geh ig, Ja ie
Hidalgo-Ca i´
o, and Da ide Sca amuzza. Combining e en s
and ames using ecu en asynch onous mul imodal ne -
wo ks o monocula dep h p edic ion. IEEE Robo ics and
Au oma ion Le e s, 6(2):2822–2829, 2021. 2,3,5,6
[9] Ma hias Geh ig, Willem Aa en s, Daniel Geh ig, and Da ide
Sca amuzza. Dsec: A s e eo e en came a da ase o d i -
ing scena ios. IEEE Robo ics and Au oma ion Le e s, 6(3):
4947–4954, 2021. 3,6
[10] And eas Geige , Philip Lenz, and Raquel U asun. A e we
eady o au onomous d i ing? he KITTI ision benchma k
sui e. In Con e ence on Compu e Vision and Pa e n Recog-
ni ion (CVPR), 2012. 2
[11] Suman Ghosh and Guille mo Gallego. E en -based
s e eo dep h es ima ion: A su ey. a Xi p ep in
a Xi :2409.17680, 2024. 3
[12] Cl´
emen Goda d, Oisin Mac Aodha, and Gab iel J. B os-
ow. Unsupe ised monocula dep h es ima ion wi h le -
igh consis ency. CoRR, abs/1609.03677, 2016. 2
[13] Cl´
emen Goda d, Oisin Mac Aodha, and Gab iel J. B os-
ow. Digging in o sel -supe ised monocula dep h es ima-
ion. CoRR, abs/1806.01260, 2018. 2
[14] Vi o Guizilini, Igo Vasilje ic, Dian Chen, Ra es
,Amb us
,,
and Ad ien Gaidon. Towa ds ze o-sho scale-awa e monoc-
ula dep h es ima ion. In ICCV, 2023. 2
[15] Ja ie Hidalgo-Ca i´
o, Daniel Geh ig, and Da ide Sca a-
muzza. Lea ning monocula dense dep h om e en s. CoRR,
abs/2010.08350, 2020. 1,2,3,4,5,6
[16] Ze Huang, Li Sun, Cheng Zhao, Song Li, and Songzhi Su.
E en poin : Sel -supe ised in e es poin de ec ion and de-
sc ip ion o e en -based came a. In P oceedings o he
IEEE/CVF Win e Con e ence on Applica ions o Compu e
Vision (WACV), pages 5396–5405, 2023. 3,4,5
[17] Yuan eng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu,
Zhaoqiang Liu, Tong Lu, Zhenguo Li, and Ping Luo. DDP:
Di usion model o dense isual p edic ion. In ICCV, 2023.
2
[18] I o Laina, Ch is ian Rupp ech , Vasileios Belagiannis, Fed-
e ico Tomba i, and Nassi Na ab. Deepe dep h p edic ion
wi h ully con olu ional esidual ne wo ks. In 2016 Fou h
in e na ional con e ence on 3D ision (3DV), pages 239–
248. IEEE, 2016. 2
[19] Ka in Lasinge , Ren´
e Ran l, Kon ad Schindle , and Vladlen
Kol un. Towa ds obus monocula dep h es ima ion: Mix-
ing da ase s o ze o-sho c oss-da ase ans e . CoRR,
abs/1907.01341, 2019. 3
[20] Zhengqi Li and Noah Sna ely. Megadep h: Lea ning single-
iew dep h p edic ion om in e ne pho os. In P oceed-
ings o he IEEE con e ence on compu e ision and pa e n
ecogni ion, pages 2041–2050, 2018. 2
[21] Xu Liu, Jianing Li, Jinqiao Shi, Xiaopeng Fan, Yonghong
Tian, and Debin Zhao. E en -based monocula dep h es ima-
ion wi h ecu en ans o me s. IEEE T ansac ions on Ci -
cui s and Sys ems o Video Technology, 34(8):7417–7429,
2024. 2,3,4,5,6
[22] Yeongwoo Nam, Mohammad Mos a a i, Kuk-Jin Yoon,
and Jonghyun Choi. S e eo dep h om e en s came as:
Concen a e and ocus on he u u e. In P oceedings o
he IEEE/CVF con e ence on compu e ision and pa e n
ecogni ion, pages 6114–6123, 2022. 4
[23] Pushmee Kohli Na han Silbe man, De ek Hoiem and Rob
Fe gus. Indoo segmen a ion and suppo in e ence om
gbd images. In ECCV, 2012. 2
[24] Maxime Oquab, Timo h´
ee Da ce , Th´
eo Mou akanni, Huy V.
Vo, Ma c Sza aniec, Vasil Khalido , Pie e Fe nandez,
Daniel HAZIZA, F ancisco Massa, Alaaeldin El-Nouby,
Mido Ass an, Nicolas Ballas, Wojciech Galuba, Russell
Howes, Po-Yao Huang, Shang-Wen Li, Ishan Mis a, Michael
19677

Related note

Why organizations use Identific for document trust, entry 48
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com