A Modular Deep Learning Framework for Scene Understanding in Augmented Reality Applications

Author: Li, Vladislav; Villarini, Barbara; Nebel, Jean-Christophe; Argyriou, Vasileios

Publisher: Zenodo

DOI: 10.1109/IAICT59002.2023.10205667

Source: https://zenodo.org/records/17542688/files/A_Modular_Deep_Learning_Framework_for_Scene_Understanding_in_Augmented_Reality_Applications.pdf

A Modula Deep Lea ning F amewo k o Scene
Unde s anding in Augmen ed Reali y Applica ions
1s Vladisla Li
Dep . o Ne wo ks and Digi al Media
Kings on Uni e si y
London, UK
[email p o ec ed]
2nd Ba ba a Villa ini
School o Compu e Science and Enginee ing
Uni e si y o Wes mins e
London, UK
b. illa ini@wes mins e .ac.uk
3 d Jean-Ch is ophe Nebel
Dep . o Compu e Science
Kings on Uni e si y
London, UK
[email p o ec ed]
4 h A gy iou Vasileios
Dep . o Ne wo ks and Digi al Media
Kings on Uni e si y
London, UK
asileios.a [email p o ec ed]
Abs ac —Taking as inpu na u al images and ideos, aug-
men ed eali y (AR) applica ions aim o enhance he eal wo ld
wi h supe imposed digi al con en s, enabling in e ac ion be ween
he use and he en i onmen . One impo an s ep in his p ocess
is au oma ic scene analysis and unde s anding, which should be
pe o med bo h in eal ime and wi h a good le el o objec ecog-
ni ion accu acy. In his wo k, an end- o-end amewo k based on
he combina ion o a Supe Resolu ion ne wo k wi h a de ec ion
and ecogni ion deep ne wo k has been p oposed o inc ease
pe o mance and lowe p ocessing ime. This no el app oach has
been e alua ed on wo di e en da ase s: he popula COCO
da ase , whose eal images a e used o benchma king many
di e en compu e ision asks, and a gene a ed da ase wi h
syn he ic images ec ea ing a a ie y o en i onmen al, ligh ing,
and acquisi ion condi ions. The e alua ion analysis is ocused on
small objec s, which a e mo e challenging o co ec ly de ec and
ecognise. The esul s show ha he A e age P ecision is highe
o small and low- esolu ion objec s o he p oposed end- o-end
app oach in mos o he selec ed condi ions.
Index Te ms—Augmen ed Reali y, Objec De ec ion, Scene
Analysis, Scene Unde s anding, Objec Recogni ion, Deep Lea n-
ing, Supe -Resolu ion, Fea u e Ex ac ion
I. INTRODUCTION
Augmen ed Reali y (AR) applica ions enable use s o in e -
ac wi h hei su ounding en i onmen by o e laying digi al
isuals on op o eali y h ough he came a iew. The aim
is o enhance he eal wo ld h ough he combina ion o
i ual in o ma ion, such as ex , images, ideo, o 3D models,
wi h scenes cap u ed by a came a in eal ime [36]. Fu -
he mo e, ecen ad ances in compu e sys ems’ capabili ies,
high-speed communica ion, and compu e ision echnologies
ha e boos ed he demand o human-digi al in e ac ion h ough
Mixed Reali y (XR) headse s and new h ee-dimensional in e -
ac i e displays. The apid de elopmen o AR echnologies has
os e ed hei applica ion o di e en ields such as es o a ion,
educa ion, a chaeology, a , ou ism, comme ce, and heal h-
ca e. [39].
These imme si e echnologies ely on he analysis o he
su ounding en i onmen o ex ac con en in o ma ion. Fo
ins ance, in he ield o au onomous ehicles, scene analysis
and unde s anding (e.g., ehicle de ec ion, a ic signs and
ligh ecogni ion, and pedes ian de ec ion) is a key componen
o decision-making asks and end- o-end con ol [29] so ha
he augmen ed en i onmen can be seamlessly isualised on
he ca display.
In he las decades, ad ances in compu e ision ha e
os e ed he design and implemen a ion o objec ecogni ion
me hods, inc easing compu a ional pe o mance and lowe ing
p ocess ime [43]. As a esul , cu en AR echnologies based
on objec ecogni ion use complex compu e ision echniques
o de ec and ack objec s in he eal wo ld. Examples o
such echnologies include he You Only Look Once (YOLO)
model [1], homomo phic il e ing and Haa ma ke s [15] and
he Single Sho De ec o [8]. The use o Con olu ional Neu al
Ne wo ks (CNNs) and Deep Lea ning (DL) led o as e and
mo e accu a e de ec ion p ocesses [41]. Howe e , hey s ill
deli e poo pe o mance when came a esolu ion is low o
when he objec s o ecognise a e e y small o a away.
Thus, his can ha e an impac on scene unde s anding and
he o e all AR expe ience.
The aim o his s udy is o p o ide a no el in eg a ed end-
o-end solu ion ha imp o es pe o mance in such condi ions
by in oducing Supe -Resolu ion (SR) mechanisms. No only
ha e Gene a i e Ad e sa ial Ne wo ks (GANs) been used o
new da a gene a ion and o s udy ad e sa ial samples and
a acks, bu in he ecen pas hey ha e also been in es iga ed
o pe o m SR asks [6] [18]. Inspi ed by his, he p oposed
app oach is based on a cascade o wo connec ed ne wo ks.
The i s ne wo k is a supe - esolu ion ne wo k ha akes as
inpu ans o med images. Mo e speci ically, a 3D ep esen a-
ion is used whe e he z-axis ep esen s he colou channel o
he image. The second ne wo k is based on he YOLO se ies’
a chi ec u e, which was designed o imp o e pe o mance
a a low compu a ional cos . The key con ibu ions o his
wo k a e: a) he end- o-end design and aining o he wo
connec ed ne wo ks, allowing au oma ic minimisa ion o he
SR econs uc ion e o and maximisa ion o he de ec ion
and classi ica ion accu acy wi h a single no el op imisa ion
unc ion; b) a comple e compa a i e s udy unde a a ie y o
en i onmen al condi ions ha a e known o a ec he o e all
pe o mance o AR de ices; and c) a new da ase composed
o syn he ic objec s c ea ed unde di e en condi ions, which
allows unbiased pe o mance e alua ion unde di e en senso
and en i onmen al pa ame e s. The a o emen ioned solu ions
could be in eg a ed in o he AR applica ions as a emo e cloud
se ice o be e scene unde s anding o , pe haps, as an o line
solu ion.
The pape is o ganised as ollows: Sec ion 1 in oduces
he p oblem and ele an echnologies; Sec ion 2 p o ides an
o e iew o ela ed wo k; Sec ion 3 desc ibes he p oposed
end- o-end a chi ec u e; Sec ion 4 p esen s esul s ob ained
using bo h a eal image da ase (COCO) and a no el syn he ic
image da ase ; and Sec ion 4 d aws he inal conclusions.
II. OVERVIEW OF PREVIOUS WORK
Augmen ed eali y applica ions ely on machine lea ning
and compu e ision echniques o ecognise he p esence o
physical objec s in he eal wo ld so ha i ual objec s can
be added and ende ed in eal ime. In ecen yea s, he use o
Deep CNNs has signi ican ly imp o ed he pe o mance and
accu acy o compu e ision o many asks, such as objec
de ec ion and ecogni ion. In 2014, Gi ishick e al. p oposed
he Regions wi h CNN ea u es (RCNN) o objec de ec ion
[14]. Fi s , ini ial objec candida e boxes a e ex ac ed by a
selec i e sea ch. Then, each box is escaled o a ixed-size
image ha is ed o a CNN model ained on an AlexNe [37]
o ea u e ex ac ion. Finally, objec de ec ion is pe o med
using a linea SVM classi ie . Al hough his app oach led
o a signi ican imp o emen in he mean A e age P ecision
when compa ed wi h p e ious app oaches, i su e s om slow
de ec ion speed. To o e come his issue, He e al. p oposed
he Spa ial Py amid Pooling Ne wo k (SPPNe ) [19]. I s
main no el y is a Spa ial Py amid Pooling (SPP) laye , which
gene a es a ixed-leng h ep esen a ion ega dless o image
size and scale, allowing images o a ying sizes o be ed
du ing he aining p ocess, which imp o es scale in a iance
and educes o e i ing. In he case o objec de ec ion, he
ea u e maps a e compu ed om he en i e image only once,
and hen he ea u es a e agg ega ed in sub- egions o gene a e
ixed-leng h ec o s o aining he de ec o s. E alua ion o
his me hod showed ha i could de ec objec s 24 o 102 imes
as e han RCNN. In 2015, Gi shick p oposed an imp o ed
e sion o he p e ious RCNN a chi ec u e called he Fas
RCNN de ec o [13]. Al hough his ne wo k allows o ain
a de ec o and a bounding box eg ession simul aneously wi h
he same ne wo k con igu a ion, slow speed emained an issue.
The same yea , Ren e al. p oposed he Fas e RCNN de ec o
[34], which is conside ed he i s almos eal- ime deep
lea ning de ec o using an end- o-end aining This a chi ec u e
in oduces a Region P oposal Ne wo k (RPN) o speed up
he de ec ion p ocess. Nume ous a ian s o his app oach
ha e been sugges ed in he ollowing yea s o dec ease any
compu a ional edundancy [7] [23] [24].
In pa icula , Cao e al. (2020) p oposed a me hod called
D2De [4], which is based on he Fas e R-CNN amewo k.
He e he Region o In e es (ROI) ea u es a e p ocessed
h ough wo di e en s ages: a high-densi y local eg ession,
which eplaces he Fas e RCNN o se eg ession, and a
disc iminan ROI pooling. In con as o all he me hods
men ioned abo e, which a e conside ed wo-s age de ec o s
as hey pe o m a coa se o ine p ocess, in 2016, Joseph
e al. p oposed a one-s age de ec o called You Only Look
Once (YOLO) [31]. The image is di ided in o egions, and he
ne wo k p edic s bounding boxes o each egion a he same
ime. Wi h such an app oach, he whole p ocess is comple ed
in one s ep by applying a single ne wo k o he en i e image,
inc easing p ocessing speed signi ican ly. Al hough YOLO’s
second and hi d e sions ha e imp o ed i s p edic ion accu-
acy [32] [33], hey s ill unde pe o m in e ms o localisa ion
accu acy when compa ed wi h he wo-s age me hods. Liu e
al. ied o imp o e his aspec by p oposing a Single Sho
Mul iBox De ec o (SSD) [27] in oducing a mul i- e e ence
and mul i- esolu ion de ec ion me hod de ec ing objec s a di -
e en scales on di e en ne wo k laye s. As a esul , he SSD
ne wo k gains a small imp o emen , ou pe o ming YOLO in
he PASCAL VOC de ec ion ask [9]. Sugges ing ha ex eme
o eg ound-backg ound class imbalance is he cause o he
lowe accu acy o he one-s age de ec o s, in 2018, Lin e
al. in oduced he Re inaNe [25] whe e a new loss unc ion
called ” ocal loss” was added o imp o e hei app oach.
Indeed, by modi ying he s anda d c oss-en opy loss, he
de ec o is mo e a en i e o misclassi ied examples du ing
aining. A ecen end in objec de ec ion me hods is ancho -
ee echniques, whe e he me hods in e he bounding box
co ne s ins ead o ixed bounding boxes. A no able example
is Cen e Ne , p oposed by Zhou e al [42]. The Cen e Ne
me hod is a s a e-o - he-a Lida -based 3D de ec ion and
acking amewo k. I could be iewed as an imp o emen
o e Co ne Ne , which is an ancho - ee app oach o de ec ing
bounding boxes as a pai o keypoin s. The keypoin s a e
he op-le and bo om- igh co ne s e ie ed ia he co ne
pooling echnique in oduced by he same au ho s [21]. The
Cen e Ne me hod has in oduced he no ion o a cen e
keypoin o help associa ing he co ne keypoin s wi h an
objec in he image. The Cen e Ne me hod has ou pe o med
common ancho -based solu ions such as Fas e RCNN and
YOLO by a signi ican ma gin. In 2020, Pe ez-Rua e al. [30]
in oduced OpeN-ended Cen e nE (ONCE), which o e ed
unc ionali y ha can de ec objec s om classes wi h a
small numbe o examples inside i s aining da ase . The
mo e ecen app oaches s a o in es iga e he possibili ies o
ans o me s co e ed in he DE ec ion TRans o me (DETR)
me hod [5], wi h he ad an age o being simple ye on pa wi h
he es o he de ec ion echniques used in he ield. La e ,
Zhu e al. p oposed De o mable DETR as an imp o emen o
add ess he pe o mance p oblem in de ec ing small objec s
and achie e s a e-o - he-a pe o mance.
Fig. 1: O e iew o he p oposed no el amewo k ained end- o-end. Fo he SR and de ec o models any s a e-o - he-a
solu ions can be used wi hou a ec ing he o e all pipeline and he p oposed modula a chi ec u e.
Fig. 2: T aining se up o he DAT SR deep ne wo k.
Fig. 3: YOLOX a chi ec u e elying on a decoupled head
In pa allel, app oaches ha e been de eloped o enhance he
de ec ion o small objec s, which is pa icula ly challenging
as hey ha e ewe isible de ails. Supe - esolu ion solu ions
elying on Gene a i e Ad e sa ial Ne wo ks (GAN) [16] ha e
p o ed pa icula ly success ul [2] [22]. Indeed, hei compe -
i i e p ocess in ol ing wo neu al ne wo ks, i.e., a gene a o
ne wo k and a disc iminan ne wo k, ensu es ha he gene a ed
images a e as ealis ic as possible.
He ein, we add ess he de ec ion and ecogni ion o isually
small objec s by p oposing a no el app oach based on a supe -
esolu ion ne wo k and a second ne wo k wi h a modi ied
YOLO a chi ec u e ained end- o-end. This solu ion aims o
p o ide accu a e de ec ion o objec s ha a e e y small o
e y a om he came a senso o he AR glasses while
deli e ing a as p ocessing ime, enabling i s usage in eal-
ime applica ions.
Fig. 4: An example o p edic ions and Con usion Ma ix ( he
whi e colou o he image was le elled up o see he numbe s)
III. PROPOSED FRAMEWORK
In his pape , we p opose an end- o-end amewo k o
scene unde s anding ha combines supe - esolu ion, objec
de ec ion, and classi ica ion a chi ec u es. Figu e 1shows an
o e iew o he p oposed me hodology, whe e he wo main
componen s ake as inpu an image (o a ideo) and a e ained
in an end- o-end manne . De ails o hese wo p ocessing
blocks a e desc ibed in he ollowing sec ions:
A. Supe -Resolu ion Me hod
As e iewed in Sec ion 2, he usage o supe - esolu ion in a
p e-p ocessing s age has been included in many compu e i-
sion pipelines. Typically, he supe - esolu ion model is ained
in an unsupe ised manne using se e al independen da ase s,
while he a ge classi ica ion o de ec ion model is ained
only on a single ask- ela ed da ase . By doing so, some ex a
in o ma ion ha is no a ailable in he a ge labelled da ase
is injec ed in o he SR images [20].
(a) G ound - Came a
Subca ego y - Gol
(97%)
(b) G ound - Ligh
Subca ego y - Ci y
Bus (96%)
(c) G ound - Wea he
Subca ego y - Ci y
Bus (95%)
Fig. 5: Examples o he gene a ed syn he ic da a. The bo om
ow ep esen s examples om he G ound ca ego y. F om
he le , he i s column shows he Came a sub-ca ego y,
he second column shows he Ligh sub-ca ego y, he hi d
columns displays he Wea he sub-ca ego y.
SR models a e ained unde he assump ion ha he low-
scale images, passed as inpu , a e he esul s o some low-pass
il e , such as a Gaussian blu o a poin sp ead unc ion. The
aining akes place by i s down-sampling high- esolu ion
images wi h such a ke nel and hen op imising he model o e-
cons uc hese high- esolu ion images. Theo e ically, he ke -
nel unc ion should ma ch he ac ual blu ing p ocess caused
by he came a used in he a ge ed applica ion. Howe e , since
his is usually unknown, ’s anda d’ ke nels ha e been used.
Un o una ely, hese s anda d ke nels ail o model he speci ic
op ics and senso s o he ac ual came as ha cap u ed he
images o in e es , leading o deg aded pe o mance in eal-
wo ld scena ios. To add ess his, me hods ha e been p oposed
o lea n he blu ke nel, known as Blind Supe Resolu ion [28].
The mos accu a e app oaches, such as he s a e-o - he-a
ne wo k Deep Al e na ing Ne wo k (DAT) [20], ha e elied
on deep lea ning a chi ec u es. DAT was selec ed o ou
pipeline because i s lea ning is unsupe ised, and i deli -
e s as compu a ion, making i sui able o mobile de ices
and low-speci ica ion desk op compu e s. In ac , he au ho s
demons a ed ha he a e age speed is 0.75 seconds pe image,
which is mo e han 500 imes as e han i s compe i o s
Ke nelGAN [3], ZSSR [35], and 5 imes as e han IKC
[17]. These a e age speeds a e conside ed as in he domain
o SR. Figu e 2shows DAT’s aining se up, which consis s
o wo main ne wo ks called Res o e and Es ima o : he
Res o e p oduces he SR image, while he Es ima o p o ides
an es ima e o a blu ke nel gi en he es o ed image. The wo
ne wo ks a e used in al e na ion, imp o ing he quali y o he
SR image and he accu acy o he es ima ed ke nel a each
Res o e -Es ima o s ep. The sequence o Res o e -Es ima o
is op imised end- o-end using a s ochas ic back-p opaga ion
algo i hm.
B. Objec De ec ion Me hod
Fo augmen ed eali y applica ions, objec de ec ion models
mus deli e high accu acy in eal- ime. Fo he p oposed
amewo k, he ancho - ee model, YOLOX [11], o e s he
bes comp omise. I s simple, powe ul, and compu a ionally
e icien a chi ec u e is buil upon one o he mos widely used
de ec o s in he indus y, YOLO 3 [33], which no only has a
limi ed compu a ional cos bu also has ecei ed excellen so -
wa e suppo . Howe e , an impo an imp o emen o YOLOX
is ha , unlike he p e ious a chi ec u es o he YOLO se ies,
i uses a decoupled head, which imp o es con e gence speed.
Figu e 3p o ides an o e iew o he a chi ec u e o YOLOX.
Following a 1 x 1 con olu ional laye used o dec ease he
numbe o channels, he e a e wo pa allel b anches wi h 3
x 3 con olu ional laye s. Mo eo e , compa ed o he baseline
YOLO 3, an In e sec ion o e Union (IoU) awa e b anch is
added in he eg ession b anch.
Ano he enhancemen o YOLOX is, unlike he pas e -
sions o YOLO de ec o s (excep o YOLO 1), he usage
o an ancho - ee model. Ancho s a e candida e bounding
boxes wi h p ede ined dimensions ha he de ec o selec s
du ing he de ec ion p ocess and o which i p edic s he
del a alues o hei cen es and dimensions. Ob iously, hese
addi ional p edic ions equi e ex a p ocessing du ing bo h
he aining and in e ence s ages, which impac s he o e all
compu a ional ime. On he o he hand, when using an ancho -
ee app oach, bounding boxes a e p edic ed di ec ly, which
educes he numbe o design pa ame e s. As such an app oach
equi es ad anced da a augmen a ion o ma ch he pe o mance
o ancho -based models, s a e-o - he-a da a augmen a ion
app oaches, i.e., Mosaic and MixUp, we e exploi ed [11].
Indeed, hey a e known o b ing s abili y and educe o e i ing
du ing he aining p ocess. Finally, i is impo an o speci y
ha YOLOX le e ages a high-pe o mance CNN on -end,
CSPNe [38], which is ollowed by a ea u e py amids ne wo k
(FPN) [33].
C. End- o-end F amewo k
The me hods desc ibed in sub-sec ions B and C we e
in eg a ed in o an end- o-end amewo k. Thus, he amewo k
comp ises wo main componen s, i.e. Supe -Resolu ion and
De ec o . Equa ion (1) illus a es he p oposed end- o-end
a chi ec u e whe e xis he inpu low- esolu ion image, yis
he image gene a ed by he supe - esolu ion unc ion S(·), and
zis he ou pu o he de ec ion unc ion D(·).
(y=S(x)
z=D(y)
→z=D(S(x)) (1)
In his amewo k, an inpu image is i s handled by
he SR componen , which p oduces a supe - esol ed ou pu
image. Then, his image is passed o he de ec o componen ,
which ecognises and loca es objec s. Th ough his p ocess, he
de ec o lea ns om images enhanced by he SR componen .
The inpu images a e supe - esol ed using ke nels. The e a e
many ypes o ke nels, such as he common bicubic ke nel o
linea ke nel. These ke nels a e well-s udied and don’ equi e
an AI ne wo k o calcula e hem.
Howe e , in he case o he SR ask, eal-wo ld images don’
con ain in o ma ion abou he ke nel, making i challenging o
success ully es o e hem. Consequen ly, an es ima o is used
o in e he ke nel du ing he aining p ocess. This es ima ed
ke nel is hen passed o he es o e o gene a e images. As a
esul , he es o ed images con ain ea u es ha a e he p oduc
o he ke nel. These ea u es can be picked up by he de ec o
du ing he aining p ocess, c ea ing a symbio ic ela ionship
be ween he SR and de ec o componen s, leading o imp o ed
pe o mance.
To moni o and e alua e he aining o he amewo k,
se e al s a e-o - he-a loss unc ions we e selec ed. The de-
ec o is ained using Va i ocal Loss [40] as he classi ica ion
loss unc ion and SIoU [12] (Scylla In e sec ion o e Union)
as he box eg ession loss unc ion. Mo eo e , he aining
p ocess was acili a ed wi h SimOTA, a simpli ica ion o
OTA [10] (Op imal T anspo Assignmen ), o dynamic label
assignmen [11].
The Va i ocal Loss is pa icula ly e icien because i con-
side s bo h classi ica ion and localisa ion sco es when anking
candida es using IoU. Simila ly, he SIoU loss unc ion ad-
d esses di ec ion misma ch be ween expec ed and p edic ed
bounding boxes by exploi ing angle, dis ance, shape, and IoU
cos s.
Finally, he alue o SimOTA is o iew he ask o bounding
box assignmen as an op imal anspo p oblem, whe e he
uni anspo a ion cos be ween an ancho -poin and g ound
u h is exp essed as a weigh ed sum o hei classi ica ion and
eg ession losses o ind he bes assignmen solu ion.
D. Pa ame e s
The end- o-end amewo k was ine- uned by unning 10
epochs wi h a ba ch size o 3 on bo h eal and syn he ic
da a unde h ee di e en ca ego ies. While he lea ning a e
was se o 0.0001 o he SR componen , i was se o
0.0032 wi h SGD (S ochas ic G adien Descen ) op imisa ion
o he de ec o . Addi ionally, as men ioned ea lie , aining
was enhanced using Mosaic and MixUp as da a augmen a ion
s a egies.
IV. EVALUATION
The p oposed me hod has been applied o objec ecogni ion
and scene unde s anding. I s e alua ion was pe o med using
he Common Objec s in Con ex (COCO) da ase [26] and
a syn he ic da ase whe e di e en en i onmen al condi ions
we e applied o a ec image quali y. COCO is widely used
o benchma k compu e ision models. I consis s o 330K
images, wi h mo e han 200K labelled images, 1.5 million
objec ins ances, 80 objec ca ego ies, 91 s u ca ego ies, and
5 cap ions pe image.
Compa isons wi h s a e-o - he-a me hods ely on he mean
A e age P ecision (mAP), a s anda d me ic in oduced in
2014 o quan i y objec de ec ion pe o mance based on a use -
de ined se o c i e ia [26]. I is de ined as he mean alue o
he a e age p ecision o he indi idual classes:
mAP =1
n
n
X
k=1
APk(2)
whe e APkis A e age P ecision o class k, and nis he
numbe o classes.
TABLE I: Model Pe o mance on he COCO Da ase in e ms
o mAP (%)
Re inaNe YOLO 3 Fas e R-CNN P oposed
52.61 44.76 40.50 67.09
TABLE II: Model Pe o mance on he Syn he ic Da ase
acco ding o he Th ee Image Ca ego ies
Ca ego y mAP(%)
Came a 60.52
Ligh 81.25
Wea he 66.98
In his e alua ion p ocess, using he COCO da ase , Table
Ishows he pe o mance in e ms o mAP o he p oposed
amewo k compa ed o o he app oaches p esen ed in he li -
e a u e e iew. Ou amewo k ou pe o ms all i s compe i o s.
Mo eo e , he added alue o he supe - esolu ion componen
is clea ly es ablished as i exceeds YOLO’s mAP by o e 20%.
The con usion ma ix in Figu e 4 u he demons a es he
pe o mance o he model. In pa icula , i p edic s objec s
belonging o he p e ailing ”ca ” ca ego y wi h high accu acy.
Howe e , i should be highligh ed ha he ” an” ca ego y is
o en mis aken o he ”ca ” ca ego y, which is due o he
isual simila i y be ween images o hese wo classes.
Fu he e alua ion has been conduc ed using a syn he ic
da ase ha we c ea ed using a 3D Rende ing Engine. This
da ase consis s o app oxima ely 3000 low- esolu ion images
pe ca ego y o g ound ehicles in di e en en i onmen s and
wea he condi ions. An example o he da ase wi h p edic ions
can be seen in Figu e 5. The images in his da ase belong o
h ee di e en ca ego ies, each allowing us o assess ou model
on speci ic p ope ies: a) Came a, b) Ligh , c) Wea he .
The ”Came a” ca ego y con ains images o objec s cap u ed
om a ious came a angles and dis ances. In he ”Ligh ”
ca ego y, images a e gene a ed unde a a ie y o ligh ing
pa ame e s, mimicking di e en pa s o he day such as
mo ning, a e noon, e ening, and nigh . The ”Wea he ” ca -
ego y simula es images cap u ed unde di e se wea he ci -
cums ances, including a ying ain and wind condi ions.
The pe o mance o he p oposed amewo k o hese h ee
ca ego ies in e ms o mAP is shown in Table II. Addi ionally,
he con usion ma ices can be obse ed in Figu e 6. As
could be obse ed, he esul s demons a e high con idence
in p edic ing he buses and ca s ca ego ies. The spo ca s and
ca s ca ego ies a e qui e o en mis aken wha is easonable
because bo h ca ego ies a e e y simila . The an ca ego y on
he o he hand some imes has been mis aken wi h he spo
ca s ca ego y wha indica es close simila i y o he samples
in he da ase . Whe eas he bus ca ego y samples seem o be
dis inc enough o a oid con usion wi h he o he ca ego ies.
V. CONCLUSION
The wo k p esen ed in his pape o e s an end- o-end
solu ion o objec de ec ion and ecogni ion on AR de ices.
The modula a chi ec u e allows o he in eg a ion o di e en
SR and de ec ion models wi hin he same pipeline. The pape

p o ides an o e iew o exis ing solu ions and app oaches in
bo h supe esolu ion and scene analysis me hods, speci ically
ocusing on hei applica ions in imme si e en i onmen s.
The p oposed a chi ec u e was es ed on bo h eal and
syn he ic da ase s in a compa a i e s udy, alongside o he s a e-
o - he-a app oaches. The esul s ob ained demons a e a sig-
ni ican imp o emen , pa icula ly o low- esolu ion o dis an
objec s. Fu he mo e, he p oposed amewo k was e alua ed
and analysed unde a ious en i onmen al condi ions and wi h
a ange o came a senso s.
In addi ion o he e alua ion on eal da ase s, a new bal-
anced syn he ic da ase was gene a ed. This da ase includes
anno a ed da a co e ing mul iple objec s and en i onmen s,
allowing o u he assessmen and expe imen a ion.
VI. ACKNOWLEDGMENT
This wo k was unded by UK Resea ch and Inno a ion
(UKRI) unde he UK go e nmen ’s Ho izon Eu ope und-
ing gua an ee [g an numbe 10047653] and unded by he
Eu opean Union [unde EC Ho izon Eu ope g an ag eemen
numbe 101070181 (TALON)].
REFERENCES
[1] Ryan Ande son, Juan Toledo, and Hala ElAa ag. Feasibili y s udy
on he u iliza ion o mic oso hololens o inc ease d i ing condi ions
awa eness. In 2019 Sou heas Con, pages 1–8. IEEE, 2019.
[2] Yancheng Bai, Yongqiang Zhang, Mingli Ding, and Be na d Ghanem.
Sod-m gan: Small objec de ec ion ia mul i- ask gene a i e ad e sa ial
ne wo k. In P oceedings o he Eu opean Con e ence on Compu e
Vision (ECCV), pages 206–221, 2018.
[3] Se i Bell-Kligle , Assa Shoche , and Michal I ani. Blind supe -
esolu ion ke nel es ima ion using an in e nal-gan. Ad ances in Neu al
In o ma ion P ocessing Sys ems, 32, 2019.
[4] Jiale Cao, Hisham Cholakkal, Rao Muhammad Anwe , Fahad Shahbaz
Khan, Yanwei Pang, and Ling Shao. D2de : Towa ds high quali y objec
de ec ion and ins ance segmen a ion. In P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni ion, pages 11485–
11494, 2020.
[5] Nicolas Ca ion, F ancisco Massa, Gab iel Synnae e, Nicolas Usunie ,
Alexande Ki illo , and Se gey Zago uyko. End- o-end objec de ec ion
wi h ans o me s. In Compu e Vision–ECCV 2020: 16 h Eu opean
Con e ence, Glasgow, UK, Augus 23–28, 2020, P oceedings, Pa I 16,
pages 213–229. Sp inge , 2020.
[6] Mengyu Chu, You Xie, Lau a Leal-Taix´
e, and Nils Thue ey. Tempo ally
cohe en gans o ideo supe - esolu ion ( ecogan). a Xi p ep in
a Xi :1811.09393, 1(2):3, 2018.
[7] Ji eng Dai, Yi Li, Kaiming He, and Jian Sun. R- cn: Objec de ec ion
ia egion-based ully con olu ional ne wo ks. Ad ances in neu al
in o ma ion p ocessing sys ems, 29, 2016.
[8] Nikos Dimi opoulos, Theodo os Togias, Geo ge Michalos, and So i is
Mak is. Ope a o suppo in human– obo collabo a i e en i onmen s
using ai enhanced wea able de ices. P ocedia Ci p, 97:464–469, 2021.
[9] M. E e ingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisse man. The PASCAL Visual Objec Classes
Challenge 2012 (VOC2012) Resul s. h p://www.pascal-
ne wo k.o g/challenges/VOC/ oc2012/wo kshop/index.h ml.
[10] Zheng Ge, Song ao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. O a:
Op imal anspo assignmen o objec de ec ion. In P oceedings o
he IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion,
pages 303–312, 2021.
[11] Zheng Ge, Song ao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox:
Exceeding yolo se ies in 2021. a Xi p ep in a Xi :2107.08430, 2021.
[12] Zho a Ge o gyan. Siou loss: Mo e powe ul lea ning o bounding box
eg ession. a Xi p ep in a Xi :2205.12740, 2022.
[13] Ross Gi shick. Fas -cnn. In P oceedings o he IEEE in e na ional
con e ence on compu e ision, pages 1440–1448, 2015.
[14] Ross Gi shick, Je Donahue, T e o Da ell, and Ji end a Malik. Rich
ea u e hie a chies o accu a e objec de ec ion and seman ic segmen-
a ion. In P oceedings o he IEEE con e ence on compu e ision and
pa e n ecogni ion, pages 580–587, 2014.
[15] Daniel Lima Gomes J , Anselmo Ca doso de Pai a, A is ´
o anes Co ˆ
ea
Sil a, Ge aldo B az J , Jo˜
ao Dallyson Sousa de Almeida, An ˆ
onio S´
e gio
de A a´
ujo, and Ma celo Ga as. Augmen ed isualiza ion using ho-
momo phic il e ing and haa -based na u al ma ke s o powe sys ems
subs a ions. Compu e s in Indus y, 97:67–75, 2018.
[16] Ian Good ellow, Jean Pouge -Abadie, Mehdi Mi za, Bing Xu, Da id
Wa de-Fa ley, She jil Ozai , Aa on Cou ille, and Yoshua Bengio. Gen-
e a i e ad e sa ial ne wo ks. Communica ions o he ACM, 63(11):139–
144, 2020.
[17] Jinjin Gu, Hannan Lu, Wangmeng Zuo, and Chao Dong. Blind supe -
esolu ion wi h i e a i e ke nel co ec ion. In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion,
pages 1604–1613, 2019.
[18] Rohi Gup a, Anu ag Sha ma, and Anupam Kuma . Supe - esolu ion
using gans o medical imaging. P ocedia Compu e Science, 173:28–
35, 2020.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spa ial py a-
mid pooling in deep con olu ional ne wo ks o isual ecogni ion. IEEE
ansac ions on pa e n analysis and machine in elligence, 37(9):1904–
1916, 2015.
[20] Yan Huang, Shang Li, Liang Wang, Tieniu Tan, e al. Un olding he
al e na ing op imiza ion o blind supe esolu ion. Ad ances in Neu al
In o ma ion P ocessing Sys ems, 33:5632–5643, 2020.
[21] Hei Law and Jia Deng. Co ne ne : De ec ing objec s as pai ed keypoin s.
In P oceedings o he Eu opean con e ence on compu e ision (ECCV),
pages 734–750, 2018.
[22] Vladisla Li, Geo ge Amponis, Jean-Ch is ophe Nebel, Vasileios A -
gy iou, Thomas Lagkas, Sa as Ouzounidis, and Panagio is Sa igianni-
dis. Supe esolu ion o augmen ed eali y applica ions. In IEEE INFO-
COM 2022-IEEE Con e ence on Compu e Communica ions Wo kshops
(INFOCOM WKSHPS), pages 1–6. IEEE, 2022.
[23] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng,
and Jian Sun. Ligh -head -cnn: In de ense o wo-s age objec de ec o .
a Xi p ep in a Xi :1711.07264, 2017.
[24] Tsung-Yi Lin, Pio Doll´
a , Ross Gi shick, Kaiming He, Bha a h Ha iha-
an, and Se ge Belongie. Fea u e py amid ne wo ks o objec de ec ion.
In P oceedings o he IEEE con e ence on compu e ision and pa e n
ecogni ion, pages 2117–2125, 2017.
[25] Tsung-Yi Lin, P iya Goyal, Ross Gi shick, Kaiming He, and Pio Doll´
a .
Focal loss o dense objec de ec ion. In P oceedings o he IEEE
in e na ional con e ence on compu e ision, pages 2980–2988, 2017.
[26] Tsung-Yi Lin, Michael Mai e, Se ge Belongie, James Hays, Pie o
Pe ona, De a Ramanan, Pio Doll´
a , and C Law ence Zi nick. Mic oso
coco: Common objec s in con ex . In Compu e Vision–ECCV 2014:
13 h Eu opean Con e ence, Zu ich, Swi ze land, Sep embe 6-12, 2014,
P oceedings, Pa V 13, pages 740–755. Sp inge , 2014.
[27] Wei Liu, D agomi Anguelo , Dumi u E han, Ch is ian Szegedy, Sco
Reed, Cheng-Yang Fu, and Alexande C Be g. Ssd: Single sho mul ibox
de ec o . In Compu e Vision–ECCV 2016: 14 h Eu opean Con e ence,
Ams e dam, The Ne he lands, Oc obe 11–14, 2016, P oceedings, Pa
I 14, pages 21–37. Sp inge , 2016.
[28] Tome Michaeli and Michal I ani. Nonpa ame ic blind supe - esolu ion.
In P oceedings o he IEEE In e na ional Con e ence on Compu e
Vision, pages 945–952, 2013.
[29] Moni ul Islam Pa el, Siok Yee Tan, and Azizi Abdullah. Vision-
based au onomous ehicle sys ems based on deep lea ning: A sys ema ic
li e a u e e iew. Applied Sciences, 12(14):6831, 2022.
[30] Juan-Manuel Pe ez-Rua, Xia ian Zhu, Timo hy M. Hospedales, and Tao
Xiang. Inc emen al ew-sho objec de ec ion. In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), June 2020.
[31] Joseph Redmon, San osh Di ala, Ross Gi shick, and Ali Fa hadi. You
only look once: Uni ied, eal- ime objec de ec ion. In P oceedings o
he IEEE con e ence on compu e ision and pa e n ecogni ion, pages
779–788, 2016.
[32] Joseph Redmon and Ali Fa hadi. Yolo9000: be e , as e , s onge . In
P oceedings o he IEEE con e ence on compu e ision and pa e n
ecogni ion, pages 7263–7271, 2017.
[33] Joseph Redmon and Ali Fa hadi. Yolo 3: An inc emen al imp o emen .
a Xi p ep in a Xi :1804.02767, 2018.
(a) G ound Ca ego y - Came a
Subca ego y - Con usion Ma ix
(b) G ound Ca ego y - Ligh Sub-
ca ego y - Con usion Ma ix
(c) G ound Ca ego y - Wea he
Subca ego y - Con usion Ma ix
(d) G ound Ca ego y - Cam-
e a Subca ego y - Supe Con-
usion Ma ix
(e) G ound Ca ego y - Ligh
Subca ego y - Supe Con u-
sion Ma ix
( ) G ound Ca ego y -
Wea he Subca ego y - Supe
Con usion Ma ix
Fig. 6: Con usion Ma ices o he G ound ca ego y o he gene a ed syn he ic da a. F om he le , he i s column shows he
Came a sub-ca ego y, he second column shows he Ligh sub-ca ego y, he hi d columns displays he Wea he sub-ca ego y.
[34] Shaoqing Ren, Kaiming He, Ross Gi shick, and Jian Sun. Fas e -
cnn: Towa ds eal- ime objec de ec ion wi h egion p oposal ne wo ks.
Ad ances in neu al in o ma ion p ocessing sys ems, 28, 2015.
[35] Assa Shoche , Nada Cohen, and Michal I ani. “ze o-sho ” supe -
esolu ion using deep in e nal lea ning. In P oceedings o he IEEE
con e ence on compu e ision and pa e n ecogni ion, pages 3118–
3126, 2018.
[36] Hu Tianyu, Zhang Quan u, and Dong Huiyuan ShenYongjie. O e iew
o augmen ed eali y echnology. Compu e Knowledge and Technology,
34:194–196, 2017.
[37] Paul Viola and Michael Jones. Rapid objec de ec ion using a boos ed
cascade o simple ea u es. In P oceedings o he 2001 IEEE compu e
socie y con e ence on compu e ision and pa e n ecogni ion. CVPR
2001, olume 1, pages I–I. Ieee, 2001.
[38] Chien-Yao Wang, Hong-Yuan Ma k Liao, Yueh-Hua Wu, Ping-Yang
Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspne : A new backbone ha can
enhance lea ning capabili y o cnn. In P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni ion wo kshops, pages
390–391, 2020.
[39] Jianghao Xiong, En-Lin Hsiang, Ziqian He, Tao Zhan, and Shin-
Tson Wu. Augmen ed eali y and i ual eali y displays: eme ging
echnologies and u u e pe spec i es. Ligh : Science & Applica ions,
10(1):216, 2021.
[40] Haoyang Zhang, Ying Wang, Fe as Dayoub, and Niko Sunde hau .
Va i ocalne : An iou-awa e dense objec de ec o . In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion,
pages 8514–8523, 2021.
[41] Zhong-Qiu Zhao, Peng Zheng, Shou- ao Xu, and Xindong Wu. Objec
de ec ion wi h deep lea ning: A e iew. IEEE ansac ions on neu al
ne wo ks and lea ning sys ems, 30(11):3212–3232, 2019.
[42] Xingyi Zhou, Dequan Wang, and Philipp K ¨
ahenb¨
uhl. Objec s as poin s.
a Xi p ep in a Xi :1904.07850, 2019.
[43] Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye.
Objec de ec ion in 20 yea s: A su ey. P oceedings o he IEEE, 2023.

Related note

Why organizations use Identific for document trust, entry 48
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com