A Modula Deep Lea ning F amewo k o Scene
Unde s anding in Augmen ed Reali y Applica ions
1s Vladisla Li
Dep . o Ne wo ks and Digi al Media
Kings on Uni e si y
London, UK
[email p o ec ed]
2nd Ba ba a Villa ini
School o Compu e Science and Enginee ing
Uni e si y o Wes mins e
London, UK
b. illa ini@wes mins e .ac.uk
3 d Jean-Ch is ophe Nebel
Dep . o Compu e Science
Kings on Uni e si y
London, UK
[email p o ec ed]
4 h A gy iou Vasileios
Dep . o Ne wo ks and Digi al Media
Kings on Uni e si y
London, UK
asileios.a [email p o ec ed]
Abs ac —Taking as inpu na u al images and ideos, aug-
men ed eali y (AR) applica ions aim o enhance he eal wo ld
wi h supe imposed digi al con en s, enabling in e ac ion be ween
he use and he en i onmen . One impo an s ep in his p ocess
is au oma ic scene analysis and unde s anding, which should be
pe o med bo h in eal ime and wi h a good le el o objec ecog-
ni ion accu acy. In his wo k, an end- o-end amewo k based on
he combina ion o a Supe Resolu ion ne wo k wi h a de ec ion
and ecogni ion deep ne wo k has been p oposed o inc ease
pe o mance and lowe p ocessing ime. This no el app oach has
been e alua ed on wo di e en da ase s: he popula COCO
da ase , whose eal images a e used o benchma king many
di e en compu e ision asks, and a gene a ed da ase wi h
syn he ic images ec ea ing a a ie y o en i onmen al, ligh ing,
and acquisi ion condi ions. The e alua ion analysis is ocused on
small objec s, which a e mo e challenging o co ec ly de ec and
ecognise. The esul s show ha he A e age P ecision is highe
o small and low- esolu ion objec s o he p oposed end- o-end
app oach in mos o he selec ed condi ions.
Index Te ms—Augmen ed Reali y, Objec De ec ion, Scene
Analysis, Scene Unde s anding, Objec Recogni ion, Deep Lea n-
ing, Supe -Resolu ion, Fea u e Ex ac ion
I. INTRODUCTION
Augmen ed Reali y (AR) applica ions enable use s o in e -
ac wi h hei su ounding en i onmen by o e laying digi al
isuals on op o eali y h ough he came a iew. The aim
is o enhance he eal wo ld h ough he combina ion o
i ual in o ma ion, such as ex , images, ideo, o 3D models,
wi h scenes cap u ed by a came a in eal ime [36]. Fu -
he mo e, ecen ad ances in compu e sys ems’ capabili ies,
high-speed communica ion, and compu e ision echnologies
ha e boos ed he demand o human-digi al in e ac ion h ough
Mixed Reali y (XR) headse s and new h ee-dimensional in e -
ac i e displays. The apid de elopmen o AR echnologies has
os e ed hei applica ion o di e en ields such as es o a ion,
educa ion, a chaeology, a , ou ism, comme ce, and heal h-
ca e. [39].
These imme si e echnologies ely on he analysis o he
su ounding en i onmen o ex ac con en in o ma ion. Fo
ins ance, in he ield o au onomous ehicles, scene analysis
and unde s anding (e.g., ehicle de ec ion, a ic signs and
ligh ecogni ion, and pedes ian de ec ion) is a key componen
o decision-making asks and end- o-end con ol [29] so ha
he augmen ed en i onmen can be seamlessly isualised on
he ca display.
In he las decades, ad ances in compu e ision ha e
os e ed he design and implemen a ion o objec ecogni ion
me hods, inc easing compu a ional pe o mance and lowe ing
p ocess ime [43]. As a esul , cu en AR echnologies based
on objec ecogni ion use complex compu e ision echniques
o de ec and ack objec s in he eal wo ld. Examples o
such echnologies include he You Only Look Once (YOLO)
model [1], homomo phic il e ing and Haa ma ke s [15] and
he Single Sho De ec o [8]. The use o Con olu ional Neu al
Ne wo ks (CNNs) and Deep Lea ning (DL) led o as e and
mo e accu a e de ec ion p ocesses [41]. Howe e , hey s ill
deli e poo pe o mance when came a esolu ion is low o
when he objec s o ecognise a e e y small o a away.
Thus, his can ha e an impac on scene unde s anding and
he o e all AR expe ience.
The aim o his s udy is o p o ide a no el in eg a ed end-
o-end solu ion ha imp o es pe o mance in such condi ions
by in oducing Supe -Resolu ion (SR) mechanisms. No only
ha e Gene a i e Ad e sa ial Ne wo ks (GANs) been used o
new da a gene a ion and o s udy ad e sa ial samples and
a acks, bu in he ecen pas hey ha e also been in es iga ed
o pe o m SR asks [6] [18]. Inspi ed by his, he p oposed
app oach is based on a cascade o wo connec ed ne wo ks.
The i s ne wo k is a supe - esolu ion ne wo k ha akes as
inpu ans o med images. Mo e speci ically, a 3D ep esen a-
ion is used whe e he z-axis ep esen s he colou channel o
he image. The second ne wo k is based on he YOLO se ies’
a chi ec u e, which was designed o imp o e pe o mance
a a low compu a ional cos . The key con ibu ions o his
wo k a e: a) he end- o-end design and aining o he wo
connec ed ne wo ks, allowing au oma ic minimisa ion o he
SR econs uc ion e o and maximisa ion o he de ec ion
and classi ica ion accu acy wi h a single no el op imisa ion
unc ion; b) a comple e compa a i e s udy unde a a ie y o
en i onmen al condi ions ha a e known o a ec he o e all
pe o mance o AR de ices; and c) a new da ase composed
o syn he ic objec s c ea ed unde di e en condi ions, which
allows unbiased pe o mance e alua ion unde di e en senso
and en i onmen al pa ame e s. The a o emen ioned solu ions
could be in eg a ed in o he AR applica ions as a emo e cloud
se ice o be e scene unde s anding o , pe haps, as an o line
solu ion.
The pape is o ganised as ollows: Sec ion 1 in oduces
he p oblem and ele an echnologies; Sec ion 2 p o ides an
o e iew o ela ed wo k; Sec ion 3 desc ibes he p oposed
end- o-end a chi ec u e; Sec ion 4 p esen s esul s ob ained
using bo h a eal image da ase (COCO) and a no el syn he ic
image da ase ; and Sec ion 4 d aws he inal conclusions.
II. OVERVIEW OF PREVIOUS WORK
Augmen ed eali y applica ions ely on machine lea ning
and compu e ision echniques o ecognise he p esence o
physical objec s in he eal wo ld so ha i ual objec s can
be added and ende ed in eal ime. In ecen yea s, he use o
Deep CNNs has signi ican ly imp o ed he pe o mance and
accu acy o compu e ision o many asks, such as objec
de ec ion and ecogni ion. In 2014, Gi ishick e al. p oposed
he Regions wi h CNN ea u es (RCNN) o objec de ec ion
[14]. Fi s , ini ial objec candida e boxes a e ex ac ed by a
selec i e sea ch. Then, each box is escaled o a ixed-size
image ha is ed o a CNN model ained on an AlexNe [37]
o ea u e ex ac ion. Finally, objec de ec ion is pe o med
using a linea SVM classi ie . Al hough his app oach led
o a signi ican imp o emen in he mean A e age P ecision
when compa ed wi h p e ious app oaches, i su e s om slow
de ec ion speed. To o e come his issue, He e al. p oposed
he Spa ial Py amid Pooling Ne wo k (SPPNe ) [19]. I s
main no el y is a Spa ial Py amid Pooling (SPP) laye , which
gene a es a ixed-leng h ep esen a ion ega dless o image
size and scale, allowing images o a ying sizes o be ed
du ing he aining p ocess, which imp o es scale in a iance
and educes o e i ing. In he case o objec de ec ion, he
ea u e maps a e compu ed om he en i e image only once,
and hen he ea u es a e agg ega ed in sub- egions o gene a e
ixed-leng h ec o s o aining he de ec o s. E alua ion o
his me hod showed ha i could de ec objec s 24 o 102 imes
as e han RCNN. In 2015, Gi shick p oposed an imp o ed
e sion o he p e ious RCNN a chi ec u e called he Fas
RCNN de ec o [13]. Al hough his ne wo k allows o ain
a de ec o and a bounding box eg ession simul aneously wi h
he same ne wo k con igu a ion, slow speed emained an issue.
The same yea , Ren e al. p oposed he Fas e RCNN de ec o
[34], which is conside ed he i s almos eal- ime deep
lea ning de ec o using an end- o-end aining This a chi ec u e
in oduces a Region P oposal Ne wo k (RPN) o speed up
he de ec ion p ocess. Nume ous a ian s o his app oach
ha e been sugges ed in he ollowing yea s o dec ease any
compu a ional edundancy [7] [23] [24].
In pa icula , Cao e al. (2020) p oposed a me hod called
D2De [4], which is based on he Fas e R-CNN amewo k.
He e he Region o In e es (ROI) ea u es a e p ocessed
h ough wo di e en s ages: a high-densi y local eg ession,
which eplaces he Fas e RCNN o se eg ession, and a
disc iminan ROI pooling. In con as o all he me hods
men ioned abo e, which a e conside ed wo-s age de ec o s
as hey pe o m a coa se o ine p ocess, in 2016, Joseph
e al. p oposed a one-s age de ec o called You Only Look
Once (YOLO) [31]. The image is di ided in o egions, and he
ne wo k p edic s bounding boxes o each egion a he same
ime. Wi h such an app oach, he whole p ocess is comple ed
in one s ep by applying a single ne wo k o he en i e image,
inc easing p ocessing speed signi ican ly. Al hough YOLO’s
second and hi d e sions ha e imp o ed i s p edic ion accu-
acy [32] [33], hey s ill unde pe o m in e ms o localisa ion
accu acy when compa ed wi h he wo-s age me hods. Liu e
al. ied o imp o e his aspec by p oposing a Single Sho
Mul iBox De ec o (SSD) [27] in oducing a mul i- e e ence
and mul i- esolu ion de ec ion me hod de ec ing objec s a di -
e en scales on di e en ne wo k laye s. As a esul , he SSD
ne wo k gains a small imp o emen , ou pe o ming YOLO in
he PASCAL VOC de ec ion ask [9]. Sugges ing ha ex eme
o eg ound-backg ound class imbalance is he cause o he
lowe accu acy o he one-s age de ec o s, in 2018, Lin e
al. in oduced he Re inaNe [25] whe e a new loss unc ion
called ” ocal loss” was added o imp o e hei app oach.
Indeed, by modi ying he s anda d c oss-en opy loss, he
de ec o is mo e a en i e o misclassi ied examples du ing
aining. A ecen end in objec de ec ion me hods is ancho -
ee echniques, whe e he me hods in e he bounding box
co ne s ins ead o ixed bounding boxes. A no able example
is Cen e Ne , p oposed by Zhou e al [42]. The Cen e Ne
me hod is a s a e-o - he-a Lida -based 3D de ec ion and
acking amewo k. I could be iewed as an imp o emen
o e Co ne Ne , which is an ancho - ee app oach o de ec ing
bounding boxes as a pai o keypoin s. The keypoin s a e
he op-le and bo om- igh co ne s e ie ed ia he co ne
pooling echnique in oduced by he same au ho s [21]. The
Cen e Ne me hod has in oduced he no ion o a cen e
keypoin o help associa ing he co ne keypoin s wi h an
objec in he image. The Cen e Ne me hod has ou pe o med
common ancho -based solu ions such as Fas e RCNN and
YOLO by a signi ican ma gin. In 2020, Pe ez-Rua e al. [30]
in oduced OpeN-ended Cen e nE (ONCE), which o e ed
unc ionali y ha can de ec objec s om classes wi h a
small numbe o examples inside i s aining da ase . The
mo e ecen app oaches s a o in es iga e he possibili ies o
ans o me s co e ed in he DE ec ion TRans o me (DETR)
me hod [5], wi h he ad an age o being simple ye on pa wi h
he es o he de ec ion echniques used in he ield. La e ,
Zhu e al. p oposed De o mable DETR as an imp o emen o
add ess he pe o mance p oblem in de ec ing small objec s
and achie e s a e-o - he-a pe o mance.
Fig. 1: O e iew o he p oposed no el amewo k ained end- o-end. Fo he SR and de ec o models any s a e-o - he-a
solu ions can be used wi hou a ec ing he o e all pipeline and he p oposed modula a chi ec u e.
Fig. 2: T aining se up o he DAT SR deep ne wo k.
Fig. 3: YOLOX a chi ec u e elying on a decoupled head
In pa allel, app oaches ha e been de eloped o enhance he
de ec ion o small objec s, which is pa icula ly challenging
as hey ha e ewe isible de ails. Supe - esolu ion solu ions
elying on Gene a i e Ad e sa ial Ne wo ks (GAN) [16] ha e
p o ed pa icula ly success ul [2] [22]. Indeed, hei compe -
i i e p ocess in ol ing wo neu al ne wo ks, i.e., a gene a o
ne wo k and a disc iminan ne wo k, ensu es ha he gene a ed
images a e as ealis ic as possible.
He ein, we add ess he de ec ion and ecogni ion o isually
small objec s by p oposing a no el app oach based on a supe -
esolu ion ne wo k and a second ne wo k wi h a modi ied
YOLO a chi ec u e ained end- o-end. This solu ion aims o
p o ide accu a e de ec ion o objec s ha a e e y small o
e y a om he came a senso o he AR glasses while
deli e ing a as p ocessing ime, enabling i s usage in eal-
ime applica ions.
Fig. 4: An example o p edic ions and Con usion Ma ix ( he
whi e colou o he image was le elled up o see he numbe s)
III. PROPOSED FRAMEWORK
In his pape , we p opose an end- o-end amewo k o
scene unde s anding ha combines supe - esolu ion, objec
de ec ion, and classi ica ion a chi ec u es. Figu e 1shows an
o e iew o he p oposed me hodology, whe e he wo main
componen s ake as inpu an image (o a ideo) and a e ained
in an end- o-end manne . De ails o hese wo p ocessing
blocks a e desc ibed in he ollowing sec ions:
A. Supe -Resolu ion Me hod
As e iewed in Sec ion 2, he usage o supe - esolu ion in a
p e-p ocessing s age has been included in many compu e i-
sion pipelines. Typically, he supe - esolu ion model is ained
in an unsupe ised manne using se e al independen da ase s,
while he a ge classi ica ion o de ec ion model is ained
only on a single ask- ela ed da ase . By doing so, some ex a
in o ma ion ha is no a ailable in he a ge labelled da ase
is injec ed in o he SR images [20].
(a) G ound - Came a
Subca ego y - Gol
(97%)
(b) G ound - Ligh
Subca ego y - Ci y
Bus (96%)
(c) G ound - Wea he
Subca ego y - Ci y
Bus (95%)
Fig. 5: Examples o he gene a ed syn he ic da a. The bo om
ow ep esen s examples om he G ound ca ego y. F om
he le , he i s column shows he Came a sub-ca ego y,
he second column shows he Ligh sub-ca ego y, he hi d
columns displays he Wea he sub-ca ego y.
SR models a e ained unde he assump ion ha he low-
scale images, passed as inpu , a e he esul s o some low-pass
il e , such as a Gaussian blu o a poin sp ead unc ion. The
aining akes place by i s down-sampling high- esolu ion
images wi h such a ke nel and hen op imising he model o e-
cons uc hese high- esolu ion images. Theo e ically, he ke -
nel unc ion should ma ch he ac ual blu ing p ocess caused
by he came a used in he a ge ed applica ion. Howe e , since
his is usually unknown, ’s anda d’ ke nels ha e been used.
Un o una ely, hese s anda d ke nels ail o model he speci ic
op ics and senso s o he ac ual came as ha cap u ed he
images o in e es , leading o deg aded pe o mance in eal-
wo ld scena ios. To add ess his, me hods ha e been p oposed
o lea n he blu ke nel, known as Blind Supe Resolu ion [28].
The mos accu a e app oaches, such as he s a e-o - he-a
ne wo k Deep Al e na ing Ne wo k (DAT) [20], ha e elied
on deep lea ning a chi ec u es. DAT was selec ed o ou
pipeline because i s lea ning is unsupe ised, and i deli -
e s as compu a ion, making i sui able o mobile de ices
and low-speci ica ion desk op compu e s. In ac , he au ho s
demons a ed ha he a e age speed is 0.75 seconds pe image,
which is mo e han 500 imes as e han i s compe i o s
Ke nelGAN [3], ZSSR [35], and 5 imes as e han IKC
[17]. These a e age speeds a e conside ed as in he domain
o SR. Figu e 2shows DAT’s aining se up, which consis s
o wo main ne wo ks called Res o e and Es ima o : he
Res o e p oduces he SR image, while he Es ima o p o ides
an es ima e o a blu ke nel gi en he es o ed image. The wo
ne wo ks a e used in al e na ion, imp o ing he quali y o he
SR image and he accu acy o he es ima ed ke nel a each
Res o e -Es ima o s ep. The sequence o Res o e -Es ima o
is op imised end- o-end using a s ochas ic back-p opaga ion
algo i hm.
B. Objec De ec ion Me hod
Fo augmen ed eali y applica ions, objec de ec ion models
mus deli e high accu acy in eal- ime. Fo he p oposed
amewo k, he ancho - ee model, YOLOX [11], o e s he
bes comp omise. I s simple, powe ul, and compu a ionally
e icien a chi ec u e is buil upon one o he mos widely used
de ec o s in he indus y, YOLO 3 [33], which no only has a
limi ed compu a ional cos bu also has ecei ed excellen so -
wa e suppo . Howe e , an impo an imp o emen o YOLOX
is ha , unlike he p e ious a chi ec u es o he YOLO se ies,
i uses a decoupled head, which imp o es con e gence speed.
Figu e 3p o ides an o e iew o he a chi ec u e o YOLOX.
Following a 1 x 1 con olu ional laye used o dec ease he
numbe o channels, he e a e wo pa allel b anches wi h 3
x 3 con olu ional laye s. Mo eo e , compa ed o he baseline
YOLO 3, an In e sec ion o e Union (IoU) awa e b anch is
added in he eg ession b anch.
Ano he enhancemen o YOLOX is, unlike he pas e -
sions o YOLO de ec o s (excep o YOLO 1), he usage
o an ancho - ee model. Ancho s a e candida e bounding
boxes wi h p ede ined dimensions ha he de ec o selec s
du ing he de ec ion p ocess and o which i p edic s he
del a alues o hei cen es and dimensions. Ob iously, hese
addi ional p edic ions equi e ex a p ocessing du ing bo h
he aining and in e ence s ages, which impac s he o e all
compu a ional ime. On he o he hand, when using an ancho -
ee app oach, bounding boxes a e p edic ed di ec ly, which
educes he numbe o design pa ame e s. As such an app oach
equi es ad anced da a augmen a ion o ma ch he pe o mance
o ancho -based models, s a e-o - he-a da a augmen a ion
app oaches, i.e., Mosaic and MixUp, we e exploi ed [11].
Indeed, hey a e known o b ing s abili y and educe o e i ing
du ing he aining p ocess. Finally, i is impo an o speci y
ha YOLOX le e ages a high-pe o mance CNN on -end,
CSPNe [38], which is ollowed by a ea u e py amids ne wo k
(FPN) [33].
C. End- o-end F amewo k
The me hods desc ibed in sub-sec ions B and C we e
in eg a ed in o an end- o-end amewo k. Thus, he amewo k
comp ises wo main componen s, i.e. Supe -Resolu ion and
De ec o . Equa ion (1) illus a es he p oposed end- o-end
a chi ec u e whe e xis he inpu low- esolu ion image, yis
he image gene a ed by he supe - esolu ion unc ion S(·), and
zis he ou pu o he de ec ion unc ion D(·).
(y=S(x)
z=D(y)
→z=D(S(x)) (1)
In his amewo k, an inpu image is i s handled by
he SR componen , which p oduces a supe - esol ed ou pu
image. Then, his image is passed o he de ec o componen ,
which ecognises and loca es objec s. Th ough his p ocess, he
de ec o lea ns om images enhanced by he SR componen .
The inpu images a e supe - esol ed using ke nels. The e a e
many ypes o ke nels, such as he common bicubic ke nel o
linea ke nel. These ke nels a e well-s udied and don’ equi e
an AI ne wo k o calcula e hem.
Howe e , in he case o he SR ask, eal-wo ld images don’
con ain in o ma ion abou he ke nel, making i challenging o
success ully es o e hem. Consequen ly, an es ima o is used
o in e he ke nel du ing he aining p ocess. This es ima ed
ke nel is hen passed o he es o e o gene a e images. As a
esul , he es o ed images con ain ea u es ha a e he p oduc
o he ke nel. These ea u es can be picked up by he de ec o
du ing he aining p ocess, c ea ing a symbio ic ela ionship
be ween he SR and de ec o componen s, leading o imp o ed
pe o mance.
To moni o and e alua e he aining o he amewo k,
se e al s a e-o - he-a loss unc ions we e selec ed. The de-
ec o is ained using Va i ocal Loss [40] as he classi ica ion
loss unc ion and SIoU [12] (Scylla In e sec ion o e Union)
as he box eg ession loss unc ion. Mo eo e , he aining
p ocess was acili a ed wi h SimOTA, a simpli ica ion o
OTA [10] (Op imal T anspo Assignmen ), o dynamic label
assignmen [11].
The Va i ocal Loss is pa icula ly e icien because i con-
side s bo h classi ica ion and localisa ion sco es when anking
candida es using IoU. Simila ly, he SIoU loss unc ion ad-
d esses di ec ion misma ch be ween expec ed and p edic ed
bounding boxes by exploi ing angle, dis ance, shape, and IoU
cos s.
Finally, he alue o SimOTA is o iew he ask o bounding
box assignmen as an op imal anspo p oblem, whe e he
uni anspo a ion cos be ween an ancho -poin and g ound
u h is exp essed as a weigh ed sum o hei classi ica ion and
eg ession losses o ind he bes assignmen solu ion.
D. Pa ame e s
The end- o-end amewo k was ine- uned by unning 10
epochs wi h a ba ch size o 3 on bo h eal and syn he ic
da a unde h ee di e en ca ego ies. While he lea ning a e
was se o 0.0001 o he SR componen , i was se o
0.0032 wi h SGD (S ochas ic G adien Descen ) op imisa ion
o he de ec o . Addi ionally, as men ioned ea lie , aining
was enhanced using Mosaic and MixUp as da a augmen a ion
s a egies.
IV. EVALUATION
The p oposed me hod has been applied o objec ecogni ion
and scene unde s anding. I s e alua ion was pe o med using
he Common Objec s in Con ex (COCO) da ase [26] and
a syn he ic da ase whe e di e en en i onmen al condi ions
we e applied o a ec image quali y. COCO is widely used
o benchma k compu e ision models. I consis s o 330K
images, wi h mo e han 200K labelled images, 1.5 million
objec ins ances, 80 objec ca ego ies, 91 s u ca ego ies, and
5 cap ions pe image.
Compa isons wi h s a e-o - he-a me hods ely on he mean
A e age P ecision (mAP), a s anda d me ic in oduced in
2014 o quan i y objec de ec ion pe o mance based on a use -
de ined se o c i e ia [26]. I is de ined as he mean alue o
he a e age p ecision o he indi idual classes:
mAP =1
n
n
X
k=1
APk(2)
whe e APkis A e age P ecision o class k, and nis he
numbe o classes.
TABLE I: Model Pe o mance on he COCO Da ase in e ms
o mAP (%)
Re inaNe YOLO 3 Fas e R-CNN P oposed
52.61 44.76 40.50 67.09
TABLE II: Model Pe o mance on he Syn he ic Da ase
acco ding o he Th ee Image Ca ego ies
Ca ego y mAP(%)
Came a 60.52
Ligh 81.25
Wea he 66.98
In his e alua ion p ocess, using he COCO da ase , Table
Ishows he pe o mance in e ms o mAP o he p oposed
amewo k compa ed o o he app oaches p esen ed in he li -
e a u e e iew. Ou amewo k ou pe o ms all i s compe i o s.
Mo eo e , he added alue o he supe - esolu ion componen
is clea ly es ablished as i exceeds YOLO’s mAP by o e 20%.
The con usion ma ix in Figu e 4 u he demons a es he
pe o mance o he model. In pa icula , i p edic s objec s
belonging o he p e ailing ”ca ” ca ego y wi h high accu acy.
Howe e , i should be highligh ed ha he ” an” ca ego y is
o en mis aken o he ”ca ” ca ego y, which is due o he
isual simila i y be ween images o hese wo classes.
Fu he e alua ion has been conduc ed using a syn he ic
da ase ha we c ea ed using a 3D Rende ing Engine. This
da ase consis s o app oxima ely 3000 low- esolu ion images
pe ca ego y o g ound ehicles in di e en en i onmen s and
wea he condi ions. An example o he da ase wi h p edic ions
can be seen in Figu e 5. The images in his da ase belong o
h ee di e en ca ego ies, each allowing us o assess ou model
on speci ic p ope ies: a) Came a, b) Ligh , c) Wea he .
The ”Came a” ca ego y con ains images o objec s cap u ed
om a ious came a angles and dis ances. In he ”Ligh ”
ca ego y, images a e gene a ed unde a a ie y o ligh ing
pa ame e s, mimicking di e en pa s o he day such as
mo ning, a e noon, e ening, and nigh . The ”Wea he ” ca -
ego y simula es images cap u ed unde di e se wea he ci -
cums ances, including a ying ain and wind condi ions.
The pe o mance o he p oposed amewo k o hese h ee
ca ego ies in e ms o mAP is shown in Table II. Addi ionally,
he con usion ma ices can be obse ed in Figu e 6. As
could be obse ed, he esul s demons a e high con idence
in p edic ing he buses and ca s ca ego ies. The spo ca s and
ca s ca ego ies a e qui e o en mis aken wha is easonable
because bo h ca ego ies a e e y simila . The an ca ego y on
he o he hand some imes has been mis aken wi h he spo
ca s ca ego y wha indica es close simila i y o he samples
in he da ase . Whe eas he bus ca ego y samples seem o be
dis inc enough o a oid con usion wi h he o he ca ego ies.
V. CONCLUSION
The wo k p esen ed in his pape o e s an end- o-end
solu ion o objec de ec ion and ecogni ion on AR de ices.
The modula a chi ec u e allows o he in eg a ion o di e en
SR and de ec ion models wi hin he same pipeline. The pape
p o ides an o e iew o exis ing solu ions and app oaches in
bo h supe esolu ion and scene analysis me hods, speci ically
ocusing on hei applica ions in imme si e en i onmen s.
The p oposed a chi ec u e was es ed on bo h eal and
syn he ic da ase s in a compa a i e s udy, alongside o he s a e-
o - he-a app oaches. The esul s ob ained demons a e a sig-
ni ican imp o emen , pa icula ly o low- esolu ion o dis an
objec s. Fu he mo e, he p oposed amewo k was e alua ed
and analysed unde a ious en i onmen al condi ions and wi h
a ange o came a senso s.
In addi ion o he e alua ion on eal da ase s, a new bal-
anced syn he ic da ase was gene a ed. This da ase includes
anno a ed da a co e ing mul iple objec s and en i onmen s,
allowing o u he assessmen and expe imen a ion.
VI. ACKNOWLEDGMENT
This wo k was unded by UK Resea ch and Inno a ion
(UKRI) unde he UK go e nmen ’s Ho izon Eu ope und-
ing gua an ee [g an numbe 10047653] and unded by he
Eu opean Union [unde EC Ho izon Eu ope g an ag eemen
numbe 101070181 (TALON)].
REFERENCES
[1] Ryan Ande son, Juan Toledo, and Hala ElAa ag. Feasibili y s udy
on he u iliza ion o mic oso hololens o inc ease d i ing condi ions
awa eness. In 2019 Sou heas Con, pages 1–8. IEEE, 2019.
[2] Yancheng Bai, Yongqiang Zhang, Mingli Ding, and Be na d Ghanem.
Sod-m gan: Small objec de ec ion ia mul i- ask gene a i e ad e sa ial
ne wo k. In P oceedings o he Eu opean Con e ence on Compu e
Vision (ECCV), pages 206–221, 2018.
[3] Se i Bell-Kligle , Assa Shoche , and Michal I ani. Blind supe -
esolu ion ke nel es ima ion using an in e nal-gan. Ad ances in Neu al
In o ma ion P ocessing Sys ems, 32, 2019.
[4] Jiale Cao, Hisham Cholakkal, Rao Muhammad Anwe , Fahad Shahbaz
Khan, Yanwei Pang, and Ling Shao. D2de : Towa ds high quali y objec
de ec ion and ins ance segmen a ion. In P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni ion, pages 11485–
11494, 2020.
[5] Nicolas Ca ion, F ancisco Massa, Gab iel Synnae e, Nicolas Usunie ,
Alexande Ki illo , and Se gey Zago uyko. End- o-end objec de ec ion
wi h ans o me s. In Compu e Vision–ECCV 2020: 16 h Eu opean
Con e ence, Glasgow, UK, Augus 23–28, 2020, P oceedings, Pa I 16,
pages 213–229. Sp inge , 2020.
[6] Mengyu Chu, You Xie, Lau a Leal-Taix´
e, and Nils Thue ey. Tempo ally
cohe en gans o ideo supe - esolu ion ( ecogan). a Xi p ep in
a Xi :1811.09393, 1(2):3, 2018.
[7] Ji eng Dai, Yi Li, Kaiming He, and Jian Sun. R- cn: Objec de ec ion
ia egion-based ully con olu ional ne wo ks. Ad ances in neu al
in o ma ion p ocessing sys ems, 29, 2016.
[8] Nikos Dimi opoulos, Theodo os Togias, Geo ge Michalos, and So i is
Mak is. Ope a o suppo in human– obo collabo a i e en i onmen s
using ai enhanced wea able de ices. P ocedia Ci p, 97:464–469, 2021.
[9] M. E e ingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisse man. The PASCAL Visual Objec Classes
Challenge 2012 (VOC2012) Resul s. h p://www.pascal-
ne wo k.o g/challenges/VOC/ oc2012/wo kshop/index.h ml.
[10] Zheng Ge, Song ao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. O a:
Op imal anspo assignmen o objec de ec ion. In P oceedings o
he IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion,
pages 303–312, 2021.
[11] Zheng Ge, Song ao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox:
Exceeding yolo se ies in 2021. a Xi p ep in a Xi :2107.08430, 2021.
[12] Zho a Ge o gyan. Siou loss: Mo e powe ul lea ning o bounding box
eg ession. a Xi p ep in a Xi :2205.12740, 2022.
[13] Ross Gi shick. Fas -cnn. In P oceedings o he IEEE in e na ional
con e ence on compu e ision, pages 1440–1448, 2015.
[14] Ross Gi shick, Je Donahue, T e o Da ell, and Ji end a Malik. Rich
ea u e hie a chies o accu a e objec de ec ion and seman ic segmen-
a ion. In P oceedings o he IEEE con e ence on compu e ision and
pa e n ecogni ion, pages 580–587, 2014.
[15] Daniel Lima Gomes J , Anselmo Ca doso de Pai a, A is ´
o anes Co ˆ
ea
Sil a, Ge aldo B az J , Jo˜
ao Dallyson Sousa de Almeida, An ˆ
onio S´
e gio
de A a´
ujo, and Ma celo Ga as. Augmen ed isualiza ion using ho-
momo phic il e ing and haa -based na u al ma ke s o powe sys ems
subs a ions. Compu e s in Indus y, 97:67–75, 2018.
[16] Ian Good ellow, Jean Pouge -Abadie, Mehdi Mi za, Bing Xu, Da id
Wa de-Fa ley, She jil Ozai , Aa on Cou ille, and Yoshua Bengio. Gen-
e a i e ad e sa ial ne wo ks. Communica ions o he ACM, 63(11):139–
144, 2020.
[17] Jinjin Gu, Hannan Lu, Wangmeng Zuo, and Chao Dong. Blind supe -
esolu ion wi h i e a i e ke nel co ec ion. In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion,
pages 1604–1613, 2019.
[18] Rohi Gup a, Anu ag Sha ma, and Anupam Kuma . Supe - esolu ion
using gans o medical imaging. P ocedia Compu e Science, 173:28–
35, 2020.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spa ial py a-
mid pooling in deep con olu ional ne wo ks o isual ecogni ion. IEEE
ansac ions on pa e n analysis and machine in elligence, 37(9):1904–
1916, 2015.
[20] Yan Huang, Shang Li, Liang Wang, Tieniu Tan, e al. Un olding he
al e na ing op imiza ion o blind supe esolu ion. Ad ances in Neu al
In o ma ion P ocessing Sys ems, 33:5632–5643, 2020.
[21] Hei Law and Jia Deng. Co ne ne : De ec ing objec s as pai ed keypoin s.
In P oceedings o he Eu opean con e ence on compu e ision (ECCV),
pages 734–750, 2018.
[22] Vladisla Li, Geo ge Amponis, Jean-Ch is ophe Nebel, Vasileios A -
gy iou, Thomas Lagkas, Sa as Ouzounidis, and Panagio is Sa igianni-
dis. Supe esolu ion o augmen ed eali y applica ions. In IEEE INFO-
COM 2022-IEEE Con e ence on Compu e Communica ions Wo kshops
(INFOCOM WKSHPS), pages 1–6. IEEE, 2022.
[23] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng,
and Jian Sun. Ligh -head -cnn: In de ense o wo-s age objec de ec o .
a Xi p ep in a Xi :1711.07264, 2017.
[24] Tsung-Yi Lin, Pio Doll´
a , Ross Gi shick, Kaiming He, Bha a h Ha iha-
an, and Se ge Belongie. Fea u e py amid ne wo ks o objec de ec ion.
In P oceedings o he IEEE con e ence on compu e ision and pa e n
ecogni ion, pages 2117–2125, 2017.
[25] Tsung-Yi Lin, P iya Goyal, Ross Gi shick, Kaiming He, and Pio Doll´
a .
Focal loss o dense objec de ec ion. In P oceedings o he IEEE
in e na ional con e ence on compu e ision, pages 2980–2988, 2017.
[26] Tsung-Yi Lin, Michael Mai e, Se ge Belongie, James Hays, Pie o
Pe ona, De a Ramanan, Pio Doll´
a , and C Law ence Zi nick. Mic oso
coco: Common objec s in con ex . In Compu e Vision–ECCV 2014:
13 h Eu opean Con e ence, Zu ich, Swi ze land, Sep embe 6-12, 2014,
P oceedings, Pa V 13, pages 740–755. Sp inge , 2014.
[27] Wei Liu, D agomi Anguelo , Dumi u E han, Ch is ian Szegedy, Sco
Reed, Cheng-Yang Fu, and Alexande C Be g. Ssd: Single sho mul ibox
de ec o . In Compu e Vision–ECCV 2016: 14 h Eu opean Con e ence,
Ams e dam, The Ne he lands, Oc obe 11–14, 2016, P oceedings, Pa
I 14, pages 21–37. Sp inge , 2016.
[28] Tome Michaeli and Michal I ani. Nonpa ame ic blind supe - esolu ion.
In P oceedings o he IEEE In e na ional Con e ence on Compu e
Vision, pages 945–952, 2013.
[29] Moni ul Islam Pa el, Siok Yee Tan, and Azizi Abdullah. Vision-
based au onomous ehicle sys ems based on deep lea ning: A sys ema ic
li e a u e e iew. Applied Sciences, 12(14):6831, 2022.
[30] Juan-Manuel Pe ez-Rua, Xia ian Zhu, Timo hy M. Hospedales, and Tao
Xiang. Inc emen al ew-sho objec de ec ion. In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), June 2020.
[31] Joseph Redmon, San osh Di ala, Ross Gi shick, and Ali Fa hadi. You
only look once: Uni ied, eal- ime objec de ec ion. In P oceedings o
he IEEE con e ence on compu e ision and pa e n ecogni ion, pages
779–788, 2016.
[32] Joseph Redmon and Ali Fa hadi. Yolo9000: be e , as e , s onge . In
P oceedings o he IEEE con e ence on compu e ision and pa e n
ecogni ion, pages 7263–7271, 2017.
[33] Joseph Redmon and Ali Fa hadi. Yolo 3: An inc emen al imp o emen .
a Xi p ep in a Xi :1804.02767, 2018.
(a) G ound Ca ego y - Came a
Subca ego y - Con usion Ma ix
(b) G ound Ca ego y - Ligh Sub-
ca ego y - Con usion Ma ix
(c) G ound Ca ego y - Wea he
Subca ego y - Con usion Ma ix
(d) G ound Ca ego y - Cam-
e a Subca ego y - Supe Con-
usion Ma ix
(e) G ound Ca ego y - Ligh
Subca ego y - Supe Con u-
sion Ma ix
( ) G ound Ca ego y -
Wea he Subca ego y - Supe
Con usion Ma ix
Fig. 6: Con usion Ma ices o he G ound ca ego y o he gene a ed syn he ic da a. F om he le , he i s column shows he
Came a sub-ca ego y, he second column shows he Ligh sub-ca ego y, he hi d columns displays he Wea he sub-ca ego y.
[34] Shaoqing Ren, Kaiming He, Ross Gi shick, and Jian Sun. Fas e -
cnn: Towa ds eal- ime objec de ec ion wi h egion p oposal ne wo ks.
Ad ances in neu al in o ma ion p ocessing sys ems, 28, 2015.
[35] Assa Shoche , Nada Cohen, and Michal I ani. “ze o-sho ” supe -
esolu ion using deep in e nal lea ning. In P oceedings o he IEEE
con e ence on compu e ision and pa e n ecogni ion, pages 3118–
3126, 2018.
[36] Hu Tianyu, Zhang Quan u, and Dong Huiyuan ShenYongjie. O e iew
o augmen ed eali y echnology. Compu e Knowledge and Technology,
34:194–196, 2017.
[37] Paul Viola and Michael Jones. Rapid objec de ec ion using a boos ed
cascade o simple ea u es. In P oceedings o he 2001 IEEE compu e
socie y con e ence on compu e ision and pa e n ecogni ion. CVPR
2001, olume 1, pages I–I. Ieee, 2001.
[38] Chien-Yao Wang, Hong-Yuan Ma k Liao, Yueh-Hua Wu, Ping-Yang
Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspne : A new backbone ha can
enhance lea ning capabili y o cnn. In P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni ion wo kshops, pages
390–391, 2020.
[39] Jianghao Xiong, En-Lin Hsiang, Ziqian He, Tao Zhan, and Shin-
Tson Wu. Augmen ed eali y and i ual eali y displays: eme ging
echnologies and u u e pe spec i es. Ligh : Science & Applica ions,
10(1):216, 2021.
[40] Haoyang Zhang, Ying Wang, Fe as Dayoub, and Niko Sunde hau .
Va i ocalne : An iou-awa e dense objec de ec o . In P oceedings o he
IEEE/CVF Con e ence on Compu e Vision and Pa e n Recogni ion,
pages 8514–8523, 2021.
[41] Zhong-Qiu Zhao, Peng Zheng, Shou- ao Xu, and Xindong Wu. Objec
de ec ion wi h deep lea ning: A e iew. IEEE ansac ions on neu al
ne wo ks and lea ning sys ems, 30(11):3212–3232, 2019.
[42] Xingyi Zhou, Dequan Wang, and Philipp K ¨
ahenb¨
uhl. Objec s as poin s.
a Xi p ep in a Xi :1904.07850, 2019.
[43] Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye.
Objec de ec ion in 20 yea s: A su ey. P oceedings o he IEEE, 2023.