scieee Science in your language
[en] (orig)

Enhancing 3D Object Detection in Autonomous Vehicles Based on Synthetic Virtual Environment Analysis

Author: Li, Vladislav; Siniosoglou, Ilias; Karamitsou, Thomai; Lytos, Anastasios; Moscholios, Ioannis; Goudos, Sotirios; Banerjee, Prof. (Dr.) Jyoti Sekhar; Sarigiannidis, Panagiotis; Argyriou, Vasileios
Publisher: Zenodo
DOI: 10.1016/j.imavis.2024.105385
Source: https://zenodo.org/records/17550318/files/Enhancing_3D_Object_Detection_in_Autonomous_Vehicles_Based_on_Synthetic_Virtual_Environment_Analysis.pdf
Enhancing 3D Objec De ec ion in Au onomous Vehicles Based
on Syn he ic Vi ual En i onmen Analysis
Vladisla Lia, Ilias Siniosogloub,c, Thomai Ka ami soud, Anas asios Ly osd, Ioannis D.
Moscholiose, So i ios K. Goudos , Jyo i S. Bane jeeg, Panagio is Sa igiannidisb,c, Vasileios
A gy ioua
a“ins i u ion –Kings on Uni e si y˝ “depa men –Depa men o Ne wo ks and Digi al Media˝ “ci y –Kings on upon
Thames˝ “coun y –Uni ed Kingdom˝ “email – [email p o ec ed], asileios.a [email p o ec ed]˝
b“ins i u ion –Uni e si y o Wes e n Macedonia˝ “depa men –Depa men o Elec ical and Compu e Enginee ing˝
“ci y –Kozani˝ “coun y –G eece˝ “email –[email p o ec ed], [email p o ec ed]˝
c“ins i u ion –Me aMind Inno a ions P.C.˝ “depa men –R“&D Depa men ˝ “ci y –Kozani˝ “coun y –G eece˝
“email –[email p o ec ed], [email p o ec ed]˝
d“ins i u ion –Sid oco Holdings L d.˝ “ci y –Nicosia˝ “coun y –Cyp us˝ “email – ka ami sou@sid oco.com,
aly os@sid oco.com˝
e“ins i u ion –Uni e si y o Peloponnese˝ “depa men –Depa men o In o ma ics and Telecommunica ions˝ “ci y
–T ipoli˝ “coun y –G eece˝ “email –[email p o ec ed]˝
“ins i u ion –A is o le Uni e si y o Thessaloniki˝ “depa men –Physics Depa men ˝ “ci y –Thessaloniki˝ “coun y
–G eece˝ “email –[email p o ec ed]˝
g“ins i u ion –Bengal Ins i u e o Technology˝ “ci y –Kolka a˝ “coun y –India˝ “email –[email p o ec ed]˝
Abs ac
Au onomous Vehicles (AVs) ely on eal- ime p ocessing o na u al images and ideos o scene
unde s anding and sa e y assu ance h ough p oac i e objec de ec ion. T adi ional me hods ha e
p ima ily ocused on 2D objec de ec ion, limi ing hei spa ial unde s anding. This s udy in o-
duces a no el app oach by le e aging 3D objec de ec ion in conjunc ion wi h augmen ed eali y
(AR) ecosys ems o enhanced eal- ime scene analysis. Ou app oach pionee s he in eg a ion
o a syn he ic da ase , designed o simula e a ious en i onmen al, ligh ing, and spa io empo al
condi ions, o ain and e alua e an AI model capable o deducing 3D bounding boxes. This
da ase , wi h i s di e se wea he condi ions and a ying came a se ings, allows us o explo e
de ec ion pe o mance in highly challenging scena ios. The p oposed me hod also signi ican ly
imp o es p ocessing imes while main aining accu acy, o e ing compe i i e esul s in condi ions
p e iously conside ed di icul o objec ecogni ion. The combina ion o 3D de ec ion wi hin
he AR amewo k and he use o syn he ic da a o ackle en i onmen al complexi y ma ks a
no able con ibu ion o he ield o AV scene analysis.
Keywo ds: Augmen ed Reali y, Objec De ec ion, Scene Analysis, Scene Unde s anding,
Objec Recogni ion, Deep Lea ning, Fea u e Ex ac ion.
1. In oduc ion
In he domain o au onomous d i ing, scene analysis and comp ehension a e undamen al o
enabling ehicles o pe cei e and in e ac wi h hei en i onmen e ec i ely [1] [2]. Au onomous
ehicles (AVs) ely on ad anced compu e ision and machine lea ning (ML) algo i hms o p o-
cess da a om mul iple senso s such as came as, LiDAR, and ada , allowing hem o ecognize
P ep in submi ed o Image and Vision Compu ing No embe 6, 2025
objec s, na iga e sa ely, and make eal- ime decisions. Howe e , while 2D objec de ec ion
me hods ha e been widely adop ed o hese asks, hey a e inhe en ly limi ed in hei abili y
o cap u e he ull h ee-dimensional na u e o he en i onmen , which is c ucial o accu a ely
unde s anding objec posi ions and in e ac ions in eal-wo ld scena ios.
A majo challenge aced by cu en AV sys ems is he ansi ion om 2D o 3D objec de ec-
ion as men ioned in ci e awole2024 ecen and ci emao20233d. P ojec ing 3D bounding boxes
in o a h ee-dimensional en i onmen is a mo e complex and compu a ionally expensi e ask,
especially when he sys em mus handle di e se en i onmen al condi ions such as changes in
ligh ing, wea he , and senso pe spec i es. T adi ional 2D me hods all sho when de ec ing ob-
jec s in such a ied condi ions, leading o educed accu acy and sa e y isks in AV applica ions.
Fu he mo e, he e is a need o e icien ly in eg a e augmen ed eali y (AR) in o his p ocess,
which could u he imp o e he sys em’s abili y o p edic and o e lay digi al elemen s on o
eal-wo ld en i onmen s o enhanced si ua ional awa eness.
This s udy aims o add ess hese limi a ions by de eloping a no el 3D objec de ec ion solu-
ion ha no only p edic s accu a e 3D bounding boxes bu also imp o es p ocessing imes and
pe o mance ac oss challenging condi ions. Ou app oach in oduces a mul imodal a chi ec u e
ha ex apola es 3D in o ma ion om 2D images, le e aging a syn he ic da ase designed o
mimic a ious eal-wo ld condi ions such as ligh ing, wea he , and came a iewpoin s. The e-
sul s o his wo k can be applied o imp o e AV sys ems’ pe o mance in dynamic en i onmen s,
p o iding mo e obus and eliable objec de ec ion and localiza ion.
Fo au onomous ehicles, scene analysis and comp ehension play an impo an ole. This in-
cludes a wide ange o applica ions such as de ec ing o he ehicles sha ing he oad, ecognizing
a ic signs, as well as de ec ing pedes ians, po en ial haza ds, e c. This deepe unde s anding
is ins umen al in making au onomous decisions while in eg a ing he augmen ed en i onmen
on o he ehicle’s display sys ems like heads-up displays (HUDs). This ex ends o isual scene
analysis which is he co ne s one o ehicle en i onmen pe cep ion and in e ac ion using ad-
anced compu e ision machine lea ning (ML) algo i hms o con olling la ge amoun s o da a
collec ed om senso s, such as came as, LiDAR, and ada . The ecogni ion and in e p e a ion
o he e e -changing su oundings allow he ehicle o make in o med choices abou na iga ion,
sa e y, and in e ac ions wi h o he oad use s.
In o de o d i e sa ely and e ec i ely, o example, he ca mus be able o ecognise and
iden i y o he ehicles, as well as hei posi ion and ela i e speed. Equally impo an is he
ecogni ion o a ic signs and signals, ensu ing compliance wi h a ic egula ions and he
seamless low o a ic. Mo eo e , he accu a e de ec ion o pedes ians, cyclis s, and po en-
ial obs acles is indispensable o a oiding acciden s and ensu ing he sa e y o all oad use s.
In he las decades, ad ances in compu e ision ha e os e ed he design and implemen a ion
o objec ecogni ion me hods, inc easing compu a ional pe o mance and lowe ing p ocess ime
[3]. These echnologies enable he ehicle’s onboa d compu e sys ems o con inuously lea n and
adap , imp o ing hei abili y o ecognise and espond o an e e -e ol ing a ay o en i onmen-
al condi ions. An impo an miles one is ha in he op imisa ion phase o such applica ions, he
e alua ion o AV image cogni ion sys ems can be pe o med in he i ual and augmen ed eali y
domains, u ilising he same en i onmen ha is also used in i ual applica ions, like game de el-
opmen engines. As a esul , cu en scene analysis echnologies based on objec ecogni ion use
complex compu e ision echniques o de ec and ack objec s in he eal wo ld. Examples o
such echnologies include he You Only Look Once (YOLO) model [4], homomo phic il e ing
and Haa ma ke s [5] and he Single Sho De ec o [6]. The use o Con olu ional Neu al Ne -
wo ks (CNNs) and Deep Lea ning (DL) led o as e and mo e accu a e de ec ion p ocesses [7].
2
Howe e , he AR expe ience could be imp o ed by p ojec ing 3D objec s in o he augmen ed
eali y space su ounding he use in e ed om he eal en i onmen .
The aim o his s udy is o analyse a no el 3D solu ion ha e alua es he pe o mance o he
3D bounding box p edic ion in a ious condi ions. This wo k p oposes a no el a chi ec u e o
e icien ly p oduce 3D bounding boxes, supe imposed on o he mul i a ia e spa io empo al iew
ha echnologies like ad anced AR and AV cogni ion sys ems employ o pe cei e he h ee-
dimensional en i onmen . The p oduced sys em is u he e alua ed on he new syn he ic da ase
p oduced o encapsula e a a ie y o possible en i onmen al condi ions, like, came a iew, ligh -
ing, wea he , and senso eadings. Finally, his s udy e alua es he p oposed a chi ec u e wi h
o he benchma k me hods, p o iding a compa a i e dimension. The main con ibu ions o his
wo k a e as ollows:
•A mul imodal a chi ec u e o e icien objec de ec ion and localisa ion o eal- ime scene
analysis
•A me hodology o p edic ing 3D bounding boxes on he h ee-dimensional en i onmen ,
ex apola ed om 2D images
•A No el Syn he ic Image da ase o objec de ec ion in AV applica ions wi h VR scene
augmen a ion
•A compa a i e s udy o he e icacy and e iciency o he de eloped me hodology agains
s a e-o - he-a echniques
The es o his pape is o ganised as ollows: 2 p o ides an o e iew o ela ed wo k. 3
desc ibes he p oposed a chi ec u e. 4 p esen s esul s ob ained using a no el syn he ic image
da ase . Finally, 5 concludes his wo k.
2. O e iew o P e ious Wo k
2.1. Region-based Fea u e Ex ac ion Algo i hms
An AR app iden i ies objec s in he eal wo ld using ML and compu e ision echniques
wi h he goal o o e laying i ual objec s in eal- ime. In ecen yea s, he use o deep CNNs
[8] has g ea ly enhanced he pe o mance and accu acy o objec de ec ion and ecogni ion in
compu e ision. In 2014, Gi shick e al. in oduced he Regions wi h CNN ea u es (RCNN)
me hod o objec de ec ion [9]. This app oach in ol ed i s iden i ying po en ial objec boxes
h ough selec i e sea ch and hen escaling each box o a ixed-size image o inpu in o a CNN
model ained on AlexNe [10] o ea u e ex ac ion. The objec was hen de ec ed using a linea
SVM classi ie , esul ing in a signi ican imp o emen in mean A e age P ecision compa ed o
p e ious me hods, bu also had a signi ican d awback o slow de ec ion speed.
In 2014, Gi ishick e al. in oduced he ”Regions wi h CNN ea u es” (RCNN) me hod o he
pu pose o objec de ec ion, as documen ed in hei seminal wo k [9]. This pionee ing app oach
signi ied a signi ican b eak h ough in he ealm o compu e ision, pa icula ly conce ning
he enhancemen o objec de ec ion accu acy. The RCNN me hodology employed a dual-s age
p ocess. Fi s ly, i commenced wi h he u ilisa ion o ”selec i e sea ch” o iden i y p ospec i e
objec boxes wi hin an image. Selec i e sea ch e ec i ely pa i ioned he image in o mul iple
egions o p oposals ha we e posi ed as likely candida es ha bo ing objec s. These egions
we e he eby conside ed as candida e boxes o po en ial objec localisa ion.
3
Subsequen ly, he nex s eps in he RCNN p ocedu e en ailed he esizing o he a o emen-
ioned candida e bounding boxes o i in o ixed-size images, ende ing hem eady o anal-
ysis. These s anda dised images a e hen subjec ed o a CNN-based p ocessing, speci ically
p e- ained on he AlexNe model [10]. The p incipal ole o his CNN was o pe o m ea u e
ex ac ion in o de o disce n and cap u e highly dis inc i e ea u es o he objec in ques ion.
Upon ea u e ex ac ion, he inal s ep o he RCNN me hodology in ol ed he employmen o a
linea Suppo Vec o Machine (SVM) classi ie . The SVM classi ie was ins umen al in e ec -
ing classi ica ion o he ex ac ed ea u es, he eby asce aining he p esence o absence o a gi en
objec wi hin he candida e box. This classi ica ion p ocess was he basis o objec iden i ica ion
and localisa ion.
The ou comes o he RCNN app oach bo e subs an ial signi icance. I led o a ma ked aug-
men a ion in he mean A e age P ecision me ic, a pi o al gauge o he e icacy and p ecision
o objec de ec ion algo i hms. E ec i ely, i su passed an eceden me hods in i s compe ence
o iden i y objec s wi hin images, ma king a subs an ial p og ession in he a ena o compu e
ision.
Ne e heless, i is wo h acknowledging ha he RCNN me hod su e ed om a compa a-
i ely leng hy de ec ion ime ame which can majo ly impac he o e all pe o mance. I s se-
quen ial ope a ions, such as selec i e sea ch, CNN-based ea u e ex ac ion, and SVM classi i-
ca ion, made i e y compu a ionally in ensi e and ook a long ime o p ocess, which limi ed i s
use ulness in si ua ions whe e eal- ime objec de ec ion was needed.
So, while he RCNN app oach made i easie o ind objec s, i equi ed a lo o compu ing
powe and ime, which mean ha mo e esea ch had o be done o make i wo k as e . Rega d-
less, i s c ea ion was a majo u ning poin in he his o y o objec de ec ion algo i hms. I pa ed
he way o la e inno a ions and sped up p og ess in a eas like obo ics, au onomous ehicle
sys ems, and many ypes o compu e ision applica ions.
In an e o o ackle he pe sis en challenge o slow de ec ion speed in objec ecogni ion and
localisa ion, He e al. p esen ed he Spa ial Py amid Pooling Ne wo k (SPPNe ) as an inno a i e
solu ion in hei seminal wo k [11]. This a chi ec u al pa adigm ma ked a no able miles one in
he e olu ion o compu e ision, o e ing a p o ound emedy o a long-s anding p edicamen in
he ield.
The basis o he SPPNe ’s success lay in i s s a egic inco po a ion o a Spa ial Py amid
Pooling (SPP) laye , a pi o al componen ha e olu ionised he objec de ec ion p ocess. The
dis inc i e ea u e o his SPP laye was i s abili y o gene a e a ixed-leng h ep esen a ion ha
emained in a ian o al e a ions in image size and scale. This a ibu e had a - eaching impli-
ca ions, pa icula ly in e ms o mi iga ing o e i ing issues ha had p e iously plagued objec
ecogni ion sys ems. A e his ini ial ea u e ex ac ion s ep, he SPPNe employed a sub- egion
pooling mechanism. This ope a ion en ailed di iding he image in o spa ial bins, enabling he
agg ega ion o ea u es om each bin o c ea e ixed-leng h ep esen a ions ha we e conduci e
o de ec o aining.
One o he mos no able ou comes o his inno a i e app oach was a ema kable accele a ion
in p ocessing speed, especially du ing es ing. The SPPNe me hod p o ed o be a signi ican leap
o wa d, wi h es ing imes anging om 24 o 102 imes as e han he p e iously es ablished
RCNN app oach. This accele a ion in speed held p o ound implica ions o eal- ime and ime-
sensi i e applica ions, pa icula ly in con ex s like au onomous ehicles, obo ics, and augmen ed
eali y.
In 2015, Gi ishick imp o ed he p e ious wo a chi ec u es wi h Fas RCNN [12]. This
ne wo k ains bo h a de ec o and a bounding box eg ession simul aneously wi h he same
4
Figu e 1: FRRCNN a chi ec u e.
con igu a ion. Howe e , he speed limi a ion pe sis ed. The same yea , Ren e al. in oduced
he Fas e RCNN de ec o [13], which was he i s deep lea ning de ec o o almos achie e
eal- ime de ec ion h ough end- o-end aining. This a chi ec u e employed he Region P oposal
Ne wo k (RPN) o speed up he de ec ion p ocess, and se e al a ian s ha e been p oposed since
hen o educe compu a ional edundancy [14], [15], [16]. In pa icula , Cao e al. (2020) [17]
in oduced he D2De me hod based on he Fas e R-CNN amewo k, which p ocesses Region
o In e es (ROI) ea u es h ough wo s ages: high-densi y local eg ession and disc iminan ROI
pooling. The me hod eplaces he Fas e RCNN o se eg ession wi h a local dense eg ession
block. Gi ishick u he mo e in oduced an enhancemen o he exis ing a chi ec u al pa adigms
in he o m o he Fas RCNN [12]. This no el ne wo k con igu a ion en ailed he simul aneous
aining o bo h an objec de ec o and a bounding box eg ession componen , all wi hin he same
uni ied a chi ec u e. Howe e , i is no ewo hy ha he issue o compu a ional speed cons ain s
pe sis ed despi e his de elopmen .
The Fas RCNN model builds upon he exis ing s a e-o - he-a , enhancing e iciency. I ex-
hibi es he capabili y o concu en ly ain wo undamen al componen s wi hin he same sys em,
i) an objec de ec o and ii) a bounding box eg ession module, inco po a ed unde he same
amewo k. This in eg a ed app oach was a signi ican s ide owa ds a mo e s eamlined and
cohe en aining p ocess. Ne e heless, he o e a ching challenge o compu a ional speed con-
s ain s pe sis ed as an obs ina e issue in he ield.
In he same ime, Ren e al. in oduced he Fas e RCNN de ec o [13], a g oundb eaking
endea o ha cha ed a cou se owa d he ealiza ion o eal- ime objec de ec ion h ough he
p ism o end- o-end aining. The Fas e RCNN a chi ec u e ma ked a seminal u ning poin
in he pu sui o swi e de ec ion capabili ies. A i s co e, i in oduced he Region P oposal
Ne wo k (RPN), a componen speci ically designed o expedi e he objec de ec ion p ocess. The
RPN’s manda e in ol ed he gene a ion o egion p oposals, a ace ha g ea ly enhanced he
ne wo k’s adep ness in e icien ly disce ning objec s wi hin complex scenes.
The in oduc ion o he Fas e RCNN model had an indelible impac on he landscape o
compu e ision. I no only ushe ed in he possibili y o nea eal- ime objec de ec ion bu also
spu ed a wa e o inno a i e a chi ec u al a ian s. These a ia ions, wi h an o e a ching ocus
on cu ailing compu a ional edundancy [14], [15], [16], explo ed di e se a enues o u he
ampli y he eloci y and e iciency o objec de ec ion while p ese ing p ecision.
Among hese p og essi e adap a ions, he D2De me hod, in oduced by Cao e al. in 2020
[17], s ands ou as an exempla o inno a ion based on he Fas e R-CNN amewo k. The D2De
me hod ha nesses a sophis ica ed wo-s age p ocess o handling Region o In e es (ROI) ea-
u es. In he ini ial phase, high-densi y local eg ession is employed o ine une he localiza ion
5

Figu e 2: YOLO a chi ec u e.
o objec s, in using a heigh ened deg ee o p ecision in o he de ec ion p ocess. Subsequen ly,
in he second s age, a disc iminan ROI pooling mechanism ex ac s dis inc i e ea u es om
he ROIs. No ably, D2De depa s om he Fas e RCNN’s o se eg ession by adop ing a lo-
cal dense eg ession block, hus augmen ing he p ecision and obus ness o he objec de ec ion
p ocess.
The p og ess made by esea che s and he pa h om Fas RCNN o Fas e RCNN and beyond
show ha he goal o eal- ime objec ecogni ion is s ill being wo ked on and will con inue o
ge be e . The e is a chance ha hese ad ances will change many a eas, such as au onomous
sys ems, su eillance, obo s, and augmen ed eali y. A he cu ing edge o p og ess in compu e
ision and deep lea ning is he ne e -ending sea ch o as e , mo e accu a e, and mo e e icien
ways o ind objec s.
The me hodologies discussed abo e all unde he classi ica ion o wo-s age de ec o s due
o hei cha ac e is ic wo-s ep p ocess: ini ially gene a ing egions o in e es (ROIs) and sub-
sequen ly execu ing de ec ion and ecogni ion. In 2016, Joseph e al. in oduced a no ewo hy
depa u e om his con en ion, p esen ing a one-s age de ec o known as You Only Look Once
(YOLO) [18]. YOLO epi omized a pionee ing pa adigm shi in he ealm o objec de ec ion,
mani es ing as a single ne wo k a chi ec u e capable o p ocessing he en i e y o an image wi hin
a soli a y s ep, which esul ed in subs an ially expedi ed p ocessing imes.
2.2. Segmen a ion-based Fea u e Ex ac ion
The YOLO me hodology ope a es by segmen ing he image in o dis inc egions and concu -
en ly p edic ing bounding boxes o each o hese egions. This one-s ep p ocessing pa adigm
ma ked a signi ican depa u e om he mul i-s ep p ocedu es o i s wo-s age coun e pa s.
YOLO was a game-change in he ield, ep esen ing a e olu iona y app oach o objec de-
ec ion. Unlike i s wo-s age coun e pa s, YOLO employed a single neu al ne wo k a chi ec u e,
capable o p ocessing an en i e image in a soli a y pass. This unique design o e ed a signi ican
ad an age in e ms o p ocessing speed, e ec i ely educing de ec ion imes. The co e p inciple
unde lying YOLO’s unc ionali y in ol ed he di ision o he image in o disc e e egions, wi h
he ne wo k making concu en p edic ions o bounding boxes associa ed wi h each egion. This
app oach elimina ed he need o sequen ial p ocessing, o e ing a subs an ial boos in e iciency.
In he subsequen yea s, YOLO unde wen i e a ions wi h he in oduc ion o YOLO 2 and 3,
aimed a enhancing p edic ion accu acy [19], [20], while subsequen e sions, 5 h ough 8
ocus on p edic ion e iciency, accu acy, speed and deploymen op imisa ion.
6
To enhance he accu acy o bounding box localiza ion, DIoU loss is employed due o i s
demons a ed imp o emen in pe o mance when used wi h he YOLO algo i hm [21]. DIoU
ep esen s an ad ancemen o he IoU me ic (1), speci ically a ge ing he op imiza ion o bound-
ing box p edic ions.
IoU =A ea o O e lap
A ea o Union =IoU(Bp,B )=Bp∩B
Bp∪B
(1)
and he dis ance
LossIoU =1−IoU (2)
In his con ex , Bpand B ep esen he p edic ed and ac ual bounding boxes, espec i ely.
The e m DIoU imp o es upon IoU by ac o ing in he squa e o he diagonal dBo he smalles
bounding box Bo ha encompasses bo h Bpand B . The e o e, he equa ion is as ollows:
DIoU =IoU −q(B2
p)−(B2
)2
d2
B
(3)
and he esul ing loss unc ion.
LDIoU =1−DIoU =1−IoU −q(B2
p)−(B2
)2
d2
B
(4)
This unc ion, by add essing he issue o non-in e sec ing bounding boxes in e ms o IoU,
aids in accele a ing he model’s con e gence.
While YOLO excelled in e ms o speed, i encoun e ed challenges ela ed o localiza ion
accu acy. This ade-o spu ed u he esea ch e o s o ine- une he model. To ed ess his
ade-o and enhance he localiza ion accu acy, Liu e al. in oduced he Single Sho Mul iBox
De ec o (SSD) in 2016 [22]. The SSD me hod was di e en om he one-s age pa adigm be-
cause i used bo h mul i- e e ence and mul i- esolu ion de ec ion s a egies. This made i possible
o ind objec s a di e en sizes ac oss di e en ne wo k laye s. This a chi ec u e can accommo-
da e objec s o di e se sizes and magni udes wi hin he image, mi iga ing he a o emen ioned
accu acy comp omise.
In 2018, Lin e al. p esen ed Re inaNe [23], ma king a signi ican ad ancemen in one-
s age objec de ec ion. The key inno a ion wi hin Re inaNe was he in oduc ion o a no el
loss unc ion e med ” ocal loss.” This loss unc ion, which di e s om he c oss-en opy loss,
was c ea ed o gi e mo e a en ion o ins ances ha kep ge ing inco ec ly labeled du ing he
aining p ocess. This heigh ened a en ion o challenging examples du ing aining esul ed in an
enhanced le el o p edic ion accu acy, ou s ipping he pe o mance o i s one-s age coun e pa s.
2.3. Ancho - ee In e ence
In con empo a y de elopmen s wi hin he domain o objec de ec ion, he e is a no ewo -
hy shi owa ds ancho - ee me hodologies [24]. These no el app oaches, in con as o con-
en ional echniques, emphasize he in e ence o bounding box co ne s, a he han eliance on
p e-de ined bounding boxes. A p ominen exempla o his end is he Cen e Ne , an inno a-
i e amewo k in oduced by Zhou e al. [25]. No ably, Cen e Ne has dis inguished i sel as a
s a e-o - he-a solu ion o 3D Lida -based de ec ion and acking, showcasing i s e sa ili y in
di e se applica ions.
7
Cen e Ne can be pe cei ed as an e olu ion o he Co ne Ne , ano he ancho - ee app oach o
bounding box de ec ion ha ep esen s objec s as pai s o keypoin s, speci ically he op-le and
bo om- igh co ne s. These co ne keypoin s a e ex ac ed h ough a echnique known as co ne
pooling, which was in oduced by he same au ho s [26]. A c i ical s ide in he ad ancemen
om Co ne Ne o Cen e Ne was he in oduc ion o a cen al keypoin , a concep ha acili a ed
he associa ion o co ne keypoin s wi h objec s depic ed in images. This no el app oach has
demons a ed supe io pe o mance compa ed o con en ional ancho -based solu ions, such as
Fas e RCNN and YOLO, ma king a signi ican ad ancemen in objec de ec ion.
Con inuing he ajec o y o inno a ion, in 2020, Pe ez-Rua and colleagues in oduced he
OpeN-ended Cen e nE (ONCE) [27]. ONCE imp o ed Cen e Ne ’s abili ies by le ing i ind
objec s om classes whe e he e we e no many examples in i s aining da ase . This is an
imp essi e achie emen ha could be use ul in si ua ions in ol ing many ypes o objec s.
Addi ionally, objec de ec ion echniques ha e begun o explo e he capabili ies o ans o m-
e s, as pa ed by he DE ec ion TRans o me (DETR) me hod in oduced by Ca ion e al. [28].
This explo a ion le e ages he ad an ages o ans o me a chi ec u es, which ha e gained p omi-
nence in na u al language p ocessing, and in eg a es hem in o he objec de ec ion domain. Wha
se s DETR apa is i s simplici y, coupled wi h pe o mance ha i als o he sophis ica ed de ec-
ion echniques employed in he ield. Subsequen ly, Zhu e al.[29] p oposed he De o mable
DETR sys em. This sys em builds on he speci ic objec i e de ec ing small objec s. This en-
hancemen aimed o achie e s a e-o - he-a pe o mance, unde sco ing he commi men o he
scien i ic communi y o con inuously e ine and ad ance objec de ec ion me hodologies o mee
he e ol ing demands o eal-wo ld applica ions.
2.4. 3D Objec De ec ion
The ield o 3D objec de ec ion in au onomous ehicles has wi nessed signi ican ad ance-
men s, d i en by he need o accu a e en i onmen al pe cep ion o ensu e sa e na iga ion. T a-
di ional single-modal de ec ion me hods, which ypically u ilise ei he LiDAR o came a da a, as
s a ed in [30], ha e limi a ions. Came a-based sys ems o en lack su icien dep h in o ma ion,
leading o challenges in accu a ely de ec ing objec s in h ee-dimensional space, pa icula ly in
complex en i onmen s. Con e sely, LiDAR-based me hods, while p o iding p ecise spa ial da a,
a e hinde ed by issues such as poin cloud spa si y and low esolu ion, especially in occluded o
dis an scena ios. Au ho s in [30] ha e inc easingly ocused on mul i-modal 3D objec de ec ion,
which combines he s eng hs o a ious senso s o enhance de ec ion pe o mance. By using
dep h in o ma ion om LiDAR wi h he ich ex u e and colo da a om came as, mul i-modal
app oaches can signi ican ly imp o e he accu acy and eliabili y o objec de ec ion sys ems.
Howe e , he in eg a ion o he e ogeneous da a p esen s unique challenges, including he need
o e ec i e da a ep esen a ion, alignmen , and usion echniques. Resea che s ha e p oposed
a ious me hodologies o add ess hese challenges, mo ing beyond adi ional usion s a egies
o mo e sophis ica ed amewo ks ha le e age he complemen a y cha ac e is ics o di e en
modali ies.
On he o he hand, single-modali y 3D objec de ec o s o e se e al ad an ages o e mul i-
modal app oaches, pa icula ly in e ms o simplici y, cos , and compu a ional e iciency. Ac-
co ding o [31], single-modali y de ec o s elimina e he need o complex usion algo i hms
ha a e necessa y when combining da a om mul iple senso s like LiDAR and came as. This
leads o educed compu a ional o e head and easie implemen a ion, making single-modali y
sys ems mo e e icien and as e , especially o eal- ime applica ions like au onomous d i ing
and obo ics. Addi ionally, hese de ec o s a e cos -e ec i e, as hey only equi e one ype o
8
senso , signi ican ly educing ha dwa e and main enance cos s. Single-modali y de ec o s can
also ocus on le e aging he s eng hs o a speci ic senso , such as he p ecision o LiDAR in
dep h es ima ion o he ich isual de ail p o ided by came as. While mul i-modal sys ems can
o e inc eased obus ness, he added complexi y o en does no jus i y he pe o mance gains in
en i onmen s whe e a single modali y is su icien o high accu acy.
The de elopmen o comp ehensi e da ase s, such as KITTI [32], nuScenes [33], and Waymo
[34], has been c ucial in acili a ing esea ch in he a ea o 3D objec de ec ion, p o iding bench-
ma ks o e alua ing he pe o mance o single- and mul i-modal de ec ion algo i hms unde di-
e se d i ing condi ions. In his con ex , he analysis o syn he ic i ual en i onmen s eme ges
as a p omising a enue o u he enhancing 3D objec de ec ion. By simula ing a ious d i ing
scena ios and condi ions, syn he ic en i onmen s can gene a e di e se aining da a ha imp o es
model obus ness and adap abili y. This app oach add esses he limi a ions o exis ing de ec ion
me hods, essen ially con ibu ing o he de elopmen o sa e and mo e e icien au onomous
d i ing sys ems. The in eg a ion o syn he ic i ual en i onmen analysis in o he 3D objec
de ec ion pipeline ep esen s a signi ican s ep o wa d, o e ing new insigh s and me hodologies
ha can enhance he o e all pe o mance o au onomous ehicles in eal-wo ld applica ions.
The ad ancemen o au onomous ehicles is signi ican ly dependen on he e icacy o hei
pe cep ion sys ems, pa icula ly in he ealm o 3D objec de ec ion. Au ho s in [35] e iew e-
cen indings on 3D objec de ec ion me hodologies, wi h a speci ic ocus on he in eg a ion o
syn he ic i ual en i onmen s o enhance de ec ion capabili ies. 3D objec de ec ion is a c i ical
componen ha enables ehicles o accu a ely pe cei e hei su oundings, acili a ing in o med
decision-making and ajec o y planning. Recen s udies ca ego ize de ec ion me hods in o h ee
p ima y ypes: image-based, poin cloud-based, and mul i-modal app oaches. Mul i-modal ech-
niques, which in eg a e da a om a ious senso s such as LiDAR and came as, ha e demon-
s a ed supe io obus ness agains en i onmen al a ia ions, as highligh ed by [35]. The use o
syn he ic i ual en i onmen s has eme ged as a p omising s a egy o augmen aining da ase s
o hese de ec ion algo i hms. By simula ing di e se scena ios ha a e o en challenging o cap-
u e in eal-wo ld se ings—such as a ying wea he condi ions, ligh ing scena ios, and complex
objec in e ac ions— esea che s can c ea e comp ehensi e da ase s ha signi ican ly imp o e
he obus ness o de ec ion sys ems. Fo ins ance, [36] emphasize he de elopmen o e icien
eal- ime 3D objec de ec ion amewo ks ha le e age syn he ic da a o aining, leading o
enhanced accu acy and educed cos s associa ed wi h eal-wo ld da a collec ion. Howe e , chal-
lenges pe sis , pa icula ly conce ning he domain gap be ween syn he ic and eal-wo ld da a,
which can esul in pe o mance deg ada ion when models a e deployed ou side hei aining
condi ions. Fu u e esea ch should p io i ize b idging his gap h ough echniques such as do-
main adap a ion and ans e lea ning, ensu ing ha models ained in i ual en i onmen s can
gene alize e ec i ely o eal-wo ld scena ios. O e all, he in eg a ion o syn he ic i ual en-
i onmen s in he de elopmen o 3D objec de ec ion sys ems p esen s a aluable oppo uni y
o enhance he capabili ies o au onomous ehicles, pa ing he way o mo e obus de ec ion
algo i hms ha a e be e equipped o handle he complexi ies o eal-wo ld d i ing condi ions.
Con inued explo a ion in his a ea will be essen ial o ad ancing he s a e o au onomous d i ing
echnology.
3. Me hodology
This sec ion del es in o he me hodology, p esen ed in his wo k, o de ec ing 3D objec s
u ilizing ML algo i hms and he Cen e Ne a chi ec u e. The me hodology employs o se s o
9
Table 1: Pa ame e Tuning
Expe imen Phase Epochs Ba ch Size Da a Type Lea ning Ra e Op imize Syn h. Da a No es
Ini ial 10 3 4 ca ego ies 0.0001 Adam ×-
Addi ional 10 3-6 4 ca ego ies 0.0001 Adam ✓GPU memo y limi a ions
Ex ended 100 3-6 8 ca ego ies 0.0001 Adam ✓Fine- uning, pe o mance assessmen
Table 2: Ha dwa e Speci ica ions
Name CPU RAM GPU
Desk op Compu e Xeon Gold [email p o ec ed] 64GB NVIDIA Ti an Xp 12GB
Lap op Compu e [email p o ec ed] 32GB NVIDIA 1650 Max-Q 4GB
bu mo e common ha dwa e con igu a ion. The speci ica ions o his po able se up a e also
documen ed in able 2, enabling a comp ehensi e compa ison be ween he wo pla o ms.
4. E alua ion
The expe imen s aim o e alua e he pe o mance o he p oposed objec de ec ion me hodol-
ogy unde a ious en i onmen al condi ions, le e aging bo h syn he ic and eal-wo ld da ase s.
The main a ge o hese expe imen s is o assess how well he model adap s o di e en condi-
ions, including a ia ions in came a angles, ligh ing, wea he , and senso ypes. In addi ion, a
compa a i e s udy is conduc ed o benchma k he p oposed me hod agains s a e-o - he-a mod-
els like YOLO 3, Fas e R-CNN, and Re inaNe . These expe imen s a e essen ial o unde s and
he obus ness and gene aliza ion capabili ies o he p oposed de ec ion model, which is e al-
ua ed using syn he ic da a gene a ed h ough a 3D ende ing engine and eal-wo ld da a om
he KITTI da ase . The KITTI da ase is a well-es ablished benchma k in au onomous d i ing
esea ch, con aining high- esolu ion images and LiDAR da a om u ban en i onmen s, making
i highly sui able o e alua ing he pe o mance o objec de ec ion models. I s ich di e si y in
classes like ca s, pedes ians, and cyclis s, ac oss a ious en i onmen s, p o ides a comp ehen-
si e pla o m o es ing he obus ness o models in bo h con olled (syn he ic) and eal-wo ld
se ings.
The expe imen al p ocedu e ollowed a clea and sys ema ic p ocess. Fi s , he syn he ic
da ase was c ea ed using a 3D ende ing engine, wi h each ca ego y (Came a, Ligh , Wea he ,
and Senso ) comp ising speci ic pa ame e s such as came a angles, ligh ing condi ions, wea he
se ings, and senso ypes. The da ase was spli in o aining, alida ion, and es ing subse s, en-
su ing ha he models we e e alua ed on bo h known and unseen da a. A e aining, he mod-
els we e e alua ed on bo h syn he ic and eal-wo ld da a, including he KITTI da ase , which
p o ided a eal-wo ld benchma k o pe o mance e alua ion. Addi ionally, expe imen s we e
epea ed on a mo e pe o mance limi ed de ice de ailed in able 2. This s ep-by-s ep p ocess
ensu ed ha he model’s pe o mance was igo ously es ed unde con olled and eal-wo ld con-
di ions, allowing o a ai compa ison ac oss di e en models.
4.1. Me ics
The e alua ion o he p oposed me hod elied on he mean A e age P ecision (mAP), a s an-
da d me ic o quan i y objec de ec ion pe o mance based on a use -de ined se o c i e ia [40].
I is de ined as he mean alue o he a e age p ecision o he indi idual classes:
16

mAP =1
n
n
X
k=1
APk(8)
whe e APkis A e age P ecision o class k, and nis he numbe o classes.
Addi ionally, con usion ma ices we e used o analyze he pe o mance o he model ac oss
di e en objec classes, o e ing a de ailed iew o de ec ion accu acy. This combina ion o me -
ics p o ides a comp ehensi e e alua ion o model pe o mance ac oss a ied scena ios, high-
ligh ing bo h successes and challenges.
4.2. Da a Speci ica ion
The syn he ic da ase used in his s udy was designed o emula e eal-wo ld condi ions and
con ains images di ided in o ou main ca ego ies: Came a, Ligh , Wea he , and Senso . Each
ca ego y is u he spli in o wo subca ego ies, Ai and G ound, whe e Ai images con ain ae ial
ehicles, and G ound images con ain e es ial ehicles. App oxima ely 3000 images pe ca -
ego y we e gene a ed o comp ehensi ely assess model pe o mance unde di e en en i on-
men al ac o s. The Came a ca ego y consis s o images cap u ed a a ious dis ances, ele a ion
angles, and azimu h angles, while he Ligh ca ego y in oduces di e en ligh ing condi ions,
such as a ia ions in in ensi y and di ec ion. The Wea he ca ego y inco po a es ainy and non-
ainy condi ions, including wind a ia ions, o assess he model’s pe o mance unde ad e se
condi ions. Finally, he Senso ca ego y includes images ha simula e nigh and he mal ision
o u he es he model’s obus ness. In con as , he KITTI da ase con ains eal-wo ld im-
ages om u ban en i onmen s, which also include LiDAR da a and o he senso in o ma ion. I
p o ides a mo e ealis ic benchma k o e alua ing he model’s pe o mance in objec de ec ion
asks, especially since i includes di e se objec s such as ca s, pedes ians, and cyclis s in eal-
wo ld d i ing scena ios. While he syn he ic da ase allows o con olled expe imen a ion, he
KITTI da ase se es as an impo an benchma k o gene aliza ion o eal-wo ld da a.
As i was men ioned abo e, he e a e ou ca ego ies o sub-da ase s a) Came a, b) Ligh ,
c) Wea he , and d) Senso . The Came a ca ego y ep esen s images gene a ed wi h di e en
came a angles (poin o iew) and dis ances om an objec in he Ci y and he Dese scenes.
Speci ically, o he Ai sub-ca ego y, he e a e images gene a ed a 4 equal dis ances be ween
70 and 350 me es. Fo he G ound sub-ca ego y, he e a e images gene a ed a 4 equal dis ances
be ween 15 and 75 me es. In bo h ca ego ies he images we e gene a ed a 4 equal ele a ion
angles be ween 5° and 85° deg ees, and a 3 equal azimu h angles be ween 0° and 240° deg ees.
The o he pa ame e s such as ligh , image ype, og and ain we e selec ed in such a way o
p e en gene a ing bias on he e alua ion o he came a pa ame e s. An o e iew o he syn he ic
da a gene a ion speci ica ion is p esen ed in Table 3.
The Ligh ca ego y con ains images gene a ed using a iable balanced ligh ing pa ame e s
co e ing he Ci y and Dese scenes. In mo e de ail, he Ai and G ound sub-ca ego ies we e
gene a ed wi h he ligh in ensi y se be ween 10% and 100% powe a 3 equal s eps. The ligh
ele a ion angles we e se be ween 5° and 90° deg ees a 3 equal s eps, he ligh azimu h angles
we e se be ween 0° and 180° deg ees a 3 equal s eps. The o he pa ame e s ela ed o came a,
wea he , and senso s we e selec ed andomly and uni o mly in such a way o a oid bias on he
e alua ion o he model unde he se o he ligh pa ame e s.
The Wea he ca ego y con ains images gene a ed using di e en balanced wea he pa ame e s
co e ing he Ci y and Dese scenes. The Ai and G ound sub-ca ego ies we e gene a ed bo h
wi h and wi hou enabling ain. Fu he mo e, he ainy images included a ia ions due o he
17
Table 3: Summa y o Da a Speci ica ions
Da a Ca ego y Scene Sub-
Ca ego ies
Pa ame e s De ails
Came a Ci y &
Dese
Ai , G ound Dis ance, Ele-
a ion Angle,
Azimu h An-
gle
Dis ances:
70-350m
(Ai ), 15-75m
(G ound); Ele-
a ion Angles:
5°-85°; Az-
imu h Angles:
0°-240°
Ligh Ci y &
Dese
Ai , G ound Ligh In en-
si y, Ele a ion
Angle, Az-
imu h Angle
In ensi y: 10-
100% (3 s eps);
Ele a ion An-
gles: 5°-90°;
Azimu h An-
gles: 0°-180°
Wea he Ci y &
Dese
Ai , G ound Rain, Wind Rain: En-
abled/Disabled;
Wind: 0 o 10
uni s
Senso Ci y &
Dese
Ai , G ound Nigh Vision,
The mal Vi-
sion
Nigh and
The mal Vision
emula ions
Real (KITTI) Va ied Ai , G ound High-
esolu ion
Images,
Lida , Cali-
b a ion
Objec De ec-
ion, T acking,
3D Scene Un-
de s anding
wind pa ame e ha was selec ed o be 0 o 10 uni s o powe . The o he pa ame e s we e
selec ed in such a way o a oid bias on he e alua ion o he models in he wea he ca ego y.
The nigh and he mal ision a e he main a ibu es o he Senso ca ego y. The nigh ision
isualises an app oxima ion o he e ec o nigh ision goggles and he same app oach was
conside ed o he he mal ision. The Senso ca ego y con ains images gene a ed using di e en
balanced senso image ypes co e ing he Ci y and Dese scenes. Also, he Ai and G ound sub-
ca ego ies con ain images emula ing bo h nigh and he mal ision senso s. The o he pa ame e s
again we e selec ed uni o mly p e en ing bias on he e alua ion o he models o he senso se
o pa ame e s.
Syn he ic da ase s can e ec i ely mimic he gene alisa ion capabili ies o eal-wo ld da a by
le e aging digi al wins and syn he ic da a gene a ion ools buil on game engines. Digi al wins
c ea e i ual eplicas o eal-wo ld en i onmen s, cap u ing de ailed spa ial, empo al, and unc-
18
Figu e 7: Real Da ase 3D Bounding Box Sample
ional cha ac e is ics ha enable ealis ic simula ion o eal-wo ld scena ios. When combined
wi h he ad anced ende ing, physics, and anima ion capabili ies o game engines, hese ools
can gene a e highly ealis ic and di e se da ase s ha mi o he complexi y o ac ual en i on-
men s. Such syn he ic da a can eplica e in ica e in e ac ions, simula e a ying condi ions, and
in oduce con olled a ia ions. This app oach is pa icula ly use ul o c ea ing scalable and
cos -e ec i e da ase s while add essing limi a ions like bias o sca ci y in eal-wo ld da a col-
lec ion and ensu ing ha he syn he ic da ase s encapsula e he a iabili y and complexi y o
eal-wo ld da a. Consequen ly, models ained on such da ase s can gene alise e ec i ely, as
hey a e exposed o a wide ange o ep esen a i e pa e ns and scena ios ha mi o eal-wo ld
applicabili y.
4.3. Real Da ase Expe imen s
The KITTI da ase [32] which is a widely used benchma k da ase o esea ch in compu e
ision and au onomous d i ing [41] was chosed as he Real da ase . I s ands o ”Ka ls uhe
Ins i u e o Technology and Toyo a Technological Ins i u e” and was c ea ed by esea che s om
hese ins i u ions. This da ase is commonly e e enced in academic publica ions ela ed o asks
such as objec de ec ion, acking, 3D scene unde s anding, and mo e. The speci ica ion o he
da ase can be seen in Table 4.
The p ima y objec i e behind i s incep ion is o os e he ad ancemen o algo i hms and
echnologies ele an o au onomous ehicles. The da ase is cha ac e ized by a comp ehensi e
collec ion o di e se da a modali ies, encompassing high- esolu ion came a images, LiDAR poin
clouds, and calib a ion pa ame e s. This da ase is used o a wide a ie y o asks, including
objec de ec ion, mo ion acking, 3D scene analysis, and o he such applica ions. An added
ea u e o he KITTI da ase is he p o ision o image anno a ions o a ious objec ypes, such as
ca s, pedes ians, and cyclis s, he eby ende ing i an impo an esou ce o he alida ion o AI
models. Fu he mo e, he da ase encompasses a wide spec um o eal-wo ld d i ing scena ios,
a iable wea he condi ions, and di e en imes o day, he eby acili a ing a comp ehensi e
assessmen o algo i hm pe o mance unde di e se en i onmen al condi ions. I is wo h no ing
ha while he KITTI da ase is widely used in he esea ch communi y, i does exhibi ce ain
19
Table 4: Fea u es and Speci ica ions o he 3D Objec De ec ion KITTI Da ase
Fea u e/Speci ica ion Desc ip ion
Da a Type Images, Lida da a
Tasks S e eo, Op ical Flow, Visual Odome y, 3D Objec
De ec ion, T acking
Numbe o Images ˜15,000 images o objec de ec ion
Image Resolu ion 1242 x 375 pixels
Senso s Ine ial Na iga ion Sys em (GPS/IMU): OXTS
RT 3003, Lase scanne : Velodyne HDL-64E,
G ayscale came as, 1.4 Megapixels: Poin G ey
Flea 2 (FL2-14S3M-C), Colo came as, 1.4
Megapixels: Poin G ey Flea 2 (FL2-14S3C-C),
Va i ocal lenses, 4-8 mm: Edmund Op ics NT59-
917
Anno a ion Types Bounding boxes, 3D boxes, objec ype, unca-
ion, occlusion le els
En i onmen s U ban, esiden ial, oad
Classes Ca s, ans, ucks, pedes ians, cyclis s
G ound T u h A ailabili y Yes
Table 5: Compa ison o Pe o mance Resul s on he Real Da ase s (%, mAP)
Class FRRCNN RETINA YOLO 3 PM
Ca 64.67 77.09 69.01 87.85
Pedes ian 28.42 51.78 39.17 60.85
Cyclis 32.33 51.32 43.84 48.69
To al
mAP 41.81 60.06 51.34 65.80
limi a ions, no ably i s ela i ely modes scale and he absence o da a pe aining o ce ain objec
classes, e.g., mo o cycles.
The pe o mance o he p oposed amewo k on he Real da ase could be obse ed in able
5. The esul s o he able compa ing pe o mance on eal da ase s e eal se e al key insigh s
ega ding he e ec i eness o di e en models—FRRCNN, RETINA, YOLO 3, and he p o-
posed PM model—ac oss h ee objec de ec ion ca ego ies: Ca , Pedes ian, and Cyclis . The
p oposed PM model demons a es supe io pe o mance ac oss all h ee ca ego ies, achie ing
he highes accu acy in de ec ing Ca s (87.85%), Pedes ians (60.85%), and Cyclis s (48.69%).
This sugges s ha he PM model is pa icula ly well-sui ed o de ec ing objec s in eal-wo ld
condi ions, ou pe o ming o he models in each indi idual class. The s ong pe o mance in he
Ca ca ego y is especially no able, whe e PM signi ican ly ou pe o ms he o he models, wi h a
pe o mance ma gin o o e 10% compa ed o he second-bes RETINA model (77.09%). This
indica es ha PM is highly capable o ecognizing ca s, likely due o be e ea u e ex ac ion o
aining s a egies sui ed o his objec class.
20
Figu e 8: Example o 3D bounding box p edic ions. (Syn he ic Da a)
21

Table 6: Pe o mance esul s on he Syn he ic da ase (mAP)
Ca ego y Sub-ca ego y PM FRRCNN YOLO 3 RETINA
Ai
Came a 61.04% 5.24% 44.82% 44.79%
Ligh 39.95% 20.66% 63.58% 61.25%
Wea he 88.71% 5.35% 39.00% 45.57%
Senso 51.90% 4.97% 4.27% 7.95%
G ound
Came a 74.66% 17.95% 76.02% 88.81%
Ligh 33.82% 32.54% 38.52% 87.12%
Wea he 58.75% 17.07% 66.32% 86.09%
Senso 55.72% 4.16% 7.64% 15.59%
In he Pedes ian class, PM again ou pe o ms he o he models, al hough he pe o mance
gap is na owe compa ed o he Ca ca ego y. RETINA, which shows a s ong second-place
pe o mance (51.78%), ails PM by app oxima ely 9%. Pedes ian de ec ion ypically in ol es
mo e a iabili y in size and occlusion, making i a challenging ca ego y. PM’s pe o mance
indica es i s obus ness in handling his complexi y, al hough he e may s ill be oom o im-
p o emen o each highe accu acy.
Fo Cyclis s, he PM model achie es he highes accu acy (48.69%), hough he pe o mance
gap he e is smalle compa ed o o he classes. RETINA and YOLO 3 pe o m simila ly, wi h e-
sul s o 51.32% and 43.84%, espec i ely, while FRRCNN shows weake pe o mance (32.33%).
This sugges s ha while PM is he mos e ec i e model o e all, de ec ing cyclis s emains mo e
challenging due o he a iabili y in appea ance and size, indica ing ha u he ine- uning o
da a augmen a ion migh be equi ed o imp o e accu acy in his ca ego y.
When looking a he o al mean A e age P ecision (mAP) ac oss all ca ego ies, he PM
model achie es he highes o e all sco e (65.80%), ou pe o ming RETINA (60.06%), YOLO 3
(51.34%), and FRRCNN (41.81%). The mAP me ic highligh s he supe io gene aliza ion and
obus ness o he PM model ac oss all objec de ec ion asks. The imp o emen in mAP by nea ly
6% o e RETINA u he unde sco es he ad an age o PM in handling eal-wo ld da ase s.
In summa y, he PM model s ands ou in e ms o pe o mance, pa icula ly in de ec ing
ca s, whe e i exhibi s subs an ial accu acy gains. I s abili y o consis en ly ou pe o m o he
models ac oss all ca ego ies, coupled wi h he highes o al mAP, sugges s ha he PM model
is mo e adap able and e ec i e in di e se objec de ec ion asks. Ne e heless, some ca ego ies,
such as cyclis s, emain mo e challenging, and addi ional e o s o u he enhance de ec ion
accu acy, pa icula ly h ough da a di e si y o model ine- uning, could help close pe o mance
gaps. These esul s sugges he PM model is highly p omising o eal-wo ld applica ions bu
may bene i om u he op imiza ion o mo e challenging objec classes.
4.4. Syn he ic Da ase Expe imen s
The p oposed Cen e Ne model demons a ed s ong and compe i i e pe o mance ac oss
a ious ca ego ies, as e idenced by he esul s on he Syn he ic da ase (Table 6). The model’s
con usion ma ices can be obse ed in Figu es 3 and 6, illus a ing he dis inc ions be ween pe -
o mance on Ai and G ound da ase s. Se e al no able phenomena we e obse ed du ing he
expe imen s. The model’s de ec ion accu acy o g ound ehicles was signi ican ly highe han
22
o ae ial ehicles. This disc epancy is likely due o he mo e consis en size, shape, and p ox-
imi y o g ound ehicles, whe eas ae ial ehicles exhibi g ea e a iabili y. In he Came a and
Ligh subca ego ies, he model pe o med pa icula ly well, bene i ing om he p edic able na-
u e o ligh ing and came a angles. In con as , he Wea he and Senso subca ego ies posed
conside able challenges. Rain, wind, and o he en i onmen al ac o s in he Wea he subca e-
go y educed de ec ion accu acy, while he Senso subca ego y, which included nigh ision and
he mal ision images, p o ed o be he mos di icul o he model. The complexi y o hese
specialized da a ypes likely equi es mo e ocused aining and ine- uning.
The objec i e o his expe imen was o e alua e he pe o mance o he p oposed Cen e Ne
model o objec de ec ion in syn he ic en i onmen s. The goal was o analyze he model’s
abili y o de ec objec s ac oss di e en ca ego ies, pa icula ly “Ai ” and “G ound,” as well
as in subca ego ies like “Came a,” “Ligh ,” “Wea he ,” and “Senso .” The e alua ion aimed o
assess he e ec s o di e se en i onmen al condi ions on he model’s pe o mance and de e mine
whe e imp o emen s could be made, especially in challenging scena ios like nigh ision and
he mal imaging. The model excelled in se e al subca ego ies, pa icula ly in he Came a and
Ligh subca ego ies, whe e he p edic abili y o ea u es such as ligh ing condi ions led o s ong
de ec ion esul s. As seen in Table 6, he Ligh subca ego y was among he easies o de ec ,
gi en i s uni o mi y in isual pa ame e s like objec angles and ligh ing. On he o he hand, mo e
complex condi ions, such as hose ound in he Wea he and Senso subca ego ies, in oduced
challenges ha a ec ed pe o mance. The Wea he subca ego y achie ed mid-le el esul s due o
ac o s like ain and wind, which added complexi y o image de ec ion. The Senso subca ego y,
comp ising specialized da a ypes like nigh and he mal ision, p o ed he mos di icul o
he model, e lec ing he need o u he specialized aining and ine- uning o handle hese
complex image ypes e ec i ely.
The unde lying causes o hese esul s a e mul i ace ed. G ound ehicles end o ha e mo e
s able and consis en ea u es, making hem easie o de ec , while ae ial ehicles a y in size and
shape, which complica es de ec ion. In he Ligh subca ego y, uni o m ligh ing condi ions made
objec ea u es mo e p edic able, con ibu ing o he model’s s ong pe o mance. The d op in
pe o mance in he Wea he subca ego y can be a ibu ed o he inc eased complexi y in oduced
by en i onmen al condi ions like ain and wind. Simila ly, he Senso subca ego y, wi h i s nigh
and he mal ision da a, di e s signi ican ly om no mal isual da a, making i mo e challenging
o he model o adap wi hou u he specialized aining.
To u he imp o e pe o mance, se e al ecommenda ions can be made. Fine- uning he
model on senso da a, pa icula ly nigh and he mal ision, could enhance i s abili y o handle
complex image ypes. Addi ionally, augmen ing he aining da ase wi h eal-wo ld samples,
such as hose om he KITTI da ase , could educe he gap be ween syn he ic and eal-wo ld
pe o mance. Domain adap a ion echniques could also imp o e he model’s gene alisa ion om
syn he ic o eal-wo ld condi ions, making i mo e obus o p ac ical applica ions. O e all, he
PM demons a ed s ong pe o mance, pa icula ly in he Came a and Ligh subca ego ies, whe e
i ou pe o med o he models. Howe e , he esul s in he Wea he and Senso subca ego ies sug-
ges ha u he wo k is needed o imp o e he model’s obus ness in handling complex en i on-
men al condi ions like ain, wind, nigh ision, and he mal ision. Wi h addi ional ine- uning,
expanded aining da a, and domain adap a ion s a egies, he model’s gene aliza ion capabili ies
can be u he enhanced, making i mo e sui able o eal-wo ld applica ions in objec de ec-
ion. This expe imen p o ides a solid ounda ion o u u e esea ch aimed a imp o ing objec
de ec ion in challenging condi ions.
Addi ionally, expe imen s we e conduc ed on bo h syn he ic and eal da ase s using a mo e
23
ha dwa e-cons ained de ice o e alua e he obus ness and adap abili y o he p oposed me hod
unde limi ed compu a ional esou ces. The esul s ob ained we e iden ical o hose om ex-
pe imen s conduc ed on highe -pe o mance ha dwa e, demons a ing he me hod’s consis ency
ac oss di e en pla o ms. Howe e , he es ic ed GPU memo y o he cons ained de ice ne-
cessi a ed adjus men s o he numbe o aining epochs, ensu ing ha he expe imen s could
be execu ed wi hou exceeding memo y limi a ions. Despi e hese adjus men s, he in e ence
pe o mance emained una ec ed, indica ing ha he me hod is no esou ce-dependen in his
phase. Ne e heless, he educed compu a ional capaci y led o an inc ease in he ime equi ed
o comple e each expe imen , highligh ing he impac o ha dwa e limi a ions on he aining p o-
cess. Howe e , in e ence speed s ayed he same. This unde sco es he impo ance o conside ing
ha dwa e cons ain s when deploying he me hod in eal-wo ld scena ios.
A mo e in-dep h analysis o he di e ences be ween syn he ic and eal-wo ld da ase s e eals
se e al ac o s ha in luence hei e ec i eness and applicabili y in a ious domains. While eal-
wo ld da ase s a e ich in complexi y, cap u ing he inhe en noise, a iabili y, and unp edic abil-
i y o ac ual en i onmen s, hey o en su e om biases, incomple e da a, and challenges ela ed
o da a collec ion [42], [43], [44]. In con as , syn he ic da ase s o e a con olled en i onmen
whe e hese biases can be mi iga ed, and da a can be gene a ed o speci ically a ge a eas ha
a e unde ep esen ed o di icul o cap u e in he eal wo ld, such as a e e en s, edge cases,
senso y da a, wea he condi ions, and came a angles, e c. Howe e , syn he ic da ase s may lack
he ull di e si y and nuance ound in eal-wo ld da a, especially when he da a gene a ion p o-
cess canno pe ec ly eplica e he complex in e ac ions and eal-wo ld unce ain ies. Despi e
his, syn he ic da ase s can be aluable o aining models in scena ios whe e eal-wo ld da a is
sca ce, expensi e, o e hically challenging o ob ain.
5. Conclusion
This pape p o ides a comp ehensi e o e iew o exis ing me hodologies and app oaches
wi hin he ealm o scene analysis, le e aged by au onomous ehicles, wi h a speci ic emphasis
on hei applicabili y in imme si e en i onmen s. The esea ch p esen ed del es in o an in-dep h
analysis o a 3D objec de ec ion model om he an age poin o he augmen ed eali y domain.
The a chi ec u al amewo k comp ises a di e se se o componen s, each me iculously designed
o ackle a ious in icacies ela ed o he es ima ion o keypoin s, he con e sion o keypoin s o
2D bounding boxes, and he in e ence o c ucial spa ial in o ma ion. This in o ma ion encom-
passes dep h, 3D dimensions measu ed in me e s, as well as o ien a ion, encompassing azimu h,
ele a ion, and oll angles. The collec i e con ibu ions o hese componen s culmina e in a model
ha exhibi s p o iciency in he p ojec ion o 3D bounding boxes on o a 2D image.
To empi ically e alua e he e icacy o he p oposed a chi ec u e, a comp ehensi e es ing
egimen was conduc ed, u ilizing a syn he ic da ase in a compa a i e s udy. The ou comes o
his e alua ion e eal ha he p oposed model deli e s compe i i e pe o mance while demon-
s a ing s abili y, pa icula ly when asked wi h he de ec ion o dis an objec s. The e alua ion
and analysis o he p oposed model we e unde aken unde di e se en i onmen al condi ions and
wi h a ying came a se ings, es ablishing i s e sa ili y and obus ness. Fu he mo e, o augmen
he comp ehensi eness o he s udy, a no el and well-balanced syn he ic da ase was c ea ed and
cu a ed, u ilising a i ual en i onmen . This da ase encompasses anno a ed da a spanning a
mul i ude o objec s and en i onmen al scena ios, p o iding a ich esou ce o subsequen ali-
da ion, expe imen a ion, and e inemen .
24
Acknowledgmen s
This wo k was unded by UK Resea ch and Inno a ion (UKRI) unde he UK go e nmen ’s
Ho izon Eu ope unding gua an ee [g an numbe 10047653] and unded by he Eu opean Union
[unde EC Ho izon Eu ope g an ag eemen numbe 101070181 (TALON)].
Re e ences
[1] M. I. Pa el, S. Y. Tan, A. Abdullah, Vision-based au onomous ehicle sys ems based on deep lea ning: A sys ema ic
li e a u e e iew, Applied Sciences 12 (14) (2022) 6831.
[2] J. Xiong, E.-L. Hsiang, Z. He, T. Zhan, S.-T. Wu, Augmen ed eali y and i ual eali y displays: eme ging ech-
nologies and u u e pe spec i es, Ligh : Science & Applica ions 10 (1) (2021) 216.
[3] Z. Zou, K. Chen, Z. Shi, Y. Guo, J. Ye, Objec de ec ion in 20 yea s: A su ey, P oceedings o he IEEE (2023).
[4] R. Ande son, J. Toledo, H. ElAa ag, Feasibili y s udy on he u iliza ion o mic oso hololens o inc ease d i ing
condi ions awa eness, in: 2019 Sou heas Con, IEEE, 2019, pp. 1–8.
[5] D. L. Gomes J , A. C. de Pai a, A. C. Sil a, G. B az J , J. D. S. de Almeida, A. S. de A a´
ujo, M. Ga as, Aug-
men ed isualiza ion using homomo phic il e ing and haa -based na u al ma ke s o powe sys ems subs a ions,
Compu e s in Indus y 97 (2018) 67–75.
[6] N. Dimi opoulos, T. Togias, G. Michalos, S. Mak is, Ope a o suppo in human– obo collabo a i e en i onmen s
using ai enhanced wea able de ices, P ocedia Ci p 97 (2021) 464–469.
[7] Z.-Q. Zhao, P. Zheng, S.- . Xu, X. Wu, Objec de ec ion wi h deep lea ning: A e iew, IEEE ansac ions on neu al
ne wo ks and lea ning sys ems 30 (11) (2019) 3212–3232.
[8] Y. LeCun, L. Bo ou, Y. Bengio, P. Ha ne , G adien -based lea ning applied o documen ecogni ion, P oceedings
o he IEEE 86 (11) (1998) 2278–2324.
[9] R. Gi shick, J. Donahue, T. Da ell, J. Malik, Rich ea u e hie a chies o accu a e objec de ec ion and seman ic
segmen a ion, in: P oceedings o he IEEE con e ence on compu e ision and pa e n ecogni ion, 2014, pp. 580–
587.
[10] P. Viola, M. Jones, Rapid objec de ec ion using a boos ed cascade o simple ea u es, in: P oceedings o he 2001
IEEE compu e socie y con e ence on compu e ision and pa e n ecogni ion. CVPR 2001, Vol. 1, Ieee, 2001, pp.
I–I.
[11] K. He, X. Zhang, S. Ren, J. Sun, Spa ial py amid pooling in deep con olu ional ne wo ks o isual ecogni ion,
IEEE ansac ions on pa e n analysis and machine in elligence 37 (9) (2015) 1904–1916.
[12] R. Gi shick, Fas -cnn, in: P oceedings o he IEEE in e na ional con e ence on compu e ision, 2015, pp. 1440–
1448.
[13] S. Ren, K. He, R. Gi shick, J. Sun, Fas e -cnn: Towa ds eal- ime objec de ec ion wi h egion p oposal ne wo ks,
Ad ances in neu al in o ma ion p ocessing sys ems 28 (2015).
[14] J. Dai, Y. Li, K. He, J. Sun, R- cn: Objec de ec ion ia egion-based ully con olu ional ne wo ks, Ad ances in
neu al in o ma ion p ocessing sys ems 29 (2016).
[15] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, J. Sun, Ligh -head -cnn: In de ense o wo-s age objec de ec o , a Xi
p ep in a Xi :1711.07264 (2017).
[16] T.-Y. Lin, P. Doll´
a , R. Gi shick, K. He, B. Ha iha an, S. Belongie, Fea u e py amid ne wo ks o objec de ec ion,
in: P oceedings o he IEEE con e ence on compu e ision and pa e n ecogni ion, 2017, pp. 2117–2125.
[17] J. Cao, H. Cholakkal, R. M. Anwe , F. S. Khan, Y. Pang, L. Shao, D2de : Towa ds high quali y objec de ec ion and
ins ance segmen a ion, in: P oceedings o he IEEE/CVF con e ence on compu e ision and pa e n ecogni ion,
2020, pp. 11485–11494.
[18] J. Redmon, S. Di ala, R. Gi shick, A. Fa hadi, You only look once: Uni ied, eal- ime objec de ec ion, in:
P oceedings o he IEEE con e ence on compu e ision and pa e n ecogni ion, 2016, pp. 779–788.
[19] J. Redmon, A. Fa hadi, Yolo9000: be e , as e , s onge , in: P oceedings o he IEEE con e ence on compu e
ision and pa e n ecogni ion, 2017, pp. 7263–7271.
[20] J. Redmon, A. Fa hadi, Yolo 3: An inc emen al imp o emen , a Xi p ep in a Xi :1804.02767 (2018).
[21] Z. Wang, L. Wu, T. Li, P. Shi, A smoke de ec ion model based on imp o ed yolo 5, Ma hema ics 10 (7) (2022).
doi:10.3390/ma h10071190.
URL h ps://www.mdpi.com/2227-7390/10/7/1190
[22] W. Liu, D. Anguelo , D. E han, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Be g, Ssd: Single sho mul ibox de ec o , in:
Compu e Vision–ECCV 2016: 14 h Eu opean Con e ence, Ams e dam, The Ne he lands, Oc obe 11–14, 2016,
P oceedings, Pa I 14, Sp inge , 2016, pp. 21–37.
25