Enhancing 3D Object Detection in Autonomous Vehicles Based on Synthetic Virtual Environment Analysis

Author: Li, Vladislav; Siniosoglou, Ilias; Karamitsou, Thomai; Lytos, Anastasios; Moscholios, Ioannis; Goudos, Sotirios; Banerjee, Prof. (Dr.) Jyoti Sekhar; Sarigiannidis, Panagiotis; Argyriou, Vasileios

Publisher: Zenodo

DOI: 10.1016/j.imavis.2024.105385

Source: https://zenodo.org/records/17550318/files/Enhancing_3D_Object_Detection_in_Autonomous_Vehicles_Based_on_Synthetic_Virtual_Environment_Analysis.pdf

Enhancing 3D Objec De ec ion in Au onomous Vehicles Based
on Syn he ic Vi ual En i onmen Analysis
Vladisla Lia, Ilias Siniosogloub,c, Thomai Ka ami soud, Anas asios Ly osd, Ioannis D.
Moscholiose, So i ios K. Goudos , Jyo i S. Bane jeeg, Panagio is Sa igiannidisb,c, Vasileios
A gy ioua
a“ins i u ion –Kings on Uni e si y˝ “depa men –Depa men o Ne wo ks and Digi al Media˝ “ci y –Kings on upon
Thames˝ “coun y –Uni ed Kingdom˝ “email – [email p o ec ed], asileios.a [email p o ec ed]˝
b“ins i u ion –Uni e si y o Wes e n Macedonia˝ “depa men –Depa men o Elec ical and Compu e Enginee ing˝
“ci y –Kozani˝ “coun y –G eece˝ “email –[email p o ec ed], [email p o ec ed]˝
c“ins i u ion –Me aMind Inno a ions P.C.˝ “depa men –R“&D Depa men ˝ “ci y –Kozani˝ “coun y –G eece˝
“email –[email p o ec ed], [email p o ec ed]˝
d“ins i u ion –Sid oco Holdings L d.˝ “ci y –Nicosia˝ “coun y –Cyp us˝ “email – ka ami sou@sid oco.com,
aly os@sid oco.com˝
e“ins i u ion –Uni e si y o Peloponnese˝ “depa men –Depa men o In o ma ics and Telecommunica ions˝ “ci y
–T ipoli˝ “coun y –G eece˝ “email –[email p o ec ed]˝
“ins i u ion –A is o le Uni e si y o Thessaloniki˝ “depa men –Physics Depa men ˝ “ci y –Thessaloniki˝ “coun y
–G eece˝ “email –[email p o ec ed]˝
g“ins i u ion –Bengal Ins i u e o Technology˝ “ci y –Kolka a˝ “coun y –India˝ “email –[email p o ec ed]˝
Abs ac
Au onomous Vehicles (AVs) ely on eal- ime p ocessing o na u al images and ideos o scene
unde s anding and sa e y assu ance h ough p oac i e objec de ec ion. T adi ional me hods ha e
p ima ily ocused on 2D objec de ec ion, limi ing hei spa ial unde s anding. This s udy in o-
duces a no el app oach by le e aging 3D objec de ec ion in conjunc ion wi h augmen ed eali y
(AR) ecosys ems o enhanced eal- ime scene analysis. Ou app oach pionee s he in eg a ion
o a syn he ic da ase , designed o simula e a ious en i onmen al, ligh ing, and spa io empo al
condi ions, o ain and e alua e an AI model capable o deducing 3D bounding boxes. This
da ase , wi h i s di e se wea he condi ions and a ying came a se ings, allows us o explo e
de ec ion pe o mance in highly challenging scena ios. The p oposed me hod also signi ican ly
imp o es p ocessing imes while main aining accu acy, o e ing compe i i e esul s in condi ions
p e iously conside ed di icul o objec ecogni ion. The combina ion o 3D de ec ion wi hin
he AR amewo k and he use o syn he ic da a o ackle en i onmen al complexi y ma ks a
no able con ibu ion o he ield o AV scene analysis.
Keywo ds: Augmen ed Reali y, Objec De ec ion, Scene Analysis, Scene Unde s anding,
Objec Recogni ion, Deep Lea ning, Fea u e Ex ac ion.
1. In oduc ion
In he domain o au onomous d i ing, scene analysis and comp ehension a e undamen al o
enabling ehicles o pe cei e and in e ac wi h hei en i onmen e ec i ely [1] [2]. Au onomous
ehicles (AVs) ely on ad anced compu e ision and machine lea ning (ML) algo i hms o p o-
cess da a om mul iple senso s such as came as, LiDAR, and ada , allowing hem o ecognize
P ep in submi ed o Image and Vision Compu ing No embe 6, 2025
objec s, na iga e sa ely, and make eal- ime decisions. Howe e , while 2D objec de ec ion
me hods ha e been widely adop ed o hese asks, hey a e inhe en ly limi ed in hei abili y
o cap u e he ull h ee-dimensional na u e o he en i onmen , which is c ucial o accu a ely
unde s anding objec posi ions and in e ac ions in eal-wo ld scena ios.
A majo challenge aced by cu en AV sys ems is he ansi ion om 2D o 3D objec de ec-
ion as men ioned in ci e awole2024 ecen and ci emao20233d. P ojec ing 3D bounding boxes
in o a h ee-dimensional en i onmen is a mo e complex and compu a ionally expensi e ask,
especially when he sys em mus handle di e se en i onmen al condi ions such as changes in
ligh ing, wea he , and senso pe spec i es. T adi ional 2D me hods all sho when de ec ing ob-
jec s in such a ied condi ions, leading o educed accu acy and sa e y isks in AV applica ions.
Fu he mo e, he e is a need o e icien ly in eg a e augmen ed eali y (AR) in o his p ocess,
which could u he imp o e he sys em’s abili y o p edic and o e lay digi al elemen s on o
eal-wo ld en i onmen s o enhanced si ua ional awa eness.
This s udy aims o add ess hese limi a ions by de eloping a no el 3D objec de ec ion solu-
ion ha no only p edic s accu a e 3D bounding boxes bu also imp o es p ocessing imes and
pe o mance ac oss challenging condi ions. Ou app oach in oduces a mul imodal a chi ec u e
ha ex apola es 3D in o ma ion om 2D images, le e aging a syn he ic da ase designed o
mimic a ious eal-wo ld condi ions such as ligh ing, wea he , and came a iewpoin s. The e-
sul s o his wo k can be applied o imp o e AV sys ems’ pe o mance in dynamic en i onmen s,
p o iding mo e obus and eliable objec de ec ion and localiza ion.
Fo au onomous ehicles, scene analysis and comp ehension play an impo an ole. This in-
cludes a wide ange o applica ions such as de ec ing o he ehicles sha ing he oad, ecognizing
a ic signs, as well as de ec ing pedes ians, po en ial haza ds, e c. This deepe unde s anding
is ins umen al in making au onomous decisions while in eg a ing he augmen ed en i onmen
on o he ehicle’s display sys ems like heads-up displays (HUDs). This ex ends o isual scene
analysis which is he co ne s one o ehicle en i onmen pe cep ion and in e ac ion using ad-
anced compu e ision machine lea ning (ML) algo i hms o con olling la ge amoun s o da a
collec ed om senso s, such as came as, LiDAR, and ada . The ecogni ion and in e p e a ion
o he e e -changing su oundings allow he ehicle o make in o med choices abou na iga ion,
sa e y, and in e ac ions wi h o he oad use s.
In o de o d i e sa ely and e ec i ely, o example, he ca mus be able o ecognise and
iden i y o he ehicles, as well as hei posi ion and ela i e speed. Equally impo an is he
ecogni ion o a ic signs and signals, ensu ing compliance wi h a ic egula ions and he
seamless low o a ic. Mo eo e , he accu a e de ec ion o pedes ians, cyclis s, and po en-
ial obs acles is indispensable o a oiding acciden s and ensu ing he sa e y o all oad use s.
In he las decades, ad ances in compu e ision ha e os e ed he design and implemen a ion
o objec ecogni ion me hods, inc easing compu a ional pe o mance and lowe ing p ocess ime
[3]. These echnologies enable he ehicle’s onboa d compu e sys ems o con inuously lea n and
adap , imp o ing hei abili y o ecognise and espond o an e e -e ol ing a ay o en i onmen-
al condi ions. An impo an miles one is ha in he op imisa ion phase o such applica ions, he
e alua ion o AV image cogni ion sys ems can be pe o med in he i ual and augmen ed eali y
domains, u ilising he same en i onmen ha is also used in i ual applica ions, like game de el-
opmen engines. As a esul , cu en scene analysis echnologies based on objec ecogni ion use
complex compu e ision echniques o de ec and ack objec s in he eal wo ld. Examples o
such echnologies include he You Only Look Once (YOLO) model [4], homomo phic il e ing
and Haa ma ke s [5] and he Single Sho De ec o [6]. The use o Con olu ional Neu al Ne -
wo ks (CNNs) and Deep Lea ning (DL) led o as e and mo e accu a e de ec ion p ocesses [7].
2
Howe e , he AR expe ience could be imp o ed by p ojec ing 3D objec s in o he augmen ed
eali y space su ounding he use in e ed om he eal en i onmen .
The aim o his s udy is o analyse a no el 3D solu ion ha e alua es he pe o mance o he
3D bounding box p edic ion in a ious condi ions. This wo k p oposes a no el a chi ec u e o
e icien ly p oduce 3D bounding boxes, supe imposed on o he mul i a ia e spa io empo al iew
ha echnologies like ad anced AR and AV cogni ion sys ems employ o pe cei e he h ee-
dimensional en i onmen . The p oduced sys em is u he e alua ed on he new syn he ic da ase
p oduced o encapsula e a a ie y o possible en i onmen al condi ions, like, came a iew, ligh -
ing, wea he , and senso eadings. Finally, his s udy e alua es he p oposed a chi ec u e wi h
o he benchma k me hods, p o iding a compa a i e dimension. The main con ibu ions o his
wo k a e as ollows:
•A mul imodal a chi ec u e o e icien objec de ec ion and localisa ion o eal- ime scene
analysis
•A me hodology o p edic ing 3D bounding boxes on he h ee-dimensional en i onmen ,
ex apola ed om 2D images
•A No el Syn he ic Image da ase o objec de ec ion in AV applica ions wi h VR scene
augmen a ion
•A compa a i e s udy o he e icacy and e iciency o he de eloped me hodology agains
s a e-o - he-a echniques
The es o his pape is o ganised as ollows: 2 p o ides an o e iew o ela ed wo k. 3
desc ibes he p oposed a chi ec u e. 4 p esen s esul s ob ained using a no el syn he ic image
da ase . Finally, 5 concludes his wo k.
2. O e iew o P e ious Wo k
2.1. Region-based Fea u e Ex ac ion Algo i hms
An AR app iden i ies objec s in he eal wo ld using ML and compu e ision echniques
wi h he goal o o e laying i ual objec s in eal- ime. In ecen yea s, he use o deep CNNs
[8] has g ea ly enhanced he pe o mance and accu acy o objec de ec ion and ecogni ion in
compu e ision. In 2014, Gi shick e al. in oduced he Regions wi h CNN ea u es (RCNN)
me hod o objec de ec ion [9]. This app oach in ol ed i s iden i ying po en ial objec boxes
h ough selec i e sea ch and hen escaling each box o a ixed-size image o inpu in o a CNN
model ained on AlexNe [10] o ea u e ex ac ion. The objec was hen de ec ed using a linea
SVM classi ie , esul ing in a signi ican imp o emen in mean A e age P ecision compa ed o
p e ious me hods, bu also had a signi ican d awback o slow de ec ion speed.
In 2014, Gi ishick e al. in oduced he ”Regions wi h CNN ea u es” (RCNN) me hod o he
pu pose o objec de ec ion, as documen ed in hei seminal wo k [9]. This pionee ing app oach
signi ied a signi ican b eak h ough in he ealm o compu e ision, pa icula ly conce ning
he enhancemen o objec de ec ion accu acy. The RCNN me hodology employed a dual-s age
p ocess. Fi s ly, i commenced wi h he u ilisa ion o ”selec i e sea ch” o iden i y p ospec i e
objec boxes wi hin an image. Selec i e sea ch e ec i ely pa i ioned he image in o mul iple
egions o p oposals ha we e posi ed as likely candida es ha bo ing objec s. These egions
we e he eby conside ed as candida e boxes o po en ial objec localisa ion.
3
Subsequen ly, he nex s eps in he RCNN p ocedu e en ailed he esizing o he a o emen-
ioned candida e bounding boxes o i in o ixed-size images, ende ing hem eady o anal-
ysis. These s anda dised images a e hen subjec ed o a CNN-based p ocessing, speci ically
p e- ained on he AlexNe model [10]. The p incipal ole o his CNN was o pe o m ea u e
ex ac ion in o de o disce n and cap u e highly dis inc i e ea u es o he objec in ques ion.
Upon ea u e ex ac ion, he inal s ep o he RCNN me hodology in ol ed he employmen o a
linea Suppo Vec o Machine (SVM) classi ie . The SVM classi ie was ins umen al in e ec -
ing classi ica ion o he ex ac ed ea u es, he eby asce aining he p esence o absence o a gi en
objec wi hin he candida e box. This classi ica ion p ocess was he basis o objec iden i ica ion
and localisa ion.
The ou comes o he RCNN app oach bo e subs an ial signi icance. I led o a ma ked aug-
men a ion in he mean A e age P ecision me ic, a pi o al gauge o he e icacy and p ecision
o objec de ec ion algo i hms. E ec i ely, i su passed an eceden me hods in i s compe ence
o iden i y objec s wi hin images, ma king a subs an ial p og ession in he a ena o compu e
ision.
Ne e heless, i is wo h acknowledging ha he RCNN me hod su e ed om a compa a-
i ely leng hy de ec ion ime ame which can majo ly impac he o e all pe o mance. I s se-
quen ial ope a ions, such as selec i e sea ch, CNN-based ea u e ex ac ion, and SVM classi i-
ca ion, made i e y compu a ionally in ensi e and ook a long ime o p ocess, which limi ed i s
use ulness in si ua ions whe e eal- ime objec de ec ion was needed.
So, while he RCNN app oach made i easie o ind objec s, i equi ed a lo o compu ing
powe and ime, which mean ha mo e esea ch had o be done o make i wo k as e . Rega d-
less, i s c ea ion was a majo u ning poin in he his o y o objec de ec ion algo i hms. I pa ed
he way o la e inno a ions and sped up p og ess in a eas like obo ics, au onomous ehicle
sys ems, and many ypes o compu e ision applica ions.
In an e o o ackle he pe sis en challenge o slow de ec ion speed in objec ecogni ion and
localisa ion, He e al. p esen ed he Spa ial Py amid Pooling Ne wo k (SPPNe ) as an inno a i e
solu ion in hei seminal wo k [11]. This a chi ec u al pa adigm ma ked a no able miles one in
he e olu ion o compu e ision, o e ing a p o ound emedy o a long-s anding p edicamen in
he ield.
The basis o he SPPNe ’s success lay in i s s a egic inco po a ion o a Spa ial Py amid
Pooling (SPP) laye , a pi o al componen ha e olu ionised he objec de ec ion p ocess. The
dis inc i e ea u e o his SPP laye was i s abili y o gene a e a ixed-leng h ep esen a ion ha
emained in a ian o al e a ions in image size and scale. This a ibu e had a - eaching impli-
ca ions, pa icula ly in e ms o mi iga ing o e i ing issues ha had p e iously plagued objec
ecogni ion sys ems. A e his ini ial ea u e ex ac ion s ep, he SPPNe employed a sub- egion
pooling mechanism. This ope a ion en ailed di iding he image in o spa ial bins, enabling he
agg ega ion o ea u es om each bin o c ea e ixed-leng h ep esen a ions ha we e conduci e
o de ec o aining.
One o he mos no able ou comes o his inno a i e app oach was a ema kable accele a ion
in p ocessing speed, especially du ing es ing. The SPPNe me hod p o ed o be a signi ican leap
o wa d, wi h es ing imes anging om 24 o 102 imes as e han he p e iously es ablished
RCNN app oach. This accele a ion in speed held p o ound implica ions o eal- ime and ime-
sensi i e applica ions, pa icula ly in con ex s like au onomous ehicles, obo ics, and augmen ed
eali y.
In 2015, Gi ishick imp o ed he p e ious wo a chi ec u es wi h Fas RCNN [12]. This
ne wo k ains bo h a de ec o and a bounding box eg ession simul aneously wi h he same
4
Figu e 1: FRRCNN a chi ec u e.
con igu a ion. Howe e , he speed limi a ion pe sis ed. The same yea , Ren e al. in oduced
he Fas e RCNN de ec o [13], which was he i s deep lea ning de ec o o almos achie e
eal- ime de ec ion h ough end- o-end aining. This a chi ec u e employed he Region P oposal
Ne wo k (RPN) o speed up he de ec ion p ocess, and se e al a ian s ha e been p oposed since
hen o educe compu a ional edundancy [14], [15], [16]. In pa icula , Cao e al. (2020) [17]
in oduced he D2De me hod based on he Fas e R-CNN amewo k, which p ocesses Region
o In e es (ROI) ea u es h ough wo s ages: high-densi y local eg ession and disc iminan ROI
pooling. The me hod eplaces he Fas e RCNN o se eg ession wi h a local dense eg ession
block. Gi ishick u he mo e in oduced an enhancemen o he exis ing a chi ec u al pa adigms
in he o m o he Fas RCNN [12]. This no el ne wo k con igu a ion en ailed he simul aneous
aining o bo h an objec de ec o and a bounding box eg ession componen , all wi hin he same
uni ied a chi ec u e. Howe e , i is no ewo hy ha he issue o compu a ional speed cons ain s
pe sis ed despi e his de elopmen .
The Fas RCNN model builds upon he exis ing s a e-o - he-a , enhancing e iciency. I ex-
hibi es he capabili y o concu en ly ain wo undamen al componen s wi hin he same sys em,
i) an objec de ec o and ii) a bounding box eg ession module, inco po a ed unde he same
amewo k. This in eg a ed app oach was a signi ican s ide owa ds a mo e s eamlined and
cohe en aining p ocess. Ne e heless, he o e a ching challenge o compu a ional speed con-
s ain s pe sis ed as an obs ina e issue in he ield.
In he same ime, Ren e al. in oduced he Fas e RCNN de ec o [13], a g oundb eaking
endea o ha cha ed a cou se owa d he ealiza ion o eal- ime objec de ec ion h ough he
p ism o end- o-end aining. The Fas e RCNN a chi ec u e ma ked a seminal u ning poin
in he pu sui o swi e de ec ion capabili ies. A i s co e, i in oduced he Region P oposal
Ne wo k (RPN), a componen speci ically designed o expedi e he objec de ec ion p ocess. The
RPN’s manda e in ol ed he gene a ion o egion p oposals, a ace ha g ea ly enhanced he
ne wo k’s adep ness in e icien ly disce ning objec s wi hin complex scenes.
The in oduc ion o he Fas e RCNN model had an indelible impac on he landscape o
compu e ision. I no only ushe ed in he possibili y o nea eal- ime objec de ec ion bu also
spu ed a wa e o inno a i e a chi ec u al a ian s. These a ia ions, wi h an o e a ching ocus
on cu ailing compu a ional edundancy [14], [15], [16], explo ed di e se a enues o u he
ampli y he eloci y and e iciency o objec de ec ion while p ese ing p ecision.
Among hese p og essi e adap a ions, he D2De me hod, in oduced by Cao e al. in 2020
[17], s ands ou as an exempla o inno a ion based on he Fas e R-CNN amewo k. The D2De
me hod ha nesses a sophis ica ed wo-s age p ocess o handling Region o In e es (ROI) ea-
u es. In he ini ial phase, high-densi y local eg ession is employed o ine une he localiza ion
5

Figu e 2: YOLO a chi ec u e.
o objec s, in using a heigh ened deg ee o p ecision in o he de ec ion p ocess. Subsequen ly,
in he second s age, a disc iminan ROI pooling mechanism ex ac s dis inc i e ea u es om
he ROIs. No ably, D2De depa s om he Fas e RCNN’s o se eg ession by adop ing a lo-
cal dense eg ession block, hus augmen ing he p ecision and obus ness o he objec de ec ion
p ocess.
The p og ess made by esea che s and he pa h om Fas RCNN o Fas e RCNN and beyond
show ha he goal o eal- ime objec ecogni ion is s ill being wo ked on and will con inue o
ge be e . The e is a chance ha hese ad ances will change many a eas, such as au onomous
sys ems, su eillance, obo s, and augmen ed eali y. A he cu ing edge o p og ess in compu e
ision and deep lea ning is he ne e -ending sea ch o as e , mo e accu a e, and mo e e icien
ways o ind objec s.
The me hodologies discussed abo e all unde he classi ica ion o wo-s age de ec o s due
o hei cha ac e is ic wo-s ep p ocess: ini ially gene a ing egions o in e es (ROIs) and sub-
sequen ly execu ing de ec ion and ecogni ion. In 2016, Joseph e al. in oduced a no ewo hy
depa u e om his con en ion, p esen ing a one-s age de ec o known as You Only Look Once
(YOLO) [18]. YOLO epi omized a pionee ing pa adigm shi in he ealm o objec de ec ion,
mani es ing as a single ne wo k a chi ec u e capable o p ocessing he en i e y o an image wi hin
a soli a y s ep, which esul ed in subs an ially expedi ed p ocessing imes.
2.2. Segmen a ion-based Fea u e Ex ac ion
The YOLO me hodology ope a es by segmen ing he image in o dis inc egions and concu -
en ly p edic ing bounding boxes o each o hese egions. This one-s ep p ocessing pa adigm
ma ked a signi ican depa u e om he mul i-s ep p ocedu es o i s wo-s age coun e pa s.
YOLO was a game-change in he ield, ep esen ing a e olu iona y app oach o objec de-
ec ion. Unlike i s wo-s age coun e pa s, YOLO employed a single neu al ne wo k a chi ec u e,
capable o p ocessing an en i e image in a soli a y pass. This unique design o e ed a signi ican
ad an age in e ms o p ocessing speed, e ec i ely educing de ec ion imes. The co e p inciple
unde lying YOLO’s unc ionali y in ol ed he di ision o he image in o disc e e egions, wi h
he ne wo k making concu en p edic ions o bounding boxes associa ed wi h each egion. This
app oach elimina ed he need o sequen ial p ocessing, o e ing a subs an ial boos in e iciency.
In he subsequen yea s, YOLO unde wen i e a ions wi h he in oduc ion o YOLO 2 and 3,
aimed a enhancing p edic ion accu acy [19], [20], while subsequen e sions, 5 h ough 8
ocus on p edic ion e iciency, accu acy, speed and deploymen op imisa ion.
6
To enhance he accu acy o bounding box localiza ion, DIoU loss is employed due o i s
demons a ed imp o emen in pe o mance when used wi h he YOLO algo i hm [21]. DIoU
ep esen s an ad ancemen o he IoU me ic (1), speci ically a ge ing he op imiza ion o bound-
ing box p edic ions.
IoU =A ea o O e lap
A ea o Union =IoU(Bp,B )=Bp∩B
Bp∪B
(1)
and he dis ance
LossIoU =1−IoU (2)
In his con ex , Bpand B ep esen he p edic ed and ac ual bounding boxes, espec i ely.
The e m DIoU imp o es upon IoU by ac o ing in he squa e o he diagonal dBo he smalles
bounding box Bo ha encompasses bo h Bpand B . The e o e, he equa ion is as ollows:
DIoU =IoU −q(B2
p)−(B2
)2
d2
B
(3)
and he esul ing loss unc ion.
LDIoU =1−DIoU =1−IoU −q(B2
p)−(B2
)2
d2
B
(4)
This unc ion, by add essing he issue o non-in e sec ing bounding boxes in e ms o IoU,
aids in accele a ing he model’s con e gence.
While YOLO excelled in e ms o speed, i encoun e ed challenges ela ed o localiza ion
accu acy. This ade-o spu ed u he esea ch e o s o ine- une he model. To ed ess his
ade-o and enhance he localiza ion accu acy, Liu e al. in oduced he Single Sho Mul iBox
De ec o (SSD) in 2016 [22]. The SSD me hod was di e en om he one-s age pa adigm be-
cause i used bo h mul i- e e ence and mul i- esolu ion de ec ion s a egies. This made i possible
o ind objec s a di e en sizes ac oss di e en ne wo k laye s. This a chi ec u e can accommo-
da e objec s o di e se sizes and magni udes wi hin he image, mi iga ing he a o emen ioned
accu acy comp omise.
In 2018, Lin e al. p esen ed Re inaNe [23], ma king a signi ican ad ancemen in one-
s age objec de ec ion. The key inno a ion wi hin Re inaNe was he in oduc ion o a no el
loss unc ion e med ” ocal loss.” This loss unc ion, which di e s om he c oss-en opy loss,
was c ea ed o gi e mo e a en ion o ins ances ha kep ge ing inco ec ly labeled du ing he
aining p ocess. This heigh ened a en ion o challenging examples du ing aining esul ed in an
enhanced le el o p edic ion accu acy, ou s ipping he pe o mance o i s one-s age coun e pa s.
2.3. Ancho - ee In e ence
In con empo a y de elopmen s wi hin he domain o objec de ec ion, he e is a no ewo -
hy shi owa ds ancho - ee me hodologies [24]. These no el app oaches, in con as o con-
en ional echniques, emphasize he in e ence o bounding box co ne s, a he han eliance on
p e-de ined bounding boxes. A p ominen exempla o his end is he Cen e Ne , an inno a-
i e amewo k in oduced by Zhou e al. [25]. No ably, Cen e Ne has dis inguished i sel as a
s a e-o - he-a solu ion o 3D Lida -based de ec ion and acking, showcasing i s e sa ili y in
di e se applica ions.
7
Cen e Ne can be pe cei ed as an e olu ion o he Co ne Ne , ano he ancho - ee app oach o
bounding box de ec ion ha ep esen s objec s as pai s o keypoin s, speci ically he op-le and
bo om- igh co ne s. These co ne keypoin s a e ex ac ed h ough a echnique known as co ne
pooling, which was in oduced by he same au ho s [26]. A c i ical s ide in he ad ancemen
om Co ne Ne o Cen e Ne was he in oduc ion o a cen al keypoin , a concep ha acili a ed
he associa ion o co ne keypoin s wi h objec s depic ed in images. This no el app oach has
demons a ed supe io pe o mance compa ed o con en ional ancho -based solu ions, such as
Fas e RCNN and YOLO, ma king a signi ican ad ancemen in objec de ec ion.
Con inuing he ajec o y o inno a ion, in 2020, Pe ez-Rua and colleagues in oduced he
OpeN-ended Cen e nE (ONCE) [27]. ONCE imp o ed Cen e Ne ’s abili ies by le ing i ind
objec s om classes whe e he e we e no many examples in i s aining da ase . This is an
imp essi e achie emen ha could be use ul in si ua ions in ol ing many ypes o objec s.
Addi ionally, objec de ec ion echniques ha e begun o explo e he capabili ies o ans o m-
e s, as pa ed by he DE ec ion TRans o me (DETR) me hod in oduced by Ca ion e al. [28].
This explo a ion le e ages he ad an ages o ans o me a chi ec u es, which ha e gained p omi-
nence in na u al language p ocessing, and in eg a es hem in o he objec de ec ion domain. Wha
se s DETR apa is i s simplici y, coupled wi h pe o mance ha i als o he sophis ica ed de ec-
ion echniques employed in he ield. Subsequen ly, Zhu e al.[29] p oposed he De o mable
DETR sys em. This sys em builds on he speci ic objec i e de ec ing small objec s. This en-
hancemen aimed o achie e s a e-o - he-a pe o mance, unde sco ing he commi men o he
scien i ic communi y o con inuously e ine and ad ance objec de ec ion me hodologies o mee
he e ol ing demands o eal-wo ld applica ions.
2.4. 3D Objec De ec ion
The ield o 3D objec de ec ion in au onomous ehicles has wi nessed signi ican ad ance-
men s, d i en by he need o accu a e en i onmen al pe cep ion o ensu e sa e na iga ion. T a-
di ional single-modal de ec ion me hods, which ypically u ilise ei he LiDAR o came a da a, as
s a ed in [30], ha e limi a ions. Came a-based sys ems o en lack su icien dep h in o ma ion,
leading o challenges in accu a ely de ec ing objec s in h ee-dimensional space, pa icula ly in
complex en i onmen s. Con e sely, LiDAR-based me hods, while p o iding p ecise spa ial da a,
a e hinde ed by issues such as poin cloud spa si y and low esolu ion, especially in occluded o
dis an scena ios. Au ho s in [30] ha e inc easingly ocused on mul i-modal 3D objec de ec ion,
which combines he s eng hs o a ious senso s o enhance de ec ion pe o mance. By using
dep h in o ma ion om LiDAR wi h he ich ex u e and colo da a om came as, mul i-modal
app oaches can signi ican ly imp o e he accu acy and eliabili y o objec de ec ion sys ems.
Howe e , he in eg a ion o he e ogeneous da a p esen s unique challenges, including he need
o e ec i e da a ep esen a ion, alignmen , and usion echniques. Resea che s ha e p oposed
a ious me hodologies o add ess hese challenges, mo ing beyond adi ional usion s a egies
o mo e sophis ica ed amewo ks ha le e age he complemen a y cha ac e is ics o di e en
modali ies.
On he o he hand, single-modali y 3D objec de ec o s o e se e al ad an ages o e mul i-
modal app oaches, pa icula ly in e ms o simplici y, cos , and compu a ional e iciency. Ac-
co ding o [31], single-modali y de ec o s elimina e he need o complex usion algo i hms
ha a e necessa y when combining da a om mul iple senso s like LiDAR and came as. This
leads o educed compu a ional o e head and easie implemen a ion, making single-modali y
sys ems mo e e icien and as e , especially o eal- ime applica ions like au onomous d i ing
and obo ics. Addi ionally, hese de ec o s a e cos -e ec i e, as hey only equi e one ype o
8
senso , signi ican ly educing ha dwa e and main enance cos s. Single-modali y de ec o s can
also ocus on le e aging he s eng hs o a speci ic senso , such as he p ecision o LiDAR in
dep h es ima ion o he ich isual de ail p o ided by came as. While mul i-modal sys ems can
o e inc eased obus ness, he added complexi y o en does no jus i y he pe o mance gains in
en i onmen s whe e a single modali y is su icien o high accu acy.
The de elopmen o comp ehensi e da ase s, such as KITTI [32], nuScenes [33], and Waymo
[34], has been c ucial in acili a ing esea ch in he a ea o 3D objec de ec ion, p o iding bench-
ma ks o e alua ing he pe o mance o single- and mul i-modal de ec ion algo i hms unde di-
e se d i ing condi ions. In his con ex , he analysis o syn he ic i ual en i onmen s eme ges
as a p omising a enue o u he enhancing 3D objec de ec ion. By simula ing a ious d i ing
scena ios and condi ions, syn he ic en i onmen s can gene a e di e se aining da a ha imp o es
model obus ness and adap abili y. This app oach add esses he limi a ions o exis ing de ec ion
me hods, essen ially con ibu ing o he de elopmen o sa e and mo e e icien au onomous
d i ing sys ems. The in eg a ion o syn he ic i ual en i onmen analysis in o he 3D objec
de ec ion pipeline ep esen s a signi ican s ep o wa d, o e ing new insigh s and me hodologies
ha can enhance he o e all pe o mance o au onomous ehicles in eal-wo ld applica ions.
The ad ancemen o au onomous ehicles is signi ican ly dependen on he e icacy o hei
pe cep ion sys ems, pa icula ly in he ealm o 3D objec de ec ion. Au ho s in [35] e iew e-
cen indings on 3D objec de ec ion me hodologies, wi h a speci ic ocus on he in eg a ion o
syn he ic i ual en i onmen s o enhance de ec ion capabili ies. 3D objec de ec ion is a c i ical
componen ha enables ehicles o accu a ely pe cei e hei su oundings, acili a ing in o med
decision-making and ajec o y planning. Recen s udies ca ego ize de ec ion me hods in o h ee
p ima y ypes: image-based, poin cloud-based, and mul i-modal app oaches. Mul i-modal ech-
niques, which in eg a e da a om a ious senso s such as LiDAR and came as, ha e demon-
s a ed supe io obus ness agains en i onmen al a ia ions, as highligh ed by [35]. The use o
syn he ic i ual en i onmen s has eme ged as a p omising s a egy o augmen aining da ase s
o hese de ec ion algo i hms. By simula ing di e se scena ios ha a e o en challenging o cap-
u e in eal-wo ld se ings—such as a ying wea he condi ions, ligh ing scena ios, and complex
objec in e ac ions— esea che s can c ea e comp ehensi e da ase s ha signi ican ly imp o e
he obus ness o de ec ion sys ems. Fo ins ance, [36] emphasize he de elopmen o e icien
eal- ime 3D objec de ec ion amewo ks ha le e age syn he ic da a o aining, leading o
enhanced accu acy and educed cos s associa ed wi h eal-wo ld da a collec ion. Howe e , chal-
lenges pe sis , pa icula ly conce ning he domain gap be ween syn he ic and eal-wo ld da a,
which can esul in pe o mance deg ada ion when models a e deployed ou side hei aining
condi ions. Fu u e esea ch should p io i ize b idging his gap h ough echniques such as do-
main adap a ion and ans e lea ning, ensu ing ha models ained in i ual en i onmen s can
gene alize e ec i ely o eal-wo ld scena ios. O e all, he in eg a ion o syn he ic i ual en-
i onmen s in he de elopmen o 3D objec de ec ion sys ems p esen s a aluable oppo uni y
o enhance he capabili ies o au onomous ehicles, pa ing he way o mo e obus de ec ion
algo i hms ha a e be e equipped o handle he complexi ies o eal-wo ld d i ing condi ions.
Con inued explo a ion in his a ea will be essen ial o ad ancing he s a e o au onomous d i ing
echnology.
3. Me hodology
This sec ion del es in o he me hodology, p esen ed in his wo k, o de ec ing 3D objec s
u ilizing ML algo i hms and he Cen e Ne a chi ec u e. The me hodology employs o se s o
9
Table 1: Pa ame e Tuning
Expe imen Phase Epochs Ba ch Size Da a Type Lea ning Ra e Op imize Syn h. Da a No es
Ini ial 10 3 4 ca ego ies 0.0001 Adam ×-
Addi ional 10 3-6 4 ca ego ies 0.0001 Adam ✓GPU memo y limi a ions
Ex ended 100 3-6 8 ca ego ies 0.0001 Adam ✓Fine- uning, pe o mance assessmen
Table 2: Ha dwa e Speci ica ions
Name CPU RAM GPU
Desk op Compu e Xeon Gold [email p o ec ed] 64GB NVIDIA Ti an Xp 12GB
Lap op Compu e [email p o ec ed] 32GB NVIDIA 1650 Max-Q 4GB
bu mo e common ha dwa e con igu a ion. The speci ica ions o his po able se up a e also
documen ed in able 2, enabling a comp ehensi e compa ison be ween he wo pla o ms.
4. E alua ion
The expe imen s aim o e alua e he pe o mance o he p oposed objec de ec ion me hodol-
ogy unde a ious en i onmen al condi ions, le e aging bo h syn he ic and eal-wo ld da ase s.
The main a ge o hese expe imen s is o assess how well he model adap s o di e en condi-
ions, including a ia ions in came a angles, ligh ing, wea he , and senso ypes. In addi ion, a
compa a i e s udy is conduc ed o benchma k he p oposed me hod agains s a e-o - he-a mod-
els like YOLO 3, Fas e R-CNN, and Re inaNe . These expe imen s a e essen ial o unde s and
he obus ness and gene aliza ion capabili ies o he p oposed de ec ion model, which is e al-
ua ed using syn he ic da a gene a ed h ough a 3D ende ing engine and eal-wo ld da a om
he KITTI da ase . The KITTI da ase is a well-es ablished benchma k in au onomous d i ing
esea ch, con aining high- esolu ion images and LiDAR da a om u ban en i onmen s, making
i highly sui able o e alua ing he pe o mance o objec de ec ion models. I s ich di e si y in
classes like ca s, pedes ians, and cyclis s, ac oss a ious en i onmen s, p o ides a comp ehen-
si e pla o m o es ing he obus ness o models in bo h con olled (syn he ic) and eal-wo ld
se ings.
The expe imen al p ocedu e ollowed a clea and sys ema ic p ocess. Fi s , he syn he ic
da ase was c ea ed using a 3D ende ing engine, wi h each ca ego y (Came a, Ligh , Wea he ,
and Senso ) comp ising speci ic pa ame e s such as came a angles, ligh ing condi ions, wea he
se ings, and senso ypes. The da ase was spli in o aining, alida ion, and es ing subse s, en-
su ing ha he models we e e alua ed on bo h known and unseen da a. A e aining, he mod-
els we e e alua ed on bo h syn he ic and eal-wo ld da a, including he KITTI da ase , which
p o ided a eal-wo ld benchma k o pe o mance e alua ion. Addi ionally, expe imen s we e
epea ed on a mo e pe o mance limi ed de ice de ailed in able 2. This s ep-by-s ep p ocess
ensu ed ha he model’s pe o mance was igo ously es ed unde con olled and eal-wo ld con-
di ions, allowing o a ai compa ison ac oss di e en models.
4.1. Me ics
The e alua ion o he p oposed me hod elied on he mean A e age P ecision (mAP), a s an-
da d me ic o quan i y objec de ec ion pe o mance based on a use -de ined se o c i e ia [40].
I is de ined as he mean alue o he a e age p ecision o he indi idual classes:
16

mAP =1
n
n
X
k=1
APk(8)
whe e APkis A e age P ecision o class k, and nis he numbe o classes.
Addi ionally, con usion ma ices we e used o analyze he pe o mance o he model ac oss
di e en objec classes, o e ing a de ailed iew o de ec ion accu acy. This combina ion o me -
ics p o ides a comp ehensi e e alua ion o model pe o mance ac oss a ied scena ios, high-
ligh ing bo h successes and challenges.
4.2. Da a Speci ica ion
The syn he ic da ase used in his s udy was designed o emula e eal-wo ld condi ions and
con ains images di ided in o ou main ca ego ies: Came a, Ligh , Wea he , and Senso . Each
ca ego y is u he spli in o wo subca ego ies, Ai and G ound, whe e Ai images con ain ae ial
ehicles, and G ound images con ain e es ial ehicles. App oxima ely 3000 images pe ca -
ego y we e gene a ed o comp ehensi ely assess model pe o mance unde di e en en i on-
men al ac o s. The Came a ca ego y consis s o images cap u ed a a ious dis ances, ele a ion
angles, and azimu h angles, while he Ligh ca ego y in oduces di e en ligh ing condi ions,
such as a ia ions in in ensi y and di ec ion. The Wea he ca ego y inco po a es ainy and non-
ainy condi ions, including wind a ia ions, o assess he model’s pe o mance unde ad e se
condi ions. Finally, he Senso ca ego y includes images ha simula e nigh and he mal ision
o u he es he model’s obus ness. In con as , he KITTI da ase con ains eal-wo ld im-
ages om u ban en i onmen s, which also include LiDAR da a and o he senso in o ma ion. I
p o ides a mo e ealis ic benchma k o e alua ing he model’s pe o mance in objec de ec ion
asks, especially since i includes di e se objec s such as ca s, pedes ians, and cyclis s in eal-
wo ld d i ing scena ios. While he syn he ic da ase allows o con olled expe imen a ion, he
KITTI da ase se es as an impo an benchma k o gene aliza ion o eal-wo ld da a.
As i was men ioned abo e, he e a e ou ca ego ies o sub-da ase s a) Came a, b) Ligh ,
c) Wea he , and d) Senso . The Came a ca ego y ep esen s images gene a ed wi h di e en
came a angles (poin o iew) and dis ances om an objec in he Ci y and he Dese scenes.
Speci ically, o he Ai sub-ca ego y, he e a e images gene a ed a 4 equal dis ances be ween
70 and 350 me es. Fo he G ound sub-ca ego y, he e a e images gene a ed a 4 equal dis ances
be ween 15 and 75 me es. In bo h ca ego ies he images we e gene a ed a 4 equal ele a ion
angles be ween 5° and 85° deg ees, and a 3 equal azimu h angles be ween 0° and 240° deg ees.
The o he pa ame e s such as ligh , image ype, og and ain we e selec ed in such a way o
p e en gene a ing bias on he e alua ion o he came a pa ame e s. An o e iew o he syn he ic
da a gene a ion speci ica ion is p esen ed in Table 3.
The Ligh ca ego y con ains images gene a ed using a iable balanced ligh ing pa ame e s
co e ing he Ci y and Dese scenes. In mo e de ail, he Ai and G ound sub-ca ego ies we e
gene a ed wi h he ligh in ensi y se be ween 10% and 100% powe a 3 equal s eps. The ligh
ele a ion angles we e se be ween 5° and 90° deg ees a 3 equal s eps, he ligh azimu h angles
we e se be ween 0° and 180° deg ees a 3 equal s eps. The o he pa ame e s ela ed o came a,
wea he , and senso s we e selec ed andomly and uni o mly in such a way o a oid bias on he
e alua ion o he model unde he se o he ligh pa ame e s.
The Wea he ca ego y con ains images gene a ed using di e en balanced wea he pa ame e s
co e ing he Ci y and Dese scenes. The Ai and G ound sub-ca ego ies we e gene a ed bo h
wi h and wi hou enabling ain. Fu he mo e, he ainy images included a ia ions due o he
17
Table 3: Summa y o Da a Speci ica ions
Da a Ca ego y Scene Sub-
Ca ego ies
Pa ame e s De ails
Came a Ci y &
Dese
Ai , G ound Dis ance, Ele-
a ion Angle,
Azimu h An-
gle
Dis ances:
70-350m
(Ai ), 15-75m
(G ound); Ele-
a ion Angles:
5°-85°; Az-
imu h Angles:
0°-240°
Ligh Ci y &
Dese
Ai , G ound Ligh In en-
si y, Ele a ion
Angle, Az-
imu h Angle
In ensi y: 10-
100% (3 s eps);
Ele a ion An-
gles: 5°-90°;
Azimu h An-
gles: 0°-180°
Wea he Ci y &
Dese
Ai , G ound Rain, Wind Rain: En-
abled/Disabled;
Wind: 0 o 10
uni s
Senso Ci y &
Dese
Ai , G ound Nigh Vision,
The mal Vi-
sion
Nigh and
The mal Vision
emula ions
Real (KITTI) Va ied Ai , G ound High-
esolu ion
Images,
Lida , Cali-
b a ion
Objec De ec-
ion, T acking,
3D Scene Un-
de s anding
wind pa ame e ha was selec ed o be 0 o 10 uni s o powe . The o he pa ame e s we e
selec ed in such a way o a oid bias on he e alua ion o he models in he wea he ca ego y.
The nigh and he mal ision a e he main a ibu es o he Senso ca ego y. The nigh ision
isualises an app oxima ion o he e ec o nigh ision goggles and he same app oach was
conside ed o he he mal ision. The Senso ca ego y con ains images gene a ed using di e en
balanced senso image ypes co e ing he Ci y and Dese scenes. Also, he Ai and G ound sub-
ca ego ies con ain images emula ing bo h nigh and he mal ision senso s. The o he pa ame e s
again we e selec ed uni o mly p e en ing bias on he e alua ion o he models o he senso se
o pa ame e s.
Syn he ic da ase s can e ec i ely mimic he gene alisa ion capabili ies o eal-wo ld da a by
le e aging digi al wins and syn he ic da a gene a ion ools buil on game engines. Digi al wins
c ea e i ual eplicas o eal-wo ld en i onmen s, cap u ing de ailed spa ial, empo al, and unc-
18
Figu e 7: Real Da ase 3D Bounding Box Sample
ional cha ac e is ics ha enable ealis ic simula ion o eal-wo ld scena ios. When combined
wi h he ad anced ende ing, physics, and anima ion capabili ies o game engines, hese ools
can gene a e highly ealis ic and di e se da ase s ha mi o he complexi y o ac ual en i on-
men s. Such syn he ic da a can eplica e in ica e in e ac ions, simula e a ying condi ions, and
in oduce con olled a ia ions. This app oach is pa icula ly use ul o c ea ing scalable and
cos -e ec i e da ase s while add essing limi a ions like bias o sca ci y in eal-wo ld da a col-
lec ion and ensu ing ha he syn he ic da ase s encapsula e he a iabili y and complexi y o
eal-wo ld da a. Consequen ly, models ained on such da ase s can gene alise e ec i ely, as
hey a e exposed o a wide ange o ep esen a i e pa e ns and scena ios ha mi o eal-wo ld
applicabili y.
4.3. Real Da ase Expe imen s
The KITTI da ase [32] which is a widely used benchma k da ase o esea ch in compu e
ision and au onomous d i ing [41] was chosed as he Real da ase . I s ands o ”Ka ls uhe
Ins i u e o Technology and Toyo a Technological Ins i u e” and was c ea ed by esea che s om
hese ins i u ions. This da ase is commonly e e enced in academic publica ions ela ed o asks
such as objec de ec ion, acking, 3D scene unde s anding, and mo e. The speci ica ion o he
da ase can be seen in Table 4.
The p ima y objec i e behind i s incep ion is o os e he ad ancemen o algo i hms and
echnologies ele an o au onomous ehicles. The da ase is cha ac e ized by a comp ehensi e
collec ion o di e se da a modali ies, encompassing high- esolu ion came a images, LiDAR poin
clouds, and calib a ion pa ame e s. This da ase is used o a wide a ie y o asks, including
objec de ec ion, mo ion acking, 3D scene analysis, and o he such applica ions. An added
ea u e o he KITTI da ase is he p o ision o image anno a ions o a ious objec ypes, such as
ca s, pedes ians, and cyclis s, he eby ende ing i an impo an esou ce o he alida ion o AI
models. Fu he mo e, he da ase encompasses a wide spec um o eal-wo ld d i ing scena ios,
a iable wea he condi ions, and di e en imes o day, he eby acili a ing a comp ehensi e
assessmen o algo i hm pe o mance unde di e se en i onmen al condi ions. I is wo h no ing
ha while he KITTI da ase is widely used in he esea ch communi y, i does exhibi ce ain
19
Table 4: Fea u es and Speci ica ions o he 3D Objec De ec ion KITTI Da ase
Fea u e/Speci ica ion Desc ip ion
Da a Type Images, Lida da a
Tasks S e eo, Op ical Flow, Visual Odome y, 3D Objec
De ec ion, T acking
Numbe o Images ˜15,000 images o objec de ec ion
Image Resolu ion 1242 x 375 pixels
Senso s Ine ial Na iga ion Sys em (GPS/IMU): OXTS
RT 3003, Lase scanne : Velodyne HDL-64E,
G ayscale came as, 1.4 Megapixels: Poin G ey
Flea 2 (FL2-14S3M-C), Colo came as, 1.4
Megapixels: Poin G ey Flea 2 (FL2-14S3C-C),
Va i ocal lenses, 4-8 mm: Edmund Op ics NT59-
917
Anno a ion Types Bounding boxes, 3D boxes, objec ype, unca-
ion, occlusion le els
En i onmen s U ban, esiden ial, oad
Classes Ca s, ans, ucks, pedes ians, cyclis s
G ound T u h A ailabili y Yes
Table 5: Compa ison o Pe o mance Resul s on he Real Da ase s (%, mAP)
Class FRRCNN RETINA YOLO 3 PM
Ca 64.67 77.09 69.01 87.85
Pedes ian 28.42 51.78 39.17 60.85
Cyclis 32.33 51.32 43.84 48.69
To al
mAP 41.81 60.06 51.34 65.80
limi a ions, no ably i s ela i ely modes scale and he absence o da a pe aining o ce ain objec
classes, e.g., mo o cycles.
The pe o mance o he p oposed amewo k on he Real da ase could be obse ed in able
5. The esul s o he able compa ing pe o mance on eal da ase s e eal se e al key insigh s
ega ding he e ec i eness o di e en models—FRRCNN, RETINA, YOLO 3, and he p o-
posed PM model—ac oss h ee objec de ec ion ca ego ies: Ca , Pedes ian, and Cyclis . The
p oposed PM model demons a es supe io pe o mance ac oss all h ee ca ego ies, achie ing
he highes accu acy in de ec ing Ca s (87.85%), Pedes ians (60.85%), and Cyclis s (48.69%).
This sugges s ha he PM model is pa icula ly well-sui ed o de ec ing objec s in eal-wo ld
condi ions, ou pe o ming o he models in each indi idual class. The s ong pe o mance in he
Ca ca ego y is especially no able, whe e PM signi ican ly ou pe o ms he o he models, wi h a
pe o mance ma gin o o e 10% compa ed o he second-bes RETINA model (77.09%). This
indica es ha PM is highly capable o ecognizing ca s, likely due o be e ea u e ex ac ion o
aining s a egies sui ed o his objec class.
20
Figu e 8: Example o 3D bounding box p edic ions. (Syn he ic Da a)
21

Table 6: Pe o mance esul s on he Syn he ic da ase (mAP)
Ca ego y Sub-ca ego y PM FRRCNN YOLO 3 RETINA
Ai
Came a 61.04% 5.24% 44.82% 44.79%
Ligh 39.95% 20.66% 63.58% 61.25%
Wea he 88.71% 5.35% 39.00% 45.57%
Senso 51.90% 4.97% 4.27% 7.95%
G ound
Came a 74.66% 17.95% 76.02% 88.81%
Ligh 33.82% 32.54% 38.52% 87.12%
Wea he 58.75% 17.07% 66.32% 86.09%
Senso 55.72% 4.16% 7.64% 15.59%
In he Pedes ian class, PM again ou pe o ms he o he models, al hough he pe o mance
gap is na owe compa ed o he Ca ca ego y. RETINA, which shows a s ong second-place
pe o mance (51.78%), ails PM by app oxima ely 9%. Pedes ian de ec ion ypically in ol es
mo e a iabili y in size and occlusion, making i a challenging ca ego y. PM’s pe o mance
indica es i s obus ness in handling his complexi y, al hough he e may s ill be oom o im-
p o emen o each highe accu acy.
Fo Cyclis s, he PM model achie es he highes accu acy (48.69%), hough he pe o mance
gap he e is smalle compa ed o o he classes. RETINA and YOLO 3 pe o m simila ly, wi h e-
sul s o 51.32% and 43.84%, espec i ely, while FRRCNN shows weake pe o mance (32.33%).
This sugges s ha while PM is he mos e ec i e model o e all, de ec ing cyclis s emains mo e
challenging due o he a iabili y in appea ance and size, indica ing ha u he ine- uning o
da a augmen a ion migh be equi ed o imp o e accu acy in his ca ego y.
When looking a he o al mean A e age P ecision (mAP) ac oss all ca ego ies, he PM
model achie es he highes o e all sco e (65.80%), ou pe o ming RETINA (60.06%), YOLO 3
(51.34%), and FRRCNN (41.81%). The mAP me ic highligh s he supe io gene aliza ion and
obus ness o he PM model ac oss all objec de ec ion asks. The imp o emen in mAP by nea ly
6% o e RETINA u he unde sco es he ad an age o PM in handling eal-wo ld da ase s.
In summa y, he PM model s ands ou in e ms o pe o mance, pa icula ly in de ec ing
ca s, whe e i exhibi s subs an ial accu acy gains. I s abili y o consis en ly ou pe o m o he
models ac oss all ca ego ies, coupled wi h he highes o al mAP, sugges s ha he PM model
is mo e adap able and e ec i e in di e se objec de ec ion asks. Ne e heless, some ca ego ies,
such as cyclis s, emain mo e challenging, and addi ional e o s o u he enhance de ec ion
accu acy, pa icula ly h ough da a di e si y o model ine- uning, could help close pe o mance
gaps. These esul s sugges he PM model is highly p omising o eal-wo ld applica ions bu
may bene i om u he op imiza ion o mo e challenging objec classes.
4.4. Syn he ic Da ase Expe imen s
The p oposed Cen e Ne model demons a ed s ong and compe i i e pe o mance ac oss
a ious ca ego ies, as e idenced by he esul s on he Syn he ic da ase (Table 6). The model’s
con usion ma ices can be obse ed in Figu es 3 and 6, illus a ing he dis inc ions be ween pe -
o mance on Ai and G ound da ase s. Se e al no able phenomena we e obse ed du ing he
expe imen s. The model’s de ec ion accu acy o g ound ehicles was signi ican ly highe han
22
o ae ial ehicles. This disc epancy is likely due o he mo e consis en size, shape, and p ox-
imi y o g ound ehicles, whe eas ae ial ehicles exhibi g ea e a iabili y. In he Came a and
Ligh subca ego ies, he model pe o med pa icula ly well, bene i ing om he p edic able na-
u e o ligh ing and came a angles. In con as , he Wea he and Senso subca ego ies posed
conside able challenges. Rain, wind, and o he en i onmen al ac o s in he Wea he subca e-
go y educed de ec ion accu acy, while he Senso subca ego y, which included nigh ision and
he mal ision images, p o ed o be he mos di icul o he model. The complexi y o hese
specialized da a ypes likely equi es mo e ocused aining and ine- uning.
The objec i e o his expe imen was o e alua e he pe o mance o he p oposed Cen e Ne
model o objec de ec ion in syn he ic en i onmen s. The goal was o analyze he model’s
abili y o de ec objec s ac oss di e en ca ego ies, pa icula ly “Ai ” and “G ound,” as well
as in subca ego ies like “Came a,” “Ligh ,” “Wea he ,” and “Senso .” The e alua ion aimed o
assess he e ec s o di e se en i onmen al condi ions on he model’s pe o mance and de e mine
whe e imp o emen s could be made, especially in challenging scena ios like nigh ision and
he mal imaging. The model excelled in se e al subca ego ies, pa icula ly in he Came a and
Ligh subca ego ies, whe e he p edic abili y o ea u es such as ligh ing condi ions led o s ong
de ec ion esul s. As seen in Table 6, he Ligh subca ego y was among he easies o de ec ,
gi en i s uni o mi y in isual pa ame e s like objec angles and ligh ing. On he o he hand, mo e
complex condi ions, such as hose ound in he Wea he and Senso subca ego ies, in oduced
challenges ha a ec ed pe o mance. The Wea he subca ego y achie ed mid-le el esul s due o
ac o s like ain and wind, which added complexi y o image de ec ion. The Senso subca ego y,
comp ising specialized da a ypes like nigh and he mal ision, p o ed he mos di icul o
he model, e lec ing he need o u he specialized aining and ine- uning o handle hese
complex image ypes e ec i ely.
The unde lying causes o hese esul s a e mul i ace ed. G ound ehicles end o ha e mo e
s able and consis en ea u es, making hem easie o de ec , while ae ial ehicles a y in size and
shape, which complica es de ec ion. In he Ligh subca ego y, uni o m ligh ing condi ions made
objec ea u es mo e p edic able, con ibu ing o he model’s s ong pe o mance. The d op in
pe o mance in he Wea he subca ego y can be a ibu ed o he inc eased complexi y in oduced
by en i onmen al condi ions like ain and wind. Simila ly, he Senso subca ego y, wi h i s nigh
and he mal ision da a, di e s signi ican ly om no mal isual da a, making i mo e challenging
o he model o adap wi hou u he specialized aining.
To u he imp o e pe o mance, se e al ecommenda ions can be made. Fine- uning he
model on senso da a, pa icula ly nigh and he mal ision, could enhance i s abili y o handle
complex image ypes. Addi ionally, augmen ing he aining da ase wi h eal-wo ld samples,
such as hose om he KITTI da ase , could educe he gap be ween syn he ic and eal-wo ld
pe o mance. Domain adap a ion echniques could also imp o e he model’s gene alisa ion om
syn he ic o eal-wo ld condi ions, making i mo e obus o p ac ical applica ions. O e all, he
PM demons a ed s ong pe o mance, pa icula ly in he Came a and Ligh subca ego ies, whe e
i ou pe o med o he models. Howe e , he esul s in he Wea he and Senso subca ego ies sug-
ges ha u he wo k is needed o imp o e he model’s obus ness in handling complex en i on-
men al condi ions like ain, wind, nigh ision, and he mal ision. Wi h addi ional ine- uning,
expanded aining da a, and domain adap a ion s a egies, he model’s gene aliza ion capabili ies
can be u he enhanced, making i mo e sui able o eal-wo ld applica ions in objec de ec-
ion. This expe imen p o ides a solid ounda ion o u u e esea ch aimed a imp o ing objec
de ec ion in challenging condi ions.
Addi ionally, expe imen s we e conduc ed on bo h syn he ic and eal da ase s using a mo e
23
ha dwa e-cons ained de ice o e alua e he obus ness and adap abili y o he p oposed me hod
unde limi ed compu a ional esou ces. The esul s ob ained we e iden ical o hose om ex-
pe imen s conduc ed on highe -pe o mance ha dwa e, demons a ing he me hod’s consis ency
ac oss di e en pla o ms. Howe e , he es ic ed GPU memo y o he cons ained de ice ne-
cessi a ed adjus men s o he numbe o aining epochs, ensu ing ha he expe imen s could
be execu ed wi hou exceeding memo y limi a ions. Despi e hese adjus men s, he in e ence
pe o mance emained una ec ed, indica ing ha he me hod is no esou ce-dependen in his
phase. Ne e heless, he educed compu a ional capaci y led o an inc ease in he ime equi ed
o comple e each expe imen , highligh ing he impac o ha dwa e limi a ions on he aining p o-
cess. Howe e , in e ence speed s ayed he same. This unde sco es he impo ance o conside ing
ha dwa e cons ain s when deploying he me hod in eal-wo ld scena ios.
A mo e in-dep h analysis o he di e ences be ween syn he ic and eal-wo ld da ase s e eals
se e al ac o s ha in luence hei e ec i eness and applicabili y in a ious domains. While eal-
wo ld da ase s a e ich in complexi y, cap u ing he inhe en noise, a iabili y, and unp edic abil-
i y o ac ual en i onmen s, hey o en su e om biases, incomple e da a, and challenges ela ed
o da a collec ion [42], [43], [44]. In con as , syn he ic da ase s o e a con olled en i onmen
whe e hese biases can be mi iga ed, and da a can be gene a ed o speci ically a ge a eas ha
a e unde ep esen ed o di icul o cap u e in he eal wo ld, such as a e e en s, edge cases,
senso y da a, wea he condi ions, and came a angles, e c. Howe e , syn he ic da ase s may lack
he ull di e si y and nuance ound in eal-wo ld da a, especially when he da a gene a ion p o-
cess canno pe ec ly eplica e he complex in e ac ions and eal-wo ld unce ain ies. Despi e
his, syn he ic da ase s can be aluable o aining models in scena ios whe e eal-wo ld da a is
sca ce, expensi e, o e hically challenging o ob ain.
5. Conclusion
This pape p o ides a comp ehensi e o e iew o exis ing me hodologies and app oaches
wi hin he ealm o scene analysis, le e aged by au onomous ehicles, wi h a speci ic emphasis
on hei applicabili y in imme si e en i onmen s. The esea ch p esen ed del es in o an in-dep h
analysis o a 3D objec de ec ion model om he an age poin o he augmen ed eali y domain.
The a chi ec u al amewo k comp ises a di e se se o componen s, each me iculously designed
o ackle a ious in icacies ela ed o he es ima ion o keypoin s, he con e sion o keypoin s o
2D bounding boxes, and he in e ence o c ucial spa ial in o ma ion. This in o ma ion encom-
passes dep h, 3D dimensions measu ed in me e s, as well as o ien a ion, encompassing azimu h,
ele a ion, and oll angles. The collec i e con ibu ions o hese componen s culmina e in a model
ha exhibi s p o iciency in he p ojec ion o 3D bounding boxes on o a 2D image.
To empi ically e alua e he e icacy o he p oposed a chi ec u e, a comp ehensi e es ing
egimen was conduc ed, u ilizing a syn he ic da ase in a compa a i e s udy. The ou comes o
his e alua ion e eal ha he p oposed model deli e s compe i i e pe o mance while demon-
s a ing s abili y, pa icula ly when asked wi h he de ec ion o dis an objec s. The e alua ion
and analysis o he p oposed model we e unde aken unde di e se en i onmen al condi ions and
wi h a ying came a se ings, es ablishing i s e sa ili y and obus ness. Fu he mo e, o augmen
he comp ehensi eness o he s udy, a no el and well-balanced syn he ic da ase was c ea ed and
cu a ed, u ilising a i ual en i onmen . This da ase encompasses anno a ed da a spanning a
mul i ude o objec s and en i onmen al scena ios, p o iding a ich esou ce o subsequen ali-
da ion, expe imen a ion, and e inemen .
24
Acknowledgmen s
This wo k was unded by UK Resea ch and Inno a ion (UKRI) unde he UK go e nmen ’s
Ho izon Eu ope unding gua an ee [g an numbe 10047653] and unded by he Eu opean Union
[unde EC Ho izon Eu ope g an ag eemen numbe 101070181 (TALON)].
Re e ences
[1] M. I. Pa el, S. Y. Tan, A. Abdullah, Vision-based au onomous ehicle sys ems based on deep lea ning: A sys ema ic
li e a u e e iew, Applied Sciences 12 (14) (2022) 6831.
[2] J. Xiong, E.-L. Hsiang, Z. He, T. Zhan, S.-T. Wu, Augmen ed eali y and i ual eali y displays: eme ging ech-
nologies and u u e pe spec i es, Ligh : Science & Applica ions 10 (1) (2021) 216.
[3] Z. Zou, K. Chen, Z. Shi, Y. Guo, J. Ye, Objec de ec ion in 20 yea s: A su ey, P oceedings o he IEEE (2023).
[4] R. Ande son, J. Toledo, H. ElAa ag, Feasibili y s udy on he u iliza ion o mic oso hololens o inc ease d i ing
condi ions awa eness, in: 2019 Sou heas Con, IEEE, 2019, pp. 1–8.
[5] D. L. Gomes J , A. C. de Pai a, A. C. Sil a, G. B az J , J. D. S. de Almeida, A. S. de A a´
ujo, M. Ga as, Aug-
men ed isualiza ion using homomo phic il e ing and haa -based na u al ma ke s o powe sys ems subs a ions,
Compu e s in Indus y 97 (2018) 67–75.
[6] N. Dimi opoulos, T. Togias, G. Michalos, S. Mak is, Ope a o suppo in human– obo collabo a i e en i onmen s
using ai enhanced wea able de ices, P ocedia Ci p 97 (2021) 464–469.
[7] Z.-Q. Zhao, P. Zheng, S.- . Xu, X. Wu, Objec de ec ion wi h deep lea ning: A e iew, IEEE ansac ions on neu al
ne wo ks and lea ning sys ems 30 (11) (2019) 3212–3232.
[8] Y. LeCun, L. Bo ou, Y. Bengio, P. Ha ne , G adien -based lea ning applied o documen ecogni ion, P oceedings
o he IEEE 86 (11) (1998) 2278–2324.
[9] R. Gi shick, J. Donahue, T. Da ell, J. Malik, Rich ea u e hie a chies o accu a e objec de ec ion and seman ic
segmen a ion, in: P oceedings o he IEEE con e ence on compu e ision and pa e n ecogni ion, 2014, pp. 580–
587.
[10] P. Viola, M. Jones, Rapid objec de ec ion using a boos ed cascade o simple ea u es, in: P oceedings o he 2001
IEEE compu e socie y con e ence on compu e ision and pa e n ecogni ion. CVPR 2001, Vol. 1, Ieee, 2001, pp.
I–I.
[11] K. He, X. Zhang, S. Ren, J. Sun, Spa ial py amid pooling in deep con olu ional ne wo ks o isual ecogni ion,
IEEE ansac ions on pa e n analysis and machine in elligence 37 (9) (2015) 1904–1916.
[12] R. Gi shick, Fas -cnn, in: P oceedings o he IEEE in e na ional con e ence on compu e ision, 2015, pp. 1440–
1448.
[13] S. Ren, K. He, R. Gi shick, J. Sun, Fas e -cnn: Towa ds eal- ime objec de ec ion wi h egion p oposal ne wo ks,
Ad ances in neu al in o ma ion p ocessing sys ems 28 (2015).
[14] J. Dai, Y. Li, K. He, J. Sun, R- cn: Objec de ec ion ia egion-based ully con olu ional ne wo ks, Ad ances in
neu al in o ma ion p ocessing sys ems 29 (2016).
[15] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, J. Sun, Ligh -head -cnn: In de ense o wo-s age objec de ec o , a Xi
p ep in a Xi :1711.07264 (2017).
[16] T.-Y. Lin, P. Doll´
a , R. Gi shick, K. He, B. Ha iha an, S. Belongie, Fea u e py amid ne wo ks o objec de ec ion,
in: P oceedings o he IEEE con e ence on compu e ision and pa e n ecogni ion, 2017, pp. 2117–2125.
[17] J. Cao, H. Cholakkal, R. M. Anwe , F. S. Khan, Y. Pang, L. Shao, D2de : Towa ds high quali y objec de ec ion and
ins ance segmen a ion, in: P oceedings o he IEEE/CVF con e ence on compu e ision and pa e n ecogni ion,
2020, pp. 11485–11494.
[18] J. Redmon, S. Di ala, R. Gi shick, A. Fa hadi, You only look once: Uni ied, eal- ime objec de ec ion, in:
P oceedings o he IEEE con e ence on compu e ision and pa e n ecogni ion, 2016, pp. 779–788.
[19] J. Redmon, A. Fa hadi, Yolo9000: be e , as e , s onge , in: P oceedings o he IEEE con e ence on compu e
ision and pa e n ecogni ion, 2017, pp. 7263–7271.
[20] J. Redmon, A. Fa hadi, Yolo 3: An inc emen al imp o emen , a Xi p ep in a Xi :1804.02767 (2018).
[21] Z. Wang, L. Wu, T. Li, P. Shi, A smoke de ec ion model based on imp o ed yolo 5, Ma hema ics 10 (7) (2022).
doi:10.3390/ma h10071190.
URL h ps://www.mdpi.com/2227-7390/10/7/1190
[22] W. Liu, D. Anguelo , D. E han, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Be g, Ssd: Single sho mul ibox de ec o , in:
Compu e Vision–ECCV 2016: 14 h Eu opean Con e ence, Ams e dam, The Ne he lands, Oc obe 11–14, 2016,
P oceedings, Pa I 14, Sp inge , 2016, pp. 21–37.
25

Related note

Why organizations use Identific for document trust, entry 34
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in North America, Europe, Latin America, and international online education, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports more transparent source review, better handling of multilingual submissions, and more consistent review procedures. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For doctoral theses, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com