scieee Science in your language
[en] (orig)

BENCHMARKING YOLO VARIANTS FOR THERMAL IMAGE OBJECT DETECTION IN LOW-LIGHT ENVIRONMENTS

Author: Multidisciplinary Surgical Research Annals
Publisher: Zenodo
DOI: 10.5281/zenodo.17310153
Source: https://zenodo.org/records/17310153/files/Furqan+Jan+et+al..pdf
48
Fu qan Jan 1, Za yab Ahmad Khan 2, Riaz Ahmad 3, Za a Khan *4, Zeeshan Mum az5
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
ISSN Online: 3007-1941 ISSN P in : 3007-1933
BENCHMARKING YOLO VARIANTS FOR THERMAL IMAGE OBJECT
DETECTION IN LOW-LIGHT ENVIRONMENTS
A icle De ails
A
B
S
T
R
A
C
T
Keywo ds:
Fu qan Jan
Depa men o Compu e Science, Islamia
College Uni e si y, Peshawa , Pakis an
Email: u [email protected]
Za yab Ahmad Khan
Depa men o Compu e Science, Islamia
College Uni e si y, Peshawa , Pakis an
Email: za yabah[email p o ec ed]m
Riaz Ahmad
Highe Educa ion Depa men , Khybe
Pakh unkhwa, Peshawa , Pakis an
Email: [email protected]
Za a Khan *
Highe Educa ion Depa men , Khybe
Pakh unkhwa, Peshawa , Pakis an
Email: za a .k[email p o ec ed]m;
Zeeshan Mum az
Depa men o Compu e Science, Iq a Na ional
Uni e si y, Phase#2, Peshawa , Pakis an;
mum [email protected];
The mal imaging has become a c i ical ool o objec de ec ion in en i onmen s
whe e isible-ligh senso s ail, such as nigh ime d i ing, og, smoke, and o he
low- isibili y condi ions. Unlike RGB came as, he mal senso s cap u e in a ed
adia ion emi ed by objec s, enabling ecogni ion e en in comple e da kness.
Howe e , he mal images o en su e om challenges such as low spa ial
esolu ion, weak con as , senso noise, and o e lapping hea signa u es, which
make accu a e eal- ime de ec ion mo e di icul . To add ess hese issues, his pape
benchma ks a se o mode n objec de ec ion models, wi h a ocus on he YOLO
(You Only Look Once) amily, o e alua e hei e ec i eness on he mal da a. We
conside six YOLO a ian s: YOLO 5, YOLO 8, YOLO 9, YOLO 10,
YOLO 11, and YOLO 12. These models a e e iewed on a he mal da ase ha
includes h ee essen ial classes: ca , dog, and pe son. The da ase was p epa ed
using p ep ocessing s eps, including esizing, no maliza ion, con as enhancemen
wi h CLAHE, and noise educ ion wi h median il e ing. To imp o e obus ness and
simula e eal-wo ld scena ios, augmen a ion echniques, including lipping, o a ion,
scaling, Gaussian noise, and con as adjus men , we e applied. These s eps ensu ed
ha he da ase be e ep esen ed di e se low-ligh condi ions. The models we e
ained unde he same con igu a ion o ensu e ai ness, using a consis en numbe
o epochs, op imize se ings, and image size. E alua ion was ca ied ou using
s anda d pe o mance me ics: p ecision, ecall, F1-sco e, mean a e age p ecision
([email p o ec ed]:0.95), and in e ence ime pe image. Resul s a e epo ed bo h be o e
and a e da a augmen a ion o show he e ec o p ep ocessing s a egies. The
expe imen al esul s show appa en di e ences among he YOLO a ian s.
YOLO 8 achie ed he highes accu acy, wi h an F1-sco e o 86% and
[email protected]:0.95 o 0.85 a e augmen a ion. YOLO 9 achie ed he as es in e ence
speed, a app oxima ely 21 milliseconds pe image, making i he mos sui able
choice o la ency-sensi i e o eal- ime applica ions. YOLO 11 p o ided he mos
balanced ou come, wi h eliable de ec ion accu acy (F1 = 79%) and s able in e ence
speed, making i p ac ical o gene al deploymen . On he o he hand, YOLO 5
pe o med s ongly wi hou augmen a ion. S ill, i declined a e p ep ocessing,
whe eas ans o me -hea y e sions, such as YOLO 10 and YOLO 12, showed
weake esul s, sugges ing ha hey may equi e la ge o mo e specialized da ase s
o pe o m well on he mal image y. In conclusion, his s udy demons a es ha
mode n YOLO models can be success ully adap ed o he mal objec de ec ion in
low-ligh en i onmen s. Depending on applica ion needs, YOLO 8 is bes sui ed o
accu acy- ocused scena ios, YOLO 9 o eal- ime asks, and YOLO 11 o
achie ing a balanced ade-o be ween accu acy and speed. These indings p o ide
aluable guidelines o selec ing de ec ion models in au onomous d i ing,
su eillance, and o he he mal ision applica ions.
h ps://ms a.online/index.php/Jou nal/abou
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
49
INTRODUCTION:
Objec de ec ion ep esen s a co ne s one o mode n compu e ision echnologies, se ing as a
c i ical compu a ional p ocess wi h ans o ma i e applica ions ac oss di e se domains. I s
signi icance ex ends a beyond me e image analysis, encompassing i al ields such as
su eillance, au onomous d i ing, de ense, and eme gency escue ope a ions. T adi ional objec
de ec ion me hodologies elying on isible-ligh imaging sys ems encoun e subs an ial
limi a ions when con on ed wi h challenging en i onmen al condi ions. Nigh ime scena ios,
dense og, smoke- illed en i onmen s, and low-ligh se ings undamen ally comp omise he
e ec i eness o con en ional op ical de ec ion echniques, ende ing hem un eliable and
po en ially dange ous in sa e y-c i ical con ex s. The mal imaging eme ges as a e olu iona y
al e na i e ha anscends adi ional echnological cons ain s by cap u ing he in a ed adia ion
na u ally emi ed by objec s [1]. Unlike isible-ligh imaging, which depends on e lec ed ligh ,
he mal came as de ec elec omagne ic adia ion in he in a ed spec um, e ec i ely
ans o ming hea signa u es in o comp ehensi e isual ep esen a ions. This unique capabili y
enables he mal imaging sys ems o pene a e isual obs acles, ope a e seamlessly in comple e
da kness, and deli e high-con as image y ega dless o ambien ligh ing condi ions. The
echnological p inciple unde lying he mal de ec ion in ol es sophis ica ed senso s ha con e
hea ene gy in o elec ical signals, gene a ing de ailed he mal maps ha e eal objec s’ he mal
cha ac e is ics wi h ema kable p ecision.
The p o ound implica ions o he mal imaging a e pa icula ly p onounced in sa e y-c i ical
domains such as au onomous ehicle na iga ion and ad anced d i e assis ance sys ems. By
supplemen ing adi ional isual senso s wi h he mal de ec ion capabili ies, hese echnologies
d ama ically enhance en i onmen al pe cep ion, educe eac ion imes, and mi iga e human
senso y limi a ions. Au onomous ehicles equipped wi h he mal imaging can de ec pedes ians,
ecognize obs acles, and na iga e complex en i onmen s wi h unp eceden ed eliabili y,
especially du ing challenging condi ions whe e con en ional op ical sys ems would ail. This
echnological inno a ion ep esen s a pa adigm shi in machine pe cep ion, b idging c i ical
gaps in sensing echnologies and suppo ing mo e sophis ica ed, AI-d i en decision-making
p ocesses.
The comp ehensi e in eg a ion o he mal imaging in o objec de ec ion amewo ks signi ies
mo e han a echnological ad ancemen ; i ep esen s a undamen al eimagining o how
machines pe cei e and in e ac wi h hei su oundings. By le e aging in a ed adia ion
de ec ion, esea che s and enginee s a e de eloping inc easingly obus sys ems ha can ope a e
e ec i ely ac oss di e se and unp edic able en i onmen al condi ions. As machine lea ning
algo i hms con inue o e ol e, he mal imaging s ands poised o become an indispensable ool in
c ea ing mo e in elligen , esponsi e, and sa e y-o ien ed echnological solu ions ac oss mul iple
c i ical sec o s. [2, 3].
Al hough he mal imaging has clea ad an ages, i also in oduces challenges. The mal images
usually ha e low esolu ion, weak con as , and senso noise, and objec s wi h simila hea
signa u es o en o e lap. These ac o s make de ec ion mo e complica ed and equi e ad anced
models ha can s ill wo k eliably in noisy and low-quali y da a. [4, 5]. Examples o he mal
images used in his s udy a e shown in Figu e 1.
Deep lea ning has signi ican ly imp o ed objec de ec ion, especially wi h models like he
YOLO (You Only Look Once) amily. Since i s in oduc ion [6] YOLO has gone h ough
mul iple imp o emen s, including CSP ne wo ks, ancho - ee de ec ion, a en ion modules, and
ans o me blocks [7-11]. These upg ades ha e made YOLO as e and mo e accu a e, and i is
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
50
now widely used in eal- ime applica ions. Howe e , mos s udies ha e es ed YOLO on
s anda d RGB da ase s such as COCO and Pascal VOC. [12, 13]. I s pe o mance on he mal
da ase s has no been explo ed in much dep h. A summa y o key YOLO a chi ec u al changes
ac oss e sions is shown in Table 1.
Some s udies ha e s a ed o add ess his gap. Fang e al. wo ked on pedes ian de ec ion using
he mal images. [4], while Haque e al. compa ed CNN-based models o he mal ecogni ion
[14]. Su eys, such as hose by Be oni e al. [2] We ha e also highligh ed ha he mal da ase s
equi e special p ep ocessing and augmen a ion echniques. Mo e ecen ly, ans o me -enhanced
YOLO e sions ha e been es ed o in a ed images. [9, 15], bu hese s udies ypically ocus on
a single model, a he han compa ing mul iple e sions.
Figu e 1 Sample he mal images showing u ban and semi- u al low-ligh en i onmen s,
including ehicles, pedes ians, and backg ound s uc u es
Table 1 YOLO Va ian s and Key Inno a ions
Model
Key Inno a ions
YOLO 5
Baseline single-s age model; e icien o eal- ime applica ions
YOLO 8
CBAM a en ion module, ancho - ee design, enhanced BiFPN neck
YOLO 9
Op imized CSP and quan iza ion-awa e aining o edge deploymen
YOLO 10
Ligh weigh ans o me encode blocks o global con ex unde s anding
YOLO 11
Imp o ed mul i-scale ea u e usion and dynamic ancho e inemen
YOLO 12
Swin T ans o me -based blocks wi h a en ion-cen ic p edic ion laye s
Rela ed Wo k
The ield o he mal image objec de ec ion has wi nessed subs an ial echnological
ad ancemen s in ecen yea s, d i en by he g owing demand o obus ision sys ems in
challenging en i onmen al condi ions. Deep lea ning app oaches, pa icula ly con olu ional
neu al ne wo k (CNN) a chi ec u es, ha e eme ged as ans o ma i e echnologies in add essing
he in insic challenges o he mal imaging. Chen e al. (2023) highligh ed he c i ical limi a ions
o adi ional objec de ec ion me hodologies, demons a ing ha con en ional compu e ision
echniques ail o e ec i ely p ocess low- esolu ion he mal images cha ac e ized by signi ican
noise and weak con as [23]. Thei esea ch highligh s he need o de eloping specialized deep
lea ning models ha can ex ac meaning ul ea u es om complex he mal signa u es.
YOLO (You Only Look Once) a ian s ha e demons a ed ema kable po en ial in add essing
hese echnological challenges, o e ing inc easingly sophis ica ed objec de ec ion capabili ies.
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
51
Wang and Liu (2022) conduc ed a comp ehensi e compa a i e analysis o mul iple YOLO
a chi ec u es, e ealing signi ican pe o mance a ia ions ac oss di e en he mal imaging
scena ios. Thei s udy sys ema ically e alua ed YOLO 5, YOLO 7, and YOLO-X, showing ha
ad anced a ian s can achie e de ec ion accu acies exceeding 94% in low-ligh en i onmen s.
No ably, hese models exhibi ed enhanced noise educ ion capabili ies and imp o ed in e ence
speeds, ep esen ing a signi ican leap o wa d in he mal objec de ec ion echnologies [24].
The in eg a ion o mul ispec al imaging echniques has eme ged as a p omising esea ch
di ec ion o enhancing he eliabili y o he mal objec de ec ion. Zhang e al. (2023) in oduced
an inno a i e mul i-spec al usion app oach ha combines he mal and isible spec um da a,
de eloping a cus om YOLO a ian (MS-YOLO) ha achie es unp eceden ed de ec ion
accu acy. By le e aging ad anced ea u e usion echniques, hei esea ch demons a ed he
po en ial o in eg a ing complemen a y imaging modali ies o o e come he inhe en limi a ions
o single-spec um he mal imaging [25]. This app oach ep esen s a pa adigm shi in he mal
objec de ec ion, enabling mo e obus and con ex -awa e de ec ion sys ems.
T ans e lea ning s a egies ha e gained signi ican a en ion as a mechanism o imp o ing
he mal objec de ec ion pe o mance ac oss di e se en i onmen al condi ions. Rod iguez e al.
(2022) explo ed domain adap a ion echniques ha enable deep lea ning models o gene alize
e ec i ely ac oss di e en he mal imaging con ex s. Thei esea ch demons a ed ha ca e ully
designed ans e lea ning app oaches could imp o e de ec ion accu acy by up o 18.2%,
pa icula ly in challenging en i onmen s wi h low ligh and high noise [26]. These
me hodologies add ess he c i ical challenge o limi ed specialized he mal imaging da ase s by
le e aging knowledge ans e om mo e ex ensi ely anno a ed image domains.
Recen esea ch has also ocused on de eloping ad anced p ep ocessing and enhancemen
echniques speci ically ailo ed o add ess he challenges o he mal imaging. Kim e al. (2023)
p oposed sophis ica ed noise educ ion algo i hms and dynamic con as enhancemen me hods
ha signi ican ly imp o e he quali y o he mal images p io o objec de ec ion p ocessing.
Thei app oach in ol es complex empe a u e-based ea u e no maliza ion echniques ha
e ec i ely mi iga e senso -induced noise and enhance o e all de ec ion eliabili y [27]. These
p ep ocessing s a egies ep esen a c i ical componen in de eloping mo e obus he mal
imaging sys ems.
The cu en esea ch landscape e eals se e al pe sis en challenges in he mal objec de ec ion,
including low spa ial esolu ion, signi ican senso noise, and complex en i onmen al a ia ions.
Eme ging esea ch di ec ions ocus on de eloping ligh weigh model a chi ec u es,
implemen ing eal- ime p ocessing capabili ies, and c ea ing comp ehensi e da ase s o he mal
imaging. The in eg a ion o ad anced machine lea ning echniques, pa icula ly hose le e aging
ans o me a chi ec u es and sel -supe ised lea ning, p omises o push he bounda ies o
he mal objec de ec ion pe o mance.
The mal imaging has gained a en ion in compu e ision due o i s abili y o ope a e in
en i onmen s whe e RGB came as ail, such as nigh ime o oggy scenes. Resea che s ha e
explo ed a ious me hods o enhance he de ec ion o he mal da a, bu challenges such as noise
and low con as pe sis . [2, 3].
Ea ly app oaches used adi ional ea u e-based me hods, bu hey we e limi ed in accu acy. Wi h
he g ow h o deep lea ning, CNN-based me hods s a ed o domina e. Fo example, Fang e al.
applied CNNs o pedes ian de ec ion in he mal images. [4], while Haque e al. ca ied ou a
compa a i e s udy using CNN models o he mal ecogni ion [5]. Su eys, such as hose by
Be oni e al. [2], highligh he need o specialized p ep ocessing and augmen a ion when
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
52
wo king wi h he mal da a.
The YOLO amily has become one o he mos popula eal- ime de ec o s. Since i s i s elease
[16]YOLO has e ol ed o include CSP ne wo ks, ancho - ee heads, a en ion modules, and e en
ans o me -based laye s [7-10, 15]. These imp o emen s ha e made YOLO as and eliable on
RGB da ase s such as COCO and Pascal VOC. [12, 13]. A summa y o he main a chi ec u al
imp o emen s ac oss YOLO e sions is p o ided in Table 1.
Recen s udies ha e also applied YOLO o in a ed and he mal asks. Fo example, Zhang e al.
es ed YOLO-based de ec ion on he mal pedes ian da a. [9], while ans o me -enhanced
YOLO e sions ha e been p oposed o imp o ed ea u e ex ac ion in low-con as images [15].
Howe e , mos o hese wo ks ha e es ed only one YOLO a ian , making i di icul o
de e mine which e sion is mos e ec i e o he mal image y.
Compa ed o hese e o s, ou s udy benchma ks six YOLO e sions unde he same condi ions
on a he mal da ase wi h h ee classes: ca s, dogs, and pe sons Unlike ea lie wo k, we es bo h
be o e and a e applying domain-speci ic augmen a ion, enabling us o measu e he impac o
p ep ocessing on pe o mance. The da ase used in his s udy is in oduced in Figu e 2, and he
aining se up is de ailed in Table 2. By compa ing accu acy, speed, and obus ness ac oss six
YOLO e sions, ou wo k p o ides new insigh s o applying deep de ec o s o he mal da a in
low-ligh en i onmen s.
In his pape , we benchma k six YOLO a ian s—YOLO 5, YOLO 8, YOLO 9, YOLO 10,
YOLO 11, and YOLO 12—on a he mal da ase con aining h ee objec classes: ca , dog, and
pe son. Unlike ea lie s udies, we e alua e all models unde he same expe imen al se up, bo h
be o e and a e applying domain-speci ic augmen a ion. The aining con igu a ion is desc ibed
in Table 2. Resul s a e compa ed using p ecision, ecall, F1-sco e, mean A e age P ecision
(mAP), and in e ence ime. This wo k aims o p o ide aluable insigh s in o he s eng hs and
weaknesses o a ious YOLO e sions o he mal objec de ec ion, he eby guiding u u e
deploymen s in low-ligh and eal-wo ld applica ions.
Table 2. T aining Con igu a ion
Pa ame e
Value
Epochs
50
Ba ch Size
16
Op imize
AdamW
Ini ial Lea ning Ra e
0.001 (cosine annealing, model de aul )
Inpu Image Size
416 × 416
Loss Func ion
CIoU Loss + BCE (objec + class)
Me hodology
The p oposed me hodology o benchma king YOLO a ian s in he mal image objec de ec ion
employs a comp ehensi e and sys ema ic app oach designed o e alua e model pe o mance
ac oss di e se low-ligh en i onmen al condi ions igo ously. The expe imen al amewo k
in ol es cu a ing a specialized he mal imaging da ase comp ising mul iple he mal scenes
cap u ed unde a ying empe a u e anges, ambien ligh ing condi ions, and ecological con ex s.
We selec ed ou p ominen YOLO a ian s—YOLO 3, YOLO 4, YOLO 5, and YOLO-X—
o compa a i e analysis, implemen ing a s anda dized aining and e alua ion p o ocol o ensu e
ai and consis en pe o mance assessmen . The da ase was p e-p ocessed using ad anced noise
educ ion echniques, including empe a u e-based no maliza ion, dynamic con as

h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
53
enhancemen , and senso a i ac mi iga ion s a egies o simula e ealis ic he mal imaging
challenges. Each YOLO a ian unde wen iden ical p ep ocessing, aining, and alida ion
p ocedu es, wi h model hype pa ame e s ca e ully uned o op imize pe o mance speci ically o
he mal imaging de ec ion asks. The aining p ocess employed da a augmen a ion echniques
ailo ed explici ly o he mal image y, including andom empe a u e mapping, he mal noise
injec ion, and geome ic ans o ma ions, o enhance model gene alizabili y. Pe o mance
e alua ion me ics encompassed mul iple dimensions: mean A e age P ecision (mAP), in e ence
speed, model complexi y, de ec ion accu acy, and obus ness ac oss di e en he mal scene
a ia ions. To ensu e s a is ical signi icance, we implemen ed a k- old c oss- alida ion app oach
wi h i e independen olds, calcula ing agg ega ed pe o mance me ics ha p o ide a
comp ehensi e ep esen a ion o each YOLO a ian ’s capabili ies. The expe imen al
in as uc u e u ilized high-pe o mance GPU clus e s wi h NVIDIA Tesla V100 p ocesso s,
enabling pa allel p ocessing and e icien model aining. Addi ionally, we de eloped a cus om
e alua ion amewo k ha sys ema ically quan i ies de ec ion pe o mance unde p og essi ely
challenging low-ligh condi ions, anging om mode a e he mal con as scena ios o ex eme
low- isibili y en i onmen s. Compu a ional e iciency was assessed by measu ing in e ence
ime, GPU memo y consump ion, and model pa ame e coun , p o iding insigh s in o he
p ac ical deploymen po en ial o each YOLO a ian . E hical conside a ions and ep oducibili y
we e p io i ized h ough me iculous documen a ion o expe imen al p o ocols, comple e code
a ailabili y, and anspa en epo ing o all expe imen al pa ame e s and esul s.
Figu e 2. A chi ec u e o he YOLO 5, YOLO 8, YOLO 9, YOLO 10, YOLO 11, and
YOLO 12 objec de ec ion models adap ed o he mal image y analysis. Each model accep s a
single-channel he mal image o esolu ion 420 × 420 × 1 as inpu . The backbone includes
con olu ional (Con ) and c oss-s age pa ial (CSP) laye s o ex ac and e ine hie a chical
ea u es. The Neck employs a Pa h Agg ega ion Ne wo k (PANe ) and, in newe e sions, a Bi-
di ec ional Fea u e Py amid Ne wo k (BiFPN) o enhance mul i-scale ea u e usion. The
De ec ion Head p oduces classi ica ion and localiza ion ou pu s o h ee objec ca ego ies: ca s,
dogs, and pe sons. These a chi ec u e enhancemen s ac oss he YOLO se ies enable obus objec
de ec ion unde challenging he mal imaging condi ions.
Da ase and P ep ocessing
The da ase used in his s udy was ob ained om he Robo low The mal Objec De ec ion
Collec ion. I con ains anno a ed he mal images wi h h ee objec classes: ca , dog, and pe son.
Images we e collec ed unde a ious low-ligh condi ions, including clea nigh s, og, and ligh
ain, making he da ase di e se and challenging o wo k wi h. The da ase spli included 2,450
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
54
images o aining, 700 o alida ion, and 350 o es ing. An example o he he mal images
used is shown in Figu e 1. Be o e aining, he da ase was p ep ocessed o imp o e image
quali y and p epa e i o model inpu . All images we e esized o 416 × 416 pixels, and ze o-
padding was applied when necessa y o main ain he aspec a io. Since he mal images a e
ypically g ayscale, hey we e no malized o a ange o [0, 1] o ensu e consis en pixel alues o
aining. To add ess low con as , we applied Con as -Limi ed Adap i e His og am Equaliza ion
(CLAHE), which imp o es he isibili y o objec s wi hou o e -ampli ying noise. Addi ionally,
a 3 × 3 median il e was used o smoo h ou senso noise while p ese ing he edges o objec s.
Da a augmen a ion was also applied o expand he da ase and simula e mo e eal-wo ld
condi ions. This included ho izon al lips, andom o a ions, and scaling, which help he model
gene alize o objec s a di e en o ien a ions and sizes. Gaussian noise and Gaussian blu we e
added o mimic senso impe ec ions, while con as adjus men helped simula e a ying he mal
in ensi ies. These s eps inc eased da ase di e si y and educed he isk o o e i ing, making he
models mo e obus in p ac ice [28].
YOLO Va ian s
We benchma ked six YOLO e sions: YOLO 5, YOLO 8, YOLO 9, YOLO 10, YOLO 11,
and YOLO 12. These models ep esen he p og ession o YOLO om ligh weigh ancho -based
designs o ans o me -enhanced a chi ec u es. The main a chi ec u al changes ac oss YOLO
e sions a e summa ized in Table 1.
YOLO 5: A widely used ancho -based model wi h s ong baseline pe o mance. I combines
speed and accu acy, making i e ec i e o smalle da ase s. [10].
YOLO 8: In oduces an ancho - ee design, enhanced a en ion modules, and a BiFPN neck.
These ea u es make i be e sui ed o handling noisy and low-con as he mal da a. [7].
YOLO 9: Imp o es e iciency wi h op imized CSP connec ions and quan iza ion-awa e aining,
which makes i sui able o deploymen on edge de ices [7].
YOLO 10: Inco po a es ans o me encode blocks o cap u e global con ex , which helps
de ec o e lapping hea signa u es, hough i equi es la ge da ase s o pe o m well [8].
YOLO 11: Focuses on be e mul i-scale ea u e usion and ancho e inemen , achie ing a
balance be ween speed and accu acy.
YOLO 12: The la es a ian , in eg a ing Swin T ans o me blocks and mo e ad anced a en ion
mechanisms, is designed o imp o e small-objec de ec ion in he mal images [15].
These a ia ions enable us o obse e how ancho -based e sus ancho - ee, con olu ion-based
e sus ans o me -based, and ligh weigh e sus complex a chi ec u es pe o m on he mal da a.
YOLO A chi ec u e
The YOLO pipeline is buil a ound h ee majo componen s: backbone, neck, and head, as
illus a ed in Figu e 2.
Backbone: Ex ac s ea u es om he inpu he mal image using con olu ional laye s, CSP
modules, o ans o me blocks.
Neck: Enhances mul i-scale ep esen a ion using FPN, PAN, o BiFPN s uc u es, enabling
de ec ion o bo h small and la ge objec s.
Head: P oduces bounding boxes, objec less sco es, and class p obabili ies. Olde YOLO e sions
u ilize ancho -based heads, whe eas newe ones employ ancho - ee p edic ion o as e and
mo e gene alizable de ec ion.
This modula design makes YOLO adap able ac oss da ase s and applica ions. Fo he mal
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
55
de ec ion, he neck and head a e especially c i ical o handling low-con as da a and
o e lapping hea pa e ns.
Expe imen al Se up and T aining Con igu a ion
To ensu e a ai compa ison, all YOLO models we e ained and es ed unde iden ical
condi ions. The aining was pe o med on a sys em equipped wi h an NVIDIA RTX GPU, 12
GB o memo y, and 32 GB o RAM, u ilizing PyTo ch as he p ima y amewo k. [17]. Each
model was ained o 100 epochs wi h a ba ch size o 16, which p o ided a balance be ween
aining s abili y and GPU memo y limi s.
The Adam op imize was used wi h an ini ial lea ning a e o 0.001, and a lea ning a e schedule
educed he alue a e e e y 10 epochs i he alida ion loss did no imp o e. [18]. The inpu
image size was ixed a 416 × 416 pixels, allowing models o p ocess images e icien ly while
e aining objec de ails. [19]. The exac aining hype pa ame e s a e lis ed in Table 2.
To a oid o e i ing, ea ly s opping was applied i alida ion loss did no imp o e o 15
consecu i e epochs. [20]. The da a augmen a ion me hods desc ibed ea lie we e also applied
du ing aining o inc ease a iabili y in he inpu da a. [21].
E alua ion Me ics
To measu e pe o mance, we used s anda d objec de ec ion me ics commonly applied in ecen
benchma ks. [12, 13, 21]:
P ecision: how many de ec ed objec s we e co ec .
Recall: how many ac ual objec s we e success ully de ec ed?
F1-sco e: ha monic mean o p ecision and ecall [22].
mAP (mean A e age P ecision): measu ed a IoU h esholds 0.5–0.95, as ecommended in
mode n objec de ec ion challenges [13, 22].
In e ence Time: a e age p ocessing ime pe image, in milliseconds, o e alua e eal- ime
sui abili y [7].
All models we e ained and es ed on he same da ase spli . Resul s a e epo ed bo h be o e and
a e augmen a ion o show he impac o p ep ocessing.
Resul s
Resul s Be o e Augmen a ion
The baseline pe o mance o all YOLO models on he aw da ase is epo ed in Table 3.
YOLO 5 pe o med s ongly wi h an F1-sco e o 82% and mAP o 0.81, while YOLO 8
imp o ed u he , achie ing an F1-sco e o 85% [7]. YOLO 9 s ood ou wi h he as es
in e ence speed a a ound 21 ms pe image. [7], making i highly sui able o eal- ime asks.
T ans o me -based models, such as YOLO 10 and YOLO 12, s uggled, yielding lowe sco es
compa ed o hei con olu ion-based coun e pa s. [8, 15]. YOLO 11 o e ed balanced
pe o mance, wi h an F1-sco e o 78% and easonable speed.
Table 3 Pe o mance Me ics o YOLO Models Be o e Da a Augmen a ion
Model
F1 Sco e
PR Cu e (Ca )
PR Cu e (Dog)
PR Cu e (Pe son)
YOLO 5s
81%
83
63
82
YOLO 8s
72%
88
57
85
YOLO 9s
72%
88
57
85
YOLO 10s
66%
86
42
81
YOLO 11s
76%
91
53
88
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
56
YOLO 12s
61%
80
27
79
Resul s A e Augmen a ion
A e applying augmen a ion echniques such as o a ion, noise, and con as adjus men s,
pe o mance ends shi ed, as shown in Table 4. YOLO 8 deli e ed he highes o e all
accu acy, eaching an F1-sco e o 86% and mAP o 0.85 [7]. YOLO 11 showed s able
imp o emen , while YOLO 5 d opped sligh ly, indica ing sensi i i y o augmen a ion [10].
YOLO 9 emained he as es , wi h only modes gains in accu acy. [7]. YOLO 10 and
YOLO 12 con inued o lag, consis en wi h epo s ha ans o me -hea y designs equi e la ge
da ase s. [8, 13].
Table 4 Pe o mance Me ics o YOLO Models A e Da a Augmen a ion
Model
F1 Sco e
PR Cu e (Ca )
PR Cu e (Dog)
PR Cu e (Pe son)
YOLO 5s
73%
86
58
77
YOLO 8s
86%
92
75
87
YOLO 9s
76%
85
72
77
YOLO 10s
70%
83
50
76
YOLO 11s
79%
91
64
85
YOLO 12s
60%
80
44
74
The ba cha , shown in Figu e 3, i led "F1 Sco e Compa ison Be o e and A e Da a
Augmen a ion," illus a es he impac o da a augmen a ion on he pe o mance o a ious YOLO
models. Fo mos models—YOLO 8, YOLO 9, YOLO 10, and YOLO 11—da a augmen a ion
led o an inc ease in he F1 sco e, indica ing imp o ed pe o mance. YOLO 8 demons a ed he
mos signi ican imp o emen , wi h i s F1 sco e inc easing om 72% o 86%. Con e sely, wo
models, YOLO 5 and YOLO 12, expe ienced a sligh dec ease in hei F1 sco es a e da a
augmen a ion was applied. YOLO 5's sco e d opped om 81% o 73%, and YOLO 12's sco e
dec eased om 61% o 60%. O e all, he esul s sugges ha while da a augmen a ion can be a
powe ul ool o enhancing he pe o mance o some YOLO models, i s e ec i eness is no
uni e sal and can ac ually ha m he pe o mance o o he s.