BENCHMARKING YOLO VARIANTS FOR THERMAL IMAGE OBJECT DETECTION IN LOW-LIGHT ENVIRONMENTS

Author: Multidisciplinary Surgical Research Annals

Publisher: Zenodo

DOI: 10.5281/zenodo.17310153

Source: https://zenodo.org/records/17310153/files/Furqan+Jan+et+al..pdf

48
Fu qan Jan 1, Za yab Ahmad Khan 2, Riaz Ahmad 3, Za a Khan *4, Zeeshan Mum az5
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
ISSN Online: 3007-1941 ISSN P in : 3007-1933
BENCHMARKING YOLO VARIANTS FOR THERMAL IMAGE OBJECT
DETECTION IN LOW-LIGHT ENVIRONMENTS
A icle De ails
A
B
S
T
R
A
C
T
Keywo ds:
Fu qan Jan
Depa men o Compu e Science, Islamia
College Uni e si y, Peshawa , Pakis an
Email: u [email protected]
Za yab Ahmad Khan
Depa men o Compu e Science, Islamia
College Uni e si y, Peshawa , Pakis an
Email: za yabah[email p o ec ed]m
Riaz Ahmad
Highe Educa ion Depa men , Khybe
Pakh unkhwa, Peshawa , Pakis an
Email: [email protected]
Za a Khan *
Highe Educa ion Depa men , Khybe
Pakh unkhwa, Peshawa , Pakis an
Email: za a .k[email p o ec ed]m;
Zeeshan Mum az
Depa men o Compu e Science, Iq a Na ional
Uni e si y, Phase#2, Peshawa , Pakis an;
mum [email protected];
The mal imaging has become a c i ical ool o objec de ec ion in en i onmen s
whe e isible-ligh senso s ail, such as nigh ime d i ing, og, smoke, and o he
low- isibili y condi ions. Unlike RGB came as, he mal senso s cap u e in a ed
adia ion emi ed by objec s, enabling ecogni ion e en in comple e da kness.
Howe e , he mal images o en su e om challenges such as low spa ial
esolu ion, weak con as , senso noise, and o e lapping hea signa u es, which
make accu a e eal- ime de ec ion mo e di icul . To add ess hese issues, his pape
benchma ks a se o mode n objec de ec ion models, wi h a ocus on he YOLO
(You Only Look Once) amily, o e alua e hei e ec i eness on he mal da a. We
conside six YOLO a ian s: YOLO 5, YOLO 8, YOLO 9, YOLO 10,
YOLO 11, and YOLO 12. These models a e e iewed on a he mal da ase ha
includes h ee essen ial classes: ca , dog, and pe son. The da ase was p epa ed
using p ep ocessing s eps, including esizing, no maliza ion, con as enhancemen
wi h CLAHE, and noise educ ion wi h median il e ing. To imp o e obus ness and
simula e eal-wo ld scena ios, augmen a ion echniques, including lipping, o a ion,
scaling, Gaussian noise, and con as adjus men , we e applied. These s eps ensu ed
ha he da ase be e ep esen ed di e se low-ligh condi ions. The models we e
ained unde he same con igu a ion o ensu e ai ness, using a consis en numbe
o epochs, op imize se ings, and image size. E alua ion was ca ied ou using
s anda d pe o mance me ics: p ecision, ecall, F1-sco e, mean a e age p ecision
([email p o ec ed]:0.95), and in e ence ime pe image. Resul s a e epo ed bo h be o e
and a e da a augmen a ion o show he e ec o p ep ocessing s a egies. The
expe imen al esul s show appa en di e ences among he YOLO a ian s.
YOLO 8 achie ed he highes accu acy, wi h an F1-sco e o 86% and
[email protected]:0.95 o 0.85 a e augmen a ion. YOLO 9 achie ed he as es in e ence
speed, a app oxima ely 21 milliseconds pe image, making i he mos sui able
choice o la ency-sensi i e o eal- ime applica ions. YOLO 11 p o ided he mos
balanced ou come, wi h eliable de ec ion accu acy (F1 = 79%) and s able in e ence
speed, making i p ac ical o gene al deploymen . On he o he hand, YOLO 5
pe o med s ongly wi hou augmen a ion. S ill, i declined a e p ep ocessing,
whe eas ans o me -hea y e sions, such as YOLO 10 and YOLO 12, showed
weake esul s, sugges ing ha hey may equi e la ge o mo e specialized da ase s
o pe o m well on he mal image y. In conclusion, his s udy demons a es ha
mode n YOLO models can be success ully adap ed o he mal objec de ec ion in
low-ligh en i onmen s. Depending on applica ion needs, YOLO 8 is bes sui ed o
accu acy- ocused scena ios, YOLO 9 o eal- ime asks, and YOLO 11 o
achie ing a balanced ade-o be ween accu acy and speed. These indings p o ide
aluable guidelines o selec ing de ec ion models in au onomous d i ing,
su eillance, and o he he mal ision applica ions.
h ps://ms a.online/index.php/Jou nal/abou
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
49
INTRODUCTION:
Objec de ec ion ep esen s a co ne s one o mode n compu e ision echnologies, se ing as a
c i ical compu a ional p ocess wi h ans o ma i e applica ions ac oss di e se domains. I s
signi icance ex ends a beyond me e image analysis, encompassing i al ields such as
su eillance, au onomous d i ing, de ense, and eme gency escue ope a ions. T adi ional objec
de ec ion me hodologies elying on isible-ligh imaging sys ems encoun e subs an ial
limi a ions when con on ed wi h challenging en i onmen al condi ions. Nigh ime scena ios,
dense og, smoke- illed en i onmen s, and low-ligh se ings undamen ally comp omise he
e ec i eness o con en ional op ical de ec ion echniques, ende ing hem un eliable and
po en ially dange ous in sa e y-c i ical con ex s. The mal imaging eme ges as a e olu iona y
al e na i e ha anscends adi ional echnological cons ain s by cap u ing he in a ed adia ion
na u ally emi ed by objec s [1]. Unlike isible-ligh imaging, which depends on e lec ed ligh ,
he mal came as de ec elec omagne ic adia ion in he in a ed spec um, e ec i ely
ans o ming hea signa u es in o comp ehensi e isual ep esen a ions. This unique capabili y
enables he mal imaging sys ems o pene a e isual obs acles, ope a e seamlessly in comple e
da kness, and deli e high-con as image y ega dless o ambien ligh ing condi ions. The
echnological p inciple unde lying he mal de ec ion in ol es sophis ica ed senso s ha con e
hea ene gy in o elec ical signals, gene a ing de ailed he mal maps ha e eal objec s’ he mal
cha ac e is ics wi h ema kable p ecision.
The p o ound implica ions o he mal imaging a e pa icula ly p onounced in sa e y-c i ical
domains such as au onomous ehicle na iga ion and ad anced d i e assis ance sys ems. By
supplemen ing adi ional isual senso s wi h he mal de ec ion capabili ies, hese echnologies
d ama ically enhance en i onmen al pe cep ion, educe eac ion imes, and mi iga e human
senso y limi a ions. Au onomous ehicles equipped wi h he mal imaging can de ec pedes ians,
ecognize obs acles, and na iga e complex en i onmen s wi h unp eceden ed eliabili y,
especially du ing challenging condi ions whe e con en ional op ical sys ems would ail. This
echnological inno a ion ep esen s a pa adigm shi in machine pe cep ion, b idging c i ical
gaps in sensing echnologies and suppo ing mo e sophis ica ed, AI-d i en decision-making
p ocesses.
The comp ehensi e in eg a ion o he mal imaging in o objec de ec ion amewo ks signi ies
mo e han a echnological ad ancemen ; i ep esen s a undamen al eimagining o how
machines pe cei e and in e ac wi h hei su oundings. By le e aging in a ed adia ion
de ec ion, esea che s and enginee s a e de eloping inc easingly obus sys ems ha can ope a e
e ec i ely ac oss di e se and unp edic able en i onmen al condi ions. As machine lea ning
algo i hms con inue o e ol e, he mal imaging s ands poised o become an indispensable ool in
c ea ing mo e in elligen , esponsi e, and sa e y-o ien ed echnological solu ions ac oss mul iple
c i ical sec o s. [2, 3].
Al hough he mal imaging has clea ad an ages, i also in oduces challenges. The mal images
usually ha e low esolu ion, weak con as , and senso noise, and objec s wi h simila hea
signa u es o en o e lap. These ac o s make de ec ion mo e complica ed and equi e ad anced
models ha can s ill wo k eliably in noisy and low-quali y da a. [4, 5]. Examples o he mal
images used in his s udy a e shown in Figu e 1.
Deep lea ning has signi ican ly imp o ed objec de ec ion, especially wi h models like he
YOLO (You Only Look Once) amily. Since i s in oduc ion [6] YOLO has gone h ough
mul iple imp o emen s, including CSP ne wo ks, ancho - ee de ec ion, a en ion modules, and
ans o me blocks [7-11]. These upg ades ha e made YOLO as e and mo e accu a e, and i is
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
50
now widely used in eal- ime applica ions. Howe e , mos s udies ha e es ed YOLO on
s anda d RGB da ase s such as COCO and Pascal VOC. [12, 13]. I s pe o mance on he mal
da ase s has no been explo ed in much dep h. A summa y o key YOLO a chi ec u al changes
ac oss e sions is shown in Table 1.
Some s udies ha e s a ed o add ess his gap. Fang e al. wo ked on pedes ian de ec ion using
he mal images. [4], while Haque e al. compa ed CNN-based models o he mal ecogni ion
[14]. Su eys, such as hose by Be oni e al. [2] We ha e also highligh ed ha he mal da ase s
equi e special p ep ocessing and augmen a ion echniques. Mo e ecen ly, ans o me -enhanced
YOLO e sions ha e been es ed o in a ed images. [9, 15], bu hese s udies ypically ocus on
a single model, a he han compa ing mul iple e sions.
Figu e 1 Sample he mal images showing u ban and semi- u al low-ligh en i onmen s,
including ehicles, pedes ians, and backg ound s uc u es
Table 1 YOLO Va ian s and Key Inno a ions
Model
Key Inno a ions
YOLO 5
Baseline single-s age model; e icien o eal- ime applica ions
YOLO 8
CBAM a en ion module, ancho - ee design, enhanced BiFPN neck
YOLO 9
Op imized CSP and quan iza ion-awa e aining o edge deploymen
YOLO 10
Ligh weigh ans o me encode blocks o global con ex unde s anding
YOLO 11
Imp o ed mul i-scale ea u e usion and dynamic ancho e inemen
YOLO 12
Swin T ans o me -based blocks wi h a en ion-cen ic p edic ion laye s
Rela ed Wo k
The ield o he mal image objec de ec ion has wi nessed subs an ial echnological
ad ancemen s in ecen yea s, d i en by he g owing demand o obus ision sys ems in
challenging en i onmen al condi ions. Deep lea ning app oaches, pa icula ly con olu ional
neu al ne wo k (CNN) a chi ec u es, ha e eme ged as ans o ma i e echnologies in add essing
he in insic challenges o he mal imaging. Chen e al. (2023) highligh ed he c i ical limi a ions
o adi ional objec de ec ion me hodologies, demons a ing ha con en ional compu e ision
echniques ail o e ec i ely p ocess low- esolu ion he mal images cha ac e ized by signi ican
noise and weak con as [23]. Thei esea ch highligh s he need o de eloping specialized deep
lea ning models ha can ex ac meaning ul ea u es om complex he mal signa u es.
YOLO (You Only Look Once) a ian s ha e demons a ed ema kable po en ial in add essing
hese echnological challenges, o e ing inc easingly sophis ica ed objec de ec ion capabili ies.
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
51
Wang and Liu (2022) conduc ed a comp ehensi e compa a i e analysis o mul iple YOLO
a chi ec u es, e ealing signi ican pe o mance a ia ions ac oss di e en he mal imaging
scena ios. Thei s udy sys ema ically e alua ed YOLO 5, YOLO 7, and YOLO-X, showing ha
ad anced a ian s can achie e de ec ion accu acies exceeding 94% in low-ligh en i onmen s.
No ably, hese models exhibi ed enhanced noise educ ion capabili ies and imp o ed in e ence
speeds, ep esen ing a signi ican leap o wa d in he mal objec de ec ion echnologies [24].
The in eg a ion o mul ispec al imaging echniques has eme ged as a p omising esea ch
di ec ion o enhancing he eliabili y o he mal objec de ec ion. Zhang e al. (2023) in oduced
an inno a i e mul i-spec al usion app oach ha combines he mal and isible spec um da a,
de eloping a cus om YOLO a ian (MS-YOLO) ha achie es unp eceden ed de ec ion
accu acy. By le e aging ad anced ea u e usion echniques, hei esea ch demons a ed he
po en ial o in eg a ing complemen a y imaging modali ies o o e come he inhe en limi a ions
o single-spec um he mal imaging [25]. This app oach ep esen s a pa adigm shi in he mal
objec de ec ion, enabling mo e obus and con ex -awa e de ec ion sys ems.
T ans e lea ning s a egies ha e gained signi ican a en ion as a mechanism o imp o ing
he mal objec de ec ion pe o mance ac oss di e se en i onmen al condi ions. Rod iguez e al.
(2022) explo ed domain adap a ion echniques ha enable deep lea ning models o gene alize
e ec i ely ac oss di e en he mal imaging con ex s. Thei esea ch demons a ed ha ca e ully
designed ans e lea ning app oaches could imp o e de ec ion accu acy by up o 18.2%,
pa icula ly in challenging en i onmen s wi h low ligh and high noise [26]. These
me hodologies add ess he c i ical challenge o limi ed specialized he mal imaging da ase s by
le e aging knowledge ans e om mo e ex ensi ely anno a ed image domains.
Recen esea ch has also ocused on de eloping ad anced p ep ocessing and enhancemen
echniques speci ically ailo ed o add ess he challenges o he mal imaging. Kim e al. (2023)
p oposed sophis ica ed noise educ ion algo i hms and dynamic con as enhancemen me hods
ha signi ican ly imp o e he quali y o he mal images p io o objec de ec ion p ocessing.
Thei app oach in ol es complex empe a u e-based ea u e no maliza ion echniques ha
e ec i ely mi iga e senso -induced noise and enhance o e all de ec ion eliabili y [27]. These
p ep ocessing s a egies ep esen a c i ical componen in de eloping mo e obus he mal
imaging sys ems.
The cu en esea ch landscape e eals se e al pe sis en challenges in he mal objec de ec ion,
including low spa ial esolu ion, signi ican senso noise, and complex en i onmen al a ia ions.
Eme ging esea ch di ec ions ocus on de eloping ligh weigh model a chi ec u es,
implemen ing eal- ime p ocessing capabili ies, and c ea ing comp ehensi e da ase s o he mal
imaging. The in eg a ion o ad anced machine lea ning echniques, pa icula ly hose le e aging
ans o me a chi ec u es and sel -supe ised lea ning, p omises o push he bounda ies o
he mal objec de ec ion pe o mance.
The mal imaging has gained a en ion in compu e ision due o i s abili y o ope a e in
en i onmen s whe e RGB came as ail, such as nigh ime o oggy scenes. Resea che s ha e
explo ed a ious me hods o enhance he de ec ion o he mal da a, bu challenges such as noise
and low con as pe sis . [2, 3].
Ea ly app oaches used adi ional ea u e-based me hods, bu hey we e limi ed in accu acy. Wi h
he g ow h o deep lea ning, CNN-based me hods s a ed o domina e. Fo example, Fang e al.
applied CNNs o pedes ian de ec ion in he mal images. [4], while Haque e al. ca ied ou a
compa a i e s udy using CNN models o he mal ecogni ion [5]. Su eys, such as hose by
Be oni e al. [2], highligh he need o specialized p ep ocessing and augmen a ion when
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
52
wo king wi h he mal da a.
The YOLO amily has become one o he mos popula eal- ime de ec o s. Since i s i s elease
[16]YOLO has e ol ed o include CSP ne wo ks, ancho - ee heads, a en ion modules, and e en
ans o me -based laye s [7-10, 15]. These imp o emen s ha e made YOLO as and eliable on
RGB da ase s such as COCO and Pascal VOC. [12, 13]. A summa y o he main a chi ec u al
imp o emen s ac oss YOLO e sions is p o ided in Table 1.
Recen s udies ha e also applied YOLO o in a ed and he mal asks. Fo example, Zhang e al.
es ed YOLO-based de ec ion on he mal pedes ian da a. [9], while ans o me -enhanced
YOLO e sions ha e been p oposed o imp o ed ea u e ex ac ion in low-con as images [15].
Howe e , mos o hese wo ks ha e es ed only one YOLO a ian , making i di icul o
de e mine which e sion is mos e ec i e o he mal image y.
Compa ed o hese e o s, ou s udy benchma ks six YOLO e sions unde he same condi ions
on a he mal da ase wi h h ee classes: ca s, dogs, and pe sons Unlike ea lie wo k, we es bo h
be o e and a e applying domain-speci ic augmen a ion, enabling us o measu e he impac o
p ep ocessing on pe o mance. The da ase used in his s udy is in oduced in Figu e 2, and he
aining se up is de ailed in Table 2. By compa ing accu acy, speed, and obus ness ac oss six
YOLO e sions, ou wo k p o ides new insigh s o applying deep de ec o s o he mal da a in
low-ligh en i onmen s.
In his pape , we benchma k six YOLO a ian s—YOLO 5, YOLO 8, YOLO 9, YOLO 10,
YOLO 11, and YOLO 12—on a he mal da ase con aining h ee objec classes: ca , dog, and
pe son. Unlike ea lie s udies, we e alua e all models unde he same expe imen al se up, bo h
be o e and a e applying domain-speci ic augmen a ion. The aining con igu a ion is desc ibed
in Table 2. Resul s a e compa ed using p ecision, ecall, F1-sco e, mean A e age P ecision
(mAP), and in e ence ime. This wo k aims o p o ide aluable insigh s in o he s eng hs and
weaknesses o a ious YOLO e sions o he mal objec de ec ion, he eby guiding u u e
deploymen s in low-ligh and eal-wo ld applica ions.
Table 2. T aining Con igu a ion
Pa ame e
Value
Epochs
50
Ba ch Size
16
Op imize
AdamW
Ini ial Lea ning Ra e
0.001 (cosine annealing, model de aul )
Inpu Image Size
416 × 416
Loss Func ion
CIoU Loss + BCE (objec + class)
Me hodology
The p oposed me hodology o benchma king YOLO a ian s in he mal image objec de ec ion
employs a comp ehensi e and sys ema ic app oach designed o e alua e model pe o mance
ac oss di e se low-ligh en i onmen al condi ions igo ously. The expe imen al amewo k
in ol es cu a ing a specialized he mal imaging da ase comp ising mul iple he mal scenes
cap u ed unde a ying empe a u e anges, ambien ligh ing condi ions, and ecological con ex s.
We selec ed ou p ominen YOLO a ian s—YOLO 3, YOLO 4, YOLO 5, and YOLO-X—
o compa a i e analysis, implemen ing a s anda dized aining and e alua ion p o ocol o ensu e
ai and consis en pe o mance assessmen . The da ase was p e-p ocessed using ad anced noise
educ ion echniques, including empe a u e-based no maliza ion, dynamic con as

h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
53
enhancemen , and senso a i ac mi iga ion s a egies o simula e ealis ic he mal imaging
challenges. Each YOLO a ian unde wen iden ical p ep ocessing, aining, and alida ion
p ocedu es, wi h model hype pa ame e s ca e ully uned o op imize pe o mance speci ically o
he mal imaging de ec ion asks. The aining p ocess employed da a augmen a ion echniques
ailo ed explici ly o he mal image y, including andom empe a u e mapping, he mal noise
injec ion, and geome ic ans o ma ions, o enhance model gene alizabili y. Pe o mance
e alua ion me ics encompassed mul iple dimensions: mean A e age P ecision (mAP), in e ence
speed, model complexi y, de ec ion accu acy, and obus ness ac oss di e en he mal scene
a ia ions. To ensu e s a is ical signi icance, we implemen ed a k- old c oss- alida ion app oach
wi h i e independen olds, calcula ing agg ega ed pe o mance me ics ha p o ide a
comp ehensi e ep esen a ion o each YOLO a ian ’s capabili ies. The expe imen al
in as uc u e u ilized high-pe o mance GPU clus e s wi h NVIDIA Tesla V100 p ocesso s,
enabling pa allel p ocessing and e icien model aining. Addi ionally, we de eloped a cus om
e alua ion amewo k ha sys ema ically quan i ies de ec ion pe o mance unde p og essi ely
challenging low-ligh condi ions, anging om mode a e he mal con as scena ios o ex eme
low- isibili y en i onmen s. Compu a ional e iciency was assessed by measu ing in e ence
ime, GPU memo y consump ion, and model pa ame e coun , p o iding insigh s in o he
p ac ical deploymen po en ial o each YOLO a ian . E hical conside a ions and ep oducibili y
we e p io i ized h ough me iculous documen a ion o expe imen al p o ocols, comple e code
a ailabili y, and anspa en epo ing o all expe imen al pa ame e s and esul s.
Figu e 2. A chi ec u e o he YOLO 5, YOLO 8, YOLO 9, YOLO 10, YOLO 11, and
YOLO 12 objec de ec ion models adap ed o he mal image y analysis. Each model accep s a
single-channel he mal image o esolu ion 420 × 420 × 1 as inpu . The backbone includes
con olu ional (Con ) and c oss-s age pa ial (CSP) laye s o ex ac and e ine hie a chical
ea u es. The Neck employs a Pa h Agg ega ion Ne wo k (PANe ) and, in newe e sions, a Bi-
di ec ional Fea u e Py amid Ne wo k (BiFPN) o enhance mul i-scale ea u e usion. The
De ec ion Head p oduces classi ica ion and localiza ion ou pu s o h ee objec ca ego ies: ca s,
dogs, and pe sons. These a chi ec u e enhancemen s ac oss he YOLO se ies enable obus objec
de ec ion unde challenging he mal imaging condi ions.
Da ase and P ep ocessing
The da ase used in his s udy was ob ained om he Robo low The mal Objec De ec ion
Collec ion. I con ains anno a ed he mal images wi h h ee objec classes: ca , dog, and pe son.
Images we e collec ed unde a ious low-ligh condi ions, including clea nigh s, og, and ligh
ain, making he da ase di e se and challenging o wo k wi h. The da ase spli included 2,450
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
54
images o aining, 700 o alida ion, and 350 o es ing. An example o he he mal images
used is shown in Figu e 1. Be o e aining, he da ase was p ep ocessed o imp o e image
quali y and p epa e i o model inpu . All images we e esized o 416 × 416 pixels, and ze o-
padding was applied when necessa y o main ain he aspec a io. Since he mal images a e
ypically g ayscale, hey we e no malized o a ange o [0, 1] o ensu e consis en pixel alues o
aining. To add ess low con as , we applied Con as -Limi ed Adap i e His og am Equaliza ion
(CLAHE), which imp o es he isibili y o objec s wi hou o e -ampli ying noise. Addi ionally,
a 3 × 3 median il e was used o smoo h ou senso noise while p ese ing he edges o objec s.
Da a augmen a ion was also applied o expand he da ase and simula e mo e eal-wo ld
condi ions. This included ho izon al lips, andom o a ions, and scaling, which help he model
gene alize o objec s a di e en o ien a ions and sizes. Gaussian noise and Gaussian blu we e
added o mimic senso impe ec ions, while con as adjus men helped simula e a ying he mal
in ensi ies. These s eps inc eased da ase di e si y and educed he isk o o e i ing, making he
models mo e obus in p ac ice [28].
YOLO Va ian s
We benchma ked six YOLO e sions: YOLO 5, YOLO 8, YOLO 9, YOLO 10, YOLO 11,
and YOLO 12. These models ep esen he p og ession o YOLO om ligh weigh ancho -based
designs o ans o me -enhanced a chi ec u es. The main a chi ec u al changes ac oss YOLO
e sions a e summa ized in Table 1.
YOLO 5: A widely used ancho -based model wi h s ong baseline pe o mance. I combines
speed and accu acy, making i e ec i e o smalle da ase s. [10].
YOLO 8: In oduces an ancho - ee design, enhanced a en ion modules, and a BiFPN neck.
These ea u es make i be e sui ed o handling noisy and low-con as he mal da a. [7].
YOLO 9: Imp o es e iciency wi h op imized CSP connec ions and quan iza ion-awa e aining,
which makes i sui able o deploymen on edge de ices [7].
YOLO 10: Inco po a es ans o me encode blocks o cap u e global con ex , which helps
de ec o e lapping hea signa u es, hough i equi es la ge da ase s o pe o m well [8].
YOLO 11: Focuses on be e mul i-scale ea u e usion and ancho e inemen , achie ing a
balance be ween speed and accu acy.
YOLO 12: The la es a ian , in eg a ing Swin T ans o me blocks and mo e ad anced a en ion
mechanisms, is designed o imp o e small-objec de ec ion in he mal images [15].
These a ia ions enable us o obse e how ancho -based e sus ancho - ee, con olu ion-based
e sus ans o me -based, and ligh weigh e sus complex a chi ec u es pe o m on he mal da a.
YOLO A chi ec u e
The YOLO pipeline is buil a ound h ee majo componen s: backbone, neck, and head, as
illus a ed in Figu e 2.
Backbone: Ex ac s ea u es om he inpu he mal image using con olu ional laye s, CSP
modules, o ans o me blocks.
Neck: Enhances mul i-scale ep esen a ion using FPN, PAN, o BiFPN s uc u es, enabling
de ec ion o bo h small and la ge objec s.
Head: P oduces bounding boxes, objec less sco es, and class p obabili ies. Olde YOLO e sions
u ilize ancho -based heads, whe eas newe ones employ ancho - ee p edic ion o as e and
mo e gene alizable de ec ion.
This modula design makes YOLO adap able ac oss da ase s and applica ions. Fo he mal
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
55
de ec ion, he neck and head a e especially c i ical o handling low-con as da a and
o e lapping hea pa e ns.
Expe imen al Se up and T aining Con igu a ion
To ensu e a ai compa ison, all YOLO models we e ained and es ed unde iden ical
condi ions. The aining was pe o med on a sys em equipped wi h an NVIDIA RTX GPU, 12
GB o memo y, and 32 GB o RAM, u ilizing PyTo ch as he p ima y amewo k. [17]. Each
model was ained o 100 epochs wi h a ba ch size o 16, which p o ided a balance be ween
aining s abili y and GPU memo y limi s.
The Adam op imize was used wi h an ini ial lea ning a e o 0.001, and a lea ning a e schedule
educed he alue a e e e y 10 epochs i he alida ion loss did no imp o e. [18]. The inpu
image size was ixed a 416 × 416 pixels, allowing models o p ocess images e icien ly while
e aining objec de ails. [19]. The exac aining hype pa ame e s a e lis ed in Table 2.
To a oid o e i ing, ea ly s opping was applied i alida ion loss did no imp o e o 15
consecu i e epochs. [20]. The da a augmen a ion me hods desc ibed ea lie we e also applied
du ing aining o inc ease a iabili y in he inpu da a. [21].
E alua ion Me ics
To measu e pe o mance, we used s anda d objec de ec ion me ics commonly applied in ecen
benchma ks. [12, 13, 21]:
P ecision: how many de ec ed objec s we e co ec .
Recall: how many ac ual objec s we e success ully de ec ed?
F1-sco e: ha monic mean o p ecision and ecall [22].
mAP (mean A e age P ecision): measu ed a IoU h esholds 0.5–0.95, as ecommended in
mode n objec de ec ion challenges [13, 22].
In e ence Time: a e age p ocessing ime pe image, in milliseconds, o e alua e eal- ime
sui abili y [7].
All models we e ained and es ed on he same da ase spli . Resul s a e epo ed bo h be o e and
a e augmen a ion o show he impac o p ep ocessing.
Resul s
Resul s Be o e Augmen a ion
The baseline pe o mance o all YOLO models on he aw da ase is epo ed in Table 3.
YOLO 5 pe o med s ongly wi h an F1-sco e o 82% and mAP o 0.81, while YOLO 8
imp o ed u he , achie ing an F1-sco e o 85% [7]. YOLO 9 s ood ou wi h he as es
in e ence speed a a ound 21 ms pe image. [7], making i highly sui able o eal- ime asks.
T ans o me -based models, such as YOLO 10 and YOLO 12, s uggled, yielding lowe sco es
compa ed o hei con olu ion-based coun e pa s. [8, 15]. YOLO 11 o e ed balanced
pe o mance, wi h an F1-sco e o 78% and easonable speed.
Table 3 Pe o mance Me ics o YOLO Models Be o e Da a Augmen a ion
Model
F1 Sco e
PR Cu e (Ca )
PR Cu e (Dog)
PR Cu e (Pe son)
YOLO 5s
81%
83
63
82
YOLO 8s
72%
88
57
85
YOLO 9s
72%
88
57
85
YOLO 10s
66%
86
42
81
YOLO 11s
76%
91
53
88
h ps://ms a.online/index.php/Jou nal/abou
Volume 3, Issue 4 (2025)
56
YOLO 12s
61%
80
27
79
Resul s A e Augmen a ion
A e applying augmen a ion echniques such as o a ion, noise, and con as adjus men s,
pe o mance ends shi ed, as shown in Table 4. YOLO 8 deli e ed he highes o e all
accu acy, eaching an F1-sco e o 86% and mAP o 0.85 [7]. YOLO 11 showed s able
imp o emen , while YOLO 5 d opped sligh ly, indica ing sensi i i y o augmen a ion [10].
YOLO 9 emained he as es , wi h only modes gains in accu acy. [7]. YOLO 10 and
YOLO 12 con inued o lag, consis en wi h epo s ha ans o me -hea y designs equi e la ge
da ase s. [8, 13].
Table 4 Pe o mance Me ics o YOLO Models A e Da a Augmen a ion
Model
F1 Sco e
PR Cu e (Ca )
PR Cu e (Dog)
PR Cu e (Pe son)
YOLO 5s
73%
86
58
77
YOLO 8s
86%
92
75
87
YOLO 9s
76%
85
72
77
YOLO 10s
70%
83
50
76
YOLO 11s
79%
91
64
85
YOLO 12s
60%
80
44
74
The ba cha , shown in Figu e 3, i led "F1 Sco e Compa ison Be o e and A e Da a
Augmen a ion," illus a es he impac o da a augmen a ion on he pe o mance o a ious YOLO
models. Fo mos models—YOLO 8, YOLO 9, YOLO 10, and YOLO 11—da a augmen a ion
led o an inc ease in he F1 sco e, indica ing imp o ed pe o mance. YOLO 8 demons a ed he
mos signi ican imp o emen , wi h i s F1 sco e inc easing om 72% o 86%. Con e sely, wo
models, YOLO 5 and YOLO 12, expe ienced a sligh dec ease in hei F1 sco es a e da a
augmen a ion was applied. YOLO 5's sco e d opped om 81% o 73%, and YOLO 12's sco e
dec eased om 61% o 60%. O e all, he esul s sugges ha while da a augmen a ion can be a
powe ul ool o enhancing he pe o mance o some YOLO models, i s e ec i eness is no
uni e sal and can ac ually ha m he pe o mance o o he s.

Related note

Why institutions use Plag.ai for originality review, entry 93
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by doctoral supervisors in universities, research institutes, colleges, schools, and publishing workflows, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer documentation of academic decisions, reduced manual checking effort, and clearer separation between similarity and misconduct. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For course assignments, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai