scieee Science in your language
[en] (orig)

Features extraction for image identification using computer vision

Author: Niyonkuru, Venant; Sekou, Sylla; Sinzinkayo, Jimmy Jackson
Publisher: Zenodo
DOI: 10.5281/zenodo.17548009
Source: https://zenodo.org/records/17548009/files/WJARR-2025-2647.pdf
 Co esponding au ho : Venan Niyonku u
Copy igh © 2025 Au ho (s) e ain he copy igh o his a icle. This a icle is published unde he e ms o he C ea i e Commons A ibu ion License 4.0.
Fea u es ex ac ion o image iden i ica ion using compu e ision
Venan Niyonku u 1, *, Sylla Sekou 2 and Jimmy Jackson Sinzinkayo 3
1 Depa men o Compu ing and In o ma ion Sys em, Kenya a Uni e si y, Kenya.
2 Depa men o Ma hema ics, Ins i u e o Basic Science, Technology and Inno a ion, Pan-A ican Uni e si y, Kenya.
3 Depa men o So wa e Enginee ing, College o So wa e, Nankai Uni e si y, China.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 27(01), 1341-1351
Publica ion his o y: Recei ed on 01 June 2025; e ised on 12 July 2025; accep ed on 14 July 2025
A icle DOI: h ps://doi.o g/10.30574/wja .2025.27.1.2647
Abs ac
This s udy examines a ious ea u e ex ac ion echniques in compu e ision, he p ima y ocus o which is on Vision
T ans o me s (ViTs) and o he app oaches such as Gene a i e Ad e sa ial Ne wo ks (GANs), deep ea u e models,
adi ional app oaches (SIFT, SURF, ORB), and non-con as i e and con as i e ea u e models. Emphasizing ViTs, he
epo summa izes hei a chi ec u e, including pa ch embedding, posi ional encoding, and mul i-head sel -a en ion
mechanisms wi h which hey o e pe o m con en ional con olu ional neu al ne wo ks (CNNs). Expe imen al esul s
de e mine he me i s and limi a ions o bo h me hods and hei u ili a ian applica ions in ad ancing compu e ision.
Keywo ds: Fea u e Ex ac ion; Posi ional Embeddings; Sel -A en ion; Vision T ans o me s (ViTs)
1. In oduc ion
Fea u e ex ac ion is a c i ical s age in he compu e ision domain ha is he backbone o ans o ming aw image da a
wi h high amoun s in o compac , desc ip i e ep esen a ions ha enable objec de ec ion, image ca ego iza ion,
segmen a ion, and scene in e p e a ion. T adi ionally, ea u e ex ac ion me hods ha e de eloped o e ime based on
he need o c ea ing desc ip o s ha a e in a ian o scaling, o a ion, ligh ing, and pe spec i e, bu compu a ionally
e ec i e (Doso i skiy e al, 2020; Jiang, 2009; Lowe,2004; G ill e al, 2020).
T adi ional ea u e ex ac ion echniques like Scale-In a ian Fea u e T ans o m (SIFT), Speeded-Up Robus Fea u es
(SURF), and O ien ed FAST and Ro a ed BRIEF (ORB) ha e been ins umen al o ini ial compu e ision sys ems
(Lowe,2004; Rublee e al, 2011, Mo ow, 2000) . These algo i hms enginee ea u es om local image p ope ies,
inding keypoin s and cons uc ing desc ip o s o acili a e ma ching among di e en images. While esis an in he
majo i y o scena ios, hese hand-c a ed ea u es a e o en p one o di icul y wi h complexi y, scalabili y, and
some imes de oid o seman ic con ex .
Deep lea ning ans o med ea u e ex ac ion by he powe o lea n hie a chical ep esen a ions di ec ly om da a
wi hou needing hand-designed ea u es. Con olu ional Neu al Ne wo ks (CNNs) eme ged as he s anda d by
le e aging local spa ial co ela ion and sha ed weigh s bu wi h he expense o local ecep i e ields, which limi hei
capaci y o lea n long- ange dependencies in images (Doso i skiy e al, 2020; Ali, e al, 2023; K izhe sky e al,
2012,Mo ow, 2000). He e, Vision T ans o me s (ViTs) ha e eme ged as a highly p omising subs i u e ha b ings he
sel -a en ion mechanism o NLP in o compu e ision (Doso i skiy e al, 2020). ViTs wo k by di iding images in o ixed-
size pa ches, la ening hem, and linea ly embedding hem. Posi ional embeddings help o main ain he spa ial
in o ma ion, and he pa ch embedding is ed in o mul i-head sel -a en ion o cap u e global con ex (Doso i skiy e al,
2020, Pa wa dhan e al, 2023, Mon ezol, 2024). This pa adigm change helps ViTs cap u e he ela ionships o he en i e
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 27(01), 1341-1351
1342
image and su pass limi a ions in insic o CNNs and deli e ing supe io pe o mance ac oss a wide a ay o ision
benchma ks.
Also, newe a chi ec u es such as Gene a i e Ad e sa ial Ne wo k (GAN)-based models and con as i e lea ning
echniques ha e added o he lis o ools used o lea n seman ic ea u es om images (Ali e al, 2024, Cao e al, 2018;
Ko ács e al, 2023).
These a e aimed a lea ning disc imina i e and gene a i e ep esen a ions ha a e use ul o e a b oad ange o asks
anging om image gene a ion o sel -supe ised lea ning (G ill e al., 2020; Ansa e al., 2024). This s udy
comp ehensi ely examines hese a ied ea u e ex ac ion app oaches, demys i ying he p inciple behind Vision
T ans o me s and hei posi ion wi hin he wide compu e ision con ex .
Expe imen al esul s cla i y hei indi idual s eng hs, comp omises, and p ac ical usabili y, ske ching he ou line o
he bes ea u e ex ac ion app oaches o use in eal-wo ld applica ions(Pu chase, 2012).
2. Rela ed wo k
T adi ional app oaches a e SIFT (Lowe, 2004), SURF (Bay e al., 2008), and ORB (Rublee e al., 2011), which ha e se ed
as s anda d baseline app oaches o image ma ching and ecogni ion. These app oaches ely on handc a ed desc ip o s
o ob ain local ea u es. They ope a e well in s uc u ed o low- a ia ion isual scenes. Howe e , hey canno deal wi h
scale a ia ion, illumina ion a ia ion, and occlusion. One o he majo b eak h oughs as exempli ied by he eme gence
o deep lea ning models, pa icula ly Con olu ional Neu al Ne wo ks (CNNs), was when hese models lea ned o lea n
end- o-end disc imina i e hie a chical ea u es om aw images (K izhe sky e al., 2012). CNNs we e mo e
gene alizable on a wide a ie y o ision asks and hus emained he s anda d o a numbe o yea s.
In ecen imes, Vision T ans o me s (ViTs) ha e been s ong compe i o s ha a e based on sel -a en ion mechanisms
o ob aining long- ange ela ions in images (Doso i skiy e al., 2020). ViTs ou pe o med CNNs on big-benchma k
benchma ks, pa icula ly when hey we e ained on e y la ge da ase s. Simul aneously, Gene a i e Ad e sa ial
Ne wo ks (GANs) ha e no only been u ilized o image syn hesis bu also o ea u e ex ac ion, depending on
disc imina o s o ob aining de ailed, high-le el ea u es. Fu he mo e, con as i e lea ning echniques such as BYOL
(G ill e al., 2020) and SimCLR ha e enhanced sel -supe ised ea u e lea ning by op imizing he ag eemen be ween
mul iple copies o an image ha a e ans o med di e en ly.
Recen la ge-scale su eys (Ali e al., 2023; Pa wa dhan e al., 2023) co e de elopmen s in hese a chi ec u es,
p esen ing ends and open ques ions. Howe e , he e a e ewe pape s p o iding an explici compa ison o hese
di e en app oaches unde he same expe imen al se ing. This pape ills his gap by compa ing classical desc ip o s,
CNNs, ViTs, and GAN-based models on an iden ical se up o popula benchma ks and measu es.
3. Me hodology
3.1. Vision T ans o me (Vi s)
3.1.1. De ini ion and unc ionali y
Vision T ans o me s (ViTs) a e deep lea ning models ha le e age sel -a en ion mechanisms o p ocess image da a,
o e ing imp o ed pe o mance o e adi ional con olu ional neu al ne wo ks (CNNs) (Doso i skiy e al, 2020) .
3.1.2. A chi ec u e
ViTs di ide an image in o ixed-size pa ches, linea ly embed hem, and eed hem in o a ans o me encode . The key
componen s include:
• Pa ch Embedding Laye : Con e s image pa ches in o oken embeddings.
• Posi ional Encoding: Adds spa ial in o ma ion o okens.
• Mul i-Head Sel -A en ion: Cap u es long- ange dependencies in an image.
• Feed-Fo wa d Ne wo k (FFN): P ocesses oken ep esen a ions o classi ica ion asks.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 27(01), 1341-1351
1343
Figu e 1 T ans o me Enco de
How ViTs Wo k
• The image is b oken in o non-o e lapping pa ches
• Each pa ch is la ened and subsequen ly passed h ough a linea p ojec ion.
• The ans o me encode con e s he pa ch embeddings h ough sel -a en ion.
• classi ica ion head p oduces p edic ions om he las encoded ep esen a ion.
3.1.3. Image Pa ches
The p ocess s a s wi h di iding an image in o small, ixed-size pa ches, and ha is a simple ans o ma ion s ep. This
p ocess has a di ec analogy in na u al language p ocessing (NLP) whe e a sen ence is segmen ed in o indi idual uni s
such as wo ds o subwo d okens. Jus like how e e y oken wi hin a sen ence ca ies con ex ual meaning, e e y pa ch
wi hin an image cap u es localized isual con ex . In his analogy, he en i e image is aken as a sen ence, and i s pa ches
a e akin o okens, which enable ans o me -based models o iginally designed o ex o be used on isual da a.
Figu e 2 Image o Image Pa ches
Bo h ision ans o me s (ViT) and na u al language p ocessing (NLP) pa i ion la ge inpu s (i.e., sen ences in ex o
en i e images in o smalle ones, e.g., okens in ex o image pa ches). Fo ins ance, p ocessing an en i e 224×224 pixel
image di ec ly would en ail an impossibly la ge numbe o calcula ions, app oxima ely 2.5 billion compa isons. Bu by
di iding he e y same image in o 256 pa ches, each 14×14 pixels, he compu a ion load o one a en ion laye becomes
inc edibly smalle app oxima ely 9.8 million compa isons.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 27(01), 1341-1351
1344
Figu e 3 Vision ans o me s
3.1.4. Linea P ojec ion
Following pa ch di ision o he image, each pa ch is hen con e ed om a 2D a ay o a 1D ec o using a linea
p ojec ion, e ec i ely p ojec ing aw pixel in o ma ion in o a se o pa ch embeddings.
Figu e 4 Linea P ojec ion
The ole o he linea p ojec ion laye is o ans o m each image pa ch in o a ixed-size ec o ep esen a ion, he aim
being o main ain meaning ul ela ions so isually simila pa ches p oduce simila embeddings. This ans o ma ion
b ings he da a in o a o m compa ible wi h he inpu o ma needed by he ans o me model. Two u he p ocessing
s eps emain be o e hese embeddings can be used.
3.1.5. Lea nable Embeddings
One o he impo an ea u es added in widely used ans o me models such as BERT is he inclusion o a special
classi ica ion oken, also known as [CLS]. This oken is placed a he beginning o e e y inpu sequence and is mean o
cap u e he sen ence-le el ep esen a ion o classi ica ion asks.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 27(01), 1341-1351
1345
Figu e 5 Be Tokenize
The e is a unique oken, [CLS], in BERT ha is added o he beginning o all inpu sequences. This oken is embedded
like any o he and passed h ough he encode laye s o he model. The [CLS] oken is special in ha i doesn' ep esen
any speci ic wo d o he inpu i begins as a neu al o unini ialized ec o . In addi ion, du ing p e aining, his inal
ou pu a he [CLS] posi ion is ed as inpu o a classi ica ion laye . This encou ages he model o encode in o ma ion
om he en i e sen ence in o his single ec o , lea ning an e ec i e ep esen a ion o he inpu . Vision T ans o me s
(ViT) do exac ly he same hing wi h a lea nable embedding ha se es he same pu pose as he [CLS] oken in BERT,
p o iding a summa y ep esen a ion o image-le el classi ica ion asks.
Figu e 6 T ans o me Encode o Linea P ojec ion
3.1.6. Posi ioning Embedding
T ans o me s do no ha e an inhe en pe cep ion o sequence o spa ial a angemen o inpu okens o pa ches.
Howe e , p ese ing o de is impo an in language, whe e wo d eo de ing can d ama ically al e meaning. The same
is ue o isual in o ma ion: when he componen s o a pic u e a e mixed up, as in a jigsaw puzzle, iden i ica ion o he
whole pic u e becomes ex emely challenging. This is also ue o ans o me models, which equi e an addi ional
mechanism o unde s and he ela i e posi ion o hese pa s.
To add ess his, posi ional embeddings a e added. In Vision T ans o me s (ViT), hese a e lea ned and o he same
dimension as he pa ch embeddings. Following he di ision o he image in o pa ches and adding he special
classi ica ion oken, each elemen is added o i s espec i e posi ional embedding. These posi ion ec o s a e also
ained along wi h he model and can u he be ine- uned la e . They g adually come o deno e spa ial ela ionships,
usually iden ical o p oxima e loca ions in he g id pa icula ly in he same column o ow such as:

Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 27(01), 1341-1351
1346
Figu e 7 Embeddings Posi ion
Once posi ional embeddings a e added, he pa ch embeddings a e comple e. These enhanced embeddings a e hen
passed in o he Vision T ans o me (ViT), and hey a e p ocessed in he same way as egula okens in a s anda d
ans o me model
Imp emen a ion
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 27(01), 1341-1351
1347
The aining da ase consis s o 60,000 images ac oss 11 unique classes. In o de o ob ain he equi alen human-
eadable labels o hese classes, he ollowing s eps may be used:
ClassLabel has 11 classes: ['ai plane', 'au omobile', 'bi d', 'ca ', 'dee ', 'dog',.].
Each en y in he da ase con ains wo ea u es: `img` and `label`. The `img` ea u e con ains a 32x32 pixel image which
is o ype PIL and wi h h ee colo channels o RGB ( ed, g een, blue).
.
3.1.7. Fea u e ex ac ion
Be o e sending images o he Vision T ans o me (ViT) model, a ea u e ex ac o is used o handle p ep ocessing. This
in ol es esizing and no malizing images, con e ing hem in o enso s e e ed o as "pixel_ alues."
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 27(01), 1341-1351
1348
The ea u e ex ac o may be ini ialized wi h he T ans o me s lib a y o Hugging Face, as shown below:
The ea u e ex ac o con igu a ion shows ha no maliza ion and esizing a e se o ue. No maliza ion is pe o med
ac oss he h ee colo channels using he mean and s anda d de ia ion alues s o ed in "image_mean" and "image_s d"
espec i ely.
The e o e, i is op imal o use an image ha is sligh ly la ge han needed, since educing by a small amoun usually
p ese es isual quali y and a oids in oducing isible deg ada ion in image quali y.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 27(01), 1341-1351
1349
E alua ion and P edic ion
The T aine e alua es du ing aining bu we can also quickly do a mo e quali a i e e i ica ion (o es ima ion) by
passing h ough a single image wi h he model and ea u e_ex ac o .
We will pass he ollowing image:
The pic u e is o poo isual quali y and does no ha e dis inguishing ea u es, so isual ca ego iza ion based on he
pic u e is di icul .
Howe e , he label gi en classi ies he subjec as a ca . We will now go ahead and es he model's p edic ion o his
pic u e.