Jou nal o Embedded & Digi al Sys em Design
2025, VOL. 01, NO. 1, 1–11
h ps://doi.o g/10.5281/zenodo.17678700
Low Cos FPGA Implemen a ion o Con olu ional Neu al
Ne wo k Based Image Classi ie
Shi shendu Roy1,Saina h Reddy K2,Ha shi h GV3,Rakshi h R4, and P ajwal M5
1-5Elec onics and Communica ion Depa men , Dayananda Saga Uni e si y, Bengalu u, India
Abs ac
A i icial in elligence and machine lea ning (AI-ML) al-
go i hms ga e a new di ec ion o he p oblem o image
classi ica ion. All p ac ical applica ions a e nowadays ap-
plying AI-ML algo i hms o image classi ica ion. Con-
olu ional neu al ne wo k (CNN) and i s a ie ies apidly
became esea che ’s i s choice o compu e ision ela ed
applica ions. Recen ly many implemen a ions o ha d-
wa e accele a o s o CNN a e epo ed in li e a u e. The
cu en wo k exploi s he oppo uni ies o imp o emen s
and p oposes a no el e y la ge scale in eg a ed ci cui
(VLSI) a chi ec u e o classic CNN model o classi ica-
ion o g ay-scale images. The p oposed a chi ec u e is
alida ed using ield ga e p og ammable a ay (FPGA)
pla o m o classi ica ion o handw i en digi s and hand
ges u es. The a chi ec u e is implemen ed on bo h A ix7
and Zynq FPGA boa d. This wo k achie es 96% classi i-
ca ion accu acy o digi s de ec ion and 97% accu acy o
ges u e images using same CNN model wi h p e-de ined
il e s in he con olu ion s age. P oposed a chi ec u e
consumes less ha dwa e esou ces compa ed o s a e-o -
he-a wo ks by using a single ec o mul iplica ion uni
(VMU) o bo h con olu ion-pooling s age and ully con-
nec ed ne wo k. A chi ec u e suppo s pa allel con olu-
ion and pooling ope a ion and achie es p ocessing speed
o ≈16 µs pe image ame o size 28×28. Also, he a chi-
ec u e is scalable and suppo s deep lea ning whe e mo e
numbe o con olu ion-pooling s ages may be used.
Keywo ds: Con olu ional Neu al Ne wo k, Hand-
W i en Digi Recogni ion, Ges u e De ec ion, Machine
Lea ning, Field P og ammable Ga e A ay (FPGA)
1 In oduc ion
Image classi ica ion is one o he p ominen esea ch p ob-
lem in he domain machine lea ning (ML). Few applica-
ions can be named as ca ego iza ion o ehicles [1] o
a ic con ol, hand w i en digi ecogni ion [2], objec
de ec ion [3], classi ica ion o RADAR images [4] e c. Va -
ious algo i hms and echniques a e epo ed in li e a u e
o e icien classi ica ion o images. Ou o hese ML al-
go i hms, con olu ion neu al ne wo k (CNN) is a c ucial
neu al ne wo k (NN) o image classi ica ion. O e he
ew yea s, many esea che s ha e ied o implemen he
CNN echnique on ha dwa e pla o m. Ini ially, CNN was
implemen ed on cen al p ocessing uni s (CPUs) and hen
ealized using g aphics p ocessing uni s (GPUs). Bu i
has been no iced ha econ igu able ha dwa e like ield
ga e p og ammable a ays (FPGAs) ha e mo e powe o
accele a e he pa allel p ocessing in ol ed in CNN. G ad-
ually, many esea ch wo ks published ocusing FPGA im-
plemen a ion o CNN.
Few implemen a ions a e ei he ARM con olle based
o ealized using high-le el syn hesis (HLS). A high-le el
language based accele a o o con olu ion laye is e-
po ed in [5]. Ano he Vi ado HLS based implemen a ion
is epo ed in [6]. Skyne model o CNN is implemen ed
on ARM p ocesso based FPGA in [7]. A FIFO based
accele a o o CNN is p esen ed in PYNQ boa d based
implemen a ion [8]. An ARM con olle based accele a o
o CNN is epo ed in [9]. Ano he wo k on FPGA based
CNN accele a o is epo ed in [10]. Au ho s in [11] ied
o op imize he FPGA implemen a ion o CNN accele a o
by exploi ing spa si y in weigh ma ix and also by using
hie a chical memo y o ganiza ion. CNN is implemen ed
on FPGA h ough Vi ado HLS so wa e o a ic ligh
image classi ica ion in [12]. Handw i en digi s classi ie is
implemen ed on FPGA using SIMULINK pla o m in [13].
A ZCU102 de elopmen boa d based specialized accele a-
o is epo ed in [14] ha suppo s pa allel execu ion o
con olu ion laye s. An ResNe like s uc u e o CNN is
implemen ed on ZCU102 de ice o RADAR signal p o-
cessing in [4] using HLS.
Au ho s in [2] implemen ed CNN on In el FPGA o
de ec ing handw i en digi s using MNIST da ase and
achie ed 90% accu acy. Ano he FPGA implemen a ion
o CNN is epo ed in [15] o pa e n ecogni ion. FPGA
Implemen a ion o p ocessing elemen uni in CNN accel-
e a o using Modi ied Boo h mul iplie and Wallace ee
adde on Uniwig a chi ech u e is epo ed in [16]. Re-
sea che s ha e p esen ed FPGA based CNN module o
he ecogni ion o a ic sign in ad anced d i e assis ance
sys em (ADAS) in [1]. CNN is implemen ed on FPGA o
ace ecogni ion in [17]. Ano he FPGA implemen a ion
o CNN is epo ed in [18]. A mul i-s age da a low imple-
men a ion o CNN accela a o o 3-D images o objec
ecogni ion is p esen ed in [19]. FPGA implemen ed CNN
p ocesso epo ed in [3] used o unmanned ae ial ehicle
(UAV) objec de ec ion. CNN accele a o is used o en i-
onmen al sound classi ica ion and implemen ed on FPGA
in [20]. A e sion o MobileNe s uc u e is implemen ed
on FPGA o image classi ica ion in [21]. CNN also can
be used o pa e n de ec ion om images [22].
Au ho s in [23] ied o educe he compu a ional com-
Manusc ip ecei ed 03 Sep embe , 2025; accep ed 03 No embe , 2025; Da e o publica ion 21 No embe , 2025;
(Co esponding au ho : Shi shendu Roy, e-mail: shi shendu o[email p o ec ed].
Low Cos FPGA Implemen a ion o Con olu ional Neu al Ne wo k Based Image Classi ie 2
plexi y using Winog ad’s 2-D Minimal il e ing algo i hm.
In [24], au ho s ha e demons a ed a s ochas ic based deep
neu al ne wo k sys em ha has nea ly he same accu acy
as con en ional bina y implemen a ions. An CNN accel-
e a o is p oposed in [25] whe e mul iplica ion ope a ions
a e eplaced wi h shi ope a ions o educe complexi y.
A oo line-model-based me hod o accele a e he pe o -
mance o FPGA implemen ed CNN p ocesso is epo ed
in [26] whe e au ho s used loa ing poin o ep esen he
da a.
An e icien FPGA implemen a ion o AlexNe CNN
s uc u e is epo ed in [27] o eal ime objec de ec ion.
He e, au ho s ha e shown accele a o s o bo h con olu-
ion and ully connec ed laye s. Sys olic mul iplie s we e
used in [28] o imp o e he pe o mance o ma ix mul i-
plica ions in CNN. ZynqNe CNN a chi ec u e is imple-
men ed on FPGA o inc ease he p ocessing speed in [29].
VGG16-SVD model o CNN is implemen ed on embedded
FPGA pla o m like ZYNQ in [30]. An applica ion speci ic
in eg a e ci cui (ASIC) implemen a ion o econ igu able
p ocesso o deep neu al ne wo ks (DNN) is epo ed in
[31] which implemen s VGG-16 and AlexNe a chi ec u e
o CNN.
E en hough plen y o wo ks a e epo ed in li e a u e
on ha dwa e implemen a ion o CNN based image clas-
si ica ion, he e a e scopes o imp o emen . The majo
con ibu ions o his esea ch a e
•An inno a i e ha dwa e accele a o o CNN is p e-
sen ed which is e y as and consumes minimal ha d-
wa e esou ces.
•The p oposed ha dwa e accele a o suppo s any
numbe o con olu ion-pooling laye s and hus sup-
po s o DNN based image classi ica ion.
•A me hod o sha ing ha dwa e o ec o inne p od-
uc s in con olu ion s age as well as in ully connec ed
laye s is p oposed.
•Maximum esou ce sha ing and he p oposed se ial-
pa allel accele a o o con olu ion-pooling s ep
makes he p oposed a chi ec u e e icien in e ms o
esou ce u iliza ion and powe consump ion compa ed
o o he wo ks.
The manusc ip is o ganized in i e sec ions. The li e a-
u e e iew on s a e-o -a wo ks on ha dwa e implemen-
a ion o CNN is p esen ed in Sec ion I. The heo e ical
backg ound behind he CNN echnique is discussed in Sec-
ion II. Sec ion III p esen s he p oposed wo k in de ails
and all he a chi ec u es a e illus a ed he e. Sec ion IV
discusses he expe imen al se -up and p oposed design is
also analysed in his sec ion. Las ly, conclusi e ema ks
a e made in he las sec ion (Sec ion V).
2 Theo e ical Backg ound
Many models o CNN o image classi ica ion a e p o-
posed o e he yea s like LeNe , AlexNe , VGGNe , Mo-
bileNe s e c. A basic LeNe kind o model is implemen ed
in his wo k which is good o de ec ing spa se images
like handw i en digi s and ges u es. CNN based image
classi ica ion is di ided in o wo majo blocks which a e
con olu ion-pooling ne wo k (CPN) and ully connec ed
ne wo k (FCN). O e all CNN model s uc u e is shown
Fig. 1 and many basic applica ions use his kind o CNN
models. Squa e image is conside ed he e o simpli ica ion
o illus a ion bu images can be o any size p ac ically.
In he CPN block, he e may be many wo dimensional
con olu ion s ages. In a con olu ion s age, a 2-D image
(I∈ Rn×n) is con olu ed wi h a il e ( ) o size λ×λand
esul ed ano he image Ic. The unc ion o 2-D con olu-
ion is shown in Algo i hm 1. The size o il e can a y
based on applica ion o applica ion. Common choices a e
3×3, 5 ×5, 7 ×7 e c.
Each con olu ion s age is associa ed wi h a pooling
s age o sub-sampling s age. The con olu ed image (Ic)
is con e ed o a sub-sampled image (Ip) by inding max-
imum, minimum o by pe o ming a e aging ope a ion on
pa icula window. Max pooling echnique is adop ed in
his wo k and a gene al unc ion is shown in Algo i hm 2.
The size o his window is amously known as s ide (s).
S ide o 2 is selec ed he e his means he window size is
2×2. An n×nimage is con e ed o n
2×n
2a e pooling
ope a ion.
The o e all CNN algo i hm is shown in Algo i hm 3.
The CPN block p o ides a la ened ec o (y∈ Rn5×1)
o he FCN block. FCN block is cons i u ed o wo s ages,
inpu laye wi h n6nodes and ou pu laye wi h n7nodes.
Inpu laye akes y ec o and p oduces ano he ec o
(z1) o size n6×1. Fla ened ec o yis mul iplied by
a weigh ma ix o W1∈ Rn5×n6and added wi h a bias
ec o b1∈ Rn6×1.
Ac i a ion unc ions a e in eg al pa o CNN models.
Algo i hm 1 Func ion o 2-D Con olu ion (Ic=
con (I, ))
Inpu : Inpu image I∈ Rn×nand il e ke nel ∈ Rλ×λ.
Ou pu : Con olu ed image Ic∈ R(n−λ−1)×(n−λ−1).
1: o i←1 o (n−λ−1) do
2: o j←1 o (n−λ−1) do
3: mp ←0
4: o k←1 o λ−1do
5: o l←1 o λ−1do
6: mp = mp +I(i+k−1, j +l−1) · (k, l)
7: end o
8: end o
9: Ic(i, j) = mp
10: end o
11: end o
Di e en ea u es a e ex ac ed om he images based on
he a ie y o ac i a ion unc ions. Ac i a ion unc ions
a e applied on ou pu o e e y laye in he FCN block.
CPN block also some imes uses ac i a ion unc ions o
be e esul s. Commonly used ac i a ion unc ions a e
ec i ied linea uni (ReLU), sigmoid unc ion, so max
unc ion e c. P oposed wo k uses ReLU unc ion o ha -
ing less complexi y in ha dwa e implemen a ion. Vec o
z1 om he inpu laye is passed h ough a ReLU unc ion
o p oduce ano he ec o 1.
3Jou nal o Embedded & Digi al Sys em Design
Con olu ion1
(4 ×n1×n1)
Pooling1
(4 ×n2×n2)
Con olu ion2
(16 ×n3×n3)
Pooling2
(16 ×n4×n4)
image
(n×n)
Fla ening
(n5×1)
I/P Laye
(n6nodes)
O/P Laye
(n7nodes)
Fully Connec ed Ne wo k (FCN)
Con olu ion and Pooling Ne wo k (CPN)
Figu e 1: O e all model o con olu ional neu al ne wo k o handw i en digi ecogni ion.
Simila compu a ion is ca ied away in he ou pu laye
also. Ou pu laye ecei es he ec o 1and p oduces
ano he ec o z2o size n7×1. The ec o 1is mul i-
plied wi h weigh ma ix W2∈ Rn6×n7and added wi h
bias ec o (b2) o size n7×1 o p oduce ou pu ec o
z2which is again passed o ReLU unc ion o gene a e
2 ec o . Finally, 2 ec o is passed o a so unc ion
which inds index o he maximum alue p esen in he 2
ec o . This index ep esen s he de ec ed class o he
image da abase.
Algo i hm 2 Func ion o max pooling (Ip=
max pool(Ic, s))
Inpu : Inpu image Ic∈ Rn×nand s ide numbe (s). s
is e en o e en nand odd o odd alue o n.
Ou pu : Image a e max pooling Ip∈ R(n/s)×(n/s).
o i←1 o (n/s)do
o j←1 o (n/s)do
mp ←0
Se w = ((i−1) ·s+ 1) : (i·s)
Se cl = ((i−1) ·s+ 1) : (i·s)
Ip(i, j) = max(I( w :cl, w :cl))
end o
end o
3 P oposed Wo k
The p oposed a chi ec u e is shown in Fig. 2. and i has
h ee majo blocks iz. ec o mul iplica ion uni (VMU),
CPN block, and FCN block. The VMU block is sha ed
by bo h CPN and FCN blocks. The inpu image can be
di ec ly ed o he CPN block o can be passed h ough a
p e-p ocessing block which is no shown he e. He e, ec o
pdeno es he pixels belonging o he λ×λwindow selec ed
du ing con olu ion. All h ee blocks a e desc ibed in h ee
sec ions below.
3.1 Vec o Mul iplica ion Uni
A VMU block is p oposed in his wo k which pe -
o ms ec o -mul iplica ion ope a ions in ol ed in he
FCN block and also pe o ms mul iplica ion be ween im-
age pixels in a pa icula window and il e co-e icien s.
Algo i hm 3 Pseudo code o CNN based image classi i-
ca ion.
Inpu : Inpu image I∈ Rn×n, s ide alue (s), and il e s
( 1, 2,· · · , 8).
Ou pu : Classi ied image index (idx).
o j←1 o 2 do
i j←1 hen
o i←1 o 4 do
Ij
ci =con (I, i)
Ij
pi =max pool(Ij
ci,2)
end o
else
o k←1 o 4 do
o l←1 o 4 do
Ij
ckl =con (I(j−1)
pk , l+4)
Ij
pkl =max pool(Ij
ckl,2)
end o
end o
end i
end o
z1=W1·y+b1
1=ReLU(z1)
z2=W2· 2+b2
2=ReLU(z2)
idx =so ( 2)
Con olu ion
and Pooling
Ne wo k
Vec o
Mul iplica ion
Uni
Fully
Connec ed
Ne wo k
ip1
ip2
ip3
ip4
in image
p
p
mb ou 2
idx
Figu e 2: O e all p oposed a chi ec u e image classi ica-
ion using CNN.
Low Cos FPGA Implemen a ion o Con olu ional Neu al Ne wo k Based Image Classi ie 4
IP 16
IP 16
IP 16
IP 16
ip1
ip2
ip3
ip4
a1:16
b1:16
a17:32
b17:32
a33:48
b33:48
a49:64
b49:64
p
Figu e 3: P oposed ec o mul iplica ion uni .
Thus same VMU block, shown in Fig. 3, is sha ed by CPN
and FCN block. This way maximum esou ce sha ing is
ob ained.
The whole VMU block is capable o mul iplying wo
ec o s o leng h 64 and di ided in o ou IP 16 blocks.
Each IP 16 block is capable o mul iplying wo ec o s o
leng h 16. The IP 16 blocks a e used du ing he con o-
lu ion p ocess and he whole VMU block is used du ing
compu a ion o ully connec ed laye s. VMU has o al 5
ou pu s whe e ou inne p oduc ou pu s a e om IP 16
blocks. La ency o he IP 16 blocks is o l1p= 6 clock cy-
cles and o al la ency (l mu) o he VMU block is 8 clock
cycles.
3.2 Con olu ion-Pooling Ne wo k
P oposed a chi ec u e o CPN block is shown in Fig. 4.
Ini ially inpu image is w i en in he img am block om
he image senso s. Image pixels a e se ially sen o he
window op block o window o ma ion. The pixels om
he k×kwindow a e sen in pa allel o he IP 16 blocks o
compu ing con olu ion alues. The con olu ed image pix-
els a e simul aneously sen o he mx pool blocks. Fou
mx pool blocks a e placed co esponding o ou IP 16
blocks. Con olu ion and pooling ope a ions uns in pa -
allel ha ing gape o ew clock cycles.
The ou pu da a samples om he i s con olu ion-
pooling s age a e w i en o membank a which is ha ing
ou memo y elemen s o suppo ou pooling blocks. The
am blocks om membank a a e ead se ially. Da a sam-
ples om am1 a e ead and ed o img am block again o
s a he nex phase con olu ion ope a ion. Ou pu o he
nex phase con olu ion will again go h ough IP 16 and
mx pooling blocks. Final ou pu da a samples o he nex
con olu ion s age a e w i en o membank b which is also
ha ing ou memo y elemen s. Once he con olu ion op-
e a ion is comple ed o da a samples s o ed in am1, 2nd
phase con olu ion ope a ion is s a ed on samples s o ed in
am2. This way all he elemen s o membank a a e ead.
All he ou pu s in he 2nd con olu ion-pooling s age a e
w i en o membank b.
The pooling block is placed jus a e he IP 16 blocks
o educe he s o age equi emen . The size o membank a
is mo e compa ed o size o membank b. Majo ope a-
ions ha his block pe o ms a e 2-D con olu ion, pool-
ing and empo a y da a s o age. All hese ope a ions a e
explained below in de ail.
3.2.1 Con olu ion The p oposed a chi ec u e o 2-D
con olu ion ope a ion is depic ed in Fig. 4. The 2-D
con olu ion ope a ion can be di ided in o wo s eps, win-
dowing ope a ion and mul iplica ion o window pixels wi h
il e s. The mul iplica ion o pixels belonging o pa icula
window and he il e co-e icien s a e pe o med by IP 16
blocks. The a chi ec u e suppo s simul aneous 2-D con-
olu ion wi h ou λ×λ il e s whe e λ= 3 chosen o his
wo k. The ad an age o simul aneous con olu ion is ha
same image is no equi ed o be ead e e y ime. I he
con olu ions we e sepa a ely done, hen inpu image has
o be ead ou imes. This way window ope a ion ime o
image is sa ed bu numbe o IP 16 blocks a e inc eased.
The con olu ion a chi ec u e is mainly based on he a -
chi ec u e o window op block which is shown in Fig. 5.
The low-cos a chi ec u e epo ed in [32] is adop ed in
his wo k. Regis e based window ope a ion a chi ec u e
epo ed in [17] is no lexible o a iable image sizes.
This is why memo y based a chi ec u e is p e e ed which
suppo s any image size by changing size o he add ess
coun e s. This block o ms he window o λ×λpixels and
hese pixels a e sen o he IP 16 blocks. Fo con olu-
ion, ou pu s om he inpu image a e pa allely compu ed
h ough ou IP 16 blocks.
This wo k uses p e-de ined il e s [33] whe e ou il e s
a e used in he i s s age and ou il e s a e used in he
second s age. The i s s age il e co-e icien s a e
1={1,0,−1,1,0,−1,1,0,−1}(1)
2={1,1,1,0,0,0,−1,−1,−1}(2)
3={1,2,1,0,0,0,−1,2,−1}(3)
4={−1,2,−1,0,0,0,1,2,1}(4)
The second s age il e co-e icien s a e
5={−1,−1,−1,2,2,2,−1,−1,−1}(5)
6={−1,2,−1,−1,2,−1,−1,2,−1}(6)
7={2,−1,−1,−1,2,−1,−1,−1,2}(7)
8={−1,−1,2,−1,2,−1,2,−1,−1}(8)
The mul iplica ion ope a ion be ween a il e and an image
window is pe o med by IP 16 blocks. Selec ion o il e s
in di e en s ages a e pe o med by a mul iplexe banks
placed be o e VMU block. Fo an n×nimage, (n−1)
ows will be s o ed in he empo a y bu e memo y. In
his wo k, 3×3 window is used o con olu ion and IP 16
block has 4 pipeline s ages. Con olu ion ope a ion has
o al la ency o lip +3 clock cycles whe e lip is he la ency
o he IP 16 block. Thus o al con olu ion p ocess akes
cn =n2+lip+ 3 clock cycles o comple ion.
3.2.2 Pooling A simple a chi ec u e o max pooling is
p oposed in his wo k and shown in Fig. 6. I uses one
empo a y memo y which is capable o s o ing one ow o
image. Only al e na e ows o con olu ed image a e w i -
en in he memo y. Fo example, i s 1s ow is w i en o
he memo y. Second ow is hen di ec ly going o he eg-
is e s and simul aneously he i s ow is also ead. In he
second cycle 3 d ow will be s o ed in he memo y and 4 h
5Jou nal o Embedded & Digi al Sys em Design
window op
IP 16
p1
p1
p9
img am
0
1
in image
IP 16
IP 16
IP 16
mx pool
mx pool
mx pool
mx pool
am1
am5
0
0
1
2
3
eg
am2
am3
am4
1
2
3
am6
am7
am8
mb ou 2
membank a
membank b
mb ou 1
Figu e 4: P oposed a chi ec u e o con olu ion and pooling in CNN.
ow mp
ow mp
eg
eg
eg
eg
eg
eg
eg
eg
eg
0
1
0
1
0
1
p1
p2
p5
p4
p7
p8
p9
p6
p3
Figu e 5: Windowing a chi ec u e o 2-D con olu ion
[32].
eg
eg
BN
eg
BN
eg
max
ow mp1
da a in
Figu e 6: P oposed s uc u e o max pooling.
ow will be ed di ec ly o egis e s. Pooling block com-
pu es maximum o all pixels p esen in a 2 ×2 window.
An basic ne wo k (BN) block a chi ec u e, shown in Fig.
10, is designed o calcula e maximum o wo elemen s.
Fi s BN block compu es maximum o 1s wo elemen s o
2×2 window and s o es in a egis e . In he nex clock
cycle, i s BN block compu es maximum o emaining wo
elemen s. Second BN block, compa es be ween p e iously
s o ed maximum alue and newly compu ed alue. The
pooling ope a ion is done by se ially accessing he ows
o con olu ed images. Fo an n×nimage, n/2 columns
a e equi ed o be s o ed in empo a y memo y while n/2
columns a e ead di ec ly. Thus o al pl =n2+lpl clock
cycles a e equi ed o comple e he pooling p ocess wi h
la ency o lpl = (2n+ 2) clock cycles.
3.3 Fully Connec ed Ne wo k
The p oposed a chi ec u e o FCN block is shown in Fig.
7. The se ial-pa allel a chi ec u e suppo s any numbe o
ully connec ed laye s. Same VMU block is used in his
a chi ec u e o do he ec o -ma ix mul iplica ions. The
mul iplica ion be ween inpu ec o and he weigh ma-
ix is pe o med in many phases o ha e less ha dwa e
consump ion. Fou 2:1 mul iplexe banks a e placed jus
be o e VMU block. The con ol signal cn mode selec s
image pixels and he il e co-e icien s. The con ol sig-
nal nx cn selec s 2nd se il e co-e icien s. Big a ows
deno e ec o s in he mul iplexe banks.
The FCN block ecei es la ened ec o samples om
membank b block o con olu ion and pooling block. Mem-
o y elemen s in membank b block a e ead se ially. Se ial
da a samples a e con e ed o pa allel da a ec o h ough
wo egis e banks. Da a samples a e w i en o eg bank1
i s and hen o eg bank2. Again he emaining samples
a e w i en o eg bank1 and hen o eg bank2. The
con ol signal phase selec s be ween wo eg banks. Gap
be ween wo w i ing cycles is main ained o suppo he
consecu i e eading o weigh ma ix blocks om weigh
memo y block. O he blocks in he FCN block a e illus-
a ed below.
3.3.1 Weigh Memo y The weigh ma ices a e s o ed in
weigh mem memo y block. The weigh ma ices a e con-
s an and hus only ead only memo y (ROM) elemen s
a e used in his memo y block. The mul iplica ion o ec-
o s o bigge sizes a e olded by he VMU uni o lesse
size. Thus he weigh memo y has o be o ganized p op-
e ly. The o ganiza ion o he weigh mem memo y block
is shown in Fig. 8. Bo h W1and W2weigh ma ices a e
s o ed in a single memo y block. The o ganiza ion o he
weigh mem memo y block is shown o handw i en digi
ecogni ion p oblem. To al 64 ROM elemen s a e used
he e and each ROM can s o e 1100 wo ds. The shaded
a ea deno es he memo y loca ions illed wi h ze os.
3.3.2 Accumula o Block Mul iplica ion be ween a ec-
o and he weigh ma ix is pe o med in many phases
wi h he help o accumula o block shown in Fig. 9. The
pa ial inne p oduc s a e s o ed in a empo a y memo y
block named mem emp in all phases excep in he las
phase. The empo a y p oduc s a e ead om mem emp
block and added wi h cu en inne p oduc s. Once he i-
nal ec o -ma ix p oduc is ob ained, bias alues a e ead
Low Cos FPGA Implemen a ion o Con olu ional Neu al Ne wo k Based Image Classi ie 6
eg bank2
eg bank1
0
0
weigh mem
1
1
1
0
0
1
{ 11,· · · 14}
{ 21,· · · 24}
V MU
accumula o
ac i a ion
unc ion
so
a
b
p
0
1
mb ou 2
om
membank b
mux bnk
idx
Figu e 7: P oposed a chi ec u e o he FCN block
1
64
1
120
56
840
960
1080
1090
1100
W2
W1
m1
m8
m9
m10
m11
Figu e 8: P oposed a angemen o he weigh memo y.
om mem bias block and added. Accumula o needs one
cl signal o clea he egis e a ini ial s age and one mul-
iplexe is placed o send ze os in he las phase o accu-
mula ion. He e a single mem bias memo y holds all he
bias ec o s.
3.3.3 Ac i a ion Func ion ReLU ac i a ion unc ion is
used in all he laye s o FCN block. This is because sim-
plici y o ReLU unc ion in ha dwa e implemen a ion and
accu acy is also wi hin accep able ange. The ReLU unc-
ion is shown as
(x) = max(0, x) (9)
ReLU unc ion e u ns xi xis g ea e han ze o. A single
mul iplexe is enough o implemen he ReLU unc ion.
3.3.4 So Block A so block, shown in Fig. 10, is placed
a he inal s age o FCN block. Once he compu a ion is
comple ed o all he laye s in FCN block, he se ial da a
samples om he ac i a ion unc ion block a e ed o he
so block. The indx in passes digi indices om 0 o 10
and mus be in sync wi h he da a s eam (da a in). This
block gi es he index o he maximum da a p esen in he
se ial s eam. The index (idx) deno es he iden i ied class
+
eg
eg
+
mem bias
eg
mem emp
0
1
0
nx ly
Figu e 9: P oposed a chi ec u e o he accumula o block.
o he images. Majo componen s o he BN block a e a
compa a o and a mul iplexe . Indices a e egis e ed in
he dc block using he less han (l ) con ol signal.
3.4 Suppo o Deep Neu al Ne wo k
The p oposed a chi ec u e can be easily adop ed o DNN
based image classi ica ion using CNN. In he classi ica-
ion o mo e complex images, CNN may use mo e num-
be o con olu ion and pooling laye s. Also, one o mo e
hidden laye s may exis in he FCN. The p oposed a chi-
ec u e uses wo kinds o memo y banks membank a and
membank b. P esen wo k uses wo con olu ion-pooling
laye s. Ou pu o i s con olu ion-pooling laye is s o ed
in membank a whe eas ou pu o second con olu ion-
pooling laye is s o ed in membank b memo y bank. In
case o an addi ional con olu ion-pooling laye , con o-
lu ion and pooling will be applied on da a s o ed on
membank b and should be s o ed in membank a. This
7Jou nal o Embedded & Digi al Sys em Design
eg
a
b
COMP
n
0
1
eg
da a in
cl
dc
4
indx in
l
c
BN
Figu e 10: P oposed a chi ec u e o he so block.
means, in odd numbe o con olu ion-pooling laye da a
will be s o ed in membank a and in e en numbe da a
will be s o ed in membank b. A 2:1 mul iplexe can
swi ch da a ans e o he img am be ween mb ou 1 and
mb ou 2.
The FCN block also suppo s mo e numbe o hidden
laye s. The size o he weigh ma ix will a y based on
he numbe o laye s in he FCN. Based on he gi en size
o VMU, he numbe o loops in calcula ion o ou pu will
inc ease. The size o VMU can be inc eased o educe
numbe o i e a ions. To suppo inc eased size o VMU,
size o eg bank and size o mux bank a e equi ed o be
inc eased. Also, da a wid h o he accumula o block is
also may needs inc emen .
4 Expe imen al Se up and Pe o mance Analysis
The p oposed a chi ec u e o CNN based handw i en
digi classi ie is alida ed by aking popula MNIST da a
base o handw i en digi s anging om 0 o 9. The di-
mension pa ame e s used o he CPN block o CNN model
a e n= 28, n1= 26, n2= 14, n3= 12, and n4= 6. In he
FCN block, size o he la ened ec o is 576×1, 120 nodes
we e used in he inpu laye , and he ou pu laye has only
10 nodes co esponding o 10 di e en digi s. P oposed
CNN model achie es 96% accu acy in case o digi ecog-
ni ion.
A ee da a base on hand ges u e ecogni ion om Kag-
gle is also used o e i y he p oposed a chi ec u e. O ig-
inal size o he images is 640 ×240 bu images we e con-
e ed o he size o 28 ×28 o e i ica ion. An accu acy
o 97.9% is achie ed o he case o ges u e ecogni ion.
Py hon based so wa e analysis we e ca ied ou in his
wo k and Ve ilog HDL is used o design he a chi ec u e
in Vi ado.
An expe imen al se up o handw i en digi ecogni-
ion is shown in Fig. 11. P oposed a chi ec u e uses 18-
bi da a-wid h o ep esen ing he da a samples and 10-
bi s a e used o p ecision. The p ocessing subsys ems
(PS) pa is used o in e acing he USB came a and
p og ammable logic (PL) pa used o implemen ing he
CNN pa . De ailed analysis o he p oposed design is
Figu e 11: Expe imen al se up o eal- ime de ec ion
o handw i en digi s using p oposed a chi ec u e imple-
men ed on Zynq FPGA.
cn 1cycles
Con o 1
Con o 2
Con o 3
Con o 4
Con o 5
Pooling 1
Pooling 2
Pooling 3
Pooling 4
Pooling 5
gp
gp1
cn 2
pl1cycles
pl2
Figu e 12: Scheduling o all con olu ion and pooling op-
e a ions in di e en s ages o CNN.
ca ied ou in he ollowing sub-sec ions.
4.0.1 P ocessing Time The es ima ion o p ocessing ime
is one o he majo me ic ha should be analysed. P o-
cessing ime o all he blocks a e analysed in his sec ion.
Scheduling o he ope a ions like con olu ion and pooling
a e pe o med such a way so ha in no ime span a block
should be in idle mode. The w i ing o image pixels in
he img am block and i s con olu ion ope a ion can be
s a ed simul aneously wi h a gap o one clock cycle.
The scheduling o he di e en con olu ion and pooling
ope a ions is shown in Fig. 12. The i s con olu ion op-
e a ion comple ed in cn 1clock cycles. Fi s pooling op-
e a ion can be s a ed jus a e a gap o gp1= 2n+lip +3
cycles. Then a e pl1clock cycles i s pooling ope a ion
can be comple ed. Second phase con olu ion ope a ion
can no be s a ed immedia ely as size o he con olu ed
image a e i s con olu ion is less han he inpu image.
Second phase con olu ion ope a ion can be s a ed a e
a gap gp clock cycles and alue o his is nea ly equal
o nclock cycles. O he con olu ion ope a ions can be
s a ed immedia ely jus inishing o he p e ious con o-
lu ion ope a ion. Simila ly o he second phase pooling
ope a ions.
The FCN block can only ope a e when he las con-
olu ion ope a ion i.e. Con o 5 in Fig. 12 is comple ed.
Ma ix- ec o mul iplica ions in FCN block a e pe o med
in di e en phases. Fo he case o hand w i en digi
ecogni ion, mul iplica ion be ween W1ma ix and la -
ened ec o (y) is pe o med in 9 phases. Each phase co -
esponds o mul iplica ion o 120 ×64 ma ix and 64 ×1
Low Cos FPGA Implemen a ion o Con olu ional Neu al Ne wo k Based Image Classi ie 8
Table 1: Es ima ion o ha dwa e complexi y
Blocks Mul Comp Add sub Reg Mux/DeMux
IP 16 16 0 15 31 0
V MU 64 0 63 127 0
mux bnk 0 0 0 128 256
accumula o 0 0 2 3 1
so 0 1 0 3 1
window op 0 0 0 9 3
mx pool 0 2 0 4 2
O he s 0 0 0 1 8
Table 2: Es ima ion o memo y elemen s.
Memo y Elemen s Wo d Pe Cycle BRAM Reg
W i e Read
membank a 1 1 4 ×14 ×14 0
membank b 1 1 4 ×12 ×12 0
weigh mem - 64 64 ×1100 0
mem emp 1 1 64 0
mem bias −1 586 0
ow mp 1 1 28 0
ow mp1 1 1 28 0
eg bank1 1 64 0 64
eg bank2 1 64 0 64
ec o in inpu laye . The ou pu phase compu a ion is
achie ed in wo phases. The o al p ocessing ime o in-
pu laye is cn1= (l mu +1)+(8·120)+1+64. He e, 64
clock cycles a e equi ed o s o e he ou pu alues again
back o he eg bank block. To al p ocessing ime o he
ou pu laye is cn2= (l mu + 1) + (2 ·10) + 2 including
he ime spend in he so block.
The es ima ion o he o e all p ocessing ime can done
by he ollowing equa ion
o = cn 1+ gp + 4 × cn 2+ cn1+ cn2(10)
To al p ocessing ime o handw i en digi ecogni ion
p oblem o de ec ing a 28 ×28 image is 2706 clock cycles.
Taking maximum equency o 173.91 MHz, he p ocessing
ime is calcula ed as o = 2706 ·5.75ns = 15,560ns. I
can be said ha he CNN p ocesso p oposed in his wo k
akes 15.56 µs o p ocess one 28×28 image o handw i en
digi s. App oxima ely he p oposed p ocesso can p ocess
≈64khandw i en digi s o size 28 ×28 in jus 1 second.
4.0.2 Resou ce Consump ion An es ima ion o esou ce
consump ion o he p oposed a chi ec u e in e ms o ba-
sic blocks is shown in Table I. This manual es ima ion
o he esou ces is equi ed o ha e a clea pic u e o he
o e all a chi ec u e. I can be seen ha maximum 64 mul-
iplie s a e used in V MU block which is he majo block o
he a chi ec u e. The block wi h mos combina ional logic
load is he mux bnk. The pipeline egis e s a e inse ed in
sui able places o keep he maximum combina ional delay
less han o equal o he delay o a mul iplie .
Memo y consump ion o e head is ano he pa ame e
Table 3: Design pe o mance o he p oposed a chi ec u e
implemen ed on wo di e en boa ds.
Memo y Elemen s A ix Boa d Zynq Boa d
Values U il(%) Values U il(%)
Slice LUT 3789 5.98 3870 21.99
Slice Reg 3301 2.6 35200 9.38
Occupied Slice 1445 9.12 1502 34.14
RAMB18 115 42.59 115 95.83
DSP48 64 26.67 64 80
Dynamic Powe 0.399 W - 0.413 0
Figu e 13: Es ima ion o powe o he p oposed a chi ec-
u e based on Vi ado ool.
ha decides he pe o mance o an a chi ec u e. Es i-
ma ion o memo y consump ion o he p oposed a chi ec-
u e is shown in Table II o handw i en digi example.
The memo y blocks weigh mem and mem bias a e e-
alized using he ROMs. O he memo y blocks such as
membank a,membank b e c. a e ealized as dual po
memo y (DPM) block. The eg bank block is ealized
en i ely using egis e s o make se ial da a s eam o a
pa allel da a ec o . The size o he memo y elemen s in
Table II is shown wi h espec o he example o hand-
w i en digi ecogni ion o image size o 28 ×28.
The FPGA implemen a ion pe o mance o he p o-
posed a chi ec u e is shown in Table III. He e, wo ype
o FPGAs a e used o demons a e he pe o mance com-
pa ison. A ix7 FPGA boa d (xc7a100 g256-2) has mo e
esou ces compa ed o Zynq sys em on chip (SoC) boa d
(XC7Z010). Resou ce consump ion on bo h he boa ds is
almos same bu highe maximum equency is achie ed
in case o A ix7 FPGA boa d. This is because o mo e
a ea a ailable o ou ing in A ix7.
4.0.3 Powe Consump ion The a chi ec u e o CNN is
p oposed such a way so ha i consumes less a ea as well
as low powe . The powe consump ion o he p oposed
design is shown in Table II on bo h kind o FPGAs. The
low powe consump ion o he p oposed design is due o
wo main a ibu es ollowed in he design. Fi s ly, all
he memo y elemen s a e equipped wi h con olled enable
mechanism. The memo y elemen s we e made ac i e when
i was equi ed. Du ing o he ime hey we e inac i e. Sec-
9Jou nal o Embedded & Digi al Sys em Design
Table 4: Compa ison wi h exis ing wo ks on handw i en
digi ecogni ion using CNN.
Blocks [6] [34] [35] [2] This Wo k
FPGA Boa d xc7 x485 xc7a100 ZYNQ ZC702 Cyclone 10 xc7a100
F equency (MHz) 150 300 166 150 173.3
Time (ms) 0.0254 0.041 0.151 0.0176 0.0157
DSP 638 0 95 274 64
Reg 66346 106400 27664 48765 3301
LUT 51125 15769 388361 12588 3789
Accu acy 96.8 90 99 97.57 96
ondly, sha ing o esou ces helped o sa e ex a esou ces
which consumes powe . The VMU block is sha ed by FCN
block as well as by he CPN block. A de ailed analysis o
he powe consump ion is shown in Fig. 13.
4.0.4 Compa ison wi h Exis ing Wo ks A subs an ial col-
lec ion o wo ks epo ed in li e a u e on ha dwa e imple-
men a ion o CNN. Few wo ks ocused only on con olu ion
pa o CNN while ew implemen ed o e all image classi-
ica ion p oblem using CNN. The cu en esea ch wo k
also ocuses on implemen a ion o he whole CNN based
classi ie . P oposed a chi ec u e is compa ed wi h ew e-
cen wo ks in Table IV.
The p oposed wo k is ha dwa e e icien as well as ha e
lesse p ocessing ime compa ed o he wo k epo ed in
[6]. Simila s a emen can be said ega ding he wo k
epo ed in [34] whe e same FPGA a ge is used. The
ZYNQ based a chi ec u e [35] consumes mo e esou ces
a less p ocessing ime compa ed o he wo k p oposed
he e. In el FPGA based implemen a ion [2] has simila
p ocessing ime bu consumes highe esou ces. O e all,
i can be concluded ha he p oposed design is ha dwa e
e icien compa ed o o he wo ks ha ing be e p ocessing
ime.
5 Conclusion
A low cos ye as e a chi ec u e o CNN is p oposed in
his wo k. The p oposed a chi ec u e is implemen ed o
2 con olu ion-pooling laye s and wo laye s in he ully
connec ed ne wo k. Bu his a chi ec u e is scalable and
suppo s any numbe o con olu ion-pooling laye and hid-
den laye s in ully connec ed ne wo k. Thus image clas-
si ica ion p oblems using DNN also can be sol ed using
his a chi ec u e. G ayscale images like handw i en dig-
i s and ges u e images we e conside ed in his wo k bu
can be ex ended o colo images also. P e-de ined il e s
a e used in his wo k bu a chi ec u e is designed o gen-
e al case. A chi ec u e consumes less ha dwa e esou ces
as i uses a single VMU block o all kinds o ec o inne
p oduc s. Bu p ocessing speed is no comp omised and
dynamic powe consump ion is also low.
Re e ences
[1] Y. Yao, Z. Zhang, Z. Yang, J. Wang, and J. Lai,
“Fpga-based con olu ion neu al ne wo k o a ic
sign ecogni ion,” in 2017 IEEE 12 h In e na ional
Con e ence on ASIC (ASICON), 2017, pp. 891–894.
[2] R. Xiao, J. Shi, and C. Zhang, “Fpga implemen a-
ion o cnn o handw i en digi ecogni ion,” in 2020
IEEE 4 h In o ma ion Technology, Ne wo king, Elec-
onic and Au oma ion Con ol Con e ence (ITNEC),
ol. 1, 2020, pp. 1128–1133.
[3] E. Wang and D. Qiu, “Accele a ion and implemen a-
ion o con olu ional neu al ne wo k based on pga,”
in 2019 IEEE 7 h In e na ional Con e ence on Com-
pu e Science and Ne wo k Technology (ICCSNT),
2019, pp. 321–325.
[4] J. Zhang, Y. Huang, H. Yang, M. Ma inez, G. Hick-
man, J. K olik, and H. Li, “E icien pga implemen a-
ion o a con olu ional neu al ne wo k o ada signal
p ocessing,” in 2021 IEEE 3 d In e na ional Con e -
ence on A i icial In elligence Ci cui s and Sys ems
(AICAS). IEEE, 2021, pp. 1–4.
[5] S. Guzel Aydin and H. S. Bilge, “Fpga -based imple-
men a ion o con olu ional laye accele a o pa o
cnn,” in 2021 Inno a ions in In elligen Sys ems and
Applica ions Con e ence (ASYU), 2021, pp. 1–6.
[6] Y. Zhou and J. Jiang, “An pga-based accele a o
implemen a ion o deep con olu ional neu al ne -
wo ks,” in 2015 4 h In e na ional Con e ence on
Compu e Science and Ne wo k Technology (ICC-
SNT), ol. 01, 2015, pp. 829–832.
[7] H. Wang, Y. Zhao, and F. Gao, “A con olu ional
neu al ne wo k accele a o based on pga o bu e
op imiza ion,” in 2021 IEEE 5 h Ad anced In o ma-
ion Technology, Elec onic and Au oma ion Con ol
Con e ence (IAEAC), ol. 5. IEEE, 2021, pp. 2362–
2367.
[8] V. Panchbhaiyye and T. Ogun unmi, “A i o based
accele a o o con olu ional neu al ne wo ks,” in
ICASSP 2020 - 2020 IEEE In e na ional Con e -
ence on Acous ics, Speech and Signal P ocessing
(ICASSP), 2020, pp. 1758–1762.
[9] S. Moini, B. Alizadeh, M. Emad, and R. Eb ahim-
pou , “A esou ce-limi ed ha dwa e accele a o o
con olu ional neu al ne wo ks in embedded ision ap-
plica ions,” IEEE T ansac ions on Ci cui s and Sys-
ems II: Exp ess B ie s, ol. 64, no. 10, pp. 1217–
1221, 2017.
[10] Y.-K. Lai and L.-C. Huang, “An e icien con olu-
ional neu al ne wo k accele a o on pga pla o m,”
in 2024 IEEE In e na ional Con e ence on Consume
Elec onics (ICCE), 2024, pp. 1–2.
[11] H. Wang, X. Zhang, D. Kong, G. Lu, D. Zhen, F. Zhu,
and K. Xu, “Con olu ional neu al ne wo k accele a-
o on pga,” in 2019 IEEE In e na ional Con e ence
on In eg a ed Ci cui s, Technologies and Applica ions
(ICTA). IEEE, 2019, pp. 61–62.
[12] J. Zhang, F. Zhang, M. Xie, X. Liu, and T. Feng,
“Design and implemen a ion o cnn a ic ligh s
classi ica ion based on pga,” in 2021 IEEE 4 h
In e na ional Con e ence on Elec onic In o ma ion
and Communica ion Technology (ICEICT), 2021, pp.
445–449.