In e na ional Jou nal on Na u al Language Compu ing (IJNLC) Vol.13, No.5/6, Decembe 2024
DIO:10.5121/ijnlc.2024.13602 15
DATA-DRIVEN PART-OF-SPEECH TAGGING FOR
THE GIKUYU LANGUAGE: DEVELOPMENT,
CHALLENGES AND PROSPECTS
Gab iel Kamau
Depa men o Compu e Science, Dedan Kima hi Uni e si y o Technology, Nye i
Kenya
ABSTRACT
This pape p esen s he de elopmen o a da a-d i en Pa -o -Speech (POS) agge o Gikuyu, a Ban u
language spoken in Kenya. Gikuyu, like many indigenous A ican languages, is unde - esou ced, wi h
limi ed compu a ional ools o linguis ic p ocessing. By employing a co pus sou ced p ima ily om he
Gikuyu Bible and le e aging a Memo y-Based Tagging (MBT) app oach, his s udy demons a es he
easibili y o c ea ing a obus POS agging sys em. The agge achie ed a p ecision o 90.44%, a ecall
o 88.34%, and an F-sco e o 91.35%. These esul s unde sco e i s po en ial o applica ions in machine
ansla ion, speech ecogni ion, and language p ese a ion. The s udy highligh s he challenges o
wo king wi h unde - esou ced languages, including da a collec ion and anno a ion, and p o ides
ecommenda ions o u u e wo k, including in eg a ion wi h b oade NLP asks.
KEYWORDS
Na u al Language P ocessing, Pa -o -Speech Tagging, Gikuyu Language, Da a-D i en App oach, Low-
Resou ce Languages
1. INTRODUCTION
Na u al Language P ocessing (NLP) b idges he gap be ween human languages and
compu a ional sys ems, enabling applica ions such as machine ansla ion, ex mining, and
sen imen analysis. A ounda ional ask in NLP is Pa -o -Speech (POS) agging, which assigns
g amma ical labels—such as nouns, e bs, and adjec i es— o wo ds in a ex . This ask
suppo s downs eam applica ions like named en i y ecogni ion, syn ax pa sing, and speech
syn hesis Kuma e al. (2023)
The Gikuyu ibe o ms he la ges indigenous ibe in Kenya. Acco ding o he Kenya
Popula ion and Housing Census (2019), he Gikuyu language is spoken by app oxima ely 8.1
million people in Kenya alone. The language howe e like o he indigenous languages in Kenya
aces challenges o digi iza ion and language ools de elopmen . While signi ican ad ances
ha e been made o Eu opean and Asia ic languages, A ican languages, including Gikuyu,
emain unde ep esen ed in NLP esea ch. Kenya's o icial languages, English and Kiswahili,
domina e w i en and spoken communica ion, elega ing Gikuyu and o he indigenous
languages p ima ily o o al use. Consequen ly, he language isks ex inc ion wi hou esea ch-
based p ese a ion e o s.
Table 1 shows he popula ion dis ibu ion o he Kenyan indigenous languages based on 2019
Kenya Popula ion and Housing Census (KPHC).
In e na ional Jou nal on Na u al Language Compu ing (IJNLC) Vol.13, No.5/6, Decembe 2024
16
Table 1: Kenya indigenous languages popula ion dis ibu ion
SN
LANGUAGE
SPEAKERS
1
Gikuyu
8,148,668
2
Luhya
6,823,842
3
Kalenjin
6,358,113
4
Luo
5,066,966
5
Kamba
4,663,910
6
Kenyan Somalis
2,780,502
This wo k add esses he gap in compu a ional esou ces o Gikuyu language by de eloping a
da a-d i en POS agging sys em. The objec i es include collec ing and anno a ing a co pus,
building a agging model, and e alua ing i s pe o mance. The s udy aligns wi h Sus ainable
De elopmen Goals (SDGs) 4.6 and 4.7, p omo ing linguis ic di e si y and global ci izenship
educa ion.
2. LITERATURE REVIEW
Compu a ional linguis ics, a ield a he in e sec ion o compu e science and linguis ics, has
seen emendous ad ancemen s in ecen yea s (Hellwig & Neh dich,2021). Among i s c i ical
a eas is Na u al Language P ocessing (NLP), which enables machines o unde s and, in e p e ,
and gene a e human language. A undamen al ask wi hin NLP is Pa -o -Speech (PoS)
Tagging, he p ocess o assigning g amma ical ca ego ies o wo ds wi hin a ex , such as nouns,
e bs, and adjec i es (Joshi e al, 2020). While subs an ial p og ess has been made in
de eloping PoS agge s o Eu opean and Asia ic languages, A ican languages con inue o be
unde ep esen ed in compu a ional linguis ics esea ch. This gap unde sco es he necessi y o
a ge ed e o s in c ea ing ools and esou ces o hese languages.
2.1. Swahili POS Tagge
Swahili, a widely spoken language in Eas e n and Cen al A ica, has bene i ed om da a-
d i en PoS agging e o s. De Pauw e al. (2006) used a da a-d i en app oach o Kiswahili,
u ilizing he Helsinki Co pus o 3.65 million wo ds. They compa ed mul iple agge s, including
Maximum En opy Modelling (MXPOST) and Suppo Vec o Machines (SVM). MXPOST
ou pe o med o he s, achie ing high accu acy. To ensu e unbiased aining, he co pus was
andomized and di ided in o aining (80%), alida ion (10%), and blind es (10%) se s. The
agge s achie ed ema kable accu acy on he blind es se , wi h MXPOST s anding ou o i s
use o maximum en opy modelling, inco po a ing lexical, con ex ual, and mo phological
ea u es.
This p ojec highligh s he alue o combining exis ing anno a ed co po a and ad anced
algo i hms bu demons a es he limi a ions o co pus a ailabili y o Gikuyu.
2.2. Kamba POS Tagge
The Kamba language, p edominan ly spoken in Machakos, Makueni, and Ki ui egions o
Kenya, ep esen s a signi ican example o unde - esou ced languages. Ki uku e al. (2015)
de eloped a POS agge o Kikamba, using a Memo y-Based Tagge (MBT). The p ojec
p ocessed a co pus o app oxima ely 30,000 wo ds collec ed om online sou ces and
documen s in Kamba. A e cleaning and o ma ing he co pus, manual anno a ion o en PoS
ca ego ies (e.g., adjec i es, nouns, e bs, and punc ua ion) was comple ed using Mic oso
Excel. The o ma ed da ase was hen p ocessed on a Linux-based MBT sys em.
In e na ional Jou nal on Na u al Language Compu ing (IJNLC) Vol.13, No.5/6, Decembe 2024
17
The esul s demons a ed a p ecision o 83%, a ecall o 72%, and an F-sco e o 75%, wi h
o e all accu acy eaching 90.68%. Howe e , he agge ’s e ec i eness was in luenced by he
quali y o anno a ions and he limi ed size o he co pus.
The sys em's language dependence howe e , limi s i s applicabili y o Gikuyu due o linguis ic
a ia ions be ween he wo languages.
2.3. Kipsigis POS Tagge
Kipsigis, a membe o he Kalenjin linguis ic amily, ep esen s ano he unde - esou ced
language. The Kipsigis agge , de eloped using MBT, p ocessed a smalle co pus o 14,000
wo ds. Following me hods simila o he Kamba p ojec , he co pus was manually cleaned,
anno a ed, and o ma ed. The agge achie ed a p ecision o 88.375%, a ecall o 88.25%, and
an F-sco e o 88.625%, wi h o e all accu acy eaching 94.46%.
The esul s we e p omising bu highligh ed he limi a ions o elying solely on adi ional MBT
me hods. The s udy unde sco es he challenges o limi ed aining da a and he necessi y o
cus omized linguis ic ools o unde - esou ced languages (Ki uku e al., 2015).
2.4. Se swana POS Tagge
M Se swana, a Ban u language spoken in Bo swana, Sou h A ica, Zimbabwe, and Namibia,
p esen s unique challenges due o i s disjunc i e o hog aphy. Malema e al. (2017) employed a
ule-based app oach suppo ed by a dic iona y and a mo phological analyse o Se swana POS
agging, ocusing on ela i e and possessi e s uc u es. Despi e achie ing an iden i ica ion a e
o 82%, he app oach s uggled wi h longe sen ences due o limi ed mo phological analysis
ools.
Addi ionally, while his e o ma ked a s ep o wa d o Se swana NLP, he app oach was
limi ed by i s eliance on ule-based sys ems, which a e gene ally less adap able han da a-
d i en models.
2.5. Resea ch Gaps
While a ious app oaches ha e been applied o Ban u languages, exis ing ools a e language-
speci ic and lack gene alizabili y o Gikuyu. This s udy ills he gap by de eloping he i s POS
agge ailo ed o Gikuyu, le e aging bo h linguis ic insigh s and da a-d i en echniques.
3. METHODOLOGY
The me hodology o de eloping he Gikuyu Pa -o -Speech (POS) agge in ol ed a s uc u ed
app oach ha included da a collec ion, anno a ion, aining, and e alua ion. A da a-d i en
app oach using a Memo y-Based Tagge (MBT) was selec ed due o i s adap abili y and
pe o mance o unde - esou ced languages (Liang & Huang, 2023). Figu e 1 summa izes he
POS agge a chi ec u e.
In e na ional Jou nal on Na u al Language Compu ing (IJNLC) Vol.13, No.5/6, Decembe 2024
18
Figu e 1: POS agge A chi ec u e
3.1. Da a Collec ion
The p ima y co pus was ex ac ed om he Gikuyu Bible, a comp ehensi e and s uc u ed ex
ha p o ided a consis en linguis ic amewo k. O he supplemen a y sou ces included online
Gikuyu ex s, cul u al s o ies, and educa ional ma e ials. Web sc aping ools such as Google
Schola and Mons e C awle we e employed o collec addi ional da a, ensu ing di e si y and
ichness in he co pus.
3.2. Co pus Cha ac e is ics
The co pus consis ed o app oxima ely 10,000 wo ds, di ided in o sen ences o easy
okeniza ion. The ex was chosen o ep esen a a ie y o linguis ic o ms and g amma ical
s uc u es in Gikuyu, including nouns, e bs, adjec i es, and ad e bs.
3.3. Manual Anno a ion
Manual anno a ion was conduc ed using Mic oso Excel, whe e each wo d was assigned a POS
ag in a pa allel column. This p ocess in ol ed:
i. Tokenizing sen ences in o indi idual wo ds.
ii. Assigning ags based on p ede ined ca ego ies, such as nouns (NOUN), e bs (VERB),
and conjunc ions (CONJ).
iii. Ensu ing consis ency in agging ules o imp o e he model's eliabili y.
Table 2. shows he comp ehensi e Tags Fo ma adop ed.
In e na ional Jou nal on Na u al Language Compu ing (IJNLC) Vol.13, No.5/6, Decembe 2024
19
Table 2: Tags Fo ma
Noun
NOUN
P onoun
PRONOUN
Ve b
VERB
Ad e b
ADVERB
Adjec i e
ADJECTIVE
P eposi ion
PREPOSITION
Conjunc ion
CONJUCTION
Comma
COMMA
Colon
COLON
Semi colon
SEMCLN
Full s op
F-STOP
Quo a ion Ma ks
QUOTES
Ques ion Ma k
QUESM
Apos ophe
APOSTR
Exclama ion ma k
EXCLMK
Nume als
NUME
3.4. Tagse Design
The ags used we e ca e ully chosen o align wi h common linguis ic ca ego ies while
add essing he unique needs o he Gikuyu language. The agse included:
i. NOUN: Deno ing objec s, names, o places.
ii. VERB: Rep esen ing ac ions o s a es.
iii. ADJECTIVE: Quali ying nouns.
i . PREPOSITION, CONJUNCTION, PRONOUN, e c.: Func ional ca ego ies.
. Punc ua ion ma ke s: COMMA, F-STOP, APOSTR, e c.
This agse was inspi ed by es ablished amewo ks om ela ed Ban u language p ojec s, such
as Kiswahili and Kamba (Ki uku e al., 2015; De Pauw e al., 2006).
3.5. T aining P ocess
A Memo y-Based Tagge (MBT) was selec ed o i s e ec i eness in handling small da ase s,
cha ac e is ic o unde - esou ced languages. MBT uses a machine lea ning algo i hm ha elies
on lexical, con ex ual, and o hog aphic ea u es o he inpu wo ds. The anno a ed co pus was
con e ed in o a ex ile o ma compa ible wi h MBT. The con e sion included:
i. Replacing spaces wi h ab delimi e s.
ii. Adding sen ence delimi e s (<u >).
Figu e 2 shows a sample aining un while Figu e 3 shows an example om he aining
co pus:
In e na ional Jou nal on Na u al Language Compu ing (IJNLC) Vol.13, No.5/6, Decembe 2024
20
Figu e 2: A sample aining un
Figu e 3: Sample anno a ed co pus
3.6. T aining Con igu a ion
The aining p ocess used de aul con igu a ions in MBT o op imize he balance be ween
complexi y and compu a ional e iciency:
i. dd a: Focused on wo disambigua ed ags o he le and one ambiguous ag o he igh
o known wo ds.
ii. dFapsss: Applied addi ional o hog aphic ea u es o unknown wo ds, including he
i s and las h ee le e s.
The command-line sc ip execu ed he aining p ocess as shown in Figu e 4
Figu e 4: command-line sc ip
Du ing aining, he model gene a ed con igu a ion iles o known and unknown wo d pa e ns.
These we e used o p edic POS ags o new inpu da a.
In e na ional Jou nal on Na u al Language Compu ing (IJNLC) Vol.13, No.5/6, Decembe 2024
21
4. EVALUATION
4.1. C oss-Valida ion
A 10- old c oss- alida ion echnique was applied o ensu e obus pe o mance assessmen . This
me hod in ol es spli ing he da ase in o 10 equal subse s, aining he model on 9 subse s, and
es ing i on he emaining subse . The p ocess epea s un il e e y subse has been used o
es ing.
4.2. E alua ion Me ics
The pe o mance o he Gikuyu Pa -o -Speech (POS) agge was e alua ed using ou s anda d
me ics: p ecision, ecall, F-sco e, and accu acy (Gup a & Sign, 2024). These me ics a e widely
adop ed in NLP o assessing classi ica ion asks and sequence labeling sys ems like POS
agging.
4.2.1. P ecision
P ecision measu es he p opo ion o co ec POS ag p edic ions ou o all ags p edic ed by he
model. I is exp essed as shown in Fo mula 1:
In he case o he Gikuyu POS agge , p ecision e alua es how accu a ely he model assigns a
speci ic POS ag o a wo d wi hou including inco ec classi ica ions. Fo his s udy, he agge
achie ed an a e age p ecision o 90.44%, demons a ing i s eliabili y in co ec ly p edic ing
ags o he gi en inpu .
4.2.2. Recall
Recall measu es he p opo ion o co ec ly p edic ed POS ags ou o all ac ual ins ances o ha
ag in he da ase . I is calcula ed as shown in Fo mula 2:
The agge 's a e age ecall sco e was 88.34%, e lec ing i s e ec i eness in iden i ying ue
ins ances o POS ags in he Gikuyu da ase . High ecall is pa icula ly impo an o minimizing
he omission o alid POS ags.
4.2.3. F-Sco e
The F-Sco e is he ha monic mean o p ecision and ecall, p o iding a balanced measu e ha
conside s bo h alse posi i es and alse nega i es. I is gi en by Fo mula 3.
The Gikuyu agge achie ed an F-Sco e o 91.35%, which highligh s i s balanced pe o mance
ac oss p ecision and ecall. This me ic is pa icula ly ele an o unde - esou ced languages
like Gikuyu, whe e misclassi ica ions can ha e a signi ican impac on usabili y
In e na ional Jou nal on Na u al Language Compu ing (IJNLC) Vol.13, No.5/6, Decembe 2024
22
4.2.4. Accu acy
Accu acy measu es he p opo ion o all co ec ly classi ied ags (bo h known and unknown) ou
o he o al p edic ions made. I is calcula ed as shown in Fo mula 4.
Table 3 shows he e alua ion esul s.
Table 3: Resul s
Fold
P ecision
Recall
F-sco e
Fold1
0.8722
0.85384
0.8862
Fold2
0.9357
0.9328
0.9340
Fold3
0.8739
0.9381
0.9327
Fold4
0.9202
0.8523
0.9156
Fold5
0.8797
0.8489
0.8858
Fold6
0.9313
0.9308
0.9307
Fold7
0.9313
0.8793
0.9307
Fol 8
0.8699
0.8563
0.8809
Fold9
0.9203
0.8671
0.9205
Fold10
0.9099
0.8746
0.9177
AVARAGE
90.44 %
88.34 %
91.35 %
Fo he Gikuyu agge , accu acy was compu ed sepa a ely o known and unknown wo ds:
i. Known Wo d Accu acy: 93.35%
ii. Unknown Wo d Accu acy: 88.35 %
iii. O e all Accu acy: 90.69%
Table 4 shows he accu acy con usion ma ix.
Table 4 Accu acy con usion ma ix
Fold
Accu acy
(Known)
Accu acy
(Unknown)
Accu acy
(O e all)
Fold1
0.9250
0.8952
0.9256
Fold2
0.9329
0.9152
0.8989
Fold3
0.9403
0.8634
0.9369
Fold4
0.9434
0.8752
0.9032
Fold5
0.9269
0.8523
0.8869
Fold6
0.9389
0.8838
0.8724
Fold7
0.9323
0.8852
0.9245
Fol 8
0.9137
0.8653
0.9035
Fold9
0.9420
0.8762
0.9074
Fold10
0.9397
0.8865
0.9100
AVARAGE
93.35 %
88.35%
90.69 %
In e na ional Jou nal on Na u al Language Compu ing (IJNLC) Vol.13, No.5/6, Decembe 2024
23
The dispa i y be ween known and unknown wo d accu acy e lec s he challenges inhe en in
handling no el inpu s, a common issue in low- esou ce language p ocessing. Imp o ing he
sys em’s abili y o gene alize o unknown wo ds emains a key a ea o u u e wo k.
4.2.5. C oss-Valida ion wi h K-Fold Tes ing
To ensu e obus e alua ion, a 10- old c oss- alida ion echnique was employed. This app oach
in ol es di iding he da ase in o 10 equal pa s, using nine pa s o aining and one o es ing
i e a i ely, un il all olds ha e been es ed. This me hod minimizes o e i ing and p o ides a
eliable es ima e o model pe o mance (Musabeyezu e al. 2023). Resul s ac oss he 10 olds
we e a e aged o p oduce he inal e alua ion me ics.
4.3. Compa ison o Rela ed Wo k
The e alua ion me ics o he Gikuyu agge we e benchma ked agains POS agging sys ems
o o he A ican languages:
i. Kamba: Ki uku e al. (2015) epo ed an F-Sco e o 75% and o e all accu acy o
90.68%.
ii. Kiswahili: De Pauw e al. (2006) achie ed s a e-o - he-a accu acy using MXPOST
wi h a la ge anno a ed co pus.
iii. Se swana: Malema e al. (2017) achie ed 82% iden i ica ion accu acy using a ule-
based app oach.
Compa ed o hese sys ems, he Gikuyu agge demons a es a compe i i e pe o mance,
pa icula ly gi en he cons ain s o limi ed co pus size and manual anno a ion.
5. RESULTS AND DISCUSSION
The Gikuyu Pa -o -Speech (POS) agge demons a ed s ong pe o mance ac oss e alua ion
me ics, indica ing i s eliabili y and u ili y o p ocessing he Gikuyu language. The key esul s
a e summa ized below:
1. P ecision, Recall, and F-Sco e:
i. P ecision: 90.44%
ii. Recall: 88.34%
iii. F-Sco e: 91.35%
These me ics indica e ha he agge accu a ely iden i ied POS ags while minimizing alse
posi i es and nega i es. The F-Sco e, which balances p ecision and ecall, con i ms he
obus ness o he agge in bo h ecognizing and p edic ing POS ca ego ies.
2. Accu acy:
i. Known Wo d Accu acy: 93%
ii. Unknown Wo d Accu acy: 88%
iii. O e all Accu acy: 91%
The highe accu acy o known wo ds compa ed o unknown ones highligh s a common
challenge in NLP, pa icula ly o unde - esou ced languages. The sys em’s abili y o handle