scieee Science in your language
[en] (orig)

Income Classification Benchmark: From R (Academic Study) to Python (ML Pipeline)

Author: Erkan, Mehmet Ali
Publisher: Zenodo
DOI: 10.5281/zenodo.17662766
Source: https://zenodo.org/records/17662766/files/report.pdf
1
Benchma king Supe ised Lea ning Algo i hms o Income Classi ica ion:
An End- o-End ML Pipeline
Mehme Ali E kan
Middle Eas Technical Uni e si y
Anka a, Tu key
mae kan@me u.edu.
Abs ac : The aim o his pape is o s a is ically analyze sala y
classi ica ion by obse ing people's wo k class, educa ion,
ace, gende , and o he cha ac e is ics and ca ego izing sala y
p edic ion based on hese cha ac e is ics. To ge a he esul s,
echniques like da a isualiza ion, s a is ical analysis and
machine lea ning echniques we e applied. The eg essions,
suppo ec o machine, a i icial neu al ne wo k, andom
o es , and Xgboos machine lea ning algo i hms a e used o
classi y sala y classi ica ion. Resea ch ques ions a e de eloped
and analyzed p io o p edic ion in o de o be e unde s and
ela ionships be ween a iables in he da a. Da a cleaning
echniques a e used o c ea e clean, app op ia e da a o he
s udy. A e he da ase has been cleaned, models and
s a is ical es s a e un. Sensi i i y, accu acy and F1 sco e we e
used o assessed models due o he la ge numbe o he
ca ego ical a iables. The analysis is conduc ed using R-
s udio.
Keywo ds—Sala y Classi ica ion, A i icial Neu al
Ne wo k, Random Fo es , Xgboos , Da a Analy ics
I. INTRODUCTION
E e y employee has a sala y. The paymen o his
sala y depends on some pa ame e s. This can a y om he
pe son's educa ion o age, om wo king hou s o being
single. In he s udy, people will be classi ied acco ding o
hese a iables as ecei ing less han $50,000 annually and
abo e. Fi s , he gene al si ua ion o he da a will be
examined, necessa y a angemen s will be made, models
will be es ablished and hese models will be compa ed by
looking a he sensi i i y, F1 sco e.
II. LITERATURE REVIEW
S a is icians and esea che s ha e done eno mous
amoun s o esea ch and analysis abou p edic ing sala y
classi ica ion. Fi s ly, a eg ession model is sugges ed o
p edic sala y classi ica ion [1]. Secondly, he e is ano he
piece o esea ch abou applying di e en machine lea ning
echniques o he p edic ion o he classi ica ion sala y,
basically a compa ison o unsupe ised and supe ised
lea ning in esea ch [2]. Acco ding o he esul s, he mos
accu a e esul is andom o es me hod.[2] Las esea ch is
abou Sala y P edic ion Using Machine Lea ning [3].
Resea che s concluded ha he bes ou come is p oduced by
he decision ee. Bu i he ea u ed a ibu e is small, KNN
will pe o m be e .[3]
III. METHODOLOGY
A. Da ase
Ba y Becke ex ac ed da a om he 1994 Census
da abase. The o m o his da a is aken om Kaggle. In
addi ion, a i s , he da a has 32561 obse a ions and 15
a iables, 9 ca ego ic 6 con inuous. In his s udy, sala y is
used as a dependen a iable. Since he e was no
in o ma ion, he nlwg a iable was emo ed. Mo eo e ,
wo kclass, na i e coun y, and occupa ion a iables ha e
NA alues in he da ase . Occupa ion has he highes NA
alue. Impu a ion was made o ill hese NA alues. Since
32561 obse a ions will keep ope a ions and i e a ions long,
1000 samples we e andomly aken om he da a. The
a iables u ilized in he analysis a e lis ed below.
•“age”: age o he wo ke s – Con inuous a iable
•“wo kclass”: sec o o he wo ke s – Ca ego ic Va iable
•“ nlwg ”: no in o ma ion - Con inuous a iable
•“educa ion”: educa ion le el - Ca ego ic Va iable
•“educa ion-num”: numbe no a ion o educa ion le el- Con inuous
Va iable
•“ma i al-s a us”: ma i al s a us o wo ke - Ca ego ic a iable
•“occupa ion”: occupa ion o wo ke - Ca ego ic a iable
•“ ela ionship: ela ionship s a us, wi e o husband exc- Ca ego ic
Va iable
•“ ace”: ace o wo ke s - Ca ego ic Va iable
•“sex”: gende o wo ke s Ca ego ic Va iable
•“capi al-gain”: The p o i on ea ns on he sale o an asse s -
Con inuous Va iable
•“capi al-loss”: The loss on sell he asse s o less han adjus ed basis
- Con inuous Va iable
•“hou s-pe -week”: weekly wo king hou s o wo ke s- Con inuous
Va iable
•“na i e-coun y”: na i e coun y o wo ke s - Ca ego ic Va iable
•“sala y”: ea ning less o mo e han 50k - Ca ego ic Va iable
B. Desc ip i e S a is ics
Table 1 indica es desc ip i e s a is ics o 2
con inuous a iable o he da a.
age
Hou s pe week
Minimum
17
1.00
1s Qua ile
28
40.00
Median
37
40.00
Mean
38
40.44
3 d Qua ile
48
45.00
Maximum
90
99.00
Table 1 Desc ip i e S a is ical Summa y o Some Va iables
Acco ding o able 1, he a e age o people’s age
is 38. The minimum age is 17 and he maximum age is 90.
2
Hal o he people a e abo e o below 37. 25 % o he
people a e below 28 and abo e 48. On he o he hand, he
a e age wo king hou s pe week a e 40.44. Minimum
wo king hou s is 1 and maximum is 99. 25 o he wo king
hou s a e below 40 and abo e 45. Since he e is a
conside able di e ence be ween he 3 d qua ile and he
maximum alue, he e migh be ou lie s. Also his shows
ha he dis ibu ion o wo king hou s may seem igh -
skewed. Las ly, he esponse a iable consis s o 24720
people who ea ned less han 50 housand and 7841 people
who ea ned mo e han $50,000. This shows us ha he da a
is imbalanced.
C. Explo a o y Da a Analysis
Six esea ch ques ions we e es ablished and
add essed in his sec ion o he da a analysis. The answe s
o hese esea ch ques ions ha e imp o ed ou
unde s anding o he ac s.
C.1 How does le el o educa ion ela e o sala y?
Figu e 1 Plo o sala y by educa ion le el
As i can be seen om he ba plo gi en in Figu e 1, i
seems he e is a di e ences in sala y be ween educa ion
le el. The majo i y o high school g adua es ea n less han
$50,000, whe eas mas e 's deg ees ha e he g ea es
ea nings a e o e $50,000 among all educa ional
special ies. (Pea son's Chi-squa ed es , p- alue < 2.2e-16).
C.2 How wo ke 's age dis ibu e o e wo kclass ype?
Figu e 2 Violin plo o age by wo kclass
D
Sum Sq
Mean Sq
F alue
P alue
wo kclass
3
6286
2095.3
11.44
2.24e-07
Res.
996
182462
183.2
To al
999
Table 2 The Resul s o ANOVA
Since wo kclass a iable has mo e han wo le els,
ANOVA can be conduc ed. In his way, i would be lea ned
whe he he le els o wo kclass a iable ha e a leas one
di e en e ec on people’s age. Some o he wo kclasses
a e signi ican ly di e en . A e ha no mali y checked
esiduals a e no no mal. Box-cox ans o ma ion was
applied bu ans o med da a a e no no mal again. Since
no mali y assump ion is no sa is ied, K uskal-Wallis was
applied which is a non-pa ame ic e sion o he one-way
ANOVA (p- alue < 2.2e-16). Thus, since he p- alue is less
han 0.05, i is clea ha he wo k-class s a uses di e
signi ican ly. Following ha , pai wise was used. The
ma ching pai ings o p ocedu es a e signi ican ly di e en
when he p- alues a e less han 0.05, acco ding o he da a.
Be ween sel -employed indi iduals and p i a e wo ke s, as
well as be ween go e nmen and p i a e employees, he e
is a signi ican di e ence in age.
D. Missingness
Since he e a e so many missing obse a ions in eal-li e
da a, i is always c ucial o iden i y he missingness p ocess.
Th ee a iables in his da a ha e a signi ican pe cen age o
missing alues.
Figu e 3
Agg ega ion Plo o Missing
The a io o missingness and he pa e n o missingness
ha e been used o comp ehend NA s uc u e. Wo kclass,
na i e coun y, and occupa ion a iables in he da ase ha e
NA alues. The NA alue ha is highes is occupa ion.
Ma gin plo s shows ha wo-plo s ha e he same a i ude.
Thus, MAR was applied o he p ocess.
Figu e 4
Densi y
Plo s o
Missing
3
Impu a ion was success ul since he densi ies a e nea o he
densi y line. Addi ionally, he da ase does no con ain any NA
alues.
E. One Ho Encoding – Fea u e Selec ion wi h Bo u a
Be o e he model, one ho encoding was applied o
ca ego ical a iables wi h h ee le els o mo e. A e ha ,
he Bo u a me hod was used o c ea e he inal da a se .
Figu e 5 Fea u e Selec ion wi h Bo u a
F. Imbalanced Da a P oblem
When we ha e unequal ins ances o a ious classes in
classi ica ion p oblems, his is e e ed o as unbalanced
da a. In ou case, i can be seen ou dependen a iable.
This si ua ion can be o e come by ying me hods such as
smo e, up, down, ose and choosing he mos app op ia e
me hod and applying sampling. When he accu acy is
compa ed acco ding o he me hods in Figu e 6, i is seen
ha he smo e me hod is ahead.
Figu e 6 Accu acy Compa ision o Imbalanced Da a
Figu e 7 Me hods Compa ision o Imbalanced Da a
In addi ion, when i is compa ed all me hods in Figu e 7, in
e ms o alues such as speci ici y, accu acy, and F sco e, i
was decided o use he smo e me hod sampling because i
was he closes o he o iginal da a as F sco e pa ame e and
also because i had high sensi i i y.
G. Modelling
New da a is p oduced ollowing missing impu a ion,
eau e e selec ion wi h bo u a and choosing igh sampling
me hod o imbalanced da a. The Sala y P edic ion can be
classi ied using his new da a. The da a is spli in o ain da a
and es da a be o e he models a e buil (C oss Valida ion).
Based on he compa ibili y o he ain and es se s
among many c oss alida ion echniques, epea ed k- old
c oss alida ion wi h 5 epea es we e decided by
conside ing he sensi i i iy and also F1 sco e.
1. Mul iple Linea Reg ession
Unde ce ain assump ions, mul iple linea eg ession is
specialized o o ecas he ou come o he esponse a iable
using a numbe o independen a iables. Below is a
desc ip ion o he mul iple linea eg ession model.
𝑦 = 𝛽!+ 𝛽"𝑥"+ 𝛽#𝑥#+ ⋯ + 𝛽$𝑥%+ 𝜀
In his s udy, backwa d elimina ion is used o build he
inal model. Insigni ican a iables we e elimina ed om
he model. Since he da a is unbalanced h eshold alue
(0.38) was used wi h he help o In o ma ionValue package.
Mo eo e , in e ac ion e ms we e added o he model, bu
no meaning ul esul s we e ound, so in e ac ion e ms we e
aken om he model.
Addi ionally, ain da a is employed in he mul iple
linea model design. The ou pu is no displayed in his
sec ion due o he inal model's la ge numbe o a iables.
Bu he e a e ce ain signi ican in e p e a ions ha should
be men ioned.
Fi s o all, he en i e model is signi ican because he F
s a is ic's p- alue is less han 0.05, also conside ing he
VIF alues, i is clea ha he e is no mul icollinea i y in
he model. Secondly, AUC sco e is 0.895. I has a a he
good capaci y o accu a ely classi y a ibu es om he wo
g oups. I is clea om he model's summa y ou pu ha
e e y componen ha hasn' been dele ed is s a is ically
signi ican . Las ly, in e p e ing coe icien , i wo king
hou s in a week inc ease 1 hou , he sala y can be 4%
change o o he sala y ca ego y (>50k) and i a pe son’s
occupa ion is abou execu i e, he sala y can be 142.66%
passes o he highe sala y class om 50k.
4
2. A i icial Neu al Ne wo ks
Ano he machine lea ning algo i hm ha can be applied
o classi ica ion p oblems is he a i icial neu al ne wo k
(ANN). In his wo k, ANN is ca ied ou using R s udio's
"ke as" and " enso low" packages. The ANN model makes
use o e e y a iable in he ain da a. P io o modeling,
ca ego ical da a a e con e ed o dummy a iables using he
max-min scaling me hod.
An ANN model's linea ac i a ion unc ion is used in
cons uc ion. Smo e sampling echnique was used. Ten
hidden uni s a e p esen in he model. The uning o
lea ning a e is 0.01 and o d opou a e is 0.4.
Figu e 8 Plo o A i icial Neu al Ne wo k
An ANN model's linea ac i a ion unc ion is used
in cons uc ion. Two laye s NN model whe e he i s laye
has 21 neu ons and he second one has 5 neu ons. The
ne wo k uses 116 weigh s o p oduce he inal ou pu .
ANN
Sensi i i y
F-Sco e
ain
0.8029
0.8647
es
0.8516
0.9010
Table 3 The Compa ison o ANN
O e i ing does no appea in he model. Howe e , since
he es alues a e highe han he ain, i migh be
unde i ing.
3. Suppo Vec o Machine
A machine lea ning app oach called Suppo Vec o
Machine (SVM) can be applied o classi ica ion and
eg ession p oblems. I is employed in his s udy o add ess
he classi ica ion issue. Smo e echnique was u ilized since
he da a se s we e unbalanced. Modeling is done using a
unc ion ke nel. The chosen SVM ype is "s mRadial".
Tuning pa ame e 'sigma' was held cons an a a alue o
0.04.
The abili y o c ea e a ea u e signi icance plo wi h
SVM in Ca e is a good ea u e. In SVM, he e m " a iable
impo ance" e e s o a me ic ha exp esses how much
each inpu a iable (o ea u e) con ibu ed o he model's
abili y o make decisions. I can be seen in he nex plo .
Figu e 9 Plo Compa ison o Age and hou s pe week
SVM
Sensi i i y
F-Sco e
ain
0.9031
0.9098
es
0.8387
0.8844
Table 4 The Compa ison o SVM
The e is nei he an o e i ing no an unde i ing issue
when he es and ain ou comes a e compa ed.
4. Decision T ee
Reg ession and classi ica ion issues can bo h be
esol ed using decision ees. The algo i hm can be shown
as a g aphical ee-like s uc u e ha p edic s he ou comes
using a a ie y o cus omized pa ame e s. Smo e echnique
was u ilized since he da a se s we e unbalanced
ANN
Sensi i i y
F-Sco e
ain
0.9917898
0.9894
es
0.9032258
0.8833
,
Table 5 The Compa ison o Decision T ee
The e may be o e i ing due o he sensi i i y and F sco e
being almos 1 in he ain da ase . Also, he di e ence
be ween ain and he es da ase .
5. Random Fo es
Classi ica ion p oblems can be used using he machine
lea ning me hod andom o es . The algo i hm is based on
ees. Smo e echnique was u ilized since he da a se s we e
unbalanced. The inal model is buil o his p oblem a e
he "n ee" and "m y" ha e been uned.
Figu e 9 Plo o impo ance o RF
5
The signi icance o he andom o es 's a iables is
depic ed in Figu e 9. Impo an a iables in he andom
o es model include being husband, single-ma i al s a us,
age, educa iın, sex, wo king hou s pe week and sec o . The
mul iple linea eg ession model also conside s he
signi icance o he p e iously lis ed a iables.
Random Fo es
Sensi i i y
F-Sco e
ain
0.9031
0.9098
es
0.9032
0.9055
Table 6 The Compa ison o RF
The e is nei he an o e i ing no an unde i ing issue
when he es and ain ou comes a e compa ed.
6. XGBoos
The las machine lea ning algo i hm u ilized in his wo k
o classi y sala y is called Xgboos . Ca ego ical a iables
we e u ned in o dummy a iables be o e o unning he
model. Smo e echnique was u ilized since he da a se s
we e unbalanced.
Fi s , he pa ame e needs o be uned. The model
is buil using he ain da a once he pa ame e s ha e been
adjus ed and he ideal numbe o ees ha e been selec ed.
In igu e 10, I can be seen ha aining p ocess o
XGBoos .
XGBoos
Sensi i i y
F-Sco e
ain
0.8834
0.8997
es
0.9032
0.9121
Table 7 The Compa ison o XGBoos
The e is nei he an o e i ing no an unde i ing issue
when he es and ain ou comes a e compa ed.
H. Pe o mance Compa ison on T ain and Tes Da ase
I 's c ucial o look in o ain pe o mance o
comp ehend how well he model i s he da a. I 's c ucial o
look a es esul s o comp ehend how he model unc ions
wi h he new a iables. In his s udy, sensi i i y, F-sco e,
accu acy and speci ici y a e u ilized o compa e he
pe o mances o he es inding he mino i y is mo e
impo an han inding he exac accu acy in such
classi ica ion p oblems, sensi i i y and F sco e we e
especially used in he pe o mence measu e o he inal
model. Sensi i i y and F sco e calcula ions a e shown in
below.
Sensi i i iy = (TP) / (TP + FN)
F sco e = 2 * ((p ecision * ecall) / (p ecision + ecall))
IV. RESULTS
In his pa , he esul s a e showed o he ollowing models;
i. Mul iple Linea Reg ession
ii. A i icial Neu al Ne wo ks
iii. Suppo Vec o Machine
i . Decision T ee
. Random Fo es
i. XGBoos
Pe o mance Compa ison o T ain Da a
Accu acy
Sensi i i y
Speci ici y
F Sco e
Mul iple
Linea R.
0.8000
0.8571
0.6178
0.8671
A i icial
Neu al N.
0.8088
0.8030
0.8272
0.8647
Suppo
Vec o M.
0.8637
0.8998
0.7382
0.9073
Decision
T ee
0.9837
0.9917
0.95811
0.9894
Random
Fo es
0.8637
0.8998
0.7329
0.9073
XGBoos
0.8500
0.8834
0.74345
0.8997
Table 8 Pe o mance Compa ison o T ain Da a
Pe o mance Compa ison o Tes Da a
Accu acy
Sensi i i y
Speci ici y
F Sco e
Mul iple
Linea R.
0.8350
0.8838
0.6666
0.8925
A i icial
Neu al N.
0.8550
0.8516
0.8666
0.9010
Suppo
Vec o M.
0.8300
0.8387
0.8000
0.8844
Decision
T ee
0.8150
0.9032
0.5111
0.8833
Random
Fo es
0.8600
0.9032
0.7111
0.9091
XGBoos
0.8650
0.9032
0.7333
0.9121
Table 9 Pe o mance Compa ison o Tes Da a
Figu e 10
Plo o
aining
p ocess
XGBoos

6
F om Table 8, i can be seen accu acy, sensi i i y,
speci ici y and F sco e o each me hod. All he models a e
cons uc ed wi h ain da a. Mo eo e , i can be seen he es
esul om Table 9. Since, i is a classi ica ion p oblem and
inding he mino i y is mo e c ucial, so i s ly sensi i i y has
been aken accoun hen F sco e as well. Las ly, less han
50k sala y (0) is a e e ence ca ego y o e alua ion.
Ini ially, andom o es and Xgboos me hod ga e
simila esul s o bo h ain and es da a. Mo eo e , hese
models do no ha e o e i ing and unde i ing p oblems.
The bes p edic ion o he es da a is p o ided by XGBoos
bu also Random Fo es ga e he same esul and he
speci ici y and F sco e alues a e much close o he ain
da a in Random Fo es . Las ly, al hough he Decision ee
model in ain da se has he mos alues, he es alues do
no con i m his and maybe he e is an o e i ing p oblem.
V . CONCLUSION
Fi s , explo a o y da a analysis is done in his piece
o wo k. Resea ch ques ions a e o mula ed, and a ious
g aphical echniques a e used o esol e hem. The
p ocessing o missing da a and da a cleaning p ocedu es a e
used hen. One ho encoding and ea u e selec ion wi h
Bo u a a e applied. Less cha ac e is ics a e chosen wi h he
help o Bo u a and also s epwise eg ession. The c ea ion o
esh da a is aided by i s new ea u es, no malized esponse,
and signi ican ca ego ical ac o s. Then, di e en me hods
a e applied o eg ession and decided ouse ” epea edc ”
wi h 5 epea s. Since he esponse which is sala y a iable
is imbalanced da a, some me hods a e applied and decided
o use “smo e” me hod in sampling. Sala y p edic ion is
classi ied using a a ie y o echniques using his a anged
da a. These echniques show ha Random Fo es esul s
be e sensi i i y and F sco e, compa ible wi h ain se .
Addi ionally, XGBoos model p oduce almos same
encou aging esul s as well.
VI. REFERENCE
[1] Gopal, K ishna, e al. “Sala y P edic ion Using
Machine Lea ning.” In e na ional Jou nal o Inno a i e
Resea ch in Technology, IJIRT(Www.Iji .O g), 4 June
2021, iji .o g/A icle?manusc ip =151548.
[2] S i as a a, Suyash, e al. “Compa ing Va ious
Machine Lea ning Techniques o P edic ing he Sala y
S a us.” EasyChai Home Page, EasyChai , 10 Feb. 2020,
easychai .o g/publica ions/p ep in /sRSZ.
[3] Ma bouli, Yasse T., and Suliman M. Alghamdi.
“S a is ical Machine Lea ning Reg ession Models o
Sala y P edic ion Fea u ing Economy Wide Ac i i ies and
Occupa ions.” MDPI, Mul idisciplina y Digi al Publishing
Ins i u e, 12 Oc . 2022, www.mdpi.com/2078-
2489/13/10/495.