An ex ensi e analysis o he in e ac ion be ween
missing da a ypes, impu a ion me hods, and
supe ised classi ie s
Unai Ga cia ena∗
, Robe o San ana∗
Facul y o In o ma ics, Uni e si y o he Basque Coun y. P Manuel La dizabal, 1 - 20018
Donos ia-San Sebas i´an, Gipuzkoa, Spain.
Abs ac
When da a-mining eal-wo ld da a, we o en ind ou sel es acing obse a ions
ha ha e no alue eco ded o some a ibu es. This can be caused by se e al
phenomenon, such as a machine’s incapabili y o eco d ce ain cha ac e is ics
o a pe son e using o answe a ques ion in a poll. Depending on ha mo i a-
ion, alues gone missing may ollow one kind o pa e n o ano he , o desc ibe
no egula i y a all. One app oach o pallia e he e ec o missing da a on
machine lea ning asks is o eplace he missing obse a ions. Impu a ion algo-
i hms a emp o calcula e a alue o a missing gap, using in o ma ion a ound
i , i.e., he a ibu e and/o he alue o he obse a ion. While se e al impu-
a ion me hods ha e been p oposed in he li e a u e, ew wo ks ha e add essed
he ques ion o he ela ionship be ween he ype o missing da a, he choice o
he impu a ion me hod, and he e ec i eness o classi ica ion algo i hms ha
used he impu ed da a. In his pape we add ess he ela ionship among hese
h ee ac o s. By cons uc ing a benchma k o hund eds o da abases con ain-
ing di e en ypes o missing da a, and applying se e al impu a ion me hods
and classi ica ion algo i hms, we empi ically show ha a in e ac ion be ween
impu a ion me hods and supe ised classi ica ion can be deduced. Besides, di -
e ences in e ms o classi ica ion pe o mance o he same impu a ion me hod
∗Co esponding au ho
Email add esses: [email p o ec ed] (Unai Ga cia ena),
[email p o ec ed] (Robe o San ana)
This is he accep ed manusc ip o he a icle ha appea ed in inal o m in Expe Sys ems wi h Applica ions 89 : 52-65
(2017), which has been published in inal o m a h ps://doi.o g/10.1016/j.eswa.2017.07.026. © 2017 Else ie unde CC
BY-NC-ND license (h p://c ea i ecommons.o g/licenses/by-nc-nd/4.0/)
in di e en missing da a pa e ns ha e been ound. This poin s o he con e-
nience o conside ing he combined choice o he impu a ion me hod and he
classi ie algo i hm acco ding o he missing da a ype.
Keywo ds: missing da a, impu a ion me hods, supe ised classi ie s, machine
lea ning
1. In oduc ion
Missing alues a e ubiqui ous in almos e e y ype o eal-wo ld da ase s.
They can be pa icula ly de imen al o ce ain applica ions o he da ase s,
especially when he dis ibu ion o he missing da a (MD) is no uni o m and a
possible mechanism ha could explain he los alues is unknown. Pe haps he
mos used among he non- i ial al e na i es o deal wi h MD a e impu a ion
me hods (IMs). These me hods eplace he missing alues by es ima es ha can
be aken om he da abase (DB), de i ed om s a is ics o known alues (e.g.,
he mean o a gi en a iable), o ob ained using mo e sophis ica ed algo i hms.
The e is consensus on he impo ance o he applica ion o IMs, especially
when DBs wi h MD a e used as a basis o lea ning supe ised classi ie s. How-
e e , he choice o he IM, and i s impac on he classi ie pe o mance can be
e y dependen on he MD ype. Fo example, an imp ope choice o he IM can
bias he lea ned classi ie , p oducing a low classi ica ion quali y on es da a.
When he p oblem o supe ised classi ica ion is conside ed, hese h ee el-
emen s a e s ongly in e wined. In his pape we analyze his ela ionship by
in es iga ing p oblems wi h di e en ypes o MD, add essed using a se o IMs
wi h he inal goal o supe ised classi ica ion by means o di e en ypes o
classi ie s. Ou aim is o de e mine o wha ex en he e is a ela ionship be-
ween he choice o he IM and he p ecision o he classi ie s when conside ing
DBs ha exhibi di e en ypes o MD.
P e ious wo k (Ba is a & Mona d, 2003; Luengo e al., 2012) has analyzed
he ela ionship be ween he IMs used o ea ing MD and classi ie s. Ba is a
& Mona d (2003) e alua ed ou IMs o wo di e en classi ie s concluding ha
2
he choice o he IM in luences he pe o mance o he classi ie s. A mo e in-
dep h s udy on he ela ionship be ween IMs and classi ie s was p esen ed by
Luengo e al. (2012). Au ho s conduc ed an ex ensi e e alua ion o classi ie s
and IMs on eal-wo ld DBs and concluded ha he choice o he IM should
indeed be condi ioned on he ype o classi ica ion me hod used.
In his pape we go beyond he analysis o he ela ionship be ween IMs and
classi ica ion algo i hms, and conside as ano he ac o he pa icula cha ac-
e is ics o he MD. We hypo hesize ha he h ee p e iously men ioned ac o s
can in luence he classi ica ion esul s and should be conside ed in hei in e ac-
ion. We in es iga e his hypo hesis by de ising p ocedu es ha gene a e DBs
wi h di e en ypes o MD, and using hem as a benchma k, we e alua e he
e ec o he MD ype and he IMs on he pe o mance p o ided by he clas-
si ie . Ano he con ibu ion o ou wo k is he simul aneous use o eal-wo ld
DBs, which a e used as a basis o cons uc he benchma k, wi h an a i icially
gene a ed MD ype which is in oduced in he o iginal DB. Following his s a -
egy, we can con ol he cha ac e is ics o he MD and e alua e he e ec on
he o he ac o s analyzed. In ou in es iga ion we also e alua e an ex ensi e
numbe o classi ie s, including many o hose in es iga ed in p e ious wo k and
some o he mo e ecen classi ica ion app oaches.
The pape is o ganized as ollows. In he nex sec ion, some essen ial back-
g ound on he main concep s co e ed in he pape is gi en. Rela ed wo k, em-
phasizing he connec ion wi h ou p oposal, is discussed in Sec ion 3. Sec ion 4
gi es a o mal p esen a ion o he me hods used o gene a e he di e en ypes
o MD. This sec ion also desc ibes he da abases selec ed o e alua e he ela-
ionships be ween he me hods and algo i hms. Sec ions 5 and 6 espec i ely
p esen he impu a ion and classi ica ion me hods in es iga ed. In Sec ion 7
we desc ibe he expe imen al amewo k, he esul s o he expe imen s, and
discuss some o ou indings. Sec ion 8 concludes he pape and p esen s some
lines o u u e esea ch.
3
2. Backg ound
2.1. Missing da a ypes
Many di e en easons can cause missing da a (MD) in eal-wo ld da abases.
Iden i ying any pa e n in he MD alues is a key aspec a he ime o con-
cei ing me hods o deal wi h he missing obse a ions. In pa icula , he ype
o MD can di ec ly impac he quali y o he p edic ions o he classi ica ion
me hods applied o he da a. The e o e, se e al wo ks ha e been de o ed o
cha ac e ize he ypes o MD, p oposing me hods o de ec hese ypes and sug-
ges ing algo i hms o impu a ion. In his sec ion we e iew he mos common
accep ed classes o MD and hei expec ed e ec on he beha io o supe -
ised classi ica ion echniques (Ba is a & Mona d, 2003; Gelman & Hill, 2006;
He nndez-Pe ei a e al., 2015; Blombe g & Ruiz, 2013; Luengo e al., 2012):
•Missing Comple ely A Random (MCAR): When he da abase’s measu e-
men ailu es occu andomly, he e is no speci ic pa e n o be iden i ied.
The impac o MCAR on a classi ica ion algo i hm will depend on he MD
dis ibu ion o e he da a. The mo e uni o m he dis ibu ion o he MD
is, he less bias is expec ed o be in oduced in he da abase.
•Missing A Random (MAR): MD is ca aloged as MAR when a pa e n
can be iden i ied, i.e., we can ind a common ac o in all he obse a ions
wi h missing alues. Fo example, we ind ha when a ce ain a iable
(wi h no MD) akes ex eme alues o an obse a ion, wo o he a iables
end o be missing o ha same obse a ion.
•Missing No A Random (MNAR): This MD ype is simila o MAR.
Howe e , in his case he alues causing o he s o be missing a e no
known, his can ha e wo o igins:
–Missingness depending on unobse ed Va iables (MuOV): One o he
easons hese alues a e no known can be ha simply hey we e no
obse ed.
4
–Missingness depending on i s Value I sel (MIV): An elemen can
be missing depending on i s alue i sel . This could happen when a
a iable akes a alue ou o i s ep esen a ion ange.
In gene al, i is no possible o iden i y he MCAR ype o MD since in eal
da abases he e is no way o ack he cause o his MD. MCAR can be caused
by a huge a ie y o easons, om da a loss du ing an in o ma ion ans e ence,
o a pe son’s e usal o p o ide pe sonal da a in a poll, e c. As s a ed abo e,
assuming ha he missing alues a e uni o mly dis ibu ed, he da ase will
no expe ience a conside able loss o in o ma ion. As long as he amoun o
missing alues is no signi ican , e en disca ding obse a ions con aining MD
will no necessa ily ha e an impac on pos e io classi ica ion. Howe e , e en i
he amoun o in o ma ion los ega ding missing obse a ions may be small1,
IM quali y could be as impo an as in o he MD ype.
The MAR kind o MD is no as common as MCAR, bu i is easie o in e i s
o igin by s udying o he a iables o he da ase . Fo example, in a si ua ion in
which people a e asked abou hei habi s and heal h, some in o ma ion abou
seden a y li es yle migh be a ailable. Howe e , while some subjec s may be
open o sha e in o ma ion abou hei weigh , o he subjec s (mo e likely hose
wi h an o e weigh condi ion) migh be mo e eluc an o disclose his ype o
in o ma ion. This example illus a es si ua ions in which a cause o MAR can
be in e ed om an analysis o he cha ac e is ics o he da abase.
MAR ype o MD can be a po en ial sou ce o p oblems o he pe o mance
o e ed by he classi ica ion algo i hms. Since in his case he e is an unde lying
eason o he MD, i is likely ha obse a ions con aining MD will be simila
o each o he and will be agged in he same class. This could lead o an
unbalanced da abase ha will po en ially a ec he classi ica ion. In his case,
disca ding da a is no an ad isable op ion, and he use o IMs is a equi emen .
Finally, MNAR p esen s a conside ably mo e di icul si ua ion. Following
1Johnny, I’m no su e i his sen ence is clea and co ec
5
he p e ious example, he ask would become much mo e edious i we had no
asked abou o he medical and li es yle pa ame e s (MuOV). Ano he scena io
needs o be add essed when an indi idual is ashamed and e uses o disclose
in o ma ion abou he amoun o money he o she spends on d ugs. This a iable
mos likely depends only on i sel . In his second case we would ha e (MIV).
These wo si ua ions a e he mos p oblema ic ones, since i can be as ha m ul
as MAR o he da a, bu i can be easily misiden i ied as MCAR, as a esul
o he impossibili y o iden i y a pa e n in he unobse ed a iables.
2.2. Impu a ion me hods
The e a e wo main ways o handling MD. Igno ing obse a ions wi h missing
da a is gene ally a good choice when he ca dinal o missing alues is ela i ely
small and he MD is homogeneously dis ibu ed. Ne e heless, when hese mea-
su emen ailu es a e concen a ed in a single a iable, o when igno ing hem
would suppose a big loss o in o ma ion, we conside echniques o ill he gaps.
This is essen ially wha impu a ion does. Se e al s a egies ha e been p oposed
o his pu pose and hey can exhibi impo an di e ences in e ms o com-
plexi y, and ou pu quali y (B owns one & Valle a, 2001; Liu & B own, 2013;
Ba is a & Mona d, 2003; Lakshmina ayan e al., 1999). Sec ion 5 will p esen
a numbe o impu a ion me hods ele an o ou wo k.
2.3. Classi ica ion p oblems and supe ised classi ica ion algo i hms
Classi ica ion is he ask o lea ning a a ge unc ion ha maps an a ibu e
collec ion o a p ede ined class. Algo i hms ha sol e his p oblem could be
de ined as p ocesses ha gene alize pa e ns om a se o gi en obse a ions.
These ope a ions, which a e able o bo h p edic and desc ibe da a, can be
di ided in o wo majo b anches: Supe ised and Unsupe ised Classi ica ion
(SC and UC). In a supe ised classi ica ion p oblem (Tan e al., 2006; Hilbe
& Lpez, 2011), a se o ( aining) da a is gi en and, o each case in he aining
se , he associa ed class (o label) is known. The classi ica ion p oblem consis s
6
o using he in o ma ion he aining da a con ains o in e o p edic he class
o new cases ha a e no labeled.
We ocus on he analysis o SC algo i hms and how hei beha io is impac ed
by MD and impu a ion me hods. This supe ised classi ica ion pa adigm can
be e y sensi i e o he MD p oblem. We expec ha i a conside able amoun
o he da a is missing o a ce ain a iable in a ce ain class, he classi ie may
s uggle o o e an accep able beha io when agging an obse a ion wi h ha
men ioned class. Howe e , he way in which di e en ypes o MD in luence he
beha io o he classi ie s has no been p e iously s udied in de ail.
We no ice ha , conside ing he classi ica ion scena io, MD can be p esen
in he aining da a, o which we know he class and also in he es da a, o
which in o ma ion abou he label is unknown. In his pape , we conside hese
wo si ua ions as he same scena io, i.e., we do no conside me hods ha use
class in o ma ion a he ime o implemen ing he impu a ion algo i hm.
In Sec ion 6, we p esen he classi ie s ha will be used in he expe imen al
pa o his pape .
3. Rela ed wo k
No many ecen s udies ha e esea ched he in e ac ion be ween missing
da a con igu a ions, MD handling me hods, and supe ised classi ica ion algo-
i hms om his app oach. In he ollowing pa ag aphs, some pape s ha ha e
in es iga ed his ques ion a e e iewed.
3.1. Gene al da a
The mos exhaus i e in es iga ion on he join beha io o IMs and classi-
ica ion algo i hms is he pape by Luengo e al. (2012). They pe o med an
analysis wi h 23 classi ica ion algo i hms (g ouped in h ee classes) combined
wi h 14 ways o dealing wi h MD. They selec ed 21 DBs, all o hem con aining
na u al missing alues, anging om 0.06 o 21.82 pe cen om he o al alues
in he ables. Since hey ha e no knowledge abou he MD dis ibu ion, hey
7
assumed i ollowed a MAR pa e n. This is a undamen al di e ence wi h he
app oach we ollow in his pape , since ou ocus is on he in luence o he MD
ype and he e o e he me hods we p opose o c ea e he es ing benchma k
a e comple ely di e en and mo e sophis ica ed o he one p oposed in Luengo
e al. (2012). Howe e , he expe imen s conduc ed by he au ho s p o ed he
posi i e e ec o impu ing da a, and ound e idence o he ela ion be ween
he classi ie ype and he IM. The same au ho s de eloped he wo k p esen ed
in (Luengo e al., 2010), which ocuses on he Radial Basis Func ion Ne wo k
(RBFN) classi ie , poin ing ou he imp o emen o he classi ie esul s when
combined wi h an E en Co e ing impu e , p oposed by Wong & Chiu (1987).
Ba is a & Mona d (2003) s udied he e ec s o impu a ion o e he C4.5
and CN2 classi ie s. They chose ou da abases om which 3 had no MD, and
inse ed a ying pe cen ages o missing alues (10, 20, 30, 40, 50 and 60%)
and hen p oceeded o classi y hem. The obse a ions con aining MD we e
emo ed, as hey ep esen ed only abou 2% o he o al o ha DB. The
DBs wi h missing alues we e classi ied wi h no impu a ion (no e ha C4.5
and CN2 ha e hei own way o wo king wi h MD), mean/mode impu a ion,
and K-nea es neighbo s impu a ion (KNNI) (wi h k= 10). 10-NNI showed
be e pe o mance han he o he simple ea men s, bu he expe imen s ound
limi a ions in he way MD was inse ed, since only some o all he a ibu es
chosen we e a ec ed. Also, only wo basic IMs we e conside ed in Ba is a &
Mona d (2003).
Acuna & Rod iguez (2004) e alua ed he e ec o h ee me hods o dealing
wi h MD on he misclassi ica ion e o a e. They used he mean, median and
KNNI IMs, and also in es iga ed he esul s o case dele ion. Then hey es ed
he esul s wi h wo classi ie s, Linea Disc iminan Analysis (LDA) and K-
nea es neighbo s (KNN). This wo k adop ed 12 da ase s, 4 o which con ained
na u al MD. These ou DBs we e p e ea ed in o de o make hei s a ing
poin s simila by applying he IM. The esul s achie ed by Acu˜na and Rod ´ıguez
showed ha he IM e ec on accu acy has a highe dependence on he p oblem
a he han on he classi ica ion algo i hm. These esul s poin in he di ec ion
8
o he ques ion we in es iga e in his pape , since one o he ea u es cha ac-
e izing a gi en DB is he MD ype. Howe e , he wo k p esen ed by Acuna &
Rod iguez (2004) did no add ess his ques ion explici ly and he numbe o IMs
and classi ica ion me hods was limi ed in compa ison o he s udy we p esen
he e.
Fa hang a e al. (2008) examined he impac o pe o ming MD impu a ion
as a p ep ocessing s ep o pos e io classi ica ion. They combined di e en e -
sions o ou impu a ion me hods (Ho Deck, Na¨ı e Bayes Polynomial mul iple
Reg ession and Mean) wi h six classi ie s (Rippe 2, C4.5, RBFN, Suppo Vec-
o Machine (SVM), KNN and na¨ı e Bayes). They selec ed 15 DBs and consid-
e ed 5, 10, 20, 30, 40 and 50% o MCAR- ype MD in hei expe imen s. They
concluded ha he imp o emen in he accu acies ob ained om he da ase
wi h MD and he impu ed one had no ela ion wi h he pe cen age o MD.
Also, di e ences in he imp o emen we e shown o di e en IM-classi ica ion
algo i hm combina ions, which led hem o conclude ha no uni e sally bes
IM exis s. The pape inally esol es ha impu a ion is bene icial o machine
lea ning asks, o e all.
H uschka J e al. (2007) in oduced wo IMs based on Bayesian Ne wo ks
and con as ed hem wi h ou o he classic me hods, namely, Expec a ion-
Maximiza ion, Da a Augmen a ion, Decision T ees, and Mean/Mode. Fou
da ase s we e used as a benchma k, which combined na u al and a i icial miss-
ing alues in oduced in he expe imen s. Fou classi ie s we e conside ed in he
expe imen s. This wo k concluded ha IM ob aining close alues o o iginal
da a (compa ing impu a ions in a i icially gene a ed MD o alues o iginally
in hose cells) did no necessa ily p oduce be e esul s when classi ying.
Song e al. (2008) pe o med an analysis simila o he one p esen ed by
Ba is a & Mona d (2003), in which he e ec o KNN-impu a ion o e C4.5
classi ica ion was in es iga ed. Howe e , in hei pape hey conside ed h ee
MD ypes (MCAR, MAR and NMAR). Thei conclusions ega ding IM-C4.5
coincide wi h hose ound by Ba is a & Mona d (2003) , and also s a e ha
he MD mechanism in luences he classi ying ask. Ne e heless, an in-dep h
9
4.1.3. MIV
This algo i hm is e y simila o he one used o c ea e he MAR pa e n
o MD, bu ins ead o gene a ing a iables ha will lose hei alues depending
on he alues o a causa i e a iable, i will be he causa i e i sel which loses
a iables. The pseudocode is shown in Algo i hm 3
Inpu :
da a: Da abase
mdp: MD pe cen age
nV: numbe o a iables losing hei alues
Ou pu :Da abase wi h mdp% gene a ed MD
begin
x = numObse a ions(da a)
y = numVa iables(da a)
causa i es = andom([0,y], nV)
o i∈[0,leng h(causa i es)] do
aux = da a[:,causa i es[i]]
o j∈[0, x ·y·mdp/100//nV] do
obse a ions[j] = minIndex(aux)
aux[obse a ion[j]]=maxIn
end
o j∈[0,leng h(obse a ions)] do
da a[obse a ions[j], causa i e] = “NaN”
end
e u n (da a)
end
end
Algo i hm 3: MIV gene a ing algo i hm.
4.1.4. MuOV
This algo i hm is also qui e simila o he one p oposed o gene a ing MAR
MD ype, bu in his case he causa i e a iable is unobse ed. The e o e,
16
he obse a ions ha will ha e missing alues will be chosen andomly. The
pseudocode is shown in Algo i hm 4.
Inpu :
da a: Da abase
mdp: MD pe cen age
nV: numbe o a iables losing hei alues
Ou pu :Da abase wi h mdp% gene a ed MD
begin
x = numObse a ions(da a)
y = numVa iables(da a)
MDVa iables = andom([0,y], nV)
o i∈[0, x ·y·mdp/100)//nV] do
obse a ions[i] = andom([0,x])
end
o i∈[0,leng h(obse a ions)] do
o j∈[0, leng h(MDVa iables)] do
da a[obse a ions[i], MDVa iables[j] = “NaN”
end
end
e u n (da a)
end
Algo i hm 4: MuOV gene a ing algo i hm.
As p e iously indica ed, his algo i hm ollows he same s uc u e, bu since
in his case he causa i e a iables a e unknown, he obse a ions o lose hei
alues will be chosen andomly.
Some o ou me hods o in oduce he missing da a seek o simula e he
di e en ypes o MD which depend on isible o hidden dependencies be ween
he a iables in he DB. The e o e, some o he algo i hms selec he obse a ions
whe e MD will be in oduced based on he alues o he causa i e a iable. We
no ice ha o he c i e ia o de ine he MD ypes ha e been p oposed in he
li e a u e. In pa icula , he MD pa e ns could be de ined in e ms o he
pe cen age o MD wi hin a single obse a ion (Deb & Liew, 2016).
17
4.2. Desc ip ion o he da abases in es iga ed
We ha e selec ed a se o 10 da ase s om he UCI Machine Lea ning Repos-
i o y (Lichman, 2013) buil in comple ely di e en con ex s in o de o achie e
esul s which a e as gene ic as possible, a oiding biasing he IMs and classi ie s
wi h a speci ic beha io om he da a. We ha e da abases obse ing di e se
na u al aspec s (Fo es , Biodeg, Clima e, Lea ), medical ma e measu emen s
(Diabe ic, BUPA, Tho acic), c edi denial/app o al (Ge man), ehicle dimen-
sion analysis (Vehicle) and image pixel in e p e a ion (Segmen a ion).
Table 1 con ains a s uc u al desc ip ion o he DBs. The i s h ee columns
(Ca ego ical, In ege and Real) p o ide in o ma ion abou he ype o he ele-
men s con ained in he DB. The ollowing wo columns, (N. Va iables, N. Cases)
e e o he numbe o measu emen s o each obse a ion (including he class
ag) and he amoun o obse a ions he DB con ains. Then, he index each
DB has been assigned in o de o simpli y u u e e e ences o dis inc da ase s.
The las wo columns e e exclusi ely o he class o he da a. n is he numbe
o classes, while Hn ep esen s he no malized en opy, compu ed as:
Hn(p) = −P
i
pi·logb(pi)
logb(n)
whe e piis he p obabili y o class i,bis he loga i hm base (in ou case we
used ebase, he na u al loga i hm), and nis he numbe o classes in he da a.
The esul o his p ocedu e p oduces a numbe be ween 0 and 1, which may be
in e p e ed as he nea e o 1, he mo e unce ain y we ind in a class, and hus,
he be e dis ibu ed is. And he nea e o 0, he mo e unbalanced owa ds
one ce ain class i is. This coe icien is no malized by he numbe o classes.
We used his measu e in o de o equilib a e measu emen s be ween bina y and
mul i-class da abases.
18
Da abase Ca eg. In . Real N.Fea s. N.Cases DB Index n Hn
Biodeg 7 3 3 41 1055 5 2 0.92
BUPA 3 3 3 7 345 4 2 0.98
Clima e 7 7 3 18 540 8 2 0.42
Diabe ic 7 3 3 20 1151 7 2 1
Fo es 3 3 3 27 326 6 4 0.91
Ge man 3 3 7 20 1000 3 2 0.88
Lea 7 7 3 16 340 1 30 1
Segmen a ion 7 7 3 19 2310 9 7 1
Tho acic 3 3 7 17 470 10 2 0.61
Vehicle 7 3 7 18 946 2 4 1
Table 1: Desc ip ion o he o iginal da ase s used o gene a e he benchma k.
5. Impu a ion me hods
IMs di e in he class o DBs hey can be applied o, hei compu a ional
complexi y, and he sophis ica ion o he me hods used o eplace he MD.
The e a e s aigh o wa d echniques o eplace missing obse a ions (Ba is a &
Mona d, 2003; Lakshmina ayan e al., 1999; Gelman & Hill, 2006), ha simply
copy alues om o he obse a ions (using simila i y as a c i e ion, o example).
O he mo e elabo a ed impu a ion algo i hms use Pa ame e Es ima ion (PE)
(Yuan, 2010; Honake & King, 2010) o eplace he missing alues om he
a ailable da a. The i s class o me hods equi e less compu a ional ime and
hei esul s could be su icien ly good o some DBs. When hese simple
s a egies do no p oduce esul s ha i he expec a ions, PE can p esumably
p o ide mo e accu a e esul s.
In his pape we ha e selec ed 8 impu a ion me hods, including some o he
mos equen ly used and mo e sophis ica ed app oaches:
The ini ial h ee me hods conside ed a e Mean-Mode,Median and Mos
F equen Value impu a ion. These me hods ha e sel -explana o y names, as
hey calcula e he s a is ics and simply asc ibe hem o gaps.
Las Value Ca ied Fo wa d impu a ion is also sel -explana o y. This unc-
ion sea ches he a iable o he las a ailable alue, and uses i as i s guess o
19
he missing alue.
In e pola ion’s wo king sys em can also be easily deduced om i s name,
bu his me hod conside s a pa ame e ha may change i s complexi y, hus
a ec ing i s compu a ional cos . The in e pola ion me hod calcula es a unc ion
ha i s bo h he p e ious and la e alues o a missing alue s e ch and ills
i using he unc ion.
Ho Deck impu a ion can also ha e a high compu a ional cos , since i de-
pends on he numbe o obse a ions and a iables he DB has. This me hod
compu es he dis ance (e.g., euclidean dis ance) be ween an obse a ion wi h
one (o mo e) missing alue(s) and eplica es he alue o ha ce ain a iable
in he nea es obse a ion.
I e a i e Impu a ion basically impu es he same missing alues mul iple
imes. The common ope a ion o hese me hods is o make ini ial simple guesses
o missing alues (e.g., mean impu a ion) and eimpu a es he same missing
alues using o he (mo e complex, usually) me hods. Once all he a iables
ha e been eimpu ed, he cycle is epea ed. An i e a ion limi o a ole ance
h eshold could be se as a hal ing condi ion.
Mul iple Impu a ion also belongs o he compu a ionally complex me hod
spec um. This s a egy duplica es he DB m imes and impu es hem sepa-
a ely. Subme hods used o pe o m impu a ion on hese eplicas may o may
no be di e en s a egies. Nex , one o wo s a egies can be ollowed. The i s
one applies he analysis algo i hm (Supe ised Classi ica ion in ou case) and
hen combines hei esul s. The second op ion i s pe o ms he combina ion
s ep, o ming one single DB again, being able his way o know he a iance o
he impu ed alues. Then he analysis is applied. The ecommended m alue
used o his p ocedu e has (and s ill is) inc eased as compu a ion capaci y has
isen, being he classical ad ice 3 ≤m≤5, ha ing eached o 20 ≤m≤100 by
some au ho s, i.e., Van Buu en (2012).
This wo k has used wo impu a ion me hods ha pe o m Mul iple Im-
pu a ion; MICE (Mul i a ia e Impu a ion by Chained Equa ions) (Buu en &
G oo huis-Oudshoo n, 2011) and Amelia (Honake e al., 2011).
20
We no ice ha some o he IMs conside ed he e a e mo e app op ia e o a
pa icula ype o da a (e.g., he in e pola ion me hod is concei ed o da a wi h
a empo al componen ). Howe e , a he ime o applying he IMs we do no
ake in o accoun he cha ac e is ics o he da a. We expec ha i a gi en IM
is no app op ia e o one DB, his ac will be ansla ed in he esul ob ained
by he classi ie o ha pa icula DB.
6. Supe ised classi ica ion me hods
In his sec ion we desc ibe he classi ica ion me hods e alua ed in he pape .
De ails abou he implemen a ion a e desc ibed in Sec ion 7.1.2. When no de-
ails abou he pa ame e s used by he classi ie s a e p o ided, i is assumed ha
hey we e applied wi h hei de aul pa ame e s in he used implemen a ions.
We use he ollowing classi ie s:
1. Regula ized logis ic eg ession wi h no m l1 (Ll1) (Yu e al., 2011)
2. Regula ized logis ic eg ession wi h no m l2 (Ll2) (Yu e al., 2011)
3. Linea disc iminan analysis (LDA) (Leas Squa es) (Fishe , 1936)
4. Quad a ic disc iminan analysis (QDA) ( eg pa am = 0.01)
(F iedman, 1989)
5. Deep Neu al Ne wo k (hidden laye s = 2, op imize = Adag ad,
ac i a ion = ReLu, max s eps = 20,000) (Bishop, 1995; G on, 2017)
(Mo e de ails a ailable in Sec ion 6.1)
6. Suppo Vec o Machine (ke nel = linea ,
max i e = 5000000) (Joachims, 1998)
7. Suppo Vec o Machine (ke nel = Polynomic,
deg ee = 3, gamma = 0.01, max i e = 5000000) (Joachims, 1998)
8. Radial Basis Func ion (gamma = 0.1, max i e = 5000000)
(Pa k & Sandbe g, 1991)
9. Gaussian Na¨ı e Bayes classi ie (GNB) (Manning e al., 2008)
10. G adien boos ing (GB) (max dep h = 11) (F iedman, 2001) wi h
numbe o ees n = 100 and maximum dep h o ees maxd= 11
21
11. Random o es s (RF) (B eiman, 2001) wi h n = 100 and maxd= 11
12. Decision ee (DT) (Olshen e al., 1984) maxd=n
13. k-nea es neighbo classi ie (1NN) algo i hm (Aha e al., 1991)
wi h k= 1 and using he Euclidean dis ance
14. k-nea es neighbo classi ie (3NN) algo i hm (Aha e al., 1991)
wi h k= 3 and using he Euclidean dis ance
Some o hese classi ie s conside in e ac ions be ween he ea u es, some o h-
e s inco po a e egula iza ion echniques, o ake in o accoun simila i y me ics
be ween he da a. Acco ding o Luengo e al. (2012), classi ie s 13 and 14 a e
lazy classi ie s, while 11 and 12 a e agged in he decision ee class. The es ,
1-10, a e model cons uc ing classi ie s.
6.1. Deep Neu al Ne wo k Classi ie
The i h classi ie used in his expe imen a ion is based on an a i icial neu-
al ne wo k a chi ec u e, a mul i laye pe cep on (MLP), o be mo e speci ic.
These classi ie s a e composed o neu ons, which a e no hing mo e han simple
unc ions ha ecei e mul iple inpu s and p oduce one ou pu . These neu ons
a e o ganized in laye s, which a e connec ed wi h each o he , in a sequen ial
manne . Two neu ons in one laye canno be connec ed while all he nodes in
wo adjacen laye s a e ully in e connec ed ia a ying weigh s.
Finally, he inpu laye mus ha e he same amoun o neu ons as a iables
a da abase has, while he ou pu laye should ha e as many as classes in he
da a, ollowing a so max a chi ec u e. The alues p oduced by hese neu ons
a e o be ea ed as he p obabili ies o an obse a ion in oduced in he inpu
laye belonging o a ce ain class. The aining p ocess consis s on ecompu ing
he p e iously men ioned weigh s based on he knowledge o a desi ed ou pu
gi en a ce ain inpu (Pal & Mi a, 1992; Bishop, 1995; G on, 2017).
In ou case, he s uc u e was composed o wo hidden laye s wi h 2n
3and
n
3nodes espec i ely, being n he amoun o a iables in a da abase.
22
6.2. F-1 Sco e
To measu e he esul s o he classi ica ion me hods, we use he F-1 sco e.
This me ic is compu ed as ollows:
F1= 2 ·1
1
+1
p
whe e s ands o ecall, which is he esul o di iding he labels co ec ly
p edic ed as ue by all he ac ual ue labels. And p ep esen s p ecision, com-
pu ed by di iding he co ec ly p edic ed ue labels by all he labels p edic ed
as ue (including he w ongly p edic ed ones):
=T ueP osi i e
O iginal uelabels , p =T ueP osi i e
Alllabelsp edic edas ue
As i can be seen, his me ic is only in ended o bina y classi ica ion p ob-
lems, while ou benchma k con ains mul i-label DBs. Fo non-bina y classi ica-
ion p oblems, we ha e compu ed he weigh ed a e age o he me ic o each
class, sepa a ely. Mo e de ails abou he implemen a ion o his me hod a e
p o ided a he end o Sec ion 7.1.2.
7. Expe imen s
In his sec ion we conduc an in-dep h empi ical in es iga ion o he ela-
ionship among he MD ypes, IMs and classi ica ion me hods. Among he
ques ions we add ess in his sec ion a e he ollowing:
1. Do di e en MD ypes p oduce di e en e ec s on he p edic ion quali y
o he classi ie s?
2. Wha is he o e all beha io o he IMs when all DBs, MD ypes, and
classi ica ion me hods a e conside ed?
3. Is i possible o iden i y any join e ec o IMs and classi ie s on he sco e
achie ed o di e en DBs? Does his po en ial e ec depend on he ype
o MD?
23
7.1. Expe imen al F amewo k
To s udy he MD-IM/PE-SC ela ion, en non- empo al da abases om he
UCI Machine Lea ning Reposi o y (Lichman, 2013) wi hou missing alues we e
chosen. Using hese DBs, we applied he me hods p oposed o inse ing di e -
en ypes o MD as p esen ed in Sec ion 4. The nex s ep consis ed o impu ing
he missing alues using eigh di e en IM. Finally, he benchma k o new DBs
wi h he inse ed MD was used o apply he di e en classi ie s, which we e
alida ed using s a i ied 5- old c oss alida ion.
To gua an ee a ep esen a i e sample o he possible DBs wi h di e en MD
ypes, we c ea ed, o each o he 10 DBs, i e di e en a ian s, using a 5- old
s a i ied me hodology. We hus sepa a ed he aining and es ing pa s a he
beginning o he algo i hms. We did his in o de o limi he in o ma ion used
by he impu e o he aining and es ing se s sepa a ely and o o ally con ol
he pe cen age o missing da a p esen in each pa i ion. Fo each one o hese 5
a ian s p oduced by he 5- old s a i ied s a egy and each o he 4 MD ypes, 30
di e en DBs con aining MD we e c ea ed. This was done using Algo i hms 1-4.
The e o e, we ob ained 10DBs ×5 olds ×4MD ypes×30ins ances = 6,000DBs
wi h missing alues. F om now on, when we e e o he “benchma k” we e e
o his se o 6,000 DBs, 1,500 o each o he ou MD ypes. Due o he
s ochas ic na u e o he algo i hms used o gene a e he MD, all DBs we e
di e en . E en i hey come om only 10 o iginal DBs, he me hods used o
inse missing alues make each DB unique. Again, no e ha MD was sepa a ely
in oduced on bo h aining and es ing pa i ions, in o de o gua an ee ha
he pe cen age o missingness is exac ly he same.
In he nex phase we ocus on s udying he beha io o he IMs and clas-
si ie s on he benchma k. The eigh di e en IMs desc ibed in Sec ion 5 we e
applied o he benchma k, o ob ain a o al o 6,000DB ×8IM = 48,000 di e -
en , comple e DBs. Finally, all hese da abases we e classi ied by applying he
14 classi ica ion algo i hms desc ibed in Sec ion 6, ob aining a inal ou pu o
48,000DBs/5 olds ×14Classi ie s = 134,400 dis inc sco es. Figu e 1 cap u es
his comple e p ocess o one single DB.
24
Figu e 1: Gene al scheme desc ibing he expe imen al amewo k used o e alua e he in luence o MD ypes, IMs, and classi ica ion algo i hms on
he classi ie pe o mance.
25
IMs Mean Median M . F eq. LVCF In e p. HD MICE EM To al
Mean 0 2 2 16 17 9 9 9 64
Median 20 0 1 19 20 12 11 14 97
M . F eq. 68 53 0 47 49 29 31 32 309
LVCF 36 30 13 0 9 11 12 10 121
In e p. 38 32 16 12 0 12 10 11 131
HD 68 62 50 61 59 0 15 16 331
MICE 68 63 48 65 60 29 0 6 339
EM 65 63 50 66 67 40 28 0 379
To al 363 305 180 286 281 142 116 98 1771
Table 5: Table 4 esul s il e ed by MIV MD ype.
IMs Mean Median M . F eq. LVCF In e p. HD MICE EM To al
Mean 0 2 18 20 29 14 14 11 108
Median 13 0 15 21 31 14 15 13 122
M . F eq 14 6 0 11 22 11 13 11 88
LVCF 16 15 22 0 19 11 11 10 104
In e p. 16 16 16 10 0 11 11 11 91
HD 61 61 59 65 70 0 8 11 335
MICE 61 61 58 65 70 27 0 13 355
EM 61 61 62 68 71 15 15 0 353
To al 242 222 250 260 312 103 87 80 1556
Table 6: Table 4 esul s il e ed by MAR MD ype.
IMs Mean Median M . F eq. LVCF In e p. HD MICE EM To al
Mean 0 4 31 41 54 20 20 14 184
Median 15 0 38 44 53 18 18 15 201
M . F eq. 8 5 0 10 28 13 13 13 90
LVCF 11 11 18 0 27 14 13 13 107
In e p. 14 9 13 9 0 13 12 13 83
HD 56 56 59 59 64 0 12 32 338
MICE 59 58 61 62 64 17 0 29 350
EM 58 55 65 65 67 15 12 0 337
To al 221 198 285 290 357 110 100 129 1690
Table 7: Table 4 esul s il e ed by MuOV MD ype.
32
IMs Mean Median M . F eq. LVCF In e p. HD MICE EM To al
Mean 0 7 45 56 59 19 16 12 214
Median 19 0 51 58 61 24 17 12 242
M . F eq. 12 8 0 44 45 12 12 9 142
LVCF 17 13 26 0 36 19 12 11 134
In e p. 11 11 22 9 0 17 12 10 92
HD 49 47 59 58 64 0 11 21 309
MICE 63 60 67 63 68 40 0 38 399
EM 59 52 70 68 69 39 15 0 372
To al 230 198 340 356 402 170 95 113 1904
Table 8: Table 4 esul s il e ed by MCAR MD ype.
The analysis o Table 4 e eals some ac s abou he beha io o he IMs:
•Th ee clea g oups o IMs can be dis inguished when obse ing o e all
a ings.
•The complex me hods ha e a la gely posi i e sco e when sub ac ing (o−)
om (o+), wi h alues nea 800-1,000.
•The second g oup, composed by he simple me hods, p oduced nega i e
alues when pe o ming he same sub ac ion, by a la ge ma gin. Rega d-
ing solely his g oup, we see ha median impu a ion s ands ou om he
es , as i s ands in a nega i e assessmen o -261, while he o he wo
ob ained a ound -500.
•Finally, he empo al me hods pe o med he wo s , by a la ge ma gin.
Analysis o Tables 5-8 shows how he pe o mance o he IMs is ela ed o
he MD ypes.
•S a ing om he mos biased MD ype, MIV (Table 5), we see how i s
esul s a y om wha could be obse ed in he o e all Table 4. To s a ,
he empo al impu e s seemed o pe o m be e han he simple me hods,
opposed o wha he i s Table 4 showed. Addi ionally, Mos F equen
impu a ion excelled in his ca ego y, as i ob ained almos as good esul s
as he complex me hods, which, as expec ed, opped he anking.
33
•Rega ding MAR (Table 6), mo e simila esul s o he global sco es we e
ob ained. In his case, LVCF was he ou lie , as i could be g ouped wi h
he simple me hods.
•As o MuOV (Table 7), he esul s also ollow a simila pa e n o he
gene al ankings. In his case, a ou h clus e could be dis inguished,
as Mos F equen and LVCF s and conside ably close o each o he , and
ela i ely away om hei o iginal g oups.
•Finally, he absolu ely andom MCAR (Table 8) o e ed conside ably di -
e en esul s. The esul dis ibu ion changed subs an ially, as MICE and
EM emained in he op, bu he o he complex me hod, HD, s ood close
o Median impu a ion a he han he op g oup i is supposed o belong
o. Mean appea s o s and in no-man’s land, while Mos F equen and
LVCF could be g ouped oge he . Ending wi h an isola ed In e pola ion.
The pai -wise compa ison be ween IMs e eals only a pa ial pic u e o he
o e all beha io o he IMs. To u he unde s and he ela ion be ween IM and
MD ype, we designed ano he expe imen o e alua e he ank o he IMs. Fi s ,
all he sco es we e di ided in 40 g oups. Each g oup con ains all classi ica ion
esul s o he combina ions o he 10 o iginal DBs and he 4 MD ypes. Each
g oup comp ises 8 ×14 ×30 = 3360 classi ica ion sco es. In each g oup, he
sco es we e so ed om he highes o he lowes , and spli in o 3 equally sized
g oups: High: he expe imen s wi h he highes sco e; Low: expe imen s wi h
he lowes sco e; Medium: he es o he expe imen s. By inspec ing each g oup
we can de e mine, o each o iginal DB and MD me hod, which combina ion o
IMs and classi ie s we e he mos equen in each o he g oups, allowing us o
de ec high pe o ming IMs, classi ie s, and combina ions o he wo.
In he ollowing analysis, he 10 High g oups (one o each DB) ha ha e a
common MD ype a e joined in a single “supe g oup”. Simila ly, a single Low
supe g oup is c ea ed o each MD ype. We hen compu e he p opo ion o
each IM included in he High and Low supe g oups. This way, we ob ain he
34
IMs ha had he bes pe o mance independen ly o he DB (since hey we e
no me ged un il he inal s ep), and MD ype.
Figu e 2 shows he pola cha s desc ibing he equency o he IMs in he
supe g oups High and Low acco ding o he MD ype. Each sec ion o he
cha is p opo ional o he equency o a gi en IM in he cases whe e he
classi ie s achie ed high ( espec i ely low) sco es. A egula ci cula shape in
hese cha s indica es ha all IMs whe e p esen wi h a simila equency in
high and low quali y cases, hus indica ing no signi ican impac o he IM in
he sco e. Con e sely, mo e i egula shapes indica e ha , o he gi en MD
ype, he e is an in luence o he IMs in he sco es. Fu he mo e, a big gap
be ween he size o he high and low sec ions o a gi en IM clea ly indica es
whe he o no he applica ion o he IM con ibu es o imp o e he F1 sco es.
An analysis o Figu e 2 shows some clea egula i ies be ween di e en MD
ypes, as all h ee complex IMs had a much la ge appea ance a e in he op
g oup a he han in he low one. Mo eo e , all MD con igu a ions also con e ge
in he null e ec i eness o he empo al me hods o non- empo al da a.
Howe e , when i comes o he simple me hods, MIV (Figu e 2a) and he
o he MD ypes (MAR, Figu e 2b, MuOV, Figu e 2c and MCAR, Figu e 2d)
con ain subs an ial di e ences. MIV g an s Mos F equen impu a ion an al-
mos op pe o ming deg ee, p oducing simila esul s o he complex me hods.
Addi ionally, he highe appea ance equency o his me hod, p oduced ha
i s pee s in he simple IM clus e su e ed a dec ease o hei appea ance a io
in he high pe o mance g oup, while he appea ance in he low g oup inc eased
subs an ially.
7.3. In e ac ions be ween MD ypes, impu a ion me hods, and classi ie s
In his sec ion we add ess he ollowing ques ions: Is i possible o iden i y
any join e ec o IMs and classi ie s on he sco e achie ed o di e en DBs?
Does his po en ial e ec depend on he ype o MD ?
To in es iga e his ques ion we use he same supe g oups o High and Low
con igu a ions bu in his case we compu e he equencies o all 8 ×14 = 112
35
mean
High
mean
Low
median
High
median
Low
m. eq.
High
m. eq.
Low
l c
High
l c
Low
in e .
High
in e .
Low
HD
High HD
Low mice
High
mice
Low
EM
High
EM
Low
(a) MIV
mean
High
mean
Low
median
High
median
Low
m. eq.
High
m. eq.
Low
l c
High
l c
Low
in e .
High
in e .
Low
HD
High HD
Low mice
High
mice
Low
EM
High
EM
Low
(b) MAR
mean
High
mean
Low
median
High
median
Low
m. eq.
High
m. eq.
Low
l c
High
l c
Low
in e .
High
in e .
Low
HD
High HD
Low mice
High
mice
Low
EM
High
EM
Low
(c) MuOV
mean
High
mean
Low
median
High
median
Low
m. eq.
High
m. eq.
Low
l c
High
l c
Low
in e .
High
in e .
Low
HD
High HD
Low mice
High
mice
Low
EM
High
EM
Low
(d) MCAR
Figu e 2: F equency o he IMs in he con igu a ion wi h highes (High) and lowes (Low)
classi ica ion sco e.
36
SVM−3
RBFN
3−NN
1−NN
LDA
QDA
C−SVM
Nai eB
CART
Reg .L2
Reg .L1
G ad.B.
Rnd.F .
DNN
Mean
Median
M.F eq.
LVCF
In e p.
HD
MICE
EM
0
50
100
150
200
250
Classi ie
Impu e
50
100
150
200
250
(a) MIV
RBFN
SVM−3
3−NN
1−NN
QDA
LDA
Nai eB
CART
C−SVM
Reg .L2
G ad.B.
Rnd.F .
Reg .L1
DNN
Mean
Median
M.F eq.
LVCF
In e p.
HD
MICE
EM
0
50
100
150
200
250
Classi ie
Impu e
50
100
150
200
250
(b) MAR
3−NN
RBFN
SVM−3
1−NN
Nai eB
CART
LDA
QDA
C−SVM
Reg .L2
G ad.B.
Rnd.F .
Reg .L1
DNN
Mean
Median
M.F eq.
LVCF
In e p.
HD
MICE
EM
0
50
100
150
200
250
Classi ie
Impu e
50
100
150
200
250
(c) MuOV
3−NN
SVM−3
RBFN
1−NN
CART
QDA
Nai eB
LDA
Reg .L2
C−SVM
Rnd.F .
Reg .L1
G ad.B.
DNN
Mean
Median
M.F eq.
LVCF
In e p.
HD
MICE
EM
0
50
100
150
200
250
Classi ie
Impu e
50
100
150
200
250
(d) MCAR
Figu e 3: Amoun o IM-Classi ie pai s p esen in he High sco e sec ion.
37
pai s o IMs and classi ie s. This in o ma ion is shown in Figu e 3, in which
alues a e anked by classi ica ion pe o mance (x axis).
F om Figu e 3a we can see how he pa ial conclusions abou he posi i e
ela ionship be ween his MDT and Mos F equen impu a ion a e co obo a ed.
Fo an ex ensi e se o classi ie s, his impu e has p o ided esul s in a simila
ange o hose ob ained by he complex IMs. Besides, i is no iceable ha he
esul s a e mo e balanced be ween classi ie s han in he es o igu es.
Figu e 3b ep esen s he equency o he pai s conside ing only MAR MD
ype. The i s di e ence obse able be ween his and he p e ious Figu e 3a is
he scale i is d awn. In his case, he op classi ie appea ed mo e equen ly,
which deno es a wo se ela ion be ween he DNN and MIV compa ed o MAR.
Addi ionally, in his case Mos F equen impu a ion did no s and ou om he
me hods o i s same ca ego y. This causes a be e isualiza ion o he classi ie s
ha equi e a complex impu a ion o p oducing an op imal pe o mance. Good
examples o his collec ion a e L2-penalized Reg ession, o C-SVM. On he con-
a y, we also ha e some o he classi ie s ha a e una ec ed by he IM applied.
Random Fo es and G adien Boos ing classi ie s a e excellen exempla s o his
se .
Resul s il e ed by MuOV MD ype a e p esen ed by Figu e 3c. Again,
we dis inguish be ween classi ie s ha p oduce simila esul s ega dless he
impu e ha p ecedes i . Howe e , his ime mo e con as can be ound in his
aspec . The di e ence be ween he simple and complex IMs o LDA and QDA
classi ie suppo s his dis inc ion be ween SCs.
Finally, Figu e 3d, ep esen ing MCAR, shows an almos non exis ing se o
mid ield classi ie s. The di e ences be ween con iguous classi ie s in he p e ious
igu es we e quasi-cons an . This ins ance, howe e , p esen s a conside able gap
in e ms o high g oup appea ance a e be ween Na¨ı e Bayes and LDA classi ie ,
specially o he complex me hods.
Conside ing all ou igu es, we can conclude ha a s ong ela ion be ween
IM and MD ype exis s (MIV and Mos F equen impu a ion, o example).
Addi ionally, we dis inguish be ween classi ie s ha need quali y impu a ion
38
o p oduce a op pe o mance (Deep Neu al Ne wo k and Logis ic Reg ession
classi ie s) and o he s ha do no (G adien Boos ing and Random Fo es Clas-
si ie s). The e seems o be a ce ain egula i y on he classi ie s anking, as
he e is no signi ican a ia ion be ween any wo MD ypes in he classi ie s axis
(x axis) in Figu e 3.
7.4. Seconda y expe imen a ion
Fo his seconda y expe imen a ion, we add ess he ques ion: How does he
amoun o MD a ec he pe o mances o bo h classi ie s and impu e s?
To answe his ques ion, we ha e conduc ed a simila se o expe imen as
in he p e ious expe imen a ion Sec ion 7.1. As explained in Sec ion 7.1.1, we
selec ed wo da abases, and inse ed MCAR MD wi h a ying pe cen ages (7-
42%) o u he impu a ion. These da abases we e subjec ed o 5 old c oss
alida ion using wo classi ie s and F1 sco e as a me ic. To isually in e p e
hese esul s, Figu e 4 has been c ea ed. These hea maps show he mean o all
F1 sco es compu ed in all 30 uns. Each sub igu e p esen s he da a ob ained
om a ce ain da abase, and a ce ain classi ie . Acco dingly, Figu e 4a shows
he mean F1 sco e o each pai o impu e and MD pe cen age ob ained in he
Lea da abase, wi h he G adien Boos ing classi ie . In he igu es a da k colo
ep esen s a be e F1 sco e, and, he e o e, a be e pe o mance.
F om his Figu e 4 we can lis he ollowing indings:
•As expec ed, he mo e MD, he wo se pe o mance o e ed by he classi-
ie s.
•Again ollowing expec a ions, he classi ie mo e sensi i e o impu a ion
quali y (Reg ession) showed an inc eased need o quali y impu a ion as
he amoun o MD ose.
•G adien Boos ing classi ie was selec ed due o i s ela i ely educed need
o a good impu a ion o o e ing a good pe o mance. Howe e , as he
MD inc eased in a DB, his changes, as complex impu e s (i.e., MICE)
p oduce mo e s able esul s han he es .
39
7 14 21 28 35 42
Pe cen ages
EM
MICE
HD
In e p.
LVCF
M .F eq.
Median
Mean
Impu a ionMe hods
MeanF1sco es
0.18
0.24
0.30
0.36
0.42
0.48
(a) Lea DB, G adien Boos ing classi ie .
7 14 21 28 35 42
Pe cen ages
EM
MICE
HD
In e p.
LVCF
M .F eq.
Median
Mean
Impu a ionMe hods
MeanF1sco es
0.12
0.15
0.18
0.21
0.24
0.27
(b) Lea DB, Logis ic Reg ession classi ie .
7 14 21 28 35 42
Pe cen ages
EM
MICE
HD
In e p.
LVCF
M .F eq.
Median
Mean
Impu a ionMe hods
MeanF1sco es
0.52
0.56
0.60
0.64
0.68
(c) Vehicle DB, G adien Boos ing classi-
ie .
7 14 21 28 35 42
Pe cen ages
EM
MICE
HD
In e p.
LVCF
M .F eq.
Median
Mean
Impu a ionMe hods
MeanF1sco es
0.48
0.54
0.60
0.66
0.72
(d) Vehicle DB, Logis ic Reg ession classi-
ie classi ie .
Figu e 4: Mean F1 sco es o he 30 uns o he expe imen a ion.
40
•Compa ing he wo pai s gene a ed om he same da abase, we see how
he pa e ns a e e y simila , which is indica i e o a s ong dependency
o he esul s on he p oblem.
•O e all, complex impu e s (specially MICE, bu also Ho Deck and EM,
in a lesse ex en ) keep epo ing accep able esul s as MD inc eases in a
DB, while he accu acy o o he me hods (i.e., Median impu a ion) seems
o be keep up he pace.
8. Conclusions
Al hough i is gene ally accep ed ha di e en pa e ns o MD can p oduce
a di e en e ec on he da a, and some esea che s ha e sugges ed ha hese
pa e ns should be aken in o accoun a he ime o selec ing he IM, an in-
dep h analysis o his ques ion had no been p e iously app oached. This may
be due o he complexi y o de ining an expe imen al amewo k ha acili a es
in es iga ing all he ac o s in ol ed. In his pape we ha e p oposed a way o
s udy he in e ac ions be ween algo i hms dealing wi h missing da a, supe ised
classi ica ion me hods and missing da a i sel , aking in o accoun i s di e en
con igu a ions (MCAR, MIV, MAR and MuOV).
A key cha ac e is ic o ou app oach has been he combina ion o o iginal
DBs wi hou MD wi h specially de ised me hods o injec ing MD consis en
wi h he desi ed MD ypes: MCAR, MIV, MAR and MuOV. Ten da ase s wi h
no missing alues we e chosen, and we e cloned 30 imes. Then he ou di e -
en MD pa e ns we e applied o hem, sepa a ely. The esul ing benchma k
comp ising 1,200 da ase s was hen used o apply he 8 di e en IMs. Fou een
classi ica ion algo i hms we e applied o he impu ed DBs.
S a is ical signi ican di e ences we e de ec ed in he pe o mance o he
classi ie s when using di e en IMs. The pa e ns o di e ences we e dependen
on he MD ype. Resul s showed signi ican ela ion be ween IM complexi y and
classi ica ion algo i hms, since MICE (which implemen s Mul iple Impu a ion)
and Ho Deck (also Expec a ion-Maximiza ion, o a lesse ex en ) we e mo e
41
Pal, S. K., & Mi a, S. (1992). Mul ilaye pe cep on, uzzy se s, and classi ica-
ion. IEEE T ansac ions on neu al ne wo ks,3, 683–697.
Pa k, J., & Sandbe g, I. W. (1991). Uni e sal app oxima ion using adial-basis-
unc ion ne wo ks. Neu al compu a ion,3, 246–257.
Ped egosa, F., Va oquaux, G., G am o , A., Michel, V., Thi ion, B., G isel, O.,
Blondel, M., P e enho e , P., Weiss, R., & Dubou g, V. (2011). Sciki -lea n:
Machine lea ning in Py hon. The Jou nal o Machine Lea ning Resea ch,12,
2825–2830.
Sil a, P. F., Ma cal, A. R., & da Sil a, R. M. A. (2013). E alua ion o ea-
u es o lea disc imina ion. In In e na ional Con e ence Image Analysis and
Recogni ion (pp. 197–204). Sp inge .
Song, Q., Sheppe d, M., Chen, X., & Liu, J. (2008). Can k-NN impu a ion
imp o e he pe o mance o C4. 5 wi h small so wa e p ojec da a se s? A
compa a i e e alua ion. Jou nal o Sys ems and so wa e,81, 2361–2370.
Tan, P.-N., S einbach, M., Kuma , V., & o he s (2006). In oduc ion o da a
mining olume 1. Pea son Addison Wesley Bos on.
Twala, B. (2009). An empi ical compa ison o echniques o handling incom-
ple e da a using decision ees. Applied A i icial In elligence,23 , 373–405.
Van Buu en, S. (2012). Flexible impu a ion o missing da a. CRC p ess.
Wong, A. K., & Chiu, D. K. (1987). Syn hesizing s a is ical knowledge om
incomple e mixed-mode da a. IEEE T ansac ions on Pa e n Analysis and
Machine In elligence, .
Yu, H.-F., Huang, F.-L., & Lin, C.-J. (2011). Dual coo dina e descen me hods
o logis ic eg ession and maximum en opy models. Machine Lea ning,85,
41–75.
Yuan, Y. C. (2010). Mul iple impu a ion o missing da a: Concep s and new
de elopmen (Ve sion 9.0). SAS Ins i u e Inc, Rock ille, MD,49.
48
Ziba, M., Tomczak, J. M., Lubicz, M., & wi ek, J. (2014). Boos ed SVM o
ex ac ing ules om imbalanced da a in applica ion o p edic ion o he pos -
ope a i e li e expec ancy in he lung cance pa ien s. Applied so compu ing,
14, 99–108.
49