scieee Science in your language
[en] (orig)

Conditional Inference Trees for the Knowledge Extraction from Motor Health Condition Data

Author: Sardá-Espinosa, Alexis,Subbiaha, Subanatarajan,Bartz-Beielstein, Thomas
Year: 2017
Source: https://cos.bibl.th-koeln.de/files/470/sard16a.pdf
CIplus
Band 1/2017
Condi ional In e ence T ees o he
Knowledge Ex ac ion om Mo o
Heal h Condi ion Da a
Alexis Sa dá-Espinosa, Subana a ajan Subbiaha, Thomas Ba z-
Beiels ein
Condi ional In e ence T ees o he Knowledge Ex ac ion om Mo o Heal h
Condi ion Da aI
Alexis Sa dá-Espinosa, Subana a ajan Subbiaha, Thomas Ba z-Beiels einb
aABB AG
Ge man Resea ch Cen e
Walls ad e S aße 59
68526 Ladenbu g
bTechnische Hochschule Köln
Facul y o Compu e Science and Enginee ing Science
S einmülle allee 1
51643 Gumme sbach
Abs ac
As he amoun o da a ga he ed by moni o ing sys ems inc eases, using compu a ional ools o analyze i becomes a necessi y.
Machine lea ning algo i hms can be used in bo h eg ession and classi ica ion p oblems, p o iding use ul insigh s while a oid-
ing he bias and p oneness o e o s o humans. In his pape , a speci ic kind o decision ee algo i hm, called condi ional
in e ence ee, is used o ex ac ele an knowledge om da a ha pe ains o elec ical mo o s. The model is chosen due o
i s lexibili y, s ong s a is ical ounda ion, as well as g ea capabili ies o gene alize and cope wi h p oblems in he da a. The
ob ained knowledge is o ganized in a s uc u ed way and hen analyzed in he con ex o heal h condi ion moni o ing. The i-
nal esul s illus a e how he app oach can be used o gain insigh in o he sys em and p esen he esul s in an unde s andable,
use - iendly manne .
Keywo ds: Decision ee, Condi ional in e ence ee, Heal h condi ion moni o ing, Machine lea ning, Knowledge ex ac ion
1. In oduc ion
As echnologies e ol e, he amoun o da a gene a ed and
ca aloged by moni o ing sys ems inc eases. In many cases
he e is, hope ully, in o ma ion con ained in he da a ha can
p o ide use ul insigh in o he p ocesses ha gene a ed i ,
so ha a s uc u ed analysis can lead o he ex ac ion and
exploi a ion o speci ic knowledge. Lea ning om da a is a
di e se p ocess, bu when he quan i y o a ailable da a is
e y la ge, i is necessa y o use compu a ional ools o be
able o e icien ly p ocess i . This is whe e machine lea ning
comes in o play.
The ques ion o wha exac ly cons i u es a machine lea n-
ing app oach canno be answe ed exac ly. Lea ning, by i sel ,
usually en ails he acquisi ion o new knowledge o he e-
inemen o an exis ing ask, o ganizing and ep esen ing he
esul s in an e ec i e way. Machine lea ning, hen, a emp s
o implan hese capabili ies in o compu e s (Michalski e al.,
2013).
Machine lea ning can be seen om se e al poin s o iew.
A he co e, Michalski e al. (2013) di e en ia es be ween
IThis documen is he p ep in e sion o he a icle accep ed by he Jou -
nal o Enginee ing Applica ions o A i icial In elligence: h p://dx.doi.
o g/10.1016/j.engappai.2017.03.008
Email add esses: [email p o ec ed]
(Alexis Sa dá-Espinosa), [email p o ec ed]
(Subana a ajan Subbiah), [email p o ec ed]
(Thomas Ba z-Beiels ein)
h ee esea ch oci depending on he objec i e: ask-o ien ed
s udies, cogni i e simula ion, and heo e ical analysis. These
a e no mu ually exclusi e, and in mos cases hey comple-
men each o he . Subsequen ly, machine lea ning sys ems
could be u he classi ied on h ee bases: he unde lying
lea ning s a egy, he ep esen a ion o knowledge disco -
e ed, and he applica ion domain. In he end, hough, he
goal is almos always he same: i is desi ed o de elop a con-
cise app oach ha will pe o m a ce ain ask wi h consis en
pe o mance and ha will be able o ma ch o exceed human
cogni ion. This usually has he addi ional ad an age o being
explici and, o some ex en , objec i e, which can hen lead
o an au oma ed o semi-au oma ed p ocedu e ha can be
used in gene al si ua ions.
The applica ions o machine lea ning a e many. I is com-
mon o pe o m classi ica ion o eg ession in o de o build
models based on ce ain da a and hen apply he ob ained
models o new in o ma ion o calcula e p edic ions. Depend-
ing on he goal and scope o he ask, classi ica ion o eg es-
sion can be an in e media y s ep in he o e all p ocess o
knowledge ex ac ion.
Mazloumi e al. (2011) used neu al ne wo ks o lea n
om a el ime da a in o de o build p edic ion models,
bu hen compu ed unce ain y in he model aking in o ac-
coun di e en sou ces o e o , so ha a p edic ion in e al
was ob ained ins ead o a single poin es ima e.
Va ga e al. (2009) applied decision ees o eac o un-
P ep in submi ed o Else ie Ma ch 31, 2017
away da a o o ganize and summa ize he mos impo an
in o ma ion om he p ocess and hen display i in a use -
iendly o ma in an ope a o suppo sys em. These a e
good examples o how a ela i ely gene al machine lea ning
algo i hm can be ine- uned o a speci ic applica ion, so ha
ele an da a can be ex ac ed in hei espec i e con ex s.
In Ge des (2014) he au ho p oposed an app oach based
on decision ees o moni o ai c a sys ems and o o ecas
ime se ies da a o he condi ion moni o ing sys em o educe
he numbe o unscheduled down imes o he ai c a . The
au ho also p oposed embedding a gene ic algo i hm based
op imiza ion app oach o imp o e pe o mance o he deci-
sion ees hus making i in an au oma ed manne .
Simila o he applica ion case conside ed in his pape ,
in Ma on e al. (2013) he au ho s conside ed he applica-
ion case o gene a o s using a da a-d i en app oach. In con-
as o decision ees, he au ho s used he da a collec ed
om a SCADA sys em o build a p edic ion model based on
neu al ne wo ks, p incipal componen analysis, and pa ial
leas squa es. The p edic ion pe o mance o he model was
measu ed using mean absolu e pe cen age e o , and he p e-
dic ed da a is e ised o de ec any abno mal beha io and
possible deg ada ion o he equipmen o gene a e ala ms.
Ano he con ibu ion o mo o diagnos ics based on de-
cision ees was p oposed in Yang e al. (2000). The au ho s
p oposed a decision ee model based on he ib a ion anal-
ysis o mo o o diagnose i s s a us. The abno mal equency
componen s we e classi ied and he causes o ib a ion we e
iden i ied.
This wo k p esen s an app oach o ex ac and exploi
knowledge in an au oma ed way by means o a decision ee
algo i hm. Sec ion 2 desc ibes he p oblem ha was o be
sol ed and i s con ex . The ounda ions o he machine lea n-
ing algo i hm ha was used, as well as a b ie desc ip ion o
he algo i hm i sel , a e p esen ed in Sec ion 3. Sec ion 4 ex-
plains how he da a was p ep ocessed in o de o eed i o
he algo i hm, and how he c oss- alida ion p ocedu e was
ca ied ou . The analysis and in e p e a ion o he ob ained
esul s a e gi en in Sec ion 5, and he inal conclusions a e
summa ized in Sec ion 6.
2. P oblem De ini ion
The da a ha will be conside ed pe ains o a condi ion
moni o ing sys em which e alua es mo o s and gene a o s.
Based on ol age, cu en , and ib a ion measu emen s, a
se ies o indica o s ha ep esen heal h condi ion o he mo-
o s is compu ed au oma ically and summa ized in a epo
ha is hen gi en o he cus ome s. In o de o main ain con-
iden iali y, he da a will be anonymized h oughou his e-
po .
As wi h mos condi ion moni o ing sys ems, he main goal
is o de ec and quan i y a ia ion in he asse ’s heal h com-
pa ed o i s no mal ope a ing condi ion, which could lead
o educed pe o mance o main enance ac ions. In o de o
s anda dize he esul s, speci ic labels ha e lec heal h s a-
us based on he measu ed pa ame e s a e assigned.
The wo k low ha concludes in he inal epo is pa ly
in e ac i e. The aw measu emen iles and he mo o in o -
ma ion mus be loaded manually and an analys mus o e -
look he whole p ocess o make su e ha he ou pu makes
sense. Once all pa ame e s ha e been ob ained, he analys
mus pe o m alida ion o he epo and pu o wa d, i nec-
essa y, a se o sugges ions o he cus ome . This can be a
ime-consuming ask, especially i he use is no acquain ed
wi h he di e en algo i hms ha a e al eady being used.
The s uc u e o he epo s and he o ganiza ion o he
da a he ein sugges s he possibili y o using machine lea n-
ing algo i hms o ex ac meaning ul in o ma ion. Since he
heal h labels ake on a speci ic se o alues, he ask can be
conside ed as a classi ica ion p oblem, al hough he main in-
e es is o use he unde lying models gene a ed o assess he
epo gene a ion wo k low wi h espec o he esul s, and
o y o iden i y a eas o oppo uni y ha can be imp o ed
in he u u e.
E en ually, i is desi able o analyze new da a no only by
i sel , bu also wi h espec o he knowledge ha has al eady
been acqui ed o e he yea s om he whole lee o de ices
cu en ly in ope a ion. This could help iden i y possible in-
consis encies, such as inpu e o s, illogical o non-s anda d
esul s, among o he s. Addi ionally, i migh be possible o
simpli y he alida ion ask o he analys so ha ime e-
qui emen s a e educed and he explici ness o he epo s
inc eases. This las poin is pa icula ly impo an due o he
ac ha use s a e mo e p one o e o s and biases, which
could make i di icul o ep oduce pas esul s.
The o e all app oach is desc ibed in he diag am o Fig. 1.
The da a will consis o se e al inpu a iables (some imes
also called a ibu es o co a ia es in he li e a u e) as well
as he ou pu labels. Some impo an aspec s o p ep ocess-
ing will be conside ed due o possible co ela ion and miss-
ing alues. The models will be ob ained by using decision
ee models, whose pa ame e s will be uned by using boo -
s ap c oss- alida ion and 0.632+e o es ima es. Finally,
he e alua ion o he models in he con ex o he da a will
be pe o med. The speci ics o each s ep o he p ocess will
be explained in mo e de ail in he ollowing sec ions.
3. Decision T ees
3.1. Decision T ees
Mo e han a speci ic algo i hm, decision ees a e a ame-
wo k o c ea e an explici hie a chy o es s ha esul in a
pa i ioning o he decision space. The gene al p ocedu e al-
lows o se e al s a egies o be used a each s ep, which in
u n esul s in a wide a ie y o ee models being a ailable.
They ha e e y good comp ehensibili y and se e as he ba-
sis o mos o he algo i hms ha will be used, so hey will be
b ie ly desc ibed ollowing Ko sian is (2007).
Decision ees o m a s uc u e made by nodes and b anches,
s a ing a a single oo node and ending in he e minal
nodes, also called he lea es o he ee. A each node, usually
a single a iable is conside ed, and one o mo e h esholds
2
Da a Inpu
P ep ocessing
Building Decision
T ee Model
C oss-Valida ion 500 Boo s ap
Replica ions
Pa ame e
Tuning
0.632+
Es ima es
Model Selec ion
E alua ion o
Ex ac ed
Knowledge
Applica ion o
New Da a
Figu e 1: Diag am showing he di e en s eps o he u ilized app oach.
a e chosen by using a gi en measu e o spli quali y o node
impu i y. These h esholds, depic ed by he b anches o he
ee, di ide he decision space o he conside ed a iable. An
ins ance can hus be classi ied by s a ing a he oo node,
analyzing he speci ied a iable, ollowing he app op ia e
b anch and ecu si ely epea ing un il a lea is eached.
The gene al p ocedu e o build decision ees is based on
a ecu si e pa i ioning p ocedu e, which can be exp essed as
ollows: conside ing a speci ic da ase wi h m a iables, he
one ha bes spli s he decision space wi h espec o a gi en
measu e is deno ed wi h m∗. The oo node is c ea ed by con-
side ing m∗and, assuming cis he h eshold ha achie es he
bes spli , wo b anches a e c ea ed: one whe e m∗≤cand
ano he one o m∗>c. This mus be ecu si ely epea ed
on he sub-lis s a each node un il a s opping c i e ion is sa -
is ied.
No e ha he ecu si e pa i ioning p ocedu e should be
unbiased, i.e., unde he assump ion o independence o he
esponse Yand he inpu a iables Xi,i=1,... , m, he p ob-
abili y o selec ing a iable Xjis 1/m o all j=1,.. ., m e-
ga dless o he measu emen scales o numbe o missing al-
ues (Ho ho n e al., 2006).
Depending on he me hodology used a each s ep o he
algo i hm, and he selec ed c i e ia, a e y di e en ee model
can be ob ained. As such, many a ia ions o he gene al p o-
cedu e we e p oposed, including e sions wi h mul iple spli -
ing and eg ession ees. Howe e , hese p ocedu es su -
e om selec ion bias owa ds a iables wi h many possible
spli s o wi h many missing alues, and o e i ing.
To a oid o e i ing, se e al algo i hms implemen a p un-
ing s a egy a e he ee is ully g own. Condi ional in-
e ence ees go one s ep u he by implemen ing a uni ied
amewo k o handling bo h selec ion bias and o e i ing.
3.2. Condi ional In e ence T ees
A condi ional in e ence ee is one possible decision ee
algo i hm o ecu si e bina y spli ing, which ies o em-
bed he amewo k in a well-de ined s a is ical en i onmen
based on pe mu a ion es s, a emp ing o dis inguish be ween
signi ican and insigni ican imp o emen s.
As an imp o emen o he ecu si e pa i ioning p oce-
du e desc ibed in Sec ion 3.1, condi ional in e ence ees sep-
a a e he a iable selec ion om he spli ing p ocedu e. This
esul s in basically h ee s eps in he condi ional in e ence
ee p ocedu e. The i s one conce ns a iable selec ion, he
second one chooses he spli ing me hodology, and he las
one is he ecu si e applica ion o he i s wo s eps. The
eade is e e ed o Ho ho n e al. (2006) o a de ailed de-
sc ip ions o hese s eps. A igne e wi h mo e in o ma ion
and se e al p ac ical examples is also a ailable o he co e-
sponding so wa e package1.
In addi ion o hei basic capabili ies o a oiding bias and
o e i ing, condi ional in e ence ees possess o he use ul
cha ac e is ics.
•Thei maximum dep h o he minimum amoun o ob-
se a ions allowed a each node o he ee can be lim-
i ed in o de o p e en pa hological spli s.
•Condi ional in e ence ees can deal wi h missing al-
ues on a spli -by-spli basis by se ing weigh s o ze o i
a gi en a iable om a conside ed obse a ion is miss-
ing.
•They can be used o a b oad a ie y o a iables, e.g.,
nominal, o dinal, and mul i a ia e esponse a iables.
•Al e na i e modeling app oaches such as neu al ne -
wo ks o suppo ec o machines ha e excellen p e-
dic ion capabili ies, bu do no p o ide any insigh in o
he unde lying p oblem. Condi ional in e ence ees
can be used as ools o p edic ion and unde s anding.
•Condi ional in e ence ees a e be e sui ed o diag-
nos ic pu poses han he s anda d (exhaus i e) ecu -
si e pa i ion p ocedu es implemen ed in Classi ica ion
and Reg ession T ees (CART).
•They use well-known, es ablished s a is ical concep s
o a iable selec ion and s opping. The esul ing ee
models a e easie o communica e o p ac i ione s.
Because o hese p ope ies, condi ional in e ence ees
we e chosen as he modeling ool o he analysis o heal h
condi ion om mo o s and gene a o s.
1h ps://c an. -p ojec .o g/web/packages/pa y/
igne es/pa y.pd
3

4. Me hodology
4.1. P ep ocessing
The condi ion moni o ing epo s a e a way o s uc u -
ing and summa izing he se o alues ha a e compu ed o
each mo o a e he aw elec ical signals ha e been ana-
lyzed. Each epo is o ganized in o ou main sec ions ha
a e o in e es o he analys . They will be e e ed o as sec-
ions A, B, C and D, espec i ely. Each o hese sec ions has a
s a us label associa ed wi h i , which can ake on h ee possi-
ble alues: KR (Keep Running) <WW (Wai & Wa ch) <SI
(S op & Inspec ); he o de shown is inc easing wi h espec
o deg ada ion le el.
In many si ua ions i is ad an ageous o ans o m he
da a in o de o accen ua e o a enua e ce ain cha ac e is-
ics (Cox and Jones, 1981). Fo example, some ans o ma-
ions like he loga i hmic ans o m can help wi h eg ession
when a iabili y o a a iable is no cons an be ween di -
e en sub-popula ions. In o he cases, da a mus be ans-
o med so ha he machine lea ning algo i hms a e capable
o ac ually p ocessing i . This may be necessa y i he e a e
missing alues, o i dimensionali y is so high ha he p ob-
lem becomes in ac able. The scale o each ea u e is also
ele an , since he e a e some algo i hms ha would na u-
ally gi e mo e o less weigh o ea u es whose alues a e
nume ically la ge o smalle .
The e a e se e al p ocedu es ha can be applied in o de
o p ep ocess da a. Wha o use depends g ea ly on he al-
go i hm o be used la e o analysis, and also on he na u e
o he da a. The algo i hm unde conside a ion is capable
o dealing wi h a iables measu ed wi h a bi a y scales, so
no maliza ion is no necessa y in his case. Addi ionally, he
algo i hm is capable o dealing wi h missing alues, so hey
do no need o be emo ed o impu ed wi h a di e en p o-
cedu e. Ne e heless, some o he a iables p esen so many
missing alues ha i is p obably de imen al o keep hem
in he da ase . As such, i mo e han 75% o he obse a ions
ha e a a iable missing, ha a iable will be emo ed.
The e a e ad anced me hods o dimensionali y educ-
ion, such as p incipal componen analysis, ha can educe
he numbe o a iables by pe o ming an o hogonal ans-
o ma ion on he da a and keeping only hose dimensions
wi h he mos in o ma ion. Un o una ely, doing such a ans-
o ma ion modi ies he decision space, and in ou applica ion
i is c i ical o main ain as much in e p e abili y as possible.
The e o e, a simple me hodology employing he linea co -
ela ion be ween he a iables o each epo sec ion will be
used. Va iables which a e co ela ed o any o he by 0.85
o mo e (absolu e alue), using Pea son’s co ela ion coe i-
cien and igno ing missing alues on a pai wise basis, will be
disca ded.
4.2. C oss-Valida ion
The e is no single me hod ha can be op imally applied o
e e y scena io. The e o e, i is necessa y o measu e pe o -
mance in such a way ha allows he e alua ion o di e en
algo i hms as well as hei sensi i i y o hei uning pa am-
e e s. Common measu es include accu acy, speed, comp e-
hensibili y o in e p e abili y, and ime equi ed o lea n. I
ollows ha he decision on which combina ion o algo i hm
and pe o mance measu e o e s he bes esul s hea ily de-
pends on he ask a hand.
I is possible o u ilize nai e app oaches as baselines, in
o de o ensu e ha he algo i hms a e indeed esul ing in
a measu able imp o emen . The simples baseline is he no-
da a ule, which always assigns a gi en class ega dless o
he inpu alues, and migh be ac ually used i he cos o
acqui ing da a is oo high. Ano he app oach is o always
p edic he mos common class, which akes in o conside a-
ion he p io p obabili ies gi en obse ed he da a. In addi-
ion, cos s o weigh s can be assigned o ei he classi ica ion
o misclassi ica ion, in o de o induce a highe p io i y on
ce ain classes.
The e a e many p oblems o be aken in o accoun when
aining machine lea ning algo i hms. Once a pe o mance
me ic has been chosen, knowing which algo i hm, along wi h
which combina ion o uning pa ame e s, yields he op imum
esul s wi h espec o he me ic is one o he main in e -
es s. Ob aining an es ima e o he algo i hm’s gene al pe -
o mance is no simple, he da ase s used o aining a e al-
ways ini e, so a way mus be ound o e ec i ely use hem in
o de o calcula e es ima es ha ha e, ideally, low bias, low
a iance and ha a e no a esul o o e i ing.
I is well known ha es ing an algo i hm wi h he same
da ase wi h which i was ained leads o o e ly op imis ic
es ima es (A lo e al., 2010), which can be mo e o less bi-
ased depending on he lea ning p ocedu e i sel . Closely e-
la ed o ha is he ac ha some algo i hms can lea n in such
a way ha hey pe ec ly i he aining da a bu a e no able
o gene alize o new obse a ions, which is he p oblem o
o e i ing. Mo eo e , i is no uncommon o he e o be one
o mo e uning pa ame e s which can change he ou come
depending on he algo i hm’s sensi i i y. algo i hm was used
in conjunc ion wi h he da a The se o p ocedu es ha a e
used o o e come hese p oblems a e called c oss- alida ion,
and hey se e mul iple pu poses. Fi s o all, hey a emp
o use he a ailable da a as e icien ly as possible in o de
o yield alid es ima es o pe o mance. Fu he mo e, hey
also help e alua e one speci ic algo i hm wi h di e en un-
ing pa ame e s, as well as a iabili y o he esul s when pa-
ame e s a e ixed bu he da a change. E en hough hey
canno alle ia e o e i di ec ly, hey p o ide ools o iden i y
i , so ha measu es can be aken in o de o educe o nega e
he e ec s. The app oach p esen ed he e will ocus on using
boo s ap c oss- alida ion.
The gene ic boo s ap me hodology explained in E on
and Tibshi ani (1993) is a compu e app oach ha has many
applica ions. In gene al, i he e is a da ase wi h nobse a-
ions, se e al new da ase s a e cons uc ed by sampling wi h
eplacemen om said da ase un il nobse a ions ha e been
selec ed; his is epea ed B imes and hese new da ase s con-
s i u e he boo s ap samples.
In he nonpa ame ic boo s ap, sampling is pe o med
4
based on a uni o m dis ibu ion ha places a p obabili y o
1/non each obse a ion. This, along wi h he ac ha sam-
pling is done wi h eplacemen , means ha he p obabili y
ha a boo s ap samples does no con ain an obse a ion is
(1−1/n)n≈e−1≈0.368. Thus, on a e age, he numbe o
obse a ions in each boo s ap sample is 0.632n.
In he simples case, using boo s ap o c oss- alida ion
consis s in aining he algo i hms wi h he boo s ap sam-
ples, hen e alua ing hem wi h hose obse a ions ha we e
no pa o he sample and inally a e aging all he boo s ap
es ima es.
A a ia ion ha is somewha speci ic o classi ica ion asks
ha use accu acy as me ic is he so-called 0.632 es ima o
p oposed in E on (1983). Gi en a o al o Bboo s ap sam-
ples whe e εiis he e o es ima e o sample iand ε0is he
e o on he ull aining se (also called he appa en e o ),
he 0.632 accu acy e o can be calcula ed wi h Eq. (1). No e
ha he es ima e is de ined in e ms o he e o a e, which
o classi ica ion asks is simply 1 minus he accu acy.
εboo 632 =1
B
B
X
i=1
(0.632 ·εi+0.368 ·ε0)(1)
The 0.632 es ima o has some sho comings. I can ail
i he classi ie is a pe ec memo ize o he da ase is com-
ple ely andom, whe e he e is no ela ionship be ween ou -
come and p edic o s (Koha i (1995)). In o de o o e come
hese p oblems, he 0.632+es ima o was la e in oduced
in E on and Tibshi ani (1997). I was in ended o be a less
biased comp omise ha depends on he amoun o o e i -
ing.
To compu e i , i s a no-in o ma ion a e ξmus be es-
ima ed by pe mu ing esponses and p edic o s. Le δi,jde-
no e he disc epancy be ween obse a ion iand p edic ion j,
hen he es ima e is gi en by Eq. (2a).
Fo mul ica ego y classi ica ion, le ˆ
plbe he p opo ion
o obse ed esponses equal o le el land ˆ
qlbe he co e-
sponding p opo ion o p edic ions equal o l. Then, he no-
in o ma ion a e can be es ima ed wi h Eq. (2b).
ˆ
ξ=1
n2
n
X
i=1
n
X
j=1
δi,j(2a)
ˆ
ξ=X
l
ˆ
pl(1−ˆ
ql)(2b)
A e wa ds, a ela i e o e i ing a e can be es ima ed
wi h Eq. (3) and he inal 0.632+es ima e is gi en by Eq. (4),
whe e εis he boo s ap es ima e and ε0is de ined as minε,ˆ
ξ.
ˆ
R0=¨(ε−ε0)/ˆ
ξ−ε0, i ε,ˆ
ξ > ε0
0, o he wise (3)
εboo 632+=εboo 632 + (ε0−ε)0.368 ·0.632 ·ˆ
R0
1−0.368 ·ˆ
R0(4)
4.3. Wo k low
The wo k low o analyze each epo sec ion will be es-
sen ially he same. Fi s , each sec ion will be e alua ed inde-
penden o he emaining ones, and a s a i ied pa i ioning
o c ea e ain and es se s, alloca ing 85% o he da a o he
aining se , will be u ilized. The aining se will be u he
di ided in o new ain and alida ion se s in conco dance
wi h he boo s ap c oss- alida ion s a egy. A e wa ds, he
models will be e- ained by u ilizing all a ailable a iables,
in o de o see i he a iables om o he sec ions could p o-
ide aluable in o ma ion o he model.
A baseline o each sec ion will be es ablished by employ-
ing he nai e ule. This will p o ide a basic poin o com-
pa ison o know how much o an imp o emen , i any, is he
algo i hm p o iding. Then, he accu acy es ima es gi en by
he 0.632+boo s ap will be used o ge an idea o app ox-
ima e pe o mance. 500 boo s ap samples will be used o
each c oss- alida ion un.
I is ue ha accu acy is no he bes me ic o assess
classi ica ion pe o mance, especially i he da ase is unbal-
anced. Howe e , he cu en ocus is doing da a explo a ion
by means o a machine lea ning algo i hm o ex ac knowl-
edge ha can be use ul in u u e analyses, so he in e p e-
a ion o he esul s will be o g ea e impo ance. Fo his
pu pose, accu acy should yield sa is ac o y esul s.
All expe imen s and analyses we e pe o med using he R
p og amming language (R Co e Team, 2016; RS udio Team,
2015) by le e aging he ca e package o model aining
and alida ion (Kuhn, 2008). These a e all open sou ce so -
wa e packages ha suppo mos ope a ing sys ems and a e
eely a ailable.
5. Expe imen al Resul s
5.1. Sample T aining - Sec ion A o he Repo s
Fo sec ion A o he epo s, using he nai e ule would
esul in an es ima e o accu acy equal o 0.514. In he ol-
lowing, he imp o emen s p o ided by he machine lea ning
algo i hm will be assessed.
Technically speaking, condi ional in e ence ees can be
uned by modi ying he minimum c i e ion (α), al hough he e
a e o he pa ame e s ha can also be con olled, such as he
ee dep h o he amoun o obse a ions allowed a each e -
minal node, also called he bucke size. The i s s ep aken
was o assess he in luence o αby es ing he common alues
o 0.9, 0.95 and 0.99, en o cing no es ic ion on he ee’s
dep h. The ob ained a e age accu acy and i s s anda d de i-
a ion (SD) is epo ed in Table 1.
Judging by he c oss- alida ion esul s, i can seen ha α
had i ually no in luence on he o e all algo i hm’s accu acy
o his da ase . Ha ing es ablished he p e ious, he e ec s
o he ee’s dep h on accu acy can also be e alua ed. The
p e ious ee had a dep h o 3, so we can simply es alues
om 1 o 3 while keeping αcons an a 0.99. The esul s o
his un a e shown in Table 2, whe e i is seen ha limi ing
he ee dep h o 2 ma ginally imp o es accu acy. No e ha
5
Table 1: T ain esul s o condi ional in e ence ees in sec ion A using min-
imum c i e ion as uning pa ame e . A o al o 500 boo s ap samples we e
used o c oss- alida ion.
Min. C i e ion (α) Accu acy Accu acy SD
0.900 0.758 0.030
0.950 0.759 0.030
0.990 0.760 0.029
a smalle dep h implies ha less spli s a e made h oughou
he ee, which means ha , po en ially, less a iables would
be needed in he inal model.
Table 2: T ain esul s o condi ional in e ence ees in sec ion A using max-
imum dep h as uning pa ame e .
Max. Dep h Accu acy Accu acy SD
1 0.749 0.024
2 0.762 0.026
3 0.761 0.026
An added ad an age o ee models is hei ease o i-
sualiza ion and in e p e a ion. Fo example, he ee model
ob ained in he second c oss- alida ion un is shown in Fig. 2.
Each node o he ee is depic ed by a ci cle wi h he a iable
used o spli ing and i s associa ed p- alue (see Sec ion 3.2).
The e minal nodes, also called he lea es, show a ba plo o
he ou pu label dis ibu ion conside ing only he obse a-
ions a each espec i e lea , and deno e wi h n he numbe
o obse a ions ha we e assigned o ha lea .
The ee in Fig. 2 p o ides a lo o in o ma ion. Fi s o
all, i implies which p edic o s a e he mos ele an wi h
espec o he ou pu a iable. The aining se con ains 20
p edic o s, so he ac ha only wo o hem can p o ide an
accu acy o 0.762 is no ewo hy.
On he o he hand, he ba plo s a he lea es depic he
consis ency o he esul s. The ou pu gi en by he ee i -
sel is a class label, bu by analyzing he obse a ions a he
lea es, he pos e io p obabili ies o each class label can be
e alua ed, condi ioned on he spli s gi en by he ee. Fo
ins ance, he hi d ba plo in Fig. 2 implies ha , i an obse -
a ion is assigned he e, i would be ex emely a e o i o
ha e a class di e en han WW, whe eas he las ba plo im-
plies ha any obse a ion assigned he e would ne e ha e
a KR class (gi en he da a).
When doing classi ica ion, a popula me hod o display-
ing he esul s is by means o a con usion ma ix. This ma-
ix has a numbe o ows and columns equal o he numbe
o le els in he ou pu class, and each cell shows he co e-
spondence be ween obse ed and p edic ed labels. Du ing
c oss- alida ion, p edic ions a e ob ained a e e y s ep in o -
de o e alua e model pe o mance, bu only conside ing he
obse a ions in he alida ion se s. None heless, a con usion
ma ix wi h he a e age co espondence o each cell ac oss
he 500 eplica ions can be cons uc ed. This ma ix is shown
in Table 3, whe e he cell a e ages a e exp essed as pe cen -
age alues o he o al cell coun s.
Table 3: Con usion ma ix o sec ion A using condi ional in e ence ees.
Each cell shows he a e age co espondence be ween p edic ed and ob-
se ed alues ac oss he 500 eplica ions pe o med du ing CV, bu exp essed
as a pe cen age o he o al coun s.
Re e ence
P edic ion KR WW SI
KR 49.5 12 1.11
WW 1.86 18 3.09
SI 0.0386 6.75 7.61
I is e iden he model is good a p edic ing he KR cases,
which is expec ed o wo easons. On he one hand, he KR
le el was he mos common one in he da ase , and on he
o he hand, om a p ac ical poin o iew, i makes sense ha
disce ning SI mo o s om he WW ones ge s mo e di icul
as he le el o deg ada ion inc eases.
So a he analysis has looked a sec ion A independen ly
om he o he sec ions o he epo . Addi ionally, some o
he a iables ha we e in he aw da ase we e emo ed du -
ing cleaning on accoun o hei co ela ion o o he a iables.
By e- aining he models wi h he igno ed ea u es, i can be
checked whe he he cleaning s ep was jus i ied and i some
o he a iables om o he sec ions can help wi h classi ica-
ion. The alue o αwas kep a 0.99, bu maximum dep h
alues om 0 o 8, whe e 0 signi ies no es ic ions, we e
es ed. Also no e ha his is only a limi a ion o maximum
dep h, meaning ha he algo i hm can s ill decide o s op a a
smalle dep h i he condi ions a e sa is ied (see Sec ion 3.2).
The bes model using all a iables esul ed, in an accu-
acy es ima e o 0.758, which is essen ially he same alue
ob ained wi hou he ex a ea u es. In e es ingly, by inspec -
ing he model ob ained a e using all a iables, i was dis-
co e ed ha i was he exac same model ob ained when us-
ing only he cleaned da ase om sec ion A. This p o ides
eassu ance ha he a iables om o he sec ions ha e no
signi ican in luence in he s a us labels assigned he e.
As he inal s ep in he lea ning p ocess, he model om
Fig. 2 can be used o classi y he da a ha was le in he es
se , which has been igno ed so a . This will p o ide one inal
es ima e o u u e pe o mance o he speci ic algo i hm ha
was selec ed.
The con usion ma ix o he esul s wi h he inal model
and he es se is shown in Table 4. The esul s ansla e o
an accu acy es ima e o 0.753, which is e y close o wha
was expec ed.
5.2. O e all Resul s
The expe imen al me hodology ollowed o he emain-
ing sec ion was he same. The nai e es ima e was compu ed
o each sec ion o he epo , since he p opo ion o he class
labels changed. Di e en condi ional in e ence ees we e
buil by uning he algo i hm pa ame e s and using only he
espec i e sec ion da a. Then he models we e ained again
6
a _63
p < 0.001
1
≤19.97 >19.97
a _55
p < 0.001
2
≤40.126 >40.126
Node 3 (n = 14)
KR WW SI
0
0.2
0.4
0.6
0.8
1
Node 4 (n = 353)
KR WW SI
0
0.2
0.4
0.6
0.8
1
a _63
p < 0.001
5
≤30.057 >30.057
Node 6 (n = 98)
KR WW SI
0
0.2
0.4
0.6
0.8
1
Node 7 (n = 98)
KR WW SI
0
0.2
0.4
0.6
0.8
1
Figu e 2: Visualiza ion o he condi ional in e ence ee o sec ion A. Each o al con ains a speci ic a iable. Following he b anches leads o speci ic bina y
pa i ions o he a iables based on he shown h eshold. The alue o na he lea es ep esen s he o al numbe o obse a ions ha all in ha e minal
node.
Table 4: Con usion ma ix o es da a in sec ion A. The inal condi ional
in e ence ee was used. Each cell shows he aw co espondence coun s
be ween obse ed and p edic ed alues.
Re e ence
P edic ion KR WW SI
KR 47 10 2
WW 2 17 0
SI 1 9 9
wi h all a ailable a iables o assess a ia ions in he ou -
come. Once he inal model was es ablished, i was applied
o he es se o he co esponding sec ion o ge one inal es-
ima e o pe o mance. The summa y o he esul s is shown
in Table 5, including he p e ious esul s o sec ion A o com-
ple eness. The g aphical depic ion o he ees o each sec-
ion can be seen in Appendix A.
I is in e es ing o see ha in mos cases, including ad-
di ional a iables in o he aining p ocedu e was ac ually
de imen al o a e age accu acy, albei sligh ly. This also
means ha he machine lea ning algo i hm is good a deal-
ing wi h i ele an o edundan in o ma ion con ained in he
da a.
5.3. E alua ion
Now ha he e is an idea o he pe o mance ha could
be expec ed om he condi ional in e ence ees, he speci ic
Table 5: Resul s a e applying he machine lea ning wo k low o all sec ions
o he epo s.
Sec ion Nai e
ule
Model
accu acy
Accu acy wi h
all a iables
Tes se
accu acy
A 0.514 0.762 0.758 0.753
B 0.736 0.865 0.851 0.859
C 0.525 0.840 0.830 0.847
D 0.492 0.669 0.669 0.656
de ails ha ela e o he quali y and he in e p e a ion o he
unde lying mechanisms a play can be ou lined. These in-
e p e a ions will be ocused on he conside ed da ase , and
will e lec some insigh ha can only be ob ained a e ca e-
ul analysis o all a iables and hei meaning.
In an ideal scena io, he a iables in he da a would con-
ain all he necessa y in o ma ion ega ding he mo o ’s heal h,
and a machine lea ning algo i hm would be able o ex ac i
and a ain pe ec accu acy on bo h known and u u e da a;
in eali y, his is a ely he case. Some plausible easons o
his could be da a inpu e o s, some o m o noise, incon-
sis en p ocessing algo i hms, human bias, e c. The accu acy
es ima es ha we e ob ained in he p e ious sec ions p o ide
clues abou he da a quali y, o lack he eo : i accu acy is
low, he e is clea ly some in o ma ion missing, o he models
we e no able o unco e i . I is desi ed o p o ide possi-
7