scieee Science in your language
[en] (orig)

EASD - Experimental Algorithmics for Streaming Data

Author: Bartz-Beielstein, Thomas
Year: 2016
Source: https://cos.bibl.th-koeln.de/files/338/bart16fCos.pdf
!
!
!
!
CIplus
Band 2/2016
EASD - Expe imen al Algo i hmics o
S eaming Da a
Thomas Ba z-Beiels ein
!
!
!
!
!
!
!
! !
EASD – Expe imen al Algo i hmics o S eaming Da a
Thomas Ba z-Beiels ein
SPOTSe en Lab
Facul y o Compu e Science and Enginee ing Science
TH K¨oln
S einm¨ulle allee 1
51643 Gumme sbach, Ge many
spo se en.de
Ma ch 13, 2016
Abs ac
This pape p oposes an expe imen al me hodology o on-line machine lea ning algo i hms, i.e.,
o algo i hms ha wo k on da a ha a e a ailable in a sequen ial o de . I is demons a ed how
es ablished ools om expe imen al algo i hmics (EA) can be applied in he on-line o s eaming
da a se ing. The massi e on-line analysis (MOA) amewo k is used o pe o m he expe imen s.
Bene i s o a well-de ined epo s uc u e a e discussed. The applica ion o me hods om he EA
communi y o on-line o s eaming da a is e e ed o as expe imen al algo i hmics o s eaming
da a (EADS).
1 In oduc ion: Expe imen al Algo i hmics
This a icle is de o ed o he ques ion “Why is an expe imen al me hodology necessa y o he
analysis o on-line algo i hms?” We will men ion wo easons o mo i a e he app oach p esen ed in
his pape .
1. Fi s , wi hou a sound me hodology, he e is he dange o gene a ing a bi a y esul s, i.e.,
esul s ha happened by chance; esul s ha a e no ep oducible; esul s ha depend on he
seed o a andom numbe gene a o ; esul s ha a e s a is ically ques ionable; esul s ha a e
s a is ically signi ican , bu scien i ically meaningless; esul s ha a e no gene alizable; e c.
2. Second, expe imen s a e he co ne s one o he scien i ic me hod. E en he disco e y o scien i ic
highly ele an esul s is o no use, i hey emain unpublished o i hey a e published in an
incomp ehensible manne . Discussion is he key ing edien o mode n science.
Expe imen al algo i hmics (EA) uses empi ical me hods o analyze and unde s and he beha io
o algo i hms. Me hods om expe imen al algo i hmics complemen heo e ical me hods o he
analysis o algo i hms. S a is ics play a p ominen ole in EA and p o ide ools o gain a deep
unde s anding o algo i hm beha io and e iciency. McGeoch (2012) desc ibes wo b anches o
expe imen al algo i hmics:
(EA-1) Empi ical analysis
deals wi h he analysis and cha ac e iza ion o he beha io o algo i hms, and uses mos ly
echniques and ools om s a is ics.
(EA-2) Algo i hm design o algo i hm enginee ing
is ocused on empi ical me hods o imp o ing he pe o mance o algo i hms, is based on
app oaches om s a is ics, machine lea ning and op imiza ion. The s a is ical analysis o al-
go i hm pa ame e s can esul in signi ican pe o mance imp o emen s. This p ocedu e is
e e ed o as algo i hm uning.
Expe imen al algo i hmics e ol ed o e he las h ee decades and p o ides ools o sound expe -
imen al analysis o algo i hms. Main con ibu ions, which in luenced he ield o EA a e McGeoch’s
hesis “Expe imen al Analysis o Algo i hms” (McGeoch,1986), he expe imen al e alua ion o sim-
ula ed annealing by (Johnson e al.,1989,1991), and he a icle abou designing and epo ing
compu a ional expe imen s wi h heu is ic me hods om (Ba e al.,1995). Cohen de ails ypical
e o s in expe imen al esea ch, e.g., loo - and ceiling e ec s (Cohen,1995). And, Hooke ’s pape s
1
wi h he s iking i les “Needed: An empi ical science o algo i hms” and “Tes ing Heu is ics: We
Ha e I All W ong” (Hooke ,1994,1996), which eally s uck a ne e.
A he u n o he millennium, hese ideas eached mo e and mo e disciplines. Theo e icians
ecognized ha hei me hods can bene i om expe imen al analysis and he discipline o algo i hm
enginee ing was es ablished (Ca aneo and I aliano,1999). New s a is ical me hods, which had hei
o igin in geos a is ics, become popula ools o he expe imen al analysis o algo i hms. San ne
e al. (2003) made he e m design and analysis o compu e expe imen s (DACE) popula .
Pa ame e uning me hods gained mo e and mo e a en ion in he machine lea ning (ML) and
compu a ional in elligence (CI) communi ies. Eiben and Jelasi y’s “C i ical No e on Expe imen-
al Resea ch Me hodology in EC” (Eiben and Jelasi y,2002) en o ced he discussion. The au ho s
desc ibed hei discon en wi h he common p ac ice. They eques ed conc e e esea ch ques ions
(and answe s), adequa e benchma ks, clea ly de ined pe o mance measu es, and ep oducibili y o
he expe imen al indings. Ba z-Beiels ein (2003) combined classical design o expe imen me hods,
ee-based me hods, and me hods om DACE o he analysis and he uning o algo i hms. This
inc eased awa eness esul ed in se e al u o ials, wo kshops, and special sessions de o ed o expe -
imen al esea ch in e olu iona y compu a ion. Resul s om hese e o s a e summa ized in he
collec ion “Expe imen al Me hods o he Analysis o Op imiza ion Algo i hms” (Ba z-Beiels ein
e al.,2010a). New so wa e ools o he expe imen al analysis and algo i hm uning we e de el-
oped. The sequen ial pa ame e op imiza ion (SPO) was p oposed as (i) a me hodological ame-
wo k o acili a e expe imen a ion wi h adequa e s a is ical me hods and (ii) a model-based uning
echnique o de e mine imp o ed algo i hm ins ances and o in es iga e pa ame e e ec s and in e -
ac ions (Ba z-Beiels ein e al.,2005). So SPO co e s bo h b anches, i.e., (EA-1) and (EA-2), o
expe imen al algo i hmics as de ined by McGeoch.
The idea o a sound expe imen al me hodology sp ead ou o e many sub-disciplines. Expe -
imen al me hods we e e ined and specialized o many new ields, e.g., pa allel algo i hms (Ba
and Hickman,1993), big da a applica ions, e.g., map educe by Ben a i (2012), mul imodal op-
imiza ion (P euss,2015), o mul iobjec i e op imiza ion (Zae e e e al.,2013). Mul iobjec i e
op imiza ion allows he combina ion o se e al goals, e.g., speed and accu acy.
Exis ing benchma ks and s anda d es - unc ion se s we e econside ed (Whi ley e al.,1996).
Two compe i ions we e es ablished in he con ex o CI: he CEC special session on pa ame e
op imiza ion (Sugan han e al.,2005) and he black box op imiza ion benchma k (BBOB) compe i-
ion (Hansen e al.,2009).
This o e iew is by a no comple e, and se e al impo an publica ions a e missing. Howe e ,
i illus a es he de elopmen o an eme ging ield and i s impo ance. The s anda d app oach
desc ibed so a ocuses on ela i ely small, s a ic da a se s ha can be analyzed o -line. We
p opose an ex ension o EA o he ield o s eam da a, which will be e e ed o as expe imen al
algo i hmics o s eaming da a (EASD).
This ex ension is mo i a ed by he eno mous g ow h o da a in he las decades. The a ailabili y
o da a aises he ques ion: how o ex ac in o ma ion om da a? Machine lea ning, i.e., au oma -
ically ex ac in o ma ion om da a, was conside ed he solu ion o he immense inc ease o da a.
His o ically, ML s a ed wi h small da a se s wi h limi ed aining se size: he en i e da a se can
be s o ed in memo y.
The ield o da a mining e ol ed o handle da a ha does no i in o wo king memo y: Da a
mining became popula , because i p o ides ools o e y la ge, bu s a ic da a se s. Models canno
be upda ed when new da a a i es.
Nowadays, da a a e collec ed in nea ly e e y de ice. Senso s a e e e ywhe e—massi e, da a
s eams a e ubiqui ous. Especially, indus ial p oduc ion p ocesses gene a e huge and dynamic
da a. This leads o he de elopmen o he da a s eam pa adigm. Bi e e al. (2010) desc ibe
co e assump ions o da a s eam p ocessing as ollows:
(S-1) Vola ili y
The aining examples can be b ie ly inspec ed a single ime only.
(S-2) Speed
The da a a i e in a high speed s eam.
(S-3) Memo y
Because he memo y is limi ed, da a mus be disca ded o p ocess he nex examples.
(S-4) Bulk da a
The o de o da a a i al is unknown.
(S-5) On-line
The model is upda ed inc emen ally, i.e., di ec ly when a new da a a i es.
(S-6) Any ime p ope y
The model can be applied a any poin be ween aining examples.
2
Inpu
Requi emen R-1
Lea ning,
T aining
Requi emen s
R-2 and R-3
Model
Requi emen R-4
T aining
da a
Tes
da a P edic ions
Lea ning,
T aining
Model
T aining
da a
Tes
da a
P edic ions
Inpu
Figu e 1: Le : The ba ch classi ica ion cycle. Righ : The s eam classi ica ion cycle. Do ed lines
ep esen nonpe manen da a. Bo h classi ica ion cycles pa i ion he da a in o es and aining se . To
keep he illus a ion simple, his spli is no shown. The igu e is based on he da a s eam classi ica ion
cycle in Bi e e al. (2011).
(S-7) E idence
Las bu no leas : heo y is nice, bu empi ical e idence o algo i hm pe o mance is necessa y.
Simila design c i e ia we e p oposed by Domingos and Hul en (2012). They add he equi emen
ha a model, which is nea ly iden ical o he one ha would be gene a ed by a co esponding
o dina y da a base mining algo i hm, should be p oduced. Al oge he , hese co e assump ions esul
in a e y demanding e alua ion and pe o mance assessmen .
We will de elop an expe imen al me hodology o on-line machine lea ning, i.e., o si ua ions in
which da a becomes a ailable in a sequen ial o de . The da a is used o upda e he p edic o o
u u e da a a each s ep. On-line lea ning di e s om adi ional ba ch lea ning echniques, which
gene a e he p edic o by lea ning on he en i e aining da a se a once. The e ms “on-line” and
“da a s eam” will be used synonymously in he ollowing.
This pape is s uc u ed as ollows. Sec ion 2compa es he adi ional ba ch se ing wi h he
s eam da a se ing. The gene a ion o da a s eam algo i hms is desc ibed in Sec. 3. Change
and d i a e b ie ly in oduced in Sec. 4. How o assess model pe o mance is desc ibed in Sec. 5.
En i onmen s o pe o ming he expe imen al analysis a e p esen ed in Sec. 6. A simple expe imen ,
which exempli ies he EASD app oach, is p esen ed in Sec. 7. This a icle concludes wi h a summa y
in Sec. 8.
2 Ba ch Ve sus S eam Classi ica ion
By compa ing he adi ional ba ch and he s eam classi ica ion, he ollowing obse a ions can be
made: Bo h classi ica ion p ocedu es pa i ion he da a in o es and aining se . In con as o
ba ch classi ica ion, s eam classi ica ion is a cyclic p ocess, which uses nonpe manen da a. The
elemen a y s eps used in bo h se ings a e illus a ed in Fig. 1.
2.1 Ba ch Classi ica ion
The ba ch classi ica ion cycle p ocesses da a as ollows:
1. Inpu
The algo i hm ecei es he da a.
3
2. Lea n
The algo i hm p ocesses he da a and gene a es i s own da a s uc u es (builds a model).
3. P edic
The algo i hm p edic s he class o unseen da a using he es se .
2.2 S eam Classi ica ion
Da a a ailabili y di e s in he s eam classi ica ion cycle. Addi ional es ic ions ha e o be consid-
e ed (Bi e e al.,2011). F eely adap ed om Bi e e al. (2011), he da a s eam p ocessing can be
desc ibed as ollows:
1. Inpu
The algo i hm ecei es he nex da a om he s eam. A his s age o he p ocess, he ollowing
equi emen has o be conside ed:
(R-1) P ocess only once
Da a s eam da a is accep ed as hey a i e. A e inspec ion, he da a is no a ailable any
mo e. Howe e , he algo i hm i sel is allowed o se up an a chi e (memo y). Fo example,
he algo i hm can s o e a ba ch o da a ha is used by adi ional, s a ic algo i hms. Due
o he limi ed amoun o memo y, he algo i hm has o disca d he da a a e a ce ain
ime pe iod.
2. Lea n
The algo i hm p ocesses he da a and upda es i s own da a s uc u es (upda es he model). A
his s age o he p ocess, he ollowing equi emen s ha e o be conside ed:
(R-2) Limi ed Memo y Budge
Da a s eam algo i hms allow p ocessing da a ha a e se e al imes bigge han he wo k-
ing memo y. Memo y can be classi ied as memo y ha is used o s o e unning s a is ics
and memo y ha is used o s o e he cu en model.
(R-3) Limi ed Time Budge
Run ime complexi y o he algo i hms should be linea in he amoun o da a, i.e., he
numbe o examples. Real- ime p ocessing equi es ha he algo i hm p ocess he da a
quickly (o e en as e ) han hey a i e.
3. P edic
The algo i hm is able o ecei e he nex da a. I eques ed, i is also able o p edic he class
o unseen da a. A his s age o he p ocess, he ollowing equi emen has o be conside ed:
(R-4) P edic a any Poin
The bes model should be gene a ed as e icien ly as possible. The model is gene a ed
di ec ly in memo y by he algo i hm as i p ocesses incoming da a.
3 Gene a ing Da a S eam Algo i hms
The p ocess o enabling ML algo i hms o p ocess la ge amoun o da a is called scaling (Die e ich,
1997). In da a mining, he w appe and he adap a ion app oach a e known o his ask.
3.1 W appe
The maximum euse o exis ing schemes is e e ed o as he w appe app oach. The w appe
app oach collec s and p e-p ocesses da a so ha adi ional ba ch lea ne s can be used o model
building.
Bi e e al. (2011) lis ypical implemen a ions o he w appe app oach: Chu and Zaniolo (2004)
p opose a boos ing ensemble me hod, which is based on a dynamic sample-weigh assignmen scheme.
They claim ha his me hod is as , makes ligh demands on memo y esou ces, and is able o adap
o concep d i . Wang e al. (2010) in oduce a amewo k o mining concep -d i ing da a s eams.
An ensemble o classi ica ion models, such as C4.5 (Quinlan,1993) o nai e Bayes, is ained om
sequen ial chunks o he da a s eam. The classi ie s in he ensemble a e weigh ed based on hei
es ima ed classi ica ion accu acy on he es da a unde he ime-e ol ing en i onmen . Thus, he
ensemble app oach imp o es bo h he e iciency in lea ning he model and he accu acy in pe o ming
classi ica ion.
W appe app oaches su e om he ollowing di icul ies: (i) The selec ion o an app op ia e
aining se sizes is p oblema ic. (ii) The aining ime canno be con olled. Only he ba ch sizes
can be modi ied, which p o ides a mean o an indi ec con ol o he aining ime.
4

3.2 Adap a ion
The adap a ion app oach de elops new me hods, which a e specially designed o he s eam da a
analysis ask. Adap a ion comp ehends he ca e ul design o algo i hms especially designed o ul ill
da a mining and ML asks on dynamic da a. Adap a ion algo i hms can implemen di ec s a egies
o memo y managemen and a e usually mo e lexible han w appe app oaches. Bi e e al. (2011)
lis examples o adap a ion algo i hms om he ollowing algo i hm classes: decision ees, associa-
ion ules, nea es neighbo me hods, suppo ec o machines, neu al ne wo ks, and ensemble-based
me hods.
4 Change and D i De ec ion
Changes in he da a dis ibu ion play an impo an ole in da a s eam analysis. This mo i a ed
he de elopmen o specialized echniques, e.g., change de ec ion s a egies (Lugho e and Sayed-
Mouchaweh,2015). The a ge alue, which will be p edic ed by he model, is e e ed o as concep .
Se e al changes o e ime may occu : A g adual concep change is e e ed o as a concep d i . A
apid concep change o e ime is e e ed o as a concep shi , and a change in he da a dis ibu ion
is e e ed o as a dis ibu ion o sample change.
Es ablished me hods om s a is ical p ocess con ol, e.g., he cumula i e sum (CUSUM) algo-
i hm, can be used o change de ec ion (Mon gome y,2008). Fu he me hods comp ehend he
geome ic-mo ing-a e age (GMA) es o s a is ical hypo hesis es s. Hypo heses es s a e popula ,
because hey a e well-s udied and can be implemen ed easily. They compa e da a om wo s eams.
The null hypo hesis H0 eads: bo h s eams come om he same dis ibu ion. Then a hypo hesis
es , which is based on he di e ence o means µ1and µ2is pe o med, e.g., he Kolmogo o -Smi no
(KS) es .
Gama e al. (2004) de eloped a d i -de ec ion me hod, which is based on he on-line e o - a e.
The aining examples a e p esen ed in sequence. New aining example a e classi ied using he ac ual
model. As long as he dis ibu ion is s a iona y, he classi ica ion e o will dec ease. I a change
in he dis ibu ion occu s, he e o will inc ease. The ace o he on-line e o o he algo i hm is
obse ed.
Fu he app oaches, e.g., based on he exponen ial-weigh ed-mo ing-a e age (EWMA) we e de-
eloped (Ross e al.,2012). The eade is e e ed o Gus a sson (2001), who p esen s an o e iew
o change de ec ion me hods.
5 Assessing Model Pe o mance
This sec ion desc ibes an expe imen al se ing o he assessmen o model pe o mance. In Sec. 5.1
pe o mance c i e ia a e in oduced. The expe imen al e alua ion o he model pe o mance equi es
da a. S a egies o ob aining es da a a e desc ibed in Sec. 5.2.
5.1 Pe o mance C i e ia
Elemen a y pe o mance c i e ia o da a s eam algo i hms a e based on
P-1 ime (speed)
We conside he amoun o ime needed (i) o lea n and (ii) o p edic . I he ime limi is
eached, con inuing he da a p ocessing will ake longe o esul s will loose p ecision. This
consequence is no so ha d as he space limi , because o e iding he space limi will o ce he
algo i hm o s op.
P-2 space (memo y)
A simple s a egy o he handling space budge is o s op once he limi is eached. To con inue
p ocessing i he he space limi is eached is o o ce he algo i hm o disca d pa s o i s da a.
P-3 e o a es (s a is ical measu es)
The p edic ion e o is conside ed. Se e al e o measu es a e a ailable (Hyndman and Koehle ,
2006;Willmo and Ma suu a,2005).
Rega ding classi ica ion, he ollowing e o measu es a e es ablished. We will conside a bina y
classi ica ion es , e.g., a es ha sc eens i ems o a ce ain ea u e. In his se ing, he e m
“posi i e” co esponds wi h “iden i ied” ( he i em has he ea u e) and “nega i e” co esponds
wi h “ ejec ed” ( he i em does no ha e he ea u e), i.e.,
•TP: ue posi i e = co ec ly iden i ied
•FP: alse posi i e = inco ec ly iden i ied = αe o o he i s ype
•TN: ue nega i e = co ec ly ejec ed
5
Table 1: Con usion ma ix. The e o o he i s kind, he so-called αe o ( alse posi i e) as well as
he e o o he second kind, he so-called β-e o ( alse nega i e), a e deno ed as αand β, espec i ely.
T ue alue
Posi i e Nega i e
Tes esul (p edic ion) Posi i e TP FP (α) PPV (p ecision)
Nega i e FN (β) TN NPV
TPR TNR
•FN: alse nega i e = inco ec ly ejec ed = βe o o he second ype
These ou alues can be a anged in a con ingency able o con usion ma ix as shown in
Table 1.
Based on he alues om he con usion ma ix, he ollowing s a is ical measu es can be de e -
mined:
•Accu acy is measu ed as he pe cen age o co ec classi ica ions, i.e., i is de ined as he
sum o he numbe o ue posi i es and he numbe o ue nega i es di ided by he o al
numbe o examples ( o al popula ion). Since accu acy can be misleading (conside he
so-called accu acy pa adox), u he measu es a e commonly used (Zhu,2007).
•The p ecision is de ined as he numbe o ue posi i es di ided by he numbe o ue
posi i es and alse posi i es. P ecision is also e e ed o as he posi i e p edic i e alue
(PPV).
•The nega i e p edic i e alue (NPV) is de ined as he numbe o ue nega i es di ided by
he numbe o ue nega i es and alse nega i es.
•The speci i y (o ue nega i e a e, TNR) is de ined as he numbe o ue nega i es
di ided by he numbe o ue nega i es and alse posi i es.
•The sensi i i y (o ue posi i e a e (TPR) o ecall) is de ined as he numbe o ue
posi i es di ided by he numbe o ue posi i es and he numbe o alse nega i es.
•The adi ional F1sco e combines he sensi i i y (TPR) and he p ecision (PPV) measu es.
I is de ined as he ha monic mean o sensi i i y and p ecison:
F1= 2 ×((PPV ×TPR)/(PPV + TPR)).
T aining da a can be used o de e mine hese s a is ical measu es. This can lead o o e i ing and
esul in poo ly gene alizable models. The e o e, es ing da a, i.e., using unseen da a, should be
used (Has ie,2009).
5.2 How o Gene a e Tes Da a
Gene a ing es da a appea s o be i ial a he i s sigh . Howe e , simply spli ing he da a in o
wo se s migh cause unwan ed e ec s, e.g., in oduce bias, o esul in an ine icien usage o he
a ailable in o ma ion. The es da a gene a ion p ocess needs ca e ul conside a ions in o de o a oid
hese allacies.
Fi s , we will conside adi ional me hods, e.g., s a ic (s a iona y) da a se s wi h a ela i ely
small amoun o da a. A se ing will be e e ed o as a ba ch se ing, i all he da a is a ailable
ahead o ime. O he wise, he se ing will be e e ed o as an on-line se ing. I he da a dis ibu ion
changes o e ime, he se ing is e e ed o as a non-s a iona y se ing.
5.2.1 Tes Da a in P e ious E alua ion Me hods
P e ious ML me hods ocused on s a ic da a wi h a limi ed amoun o samples in he ba ch se ing.
Bi e e al. (2011) lis he ollowing s a egies, which a e conside ed s anda d in ML.
Holdou :
Di ide he da a in wo subse s, he so-called aining se and he holdou o es ing se . The
concep ual p oblem wi h he holdou me hod is ha i does no use he da a e icien ly, because
many da a a e no used o aining.
Random subsampling:
To o e come he concep ual p oblem o he holdou app oach and o imp o e i s e iciency,
andom subsampling and a e aging o he e o s om each subse can be used. Random
subsampling also allows measu ing he accu acy es ima e’s a iance.
6
Random subsampling su e s om he ollowing disad an age: I he aining se con ains many
da a om one class, han he es se na u ally con ains only a ew da a om his class. This
bias esul s in andom subsampling iola ing he assump ion ha aining and es se s a e
independen .
C oss alida ion (CV):
C oss alida ion (A lo and Celisse,2009) op imizes he use o da a o aining and es ing.
E e y a ailable da a is used o aining and es ing. To a oid bias, s a i ied c oss- alida ion
can be used.
Lea e-one-ou CV:
is a special case o c oss- alida ion, which is ela i ely expensi e o pe o m i he model i is
complex. Lea e-one-ou CV is, in con as o CV, comple ely de e minis ic. Howe e , lea e-
one-ou CV can ail, e.g., conside a classi ica ion si ua ion wi h wo classes, when da a is
comple ely andom and 50 pe cen o he da a belong o each class. The lea e-one-ou CV
ails, because he majo i y o e is always w ong on he hold-ou da a and lea e-one-ou CV
will p edic an accu acy o ze o, al hough he expec ed accu acy is 50 pe cen .
The boo s ap:
The boo s ap me hod (E on and Tibshi ani,1993) gene a es a boo s ap sample by sampling
wi h eplacemen a aining da a se , which has he same size as he o iginal da a. The da a
ha a e no in he aining da a se a e used used o es ing. The boo s ap can gene a e
w ong esul s (in his case, hey a e oo op imis ic) in se ings desc ibed o he lea e-one-ou
CV.
Recommenda ions o choosing a sui able e alua ion me hod a e gi en in Koha i (1995). Kleijnen
(2008) discusses hese me hods in he con ex o simula ion expe imen s.
5.2.2 E alua ion P ocedu es o Dynamic Da a S eam Me hods
In he dynamic da a s eam se ing, plen y o da a is a ailable. The simple holdou s a egy can be
used wi hou causing he p oblems men ioned in he ba ch se ing. In con as o he ba ch se ings,
la ge da a se s o exac accu acy es ima ions can be used o es ing wi hou p oblems.
Holdou :
The simples app oach is jus holding ou one (la ge) single e e ence da a se du ing he whole
lea ning ( aining) phase. Using his holdou da a se , he model can be e alua ed pe iodically.
This holdou da a se can be gene a ed using (i) old o (ii) new da a, i.e., look-ahead da a se s.
The look-ahead me hod (Agga wal,2013) can gi e he model addi ional in o ma ion (addi ional
aining da a) a e he i s es ing phase is comple ed. This me hod is use ul in si ua ion when
concep d i occu s.
In si ua ions wi hou any concep d i , a s a ic holdou da a se is adequa e and be e sui ed,
because i does no in oduce an addi ional sou ce o andomness in he lea ning p ocess.
In e lea ed Tes -Then-T ain
In addi ion o he holdou me hod, he in e lea ed es - hen- ain me hod was p oposed o
e alua ing da a s eam algo i hms (Bi e e al.,2010). Each new da a can be used o es he
model be o e i is used o aining. Ad an ages o his me hod a e: (i) The model is es ed on
unknown da a, (ii) each da um is used o es ing and aining, and (iii) he accu acy o e ime
is smoo h, because each indi idual new da a will become less signi ican o he o e all a e age
( he sample size is inc easing).
In e lea ed es - hen- ain me hods obscu e he di e ences be ween aining and es ing pe i-
ods. The ue accu acy migh be biased, because algo i hms will be punished o ea ly mis akes.
This e ec cancels ou o e ime.
5.3 Analysis
A g aphical plo (accu acy e sus numbe o aining samples) is he mos common way p esen ing
esul s. Ve y o en, he compa ison o wo algo i hms is based on g aphical compa isons by isualizing
ends (e.g., accu acy) o e ime. Bi e e al. (2011) desc ibe he si ua ion as ollows:
”Mos o en his is de e mined om a single holdou un, and wi h an independen es se
con aining 300 housand examples o less. I is a e o see a se ious a emp a quan i ying
he signi icance o esul s wi h con idence in e als o simila checks.”
This si ua ion appea s o be simila o he si ua ion o expe imen al algo i hmics a decade ago,
which was desc ibed in Sec. 1. To ob ain eliable esul s, s a is ical measu es such as he s anda d
e o o esul s, a e ecommended. Use ul app oaches we e desc ibed by Demˇsa (2006). Fo example,
7
McNema ’s es can be used o measu e he ag eemen be ween compe ing algo i hms on each es
da a (Die e ich,1998). The s a is ical analysis should be accompanied by a comp ehensi e epo ing
scheme, which includes he ele an de ails o unde s anding and possible eplica ion o he indings.
The expe imen al analysis o da a s eam classi ica ion is subjec o cu en esea ch. Only a ew
publica ions ha pe o m an ex ensi e s a is ical analyis a e a ailable. Bi e e al. (2011) claim ha
pape Domingos and Hul en (2000) is he bes analysis so a .
Fo una ely, he open sou ce amewo k o da a s eam mining MOA, which includes a collec ion
o machine lea ning algo i hms (classi ica ion, eg ession, clus e ing, ou lie de ec ion, concep d i
de ec ion and ecommende sys ems) and ools o e alua ion, is a ailable and p o ides ools o an
ex ensi e expe imen al analysis (Holmes e al.,2010).
6 En i onmen s and Da a Sou ces
6.1 Real-wo ld Da a S eam Compu ing En i onmen s
Compu ing en i onmen s can be classi ied as ollows: (i) Senso ne wo ks. In his se ing, only a ew
hund ed kiloby es o memo y a e a ailable. (ii) Handheld compu e , which p o ide se e al megaby es
o memo y, (iii) Single-boa d compu e s, i.e., compu e s ha a e buil on a single ci cui boa d. The
componen s such as mic op ocesso (s), memo y, o inpu /ou pu a e loca ed on his boa d. Single-
boa d compu e s can be used as embedded compu e con olle s, and (i ) se e s, which p o ide
se e al gigaby es o RAM, e.g., 64 GB.
6.2 Simula o s o Da a S eam Analysis
Random da a s eam simula o s a e a aluable ool o he expe imen al analysis o da a s eam
algo i hms. Fo ou expe imen s in Sec. 7a s a ic da a se , which con ained p e-assigned class
labels, was used o simula e a eal-wo ld da a s eam en i onmen . Conside a da a se o size n,
which is pa i ioned in aining and es da a se s wi h n ain : numbe o examples used o aining
be o e an e alua ion is pe o med on he es se . 2. n es : size o he es se . Using nai e Bayes o
Hoe ding ees, a model is buil using he i s n ain ins ances o aining. P edic ion is pe o med
on each o he emaining n es ins ances. The es da a is used o inc emen ally upda e he model
buil . This on-line s eam se ing allows he es ima ion o he model accu acy on an ongoing basis,
because he esul s (class labels) a e known in ad ance.
The open sou ce amewo k o da a s eam mining MOA (Holmes e al.,2010) is able o gene a e
a ew housand examples up o se e al hund ed housand examples pe second. Addi ional noise can
slow down he speed, because i equi es he gene a ion o andom numbe s.
7 A Simple Expe imen in he EASD F amewo k
7.1 The Scien i ic Ques ion
Be o e expe imen al uns a e pe o med, he scien i ic ques ion, which mo i a es he expe imen s,
should be clea ly s a ed. To exempli y he EASD app oach, he ollowing ask is conside ed:
Machine lea ning me hods, which combine mul iple models o imp o e p edic ion accu acy, a e
called ensemble da a mining algo i hms. Di e si y o he di e en models is necessa y o each his
pe o mance gain compa ed o indi idual models. I ensemble membe s always ag ee, he ensemble
does no p o ide an ex a bene i . E o s o one ensemble membe can be co ec ed by o he membe s
o he ensemble.
Each indi idual ML algo i hm equi es he speci ica ion o some pa ame e s, e.g., he spli ing
c i e ion o eg ession ees o he o de o linea models. Building ensembles equi es he speci-
ica ion o addi ional algo i hm pa ame e s, e.g., he numbe o ensemble membe s. The scien i ic
ques ion can be o mula ed as ollows:
Scien i ic ques ion:
How does he numbe o models o boos a ec he pe o mance o on-line lea ning algo i hms?
To se up expe imen s, a speci ic algo i hm (o a se o algo i hms) has o be selec ed. Oza e
al. p esen ed a simple on-line bagging and boos ing algo i hm, OzaBoos (Oza and Russell,2001b,a;
Oza). OzaBoos combines mul iple lea ned base models wi h he aim o imp o ing gene aliza ion
pe o mance. OzaBoos implemen s an on-line e sion o hese ensemble lea ne s, which ha e been
used p ima ily in ba ch mode. The e ec o he numbe o models o boos on he algo i hm
pe o mance is an impo an esea ch ques ion, which will be analyzed in he ollowing expe imen s.
The e o e, he expe imen al se up as well as he implemen a ion de ails ha e o be speci ied u he .
8
G. Holmes, B. P ah inge , P. K anen, T. Jansen, T. Seidl, and A. Bi e . MOA: Massi e Online Analy-
sis, a F amewo k o S eam Classi ica ion and Clus e ing. In In e na ional Wo kshop on Handling
Concep D i in Adap i e In o ma ion Sys ems in conjunc ion wi h Eu opean Con e ence on
Machine Lea ning and P inciples and P ac ice o Knowledge Disco e y in Da abases ECML
PKDD, pages 3–16, 2010.
J. N. Hooke . Needed: An empi ical science o algo i hms. Ope a ions esea ch, 42(2):201–212, 1994.
J. N. Hooke . Tes ing Heu is ics: We Ha e I All W ong. Jou nal o heu is ics, 1(1):33–42, 1996.
R. J. Hyndman and A. B. Koehle . Ano he look a measu es o o ecas accu acy. In e na ional
Jou nal o Fo ecas ing, 22(4):679–688, Oc . 2006.
D. S. Johnson, C. R. A agon, L. A. McGeoch, and C. Sche on. Op imiza ion by Simula ed Annealing:
an Expe imen al E alua ion. Pa I, G aph Pa i ioning. Ope a ions esea ch, 37(6):865–892, 1989.
D. S. Johnson, C. R. A agon, L. A. McGeoch, and C. Sche on. Op imiza ion by Simula ed Anneal-
ing: an Expe imen al E alua ion. Pa II, G aph Colo ing and Numbe Pa i ioning. Ope a ions
esea ch, 39(3):378–406, 1991.
J. P. C. Kleijnen. Design and analysis o simula ion expe imen s. Sp inge , New Yo k NY, 2008.
R. Koha i. A S udy o C oss-Valida ion and Boo s ap o Accu acy Es ima ion and Model Selec ion.
pages 1–8, 1995.
E. Lugho e and M. Sayed-Mouchaweh. Adap i e and on-line lea ning in non-s a iona y en i on-
men s. E ol ing Sys ems, 6(2):75–77, Feb. 2015.
C. C. McGeoch. Expe imen al Analysis o Algo i hms. PhD hesis, Ca negie Mellon Uni e si y,
Pi sbu gh PA, 1986.
C. C. McGeoch. A Guide o Expe imen al Algo i hmics. Camb idge Uni e si y P ess, New Yo k,
NY, USA, 1s edi ion, 2012.
D. C. Mon gome y. S a is ical Quali y Con ol. Wiley, 2008.
N. C. Oza. Online Bagging and Boos ing. In 2005 IEEE In e na ional Con e ence on Sys ems, Man
and Cybe ne ics.
N. C. Oza and S. Russell. Online bagging and boos ing. In T. Jaakola and T. Richa dson, edi o s,
8 h Ins e na ional Wo kshop on A i icial In elligence and S a is ics, pages 105–112, 2001a.
N. C. Oza and S. Russell. Expe imen al compa isons o online and ba ch e sions o bagging and
boos ing. In he se en h ACM SIGKDD in e na ional con e ence, pages 359–364, New Yo k, New
Yo k, USA, 2001b. ACM P ess.
B. P ah inge , G. Holmes, and R. Ki kby. New Op ions o Hoe ding T ees. In AI 2007: Ad ances
in A i icial In elligence, pages 90–99. Sp inge Be lin Heidelbe g, Be lin, Heidelbe g, Dec. 2007.
M. P euss. Mul imodal Op imiza ion by Means o E olu iona y Algo i hms. Na u al Compu ing
Se ies. Sp inge In e na ional Publishing, Cham, 2015.
J. R. Quinlan. C4.5: P og ams o Machine Lea ning. Mo gan Kau mann Publishe s Inc., San F an-
cisco, CA, USA, 1993.
R Co e Team. R: A Language and En i onmen o S a is ical Compu ing. R Founda ion o S a-
is ical Compu ing, Vienna, Aus ia, 2015.
G. J. Ross, N. M. Adams, D. K. Tasoulis, and D. J. Hand. Exponen ially weigh ed mo ing a e age
cha s o de ec ing concep d i . 33(2):191–198, Jan. 2012.
T. J. San ne , B. J. Williams, and W. I. No z. The Design and Analysis o Compu e Expe imen s.
Sp inge , Be lin, Heidelbe g, New Yo k, 2003.
A. Shake and E. H¨ulle meie . IBLS eams: a sys em o ins ance-based classi ica ion and eg ession
on da a s eams. E ol ing Sys ems, 3(4):235–249, 2012.
P. N. Sugan han, N. Hansen, J. J. Liang, K. Deb, Y. P. Chen, A. Auge , and S. Tiwa i. P oblem De i-
ni ions and E alua ion C i e ia o he CEC 2005 Special Session on Real-Pa ame e Op imiza ion.
Technical epo , Singapo e, 2005.
15

J. W. Tukey. Explo a i e da a analysis. Addison-Wesley, 1977.
H. Wang, P. S. Yu, and J. Han. Mining Concep -D i ing Da a S eams. In Da a Mining and
Knowledge Disco e y Handbook, pages 789–802. Sp inge US, Bos on, MA, July 2010.
D. Whi ley, K. Ma hias, S. Rana, and J. Dzube a. E alua ing e olu iona y algo i hms. A i icial
In elligence, 85(1–2):245–276, 1996.
C. J. Willmo and K. Ma suu a. Ad an ages o he mean absolu e e o (MAE) o e he oo mean
squa e e o (RMSE) in assessing a e age model pe o mance. Clim Res, pages 1–4, Dec. 2005.
M. Zae e e , T. Ba z-Beiels ein, B. Naujoks, T. Wagne , and M. Emme ich. A Case S udy on Mul i-
C i e ia Op imiza ion o an E en De ec ion So wa e unde Limi ed Budge s. In R. C. Pu shouse
e al., edi o s, E olu iona y Mul i-C i e ion Op imiza ion 7 h In e na ional Con e ence, EMO,
pages 756–770, Heidelbe g, 2013. Sp inge .
X. Zhu. Knowledge Disco e y and Da a Mining: Challenges and Reali ies: Challenges and Reali ies.
Gale i ual e e ence lib a y. In o ma ion Science Re e ence, 2007.
16
!
!
CIplus
Band 2/2016
EASD - Expe imen al Algo i hmics o
S eaming Da a
P o . D . Thomas Ba z-Beiels ein
Ins i u ü In o ma ik
Fakul ä ü In o ma ik und Ingenieu wissenscha en
Technische Hochschule Köln
Mä z 2016
Die Ve an wo ung ü den Inhal diese
Ve ö en lichung lieg beim Au o .