Sch i en eihe CIplus, Band 5/2013
He ausgebe : T. Ba z-Beiels ein, W. Konen, H. S enzel, B. Naujoks
UniFIeD
Uni a ia e F equency-based
Impu a ion o Time Se ies Da a
Ma ina F iese, J¨o g S o k, Rica do Ramos Gue a,
Thomas Ba z-Beiels ein, Soham Thake , Oli e Flasch,
Ma in Zae e e
UniFIeD Uni a ia e F equency-based
Impu a ion o Time Se ies Da a
Ma ina F iese, Jo g S o k, Rica do Ramos Gue a,
Thomas Ba z-Beiels ein, Soham Thake , Oli e Flasch, Ma in Zae e e
Facul y o Compu e and Enginee ing Sciences
Cologne Uni e si y o Applied Sciences, 51643 Gumme sbach, Ge many
i s name.las name@ h-koeln.de
Abs ac . This pape in oduces UniFIeD, a new da a p ep ocessing
me hod o ime se ies. UniFIeD can cope wi h la ge in e als o missing
da a. A scalable es unc ion gene a o , which allows he simula ion o
ime se ies wi h di e en gap sizes, is p esen ed addi ionally. An expe -
imen al s udy demons a es ha (i) UniFIeD shows a signi ican be e
pe o mance han simple impu a ion me hods and (ii) UniFIeD is able o
handle si ua ions, whe e ad anced impu a ion me hods ail. The esul s
a e independen om he unde lying e o measu emen s.
1 In oduc ion
Missing da a is a well-known p oblem in nea ly e e y eal-wo ld ime se ies. Sen-
so s may ail, da a migh ge los du ing ans e , o measu emen s a e simply
missing. Al hough his p oblem is well-known, many s anda d ime-se ies p edic-
ion and analysis me hods ely on comple e da a. Du ing he las decades, se e al
ways ha e been de eloped o ackle missing da a. Se e al o hese me hods a e
applicable o small gap sizes only.
A common sugges ion, which is a ailable in se e al so wa e packages, is he
impu a ion o mean alues. This app oach can des oy inhe en da a s uc u es
and may wo sen he s a is ical modeling, esul ing in la ge p edic ion e o s [4].
Impu ing missing elemen s om eg ession o analysis o a iance (ANOVA) a e
usually be e . Mo e ad anced me hods, also om compu a ional in elligence, o
da a impu a ion we e de eloped in he con ex o uni a ia e (linea , spline, and
nea es neighbo in e pola ion), mul i a ia e ( eg ession-based impu a ion, nea -
es neighbo , sel -o ganizing map, mul i-laye pe cep on), and hyb id me hods
o he p e ious by using simula ed missing da a pa e ns. A small s udy, which
discussed he applicabili y o hese me hods o ai quali y da a se s, was pe -
o med by Junninnen e al. [7]. Single impu a ion me hods, i.e., illing in p ecisely
one alue o each missing one, can be dis inguished om mul iple impu a ion
me hods. The la e gene a e mul iple simula ed alues o each missing alue.
Ou s udy was ini ialized by an eal-wo ld ask and ocuses on he applica-
bili y o uni a ia e me hods. I was mo i a ed by a eal-wo ld p oblem, because
we ecei ed ime-se ies da a wi h la ge gaps om one o ou indus ial pa -
ne s. These da a should be used o ime-se ies p edic ions, whe e he me hods
2 F iese e al.
o choice equi e comple e da a. Since simple impu a ion me hods ailed com-
ple ely in ou se ing and he ad anced me hods did no show he expec ed
pe o mance, we decided o de elop a new impu a ion me hod.
The new me hod, en i led uni a ia e equency-based impu a ion o ime se-
ies da a (UniFIeD) ou pe o med he exis ing me hods. The success in ield
se ings igh om he s a mo i a ed a i s analysis and ga e eason o pe -
o ming an expe imen al s udy. Focussing on me hods o la ge in e als o miss-
ing da a, seasonal da a, especially ime se ies da a, we conside he ollowing
scien i ic goals:
(G-1) Which me hod gene a es he smalles impu a ion e o ?
(G-2 ) Wha is he in luence o da a p e-p ocessing me hods on he pe o mance
o o ecas me hods?
Based on hese goals, we a e in e es ed in de eloping an au oma ed and obus
p ocedu e o da a p e-p ocessing, which can be implemen ed easily.
To gene a e scien i ically signi ican esul s, we will p oceed as ollows. Fi s
we gene a e ins ances based on he eal-wo ld da a. Then we un he impu a ion
me hods. Nex , he e o s based on di e en e o measu emen s a e de e mined.
Finally, hei p edic ion e o s a e epo ed and compa ed on di e en e o
measu emen s. As a u u e s ep o inc ease he plausibili y o ou indings, we
a e planning o pe o m p edic ions wi h di e en s a e-o - he-a me hods.
This pape is s uc u ed as ollows: Fi s , he eal-wo ld da a is desc ibed in
Sec. 2. P e-p ocessing me hods and he uni a ia e equency-based impu a ion
o ime se ies da a (UniFIeD) a e in oduced in Sec. 3. The p edic ion models,
which we e used in he inal compa ison, a e desc ibed in Sec. 4. E o measu es,
which play a c ucial ole in ou s udy, a e p esen ed in Sec. 5, he wo di e en
expe imen s a e in oduced in Sec. 6. Resul s a e p esen ed in Sec. 7, ou indings
a e discussed in Sec. 8. The pape concludes wi h a summa y and an ou look in
Sec. 9. An R e sion o he p og am code used in his s udy, is eely a ailable
o download and will be compiled as an R package [8].
2 Da a
2.1 Missing Da a
We conside h ee ypes o da a: y∗deno es he unde lying (la en and comple e)
da a, yis he obse ed da a, and ˆyis he impu ed da a.
To e alua e he pe o mance o an impu a ion me hod, c i e ia ha e o be de-
ined. The impu a ion pe o mance depends (a leas ) on wo cha ac e is ics: (a)
he s uc u e o missing da a pa e n and (b) he amoun o missing da a. I he
p obabili y o missing da a does no depend upon he obse ed o he unobse ed
da a, hen hese da a a e called missing comple ely a andom (MCAR)[9]. The e
is no p edic i e powe in he obse ed alues y, i he missing alue p ocess is
MCAR. In gene al, he s uc u e o missing da a in ou p ojec s is MCAR. The
simula ion o missing da a pa e n andomly will be desc ibed in Sec. 2.3.
Uni a ia e F equency-based Impu a ion o Time Se ies Da a 3
2.2 The Da ase s
Ou s udy is based on eal-wo ld da a. The expe imen s a e based on ene gy
consump ion ime-se ies da a supplied by G eenPocke GmbH. The da a was
eco ded by wo independen sma me e ing de ices, ins alled a a local comme -
cial cus ome . Some da a poin s a e missing due o measu emen o ansmission
issues, which is a common si ua ion in eal-wo ld se ings. The da a p o ided by
G eenPocke GmbHis a se ies o imes amp and me e eading pai s aken qua -
e hou ly. Times amps a e gi en an ISO 8601 de i ed da e/ ime o ma , me e
eadings a e gi en in kilowa hou s (kWh). The ene gy consump ion ime se ies
da a was eco ded a he same ime in e al by wo independen sma me e ing
de ices esul ing in he wo da a se s se ies_me e 1 and se ies_me e 2. Bo h
ene gy consump ion ime se ies da ase s con ain a o al o 8548 en ies s a ing
a 2010-12-06 23:15:00 and ending a 2011-03-06 00:00:00 which makes a
o al ime in e al o mo e han 12 Weeks. The comple e ime se ies da a se
se ies_me e 1 is shown in Figu e 1, whe eas Figu e 2 shows only he las wo
weeks o he same da a se .
[1] "English_Uni ed S a es.1252"
[1] "Ge man_Ge many.1252"
0
2
4
6
8
Dec 15 Jan 01 Jan 15 Feb 01 Feb 15 Ma 01
ene gy consump ion (kWh)
Fig. 1. plo o all he obse a ions gi en in se ies_me e 1
Visual inspec ion o he da a e eals daily pe iods, while weekly pe iods a e
de ec able, bu no as clea ly de ined. Also ime in e als wi h missing da a can
be clea ly seen. Ha ing a close look a he missing da a alues, con ained in bo h
ime se ies, e eals ha he e a e al oge he wel e gaps. Mos ly smalle gaps o
leng h one, bu also la ge gaps up o he size o 385 missing obse a ions.
2.3 Tes Ins ance Gene a ion
Since missing da a al eady occu s in he eal-wo ld es da a, we a e able o
de e mine a ealis ic dis ibu ion o he gap sizes and equencies. This includes
4 F iese e al.
[1] "English_Uni ed S a es.1252"
[1] "Ge man_Ge many.1252"
0
2
4
6
Feb 21 Feb 23 Feb 25 Feb 27 Ma 01 Ma 03 Ma 05
ene gy consump ion (kWh)
Fig. 2. exce p o he obse a ion om se ies_me e 1 showing only he las wo weeks
o he da a se
he de e mina ion o wo pa ame e s: (a) dis ibu ion o he gap sizes gapsize
and (b) he o al amoun o missing da a, i.e., gappe cen age. The gapsize
dis ibu ion can be es ima ed om eal-wo ld da a as ollows. Fi s , his og am
plo s we e used o isual inspec ion o he gap sizes. Visual inspec ion sugges s
an exponen ial dis ibu ion o gapsize wi h smalle gaps appea ing mo e e-
quen ly han la ge gaps. The pa ame e λ om he densi y o he exponen ial
dis ibu ion ( ) = λexp(−λ ) is es ima ed om he eal-wo d da a. Now we
a e able o gene a e andom gaps wi h easonable sizes ha a e in co espon-
dence wi h eal-wo ld da a. In a second s ep, we de e mine he pe cen age o
missing da a, gappe cen age. He e, we conside alues be ween 5 and 30. The
gene a ion o a single es ins ance han wo ks as ou lined in algo i hm 1.
3 P e-p ocessing Me hods
3.1 Exis ing Me hods
Exis ing impu a ion me hods can be pa i ioned in o wo g oups: he i s g oup
included basic me hods [7], ha do no use complex compu a ions o de e mine
he impu ed alues. A second g oup uses sophis ica ed echniques o impu a-
ions. We will p esen he basic me hods i s .
Basic Impu a ion Me hods Mean impu a ion is an o en used me hod be-
cause o i s simplici y. The missing da a is eplaced by he mean o he non-
missing obse ed da a.
ˆy=y, (1)
whe e ˆyis he impu ed alue, y a e he obse ed alues, and ydeno es he
sample mean. Linea in e pola ion uses he s a and end poin o a gap o
Uni a ia e F equency-based Impu a ion o Time Se ies Da a 5
inpu : Time Se ies
inpu :gappe cen age
coun Da a = numbe o obse a ions ! = NA in
coun D op =coun Da a ×gappe cen age/100;
epea
d aw andom gapsize om exponen ial dis ibu ion;
un il coun D op eached;
coun Gaps = numbe o gaps d awn;
emainingDa a =coun Da a -coun D op;
gene a e andom pa i ion o coun Gaps+1 pa s summing up o he size o
emainingDa a;(assu ing ha he gene a ed ime se ies nei he s a s no ends
wi h a gap and has a leas one da a poin be ween wo gaps.)
∗= ;
o i= 1 →coun Gaps do
posi ion = sum(pa i ions[1:i] + sum(gapsize [1:i-1]));
emo e da a om s∗ om posi ion o posi ion+gapsize [i];
end
ou pu : ∗
Algo i hm 1: Gene a ion o a es ins ance
cons uc a s aigh line.
ˆy=y 1+k×( − 1)wi h k =y 2−y 1
2− 1
(2)
y 1and y 2a e he s a and end alues o he gap, while 1and 2a e he s a
and end ime alues. x is he cu en ime alue. Nea es neighbo s uses he s a
and end poin s o a gap as es ima es o he impu a ion.
ˆy=y 1i < 1+ 2− 1
2
ˆy=y 2i > 1+ 2− 1
2
(3)
Ad anced Impu a ion Me hods The second g oup o impu a ion me hods
uses ad anced echniques. The mice (Mul i a ia e Impu a ion by Chained Equa-
ions) package specializes on mul iple impu a ion me hods [1]. The me hods
wo ks bes on mul i a ia e da a and no me hod applicable o he sma me e -
ing da a was ound o deli e good esul s. The zoo packages p o ides a me hods
na.s uc TS uses a gene ic unc ion o illing NA alues using seasonal Kalman
il e [10]. Finally he Amelia II package can be men ioned he e [3]. I was no
able o ind sui able alues o he ime in e al om he Sma Me e ing da a
se . These packages ob ain e y good esul s, i mul i a ia e ime-se ies da a
we e a ailable.
3.2 Uni a ia e F equency-based Impu a ion
The UniFIeD me hod p oposed in his wo k elies on an au oma ed es ima ion
o ime-se ies equencies.
6 F iese e al.
Es ima ing ime-se ies equencies au oma ically Fo he es ima ion o
he equencies con ained in he da a, he au o co ela ion unc ion (ac ) is used.
The algo i hm wo ks as desc ibed in Algo i hm 2.
inpu : Time Se ies
de e mine ac alues ia au o co ela ion unc ion on ;
emembe au oco ela ion alues om ac ;
emembe ela ed lags om ac ;
epea
educe au oco ela ion alues o i s peaks
emembe ela ed lags;
de e mine equency om lags; ia he equency o he dis ances om one
peak-lag o ano he
un il no new equency ound;
ou pu : all equencies ound
Algo i hm 2: Es ima ion o an unde lying equency using he au o co -
ela ion unc ion.
Fo be e illus a ion, igu e 3 shows he au o co ela ion unc ion o se-
ies_me e 1. Bo h, daily and weekly pe iods a e clea ly ecognizable. The se
o peaks ha a e conside ed as indica o s o he da a’s unde lying equency,
a e ma ked wi h small ci cles. In he second i e a ion o he algo i hm, his se
o peaks is educed o a smalle se , which is ma ked wi h illed ci cles. A e
wo i e a ions, he algo i hm s ops since he e is no lowe equency in he da a.
How UniFIeD wo ks UniFIeD is aluable o uni a ia e ime se ies da a ha
p esens a pa e n o a seasonal e ec . The algo i hm akes ad an age o his
seasonal e ec o ind co ela ed pa e ns o he missing window o de elop he
impu a ion. Using he equency es ima ed wi h Algo i hm 2, we p oceed o
impu e he missing alues om he ime se ies da a se .
UniFIeD was de eloped no only o conside single missing poin s ˆybu ull
missing ime windows o any size. The basic idea is o i e a i ely look o he
nex missing poin ˆyo he ime se ies a he ime momen mand, using he
equency and he numbe o simila windows k o sea ch in, ga he he amoun
o 2knon missing poin s y ha co ela e o he ime momen mo he ound
missing poin and o m a ec o s. Once his ec o sis o med, and depending
on he use ’s eques , he app op ia e me hod o calcula e he alue o impu e
in o he ime se ies is selec ed, whe e he a ailable op ions o de e mine he
new alue a e he mean, median, maximum, o minimum alues o ec o s.
Figu e 4 shows an example o a missing alue ˆy, he selec ed co ela ed alues
in he ec o sand he di e en me hods used o impu e he new alue o a
andomly selec ed missing window o he pu poses o illus a ion.
The UniFIeD algo i hm wo ks as ollows:
Uni a ia e F equency-based Impu a ion o Time Se ies Da a 7
0 1000 2000 3000 4000 5000 6000
−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
Lag
ACF
Fig. 3. Plo o he au o co ela ion unc ion ou pu on me e _se ies1. I shows he
au o co ela ion alue o each lag. Values, conside ed as peaks indica ing he equency,
a e ma ked wi h small ci cles.
No e ha he inal s ep o Algo i hm 3 is o indica e whe he any NA we e
impu ed in o ime se ies . I his is ac , he algo i hm will look o he le
neighbo and impu e i ’s alue in o ˆy.
4 P edic ion Models
4.1 Hol -Win e ’s
Hol -Win e ’s algo i hm [2] is used o o ecas ime se ies wi h ends and sea-
sonal e ec s. In R, Hol -Win e ’s algo i hm is implemen ed in he s a s package.
8 F iese e al.
inpu : Uni a ia e Time Se ies
inpu : F equency o
inpu : Cons an ko simila windows o look o co ela ed alues
inpu : Me hod M o calcula e alue o impu e: mean, median, max o min
ou pu : Time Se ies ˆ
ini ializa ion; de ine a ec o o window numbe s o look o , using k
win ← {−k, −k+ 1,...,−1,1,...,k−1, k}
o i←ini ial poin o da a o end o da a do
i ound ypoin is NA hen
look o co ela ed alues o yand o m ec o sby using win ec o
and eq. as ollows:
s← {y( m−k ), y( m−(k−1) ),...,y( m− ),
y( m+ ),...,y( m+(k−1) ), y( m+k )}
end
swi ch me hod Mchosen, calcula e do
mean: ˆy←mean( s)
median: ˆy←median( s)
max: ˆy←max( s)
min: ˆy←min( s)
endsw
end
check i some NA alues we e impu ed in o ime se ies ˆ
, and i he e a e, ix by
using he le neighbo alue.
Algo i hm 3: Impu a ion Algo i hm.
The me hod wo ks wi h h ee exponen ial smoo hing equa ions:
Le el: ` =α(y −s −m) + (1 −α)(` −1+b −1) (4)
T end: b =β∗(` −l −1+ (1 −β∗)b −1(5)
Seasonal: s =γ(y −` −1−b −1) + (1 −γ)s −m(6)
The o ecas ing equa ion is:
Fo ecas : ˆy +h| =` +hb +s −m+h+
m(7)
wi h h+
m=b(h−1) mod mc+ 1. α,β∗and γa e so-called smoo hing pa ame-
e s. The le el equa ion is a weigh ed a e age o he seasonally adjus ed obse a-
ion (y −s −m) and he non-seasonal o ecas (` −1+b −1) o ime . The end
equa ion shows ha b is a weigh ed a e age o he es ima ed end a ime
based on ` −` −1and b −1, he p e ious es ima e o he end. The seasonal
equa ion is a weigh ed a e age o he cu en seasonal index, (y −` −1−b −1) and
he seasonal index o he same season las e m. Ini ial alues o he le el, end,
and seasonal indices a e calcula ed using a simple decomposi ion and eg ession
on he i s wo seasons o he gi en ime se ies. The smoo hing pa ame e s a e,
i no gi en manually, i ed by he B oyden-Fle che -Gold a b-Shanno (BFGS)
me hod wi h he sum o squa ed e o s o p edic ion (SSE).
Uni a ia e F equency-based Impu a ion o Time Se ies Da a 15
9 Summa y and Ou look
In his wo k, we in oduced a scalable es p oblem gene a o o simula ing
missing da a. We gene a ed p oblem ins ances wi h small and la ge gaps, u -
he mo e we in oduced a new me hod called UniFIeD o handling missing da a,
which uses an au oma ed equency es ima o using au oco ela ion. We demon-
s a ed ha his me hod ou pe o mes simple impu a ion me hods. The esul s
a e independen om he unde lying e o measu emen s.
Fu u e plans in ol e he ollowing s eps:
–ex ended expe imen s wi h addi ional da a
–disco e he limi s o he me hod
–disco e he maximum gap sizes
–s a is ical alida e he me hod
–can i be used as a hyb id me hod
–expe imen s on mu li a ia e ime-se ies
We a e also planning o p o ide UniFIeD as an R package on CRAN.
10 Acknowledgmen s
This wo k has been kindly suppo ed by he Ge man Fede al Minis y o Ed-
uca ion and Resea ch (BMBF) unde he g an s MCIOP (FKZ 17N0311) and
CIMO (FKZ 17002X11).
Re e ences
1. S. Buu en and K. G oo huis-Oudshoo n. Mice: Mul i a ia e impu a ion by chained
equa ions in . Jou nal o S a is ical So wa e, 45(3), 2011.
2. C. Hol . Fo ecas ing seasonals and ends by exponen ially weigh ed mo ing a e -
ages. In e na ional Jou nal o Fo ecas ing, 20(1):5–10, 2004.
3. J. Honake , G. King, and M. Blackwell. Amelia ii: A p og am o missing da a.
Jou nal o S a is ical So wa e, 45(7):1–47, 12 2011.
4. N. Ho on and K. Kleinman. Much ado abou no hing. The Ame ican S a is ician,
61(1):79–90, 2007.
5. R. Hyndman and Y. Khandaka . Au oma ic ime se ies o o ecas ing: The o ecas
package o . Technical epo , Monash Uni e si y, Depa men o Econome ics
and Business S a is ics, 2007.
6. R. J. Hyndman and Y. Khandaka . Au oma ic Time Se ies Fo ecas ing. The o e-
cas Package o R. Jou nal o S a is ical So wa e, 27(3):1–22, 7 2008.
7. H. Junninen, H. Niska, K. Tuppu ainen, J. Ruuskanen, and M. Kolehmainen. Me h-
ods o impu a ion o missing alues in ai quali y da a se s. A mosphe ic En i-
onmen , 38(18):2895 – 2907, 2004.
8. R De elopmen Co e Team. R: A Language and En i onmen o S a is ical Com-
pu ing. R Founda ion o S a is ical Compu ing, Vienna, Aus ia, 2011. ISBN
3-900051-07-0.
9. D. B. Rubin. In e ence and missing da a. Biome ika, 63(3):581–592, 1976.
10. A. Zeileis and G. G o hendieck. zoo: S3 in as uc u e o egula and i egula
ime se ies. a Xi p ep in ma h/0505527, 2005.
Kon ak /Imp essum
Diese Ve ¨o en lichungen e scheinen im Rahmen de Sch i en eihe ”CIplus”. Alle
Ve ¨o en lichungen diese Reihe k¨onnen un e
www.ciplus- esea ch.de
ode un e
h p://opus.bsz-bw.de/ hk/index.php?la=de
abge u en we den.
K¨oln, Janua 2012
He ausgebe / Edi o ship
P o . D . Thomas Ba z-Beiels ein,
P o . D . Wol gang Konen,
P o . D . Ho s S enzel,
D . Bo is Naujoks
Ins i u e o Compu e Science,
Facul y o Compu e Science and Enginee ing Science,
Cologne Uni e si y o Applied Sciences,
S einm¨ulle allee 1,
51643 Gumme sbach
u l: www.ciplus- esea ch.de
Sch i lei ung und Ansp echpa ne / Con ac edi o ’s o ice
P o . D . Thomas Ba z-Beiels ein,
Ins i u e o Compu e Science,
Facul y o Compu e Science and Enginee ing Science,
Cologne Uni e si y o Applied Sciences,
S einm¨ulle allee 1, 51643 Gumme sbach
phone: +49 2261 8196 6391
u l: h p://www.gm. h-koeln.de/~ba z/
eMail: homas.ba z-beiels ein@ h-koeln.de
ISSN (online) 2194-2870