scieee Science in your language
[en] (orig)

Online Adaptable Learning Rates for the Game Connect-4

Author: Bagheri, Samineh,Thill, Markus,Koch, Patrick,Konen, Wolfgang
Year: 2014
Source: https://cos.bibl.th-koeln.de/files/53/TCL_C4_paper_COS_version.pdf
Sch i en eihe CIplus, Band 3/2014
Thomas Ba z-Beiels ein, Wol gang Konen, Ho s S enzel, Bo is Naujoks
Online Adap able Lea ning Ra es
o he Game Connec -4
Samineh Baghe i, Ma kus Thill, Pa ick Koch and Wol gang Konen
1
Online Adap able Lea ning Ra es o he Game
Connec -4
Samineh Baghe i, Ma kus Thill, Pa ick Koch and Wol gang Konen
Abs ac
Lea ning boa d games by sel -play has a long adi ion in compu a ional in elligence o games. Based
on Tesau o’s seminal success wi h TD-Gammon in 1994, many success ul agen s use empo al di e ence
lea ning oday. Bu in o de o be success ul wi h empo al di e ence lea ning on game asks, o en a ca e ul
selec ion o ea u es and a la ge numbe o aining games is necessa y. E en o boa d games o mode a e
complexi y like Connec -4, we ound in p e ious wo k ha a e y ich ini ial ea u e se and se e al millions
o game plays a e equi ed. In his wo k we in es iga e di e en app oaches o online-adap able lea ning
a es like Inc emen al Del a Ba Del a (IDBD) o Tempo al Cohe ence Lea ning (TCL) whe he hey ha e he
po en ial o speed up lea ning o such a complex ask. We p opose a new a ian o TCL wi h geome ic
s ep size changes. We compa e hose algo i hms wi h se e al o he s a e-o - he-a lea ning a e adap a ion
algo i hms and pe o m a case s udy on he sensi i i y wi h espec o hei me a pa ame e s. We show ha
in his se o lea ning algo i hms hose wi h geome ic s ep size changes ou pe o m hose o he algo i hms
wi h cons an s ep size changes. Algo i hms wi h nonlinea ou pu unc ions a e sligh ly be e han linea
ones. Algo i hms wi h geome ic s ep size changes lea n as e by a ac o o 4 as compa ed o p e iously
published esul s on he ask Connec -4.
Index Te ms
Machine lea ning, boa d games, sel -play, ein o cemen lea ning, empo al di e ence lea ning (TDL),
empo al cohe ence, lea ning a es, sel -adap a ion, online adap a ion, n- uple sys ems.
F
c
IEEE. This is he p ep in e sion o an a icle appea ing in IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE
AND AI IN GAMES, 2015. Pe sonal use o his ma e ial is pe mi ed. Fo o he use, see he IEEE Copy igh Policy.
1 INTRODUCTION
HUMANS a e e y e icien and as in lea ning complica ed asks in new domains. As
Su on [1] has poin ed ou , in o ma ion heo e ic a gumen s sugges ha he lea ning
p og ess is o en oo apid o be jus i ied by he da a o he new domain alone. The key o
success is, as i is commonly belie ed, ha humans b ing a se o co ec biases om o he
domains in o he new domain. These biases allow o lea n as e , because biases di ec hem
o p e e ce ain hypo heses o e o he s o ce ain ea u es o e o he s. I machine lea ning
algo i hms shall achie e a pe o mance simila o humans, hey p obably need o acqui e biases
as well. Whe e do hese biases come om?
Al eady in 1992 Su on [1] sugges ed he Inc emen al Del a Ba Del a (IDBD) algo i hm whe e
he biases a e unde s ood as lea ning a es which can be di e en o di e en ainable pa am-
e e s o any unde lying algo i hm. The key idea o IDBD is ha hese lea ning a es a e no
p ede ined by he algo i hm designe bu hey a e adap ed as hype pa ame e s o he lea ning
p ocess hemsel es. Su on [1] expec ed such adap able lea ning a es o be especially use ul o
•All au ho s a e wi h he Facul y o Enginee ing and Compu e Science, Cologne Uni e si y o Applied Sciences, Cologne, Ge many.
E-mail: wol gang.konen@ h-koeln.de
Manusc ip ecei ed Ma ch 2nd, 2014; e ised e sion accep ed Oc 29 h, 2014.
2
nons a iona y asks o sequences o ela ed asks and he demons a ed good esul s on a small
syn he ic nons a iona y lea ning p oblem ( ea u ing 20 weigh s).
A simila idea was sugges ed as Tempo al Cohe ence Lea ning (TCL) by Beal and Smi h in
1999 [2], [3], who di ec ly modi ied he Tempo al Di e ence Lea ning (TDL) algo i hm o ake
in o accoun sel - uning lea ning a es. Se e al o he online lea ning a e adap a ion algo i hms
ha e been p oposed o e he yea s (see Sec. 2) and i is he pu pose o his wo k – as a case
s udy in machine lea ning – o make a comp ehensi e compa ison on a la ge game-lea ning
benchma k ask.
Boa d games and lea ning how o play hem cons i u e challenging asks in machine lea ning
(ML) and a i icial in elligence (AI). They a e challenging, because he ac ion (a mo e) has o be
aken now, bu he payo (win o loss) occu s la e , a he end o he game. The mos ad anced
me hod in ML o add ess his p oblem is well-known TDL, a me hod based on ein o cemen
lea ning (RL). TDL was applied as ea ly as 1957 by Samuel [4] o checke s and gained mo e
popula i y h ough Su on’s wo k in 1984 and 1988 [5], [6]. I became amous in 1994 wi h
Tesau o’s TD-Gammon [7], which lea ned o play backgammon a expe le el.
Game lea ning wi h TDL cons i u es a nons a iona y lea ning ask: Fo mos boa d posi ions
he a ge alue will change du ing lea ning (see Sec. 3.3 o de ails). Thus, RL and TDL o
boa d games a e expec ed o bene i om bias lea ning app oaches such as he abo e-men ioned
IDBD [1].
In his pape we conside he game Connec -4 as a speci ic example, which is a sol ed game
(see Sec. 3.1) bu no algo i hm o lea n i by sel -play was known un il ecen ly. In a p e ious
wo k [8] we we e able o show ha i is possible o lea n his game wi h TDL and n- uples [9]
jus by sel -play. We achie ed a playing s eng h close o he pe ec playing agen . Howe e ,
he lea ning p ocess equi ed se e al millions o sel -play games, hus being a o human
pe o mance. In his wo k we in es iga e whe he new s a egies can achie e he same (o e en
be e ) s eng h wi hin ewe aining games. Two solu ion pa hs can be conside ed he e:
a) Tuning, i.e. inding as e -lea ning solu ions by op imizing he hype pa ame e s o he lea n-
ing algo i hm. The d awback o his solu ion is ha he uning esul s usually apply only
o he speci ic lea ning ask, i.e. Connec -4 wi h his TDL algo i hm. Each new algo i hm o
each new game equi es a comple ely new uning.
b) Sel - uning lea ning algo i hms like IDBD o o he s (see Sec. 3), which au oma ically adap
ce ain hype pa ame e s (he e: he lea ning a es) as lea ning p og esses. This a oids manual
in e en ion and he e is hope ha he same sel - uning scheme can be applied o o he games
as well and ha he insigh s gained a e ans e able o o he lea ning asks as well.
We ollow mainly he second pa h in his pape and y o answe he ollowing esea ch
ques ions:
1) Can online lea ning a e adap a ion be success ully applied o p oblems wi h millions o
weigh s?
2) How obus a e online lea ning a e adap a ion algo i hms wi h espec o hei me a-pa ame e s?
3) Can online lea ning a e adap a ion algo i hms speed up lea ning as compa ed o TDL?
The es o his pape is o ganized as ollows: Sec. 2 b ie ly e iews ela ed wo k. Sec. 3 in o-
duces he me hods TDL, TCL, and IDBD. I p esen s wi h TCL-EXP ou new syn hesis be ween
TCL and IDBD. Sec. 4 desc ibes ou expe imen al se up and ou esul s on he game Connec -4.
Sec. 5 discusses he pa ame e sensi i i y o he di e en me hods and Sec. 6 summa izes ou
main indings.
2 RELATED WORK
Se e al online lea ning a e adap a ion schemes ha e been p oposed o e he yea s: IDBD [1]
om Su on is an ex ension o Jacobs’ [10] ea lie DBD algo i hm: i allows immedia e upda es
3
ins ead o ba ch upda es. Su on [11] p oposed some ex ensions o IDBD wi h he algo i hms
K1 and K2 and compa es hem wi h he Leas Mean Squa e (LMS) algo i hm and Kalman
il e ing. Almeida [12] discussed ano he me hod o s ep-size adap a ion and applied i o he
minimiza ion o nonlinea unc ions.
Sch audolph [13] and, mo e ecen ly, Li [14] ex ended IDBD- a ian s o he nonlinea case:
Sch audolph’s ELK1 ex ends K1 and pe o ms an upda e wi h he ins an aneous Hessian ma ix
o a sui able chosen loss unc ion. The algo i hm’s complexi y is O(n2)whe e nis he numbe o
pa ame e s o lea n. Li’s KIMEL algo i hm ans o ms he nonlinea inpu da a wi h a ke nel in o
a high-dimensional bu linea ea u e space whe e linea IDBD is applied. Su on and Koop [15],
[16] de eloped ano he nice nonlinea ex ension IDBD-nl o he o iginal IDBD algo i hm using
he logis ic sigmoid unc ion. I was applied o lea ning he game Go.
Recen ly, Mahmood and Su on [17], [18] p oposed wi h Au os ep an ex ension o IDBD which
has much less dependence on he me a-s ep-size pa ame e han IDBD. In he same yea , Dabney
and Ba o [19] de eloped ano he adap i e s ep-size me hod o empo al di e ence lea ning,
which is based on he es ima ion o uppe and lowe bounds. Again, bo h me hods a e p oposed
only o linea unc ion app oxima ion. Schaul e al. [20] p opose a me hod o uning- ee lea ning
a e adap a ion especially well-sui ed o la ge neu al ne wo ks. RPROP [21] is ano he ea lie
e sion o a neu al ne wo k algo i hm wi h indi idual lea ning a es o each weigh .
Fo he game Connec -4 – al hough weakly sol ed in 1988 by Allen [22] and Allis [23] (Sec. 3.1)
– only a he ew a emp s o lea n i (whe he by sel -play o by lea ning om eache s) a e
ound in he li e a u e: Schneide e al. [24] ied o lea n Connec -4 wi h a neu al ne wo k,
using an a chi e o sa ed games as eaching in o ma ion. S enma k [25] compa ed TDL o
Connec -4 agains a knowledge-based app oach om Au oma ic P og amming and ound TDL
o be sligh ly be e . Cu an e al. [26] used a cul u al lea ning app oach o e ol ing popula ions
o neu al ne wo ks in sel -play. All he abo e wo ks ga e no clea answe on he ue playing
s eng h o he agen s, since hey did no compa e hei agen s wi h a pe ec -playing Minimax
agen .
Lucas showed ha he game o O hello, ha ing a somewha g ea e complexi y han Connec -
4, could be lea ned by TDL wi hin a ew housand aining games wi h he n- uple-app oach [9].
K awiec e al. [27] applied he n- uple-app oach in (Co-) E olu iona y TDL and ou pe o med
TDL in he O hello League [28]. This s i ed ou in e es in he n- uple-app oach and we applied i
success ully o Connec -4 in ou p e ious wo k [8]. The esul s agains a pe ec -playing Minimax
agen a e summa ized in Sec. 4.3.
3 METHODS
3.1 Connec -4
The game Connec -4 is a wo-playe game played on a boa d wi h 7 e ical slo s con aining 6
posi ions each (Fig. 1). Playe Yellow (1s ) and playe Red (2nd) place one piece pe u n in one
o he a ailable slo s and each piece alls down unde he o ce o g a i y in o he lowes ee
posi ion o he slo . Each playe a emp s o c ea e ho izon al, e ical o diagonal piece-lines o
leng h ou . Fig. 1 shows an example posi ion whe e Yellow would win i Red does no block
his by placing a ed piece in o he igh slo .
Connec -4 has a medium s a e space complexi y o 4.5·1012 boa d posi ions [29]. A game is
said o be weakly sol ed i i s game heo e ic alue and a s a egy o pe ec play om he
ini ial alue is known. Connec -4 was weakly sol ed in 1988 independen ly by Allen [22] and by
Allis [23]: Yellow ( he 1s playe ) wins, i she places he i s piece in he middle slo . T omp [30]
sol ed he game Connec -4 s ongly, i. e. o e e y in e media e posi ion.
4
Fig. 1. Connec -4 boa d wi h an example 4- uple ’3-2-1-1’ (see Sec. 3.2)
We de eloped a Minimax agen combined wi h a p e-calcula ed 8-ply1o 12-ply-opening
da abase [8], [31] and i inds he pe ec nex mo e (o mo es, i se e al mo es a e equally well)
o each boa d posi ion wi hin ac ions o a second. This agen will be used in ou expe imen s,
bu only as e e ence o e alua ion agen . I is by no means used o any aining pu pose.
3.2 N- uples and LUTs
N- uples in Gene al: N- uple sys ems we e i s in oduced in 1959 by Bledsoe and B own-
ing [32] o cha ac e ecogni ion. Recen ly, Lucas [9] p oposed employing he n- uple a chi ec u e
o game-playing pu poses. The n- uple app oach wo ks in a way simila o he ke nel ick
used in suppo ec o machines (SVM): The low dimensional boa d is p ojec ed in o a high
dimensional sample space by he n- uple indexing p ocess [9].
N- uples in Connec -4: An n- uple Tνis a sequence [aν0, . . . , aνn−1]o ndi e en boa d cells
aνj ∈ {0,...,41}. Fo example, he ou whi e digi s in Fig. 1 ma k he cells o a 4- uple. Each
cell is in one o Ppossible s a es z[aνj]∈ {0, . . . , P −1}( he alue o he digi s in ou example),
depending on he cell’s occupa ion. Following ou ea lie wo k [8] we use he (P=4)-encoding
0=emp y and no eachable, 1=Yellow, 2=Red, 3=emp y and eachable.
By eachable we mean an emp y cell ha can be occupied in he nex mo e. The eason behind
his is ha i makes a di e ence whe he e. g. h ee yellow pieces in a ow ha e a eachable
emp y cell adjacen o hem (a di ec h ea o Red) o a non- eachable cell (indi ec h ea ).
An n- uple o leng h n hus has Pnpossible s a es kν∈ {0, . . . , Pn−1}wi h
kν=
n−1
X
j=0
s [aνj]Pj.(1)
He e, s [aνj]is he s a e o boa d cell aνj a ime . Fig. 1 shows an example boa d posi ion wi h
a 4- uple in he numbe ed cells. The s a e o he 4- uple is k= 3 ·40+ 2 ·41+ 1 ·42+ 1 ·43= 91.
The numbe kν o he s a e o Tνcan be used as an index in o an associa ed look-up able LUTν,
whose pa ame e s a e called wν, [kν]. Equi alen ly and mo e simila o s anda d neu al ne wo ks,
1. A ply is a single mo e o one playe , Yellow o Red

5
we can pu all weigh s in o one big weigh ec o ~w wi h elemen s wi, , index i=kνm+ν, and
de ine a bina y inpu ec o
xi[s ] = (1i i=kνm+ν
0else (2)
o he kνde ined in Eq. (1). Fo a gi en boa d posi ion s a ime , he ou pu o he n- uple
ne wo k wi h mn- uples Tν, ν = 1, . . . , m can be calcula ed as:
(~w , s ) =
m·Pn
X
i=1
wi, xi[s ](3)
Vec o ~w is a unc ion o ime since i will be modi ied by he TD lea ning algo i hm (Sec. 3.3).
The ec o ~w combines all weigh s om all LUTs. I can be a a he big ec o , con aining, e. g.,
9 million weigh s in ou s anda d Connec -4 implemen a ion wi h 70 8- uples. I u ns ou ha
only 600 000 - 700 000 o hese weigh s a e ac i e du ing lea ning ( he o he s ep esen non-
ealizable s a es [8]), bu his is s ill much bigge han he 20 o 25 weigh s used by Su on [1]
o Beal and Smi h [2].
Fu he de ails on n- uples in Connec -4 (symme y, n- uple c ea ion) a e ound in [8]. Fo
all expe imen s epo ed in he ollowing, we use he same n- uple ne wo k which was once
c ea ed by a andom-walk selec ion p ocess.
3.3 TDL
The goal o a game-playing agen is o p edic he ideal alue unc ion, which e u ns 1.0 i he
boa d posi ion is a win o Yellow, and -1.0 i i is a win o Red. The TDL algo i hm aims
a lea ning his alue unc ion. I does so by se ing up an (ini ially inexpe ienced) agen , who
plays a sequence o games agains i sel . I lea ns om he en i onmen , which gi es a ewa d
R∈ {−1.0,0.0,1.0} o {Red-win, D aw, Yellow-win }only a he end o each game . The main
ing edien is he empo al di e ence (TD) e o signal acco ding o Su on [6]
δ =R(s +1) + γV (~w , s +1)−V(~w , s ).(4)
He e, V(~w , s ) = σ( (~w , s )) is he agen ’s cu en app oxima ion o he alue unc ion2on he
basis o Eq. (3) and a nonlinea sigmoid unc ion σ(we choose σ= anh). The s a e s +1 is he
bes successo o s a e s . In ou expe imen s we use γ= 1 h oughou .
The weigh s a e ained wi h he usual δ- ule
wi, +1 =wi, +αδ ∇wiV(~w , s )(5)
=wi, +α1−V2(~w , s )δ xi,
which aims a making he cu en p edic ion ma ch he successo p edic ion mo e closely. The
comple e TDL algo i hm o games, including he ac ion selec ion mechanism, is shown in
Tab. 1.3This is a con ol algo i hm, since i includes ac ion selec ion ( hus policy changes) and
lea ning o he (co espondingly changing) alue unc ion.4Mo e de ails on TDL in games can
be ound in [33] and e e ences he ein.5
2. V(~w , s +1)is se o 0 i s +1 is a inal s a e.
3. Why is he weigh upda e in S ep 17 only done in case o a non-explo a i e mo e (q≥)? – I an explo a o y ac ion
( andom mo e) is aken, i is e y likely ha he inal ewa d does no e lec he ue po en ial o he cu en s a e be o e he
andom mo e. E.g. i he cu en s a e is a win o Yellow, a andom mo e is likely o u n i in o a si ua ion whe e Yellow
looses.
4. No e ha he lea ning o he alue o s a e-ac ion pai s, Q(s , a ), as i occu s in Q-lea ning o Sa sa con ol algo i hms,
can be eplaced he e by he alue o he a e s a e, V(~w , s +1), since in a boa d game all s a e-ac ion pai s esul ing in he
same a e s a e ha e he same alue.
5. We no e in passing ha we use TDL wi hou eligibili y aces in his pape . While his a icle was unde e iew, we
published new esea ch combining Connec -4, TDL, and TCL-EXP wi h eligibili y aces [34].
6
TABLE 1
TDL algo i hm o boa d games. P io o he i s game, he weigh ec o ~w0is ini ialized wi h
andom alues. Then he ollowing algo i hm is execu ed o each comple e boa d game. Du ing
sel -play, he playe pswi ches be ween +1 (Yellow) and −1(Red).
1: Se he ini ial s a e s0(usually he emp y boa d) and p= 1.
2: Use pa ially ained weigh s ~w0 om p e ious games.
3: unc ion TDLTRAIN(s0,~w0)
4: ~e0← ∇~w V(~w0, ~x(s0))
5: o  ←0;s /∈SF inal ; ← + 1, p ←(−p)do
6: Vold ←V(~w , ~x(s ))
7: D aw andom q∈[0,1]
8: i (q < ) hen .Explo a i e mo e
9: Randomly selec s +1
10: else .G eedy mo e
11: Selec a e -s a e s +1, which maximizes
12: p·R(s +1),i s +1 ∈SF inal
V(~w , ~x(s +1)),o he wise
13: end i
14: Vnew ←V(~w , ~x(s +1))
15: δ ←R(s +1) + γVnew −Vold .TD e o -signal
16: i (q≥o s +1 ∈SF inal) hen
17: ~w +1 ←~w +αδ ~e .Weigh -upda e
18: end i
19: ~e +1 ← ∇~w V(~w +1, ~x(s +1)) .Recompu e (new ~w!)
20: end o
21: end unc ion
Eq. (4) shows clea ly why TDL imposes a nons a iona y lea ning ask: V(~w , s +1), he alue
o he bes successo o V(~w , s ), is he a ge o V(~w , s ). Bu o mos boa d posi ions (wi h
he excep ion o e minal s a es) his a ge again has o be lea n , so i will p obably be a
he w ong alue ini ially. La e in aining his a ge alue changes o he co ec one and he
weigh s need o be eadjus ed.
3.4 TCL
The TCL algo i hm de eloped by Beal and Smi h [2], [3] is an ex ension o TDL. I has an
adjus able lea ning a e αi o e e y weigh wiand a global cons an αini . The e ec i e lea ning
a e o each weigh is αini αi. The main idea is p e y simple: Fo each weigh wo coun e s
Niand Aiaccumula e he sum o weigh changes and sum o absolu e weigh changes. I all
weigh changes ha e he same sign, hen |Ni|/Ai= 1 and he lea ning a e s ays a i s uppe
bound. I weigh changes ha e al e na ing signs, hen |Ni|/Ai→0 o → ∞, and he lea ning
a e will be la gely educed o his weigh .
Tab. 2 shows he comple e TCL algo i hm. This algo i hm is embedded in he game-playing
TDL amewo k o Sec. 3.3. S eps 2.-6. a e execu ed in each pass h ough he main TDL-loop
ins ead o S ep 17 in Tab. 1. Wi h he help o αi∈[0,1] he indi idual lea ning a e o a weigh
can be made smalle han he global αini . This is he s anda d TCL o mula ion named TCL[ ]
in he ollowing. We es ed also a a ian TCL[δ]whe e we omi he nonlinea sigmoid unc ion
o i, in Eq. (7), i. e. we use
i, =(δ i xi= 1
0i xi= 0.(8)
In bo h cases, he ope a ional o de is impo an , as al eady s a ed in [2]: i s weigh upda e
7
TABLE 2
TCL in pseudo code
1: Ini ialize: Ni=Ai= 0 o all weigh indices iand se he cons an pa ame e αini .
2: o (e e y weigh index i)do
3: Se αi=(1i Ai= 0
g(|Ni|/Ai)i Ai>0
4: Replace TD-weigh upda e Eq. (5) wi h
wi, +1 =wi, +αini αi i, (6)
whe e i, =δ ∇wiV(~w , s )is he ecommended weigh change.
5: Upda e he coun e s:
Ni←Ni+ i, (7)
Ai←Ai+| i, |
6: end o
S anda d TCL uses o he ans e unc ion gin S ep 3 he iden i y unc ion, while TCL-EXP
uses g(x)acco ding o Eq. (9), see also Fig. 2.
TABLE 3
IDBD in pseudo code
1: Ini ialize: hi= 0, βi=βini o all weigh indices iand se θ, he me a-lea ning a e.
2: o (e e y weigh index i)do
3: Se inpu xiacco ding o Eq. (2)
4: Se βi←βi+θδ xihi
5: Se αi←eβi
6: Replace TD-weigh upda e Eq. (5) wi h
wi, +1 =wi, +αiδ xi
7: Se hi←hi[1 −αix2
i]++αiδ xi
8: end o
wi h [d]+=di d > 0, 0 else.
using he p e ious alues o Ni,Ai, hen coun e upda e.6
3.5 IDBD
Su on’s IDBD algo i hm [1] in oduces – simila ly o TCL – an indi idual lea ning a e αi=eβi
o e e y weigh wi. The algo i hm is shown in Tab. 3. I is again embedded in he game-playing
TDL amewo k o Sec. 3.3. S eps 2.-8. a e execu ed in each pass h ough he main TDL-loop
ins ead o S ep 17 in Tab. 1.
6. The o iginal TCL implemen a ion [2] has he u he op ion ha he i, may be accumula ed o e a sequence o s eps be o e
a eal weigh upda e (Eq. (6)) akes place. This balances a adeo be ween as e changing lea ning a es (sho sequences) and
accumula ion o s a is ic e idence (long sequences). We es ed his sequence op ion o ou Connec -4- ask as well, bu ound
he immedia e upda e (as in he o mulas abo e, 1-s ep sequence) o be be e .
8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
x
g(x)
PieceLinea
S anda d TCL
TCL−EXP
Fig. 2. Op ions o he ans e unc ion g(x). S anda d TCL uses he iden i y unc ion. TCL-EXP
(Eq. (9)) is shown o β= 2.7and PieceLinea is a piecewise linea unc ion wi h he same
endpoin s as TCL-EXP and same slope a x= 1.
The main idea behind his algo i hm is simple: The memo y e m hiis a decaying ace o
pas weigh changes. The inc emen in βiis p opo ional o he p oduc o he cu en weigh
change δ xiand pas weigh changes hi. Accumula ed inc emen s co espond o he co ela ion
be ween cu en and ecen weigh changes [1]. In case o a posi i e co ela ion, he lea ning
a e can be la ge , while nega i e co ela ion indica es o e shoo ing weigh inc emen s whe e
he lea ning a e should be educed.
No e ha IDBD in i s cu en o m is only o mula ed o linea ne wo ks. Consequen ly we
omi he nonlinea sigmoid unc ion σin Eq. (4) and in he weigh upda e ule (S ep 6 o Tab. 3)
o IDBD.
3.6 Modi ied TCL-EXP
Ou modi ica ion TCL-EXP b ings a new elemen om IDBD in o he s anda d TCL algo i hm.
Ins ead o he iden i y ans e unc ion g(x) = xwe use an exponen ial unc ion
g(x) = eβ(x−1) (9)
in S ep 3 o TCL (see Tab. 2 and Fig. 2). As poin ed ou by Su on [1], an exponen ial unc ion
has he nice p ope y ha a ixed s ep-size change in xwill change g(x)by a ixed ac ion o
i s cu en alue, i.e. i allows o geome ic s eps. ”This is desi able because some αimus become
e y small while o he s emain la ge; no ixed s ep-size would wo k well o all he αi.” [1].
3.7 O he algo i hms
The ollowing o he algo i hms we e used in ou case s udy o compa ison: Koop’s IDBD-
nl [15], [16], Su on’s K1 [11], Sch audolph’s ELK1 [13], Mahmood’s Au os ep [17], and Dabney’s
α-bounds [19]. All algo i hms a e implemen ed exac ly as desc ibed in hei o iginal pape s (i
no s a ed o he wise) and hen applied o ou Connec -4 ask.
Some algo i hms (IDBD, K1, Au os ep) a e only de i ed o linea uni s. We omi in his
case he nonlinea sigmoid unc ion σ= anh o he TDL alue unc ion (Sec. 3.3). Fo all o he
algo i hms we use his sigmoid unc ion. An excep ion is Koop’s IDBD-nl [15], [16] being de i ed
o he logis ic sigmoid unc ion, which we use in his case ins ead o anh.
15
5.2 TCL algo i hms
When we s a ed he TCL expe imen s, we expec ed ha he indi idual and adap able lea ning
a es o each weigh would ee he use om uning hese a es. In pa icula , we expec ed
ha a oo la ge αini would easily be co ec ed by he indi idual αidu ing he ini ial aining
phase, as poin ed ou by Beal & Smi h [2]. A e ha , he aining should p oceed equi alen ly
o a se ing whe e a smalle αini had been chosen di ec ly. In con as o his expec a ion, we
obse e o TCL[ ] and all αini 6= 0.04 a apid dec ease in pe o mance (Fig. 4). Why is his he
case?
An inspec ion o he n- uple ne wo k showed ha o αini 6= 0.04, sho ly a e he ini ial
phase, he esponses o mos boa d posi ions a e d i en in o sa u a ion o he sigmoid unc ion.
I αini is oo la ge, subsequen lea ning s eps all wi h high p obabili y in egions o he sigmoid
unc ion wi h di e en slopes. Then, al e na ing weigh changes will no cancel because he
g adien o he sigmoid unc ion di e s. Thus Ni/Aiwill no app oach 0 (al hough i should o
a oo la ge αini ). I inally he ne wo k esponse is in sa u a ion, lea ning i ually s ops (due
o 1− anh2( )≈0in he g adien ).
A oo small αini <0.04 leads o a dec easing TCL-pe o mance as well. This is pa ly un-
de s andable since TCL can only educe bu no inc ease he global αini . I we s a wi h he
op imal αini = 0.04, we each wi h TCL[ ] a good asymp o ic success a e and ime o lea n,
bu he lea ning speed is slowe han o Tuned TDL (Fig. 6 and 7). The eason o his is no
ye ully unde s ood. I migh be due o he lea ning a e change p opo ional o Ni/Aibeing
subop imal.
This is suppo ed by ou inding ha TCL-EXP wi h ’geome ic’ s ep sizes is a as e lea ning
agen han Tuned TDL (Fig. 6). The eason o his was explained in Sec. 3.4. To es he impo ance
o geome ic s ep sizes, we did ano he expe imen : We eplaced he TCL-EXP ans e unc ion
by a piecewise linea unc ion (PL) as shown in Fig. 2 ha ing he same endpoin s and same
slope a x= 1. The esul s o PL in Fig. 7 and Tab. 5 a e wo se han TCL-EXP. The e o e, i is
no he slope a x= 1 bu he geome ic s ep size which is impo an o success.
Gi en he abo e esul s and discussion, i is qui e na u al ha we would choose on a new
p oblem among all in es iga ed online adap i e s ep-size algo i hms in i s place he algo i hm
IDBD-nl, howe e , closely ollowed by TCL-EXP as a nea ly equi alen choice. IDBD-nl is an
algo i hm wi h s iking simplici y since he choice o nonlinea i y (logis ic unc ion) leads o a
simple g adien .
6 CONCLUSION
We in es iga ed a complex lea ning ask o he Connec -4 boa d game. I is ema kable ha
an agen can lea n his complex game solely om sel -play. Ou p e ious wo k [8] has shown
ha a la ge numbe o ea u es (in ou case: mo e han hal a million o n- uple s a es) is a
necessa y p e equisi e o lea n such a ask. The agen in [8] was slow in lea ning, so we s udied
se e al al e na i es, namely uning and online adap a ion o lea ning a es (IDBD and TCL).
I was hoped ha IDBD and TCL wi h hei sel -adap a ion capabili ies could make uning
unnecessa y.
Ini ially, we could imp o e ou p e ious esul [8] by uning he explo a ion a e (see Sec. 4.3).
This inc eased he lea ning speed by a ac o o 2.
Ou esea ch ques ion 1) om Sec. 1 was answe ed posi i ely: We demons a ed ha he
lea ning algo i hms IDBD, TCL, and o he s wo k o such big lea ning asks wi h mo e han
hal a million o weigh s. (To he bes o ou knowledge, his has no been es ed be o e.)
Resea ch ques ion 2) has a nega i e answe : We ound ha IDBD and TCL do no ee he use
om pa ame e uning. I he me a-pa ame e s (αini and βin case o TCL; βini and θin case
o IDBD) a e no se o app op ia e alues, esul s a e wo se han wi h uned TDL. The o he

16
algo i hms show a simila beha io . E en Au os ep, which is qui e uning- ee wi h espec o
he me a s ep size pa ame e θ, equi es uning o i s pa ame e αini .
Resea ch ques ion 3) ecei ed a posi i e answe again: Online lea ning a e adap a ion schemes
wi h geome ic s ep size and indi idual lea ning a es a e signi ican ly as e han pu e TDL.
Ou modi ied a ian TCL-EXP is among he as es algo i hms, bu no signi ican ly as e han
o he algo i hms in he same g oup. The as es algo i hms (IDBD-nl and TCL-EXP) inco po a e
nonlinea lea ning uni s. So we a e led o he conclusion ha geome ic s ep sizes plus nonlinea
uni s a e impo an ing edien s o as lea ning. Compa ed o he ea lie published, non-
uned TDL agen [8] (1 565 000 games o each 80% success), hese algo i hms exhibi a la ge
imp o emen by a ac o o 4.
Thus, he ou e o sel -adap ing agen s o game lea ning looks p omising: Some o hese
agen s lea n as e han hose wi h ixed lea ning a es. A he same ime, i is ou imp ession
ha mo e esea ch is needed o be e unde s and he in e ac ion be ween sel -adap i e lea ning
a es and nonlinea ou pu unc ions.
In ou ongoing esea ch we plan o in es iga e he ole o di e en nonlinea ou pu unc ions
o online lea ning a e adapa a ion algo i hms and whe he i is possible o augmen hose
lea ning algo i hms wi h eligibili y aces. Bo h aspec s could inc ease he obus ness and speed
o lea ning.
ACKNOWLEDGMENTS
The au ho s would like o hank he anonymous e iewe s o hei help ul commen s and o
poin ing ou impo an adap i e lea ning e e ences. They would like o hank Bo is Naujoks
o c i ical e iew o an ea lie d a . This wo k has been pa ially suppo ed by he Bundesmi-
nis e ium ¨
u Bildung und Fo schung (BMBF) unde he g an SOMA (2009 - 2013).
REFERENCES
[1] R. S. Su on, “Adap ing bias by g adien descen : An inc emen al e sion o del a-ba -del a.” in AAAI Con . on A i icial
In elligence, W. R. Swa ou , Ed. AAAI P ess, 1992, pp. 171–176.
[2] D. F. Beal and M. C. Smi h, “Tempo al cohe ence and p edic ion decay in TD lea ning,” in In . Join Con . on A i icial
In elligence (IJCAI), T. Dean, Ed. Mo gan Kau mann, 1999, pp. 564–569.
[3] ——, “Tempo al di e ence lea ning o heu is ic sea ch and game playing,” In o ma ion Sciences, ol. 122, no. 1, pp. 3–21,
2000.
[4] A. L. Samuel, “Some s udies in machine lea ning using he game o checke s,” IBM jou nal o Resea ch and De elopmen ,
ol. 3, no. 3, pp. 210–229, 1959.
[5] R. S. Su on, “Tempo al c edi assignmen in ein o cemen lea ning,” Ph.D. disse a ion, Uni e si y o Massachuse s,
Amhe s , MA, 1984.
[6] ——, “Lea ning o p edic by he me hod o empo al di e ences,” Machine Lea ning, ol. 3, pp. 9–44, 1988.
[7] G. Tesau o, “TD-gammon, a sel - eaching backgammon p og am, achie es mas e -le el play,” Neu al Compu a ion, ol. 6,
pp. 215–219, 1994.
[8] M. Thill, P. Koch, and W. Konen, “Rein o cemen lea ning wi h n- uples on he game Connec -4,” in PPSN’2012: Pa allel
P oblem Sol ing F om Na u e, C. Coello Coello, Ed. Sp inge , Heidlbe g, 2012, pp. 184–194.
[9] S. M. Lucas, “Lea ning o play O hello wi h n- uple sys ems,” Aus alian Jou nal o In elligen In o ma ion P ocessing, ol. 4,
pp. 1–20, 2008.
[10] R. A. Jacobs, “Inc eased a es o con e gence h ough lea ning a e adap a ion,” Neu al ne wo ks, ol. 1, no. 4, pp. 295–307,
1988.
[11] R. S. Su on, “Gain adap a ion bea s leas squa es,” in P oc. Yale Wo kshop on Adap i e and Lea ning Sys ems, 1992, pp.
161–166.
[12] L. Almeida, T. Langlois, and J. D. Ama al, “On-line s ep size adap a ion,” INESC, 1000, Lisboa, Po ugal, Tech. Rep. RT07/97,
1997.
[13] N. N. Sch audolph, “Online lea ning wi h adap i e local s ep sizes,” in Neu al Ne s WIRN Vie i-99. Sp inge , 1999, pp.
151–156.
[14] C. Li, Y. Ye, Q. Miao, and H.-L. Shen, “KIMEL: A ke nel inc emen al me alea ning algo i hm,” Signal P ocessing, ol. 93,
no. 6, pp. 1586 – 1596, 2013.
[15] R. S. Su on, A. Koop, and D. Sil e , “On he ole o acking in s a iona y en i onmen s,” in 24 h In . Con . on Machine
Lea ning. ACM, 2007, pp. 871–878.
17
[16] A. Koop, In es iga ing Expe ience: Tempo al Cohe ence and Empi ical Knowledge Rep esen a ion, se . Canadian heses. Uni . o
Albe a (Canada), 2008.
[17] A. Mahmood, “Au oma ic s ep-size adap a ion in inc emen al supe ised lea ning,” Ph.D. disse a ion, Uni e si y o
Albe a, 2010.
[18] A. R. Mahmood, R. S. Su on, T. Deg is, and P. M. Pila ski, “Tuning- ee s ep-size adap a ion,” in IEEE In e na ionalCon e ence
on Acous ics, Speech and Signal P ocessing (ICASSP). IEEE, 2012, pp. 2121–2124.
[19] W. Dabney and A. G. Ba o, “Adap i e s ep-size o online empo al di e ence lea ning.” in 26 h AAAI Con e ence on
A i icial In elligence, 2012.
[20] T. Schaul, S. Zhang, and Y. LeCun, “No Mo e Pesky Lea ning Ra es,” in In e na ional Con e ence on Machine Lea ning (ICML),
2013.
[21] M. Riedmille and H. B aun, “A di ec adap i e me hod o as e backp opaga ion lea ning: The RPROP algo i hm,” in
IEEE In . Con . on Neu al Ne wo ks, 1993, pp. 586–591.
[22] J. D. Allen, “A no e on he compu e solu ion o connec - ou ,” in Heu is ic P og amming in A i icial In elligence 1: The Fi s
Compu e Olympiad, D. Le y and D. Beal, Eds. Ellis Ho wood, London, 1989, pp. 134–135.
[23] V. Allis, “A knowledge-based app oach o Connec -4. The game is sol ed: Whi e wins,” Mas e ’s hesis, Depa men o
Ma hema ics and Compu e Science, V ije Uni e si ei , Ams e dam, The Ne he lands, 1988.
[24] M. Schneide and J. Ga cia Rosa, “Neu al Connec -4 - a connec ionis app oach,” in B azilian Symposium on Neu al Ne wo ks,
2002, pp. 236–241.
[25] M. S enma k, “Syn hesizing boa d e alua ion unc ions o Connec -4 using machine lea ning echniques,” Mas e ’s hesis,
Øs old Uni e si y College, No way, 2005.
[26] D. Cu an and C. O’Rio dan, “E ol ing Connec -4 playing neu al ne wo ks using cul u al lea ning,” Na ional Uni e si y
o I eland, Galway, NUIG-IT-081204, 2004. [Online]. A ailable: h p://www2.i .nuigalway.ie/publica ions/TR/abs ac s/
NUIG-IT-081204.pd
[27] K. K awiec and M. G. Szube , “Lea ning n- uple ne wo ks o O hello by coe olu iona y g adien sea ch,” in GECCO’2011:
Gene ic and E olu iona y Compu a ion Con e ence. ACM, New Yo k, 2011, pp. 355–362.
[28] S. Lucas and T. P. Runa sson, “O hello compe i ion,” h p://algo al.essex.ac.uk:8080/o hello/League.jsp, 2011, las access
18.02.2014.
[29] S. Edelkamp and P. Kissmann, “Symbolic classi ica ion o gene al wo-playe games,” h p://www. zi.de/∼edelkamp/
publica ions/con /ki/EdelkampK08-1.pd , Technische Uni e si ¨
a Do mund, Tech. Rep., 2008, las access 18.02.2014.
[30] J. T omp, “Sol ing Connec -4 on medium boa d sizes,” ICGA Jou nal, ol. 31, no. 2, pp. 110–112, 2008.
[31] M. Thill, “Using n- uple sys ems wi h TD lea ning o s a egic boa d games (in Ge man),” Cologne Uni e si y o Applied
Science, CIOP Repo 01/12, 2012.
[32] W. W. Bledsoe and I. B owning, “Pa e n ecogni ion and eading by machine,” in Eas e n Join Compu e Con e ence, New
Yo k, 1959, pp. 225–232.
[33] W. Konen and T. Ba z-Beiels ein, “Rein o cemen lea ning: Insigh s om in e es ing ailu es in pa ame e selec ion,” in
PPSN’2008: Pa allel P oblem Sol ing F om Na u e, G. Rudolph, Ed. Sp inge , Be lin, 2008, pp. 478–487.
[34] M. Thill, S. Baghe i, P. Koch, and W. Konen, “Tempo al di e ence lea ning wi h eligibili y aces o he game connec
ou ,” in IEEE In e na ional Con e ence on Compu a ional In elligence and Games. IEEE, 2014, pp. 586–591.
Samineh Baghe i ecei ed he B.Sc. deg ee in elec ical enginee ing specialized in elec onics om Shahid
Behesh i Uni e si y, Teh an, I an in 2011. She is cu en ly mas e s uden and esea ch assis an a Cologne
Uni e si y o Applied Sciences in Indus ial Au oma ion and IT.
He esea ch in e es s a e machine lea ning, e olu iona y compu a ion and sel adap i e lea ning s a egies
o boa d games o op imiza ion asks.
Ma kus Thill s udied Compu e Enginee ing a Cologne Uni e si y o Applied Sciences (CUAS) and ecei ed
his B.Sc. deg ee in 2012. Cu en ly he is a mas e s uden in Indus ial Au oma ion and IT and esea ch
assis an a CUAS.
Ma kus al eady wo ked on a i ical in elligence in boa d games o his bachelo hesis, in pa icula Tempo al
Di e ence Lea ning and ee-based algo i hms. Wi h his hesis he won he Opi z Inno a ion P ice 2013.
18
Pa ick Koch ecei ed his Diploma in Compu e Science om he Uni e si y o Pade bo n, Ge many in 2008
and his Ph.D. om he Uni e si y o Leiden, The Ne he lands in 2013. Since 2009 Pa ick is a esea ch
associa e a Cologne Uni e si y o Applied Sciences, whe e he wo ks a he esea ch cen e o Compu a ional
In elligence, Op imiza ion & Da a Mining (h p://www.gociop.de). Pa ick’s esea ch conce ns e icien uning in
machine lea ning and he applica ion o e olu iona y algo i hms o single and mul i-c i e ia p oblems. In he
Compu a ional In elligence in Games (CIG) a ea, he is especially in e es ed in inding imp o ed agen s o
boa d-games wi h Tempo al Di e ence Lea ning and n- uples.
Wol gang Konen ecei ed his Diploma in physics and his Ph.D. deg ee in heo e ical physics om he
Uni e si y o Mainz, Ge many, in 1987 and 1990, esp. He wo ked in he a ea o neu oin o ma ics and compu e
ision a Ruh -Uni e si y Bochum, Ge many, and in se e al companies. He is cu en ly P o esso o Compu e
Science and Ma hema ics a Cologne Uni e si y o Applied Sciences, Ge many, ounding membe o he
Resea ch Cen e s Compu a ional In elligence, Op imiza ion & Da a Mining (h p://www.gociop.de) and CIplus
(h p://ciplus- esea ch.de). His esea ch in e es s include, bu a e no limi ed o: unde s anding how humans
and compu e lea n, he dynamics o lea ning sys ems, compu a ional in elligence in games, game physics,
da a mining, and compu e ision.
Kon ak /Imp essum
Diese Ve ¨offen lichungen e scheinen im Rahmen de Sch i en eihe ”CIplus”.
Alle Ve ¨offen lichungen diese Reihe k¨onnen un e
www.ciplus- esea ch.de
ode un e
h p://opus.bsz-bw.de/ hk/index.php?la=de
abge u en we den.
K¨oln, Dezembe 2014
He ausgebe / Edi o ship
P o . D . Thomas Ba z-Beiels ein,
P o . D . Wol gang Konen,
P o . D . Ho s S enzel,
P o . D . Bo is Naujoks
Ins i u e o Compu e Science,
Facul y o Compu e Science and Enginee ing Science,
Cologne Uni e si y o Applied Sciences,
S einm¨ulle allee 1,
51643 Gumme sbach
u l: www.ciplus- esea ch.de
Sch i lei ung und Ansp echpa ne / Con ac edi o ’s office
P o . D . Thomas Ba z-Beiels ein,
Ins i u e o Compu e Science,
Facul y o Compu e Science and Enginee ing Science,
Cologne Uni e si y o Applied Sciences,
S einm¨ulle allee 1, 51643 Gumme sbach
phone: +49 2261 8196 6391
u l: h p://www.gm. h-koeln.de/˜ba z/
eMail: homas.ba z-beiels ein@ h-koeln.de
ISSN (online) 2194-2870