Online Adaptable Learning Rates for the Game Connect-4

Author: Bagheri, Samineh,Thill, Markus,Koch, Patrick,Konen, Wolfgang

Year: 2014

Source: https://cos.bibl.th-koeln.de/files/53/TCL_C4_paper_COS_version.pdf

Sch i en eihe CIplus, Band 3/2014
Thomas Ba z-Beiels ein, Wol gang Konen, Ho s S enzel, Bo is Naujoks
Online Adap able Lea ning Ra es
o he Game Connec -4
Samineh Baghe i, Ma kus Thill, Pa ick Koch and Wol gang Konen
1
Online Adap able Lea ning Ra es o he Game
Connec -4
Samineh Baghe i, Ma kus Thill, Pa ick Koch and Wol gang Konen
Abs ac
Lea ning boa d games by sel -play has a long adi ion in compu a ional in elligence o games. Based
on Tesau o’s seminal success wi h TD-Gammon in 1994, many success ul agen s use empo al di e ence
lea ning oday. Bu in o de o be success ul wi h empo al di e ence lea ning on game asks, o en a ca e ul
selec ion o ea u es and a la ge numbe o aining games is necessa y. E en o boa d games o mode a e
complexi y like Connec -4, we ound in p e ious wo k ha a e y ich ini ial ea u e se and se e al millions
o game plays a e equi ed. In his wo k we in es iga e di e en app oaches o online-adap able lea ning
a es like Inc emen al Del a Ba Del a (IDBD) o Tempo al Cohe ence Lea ning (TCL) whe he hey ha e he
po en ial o speed up lea ning o such a complex ask. We p opose a new a ian o TCL wi h geome ic
s ep size changes. We compa e hose algo i hms wi h se e al o he s a e-o - he-a lea ning a e adap a ion
algo i hms and pe o m a case s udy on he sensi i i y wi h espec o hei me a pa ame e s. We show ha
in his se o lea ning algo i hms hose wi h geome ic s ep size changes ou pe o m hose o he algo i hms
wi h cons an s ep size changes. Algo i hms wi h nonlinea ou pu unc ions a e sligh ly be e han linea
ones. Algo i hms wi h geome ic s ep size changes lea n as e by a ac o o 4 as compa ed o p e iously
published esul s on he ask Connec -4.
Index Te ms
Machine lea ning, boa d games, sel -play, ein o cemen lea ning, empo al di e ence lea ning (TDL),
empo al cohe ence, lea ning a es, sel -adap a ion, online adap a ion, n- uple sys ems.
F
c
IEEE. This is he p ep in e sion o an a icle appea ing in IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE
AND AI IN GAMES, 2015. Pe sonal use o his ma e ial is pe mi ed. Fo o he use, see he IEEE Copy igh Policy.
1 INTRODUCTION
HUMANS a e e y e icien and as in lea ning complica ed asks in new domains. As
Su on [1] has poin ed ou , in o ma ion heo e ic a gumen s sugges ha he lea ning
p og ess is o en oo apid o be jus i ied by he da a o he new domain alone. The key o
success is, as i is commonly belie ed, ha humans b ing a se o co ec biases om o he
domains in o he new domain. These biases allow o lea n as e , because biases di ec hem
o p e e ce ain hypo heses o e o he s o ce ain ea u es o e o he s. I machine lea ning
algo i hms shall achie e a pe o mance simila o humans, hey p obably need o acqui e biases
as well. Whe e do hese biases come om?
Al eady in 1992 Su on [1] sugges ed he Inc emen al Del a Ba Del a (IDBD) algo i hm whe e
he biases a e unde s ood as lea ning a es which can be di e en o di e en ainable pa am-
e e s o any unde lying algo i hm. The key idea o IDBD is ha hese lea ning a es a e no
p ede ined by he algo i hm designe bu hey a e adap ed as hype pa ame e s o he lea ning
p ocess hemsel es. Su on [1] expec ed such adap able lea ning a es o be especially use ul o
•All au ho s a e wi h he Facul y o Enginee ing and Compu e Science, Cologne Uni e si y o Applied Sciences, Cologne, Ge many.
E-mail: wol gang.konen@ h-koeln.de
Manusc ip ecei ed Ma ch 2nd, 2014; e ised e sion accep ed Oc 29 h, 2014.
2
nons a iona y asks o sequences o ela ed asks and he demons a ed good esul s on a small
syn he ic nons a iona y lea ning p oblem ( ea u ing 20 weigh s).
A simila idea was sugges ed as Tempo al Cohe ence Lea ning (TCL) by Beal and Smi h in
1999 [2], [3], who di ec ly modi ied he Tempo al Di e ence Lea ning (TDL) algo i hm o ake
in o accoun sel - uning lea ning a es. Se e al o he online lea ning a e adap a ion algo i hms
ha e been p oposed o e he yea s (see Sec. 2) and i is he pu pose o his wo k – as a case
s udy in machine lea ning – o make a comp ehensi e compa ison on a la ge game-lea ning
benchma k ask.
Boa d games and lea ning how o play hem cons i u e challenging asks in machine lea ning
(ML) and a i icial in elligence (AI). They a e challenging, because he ac ion (a mo e) has o be
aken now, bu he payo (win o loss) occu s la e , a he end o he game. The mos ad anced
me hod in ML o add ess his p oblem is well-known TDL, a me hod based on ein o cemen
lea ning (RL). TDL was applied as ea ly as 1957 by Samuel [4] o checke s and gained mo e
popula i y h ough Su on’s wo k in 1984 and 1988 [5], [6]. I became amous in 1994 wi h
Tesau o’s TD-Gammon [7], which lea ned o play backgammon a expe le el.
Game lea ning wi h TDL cons i u es a nons a iona y lea ning ask: Fo mos boa d posi ions
he a ge alue will change du ing lea ning (see Sec. 3.3 o de ails). Thus, RL and TDL o
boa d games a e expec ed o bene i om bias lea ning app oaches such as he abo e-men ioned
IDBD [1].
In his pape we conside he game Connec -4 as a speci ic example, which is a sol ed game
(see Sec. 3.1) bu no algo i hm o lea n i by sel -play was known un il ecen ly. In a p e ious
wo k [8] we we e able o show ha i is possible o lea n his game wi h TDL and n- uples [9]
jus by sel -play. We achie ed a playing s eng h close o he pe ec playing agen . Howe e ,
he lea ning p ocess equi ed se e al millions o sel -play games, hus being a o human
pe o mance. In his wo k we in es iga e whe he new s a egies can achie e he same (o e en
be e ) s eng h wi hin ewe aining games. Two solu ion pa hs can be conside ed he e:
a) Tuning, i.e. inding as e -lea ning solu ions by op imizing he hype pa ame e s o he lea n-
ing algo i hm. The d awback o his solu ion is ha he uning esul s usually apply only
o he speci ic lea ning ask, i.e. Connec -4 wi h his TDL algo i hm. Each new algo i hm o
each new game equi es a comple ely new uning.
b) Sel - uning lea ning algo i hms like IDBD o o he s (see Sec. 3), which au oma ically adap
ce ain hype pa ame e s (he e: he lea ning a es) as lea ning p og esses. This a oids manual
in e en ion and he e is hope ha he same sel - uning scheme can be applied o o he games
as well and ha he insigh s gained a e ans e able o o he lea ning asks as well.
We ollow mainly he second pa h in his pape and y o answe he ollowing esea ch
ques ions:
1) Can online lea ning a e adap a ion be success ully applied o p oblems wi h millions o
weigh s?
2) How obus a e online lea ning a e adap a ion algo i hms wi h espec o hei me a-pa ame e s?
3) Can online lea ning a e adap a ion algo i hms speed up lea ning as compa ed o TDL?
The es o his pape is o ganized as ollows: Sec. 2 b ie ly e iews ela ed wo k. Sec. 3 in o-
duces he me hods TDL, TCL, and IDBD. I p esen s wi h TCL-EXP ou new syn hesis be ween
TCL and IDBD. Sec. 4 desc ibes ou expe imen al se up and ou esul s on he game Connec -4.
Sec. 5 discusses he pa ame e sensi i i y o he di e en me hods and Sec. 6 summa izes ou
main indings.
2 RELATED WORK
Se e al online lea ning a e adap a ion schemes ha e been p oposed o e he yea s: IDBD [1]
om Su on is an ex ension o Jacobs’ [10] ea lie DBD algo i hm: i allows immedia e upda es
3
ins ead o ba ch upda es. Su on [11] p oposed some ex ensions o IDBD wi h he algo i hms
K1 and K2 and compa es hem wi h he Leas Mean Squa e (LMS) algo i hm and Kalman
il e ing. Almeida [12] discussed ano he me hod o s ep-size adap a ion and applied i o he
minimiza ion o nonlinea unc ions.
Sch audolph [13] and, mo e ecen ly, Li [14] ex ended IDBD- a ian s o he nonlinea case:
Sch audolph’s ELK1 ex ends K1 and pe o ms an upda e wi h he ins an aneous Hessian ma ix
o a sui able chosen loss unc ion. The algo i hm’s complexi y is O(n2)whe e nis he numbe o
pa ame e s o lea n. Li’s KIMEL algo i hm ans o ms he nonlinea inpu da a wi h a ke nel in o
a high-dimensional bu linea ea u e space whe e linea IDBD is applied. Su on and Koop [15],
[16] de eloped ano he nice nonlinea ex ension IDBD-nl o he o iginal IDBD algo i hm using
he logis ic sigmoid unc ion. I was applied o lea ning he game Go.
Recen ly, Mahmood and Su on [17], [18] p oposed wi h Au os ep an ex ension o IDBD which
has much less dependence on he me a-s ep-size pa ame e han IDBD. In he same yea , Dabney
and Ba o [19] de eloped ano he adap i e s ep-size me hod o empo al di e ence lea ning,
which is based on he es ima ion o uppe and lowe bounds. Again, bo h me hods a e p oposed
only o linea unc ion app oxima ion. Schaul e al. [20] p opose a me hod o uning- ee lea ning
a e adap a ion especially well-sui ed o la ge neu al ne wo ks. RPROP [21] is ano he ea lie
e sion o a neu al ne wo k algo i hm wi h indi idual lea ning a es o each weigh .
Fo he game Connec -4 – al hough weakly sol ed in 1988 by Allen [22] and Allis [23] (Sec. 3.1)
– only a he ew a emp s o lea n i (whe he by sel -play o by lea ning om eache s) a e
ound in he li e a u e: Schneide e al. [24] ied o lea n Connec -4 wi h a neu al ne wo k,
using an a chi e o sa ed games as eaching in o ma ion. S enma k [25] compa ed TDL o
Connec -4 agains a knowledge-based app oach om Au oma ic P og amming and ound TDL
o be sligh ly be e . Cu an e al. [26] used a cul u al lea ning app oach o e ol ing popula ions
o neu al ne wo ks in sel -play. All he abo e wo ks ga e no clea answe on he ue playing
s eng h o he agen s, since hey did no compa e hei agen s wi h a pe ec -playing Minimax
agen .
Lucas showed ha he game o O hello, ha ing a somewha g ea e complexi y han Connec -
4, could be lea ned by TDL wi hin a ew housand aining games wi h he n- uple-app oach [9].
K awiec e al. [27] applied he n- uple-app oach in (Co-) E olu iona y TDL and ou pe o med
TDL in he O hello League [28]. This s i ed ou in e es in he n- uple-app oach and we applied i
success ully o Connec -4 in ou p e ious wo k [8]. The esul s agains a pe ec -playing Minimax
agen a e summa ized in Sec. 4.3.
3 METHODS
3.1 Connec -4
The game Connec -4 is a wo-playe game played on a boa d wi h 7 e ical slo s con aining 6
posi ions each (Fig. 1). Playe Yellow (1s ) and playe Red (2nd) place one piece pe u n in one
o he a ailable slo s and each piece alls down unde he o ce o g a i y in o he lowes ee
posi ion o he slo . Each playe a emp s o c ea e ho izon al, e ical o diagonal piece-lines o
leng h ou . Fig. 1 shows an example posi ion whe e Yellow would win i Red does no block
his by placing a ed piece in o he igh slo .
Connec -4 has a medium s a e space complexi y o 4.5·1012 boa d posi ions [29]. A game is
said o be weakly sol ed i i s game heo e ic alue and a s a egy o pe ec play om he
ini ial alue is known. Connec -4 was weakly sol ed in 1988 independen ly by Allen [22] and by
Allis [23]: Yellow ( he 1s playe ) wins, i she places he i s piece in he middle slo . T omp [30]
sol ed he game Connec -4 s ongly, i. e. o e e y in e media e posi ion.
4
Fig. 1. Connec -4 boa d wi h an example 4- uple ’3-2-1-1’ (see Sec. 3.2)
We de eloped a Minimax agen combined wi h a p e-calcula ed 8-ply1o 12-ply-opening
da abase [8], [31] and i inds he pe ec nex mo e (o mo es, i se e al mo es a e equally well)
o each boa d posi ion wi hin ac ions o a second. This agen will be used in ou expe imen s,
bu only as e e ence o e alua ion agen . I is by no means used o any aining pu pose.
3.2 N- uples and LUTs
N- uples in Gene al: N- uple sys ems we e i s in oduced in 1959 by Bledsoe and B own-
ing [32] o cha ac e ecogni ion. Recen ly, Lucas [9] p oposed employing he n- uple a chi ec u e
o game-playing pu poses. The n- uple app oach wo ks in a way simila o he ke nel ick
used in suppo ec o machines (SVM): The low dimensional boa d is p ojec ed in o a high
dimensional sample space by he n- uple indexing p ocess [9].
N- uples in Connec -4: An n- uple Tνis a sequence [aν0, . . . , aνn−1]o ndi e en boa d cells
aνj ∈ {0,...,41}. Fo example, he ou whi e digi s in Fig. 1 ma k he cells o a 4- uple. Each
cell is in one o Ppossible s a es z[aνj]∈ {0, . . . , P −1}( he alue o he digi s in ou example),
depending on he cell’s occupa ion. Following ou ea lie wo k [8] we use he (P=4)-encoding
0=emp y and no eachable, 1=Yellow, 2=Red, 3=emp y and eachable.
By eachable we mean an emp y cell ha can be occupied in he nex mo e. The eason behind
his is ha i makes a di e ence whe he e. g. h ee yellow pieces in a ow ha e a eachable
emp y cell adjacen o hem (a di ec h ea o Red) o a non- eachable cell (indi ec h ea ).
An n- uple o leng h n hus has Pnpossible s a es kν∈ {0, . . . , Pn−1}wi h
kν=
n−1
X
j=0
s [aνj]Pj.(1)
He e, s [aνj]is he s a e o boa d cell aνj a ime . Fig. 1 shows an example boa d posi ion wi h
a 4- uple in he numbe ed cells. The s a e o he 4- uple is k= 3 ·40+ 2 ·41+ 1 ·42+ 1 ·43= 91.
The numbe kν o he s a e o Tνcan be used as an index in o an associa ed look-up able LUTν,
whose pa ame e s a e called wν, [kν]. Equi alen ly and mo e simila o s anda d neu al ne wo ks,
1. A ply is a single mo e o one playe , Yellow o Red

5
we can pu all weigh s in o one big weigh ec o ~w wi h elemen s wi, , index i=kνm+ν, and
de ine a bina y inpu ec o
xi[s ] = (1i i=kνm+ν
0else (2)
o he kνde ined in Eq. (1). Fo a gi en boa d posi ion s a ime , he ou pu o he n- uple
ne wo k wi h mn- uples Tν, ν = 1, . . . , m can be calcula ed as:
(~w , s ) =
m·Pn
X
i=1
wi, xi[s ](3)
Vec o ~w is a unc ion o ime since i will be modi ied by he TD lea ning algo i hm (Sec. 3.3).
The ec o ~w combines all weigh s om all LUTs. I can be a a he big ec o , con aining, e. g.,
9 million weigh s in ou s anda d Connec -4 implemen a ion wi h 70 8- uples. I u ns ou ha
only 600 000 - 700 000 o hese weigh s a e ac i e du ing lea ning ( he o he s ep esen non-
ealizable s a es [8]), bu his is s ill much bigge han he 20 o 25 weigh s used by Su on [1]
o Beal and Smi h [2].
Fu he de ails on n- uples in Connec -4 (symme y, n- uple c ea ion) a e ound in [8]. Fo
all expe imen s epo ed in he ollowing, we use he same n- uple ne wo k which was once
c ea ed by a andom-walk selec ion p ocess.
3.3 TDL
The goal o a game-playing agen is o p edic he ideal alue unc ion, which e u ns 1.0 i he
boa d posi ion is a win o Yellow, and -1.0 i i is a win o Red. The TDL algo i hm aims
a lea ning his alue unc ion. I does so by se ing up an (ini ially inexpe ienced) agen , who
plays a sequence o games agains i sel . I lea ns om he en i onmen , which gi es a ewa d
R∈ {−1.0,0.0,1.0} o {Red-win, D aw, Yellow-win }only a he end o each game . The main
ing edien is he empo al di e ence (TD) e o signal acco ding o Su on [6]
δ =R(s +1) + γV (~w , s +1)−V(~w , s ).(4)
He e, V(~w , s ) = σ( (~w , s )) is he agen ’s cu en app oxima ion o he alue unc ion2on he
basis o Eq. (3) and a nonlinea sigmoid unc ion σ(we choose σ= anh). The s a e s +1 is he
bes successo o s a e s . In ou expe imen s we use γ= 1 h oughou .
The weigh s a e ained wi h he usual δ- ule
wi, +1 =wi, +αδ ∇wiV(~w , s )(5)
=wi, +α1−V2(~w , s )δ xi,
which aims a making he cu en p edic ion ma ch he successo p edic ion mo e closely. The
comple e TDL algo i hm o games, including he ac ion selec ion mechanism, is shown in
Tab. 1.3This is a con ol algo i hm, since i includes ac ion selec ion ( hus policy changes) and
lea ning o he (co espondingly changing) alue unc ion.4Mo e de ails on TDL in games can
be ound in [33] and e e ences he ein.5
2. V(~w , s +1)is se o 0 i s +1 is a inal s a e.
3. Why is he weigh upda e in S ep 17 only done in case o a non-explo a i e mo e (q≥)? – I an explo a o y ac ion
( andom mo e) is aken, i is e y likely ha he inal ewa d does no e lec he ue po en ial o he cu en s a e be o e he
andom mo e. E.g. i he cu en s a e is a win o Yellow, a andom mo e is likely o u n i in o a si ua ion whe e Yellow
looses.
4. No e ha he lea ning o he alue o s a e-ac ion pai s, Q(s , a ), as i occu s in Q-lea ning o Sa sa con ol algo i hms,
can be eplaced he e by he alue o he a e s a e, V(~w , s +1), since in a boa d game all s a e-ac ion pai s esul ing in he
same a e s a e ha e he same alue.
5. We no e in passing ha we use TDL wi hou eligibili y aces in his pape . While his a icle was unde e iew, we
published new esea ch combining Connec -4, TDL, and TCL-EXP wi h eligibili y aces [34].
6
TABLE 1
TDL algo i hm o boa d games. P io o he i s game, he weigh ec o ~w0is ini ialized wi h
andom alues. Then he ollowing algo i hm is execu ed o each comple e boa d game. Du ing
sel -play, he playe pswi ches be ween +1 (Yellow) and −1(Red).
1: Se he ini ial s a e s0(usually he emp y boa d) and p= 1.
2: Use pa ially ained weigh s ~w0 om p e ious games.
3: unc ion TDLTRAIN(s0,~w0)
4: ~e0← ∇~w V(~w0, ~x(s0))
5: o  ←0;s /∈SF inal ; ← + 1, p ←(−p)do
6: Vold ←V(~w , ~x(s ))
7: D aw andom q∈[0,1]
8: i (q < ) hen .Explo a i e mo e
9: Randomly selec s +1
10: else .G eedy mo e
11: Selec a e -s a e s +1, which maximizes
12: p·R(s +1),i s +1 ∈SF inal
V(~w , ~x(s +1)),o he wise
13: end i
14: Vnew ←V(~w , ~x(s +1))
15: δ ←R(s +1) + γVnew −Vold .TD e o -signal
16: i (q≥o s +1 ∈SF inal) hen
17: ~w +1 ←~w +αδ ~e .Weigh -upda e
18: end i
19: ~e +1 ← ∇~w V(~w +1, ~x(s +1)) .Recompu e (new ~w!)
20: end o
21: end unc ion
Eq. (4) shows clea ly why TDL imposes a nons a iona y lea ning ask: V(~w , s +1), he alue
o he bes successo o V(~w , s ), is he a ge o V(~w , s ). Bu o mos boa d posi ions (wi h
he excep ion o e minal s a es) his a ge again has o be lea n , so i will p obably be a
he w ong alue ini ially. La e in aining his a ge alue changes o he co ec one and he
weigh s need o be eadjus ed.
3.4 TCL
The TCL algo i hm de eloped by Beal and Smi h [2], [3] is an ex ension o TDL. I has an
adjus able lea ning a e αi o e e y weigh wiand a global cons an αini . The e ec i e lea ning
a e o each weigh is αini αi. The main idea is p e y simple: Fo each weigh wo coun e s
Niand Aiaccumula e he sum o weigh changes and sum o absolu e weigh changes. I all
weigh changes ha e he same sign, hen |Ni|/Ai= 1 and he lea ning a e s ays a i s uppe
bound. I weigh changes ha e al e na ing signs, hen |Ni|/Ai→0 o → ∞, and he lea ning
a e will be la gely educed o his weigh .
Tab. 2 shows he comple e TCL algo i hm. This algo i hm is embedded in he game-playing
TDL amewo k o Sec. 3.3. S eps 2.-6. a e execu ed in each pass h ough he main TDL-loop
ins ead o S ep 17 in Tab. 1. Wi h he help o αi∈[0,1] he indi idual lea ning a e o a weigh
can be made smalle han he global αini . This is he s anda d TCL o mula ion named TCL[ ]
in he ollowing. We es ed also a a ian TCL[δ]whe e we omi he nonlinea sigmoid unc ion
o i, in Eq. (7), i. e. we use
i, =(δ i xi= 1
0i xi= 0.(8)
In bo h cases, he ope a ional o de is impo an , as al eady s a ed in [2]: i s weigh upda e
7
TABLE 2
TCL in pseudo code
1: Ini ialize: Ni=Ai= 0 o all weigh indices iand se he cons an pa ame e αini .
2: o (e e y weigh index i)do
3: Se αi=(1i Ai= 0
g(|Ni|/Ai)i Ai>0
4: Replace TD-weigh upda e Eq. (5) wi h
wi, +1 =wi, +αini αi i, (6)
whe e i, =δ ∇wiV(~w , s )is he ecommended weigh change.
5: Upda e he coun e s:
Ni←Ni+ i, (7)
Ai←Ai+| i, |
6: end o
S anda d TCL uses o he ans e unc ion gin S ep 3 he iden i y unc ion, while TCL-EXP
uses g(x)acco ding o Eq. (9), see also Fig. 2.
TABLE 3
IDBD in pseudo code
1: Ini ialize: hi= 0, βi=βini o all weigh indices iand se θ, he me a-lea ning a e.
2: o (e e y weigh index i)do
3: Se inpu xiacco ding o Eq. (2)
4: Se βi←βi+θδ xihi
5: Se αi←eβi
6: Replace TD-weigh upda e Eq. (5) wi h
wi, +1 =wi, +αiδ xi
7: Se hi←hi[1 −αix2
i]++αiδ xi
8: end o
wi h [d]+=di d > 0, 0 else.
using he p e ious alues o Ni,Ai, hen coun e upda e.6
3.5 IDBD
Su on’s IDBD algo i hm [1] in oduces – simila ly o TCL – an indi idual lea ning a e αi=eβi
o e e y weigh wi. The algo i hm is shown in Tab. 3. I is again embedded in he game-playing
TDL amewo k o Sec. 3.3. S eps 2.-8. a e execu ed in each pass h ough he main TDL-loop
ins ead o S ep 17 in Tab. 1.
6. The o iginal TCL implemen a ion [2] has he u he op ion ha he i, may be accumula ed o e a sequence o s eps be o e
a eal weigh upda e (Eq. (6)) akes place. This balances a adeo be ween as e changing lea ning a es (sho sequences) and
accumula ion o s a is ic e idence (long sequences). We es ed his sequence op ion o ou Connec -4- ask as well, bu ound
he immedia e upda e (as in he o mulas abo e, 1-s ep sequence) o be be e .
8
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
x
g(x)
PieceLinea
S anda d TCL
TCL−EXP
Fig. 2. Op ions o he ans e unc ion g(x). S anda d TCL uses he iden i y unc ion. TCL-EXP
(Eq. (9)) is shown o β= 2.7and PieceLinea is a piecewise linea unc ion wi h he same
endpoin s as TCL-EXP and same slope a x= 1.
The main idea behind his algo i hm is simple: The memo y e m hiis a decaying ace o
pas weigh changes. The inc emen in βiis p opo ional o he p oduc o he cu en weigh
change δ xiand pas weigh changes hi. Accumula ed inc emen s co espond o he co ela ion
be ween cu en and ecen weigh changes [1]. In case o a posi i e co ela ion, he lea ning
a e can be la ge , while nega i e co ela ion indica es o e shoo ing weigh inc emen s whe e
he lea ning a e should be educed.
No e ha IDBD in i s cu en o m is only o mula ed o linea ne wo ks. Consequen ly we
omi he nonlinea sigmoid unc ion σin Eq. (4) and in he weigh upda e ule (S ep 6 o Tab. 3)
o IDBD.
3.6 Modi ied TCL-EXP
Ou modi ica ion TCL-EXP b ings a new elemen om IDBD in o he s anda d TCL algo i hm.
Ins ead o he iden i y ans e unc ion g(x) = xwe use an exponen ial unc ion
g(x) = eβ(x−1) (9)
in S ep 3 o TCL (see Tab. 2 and Fig. 2). As poin ed ou by Su on [1], an exponen ial unc ion
has he nice p ope y ha a ixed s ep-size change in xwill change g(x)by a ixed ac ion o
i s cu en alue, i.e. i allows o geome ic s eps. ”This is desi able because some αimus become
e y small while o he s emain la ge; no ixed s ep-size would wo k well o all he αi.” [1].
3.7 O he algo i hms
The ollowing o he algo i hms we e used in ou case s udy o compa ison: Koop’s IDBD-
nl [15], [16], Su on’s K1 [11], Sch audolph’s ELK1 [13], Mahmood’s Au os ep [17], and Dabney’s
α-bounds [19]. All algo i hms a e implemen ed exac ly as desc ibed in hei o iginal pape s (i
no s a ed o he wise) and hen applied o ou Connec -4 ask.
Some algo i hms (IDBD, K1, Au os ep) a e only de i ed o linea uni s. We omi in his
case he nonlinea sigmoid unc ion σ= anh o he TDL alue unc ion (Sec. 3.3). Fo all o he
algo i hms we use his sigmoid unc ion. An excep ion is Koop’s IDBD-nl [15], [16] being de i ed
o he logis ic sigmoid unc ion, which we use in his case ins ead o anh.
15
5.2 TCL algo i hms
When we s a ed he TCL expe imen s, we expec ed ha he indi idual and adap able lea ning
a es o each weigh would ee he use om uning hese a es. In pa icula , we expec ed
ha a oo la ge αini would easily be co ec ed by he indi idual αidu ing he ini ial aining
phase, as poin ed ou by Beal & Smi h [2]. A e ha , he aining should p oceed equi alen ly
o a se ing whe e a smalle αini had been chosen di ec ly. In con as o his expec a ion, we
obse e o TCL[ ] and all αini 6= 0.04 a apid dec ease in pe o mance (Fig. 4). Why is his he
case?
An inspec ion o he n- uple ne wo k showed ha o αini 6= 0.04, sho ly a e he ini ial
phase, he esponses o mos boa d posi ions a e d i en in o sa u a ion o he sigmoid unc ion.
I αini is oo la ge, subsequen lea ning s eps all wi h high p obabili y in egions o he sigmoid
unc ion wi h di e en slopes. Then, al e na ing weigh changes will no cancel because he
g adien o he sigmoid unc ion di e s. Thus Ni/Aiwill no app oach 0 (al hough i should o
a oo la ge αini ). I inally he ne wo k esponse is in sa u a ion, lea ning i ually s ops (due
o 1− anh2( )≈0in he g adien ).
A oo small αini <0.04 leads o a dec easing TCL-pe o mance as well. This is pa ly un-
de s andable since TCL can only educe bu no inc ease he global αini . I we s a wi h he
op imal αini = 0.04, we each wi h TCL[ ] a good asymp o ic success a e and ime o lea n,
bu he lea ning speed is slowe han o Tuned TDL (Fig. 6 and 7). The eason o his is no
ye ully unde s ood. I migh be due o he lea ning a e change p opo ional o Ni/Aibeing
subop imal.
This is suppo ed by ou inding ha TCL-EXP wi h ’geome ic’ s ep sizes is a as e lea ning
agen han Tuned TDL (Fig. 6). The eason o his was explained in Sec. 3.4. To es he impo ance
o geome ic s ep sizes, we did ano he expe imen : We eplaced he TCL-EXP ans e unc ion
by a piecewise linea unc ion (PL) as shown in Fig. 2 ha ing he same endpoin s and same
slope a x= 1. The esul s o PL in Fig. 7 and Tab. 5 a e wo se han TCL-EXP. The e o e, i is
no he slope a x= 1 bu he geome ic s ep size which is impo an o success.
Gi en he abo e esul s and discussion, i is qui e na u al ha we would choose on a new
p oblem among all in es iga ed online adap i e s ep-size algo i hms in i s place he algo i hm
IDBD-nl, howe e , closely ollowed by TCL-EXP as a nea ly equi alen choice. IDBD-nl is an
algo i hm wi h s iking simplici y since he choice o nonlinea i y (logis ic unc ion) leads o a
simple g adien .
6 CONCLUSION
We in es iga ed a complex lea ning ask o he Connec -4 boa d game. I is ema kable ha
an agen can lea n his complex game solely om sel -play. Ou p e ious wo k [8] has shown
ha a la ge numbe o ea u es (in ou case: mo e han hal a million o n- uple s a es) is a
necessa y p e equisi e o lea n such a ask. The agen in [8] was slow in lea ning, so we s udied
se e al al e na i es, namely uning and online adap a ion o lea ning a es (IDBD and TCL).
I was hoped ha IDBD and TCL wi h hei sel -adap a ion capabili ies could make uning
unnecessa y.
Ini ially, we could imp o e ou p e ious esul [8] by uning he explo a ion a e (see Sec. 4.3).
This inc eased he lea ning speed by a ac o o 2.
Ou esea ch ques ion 1) om Sec. 1 was answe ed posi i ely: We demons a ed ha he
lea ning algo i hms IDBD, TCL, and o he s wo k o such big lea ning asks wi h mo e han
hal a million o weigh s. (To he bes o ou knowledge, his has no been es ed be o e.)
Resea ch ques ion 2) has a nega i e answe : We ound ha IDBD and TCL do no ee he use
om pa ame e uning. I he me a-pa ame e s (αini and βin case o TCL; βini and θin case
o IDBD) a e no se o app op ia e alues, esul s a e wo se han wi h uned TDL. The o he

16
algo i hms show a simila beha io . E en Au os ep, which is qui e uning- ee wi h espec o
he me a s ep size pa ame e θ, equi es uning o i s pa ame e αini .
Resea ch ques ion 3) ecei ed a posi i e answe again: Online lea ning a e adap a ion schemes
wi h geome ic s ep size and indi idual lea ning a es a e signi ican ly as e han pu e TDL.
Ou modi ied a ian TCL-EXP is among he as es algo i hms, bu no signi ican ly as e han
o he algo i hms in he same g oup. The as es algo i hms (IDBD-nl and TCL-EXP) inco po a e
nonlinea lea ning uni s. So we a e led o he conclusion ha geome ic s ep sizes plus nonlinea
uni s a e impo an ing edien s o as lea ning. Compa ed o he ea lie published, non-
uned TDL agen [8] (1 565 000 games o each 80% success), hese algo i hms exhibi a la ge
imp o emen by a ac o o 4.
Thus, he ou e o sel -adap ing agen s o game lea ning looks p omising: Some o hese
agen s lea n as e han hose wi h ixed lea ning a es. A he same ime, i is ou imp ession
ha mo e esea ch is needed o be e unde s and he in e ac ion be ween sel -adap i e lea ning
a es and nonlinea ou pu unc ions.
In ou ongoing esea ch we plan o in es iga e he ole o di e en nonlinea ou pu unc ions
o online lea ning a e adapa a ion algo i hms and whe he i is possible o augmen hose
lea ning algo i hms wi h eligibili y aces. Bo h aspec s could inc ease he obus ness and speed
o lea ning.
ACKNOWLEDGMENTS
The au ho s would like o hank he anonymous e iewe s o hei help ul commen s and o
poin ing ou impo an adap i e lea ning e e ences. They would like o hank Bo is Naujoks
o c i ical e iew o an ea lie d a . This wo k has been pa ially suppo ed by he Bundesmi-
nis e ium ¨
u Bildung und Fo schung (BMBF) unde he g an SOMA (2009 - 2013).
REFERENCES
[1] R. S. Su on, “Adap ing bias by g adien descen : An inc emen al e sion o del a-ba -del a.” in AAAI Con . on A i icial
In elligence, W. R. Swa ou , Ed. AAAI P ess, 1992, pp. 171–176.
[2] D. F. Beal and M. C. Smi h, “Tempo al cohe ence and p edic ion decay in TD lea ning,” in In . Join Con . on A i icial
In elligence (IJCAI), T. Dean, Ed. Mo gan Kau mann, 1999, pp. 564–569.
[3] ——, “Tempo al di e ence lea ning o heu is ic sea ch and game playing,” In o ma ion Sciences, ol. 122, no. 1, pp. 3–21,
2000.
[4] A. L. Samuel, “Some s udies in machine lea ning using he game o checke s,” IBM jou nal o Resea ch and De elopmen ,
ol. 3, no. 3, pp. 210–229, 1959.
[5] R. S. Su on, “Tempo al c edi assignmen in ein o cemen lea ning,” Ph.D. disse a ion, Uni e si y o Massachuse s,
Amhe s , MA, 1984.
[6] ——, “Lea ning o p edic by he me hod o empo al di e ences,” Machine Lea ning, ol. 3, pp. 9–44, 1988.
[7] G. Tesau o, “TD-gammon, a sel - eaching backgammon p og am, achie es mas e -le el play,” Neu al Compu a ion, ol. 6,
pp. 215–219, 1994.
[8] M. Thill, P. Koch, and W. Konen, “Rein o cemen lea ning wi h n- uples on he game Connec -4,” in PPSN’2012: Pa allel
P oblem Sol ing F om Na u e, C. Coello Coello, Ed. Sp inge , Heidlbe g, 2012, pp. 184–194.
[9] S. M. Lucas, “Lea ning o play O hello wi h n- uple sys ems,” Aus alian Jou nal o In elligen In o ma ion P ocessing, ol. 4,
pp. 1–20, 2008.
[10] R. A. Jacobs, “Inc eased a es o con e gence h ough lea ning a e adap a ion,” Neu al ne wo ks, ol. 1, no. 4, pp. 295–307,
1988.
[11] R. S. Su on, “Gain adap a ion bea s leas squa es,” in P oc. Yale Wo kshop on Adap i e and Lea ning Sys ems, 1992, pp.
161–166.
[12] L. Almeida, T. Langlois, and J. D. Ama al, “On-line s ep size adap a ion,” INESC, 1000, Lisboa, Po ugal, Tech. Rep. RT07/97,
1997.
[13] N. N. Sch audolph, “Online lea ning wi h adap i e local s ep sizes,” in Neu al Ne s WIRN Vie i-99. Sp inge , 1999, pp.
151–156.
[14] C. Li, Y. Ye, Q. Miao, and H.-L. Shen, “KIMEL: A ke nel inc emen al me alea ning algo i hm,” Signal P ocessing, ol. 93,
no. 6, pp. 1586 – 1596, 2013.
[15] R. S. Su on, A. Koop, and D. Sil e , “On he ole o acking in s a iona y en i onmen s,” in 24 h In . Con . on Machine
Lea ning. ACM, 2007, pp. 871–878.
17
[16] A. Koop, In es iga ing Expe ience: Tempo al Cohe ence and Empi ical Knowledge Rep esen a ion, se . Canadian heses. Uni . o
Albe a (Canada), 2008.
[17] A. Mahmood, “Au oma ic s ep-size adap a ion in inc emen al supe ised lea ning,” Ph.D. disse a ion, Uni e si y o
Albe a, 2010.
[18] A. R. Mahmood, R. S. Su on, T. Deg is, and P. M. Pila ski, “Tuning- ee s ep-size adap a ion,” in IEEE In e na ionalCon e ence
on Acous ics, Speech and Signal P ocessing (ICASSP). IEEE, 2012, pp. 2121–2124.
[19] W. Dabney and A. G. Ba o, “Adap i e s ep-size o online empo al di e ence lea ning.” in 26 h AAAI Con e ence on
A i icial In elligence, 2012.
[20] T. Schaul, S. Zhang, and Y. LeCun, “No Mo e Pesky Lea ning Ra es,” in In e na ional Con e ence on Machine Lea ning (ICML),
2013.
[21] M. Riedmille and H. B aun, “A di ec adap i e me hod o as e backp opaga ion lea ning: The RPROP algo i hm,” in
IEEE In . Con . on Neu al Ne wo ks, 1993, pp. 586–591.
[22] J. D. Allen, “A no e on he compu e solu ion o connec - ou ,” in Heu is ic P og amming in A i icial In elligence 1: The Fi s
Compu e Olympiad, D. Le y and D. Beal, Eds. Ellis Ho wood, London, 1989, pp. 134–135.
[23] V. Allis, “A knowledge-based app oach o Connec -4. The game is sol ed: Whi e wins,” Mas e ’s hesis, Depa men o
Ma hema ics and Compu e Science, V ije Uni e si ei , Ams e dam, The Ne he lands, 1988.
[24] M. Schneide and J. Ga cia Rosa, “Neu al Connec -4 - a connec ionis app oach,” in B azilian Symposium on Neu al Ne wo ks,
2002, pp. 236–241.
[25] M. S enma k, “Syn hesizing boa d e alua ion unc ions o Connec -4 using machine lea ning echniques,” Mas e ’s hesis,
Øs old Uni e si y College, No way, 2005.
[26] D. Cu an and C. O’Rio dan, “E ol ing Connec -4 playing neu al ne wo ks using cul u al lea ning,” Na ional Uni e si y
o I eland, Galway, NUIG-IT-081204, 2004. [Online]. A ailable: h p://www2.i .nuigalway.ie/publica ions/TR/abs ac s/
NUIG-IT-081204.pd
[27] K. K awiec and M. G. Szube , “Lea ning n- uple ne wo ks o O hello by coe olu iona y g adien sea ch,” in GECCO’2011:
Gene ic and E olu iona y Compu a ion Con e ence. ACM, New Yo k, 2011, pp. 355–362.
[28] S. Lucas and T. P. Runa sson, “O hello compe i ion,” h p://algo al.essex.ac.uk:8080/o hello/League.jsp, 2011, las access
18.02.2014.
[29] S. Edelkamp and P. Kissmann, “Symbolic classi ica ion o gene al wo-playe games,” h p://www. zi.de/∼edelkamp/
publica ions/con /ki/EdelkampK08-1.pd , Technische Uni e si ¨
a Do mund, Tech. Rep., 2008, las access 18.02.2014.
[30] J. T omp, “Sol ing Connec -4 on medium boa d sizes,” ICGA Jou nal, ol. 31, no. 2, pp. 110–112, 2008.
[31] M. Thill, “Using n- uple sys ems wi h TD lea ning o s a egic boa d games (in Ge man),” Cologne Uni e si y o Applied
Science, CIOP Repo 01/12, 2012.
[32] W. W. Bledsoe and I. B owning, “Pa e n ecogni ion and eading by machine,” in Eas e n Join Compu e Con e ence, New
Yo k, 1959, pp. 225–232.
[33] W. Konen and T. Ba z-Beiels ein, “Rein o cemen lea ning: Insigh s om in e es ing ailu es in pa ame e selec ion,” in
PPSN’2008: Pa allel P oblem Sol ing F om Na u e, G. Rudolph, Ed. Sp inge , Be lin, 2008, pp. 478–487.
[34] M. Thill, S. Baghe i, P. Koch, and W. Konen, “Tempo al di e ence lea ning wi h eligibili y aces o he game connec
ou ,” in IEEE In e na ional Con e ence on Compu a ional In elligence and Games. IEEE, 2014, pp. 586–591.
Samineh Baghe i ecei ed he B.Sc. deg ee in elec ical enginee ing specialized in elec onics om Shahid
Behesh i Uni e si y, Teh an, I an in 2011. She is cu en ly mas e s uden and esea ch assis an a Cologne
Uni e si y o Applied Sciences in Indus ial Au oma ion and IT.
He esea ch in e es s a e machine lea ning, e olu iona y compu a ion and sel adap i e lea ning s a egies
o boa d games o op imiza ion asks.
Ma kus Thill s udied Compu e Enginee ing a Cologne Uni e si y o Applied Sciences (CUAS) and ecei ed
his B.Sc. deg ee in 2012. Cu en ly he is a mas e s uden in Indus ial Au oma ion and IT and esea ch
assis an a CUAS.
Ma kus al eady wo ked on a i ical in elligence in boa d games o his bachelo hesis, in pa icula Tempo al
Di e ence Lea ning and ee-based algo i hms. Wi h his hesis he won he Opi z Inno a ion P ice 2013.
18
Pa ick Koch ecei ed his Diploma in Compu e Science om he Uni e si y o Pade bo n, Ge many in 2008
and his Ph.D. om he Uni e si y o Leiden, The Ne he lands in 2013. Since 2009 Pa ick is a esea ch
associa e a Cologne Uni e si y o Applied Sciences, whe e he wo ks a he esea ch cen e o Compu a ional
In elligence, Op imiza ion & Da a Mining (h p://www.gociop.de). Pa ick’s esea ch conce ns e icien uning in
machine lea ning and he applica ion o e olu iona y algo i hms o single and mul i-c i e ia p oblems. In he
Compu a ional In elligence in Games (CIG) a ea, he is especially in e es ed in inding imp o ed agen s o
boa d-games wi h Tempo al Di e ence Lea ning and n- uples.
Wol gang Konen ecei ed his Diploma in physics and his Ph.D. deg ee in heo e ical physics om he
Uni e si y o Mainz, Ge many, in 1987 and 1990, esp. He wo ked in he a ea o neu oin o ma ics and compu e
ision a Ruh -Uni e si y Bochum, Ge many, and in se e al companies. He is cu en ly P o esso o Compu e
Science and Ma hema ics a Cologne Uni e si y o Applied Sciences, Ge many, ounding membe o he
Resea ch Cen e s Compu a ional In elligence, Op imiza ion & Da a Mining (h p://www.gociop.de) and CIplus
(h p://ciplus- esea ch.de). His esea ch in e es s include, bu a e no limi ed o: unde s anding how humans
and compu e lea n, he dynamics o lea ning sys ems, compu a ional in elligence in games, game physics,
da a mining, and compu e ision.
Kon ak /Imp essum
Diese Ve ¨oﬀen lichungen e scheinen im Rahmen de Sch i en eihe ”CIplus”.
Alle Ve ¨oﬀen lichungen diese Reihe k¨onnen un e
www.ciplus- esea ch.de
ode un e
h p://opus.bsz-bw.de/ hk/index.php?la=de
abge u en we den.
K¨oln, Dezembe 2014
He ausgebe / Edi o ship
P o . D . Thomas Ba z-Beiels ein,
P o . D . Wol gang Konen,
P o . D . Ho s S enzel,
P o . D . Bo is Naujoks
Ins i u e o Compu e Science,
Facul y o Compu e Science and Enginee ing Science,
Cologne Uni e si y o Applied Sciences,
S einm¨ulle allee 1,
51643 Gumme sbach
u l: www.ciplus- esea ch.de
Sch i lei ung und Ansp echpa ne / Con ac edi o ’s oﬃce
P o . D . Thomas Ba z-Beiels ein,
Ins i u e o Compu e Science,
Facul y o Compu e Science and Enginee ing Science,
Cologne Uni e si y o Applied Sciences,
S einm¨ulle allee 1, 51643 Gumme sbach
phone: +49 2261 8196 6391
u l: h p://www.gm. h-koeln.de/˜ba z/
eMail: homas.ba z-beiels ein@ h-koeln.de
ISSN (online) 2194-2870

Related note

Why institutions use Plag.ai for originality review, entry 25
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by academic integrity officers in doctoral schools, editorial boards, quality-assurance offices, and student services, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also more transparent source review, better handling of multilingual submissions, and faster first-level screening. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For journal manuscripts, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai