Document [original]

Game Theoretic Approaches

to Motion Planning

in Robot Soccer

von der Fakultät für Elektrotechnik,

Informatik und Mathematik

der Universität Paderborn

zur Erlangung des akademischen Grades

Doktor der Naturwissenschaften

(Dr. rer. nat.)

genehmigte Dissertation

von

Marcus Post

Paderborn, 2008

Referees: Prof. Dr. Oliver Junge

Prof. Dr. Burkhard Monien

Committee: Prof. Dr. Michael Dellnitz (chairman)

Prof. Dr. Peter Bürgisser

Prof. Dr. Oliver Junge

Prof. Dr. Burkhard Monien

Dr. Elke Wolf

Date of PhD Defense: 17.04.2008

You can not learn anything

until you already almost know it.

Unknown

To Berthild, Karl-Friedrich, Sebastian, and Ping

iii

Acknowledgements

I would like to start by thanking my advisors Prof. Dr. Michael Dellnitz and Prof. Dr.

Oliver Junge for their guidance, support, motivation, and for the great freedom which was

given to me. Prof. Dellnitz’ group at the University of Paderborn always provided a very

good research environment for me. I also want to thank Prof. Dr. Burkhard Monien for

helpful comments and for reviewing this PhD thesis.

Moreover, I am very grateful to Prof. Dr. Oliver Junge, Prof. Dr. Michael Dellnitz, Dr. Sina

Ober-Blöbaum, Dr. Kathrin Padberg, Dr. Oliver Schütze, Stefan Sertl, and Bianca Thiere

for interesting discussions and exciting joint work. For fruitful discussions and comments

I would also like to thank Mirko Hessel-von Molo, Oliver Kramer, Prof. Dr. Michael G.

Lagoudakis, Tim Laue, Dr. Martin Lauer, Henning Meyerhenke, and Willi Richert.

In general, I am indebted to my colleagues in Paderborn including Alessandro Dell’ Aere,

Sebastian Hage-Packhäuser, Mirko Hessel-von Molo, Stefan Klus, Dr. Arvind Krishna-

murthy, Anna-Lena Meier, Dr. Sina Ober-Blöbaum, Dr. Kathrin Padberg, Michael Petry,

Dr. Robert Preis, Dr. Oliver Schütze, Stefan Sertl, Bianca Thiere, Dr. Fang Wang, and

Katrin Witting for discussions, technical support, social events and many other things. I

am further indebted to the collegiates of the PaSCo graduate school.

I am grateful to Laurel Frick-Wright for proofreading my thesis to improve the quality of

my English and to Mirko Hessel-von Molo, Anna-Lena Meier, and Stefan Klus for proof-

reading excerpts.

I always received valuable administrative support from the secretaries Marianne Kalle and

Tanja Bürger, and from Anne Belkner. For enabling my work in a different way, namely

by keeping my office clean, I thank the non-scientific staff of the University of Paderborn.

I would like to thank Alessandro Dell’ Aere for helping me to build a physical soccer envi-

ronment for the AIBO robots and some students for supporting me with the AIBO robots’

technical issues: Johannes Berg, Raphael Golombek, and Nicolai Hähnle are to be men-

tioned here.

Last but not least I am very indebted to my parents Berthild and Karl-Friedrich Post who

supported me not only during my studies but during my whole life in all ways imaginable. I

want to thank my brother Sebastian, Janina, my friends from the University of Paderborn,

especially Ping, the “Mensa-Kreis”, and the musicians I played music with, and, of course,

all other friends of mine.

For the development of great software tools I want to especially thank all L

EX develop-

ers and the developers of Dia, Kate, and Kile many of whom work voluntarily, and the

developers of Matlab which is a commercial software tool. All of them made my work

technically possible or at least substantially simpler.

For financial support I am very grateful to the Paderborn Institute for Scientific Compu-

tation (PaSCo)(1) and to the University of Paderborn.

Marcus Post, February 2008

(1)The research is (partly) supported by the DFG Research Training Group GK-693 of the Paderborn

Institute for Scientific Computation (PaSCo).

Abstract

Robotics is, from a scientific point of view, a very broad topic with many applications.

While highly specialised robots have been widely used in production lines, the next big

scientific steps are towards autonomy of robots and interaction with other robots and hu-

mans. For achieving these long-term goals catenations of physical and mental abilities

which are interdisciplinary and scientifically challenging have to be carried out. At its cur-

rent state, robot soccer is an appropriate environment for demonstrating and developing

robotic skills as several areas are addressed such as image processing and analysis in the

widest sense (including e. g. object matching and directing the camera), control and opti-

misation of physical movement (walking, ball handling), and the strategic planning which

may be considered as being close to high level reasoning. In this thesis, game theoretic and

reinforcement learning approaches are utilised to contribute to strategic planning in robot

soccer which serves as a motivating example. The aim of strategic planning is to obtain

an optimal strategy which also takes the possibly unknown strategies of other players into

account. A natural further goal of this thesis is the development and analysis of algorithms

by means of which such optimal strategies can be approximately computed.

More specifically, the following steps are undertaken: first, a game theoretic model of

multi-player robot soccer is developed which is independent from the robot hardware. The

occuring challenge to determine an optimal strategy with respect to this model for as

many robots as possible is met by exact model reduction, i. e. by finding equivalent smaller

models. For this, a theoretical framework of symmetries is developed which bases on

homomorphisms between two-player zero-sum Markov games. It lays a formal foundation

for practitioners who already implicitly used results proven within that framework. A

special result which is important for model reduction is that the reduction can be performed

in several separate steps and be combined afterwards which is expressed by a composition

of homomorphisms. Finally, a qualitatively new symmetry which interchanges the two

players of the Markov game, i. e. the two teams in robot soccer, is proven to be part of

the homomorphism framework. Particularly, this means that it can be combined with all

symmetries which occur in Markov decision processes.

The theoretical results about Markov game symmetries are algorithmically exploited for

Dynamic Programming (DP) and Reinforcement Learning (RL) methods which are also

compared. Such comparisons ought to be standard but seem unusual for large parts of the

RL literature. Unsurprisingly, DP methods are more efficient and thus the following general

procedure seems recommendable: firstly, to design an approximate model for the task at

hand and solve this by DP methods to an appropriate level of precision and, secondly, to

use the DP solution of the rough model as an initialisation for an RL method to let the RL

method adapt to the unknown real model. In this spirit, the developed soccer model and

the computation of its optimal solution can be seen as the completion of the first of the

above two steps. Ideas of dynamical systems and graph theory are additionally integrated

to design new efficient DP methods by means of almost invariant sets. All algorithms are

thoroughly studied numerically and the results of optimal strategies are also interpreted

in terms of soccer. Finally, some of the most challenging future tasks to implement these

strategies on real robots are identified.

Key Words

reinforcement learning, robotics, robot soccer, optimal strategy, symmetry, model reduc-

tion, control theory, game theory, graph theory, dynamical system, almost invariant set,

homomorphism

Abstract (German)

Die Robotik ist aus wissenschaftlicher Sicht ein sehr breites Fachgebiet, das viele An-

wendungen hat. Weitverbreitet sind beispielsweise hochspezialisierte Roboter, die in der

maschinellen Fertigung eingesetzt werden. Einige der nächsten Meilensteine in der Robotik

sind in der Autonomie von Robotern und in der Interaktion mit Robotern und Menschen

zu erwarten. Zum Erreichen dieser Meilensteine ist eine Verknüpfung von physischen und

“mentalen” Fähigkeiten notwendig, die interdisziplinär ist und wissenschaftliche Herausfor-

derungen bietet. Roboterfußball stellt zum derzeitigen Stand der Wissenschaft eine geeig-

nete Umgebung dar, um verschiedenartige Fähigkeiten der Roboter zu demonstrieren und

weiterzuentwickeln, denn es beinhaltet bereits eine Vielzahl von Gebieten, beispielsweise

Bildverarbeitung im weitesten Sinne (einschließlich Objekterkennnung und -verfolgung),

Kontrolle und Optimierung physischer Bewegung (Fortbewegung, Ballfertigkeiten) und die

strategische Planung, die auch als höhere kognitive Fähigkeit betrachtet werden kann.

In dieser Doktorarbeit werden Ansätze der Spieltheorie und des Reinforcement-Learnings

genutzt, um Beiträge zur strategischen Bewegungsplanung im Roboterfußball, das als moti-

vierendes Beispiel dient, zu leisten. Ziel der Strategieplanung ist es, eine optimale Strategie

zu ermitteln, die auch die möglicherweise unbekannten Strategien anderer Spieler einbe-

zieht. Ein weiterführendes Ziel der Arbeit stellt die Weiterentwicklung und Analyse von

Algorithmen, mit deren Hilfe optimale Strategien approximativ berechnet werden, dar.

Dazu werden die folgenden Schritte unternommen: Zunächst wird ein spieltheoretisches

Modell des Mehrspieler-Roboterfußballs entwickelt, das möglichst hardware-unabhängig

ist. Einer wesentlichen dabei auftauchenden Herausforderung, optimale Strategien für die-

ses Modell mit einer möglichst großen Anzahl von Robotern zu bestimmen, wird durch

exakte Modellreduktion begegnet, d. h. es wird versucht, ein möglichst kleines, dem origi-

nalen Modell äquivalentes Markov-Spiel zu ermitteln. Zu diesem Zweck wird ein theoreti-

sches Konzept von Symmetrien eingeführt, das auf Homomorphismen zwischen Zweispieler-

Nullsummenspielen basiert. Der Symmetriebegriff schafft dabei eine formale Basis für schon

zuvor zur praktischen Lösung von Markov-Spielen implizit angewendeten Symmetriereduk-

tionen. Ein nützliches Ergebnis für die Modellreduktion ist, dass diese schrittweise durch-

geführt und anschließend kombiniert werden kann, was sich formal durch die Komposition

von Homomorphismen darstellen lässt. Schließlich ist eine qualitativ neuartige Symmetrie,

die die Spieler eines Markov-Spiels vertauscht, in den Formalismus integriert. Insbesondere

wird gezeigt, dass sich die nicht in Markov-Entscheidungsprozessen vorkommende Symme-

trie mit allen dort anzutreffenden Symmetrien kombinieren lässt.

Die theoretischen Ergebnisse über Symmetrien in Markov-Spielen werden algorithmisch

umgesetzt in Methoden der Dynamischen Programmierung (DP) und des Reinforcement-

Learnings (RL), welche ferner miteinander verglichen werden. Derartige Vergleiche sollten

als Standard gelten, scheinen aber eher die Ausnahme in weiten Teilen der RL Literatur zu

sein. Erwartungsgemäß sind die DP Methoden effizienter, weshalb die folgende allgemeine

vii

Vorgehensweise vorgeschlagen wird: zunächst ein approximatives Modell zu konstruieren

und mit Hilfe der DP Methoden zu lösen, um dann ein RL Verfahren mit dieser Lösung

als Startwert auszustatten. Dies ermöglicht sowohl den Einsatz der effizienteren DP Me-

thoden, die mit angemessener Präzision das approximative Modell lösen, als auch den der

RL Methoden, deren Adaptivität an das unbekannte reale Modell ausgenutzt wird. In die-

sem Sinne können das entwickelte Roboterfußball-Modell und die Berechnung optimaler

Strategien als Lösung des ersten Teils des allgemeinen Vorgehens angesehen werden. Dazu

finden bei der Entwicklung neuer effizienter Algorithmen unter anderem Ideen aus dem

Gebiet der Dynamischen Systeme und der Graphentheorie zu fast invarianten Mengen An-

wendung. Abschließend werden wichtige praktische Herausforderungen identifiziert, die es

zu lösen gilt, bevor die berechneten optimalen Strategien auf reale Roboter übertragen

werden können.

Schlagworte

Reinforcement-Learning, Robotik, Roboterfußball, optimale Strategie, Symmetrie, Modell-

reduktion, Kontrolltheorie, Spieltheorie, Graphentheorie, Dynamisches System, fast inva-

riante Menge, Homomorphismus

viii

Contents

1 Introduction 1

2 Reinforcement Learning (RL) and Game Theory 7

2.1 Dynamical Systems and Markov Processes . . . . . . . . . . . . . . . . . . . 9

2.1.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . . . 15

2.2 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . . . 22

2.3 Matrix Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . . . 23

2.3.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . . . 29

2.4 Two Player Zero Sum Markov Games (2P-ZS-MGs) . . . . . . . . . . . . . . 29

2.4.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . . . 30

2.4.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . . . 32

2.5 General Markov Games, Differential Games, and Advanced Concepts of RL 33

3 Model Reduction and Symmetry 36

3.1 Homomorphisms and Symmetry in MDPs . . . . . . . . . . . . . . . . . . . 37

3.1.1 Equivalence of MDP Homomorphisms and MDP Symmetries . . . . 38

3.1.2 Symmetries by Group Actions on MDPs . . . . . . . . . . . . . . . . 42

3.2 Homomorphisms and Symmetry in 2P-ZS-MGs . . . . . . . . . . . . . . . . 43

3.2.1 2P-ZS-MG Homomorphisms and Symmetry . . . . . . . . . . . . . . 44

3.2.2 Automorphisms for the Exchange of Agents . . . . . . . . . . . . . . 49

4 Supervised Learning (SL), Function Approximation, Generalisation 55

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.1 General Approximation Results . . . . . . . . . . . . . . . . . . . . . 57

4.1.2 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.3 Function Approximation with Automated Basis Determination . . . 58

4.2 Value Iteration with SL: Convergence Result . . . . . . . . . . . . . . . . . . 59

4.3 Combination of RL and SL: Practical Results from Literature . . . . . . . . 61

4.3.1 MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.2 2P-ZS-MGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Robot Soccer and Other Applications 63

5.1 Modeling Robot Soccer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.1.1 General Issues of Modeling Robot Soccer . . . . . . . . . . . . . . . . 64

5.1.2 A Simple Multi-Player Robot Soccer Model . . . . . . . . . . . . . . 67

5.1.3 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Numerical Results of Grid Soccer . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.1 Preliminaries for the Following Subsections . . . . . . . . . . . . . . 75

5.2.2 Reasoning for 2P-ZS-MG Modelling: Comparison of MDP and 2P-

ZS-MG strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2.3 Relating Policies to Humanoid Soccer Characteristics . . . . . . . . . 80

5.2.4 Comparison of DP and RL Techniques . . . . . . . . . . . . . . . . . 81

5.2.5 Comparison of Different DP Techniques with Various Parameters . . 84

5.2.6 Comparison of Standard Methods and SL Techniques . . . . . . . . . 91

5.2.7 Towards Multi-Player Robot Soccer: 2v2 Grid Soccer . . . . . . . . . 93

5.3 A New Algorithm: MaG-Clus-VI . . . . . . . . . . . . . . . . . . . . . . . . 95

5.4 From Grid Soccer to Robot Soccer: Practical Issues . . . . . . . . . . . . . . 96

5.4.1 Lower Level behaviours . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.4.2 Image Processing and Localisation . . . . . . . . . . . . . . . . . . . 97

5.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6 Conclusion and Outlook 100

A Basics of Group Homomorphisms and Group Actions 103

B Bellman Equations and Iterative Linear Solvers 105

C The Software Package DRPOST 107

C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

C.2 Technical Aspects of Symmetry Reduction in 2P-ZS-MGs . . . . . . . . . . 108

D Detailed Tables of Numerical Results 110

D.1 Initial Value Functions V0and Discount Factors γ. . . . . . . . . . . . . . . 110

D.2 Additional Figures and Tables for the Comparative Studies of DP and RL

methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

List of Figures 118

List of Tables 120

Glossary 122

Bibliography 126

Index 141

Chapter 1

Introduction

One of the most impacting technical revolutions in the near future after the boom of com-

puter technology and the Internet could be expected in robotics. While highly specialised

robots have been widely used in production lines the interaction of robots with other robots

or humans remains a vision. The catenation of physical and mental abilities of robots is

scientifically highly challenging. The physical abilities are often referred to as lower level

abilities such as reflexes and basic motion abilities. However, it is not completely clear how

much the physical abilities are intertwined with cognitive ones. Imagine, for example, that

a human hits an obstacle by a physical movement. A negative reward signal is received in

the form of pain which causes the cognitive section to think how to avoid that pain. This

high level cognitive activity can at first be concentrated on the special situation but then be

generalised to a set of similar situations and at the end possibly result in an improvement

of the basic ability of moving, e. g. if it is generally sensible to move more slowly.

Key aspects of the present work are informally described and related to the introductionary

example above. In this thesis, optimal strategies of several robots which rule their interac-

tion are determined for the example of robot soccer. The highest cognitive level of strategic

planning is addressed. The above example of a human hitting an obstacle already points

to important ingredients of a mathematical description and solution methods: firstly, the

physical movement indicates that a dynamic model of an environment and some acting

agents is needed. Secondly, a kind of optimality criterion has to be employed based on

a local reward signal (pain) which could also be positive, and thirdly, after receiving any

reward an estimate of how to optimally act to maximise a positive reward while avoiding

a negative one has to be improved. A basic principle for algorithms seems to be that

some estimates of how useful a special situation is have to be improved over time. These

estimates are later represented by a value function and the iterative procedures are refered

as dynamic programming (DP) and reinforcement learning (RL) methods. Finally, the

ability to abstract former knowledge of how to deal with similar situations which have not

been explicitly experienced appears to be crucial. This question is theoretically addressed

by developing a framework of exact model reduction, i. e. identifying exactly equal situ-

ations, as well as practically by incorporating approximate generalisation methods which

are introduced as supervised learning methods.

A huge variety of concepts and algorithms to which the above mentioned aspects belong

also are collectively named machine learning. In the remainder of this introduction an

overview of different topics is given: goals of this thesis are presented in more detail and

several disciplines of machine learning relevant to this thesis are introduced. The areas of

supervised learning (function approximation) and reinforcement learning (optimal control)

are touched on. In addition, relations to game theory and to the concept of almost invariant

sets established in dynamical systems theory are shortly pointed out. Finally, challenges

of implementing optimal strategies on real soccer playing robots are discussed before the

introduction ends with the contributions of the author.

Goals of the Present Thesis. Since in this thesis optimal strategies for motion planning

in robot soccer are to be obtained the first goal is to design an appropriate soccer model

which is as simple as possible and as special as necessary. The simplicity aims at wide

applicability whereas a certain level of detail is necessary to provide meaningful results.

A more general goal is to provide and analyse efficient algorithms for computing optimal

strategies. Such algorithms are applicable independently from the model as long as it

belongs to a certain but broad class. A natural small subtask is the calculation of optimal

strategies especially for the robot soccer model, i. e. to give for each soccer situation (state

s∈ S) a probability distribution of which action aof the action space A(s)to perform. This

subtask gives reason to define the development of a software package which implements all

key ideas as a further substantial goal.

Beyond designing efficient algorithms, the intent to widen applicability of the algorithms

shall also be pursued by reducing models to equivalent smaller ones. In this area, the focus

is on symmetry reduction which has to be appropriately defined for two-player zero-sum

Markov games. Furthermore, the proof of desired properties e. g. which consequences the

symmetry of a model has for the symmetry of optimal strategies is an important goal of

this thesis.

Machine Learning, Game Theory, and Dynamical Systems

Although often a different impression could be received machine learning [139] and stochas-

tic control are intertwined, as e. g. [10, 145] discuss the relevance of control theory topics to

Artificial Intelligence (AI). In stochastic control, adaptive control is the field concerned with

algorithms which improve a sequence of decisions from experience. It is a mature discipline

for systems with smooth dynamics, see e. g. [26, 195].(1) In contrast, learning methods are

often applied to time-discrete systems and utilise special function approximation (super-

vised learning) or data preprocessing (unsupervised learning) methods. However, the most

decisive difference is the knowledge of the underlying model: in learning methods the model

can be completely unknown whereas in control theory the model typically is assumed to be

known.(2) Throughout this thesis, the author has chosen the machine learning terminology.

This should not be misunderstood as an evaluation of both approaches.

Machine Learning. There are three major types of learning (Figure 1.1) with different

degrees of freedom for the learner: completely supervised learning (SL, Chapter 4), com-

pletely unsupervised learning, and lying between these two extremes reinforcement learning

(RL, Chapter 2). In several applications these different types of learning are intertwined,

for example [193] gives a short overview of combinations of supervised, unsupervised and

reinforcement learning.

Supervised learning methods are often interpreted as generalisation methods and deal with

function approximation for a given set of samples. For example, let f:Rn→Rbe a

(1)In [195] a good overview of numerical methods in optimal control can also be found.

(2)By addition of system identification methods this distinction can also become blurred.

unsupervised

learning supervised

learning

reinforcement

learning

Figure 1.1: Scheme of the main subfields of learning.

function and Sinput ={s1, . . . , sm} ⊆ Rna finite subset of the domain of f. The set of

input-output pairs S={(sk, f(sk))}k=1,...,m is called the sample set. Then, the task of

SL methods is to find e

f≈f, i. e. minimising the norm difference with respect to a given

norm, by only using information of the sample set. A popular choice of e

fis a linear

approximation architecture e

f=Pjajfjwith specified fiand aito determine. In contrast

to function approximation methods some SL methods deal with noise in the process of

function evaluation, i. e. f(si)is deviating from the assumed value in the input-output pair.

Furthermore, an SL specific topic is feature detection which is related to a partly automatic

design of the fi. The use of handcrafted features is computationally less demanding which

is subject to numerical studies in Chapter 5 and yielded reasonable results even if only

simple features are used.

Unsupervised learning methods are also seen as generalisation methods like SL but they

differ by only making use of the distribution of input data Sinput. Thus, they can be

interpreted as input data preprocessing methods to which e. g. clustering approaches belong

to.

Reinforcement learning methods do not directly fit to the concept of the above two meth-

ods. They are biologically inspired by action-reward animal training mechanisms. The

mathematical foundations are based on a combination of optimal control and stochastic

approximation methods. The basic ingredients are a state and an action space, a dynami-

cal probabilistic model called transition function according to a specified time set (discrete

or continuous), and a reward function which evaluates every action in each situation and

is typically non-zero only in a few situations. The goal is to determine a behaviour that

ensures a high long-term reward which includes for example the tedious work to figure out

which situations are the most positive ones and how these can be achieved most effectively.

Effectivity is typically measured by rewards and time (time discounting of reward). One of

the most applied equations is the Bellman equation (Equations 2.48 and 2.49, [66]) which

gives a rule to propagate the estimations of usefulness of each situation one step further

in time. Many iterative methods for computing optimal strategies such as value iteration

and Q-learning are based on the Bellman equation.

A general discussion about the use of learning paired with some criticism about current

research and an excellent overview of multi-agent learning can be found in [188].

Game Theory. In reinforcement learning approaches it is the goal to determine an

optimal behaviour according to some optimisation criterion. This behaviour is usually

interpreted as the active decision of an agent and, thus, it is natural to consider models

with several agents. This path directly leads to game theory. While game theory is often

influenced by economic flavours the basic question of game theory of how to act optimally

in an environment with several players, respectively agents, is quite general. The equally

general answer is to find a Nash equilibrium. However, special foci of classical game

theory are rivalry or conflicting aims of several players, coordination and coalitions, threat

mechanisms, partial information, games of social welfare, and so on. From a computational

point of view finding Nash equilibria can become arbitrarily hard for games with more

than two agents [41].(3) Fortunately, in robot soccer the different agents of each team

can be represented by only one game theoretic agent because the team is fully cooperative.

Furthermore, the game is of type “zero-sum” which can be interpreted as games between two

rivals: one agent wants to avoid what the other tries to achieve and vice versa. Sometimes

such a situation is called (completely) competitive because there is no potential for a

compromise. Typically, board games (backgammon, chess) or sport games (soccer, tennis)

are of this type because one agent or team of agents wins if and only if the other one loses.

Connections to Dynamical Systems and Graph Theory: Almost Invariant Sets.

The update steps of synchronous dynamic programming (DP) algorithms are global, while

the updates of reinforcement learning methods are local with respect to specific stochastic

trajectories. To combine these two concepts, almost invariant sets are employed. These

characterise regions which are left by a stochastic trajectory with low probability, and thus

give information about where a typical reinforcement learning trajectory will stay for a long

time. In Section 5.3, it is shown how to exploit this knowledge to design an asynchronous

DP algorithm for which flexible trade-offs between global and local updates are possible.

The idea is to utilise an estimate of the Nash equilibrium strategies to analyse the dy-

namics of the Markov game. If all players’ strategies are fixed such that the dynamics

do not explicitly depend on the strategies, the concept is identical to that of analysing

the dynamics of dynamical systems by means of discretised transfer operators [46]. By

the invariance information of the dynamics regions for repeated local updates, namely the

almost invariant sets, can be located. Finally, it seems algorithmically sensible to alter

global and local update steps. For computing a partition of almost invariant sets graph

partitioning techniques are utilised.

The invariance information could have a two-fold impact: it may directly yield valuable

information to speed up numerical algorithms (e. g. value iteration) and may give – on

a meta-level – useful information about which situations in robot soccer are dynamically

linked if both teams are playing optimally.(4) An application could be to exploit subop-

timal behaviour of the other team to perform a controlled jump from a nearly invariant

component to another more advantageous one.

Robot Soccer with Real Robots

Multi-player robot soccer is the main application as well as the standard example to il-

lustrate theoretical concepts throughout this thesis. The model used for numerical com-

putations is described in detail in Section 5.1. Here, only the most salient aspects are

mentioned. Since the model addresses the highest level planning and does not resolve

lower levels of behaviour such as direct motor controls of the robots, it is possible to design

a model that is widely applicable. Particularly, the proposed model is relatively hardware

independent. Basic abilities such as walking towards a predefined location, handling the

(3)Communicated by Prof. Dr. Bernd Sturmfels in a mathematical colloquium of the University of Pader-

born.

(4)These could be called a meta stable set of situations.

ball by dribbling and kicking, and skills for the analysis of visual information (self and

opponent localisation) are highly non-trivial but assumed to be already available. The

only assumption about the abilities of the robots is qualitative, namely that both teams

including every single robot are totally equal. This is true e. g. for the AIBO league(5) but

not for all leagues of robot soccer. The consequences of equality of all robots are that the

robots are undistinguishable and that it is at least possible for each robot to apply the

same basic abilities. The question of whether simulation results can be transferred to real

robots is answered positively by [69]. This shows that it can be valuable to compute an

optimal strategy for a non-exact model (simulation).

Reading Information and Contributions of this Thesis

In this thesis a game theoretic model of robot soccer and new algorithms for obtaining

optimal strategies within this model are developed. This results in the following structure

of the thesis, the description of which emphasises the contributions of the author:

A large portion of Chapter 2 is a special collection of common knowledge about Markov

decision processes (MDPs), reinforcement learning, and different types of games. The in-

terpretation of matrix games (Definition 2.18) in the more general context of two-player

zero-sum Markov games (2P-ZS-MGs) and the numerical error analysis of the value itera-

tion for 2P-ZS-MGs (Lemma 2.36) are contributed by the author. The latter seems to be

necessary because the solution of matrix games in the iteration step introduces numerical

errors which lead to a modified stopping criterion of the iteration scheme. The analysis

is mathematically identical to the error analysis of using supervised learning methods in

Chapter 4 but the consequences are different: for standard value iteration the total error is

typically only corrected a little whereas for function approximation the same error analysis

reveals that no guarantee of the quality of the solution can be given.

The analysis is originally developed in the context of Chapter 4 for the theoretically iden-

tical case of function approximation but the assumptions and consequences are different:

in the numerical value iteration the total error is typically just corrected a little whereas

for function approximation the same error analysis reveals that it destroys any guarantees

on the quality of the solution.

Chapter 3 deals with model reduction. Section 3.1 begins with an introduction of the

concepts of MDP homomorphisms and MDP symmetries. By a new and algebraicly more

precise way of presentation the relation between concepts of different authors can be clar-

ified. A key result of the author is to show that the formalisms of MDP homomorphisms

[171] and MDP symmetries [217] are equivalent (Lemma 3.3). This equivalence reveals that

the model minimisation framework of [171] with MDP options and the symmetry context

of [217] with its generalisation to multi-agent MDPs does not exclude each other but are

based on the same foundation. Additionally, the concept of the symmetry group of MDPs

by [171] is related to its natural algebraic framework of group actions by the author.

A second key result is the development of a corresponding framework for the case of 2P-

ZS-MGs including all basic statements equivalent to those for MDPs in Section 3.2. This

more general framework was necessary to capture the symmetries of the robot soccer model

(5)The current development of the four legged league was influenced by the fact that the production of

the AIBO ERS-7 was stopped recently. The new name of the league is Standard Platform League (SPL)

meaning that all robots have to be equal (as before) but additionally a new two legged humanoid robot

called NAO is introduced as a substitute for the AIBO “dog”.

developed in Chapter 5. It can be expected that 2P-ZS-MGs are one of the most general

types of models which include MDPs and for which such statements are true.

Besides the symmetries of MDPs, a qualitatively new symmetry that results from exchang-

ing the two players can be exploited. Practically, such symmetries have been used within

different board games by the argument that exchanging the players in a zero-sum game

has to result in a multiplication of the value by −1. One example is given by the pioneer

Samuel for checkers [179]. The present work lays a formal foundation for this argument

and, more importantly, shows that this exchange of players is also compatible with the

other standard symmetries similar to that of MDPs (Proposition 3.24) and that symmetry

reduction can be performed stepwise.

The key result of Chapter 4 which is devoted to the combination of reinforcement learning

and supervised learning is Lemma 4.2 and its implications. It clarifies that unless a function

approximation architecture is very accurate care is needed if the quality of the solutions is

to be provable. Nevertheless, a collection of successfully applied results of other authors is

provided. Corresponding results of the author can be found in Section 5.2.6.

Finally, Chapter 5 provides detailed information about the model and the numerical

results. The multi-player grid soccer model described in Section 5.1 is based on [112] but

goes far beyond: the generalisation to several agents per team makes it necessary to change

from a one-agent “blocking dynamic” to a more permeable one and allows it to introduce

passes between different agents of the same team. Furthermore, the author believes that

the model is very well-suited to the study of multi-agent systems: since the grid size can

be varied as well as the number of agents, effects of large state spaces can comparatively be

studied, introduced by both high grid resolution and by multiple agents on a low resolution

grid.

Section 5.2 contains the numerical results and possesses a rich substructure. Some of the

studies are providing strong arguments for the choice of 2P-ZS-MG instead of MDP models

in robot soccer (Section 5.2.2), others compare dynamic programming and reinforcement

learning techniques (Section 5.2.4) with the result that exact and model exploiting dy-

namic programming techniques should be preferred whenever possible. After these eval-

uative numerical results which appear to be natural but are surprisingly not standard in

literature the studies of dynamic programming methods are intensified in Section 5.2.5

and the dependency of convergence speed on different types of methods and parameters

is numerically analysed. This includes comparative studies of symmetry reduced models

with their unreduced counterparts which is the practical application of the theoretical re-

sults in Chapter 3. Particularly interesting from a practical point of view is the “max-min

convergence boosting phenomenon” which only seems to be present in 2P-ZS-MGs but

not in MDPs. As mentioned above, results with supervised learning (Section 5.2.6) and

general technical issues of applying strategies to real robots (Section 5.4)(6), especially of

type AIBO ERS-7, follow.

Chapter 6 concludes and points to future work and the appendix contains a short

description of group actions (Appendix A), comments on the relations of the iterative

solution of the Bellman equation to iterative linear solvers (Appendix B), an introduction

to the software package DRPOST developed by the author (Appendix C), and additional

material (Appendix D) omitted in the main part.

(6)These technical issues were not discovered by the author but independently confirmed.

Chapter 2

Reinforcement Learning (RL) and

Game Theory

Contents

2.1 Dynamical Systems and Markov Processes . . . . . . . . . . . . 9

2.1.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . 9

2.1.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . 15

2.2 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . 15

2.2.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . 16

2.2.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . 22

2.3 Matrix Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . 23

2.3.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . 29

2.4 Two Player Zero Sum Markov Games (2P-ZS-MGs) . . . . . . 29

2.4.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . 30

2.4.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . 32

2.5 General Markov Games, Differential Games, and Advanced

Concepts of RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

The goal of machine learning is to design a software

that has more abilities than its programmer in the end.

inspired by Arthur L. Samuel (1901-1990)

A very general framework to which this thesis contributes is that of multi-agent systems

(MASs) [64] and more specifically to multi-agent learning [188]. To cut a long story short,

the basic ingredients of MASs are a number of agents(1) which interact with an environ-

(1)Agents are not restricted to physical agents like humans or robots. Reversely, if a robot is not capable

of making decisions then it is part of the environment and does not count as an agent.

ment and perceive some information from the environment. The role and (hierarchical)

structures of agents are also discussed in the MAS context.

environment

agentdecision

agentdecision perception

interaction

perception

interaction

agentdecision

perception

interaction

comm.

Figure 2.1: Scheme of the main components of multi-agent systems. Multiple agents

interact with an environment and partially perceive information from the environment. The

agents are allowed to communicate and make some decisions for their future interactions.

Bowling comments the usefulness of the MAS context being a general framework as follows

[21]:

Frameworks are models of reality. As such, they are an important foundation

for the generation and evaluation of new ideas. They establish the rules of the

game, crystallise the core issues, provide a common basis of study, make intrinsic

assumptions visible, provide a general perspective on large classes of problems, help

to categorise the variety of solutions, and allow comparison with other models of

reality. It is with all of these reasons in mind that we begin this work by introducing

a framework for multiagent learning.

The class of stochastic games is a more specific framework than the general MASs.(2)

These define the guidelines of this work and are rarely used for reinforcement learning

tasks – exceptions are [21, 98, 112] the first of which also inspired the structure of this

section. Stochastic games are a generalisation of Markov decision processes (MDPs) and

matrix games. The first is the research focus of reinforcement learning (RL) and dynamic

programming (DP), the second is a major topic of game theory with applications in eco-

nomic science. In the remainder of this section, basic ingredients are defined and some

well-known results about the solution of basic problems are presented.

More precisely, this section is organised by increasing model complexity as follows: First,

Section 2.1 starts with a brief overview of dynamical systems which provide the framework

for systems without any decision maker. Nevertheless, fixing the policies of all agents of

an MAS results in a (stochastic) dynamical system and hence some basic insights from

this mathematical discipline may serve as inspiration for Markov games. Even more of

interest, a deterministic continuous dynamical system can also be approximated by a

stochastic discrete dynamical system [46] – or in other words an MDP without a decision

maker(3). We proceed by introducing MDPs in Section 2.2 (one decision maker), matrix

games (two adversary decision makers) in Section 2.3, and two-player zero-sum Markov

games in Section 2.4 (also two adversary decision makers). In the last section, an overview

of more advanced concepts which are only partly relevant to this work is given. This

(2)An alternative (exhaustive) framework for continuous motion planning problems can be found in [104].

However, practical application seems limited at the moment to special cases.

(3)A stochastic discrete dynamical system is often called Markov process or Markov chain.

includes semi-Markov decision processes (S-MDPs) and hierarchical approaches. In each

subsection aspects of history, notation, and the basic problems are provided. Summarising,

this chapter is the foundation of this thesis.

2.1 Dynamical Systems and Markov Processes

Dynamical systems are a very general concept and include concepts for discrete and con-

tinuous systems as well as for deterministic and stochastic ones which are also called ran-

dom dynamical systems. Many concepts were developed to analyse emerging properties

of deterministic dynamical systems such as equilibrium and periodical solutions, symbolic

dynamics, chaos, and relations between numerics and dynamical systems. A special goal

of dynamical systems theory is to provide statements about the asymptotic long-term be-

haviour of a given system. Additionally, as will be pointed out in the following, extracting

statistical information of (deterministic) dynamical systems is interesting for many sys-

tems and leads to relations of deterministic continuous dynamical systems and stochastic

discrete ones (see also footnote (3)) by so-called transfer operators.

As mentioned above, dynamical systems are considered in the beginning of Chapter 2 be-

cause they describe situations where no agent, decision maker, or controller is involved.

These three terms are synonymously used by different communities. However, a (stochas-

tic) dynamical system arises if one considers the policies of all agents of an MAS being

fixed. The ideas of the following short overview are mainly borrowed from [46]. In ad-

dition to this article, a more detailed description and further references of approximating

deterministic dynamical systems by Markov processes can be found in [48, 49, 50]. A less

specific introduction to the broad area of dynamical systems may be obtained by reading

the textbooks [25, 73, 153] the last of which contains many examples.

Historical Remarks. Following [153], early work of dynamical systems was motivated by

the aim of predicting the non-linear dynamics of the n-body problem in celestial mechanics.

A pioneer in this field was Poincaré who used perturbation methods and employed a geo-

metrical qualitative point of view. Further major contributions were achieved by Arnold,

Birkhoff, Bowen, Duffing, Kolmogorov, Krylov, Lorenz, Lyapunov, Moser, Pontryagin,

Rayleigh, Ruelle, Smale, and many others. Cartwright and Littlewood [28] observed chaos

which was also discovered by Lorenz studying weather dynamics [117] and separately by

Smale who introduced the horseshoe map [192].

2.1.1 Basics and Problem Definitions

After a short introduction to deterministic dynamical systems and invariant sets Markov

processes and some concepts to approximate the statistics of dynamical systems by means

of transfer operators will be introduced.

Deterministic Dynamical Systems

The first definition of dynamical systems seems very abstract but later more specific aspects

will be discussed.

2.1 Definition (Dynamical System [25])

Let Xbe a non-empty set and {ϕt:X→X|t∈K}a one-parametric family of maps with

parameter tand K=Ror K=Z. If this family is a one-parametric group, that means

ϕt+s=ϕt◦ϕsand ϕ0= idXis the identity map, then (X, ϕ)is called a time-continuous

(autonomous) dynamical system if K=Rand a time-discrete one if K=Z.ϕis the flow

of the dynamical system.

In an irreversible system the time may be restricted to t≥0resulting in a semi group

instead of a group structure. For a dynamical system (X, ϕ)and x∈Xthe set O+

ϕ(x) =

St≥0ϕt(x)is called the positive semi orbit,O−

ϕ(x) = St≤0ϕt(x)the negative semi orbit

and Oϕ(x) = O+

ϕ∪ O−

ϕ=Stϕt(x)the orbit or trajectory of xunder flow ϕ. If the context

is clear then ϕmay be omitted.

2.2 Definition (Invariant Set)

The set A⊆Xis called ϕ-invariant if ∀t:ϕt(A)⊆A. If a dynamical system is not

time-invertible then ϕ−t(A) = (ϕt)−1(A) = {x∈X:ϕt(x)∈A}is the set of preimages

of Aunder ϕt. A forward invariant set possesses the invariance property for all t≥0, a

backward invariant set for all t≤0.

Dynamical systems and differential equations. According to [25], flows of dynamical

systems occur naturally in autonomous differential equations of the first order. That means

equations of the form

˙x=f(x),(2.1)

where x∈X=Rn,f:Rn→Rnis a continuous differentiable function (f∈C1), and

˙x=dx

dt. For each x∈Xthere exists a unique solution ϕt(x)with ϕ0(x) = xwhich is defined

for all t∈Uε(0) in a neighbourhood of 0. If the solution exists for all t(which is assumed

in the following and is true e. g. for bounded f) then for fixed taC1-diffeomorphism of

Rnis given by the map x7→ ϕt(x). Because the differential equation was autonomous it

holds ϕt+s=ϕt(ϕs(x)) meaning that ϕtis a flow. Reversely, for a given differentiable flow

ϕt:Rn→Rnthe corresponding differential equation reads to

˙x=d

dtϕt(x)t=0 (2.2)

because if equation 2.1 is integrable a solution x(t)fulfills: ϕt(x(0)) = x(t) = x(0) +

0f(x(s)) ds.

Transfer Operators, Almost Invariant Sets, and Markov Processes

This paragraph is based on [46] and deals with time-discrete dynamical systems of the

form

xk+1 =f(xk)(2.3)

with k∈N0and f:X→Xfor a compact Xare considered. An important class of

examples are the time-Tmaps (for fixed T∈R) of a time-continuous dynamical system

(Equation 2.1). If the global dynamical behaviour of a given system is of interest then a

useful method is to employ the transfer operator or Perron-Frobenius operator associated

with f. It describes the evolution of signed measures on X.

The transfer operator or Perron-Frobenius operator of a time-discrete dynamical system f

is the linear operator P:M → M defined for all measurable sets Sby

(Pν)(S) = ν(f−1(S)) (2.4)

where Mis the space of signed measures on the Borel σ-algebra over X. For example,

if νis a uniformly distributed probability measure on a set S1⊆X(meaning uniformly

distributed on S1and 0elsewhere) then (Pν)(S2)is exactly the transition probability

T(S1, S2)of Definition 2.5 because ν(S1)=1. In contrast to the transition function T

of Section 2.2, which maps only states and actions to probabilities, the transfer operator

Pdirectly maps a probability distribution of inputs (states at time t) to a probability

distribution of outputs (states at time t+ 1). In the context of dynamical systems the

case of a measure µ, which is a fixed point of the transfer operator, is considered, i. e.

µis invariant with respect to f. Then for all measurable sets Sthe following holds:

µ(S) = µ(f−1(S)) = (Pµ)(S). Furthermore, an additional standard assumption in this

context is that the measure is robust under small random perturbations leading to a unique

SRB measure (Sinai, Ruelle, Bowen).

The next step to Markov decision processes (MDPs) and to almost invariant sets is to

define the transition probability Tfrom a measurable set S1with µ(S1)6= 0 to a second

measurable set S2by

T(S1, S2) = µ(S1∩f−1(S2))

µ(S1).(2.5)

The transition probability from a set S1with non-zero measure into itself is called the

invariance ratio Tinv(S1) = T(S1, S1). If Tinv(S1)≥1−εfor ε∈[0,1] the set S1is called

(1−ε)-invariant which for ε= 0 turns into pure invariance as in Definition 2.2 – neglecting

sets of measure zero. If the exact value of εis not in the focus of interest the imprecise

term almost invariant set will be used. Particularly interesting – although elementarily

obtainable – is the fact that for f-invariant µthe complement of an (1−ε)-almost invariant

set is almost invariant with invariance ratio

Tinv(X\S)≥1−ε·µ(S)

µ(X\S).(2.6)

If µ(S)≤1

2µ(X)then the invariance of the complement Tinv(X\S)≥1−εi. e. the

complementary set is also at least (1−ε)-invariant. This motivates a successive algorithmic

way of hierarchically splitting up almost invariant sets into almost invariant subsets, and

hence iteratively constructing a partition. A partition P(X)of the state space Xof a

dynamical system (or any arbitrary set X) is a finite collection of pairwise disjoint subsets

Pi⊆X,i≤N∈N, with µ(Pi)>0and the covering property X=SN

i=1 Pi. A partition

P(X) = {P1, . . . , PN}is called (1 −ε)-invariant if

min

i=1,...,N Tinv(Pi)≥1−ε . (2.7)

In contrast to other types of decomposition (e. g. ergodic components) the decomposition of

a state space Xof a dynamical system into a partition of almost invariant sets is not unique.

The reason is that slightly changing a partition of almost invariant sets will typically result

in only a minor change of the invariance ratios of the corresponding sets.

2.3 Problem (Partitioning into Almost Invariant Sets)

For a dynamical system (X, ϕ)and fixed N∈Nfind a partition P(X) = {P1, . . . , PN}that

maximises the average invariance

Tinv(P(X)) = 1

i=1

Tinv(Pi).(2.8)

Relation to Markov chains, MDPs, and RL

A Markov chain with a finite amount of states can be obtained by choosing a finite parti-

tion(4) P(X) = {P1, . . . , PN}and defining transition probabilities according to T(Pi, Pj).

This Markov chain can also be considered as an MDP with a finite amount of states where

a Markovian strategy of the decision maker is fixed.

In Section 5.3 almost invariant sets build a bridge from the classical dynamic programming

algorithms to the reinforcement learning algorithms by combining the global nature of the

first methods with the locality of the second ones. The idea is to perform only a few updates

on the whole state space and then to restrict the updates of dynamic programming methods

to sets with a high invariance ratio(5) (under some fixed strategy of the decision makers),

and to again perform a few global updates and so on.

2.1.2 Numerical Methods

A set oriented numerical approach to compute partitions P(X)which preferably consists

of almost invariant sets is described below. The presentation includes the numerical ap-

proximation of sets in general, an approximation of the transfer operator, some different

versions of partitioning problems, and their transformation to a graph based context.

Discretisation of Transfer Operators

To deal with transfer operators in a numerical way a finite dimensional discretisation of the

operator must be introduced. As mentioned above, more detailed information is available

e. g. in [46, 48, 49, 50]. The basic idea is to first approximate the state space Xby a

hierarchically constructed sequence of generalised rectangles (a d-dimensional box) and

then define the transfer operator by the transition probabilities between these boxes.

One numerical scheme for approximating a space Xby a finite collection of boxes is the

so-called subdivision algorithm [47] and works as follows:

Starting with an initial box B0⊆X, a sequence of collections of boxes (Bi)i∈N0with

B0={B0}is constructed by repeating the following two steps iteratively until some

stopping criterion is fulfilled. For the (i+ 1)-th iteration this reads to:

1.) Refinement: Given Bi,construct a finer collection of boxes e

Bi+1 by subdividing each

box B∈ Bialong one coordinate axis into two halves.

2.) Selection: Given e

Bi+1, construct Bi+1 by keeping all boxes which intersect with the

state space X, i. e. Bi+1 ={B∈e

Bi+1 :B∩X6=∅}.

Each of the two steps entails a consideration of how to perform it numerically. The first

step can be done directly or be modified by selecting only parts of all boxes for subdivision.

Whether the second step is complicated or not depends on the set Xand its mathematical

description: in general, i. e. for all X, it can not be expected that the condition B∩

X6=∅can be exactly and efficiently be checked.(6) Finally, after iend steps a stopping

criterion is fulfilled, and the above algorithm’s output is a finite collection of boxes Biend =

(4)This partition may be much finer than the almost invariant sets.

(5)Neglecting the evolution of the strategy, an RL trajectotry would typically stay in an almost invariant

set for a long time.

(6)For example, for a choice of representative test points pkon the box boundary, a test if pk∈Xcan be

performed.

{B1, iend , B2, iend , . . . , BN, iend }. These fulfill the following criteria: X⊆SN

k=1 Bk, iend and

for all k6=lholds µLeb(Bk, iend ∩Bl, iend ) = 0 where µLeb is the Lebesgue measure.

For the approximation of the transfer operator we simplify notation by writing Bkinstead

of Bk, iend ). Hence, the starting point is now a collection of boxes B={B1, . . . , BN}which

approximate the set Xe. g. in the sense that some stopping criterion of the subdivision

algorithm is fulfilled. The discretisation of the transfer operator is carried into execution by

means of replacing the space of signed measures Mby MBwhich is the finite-dimensional

space of signed measures on the Borel σ-algebra generated by the collection of boxes B. If

the collection Bforms a partition (i. e. pairwise intersections being empty instead of being

of zero measure) then the generated σ-Algebra contains only arbitrary unions of boxes

Bk∈ B.(7) A standard basis for the vector space MBis the set of Nmeasures which

uniformly assign the measure 1to one fixed box Bi∈ B and 0to all other boxes. The

transfer operator with respect to this basis PB:MB→ MBis given by the following

matrix of transition probabilities:(8)

PB= (pij)ij = µLeb(Bj∩f−1(Bi))

µLeb(Bj)!1≤i≤N, 1≤j≤N

∈RN,N .(2.9)

The measure of a box is straightforward to compute, however, the measure of intersection

has to be approximated by numerical methods. [45] describes some possibilities while a

wide-spread method is the so-called Monte Carlo approach which is described by [85] and

means

µLeb(Bj∩f−1(Bi)) ≈1

k=1

χBi(f(xk)) (2.10)

with xkbeing uniformly selected at random from Bj. This simply means that one has

to select Krandom points in Bjand check whether the image f(xk)is in Bi. There are

efficient numerical techniques for performing this task [48, 47].(9)

2.4 Problem (Discretised Partitioning into Almost Invariant Sets (I))

For a dynamical system (X, ϕ), a set of boxes B={B1, . . . , BL}, and fixed N∈Nfind a

partition of boxes P(X) = {P1, . . . , PN}(i. e. for each ithere exists an index set Ki⊆N

such that {K1, . . . , KN}forms a partition of {1, . . . , L}, and Pi=Sk∈KiBkwith each

Bk∈ B) that maximises the average invariance

Tinv(P(X)) = 1

i=1

Tinv(Pi) = 1

i=1 PBi, Bj⊆PkµLeb(Bj)·pij

PBj⊆PkµLeb(Bj).(2.11)

Interpretation in Terms of Graphs

As pointed out in [46] Problem 2.4 can be interpreted as finding an optimal cut of a graph.

Given a dynamical system (X, ϕ)and a partition of the state space by boxes B=P(X)(10),

(7)To handle intersections of measure zero one may define equivalence classes and equality by identifying

all sets of zero measure with the empty set.

(8)pij is the transition probability from box Bjto box Bi.

(9)The fixed point of Pdescribing the f-invariant measure can be obtained in a discretised version by

the eigenvector of PBto the eigenvalue 1.

(10)If Xcan not be written as a finite union of boxes, the construction of a box approximation of Xas

described above may be necessary.

one defines the transition graph G= (V, E)as a weighted directed graph with vertex set

V=Band edge set

E={(B1, B2)∈ B × B :B1∩f−1(B2)6=∅} .(2.12)

The condition f(B1)∩B26=∅is equivalent to B1∩f−1(B2)6=∅even if fis not invertible

(bijective). The function of vertex weights is defined by

v:V→R, v(Bi) = µLeb(Bi),(2.13)

and the function of edge weights by

e:E→R, e( (Bi, Bj) ) = µLeb(Bi)·pji .(2.14)

There also exists an undirected version of the transition graph, namely e

G= (V, e

E)with

the modified edge set

E={(B1, B2)∈ B × B : (B1∩f−1(B2)) ∪(B2∩f−1(B1)) 6=∅} .(2.15)

Accordingly, the function of edge weights has to be symmetrised with respect to iand j

yielding

ee:e

E→R,ee( (Bi, Bj) ) = µLeb(Bi)·pji +µLeb(Bj)·pij .(2.16)

Note that by construction the total edge weight of the directed and the undirected transi-

tion graph is equal.

To measure the degree of invariance of a subset of vertices S⊆V, external and internal

costs are defined. For two subsets of vertices A, B ⊆Vthe following notation is introduced

EA,B =1

Bi∈A, Bj∈BµLeb(Bi)·pji +µLeb(Bj)·pij(2.17)

which simplifies for A=Bto EA,A =PBi, Bj∈AµLeb(Bi)·pji. Now, for S⊆Vthe internal

costs are defined as

Cint(S) = ES,S

µLeb(S),(2.18)

and the external costs as

Cext(S) = ES,S

µLeb(S)·µLeb(S)(2.19)

where S=V\Sis the complement of S. Based on the previous definitions, for a partition

of the vertices P(V) = {P1, . . . , PN}the analogous quantities are the internal costs

Cint(P(V)) = 1

i=1

Cint(Pi)(2.20)

and the external costs

Cext(P(V)) = P1≤i<j≤NEPi, Pj

i=1 µLeb(Pi).(2.21)

In [46] the internal and external costs are seen to be intuitively high and low, respec-

tively, for almost invariant sets. However, it is stated that minimising the internal costs

is not equivalent to maximising the external costs. Optimisation with respect to the first

criterion may lead to relatively small sets in the partition while optimisation with re-

spect to the second criterion leads to more balanced partitions. Identifying a partition

P(X) = {P1, . . . , PN}with the corresponding one of the graph P(V) = {P1, . . . , PN}

yields Tinv(P(X)) = Cint(P(V)), and thus that maximising internal costs is equivalent to

Problem 2.4. Minimising the external costs yields the third partitioning problem:

2.5 Problem (Discretised Partitioning into Almost Invariant Sets (II))

For a graph G= (E, V )and fixed N∈Nfind a partition of vertices P(V) = {P1, . . . , PN}

with Pp∈Piv(p)>0that minimises Cext.

2.1.3 Complexity, Algorithmic Issues and Software

Most variants of the graph partitioning problems including the above mentioned are NP-

complete [46]. Therefore, heuristics are employed to solve Problem 2.4 and Problem 2.5.

An incomplete list of software tools for (balanced) graph partitioning may contain e. g.

JOSTLE [211], METIS [91], and PARTY [166]. Throughout this thesis, graph partitioning

will be performed using PARTY.

A common idea influencing the design of software packages is the multilevel paradigm which

has been proven to be powerful (e. g. [141]). This paradigm consists of two steps: graph

coarsening and local improvement. For the graph coarsening a heuristic approach called

graph matching is employed in PARTY. A graph matching is a subset of edges such that

each vertex is, at most, part of one edge. The coarsening procedure then reduces every two

linked vertices of the matching to one supervertex. The hierarchical multi-step coarsening

yields a clustering(11) on the coarsest level. It is followed by a stepwise projection onto the

next finer level with a local optimisation of the partition of each level by standard methods

like Kernighan-Lin [92] or the Helpful-Set Method [52].

The topic of partitioning with a variable number of partitions is even more challenging

than the aforementioned partitioning problems. Some heuristics such as congestion [46]

can be introduced but are not discussed further here.

2.2 Markov Decision Processes (MDPs)

Markov decision processes (MDPs) are the next step towards Markov games. They provide

a model for situations where a stochastic discrete dynamical system evolves under the

influence of one agent. In robot soccer this agent may also represent a whole team if

it is fully cooperative. Typically, solving an MDP means to compute a global feedback

control law to achieve some predefined goal in accord with the underlying dynamics. The

textbooks [14, 168] provide a good introduction to the material, as does [202] where the

notation of this section partly stems from.

Historical Remarks. MDPs were popularised by the books of Bellman and Howard [14, 84]

but according to Puterman [168] the historical roots are located much earlier. Some of the

basic concepts date back to problems of the calculus of variation to the 17th century but

an explicit reference only points to the end of the 19th century: a paper of Cayley [30].

The beginning of the modern study can be dated to the 1940s where Wald [210] already

presented the essence of the theory. A little later, important work was done on games

(11)The coarsening can be continued until the correct number of partitions is obtained, or until the coarsest

level can be partitioned by standard partitioning methods.

[15, 186], stochastic inventory models [61], pursuit problems [86] and sequential statistical

problems [5].

2.2.1 Basics and Problem Definitions

This subsection begins by defining the basic ingredients of an MDP:

2.6 Definition (Markov Decision Process)

A (discrete time, finite) Markov decision process M= (D,S,SA, T, R)is given by

1.) decision epochs D=N0,

2.) a (finite) state space S,

3.) a (finite) state action space SA ={(s, a) : s∈ S, a ∈ A(s)}where A(s)is a (finite)

set of available actions in state s,

4.) a transition function T:SA×S → [0,1] with T(s, a, s0)being the probability of reaching

state s0if choosing action ain state s,(12)

5.) a (deterministic) reward function R:SA → R.

A standard assumption is that for every pair (s, a)∈ SA holds that Ps0∈S T(s, a, s0)=1

which means that no action can lead to a state outside of S. The addition of (absorbing)

extra states may help to ensure the above condition. For hierarchical problems as men-

tioned in Section 2.5 it can be necessary to differentiate between states within the same

level of hierarchy (with a probability less than 1) and states outside of the given hierarchy

level.

2.7 Remark (General MDP [168])

For a more general version of MDPs the following aspects may additionally be taken into

account:

1.) The decision epochs can be either discrete or continuous and, in the first case, there is

the possibility of a finite or an infinite set of time points; in the continuous case there

are further possibilities when the decisions are to be made: continuously, at random

time points, or at timepoints which itself can be decided by the decision maker.

2.) The state space can be continuous, discrete, or a mixture of both, and again finite or

infinite in the discrete case.(13)

3.) The action space can have the same specifications as the state space and may be

different for every state, additionally, instead of pure actions, so-called mixed action

i. e. a probability distribution PD(B)on a (Borel) subset of actions can be performed

during the Markov decision process. In this context, pure actions are degenerate mixed

ones.

4.) The reward function may also be dependent upon the reached state. In the model of

reward it is not important how the reward is acrued (continuously through a period,

system state of subsequent decision epoch) but it or its expected value must be known

(12)An alternative for MDPs with deterministic state transitions is to define the modified transition func-

tion e

T:SA → S with T(s, a) = s0being the next state which is reached with probability 1.

(13)The exact condition is that the state space is a non-empty Borel subset of a complete, separable metric

space. The same condition is needed for the action spaces. “Separable” means that there exists a countable

dense subset, and a Borel set is an element of the Borel σ-algebra of the metric space.

before the next choice of action. If the reward depends on the subsequent state then

it may be computed by

R(s, a) = X

s0∈S

T(s, a, s0)R(s, a, s0).(2.22)

time t

observation

decision

time t+1

state observation

decision

state

mixed

action mixed

action actionaction

reward reward

Figure 2.2: Scheme of a Markov decision process. The agent, being in a state at time t,

bases its decision on a stochastic observation of the state to choose a mixed action. By

random, this leads to a pure action and, again by random, to the next state. The agent

receives (local) rewards depending on states and actions.

2.8 Definition (Markov Property)

A decision process is said to be Markovian if for a sequence of state-actions (st, at)twith

st∈ S and actions at∈ A(st), and a sequence of rewards (rt)twith rt=R(st, at), the

following holds:

Prob st=s, rt=r|st−1, at−1, st−2, at−2, . . . , s0, a0

= Prob st=s, rt=r|st−1, at−1.(2.23)

An MDP possesses the Markov property because its transition function Tand reward

function Rare just designed in this way. The Markov property means that the transitions

and rewards are not dependent on history. A common artifice to include (a part of) the

state action history in an MDP framework is to attach it directly to the state. It is obvious

that this may augment the state space enormously depending on the maximal length of

history information.

2.9 Example (Robot Soccer, 1)

Robot soccer will be the standard example throughout this thesis (for a more detailed

description see Section 5.1) and serves to explain most of the theoretical concepts. If the

strategy of the opponent team is fixed then only one team being represented by an abstract

agent has to explicitly decide on its actions. This is exactly the situation of an MDP, and

the ingredients of Definition 2.6 are as follows: the decision epoch D=N0is equidistant

and infinite, the state space S ⊆ Rnis the set which describes the possible coordinates

(position and maybe velocity) of all robots and the ball, the action set A(s)consists of

movements and kicks, the (probabilistic) transition function Tdescribes the evolution of

the game for each state and each action, and the reward Ris simply positive (= +1) for

scoring a goal and negative (=−1) for letting the opponent team score a goal.

Return Models

After the definition of the dynamics and reward of an MDP the goal of an MDP is to

be stated: to maximise some kind of long-term reward called the return R.(14) Given a

stochastic(15) sequence of rewards (rt)tthere are, however, different criterions of optimality.

In [88, 202], the following variants of measuring the optimality by returns are considered:

Wide spread due to convergence properties is the discounted infinite horizon return

Rdisc = E (∞

t=0

γtrt)(2.24)

with discount rate γ∈(0,1), while the numerical approximation of the infinite horizon

return often is performed by the corresponding finite horizon return

disc = E (N

t=0

γtrt)(2.25)

where additionally γ= 1 is allowed.(16) E{} here means the expectation value. Completely

different and more difficult to analyse are the average infinite horizon return model

Raver = lim

N→∞ E(1

N+ 1

t=0

rt),(2.26)

or bias-optimal models which both do not need a discount factor. An empirical comparison

of a discounted to an average reward based return model can be found in [125].

At this point it should be remarked that all of the following concepts which deal with

optimality depend strongly on the optimisation criterion defined by the return model.

Therefore, if not stated differently, the standard is to assume the discounted infinite horizon

return, i. e. R=Rdisc.

2.10 Example (Robot Soccer, 2)

Using the robot soccer example again for illustration, a discounted reward means that

scoring a goal faster is ranked higher than scoring it later. The finite horizon version seems

to be adequate if the duration of the soccer match and of the performance of actions can

be exactly foreseen – which is typically not true. Finally, in the average reward case the

ranking of scoring is completely independent of the exact time points. Only the average

amount of goals per time interval is important.

It is also possible to design arbitrarily return profiles with weights wt6=γtas long as

convergence of the series can be guaranteed. This may be useful to enforce some strategic

(14)Sometimes, the MDP as defined above is called Markov decision process and the MDP plus return

model is called Markov decision problem. In this thesis, the term Markov decision process includes also

the return model.

(15)The rewards are either stochastic themself or by the stochasticity of the transition function, or both.

(16)In the discounted infinite horizon return model one can also consider limγ→1if existent.

behaviour for which a desired time scale is specified (soft constraint). However, for a

non-constant weight the Markov property of the value function is typically not fulfilled.

Although not considered before, employing such a method may give additional tactical

options for the soccer game.

Policies, Value Functions, and Optimality

A definition of the concepts of optimality remains: policies, value functions, and Bellman’s

principle of optimality.

2.11 Definition (Policy, Decision Rule)

A (stationary Markovian) policy π:SA → [0,1] is a decision rule which for each s∈ S

specifies a probability distribution PD(A(s)), i. e. ∀s∈ S :Pa∈A(s)π(s, a) = 1.

According to the definition a policy πspecifies the probability π(s, a)of performing action

ain state s. If π(SA) = {0,1}, then the policy is called deterministic. Given an initial

state s0∈ S and a policy π, one can compute an associated random trajectory O(s0) =

(st, at)t∈N0with rewards rt=R(st, at)and the corresponding return. Then, the state value

function Vπ:S → Ris the expectation of the return under policy πwhen starting in state

s. Similarly, the state-action value function Qπ(s, a)is the expectation of the return under

policy πwhen starting in state-action (s, a). A formal definition follows:

2.12 Definition (Value Functions of an MDP)

Given an MDP M= (D,S,SA, T, R)with D=N0and a policy π, the state value function

Vπ:S → Runder this policy is defined by

Vπ(s) = Eπ{R | s0=s}(2.27)

where the notation Eπ{X}= E X| ∀t: Prob st+1 =s0|st=s, at=a=T(s, a, s0) and

Prob {at=a|st=s}=π(s, a)}is employed. The corresponding state action value func-

tion Qπ:SA → Runder this policy is defined by

Qπ(s, a) = Eπ{R | s0=s, a0=a}.(2.28)

A policy π∗is called optimal if for all policies πholds Vπ∗≥Vπ.(17)

The above definition is not well-suited to reveal a method for algorithmically computing

value functions. Most of the solution algorithms utilise a recursivity property known as

Bellman’s principle of optimality.

2.13 Theorem (Optimality Principle [168])

A (state-) value function V∈R|S| of an MDP is optimal iff it is the unique solution to the

Bellman equation:

V(s) = max

πs∈PD(A(s)) X

a∈A(s)

πs(a)·Q(s, a)(2.29)

where

Q(s, a) = R(s, a) + γX

s0∈S

T(s, a, s0)·V(s0).(2.30)

Here, PD(X)is the set of probability distributions on the set Xand πsis the restriction of

πto state smeaning πs(a) = π(s, a).

(17)The optimality condition is equivalent to the following one: Qπ∗≥Qπfor all policies πas can be seen

by plugging Q∗,Qπinto Equation 2.29 and, reversely, V∗,Vπinto Equation 2.30.

The result of plugging Equation 2.30 into Equation 2.29 can be abbreviated by the Bellman

operator BMDP which shortens the notation of the Bellman equation to V=BMDPV.

The Bellman equation is a nonlinear fixed point equation for the optimal value function

V∗=Vπ∗and the operator is a contraction in k k∞with rate γ[19]. Note that every MDP

has a deterministic optimal policy [168] and it would thus suffice to take the maximum

in Equation 2.29 only over pure actions a∈ A(s). However, in the case of 2P-ZS-MGs in

Section 2.4 a need for the more general formulation will arise. In order to stress the formal

similarities the notation above deviates from standard.

2.14 Remark (History Dependent Strategies, Non-Stationarity [168])

As stated, policies can be deterministic or stochastic. One concept which is not covered

by the policy of Definition 2.11 are history dependent strategies. A t-time history htis

recursively defined by h0=s0and hi= (hi−1, ai−1, si). A second concept which is also not

covered is the possible non-stationarity of policies meaning that it also explicitly depends

on time. Both concepts are irrelevant to find the optimal solution of an MDP but can be

of practical interest if the Markov condition is not (exactly) fulfilled.

2.2.2 Numerical Methods

Two important classes of approaches are dynamic programming (DP) and reinforcement

learning (RL) methods. While for the first class of solution methods it is assumed that the

stochastic model is known in advance, in the second class this assumption is dropped and

the consequences of the world model are directly (model-based approaches) or indirectly

(model free approaches) approximated.

Dynamic Programming Methods

2.15 Definition (Value Iteration (MDP) [168])

The following algorithm is called value iteration: select ε > 0, choose an arbitrary initial

guess V0∈R|S| for the (state-) value function, and determine iteratively Vk=BMDPVk−1

for k= 1,2, . . . until kVk+1 −Vkk∞≤ε

2·1−γ

γ.

The stopping criterion can also be based on a span semi-norm [168] which may improve

the contraction rate if the transition matrix is non-sparse. Value iteration converges to V∗,

and provides an ε

2-approximation for the value function estimate, i. e. kVN−V∗k∞≤ε

and an ε-optimal stationary policy by

π∗

ε(s) = Π∗(VN) = arg max

a∈A(s)R(s, a) + γX

s0∈S

T(s, a, s0)·VN−1(s0)(2.31)

where Nis the number of iterates and arg maxa∈A(s)∈PD(A(s)) is considered to be a

probability distribution of a pure action [168]. In the case of several equally good actions

the “arg max” has to return a mixed action with equal probabilities to keep uniqueness.

Note that in terms of the state-action value function Equation 2.31 can also be written

π∗

ε(s) = Π∗(QN−1) = arg maxa∈A(s)QN−1(s, a).(18)

Furthermore, the convergence is linear at rate γ, i. e. kVk+1 −V∗k≤γ· kVk−V∗k, and

that convergence for Gauss-Seidel value iteration, which uses Vk+1(s)instead of Vk(s)as

(18)Analogue to later proofs the iterative procedure of applying the two parts of the Bellman equation

(Equations 2.29 and 2.30) is labelled in the order: V0,Q0,V1,Q1, . . .

soon as available, is linear at least with rate γ[168]. The analysis of the algorithms is

based on the fact that for each iteration kthere exists a policy πkwhich corresponds to

the maximum selection rule. With respect to these policies πkthe analysis is done in a

way similar to iterative solvers for linear equations because the evaluation of a fixed policy

is solving a linear equation with the matrix depending on the policy (Appendix B).

For the sake of completeness and because of its superlinear, sometimes quadratic conver-

gence the policy iteration algorithm is presented. For MDPs it is further guaranteed to find

the optimal policy in finitely many steps but this relies on the existence of a deterministic

optimal policy and is not true for 2P-ZS-MGs.

2.16 Definition (Policy Iteration [168])

Select an arbitrary policy π0, and repeat the following until πk+1 =πk: compute the value

function Vk=Vπkof policy πk(policy evaluation) and choose the policy πk+1 = Π∗(Vk)as

in Equation 2.31 (policy improvement).

Reinforcement Learning Methods

Reinforcement learning methods [88] belong to the class of stochastic approximation algo-

rithms and are employed for numerical comparisons. From a practical point of view they

may help to adopt optimal policies of a DP model to a real world problem. A key difference

of RL is that the agent directly executes actions according to a policy and, based on the

outcomes, estimates the value function and in some cases a model of the underlying MDP.

One of the most common algorithms is Q-learning first proposed by [212]:

2.17 Definition (Q-Learning [202])

Start with state-action value function Q0, and let the agent be in an initial state s0. Repeat

iteratively: Being at step kin state sk, choose an action akaccording to an arbitrary policy

π, observe the next state sk+1 and set

Qk+1(s, a) = ∂(s,a),(sk,ak)·

h(1 −αk)Qk(sk, ak) + αk(R(sk, ak) + γ·max

ak+1∈A(sk+1)Qk(sk+1, ak+1))i.(2.32)

where ∂i,j is the Kronecker symbol being 1if i=jand 0else, and αk=αk(s, a).

The Q-learning rule is guaranteed to converge with probability 1 (for any initial state s0

and for any Q0) if for the sequence of learning rates holds for every (s, a)∈ SA [19]:

∞

k=0

αk(s, a) = ∞,

∞

k=0

α2

k(s, a)<∞.(2.33)

The first condition includes the necessity for infinitely updating every state-action pair

while the second one establishes convergence. Because the convergence to the optimal Q-

value function occurs for every policy πof the agent as long as Equation 2.33 is fulfilled, the

method is called an off-policy method. In practice, exploration strategies are included to

guarantee Equation 2.33 while exploitation, i. e. performing an optimal action, ensure that

the most promising parts of the state (-action) space are updated faster and more often. A

very common policy is the ε-greedy which chooses with probability ε > 0a random action

while greediness means to choose an optimal action ak= Π∗(Qk)with probability 1−ε.

Other possibilities are to introduce an exploration bonus, curiosity driven exploration [183],

Boltzman exploration or interval based techniques [38].

For the sake of completeness a common on-policy method is to be introduced. It is called

SARSA, which stems from the update formula (state, action, reward, state, action), and

is similar to Q-learning but with the slightly modified update rule:

Qk+1(s, a) = ∂(s,a),(sk,ak)·

h(1 −αk)Qk(sk, ak) + αk(R(sk, ak) + γ·Qk(sk+1, ak+1))i.(2.34)

The difference is now that if the agent would execute policy π, the SARSA algorithm would

converge to Qπinstead of Q∗. However, if the policy is not kept constant but instead the

agent follows an ε-greedy policy with respect to the Q-values and εis decayed over time

then it can be hoped [202] that SARSA converges to Q∗. Compared to DP methods SARSA

perhaps can be interpreted as an asynchronous version of a policy iteration algorithm.

[116, 203] give general conditions under which asynchronous RL algorithms converge. They

reduce the effort to prove the convergence of synchronous update rules. One special case

is Q-learning for MDPs.

2.2.3 Complexity, Algorithmic Issues and Software

The numerical effort of solving MDPs is different for different algorithms [88]: Value it-

eration needs in the worst-case per iteration O(|SA| · |S|)multiplications and evaluations

of Tor, if the transition function is sparse (T(s, a, s0)6= 0 only for constantly many s0),

then value iteration only needs O(|SA|)multiplications and evaluations of T. However, the

number of iterations to achieve some prescribed ε-optimality can grow dramatically if the

discount factor γapproaches 1 (see stopping criterion of Definition 2.15). In practice, pol-

icy iteration can converge faster although the per iteration complexity is O(|S|3+|SA|·|S|)

(policy evaluation + improvement) and there are no general theoretical worst-case results

[115]. For special cases like deterministic MDPs Madani gives proofs that policy iteration

is in P (polynomial) [123] and so is value iteration.

Linear Programming [184] is also a method to solve MDPs with the advantage that very

efficient commercial packages can be used. Theoretically, this is the only known method

with polynomial time although this need not indicate that it is the most efficient in practice.

Other standard ideas to speed up numerical techniques can be used e. g. multi-grid methods

as in [175] or state aggregation which conglomerates several states to a single meta-state

[18].

The Q-learning method presented above belongs to the model free approaches as well as

adaptive heuristic critic [12]. Under some conditions model free methods are applicable

in the average return case [185]. Model-based approaches which were not introduced here

include certainty equivalence [95], Dyna-Q [199, 200, 213], Queue Dyna [163], and Priori-

tised Sweeping [144, 44]. Special model-based methods for finding short paths to a goal

state are real-time dynamic programming (RTDP) [10] and plexus planning [43].

2.3 Matrix Games

In general, game theory is a scientific field which deals with the question: How should two

or more agents, being in an intertwined situation, decide to their own best?(19) In a less

selfish formulation the question may also be stated as “how to model and solve conflicts”.

Similar to MDPs, there are different possibilities concerning the rules and settings of the

game situations. Again, one of the most distinguishing criteria is the question of whether

the game is a differential game, which is time-continuous, or a (time-) discrete game. A

second criterion is whether the game is deterministic or stochastic. For a more detailed

overview one may consult the textbooks [13, 66, 86, 107, 158, 159] and the article [135].

Matrix games are a very special and restricted class of games but nevertheless helpful to

begin with. They belong to the class of two-player zero-sum games whereas only one time

step is under consideration (single-stage game), or whereas the situation does not change

with time (repeated matrix games). From a theoretical point of view repeated matrix

games are different from normal ones if the allowed strategies are deterministic but may be

changed over time. In such a case non-Markovian strategies like “tit-for-tat”, which depend

on the action history, are sometimes reasonable and offer different options than Markovian

stochastic strategies (mixed actions) that are independent from time. In the sequel, matrix

games are a basic element for the solution of the more general class of Markov games. This

is the reason why they are introduced and why a numerical method for solving this class

of games is presented.

Generally, two-player zero-sum games can be interpreted as games between two rivals: one

agent wants to avoid what the other tries to achieve and vice versa. Sometimes, such a

situation is called (completely) competitive because there is no potential for a compromise.

Typically, board games (backgammon, chess) or sport games (soccer, tennis) are of this

type because one agent or team of agents wins if and only if the other one loses.

Historical Remarks. [13] dates the theory of zero-sum games back to the 1920s when

Borel did some work which was translated into English much later [20]. Borel developed

some concepts of strategies but conjectured that the minimax theorem (Theorem 2.23) was

false. Von Neumann proved the opposite [154] – the first proof of the minimax theorem

was not as elementary as e. g. in [13] – and built the foundations of game theory which

are summarised in his famous book with Morgenstern [155]. Other early references include

[120, 136] and the work of Nash.

2.3.1 Basics and Problem Definitions

Matrix games are finite two-player zero-sum games in normal form whereas the extensive

form is represented by a tree structure [13]. This tree structure allows a more detailed

picture e. g. about the order of play and the information of the agents at each decision

epoch. However, for the Markov games considered later the standard assumption will be

that at every time step the actions of all agents take place simultaneously or, equivalently,

that one agent does not know the action of the other agent before the end of a decision

epoch. Thus, the representation of each time step by a matrix game is adequate. To

distinguish between zero-sum and general-sum games with two agents the latter ones are

called bimatrix games [66] to indicate that for each agent Pi, i = 1,2areward matrix or

payoff matrix Rihas to be specified. Below, the definitions of a bimatrix game and a (zero-

(19)In game theory, agents are typically called players.

sum) matrix game are given and the interchangeability theorem is stated. Basic concepts

like the value of a game, the minimax theorem (Theorem 2.23), and the dominance of

actions are introduced.

Bimatrix and Matrix Games

The definitions start with bimatrix games which are two-player general-sum games in nor-

mal form and their zero-sum variant called matrix games. It is intended to keep analogies

to MDPs (Definition 2.6) and, hence, slightly deviate from standard notation. Many of

the following statements and definitions stem from or are inspired by [66].

2.18 Definition (Bimatrix Game)

Abimatrix game Γis defined by

1.) a trivial decision epoch D={t0},

2.) a trivial state space S={s0},

3.) a finite state action space SAO ={(s, a, o) : s∈ S, a ∈ A(s), o ∈ O(s)}where

A(s),O(s)are the (finite) sets of available actions of the first and the second agent,

P1and P2, respectively,

4.) a trivial transition function T:SAO × S → [0,1] with T(s0, a, o, s0) = 1,

5.) a (deterministic) reward function for each agent Pidenoted by Ri:SAO → R.

Arepeated bimatrix game only differs in the decision epoch D=N0from a bimatrix

game. The name bimatrix game stems from the fact that the only non-trivial elements

are the finitely many actions of both agents and the two reward functions Rithat can be

represented by two matrices. The convention will be to write the matrices in such a way

that actions A={a1, . . . , am}of agent P1correspond to rows and actions O={o1, . . . , on}

of agent P2correspond to columns of the matrices Rias the following scheme indicates:

o1o2. . . on

a1





r11 r12 . . . r1n





a2r21 r22 . . . r2n

.....

amrm1rm2. . . rmn

2.19 Definition (Matrix Game)

Amatrix game is defined to be a bimatrix game Γwith the additional condition that the

sum of the reward matrices of the two agents is zero:

R1+R2= 0 .(2.35)

A matrix game belongs to the class of two-player zero-sum games in normal form. Analo-

gously to bimatrix games, a repeated matrix game only differs in the decision epoch D=N0.

As for MDPs, each agent P1, P2has to find an optimal policy π∗

1, π∗

2which maximises its

return R1,R2,respectively. In the case of non-repeated (bi)matrix games this is simply

the selection of a mixed action(20), whereas the return Riof agent Pireduces to the expec-

tation over the single-stage return (only first summand in the sum of e. g. Equation 2.24).

(20)The set of mixed actions for agent P1is the (m−1)-dimensional simplex of probability distributions

PD(A(s0)) = {x∈Rm:x= (xj)jwith x≥0,Pm

j=1 xj= 1},and for agent P2it is the analogue

(n−1)-dimensional simplex.

If the policies of the two agents are represented by column vectors (π1∈Rm,1, π2∈Rn,1)

and the reward functions Riby matrices as depicted above, then the expectation of the

single-stage return Rifor agent Pican be expressed simply by matrix vector multiplica-

tions:

Eπ1,π2{Ri}=πT

1Riπ2=X

k,l

(π1)k(Ri)kl (π2)l(2.36)

with ATdenoting the transpose of a matrix A.

Before the interchangeability theorem is stated the general definition of a Nash equilibrium

has to be introduced. More details under which conditions for the policy spaces Nash

equilibria exist can be found in [22]. For defining a Nash equilibrium very briefly, the

concept of best response is employed which is also applicable to the general case of k

agents. A policy π∗

iof agent Piis an element of the set of best response or best reply

BRi(π−i)to a joint policy π−iof all other agents if for all policies πiholds that

E(π∗

i,π−i){Ri} ≥ E(πi,π−i){Ri}(2.37)

with (π∗

i, π−i)being appropriately ordered. Then, a Nash equilibrium is a tuple of policies

π∗= (π∗

1, . . . , π∗

k)of the kagents with

∀i:π∗

i∈BRi(π∗

−i)(2.38)

which means that no agent has an incentive to deviate unilaterally from its policy. A

difficulty in determining Nash equilibria, which is in some cases NP-hard [34, 32], arises

because the set of best response depends on the joint policy π−iof all other agents. An

exception is the computation of Nash equilibria in (two-player zero-sum) matrix games

which is polynomial due to the solvability by linear programming [34]. A more precise

statement of the complexity of finding Nash equilibria in n-player games with n≥4gives

[40]. More details on the complexity of linear programming can be found below in the

subsection concerning numerical methods.

An important fact of arbitrary two-person zero-sum games which is not restricted to matrix

games but is not true for general-sum games(21) is about the interchangeability of equi-

librium pairs of strategies where equilibrium pair means Nash equilibrium. The theorem

below shows that all Nash equilibria are equally good and, hence, that a Nash equilibrium

is already a pair of globally optimal policies for both players. A pair of policies may also

be called a total policy π= (π1, π2).

2.20 Theorem (Interchangeability and Equal Payoff of Equilibrium Strategies [159])

In a two-person zero-sum game Γlet (π1, π2)and (eπ1,eπ2)be two pairs of equilibrium

strategies. Then (π1,eπ2)and (eπ1, π2)are also equilibrium pairs, and for i= 1,2holds:

Eπ1,π2{Ri}= Eeπ1,eπ2

{Ri}= Eπ1,eπ2

{Ri}= Eeπ1,π2

{Ri}.(2.39)

The proof only uses the zero-sum property and the general equilibrium inequalities which

are valid for any strategy bπ1,bπ2, respectively: Eπ1,π2{R1} ≥ Ebπ1,π2

{R1}and Eπ1,π2{R1} ≤

Eπ1,bπ2

{R1}to show the equality and that the new strategies are equilibria.

(21)Consider e. g. the bimatrix game with R1=R2=„a0

0b«and a > b > 0which is sometimes called

coordination game and has two Nash equilibria with expected return of aand bbeing equal for both agents.

2.21 Definition (Value of a Matrix Game [66])

A two-person zero-sum game Γis said to have a value V∗if and only if

sup

π1

inf

π2

Eπ1,π2{R1}= inf

π2

sup

π1

Eπ1,π2{R1}(:= V∗).(2.40)

This means, agent P1can assure itself a return R1=V∗when acting optimally independent

of what the other agent does. Vice versa, agent P2can assure itself a return R2=−R1=

−V∗.(22) The left-hand side of Equation 2.40 is called the lower value of the game and

the right-hand side the upper value. The upper and lower value represent security levels

of the two agents.

For a matrix game Γ = [M]with value V∗(M), the policies π1,ε, π2,ε are called to be

ε-optimal (ε≥0) for agents P1and P2if

inf

π2

Eπ1,ε,π2{R1} ≥ V∗−εand sup

π1

Eπ1,π2,ε {R1} ≤ V∗+ε , (2.41)

respectively. 0-optimal strategies are called optimal. The set of optimal strategies for player

Pkof a matrix game Γ = [M]with matrix Mis denoted by Ok(M)and the ε-optimal set

by Ok

ε(M).

The following proposition can be obtained elementarily:

2.22 Proposition (Optimality of Saddle Points [66])

Be Γ = [M]a matrix game. The following holds:

1.) sup

π1

inf

π2

Eπ1,π2{R1} ≤ inf

π2

sup

π1

Eπ1,π2{R1}.

2.) (Optimality of Saddle Points) When there exist policies π∗

1, π∗

2such that for all π1, π2:

Eπ1,π∗

2{R1} ≤ Eπ∗

1,π∗

2{R1} ≤ Eπ∗

1,π2{R1},then the value of the game exists and π∗

1, π∗

are optimal.

An essential theorem for matrix games is the following one which was beneath others shown

by J. von Neumann ([154], comment in [209]):

2.23 Theorem (Minimax Theorem for Matrix Games [66])

For every matrix M∈Rm,n the corresponding matrix game Γ = [M]has a value V∗=

V∗(M)and both agents P1, P2possess optimal strategies π∗

1∈Rm, π∗

2∈Rn.

A relatively simple direct proof of the minimax theorem can be found in [13]. More

elegantly, the theorem is obtained by duality in linear programming [194].(23)

A generalisation of convergence properties from MDPs to 2P-ZS-MGs [203] needs a non-

expansion property of determining the value of a game in terms of the matrix entries. In

view of this, the following matrix distance, which does not coincide with the k k∞-norm

for matrices, is introduced.

2.24 Definition (Distance of Game Matrices)

For two matrices M1, M2∈Rm,n the game matrix distance dΓis defined by

dΓ(M1, M2) = max

i, j |(M1−M2)ij|.(2.42)

(22)The notation deviates from standard as far as throughout this thesis every agent maximises its own

return, which is in the two-player zero-sum case equivalent to minimise the return of the other agent.

(23)The duality theorem is stated as Theorem 2.30.

2.25 Proposition (Properties of Matrix Games [66])

Let M, M1, M2∈Rm,n be matrices and J∈Rm,n denote the matrix with Jij = 1 for all

i, j.

1.) (Addition of Constants) For any c∈R:V∗(M+cJ) = V∗(M) + c, and the opti-

mal strategy sets O1,O2for both players are the same for the matrix games [M]and

[M+cJ].Thus, the assumptions Mij >0and V∗(M)>0are not restrictions.

2.) (Monotonicity) If (M1)ij ≤(M2)ij for all i, j then V∗(M1)≤V∗(M2).

3.) (Non-Expansion, Continuity of Value) It holds |V∗(M1)−V∗(M2)| ≤ dΓ(M1, M2).

A reduction of matrix games can sometimes be achieved by eliminating dominated actions.

While this is theoretically interesting and may lead to non-trivial reductions if applied

recursively, the numerical effort of comparing each row to every other and likewise for the

columns seems to only be appropriate if a direct numerical solution of a matrix game fails.

2.26 Definition (Dominance of Actions [13])

Let Γ = [M]be a matrix game. The i-th action of player P1(row of matrix M) is said

to dominate the j-th action (row) if eT

iM≥eT

jM(i. e. for all k:mik ≥mjk) and in

one component the inequality is strict. Similarly, the i-th action of player P2(column

of matrix M) is said to dominate the j-th action (column) if Mei≤Mejand in one

component the inequality is strict. The actions are called strictly dominating if for all

components inequality is strict.

2.27 Theorem (Elimination of Dominant Actions [159])

Let Γ = [M]be a matrix game, and assume that rows i1, . . . , ikof Mare dominated. Then

agent P1has an optimal policy π1with (π1)i1=· · · = (π1)ik= 0. Moreover, any optimal

policy for the game obtained by removing the dominated rows will also be an optimal policy

for the original game.

The analogous statement for agent P2results from applying the theorem to −MT. [13]

gives an example that deleting non-strictly dominating actions can reduce the number of

optimal strategies but as long as the aim is to compute the value of a game and only one

optimal strategy this raises no problem.

2.28 Example (Robot Soccer, 3)

A well-known matrix game is matching pennies which is given by the matrix

M=−1 1

1−1.(2.43)

In a robot soccer environment the same type of matrix game may arise if one simplifies a

situation where a robot tries to dribble the ball around one opponent robot. The decision

of the dribbling agent is to go left (a1) or right (a2), while the opponent agent has to decide

to block left (o1) or right (o2). If both choose the same direction then the dribbler loses

the ball, otherwise the move is successful.

It is also possible to consider the complete tactics of a soccer game as an action of a matrix

game. More precisely, the actions are to choose the tactics before the game starts and the

reward matrix is related to the expected scoring at the end of the game. Obviously, this

concept is not suitable to initially construct different tactics and the tactics of particular

game situations are not alterable. Furthermore, obtaining the expectations of all strategy

combinations requires a lot of games to be played.

2.3.2 Numerical Methods

Bimatrix games can be solved by the Lemke-Howson algorithm, see e. g. the excellent

survey of (bi)linear methods for solving bimatrix games in [194] or the original work [106],

or by the Mangasarian-Stone algorithm (Equation 2.46).(24) Accordingly, it is possible to

numerically solve matrix games by linear programming. An overview of linear programming

gives [208] as well as the textbook [39] by Dantzig who was the pioneer of the simplex

method.

Linear Programming (LP) and Matrix Games

For determining the optimal value function V∗of a Markov game it is necessary to calculate

the value of matrix games. Numerically, the solution of matrix games is addressed by LP,

which motivates the following description. It includes a definition of linear programs, the

duality theorem, the solution of matrix games(25), and concludes with the Mangasarian-

Stone algorithm for the solution of bimatrix games.

2.29 Definition (Primal and Dual Linear Program)

A linear program is determined by a triplet (M, b, c)where M∈Rm,n, b ∈Rm, c ∈Rn.

This triplet defines two related optimisation problems:

Primal

maximise cTx

subject to Mx ≤b

x≥0

Dual

minimise bTy

subject to MTy≥c

y≥0

(2.44)

The sets Sp={x∈Rn|Mx ≤b, x ≥0}and Sd={y∈Rm|MTy≥c, y ≥0}are

called feasible sets for the primal and dual program, respectively. Elements of these sets

are feasible solutions, and the primal and dual linear program is called feasible if the

corresponding feasible set is non-empty.

2.30 Proposition (LP Duality Theorem)

1.) If either of the primal or dual linear programs has a finite optimal solution, so does

the other, and the corresponding values of the objective functions are equal.

2.) If either of the primal or dual linear programs has an unbounded objective, the other

problem has no feasible solution.

For a game matrix M∈Rm,n define the matrix f

M∈Rm+1,n+1 and the vectors eb∈Rm+1,

ec∈Rn+1 by

M=





−1

−1· · · −1 0





,eb=





−1





ec=





−1





.(2.45)

The solution of a matrix game Γ = [M]by LP then is treated by the following proposition:

2.31 Proposition (LP for Solving Matrix Games [66])

Let Γ = [M]be a matrix game with V∗(M)>0and let f

M,eb, ecas in Equation 2.45, then

(24)Other options also exist such as fictitious play.

(25)Reversely, it is also possible to solve linear programs by special matrix games [208].

1.) the primal and dual LP (f

M,eb, ec)are feasible and thus have bounded solutions, and

2.) the optimal values of both programs equal −V∗(M)and

π2∈ O2(M)⇐⇒ (π2, V ∗(M)) ∈Rn+1 is optimal in the primal LP.

π1∈ O1(M)⇐⇒ (π1, V ∗(M)) ∈Rm+1 is optimal in the dual LP.

The last row of [M]ensures in the primal LP that Pi(π2)i≥1while the last column ensures

in the dual LP that Pi(π1)i≤1. The condition V∗(M)>0has the effect that these sums

equal 1for an optimal solution and can be replaced by π1∈PD(A(s0)), π2∈PD(O(s0))

[66]. The application of LP can also be seen as a special case (M1=M, M2=−M,

separate to two independent LPs) of the Mangasarian-Stone algorithm for bimatrix games

[129] which is the following bilinear quadratic program:

maximise πT

1M1π2+πT

1M2π2−(c1+c2)

subject to π1∈PD(A(s0)), π2∈PD(O(s0))

aiM1π2≤c1(for all ai∈ A(s0))

πT

1M2eoi≤c2(for all oi∈ O(s0))

c1, c2∈R,

(2.46)

where eai∈PD(A(s0)), eoi∈ O(s0)) are policies corresponding to a pure action, i. e.

(eai) = eibeing the i-th unit vector. The Mangasarian-Stone algorithm can be suitably

enhanced to determine all Nash equilibria in a general (nonzero-sum multi-stage nagent)

Markov game [65]. The optimisation criterion as well as the constraints look very similar

to the Bellman equation of MDPs (Equations 2.29 and 2.30) which is reasonable because

fixing all but one agent’s strategies results in an MDP and the Nash equilibrium is defined

by the non-motivation of a unilateral deviation of one agent.

2.3.3 Complexity, Algorithmic Issues and Software

The complexity of solving linear programs is in P (polynomial) for interior point methods

but for the standard simplex method exponential [208](26). Also numerical studies can be

found therein and a bound on the approximation quality of matrix game solutions which

was found by von Neumann [155]: To approximate the value V∗(M)of a matrix game

Γ = [M]with matrix M∈Rm,n by a factor of εa special linear programming algorithm –

different from the simplex method – needs at most

(m+n)(maxij(M)ij −minij(M)ij)2

ε2(2.47)

steps, each of which requires about 4mn flops.

Throughout this thesis the solution of matrix games is performed by linear programming

methods which are embedded in the MATLAB software package, namely a primal-dual

method and a simplex method.

2.4 Two Player Zero Sum Markov Games (2P-ZS-MGs)

Finally, two-player zero-sum Markov games provide a suitable concept for modelling robot

soccer. As mentioned before they are a generalisation of both MDPs and matrix games,

(26)There exist subexponential variants of the simplex method but they are nevertheless superpolynomial.

for they describe a situation in which two agents try to achieve opposite goals. Despite

the generalisation of MDPs, many concepts, statements, and algorithms are transferable

from MDPs to 2P-ZS-MGs. For example, the Bellman equation and, therefore, the basic

algorithms value iteration and Q-learning only have to changed by determining the solution

of a matrix game (max-min) instead of the maximisation problem (max).

One major dissimilarity is the necessity for mixed actions in 2P-ZS-MGs(27) while for

MDPs it is guaranteed that an optimal policy in pure actions exists [168]. This is induced

by the fact that the decisions of both agents at each decision epoch have to be made at

the same time without the knowledge of the other agent’s decision. The implications are

far reaching for more advanced concepts of learning such as the unapplicability of team

learning algorithms with coordination mechanisms which rely on a deterministic optimal

policy. [103, 90] give some ideas of coordination mechanisms to overcome difficulties with

previous approaches [89, 102]. However, applicability of 2P-ZS-MGs includes not only two-

person and two-team board games and sports games with competitive character but also

the modelling of a worst-case optimal policy in an MDP with uncertainties by considering

the “uncertainty generator” as a competitive player (a “game against nature” [147]).

2.4.1 Basics and Problem Definitions

2.32 Definition (Two Player Zero-Sum Markov Game)

A (discrete time, finite) two-player zero-sum Markov game M= (D,S,SAO, T, R)is given

1.) decision epochs D=N0,

2.) a (finite) state space S,

3.) a (finite) state action space SAO ={(s, a, o) : s∈ S, a ∈ A(s), o ∈ O(s)}where

A(s),O(s)are the (finite) sets of available actions for the agents P1and P2, respec-

tively, in state s,

4.) a transition function T:SAO × S → [0,1] with T(s, a, o, s0)being the probability of

reaching state s0if choosing the action pair (a, o)in state s,(28)

5.) a (deterministic) reward function R:SAO → R.

Policies πiare defined for each player as for MDPs. The goal of determining an optimal

policy for agent P1and the definitions of value functions Vand returns Rare (nearly)

the same as for MDPs.(29) The reason why the value function and return are independent

from the policy π2of the second agent is that it always is assumed to be a worst-case

answer against the first agent (π2∈BR(π1)). Hence, the second player’s policy is not an

independent variable but depends on the policy of the first one: π2=π2(π1). Theorem 2.20

yields the well-definedness of the optimal value function: the zero-sum property guarantees

that for all π∗

2∈BR(π∗

1)the value functions and returns are equal. [159] contains a proof

of existence of a pair of (stationary) optimal policies for 2P-ZS-MGs.

As alluded to in the introduction the max-operator simply needs to be replaced by a (max-

min)-operator in the Bellman equation of an MDP (Equation 2.29) in order to obtain the

corresponding optimality principle for a 2P-ZS-MG:

(27)[112] refers to rock-paper-scissors as an example for the need of mixed actions in 2P-ZS-MGs.

(28)An alternative for 2P-ZS-MGs with deterministic state transitions is to define the modified transition

function e

T:SAO → S with T(s, a, o) = s0being the next state which is reached with probability 1.

(29)Especially, the discounted return Rdisc is considered to be standard.

2.33 Theorem (Optimality Principle (Shapley’s Theorem [66]))

A (state-) value function Vof an 2P-ZS-MG is optimal iff it is the unique solution to the

Bellman equation:

V(s) = max

πs∈PD(A(s)) min

o∈O(s)X

a∈A(s)

πs(a)·Q(s, a, o)(2.48)

where

Q(s, a, o) = R(s, a, o) + γX

s0∈S

T(s, a, o, s0)·V(s0).(2.49)

As for MDPs, the result of plugging Equation 2.49 into Equation 2.48 can be abbreviated

by the Bellman operator BMG which shortens the notation of the Bellman equation to

V=BMGVand the operator is a contraction with rate γwith respect to k k∞[97]. Again,

the Bellman equation is a non-linear fixed point equation for the optimal value function

V∗=Vπ∗.

One additional result which is not mentioned for MDPs but is also valid because these

are special cases of 2P-ZS-MGs concerns the continuous dependency of the optimal value

function on the Markov game:

2.34 Theorem (Continuity of Optimal Value Functions [66])

Given a state space Sand a state-action space SAO, the optimal value function V∗=

V∗(R, T, γ)is continuous with respect to the metric

dR,T,γ((R1, T1, γ1),(R2, T2, γ2)) = max{ kR1−R2kR,kT1−T2kT,|γ1−γ2| } (2.50)

with

kR1−R2kR= max

(s,a,o)|R1(s, a, o)−R2(s, a, o)|(2.51)

and

kT1−T2kT= max

(s,a,o)X

s0∈S

|T1(s, a, o, s0)−T2(s, a, o, s0)|.(2.52)

2.4.2 Numerical Methods

As for MDPs, the two important classes of approaches are dynamic programming (DP) and

reinforcement learning (RL) methods. Most of the basic methods can also be applied to 2P-

ZS-MGs including value iteration [159], policy iteration (without finite time convergence),

and Q-learning [203]. The reason is that not only the max-operator is a non-expansion

(i. e. |maxaf(s, a)−maxag(s, a)| ≤ maxa|(f(s, a)−g(s, a)|for all functions f, g and all

states s, [203]) but also the max-min: |maxπaminof(s, a, o)−maxπaminog(s, a, o)| ≤

max(a,o)|f(s, a, o)−g(s, a, o)|. The statement for the max-operator follows from the fact

that, if without restriction maxaf(s, a)≥maxag(s, a)and a∗= arg maxaf(s, a), then

|maxaf(s, a)−maxag(s, a)|=f(s, a∗)−maxag(s, a)≤f(s, a∗)−g(s, a∗)≤maxa|f(s, a)−

g(s, a)|. Regarding the max-min, the non-expansion property is stated already in Proposi-

tion 2.25, because the max-min of f(s, a, o)is the value of a matrix game with action pairs

(a, o)(for each fand each s).

The definition of value iteration is the only which is repeated in this section to stress the

similarities: the difference is in fact only to replace BMDP by BMG. According to [21] the

value iteration algorithm for 2P-ZS-MGs was already designed by Shapley [186] before the

corresponding algorithm for MDPs.

2.35 Definition (Value Iteration (2P-ZS-MG))

The following algorithm is called value iteration: select ε > 0, choose an arbitrary initial

guess V0∈R|S| for the (state-) value function, and determine iteratively Vk=BMGVk−1

for k= 1,2, . . . until kVk+1 −Vkk∞≤ε

2·1−γ

γ.

This algorithm would converge to V∗, and provide an ε

2-approximation for the value func-

tion estimate and an ε-optimal stationary policy (as for MDPs), if one assumes that BMG

can be calculated exactly. However, as proposed in Section 2.3 the solution of the ma-

trix game by LP methods can only be numerically determined. Although for numerically

applying BMDP the maximum of finitely many values can exactly be determined, the rep-

resentation of real numbers by machine numbers may lead to small discrepancies.

In the following, the result of Lemma 4.2 is anticipated since it fits well into this con-

text and its interpretation is new in the sense that the error of solving a matrix game is

mathematically equivalent to the error of using supervised learning techniques. Also, the

comments and discussion in Section 4.2 are correspondingly valid.

2.36 Lemma (Error of Numerical Value Iteration (2P-ZS-MG))

Let BMG be the Bellman operator and e

BMG a numerical realisation with ke

BMGV−BMGVk∞

≤ε1kVk∞.(30) Let further be e

V0=V0,Vk= (BMG)kV0, and e

Vk= ( e

BMG)ke

V0the corre-

sponding k-th value iterates. Then

Vk−Vkk ≤ ε1·

k−1

i=0

γk−i−1ke

Vik∞

| {z }

=EV(k)

≤ε1

1−γmax

i=0,...,k−1ke

Vik∞.(2.53)

2.37 Corollary (New Stopping Criterion for Numerical Value Iteration)

If the stopping criterion is changed to ke

Vk+1 −e

Vkk ≤ c(e

V0, . . . , e

Vk)with

c(e

V0, . . . , e

Vk) = ε

2− EV(k+ 1)·1−γ

γ−(EV(k+ 1) + EV(k)) (2.54)

where the errors EV(k)depend on the first k−1numerical value iterates (Equation 4.3),

and secondly, if the numerical approximation e

BMG is a contraction of rate γ(like BMG)

then also the numerical approximation of value iteration yields results comparable to the

original value iteration.

2.4.3 Complexity, Algorithmic Issues and Software

In principle, the statements for MDPs of Section 2.2.3 are valid e. g. for the effort per

iteration with the difference that the costs of solving |S| matrix games of possibly diffe-

rent sizes |A(s)| · |O(s)|have to be added. The complexity of solving matrix games e. g.

by LP is discussed in Section 2.3.3. Furthermore, Condon shows for the case of simple

stochastic games, which are a restricted class of two-player games, that to decide whether

the probability of winning for one player is >0.5is in NP ∩co-NP [33].

(30) ke

BMGV− BMGVk∞≤ε1kVk∞for all Vmeans that the operator B1:= 1

ε1(e

BMG − BMG)has operator

norm less than 1: kB1k= supV6=0

kB1Vk∞

kVk∞≤1. Furthermore, the above definition implies e

BMG =BMG +

ε1B1.

There are free available software frameworks for reinforcement learning but typically they

are only designed to solve MDPs. Often, it is time consuming to provide a transition

model and a suitable representation of the problem. Thus, the Matlab based software

package DRPOST is implemented by the author for the special use of determining optimal

strategies in multi-player grid soccer.

2.5 General Markov Games, Differential Games, and Ad-

vanced Concepts of RL

In this section pointers to interesting concepts that are related to the previous part of the

section but are not a main focus of this thesis shall be given. Some of the concepts show

alternatives to those in Section 5.1 of modeling a robot soccer game. The range of concepts

reaches from general n-player games via differential games to more advanced concepts of

learning.

Discrete Games with nAgents

It can not be expected that the results of this thesis are extendable to general-sum multi-

player Markov games as [218] shows that suitably adapted value iteration methods (best

response instead of max-min strategies) do not even need to converge to a stationary

policy. This is even true in games with two players, alternating turns, and deterministic

transitions.

Differential Games

In contrast to other games discussed before differential games model time as being contin-

uous (RL with continuous time is treated by [56]). Two standard textbooks on differential

games are [86] and, especially on pursuit-evasion games, [107] where some of the notation

is adapted from.

The evolution of a differential game is modeled by a differential equation

˙x=f(x, u, v),(2.55)

where ˙x=dx

dtand x(t), u(t), v(t)∈Rn. This can also be seen as a dynamical system

with two controls u, v. Other aspects which are necessary to describe a game is a terminal

function which determines whether a terminal state of the game is reached (often one

considers merely a fixed terminal set), and an outcome functional of the game for every

player Pithat can consist of an outcome dependent on a trajectory and a final outcome

dependent on the final state. The aim of each player is – as for MDPs – to maximise its own

outcome. Some statements about uniqueness and continuity of the solution to differential

game trajectories including different information patterns can be found in [13].

In a newer context, viscosity solutions to the so-called Hamilton-Jacobi-Bellman-Isaacs

equation (non-linear PDE) are introduced, see [9, 36] and references therein. This concept

may be considered a weak solution concept which overcomes the difficulty that because of

the non-differentiability of the value functions, this function is no solution to the Hamilton-

Jacobi-Bellman-Isaacs equation and other generalisations may yield non-unique solutions

(even for MDPs [152]). A good overview of theoretical results as well as numerical approx-

imation schemes and examples is contained in [63].

Advanced Concepts of RL

The concept of generalisation is discussed at more length in Chapter 4; model reduction

is outlined in Chapter 3. At this place, semi-Markov decision processes (S-MDPs), hierar-

chical Markov decision processes (H-MDPs), partial observability (PO-MDPs), and some

trade-offs and special artifices to speed up convergence are to be addressed.

Hierarchical learning and S-MDPs. S-MDPs are a generalisation of MDPs with

respect to the duration of performing actions. The assumption of MDPs that every time

step is equidistant (every action takes the same amount of time) is dropped for S-MDPs.

Continuous-time discrete event versions exist as well as the easier discrete time version

which allows only durations being integer multiples of a unit duration. S-MDPs are the

natural framework for H-MDPs [11] which typically consist of a hierarchy of learning

problems with at least two different levels. Dietterich [53] provides with MAX-Q a multi-

level approach in which the hierarchy structure must be given in advance. [11] gives a

broad overview. Earlier examples often follow a two-level approach and include [42, 55,

109, 124, 127, 190] while newer examples include [81, 161, 197, 215]. [191] gives additional

useful references.

Partially observable MDPs (PO-MDPs). One problem of real world tasks is often

that the observed information is noisy or incomplete. Unfortunately, complete information

is necessary to solve MDPs because the agent needs to know the state for which the

updates of the value function have to be done. A formal model which incorporates the

effect of a lack of information is called partially observable Markov decision process. There

are several strategies to deal with PO-MDPs [88]: State-free deterministic or stochastic

policies i. e. ignoring the partial observability can lead to non-Markovity. Determining a

deterministic optimal policy (mapping from observation to actions) is NP-hard [113]. By

stochastic policies locally optimal results can be obtained [87]. True improvements can only

be observed in most environments by memorising previous actions and observations. Some

approaches are recurrent Q-learning (using a recurrent neural network to learn features of

history [110, 137, 182]), classifier systems [71, 54, 27, 100] which were originally similar to

Q-learning, interval-based estimation of the transition probabilities [38], and finite-history

window approaches with fixed [110] or variable length (utile suffix memory) [134], possibly

in combination with neural network approaches [172]. Finally, PO-MDP approaches use a

hidden Markov model (HMM) to learn a model of the environment and construct a perfect

memory controller [29, 119, 140, 173, 174, 189]. The discrete PO-MDP can be transformed

into an equivalent continuous space belief MDP, with the probabilities distributions over

the discrete states being the new states. Thus, roughly speaking, an PO-MDP with nstates

can be converted to an MDP with state space ⊆Rnwhich clearly shows the limits of this

approach. Principal component analysis seems to be an appropriate method to reduce the

huge belief space of PO-MDPs [174]. A good overview of PO-MDPs and a comparison of

methods for very small problems (less than 20 states, 5 actions, and 10 observation states)

can be found in [114].

Special Artifices to Speed Up Convergence of DP and RL methods. Convergence

properties are in general asymptotic i. e. to some degree useless for practical needs where

the convergence rate at the beginning of learning is more interesting. Therefore, speed

of convergence is an ill-defined measure while speed of convergence to near-optimality is

more fruitful [88]. One performance measure in this sense is the regret [17] which describes

the loss of return while learning a policy in comparison to executing the (a priori known)

optimal policy. Of course, it is very hard to estimate the regret. Other artifices are updating

states with higher Bellman error (kVk+1 −Vkk∞) more often, using combinations of value

iteration and policy iteration like methods (general policy iteration [202]), modifying the

model (mostly the reward function (shaping) or the start position (Q-learning)), or using

heuristic knowledge to avoid starting from scratch (modify Q0, V0such that it corresponds

to a heuristic (human) policy).

Learning versus Dynamic Programming

The advantage of RL methods is that they do not need a model while the disadvantage is

that they need a lot of training data or episodes to create reliable estimates. For the latter

example of soccer a middle course shall be followed: an offline computation with a coarse

model yields a good estimate of the value function. In the real game, this estimate will be

the initial value function from which the learning process starts instead of starting from

scratch. One goal of this thesis is to construct such an initial guess from a coarse model

for a competitive multi-agent soccer game.

Two further issues, which are especially under consideration in RL but also are related to

DP, are the temporal credit assignment problem which concerns the question how much

a single action contributes to the complete sequence of actions, and the structural credit

assignment problem of cooperative multiple agents, i. e. how much a single agent contributes

to solving the overall task [1]. A survey over cooperative agents with exhaustively many

references is [160].

Chapter 3

Model Reduction and Symmetry

Contents

3.1 Homomorphisms and Symmetry in MDPs . . . . . . . . . . . . 37

3.1.1 Equivalence of MDP Homomorphisms and MDP Symmetries . . 38

3.1.2 Symmetries by Group Actions on MDPs . . . . . . . . . . . . . . 42

3.2 Homomorphisms and Symmetry in 2P-ZS-MGs . . . . . . . . . 43

3.2.1 2P-ZS-MG Homomorphisms and Symmetry . . . . . . . . . . . . 44

3.2.2 Automorphisms for the Exchange of Agents . . . . . . . . . . . . 49

Order and simplification are the first steps

towards the mastery of a subject.

P. Thomas Mann (1875–1955)

(http://www.nonstopenglish.com)

Symmetries which are an essential source for model reduction address two important issues:

first, it can generally be considered sensible that any two models describing the same

situation, e. g. the discretisation of a continuous model, should have the same symmetries,

and second, the abstraction over equivalent state(-action)s implies faster algorithms e. g.

faster learning comparable to the spirit of function approximation.(1)

Temporal and structural abstraction are main challenges in solving real world MDPs and

2P-ZS-MGs [67]. The latter concerns the question of how to handle the state space of the

underlying model for very large problems. In this context, model reduction, i. e. finding an

equivalent smaller or the smallest model, can be seen to be an important part of the task

to make computations more effective or simply feasible. While some reductions, especially

symmetries, are often easy to detect for a human observer, others are harder to observe.

The design of a software to accomplish this task is a challenging area of research for any

kind of model reduction.(2)

In this chapter proofs are given that some kinds of reduction yield models equivalent to

the original one and, hence, the effort to determine a reduced model can be valuable. A

(1)Even if a symmetry is not noted in advance, approximating functions which respect this symmetry

should be better suited than others.

(2)A complexity result of [70] is that it is NP-hard to determine whether a given finite model is already

minimal under reduction by symmetry groups (see Section 3.1.2).

second important point, namely, how to compute possible reductions will not be addressed

in detail. One possible method is adaptive state aggregation (clustering of states) by

locality or by the similarity of the value function [216](3); a different one is to detect steep

gradients of the value function as a natural barrier between state clusters [58](4). Optimal

value functions for multiple goals can also be used to uncover structure in the state space

[67].(5) However, all these methods typically yield only approximate reductions whereas

this chapter concerns exact reductions.

The chapter is structured as follows: after some remarks about existing work for MDPs on

MDP homomorphisms and MDP symmetries the equivalence of these concepts is proven by

the author in Section 3.1. Furthermore, the previous concepts are related to group actions

in Section 3.1.2. In Section 3.2, the focus will be on extensions to 2P-ZS-MGs which is also

one of the main theoretical contributions of this thesis. A central aspect in 2P-ZS-MGs,

which is not an issue in MDPs, is the matrix game reduction property (Definition 3.12)

which is proven to be valid if the equivalence classes on the state-action space fulfill some

natural projection properties. The most general framework of model reduction leads to 2P-

ZS-MG µ-homomorphisms (Theorem 3.20). The fact that the composition of two 2P-ZS-

MG µ-homomorphisms stays a 2P-ZS-MG µ-homomorphism enables the stepwise reduction

of models. Additionally, the combination of agent exchanging symmetries with agent

preserving symmetries is part of the concept. An application to automorphisms of finite

2P-ZS-MGs (Lemma 3.25) gives some further insight.

3.1 Homomorphisms and Symmetry in MDPs

The concept of symmetries is strongly related to questions of model reduction and model

minimisation. Historically, the question of model minimisation emerged first in finite state

automata and was extended to Markov chains and MDPs [170, 171]. The motivation of

this work is the insight that a smaller model results in a smaller amount of computation

time since the computational complexity typically depends on the model size (compare

Section 2.2.3 and Remark 3.7). The fundamental algebraic elements are homomorphisms

between two different models which represent the idea of equivalence of these two models.

For an equivalence of two MDPs not only does the structure of the state(-action) space

have to be preserved by a homomorphism but also the structure of the transition and

reward functions, implying the same structure for the optimal value functions and optimal

policies.

This section provides background material for Section 3.2 but also contains theoretical

contributions of the author. A main contribution is to show the equivalence of MDP

homomorphisms [171] to MDP symmetries [217] in Lemma 3.3.(6) This equivalence reveals

that the model minimisation framework of [171] with MDP options and the symmetry

context of [217] with its generalisation to multi-agent MDPs do not exclude each other but

(3)The presented approach has the burden that the value function of non-aggregated states also needs to

be stored and that there are an exponential number of cluster intern (deterministic) policies over which a

maximum is searched for.

(4)This approach seems to be most appropriate for detection of walls and implicit barriers.

(5)The problem with this approach is that one needs to calculate some value functions for different goals

before one can benefit from the structural abstraction.

(6)In principle, many results of this section can be found in the technical report [170] but there are e. g. no

pointers to the relation to group actions nor any note – if noticed – that the concept of reward respecting

SSP partitions in fact is the MDP symmetry of Zinkevich. The main reason for this may be that some

conditions or implications are left implicit in the technical report which are made explicit by the author.

instead are based on the same foundations. Furthermore, because of the equivalence it is

sufficient to prove only once e. g. that the optimal value function is constant on equivalence

classes induced by MDP homomorphisms which also implies that this is true for equivalence

classes induced by MDP symmetries.

3.1.1 Equivalence of MDP Homomorphisms and MDP Symmetries

MDP homomorphisms have the concept of preserving some structure in common with

group homomorphism. For MDPs this includes maps for states, state-actions, transition

functions, and reward functions. Because MDP homomorphisms shall be utilised to model

reduction, the original MDP M1= (D1,S1,SA1, T1, R1)has a larger number of states or

state-actions than the reduced MDP M2= (D2,S2,SA2, T2, R2). In this context, an MDP

homomorphisms aggregates some states and state-actions, i. e. merges them into non-trivial

equivalence classes.

Before MDP homomorphisms can be defined some aspects of projecting partitions (or

equivalence classes [60]) of the state-action space onto partitions of the state space have

to be noted. A first assumption in [171] is that a partition of the (finite) state-action

space P(SA1) = {[(s, a)] : (s, a)∈ SA1}into equivalence classes is given. It is further

assumed that this partition induces a partition of equivalence classes on the state space

P(S1) = {[s] : s∈ S1}by the direct projection Πs:SA1→ S1with Πs(s1, a1) = s1by

means of [s] = Πs([(s, a)]). The well-definedness of this projection onto state equivalence

classes is equivalent to the condition

∀(s, a00)∈ SA1∀(s, a)∈ SA1∀s0∈Πs([(s, a00)]) ∃a0∈ A1(s0) : (s0, a0)∈[(s, a)] .(3.1)

If the above equation is fulfilled (which is assumed by [171]) then it can be rewritten in

terms of the induced equivalence classes because [s] = Πs([(s, a)]) = Πs([(s, a00)]):

∀(s, a)∈ SA1∀s0∈[s]∃a0∈ A1(s0) : (s0, a0)∈[(s, a)] .(3.2)

The well-definedness of the projection of equivalence classes by Πsfurther implies (not

especially noted in [171]) the following condition on the equivalence classes:

∀(s, a)∈ SA1∀(s0, a0)∈ SA1:(s0, a0)∈[(s, a)] ⇒s0∈[s].(3.3)

These two conditions (Equations 3.2 and 3.3) are also needed in the symmetry formalism of

Zinkevich and Balch: they simply are Definition 7 of [217] with [(s, a)] being the equivalence

classes of the relation ESA and [s]being the equivalence classes of ESin their notation.

Considering the reverse direction, i. e. if equivalence relations are given on S1and SA1which

satisfy Equations 3.2 and 3.3, then the projection Πsexactly maps equivalence classes from

SA1to the ones of S1. The reason is that Equation 3.2 guarantees that the equivalence

classes on S1are small enough to be induced by the projection Πs, while Equation 3.3

assures that the equivalence classes on S1are also large enough.

Next, MDP homomorphisms can be defined:

3.1 Definition (MDP Homomorphism [171])

Let M1= (D1,S1,SA1, T1, R1)and M2= (D2,S2,SA2, T2, R2)be two MDPs as defined

in Section 2.2 with D1=D2=N0. A map h:SA1→ SA2is called an MDP ho-

momorphism if his a surjection, defined by a tuple of surjective maps (f, (gs)s∈S)with

h(s, a) = (f(s), gs(a)),f:S1→ S2, and gs:A1(s)→ A2(f(s)) such that

∀(s, a)∈ SA1∀s0∈ S1:e

T1(s, a, [s0]) = T2(f(s), gs(a), f(s0)) (3.4)

and

∀(s, a)∈ SA1:R1(s, a) = R2(f(s), gs(a)) (3.5)

where the block transition function e

T1of the MDP M1is defined by

T1:SA1× {[s] : s∈ S1} → R,e

T1(s, a, [s0]) = X

s00∈[s0]

T1(s, a, s00)(7) (3.6)

and the equivalence classes [(s, a)] on SA1are defined by (s0, a0)∈[(s, a)] iff h(s0, a0) =

h(s, a)and the equivalence classes [s]on S1are defined by the projections Πs([(s, a)]) which

makes Equation 3.1 a necessary condition.(8)

To provide the material necessary for a comparison between the MDP homomorphisms of

[171] and the equivalence relation notation of [217], the latter one remains to be introduced.

On that account, the notion of an equivalence relation Eon a set Mwhich is a subset

E⊆M×Mwith the properties reflexivity ((x, x)∈E), symmetry ((x, y)∈E⇒(y, x)∈

E), and transitivity ((x, y)∈E,(y, z)∈E⇒(x, z)∈E) is to be recalled. An equivalence

relation gives rise to the quotient set M/E ={B⊂E|(x, y)∈Efor all x, y ∈B}of

equivalence classes of Mby E. As above the notation x∈[y]means that xis in the

equivalence class of y.

In the following definition the notation of [217] is slightly changed to stress the similarities

to MDP homomorphisms:

3.2 Definition (MDP Symmetry [217])

Let M1= (D1,S1,SA1, T1, R1)be an MDP. An MDP symmetry is a tuple E= (ES1, ESA1)

of equivalence relations on S1and SA1, respectively, such that for the corresponding equiv-

alence classes [s]and [(s, a)] Equations 3.2 and 3.3 are valid, and additionally

1.) the block transition function e

T1defined by Equation 3.6 is constant on equivalence

classes:

∀(s0, a0)∈[s, a]∀s00 ∈ S1:e

T1(s0, a0,[s00]) = e

T1(s, a, [s00]) ,(3.7)

2.) and the reward function is constant on equivalence classes:

∀(s0, a0)∈[s, a] : R1(s0, a0) = R1(s, a).(3.8)

3.3 Lemma (Equivalence of MDP Homomorphisms and MDP Symmetries)

MDP homomorphisms (Definition 3.1) and MDP symmetries (Definition 3.2) are equiva-

lent, i. e. for each MDP homomorphisms there exists an MDP symmetry and for each MDP

symmetry there exists an MDP homomorphism such that the equivalence classes induced

by the MDP homomorphism and the MDP symmetry are equal.(9)

Proof: It is to show firstly that given any MDP symmetry there exists an equivalent MDP

homomorphism and, secondly, that the opposite direction of implication is also true. For

(7)The block transition function is in fact a transition function because ∀(s, a)∈ SA1:P[s0]e

T1(s, a, [s0]) =

P[s0]Ps00 ∈[s0]T1(s, a, s0) = Ps0∈S1T1(s, a, s0) = 1 because T1is a transition function.

(8)An implication is that not only the equivalence classes on the state(-action) space can be written as

[(s, a)] = h−1(h(s, a)) but also [s] = f−1(f(s)).

(9)If an equivalence relation on all MDP homomorphisms is introduced such that two of them are equiv-

alent if the induced equivalence relation on the state and state-action spaces are equal, and if similarly an

equivalence relation of MDP symmetries is introduced, then the Lemma means that a bijection between

equivalence classes of MDP homomorphisms and equivalence classes of MDP symmetries exists.

clearness of the presentation equivalence classes induced by an MDP homomorphism hare

denoted by [s]hand [(s, a)]hwhile the ones induced by an MDP symmetry E= (ES1, ESA1)

are described by [s]ES1and [(s, a)]ESA1.

1.) Let an MDP M1= (D1,S1,SA1, T1, R1)and an MDP symmetry E= (ES1, ESA1)

be given. Above Definition 3.1 it is discussed that Equations 3.2 and 3.3 imply the

well-definedness of the projection Πs([(s, a)]ESA1) = [s]ES1from equivalence classes

of SA1onto that of S1. Because of this well-definedness it is possible to choose a

unique (finite) set of representatives Ns,a ={(si, ai)}, one for each equivalence class

on SA1, such that Ns= Πs(Ns,a) = {si}is a unique (finite) set of representatives

for the equivalence classes on S1. Be Φs,a :SA1→Ns,a the mapping which maps a

state-action (s, a)to its representative (sk, ak)∈[(s, a)]ESA1∩Ns,a and Φs:S1→Ns

the corresponding mapping for the state space(10).

Let a second MDP M2= (D2,S2,SA2, T2, R2)be defined by D2=D1,S2=Ns,SA2=

Ns,a, T2(s, a, s0) = e

T1(Φs,a(s, a),[Φs(s0)]ES1),and R2(s, a) = R1◦Φs,a(s, a). Then, an

MDP homomorphism h(s, a) = h(f(s), gs(a)) is defined by h(s, a) = Φs,a(s, a), imply-

ing that f(s) = Φs(s)and gs(a)=Πa(Φs,a(s, a)) where Πais the direct projection on

the actions and A2(f(s)) = {Πa(Φs,a(s, a)) : a∈ A1(s))}which is independent from

the representative f(s) = Φs(s)of [s]ES1by means of Equation 3.2.

For the proof that his an MDP homomorphism the first step is to note that the

equivalence classes induced by hon SA1, i. e. (s0, a0)∈[(s, a)]hiff h(s0, a0) = h(s, a)

are by construction the same as that of ESA1i. e. [(s, a)]h= [(s, a)]ESA1. Because

Πsis well-defined for the [(s, a)]ESA1equivalence classes it is also well-defined for the

[(s, a)]hcases, hence [s]h= Πs( [(s, a)]h) = Πs( [(s, a)]ESA1) = [s]ES1and especially

Equation 3.1 is valid for [s]hand [(s, a)]h.

The second step is to show that Equations 3.4 and 3.5 hold. Equation 3.5 directly

follows because the reward function is constant on equivalence classes of ESA1(Equa-

tion 3.8) and therefore also on equivalence classes of hwhich are characterised by

(s0, a0)∈[(s, a)]hiff (f(s), gs(a)) = h(s, a) = h(s0, a0) = (f(s0), gs0(a0)). Equation 3.4

analogously follows hence the function (s, a)7→ e

T1(Φs,a(s, a),Φs(s0)) is constant on

equivalence classes of ESA1(Equation 3.7) and therefore also on that of h.

2.) Now, let an MDP homomorphism h:SA1→ SA2be given for the two MDPs

M1= (D1,S1,SA1, T1, R1)and M2= (D2,S2,SA2, T2, R2)with D1=D2=N0.

The discussion above Definition 3.1 yields that Equations 3.2 and 3.3 are valid for the

equivalence classes [(s, a)]hand [s]hinduced by h. Define equivalence relations ES1by

[s]ES1= [s]hand ESA1by [(s, a)]ESA1= [(s, a)]h.

By definition the equivalence classes induced by hand the ones induced by E=

(ES1, ESA1)are equal and the latter ones also fulfill Equations 3.2 and 3.3. Therefore,

in the opposite direction Equation 3.8 directly follows from Equation 3.5 and Equa-

tion 3.7 from Equation 3.4 which means that E= (ES1, ESA1)is an MDP symmetry.

3.4 Remark (Implications of Lemma 3.3)

A direct implication of Lemma 3.3 is that the work of [217] and [171] has a common basis.

The concepts particularly in the first reference for multi-agent systems and in the second

reference for options, i. e. macro or multi-step actions, may be jointly used. As a further

(10)The function Φs:S1→Nscan also be defined by Φs(s)=Πs(Φs,a(s, a(s))) where for each sany

a=a(s)∈ A1(s)can be chosen.

consequence, the proofs about value functions and strategies being equal on equivalence

classes need only be given in one framework.

After having established the equivalence of both frameworks their usefulness is to be re-

flected in a Theorem. Before the theorem can be stated, an idea has to be given how to

transform policies from h(SA1)to SA1if his an MDP homomorphism:

3.5 Definition (Policy Lifting [171])

Let M1= (D1,S1,SA1, T1, R1)and M2= (D2,S2,SA2, T2, R2)be two MDPs and let

h:SA1→ SA2, h(s, a)=(f(s), gs(a)) be an MDP homomorphism. For any s∈ S1and

ba∈ A2(f(s)) the action space g−1

s(ba)⊆ A1(s)is the preimage of the action baunder gs.

Let bπbe a policy of the MDP M2. Then the corresponding lifted policy πof the MDP M1

is defined by

π(s, a) = bπ(f(s), gs(a))

|g−1

s(gs(a))|.(3.9)

The policy lifting simply means to assign the same fraction of probability to all actions in

the same state-action equivalence class. The main theorem about MDP homomorphisms

(and MDP symmetries) follows:

3.6 Theorem (Main Implications of MDP Homomorphisms [170])

Let M1= (D1,S1,SA1, T1, R1)be an MDP and let h:SA1→ SA2, h(s, a) = (f(s), gs(a)),

be an MDP homomorphism (Definition 3.1) for some MDP M2= (D2,S2,SA2, T2, R2).

Then, V∗(s) = V∗(f(s)) and Q∗(s, a) = Q∗(h(s, a)). Furthermore, there exists an optimal

policy of M1which is a lifted optimal policy of M2.

The proof will be given in Section 3.2 for 2P-ZS-MGs which is methodically similar and

includes MDPs.

3.7 Remark (Implications of Theorem 3.6 on Complexity)

Theorem 3.6 means that the optimal value function is constant on [s]h, the optimal Q-

value function is constant on [(s, a)]h, and an optimal policy exists which is also constant

on [(s, a)]h. The theorem additionally holds for any lifted policy which means that also

non-optimal lifted policies induce value functions which respect the equivalence relation.

This means that it is sufficient to operate on the reduced model to obtain good or optimal

policies of the original model without any loss of information through the state and state-

action space compression. Because the complexity depends heavily on the size of the state

and state-action spaces (see Section 2.2.3) the reduced models can be solved faster.

The minimum savings are a reduction in the number of stored values (less computer mem-

ory needed) for any policy and value function proportional to |S2|

|S1|or |SA2|

|SA1|, respectively.

A more important reduction applies to the computational time: because the complexity

formulae are worse than linear for a non-sparse transition function the savings are, in this

case, proportional to the corresponding powers and products of |S| and |SA|. Nevertheless,

e. g. the number of value iterations to achieve a given numerical precision εdoes not change

– only for asynchronous updates as Gauss-Seidel or RL techniques could this happen. This

represents the fact that a reducable model inherits some of the basic “difficultiness” of the

reduced model although the theoretical complexity increases for larger models.

3.1.2 Symmetries by Group Actions on MDPs

In this section the framework of MDP homomorphisms is to be related to the classical

notion of symmetry by means of group actions which turns out to be similar but not

exactly the same.(11) [170] gives an example for which the two concepts of symmetry

are different. However, the concept of MDP homomorphisms includes the symmetry group

concept which is defined by group actions of the group of all MDP automorphisms.(12) [70]

also uses this approach (called bisimulation there) and relates it to finite state machines

(FSM). One complexity result of [70] is that it is NP-hard to determine whether a given

finite model is already minimal under reduction by symmetry groups.

Firstly, some standard definitions of group homomorphisms and group actions are recalled

(for more details see Appendix A). A (left) group action Θis a map Θ : G×X→

Xsuch that for all g, h ∈Gand for all x∈Xholds: Θ(g, Θ(h, x)) = Θ(gh, x)and

Θ(1, x) = xwhereas 1∈Gis the identity. A simplified notation for group actions is

gh ·x=ghx = Θ(g, Θ(h, x)). Furthermore, group actions act like permutations on X.

This can be made precise by introducing the kernel of a group action. The main results

related to equivalence classes are Proposition A.4 and Proposition A.5 in which it is shown

that equivalence relations and group actions are equivalent concepts.

Proposition A.5 is now to be applied to MDPs. Therefore, a definition of symmetries based

on MDP automorphisms is given:

3.8 Definition (MDP Iso- and Automorphism, Symmetry Group [170])

Let M1= (D1,S1,SA1, T1, R1)and M2= (D2,S2,SA2, T2, R2)be two MDPs and let

h:SA1→ SA2be an MDP homomorphism. his said to be an MDP isomorphism if it

is bijective(13) and it is called an MDP automorphism if it is bijective and if SA1=SA2.

The set of all automorphisms Aut(M1)of an MDP M1is called the symmetry group of

the MDP.(14)

3.9 Corollary (Symmetry Group of an MDP and MDP homomorphisms)

Let M1= (D1,S1,SA1, T1, R1)be an MDP and let G= Aut(M1)be its symmetry group.

Then, Θ : G× SA1→ SA1,Θ(g, (s, a)) = g(s, a),is a group action and there exists an

MDP homomorphism hwhich induces the same equivalence relation as the group action.

Proof: For the proof previous results have to be collected. Θis a group action because

idSA1∈Gis the neutral element of (G, ◦), and the neutral element and the composition of

automorphisms fulfill the group action axioms. Then, an equivalence relation on SA1with

the group orbits as equivalence classes is induced. To be able to define an MDP symmetry

it is to show that Equations 3.2, 3.3, 3.8, and 3.7 hold for the equivalence relation defined

by the G-orbits.(15)

(11) A corresponding citation from the Wikipedia website on equivalence relations motivates to com-

plete our description (http://en.wikipedia.org/wiki/Equivalence_relation, 06.09.2007): “It is very

well known that lattice theory captures the mathematical structure of order relations. It is much less

known that transformation groups (some authors prefer permutation groups) and their orbits capture the

mathematical structure of equivalence relations.” This web page includes also important suggestions for

literature presented below.

(12)In principle, the main results of this section about MDPs can be found in [170] but there are no

pointers to the relation to group actions.

(13)h(s, a) = (f(s), gs(a)) is bijective iff f, gsare all bijective.

(14)In fact, the group properties hold for (Aut(M1),◦).

(15)This does not directly follow due to the fact that Gconsists of MDP automorphisms. For the corre-

sponding MDP homomorphism some conditions become nearly trivial since the induced equivalence classes

of MDP automorphisms are singletons.

Equations 3.2 and 3.3 follow from 3.1: Let (s, a00),(s, a)∈ SA1and let s0∈Πs([(s, a00)]),

i. e. there exists an MDP automorphism h∈Gwith h(s, a00) = (f(s), gs(a00)) = (s0, gs(a00))

which especially means f(s) = s0. Then h(s, a) = (f(s), gs(a)) = (s0, gs(a)) and it follows

with a0=gs(a)that ∃a0∈ A1(s0) : (s0, a0)∈[(s, a)] because (s, a)and (s0, a0)are in the

same G-orbit.

Equation 3.8 follows because for any automorphism h∈Gand any (s0, a0) = h(s, a)the

reward R1(s, a) = R1(h(s, a)). For Equation 3.7 to prove note that for an automorphism

h(s, a) = (f(s), gs(a)) Equation 3.4 turns into T1(s, a, s0) = T1(h(s, a), f(s0)). Furthermore,

Gs= ΠsG={f:h∈Gwith h(s, a) = (f(s), gs(a))}is a group because Gis a group, Gs

induces a group action on S1by function evaluation and composition (as Gon SA1), and

the orbits of Gsare by definition the projections of orbits of G, i. e. the equivalence classes

on S1. Then, for any h∈Gand any (s0, a0) = h(s, a) = (f(s), gs(a)) holds (f◦Gs)·s=Gs·s

(since f∈Gs), and

T1(s, a, [s00]) = X

s000∈Gs·s00

T1(s, a, s000)

s000∈Gs·s00

T1(h(s, a), f(s000))

s000∈Gs·s00

T1(h(s, a), s000)

T1(s0, a0,[s00]) .

Summarised, the equivalence relation on SA1induces one on S1and a corresponding MDP

symmetry. Finally, Lemma 3.3 shows the existence of an MDP homomorphism which

induces the same equivalence relation. 2

Although MDP homomorphisms can capture equivalence relations induced by (subgroups

of) the symmetry group the reverse implication is not true because for MDP isomorphisms

Equation 3.4 reduces to

∀s∈ S1∀s0∈ S1∀a∈ A(s) : T1(s, a, s0) = T2(f(s), gs(a), f(s0)) .

[170] contains an example of model reductions due to MDP homomorphisms(16) which can

not be obtained by symmetry groups of MDPs.

3.2 Homomorphisms and Symmetry in 2P-ZS-MGs

In Section 3.1 the concepts of MDP homomorphisms and MDP symmetries have been

shown to be equivalent. This reduces the burden to generalise both concepts to 2P-ZS-MGs

to the task to choose one of them. In the following, the concept of MDP homomorphisms is

generalised to 2P-ZS-MG homomorphisms because that formalism explicitly includes the

mappings from a model to the reduced model in its definition. Nevertheless, the formalism

of 2P-ZS-MG homomorphisms could be transformed to one of 2P-ZS-MG symmetries.

The main result of Section 3.2.1 is Theorem 3.16 which includes Theorem 3.6 (MDPs)

and states that a symmetric (or reducable) 2P-ZS-MG induces a structure on the optimal

value functions that respects this symmetry (or reduction) and that a corresponding policy

exists. The proof of the main theorem was independently given by the author and goes

(16)Characterised by reward respecting SSP partitions, in fact equalling MDP symmetries.

beyond the proof for MDPs e. g. because the matrix game reduction (MGR) property

(Definition 3.12) needs to be valid which is always true (Proposition 3.13). It is to be

noted that (asynchronous) RL algorithms operating on the original model can behave

differently than that operating on the reduced model because for the latter the experience

is spread over all equivalent states even if never visited by the learning agent.

Besides the symmetries of MDPs, a qualitatively new symmetry that results from exchang-

ing the two agents can be exploited. For that purpose, the concept of µ-homomorphisms

is introduced in Section 3.2.2 which can also be utilised to identify identical models which

only differ by a scaling of the rewards. This can not be achieved by the standard 2P-ZS-

MG homomorphism framework. Practically, agent exchanging symmetries have been used

for different board games by the argument that exchanging the agents in a zero-sum game

has to result in a multiplication of the value by −1. One example is given by the pioneer

Samuel for checkers [179].(17) The present work lays formal foundations for this practice

and, more importantly, shows that the exchange of agents is also compatible with other

standard symmetries similar to that of 2P-ZS-MGs (Proposition 3.24).

3.2.1 2P-ZS-MG Homomorphisms and Symmetry

In this section the analogue concepts of Section 3.1 for 2P-ZS-MGs are to be introduced.

The main differences are that the state-action spaces SAihave to be replaced by the

corresponding state-action spaces SAOiand that additional projection properties hold.

This implicates that for the projection Πsand the reward and transition functions some

changes occur.

First and foremost, again some aspects of projecting partitions of the state-action space

onto partitions of the state space have to be noted. According to Section 3.1, a first

assumption is that a partition of the (finite) state-action space P(SAO1) = {[(s, a, o)] :

(s, a, o)∈ SAO1}into equivalence classes is given. It is further assumed that this partition

induces a partition of equivalence classes on the state space P(S1) = {[s] : s∈ S1}by the

direct projection Πs:SAO1→ S1with Πs(s1, a1, o1) = s1by means of [s] = Πs([(s, a, o)]).

The well-definedness of this projection onto state equivalence classes is equivalent to the

condition

∀(s, a00, o00)∈ SAO1∀(s, a, o)∈ SAO1∀s0∈Πs([(s, a00, o00)])

∃(a0, o0)∈ A1(s0)× O1(s0) : (s0, a0, o0)∈[(s, a, o)] .(3.10)

If the above equation is fulfilled then it can be rewritten in terms of the induced equivalence

classes because [s] = Πs([(s, a, o)]) = Πs([(s, a00, o00)]):

∀(s, a, o)∈ SAO1∀s0∈[s]∃(a0, o0)∈ A1(s0)× O1(s0) : (s0, a0, o0)∈[(s, a, o)] .(3.11)

The well-definedness of the projection of equivalence classes by Πsfurther implies the

following condition on the equivalence classes:

∀(s, a, o)∈ SAO1∀(s0, a0, o0)∈ SA1:(s0, a0, o0)∈[(s, a, o)] ⇒s0∈[s].(3.12)

Considering the reverse direction, i. e. if equivalence classes are given on S1and SAO1which

satisfy Equations 3.2 and 3.3 then the projection Πsexactly maps equivalence classes from

SAO1to the ones of S1.

(17)Samuel states that all stored board positions are transformed as if Black has to move.

The definitions above are a straightforward adaption from the MDP case. However, it will

be obvious later in the proof of the main theorem (Theorem 3.16) that it is necessary to

use the matrix game reduction property (Definition 3.12) which on its part uses the policy

lifting (Definition 3.11) in 2P-ZS-MGs. For a lifted policy to depend only on the actions of

one agent it is necessary that the state-action equivalence classes for joint actions induce

equivalence classes on SA1and SO1by the projections Πs,a :SAO1→ SA1and Πs,o :

SAO1→ SO1, respectively. This means that the following two variants of Equation 3.10

have also to be valid:

∀(s, a, o00)∈ SAO1∀(s, a, o)∈ SAO1∀(s0, a0)∈Πs,a([(s, a, o00)])

∃o0∈ O1(s0) : (s0, a0, o0)∈[(s, a, o)] (3.13)

and

∀(s, a00, o)∈ SAO1∀(s, a, o)∈ SAO1∀(s0, o0)∈Πs,o([(s, a00, o)])

∃a0∈ A1(s0) : (s0, a0, o0)∈[(s, a, o)] .(3.14)

If they are fulfilled, these two equations can also be abbreviated by the following:

∀(s, a, o)∈ SAO1∀(s0, a0)∈[s, a]∃o0∈ O1(s0) : (s0, a0, o0)∈[(s, a, o)] (3.15)

and

∀(s, a, o)∈ SAO1∀(s0, o0)∈[s, o]∃a0∈ A1(s0) : (s0, a0, o0)∈[(s, a, o)] .(3.16)

Next, 2P-ZS-MG homomorphisms can be defined which do not capture symmetries ex-

changing the two agents. The straightforward generalisation from MDP homomorphisms

would be h(s, a, o) = (f(s), gs(a, o)) but instead the homomorphism is defined as h(s, a, o) =

(f(s), gs(a), is(o)) because this explicitly takes the projective properties (Equations 3.11,

3.15 and 3.16) into account, whereas [s] = f−1(f(s)),[(s, a)] = {(s0, a0) : gs0(a0) = gs(a)},

and [(s, o)] = {(s0, o0) : is0(o0) = is(o)}.

3.10 Definition (2P-ZS-MG Homomorphism)

Let M1= (D1,S1,SAO1, T1, R1)and M2= (D2,S2,SAO2, T2, R2)be two 2P-ZG-MGs

as defined in Section 2.4 with D1=D2=N0. A map h:SAO1→ SAO2is called

a 2P-ZS-MG homomorphism if his a surjection, defined by a tuple of surjective maps

(f, (gs)s∈S1,(is)s∈S1)with h(s, a, o)=(f(s), gs(a), is(o)),f:S1→ S2,gs:A1(s)→

A2(f(s)),and is:O1(s)→ O2(f(s)) such that

∀(s, a, o)∈ SAO1∀s0∈ S1:e

T1(s, a, o, [s0]) = T2(f(s), gs(a), is(o), f(s0)) (3.17)

and

∀(s, a, o)∈ SAO1:R1(s, a, o) = R2(f(s), gs(a), is(o)) (3.18)

where the block transition function e

T1of the 2P-ZS-MG M1is defined by

T1:SAO1× {[s] : s∈ S1} → R,e

T1(s, a, o, [s0]) = X

s00∈[s0]

T1(s, a, o, s00)(3.19)

and the equivalence classes [(s, a, o)] on SA1are defined by (s0, a0, o0)∈[(s, a, o)] iff

h(s0, a0, o0) = h(s, a, o)and the equivalence classes [s]on S1,[(s, a)] on SA1, and [s, o]

on SO1are defined by the projections Πs([(s, a, o)]),Πs,a([(s, a, o)]), and Πs,o([(s, a, o)]),

respectively, which makes Equations 3.10, 3.13 and 3.14 necessary conditions.

According to MDPs, it is necessary to have the possibility to lift policies. This is only

possible because the projections Πs,a and Πs,o project onto equivalence classes:

3.11 Definition (Policy Lifting for 2P-ZS-MGs)

Let M1= (D1,S1,SAO1, T1, R1)and M2= (D2,S2,SAO2, T2, R2)be two 2P-ZS-MGs and

let h:SAO1→ SAO2, h(s, a, o)=(f(s), gs(a), is(o)) be an 2P-ZS-MG homomorphism.

Let bπbe a policy of the 2P-ZS-MG M2for the first or second agent, appropriately. Then

the corresponding lifted policy πof the 2P-ZS-MG M1is defined for the first agent by

π(s, a) = bπ(f(s), gs(a))

|g−1

s(gs(a))|,(3.20)

and for the second agent by

π(s, o) = bπ(f(s), is(o))

|i−1

s(is(o))|.(3.21)

Next, the matrix game reduction property which will be essential for the reduction by

2P-ZS-MG homomorphisms is introduced:

3.12 Definition (Matrix Game Reduction (MGR) Property)

Let M1= (D1,S1,SAO1, T1, R1)and M2= (D2,S2,SAO2, T2, R2)be two 2P-ZS-MGs

and let h:SAO1→ SAO2,h(s, a, o) = (f(s), gs(a), is(o)) be an 2P-ZS-MG homomor-

phism. Then his said to have the matrix game reduction (MGR) property iff for all states

s∈ S1, for all matrices M1,s ∈R|A1(s)|,|O1(s)|and M2,f(s)∈R|A2(f(s))|,|O2(f(s))|that re-

spect the structure of h, i. e. ∀(a0, o0)∈ A2(f(s)) × O2(f(s)) ∀(a, o)∈g−1

s(a0)×i−1

s(o0) :

(M1,s)(a, o) = (M2,f(s))(a0, o0), holds that

V∗(M1,s) = V∗(M2,f(s))(3.22)

and additionally that the optimal policy of M1,s is a lifted optimal policy of M2,f(s)(inter-

preting the matrix game as a single state 2P-ZS-MG as in Definition 2.19).

By definition Equation 3.22 is equivalent to(18)

max

πs∈PD(A1(s)) min

o∈O1(s)X

a∈A1(s)

πs(a)·M1,s(a, o)

= max

πf(s)∈PD(A2(f(s))) min

o∈O2(f(s)) X

a∈A2(f(s))

πf(s)(a)·M2,f(s)(a, o),(3.23)

where the abbreviation M(a, o) = Mi,j if ai=aand oj=ois used.

3.13 Proposition (Validity of the MGR Property)

Let M1= (D1,S1,SAO1, T1, R1)and M2= (D2,S2,SAO2, T2, R2)be two 2P-ZS-MGs

and let h:SAO1→ SAO2,h(s, a, o) = (f(s), gs(a), is(o)) be a 2P-ZS-MG homomorphism.

Then the MGR property holds.

(18)Because of the minimax theorem (Theorem 2.23), the definition of the value of a matrix game (Def-

inition 2.21) and the fact that the minimum over probability distributions for the later deciding agent

(applied first in the equation) can be replaced by the pure actions.

Proof: Let M1,M2, h, M1,s, M2,f(s)be as in Definition 3.12. It will be shown that M1,s

can be transformed into M2,f(s)by removing equal rows and equal columns, and hence

the associated matrix games have the same value. Since the structure of equal rows and

columns corresponds to the structure of the state-wise intersection of equivalence classes

[(s, a)] ∩({s}×A1(s)) = {s} × g−1

s(gs(a)) and [(s, o)] ∩({s}×O1(s)) = i−1

s(is(o)), respec-

tively, an optimal policy from M2,f(s)can be lifted to a policy of M1,s.

For a fixed s∈ S1be {a0

1, . . . , a0

m2}=A2(f(s)) an enumeration of the actions with

m2=|A2(f(s))|, and be {o0

1, . . . , o0

n2}=O2(f(s)) an enumeration of the actions with

n2=|O2(f(s))|such that the matrix is ordered as M2,f(s)(i, j) = M2,f(s)(a0

i, o0

j). Since

gs:A1(s)→ A2(f(s)) is surjective, g−1

s(a0

i)6=∅for every iand there exists an enu-

meration {a1, . . . , am1}=A1(s)with ak∈g−1

s(a0

k)for 1≤k≤m2. Since A1(s) =

Sa0∈A2(f(s)) g−1

s(a0), for all k > m2holds that ak∈g−1

s(gs(aik)) with ik≤m2. Then for

all o∈ O1(s)holds: h(s, ak, o) = (f(s), gs(ak), is(o)) = (f(s), gs(aik), is(o)) = h(s, aik, o).

This means that in M1,s for each k > m2the k-th row is equal to the ik-th row where

ik≤m2and therefore all rows with index k≥m2can be removed. Further, all actions

g−1

s(gs(aik)) ⊆ A1(s)are equal which shows that the policy for agent P1can be lifted.

For columns the analogous statement holds that there exists an enumeration {o1, . . . , on1}=

O1(s)with ok∈i−1

s(o0

k)for 1≤k≤n2and that all columns with index k > n2can be

removed. Further, all actions i−1

s(is(oik)) ⊆ O1(s)are equal which shows that the policy

for agent P2can also be lifted. Since by removing equal rows equal columns stay equal,

M1,s can be transformed into M1,s which equals by construction M2,f(s).

3.14 Remark (MGR Property for MDPs)

For MDPs, the MGR property can be interpretated by M1being a vector (the minimiser

has only one action to “choose”) which makes the proof of Proposition 3.13 as simple as

maxa∈A1(s)M1,s(a) = maxa∈A2(f(s)) M2,f(s)(a)by noting that gs:A1(s)→ A2(f(s)) is

surjective.(19)

3.15 Example (Structure of Matrices with MGR Property)

The definition of the MGR property is in principle independent from a state salthough it

must hold for all states. Therefore, only two matrices M1and M2are presented which can

be thought of as M1,s(a, o)and M2,f(s)(a0, o0):

M1=





c1c1c2

c3c3c4





, M2=c1c2

c3c4.

The lines indicate the borders of different equivalent action pairs. There has to exist an

enumeration of the actions such that the lines generate a grid on the matrix to fulfill the

MGR property. The type of the grid is defined by the equivalence classes of a 2P-ZS-MG

while the numbers ci∈Rare arbitrary.

3.16 Theorem (Main Implications of 2P-ZS-MG Homomorphisms)

Let M1= (D1,S1,SAO1, T1, R1)be a 2P-ZS-MG and let h:SAO1→ SAO2be an 2P-ZS-

MG homomorphism (Definition 3.10) for some 2P-ZS-MG M2= (D2,S2,SAO2, T2, R2).

(19)The simplicity of this argument could be the reason why it is not mentioned in [170]. However, for

matrix games the sitation is a little less simple.

Then, V∗(s) = V∗(f(s)) and Q∗(s, a, o) = Q∗(h(s, a, o)). Furthermore, there exists an

optimal policy of M1which is a lifted optimal policy of M2.

Proof: Parts of the proof are similar to that of Theorem 3.6 in [170](20) and [217](21) but

independently developed and the theorem here is valid for a larger class of problems. Let Vk

be the k-th value iterate for V0= 0 (it is only necessary that V0respects the symmetry but

this one respects all possible symmetries), i. e. Vk=Vk(Qk−1)according to Equation 2.48,

and Qk=Qk(Vk)according to Equation 2.49. It will be shown by induction that each of

the (Q-)value iterates keeps the same symmetry which implies that the limits V∗and Q∗,

respectively also possess this symmetry.

Induction starts by noticing that V0= 0 respects any symmetry and that then Q0=R1

also respects any symmetry which can be induced by hby its definition. The induction

hypothesis (I. H.) is the assumption that this is true for all iterates up to Vk−1, Qk−1,

i. e. for all j≤k−1and for h(s, a, o)=(f(s), gs(a), is(o)) holds that Vj(s) = Vj(f(s))

and Qj(s, a, o) = Qj(h(s, a, o)). Here, the value functions are not indexed like the state

spaces because the argument uniquely shows which value function is meant. Then, for

h(s, a, o) = (f(s), gs(a), is(o)) holds:

Vk(s) = max

πs∈PD(A1(s)) min

o∈O1(s)X

a∈A1(s)

πs(a)·Qk−1(s, a, o)

I. H.

= max

πs∈PD(A1(s)) min

o∈O1(s)X

a∈A1(s)

πs(a)·Qk−1(f(s), gs(a), is(o))

(1)

= max

πf(s)∈PD(A2(f(s))) min

o∈O2(f(s)) X

a∈A2(f(s))

πf(s)(a)·Qk−1(f(s), a, o)

=Vk(f(s)) ,

whereas (1) is the MGR property (Definition 3.12 and below) which is always fulfilled

(Proposition 3.13) and also guarantees the existence of a lifted policy for every state s.

Furthermore,

Qk(s, a, o) = R1(s, a, o) + γX

s0∈S1

T1(s, a, o, s0)·Vk(s0)

(1)

=R2(h(s, a, o))+γX

s0∈S1

T1(s, a, o, s0)·Vk(f(s0))

(2)

=R2(h(s, a, o))+γX

f(s0)∈S2

Vk(f(s0)) X

s00∈f−1(f(s0))

T1(s, a, o, s00)

(3)

=R2(h(s, a, o))+γX

f(s0)∈S2

Vk(f(s0)) ·e

T1(s, a, o, [s0])

(4)

=R2(h(s, a, o))+γX

f(s0)∈S2

T2(h(s, a, o), f(s0)) ·Vk(f(s0))

(5)

=R2(h(s, a, o))+γX

s0∈S2

T2(h(s, a, o), s0)·Vk(s0)

=Qk(h(s, a, o)).

(20)The author thinks that for the second equality sign of the proof in [170] some implicit assumptions on

the state-value function Vmare made which are more obvious in the presentation here.

(21)That proof on the other hand is more formal but possibly more complex than necessary.

(1) is valid because of the definition of hand because the symmetry of Vkwas shown

directly before, for (2) note that f:S1→ S2, (3) holds by definition of the block transition

function e

T1and because f−1(f(s0)) = [s0]and (4) again by definition, and (5) because f

is surjective. 2

3.17 Remark (Implications of Theorem 3.16 on Complexity)

Remark 3.7 is analogously valid for 2P-ZS-MGs. The optimal value function is especially

constant on [s]h, the optimal Q-value function is constant on [(s, a, o)]h, and there exists

an optimal policy for each agent which is constant on [(s, a)]hand [(s, o)]h,respectively.

3.2.2 Automorphisms for the Exchange of Agents

After the analogue symmetries of MDPs in 2P-ZS-MGs have been exploited, a qualita-

tively new symmetry that results from exchanging the two agents is introduced.(22) For

that purpose, the concept of µ-homomorphisms is employed. Practically, agent exchanging

symmetries have been used for different board games – as already mentioned in the intro-

duction of this chapter – but the present work lays formal foundations for this practice

and, more importantly, shows that the exchange of agents is also compatible with the other

standard symmetries obtained by 2P-ZS-MG homomorphisms (Proposition 3.24).

Briefly summarised, a 2P-ZS-MG µ-homomorphism is a 2P-ZS-MG homomorphism for

which the reward condition is changed by a factor of µ. If µ > 0then the proof of

Theorem 3.16 holds with the simple changes that R1(s, a, o) = µ·R2(h(s, a, o)), Vk(s) =

µ·Vk(f(s)),and Qk(s) = µ·Qk(f(s)) because a positive constant can be factored out of

an expression to maximise or to minimise: maxx(µ·f(x)) = µ·maxx(f(x)).

However, for a negative constant µ < 0a maximums turns into a minimum and vice versa:

maxx(µ·f(x)) = µ·minx(f(x)). This makes the two cases essentially different and justifies

the separate definition. In fact, the case µ > 0only introduces an additional scaling of

the reward to the framework of Section 3.2.1 but µ < 0points to the new aspect of agent

exchanging symmetries in 2P-ZS-MGs.

Now, 2P-ZS-MG µ-homomorphisms are to be defined:

3.18 Definition (2P-ZS-MG µ-Homomorphism)

Let M1= (D1,S1,SAO1, T1, R1)and M2= (D2,S2,SAO2, T2, R2)be two 2P-ZG-MGs

as defined in Section 2.4 with D1=D2=N0. A map h:SAO1→ SAO2is called an

2P-ZS-MG µ-homomorphism if µ6= 0, h is a surjection, and additionally holds:

1.) if µ > 0:his defined by a tuple of surjective maps (f, (gs)s∈S1,(is)s∈S1)with h(s, a, o) =

(f(s), gs(a), is(o)),f:S1→ S2,gs:A1(s)→ A2(f(s)),and is:O1(s)→ O2(f(s)),

2.) if µ < 0:his defined by a tuple of surjective maps (f, (gs)s∈S1,(is)s∈S1)with h(s, a, o) =

(f(s), is(o), gs(a)),f:S1→ S2,gs:A1(s)→ O2(f(s)),and is:O1(s)→ A2(f(s)),

such that

∀(s, a, o)∈ SAO1∀s0∈ S1:e

T1(s, a, o, [s0]) = T2(h(s, a, o), f(s0)) (3.24)

and

∀(s, a, o)∈ SAO1:R1(s, a, o) = µ·R2(h(s, a, o)) (3.25)

where the block transition function e

T1of the 2P-ZS-MG M1is defined as in Equation 3.19

and the equivalence classes fulfill the projective conditions of 2P-ZS-MG homomorphisms,

which makes Equations 3.10, 3.13 and 3.14 necessary conditions.

(22)This exchange of agents is not to be confused with a permutation of agents in a multi-player MDP.

The MGR property remains for µ > 0but significantly changes for µ < 0:

3.19 Remark (Adaptation of MGR Property to µ-homomorphisms)

For µ < 0,his said to have the matrix game reduction (MGR) property iff for all states s∈

S1, for all matrices M1,s ∈R|A1(s)|,|O1(s)|and M2,f(s)∈R|A2(f(s))|,|O2(f(s))|that respect the

structure of h, i. e. ∀(a0, o0)∈ A2(f(s))×O2(f(s)) ∀(a, o)∈g−1

s(o0)×i−1

s(a0) : (M1,s)(a, o) =

−((M2,f(s))(a0, o0))T, holds that

V∗(M1,s) = −V∗(M2,f(s))(3.26)

and that the optimal policies of M1,s are lifted optimal policies of M2,f(s).

The proof that this property holds for every 2P-ZS-MG homomorphism is accordingly to

that of Proposition 3.13 with the difference that it is to show that M1,s can be transformed

into M1,s =−MT

2,f(s)by removing equal rows and equal columns. Then, it follows by the

Minimax-Theorem (Theorem 2.23) which for any matrix Nholds:

max

π1

min

π2

πT

1Nπ2= min

π2

max

π1

πT

1Nπ2=−max

π2

min

π1

−πT

1Nπ2=−max

π2

min

π1

πT

2(−NT)π1.

This means that V∗(N) = −V∗(−NT)and that a policy pair (π∗

1, π∗

2)in Nis optimal iff

(π∗

2, π∗

1)is optimal for in −NT.

To see that M1,s can be transformed into M1,s =−MT

2,f(s)the following adaptations have to

be made: the statement remains that the structure of equal rows and columns corresponds

to the structure of the state-wise intersection of equivalence classes [(s, a)]∩({s}×A1(s)) =

{s} × g−1

s(gs(a)) and [(s, o)] ∩({s}×O1(s)) = i−1

s(is(o)), respectively, an optimal policy

from M2,f(s)can be lifted to a policy of M1,s.

Then some minor changes are necessary because now gs:A1(s)→ O2(f(s)) and is:

A1(s)→ O2(f(s)): For a fixed s∈ S1be {o0

1, . . . , o0

m2}=O2(f(s)) an enumeration

of the actions with m2=|O2(f(s))|, and be {a0

1, . . . , a0

n2}=A2(f(s)) an enumeration

of the actions with n2=|A2(f(s))|such that the transposed matrix (M2,f(s)(i, j))Tis

ordered as (M2,f(s)(i, j))T= (M2,f(s)(a0

i, o0

j))T. Since gs:A1(s)→ O2(f(s)) is surjec-

tive, g−1

s(o0

i)6=∅for every iand there exists an enumeration {a1, . . . , am1}=A1(s)

with ak∈g−1

s(o0

k)for 1≤k≤m2. Since A1(s) = So0∈O2(f(s)) g−1

s(o0), for all k > m2

holds that ak∈g−1

s(gs(aik)) with ik≤m2. Then for all o∈ O1(s)holds: h(s, ak, o) =

(f(s), gs(ak), is(o)) = (f(s), gs(aik), is(o)) = h(s, aik, o). This means that in M1,s for each

k > m2the k-th row is equal to the ik-th row where ik≤m2and therefore all rows with

index k > m2can be removed. Further, all actions g−1

s(gs(aik)) ⊆ A1(s)are equal which

shows that the policy for P1can be lifted.

For columns the analogous statement holds that there exists an enumeration {o1, . . . , on1}=

O1(s)with ok∈i−1

s(a0

k)for 1≤k≤n2and that all columns with index k > n2can be

removed. Further, all actions i−1

s(is(oik)) ⊆ O1(s)are equal which shows that the policy

for P2can also be lifted. Since by removing equal rows equal columns stay equal, M1,s can

be transformed into M1,s which equals by construction −MT

2,f(s).

By definition Equation 3.26 is equivalent to

max

πs∈PD(A1(s)) min

o∈O1(s)X

a∈A1(s)

πs(a)·M1,s(a, o)

=−max

πf(s)∈PD(A2(f(s))) min

o∈O2(f(s)) X

o∈O2(f(s))

πf(s)(a)·M2,f(s)(a, o)(3.27)

which means

max

πs∈PD(A1(s)) min

o∈O1(s)−X

a∈A1(s)

πs(a)·(M2,f(s)(gs(a), is(o)))T

=−max

πf(s)∈PD(A2(f(s))) min

o∈O2(f(s)) X

o∈O2(f(s))

πf(s)(a)·M2,f(s)(a, o).(3.28)

In the following theorem all other corresponding main theorems of this chapter are included

because 2P-ZS-MG homomorphisms are 2P-ZS-MG µ-homomorphisms with µ= 1 >0.

3.20 Theorem (Main Implications of 2P-ZS-MG µ-Homomorphisms)

Let M1= (D1,S1,SAO1, T1, R1)be a 2P-ZS-MG and let h:SAO1→ SAO2be an 2P-ZS-

MG µ-homomorphism (Definition 3.18) for some 2P-ZS-MG M2= (D2,S2,SAO2, T2, R2).

Then, V∗(s) = µ·V∗(f(s)) and Q∗(s, a, o) = µ·Q∗(h(s, a, o)). Furthermore, there exists

an optimal policy of M1which is a lifted optimal policy of M2where in the case of µ < 0

the two policies π, bπin each of the two formulae in Definition 3.11 are from different agents

as the domains and co-domains of gsand isindicate.

Proof: As mentioned at the beginning of this subsection, the case µ > 0just intro-

duces a factor µin the proof of Theorem 3.16 such that R1(s, a, o) = µ·R2(h(s, a, o)),

Vk(s) = µ·Vk(f(s)),and Qk(s) = µ·Qk(f(s)).

However, for µ < 0something essentially different can be observed. In this case, the

induction again starts by noticing that V0= 0 respects any symmetry and that then

Q0=R1also respects any symmetry which can be induced by h. As induction hypoth-

esis (I. H.), it is assumed that this is true for all iterates up to Vk−1, Qk−1, i. e. for all

j≤k−1and for h(s, a, o)=(f(s), gs(a), is(o)) holds that Vj(s) = µ·Vj(f(s)) and

Qj(s, a, o) = µ·Qj(h(s, a, o)). Again, the value functions are not indexed as the state

spaces because the argument uniquely shows which value function is meant. Thus, for

h(s, a, o) = (f(s), is(o), gs(a)) holds:

Vk(s) = max

πs∈PD(A1(s)) min

o∈O1(s)X

a∈A1(s)

πs(a)·Qk−1(s, a, o)

I. H.

= max

πs∈PD(A1(s)) min

o∈O1(s)X

a∈A1(s)

πs(a)·µ·Qk−1(f(s), is(o), gs(a))

(1)

=−µ·max

πs∈PD(A1(s)) min

o∈O1(s)−X

a∈A1(s)

πs(a)·Qk−1(f(s), is(o), gs(a))

(2)

=µ·max

πf(s)∈PD(A2(f(s))) min

o∈O2(f(s)) X

a∈A2(f(s))

πf(s)(a)·Qk−1(f(s), a, o)

=µ·Vk(f(s)) ,

whereas (1) holds because −µ > 0and (2) is the MGR property (Remark 3.19 and

Equation 3.28) the proof of which needed the minimax theorem for matrix games (Theo-

rem 2.23). Then, also Qk(s, a, o) = µ·Qk(h(s, a, o)) analogously to the proof of 2P-ZS-MG

homomorphisms. 2

3.21 Remark (New Aspects of Theorem 3.20)

One of the most interesting aspects of Theorem 3.20 besides its existence is that the proof

for the case µ < 0needs the minimax theorem (Theorem 2.23) for matrix games. This

aspect should be highlighted because also an informal argumentation like “one exchanges

the two agents in equal situations”, which is often given by practitioners and points out

the essence of the theorem, makes implicitly use of the minimax theorem. The reason is

that the minimax theorem states that for an optimal pair of policies it does not matter

whether agent one or agent two has to decide its strategy first (and telling it to the other)

or whether both decide without knowledge of the other agent’s policy. It is obvious that

for such a property is important to speak about an “equal situation” of both agents.

3.22 Example (Robot Soccer, 4)

In Figure 3.1, some standard symmetries in robot soccer are depicted which should be

respected by any model describing a robot soccer game. In this example, an abstract

agent is a team consisting of two robots. A permutation of identical agents within a

team is additionally possible but only depicted indirectly as the robots have no individual

numbers.

2 1

(a) State s.

2 1

(b) State gy(s).

2 1

(d) State (gx◦gy)(s) = (gy◦gx)(s).

Figure 3.1: Symmetries in robot soccer: a standard situation (state) sand its symmetric

states gy(s)with exchange of the two teams, gx(s)without the exchange of teams, and the

combination of both: (gx◦gy)(s) = (gy◦gx)(s). The labels 1 and 2 indicate the team and

the defended goal region, and the small black circle depicts the ball.

3.23 Remark (Fulfillment of the Projection Properties)

Any 2P-ZS-MG µ-homomorphism h:SAO1→ SAO2fulfills the projective properties

(Equations 3.10, 3.13 and 3.14) because without using these properties it can be shown for

µ > 0and µ < 0that

Πs,a(h−1(h(s, a, o))) = {(s0, a0) : f(s) = f(s0)and gs0(a0) = gs(a)},

Πs,o(h−1(h(s, a, o))) = {(s0, o0) : f(s) = f(s0)and is0(o0) = is(o)}, and

Πs(h−1(h(s, a, o))) = {s0:f(s0) = f(s)}.

Proof of e. g. Πs,a(h−1(h(s, a, o)) = {(s0, a0) : f(s) = f(s0)and gs0(a0) = gs(a)}for µ > 0:

“⊆”: Be (s0, a0)∈Πs,a(h−1(h(s, a, o)), i. e. there exists o0∈ O(s0)with h(s0, a0, o0) =

h(s, a, o). Then f(s) = f(s0)and gs0(a0) = gs(a).

“⊇”: Be (s0, a0)∈ {(s00, a00) : f(s) = f(s00)and gs00 (a00) = gs(a)}. Then f(s) = f(s0)and

gs0(a0) = gs(a). Choose an arbitrary o∈ O1(s)and o0∈i−1

s0(is(o)) ⊆ O1(s0)which is non-

empty since is0:O1(s0)→ O2(f(s0)) = O2(f(s)) is surjective. Then h(s0, a0, o0) = h(s, a, o)

which means (s0, a0)∈Πs,a(h−1(h(s, a, o))).

The other projections for µ > 0are similar. For µ < 0, only the different image spaces,

e. g. is0:O1(s0)→ A2(f(s0)) = A2(f(s)) have to be taken into account additionally.

3.24 Proposition (Composition of 2P-ZS-MG µ-Homomorphisms)

Let h1:SAO1→ SAO2be a µ1-homomorphism and h2:SAO2→ SAO3be a µ2-

homomorphism for three associated 2P-ZS-MGs M1= (D1,S1,SAO1, T1, R1),M2=

(D2,S2,SAO2, T2, R2),and M3= (D3,S3,SAO3, T3, R3). Then, h3=h2◦h1:SAO1→

SAO3is a µ3-homomorphism with µ3=µ1·µ2.

Proof: To check that h3is a 2P-ZS-MG µ3-homomorphism, one first must notice that

the domains and co-domains of the component functions fk, gk,s, ik,s (k∈ {1,2,3}) of all

three homomorphisms fit together for all four cases (µ1, µ2positive or negative) because a

change of sign is equivalent to a change of the agents.

According to Remark 3.23 the equivalence classes [(s, a, o)]h3⊆ SAO1induced by h3fulfill

the projection properties Equations 3.10, 3.13 and 3.14.

It must then be verified that the transition and reward functions also fulfill the 2P-

ZS-MG µ3-homomorphism conditions. For the transition function that means to proof

that e

T1(s, a, o, [s0]h3) = T3(h3(s, a, o), f3(s0)) under the premise that e

Tk(s, a, o, [s0]hk) =

Tk+1(hk(s, a, o), fk(s0)) for k∈ {1,2}.The following equation is helpful in the proof

[(s, a, o)]h3=h−1

3(h3(s, a, o))

= (h2◦h1)−1((h2◦h1)(s, a, o))

=h−1

1(h−1

2(h2(h1(s, a, o))))

=h−1

1([h1(s, a, o)]h2)

(s0,a0,o0)∈[h1(s,a,o)]h2

h−1

1(s0, a0, o0)

because by projection holds [s]h3=Ss0∈[f1(s)]h2

f−1

1(s0)which is needed below for (∗)

together with the surjectivity of f1:

T1(s, a, o, [s0]h3) = X

es00∈[s0]h3

T1(s, a, o, s00)

(∗)

f1(s00)∈[f1(s0)]h2X

s000∈[s00]h1

T1(s, a, o, s000)

f1(s00)∈[f1(s0)]h2

T2(h1(s, a, o), f1(s00))

T2(h1(s, a, o),[f1(s0)]h2)

=T3(h2(h1(s, a, o)), f2(f1(s0)))

=T3(h3(s, a, o), f3(s0)) .

Finally, the reward condition reads to R1=µ1·(R2◦h1) = (µ1µ2)·(R3◦(h2◦h1)) which

shows that h3is a 2P-ZS-MG µ3-homomorphism with µ3=µ1·µ2.2

The intuition that the composition of 2P-ZS-MG µ-homomorphisms leads to coarser (if

not equal) equivalence classes can formally be obtained by

[(s, a, o)]h3=[

(s0,a0,o0)∈[h1(s,a,o)]h2

h−1

1(s0, a0, o0)

which is used in the proof of Proposition 3.24. Because h1is surjective for every (s0, a0, o0)∈

[h1(s, a, o)]h2there exists an (s00, a00, o00)∈ SAO1with (s0, a0, o0) = h1(s00, a00, o00), i. e.

[(s, a, o)]h3=Si[(si, ai, oi)]h1is a union of equivalence classes with respect to h1.

3.25 Lemma (2P-ZS-MG µ-Automorphisms)

Let M1= (D1,S1,SAO1, T1, R1),be a 2P-ZS-MG with finite SAO1and let h:SAO1→

SAO1be a 2P-ZS-MG µ-automorphism. Then µ∈ {1,−1}.

Proof: Let rmax = max(s,a,o)∈SAO1|R1(s, a, o)| 6= 0 be the maximal modulus of the reward

of the 2P-ZS-MG. Since his a 2P-ZS-MG µ-automorphism, h◦h=h2is a 2P-ZS-MG

µ2-automorphism by means of Proposition 3.24 and R1=µ2(R1◦h2).It must be verified

that µ2= 1. Two cases have to be excluded: 1> µ2>0and µ2>1. If µ2>1, then

all state-actions (s, a, o)with R1(s, a, o) = rmax are mapped to a state-action pair with

reward modulus rmax ·µ2> rmax, in contradiction to the fact that his an automorphism

and rmax is maximal. If µ2<1, then each state-action (s, a, o)is mapped to a state-action

with a modulus of reward |R1(s, a, o)| · µ2≤rmax ·µ2< rmax, in contradiction to the fact

that his a bijection. 2

3.26 Remark (Finite and Infinite 2P-ZS-MG µ-Automorphisms)

Lemma 3.25 shows that automorphisms of finite 2P-ZS-MGs can only have two forms: the

first is the 2P-ZS-MG homomorphism (µ= 1) and the second one can be considered as a

2P-ZS-MG homomorphism by exchanging the two agents (µ=−1). For MDPs, the second

possibility is meaningless and justifies the consideration of MDP homomorphisms without

general µ > 0of [171].

However, for (countably and uncountably) infinite 2P-ZS-MGs the Lemma is not valid.

Consider e. g. the following 2P-ZS-MG M= (D,S,SAO, T, R)with S=Qor S=R,

trivial action sets As={a},Os={o}for all s∈ S1, a trivial transition function

T(s, a, o, s0) = ∂s,s0for all s, s0∈ S, and R(s, a, o) = s. Then, for every µ∈ S \ {0}

aµ-automorphism is defined by h(s, a, o) = (µ·s, a, o)because the complete structure

(the trivial action spaces and transition function) is equal in every state and the reward is

appropriately scaled.

Chapter 4

Supervised Learning (SL), Function

Approximation, Generalisation

Contents

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.1 General Approximation Results . . . . . . . . . . . . . . . . . . . 57

4.1.2 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.3 Function Approximation with Automated Basis Determination . 58

4.2 Value Iteration with SL: Convergence Result . . . . . . . . . . . 59

4.3 Combination of RL and SL: Practical Results from Literature 61

4.3.1 MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.2 2P-ZS-MGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Wisdom is learning what to overlook.

William James (1842–1910)

(http://www.nonstopenglish.com)

Supervised learning (SL) is almost a synonym for function approximation. Sometimes the

first one is considered to deal especially with noisy function evaluations and the second

one with exact evaluations. There are two basic reasons for applying supervised learning

(SL) techniques in RL: first, to simply compress an unmanageably large state space to

a reasonable size assuming that the (unknown) underlying essential state space is much

smaller than the standard one, and second to gain insight about a problem by discovering

this essential state space. For both reasons the construction of features, i. e. a special kind

of basis functions in the function approximation sense(1) which is often easy to interpret

and inspired by human intuition, is one of the most common techniques besides neural

network approaches. However, theoretically every set of basis functions such as radial

basis functions of any kind, polynomials, or splines can serve as function approximation

architecture.

This chapter is structured as follows: Section 4.1 gives a short introduction to SL and

highlights the importance of automatically detecting good basis functions. In Section 4.2

(1)No orthogonality conditions are fulfilled and even the linear independency is often not checked in RL

literature but fulfilled by trying to achieve the smallest set of features.

the key result on the provable convergence quality of value iteration in combination with

SL methods is presented. Although the result improves a previous one in [19] it means

that in practice it is very hard to guarantee convergence to an ε-approximation of the

optimal value function since the approximation architecture has to provide very small

errors in k k∞. Nevertheless, Section 4.3 provides an overview of practical results showing

encouraging performance without any guarantee.

Numerical results of the author can be found in Section 5.2.6 in the context of other

numerical studies.

4.1 Introduction

Supervised learning deals with finding a function e

f:X→Yin some predefined function

space Fwhich approximates a given function f:X→Yin an optimal way with respect

to some norm:

f= arg min

g∈F

kg−fk.(4.1)

Common norms are k k2or k k∞which measure the mean and the maximum deviation,

respectively. The more compact formulation

f=e

Af(4.2)

may be advantageous whenever Fis undoubtedly defined. Generally, two types of errors

occur: the approximation error due to the approximation architecture because f /∈ F, and

the estimation error or sample error due to the finite number of sample data for which the

function is evaluated [37, 157]. Sometimes, SL is understood to be function approximation

with noisy data, i. e. the true function values f(x)are modified by some additional noise,

but throughout this thesis it is assumed to approximate noise-free function data.

Since the contraction property of the Bellman operators (contraction rate γ < 1) is valid

for k k∞, this norm should also be used for function approximation ([75, 78], approxi-

mation scheme bases on [196]).(2) However, many SL methods find approximations in

Lp-norms, hence, it is valuable to give approximation bounds for approximate value iter-

ation with these norms being better than the p

p|S| factor of the standard estimation for

norm equivalence [151].

A huge number of function approximation methods exists which include but are not limited

to (see [88]) neural network methods [19, 62, 122, 156, 176, 4, 82] (3), fuzzy logic [16, 105],

cerebellar model articulation controller (CMAC) [2, 3, 83, 198, 201], classifier systems

[57], and local memory-based methods [51, 145, 146]. By racing different approximation

architectures can be tested and their parameters optimised [59, 131, 132]. The main idea

is that most of the possible choices of parameters are discarded by a few tests and only

the remaining ones are tested more intensively. A comparison of different approximation

architectures in the sense of good fitting and a simple model (no overfitting) is studied in

[121].

A subtle remaining question is which function(s) to approximate: one of the value functions

(Vor Q), or directly the policy π, see also Section 4.1.2. Approximating the state value

(2)An alternative would be to use weighted k k2norms [149] but the weights have to be determined first.

(3)Pesch states that some standard neural networks can lead to ill-conditioned optimisation problems

[165] and gives further references.

function yields a smaller domain than the state-action value function, therefore it may

be better to approximate with fewer parameters, but this comes at the cost of having

to calculate the Q-values in each step. In 2P-ZS-MGs (not in MDPs) both approaches

suffer from the fact that the value of a potentially large matrix game has to be determined

for every evaluation of the policy. Approximating the policy is the only method which

avoids matrix game calculations for an acting agent but is not suited for the process of

determining the optimal policy because the evaluation of the value of a state(-action)

would be very expensive (policy evaluation involves the complete state space). Even policy

search methods [178] need to estimate the quality of several policies and therefore have to

estimate value functions or rely on heuristics. Finally, it is also possible to approximate

several or all three functions which can also be useful for mutual error tests.(4)

4.1.1 General Approximation Results

A qualitative result often mentioned as curse of dimension and blessing of smoothness is

that with a degree of smoothness sand dimension of input parameter dim(x) = dthe

function f(x)can only be expected to be approximated by an error of O(n−s/d)where nis

the number of parameters of the function approximation scheme [157]. [82] states that the

curse of dimensions can not be broken by neural networks with multi-layer perceptrons,

radial basis functions or similar nonlinear techniques. This means that neural network

approaches can only be successful if a problem is simpler at the core.

4.1.2 Generalisation

Generalisation is one reason why supervised learning is employed. It means that the

experience of a learner is transferred from the current situation to comparable ones. Gen-

eralisation is motivated by two facts: firstly, for large finite (or continuous) state spaces

it is not possible to enumerate all states and, secondly, similar states often have similar

values and optimal actions. Thus, generalisation is related to the two basic reasons for

applying SL: state space reduction and a gain on insight.

Sometimes, generalisation and function approximation is separated artificially, however

generalisation can always be interpreted in terms of function approximation. Some gen-

eralisation methods are used to determine a function eX:X→e

Xsuch that the domain

of e

fis no longer Xbut instead eX(X)⊆e

X, and are called feature extraction methods

[19]. These can also be local [111] or action dependent [197] for Q-value functions. The

hope is that if gis complicated enough that Fcan be chosen in a simpler way because the

approximation is performed with e

f∈ F ◦ eX.

Other types of generalisation methods do not make use of f(x)but only of the distribu-

tion of inputs xiof some input pairs (xi, f(xi)). These are collectively called unsupervised

learning methods and can be interpreted as an input data preprocessing. Clustering ap-

proaches belong to this class. However, clustering approaches can also be applied to the

f(xi)to extract all states with a similar value as a feature and hence become SL methods.

Generalisation in RL. As alluded to in the introduction, generalisation in RL can be

divided into different approaches depending on the domain [88]. Generalisation over states

(4)[78], p. 413, shows how to perform value function updates computationally more efficiently if the

transition function Tis known by separately storing expectation values for each function of a linear

approximation architecture.

such as feature extraction reflects the question of structural credit assignment: Which char-

acteristica influence the optimal value function most? These methods basically reduce the

description of the state space Sby forming equivalence classes of similar states. Forming

strict equivalence classes by exact model reduction is more restrictive but yields exact re-

sults (Chapter 3). Another strategy is to directly approximate the value function during

value iteration [23, 206] or in Q-learning [108, 212]. In general, function approximation

can affect the convergence of the iterative algorithms [23, 207] if one can not show that

the approximation scheme is less expansive (in max norm) than the Bellman operator is

contractive [72, 97]. Residual algorithms are one approach to address this problem [7]

while some standard approaches like neural nets and linear regression are not theoretically

justified even if working well in practice [72]. Instead, non-expansive approximation can

be performed by averagers but these lack adaptivity [72]. Furthermore, adaptive resolu-

tion models, e. g. variable resolution dynamic programming (kd-tree [142]), the PartiGame

algorithm [143], and decision trees [31, 134] can be seen as generalisation approaches.

Alternatively to states, state-action pairs could be subject to generalisation. For approxi-

mating Q-values, two approaches exist: the first is to use one approximation architecture

for each action (if there are not too many different actions), the second is to employ only

one architecture for all state-action pairs. The last method is the only possible approach

for continuous action spaces. Training of continuous actions can be done by local gradient

ascent methods [8], by modifying variance and mean of normal distributions of the ac-

tions to execute [80], or by application of Kohonen maps (self-organising maps [93]) which

adaptively choose a discrete set of actions during learning [193](5).

4.1.3 Function Approximation with Automated Basis Determination

Choosing the approximating function space F, i. e. basis functions in the case of linear

function approximation, is an important issue.(6) There are three qualitatively different

methods to choose the functions: first, one can directly choose finitely many functions

f1, . . . fn, second, one can choose parametrised families of functions f1,α1, . . . fn,αnand

choose finitely many functions (perhaps with some optimisation over the parameters) which

are most suited to approximate the given function, and third, one can try to construct the

functions in a completely unparametrised way.

Obviously, the challenges increase from the first to the third method. Some work in the

spirit of the last method is by Mahadevan [126, 128]). In his approach, so-called proto

value functions serve as basis functions and are constructed only by state space topology

information.(7) The second of the three methods is similar to feature detection and contains

essential degrees of choice (parameters) for the approximation algorithm. Hence, this is

also considered to be automated basis determination. Besides the predefined functions

and free parameters the algorithms of the second type need a selection criterion which is

in the best case an optimisation criterion. The two aims of this selection are the general

ones of function approximation: to fit the function well and to avoid overfitting. Cross

validation errors e. g. leave-one-out tests or more generally, the distinction of a training set

and an evaluation set of points is an important method to reduce overfitting [146]. The

(5)The only problem is that the construction of appropriate Kohonen maps seems to be mathematically

ill-posed.

(6)Linear function approximation means in this context the approximation by a linear sum of specified

non-linear functions and not the approximation by piecewise linear functions.

(7)The question arises whether a simple grid soccer topology possesses enough structure to take advantage

of it.

approach of [59, 131, 132] is algorithmically interesting to adapt the evaluation effort to

the possible success of single parameter values: The evaluation criterion is computationally

more intensive for promising candidates and less for weaker candidates.

4.2 Value Iteration with SL: Convergence Result

[23] provides a phenomenological classification of types of convergence which sounds amus-

ing from a theoretical point of view but reflects the possibilities of “solutions” practitioners

have obtained: good convergence (all iterates Vkare represented well, convergence to the

exact optimal value function), lucky convergence (not all iterates Vkare represented well,

convergence to the exact optimal value function), bad convergence (convergence to a non-

optimal value function), and divergence. From a mathematical point of view only the

“good convergence” is acceptable and hence a criterion is developed in the following under

which this good convergence will occur.

For the sake of completeness the definition of classical value iteration (Definition 2.35) is

repeated:

4.1 Definition (Value Iteration (2P-ZS-MG))

The following algorithm is called value iteration: select ε > 0, choose an arbitrary initial

guess V0∈R|S| for the (state-) value function, and determine iteratively Vk=BMGVk−1

for k= 1,2, . . . until kVk+1 −Vkk∞≤ε

2·1−γ

γ.

This algorithm would converge to V∗, and provide an ε

2-approximation for the value func-

tion estimate and an ε-optimal stationary policy (as for MDPs), if one assumes that BMG

can be exactly calculated. In Section 2.4.2 the following results are anticipated and ap-

plied to the scheme of numerical value iteration, i. e. roughly speaking a numerical error of

solving DPs is interpreted as an SL technique. This is possible because for the analysis of

the algorithm it does not matter whether the approximation error is introduced by an SL

method or by a different numerical technique.

For an error analysis similar to [19] (compare Remark 4.4), which does not seem to be

widely available for 2P-ZS-MGs in the literature(8), the combination of the SL method

with the operator BMG is interpreted as a numerical version of the operator and denoted

by e

BMG. For an operator e

Adescribing the application of an approximation architecture,

BMG =e

ABMG because a value function – here stemming from the previously applied

SL technique – is plugged into the Bellman operator and the result is again projected

by the approximation architecture. In practice, the operator e

Acan not get the complete

value function BMGVkas input meaning that the approximation should be even worse than

Lemma 4.2 suggests. However, for the readability of the theoretical result these technical

subtleties are omitted.

The assumption of a (maximal) error of ε1, i. e. ke

BMGV− BMGVk∞≤ε1kVk∞for all

V∈R|S|, leads to the following lemma:

4.2 Lemma (Error of Numerical Value Iteration (2P-ZS-MG))

Let BMG be the Bellman operator and e

BMG a numerical realisation (e. g. by combination

with an SL technique) with ke

BMGV− BMGVk∞≤ε1kVk∞.(9) Let further be e

V0=V0,

(8)An exception which generalises Bertsekas’ results to 2P-ZS-MGs is an extended draft version of [181].

(9) ke

BMGV− BMGVk∞≤ε1kVk∞for all Vmeans that the operator B1:= 1

ε1(e

BMG − BMG)has operator

Vk= (BMG)kV0, and e

Vk= ( e

BMG)ke

V0the corresponding k-th value iterates. Then

Vk−Vkk∞≤ε1·

k−1

i=0

γk−i−1ke

Vik∞

| {z }

=EV(k)

≤ε1

1−γmax

i=0,...,k−1ke

Vik∞.(4.3)

Proof: The case k= 1 simply reads to ke

V1−V1k∞=ke

BMG e

V0− BMGV0k∞≤ε1ke

V0k∞. It

is now assumed, that the lemma is true for the value iterates of index k. Then

Vk+1 −Vk+1k∞=ke

BMG e

Vk− BMGVkk∞

=ke

BMG e

Vk− BMG e

Vkk∞+kBMG e

Vk− BMGVkk∞

(∗)

≤ε1· ke

Vkk∞+γ· ke

Vk−Vkk∞

≤ε1· ke

Vkk∞+γ·ε1

k−1

i=0

γk−i−1ke

Vik∞

=ε1

(k+1)−1

i=0

γ(k+1)−i−1ke

Vik∞.

(∗)is valid because BMG is a contraction with rate γand the line below uses the induction

hypothesis. The second estimate of the lemma is by means of the infinite geometrical

series. 2

Lemma 4.2 provides a bound independent from the value iterates, e. g. if the sequence

Vk)kis monotonically decreasing and e

Vk≥0. If e

BMG =BMG, monotonicity is guaranteed

as soon as V0≥ BMGV0(analogously to the MDP case which e. g. is true if V0(s) =

1−γmax(s,a,o)∈SAO R(s, a, o)for all s∈ S because then the Q-value iterate Q1(s, a, o) =

R(s, a, o) + γ

1−γ·max(s,a,o)R(s, a, o)≤1

1−γ·max(s,a,o)R(s, a, o)and Proposition 2.25, 2.)

completes the argument).

A reference for the monotonicity result in 2P-ZS-MGs seems to be lacking and therefore

the proof is sketched here: The first part is to show that BMG preserves ≥(in the vector

sense). This is seen by the assumption that we have a vector representing a value function

estimate that is greater or equal than a second one in all components. This implies that the

corresponding Q-value functions (Equation 2.49) have the same greater or equal property,

and the result follows from the monotonicity property of matrix games (Proposition 2.25).

The second part is completely analogous to MDPs [168]: Noticing that the monotonicity

holds with BMG also for Bi

MG for all i∈Nleads directly to Vk+1 =Bk

MG(BMGV0)≤

MG(V0) = Vk.

However, for use in algorithms the tighter bound of the above lemma should be preferred

because it can be computed iteratively (as suggested by the inductive proof: starting with

ε1ke

V0k∞and assuming that the k-th step is already performed the sum has to be multiplied

by γand then ε1ke

Vkk∞has to be added for obtaining the result in iteration k+ 1).

norm equal to or less than 1: kB1k= supV6=0

kB1Vk∞

kVk∞≤1. Furthermore, the above definition implies

BMG =BMG +ε1B1.

4.3 Corollary (New Stopping Criterion for Numerical Value Iteration)

If the stopping criterion is changed to ke

Vk+1 −e

Vkk ≤ c(e

V0, . . . , e

Vk)with

c(e

V0, . . . , e

Vk) = ε

2− EV(k+ 1)·1−γ

γ−(EV(k+ 1) + EV(k)) (4.4)

where the errors EV(k)depend on the first k−1numerical value iterates (Equation 4.3),

and if the numerical approximation e

BMG is a contraction of rate γ(like BMG)(10) then the

numerical approximation of value iteration also yields results comparable to the original

value iteration.

Proof: Utilising notation of Lemma 4.2 the error of numerical and theoretical value iterates

can be related by

kVk+1 −Vkk∞≤ kVk+1 −e

Vk+1k∞+ke

Vk+1 −e

Vkk∞+ke

Vk−Vkk∞

≤ ke

Vk+1 −e

Vkk∞+EV(k+ 1) + EV(k)

≤ε

2− EV(k+ 1)·1−γ

γ.

According to classical value iteration kVk+1 −V∗k∞≤ε

2− EV(k+ 1), hence

Vk+1 −V∗k∞≤ ke

Vk+1 −Vk+1k∞+kVk+1 −V∗k∞≤ε

In prooving the quality of value function approximation it was not necessary to use that

BMG is a contraction of rate γ. However, for adopting the proof in [168] that the policy

πk+1 induced by e

Vk+1 is ε-optimal it is additionally needed that e

BMG as well as the (linear)

Bellman operator with fixed policy πk+1 are contractions with rate γ.2

Interpretation. The considerations above indicate that care should be taken (nothing can

be guaranteed by Corollary 4.3) whenever the above defined numerical error ε1is in the

same magnitude as the error εof value iterates. If ε1ε,γnot too close to 1, and

B1:= 1

ε1(e

BMG − BMG)fulfills a Lipschitz condition, then the solution of 2P-ZS-MGs can

be performed nearly as well numerically as theoretically.

4.4 Remark (Comparison to Results of [19])

For the results in [19] about approximation quality of function approximation it is assumed

that ke

BMGV−BMGVk∞≤ε1, being independent from kVk∞. Analogously to Lemma 4.2,

this leads to a (simpler) error bound of

Vk−Vkk∞≤ε1·

k−1

i=0

γi≤ε1

1−γ.(4.5)

This error estimation could be utilised to define a different EV(k)and to obtain a corre-

sponding stopping criterion (Corollary 4.3).

4.3 Combination of RL and SL: Practical Results from Lite-

rature

Although Section 4.2 makes it clear that convergence of RL in combination with SL meth-

ods is hard to guarantee, some encouraging examples are to be presented in the following.

(10)If there exists some n∈Nsuch that for all V, W holds kB1V− B1Wk∞≤nγkV−Wk∞then e

BMG

is a contraction with rate γ1≤γ(1 + nε1), if γ1<1, and therefore the weaker statement with discount

factor γ1holds.

4.3.1 MDPs

In Section 4.1 the following incomplete list of function approximation methods is given and

repeated here: neural network methods [19, 62, 122, 156, 176, 4, 82], fuzzy logic [16, 105],

cerebellar model articulation controller (CMAC) [2, 3, 83, 198, 201], classifier systems [57],

and local memory-based methods [51, 145, 146]. In addition to these references to special

function approximation architectures, most of which already pointed to combinations with

RL methods, the following reference is to be added: [177] combines neural networks in

combination with some advanced Q-learning algorithms (e. g. Q(λ)[164]).

Space continuous and Time Continuous Systems. Although it is in principle

possible to apply function approximation methods to systems with continuous time (evo-

lution by a differential equation) this approach suffers from the lack of a unique solution

to the Hamilton-Jacobi-Bellman equation [152] if the concept of viscosity solution is not

considered (analogous to differential games, Section 2.5). [35, 56] attack instances of space

continuous and continuous time MDPs by means of the Hamilton-Jacobi-Bellman equation

and function approximation. An exceptional example in which the concept of viscosity so-

lutions is applied to RL algorithms for solving MDPs is [148]. A different approach in

the time-continuous setting is a policy search, i. e. a search by gradients over a priori

parametrised policies [150].

For a multi-agent system with many agents additionally to function approximation a de-

composition of the value function into a sum of functions dependent on less agents (factored

representation) can be helpful ([94, 76, 74, 162], or in combination with a suitable com-

munication scheme: [79]). However, care has to be taken that even a factored reward and

transition function does not imply a factored value function. The reason is that in each

value iteration step the dependencies on other variables grow until at some point typically

every variable is important for every state.

4.3.2 2P-ZS-MGs

Applications of function approximation in differential games include neural networks for

driver assistance [96] and memory-based and kd-tree based methods for two-player pursuit

evasion games [187]. Much of the work about the combination of discrete 2P-ZS-MGs and

function approximation is done by Lagoudakis and concentrated on least squares policy

iteration (LSPI): General results with an application to Littman’s 1v1 grid soccer [97] and

the utilisation of factored approaches for multiple agents [98] which are related to model

reduction only with respect to actions(11), and an overview with a diversity of examples

[99] are to be mentioned. [98] is quite close to the need of multi-agent robot soccer but

due to the exponential dependence of the state space size on the number of agents the

approach is not directly applicable because it only reduces the exponential dependence of

the joint action space.

(11)The meaning of some formulae can be better understood by additionally considering [77].

Chapter 5

Robot Soccer and Other Applications

Contents

5.1 Modeling Robot Soccer . . . . . . . . . . . . . . . . . . . . . . . . 64

5.1.1 General Issues of Modeling Robot Soccer . . . . . . . . . . . . . 64

5.1.2 A Simple Multi-Player Robot Soccer Model . . . . . . . . . . . . 67

5.1.3 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Numerical Results of Grid Soccer . . . . . . . . . . . . . . . . . 74

5.2.1 Preliminaries for the Following Subsections . . . . . . . . . . . . 75

5.2.2 Reasoning for 2P-ZS-MG Modelling: Comparison of MDP and

2P-ZS-MG strategies . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2.3 Relating Policies to Humanoid Soccer Characteristics . . . . . . 80

5.2.4 Comparison of DP and RL Techniques . . . . . . . . . . . . . . . 81

5.2.5 Comparison of Different DP Techniques with Various Parameters 84

5.2.6 Comparison of Standard Methods and SL Techniques . . . . . . 91

5.2.7 Towards Multi-Player Robot Soccer: 2v2 Grid Soccer . . . . . . 93

5.3 A New Algorithm: MaG-Clus-VI . . . . . . . . . . . . . . . . . . 95

5.4 From Grid Soccer to Robot Soccer: Practical Issues . . . . . . 96

5.4.1 Lower Level behaviours . . . . . . . . . . . . . . . . . . . . . . . 97

5.4.2 Image Processing and Localisation . . . . . . . . . . . . . . . . . 97

5.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 98

The complexity arises

when the principles are reduced to practice.

Robert F. Stengel

(Preface to [195])

The model of multi-player grid soccer is introduced and all important numerical results

with respect to this model are presented in this chapter. The numerical results are broadly

interpreted: different models and solution approaches such as MDP and 2P-ZS-MG models

with or without symmetry reduction as well as DP and RL techniques and the impact

of their parameters are numerically analysed. Furthermore, the resulting strategies are

interpreted in a more soccer oriented fashion by goal rates per time and per team. Before

giving a more detailed overview of the contents with references to single sections some

general comments on literature are to be made: A huge body of literature exists on the

special application of robot soccer e. g. [69, 178, 197, 198] and references therein. For

practical reasons, most of the work uses an MDP framework instead of an 2P-ZS-MG

one which is at least questioned by the results presented in Section 5.2. The need for

hierarchical structures for soccer with 11 versus 11 agents and the employment of heuristic

ideas can often be observed [197, 215] as well as a simplification of robot soccer to keep

away soccer [198, 215].

This chapter has the following structure: Section 5.1 summarises general thoughts of mod-

eling the game of robot soccer and provides details of the special model “multi-player grid

soccer” which serves as a basis for all following computations. The model is based on [112]

but goes far beyond it because the dynamics for multiple agents have to be significantly

changed. The numerical results are shown in Section 5.2 which possesses a rich substruc-

ture. Section 5.2.2 provides strong arguments for the choice of 2P-ZS-MG instead of MDP

models in robot soccer, in Section 5.2.4 DP and RL techniques are compared with the re-

sult that exact and model exploiting DP techniques should be preferred whenever possible.

These first evaluative numerical results appear natural but are surprisingly not standard

in literature. DP methods are more intensively studied in Section 5.2.5: the dependency of

convergence speed on different types of methods and parameters is numerically analysed.

This includes comparisons of symmetry reduced models with their unreduced counterparts

which is the practical application of the theoretical results in Chapter 3. A highlight is

the “max-min convergence boosting phenomenon” which only seems to be present in 2P-

ZS-MGs but not in MDPs and reveals unexpected “spontaneous” large error reductions

during value iteration. Results with supervised learning (Section 5.2.6), a new DP algo-

rithm which exploits almost invariant sets of the transition dynamics (Section 5.3), and

general technical issues of applying strategies to real robots (Section 5.4), e. g. to the AIBO

ERS-7, follow.

5.1 Modeling Robot Soccer

Robot soccer can be modeled for different purposes: the spectrum of possibilities ranges

from a “true” continuous physical model including kinematic movements of every joint

and object to a very coarse discrete model in which an elementary action already includes

movement and ball-handling skills. A second distinguishing criterion is the mathematical

range of models which is intertwined with the choice above: the model may be discrete or

continuous (even with different physical degrees of modeling both options stay possible).

Furthermore, the model may or may not include the policies of the opponent, and in the

first case again with different degrees of freedom, e. g. the set of assumed opponent policies

can be arbitrarily restricted.

5.1.1 General Issues of Modeling Robot Soccer

In this subsection some general thoughts which have an impact on every modeling of robot

soccer shall be collected. It seems advantageous to apply the RL methods not directly

with an initial value function V0= 0 to the robots and simply let the robots learn during

play(1) some simplifications can be made even if the resulting model differs from reality.

(1)Even if the robot could detect its true state which is often difficult, the number of training steps can

be quite large which leads to a vast amount of training time.

The obtained optimal policy of the approximate model can be understood as a very good

initial guess of V0for an RL method applied to the real world problem.

This accounts for a general trade-off between model complexity and applicability of a

model. On the one hand, the more precise the model is the better is the approximation for

the “first guess” of the value function which can be improved stepwise by RL methods (a

safe RL method in 2P-ZS-MGs is the WoLF (win or learn fast) method of [21]) to adapt

to a special situation or opponent policy. On the other hand, the more unspecific a model

is the more widely applicable it is, e. g. for different hardware realisation of the robots.

Continuous and Discrete Models, Stochasticity. Besides from technical difficulties

of continuous game theoretic models (see Section 2.5 for an overview) these are typically

considered to be deterministic. Clearly, introducing stochasticity to the continuous models

would not simplify numerical computations. Thus, one can consider the choice to be only

between a discrete stochastic and a continuous deterministic model.

There are several reasons why to prefer a discrete stochastic model for robot soccer. Firstly,

(robot) soccer is a game with stochastic events not only because of unknown parameters

such as the exact physical properties of the subsurface (grass or carpet) and unknown

parameters in the robots’ actuators and sensors but also because the movement of the

ball kicked or hit by a robot is highly unpredictable. Secondly, the stochasticity of the

model is also preferred because the decisions of the robots have to be stochastic. If the

same situation always leads to the same deterministic reaction then the opponent team can

exploit the observed behaviour in the next similar situation: e. g. if the position of a robot

is slightly left to its opponent with respect to an appropriate axis this could lead to always

going left around the opponent. Thirdly, the discretisation of the state and action space

automatically leads to some abstraction away from a hardware specific model whereas a

hardware specific model would be more appropriate for a differential game (the kinematics

may be dependent on the robot type).

Symmetries. As mentioned in Chapter 3, respecting the inherent symmetry of a prob-

lem should be an issue. For example, in robot soccer the symmetries depicted in Figure 3.1

should be respected by any robot soccer model which assumes equal robots. One of the

symmetries simply means to change left and right (from the point of view of the robots

heading to a goal). This symmetry is only maintained if the robots possess the same sym-

metry and the abilities of the robots are symmetric. For human soccer, such a symmetry

is clearly not assumable because it is standard to consider human players in the categories

of preferring left, right or both. Only the last category of players fulfills the needs of that

symmetry.

A second symmetry is due to the exchangeability of the two teams which is somewhat

questionable. As remarked above the abilities of two teams can differ even if the robots do

not. However, if one assumes that the two teams are about equally strong the assumption

of exchangeability seems to be sensible if the model is not too detailed. This criterion is

met e. g. by the model described in Section 5.1.2.

A third symmetry which is not depicted in Figure 3.1 is the permutation of robots within

the same team. It relies on the equality of the robots and their abilities which is easily

achievable for equal robots by using equal software. In terms of human soccer this means

that every human player e. g. needs to have the strategical knowledge for every position

(defense, center, attack) which is typically not valid.

Fair Kick-off Positions. A fair kick-off position after every goal is a distinguishing fea-

ture from standard human and standard robot soccer. Thus, it should be briefly discussed

whether or not such a feature is to be employed in a model. First, in human and robot

soccer the rules determine that the team against which a goal was scored gets the ball in

the center of the soccer field (kick-off position). Only at the very beginning does one team

get the ball at random and the other team gets it at the beginning of the second half. From

a game theoric point of view even tossing coins and playing only one halftime is already fair

(the probability distribution of start states ξ∈PD(S)has a value of Ps∈S ξ(s)V(s)=0).

However, after tossing the coins the kick-off position typically is considered to be advan-

tageously for the ball-possessing team which can also be verified for our later model (see

Figure 5.3): e. g. for a 1v1 grid soccer the ball-possessing player state has a value of roughly

0.07 for a max-min optimal value function, i. e. even for a worst-case opponent there is a

higher probability of winning for an optimally acting ball-possessing player.

Nevertheless, if a kick-off position is determined after every scored goal by chance this

reduces the effect of a single random decision independently from the strength of each

team. The standard rules improve the chance to win for a weaker team (against which

probably more goals are scored), a fair kick-off position after every scored goal would

increase the chance for the better team to win because they could randomly get the ball

again even if they scored a goal directly before.

Additionally, the effect of a fair kick-off position after every goal can be interpreted as

reducing the game to the expectation of scoring only the first goal which is sometimes

used in soccer for avoiding a tie (golden goal rule). Time delaying strategies, e. g. keeping

the ball and scoring a goal in the last minute to win and to avoid that the opponent

scores a goal thereafter are neglected in such a model description. Instead, by means of

the discounting factor early scoring is encouraged. For robot soccer, the repeated random

criterion seems to be a sensible approach(2) even more because robots can – with the

current battery configurations – play a full-time powerful game without any recreation

phases which could be necessary for humans.

Concurrent Actions. Concurrent actions of the two teams seem very natural and

essential in robot soccer. It is noted, however, because large parts of game playing literature

are concerned with alternative turn games such as chess, checkers [179], or backgammon

[205]. Some solution methods designed for the important class of deterministic alternative

turn games, e. g. game tree search methods [118], are not feasible for stochastic games.

Abilities of the Robots. An important point is that in the following the hardware

of all robots is considered to be equal. Of course, robots can have different abilities by

different software but at least it is assumed that every robot could have the same skills.

Consequences of these assumptions are that each robot is able to walk as fast as any

other robot, has the same skills of ball handling, and the same communicational and

computational performance features. For example, this condition is true for the AIBO

league(3) in which AIBO ERS-7 robots without modifications have to be used, or for

simulation leagues. However, in some leagues, especially if self-constructed robots are

used, these ideal conditions are not given but instead a range of allowed performance

features has to be assumed. Additionally, even for physically equal robots it may depend

on the software of a team to which level the utilised skills deviate from best possible ones.

(2)Even for human soccer it could be interesting to encourage early goals to make games increasingly

fascinating.

(3)The AIBO league is now called standard platform league because a new robot (NAO) has been intro-

duced to the league.

Ball Possession. In general, the ball need not be close to any robot during a game.

However, it can be considered sensible and typical that if no robot can control the ball

directly, some of the next situated robots should try to get the ball. Hence, it is possible

to model the ball so that it is always in the possession of one robot to reduce the number

of states in the soccer game. Furthermore, in a sensible soccer model there have to be

options for dribbling around an opponent and passing to a teammate.

Intra-Team Communication. It is theoretically important whether the team has a

communication structure or not. The reason is that a team with perfect communication

structure can be treated as a single (complex) agent making decisions while any imperfect

communication structure must be treated as if each robot acts as a single agent with

different information on the game. Fortunately, in robot soccer communication among a

team is allowed such that the case of perfect communication can be assumed.

5.1.2 A Simple Multi-Player Robot Soccer Model

In this subsection the general thoughts of Section 5.1.1 are combined with practical issues

to formulate a model of robot soccer which is abstract enough to be independent from

hardware but specific enough to yield a meaningful policy for robot soccer. A main source

of inspiration was [112]. Later on, this model is used for all numerical computations. Its key

aspects are: the model is discrete and stochastic and respects some standard symmetries,

all robots are equal, the ball is always in possession of one robot, and the kick-off positions

are fair after scoring a goal – not only for the total halftime.

Furthermore, the standard order of different parts of actions during a time step is first

dribbling, then moving, and finally passing. In real robot soccer, many basic abilities such

as walking towards a predefined location, handling the ball (dribbling and kicking), and

skills for the analysis of visual information (self and opponent localisation) are needed and

highly non-trivial but assumed to be already available.

Now, the detailed model follows: The model is a 2P-ZS-MG M= (D,S,SAO, T, R)where

the decision epoch D=N0is scaled appropriately such that the time of performing one

of the later described actions is approximately scaled to 1. In practice, actions can have

different durations but the model is to be kept simple.

State Space. The state space S ⊆ N2(na+no+1) is discrete and consists of two dimensions

for every robot in Team 1 and the opponent team the numbers of which are naand no,

respectively, and an extra pair of dimensions for the ball.(4) Several robots can share the

same position (i. e. grid box, grid cell) on the field because otherwise blocking an opponent

would be too easy for a multiple robot game. The ball must share the position with some

robot. More specifically, the grid resolution in the two dimensions will often be of the type

(6 ×4), i. e. the position of each robot (xi, yi)∈ {1, . . . , 6}×{1, . . . , 4}, see Figure 5.1.

An abbreviation will be “3v2-game” which means that na= 3 and no= 2. In terms of

2P-ZS-MGs each team is considered to be one of the two players of the Markov game which

implicitly means that perfect communication is assumed.

Action Spaces. The action spaces A(s)and O(s)are constructed by the same principles

to enable the “player exchanging symmetries”. From the construction principles it becomes

clear that all symmetries of Figure 3.1 are respected by SAO. There are two different main

(4)The software package DRPOST maintains two different options for the ball: the coordinate version

as shown here or the number of the ball-possessing robot which is more compact if the ball coordinates

are limited to robot positions.

1 2 3 4 5 6

Figure 5.1: Discretisation of a soccer field: grid soccer. Dark grey indicates the defended

goal region of the first team, light grey the defended goal region of the opponent team.

types of motion which can be alternatively performed in each time step: a move and a

pass.

1.) Moves. In principle, for every robot of a team a move on the grid with squared

Euclidean distance less or equal than 1is allowed, i. e. it is possible to go one box

N(orth), E(ast), W(est), S(outh) or to stand (0). Actions to leave the soccer field and

diagonal moves are not allowed.(5) In contrast to real (robot) soccer, no robot can

leave the soccer field and no fouls occur. Fouls could simply be integrated e. g. by

assigning a foul probability which increases depending on how crowded a grid cell is.

2.) Passes. Passes have only to be added to the action set if the team size is larger than 1

and if the ball-possessing robot is the only robot in its grid box. For a larger number

of robots in the same box – even of the same team – it can not be guaranteed that the

robots do not (unintendedly) obstruct each other. Furthermore, a pass is to be distin-

guished from a kick in the sense that it addresses a team member while a kick could

also be an attempt to score a goal. However, kicks towards a goal are not modeled

because it is assumed that they will automatically occur if a robot enters its opponent

goal region.

Concerning a pass, there exist three parameters in the present model. The first pa-

rameter is called max_kick_distance which limits the maximum range of a kick (Fig-

ure 5.2). This parameter is the only one which influences the size of the action spaces

and is standardly set to max_kick_distance = 5 (squared Euclidean distance) for a

(6 ×4)-grid. The other two parameters only affect the transition function and are de-

scribed there. It is important to note that the position of the pass target is determined

after its intended move.

It can be discussed whether a large kick range makes sense. An observation in robot

soccer yields that for the AIBO league the robots are able to kick across the whole

soccer field. The range is restricted in the model because in reality the reliability of a

pass decreases with its distance.

Transition Function. The transition function has to be specified for every state-action

pair. Since invalid actions such as leaving the soccer field or passing to team members

being too far away are already filtered by the definition of the action spaces they need

not be considered here. Furthermore, the description of the rules again reveals that they

(5)The reason is not that the robots could not move diagonal but a diagonal move takes more time and is

not reachable within one time unit. If desired, diagonal moves could be allowed which would dramatically

increase the action spaces with the number of robots.

212 1

(a) max_kick_distance = 0.

212 1

(b) max_kick_distance = 1.

212 1

(d) max_kick_distance = 5.

Figure 5.2: Grid soccer: the parameter max_kick_distance rules the maximum distance

of the ball possessor to a team member being a pass target. Several values according

to squared Euclidean distance are illustrated. max_kick_distance = 5 is the standard

configuration of the present robot soccer model. As in Figure 3.1, the labels 1 and 2

indicate the team and the defended goal region, and the small black circle depicts the ball.

respect the symmetries of Figure 3.1 because there are no conditions which are different

for the two teams or for a single robot. The standard order of different part of actions is

dribbling, moving, and passing. Dribbling is only reflected in the ball possession and is

not part of the action spaces. Dribbling and passing can not occur at the same time step:

a passing robot is not allowed to move at the same time step, and the pass target (or a

mistarget) gets the ball after moving.

1.) Moves. All movements of the robots are performed as intended but for the ball pos-

session there are the following rules: If nis the number of all robots (of both teams)

occupying the cell with the ball then the probability is 1

nfor each robot to be on the

ball after its movement. In soccer terminology this means that the dribbling phase is

preceding the moving phase and that the dribbling phase is decided in every time step.

It is imaginable that the previous action as well as the current action of each robot

influences the probability of getting the ball but this would violate the Markov prop-

erty. More importantly, the ball possession is not modeled actively: the robot which

is on the ball in the last step has the same chance to keep it as every other robot in

the same grid cell has to steal it. The model could easily be changed to give the ball

possessor a different probability than a stealer but this would need a sound knowledge

of the abilities of the robots.

2.) Passes. The soccer dynamics for the passes are more complicated than the movement

dynamics because it seems to be essential that a pass can be intercepted by a robot

not aimed at by the passing one. The reason is that a pass can be considered as a

risk option: keeping the risk low and not passing moves the ball only a small amount

towards the goal while a pass can move the ball a larger amount with a larger risk of

a miskick (depending on the number of robots staying close to the pass target).

Concretely, three parameters exist in the model to influence the pass characteristics.

The parameter max_kick_distance has been already discussed for the action sets and

the remaining two, namely close_distance and kick_good_prob_mult, control the

risk of a pass. close_distance determines how close an intercepting robot (team

member or opponent) has to be such that it is a possible mistarget of the pass (the

passing robot can also be a mistarget if close enough to the target, i. e. that it gets the

ball again). It is assumed that the robot is not allowed to leave its grid cell towards

the ball but it can get the ball.(6) kick_good_prob_mult determines how many times

the probability is larger for performing the pass to the correct grid box. This means

that each robot in the targeted box has the same (higher) probability – not only the

targeted robot. The standard values for all computations (in 1v1 up to 3v3 soccer)

for a (6 ×4)-grid are close_distance = 2 in squared Euclidean distance which is

illustrated in Figure 5.4 and kick_good_prob_mult = 3. Again, note that the target

position and closeness relations of the robots is determined by the positions after their

intended move of that time step.

From a soccer perspective it seems to be sensible that dribbling and moving, only

passing, or moving and recieving a pass are the three basic components of action. In

robot soccer what may occur is that also teammates unintendedly also steal the ball

because a crowded grid box can easily lead to a very uncontrolled outcome of dribbling

actions. If controllability skills for the ball increase it will be possible to change the

model to one in which the probability of keeping the ball is higher or in which one

team randomly gets the ball and the robots can decide which teammate gets it.

(6)In practise, it is assumed that the robot starts some intercepting behaviour if the ball comes close

enough to its location. However, the interception distance is much smaller than the distance of one grid

cell because the ball moves too fast for a long interception procedure.

3.) Kick-off . At the very beginning as well as after every scored goal a fair kick-off position

(see Section 5.1.1) is initialised. Particularly, the two states in Figure 5.3 are assigned

a probability of 1

2. Since they are agent exchanging symmetric (agent = team) they

are fair according to Theorem 3.20. For value iteration it is only important that the

stochastically weighted sum of values for these positions is zero. However, for RL

methods it can make a big difference in the speed of learning if an equal weighting of

the two positions of Figure 5.3 or an equal distribution of all states in Sis chosen.

For example, the latter can encourage exploring different regions of the state space

without any exploration strategy in the RL method.

1 2 3 4 5 6

333

(a) Kick-off state s1.

1 2 3 4 5 6

333

(b) Kick-off state s2.

Figure 5.3: The two kick-off states of multi-player grid soccer, here 3 versus 3 players (3v3

grid soccer). Dark grey indicates the robots and defended goal region of the first team,

light grey the robots and defended goal region of the opponent team. In contrast to other

figures before, the numbers do not indicate the team but the numbers of robots of that

team being in the same grid cell. The small grey circle represents the ball which is not

attached to a special robot in that cell.

In general, due to the nature of the stochastic transition function it is easy to model a

mis-action of any kind by giving a low probability to an action not intended by the player.

In this way, (small) sensor errors can be modeled. Since it is assumed that all robots

have the same degree of errors no team has an advantage over the other. Therefore, it is

assumed that neglegting (small) errors has little impact on the optimal policies.

5.1 Example (Robot Soccer, 5)

A special situation (Figure 5.4) should illustrate the pass mechanism described above. One

of the robots of Team 2 stops and plans a kick to a teammate indicated by a large arrow.

The kick is possible because the robot is not disturbed by a second robot in the same cell

and the distance criterion marked by the dark grey area (partly overlapped by the light grey

area) is satisfied. All robots are depicted after their planned movement of the time period

which is shown by the small arrows. The possible mistargets are all robots in the light

grey area, in this case only two robots of Team 1. Assuming that kick_good_prob_mult

= 3, this results in a probability of 3

5for the intended pass and a probability of 1

5for each

of the two mistargets. If the upper of the two robots of Team 1 moved to the target field

then that robot and the intended target would have probabilities of 3

7while the second

mistarget has a probability of 1

7for receiving the pass.

Reward Function: Neutral, Defensive, Aggressive Policy. Scoring a goal. The

reward for Team 1 for entering the opponents goal region is 1if the ball is also in that field

2 1

Figure 5.4: Grid soccer: the parameters max_kick_distance = 5 rules the maximum

distance of the ball possessor to a team member being a pass target (dark grey boxes,

partly hidden by light grey boxes), and by close_distance = 2 the possible mistargets

are all other robots in the light grey boxes. The targeted grid box for the pass is indicated

by the big arrow. Small arrows indicate the performed moves (no arrow means the action

“stand”) during the time period. As in Figure 3.1, the labels 1 and 2 indicate the team and

the defended goal region, and the small black circle depicts the ball (see also Example 5.1).

(no matter which robot took the ball to that field) and if additionally at least as many

offenders as defenders of the opponent team are in the same of the two cells of the goal

region (Figure 5.1). It is assumed that otherwise the defenders successfully defend their

goal (region) and a performed kick towards the goal is blocked. The corresponding agent

exchange symmetric situation that Team 2 enters the goal region of Team 1 with the ball

and a less or equal number of defenders of Team 1 is rewarded by −1(scored goal against

Team 1).

All other cases. The reward function simply needs to be 0if no goal is scored because no

team should gain an advantage by doing nothing. As discussed in Section 5.1.1 at “fair

kick-off positions” the present score of a game is neglected in the model, which considerably

reduces the number of states. This shortcoming could be relativised in real robot soccer

by introducing three types of policies: neutral, defensive or aggressive which are employed

if the score difference is equal, positive, or negative, respectively. The three different

policies can be calculated (by sacrificing the team exchanging symmetry in the non-equal

cases) with three different reward functions: scoring a goal is artificially ranked equal

(as is described above), lower (smaller reward) or higher (larger reward) than letting the

opponent score a goal. If not stated otherwise, equal ranking is assumed.

5.1.3 Symmetry

The multi-player grid soccer model of Section 5.1.2 is constructed in concordance with the

general three symmetries mentioned in Example 3.22 which should be respected by any

model of robot soccer:

1.) permutations of the players in the same team,

2.) reflection at the goal-to-goal line (gxin Figure 3.1), and

3.) reflection at the mid-line and exchange of teams (gyin Figure 3.1).

This is best illustrated by “a discretised version” of Figure 3.1 which is depicted in Fig-

ure 5.5.

1 2 3 4 5 6

(a) State s.

1 2 3 4 5 6

(b) State gy(s).

1 2 3 4 5 6

(d) State (gx◦gy)(s) = (gy◦gx)(s).

Figure 5.5: Symmetries in grid soccer (discretisation of states in Figure 3.1): a standard

situation (state) sand its symmetric states gy(s)with exchange of the two teams, gx(s)

without the exchange of teams, and the combination of both: (gx◦gy)(s) = (gy◦gx)(s).

The small grey circle depicts the ball.

Utilising these symmetries for model reduction, the effect on the size of the state (and

hence state action) space can be enormously (Table 5.1). In a multi-agent grid soccer, some

symmetries help to reduce the state space(7) by a constant factor (reflexions), while others

(robot permutations) are of increasing usefulness with a growing number of robots. Note

that the third symmetry is particular to 2P-ZS-MGs and can only be applied for a soccer

game with equally many robots in each team (an “xvx” game). This explains a qualitative

difference of the reduction factors a/b in Table 5.1 which are roughly (2 ·2·na!·no!) in

special cases and otherwise (2 ·na!·no!). Applying these formulae for a grid size of (6 ×4),

theoretically in a 6v6 grid soccer game the savings would be of the order of 106and in the

11v11 version of about 1015 (yet, of course, even the reduced state space is still way too

large).

Game Standard (a) Reduced (b)a/b Sym. 1 Sym. 2 Sym. 3

1v1 1152 282 4.11.0 2 2

1v2, 2v1 41472 10244 4.02.0 2 1

2v2 1327104 82944 16.04.0 2 2

3v2, 2v3 39813120 1742400 22.811.4 2 1

3v3 1146617856 8820000 130.032.5 2 2

Table 5.1: Number of states for different multi-player soccer games on a (6 ×4) grid in a

standard and a symmetry reduced form. novnameans a game with team size nafor Team

1 and size nofor team 2. While a/b denotes the total reduction factor by all symmetries,

the last three columns clarify the effect of each single symmetry, the number of which

corresponds to the enumeration list at the beginning of Section 5.1.3).

5.2 Numerical Results of Grid Soccer

The numerical results form an important part of the present work. They are organised as

comparative studies including but also going far beyond the application of the theoretical

results of Section 3.2. In the small amount of literature on 2P-ZS-MGs only comparisons of

different max-min RL methods seem to be available but the key model design question of

whether to consider max-min or max methods or whether DP or RL methods give higher

performance are typically unanswered.

As a first answer it is clear that DP methods should be more effective than RL meth-

ods because the knowledge of the model is only available to the first class of algorithms.

Therefore, the argument for RL methods is often not effectiveness but adaptiveness to an

unknown model (partly unknown or slowly changing environments). However, it can be ob-

served by the studies in Sections 5.2.2 and 5.2.4 – and is also known to the RL community

– that applying the theoretical convergence results of RL methods to practical problems

leads to uncomfortably large numbers of learning steps which is often unmanagable to

perform in practice.

Many methods have been designed to speed up learning: some of them targeting at the

update rules of the algorithms (such as multi-step updating or multi-state updates (function

approximation)), others aiming at external knowledge directly (such as imitating humans

or experts made by humans) or indirectly (such as hierarchical models with a hierarchy

(7)The size of the state space of a navnogame with a total of n=na+norobots is |S| =n(6 ·4)nif the

ball is uniquely assigned to one of the nrobots and the grid size is (6 ×4).

specified by humans). However, the ultimate application of learning theory to design any

method which is suitable for most models and is learning completely automatically has not

yet been reached to the knowledge of the author.

The author tries to make a small contribution by incorporating DP methods which are

neglegted by many RL researchers or only used for computation of the exact values of

policies to compare final results of small models. One mission of this work is to claim

that using DP methods with a non-perfect but simple model to construct a first guess

approximation for the value function and then starting RL methods with this initial guess

is an appropriate methodology. It is also imaginable that the fact that RL methods do not

need to cover the whole state space Scan be incorporated by simulating an RL trajectory

and then constructing an initial guess by DP methods only on the states of this trajectory

and on suitable additional states.

5.2.1 Preliminaries for the Following Subsections

The numerical results presented in Section 5.2 are computed by a software package called

DRPOST (Discrete Robust Probabilistic Optimal Strategy Tool) which is implemented

by the author as a bundle of Matlab routines. These routines contain a model generating

collection (for state and action spaces, transition and reward function) for a general n-player

grid soccer model as described in Section 5.1, all considered DP and RL methods for use

with MDPs and 2P-ZS-MGs, and a large amount of other functions which are needed,

e. g. for function approximation, symmetry reduction, and for solving matrix games (see

Appendix C).

Most of the following comparative studies are performed with a 1v1 soccer model on a 6×4

grid field if they are not explicitly intended to compare a single agent with a multi-agent

team grid soccer model or to compare different sizes of the field. The reason is that the

computation of the optimal value functions and strategies for a 2v2 grid soccer model on

a small grid or a 1v1 grid soccer model on a larger grid needs too much time to compare

large varieties of parameters. Additionally, many basic effects can be observed for the small

model.

Standard parameters for the algorithms are a discount factor γ= 0.9and an accuracy of

0.5·10−3for value functions in value iteration implying an accuracy of ε= 1 ·10−3for

the policy (Corollary 4.3). For RL methods a decayed learning rate αn(s, a) = 1

n(for each

s∈ S) which meets the standard convergence criteria is used and the random exploration

rate of the employed ε1-greedy policy is ε1= 0.2. Since Q-Learning is an off-policy method

the learned strategy is greedy independently from the policy followed during the learning

phase. Furthermore, the standard number of learning steps is 100 (times |S|) which is

higher than any number of steps needed for convergence in DP methods. This number was

chosen on the basis of the results of Figure 5.6.

Nomenclature. In the following, “team 1”, “player P1”, or simply “the player (team)”

is considered to be the robot soccer team which is controllable, e. g. by optimal policies

computed by DP or RL methods. “team 2”, “player P2”, or “the opponent (team)” is the

collection of agents which are trying to score against team 1. In some of the comparisons,

one player is considered to be omniscient, i. e. the strongest possible or worst-case opponent

that already knows the policy of team 1, in some cases it is assumed that each team does

not know the policy of the other team. To make the DP operators part of the description

the DP and RL methods for classical MDPs are called max methods and for 2P-ZS-MGs

max-min methods according to the Bellman operator.(8)

There is also a nomenclature introduced to describe the policies efficiently without the need

for giving action probability distributions for every state which would be hard to interpret.

Table 5.2 shows relevant abbreviations (where the max method yields an element of the

game theoretic best response to the other team’s strategy) and the description of the table

makes the reader familiar with the notation of e. g. MMVI(R).

Abbreviation Meaning

M strategy determined by max method

MM strategy determined by max-min method

QL strategy determined by Q-learning method

R random strategy

VI strategy determined by value iteration method

Table 5.2: Different abbreviations for special policies. Example: MMVI(R) means a policy

that is determined by a max-min value iteration method against a random opponent. This

is the only example in which the policy of the opponent does not influence the algorithm

for determining the policy. The opponent’s policy is crucial especially in all max methods

and also in all RL methods.

Evaluation of Policies. In general, the quality of a policy π1for Team 1 is quantified in

a simple and precise way by its value V(ξ, π1, π2)which depends on an initial probability

distribution ξof start states(9), and the two policies πiof the teams i= 1,2. In the

following V(ξ, π1, π2)is often abbreviated by V(π1, π2)and it is then assumed that ξis the

standard distribution of start states. If π2is also omitted it is considered to be a best

response to π1, i. e. π2∈BR2(π1). Typically, the value of policies is determined to an

accuracy of ε= 1 ·10−3.

Since the value of a policy may be hard to interpret in robot soccer, additional charac-

teristics of the policies extracted by long-term simulations (many kick-off positions) are

given: the fraction of total scored goals of both teams divided by the number of time steps

as well as the percentage of goals of team 1. The first characteristic is intended to show

how offensive or defensive the combination of policies is and the second one shows how

successful each team is in comparison to the other. For example, if both characteristics

are high then Team 1 probably has a successful offensive strategy while Team 2 has an

unsuccessful defensive or offensive strategy. If the number of total goals is low but the

success rate for Team 1 is high then Team 1 has a successful defensive strategy and Team

2 an unsuccessful defensive or offensive strategy. One problem remains: the characteristics

may yield no results about an unsuccessful team strategy or if both teams are equally

good. Therefore, in Tables 5.9 and D.12 each policy is evaluated in a simulation against a

random opponent and against itself to obtain a measure of how offensive or defensive each

policy is (the total number of scored goals may indicate this). In general, however, some

care has to be taken about the precision of the simulation: although 1·106simulation

steps are performed for every policy the more total goals are scored during the simulation

the more accurate are the goal statistics. As a conservative rule of thumb, the goal rate of

player P1should not be argued about for absolute differences of ±0.05.

(8)The software DRPOST also provides the option to choose the type of Team 1 by similar declarations.

(9)ξis fixed in robot soccer to the distribution which assigns 50% probability to each of the two kick-off

states of Figure 5.3.

5.2.2 Reasoning for 2P-ZS-MG Modelling: Comparison of MDP and

2P-ZS-MG strategies

The first subsections of the numerical results serve to reason for the authors choice of model

and methods. The two basic statements are: for modelling it is preferable to use 2P-ZS-

MGs instead of MDPs and for computing optimal strategies DP methods are important to

initialise RL methods whenever this is possible. The most obvious theoretical argument

for the first statement is that for an MDP a deterministic optimal policy always exist.

This is typically determined by DP and RL algorithms and it is a best answer only to a

fixed opponent policy. In contrast, the 2P-ZS-MG model a priori assumes a worst-case

i. e. tactically strongest opponent which already knows the policy of the player. This has

two advantages: because the computed policy of the player is safe against any strategy of

the opponent it is first sufficient to compute only one optimal policy and not one against

each of the infinitely many possible opponent policies and, second, it is not necessary to

know the opponent policy.(10) In the opinion of the author the second advantage is more

important because for relativising the first one it can be argued that a strong policy could

be strong against a whole set of opponent policies.

A First Comparison

Tables 5.3 and 5.4 provide the result that a best response policy (π1by an RL and π2

by a DP method) against a randomly acting opponent is easily outperformed by its own

best response answer (π3,π4, and π5).(11) This is not surprising because of the mainly

deterministic nature of the max-optimal policy – only exact ties of the Q-values lead

to randomised actions. Another observation is that with a higher number of learning

steps RL methods in principle do as well as DP methods (π3is equally strong against

π1as π4because the value is equal) indicating that the inherent problem lies in the non-

stochasticity of the policy. This can be seen at first glance in Table 5.3 by noticing that

also the theoretically best policy π2against a random opponent is easily outperformed by

the worst-case opponent strategy π5which is reflected in the high value V(π5, π2). The

fact that the value V(π5, π2)is equal to that of V(π4, π1)shows that “bad learning”(12) of

the RL method is not the reason for bad performance against a worst-case opponent.

To conclude the first comparison the analogue tables for max-min methods are presented

which show that the max-min optimal policy can not be exploited – as expected – even by

a worst-case opponent knowing this policy in advance.

By comparison of the small size 6×4soccer field with the medium size 12 ×8one it

becomes obvious that RL methods can degrade with the number of states if the number

of training steps are kept linearly related to the state space size |S|. Max-min learning

methods are specifically concerned: with the 100 (times size of the state space) learning

steps the max RL methods approximate the optimal policy quite well (all values are equal

in Table 5.4), however, for the max-min strategy the performance against a worst-case

opponent degrades significantly. The difference between V(π4, π1)and V(π5, π2)of 0.074

in value seems not too dramatic at a first glance but the fact that the goal rate of the

(10)An RL method can adapt to a stationary (temporarily fixed) policy of other agents but problems occur

if all agents are learning, i. e. changing their policies over time.

(11)Best responses are max optimal policies and in the case of RL methods (MQL(·)) only a more or less

well approximation to a true best response MVI(·).

(12)The RL policy is typically initialised as R(andom) such that for states never reached by the finite

simulation the random action selection is kept.

π3= MQL(π1)π4= MVI(π1)π5= MVI(π2)

π1= MQL(R)

V: 0.415

gt: 0.080

g1: 0.735

V: 0.415

gt: 0.080

g1: 0.731

π2= MVI(R)

V: 0.415

gt: 0.080

g1: 0.730

Table 5.3: Robustness of max-policies against worst-case opponents (6×4soccer field).

The column policies are that of player P1and the row strategies of player P2. The value

Vand the relative amount of goals g1are from the view of P1(as always), whereas the

total goal rate per time step gtrelates the sum of goals of both players to the number of

simulated time steps.

π3= MQL(π1)π4= MVI(π1)π5= MVI(π2)

π1= MQL(R)

V: 0.145

gt: 0.030

g1: 0.719

V: 0.145

gt: 0.030

g1: 0.723

π2= MVI(R)

V: 0.145

gt: 0.029

g1: 0.725

Table 5.4: Robustness of max-policies against worst-case opponents (12 ×8soccer field).

The explanation of how to read the table is as in Table 5.3.

opponent increases from the best of 50% to over 90% should give the correct impression

(Table 5.6).

π3= MQL(π1)π4= MVI(π1)π5= MVI(π2)

π1= MMQL(R)

V: 0.010

gt: 0.070

g1: 0.509

V: 0.010

gt: 0.071

g1: 0.505

π2= MMVI(R)

V: 0.000

gt: 0.070

g1: 0.502

Table 5.5: Robustness of max-min-policies against worst-case opponents (6×4soccer

field). The explanation of how to read the table is as in Table 5.3.

π3= MQL(π1)π4= MVI(π1)π5= MVI(π2)

π1= MMQL(R)

V: 0.074

gt: 0.007

g1: 0.979

V: 0.076

gt: 0.008

g1: 0.921

π2= MMVI(R)

V: 0.000

gt: 0.023

g1: 0.499

Table 5.6: Robustness of max-min-policies against worst-case opponents (12 ×8soccer

field). The explanation of how to read the table is as in Table 5.3.

Training Against Better Opponents

To provide additional weight for our hypothesis above that the non-stochasticity of optimal

policies for MDPs is the reason for the failure of these policies against a learning opponent,

better training partners are chosen and the effect is studied. In the first comparison the

policies π1, π2of Table 5.3 are trained against a randomly acting opponent which can

be considered to be very weak – only an opponent helping the player could be weaker.

Therefore, in the following comparison in Tables 5.7 and D.10 the policies π1, π2are trained

against much better initial opponent policies which in fact are π1and π2of Table 5.3. The

result is – by comparing Tables 5.3 and 5.7 – that the better initial training policy increases

the exploitability by a best response strategy. Although this can be expected by means

of the lack of stochasticity of the better training partner it questions efforts to present a

strong policy to an MDP learner and ignoring the 2P-ZS-MG nature of robot soccer.

Exploitability of Non-Optimal Opponents

The most reasonable counter argument against using max-min methods is that they do

not fully exploit weaknesses of the opponent’s policy. This is correct for non-optimal and

particularly for very weak opponents as can be seen by the case of exploiting a randomly

acting opponent on a 6×4soccer field in 1v1 soccer: the values of start states for the max

policy is much higher than that of the max-min policy (V(ξstart,MVI(R),R) ≈0.688 >

π3= MQL(π1)π4= MVI(π1)π5= MQL(π2)π6= MVI(π2)

π1= MQL(MQL(R))

V: 0.539

gt: 0.065

g1: 0.876

V: 0.539

gt: 0.064

g1: 0.877

π2= MQL(MVI(R))

V: 0.506

gt: 0.058

g1: 0.891

V: 0.506

gt: 0.058

g1: 0.893

Table 5.7: Robustness of max-policies against worst-case opponents (6×4soccer field)

with better initial training partners. The explanation of how to read the table is as in

Table 5.3.

0.483 ≈V(ξstart,MMVI(R),R), Table 5.9). However, max-min policies fully exploit subop-

timal policies of the opponent with the constraint of staying safe in the sense that max-min

methods assume that after the observed non-optimal action the opponent again will act

optimally. This prevents the player from being tricked, i. e. that the opponent intentionally

behaves in a way to create a misleading conclusion about its policy.(13)

Exploitability of Optimal Max-min Opponents

In addition to the question of exploiting a suboptimal opponent it is interesting to ask

whether a max-min policy can be better exploited by a max policy than by a max-min

policy. The related theoretical question is whether the max-min policy is already a best

response to a max-min policy. In contrast to the case of non-optimal strategies, the answer

is positive [186]. In Tables 5.8 and D.11 this – or more appropriate: the software DRPOST

– is verified by the fact that both values for π2= MVI(π1)and π3= MMVI(R) = MMVI(π1)

are equal (necessarily equal to 0).

π2= MVI(π1)π3= MMVI(R)

π1= MMVI(R)

V: 0.000

gt: 0.070

g1: 0.502

V: 0.000

gt: 0.071

g1: 0.503

Table 5.8: Exploitability of optimal max-min opponents (6×4soccer field). The expla-

nation of how to read the table is as in Table 5.3.

5.2.3 Relating Policies to Humanoid Soccer Characteristics

It is not trivial to get qualitative heuristic results beyond the abstract numerical value of a

policy or a pair of policies because there can be differently characterised succesful policies.

A standard distinction in human soccer is e. g. between defensive and offensive strategies.

A practical way to determine the offensiveness and defensiveness of a grid soccer policy

is the following two step method: first, evaluate the policy against a random opponent to

test its offensiveness (for a very weak opponent a defensive behaviour does not contribute

(13)Although it is often natural and plausible to assume that the opponent will repeat non-optimal be-

haviour this can not be relied upon.

to its success) and, second, evaluate the policy against itself. It can be tried to estimate

the defensive quality by the reduction of scored goals in comparison to the weak opponent.

Exemplarily, some results of Tables 5.9 and D.12 are evaluated. A look at the first column

could lead to some confusion because to stay consistent and keep the page layout the values

Vare from the point of view of the column policy. To change the viewpoint the negative

signs have to be made positive. Then it becomes directly clear that the random policy

is by far the worst and that π3is strongest against the random policy. Its offensiveness

can be seen by the high goal rate per time step gtand its very high amount of own goals

(1−0.016 = 98.4%, Table 5.9). The RL analogon π2is similarly strong which indicates

that the learning phase was sufficiently long.

The max-min optimal strategy π5seems to be weaker because of its safety aspects but

outperforms the random opponent considerably. This means that it clearly exploits the

weaknesses of its opponent although in a safe way which disproves the argument that max-

min strategies do not exploit weaker opponents. The defensiveness of the max-min strategy

can also be seen in comparison to the max strategy in the second column by noticing that

the goals per time step are fewer. Remarkably, the max-min RL method π4is stronger

against the random opponent than the DP method. This could have two reasons: the first

is that the learning is not completed and the safety is not optimal (this is not the case

here) and the second reason is that non unique Nash equilibria exist which is verified by

the author at least for single states s∈ S.(14) The Nash equilibrium (optimal) policies are

all equally strong against a worst-case opponent (Theorem 2.20), however, they may be

and obviously are differently strong against the non-optimal random opponent.

A last highlighted result – again providing an argument against using MDP models for

robot soccer – is the evaluation of π7. Although the max strategy π7is trained against π3

which is the strongest max opponent for the random policy it is relatively weak against the

random opponent. This again indicates that the max learning only adapts to that single

opponent to which it is an approximate best response. In contrast, the quality of max-min

solutions is independent from the training partner which can only influence the update

order in learning and not the final policy.

5.2.4 Comparison of DP and RL Techniques

In this subsection a rough impression is to be given of how much more effective DP methods

are in comparison to RL methods. The comparison is only depicted for 2P-ZS-MGs (max-

min methods) but the result that DP methods are more effective is also expected in classical

MDPs (max methods). A novelty is the inclusion of the game theoretic safety measure:

the computed policy is not only evaluated against its standard opponent (standard value

evaluation) but also against a worst-case opponent (security level). Figure 5.6 comprises all

details for a 1v1 grid soccer model with MDP typical symmetry reduction (all symmetries

which do not exchange the two players are reduced). The DP method converges to a

reasonable policy much faster (after 6 steps) than the RL method (after about 24 steps).

Concerning the security level, the DP method also needs far fewer steps (17 in comparison

to 70 for the RL method) to achieve a value close to 0.

(14)The non-uniqueness of matrix game policies of single states s∈ S implies the non-uniqueness of the

total policy.

π1= R equal

π1= R

V: 0.000

gt: 0.005

g1: 0.525

V: 0.000

gt: 0.005

g1: 0.495

π2= MQL(R)

V:−0.682

gt: 0.063

g1: 0.015

V: 0.000

gt: 0.092

g1: 0.499

π3= MVI(R)

V:−0.688

gt: 0.064

g1: 0.016

V: 0.000

gt: 0.092

g1: 0.501

π4= MMQL(R)

V:−0.566

gt: 0.053

g1: 0.016

V: 0.000

gt: 0.070

g1: 0.499

π5= MMVI(R)

V:−0.483

gt: 0.046

g1: 0.026

V: 0.000

gt: 0.070

g1: 0.499

π6= MQL(π2)

V:−0.086

gt: 0.011

g1: 0.147

V: 0.000

gt: 0.020

g1: 0.498

π7= MQL(π3)

V:−0.098

gt: 0.012

g1: 0.132

V: 0.000

gt: 0.018

g1: 0.496

Table 5.9: Analysis of offensiveness and defensiveness of different policies (6×4soccer

field). The explanation of how to read the table is as in Table 5.3.

0 20 40 60 80 100 120 140 160 180 200

−0.6

−0.4

−0.2

0.2

0.4

0.6

0.8

step number

value

opt. value max

opt. value max−min

max−min (QL): value

max−min (QL): safety

(a) RL method.

0 5 10 15 20 25 30 35 40 45 50

−0.6

−0.4

−0.2

0.2

0.4

0.6

0.8

step number

value

opt. value max

opt. value max−min

max−min (VI): value

max−min (VI): safety

(b) DP method.

Figure 5.6: Convergence speed of RL and DP techniques measured by value and security

evaluation: standard Q-learning versus a Gauss-Seidel DP method (γ= 0.9, random

exploration rate 0.20 for the RL method). The scales for the step numbers (x-axis) are

different.

5.2.5 Comparison of Different DP Techniques with Various Parameters

In the following basic studies of different DP techniques with a variety of parameters based

on the model of 1v1 multi-player grid soccer are provided. In general, such comparisons

should also improve the understandings of RL methods because RL methods offer some

additional subtleties (such as choosing a starting state or exploration strategies) which can

be avoided in DP methods. Furthermore, in some sense DP methods provide an upper

bound for the effectiveness of RL methods (if strict bounds on the quality of the solutions

are to be guaranteed) because in DP methods the model is completely known. However,

the strength of RL methods is typically to yield a good solution without guaranteeing its

quality, i. e. that in a typically seldomly occuring state the policy can be arbitrarily bad.

This subsection is divided into different parts: first, the three DP methods with normal

update without symmetry reduction, Gauss-Seidel update without symmetry reduction,

and Gauss-Seidel update with symmetry reduction are compared for different choices of

initial value iterates V0and discount factors γand for the three cases max-min (2P-ZS-

MG), max (MDP with opponent being fixed to perform a uniformly random policy), and

fixed (Markov process with player and opponent performing a fixed random policy, simply:

policy evaluation). Note that the theoretical fourth DP method of normal DP updates

with symmetry reduction is the same as without reduction except that the state space is

shrunken and therefore each value iteration step is considerably faster. Second, different

state sorting strategies for Gauss-Seidel methods are evaluated and an example is worked

out that shows that a resorting of the update order can lead to loosing the monotonicity

of the Bellman error in MDPs and 2P-ZS-MGs. Third, a special convergence phenomenon

is analysed in detail to recover why certain large error improvements occur.

Initial Value Functions V0and Discount Factors γ

There are two qualitative standard behaviours of DP and RL methods which are verified in

this thesis by means of the grid soccer model. The first behaviour is that the discount factor

γdramatically influences the convergence speed for DP and RL methods: the closer γis

to 1the more update steps are needed to achieve a prescribed maximal error. The second

behaviour is that the closer the initial value function V0is to the optimal value function

V∗with respect to k k∞the less update steps are needed. Both algorithmic behaviours are

direct consequences of the value iteration theorem (Definition 2.35 and below): the first

one is due to the fact that the stopping criterion depends on 1−γ

γand that the contraction

rate of the Bellman operator is γ; the second one is due to the fact that kVk+1 −Vkk∞≤

kVk+1 −V∗k∞+kVk−V∗k∞=kBMGVk−BMGV∗k∞+kVk−V∗k∞≤(1+γ)·kVk−V∗k∞.

Notes on the Choice of the Discount Factor. It is to be remarked that the conclusion

of the first behaviour of DP and RL methods is not to use a very small discount factor γbut

to use the smallest reasonable one. The problem of a too small γis that e. g. with γ= 0.1

the 2P-ZS-MG reward after about 10 time steps is discounted to a magnitude below that

of numerical errors. One possibility is to determine or to estimate the minimum number

of time steps t0between obtaining two essential rewards and setting γin a way that the

total discounting γt0is in the order of 0.5.(15)

A mathematically more beautiful point of view includes Euler’s number ewhich can be

(15)For γ= 0.9this would yield a reasonable time scale of 6to 7time steps.

expressed by different limits, e. g.

e= lim

n→∞ 1−1

nn

(5.1)

For the special discount factors γ= 1 −1

nand nlarge enough this implies that the n-step

discounting is close to 1

e≈0.368. If a discrete 2P-ZS-MG is constructed by discretisation

of an underlying continuous model, then doubling the discretisation accuracy will typically

lead to a doubling of the needed discrete time steps to describe the same process. Hence

to maintain expressiveness of the numerical results a sensible γwould also be double as

close to 1.

Notes on the Magnitudes of the Value Function. The values for V0are chosen by

the following criteria: 3is larger than the maximum optimal value, −3is lower than the

minimal one. ±1is in the magnitude of high or low values while ±0.1is a small positive

or negative estimation. 0is the mean value of the value function for the max-min case

because a 2P-ZS-MG µ-isomorphism with µ=−1exists. For the max case the mean value

against a random opponent is 0.688 for γ= 0.9,0.136 for γ= 0.75, or numerically zero

(≈1·10−7) for γ= 0.1which shows that this discount factor is far too small for the

grid soccer problem. The relatively high positive values of start states for the first two

discount factors show that it is (of course) much better to use an optimised strategy as to

act randomly.

Notes on the Iteration Steps. Figures 5.7 and 5.8 are intended to give a quantitative

feeling about how many iteration steps different DP methods need for convergence to a

certain error (ε= 1·10−3) for max-min DP methods and max DP methods (with a randomly

acting opponent). The effects of different values for γand V0and of using standard and

Gauss-Seidel type updates as well as using symmetry reduction are illustrated.(16) The

analogue results for iterative policy evaluation, i. e. that both players of the 2P-ZS-MG

are acting according to a fixed – here: random – policy are omitted at this place (see

Figure D.1) because they give a qualitatively similar picture as the max methods do. The

Gauss-Seidel methods work more efficiently than the standard updates in the max case but

there is no essential difference between Gauss-Seidel methods with and without symmetry

reduction except that each iteration consists of a lower number of updates. Astonishingly,

for max-min methods the general view changes qualitatively: for V0≥0the symmetry

reduction decreases the number of iterations considerably. The effect is larger the more

iterations are needed by the other two methods which leads to a particular insensitivity

of this method to the values of V0≥0: for a Gauss-Seidel method without symmetry

reduction the number of steps kfor V0∈[0,3] ranges from 43 to 72 whereas the range of a

Gauss-Seidel method with symmetry reduction is only from 26 to 30. However, for negative

initialised V0this insensitivity is not observable but at least the symmetry reduction leads

to the smallest number of iterates in every single case. Two things are essential: first,

the occurance of the insensitivity effect and, second, the unexpected dependency on the

positivity of V0. In RL methods a well known phenomenon exists that an initial guess of

V0which is too positive encourages exploration of unexplored states simply because more

realistic estimates of explored states reveal unattractivity of these states [202]. However,

for DP methods an analogon does not exist. Perhaps the effect can be explained by the

solution of the matrix games: if V0>0and only a few entries of a matrix have changed

to more realistic lower ones the worst-case opponent forces the corresponding actions and

the value decreases appropriately after a few iterations. If, however, V0<0than nearly

(16)Additional tables and figures on the numerical results can be found in Appendix D.

all entries of a matrix, i. e. all values of successor states, have to be estimated in a more

realistic way (more positive) before the value of the matrix game is significantly influenced

because by the same argument as above a few estimates dominate which are too negative.

−3 −2 −1 0 1 2 3

100

120

step number

no GS, no symm

GS, no symm

GS, symm

(a) Max-min.

−3 −2 −1 0 1 2 3

100

120

step number

no GS, no symm

GS, no symm

GS, symm

(b) Max.

Figure 5.7: Comparison of different Gauss-Seidel types with or without symmetry reduc-

tion: number of iteration steps over initial values of the initial value function V0for a

1v1 multi-player grid soccer model and a (a) max-min method, (b) max method without

sorting strategy (γ= 0.9, stopping criterion precision ε= 1 ·10−3(Corollary 4.3)).

In contrast to the exciting results of varying V0, the results for varying the discount factor

γpresented in Figure 5.8 are completely unspectacular. The result here is valid for max-

min, max, and fixed strategy methods and is as above for the max method, namely that

Gauss-Seidel type updates work better than standard ones and the symmetry reduction

does not have a major effect on the number of iterations.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

100

120

gamma

step number

no GS, no symm

GS, no symm

GS, symm

(a) Max-min.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

100

120

gamma

step number

no GS, no symm

GS, no symm

GS, symm

(b) Max.

Figure 5.8: Comparison of different Gauss-Seidel types with or without symmetry reduc-

tion: number of iteration steps over the discount factor γfor a 1v1 multi-player grid soccer

model and a (a) max-min method, (b) max method without sorting strategy (V0= 0,

stopping criterion precision ε= 1 ·10−3(Corollary 4.3)).

Different Sorting Strategies for Gauss-Seidel Methods

Figure 5.9 shows the convergence speed of the following different sorting strategies: stan-

dard updates (no sorting), updating by maximal Bellman error after each iteration, up-

dating by randomly rearranged state order in each iteration (random), and keeping the

randomly rearranged state order of the first iteration (random fixed). Some of the parame-

ters explored above are fixed: V0= 0 and γ= 0.9. There are three surprising observations

in this figure: First, the “Max-min Convergence Boosting Phenomenon” which means that

occasionally a very large error reduction of nearly one order of magnitude occurs within

mostly one but at most two iteration steps. This phenomenon has been only observed

for max-min methods and looks such important that it is treated in an extra subsection

of Section 5.2.5. Second, the update order by highest Bellman error is the worst of all

methods for nearly all iteration steps but for this update order the large decisive second

“convergence boost” occurs first for the max-min method. It is not clear whether this is

a coincidence or not because the first smaller “convergence boost” appears later than for

the two other methods. The bad performance of Bellman error update order is interesting

because the argument to firstly update states with the highest error is directly plausible.

Third, the DP methods with Gauss-Seidel type updates without equally sorted states in

each iteration step (e. g. Bellman error and random sorting) sometimes show a small error

increase. This is not due to numerical errors which are smaller than 1·10−5and thus

undetectable in the figure, but to the fact that these update methods do not make use of

all estimated values of all value iterates, see Example 5.2. Numerical studies indicate that

the true error to the unknown optimal value function is reduced but that the Bellman error

does not reflect that fact. Therefore, it is stressed that Bellman errors do not reflect true

convergence but only convergence of a guarantee for convergence. This also gives reasons

for the bad performance of sorting by the Bellman error.

The following example is very simple but much can be learnt from it:

5.2 Example (Monotonicity of Bellman Error for some “Gauss-Seidel” Methods)

Be M= (D,S,SA, T, R)an MDP with decision epoch D=N0, state space S={s1, s2}

(only two states), A(s1) = A(s2) = {a1}(only one action in every state), T(s1, a1, s2) = 1

and T(s2, a1, s2) = 1 (probability 1to go to state s2and stay there), and R(s1, a1) = 0

and R(s2, a1) = 1.

In fact, this example is a discrete-space discrete-time dynamical system since only one

policy exists (probability 1for the only action in each state) which is concurrently the

optimal policy. Furthermore, it represents the simplest example with one recurrent and

one transient state (in terms of dynamical systems). Nevertheless, it gives useful intuition

for clustering methods interpreting one state as a cluster of other states and already an

example that in MDPs and 2P-ZS-MGs the Bellman error of Gauss-Seidel methods with

reorganising the update order is not monotone. This is in contrast to the Jacobi update

type and to Gauss-Seidel type updates with fixed update order for which the Bellman

error decreases at least with the discount factor γ([168] gives a proof for MDPs). The

above statement can be seen by the following table in which the Jacobi and a Gauss-Seidel

method with the noted update order is performed for a discount factor γ= 0.9. The value

iterates Vkare represented by the vector (Vk(s1), Vk(s2))

k Vk(Jacobi) kVi−Vi−1k∞Vk(Gauss-Seidel) update order kVi−Vi−1k∞

0 (0,0) (0,0) s1, s2

1 (0,1) 1 (0,1) s1, s21

2 (0.9,1.9) 0.9 (0.9,1.9) s1, s20.9

3 (1.71,2.71) 0.81 = 0.92(2.439,2.71) s2, s11.539 >0.9

The problem why the Bellman error is not monotone decreasing by γis that the updates

do not correspond to the Gauss-Seidel method any longer in the strict sense since for

computing V3(s1)the value V3(s2)is used whereas for V2(s1)the value of V1(s2)is made

use of. Thus, V2(s2)is not plugged in the two computations which is different for the

Jacobi method or for any method with fixed update order.

The Max-min Convergence Boosting Phenomenon

At a very first glance, it would not be surprising if a max-min value iteration would take

more steps than a max or fixed strategy value iteration for the reason that 2P-ZS-MGs

are more complex than MDPs or Markov chains. However, the fixed policy method needs

more steps than the max-min method, whereas the max method needs clearly the most

number of iteration steps.

A heuristic argument is that of information diffusion: the shorter the longest shortest path

is from any state sito sjon an appropriate graph, e. g. induced by the transition matrix,

the faster information about the value iterate Vk(si)is propagated to the state sjin a

future iterate. Because the success of value iteration is measured by k k∞it can give bad

0 10 20 30 40 50 60 70

10−4

10−3

10−2

10−1

100

101

102

step number

maximal error

no sorting

Bellman

random

random fixed

(a) Max-min.

0 10 20 30 40 50 60 70

10−4

10−3

10−2

10−1

100

101

102

step number

maximal error

no sorting

Bellman

random

random fixed

(b) Max.

Figure 5.9: Maximal DP error (logarithmic scale) over the number of iteration steps for a

Gauss-Seidel type update with symmetry reduction, a 1v1 multi-player grid soccer model,

and a (a) max-min method, (b) max method by means of standard updates (no sorting),

Bellman error estimation (Bellman), randomly rearranged state order in each iteration

(random), and keeping the randomly rearranged state order of the first iteration (random

fixed) (V0= 0,γ= 0.9, stopping criterion precision ε= 1 ·10−3(Corollary 4.3)).

performance if only a few states are late informed about significant changes. If only the

number of connections by the transition matrix without a detailed shortest path analysis

is considered this will give reason why the fixed strategy method needs less iterations than

the max method because the first includes two completely random policies and the second

includes one random and one (nearly) deterministic policy. Applying the same argument,

the max-min method should lie in between the two other methods because only sensible

actions are performed with non-zero probability by each of the two players but amongst

these sensible actions a reasonable diversification has to be performed to minimise the

risk of being exploited. However, because of extraordinary convergence steps the max-min

method seems to be the fastest.

A less heuristic argument is based on iterative linear solvers. The Bellman equation for

fixed policies is equivalent to solving a linear equation but also the non-linear versions can a

posteriori be reformulated in a linear way when the optimal policies are known which fulfill

the max or max-min equation. Thus, the convergence speed can be analysed by the same

methodology standardly employed for iterative linear solvers (see Appendix B). It should

be remarked that there is a small difference between Bellman equation Jacobi and Gauss-

Seidel update and the classical versions for iterative linear solvers. All in all, the analysis

does not reveal novelties: the k k∞for the update matrix nearly always equals exactly the

discount factor γand even the spectral radius is always close to it. That implies that even

with a different norm – which can not be chosen freely for the convergence guarantee – no

significantly better convergence rates can be explained. The conclusion of this paragraph

is that the worst-case convergence rate is not a good measure for the real convergence rate

in the practical example of multi-player grid soccer.

A next idea why the “Max-min Convergence Boosting Phenomenon” might occur is the

existence of 2P-ZS-MG µ-isomorphisms with µ=−1(player exchanging symmetries). It

was mentioned in Section 3.2 that this qualitatively new kind of symmetry is special to

2P-ZS-MGs and can not occur in MDPs. Hence, in Figure 5.10 the same algorithmic

convergence as in Figure 5.9, a) is shown with the exception that the max-min method

without reduction of the player exchanging symmetry is considered (all other symmetries

are reduced). The result is that the convergence boosting also takes place similarly. Also,

the possibility that the Gauss-Seidel update could be responsible is excluded by observing

the same phenomenon with non Gauss-Seidel type updates.

Nevertheless, Figure 5.10 reveals a new idea: without the player exchanging symmetry the

steps with convergence boosts seem to have a clearer periodical structure. This is valid

at least for the updates with fixed update order (“no sorting” and “random fixed” in the

Figure) and the other update types should be neglegted because of Example 5.2. The

periodicity is more clearly illustrated in Figure 5.11, b) in which the stepwise convergence

rate is depicted. The rough periodicity of 6−7steps for the peaks is in concordance with

the expected duration of scoring a goal after a restart of the game. It is speculated that

this inherent “periodicity” of the model is responsible for the periodicity of the convergence

error peaks. This would also fit to the heuristic argument of information diffusion above.

Furthermore, the differences of the Bellman error and the true error indicate that the

Bellman error often decreases later than the true error.(17)

(17)The true error is computed by the difference to the optimal value function. The total reduction of the

Bellman error can be larger than the total reduction of the true error since the initial error for the Bellman

estimate is largely above the true error.

0 10 20 30 40 50 60 70

10−4

10−3

10−2

10−1

100

101

102

step number

maximal error

no sorting

Bellman

random

random fixed

Figure 5.10: Maximal DP error (logarithmic scale) over the number of iteration steps, all

details are as in Figure 5.9, only the player exchanging symmetry is not reduced.

5.2.6 Comparison of Standard Methods and SL Techniques

It is pointed out in Lemma 4.2 that convergence guarantees can not be given if the ap-

proximation error introduced by the SL techniques is close to the magnitude of the desired

error of the value function. Nevertheless, SL techniques have been successfully combined

with RL methods (Section 4.3) and in this spirit the practical effort of the following results

is to be understood.

Especially for single player robot soccer [97] introduces a number of features. These seem

to be highly specialised to the model which is quite similar to the 1v1 multi-player grid

soccer of Section 5.1 because both models are motivated by [112]. The features are case

based and extract the information in a way such as “if the attacker is not closer to the

defenders goal than the defender and if the defender is close to the attacker then store

some position information of the defender in relation to its goal and all possible relative

positions between the attacker and defender” and so on for all four possible if-combinations.

In contrast to the probably very laborious work to design such features, the approach used

in the present work is to approximate all occuring intermediate state value functions V(s)

during an RL procedure by the following very simple features which are already designed

for their application in multi-player models:

1.) the number of robots of each team in the cell of the ball,

2.) the minimal distance of each team (minimum over all team members) to the ball,

3.) the minimal distance of the ball to each goal region, and

4.) the number of robots of each team in the half of the first team.

Distances are measured by squared Euclidean distance and all features are restricted by

a maximum value, e. g. 3for a small grid. If the distances to the goal regions are not

restricted one of the two would suffice. Table 5.10 shows results of a policy based on state

5 10 15 20 25

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

step number

conv. rate (Bellman)

conv. rate (true)

discount factor

(a) Symmetry reduction 1.

5 10 15 20 25 30 35 40

0.3

0.4

0.5

0.6

0.7

0.8

0.9

step number

conv. rate (Bellman)

conv. rate (true)

discount factor

(b) Symmetry reduction 2.

Figure 5.11: Convergence rate of a max-min DP method by maximal DP error (Bellman)

and by the true error over the number of iteration steps for a Gauss-Seidel type update and

a 1v1 multi-player grid soccer model. In (a) all symmetries are reduced, in (b) all except

the player exchanging symmetry are reduced, i. e. (a) corresponds to the “no sorting”-line

of Figure 5.9 (a), and (b) corresponds to that line of Figure 5.10. The convergence rate can

be a little larger than γfor later iterates because the Bellman error includes the results of

Lemma 2.36.

value functions represented by feature approximation architecture (92 degrees of freedom

instead of the 576 needed for the lookup table representation). The max-min solution of

the 1v1 grid soccer model with respect to the exploitation of the random opponent (in

reference to the discussion about non-unique Nash equilibria) are reasonably good but not

as good as that of Table 5.9. The security level against a worst-case opponent trained

with the same restrictive architecture is perfectly optimal, i. e. equal to 0, because both

players are restricted to the same policy space. Nevertheless, something is lost by approx-

imating the value function only by features which could also be interpreted as building

“wrong equivalence classes” by means of heuristics. This becomes obvious when looking at

the high outperformance of a worst-case opponent which is not restricted to the feature

representation of the value function: the separated right column reveals that the policy π1

against this opponent is quite bad.

The conclusion of the above is that features can yield reasonable results even if they are very

simple and also that the security level against an opponent of the same feature architecture

is reasonable. However, the performance can significantly degrade against more powerful

approximation architectures. This makes the knowledge about which feature architecture

is used by the opponent nearly as useful as to know its policy directly.

π2= R π3=π1π4= MQL(π1)π5= MVI(π1)

π1= MMQL(R)

V:−0.325

gt: 0.033

g1: 0.054

V: 0.000

gt: 0.068

g1: 0.502

V: 0.000

gt: 0.068

g1: 0.500

V: 0.423

gt: 0.042

g1: 0.955

Table 5.10: Evaluation of max-min policies for a 1v1 grid soccer on a 6×4field which

are computed with a feature based value function. Only the separated right-most column

strategy π5is computed without features.

5.2.7 Towards Multi-Player Robot Soccer: 2v2 Grid Soccer

In this section effects of large state spaces obtained by a fine grid discretisation of 1v1 grid

soccer and large state spaces induced by multi-player especially 2v2 grid soccer on a coarse

grid are to be discussed. Tables 5.11, and 5.12 show the number of state and state-action

spaces as well as the number of value iterations needed for convergence. From a heuristic

point of view the discretisation to a higher grid size can be thought to be easier, however,

this is not reflected in the number of value iterations. However, the computational time for

the 2v2 case is much higher per iteration because the size of the matrix games is growing

quadratically in the exponentially growing action space of the player (since the action space

of the opponent is growing exactly as fast). A further expectation of the author is that

there should only be a small amount of extra information obtained by the solution of a

finer resolution of the grid.(18) Nevertheless, it should be expected that an initialisation

of a finer grid with an adapted value function of a coarser grid should lead to a remarkable

reduction of needed iteration steps because the fine and coarse grid models possess strong

similarities.

In contrast, for a 2v2 grid soccer model it should be much more difficult to transfer knowl-

edge from the 1v1 grid soccer model. The main aspect is that the structure of the model

(18)Note that the different grid size models are different models and can not be interpreted as two different

discretisations of a common underlying continuous model.

Size |S| |SAO| Results

small (6x4) 282 4893 26

0.00037

medium (12x8) 4584 96288 33

0.00039

large (24x16) 73632 1690578 47

0.00041

Table 5.11: Comparison of symmetry reduced 1v1 grid soccer with γ= 0.9(different

soccer field sizes) by problem size and necessary DP iterations for achieving a stopping

criterion precision of ε= 1 ·10−3(Corollary 4.3). The achieved precision is also noted.

Size |S| |SAO| Results

small (6x4) 82944 27724780 19

0.00075

Table 5.12: Analogon to Table 5.11 for 2v2 grid soccer for a 6×4soccer field.

drastically changes: the new options of performing passes and to coordinate behaviour

with a second team member can lead to a completely different kind of optimal behaviour.

Since it is a little tedious and will not give many insights to analyse the whole policy or

value function the author tried to identify a single situation which is somehow comparable

but in which differences of a 1v1 and a 2v2 grid soccer optimal policy become obvious.

The state in Figure 5.12 a) for the 1v1 grid soccer is extended to 2v2 soccer by adding a

tactically not so well positioned second opponent and a tactically well placed second player

ready for a pass to the field. In the 1v1 case the optimal policy is to go left with probability

1because the opponent player can only block the ball if he can be in the same cell after one

move as the ball-possessing player, i. e. the opponent player has to stand in his cell. Also

moving above or below does not make sense for the player because the opponent player will

then in the next turn definitely meet him or else a time consuming phase of several side

turns would start and decrease the time discounted long-term reward. In the 2v2 case the

situation with the added team members the situation looks different: the optimal policy

for Team 1 is to let the member without ball move to the left and the other pass to it with

a probability of roughly 2

3– although there is a real chance that a miskick will occur(19)

– and to perform with the remaining probability of 1

3the same movement for the team

member without ball but a move downwards without a kick for the ball-possessing one.

The tactical advantage of this position is that there is a new chance for a kick without

the opponent team being able to reach the ball-possessing team member. The opponent

team has a clearer strategy: the opponent in the middle at coordinates (3,3) has to stand

and the second opponent has to stand with a probability of 93% and to go upwards with

a probability of 7%. The key aspect is here for both opponents to stay in a range to be

able to be a mistarget of a possible pass.

(19)Without a miskick this action would have probability 1because then the goal scoring would be secure

after the pass.

1 2 3 4 5 6

(a) 1v1 game

1 2 3 4 5 6

(b) 2v2 game

Figure 5.12: Case study for a situation of 1v1 and a similar situation in 2v2 soccer. Dark

grey indicates the defended goal region and the players of the first team, light grey those

of the second team and the ball is sketched by the small grey circle.

5.3 A New Algorithm: MaG-Clus-VI

In the following a new algorithm which combines local and global update steps by means

of clustering methods is introduced. The idea of structuring the state space with the use of

clustering algorithms is common with the graph clustering by topology approach of [130].

Nevertheless, there are two key differences: firstly, the following clustering algorithm takes

into account the dynamics by clustering by the transition probabilities and not simply

by values of the value function or by some state space topology, and secondly no macro

actions on clusters are considered. This makes the theoretical proof of convergence very

simple and a loss of optimality with respect to the original model can be avoided. An

optimal solution will always be achieved if every state is updated infinitely many times

([19] for MDPs) which can be guaranteed by performing infinitely many global steps or by

arranging the local updates in a way that after all local updates thought of as a “big local

step” at least every state is updated once. This criterion is met by the algorithm below

called Markov Game Value Iteration with Clustering (MaG-Clus-VI) because the union of

all partitions forms the state space.

5.3 Algorithm (Markov Game Value Iteration with Clustering (MaG-Clus-VI))

Given a 2P-ZS-MG M= (D,S,SAO, T, R)with D=N0the MaG-Clus-VI algorithm is

defined by:

1.) Global steps:

Perform nglo ∈N0steps of standard value iteration. The result is an estimation Vof

the value function as well as a Nash equilibrium policy πwith respect to V.

2.) Local steps:

a) Determine the transition probability matrix with entries for all state pairs (si, sj)∈

S × S by the 2P-ZS-MG transition probabilities and the total policy πof both

players.

b) Use any clustering method, especially for obtaining balanced partitions, to determine

the kalmost invariant clusters Ci⊆ S,i= 1, . . . , k.

c) For i= 1, . . . , k: perform for Cistandard value iteration updates restricted to Ci

until convergence to accuracy ε=γmi·εprev,mi∈Nis achieved, whereat εprev is

the Bellman error of the last global step of 1.) if nglo >0or the estimation of the

extra convergence check in 3.) if nglo = 0.

3.) Go to 1 until some convergence criterion is achieved. If nglo = 0 an extra step of

global value iteration can be performed without influencing the next estimate of V

otherwise during the global steps the approximation error can be estimated without

extra computational effort.

One strength of the above algorithm is that in the local steps the propagation of larger

updates is restricted to only a part of the state space and on that part fully exploited.

Another key point is that the algorithm can have a various number of local update steps

on Ciand Cjwhich adapts the update steps to the local properties of the model. Finally,

it should be advantageous to use almost invariant sets because this minimises the influence

of value updates on adjacent partitions. Furthermore, the almost invariant sets are those

which trap a typical Q-learning trajectory which follows an optimal policy for a relatively

long time. All in all, the MaG-Clus-VI-algorithm combines the idea of global updates with

the idea of local exploration of Q-learning. On the one hand it is global and on the other

hand hierarchically structured without loosing the guarantee to obtain an optimal policy

of the non-hierarchical model.

Some numerical results are obtained with the 2P-ZS-MG of 1v1 grid soccer: with the

number of partitions being equal to 4in each step, nglo = 7, and mi= 3 for all steps and

all ithe algorithm needs about 39.2steps for obtaining a prescribed accuracy of ε= 1·10−3.

If using the partition created by the optimal policy from the first step – which assumes

that the optimal policy problem is already solved before – this method takes about 39.8

steps which shows that the adaptivity may be more advantageous than the knowledge of

the clustering by the optimal policy. In comparison to the 42 steps of the standard Gauss-

Seidel method this is about 5% less, however, for random clustering 49.5iteration steps

are needed. Thus, the comparison with “knowledge free” clustering yields a reduction of

about 20% by use of almost invariant sets.

5.4 From Grid Soccer to Robot Soccer: Practical Issues

There is an essential variety of practical issues of transferring the numerical results obtained

in Matlab by DRPOST to real robots, e. g. to AIBO ERS-7 type robot dogs. Some of

the key aspects are the use of lower level behaviours which are considered to be elementary

actions in the 2P-ZS-MG model and the extraction of visual information especially the self

and opponent localisation task. Localistion of the robot and all other robots is essential

for determining the state s∈ S and therefore for being able to apply a state-dependent

policy.

Since it would go far beyond the scope of a single PhD thesis to design a software architec-

ture from scratch which handles the control of single joints of the robots reasonably well,

basic behaviours such as walking and kicking, higher level behaviours, and finally a policy,

communication between the robots via WLAN, analysis of (noisy) visual information of a

moving camera, and so on, the author decided to resort to the public available software

of the German Team.(20) A new version of this software is typically published some time

after the most recent world competition of robot soccer called RoboCup. A complete team

(20)Web site (30.11.2007): http://www.germanteam.org/tiki-index.php.

report of 2004 exists, available only online at the German Team web site, which describes

all methods and features of the software in some detail.

5.4.1 Lower Level behaviours

Lower level behaviours are provided by the German Team software in a sufficient number.

A large number of different walking types already exists with or without the ball being in

possession of the robot, an even larger number of different kicks applicable from diverse

relative positions and targeting different locations, and a variety of head moving behaviours

also exists which is essential for the following localisation task.

The task at hand is to construct action schemes for the robots which correspond as closely

as possible to the basic actions ai∈ A(s)of the multi-player grid soccer model. Some

practical experiments show that this task is manageable by manually experimenting in

some standard situations of robot soccer.

5.4.2 Image Processing and Localisation

The issues concerning image processing and localisation are trickier than using the lower

level behaviours to create the basic actions ai∈ A(s)of the 2P-ZS-MG. Localistion of the

robot and all other robots is essential for determining the state s∈ S.

Image processing deals with the problem of extracting information from a series of camera

pictures. The basic problem in robot soccer is to match shapes and colors of predefined

object to pixels of a picture. The objects as well as the camera can be (not completely but

considerably) positioned freely in a three dimensional space such that some objects can be

partly hidden or be out of the visible range. In addition to the basic problem noise distorts

the picture and the analysis must be performed in real time on the robot itself.

[101] gives an overview and further references of newer and possible developments of image

analysis with a focus on robot soccer. The main idea is shortly presented: A set of

hypotheses of positions for each object is generated and Kalman filters are applied to

predict the moving of each single hypothesis. Then, unlikely hypotheses are removed

whereas new ones can be added if corresponding objects are detected in the current camera

image.

Self Localisation. Some aspects of self localisation in robot soccer with the AIBO

ERS-7 robots are shortly discussed to provide insights for the opponent localisation. The

self localisation typically works as follows: in a picture some fixed objects of known size

and position (goals, landmarks, lines of soccer field) are detected and by the size on the

picture the distance is estimated. This is necessary since only one camera is available to

the robot and no stereo three dimensional vision is possible as it is e. g. for humans. To

calculate the position in a global coordinate system of the soccer field the position and

direction of the camera (in principle all joints of the robots) have to be known.

Summarising, two possible sources of errors are present: first, the errors of reading the

sensor information of the joints (error of camera position) and, second, the error of color

noise (pixel errors) and misclassified objects (errors of object recognition). It is no issue

to locate other team members because it is allowed to communicate via WLAN within a

team and to spread the information of the robot’s own position.

Opponent Localisation. Since opponent localisation is as essential as self localisation

for the determination of the state s∈ S the additional barriers to obtain a good estimation

are shortly outlined. All issues of error sources mentioned in self localisation are present

and additionally it is essential that the opponents are moving targets and that not all of

them can be seen by each robot at the same time. The reason why this causes problems is

that the distance can not be estimated by the size of the robots’ trikots because they have

a complicated non-convex shape.(21) If it is assumed that only the direction but not the

distance can be determined, a natural approach will be to look at the same opponent robot

by different team robots and intersecting the corresponding lines. This could work for one

opponent robot but even if two opponent robots are present at the same time intersections

at locations where no robot stands can occur.

5.5 Other Applications

Many other examples than robot soccer exist in which RL methods are succesfully applied;

an overview can be found in [88, 202]. However, two of the main areas are intertwined by

robot soccer: game playing and robotics. For example, the lower level of motion control

can be considered a typical robotic application and the higher level of strategic planning

is strongly related to game playing. This makes robot soccer a challenging and especially

interesting subject of research.

Games

The major part of RL literature deals with MDPs which do not include game playing or

model only a fixed policy of the second agent. Sometimes, even robotic control problems

can be appropriately modeled by 2P-ZS-MGs, e. g. if the aim is to obtain policies being

robust to errors of sensors or motors (MDP with “uncertainty generator” as a competitive

player [147]). Nevertheless, there are examples that RL methods (Section 2.4) are utilised

to solve 2P-ZS-MGs [112, 97].

2P-ZS-MGs include a broad class of games e. g. nearly all two-player board games and two

team sport games. Two famous examples of board games are checkers and backgammon

in both of which the machine learning can be performed by self-play, i. e. a computer plays

against itself. Self-play can be seen as a special asynchronous method in which the policy

of the first agent is a best response to a time-depending policy of the second agent and

vice versa. Hence, the standard assumption for convergence of RL methods need not be

fulfilled because instead of a max-min a max policy against the current policy of the second

agent which is equal to the own policy is determined.

A very early succesful application was the checkers(22) playing system of Samuel [179, 180]

although Samuel did not use the standard RL approach with rewards. Instead, he backed

up a kind of value function to estimate the use of board positions. In [202] it is discussed

how to relate Samuel’s checkers to current RL methods.

A second example which led to remarkable success was the backgammon policy developed

by Tesauro’s software TD-Gammon [205, 206, 202]. A backpropagation neural network

approach with three layers to approximate the probability of winning the game is used.

The most successful version of backgammon programs combines the state information

(21)[101] suggests the alternative to estimate the contact point of the robot to the ground. However, no

results of the obtained accuracy of detecting multiple moving robots are stated.

(22)Checkers is called “Dame” in Germany.

with features designed by humans. In this way, human insights can be integrated without

having a negative influence on the performance. The trade-off between exploitation and

exploration is neglected but due to the stochasticity of rolling the dice even a greedy policy

execution seems to lead to enough exploration.

Robotics

Robotic tasks often have an inherent continuous nature (states, actions, time) and are de-

manding because the decision making is disturbed by noise and error, information often has

to be extracted from sensor information (actuator status, image processing, speech anal-

ysis), and limited resources (computational power, time constraints, restriction of usable

material) create additional difficulties to solving a given problem. Besides robot soccer,

some other successful examples are: Robot juggling with a so-called devil-stick [6], box

pushing with a special clustering technique [127], collecting and transporting small disks

by a team of four robots in a decentralised way [133], and the huge area of roboters being

part of a production line.

Chapter 6

Conclusion and Outlook

In the present work two-player zero-sum Markov games (2P-ZS-MGs) are shown to be

an adequate framework for modeling robot soccer in contrast to the widely used Markov

decision process (MDP) framework. Furthermore, a grid soccer model for an arbitrary

resolution grid as well as for an arbitrary number of agents is provided. It seems to be

well-suited for comparison of large scale 2P-ZS-MG effects caused by fine discretisation

and multi-agent scenarios.

Amongst the theoretic aspects, the development of a notion of symmetry for 2P-ZS-MGs,

particularly 2P-ZS-MG µ-homomorphisms, is to be accentuated. The concept is shown

to be a non-trivial extension of recently developed MDP (1-) homomorphisms and its

relation to the special case of classical group actions is studied. A qualitatively new class

of symmetries which does not occur in MDPs and exchanges the two players of a 2P-ZS-MG

is proven to fulfill some natural algebraic properties, especially that it can be composed

with recently developed MDP symmetries. Practitioners already applied the results of this

symmetry concept without a precise theoretical foundation. In the present thesis their

work is legitimated mathematically a posteriori.

To also highlight some practical aspects the origination of the software package DRPOST

is to be mentioned by which all numerical results are computed. Notably, it includes a

new asynchronous dynamic progreamming (DP) algorithm called MaG-Clus-VI which in-

tertwines global and local aspects of updating by means of almost invariant sets established

in dynamical systems theory. The usefulness and usability of the software package aims

at helping other researchers and practitioners to solve interesting problems and to gain

deeper insights into 2P-ZS-MGs. Additionally, several comparative studies are performed

by means of DRPOST: most notable are the aspects of incorporating dynamic program-

ming (DP) methods – as typically not done by the reinforcement learning (RL) community

– and discovering and investigating interesting phenomena such as the “max-min conver-

gence boosting phenomenon” which is observed in 2P-ZS-MGs. Concerning realisability

on physical robots – e. g. on AIBO ERS-7 – the author identifies the opponent localisation

as the most urgent topic on which to focus future research.

However, solving problems raises others, and thus many ideas remain on how to continue

research beyond this PhD thesis. In theory, many open questions exist if the models are

differential games: How does discretisation affect the reliability of the solution? Does

the limit of arbitrary fine discrete models converge to the continuous model? Most of

the results in differential game theory are limited to the important but not general case

of pursuit evasion games. Questions about how to design a stochastic differential game

100

theory are even more challenging since stochastic dynamical systems without any control

are also an interesting topic of current research.

Not only continuous dynamical games but also generalising results from 2P-ZS-MGs to

non-zero-sum and multi-player games with more than two players offer interesting open

questions. A natural idea is to replace the solution of the matrix game by a more general

Nash equilibrium solution of a non-zero-sum or multi-player game. However, this intro-

duces quite a burden since all the issues about solution concepts for games with only a

single state and single time step (e. g. selection of one Nash equilibrium if multiple exists)

known to the game theory community become relevant for every state and every time step

in the iterative procedure of finding a solution of a multiple time step multiple state game.

Practically challenging is the application of RL and DP methods to big problem instances.

Some practical ideas which are partly biologically inspired and could help to manage such

instances are according to Kaelbling [88]: shaping, i. e. first presenting simple problems

and then raising the difficulty level little by little (e. g. reward shaping [169]), imitation,

i. e. learning by watching other agents or providing parts of a policy with the aid of humans

([167, 204] utilise initial policies called experts to accelerate learning), and reflexes, i. e.

providing some standard reactions to standard situations.(1)

What all these ideas have in common is to give up tabula rasa learning which is also

the motivation of the present work. The difference is that in the three approaches above

knowledge of a hand-coded (part of a) policy is provided: the experts are directly policies,

reflexes can be interpreted as policies defined only on special states, and shaping can be

performed by restricting the model and the policies to a subset of the original domain. To

make progress in the spirit of giving up tabula rasa learning, however, the author suggests

using DP methods with simpler models to provide an initial guess for a policy.(2) In

future research, all these methods could be compared and intertwined e. g. by utilising the

author’s approach with different models of the same real world problem and let (nearly)

optimal policies of each model be different experts to imitate.

Another suggestion of Kaelbling is to make reinforcement signals local. This addresses

the model designing phase rather than the learning algorithms but may be advantageous

especially in navigation tasks [133]. Parallels can be found to so-called potential respec-

tively vector field approaches in which goals and special objects or locations produce a

virtual vector field to influence the motion of a robot. Caution is adequate for vector field

approaches as well as for local reward methods because unintendedly introduced subgoals

may lead to suboptimal solutions and may prevent the agent from reaching the original

goal. It is also interesting future work to design a suitable local reward function for the

multi-player robot soccer model and to study the influence of such models on 2P-ZS-MGs.

Finally, a lot of work has been done for MDPs and in this area there are still open questions

but most of the comparable work for 2P-ZS-MGs has not yet been completed. From

a practical point of view, a great deal of function approximation methods succesfully

applied to MDPs should also be evaluated in 2P-ZS-MGs. It is the author’s hope that a

multilaterally developed public software package could originate as a byproduct of future

work. This should provide a full variety of MDP and 2P-ZS-MG standard example models,

include an essential selection of function approximation and data mining methods, and

(1)Reflexes could speed up the beginning phase of learning in which nothing happens until a random

walk detects some goal (or non-zero reward) and could help to avoid dangerous situations [138]. In robot

soccer there are no dangerous situations but a walk close to a cliff or controlling a nuclear reactor provides

a meaningful example.

(2)[69] shows that the transfer from one model to another (there: simulation to real robots) can work.

101

integrate symmetry reduction and hierarchical techniques in a modular way. By means of

such a software tool, transparency and comparability amongst the huge variety of proposed

learning methods and test models could be improved.

102

Appendix A

Basics of Group Homomorphisms

and Group Actions

In this appendix some standard definitions of group homomorphisms and group actions are

repeated. By Proposition A.4 and Proposition A.5 it is shown that equivalence relations

and group actions are equivalent concepts. The proof of Proposition A.5 is given for the

convenience of the reader. Some of the statements in this appendix also hold for infinite

groups but only the application to finite groups is intended.

Before continuing with group actions a short reminder of group homomorphisms shall be

given:

A.1 Definition (Group Homomorphism, Isomorphism [60])

Let (G1,∗)and (G2,)be two groups with group operation ∗and , respectively. A map

h: (G1,∗)→(G2,)(or short h:G1→G2) is called a group homomorphism if

∀g1,eg1∈G1:h(g1∗eg1) = h(g1)h(eg1).(A.1)

A (group) homomorphism is called isomorphism if it is a bijection.

A.2 Definition (Group Action, Transformation Group [24, 60])

Let Gbe a group and let Xbe a set. A (left) group action Θis a map Θ : G×X→X

with the properties:

∀g, h ∈G∀x∈X: Θ(g, Θ(h, x)) = Θ(gh, x)(A.2)

and for the identity 1∈Gholds:

∀x∈X: Θ(1, x) = x . (A.3)

A simplified notation for group actions will be gh·x=ghx = Θ(g, Θ(h, x)) when there are

no misunderstandings possible. The composition of group elements is due to the standard

group operation. [60] states that for each g∈Gthe map σg:X→X, a 7→ σg(a) = g·ais

apermutation or transformation of X(i. e. bijection from Xto X) and that the map from

Gto SXdefined by g7→ σgis a homomorphism. This means that every element g∈G

acts on Xas a permutation in a manner that is consistent with the group operation of G.

To make this more precise define the kernel of a group action by K={g∈G: (∀x∈

X:g·x=x)} ⊆ Gand the action to be faithful if K={e}. Then, the kernel Kis a

103

normal subgroup of Gand hence induces equivalence classes corresponding to elements of

the quotient group G/K.

Furthermore, not only each group action induces a homomorphism by g7→ σgbut also

reversely any homomorphism ϕ:G→SXdefines a group action of Gon Xby g·x=

ϕ(g)(x)for all g∈Gand x∈X. The kernel of this group action is the kernel of ϕand

the permutation representation g7→ σgequals ϕ. Thus, the following proposition holds:

A.3 Proposition (Characterising Group Actions [60])

For any group Gand any non-empty set Xthere exists a bijection between the actions of

Gon Xand the homomorphisms from Ginto SX.

The following result finally relates group actions to equivalence classes:

A.4 Proposition (Equivalence Relations Induced by Group Actions [60])

For any group Gacting on a non-empty set Xthe group action induces an equivalence

relation on Xby

x0∈[x]iff ∃g∈G:x0=g·x . (A.4)

For each x∈Xthe size of the equivalence class |[x]|=|G:Gx|which is the index of the

stabiliser Gx={g∈G:g·x=x}.

A shorter notation of the equivalence classes induced by a group action is [x] = G·x=

{g·x:g∈G}.G·xis also called the group orbit of xunder G.

Proposition A.4 shows that each group action induces an equivalence relation on X. More-

over, equivalence relations on Xalso induce group actions stated by Proposition A.5. If

equivalence classes of all group actions which induce the same equivalence classes on Xare

formed, then group actions and equivalence relations on Xare equivalent concepts.

A.5 Proposition (Group Actions Induced by Equivalence Relations [68])

Let Xbe a non-empty set with an equivalence relation, i. e. a partition of equivalence

classes P(X) = {[x] : x∈X}, and let Gbe the set of all bijective functions g:X→X

that preserve the structure of the equivalence classes, i. e.

∀g∈G∀x∈X:g(x)∈[x].(A.5)

Then (G, ◦)is a group with ◦being the standard composition of functions and the group

action Θ : G×X→X, Θ(g, x) = g(x)induces equivalence classes equal to that of P(X).

Proof: The proof of [68] is adapted to the notion of group actions. Gis a group because

it contains the identity, and the inverses and compositions of gi∈Gare also members of

Gbecause they all operate with respect to the partition P(X). Furthermore, Θis a group

action.

It remains to show: ∀x∈X:G·x= [x]. The inclusion G·x⊆[x]directly follows from

Equation A.5. To prove G·x⊇[x], recall that x∈G·xand observe that for any y∈[x]

the swapping function gx,y :X→X, defined by gx,y(x) = y,gx,y(y) = x, and otherwise

gx,y(z) = z, is an element of G. Thus, y∈G·xwhich completes the proof. 2

104

Appendix B

Bellman Equations and Iterative

Linear Solvers

In this appendix connections between standard iterative linear solvers [214] such as Jacobi

or Gauss-Seidel method are related to the linear versions of the Bellman equation which

emerges if the policies are fixed. In principle, this does not reduce the problem of deter-

mining the optimal value function but gives a possibility for an a posteriori analysis of

the iterative scheme. The reason is that after a non-linear Bellman update the optimal

policies are known and the non-linear update can be reformulated as a linear one with the

calculated optimal policies. The idea is the same as for MDPs [168].

Iterative Solution of Linear Equations. The problem of solving a linear equation of

the type

Ax =b, x, b ∈Rn, A ∈Rn,n (B.1)

can be solved iteratively which is especially useful for large systems of equations. The

standard idea is to rewrite the above equation to the fixed point formulation

x=C−1b−C−1(A−C)x(B.2)

for some invertible matrix C∈Rn,n. The corresponding iterative method is simply

xm+1 =C−1b−C−1(A−C)xm(B.3)

with some initial x0and convergence of the method depends on the spectral properties

of the update matrix C−1(A−C)(spectral radius less than 1). For the classical Jacobi

method is C=Dwhere Ais written as a sum of its diagonal, lower triangle, and upper

triangle part:

A=D+L+U. (B.4)

The standard Gauss-Seidel method utilises C=D+L.

Bellman Equation as Linear Equation. As mentioned above the idea is the same as

for MDPs with the difference that in 2P-ZS-MGs the policies of both players have to be

fixed to make the Bellman equation linear. Equations 2.48 and 2.49 with the minimum

equivalently also taken over probability distributions become for fixed policies π1, π2which

fulfill the max-min property:

Vk+1(s) = X

a∈A(s)X

o∈O(s)

π1,s(a)·π2,s(o)·Qk(s, a, o)(B.5)

105

where

Qk(s, a, o) = R(s, a, o) + γX

s0∈S

T(s, a, o, s0)·Vk(s0).(B.6)

All in all, this is equivalent to performing a single iterative step of solving a system of

linear equations Ax =bfor each value iteration step. At iterate k+ 1 the variable reads

to xs=Vk+1(s)with the initial guess estimate being Vk, the right-hand side is defined by

the modified reward

bs=X

a∈A(s)X

o∈O(s)X

s0∈S

π1,s(a)·π2,s(o)·T(s, a, o, s0)·R(s, a, o, s0)(B.7)

for rewards possibly depending on the future states s0, and the matrix A=I−γTs,s0

depends on the transition matrix

Ts,s0=X

a∈A(s)X

o∈O(s)

π1,s(a)·π2,s(o)·T(s, a, o, s0).(B.8)

With these ingredients the normal DP update is specified in terms of iterative solvers for

linear systems by C=Iand the Gauss-Seidel type update by C=I+L. Thus, the

methods are typically different from the versions for iterative linear solvers unless D=I,

i. e. that the transition matrix Ts,s0has zero probability transitions from each state to itself.

Finally, note again that the policies π1, π2are not known a priori but determined during

value iteration such that an a posteriori analysis is possible. However, the convergence

speed of DP methods is independent from the complexity of determining π1, π2but rather

dependent on the convergence properties of the matrix A. Furthermore, for every iteration

a different matrix and right hand side is to be calculated such that the convergence results

(contractivity of the update matrix) can differ from step to step.

106

Appendix C

The Software Package DRPOST

C.1 Introduction

The software package DRPOST (Discrete Robust Probabilistic Optimal Strategy Tool)

developed during the PhD thesis is also used for computing the results of Section 5.2. In

this appendix, the basic structure of the package is described. First, the files with ending

.m are Matlab files (Matlab 7.3.0.298 was used but the basic files should also work on

previous Matlab versions). When starting Matlab for the first time all subdirectories

can be added to the Matlab path while the first call of renew_model.m will clear archive

subdirectories and unused model directories from the path. The directory phd_scripts

contains executable scripts which generate material contained in this thesis. They can

be considered a good starting point to learn how the software package works and which

parameters are important.

For the sake of completeness, the other subdirectories are shortly described:

data_matlab contains Matlab save files (ending .mat) for the most important or lengthy

computations by scripts in the directory phd_scripts.

func_approx contains the structure for inserting arbitrary function approximation schemes.

The approximation procedures can also be externally provided e. g. by the Matlab neural

net toolbox.

func_approx_EXTERN is empty but the intended directory for external software packages

(not written by the author) which are specialised to function approximation.

models contain one subdirectory for each model. The name of the currently used model

can be specified by the Matlab struct field DP_RL_param.model_dir. This model will

be copied to model_current by the function renew_model.m and the Matlab path is

adapted such that only this model belongs to it.

phd_scripts contains all scripts for computations in this thesis as mentioned above.

pictures is used for storage of pictures and figures.

tools_divers is a collection of different functions not fitting to one of the other categories.

It contains a subdirectory policy with policy generating and modifying functions and a

subdirectory plot for general plot routines.

The use in Matlab is quite straightforward because every function and script is docu-

mented and a complete help on how to call this function is displayed in Matlab as usual

by help <functionname>. Furthermore, only the structs DP_RL_param,DP_RL_prev_step,

107

param_model,param_model_large,simulator_param, and strat_all are needed to keep

all information, and all these structs as well as value functions and strategies (function

approximation structs) have a subfield .info which gives all necessary information about

the structs.

C.2 Technical Aspects of Symmetry Reduction in 2P-ZS-MGs

Efficient Data Structures for State Spaces. In general, all available states s∈ S have

to be made accessible by computer software. The most obvious and easiest way in terms

of programming effort and clarity is to store a list of all characteristics for every state, i. e.

for grid soccer all elements ⊆N2(na+no+1) which describe a discrete soccer state. However,

if the state space Sis very large as e. g. in a multi-player grid soccer with many robots it

may be important to represent the list of states in a compact way. In the following, hash

functions are not considered to be a sensible solution because they are not injective.

A very compact way for finite 2P-ZS-MGs is to assign a list of states (state-actions) which

is just a mapping ino :S → N(ino :SAO → N) preferably such that the smallest possible

integer numbers, i. e. all numbers from 1to |S| (1to |SAO|), are assigned. This number

representation introduces the cost of computing the function ino very often and, hence,

should be very simple. The reason for the immense use is that the assignment of value

functions and policies, the evaluation of the transition function which includes the deter-

mination of all possible following states, and the evaluation of the reward function need

the evaluation of ino for every state or state-action.(1) Sometimes, the inverse i−1

no also

has to be computed e. g. for interpreting a value function or policy stored in the number

representation.

Efficient Data Structures for Multi-Player Grid Soccer. The structure of possible

states in robot soccer which is simply the product of discrete intervals(2) makes it possible

to find a reasonable number representation of Sby simple for-loops and multiplications

while the calculation of the inverse i−1

no needs divisions with remainder. Based on the

enumeration of states, SAO can be enumerated by storing (a vector of) the number of

possible actions in every state.(3)

Now, if symmetries are introduced the aim is to store all symmetric states by only one

entry to reduce the amount of stored data. The main problem is that the function ino

typically becomes quite complicated and no simple computation comparable to the case

without symmetries is obvious. Furthermore, the standard way to store the mapping

directly in a lookup list is impractical by reason of size. Thus, a new way of storing value

functions is suggested by the author: employing sparse matrices. Sparse matrices are a

well-known structure that are typically utilised to store large matrices with many zero

elements. Sparsity is often defined by the fact that the number of non-zero elements grow

linearly with matrix size instead of quadratically which would be natural. Sparsity does

not matter to the present work, it should only be mentioned that in principle in a sparse

matrix only non-zero entries are stored together with their row and column number. To

(1)For a small model it could be more practical to store the transition function and reward function once

for the number representation and than work on that data. However, if the state space is large (as always

assumed) then the transition function will typically be much larger.

(2)Adiscrete interval is an intersection of a real interval Iwith N0.

(3)In grid soccer, actions for robots at the margin of the soccer field and the number of kicks can vary.

If the action space would not vary SAO could be handled in exactly the same way as S.

108

read a value the index lists are searched for matching entries and, if no match is found, a

zero is returned.

In the context of symmetries this means that the enumeration of the complete state(-action)

space can be initialised for storing a value function or a policy, then each state(-action)

is mapped to a unique representative of its equivalence class (see Section 3.2), and finally

the value function or policy only of the representatives are accessed. The main advantage

of this approach is twofold: Firstly, there is no need to compute ino directly, it is simply

stored by the matrix entry list. Secondly, by the size of the matrix (more columns or

more rows) a trade-off between access speed and used memory of the sparse matrix can be

decided. The reason is that the size of the matrix influences the length of vectors of the

row and column information which typically is stored as a vector of vectors. If the storage

is first row and then column and a matrix is already a row or a column vector then the

maximum number of pointers to columns has to be stored (storage bad, access fast) or only

one (storage good, access slow). The best compromise seems to be a quadratic matrix. A

generalisation to a multi-level tree structure with a variable number of leaves at each level

which could even make ino superfluous is not considered because for the multi-player grid

soccer example a sparse vector is already sufficiently effective.

109

Appendix D

Detailed Tables of Numerical Results

The numerical results in this section are obtained by means of the software package

DRPOST by scripts gathered in the subdirectory phd_scripts. The material is omit-

ted in the main part because not all results are spectacular. However, the author regards

it as his duty to provide this material and hopes that it may be helpful.

D.1 Initial Value Functions V0and Discount Factors γ

The following tables are related to Section 5.2.5(1) and give detailed information about all

combinations of initial value function V0, discount factor γ, Gauss-Seidel and symmetry

reduction types, and max-min, max, or fixed policy DP methods.

For reasons of comparability and clear view the tables are placed in combinations of three

tables per page with the following ordering principle: the following page contains three

tables of max-min DP methods, the subsequent page three tables of max DP methods and

the next page three tables of fixed strategy DP methods (policy evaluation), whereas on

each page the first table contains data about a non Gauss-Seidel method without symmetry

reduction, the second table about a Gauss-Seidel method without symmetry reduction, and

the third table about a Gauss-Seidel method with symmetry reduction. The fourth case

of a non Gauss-Seidel method with symmetry reduction would yield the same results as

the method on a non symmetry reduced model but simply updating each state in an

equivalence class separately.

(1)A more detailed description of the setting and the interpretation of the basic facts can be found there.

110

V0≡3V0≡1V0≡0.1V0≡0V0≡ −0.1V0≡ −1V0≡ −3

γ= 0.10 4

0.00083

0.00043

0.00025

0.00023

0.00024

0.00043

0.00083

γ= 0.25 7

0.00053

0.00029

0.00070

0.00066

0.00070

0.00029

0.00053

γ= 0.50 13

0.00075

0.00050

0.00038

0.00028

0.00038

0.00050

0.00075

γ= 0.75 31

0.00087

0.00091

0.00015

0.00090

0.00091

0.00086

γ= 0.90 88

0.00096

0.00099

0.00094

0.00038

0.00100

0.00099

0.00096

Table D.1: Comparison of different initial value functions V0and discount factors γfor a grid

soccer model without symmetry reduction and a max-min value iteration method with standard

updates (not Gauss-Seidel). In each cell of the table the number of needed iteration steps and the

yielded precision εwith a stopping criterion precision of ε= 1 ·10−3(εas in Corollary 4.3) and a

matrix game solution precision of 1·10−6.

V0≡3V0≡1V0≡0.1V0≡0V0≡ −0.1V0≡ −1V0≡ −3

γ= 0.10 4

0.00022

0.00023

0.00025

0.00043

0.00085

γ= 0.25 6

0.00054

0.00062

0.00065

0.00066

0.00071

0.00025

0.00057

γ= 0.50 11

0.00050

0.00063

0.00051

0.00050

0.00039

0.00055

0.00049

γ= 0.75 24

0.00083

0.00089

0.00072

0.00057

0.00042

0.00090

0.00098

γ= 0.90 72

0.00095

0.00096

0.00095

0.00098

0.00096

0.00099

Table D.2: Comparison of different initial value functions V0and discount factors γfor a grid

soccer model without symmetry reduction and a max-min value iteration method with Gauss-Seidel

updates (standard enumeration of states). Description of entries as in Table D.1.

V0≡3V0≡1V0≡0.1V0≡0V0≡ −0.1V0≡ −1V0≡ −3

γ= 0.10 4

0.00011

0.00020

0.00023

0.00021

0.00022

0.00067

γ= 0.25 5

0.00058

0.00034

0.00059

0.00066

0.00060

0.00025

0.00090

γ= 0.50 9

0.00055

0.00042

0.00027

0.00001

0.00016

0.00055

0.00058

γ= 0.75 14

0.00094

0.00089

0.00059

0.00006

0.00093

0.00072

γ= 0.90 30

0.00087

0.00086

0.00087

0.00037

0.00100

0.00097

0.00099

Table D.3: Comparison of different initial value functions V0and discount factors γfor a grid

soccer model with symmetry reduction and a max-min value iteration method with Gauss-Seidel

updates (standard enumeration of states). Description of entries as in Table D.1.

111

V0≡3V0≡1V0≡0.1V0≡0V0≡ −0.1V0≡ −1V0≡ −3

γ= 0.10 4

0.00062

0.00022

0.00020

0.00022

0.00024

0.00042

0.00082

γ= 0.25 7

0.00037

0.00052

0.00061

0.00066

0.00070

0.00029

0.00053

γ= 0.50 13

0.00074

0.00099

0.00100

0.00055

0.00060

0.00053

0.00051

γ= 0.75 31

0.00080

0.00099

0.00090

0.00077

0.00086

0.00083

0.00079

γ= 0.90 89

0.00096

0.00099

0.00098

0.00097

Table D.4: Comparison of different initial value functions V0and discount factors γfor a grid

soccer model without symmetry reduction and a max value iteration method with standard updates

(not Gauss-Seidel). In each cell of the table the number of needed iteration steps and the yielded

precision εwith a stopping criterion precision of ε= 1 ·10−3(εas in Corollary 4.3) and a matrix

game solution precision of 1·10−6.

V0≡3V0≡1V0≡0.1V0≡0V0≡ −0.1V0≡ −1V0≡ −3

γ= 0.10 4

0.00053

0.00018

0.00005

0.00006

0.00013

0.00025

γ= 0.25 6

0.00088

0.00030

0.00021

0.00025

0.00029

0.00064

0.00027

γ= 0.50 11

0.00041

0.00100

0.00089

0.00079

0.00037

0.00044

0.00039

γ= 0.75 21

0.00066

0.00070

0.00078

0.00081

0.00069

0.00085

0.00078

γ= 0.90 53

0.00099

0.00095

0.00093

0.00097

0.00093

0.00099

Table D.5: Comparison of different initial value functions V0and discount factors γfor a grid

soccer model without symmetry reduction and a max value iteration method with Gauss-Seidel

updates (standard enumeration of states). Description of entries as in Table D.4.

V0≡3V0≡1V0≡0.1V0≡0V0≡ −0.1V0≡ −1V0≡ −3

γ= 0.10 4

0.00053

0.00018

0.00005

0.00006

0.00012

0.00037

γ= 0.25 6

0.00086

0.00023

0.00020

0.00025

0.00029

0.00081

0.00032

γ= 0.50 10

0.00097

0.00079

0.00083

0.00059

0.00088

0.00039

0.00035

γ= 0.75 20

0.00090

0.00063

0.00060

0.00066

0.00092

0.00068

0.00098

γ= 0.90 52

0.00094

0.00095

0.00093

0.00097

0.00099

0.00096

Table D.6: Comparison of different initial value functions V0and discount factors γfor a grid

soccer model with symmetry reduction and a max value iteration method with Gauss-Seidel updates

(standard enumeration of states). Description of entries as in Table D.4.

112

V0≡3V0≡1V0≡0.1V0≡0V0≡ −0.1V0≡ −1V0≡ −3

γ= 0.10 4

0.00062

0.00022

0.00050

0.00030

0.00050

0.00022

0.00062

γ= 0.25 7

0.00038

0.00053

0.00038

0.00085

0.00038

0.00053

0.00038

γ= 0.50 13

0.00075

0.00052

0.00063

0.00055

0.00063

0.00052

0.00075

γ= 0.75 31

0.00087

0.00092

0.00090

0.00071

0.00090

0.00092

0.00087

γ= 0.90 88

0.00095

0.00099

0.00097

0.00100

0.00097

0.00099

0.00095

Table D.7: Comparison of different initial value functions V0and discount factors γfor a grid

soccer model without symmetry reduction and a fixed value iteration method with standard updates

(not Gauss-Seidel). In each cell of the table the number of needed iteration steps and the yielded

precision εwith a stopping criterion precision of ε= 1 ·10−3(εas in Corollary 4.3) and a matrix

game solution precision of 1·10−6.

V0≡3V0≡1V0≡0.1V0≡0V0≡ −0.1V0≡ −1V0≡ −3

γ= 0.10 4

0.00033

0.00011

0.00041

0.00031

0.00032

0.00011

0.00033

γ= 0.25 6

0.00044

0.00085

0.00016

0.00089

0.00092

0.00045

γ= 0.50 10

0.00090

0.00083

0.00097

0.00062

0.00091

0.00086

0.00091

γ= 0.75 21

0.00093

0.00085

0.00082

0.00090

0.00091

0.00080

0.00091

γ= 0.90 55

0.00093

0.00096

0.00091

0.00094

0.00092

Table D.8: Comparison of different initial value functions V0and discount factors γfor a grid

soccer model without symmetry reduction and a fixed value iteration method with Gauss-Seidel

updates (standard enumeration of states). Description of entries as in Table D.7.

V0≡3V0≡1V0≡0.1V0≡0V0≡ −0.1V0≡ −1V0≡ −3

γ= 0.10 4

0.00038

0.00012

0.00041

0.00031

0.00032

0.00013

0.00039

γ= 0.25 6

0.00046

0.00095

0.00016

0.00089

0.00090

0.00016

0.00047

γ= 0.50 10

0.00079

0.00076

0.00090

0.00060

0.00089

0.00079

γ= 0.75 21

0.00081

0.00075

0.00074

0.00083

0.00085

0.00071

0.00079

γ= 0.90 54

0.00099

0.00091

0.00093

0.00091

0.00092

0.00098

Table D.9: Comparison of different initial value functions V0and discount factors γfor a grid

soccer model with symmetry reduction and a fixed value iteration method with Gauss-Seidel updates

(standard enumeration of states). Description of entries as in Table D.7.

113

D.2 Additional Figures and Tables for the Comparative Stud-

ies of DP and RL methods

Initial Value Functions V0and Discount Factors γ

Figure D.1 shows the omitted results for fixed strategy DP methods related to Figure 5.7.

The figure was omitted since it yields only expected results analogue to the max methods.

−3 −2 −1 0 1 2 3

100

120

step number

no GS, no symm

GS, no symm

GS, symm

Figure D.1: Comparison of different Gauss-Seidel types with or without symmetry reduc-

tion: number of iteration steps over initial values of the initial value function V0for a

1v1 multi-player grid soccer model and a fixed strategy method without sorting strategy

(γ= 0.9, stopping criterion precision ε= 1 ·10−3(Corollary 4.3)).

Figure D.2 shows the omitted results for fixed strategy DP methods related to Figure 5.8.

The figure was omitted since it yields only expected results analogue to the max-min and

max methods.

114

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

100

120

gamma

step number

no GS, no symm

GS, no symm

GS, symm

Figure D.2: Comparison of different Gauss-Seidel types with or without symmetry reduc-

tion: number of iteration steps over the discount factor γfor a 1v1 multi-player grid soccer

model and a fixed strategy method without sorting strategy (V0= 0, stopping criterion

precision ε= 1 ·10−3(Corollary 4.3)).

Figure D.3 shows the omitted results for fixed strategy DP methods related to Figure 5.9.

The figure was omitted since it yields only expected results analogue to the max methods.

0 10 20 30 40 50 60 70

10−4

10−3

10−2

10−1

100

101

102

step number

maximal error

no sorting

Bellman

random

random fixed

Figure D.3: Fixed policy method, Gauss-Seidel type, with symmetry reduction: Maximal

DP error (logarithmic scale) over the number of iteration steps for a 1v1 multi-player grid

soccer model (V0= 0,γ= 0.9, stopping criterion precision ε= 1 ·10−3(Corollary 4.3))

by means of standard updates (no sorting), Bellman error estimation (Bellman), randomly

rearranged state order in each iteration (random), and keeping the randomly rearranged

state order of the first iteration (random fixed).

115

Strategy Evaluation Tables for 12 ×8Grid Soccer

Additional Tables of evaluating different soccer strategies according to that of Section 5.2

are presented for the 1v1 grid soccer on a 12 ×8field. These tables reveal only expected

results and are not intended – like some other tables – to stress differences of a coarse to

a medium discretisation of the soccer field. However, the time it would take to compute

these results again by the software package DRPOST can be saved by looking onto this

appendix.

π3= MQL(π1)π4= MVI(π1)π5= MQL(π2)π6= MVI(π2)

π1= MQL(MQL(R))

V: 0.225

gt: 0.021

g1: 0.991

V: 0.234

gt: 0.022

g1: 0.985

π2= MQL(MVI(R))

V: 0.233

gt: 0.022

g1: 0.980

V: 0.236

gt: 0.022

g1: 0.984

Table D.10: Robustness of max-policies against worst-case opponents (12 ×8soccer field)

with better initial training partners. The explanation of how to read the table is as in

Table 5.3.

π2= MVI(π1)π3= MMVI(R)

π1= MMVI(R)

V: 0.000

gt: 0.023

g1: 0.505

V: 0.000

gt: 0.023

g1: 0.503

Table D.11: Exploitability of optimal max-min opponents (12 ×8soccer field). The

explanation of how to read the table is as in Table 5.3.

116

π1= R equal

π1= R

V: 0.000

gt: 0.000

g1: 0.577

V: 0.000

gt: 0.000

g1: 0.516

π2= MQL(R)

V:−0.252

gt: 0.023

g1: 0.001

V: 0.000

gt: 0.034

g1: 0.500

π3= MVI(R)

V:−0.252

gt: 0.023

g1: 0.000

V: 0.000

gt: 0.021

g1: 0.501

π4= MMQL(R)

V:−0.179

gt: 0.016

g1: 0.001

V:−0.000

gt: 0.017

g1: 0.505

π5= MMVI(R)

V:−0.173

gt: 0.016

g1: 0.001

V: 0.000

gt: 0.023

g1: 0.507

π6= MQL(π2)

V:−0.002

gt: 0.000

g1: 0.110

V: 0.000

gt: 0.001

g1: 0.468

π7= MQL(π3)

V:−0.002

gt: 0.000

g1: 0.132

V: 0.000

gt: 0.000

g1: 0.485

Table D.12: Analysis of offensiveness and defensiveness of different policies (12 ×8soccer

field). The explanation of how to read the table is as in Table 5.3.

117

List of Figures

1.1 Scheme of the main subfields of learning. . . . . . . . . . . . . . . . . . . . 3

2.1 Scheme of the main components of multi-agent systems. . . . . . . . . . . . 8

2.2 Scheme of a Markov decision process. . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Symmetries in robot soccer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1 Discretisation of a soccer field: grid soccer. . . . . . . . . . . . . . . . . . . . 68

5.2 Grid soccer: the parameter max_kick_distance. . . . . . . . . . . . . . . . 69

5.3 Kick-off states in grid soccer. . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.4 Grid soccer: the parameters max_kick_distance and close_distance. . . . 72

5.5 Symmetries in grid soccer (discretisation of states in Figure 3.1). . . . . . . 73

5.6 Convergence speed of RL and DP techniques measured by value and security

evaluation: standard Q-learning versus a Gauss-Seidel DP method. . . . . . 83

5.7 Comparison of different Gauss-Seidel types with or without symmetry re-

duction: number of iteration steps over initial values of the initial value

function V0for a 1v1 multi-player grid soccer model and a (a) max-min

method, (b) max method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.8 Comparison of different Gauss-Seidel types with or without symmetry re-

duction: number of iteration steps over the discount factor γfor a 1v1

multi-player grid soccer model and a (a) max-min method, (b) max method

without sorting strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.9 Maximal DP error (logarithmic scale) over the number of iteration steps for

a Gauss-Seidel type update with symmetry reduction, a 1v1 multi-player

grid soccer model, and a (a) max-min method, (b) max method by means

of standard updates (no sorting), Bellman error estimation (Bellman), ran-

domly rearranged state order in each iteration (random), and keeping the

randomly rearranged state order of the first iteration. . . . . . . . . . . . . . 89

5.10 Maximal DP error (logarithmic scale) over the number of iteration steps,

all details are as in Figure 5.9, only the player exchanging symmetry is not

reduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.11 Convergence rate of a max-min DP method by maximal DP error (Bellman)

and by the true error over the number of iteration steps for a Gauss-Seidel

type update and a 1v1 multi-player grid soccer model. . . . . . . . . . . . . 92

5.12 Discretisation of a soccer field: grid soccer. . . . . . . . . . . . . . . . . . . . 95

D.1 Comparison of different Gauss-Seidel types with or without symmetry re-

duction: number of iteration steps over initial values of the initial value

function V0for a 1v1 multi-player grid soccer model and a fixed strategy

method without sorting strategy. . . . . . . . . . . . . . . . . . . . . . . . . 114

118

D.2 Comparison of different Gauss-Seidel types with or without symmetry re-

duction: number of iteration steps over initial values of the initial value

function V0for a 1v1 multi-player grid soccer model and a fixed strategy

method without sorting strategy. . . . . . . . . . . . . . . . . . . . . . . . . 115

D.3 Max method, Gauss-Seidel type, with symmetry reduction: Maximal DP

error over the number of iteration steps for a 1v1 multi-player grid soccer

model for different update orders. . . . . . . . . . . . . . . . . . . . . . . . . 115

119

List of Tables

5.1 Comparison of state space sizes for different types of multi-player grid soccer

with a (6 ×4) grid................................. 74

5.2 Different abbreviations for special policies. . . . . . . . . . . . . . . . . . . . 76

5.3 Robustness of max-policies against worst-case opponents (6×4soccer field). 78

5.4 Robustness of max-policies against worst-case opponents (12 ×8soccer field). 78

5.5 Robustness of max-min policies against worst-case opponents (6×4soccer

field). ....................................... 79

5.6 Robustness of max-min-policies against worst-case opponents (12 ×8soccer

field). ....................................... 79

5.7 Robustness of max-policies against worst-case opponents (6×4soccer field)

with better initial training partners. . . . . . . . . . . . . . . . . . . . . . . 80

5.8 Exploiting Non-Optimality of the Opponent (6×4soccer field). . . . . . . . 80

5.9 Analysis of offensiveness and defensiveness of different policies (6×4soccer

field). ....................................... 82

5.10 Evaluation of policies for a 1v1 grid soccer on a 6×4field which are computed

with a feature based value function. . . . . . . . . . . . . . . . . . . . . . . . 93

5.11 Comparison of symmetry reduced 1v1 grid soccer with γ= 0.9(different

soccer field sizes) by problem size and necessary DP iterations for achieving

a stopping criterion precision of ε= 1 ·10−3(Corollary 4.3). . . . . . . . . . 94

5.12 Analogon to Table 5.11 for 2v2 grid soccer for a 6×4soccer field. . . . . . 94

D.1 Comparison of different initial value functions V0and discount factors γ

for a grid soccer model without symmetry reduction and a max-min value

iteration method with standard updates (not Gauss-Seidel). . . . . . . . . . 111

D.2 Comparison of different initial value functions V0and discount factors γ

for a grid soccer model without symmetry reduction and a max-min value

iteration method with Gauss-Seidel updates (standard enumeration of states).111

D.3 Comparison of different initial value functions V0and discount factors γfor

a grid soccer model with symmetry reduction and a max-min value iteration

method with Gauss-Seidel updates (standard enumeration of states). . . . . 111

D.4 Comparison of different initial value functions V0and discount factors γfor

a grid soccer model without symmetry reduction and a max value iteration

method with standard updates (not Gauss-Seidel). . . . . . . . . . . . . . . . 112

D.5 Comparison of different initial value functions V0and discount factors γfor

a grid soccer model without symmetry reduction and a max value iteration

method with Gauss-Seidel updates (standard enumeration of states). . . . . 112

D.6 Comparison of different initial value functions V0and discount factors γ

for a grid soccer model with symmetry reduction and a max value iteration

method with Gauss-Seidel updates (standard enumeration of states). . . . . 112

120

D.7 Comparison of different initial value functions V0and discount factors γfor

a grid soccer model without symmetry reduction and a fixed value iteration

method with standard updates (not Gauss-Seidel). . . . . . . . . . . . . . . . 113

D.8 Comparison of different initial value functions V0and discount factors γfor

a grid soccer model without symmetry reduction and a fixed value iteration

method with Gauss-Seidel updates (standard enumeration of states). . . . . 113

D.9 Comparison of different initial value functions V0and discount factors γfor

a grid soccer model with symmetry reduction and a fixed value iteration

method with Gauss-Seidel updates (standard enumeration of states). . . . . 113

D.10 Robustness of max-policies against worst-case opponents (12×8soccer field)

with better initial training partners. . . . . . . . . . . . . . . . . . . . . . . 116

D.11 Exploitability of optimal max-min opponents (12 ×8soccer field). . . . . . 116

D.12 Analysis of offensiveness and defensiveness of different policies (12×8soccer

field). .......................................117

121

Glossary

∅empty set

Acomplement of a set A

A∩Bintersection of two sets Aand B

A∪Bunion of two sets Aand B

A\Bdifference of two sets: set Aminus set B

A×Bproduct of two sets Aand B

ATtranspose of a matrix or vector A

2P-ZS-MG Two Player Zero Sum Markov Game M

A(s)set of actions in a state sfor a Markov decision process or for the

first agent P1of a two-player zero-sum Markov game

AI Artificial Intelligence

αpossibly time- and state-dependent learning rate of an reinforcement

learning algorithm

Aut(M)group of automorphisms (also: symmetry group) of a set or especially

of a Markov decision process or a two-player zero-sum Markov game

Bbox (generalised rectangle) used for a special class of set oriented

numerical methods

BMDP Bellman operator for a Markov decision process

BMG Bellman operator for a two-player zero-sum Markov game

BMG numerical approximation of a Bellman operator for a two-player zero-

sum Markov game

BRi(π−i)best response or best reply policy of agent Pito the joint policy π−i

of all other agents

Cext external costs of a subset of vertices of a graph

χAcharacteristic function on a set Awith χA(x)being 1if x∈Aand 0

else

Cint internal costs of a subset of vertices of a graph

Cnn-times continuously differentiable functions

Ddecision epoch for a Markov decision process or a two-player zero-sum

Markov game

∂ij Kronecker symbol (1if iequals jelse 0)

∂S boundary of a set S

dΓgame matrix distance

DP Dynamic Programming

122

dttotal derivative of xwith respect to t

Eedge set of a graph G

eEuler’s number G

E{X}expectation of random variable X

Eπ{X}expectation of a random variable Xin a Markov decision process or a

two-player zero-sum Markov game if the (joint) policy πis executed

eii-th unit vector of Rn

EV(i+ 1) error bound for numerical approximation of value iteration in 2P-ZS-

MGs

fapproximation to a function f

Ga graph with vertex set Vand edge set E

Γa game

γdiscount factor for long-term rewards in a Markov decision process

or a two-player zero-sum Markov game

idXidentity map on a set X

ino mapping of state or state-action space to a number representation

LP Linear Programming

LSPI Least Squares Policy Iteration

[M]matrix game with game matrix M

Mspace of (signed) measures

MAS Multi Agent System

MBspace of (signed) measures discretised by a collection of boxes B

MDP Markov Decision Process Mor sometimes also Markov Decision

Problem

MGR Matrix Game Reduction

µa measure (sometimes especially the Lebesgue one)

µLeb Lebesgue measure (sometimes abbreviated by µ)

Nset of natural numbers

N0set of natural numbers including zero

nanumber of robots in the so-called first soccer team

nonumber of robots in the so-called second or opponent soccer team

NP non-deterministic polynomial: measure of hardness of an algorithmic

problem are e. g. NP-hard or NP-complete

O(s)set of actions in a state sfor the second agent P2of a two-player

zero-sum Markov game

Oϕ(x)orbit or trajectory of a dynamical system with respect to a flow ϕ

Ptransfer operator of a dynamical system

PBtransfer operator of a dynamical system discretised with respect to

a partition (of boxes) B

123

P(A)partition of a set Ainto disjoint subsets (measure zero of pairwise

intersections)

PD(B)set of all probability distributions on a (Borel) set B

ϕflow of a dynamical system

Piplayer of a matrix game (also called agent)

Πxprojection of a tuple onto the xcomponent

Π∗(V) optimal policy of a Markov decision process or a two-player zero-sum

Markov game with respect to state value function V

πpolicy of a Markov decision process or a two-player zero-sum Markov

game (a subindex iindicates the agent Pi)

πspolicy of a Markov decision process or a two-player zero-sum Markov

game restricted to a state s

π∗optimal policy of a Markov decision process or a two-player zero-sum

Markov game (a subindex iindicates the agent Pi)

π−ijoint policy of all agents with exception of agent Pi

PO-MDP Partially Observable Markov Decision Process (see also MDP)

Prob {X}probability that event Xhappens

Prob {X|Y}conditional probability that event Xhappens under the condition Y

Qset of rational numbers

Qπstate action (or Q-) value function of policy πfor a Markov decision

process or a two-player zero-sum Markov game

Q∗(=Qπ∗) optimal value function for a Markov decision process or a two-player

zero-sum Markov game

Rset of real numbers

Rreward function for a Markov decision process or a two-player zero-

sum Markov game

Rreward or payoff matrix for a matrix game (two matrices R1and R2

in a bimatrix game)

Rreturn of a Markov decision process or a two-player zero-sum Markov

game (a subindex iindicates the agent Pi)

Raver return model of average reward for a Markov decision process or a

two-player zero-sum Markov game

Rdisc return model of discounted reward for a Markov decision process or

a two-player zero-sum Markov game

RL Reinforcement Learning

Sset of states for a Markov decision process or a two-player zero-sum

Markov game

SA set of state action pairs for a Markov decision process

SAO set of state action pairs for a two-player zero-sum Markov game

Sdfeasible set of a dual linear program

SL Supervised Learning

S-MDP Semi Markov Decision Process (see also MDP)

Spfeasible set of a primal linear program

SXpermutation or transformation group of a set Xcontaining all auto-

morphisms of X

124

Ttransition probability between two sets of a dynamical system

Tinv(S)invariance ratio of a set Sof a dynamical system being the transition

probability of a set into itself

Tinv(P(S)) average invariance ratio of a partition of a set S

Ttransition function for a Markov decision process or a two-player

zero-sum Markov game

Talternative form of a transition function for a deterministic Markov

decision process or a two-player zero-sum Markov game

U(x)-neighbourhood of x⊆X

Vvertex set of a graph G

Vπstate value function of policy πfor a Markov decision process or a

two-player zero-sum Markov game

V∗(=Vπ∗) optimal value function for a Markov decision process or a two-player

zero-sum Markov game (especially value of a matrix game)

Vknumerical approximation (e. g. by SL techniques) of the k-th iterate

of value iteration

Zset of integer numbers

125

Bibliography

[1] Adrian K. Agogino and Kagan Tumer. Unifying temporal and structural credit

assignment problems. In AAMAS, pages 980–987. IEEE Computer Society, 2004.

[2] James S. Albus. A new approach to manipulator control: The cerebellar model

articulation controller (CMAC). ASM Journal of Dynamic Systems, Measurement,

and Control, 97:220–227, 1975.

[3] James S. Albus. Brains, Behavior, and Robotics. Byte Books, Peterborough, New

Hampshire, 1981.

[4] Michael A. Arbib, editor. The Handbook of Brain Theory and Neural Networks. MIT

Press, Cambridge, MA, 1995.

[5] Kenneth J. Arrow, David Blackwell, and M. A. Girshick. Bayes and minimax solu-

tions of sequential decision problems. Econometrica, 17:213–244, 1949.

[6] Christopher G. Atkeson and Stefan Schaal. Memory-based neural networks for robot

learning. Neurocomputing, 9(3):243–269, 1995.

[7] Leemon C. Baird. Residual algorithms: Reinforcement learning with function approx-

imation. In Proceedings of the 12th International Conference on Machine Learning

(ICML), pages 30–37, San Francisco, CA, 1995. Morgan Kaufmann.

[8] Leemon C. Baird and A. Harry Klopf. Reinforcement learning with high-dimensional

continuous actions. Technical Report WL-TR-93-1147, Wright Laboratory, Wright-

Patterson Air Force Base, OH 45433-7301, 1993.

[9] Martino Bardi, Maurizio Falcone, and Pierpaolo Soravia. Fully discrete schemes for

the value function of pursuit-evasion games. In Advances in dynamic games and

applications, volume 1 of Annals of the International Society of Dynamic Games,

pages 89–105. Birkhäuser, Boston, MA, 1994.

[10] Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using

real-time dynamic programming. Artificial Intelligence, 72(1):81–138, 1995.

[11] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforce-

ment learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.

[12] Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuron-like adaptive

elements that can solve difficult learning control problems. IEEE Transactions on

Systems, Man, and Cybernetics, 13(5):834–846, 1983.

[13] Tamer Başar and Geert J. Olsder. Dynamic Noncooperative Game Theory. Academic

Press Ltd., London, 2nd edition, 1995.

[14] Richard E. Bellman. Dynamic Programming. Princeton University Press, Princeton,

NJ, 1957.

126

[15] Richard E. Bellman and Joseph P. LaSalle. On non-zero sum games and stochastic

processes. RM-212, Rand Corp., Santa Monica, 1949.

[16] Hamid R. Berenji. Artificial neural networks and approximate reasoning for intelli-

gent control in space. In American Control Conference, pages 1075–1080, 1991.

[17] Donald A. Berry and Bert Fristedt. Bandit Problems: Sequential Allocation of Ex-

periments. Chapman and Hall, London, UK, 1985.

[18] Dimitri P. Bertsekas and David A. Castañon. Adaptive aggregation methods for

infinite horizon dynamic programming. IEEE Transactions on Automatic Control,

34(6):589–598, 1989.

[19] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena

Scientific, Belmont, MA, 1996.

[20] Emile Borel. The theory of play and integral equations with skew symmetric kernels.

Econometrica. Journal of the Econometric Society, 21:97–100, 1953.

[21] Michael Bowling. Multiagent learning in the presence of agents with limitations.

PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh,

PA 15213, 2003. Also published as Technical Report CMU-CS-03-118.

[22] Michael Bowling and Manuela M. Veloso. Existence of multiagent equilibria with

limited agents. Technical Report CMU-CS-02-104, Carnegie Mellon University, Pitts-

burgh, PA, 2002.

[23] Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning:

Safely approximating the value function. In G. Tesauro, D. S. Touretzky, and T. K.

Leen, editors, Advances in Neural Information Processing Systems (NIPS) 7, pages

369–376, Cambridge, MA, 1995. The MIT Press.

[24] Glen E. Bredon. Introduction to Compact Transformation Groups. Academic Press,

New York and London, 1972.

[25] Michael Brin and Garrett Stuck. Introduction to Dynamical Systems. Cambridge

University Press, 2002.

[26] David N. Burghes and Alexander Graham. Introduction to Control Theory, Including

Optimal Control. Ellis Horwood Ltd., Chichester, 1980. Ellis Horwood Series in

Mathematics and its Applications.

[27] A. Martin V. Butz, David E. Goldberg, and C. Wolfgang Stolzmann. The anticipa-

tory classifier system and genetic generalization. Natural Computing, 1(4):427–467,

2002.

[28] Mary L. Cartwright and John E. Littlewood. On non-linear differential equations

of the second order. Journal of the London Mathematical Society. Second Series,

20:180–189, 1945.

[29] Anthony R. Cassandra, Leslie P. Kaelbling, and Michael L. Littman. Acting opti-

mally in partially observable stochastic domains. In Proceedings of the 12th National

Conference on Artificial Intelligence, volume 2, pages 1023–1028, Seattle, WA, 1994.

AAAI Press.

[30] Arthur Cayley. Adding temporary memory to ZCS. The Educational Times, 23(18),

1875.

[31] David Chapman and Leslie P. Kaelbling. Input generalization in delayed reinforce-

ment learning: An algorithm and performance comparisons. In J. Myopoulos and

127

R. Reiter, editors, Proceedings of the 12th International Joint Conference on Arti-

ficial Intelligence (IJCAI), Sydney, Australia, pages 726–731, San Francisco, CA,

1991. Morgan Kaufmann.

[32] Bruno Codenotti and Daniel Stefankovic. On the computational complexity of Nash

equilibria for (0,1) bimatrix games. Information Processing Letters (IPL), 94(3):145–

150, 2005.

[33] Anne Condon. The complexity of stochastic games. Information and Computation,

96(2):203–224, 1992.

[34] Vincent Conitzer and Tuomas Sandholm. Complexity results about Nash equilibria.

In Georg Gottlob and Toby Walsh, editors, Proceedings of the 18th International

Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, pages 765–

771. Morgan Kaufmann, 2003.

[35] Rémi Coulom. Reinforcement Learning Using Neural Networks, with Applications to

Motor Control. PhD thesis, Institut National Polytechnique de Grenoble, 2002.

[36] Michael G. Crandall. Viscosity solutions: A primer. In Viscosity solutions and

applications, volume 1660 of Lecture Notes in Mathematics, pages 1–43. Springer,

Berlin, 1997.

[37] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bul-

letin of the American Mathematical Society (BAMS), 39, 2002.

[38] Shu-Lin Cui, Ji-Gui Sun, Ming-Hao Yin, and Shuai Lu. Solving uncertain Markov

decision problems: An interval-based method. In L. Jiao, L. Wang, X. Gao, J. Liu,

and F. Wu, editors, Proceedings of the 2nd International Conference on Advances in

Natural Computation (ICNC), Xi’an, China (Part II), volume 4222 of Lecture Notes

in Computer Science, pages 948–957. Springer, 2006.

[39] George B. Dantzig and Mukund N. Thapa. Linear Programming 2. Springer Series

in Operations Research. Springer-Verlag, New York, 2003.

[40] Konstantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The

complexity of computing a Nash equilibrium. Electronic Colloquium on Computa-

tional Complexity (ECCC), (115), 2005.

[41] Ruchira S. Datta. Universality of Nash equilibria. Mathematics of Operations Re-

search (MOR), 28(3):424–432, 2003.

[42] Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In S. J. Hanson,

J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing

Systems (NIPS) 5, pages 271–278, San Mateo, CA, 1993. Morgan Kaufmann.

[43] Thomas Dean, Leslie P. Kaelbling, Jak Kirman, and Ann Nicholson. Planning with

deadlines in stochastic domains. In Proceedings of the 11th National Conference on

Artificial Intelligence (AAAI), pages 574–579, Washington, DC, 1993. AAAI Press.

[44] Richard Dearden. Structured prioritized sweeping. In Proceedings of the 18th Inter-

national Conference on Machine Learning (ICML), pages 82–89, San Francisco, CA,

2001. Morgan Kaufmann.

[45] Michael Dellnitz, Gary Froyland, and Oliver Junge. The algorithms behind Gaio –

set oriented numerical methods for dynamical systems. In Ergodic Theory, Analysis,

and Efficient Simulation of Dynamical Systems, pages 145–174. Springer, 2001.

[46] Michael Dellnitz, Mirko Hessel-von Molo, Philipp Metzner, Robert Preis, and

128

Christof Schütte. Graph algorithms for dynamical systems. In A. Mielke, editor,

Analysis, modeling and simulation of multiscale problems, pages 619–645. Springer,

Berlin, 2006.

[47] Michael Dellnitz and Andreas Hohmann. A subdivision algorithm for the computa-

tion of unstable manifolds and global attractors. Numerische Mathematik, 75(3):293–

317, 1997.

[48] Michael Dellnitz, Andreas Hohmann, Oliver Junge, and Martin Rumpf. Exploring

invariant sets and invariant measures. Chaos. An Interdisciplinary Journal of Non-

linear Science, 7(2):221–228, 1997.

[49] Michael Dellnitz and Oliver Junge. On the approximation of complicated dynamical

behavior. SIAM Journal on Numerical Analysis, 36(2):491–515 (electronic), 1999.

[50] Michael Dellnitz and Oliver Junge. Set oriented numerical methods for dynamical

systems. In Handbook of dynamical systems, Vol. 2, pages 221–264. North-Holland,

Amsterdam, 2002.

[51] Kan Deng and Andrew W. Moore. Multiresolution instance-based learning. In

Chris S. Mellish, editor, Proceedings of the 14th International Joint Conference on

Artificial Intelligence (IJCAI), pages 1233–1242, San Mateo, 1995. Morgan Kauf-

mann.

[52] Ralf Diekmann, Burkhard Monien, and Robert Preis. Using helpful sets to improve

graph bisections. In D. Hsu, A. Rosenberg, and D. Sotteau, editors, Interconnec-

tion Networks and Mapping and Scheduling Parallel Computations, volume 21 of

DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages

57–73. American Mathematical Society, 1995.

[53] Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value

function decomposition. Journal of Artificial Intelligence Research (JAIR), 13:227–

303, 2000.

[54] Marco Dorigo and Hugues Bersini. A comparison of Q-learning and classifier systems.

In D. Cliff, P. Husbands, J.-A. Meyer, and S. W. Wilson, editors, From Animals to

Animats 3. Proceedings of the 3rd International Conference on Simulation of Adaptive

Behavior (SAB), pages 248–255, Cambridge, MA, 1994. MIT Press.

[55] Marco Dorigo and Marco Colombetti. Robot shaping: Developing autonomous agents

through learning. Artificial Intelligence, 71(2):321–370, 1994.

[56] Kenji Doya. Reinforcement learning in continuous time and space. Neural Compu-

tation, 12(1):219–245, 2000.

[57] Jan Drugowitsch and Alwyn M. Barry. A formal framework and extensions for

function approximation in learning classifier systems. Technical Report CSBU-2006-

02, University of Bath, 2006.

[58] Chris Drummond. Accelerating reinforcement learning by composing solutions of

automatically identified subtasks. Journal of Artificial Intelligence Research (JAIR),

16:59–104, 2002.

[59] Artur Dubrawski and Jeff G. Schneider. Memory based stochastic optimization for

validation and tuning of function approximators. In Proceedings of the 6th Interna-

tional Workshop on AI and Statistics, Florida, USA, 1997.

[60] David S. Dummit and Richard M. Foote. Abstract algebra. Prentice Hall Inc., En-

129

glewood Cliffs, NJ, 1991.

[61] Aryeh Dvoretzky, J. Kiefer, and Jacob Wolfowitz. The inventory problem. I. Case of

known distributions of demand. Econometrica, 20:187–222, 1952.

[62] Scott E. Fahlman. An empirical study of learning speed in back-propagation net-

works. Technical Report CMU-CS-88-162, Carnegie-Mellon University, Pittsburgh,

PA, 1988.

[63] Maurizio Falcone. Numerical methods for differential games based on partial dif-

ferential equations. Unpublished. Based on lectures given at the summer school on

Differential Games and Applications, 2005.

[64] Jacques Ferber. Multi-Agent Systems – An Introduction to Distributed Artifical In-

telligence. Addison Wesley, 1999.

[65] Jerzy A. Filar, T. A. Schultz, Frank Thuijsman, and Koos (O. J.) Vrieze. Nonlinear

programming and stationary equilibria in stochastic games. Mathematical Program-

ming, 50(2):227–237, 1991.

[66] Jerzy A. Filar and Koos (O. J.) Vrieze. Competitive Markov Decision Processes.

Springer-Verlag, New York, 1997.

[67] David Foster and Peter Dayan. Structure in the space of value functions. Machine

Learning, 49(2–3):325–346, 2002.

[68] Bas van Fraassen. Laws and Symmetry. Oxford University Press, Oxford, 1989.

[69] Thomas Gabel, Roland Hafner, Sascha Lange, Martin Lauer, and Martin Riedmiller.

Bridging the gap: Learning in the robocup simulation and midsize league. In Pro-

ceedings of the 7th Portuguese Conference on Automatic Control (Controlo), Lisbon,

Portugal, 2006.

[70] Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model

minimization in Markov decision processes. Artificial Intelligence, 147(1–2):163–223,

2003.

[71] David E. Goldberg. Genetic Algorithms in Search, Optimization &Machine Learn-

ing. Addison-Wesley, Reading, MA, 1989.

[72] Geoffrey J. Gordon. Stable function approximation in dynamic programming. In

A. Prieditis and S. Russell, editors, Proceedings of the 12th International Conference

on Machine Learning (ICML), pages 261–268, San Francisco, CA, 1995. Morgan

Kaufmann.

[73] John Guckenheimer and Philip Holmes. Nonlinear Oscillations, Dynamical Systems,

and Bifurcations of Vector Fields. Springer-Verlag, New York, 1990.

[74] Carlos Guestrin, Milos Hauskrecht, and Branislav Kveton. Solving factored MDPs

with continuous and discrete variables. In D. M. Chickering and J. Y. Halpern,

editors, Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence

(UAI), Banff, Canada, pages 235–242, 2004.

[75] Carlos Guestrin, Daphne Koller, and Ronald Parr. Max-norm projections for factored

MDPs. In B. Nebel, editor, Proceedings of the 17th International Joint Conference on

Artificial Intelligence (IJCAI-01), Seattle, pages 673–682, San Francisco, CA, 2001.

Morgan Kaufmann.

[76] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored

MDPs. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in

130

Neural Information Processing Systems (NIPS) 14, Vancouver, Canada, pages 1523–

1530, Cambridge, MA, 2001. MIT Press.

[77] Carlos Guestrin, Daphne Koller, and Ronald Parr. Solving factored POMDPs with

linear value functions. In Workshop on Planning under Uncertainty and Incomplete

Information (IJCAI), Seattle, Washington, pages 67–75, 2001.

[78] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient

solution algorithms for factored MDPs. Journal of Artificial Intelligence Research

(JAIR), 19:399–468, 2003.

[79] Carlos Guestrin, Michail G. Lagoudakis, and Ronald Parr. Coordinated reinforce-

ment learning. In C. Sammut and A. G. Hoffmann, editors, Proceedings of the 19th

International Conference on Machine Learning (ICML), Sydney, Australia, pages

227–234, San Francisco, CA, 2002. Morgan Kaufmann.

[80] Vijaykumar Gullapalli. Reinforcement Learning and its Application to Control. PhD

thesis, University of Massachusetts at Amherst, 1992.

[81] Milos Hauskrecht, Nicolas Meuleau, Leslie P. Kaelbling, Thomas Dean, and Craig

Boutilier. Hierarchical solution of Markov decision processes using macro-actions. In

G. F. Cooper and S. Moral, editors, Proceedings of the 14th Conference on Uncer-

tainty in Artificial Intelligence (UAI), pages 220–229, San Francisco, 1998. Morgan

Kaufmann.

[82] Simon S. Haykin. Neural Networks: A Comprehensive Introduction. Prentice Hall,

New Jersey, USA, 1999.

[83] Chao He, Li-Xin Xu, and Yu-He Zhang. Learning convergence of CMAC algorithm.

Neural Processing Letters, 14(1):61–74, 2001.

[84] Ronald A. Howard. Dynamic Programming and Markov Processes. MIT Press,

Cambridge, MA, 1960.

[85] Fern Y. Hunt. A Monte Carlo approach to the approximation of invariant measures.

Random &Computational Dynamics, 2(1):111–133, 1994.

[86] Rufus P. Isaacs. Differential Games: A Mathematical Theory with Applications to

Warfare and Pursuit, Control and Optimization. John Wiley, Toronto, 1965.

[87] Tommi Jaakkola, Satinder P. Singh, and Michael I. Jordan. Reinforcement learn-

ing algorithm for partially observable Markov decision problems. In G. Tesauro,

D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Sys-

tems (NIPS 7), pages 345–352, Cambridge, MA, 1995. The MIT Press.

[88] Leslie P. Kaelbling, Michael L. Littman, and Andrew P. Moore. Reinforcement

learning: A survey. Journal of Artificial Intelligence Research (JAIR), 4:237–285,

1996.

[89] Spiros Kapetanakis and Daniel Kudenko. Reinforcement learning of coordination in

cooperative multi-agent systems. In AAAI/IAAI 2002, pages 326–331, 2002.

[90] Spiros Kapetanakis, Daniel Kudenko, and Malcolm J. A. Strens. Learning of coordi-

nation in cooperative multi-agent systems using commitment sequences. In Artificial

Intelligence and the Simulation of Behavior 1(5), 2004.

[91] George Karypis and Vipin Kumar. METIS Manual, Version 4.0. University of

Minnesota, 1998.

[92] Brian W. Kernighan and Shen Lin. An efficient heuristic procedure for partitioning

131

graphs. Bell System Technical Journal, 49(2):291–307, 1970.

[93] Teuvo Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 1995.

[94] Daphne Koller and Ronald Parr. Computing factored value functions for policies in

structured MDPs. In T. Dean, editor, Proceedings of the 16th International Joint

Conference on Artificial Intelligence (IJCAI), pages 1332–1339, San Francisco, CA,

1999. Morgan Kaufmann.

[95] Panganamala R. Kumar and Pravin P. Varaiya. Stochastic Systems: Estimation,

Identification, and Adaptive Control. Prentice Hall, Englewood Cliffs, NJ, 1986.

[96] Rainer Lachner, Michael H. Breitner, and Hans J. Pesch. Real-time collision avoid-

ance against wrong drivers: Differential game approach, numerical solution and syn-

thesis of strategies with neural networks. In Proceedings of the 7th International

Symposium on Dynamic Games and Applications, Kanagawa, Japan, 1996.

[97] Michail G. Lagoudakis and Ronald Parr. Value function approximation in zero-sum

Markov games. In A. Darwiche and N. Friedman, editors, Proceedings of the 18th

Conference in Uncertainty in Artificial Intelligence (UAI), University of Alberta, Ed-

monton, Alberta, Canada, pages 283–292, San Francisco, CA, 2002. Morgan Kauf-

mann.

[98] Michail G. Lagoudakis and Ronald Parr. Learning in zero-sum team Markov games

using factored value functions. In S. Becker, S. B. Thrun, and K. Obermayer, editors,

Advances in Neural Information Processing Systems (NIPS) 15, pages 1627–1634.

MIT Press, Cambridge, MA, 2003.

[99] Michail G. Lagoudakis, Ronald Parr, and Michael L. Littman. Least-squares methods

in reinforcement learning for control. In I. P. Vlahavas and C. D. Spyropoulos,

editors, Proceedings of the 2nd Hellenic Conference on AI (SETN). Thessaloniki,

Greece, volume 2308 of Lecture Notes in Computer Science, pages 249–260. Springer,

2002.

[100] Pier L. Lanzi. Learning classifier systems from a reinforcement learning perspective.

Soft Computing, 6(3–4):162–170, 2002.

[101] Tim Laue and Thomas Röfer. Integrating simple unreliable perceptions for accurate

robot modeling in the four-legged league. In G. Lakemeyer, E. Sklar, D. G. Sorrenti,

and T. Takahashi, editors, RoboCup, volume 4434 of Lecture Notes in Computer

Science, pages 474–482. Springer, 2006.

[102] Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement

learning in cooperative multi-agent systems. In Proc. of the 17th International Con-

ference on Machine Learning (ICML), pages 535–542, San Francisco, CA, 2000. Mor-

gan Kaufmann.

[103] Martin Lauer and Martin Riedmiller. Reinforcement learning for stochastic cooper-

ative multi-agent systems. In AAMAS, pages 1516–1517. IEEE Computer Society,

2004.

[104] Steven M. LaValle. Robot motion planning: A game-theoretic foundation. Algorith-

mica, 26(3-4):430–465, 2000.

[105] Chuen-Chien Lee. A self learning rule-based controller employing approximate rea-

soning and neural net concepts. International Journal of Intelligent Systems, 6(1):71–

93, 1991.

132

[106] Carlton E. Lemke and Joseph T. Howson, Jr. Equilibrium points of bimatrix games.

Journal of the Society for Industrial and Applied Mathematics (SIAM), 12(2):413–

423, 1964.

[107] Joseph Lewin. Differential Games. Springer-Verlag London Ltd., London, 1994.

[108] Long-Ji Lin. Programming robots using reinforcement learning and teaching. In

T. L. Dean and K. McKeown, editors, Proceedings of the 9th National Conference on

Artificial Intelligence, pages 781–786. MIT Press, 1991.

[109] Long-Ji Lin. Hierarchical learning of robot skills by reinforcement. In Proceedings of

the International Conference on Neural Networks (ICNN), volume 1, pages 181–186,

San Francisco, CA, 1993. IEEE/INNS.

[110] Long-Ji Lin and Tom M. Mitchell. Memory approaches to reinforcement learning

in non-Markovian domains. Technical Report CMU-CS-92-138, Carnegie Mellon

University, Pittsburgh, PA, 1992.

[111] Ya-Ping Lin and Xue-Yong Li. Reinforcement learning based on local state feature

learning and policy adjustment. Information Sciences ISCI, 154(1–2):59–70, 2003.

[112] Michael L. Littman. Markov games as a framework for multi-agent reinforcement

learning. In Proceedings of the 11th International Conference on Machine Learning

(ICML), pages 157–163, San Francisco, CA, 1994. Morgan Kaufmann.

[113] Michael L. Littman. Memoryless policies: Theoretical limitations and practical re-

sults. In D. Cliff, P. Husbands, J.-A. Meyer, and S. W. Wilson, editors, From An-

imals to Animats 3: Proceedings of the 3rd International Conference on Simulation

of Adaptive Behavior (SAB), Cambridge, MA, 1994. MIT Press.

[114] Michael L. Littman, Anthony R. Cassandra, and Leslie P. Kaelbling. Learning poli-

cies for partially observable environments: Scaling up. In A. Prieditis and S. Rus-

sell, editors, Proceedings of the 12th International Conference on Machine Learning

(ICML), pages 362–370, San Francisco, CA, 1995. Morgan Kaufmann Publishers.

[115] Michael L. Littman, Thomas L. Dean, and Leslie P. Kaelbling. On the complexity of

solving Markov decision problems. In P. Besnard and S. Hanks, editors, Proceedings

of the 11th Conference on Uncertainty in Artificial Intelligence (UAI), pages 394–402,

San Francisco, CA, 1995. Morgan Kaufmann.

[116] Michael L. Littman and Csaba Szepesvári. A generalized reinforcement-learning

model: Convergence and applications. In L. Saitta, editor, Proceedings of the 13th

International Conference on Machine Learning (ICML), Bari, Italy, pages 310–318.

Morgan Kaufmann, 1996.

[117] Edward N. Lorenz. Deterministic nonperiodic flow. Journal of Atmospheric Science,

20:130–141, 1963.

[118] Ulf Lorenz and Burkhard Monien. Error analysis in minimax trees. Theoretical

Computer Science (TCS), 313(3):485–498, 2004. Algorithmic combinatorial game

theory.

[119] William S. Lovejoy. A survey of algorithmic methods for partially observable Markov

decision processes. Annals of Operations Research, 28(1):47–65, 1991.

[120] R. Duncan Luce and Howard Raiffa. Games and Decisions: Introduction and Critical

Survey. John Wiley, New York, 1957. A study of the Behavioral Models Project,

Bureau of Applied Social Research, Columbia University;.

133

[121] David J. C. MacKay. Bayesian model comparison and backprop nets. In J. E. Moody,

S. J. Hanson, and R. Lippmann, editors, Advances in Neural Information Processing

Systems (NIPS) 4, pages 839–846. Morgan Kaufmann, 1992.

[122] David J. C. MacKay. Bayesian non-linear modelling for the prediction competition.

In ASHRAE Transactions, volume 100, pages 1053–1062, Atlanta, Georgia, 1994.

ASHRAE.

[123] Omid Madani. On policy iteration as a Newton’s method and polynomial policy

iteration algorithms. In Proceedings of the 18th National Conference on Artificial

Intelligence and Fourteenth Conference on Innovative Applications of Artificial In-

telligence (AAAI/IAAI), pages 273–278, Menlo Parc, CA, 2002. AAAI Press.

[124] Pattie Maes and Rodney A. Brooks. Learning to coordinate behaviors. In T. G.

Dietterich and W. Swartout, editors, Proceedings of the 8th National Conference on

Artificial Intelligence (AAAI), pages 796–802, Boston, MA, 1990. MIT Press.

[125] Sridhar Mahadevan. To discount or not to discount in reinforcement learning: A case

study comparing R learning and Q learning. In Proceedings of the 11th International

Conference on Machine Learning (ICML), pages 164–172, San Francisco, CA, 1994.

Morgan Kaufmann.

[126] Sridhar Mahadevan. Proto-value functions: Developmental reinforcement learning.

In Luc De Raedt and Stefan Wrobel, editors, Proceedings of the 22nd International

Conference on Machine Learning (ICML), Bonn, Germany, pages 553–560. ACM,

2005.

[127] Sridhar Mahadevan and Jonathan Connell. Scaling reinforcement learning to robotics

by exploiting the subsumption architecture. In Proceedings of the 8th International

Workshop on Machine Learning (ICML), pages 328–332, 1991.

[128] Sridhar Mahadevan, Mauro Maggioni, Kimberly Ferguson, and Sarah Osentoski.

Learning representation and control in continuous Markov decision processes. In

AAAI. Boston, 2006.

[129] Olvi L. Mangasarian and H. Stone. Two-person nonzero-sum games and quadratic

programming. Journal of Mathematical Analysis and Applications, 9:348–355, 1964.

[130] Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in

reinforcement learning via clustering. In C. E. Brodley, editor, Proceedings of the

21st International Conference on Machine Learning (ICML), Banff, Canada. ACM,

2004.

[131] Oded Maron and Andrew W. Moore. Hoeffding races: Accelerating model selection

search for classification and function approximation. In Jack D. Cowan, Gerald

Tesauro, and Joshua Alspector, editors, Advances in Neural Information Processing

Systems (NIPS) 6, pages 59–66, San Francisco, CA, 1994. Morgan Kaufmann.

[132] Oded Maron and Andrew W. Moore. The racing algorithm: Model selection for lazy

learners. Artificial Intelligence Rev, 11(1-5):193–225, 1997.

[133] Maja J. Mataric. Reward functions for accelerated learning. In Proceedings of the

11nd International Conference on Machine Learning (ICML), pages 181–189, 1994.

[134] R. Andrew K. McCallum. Instance-based utile distinctions for reinforcement learning

with hidden state. In Proceedings of the 12th International Conference on Machine

Learning (ICML), pages 387–395, San Francisco, CA, 1995. Morgan Kaufmann.

134

[135] Richard D. McKelvey and Andrew McLennan. Computation of equilibria in finite

games. In Handbook of Computational Economics, Vol. I, volume 13 of Handbooks

in Economics, pages 87–142. North-Holland, Amsterdam, 1996.

[136] John C. C. McKinsey. Introduction to the Theory of Games. McGraw-Hill Book

Company, Inc., New York-Toronto-London, 1952.

[137] Lisa Meeden, Gary Mcgraw, and Douglas Blank. Emergent control and planning

in an autonomous vehicle. In D. S. Touretsky, editor, Proceedings of the 15th An-

nual Meeting of the Cognitive Science Society, pages 735–740. Lawrence Erlbaum,

Hillsdale, NJ, 1993.

[138] José del R. Millán. Rapid, safe, and incremental learning of navigation strategies.

In IEEE Transactions on Systems, Man and Cybernetics (Part B), volume 26, pages

408–420, 1996.

[139] Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

[140] George E. Monahan. A survey of partially observable Markov decision processes:

Theory, models, and algorithms. Management Science, 28(1):1–16, 1982.

[141] Burkhard Monien, Robert Preis, and Ralf Diekmann. Quality matching and local

improvement for multilevel graph-partitioning. Parallel Computing, 26(12):1609–

1634, 2000.

[142] Andrew W. Moore. Variable resolution dynamic programming: Efficiently learn-

ing action maps in multivariate real-valued state-spaces. In L. Birnbaum and

G. Collins, editors, Proceedings of the 8th International Conference on Machine

Learning (ICML), pages 333–337, San Francisco, CA, 1991. Morgan Kaufmann.

[143] Andrew W. Moore. The parti-game algorithm for variable resolution reinforcement

learning in multidimensional state-spaces. In J. D. Cowan, G. Tesauro, and J. Al-

spector, editors, Advances in Neural Information Processing Systems (NIPS) 6, pages

711–718, San Mateo, CA, 1994. Morgan Kaufmann.

[144] Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement

learning with less data and less time. Machine Learning, 13:103–130, 1993.

[145] Andrew W. Moore, Christopher G. Atkeson, and Stefan Schaal. Memory-based learn-

ing for control. Technical Report CMU-RI-TR-95-18, Carnegie Mellon University,

Pittsburgh, PA, 1995.

[146] Andrew W. Moore, Daniel J. Hill, and Michael P. Johnson. An empirical investi-

gation of brute force to choose features, smoothers and function approximators. In

S. Hanson, S. Judd, and T. Petsche, editors, Computational Learning Theory and

Natural Learning, volume III: Selecting Good Models, pages 361–379. MIT Press,

1995.

[147] Jun Morimoto and Kenji Doya. Robust reinforcement learning. Neural Computation,

17(2):335–359, 2005.

[148] Rémi Munos. A study of reinforcement learning in the continuous case by the means

of viscosity solutions. Machine Learning, 40(3):265–299, 2000.

[149] Rémi Munos. Error bounds for approximate policy iteration. In T. Fawcett and

N. Mishra, editors, Proceedings of the 20th International Conference on Machine

Learning (ICML), Washington, DC, USA, pages 560–567. AAAI Press, 2003.

[150] Rémi Munos. Policy gradient in continuous time. Journal of Machine Learning

135

Research (JMLR), 7:771–791, 2006.

[151] Rémi Munos. Performance bounds in lpnorm for approximate value iteration. SIAM

Journal on Control and Optimization, 2007.

[152] Rémi Munos, Leemon C. Baird, and Andrew W. Moore. Gradient descent approaches

to neural-net-based solutions of the Hamilton-Jacobi-Bellman equation. In Interna-

tional Joint Conference on Neural Networks (IJCNN), volume 3, pages 2152–2157,

1999.

[153] Ali H. Nayfeh and Balakumar Balachandran. Applied Nonlinear Dynamics. Wiley

Series in Nonlinear Science. John Wiley, New York, 1995.

[154] John von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematical Annals,

100:295–320, 1928.

[155] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Be-

havior. Princeton University Press, Princeton, NJ, 2nd edition, 1947.

[156] Ralph Neuneier and Hans G. Zimmermann. How to train neural networks. In G. B.

Orr and K. R. Müller, editors, Neural Networks: Tricks of the Trade, volume 1524

of Lecture Notes in Computer Science, pages 373–423. Springer, 1996.

[157] Partha Niyogi and Federico Girosi. Generalization bounds for function approximation

from scattered noisy data. Advances in Computational Mathematics, 10:51–80, 1999.

[158] Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. MIT Press,

Cambridge, MA, 1994.

[159] Guillermo Owen. Game Theory. Academic Press Inc., San Diego, CA, 3rd edition,

1995.

[160] Liviu Panait and Sean Luke. Cooperative multi-agent learning: The state of the art.

Autonomous Agents and Multi-Agent Systems, 11(3):387–434, 2005.

[161] Ronald E. Parr. Hierarchical Control and Learning for Markov Decision Processes.

PhD thesis, University of California, Berkeley, 1998.

[162] Relu Patrascu, Pascal Poupart, Dale Schuurmans, Craig Boutilier, and Carlos

Guestrin. Greedy linear value-approximation for factored Markov decision processes.

In Proceedings of the 18th National Conference on Artificial Intelligence and 14th

Conference on Innovative Applications of Artificial Intelligence (AAAI/IAAI), pages

285–291, Menlo Parc, CA, 2002. AAAI Press.

[163] Jing Peng and Ronald J. Williams. Efficient learning and planning within the Dyna

framework. Adaptive Behavior, 1(4):437–454, 1993.

[164] Jing Peng and Ronald J. Williams. Incremental multi-step Q-learning. In Proceedings

of the 11th International Conference on Machine Learning (ICML), pages 226–232,

San Francisco, CA, 1994. Morgan Kaufmann.

[165] Hans J. Pesch, I. Gabler, Stefan Miesbach, and Michael H. Breitner. Synthesis of

optimal strategies for differential games by neural networks. In G. J. Olsder, editor,

New Trends in Dynamic Games and Applications, Annals of the International Society

of Dynamic Games 3, pages 111–142, Boston, 1996. Birkhäuser.

[166] Robert Preis. The PARTY Graphpartitioning-Library, User Manual – Version 1.99.

University of Paderborn, 1998.

[167] Bob Price and Craig Boutilier. Accelerating reinforcement learning through implicit

136

imitation. Journal of Artificial Intelligence Research (JAIR), 19:569–629, 2003.

[168] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Pro-

gramming. John Wiley, New York, 1994.

[169] Jette Randløv. Solving Complex Problems with Reinforcement Learning. PhD thesis,

University of Copenhagen, 2001.

[170] Balaraman Ravindran and Andrew G. Barto. Symmetries and model minimization

in Markov decision processes. Technical Report CMPSCI 01-43, University of Mas-

sachusetts, Amherst, MA, 2001.

[171] Balaraman Ravindran and Andrew G. Barto. Model minimization in hierarchical

reinforcement learning. In S. Koenig and R. C. Holte, editors, 5th International

Symposium on Abstraction, Reformulation and Approximation (SARA), Kananaskis,

Canada, volume 2371 of Lecture Notes in Computer Science, pages 196–211. Springer,

2002.

[172] Mark B. Ring. Continual Learning in Reinforcement Environments. PhD thesis,

University of Texas at Austin, 1994.

[173] Nicholas Roy. Finding Approximate POMDP Solutions through Belief Compression.

PhD thesis, Carnegie Mellon University, 2003. Also published as Technical Report

CMU-RI-TR-03-25.

[174] Nicholas Roy, Geoffrey J. Gordon, and Sebastian B. Thrun. Finding approximate

POMDP solutions through belief compression. Journal of Artificial Intelligence Re-

search (JAIR), 23:1–40, 2005.

[175] Ulrich Rüde. Mathematical and Computational Techniques for Multilevel Adaptive

Methods. SIAM, Philadelphia, PA, 1993.

[176] David E. Rumelhart and James L. McClelland. Parallel Distributed Processing: Ex-

plorations in the Microstructure of Cognition., volume 1–2. MIT Press, Cambridge,

MA, 1986.

[177] Gavin A. Rummery. Problem Solving with Reinforcement Learning. PhD thesis,

University of Cambridge, 1995.

[178] Rafal P. Salustowicz, Marco A. Wiering, and Jürgen Schmidhuber. Learning team

strategies: Soccer case studies. Machine Learning, 33:263–282, 1998.

[179] Arthur L. Samuel. Some studies in machine learning using the game of checkers.

IBM Journal of Research and Development, 3:211–229, 1959. Reprinted in E. A.

Feigenbaum and J. Feldman, editors, Computers and Thought, McGraw-Hill, NY

1963.

[180] Arthur L. Samuel. Some studies in machine learning using the game of checkers II –

Recent progress. IBM Journal of Research and Development, 11(6):601–617, 1967.

[181] Uday Savagaonkar, Edwin K. P. Chong, and Robert L. Givan. Sampling techniques

for zero-sum, discounted Markov games. In Leslie P. Kaelbling, editor, Allerton

Conference on Control and Communications, 2002.

[182] Jürgen Schmidhuber. Reinforcement learning in Markovian and non-Markovian en-

vironments. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances

in Neural Information Processing Systems (NIPS) 3, pages 500–506. Morgan Kauf-

mann, 1991.

[183] Jürgen Schmidhuber. Developmental robotics, optimal artificial curiosity, creativity,

137

music, and the fine arts. In Connection Science, volume 18, pages 173–187, 2006.

[184] Alexander Schrijver. Theory of Linear and Integer Programming. Wiley-Interscience

Series in Discrete Mathematics. John Wiley, 1986.

[185] Anton Schwartz. A reinforcement learning method for maximizing undiscounted

rewards. In Proceedings of the 10th International Conference on Machine Learning

(ICML), San Mateo, CA, 1993. Morgan Kaufmann.

[186] Lloyd S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences

of the U. S. A., 39:1095–1100, 1953.

[187] John W. Sheppard. Co-learning in differential games. Machine Learning, 33(2-

3):201–233, 1998.

[188] Yoav Shoham, Rob Powers, and Trond Grenager. If multi-agent learning is the

answer, what is the question? In R. Vohra and M. Wellman, editors, Artificial

Intelligence (special issue on foundations of multi-agent learning), volume 171, pages

365–377, 2007.

[189] Sajid M. Siddiqi and Andrew W. Moore. Fast inference and learning in large-state-

space HMMs. In L. de Raedt and S. Wrobel, editors, Proceedings of the 22nd Inter-

national Conference on Machine Learning (ICML), Bonn, Germany, pages 800–807.

ACM, 2005.

[190] Satinder P. Singh. Reinforcement learning with a hierarchy of abstract models. In

W. R. Swartout, editor, Proceedings of the 10th National Conference on Artificial

Intelligence (AAAI), San Jose, CA, pages 202–207. MIT Press, 1992.

[191] Margaret M. Skelly. Hierarchical Reinforcement Learning with Function Approxima-

tion for Adaptive Control. PhD thesis, Case Western Reserve University, OhioLINK,

2004.

[192] Stephen Smale. Differentiable dynamical systems. Bulletin of the American Mathe-

matical Society (BAMS), 73:747–817, 1967.

[193] Andrew J. Smith. Applications of the self-organising map to reinforcement learning.

Neural Networks, 15(8-9):1107–1124, 2002.

[194] Bernhard von Stengel. Computing equilibria for two-person games. In R. J. Au-

mann and S. Hart, editors, Handbook of Game Theory with Economic Applications,

volume 3, chapter 45. North-Holland, Amsterdam, 2002.

[195] Robert F. Stengel. Stochastic Optimal Control. A Wiley-Interscience Publication.

John Wiley, New York, 1986.

[196] Eduard L. Stiefel. Note on Jordan elimination, linear programming and Chebyshev

approximation. Numerische Mathematik, 2:1–17, 1960.

[197] Peter Stone. Layered Learning in Multi-Agent Systems. PhD thesis, Carnegie Mellon

University, 1998.

[198] Peter Stone and Richard S. Sutton. Scaling reinforcement learning toward robocup

soccer. In C. E. Brodley and A. P. Danyluk, editors, Proceedings of the 18th Interna-

tional Conference on Machine Learning (ICML), Williams College, Williamstown,

MA, pages 537–544, San Francisco, CA, 2001. Morgan Kaufmann.

[199] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based

on approximating dynamic programming. In Proceedings of the 7th International

Conference on Machine Learning (ICML), pages 216–224, Austin, TX, 1990. Morgan

138

Kaufmann.

[200] Richard S. Sutton. Planning by incremental dynamic programming. In Proceedings

of the 8th International Workshop on Machine Learning (ICML), pages 353–357.

Morgan Kaufmann, 1991.

[201] Richard S. Sutton. Generalization in reinforcement learning: Successful examples

using sparse coarse coding. In D. S. Touretzky, M. Mozer, and M. E. Hasselmo,

editors, Advances in Neural Information Processing Systems (NIPS) 8, pages 1038–

1044, Cambridge, MA, 1996. MIT Press.

[202] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction.

MIT Press, Cambridge, MA, 1998.

[203] Csaba Szepesvári and Michael L. Littman. A unified analysis of value-function-based

reinforcement learning algorithms. Neural Computation, 11(8):2017–2060, 1999.

[204] Erik Talvitie and Satinder P. Singh. An experts algorithm for transfer learning.

In M. M. Veloso, editor, Proceedings of the 20th International Joint Conference on

Artificial Intelligence (IJCAI), Hyderabad, India, pages 1065–1070, 2007.

[205] Gerald Tesauro. TD-Gammon, A self-teaching backgammon program, achieves

master-level play. Neural Computation, 6(2):215–219, 1994.

[206] Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications

of the ACM, 38(3):58–68, 1995.

[207] Sebastian B. Thrun and Anton Schwartz. Issues in using function approximation for

reinforcement learning. In M. Mozer, P. Smolensky, D. Touretzky, J. Elman, and

A. Weigend, editors, Proceedings of the 1993 Connectionist Models Summer School,

Hillsdale, NJ, 1993. Lawrence Erlbaum.

[208] Michael J. Todd. The many facets of linear programming. Mathematical Program-

ming, 91(3):417–436, 2002.

[209] Abraham Wald. Generalization of a theorem by v. Neumann concerning zero sum

two person games. Annals of Mathematics. Second Series, 46:281–286, 1945.

[210] Abraham Wald. Sequential Analysis. John Wiley, New York, 1947.

[211] Chris Walshaw. The JOSTLE user manual: Version 2.2. University of Greenwich,

2000.

[212] Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, Univer-

sity of Cambridge, 1989.

[213] Gerhard Weiß. A multiagent framework for planning, reacting, and learning. Tech-

nical Report FKI-233-99, TU München, Germany, 1999.

[214] Jochen Werner. Numerische Mathematik. Vieweg Verlag, Braunschweig, Germany,

1992.

[215] Shimon Whiteson and Peter Stone. Concurrent layered learning. In Second Interna-

tional Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS),

pages 193–200. ACM, 2003.

[216] Chang Zhang and John S. Baras. A new adaptive aggregation algorithm for infinite

horizon dynamic programming. In Proceedings of the 11th Mediterranean Conference

on Control and Automation (MED), Rhodes, Greece, 2003.

[217] Martin Zinkevich and Tucker Balch. Symmetry in Markov decision processes and its

139

implications for single agent and multiagent learning. In Proceedings of the 18th Inter-

national Conference on Machine Learning (ICML), Williams College, Williamstown,

MA, pages 632–639, San Francisco, CA, 2001. Morgan Kaufmann.

[218] Martin Zinkevich, Amy Greenwald, and Michael G. Littman. Cyclic equilibria in

Markov games. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural

Information Processing Systems (NIPS) 18, pages 1641–1648. MIT Press, Cambridge,

MA, 2006.

140

Index

µ-homomorphism

2P-ZS-MG, 49

action

faithful, 103

mixed, 23

agent, 7

approximation error, 56

averagers, 58

basis functions, 58

Bellman equation, 19, 31

Bellman error, 35

Bellman operator, 20, 31

best reply, 25

best response, 25

bimatrix game, 23, 24

decision epoch, 24

repeated, 24

state action space, 24

state space, 24

transition function, 24

bisimulation, 42

box, 12

collection of -es, 12

chaos, 9

credit assignment problem

structural, 35

temporal, 35

cross validation, 58

differential game, 33

dynamic programming, 20, 31

dynamical system, 9

flow, 10

time-continuous, 9

time-discrete, 9

transition probability, 11

environment, 8

estimation error, 56

Euler’s number, 84

expectation value, 18

feature extraction, 57

features, 55

fictitious play, 28

finite state machines, 42

function approximation, 55

approximation error, 56

architecture, 55

blessing of smoothness, 57

curse of dimension, 57

estimation error, 56

sample error, 56

game, 23

bimatrix, 24

competitive, 4, 23

coordination, 25

differential, 23

discrete, 23

matrix, 24

optimal strategy, 26

single-stage, 23

game matrix

distance, 26

game theory

player, 23

generalisation, 57

over states, 57

graph, 14

edge set, 14

external costs, 14

internal costs, 14

vertex set, 14

graph matching, 15

graph partitioning

congestion, 15

Helpful-Set method, 15

multilevel paradigm, 15

grid soccer

141

action spaces, 67

kick-off, 71

move, 68, 70

pass, 68, 70

reward function, 71

state space, 67

transition function, 68

group

index of subgroup, 104

transformation, 103

group action, 103

kernel of, 103

group homomorphism, 103

group isomorphism, 103

group orbit, 104

Hamilton-Jacobi-Bellman equation, 62

Hamilton-Jacobi-Bellman-Isaacs equation,

hash function, 108

homomorphism

2P-ZS-MG, 45

group, 103

MDP, 38

imitation, 101

interaction, 7

interval

discrete, 108

invariance ratio, 11

isomorphism

group, 103

kick-off position, 66

Kohonen map, 58

Lemke-Howson algorithm, 28

linear program, 28

feasible set, 28

feasible solution, 28

list of states, 108

Mangasarian-Stone algorithm, 29

Markov chain, 9

Markov decision process, 16

action, 16

automorphism, 42

belief, 34

Bellman equation, 19

decision epoch, 16

decision rule, 19

discount rate, 18

homomorphism, 38

isomorphism, 42

multi-grid methods, 22

optimality principle, 19

partially observable, 34

policy, 19

lifted, 41

pure action, 20

return, 18

average, 18

discounted, 18

finite horizon, 18

state action space, 16

state aggregation, 22

state space, 16

symmetry group of, 42

transition function, 16

value functions, 19

Markov game

µ-homomorphism, 49

action, 30

Bellman equation, 31

decision epoch, 30

general-sum multi-player, 33

homomorphism, 45

optimality principle, 31

policy, 30

lifted, 46

return, 30

state action space, 30

state space, 30

transition function, 30

value function, 30

Markov model

hidden, 34

Markov process, 9

Markov property, 17

matrix

sparse, 108

matrix game, 23, 24

dominating action, 27

lower value, 26

matching pennies, 27

reduction property, 46, 50

repeated, 23, 24

stricly dominating action, 27

upper value, 26

multi-agent systems, 7

142

Nash equilibrium, 25

non-expansion, 31

optimality principle, 19, 31

orbit, 10

partition, 11

covering property, 11

payoff matrix, 23

perception, 8

perfect memory controller, 34

permutation, 103

Perron-Frobenius operator, 10

discrete, 12

policy

ε-greedy, 21

deterministic, 19

history dependent, 20

Markovian, 19

non-stationary, 20

total, 25

policy search, 57

probability distribution, 16

projection, 38, 44

proto value functions, 58

Q-learning

recurrent, 34

quotient set, 39

racing, 56

reflexes, 101

regret, 34

reinforcement learning, 20, 31

reinforcement signals

local, 101

reward

modified, 106

reward matrix, 23

risk option, 70

exploitation, 21

exploration, 21

off-policy method, 21

on-policy method, 22

sample error, 56

SARSA, 22

security level, 26, 81

self-organising maps, 58

self-play, 98

semi orbit

negative, 10

positive, 10

set

almost invariant, 11

backward invariant, 10

forward invariant, 10

invariant, 10

set of optimal strategies, 26

shaping, 101

soft constraint, 19

squared Euclidean distance, 68

stabiliser, 104

state aggregation, 38

state space

partition of, 11

stochastic approximation

Monte Carlo approach, 13

stochastic games, 8

subdivision algorithm, 12

supervertex, 15

supervised learning, 55

symmetry

MDP, 39

test points, 12

trajectory, 10

transfer operator, 10

discretisation of, 12, 13

transformation, 103

transition function

block, 39, 45, 49

transition graph, 14

edge weights, 14

undirected, 14

vertex weights, 14

transition matrix, 106

two-player zero-sum game

normal form, 23

unsupervised learning, 57

update matrix, 105

utile suffix memory, 34

value function

factored representation, 62

first guess, 75

value iteration, 20, 32, 59

viscosity solutions, 33

143