scieee Science in your language
[en] (orig)
Game Theoretic Approaches
to Motion Planning
in Robot Soccer
von der Fakultät für Elektrotechnik,
Informatik und Mathematik
der Universität Paderborn
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
(Dr. rer. nat.)
genehmigte Dissertation
von
Marcus Post
Paderborn, 2008
Referees: Prof. Dr. Oliver Junge
Prof. Dr. Burkhard Monien
Committee: Prof. Dr. Michael Dellnitz (chairman)
Prof. Dr. Peter Bürgisser
Prof. Dr. Oliver Junge
Prof. Dr. Burkhard Monien
Dr. Elke Wolf
Date of PhD Defense: 17.04.2008
You can not learn anything
until you already almost know it.
Unknown
To Berthild, Karl-Friedrich, Sebastian, and Ping
iii
Acknowledgements
I would like to start by thanking my advisors Prof. Dr. Michael Dellnitz and Prof. Dr.
Oliver Junge for their guidance, support, motivation, and for the great freedom which was
given to me. Prof. Dellnitz’ group at the University of Paderborn always provided a very
good research environment for me. I also want to thank Prof. Dr. Burkhard Monien for
helpful comments and for reviewing this PhD thesis.
Moreover, I am very grateful to Prof. Dr. Oliver Junge, Prof. Dr. Michael Dellnitz, Dr. Sina
Ober-Blöbaum, Dr. Kathrin Padberg, Dr. Oliver Schütze, Stefan Sertl, and Bianca Thiere
for interesting discussions and exciting joint work. For fruitful discussions and comments
I would also like to thank Mirko Hessel-von Molo, Oliver Kramer, Prof. Dr. Michael G.
Lagoudakis, Tim Laue, Dr. Martin Lauer, Henning Meyerhenke, and Willi Richert.
In general, I am indebted to my colleagues in Paderborn including Alessandro Dell’ Aere,
Sebastian Hage-Packhäuser, Mirko Hessel-von Molo, Stefan Klus, Dr. Arvind Krishna-
murthy, Anna-Lena Meier, Dr. Sina Ober-Blöbaum, Dr. Kathrin Padberg, Michael Petry,
Dr. Robert Preis, Dr. Oliver Schütze, Stefan Sertl, Bianca Thiere, Dr. Fang Wang, and
Katrin Witting for discussions, technical support, social events and many other things. I
am further indebted to the collegiates of the PaSCo graduate school.
I am grateful to Laurel Frick-Wright for proofreading my thesis to improve the quality of
my English and to Mirko Hessel-von Molo, Anna-Lena Meier, and Stefan Klus for proof-
reading excerpts.
I always received valuable administrative support from the secretaries Marianne Kalle and
Tanja rger, and from Anne Belkner. For enabling my work in a different way, namely
by keeping my office clean, I thank the non-scientific staff of the University of Paderborn.
I would like to thank Alessandro Dell’ Aere for helping me to build a physical soccer envi-
ronment for the AIBO robots and some students for supporting me with the AIBO robots’
technical issues: Johannes Berg, Raphael Golombek, and Nicolai Hähnle are to be men-
tioned here.
Last but not least I am very indebted to my parents Berthild and Karl-Friedrich Post who
supported me not only during my studies but during my whole life in all ways imaginable. I
want to thank my brother Sebastian, Janina, my friends from the University of Paderborn,
especially Ping, the “Mensa-Kreis”, and the musicians I played music with, and, of course,
all other friends of mine.
For the development of great software tools I want to especially thank all L
A
T
EX develop-
ers and the developers of Dia, Kate, and Kile many of whom work voluntarily, and the
developers of Matlab which is a commercial software tool. All of them made my work
technically possible or at least substantially simpler.
For financial support I am very grateful to the Paderborn Institute for Scientific Compu-
tation (PaSCo)(1) and to the University of Paderborn.
Marcus Post, February 2008
(1)The research is (partly) supported by the DFG Research Training Group GK-693 of the Paderborn
Institute for Scientific Computation (PaSCo).
iv
Abstract
Robotics is, from a scientific point of view, a very broad topic with many applications.
While highly specialised robots have been widely used in production lines, the next big
scientific steps are towards autonomy of robots and interaction with other robots and hu-
mans. For achieving these long-term goals catenations of physical and mental abilities
which are interdisciplinary and scientifically challenging have to be carried out. At its cur-
rent state, robot soccer is an appropriate environment for demonstrating and developing
robotic skills as several areas are addressed such as image processing and analysis in the
widest sense (including e. g. object matching and directing the camera), control and opti-
misation of physical movement (walking, ball handling), and the strategic planning which
may be considered as being close to high level reasoning. In this thesis, game theoretic and
reinforcement learning approaches are utilised to contribute to strategic planning in robot
soccer which serves as a motivating example. The aim of strategic planning is to obtain
an optimal strategy which also takes the possibly unknown strategies of other players into
account. A natural further goal of this thesis is the development and analysis of algorithms
by means of which such optimal strategies can be approximately computed.
More specifically, the following steps are undertaken: first, a game theoretic model of
multi-player robot soccer is developed which is independent from the robot hardware. The
occuring challenge to determine an optimal strategy with respect to this model for as
many robots as possible is met by exact model reduction, i. e. by finding equivalent smaller
models. For this, a theoretical framework of symmetries is developed which bases on
homomorphisms between two-player zero-sum Markov games. It lays a formal foundation
for practitioners who already implicitly used results proven within that framework. A
special result which is important for model reduction is that the reduction can be performed
in several separate steps and be combined afterwards which is expressed by a composition
of homomorphisms. Finally, a qualitatively new symmetry which interchanges the two
players of the Markov game, i. e. the two teams in robot soccer, is proven to be part of
the homomorphism framework. Particularly, this means that it can be combined with all
symmetries which occur in Markov decision processes.
The theoretical results about Markov game symmetries are algorithmically exploited for
Dynamic Programming (DP) and Reinforcement Learning (RL) methods which are also
compared. Such comparisons ought to be standard but seem unusual for large parts of the
RL literature. Unsurprisingly, DP methods are more efficient and thus the following general
procedure seems recommendable: firstly, to design an approximate model for the task at
hand and solve this by DP methods to an appropriate level of precision and, secondly, to
use the DP solution of the rough model as an initialisation for an RL method to let the RL
method adapt to the unknown real model. In this spirit, the developed soccer model and
the computation of its optimal solution can be seen as the completion of the first of the
above two steps. Ideas of dynamical systems and graph theory are additionally integrated
v
to design new efficient DP methods by means of almost invariant sets. All algorithms are
thoroughly studied numerically and the results of optimal strategies are also interpreted
in terms of soccer. Finally, some of the most challenging future tasks to implement these
strategies on real robots are identified.
Key Words
reinforcement learning, robotics, robot soccer, optimal strategy, symmetry, model reduc-
tion, control theory, game theory, graph theory, dynamical system, almost invariant set,
homomorphism
vi
Abstract (German)
Die Robotik ist aus wissenschaftlicher Sicht ein sehr breites Fachgebiet, das viele An-
wendungen hat. Weitverbreitet sind beispielsweise hochspezialisierte Roboter, die in der
maschinellen Fertigung eingesetzt werden. Einige der nächsten Meilensteine in der Robotik
sind in der Autonomie von Robotern und in der Interaktion mit Robotern und Menschen
zu erwarten. Zum Erreichen dieser Meilensteine ist eine Verknüpfung von physischen und
“mentalen” Fähigkeiten notwendig, die interdisziplinär ist und wissenschaftliche Herausfor-
derungen bietet. Roboterfußball stellt zum derzeitigen Stand der Wissenschaft eine geeig-
nete Umgebung dar, um verschiedenartige Fähigkeiten der Roboter zu demonstrieren und
weiterzuentwickeln, denn es beinhaltet bereits eine Vielzahl von Gebieten, beispielsweise
Bildverarbeitung im weitesten Sinne (einschließlich Objekterkennnung und -verfolgung),
Kontrolle und Optimierung physischer Bewegung (Fortbewegung, Ballfertigkeiten) und die
strategische Planung, die auch als höhere kognitive Fähigkeit betrachtet werden kann.
In dieser Doktorarbeit werden Ansätze der Spieltheorie und des Reinforcement-Learnings
genutzt, um Beiträge zur strategischen Bewegungsplanung im Roboterfußball, das als moti-
vierendes Beispiel dient, zu leisten. Ziel der Strategieplanung ist es, eine optimale Strategie
zu ermitteln, die auch die möglicherweise unbekannten Strategien anderer Spieler einbe-
zieht. Ein weiterführendes Ziel der Arbeit stellt die Weiterentwicklung und Analyse von
Algorithmen, mit deren Hilfe optimale Strategien approximativ berechnet werden, dar.
Dazu werden die folgenden Schritte unternommen: Zunächst wird ein spieltheoretisches
Modell des Mehrspieler-Roboterfußballs entwickelt, das möglichst hardware-unabhängig
ist. Einer wesentlichen dabei auftauchenden Herausforderung, optimale Strategien für die-
ses Modell mit einer möglichst großen Anzahl von Robotern zu bestimmen, wird durch
exakte Modellreduktion begegnet, d. h. es wird versucht, ein möglichst kleines, dem origi-
nalen Modell äquivalentes Markov-Spiel zu ermitteln. Zu diesem Zweck wird ein theoreti-
sches Konzept von Symmetrien eingeführt, das auf Homomorphismen zwischen Zweispieler-
Nullsummenspielen basiert. Der Symmetriebegriff schafft dabei eine formale Basis für schon
zuvor zur praktischen Lösung von Markov-Spielen implizit angewendeten Symmetriereduk-
tionen. Ein nützliches Ergebnis für die Modellreduktion ist, dass diese schrittweise durch-
geführt und anschließend kombiniert werden kann, was sich formal durch die Komposition
von Homomorphismen darstellen lässt. Schließlich ist eine qualitativ neuartige Symmetrie,
die die Spieler eines Markov-Spiels vertauscht, in den Formalismus integriert. Insbesondere
wird gezeigt, dass sich die nicht in Markov-Entscheidungsprozessen vorkommende Symme-
trie mit allen dort anzutreffenden Symmetrien kombinieren lässt.
Die theoretischen Ergebnisse über Symmetrien in Markov-Spielen werden algorithmisch
umgesetzt in Methoden der Dynamischen Programmierung (DP) und des Reinforcement-
Learnings (RL), welche ferner miteinander verglichen werden. Derartige Vergleiche sollten
als Standard gelten, scheinen aber eher die Ausnahme in weiten Teilen der RL Literatur zu
sein. Erwartungsgemäß sind die DP Methoden effizienter, weshalb die folgende allgemeine
vii
Vorgehensweise vorgeschlagen wird: zunächst ein approximatives Modell zu konstruieren
und mit Hilfe der DP Methoden zu lösen, um dann ein RL Verfahren mit dieser Lösung
als Startwert auszustatten. Dies ermöglicht sowohl den Einsatz der effizienteren DP Me-
thoden, die mit angemessener Präzision das approximative Modell lösen, als auch den der
RL Methoden, deren Adaptivität an das unbekannte reale Modell ausgenutzt wird. In die-
sem Sinne können das entwickelte Roboterfußball-Modell und die Berechnung optimaler
Strategien als Lösung des ersten Teils des allgemeinen Vorgehens angesehen werden. Dazu
finden bei der Entwicklung neuer effizienter Algorithmen unter anderem Ideen aus dem
Gebiet der Dynamischen Systeme und der Graphentheorie zu fast invarianten Mengen An-
wendung. Abschließend werden wichtige praktische Herausforderungen identifiziert, die es
zu lösen gilt, bevor die berechneten optimalen Strategien auf reale Roboter übertragen
werden können.
Schlagworte
Reinforcement-Learning, Robotik, Roboterfußball, optimale Strategie, Symmetrie, Modell-
reduktion, Kontrolltheorie, Spieltheorie, Graphentheorie, Dynamisches System, fast inva-
riante Menge, Homomorphismus
viii
Contents
1 Introduction 1
2 Reinforcement Learning (RL) and Game Theory 7
2.1 Dynamical Systems and Markov Processes . . . . . . . . . . . . . . . . . . . 9
2.1.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . . . 15
2.2 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . . . 22
2.3 Matrix Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . . . 29
2.4 Two Player Zero Sum Markov Games (2P-ZS-MGs) . . . . . . . . . . . . . . 29
2.4.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . . . 32
2.5 General Markov Games, Differential Games, and Advanced Concepts of RL 33
3 Model Reduction and Symmetry 36
3.1 Homomorphisms and Symmetry in MDPs . . . . . . . . . . . . . . . . . . . 37
3.1.1 Equivalence of MDP Homomorphisms and MDP Symmetries . . . . 38
3.1.2 Symmetries by Group Actions on MDPs . . . . . . . . . . . . . . . . 42
3.2 Homomorphisms and Symmetry in 2P-ZS-MGs . . . . . . . . . . . . . . . . 43
3.2.1 2P-ZS-MG Homomorphisms and Symmetry . . . . . . . . . . . . . . 44
3.2.2 Automorphisms for the Exchange of Agents . . . . . . . . . . . . . . 49
4 Supervised Learning (SL), Function Approximation, Generalisation 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.1 General Approximation Results . . . . . . . . . . . . . . . . . . . . . 57
4.1.2 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.3 Function Approximation with Automated Basis Determination . . . 58
4.2 Value Iteration with SL: Convergence Result . . . . . . . . . . . . . . . . . . 59
4.3 Combination of RL and SL: Practical Results from Literature . . . . . . . . 61
4.3.1 MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.2 2P-ZS-MGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
ix
5 Robot Soccer and Other Applications 63
5.1 Modeling Robot Soccer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1.1 General Issues of Modeling Robot Soccer . . . . . . . . . . . . . . . . 64
5.1.2 A Simple Multi-Player Robot Soccer Model . . . . . . . . . . . . . . 67
5.1.3 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Numerical Results of Grid Soccer . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 Preliminaries for the Following Subsections . . . . . . . . . . . . . . 75
5.2.2 Reasoning for 2P-ZS-MG Modelling: Comparison of MDP and 2P-
ZS-MG strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.3 Relating Policies to Humanoid Soccer Characteristics . . . . . . . . . 80
5.2.4 Comparison of DP and RL Techniques . . . . . . . . . . . . . . . . . 81
5.2.5 Comparison of Different DP Techniques with Various Parameters . . 84
5.2.6 Comparison of Standard Methods and SL Techniques . . . . . . . . . 91
5.2.7 Towards Multi-Player Robot Soccer: 2v2 Grid Soccer . . . . . . . . . 93
5.3 A New Algorithm: MaG-Clus-VI . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4 From Grid Soccer to Robot Soccer: Practical Issues . . . . . . . . . . . . . . 96
5.4.1 Lower Level behaviours . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4.2 Image Processing and Localisation . . . . . . . . . . . . . . . . . . . 97
5.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6 Conclusion and Outlook 100
A Basics of Group Homomorphisms and Group Actions 103
B Bellman Equations and Iterative Linear Solvers 105
C The Software Package DRPOST 107
C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
C.2 Technical Aspects of Symmetry Reduction in 2P-ZS-MGs . . . . . . . . . . 108
D Detailed Tables of Numerical Results 110
D.1 Initial Value Functions V0and Discount Factors γ. . . . . . . . . . . . . . . 110
D.2 Additional Figures and Tables for the Comparative Studies of DP and RL
methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
List of Figures 118
List of Tables 120
Glossary 122
Bibliography 126
Index 141
x
Chapter 1
Introduction
One of the most impacting technical revolutions in the near future after the boom of com-
puter technology and the Internet could be expected in robotics. While highly specialised
robots have been widely used in production lines the interaction of robots with other robots
or humans remains a vision. The catenation of physical and mental abilities of robots is
scientifically highly challenging. The physical abilities are often referred to as lower level
abilities such as reflexes and basic motion abilities. However, it is not completely clear how
much the physical abilities are intertwined with cognitive ones. Imagine, for example, that
a human hits an obstacle by a physical movement. A negative reward signal is received in
the form of pain which causes the cognitive section to think how to avoid that pain. This
high level cognitive activity can at first be concentrated on the special situation but then be
generalised to a set of similar situations and at the end possibly result in an improvement
of the basic ability of moving, e. g. if it is generally sensible to move more slowly.
Key aspects of the present work are informally described and related to the introductionary
example above. In this thesis, optimal strategies of several robots which rule their interac-
tion are determined for the example of robot soccer. The highest cognitive level of strategic
planning is addressed. The above example of a human hitting an obstacle already points
to important ingredients of a mathematical description and solution methods: firstly, the
physical movement indicates that a dynamic model of an environment and some acting
agents is needed. Secondly, a kind of optimality criterion has to be employed based on
a local reward signal (pain) which could also be positive, and thirdly, after receiving any
reward an estimate of how to optimally act to maximise a positive reward while avoiding
a negative one has to be improved. A basic principle for algorithms seems to be that
some estimates of how useful a special situation is have to be improved over time. These
estimates are later represented by a value function and the iterative procedures are refered
as dynamic programming (DP) and reinforcement learning (RL) methods. Finally, the
ability to abstract former knowledge of how to deal with similar situations which have not
been explicitly experienced appears to be crucial. This question is theoretically addressed
by developing a framework of exact model reduction, i. e. identifying exactly equal situ-
ations, as well as practically by incorporating approximate generalisation methods which
are introduced as supervised learning methods.
A huge variety of concepts and algorithms to which the above mentioned aspects belong
also are collectively named machine learning. In the remainder of this introduction an
overview of different topics is given: goals of this thesis are presented in more detail and
several disciplines of machine learning relevant to this thesis are introduced. The areas of
1
supervised learning (function approximation) and reinforcement learning (optimal control)
are touched on. In addition, relations to game theory and to the concept of almost invariant
sets established in dynamical systems theory are shortly pointed out. Finally, challenges
of implementing optimal strategies on real soccer playing robots are discussed before the
introduction ends with the contributions of the author.
Goals of the Present Thesis. Since in this thesis optimal strategies for motion planning
in robot soccer are to be obtained the first goal is to design an appropriate soccer model
which is as simple as possible and as special as necessary. The simplicity aims at wide
applicability whereas a certain level of detail is necessary to provide meaningful results.
A more general goal is to provide and analyse efficient algorithms for computing optimal
strategies. Such algorithms are applicable independently from the model as long as it
belongs to a certain but broad class. A natural small subtask is the calculation of optimal
strategies especially for the robot soccer model, i. e. to give for each soccer situation (state
s S) a probability distribution of which action aof the action space A(s)to perform. This
subtask gives reason to define the development of a software package which implements all
key ideas as a further substantial goal.
Beyond designing efficient algorithms, the intent to widen applicability of the algorithms
shall also be pursued by reducing models to equivalent smaller ones. In this area, the focus
is on symmetry reduction which has to be appropriately defined for two-player zero-sum
Markov games. Furthermore, the proof of desired properties e. g. which consequences the
symmetry of a model has for the symmetry of optimal strategies is an important goal of
this thesis.
Machine Learning, Game Theory, and Dynamical Systems
Although often a different impression could be received machine learning [139] and stochas-
tic control are intertwined, as e. g. [10, 145] discuss the relevance of control theory topics to
Artificial Intelligence (AI). In stochastic control, adaptive control is the field concerned with
algorithms which improve a sequence of decisions from experience. It is a mature discipline
for systems with smooth dynamics, see e. g. [26, 195].(1) In contrast, learning methods are
often applied to time-discrete systems and utilise special function approximation (super-
vised learning) or data preprocessing (unsupervised learning) methods. However, the most
decisive difference is the knowledge of the underlying model: in learning methods the model
can be completely unknown whereas in control theory the model typically is assumed to be
known.(2) Throughout this thesis, the author has chosen the machine learning terminology.
This should not be misunderstood as an evaluation of both approaches.
Machine Learning. There are three major types of learning (Figure 1.1) with different
degrees of freedom for the learner: completely supervised learning (SL, Chapter 4), com-
pletely unsupervised learning, and lying between these two extremes reinforcement learning
(RL, Chapter 2). In several applications these different types of learning are intertwined,
for example [193] gives a short overview of combinations of supervised, unsupervised and
reinforcement learning.
Supervised learning methods are often interpreted as generalisation methods and deal with
function approximation for a given set of samples. For example, let f:RnRbe a
(1)In [195] a good overview of numerical methods in optimal control can also be found.
(2)By addition of system identification methods this distinction can also become blurred.
2
unsupervised
learning supervised
learning
learning
reinforcement
learning
Figure 1.1: Scheme of the main subfields of learning.
function and Sinput ={s1, . . . , sm} Rna finite subset of the domain of f. The set of
input-output pairs S={(sk, f(sk))}k=1,...,m is called the sample set. Then, the task of
SL methods is to find e
ff, i. e. minimising the norm difference with respect to a given
norm, by only using information of the sample set. A popular choice of e
fis a linear
approximation architecture e
f=Pjajfjwith specified fiand aito determine. In contrast
to function approximation methods some SL methods deal with noise in the process of
function evaluation, i. e. f(si)is deviating from the assumed value in the input-output pair.
Furthermore, an SL specific topic is feature detection which is related to a partly automatic
design of the fi. The use of handcrafted features is computationally less demanding which
is subject to numerical studies in Chapter 5 and yielded reasonable results even if only
simple features are used.
Unsupervised learning methods are also seen as generalisation methods like SL but they
differ by only making use of the distribution of input data Sinput. Thus, they can be
interpreted as input data preprocessing methods to which e. g. clustering approaches belong
to.
Reinforcement learning methods do not directly fit to the concept of the above two meth-
ods. They are biologically inspired by action-reward animal training mechanisms. The
mathematical foundations are based on a combination of optimal control and stochastic
approximation methods. The basic ingredients are a state and an action space, a dynami-
cal probabilistic model called transition function according to a specified time set (discrete
or continuous), and a reward function which evaluates every action in each situation and
is typically non-zero only in a few situations. The goal is to determine a behaviour that
ensures a high long-term reward which includes for example the tedious work to figure out
which situations are the most positive ones and how these can be achieved most effectively.
Effectivity is typically measured by rewards and time (time discounting of reward). One of
the most applied equations is the Bellman equation (Equations 2.48 and 2.49, [66]) which
gives a rule to propagate the estimations of usefulness of each situation one step further
in time. Many iterative methods for computing optimal strategies such as value iteration
and Q-learning are based on the Bellman equation.
A general discussion about the use of learning paired with some criticism about current
research and an excellent overview of multi-agent learning can be found in [188].
Game Theory. In reinforcement learning approaches it is the goal to determine an
optimal behaviour according to some optimisation criterion. This behaviour is usually
interpreted as the active decision of an agent and, thus, it is natural to consider models
with several agents. This path directly leads to game theory. While game theory is often
influenced by economic flavours the basic question of game theory of how to act optimally
3
in an environment with several players, respectively agents, is quite general. The equally
general answer is to find a Nash equilibrium. However, special foci of classical game
theory are rivalry or conflicting aims of several players, coordination and coalitions, threat
mechanisms, partial information, games of social welfare, and so on. From a computational
point of view finding Nash equilibria can become arbitrarily hard for games with more
than two agents [41].(3) Fortunately, in robot soccer the different agents of each team
can be represented by only one game theoretic agent because the team is fully cooperative.
Furthermore, the game is of type “zero-sum” which can be interpreted as games between two
rivals: one agent wants to avoid what the other tries to achieve and vice versa. Sometimes
such a situation is called (completely) competitive because there is no potential for a
compromise. Typically, board games (backgammon, chess) or sport games (soccer, tennis)
are of this type because one agent or team of agents wins if and only if the other one loses.
Connections to Dynamical Systems and Graph Theory: Almost Invariant Sets.
The update steps of synchronous dynamic programming (DP) algorithms are global, while
the updates of reinforcement learning methods are local with respect to specific stochastic
trajectories. To combine these two concepts, almost invariant sets are employed. These
characterise regions which are left by a stochastic trajectory with low probability, and thus
give information about where a typical reinforcement learning trajectory will stay for a long
time. In Section 5.3, it is shown how to exploit this knowledge to design an asynchronous
DP algorithm for which flexible trade-offs between global and local updates are possible.
The idea is to utilise an estimate of the Nash equilibrium strategies to analyse the dy-
namics of the Markov game. If all players’ strategies are fixed such that the dynamics
do not explicitly depend on the strategies, the concept is identical to that of analysing
the dynamics of dynamical systems by means of discretised transfer operators [46]. By
the invariance information of the dynamics regions for repeated local updates, namely the
almost invariant sets, can be located. Finally, it seems algorithmically sensible to alter
global and local update steps. For computing a partition of almost invariant sets graph
partitioning techniques are utilised.
The invariance information could have a two-fold impact: it may directly yield valuable
information to speed up numerical algorithms (e. g. value iteration) and may give on
a meta-level useful information about which situations in robot soccer are dynamically
linked if both teams are playing optimally.(4) An application could be to exploit subop-
timal behaviour of the other team to perform a controlled jump from a nearly invariant
component to another more advantageous one.
Robot Soccer with Real Robots
Multi-player robot soccer is the main application as well as the standard example to il-
lustrate theoretical concepts throughout this thesis. The model used for numerical com-
putations is described in detail in Section 5.1. Here, only the most salient aspects are
mentioned. Since the model addresses the highest level planning and does not resolve
lower levels of behaviour such as direct motor controls of the robots, it is possible to design
a model that is widely applicable. Particularly, the proposed model is relatively hardware
independent. Basic abilities such as walking towards a predefined location, handling the
(3)Communicated by Prof. Dr. Bernd Sturmfels in a mathematical colloquium of the University of Pader-
born.
(4)These could be called a meta stable set of situations.
4
ball by dribbling and kicking, and skills for the analysis of visual information (self and
opponent localisation) are highly non-trivial but assumed to be already available. The
only assumption about the abilities of the robots is qualitative, namely that both teams
including every single robot are totally equal. This is true e. g. for the AIBO league(5) but
not for all leagues of robot soccer. The consequences of equality of all robots are that the
robots are undistinguishable and that it is at least possible for each robot to apply the
same basic abilities. The question of whether simulation results can be transferred to real
robots is answered positively by [69]. This shows that it can be valuable to compute an
optimal strategy for a non-exact model (simulation).
Reading Information and Contributions of this Thesis
In this thesis a game theoretic model of robot soccer and new algorithms for obtaining
optimal strategies within this model are developed. This results in the following structure
of the thesis, the description of which emphasises the contributions of the author:
A large portion of Chapter 2 is a special collection of common knowledge about Markov
decision processes (MDPs), reinforcement learning, and different types of games. The in-
terpretation of matrix games (Definition 2.18) in the more general context of two-player
zero-sum Markov games (2P-ZS-MGs) and the numerical error analysis of the value itera-
tion for 2P-ZS-MGs (Lemma 2.36) are contributed by the author. The latter seems to be
necessary because the solution of matrix games in the iteration step introduces numerical
errors which lead to a modified stopping criterion of the iteration scheme. The analysis
is mathematically identical to the error analysis of using supervised learning methods in
Chapter 4 but the consequences are different: for standard value iteration the total error is
typically only corrected a little whereas for function approximation the same error analysis
reveals that no guarantee of the quality of the solution can be given.
The analysis is originally developed in the context of Chapter 4 for the theoretically iden-
tical case of function approximation but the assumptions and consequences are different:
in the numerical value iteration the total error is typically just corrected a little whereas
for function approximation the same error analysis reveals that it destroys any guarantees
on the quality of the solution.
Chapter 3 deals with model reduction. Section 3.1 begins with an introduction of the
concepts of MDP homomorphisms and MDP symmetries. By a new and algebraicly more
precise way of presentation the relation between concepts of different authors can be clar-
ified. A key result of the author is to show that the formalisms of MDP homomorphisms
[171] and MDP symmetries [217] are equivalent (Lemma 3.3). This equivalence reveals that
the model minimisation framework of [171] with MDP options and the symmetry context
of [217] with its generalisation to multi-agent MDPs does not exclude each other but are
based on the same foundation. Additionally, the concept of the symmetry group of MDPs
by [171] is related to its natural algebraic framework of group actions by the author.
A second key result is the development of a corresponding framework for the case of 2P-
ZS-MGs including all basic statements equivalent to those for MDPs in Section 3.2. This
more general framework was necessary to capture the symmetries of the robot soccer model
(5)The current development of the four legged league was influenced by the fact that the production of
the AIBO ERS-7 was stopped recently. The new name of the league is Standard Platform League (SPL)
meaning that all robots have to be equal (as before) but additionally a new two legged humanoid robot
called NAO is introduced as a substitute for the AIBO “dog”.
5
developed in Chapter 5. It can be expected that 2P-ZS-MGs are one of the most general
types of models which include MDPs and for which such statements are true.
Besides the symmetries of MDPs, a qualitatively new symmetry that results from exchang-
ing the two players can be exploited. Practically, such symmetries have been used within
different board games by the argument that exchanging the players in a zero-sum game
has to result in a multiplication of the value by 1. One example is given by the pioneer
Samuel for checkers [179]. The present work lays a formal foundation for this argument
and, more importantly, shows that this exchange of players is also compatible with the
other standard symmetries similar to that of MDPs (Proposition 3.24) and that symmetry
reduction can be performed stepwise.
The key result of Chapter 4 which is devoted to the combination of reinforcement learning
and supervised learning is Lemma 4.2 and its implications. It clarifies that unless a function
approximation architecture is very accurate care is needed if the quality of the solutions is
to be provable. Nevertheless, a collection of successfully applied results of other authors is
provided. Corresponding results of the author can be found in Section 5.2.6.
Finally, Chapter 5 provides detailed information about the model and the numerical
results. The multi-player grid soccer model described in Section 5.1 is based on [112] but
goes far beyond: the generalisation to several agents per team makes it necessary to change
from a one-agent “blocking dynamic” to a more permeable one and allows it to introduce
passes between different agents of the same team. Furthermore, the author believes that
the model is very well-suited to the study of multi-agent systems: since the grid size can
be varied as well as the number of agents, effects of large state spaces can comparatively be
studied, introduced by both high grid resolution and by multiple agents on a low resolution
grid.
Section 5.2 contains the numerical results and possesses a rich substructure. Some of the
studies are providing strong arguments for the choice of 2P-ZS-MG instead of MDP models
in robot soccer (Section 5.2.2), others compare dynamic programming and reinforcement
learning techniques (Section 5.2.4) with the result that exact and model exploiting dy-
namic programming techniques should be preferred whenever possible. After these eval-
uative numerical results which appear to be natural but are surprisingly not standard in
literature the studies of dynamic programming methods are intensified in Section 5.2.5
and the dependency of convergence speed on different types of methods and parameters
is numerically analysed. This includes comparative studies of symmetry reduced models
with their unreduced counterparts which is the practical application of the theoretical re-
sults in Chapter 3. Particularly interesting from a practical point of view is the “max-min
convergence boosting phenomenon” which only seems to be present in 2P-ZS-MGs but
not in MDPs. As mentioned above, results with supervised learning (Section 5.2.6) and
general technical issues of applying strategies to real robots (Section 5.4)(6), especially of
type AIBO ERS-7, follow.
Chapter 6 concludes and points to future work and the appendix contains a short
description of group actions (Appendix A), comments on the relations of the iterative
solution of the Bellman equation to iterative linear solvers (Appendix B), an introduction
to the software package DRPOST developed by the author (Appendix C), and additional
material (Appendix D) omitted in the main part.
(6)These technical issues were not discovered by the author but independently confirmed.
6
Chapter 2
Reinforcement Learning (RL) and
Game Theory
Contents
2.1 Dynamical Systems and Markov Processes . . . . . . . . . . . . 9
2.1.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . 9
2.1.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . 15
2.2 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . 15
2.2.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . 16
2.2.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . 22
2.3 Matrix Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . 23
2.3.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . 29
2.4 Two Player Zero Sum Markov Games (2P-ZS-MGs) . . . . . . 29
2.4.1 Basics and Problem Definitions . . . . . . . . . . . . . . . . . . . 30
2.4.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.3 Complexity, Algorithmic Issues and Software . . . . . . . . . . . 32
2.5 General Markov Games, Differential Games, and Advanced
Concepts of RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
The goal of machine learning is to design a software
that has more abilities than its programmer in the end.
inspired by Arthur L. Samuel (1901-1990)
A very general framework to which this thesis contributes is that of multi-agent systems
(MASs) [64] and more specifically to multi-agent learning [188]. To cut a long story short,
the basic ingredients of MASs are a number of agents(1) which interact with an environ-
(1)Agents are not restricted to physical agents like humans or robots. Reversely, if a robot is not capable
of making decisions then it is part of the environment and does not count as an agent.
7
ment and perceive some information from the environment. The role and (hierarchical)
structures of agents are also discussed in the MAS context.
environment
agentdecision
agentdecision perception
interaction
perception
interaction
agentdecision
perception
interaction
comm.
comm.
Figure 2.1: Scheme of the main components of multi-agent systems. Multiple agents
interact with an environment and partially perceive information from the environment. The
agents are allowed to communicate and make some decisions for their future interactions.
Bowling comments the usefulness of the MAS context being a general framework as follows
[21]:
Frameworks are models of reality. As such, they are an important foundation
for the generation and evaluation of new ideas. They establish the rules of the
game, crystallise the core issues, provide a common basis of study, make intrinsic
assumptions visible, provide a general perspective on large classes of problems, help
to categorise the variety of solutions, and allow comparison with other models of
reality. It is with all of these reasons in mind that we begin this work by introducing
a framework for multiagent learning.
The class of stochastic games is a more specific framework than the general MASs.(2)
These define the guidelines of this work and are rarely used for reinforcement learning
tasks exceptions are [21, 98, 112] the first of which also inspired the structure of this
section. Stochastic games are a generalisation of Markov decision processes (MDPs) and
matrix games. The first is the research focus of reinforcement learning (RL) and dynamic
programming (DP), the second is a major topic of game theory with applications in eco-
nomic science. In the remainder of this section, basic ingredients are defined and some
well-known results about the solution of basic problems are presented.
More precisely, this section is organised by increasing model complexity as follows: First,
Section 2.1 starts with a brief overview of dynamical systems which provide the framework
for systems without any decision maker. Nevertheless, fixing the policies of all agents of
an MAS results in a (stochastic) dynamical system and hence some basic insights from
this mathematical discipline may serve as inspiration for Markov games. Even more of
interest, a deterministic continuous dynamical system can also be approximated by a
stochastic discrete dynamical system [46] or in other words an MDP without a decision
maker(3). We proceed by introducing MDPs in Section 2.2 (one decision maker), matrix
games (two adversary decision makers) in Section 2.3, and two-player zero-sum Markov
games in Section 2.4 (also two adversary decision makers). In the last section, an overview
of more advanced concepts which are only partly relevant to this work is given. This
(2)An alternative (exhaustive) framework for continuous motion planning problems can be found in [104].
However, practical application seems limited at the moment to special cases.
(3)A stochastic discrete dynamical system is often called Markov process or Markov chain.
8
includes semi-Markov decision processes (S-MDPs) and hierarchical approaches. In each
subsection aspects of history, notation, and the basic problems are provided. Summarising,
this chapter is the foundation of this thesis.
2.1 Dynamical Systems and Markov Processes
Dynamical systems are a very general concept and include concepts for discrete and con-
tinuous systems as well as for deterministic and stochastic ones which are also called ran-
dom dynamical systems. Many concepts were developed to analyse emerging properties
of deterministic dynamical systems such as equilibrium and periodical solutions, symbolic
dynamics, chaos, and relations between numerics and dynamical systems. A special goal
of dynamical systems theory is to provide statements about the asymptotic long-term be-
haviour of a given system. Additionally, as will be pointed out in the following, extracting
statistical information of (deterministic) dynamical systems is interesting for many sys-
tems and leads to relations of deterministic continuous dynamical systems and stochastic
discrete ones (see also footnote (3)) by so-called transfer operators.
As mentioned above, dynamical systems are considered in the beginning of Chapter 2 be-
cause they describe situations where no agent, decision maker, or controller is involved.
These three terms are synonymously used by different communities. However, a (stochas-
tic) dynamical system arises if one considers the policies of all agents of an MAS being
fixed. The ideas of the following short overview are mainly borrowed from [46]. In ad-
dition to this article, a more detailed description and further references of approximating
deterministic dynamical systems by Markov processes can be found in [48, 49, 50]. A less
specific introduction to the broad area of dynamical systems may be obtained by reading
the textbooks [25, 73, 153] the last of which contains many examples.
Historical Remarks. Following [153], early work of dynamical systems was motivated by
the aim of predicting the non-linear dynamics of the n-body problem in celestial mechanics.
A pioneer in this field was Poincaré who used perturbation methods and employed a geo-
metrical qualitative point of view. Further major contributions were achieved by Arnold,
Birkhoff, Bowen, Duffing, Kolmogorov, Krylov, Lorenz, Lyapunov, Moser, Pontryagin,
Rayleigh, Ruelle, Smale, and many others. Cartwright and Littlewood [28] observed chaos
which was also discovered by Lorenz studying weather dynamics [117] and separately by
Smale who introduced the horseshoe map [192].
2.1.1 Basics and Problem Definitions
After a short introduction to deterministic dynamical systems and invariant sets Markov
processes and some concepts to approximate the statistics of dynamical systems by means
of transfer operators will be introduced.
Deterministic Dynamical Systems
The first definition of dynamical systems seems very abstract but later more specific aspects
will be discussed.
2.1 Definition (Dynamical System [25])
Let Xbe a non-empty set and {ϕt:XX|tK}a one-parametric family of maps with
parameter tand K=Ror K=Z. If this family is a one-parametric group, that means
9
ϕt+s=ϕtϕsand ϕ0= idXis the identity map, then (X, ϕ)is called a time-continuous
(autonomous) dynamical system if K=Rand a time-discrete one if K=Z.ϕis the flow
of the dynamical system.
In an irreversible system the time may be restricted to t0resulting in a semi group
instead of a group structure. For a dynamical system (X, ϕ)and xXthe set O+
ϕ(x) =
St0ϕt(x)is called the positive semi orbit,O
ϕ(x) = St0ϕt(x)the negative semi orbit
and Oϕ(x) = O+
ϕ O
ϕ=Stϕt(x)the orbit or trajectory of xunder flow ϕ. If the context
is clear then ϕmay be omitted.
2.2 Definition (Invariant Set)
The set AXis called ϕ-invariant if t:ϕt(A)A. If a dynamical system is not
time-invertible then ϕt(A) = (ϕt)1(A) = {xX:ϕt(x)A}is the set of preimages
of Aunder ϕt. A forward invariant set possesses the invariance property for all t0, a
backward invariant set for all t0.
Dynamical systems and differential equations. According to [25], flows of dynamical
systems occur naturally in autonomous differential equations of the first order. That means
equations of the form
˙x=f(x),(2.1)
where xX=Rn,f:RnRnis a continuous differentiable function (fC1), and
˙x=dx
dt. For each xXthere exists a unique solution ϕt(x)with ϕ0(x) = xwhich is defined
for all tUε(0) in a neighbourhood of 0. If the solution exists for all t(which is assumed
in the following and is true e. g. for bounded f) then for fixed taC1-diffeomorphism of
Rnis given by the map x7→ ϕt(x). Because the differential equation was autonomous it
holds ϕt+s=ϕt(ϕs(x)) meaning that ϕtis a flow. Reversely, for a given differentiable flow
ϕt:RnRnthe corresponding differential equation reads to
˙x=d
dtϕt(x)t=0 (2.2)
because if equation 2.1 is integrable a solution x(t)fulfills: ϕt(x(0)) = x(t) = x(0) +
Rt
0f(x(s)) ds.
Transfer Operators, Almost Invariant Sets, and Markov Processes
This paragraph is based on [46] and deals with time-discrete dynamical systems of the
form
xk+1 =f(xk)(2.3)
with kN0and f:XXfor a compact Xare considered. An important class of
examples are the time-Tmaps (for fixed TR) of a time-continuous dynamical system
(Equation 2.1). If the global dynamical behaviour of a given system is of interest then a
useful method is to employ the transfer operator or Perron-Frobenius operator associated
with f. It describes the evolution of signed measures on X.
The transfer operator or Perron-Frobenius operator of a time-discrete dynamical system f
is the linear operator P:M M defined for all measurable sets Sby
(Pν)(S) = ν(f1(S)) (2.4)
10
where Mis the space of signed measures on the Borel σ-algebra over X. For example,
if νis a uniformly distributed probability measure on a set S1X(meaning uniformly
distributed on S1and 0elsewhere) then (Pν)(S2)is exactly the transition probability
T(S1, S2)of Definition 2.5 because ν(S1)=1. In contrast to the transition function T
of Section 2.2, which maps only states and actions to probabilities, the transfer operator
Pdirectly maps a probability distribution of inputs (states at time t) to a probability
distribution of outputs (states at time t+ 1). In the context of dynamical systems the
case of a measure µ, which is a fixed point of the transfer operator, is considered, i. e.
µis invariant with respect to f. Then for all measurable sets Sthe following holds:
µ(S) = µ(f1(S)) = (Pµ)(S). Furthermore, an additional standard assumption in this
context is that the measure is robust under small random perturbations leading to a unique
SRB measure (Sinai, Ruelle, Bowen).
The next step to Markov decision processes (MDPs) and to almost invariant sets is to
define the transition probability Tfrom a measurable set S1with µ(S1)6= 0 to a second
measurable set S2by
T(S1, S2) = µ(S1f1(S2))
µ(S1).(2.5)
The transition probability from a set S1with non-zero measure into itself is called the
invariance ratio Tinv(S1) = T(S1, S1). If Tinv(S1)1εfor ε[0,1] the set S1is called
(1ε)-invariant which for ε= 0 turns into pure invariance as in Definition 2.2 neglecting
sets of measure zero. If the exact value of εis not in the focus of interest the imprecise
term almost invariant set will be used. Particularly interesting although elementarily
obtainable is the fact that for f-invariant µthe complement of an (1ε)-almost invariant
set is almost invariant with invariance ratio
Tinv(X\S)1ε·µ(S)
µ(X\S).(2.6)
If µ(S)1
2µ(X)then the invariance of the complement Tinv(X\S)1εi. e. the
complementary set is also at least (1ε)-invariant. This motivates a successive algorithmic
way of hierarchically splitting up almost invariant sets into almost invariant subsets, and
hence iteratively constructing a partition. A partition P(X)of the state space Xof a
dynamical system (or any arbitrary set X) is a finite collection of pairwise disjoint subsets
PiX,iNN, with µ(Pi)>0and the covering property X=SN
i=1 Pi. A partition
P(X) = {P1, . . . , PN}is called (1 ε)-invariant if
min
i=1,...,N Tinv(Pi)1ε . (2.7)
In contrast to other types of decomposition (e. g. ergodic components) the decomposition of
a state space Xof a dynamical system into a partition of almost invariant sets is not unique.
The reason is that slightly changing a partition of almost invariant sets will typically result
in only a minor change of the invariance ratios of the corresponding sets.
2.3 Problem (Partitioning into Almost Invariant Sets)
For a dynamical system (X, ϕ)and fixed NNfind a partition P(X) = {P1, . . . , PN}that
maximises the average invariance
Tinv(P(X)) = 1
N
N
X
i=1
Tinv(Pi).(2.8)
11
Relation to Markov chains, MDPs, and RL
A Markov chain with a finite amount of states can be obtained by choosing a finite parti-
tion(4) P(X) = {P1, . . . , PN}and defining transition probabilities according to T(Pi, Pj).
This Markov chain can also be considered as an MDP with a finite amount of states where
a Markovian strategy of the decision maker is fixed.
In Section 5.3 almost invariant sets build a bridge from the classical dynamic programming
algorithms to the reinforcement learning algorithms by combining the global nature of the
first methods with the locality of the second ones. The idea is to perform only a few updates
on the whole state space and then to restrict the updates of dynamic programming methods
to sets with a high invariance ratio(5) (under some fixed strategy of the decision makers),
and to again perform a few global updates and so on.
2.1.2 Numerical Methods
A set oriented numerical approach to compute partitions P(X)which preferably consists
of almost invariant sets is described below. The presentation includes the numerical ap-
proximation of sets in general, an approximation of the transfer operator, some different
versions of partitioning problems, and their transformation to a graph based context.
Discretisation of Transfer Operators
To deal with transfer operators in a numerical way a finite dimensional discretisation of the
operator must be introduced. As mentioned above, more detailed information is available
e. g. in [46, 48, 49, 50]. The basic idea is to first approximate the state space Xby a
hierarchically constructed sequence of generalised rectangles (a d-dimensional box) and
then define the transfer operator by the transition probabilities between these boxes.
One numerical scheme for approximating a space Xby a finite collection of boxes is the
so-called subdivision algorithm [47] and works as follows:
Starting with an initial box B0X, a sequence of collections of boxes (Bi)iN0with
B0={B0}is constructed by repeating the following two steps iteratively until some
stopping criterion is fulfilled. For the (i+ 1)-th iteration this reads to:
1.) Refinement: Given Bi,construct a finer collection of boxes e
Bi+1 by subdividing each
box B Bialong one coordinate axis into two halves.
2.) Selection: Given e
Bi+1, construct Bi+1 by keeping all boxes which intersect with the
state space X, i. e. Bi+1 ={Be
Bi+1 :BX6=∅}.
Each of the two steps entails a consideration of how to perform it numerically. The first
step can be done directly or be modified by selecting only parts of all boxes for subdivision.
Whether the second step is complicated or not depends on the set Xand its mathematical
description: in general, i. e. for all X, it can not be expected that the condition B
X6=can be exactly and efficiently be checked.(6) Finally, after iend steps a stopping
criterion is fulfilled, and the above algorithm’s output is a finite collection of boxes Biend =
(4)This partition may be much finer than the almost invariant sets.
(5)Neglecting the evolution of the strategy, an RL trajectotry would typically stay in an almost invariant
set for a long time.
(6)For example, for a choice of representative test points pkon the box boundary, a test if pkXcan be
performed.
12
{B1, iend , B2, iend , . . . , BN, iend }. These fulfill the following criteria: XSN
k=1 Bk, iend and
for all k6=lholds µLeb(Bk, iend Bl, iend ) = 0 where µLeb is the Lebesgue measure.
For the approximation of the transfer operator we simplify notation by writing Bkinstead
of Bk, iend ). Hence, the starting point is now a collection of boxes B={B1, . . . , BN}which
approximate the set Xe. g. in the sense that some stopping criterion of the subdivision
algorithm is fulfilled. The discretisation of the transfer operator is carried into execution by
means of replacing the space of signed measures Mby MBwhich is the finite-dimensional
space of signed measures on the Borel σ-algebra generated by the collection of boxes B. If
the collection Bforms a partition (i. e. pairwise intersections being empty instead of being
of zero measure) then the generated σ-Algebra contains only arbitrary unions of boxes
Bk B.(7) A standard basis for the vector space MBis the set of Nmeasures which
uniformly assign the measure 1to one fixed box Bi B and 0to all other boxes. The
transfer operator with respect to this basis PB:MB MBis given by the following
matrix of transition probabilities:(8)
PB= (pij)ij = µLeb(Bjf1(Bi))
µLeb(Bj)!1iN, 1jN
RN,N .(2.9)
The measure of a box is straightforward to compute, however, the measure of intersection
has to be approximated by numerical methods. [45] describes some possibilities while a
wide-spread method is the so-called Monte Carlo approach which is described by [85] and
means
µLeb(Bjf1(Bi)) 1
K
K
X
k=1
χBi(f(xk)) (2.10)
with xkbeing uniformly selected at random from Bj. This simply means that one has
to select Krandom points in Bjand check whether the image f(xk)is in Bi. There are
efficient numerical techniques for performing this task [48, 47].(9)
2.4 Problem (Discretised Partitioning into Almost Invariant Sets (I))
For a dynamical system (X, ϕ), a set of boxes B={B1, . . . , BL}, and fixed NNfind a
partition of boxes P(X) = {P1, . . . , PN}(i. e. for each ithere exists an index set KiN
such that {K1, . . . , KN}forms a partition of {1, . . . , L}, and Pi=SkKiBkwith each
Bk B) that maximises the average invariance
Tinv(P(X)) = 1
N
N
X
i=1
Tinv(Pi) = 1
N
N
X
i=1 PBi, BjPkµLeb(Bj)·pij
PBjPkµLeb(Bj).(2.11)
Interpretation in Terms of Graphs
As pointed out in [46] Problem 2.4 can be interpreted as finding an optimal cut of a graph.
Given a dynamical system (X, ϕ)and a partition of the state space by boxes B=P(X)(10),
(7)To handle intersections of measure zero one may define equivalence classes and equality by identifying
all sets of zero measure with the empty set.
(8)pij is the transition probability from box Bjto box Bi.
(9)The fixed point of Pdescribing the f-invariant measure can be obtained in a discretised version by
the eigenvector of PBto the eigenvalue 1.
(10)If Xcan not be written as a finite union of boxes, the construction of a box approximation of Xas
described above may be necessary.
13
one defines the transition graph G= (V, E)as a weighted directed graph with vertex set
V=Band edge set
E={(B1, B2) B × B :B1f1(B2)6=∅} .(2.12)
The condition f(B1)B26=is equivalent to B1f1(B2)6=even if fis not invertible
(bijective). The function of vertex weights is defined by
v:VR, v(Bi) = µLeb(Bi),(2.13)
and the function of edge weights by
e:ER, e( (Bi, Bj) ) = µLeb(Bi)·pji .(2.14)
There also exists an undirected version of the transition graph, namely e
G= (V, e
E)with
the modified edge set
e
E={(B1, B2) B × B : (B1f1(B2)) (B2f1(B1)) 6=∅} .(2.15)
Accordingly, the function of edge weights has to be symmetrised with respect to iand j
yielding
ee:e
ER,ee( (Bi, Bj) ) = µLeb(Bi)·pji +µLeb(Bj)·pij .(2.16)
Note that by construction the total edge weight of the directed and the undirected transi-
tion graph is equal.
To measure the degree of invariance of a subset of vertices SV, external and internal
costs are defined. For two subsets of vertices A, B Vthe following notation is introduced
EA,B =1
2X
BiA, BjBµLeb(Bi)·pji +µLeb(Bj)·pij(2.17)
which simplifies for A=Bto EA,A =PBi, BjAµLeb(Bi)·pji. Now, for SVthe internal
costs are defined as
Cint(S) = ES,S
µLeb(S),(2.18)
and the external costs as
Cext(S) = ES,S
µLeb(S)·µLeb(S)(2.19)
where S=V\Sis the complement of S. Based on the previous definitions, for a partition
of the vertices P(V) = {P1, . . . , PN}the analogous quantities are the internal costs
Cint(P(V)) = 1
N
N
X
i=1
Cint(Pi)(2.20)
and the external costs
Cext(P(V)) = P1i<jNEPi, Pj
QN
i=1 µLeb(Pi).(2.21)
In [46] the internal and external costs are seen to be intuitively high and low, respec-
tively, for almost invariant sets. However, it is stated that minimising the internal costs
is not equivalent to maximising the external costs. Optimisation with respect to the first
14
criterion may lead to relatively small sets in the partition while optimisation with re-
spect to the second criterion leads to more balanced partitions. Identifying a partition
P(X) = {P1, . . . , PN}with the corresponding one of the graph P(V) = {P1, . . . , PN}
yields Tinv(P(X)) = Cint(P(V)), and thus that maximising internal costs is equivalent to
Problem 2.4. Minimising the external costs yields the third partitioning problem:
2.5 Problem (Discretised Partitioning into Almost Invariant Sets (II))
For a graph G= (E, V )and fixed NNfind a partition of vertices P(V) = {P1, . . . , PN}
with PpPiv(p)>0that minimises Cext.
2.1.3 Complexity, Algorithmic Issues and Software
Most variants of the graph partitioning problems including the above mentioned are NP-
complete [46]. Therefore, heuristics are employed to solve Problem 2.4 and Problem 2.5.
An incomplete list of software tools for (balanced) graph partitioning may contain e. g.
JOSTLE [211], METIS [91], and PARTY [166]. Throughout this thesis, graph partitioning
will be performed using PARTY.
A common idea influencing the design of software packages is the multilevel paradigm which
has been proven to be powerful (e. g. [141]). This paradigm consists of two steps: graph
coarsening and local improvement. For the graph coarsening a heuristic approach called
graph matching is employed in PARTY. A graph matching is a subset of edges such that
each vertex is, at most, part of one edge. The coarsening procedure then reduces every two
linked vertices of the matching to one supervertex. The hierarchical multi-step coarsening
yields a clustering(11) on the coarsest level. It is followed by a stepwise projection onto the
next finer level with a local optimisation of the partition of each level by standard methods
like Kernighan-Lin [92] or the Helpful-Set Method [52].
The topic of partitioning with a variable number of partitions is even more challenging
than the aforementioned partitioning problems. Some heuristics such as congestion [46]
can be introduced but are not discussed further here.
2.2 Markov Decision Processes (MDPs)
Markov decision processes (MDPs) are the next step towards Markov games. They provide
a model for situations where a stochastic discrete dynamical system evolves under the
influence of one agent. In robot soccer this agent may also represent a whole team if
it is fully cooperative. Typically, solving an MDP means to compute a global feedback
control law to achieve some predefined goal in accord with the underlying dynamics. The
textbooks [14, 168] provide a good introduction to the material, as does [202] where the
notation of this section partly stems from.
Historical Remarks. MDPs were popularised by the books of Bellman and Howard [14, 84]
but according to Puterman [168] the historical roots are located much earlier. Some of the
basic concepts date back to problems of the calculus of variation to the 17th century but
an explicit reference only points to the end of the 19th century: a paper of Cayley [30].
The beginning of the modern study can be dated to the 1940s where Wald [210] already
presented the essence of the theory. A little later, important work was done on games
(11)The coarsening can be continued until the correct number of partitions is obtained, or until the coarsest
level can be partitioned by standard partitioning methods.
15
[15, 186], stochastic inventory models [61], pursuit problems [86] and sequential statistical
problems [5].
2.2.1 Basics and Problem Definitions
This subsection begins by defining the basic ingredients of an MDP:
2.6 Definition (Markov Decision Process)
A (discrete time, finite) Markov decision process M= (D,S,SA, T, R)is given by
1.) decision epochs D=N0,
2.) a (finite) state space S,
3.) a (finite) state action space SA ={(s, a) : s S, a A(s)}where A(s)is a (finite)
set of available actions in state s,
4.) a transition function T:SA×S [0,1] with T(s, a, s0)being the probability of reaching
state s0if choosing action ain state s,(12)
5.) a (deterministic) reward function R:SA R.
A standard assumption is that for every pair (s, a) SA holds that Ps0∈S T(s, a, s0)=1
which means that no action can lead to a state outside of S. The addition of (absorbing)
extra states may help to ensure the above condition. For hierarchical problems as men-
tioned in Section 2.5 it can be necessary to differentiate between states within the same
level of hierarchy (with a probability less than 1) and states outside of the given hierarchy
level.
2.7 Remark (General MDP [168])
For a more general version of MDPs the following aspects may additionally be taken into
account:
1.) The decision epochs can be either discrete or continuous and, in the first case, there is
the possibility of a finite or an infinite set of time points; in the continuous case there
are further possibilities when the decisions are to be made: continuously, at random
time points, or at timepoints which itself can be decided by the decision maker.
2.) The state space can be continuous, discrete, or a mixture of both, and again finite or
infinite in the discrete case.(13)
3.) The action space can have the same specifications as the state space and may be
different for every state, additionally, instead of pure actions, so-called mixed action
i. e. a probability distribution PD(B)on a (Borel) subset of actions can be performed
during the Markov decision process. In this context, pure actions are degenerate mixed
ones.
4.) The reward function may also be dependent upon the reached state. In the model of
reward it is not important how the reward is acrued (continuously through a period,
system state of subsequent decision epoch) but it or its expected value must be known
(12)An alternative for MDPs with deterministic state transitions is to define the modified transition func-
tion e
T:SA S with T(s, a) = s0being the next state which is reached with probability 1.
(13)The exact condition is that the state space is a non-empty Borel subset of a complete, separable metric
space. The same condition is needed for the action spaces. “Separable” means that there exists a countable
dense subset, and a Borel set is an element of the Borel σ-algebra of the metric space.
16
before the next choice of action. If the reward depends on the subsequent state then
it may be computed by
R(s, a) = X
s0∈S
T(s, a, s0)R(s, a, s0).(2.22)
time t
observation
decision
time t+1
state observation
decision
state
mixed
action mixed
action actionaction
reward reward
Figure 2.2: Scheme of a Markov decision process. The agent, being in a state at time t,
bases its decision on a stochastic observation of the state to choose a mixed action. By
random, this leads to a pure action and, again by random, to the next state. The agent
receives (local) rewards depending on states and actions.
2.8 Definition (Markov Property)
A decision process is said to be Markovian if for a sequence of state-actions (st, at)twith
st S and actions at A(st), and a sequence of rewards (rt)twith rt=R(st, at), the
following holds:
Prob st=s, rt=r|st1, at1, st2, at2, . . . , s0, a0
= Prob st=s, rt=r|st1, at1.(2.23)
An MDP possesses the Markov property because its transition function Tand reward
function Rare just designed in this way. The Markov property means that the transitions
and rewards are not dependent on history. A common artifice to include (a part of) the
state action history in an MDP framework is to attach it directly to the state. It is obvious
that this may augment the state space enormously depending on the maximal length of
history information.
2.9 Example (Robot Soccer, 1)
Robot soccer will be the standard example throughout this thesis (for a more detailed
description see Section 5.1) and serves to explain most of the theoretical concepts. If the
strategy of the opponent team is fixed then only one team being represented by an abstract
agent has to explicitly decide on its actions. This is exactly the situation of an MDP, and
the ingredients of Definition 2.6 are as follows: the decision epoch D=N0is equidistant
and infinite, the state space S Rnis the set which describes the possible coordinates
17
(position and maybe velocity) of all robots and the ball, the action set A(s)consists of
movements and kicks, the (probabilistic) transition function Tdescribes the evolution of
the game for each state and each action, and the reward Ris simply positive (= +1) for
scoring a goal and negative (=1) for letting the opponent team score a goal.
Return Models
After the definition of the dynamics and reward of an MDP the goal of an MDP is to
be stated: to maximise some kind of long-term reward called the return R.(14) Given a
stochastic(15) sequence of rewards (rt)tthere are, however, different criterions of optimality.
In [88, 202], the following variants of measuring the optimality by returns are considered:
Wide spread due to convergence properties is the discounted infinite horizon return
Rdisc = E (
X
t=0
γtrt)(2.24)
with discount rate γ(0,1), while the numerical approximation of the infinite horizon
return often is performed by the corresponding finite horizon return
RN
disc = E (N
X
t=0
γtrt)(2.25)
where additionally γ= 1 is allowed.(16) E{} here means the expectation value. Completely
different and more difficult to analyse are the average infinite horizon return model
Raver = lim
N→∞ E(1
N+ 1
N
X
t=0
rt),(2.26)
or bias-optimal models which both do not need a discount factor. An empirical comparison
of a discounted to an average reward based return model can be found in [125].
At this point it should be remarked that all of the following concepts which deal with
optimality depend strongly on the optimisation criterion defined by the return model.
Therefore, if not stated differently, the standard is to assume the discounted infinite horizon
return, i. e. R=Rdisc.
2.10 Example (Robot Soccer, 2)
Using the robot soccer example again for illustration, a discounted reward means that
scoring a goal faster is ranked higher than scoring it later. The finite horizon version seems
to be adequate if the duration of the soccer match and of the performance of actions can
be exactly foreseen which is typically not true. Finally, in the average reward case the
ranking of scoring is completely independent of the exact time points. Only the average
amount of goals per time interval is important.
It is also possible to design arbitrarily return profiles with weights wt6=γtas long as
convergence of the series can be guaranteed. This may be useful to enforce some strategic
(14)Sometimes, the MDP as defined above is called Markov decision process and the MDP plus return
model is called Markov decision problem. In this thesis, the term Markov decision process includes also
the return model.
(15)The rewards are either stochastic themself or by the stochasticity of the transition function, or both.
(16)In the discounted infinite horizon return model one can also consider limγ1if existent.
18
behaviour for which a desired time scale is specified (soft constraint). However, for a
non-constant weight the Markov property of the value function is typically not fulfilled.
Although not considered before, employing such a method may give additional tactical
options for the soccer game.
Policies, Value Functions, and Optimality
A definition of the concepts of optimality remains: policies, value functions, and Bellman’s
principle of optimality.
2.11 Definition (Policy, Decision Rule)
A (stationary Markovian) policy π:SA [0,1] is a decision rule which for each s S
specifies a probability distribution PD(A(s)), i. e. s S :Pa∈A(s)π(s, a) = 1.
According to the definition a policy πspecifies the probability π(s, a)of performing action
ain state s. If π(SA) = {0,1}, then the policy is called deterministic. Given an initial
state s0 S and a policy π, one can compute an associated random trajectory O(s0) =
(st, at)tN0with rewards rt=R(st, at)and the corresponding return. Then, the state value
function Vπ:S Ris the expectation of the return under policy πwhen starting in state
s. Similarly, the state-action value function Qπ(s, a)is the expectation of the return under
policy πwhen starting in state-action (s, a). A formal definition follows:
2.12 Definition (Value Functions of an MDP)
Given an MDP M= (D,S,SA, T, R)with D=N0and a policy π, the state value function
Vπ:S Runder this policy is defined by
Vπ(s) = Eπ{R | s0=s}(2.27)
where the notation Eπ{X}= E X| t: Prob st+1 =s0|st=s, at=a=T(s, a, s0) and
Prob {at=a|st=s}=π(s, a)}is employed. The corresponding state action value func-
tion Qπ:SA Runder this policy is defined by
Qπ(s, a) = Eπ{R | s0=s, a0=a}.(2.28)
A policy πis called optimal if for all policies πholds VπVπ.(17)
The above definition is not well-suited to reveal a method for algorithmically computing
value functions. Most of the solution algorithms utilise a recursivity property known as
Bellman’s principle of optimality.
2.13 Theorem (Optimality Principle [168])
A (state-) value function VR|S| of an MDP is optimal iff it is the unique solution to the
Bellman equation:
V(s) = max
πsPD(A(s)) X
a∈A(s)
πs(a)·Q(s, a)(2.29)
where
Q(s, a) = R(s, a) + γX
s0∈S
T(s, a, s0)·V(s0).(2.30)
Here, PD(X)is the set of probability distributions on the set Xand πsis the restriction of
πto state smeaning πs(a) = π(s, a).
(17)The optimality condition is equivalent to the following one: QπQπfor all policies πas can be seen
by plugging Q,Qπinto Equation 2.29 and, reversely, V,Vπinto Equation 2.30.
19
The result of plugging Equation 2.30 into Equation 2.29 can be abbreviated by the Bellman
operator BMDP which shortens the notation of the Bellman equation to V=BMDPV.
The Bellman equation is a nonlinear fixed point equation for the optimal value function
V=Vπand the operator is a contraction in k kwith rate γ[19]. Note that every MDP
has a deterministic optimal policy [168] and it would thus suffice to take the maximum
in Equation 2.29 only over pure actions a A(s). However, in the case of 2P-ZS-MGs in
Section 2.4 a need for the more general formulation will arise. In order to stress the formal
similarities the notation above deviates from standard.
2.14 Remark (History Dependent Strategies, Non-Stationarity [168])
As stated, policies can be deterministic or stochastic. One concept which is not covered
by the policy of Definition 2.11 are history dependent strategies. A t-time history htis
recursively defined by h0=s0and hi= (hi1, ai1, si). A second concept which is also not
covered is the possible non-stationarity of policies meaning that it also explicitly depends
on time. Both concepts are irrelevant to find the optimal solution of an MDP but can be
of practical interest if the Markov condition is not (exactly) fulfilled.
2.2.2 Numerical Methods
Two important classes of approaches are dynamic programming (DP) and reinforcement
learning (RL) methods. While for the first class of solution methods it is assumed that the
stochastic model is known in advance, in the second class this assumption is dropped and
the consequences of the world model are directly (model-based approaches) or indirectly
(model free approaches) approximated.
Dynamic Programming Methods
2.15 Definition (Value Iteration (MDP) [168])
The following algorithm is called value iteration: select ε > 0, choose an arbitrary initial
guess V0R|S| for the (state-) value function, and determine iteratively Vk=BMDPVk1
for k= 1,2, . . . until kVk+1 Vkkε
2·1γ
γ.
The stopping criterion can also be based on a span semi-norm [168] which may improve
the contraction rate if the transition matrix is non-sparse. Value iteration converges to V,
and provides an ε
2-approximation for the value function estimate, i. e. kVNVkε
2,
and an ε-optimal stationary policy by
π
ε(s) = Π(VN) = arg max
a∈A(s)R(s, a) + γX
s0∈S
T(s, a, s0)·VN1(s0)(2.31)
where Nis the number of iterates and arg maxa∈A(s)PD(A(s)) is considered to be a
probability distribution of a pure action [168]. In the case of several equally good actions
the arg max has to return a mixed action with equal probabilities to keep uniqueness.
Note that in terms of the state-action value function Equation 2.31 can also be written
π
ε(s) = Π(QN1) = arg maxa∈A(s)QN1(s, a).(18)
Furthermore, the convergence is linear at rate γ, i. e. kVk+1 Vk≤γ· kVkVk, and
that convergence for Gauss-Seidel value iteration, which uses Vk+1(s)instead of Vk(s)as
(18)Analogue to later proofs the iterative procedure of applying the two parts of the Bellman equation
(Equations 2.29 and 2.30) is labelled in the order: V0,Q0,V1,Q1, . . .
20
soon as available, is linear at least with rate γ[168]. The analysis of the algorithms is
based on the fact that for each iteration kthere exists a policy πkwhich corresponds to
the maximum selection rule. With respect to these policies πkthe analysis is done in a
way similar to iterative solvers for linear equations because the evaluation of a fixed policy
is solving a linear equation with the matrix depending on the policy (Appendix B).
For the sake of completeness and because of its superlinear, sometimes quadratic conver-
gence the policy iteration algorithm is presented. For MDPs it is further guaranteed to find
the optimal policy in finitely many steps but this relies on the existence of a deterministic
optimal policy and is not true for 2P-ZS-MGs.
2.16 Definition (Policy Iteration [168])
Select an arbitrary policy π0, and repeat the following until πk+1 =πk: compute the value
function Vk=Vπkof policy πk(policy evaluation) and choose the policy πk+1 = Π(Vk)as
in Equation 2.31 (policy improvement).
Reinforcement Learning Methods
Reinforcement learning methods [88] belong to the class of stochastic approximation algo-
rithms and are employed for numerical comparisons. From a practical point of view they
may help to adopt optimal policies of a DP model to a real world problem. A key difference
of RL is that the agent directly executes actions according to a policy and, based on the
outcomes, estimates the value function and in some cases a model of the underlying MDP.
One of the most common algorithms is Q-learning first proposed by [212]:
2.17 Definition (Q-Learning [202])
Start with state-action value function Q0, and let the agent be in an initial state s0. Repeat
iteratively: Being at step kin state sk, choose an action akaccording to an arbitrary policy
π, observe the next state sk+1 and set
Qk+1(s, a) = (s,a),(sk,ak)·
h(1 αk)Qk(sk, ak) + αk(R(sk, ak) + γ·max
ak+1∈A(sk+1)Qk(sk+1, ak+1))i.(2.32)
where i,j is the Kronecker symbol being 1if i=jand 0else, and αk=αk(s, a).
The Q-learning rule is guaranteed to converge with probability 1 (for any initial state s0
and for any Q0) if for the sequence of learning rates holds for every (s, a) SA [19]:
X
k=0
αk(s, a) = ,
X
k=0
α2
k(s, a)<.(2.33)
The first condition includes the necessity for infinitely updating every state-action pair
while the second one establishes convergence. Because the convergence to the optimal Q-
value function occurs for every policy πof the agent as long as Equation 2.33 is fulfilled, the
method is called an off-policy method. In practice, exploration strategies are included to
guarantee Equation 2.33 while exploitation, i. e. performing an optimal action, ensure that
the most promising parts of the state (-action) space are updated faster and more often. A
very common policy is the ε-greedy which chooses with probability ε > 0a random action
while greediness means to choose an optimal action ak= Π(Qk)with probability 1ε.
21
Other possibilities are to introduce an exploration bonus, curiosity driven exploration [183],
Boltzman exploration or interval based techniques [38].
For the sake of completeness a common on-policy method is to be introduced. It is called
SARSA, which stems from the update formula (state, action, reward, state, action), and
is similar to Q-learning but with the slightly modified update rule:
Qk+1(s, a) = (s,a),(sk,ak)·
h(1 αk)Qk(sk, ak) + αk(R(sk, ak) + γ·Qk(sk+1, ak+1))i.(2.34)
The difference is now that if the agent would execute policy π, the SARSA algorithm would
converge to Qπinstead of Q. However, if the policy is not kept constant but instead the
agent follows an ε-greedy policy with respect to the Q-values and εis decayed over time
then it can be hoped [202] that SARSA converges to Q. Compared to DP methods SARSA
perhaps can be interpreted as an asynchronous version of a policy iteration algorithm.
[116, 203] give general conditions under which asynchronous RL algorithms converge. They
reduce the effort to prove the convergence of synchronous update rules. One special case
is Q-learning for MDPs.
2.2.3 Complexity, Algorithmic Issues and Software
The numerical effort of solving MDPs is different for different algorithms [88]: Value it-
eration needs in the worst-case per iteration O(|SA| · |S|)multiplications and evaluations
of Tor, if the transition function is sparse (T(s, a, s0)6= 0 only for constantly many s0),
then value iteration only needs O(|SA|)multiplications and evaluations of T. However, the
number of iterations to achieve some prescribed ε-optimality can grow dramatically if the
discount factor γapproaches 1 (see stopping criterion of Definition 2.15). In practice, pol-
icy iteration can converge faster although the per iteration complexity is O(|S|3+|SA|·|S|)
(policy evaluation + improvement) and there are no general theoretical worst-case results
[115]. For special cases like deterministic MDPs Madani gives proofs that policy iteration
is in P (polynomial) [123] and so is value iteration.
Linear Programming [184] is also a method to solve MDPs with the advantage that very
efficient commercial packages can be used. Theoretically, this is the only known method
with polynomial time although this need not indicate that it is the most efficient in practice.
Other standard ideas to speed up numerical techniques can be used e. g. multi-grid methods
as in [175] or state aggregation which conglomerates several states to a single meta-state
[18].
The Q-learning method presented above belongs to the model free approaches as well as
adaptive heuristic critic [12]. Under some conditions model free methods are applicable
in the average return case [185]. Model-based approaches which were not introduced here
include certainty equivalence [95], Dyna-Q [199, 200, 213], Queue Dyna [163], and Priori-
tised Sweeping [144, 44]. Special model-based methods for finding short paths to a goal
state are real-time dynamic programming (RTDP) [10] and plexus planning [43].
22
2.3 Matrix Games
In general, game theory is a scientific field which deals with the question: How should two
or more agents, being in an intertwined situation, decide to their own best?(19) In a less
selfish formulation the question may also be stated as “how to model and solve conflicts”.
Similar to MDPs, there are different possibilities concerning the rules and settings of the
game situations. Again, one of the most distinguishing criteria is the question of whether
the game is a differential game, which is time-continuous, or a (time-) discrete game. A
second criterion is whether the game is deterministic or stochastic. For a more detailed
overview one may consult the textbooks [13, 66, 86, 107, 158, 159] and the article [135].
Matrix games are a very special and restricted class of games but nevertheless helpful to
begin with. They belong to the class of two-player zero-sum games whereas only one time
step is under consideration (single-stage game), or whereas the situation does not change
with time (repeated matrix games). From a theoretical point of view repeated matrix
games are different from normal ones if the allowed strategies are deterministic but may be
changed over time. In such a case non-Markovian strategies like “tit-for-tat”, which depend
on the action history, are sometimes reasonable and offer different options than Markovian
stochastic strategies (mixed actions) that are independent from time. In the sequel, matrix
games are a basic element for the solution of the more general class of Markov games. This
is the reason why they are introduced and why a numerical method for solving this class
of games is presented.
Generally, two-player zero-sum games can be interpreted as games between two rivals: one
agent wants to avoid what the other tries to achieve and vice versa. Sometimes, such a
situation is called (completely) competitive because there is no potential for a compromise.
Typically, board games (backgammon, chess) or sport games (soccer, tennis) are of this
type because one agent or team of agents wins if and only if the other one loses.
Historical Remarks. [13] dates the theory of zero-sum games back to the 1920s when
Borel did some work which was translated into English much later [20]. Borel developed
some concepts of strategies but conjectured that the minimax theorem (Theorem 2.23) was
false. Von Neumann proved the opposite [154] the first proof of the minimax theorem
was not as elementary as e. g. in [13] and built the foundations of game theory which
are summarised in his famous book with Morgenstern [155]. Other early references include
[120, 136] and the work of Nash.
2.3.1 Basics and Problem Definitions
Matrix games are finite two-player zero-sum games in normal form whereas the extensive
form is represented by a tree structure [13]. This tree structure allows a more detailed
picture e. g. about the order of play and the information of the agents at each decision
epoch. However, for the Markov games considered later the standard assumption will be
that at every time step the actions of all agents take place simultaneously or, equivalently,
that one agent does not know the action of the other agent before the end of a decision
epoch. Thus, the representation of each time step by a matrix game is adequate. To
distinguish between zero-sum and general-sum games with two agents the latter ones are
called bimatrix games [66] to indicate that for each agent Pi, i = 1,2areward matrix or
payoff matrix Rihas to be specified. Below, the definitions of a bimatrix game and a (zero-
(19)In game theory, agents are typically called players.
23
sum) matrix game are given and the interchangeability theorem is stated. Basic concepts
like the value of a game, the minimax theorem (Theorem 2.23), and the dominance of
actions are introduced.
Bimatrix and Matrix Games
The definitions start with bimatrix games which are two-player general-sum games in nor-
mal form and their zero-sum variant called matrix games. It is intended to keep analogies
to MDPs (Definition 2.6) and, hence, slightly deviate from standard notation. Many of
the following statements and definitions stem from or are inspired by [66].
2.18 Definition (Bimatrix Game)
Abimatrix game Γis defined by
1.) a trivial decision epoch D={t0},
2.) a trivial state space S={s0},
3.) a finite state action space SAO ={(s, a, o) : s S, a A(s), o O(s)}where
A(s),O(s)are the (finite) sets of available actions of the first and the second agent,
P1and P2, respectively,
4.) a trivial transition function T:SAO × S [0,1] with T(s0, a, o, s0) = 1,
5.) a (deterministic) reward function for each agent Pidenoted by Ri:SAO R.
Arepeated bimatrix game only differs in the decision epoch D=N0from a bimatrix
game. The name bimatrix game stems from the fact that the only non-trivial elements
are the finitely many actions of both agents and the two reward functions Rithat can be
represented by two matrices. The convention will be to write the matrices in such a way
that actions A={a1, . . . , am}of agent P1correspond to rows and actions O={o1, . . . , on}
of agent P2correspond to columns of the matrices Rias the following scheme indicates:
o1o2. . . on
a1
r11 r12 . . . r1n
a2r21 r22 . . . r2n
.
.
..
.
..
.
.....
.
.
amrm1rm2. . . rmn
.
2.19 Definition (Matrix Game)
Amatrix game is defined to be a bimatrix game Γwith the additional condition that the
sum of the reward matrices of the two agents is zero:
R1+R2= 0 .(2.35)
A matrix game belongs to the class of two-player zero-sum games in normal form. Analo-
gously to bimatrix games, a repeated matrix game only differs in the decision epoch D=N0.
As for MDPs, each agent P1, P2has to find an optimal policy π
1, π
2which maximises its
return R1,R2,respectively. In the case of non-repeated (bi)matrix games this is simply
the selection of a mixed action(20), whereas the return Riof agent Pireduces to the expec-
tation over the single-stage return (only first summand in the sum of e. g. Equation 2.24).
(20)The set of mixed actions for agent P1is the (m1)-dimensional simplex of probability distributions
PD(A(s0)) = {xRm:x= (xj)jwith x0,Pm
j=1 xj= 1},and for agent P2it is the analogue
(n1)-dimensional simplex.
24
If the policies of the two agents are represented by column vectors (π1Rm,1, π2Rn,1)
and the reward functions Riby matrices as depicted above, then the expectation of the
single-stage return Rifor agent Pican be expressed simply by matrix vector multiplica-
tions:
Eπ12{Ri}=πT
1Riπ2=X
k,l
(π1)k(Ri)kl (π2)l(2.36)
with ATdenoting the transpose of a matrix A.
Before the interchangeability theorem is stated the general definition of a Nash equilibrium
has to be introduced. More details under which conditions for the policy spaces Nash
equilibria exist can be found in [22]. For defining a Nash equilibrium very briefly, the
concept of best response is employed which is also applicable to the general case of k
agents. A policy π
iof agent Piis an element of the set of best response or best reply
BRi(πi)to a joint policy πiof all other agents if for all policies πiholds that
E(π
ii){Ri} E(πii){Ri}(2.37)
with (π
i, πi)being appropriately ordered. Then, a Nash equilibrium is a tuple of policies
π= (π
1, . . . , π
k)of the kagents with
i:π
iBRi(π
i)(2.38)
which means that no agent has an incentive to deviate unilaterally from its policy. A
difficulty in determining Nash equilibria, which is in some cases NP-hard [34, 32], arises
because the set of best response depends on the joint policy πiof all other agents. An
exception is the computation of Nash equilibria in (two-player zero-sum) matrix games
which is polynomial due to the solvability by linear programming [34]. A more precise
statement of the complexity of finding Nash equilibria in n-player games with n4gives
[40]. More details on the complexity of linear programming can be found below in the
subsection concerning numerical methods.
An important fact of arbitrary two-person zero-sum games which is not restricted to matrix
games but is not true for general-sum games(21) is about the interchangeability of equi-
librium pairs of strategies where equilibrium pair means Nash equilibrium. The theorem
below shows that all Nash equilibria are equally good and, hence, that a Nash equilibrium
is already a pair of globally optimal policies for both players. A pair of policies may also
be called a total policy π= (π1, π2).
2.20 Theorem (Interchangeability and Equal Payoff of Equilibrium Strategies [159])
In a two-person zero-sum game Γlet (π1, π2)and (eπ1,eπ2)be two pairs of equilibrium
strategies. Then (π1,eπ2)and (eπ1, π2)are also equilibrium pairs, and for i= 1,2holds:
Eπ12{Ri}= Eeπ1,eπ2
{Ri}= Eπ1,eπ2
{Ri}= Eeπ12
{Ri}.(2.39)
The proof only uses the zero-sum property and the general equilibrium inequalities which
are valid for any strategy bπ1,bπ2, respectively: Eπ12{R1} Ebπ12
{R1}and Eπ12{R1}
Eπ1,bπ2
{R1}to show the equality and that the new strategies are equilibria.
(21)Consider e. g. the bimatrix game with R1=R2=a0
0b«and a > b > 0which is sometimes called
coordination game and has two Nash equilibria with expected return of aand bbeing equal for both agents.
25
2.21 Definition (Value of a Matrix Game [66])
A two-person zero-sum game Γis said to have a value Vif and only if
sup
π1
inf
π2
Eπ12{R1}= inf
π2
sup
π1
Eπ12{R1}(:= V).(2.40)
This means, agent P1can assure itself a return R1=Vwhen acting optimally independent
of what the other agent does. Vice versa, agent P2can assure itself a return R2=−R1=
V.(22) The left-hand side of Equation 2.40 is called the lower value of the game and
the right-hand side the upper value. The upper and lower value represent security levels
of the two agents.
For a matrix game Γ = [M]with value V(M), the policies π1, π2 are called to be
ε-optimal (ε0) for agents P1and P2if
inf
π2
Eπ12{R1} Vεand sup
π1
Eπ12 {R1} V+ε , (2.41)
respectively. 0-optimal strategies are called optimal. The set of optimal strategies for player
Pkof a matrix game Γ = [M]with matrix Mis denoted by Ok(M)and the ε-optimal set
by Ok
ε(M).
The following proposition can be obtained elementarily:
2.22 Proposition (Optimality of Saddle Points [66])
Be Γ = [M]a matrix game. The following holds:
1.) sup
π1
inf
π2
Eπ12{R1} inf
π2
sup
π1
Eπ12{R1}.
2.) (Optimality of Saddle Points) When there exist policies π
1, π
2such that for all π1, π2:
Eπ1
2{R1} Eπ
1
2{R1} Eπ
12{R1},then the value of the game exists and π
1, π
2
are optimal.
An essential theorem for matrix games is the following one which was beneath others shown
by J. von Neumann ([154], comment in [209]):
2.23 Theorem (Minimax Theorem for Matrix Games [66])
For every matrix MRm,n the corresponding matrix game Γ = [M]has a value V=
V(M)and both agents P1, P2possess optimal strategies π
1Rm, π
2Rn.
A relatively simple direct proof of the minimax theorem can be found in [13]. More
elegantly, the theorem is obtained by duality in linear programming [194].(23)
A generalisation of convergence properties from MDPs to 2P-ZS-MGs [203] needs a non-
expansion property of determining the value of a game in terms of the matrix entries. In
view of this, the following matrix distance, which does not coincide with the k k-norm
for matrices, is introduced.
2.24 Definition (Distance of Game Matrices)
For two matrices M1, M2Rm,n the game matrix distance dΓis defined by
dΓ(M1, M2) = max
i, j |(M1M2)ij|.(2.42)
(22)The notation deviates from standard as far as throughout this thesis every agent maximises its own
return, which is in the two-player zero-sum case equivalent to minimise the return of the other agent.
(23)The duality theorem is stated as Theorem 2.30.
26
2.25 Proposition (Properties of Matrix Games [66])
Let M, M1, M2Rm,n be matrices and JRm,n denote the matrix with Jij = 1 for all
i, j.
1.) (Addition of Constants) For any cR:V(M+cJ) = V(M) + c, and the opti-
mal strategy sets O1,O2for both players are the same for the matrix games [M]and
[M+cJ].Thus, the assumptions Mij >0and V(M)>0are not restrictions.
2.) (Monotonicity) If (M1)ij (M2)ij for all i, j then V(M1)V(M2).
3.) (Non-Expansion, Continuity of Value) It holds |V(M1)V(M2)| dΓ(M1, M2).
A reduction of matrix games can sometimes be achieved by eliminating dominated actions.
While this is theoretically interesting and may lead to non-trivial reductions if applied
recursively, the numerical effort of comparing each row to every other and likewise for the
columns seems to only be appropriate if a direct numerical solution of a matrix game fails.
2.26 Definition (Dominance of Actions [13])
Let Γ = [M]be a matrix game. The i-th action of player P1(row of matrix M) is said
to dominate the j-th action (row) if eT
iMeT
jM(i. e. for all k:mik mjk) and in
one component the inequality is strict. Similarly, the i-th action of player P2(column
of matrix M) is said to dominate the j-th action (column) if MeiMejand in one
component the inequality is strict. The actions are called strictly dominating if for all
components inequality is strict.
2.27 Theorem (Elimination of Dominant Actions [159])
Let Γ = [M]be a matrix game, and assume that rows i1, . . . , ikof Mare dominated. Then
agent P1has an optimal policy π1with (π1)i1=· · · = (π1)ik= 0. Moreover, any optimal
policy for the game obtained by removing the dominated rows will also be an optimal policy
for the original game.
The analogous statement for agent P2results from applying the theorem to MT. [13]
gives an example that deleting non-strictly dominating actions can reduce the number of
optimal strategies but as long as the aim is to compute the value of a game and only one
optimal strategy this raises no problem.
2.28 Example (Robot Soccer, 3)
A well-known matrix game is matching pennies which is given by the matrix
M=1 1
11.(2.43)
In a robot soccer environment the same type of matrix game may arise if one simplifies a
situation where a robot tries to dribble the ball around one opponent robot. The decision
of the dribbling agent is to go left (a1) or right (a2), while the opponent agent has to decide
to block left (o1) or right (o2). If both choose the same direction then the dribbler loses
the ball, otherwise the move is successful.
It is also possible to consider the complete tactics of a soccer game as an action of a matrix
game. More precisely, the actions are to choose the tactics before the game starts and the
reward matrix is related to the expected scoring at the end of the game. Obviously, this
concept is not suitable to initially construct different tactics and the tactics of particular
game situations are not alterable. Furthermore, obtaining the expectations of all strategy
combinations requires a lot of games to be played.
27
2.3.2 Numerical Methods
Bimatrix games can be solved by the Lemke-Howson algorithm, see e. g. the excellent
survey of (bi)linear methods for solving bimatrix games in [194] or the original work [106],
or by the Mangasarian-Stone algorithm (Equation 2.46).(24) Accordingly, it is possible to
numerically solve matrix games by linear programming. An overview of linear programming
gives [208] as well as the textbook [39] by Dantzig who was the pioneer of the simplex
method.
Linear Programming (LP) and Matrix Games
For determining the optimal value function Vof a Markov game it is necessary to calculate
the value of matrix games. Numerically, the solution of matrix games is addressed by LP,
which motivates the following description. It includes a definition of linear programs, the
duality theorem, the solution of matrix games(25), and concludes with the Mangasarian-
Stone algorithm for the solution of bimatrix games.
2.29 Definition (Primal and Dual Linear Program)
A linear program is determined by a triplet (M, b, c)where MRm,n, b Rm, c Rn.
This triplet defines two related optimisation problems:
Primal
maximise cTx
subject to Mx b
x0
Dual
minimise bTy
subject to MTyc
y0
(2.44)
The sets Sp={xRn|Mx b, x 0}and Sd={yRm|MTyc, y 0}are
called feasible sets for the primal and dual program, respectively. Elements of these sets
are feasible solutions, and the primal and dual linear program is called feasible if the
corresponding feasible set is non-empty.
2.30 Proposition (LP Duality Theorem)
1.) If either of the primal or dual linear programs has a finite optimal solution, so does
the other, and the corresponding values of the objective functions are equal.
2.) If either of the primal or dual linear programs has an unbounded objective, the other
problem has no feasible solution.
For a game matrix MRm,n define the matrix f
MRm+1,n+1 and the vectors ebRm+1,
ecRn+1 by
f
M=
1
M.
.
.
1
1· · · 1 0
,eb=
0
.
.
.
0
1
ec=
0
.
.
.
0
1
.(2.45)
The solution of a matrix game Γ = [M]by LP then is treated by the following proposition:
2.31 Proposition (LP for Solving Matrix Games [66])
Let Γ = [M]be a matrix game with V(M)>0and let f
M,eb, ecas in Equation 2.45, then
(24)Other options also exist such as fictitious play.
(25)Reversely, it is also possible to solve linear programs by special matrix games [208].
28
1.) the primal and dual LP (f
M,eb, ec)are feasible and thus have bounded solutions, and
2.) the optimal values of both programs equal V(M)and
π2 O2(M) (π2, V (M)) Rn+1 is optimal in the primal LP.
π1 O1(M) (π1, V (M)) Rm+1 is optimal in the dual LP.
The last row of [M]ensures in the primal LP that Pi(π2)i1while the last column ensures
in the dual LP that Pi(π1)i1. The condition V(M)>0has the effect that these sums
equal 1for an optimal solution and can be replaced by π1PD(A(s0)), π2PD(O(s0))
[66]. The application of LP can also be seen as a special case (M1=M, M2=M,
separate to two independent LPs) of the Mangasarian-Stone algorithm for bimatrix games
[129] which is the following bilinear quadratic program:
maximise πT
1M1π2+πT
1M2π2(c1+c2)
subject to π1PD(A(s0)), π2PD(O(s0))
eT
aiM1π2c1(for all ai A(s0))
πT
1M2eoic2(for all oi O(s0))
c1, c2R,
(2.46)
where eaiPD(A(s0)), eoi O(s0)) are policies corresponding to a pure action, i. e.
(eai) = eibeing the i-th unit vector. The Mangasarian-Stone algorithm can be suitably
enhanced to determine all Nash equilibria in a general (nonzero-sum multi-stage nagent)
Markov game [65]. The optimisation criterion as well as the constraints look very similar
to the Bellman equation of MDPs (Equations 2.29 and 2.30) which is reasonable because
fixing all but one agent’s strategies results in an MDP and the Nash equilibrium is defined
by the non-motivation of a unilateral deviation of one agent.
2.3.3 Complexity, Algorithmic Issues and Software
The complexity of solving linear programs is in P (polynomial) for interior point methods
but for the standard simplex method exponential [208](26). Also numerical studies can be
found therein and a bound on the approximation quality of matrix game solutions which
was found by von Neumann [155]: To approximate the value V(M)of a matrix game
Γ = [M]with matrix MRm,n by a factor of εa special linear programming algorithm
different from the simplex method needs at most
(m+n)(maxij(M)ij minij(M)ij)2
ε2(2.47)
steps, each of which requires about 4mn flops.
Throughout this thesis the solution of matrix games is performed by linear programming
methods which are embedded in the MATLAB software package, namely a primal-dual
method and a simplex method.
2.4 Two Player Zero Sum Markov Games (2P-ZS-MGs)
Finally, two-player zero-sum Markov games provide a suitable concept for modelling robot
soccer. As mentioned before they are a generalisation of both MDPs and matrix games,
(26)There exist subexponential variants of the simplex method but they are nevertheless superpolynomial.
29
for they describe a situation in which two agents try to achieve opposite goals. Despite
the generalisation of MDPs, many concepts, statements, and algorithms are transferable
from MDPs to 2P-ZS-MGs. For example, the Bellman equation and, therefore, the basic
algorithms value iteration and Q-learning only have to changed by determining the solution
of a matrix game (max-min) instead of the maximisation problem (max).
One major dissimilarity is the necessity for mixed actions in 2P-ZS-MGs(27) while for
MDPs it is guaranteed that an optimal policy in pure actions exists [168]. This is induced
by the fact that the decisions of both agents at each decision epoch have to be made at
the same time without the knowledge of the other agent’s decision. The implications are
far reaching for more advanced concepts of learning such as the unapplicability of team
learning algorithms with coordination mechanisms which rely on a deterministic optimal
policy. [103, 90] give some ideas of coordination mechanisms to overcome difficulties with
previous approaches [89, 102]. However, applicability of 2P-ZS-MGs includes not only two-
person and two-team board games and sports games with competitive character but also
the modelling of a worst-case optimal policy in an MDP with uncertainties by considering
the “uncertainty generator” as a competitive player (a “game against nature” [147]).
2.4.1 Basics and Problem Definitions
2.32 Definition (Two Player Zero-Sum Markov Game)
A (discrete time, finite) two-player zero-sum Markov game M= (D,S,SAO, T, R)is given
by
1.) decision epochs D=N0,
2.) a (finite) state space S,
3.) a (finite) state action space SAO ={(s, a, o) : s S, a A(s), o O(s)}where
A(s),O(s)are the (finite) sets of available actions for the agents P1and P2, respec-
tively, in state s,
4.) a transition function T:SAO × S [0,1] with T(s, a, o, s0)being the probability of
reaching state s0if choosing the action pair (a, o)in state s,(28)
5.) a (deterministic) reward function R:SAO R.
Policies πiare defined for each player as for MDPs. The goal of determining an optimal
policy for agent P1and the definitions of value functions Vand returns Rare (nearly)
the same as for MDPs.(29) The reason why the value function and return are independent
from the policy π2of the second agent is that it always is assumed to be a worst-case
answer against the first agent (π2BR(π1)). Hence, the second player’s policy is not an
independent variable but depends on the policy of the first one: π2=π2(π1). Theorem 2.20
yields the well-definedness of the optimal value function: the zero-sum property guarantees
that for all π
2BR(π
1)the value functions and returns are equal. [159] contains a proof
of existence of a pair of (stationary) optimal policies for 2P-ZS-MGs.
As alluded to in the introduction the max-operator simply needs to be replaced by a (max-
min)-operator in the Bellman equation of an MDP (Equation 2.29) in order to obtain the
corresponding optimality principle for a 2P-ZS-MG:
(27)[112] refers to rock-paper-scissors as an example for the need of mixed actions in 2P-ZS-MGs.
(28)An alternative for 2P-ZS-MGs with deterministic state transitions is to define the modified transition
function e
T:SAO S with T(s, a, o) = s0being the next state which is reached with probability 1.
(29)Especially, the discounted return Rdisc is considered to be standard.
30
2.33 Theorem (Optimality Principle (Shapley’s Theorem [66]))
A (state-) value function Vof an 2P-ZS-MG is optimal iff it is the unique solution to the
Bellman equation:
V(s) = max
πsPD(A(s)) min
o∈O(s)X
a∈A(s)
πs(a)·Q(s, a, o)(2.48)
where
Q(s, a, o) = R(s, a, o) + γX
s0∈S
T(s, a, o, s0)·V(s0).(2.49)
As for MDPs, the result of plugging Equation 2.49 into Equation 2.48 can be abbreviated
by the Bellman operator BMG which shortens the notation of the Bellman equation to
V=BMGVand the operator is a contraction with rate γwith respect to k k[97]. Again,
the Bellman equation is a non-linear fixed point equation for the optimal value function
V=Vπ.
One additional result which is not mentioned for MDPs but is also valid because these
are special cases of 2P-ZS-MGs concerns the continuous dependency of the optimal value
function on the Markov game:
2.34 Theorem (Continuity of Optimal Value Functions [66])
Given a state space Sand a state-action space SAO, the optimal value function V=
V(R, T, γ)is continuous with respect to the metric
dR,T((R1, T1, γ1),(R2, T2, γ2)) = max{ kR1R2kR,kT1T2kT,|γ1γ2| } (2.50)
with
kR1R2kR= max
(s,a,o)|R1(s, a, o)R2(s, a, o)|(2.51)
and
kT1T2kT= max
(s,a,o)X
s0∈S
|T1(s, a, o, s0)T2(s, a, o, s0)|.(2.52)
2.4.2 Numerical Methods
As for MDPs, the two important classes of approaches are dynamic programming (DP) and
reinforcement learning (RL) methods. Most of the basic methods can also be applied to 2P-
ZS-MGs including value iteration [159], policy iteration (without finite time convergence),
and Q-learning [203]. The reason is that not only the max-operator is a non-expansion
(i. e. |maxaf(s, a)maxag(s, a)| maxa|(f(s, a)g(s, a)|for all functions f, g and all
states s, [203]) but also the max-min: |maxπaminof(s, a, o)maxπaminog(s, a, o)|
max(a,o)|f(s, a, o)g(s, a, o)|. The statement for the max-operator follows from the fact
that, if without restriction maxaf(s, a)maxag(s, a)and a= arg maxaf(s, a), then
|maxaf(s, a)maxag(s, a)|=f(s, a)maxag(s, a)f(s, a)g(s, a)maxa|f(s, a)
g(s, a)|. Regarding the max-min, the non-expansion property is stated already in Proposi-
tion 2.25, because the max-min of f(s, a, o)is the value of a matrix game with action pairs
(a, o)(for each fand each s).
The definition of value iteration is the only which is repeated in this section to stress the
similarities: the difference is in fact only to replace BMDP by BMG. According to [21] the
value iteration algorithm for 2P-ZS-MGs was already designed by Shapley [186] before the
corresponding algorithm for MDPs.
31
2.35 Definition (Value Iteration (2P-ZS-MG))
The following algorithm is called value iteration: select ε > 0, choose an arbitrary initial
guess V0R|S| for the (state-) value function, and determine iteratively Vk=BMGVk1
for k= 1,2, . . . until kVk+1 Vkkε
2·1γ
γ.
This algorithm would converge to V, and provide an ε
2-approximation for the value func-
tion estimate and an ε-optimal stationary policy (as for MDPs), if one assumes that BMG
can be calculated exactly. However, as proposed in Section 2.3 the solution of the ma-
trix game by LP methods can only be numerically determined. Although for numerically
applying BMDP the maximum of finitely many values can exactly be determined, the rep-
resentation of real numbers by machine numbers may lead to small discrepancies.
In the following, the result of Lemma 4.2 is anticipated since it fits well into this con-
text and its interpretation is new in the sense that the error of solving a matrix game is
mathematically equivalent to the error of using supervised learning techniques. Also, the
comments and discussion in Section 4.2 are correspondingly valid.
2.36 Lemma (Error of Numerical Value Iteration (2P-ZS-MG))
Let BMG be the Bellman operator and e
BMG a numerical realisation with ke
BMGVBMGVk
ε1kVk.(30) Let further be e
V0=V0,Vk= (BMG)kV0, and e
Vk= ( e
BMG)ke
V0the corre-
sponding k-th value iterates. Then
ke
VkVkk ε1·
k1
X
i=0
γki1ke
Vik
| {z }
=EV(k)
ε1
1γmax
i=0,...,k1ke
Vik.(2.53)
2.37 Corollary (New Stopping Criterion for Numerical Value Iteration)
If the stopping criterion is changed to ke
Vk+1 e
Vkk c(e
V0, . . . , e
Vk)with
c(e
V0, . . . , e
Vk) = ε
2 EV(k+ 1)·1γ
γ(EV(k+ 1) + EV(k)) (2.54)
where the errors EV(k)depend on the first k1numerical value iterates (Equation 4.3),
and secondly, if the numerical approximation e
BMG is a contraction of rate γ(like BMG)
then also the numerical approximation of value iteration yields results comparable to the
original value iteration.
2.4.3 Complexity, Algorithmic Issues and Software
In principle, the statements for MDPs of Section 2.2.3 are valid e. g. for the effort per
iteration with the difference that the costs of solving |S| matrix games of possibly diffe-
rent sizes |A(s)| · |O(s)|have to be added. The complexity of solving matrix games e. g.
by LP is discussed in Section 2.3.3. Furthermore, Condon shows for the case of simple
stochastic games, which are a restricted class of two-player games, that to decide whether
the probability of winning for one player is >0.5is in NP co-NP [33].
(30) ke
BMGV BMGVkε1kVkfor all Vmeans that the operator B1:= 1
ε1(e
BMG BMG)has operator
norm less than 1: kB1k= supV6=0
kB1Vk
kVk1. Furthermore, the above definition implies e
BMG =BMG +
ε1B1.
32
There are free available software frameworks for reinforcement learning but typically they
are only designed to solve MDPs. Often, it is time consuming to provide a transition
model and a suitable representation of the problem. Thus, the Matlab based software
package DRPOST is implemented by the author for the special use of determining optimal
strategies in multi-player grid soccer.
2.5 General Markov Games, Differential Games, and Ad-
vanced Concepts of RL
In this section pointers to interesting concepts that are related to the previous part of the
section but are not a main focus of this thesis shall be given. Some of the concepts show
alternatives to those in Section 5.1 of modeling a robot soccer game. The range of concepts
reaches from general n-player games via differential games to more advanced concepts of
learning.
Discrete Games with nAgents
It can not be expected that the results of this thesis are extendable to general-sum multi-
player Markov games as [218] shows that suitably adapted value iteration methods (best
response instead of max-min strategies) do not even need to converge to a stationary
policy. This is even true in games with two players, alternating turns, and deterministic
transitions.
Differential Games
In contrast to other games discussed before differential games model time as being contin-
uous (RL with continuous time is treated by [56]). Two standard textbooks on differential
games are [86] and, especially on pursuit-evasion games, [107] where some of the notation
is adapted from.
The evolution of a differential game is modeled by a differential equation
˙x=f(x, u, v),(2.55)
where ˙x=dx
dtand x(t), u(t), v(t)Rn. This can also be seen as a dynamical system
with two controls u, v. Other aspects which are necessary to describe a game is a terminal
function which determines whether a terminal state of the game is reached (often one
considers merely a fixed terminal set), and an outcome functional of the game for every
player Pithat can consist of an outcome dependent on a trajectory and a final outcome
dependent on the final state. The aim of each player is as for MDPs to maximise its own
outcome. Some statements about uniqueness and continuity of the solution to differential
game trajectories including different information patterns can be found in [13].
In a newer context, viscosity solutions to the so-called Hamilton-Jacobi-Bellman-Isaacs
equation (non-linear PDE) are introduced, see [9, 36] and references therein. This concept
may be considered a weak solution concept which overcomes the difficulty that because of
the non-differentiability of the value functions, this function is no solution to the Hamilton-
Jacobi-Bellman-Isaacs equation and other generalisations may yield non-unique solutions
(even for MDPs [152]). A good overview of theoretical results as well as numerical approx-
imation schemes and examples is contained in [63].
33
Advanced Concepts of RL
The concept of generalisation is discussed at more length in Chapter 4; model reduction
is outlined in Chapter 3. At this place, semi-Markov decision processes (S-MDPs), hierar-
chical Markov decision processes (H-MDPs), partial observability (PO-MDPs), and some
trade-offs and special artifices to speed up convergence are to be addressed.
Hierarchical learning and S-MDPs. S-MDPs are a generalisation of MDPs with
respect to the duration of performing actions. The assumption of MDPs that every time
step is equidistant (every action takes the same amount of time) is dropped for S-MDPs.
Continuous-time discrete event versions exist as well as the easier discrete time version
which allows only durations being integer multiples of a unit duration. S-MDPs are the
natural framework for H-MDPs [11] which typically consist of a hierarchy of learning
problems with at least two different levels. Dietterich [53] provides with MAX-Q a multi-
level approach in which the hierarchy structure must be given in advance. [11] gives a
broad overview. Earlier examples often follow a two-level approach and include [42, 55,
109, 124, 127, 190] while newer examples include [81, 161, 197, 215]. [191] gives additional
useful references.
Partially observable MDPs (PO-MDPs). One problem of real world tasks is often
that the observed information is noisy or incomplete. Unfortunately, complete information
is necessary to solve MDPs because the agent needs to know the state for which the
updates of the value function have to be done. A formal model which incorporates the
effect of a lack of information is called partially observable Markov decision process. There
are several strategies to deal with PO-MDPs [88]: State-free deterministic or stochastic
policies i. e. ignoring the partial observability can lead to non-Markovity. Determining a
deterministic optimal policy (mapping from observation to actions) is NP-hard [113]. By
stochastic policies locally optimal results can be obtained [87]. True improvements can only
be observed in most environments by memorising previous actions and observations. Some
approaches are recurrent Q-learning (using a recurrent neural network to learn features of
history [110, 137, 182]), classifier systems [71, 54, 27, 100] which were originally similar to
Q-learning, interval-based estimation of the transition probabilities [38], and finite-history
window approaches with fixed [110] or variable length (utile suffix memory) [134], possibly
in combination with neural network approaches [172]. Finally, PO-MDP approaches use a
hidden Markov model (HMM) to learn a model of the environment and construct a perfect
memory controller [29, 119, 140, 173, 174, 189]. The discrete PO-MDP can be transformed
into an equivalent continuous space belief MDP, with the probabilities distributions over
the discrete states being the new states. Thus, roughly speaking, an PO-MDP with nstates
can be converted to an MDP with state space Rnwhich clearly shows the limits of this
approach. Principal component analysis seems to be an appropriate method to reduce the
huge belief space of PO-MDPs [174]. A good overview of PO-MDPs and a comparison of
methods for very small problems (less than 20 states, 5 actions, and 10 observation states)
can be found in [114].
Special Artifices to Speed Up Convergence of DP and RL methods. Convergence
properties are in general asymptotic i. e. to some degree useless for practical needs where
the convergence rate at the beginning of learning is more interesting. Therefore, speed
of convergence is an ill-defined measure while speed of convergence to near-optimality is
more fruitful [88]. One performance measure in this sense is the regret [17] which describes
the loss of return while learning a policy in comparison to executing the (a priori known)
optimal policy. Of course, it is very hard to estimate the regret. Other artifices are updating
34
states with higher Bellman error (kVk+1 Vkk) more often, using combinations of value
iteration and policy iteration like methods (general policy iteration [202]), modifying the
model (mostly the reward function (shaping) or the start position (Q-learning)), or using
heuristic knowledge to avoid starting from scratch (modify Q0, V0such that it corresponds
to a heuristic (human) policy).
Learning versus Dynamic Programming
The advantage of RL methods is that they do not need a model while the disadvantage is
that they need a lot of training data or episodes to create reliable estimates. For the latter
example of soccer a middle course shall be followed: an offline computation with a coarse
model yields a good estimate of the value function. In the real game, this estimate will be
the initial value function from which the learning process starts instead of starting from
scratch. One goal of this thesis is to construct such an initial guess from a coarse model
for a competitive multi-agent soccer game.
Two further issues, which are especially under consideration in RL but also are related to
DP, are the temporal credit assignment problem which concerns the question how much
a single action contributes to the complete sequence of actions, and the structural credit
assignment problem of cooperative multiple agents, i. e. how much a single agent contributes
to solving the overall task [1]. A survey over cooperative agents with exhaustively many
references is [160].
35
Chapter 3
Model Reduction and Symmetry
Contents
3.1 Homomorphisms and Symmetry in MDPs . . . . . . . . . . . . 37
3.1.1 Equivalence of MDP Homomorphisms and MDP Symmetries . . 38
3.1.2 Symmetries by Group Actions on MDPs . . . . . . . . . . . . . . 42
3.2 Homomorphisms and Symmetry in 2P-ZS-MGs . . . . . . . . . 43
3.2.1 2P-ZS-MG Homomorphisms and Symmetry . . . . . . . . . . . . 44
3.2.2 Automorphisms for the Exchange of Agents . . . . . . . . . . . . 49
Order and simplification are the first steps
towards the mastery of a subject.
P. Thomas Mann (1875–1955)
(http://www.nonstopenglish.com)
Symmetries which are an essential source for model reduction address two important issues:
first, it can generally be considered sensible that any two models describing the same
situation, e. g. the discretisation of a continuous model, should have the same symmetries,
and second, the abstraction over equivalent state(-action)s implies faster algorithms e. g.
faster learning comparable to the spirit of function approximation.(1)
Temporal and structural abstraction are main challenges in solving real world MDPs and
2P-ZS-MGs [67]. The latter concerns the question of how to handle the state space of the
underlying model for very large problems. In this context, model reduction, i. e. finding an
equivalent smaller or the smallest model, can be seen to be an important part of the task
to make computations more effective or simply feasible. While some reductions, especially
symmetries, are often easy to detect for a human observer, others are harder to observe.
The design of a software to accomplish this task is a challenging area of research for any
kind of model reduction.(2)
In this chapter proofs are given that some kinds of reduction yield models equivalent to
the original one and, hence, the effort to determine a reduced model can be valuable. A
(1)Even if a symmetry is not noted in advance, approximating functions which respect this symmetry
should be better suited than others.
(2)A complexity result of [70] is that it is NP-hard to determine whether a given finite model is already
minimal under reduction by symmetry groups (see Section 3.1.2).
36
second important point, namely, how to compute possible reductions will not be addressed
in detail. One possible method is adaptive state aggregation (clustering of states) by
locality or by the similarity of the value function [216](3); a different one is to detect steep
gradients of the value function as a natural barrier between state clusters [58](4). Optimal
value functions for multiple goals can also be used to uncover structure in the state space
[67].(5) However, all these methods typically yield only approximate reductions whereas
this chapter concerns exact reductions.
The chapter is structured as follows: after some remarks about existing work for MDPs on
MDP homomorphisms and MDP symmetries the equivalence of these concepts is proven by
the author in Section 3.1. Furthermore, the previous concepts are related to group actions
in Section 3.1.2. In Section 3.2, the focus will be on extensions to 2P-ZS-MGs which is also
one of the main theoretical contributions of this thesis. A central aspect in 2P-ZS-MGs,
which is not an issue in MDPs, is the matrix game reduction property (Definition 3.12)
which is proven to be valid if the equivalence classes on the state-action space fulfill some
natural projection properties. The most general framework of model reduction leads to 2P-
ZS-MG µ-homomorphisms (Theorem 3.20). The fact that the composition of two 2P-ZS-
MG µ-homomorphisms stays a 2P-ZS-MG µ-homomorphism enables the stepwise reduction
of models. Additionally, the combination of agent exchanging symmetries with agent
preserving symmetries is part of the concept. An application to automorphisms of finite
2P-ZS-MGs (Lemma 3.25) gives some further insight.
3.1 Homomorphisms and Symmetry in MDPs
The concept of symmetries is strongly related to questions of model reduction and model
minimisation. Historically, the question of model minimisation emerged first in finite state
automata and was extended to Markov chains and MDPs [170, 171]. The motivation of
this work is the insight that a smaller model results in a smaller amount of computation
time since the computational complexity typically depends on the model size (compare
Section 2.2.3 and Remark 3.7). The fundamental algebraic elements are homomorphisms
between two different models which represent the idea of equivalence of these two models.
For an equivalence of two MDPs not only does the structure of the state(-action) space
have to be preserved by a homomorphism but also the structure of the transition and
reward functions, implying the same structure for the optimal value functions and optimal
policies.
This section provides background material for Section 3.2 but also contains theoretical
contributions of the author. A main contribution is to show the equivalence of MDP
homomorphisms [171] to MDP symmetries [217] in Lemma 3.3.(6) This equivalence reveals
that the model minimisation framework of [171] with MDP options and the symmetry
context of [217] with its generalisation to multi-agent MDPs do not exclude each other but
(3)The presented approach has the burden that the value function of non-aggregated states also needs to
be stored and that there are an exponential number of cluster intern (deterministic) policies over which a
maximum is searched for.
(4)This approach seems to be most appropriate for detection of walls and implicit barriers.
(5)The problem with this approach is that one needs to calculate some value functions for different goals
before one can benefit from the structural abstraction.
(6)In principle, many results of this section can be found in the technical report [170] but there are e. g. no
pointers to the relation to group actions nor any note if noticed that the concept of reward respecting
SSP partitions in fact is the MDP symmetry of Zinkevich. The main reason for this may be that some
conditions or implications are left implicit in the technical report which are made explicit by the author.
37
instead are based on the same foundations. Furthermore, because of the equivalence it is
sufficient to prove only once e. g. that the optimal value function is constant on equivalence
classes induced by MDP homomorphisms which also implies that this is true for equivalence
classes induced by MDP symmetries.
3.1.1 Equivalence of MDP Homomorphisms and MDP Symmetries
MDP homomorphisms have the concept of preserving some structure in common with
group homomorphism. For MDPs this includes maps for states, state-actions, transition
functions, and reward functions. Because MDP homomorphisms shall be utilised to model
reduction, the original MDP M1= (D1,S1,SA1, T1, R1)has a larger number of states or
state-actions than the reduced MDP M2= (D2,S2,SA2, T2, R2). In this context, an MDP
homomorphisms aggregates some states and state-actions, i. e. merges them into non-trivial
equivalence classes.
Before MDP homomorphisms can be defined some aspects of projecting partitions (or
equivalence classes [60]) of the state-action space onto partitions of the state space have
to be noted. A first assumption in [171] is that a partition of the (finite) state-action
space P(SA1) = {[(s, a)] : (s, a) SA1}into equivalence classes is given. It is further
assumed that this partition induces a partition of equivalence classes on the state space
P(S1) = {[s] : s S1}by the direct projection Πs:SA1 S1with Πs(s1, a1) = s1by
means of [s] = Πs([(s, a)]). The well-definedness of this projection onto state equivalence
classes is equivalent to the condition
(s, a00) SA1(s, a) SA1s0Πs([(s, a00)]) a0 A1(s0) : (s0, a0)[(s, a)] .(3.1)
If the above equation is fulfilled (which is assumed by [171]) then it can be rewritten in
terms of the induced equivalence classes because [s] = Πs([(s, a)]) = Πs([(s, a00)]):
(s, a) SA1s0[s]a0 A1(s0) : (s0, a0)[(s, a)] .(3.2)
The well-definedness of the projection of equivalence classes by Πsfurther implies (not
especially noted in [171]) the following condition on the equivalence classes:
(s, a) SA1(s0, a0) SA1:(s0, a0)[(s, a)] s0[s].(3.3)
These two conditions (Equations 3.2 and 3.3) are also needed in the symmetry formalism of
Zinkevich and Balch: they simply are Definition 7 of [217] with [(s, a)] being the equivalence
classes of the relation ESA and [s]being the equivalence classes of ESin their notation.
Considering the reverse direction, i. e. if equivalence relations are given on S1and SA1which
satisfy Equations 3.2 and 3.3, then the projection Πsexactly maps equivalence classes from
SA1to the ones of S1. The reason is that Equation 3.2 guarantees that the equivalence
classes on S1are small enough to be induced by the projection Πs, while Equation 3.3
assures that the equivalence classes on S1are also large enough.
Next, MDP homomorphisms can be defined:
3.1 Definition (MDP Homomorphism [171])
Let M1= (D1,S1,SA1, T1, R1)and M2= (D2,S2,SA2, T2, R2)be two MDPs as defined
in Section 2.2 with D1=D2=N0. A map h:SA1 SA2is called an MDP ho-
momorphism if his a surjection, defined by a tuple of surjective maps (f, (gs)s∈S)with
h(s, a) = (f(s), gs(a)),f:S1 S2, and gs:A1(s) A2(f(s)) such that
(s, a) SA1s0 S1:e
T1(s, a, [s0]) = T2(f(s), gs(a), f(s0)) (3.4)
38
and
(s, a) SA1:R1(s, a) = R2(f(s), gs(a)) (3.5)
where the block transition function e
T1of the MDP M1is defined by
e
T1:SA1× {[s] : s S1} R,e
T1(s, a, [s0]) = X
s00[s0]
T1(s, a, s00)(7) (3.6)
and the equivalence classes [(s, a)] on SA1are defined by (s0, a0)[(s, a)] iff h(s0, a0) =
h(s, a)and the equivalence classes [s]on S1are defined by the projections Πs([(s, a)]) which
makes Equation 3.1 a necessary condition.(8)
To provide the material necessary for a comparison between the MDP homomorphisms of
[171] and the equivalence relation notation of [217], the latter one remains to be introduced.
On that account, the notion of an equivalence relation Eon a set Mwhich is a subset
EM×Mwith the properties reflexivity ((x, x)E), symmetry ((x, y)E(y, x)
E), and transitivity ((x, y)E,(y, z)E(x, z)E) is to be recalled. An equivalence
relation gives rise to the quotient set M/E ={BE|(x, y)Efor all x, y B}of
equivalence classes of Mby E. As above the notation x[y]means that xis in the
equivalence class of y.
In the following definition the notation of [217] is slightly changed to stress the similarities
to MDP homomorphisms:
3.2 Definition (MDP Symmetry [217])
Let M1= (D1,S1,SA1, T1, R1)be an MDP. An MDP symmetry is a tuple E= (ES1, ESA1)
of equivalence relations on S1and SA1, respectively, such that for the corresponding equiv-
alence classes [s]and [(s, a)] Equations 3.2 and 3.3 are valid, and additionally
1.) the block transition function e
T1defined by Equation 3.6 is constant on equivalence
classes:
(s0, a0)[s, a]s00 S1:e
T1(s0, a0,[s00]) = e
T1(s, a, [s00]) ,(3.7)
2.) and the reward function is constant on equivalence classes:
(s0, a0)[s, a] : R1(s0, a0) = R1(s, a).(3.8)
3.3 Lemma (Equivalence of MDP Homomorphisms and MDP Symmetries)
MDP homomorphisms (Definition 3.1) and MDP symmetries (Definition 3.2) are equiva-
lent, i. e. for each MDP homomorphisms there exists an MDP symmetry and for each MDP
symmetry there exists an MDP homomorphism such that the equivalence classes induced
by the MDP homomorphism and the MDP symmetry are equal.(9)
Proof: It is to show firstly that given any MDP symmetry there exists an equivalent MDP
homomorphism and, secondly, that the opposite direction of implication is also true. For
(7)The block transition function is in fact a transition function because (s, a) SA1:P[s0]e
T1(s, a, [s0]) =
P[s0]Ps00 [s0]T1(s, a, s0) = Ps0∈S1T1(s, a, s0) = 1 because T1is a transition function.
(8)An implication is that not only the equivalence classes on the state(-action) space can be written as
[(s, a)] = h1(h(s, a)) but also [s] = f1(f(s)).
(9)If an equivalence relation on all MDP homomorphisms is introduced such that two of them are equiv-
alent if the induced equivalence relation on the state and state-action spaces are equal, and if similarly an
equivalence relation of MDP symmetries is introduced, then the Lemma means that a bijection between
equivalence classes of MDP homomorphisms and equivalence classes of MDP symmetries exists.
39
clearness of the presentation equivalence classes induced by an MDP homomorphism hare
denoted by [s]hand [(s, a)]hwhile the ones induced by an MDP symmetry E= (ES1, ESA1)
are described by [s]ES1and [(s, a)]ESA1.
1.) Let an MDP M1= (D1,S1,SA1, T1, R1)and an MDP symmetry E= (ES1, ESA1)
be given. Above Definition 3.1 it is discussed that Equations 3.2 and 3.3 imply the
well-definedness of the projection Πs([(s, a)]ESA1) = [s]ES1from equivalence classes
of SA1onto that of S1. Because of this well-definedness it is possible to choose a
unique (finite) set of representatives Ns,a ={(si, ai)}, one for each equivalence class
on SA1, such that Ns= Πs(Ns,a) = {si}is a unique (finite) set of representatives
for the equivalence classes on S1. Be Φs,a :SA1Ns,a the mapping which maps a
state-action (s, a)to its representative (sk, ak)[(s, a)]ESA1Ns,a and Φs:S1Ns
the corresponding mapping for the state space(10).
Let a second MDP M2= (D2,S2,SA2, T2, R2)be defined by D2=D1,S2=Ns,SA2=
Ns,a, T2(s, a, s0) = e
T1s,a(s, a),s(s0)]ES1),and R2(s, a) = R1Φs,a(s, a). Then, an
MDP homomorphism h(s, a) = h(f(s), gs(a)) is defined by h(s, a) = Φs,a(s, a), imply-
ing that f(s) = Φs(s)and gs(a)=Πas,a(s, a)) where Πais the direct projection on
the actions and A2(f(s)) = {Πas,a(s, a)) : a A1(s))}which is independent from
the representative f(s) = Φs(s)of [s]ES1by means of Equation 3.2.
For the proof that his an MDP homomorphism the first step is to note that the
equivalence classes induced by hon SA1, i. e. (s0, a0)[(s, a)]hiff h(s0, a0) = h(s, a)
are by construction the same as that of ESA1i. e. [(s, a)]h= [(s, a)]ESA1. Because
Πsis well-defined for the [(s, a)]ESA1equivalence classes it is also well-defined for the
[(s, a)]hcases, hence [s]h= Πs( [(s, a)]h) = Πs( [(s, a)]ESA1) = [s]ES1and especially
Equation 3.1 is valid for [s]hand [(s, a)]h.
The second step is to show that Equations 3.4 and 3.5 hold. Equation 3.5 directly
follows because the reward function is constant on equivalence classes of ESA1(Equa-
tion 3.8) and therefore also on equivalence classes of hwhich are characterised by
(s0, a0)[(s, a)]hiff (f(s), gs(a)) = h(s, a) = h(s0, a0) = (f(s0), gs0(a0)). Equation 3.4
analogously follows hence the function (s, a)7→ e
T1s,a(s, a),Φs(s0)) is constant on
equivalence classes of ESA1(Equation 3.7) and therefore also on that of h.
2.) Now, let an MDP homomorphism h:SA1 SA2be given for the two MDPs
M1= (D1,S1,SA1, T1, R1)and M2= (D2,S2,SA2, T2, R2)with D1=D2=N0.
The discussion above Definition 3.1 yields that Equations 3.2 and 3.3 are valid for the
equivalence classes [(s, a)]hand [s]hinduced by h. Define equivalence relations ES1by
[s]ES1= [s]hand ESA1by [(s, a)]ESA1= [(s, a)]h.
By definition the equivalence classes induced by hand the ones induced by E=
(ES1, ESA1)are equal and the latter ones also fulfill Equations 3.2 and 3.3. Therefore,
in the opposite direction Equation 3.8 directly follows from Equation 3.5 and Equa-
tion 3.7 from Equation 3.4 which means that E= (ES1, ESA1)is an MDP symmetry.
2
3.4 Remark (Implications of Lemma 3.3)
A direct implication of Lemma 3.3 is that the work of [217] and [171] has a common basis.
The concepts particularly in the first reference for multi-agent systems and in the second
reference for options, i. e. macro or multi-step actions, may be jointly used. As a further
(10)The function Φs:S1Nscan also be defined by Φs(s)=Πss,a(s, a(s))) where for each sany
a=a(s) A1(s)can be chosen.
40
consequence, the proofs about value functions and strategies being equal on equivalence
classes need only be given in one framework.
After having established the equivalence of both frameworks their usefulness is to be re-
flected in a Theorem. Before the theorem can be stated, an idea has to be given how to
transform policies from h(SA1)to SA1if his an MDP homomorphism:
3.5 Definition (Policy Lifting [171])
Let M1= (D1,S1,SA1, T1, R1)and M2= (D2,S2,SA2, T2, R2)be two MDPs and let
h:SA1 SA2, h(s, a)=(f(s), gs(a)) be an MDP homomorphism. For any s S1and
ba A2(f(s)) the action space g1
s(ba) A1(s)is the preimage of the action baunder gs.
Let bπbe a policy of the MDP M2. Then the corresponding lifted policy πof the MDP M1
is defined by
π(s, a) = bπ(f(s), gs(a))
|g1
s(gs(a))|.(3.9)
The policy lifting simply means to assign the same fraction of probability to all actions in
the same state-action equivalence class. The main theorem about MDP homomorphisms
(and MDP symmetries) follows:
3.6 Theorem (Main Implications of MDP Homomorphisms [170])
Let M1= (D1,S1,SA1, T1, R1)be an MDP and let h:SA1 SA2, h(s, a) = (f(s), gs(a)),
be an MDP homomorphism (Definition 3.1) for some MDP M2= (D2,S2,SA2, T2, R2).
Then, V(s) = V(f(s)) and Q(s, a) = Q(h(s, a)). Furthermore, there exists an optimal
policy of M1which is a lifted optimal policy of M2.
The proof will be given in Section 3.2 for 2P-ZS-MGs which is methodically similar and
includes MDPs.
3.7 Remark (Implications of Theorem 3.6 on Complexity)
Theorem 3.6 means that the optimal value function is constant on [s]h, the optimal Q-
value function is constant on [(s, a)]h, and an optimal policy exists which is also constant
on [(s, a)]h. The theorem additionally holds for any lifted policy which means that also
non-optimal lifted policies induce value functions which respect the equivalence relation.
This means that it is sufficient to operate on the reduced model to obtain good or optimal
policies of the original model without any loss of information through the state and state-
action space compression. Because the complexity depends heavily on the size of the state
and state-action spaces (see Section 2.2.3) the reduced models can be solved faster.
The minimum savings are a reduction in the number of stored values (less computer mem-
ory needed) for any policy and value function proportional to |S2|
|S1|or |SA2|
|SA1|, respectively.
A more important reduction applies to the computational time: because the complexity
formulae are worse than linear for a non-sparse transition function the savings are, in this
case, proportional to the corresponding powers and products of |S| and |SA|. Nevertheless,
e. g. the number of value iterations to achieve a given numerical precision εdoes not change
only for asynchronous updates as Gauss-Seidel or RL techniques could this happen. This
represents the fact that a reducable model inherits some of the basic “difficultiness” of the
reduced model although the theoretical complexity increases for larger models.
41
3.1.2 Symmetries by Group Actions on MDPs
In this section the framework of MDP homomorphisms is to be related to the classical
notion of symmetry by means of group actions which turns out to be similar but not
exactly the same.(11) [170] gives an example for which the two concepts of symmetry
are different. However, the concept of MDP homomorphisms includes the symmetry group
concept which is defined by group actions of the group of all MDP automorphisms.(12) [70]
also uses this approach (called bisimulation there) and relates it to finite state machines
(FSM). One complexity result of [70] is that it is NP-hard to determine whether a given
finite model is already minimal under reduction by symmetry groups.
Firstly, some standard definitions of group homomorphisms and group actions are recalled
(for more details see Appendix A). A (left) group action Θis a map Θ : G×X
Xsuch that for all g, h Gand for all xXholds: Θ(g, Θ(h, x)) = Θ(gh, x)and
Θ(1, x) = xwhereas 1Gis the identity. A simplified notation for group actions is
gh ·x=ghx = Θ(g, Θ(h, x)). Furthermore, group actions act like permutations on X.
This can be made precise by introducing the kernel of a group action. The main results
related to equivalence classes are Proposition A.4 and Proposition A.5 in which it is shown
that equivalence relations and group actions are equivalent concepts.
Proposition A.5 is now to be applied to MDPs. Therefore, a definition of symmetries based
on MDP automorphisms is given:
3.8 Definition (MDP Iso- and Automorphism, Symmetry Group [170])
Let M1= (D1,S1,SA1, T1, R1)and M2= (D2,S2,SA2, T2, R2)be two MDPs and let
h:SA1 SA2be an MDP homomorphism. his said to be an MDP isomorphism if it
is bijective(13) and it is called an MDP automorphism if it is bijective and if SA1=SA2.
The set of all automorphisms Aut(M1)of an MDP M1is called the symmetry group of
the MDP.(14)
3.9 Corollary (Symmetry Group of an MDP and MDP homomorphisms)
Let M1= (D1,S1,SA1, T1, R1)be an MDP and let G= Aut(M1)be its symmetry group.
Then, Θ : G× SA1 SA1,Θ(g, (s, a)) = g(s, a),is a group action and there exists an
MDP homomorphism hwhich induces the same equivalence relation as the group action.
Proof: For the proof previous results have to be collected. Θis a group action because
idSA1Gis the neutral element of (G, ), and the neutral element and the composition of
automorphisms fulfill the group action axioms. Then, an equivalence relation on SA1with
the group orbits as equivalence classes is induced. To be able to define an MDP symmetry
it is to show that Equations 3.2, 3.3, 3.8, and 3.7 hold for the equivalence relation defined
by the G-orbits.(15)
(11) A corresponding citation from the Wikipedia website on equivalence relations motivates to com-
plete our description (http://en.wikipedia.org/wiki/Equivalence_relation, 06.09.2007): “It is very
well known that lattice theory captures the mathematical structure of order relations. It is much less
known that transformation groups (some authors prefer permutation groups) and their orbits capture the
mathematical structure of equivalence relations.” This web page includes also important suggestions for
literature presented below.
(12)In principle, the main results of this section about MDPs can be found in [170] but there are no
pointers to the relation to group actions.
(13)h(s, a) = (f(s), gs(a)) is bijective iff f, gsare all bijective.
(14)In fact, the group properties hold for (Aut(M1),).
(15)This does not directly follow due to the fact that Gconsists of MDP automorphisms. For the corre-
sponding MDP homomorphism some conditions become nearly trivial since the induced equivalence classes
of MDP automorphisms are singletons.
42
Equations 3.2 and 3.3 follow from 3.1: Let (s, a00),(s, a) SA1and let s0Πs([(s, a00)]),
i. e. there exists an MDP automorphism hGwith h(s, a00) = (f(s), gs(a00)) = (s0, gs(a00))
which especially means f(s) = s0. Then h(s, a) = (f(s), gs(a)) = (s0, gs(a)) and it follows
with a0=gs(a)that a0 A1(s0) : (s0, a0)[(s, a)] because (s, a)and (s0, a0)are in the
same G-orbit.
Equation 3.8 follows because for any automorphism hGand any (s0, a0) = h(s, a)the
reward R1(s, a) = R1(h(s, a)). For Equation 3.7 to prove note that for an automorphism
h(s, a) = (f(s), gs(a)) Equation 3.4 turns into T1(s, a, s0) = T1(h(s, a), f(s0)). Furthermore,
Gs= ΠsG={f:hGwith h(s, a) = (f(s), gs(a))}is a group because Gis a group, Gs
induces a group action on S1by function evaluation and composition (as Gon SA1), and
the orbits of Gsare by definition the projections of orbits of G, i. e. the equivalence classes
on S1. Then, for any hGand any (s0, a0) = h(s, a) = (f(s), gs(a)) holds (fGs)·s=Gs·s
(since fGs), and
e
T1(s, a, [s00]) = X
s000Gs·s00
T1(s, a, s000)
=X
s000Gs·s00
T1(h(s, a), f(s000))
=X
s000Gs·s00
T1(h(s, a), s000)
=e
T1(s0, a0,[s00]) .
Summarised, the equivalence relation on SA1induces one on S1and a corresponding MDP
symmetry. Finally, Lemma 3.3 shows the existence of an MDP homomorphism which
induces the same equivalence relation. 2
Although MDP homomorphisms can capture equivalence relations induced by (subgroups
of) the symmetry group the reverse implication is not true because for MDP isomorphisms
Equation 3.4 reduces to
s S1s0 S1a A(s) : T1(s, a, s0) = T2(f(s), gs(a), f(s0)) .
[170] contains an example of model reductions due to MDP homomorphisms(16) which can
not be obtained by symmetry groups of MDPs.
3.2 Homomorphisms and Symmetry in 2P-ZS-MGs
In Section 3.1 the concepts of MDP homomorphisms and MDP symmetries have been
shown to be equivalent. This reduces the burden to generalise both concepts to 2P-ZS-MGs
to the task to choose one of them. In the following, the concept of MDP homomorphisms is
generalised to 2P-ZS-MG homomorphisms because that formalism explicitly includes the
mappings from a model to the reduced model in its definition. Nevertheless, the formalism
of 2P-ZS-MG homomorphisms could be transformed to one of 2P-ZS-MG symmetries.
The main result of Section 3.2.1 is Theorem 3.16 which includes Theorem 3.6 (MDPs)
and states that a symmetric (or reducable) 2P-ZS-MG induces a structure on the optimal
value functions that respects this symmetry (or reduction) and that a corresponding policy
exists. The proof of the main theorem was independently given by the author and goes
(16)Characterised by reward respecting SSP partitions, in fact equalling MDP symmetries.
43
beyond the proof for MDPs e. g. because the matrix game reduction (MGR) property
(Definition 3.12) needs to be valid which is always true (Proposition 3.13). It is to be
noted that (asynchronous) RL algorithms operating on the original model can behave
differently than that operating on the reduced model because for the latter the experience
is spread over all equivalent states even if never visited by the learning agent.
Besides the symmetries of MDPs, a qualitatively new symmetry that results from exchang-
ing the two agents can be exploited. For that purpose, the concept of µ-homomorphisms
is introduced in Section 3.2.2 which can also be utilised to identify identical models which
only differ by a scaling of the rewards. This can not be achieved by the standard 2P-ZS-
MG homomorphism framework. Practically, agent exchanging symmetries have been used
for different board games by the argument that exchanging the agents in a zero-sum game
has to result in a multiplication of the value by 1. One example is given by the pioneer
Samuel for checkers [179].(17) The present work lays formal foundations for this practice
and, more importantly, shows that the exchange of agents is also compatible with other
standard symmetries similar to that of 2P-ZS-MGs (Proposition 3.24).
3.2.1 2P-ZS-MG Homomorphisms and Symmetry
In this section the analogue concepts of Section 3.1 for 2P-ZS-MGs are to be introduced.
The main differences are that the state-action spaces SAihave to be replaced by the
corresponding state-action spaces SAOiand that additional projection properties hold.
This implicates that for the projection Πsand the reward and transition functions some
changes occur.
First and foremost, again some aspects of projecting partitions of the state-action space
onto partitions of the state space have to be noted. According to Section 3.1, a first
assumption is that a partition of the (finite) state-action space P(SAO1) = {[(s, a, o)] :
(s, a, o) SAO1}into equivalence classes is given. It is further assumed that this partition
induces a partition of equivalence classes on the state space P(S1) = {[s] : s S1}by the
direct projection Πs:SAO1 S1with Πs(s1, a1, o1) = s1by means of [s] = Πs([(s, a, o)]).
The well-definedness of this projection onto state equivalence classes is equivalent to the
condition
(s, a00, o00) SAO1(s, a, o) SAO1s0Πs([(s, a00, o00)])
(a0, o0) A1(s0)× O1(s0) : (s0, a0, o0)[(s, a, o)] .(3.10)
If the above equation is fulfilled then it can be rewritten in terms of the induced equivalence
classes because [s] = Πs([(s, a, o)]) = Πs([(s, a00, o00)]):
(s, a, o) SAO1s0[s](a0, o0) A1(s0)× O1(s0) : (s0, a0, o0)[(s, a, o)] .(3.11)
The well-definedness of the projection of equivalence classes by Πsfurther implies the
following condition on the equivalence classes:
(s, a, o) SAO1(s0, a0, o0) SA1:(s0, a0, o0)[(s, a, o)] s0[s].(3.12)
Considering the reverse direction, i. e. if equivalence classes are given on S1and SAO1which
satisfy Equations 3.2 and 3.3 then the projection Πsexactly maps equivalence classes from
SAO1to the ones of S1.
(17)Samuel states that all stored board positions are transformed as if Black has to move.
44
The definitions above are a straightforward adaption from the MDP case. However, it will
be obvious later in the proof of the main theorem (Theorem 3.16) that it is necessary to
use the matrix game reduction property (Definition 3.12) which on its part uses the policy
lifting (Definition 3.11) in 2P-ZS-MGs. For a lifted policy to depend only on the actions of
one agent it is necessary that the state-action equivalence classes for joint actions induce
equivalence classes on SA1and SO1by the projections Πs,a :SAO1 SA1and Πs,o :
SAO1 SO1, respectively. This means that the following two variants of Equation 3.10
have also to be valid:
(s, a, o00) SAO1(s, a, o) SAO1(s0, a0)Πs,a([(s, a, o00)])
o0 O1(s0) : (s0, a0, o0)[(s, a, o)] (3.13)
and
(s, a00, o) SAO1(s, a, o) SAO1(s0, o0)Πs,o([(s, a00, o)])
a0 A1(s0) : (s0, a0, o0)[(s, a, o)] .(3.14)
If they are fulfilled, these two equations can also be abbreviated by the following:
(s, a, o) SAO1(s0, a0)[s, a]o0 O1(s0) : (s0, a0, o0)[(s, a, o)] (3.15)
and
(s, a, o) SAO1(s0, o0)[s, o]a0 A1(s0) : (s0, a0, o0)[(s, a, o)] .(3.16)
Next, 2P-ZS-MG homomorphisms can be defined which do not capture symmetries ex-
changing the two agents. The straightforward generalisation from MDP homomorphisms
would be h(s, a, o) = (f(s), gs(a, o)) but instead the homomorphism is defined as h(s, a, o) =
(f(s), gs(a), is(o)) because this explicitly takes the projective properties (Equations 3.11,
3.15 and 3.16) into account, whereas [s] = f1(f(s)),[(s, a)] = {(s0, a0) : gs0(a0) = gs(a)},
and [(s, o)] = {(s0, o0) : is0(o0) = is(o)}.
3.10 Definition (2P-ZS-MG Homomorphism)
Let M1= (D1,S1,SAO1, T1, R1)and M2= (D2,S2,SAO2, T2, R2)be two 2P-ZG-MGs
as defined in Section 2.4 with D1=D2=N0. A map h:SAO1 SAO2is called
a 2P-ZS-MG homomorphism if his a surjection, defined by a tuple of surjective maps
(f, (gs)s∈S1,(is)s∈S1)with h(s, a, o)=(f(s), gs(a), is(o)),f:S1 S2,gs:A1(s)
A2(f(s)),and is:O1(s) O2(f(s)) such that
(s, a, o) SAO1s0 S1:e
T1(s, a, o, [s0]) = T2(f(s), gs(a), is(o), f(s0)) (3.17)
and
(s, a, o) SAO1:R1(s, a, o) = R2(f(s), gs(a), is(o)) (3.18)
where the block transition function e
T1of the 2P-ZS-MG M1is defined by
e
T1:SAO1× {[s] : s S1} R,e
T1(s, a, o, [s0]) = X
s00[s0]
T1(s, a, o, s00)(3.19)
and the equivalence classes [(s, a, o)] on SA1are defined by (s0, a0, o0)[(s, a, o)] iff
h(s0, a0, o0) = h(s, a, o)and the equivalence classes [s]on S1,[(s, a)] on SA1, and [s, o]
on SO1are defined by the projections Πs([(s, a, o)]),Πs,a([(s, a, o)]), and Πs,o([(s, a, o)]),
respectively, which makes Equations 3.10, 3.13 and 3.14 necessary conditions.
45
According to MDPs, it is necessary to have the possibility to lift policies. This is only
possible because the projections Πs,a and Πs,o project onto equivalence classes:
3.11 Definition (Policy Lifting for 2P-ZS-MGs)
Let M1= (D1,S1,SAO1, T1, R1)and M2= (D2,S2,SAO2, T2, R2)be two 2P-ZS-MGs and
let h:SAO1 SAO2, h(s, a, o)=(f(s), gs(a), is(o)) be an 2P-ZS-MG homomorphism.
Let bπbe a policy of the 2P-ZS-MG M2for the first or second agent, appropriately. Then
the corresponding lifted policy πof the 2P-ZS-MG M1is defined for the first agent by
π(s, a) = bπ(f(s), gs(a))
|g1
s(gs(a))|,(3.20)
and for the second agent by
π(s, o) = bπ(f(s), is(o))
|i1
s(is(o))|.(3.21)
Next, the matrix game reduction property which will be essential for the reduction by
2P-ZS-MG homomorphisms is introduced:
3.12 Definition (Matrix Game Reduction (MGR) Property)
Let M1= (D1,S1,SAO1, T1, R1)and M2= (D2,S2,SAO2, T2, R2)be two 2P-ZS-MGs
and let h:SAO1 SAO2,h(s, a, o) = (f(s), gs(a), is(o)) be an 2P-ZS-MG homomor-
phism. Then his said to have the matrix game reduction (MGR) property iff for all states
s S1, for all matrices M1,s R|A1(s)|,|O1(s)|and M2,f(s)R|A2(f(s))|,|O2(f(s))|that re-
spect the structure of h, i. e. (a0, o0) A2(f(s)) × O2(f(s)) (a, o)g1
s(a0)×i1
s(o0) :
(M1,s)(a, o) = (M2,f(s))(a0, o0), holds that
V(M1,s) = V(M2,f(s))(3.22)
and additionally that the optimal policy of M1,s is a lifted optimal policy of M2,f(s)(inter-
preting the matrix game as a single state 2P-ZS-MG as in Definition 2.19).
By definition Equation 3.22 is equivalent to(18)
max
πsPD(A1(s)) min
o∈O1(s)X
a∈A1(s)
πs(a)·M1,s(a, o)
= max
πf(s)PD(A2(f(s))) min
o∈O2(f(s)) X
a∈A2(f(s))
πf(s)(a)·M2,f(s)(a, o),(3.23)
where the abbreviation M(a, o) = Mi,j if ai=aand oj=ois used.
3.13 Proposition (Validity of the MGR Property)
Let M1= (D1,S1,SAO1, T1, R1)and M2= (D2,S2,SAO2, T2, R2)be two 2P-ZS-MGs
and let h:SAO1 SAO2,h(s, a, o) = (f(s), gs(a), is(o)) be a 2P-ZS-MG homomorphism.
Then the MGR property holds.
(18)Because of the minimax theorem (Theorem 2.23), the definition of the value of a matrix game (Def-
inition 2.21) and the fact that the minimum over probability distributions for the later deciding agent
(applied first in the equation) can be replaced by the pure actions.
46
Proof: Let M1,M2, h, M1,s, M2,f(s)be as in Definition 3.12. It will be shown that M1,s
can be transformed into M2,f(s)by removing equal rows and equal columns, and hence
the associated matrix games have the same value. Since the structure of equal rows and
columns corresponds to the structure of the state-wise intersection of equivalence classes
[(s, a)] ({s}×A1(s)) = {s} × g1
s(gs(a)) and [(s, o)] ({s}×O1(s)) = i1
s(is(o)), respec-
tively, an optimal policy from M2,f(s)can be lifted to a policy of M1,s.
For a fixed s S1be {a0
1, . . . , a0
m2}=A2(f(s)) an enumeration of the actions with
m2=|A2(f(s))|, and be {o0
1, . . . , o0
n2}=O2(f(s)) an enumeration of the actions with
n2=|O2(f(s))|such that the matrix is ordered as M2,f(s)(i, j) = M2,f(s)(a0
i, o0
j). Since
gs:A1(s) A2(f(s)) is surjective, g1
s(a0
i)6=for every iand there exists an enu-
meration {a1, . . . , am1}=A1(s)with akg1
s(a0
k)for 1km2. Since A1(s) =
Sa0∈A2(f(s)) g1
s(a0), for all k > m2holds that akg1
s(gs(aik)) with ikm2. Then for
all o O1(s)holds: h(s, ak, o) = (f(s), gs(ak), is(o)) = (f(s), gs(aik), is(o)) = h(s, aik, o).
This means that in M1,s for each k > m2the k-th row is equal to the ik-th row where
ikm2and therefore all rows with index km2can be removed. Further, all actions
g1
s(gs(aik)) A1(s)are equal which shows that the policy for agent P1can be lifted.
For columns the analogous statement holds that there exists an enumeration {o1, . . . , on1}=
O1(s)with oki1
s(o0
k)for 1kn2and that all columns with index k > n2can be
removed. Further, all actions i1
s(is(oik)) O1(s)are equal which shows that the policy
for agent P2can also be lifted. Since by removing equal rows equal columns stay equal,
M1,s can be transformed into M1,s which equals by construction M2,f(s).
2
3.14 Remark (MGR Property for MDPs)
For MDPs, the MGR property can be interpretated by M1being a vector (the minimiser
has only one action to “choose”) which makes the proof of Proposition 3.13 as simple as
maxa∈A1(s)M1,s(a) = maxa∈A2(f(s)) M2,f(s)(a)by noting that gs:A1(s) A2(f(s)) is
surjective.(19)
3.15 Example (Structure of Matrices with MGR Property)
The definition of the MGR property is in principle independent from a state salthough it
must hold for all states. Therefore, only two matrices M1and M2are presented which can
be thought of as M1,s(a, o)and M2,f(s)(a0, o0):
M1=
c1c1c2
c1c1c2
c1c1c2
c3c3c4
, M2=c1c2
c3c4.
The lines indicate the borders of different equivalent action pairs. There has to exist an
enumeration of the actions such that the lines generate a grid on the matrix to fulfill the
MGR property. The type of the grid is defined by the equivalence classes of a 2P-ZS-MG
while the numbers ciRare arbitrary.
3.16 Theorem (Main Implications of 2P-ZS-MG Homomorphisms)
Let M1= (D1,S1,SAO1, T1, R1)be a 2P-ZS-MG and let h:SAO1 SAO2be an 2P-ZS-
MG homomorphism (Definition 3.10) for some 2P-ZS-MG M2= (D2,S2,SAO2, T2, R2).
(19)The simplicity of this argument could be the reason why it is not mentioned in [170]. However, for
matrix games the sitation is a little less simple.
47
Then, V(s) = V(f(s)) and Q(s, a, o) = Q(h(s, a, o)). Furthermore, there exists an
optimal policy of M1which is a lifted optimal policy of M2.
Proof: Parts of the proof are similar to that of Theorem 3.6 in [170](20) and [217](21) but
independently developed and the theorem here is valid for a larger class of problems. Let Vk
be the k-th value iterate for V0= 0 (it is only necessary that V0respects the symmetry but
this one respects all possible symmetries), i. e. Vk=Vk(Qk1)according to Equation 2.48,
and Qk=Qk(Vk)according to Equation 2.49. It will be shown by induction that each of
the (Q-)value iterates keeps the same symmetry which implies that the limits Vand Q,
respectively also possess this symmetry.
Induction starts by noticing that V0= 0 respects any symmetry and that then Q0=R1
also respects any symmetry which can be induced by hby its definition. The induction
hypothesis (I. H.) is the assumption that this is true for all iterates up to Vk1, Qk1,
i. e. for all jk1and for h(s, a, o)=(f(s), gs(a), is(o)) holds that Vj(s) = Vj(f(s))
and Qj(s, a, o) = Qj(h(s, a, o)). Here, the value functions are not indexed like the state
spaces because the argument uniquely shows which value function is meant. Then, for
h(s, a, o) = (f(s), gs(a), is(o)) holds:
Vk(s) = max
πsPD(A1(s)) min
o∈O1(s)X
a∈A1(s)
πs(a)·Qk1(s, a, o)
I. H.
= max
πsPD(A1(s)) min
o∈O1(s)X
a∈A1(s)
πs(a)·Qk1(f(s), gs(a), is(o))
(1)
= max
πf(s)PD(A2(f(s))) min
o∈O2(f(s)) X
a∈A2(f(s))
πf(s)(a)·Qk1(f(s), a, o)
=Vk(f(s)) ,
whereas (1) is the MGR property (Definition 3.12 and below) which is always fulfilled
(Proposition 3.13) and also guarantees the existence of a lifted policy for every state s.
Furthermore,
Qk(s, a, o) = R1(s, a, o) + γX
s0∈S1
T1(s, a, o, s0)·Vk(s0)
(1)
=R2(h(s, a, o))+γX
s0∈S1
T1(s, a, o, s0)·Vk(f(s0))
(2)
=R2(h(s, a, o))+γX
f(s0)∈S2
Vk(f(s0)) X
s00f1(f(s0))
T1(s, a, o, s00)
(3)
=R2(h(s, a, o))+γX
f(s0)∈S2
Vk(f(s0)) ·e
T1(s, a, o, [s0])
(4)
=R2(h(s, a, o))+γX
f(s0)∈S2
T2(h(s, a, o), f(s0)) ·Vk(f(s0))
(5)
=R2(h(s, a, o))+γX
s0∈S2
T2(h(s, a, o), s0)·Vk(s0)
=Qk(h(s, a, o)).
(20)The author thinks that for the second equality sign of the proof in [170] some implicit assumptions on
the state-value function Vmare made which are more obvious in the presentation here.
(21)That proof on the other hand is more formal but possibly more complex than necessary.
48
(1) is valid because of the definition of hand because the symmetry of Vkwas shown
directly before, for (2) note that f:S1 S2, (3) holds by definition of the block transition
function e
T1and because f1(f(s0)) = [s0]and (4) again by definition, and (5) because f
is surjective. 2
3.17 Remark (Implications of Theorem 3.16 on Complexity)
Remark 3.7 is analogously valid for 2P-ZS-MGs. The optimal value function is especially
constant on [s]h, the optimal Q-value function is constant on [(s, a, o)]h, and there exists
an optimal policy for each agent which is constant on [(s, a)]hand [(s, o)]h,respectively.
3.2.2 Automorphisms for the Exchange of Agents
After the analogue symmetries of MDPs in 2P-ZS-MGs have been exploited, a qualita-
tively new symmetry that results from exchanging the two agents is introduced.(22) For
that purpose, the concept of µ-homomorphisms is employed. Practically, agent exchanging
symmetries have been used for different board games as already mentioned in the intro-
duction of this chapter but the present work lays formal foundations for this practice
and, more importantly, shows that the exchange of agents is also compatible with the other
standard symmetries obtained by 2P-ZS-MG homomorphisms (Proposition 3.24).
Briefly summarised, a 2P-ZS-MG µ-homomorphism is a 2P-ZS-MG homomorphism for
which the reward condition is changed by a factor of µ. If µ > 0then the proof of
Theorem 3.16 holds with the simple changes that R1(s, a, o) = µ·R2(h(s, a, o)), Vk(s) =
µ·Vk(f(s)),and Qk(s) = µ·Qk(f(s)) because a positive constant can be factored out of
an expression to maximise or to minimise: maxx(µ·f(x)) = µ·maxx(f(x)).
However, for a negative constant µ < 0a maximums turns into a minimum and vice versa:
maxx(µ·f(x)) = µ·minx(f(x)). This makes the two cases essentially different and justifies
the separate definition. In fact, the case µ > 0only introduces an additional scaling of
the reward to the framework of Section 3.2.1 but µ < 0points to the new aspect of agent
exchanging symmetries in 2P-ZS-MGs.
Now, 2P-ZS-MG µ-homomorphisms are to be defined:
3.18 Definition (2P-ZS-MG µ-Homomorphism)
Let M1= (D1,S1,SAO1, T1, R1)and M2= (D2,S2,SAO2, T2, R2)be two 2P-ZG-MGs
as defined in Section 2.4 with D1=D2=N0. A map h:SAO1 SAO2is called an
2P-ZS-MG µ-homomorphism if µ6= 0, h is a surjection, and additionally holds:
1.) if µ > 0:his defined by a tuple of surjective maps (f, (gs)s∈S1,(is)s∈S1)with h(s, a, o) =
(f(s), gs(a), is(o)),f:S1 S2,gs:A1(s) A2(f(s)),and is:O1(s) O2(f(s)),
2.) if µ < 0:his defined by a tuple of surjective maps (f, (gs)s∈S1,(is)s∈S1)with h(s, a, o) =
(f(s), is(o), gs(a)),f:S1 S2,gs:A1(s) O2(f(s)),and is:O1(s) A2(f(s)),
such that
(s, a, o) SAO1s0 S1:e
T1(s, a, o, [s0]) = T2(h(s, a, o), f(s0)) (3.24)
and
(s, a, o) SAO1:R1(s, a, o) = µ·R2(h(s, a, o)) (3.25)
where the block transition function e
T1of the 2P-ZS-MG M1is defined as in Equation 3.19
and the equivalence classes fulfill the projective conditions of 2P-ZS-MG homomorphisms,
which makes Equations 3.10, 3.13 and 3.14 necessary conditions.
(22)This exchange of agents is not to be confused with a permutation of agents in a multi-player MDP.
49
The MGR property remains for µ > 0but significantly changes for µ < 0:
3.19 Remark (Adaptation of MGR Property to µ-homomorphisms)
For µ < 0,his said to have the matrix game reduction (MGR) property iff for all states s
S1, for all matrices M1,s R|A1(s)|,|O1(s)|and M2,f(s)R|A2(f(s))|,|O2(f(s))|that respect the
structure of h, i. e. (a0, o0) A2(f(s))×O2(f(s)) (a, o)g1
s(o0)×i1
s(a0) : (M1,s)(a, o) =
((M2,f(s))(a0, o0))T, holds that
V(M1,s) = V(M2,f(s))(3.26)
and that the optimal policies of M1,s are lifted optimal policies of M2,f(s).
The proof that this property holds for every 2P-ZS-MG homomorphism is accordingly to
that of Proposition 3.13 with the difference that it is to show that M1,s can be transformed
into M1,s =MT
2,f(s)by removing equal rows and equal columns. Then, it follows by the
Minimax-Theorem (Theorem 2.23) which for any matrix Nholds:
max
π1
min
π2
πT
1Nπ2= min
π2
max
π1
πT
1Nπ2=max
π2
min
π1
πT
1Nπ2=max
π2
min
π1
πT
2(NT)π1.
This means that V(N) = V(NT)and that a policy pair (π
1, π
2)in Nis optimal iff
(π
2, π
1)is optimal for in NT.
To see that M1,s can be transformed into M1,s =MT
2,f(s)the following adaptations have to
be made: the statement remains that the structure of equal rows and columns corresponds
to the structure of the state-wise intersection of equivalence classes [(s, a)]({s}×A1(s)) =
{s} × g1
s(gs(a)) and [(s, o)] ({s}×O1(s)) = i1
s(is(o)), respectively, an optimal policy
from M2,f(s)can be lifted to a policy of M1,s.
Then some minor changes are necessary because now gs:A1(s) O2(f(s)) and is:
A1(s) O2(f(s)): For a fixed s S1be {o0
1, . . . , o0
m2}=O2(f(s)) an enumeration
of the actions with m2=|O2(f(s))|, and be {a0
1, . . . , a0
n2}=A2(f(s)) an enumeration
of the actions with n2=|A2(f(s))|such that the transposed matrix (M2,f(s)(i, j))Tis
ordered as (M2,f(s)(i, j))T= (M2,f(s)(a0
i, o0
j))T. Since gs:A1(s) O2(f(s)) is surjec-
tive, g1
s(o0
i)6=for every iand there exists an enumeration {a1, . . . , am1}=A1(s)
with akg1
s(o0
k)for 1km2. Since A1(s) = So0∈O2(f(s)) g1
s(o0), for all k > m2
holds that akg1
s(gs(aik)) with ikm2. Then for all o O1(s)holds: h(s, ak, o) =
(f(s), gs(ak), is(o)) = (f(s), gs(aik), is(o)) = h(s, aik, o). This means that in M1,s for each
k > m2the k-th row is equal to the ik-th row where ikm2and therefore all rows with
index k > m2can be removed. Further, all actions g1
s(gs(aik)) A1(s)are equal which
shows that the policy for P1can be lifted.
For columns the analogous statement holds that there exists an enumeration {o1, . . . , on1}=
O1(s)with oki1
s(a0
k)for 1kn2and that all columns with index k > n2can be
removed. Further, all actions i1
s(is(oik)) O1(s)are equal which shows that the policy
for P2can also be lifted. Since by removing equal rows equal columns stay equal, M1,s can
be transformed into M1,s which equals by construction MT
2,f(s).
By definition Equation 3.26 is equivalent to
max
πsPD(A1(s)) min
o∈O1(s)X
a∈A1(s)
πs(a)·M1,s(a, o)
=max
πf(s)PD(A2(f(s))) min
o∈O2(f(s)) X
o∈O2(f(s))
πf(s)(a)·M2,f(s)(a, o)(3.27)
50
which means
max
πsPD(A1(s)) min
o∈O1(s)X
a∈A1(s)
πs(a)·(M2,f(s)(gs(a), is(o)))T
=max
πf(s)PD(A2(f(s))) min
o∈O2(f(s)) X
o∈O2(f(s))
πf(s)(a)·M2,f(s)(a, o).(3.28)
In the following theorem all other corresponding main theorems of this chapter are included
because 2P-ZS-MG homomorphisms are 2P-ZS-MG µ-homomorphisms with µ= 1 >0.
3.20 Theorem (Main Implications of 2P-ZS-MG µ-Homomorphisms)
Let M1= (D1,S1,SAO1, T1, R1)be a 2P-ZS-MG and let h:SAO1 SAO2be an 2P-ZS-
MG µ-homomorphism (Definition 3.18) for some 2P-ZS-MG M2= (D2,S2,SAO2, T2, R2).
Then, V(s) = µ·V(f(s)) and Q(s, a, o) = µ·Q(h(s, a, o)). Furthermore, there exists
an optimal policy of M1which is a lifted optimal policy of M2where in the case of µ < 0
the two policies π, bπin each of the two formulae in Definition 3.11 are from different agents
as the domains and co-domains of gsand isindicate.
Proof: As mentioned at the beginning of this subsection, the case µ > 0just intro-
duces a factor µin the proof of Theorem 3.16 such that R1(s, a, o) = µ·R2(h(s, a, o)),
Vk(s) = µ·Vk(f(s)),and Qk(s) = µ·Qk(f(s)).
However, for µ < 0something essentially different can be observed. In this case, the
induction again starts by noticing that V0= 0 respects any symmetry and that then
Q0=R1also respects any symmetry which can be induced by h. As induction hypoth-
esis (I. H.), it is assumed that this is true for all iterates up to Vk1, Qk1, i. e. for all
jk1and for h(s, a, o)=(f(s), gs(a), is(o)) holds that Vj(s) = µ·Vj(f(s)) and
Qj(s, a, o) = µ·Qj(h(s, a, o)). Again, the value functions are not indexed as the state
spaces because the argument uniquely shows which value function is meant. Thus, for
h(s, a, o) = (f(s), is(o), gs(a)) holds:
Vk(s) = max
πsPD(A1(s)) min
o∈O1(s)X
a∈A1(s)
πs(a)·Qk1(s, a, o)
I. H.
= max
πsPD(A1(s)) min
o∈O1(s)X
a∈A1(s)
πs(a)·µ·Qk1(f(s), is(o), gs(a))
(1)
=µ·max
πsPD(A1(s)) min
o∈O1(s)X
a∈A1(s)
πs(a)·Qk1(f(s), is(o), gs(a))
(2)
=µ·max
πf(s)PD(A2(f(s))) min
o∈O2(f(s)) X
a∈A2(f(s))
πf(s)(a)·Qk1(f(s), a, o)
=µ·Vk(f(s)) ,
whereas (1) holds because µ > 0and (2) is the MGR property (Remark 3.19 and
Equation 3.28) the proof of which needed the minimax theorem for matrix games (Theo-
rem 2.23). Then, also Qk(s, a, o) = µ·Qk(h(s, a, o)) analogously to the proof of 2P-ZS-MG
homomorphisms. 2
3.21 Remark (New Aspects of Theorem 3.20)
One of the most interesting aspects of Theorem 3.20 besides its existence is that the proof
for the case µ < 0needs the minimax theorem (Theorem 2.23) for matrix games. This
51
aspect should be highlighted because also an informal argumentation like “one exchanges
the two agents in equal situations”, which is often given by practitioners and points out
the essence of the theorem, makes implicitly use of the minimax theorem. The reason is
that the minimax theorem states that for an optimal pair of policies it does not matter
whether agent one or agent two has to decide its strategy first (and telling it to the other)
or whether both decide without knowledge of the other agent’s policy. It is obvious that
for such a property is important to speak about an “equal situation” of both agents.
3.22 Example (Robot Soccer, 4)
In Figure 3.1, some standard symmetries in robot soccer are depicted which should be
respected by any model describing a robot soccer game. In this example, an abstract
agent is a team consisting of two robots. A permutation of identical agents within a
team is additionally possible but only depicted indirectly as the robots have no individual
numbers.
2 1
21
21
(a) State s.
2 1
2
2
1
1
(b) State gy(s).
2 1
2
2
1
1
(c) State gx(s).
2 1
2
21
1
(d) State (gxgy)(s) = (gygx)(s).
Figure 3.1: Symmetries in robot soccer: a standard situation (state) sand its symmetric
states gy(s)with exchange of the two teams, gx(s)without the exchange of teams, and the
combination of both: (gxgy)(s) = (gygx)(s). The labels 1 and 2 indicate the team and
the defended goal region, and the small black circle depicts the ball.
3.23 Remark (Fulfillment of the Projection Properties)
Any 2P-ZS-MG µ-homomorphism h:SAO1 SAO2fulfills the projective properties
(Equations 3.10, 3.13 and 3.14) because without using these properties it can be shown for
µ > 0and µ < 0that
Πs,a(h1(h(s, a, o))) = {(s0, a0) : f(s) = f(s0)and gs0(a0) = gs(a)},
Πs,o(h1(h(s, a, o))) = {(s0, o0) : f(s) = f(s0)and is0(o0) = is(o)}, and
Πs(h1(h(s, a, o))) = {s0:f(s0) = f(s)}.
52
Proof of e. g. Πs,a(h1(h(s, a, o)) = {(s0, a0) : f(s) = f(s0)and gs0(a0) = gs(a)}for µ > 0:
”: Be (s0, a0)Πs,a(h1(h(s, a, o)), i. e. there exists o0 O(s0)with h(s0, a0, o0) =
h(s, a, o). Then f(s) = f(s0)and gs0(a0) = gs(a).
”: Be (s0, a0) {(s00, a00) : f(s) = f(s00)and gs00 (a00) = gs(a)}. Then f(s) = f(s0)and
gs0(a0) = gs(a). Choose an arbitrary o O1(s)and o0i1
s0(is(o)) O1(s0)which is non-
empty since is0:O1(s0) O2(f(s0)) = O2(f(s)) is surjective. Then h(s0, a0, o0) = h(s, a, o)
which means (s0, a0)Πs,a(h1(h(s, a, o))).
The other projections for µ > 0are similar. For µ < 0, only the different image spaces,
e. g. is0:O1(s0) A2(f(s0)) = A2(f(s)) have to be taken into account additionally.
3.24 Proposition (Composition of 2P-ZS-MG µ-Homomorphisms)
Let h1:SAO1 SAO2be a µ1-homomorphism and h2:SAO2 SAO3be a µ2-
homomorphism for three associated 2P-ZS-MGs M1= (D1,S1,SAO1, T1, R1),M2=
(D2,S2,SAO2, T2, R2),and M3= (D3,S3,SAO3, T3, R3). Then, h3=h2h1:SAO1
SAO3is a µ3-homomorphism with µ3=µ1·µ2.
Proof: To check that h3is a 2P-ZS-MG µ3-homomorphism, one first must notice that
the domains and co-domains of the component functions fk, gk,s, ik,s (k {1,2,3}) of all
three homomorphisms fit together for all four cases (µ1, µ2positive or negative) because a
change of sign is equivalent to a change of the agents.
According to Remark 3.23 the equivalence classes [(s, a, o)]h3 SAO1induced by h3fulfill
the projection properties Equations 3.10, 3.13 and 3.14.
It must then be verified that the transition and reward functions also fulfill the 2P-
ZS-MG µ3-homomorphism conditions. For the transition function that means to proof
that e
T1(s, a, o, [s0]h3) = T3(h3(s, a, o), f3(s0)) under the premise that e
Tk(s, a, o, [s0]hk) =
Tk+1(hk(s, a, o), fk(s0)) for k {1,2}.The following equation is helpful in the proof
[(s, a, o)]h3=h1
3(h3(s, a, o))
= (h2h1)1((h2h1)(s, a, o))
=h1
1(h1
2(h2(h1(s, a, o))))
=h1
1([h1(s, a, o)]h2)
=[
(s0,a0,o0)[h1(s,a,o)]h2
h1
1(s0, a0, o0)
because by projection holds [s]h3=Ss0[f1(s)]h2
f1
1(s0)which is needed below for ()
together with the surjectivity of f1:
e
T1(s, a, o, [s0]h3) = X
es00[s0]h3
T1(s, a, o, s00)
()
=X
f1(s00)[f1(s0)]h2X
s000[s00]h1
T1(s, a, o, s000)
=X
f1(s00)[f1(s0)]h2
T2(h1(s, a, o), f1(s00))
=e
T2(h1(s, a, o),[f1(s0)]h2)
=T3(h2(h1(s, a, o)), f2(f1(s0)))
=T3(h3(s, a, o), f3(s0)) .
53
Finally, the reward condition reads to R1=µ1·(R2h1) = (µ1µ2)·(R3(h2h1)) which
shows that h3is a 2P-ZS-MG µ3-homomorphism with µ3=µ1·µ2.2
The intuition that the composition of 2P-ZS-MG µ-homomorphisms leads to coarser (if
not equal) equivalence classes can formally be obtained by
[(s, a, o)]h3=[
(s0,a0,o0)[h1(s,a,o)]h2
h1
1(s0, a0, o0)
which is used in the proof of Proposition 3.24. Because h1is surjective for every (s0, a0, o0)
[h1(s, a, o)]h2there exists an (s00, a00, o00) SAO1with (s0, a0, o0) = h1(s00, a00, o00), i. e.
[(s, a, o)]h3=Si[(si, ai, oi)]h1is a union of equivalence classes with respect to h1.
3.25 Lemma (2P-ZS-MG µ-Automorphisms)
Let M1= (D1,S1,SAO1, T1, R1),be a 2P-ZS-MG with finite SAO1and let h:SAO1
SAO1be a 2P-ZS-MG µ-automorphism. Then µ {1,1}.
Proof: Let rmax = max(s,a,o)∈SAO1|R1(s, a, o)| 6= 0 be the maximal modulus of the reward
of the 2P-ZS-MG. Since his a 2P-ZS-MG µ-automorphism, hh=h2is a 2P-ZS-MG
µ2-automorphism by means of Proposition 3.24 and R1=µ2(R1h2).It must be verified
that µ2= 1. Two cases have to be excluded: 1> µ2>0and µ2>1. If µ2>1, then
all state-actions (s, a, o)with R1(s, a, o) = rmax are mapped to a state-action pair with
reward modulus rmax ·µ2> rmax, in contradiction to the fact that his an automorphism
and rmax is maximal. If µ2<1, then each state-action (s, a, o)is mapped to a state-action
with a modulus of reward |R1(s, a, o)| · µ2rmax ·µ2< rmax, in contradiction to the fact
that his a bijection. 2
3.26 Remark (Finite and Infinite 2P-ZS-MG µ-Automorphisms)
Lemma 3.25 shows that automorphisms of finite 2P-ZS-MGs can only have two forms: the
first is the 2P-ZS-MG homomorphism (µ= 1) and the second one can be considered as a
2P-ZS-MG homomorphism by exchanging the two agents (µ=1). For MDPs, the second
possibility is meaningless and justifies the consideration of MDP homomorphisms without
general µ > 0of [171].
However, for (countably and uncountably) infinite 2P-ZS-MGs the Lemma is not valid.
Consider e. g. the following 2P-ZS-MG M= (D,S,SAO, T, R)with S=Qor S=R,
trivial action sets As={a},Os={o}for all s S1, a trivial transition function
T(s, a, o, s0) = s,s0for all s, s0 S, and R(s, a, o) = s. Then, for every µ S \ {0}
aµ-automorphism is defined by h(s, a, o) = (µ·s, a, o)because the complete structure
(the trivial action spaces and transition function) is equal in every state and the reward is
appropriately scaled.
54
Chapter 4
Supervised Learning (SL), Function
Approximation, Generalisation
Contents
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.1 General Approximation Results . . . . . . . . . . . . . . . . . . . 57
4.1.2 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.3 Function Approximation with Automated Basis Determination . 58
4.2 Value Iteration with SL: Convergence Result . . . . . . . . . . . 59
4.3 Combination of RL and SL: Practical Results from Literature 61
4.3.1 MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.2 2P-ZS-MGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Wisdom is learning what to overlook.
William James (1842–1910)
(http://www.nonstopenglish.com)
Supervised learning (SL) is almost a synonym for function approximation. Sometimes the
first one is considered to deal especially with noisy function evaluations and the second
one with exact evaluations. There are two basic reasons for applying supervised learning
(SL) techniques in RL: first, to simply compress an unmanageably large state space to
a reasonable size assuming that the (unknown) underlying essential state space is much
smaller than the standard one, and second to gain insight about a problem by discovering
this essential state space. For both reasons the construction of features, i. e. a special kind
of basis functions in the function approximation sense(1) which is often easy to interpret
and inspired by human intuition, is one of the most common techniques besides neural
network approaches. However, theoretically every set of basis functions such as radial
basis functions of any kind, polynomials, or splines can serve as function approximation
architecture.
This chapter is structured as follows: Section 4.1 gives a short introduction to SL and
highlights the importance of automatically detecting good basis functions. In Section 4.2
(1)No orthogonality conditions are fulfilled and even the linear independency is often not checked in RL
literature but fulfilled by trying to achieve the smallest set of features.
55
the key result on the provable convergence quality of value iteration in combination with
SL methods is presented. Although the result improves a previous one in [19] it means
that in practice it is very hard to guarantee convergence to an ε-approximation of the
optimal value function since the approximation architecture has to provide very small
errors in k k. Nevertheless, Section 4.3 provides an overview of practical results showing
encouraging performance without any guarantee.
Numerical results of the author can be found in Section 5.2.6 in the context of other
numerical studies.
4.1 Introduction
Supervised learning deals with finding a function e
f:XYin some predefined function
space Fwhich approximates a given function f:XYin an optimal way with respect
to some norm:
e
f= arg min
g∈F
kgfk.(4.1)
Common norms are k k2or k kwhich measure the mean and the maximum deviation,
respectively. The more compact formulation
e
f=e
Af(4.2)
may be advantageous whenever Fis undoubtedly defined. Generally, two types of errors
occur: the approximation error due to the approximation architecture because f / F, and
the estimation error or sample error due to the finite number of sample data for which the
function is evaluated [37, 157]. Sometimes, SL is understood to be function approximation
with noisy data, i. e. the true function values f(x)are modified by some additional noise,
but throughout this thesis it is assumed to approximate noise-free function data.
Since the contraction property of the Bellman operators (contraction rate γ < 1) is valid
for k k, this norm should also be used for function approximation ([75, 78], approxi-
mation scheme bases on [196]).(2) However, many SL methods find approximations in
Lp-norms, hence, it is valuable to give approximation bounds for approximate value iter-
ation with these norms being better than the p
p|S| factor of the standard estimation for
norm equivalence [151].
A huge number of function approximation methods exists which include but are not limited
to (see [88]) neural network methods [19, 62, 122, 156, 176, 4, 82] (3), fuzzy logic [16, 105],
cerebellar model articulation controller (CMAC) [2, 3, 83, 198, 201], classifier systems
[57], and local memory-based methods [51, 145, 146]. By racing different approximation
architectures can be tested and their parameters optimised [59, 131, 132]. The main idea
is that most of the possible choices of parameters are discarded by a few tests and only
the remaining ones are tested more intensively. A comparison of different approximation
architectures in the sense of good fitting and a simple model (no overfitting) is studied in
[121].
A subtle remaining question is which function(s) to approximate: one of the value functions
(Vor Q), or directly the policy π, see also Section 4.1.2. Approximating the state value
(2)An alternative would be to use weighted k k2norms [149] but the weights have to be determined first.
(3)Pesch states that some standard neural networks can lead to ill-conditioned optimisation problems
[165] and gives further references.
56
function yields a smaller domain than the state-action value function, therefore it may
be better to approximate with fewer parameters, but this comes at the cost of having
to calculate the Q-values in each step. In 2P-ZS-MGs (not in MDPs) both approaches
suffer from the fact that the value of a potentially large matrix game has to be determined
for every evaluation of the policy. Approximating the policy is the only method which
avoids matrix game calculations for an acting agent but is not suited for the process of
determining the optimal policy because the evaluation of the value of a state(-action)
would be very expensive (policy evaluation involves the complete state space). Even policy
search methods [178] need to estimate the quality of several policies and therefore have to
estimate value functions or rely on heuristics. Finally, it is also possible to approximate
several or all three functions which can also be useful for mutual error tests.(4)
4.1.1 General Approximation Results
A qualitative result often mentioned as curse of dimension and blessing of smoothness is
that with a degree of smoothness sand dimension of input parameter dim(x) = dthe
function f(x)can only be expected to be approximated by an error of O(ns/d)where nis
the number of parameters of the function approximation scheme [157]. [82] states that the
curse of dimensions can not be broken by neural networks with multi-layer perceptrons,
radial basis functions or similar nonlinear techniques. This means that neural network
approaches can only be successful if a problem is simpler at the core.
4.1.2 Generalisation
Generalisation is one reason why supervised learning is employed. It means that the
experience of a learner is transferred from the current situation to comparable ones. Gen-
eralisation is motivated by two facts: firstly, for large finite (or continuous) state spaces
it is not possible to enumerate all states and, secondly, similar states often have similar
values and optimal actions. Thus, generalisation is related to the two basic reasons for
applying SL: state space reduction and a gain on insight.
Sometimes, generalisation and function approximation is separated artificially, however
generalisation can always be interpreted in terms of function approximation. Some gen-
eralisation methods are used to determine a function eX:Xe
Xsuch that the domain
of e
fis no longer Xbut instead eX(X)e
X, and are called feature extraction methods
[19]. These can also be local [111] or action dependent [197] for Q-value functions. The
hope is that if gis complicated enough that Fcan be chosen in a simpler way because the
approximation is performed with e
f F eX.
Other types of generalisation methods do not make use of f(x)but only of the distribu-
tion of inputs xiof some input pairs (xi, f(xi)). These are collectively called unsupervised
learning methods and can be interpreted as an input data preprocessing. Clustering ap-
proaches belong to this class. However, clustering approaches can also be applied to the
f(xi)to extract all states with a similar value as a feature and hence become SL methods.
Generalisation in RL. As alluded to in the introduction, generalisation in RL can be
divided into different approaches depending on the domain [88]. Generalisation over states
(4)[78], p. 413, shows how to perform value function updates computationally more efficiently if the
transition function Tis known by separately storing expectation values for each function of a linear
approximation architecture.
57
such as feature extraction reflects the question of structural credit assignment: Which char-
acteristica influence the optimal value function most? These methods basically reduce the
description of the state space Sby forming equivalence classes of similar states. Forming
strict equivalence classes by exact model reduction is more restrictive but yields exact re-
sults (Chapter 3). Another strategy is to directly approximate the value function during
value iteration [23, 206] or in Q-learning [108, 212]. In general, function approximation
can affect the convergence of the iterative algorithms [23, 207] if one can not show that
the approximation scheme is less expansive (in max norm) than the Bellman operator is
contractive [72, 97]. Residual algorithms are one approach to address this problem [7]
while some standard approaches like neural nets and linear regression are not theoretically
justified even if working well in practice [72]. Instead, non-expansive approximation can
be performed by averagers but these lack adaptivity [72]. Furthermore, adaptive resolu-
tion models, e. g. variable resolution dynamic programming (kd-tree [142]), the PartiGame
algorithm [143], and decision trees [31, 134] can be seen as generalisation approaches.
Alternatively to states, state-action pairs could be subject to generalisation. For approxi-
mating Q-values, two approaches exist: the first is to use one approximation architecture
for each action (if there are not too many different actions), the second is to employ only
one architecture for all state-action pairs. The last method is the only possible approach
for continuous action spaces. Training of continuous actions can be done by local gradient
ascent methods [8], by modifying variance and mean of normal distributions of the ac-
tions to execute [80], or by application of Kohonen maps (self-organising maps [93]) which
adaptively choose a discrete set of actions during learning [193](5).
4.1.3 Function Approximation with Automated Basis Determination
Choosing the approximating function space F, i. e. basis functions in the case of linear
function approximation, is an important issue.(6) There are three qualitatively different
methods to choose the functions: first, one can directly choose finitely many functions
f1, . . . fn, second, one can choose parametrised families of functions f11, . . . fn,αnand
choose finitely many functions (perhaps with some optimisation over the parameters) which
are most suited to approximate the given function, and third, one can try to construct the
functions in a completely unparametrised way.
Obviously, the challenges increase from the first to the third method. Some work in the
spirit of the last method is by Mahadevan [126, 128]). In his approach, so-called proto
value functions serve as basis functions and are constructed only by state space topology
information.(7) The second of the three methods is similar to feature detection and contains
essential degrees of choice (parameters) for the approximation algorithm. Hence, this is
also considered to be automated basis determination. Besides the predefined functions
and free parameters the algorithms of the second type need a selection criterion which is
in the best case an optimisation criterion. The two aims of this selection are the general
ones of function approximation: to fit the function well and to avoid overfitting. Cross
validation errors e. g. leave-one-out tests or more generally, the distinction of a training set
and an evaluation set of points is an important method to reduce overfitting [146]. The
(5)The only problem is that the construction of appropriate Kohonen maps seems to be mathematically
ill-posed.
(6)Linear function approximation means in this context the approximation by a linear sum of specified
non-linear functions and not the approximation by piecewise linear functions.
(7)The question arises whether a simple grid soccer topology possesses enough structure to take advantage
of it.
58
approach of [59, 131, 132] is algorithmically interesting to adapt the evaluation effort to
the possible success of single parameter values: The evaluation criterion is computationally
more intensive for promising candidates and less for weaker candidates.
4.2 Value Iteration with SL: Convergence Result
[23] provides a phenomenological classification of types of convergence which sounds amus-
ing from a theoretical point of view but reflects the possibilities of “solutions” practitioners
have obtained: good convergence (all iterates Vkare represented well, convergence to the
exact optimal value function), lucky convergence (not all iterates Vkare represented well,
convergence to the exact optimal value function), bad convergence (convergence to a non-
optimal value function), and divergence. From a mathematical point of view only the
“good convergence” is acceptable and hence a criterion is developed in the following under
which this good convergence will occur.
For the sake of completeness the definition of classical value iteration (Definition 2.35) is
repeated:
4.1 Definition (Value Iteration (2P-ZS-MG))
The following algorithm is called value iteration: select ε > 0, choose an arbitrary initial
guess V0R|S| for the (state-) value function, and determine iteratively Vk=BMGVk1
for k= 1,2, . . . until kVk+1 Vkkε
2·1γ
γ.
This algorithm would converge to V, and provide an ε
2-approximation for the value func-
tion estimate and an ε-optimal stationary policy (as for MDPs), if one assumes that BMG
can be exactly calculated. In Section 2.4.2 the following results are anticipated and ap-
plied to the scheme of numerical value iteration, i. e. roughly speaking a numerical error of
solving DPs is interpreted as an SL technique. This is possible because for the analysis of
the algorithm it does not matter whether the approximation error is introduced by an SL
method or by a different numerical technique.
For an error analysis similar to [19] (compare Remark 4.4), which does not seem to be
widely available for 2P-ZS-MGs in the literature(8), the combination of the SL method
with the operator BMG is interpreted as a numerical version of the operator and denoted
by e
BMG. For an operator e
Adescribing the application of an approximation architecture,
e
BMG =e
ABMG because a value function here stemming from the previously applied
SL technique is plugged into the Bellman operator and the result is again projected
by the approximation architecture. In practice, the operator e
Acan not get the complete
value function BMGVkas input meaning that the approximation should be even worse than
Lemma 4.2 suggests. However, for the readability of the theoretical result these technical
subtleties are omitted.
The assumption of a (maximal) error of ε1, i. e. ke
BMGV BMGVkε1kVkfor all
VR|S|, leads to the following lemma:
4.2 Lemma (Error of Numerical Value Iteration (2P-ZS-MG))
Let BMG be the Bellman operator and e
BMG a numerical realisation (e. g. by combination
with an SL technique) with ke
BMGV BMGVkε1kVk.(9) Let further be e
V0=V0,
(8)An exception which generalises Bertsekas’ results to 2P-ZS-MGs is an extended draft version of [181].
(9) ke
BMGV BMGVkε1kVkfor all Vmeans that the operator B1:= 1
ε1(e
BMG BMG)has operator
59
Vk= (BMG)kV0, and e
Vk= ( e
BMG)ke
V0the corresponding k-th value iterates. Then
ke
VkVkkε1·
k1
X
i=0
γki1ke
Vik
| {z }
=EV(k)
ε1
1γmax
i=0,...,k1ke
Vik.(4.3)
Proof: The case k= 1 simply reads to ke
V1V1k=ke
BMG e
V0 BMGV0kε1ke
V0k. It
is now assumed, that the lemma is true for the value iterates of index k. Then
ke
Vk+1 Vk+1k=ke
BMG e
Vk BMGVkk
=ke
BMG e
Vk BMG e
Vkk+kBMG e
Vk BMGVkk
()
ε1· ke
Vkk+γ· ke
VkVkk
ε1· ke
Vkk+γ·ε1
k1
X
i=0
γki1ke
Vik
=ε1
(k+1)1
X
i=0
γ(k+1)i1ke
Vik.
()is valid because BMG is a contraction with rate γand the line below uses the induction
hypothesis. The second estimate of the lemma is by means of the infinite geometrical
series. 2
Lemma 4.2 provides a bound independent from the value iterates, e. g. if the sequence
(e
Vk)kis monotonically decreasing and e
Vk0. If e
BMG =BMG, monotonicity is guaranteed
as soon as V0 BMGV0(analogously to the MDP case which e. g. is true if V0(s) =
1
1γmax(s,a,o)∈SAO R(s, a, o)for all s S because then the Q-value iterate Q1(s, a, o) =
R(s, a, o) + γ
1γ·max(s,a,o)R(s, a, o)1
1γ·max(s,a,o)R(s, a, o)and Proposition 2.25, 2.)
completes the argument).
A reference for the monotonicity result in 2P-ZS-MGs seems to be lacking and therefore
the proof is sketched here: The first part is to show that BMG preserves (in the vector
sense). This is seen by the assumption that we have a vector representing a value function
estimate that is greater or equal than a second one in all components. This implies that the
corresponding Q-value functions (Equation 2.49) have the same greater or equal property,
and the result follows from the monotonicity property of matrix games (Proposition 2.25).
The second part is completely analogous to MDPs [168]: Noticing that the monotonicity
holds with BMG also for Bi
MG for all iNleads directly to Vk+1 =Bk
MG(BMGV0)
Bk
MG(V0) = Vk.
However, for use in algorithms the tighter bound of the above lemma should be preferred
because it can be computed iteratively (as suggested by the inductive proof: starting with
ε1ke
V0kand assuming that the k-th step is already performed the sum has to be multiplied
by γand then ε1ke
Vkkhas to be added for obtaining the result in iteration k+ 1).
norm equal to or less than 1: kB1k= supV6=0
kB1Vk
kVk1. Furthermore, the above definition implies
e
BMG =BMG +ε1B1.
60
4.3 Corollary (New Stopping Criterion for Numerical Value Iteration)
If the stopping criterion is changed to ke
Vk+1 e
Vkk c(e
V0, . . . , e
Vk)with
c(e
V0, . . . , e
Vk) = ε
2 EV(k+ 1)·1γ
γ(EV(k+ 1) + EV(k)) (4.4)
where the errors EV(k)depend on the first k1numerical value iterates (Equation 4.3),
and if the numerical approximation e
BMG is a contraction of rate γ(like BMG)(10) then the
numerical approximation of value iteration also yields results comparable to the original
value iteration.
Proof: Utilising notation of Lemma 4.2 the error of numerical and theoretical value iterates
can be related by
kVk+1 Vkk kVk+1 e
Vk+1k+ke
Vk+1 e
Vkk+ke
VkVkk
ke
Vk+1 e
Vkk+EV(k+ 1) + EV(k)
ε
2 EV(k+ 1)·1γ
γ.
According to classical value iteration kVk+1 Vkε
2 EV(k+ 1), hence
ke
Vk+1 Vk ke
Vk+1 Vk+1k+kVk+1 Vkε
2.
In prooving the quality of value function approximation it was not necessary to use that
e
BMG is a contraction of rate γ. However, for adopting the proof in [168] that the policy
πk+1 induced by e
Vk+1 is ε-optimal it is additionally needed that e
BMG as well as the (linear)
Bellman operator with fixed policy πk+1 are contractions with rate γ.2
Interpretation. The considerations above indicate that care should be taken (nothing can
be guaranteed by Corollary 4.3) whenever the above defined numerical error ε1is in the
same magnitude as the error εof value iterates. If ε1ε,γnot too close to 1, and
B1:= 1
ε1(e
BMG BMG)fulfills a Lipschitz condition, then the solution of 2P-ZS-MGs can
be performed nearly as well numerically as theoretically.
4.4 Remark (Comparison to Results of [19])
For the results in [19] about approximation quality of function approximation it is assumed
that ke
BMGVBMGVkε1, being independent from kVk. Analogously to Lemma 4.2,
this leads to a (simpler) error bound of
ke
VkVkkε1·
k1
X
i=0
γiε1
1γ.(4.5)
This error estimation could be utilised to define a different EV(k)and to obtain a corre-
sponding stopping criterion (Corollary 4.3).
4.3 Combination of RL and SL: Practical Results from Lite-
rature
Although Section 4.2 makes it clear that convergence of RL in combination with SL meth-
ods is hard to guarantee, some encouraging examples are to be presented in the following.
(10)If there exists some nNsuch that for all V, W holds kB1V B1WkkVWkthen e
BMG
is a contraction with rate γ1γ(1 + 1), if γ1<1, and therefore the weaker statement with discount
factor γ1holds.
61
4.3.1 MDPs
In Section 4.1 the following incomplete list of function approximation methods is given and
repeated here: neural network methods [19, 62, 122, 156, 176, 4, 82], fuzzy logic [16, 105],
cerebellar model articulation controller (CMAC) [2, 3, 83, 198, 201], classifier systems [57],
and local memory-based methods [51, 145, 146]. In addition to these references to special
function approximation architectures, most of which already pointed to combinations with
RL methods, the following reference is to be added: [177] combines neural networks in
combination with some advanced Q-learning algorithms (e. g. Q(λ)[164]).
Space continuous and Time Continuous Systems. Although it is in principle
possible to apply function approximation methods to systems with continuous time (evo-
lution by a differential equation) this approach suffers from the lack of a unique solution
to the Hamilton-Jacobi-Bellman equation [152] if the concept of viscosity solution is not
considered (analogous to differential games, Section 2.5). [35, 56] attack instances of space
continuous and continuous time MDPs by means of the Hamilton-Jacobi-Bellman equation
and function approximation. An exceptional example in which the concept of viscosity so-
lutions is applied to RL algorithms for solving MDPs is [148]. A different approach in
the time-continuous setting is a policy search, i. e. a search by gradients over a priori
parametrised policies [150].
For a multi-agent system with many agents additionally to function approximation a de-
composition of the value function into a sum of functions dependent on less agents (factored
representation) can be helpful ([94, 76, 74, 162], or in combination with a suitable com-
munication scheme: [79]). However, care has to be taken that even a factored reward and
transition function does not imply a factored value function. The reason is that in each
value iteration step the dependencies on other variables grow until at some point typically
every variable is important for every state.
4.3.2 2P-ZS-MGs
Applications of function approximation in differential games include neural networks for
driver assistance [96] and memory-based and kd-tree based methods for two-player pursuit
evasion games [187]. Much of the work about the combination of discrete 2P-ZS-MGs and
function approximation is done by Lagoudakis and concentrated on least squares policy
iteration (LSPI): General results with an application to Littman’s 1v1 grid soccer [97] and
the utilisation of factored approaches for multiple agents [98] which are related to model
reduction only with respect to actions(11), and an overview with a diversity of examples
[99] are to be mentioned. [98] is quite close to the need of multi-agent robot soccer but
due to the exponential dependence of the state space size on the number of agents the
approach is not directly applicable because it only reduces the exponential dependence of
the joint action space.
(11)The meaning of some formulae can be better understood by additionally considering [77].
62
Chapter 5
Robot Soccer and Other Applications
Contents
5.1 Modeling Robot Soccer . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1.1 General Issues of Modeling Robot Soccer . . . . . . . . . . . . . 64
5.1.2 A Simple Multi-Player Robot Soccer Model . . . . . . . . . . . . 67
5.1.3 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Numerical Results of Grid Soccer . . . . . . . . . . . . . . . . . 74
5.2.1 Preliminaries for the Following Subsections . . . . . . . . . . . . 75
5.2.2 Reasoning for 2P-ZS-MG Modelling: Comparison of MDP and
2P-ZS-MG strategies . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.3 Relating Policies to Humanoid Soccer Characteristics . . . . . . 80
5.2.4 Comparison of DP and RL Techniques . . . . . . . . . . . . . . . 81
5.2.5 Comparison of Different DP Techniques with Various Parameters 84
5.2.6 Comparison of Standard Methods and SL Techniques . . . . . . 91
5.2.7 Towards Multi-Player Robot Soccer: 2v2 Grid Soccer . . . . . . 93
5.3 A New Algorithm: MaG-Clus-VI . . . . . . . . . . . . . . . . . . 95
5.4 From Grid Soccer to Robot Soccer: Practical Issues . . . . . . 96
5.4.1 Lower Level behaviours . . . . . . . . . . . . . . . . . . . . . . . 97
5.4.2 Image Processing and Localisation . . . . . . . . . . . . . . . . . 97
5.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 98
The complexity arises
when the principles are reduced to practice.
Robert F. Stengel
(Preface to [195])
The model of multi-player grid soccer is introduced and all important numerical results
with respect to this model are presented in this chapter. The numerical results are broadly
interpreted: different models and solution approaches such as MDP and 2P-ZS-MG models
with or without symmetry reduction as well as DP and RL techniques and the impact
of their parameters are numerically analysed. Furthermore, the resulting strategies are
interpreted in a more soccer oriented fashion by goal rates per time and per team. Before
giving a more detailed overview of the contents with references to single sections some
63
general comments on literature are to be made: A huge body of literature exists on the
special application of robot soccer e. g. [69, 178, 197, 198] and references therein. For
practical reasons, most of the work uses an MDP framework instead of an 2P-ZS-MG
one which is at least questioned by the results presented in Section 5.2. The need for
hierarchical structures for soccer with 11 versus 11 agents and the employment of heuristic
ideas can often be observed [197, 215] as well as a simplification of robot soccer to keep
away soccer [198, 215].
This chapter has the following structure: Section 5.1 summarises general thoughts of mod-
eling the game of robot soccer and provides details of the special model “multi-player grid
soccer” which serves as a basis for all following computations. The model is based on [112]
but goes far beyond it because the dynamics for multiple agents have to be significantly
changed. The numerical results are shown in Section 5.2 which possesses a rich substruc-
ture. Section 5.2.2 provides strong arguments for the choice of 2P-ZS-MG instead of MDP
models in robot soccer, in Section 5.2.4 DP and RL techniques are compared with the re-
sult that exact and model exploiting DP techniques should be preferred whenever possible.
These first evaluative numerical results appear natural but are surprisingly not standard
in literature. DP methods are more intensively studied in Section 5.2.5: the dependency of
convergence speed on different types of methods and parameters is numerically analysed.
This includes comparisons of symmetry reduced models with their unreduced counterparts
which is the practical application of the theoretical results in Chapter 3. A highlight is
the “max-min convergence boosting phenomenon” which only seems to be present in 2P-
ZS-MGs but not in MDPs and reveals unexpected “spontaneous” large error reductions
during value iteration. Results with supervised learning (Section 5.2.6), a new DP algo-
rithm which exploits almost invariant sets of the transition dynamics (Section 5.3), and
general technical issues of applying strategies to real robots (Section 5.4), e. g. to the AIBO
ERS-7, follow.
5.1 Modeling Robot Soccer
Robot soccer can be modeled for different purposes: the spectrum of possibilities ranges
from a “true” continuous physical model including kinematic movements of every joint
and object to a very coarse discrete model in which an elementary action already includes
movement and ball-handling skills. A second distinguishing criterion is the mathematical
range of models which is intertwined with the choice above: the model may be discrete or
continuous (even with different physical degrees of modeling both options stay possible).
Furthermore, the model may or may not include the policies of the opponent, and in the
first case again with different degrees of freedom, e. g. the set of assumed opponent policies
can be arbitrarily restricted.
5.1.1 General Issues of Modeling Robot Soccer
In this subsection some general thoughts which have an impact on every modeling of robot
soccer shall be collected. It seems advantageous to apply the RL methods not directly
with an initial value function V0= 0 to the robots and simply let the robots learn during
play(1) some simplifications can be made even if the resulting model differs from reality.
(1)Even if the robot could detect its true state which is often difficult, the number of training steps can
be quite large which leads to a vast amount of training time.
64
The obtained optimal policy of the approximate model can be understood as a very good
initial guess of V0for an RL method applied to the real world problem.
This accounts for a general trade-off between model complexity and applicability of a
model. On the one hand, the more precise the model is the better is the approximation for
the “first guess” of the value function which can be improved stepwise by RL methods (a
safe RL method in 2P-ZS-MGs is the WoLF (win or learn fast) method of [21]) to adapt
to a special situation or opponent policy. On the other hand, the more unspecific a model
is the more widely applicable it is, e. g. for different hardware realisation of the robots.
Continuous and Discrete Models, Stochasticity. Besides from technical difficulties
of continuous game theoretic models (see Section 2.5 for an overview) these are typically
considered to be deterministic. Clearly, introducing stochasticity to the continuous models
would not simplify numerical computations. Thus, one can consider the choice to be only
between a discrete stochastic and a continuous deterministic model.
There are several reasons why to prefer a discrete stochastic model for robot soccer. Firstly,
(robot) soccer is a game with stochastic events not only because of unknown parameters
such as the exact physical properties of the subsurface (grass or carpet) and unknown
parameters in the robots’ actuators and sensors but also because the movement of the
ball kicked or hit by a robot is highly unpredictable. Secondly, the stochasticity of the
model is also preferred because the decisions of the robots have to be stochastic. If the
same situation always leads to the same deterministic reaction then the opponent team can
exploit the observed behaviour in the next similar situation: e. g. if the position of a robot
is slightly left to its opponent with respect to an appropriate axis this could lead to always
going left around the opponent. Thirdly, the discretisation of the state and action space
automatically leads to some abstraction away from a hardware specific model whereas a
hardware specific model would be more appropriate for a differential game (the kinematics
may be dependent on the robot type).
Symmetries. As mentioned in Chapter 3, respecting the inherent symmetry of a prob-
lem should be an issue. For example, in robot soccer the symmetries depicted in Figure 3.1
should be respected by any robot soccer model which assumes equal robots. One of the
symmetries simply means to change left and right (from the point of view of the robots
heading to a goal). This symmetry is only maintained if the robots possess the same sym-
metry and the abilities of the robots are symmetric. For human soccer, such a symmetry
is clearly not assumable because it is standard to consider human players in the categories
of preferring left, right or both. Only the last category of players fulfills the needs of that
symmetry.
A second symmetry is due to the exchangeability of the two teams which is somewhat
questionable. As remarked above the abilities of two teams can differ even if the robots do
not. However, if one assumes that the two teams are about equally strong the assumption
of exchangeability seems to be sensible if the model is not too detailed. This criterion is
met e. g. by the model described in Section 5.1.2.
A third symmetry which is not depicted in Figure 3.1 is the permutation of robots within
the same team. It relies on the equality of the robots and their abilities which is easily
achievable for equal robots by using equal software. In terms of human soccer this means
that every human player e. g. needs to have the strategical knowledge for every position
(defense, center, attack) which is typically not valid.
Fair Kick-off Positions. A fair kick-off position after every goal is a distinguishing fea-
65
ture from standard human and standard robot soccer. Thus, it should be briefly discussed
whether or not such a feature is to be employed in a model. First, in human and robot
soccer the rules determine that the team against which a goal was scored gets the ball in
the center of the soccer field (kick-off position). Only at the very beginning does one team
get the ball at random and the other team gets it at the beginning of the second half. From
a game theoric point of view even tossing coins and playing only one halftime is already fair
(the probability distribution of start states ξPD(S)has a value of Ps∈S ξ(s)V(s)=0).
However, after tossing the coins the kick-off position typically is considered to be advan-
tageously for the ball-possessing team which can also be verified for our later model (see
Figure 5.3): e. g. for a 1v1 grid soccer the ball-possessing player state has a value of roughly
0.07 for a max-min optimal value function, i. e. even for a worst-case opponent there is a
higher probability of winning for an optimally acting ball-possessing player.
Nevertheless, if a kick-off position is determined after every scored goal by chance this
reduces the effect of a single random decision independently from the strength of each
team. The standard rules improve the chance to win for a weaker team (against which
probably more goals are scored), a fair kick-off position after every scored goal would
increase the chance for the better team to win because they could randomly get the ball
again even if they scored a goal directly before.
Additionally, the effect of a fair kick-off position after every goal can be interpreted as
reducing the game to the expectation of scoring only the first goal which is sometimes
used in soccer for avoiding a tie (golden goal rule). Time delaying strategies, e. g. keeping
the ball and scoring a goal in the last minute to win and to avoid that the opponent
scores a goal thereafter are neglected in such a model description. Instead, by means of
the discounting factor early scoring is encouraged. For robot soccer, the repeated random
criterion seems to be a sensible approach(2) even more because robots can with the
current battery configurations play a full-time powerful game without any recreation
phases which could be necessary for humans.
Concurrent Actions. Concurrent actions of the two teams seem very natural and
essential in robot soccer. It is noted, however, because large parts of game playing literature
are concerned with alternative turn games such as chess, checkers [179], or backgammon
[205]. Some solution methods designed for the important class of deterministic alternative
turn games, e. g. game tree search methods [118], are not feasible for stochastic games.
Abilities of the Robots. An important point is that in the following the hardware
of all robots is considered to be equal. Of course, robots can have different abilities by
different software but at least it is assumed that every robot could have the same skills.
Consequences of these assumptions are that each robot is able to walk as fast as any
other robot, has the same skills of ball handling, and the same communicational and
computational performance features. For example, this condition is true for the AIBO
league(3) in which AIBO ERS-7 robots without modifications have to be used, or for
simulation leagues. However, in some leagues, especially if self-constructed robots are
used, these ideal conditions are not given but instead a range of allowed performance
features has to be assumed. Additionally, even for physically equal robots it may depend
on the software of a team to which level the utilised skills deviate from best possible ones.
(2)Even for human soccer it could be interesting to encourage early goals to make games increasingly
fascinating.
(3)The AIBO league is now called standard platform league because a new robot (NAO) has been intro-
duced to the league.
66
Ball Possession. In general, the ball need not be close to any robot during a game.
However, it can be considered sensible and typical that if no robot can control the ball
directly, some of the next situated robots should try to get the ball. Hence, it is possible
to model the ball so that it is always in the possession of one robot to reduce the number
of states in the soccer game. Furthermore, in a sensible soccer model there have to be
options for dribbling around an opponent and passing to a teammate.
Intra-Team Communication. It is theoretically important whether the team has a
communication structure or not. The reason is that a team with perfect communication
structure can be treated as a single (complex) agent making decisions while any imperfect
communication structure must be treated as if each robot acts as a single agent with
different information on the game. Fortunately, in robot soccer communication among a
team is allowed such that the case of perfect communication can be assumed.
5.1.2 A Simple Multi-Player Robot Soccer Model
In this subsection the general thoughts of Section 5.1.1 are combined with practical issues
to formulate a model of robot soccer which is abstract enough to be independent from
hardware but specific enough to yield a meaningful policy for robot soccer. A main source
of inspiration was [112]. Later on, this model is used for all numerical computations. Its key
aspects are: the model is discrete and stochastic and respects some standard symmetries,
all robots are equal, the ball is always in possession of one robot, and the kick-off positions
are fair after scoring a goal not only for the total halftime.
Furthermore, the standard order of different parts of actions during a time step is first
dribbling, then moving, and finally passing. In real robot soccer, many basic abilities such
as walking towards a predefined location, handling the ball (dribbling and kicking), and
skills for the analysis of visual information (self and opponent localisation) are needed and
highly non-trivial but assumed to be already available.
Now, the detailed model follows: The model is a 2P-ZS-MG M= (D,S,SAO, T, R)where
the decision epoch D=N0is scaled appropriately such that the time of performing one
of the later described actions is approximately scaled to 1. In practice, actions can have
different durations but the model is to be kept simple.
State Space. The state space S N2(na+no+1) is discrete and consists of two dimensions
for every robot in Team 1 and the opponent team the numbers of which are naand no,
respectively, and an extra pair of dimensions for the ball.(4) Several robots can share the
same position (i. e. grid box, grid cell) on the field because otherwise blocking an opponent
would be too easy for a multiple robot game. The ball must share the position with some
robot. More specifically, the grid resolution in the two dimensions will often be of the type
(6 ×4), i. e. the position of each robot (xi, yi) {1, . . . , 6}×{1, . . . , 4}, see Figure 5.1.
An abbreviation will be “3v2-game” which means that na= 3 and no= 2. In terms of
2P-ZS-MGs each team is considered to be one of the two players of the Markov game which
implicitly means that perfect communication is assumed.
Action Spaces. The action spaces A(s)and O(s)are constructed by the same principles
to enable the “player exchanging symmetries”. From the construction principles it becomes
clear that all symmetries of Figure 3.1 are respected by SAO. There are two different main
(4)The software package DRPOST maintains two different options for the ball: the coordinate version
as shown here or the number of the ball-possessing robot which is more compact if the ball coordinates
are limited to robot positions.
67
1 2 3 4 5 6
1
2
3
4
Figure 5.1: Discretisation of a soccer field: grid soccer. Dark grey indicates the defended
goal region of the first team, light grey the defended goal region of the opponent team.
types of motion which can be alternatively performed in each time step: a move and a
pass.
1.) Moves. In principle, for every robot of a team a move on the grid with squared
Euclidean distance less or equal than 1is allowed, i. e. it is possible to go one box
N(orth), E(ast), W(est), S(outh) or to stand (0). Actions to leave the soccer field and
diagonal moves are not allowed.(5) In contrast to real (robot) soccer, no robot can
leave the soccer field and no fouls occur. Fouls could simply be integrated e. g. by
assigning a foul probability which increases depending on how crowded a grid cell is.
2.) Passes. Passes have only to be added to the action set if the team size is larger than 1
and if the ball-possessing robot is the only robot in its grid box. For a larger number
of robots in the same box even of the same team it can not be guaranteed that the
robots do not (unintendedly) obstruct each other. Furthermore, a pass is to be distin-
guished from a kick in the sense that it addresses a team member while a kick could
also be an attempt to score a goal. However, kicks towards a goal are not modeled
because it is assumed that they will automatically occur if a robot enters its opponent
goal region.
Concerning a pass, there exist three parameters in the present model. The first pa-
rameter is called max_kick_distance which limits the maximum range of a kick (Fig-
ure 5.2). This parameter is the only one which influences the size of the action spaces
and is standardly set to max_kick_distance = 5 (squared Euclidean distance) for a
(6 ×4)-grid. The other two parameters only affect the transition function and are de-
scribed there. It is important to note that the position of the pass target is determined
after its intended move.
It can be discussed whether a large kick range makes sense. An observation in robot
soccer yields that for the AIBO league the robots are able to kick across the whole
soccer field. The range is restricted in the model because in reality the reliability of a
pass decreases with its distance.
Transition Function. The transition function has to be specified for every state-action
pair. Since invalid actions such as leaving the soccer field or passing to team members
being too far away are already filtered by the definition of the action spaces they need
not be considered here. Furthermore, the description of the rules again reveals that they
(5)The reason is not that the robots could not move diagonal but a diagonal move takes more time and is
not reachable within one time unit. If desired, diagonal moves could be allowed which would dramatically
increase the action spaces with the number of robots.
68
212 1
1
2
2
1
1
2
(a) max_kick_distance = 0.
212 1
1
2
2
1
1
2
(b) max_kick_distance = 1.
212 1
1
2
2
1
1
2
(c) max_kick_distance = 2.
212 1
1
2
2
1
1
2
(d) max_kick_distance = 5.
Figure 5.2: Grid soccer: the parameter max_kick_distance rules the maximum distance
of the ball possessor to a team member being a pass target. Several values according
to squared Euclidean distance are illustrated. max_kick_distance = 5 is the standard
configuration of the present robot soccer model. As in Figure 3.1, the labels 1 and 2
indicate the team and the defended goal region, and the small black circle depicts the ball.
69
respect the symmetries of Figure 3.1 because there are no conditions which are different
for the two teams or for a single robot. The standard order of different part of actions is
dribbling, moving, and passing. Dribbling is only reflected in the ball possession and is
not part of the action spaces. Dribbling and passing can not occur at the same time step:
a passing robot is not allowed to move at the same time step, and the pass target (or a
mistarget) gets the ball after moving.
1.) Moves. All movements of the robots are performed as intended but for the ball pos-
session there are the following rules: If nis the number of all robots (of both teams)
occupying the cell with the ball then the probability is 1
nfor each robot to be on the
ball after its movement. In soccer terminology this means that the dribbling phase is
preceding the moving phase and that the dribbling phase is decided in every time step.
It is imaginable that the previous action as well as the current action of each robot
influences the probability of getting the ball but this would violate the Markov prop-
erty. More importantly, the ball possession is not modeled actively: the robot which
is on the ball in the last step has the same chance to keep it as every other robot in
the same grid cell has to steal it. The model could easily be changed to give the ball
possessor a different probability than a stealer but this would need a sound knowledge
of the abilities of the robots.
2.) Passes. The soccer dynamics for the passes are more complicated than the movement
dynamics because it seems to be essential that a pass can be intercepted by a robot
not aimed at by the passing one. The reason is that a pass can be considered as a
risk option: keeping the risk low and not passing moves the ball only a small amount
towards the goal while a pass can move the ball a larger amount with a larger risk of
a miskick (depending on the number of robots staying close to the pass target).
Concretely, three parameters exist in the model to influence the pass characteristics.
The parameter max_kick_distance has been already discussed for the action sets and
the remaining two, namely close_distance and kick_good_prob_mult, control the
risk of a pass. close_distance determines how close an intercepting robot (team
member or opponent) has to be such that it is a possible mistarget of the pass (the
passing robot can also be a mistarget if close enough to the target, i. e. that it gets the
ball again). It is assumed that the robot is not allowed to leave its grid cell towards
the ball but it can get the ball.(6) kick_good_prob_mult determines how many times
the probability is larger for performing the pass to the correct grid box. This means
that each robot in the targeted box has the same (higher) probability not only the
targeted robot. The standard values for all computations (in 1v1 up to 3v3 soccer)
for a (6 ×4)-grid are close_distance = 2 in squared Euclidean distance which is
illustrated in Figure 5.4 and kick_good_prob_mult = 3. Again, note that the target
position and closeness relations of the robots is determined by the positions after their
intended move of that time step.
From a soccer perspective it seems to be sensible that dribbling and moving, only
passing, or moving and recieving a pass are the three basic components of action. In
robot soccer what may occur is that also teammates unintendedly also steal the ball
because a crowded grid box can easily lead to a very uncontrolled outcome of dribbling
actions. If controllability skills for the ball increase it will be possible to change the
model to one in which the probability of keeping the ball is higher or in which one
team randomly gets the ball and the robots can decide which teammate gets it.
(6)In practise, it is assumed that the robot starts some intercepting behaviour if the ball comes close
enough to its location. However, the interception distance is much smaller than the distance of one grid
cell because the ball moves too fast for a long interception procedure.
70
3.) Kick-off . At the very beginning as well as after every scored goal a fair kick-off position
(see Section 5.1.1) is initialised. Particularly, the two states in Figure 5.3 are assigned
a probability of 1
2. Since they are agent exchanging symmetric (agent = team) they
are fair according to Theorem 3.20. For value iteration it is only important that the
stochastically weighted sum of values for these positions is zero. However, for RL
methods it can make a big difference in the speed of learning if an equal weighting of
the two positions of Figure 5.3 or an equal distribution of all states in Sis chosen.
For example, the latter can encourage exploring different regions of the state space
without any exploration strategy in the RL method.
1 2 3 4 5 6
1
2
3
4
333
333
(a) Kick-off state s1.
1 2 3 4 5 6
1
2
3
4
333
333
(b) Kick-off state s2.
Figure 5.3: The two kick-off states of multi-player grid soccer, here 3 versus 3 players (3v3
grid soccer). Dark grey indicates the robots and defended goal region of the first team,
light grey the robots and defended goal region of the opponent team. In contrast to other
figures before, the numbers do not indicate the team but the numbers of robots of that
team being in the same grid cell. The small grey circle represents the ball which is not
attached to a special robot in that cell.
In general, due to the nature of the stochastic transition function it is easy to model a
mis-action of any kind by giving a low probability to an action not intended by the player.
In this way, (small) sensor errors can be modeled. Since it is assumed that all robots
have the same degree of errors no team has an advantage over the other. Therefore, it is
assumed that neglegting (small) errors has little impact on the optimal policies.
5.1 Example (Robot Soccer, 5)
A special situation (Figure 5.4) should illustrate the pass mechanism described above. One
of the robots of Team 2 stops and plans a kick to a teammate indicated by a large arrow.
The kick is possible because the robot is not disturbed by a second robot in the same cell
and the distance criterion marked by the dark grey area (partly overlapped by the light grey
area) is satisfied. All robots are depicted after their planned movement of the time period
which is shown by the small arrows. The possible mistargets are all robots in the light
grey area, in this case only two robots of Team 1. Assuming that kick_good_prob_mult
= 3, this results in a probability of 3
5for the intended pass and a probability of 1
5for each
of the two mistargets. If the upper of the two robots of Team 1 moved to the target field
then that robot and the intended target would have probabilities of 3
7while the second
mistarget has a probability of 1
7for receiving the pass.
Reward Function: Neutral, Defensive, Aggressive Policy. Scoring a goal. The
reward for Team 1 for entering the opponents goal region is 1if the ball is also in that field
71
Figure 5.4: Grid soccer: the parameters max_kick_distance = 5 rules the maximum
distance of the ball possessor to a team member being a pass target (dark grey boxes,
partly hidden by light grey boxes), and by close_distance = 2 the possible mistargets
are all other robots in the light grey boxes. The targeted grid box for the pass is indicated
by the big arrow. Small arrows indicate the performed moves (no arrow means the action
“stand”) during the time period. As in Figure 3.1, the labels 1 and 2 indicate the team and
the defended goal region, and the small black circle depicts the ball (see also Example 5.1).
(no matter which robot took the ball to that field) and if additionally at least as many
offenders as defenders of the opponent team are in the same of the two cells of the goal
region (Figure 5.1). It is assumed that otherwise the defenders successfully defend their
goal (region) and a performed kick towards the goal is blocked. The corresponding agent
exchange symmetric situation that Team 2 enters the goal region of Team 1 with the ball
and a less or equal number of defenders of Team 1 is rewarded by 1(scored goal against
Team 1).
All other cases. The reward function simply needs to be 0if no goal is scored because no
team should gain an advantage by doing nothing. As discussed in Section 5.1.1 at “fair
kick-off positions” the present score of a game is neglected in the model, which considerably
reduces the number of states. This shortcoming could be relativised in real robot soccer
by introducing three types of policies: neutral, defensive or aggressive which are employed
if the score difference is equal, positive, or negative, respectively. The three different
policies can be calculated (by sacrificing the team exchanging symmetry in the non-equal
cases) with three different reward functions: scoring a goal is artificially ranked equal
(as is described above), lower (smaller reward) or higher (larger reward) than letting the
opponent score a goal. If not stated otherwise, equal ranking is assumed.
5.1.3 Symmetry
The multi-player grid soccer model of Section 5.1.2 is constructed in concordance with the
general three symmetries mentioned in Example 3.22 which should be respected by any
model of robot soccer:
1.) permutations of the players in the same team,
2.) reflection at the goal-to-goal line (gxin Figure 3.1), and
3.) reflection at the mid-line and exchange of teams (gyin Figure 3.1).
This is best illustrated by “a discretised version” of Figure 3.1 which is depicted in Fig-
ure 5.5.
72
1 2 3 4 5 6
1
2
3
4
(a) State s.
1 2 3 4 5 6
1
2
3
4
(b) State gy(s).
1 2 3 4 5 6
1
2
3
4
(c) State gx(s).
1 2 3 4 5 6
1
2
3
4
(d) State (gxgy)(s) = (gygx)(s).
Figure 5.5: Symmetries in grid soccer (discretisation of states in Figure 3.1): a standard
situation (state) sand its symmetric states gy(s)with exchange of the two teams, gx(s)
without the exchange of teams, and the combination of both: (gxgy)(s) = (gygx)(s).
The small grey circle depicts the ball.
73
Utilising these symmetries for model reduction, the effect on the size of the state (and
hence state action) space can be enormously (Table 5.1). In a multi-agent grid soccer, some
symmetries help to reduce the state space(7) by a constant factor (reflexions), while others
(robot permutations) are of increasing usefulness with a growing number of robots. Note
that the third symmetry is particular to 2P-ZS-MGs and can only be applied for a soccer
game with equally many robots in each team (an xvx game). This explains a qualitative
difference of the reduction factors a/b in Table 5.1 which are roughly (2 ·2·na!·no!) in
special cases and otherwise (2 ·na!·no!). Applying these formulae for a grid size of (6 ×4),
theoretically in a 6v6 grid soccer game the savings would be of the order of 106and in the
11v11 version of about 1015 (yet, of course, even the reduced state space is still way too
large).
Game Standard (a) Reduced (b)a/b Sym. 1 Sym. 2 Sym. 3
1v1 1152 282 4.11.0 2 2
1v2, 2v1 41472 10244 4.02.0 2 1
2v2 1327104 82944 16.04.0 2 2
3v2, 2v3 39813120 1742400 22.811.4 2 1
3v3 1146617856 8820000 130.032.5 2 2
Table 5.1: Number of states for different multi-player soccer games on a (6 ×4) grid in a
standard and a symmetry reduced form. novnameans a game with team size nafor Team
1 and size nofor team 2. While a/b denotes the total reduction factor by all symmetries,
the last three columns clarify the effect of each single symmetry, the number of which
corresponds to the enumeration list at the beginning of Section 5.1.3).
5.2 Numerical Results of Grid Soccer
The numerical results form an important part of the present work. They are organised as
comparative studies including but also going far beyond the application of the theoretical
results of Section 3.2. In the small amount of literature on 2P-ZS-MGs only comparisons of
different max-min RL methods seem to be available but the key model design question of
whether to consider max-min or max methods or whether DP or RL methods give higher
performance are typically unanswered.
As a first answer it is clear that DP methods should be more effective than RL meth-
ods because the knowledge of the model is only available to the first class of algorithms.
Therefore, the argument for RL methods is often not effectiveness but adaptiveness to an
unknown model (partly unknown or slowly changing environments). However, it can be ob-
served by the studies in Sections 5.2.2 and 5.2.4 and is also known to the RL community
that applying the theoretical convergence results of RL methods to practical problems
leads to uncomfortably large numbers of learning steps which is often unmanagable to
perform in practice.
Many methods have been designed to speed up learning: some of them targeting at the
update rules of the algorithms (such as multi-step updating or multi-state updates (function
approximation)), others aiming at external knowledge directly (such as imitating humans
or experts made by humans) or indirectly (such as hierarchical models with a hierarchy
(7)The size of the state space of a navnogame with a total of n=na+norobots is |S| =n(6 ·4)nif the
ball is uniquely assigned to one of the nrobots and the grid size is (6 ×4).
74
specified by humans). However, the ultimate application of learning theory to design any
method which is suitable for most models and is learning completely automatically has not
yet been reached to the knowledge of the author.
The author tries to make a small contribution by incorporating DP methods which are
neglegted by many RL researchers or only used for computation of the exact values of
policies to compare final results of small models. One mission of this work is to claim
that using DP methods with a non-perfect but simple model to construct a first guess
approximation for the value function and then starting RL methods with this initial guess
is an appropriate methodology. It is also imaginable that the fact that RL methods do not
need to cover the whole state space Scan be incorporated by simulating an RL trajectory
and then constructing an initial guess by DP methods only on the states of this trajectory
and on suitable additional states.
5.2.1 Preliminaries for the Following Subsections
The numerical results presented in Section 5.2 are computed by a software package called
DRPOST (Discrete Robust Probabilistic Optimal Strategy Tool) which is implemented
by the author as a bundle of Matlab routines. These routines contain a model generating
collection (for state and action spaces, transition and reward function) for a general n-player
grid soccer model as described in Section 5.1, all considered DP and RL methods for use
with MDPs and 2P-ZS-MGs, and a large amount of other functions which are needed,
e. g. for function approximation, symmetry reduction, and for solving matrix games (see
Appendix C).
Most of the following comparative studies are performed with a 1v1 soccer model on a 6×4
grid field if they are not explicitly intended to compare a single agent with a multi-agent
team grid soccer model or to compare different sizes of the field. The reason is that the
computation of the optimal value functions and strategies for a 2v2 grid soccer model on
a small grid or a 1v1 grid soccer model on a larger grid needs too much time to compare
large varieties of parameters. Additionally, many basic effects can be observed for the small
model.
Standard parameters for the algorithms are a discount factor γ= 0.9and an accuracy of
0.5·103for value functions in value iteration implying an accuracy of ε= 1 ·103for
the policy (Corollary 4.3). For RL methods a decayed learning rate αn(s, a) = 1
n(for each
s S) which meets the standard convergence criteria is used and the random exploration
rate of the employed ε1-greedy policy is ε1= 0.2. Since Q-Learning is an off-policy method
the learned strategy is greedy independently from the policy followed during the learning
phase. Furthermore, the standard number of learning steps is 100 (times |S|) which is
higher than any number of steps needed for convergence in DP methods. This number was
chosen on the basis of the results of Figure 5.6.
Nomenclature. In the following, “team 1”, “player P1”, or simply “the player (team)”
is considered to be the robot soccer team which is controllable, e. g. by optimal policies
computed by DP or RL methods. “team 2”, “player P2”, or “the opponent (team)” is the
collection of agents which are trying to score against team 1. In some of the comparisons,
one player is considered to be omniscient, i. e. the strongest possible or worst-case opponent
that already knows the policy of team 1, in some cases it is assumed that each team does
not know the policy of the other team. To make the DP operators part of the description
the DP and RL methods for classical MDPs are called max methods and for 2P-ZS-MGs
75
max-min methods according to the Bellman operator.(8)
There is also a nomenclature introduced to describe the policies efficiently without the need
for giving action probability distributions for every state which would be hard to interpret.
Table 5.2 shows relevant abbreviations (where the max method yields an element of the
game theoretic best response to the other team’s strategy) and the description of the table
makes the reader familiar with the notation of e. g. MMVI(R).
Abbreviation Meaning
M strategy determined by max method
MM strategy determined by max-min method
QL strategy determined by Q-learning method
R random strategy
VI strategy determined by value iteration method
Table 5.2: Different abbreviations for special policies. Example: MMVI(R) means a policy
that is determined by a max-min value iteration method against a random opponent. This
is the only example in which the policy of the opponent does not influence the algorithm
for determining the policy. The opponent’s policy is crucial especially in all max methods
and also in all RL methods.
Evaluation of Policies. In general, the quality of a policy π1for Team 1 is quantified in
a simple and precise way by its value V(ξ, π1, π2)which depends on an initial probability
distribution ξof start states(9), and the two policies πiof the teams i= 1,2. In the
following V(ξ, π1, π2)is often abbreviated by V(π1, π2)and it is then assumed that ξis the
standard distribution of start states. If π2is also omitted it is considered to be a best
response to π1, i. e. π2BR2(π1). Typically, the value of policies is determined to an
accuracy of ε= 1 ·103.
Since the value of a policy may be hard to interpret in robot soccer, additional charac-
teristics of the policies extracted by long-term simulations (many kick-off positions) are
given: the fraction of total scored goals of both teams divided by the number of time steps
as well as the percentage of goals of team 1. The first characteristic is intended to show
how offensive or defensive the combination of policies is and the second one shows how
successful each team is in comparison to the other. For example, if both characteristics
are high then Team 1 probably has a successful offensive strategy while Team 2 has an
unsuccessful defensive or offensive strategy. If the number of total goals is low but the
success rate for Team 1 is high then Team 1 has a successful defensive strategy and Team
2 an unsuccessful defensive or offensive strategy. One problem remains: the characteristics
may yield no results about an unsuccessful team strategy or if both teams are equally
good. Therefore, in Tables 5.9 and D.12 each policy is evaluated in a simulation against a
random opponent and against itself to obtain a measure of how offensive or defensive each
policy is (the total number of scored goals may indicate this). In general, however, some
care has to be taken about the precision of the simulation: although 1·106simulation
steps are performed for every policy the more total goals are scored during the simulation
the more accurate are the goal statistics. As a conservative rule of thumb, the goal rate of
player P1should not be argued about for absolute differences of ±0.05.
(8)The software DRPOST also provides the option to choose the type of Team 1 by similar declarations.
(9)ξis fixed in robot soccer to the distribution which assigns 50% probability to each of the two kick-off
states of Figure 5.3.
76
5.2.2 Reasoning for 2P-ZS-MG Modelling: Comparison of MDP and
2P-ZS-MG strategies
The first subsections of the numerical results serve to reason for the authors choice of model
and methods. The two basic statements are: for modelling it is preferable to use 2P-ZS-
MGs instead of MDPs and for computing optimal strategies DP methods are important to
initialise RL methods whenever this is possible. The most obvious theoretical argument
for the first statement is that for an MDP a deterministic optimal policy always exist.
This is typically determined by DP and RL algorithms and it is a best answer only to a
fixed opponent policy. In contrast, the 2P-ZS-MG model a priori assumes a worst-case
i. e. tactically strongest opponent which already knows the policy of the player. This has
two advantages: because the computed policy of the player is safe against any strategy of
the opponent it is first sufficient to compute only one optimal policy and not one against
each of the infinitely many possible opponent policies and, second, it is not necessary to
know the opponent policy.(10) In the opinion of the author the second advantage is more
important because for relativising the first one it can be argued that a strong policy could
be strong against a whole set of opponent policies.
A First Comparison
Tables 5.3 and 5.4 provide the result that a best response policy (π1by an RL and π2
by a DP method) against a randomly acting opponent is easily outperformed by its own
best response answer (π3,π4, and π5).(11) This is not surprising because of the mainly
deterministic nature of the max-optimal policy only exact ties of the Q-values lead
to randomised actions. Another observation is that with a higher number of learning
steps RL methods in principle do as well as DP methods (π3is equally strong against
π1as π4because the value is equal) indicating that the inherent problem lies in the non-
stochasticity of the policy. This can be seen at first glance in Table 5.3 by noticing that
also the theoretically best policy π2against a random opponent is easily outperformed by
the worst-case opponent strategy π5which is reflected in the high value V(π5, π2). The
fact that the value V(π5, π2)is equal to that of V(π4, π1)shows that “bad learning”(12) of
the RL method is not the reason for bad performance against a worst-case opponent.
To conclude the first comparison the analogue tables for max-min methods are presented
which show that the max-min optimal policy can not be exploited as expected even by
a worst-case opponent knowing this policy in advance.
By comparison of the small size 6×4soccer field with the medium size 12 ×8one it
becomes obvious that RL methods can degrade with the number of states if the number
of training steps are kept linearly related to the state space size |S|. Max-min learning
methods are specifically concerned: with the 100 (times size of the state space) learning
steps the max RL methods approximate the optimal policy quite well (all values are equal
in Table 5.4), however, for the max-min strategy the performance against a worst-case
opponent degrades significantly. The difference between V(π4, π1)and V(π5, π2)of 0.074
in value seems not too dramatic at a first glance but the fact that the goal rate of the
(10)An RL method can adapt to a stationary (temporarily fixed) policy of other agents but problems occur
if all agents are learning, i. e. changing their policies over time.
(11)Best responses are max optimal policies and in the case of RL methods (MQL(·)) only a more or less
well approximation to a true best response MVI(·).
(12)The RL policy is typically initialised as R(andom) such that for states never reached by the finite
simulation the random action selection is kept.
77
π3= MQL(π1)π4= MVI(π1)π5= MVI(π2)
π1= MQL(R)
V: 0.415
gt: 0.080
g1: 0.735
V: 0.415
gt: 0.080
g1: 0.731
π2= MVI(R)
V: 0.415
gt: 0.080
g1: 0.730
Table 5.3: Robustness of max-policies against worst-case opponents (6×4soccer field).
The column policies are that of player P1and the row strategies of player P2. The value
Vand the relative amount of goals g1are from the view of P1(as always), whereas the
total goal rate per time step gtrelates the sum of goals of both players to the number of
simulated time steps.
π3= MQL(π1)π4= MVI(π1)π5= MVI(π2)
π1= MQL(R)
V: 0.145
gt: 0.030
g1: 0.719
V: 0.145
gt: 0.030
g1: 0.723
π2= MVI(R)
V: 0.145
gt: 0.029
g1: 0.725
Table 5.4: Robustness of max-policies against worst-case opponents (12 ×8soccer field).
The explanation of how to read the table is as in Table 5.3.
78
opponent increases from the best of 50% to over 90% should give the correct impression
(Table 5.6).
π3= MQL(π1)π4= MVI(π1)π5= MVI(π2)
π1= MMQL(R)
V: 0.010
gt: 0.070
g1: 0.509
V: 0.010
gt: 0.071
g1: 0.505
π2= MMVI(R)
V: 0.000
gt: 0.070
g1: 0.502
Table 5.5: Robustness of max-min-policies against worst-case opponents (6×4soccer
field). The explanation of how to read the table is as in Table 5.3.
π3= MQL(π1)π4= MVI(π1)π5= MVI(π2)
π1= MMQL(R)
V: 0.074
gt: 0.007
g1: 0.979
V: 0.076
gt: 0.008
g1: 0.921
π2= MMVI(R)
V: 0.000
gt: 0.023
g1: 0.499
Table 5.6: Robustness of max-min-policies against worst-case opponents (12 ×8soccer
field). The explanation of how to read the table is as in Table 5.3.
Training Against Better Opponents
To provide additional weight for our hypothesis above that the non-stochasticity of optimal
policies for MDPs is the reason for the failure of these policies against a learning opponent,
better training partners are chosen and the effect is studied. In the first comparison the
policies π1, π2of Table 5.3 are trained against a randomly acting opponent which can
be considered to be very weak only an opponent helping the player could be weaker.
Therefore, in the following comparison in Tables 5.7 and D.10 the policies π1, π2are trained
against much better initial opponent policies which in fact are π1and π2of Table 5.3. The
result is by comparing Tables 5.3 and 5.7 that the better initial training policy increases
the exploitability by a best response strategy. Although this can be expected by means
of the lack of stochasticity of the better training partner it questions efforts to present a
strong policy to an MDP learner and ignoring the 2P-ZS-MG nature of robot soccer.
Exploitability of Non-Optimal Opponents
The most reasonable counter argument against using max-min methods is that they do
not fully exploit weaknesses of the opponent’s policy. This is correct for non-optimal and
particularly for very weak opponents as can be seen by the case of exploiting a randomly
acting opponent on a 6×4soccer field in 1v1 soccer: the values of start states for the max
policy is much higher than that of the max-min policy (V(ξstart,MVI(R),R) 0.688 >
79
π3= MQL(π1)π4= MVI(π1)π5= MQL(π2)π6= MVI(π2)
π1= MQL(MQL(R))
V: 0.539
gt: 0.065
g1: 0.876
V: 0.539
gt: 0.064
g1: 0.877
π2= MQL(MVI(R))
V: 0.506
gt: 0.058
g1: 0.891
V: 0.506
gt: 0.058
g1: 0.893
Table 5.7: Robustness of max-policies against worst-case opponents (6×4soccer field)
with better initial training partners. The explanation of how to read the table is as in
Table 5.3.
0.483 V(ξstart,MMVI(R),R), Table 5.9). However, max-min policies fully exploit subop-
timal policies of the opponent with the constraint of staying safe in the sense that max-min
methods assume that after the observed non-optimal action the opponent again will act
optimally. This prevents the player from being tricked, i. e. that the opponent intentionally
behaves in a way to create a misleading conclusion about its policy.(13)
Exploitability of Optimal Max-min Opponents
In addition to the question of exploiting a suboptimal opponent it is interesting to ask
whether a max-min policy can be better exploited by a max policy than by a max-min
policy. The related theoretical question is whether the max-min policy is already a best
response to a max-min policy. In contrast to the case of non-optimal strategies, the answer
is positive [186]. In Tables 5.8 and D.11 this or more appropriate: the software DRPOST
is verified by the fact that both values for π2= MVI(π1)and π3= MMVI(R) = MMVI(π1)
are equal (necessarily equal to 0).
π2= MVI(π1)π3= MMVI(R)
π1= MMVI(R)
V: 0.000
gt: 0.070
g1: 0.502
V: 0.000
gt: 0.071
g1: 0.503
Table 5.8: Exploitability of optimal max-min opponents (6×4soccer field). The expla-
nation of how to read the table is as in Table 5.3.
5.2.3 Relating Policies to Humanoid Soccer Characteristics
It is not trivial to get qualitative heuristic results beyond the abstract numerical value of a
policy or a pair of policies because there can be differently characterised succesful policies.
A standard distinction in human soccer is e. g. between defensive and offensive strategies.
A practical way to determine the offensiveness and defensiveness of a grid soccer policy
is the following two step method: first, evaluate the policy against a random opponent to
test its offensiveness (for a very weak opponent a defensive behaviour does not contribute
(13)Although it is often natural and plausible to assume that the opponent will repeat non-optimal be-
haviour this can not be relied upon.
80
to its success) and, second, evaluate the policy against itself. It can be tried to estimate
the defensive quality by the reduction of scored goals in comparison to the weak opponent.
Exemplarily, some results of Tables 5.9 and D.12 are evaluated. A look at the first column
could lead to some confusion because to stay consistent and keep the page layout the values
Vare from the point of view of the column policy. To change the viewpoint the negative
signs have to be made positive. Then it becomes directly clear that the random policy
is by far the worst and that π3is strongest against the random policy. Its offensiveness
can be seen by the high goal rate per time step gtand its very high amount of own goals
(10.016 = 98.4%, Table 5.9). The RL analogon π2is similarly strong which indicates
that the learning phase was sufficiently long.
The max-min optimal strategy π5seems to be weaker because of its safety aspects but
outperforms the random opponent considerably. This means that it clearly exploits the
weaknesses of its opponent although in a safe way which disproves the argument that max-
min strategies do not exploit weaker opponents. The defensiveness of the max-min strategy
can also be seen in comparison to the max strategy in the second column by noticing that
the goals per time step are fewer. Remarkably, the max-min RL method π4is stronger
against the random opponent than the DP method. This could have two reasons: the first
is that the learning is not completed and the safety is not optimal (this is not the case
here) and the second reason is that non unique Nash equilibria exist which is verified by
the author at least for single states s S.(14) The Nash equilibrium (optimal) policies are
all equally strong against a worst-case opponent (Theorem 2.20), however, they may be
and obviously are differently strong against the non-optimal random opponent.
A last highlighted result again providing an argument against using MDP models for
robot soccer is the evaluation of π7. Although the max strategy π7is trained against π3
which is the strongest max opponent for the random policy it is relatively weak against the
random opponent. This again indicates that the max learning only adapts to that single
opponent to which it is an approximate best response. In contrast, the quality of max-min
solutions is independent from the training partner which can only influence the update
order in learning and not the final policy.
5.2.4 Comparison of DP and RL Techniques
In this subsection a rough impression is to be given of how much more effective DP methods
are in comparison to RL methods. The comparison is only depicted for 2P-ZS-MGs (max-
min methods) but the result that DP methods are more effective is also expected in classical
MDPs (max methods). A novelty is the inclusion of the game theoretic safety measure:
the computed policy is not only evaluated against its standard opponent (standard value
evaluation) but also against a worst-case opponent (security level). Figure 5.6 comprises all
details for a 1v1 grid soccer model with MDP typical symmetry reduction (all symmetries
which do not exchange the two players are reduced). The DP method converges to a
reasonable policy much faster (after 6 steps) than the RL method (after about 24 steps).
Concerning the security level, the DP method also needs far fewer steps (17 in comparison
to 70 for the RL method) to achieve a value close to 0.
(14)The non-uniqueness of matrix game policies of single states s S implies the non-uniqueness of the
total policy.
81
π1= R equal
π1= R
V: 0.000
gt: 0.005
g1: 0.525
V: 0.000
gt: 0.005
g1: 0.495
π2= MQL(R)
V:0.682
gt: 0.063
g1: 0.015
V: 0.000
gt: 0.092
g1: 0.499
π3= MVI(R)
V:0.688
gt: 0.064
g1: 0.016
V: 0.000
gt: 0.092
g1: 0.501
π4= MMQL(R)
V:0.566
gt: 0.053
g1: 0.016
V: 0.000
gt: 0.070
g1: 0.499
π5= MMVI(R)
V:0.483
gt: 0.046
g1: 0.026
V: 0.000
gt: 0.070
g1: 0.499
π6= MQL(π2)
V:0.086
gt: 0.011
g1: 0.147
V: 0.000
gt: 0.020
g1: 0.498
π7= MQL(π3)
V:0.098
gt: 0.012
g1: 0.132
V: 0.000
gt: 0.018
g1: 0.496
Table 5.9: Analysis of offensiveness and defensiveness of different policies (6×4soccer
field). The explanation of how to read the table is as in Table 5.3.
82
0 20 40 60 80 100 120 140 160 180 200
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
step number
value
opt. value max
opt. value max−min
max−min (QL): value
max−min (QL): safety
(a) RL method.
0 5 10 15 20 25 30 35 40 45 50
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
step number
value
opt. value max
opt. value max−min
max−min (VI): value
max−min (VI): safety
(b) DP method.
Figure 5.6: Convergence speed of RL and DP techniques measured by value and security
evaluation: standard Q-learning versus a Gauss-Seidel DP method (γ= 0.9, random
exploration rate 0.20 for the RL method). The scales for the step numbers (x-axis) are
different.
83
5.2.5 Comparison of Different DP Techniques with Various Parameters
In the following basic studies of different DP techniques with a variety of parameters based
on the model of 1v1 multi-player grid soccer are provided. In general, such comparisons
should also improve the understandings of RL methods because RL methods offer some
additional subtleties (such as choosing a starting state or exploration strategies) which can
be avoided in DP methods. Furthermore, in some sense DP methods provide an upper
bound for the effectiveness of RL methods (if strict bounds on the quality of the solutions
are to be guaranteed) because in DP methods the model is completely known. However,
the strength of RL methods is typically to yield a good solution without guaranteeing its
quality, i. e. that in a typically seldomly occuring state the policy can be arbitrarily bad.
This subsection is divided into different parts: first, the three DP methods with normal
update without symmetry reduction, Gauss-Seidel update without symmetry reduction,
and Gauss-Seidel update with symmetry reduction are compared for different choices of
initial value iterates V0and discount factors γand for the three cases max-min (2P-ZS-
MG), max (MDP with opponent being fixed to perform a uniformly random policy), and
fixed (Markov process with player and opponent performing a fixed random policy, simply:
policy evaluation). Note that the theoretical fourth DP method of normal DP updates
with symmetry reduction is the same as without reduction except that the state space is
shrunken and therefore each value iteration step is considerably faster. Second, different
state sorting strategies for Gauss-Seidel methods are evaluated and an example is worked
out that shows that a resorting of the update order can lead to loosing the monotonicity
of the Bellman error in MDPs and 2P-ZS-MGs. Third, a special convergence phenomenon
is analysed in detail to recover why certain large error improvements occur.
Initial Value Functions V0and Discount Factors γ
There are two qualitative standard behaviours of DP and RL methods which are verified in
this thesis by means of the grid soccer model. The first behaviour is that the discount factor
γdramatically influences the convergence speed for DP and RL methods: the closer γis
to 1the more update steps are needed to achieve a prescribed maximal error. The second
behaviour is that the closer the initial value function V0is to the optimal value function
Vwith respect to k kthe less update steps are needed. Both algorithmic behaviours are
direct consequences of the value iteration theorem (Definition 2.35 and below): the first
one is due to the fact that the stopping criterion depends on 1γ
γand that the contraction
rate of the Bellman operator is γ; the second one is due to the fact that kVk+1 Vkk
kVk+1 Vk+kVkVk=kBMGVkBMGVk+kVkVk(1+γ)·kVkVk.
Notes on the Choice of the Discount Factor. It is to be remarked that the conclusion
of the first behaviour of DP and RL methods is not to use a very small discount factor γbut
to use the smallest reasonable one. The problem of a too small γis that e. g. with γ= 0.1
the 2P-ZS-MG reward after about 10 time steps is discounted to a magnitude below that
of numerical errors. One possibility is to determine or to estimate the minimum number
of time steps t0between obtaining two essential rewards and setting γin a way that the
total discounting γt0is in the order of 0.5.(15)
A mathematically more beautiful point of view includes Euler’s number ewhich can be
(15)For γ= 0.9this would yield a reasonable time scale of 6to 7time steps.
84
expressed by different limits, e. g.
1
e= lim
n→∞ 11
nn
(5.1)
For the special discount factors γ= 1 1
nand nlarge enough this implies that the n-step
discounting is close to 1
e0.368. If a discrete 2P-ZS-MG is constructed by discretisation
of an underlying continuous model, then doubling the discretisation accuracy will typically
lead to a doubling of the needed discrete time steps to describe the same process. Hence
to maintain expressiveness of the numerical results a sensible γwould also be double as
close to 1.
Notes on the Magnitudes of the Value Function. The values for V0are chosen by
the following criteria: 3is larger than the maximum optimal value, 3is lower than the
minimal one. ±1is in the magnitude of high or low values while ±0.1is a small positive
or negative estimation. 0is the mean value of the value function for the max-min case
because a 2P-ZS-MG µ-isomorphism with µ=1exists. For the max case the mean value
against a random opponent is 0.688 for γ= 0.9,0.136 for γ= 0.75, or numerically zero
(1·107) for γ= 0.1which shows that this discount factor is far too small for the
grid soccer problem. The relatively high positive values of start states for the first two
discount factors show that it is (of course) much better to use an optimised strategy as to
act randomly.
Notes on the Iteration Steps. Figures 5.7 and 5.8 are intended to give a quantitative
feeling about how many iteration steps different DP methods need for convergence to a
certain error (ε= 1·103) for max-min DP methods and max DP methods (with a randomly
acting opponent). The effects of different values for γand V0and of using standard and
Gauss-Seidel type updates as well as using symmetry reduction are illustrated.(16) The
analogue results for iterative policy evaluation, i. e. that both players of the 2P-ZS-MG
are acting according to a fixed here: random policy are omitted at this place (see
Figure D.1) because they give a qualitatively similar picture as the max methods do. The
Gauss-Seidel methods work more efficiently than the standard updates in the max case but
there is no essential difference between Gauss-Seidel methods with and without symmetry
reduction except that each iteration consists of a lower number of updates. Astonishingly,
for max-min methods the general view changes qualitatively: for V00the symmetry
reduction decreases the number of iterations considerably. The effect is larger the more
iterations are needed by the other two methods which leads to a particular insensitivity
of this method to the values of V00: for a Gauss-Seidel method without symmetry
reduction the number of steps kfor V0[0,3] ranges from 43 to 72 whereas the range of a
Gauss-Seidel method with symmetry reduction is only from 26 to 30. However, for negative
initialised V0this insensitivity is not observable but at least the symmetry reduction leads
to the smallest number of iterates in every single case. Two things are essential: first,
the occurance of the insensitivity effect and, second, the unexpected dependency on the
positivity of V0. In RL methods a well known phenomenon exists that an initial guess of
V0which is too positive encourages exploration of unexplored states simply because more
realistic estimates of explored states reveal unattractivity of these states [202]. However,
for DP methods an analogon does not exist. Perhaps the effect can be explained by the
solution of the matrix games: if V0>0and only a few entries of a matrix have changed
to more realistic lower ones the worst-case opponent forces the corresponding actions and
the value decreases appropriately after a few iterations. If, however, V0<0than nearly
(16)Additional tables and figures on the numerical results can be found in Appendix D.
85
all entries of a matrix, i. e. all values of successor states, have to be estimated in a more
realistic way (more positive) before the value of the matrix game is significantly influenced
because by the same argument as above a few estimates dominate which are too negative.
−3 −2 −1 0 1 2 3
0
20
40
60
80
100
120
V0
step number
no GS, no symm
GS, no symm
GS, symm
(a) Max-min.
−3 −2 −1 0 1 2 3
0
20
40
60
80
100
120
V0
step number
no GS, no symm
GS, no symm
GS, symm
(b) Max.
Figure 5.7: Comparison of different Gauss-Seidel types with or without symmetry reduc-
tion: number of iteration steps over initial values of the initial value function V0for a
1v1 multi-player grid soccer model and a (a) max-min method, (b) max method without
sorting strategy (γ= 0.9, stopping criterion precision ε= 1 ·103(Corollary 4.3)).
86
In contrast to the exciting results of varying V0, the results for varying the discount factor
γpresented in Figure 5.8 are completely unspectacular. The result here is valid for max-
min, max, and fixed strategy methods and is as above for the max method, namely that
Gauss-Seidel type updates work better than standard ones and the symmetry reduction
does not have a major effect on the number of iterations.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
20
40
60
80
100
120
gamma
step number
no GS, no symm
GS, no symm
GS, symm
(a) Max-min.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
20
40
60
80
100
120
gamma
step number
no GS, no symm
GS, no symm
GS, symm
(b) Max.
Figure 5.8: Comparison of different Gauss-Seidel types with or without symmetry reduc-
tion: number of iteration steps over the discount factor γfor a 1v1 multi-player grid soccer
model and a (a) max-min method, (b) max method without sorting strategy (V0= 0,
stopping criterion precision ε= 1 ·103(Corollary 4.3)).
Different Sorting Strategies for Gauss-Seidel Methods
Figure 5.9 shows the convergence speed of the following different sorting strategies: stan-
dard updates (no sorting), updating by maximal Bellman error after each iteration, up-
dating by randomly rearranged state order in each iteration (random), and keeping the
randomly rearranged state order of the first iteration (random fixed). Some of the parame-
ters explored above are fixed: V0= 0 and γ= 0.9. There are three surprising observations
in this figure: First, the “Max-min Convergence Boosting Phenomenon” which means that
occasionally a very large error reduction of nearly one order of magnitude occurs within
mostly one but at most two iteration steps. This phenomenon has been only observed
for max-min methods and looks such important that it is treated in an extra subsection
of Section 5.2.5. Second, the update order by highest Bellman error is the worst of all
methods for nearly all iteration steps but for this update order the large decisive second
“convergence boost” occurs first for the max-min method. It is not clear whether this is
a coincidence or not because the first smaller “convergence boost” appears later than for
the two other methods. The bad performance of Bellman error update order is interesting
because the argument to firstly update states with the highest error is directly plausible.
Third, the DP methods with Gauss-Seidel type updates without equally sorted states in
each iteration step (e. g. Bellman error and random sorting) sometimes show a small error
increase. This is not due to numerical errors which are smaller than 1·105and thus
undetectable in the figure, but to the fact that these update methods do not make use of
all estimated values of all value iterates, see Example 5.2. Numerical studies indicate that
the true error to the unknown optimal value function is reduced but that the Bellman error
does not reflect that fact. Therefore, it is stressed that Bellman errors do not reflect true
87
convergence but only convergence of a guarantee for convergence. This also gives reasons
for the bad performance of sorting by the Bellman error.
The following example is very simple but much can be learnt from it:
5.2 Example (Monotonicity of Bellman Error for some “Gauss-Seidel” Methods)
Be M= (D,S,SA, T, R)an MDP with decision epoch D=N0, state space S={s1, s2}
(only two states), A(s1) = A(s2) = {a1}(only one action in every state), T(s1, a1, s2) = 1
and T(s2, a1, s2) = 1 (probability 1to go to state s2and stay there), and R(s1, a1) = 0
and R(s2, a1) = 1.
In fact, this example is a discrete-space discrete-time dynamical system since only one
policy exists (probability 1for the only action in each state) which is concurrently the
optimal policy. Furthermore, it represents the simplest example with one recurrent and
one transient state (in terms of dynamical systems). Nevertheless, it gives useful intuition
for clustering methods interpreting one state as a cluster of other states and already an
example that in MDPs and 2P-ZS-MGs the Bellman error of Gauss-Seidel methods with
reorganising the update order is not monotone. This is in contrast to the Jacobi update
type and to Gauss-Seidel type updates with fixed update order for which the Bellman
error decreases at least with the discount factor γ([168] gives a proof for MDPs). The
above statement can be seen by the following table in which the Jacobi and a Gauss-Seidel
method with the noted update order is performed for a discount factor γ= 0.9. The value
iterates Vkare represented by the vector (Vk(s1), Vk(s2))
k Vk(Jacobi) kViVi1kVk(Gauss-Seidel) update order kViVi1k
0 (0,0) (0,0) s1, s2
1 (0,1) 1 (0,1) s1, s21
2 (0.9,1.9) 0.9 (0.9,1.9) s1, s20.9
3 (1.71,2.71) 0.81 = 0.92(2.439,2.71) s2, s11.539 >0.9
The problem why the Bellman error is not monotone decreasing by γis that the updates
do not correspond to the Gauss-Seidel method any longer in the strict sense since for
computing V3(s1)the value V3(s2)is used whereas for V2(s1)the value of V1(s2)is made
use of. Thus, V2(s2)is not plugged in the two computations which is different for the
Jacobi method or for any method with fixed update order.
The Max-min Convergence Boosting Phenomenon
At a very first glance, it would not be surprising if a max-min value iteration would take
more steps than a max or fixed strategy value iteration for the reason that 2P-ZS-MGs
are more complex than MDPs or Markov chains. However, the fixed policy method needs
more steps than the max-min method, whereas the max method needs clearly the most
number of iteration steps.
A heuristic argument is that of information diffusion: the shorter the longest shortest path
is from any state sito sjon an appropriate graph, e. g. induced by the transition matrix,
the faster information about the value iterate Vk(si)is propagated to the state sjin a
future iterate. Because the success of value iteration is measured by k kit can give bad
88
0 10 20 30 40 50 60 70
10−4
10−3
10−2
10−1
100
101
102
step number
maximal error
no sorting
Bellman
random
random fixed
(a) Max-min.
0 10 20 30 40 50 60 70
10−4
10−3
10−2
10−1
100
101
102
step number
maximal error
no sorting
Bellman
random
random fixed
(b) Max.
Figure 5.9: Maximal DP error (logarithmic scale) over the number of iteration steps for a
Gauss-Seidel type update with symmetry reduction, a 1v1 multi-player grid soccer model,
and a (a) max-min method, (b) max method by means of standard updates (no sorting),
Bellman error estimation (Bellman), randomly rearranged state order in each iteration
(random), and keeping the randomly rearranged state order of the first iteration (random
fixed) (V0= 0,γ= 0.9, stopping criterion precision ε= 1 ·103(Corollary 4.3)).
89
performance if only a few states are late informed about significant changes. If only the
number of connections by the transition matrix without a detailed shortest path analysis
is considered this will give reason why the fixed strategy method needs less iterations than
the max method because the first includes two completely random policies and the second
includes one random and one (nearly) deterministic policy. Applying the same argument,
the max-min method should lie in between the two other methods because only sensible
actions are performed with non-zero probability by each of the two players but amongst
these sensible actions a reasonable diversification has to be performed to minimise the
risk of being exploited. However, because of extraordinary convergence steps the max-min
method seems to be the fastest.
A less heuristic argument is based on iterative linear solvers. The Bellman equation for
fixed policies is equivalent to solving a linear equation but also the non-linear versions can a
posteriori be reformulated in a linear way when the optimal policies are known which fulfill
the max or max-min equation. Thus, the convergence speed can be analysed by the same
methodology standardly employed for iterative linear solvers (see Appendix B). It should
be remarked that there is a small difference between Bellman equation Jacobi and Gauss-
Seidel update and the classical versions for iterative linear solvers. All in all, the analysis
does not reveal novelties: the k kfor the update matrix nearly always equals exactly the
discount factor γand even the spectral radius is always close to it. That implies that even
with a different norm which can not be chosen freely for the convergence guarantee no
significantly better convergence rates can be explained. The conclusion of this paragraph
is that the worst-case convergence rate is not a good measure for the real convergence rate
in the practical example of multi-player grid soccer.
A next idea why the “Max-min Convergence Boosting Phenomenon” might occur is the
existence of 2P-ZS-MG µ-isomorphisms with µ=1(player exchanging symmetries). It
was mentioned in Section 3.2 that this qualitatively new kind of symmetry is special to
2P-ZS-MGs and can not occur in MDPs. Hence, in Figure 5.10 the same algorithmic
convergence as in Figure 5.9, a) is shown with the exception that the max-min method
without reduction of the player exchanging symmetry is considered (all other symmetries
are reduced). The result is that the convergence boosting also takes place similarly. Also,
the possibility that the Gauss-Seidel update could be responsible is excluded by observing
the same phenomenon with non Gauss-Seidel type updates.
Nevertheless, Figure 5.10 reveals a new idea: without the player exchanging symmetry the
steps with convergence boosts seem to have a clearer periodical structure. This is valid
at least for the updates with fixed update order (“no sorting” and “random fixed” in the
Figure) and the other update types should be neglegted because of Example 5.2. The
periodicity is more clearly illustrated in Figure 5.11, b) in which the stepwise convergence
rate is depicted. The rough periodicity of 67steps for the peaks is in concordance with
the expected duration of scoring a goal after a restart of the game. It is speculated that
this inherent “periodicity” of the model is responsible for the periodicity of the convergence
error peaks. This would also fit to the heuristic argument of information diffusion above.
Furthermore, the differences of the Bellman error and the true error indicate that the
Bellman error often decreases later than the true error.(17)
(17)The true error is computed by the difference to the optimal value function. The total reduction of the
Bellman error can be larger than the total reduction of the true error since the initial error for the Bellman
estimate is largely above the true error.
90
0 10 20 30 40 50 60 70
10−4
10−3
10−2
10−1
100
101
102
step number
maximal error
no sorting
Bellman
random
random fixed
Figure 5.10: Maximal DP error (logarithmic scale) over the number of iteration steps, all
details are as in Figure 5.9, only the player exchanging symmetry is not reduced.
5.2.6 Comparison of Standard Methods and SL Techniques
It is pointed out in Lemma 4.2 that convergence guarantees can not be given if the ap-
proximation error introduced by the SL techniques is close to the magnitude of the desired
error of the value function. Nevertheless, SL techniques have been successfully combined
with RL methods (Section 4.3) and in this spirit the practical effort of the following results
is to be understood.
Especially for single player robot soccer [97] introduces a number of features. These seem
to be highly specialised to the model which is quite similar to the 1v1 multi-player grid
soccer of Section 5.1 because both models are motivated by [112]. The features are case
based and extract the information in a way such as if the attacker is not closer to the
defenders goal than the defender and if the defender is close to the attacker then store
some position information of the defender in relation to its goal and all possible relative
positions between the attacker and defender” and so on for all four possible if-combinations.
In contrast to the probably very laborious work to design such features, the approach used
in the present work is to approximate all occuring intermediate state value functions V(s)
during an RL procedure by the following very simple features which are already designed
for their application in multi-player models:
1.) the number of robots of each team in the cell of the ball,
2.) the minimal distance of each team (minimum over all team members) to the ball,
3.) the minimal distance of the ball to each goal region, and
4.) the number of robots of each team in the half of the first team.
Distances are measured by squared Euclidean distance and all features are restricted by
a maximum value, e. g. 3for a small grid. If the distances to the goal regions are not
restricted one of the two would suffice. Table 5.10 shows results of a policy based on state
91
5 10 15 20 25
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
step number
conv. rate (Bellman)
conv. rate (true)
discount factor
(a) Symmetry reduction 1.
5 10 15 20 25 30 35 40
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
step number
conv. rate (Bellman)
conv. rate (true)
discount factor
(b) Symmetry reduction 2.
Figure 5.11: Convergence rate of a max-min DP method by maximal DP error (Bellman)
and by the true error over the number of iteration steps for a Gauss-Seidel type update and
a 1v1 multi-player grid soccer model. In (a) all symmetries are reduced, in (b) all except
the player exchanging symmetry are reduced, i. e. (a) corresponds to the “no sorting”-line
of Figure 5.9 (a), and (b) corresponds to that line of Figure 5.10. The convergence rate can
be a little larger than γfor later iterates because the Bellman error includes the results of
Lemma 2.36.
92
value functions represented by feature approximation architecture (92 degrees of freedom
instead of the 576 needed for the lookup table representation). The max-min solution of
the 1v1 grid soccer model with respect to the exploitation of the random opponent (in
reference to the discussion about non-unique Nash equilibria) are reasonably good but not
as good as that of Table 5.9. The security level against a worst-case opponent trained
with the same restrictive architecture is perfectly optimal, i. e. equal to 0, because both
players are restricted to the same policy space. Nevertheless, something is lost by approx-
imating the value function only by features which could also be interpreted as building
“wrong equivalence classes” by means of heuristics. This becomes obvious when looking at
the high outperformance of a worst-case opponent which is not restricted to the feature
representation of the value function: the separated right column reveals that the policy π1
against this opponent is quite bad.
The conclusion of the above is that features can yield reasonable results even if they are very
simple and also that the security level against an opponent of the same feature architecture
is reasonable. However, the performance can significantly degrade against more powerful
approximation architectures. This makes the knowledge about which feature architecture
is used by the opponent nearly as useful as to know its policy directly.
π2= R π3=π1π4= MQL(π1)π5= MVI(π1)
π1= MMQL(R)
V:0.325
gt: 0.033
g1: 0.054
V: 0.000
gt: 0.068
g1: 0.502
V: 0.000
gt: 0.068
g1: 0.500
V: 0.423
gt: 0.042
g1: 0.955
Table 5.10: Evaluation of max-min policies for a 1v1 grid soccer on a 6×4field which
are computed with a feature based value function. Only the separated right-most column
strategy π5is computed without features.
5.2.7 Towards Multi-Player Robot Soccer: 2v2 Grid Soccer
In this section effects of large state spaces obtained by a fine grid discretisation of 1v1 grid
soccer and large state spaces induced by multi-player especially 2v2 grid soccer on a coarse
grid are to be discussed. Tables 5.11, and 5.12 show the number of state and state-action
spaces as well as the number of value iterations needed for convergence. From a heuristic
point of view the discretisation to a higher grid size can be thought to be easier, however,
this is not reflected in the number of value iterations. However, the computational time for
the 2v2 case is much higher per iteration because the size of the matrix games is growing
quadratically in the exponentially growing action space of the player (since the action space
of the opponent is growing exactly as fast). A further expectation of the author is that
there should only be a small amount of extra information obtained by the solution of a
finer resolution of the grid.(18) Nevertheless, it should be expected that an initialisation
of a finer grid with an adapted value function of a coarser grid should lead to a remarkable
reduction of needed iteration steps because the fine and coarse grid models possess strong
similarities.
In contrast, for a 2v2 grid soccer model it should be much more difficult to transfer knowl-
edge from the 1v1 grid soccer model. The main aspect is that the structure of the model
(18)Note that the different grid size models are different models and can not be interpreted as two different
discretisations of a common underlying continuous model.
93
Size |S| |SAO| Results
small (6x4) 282 4893 26
0.00037
medium (12x8) 4584 96288 33
0.00039
large (24x16) 73632 1690578 47
0.00041
Table 5.11: Comparison of symmetry reduced 1v1 grid soccer with γ= 0.9(different
soccer field sizes) by problem size and necessary DP iterations for achieving a stopping
criterion precision of ε= 1 ·103(Corollary 4.3). The achieved precision is also noted.
Size |S| |SAO| Results
small (6x4) 82944 27724780 19
0.00075
Table 5.12: Analogon to Table 5.11 for 2v2 grid soccer for a 6×4soccer field.
drastically changes: the new options of performing passes and to coordinate behaviour
with a second team member can lead to a completely different kind of optimal behaviour.
Since it is a little tedious and will not give many insights to analyse the whole policy or
value function the author tried to identify a single situation which is somehow comparable
but in which differences of a 1v1 and a 2v2 grid soccer optimal policy become obvious.
The state in Figure 5.12 a) for the 1v1 grid soccer is extended to 2v2 soccer by adding a
tactically not so well positioned second opponent and a tactically well placed second player
ready for a pass to the field. In the 1v1 case the optimal policy is to go left with probability
1because the opponent player can only block the ball if he can be in the same cell after one
move as the ball-possessing player, i. e. the opponent player has to stand in his cell. Also
moving above or below does not make sense for the player because the opponent player will
then in the next turn definitely meet him or else a time consuming phase of several side
turns would start and decrease the time discounted long-term reward. In the 2v2 case the
situation with the added team members the situation looks different: the optimal policy
for Team 1 is to let the member without ball move to the left and the other pass to it with
a probability of roughly 2
3 although there is a real chance that a miskick will occur(19)
and to perform with the remaining probability of 1
3the same movement for the team
member without ball but a move downwards without a kick for the ball-possessing one.
The tactical advantage of this position is that there is a new chance for a kick without
the opponent team being able to reach the ball-possessing team member. The opponent
team has a clearer strategy: the opponent in the middle at coordinates (3,3) has to stand
and the second opponent has to stand with a probability of 93% and to go upwards with
a probability of 7%. The key aspect is here for both opponents to stay in a range to be
able to be a mistarget of a possible pass.
(19)Without a miskick this action would have probability 1because then the goal scoring would be secure
after the pass.
94
1 2 3 4 5 6
1
2
3
4
(a) 1v1 game
1 2 3 4 5 6
1
2
3
4
(b) 2v2 game
Figure 5.12: Case study for a situation of 1v1 and a similar situation in 2v2 soccer. Dark
grey indicates the defended goal region and the players of the first team, light grey those
of the second team and the ball is sketched by the small grey circle.
5.3 A New Algorithm: MaG-Clus-VI
In the following a new algorithm which combines local and global update steps by means
of clustering methods is introduced. The idea of structuring the state space with the use of
clustering algorithms is common with the graph clustering by topology approach of [130].
Nevertheless, there are two key differences: firstly, the following clustering algorithm takes
into account the dynamics by clustering by the transition probabilities and not simply
by values of the value function or by some state space topology, and secondly no macro
actions on clusters are considered. This makes the theoretical proof of convergence very
simple and a loss of optimality with respect to the original model can be avoided. An
optimal solution will always be achieved if every state is updated infinitely many times
([19] for MDPs) which can be guaranteed by performing infinitely many global steps or by
arranging the local updates in a way that after all local updates thought of as a “big local
step” at least every state is updated once. This criterion is met by the algorithm below
called Markov Game Value Iteration with Clustering (MaG-Clus-VI) because the union of
all partitions forms the state space.
5.3 Algorithm (Markov Game Value Iteration with Clustering (MaG-Clus-VI))
Given a 2P-ZS-MG M= (D,S,SAO, T, R)with D=N0the MaG-Clus-VI algorithm is
defined by:
1.) Global steps:
Perform nglo N0steps of standard value iteration. The result is an estimation Vof
the value function as well as a Nash equilibrium policy πwith respect to V.
2.) Local steps:
a) Determine the transition probability matrix with entries for all state pairs (si, sj)
S × S by the 2P-ZS-MG transition probabilities and the total policy πof both
players.
b) Use any clustering method, especially for obtaining balanced partitions, to determine
the kalmost invariant clusters Ci S,i= 1, . . . , k.
c) For i= 1, . . . , k: perform for Cistandard value iteration updates restricted to Ci
95
until convergence to accuracy ε=γmi·εprev,miNis achieved, whereat εprev is
the Bellman error of the last global step of 1.) if nglo >0or the estimation of the
extra convergence check in 3.) if nglo = 0.
3.) Go to 1 until some convergence criterion is achieved. If nglo = 0 an extra step of
global value iteration can be performed without influencing the next estimate of V
otherwise during the global steps the approximation error can be estimated without
extra computational effort.
One strength of the above algorithm is that in the local steps the propagation of larger
updates is restricted to only a part of the state space and on that part fully exploited.
Another key point is that the algorithm can have a various number of local update steps
on Ciand Cjwhich adapts the update steps to the local properties of the model. Finally,
it should be advantageous to use almost invariant sets because this minimises the influence
of value updates on adjacent partitions. Furthermore, the almost invariant sets are those
which trap a typical Q-learning trajectory which follows an optimal policy for a relatively
long time. All in all, the MaG-Clus-VI-algorithm combines the idea of global updates with
the idea of local exploration of Q-learning. On the one hand it is global and on the other
hand hierarchically structured without loosing the guarantee to obtain an optimal policy
of the non-hierarchical model.
Some numerical results are obtained with the 2P-ZS-MG of 1v1 grid soccer: with the
number of partitions being equal to 4in each step, nglo = 7, and mi= 3 for all steps and
all ithe algorithm needs about 39.2steps for obtaining a prescribed accuracy of ε= 1·103.
If using the partition created by the optimal policy from the first step which assumes
that the optimal policy problem is already solved before this method takes about 39.8
steps which shows that the adaptivity may be more advantageous than the knowledge of
the clustering by the optimal policy. In comparison to the 42 steps of the standard Gauss-
Seidel method this is about 5% less, however, for random clustering 49.5iteration steps
are needed. Thus, the comparison with “knowledge free” clustering yields a reduction of
about 20% by use of almost invariant sets.
5.4 From Grid Soccer to Robot Soccer: Practical Issues
There is an essential variety of practical issues of transferring the numerical results obtained
in Matlab by DRPOST to real robots, e. g. to AIBO ERS-7 type robot dogs. Some of
the key aspects are the use of lower level behaviours which are considered to be elementary
actions in the 2P-ZS-MG model and the extraction of visual information especially the self
and opponent localisation task. Localistion of the robot and all other robots is essential
for determining the state s S and therefore for being able to apply a state-dependent
policy.
Since it would go far beyond the scope of a single PhD thesis to design a software architec-
ture from scratch which handles the control of single joints of the robots reasonably well,
basic behaviours such as walking and kicking, higher level behaviours, and finally a policy,
communication between the robots via WLAN, analysis of (noisy) visual information of a
moving camera, and so on, the author decided to resort to the public available software
of the German Team.(20) A new version of this software is typically published some time
after the most recent world competition of robot soccer called RoboCup. A complete team
(20)Web site (30.11.2007): http://www.germanteam.org/tiki-index.php.
96
report of 2004 exists, available only online at the German Team web site, which describes
all methods and features of the software in some detail.
5.4.1 Lower Level behaviours
Lower level behaviours are provided by the German Team software in a sufficient number.
A large number of different walking types already exists with or without the ball being in
possession of the robot, an even larger number of different kicks applicable from diverse
relative positions and targeting different locations, and a variety of head moving behaviours
also exists which is essential for the following localisation task.
The task at hand is to construct action schemes for the robots which correspond as closely
as possible to the basic actions ai A(s)of the multi-player grid soccer model. Some
practical experiments show that this task is manageable by manually experimenting in
some standard situations of robot soccer.
5.4.2 Image Processing and Localisation
The issues concerning image processing and localisation are trickier than using the lower
level behaviours to create the basic actions ai A(s)of the 2P-ZS-MG. Localistion of the
robot and all other robots is essential for determining the state s S.
Image processing deals with the problem of extracting information from a series of camera
pictures. The basic problem in robot soccer is to match shapes and colors of predefined
object to pixels of a picture. The objects as well as the camera can be (not completely but
considerably) positioned freely in a three dimensional space such that some objects can be
partly hidden or be out of the visible range. In addition to the basic problem noise distorts
the picture and the analysis must be performed in real time on the robot itself.
[101] gives an overview and further references of newer and possible developments of image
analysis with a focus on robot soccer. The main idea is shortly presented: A set of
hypotheses of positions for each object is generated and Kalman filters are applied to
predict the moving of each single hypothesis. Then, unlikely hypotheses are removed
whereas new ones can be added if corresponding objects are detected in the current camera
image.
Self Localisation. Some aspects of self localisation in robot soccer with the AIBO
ERS-7 robots are shortly discussed to provide insights for the opponent localisation. The
self localisation typically works as follows: in a picture some fixed objects of known size
and position (goals, landmarks, lines of soccer field) are detected and by the size on the
picture the distance is estimated. This is necessary since only one camera is available to
the robot and no stereo three dimensional vision is possible as it is e. g. for humans. To
calculate the position in a global coordinate system of the soccer field the position and
direction of the camera (in principle all joints of the robots) have to be known.
Summarising, two possible sources of errors are present: first, the errors of reading the
sensor information of the joints (error of camera position) and, second, the error of color
noise (pixel errors) and misclassified objects (errors of object recognition). It is no issue
to locate other team members because it is allowed to communicate via WLAN within a
team and to spread the information of the robot’s own position.
Opponent Localisation. Since opponent localisation is as essential as self localisation
97
for the determination of the state s S the additional barriers to obtain a good estimation
are shortly outlined. All issues of error sources mentioned in self localisation are present
and additionally it is essential that the opponents are moving targets and that not all of
them can be seen by each robot at the same time. The reason why this causes problems is
that the distance can not be estimated by the size of the robots’ trikots because they have
a complicated non-convex shape.(21) If it is assumed that only the direction but not the
distance can be determined, a natural approach will be to look at the same opponent robot
by different team robots and intersecting the corresponding lines. This could work for one
opponent robot but even if two opponent robots are present at the same time intersections
at locations where no robot stands can occur.
5.5 Other Applications
Many other examples than robot soccer exist in which RL methods are succesfully applied;
an overview can be found in [88, 202]. However, two of the main areas are intertwined by
robot soccer: game playing and robotics. For example, the lower level of motion control
can be considered a typical robotic application and the higher level of strategic planning
is strongly related to game playing. This makes robot soccer a challenging and especially
interesting subject of research.
Games
The major part of RL literature deals with MDPs which do not include game playing or
model only a fixed policy of the second agent. Sometimes, even robotic control problems
can be appropriately modeled by 2P-ZS-MGs, e. g. if the aim is to obtain policies being
robust to errors of sensors or motors (MDP with “uncertainty generator” as a competitive
player [147]). Nevertheless, there are examples that RL methods (Section 2.4) are utilised
to solve 2P-ZS-MGs [112, 97].
2P-ZS-MGs include a broad class of games e. g. nearly all two-player board games and two
team sport games. Two famous examples of board games are checkers and backgammon
in both of which the machine learning can be performed by self-play, i. e. a computer plays
against itself. Self-play can be seen as a special asynchronous method in which the policy
of the first agent is a best response to a time-depending policy of the second agent and
vice versa. Hence, the standard assumption for convergence of RL methods need not be
fulfilled because instead of a max-min a max policy against the current policy of the second
agent which is equal to the own policy is determined.
A very early succesful application was the checkers(22) playing system of Samuel [179, 180]
although Samuel did not use the standard RL approach with rewards. Instead, he backed
up a kind of value function to estimate the use of board positions. In [202] it is discussed
how to relate Samuel’s checkers to current RL methods.
A second example which led to remarkable success was the backgammon policy developed
by Tesauro’s software TD-Gammon [205, 206, 202]. A backpropagation neural network
approach with three layers to approximate the probability of winning the game is used.
The most successful version of backgammon programs combines the state information
(21)[101] suggests the alternative to estimate the contact point of the robot to the ground. However, no
results of the obtained accuracy of detecting multiple moving robots are stated.
(22)Checkers is called “Dame” in Germany.
98
with features designed by humans. In this way, human insights can be integrated without
having a negative influence on the performance. The trade-off between exploitation and
exploration is neglected but due to the stochasticity of rolling the dice even a greedy policy
execution seems to lead to enough exploration.
Robotics
Robotic tasks often have an inherent continuous nature (states, actions, time) and are de-
manding because the decision making is disturbed by noise and error, information often has
to be extracted from sensor information (actuator status, image processing, speech anal-
ysis), and limited resources (computational power, time constraints, restriction of usable
material) create additional difficulties to solving a given problem. Besides robot soccer,
some other successful examples are: Robot juggling with a so-called devil-stick [6], box
pushing with a special clustering technique [127], collecting and transporting small disks
by a team of four robots in a decentralised way [133], and the huge area of roboters being
part of a production line.
99
Chapter 6
Conclusion and Outlook
In the present work two-player zero-sum Markov games (2P-ZS-MGs) are shown to be
an adequate framework for modeling robot soccer in contrast to the widely used Markov
decision process (MDP) framework. Furthermore, a grid soccer model for an arbitrary
resolution grid as well as for an arbitrary number of agents is provided. It seems to be
well-suited for comparison of large scale 2P-ZS-MG effects caused by fine discretisation
and multi-agent scenarios.
Amongst the theoretic aspects, the development of a notion of symmetry for 2P-ZS-MGs,
particularly 2P-ZS-MG µ-homomorphisms, is to be accentuated. The concept is shown
to be a non-trivial extension of recently developed MDP (1-) homomorphisms and its
relation to the special case of classical group actions is studied. A qualitatively new class
of symmetries which does not occur in MDPs and exchanges the two players of a 2P-ZS-MG
is proven to fulfill some natural algebraic properties, especially that it can be composed
with recently developed MDP symmetries. Practitioners already applied the results of this
symmetry concept without a precise theoretical foundation. In the present thesis their
work is legitimated mathematically a posteriori.
To also highlight some practical aspects the origination of the software package DRPOST
is to be mentioned by which all numerical results are computed. Notably, it includes a
new asynchronous dynamic progreamming (DP) algorithm called MaG-Clus-VI which in-
tertwines global and local aspects of updating by means of almost invariant sets established
in dynamical systems theory. The usefulness and usability of the software package aims
at helping other researchers and practitioners to solve interesting problems and to gain
deeper insights into 2P-ZS-MGs. Additionally, several comparative studies are performed
by means of DRPOST: most notable are the aspects of incorporating dynamic program-
ming (DP) methods as typically not done by the reinforcement learning (RL) community
and discovering and investigating interesting phenomena such as the “max-min conver-
gence boosting phenomenon” which is observed in 2P-ZS-MGs. Concerning realisability
on physical robots e. g. on AIBO ERS-7 the author identifies the opponent localisation
as the most urgent topic on which to focus future research.
However, solving problems raises others, and thus many ideas remain on how to continue
research beyond this PhD thesis. In theory, many open questions exist if the models are
differential games: How does discretisation affect the reliability of the solution? Does
the limit of arbitrary fine discrete models converge to the continuous model? Most of
the results in differential game theory are limited to the important but not general case
of pursuit evasion games. Questions about how to design a stochastic differential game
100
theory are even more challenging since stochastic dynamical systems without any control
are also an interesting topic of current research.
Not only continuous dynamical games but also generalising results from 2P-ZS-MGs to
non-zero-sum and multi-player games with more than two players offer interesting open
questions. A natural idea is to replace the solution of the matrix game by a more general
Nash equilibrium solution of a non-zero-sum or multi-player game. However, this intro-
duces quite a burden since all the issues about solution concepts for games with only a
single state and single time step (e. g. selection of one Nash equilibrium if multiple exists)
known to the game theory community become relevant for every state and every time step
in the iterative procedure of finding a solution of a multiple time step multiple state game.
Practically challenging is the application of RL and DP methods to big problem instances.
Some practical ideas which are partly biologically inspired and could help to manage such
instances are according to Kaelbling [88]: shaping, i. e. first presenting simple problems
and then raising the difficulty level little by little (e. g. reward shaping [169]), imitation,
i. e. learning by watching other agents or providing parts of a policy with the aid of humans
([167, 204] utilise initial policies called experts to accelerate learning), and reflexes, i. e.
providing some standard reactions to standard situations.(1)
What all these ideas have in common is to give up tabula rasa learning which is also
the motivation of the present work. The difference is that in the three approaches above
knowledge of a hand-coded (part of a) policy is provided: the experts are directly policies,
reflexes can be interpreted as policies defined only on special states, and shaping can be
performed by restricting the model and the policies to a subset of the original domain. To
make progress in the spirit of giving up tabula rasa learning, however, the author suggests
using DP methods with simpler models to provide an initial guess for a policy.(2) In
future research, all these methods could be compared and intertwined e. g. by utilising the
author’s approach with different models of the same real world problem and let (nearly)
optimal policies of each model be different experts to imitate.
Another suggestion of Kaelbling is to make reinforcement signals local. This addresses
the model designing phase rather than the learning algorithms but may be advantageous
especially in navigation tasks [133]. Parallels can be found to so-called potential respec-
tively vector field approaches in which goals and special objects or locations produce a
virtual vector field to influence the motion of a robot. Caution is adequate for vector field
approaches as well as for local reward methods because unintendedly introduced subgoals
may lead to suboptimal solutions and may prevent the agent from reaching the original
goal. It is also interesting future work to design a suitable local reward function for the
multi-player robot soccer model and to study the influence of such models on 2P-ZS-MGs.
Finally, a lot of work has been done for MDPs and in this area there are still open questions
but most of the comparable work for 2P-ZS-MGs has not yet been completed. From
a practical point of view, a great deal of function approximation methods succesfully
applied to MDPs should also be evaluated in 2P-ZS-MGs. It is the author’s hope that a
multilaterally developed public software package could originate as a byproduct of future
work. This should provide a full variety of MDP and 2P-ZS-MG standard example models,
include an essential selection of function approximation and data mining methods, and
(1)Reflexes could speed up the beginning phase of learning in which nothing happens until a random
walk detects some goal (or non-zero reward) and could help to avoid dangerous situations [138]. In robot
soccer there are no dangerous situations but a walk close to a cliff or controlling a nuclear reactor provides
a meaningful example.
(2)[69] shows that the transfer from one model to another (there: simulation to real robots) can work.
101
integrate symmetry reduction and hierarchical techniques in a modular way. By means of
such a software tool, transparency and comparability amongst the huge variety of proposed
learning methods and test models could be improved.
102
Appendix A
Basics of Group Homomorphisms
and Group Actions
In this appendix some standard definitions of group homomorphisms and group actions are
repeated. By Proposition A.4 and Proposition A.5 it is shown that equivalence relations
and group actions are equivalent concepts. The proof of Proposition A.5 is given for the
convenience of the reader. Some of the statements in this appendix also hold for infinite
groups but only the application to finite groups is intended.
Before continuing with group actions a short reminder of group homomorphisms shall be
given:
A.1 Definition (Group Homomorphism, Isomorphism [60])
Let (G1,)and (G2,)be two groups with group operation and , respectively. A map
h: (G1,)(G2,)(or short h:G1G2) is called a group homomorphism if
g1,eg1G1:h(g1eg1) = h(g1)h(eg1).(A.1)
A (group) homomorphism is called isomorphism if it is a bijection.
A.2 Definition (Group Action, Transformation Group [24, 60])
Let Gbe a group and let Xbe a set. A (left) group action Θis a map Θ : G×XX
with the properties:
g, h GxX: Θ(g, Θ(h, x)) = Θ(gh, x)(A.2)
and for the identity 1Gholds:
xX: Θ(1, x) = x . (A.3)
A simplified notation for group actions will be gh·x=ghx = Θ(g, Θ(h, x)) when there are
no misunderstandings possible. The composition of group elements is due to the standard
group operation. [60] states that for each gGthe map σg:XX, a 7→ σg(a) = g·ais
apermutation or transformation of X(i. e. bijection from Xto X) and that the map from
Gto SXdefined by g7→ σgis a homomorphism. This means that every element gG
acts on Xas a permutation in a manner that is consistent with the group operation of G.
To make this more precise define the kernel of a group action by K={gG: (x
X:g·x=x)} Gand the action to be faithful if K={e}. Then, the kernel Kis a
103
normal subgroup of Gand hence induces equivalence classes corresponding to elements of
the quotient group G/K.
Furthermore, not only each group action induces a homomorphism by g7→ σgbut also
reversely any homomorphism ϕ:GSXdefines a group action of Gon Xby g·x=
ϕ(g)(x)for all gGand xX. The kernel of this group action is the kernel of ϕand
the permutation representation g7→ σgequals ϕ. Thus, the following proposition holds:
A.3 Proposition (Characterising Group Actions [60])
For any group Gand any non-empty set Xthere exists a bijection between the actions of
Gon Xand the homomorphisms from Ginto SX.
The following result finally relates group actions to equivalence classes:
A.4 Proposition (Equivalence Relations Induced by Group Actions [60])
For any group Gacting on a non-empty set Xthe group action induces an equivalence
relation on Xby
x0[x]iff gG:x0=g·x . (A.4)
For each xXthe size of the equivalence class |[x]|=|G:Gx|which is the index of the
stabiliser Gx={gG:g·x=x}.
A shorter notation of the equivalence classes induced by a group action is [x] = G·x=
{g·x:gG}.G·xis also called the group orbit of xunder G.
Proposition A.4 shows that each group action induces an equivalence relation on X. More-
over, equivalence relations on Xalso induce group actions stated by Proposition A.5. If
equivalence classes of all group actions which induce the same equivalence classes on Xare
formed, then group actions and equivalence relations on Xare equivalent concepts.
A.5 Proposition (Group Actions Induced by Equivalence Relations [68])
Let Xbe a non-empty set with an equivalence relation, i. e. a partition of equivalence
classes P(X) = {[x] : xX}, and let Gbe the set of all bijective functions g:XX
that preserve the structure of the equivalence classes, i. e.
gGxX:g(x)[x].(A.5)
Then (G, )is a group with being the standard composition of functions and the group
action Θ : G×XX, Θ(g, x) = g(x)induces equivalence classes equal to that of P(X).
Proof: The proof of [68] is adapted to the notion of group actions. Gis a group because
it contains the identity, and the inverses and compositions of giGare also members of
Gbecause they all operate with respect to the partition P(X). Furthermore, Θis a group
action.
It remains to show: xX:G·x= [x]. The inclusion G·x[x]directly follows from
Equation A.5. To prove G·x[x], recall that xG·xand observe that for any y[x]
the swapping function gx,y :XX, defined by gx,y(x) = y,gx,y(y) = x, and otherwise
gx,y(z) = z, is an element of G. Thus, yG·xwhich completes the proof. 2
104
Appendix B
Bellman Equations and Iterative
Linear Solvers
In this appendix connections between standard iterative linear solvers [214] such as Jacobi
or Gauss-Seidel method are related to the linear versions of the Bellman equation which
emerges if the policies are fixed. In principle, this does not reduce the problem of deter-
mining the optimal value function but gives a possibility for an a posteriori analysis of
the iterative scheme. The reason is that after a non-linear Bellman update the optimal
policies are known and the non-linear update can be reformulated as a linear one with the
calculated optimal policies. The idea is the same as for MDPs [168].
Iterative Solution of Linear Equations. The problem of solving a linear equation of
the type
Ax =b, x, b Rn, A Rn,n (B.1)
can be solved iteratively which is especially useful for large systems of equations. The
standard idea is to rewrite the above equation to the fixed point formulation
x=C1bC1(AC)x(B.2)
for some invertible matrix CRn,n. The corresponding iterative method is simply
xm+1 =C1bC1(AC)xm(B.3)
with some initial x0and convergence of the method depends on the spectral properties
of the update matrix C1(AC)(spectral radius less than 1). For the classical Jacobi
method is C=Dwhere Ais written as a sum of its diagonal, lower triangle, and upper
triangle part:
A=D+L+U. (B.4)
The standard Gauss-Seidel method utilises C=D+L.
Bellman Equation as Linear Equation. As mentioned above the idea is the same as
for MDPs with the difference that in 2P-ZS-MGs the policies of both players have to be
fixed to make the Bellman equation linear. Equations 2.48 and 2.49 with the minimum
equivalently also taken over probability distributions become for fixed policies π1, π2which
fulfill the max-min property:
Vk+1(s) = X
a∈A(s)X
o∈O(s)
π1,s(a)·π2,s(o)·Qk(s, a, o)(B.5)
105
where
Qk(s, a, o) = R(s, a, o) + γX
s0∈S
T(s, a, o, s0)·Vk(s0).(B.6)
All in all, this is equivalent to performing a single iterative step of solving a system of
linear equations Ax =bfor each value iteration step. At iterate k+ 1 the variable reads
to xs=Vk+1(s)with the initial guess estimate being Vk, the right-hand side is defined by
the modified reward
bs=X
a∈A(s)X
o∈O(s)X
s0∈S
π1,s(a)·π2,s(o)·T(s, a, o, s0)·R(s, a, o, s0)(B.7)
for rewards possibly depending on the future states s0, and the matrix A=IγTs,s0
depends on the transition matrix
Ts,s0=X
a∈A(s)X
o∈O(s)
π1,s(a)·π2,s(o)·T(s, a, o, s0).(B.8)
With these ingredients the normal DP update is specified in terms of iterative solvers for
linear systems by C=Iand the Gauss-Seidel type update by C=I+L. Thus, the
methods are typically different from the versions for iterative linear solvers unless D=I,
i. e. that the transition matrix Ts,s0has zero probability transitions from each state to itself.
Finally, note again that the policies π1, π2are not known a priori but determined during
value iteration such that an a posteriori analysis is possible. However, the convergence
speed of DP methods is independent from the complexity of determining π1, π2but rather
dependent on the convergence properties of the matrix A. Furthermore, for every iteration
a different matrix and right hand side is to be calculated such that the convergence results
(contractivity of the update matrix) can differ from step to step.
106
Appendix C
The Software Package DRPOST
C.1 Introduction
The software package DRPOST (Discrete Robust Probabilistic Optimal Strategy Tool)
developed during the PhD thesis is also used for computing the results of Section 5.2. In
this appendix, the basic structure of the package is described. First, the files with ending
.m are Matlab files (Matlab 7.3.0.298 was used but the basic files should also work on
previous Matlab versions). When starting Matlab for the first time all subdirectories
can be added to the Matlab path while the first call of renew_model.m will clear archive
subdirectories and unused model directories from the path. The directory phd_scripts
contains executable scripts which generate material contained in this thesis. They can
be considered a good starting point to learn how the software package works and which
parameters are important.
For the sake of completeness, the other subdirectories are shortly described:
data_matlab contains Matlab save files (ending .mat) for the most important or lengthy
computations by scripts in the directory phd_scripts.
func_approx contains the structure for inserting arbitrary function approximation schemes.
The approximation procedures can also be externally provided e. g. by the Matlab neural
net toolbox.
func_approx_EXTERN is empty but the intended directory for external software packages
(not written by the author) which are specialised to function approximation.
models contain one subdirectory for each model. The name of the currently used model
can be specified by the Matlab struct field DP_RL_param.model_dir. This model will
be copied to model_current by the function renew_model.m and the Matlab path is
adapted such that only this model belongs to it.
phd_scripts contains all scripts for computations in this thesis as mentioned above.
pictures is used for storage of pictures and figures.
tools_divers is a collection of different functions not fitting to one of the other categories.
It contains a subdirectory policy with policy generating and modifying functions and a
subdirectory plot for general plot routines.
The use in Matlab is quite straightforward because every function and script is docu-
mented and a complete help on how to call this function is displayed in Matlab as usual
by help <functionname>. Furthermore, only the structs DP_RL_param,DP_RL_prev_step,
107
param_model,param_model_large,simulator_param, and strat_all are needed to keep
all information, and all these structs as well as value functions and strategies (function
approximation structs) have a subfield .info which gives all necessary information about
the structs.
C.2 Technical Aspects of Symmetry Reduction in 2P-ZS-MGs
Efficient Data Structures for State Spaces. In general, all available states s S have
to be made accessible by computer software. The most obvious and easiest way in terms
of programming effort and clarity is to store a list of all characteristics for every state, i. e.
for grid soccer all elements N2(na+no+1) which describe a discrete soccer state. However,
if the state space Sis very large as e. g. in a multi-player grid soccer with many robots it
may be important to represent the list of states in a compact way. In the following, hash
functions are not considered to be a sensible solution because they are not injective.
A very compact way for finite 2P-ZS-MGs is to assign a list of states (state-actions) which
is just a mapping ino :S N(ino :SAO N) preferably such that the smallest possible
integer numbers, i. e. all numbers from 1to |S| (1to |SAO|), are assigned. This number
representation introduces the cost of computing the function ino very often and, hence,
should be very simple. The reason for the immense use is that the assignment of value
functions and policies, the evaluation of the transition function which includes the deter-
mination of all possible following states, and the evaluation of the reward function need
the evaluation of ino for every state or state-action.(1) Sometimes, the inverse i1
no also
has to be computed e. g. for interpreting a value function or policy stored in the number
representation.
Efficient Data Structures for Multi-Player Grid Soccer. The structure of possible
states in robot soccer which is simply the product of discrete intervals(2) makes it possible
to find a reasonable number representation of Sby simple for-loops and multiplications
while the calculation of the inverse i1
no needs divisions with remainder. Based on the
enumeration of states, SAO can be enumerated by storing (a vector of) the number of
possible actions in every state.(3)
Now, if symmetries are introduced the aim is to store all symmetric states by only one
entry to reduce the amount of stored data. The main problem is that the function ino
typically becomes quite complicated and no simple computation comparable to the case
without symmetries is obvious. Furthermore, the standard way to store the mapping
directly in a lookup list is impractical by reason of size. Thus, a new way of storing value
functions is suggested by the author: employing sparse matrices. Sparse matrices are a
well-known structure that are typically utilised to store large matrices with many zero
elements. Sparsity is often defined by the fact that the number of non-zero elements grow
linearly with matrix size instead of quadratically which would be natural. Sparsity does
not matter to the present work, it should only be mentioned that in principle in a sparse
matrix only non-zero entries are stored together with their row and column number. To
(1)For a small model it could be more practical to store the transition function and reward function once
for the number representation and than work on that data. However, if the state space is large (as always
assumed) then the transition function will typically be much larger.
(2)Adiscrete interval is an intersection of a real interval Iwith N0.
(3)In grid soccer, actions for robots at the margin of the soccer field and the number of kicks can vary.
If the action space would not vary SAO could be handled in exactly the same way as S.
108
read a value the index lists are searched for matching entries and, if no match is found, a
zero is returned.
In the context of symmetries this means that the enumeration of the complete state(-action)
space can be initialised for storing a value function or a policy, then each state(-action)
is mapped to a unique representative of its equivalence class (see Section 3.2), and finally
the value function or policy only of the representatives are accessed. The main advantage
of this approach is twofold: Firstly, there is no need to compute ino directly, it is simply
stored by the matrix entry list. Secondly, by the size of the matrix (more columns or
more rows) a trade-off between access speed and used memory of the sparse matrix can be
decided. The reason is that the size of the matrix influences the length of vectors of the
row and column information which typically is stored as a vector of vectors. If the storage
is first row and then column and a matrix is already a row or a column vector then the
maximum number of pointers to columns has to be stored (storage bad, access fast) or only
one (storage good, access slow). The best compromise seems to be a quadratic matrix. A
generalisation to a multi-level tree structure with a variable number of leaves at each level
which could even make ino superfluous is not considered because for the multi-player grid
soccer example a sparse vector is already sufficiently effective.
109
Appendix D
Detailed Tables of Numerical Results
The numerical results in this section are obtained by means of the software package
DRPOST by scripts gathered in the subdirectory phd_scripts. The material is omit-
ted in the main part because not all results are spectacular. However, the author regards
it as his duty to provide this material and hopes that it may be helpful.
D.1 Initial Value Functions V0and Discount Factors γ
The following tables are related to Section 5.2.5(1) and give detailed information about all
combinations of initial value function V0, discount factor γ, Gauss-Seidel and symmetry
reduction types, and max-min, max, or fixed policy DP methods.
For reasons of comparability and clear view the tables are placed in combinations of three
tables per page with the following ordering principle: the following page contains three
tables of max-min DP methods, the subsequent page three tables of max DP methods and
the next page three tables of fixed strategy DP methods (policy evaluation), whereas on
each page the first table contains data about a non Gauss-Seidel method without symmetry
reduction, the second table about a Gauss-Seidel method without symmetry reduction, and
the third table about a Gauss-Seidel method with symmetry reduction. The fourth case
of a non Gauss-Seidel method with symmetry reduction would yield the same results as
the method on a non symmetry reduced model but simply updating each state in an
equivalence class separately.
(1)A more detailed description of the setting and the interpretation of the basic facts can be found there.
110
V03V01V00.1V00V0 0.1V0 1V0 3
γ= 0.10 4
0.00083
4
0.00043
4
0.00025
4
0.00023
4
0.00024
4
0.00043
4
0.00083
γ= 0.25 7
0.00053
7
0.00029
6
0.00070
6
0.00066
6
0.00070
7
0.00029
7
0.00053
γ= 0.50 13
0.00075
12
0.00050
11
0.00038
11
0.00028
11
0.00038
12
0.00050
13
0.00075
γ= 0.75 31
0.00087
27
0.00091
19
0.00091
13
0.00015
19
0.00090
27
0.00091
31
0.00086
γ= 0.90 88
0.00096
77
0.00099
56
0.00094
30
0.00038
55
0.00100
77
0.00099
88
0.00096
Table D.1: Comparison of different initial value functions V0and discount factors γfor a grid
soccer model without symmetry reduction and a max-min value iteration method with standard
updates (not Gauss-Seidel). In each cell of the table the number of needed iteration steps and the
yielded precision εwith a stopping criterion precision of ε= 1 ·103(εas in Corollary 4.3) and a
matrix game solution precision of 1·106.
V03V01V00.1V00V0 0.1V0 1V0 3
γ= 0.10 4
0.00022
4
0.00022
4
0.00022
4
0.00023
4
0.00025
4
0.00043
4
0.00085
γ= 0.25 6
0.00054
6
0.00062
6
0.00065
6
0.00066
6
0.00071
7
0.00025
7
0.00057
γ= 0.50 11
0.00050
10
0.00063
10
0.00051
10
0.00050
10
0.00039
12
0.00055
13
0.00049
γ= 0.75 24
0.00083
22
0.00089
20
0.00072
20
0.00057
21
0.00042
24
0.00090
27
0.00098
γ= 0.90 72
0.00095
66
0.00096
49
0.00095
43
0.00095
51
0.00098
70
0.00096
77
0.00099
Table D.2: Comparison of different initial value functions V0and discount factors γfor a grid
soccer model without symmetry reduction and a max-min value iteration method with Gauss-Seidel
updates (standard enumeration of states). Description of entries as in Table D.1.
V03V01V00.1V00V0 0.1V0 1V0 3
γ= 0.10 4
0.00011
4
0.00011
4
0.00020
4
0.00023
4
0.00021
4
0.00022
4
0.00067
γ= 0.25 5
0.00058
5
0.00034
5
0.00059
5
0.00066
5
0.00060
6
0.00025
6
0.00090
γ= 0.50 9
0.00055
9
0.00042
8
0.00027
8
0.00001
8
0.00016
9
0.00055
10
0.00058
γ= 0.75 14
0.00094
14
0.00089
13
0.00059
12
0.00006
15
0.00093
21
0.00093
24
0.00072
γ= 0.90 30
0.00087
29
0.00086
28
0.00087
26
0.00037
49
0.00100
66
0.00097
71
0.00099
Table D.3: Comparison of different initial value functions V0and discount factors γfor a grid
soccer model with symmetry reduction and a max-min value iteration method with Gauss-Seidel
updates (standard enumeration of states). Description of entries as in Table D.1.
111
V03V01V00.1V00V0 0.1V0 1V0 3
γ= 0.10 4
0.00062
4
0.00022
4
0.00020
4
0.00022
4
0.00024
4
0.00042
4
0.00082
γ= 0.25 7
0.00037
6
0.00052
6
0.00061
6
0.00066
6
0.00070
7
0.00029
7
0.00053
γ= 0.50 13
0.00074
11
0.00099
11
0.00100
12
0.00055
12
0.00060
13
0.00053
14
0.00051
γ= 0.75 31
0.00080
26
0.00099
26
0.00090
27
0.00077
27
0.00086
29
0.00083
32
0.00079
γ= 0.90 89
0.00096
60
0.00096
80
0.00099
81
0.00099
82
0.00099
88
0.00098
95
0.00097
Table D.4: Comparison of different initial value functions V0and discount factors γfor a grid
soccer model without symmetry reduction and a max value iteration method with standard updates
(not Gauss-Seidel). In each cell of the table the number of needed iteration steps and the yielded
precision εwith a stopping criterion precision of ε= 1 ·103(εas in Corollary 4.3) and a matrix
game solution precision of 1·106.
V03V01V00.1V00V0 0.1V0 1V0 3
γ= 0.10 4
0.00053
4
0.00018
4
0.00005
4
0.00006
4
0.00006
4
0.00013
4
0.00025
γ= 0.25 6
0.00088
6
0.00030
5
0.00021
5
0.00025
5
0.00029
5
0.00064
6
0.00027
γ= 0.50 11
0.00041
9
0.00100
7
0.00089
7
0.00079
8
0.00037
9
0.00044
10
0.00039
γ= 0.75 21
0.00066
18
0.00070
14
0.00078
15
0.00081
16
0.00069
18
0.00085
20
0.00078
γ= 0.90 53
0.00099
40
0.00099
47
0.00095
48
0.00093
48
0.00097
52
0.00093
55
0.00099
Table D.5: Comparison of different initial value functions V0and discount factors γfor a grid
soccer model without symmetry reduction and a max value iteration method with Gauss-Seidel
updates (standard enumeration of states). Description of entries as in Table D.4.
V03V01V00.1V00V0 0.1V0 1V0 3
γ= 0.10 4
0.00053
4
0.00018
4
0.00005
4
0.00005
4
0.00006
4
0.00012
4
0.00037
γ= 0.25 6
0.00086
6
0.00023
5
0.00020
5
0.00025
5
0.00029
5
0.00081
6
0.00032
γ= 0.50 10
0.00097
9
0.00079
7
0.00083
7
0.00059
7
0.00088
9
0.00039
10
0.00035
γ= 0.75 20
0.00090
18
0.00063
14
0.00060
15
0.00066
15
0.00092
18
0.00068
19
0.00098
γ= 0.90 52
0.00094
40
0.00094
45
0.00095
46
0.00093
46
0.00097
49
0.00099
53
0.00096
Table D.6: Comparison of different initial value functions V0and discount factors γfor a grid
soccer model with symmetry reduction and a max value iteration method with Gauss-Seidel updates
(standard enumeration of states). Description of entries as in Table D.4.
112
V03V01V00.1V00V0 0.1V0 1V0 3
γ= 0.10 4
0.00062
4
0.00022
3
0.00050
3
0.00030
3
0.00050
4
0.00022
4
0.00062
γ= 0.25 7
0.00038
6
0.00053
5
0.00038
4
0.00085
5
0.00038
6
0.00053
7
0.00038
γ= 0.50 13
0.00075
12
0.00052
9
0.00063
8
0.00055
9
0.00063
12
0.00052
13
0.00075
γ= 0.75 31
0.00087
27
0.00092
20
0.00090
17
0.00071
20
0.00090
27
0.00092
31
0.00087
γ= 0.90 88
0.00095
77
0.00099
57
0.00097
44
0.00100
57
0.00097
77
0.00099
88
0.00095
Table D.7: Comparison of different initial value functions V0and discount factors γfor a grid
soccer model without symmetry reduction and a fixed value iteration method with standard updates
(not Gauss-Seidel). In each cell of the table the number of needed iteration steps and the yielded
precision εwith a stopping criterion precision of ε= 1 ·103(εas in Corollary 4.3) and a matrix
game solution precision of 1·106.
V03V01V00.1V00V0 0.1V0 1V0 3
γ= 0.10 4
0.00033
4
0.00011
3
0.00041
3
0.00031
3
0.00032
4
0.00011
4
0.00033
γ= 0.25 6
0.00044
5
0.00085
5
0.00016
4
0.00089
4
0.00089
5
0.00092
6
0.00045
γ= 0.50 10
0.00090
9
0.00083
7
0.00097
7
0.00062
7
0.00091
9
0.00086
10
0.00091
γ= 0.75 21
0.00093
19
0.00085
15
0.00082
13
0.00090
14
0.00091
19
0.00080
21
0.00091
γ= 0.90 55
0.00093
49
0.00096
38
0.00096
33
0.00091
35
0.00094
49
0.00092
55
0.00092
Table D.8: Comparison of different initial value functions V0and discount factors γfor a grid
soccer model without symmetry reduction and a fixed value iteration method with Gauss-Seidel
updates (standard enumeration of states). Description of entries as in Table D.7.
V03V01V00.1V00V0 0.1V0 1V0 3
γ= 0.10 4
0.00038
4
0.00012
3
0.00041
3
0.00031
3
0.00032
4
0.00013
4
0.00039
γ= 0.25 6
0.00046
5
0.00095
5
0.00016
4
0.00089
4
0.00090
6
0.00016
6
0.00047
γ= 0.50 10
0.00079
9
0.00076
7
0.00090
7
0.00060
7
0.00089
9
0.00079
10
0.00079
γ= 0.75 21
0.00081
19
0.00075
15
0.00074
13
0.00083
14
0.00085
19
0.00071
21
0.00079
γ= 0.90 54
0.00099
49
0.00091
38
0.00093
33
0.00091
35
0.00092
48
0.00098
54
0.00098
Table D.9: Comparison of different initial value functions V0and discount factors γfor a grid
soccer model with symmetry reduction and a fixed value iteration method with Gauss-Seidel updates
(standard enumeration of states). Description of entries as in Table D.7.
113
D.2 Additional Figures and Tables for the Comparative Stud-
ies of DP and RL methods
Initial Value Functions V0and Discount Factors γ
Figure D.1 shows the omitted results for fixed strategy DP methods related to Figure 5.7.
The figure was omitted since it yields only expected results analogue to the max methods.
−3 −2 −1 0 1 2 3
0
20
40
60
80
100
120
V0
step number
no GS, no symm
GS, no symm
GS, symm
Figure D.1: Comparison of different Gauss-Seidel types with or without symmetry reduc-
tion: number of iteration steps over initial values of the initial value function V0for a
1v1 multi-player grid soccer model and a fixed strategy method without sorting strategy
(γ= 0.9, stopping criterion precision ε= 1 ·103(Corollary 4.3)).
Figure D.2 shows the omitted results for fixed strategy DP methods related to Figure 5.8.
The figure was omitted since it yields only expected results analogue to the max-min and
max methods.
114
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
20
40
60
80
100
120
gamma
step number
no GS, no symm
GS, no symm
GS, symm
Figure D.2: Comparison of different Gauss-Seidel types with or without symmetry reduc-
tion: number of iteration steps over the discount factor γfor a 1v1 multi-player grid soccer
model and a fixed strategy method without sorting strategy (V0= 0, stopping criterion
precision ε= 1 ·103(Corollary 4.3)).
Figure D.3 shows the omitted results for fixed strategy DP methods related to Figure 5.9.
The figure was omitted since it yields only expected results analogue to the max methods.
0 10 20 30 40 50 60 70
10−4
10−3
10−2
10−1
100
101
102
step number
maximal error
no sorting
Bellman
random
random fixed
Figure D.3: Fixed policy method, Gauss-Seidel type, with symmetry reduction: Maximal
DP error (logarithmic scale) over the number of iteration steps for a 1v1 multi-player grid
soccer model (V0= 0,γ= 0.9, stopping criterion precision ε= 1 ·103(Corollary 4.3))
by means of standard updates (no sorting), Bellman error estimation (Bellman), randomly
rearranged state order in each iteration (random), and keeping the randomly rearranged
state order of the first iteration (random fixed).
115
Strategy Evaluation Tables for 12 ×8Grid Soccer
Additional Tables of evaluating different soccer strategies according to that of Section 5.2
are presented for the 1v1 grid soccer on a 12 ×8field. These tables reveal only expected
results and are not intended like some other tables to stress differences of a coarse to
a medium discretisation of the soccer field. However, the time it would take to compute
these results again by the software package DRPOST can be saved by looking onto this
appendix.
π3= MQL(π1)π4= MVI(π1)π5= MQL(π2)π6= MVI(π2)
π1= MQL(MQL(R))
V: 0.225
gt: 0.021
g1: 0.991
V: 0.234
gt: 0.022
g1: 0.985
π2= MQL(MVI(R))
V: 0.233
gt: 0.022
g1: 0.980
V: 0.236
gt: 0.022
g1: 0.984
Table D.10: Robustness of max-policies against worst-case opponents (12 ×8soccer field)
with better initial training partners. The explanation of how to read the table is as in
Table 5.3.
π2= MVI(π1)π3= MMVI(R)
π1= MMVI(R)
V: 0.000
gt: 0.023
g1: 0.505
V: 0.000
gt: 0.023
g1: 0.503
Table D.11: Exploitability of optimal max-min opponents (12 ×8soccer field). The
explanation of how to read the table is as in Table 5.3.
116
π1= R equal
π1= R
V: 0.000
gt: 0.000
g1: 0.577
V: 0.000
gt: 0.000
g1: 0.516
π2= MQL(R)
V:0.252
gt: 0.023
g1: 0.001
V: 0.000
gt: 0.034
g1: 0.500
π3= MVI(R)
V:0.252
gt: 0.023
g1: 0.000
V: 0.000
gt: 0.021
g1: 0.501
π4= MMQL(R)
V:0.179
gt: 0.016
g1: 0.001
V:0.000
gt: 0.017
g1: 0.505
π5= MMVI(R)
V:0.173
gt: 0.016
g1: 0.001
V: 0.000
gt: 0.023
g1: 0.507
π6= MQL(π2)
V:0.002
gt: 0.000
g1: 0.110
V: 0.000
gt: 0.001
g1: 0.468
π7= MQL(π3)
V:0.002
gt: 0.000
g1: 0.132
V: 0.000
gt: 0.000
g1: 0.485
Table D.12: Analysis of offensiveness and defensiveness of different policies (12 ×8soccer
field). The explanation of how to read the table is as in Table 5.3.
117
List of Figures
1.1 Scheme of the main subfields of learning. . . . . . . . . . . . . . . . . . . . 3
2.1 Scheme of the main components of multi-agent systems. . . . . . . . . . . . 8
2.2 Scheme of a Markov decision process. . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Symmetries in robot soccer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Discretisation of a soccer field: grid soccer. . . . . . . . . . . . . . . . . . . . 68
5.2 Grid soccer: the parameter max_kick_distance. . . . . . . . . . . . . . . . 69
5.3 Kick-off states in grid soccer. . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Grid soccer: the parameters max_kick_distance and close_distance. . . . 72
5.5 Symmetries in grid soccer (discretisation of states in Figure 3.1). . . . . . . 73
5.6 Convergence speed of RL and DP techniques measured by value and security
evaluation: standard Q-learning versus a Gauss-Seidel DP method. . . . . . 83
5.7 Comparison of different Gauss-Seidel types with or without symmetry re-
duction: number of iteration steps over initial values of the initial value
function V0for a 1v1 multi-player grid soccer model and a (a) max-min
method, (b) max method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.8 Comparison of different Gauss-Seidel types with or without symmetry re-
duction: number of iteration steps over the discount factor γfor a 1v1
multi-player grid soccer model and a (a) max-min method, (b) max method
without sorting strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.9 Maximal DP error (logarithmic scale) over the number of iteration steps for
a Gauss-Seidel type update with symmetry reduction, a 1v1 multi-player
grid soccer model, and a (a) max-min method, (b) max method by means
of standard updates (no sorting), Bellman error estimation (Bellman), ran-
domly rearranged state order in each iteration (random), and keeping the
randomly rearranged state order of the first iteration. . . . . . . . . . . . . . 89
5.10 Maximal DP error (logarithmic scale) over the number of iteration steps,
all details are as in Figure 5.9, only the player exchanging symmetry is not
reduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.11 Convergence rate of a max-min DP method by maximal DP error (Bellman)
and by the true error over the number of iteration steps for a Gauss-Seidel
type update and a 1v1 multi-player grid soccer model. . . . . . . . . . . . . 92
5.12 Discretisation of a soccer field: grid soccer. . . . . . . . . . . . . . . . . . . . 95
D.1 Comparison of different Gauss-Seidel types with or without symmetry re-
duction: number of iteration steps over initial values of the initial value
function V0for a 1v1 multi-player grid soccer model and a fixed strategy
method without sorting strategy. . . . . . . . . . . . . . . . . . . . . . . . . 114
118
D.2 Comparison of different Gauss-Seidel types with or without symmetry re-
duction: number of iteration steps over initial values of the initial value
function V0for a 1v1 multi-player grid soccer model and a fixed strategy
method without sorting strategy. . . . . . . . . . . . . . . . . . . . . . . . . 115
D.3 Max method, Gauss-Seidel type, with symmetry reduction: Maximal DP
error over the number of iteration steps for a 1v1 multi-player grid soccer
model for different update orders. . . . . . . . . . . . . . . . . . . . . . . . . 115
119
List of Tables
5.1 Comparison of state space sizes for different types of multi-player grid soccer
with a (6 ×4) grid................................. 74
5.2 Different abbreviations for special policies. . . . . . . . . . . . . . . . . . . . 76
5.3 Robustness of max-policies against worst-case opponents (6×4soccer field). 78
5.4 Robustness of max-policies against worst-case opponents (12 ×8soccer field). 78
5.5 Robustness of max-min policies against worst-case opponents (6×4soccer
field). ....................................... 79
5.6 Robustness of max-min-policies against worst-case opponents (12 ×8soccer
field). ....................................... 79
5.7 Robustness of max-policies against worst-case opponents (6×4soccer field)
with better initial training partners. . . . . . . . . . . . . . . . . . . . . . . 80
5.8 Exploiting Non-Optimality of the Opponent (6×4soccer field). . . . . . . . 80
5.9 Analysis of offensiveness and defensiveness of different policies (6×4soccer
field). ....................................... 82
5.10 Evaluation of policies for a 1v1 grid soccer on a 6×4field which are computed
with a feature based value function. . . . . . . . . . . . . . . . . . . . . . . . 93
5.11 Comparison of symmetry reduced 1v1 grid soccer with γ= 0.9(different
soccer field sizes) by problem size and necessary DP iterations for achieving
a stopping criterion precision of ε= 1 ·103(Corollary 4.3). . . . . . . . . . 94
5.12 Analogon to Table 5.11 for 2v2 grid soccer for a 6×4soccer field. . . . . . 94
D.1 Comparison of different initial value functions V0and discount factors γ
for a grid soccer model without symmetry reduction and a max-min value
iteration method with standard updates (not Gauss-Seidel). . . . . . . . . . 111
D.2 Comparison of different initial value functions V0and discount factors γ
for a grid soccer model without symmetry reduction and a max-min value
iteration method with Gauss-Seidel updates (standard enumeration of states).111
D.3 Comparison of different initial value functions V0and discount factors γfor
a grid soccer model with symmetry reduction and a max-min value iteration
method with Gauss-Seidel updates (standard enumeration of states). . . . . 111
D.4 Comparison of different initial value functions V0and discount factors γfor
a grid soccer model without symmetry reduction and a max value iteration
method with standard updates (not Gauss-Seidel). . . . . . . . . . . . . . . . 112
D.5 Comparison of different initial value functions V0and discount factors γfor
a grid soccer model without symmetry reduction and a max value iteration
method with Gauss-Seidel updates (standard enumeration of states). . . . . 112
D.6 Comparison of different initial value functions V0and discount factors γ
for a grid soccer model with symmetry reduction and a max value iteration
method with Gauss-Seidel updates (standard enumeration of states). . . . . 112
120
D.7 Comparison of different initial value functions V0and discount factors γfor
a grid soccer model without symmetry reduction and a fixed value iteration
method with standard updates (not Gauss-Seidel). . . . . . . . . . . . . . . . 113
D.8 Comparison of different initial value functions V0and discount factors γfor
a grid soccer model without symmetry reduction and a fixed value iteration
method with Gauss-Seidel updates (standard enumeration of states). . . . . 113
D.9 Comparison of different initial value functions V0and discount factors γfor
a grid soccer model with symmetry reduction and a fixed value iteration
method with Gauss-Seidel updates (standard enumeration of states). . . . . 113
D.10 Robustness of max-policies against worst-case opponents (12×8soccer field)
with better initial training partners. . . . . . . . . . . . . . . . . . . . . . . 116
D.11 Exploitability of optimal max-min opponents (12 ×8soccer field). . . . . . 116
D.12 Analysis of offensiveness and defensiveness of different policies (12×8soccer
field). .......................................117
121
Glossary
empty set
Acomplement of a set A
ABintersection of two sets Aand B
ABunion of two sets Aand B
A\Bdifference of two sets: set Aminus set B
A×Bproduct of two sets Aand B
ATtranspose of a matrix or vector A
2P-ZS-MG Two Player Zero Sum Markov Game M
A(s)set of actions in a state sfor a Markov decision process or for the
first agent P1of a two-player zero-sum Markov game
AI Artificial Intelligence
αpossibly time- and state-dependent learning rate of an reinforcement
learning algorithm
Aut(M)group of automorphisms (also: symmetry group) of a set or especially
of a Markov decision process or a two-player zero-sum Markov game
Bbox (generalised rectangle) used for a special class of set oriented
numerical methods
BMDP Bellman operator for a Markov decision process
BMG Bellman operator for a two-player zero-sum Markov game
e
BMG numerical approximation of a Bellman operator for a two-player zero-
sum Markov game
BRi(πi)best response or best reply policy of agent Pito the joint policy πi
of all other agents
Cext external costs of a subset of vertices of a graph
χAcharacteristic function on a set Awith χA(x)being 1if xAand 0
else
Cint internal costs of a subset of vertices of a graph
Cnn-times continuously differentiable functions
Ddecision epoch for a Markov decision process or a two-player zero-sum
Markov game
ij Kronecker symbol (1if iequals jelse 0)
S boundary of a set S
dΓgame matrix distance
DP Dynamic Programming
122
dx
dttotal derivative of xwith respect to t
Eedge set of a graph G
eEuler’s number G
E{X}expectation of random variable X
Eπ{X}expectation of a random variable Xin a Markov decision process or a
two-player zero-sum Markov game if the (joint) policy πis executed
eii-th unit vector of Rn
EV(i+ 1) error bound for numerical approximation of value iteration in 2P-ZS-
MGs
e
fapproximation to a function f
Ga graph with vertex set Vand edge set E
Γa game
γdiscount factor for long-term rewards in a Markov decision process
or a two-player zero-sum Markov game
idXidentity map on a set X
ino mapping of state or state-action space to a number representation
LP Linear Programming
LSPI Least Squares Policy Iteration
[M]matrix game with game matrix M
Mspace of (signed) measures
MAS Multi Agent System
MBspace of (signed) measures discretised by a collection of boxes B
MDP Markov Decision Process Mor sometimes also Markov Decision
Problem
MGR Matrix Game Reduction
µa measure (sometimes especially the Lebesgue one)
µLeb Lebesgue measure (sometimes abbreviated by µ)
Nset of natural numbers
N0set of natural numbers including zero
nanumber of robots in the so-called first soccer team
nonumber of robots in the so-called second or opponent soccer team
NP non-deterministic polynomial: measure of hardness of an algorithmic
problem are e. g. NP-hard or NP-complete
O(s)set of actions in a state sfor the second agent P2of a two-player
zero-sum Markov game
Oϕ(x)orbit or trajectory of a dynamical system with respect to a flow ϕ
Ptransfer operator of a dynamical system
PBtransfer operator of a dynamical system discretised with respect to
a partition (of boxes) B
123
P(A)partition of a set Ainto disjoint subsets (measure zero of pairwise
intersections)
PD(B)set of all probability distributions on a (Borel) set B
ϕflow of a dynamical system
Piplayer of a matrix game (also called agent)
Πxprojection of a tuple onto the xcomponent
Π(V) optimal policy of a Markov decision process or a two-player zero-sum
Markov game with respect to state value function V
πpolicy of a Markov decision process or a two-player zero-sum Markov
game (a subindex iindicates the agent Pi)
πspolicy of a Markov decision process or a two-player zero-sum Markov
game restricted to a state s
πoptimal policy of a Markov decision process or a two-player zero-sum
Markov game (a subindex iindicates the agent Pi)
πijoint policy of all agents with exception of agent Pi
PO-MDP Partially Observable Markov Decision Process (see also MDP)
Prob {X}probability that event Xhappens
Prob {X|Y}conditional probability that event Xhappens under the condition Y
Qset of rational numbers
Qπstate action (or Q-) value function of policy πfor a Markov decision
process or a two-player zero-sum Markov game
Q(=Qπ) optimal value function for a Markov decision process or a two-player
zero-sum Markov game
Rset of real numbers
Rreward function for a Markov decision process or a two-player zero-
sum Markov game
Rreward or payoff matrix for a matrix game (two matrices R1and R2
in a bimatrix game)
Rreturn of a Markov decision process or a two-player zero-sum Markov
game (a subindex iindicates the agent Pi)
Raver return model of average reward for a Markov decision process or a
two-player zero-sum Markov game
Rdisc return model of discounted reward for a Markov decision process or
a two-player zero-sum Markov game
RL Reinforcement Learning
Sset of states for a Markov decision process or a two-player zero-sum
Markov game
SA set of state action pairs for a Markov decision process
SAO set of state action pairs for a two-player zero-sum Markov game
Sdfeasible set of a dual linear program
SL Supervised Learning
S-MDP Semi Markov Decision Process (see also MDP)
Spfeasible set of a primal linear program
SXpermutation or transformation group of a set Xcontaining all auto-
morphisms of X
124
Ttransition probability between two sets of a dynamical system
Tinv(S)invariance ratio of a set Sof a dynamical system being the transition
probability of a set into itself
Tinv(P(S)) average invariance ratio of a partition of a set S
Ttransition function for a Markov decision process or a two-player
zero-sum Markov game
e
Talternative form of a transition function for a deterministic Markov
decision process or a two-player zero-sum Markov game
U(x)-neighbourhood of xX
Vvertex set of a graph G
Vπstate value function of policy πfor a Markov decision process or a
two-player zero-sum Markov game
V(=Vπ) optimal value function for a Markov decision process or a two-player
zero-sum Markov game (especially value of a matrix game)
e
Vknumerical approximation (e. g. by SL techniques) of the k-th iterate
of value iteration
Zset of integer numbers
125
Bibliography
[1] Adrian K. Agogino and Kagan Tumer. Unifying temporal and structural credit
assignment problems. In AAMAS, pages 980–987. IEEE Computer Society, 2004.
[2] James S. Albus. A new approach to manipulator control: The cerebellar model
articulation controller (CMAC). ASM Journal of Dynamic Systems, Measurement,
and Control, 97:220–227, 1975.
[3] James S. Albus. Brains, Behavior, and Robotics. Byte Books, Peterborough, New
Hampshire, 1981.
[4] Michael A. Arbib, editor. The Handbook of Brain Theory and Neural Networks. MIT
Press, Cambridge, MA, 1995.
[5] Kenneth J. Arrow, David Blackwell, and M. A. Girshick. Bayes and minimax solu-
tions of sequential decision problems. Econometrica, 17:213–244, 1949.
[6] Christopher G. Atkeson and Stefan Schaal. Memory-based neural networks for robot
learning. Neurocomputing, 9(3):243–269, 1995.
[7] Leemon C. Baird. Residual algorithms: Reinforcement learning with function approx-
imation. In Proceedings of the 12th International Conference on Machine Learning
(ICML), pages 30–37, San Francisco, CA, 1995. Morgan Kaufmann.
[8] Leemon C. Baird and A. Harry Klopf. Reinforcement learning with high-dimensional
continuous actions. Technical Report WL-TR-93-1147, Wright Laboratory, Wright-
Patterson Air Force Base, OH 45433-7301, 1993.
[9] Martino Bardi, Maurizio Falcone, and Pierpaolo Soravia. Fully discrete schemes for
the value function of pursuit-evasion games. In Advances in dynamic games and
applications, volume 1 of Annals of the International Society of Dynamic Games,
pages 89–105. Birkhäuser, Boston, MA, 1994.
[10] Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using
real-time dynamic programming. Artificial Intelligence, 72(1):81–138, 1995.
[11] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforce-
ment learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.
[12] Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuron-like adaptive
elements that can solve difficult learning control problems. IEEE Transactions on
Systems, Man, and Cybernetics, 13(5):834–846, 1983.
[13] Tamer Başar and Geert J. Olsder. Dynamic Noncooperative Game Theory. Academic
Press Ltd., London, 2nd edition, 1995.
[14] Richard E. Bellman. Dynamic Programming. Princeton University Press, Princeton,
NJ, 1957.
126
[15] Richard E. Bellman and Joseph P. LaSalle. On non-zero sum games and stochastic
processes. RM-212, Rand Corp., Santa Monica, 1949.
[16] Hamid R. Berenji. Artificial neural networks and approximate reasoning for intelli-
gent control in space. In American Control Conference, pages 1075–1080, 1991.
[17] Donald A. Berry and Bert Fristedt. Bandit Problems: Sequential Allocation of Ex-
periments. Chapman and Hall, London, UK, 1985.
[18] Dimitri P. Bertsekas and David A. Castañon. Adaptive aggregation methods for
infinite horizon dynamic programming. IEEE Transactions on Automatic Control,
34(6):589–598, 1989.
[19] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena
Scientific, Belmont, MA, 1996.
[20] Emile Borel. The theory of play and integral equations with skew symmetric kernels.
Econometrica. Journal of the Econometric Society, 21:97–100, 1953.
[21] Michael Bowling. Multiagent learning in the presence of agents with limitations.
PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh,
PA 15213, 2003. Also published as Technical Report CMU-CS-03-118.
[22] Michael Bowling and Manuela M. Veloso. Existence of multiagent equilibria with
limited agents. Technical Report CMU-CS-02-104, Carnegie Mellon University, Pitts-
burgh, PA, 2002.
[23] Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning:
Safely approximating the value function. In G. Tesauro, D. S. Touretzky, and T. K.
Leen, editors, Advances in Neural Information Processing Systems (NIPS) 7, pages
369–376, Cambridge, MA, 1995. The MIT Press.
[24] Glen E. Bredon. Introduction to Compact Transformation Groups. Academic Press,
New York and London, 1972.
[25] Michael Brin and Garrett Stuck. Introduction to Dynamical Systems. Cambridge
University Press, 2002.
[26] David N. Burghes and Alexander Graham. Introduction to Control Theory, Including
Optimal Control. Ellis Horwood Ltd., Chichester, 1980. Ellis Horwood Series in
Mathematics and its Applications.
[27] A. Martin V. Butz, David E. Goldberg, and C. Wolfgang Stolzmann. The anticipa-
tory classifier system and genetic generalization. Natural Computing, 1(4):427–467,
2002.
[28] Mary L. Cartwright and John E. Littlewood. On non-linear differential equations
of the second order. Journal of the London Mathematical Society. Second Series,
20:180–189, 1945.
[29] Anthony R. Cassandra, Leslie P. Kaelbling, and Michael L. Littman. Acting opti-
mally in partially observable stochastic domains. In Proceedings of the 12th National
Conference on Artificial Intelligence, volume 2, pages 1023–1028, Seattle, WA, 1994.
AAAI Press.
[30] Arthur Cayley. Adding temporary memory to ZCS. The Educational Times, 23(18),
1875.
[31] David Chapman and Leslie P. Kaelbling. Input generalization in delayed reinforce-
ment learning: An algorithm and performance comparisons. In J. Myopoulos and
127
R. Reiter, editors, Proceedings of the 12th International Joint Conference on Arti-
ficial Intelligence (IJCAI), Sydney, Australia, pages 726–731, San Francisco, CA,
1991. Morgan Kaufmann.
[32] Bruno Codenotti and Daniel Stefankovic. On the computational complexity of Nash
equilibria for (0,1) bimatrix games. Information Processing Letters (IPL), 94(3):145–
150, 2005.
[33] Anne Condon. The complexity of stochastic games. Information and Computation,
96(2):203–224, 1992.
[34] Vincent Conitzer and Tuomas Sandholm. Complexity results about Nash equilibria.
In Georg Gottlob and Toby Walsh, editors, Proceedings of the 18th International
Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, pages 765–
771. Morgan Kaufmann, 2003.
[35] Rémi Coulom. Reinforcement Learning Using Neural Networks, with Applications to
Motor Control. PhD thesis, Institut National Polytechnique de Grenoble, 2002.
[36] Michael G. Crandall. Viscosity solutions: A primer. In Viscosity solutions and
applications, volume 1660 of Lecture Notes in Mathematics, pages 1–43. Springer,
Berlin, 1997.
[37] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bul-
letin of the American Mathematical Society (BAMS), 39, 2002.
[38] Shu-Lin Cui, Ji-Gui Sun, Ming-Hao Yin, and Shuai Lu. Solving uncertain Markov
decision problems: An interval-based method. In L. Jiao, L. Wang, X. Gao, J. Liu,
and F. Wu, editors, Proceedings of the 2nd International Conference on Advances in
Natural Computation (ICNC), Xi’an, China (Part II), volume 4222 of Lecture Notes
in Computer Science, pages 948–957. Springer, 2006.
[39] George B. Dantzig and Mukund N. Thapa. Linear Programming 2. Springer Series
in Operations Research. Springer-Verlag, New York, 2003.
[40] Konstantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The
complexity of computing a Nash equilibrium. Electronic Colloquium on Computa-
tional Complexity (ECCC), (115), 2005.
[41] Ruchira S. Datta. Universality of Nash equilibria. Mathematics of Operations Re-
search (MOR), 28(3):424–432, 2003.
[42] Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In S. J. Hanson,
J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing
Systems (NIPS) 5, pages 271–278, San Mateo, CA, 1993. Morgan Kaufmann.
[43] Thomas Dean, Leslie P. Kaelbling, Jak Kirman, and Ann Nicholson. Planning with
deadlines in stochastic domains. In Proceedings of the 11th National Conference on
Artificial Intelligence (AAAI), pages 574–579, Washington, DC, 1993. AAAI Press.
[44] Richard Dearden. Structured prioritized sweeping. In Proceedings of the 18th Inter-
national Conference on Machine Learning (ICML), pages 82–89, San Francisco, CA,
2001. Morgan Kaufmann.
[45] Michael Dellnitz, Gary Froyland, and Oliver Junge. The algorithms behind Gaio
set oriented numerical methods for dynamical systems. In Ergodic Theory, Analysis,
and Efficient Simulation of Dynamical Systems, pages 145–174. Springer, 2001.
[46] Michael Dellnitz, Mirko Hessel-von Molo, Philipp Metzner, Robert Preis, and
128
Christof Schütte. Graph algorithms for dynamical systems. In A. Mielke, editor,
Analysis, modeling and simulation of multiscale problems, pages 619–645. Springer,
Berlin, 2006.
[47] Michael Dellnitz and Andreas Hohmann. A subdivision algorithm for the computa-
tion of unstable manifolds and global attractors. Numerische Mathematik, 75(3):293–
317, 1997.
[48] Michael Dellnitz, Andreas Hohmann, Oliver Junge, and Martin Rumpf. Exploring
invariant sets and invariant measures. Chaos. An Interdisciplinary Journal of Non-
linear Science, 7(2):221–228, 1997.
[49] Michael Dellnitz and Oliver Junge. On the approximation of complicated dynamical
behavior. SIAM Journal on Numerical Analysis, 36(2):491–515 (electronic), 1999.
[50] Michael Dellnitz and Oliver Junge. Set oriented numerical methods for dynamical
systems. In Handbook of dynamical systems, Vol. 2, pages 221–264. North-Holland,
Amsterdam, 2002.
[51] Kan Deng and Andrew W. Moore. Multiresolution instance-based learning. In
Chris S. Mellish, editor, Proceedings of the 14th International Joint Conference on
Artificial Intelligence (IJCAI), pages 1233–1242, San Mateo, 1995. Morgan Kauf-
mann.
[52] Ralf Diekmann, Burkhard Monien, and Robert Preis. Using helpful sets to improve
graph bisections. In D. Hsu, A. Rosenberg, and D. Sotteau, editors, Interconnec-
tion Networks and Mapping and Scheduling Parallel Computations, volume 21 of
DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages
57–73. American Mathematical Society, 1995.
[53] Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value
function decomposition. Journal of Artificial Intelligence Research (JAIR), 13:227–
303, 2000.
[54] Marco Dorigo and Hugues Bersini. A comparison of Q-learning and classifier systems.
In D. Cliff, P. Husbands, J.-A. Meyer, and S. W. Wilson, editors, From Animals to
Animats 3. Proceedings of the 3rd International Conference on Simulation of Adaptive
Behavior (SAB), pages 248–255, Cambridge, MA, 1994. MIT Press.
[55] Marco Dorigo and Marco Colombetti. Robot shaping: Developing autonomous agents
through learning. Artificial Intelligence, 71(2):321–370, 1994.
[56] Kenji Doya. Reinforcement learning in continuous time and space. Neural Compu-
tation, 12(1):219–245, 2000.
[57] Jan Drugowitsch and Alwyn M. Barry. A formal framework and extensions for
function approximation in learning classifier systems. Technical Report CSBU-2006-
02, University of Bath, 2006.
[58] Chris Drummond. Accelerating reinforcement learning by composing solutions of
automatically identified subtasks. Journal of Artificial Intelligence Research (JAIR),
16:59–104, 2002.
[59] Artur Dubrawski and Jeff G. Schneider. Memory based stochastic optimization for
validation and tuning of function approximators. In Proceedings of the 6th Interna-
tional Workshop on AI and Statistics, Florida, USA, 1997.
[60] David S. Dummit and Richard M. Foote. Abstract algebra. Prentice Hall Inc., En-
129
glewood Cliffs, NJ, 1991.
[61] Aryeh Dvoretzky, J. Kiefer, and Jacob Wolfowitz. The inventory problem. I. Case of
known distributions of demand. Econometrica, 20:187–222, 1952.
[62] Scott E. Fahlman. An empirical study of learning speed in back-propagation net-
works. Technical Report CMU-CS-88-162, Carnegie-Mellon University, Pittsburgh,
PA, 1988.
[63] Maurizio Falcone. Numerical methods for differential games based on partial dif-
ferential equations. Unpublished. Based on lectures given at the summer school on
Differential Games and Applications, 2005.
[64] Jacques Ferber. Multi-Agent Systems An Introduction to Distributed Artifical In-
telligence. Addison Wesley, 1999.
[65] Jerzy A. Filar, T. A. Schultz, Frank Thuijsman, and Koos (O. J.) Vrieze. Nonlinear
programming and stationary equilibria in stochastic games. Mathematical Program-
ming, 50(2):227–237, 1991.
[66] Jerzy A. Filar and Koos (O. J.) Vrieze. Competitive Markov Decision Processes.
Springer-Verlag, New York, 1997.
[67] David Foster and Peter Dayan. Structure in the space of value functions. Machine
Learning, 49(2–3):325–346, 2002.
[68] Bas van Fraassen. Laws and Symmetry. Oxford University Press, Oxford, 1989.
[69] Thomas Gabel, Roland Hafner, Sascha Lange, Martin Lauer, and Martin Riedmiller.
Bridging the gap: Learning in the robocup simulation and midsize league. In Pro-
ceedings of the 7th Portuguese Conference on Automatic Control (Controlo), Lisbon,
Portugal, 2006.
[70] Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model
minimization in Markov decision processes. Artificial Intelligence, 147(1–2):163–223,
2003.
[71] David E. Goldberg. Genetic Algorithms in Search, Optimization &Machine Learn-
ing. Addison-Wesley, Reading, MA, 1989.
[72] Geoffrey J. Gordon. Stable function approximation in dynamic programming. In
A. Prieditis and S. Russell, editors, Proceedings of the 12th International Conference
on Machine Learning (ICML), pages 261–268, San Francisco, CA, 1995. Morgan
Kaufmann.
[73] John Guckenheimer and Philip Holmes. Nonlinear Oscillations, Dynamical Systems,
and Bifurcations of Vector Fields. Springer-Verlag, New York, 1990.
[74] Carlos Guestrin, Milos Hauskrecht, and Branislav Kveton. Solving factored MDPs
with continuous and discrete variables. In D. M. Chickering and J. Y. Halpern,
editors, Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence
(UAI), Banff, Canada, pages 235–242, 2004.
[75] Carlos Guestrin, Daphne Koller, and Ronald Parr. Max-norm projections for factored
MDPs. In B. Nebel, editor, Proceedings of the 17th International Joint Conference on
Artificial Intelligence (IJCAI-01), Seattle, pages 673–682, San Francisco, CA, 2001.
Morgan Kaufmann.
[76] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored
MDPs. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in
130
Neural Information Processing Systems (NIPS) 14, Vancouver, Canada, pages 1523–
1530, Cambridge, MA, 2001. MIT Press.
[77] Carlos Guestrin, Daphne Koller, and Ronald Parr. Solving factored POMDPs with
linear value functions. In Workshop on Planning under Uncertainty and Incomplete
Information (IJCAI), Seattle, Washington, pages 67–75, 2001.
[78] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient
solution algorithms for factored MDPs. Journal of Artificial Intelligence Research
(JAIR), 19:399–468, 2003.
[79] Carlos Guestrin, Michail G. Lagoudakis, and Ronald Parr. Coordinated reinforce-
ment learning. In C. Sammut and A. G. Hoffmann, editors, Proceedings of the 19th
International Conference on Machine Learning (ICML), Sydney, Australia, pages
227–234, San Francisco, CA, 2002. Morgan Kaufmann.
[80] Vijaykumar Gullapalli. Reinforcement Learning and its Application to Control. PhD
thesis, University of Massachusetts at Amherst, 1992.
[81] Milos Hauskrecht, Nicolas Meuleau, Leslie P. Kaelbling, Thomas Dean, and Craig
Boutilier. Hierarchical solution of Markov decision processes using macro-actions. In
G. F. Cooper and S. Moral, editors, Proceedings of the 14th Conference on Uncer-
tainty in Artificial Intelligence (UAI), pages 220–229, San Francisco, 1998. Morgan
Kaufmann.
[82] Simon S. Haykin. Neural Networks: A Comprehensive Introduction. Prentice Hall,
New Jersey, USA, 1999.
[83] Chao He, Li-Xin Xu, and Yu-He Zhang. Learning convergence of CMAC algorithm.
Neural Processing Letters, 14(1):61–74, 2001.
[84] Ronald A. Howard. Dynamic Programming and Markov Processes. MIT Press,
Cambridge, MA, 1960.
[85] Fern Y. Hunt. A Monte Carlo approach to the approximation of invariant measures.
Random &Computational Dynamics, 2(1):111–133, 1994.
[86] Rufus P. Isaacs. Differential Games: A Mathematical Theory with Applications to
Warfare and Pursuit, Control and Optimization. John Wiley, Toronto, 1965.
[87] Tommi Jaakkola, Satinder P. Singh, and Michael I. Jordan. Reinforcement learn-
ing algorithm for partially observable Markov decision problems. In G. Tesauro,
D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Sys-
tems (NIPS 7), pages 345–352, Cambridge, MA, 1995. The MIT Press.
[88] Leslie P. Kaelbling, Michael L. Littman, and Andrew P. Moore. Reinforcement
learning: A survey. Journal of Artificial Intelligence Research (JAIR), 4:237–285,
1996.
[89] Spiros Kapetanakis and Daniel Kudenko. Reinforcement learning of coordination in
cooperative multi-agent systems. In AAAI/IAAI 2002, pages 326–331, 2002.
[90] Spiros Kapetanakis, Daniel Kudenko, and Malcolm J. A. Strens. Learning of coordi-
nation in cooperative multi-agent systems using commitment sequences. In Artificial
Intelligence and the Simulation of Behavior 1(5), 2004.
[91] George Karypis and Vipin Kumar. METIS Manual, Version 4.0. University of
Minnesota, 1998.
[92] Brian W. Kernighan and Shen Lin. An efficient heuristic procedure for partitioning
131
graphs. Bell System Technical Journal, 49(2):291–307, 1970.
[93] Teuvo Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 1995.
[94] Daphne Koller and Ronald Parr. Computing factored value functions for policies in
structured MDPs. In T. Dean, editor, Proceedings of the 16th International Joint
Conference on Artificial Intelligence (IJCAI), pages 1332–1339, San Francisco, CA,
1999. Morgan Kaufmann.
[95] Panganamala R. Kumar and Pravin P. Varaiya. Stochastic Systems: Estimation,
Identification, and Adaptive Control. Prentice Hall, Englewood Cliffs, NJ, 1986.
[96] Rainer Lachner, Michael H. Breitner, and Hans J. Pesch. Real-time collision avoid-
ance against wrong drivers: Differential game approach, numerical solution and syn-
thesis of strategies with neural networks. In Proceedings of the 7th International
Symposium on Dynamic Games and Applications, Kanagawa, Japan, 1996.
[97] Michail G. Lagoudakis and Ronald Parr. Value function approximation in zero-sum
Markov games. In A. Darwiche and N. Friedman, editors, Proceedings of the 18th
Conference in Uncertainty in Artificial Intelligence (UAI), University of Alberta, Ed-
monton, Alberta, Canada, pages 283–292, San Francisco, CA, 2002. Morgan Kauf-
mann.
[98] Michail G. Lagoudakis and Ronald Parr. Learning in zero-sum team Markov games
using factored value functions. In S. Becker, S. B. Thrun, and K. Obermayer, editors,
Advances in Neural Information Processing Systems (NIPS) 15, pages 1627–1634.
MIT Press, Cambridge, MA, 2003.
[99] Michail G. Lagoudakis, Ronald Parr, and Michael L. Littman. Least-squares methods
in reinforcement learning for control. In I. P. Vlahavas and C. D. Spyropoulos,
editors, Proceedings of the 2nd Hellenic Conference on AI (SETN). Thessaloniki,
Greece, volume 2308 of Lecture Notes in Computer Science, pages 249–260. Springer,
2002.
[100] Pier L. Lanzi. Learning classifier systems from a reinforcement learning perspective.
Soft Computing, 6(3–4):162–170, 2002.
[101] Tim Laue and Thomas Röfer. Integrating simple unreliable perceptions for accurate
robot modeling in the four-legged league. In G. Lakemeyer, E. Sklar, D. G. Sorrenti,
and T. Takahashi, editors, RoboCup, volume 4434 of Lecture Notes in Computer
Science, pages 474–482. Springer, 2006.
[102] Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement
learning in cooperative multi-agent systems. In Proc. of the 17th International Con-
ference on Machine Learning (ICML), pages 535–542, San Francisco, CA, 2000. Mor-
gan Kaufmann.
[103] Martin Lauer and Martin Riedmiller. Reinforcement learning for stochastic cooper-
ative multi-agent systems. In AAMAS, pages 1516–1517. IEEE Computer Society,
2004.
[104] Steven M. LaValle. Robot motion planning: A game-theoretic foundation. Algorith-
mica, 26(3-4):430–465, 2000.
[105] Chuen-Chien Lee. A self learning rule-based controller employing approximate rea-
soning and neural net concepts. International Journal of Intelligent Systems, 6(1):71–
93, 1991.
132
[106] Carlton E. Lemke and Joseph T. Howson, Jr. Equilibrium points of bimatrix games.
Journal of the Society for Industrial and Applied Mathematics (SIAM), 12(2):413–
423, 1964.
[107] Joseph Lewin. Differential Games. Springer-Verlag London Ltd., London, 1994.
[108] Long-Ji Lin. Programming robots using reinforcement learning and teaching. In
T. L. Dean and K. McKeown, editors, Proceedings of the 9th National Conference on
Artificial Intelligence, pages 781–786. MIT Press, 1991.
[109] Long-Ji Lin. Hierarchical learning of robot skills by reinforcement. In Proceedings of
the International Conference on Neural Networks (ICNN), volume 1, pages 181–186,
San Francisco, CA, 1993. IEEE/INNS.
[110] Long-Ji Lin and Tom M. Mitchell. Memory approaches to reinforcement learning
in non-Markovian domains. Technical Report CMU-CS-92-138, Carnegie Mellon
University, Pittsburgh, PA, 1992.
[111] Ya-Ping Lin and Xue-Yong Li. Reinforcement learning based on local state feature
learning and policy adjustment. Information Sciences ISCI, 154(1–2):59–70, 2003.
[112] Michael L. Littman. Markov games as a framework for multi-agent reinforcement
learning. In Proceedings of the 11th International Conference on Machine Learning
(ICML), pages 157–163, San Francisco, CA, 1994. Morgan Kaufmann.
[113] Michael L. Littman. Memoryless policies: Theoretical limitations and practical re-
sults. In D. Cliff, P. Husbands, J.-A. Meyer, and S. W. Wilson, editors, From An-
imals to Animats 3: Proceedings of the 3rd International Conference on Simulation
of Adaptive Behavior (SAB), Cambridge, MA, 1994. MIT Press.
[114] Michael L. Littman, Anthony R. Cassandra, and Leslie P. Kaelbling. Learning poli-
cies for partially observable environments: Scaling up. In A. Prieditis and S. Rus-
sell, editors, Proceedings of the 12th International Conference on Machine Learning
(ICML), pages 362–370, San Francisco, CA, 1995. Morgan Kaufmann Publishers.
[115] Michael L. Littman, Thomas L. Dean, and Leslie P. Kaelbling. On the complexity of
solving Markov decision problems. In P. Besnard and S. Hanks, editors, Proceedings
of the 11th Conference on Uncertainty in Artificial Intelligence (UAI), pages 394–402,
San Francisco, CA, 1995. Morgan Kaufmann.
[116] Michael L. Littman and Csaba Szepesvári. A generalized reinforcement-learning
model: Convergence and applications. In L. Saitta, editor, Proceedings of the 13th
International Conference on Machine Learning (ICML), Bari, Italy, pages 310–318.
Morgan Kaufmann, 1996.
[117] Edward N. Lorenz. Deterministic nonperiodic flow. Journal of Atmospheric Science,
20:130–141, 1963.
[118] Ulf Lorenz and Burkhard Monien. Error analysis in minimax trees. Theoretical
Computer Science (TCS), 313(3):485–498, 2004. Algorithmic combinatorial game
theory.
[119] William S. Lovejoy. A survey of algorithmic methods for partially observable Markov
decision processes. Annals of Operations Research, 28(1):47–65, 1991.
[120] R. Duncan Luce and Howard Raiffa. Games and Decisions: Introduction and Critical
Survey. John Wiley, New York, 1957. A study of the Behavioral Models Project,
Bureau of Applied Social Research, Columbia University;.
133
[121] David J. C. MacKay. Bayesian model comparison and backprop nets. In J. E. Moody,
S. J. Hanson, and R. Lippmann, editors, Advances in Neural Information Processing
Systems (NIPS) 4, pages 839–846. Morgan Kaufmann, 1992.
[122] David J. C. MacKay. Bayesian non-linear modelling for the prediction competition.
In ASHRAE Transactions, volume 100, pages 1053–1062, Atlanta, Georgia, 1994.
ASHRAE.
[123] Omid Madani. On policy iteration as a Newton’s method and polynomial policy
iteration algorithms. In Proceedings of the 18th National Conference on Artificial
Intelligence and Fourteenth Conference on Innovative Applications of Artificial In-
telligence (AAAI/IAAI), pages 273–278, Menlo Parc, CA, 2002. AAAI Press.
[124] Pattie Maes and Rodney A. Brooks. Learning to coordinate behaviors. In T. G.
Dietterich and W. Swartout, editors, Proceedings of the 8th National Conference on
Artificial Intelligence (AAAI), pages 796–802, Boston, MA, 1990. MIT Press.
[125] Sridhar Mahadevan. To discount or not to discount in reinforcement learning: A case
study comparing R learning and Q learning. In Proceedings of the 11th International
Conference on Machine Learning (ICML), pages 164–172, San Francisco, CA, 1994.
Morgan Kaufmann.
[126] Sridhar Mahadevan. Proto-value functions: Developmental reinforcement learning.
In Luc De Raedt and Stefan Wrobel, editors, Proceedings of the 22nd International
Conference on Machine Learning (ICML), Bonn, Germany, pages 553–560. ACM,
2005.
[127] Sridhar Mahadevan and Jonathan Connell. Scaling reinforcement learning to robotics
by exploiting the subsumption architecture. In Proceedings of the 8th International
Workshop on Machine Learning (ICML), pages 328–332, 1991.
[128] Sridhar Mahadevan, Mauro Maggioni, Kimberly Ferguson, and Sarah Osentoski.
Learning representation and control in continuous Markov decision processes. In
AAAI. Boston, 2006.
[129] Olvi L. Mangasarian and H. Stone. Two-person nonzero-sum games and quadratic
programming. Journal of Mathematical Analysis and Applications, 9:348–355, 1964.
[130] Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in
reinforcement learning via clustering. In C. E. Brodley, editor, Proceedings of the
21st International Conference on Machine Learning (ICML), Banff, Canada. ACM,
2004.
[131] Oded Maron and Andrew W. Moore. Hoeffding races: Accelerating model selection
search for classification and function approximation. In Jack D. Cowan, Gerald
Tesauro, and Joshua Alspector, editors, Advances in Neural Information Processing
Systems (NIPS) 6, pages 59–66, San Francisco, CA, 1994. Morgan Kaufmann.
[132] Oded Maron and Andrew W. Moore. The racing algorithm: Model selection for lazy
learners. Artificial Intelligence Rev, 11(1-5):193–225, 1997.
[133] Maja J. Mataric. Reward functions for accelerated learning. In Proceedings of the
11nd International Conference on Machine Learning (ICML), pages 181–189, 1994.
[134] R. Andrew K. McCallum. Instance-based utile distinctions for reinforcement learning
with hidden state. In Proceedings of the 12th International Conference on Machine
Learning (ICML), pages 387–395, San Francisco, CA, 1995. Morgan Kaufmann.
134
[135] Richard D. McKelvey and Andrew McLennan. Computation of equilibria in finite
games. In Handbook of Computational Economics, Vol. I, volume 13 of Handbooks
in Economics, pages 87–142. North-Holland, Amsterdam, 1996.
[136] John C. C. McKinsey. Introduction to the Theory of Games. McGraw-Hill Book
Company, Inc., New York-Toronto-London, 1952.
[137] Lisa Meeden, Gary Mcgraw, and Douglas Blank. Emergent control and planning
in an autonomous vehicle. In D. S. Touretsky, editor, Proceedings of the 15th An-
nual Meeting of the Cognitive Science Society, pages 735–740. Lawrence Erlbaum,
Hillsdale, NJ, 1993.
[138] José del R. Millán. Rapid, safe, and incremental learning of navigation strategies.
In IEEE Transactions on Systems, Man and Cybernetics (Part B), volume 26, pages
408–420, 1996.
[139] Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
[140] George E. Monahan. A survey of partially observable Markov decision processes:
Theory, models, and algorithms. Management Science, 28(1):1–16, 1982.
[141] Burkhard Monien, Robert Preis, and Ralf Diekmann. Quality matching and local
improvement for multilevel graph-partitioning. Parallel Computing, 26(12):1609–
1634, 2000.
[142] Andrew W. Moore. Variable resolution dynamic programming: Efficiently learn-
ing action maps in multivariate real-valued state-spaces. In L. Birnbaum and
G. Collins, editors, Proceedings of the 8th International Conference on Machine
Learning (ICML), pages 333–337, San Francisco, CA, 1991. Morgan Kaufmann.
[143] Andrew W. Moore. The parti-game algorithm for variable resolution reinforcement
learning in multidimensional state-spaces. In J. D. Cowan, G. Tesauro, and J. Al-
spector, editors, Advances in Neural Information Processing Systems (NIPS) 6, pages
711–718, San Mateo, CA, 1994. Morgan Kaufmann.
[144] Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement
learning with less data and less time. Machine Learning, 13:103–130, 1993.
[145] Andrew W. Moore, Christopher G. Atkeson, and Stefan Schaal. Memory-based learn-
ing for control. Technical Report CMU-RI-TR-95-18, Carnegie Mellon University,
Pittsburgh, PA, 1995.
[146] Andrew W. Moore, Daniel J. Hill, and Michael P. Johnson. An empirical investi-
gation of brute force to choose features, smoothers and function approximators. In
S. Hanson, S. Judd, and T. Petsche, editors, Computational Learning Theory and
Natural Learning, volume III: Selecting Good Models, pages 361–379. MIT Press,
1995.
[147] Jun Morimoto and Kenji Doya. Robust reinforcement learning. Neural Computation,
17(2):335–359, 2005.
[148] Rémi Munos. A study of reinforcement learning in the continuous case by the means
of viscosity solutions. Machine Learning, 40(3):265–299, 2000.
[149] Rémi Munos. Error bounds for approximate policy iteration. In T. Fawcett and
N. Mishra, editors, Proceedings of the 20th International Conference on Machine
Learning (ICML), Washington, DC, USA, pages 560–567. AAAI Press, 2003.
[150] Rémi Munos. Policy gradient in continuous time. Journal of Machine Learning
135
Research (JMLR), 7:771–791, 2006.
[151] Rémi Munos. Performance bounds in lpnorm for approximate value iteration. SIAM
Journal on Control and Optimization, 2007.
[152] Rémi Munos, Leemon C. Baird, and Andrew W. Moore. Gradient descent approaches
to neural-net-based solutions of the Hamilton-Jacobi-Bellman equation. In Interna-
tional Joint Conference on Neural Networks (IJCNN), volume 3, pages 2152–2157,
1999.
[153] Ali H. Nayfeh and Balakumar Balachandran. Applied Nonlinear Dynamics. Wiley
Series in Nonlinear Science. John Wiley, New York, 1995.
[154] John von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematical Annals,
100:295–320, 1928.
[155] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Be-
havior. Princeton University Press, Princeton, NJ, 2nd edition, 1947.
[156] Ralph Neuneier and Hans G. Zimmermann. How to train neural networks. In G. B.
Orr and K. R. Müller, editors, Neural Networks: Tricks of the Trade, volume 1524
of Lecture Notes in Computer Science, pages 373–423. Springer, 1996.
[157] Partha Niyogi and Federico Girosi. Generalization bounds for function approximation
from scattered noisy data. Advances in Computational Mathematics, 10:51–80, 1999.
[158] Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. MIT Press,
Cambridge, MA, 1994.
[159] Guillermo Owen. Game Theory. Academic Press Inc., San Diego, CA, 3rd edition,
1995.
[160] Liviu Panait and Sean Luke. Cooperative multi-agent learning: The state of the art.
Autonomous Agents and Multi-Agent Systems, 11(3):387–434, 2005.
[161] Ronald E. Parr. Hierarchical Control and Learning for Markov Decision Processes.
PhD thesis, University of California, Berkeley, 1998.
[162] Relu Patrascu, Pascal Poupart, Dale Schuurmans, Craig Boutilier, and Carlos
Guestrin. Greedy linear value-approximation for factored Markov decision processes.
In Proceedings of the 18th National Conference on Artificial Intelligence and 14th
Conference on Innovative Applications of Artificial Intelligence (AAAI/IAAI), pages
285–291, Menlo Parc, CA, 2002. AAAI Press.
[163] Jing Peng and Ronald J. Williams. Efficient learning and planning within the Dyna
framework. Adaptive Behavior, 1(4):437–454, 1993.
[164] Jing Peng and Ronald J. Williams. Incremental multi-step Q-learning. In Proceedings
of the 11th International Conference on Machine Learning (ICML), pages 226–232,
San Francisco, CA, 1994. Morgan Kaufmann.
[165] Hans J. Pesch, I. Gabler, Stefan Miesbach, and Michael H. Breitner. Synthesis of
optimal strategies for differential games by neural networks. In G. J. Olsder, editor,
New Trends in Dynamic Games and Applications, Annals of the International Society
of Dynamic Games 3, pages 111–142, Boston, 1996. Birkhäuser.
[166] Robert Preis. The PARTY Graphpartitioning-Library, User Manual Version 1.99.
University of Paderborn, 1998.
[167] Bob Price and Craig Boutilier. Accelerating reinforcement learning through implicit
136
imitation. Journal of Artificial Intelligence Research (JAIR), 19:569–629, 2003.
[168] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Pro-
gramming. John Wiley, New York, 1994.
[169] Jette Randløv. Solving Complex Problems with Reinforcement Learning. PhD thesis,
University of Copenhagen, 2001.
[170] Balaraman Ravindran and Andrew G. Barto. Symmetries and model minimization
in Markov decision processes. Technical Report CMPSCI 01-43, University of Mas-
sachusetts, Amherst, MA, 2001.
[171] Balaraman Ravindran and Andrew G. Barto. Model minimization in hierarchical
reinforcement learning. In S. Koenig and R. C. Holte, editors, 5th International
Symposium on Abstraction, Reformulation and Approximation (SARA), Kananaskis,
Canada, volume 2371 of Lecture Notes in Computer Science, pages 196–211. Springer,
2002.
[172] Mark B. Ring. Continual Learning in Reinforcement Environments. PhD thesis,
University of Texas at Austin, 1994.
[173] Nicholas Roy. Finding Approximate POMDP Solutions through Belief Compression.
PhD thesis, Carnegie Mellon University, 2003. Also published as Technical Report
CMU-RI-TR-03-25.
[174] Nicholas Roy, Geoffrey J. Gordon, and Sebastian B. Thrun. Finding approximate
POMDP solutions through belief compression. Journal of Artificial Intelligence Re-
search (JAIR), 23:1–40, 2005.
[175] Ulrich Rüde. Mathematical and Computational Techniques for Multilevel Adaptive
Methods. SIAM, Philadelphia, PA, 1993.
[176] David E. Rumelhart and James L. McClelland. Parallel Distributed Processing: Ex-
plorations in the Microstructure of Cognition., volume 1–2. MIT Press, Cambridge,
MA, 1986.
[177] Gavin A. Rummery. Problem Solving with Reinforcement Learning. PhD thesis,
University of Cambridge, 1995.
[178] Rafal P. Salustowicz, Marco A. Wiering, and Jürgen Schmidhuber. Learning team
strategies: Soccer case studies. Machine Learning, 33:263–282, 1998.
[179] Arthur L. Samuel. Some studies in machine learning using the game of checkers.
IBM Journal of Research and Development, 3:211–229, 1959. Reprinted in E. A.
Feigenbaum and J. Feldman, editors, Computers and Thought, McGraw-Hill, NY
1963.
[180] Arthur L. Samuel. Some studies in machine learning using the game of checkers II
Recent progress. IBM Journal of Research and Development, 11(6):601–617, 1967.
[181] Uday Savagaonkar, Edwin K. P. Chong, and Robert L. Givan. Sampling techniques
for zero-sum, discounted Markov games. In Leslie P. Kaelbling, editor, Allerton
Conference on Control and Communications, 2002.
[182] Jürgen Schmidhuber. Reinforcement learning in Markovian and non-Markovian en-
vironments. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances
in Neural Information Processing Systems (NIPS) 3, pages 500–506. Morgan Kauf-
mann, 1991.
[183] Jürgen Schmidhuber. Developmental robotics, optimal artificial curiosity, creativity,
137
music, and the fine arts. In Connection Science, volume 18, pages 173–187, 2006.
[184] Alexander Schrijver. Theory of Linear and Integer Programming. Wiley-Interscience
Series in Discrete Mathematics. John Wiley, 1986.
[185] Anton Schwartz. A reinforcement learning method for maximizing undiscounted
rewards. In Proceedings of the 10th International Conference on Machine Learning
(ICML), San Mateo, CA, 1993. Morgan Kaufmann.
[186] Lloyd S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences
of the U. S. A., 39:1095–1100, 1953.
[187] John W. Sheppard. Co-learning in differential games. Machine Learning, 33(2-
3):201–233, 1998.
[188] Yoav Shoham, Rob Powers, and Trond Grenager. If multi-agent learning is the
answer, what is the question? In R. Vohra and M. Wellman, editors, Artificial
Intelligence (special issue on foundations of multi-agent learning), volume 171, pages
365–377, 2007.
[189] Sajid M. Siddiqi and Andrew W. Moore. Fast inference and learning in large-state-
space HMMs. In L. de Raedt and S. Wrobel, editors, Proceedings of the 22nd Inter-
national Conference on Machine Learning (ICML), Bonn, Germany, pages 800–807.
ACM, 2005.
[190] Satinder P. Singh. Reinforcement learning with a hierarchy of abstract models. In
W. R. Swartout, editor, Proceedings of the 10th National Conference on Artificial
Intelligence (AAAI), San Jose, CA, pages 202–207. MIT Press, 1992.
[191] Margaret M. Skelly. Hierarchical Reinforcement Learning with Function Approxima-
tion for Adaptive Control. PhD thesis, Case Western Reserve University, OhioLINK,
2004.
[192] Stephen Smale. Differentiable dynamical systems. Bulletin of the American Mathe-
matical Society (BAMS), 73:747–817, 1967.
[193] Andrew J. Smith. Applications of the self-organising map to reinforcement learning.
Neural Networks, 15(8-9):1107–1124, 2002.
[194] Bernhard von Stengel. Computing equilibria for two-person games. In R. J. Au-
mann and S. Hart, editors, Handbook of Game Theory with Economic Applications,
volume 3, chapter 45. North-Holland, Amsterdam, 2002.
[195] Robert F. Stengel. Stochastic Optimal Control. A Wiley-Interscience Publication.
John Wiley, New York, 1986.
[196] Eduard L. Stiefel. Note on Jordan elimination, linear programming and Chebyshev
approximation. Numerische Mathematik, 2:1–17, 1960.
[197] Peter Stone. Layered Learning in Multi-Agent Systems. PhD thesis, Carnegie Mellon
University, 1998.
[198] Peter Stone and Richard S. Sutton. Scaling reinforcement learning toward robocup
soccer. In C. E. Brodley and A. P. Danyluk, editors, Proceedings of the 18th Interna-
tional Conference on Machine Learning (ICML), Williams College, Williamstown,
MA, pages 537–544, San Francisco, CA, 2001. Morgan Kaufmann.
[199] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based
on approximating dynamic programming. In Proceedings of the 7th International
Conference on Machine Learning (ICML), pages 216–224, Austin, TX, 1990. Morgan
138
Kaufmann.
[200] Richard S. Sutton. Planning by incremental dynamic programming. In Proceedings
of the 8th International Workshop on Machine Learning (ICML), pages 353–357.
Morgan Kaufmann, 1991.
[201] Richard S. Sutton. Generalization in reinforcement learning: Successful examples
using sparse coarse coding. In D. S. Touretzky, M. Mozer, and M. E. Hasselmo,
editors, Advances in Neural Information Processing Systems (NIPS) 8, pages 1038–
1044, Cambridge, MA, 1996. MIT Press.
[202] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction.
MIT Press, Cambridge, MA, 1998.
[203] Csaba Szepesvári and Michael L. Littman. A unified analysis of value-function-based
reinforcement learning algorithms. Neural Computation, 11(8):2017–2060, 1999.
[204] Erik Talvitie and Satinder P. Singh. An experts algorithm for transfer learning.
In M. M. Veloso, editor, Proceedings of the 20th International Joint Conference on
Artificial Intelligence (IJCAI), Hyderabad, India, pages 1065–1070, 2007.
[205] Gerald Tesauro. TD-Gammon, A self-teaching backgammon program, achieves
master-level play. Neural Computation, 6(2):215–219, 1994.
[206] Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications
of the ACM, 38(3):58–68, 1995.
[207] Sebastian B. Thrun and Anton Schwartz. Issues in using function approximation for
reinforcement learning. In M. Mozer, P. Smolensky, D. Touretzky, J. Elman, and
A. Weigend, editors, Proceedings of the 1993 Connectionist Models Summer School,
Hillsdale, NJ, 1993. Lawrence Erlbaum.
[208] Michael J. Todd. The many facets of linear programming. Mathematical Program-
ming, 91(3):417–436, 2002.
[209] Abraham Wald. Generalization of a theorem by v. Neumann concerning zero sum
two person games. Annals of Mathematics. Second Series, 46:281–286, 1945.
[210] Abraham Wald. Sequential Analysis. John Wiley, New York, 1947.
[211] Chris Walshaw. The JOSTLE user manual: Version 2.2. University of Greenwich,
2000.
[212] Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, Univer-
sity of Cambridge, 1989.
[213] Gerhard Weiß. A multiagent framework for planning, reacting, and learning. Tech-
nical Report FKI-233-99, TU München, Germany, 1999.
[214] Jochen Werner. Numerische Mathematik. Vieweg Verlag, Braunschweig, Germany,
1992.
[215] Shimon Whiteson and Peter Stone. Concurrent layered learning. In Second Interna-
tional Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS),
pages 193–200. ACM, 2003.
[216] Chang Zhang and John S. Baras. A new adaptive aggregation algorithm for infinite
horizon dynamic programming. In Proceedings of the 11th Mediterranean Conference
on Control and Automation (MED), Rhodes, Greece, 2003.
[217] Martin Zinkevich and Tucker Balch. Symmetry in Markov decision processes and its
139
implications for single agent and multiagent learning. In Proceedings of the 18th Inter-
national Conference on Machine Learning (ICML), Williams College, Williamstown,
MA, pages 632–639, San Francisco, CA, 2001. Morgan Kaufmann.
[218] Martin Zinkevich, Amy Greenwald, and Michael G. Littman. Cyclic equilibria in
Markov games. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural
Information Processing Systems (NIPS) 18, pages 1641–1648. MIT Press, Cambridge,
MA, 2006.
140
Index
µ-homomorphism
2P-ZS-MG, 49
action
faithful, 103
mixed, 23
agent, 7
approximation error, 56
averagers, 58
basis functions, 58
Bellman equation, 19, 31
Bellman error, 35
Bellman operator, 20, 31
best reply, 25
best response, 25
bimatrix game, 23, 24
decision epoch, 24
repeated, 24
state action space, 24
state space, 24
transition function, 24
bisimulation, 42
box, 12
collection of -es, 12
chaos, 9
credit assignment problem
structural, 35
temporal, 35
cross validation, 58
differential game, 33
dynamic programming, 20, 31
dynamical system, 9
flow, 10
time-continuous, 9
time-discrete, 9
transition probability, 11
environment, 8
estimation error, 56
Euler’s number, 84
expectation value, 18
feature extraction, 57
features, 55
fictitious play, 28
finite state machines, 42
function approximation, 55
approximation error, 56
architecture, 55
blessing of smoothness, 57
curse of dimension, 57
estimation error, 56
sample error, 56
game, 23
bimatrix, 24
competitive, 4, 23
coordination, 25
differential, 23
discrete, 23
matrix, 24
optimal strategy, 26
single-stage, 23
game matrix
distance, 26
game theory
player, 23
generalisation, 57
over states, 57
graph, 14
edge set, 14
external costs, 14
internal costs, 14
vertex set, 14
graph matching, 15
graph partitioning
congestion, 15
Helpful-Set method, 15
multilevel paradigm, 15
grid soccer
141
action spaces, 67
kick-off, 71
move, 68, 70
pass, 68, 70
reward function, 71
state space, 67
transition function, 68
group
index of subgroup, 104
transformation, 103
group action, 103
kernel of, 103
group homomorphism, 103
group isomorphism, 103
group orbit, 104
Hamilton-Jacobi-Bellman equation, 62
Hamilton-Jacobi-Bellman-Isaacs equation,
33
hash function, 108
homomorphism
2P-ZS-MG, 45
group, 103
MDP, 38
imitation, 101
interaction, 7
interval
discrete, 108
invariance ratio, 11
isomorphism
group, 103
kick-off position, 66
Kohonen map, 58
Lemke-Howson algorithm, 28
linear program, 28
feasible set, 28
feasible solution, 28
list of states, 108
Mangasarian-Stone algorithm, 29
Markov chain, 9
Markov decision process, 16
action, 16
automorphism, 42
belief, 34
Bellman equation, 19
decision epoch, 16
decision rule, 19
discount rate, 18
homomorphism, 38
isomorphism, 42
multi-grid methods, 22
optimality principle, 19
partially observable, 34
policy, 19
lifted, 41
pure action, 20
return, 18
average, 18
discounted, 18
finite horizon, 18
state action space, 16
state aggregation, 22
state space, 16
symmetry group of, 42
transition function, 16
value functions, 19
Markov game
µ-homomorphism, 49
action, 30
Bellman equation, 31
decision epoch, 30
general-sum multi-player, 33
homomorphism, 45
optimality principle, 31
policy, 30
lifted, 46
return, 30
state action space, 30
state space, 30
transition function, 30
value function, 30
Markov model
hidden, 34
Markov process, 9
Markov property, 17
matrix
sparse, 108
matrix game, 23, 24
dominating action, 27
lower value, 26
matching pennies, 27
reduction property, 46, 50
repeated, 23, 24
stricly dominating action, 27
upper value, 26
multi-agent systems, 7
142
Nash equilibrium, 25
non-expansion, 31
optimality principle, 19, 31
orbit, 10
partition, 11
covering property, 11
payoff matrix, 23
perception, 8
perfect memory controller, 34
permutation, 103
Perron-Frobenius operator, 10
discrete, 12
policy
ε-greedy, 21
deterministic, 19
history dependent, 20
Markovian, 19
non-stationary, 20
total, 25
policy search, 57
probability distribution, 16
projection, 38, 44
proto value functions, 58
Q-learning
recurrent, 34
quotient set, 39
racing, 56
reflexes, 101
regret, 34
reinforcement learning, 20, 31
reinforcement signals
local, 101
reward
modified, 106
reward matrix, 23
risk option, 70
RL
exploitation, 21
exploration, 21
off-policy method, 21
on-policy method, 22
sample error, 56
SARSA, 22
security level, 26, 81
self-organising maps, 58
self-play, 98
semi orbit
negative, 10
positive, 10
set
almost invariant, 11
backward invariant, 10
forward invariant, 10
invariant, 10
set of optimal strategies, 26
shaping, 101
soft constraint, 19
squared Euclidean distance, 68
stabiliser, 104
state aggregation, 38
state space
partition of, 11
stochastic approximation
Monte Carlo approach, 13
stochastic games, 8
subdivision algorithm, 12
supervertex, 15
supervised learning, 55
symmetry
MDP, 39
test points, 12
trajectory, 10
transfer operator, 10
discretisation of, 12, 13
transformation, 103
transition function
block, 39, 45, 49
transition graph, 14
edge weights, 14
undirected, 14
vertex weights, 14
transition matrix, 106
two-player zero-sum game
normal form, 23
unsupervised learning, 57
update matrix, 105
utile suffix memory, 34
value function
factored representation, 62
first guess, 75
value iteration, 20, 32, 59
viscosity solutions, 33
143