Uni e si y o he Basque Coun y
UPV/EHU
Doc o al Thesis
In insic Mo i a ion Mechanisms
o a Be e Sample E iciency in
Deep Rein o cemen Lea ning
applied o Scena ios wi h Spa se
Rewa ds
Au ho :
Alain And es
Fe nandez
Supe iso s:
D . Es he Villa -Rod iguez
P o . D . Ja ie Del Se
A Thesis submi ed in ul illmen o he equi emen s
o he deg ee o Doc o o Philosophy in he
Depa men o Communica ions Enginee ing
June 28, 2023
(cc)2023 ALAIN ANDRES FERNANDEZ (cc by-sa 4.0)
iii
“We end o o e es ima e he e ec o a echnology in he sho un and
unde es ima e he e ec in he long un.”
Roy Ama a
“Mos people o e es ima e wha hey can achie e in a yea and unde es i-
ma e wha hey can achie e in en yea s”
Bill Ga es
“The complex line ha delimi s he sho -sigh ed and long- e m decisions
o happiness. The 𝛾pa ame e ha go e ns and ules ou li es. The
mo i a ions behind each decision. The unce ain y o he en i onmen ha
su ounds us. The e is no “op imal” pa h o ollow; he answe o a wo h
li ing li e is unique and subjec i e o each human being.”
Alain And es, mysel .
UNIVERSITY OF THE BASQUE COUNTRY UPV/EHU
Abs ac
Enginee ing School o Bilbao
Depa men o Communica ions Enginee ing
Doc o al Deg ee
In insic Mo i a ion Mechanisms o a Be e Sample E iciency
in Deep Rein o cemen Lea ning applied o Scena ios wi h
Spa se Rewa ds
by Alain And es Fe nandez
D i en by he ques o c ea e in elligen sys ems ha can au onomously
lea n o make op imal decisions, Rein o cemen Lea ning has eme ged as
a powe ul b anch o Machine Lea ning. Rein o cemen Lea ning agen s
in e ac wi h hei en i onmen , lea ning om ial and e o , guided by
eedback signals shaped in he o m o ewa ds. Howe e , he applica ion
o Rein o cemen Lea ning is o en hampe ed by he complexi y associa ed
wi h he design o such ewa ds. C ea ing a dense ewa d unc ion, whe e
he agen ecei es immedia e and equen eedback om i s ac ions, is
o en a challenging ask. This challenge a ises om he di icul y o speci-
ying he co ec beha io o e e y possible s a e-ac ion pai . This issue
pa allels he challenges aced in human lea ning whe e educa o s o en
g apple wi h iden i ying he bes way o each a ce ain skill o subjec ,
gi en ha lea ning s yles can a y d ama ically among indi iduals. As a
consequence, i is common o o mula e he p oblems wi h spa se ewa ds,
whe e he agen is only ewa ded when i accomplishes a signi ican ask
o achie es he inal goal, hus aligning mo e di ec ly wi h he objec i e
o he p oblem. The spa se ewa d o mula ion does no equi e he an-
icipa ion o e e y possible scena io o s a e, making i mo e ac able o
complex en i onmen s and eal-wo ld scena ios, whe e eedback is o en
delayed and no immedia ely a ailable.
Howe e , spa se ewa d se ings also in oduce hei own challenges,
mos no ably, he issue o explo a ion. In he absence o equen ewa ds,
an agen can s uggle o iden i y bene icial ac ions, making lea ning slow
and ine icien . This is whe e mechanisms such as In insic Mo i a ion
come in o play, encou aging mo e e ec i e explo a ion and imp o ing sam-
ple e iciency, despi e he spa si y o ex insic ewa ds.
In his con ex , he o e all con ibu ion o his Thesis is o del e in o
how In insic Mo i a ion can boos he pe o mance o Deep Rein o ce-
men Lea ning app oaches in en i onmen s wi h spa se ewa ds, aiming
i
o enhance hei sample e iciency. To his end, we i s s ess on i s
applica ion wi h concu en he e ogeneous agen s, aiming o es ablish a
collabo a i e amewo k o make hem explo e mo e e icien ly and accel-
e a e hei lea ning p ocess. Fu he mo e, an en i e chap e is de o ed o
analyzing and discussing he impac o ce ain design choices and pa am-
e e se ings on he gene a ion o he In insic Mo i a ion bonuses. Las
bu no leas , he Thesis p oposes o combine hese explo a i e echniques
wi h Sel -Imi a ion Lea ning, demons a ing ha hey can be used join ly
owa ds achie ing as e con e gence and op imal policies.
All he analyzed scena ios sugges ha In insic Mo i a ion can signi -
ican ly speed up lea ning, educing he numbe o in e ac ions an agen
needs o pe o m, and ul ima ely, leading o mo e apid and e icien
p oblem-sol ing in complex en i onmen s cha ac e ized by spa se ewa ds.
ii
Acknowledgemen s
I seems like yes e day when I was doing my Mas e ’s and began my
in e nship a he Aula Tecnalia in San Mames. Al hough my esea ch a
he ime was o ien ed owa ds cybe secu i y due o i s ela ion o my s ud-
ies, I had always been cu ious abou he po en ial o A i icial In elligence
and i s possibili ies o c ea e solu ions ha lead us, humans, o a be e
en i onmen . Unbeknowns o me, I was wo king alongside a g oup o
high-quali y esea che s in AI (JRL g oup)... and one day, I app oached
hem and exp essed my in e es in hei wo k, no knowing ha i would
be he i s s ep ha p opelled me in o he wo ld o esea ch.
This jou ney would no ha e been possible wi hou Ja ie Del Se , a.k.a
el seño mayo o deidad del se ... my p o esso , di ec o , and supe iso
h oughou his long and challenging jou ney. I emembe he i s ime
we me du ing a class, bu i wasn’ un il some ime la e ha I ealized
how esea ch-aholic you we e(a e) when I disco e ed you held, no one,
bu wo PhDs! I will always be g a e ul o he ime you ook o answe
my inqui ies and explain wha doing esea ch is, in oduce me o he en i e
esea ch g oup, and encou age me o pu sue my PhD despi e my ea s and
no being amilia wi h he ield. You p o ided guidance when I el los
and demo i a ed, o e ing in aluable ips ha ha e shaped my esea ch
ca ee up o his poin . Wi hou you suppo , I de ini ely would no ha e
emba ked on his pa h.
I am also indeb ed o Tecnalia, which was ini ially he esea ch pa ne
o he Bikain ek unding p og am ha g an ed my PhD. Despi e he ac
ha o he employe was in ol ed (wi h mo e weigh and mo e in e es
in wha espec s o my esea ch), when he la e decided o wi hd aw
om he p ojec , Tecnalia ook a s ep o wa d, assumed he p opo ional
inancial aspec s o my g an , and con inued o suppo he p ojec and
mysel , ecognizing i s alue. I wan o hank my supe io s a ha ime:
Isido o Ci ion, Iñigo A izaga, Elena U u ia and Joseba Laka. Howe e ,
I mus emphasize he c i ical ole played by bo h my di ec o s du ing
his pe iod when we had no esul s, pape s o indica o s gua an eeing he
iabili y o such an in es men . We had o shi ou ocus away om
he p oblem we we e add essing and s a om sc a ch again (due o he
o he pa ne lea ing), which posed a eal challenge o us. E en in ha
ci cums ance, bo h o you con inced e e yone, solely wi h you wo ds, o
con inue us ing in me and pu you sel es in a complex posi ion. I am a
a loss o wo ds o exp ess my g a i ude..
I can no o ge he mos signi ican pilla du ing hese yea s, my o he
di ec o , Es he . E en a his poin , I s uggle o ind he igh wo ds o
exp ess mysel adequa ely. I could highligh di e en (nume ous, a ious,
mul iple...) echnical aspec s ha commend such a b illian b ain which
ha e been c ucial o he success ul de elopmen o his hesis. Howe e ,
wi hou in ending o diminish hese p o essional a ibu es, I wan o use
my wo ds o emphasize you humani y. We ha e discussed, a gued and
con e sed abou a ious opics o hou s, much like a child does wi h hei
mo he (I hink ha is one o he easons behind some co-wo ke s saying
iii
ha you we e my igu a i e mom). You ha e always lis ened o me, no
only du ing ou wo k hou s bu also ou side o wo k, o e ing you pe -
spec i e and ad ice in wha e e he p oblem was. I can no enume a e
how many calls we ha e had, and how hank ul I el o ha e you sup-
po , specially in hose si ua ions we e I was unmo i a ed due o se e al
easons ha a e no ele an a his momen . I can no o ge when you
said some hing like: "I is abou he pe son and i s alues, no jus he
wo k o he esul s. You should be p oud o wha you a e; any eam would
undoub edly be lucky o ha e you". I ha e epea ed hose wo ds o mysel
and use hem as a compass du ing his jou ney. This hesis and he pe son
I ha e become, bo h p o essionally and pe sonally, owe a g ea deal o you.
Thank you.
Las bu no leas , I ha e o hank my iends, bu mo e impo an ly,
my amily – bo h my pa en s, Txomin and Ma ijo, and my sis e Go e i
– who ha e always suppo me uncondi ionally, no only du ing hese pas
4 yea s, bu h oughou my en i e li e. I would no be who I am wi hou
you, wi hou you pa ience, wi hou you e o s, wi hou he alues you
ha e ins illed in me, and wi hou all he us you placed in me e en when
I los mysel . I hope I can e u n e e y hing I go , and o be, a some
poin in ime, o o he people, wha you ha e been o me.
ix
Con en s
Abs ac
Acknowledgemen s ii
1 In oduc ion 1
1.1 Mo i a ion ........................... 2
1.2 Ou line and Con ibu ions o he Thesis ........... 5
1.3 Reading his Thesis ....................... 6
2 Backg ound 9
2.1 Fundamen als o Rein o cemen Lea ning .......... 10
2.1.1 Ma ko Decision P ocess ............... 10
2.1.2 Sequence Bounda ies: Episode & Rollou ...... 12
2.1.3 Rewa ds and Re u ns ................. 12
2.1.4 Policy and Value Func ion ............... 15
2.1.5 On-policy VS O -policy ................ 17
2.1.6 Value-based VS Policy-based ............. 17
2.1.6.1 Policy G adien me hods .......... 20
2.1.7 Deep Rein o cemen Lea ning ............. 25
2.2 En i onmen s .......................... 26
2.2.1 P ocedu ally-Gene a ed En i onmen s ........ 27
2.3 Explo a ion S a egies ..................... 28
2.3.1 In insic Mo i a ion .................. 30
2.3.2 Imi a ion Lea ning ................... 34
3 Collabo a i e T aining o He e ogeneous Agen s 37
3.1 Rela ed Wo k .......................... 39
3.1.1 Con ibu ion Beyond he S a e o he A ...... 41
3.2 P oblem S a emen ....................... 42
3.3 P oposed Collabo a i e F amewo k .............. 44
3.3.1 Cen alized Lea ning wi h Decen alized Execu ion . 45
3.3.1.1 Decen alized Ac o s ............. 46
3.3.1.2 Cen alized C i ic Module .......... 47
3.3.2 Cen alized In insic Cu iosi y Module ........ 49
3.3.2.1 Ac ion-based Cu iosi y Module ....... 51
3.3.2.2 T ee Fil e ing ................ 52
3.3.3 Summa y o he P oposed Modules .......... 53
3.4 Expe imen al Se up ...................... 54
3.4.1 Case S udy 1 ...................... 55
3.4.2 Case S udy 2 ...................... 57
x ii
Lis o Tables
2.1 Popula 𝜓es ima o choices. .................. 20
3.1 De ails o bo h he ac o and c i ic neu al ne wo k a chi ec-
u es. .............................. 60
3.2 Summa y o he con igu a ion abla ions wi hin he collabo-
a i e amewo k. ........................ 63
3.3 Sample-e iciency and quali y o esul ing policies o di e -
en e alua ed con igu a ions in Se up 3. ........... 74
4.1 Va ious IM me hods based on di e en design choices. . . . 82
4.2 Resul s o di e en IM s a egies o e MiniG id scena ios,
add essing RQ1. ........................ 93
4.3 Resul s o di e en IM s a egies o e MiniG id scena ios,
add essing RQ2. ........................ 95
4.4 Compa ison o numbe o pa ame e s and equi ed o wa d
and backwa d passes ac oss di e en IM modules. ...... 97
4.5 Resul s o di e en IM s a egies o e MiniG id scena ios,
add essing RQ3. ........................ 98
5.1 On-policy e sus o -policy a ios (𝜉) in each en i onmen ,
wi h he o -policy upda e execu ed upon episode comple ion.114
xix
Lis o Abb e ia ions
Gene al
SOTA S a e O The A
ANN A i icial Neu al Ne wo k
DL Deep Lea ning
SL Supe ised Lea ning
UL Unsupe ised Lea ning
RL Rein o cemen Lea ning
DRL Deep Rein o cemen Lea ning
MDP Ma ko Decision P ocess
POMDP Pa ially Obse able Ma ko Decision
P ocess
MARL Mul i-Agen RL
CLDE Cen alized Lea ning wi h Decen alized
Execu ion
IM In insic Mo i a ion
IL Imi a ion Lea ning
sel -IL Sel Imi a ion Lea ning (gene ic)
L D Lea ning om Demons a ions
IRL In e se Rein o cemen Lea ning
PCG P ocedu ally Con en Gene a o
KL Kullback-Leible
SR Success Ra e
LSTM Long Sho -Te m Memo y
Rein o cemen Lea ning
SS a e space
AAc ion space
RRewa d space
PT ansi ion p obabili y unc ion
GRe u n
OObse a ion unc ion
ΩObse a ion space
𝛾Discoun ac o
𝜋Policy
VValue unc ion
QAc ion-Value unc ion
TD Tempo al Di e ence
xx
Algo i hmic app oaches
EA E olu iona y Algo i hms
UCB Uppe Con idence Bound
SARSA S a e-Ac ion-Rewa d-S a eAc ion-
DQN Deep Q-Ne wo k
PPO P oximal Policy Op imiza ion
TRPO T us Region Policy Op imiza ion
GAE Gene ealized Ad an age Es ima o
A3C Asynch onous Ad an age Ac o -C i ic
IMPALA Impo ance Weigh ed Ac o -Lea ne
A chi ec u e
DPG De e minis ic Policy G adien
DDPG Deep De e minis ic Policy G adien
TD3 Twin Delayed DDPG
SAC So Ac o -C i ic
NGU Ne e Gi e Up
PER P io i ized Expe ience Replay
ICM In insic Cu iosi y Module
RND Random Ne wo k Dis illa ion
RIDE Rewa ding Impac D i en Eexplo a ion
RAPID Rank he Episodes
BeBold Beyond he Bounda y o Explo edRegions
MADE Explo a ion ia Maximizing De ia ion om
Explo ed Regions
BeBold Beyond he Bounda y o Explo edRegions
No elD No el y Di e ence
FaSo Fas and Slow in insic cu iosi y
AGAC Ad e sa ially Guided Ac o -C i ic
DoWhaM Don’ Do Wha Doesn’ Ma e
D&E Di ide-and-Explo e
SIL Sel -Imi a ion Lea ning
DTSIL Di e se T ajec o y-condi ioned
Sel -Imi a ion Lea ning
UVFA Uni e sal Value Func ion App oxima o
BC Beha io Cloning
DAGGER Da ase Agg ega ion
1
Chap e 1
In oduc ion
A i icial In elligence (AI) is one o hose opics in e e yone’s lips in hese
days. Al hough mul iple de ini ions can be ound in he li e a u e laid
ou by how a sys em should hink and ac aking in o accoun bo h he
a ional and human aspec s, a wide and mo e gene alis de ini ion was se
in (Russell & No ig, 2022), which cha ac e ized AI as:
“The s udy o agen s ha ecei e pe cep s om he en i onmen and
pe o m ac ions”.
AI’s popula i y has aised wi h he i up ion o Indus y 4.0 (and he up-
coming and mo e sus ainable Indus y 5.0) whe e i has been conside ed
one o he main Key Enabling Technologies, being in he own wo ds o
he Eu opean Commission a game-change due o i s po en ial o inc ease
he e iciency and p oduc i i y ac oss mul iple sec o s1. Mo e conc e ely,
Machine Lea ning (ML) has d awn he a en ion due i s po en ial o make
a compu e -sys em lea n om examples (da a) wi hou explici supe i-
sion o a human-being, ge ing he necessa y in o ma ion by analyzing
pa e ns. By eso ing o ML o au oma e asks, people can spend ime
ca ying ou o he du ies (p oduc i i y) and also ely on he solu ions p o-
ided by sys ems wi h be e pe o mance ha o e come na u al human
limi a ions (e iciency/op imali y), ul ima ely imp o ing o e all people’s
wel a e. Rega ding ML, h ee subg oups can be dis inguished:
•Supe ised lea ning (SL): lea ns om labeled da a in o de o
gene alize he knowledge o upcoming new inpu s.
•Unsupe ised lea ning (UL): lea ns om unlabeled da a so ha
he in o ma ion can be comp essed and acco dingly segmen ed in o
classes.
•Rein o cemen lea ning (RL): lea ns h ough he in e ac ion
( ial and e o ) wi h an en i onmen whe e he aim is o sol e a
de ined ask.
This hesis g a i a es a ound RL and, al hough i s undamen als a e going
o be mo e deeply explained in Chap e 2, i is impo an o no ice he
1h ps:// esea ch-and-inno a ion.ec.eu opa.eu/knowledge-publica ions- ools-and-d
a a/publica ions/all-publica ions/ai- esea ch-and-inno a ion-eu ope-pa ing-i s-own
-way_en.
2Chap e 1. In oduc ion
di e ences wi h espec o he o he wo ca ego ies, specially be ween RL
and SL, which a e simila and o en con used wi h each o he . On he one
hand, SL assumes he da a o be independen and iden ically dis ibu ed
(i.i.d) and equi es a p io i knowledge abou he g ound u h (also e e ed
o as ue label o anno a ion) o he aining da a. Con a ily, in RL
p e ious decisions in luence u u e inpu s (i.e., da a a e no independen ,
i is a sequen ial pa adigm) whe eas he g ound u h answe is no known
(co ec ac ions/labels a e no p o ided). Ins ead, he ewa d is used as
an es ima o o guide he lea ning.
Al hough he RL ield has been unde s udy since he 20 h cen u y, i
did no come o he o e un il he las decade due o ad ances in Deep
Lea ning (DL) and compu a ional capabili ies ha ease hei applica ion.
DL in ol es using non-linea unc ion app oxima o s – ypically A i icial
Neu al Ne wo ks (ANN) – so ha ML algo i hms can inges uns uc u ed
da a and au oma e he ea u e ex ac ion p ocess. Rega ding compu a-
ional capabili ies, he p ocessing uni s ha e expe ienced signi ican ad-
ances in e iciency enabling he deploymen o la ge and mo e complex
models while exponen ially dec easing he ime de o ed o ain hem. By
he i ue o his p og ess, RL can le e age ANNs o handle mo e compli-
ca e and di e se p oblems unapp oachable in he pas , which gi es name
o he ield whe e his disse a ion is con ex ualized, Deep Rein o cemen
Lea ning (DRL), Figu e 1.1.
UL
SL
RL
A i icial
In elligence
Machine
Lea ning
Deep
Lea ning
: Pu e Deep RL
: Deep RL + SL
Figu e 1.1: A i icial In elligence axonomy: Supe ised Lea ning (SL), Unsu-
pe ised Lea ning (UL) and Rein o cemen Lea ning (RL). This disse a ion is
ocused on he a eas highligh ed in o ange, Pu e DRL, and pink, DRL+SL.
1.1 Mo i a ion
Despi e he p emises s a ed abo e, s a e-o - he-a (SOTA) ML me hods
a e no ma u e enough o sol e he as majo i y o he p oblems wi hou
human p esence. Behind he e y basic idea o lea ning om a ewa d,
1.1. Mo i a ion 3
RL has o deal wi h mul iple challenges de i ed om i s demanding se up
equi emen s (Dulac-A nold e al., 2021) (e.g., lack o an a ailable-good
simula o , delayed eedback signals, lea ning om poo ly speci ied ewa d
unc ions) as well as o he di icul ies inhe en o hese echniques(Osband
e al., 2020) (e.g., explo a ion-exploi a ion dilemma, c edi assignmen
p oblem, gene aliza ion o unseen expe iences). Howe e , his has no
been an obs acle o begin applying RL o eal-wo ld p oblems when pos-
sible(Li, 2019) and see ou s anding esul s in ields like:
•Indus y/ obo ics (supply chain, manu ac u ing)(Iba z e al., 2021;
Nian e al., 2020)
•Heal hca e ( ea men ecommenda ion)(Go esman e al., 2019)
•Ene gy (powe consump ion)(Fu e al., 2022)
•Finance (po olio managemen )(Filos, 2019)
•Communica ions and Ne wo king Sys ems (ne wo k access and se-
cu i y, adap i e a e con ol)(Luong e al., 2019)
Mo i a ed by he exci ing jou ney o RL in hose ields, he esea ch-
d i en in e es ha e been o ien ed owa ds na owing he gap be ween eal-
wo ld p oblem equi emen s and expe imen al RL se ups, so ha mo e
p oblems become ac able. Wi h all his in mind, mul iple high le el
challenges can be iden i ied (Dulac-A nold e al., 2021):
•Spa se ewa ds: in RL a eedback signal ( ewa d) is needed o
guide he lea ning so ha he agen can dis inguish whe he he de-
cisions made we e ac ually good/bad. In o ma i e ewa ds a e no
necessa y igh a e e e y single in e ac ion as long as he c edi o
each ac ion can be deduced. Ne e heless, de e mining i a decision
is be e /wo se han ano he , wi hou conside ing a whole sequence
o e en s, is complex – e en when ha ing access o he whole s a e
in o ma ion and he objec i e o a ain – as he e a e a la ge amoun
o possible sequen ial combina ions ha exponen ially g ow wi h he
ex ension o he ac ion space and he equi ed numbe o s eps up o
he goal, which can lead o e y di e en ou comes. Thus, spa se e-
wa ds can be used o e alua e a sequence o decisions. In ac , spa se
eedback signals a e one o he main challenges p esen in eal-wo ld
se ups: sys em delays and di icul ies in modeling ewa d unc ions
in complex p oblems. Howe e , he mo e spa se he ewa ds, he
mo e a duous becomes o de e mine which ac ions a e use ul. Fu -
he mo e, he explo a ion becomes mo e oublesome. The e o e,
spa si y emains as one o he main conce ns o be sol ed in eal-
wo ld RL p oblems.
•Pa ial obse abili y: he RL- amewo k is commonly o malised
as a Ma ko Decision P ocess (MDP), whe e a s a e mus con ain
all he necessa y in o ma ion o make a decision. In p ac ice, his
a ely holds ue due o he lack o c i ical in o ma ion needed in each
4Chap e 1. In oduc ion
ime s ep. Hence, i is common ha he agen ge s an obse a ion
a he han a s a e, which ob iously limi s he comp ehension o he
en i onmen ha su ounds i . Tha con ex is o mally e e ed o
as a Pa ially Obse able Ma ko Decision P ocess (POMDP) and
exposes di icul ies ega ding gene aliza ion, c edi assignmen and
long- e m consequences2, being a challenge p esen in la ge numbe
o eal-wo ld scena ios.
•High dimensional con inuous s a es spaces: among he di e -
en possibili ies o model a p oblem, one o he big issues is how o
ep esen he s a e (o obse a ion) in such a way ha he agen can
lea n. This implies selec ing he ype o da a and he dimensions
o be used as inpu , whe e an inapp op ia e c i e ia can downg ade
d ama ically he expec ed esul s. This may cause ha he agen
is unable o model he co ela ion be ween he inpu ea u es, he
selec ed ac ion and hei u ili y. Thank o ad ances in DL and as-
suming an agen can unde s and/in e he wo ld simila ly o how
humans do, i has become popula o model p oblems aking in o
accoun , o example, images, as inpu . The e o e, high dimensional
inpu s a e ela ed o gene aliza ion issues which a e also p esen in
eal-wo ld p oblems.
•E olu ion-Adap a ion o ac ion space modi ica ions: he mod-
i ica ion and he consequence adap a ion o he agen o ei he s a e
and/o ac ion spaces can b ing new beha io s. Ins ead o e- aining
om sc a ch, he p e ious knowledge can be eused wi h echniques
like T ans e Lea ning o by he i ue o using Expe Demons a-
ions. In such con ex , how he e ogeneous agen s should be ained
is no clea , as hey a e supposed o lea n di e en policies. The
challenge esides in how o exploi he knowledge gained by o he
agen s.
•Real- ime in e ence: in o de o deploy any ML-based solu ion
in o a p oduc ion sys em, he algo i hm has o be designed acco d-
ing o he sys em’s capabili ies and cons ain s. While la ge and
complex a i icial neu al ne wo k (ANN) a chi ec u es ha e achie ed
ema kably good esul s in a ious applica ions, hei high compu a-
ional cos s o en hinde hei adop ion in eal-wo ld sys ems. The e-
o e, s iking a balance be ween pe o mance and cos s becomes a
p ac ical c i e ion. Some imes, achie ing high pe o mance can be
accomplished by educing he complexi y o he ne wo k while in-
oducing complemen a y, ye ligh e , p ocedu es om algo i hmic
de elopmen in o an ex ended ML pipeline.
2As he agen only manages o unde s and he impac o he decisions ha modi y
pa s o s a e ha a e measu able in i s obse a ion, he c edi o each ac ion is usually
ha d o de e mine (c edi assignmen ). This p oblem can be mino ed i such e ec s can
be co ela ed wi hin a na ow sequence o in e ac ions (long- e m consequences), which
could ul ima ely a ec he capaci y o ac in new o simila obse a ions (gene aliza ion
capaci y).
1.2. Ou line and Con ibu ions o he Thesis 5
The Thesis aims o de elop no el s a egies o cope p o icien ly wi h all
hese aspec s, which a e he ace s ha mos ai h ully ep oduce ealis ic
scena ios.
1.2 Ou line and Con ibu ions o he Thesis
In ligh o he a o emen ioned objec i es, he co e p oblem o be add essed
can be en i led as sample-e iciency in POMDPs wi h spa se e-
wa ds, co e ing explo a ion-exploi a ion dilemma in mul iple scena ios
while a emp ing o use he minimum samples o ge an op imal policy.
The e o e, he Thesis is s uc u ed in chap e s wi h di e en use-cases. A
b ie summa y o each chap e is in oduced below.
Chap e 2
This chap e – Backg ound – aims o in oduce and condense all he
needed in o ma ion o unde s and he echnical con ibu ions. Besides
he undamen als o RL and he benchma ks/en i onmen s ha can be
ound in o he li e a u e, he easons why spa se ewa d p oblems ha e
become popula a e highligh ed. A he same ime, he incoming chal-
lenges o adop ing such spa se pa adigm a e explained al oge he wi h he
mos popula echniques adop ed o ace he majo d awbacks. Along
his sec ion a wide e iew o ela ed esea ch wo ks a e p esen ed in o de
o p o ide he eade wi h he undamen al concep s, which a e indeed
ans e sal o he ollowing chap e s.
Chap e 3
In his chap e – Collabo a i e aining be ween he e eogeneously
skilled agen s in en i onmen s wi h spa se ewa ds – we ocus on
how o ca y ou a collabo a i e lea ning amewo k be ween he e oge-
neous agen s wi h di e en ac ion spaces yielding di e en op imal policies.
Unlike mul i-agen sys ems, in which agen s ope a e in he same scena io
and a e ypically e alua ed based on a eam- ewa d unc ion, we analyze
how o lea n mo e e icien ly when agen s’ ewa ds a e independen and
each o hem in e ac wi h dis inc ins ances o he en i onmen . This
is also known as he concu en lea ning pa adigm, which lies somewhe e
be ween single- and mul i- agen p oblems. Besides he he e ogenei y, his
chap e also del es in o he challenges o POMPDs,spa se ewa ds
and high-dimensional s a e spaces by lea ning how o na iga e di-
ec ly om pixels.
Chap e 4
Mo i a ed by he g ea success and ad ances o In insic Mo i a ion (IM)
echniques, Chap e 4 – An E alua ion S udy o In insic Mo i a-
ion Techniques applied o Rein o cemen Lea ning o e Ha d
Explo a ion En i onmen s – p esen s an empi ical s udy o assess and
12 Chap e 2. Backg ound
2.1.2 Sequence Bounda ies: Episode & Rollou
The sequence o numbe o in e ac ions be ween he agen and he en i on-
men can be b oken in o subsequences which can be e e ed as ajec o y,
ollou and/o episode, being hei meaning sligh ly di e en depending on
he bounda ies. In his Thesis, we adop he ollowing axonomy which is
widely used in he li e a u e:
•T ajec o y is he less es ic i e concep and can be used o e e o
any o he nex wo e ms.
•An episode ends when a maximum numbe o s eps a e aken o when
he agen s achie es he goal ( he numbe o s eps equi ed o inish
an episode in any o hose cases is pa ame e ized by T). As a esul ,
he en i onmen is ese and he agen is b ough back o a ini ial
s a e2in o de o sol e he en i onmen again. Despi e he ac ha
he la ge majo i y o p oblems a e o his na u e, commonly e e ed
as episodic asks, o he s a e ca ego ized as con inuous asks when
he goal is ne e achie ed (𝑇=∞) because he ask is endless.
•On he con a y, a ollou (𝜏) is no subjec o he e mina ion o he
episode and is composed by a p ede e mined numbe o s eps. Con-
sequen ly, a ollou could con ain less expe iences han an episode,
o e en a mul iple amoun o hem, being he numbe o such expe-
iences (T) a pa ame e de ined by he use (independen ly o he
en i onmen ).
Fo he sake o cla i y, we p o ide an example in Figu e 2.2 whe e
in e ac ions o wo di e en comple e episodes can be dis inguished. I we
conside ed a ollou o size 15 (𝑇=15), hen he ollou would ence he
wo di e en episode’s in o ma ion in; on he opposi e, i i was se o 5
(𝑇=5), hen he ollou will co e less in o ma ion (e.g., hal o an episode
in he i s example).
No e ha an episode’s leng h (numbe o expe iences) depends no
only on he en i onmen , bu also on he quali y o he policy ha selec s
he ac ions, since an expe agen will be able o accomplish he ask wi h
he op imal, i.e. smalles , numbe o s eps3. Thus, he de ined ollou size
(𝑇) ends up con aining a a iable numbe o episodes du ing he aining
p ocess, which is impo an in o de o balance he bias and a iance o
he upda es gene a ed upon hose expe iences.
2.1.3 Rewa ds and Re u ns
In RL, a ewa d is a scala alue ha an agen ecei es om he en i on-
men a e aking an ac ion o guide i s lea ning. The ewa d indica es
how well he agen pe o med ela i e o he objec i e. Mo e impo an ly,
2The agen can be ese ei he in a ixed s a ing s a e (𝑠0) o wi hin a dis ibu ion
o possible s a es (𝜌0). Thus, 𝑠0∼𝜌0wi h a a iable numbe o ini ial s a es.
3In he in e ac ions o he wo episodes p esen ed in Figu e 2.2 he op imal ajec-
o ies a e conside ed.
2.1. Fundamen als o Rein o cemen Lea ning 13
Episode 1
10 s eps
Episode 2
11 s eps
Figu e 2.2: Example o wo di e en episodes’ in e ac ions. The agen is he
ed a ow and he en i onmen he maze and all he objec s ha su ound i .
The s a e is he isual pe cep ion o he en i onmen , he ac ions a e he se o
pe mi ed na iga ion mo emen s and he se o objec manipula ion ope a ions,
and, he ewa d, is always ze o excep when a i ing o he g een squa e ( he
goal). The abo e wo ows ep esen a single episode, while he emaining ows
ep esen a di e en episode.
he agen ’s p ima y goal is o make decisions ha maximize he ewa ds
ob ained om he en i onmen , which is e e ed o as he e u n. The e-
o e, designing a ewa d unc ion ha p o ides adequa e eedback signals
is o u mos impo ance. In he ollowing, ex ended de ini ions o ewa ds
and e u ns a e p o ide.
Rewa ds
The ewa d has o ein o ce good decisions and discou age useless o w ong
ac ions in o de o make he agen achie e wha we desi e om i . This
means ha he agen ´s success pi o s on how well he eedback signals
a e cohe en wi h he goal o he ask. Some concep ualiza ions o e-
wa d unc ions, and subsequen ly, he ewa ds in each in e ac ion, can be
exempli ied as ollows:
Example 1, Robo . Goal: make a obo un as as as possible no alling.
The ewa d could be in e sely p opo ional o he equi ed numbe o s eps
o a i e o a gi en des ina ion wi hou alling.
14 Chap e 2. Backg ound
Example 2, Chess. Goal: make an agen lea n how o play chess. The
in ui i ely ewa ds could be +1 o winning, -1 o losing and 0 o d awing.
In such examples, he agen is guided o comple e he ask wi h spa se
signals ha e alua e he whole sequence o ac ions ha leads o a gi en
ou come. Ne e heless, a ewa d unc ion’s success is also subjec o how
he p og ess in eaching he objec i e is e alua ed. Fo ins ance, spa si y
can be ci cum en ed by means o es ablishing easie subgoals o p o iding
in e media e ewa ds (i.e., dense) ha ease he c edi assignmen p oblem:
Example 1, Robo . The ewa d unc ion can be designed o p omo e he
o wa d mo ion a each s ep.
Example 2, Chess. In e media e ewa ds can be conside ed when aking
opponen ’s pieces ou .
None heless, his s a egy could mislead he agen in o a g eedy sea ch o
subgoals achie emen ins ead o ocusing on he main goal.
Example 2, Chess. The agen could ind di icul ies o bea he opponen
becoming g eedy in o aking he o he s pieces ou a he han de eloping a
winning s a egy.
I is impo an o ema k ha , e en by designing a good ewa d unc ion,
he success and quali y o he esul s migh no be as expec ed due o
o he impo an aspec s (e.g., model weigh s ini ializa ion, algo i hmic
limi a ions, bias- a iance ade-o )4. Thus, op ing o a nai e and easy
ewa d unc ion (o e a mo e complex one) is some imes sugges ed.
Fo hese easons, i s design is no i ial and spa se o mula ions a e
p e e able a he expense o explo a ion challenges. We will e e la e
on his Chap e (Sec ion 2.3) o me hods o add ess he explo a ion-
exploi a ion dilemma mo e e icien ly al hough his is angen ial o he
main subjec o his disse a ion.
Re u n
No e ha he main goal o he agen is o maximize he sum o ewa ds,
which can be o malized wi h he e u n,𝐺𝑡:
𝐺𝑡=𝑟𝑡+1+𝑟𝑡+2+𝑟𝑡+3+... +𝑟𝑇(2.2)
whe e 𝑡and 𝑇s and o he cu en and inal ime s eps in an episode, e-
spec i ely. This calcula ion gi es he same impo ance o all he decisions
ega dless o hei empo al componen . Wha is mo e, his o mula ion
complica es he calcula ion o he e u n in con inuous asks, when he e is
no episodic bounda ies and he e u n becomes a sum o in ini e se ies. In
ligh o his limi a ion, he discoun concep was in oduced by 𝛾∈ [0,1],
4This can be seen in humans clea ly: o he same s imuli, en i onmen , and ob-
jec i e, people equi e di e en ime o con e ge o a solu ion. Mo eo e , mul iple
beha io s could lead o wha is conside ed an op imal policy (e en o he same ewa d
unc ion).
2.1. Fundamen als o Rein o cemen Lea ning 15
u ning such ope a ion in a ini e calcula ion5. This discoun ac o allows
also modula ing he impo ance o immedia e and dis an ewa ds. This
new e u n o mula ion is commonly e e ed o as discoun ed e u n:
𝐺𝑡=𝑟𝑡+1+𝛾𝑟𝑡+2+𝛾2𝑟𝑡+3+... =
∞
∑︁
𝑘=0
𝛾𝑘𝑟𝑡+𝑘+1(2.3)
This implies ha a ewa d o be ecei ed a e 𝑘s eps in he u u e will
be wo h 𝛾𝑘−1 imes less han one ob ained immedia ely. Acco dingly,
•𝛾 < 1is used o adjus he weigh s o u u e ewa ds.
•𝛾=0is known as "myopic- iew" and only maximizes immedia e
ewa ds, 𝐺𝑡=𝑟𝑡+1+0·𝑟𝑡+2+0·𝑟𝑡+3+... =𝑟𝑡+1.
•𝛾=1co esponds o he o mal de ini ion o e u n wi hou discoun ,
homogenizing he alue o u u e and immedia e ewa ds, 𝐺𝑡=𝑟𝑡+1+
1·𝑟𝑡+2+1·𝑟𝑡+3+... =𝑟𝑡+1+𝑟𝑡+2+𝑟𝑡+3...
In summa y, he 𝛾 alue egula es he e ec o maximizing sho - e m o
long- e m beha io s, being 0.9< 𝛾 < 1mos ly selec ed o gi e c edi o
u u e ac ions and a oid he ewa d impo ance anishing. As a conse-
quence, a i h (and se en h) elemen mus be a ached o he p e iously
in oduced MDP (POMDP) uple: {S,A,P,R, 𝛾} ({S,A,P,R, 𝛾, O,Ω}).
2.1.4 Policy and Value Func ion
P e iously, i has been explained how he agen in e ac s wi h he en i-
onmen h ough ac ions. A policy, 𝜋:S −→ A, is a unc ion ha maps
he cu en s a e o an agen o an ac ion o be aken, 𝑎∼𝜋(𝑠)and i can
be ei he de e minis ic o s ochas ic. A de e minis ic policy maps each
s a e o a single ac ion, whe eas a s ochas ic policy maps each s a e o a
p obabili y dis ibu ion o e he possible ac ions ha he agen can ake.
The alue unc ion is a unc ion ha es ima es he long- e m ewa d
ha an agen can expec o ecei e in a gi en s a e o s a e-ac ion pai ,
unde a speci ic policy 𝜋. The s a e alue unc ion,𝑉𝜋(𝑠), is esponsible
o es ima ing he expec ed e u n s a ing om a s a e 𝑠and ollowing
he policy 𝜋 he ea e , i.e.,
𝑉𝜋(𝑠𝑡)=E𝜋[𝐺𝑡|𝑠𝑡=𝑠]=E𝜋"∞
∑︁
𝑘=0
𝛾𝑘𝑟𝑡+𝑘+1|𝑠𝑡=𝑠#(2.4)
whe e E[·] deno es expec ed alue. Simila ly, he ac ion alue unc ion,
𝑄𝜋(𝑠, 𝑎), es ima es he expec ed e u n s a ing om no only a s a e 𝑠,
5A e a big numbe o s eps, any u u e ewa d’s e ec can be conside ed insigni i-
can . Fu he mo e, his only holds ue as long as 𝛾∈ [0,1)because when 𝛾=1all he
ewa ds a e conside ed equally impo an .
16 Chap e 2. Backg ound
bu also execu ing an ac ion 𝑎, and ollowing he policy 𝜋 he ea e , i.e.,
𝑄𝜋(𝑠𝑡, 𝑎𝑡)=E𝜋[𝐺𝑡|𝑠𝑡=𝑠, 𝑎𝑡=𝑎]=E𝜋"∞
∑︁
𝑘=0
𝛾𝑘𝑟𝑡+𝑘+1|𝑠𝑡=𝑠, 𝑎𝑡=𝑎#.(2.5)
In e es ingly, one p ope y ha applies o e alue unc ions is he e-
cu si e ela ionship in ol ing he calcula ion o e u ns:
𝐺𝑡=𝑟𝑡+1+𝛾(𝑟𝑡+2+𝛾𝑟𝑡+3+𝛾2𝑟𝑡+4+...)
=𝑟𝑡+1+𝛾𝐺𝑡+1
(2.6)
wi h he consequen e o mula ion o Equa ion (2.4):
𝑉𝜋(𝑠𝑡)=E𝜋[𝐺𝑡|𝑠𝑡=𝑠]
=E𝜋[𝑟𝑡+1+𝛾𝑟𝑡+2+𝛾2𝑟𝑡+3+...|𝑠𝑡=𝑠]
=E𝜋[𝑟𝑡+1+𝛾𝐺𝑡+1|𝑠𝑡=𝑠]
(2.7)
being he ewa ds hose ha a e ob ained by ollowing 𝜋ac ions in each
o he encoun e ed s a es om 𝑠onwa ds. No e ha bo h 𝑉𝜋and 𝑄𝜋a e
connec ed h ough he nex equa ions:
𝑉𝜋(𝑠𝑡)=E𝜋[𝑄𝜋(𝑠𝑡, 𝑎𝑡)|𝑠𝑡=𝑠, 𝑎𝑡=𝑎∼𝜋(𝑠)] (2.8)
𝑄𝜋(𝑠𝑡, 𝑎𝑡)=E𝜋[𝑟𝑡+1+𝛾𝑉𝜋(𝑠𝑡+1)|𝑠𝑡=𝑠, 𝑎𝑡=𝑎](2.9)
whe e he key di e ence lies in he ac ha 𝑄𝜋calcula es he expec ed
e u n assuming ha he immedia e ac ion will be 𝑎𝑡, de e mining he
nex s a e 𝑠𝑡+1∼ P(𝑠𝑡, 𝑎𝑡)and he associa ed ewa d 𝑟𝑡+1=R(𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1);
whe eas 𝑉𝜋does no p esume any ac ion in i s e u n es ima ion, being
his selec ion dependen on he cu en beha io o he policy 𝜋.
In addi ion o hese wo alue es ima o s, a new unc ion can be con-
side ed: he ad an age unc ion, 𝐴𝜋(𝑠, 𝑎). This unc ion quan i ies how
much is a ce ain ac ion 𝑎 aken in s a e 𝑠a good o bad decision in ela ion
o he expec ed alue 𝑉𝜋(𝑠)in ha s a e, i.e.,
𝐴𝜋(𝑠𝑡, 𝑎𝑡)=𝑄𝜋(𝑠𝑡, 𝑎𝑡|𝑠𝑡=𝑠, 𝑎𝑡=𝑎) − 𝑉𝜋(𝑠𝑡|𝑠𝑡=𝑠)(2.10)
Las bu no leas , a policy 𝜋is conside ed o be be e han ano he
policy 𝜋′i he expec ed e u n is g ea e , ha is, 𝜋≥𝜋′i (i and only
i )𝑉𝜋(𝑠) ≥ 𝑉𝜋′(𝑠). In his ega d, he e is always going o be a policy
ha is equal o be e o he es o policies, named he op imal policy 𝜋∗.
Analogously, he e will be op imal alue unc ions ep esen ing he ac ual
bes e u ns ha would be expec ed om each s a e 𝑠when ollowing he
op imal policy 𝜋∗ he ea e , i.e.,
𝑉∗(𝑠𝑡)=𝑚𝑎𝑥𝜋𝑉𝜋(𝑠𝑡|𝑠𝑡=𝑠)
𝑄∗(𝑠𝑡, 𝑎𝑡)=𝑚𝑎𝑥𝜋𝑄𝜋(𝑠𝑡, 𝑎𝑡|𝑠𝑡=𝑠, 𝑎𝑡=𝑎)(2.11)
2.1. Fundamen als o Rein o cemen Lea ning 17
2.1.5 On-policy VS O -policy
In RL, a wide ange o algo i hms can be ound. One o he c i e ia o op
o one schema is he s a egy abou how o use he da a in he aining,
commonly ca ego ised as on-policy o o -policy s a egies.
On-policy echniques a emp o imp o e he policy ha is being
used o in e ac wi h he en i onmen . Because o ha , hey can only
use da a ha a e ep esen a i e o he cu en policy, 𝜋𝑡, which p ecludes
he use o any da a ga he ed wi h a di e en policy, including any p e-
ious policy s a e 𝜋𝑡−1, 𝜋𝑡−2, ... Hence, hey a e p one o be less sample
e icien ye mo e s able in he lea ning p ocess. Wi hin his g oup we can
ind SARSA(Rumme y & Ni anjan, 1994), REINFORCE and T us Re-
gion Policy Op imiza ion (TRPO) (Schulman, Le ine, e al., 2017), among
o he s.
On he o he hand, o -policy me hods lea n a a ge policy wi h da a
gene a ed by a di e en policy, known as beha io policy. In ha case,
he lea ning is said o be ca ied ou om expe iences "o " he a ge
policy. Consequen ly, hese algo i hms exhibi be e sample-e iciency,
bu a e p one o o e es ima ion and ins abili ies du ing aining ime.
The mos common o -policy algo i hms a e Q-lea ning (Wa kins & Dayan,
1992) and i s ex ended DL app oach, DQN(Mnih e al., 2015); and o he
app oaches ha we e buil on op o DQN like Double DQN ( an Hassel
e al., 2015), Dueling DQN(Z. Wang e al., 2016) and C51(Bellema e e
al., 2017). Ne e heless, o he popula and e ec i e algo i hms un ela ed
o DQN ha e also been p oposed, such as De e minis ic Policy G adien s
(DPG)(Sil e e al., 2014), Deep De e minis ic Policy G adien (DDPG)
(Lillic ap e al., 2015), Twin Delayed DDPG (TD3)(Fujimo o e al., 2018)
and So Ac o -C i ic (SAC) (Haa noja e al., 2018).
2.1.6 Value-based VS Policy-based
Rega ding he p ocedu e o ob ain he policy, RL algo i hms can be di-
ided in o alue-based o policy-based me hods.
The i s g oup, i.e. alue-based me hods, aims o lea n a alue
unc ion ha e alua es he u ili y o each s a e (i.e., 𝑉𝜋(𝑠)) and/o s a e-
ac ion pai s (i.e., 𝑄𝜋(𝑠, 𝑎)). Fo his pu pose, he objec i e is o mini-
mize he di e ence be ween he p edic ed e u n o each s a e (𝑉𝜋(𝑠𝑡)o
𝑄𝜋(𝑠𝑡, 𝑎𝑡)) and he ac ual a ge e u n (𝐺𝑡). No e ha he ac ual e-
u n calcula ion is subjec o he expe iences ga he ed by he agen (e.g.,
𝜏={𝑠𝑡, 𝑎𝑡, 𝑟𝑡+1, 𝑠𝑡+1, 𝑎𝑡+1, 𝑟𝑡+2, ...}), which migh well no ep esen he op-
imal e u n and will esul in he lea ning o alue unc ions acco ding
o hese subop imal a ge alues. Mo e impo an ly, he ajec o ies col-
lec ed o his pu pose will be e y di e se due o he 𝜋’s e olu ion de-
pendence du ing aining. Thus, he a ge e u n calcula ion will exhibi
la ge a iance and induce ins abili ies in he espec i e es ima o unc ion
lea ning. To mi iga e he possible a iance (and bias)- ela ed issues, any
o he ollowing p oposed es ima o s can be adop ed:
18 Chap e 2. Backg ound
•Mon e Ca lo. All he ewa ds om he cu en s a e o he e minal
s a e a e included, 𝐺𝑡=𝑟𝑡+1+𝛾𝑟𝑡+2+𝛾2𝑟𝑡+3+.... I has no bias bu
exhibi s a iance p oblems.
•Tempo al Di e ence e o (TD-e o ). Only he cu en ewa d
is conside ed and hen he es is boo s apped by using he alue
o he nex s a e as an es ima e o all he ewa ds o go, 𝐺𝑡=𝑟𝑡+1+
𝛾𝑉 (𝑠𝑡+1). I copes well wi h he a iance p oblem, bu in oduces a
highe bias.
•n-s ep. I is he gene aliza ion o he TD-e o (𝑛=1) o g ea e
alues o 𝑛. This means boo s apping om a speci ic ime s ep (𝑛)
o he e minal s a e: 𝐺𝑡:𝑡+𝑛=𝑟𝑡+1+𝛾𝑟𝑡+2+.. +𝛾𝑛−1𝑟𝑡+𝑛+𝛾𝑛𝑉(𝑠𝑡+𝑛).
The la ge he 𝑛, he less bias and mo e a iance; he lowe he alue
o 𝑛, he highe bias bu he less a iance.
•TD(𝜆)can be explained as a way o a e age o e he abo e men-
ioned n-s ep upda es. The e o e, i equi es he calcula ion o all
he 𝑛-s ep e u ns o, a e wa ds, assign hem mo e/less weigh :
𝐺𝜆
𝑡=(1−𝜆)Í∞
𝑛=1𝜆𝑛−1𝐺𝑡:𝑡+𝑛. The TD-e o is also known as TD(0)
as i equals he case 𝜆=0wi h jus 1-s ep e u n.
Fo he sake o cla i y, Figu e 2.3 summa izes he s a egies o 𝑛-s ep and
TD(𝜆).
Once he alue unc ion has been ob ained, alue-based me hods dis ill
hei knowledge wi h some de ined ules o build a policy. One app oach is
o lea n an ac ion- alue unc ion 𝑄(𝑠, 𝑎) ha closely app oxima es, i no
exac ly, he op imal ac ion- alue unc ion 𝑄∗(𝑠, 𝑎). Then, he agen can
g eedily choose he ac ion ha maximizes he e u n in each s a e:
𝑎𝑡=a g max
𝑎
𝑄∗(𝑠𝑡, 𝑎)(2.12)
This me hodology is known as g eedy and is used o exploi and e alua e
he knowledge. Howe e , using such s a egy du ing he aining (p io
o ob aining 𝑄∗(𝑠, 𝑎)) could lead o policies wi h subop imal beha io s
due o insu icien explo a ion. This is he eason why o he mechanisms
ha in luence in he ac ion selec ion a e adop ed (e.g., 𝜖-g eedy6). He e
we can ind algo i hms like Q-lea ning (Wa kins & Dayan, 1992), SARSA
(Rumme y & Ni anjan, 1994) and DQN- amily among o he s (Bellema e
e al., 2017; Mnih e al., 2015; an Hassel e al., 2015; Z. Wang e al.,
2016).
On he opposi e, policy-based me hods pa ame e ize and op imize he
policy di ec ly wi hou he necessi y o ha ing a alue unc ion. Policies
can be lea n by ei he de i a i e ee me hods such as gene ic algo i hms
(Mi jalili, 2019) ( ecen ly compa ed wi h RL solu ions (Ma inez e al.,
2021)) o policy g adien schemes. In all hese me hods, he objec i e is
6Re e s o a s a egy whe e he agen selec s wi h p obabili y 𝜖−→ [0,1]a andom
ac ion and wi h 1−𝜖 he g eedy ac ion, balancing explo a ion-exploi a ion h ough 𝜖
pa ame e .
2.1. Fundamen als o Rein o cemen Lea ning 19
.
.
.
. . .
. . .
. . .
. . .
.
.
.
. . .
. . .
. . .
. . .
1−λ
(1 −λ)λ
(1 −λ)λ2
λT− −1
2-s ep
TD
3-s ep
TD
Mon e
Ca lo
T D(λ)
s
a
s +1
s +2
a +1
a +2
aT−1
s +3
S a e
Ac ion
P= 1
1-s ep
TD
Te minal
S a e
Figu e 2.3: (Le ) Spec um o possible TD es ima o s om 1-s ep up o Mon e
Ca lo (un il e mina ion o episode); in be ween, n-s ep calcula ion a e placed.
The e u n es ima o is calcula ed wi h he eal n ewa ds and hen he es ima ed
alue o he n h nex s a e. (Righ ) TD(𝜆) diag am used o weigh he n-s ep
e u ns (when being adop ed). 𝜆=0co esponds o jus using he 1-s ep TD,
whe eas 𝜆=1conside s only he Mon e Ca lo upda e.
o maximize he pe o mance ia a i ness sco e (used o e alua ion) o by
maximizing di ec ly he e u n, 𝐽(𝜃)=E𝜋[𝐺𝑡]7. Addi ionally, policy g a-
dien algo i hms can handle bo h disc e e and con inuous ac ions spaces.
Con inuous ac ions can be mo e di icul o wo k wi h because i is no
easible o explici ly ep esen e e y possible ac ion’s alue, as he e a e
an in ini e numbe o hem. As a consequence, hey a e pa ame e ized by
ei he disc e izing he ange o possible ac ion alues in a disc e e numbe
o alues, o using s a is ical dis ibu ions (e.g., Gaussian) om which he
agen can sample speci ic alues.
O e all, any alue-based o policy-based me hod can esul in de-
e minis ic o s ochas ic policies. Indeed, in alue-based me hods he
agen lea ns he alue o each ac ion. Then, i usually selec s he ac ion
wi h highe ou come leading o a de e minis ic policy. Howe e , his can
be bypassed by means o me hods ha pe u b he ac ion selec ion p o-
cess (e.g., 𝜖-g eedy s a egy) o by pa ame e izing he ou pu alues wi h
7𝜃is used o e e o he pa ame e s ha compose he policy 𝜋.
20 Chap e 2. Backg ound
a so -max unc ion o gene a e a dis ibu ion, esul ing in a s ochas ic
policy8:
𝜋(𝑎|𝑠)=exp(𝑠,𝑎)
Í𝑘exp(𝑠,𝑘)(2.13)
being 𝑘 he o al numbe o possible ac ions in A𝑘whe e he o al sum
o p obabili ies o selec ing an ac ion is equal o 1, Í𝑘𝜋(𝑎𝑘|𝑠)=1. On
he o he hand, in policy-based me hods he agen lea ns a p obabili y
dis ibu ion o e he ac ions composing a disc e e ac ion space (o a dis-
ibu ion pe ac ion in con inuous ac ion spaces), and hen samples om
ha dis ibu ion o selec an ac ion.
2.1.6.1 Policy G adien me hods
Policy g adien me hods maximize he expec ed o al ewa d by es ima -
ing he g adien , which can be ob ained by di e en ia ing he ollowing
objec i e:
𝐿𝑃𝐺 (𝜃)=b
E𝑡[𝜓𝑡log 𝜋𝜃(𝑎𝑡|𝑠𝑡)] (2.14)
ha esul s in he popula o maliza ion o he g adien as:
b𝑔=b
E𝑡"∞
∑︁
𝑡=0
𝜓𝑡∇𝜃𝑙𝑜𝑔𝜋𝜃(𝑎𝑡|𝑠𝑡)#(2.15)
whe e 𝜓can be es ima ed in a ious ways (Schulman e al., 2015) –see
Table 2.1– simila o he es ima o s p e iously men ioned o alue-based
me hods.
Table 2.1: Di e en 𝜓es ima o s (Schulman e al., 2015) ha can be used o
compu e he g adien in policy g adien me hods as exposed in Equa ion (2.15).
𝜓Desc ip ion
Í𝑇
𝑡=0𝛾𝑡𝑟𝑡+1To al ewa d o he ajec o y om he ini ial s a e (𝑠𝑡|𝑡=0), Equa ion (2.3)
Í𝑇
𝑡=𝑡𝑖𝛾𝑡𝑟𝑡+1The o al ewa d om a ime s ep (𝑡𝑖) onwa d, " ewa d- o-go", Equa ion (2.3)
Í𝑇
𝑡=𝑡𝑖𝛾𝑡𝑟𝑡+1−𝑏(𝑠𝑡𝑖)A baseline (i.e. an a e age e u n o e ajec o ies o a pa allel 𝑉𝜋)
𝑄𝜋(𝑠𝑡, 𝑎𝑡)S a e-ac ion alue unc ion, Equa ion (2.5)
𝐴𝜋(𝑠𝑡, 𝑎𝑡)Ad an age unc ion, Equa ion (2.10)
𝑟𝑡+1+𝑉𝜋(𝑠𝑡+1) − 𝑉𝜋(𝑠𝑡)TD- esidual
A his poin i is impo an o highligh ha b𝑔is calcula ed based on
expe iences belonging o a ajec o y, whose p obabili y depends no only
on he ini ial s a e (𝑠0) and he ansi ion p obabili y unc ion (P), bu
8No e ha by he i ue o gene a ing a dis ibu ion, an agen will sample di e en
alues e en o he same s a e due o he andomness in he sampling dis ibu ion.
None heless, he ou come can be se o be de e minis ic by selec ing he ac ion wi h
he highes selec ion p obabili y (Su on & Ba o, 2018).
2.1. Fundamen als o Rein o cemen Lea ning 21
also on he cu en policy (𝜋𝑡) and he subsequen ac ion p obabili ies:
𝑝(𝜏|𝜋𝑡)=𝑝(𝑠0) · 𝜋𝑡(𝑎0|𝑠0)
· P(𝑠1|𝑠0, 𝑎0) · 𝜋𝑡(𝑎1|𝑠1)
· P(𝑠2|𝑠1, 𝑎1) · 𝜋𝑡(𝑎2|𝑠2)
...
· P(𝑠𝑇|𝑠𝑇−1, 𝑎𝑇−1) · 𝜋𝑡(𝑎𝑇|𝑠𝑇)
(2.16)
The e o e, once he policy is upda ed (𝜋𝑡≠𝜋𝑡+1) he p obabili y o sam-
pling he same 𝜏also changes, which leads o e y di e en expe iences,
and consequen ly, o highly a ian e u ns. In ac , some app oaches
(Espehol e al., 2018; Ho gan e al., 2018; Mnih e al., 2016; S ooke &
Abbeel, 2019) use mul iple pa allel agen s o calcula e expec a ions on
mo e di e se ba ches o expe iences ha end up s abilizing he a iance
o e he g adien upda es:
b𝑔=b
E𝑡"∑︁
𝜏∈ D𝑤
∞
∑︁
𝑡=0
𝜓𝑡∇𝜃𝑙𝑜𝑔𝜋𝜃(𝑎𝑡|𝑠𝑡)#(2.17)
being 𝑤 he numbe o pa allel agen s and D𝑤 he se o all he ajec o ies
collec ed by all hese agen s.
The mos basic app oach is called REINFORCE (Williams, 1992) and
eso s o 𝜓=Í𝑇
𝑡=0𝛾𝑡𝑟𝑡+1 o he policy upda e. Pos e io wo ks, coined
REINFORCE wi h baseline o Vanilla Policy G adien (VPG), in oduced
𝜓=Í𝑇
𝑡=𝑡𝑖𝛾𝑡𝑟𝑡+1−𝑏(𝑠𝑡𝑖), whe e a baseline 𝑏𝑡(𝑠𝑡) ≈ 𝑉𝜋(𝑠𝑡)was used in o -
de o mi iga e high a iance g adien upda es. Ne e heless, he mos
adop ed 𝜓since i s publica ion has been he Gene alized Ad an age Es-
ima ion (GAE) (Schulman e al., 2015), being also he one employed in
his Thesis.
Gene alized Ad an age Es ima ion
Analogously o TD(𝜆), GAE is de ined as an exponen ially-weigh ed es-
ima o o he ad an age unc ion (ins ead o he alue unc ion in
TD(𝜆)). In ha con ex , he TD- esidual o he alue- unc ion is de ined
as 𝛿𝑉
𝑡=𝑟𝑡+1+𝛾𝑉 (𝑠𝑡+1) − 𝑉(𝑠𝑡), which can be conside ed as an es ima e o
he ad an age when execu ing an ac ion 𝑎𝑡 ha p o ides a ewa d 𝑟𝑡and
a new s a e 𝑠𝑡+1.
Simila ly o he n-s ep a ge es ima o , now we can calcula e mul iple
ad an age es ima o s by aking in o accoun k-s eps o he e u ns minus
28 Chap e 2. Backg ound
been eleased such as Sonic (Nichol e al., 2018), MiniG id (Che alie -
Bois e e al., 2018), Obs acle Towe Challenge(Juliani e al., 2019),
Ne Hack (Kü le e al., 2020), P ocgen(Cobbe, Hesse, e al., 2020) and
XLand (Team e al., 2021) among o he s. Besides gene aliza ion, in he
same way as single on benchma ks, each PCG en i onmen poses i s own
pa icula challenges oo, such as spa se ewa ds o analyze he sample-
e iciency men ioned in he p e ious chap e .
Th oughou his Thesis some ha d-explo a ion mazes om MiniG id
(Che alie -Bois e e al., 2018) a e employed, whe e he agen has a pa -
ial egocen ic iew (POMDP) o he en i onmen and i s objec i e is o
each a gi en des ina ion, being each le el’s con igu a ion di e en despi e
he ask is kep ixed. See some examples in Figu e 2.7. The employed
asks in his Thesis a e deemed spa se ewa ds p oblems because he agen
only ge s a non-ze o ewa d when accomplishing he goal, i.e.,
R(𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1)=(1−0.9·𝑡
𝑡𝑚𝑎𝑥 ,i 𝑡 < 𝑡𝑚𝑎𝑥 and 𝑠𝑡+1is e minal
0,o he wise (2.23)
being 𝑡𝑚𝑎𝑥 he maximum numbe o s eps pe episode in each p oblem/ ask.
Rema k ha he p obabili y o achie ing he goal by andomness is oo
small o lea n a alid policy wi h any s a e-o - he-a (SOTA) RL-algo i hm.
Fu he de ails can be ound la e in his manusc ip when hose en i on-
men s a e employed as benchma k.
2.3 Explo a ion S a egies
When should he agen s explo e? I is a ele an ques ion s ill unsol ed and
appa en ly highly p oblem dependan (Pîsla e al., 2022). The explo a ion-
exploi a ion dilemma becomes undamen al in spa se ewa d o mula ions
whe e he p obabili y o ge ing a aluable eedback om he en i onmen
is close o ze o in almos all he cases, 𝑝(𝐺𝑡=𝑟𝑡+1+𝛾𝑟𝑡+2+... ≠0) ≈ 0,
which leads o a huge amoun o unin o ma i e in e ac ions. In his con-
ex , ac ing g eedily – exploi ing he in o ma ion ha he agen al eady
knows – is synonym o ailu e o e y poo pe o mance. Hence, he ex-
plo a ion becomes essen ial. Along he li e a u e wo main explo a ion
s a egies can be lis ed(Th un, 1992): Undi ec ed explo a ion and Di ec ed
Explo a ion.
The undi ec ed explo a ion s a egies ocus on injec ing andomness
in o he ac ion selec ion o p omo e he disco e y o new s a es wi h-
ou aking in o accoun he in o ma ion o he en i onmen . Typically,
hey end o be simple and ha e good esul s in small s a e spaces and
dense ewa d o mula ions, albei s uggle and ine icien in he opposi e
si ua ions. In his ca ego y, algo i hms andom-walks (Ande son, 1986;
Nguyen & Wid ow, 1989), 𝜖-g eedy (Su on, 1995; Wa kins & Dayan,
2.3. Explo a ion S a egies 29
Figu e 2.7: Rende ing o PCG MiniG id’s Mul iRoom-N7-S8 ≡MN7S8 ( op
ow), KeyCo ido -S3-R3 ≡KS3R3 (middle ow) and Obs uc edMaze-2Dl ≡O2Dl
(bo om ow) en i onmen s ac oss h ee di e en le els. Each episode is gene -
a ed wi h a di e en seed so ha he con igu a ion o objec s and he ini ial
spawn posi ion (and o ien a ion) o he agen a e di e en . As a consequence, a
huge numbe o di e se le els o he same asks can be gene a ed.
1992; Whi ehead & Balla d, 1991) and Bol zmann dis ibu ion s a egies
(Cesa-Bianchi e al., 2017; L.-J. Lin, 1992; Su on, 1990) a e included11.
Con a ily, di ec ed explo a ion echniques memo ize explo a ion spe-
ci ic knowledge o guide he u u e agen ’s beha io . The Uppe Con i-
dence Bound (UCB) (Aue e al., 2002) was one o he i s app oaches o
implemen his by es ima ing he expec ed e u n along wi h a measu e o
11These me hods always use some kind o pa ame e – 𝜖(in 𝜖-g eedy) o 𝜏(Bol z-
mann) o de ine he p obabili y/ equency o selec ing he g eedy ac ion o a an-
dom one. Jus o cla i y, he Bol zmann (o Gibs) dis ibu ion can be seen as a
so -max dis ibu ion (Equa ion (2.13)) o e he possible 𝑄(𝑠𝑡,·)- alues/p obabili ies
gi en by 𝜋(·|𝑠𝑡)whe e he dis ibu ion is subjec o a ene gy/ empe a u e ac o ,𝜏:
exp 𝑄(𝑠,𝑎)
𝜏/Í𝑘exp 𝑄(𝑠,𝑘)
𝜏.
30 Chap e 2. Backg ound
he unce ain y o each ac ion:
𝑎𝑡=a g max
𝑎"𝑄(𝑠𝑡, 𝑎𝑡) + 𝑐√︄ln(𝑡)
𝑁𝑡(𝑎)#(2.24)
whe e he i s e m, 𝑄(𝑠𝑡, 𝑎𝑡), s ands o he expec ed e u n, whe eas
he second e m, √︃𝑙𝑛(𝑡)
𝑁𝑡(𝑎), speci ies he unce ain y o selec ing an ac ion
(𝑎) conside ing he numbe o imes (𝑁𝑡) ha ac ion was aken un il ha
ime s ep (𝑡). Tha is, he i s componen aims o selec he ac ion ha
leads o he highes e u n (exploi a ion), whe eas he second p omo es he
selec ion o ac ions in e sely p opo ional o he numbe o imes ha hey
ha e been selec ed (explo a ion). Such explo a ion-exploi a ion ade-o
is ul ima ely con olled by he hype pa ame e 𝑐≥0. This idea os e ed
he p oposal o In insic Mo i a ion (IM) me hods, ecen ly cen e ed on
gene a ing in insic ewa ds o explo e and disco e new beha io s mo e
e icien ly, which is o u mos impo ance in spa se ewa ds se ings o
lea n he op imal policy wi h he minimum amoun o agen -en i onmen
in e ac ions.
Below some o he mos popula IM app oaches ha a e going o be
discussed in he ollowing Chap e s a e de ailed. The ea e , Imi a ion
Lea ning (IL) is also explained, and u he discussed in Chap e 5, as an
al e na i e app oach when coun ing on expe demons a ions.
2.3.1 In insic Mo i a ion
By le ing he agen explo e he en i onmen o i s inhe en sa is ac ion
a he han o o he exogenous s imuli, new beha io s eme ge. In ac ,
his is ela ed o psychology and how he babies can lea n di e en skills
in he ea ly s ages o hei human li e wi hou addi ional eedback om
he wo ld (G igo escu, 2020; Oudeye e al., 2016; Ryan & Deci, 2000).
IM me hods, also e e ed o as cu iosi y o no el y, endow he agen wi h
he abili y o lea ning beha io s ha a e sepa a e om hei main ask
(Aub e e al., 2019) ( ask-agnos ic explo a ion/beha io ). This p ope y
becomes pa icula ly in e es ing in he absence o explici eedback om
he p ima y ask, as he agen is encou aged o lea n a seconda y goal
(in insic-goal) ha will e en ually d i e i o achie e he main objec i e
(ex insic-goal). This idea is o malized in an in insic ewa d (𝑟𝑖
𝑡) ha is
combined wi h he ex insic ewa d p o ided by he en i onmen (𝑟𝑒
𝑡) a
each ime s ep 𝑡 h ough a weigh ing ac o 𝛽:
𝑟𝑡=𝑟𝑒
𝑡+𝛽𝑟𝑖
𝑡.(2.25)
In his con ex , se e al app oaches can be ound in he li e a u e o gen-
e a e he explo a ion bonuses.
2.3. Explo a ion S a egies 31
Coun -based me hods
One mechanism o gene a e he a o emen ioned in insic ewa ds is by
adop ing a isi a ion coun s a egy, also known as coun -based me hods.
Simila o UCB’s explo a ion componen (Equa ion (2.24)), he a ionale
is ha he agen should be less cu ious in hose s a es wi h less no el y.
Tha is, he explo a ion bonus is in e sely p opo ional o he numbe
o imes (𝑁(𝑠𝑡)) a gi en s a e (𝑠) has been isi ed. The mos common
app oach is o de ine 𝑟𝑡
𝑐𝑜𝑢𝑛𝑡𝑠 =1/𝑁(𝑠𝑡)1/2=1/√︁𝑁(𝑠𝑡)(S ehl & Li man,
2008), al hough o he al e na i es wi hou he squa e oo (Kol e & Ng,
2009) o o he exponen ial magni udes o ge he desi ed bonus decay (i.e.,
how smoo hly he magni ude dec eases, see Figu e 2.8) can also be u ilized.
Figu e 2.8: Visi a ion coun bonus decay o di e en squa e alues
𝛽
𝑁(𝑠𝑡)𝑒𝑥 𝑝_𝑣𝑎𝑙𝑢𝑒 o 1000 consecu i e isi s. The magni ude pa ame e is p o-
po ional o he selec ed nume a o alue, usually weigh ed wi h a pa ame e 𝛽.
The pa icula case o 𝛽=1is illus a ed.
This is a simple, ye e ec i e, solu ion o quan i y he deg ee o which
a s a e is unknown o he agen . Howe e , his is only possible when
dealing wi h disc e e s a e spaces. Con a ily, when ha ing mo e complex
domains wi h con inuous s a e spaces o he solu ions a e needed. One
op ion is o disc e ize i by c ea ing iles/bins o embed mul iple alues a
once. O he al e na i es ha e been ui ully: densi y models o measu e
he unce ain y and hence o h compu e he bonus (Bellema e e al., 2016;
Os o ski e al., 2017), hashes o encode he s a es in a disc e e manne
32 Chap e 2. Backg ound
(Tang e al., 2017) o successo ep esen a ions o le e age simila i ies o
he explo a ion bonus gene a ion (Machado e al., 2019).
P edic ion-e o me hods
On he o he hand, he in insic ewa d can be compu ed as he p edic ion-
e o when p edic ing he consequence o an agen ’s ac ion in he en i on-
men ; ha is, measu ing he p edic abili y o he changes in he en i on-
men . The in ui ion in hese me hods is clea : he be e he p edic ion,
he mo e o en migh ha si ua ion has been encoun e ed and he lowe
he no el y bonus should be.
In insic Cu iosi y Module (ICM) (Pa hak e al., 2017) was a game
change and dis inc i sel om o he p e ious p edic ion app oaches
(Hou hoo e al., 2017; S adie e al., 2015) because i ocuses on a smalle
ea u e space o compu e he expec ed changes ha a ec he p edic ion.
Such a ea u e space is buil o model he ansi ions be ween consecu i e
s eps ha we e con olled by he agen o ha di ec ly a ec i ; while
igno ing he es . This was accomplished by using an in e se dynamics
model in a sel -supe ised manne o p edic he agen ’s ac ion (b𝑎𝑡) gi en
he cu en (𝜙(𝑠𝑡)) and nex s a e (𝜙(𝑠𝑡+1)) embeddings, so ha only hings
a ec ing o he agen we e modeled o ob ain he desi ed ea u e space.
A he same ime, ha embedding space (𝜙(𝑠𝑡)) al oge he wi h he ac-
ual ac ion (𝑎𝑡) is used o ain a o wa d dynamic model (S adie e al.,
2015) ha p edic s he ea u e ep esen a ion in he nex s a e (b
𝜙(𝑠𝑡+1)),
which in las ins ance is compa ed agains he la en ep esen a ion o he
nex s a e in he p e iously modeled ea u e space (𝜙(𝑠𝑡+1)) o compu e
he in insic ewa d (𝑟𝑖
𝑡), see Figu e 2.9.
ϕ(s )
ϕ(s +1)
b
ϕ(s +1)
s
s +1
ba
a
i
ICM
Fea u es
Fea u es
Fo wa d
model
In e se
model
−
Figu e 2.9: In insic Cu iosi y Module (ICM)(Pa hak e al., 2017), whe e he
gene a ion o he in insic ewa d 𝑟𝑖
𝑡is illus a ed. The in insic ewa d is com-
pu ed as he p edic ion e o in he ea u e space o he nex s a e, ha is, he
di e ence be ween b
𝜙(𝑠𝑡+1)and 𝜙(𝑠𝑡+1)gi en 𝑠𝑡, 𝑠𝑡+1and 𝑎𝑡.
2.3. Explo a ion S a egies 33
La e on, (Bu da, Edwa ds, Pa hak, e al., 2018) conduc ed a la ge-
scale s udy based on hese p edic ion e o s o e 54 en i onmen s wi h-
ou any ex insic ewa d –pu ely guided by in insic beha io s– in which
hey analyzed he e icacy o using a ious ea u e lea ning me hods. In
o he wo ds, hey in es iga ed he e ec o using di e en ea u e spaces –
𝜙(·)– such as elying on pixels, andom ea u es, a ia ional audoencode s
(Kingma & Welling, 2014) and he p e iously in oduced in e se model
(Pa hak e al., 2017). One impo an ema k is ha hey b ough up
he noisy-TV p oblem on his kind o algo i hms: he agen s end o be
a ac ed by s ochas ic dynamics o he en i onmen which was clea ly
exempli ied by in oducing a TV in o he en i onmen ha changed he
channels andomly independen ly o he agen ’s ac ions. In o de o sol e
his issue, (Pa hak e al., 2019) p oposed he use o an ensemble o o wa d
dynamics models so ha he ewa d was compu ed aking in o accoun he
a iance wi h espec o hei nex s a e p edic ion; hence, hey a e no
sensi i e o agen ’s impac on he en i onmen changes bu o he pa s
o he en i onmen ha ha e been la gely/sho ly explo ed ( he mo e a
s a e has been isi ed, he less he disag eemen be ween he ou come o
all he o wa d models and he less a iance e en in a s ochas ic si ua ion).
Ano he idea is o use an episodic memo y so ha he dis ance/p oximi y
– e e ed o as eachabili y in he pape – o pas ins ances in e e ence o
he cu en s a e can be measu ed (Sa ino e al., 2019); in o he wo ds,
how many s eps away is he agen om expe iencing hose si ua ions again.
The episodic no el y module idea was ex ended and combined wi h a li e-
long no el y module so ha cu iosi y ac oss he episode and he whole
aining was modula ed yielding new SOTA esul s in some benchma ks
(Badia, Sp echmann, e al., 2020).
Special men ion dese es Random Ne wo k Dis illa ion (RND) (Bu da,
Edwa ds, S o key, e al., 2018), which became popula due o i s simplic-
i y and good pe o mance. Thi is he eason why i was picked o e o he
p edic ion-e o me hods o his Thesis. In his s a egy, wo neu al ne -
wo ks a e equi ed: a a ge 𝜙(·), and a p edic o b
𝜙(·). Bo h o hem
a e ini ialized andomly and he a ge ’s pa ame e s a e ozen he ea e .
The p edic o ’s goal is o mimic he a ge ne wo k’s ou pu , so ha he
ou comes a e as close as possible. The e o e, he in insic ewa d measu es
he closeness h ough: || b
𝜙(𝑠𝑡+1) − 𝜙(𝑠𝑡+1)||. As he p edic o keeps lea ning
o imi a e he a ge , he in insic ewa d is supposed o be smalle and
smalle as a e lec ion o he numbe o cumula i e s a e isi s, so ha he
cu iosi y concep abou explo ing no el s a es is sa is ied. The au ho s
iden i y h ee main ac o s o be ele an sou ce o p edic ion e o s:
•Fac o 1. P edic ion e o is high when he p edic o ails o gene -
alize om p e iously seen da a.
•Fac o 2. P edic ion e o is high because he a ge is s ochas ic.
•Fac o 3. P edic ion e o is high because necessa y in o ma ion o
he p edic o is no gi en (o he model capaci y is oo limi ed o
accu a ely p edic he a ge ).
34 Chap e 2. Backg ound
The las 2 ac o s can induce he a o emen ioned noisy-TV p oblem. Hence,
RND was designed o o e come hose undesi ed p ope ies by ixing he
p edic ion p oblem wi h a de e minis ic a ge and ha ing wo eplica es
o he same ANN a chi ec u e, so ha he p edic ion e o is no limi ed
by he model capaci y o a chi ec u e.
Las bu no leas , i is impo an o emphasize ha when using in in-
sic ewa ds he p oblem becomes bi-objec i e and he agen is acco dingly
going o op imize bo h goals12. Ne e heless, unexpec ed beha io s can
a ise in hese se ings due o an excessi e explo a ion ha hinde s he
exploi a ion o he main ask (Badia, Sp echmann, e al., 2020; Rosse &
Abed, 2021; Taïga e al., 2020). Mos o he app oaches nei he con ol no
balance he impo ance o he ex insic and in insic componen s du ing
aining. This is based on he ollowing assump ions:
•The scale o bo h ewa ds is e y di e en : e y low in insic alues in
compa ison o he ex insic ones. As a esul , possible goal-de ia ion
occu s mainly in he absence o ex insic ewa ds.
•In insic ewa ds a e non-s a iona y in na u e. Thei magni ude,
ega dless o 𝛽, dec eases on a e age h oughou he aining as he
s a e space is explo ed, esul ing in an e en la ge di e ence be ween
he wo ypes o ewa ds/goals.
Howe e , hese assump ions some imes a e no enough and o he solu ions
a e equi ed. Among hose examples, he e a e me a-lea ning app oaches
whe e he unc ions ha pa ame e ize he in insic ewa ds a e in luenced
by he di ec ion o he ex insic g adien (Dai e al., 2022; Du e al., 2019;
Z. Zheng e al., 2018) (ensu ing ha he main ex insic objec i e is aligned
wi h he explo a ion componen oo), while o he amewo ks p opose o
di ec ly decouple he wo goals in o di e en agen s (E. Z. Liu e al., 2021;
Schä e e al., 2022).
2.3.2 Imi a ion Lea ning
Ano he solu ion o o e come explo a ion p oblems is he use o expe
demons a ions, which is also known as Imi a ion Lea ning (IL) and/o
Lea ning om Demons a ions (L D)(Hes e e al., 2017; Vece ik e al.,
2018). Wi hin his amewo k, good (op imal o subop imal) ajec o ies
a e assumed o be p o ided, 𝜏∗={(𝑠0, 𝑎0, 𝑟0, 𝑠1),(𝑠1, 𝑎1, 𝑟1, 𝑠2), ...}, so ha
he agen can use hose uples o p e- ain o e en mas e a policy in
an online ashion ha p e en s he agen om ge ing s uck in he ea ly
phases o he aining (whe e no expe ise is s ill de eloped). Ne e heless,
key aspec s such as di e en embodimen and obse abili y be ween he
expe and he lea ne make challenging i s success applica ion (Osa e
al., 2018). Depending on how he demons a ions a e used o dis ill he
knowledge, wo ways o lea ning can be ound: Beha iou Cloning (BC)
and In e se Rein o cemen Lea ning (IRL).
12Recall ha he agen maximizes he e u n (Equa ion (2.3)) in which he conside ed
ewa d has now a new explo a i e componen (Equa ion (2.25)).
2.3. Explo a ion S a egies 35
On he one hand BC (Bain & Sammu , 2001; Pome leau, 1988; To abi
e al., 2018) seeks o lea n a policy h ough a mapping s a egy whe e
a gi en inpu is associa ed o an ac ion; his is, i jus equi es s a e-
ac ion uples, 𝜏∗={(𝑠0, 𝑎0),(𝑠1, 𝑎1,), ...}. S anda d supe ised lea ning
me hods such as he log loss unc ion (which can be embedded wi hin
a C oss En opy loss (Gnei ing & Ra e y, 2007)) a e used o map he
p obabili y o selec ing an ac ion o he speci ied inpu , which augmen s
i s u u e p obabili y p e e ence:
𝐿𝐵𝐶 =−1
|𝐷|∑︁
(𝑠,𝑎) ∈𝐷
ln(𝜋(𝑎|𝑠)) (2.26)
whe e 𝐷 e e s o a pool o da a whe e he demons a ions a e con ained
and om he uples a e sampled. Ne e heless, hese app oaches su e
om compounding e o s (Ross & Bagnell, 2010) de i ed om he ac
ha he policy o be upda ed exhibi s di e en p obabili ies o collec ing
expe iences wi h he assumed expe policy ha p o ides samples. This
is, a dis ibu ion shi exis s in he sampling p obabili y o ajec o ies ( e-
call Equa ion (2.16)) be ween he policy ha ga he ed he demons a ions
and he policy ha is being lea ned. Consequen ly, he u u e es da a
a e in luenced by he policy ha is being lea ned, b eaking he main as-
sump ion o mos SL me hods ha assume he da a o be independen and
iden ically dis ibu ed ( ecall Chap e 1when we explained he di e ences
be ween RL and SL). The e o e, one o he mos popula BC algo i hms up
o da e – Da ase Agg ega ion (DAGGER) (Ross e al., 2011)– p oposed
o agg ega e addi ional online da a o he da ase used o aining (D),
wi h he pa icula i y ha he isi ed s a es a e subjec o he lea ned
policy dis ibu ions (𝜋(𝑎𝑡|𝑠𝑡) −→ 𝑠𝑡+1) bu he s o ed ac ion in each s a e is
he expe ’s (𝑎∗
𝑡+1∼𝜋∗(𝑠𝑡+1)), so ha 𝐷∪ {𝑠𝑡+1, 𝑎∗
𝑡+1}.
Al e na i ely, IRL(Finn e al., 2016) aims o lea n he hidden ewa d
unc ion om he p o ided expe iences unde he assump ion o being
op imal (o e y close o op imal) demons a ions. To do so, i uses
ha unc ion o ob ain ewa ds om which he agen ’s policy is lea ned,
𝜏0, 𝜏1, ... −→ Rℎ≈ R;b𝑟𝑡∼ Rℎ(𝑠, 𝑎)13. These me hods a e highly sensi i e o
how good he ewa d unc ion ep esen s he desi ed (op imal) beha io .
Wi hin his axonomy, ad e sa ial IL me hods can be aken in o accoun
oo (Ho & E mon, 2016; Ho e al., 2016), whe e he policy pa ame e izes
a gene a i e model ha "c ea es" new expe iences and he cos unc ion
(i.e., ewa d unc ion) se es as an ad e sa y.
In summa y, he selec ion o one o ano he app oach will depend on
whe he he BC’s lea ned policy ep esen s a alid mapping om s a es
o ac ions o i IRL’s dis illed ewa d unc ion is alid o lea n a sui able
policy o he desi ed beha io . Fu he mo e, he c i e ia is also subjec
o he a ailabili y o a model ha makes possible he use o dynamics
in o ma ion o he en i onmen (Osa e al., 2018).
13Fo simplici y, he calcula ed ewa d unc ion is shown o be dependan on he s a e
and ac ion, al hough i can also be subjec o he nex s a e.
37
Chap e 3
Collabo a i e aining
be ween He e ogeneously
skilled Agen s in
En i onmen s wi h Spa se
Rewa ds
Designing a ewa d unc ion is one o he mos challenging s eps when o -
mula ing a p oblem ha is mean o be sol ed wi h RL. As we ha e p e-
iously highligh ed in Sec ion 1.1, one way o o e come his cumbe some
design is by using a single (spa se) ewa d signal ha de e mines whe he
a RL ask has been sol ed. In his con ex , he p oblem becomes mo e
complex due o he lack o dense eedback signals ha guide he lea ning
p ocess, ul ima ely hinde ing he co ela ion be ween success ully sol ing
he ask and he successi e ac ions ha lead o ha ou come. To add ess
his issue, a solu ion is o gene a e an explo a ion bonus (in insic ewa d)
ha p omo es he no el y (mo i a e he agen ) wi hin he en i onmen .
This app oach encou ages di e se beha io s and enables he disco e y o
alid solu ions h ough explo a ion, he eby os e ing goal achie emen .
The amily o algo i hms ha can gene a e hese bonuses a e known as
In insic Mo i a ion (IM) echniques, which ha e been in oduced p e i-
ously in Sec ion 2.3.1 o Chap e 2. Thei u ili y can be be e unde s ood
om he in ui ion gained om he ollowing eal-wo ld example:
A bike ide wan s o descend a gi en moun ain ac oss he sho es pa h
and as as as possible. Howe e , he ide does no know he moun ain,
and he unique eedback signal will be ecei ed a he end o he ou e. Thus,
he ide does no know whe he he decision in a bi u ca ion is igh , i
hey ge s acked close o he inal line, o e en i hey spend oo much
ime when compa ed o o he bike s. Due o so much unce ain y wi hou
eedback signals, he agen (bike ide ) should d i e hei decisions based
on hei own mo i a ion and cu iosi y.
A i s ques ion a ises when examining his eal-wo ld example: wha
44 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
sea ch space. This implies a wo-sided compe i ion whe e he non-skilled
agen d i es he skilled one in o a longe pa h solu ion, whe eas he skilled-
agen pushes he o me o ake he sho cu ha is no ep oducible o
i . Consequen ly, nega i e ans e p oblems may well a ise. This si ua-
ion can be obse ed om hei alue es ima e di e ence which, as shown
in Figu e 3.1.b, di e ema kably om each o he a c i ical poin s (nea
he co ido ). These issues can be unde s ood e en clea e i he p oblem
is ep esen ed as a MDP ee (Figu e 3.2), in which he agen s will ha e
a sha e- iew o he en i onmen as long as hey can ep oduce he same
ajec o ies. None heless, some s a es will only be isi ed by one agen due
o special capaci ies o i s ac ion space, gene a ing an independen iew o
he p oblem o ha pa icula agen .
S0
S1
a0
a1
a2
a0
a1
a2
a0
a1
a1
S3
S4≡S7( e minal s a e)
: Independen iew (skilled agen )
: Sha ed iew (bo h agen s)
S4
S5
S2
a0
S6
S7
a1
{a0, a1, a2}∈Askilled
{a0, a1}∈Anon−skilled
Figu e 3.2: Example o a MDP as a ee whe e s a es a e ep esen ed wi h
nodes and he edges deno e ac ions. Some s a es (e.g., 𝑆3) can be eached by
being in a speci ic s a e and execu ing a ce ain ac ion (e.g., 𝑆1
𝑎2
−−→ 𝑆3). This
esul s in pa s accessible and sha ed be ween agen s (sha ed- iew) and o he s
ha a e es ic ed o he capaci ies o he agen s (independen - iew).
These p oblems a e no limi ed o he example shown in he abo e plo ,
bu also o any scena io wi h he e ogeneous agen s. The con ibu ion o
his chap e is o expose his p oblem, and o ske ch e ec i e collabo a i e
lea ning s a egies unde such ci cums ances.
3.3 P oposed Collabo a i e F amewo k
The design o he amewo k p oposed in his chap e oo s in he ac
ha he e can be obse a ions whe e he policy dis ibu ions o he e oge-
neous agen s can be e y simila o each o he . In some cases, bo h agen s
can push each o he owa ds he same di ec ion, i.e., 𝜋𝑠𝑘𝑖𝑙𝑙𝑒𝑑 ≡𝜋𝑛𝑜𝑛𝑠𝑘𝑖𝑙𝑙𝑒𝑑.
Howe e , in o he cases hose dis ibu ions can di e om each o he be-
cause each agen pushes in a di e en di ec ion based on hei op imal
solu ion lea ned a ha ime. In his si ua ion, we aim o s eng hen he
sha ed knowledge be ween bo h o hem, ye a he same ime, o a oid
nega i e ans e in places whe e he op imal solu ions o each agen s a e
in con lic . Consequen ly, he goal o he amewo k is o lea n a sha ed-
knowledge iew while espec ing hose subspaces in he en i onmen whe e
he in e es o he agen s a e no he same.
3.3. P oposed Collabo a i e F amewo k 45
As al eady explained in p e ious sec ions, in p oblems cha ac e ized
by spa se ewa ds he main issue o deal wi h is an e icien explo a ion o
he en i onmen . The applica ion o IM and on-policy echniques does no
pe mi o in e e e in he ac ion-sampling p ocess di ec ly, as he aining
expe iences ha e o be ep esen a i e o he cu en policy, i.e., 𝑎∼𝜋(𝑠).
Hence, he use o pas expe iences o e en samples collec ed by o he
policies is no ac able3. In his case, he policy is op imized as pe
Exp ession 2.14 whe e, aside om he inhe en mechanism o he algo i hm
i sel , he ad an age es ima o b
𝐴𝑡is he main ac o ha eases and pushes
he lea ning p ocess4. The la e ad an age es ima o can be es ima ed
in di e en ways, bu almos all o hem a e co ela ed o he ewa d 𝑟𝑡+1
and he alue unc ion 𝑉(𝑠𝑡) h ough he TD-e o :
𝛿=𝑟𝑡+1+𝛾𝑉 (𝑠𝑡+1) − 𝑉(𝑠𝑡)(3.1)
whose alue changes i e a i ely as soon as 𝑉(𝑠𝑡)ge s upda ed. This p ocess
can be said o con e ge when 𝑉(𝑠𝑡)=𝑉∗(𝑠𝑡).
The amewo k desc ibed in wha ollows aims a accele a ing he lea n-
ing p ocess ocusing on he explo a ion pa , mo e conc e ely in how o
gene a e be e ad an ages. Fo ha pu pose, we p opose a amewo k
d i en by wo di e en design objec i es (DO):
•DO1: How o gene a e mo e accu a e and as e s a e alue es ima es
𝑉(𝑠).
•DO2: How o modi y he in insic ewa d gene a ion p ocess o
be ackled mo e e icien ly when dealing wi h he e ogeneous ac ion
spaces.
Nex , mul iple me hods a e p oposed o add ess hese objec i es wi hin
a collabo a i e amewo k (see Figu e 3.3), so ha he ongoing abla ion
s udies in Sec ion 3.5 can in o m abou he bes op ions among he pos u-
la ed me hods. Fo simplici y, he eina e we conside only 2 he e ogeneous
agen s, skilled and non-skilled, al hough he app oaches could be ex ended
o wo k wi h mo e agen s.
3.3.1 Cen alized Lea ning wi h Decen alized Execu-
ion
Ou amewo k adop s an ac o -c i ic policy g adien a chi ec u e wi h wo
sepa a ed ne wo ks:
•An ac o whose policy (one o each agen ) is ed jus wi h i s local
obse a ions.
•A c i ic wi h wo ou pu heads ela ed o he ex insic (𝑉𝑒) and
in insic (𝑉𝑖) signals ha is ained wi h he obse a ions ga he ed
by all he agen s.
3No ac able a leas heo e ically wi hou any ype o co ec ion, such as impo -
ance sampling (Ch is ianos e al., 2020; Schä e e al., 2022).
4We assume 𝜓=𝐴𝑡.
46 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
a
ACT ORID
Ve
Vi
b
Ae(s , a )
b
Ai(s , a )
b
A o al =
b
Ae(s , a ) + β
b
Ai(s , a )
Ge
Gi
O(s , a ) = o +1
i
INTRINSIC
MOTIVATION
MODULE
ENVIRONMENT
e
PPO loss
GAE
Obse a ion
MSE loss
CRITIC
ACTOR
Figu e 3.3: Flowcha o he collabo a i e amewo k, whe e we highligh in
blue hose modules ha a e usually pe o med independen ly o each agen , and
ha can be sha ed in ou amewo k.
The co e idea is o ha e a unique and cen alized c i ic, so ha i s
capabili ies can be augmen ed wi h addi ional in o ma ion co esponding
o he di e en agen s solely du ing he aining phase. This s a egy is
also known in he li e a u e as he cen alized lea ning wi h decen alized
execu ion (CLDE) pa adigm (Foe s e e al., 2017; Lowe e al., 2017).
Wi h his design, we aim o expedi e he c i ic’s lea ning p ocess so as o
gene a e mo e accu a e and as e alue es ima es, con ibu ing o DO1.
Mo eo e , i gi es ise o a scalable a chi ec u e which can easily ake in o
accoun mo e agen s wi h li le addi ional complexi y.
3.3.1.1 Decen alized Ac o s
In spi e o using cen alized lea ning s a egy, he beha io o each agen
can be e y simila ye no equal. As a consequence, each agen is pa am-
e e ized by an independen ac o 5.
As abo e explained, he bene i o CLDE elies on lea ning as e and
mo e accu a e 𝑉(𝑠), which subsequen ly has a posi i e e ec on 𝐴(𝑠, 𝑎),
ul ima ely leading o an imp o ed o e all lea ning. Howe e , he speed a
which his is achie ed depends on mul iple ac o s. All his coupled wi h
he ac o ansien in insic ewa ds (𝑟𝑖
𝑡) and spa se ex insic eedback
(𝑟𝑒
𝑡), inc eased he impo ance o in oducing Mon e Ca lo upda es o la ch
on o hese signals apidly (Bellema e e al., 2016; Os o ski e al., 2017).
In ou amewo k, his is ins ead ci cum en ed by using GAE (Schulman
e al., 2015) and calcula ing wo independen ad an ages o he ex insic
and in insic s eams, 𝐴𝑒(𝑠, 𝑎)and 𝐴𝑖(𝑠, 𝑎), which a e hen blended as
5Fo p ac ical pu poses, hei lea ning wo ks in he same way as when being done
independen ly. Tha is, he ac o is ained only wi h da a cap u ed by i sel as i would
do in a single agen scheme.
3.3. P oposed Collabo a i e F amewo k 47
ollows:
𝐴(𝑠, 𝑎)=𝐴𝑒(𝑠, 𝑎) + 𝛽𝐴𝑖(𝑠, 𝑎)(3.2)
This implies ha ing ex insic (𝑉𝑒) and in insic (𝑉𝑖) s eams wi h hei
espec i e independen e u ns, which allows o a highe lexibili y o com-
bine episodic and non-episodic e u ns. I also enables he use o di e en
discoun ac o s (i.e., 𝛾𝑒and 𝛾𝑖). Mo eo e , i is in ui i ely mo e sui -
able o sepa a e bo h s eams ha a e indeed s a iona y (𝑉𝑒) and non-
s a iona y (𝑉𝑖) in na u e. The ex insic ewa d in a single on en i on-
men has an associa ed 𝑉𝑒∗because he ex insic ewa d unc ion does no
change h oughou he lea ning p ocess6. On he con a y, 𝑉𝑖∗will a y
as he aining e ol es because he gene a ed in insic ewa ds depend on
a no el y measu e ha changes igh a e e e y in e ac ion. No e ha
combining in his way he ex insic and in insic s eams is jus ano he
s a egy (Bu da, Edwa ds, S o key, e al., 2018) ha subs i u es he nai e
idea o mixing bo h objec i es in a weigh ed ewa d as in Equa ion (2.25).
3.3.1.2 Cen alized C i ic Module
When concei ed wi hin collabo a i e lea ning, a p oblem ha equi es
a en ion is ha he alue unc ion es ima es, 𝑉(𝑠), can be di e en among
agen s o he same s a e, al hough i migh be equal o e y simila a
many o he s a es o he same scena io ( ecall Figu e 3.1). Based on his
in ui ion, he alue o a s a e should depend no only on he s a e i sel , bu
also on he possible ac ions o he agen s. Hence o h, we p opose o use
a cen alized ac ion- alue unc ion, 𝑄(𝑠, 𝑎)which, as shown in Figu e 3.4,
is ed wi h he obse a ions o all agen s, p oducing he alue es ima e o
selec ing an ac ion 𝑎𝑡when being a s a e 𝑠𝑡. This is, ins ead o p oducing
an es ima ion o he s a e alue 𝑉(𝑠), he cen alized module elici s all
𝑄(𝑠, 𝑎)possible alues o 𝑎∈ A𝑠𝑘𝑖𝑙𝑙𝑒𝑑 ∪ A𝑛𝑜𝑛−𝑠𝑘𝑖𝑙𝑙𝑒𝑑, ega dless o he
agen collec ing he obse a ion.
Cen alized C i ic Module
oskilled
onon−skilled
. . .
aN
a2
a1
.
.
.
. . .
{a0, a1, . . . , aN}=Askilled ∪ Anon−skilled
πnon−skilled
Q(o , an)
En i onmen
Anon−skilled
πskilled
V =PA
aπ(a|o )·Q(o , a)
Askilled
being
π
and
A
o
Figu e 3.4: Cen alized c i ic module based on 𝑄(𝑠, 𝑎)(ins ead o 𝑉(𝑠)) o 2
agen s wi h di e en ac ion spaces (A𝑠𝑘𝑖𝑙𝑙𝑒𝑑,A𝑛𝑜𝑛−𝑠𝑘𝑖𝑙𝑙𝑒𝑑). In he image, how
𝑉𝑡(𝑠)is calcula ed o each case is shown.
This a chi ec u al change o he c i ic module implies se e al consid-
e a ions. To begin wi h, 𝐴(𝑠, 𝑎), which is one o he key componen s o
6We a e no conside ing en i onmen wi h s ochas ic ansi ions.
48 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
he calcula ion o he he ac o ’s loss, commonly equi es a alue es ima e
–𝑉(𝑠)(no 𝑄(𝑠, 𝑎))– o educe i s a iance (Schulman e al., 2015). The e-
o e, we calcula e di e en s a e alues 𝑉𝑥(𝑠) o each agen by aking in o
accoun hei ac ion spaces, as ollows:
𝑉𝑥(𝑠)=∑︁
𝑎∈A𝑥
𝜋𝑥(𝑎|𝑠) · 𝑄(𝑠, 𝑎)(3.3)
whe e 𝑥∈ {𝑠𝑘𝑖𝑙𝑙𝑒𝑑, 𝑛𝑜𝑛 −𝑠𝑘𝑖𝑙𝑙𝑒𝑑}and 𝜋𝑥(𝑎|𝑠)deno es he p obabili y o
each agen 𝑥pe o ming ac ion 𝑎∈ A𝑥in s a e 𝑠. Thus, an agen no
capable o execu ing a gi en ac ion will ha e a ze o p obabili y o ha
gi en op ion. This can be also ega ded as a way o masking possible
ou comes.
Addi ionally, he c i ic loss is sligh ly modi ied o accommoda e he
mul iple ac ion-wise ou pu s as opposed o he unique ou pu neu on usu-
ally se when c i ic es ima es di ec ly he alue o he s a e i sel . Namely:
L𝑐𝑟𝑖𝑡𝑖𝑐 =1
𝑇
𝑇
∑︁
𝑡=0𝑄(𝑠𝑡, 𝑎𝑡) − b
𝑄𝑡2
,(3.4)
whe e 𝑎𝑡is he ac ion aken by he agen a ime s ep 𝑡, and b
𝑄𝑡is a dis-
coun ed e u n es ima e o he 𝑇-leng h ollou o e which he op imiza ion
s ep is pe o med.
Las bu no leas , he c i ic is upda ed wi h he uples ga he ed by
each agen indi idually, and execu es an op imiza ion s ep pe collec ed
ba ch o expe iences:
B𝑠𝑘𝑖𝑙𝑙𝑒𝑑 ={(𝑠𝑡, 𝑎𝑡, 𝑟𝑡),(𝑠𝑡+1, 𝑎𝑡+1, 𝑟𝑡+1). . . , (𝑠𝑇−1, 𝑎𝑇−1, 𝑟𝑇−1)} ∼ 𝜋𝑠𝑘𝑖𝑙𝑙𝑒𝑑
B𝑛𝑜𝑛−𝑠𝑘𝑖𝑙𝑙𝑒𝑑 ={(𝑠𝑡, 𝑎𝑡, 𝑟𝑡),(𝑠𝑡+1, 𝑎𝑡+1, 𝑟𝑡+1). . . , (𝑠𝑇-1, 𝑎𝑇-1, 𝑟𝑇-1)} ∼ 𝜋𝑛𝑜𝑛-𝑠𝑘𝑖𝑙𝑙𝑒𝑑
As a consequence, he c i ic will ake as many op imiza ion s eps in e e y
aining s ep as he numbe o agen s a hand (in he conside ed case, 2
upda es wi h B𝑠𝑘𝑖𝑙𝑙𝑒𝑑 and B𝑛𝑜𝑛−𝑠𝑘𝑖𝑙𝑙𝑒𝑑).
Uni e sal Value Func ion App oxima o
An al e na i e o he p e ious p oposed cen alized c i ic is o adop a
so-called Uni e sal Value Func ion App oxima o (UVFA) design (Schaul
e al., 2015), whe e he ANN will be condi ioned o addi ional pa ame e s
(i.e., o a de e mined goal 𝑉(𝑠, 𝑔)). Ac ually, in he p oposed amewo k
he alue es ima ion is subjec o he agen ’s capabili ies:
𝑉(𝑠) −→ 𝑉(𝑠, 𝑎𝑐𝑡𝑜𝑟𝑖𝑑)(3.5)
Indeed, wi h he p e iously men ioned ac ion- alue a chi ec u e modi i-
ca ion, i will be 𝑄(𝑠, 𝑎, 𝑎𝑐𝑡𝑜𝑟𝑖𝑑 )as shown in Figu e 3.5. Analogously o
he p ocedu e ollowed o he o he c i ic a chi ec u e, ad an ages will be
3.3. P oposed Collabo a i e F amewo k 49
calcula ed wi h alue es ima es ha will be ob ained as in Exp ession 3.3.
{a ,ac o id}
h
h −1
o
CNN
FLATTEN
Qi(o , a )
Qe(o , a )
FC
Recu ence
FC
module
FC
FC
Figu e 3.5: UVFA based cen alized c i ic, whe e he con olu ional (and he
ollowing FC) laye s ex ac common ea u es o bo h ype o agen s. The es
o he ne wo k is pa ame e ized subjec o he skills o each agen .
The design is inspi ed by he idea ha he ea u e ex ac ion o an
obse a ion can be linked o an agen bu no o he addi ional in o ma-
ion ha can be in e ed om a sequence. In his la e case, i could
be inconsis en due o he agen ’s di e en capabili ies o gene a e hei
own di e gen ajec o ies ha migh well no be ep oducible by o he
agen s. In o de o add ess his inconsis ency du ing he aining s age,
and o aid he ne wo k in gaining insigh s abou wha knowledge mus be
sha ed and wha mus be p ese ed o indi idual use, in o ma ion abou
he skills is p o ided o he ne wo k as an inpu (𝑎𝑐𝑡𝑜𝑟𝑖𝑑)7. In addi ion,
he ac ion in e e y ime s ep 𝑎𝑡is also ed as an inpu , which can be
use ul o lea n be e empo al ep esen a ions wi hin he ecu en mod-
ule. O he pa ame e s such as he ade-o be ween in insic-ex insic
s eams (i.e., 𝛽coe icien ) o he collec ed ewa ds (i.e., 𝑟𝑒
𝑡and 𝑟𝑖
𝑡) could
also be ad an ageous (Badia, Sp echmann, e al., 2020). Ne e heless,
he s udy is limi ed o he a o emen ioned pa ame e s in o de o a oid
o e -pa ame e ized c i ic a chi ec u es.
O e all, wi h he design o a cen alized c i ic we aim o ha e a mo e
obus and s able lea ning, whe e he sha ed- iew alue es ima es o he
en i onmen should be easie o ob ain, while no hinde ing he calcula ion
and lea ning o he independen - iew alue es ima es when he op imal
solu ions o he agen s di e ge. This closely aligns wi h he design objec i e
DO1 es ablished p e iously.
3.3.2 Cen alized In insic Cu iosi y Module
The mos s aigh o wa d s a egy o make he explo a ion o one agen
depend on he explo a ion pe o med by o he s is o combine hem by
using a cen alized module, which is di ec ly ela ed o he in insic ewa d
gene a ion (DO2). This idea elies on he p inciple o di ide and conque ,
whe e an obse a ion should be discou aged o be isi ed i he o he agen
7The in o ma ion is encoded as a one-ho ec o dis inguishing be ween agen s wi h
di e en ac ion domains, i.e., 𝑎𝑐𝑡𝑜𝑟𝑖𝑑
𝑠𝑘𝑖𝑙𝑙𝑒𝑑
−−−−−−→ [1,0]o 𝑎𝑐𝑡𝑜𝑟𝑖𝑑
𝑛𝑜𝑛−𝑠𝑘𝑖𝑙𝑙𝑒𝑑
−−−−−−−−−−→ [0,1].
50 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
S a
Doo
Goal
(a) (b)
(c) (d) (e)
Figu e 3.6: E olu ion o he in insic ewa ds in a simplis ic RL en i onmen
a e 10 execu ions acco ding o he numbe o isi s (i.e. 𝑟𝑖=1/√︁𝑁(𝑠)). The
agen is ini ialized a he bo om-le co ne and i s goal is o a i e o he
des ina ion loca ed a he bo om igh . Going s aigh , in he middle is a doo
ha obs uc s he pa h, which can be only be opened by a skilled agen . (a)
In insic ewa ds hea map o a skilled agen able o a e se he co ido h ough
he doo and go s aigh . (b) In insic ewa d hea map o a non-skilled agen no
capable o opening he doo , hence a i ing a he a ge h ough he la ge pa h.
(c) Resul ing in insic ewa d hea map when combining bo h ype o agen s’
isi s o a o al o 10 execu ions pe agen (20 in o al). (d) Rela i e di e ence
o ewa ds using he cen alized no el y (as in sub igu e (c)) wi h espec o using
wo skilled agen s (sub igu e a) o he same amoun o in e ac ions. (e) Rela i e
di e ence o ewa ds using he cen alized no el y (sub igu e (c)) wi h espec o
using wo non-skilled agen s (sub igu e b) o he same amoun o in e ac ions.
In (a,b,c) da ke colo s mean highe ewa d; b igh e he opposi e. In (d,e)
ed means ha he cen aliza ion wi h he e ogeneous agen s encou ages isi ing
hose loca ions mo e o en wi h espec o using homogeneous agen s, yielding
highe in insic ewa ds in ha loca ion by i ue o ha ing he e ogeneous ac ions
(blue he opposi e).
has al eady been he e, p omo ing he explo a ion o uncha ed a eas.
The p oblem o his assump ion is ha i agen s ha e di e en knowledge
and/o capabili ies, one agen may ge discou aged o explo e a eas ha
a e indeed c ucial o inding i s own op imal solu ion and en o ced o isi
unp omising a eas ins ead.
In p ac ice, by using a cen alized cu iosi y app oach wi h mul iple
he e ogeneous agen s, he expe ienced no el y is a ec ed. Le ’s see he
expec ed modi ica ions ollowing he example illus a ed in Figu e 3.6.
Fi s ly, he in insic bonuses o hose s a es ha can be eached by
bo h agen s will be smalle (Figu e 3.6.c, yellow a eas). By he same oken,
in insic e u ns should be highe along hose ajec o ies in which he
agen isi s mo e no el s a es. This beha io is exace ba ed in hose s a es
ha a e only accessible by one o he agen s (i.e., skilled agen , Figu e
3.6.a, co ido colo ed in pu ple), as hey can only be isi ed by hem
3.3. P oposed Collabo a i e F amewo k 51
and i s no el y dec eases a a slowe pace when compa ed o he es o
possible s a es (Figu e 3.6.d, ed). The e o e, he skilled agen will end up
becoming mo e encou aged o isi es ic ed a eas – namely, s a es ha
a e only possible o be accessed by he use o he ac ion ha make hem o
be di e en – when compa ed o he beha io in he decen alized in insic
module app oach.
In ega d o he non-skilled agen , using a cen alized cu iosi y wi h
an addi ional mo e skilled agen has li le impac in i s explo a ion p o-
cedu e, as he no el y dis ibu ion will unde go no changes o i . Indeed,
he pa s ha a e c i ical o he skilled agen – he doo and he co ido –
do no in luence he explo a ion o he non-skilled (Figu e 3.6.e, co ido ).
The emaining s a e space will be simila ly isi ed o bo h agen s. How-
e e , i we assume ha he skilled agen will be encou aged o isi mo e
imes hose expe iences leading he co ido , in e sely he non-skilled agen
will be discou aged o go o e hose same loca ions. E en ually, he non-
skilled agen will be pushed owa ds explo ing o he al e na i es. This can
be obse ed in Figu e 3.6.e, in which he non-skilled agen will be mo e
encou aged o explo e h ough he la ge pa h (as old by he highe e-
wa ds colo ed in ed) when combining i s ewa ds wi h a skilled-agen wi h
espec o doing i independen ly.
In conclusion, adop ing a cen alized cu iosi y module can be bene icial
when he e ogeneous agen s a e in ol ed. On he one hand, ac ions yielding
obse a ions ha can only be achie ed by he one o he agen s (i.e., open
he doo and access he co ido ) will ha e la ge in insic ewa ds, and
hence, highe e u ns, os e ing he explo a ion o ha s a e space. A
he same ime, i discou ages he agen who is no capable o execu ing
such ac ions o explo ing he s a e space ha guides such non- ep oducible
si ua ions (i.e., co ido ), being ad an ageous o ocus on explo ing o he
p omising zones.
3.3.2.1 Ac ion-based Cu iosi y Module
Mani old means o calcula ing he no el y o a gi en s a e ha e been p o-
posed in he li e a u e. Mechanisms o deal wi h no el y a e based on
using ei he 𝑠𝑡(Bellema e e al., 2016), 𝑠𝑡+1(Bu da, Edwa ds, S o key,
e al., 2018) o e en he in o ma ion ela ed o he ansi ion be ween suc-
cessi e s a es {𝑠𝑡, 𝑠𝑡+1}(Pa hak e al., 2017)8. In his ein, when ha ing
mul iple agen s using his module in a cen alized manne , hey upda e
i mo e equen ly wi h he expe iences sampled by hei own indepen-
den ac ion dis ibu ions, leading o di e en isi a ion s a egies as hose
depic ed in Figu e 3.6. No ice ha he agen will be discou aged o
isi s a es al eady inspec ed ega dless he ac ions aken be o e.
This implies ha he agen will ha e he same cu iosi y o isi a s a e
and execu e an ac ion equen ly selec ed (a ha s a e) as selec ing an-
o he ac ion ha has been ba ely chosen. P e ious wo ks ha e epo ed
8The in insic ewa d is gene a ed jus wi h 𝑠𝑡+1, bu he upda e o he whole ICM
amewo k equi es 𝑎𝑡,𝑠𝑡and 𝑠𝑡+1.
52 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
ha no di e ence a ises om conside ing he ac ion (Tang e al., 2017),
specula ing ha he policy i sel was su icien ly andom (i.e., had su i-
cien en opy) o en us he explo a ion a each s a e. This hypo hesis,
howe e , was alida ed o e RL en i onmen s wi h single agen s whose
indi idual explo a ion does no in e e e wi h he in e ac ion and lea ning
o o he agen s. By con as , when he e ogeneous agen s a e in ol ed, he
ac ion selec ion and i s consequen explo a ion becomes mo e sensi i e.
The e o e, we modi y hose in insic ela ed app oaches in o de o ac-
coun o he ac ion as well, so ha he gene a ed in insic ewa ds become
mo e in o ma i e o he c i ic (DO2). In ac , a s a egy ha akes in o
accoun bo h he ac ion and he s a e when compu ing he no el y will
encou age a mo e homogeneous ac ion selec ion and a deepe explo a ion
(Raileanu & Rock äschel, 2020). This di e ence may no hinde con e -
gence in single-agen RL p oblems, bu can be p oblema ic when ha ing
agen s wi h di e en ac ion spaces. In his la e case, ac ions ha can
only be execu ed by jus one agen will become mo e a ec ed, as shown
p e iously in Figu e 3.6.
3.3.2.2 T ee Fil e ing
P e ious explo a ion s a egies aim a sha ing as much in o ma ion as
possible be ween he agen s. Ne e heless, he e migh be s a es embedded
in a ajec o y ha a e no accessible by some agen s whe e speci ic chunks
o he ajec o y migh , in u n, be ep oducible.
On he one hand, a ajec o y can be hough o be sha eable o
bo h agen s i he ac ions aken by he agen esponsible o ga he ing he
expe iences belong o he mu ual ac ion space9.
On he o he hand, le us conside a ajec o y ga he ed by he skilled
agen ha is no ully ep oducible by he non-skilled agen . Can ha
in o ma ion be used in some way by he non-skilled agen ( a he han
being disca ded)? This is wha ee- il e ing is all abou . In o de o
explain i and o he sake o cla i y, conside he ajec o y shown in
Figu e 3.7, whe e we can dis inguish wo main chunks o expe iences:
•{(𝑠49, 𝑎2),(𝑠50, 𝑎3), . . .}:
F om 𝑠49 onwa d, he whole ajec o y is assumed o be ep oducible
by he non-skilled agen oo. In spi e o he non-skilled no being e-
sponsible o collec ing such expe iences, he cu iosi y o bo h agen s
a hem is upda ed (i.e., dec eased). As a consequence, u u e e-
u ns, and subsequen ly, hei c i ic es ima es, will e lec i 10.
9This also applies when selec ing an ac ion ou o ha mu ual ac ion space which
has no e ec on he en i onmen , o which is in e changeable by one o he ac ions o
he mu ual ac ion space.
10I he non-skilled agen is no capable o ep oducing some o hose s a es, he
no el y upda e, om he pe spec i e o ha agen , will be insigni ican , as i would
ne e be able o explo e ha si ua ion; on he con a y, i would assume ha an agen
wi h a leas he same capabili ies would ha e p e iously explo ed hem (p e ending
ha he non-skilled agen i sel ga he ed hem).
3.3. P oposed Collabo a i e F amewo k 53
•{. . . (𝑠45, 𝑎1),(𝑠46, 𝑎1),(𝑠47, 𝑎2),(𝑠48, 𝑎4)}:
A s a e 𝑠48, he skilled agen execu ed an ac ion ha does no belong
o he mu ual ac ion space, 𝑎4, which is no ep oducible by he o he
agen .
Should we hen dec ease he no el y o he non-skilled agen o all
hose {𝑠, 𝑎} uples?
I so, ha no el y educ ion will be no iced when he non-skilled
agen collec s a ajec o y con aining any o hose expe iences and
upda es he c i ic. Le us examine he consequences:
–Rega ding (𝑠48, 𝑎4), no impac will be caused, since his uple is
indeed impossible o be expe ienced in any ajec o y pe o med
by he non-skilled agen .
–None heless, o he es o easible uples:
{. . . (𝑠45, 𝑎1),(𝑠46, 𝑎1),(𝑠47, 𝑎2)},
he in insic ewa d signal will be lowe ed, discou aging he non-
skilled agen om de eloping i s own explo a ion s a egy on
accoun o an ex e nal upda e o he skilled-agen no playing
he ole o an equally skilled agen .
In o de o encou age he non-skilled agen o c ea e i s own pe sonal
expe ience, he no el y upda e o he uples om 𝑠48 back o he
ini ial s a e a e no pe o med on he non-skilled agen , allowing i
o keep on wo king on i s independen indi idual iew.
As a esul o his il e ing p ocess, we p opose o conside no el y
along sequences a he han no el y as a ac i eness on isola ed s ep-on
s a es11. This is, we aim o minimize he e o be ween he globally gene -
a ed no el y es ima ion o pa hs aking in o accoun he in insic ewa ds
gene a ed a each expe ience and also hei ep oducibili y, hus polishing
he in insic ewa d ecollec ion by allowing oom o independen iews
on he en i onmen (DO2). Ideally, he no el y h ough a pa h would be
handled by a in insic cu iosi y module ha akes in o accoun sequences
a he han single expe iences. Howe e , as we will u he elabo a e in
Sec ion 3.7, he design o such a no el y ewa d unc ion is no i ial a
all.
3.3.3 Summa y o he P oposed Modules
To sum up, he p oposed collabo a i e amewo k is composed o a cen-
alized c i ic and modi ied in elligen explo a ion s a egies, whe e:
•The use o a cen alized c i ic enhances he lea ning p ocess by ensu -
ing mo e di e se expe iences. A he same ime, a obus knowledge
11In p ac ice, he no el y o a sequence is calcula ed as he discoun ed in insic e u n
o each he expe iences belonging o ha ajec o y, which is a sum o independen
in insic bonus as in Exp ession (2.3).
60 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
Table 3.1: Con 2D(A1,A2,B,C,D,E): Con olu ional laye wi h A1 inpu chan-
nels and A2 ou pu channels, B ke nel size B, s ide C, padding D and ac i a ion
unc ion E (ELU: Exponen ial Linea Uni )
Ne wo k A chi ec u e T aining Pa ame e s
Ac o
Con 2D(4,32,3,2,1,ELU)+
Con 2D(32,32,3,2,1,ELU)+
Con 2D(32,32,3,2,1,ELU)+
Con 2D(32,32,3,2,1,ELU)+
Dense(256,ELU)+
Dense(# ac ions, so max)
O hogonal ini ializa ion
Adam op imize
PPO loss
C i ic
Con 2D(4,32,3,2,1,ELU)+
Con 2D(32,32,3,2,1,ELU)+
Con 2D(32,32,3,2,1,ELU)+
Con 2D(32,32,3,2,1,ELU)+
Dense(256,ELU)+LSTM(128)+
Dense(256,ELU)+. . .+
Dense(5) [ex insic] &
Dense(5) [in insic]
O hogonal ini ializa ion
Adam op imize
MSE loss in bo h
c i ic heads
in insic e u ns in o de o mi iga e issues de i e om he ewa d scale
(Bu da, Edwa ds, S o key, e al., 2018), i.e., :
𝑟𝑖
𝑡=𝑟𝑖
𝑡
𝜎(𝐺𝑖
𝑡(𝜏)) (3.6)
Mo eo e , a c ucial ma e when using ANN is no malizing he inpu o
p e en se e al p oblems. The e o e, i also happens wi h IM me hods
ha use ANN o he ewa d gene a ion, bu i becomes c ucial when
using RND13. Hence, he inpu o he RND modules is s anda dized and
clipped wi hin alues be ween -5 and 5 as ollows:
𝑜𝑐𝑙𝑖 𝑝 𝑝𝑒𝑑 =max h−5,min h𝑜−𝜇
𝜎,5ii (3.7)
Recall ha he la e is only applied when using RND, i.e., only a Se ups
1 and 2. Mo e in o ma ion ega ding how RND pe o ms in ViZDooM and
why we decided no o use i a Se up 3 can be ound a Appendix A).
3.4.4 E alua ion Me ics
In gene al, he main goal o knowledge euse in RL is o accele a e he
lea ning p ocess. In o de o analyze he bene i s o using knowledge
ans e , di e en me ics can be used (Taylo & S one, 2009). Howe e , a
13The a ge ne wo k has i s pa ame e s ixed ( ozen) and canno adjus i s alues
acco ding o he ain da a. Consequen ly, he ob ained embeddings migh no con ey
enough meaning ul in o ma ion and could esul in high a iance ou comes.
3.4. Expe imen al Se up 61
amewo k could epo simila pe o mance me ics o o he possible op-
ions, bu could s ill emain o in e es due o o he ac o s ela ed o he
aining p ocedu e, such as he numbe o equi ed samples, he aining
ime o a gi en compu a ional powe , and model complexi y/size, among
o he ac o s. Consequen ly, discussions on he expe imen al esul s la e
held in his chap e conside wo pe o mance sco es:
•A e age ex insic esul (also e e ed o as Success Ra e, SR), which
is calcula ed as he a e age ex insic sco e ob ained h ough a win-
dow o he las 100 episodes.
•Numbe o s eps o achie e he goal, measu ed om he s a ing poin
o he scena io un il he agen eaches he a ge .
The eason o conside ing hese wo sco es is ha , by only inspec ing
he SR me ic, he discussion only ega ds whe he agen s ha e eached
he goal, dis ega ding he equi ed numbe o s eps (which ep esen he
quali y o he lea ned policy). O he wo ks using his en i onmen assume
ha no ewa ds a e gi en excep when a i ing o he goal, when hey
ac ually gi e a small penaliza ion e e ed o as li ing ewa d, equal o
−0.0001 o each s ep. This small modi ica ion yields an op imal a e age
ex insic e u n o 0.97 app oxima ely o 270 s eps; his is, hey ha e a
ewa d unc ion ha pa ame e izes he op imali y o he esul s subjec
o he numbe o s eps. We ins ead ix a null li ing ewa d, and gi e a
ewa d equal o 1 when achie ing he goal (independen o he numbe o
s eps). In his way, we s and s ic in ega ds o he spa se ewa d p oblem
o mula ion.
Mo eo e , he en i onmen i sel is sligh ly di e en depending on he
ac ion space o each agen . Hence, in his case he skilled agen has di e -
en possibili ies o achie e he a ge , being op imal he one ha in ol es
going h ough he co ido (labeled in wha ollows as _OPT). The e o e,
we ace no only whe he e e y agen eaches he a ge , bu also i hey
na iga e h ough hei op imal pa hs.
Summa y
On he one hand, Case S udy 1 analyzes he impac o a s anda d cen-
alized c i ic app oach while using ei he an independen o a cen alized
RND-based cu iosi y module. Se up 1 and Se up 2 es ablish a co ido
in di e en places (Figu e 3.8) while allowing he agen o spawn a a i-
ous loca ions based on he selec ed se ing. Mo e impo an ly, he agen s’
policies di e due o he p esence o a c ouch and mo e o wa d ac ion
in he policy o he skilled agen .
On he o he hand, Case S udy 2 examines a mo e sophis ica ed cen-
alized c i ic design (wi h an UVFA a chi ec u e and LSTM laye s). In-
s ead o using RND, isi a ion coun s a e used o compu e he cu iosi y
and o assess he impac o making he la e independen , cen alized and
subjec o he ac ion space. In addi ion, i adop s a mo e challenging se up
62 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
(Se up 3, Figu e 3.10), whe e agen s di e due o he exis ence o an open
ac ion o he skilled agen o open a ga e and access he co ido .
As a esul o he abo e case s udies, di e en algo i hmic con igu a-
ions a e conside ed (summa ized in Table 3.2):
•Full Independen PPO (PPO): he baseline PPO algo i hm.
•Independen Cu iosi y (IC_IC): he PPO algo i hm wi h indepen-
den cu iosi y (IC) and independen c i ics (IC).
–Independen Cu iosi y (IC_IC_3 ): Uses 3 pa allel en i on-
men s/ unne s o collec expe iences.
–Independen Cu iosi y (IC_IC_6 ): Uses 6 pa allel en i on-
men s/ unne s o collec expe iences.
•Independen C i ic + Cen alized Cu iosi y (IC_CC): bo h agen s
sha e a unique/cen alized cu iosi y module ye hey ha e indepen-
den c i ics.
•Cen alized C i ic + Independen Cu iosi y (CC_IC): bo h agen s
sha e a unique/cen alized c i ic, bu hey emain independen in
wha e e s o he gene a ion o hei in insic ewa ds.
•Cen alized C i ic + Cen alized Cu iosi y (CC_CC ≡CC_CC_sh): bo h
agen s sha e all pa ame e s o bo h he c i ic and he cu iosi y mod-
ules o gene a e he in insic ewa ds. By de aul , solely he s a e is
conside ed as inpu .
–Cen alized C i ic + Cen alized-Ac ion-based Cu iosi y
(CC_CC_sh_ ac ion): In his case, he in insic bonus is made
dependen on he s a e and he ac ion, ins ead o jus uniquely
on he s a e.
–Cen alized C i ic + Cen alized-Ac ion Cu iosi y + T ee Fil-
e ing (CC_CC_sh_ac ion_ il e ): his scheme is equal o he
p e ious one, bu du ing he gene a ion o he ewa ds i p unes
hose ollou s whose expe iences a e no ep oducible by he
non-skilled agen (see Sec ion 3.3.2.2)14.
3.5 Resul s and Analysis
Resul s p oduced a e he expe imen s held o e he a o emen ioned se up
a e discussed in his sec ion. Fo he sake o cla i y in he discussion, esul s
a e commen ed based on he ollowing esea ch ques ions (RQ):
•RQ1: Does a cen alized c i ic p o ide any gain when compa ed o
comple ely independen agen s?
14We assume an o acle ha in o ms whe he he ac ion execu ed by he skilled-agen
is is ep oducible by he non-skilled agen .
3.5. Resul s and Analysis 63
Table 3.2: Summa y o algo i hmic con igu a ions o c i ic and cu iosi y mod-
ules. Besides he se ups, he case s udies also di e in he use o a (1) s anda d
o UVFA cen alized c i ic and (2) a RND o isi a ion coun s based cu iosi y
module as explained in Sec ions 3.4.1 and 3.4.2.*: sh and sh_ac ion a e used
o dis inguish he inpu o he cen alized cu iosi y module.
C i ic Cu iosi y Module
Case
S udy Con igu a ion Independen Cen alized Independen
(s a e)
Cen alized
(s a e)
Cen alized
(s a e-ac ion)
1
PPO ✓
IC_IC ✓ ✓
IC_CC ✓ ✓
CC_CC ✓ ✓
2
IC_IC_3 ✓ ✓
IC_IC_6 ✓ ✓
CC_IC ✓ ✓
CC_CC_sh* ✓ ✓
CC_CC_sh_ac ion* ✓Nai e
CC_CC_sh_ac ion_ il e ✓Fil e
•RQ2: Does a cen alized cu iosi y yield be e pe o mance le els han
main aining he cu iosi y locally a e e y agen ?
•RQ3: Should we compu e cu iosi y incen i es based on he (s a e,ac ion)
pai a he han only he s a e i sel ?
•RQ4: Should agen s ha e hei in insic ewa ds upda ed only by
expe iences ha a e ep oducible as pe hei ac ion spaces?
We now analyze expe imen al esul s aiming o ob ain in o med e-
sponses o he abo e ques ions, using o his end he di e en con igu a-
ions o he p oposed collabo a i e amewo k ha a e ep esen ed in Ta-
ble 3.2. Resul s a e epo ed o e 3 independen uns in o de o accoun
o hei s a is ical a iabili y. Unless o he wise s a ed, cu es shown in
he plo s co espond o he a e age ex insic e u n/success a io (y-axis)
ob ained a e a gi en numbe o ain episodes (x-axis).
RQ1: Does a cen alized c i ic p o ide any gain when
compa ed o comple ely independen agen s?
We begin ou discussion by examining whe he a cen alized c i ic pe -
o ms be e han comple ely independen agen s in he RL scena io un-
de conside a ion. Responses o his ques ion can be ound in Figu e 3.11,
Figu e 3.12 and Figu e 3.13, which e ince ha a cen alized c i ic (CC_XC)
eaches be e pe o mance le els wi h espec o using independen c i ics
(IC_XC).
Wi h a cen alized c i ic, bo h agen s manage o sol e he ask consis-
en ly in all he conside ed se ups and se ings, while eaching he a ge
h ough hei op imal pa h in mos o he a emp s (as shown in he p e i-
ously e e ed Figu es wi h _OPT). By con as , agen s ea u ing indi idual
c i ic modules (IC_XC) a e mo e uns able and equi e a la ge amoun o
episodes han hose conside ed du ing aining.
64 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
PPO IC IC IC CC CC CC
Non-skilled agen Skilled agen Skilled agen (_OPT)
Episodes
Figu e 3.11: A e age ex insic e u n achie ed in Se up 1 o di e en se ings
(i.e., agen ’s spawn ini ializa ion, each ep esen ed in a di e en ow). The las
column ep esen s he sco e ob ained by he skilled agen when going h ough
i s sho es pa h (i.e., co ido ).
PPO IC IC IC CC CC CC
Non-skilled agen Skilled agen Skilled agen (_OPT)
Episodes
Figu e 3.12: Same in e p e a ion as in Figu e 3.11, bu o Se up 2.
In ui i ely one can pos ula e ha he ad an age o using a cen al-
ized c i ic is ha , o he same/unique ANN, mo e numbe o expe iences
a e collec ed (and used). Thus, as we compu e he g adien s wi h la ge
amoun o da a (ga he ed by wo agen s ins ead o jus one), bene i s in
e ms o a iance a e expec ed. I his is he case, we can jus inc ease he
numbe o collec ed expe iences by each wo ke by doubling he numbe o
unne s, which ensu es each agen o ha e he same amoun o expe iences
as hey would ha e had when using a cen alized c i ic. This hypo hesis
can be answe ed om Figu e 3.13, whe e we obse e ha IC_IC_6 is no
only unable o pe o m as CC_IC, bu also pe o ms wo se han IC_IC_3 .
3.5. Resul s and Analysis 65
Addi ionally o less a iance, ano he key di e ence elies on he ac ha
CC_IC is upda ed almos wice as e , as i execu es an op imiza ion s ep
pe ajec o ies collec ed by each wo ke . On he con a y, in IC_IC_3
and IC_IC_6 each wo ke has i s own c i ic module, which is upda ed
once o he expe iences collec ed by hei espec i e ac o . Ne e heless,
i he numbe o op imiza ion s eps was he key ac o o pe o m be -
e , hen wi h wice as many numbe o episodes, any indi idual app oach
should achie e simila pe o mance le els han hose by a cen alized c i ic.
Howe e , his is no he case ei he , he eby a i ing a he conclusion ha
a cen alized c i ic pe o ms be e han indi idual c i ic modules.
IC IC 3 IC IC 6 CC CC
Non-skilled agen Skilled agen
3 unne s
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
6 unne s
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
Episodes
Figu e 3.13: A e age ex insic e u n achie ed in Se up 3 using independen
cu iosi y o encou aging he explo a ion when using independen c i ics (IC_CC)
and a single cen alized c i ic o bo h agen s (CC_CC). We show he cu es when
using ei he 3 (uppe ow) o 6 (bo om ow) pa allel agen unne s o he
independen c i ic case; whe eas he cen alized c i ic app oach uses 3 pa allel
agen s. Dashed lines wi h ma ke s a e used o plo skilled agen ’s _OPT cu es.
RQ2: Does a cen alized cu iosi y yield be e pe o -
mance le els han main aining he cu iosi y locally a
e e y agen ?
Be o e del ing in o his second RQ, i is impo an o highligh ha he
addi ion o a cu iosi y module is undeniably necessa y wi h espec o no
using i , as PPO on i s own is no able o ou pe o m he beha io o a
66 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
andom agen (included as a dashed ho izon al line in each plo o Figu es
3.11 and 3.12).
By using independen c i ics, esul s ob ained by using ei he an in-
di idual (IC_IC) o a cen alized (IC_CC) cu iosi y module elici a be e
pe o mance when using e e y hing in an independen ashion. This s a e-
men is suppo ed by he di e ences obse ed in Figu es 3.11 and 3.12 o
Se ups 1 and 2, whe e IC_IC (g een) exhibi s highe success a es wi h
a be e sample e iciency. Besides, hese di e ences a e mo e no o ious
o he skilled agen , which unde goes mo e di icul ies o go h ough he
co ido when sha ing he cu iosi y module, CC_CC ( ed), as seen in he
_OPT cu es.
On he o he hand, when using a cen alized c i ic, he adop ion o
a cen alized cu iosi y s a egy (CC_CC_sh) is sligh ly be e wi h espec
o he independen cu iosi y coun e pa (CC_IC), which can be con i med
by he esul s ob ained in Figu e 3.1415. By zooming in o hese esul s,
o he skilled agen he CC_CC_sh app oach achie es a 90% o SR wi h
1309 episodes on a e age, whe eas CC_IC equi es 1522 (an imp o emen
o 14%). This can be also obse ed when he skilled agen achie es he
des ina ion h ough he co ido o e 80% o he o al episodes. A his
poin o he lea ning p ocess, he ully cen alized app oach equi es 6%
less episodes. In he case o a non-skilled agen , di e ences a e isually
negligible, bu hey ep esen an imp o emen o 8%. Fu he mo e, CC_IC
inishes wi h a sligh ly be e policy ha equi es less s eps o achie e he
goal.
In e es ingly, he esul s ob ained in Se ups 1 and 2 wi h independen
c i ics go agains he in ui ion explained in Sec ion 3.3.2 abou cen alizing
he cu iosi y module (IC_IC >IC_CC), al hough he ou comes in Se ups
1, 2 and 3 when using a cen alized c i ic en o ces his idea (CC_CC >
IC_IC). We hypo hesize ha his occu s because he cu iosi y dec eases
o bo h agen s when being sha ed, ye ha knowledge is no pe sis ed
in o hei c i ic modules (when hey ha e independen c i ics), es ima ing
w ongly he in insic alue o he s a e 𝑉𝑖(𝑠𝑡). This is e ec i ely a oided
when using a cen alized c i ic. The e o e, esul s sugges ha sha ing
he cu iosi y wi hou sha ing he c i ic as well is no ac ually bene icial.
Howe e , sha ing bo h modules gi e ise o consis en ly be e esul s.
RQ3: Should we compu e cu iosi y incen i es based on
he (s a e,ac ion) pai a he han only he s a e i sel ?
P e iously, we ha e concluded ha sha ing cu iosi y in o ma ion be ween
agen s yields ad an ages in e ms o success a e and numbe o s eps o
each he a ge as long as he c i ic is also sha ed.
15Indeed, he need o ha ing a la ge numbe o episodes o ac ually see ha he
skilled agen is capable o a e sing he co ido conceals any imp o emen s ha could
a ise om he expe imen s.
3.5. Resul s and Analysis 67
CC CC CC CC sh
Non-skilled agen Skilled agen
Ex insic e u n
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
Numbe o s eps
0 1000 2000 3000 4000 5000 6000
0
200
400
600
800
1000
1200
1400
0 1000 2000 3000 4000 5000 6000
0
200
400
600
800
1000
1200
1400
Episodes
Figu e 3.14: A e age ex insic e u n ( op ow) and numbe o s eps (bo om
om) achie ed in Se up 3 using a cen alized c i ic while using ei he an indepen-
den cu iosi y(CC_IC) o a cen alized app oach (CC_CC_sh). Dashed lines wi h
ma ke s a e used o plo skilled agen ’s _OPT cu es.
Now we u n he ocus on e alua ing whe he he in insic ewa d
should be made dependen on bo h he s a e and ac ion a he han jus
he s a e. In he pas , he wo k in (Tang e al., 2017) showed no empi ical
di e ences be ween bo h app oaches. Howe e , in he cases unde s udy
hey we e no dealing wi h he e ogeneous agen s, whe e he no el y may
be in luenced by he ac ions a ailable a each agen . Thus, as o e old
in Sec ion 3.3.2, ou hypo hesis is ha by making he cu iosi y subjec
o he {𝑠, 𝑎} uple, CC_CC_sh_ac ion, di e en explo a ion beha io s can
be induced in o he agen s, making i easie o he skilled agen o go
h ough he co ido (as a consequence o inducing a la ge cu iosi y o
ha special ac ion).
In ligh o he esul s depic ed in Figu e 3.15, i is ai o claim ha
ou hypo hesis holds, whe e he skilled agen exhibi s a con e gence im-
p o emen o i s success a e o almos 1000 episodes when conside ing
success as a e sing he co ido o each he a ge . This enhancemen
can be a ibu ed o a smoo he explo a ion bonus, which is ep esen a i e
on how he equi ed s eps decay mo e ab up ly a e inding ou ha pa h.
On he o he side, once ha he pa h is disco e ed, i ge s s acked wi h a
policy ha is sligh ly wo se han he wo app oaches analyzed p e iously.
Tha is, i equi es g ea e numbe o s eps o achie e he goal. We hy-
po hesize ha he eason o his e ec is he same ha leads he agen o
68 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
ind he pa h as e : he explo a ion componen (in insic ewa d) is high
when compa ed o he ex insic bonuses, which makes he agen unde go
noise in i s lea ning p ocess (highe en opy). The same beha io is also
dis illed in o he policy lea ned by he non-skilled agen , whose sco es a e
wo se despi e con e ging as e .
CC CC sh CC CC sh ac ion
Non-skilled agen Skilled agen
Ex insic e u n
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
Numbe o s eps
0 1000 2000 3000 4000 5000 6000
0
200
400
600
800
1000
1200
1400
0 1000 2000 3000 4000 5000 6000
0
200
400
600
800
1000
1200
1400
Episodes
Figu e 3.15: A e age ex insic e u n ( op ow) and numbe o s eps (bo om
om) achie ed in Se up 3 using a cen alized and cu iosi y app oach, ye making
he cu iosi y o be subjec o only he s a e (CC_CC_sh) o he s a e-ac ion pai
(CC_CC_sh_ac ion). Dashed lines wi h ma ke s a e used o plo skilled agen ’s
_OPT cu es.
RQ4: Should agen s ha e hei in insic ewa ds up-
da ed ewa ds only by expe iences ha a e ep oducible
as pe hei ac ion spaces?
Finally, we e alua e he p oposed collabo a i e amewo k con igu ed wi h
a cen alized c i ic and a cen alized ac ion-based cu iosi y, bu il e ing ac-
co ding o he idea explained in Sec ion 3.3.2.2,CC_CC_sh_ac ion_ il e .
Di e ences should appea mainly o he non-skilled agen , so ha i s
lea ning p ocess changes by dele ing hose expe iences ha modi y i s cu-
iosi y inapp op ia ely.
Plo s nes ed in Figu e 3.16 alida e his hypo hesis. A na ow pe o -
mance gap a ises be ween he wo compa ed app oaches CC_CC_sh_ac ion
_ il e and CC_CC_sh_ac ion. Bo h wo ke s con e ge o a SR o 90%
as e when compa ed o any o he p e iously analyzed con igu a ions o
3.5. Resul s and Analysis 69
he amewo k, a aining an imp o emen o 7.7% (skilled agen ) and 15%
(non-skilled agen ) in compa ison o he second-bes solu ion.
CC CC sh ac ion CC CC sh ac ion il e
Non-skilled agen Skilled agen
Ex insic e u n
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
Numbe o s eps
0 1000 2000 3000 4000 5000 6000
0
200
400
600
800
1000
1200
1400
0 1000 2000 3000 4000 5000 6000
0
200
400
600
800
1000
1200
1400
Episodes
Figu e 3.16: A e age ex insic e u n ( op ow) and numbe o s eps (bo om
om) achie ed in Se up 3 using a cen alized c i ic and a cen alized cu iosi y
subjec o bo h he s a e-ac ion, and wi h (CC_CC_sh_ac ion_ il e ) and wi h-
ou (CC_CC_sh_ac ion) il e ing he episodes in which he special ac ion has been
used (e.g., open). Dashed lines wi h ma ke s a e used o plo skilled agen ’s _OPT
cu es.
3.5.1 Explo a ion e sus Exploi a ion: When?
One o he majo issues a ising om he analysis o he esul s is ha he
numbe o s eps o he op imal policy is a om he numbe o s eps aken
by execu ing he minimum numbe o ac ions16. The eason is ha , e en a
he inal s ages o he aining p ocess, he lea ned policy is oo s ochas ic
and s ill ea u es signi ican a iabili y. Depending on he p oblem, his
migh be a good esul as i allows he agen o adap o changes mo e
easily (Haa noja e al., 2017). Howe e , i he aim is o lea n o pe o m
he ask as e icien ly as possible, he op imal policy should be he one
ha con e ges wi h he minimum equi ed s eps owa ds he a ge .
The challenge lies in he absence o a speci ic objec i e inco po a ed
in o he ewa d unc ion ha guides he p oblem-sol ing p ocess wi h he
ewes possible s eps. In ac , he policy’s enhancemen elies on p e-
cise alue es ima es, deno ed as 𝑉(𝑠), based on he discoun ed e u n.
16Expe imen s ha e conside ed a ame skip equal o 4, hence he op imal solu ion
wi h 1 ame pe s ep should equi e less in e ac ions o he agen wi h he en i onmen .
76 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
p e en ing om ge ing an op imal policy ( emains oo s ochas ic).
This aligns wi h o he wo ks whe e, once a ce ain deg ee o knowl-
edge has been ob ained and he explo a ion is al eady conside ed
su icien , he ac o con inuing o use i esul s o be coun e p o-
duc i e o he lea ning p ocess (Rosse & Abed, 2021; Taïga e al.,
2020).
3.7 Lessons Lea ned & Fu u e Wo k
G ounded on he insigh s ex ac ed om he expe imen s and he analysis
o he esul s, in his sec ion we ske ch lea ned lessons and in e es ing
di ec ions o u u e esea ch. Some o he e lec ions o e ed in wha
ollows ela e o he he e ogenei y be ween agen s, whe eas o he s ela e
o issues ha lie a he conjunc ion o bo h RL and IM.
3.7.1 When o Explo e? Explo a ion-Exploi a ion
Dilemma wi h He e ogeneous Agen s
A well-known challenge in RL is abou deciding when o explo e and when
o exploi in single agen scena ios. Besides he s ong dependence on he
cha ac e is ics o he en i onmen , he e a e di e en ypes o explo a ion
s a egies ha can be ollowed wi h di e se esul s (Pîsla e al., 2022).
E en in he simple single-agen scena io, i is no clea how o make
he agen explo e e icien ly. In o he wo ds, when should a gi en agen
explo e? This ques ion, o en ega ded as he explo a ion-exploi a ion
dilemma, is ye unsol ed, as i is no s aigh o wa d o de e mine when
he agen (o e en a human) has explo ed enough when lea ning o sol e a
ask. This p oblem is exace ba ed in se ings wi h spa se ewa ds, specially
when he comple ion o he ask can equi e long- e m aining ho izons.
I has been seen in his chap e ha one way o deal wi h explo a ion
is o use IM echniques, wi h which he agen can explo e he en i onmen
mo e sma ly. Howe e , his app oach in oduces a non-s a iona y no -
el y bonus, yielding a bi-objec i e p oblem wi h con lic ing objec i es: he
main ask’s ex insic goal and he explo a ion- ela ed in insic goal. As
consequence, a misalignmen be ween hese objec i es can eme ge, po en-
ially leading o wo se esul s ha no using he a o emen ioned in insic
s eams wha soe e (Taïga e al., 2020).
In he conside ed concu en lea ning p oblem he he e ogeneous agen s
do no sha e any hing (by de aul ) as opposed o he assump ions made
in mul i-agen RL p oblems, whe e hey sha e a leas a eam ewa d o
he en i onmen whe e hey a e deployed.
Should we impose a collabo a i e s a egy when none o he ac ions
execu ed by an agen in luence in he o he agen s beha io ?
I is complex o gi e an answe , and pa icula ly i we do no know when an
agen (independen ly o o he agen s) has explo ed enough on a gi en ask,
3.7. Lessons Lea ned & Fu u e Wo k 77
as depic ed in he p e ious pa ag aphs. The e o e, in he cu en chap-
e , we ha e assumed some kind o la en knowledge be ween agen s and
asks17 ha ha e been o malized in e ms o sha ing he c i ic and cu ios-
i y module. We u he assumed ha bo h agen s unde s and and pe cei e
he en i onmen in simila ways, which can be ansla ed in o de eloping
cong uen ep esen a ions and explo a ion pa e ns, which, ul ima ely, can
help boo s apping he lea ning o he in ol ed agen s. Un o una ely, his
migh no be ealis ic in o he RL scena ios.
3.7.2 De achmen -De ailmen P oblem
Solu ions ha ely on IM echniques exhibi he so-called de achmen -
de ailmen p oblem. This issue a ises when an agen has explo ed he
en i onmen co ec ly, becoming close o disco e ing an in e es ing s a e
space o o achie ing he goal. A some poin , howe e , he agen ’s lea ning
ge s s uck and he episode inishes. When he nex episode is s a ed,
all decisions ha he agen made o each hose spo s a e now ega ded
wi h less no el y (e en being close o inding ou p omising loca ions).
Consequen ly, he agen will be s imula ed o examine o he al e na i es,
e en i i was in he igh di ec ion o disco e no el s a es, deg ading
he e ec i eness o he explo a ion. In his chap e , we ealized ha he
de achmen -de ailmen p oblem ge s wo se when he ime ho izon equi ed
o achie e any meaning ul eedback signal inc eases.
Recen ly, i has been shown ha an e ec i e way o add ess his issue
is by clus e ing ep esen a ions, and by eini ializa ing he agen sma ly
in he en i onmen (Eco e e al., 2021; Ugadia o e al., 2021). Howe e ,
hese app oaches equi e he en i onmen o be ese - ee18. In he sce-
na io wi h he e ogeneous agen s ackled in his chap e , a simila i y-based
clus e ing o he s a e space migh be sui able o iden i y p omising s a es
whe e he agen can be ese (Eco e e al., 2021; Ugadia o e al., 2021).
Un o una ely, i is di icul o make hese echniques wo k in POMDPs
wi h i s -pe son- iew obse a ions due o (1) he dimensionali y educ ion
o he s a e space, and (2) he gene a ion o clus e s and he de e mina-
ion on whe e (i.e., in which clus e o s a es) o eini ialize each agen
conside ing ha hey migh ha e di e en s imuli and op imal pa hs o
he same goal.
In spi e o he di icul y o implemen ing adequa e mechanisms o deal
wi h his phenomena is high, analysing and de eloping p ocedu es o keep
ack o p e ious no - ully explo ed, albei p omising, ou es, could comple-
men IM echniques and make hem e icien e en in ex ao dina y complex
ci cums ances
17Akin o he hypo hesis behind T ans e Lea ning app oaches.
18An en i onmen in which he agen posi ion and/o s a e pe cep ion can be
manually selec ed wi hou any cons ain s. This p ope y g an s lexibili y o selec
new/desi ed s a posi ions a bi a ily.
78 Chap e 3. Collabo a i e T aining o He e ogeneous Agen s
3.7.3 Po en ial o Recu en Rewa ds
Ano he issue encoun e ed du ing his esea ch sp ings om he ac ha
in insic bonuses a e gene a ed om a gi en expe ience uple a he han
a sequence o uples. This issue a ec s no only he scena io ackled in his
chap e , bu also o he RL en i onmen s ha gene a e in insic ewa ds
based on single expe iences. This mainly occu s when ha ing a POMDP
as changes in he en i onmen canno be di ec ly e lec ed e en i hose
changes ha e a clea impac in he en i onmen . Nex , we expose his
p oblem by b ie ly discussing on wo hypo he ical en i onmen s.
Bu on
Unlock when p essed
Doo
Agen loca ion
(obse a ion)
Bu on
Unlock when p essed
Doo
2
1
S a e isi ed
wice
(a) (b)
Figu e 3.20: Hypo hesized case s udies o discuss on how o deal wi h long-
e m dependencies wi hin spa se POMDP p oblems.
In he en i onmen s shown in Figu e 3.20, he agen can unlock he
colo ed passage by pushing he bu on ha is loca ed a a di e en loca-
ion, ela i ely a om he en ance o he co ido . Fo his pu pose,
an ac ion namely open is a ailable by he agen bu is useless anywhe e
else excep in on o he doo . In hese en i onmen s a i s -pe son- iew
obse a ion hinde s he agen om unde s anding he co ela ion be ween
pushing he bu on and opening he doo . Wha is mo e, he alue o
eaching he loca ion whe e he bu on is loca ed (and all he subsequen
s a es o he des ina ion) will di e depending whe he :
•The bu on is pushed and he agen goes h ough he co ido .
•The bu on is pushed and he agen does no go h ough he co ido .
•The bu on is no pushed.
This issue, combined wi h long ho izon e u ns and an agen ha does
no know how o in e ac and sol e he p oblem co ec ly, leads o noisy
upda es and hampe s he disco e y o he co ela ion exis ing be ween he
3.7. Lessons Lea ned & Fu u e Wo k 79
bu on and he doo . This is e en mo e complex in scena ios as he one in
Figu e 3.20.b, whe e a gi en obse a ion (e.g., he one ma ked wi h an X)
mus be isi ed wice: 1when sea ching o he bu on ha opens he
passage, and ano he 2 o go h ough he passage i sel 19.
Due o hese inconsis encies, we belie e ha no el y needs o be ede-
ined in one o he ollowing wo ways:
•As he in insic ewa d o a gi en expe ience uple, aiming o quan-
i y how no el he expe ience is on i s own.
•As he discoun ed expec ed e u n wi hin a gi en ajec o y,conside ing
he calcula ed in insic bonus o he expe iences ha make up ha
speci ic ajec o y, answe ing which deg ee o no el y his expe ience
injec s in o u u e s eps o he episode.
The i s de ini ion elies solely on he expe ience i sel o measu e no -
el y. I is mo e p ac ical and widely adop ed in he esea ch communi y.
Ne e heless, his equi es he empo al dependencies among he expe-
iences o be modeled manually (e.g., s acking mul iple ins ance ames,
using memo y mechanisms) o inco po a ing ecu en (and/o a en ion)
modules a he ac o , he c i ic o bo h (Hausknech & S one, 2015; Oh e
al., 2016; Vaswani e al., 2017). In ac , in he a chi ec u es discussed in he
expe imen s o his chap e , one o he algo i hmic con igu a ions adop ed
a LSTM-based neu al a chi ec u e in he c i ic. Howe e , he e a e no
gua an ees ha his ype o a chi ec u e e ains he ga he ed knowledge
a long- e m ho izons, no is he no el y sco e used o compu e he e u n
s a iona y (i dec eases o e ime). This ins abili y in he expec a ion
e m o e ime ul ima ely hampe s he long- e m modeling capabili ies o
he ecu en /a en ion modules wi hin ANN.
Al e na i ely, a solu ion could be o gene a e in insic ewa ds based
no only on he cu en ime s ep, bu also on pas expe iences (i.e. a
sequence o expe iences, second de ini ion). This is, designing a ewa d
unc ion ha handles he empo al dependencies and p o ides a di e en
ewa d alue, so ha an expe ience is de e mined o be no el aking in o
accoun a ull episode o pa h wi h i s inhe en consequences. This p oblem
has also been ecen ly showcased in ela ion o goals in (Colas e al., 2022),
opening a deba e a ound how o add ess his p oblem in an online ashion
wi h no p e ious knowledge abou he en i onmen . This discussion inds
in he ac ion he e ogenei y o agen s s udied in his chap e ano he wis
o i s sc ew.
19Recall he agen is only p o ided by a i s -pe son- iew inpu ; he e o e, he same
obse a ion can ecei e di e en alues es ima es depending whe he he bu on was
p e iously pushed o no .
81
Chap e 4
An E alua ion S udy o
In insic Mo i a ion
Techniques applied o
Rein o cemen Lea ning
o e Ha d Explo a ion
En i onmen s
The claimed e ec i eness o IM echniques in en i onmen s wi h spa se
ewa ds has been p o en in he p e ious chap e , when applied ei he col-
labo a i ely o independen ly in mul i- and single- agen p oblems. Ex-
pe imen s pe o med in he p e ious chap e , which conside ed RND and
coun -based s a egies o compu e he in insic ewa ds, showcased he
la ge amoun o IM app oaches ha can be adop ed o os e he explo-
a ion by combining he p oduced in insic signal wi h i s ex insic coun-
e pa (e.g. as in Exp ession (2.25) o Exp ession (3.2)).
In his con ex , mode n IM solu ions (Badia, Sp echmann, e al., 2020;
Raileanu & Rock äschel, 2020; Seu in e al., 2021; T. Zhang e al., 2020)
solu ions p opose no only hei own me hod o calcula e he explo a ion
bonus, bu also in oduce o he ope a ions o weigh and scale he mag-
ni ude o hei gene a ed in insic ewa ds. Table 4.1 lis s se e al o such
IM me hods, building upon he ea ly s udies ocused on he gene a ion o
cu iosi y in o ma ion (Bellema e e al., 2016; Bu da, Edwa ds, S o key, e
al., 2018; Pa hak e al., 2017). Un o una ely, as pe he cu en li e a u e
i emains unclea whe he he esea ch ace owa ds supe io IM me hods
is mainly d i en by he p oposed ewa d gene a ion app oach o ins ead,
biased by o he design choices, such as di e en base RL algo i hms, de-
cay o he explo a ion bonus, episodic scaling echniques adop ion, neu al
ne wo k a chi ec u es and benchma ks o he e alua ion o esul s.
Analogously o o he s udies in he ield o RL (And ychowicz e al.,
2021a; And ychowicz e al., 2021b; Hende son e al., 2019; O sini e al.,
82 Chap e 4. Empi ical S udy o In insic Mo i a ion Techniques
Table 4.1: Classi ica ion o a ious IM me hods based on di e en design
choices. We p o ide he pa ame e s wi h which hose app oaches ha e been
e alua ed in he MiniG id benchma k, excep o NGU (A a i).
Re RL-algo i hm Va y 𝛽𝑖Scale 𝑟𝑖ANN a chi ec u e
ICM (Pa hak e al., 2017) IMPALA ✗ ✗ Sha ed AC [3CNN,256LSTM,FC]
RND (Bu da, Edwa ds, S o key, e al., 2018) IMPALA ✗ ✗ Sha ed AC, [3CNN,256LSTM,FC]
RIDE (Raileanu & Rock äschel, 2020) IMPALA ✗ ✓ Sha ed AC, [3CNN,256LSTM,FC]
BeBold (T. Zhang e al., 2020) IMPALA ✗ ✓ Sha ed AC, [3CNN,256LSTM,FC]
DoWhaM (Seu in e al., 2021) IMPALA ✗ ✓ Sha ed AC, [3CNN,1024LSTM,1024FC]
RAPID (Zha, Ma, e al., 2021) PPO ✗ ✗ Independen AC, [2FC64]
AGAC (Fle -Be liac e al., 2021) PPO ✗ ✓ Independen AC, [3CNN,512FC]
D&E (Jing e al., 2021) PPO ✓ ✓ Independen AC, [3CNN,512FC]
NGU (Badia, Sp echmann, e al., 2020) R2D2 ✓ ✓ Single Q(s,a,𝛽), [4CNN,512LSTM,512FC]
2021), a undamen al ma e is o dis inguish which design c i e ia a e ac-
ually impo an and hei impac on he pe o mance o he agen . This is
specially ele an in ha d explo a ion en i onmen s, since i is known ha
unde such ci cums ances, he p o iciency o he agen is e y sensi i e
w. . . he con igu a ion o i s compounding modules. Fo his eason, he
goal o his chap e is o pe o m a ai e alua ion o IM-based solu ions
p esen in he li e a u e, aiming o decouple he con ibu ion o he IM
app oach o he o e all pe o mance o he agen om he impac o addi-
ional design choices. As a esul , insigh s will be gi en abou which design
choices ma e when designing IM mechanisms, so ha hese app oaches
can be adap ed and used in new RL p oblems hough ully.
4.1 Rela ed Wo k
Be o e digging in o he con ibu ion o his chap e chap e , we i s b ie ly
e iew he concep s in which some IM solu ions suppo hei cu iosi y
mechanisms.
In insic Mo i a ion
As we ha e al eady explained in Sec ion 2.3.1 o Chap e 2, wo main
g oups o IM algo i hms can be ound in he li e a u e: coun -based and
p edic ion-e o me hods. The i s s calcula e he ewa d in e sely p o-
po ional o he numbe o imes 𝑁(𝑠𝑡)a gi en s a e (𝑠𝑡) has been isi ed:
𝑟𝑐𝑜𝑢𝑛𝑡𝑠
𝑡=1
√︁𝑁(𝑠𝑡)(4.1)
This idea can be also ex ended o o he isi a ion coun app oaches
ha a e sui able o high-dimensional s a e domains (Bellema e e al.,
2016; Machado e al., 2019; Os o ski e al., 2017; Tang e al., 2017).
On he o he hand, p edic ion-e o me hods gene a e he explo a ion
bonus aking in o accoun he abili y o he me hod o eliably p edic
changes in he en i onmen . In o de o accomplish i , ICM (Pa hak e al.,
2017) p oposed a amewo k o calcula e he di e ence be ween he ac ual
nex s a e (𝑠𝑡+1) and a p edic ion o he nex s a e aking in o accoun he
4.1. Rela ed Wo k 83
cu en s a e and ac ion, b𝑠𝑡+1=𝑓(𝑠𝑡, 𝑎𝑡), being 𝑓 he unc ion ha will
lea n he dynamics o he en i onmen . E en mo e impo an ly, ins ead
o calcula ing he e o di ec ly wi h he aw inpu s a e, in ICM a la en
ep esen a ion 𝜙(·) is lea ned o cap u e only he in o ma ion ha a ec s
o is a ec ed by he agen (p e en ing i ele an ea u es o he s a e space
om biasing he p edic ion):
𝑟𝐼𝐶𝑀
𝑡=|| b
𝜙(𝑠𝑡+1) − 𝜙(𝑠𝑡+1)||2(4.2)
whe e || · ||2s ands o he 𝐿2(Euclidean) no m and b
𝜙(𝑠𝑡+1) ep esen s he
p edic ion o he 𝑠𝑡+1 aking in o accoun 𝜙(𝑠𝑡)and he ac ual ac ion 𝑎𝑡as
inpu ; ha is, b
𝜙(𝑠𝑡+1)=𝑓(𝜙(𝑠𝑡), 𝑎𝑡). Please e e o Figu e 2.9 o be e
cla i y.
Upon he idea o how s a e embeddings a e lea ned, RIDE (Raileanu
& Rock äschel, 2020) p oposed o calcula e he explo a ion bonus by he
di e ence be ween wo consecu i e s a es in hei la en space:
𝑟𝑅𝐼𝐷𝐸
𝑡=||𝜙(𝑠𝑡+1) − 𝜙(𝑠𝑡)||2(4.3)
Wi h his change, RIDE encou ages he agen o pe o m ac ions ha
ha e an impac on he en i onmen . The modi ica ion wi h espec o ICM
can be seen in Figu e 4.11.
φ(s )
φ(s +1)
b
φ(s +1)
s
s +1
ba
a
LF W
RIDE
Fea u es
Fea u es
Fo wa d
model
In e se
model
LIN V
−
RIDE
Figu e 4.1: RIDE amewo k (Raileanu & Rock äschel, 2020) o gene a e he
in insic ewa d.
Wha is mo e, o ensu e ha he agen does no go back and o h be-
ween a sequence o s a es in o de o ge in insic ewa ds, he ewa d is
discoun ed by he episodic s a e isi a ion coun s:
𝑟𝑅𝐼𝐷𝐸
𝑡=||𝜙(𝑠𝑡+1) − 𝜙(𝑠𝑡)||2
√︁𝑁𝑒𝑝 (𝑠𝑡+1)(4.4)
1No e ha he o wa d model is now jus used o build a be e app oxima ion o
he ea u e space in he same way as he in e se model does.
84 Chap e 4. Empi ical S udy o In insic Mo i a ion Techniques
so ha he bonus now is calcula ed by combining expe imen - and episode-
le el explo a ion (Pîsla e al., 2022; S an on & Clune, 2018). Simila bu
mo e agg essi ely, in BeBold/No elD (T. Zhang e al., 2020,2022) he
ewa d was es ic ed so ha only he i s ime he agen isi s a gi en
s a e in an episode was alid:
𝑟𝐵𝑒𝐵𝑜𝑙𝑑
𝑡=max 1
𝑁(𝑠𝑡+1)−1
𝑁(𝑠𝑡),0·I[𝑁𝑒(𝑠𝑡+1)=1](4.5)
whe e 𝑁𝑒(·) s ands o he episodic s a e coun ha is ese e e y episode,
and I[·] is an indica o unc ion aking alue 1 i i s a gumen is ue (0
o he wise).
Following he idea o combining a ious deg ees o explo a ion, NGU
(Badia, Sp echmann, e al., 2020) calcula ed he in insic ewa d as he
combina ion o wo sub- ewa ds:
𝑟𝑖
𝑡=𝑟𝑒𝑝𝑖𝑠𝑜𝑑𝑖𝑐𝑖
𝑡·min{max{𝑟𝑙𝑖 𝑓 𝑒𝑙𝑜𝑛𝑔𝑖
𝑡,1},5}(4.6)
being 𝑟𝑒𝑝𝑖𝑠𝑜𝑑𝑖𝑐𝑖
𝑡calcula ed h ough an episodic memo y (P i zel e al.,
2017) and 𝑟𝑙𝑖 𝑓 𝑒𝑙𝑜𝑛𝑔𝑖
𝑡compu ed ac oss he whole aining. In addi ion, NGU
adop ed an UVFA (Schaul e al., 2015) amewo k so ha he employed
ac ion- alue unc ion was subjec o di e en 𝛽coe icien s, 𝑄(𝑠𝑡, 𝑎𝑡, 𝛽),
which allows lea ning policies wi h di e en explo a i e beha io s using a
single ne wo k. Las bu no leas , FaSo (Bougie & Ichise, 2021) combined
local and global explo a ion by gene a ing wo di e en in insic ewa ds,
depending on he quali y o he econs uc ion o wo con ex s buil om
he same s a e.
Aside om he me hod o calcula e he explo a ion bonus i sel , new IM
solu ions a e shown o yield be e esul s in hei espec i e publica ions,
ye using addi ional componen s which we e no used when compa ed o
he selec ed baselines. Thus, a he han p oposing a new in insic gene a-
ion module, in his chap e we ca y ou an e alua ion s udy o gauge he
impac o such modi ica ions (Table 4.1) and o asce ain he con ibu ion
o he IM ewa d gene a ion o he o e all pe o mance o he agen .
Rein o cemen Lea ning S udies
O he benchma ks/s udies ha e been done in ecen imes su ounding RL:
o begin wi h, (Taïga e al., 2020) e alua es he pe o mance o di e en
explo a ion bonuses (pseudo-coun s, ICM, RND and noisy ne wo ks) in he
whole A a i 2600 sui e wi h Rainbow (Hessel e al., 2017). By con as ,
(Bu da, Edwa ds, Pa hak, e al., 2018) ca ied ou a la ge-scale s udy
based exclusi ely on p edic ion e o bonuses o e 54 en i onmen s, whe e
hey in es iga ed he e icacy o using di e en ea u e lea ning me hods
wi h PPO (Schulman, Wolski, e al., 2017). This chap e also connec s
wi h (And ychowicz e al., 2021a; And ychowicz e al., 2021b; Hende son
e al., 2019; O sini e al., 2021), a se ies o e alua ion s udies aimed o un-
de s and wha choices among high- and low-le el algo i hmic op ions a ec
4.2. Me hodology o he S udy 85
he lea ning p ocess. As such, he s udies in (And ychowicz e al., 2021a;
And ychowicz e al., 2021b) ocus on on-policy deep ac o -c i ic me hods
(examining di e en policy losses, a chi ec u es and ad an age es ima o s).
On he o he hand, (O sini e al., 2021) add esses ad e sa ial IM ela ed
decisions (mul iple ewa d unc ions and obse a ion no maliza ion me h-
ods), whe eas (Hende son e al., 2019) in es iga es ep oducibili y issues
using di e en andom seeds, ac i a ion unc ions, codebases, and ewa d
scales, among o he expe imen al choices.
Con ibu ion
To he bes o knowledge, he e is no p io wo k ha exhaus i ely e alua es
di e en choices o he implemen a ion o in insic mo i a ion s a egies.
The s udy p esen ed in his chap e o he Thesis akes a s ep u he
by analyzing di e en weigh and scale s a egies o he combina ion o
in insic and ex insic ewa ds, as well as he impac o adop ing di e en
neu al ne wo ks a chi ec u es and i s dimensions. The design choices he e
e alua ed a e applicable o any in insic cu iosi y gene a ion module, so
ha conclusions abou which ones a e he mos sui able gi en a ask and
an en i onmen wi h spa se ewa ds can be d awn.
4.2 Me hodology o he S udy
A e e iewing di e en solu ions p oposed in he li e a u e o cope wi h
ha d explo a ion issues wi h IM echniques, we now p oceed by desc ibing
he me hodology adop ed in his chap e o gauge he ad an ages and
d awbacks o design choices ha a e p esen in some o hem, gi ing an
in o med hin o hei u ili y when ex apola ed o he es o IM solu ions.
The me hodology is d i en by he pu sui o esponses o h ee esea ch
ques ions (RQ):
•RQ1: Does he use o a s a ic, pa ame ic o adap i e decaying in-
insic coe icien weigh 𝛽a ec he agen ’s aining p ocess?
•RQ2: Which is he impac o using episodic coun s o scale he
in insic bonus? Is i be e o use episodic coun s han o jus
conside he i s ime a gi en s a e is isi ed by he agen ?
•RQ3: Is he choice o he neu al ne wo k a chi ec u e c ucial o he
agen ’s pe o mance and lea ning e iciency?
Depa ing om hese ques ions, he ollowing me hodology has been de-
ised:
4.2.1 RQ1: Va ying he Weigh o he In insic Re-
wa d Coe icien 𝛽
In gene al, i is no ad isable o combine aw ex insic and in insic ewa d
signals di ec ly due o hei po en ially di e ging alue scales. Mo eo e ,
92 Chap e 4. Empi ical S udy o In insic Mo i a ion Techniques
il e s wi h ke nel 3×3, s ide equal o 2, and padding 1) and a FC-
256 laye . O iginally in (Raileanu & Rock äschel, 2020) hey used an
LSTM o 256 uni s ins ead o a FC-256. We analyze he esul s wi h
no ecu ence despi e being in an POMDP se ing, which will also al-
low he compa ison whe he i i is ac ually necessa y he use o ecu -
ence modules in hese en i onmen s. Wha is mo e, e en i (Raileanu
& Rock äschel, 2020) de ined he p e iously men ioned a chi ec u e de-
sign, in hei Gi Hub implemen a ion hey seem o use la ge ne wo ks
(h ps://gi hub.com/ acebook esea ch/impac -d i en-explo a ion). This
is he eason why in Table 4.1 we do no speci y he FC uni s. This las
a chi ec u e will be labeled as he de aul a chi ec u e o endow he agen
wi h mo e lea ning capabili ies and o ensu e ha i is no limi ed by a
es ic ed ne wo k.
CNN
1
CNN
2
CNN
3
FC
256
V(s)
π(a|s)
(7 alues,
dis ibu ion)
(1 alue)
32 il e s, 3 ×3 ke nel,
2×2 s ide, padding 1
FC
7
FC
64
FC
64
π(a|s)
(7 alues,
dis ibu ion)
FC
7
FC
64
FC
64
FC
1
V(s)
(1 alue)
Ac o :
C i ic:
(a)
(b)
FC
1
.
.
.
Figu e 4.4: (a) Sophis ica ed/de aul and (b) ligh weigh ne wo k a chi ec-
u es.
4.4 Resul s and Analysis
In his sec ion expe imen al esul s a e p esen ed and discussed owa ds
answe ing he esea ch ques ions posed in Sec ion 4.2. Sc ip s and esul s
ha e been made a ailable in a public Gi Hub eposi o y (h ps://gi hub
.com/aklein1995/in insic_mo i a ion_ echniques_s udy) o os e
ep oducibili y and s imula e ollow-up s udies. Fo all he expe imen s
desc ibed in his sec ion we p o ide he mean and s anda d de ia ion o
he a e age e u n compu ed o e he pas 100 episodes, pe o ming 3
di e en uns (each wi h a di e en seed) o accoun o he s a is ical
a iabili y o he esul s.
4.4.1 RQ1: Does he use o a s a ic, pa ame ic o
adap i e decaying in insic coe icien weigh 𝛽
a ec he agen ’s aining p ocess?
Ou i s se o esul s compa es he mul iple weigh ing s a egies in o-
duced in Sec ion 4.2.1, which di e en ly une he impo ance g an ed o
he in insic ewa ds wi h espec o ex insic signals coming om he
en i onmen .
The esul s a e shown in Table 4.2. I is s aigh o wa d o no e ha
RIDE ou pe o ms COUNTS and RND. A his poin we emind ha
4.4. Resul s and Analysis 93
Table 4.2: Resul s o di e en IM s a egies o e se e al MiniG id scena ios
wi h s a ic (_𝑠), mul iple s a ic (_𝑛𝑔𝑢) (as in NGU Badia, Sp echmann, e
al., 2020), a pa ame ic (_𝑝𝑑) o adap i e decay (_𝑎𝑑) weigh 𝛽 o modula e
he impo ance o he in insic bonus in he compu a ion o he ewa d. Cell
alues deno e he aining s eps/ ames (1𝑒6scale) a which he op imal a e age
ex insic e u n is achie ed; be ween pa en heses, s eps a which 95% o he
op imal a e age ex insic e u n is eached. The bes esul s o e e y (IM
s a egy, scena io) combina ion a e highligh ed in bold.
MN7S4 MN10S4 MN7S8 KS3R3 O2Dlh
COUNTS_𝑠0.93 (0.86) 1.87 (1.78) >30 >30 >50
COUNTS_𝑛𝑔𝑢 1.17 (1.11) 2.67 (2.35) >30 >30 >50
COUNTS_𝑝𝑑 0.96 (0.83) 2.27 (1.67) >30 22.91 (22.49) >50
COUNTS_𝑎𝑑 1.03 (0.92) 1.81 (1.65) 24.23 (24.10) >30 >50
COUNTS_𝑎𝑑1000 1.03 (0.92) 1.81 (1.65) 23.63 (23.56) >30 >50
RND_𝑠3.83 (3.78) 7.84 (7.79) >30 10.83 (9.72) >50
RND_𝑛𝑔𝑢 2.69 (2.62) 5.78 (5.75) >30 8.12 (7.50) >50
RND_𝑝𝑑 4.04 (3.94) 6.02 (5.99) >30 9.24 (8.07) >50
RND_𝑎𝑑 2.02 (1.39) 3.21 (2.65) >30 6.02 (5.43) >50
RND_𝑎𝑑1000 3.62 (1.42) 3.59 (3.50) >30 7.47 (6.66) >50
RIDE_𝑠2.49 (1.82) 2.27 (2.14) 4.00 (3.68) 6.63 (4.39) 30.88 (25.87)
RIDE_𝑛𝑔𝑢 3.85 (2.40) 2.59 (1.26) >30 7.18 (3.91) 36.07 (29.96)
RIDE_𝑝𝑑 5.20 (2.14) 5.01 (1.96) 3.73 (3.49) 6.42 (3.87) 29.27 (20.84)
RIDE_𝑎𝑑 2.89 (0.91) 1.60 (0.99) >30 5.93 (2.99) 27.65 (20.91)
RIDE_𝑎𝑑1000 2.54 (0.91) 1.60 (0.99) 3.88 (3.70) 4.70 (3.00) 28.00 (23.01)
RIDE is con igu ed wi h episodic coun scaling, in acco dance wi h he
inal solu ion p oposed in (Raileanu & Rock äschel, 2020). Coun -based
gene a ed ewa ds seem o be he bes solu ion when acing easy explo-
a ion scena ios (MN7S4 and MN10S4), bu i s pe o mance deg ades when
acing scena ios ha equi e mo e sophis ica ed explo a ion s a egies. A
simila pa e n can be obse ed when analyzing he esul s o RND, which
is unable o sol e MN7S8 and O2Dlh wi h any kind o weigh ing s a egy.
Con a ily, RIDE manages o sol e all he asks by i s naï e implemen-
a ion, al hough i achie es be e esul s when using mo e sophis ica ed
weigh ing explo a ion s a egies.
We now ocus he discussion on gaps a ising om he use o di e en
weigh ing s a egies. The s a ic (de aul ) weigh ing s a egy (indica ed
wi h a su ix _𝑠appended o each app oach) is su passed by any o he
o he p oposed weigh ing app oaches in he majo i y o he cases. When
using mul iple s a ic alues (_𝑛𝑔𝑢), he only app oach ha akes ad an-
age o such a s a egy is RND, yielding wo se esul s o bo h COUNTS
and RIDE in all he cases. This migh happen due o he slow pace a
which he in insic ewa ds alues decay in RND in e e ence o he o he
s a egies6. On he o he hand, he use o pa ame ic decay (_𝑝𝑑), which
6The e o ou pu by RND has highe ampli ude alues han hose o RIDE, he eby
RND is a be e candida e o ge bene i o applying he _𝑛𝑔𝑢 s a egy by he use o
agen s wi h smalle in insic coe icien weigh s (a oiding o e -explo a ion issues in he
case o RND and opposi ely ha ing unde -explo a ion issues wi h RIDE).
94 Chap e 4. Empi ical S udy o In insic Mo i a ion Techniques
dec eases he weigh o he in insic ewa d as he aining e ol es o a-
o explo a ion, p o ides signi ican gains in almos all simula ed scena -
ios. This app oach is simila o _𝑛𝑔𝑢. Howe e , ins ead o using mul i-
ple agen s wi h di e en s a ic in insic coe icien s, he pa ame ic decay
s a egy modula es a single alue du ing he cou se o aining. When
employing he _𝑝𝑑 s a egy, COUNTS is able o ge a alid solu ion in
KS3R3, RND imp o es all i s sco es and RIDE imp o es i s beha io in he
mos challenging scena ios MN7S8,KS3R3 and O2Dlh. Ne e heless, _𝑛𝑔𝑢
and _𝑝𝑑 highly depend on he in insic coe icien s gi en o each agen and
he e olu ion o a single in insic coe icien du ing aining, espec i ely.
This s ongly impac s on he agen ’s pe o mance o a gi en scena io and
dic a es when hose app oaches migh be be e . Indeed, i can be seen as
a uning pa ame e like 𝜖in 𝜖-g eedy s a egies.
Finally, he use o adap i e decay (_𝑎𝑑)p oduces be e esul s in
COUNTS and RND when compa ed o he s a ic case (_𝑠). Fo RIDE,
howe e , his s a emen does no s ic ly hold ue, as i s pe o mance
deg ades in MN7S4 and MN7S8 ( he agen does no e en sol e he ask in
he la e case). We hypo hesize ha his is due o he ac ha he ini ial
in insic e u ns a e oo high. Hence, calcula ing he his o ical a e age
in insic e u ns biases he compu a ion o he decay ac o . As ou lined
in Sec ion 4.2.1, a wo ka ound o o e come his issue is o calcula e e u ns
wi h a mo ing a e age o e a window o 𝜔s eps/ ollou s. We hence include
in he benchma k an adap i e decay wi h a window size o 𝜔=1000
ollou s (_𝑎𝑑1000). Wi h his modi ica ion, RIDE imp o es i s beha io in
all he complex scena ios. Ne e heless, _𝑎𝑑1000 pe o ms sligh ly wo se
han _𝑎𝑑 in RND, bu ne e wo se han i s s a ic coun e pa _𝑠. In
gene al, _𝑎𝑑1000 p omo es highe in insic coe icien alues han _𝑎𝑑,
as he calcula ed a e age e u n is a be e i o he ac ual e u n alues.
This leads o a lowe decay alue and a highe in insic coe icien , o cing
he agen o explo e mo e in ensely han wi h _𝑎𝑑 (bu less han wi h
_𝑠).
4.4.2 RQ2: Which is he impac o using episodic
coun s o scale he in insic bonus? Is i be e
o use episodic coun s han o jus conside he
i s ime a gi en s a e is isi ed by he agen ?
Answe s o his second ques ion can be d awn om he esul s o Table
4.3. A i s glance a his able e eals ha he use o episodic coun s o
i s - ime isi a ion s a egies o scaling he gene a ed in insic ewa ds
leads o be e esul s. In he mos challenging en i onmen s (MNS78,
KS3R3 and O2Dlh), hese di e ences a e e en wide , as hey equi e a
mo e in ense and e icien explo a ion by he agen . In ac , when he
aining s age is ex ended o cope wi h a mo e complex ask, in insic
ewa ds also dec ease, inducing a lowe explo a i e beha iou in he agen
he longe he aining pe iod is ex ended. Hence, he agen does no
seek as much no el y as i should, wha migh explain why he baseline
4.4. Resul s and Analysis 95
implemen a ion o in insic mo i a ion (_𝑛𝑜𝑒𝑝) ails in hose scena ios as
opposed o when using he scaling s a egies (e.g., COUNTS and RND
in O2Dlh). By con as , in en i onmen s equi ing less explo a ion (MN7S4
and MN10S4), di e ences a e na owe when using episode-le el explo a ion
and may be e en coun e p oduc i e in some cases (i.e. COUNTS a MN10S4
wi h _1𝑠𝑡).
Table 4.3: Compa ison o di e en IM s a egies when using no scaling
(_𝑛𝑜𝑒𝑝), episodic (_𝑒𝑝) o i s - ime isi (_1𝑠𝑡) o scale he gene a ed in-
insic ewa d and combine wo ypes o explo a ion deg ees. In e p e a ion as
in Table 4.2.
MN7S4 MN10S4 MN7S8 KS3R3 O2Dlh
COUNTS_𝑛𝑜𝑒𝑝 0.93 (0.86) 1.87 (1.78) >30 >30 >50
COUNTS_𝑒𝑝 0.76 (0.56) 1.55 (1.47) 2.77 (2.56) 3.99 (2.00) 33.17 (29.79)
COUNTS_1𝑠𝑡 0.85 (0.48) >20 1.64 (1.42) 1.97 (1.19) 45.26 (37.29)
RND_𝑛𝑜𝑒𝑝 3.83 (3.78) 7.84 (7.79) >30 10.83 (9.72) >50
RND_𝑒𝑝 1.41 (0.96) 1.72 (1.34) 3.60 (3.30) 4.31 (2.63) 18.54 (14.07)
RND_1𝑠𝑡 1.18 (0.59) 1.36 (0.78) 1.97 (1.72) 4.78 (2.29) 21.19 (9.88)
RIDE_𝑛𝑜𝑒𝑝 4.71 (4.54) 5.29 (5.20) >30 11.44 (9.63) 39.68 (35.15)
RIDE_𝑒𝑝 2.49 (1.82) 2.27 (2.14) 4.00 (3.68) 6.63 (4.39) 30.88 (25.87)
RIDE_1𝑠𝑡 3.17 (1.34) 3.27 (2.33) 1.95 (1.83) 5.13 (2.26) 32.14 (28.03)
ICM_𝑛𝑜𝑒𝑝 2.67 (2.55) >20 >30 8.02 (6.75) 34.04 (26.78)
ICM_𝑒𝑝 3.25 (1.26) 1.68 (1.59) >30 5.32 (3.14) 19.05 (13.87)
ICM_1𝑠𝑡 1.56 (0.87) 1.90 (1.07) 2.11 (1.77) 4.72 (4.23) 20.74 (10.09)
To be e unde s and he supe io i y o RIDE o e ICM as shown in
(Raileanu & Rock äschel, 2020), we also e alua e he pe o mance o bo h
app oaches unde equal condi ions, wi h (_𝑒𝑝, _1𝑠𝑡) and wi hou (_𝑛𝑜𝑒𝑝)
scaling s a egies. In his way, we can examine he ac ual imp o emen
be ween he wo ypes o explo a ion bonus s a egies. Su p isingly, ICM
gi es be e esul s in almos all he cases o he analyzed scena ios, ye
exhibi ing a la ge a iance in se e al en i onmen s ha lead o ailu e
(MN10S4 and MN7S8). The eason migh eside in how RIDE encou ages
he agen o pe o m ac ions ha a ec he en i onmen , o cing he agen
o assess all possible ac ions, so ha he en opy in he policy dis ibu ion
decays slowly. This hypo hesis is bu essed by he esul s ob ained in
MN7S4 and MN10S4: we ecall ha he e a e 3 useless ac ions in hese
scena ios (pick up,d op and done), and RIDE pe o ms clea ly wo se
(excep o he _𝑒𝑝 case in MN7S4). In mo e complex scena ios, when
hose ac ions a e ele an o he ask, pe o mance gaps be ween RIDE
and ICM become na owe .
Fo he sake o comple eness o he esul s discussed o RQ1 and RQ2,
Figu e 4.5 shows he aining con e gence plo s o COUNTS, RND and
RIDE o di e en weigh ing and scaling s a egies.
96 Chap e 4. Empi ical S udy o In insic Mo i a ion Techniques
s a ic ngu pd ad ad 1000 ep 1s
0.25 0.50 1.00 1.50 2.00
×107
0.25
0.50
0.77
1.00
MN7S4
COUNTS
0.25 0.50 1.00 1.50 2.00
×107
0.0
0.2
0.4
0.6
0.8
1.0RND
0.25 0.50 1.00 1.50 2.00
×107
0.0
0.2
0.4
0.6
0.8
1.0RIDE
0.25 0.50 1.00 1.50 2.00
×107
0.25
0.50
0.76
1.00
MN10S4
0.25 0.50 1.00 1.50 2.00
×107
0.0
0.2
0.4
0.6
0.8
1.0
0.25 0.50 1.00 1.50 2.00
×107
0.0
0.2
0.4
0.6
0.8
1.0
0.5 1.0 1.5 2.0 2.5 3.0
×107
0.25
0.50
0.65
1.00
MN7S8
0.5 1.0 1.5 2.0 2.5 3.0
×107
0.0
0.2
0.4
0.6
0.8
1.0
0.5 1.0 1.5 2.0 2.5 3.0
×107
0.0
0.2
0.4
0.6
0.8
1.0
0.5 1.0 1.5 2.0 2.5 3.0
×107
0.25
0.50
0.90
1.00
KS3R3
0.5 1.0 1.5 2.0 2.5 3.0
×107
0.0
0.2
0.4
0.6
0.8
1.0
0.5 1.0 1.5 2.0 2.5 3.0
×107
0.0
0.2
0.4
0.6
0.8
1.0
12345
×107
0.25
0.50
0.95
1.00
O2Dlh
12345
×107
0.0
0.2
0.4
0.6
0.8
1.0
12345
×107
0.0
0.2
0.4
0.6
0.8
1.0
Figu e 4.5: Con e gence plo s o he schemes epo ed in Tables 4.2 and 4.3.
Each column ep esen s a In insic Mo i a ion ype (COUNTS, RND and RIDE
om le o igh ); each ow ep esen s he di e en scena ios (MN7S4,MN10S4,
MN7S8,KS3R3 and O2Dlh, om op o bo om). All igu es depic he a e age
ex insic e u n as a unc ion o he numbe o aining s eps/ ames (in a scale
o 1𝑒7). Fo each scena io, op imal and subop imal sco es a e highligh ed wi h
ho izon al black and b own lines, espec i ely.
4.4.3 RQ3: Is he choice o he neu al ne wo k a chi-
ec u e c ucial o he agen ’s pe o mance and
lea ning e iciency?
One o he mos edious pa s when implemen ing an algo i hm is o de-
e mine which ne wo k a chi ec u es o use. Fi s o all, when using an
ac o -c i ic RL amewo k i is necessa y o es ablish whe he a single bu
wo-headed ne wo k o wo di e en (and independen ) ne wo ks will be
adop ed o he ac o and he c i ic modules. In addi ion, some IM ap-
p oaches a e based on neu al ne wo ks o gene a e he in insic ewa ds.
4.4. Resul s and Analysis 97
He ein we e alua e wo o hose solu ions: RND and RIDE, e alua ing
he con ibu ion o di e en neu al ne wo k a chi ec u es o he o e all
pe o mance o he agen . We use simila a chi ec u es o he ones used
in RIDE and RAPID7: (a) a wo-headed sha ed ac o -c i ic ne wo k buil
upon con olu ional and dense laye s and (b) wo independen MLP ne -
wo ks o he ac o and he c i ic, espec i ely (Figu e 4.4). Mo eo e ,
we ix he RL algo i hm (PPO) and de ail he numbe o pa ame e s and
ime aken o he o wa d and backwa d passes in each ne wo k o an
in o med compa ison.
Table 4.4: Compa ison o numbe o pa ame e s and equi ed o wa d and
backwa d passes be ween he ANN a chi ec u es desc ibed in Sec ion 4.2.3 when
being used wi h di e en IM modules.
Ligh weigh (lw) De aul
Pa ame e s Time (ms) Pa ame e s Time (ms)
Ac o 14,087 - -
C i ic 13,697 - -
Ac o +C i ic 27,784 - 29,896 -
Dic iona y - 83.66 - 95.11
To al COUNTS 27,784 724.25 29,896 937.37
Embedding 13,632 - 19,392 -
RND 27,264 336.39 38,784 721.64
To al RND 55,048 986.13 68,937 1,408.42
In e se 12,871 - 18,439 -
Fo wa d 12,928 - 18,464 -
Embedding 13,632 - 19,392 -
RIDE 39,431 388.84 56,295 844.43
To al RIDE 67,215 1,177.75 86,191 1,791.70
Fi s o all, Table 4.4 in o ms abou hese de ails o he neu al a chi-
ec u es in use o COUNTS, RND and RIDE. I epo s he di e ences in
e ms o he numbe o pa ame e s o each ne wo k, and he la ency aken
by he sum o bo h o wa d and backwa d passes h ough hose IM mod-
ules (we no e ha COUNTS uses a dic iona y and no a neu al ne wo k
o he ewa d gene a ion). In addi ion, we summa ize he o al numbe
o pa ame e s depending on he implemen ed IM module, oge he wi h
he ac o -c i ic pa ame e s. Re e ed o he o al elapsed ime, we epo
he o al amoun o ime equi ed o a ollou collec ion. This elapsed
ime akes in o accoun bo h he o wa d and backwa d passes in he IM
modules, and jus he o wa d pass ac oss he ac o -c i ic, among o he
ope a ions execu ed when collec ing samples. Times a e calcula ed when
execu ing he expe imen s o e an In el(R) Xeon(R) CPU E3-1505M 6
p ocesso unning a 3.00GHz.
7E en wi h di e en neu al a chi ec u es and base RL algo i hms, hey success ully
sol e he same asks in MiniG id wi h di e en sample-e iciency.
98 Chap e 4. Empi ical S udy o In insic Mo i a ion Techniques
On he o he hand, Table 4.5 shows he pe o mance o he agen when
con igu ed wi h such di e en ne wo k con igu a ions. I can be seen ha
when educing he numbe o pa ame e s in bo h he ac o -c i ic and he
IM modules (_𝑙𝑤_𝑡𝑜𝑡), he agen ’s beha io deg ades c i ically. This oc-
cu s e en wi h COUNTS, whe e he modi ica ion should ha e had less
impac as he gene a ion o in insic ewa ds does no depend on a neu al
ne wo k, bu on a dic iona y. When inspec ing he pe o mance o RIDE,
i s pe o mance ge s wo se in all cases excep o MN7S4, whe e he ex-
plo a ion equi emen s a e he lowes among all he analyzed scena ios.
As o RND, he ull ligh weigh con igu a ion o he ne wo ks makes he
asks no sol able by he agen .
Table 4.5: Pe o mance ob ained wi h COUNTS, RND and RIDE when 1)
using he de aul ne wo k con igu a ions, 2) a ligh weigh a chi ec u e o he
IM modules and keeping ac o -c i ic wi h a de aul con igu a ion (_𝑙𝑤_𝑖𝑚),
and 3) when bo h he IM and he ac o -c i ic modules a e implemen ed wi h
he ligh weigh ne wo ks (_𝑙𝑤_𝑡𝑜𝑡). Values in he cells ep esen he aining
s eps/ ames (in a scale o 1𝑒6) when he op imal a e age ex insic e u n is
achie ed. Wi hin b acke s, he aining s eps when a subop imal beha io is
accomplished.
MN7S4 MN10S4 MN7S8 KS3R3 O2Dlh
COUNTS 0.93 (0.86) 1.87 (1.78) >30 >30 >50
COUNTS_𝑙𝑤_𝑖𝑚 0.93 (0.86) 1.87 (1.78) >30 >30 >50
COUNTS_𝑙𝑤_𝑡𝑜𝑡 1.64 (1.48) 2.52 (2.36) >30 (29.96) >30 >50
RND 3.86 (3.79) 7.84 (7.79) >30 10.84 (9.72) >50
RND_𝑙𝑤_𝑖𝑚 5.66 (5.44) 6.68 (6.61) >30 10.97 (9.45) >50
RND_𝑙𝑤_𝑡𝑜𝑡 > 20 >20 >30 >30 >50
RIDE 2.49 (1.82) 2.27 (2.14) 4.01 (3.38) 6.63 (4.39) 30.88 (25.87)
RIDE_𝑙𝑤_𝑖𝑚 1.63 (1.31) 1.75 (1.53) >30 9.44 (5.08) >50
RIDE_𝑙𝑤_𝑡𝑜𝑡 1.42 (1.05) >20 >30 8.00 (5.69) >50
Going back again o Table 4.4, i can be seen ha he numbe o pa-
ame e s o be lea ned is mos ly dependen on he IM ne wo ks unde con-
side a ion, whe eas joining he ac o and he c i ic in o a single wo-headed
ne wo k ba ely inc eases he dimensionali y equi emen s8. Ne e heless,
he ime equi ed o pe o m a o wa d pass inc eases in app oxima ely
25% when an unique ac o -c i ic ne wo k is employed. Mo eo e , by us-
ing a single ne wo k, pa o he pa ame e s o he ne wo k a e sha ed
be ween he ac o and he c i ic, which can induce mo e ins abili ies bu
also a as e lea ning since he model may sha e ea u es be ween he ac o
and he c i ic and equi e less samples o lea n a gi en ask. Wi h his
in mind, we ca y ou an addi ional abla ion s udy conside ing only he
educ ion o pa ame e s a IM modules, and main aining he ac o -c i ic
as a single wo-head ne wo k.
Such esul s a e p o ided in he second ow o e e y g oup o e-
sul s in Table 4.5 (_𝑙𝑤_𝑖𝑚). These ou comes e ince ha when using
8We no e ha he numbe o pa ame e s is sligh ly inc eased, bu hey also di e in
he ype o laye s ha a e used in each ne wo k ( he wo-headed ne wo k uses CNNs
while he independen ac o -c i ic only uses dense laye s.
4.5. Conclusions 99
RND_𝑙𝑤_𝑖𝑚, sligh ly wo se esul s a e achie ed wi h espec o RND
wi h he de aul ne wo k se up. Howe e , i s pe o mance does no de-
g ade d ama ically down o ailu e as wi h RND_𝑙𝑤_𝑡𝑜𝑡. Hence, using
pa ame e sha ing in a single ac o -c i ic ne wo k yields a as e lea n-
ing p ocess and posi i ely con ibu es o his case, in e ing also ha
he dimensionali y educ ion in IM modules is no ha c i ical in RND.
Rega ding RIDE_𝑙𝑤_𝑖𝑚, in some cases (MN7S4 and MN10S4) i a ains
be e esul s, whe eas in MN7S8 and KS3R3 i su e s om a no o ious
pe o mance deg ada ion (MN7S8 is no sol ed). I can also be obse ed
ha he use o he single ac o -c i ic ne wo k migh be bene icial when
educing he complexi y o he IM ne wo k (_𝑙𝑤_𝑖𝑚), as i mi iga es he
pe o mance deg ada ion in 3 ou o 5 scena ios (s ill, MN7S8 and O2Dlh
a e no sol ed). This clashes wi h he esul s o sepa a ed ac o -c i ic
ne wo ks (_𝑙𝑤_𝑡𝑜𝑡), which ail o sol e MN7S8,O2Dlh and MN10S4).
0.0 0.5 1.0 1.5 2.0 2.5 3.0
×107
0.0
0.2
0.4
0.6
0.8
1.0RIDE a MN7S8
012345
×107
0.0
0.2
0.4
0.6
0.8
1.0RIDE a O2Dlh
0.0 0.5 1.0 1.5 2.0 2.5 3.0
×107
0.0
0.2
0.4
0.6
0.8
1.0COUNTS a MN7S8
Figu e 4.6: Con e gence plo s o COUNTS and RIDE o some
scena ios when using he de aul ne wo k (blue), _𝑙𝑤_𝑖𝑚(g een) and
_𝑙𝑤_𝑡𝑜𝑡( ed). All he igu es depic he a e age ex insic e u n as a
unc ion o he numbe o aining ames.
Finally, we include Figu e 4.6 in o de o help he eade ex ac u he
conclusions and gain insigh abou he beha io o he lea ning p ocess.
This igu e e eals ha , in he wo cases in which RIDE_𝑙𝑤_𝑖𝑚 ailed
(namely, MN7S8 and O2Dlh), he agen lea ned o sol e he ask in wo
ou o he h ee expe imen s ha we e un (seeds). This unde sco es
he impac o using di e en ac o -c i ic a chi ec u es. Mo eo e , wi h
he de aul ac o -c i ic a chi ec u e and using he COUNTS app oach, he
agen is also capable o sol ing he MN7S8 ask in 2 ou o he 3 uns. When
using COUNTS_𝑙𝑤_𝑡𝑜𝑡, he agen eaches subop imal pe o mance and
almos he op imal one wi hin he ame budge .
4.5 Conclusions
In his chap e we ha e s udied he ac ual impac o di e en design choices
when implemen ing RL agen s augmen ed wi h IM mechanisms. Mo e con-
c e ely, we ha e e alua ed mul iple weigh ing s a egies o g an di e en
impo ance when combining he in insic and ex insic ewa ds (i.e., he
𝛽coe icien ). Mo eo e , we ha e analyzed he e ec o applying dis inc
deg ees o explo a ion ( o scale gene a ed in insic ewa ds, 𝑟𝑖) along wi h
he in luence o he complexi y o he ne wo k a chi ec u es on he pe o -
mance o bo h ac o -c i ic and IM modules. To conduc he s udy we ha e
100 Chap e 4. Empi ical S udy o In insic Mo i a ion Techniques
u ilized en i onmen s belonging o he MiniG id benchma k, so as o es
he quali y o he conside ed schemes in a a ie y o asks cha ac e ized
by a ha d o e y-ha d demand o an explo a o y beha io o he agen .
On one hand, we ha e shown ha using a s a ic in insic coe icien
migh no be he bes s a egy i ocusing on sample e iciency. Adap i e
decay s a egies ha e p o en o be p omising, al hough hey equi e a good
pa ame e iza ion o he sliding window. The pa ame e decay app oach,
in u n, has pe o med compe en ly. Howe e , he pa ame e alues o he
decay unc ion a e mo e dependen on he ask a hand han he p e ious
scheme, making his s a egy mo e sensi i e o he en i onmen and he
ask. This esounds wha occu s wi h 𝜖-g eedy s a egies in some alue-
based algo i hms. The use o mul iple agen s (as in NGU), each ea u ing
a di e en explo a ion-exploi a ion balance, also su e s om he need o
a good pa ame e iza ion, bu i epo s wo se esul s.
On he o he hand, he use o episode-le el explo a ion along wi h
expe imen -le el s a egies seem o be p e e able when ha ing en i on-
men s wi h ha d explo a ion equi emen s. I is no a clea winne no a
p e e ence be ween episodic coun s and i s isi a ion s a egies, as hei
pe o mance is subjec o he en i onmen and he selec ed IM s a egy.
Howe e , bo h achie e signi ican pe o mance gains. The adop ion o any
o hese s a egies can be ad ised in u u e IM- ela ed s udies.
We ha e also analyzed he impac o he neu al ne wo k a chi ec u e on
bo h he ac o -c i ic and IM modules. Resul s ha e shown ha educing
he numbe o pa ame e s in he IM modules de e io a es he pe o mance
o he agen , making i ail in some challenging scena ios which a e ea-
sible o he complex neu al con igu a ion. Wha is mo e, when educing
he dimensions o he IM ne wo k, i is p e e able o use a sha ed wo-
headed ac o -c i ic as i p o ides be e esul s, al hough i is no clea
whe he hose esul s a e due o he use o a single neu al ne wo k (and
he unde lying pa ame e sha ing and common ea u e space o he ac o
and he c i ic), o ins ead o he adop ion o di e en neu al p ocessing
a chi ec u es (e.g. CNNs). Fu he esea ch is necessa y in his di ec ion.
All in all, he e alua ion s udy p esen ed in his chap e can se e as a
e e ence o he communi y in he implemen a ion o in insic mo i a ion
s a egies o add ess (1) asks wi h spa se ewa ds; o (2) ha d explo a ion
scena ios whe e classic explo a ion echniques do no su ice.
101
Chap e 5
Towa ds Imp o ing
Explo a ion in
Sel -Imi a ion Lea ning
using In insic Mo i a ion
The p e ious chap e has analyzed he impac o using di e en design
ac o s o e ewa ds gene a ed wi h IM echniques. We ha e e alua ed
hose algo i hms no in single on bu in p ocedu ally gene a ed en i on-
men s, whe e he gene aliza ion capabili ies o he agen a e essen ial o
i o exhibi an o e all good pe o mance. Con inuing wi h he idea o
imp o ing he sample e iciency o e ha d explo a ion PCG en i onmen s,
in his chap e we u he examine he use o Imi a ion Lea ning (IL) o
his pu pose.
O e yea s he use o IL and T ans e Lea ning has been widely adop ed
o accele a e he lea ning p ocess and o educe he amoun o equi ed
aining da a (Hua e al., 2021; Nai e al., 2021; Wu e al., 2022). The
s a egy o using expe demons a ions has been also adop ed o ackle
explo a ion issues in ha d explo a ion scena ios wi h spa se ewa ds, by
ei he ini ializing a bu e wi h good beha io ajec o ies (Hes e e al.,
2017; Vece ik e al., 2018) o by gene a ing a cu iculum-s yle lea ning
and e-ini ializing he agen sma ly (Ay a e al., 2018; Salimans & Chen,
2018).
Un o una ely, such expe demons a ions a e no always a ailable in
p ac ice. This mo i a ed he idea o s o ing ajec o ies – sel -collec ed
by he agen – ea u ing good explo a ion p ope ies o a la e eplay,
o ging wha is now known as sel -Imi a ion Lea ning (sel -IL1). Despi e
i s e ec i eness o alle ia e he need o expe demons a ions, sel -IL
me hods a e highly sensi i e o he ea ly disco e y o su icien ly good
ajec o ies, which can be challenging in ha d explo a ion scena ios.
1The e is an app oach named di ec ly as SIL. Thus, o he sake o cla i y, in his
chap e we e e as sel -IL o he amily o algo i hm in which he agen collec s he
expe iences by i sel o augmen ing i s sample e iciency, whe eas SIL will deno e he
speci ic app oach p esen ed in (Oh e al., 2018).
108 Chap e 5. Sel -Imi a ion Lea ning wi h In insic Mo i a ion
As in p e ious chap e s, we epo he mean and s anda d de ia ion o
he a e age e u n compu ed o e he pas 100 episodes o each expe i-
men , pe o ming 3 di e en uns (wi h di e en seeds) o accoun o he
s a is ical a iabili y o he esul s. Fo anspa ency and ep oducibili y
o he expe imen s la e discussed, he code is a ailable in a public Gi Hub
eposi o y: h ps://gi hub.com/aklein1995/explo a ion_sil_im.
5.3.1 En i onmen s
We e alua e ou p oposed app oach o e MiniG id (Che alie -Bois e e
al., 2018), as explained in Chap e 4(Sec ion 4.3.1). Speci ically, we e al-
ua e he amewo k o e he ollowing scena ios ( o u he in o ma ion
abou he en i onmen s and hei asks, please e e o Che alie -Bois e
e al., 2018): Mul iRoom (MN7S8 and MN12S10), KeyCo ido (KS4R3) and
Obs uc edMaze (O2Dlh). The c i e ion o selec hese en i onmen s elies
on hei di icul y as e i ied in (Zha, Ma, e al., 2021), whe e MN12S10 and
KS4R3 we e iden i ied as he mos di icul scena ios unde analysis: he
i s was sol ed by RAPID and RIDE, while he la e emained unsol ed
o he gi en ain s eps by any o he baselines unde conside a ion. In he
case o (Ning e al., 2021), whe e he pe o mance o SIL+BeBold was an-
alyzed in MiniG id, he mos di icul en i onmen s we e KS3R3 and MN6S,
which a e mo e easily sol able han KS4R3 and MN12S10 ( hey use smalle
ooms and less numbe o ooms espec i ely). Addi ionally, we include
ano he e y ha d explo a ion scena io, no conside ed in he a o emen-
ioned wo ks, which possesses di e en cha ac e is ics and equi emen s
han he p e ious en i onmen s: O2Dlh.
5.3.2 Baselines and Hype pa ame e s
We selec RAPID (Zha, Ma, e al., 2021) and SIL (Oh e al., 2018) as
sel -IL baseline me hods, and BeBold (T. Zhang e al., 2020) as he IM.
All s a egies use PPO as hei co e RL algo i hm, which uses a numbe
o s eps equal o 128 and 4miniba ches o size 32 o aining (one unique
agen ). Each ain s ep comp ises 4epochs, whe e op imiza ion upda es
a e ca ied ou wi h a lea ning a e o 10−4, a clipping ac o o 𝜖=0.2,
𝛾=0.99 and 𝜆=0.95 o he ad an ages calcula ion wi h GAE as pe Ex-
p ession (2.18). Fu he mo e, he loss unc ion ( ecall Exp ession (2.22))
is weigh ed by a en opy coe icien o 𝑐2=0.01 and a alue coe icien o
𝑐1=0.5. Mo eo e , we employ 2 independen ully-connec ed laye s o
he ac o and he c i ic – each wi h 64 neu ons – o all he expe imen s
and baselines.
Speci ic pa ame e s o RAPID a e con igu ed as in i s o iginal imple-
men a ion epo ed in he pape whe e i was i s p esen ed: a bu e size
o D=104expe iences, ba ch size o 256 and 5o -policy upda es a e
e e y episode comple ion. Mo eo e , he weigh s o ank he eplay bu e
episodes – Exp ession (5.1) – a e se o 𝑤0=1,𝑤1=0.1and 𝑤2=0.001
acco ding o he sensi i i y analysis shown in he o iginal app oach (Zha,
Ma, e al., 2021).
5.4. Resul s and Analysis 109
In he case o SIL, o he sake o ai ness wi h espec o RAPID he
same eplay bu e size (D=104) and he same o -policy upda e a io
(5) a e used. Mo eo e , a SIL loss weigh o 0.1and a SIL alue loss
weigh o 𝛽𝑠𝑖𝑙 =0.01 a e se . Rega ding PER (Schaul e al., 2016), we
selec a p io i iza ion exponen 𝛼𝑃𝐸𝑅 =0.6and a bias co ec ion ac o
𝛽𝑃𝐸𝑅 =0.1. All hese pa ame e alues we e chosen acco ding o he
supplemen a y ma e ial p o ided in (Oh e al., 2018)2, and aking in o
accoun ha we aim o sol e ha d explo a ion en i onmen s. On he o he
hand, he in insic ewa d when using BeBold is compu ed as desc ibed
in Sec ion 5.2.2, calcula ing he no el y wi h isi a ions coun s ( aking
ad an age o he disc e e s a e space) and using an in insic coe icien o
𝛽=0.005. The alue o his coe icien ( oge he wi h ha o he en opy
coe icien , 𝑐2) was ailo ed based on he esul s o a g id sea ch ca ied
ou o e scena io MN7S8 – whose esul s a e shown in Figu e 5.3 – while
keeping he alues o o he pa ame e s ixed (e.g. he RAPID weigh
alues abo e e e ed, namely, 𝑤0,𝑤1and 𝑤2).
Figu e 5.3: Resul s o a g id sea ch o e he MN7S8 scena io o de e mine
𝛽(in insic mo i a ion coe icien ) and 𝑐2(en opy coe icien ). (Le ) Re u ns
ob ained a e 3·106 aining s eps; (Righ ) Numbe o s eps (in scale o millions,
106) equi ed o he agen o achie e an op imal a e age e u n (≈0.65) o he
i s ime.
5.4 Resul s and Analysis
This sec ion p esen s he esul s o he p oposed app oach in PCG en i-
onmen s, examining hem in dep h om di e en angles:
5.4.1 Pe o mance o sel -IL and IM Techniques:
Independen e sus Combined
To begin wi h, Figu e 5.4 analyzes he ac ual impac on he pe o mance
o he agen when using IM and sel -IL echniques, ei he independen ly o
join ly. We obse e ha BeBold (ligh blue cu e) shows a good beha io
only in 2 ou o he 4 en i onmen s unde conside a ion (namely, MN7S8
and KS4R3). Howe e , i comple ely ails when dealing wi h he challenging
2h p://p oceedings.ml .p ess/ 80/oh18b/oh18b-supp.pd
110 Chap e 5. Sel -Imi a ion Lea ning wi h In insic Mo i a ion
scena ios o Mul iRoom and Obs uc edMaze se ies (i.e., MN12S10 and
O2Dlh). When using jus SIL (g een cu e), i pe o ms poo ly in all
scena ios. We he e ecall wha we s a ed a he beginning o his chap e :
o he wo ks (e.g., (Ning e al., 2021)) ha e analyzed he complemen a i y
o SIL and IM, bu o e p oblems wi h spa se ewa ds ha a e no so
complex as he ones conside ed in his chap e .
When i comes o RAPID, i is capable o sol ing Mul iRoom en i on-
men s, bu s uggles o e KS4R3 and O2Dlh (as expec ed). This la e en-
i onmen s a e assumed o ha e la ge s a e spaces and an inc easing di i-
cul y om he pe spec i e o explo a ion. On op o he sel -IL app oaches,
BeBold os e s he explo a ion and, consequen ly, ende s some ac ionable
lea ning when using SIL (pink cu e). Howe e , esul s a e wo se han
hose ob ained when using BeBold in isola ion (ligh blue). This sugges s
ha he SIL p io i iza ion mechanisms a e no wo king p ope ly. Con-
a ily, esul s a e ou s anding when combined wi h RAPID (ligh g een
cu e), educing d as ically he numbe o samples o achie e he same pe -
o mance le el, and a aining a be e o e all lea ning when compa ed o
using RAPID in i s nai e e sion (blue plo ). Besides hese imp o emen s,
i is in e es ing o no ice ha he bene i s o using IM emain e en when
he la e is no enough o lea n in isola ion: BeBold does no cap u e any
knowledge o e MN12S10 and O2Dlh, bu i augmen s he capabili ies o
RAPID when used in hose scena ios.
5.4.2 E alua ion o RAPID wi h Va ious IM S a e-
gies
A key aspec o s udy empi ically is he capaci y o IM o enhance he
agen ’s explo a ion while lea ning. The e o e, i is o u mos impo ance o
assess he sensi i i y o he p oposed sel -IL+IM combina ion wi h espec
o he selec ion o he IM app oach. Wi h ha in mind, and conside ing
ha he cu en implemen a ion is based on BeBold’s abula e sion (see
Sec ion 5.2.2), we now e alua e he agen ’s pe o mance wi h o he wo
isi a ion coun s s a egies: coun s (i.e. 𝑟𝑖
𝑡=1/√︁𝑁(𝑠𝑡+1)) and coun s1s ,
which is he same as coun s bu wi h episodic es ic ion. This second
se o expe imen s allows compa ing e y simila IM s a egies ha ha e
p o en o yield di e en esul s due o hei in insic ewa d gene a ion
scheme (And es e al., 2022; T. Zhang e al., 2020).
The esul s p o ided in Figu e 5.5 sugges ha he e is a high e-
la ionship be ween wha he agen can lea n wi h IM (wi hou sel -IL)
and wha i ac ually does by combining hem al oge he . This can be
ega ded as a measu e o he e ec i eness o IM me hods when imple-
men ed in isola ion, whe e hei base unc ionali y o explo ing is no
wide-sp ead wi h he sel -IL coun e pa . A his poin , by jus inspec -
ing he esul s epo ed in (And es e al., 2022; T. Zhang e al., 2020), i
is clea ha coun s is he wo s me hod, ollowed by coun s1s and Be-
Bold, 𝑐𝑜𝑢𝑛𝑡𝑠 < 𝑐𝑜𝑢𝑛𝑡𝑠1𝑠𝑡 < 𝐵𝑒𝐵𝑜𝑙𝑑. Di e ences be ween coun s1s and
BeBold a e unclea : mos o he con ibu ion seems o be ela ed o he
5.4. Resul s and Analysis 111
BeBold RAPID RAPID+BeBold SIL SIL+BeBold
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
MN7S8
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
MN12S10
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
KS4R3
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
O2Dlh
Figu e 5.4: Resul s o e mul iple p ocedu ally gene a ed ha d explo a ion en-
i onmen s in MiniG id. Bo h RAPID and SIL always achie e be e esul s
when combined wi h BeBold.
episodic es ic ion pa . Howe e , going beyond he bounda ies o al eady
explo ed egions seems o be p omising as well, as i yields be e esul s
when compa ed o RND wi h episodic es ic ion (T. Zhang e al., 2020).
The same compa a i e pe o mance be ween IM me hods holds when
combining hem wi h he anking eplay s a egy, whe e RAPID+coun s
( ed cu e) pe o ms sligh ly be e o equal o RAPID in isola ion (blue
plo ), ye being he wo s ou o he h ee IM op ions. Mo eo e , he choice
o one IM s a egy o e ano he can ac ually de e io a e he pe o mance o
he agen , as obse ed in KS4R3. In his pa icula case, he a o emen ioned
RAPID+coun s( ed cu e) is wo se han using RAPID wi hou IM (blue
cu e). Ne e heless, when selec ing demons ably good IM s a egies, he
agen combining sel -IL+IM – bo h RAPID+coun s1s (yellow cu e) and
RAPID+BeBold (ligh g een cu e) – imp o es i s pe o mance e en when
i was no able o do i wi h jus he IM s a egy.
112 Chap e 5. Sel -Imi a ion Lea ning wi h In insic Mo i a ion
RAPID RAPID+BeBold RAPID+Coun s RAPID+Coun s+1s
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
MN7S8
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
MN12S10
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
KS4R3
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
O2Dlh
Figu e 5.5: Pe o mance compa ison o RAPID when combined wi h di e en
IM me hods, namely, coun s,coun s1s and BeBold.
5.4.3 Explo a ion-exploi a ion Pa ame e s E olu ion
in sel -IL+IM
By in oducing IM in o he on-policy loss, he agen has o deal wi h mul i-
ple objec i es (explo a ion-exploi a ion) in a ious s ages: 1) on-policy, by
balancing he ex insic and in insic ewa ds; and 2) o -policy, by keep-
ing in he bu e he mos p omising expe iences pa ame e ized by he
ex insic, local and global sco es.
In his ega d, Figu e 5.6 depic s he e olu ion o some ep esen a i e
alues conce ning how he explo a ion is ca ied ou du ing an expe imen .
Ini ially 𝐺𝑖> 𝐺𝑒(i.e., he episodic discoun ed in insic and ex insic e-
u ns calcula ed as desc ibed in Exp ession 2.3), which e inces ha he
agen lea ning p ocess is guided by IM in he absence o ex insic sig-
nals om he en i onmen . E en ually, ex insic eedback is ob ained and
gains mo e impo ance o he agen ’s abili y o comple e he ask. Sim-
ila ly, he impac o he ex insic sco e in Exp ession (5.1) – 𝑤0·𝑆𝑒𝑥𝑡 ,
which p omo es he exploi a ion o highly ex insic ewa ded episodes –
quickly inc eases, so ha hose po en ial ajec o ies a e mo e o en e-
played. Howe e , he selec ion c i e ion is also subjec o he local sco e
–𝑤1·𝑆𝑙𝑜𝑐𝑎𝑙, which aims o maximize he di e si y o obse a ions inside
he episode – ha also inc eases un il eaching i s maximum alue o 0.1
5.4. Resul s and Analysis 113
Gex Gin w0 w1 w2 on-policy o -policy
0.00
0.25
0.50
0.75
1.00
A g Ex Re u n
MN7S8
0.0
0.2
0.4
Gex VS Gin
0.00
0.25
0.50
0.75
1.00
Impac w0/w1/w2
0.64 3.84 7.04 10.24 13.44 16.64 19.84
F ames/s eps (1e6)
0
10
20
Numbe o upda es
0.00
0.25
0.50
0.75
1.00
A g Ex Re u n
MN12S10
0.0
0.1
0.2
Gex VS Gin
0.00
0.25
0.50
0.75
1.00
Impac w0/w1/w2
0.64 3.84 7.04 10.24 13.44 16.64 19.84
F ames/s eps (1e6)
0
5
10
Numbe o upda es
0.00
0.25
0.50
0.75
1.00
A g Ex Re u n
KS4R3
0.0
0.2
0.4
0.6
Gex VS Gin
0.00
0.25
0.50
0.75
1.00
Impac w0/w1/w2
1.28 7.68 14.08 20.48 26.88 33.28 39.68
F ames/s eps (1e6)
0
10
20
Numbe o upda es
0.00
0.25
0.50
0.75
1.00
A g Ex Re u n
O2Dlh
0.0
0.2
0.4
Gex VS Gin
0.00
0.25
0.50
0.75
1.00
Impac w0/w1/w2
1.28 7.68 14.08 20.48 26.88 33.28 39.68
F ames/s eps (1e6)
0
5
10
Numbe o upda es
Figu e 5.6: Summa y o he e olu ion o di e en c i ical alues ha impac
he lea ning o a gi en seed in all he scena ios, using RAPID+BeBold. Plo s
in he i s ow deno e he a e age ex insic ewa d. Plo s in he second ow
depic he di e ence be ween he discoun ed ex insic (𝐺𝑒𝑥𝑡 ≡𝐺𝑒) and in insic
(𝐺𝑖𝑛𝑡 ≡𝐺𝑖) e u ns used in he on-policy upda e (RL-loss). Figu es in he hi d
ow show he in luence o each componen /sco e o he anking bu e (𝑤0,𝑤1
and 𝑤2) when sampling om i s collec ed expe iences. Finally, plo s in he las
ow indica e he a e age numbe o o -policy upda es pe 10 on-policy upda es
( a io o upda es, 𝜉). All depic ed da a co espond o he a e age alue in he
gi en ime slo s.
(which is subjec o 𝑚𝑎𝑥(𝑆𝑙𝑜𝑐𝑎𝑙)=1and 𝑤1=0.1). To a lowe ex en ,
he global sco e (𝑤2·𝑆𝑔𝑙𝑜𝑏𝑎𝑙) also plays i s ole in he selec ion c i e ion,
which can be help ul du ing he ini ial lea ning s ages, when he e a e no
success episodes o comple e he ask, and also o un ie when wo episodes
equi e he same amoun o s eps o he comple ion o he ask. Howe e ,
i s ela i e impo ance is lowe in compa ison o he o he sco es due o
he selec ed alue o he 𝑤2pa ame e (0.001)3.
F equency o Upda es
We now p oceed by exposing how he a io 𝜉be ween he numbe o on-
policy and o -policy upda es changes o e he cu se o aining. In wha
ollows 𝜉is ep esen ed as on-policy:o -policy a io: a 𝜉 alue o 1:2 will
hus imply ha he o -policy upda es a e execu ed 2 imes mo e equen ly
han he on-policy ones.
3Recall ha he c i e ia o selec such weigh alues (𝑤0, 𝑤1, 𝑤2) is due o epo ed
esul s in (Zha, Ma, e al., 2021).
114 Chap e 5. Sel -Imi a ion Lea ning wi h In insic Mo i a ion
As was explained in Sec ion 2.1.2, an episode can be la ge o sho e
han a ajec o y. On-policy op imiza ion s eps a e execu ed once a a-
jec o y4has been inished, and i emains ixed du ing he whole aining.
By con as , o -policy upda es a e applied once an episode inishes, which
a ies depending on he maximum s eps pe episode con igu ed o each
en i onmen , and also on he op imali y o he agen ’s policy a ha mo-
men . The decision o execu e o -policy upda es a he end o he episode
was aken om he o iginal pape whe e RAPID was p oposed (Zha, Ma,
e al., 2021).
Such a io 𝜉can change om 1:1 o 1:3 in Mul iRoom en i onmen s,
and mo e d ama ically in o he scena ios like KS4R3, which ini ially implies
a a io o 4:1 and can e ol e up o a 4:13 ela ion. In wo ds, he o -policy
loss can unde go a modi ica ion in i s schedule ha makes i upda e mo e
han 10×a i s ini ial equency (Table 5.1). Such a balance has a c i ical
impo ance in he agen ’s lea ning p ocess, as i would u n o op imize
wha is s o ed in he bu e a he han wha is ac ually expe iencing (o
ice e sa). This gene a es in u n a big di e ence be ween bo h me h-
ods. In ac , in IL his a io is usually balanced by ei he using a weigh
when combining bo h losses o by ca e ully ailo ing he equency upda e
(Hes e e al., 2017; So ano, 2019).
Table 5.1: On-policy e sus o -policy a ios ha can be achie ed in each
scena io when he supe ised loss is backp opaga ed o when he episode inishes.
Each scena io has a di e en maximum numbe o s eps ( ow 2) and also di e en
expec ed numbe o op imal s eps ( ow 3) (we include an es ima ion o he
op imal s eps as i di e s om seed o seed). We show he expec ed ini ial a ios
(𝜉) when he agen canno sol e he ask ( ows 4 & 6) and when i accomplishes
he ask ia an es ima ed op imal policy ( ows 5 & 7). We also epo hose
alues when he ollou size is 𝑇=128 ( ows 4-5) and 𝑇=2048 ( ows 6-7).
MN7S8 MN12S10 KS4R3 O2Dlh
Max s eps pe episode 140 240 480 576
Expec ed op imum s eps 50 105 37 32
𝑇=128 Ini ial 1:1 2:1 4:1 5:1
Final 1:3 2:2 4:13 5:18
𝑇=2048 Ini ial 1:14 2:17 4:17 5:18
Final 1:40 2:40 4:216 5:320
5.4.4 Scheduling sel -IL Upda es
To shed u he ligh on he impo ance o he a o emen ioned a io 𝜉, we
now ix he o -policy loss o be cons an and subjec di ec ly o he on-
policy upda es. We hen analyze how he pe o mance a ies unde se e al
4He e we e e as a ajec o y o he expe iences collec ed on-policy wi h a ixed
amoun o in e ac ions, whe eas an episode’s leng h migh a y depending he en i on-
men and he lea ned policy.
5.4. Resul s and Analysis 115
alues o his a io.
Figu e 5.7 summa izes he esul s ob ained o his s udy. In he am-
ily o Mul iRoom scena ios, he agen is e y sensi i e o a educ ion o
he equency o he o -policy upda es, which can e en ually make he
agen ail when inc easing hei complexi y (e.g. 10:1 in MN12S10). Con-
a ily, in KS4R3 he o iginal adop ed schema (blue cu e) wi h a a io
o 4:1 pe o ms much be e han a mo e equen upda e (g een plo ) o
he o -policy pa (1:1). This ac is also obse ed when using a mo e
conse a i e a io o 10:1 ( ed esul ), sugges ing ha , al hough a highe
o -policy upda e equency can be bene icial a ini ial s ages o boo s ap
he lea ning p ocess in ha d explo a ion asks, i can e en ually deg ade
he lea ned knowledge in he long e m. These conclusions can also be in-
e ed when using BeBold, bu wi h a be e sample-e iciency and op imal
solu ions. Simila conclusions hold when analyzing O2Dlh.
RAPID
RAPID+BeBold
RAPID 1:1 a io
RAPID+BeBold 1:1 a io
RAPID 10:1 a io
RAPID+BeBold 10:1 a io
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
MN7S8
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
MN12S10
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
KS4R3
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
O2Dlh
Figu e 5.7: Resul s o e mul iple p ocedu ally gene a ed MiniG id ha d ex-
plo a ion en i onmen s using di e en a ios 𝜉be ween on-policy (PPO) and
o -policy (RAPID) upda es. The de aul RAPID app oach has a dynamic up-
da e a io, by which i execu es an op imiza ion s ep e e y ime an episode
inishes (see Table 5.1).
116 Chap e 5. Sel -Imi a ion Lea ning wi h In insic Mo i a ion
5.4.5 Add essing In e -episode Va iance
So a , he selec ed alue o he a io 𝜉seems o be decisi e o he suc-
cess and sample e iciency o he aining p ocess. Howe e , he ob ained
ou comes a e e y noisy and ba ely close o op imal esul s.
We hypo hesize ha his can be due o one o he wo losses being
uns able. While he seminal wo k p esen ing RAPID used PPO wi h a
ollou size5o 𝑇=128, o he simila wo ks conside ing he same en i on-
men use a la ge ime ho izon equal o 𝑇=2048, wi h be e and mo e
s able esul s (And es e al., 2022; Fle -Be liac e al., 2021). In PCG en-
i onmen s each le el is con igu ed di e en ly depending on he selec ed
seed. Consequen ly, by aining he agen wi h less episodes in a single
upda e, i migh ge biased o lea n speci ic ea u es p esen in ha subse
o episodes, a he han ge ing he equi ed high-le el skills o sol e he
desi ed ask in he whole possible episode/le el dis ibu ion. Hence, he
inc ease o he ollou size implies ha he agen will be ained – in he
on-policy upda e – wi h a la ge se o episodes (see Table 5.1 o check
episode leng hs). This o ces he algo i hm o ex ac gene alizable knowl-
edge in his wide se o sligh ly di e en en i onmen s, a oiding a by-hea
lea ning. Fu he mo e, his also educes he a iance o he on-policy up-
da es h ough he ANN, as he miniba ch size will be la ge . Howe e , he
agen will pe o m less op imiza ion s eps du ing he aining p ocess o
he same amoun o s eps/ ames. On his basis, he ollowing ques ion
a ises:
How does he use o la ge ollou size impac on he on-policy upda e
ega ding he pe o mance and he s abiliza ion o he lea ned knowledge?
The answe can be ound by analyzing Figu e 5.8. The on-policy upda e
is subs an ially imp o ed, as can be old om he pe o mance o BeBold
(ligh blue) wi hou being co up ed by o -policy upda es. Indeed, his IM
app oach is able o sol e all he en i onmen s wi h he expec ed op imal
s eps, ob aining he bes esul in bo h KS4R3 and O2Dlh. On he con a y,
RAPID (blue) pe o ms wo se, and i s con ibu ion when combined wi h
BeBold (ligh g een) is also no as good as i has been obse ed in he
p e ious analysis. The eason o hese bad esul s also connec s o wha
we ha e p e iously highligh ed: he a io 𝜉.
By inc easing he ollou size (𝑇) and by making he o -policy up-
da es be subjec o he episode comple ion, he ele ance o he o -policy
loss in he agen ’s lea ning p ocess g ows up o be 14×,8×,4×and 4×
mo e equen han he on-policy coun e pa in MN7S8,MN12S10,KS4R3
and O2Dlh, espec i ely, jus a he s a o he aining p ocess (Table
5.1). As we ha e al eady obse ed in Figu e 5.7, hese a ios do no nec-
essa ily gua an ee a be e lea ning p ocess. Thus, when adjus ing he
5The ollou size is di ec ly ela ed wi h he numbe and miniba ch size. The inc ease
o he i s implies ha he miniba ch size is also augmen ed ( o he same numbe o
miniba ches). Fo ins ance, using 𝑇=1024 and 4 miniba ches means o ha e 256-sized
miniba ches, whe eas wi h 𝑇=128 and using he same numbe o miniba ches his size
dec eases o 32 uni s.
5.4. Resul s and Analysis 117
BeBold (T=2048) RAPID (T=2048) RAPID+BeBold (T=2048)
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
MN7S8
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
MN12S10
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
KS4R3
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Times eps ×107
0.0
0.2
0.4
0.6
0.8
1.0
A g Ex insic Re u n
O2Dlh
Figu e 5.8: Resul s on mul iple ha d explo a ion p ocedu ally-gene a ed en i-
onmen s in MiniG id when inc easing he ime ho izon up o 2048 in on-policy
(RL-loss) upda es. O -policy (supe ised/imi a ion) upda es emain wi h ixed
ba ch size o 256.
a io again wi h he new ollou size, he pe o mance o bo h RAPID and
RAPID+BeBold d as ically changes, as in o med in Figu e 5.9. A be e
sample-e iciency can be no ed when using a mo e conse a i e a io (1:1,
g een and pink cu es) in bo h KS4R3 and O2Dlh wi h espec o he de-
aul episode e mina ion se ing (blue and ligh g een esul s). This also
occu s when dec easing he o -policy upda es down o a 10:1 a io ( ed
and yellow cu es). In his case, he con e gence speed can be a ec ed,
al hough i manages o achie e he op imal policy in less s eps ( he 1:1
a io s uggles mo e o inally achie e i ). In con as , when applying hose
upda es a he end o he episode, which co esponds wi h app oxima ely
a 1:4 a io ini ially in KS4R3 and O2Dlh (Table 5.1), esul s ge wo se, jus
su passed by he BeBold app oach. Conce ning Mul iRoom en i onmen s,
inc easing he numbe o o -policy upda es seems o be a good s a egy,
which is di icul o be ou pe o med by any s a e-o - he-a solu ion. In
ac , dec easing he equency o he eplayed expe iences has a nega i e
impac ha can make he agen no lea n in he absence o in insic e-
wa ds.
The abo e discussed beha io s s eng hen he claim posed in his chap e :
124 Chap e 6. Concluding Rema ks
sys ema ically e alua ed by means o an abla ion s udy. The conclusion
d awn om his wo k can be summa ised as ollows:
•A cen alized c i ic has g ea e s abili y and also leads o as e con-
e gence o op imal policy. As long as he c i ic is cen alized, cen-
alizing also he cu iosi y module b ings ad an ages ha a e mos
no iceable when conside ing he ac ion o gene a e he explo a ion
bonus.
•The use o IM con e s he p oblem in o a bi-objec i e unc ion in
which he explo a i e side may induce noise in o he a ainmen o
he main ask objec i e, ul ima ely slowing down he lea ning.
One way o add ess hese issues migh be h ough decoupling he explo-
a ion and he exploi a ion beha iou s by wo di e en agen s (Schä e
e al., 2022) o ans o ming he p oblem in o a mul i-objec i e app oach
(Hayes e al., 2021). An in e es ing a enue would be also e o mula e
ou he e ogeneous agen p oposal in o o -policy s a egies (e.g., DQN)
whe e he agen s could sha e hei eplay bu e s and bene i di ec ly om
episodes ep esen ing how o he s unde ook he same ask om di e en
pe spec i es (Ch is ianos e al., 2020). Addi ionally, ailo ing echniques
o le e age expe demons a ions so as o cope wi h he he e ogenei y o
he ac ion spaces would be in e es ing o analyze (e.g., using IL echniques
ha only ely on obse a ions and do no s ic ly depend on he ac ions
(To abi e al., 2018)).
•Chap e 4.Analysing ai ly he con ibu ion o pe o mance o he
s a e-o - he-a IM algo i hms. IM echniques ha e been shown o be e -
ec i e o p omo ing he explo a ion in RL. Ne e heless, i is no always
clea i he p oposals a e supe io due o he p esence o no el ewa d-
ela ed p ocedu es o o pe iphe al o addi ional design choices. On his
g ound, we conduc ed a s udy o y o de ach bo h componen s and he
conclusions we e as ollows:
•Using an adap i e in insic coe icien 𝛽based on he e u n o p e-
ious ollou s ou pe o ms s a egies elying on a ixed pa ame e .
•The inclusion o episode-le el (e.g., episodic isi a ion coun s) o
he gene a ion o in insic ewa ds a e bene icial in compa ison wi h
dis ega ding episode-le el in o ma ion.
•Adop ing di e en neu al ne wo k a chi ec u es is c i ical o gua -
an ee he success. Indeed, when educing he numbe o pa ame e s
o he IM modules he pe o mance is de e io a ed, which ge s e en
wo se i he ac o -c i ic pa ame e s a e also dec eased.
In u u e ex ensions, he s udy o mo e en i onmen s (e.g., P ocgen, wi h
high-dimensional obse a ions (Cobbe, Hesse, e al., 2020)) and mo e IM
algo i hms o sol e e icien ly ha d explo a ion en i onmen s would be o
g ea in e es .
•Chap e 5.How o collec good ajec o ies o imp o e sel -IM al-
go i hms pe o mance A ac ed by he idea o eplaying no only good
6.1. Lis o Publica ions 125
ajec o ies in e ms o pe o mance bu also no el ajec o ies, we p o-
posed he use o IM o p omo e explo a ion and disco e episodes wi h
in e es ing p ope ies o he agen ’s lea ning. We e inced ha :
•As long as he selec ed IM app oach and i ing is app op ia e, he
bene i s a e clea .
•The me hod is sensi i e o he di e si y o he eplayed ajec o ies
and he ollou size, i.e. when o execu e he upda es o he agen ’s
policy. These a e decisi e o make he agen gene alize well o he
whole le el dis ibu ion o he ask.
We i mly belie e ha he esul s can be imp o ed e en mo e i he di e -
si y o he ajec o ies is gua an eed; his is, i he demons a ions a e no
biased and ep esen he whole le el dis ibu ion. In addi ion, mo e e ec-
i e ways o manage he scheduling o losses (o e en he combina ion o
hem in a single loss unc ion (Rajeswa an e al., 2018)) should be s udied
as well.
6.1 Lis o Publica ions
As a esul o he esea ch conduc ed du ing he de elopmen o his PhD
Thesis, se e al con ibu ions we e published in con e ences and jou nals
ela ed o he a eas o ein o cemen lea ning and neu al ne wo ks:
•Jou nal publica ions:
– Alain And es, Es he Villa -Rod iguez and Ja ie Del Se , “Col-
labo a i e aining o he e ogeneous ein o cemen lea ning agen s
in en i onmen s wi h spa se ewa ds: wha and when o sha e?”
Neu al Compu ing & Applica ions, published on-line, 2022. h ps:
//doi.o g/10.1007/s00521-022-07774-5 (IF: 5.102, Q2, 45/145 ARTI-
FICIAL INTELLIGENCE).
•Con e ence con ibu ions:
– Alain And es, Es he Villa -Rod iguez, A i z D. Ma inez and Ja ie
Del Se , “Collabo a i e Explo a ion and Rein o cemen Lea ning be-
ween He e ogeneously Skilled Agen s in En i onmen s wi h Spa se
Rewa ds,” 2021 In e na ional Join Con e ence on Neu al Ne wo ks
(IJCNN), Shenzhen, China, pp. 1-10, 2021. h ps://doi.o g/10.110
9/IJCNN52387.2021.9534146.
– Alain And es, Es he Villa -Rod iguez and Ja ie Del Se , “An
E alua ion S udy o In insic Mo i a ion Techniques Applied o Re-
in o cemen Lea ning o e Ha d Explo a ion En i onmen s,” in: A.
Holzinge , P. Kiesebe g, A. M. Tjoa, E. Weippl (eds). Machine Lea n-
ing and Knowledge Ex ac ion (CD-MAKE 2022), Lec u e No es in
Compu e Science, ol 13480, Sp inge , 2022. h ps://doi.o g/10.100
7/978-3-031-14463-9_13
126 Chap e 6. Concluding Rema ks
– Alain And es, Es he Villa -Rod iguez and Ja ie Del Se , “To-
wa ds Imp o ing Explo a ion in Sel -Imi a ion Lea ning using In in-
sic Mo i a ion,” IEEE Symposium Se ies on Compu a ional In elli-
gence (SSCI), Singapo e, pp. 890-899, 2022. h ps://doi.o g/10.110
9/SSCI51031.2022.10022199
– Alain And es, Lukas Schä e , Es he Villa -Rod iguez, S e ano V.
Alb ech and Ja ie Del Se , “Using O line Da a o Speed-up Rein-
o cemen Lea ning in P ocedu ally Gene a ed En i onmen s,” Adap-
i e and Lea ning Agen s (ALA) Wo kshop a he In e na ional Con-
e ence on Au onomous Agen s and Mul iagen Sys ems (AAMAS),
accep ed, London, UK, 2023.
6.2 Fu u e Resea ch Lines
This Thesis concludes by ou lining u u e esea ch lines ha ha e been
iden i ied as in e es ing di ec ions du ing he PhD Thesis:
As we ha e highligh ed du ing his documen , sample-e iciency is c u-
cial in RL because despi e simula o s p o ide unlimi ed numbe o in e -
ac ions wi h a good h oughpu a e, in eal-wo ld he sys ems a e ac-
ually slow, agile and expensi e o ope a e, p e en ing he adop ion o
RL solu ions. This is ansla ed in ha ing a high cos in e ms o agen -
en i onmen in e ac ions.
One way o o e come i is using o line da a o speed up he lea ning.
Imi a ion Lea ning app oaches ha e shown an inc edible po en ial as long
as demons a ions a e a ailable, al hough hei success is usually highly
dependan o he quali y, quan i y and also he di e si y o he ajec o-
ies. Indeed, we analyzed his issue in PCG en i onmen s in a pape ha
is cu en ly unde e iew –"Using O line Da a o Speed-up Rein o cemen
Lea ning in P ocedu ally Gene a ed En i onmen s"– whe e IL could o e -
i he model owa ds he p o ided examples. As explained in Sec ion
2.3.2, he mos b oadly used IL echnique is BC due i s simplici y and
good esul s. Howe e , be e esul s can be expec ed when using mo e
ad anced echniques such as ad e sa ial IL (Ho & E mon, 2016; O sini e
al., 2021), cu iculum s a egies ha p io i ize demons a ions o e o h-
e s (Bajaj e al., 2022) and e en using app oaches ha ake in o accoun
empo al dependencies (Paine e al., 2019). Akin o Imi a ion Lea ning,
O line RL ocus on how o lea n in he absence o online in e ac ions.
This sub ield o RL has shown p omising esul s when ha ing da a ha
do no esemble a demons a ion bu andom da a o when being ained
wi h subop imal and noisy da a (Kuma e al., 2022). Howe e , his kind
o algo i hms exhibi challenges ega ding he dis ibu ion shi be ween
he o line da a and he ac ual p oblem dis ibu ion, eason why some
app oaches cons ain he policy o no de ia e oo a om he beha io
policy (Kos iko e al., 2021; Kuma e al., 2020); whe eas o he s ocus on
p io i izing he usage o expe iences o maximize he da a co e age o he
disco e y o skills (H. Liu & Abbeel, 2021a,2021b), ul ima ely lea ning
a good ep esen a ion and a e sa ile policy (Yang & Nachum, 2021). In
6.2. Fu u e Resea ch Lines 127
iew o he necessi ies and po en ial o hese echniques, using o line da a
en isages an exci ing pa h.
Ano he ascina ing b anch is he one ela ed o Rep esen a ion Lea n-
ing and ew-sho lea ning, which a e closely ela ed when gene aliza ion
is pu sued. The abili y o unde s and and disco e au oma ically he key
ea u es ha go e n a ask is indeed a game-change , as i b ings he pol-
icy wi h he capaci y o quickly adap when changes in he en i onmen
a e made (e.g., goal modi ica ion, s a e domain a ia ion), minimizing he
o al numbe o online in e ac ions wi h he en i onmen wi hin he RL do-
main (X. Chen e al., 2021). Ne e heless, how lea n a alid ep esen a ion
is no i ial, equi ing some imes o ha e di e en ep esen a ions be ween
he ac o ’s policy and he c i ic (Cobbe, Hil on, e al., 2020; Raileanu &
Fe gus, 2021). In ac , alue-based me hods migh ha e some issues when
i becomes o gene aliza ion capabili ies (Eh enbe g e al., 2022; Lyle e
al., 2022), which can explain why he la ge majo i y o o -policy solu ions
( ha end o be mo e sample-e icien han hei on-policy coun e pa s)
s uggle in PCG en i onmen s (Eh enbe g e al., 2022; Mohan y e al.,
2021).
Las bu no leas , we ea u e wo ld models (Ha & Schmidhube , 2018;
Wu e al., 2022) and unsupe ised en i onmen design (Dennis e al., 2020;
Pa ke -Holde e al., 2022) as a p oxy o a oid he la ge cos s o eal-wo ld
en i onmen in e ac ions by he i ue o using echniques (e.g. gene a i e
models) o gene a e new ins ances o he p oblem wi hou he necessi y o
explici ly ha ing access o he en i onmen i sel .
129
Appendix A
Random Ne wo k
Dis illa ion - Limi a ions
One o he c i ical aspec s when using any p edic ion e o me hod is how
he scale o ewa ds can a y, no only be ween en i onmen s, bu also a
poin s in ime in he same scene, making i di icul he selec ion o hype -
pa ame e s. Addi ionally, i such IM app oach uses DL, he no maliza ion
o inpu s is impo an o an app op ia e p edic ion. None heless, he la -
e , is c ucial when using RND, as he a ge ne wo k’s pa ame e s a e
ozen and hence can no adjus he scale o he upcoming obse a ions.
Acco ding o he ecommenda ions (Bu da, Edwa ds, S o key, e al.,
2018), we no malized he obse a ions as in Exp ession (3.7). Unexpec -
edly, we ind ou ha he ewa d scale was biased owa ds he ea u es o
each oom in ViZDooM en i onmen . In o de o accoun o ha issue,
we p oceed as ollows:
•Fi s , we selec obse a ions ga he by he agen a di e en poin s o
he Se up 3 shown in Figu e 3.10, which esul s in he isualiza ions
shown in Figu e A.1.
•A e wa ds, we ain he p edic o ne wo k, ˆ
𝜙(·), du ing 100 con-
secu i e andomly sampled episodes, and we s o e bo h he ozen
–𝜙(·)– and ained p edic o ne wo ks pa ame e s.
•Finally, we e alua e which would ha e been he he ob ained in insic
ewa d a he selec ed checkpoin s a e each episode’s upda es.
The e olu ion o he in insic ewa ds conside ing di e en changes a e
shown in Figu e A.2. O e all, i can be seen ha he e is a end in all
he checkpoin s o dec ease he in insic ewa d o e ime. Howe e , i is
no consis en wi h he no el y we a e pu suing, as he poin s ha a ely
migh ha e been isi ed – he ones ha a e a om he s a posi ion and
a e e y di icul o be expe ienced wi hou knowledge (e.g., 49 and 50) –
ha e lowe bonus espec o o he s ha a e close o he spawn loca ion
and ha a e mo e o en obse ed (e.g., 0 o 14). In ac , he la ges alues
a e gi en always o obse a ions a ooms 22 and 24. We also expe imen
i he issue was ela ed o how he inpu was p ocessed by ei he p o iding
highe dimensions and using RGB images ins ead o he de aul g ayscale
130 Appendix A. Random Ne wo k Dis illa ion - Limi a ions
con igu a ion (Figu e A.2, middle), o by he adop ed ANN a chi ec u e
(Figu e A.2, bo om). Ne e heless, he e was no signi ican changes ex-
cep he ampli ude o he no el y signal.
The e o e, we conclude ha RND p esen s un o eseen di icul y o cap-
u e he ac ual cu iosi y and should be aken in o accoun when being used
in ViZDooM.
Appendix A. Random Ne wo k Dis illa ion - Limi a ions 131
(0) Ini ial spawn posi ion (14) A oom 13
looking o wa d looking o wa d
(22) A co ido 16 (23) A oom 17
o ien ed o he doo in on o he doo
(40) A oom 22 (41) A oom 22
o ien ed o he wall pa ially o ien ed o he nex co ido
(46) A oom 24 (47) A oom 24
o ien ed o he wall pa ially o ien ed o he nex co ido
(49) A co ido 25 (50) A oom 26
o ien ed o goal/ es o ien ed o he goal/ es
Figu e A.1: Obse a ions (g ayscale,120x160) a 10 di e en checkpoin s o
VizDoom’s My way home en i onmen .
132 Appendix A. Random Ne wo k Dis illa ion - Limi a ions
0 20 40 60 80 100
episode
0
50
100
150
200
250
300
350
400 0
14
22
23
40
41
46
47
49
50
De aul wi h 42x42 g ayscale images and 512 ou pu neu ons ANN
0 20 40 60 80 100
episode
0
50
100
150
200
250
300
350
400
42x42 Colo 160x120 g ayscale
Inpu P ocessing
0 20 40 60 80 100
episode
0
20
40
60
80
0 20 40 60 80 100
episode
0
5
10
15
20
25
30
100 ou pu neu ons 10 ou pu neu ons
ANN modi ica ion
0 20 40 60 80 100
episode
0
20
40
60
80
0 20 40 60 80 100
episode
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
Figu e A.2: In insic ewa ds e olu ion h oughou 100 andomly sampled
episodes a di e en checkpoin s explained in Figu e A.1. Cold colo s ep esen
loca ions ha a e close o he spawn posi ion and a he om he goal/ es .
A he op ow he de aul pe o mance wi h 42x42 g ayscale images and he
adop ed ANN a chi ec u e is shown; he middle ow shows he impac when
a ying he inpu image by ei he using 42x42 colo ed images (le ) o 160x120
images; he bo om ow esul s illus a e how changes in he ANN a chi ec u e
a ec when using 100 ou pu neu ons (le ) o jus 10 ou pu neu ons ( igh ).
133
Bibliog aphy
Abbeel, P., & Ng, A. Y. (2004). App en iceship lea ning ia in e se e-
in o cemen lea ning. 21s In e na ional Con e ence on Machine
Lea ning (ICML), 1.
Abola ia, D. A., No ouzi, M., Shen, J., Zhao, R., & Le, Q. V. (2018). Neu al
P og am Syn hesis wi h P io i y Queue T aining [a Xi :1801.03526].
Ande son, C. W. (1986). Lea ning and P oblem-Sol ing wi h Mul ilaye
Connec ionis Sys ems (Adap i e S a egy Le aning, Neu al Ne -
wo ks, Rein o cemen Lea ning). Doc o al Disse a ion, 1–260.
And es, A., Villa -Rod iguez, E., & Del Se , J. (2022). An E alua ion
S udy o In insic Mo i a ion Techniques Applied o Rein o ce-
men Lea ning o e Ha d Explo a ion En i onmen s. In A. Holzinge ,
P. Kiesebe g, A. M. Tjoa, & E. Weippl (Eds.), Machine Lea ning
and Knowledge Ex ac ion (pp. 201–220). Sp inge In e na ional
Publishing.
And ychowicz, M., Raichuk, A., S ańczyk, P., O sini, M., Gi gin, S., Ma inie ,
R., Husseno , L., Geis , M., Pie quin, O., Michalski, M., Gelly, S.,
& Bachem, O. (2021a). Wha Ma e s In On-Policy Rein o cemen
Lea ning? A La ge-Scale Empi ical S udy [a Xi :2006.05990].
And ychowicz, M., Raichuk, A., S ańczyk, P., O sini, M., Gi gin, S., Ma inie ,
R., Husseno , L., Geis , M., Pie quin, O., Michalski, M., Gelly, S.,
& Bachem, O. (2021b). Wha Ma e s o On-Policy Deep Ac o -
C i ic Me hods? A La ge-Scale S udy. 9 h In e na ional Con e -
ence on Lea ning Rep esen a ions (ICLR).
Aub e , A., Ma ignon, L., & Hassas, S. (2019). A su ey on in insic mo-
i a ion in ein o cemen lea ning [a Xi :1908.06976].
Aue , P., Cesa-Bianchi, N., & Fische , P. (2002). Fini e- ime Analysis o
he Mul ia med Bandi P oblem. Machine Lea ning,47(2), 235–
256.
Ay a , Y., P a , T., Budden, D., Paine, T., & Wang, Z. (2018). Playing
ha d explo a ion games by wa ching YouTube. Ad ances in Neu al
In o ma ion P ocessing Sys ems (Neu IPS), 12.
Badia, A. P., Pio , B., Kap u owski, S., Sp echmann, P., Vi i skyi, A.,
Guo, D., & Blundell, C. (2020). Agen 57: Ou pe o ming he A a i
Human Benchma k. 37 h In e na ional Con e ence on Machine
Lea ning (ICML),119.
Badia, A. P., Sp echmann, P., Vi i skyi, A., Guo, D., Pio , B., Kap-
u owski, S., Tieleman, O., A jo sky, M., P i zel, A., Bol , A.,
& Blundell, C. (2020). Ne e Gi e Up: Lea ning Di ec ed Explo-
a ion S a egies. In e na ional Con e ence on Lea ning Rep esen-
a ions (ICLR).