Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Sign in
Toggle navigation
masterthesis
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Locked Files
Issues
6
Issues
6
List
Boards
Labels
Service Desk
Milestones
Iterations
Merge Requests
0
Merge Requests
0
Requirements
Requirements
List
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Security & Compliance
Security & Compliance
Dependency List
License Compliance
Operations
Operations
Incidents
Environments
Packages & Registries
Packages & Registries
Package Registry
Container Registry
Analytics
Analytics
CI / CD
Code Review
Insights
Issue
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Daniel Lukats
masterthesis
Commits
499ab6e9
Commit
499ab6e9
authored
Jun 04, 2020
by
Daniel Lukats
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
feedback from jürgen + kathrin
parent
fb01642e
Changes
19
Hide whitespace changes
Inline
Sidebyside
Showing
19 changed files
with
79 additions
and
73 deletions
+79
73
thesis/01_introduction/motivation.tex
thesis/01_introduction/motivation.tex
+8
8
thesis/02_rl_theory/index.tex
thesis/02_rl_theory/index.tex
+1
1
thesis/02_rl_theory/mdp.tex
thesis/02_rl_theory/mdp.tex
+5
5
thesis/02_rl_theory/policy_gradients.tex
thesis/02_rl_theory/policy_gradients.tex
+4
4
thesis/03_ppo/dependency.pdf
thesis/03_ppo/dependency.pdf
+0
0
thesis/03_ppo/gae.tex
thesis/03_ppo/gae.tex
+16
8
thesis/03_ppo/index.tex
thesis/03_ppo/index.tex
+1
1
thesis/03_ppo/loss.tex
thesis/03_ppo/loss.tex
+2
1
thesis/04_implementation/adv_normalization.tex
thesis/04_implementation/adv_normalization.tex
+10
13
thesis/04_implementation/full_algorithm.tex
thesis/04_implementation/full_algorithm.tex
+7
7
thesis/04_implementation/index.tex
thesis/04_implementation/index.tex
+1
1
thesis/04_implementation/initialization.tex
thesis/04_implementation/initialization.tex
+1
1
thesis/04_implementation/value_clipping.tex
thesis/04_implementation/value_clipping.tex
+2
2
thesis/05_evaluation/ale.tex
thesis/05_evaluation/ale.tex
+1
1
thesis/05_evaluation/discussion.tex
thesis/05_evaluation/discussion.tex
+5
5
thesis/05_evaluation/experiments.tex
thesis/05_evaluation/experiments.tex
+5
7
thesis/05_evaluation/method.tex
thesis/05_evaluation/method.tex
+9
7
thesis/05_evaluation/unsmoothed.png
thesis/05_evaluation/unsmoothed.png
+0
0
thesis/06_conclusion/conclusion.tex
thesis/06_conclusion/conclusion.tex
+1
1
No files found.
thesis/01_introduction/motivation.tex
View file @
499ab6e9
...
...
@@ 16,20 +16,20 @@ learning agents \cite{liu2019}. Yet another concept that may be combined with th
\cite
{
pathak2017
}
. Curiosity is a mechanism that incentivizes a methodical search for an optimal solution rather than a
random search.
Despite its widespread use, researchers found that several implementation choices
\todo
{
is this fine or not?
}
are
undocumented in the original paper
\cite
{
ppo
}
, but have great effect on the performance of Proximal Policy Optimization.
Consequently, the authors raise doubts on our mathematical understanding of the foundations of the algorithm
\cite
{
ilyas2018, engstrom2019
}
.
Despite its widespread use,
\citeA
{
ilyas2018
}
as well as
\citeA
{
engstrom2019
}
found that several implementation choices
are undocumented in the original paper, but have great effect on the performance of Proximal Policy Optimization.
Consequently, the authors raise doubts on our mathematical understanding of the foundations of the algorithm.
The goal of this thesis is twofold. On the one hand, it provides the required fundamentals of reinforcement learning to
understand policy gradient methods, so students may be introduced to reinforcement learning. It then builds upon these
fundamentals to explain Proximal Policy Optimization. In order to gain a thorough understanding of the algorithm, a
dedicated implementation was written
\todo
{
with the OpenAI Gym benchmark and PyTorch deep learning frameworks?
}
instead
of relying on an open source project.
\footnote
{
The code is available on TODO URL HERE.
}
dedicated implementation based on the benchmarking framework
\emph
{
OpenAI Gym
}
and the deep learning framework
\emph
{
PyTorch
}
was written instead of relying on an open source implementation of Proximal Policy
Optimization.
\footnote
{
The code is available on TODO URL HERE.
}
On the other hand, this thesis examines not only the impact of the aforementioned implementation choices, but also
common hyperparameter choices on a selection of five ATARI 2600 games. These video games form a typical benchmarking
environment for evaluating reinforcement learning algorithms. The
signific
ance of the implementation choices was already
environment for evaluating reinforcement learning algorithms. The
import
ance of the implementation choices was already
researched on robotics tasks
\cite
{
ilyas2018, engstrom2019
}
, but the authors forewent an examination on ATARI games.
Therefore, one may raise the question, if these choices have the same
signific
ance for ATARI 2600 games as they do for
Therefore, one may raise the question, if these choices have the same
import
ance for ATARI 2600 games as they do for
robotics tasks.
thesis/02_rl_theory/index.tex
View file @
499ab6e9
\section
{
Fundamentals of
Reinforcement Learning
}
\section
{
Reinforcement Learning
}
\label
{
sec:02:basics
}
\input
{
02
_
rl
_
theory/introduction
}
...
...
thesis/02_rl_theory/mdp.tex
View file @
499ab6e9
...
...
@@ 32,11 +32,11 @@ contains actions \emph{left} and \emph{right} as well as rewards $0$ and $1$. Th
transition probabilities and can be found on the edges connecting the states. Further explanations and an example using
elements of this Markov decision process are given in chapter
\ref
{
sec:02:distributions
}
.
Markov decision processes
share most properties with Markov chains
\todo
{
some source here
}
, the major difference being
t
he transitions between states. Whereas in Markov chains the transition depends on the current state only, in Markov
decision processes the transition function
$
p
$
takes actions into account as well. Thus, the transition can be affected,
although it remains stochastic in nature. We assume that the Markov property holds true for Markov decision processes,
which means that these transitions depend on the
current state and action only
\cite
[p.~49]
{
sutton18
}
.
Markov decision processes
extend Markov chains by introducing actions and rewards. Whereas in Markov chains a transition
t
o a successor state depends on the current state only, in Markov decision processes the transition function
$
p
$
takes
actions into account as well. Thus, the transition can be affected, although it remains stochastic in nature. We assume
that the Markov property holds true for Markov decision processes, which means that these transitions depend on the
current state and action only
\cite
[p.~49]
{
sutton18
}
.
\subsubsection
{
States, Actions and Rewards
}
\label
{
sec:02:mdp
_
vars
}
...
...
thesis/02_rl_theory/policy_gradients.tex
View file @
499ab6e9
...
...
@@ 23,10 +23,10 @@ optimize the parameterization $\boldsymbol\theta$ so that the likelihood of choo
increased.
Let
$
S
_
t
=
s
$
and
$
\mathcal
{
A
}
(
s
)
=
\{
1
,
2
\}
$
. A function
$
h
$
parameterized with
$
\boldsymbol\theta
_
t
$
could return a
vector with two elements, for example
$
h
(
s,
\boldsymbol\theta
_
t
)
=
\begin
{
pmatrix
}
4
\\
2
\end
{
pmatrix
}$
.
\footnote
{
We write
$
\left
(
h
(
s,
\boldsymbol\theta
_
t
)
\right
)
_
a
$
to obtain the
$
a
$
th component of
$
h
(
s,
\boldsymbol\theta
_
t
)
$
, e.g.,
$
\
begin
{
pmatrix
}
4
\\
2
\end
{
pmatrix
}_
1
=
4
$
.
}
The elements of this vector are numerical weights a policy could use to
determine probabilities:
vector with two elements, for example
$
h
(
s,
\boldsymbol\theta
_
t
)
=
\begin
{
pmatrix
}
4
\\
2
\end
{
pmatrix
}$
.
The elements of this vector are numerical weights a policy could use to determine probabilities:
\footnote
{
We write
$
\
left
(
h
(
s,
\boldsymbol\theta
_
t
)
\right
)
_
n
$
to obtain the
$
n
$
th component of
$
h
(
s,
\boldsymbol\theta
_
t
)
$
, e.g.,
$
\begin
{
pmatrix
}
4
\\
2
\end
{
pmatrix
}_
1
=
4
$
.
}
\begin{align*}
\pi
(a
\mid
s,
\boldsymbol\theta
_
t)
&
=
\left
(h(s,
\boldsymbol\theta
_
t)
\right
)
_
a
\cdot
\left
(
\sum
_{
b
\in\mathcal
{
A
}
(s)
}
\left
(h(s,
\boldsymbol\theta
_
t)
\right
)
_
b
\right
)
^{
1
}
\text
{
, e.g.,
}
\\
...
...
thesis/03_ppo/dependency.pdf
0 → 100644
View file @
499ab6e9
File added
thesis/03_ppo/gae.tex
View file @
499ab6e9
Both the loss described in equation
\ref
{
eqn:naive
_
loss
}
and the loss proposed by
\citeA
{
ppo
}
depend on advantage
estimations. However, to estimate advantages value estimations are required. By definition (cf.~equation
\ref
{
eqn:value
_
function
}
), the value function can be estimated using the return.
\ref
{
eqn:value
_
function
}
), the value function can be estimated using the return. Figure
\ref
{
fig:adv
_
dependency
}
highlights these relations.
\begin{figure}
[h]
\centering
\includegraphics
[width=\textwidth]
{
03
_
ppo/dependency.pdf
}
\caption
{
The advantage estimations required to calculate a loss or gradient estimator depend on the value function.
The value function in turn requires returns. Finally, these estimators and functions build upon rewards
$
R
_{
t
+
1
}$
.
}
\label
{
fig:adv
_
dependency
}
\end{figure}
Both the advantage estimator (cf.~equation
\ref
{
eqn:delta2
}
) and the return (cf.~equation
\ref
{
eqn:return
}
) are
suboptimal as they suffer from being biased and having high variance respectively. We further explain these issues and
suboptimal
,
as they suffer from being biased and having high variance respectively. We further explain these issues and
introduce advanced methods that alleviate this issue in this chapter.
\todo
[inline]
{
insert overview of Loss using advantage using value using return using rewards
}
\subsubsection
{
Generalized Advantage Estimation
}
\label
{
sec:03:gae
_
gae
}
...
...
@@ 56,10 +63,11 @@ we do not require this special case.} Then \emph{Generalized Advantage Estimatio
\label
{
eqn:gae
_
start
}
\delta
_
t
^{
\text
{
GAE
}
(
\gamma
,
\lambda
)
}
&
\doteq
(1 
\lambda
)
\left
(
\delta
_
t
^{
(1)
}
+
\lambda\delta
_
t
^{
(2)
}
+
\lambda
^
2
\delta
_
t
^{
(3)
}
+
\dots
\right
)
\\
&
= (1 
\lambda
) (
\delta
_
t +
\lambda
(
\delta
_
t +
\gamma\delta
_{
t+1
}
) +
\lambda
^
2(
\delta
_
t +
\gamma\delta
_{
t+1
}
+
\gamma
^
2
\delta
_{
t+2
}
) +
\dots
)
\nonumber
\\
&
= (1 
\lambda
) (
\delta
_
t(1 +
\lambda
+
\lambda
^
2 +
\dots
) +
\gamma\delta
_{
t+1
}
(
\lambda
+
\lambda
^
2 +
\lambda
^
3 +
\dots
)
\nonumber
\\
&
\hspace
{
48pt
}
+
\gamma
^
2
\delta
_{
t+2
}
(
\lambda
^
2 +
\lambda
^
3 +
\lambda
^
4 +
\dots
) +
\dots
)
\nonumber
\\
&
= (1 
\lambda
)
\left
(
\delta
_
t +
\lambda
(
\delta
_
t +
\gamma\delta
_{
t+1
}
) +
\lambda
^
2(
\delta
_
t +
\gamma\delta
_{
t+1
}
+
\gamma
^
2
\delta
_{
t+2
}
) +
\dots
\right
)
\nonumber
\\
&
= (1 
\lambda
)
\left
(
\delta
_
t(1 +
\lambda
+
\lambda
^
2 +
\dots
) +
\gamma\delta
_{
t+1
}
(
\lambda
+
\lambda
^
2 +
\lambda
^
3 +
\dots
)
\nonumber
\right
.
\\
&
\hspace
{
48pt
}
\left
. +
\gamma
^
2
\delta
_{
t+2
}
(
\lambda
^
2 +
\lambda
^
3 +
\lambda
^
4 +
\dots
) +
\dots
\right
)
\nonumber
\\
&
= (1 
\lambda
)
\left
(
\delta
_
t
\left
(
\frac
{
1
}{
1
\lambda
}
\right
) +
\gamma\delta
_{
t+1
}
\left
(
\frac
{
\lambda
}{
1
\lambda
}
\right
) +
\gamma
^
2
\delta
_{
t+2
}
\left
(
\frac
{
\lambda
^
2
}{
1
\lambda
}
\right
) +
\dots
\right
)
\nonumber
\\
\label
{
eqn:gae
}
...
...
thesis/03_ppo/index.tex
View file @
499ab6e9
...
...
@@ 6,7 +6,7 @@
\label
{
sec:03:ppo
_
motivation
}
\input
{
03
_
ppo/motivation
}
\subsection
{
Modern Advantage and Return
Estimation
}
\subsection
{
Advantage and Value
Estimation
}
\label
{
sec:03:gae
}
\input
{
03
_
ppo/gae
}
...
...
thesis/03_ppo/loss.tex
View file @
499ab6e9
...
...
@@ 75,7 +75,7 @@ advantage estimations \cite{ppo}:
Despite the use of importance sampling this loss can be unreliable. For actions with a very large likelihood ratio, for
example,
$
\rho
_
t
(
\boldsymbol\theta
)
=
100
$
, gradient steps become excessively large possibly leading to performance
collapes
\cite
{
kakade2002
}
.
\todo
{
highlight this a bit more?
}
collapes
\cite
{
kakade2002
}
.
Hence, Proximal Policy Optimization algorithms intend to keep the likelihood ratio close to one
\cite
{
ppo
}
. There are
two variants of PPO, one that relies on clipping the ratio whereas the other is penalized using the KullbackLeibler
...
...
@@ 140,6 +140,7 @@ This issue is solved by taking an elementwise minimum:
\right
)
\right
].
\end{align}
In practice, the advantage function
$
\hat
{
a
}_{
\pi
_{
\boldsymbol\theta
_
\text
{
old
}}}$
is replaced with Generalized
Advantage Estimations
$
\delta
_
t
^{
\text
{
GAE
}
(
\gamma
,
\lambda
)
}$
(cf.~equation
\ref
{
eqn:gae
}
). Often,
$
\mathcal
{
L
}^
\text
{
CLIP
}
(
\boldsymbol\theta
)
$
is called a surrogate objective, as it approximates the original objective
...
...
thesis/04_implementation/adv_normalization.tex
View file @
499ab6e9
Albeit it is not mentioned by
\citeA
{
ppo
}
, advantages are normalized before loss computation. Normalizing advantages is
a wellknown operation that lowers the variance of the gradient estimator.
Albeit it is not mentioned by
\citeA
{
ppo
}
, advantage estimations
$
\delta
$
are normalized before loss computation. Normalizing
advantages is a wellknown operation that lowers the variance of the gradient estimator.
\begin{align}
\label
{
eqn:adv
_
normalization
}
\delta
\gets
\frac
{
\delta

\text
{
mean
}
(
\delta
)
}{
\text
{
std
}
(
\delta
) + 10
^{
5
}}
\end{align}
\begin{algorithm}
\caption
{
Advantage normalization
}
\label
{
alg:adv
_
normalization
}
\begin{algorithmic}
\Require
advantage estimations
$
\delta
$
\State
$
\delta
\gets
\frac
{
\delta

\text
{
mean
(
}
\delta\text
{
)
}}{
\text
{
std
(
}
\delta\text
{
)
}
+
10
^{

5
}}$
\end{algorithmic}
\end{algorithm}
The normalization shown in algorithm
\ref
{
alg:adv
_
normalization
}
ensures that the advantages have a mean of
$
0
$
and a
standard deviation of
$
1
$
. We add
$
10
^{

5
}$
to the divisor for numerical stability.
Due to the update
\footnote
{
As common in computer science, we use
$
\gets
$
to denote an assignment. Hence, the advantage
estimations
$
\delta
$
are updated with their normalized values.
}
seen in equation
\ref
{
eqn:adv
_
normalization
}
, advantages
are normalized, so they have a mean of
$
0
$
and a standard deviation of
$
1
$
. We add
$
10
^{

5
}$
to the divisor for
numerical stability.
thesis/04_implementation/full_algorithm.tex
View file @
499ab6e9
...
...
@@ 6,37 +6,37 @@ $\mathcal{L}^\text{VF}$ with $\mathcal{L}^\text{VFCLIP}$ and the use of gradient
\begin{algorithm}
[ht]
\caption
{
Full Proximal Policy Optimization for ATARI 2600 games, modified from
\protect\citeauthor
{
ppo
}
's
\protect\citeyear
{
ppo
}
and
\protect\citeauthor
{
peng2018
}
's
\protect\citeyear
{
peng2018
}
publications. Lines commented
with a
$
*
$
are changed from algorithm
\ref
{
alg:ppo
}
.
}
with a
\Comment
{}
are changed from algorithm
\ref
{
alg:ppo
}
.
}
\label
{
alg:ppo
_
full
}
\begin{algorithmic}
[1]
\Require
number of iterations
$
I
$
, rollout horizon
$
T
$
, number of actor
$
N
$
, number of epochs
$
K
$
, minibatch size
$
M
$
, discount
$
\gamma
$
, GAE weight
$
\lambda
$
, learn rate
$
\alpha
$
, clipping parameter
$
\epsilon
$
, coefficients
$
c
_
1
, c
_
2
$
, postprocessing
$
\phi
$
\State
$
\boldsymbol\theta
\gets
$
orthogonal initialization
\Comment
{
$
*
$
}
\State
$
\boldsymbol\theta
\gets
$
orthogonal initialization
\Comment
{}
\State
Initialize environments
$
E
$
\State
Number of minibatches
$
B
\gets
N
\cdot
T
\;
/
\;
M
$
\For
{
iteration=
$
1
,
2
,
\dots
, I
$}
\State
$
\tau
\gets
$
empty rollout
\For
{
actor=
$
1
,
2
,
\dots
, N
$}
\State
$
s
\gets
$
$
\phi
(
\text
{
current state of environment
}
E
_
\text
{
actor
}
)
$
\Comment
{$
*
$
}
\State
$
s
\gets
$
$
\phi
_
s
(
\text
{
current state of environment
}
E
_
\text
{
actor
}
)
$
\Comment
{
}
\State
Append
$
s
$
to
$
\tau
$
\For
{
step=
$
1
,
2
,
\dots
, T
$}
\State
$
a
\sim
\pi
_{
\boldsymbol\theta
}
(
a
\mid
s
)
$
\State
$
\pi
_{
\boldsymbol\theta
_
\text
{
old
}}
(
a
\mid
s
)
\gets
\pi
_{
\boldsymbol\theta
}
(
a
\mid
s
)
$
\State
Execute action
$
a
$
in environment
$
E
_
\text
{
actor
}$
\State
$
s
\gets
$
$
\phi
_
s
(
\text
{
successor state
}
)
$
\Comment
{
$
*
$
}
\State
$
r
\gets
$
$
\phi
_
r
(
\text
{
reward
}
)
$
\Comment
{
$
*
$
}
\State
$
s
\gets
$
$
\phi
_
s
(
\text
{
successor state
}
)
$
\Comment
{}
\State
$
r
\gets
$
$
\phi
_
r
(
\text
{
reward
}
)
$
\Comment
{}
\State
Append
$
a, s, r,
\pi
_{
\boldsymbol\theta
_
\text
{
old
}}
(
a
\mid
s
)
,
\hat
{
v
}_
\pi
(
s,
\boldsymbol\theta
_
\text
{
old
}
)
$
to
$
\tau
$
\EndFor
\EndFor
\State
Compute Generalized Advantage Estimations
$
\delta
_
t
^{
\text
{
GAE
}
(
\gamma
,
\lambda
)
}$
\State
Normalize advantages
\Comment
{
$
*
$
}
\State
Normalize advantages
\Comment
{}
\State
Compute
$
\lambda
$
returns
$
G
_
t
^
\lambda
$
\For
{
epoch=
$
1
,
2
,
\dots
, K
$}
\For
{
minibatch=
$
1
,
2
,
\dots
, B
$}
\State
$
\boldsymbol\theta
\gets
\boldsymbol\theta
+
\text
{
clip
}
(
\alpha
\cdot
\nabla
_{
\boldsymbol\theta
}
\;\mathcal
{
L
}^{
\text
{
CLIP
}
+
\text
{
VFCLIP
}
+
S
}
(
\boldsymbol\theta
))
$
on
minibatch with size
$
M
$
\Comment
{
$
*
$
}
minibatch with size
$
M
$
\Comment
{}
\EndFor
\EndFor
\State
Anneal
$
\alpha
$
and
$
\epsilon
$
linearly
...
...
thesis/04_implementation/index.tex
View file @
499ab6e9
\section
{
From Theory to Practice
}
\section
{
Realization
}
\label
{
sec:04:implementation
}
\input
{
04
_
implementation/introduction
}
...
...
thesis/04_implementation/initialization.tex
View file @
499ab6e9
\citeA
{
ilyas2018
}
found that the weights of the neural network used by
\citeA
[baselines/ppo2]
{
baselines
}
are initialized using
orthogonal initalization.
\footnote
{
Using orthogonal initialization with large neural networks was proposed by
\citeA
{
saxe2014
}
The authors also provide mathematical foundations and examine the benefits of various initialization
\citeA
{
saxe2014
}
.
The authors also provide mathematical foundations and examine the benefits of various initialization
schemes.
}
The impact of this choice appears to be subject of empirical examinations only
\cite
{
ilyas2018, engstrom2019
}
.
Table
\ref
{
table:orthogonal
_
scaling
}
lists each layer of the neural network and the corresponding scaling factor that is
used to initialize the layer.
...
...
thesis/04_implementation/value_clipping.tex
View file @
499ab6e9
...
...
@@ 33,5 +33,5 @@ $\epsilon$ is the same hyperparameter that is used to clip the likelihood ratio
Intuitively, this approach may be similar to clipping the probability ratio. To avoid gradient collapses, a trust region
is created with the clipping parameter
$
\epsilon
$
. Then an elementwise maximum is taken, so errors from previous
gradient steps can be corrected. A maximum is applied instead of a minimum because the value function loss is minimized.
Further analysis on the ramifications of optimizing a surrogate loss
for the value functions
is available
\cite
A
{
ilyas2020
}
.
Further analysis on the ramifications of optimizing a surrogate loss
of the value function
is available
\cite
{
ilyas2020
}
.
thesis/05_evaluation/ale.tex
View file @
499ab6e9
...
...
@@ 21,7 +21,7 @@ accurately. On BeamRider, agents achieve close to human performance.
OpenAI Gym and ALE simplify evaluation greatly, as they provide a unified interface for all environments. We simply
provide an action to the environment and obtain observations containing the current state of the environment as a
$
210
\times
160
$
RGB image (before postprocessing), the reward and further information such as the number of lives remaining
or if the game terminated.
Just like the ATARI 2600 console, the Arcade Learning Environment runs at 60 Hz
without
or if the game terminated.
The Arcade Learning Environment runs at 60 frames per second when run in realtime
without
frame skipping we would obtain 60 observations per second.
Since the ALE returns a reward, we do not have to design a reward function. Instead, all reinforcement
...
...
thesis/05_evaluation/discussion.tex
View file @
499ab6e9
...
...
@@ 6,7 +6,7 @@ shown in this thesis are reviewed. Finally, we discuss the robustness of PPO to
\label
{
sec:05:discussion
_
optims
}
Since the performance of agents trained with Proximal Policy Optimization as described in the publication is
significant
ly worse than the performance achieved with the reference configuration, it begs the question if all
notab
ly worse than the performance achieved with the reference configuration, it begs the question if all
optimizations are required and what their individual impact is. In order to answer this question, agents were
trained twice for each optimization. Once with only the specific optimization enabled and once with all other
optimizations enabled.
...
...
@@ 20,7 +20,7 @@ as is the performance on Seaquest.
\paragraph
{
Gradient clipping.
}
Gradient clipping is disabled in experiment
\emph
{
no
\_
gradient
\_
clipping
}
. We note that the performance on Breakout is
significant
ly improved with a final score of roughly
$
400
$
achieved by all three agents without gradient clipping.
particular
ly improved with a final score of roughly
$
400
$
achieved by all three agents without gradient clipping.
However, some reward graphs on other games appear less stable, notably one of the training runs on BeamRider, Pong and
Seaquest each.
...
...
@@ 44,7 +44,7 @@ only game that is not affected by this change is Pong, as the set of rewards in
Furthermore, in Pong no player will score multiple times within
$
k
=
4
$
frames (cf.~chapter
\ref
{
sec:04:postprocessing
}
).
On the remaining games, rewards can achieve much larger magnitudes so reward binning or clipping has a notable effect on
the rewards. Experiment
\emph
{
reward
\_
clipping
}
shows a
significan
t drop in performance on all games but Pong.
the rewards. Experiment
\emph
{
reward
\_
clipping
}
shows a
distinc
t drop in performance on all games but Pong.
This is echoed in the reward graphs and final scores with all agents performing much worse once rewards are no longer
subject to reward binning. As a consequence, rewards should be binned and not clipped.
...
...
@@ 60,7 +60,7 @@ normalization, which appears to be detrimental to agents learning BeamRider, alt
on Breakout, Pong, Seaquest and SpaceInvaders. Moreover, we observe that all optimizations contribute to the success of
Proximal Policy Optimization. Only advantage normalization affected the performance of agents to a lesser degree.
However, we cannot deduce that advantage normalization is not important for PPO, as the optimization may be of larger
signific
ance on other tasks.
import
ance on other tasks.
Since these optimizations are crucial to achieve competitive results on ATARI games, we must criticize that they were
not disclosed by
\citeA
{
ppo
}
. Even though they are not directly related to reinforcement learning, they should be
...
...
@@ 106,7 +106,7 @@ That does not mean that the reference configuration is the optimal configuration
(cf.~experiment
\emph
{
no
\_
epsilon
\_
annealing
}
).
\item
When the minibatch size is decreased from
$
256
$
to
$
64
$
, agents learn Pong a lot faster (cf.~experiment
\emph
{
16
\_
minibatches
}
).
\item
Both the reward graph and the final score of Sea
Quest are significant
ly improved when the learn rate
$
\alpha
$
\item
Both the reward graph and the final score of Sea
quest are notab
ly improved when the learn rate
$
\alpha
$
is not annealed over the course of the training (cf.~experiment
\emph
{
no
\_
alpha
\_
annealing
}
).
\end{itemize}
However, performing such changes often leads to worse performances on at least one other game. Hence, it is hard to
...
...
thesis/05_evaluation/experiments.tex
View file @
499ab6e9
\todo
[inline]
{
put a list of all experiments with short note on results in the appendix
}
A total of
$
42
$
experiments were conducted to evaluate optimizations listed in chapter
\ref
{
sec:04:implementation
}
and
values of hyperparameters. For each experiment, two graphs are generated: one including only the run itself and one
adding a trendline of the reference configuration of Proximal Policy Optimization for ATARI. The reference configuration
...
...
@@ 184,11 +182,11 @@ final scores are shown in table \ref{table:paper_score}.
\label
{
table:paper
_
score
}
\end{table}
Most reward graphs
are significantly below the reward graphs of the reference configuration which is echoed in the final
scoresonly the agents trained on Pong remain close to the reported performance, but it takes the agents much longer
t
o achieve strong performances. The results strongly deviate from the results reported by
\citeA
{
ppo
}
. Therefore, we can
conclude that the optimizations outlined in chapter
\ref
{
sec:04:implementation
}
have strong effects on the course of
training as well as on the final performance of trained agents.
Most reward graphs
display noticably lower scores than the reward graphs of the reference configuration. This is also
apparent in the final scoresonly the agents trained on Pong remain close to the reported performance, but it takes
t
he agents much longer to achieve strong performances. The results strongly deviate from the results reported by
\citeA
{
ppo
}
. Therefore, we can conclude that the optimizations outlined in chapter
\ref
{
sec:04:implementation
}
have
strong effects on the course of
training as well as on the final performance of trained agents.
\begin{figure}
[H]
\centering
...
...
thesis/05_evaluation/method.tex
View file @
499ab6e9
...
...
@@ 20,21 +20,23 @@ its training. We evaluate how quickly an agent learns by drawing an episode rewa
terminates, the highscore is plotted. As we run
$
N
$
environments simultaneously, several episodes could terminate at
the same time. If this occurs, we plot the average of all terminated episodes. Figure
\ref
{
fig:reward
_
graph
}
displays an unsmoothed reward graph for the game Pong. The x axis displays the training time step and the y axis shows
the highscore. As a result, we get a graph that shows the performance of an agent over the course of its training.
the highscore. As a result, we get a graph that shows the performance of an agent over the course of its training. A
smoothed graph of the same data can be seen in figure
\ref
{
fig:graph
_
breakout
}
on page
\pageref
{
fig:graph
_
breakout
}
.
\begin{figure}
[ht]
\centering
\includegraphics
[width=\textwidth]
{
05
_
evaluation/
Pong
.png
}
\caption
{
TODOthis actually isnt even a nonsmoothed graph atm. just imagine it was noisier. to be fixed with
colors and font size later
}
\includegraphics
[width=\textwidth]
{
05
_
evaluation/
unsmoothed
.png
}
\caption
{
Unsmoothed episode reward graphs like this one are very noisy and therefore hard to examine. We address
this issue by smoothing the graphs. Smoothed episode reward graphs may be seen in chapter
\ref
{
sec:05:experiments
}
.
}
\label
{
fig:reward
_
graph
}
\end{figure}
Since agents encounter novel states regularly in the beginning of the training, the performance of episodes can vary
greatly. This results in a noisy episode reward graph. Depending on the complexity of the game, the graph may remain
noisy even until the end of the training, e.g., when learning Breakout. We alleviate this issue by smoothing the reward
graphoutliers are featured less prominently and the noise is reduced, whilst the overall trend of the training is
preserved.
noisy even until the end of the training as seen in figure
\ref
{
fig:reward
_
graph
}
. We alleviate this issue by smoothing
the reward graphoutliers are featured less prominently and the noise is reduced, whilst the overall trend of the
training is preserved. However, applying a suitable method to evaluate noise, such as confidence intervals, will be a
topic for future research.
The reward graphs are smoothed by computing the average of a sliding window with
$
16
$
data points. This window is then
centered, so each data point is the average of the
$
8
$
previous episodes and the
$
7
$
following ones. We note that this
...
...
thesis/05_evaluation/unsmoothed.png
0 → 100644
View file @
499ab6e9
474 KB
thesis/06_conclusion/conclusion.tex
View file @
499ab6e9
...
...
@@ 21,7 +21,7 @@ However, multiple nondisclosed optimizations to the algorithm can be found in t
popular hyperparameter choices on five ATARI 2600 games: BeamRider, Breakout, Pong, Seaquest and SpaceInvaders. The
results of the experiments show that most of the optimizations are crucial to achieve good results on these ATARI games.
This finding is shared by
\citeA
{
ilyas2018
}
and
\citeA
{
engstrom2019
}
, who examined Proximal Policy Optimization on
robotics environments. Furthermore, we found that
significant
outliers are present in approximately
$
35
\%
$
of the
robotics environments. Furthermore, we found that
noticable
outliers are present in approximately
$
35
\%
$
of the
experiments.
As a consequence, we may call for two improvements: Firstly, all optimization choices pertaining to the algorithmeven
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment