Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit e323d186 authored by Daniel Lukats's avatar Daniel Lukats

another large polishing pass

parent d8f8b71f
......@@ -37,8 +37,9 @@
\begin{table}[H]
\centering
\begin{tabular}{r l}
\large{Advisor} & \large{Prof. Dr.-Ing. Jürgen te Vrugt}\\
\large{Co-Advisor} & \large{Prof. Dr. Kathrin Ungru}
\large{Advisor} & \large{Prof. Dr.-Ing. Jürgen te Vrugt} \\
\large{Co-Advisor} & \large{Prof. Dr. Kathrin Ungru} \\
\large{Student number} & \large{978 528}
\end{tabular}
\end{table}
\vspace*{3cm}
......
......@@ -10,7 +10,7 @@ spaceflight, e.g., to train agents for interplanetary transfers \cite{nasa} and
\cite{gaudet2020}. For a final example, researchers applied Proximal Policy Optimization to medical imaging in order to
trace axons, which are microscopic neuronal structures \cite{dai2019}.
On the other hand, researchers combine the algorithm with further concepts. For example, Proximaly Policy Optimization
On the other hand, researchers combine the algorithm with further concepts. For example, Proximal Policy Optimization
can be used in meta reinforcement learning, which trains a reinforcement learning agent to train other reinforcement
learning agents \cite{liu2019}. Yet another concept that may be combined with the algorithm is curiosity
\cite{pathak2017}. Curiosity is a mechanism that incentivizes a methodical search for an optimal solution rather than a
......@@ -31,5 +31,5 @@ On the other hand, this thesis examines not only the impact of the aforementione
common hyperparameter choices on a selection of five ATARI 2600 games. These video games form a typical benchmarking
environment for evaluating reinforcement learning algorithms. The importance of the implementation choices was already
researched on robotics tasks \cite{ilyas2018, engstrom2019}, but the authors forewent an examination on ATARI games.
Therefore, one may raise the question, if these choices have the same importance for ATARI 2600 games as they do for
Therefore, one may raise the question if these choices have the same importance for ATARI 2600 games as they do for
robotics tasks.
For ease of reading, an overview of all definitions including the equation and page number can be found on page
\pageref{sec:maths_index}. The mathematical notation used in publications regarding reinforcement learning can differ
greatly. In this thesis, we adhere to the notation introduced by \citeA{sutton18} and adapt definitions and proves from
other sources accordingly. As a consequence, interested readers may notice differences between the notation used in
other sources accordingly. Consequently, interested readers may notice differences between the notation used in
chapter \ref{sec:03:ppo} and the publications of \citeA{gae} and \citeA{ppo}:
\begin{itemize}
\item The advantage estimator is denoted $\delta$ instead of $\hat{A}$, as both $A$ and $a$ are heavily overloaded
......
......@@ -24,7 +24,7 @@ squared value error
\label{eqn:value_error}
\overline{\text{VE}}(\boldsymbol\omega) \doteq \sum_{s \in \mathcal{S}} \mu_\pi(s) [v_\pi(s) - \hat{v}_\pi(s, \boldsymbol\omega)]^2
\end{align}
is minimized \cite[p.~199]{sutton18}. As we sample trajectories to form our estimations from, the states we observe
is minimized \cite[p.~199]{sutton18}. As we sample trajectories to form estimations from, the states we observe
naturally are proportional to the stationary distribution $\mu$. As a consequence, we update the value estimations of
states we observe regularly more frequently leading to more accurate estimations. Therefore, we minimize
$\overline{\text{VE}}(\boldsymbol\omega)$ implicitly. We establish a method to learn the value function in chapter
......
......@@ -39,9 +39,9 @@ which we introduce in chapter \ref{sec:02:mdp}. Subsequently, we define a goal t
\emph{value function}.
A key issue of reinforcement learning is the balance of \emph{exploration} and \emph{exploitation}
\cite[p.~3]{sutton18}. Often we lack full knowledge of the environment, for example a few frames of video input from a
\cite[p.~3]{sutton18}. Often, we lack full knowledge of the environment, for example a few frames of video input from a
video game do not carry information on the entire internal state of the game. We gather information by interacting with
the environment, slowly learning which actions help achieving the goal. As we do not gain full information of a state or
action by observing or choosing it once, we have to make a decision: Do we explore more states and actions that we have
no information of? Or do we exploit the best known way to achieving the goal, gaining more knowledge on a few select
action by observing or choosing it once, we must make a decision: Do we explore more states and actions that we have
no information of? Or do we exploit the best-known way to achieving the goal, gaining more knowledge on a few select
states and actions?
......@@ -26,8 +26,8 @@ interact once---the agent chooses an action and observes the environment.
\label{fig:intro_mdp}
\end{figure}
A simple Markov decision process is displayed in figure \ref{fig:intro_mdp}. Like Markov chains \cite[chapter
11.1]{grinstead1997}, it contains states $s_0, s_1, s_2$ with $s_0$ being the initial state and $s_2$ being a terminal
A simple Markov decision process is displayed in figure \ref{fig:intro_mdp}. Like Markov chains, it contains states
\cite[chapter 11.1]{grinstead1997}: $s_0, s_1$ and $s_2$ with $s_0$ being the initial state and $s_2$ being a terminal
state. Unlike Markov chains, it also contains actions \emph{left} and \emph{right} as well as rewards $0$ and $1$. The
rewards are written alongside transition probabilities and can be found on the edges connecting the states. Further
explanations and an example using elements of this Markov decision process are given in chapter
......@@ -169,7 +169,7 @@ actions and rewards to use as training data by running the agent in the environm
ATARI 2600 games have a plentitude of \emph{terminal states} that indicate the game has come to an end, $\#\{s \mid s
\in \mathcal{S} \land s \text{ is terminal}\} \gg 1$. For example, in Pong the game ends once either player
scored 21 points, whereas the loser's score can have any value in the range of 0 to 20. Furthermore, the paddles may be
in any position at the end of the game, which opens up even more terminal states.
in any position at the end of the game, which facilitates even more terminal states.
Whenever the agent transitions to a terminal state, it cannot transition to other states anymore. Therefore, there is a
final time step $T$ after which no meaningful information can be gained. We call $T$ the \emph{finite} \emph{horizon};
......
......@@ -11,10 +11,10 @@ parameterized policy
\pi(a \mid s, \boldsymbol\theta) \doteq P(A_t = a \mid S_t = s, \boldsymbol\theta_t = \boldsymbol\theta)
\end{align}
returns the probability of taking the action $a$ given the state $s$ and parameter vector
$\boldsymbol\theta$ \cite[p.~321]{sutton18}. Often we write $\pi_{\boldsymbol\theta}$ instead of $\pi(a \mid s,
$\boldsymbol\theta$ \cite[p.~321]{sutton18}. Often, we write $\pi_{\boldsymbol\theta}$ instead of $\pi(a \mid s,
\boldsymbol\theta)$.
Often $\boldsymbol\theta$ is the parameterization of a function that maps a state to a vector of numerical weights. Each
Usually, $\boldsymbol\theta$ is the parameterization of a function that maps a state to a vector of numerical weights. Each
action corresponds to a weight that indicates the likelihood of picking the respective action. The higher the weight,
the more likely picking its respective action should be. The policy $\pi$ uses the numerical weights to determine a
probability distribution \cite[p.~322]{sutton18}, e.g., by applying a softmax function
......@@ -70,13 +70,13 @@ $\boldsymbol\theta$. We utilize neural networks as a differentiable function (cf
According to \citeA[pp.~322--323]{sutton18}, policy gradient methods provide two major advantages over other methods.
Firstly, parameterized policies can approach a deterministic policy gradually. Ideally, we must only assign a reasonable
value to the learn rate $\alpha$, but no further hyperparameters are required to ensure sufficient exploration and a
smooth transition to exploitation.\footnote{In practice a number of other parameters are required to prevent forgetting
smooth transition to exploitation.\footnote{In practice other parameters are required to prevent forgetting
and performance collapses (cf.~chapter \ref{sec:03:ppo_motivation}).} Secondly, parameterized policies enable us to
assign arbitrary probabilities to actions. This gives us the opportunity to discover stochastic approximates of optimal
policies.\footnote{A policy that is not capable of doing so is the $\varepsilon$-greedy policy, which chooses the best
available action with a probability of $1 - \varepsilon$ and a random one otherwise \cite[p.~322]{sutton18}.
$\varepsilon$ typically converges to $0$ over the course of the training resulting in a deterministic policy.}
\citeA[p.~323]{sutton18} give an example of a Markov descision process that can only be solved by a stochastic policy.
\citeA[p.~323]{sutton18} give an example of a Markov decision process that can only be solved by a stochastic policy.
\citeA[p.~324]{sutton18} point out that combining a parameterized policy with function approximation may pose a
challenge: the performances of both $\pi_{\boldsymbol\theta}$ and $\hat{v}(s, \boldsymbol\omega)$ depend on the action
......@@ -103,10 +103,10 @@ policy gradients with the advantage function \cite{sutton2000}:
Using the advantage function is beneficial, as the policy gradient in equation \ref{eqn:advantage_gradient} has a lower
variance than the gradient in equation \ref{eqn:q_gradient} \cite{gae}.
Similar to $\overline{\text{VE}}$, the gradient is weighted according to the distribution of states $\mu$, but it does not
depend on its derivative. By virtue of weighing with $\mu$, states an agent encounters regularly have a greater effect
on the gradient. Moreover, the magnitude of the gradient is controlled by the advantage of an action: A large advantage
mandates a large gradient step.
Like the mean squared value error $\overline{\text{VE}}$, the gradient is weighted according to the distribution of
states $\mu$, but it does not depend on its derivative. By virtue of weighing with $\mu$, states an agent encounters
regularly have a greater effect on the gradient. Moreover, the magnitude of the gradient is controlled by the advantage
of an action: A large advantage mandates a large gradient step.
\subsubsection{Gradient Estimation}
......
......@@ -57,7 +57,7 @@ it is defined to be.}
\end{align}
The advantage function compares the expected immediate reward and return of a successor state $r + v_\pi(s')$ with the
expected return of the current state $v_\pi(s)$. As the environment may be non-deterministic, the former term has to be
expected return of the current state $v_\pi(s)$. As the environment may be non-deterministic, the former term must be
weighted with its probability.
If $a_\pi(s, a) = 0$, which means $\sum_{s', r}\left[ p(s', r \mid s, a) \cdot (r + \gamma v_\pi(s')) \right] =
......
......@@ -65,14 +65,14 @@ approach $0$ over the course of the training; the trust region shrinks. Lines 18
\end{algorithm}
Rollouts always have a length of $T$ time steps. Hence, the horizon of an episode might not align with the horizon of a
rollout---an episode might terminate earlier or it might not terminate at all. We address these issues as follows:
rollout---an episode might terminate earlier, or it might not terminate at all. We address these issues as follows:
\begin{itemize}
\item If an episode does not terminate whilst recording a rollout, we have to adjust advantage and return
\item If an episode does not terminate whilst recording a rollout, we must adjust advantage and return
calculation. We can only compute $\lambda$-returns and GAE advantages up to the rollout time step $T$. Then we
bootstrap the missing rewards by adding an appropriately discounted value estimation $\hat{v}_\pi(S_T,
\boldsymbol\omega)$ of the final state included in the rollout.
\item If an episode terminates early, we only include rewards and values up until the episode terminated. Let
$T_\text{episode}$ denote this time step. Then all advantage and return estimations for time steps $t \le
T_\text{episode}$ use the rollout time step that correspeonds with $T_\text{episode}$ as the upper bound of
T_\text{episode}$ use the rollout time step that corresponds with $T_\text{episode}$ as the upper bound of
summation (cf.~equations \ref{eqn:gae} and \ref{eqn:lambda_return}).
\end{itemize}
......@@ -12,7 +12,7 @@ highlights these relations.
\end{figure}
Both the advantage estimator (cf.~equation \ref{eqn:delta2}) and the return (cf.~equation \ref{eqn:return}) are
suboptimal, as they suffer from being biased and having high variance respectively. We further explain these issues and
suboptimal, as they suffer from being biased and having high variance, respectively. We further explain these issues and
introduce advanced methods that alleviate this issue in this chapter.
\subsubsection{Generalized Advantage Estimation}
......@@ -50,13 +50,13 @@ Let $\delta_t^{(i)}$ denote the advantage estimator that contains $i$ rewards. T
as proposed by \citeA{gae}.
The more rewards are added, the more the value estimation $\hat{v}_\pi(S_{t+i}, \boldsymbol\omega)$ is discounted,
because $\delta_t^{(i)}$ contains a $\gamma^i$-discounted value estimation and $\gamma < 1$. As a consequence, the
because $\delta_t^{(i)}$ contains a $\gamma^i$-discounted value estimation and $\gamma < 1$. Consequently, the
inaccuracy contributed by this term is reduced, yielding a less biased advantage estimator. However, as $i$ is
increased, the number of stochastic elements in the calculation grows. This increases the variance of the estimator.
Thus, $\delta_t^{(1)}$ has the highest bias and the lowest variance, whereas $\delta_t^{(T-t)}$ possesses the lowest
bias and the highest variance.
\citeA{gae} found that no particular advantage estimator $\delta_t^{(i)}$ grants the best results---instead they must be
\citeA{gae} found that no particular advantage estimator $\delta_t^{(i)}$ grants the best results---instead, they must be
combined. Let $\lambda\in[0,1)$ denote a variance tuning factor.\footnote{Note that \citeA{gae} allow $\lambda = 1$, but
we do not require this special case.} Then \emph{Generalized Advantage Estimation} (GAE) is defined to be
\begin{align}
......
......@@ -21,7 +21,7 @@ policy and the current policy (similar to equation \ref{eqn:likelihood_ratio}) t
divergence. This way they incentivize finding a new policy that is close to the old policy.
As the new loss still imposes a lower bound on the fitness $J(\boldsymbol\theta)$, it is a minorant to the fitness
function. The authors prove that maximizing this minorant guarantees a monotically rising fitness
function. The authors prove that maximizing this minorant guarantees a monotonously rising fitness
$J(\boldsymbol\theta)$; given sufficient optimization steps the algorithm converges to a local
optimum.\footnote{Algorithms like this one are called MM algorithms. This one is a \emph{minorize maximization} algorithm
\cite{hunter2004}.} The ensuing objective is called a \emph{surrogate} objective.
......@@ -34,14 +34,14 @@ However, TRPO achieves suboptimal results on tasks like video games and is both
execute \cite{ppo}. In order to compute an approximately optimal solution of the KL divergence, the inverse of a large
Hessian matrix\footnote{The Hessian matrix is a second-order derivative. It can be determined by deriving the gradient
of a function \cite[pp.~86--87]{goodfellow2016}.} is required. This lead \citeA{ppo} to devise a simpler algorithm
called \emph{Proximal Policy Optimizaiton} (PPO). We discuss this algorithm in the following chapters.
called \emph{Proximal Policy Optimization} (PPO). We discuss this algorithm in the following chapters.
\subsubsection{Clipped Surrogate Objective}
\label{sec:03:policy_loss}
Proximal Policy Optimization is a policy gradient method that uses advantage estimation, e.g., Generalized Advantage
Estimation, to estimate the gradient. PPO performs multiple optimization steps on the same trajectory. The policy that
recorded the trajctory is denoted $\pi_{\boldsymbol\theta_\text{old}}$. With each optimization step, $\boldsymbol\theta$
recorded the trajectory is denoted $\pi_{\boldsymbol\theta_\text{old}}$. With each optimization step, $\boldsymbol\theta$
moves further from $\boldsymbol\theta_\text{old}$.
As a consequence, an action that has become very unlikely under $\pi_{\boldsymbol\theta}$ might have a large advantage.
......@@ -75,7 +75,7 @@ advantage estimations \cite{ppo}:
Despite the use of importance sampling this loss can be unreliable. For actions with a very large likelihood ratio, for
example, $\rho_t(\boldsymbol\theta) = 100$, gradient steps become excessively large possibly leading to performance
collapes \cite{kakade2002}.
collapses \cite{kakade2002}.
Hence, Proximal Policy Optimization algorithms intend to keep the likelihood ratio close to one \cite{ppo}. There are
two variants of PPO, one that relies on clipping the ratio whereas the other is penalized using the Kullback-Leibler
......
......@@ -6,7 +6,7 @@ when performing stochastic gradient descent. The longer the trajectories are the
the dynamics $p$ of the environment and the policy $\pi$ introduce non-deterministic behavior.
As the variance of the trajectories grows, the quality of the gradient estimator declines. Performing stochastic
gradient ascent with this estimator is no longer guaranteed to yield an improved policy---in fact the performance of the
gradient ascent with this estimator is no longer guaranteed to yield an improved policy---in fact, the performance of the
policy might deteriorate \cite{seita2017}. Even a single bad step might cause a collapse of policy performance
\cite{spinningup}. More importantly, the performance collapse might be irrecoverable. As the
policy was likely trained for several steps already, it has begun transitioning from exploration to exploitation. If an
......
Although \gls{ppo} may be used with any differentiable function, neural networks are the most commonly used solution,
e.g., in the baselines repository by \citeA[baselines/common/models.py]{baselines}. With few exceptions, most deep
policy gradient algorithms use the architecture established by \citeA{nature_dqn} when learning ATARI 2600 games.
Although Proximal Policy Optimization may be used with any differentiable function, neural networks are the most
commonly used solution, e.g., in the baselines repository by \citeA[baselines/common/models.py]{baselines}. With few
exceptions, most deep policy gradient algorithms use the architecture established by \citeA{nature_dqn} when learning
ATARI 2600 games.
\begin{sidewaysfigure}
\centering
......
Before performing a gradient step, the gradient $\nabla_\theta \mathcal{L}^{\text{CLIP}+\text{VFCLIP}+S}$ is
clipped \cite{ilyas2018}. All changes in weights are concatenated to a single vector. The weight changes are adjusted
such that the euclidean norm of the vector does not exceed $0.5$. Gradient clipping is a technique commonly used to
such that the Euclidean norm of the vector does not exceed $0.5$. Gradient clipping is a technique commonly used to
prevent large gradient steps \cite[pp.~413--415]{goodfellow2016}.
Since Proximal Policy Optimization executes $N$ actors simulataneously, the runtime of a training can be greatly reduced
Since Proximal Policy Optimization executes $N$ actors simultaneously, the runtime of a training can be greatly reduced
on multicore processors by parallelizing the execution. In this thesis, each actor operates in a dedicated process. They
are coordinated by a parent process that gathers the observations and optimizes the policy and the value function.
......
A sequence of post-processing steps is performed on observations returned by the ATARI 2600 emulator. With the exception
of \emph{reward clipping}, these operations are common when learning ATARI 2600 games. Operations such as \emph{fire
A sequence of post-processing steps is performed on observations returned by the ATARI 2600 emulator. Except for
\emph{reward clipping}, these operations are common when learning ATARI 2600 games. Operations such as \emph{fire
resets}, \emph{frame skipping} and handling \emph{episodic life} adapt specific behavior of ATARI games to the
requirements of reinforcement learning algorithms. Other operations reduce the complexity of the state space or lead to
more diverse starting conditions.
......@@ -10,7 +10,7 @@ post-processing steps. Mathematically, $\phi$ is part of the environment and the
Markov decision process.
\paragraph{No operation resets.}
Whenever an environment is reset, a random number of \emph{noop} actions is performed for up to 30 frames in order
Whenever an environment is reset, a random number of \emph{noop} actions are performed for up to 30 frames in order
to guarantee differing initial conditions across different trajectories. Originally this approach was proposed to ensure
that an agent under evaluation does not overfit \cite{nature_dqn}, but it has since been adopted to training. This might
prove beneficial, as the random initial conditions can lead to more diverse trajectories. However, it appears that no
......@@ -71,7 +71,7 @@ Finally, the four most recent maximized images seen by an agent are combined to
elements of the tensor are set to 0, which represents the color black. The same applies to simulated episode ends and
beginnings as performed by the \emph{episodic life} operation.
Although no explanation is given by \citeA{nature_dqn}, one benefit is apparent: by stacking images we provide an agent
Although no explanation is given by \citeA{nature_dqn}, one benefit is apparent: By stacking images, we provide an agent
further information on the state of the game such as direction and movement. If the agent would see one frame only, it
could not determine which direction the ball is moving in Pong. By showing it four frames at once, the agent can discern
if the ball is moving towards itself or the enemy or whether it will hit a wall or not (cf.~figure
......
......@@ -30,7 +30,7 @@ Then the clipped value function loss $\mathcal{L}^\text{VFCLIP}$ is defined to b
$\epsilon$ is the same hyperparameter that is used to clip the likelihood ratio in $\mathcal{L}^\text{CLIP}$
(cf.~equation \ref{eqn:lclip} in chapter \ref{sec:03:policy_loss}).
Intuitively, this approach may be similar to clipping the probability ratio. To avoid gradient collapses, a trust region
Intuitively, this approach may be like clipping the probability ratio. To avoid gradient collapses, a trust region
is created with the clipping parameter $\epsilon$. Then an elementwise maximum is taken, so errors from previous
gradient steps can be corrected. A maximum is applied instead of a minimum because the value function loss is minimized.
Further analysis on the ramifications of optimizing a surrogate loss of the value function is available
......
......@@ -26,7 +26,7 @@ frame skipping we would obtain 60 observations per second.
Since the ALE returns a reward, we do not have to design a reward function. Instead, all reinforcement
learning algorithms are trained using the same reward function bar post-processing. The reward of an action is the
change in score, making the cumulated rewards of an episode the high-score.
change in score, making the cumulated rewards of an episode the high score.
The action space of ATARI games is discrete. Every game accepts the actions $\mathcal{A} = \{\emph{\text{noop}},
\emph{\text{fire}}, \emph{\text{left}}, \emph{\text{right}}, \emph{\text{up}}, \emph{\text{down}}\}$. Actions that are
......
Three aspects of the experiments must be discussed. Firstly, the impact of the non-disclosed optimizations shall be
evaluated. Secondly, the prevalance of outliers and the reliability of reward graphs as proposed by \citeA{ppo} and
evaluated. Secondly, the prevalence of outliers and the reliability of reward graphs as proposed by \citeA{ppo} and
shown in this thesis are reviewed. Finally, we discuss the robustness of PPO to hyperparameter choices. Remember that
all experiments are available online at \url{https://github.com/Aethiles/ppo-results}.
......@@ -65,7 +65,7 @@ sees diminished results for no apparent reason.
\paragraph{Value function loss clipping.}
Agents trained without value function loss clipping achieve notably improved performances on Breakout but perform
Agents trained without value function loss clipping show notably improved performances on Breakout but perform
slightly worse on Pong as shown in the experiment \emph{no\_value\_function\_loss\_clipping}. With only value function
loss clipping enabled, the performance of agents trained on BeamRider is worse than those trained without optimizations
and those trained with the reference configuration (cf.~experiment \emph{only\_value\_function\_loss\_clipping}). On the
......@@ -82,13 +82,13 @@ less pronounced.
Furthermore, optimizations likely interact with each other, which further complicates general
statements based on the experiments. This can be seen with value function loss clipping on Breakout. Both the
experiment with all optimizations but value function loss clipping and the experiment with only value function loss
clipping show improved performances over the reference configuration and the no optimizations experiment respectively.
clipping show improved performances over the reference configuration and the no optimizations experiment, respectively.
Most notably, agents trained on BeamRider benefit from the optimizations when trained with the paper configuration of $K
= 3$ and $c_1 = 1.0$, whereas they achieve subpar results with the reference configuration. This indicates that
hyperparameter choices revolving around the value function may have a larger impact than the optimizations.
Overall all optimizations have a positive effect on the algorithm with advantage normalization having the least impact.
Overall, the optimizations have a positive effect on the algorithm with advantage normalization having the least impact.
Hence, we must criticize that they were not disclosed by \citeA{ppo}. Even though they are not directly related to
reinforcement learning, they should be disclosed to ensure reproducibility and a transparent discussion on the
performance of Proximal Policy Optimization algorithms.
......@@ -98,7 +98,7 @@ performance of Proximal Policy Optimization algorithms.
Figures \ref{fig:graph_seaquest} and \ref{fig:graph_penalty_beamrider} to \ref{fig:graph_paper_beamrider} contain
obvious outliers, some of which perform a lot better than other runs in that game, whereas others perform a lot worse.
Among all experiments conducted for this thesis, about $35\%$ of the graphs generated contain an obvious outlier.
Similar inconcistency can be seen in the plots published by \citeA{ppo} for a variety of games.
Similar inconsistency can be seen in the plots published by \citeA{ppo} for a variety of games.
With merely three runs per configuration used in evaluation, it is hard to tell if this behavior is representative. For
example, we cannot determine if a single good performance must be attributed to luck or if in fact roughly a third of
......@@ -106,7 +106,7 @@ all trained agents achieve such a performance.
At the time of publication of Proximal Policy Optimization in 2017 OpenAI Gym contained 49 ATARI games. As of writing
this thesis, this number has grown to 57 games. It seems unlikely that each of these games provides a unique challenge.
Therefore, instead of training on all of the games available in OpenAI Gym one could train on a specific selection of
Therefore, instead of training on all the games available in OpenAI Gym one could train on a specific selection of
games chosen by experts to be as diverse as possible. If this subset is half or a third the size of the original set of
games, we can easily perform more training runs on each game without requiring more time to conduct our experiments.
......
......@@ -10,11 +10,14 @@ graphs is shown in this thesis. However, all graphs are made available online in
used to generate them.\footnote{The raw data and images are available at \url{https://github.com/Aethiles/ppo-results}.}
When we refer to an experiment, consult the repository unless a figure or table is specified.
TODO enumerate here
Subsequently, we show three different experiments. The first experiment we evaluate is done with the reference
configuration. By evaluating this experiment, we can discern whether the algorithm implemented for this thesis achieves
the intended performance. The second experiment shows the stability of PPO even when severely misconfigured, by training
with an entropy penalty instead of an entropy bonus. Finally, we evaluate the effects of the non-disclosed optimizations.
Subsequently, we show three different experiments:
\begin{enumerate}
\item By evaluating the reference configuration we determine that the implementation written for this thesis
achieves the intended performance.
\item \citeA{ppo} claim that PPO is robust to hyperparameter choice. Hence, we show the stability of the algorithm
when a parameter is severely misconfigured.
\item The effect of the non-disclosed optimizations is shown.
\end{enumerate}
We discuss the findings in chapter \ref{sec:05:discussion}.
\subsubsection{Reference Configuration}
......@@ -43,12 +46,12 @@ in table \ref{table:default_score}.
Comparing the shapes of the reward graphs with those given by \citeA{ppo} reveals no discernable differences except for
the magnitude of score obtained in BeamRider. On BeamRider and Seaquest, the algorithm implemented for this thesis
achieves a higher final score than the originally reported results. On Breakout and SpaceInvaders, the obtained score is
slightly lower. Most likely, these differences can be attributed to differing configurations, as \citeA{ppo} train with
a value function loss coefficient $c_1 = 1.0$ and $K = 3$ epochs instead of $c_1 = 0.5$ and $K = 4$.
slightly lower.
Training with the original configuration grants even more favorable results (cf.~experiment \emph{paper}), which means
that some of the optimization methods outlined in chapter \ref{sec:04:implementation} likely were not used for the
original publication.
Most likely, these differences can be attributed to differing configurations, as \citeA{ppo} train with a value function
loss coefficient $c_1 = 1.0$ and $K = 3$ epochs instead of $c_1 = 0.5$ and $K = 4$. Training with the original
configuration grants even more favorable results (cf.~experiment \emph{paper\_configuration}), which means that some of the
optimization methods outlined in chapter \ref{sec:04:implementation} likely were not used for the publication.
We can easily see a lot of noise in the plots for BeamRider, Breakout and SpaceInvaders (cf.~figures
\ref{fig:graph_beamrider}, \ref{fig:graph_breakout} and \ref{fig:graph_spaceinvaders}). This is expected, as our agents
......@@ -58,8 +61,7 @@ very apparent outlier in the Seaquest runs. The same can be observed in \citeaut
discuss this phenomenon in chapter \ref{sec:05:discussion}.
As the performance of the implementation in this thesis matches or exceeds publicized results, one can reasonably assume
that the results in this thesis are reliable. Therefore, the implementation is suitable for comparison of various
hyperparameter values and optimization choices.
that the implementation is suitable for the intended evaluation of hyperparameters and the optimizations.
\begin{figure}[H]
\centering
......@@ -94,9 +96,10 @@ hyperparameter values and optimization choices.
\subsubsection{Stability with Suboptimal Hyperparameter}
\label{sec:05:stability}
In figures \ref{fig:graph_penalty_beamrider}--\ref{fig:graph_penalty_spaceinvaders} reward graphs for a training with
the entropy coefficient $c_2 = -0.01$ are shown. The corresponding final scores can be seen in table
\ref{table:penalty_score}.
\citeA{ppo} claim that Proximal Policy Optimization is robust to hyperparameter choices. In this experiment, agents were
trained with an exploration penalty instead of an exploration bonus. Figures
\ref{fig:graph_penalty_beamrider}--\ref{fig:graph_penalty_spaceinvaders} display reward graphs for a training with the
entropy coefficient $c_2 = -0.01$. The corresponding final scores can be seen below in table \ref{table:penalty_score}.
\begin{table}[ht]
\centering
......@@ -124,10 +127,10 @@ In figure \ref{fig:graph_penalty_spaceinvaders} a steep drop in performance is v
approximately $5.6 \cdot 10^7$ time steps. This performance collapse could be related to the policy making too large
gradient steps. However, as policy updates occur within the trust region $\epsilon$, the performance recovers to a
degree. If we choose a significantly larger $\epsilon$, the performance collapse may be unrecoverable, as PPO becomes
more similar to simple policy gradient methods.
more like simple policy gradient methods.
As some agents still learn, this implies that the algorithm is robust to the specific values chosen for hyperparameters.
We further discuss this in chapter \ref{sec:05:robustness}.
We further discuss this topic in chapter \ref{sec:05:robustness}.
\begin{figure}[H]
\centering
......@@ -183,12 +186,25 @@ final scores are shown in table \ref{table:paper_score}.
\label{table:paper_score}
\end{table}
Most reward graphs display noticably lower scores than the reward graphs of the reference configuration. This is also
Most reward graphs display noticeably lower scores than the reward graphs of the reference configuration. This is also
apparent in the final scores---only the agents trained on Pong remain close to the reported performance, but it takes
the agents much longer to achieve strong performances. The results strongly deviate from the results reported by
\citeA{ppo}. Therefore, we can conclude that the optimizations outlined in chapter \ref{sec:04:implementation} have
strong effects on the course of training as well as on the final performance of trained agents.
This result is not restricted to the original paper configuration---with one exception. As can be seen in
experiment \emph{no\_optimizations}, agents perform worse without the optimizations when trained with the reference
configuration as well. The notable---and unexpected---exceptions to this are agents trained on BeamRider, which achieve
noticeably improved performances. We may reason that this shows that agents trained on BeamRider are more sensitive to
hyperparameter choice than they are to the optimizations. Further research is required to make conclusive
statements.
However, we can confidently state that advantage normalization, gradient clipping, orthogonal initialization, reward
binning and value function loss clipping combined have a pronounced positive effect on the performance of Proximal
Policy Optimization in general. We further discuss this matter and the respective individual impact of the optimizations
in chapter \ref{sec:05:discussion_optims}.
\vspace{0.58cm}
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{05_evaluation/BeamRider_paper.png}
......
......@@ -17,10 +17,10 @@ GeForce GTX 1060 6GB.
The two points of interest in evaluation are how quickly an agent learns and the performance of an agent at the end of
its training. We evaluate how quickly an agent learns by drawing an episode reward graph. Whenever an episode of a game
terminates, the high-score is plotted. As we run $N$ environments simultaneously, several episodes could terminate at
terminates, the score is plotted. As we run $N$ environments simultaneously, several episodes could terminate at
the same time. If this occurs, we plot the average of all terminated episodes. Figure \ref{fig:reward_graph}
displays an unsmoothed reward graph for the game Pong. The x axis displays the training time step and the y axis shows
the high-score. As a result, we get a graph that shows the performance of an agent over the course of its training. A
the score. As a result, we get a graph that shows the performance of an agent over the course of its training. A
smoothed graph of the same data can be seen in figure \ref{fig:graph_breakout} on page \pageref{fig:graph_breakout}.
\begin{figure}[ht]
......
......@@ -7,7 +7,7 @@ specific policy.
In order to improve the policy, we introduced parameterization and policy gradients, which enable us to optimize a
policy via stochastic gradient descent. As the environments in this thesis are specific ATARI 2600 video games, we lack
full knowledge of the environments' dynamics. Therefore, we must estimate gradients by sampling training data from
full knowledge of the environments' dynamics. Thusly, we must estimate gradients by sampling training data from
interaction with the environments. Simple policy gradient estimators have been shown to be unreliable, resulting in
performance collapses and suboptimal solutions \cite{kakade2002}.
......@@ -22,9 +22,9 @@ popular hyperparameter choices on five ATARI 2600 games: BeamRider, Breakout, Po
results of the experiments show that most of the optimizations are crucial to achieve good results on these ATARI games,
although their specific impact depends on the game an agent is trained on. This finding is shared by \citeA{ilyas2018}
and \citeA{engstrom2019}, who examined Proximal Policy Optimization on robotics environments. Furthermore, we found that
noticable outliers are present in approximately $35\%$ of the experiments.
noticeable outliers are present in approximately $35\%$ of the experiments.
As a consequence, we may call for two improvements: Firstly, all optimization choices pertaining to the algorithm---even
Therefore, we may call for two improvements: Firstly, all optimization choices pertaining to the algorithm---even
those not purely related to reinforcement learning---must be published. In order to ensure proper reproducibility and
comparability, methods used for the portrayal of results, such as the smoothing of graphs, must be described as well.
Secondly, instead of testing each configuration thrice on every game available, performing more runs on an expert-picked
......
Based on the work conducted for this thesis, three possible venues for future work present themselves. First of all, we
could work on improved evaluation methods that allow for easier comparisons of algorithms across publications. This
includes increasing the number of training runs executed and potentially selecting a diverse subset of games, so robust
trendlines may be created and shared. Furthermore, we may examine advanced regression methods and evaluate if these
methods can be used to create more suitable baselines. Lastly, this also includes devising a proper means of
determining the noise of a reward graph, for example by displaying a standard deviation or confidence intervals around
the trendline.
Based on the work conducted for this thesis, multiple venues for future work present themselves--in the following three
possible topics are highlighted. First, we could work on improved evaluation methods that allow for easier comparisons
of algorithms across publications. This includes increasing the number of training runs executed and potentially
selecting a diverse subset of games, so robust trendlines may be created and shared. Furthermore, we may examine
advanced regression methods and evaluate if these methods can be used to create more suitable baselines. Lastly, this
also includes devising a proper means of determining the noise of a reward graph, for example by displaying a standard
deviation or confidence intervals around the trendline.
Second of all, the code written for this thesis can be adjusted to support other benchmarking environments. Most
Second, the code written for this thesis can be adjusted to support other benchmarking environments. Most
importantly, by supporting robotics and control tasks, we could reproduce the findings of \citeA{ilyas2018} and
\citeA{engstrom2019}. Moreover, we could also reproduce the findings of modified Proximal Policy Optimization algorithms
such as Truly Proximal Policy Optimization \cite{wang2019}. A number of improvements to Proximal Policy Optimization
......
......@@ -132,7 +132,7 @@
}
\newglossaryentry{ppo}{
name = {Proximal Policy Optimizaiton},
name = {Proximal Policy Optimization},
description = {a class of deep reinforcement learning algorithms. The specific version used in this thesis is called
PPO clip},
}
......@@ -160,8 +160,8 @@
}
\newglossaryentry{surrogate}{
name = {surrogate},
description = {TODO}
name = {surrogate objective},
description = {an objective that approximates the true objective, e.g., by posing a lower bound on improvement}
}
\newglossaryentry{terminal}{
......@@ -178,5 +178,6 @@
\newglossaryentry{trustregion}{
name = {trust region},
description = {}
description = {a region around the old policy that should allow for safe policy updates, so the performance of the
new policy does not collapse}
}
......@@ -62,7 +62,7 @@
eq.~\ref{eqn:lvfclip}) \\
\\
$c_1$ & value function loss coefficient & (chapter \ref{sec:03:shared_loss}) \\
$c_2$ & entropy bonus coeficient & (chapter \ref{sec:03:shared_loss}) \\
$c_2$ & entropy bonus coefficient & (chapter \ref{sec:03:shared_loss}) \\
$S$ & entropy bonus & (chapter \ref{sec:03:shared_loss}) \\
$\mathcal{L}^{\text{CLIP}+\text{VFCLIP}+S}(\boldsymbol\omega)$ & shared parameters PPO loss & (chapter
\ref{sec:03:shared_loss}, eq.~\ref{eqn:lshared}) \\
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment