Commit 499ab6e9 authored by Daniel Lukats's avatar Daniel Lukats

feedback from jürgen + kathrin

parent fb01642e
......@@ -16,20 +16,20 @@ learning agents \cite{liu2019}. Yet another concept that may be combined with th
\cite{pathak2017}. Curiosity is a mechanism that incentivizes a methodical search for an optimal solution rather than a
random search.
Despite its widespread use, researchers found that several implementation choices\todo{is this fine or not?} are
undocumented in the original paper \cite{ppo}, but have great effect on the performance of Proximal Policy Optimization.
Consequently, the authors raise doubts on our mathematical understanding of the foundations of the algorithm
\cite{ilyas2018, engstrom2019}.
Despite its widespread use, \citeA{ilyas2018} as well as \citeA{engstrom2019} found that several implementation choices
are undocumented in the original paper, but have great effect on the performance of Proximal Policy Optimization.
Consequently, the authors raise doubts on our mathematical understanding of the foundations of the algorithm.
The goal of this thesis is twofold. On the one hand, it provides the required fundamentals of reinforcement learning to
understand policy gradient methods, so students may be introduced to reinforcement learning. It then builds upon these
fundamentals to explain Proximal Policy Optimization. In order to gain a thorough understanding of the algorithm, a
dedicated implementation was written \todo{with the OpenAI Gym benchmark and PyTorch deep learning frameworks?} instead
of relying on an open source project.\footnote{The code is available on TODO URL HERE.}
dedicated implementation based on the benchmarking framework \emph{OpenAI Gym} and the deep learning framework
\emph{PyTorch} was written instead of relying on an open source implementation of Proximal Policy
Optimization.\footnote{The code is available on TODO URL HERE.}
On the other hand, this thesis examines not only the impact of the aforementioned implementation choices, but also
common hyperparameter choices on a selection of five ATARI 2600 games. These video games form a typical benchmarking
environment for evaluating reinforcement learning algorithms. The significance of the implementation choices was already
environment for evaluating reinforcement learning algorithms. The importance of the implementation choices was already
researched on robotics tasks \cite{ilyas2018, engstrom2019}, but the authors forewent an examination on ATARI games.
Therefore, one may raise the question, if these choices have the same significance for ATARI 2600 games as they do for
Therefore, one may raise the question, if these choices have the same importance for ATARI 2600 games as they do for
robotics tasks.
\section{Fundamentals of Reinforcement Learning}
\section{Reinforcement Learning}
\label{sec:02:basics}
\input{02_rl_theory/introduction}
......
......@@ -32,11 +32,11 @@ contains actions \emph{left} and \emph{right} as well as rewards $0$ and $1$. Th
transition probabilities and can be found on the edges connecting the states. Further explanations and an example using
elements of this Markov decision process are given in chapter \ref{sec:02:distributions}.
Markov decision processes share most properties with Markov chains\todo{some source here}, the major difference being
the transitions between states. Whereas in Markov chains the transition depends on the current state only, in Markov
decision processes the transition function $p$ takes actions into account as well. Thus, the transition can be affected,
although it remains stochastic in nature. We assume that the Markov property holds true for Markov decision processes,
which means that these transitions depend on the current state and action only \cite[p.~49]{sutton18}.
Markov decision processes extend Markov chains by introducing actions and rewards. Whereas in Markov chains a transition
to a successor state depends on the current state only, in Markov decision processes the transition function $p$ takes
actions into account as well. Thus, the transition can be affected, although it remains stochastic in nature. We assume
that the Markov property holds true for Markov decision processes, which means that these transitions depend on the
current state and action only \cite[p.~49]{sutton18}.
\subsubsection{States, Actions and Rewards}
\label{sec:02:mdp_vars}
......
......@@ -23,10 +23,10 @@ optimize the parameterization $\boldsymbol\theta$ so that the likelihood of choo
increased.
Let $S_t = s$ and $\mathcal{A}(s) = \{1, 2\}$. A function $h$ parameterized with $\boldsymbol\theta_t$ could return a
vector with two elements, for example $h(s, \boldsymbol\theta_t) = \begin{pmatrix}4\\2\end{pmatrix}$.\footnote{We write
$\left(h(s, \boldsymbol\theta_t)\right)_a$ to obtain the $a$th component of $h(s, \boldsymbol\theta_t)$, e.g.,
$\begin{pmatrix}4\\2\end{pmatrix}_1 = 4$.} The elements of this vector are numerical weights a policy could use to
determine probabilities:
vector with two elements, for example $h(s, \boldsymbol\theta_t) = \begin{pmatrix}4\\2\end{pmatrix}$.
The elements of this vector are numerical weights a policy could use to determine probabilities:\footnote{We write
$\left(h(s, \boldsymbol\theta_t)\right)_n$ to obtain the $n$-th component of $h(s, \boldsymbol\theta_t)$, e.g.,
$\begin{pmatrix}4\\2\end{pmatrix}_1 = 4$.}
\begin{align*}
\pi(a\mid s, \boldsymbol\theta_t) &= \left(h(s, \boldsymbol\theta_t)\right)_a \cdot
\left(\sum_{b\in\mathcal{A}(s)}\left(h(s, \boldsymbol\theta_t)\right)_b\right)^{-1}\text{, e.g.,} \\
......
Both the loss described in equation \ref{eqn:naive_loss} and the loss proposed by \citeA{ppo} depend on advantage
estimations. However, to estimate advantages value estimations are required. By definition (cf.~equation
\ref{eqn:value_function}), the value function can be estimated using the return.
\ref{eqn:value_function}), the value function can be estimated using the return. Figure \ref{fig:adv_dependency}
highlights these relations.
\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{03_ppo/dependency.pdf}
\caption{The advantage estimations required to calculate a loss or gradient estimator depend on the value function.
The value function in turn requires returns. Finally, these estimators and functions build upon rewards $R_{t+1}$.}
\label{fig:adv_dependency}
\end{figure}
Both the advantage estimator (cf.~equation \ref{eqn:delta2}) and the return (cf.~equation \ref{eqn:return}) are
suboptimal as they suffer from being biased and having high variance respectively. We further explain these issues and
suboptimal, as they suffer from being biased and having high variance respectively. We further explain these issues and
introduce advanced methods that alleviate this issue in this chapter.
\todo[inline]{insert overview of Loss using advantage using value using return using rewards}
\subsubsection{Generalized Advantage Estimation}
\label{sec:03:gae_gae}
......@@ -56,10 +63,11 @@ we do not require this special case.} Then \emph{Generalized Advantage Estimatio
\label{eqn:gae_start}
\delta_t^{\text{GAE}(\gamma, \lambda)} &\doteq (1 - \lambda) \left(\delta_t^{(1)} + \lambda\delta_t^{(2)} +
\lambda^2\delta_t^{(3)} + \dots \right) \\
&= (1 - \lambda) (\delta_t + \lambda(\delta_t + \gamma\delta_{t+1}) + \lambda^2(\delta_t + \gamma\delta_{t+1} +
\gamma^2\delta_{t+2}) + \dots ) \nonumber \\
&= (1 - \lambda) (\delta_t(1 + \lambda + \lambda^2 + \dots) + \gamma\delta_{t+1}(\lambda + \lambda^2 + \lambda^3 +
\dots) \nonumber \\&\hspace{48pt}+ \gamma^2\delta_{t+2}(\lambda^2 + \lambda^3 + \lambda^4 + \dots) + \dots) \nonumber \\
&= (1 - \lambda) \left(\delta_t + \lambda(\delta_t + \gamma\delta_{t+1}) + \lambda^2(\delta_t + \gamma\delta_{t+1} +
\gamma^2\delta_{t+2}) + \dots \right) \nonumber \\
&= (1 - \lambda) \left(\delta_t(1 + \lambda + \lambda^2 + \dots) + \gamma\delta_{t+1}(\lambda + \lambda^2 + \lambda^3 +
\dots) \nonumber \right. \\&\hspace{48pt}\left. + \gamma^2\delta_{t+2}(\lambda^2 + \lambda^3 + \lambda^4 + \dots) + \dots
\right) \nonumber \\
&= (1 - \lambda) \left(\delta_t\left(\frac{1}{1-\lambda}\right) + \gamma\delta_{t+1}\left(\frac{\lambda}{1-\lambda}
\right) + \gamma^2\delta_{t+2}\left(\frac{\lambda^2}{1-\lambda}\right) + \dots \right) \nonumber \\
\label{eqn:gae}
......
......@@ -6,7 +6,7 @@
\label{sec:03:ppo_motivation}
\input{03_ppo/motivation}
\subsection{Modern Advantage and Return Estimation}
\subsection{Advantage and Value Estimation}
\label{sec:03:gae}
\input{03_ppo/gae}
......
......@@ -75,7 +75,7 @@ advantage estimations \cite{ppo}:
Despite the use of importance sampling this loss can be unreliable. For actions with a very large likelihood ratio, for
example, $\rho_t(\boldsymbol\theta) = 100$, gradient steps become excessively large possibly leading to performance
collapes \cite{kakade2002}.\todo{highlight this a bit more?}
collapes \cite{kakade2002}.
Hence, Proximal Policy Optimization algorithms intend to keep the likelihood ratio close to one \cite{ppo}. There are
two variants of PPO, one that relies on clipping the ratio whereas the other is penalized using the Kullback-Leibler
......@@ -140,6 +140,7 @@ This issue is solved by taking an elementwise minimum:
\right)
\right].
\end{align}
In practice, the advantage function $\hat{a}_{\pi_{\boldsymbol\theta_\text{old}}}$ is replaced with Generalized
Advantage Estimations $\delta_t^{\text{GAE}(\gamma,\lambda)}$ (cf.~equation \ref{eqn:gae}). Often,
$\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$ is called a surrogate objective, as it approximates the original objective
......
Albeit it is not mentioned by \citeA{ppo}, advantages are normalized before loss computation. Normalizing advantages is
a well-known operation that lowers the variance of the gradient estimator.
Albeit it is not mentioned by \citeA{ppo}, advantage estimations $\delta$ are normalized before loss computation. Normalizing
advantages is a well-known operation that lowers the variance of the gradient estimator.
\begin{align}
\label{eqn:adv_normalization}
\delta \gets \frac{\delta - \text{mean}(\delta)}{\text{std}(\delta) + 10^{-5}}
\end{align}
\begin{algorithm}
\caption{Advantage normalization}
\label{alg:adv_normalization}
\begin{algorithmic}
\Require advantage estimations $\delta$
\State $\delta \gets \frac{\delta - \text{mean(}\delta\text{)}}{\text{std(}\delta\text{)} + 10^{-5}}$
\end{algorithmic}
\end{algorithm}
The normalization shown in algorithm \ref{alg:adv_normalization} ensures that the advantages have a mean of $0$ and a
standard deviation of $1$. We add $10^{-5}$ to the divisor for numerical stability.
Due to the update\footnote{As common in computer science, we use $\gets$ to denote an assignment. Hence, the advantage
estimations $\delta$ are updated with their normalized values.} seen in equation \ref{eqn:adv_normalization}, advantages
are normalized, so they have a mean of $0$ and a standard deviation of $1$. We add $10^{-5}$ to the divisor for
numerical stability.
......@@ -6,37 +6,37 @@ $\mathcal{L}^\text{VF}$ with $\mathcal{L}^\text{VFCLIP}$ and the use of gradient
\begin{algorithm}[ht]
\caption{Full Proximal Policy Optimization for ATARI 2600 games, modified from \protect\citeauthor{ppo}'s
\protect\citeyear{ppo} and \protect\citeauthor{peng2018}'s \protect\citeyear{peng2018} publications. Lines commented
with a $*$ are changed from algorithm \ref{alg:ppo}.}
with a \Comment{} are changed from algorithm \ref{alg:ppo}.}
\label{alg:ppo_full}
\begin{algorithmic}[1]
\Require number of iterations $I$, rollout horizon $T$, number of actor $N$, number of epochs $K$, minibatch size $M$, discount $\gamma$, GAE weight
$\lambda$, learn rate $\alpha$, clipping parameter $\epsilon$, coefficients $c_1, c_2$, postprocessing $\phi$
\State $\boldsymbol\theta \gets $ orthogonal initialization \Comment{$*$}
\State $\boldsymbol\theta \gets $ orthogonal initialization \Comment{}
\State Initialize environments $E$
\State Number of minibatches $B \gets N \cdot T\; / \;M$
\For{iteration=$1, 2, \dots, I$}
\State $\tau \gets $ empty rollout
\For{actor=$1, 2, \dots, N$}
\State $s \gets $ $\phi(\text{current state of environment }E_\text{actor})$ \Comment{$*$}
\State $s \gets $ $\phi_s(\text{current state of environment }E_\text{actor})$ \Comment{}
\State Append $s$ to $\tau$
\For{step=$1, 2, \dots, T$}
\State $a \sim \pi_{\boldsymbol\theta}(a\mid s)$
\State $\pi_{\boldsymbol\theta_\text{old}}(a\mid s) \gets \pi_{\boldsymbol\theta}(a\mid s)$
\State Execute action $a$ in environment $E_\text{actor}$
\State $s \gets $ $\phi_s(\text{successor state})$ \Comment{$*$}
\State $r \gets $ $\phi_r(\text{reward})$ \Comment{$*$}
\State $s \gets $ $\phi_s(\text{successor state})$ \Comment{}
\State $r \gets $ $\phi_r(\text{reward})$ \Comment{}
\State Append $a, s, r, \pi_{\boldsymbol\theta_\text{old}}(a\mid s),
\hat{v}_\pi (s, \boldsymbol\theta_\text{old})$ to $\tau$
\EndFor
\EndFor
\State Compute Generalized Advantage Estimations $\delta_t^{\text{GAE}(\gamma, \lambda)}$
\State Normalize advantages \Comment{$*$}
\State Normalize advantages \Comment{}
\State Compute $\lambda$-returns $G_t^\lambda$
\For{epoch=$1, 2, \dots, K$}
\For{minibatch=$1, 2, \dots, B$}
\State $\boldsymbol\theta \gets \boldsymbol\theta + \text{clip}(\alpha \cdot
\nabla_{\boldsymbol\theta}\;\mathcal{L}^{\text{CLIP}+\text{VFCLIP}+S}(\boldsymbol\theta))$ on
minibatch with size $M$ \Comment{$*$}
minibatch with size $M$ \Comment{}
\EndFor
\EndFor
\State Anneal $\alpha$ and $\epsilon$ linearly
......
\section{From Theory to Practice}
\section{Realization}
\label{sec:04:implementation}
\input{04_implementation/introduction}
......
\citeA{ilyas2018} found that the weights of the neural network used by \citeA[baselines/ppo2]{baselines} are initialized using
orthogonal initalization.\footnote{Using orthogonal initialization with large neural networks was proposed by
\citeA{saxe2014} The authors also provide mathematical foundations and examine the benefits of various initialization
\citeA{saxe2014}. The authors also provide mathematical foundations and examine the benefits of various initialization
schemes.} The impact of this choice appears to be subject of empirical examinations only \cite{ilyas2018, engstrom2019}.
Table \ref{table:orthogonal_scaling} lists each layer of the neural network and the corresponding scaling factor that is
used to initialize the layer.
......
......@@ -33,5 +33,5 @@ $\epsilon$ is the same hyperparameter that is used to clip the likelihood ratio
Intuitively, this approach may be similar to clipping the probability ratio. To avoid gradient collapses, a trust region
is created with the clipping parameter $\epsilon$. Then an elementwise maximum is taken, so errors from previous
gradient steps can be corrected. A maximum is applied instead of a minimum because the value function loss is minimized.
Further analysis on the ramifications of optimizing a surrogate loss for the value functions is available
\citeA{ilyas2020}.
Further analysis on the ramifications of optimizing a surrogate loss of the value function is available
\cite{ilyas2020}.
......@@ -21,7 +21,7 @@ accurately. On BeamRider, agents achieve close to human performance.
OpenAI Gym and ALE simplify evaluation greatly, as they provide a unified interface for all environments. We simply
provide an action to the environment and obtain observations containing the current state of the environment as a $210
\times 160$ RGB image (before post-processing), the reward and further information such as the number of lives remaining
or if the game terminated. Just like the ATARI 2600 console, the Arcade Learning Environment runs at 60 Hz---without
or if the game terminated. The Arcade Learning Environment runs at 60 frames per second when run in real-time---without
frame skipping we would obtain 60 observations per second.
Since the ALE returns a reward, we do not have to design a reward function. Instead, all reinforcement
......
......@@ -6,7 +6,7 @@ shown in this thesis are reviewed. Finally, we discuss the robustness of PPO to
\label{sec:05:discussion_optims}
Since the performance of agents trained with Proximal Policy Optimization as described in the publication is
significantly worse than the performance achieved with the reference configuration, it begs the question if all
notably worse than the performance achieved with the reference configuration, it begs the question if all
optimizations are required and what their individual impact is. In order to answer this question, agents were
trained twice for each optimization. Once with only the specific optimization enabled and once with all other
optimizations enabled.
......@@ -20,7 +20,7 @@ as is the performance on Seaquest.
\paragraph{Gradient clipping.}
Gradient clipping is disabled in experiment \emph{no\_gradient\_clipping}. We note that the performance on Breakout is
significantly improved with a final score of roughly $400$ achieved by all three agents without gradient clipping.
particularly improved with a final score of roughly $400$ achieved by all three agents without gradient clipping.
However, some reward graphs on other games appear less stable, notably one of the training runs on BeamRider, Pong and
Seaquest each.
......@@ -44,7 +44,7 @@ only game that is not affected by this change is Pong, as the set of rewards in
Furthermore, in Pong no player will score multiple times within $k = 4$ frames (cf.~chapter \ref{sec:04:postprocessing}).
On the remaining games, rewards can achieve much larger magnitudes so reward binning or clipping has a notable effect on
the rewards. Experiment \emph{reward\_clipping} shows a significant drop in performance on all games but Pong.
the rewards. Experiment \emph{reward\_clipping} shows a distinct drop in performance on all games but Pong.
This is echoed in the reward graphs and final scores with all agents performing much worse once rewards are no longer
subject to reward binning. As a consequence, rewards should be binned and not clipped.
......@@ -60,7 +60,7 @@ normalization, which appears to be detrimental to agents learning BeamRider, alt
on Breakout, Pong, Seaquest and SpaceInvaders. Moreover, we observe that all optimizations contribute to the success of
Proximal Policy Optimization. Only advantage normalization affected the performance of agents to a lesser degree.
However, we cannot deduce that advantage normalization is not important for PPO, as the optimization may be of larger
significance on other tasks.
importance on other tasks.
Since these optimizations are crucial to achieve competitive results on ATARI games, we must criticize that they were
not disclosed by \citeA{ppo}. Even though they are not directly related to reinforcement learning, they should be
......@@ -106,7 +106,7 @@ That does not mean that the reference configuration is the optimal configuration
(cf.~experiment \emph{no\_epsilon\_annealing}).
\item When the minibatch size is decreased from $256$ to $64$, agents learn Pong a lot faster (cf.~experiment
\emph{16\_minibatches}).
\item Both the reward graph and the final score of SeaQuest are significantly improved when the learn rate $\alpha$
\item Both the reward graph and the final score of Seaquest are notably improved when the learn rate $\alpha$
is not annealed over the course of the training (cf.~experiment \emph{no\_alpha\_annealing}).
\end{itemize}
However, performing such changes often leads to worse performances on at least one other game. Hence, it is hard to
......
\todo[inline]{put a list of all experiments with short note on results in the appendix}
A total of $42$ experiments were conducted to evaluate optimizations listed in chapter \ref{sec:04:implementation} and
values of hyperparameters. For each experiment, two graphs are generated: one including only the run itself and one
adding a trendline of the reference configuration of Proximal Policy Optimization for ATARI. The reference configuration
......@@ -184,11 +182,11 @@ final scores are shown in table \ref{table:paper_score}.
\label{table:paper_score}
\end{table}
Most reward graphs are significantly below the reward graphs of the reference configuration which is echoed in the final
scores---only the agents trained on Pong remain close to the reported performance, but it takes the agents much longer
to achieve strong performances. The results strongly deviate from the results reported by \citeA{ppo}. Therefore, we can
conclude that the optimizations outlined in chapter \ref{sec:04:implementation} have strong effects on the course of
training as well as on the final performance of trained agents.
Most reward graphs display noticably lower scores than the reward graphs of the reference configuration. This is also
apparent in the final scores---only the agents trained on Pong remain close to the reported performance, but it takes
the agents much longer to achieve strong performances. The results strongly deviate from the results reported by
\citeA{ppo}. Therefore, we can conclude that the optimizations outlined in chapter \ref{sec:04:implementation} have
strong effects on the course of training as well as on the final performance of trained agents.
\begin{figure}[H]
\centering
......
......@@ -20,21 +20,23 @@ its training. We evaluate how quickly an agent learns by drawing an episode rewa
terminates, the high-score is plotted. As we run $N$ environments simultaneously, several episodes could terminate at
the same time. If this occurs, we plot the average of all terminated episodes. Figure \ref{fig:reward_graph}
displays an unsmoothed reward graph for the game Pong. The x axis displays the training time step and the y axis shows
the high-score. As a result, we get a graph that shows the performance of an agent over the course of its training.
the high-score. As a result, we get a graph that shows the performance of an agent over the course of its training. A
smoothed graph of the same data can be seen in figure \ref{fig:graph_breakout} on page \pageref{fig:graph_breakout}.
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{05_evaluation/Pong.png}
\caption{TODO---this actually isnt even a non-smoothed graph atm. just imagine it was noisier. to be fixed with
colors and font size later}
\includegraphics[width=\textwidth]{05_evaluation/unsmoothed.png}
\caption{Unsmoothed episode reward graphs like this one are very noisy and therefore hard to examine. We address
this issue by smoothing the graphs. Smoothed episode reward graphs may be seen in chapter \ref{sec:05:experiments}.}
\label{fig:reward_graph}
\end{figure}
Since agents encounter novel states regularly in the beginning of the training, the performance of episodes can vary
greatly. This results in a noisy episode reward graph. Depending on the complexity of the game, the graph may remain
noisy even until the end of the training, e.g., when learning Breakout. We alleviate this issue by smoothing the reward
graph---outliers are featured less prominently and the noise is reduced, whilst the overall trend of the training is
preserved.
noisy even until the end of the training as seen in figure \ref{fig:reward_graph}. We alleviate this issue by smoothing
the reward graph---outliers are featured less prominently and the noise is reduced, whilst the overall trend of the
training is preserved. However, applying a suitable method to evaluate noise, such as confidence intervals, will be a
topic for future research.
The reward graphs are smoothed by computing the average of a sliding window with $16$ data points. This window is then
centered, so each data point is the average of the $8$ previous episodes and the $7$ following ones. We note that this
......
......@@ -21,7 +21,7 @@ However, multiple non-disclosed optimizations to the algorithm can be found in t
popular hyperparameter choices on five ATARI 2600 games: BeamRider, Breakout, Pong, Seaquest and SpaceInvaders. The
results of the experiments show that most of the optimizations are crucial to achieve good results on these ATARI games.
This finding is shared by \citeA{ilyas2018} and \citeA{engstrom2019}, who examined Proximal Policy Optimization on
robotics environments. Furthermore, we found that significant outliers are present in approximately $35\%$ of the
robotics environments. Furthermore, we found that noticable outliers are present in approximately $35\%$ of the
experiments.
As a consequence, we may call for two improvements: Firstly, all optimization choices pertaining to the algorithm---even
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment