Commit 367ff482 authored by Daniel Lukats's avatar Daniel Lukats

final pass

parent e323d186
......@@ -31,7 +31,6 @@
}
\vspace{1.6cm}
% \large{978528} \\ TODO
{\large \today} \\
\vspace{1.6cm}
\begin{table}[H]
......@@ -39,7 +38,7 @@
\begin{tabular}{r l}
\large{Advisor} & \large{Prof. Dr.-Ing. Jürgen te Vrugt} \\
\large{Co-Advisor} & \large{Prof. Dr. Kathrin Ungru} \\
\large{Student number} & \large{978 528}
\large{Matriculation number} & \large{TODO}
\end{tabular}
\end{table}
\vspace*{3cm}
......
......@@ -17,17 +17,17 @@ learning agents \cite{liu2019}. Yet another concept that may be combined with th
random search.
Despite its widespread use, \citeA{ilyas2018} as well as \citeA{engstrom2019} found that several implementation choices
are undocumented in the original paper, but have great effect on the performance of Proximal Policy Optimization.
are undocumented in the original paper but have great effect on the performance of Proximal Policy Optimization.
Consequently, the authors raise doubts on our mathematical understanding of the foundations of the algorithm.
The goal of this thesis is twofold. On the one hand, it provides the required fundamentals of reinforcement learning to
understand policy gradient methods, so students may be introduced to reinforcement learning. It then builds upon these
fundamentals to explain Proximal Policy Optimization. In order to gain a thorough understanding of the algorithm, a
dedicated implementation based on the benchmarking framework \emph{OpenAI Gym} and the deep learning framework
\emph{PyTorch} was written instead of relying on an open source implementation of Proximal Policy
understand policy gradient methods, so interested parties may be introduced to reinforcement learning. It then builds
upon these fundamentals to explain Proximal Policy Optimization. In order to gain a thorough understanding of the
algorithm, a dedicated implementation based on the benchmarking framework \emph{OpenAI Gym} and the deep learning
framework \emph{PyTorch} was written instead of relying on an open source implementation of Proximal Policy
Optimization.\footnote{The code is available at \url{https://github.com/Aethiles/ppo-pytorch}.}
On the other hand, this thesis examines not only the impact of the aforementioned implementation choices, but also
On the other hand, this thesis examines not only the impact of the aforementioned implementation choices but also
common hyperparameter choices on a selection of five ATARI 2600 games. These video games form a typical benchmarking
environment for evaluating reinforcement learning algorithms. The importance of the implementation choices was already
researched on robotics tasks \cite{ilyas2018, engstrom2019}, but the authors forewent an examination on ATARI games.
......
The fundamentals of reinforcement learning are given in chapter \ref{sec:02:basics}. These contain core terms
and definitions required to discuss and construct reinforcement learning algorithms.
Issues with the naive learning approach outlined in chapter \ref{sec:02:basics} are pointed out in chapter
\ref{sec:03:ppo}. This leads to the introduction of advanced estimation methods, which are used in \emph{Proximal Policy
Issues with the naive learning approach outlined in chapter \ref{sec:02:basics} are pointed out in chapter~\ref{sec:03:ppo}. This leads to the introduction of advanced estimation methods, which are used in \emph{Proximal Policy
Optimization}. With these estimation methods PPO is defined and ramifications of specific operations are explained.
Chapter \ref{sec:03:ppo} closes with an outline of the complete reinforcement learning algorithm.
......
......@@ -45,7 +45,8 @@ $v_\pi$:
\hat{a}_\pi(s, a, \boldsymbol\omega) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left( r + \gamma \hat{v}_\pi(s',
\boldsymbol\omega) \right) \right] - \hat{v}_\pi(s, \boldsymbol\omega) \\
\label{eqn:a_hat2}
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) \mid S_t = s, A_t = a \right] - v_\pi(s,
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) \mid S_t = s, A_t = a \right] -
\hat{v}_\pi(s,
\boldsymbol\omega).
\end{align}
......
......@@ -19,7 +19,7 @@ interact once---the agent chooses an action and observes the environment.
\begin{figure}[h]
\centering
\hspace*{1.8cm}
\hspace*{1.3cm}
\includegraphics[width=0.6\textwidth]{02_rl_theory/simple_mdp.pdf}
\caption{A simple Markov decision process with three states $s_0, s_1, s_2$, actions \emph{left} and \emph{right} as
well as rewards $0$ and $1$.}
......@@ -65,7 +65,7 @@ we use a sequence of 4 consecutive grayscale images resized to $84 \times 84$ pi
\paragraph{Actions.}
At each time step $t$ our agent takes an \emph{action} depending on the state $S_t$ of the environment, which causes a
transition to one of several successor states. Let $A_t \in \mathcal{A}(s),\;S_t = s$ denote the action at time step $t$
transition to one of several successor states. Let $A_t \in \mathcal{A}(s),\;S_t = s$, denote the action at time step $t$
and let $\mathcal{A}(s)$ denote the set of actions available in the state $s$ \cite[p.~48]{sutton18}.
The actions available to an agent depend on the environment and the state. An agent learning SpaceInvaders effectively
......@@ -74,16 +74,16 @@ all $s \in \mathcal{S}$. \emph{noop} means \emph{no operation}, which is the act
step. If the actions do not differ between states, we write $\mathcal{A}$ to denote the set of actions for convenience.
\paragraph{Rewards.}
Every time the agent completes an action, it observes the environment. These \emph{observations} consist of two
Every time an agent completes an action, it observes the environment. These \emph{observations} consist of two
components. The first component of an observation is the \emph{reward}. Let $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$
denote the reward observed at time step $t + 1$ and let $\mathcal{R}$ denote the set of rewards \cite[p.~48]{sutton18}.
As teachers, rewards are our primary means of communicating with an agent. We use them to inform an agent whether it
Rewards are our primary means of communicating with an agent. We use them to inform an agent whether it
achieved its goal or not. As stated in chapter \ref{sec:02:agent_environment}, the agent may receive positive rewards
only for achieving its goal, which raises the issue of delayed rewards. We will remedy this issue in chapter
\ref{sec:02:value_function} by introducing the concept of a \emph{return}.
The second component of an observation is the successor state $S_{t+1}$. Note that we define the reward to be $R_{t+1}$
The second component of an observation is the successor state $S_{t+1}$. Note that we define the reward to be $R_{t+1}$,
as it is the reward that is observed with the state $S_{t+1}$. Consequently, there is no reward $R_0$.
In ATARI 2600 games, the reward we provide to our agent is the change in score immediately following an action, e.g., if
......@@ -98,7 +98,7 @@ the agent scores a point in Pong, we provide a reward $R_{t+1} = 1$ in the next
\label{fig:episode}
\end{figure}
Figure \ref{fig:episode} shows the interaction of agent and environment in greater detail: following each action $A_t$ the
Figure \ref{fig:episode} shows the interaction of agent and environment in greater detail: Following each action $A_t$, the
agent observes the change in environment represented by the reward $R_{t+1}$ and the successor state $S_{t+1}$. Then,
the agent acts again and obtains another observation.
......@@ -133,7 +133,7 @@ action $A_t = \text{\emph{right}}$ is $p(s_2, 1 \mid s_1, \text{\emph{right}}) =
\begin{figure}[h]
\centering
\hspace*{1.8cm}
\hspace*{1.3cm}
\includegraphics[width=0.6\textwidth]{02_rl_theory/simple_mdp.pdf}
\caption{This Markov decision process consists of three states $\mathcal{S} = \{s_0, s_1, s_2\}$, offers two actions $\mathcal{A} =
\{\text{\emph{left}}, \text{\emph{right}}\}$ and can yield two rewards $\mathcal{R} = \{0, 1\}$. Each edge
......@@ -147,9 +147,10 @@ An agent's actions are determined by the \emph{policy} function, which \citeA[p
\begin{align}
\label{eqn:policy}
&\pi(a \mid s) \doteq P(A_t = a \mid S_t = s),\\
&\pi: \mathcal{A} \times \mathcal{S} \rightarrow [0, 1], \nonumber
&\pi: \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]. \nonumber
\end{align}
which returns the probability of the agent picking the action $a$ in state $s$ \cite[p.~58]{sutton18}. We say that the
The policy $\pi$ returns the probability of the agent picking the action $a$ in state $s$. We say that the
agent follows the policy $\pi$. A means for learning a policy $\pi$ is introduced in chapter
\ref{sec:02:policy_gradients} and elaborated on in chapter \ref{sec:03:ppo}.
......@@ -175,4 +176,4 @@ Whenever the agent transitions to a terminal state, it cannot transition to othe
final time step $T$ after which no meaningful information can be gained. We call $T$ the \emph{finite} \emph{horizon};
it marks the end of an episode. Each episode begins with an initial state $S_0$ and ends once an agent reaches a
terminal state $S_T$. Usually, agents attempt episodic tasks many times with each episode giving rise to a different
trajectory; the horizon $T$ may differ, too.
trajectory \cite[p.~54]{sutton18}; the horizon $T$ may differ, too.
......@@ -52,11 +52,13 @@ gradient estimators $\hat{g}$:
\end{align*}
This approach matches \emph{stochastic gradient descent} as commonly used in machine learning
\cite[pp.~151--153]{goodfellow2016}: the accurate gradient is estimated using a set of samples---we record a trajectory.
\cite[pp.~151--153]{goodfellow2016}: The accurate gradient is estimated using a set of samples---we record a trajectory.
The learn rate $\alpha$ controls the step size along the gradient, as we do not desire to optimize $\boldsymbol\theta$
for one gradient estimation only. Instead, we perform stochastic gradient ascent many times, performing a step along the
gradient. By performing many small steps along different gradients, we strike a balance between different estimations
that are calculated from different trajectories. In contrast to common stochastic gradient descent, we do not intend to
that are calculated from different trajectories.
In contrast to common stochastic gradient descent, we do not intend to
minimize an error. Rather, we maximize the value function when following $\pi_{\boldsymbol\theta}$ and thusly define the
fitness function to be
\begin{align}
......@@ -79,7 +81,7 @@ $\varepsilon$ typically converges to $0$ over the course of the training resulti
\citeA[p.~323]{sutton18} give an example of a Markov decision process that can only be solved by a stochastic policy.
\citeA[p.~324]{sutton18} point out that combining a parameterized policy with function approximation may pose a
challenge: the performances of both $\pi_{\boldsymbol\theta}$ and $\hat{v}(s, \boldsymbol\omega)$ depend on the action
challenge: The performances of both $\pi_{\boldsymbol\theta}$ and $\hat{v}(s, \boldsymbol\omega)$ depend on the action
selections and on the distribution of the states $\mu_{\pi_{\boldsymbol\theta}}(s)$ that these actions are selected in.
Adjusting the policy results in different action choices, which in turn changes $\mu_{\pi_{\boldsymbol\theta}}(s)$.
Thus, we might assume that we require the derivative of $\mu_{\pi_{\boldsymbol\theta}}(s)$ to compute gradient
......@@ -92,7 +94,7 @@ estimates. This issue is remedied by the \emph{policy gradient theorem}, which p
with $\propto$ meaning \enquote{proportional to}. These gradients may not share the same magnitude, but they share the
same direction. Accordingly, we must only adjust the step size $\alpha$ that we perform stochastic gradient ascent with.
Originally this proof held true only with $q_{\pi_{\boldsymbol\theta}}(s, a)$, a function that returns the value of an
Originally, this proof held true only with $q_{\pi_{\boldsymbol\theta}}(s, a)$, a function that returns the value of an
action $a$ given the state $s$. However, the policy gradient theorem was expanded upon to allow the combination of
policy gradients with the advantage function \cite{sutton2000}:
\begin{align}
......@@ -103,10 +105,10 @@ policy gradients with the advantage function \cite{sutton2000}:
Using the advantage function is beneficial, as the policy gradient in equation \ref{eqn:advantage_gradient} has a lower
variance than the gradient in equation \ref{eqn:q_gradient} \cite{gae}.
Like the mean squared value error $\overline{\text{VE}}$, the gradient is weighted according to the distribution of
states $\mu$, but it does not depend on its derivative. By virtue of weighing with $\mu$, states an agent encounters
regularly have a greater effect on the gradient. Moreover, the magnitude of the gradient is controlled by the advantage
of an action: A large advantage mandates a large gradient step.
Like the mean squared value error $\overline{\text{VE}}$ in equation \ref{eqn:value_error}, the gradient is weighted
according to the distribution of states $\mu$, but it does not depend on its derivative. By virtue of weighting with
$\mu$, states an agent encounters regularly have a greater effect on the gradient. Moreover, the magnitude of the
gradient is controlled by the advantage of an action: A large advantage mandates a large gradient step.
\subsubsection{Gradient Estimation}
......
......@@ -15,7 +15,7 @@ R_T$. Using this sequence we define the return
\end{align}
An agent strives to maximize the return $G_0$ and thusly all following returns $G_t$, too. As early returns include the
reward sequence of later returns, returns may be defined recursively: $G_t = R_{t+1} + \gamma G_{t+1}$.
reward sequence of later returns, returns may be expressed recursively: $G_t = R_{t+1} + \gamma G_{t+1}$.
$\gamma$ denotes the discount factor, which discounts future rewards. In finite-horizon Markov decision processes,
$\gamma$ may equal $1$. In infinite-horizon Markov decision processes, we must choose $\gamma < 1$ so $G_t$ remains
......
By optimizing the shared loss $\mathcal{L}^{\text{CLIP}+\text{VF}+S}$ through gradient ascent, we calculate an improved
policy. As not all information is learned with a single gradient step, \citeA{ppo} propose performing multiple
optimization passes over the same trajectory. This approach increases the sample efficiency of Proximal Policy
optimization passes on the same trajectory. This approach increases the sample efficiency of Proximal Policy
Optimization, as an agent can learn quicker than an agent that trains on a trajectory once.
More precisely, each training iteration consists of $K$ epochs. In each epoch, the entire data set is split into
......@@ -22,7 +22,7 @@ of algorithm \ref{alg:ppo}.
In the second phase, the value function approximation and the policy are optimized with the data the agent collected.
These two steps are repeated for a certain time or until the value function and policy converge to their respective
optimal functions. Afterwards, the learn rate $\alpha$ and the clipping $\epsilon$ are decreased, so they linearly
optimal functions. Afterwards, the learn rate $\alpha$ and the clipping parameter $\epsilon$ are decreased, so they linearly
approach $0$ over the course of the training; the trust region shrinks. Lines 18--25 of algorithm \ref{alg:ppo} show this phase.
\begin{algorithm}[ht]
......@@ -47,7 +47,7 @@ approach $0$ over the course of the training; the trust region shrinks. Lines 18
\State $s \gets $ successor state
\State $r \gets $ reward
\State Append $a, s, r, \pi_{\boldsymbol\theta_\text{old}}(a\mid s),
\hat{v}_\pi (s, \boldsymbol\theta_\text{old})$ to $\tau$
\hat{v}_\pi (s, \boldsymbol\theta)$ to $\tau$
\EndFor
\EndFor
\State Compute Generalized Advantage Estimations $\delta_t^{\text{GAE}(\gamma, \lambda)}$
......
......@@ -18,7 +18,7 @@ introduce advanced methods that alleviate this issue in this chapter.
\subsubsection{Generalized Advantage Estimation}
\label{sec:03:gae_gae}
In chapter \ref{sec:02:advantage} we introduced the advantage function $a_\pi(s, a)$ and an estimator $\delta$ that we
In chapter \ref{sec:02:advantage} we introduced the advantage function $a_\pi(s, a)$ and an estimator $\delta_t$ that we
combined with function approximation (cf.~equations \ref{eqn:a_hat} and \ref{eqn:delta2} in chapter
\ref{sec:02:function_approximation}):
\begin{align*}
......@@ -55,9 +55,10 @@ inaccuracy contributed by this term is reduced, yielding a less biased advantage
increased, the number of stochastic elements in the calculation grows. This increases the variance of the estimator.
Thus, $\delta_t^{(1)}$ has the highest bias and the lowest variance, whereas $\delta_t^{(T-t)}$ possesses the lowest
bias and the highest variance.
\citeA{gae} found that no particular advantage estimator $\delta_t^{(i)}$ grants the best results---instead, they must be
combined. Let $\lambda\in[0,1)$ denote a variance tuning factor.\footnote{Note that \citeA{gae} allow $\lambda = 1$, but
combined.
Let $\lambda\in[0,1)$ denote a variance tuning factor.\footnote{Note that \citeA{gae} allow $\lambda = 1$, but
we do not require this special case.} Then \emph{Generalized Advantage Estimation} (GAE) is defined to be
\begin{align}
\label{eqn:gae_start}
......@@ -80,7 +81,7 @@ result of the geometric series; the same can be achieved for sums $\lambda + \la
factoring $\lambda$ first.
$\lambda$ allows us to control the impact of high-bias terms versus high-variance terms. If we choose $\lambda = 0$,
then the bias is the highest as only $\delta_t$ is used for advantage estimation. As $\lambda$ approaches $1$, the bias
then the bias is the highest, as only $\delta_t$ is used for advantage estimation. As $\lambda$ approaches $1$, the bias
decreases whereas the variance of the estimator grows.
Note that Generalized Advantage Estimation is defined for infinite-horizon Markov decision processes. However, GAE is
......@@ -112,6 +113,7 @@ choosing a specific $n$, we combine many $n$-step returns discounted by $\lambda
\label{eqn:lambda_return}
G_t^\lambda \doteq (1 - \lambda) \sum_{n=1}^\infty \lambda^{n-1} G_t^{(n)}.
\end{align}
Again, we utilize $(1 - \lambda)$ to normalize the sum. As above, if we choose $\lambda = 0$, the return approximation has
a high bias. For $\lambda$ approaching $1$ the return approximation has high variance.
......
......@@ -14,7 +14,8 @@ loss $\mathcal{L}$, as is common in deep learning \cite[chapter 4.3]{goodfellow2
s)\right]
\end{align}
The gradient estimator $\hat{g}$ can be obtained by deriving the loss in equation \ref{eqn:naive_loss}. Unlike in deep
learning, a loss notation is also used to denote objectives that shall be maximized rather than. Therefore, it depends on the
learning, a loss notation is also used to denote objectives that shall be maximized rather than minimized in deep
reinforcement learning. Therefore, it depends on the
specific objective whether gradient ascent or gradient descent is performed.
We begin by highlighting issues with the gradient estimator from equation \ref{eqn:gradient_estimator} and motivate the
......
......@@ -23,7 +23,7 @@ divergence. This way they incentivize finding a new policy that is close to the
As the new loss still imposes a lower bound on the fitness $J(\boldsymbol\theta)$, it is a minorant to the fitness
function. The authors prove that maximizing this minorant guarantees a monotonously rising fitness
$J(\boldsymbol\theta)$; given sufficient optimization steps the algorithm converges to a local
optimum.\footnote{Algorithms like this one are called MM algorithms. This one is a \emph{minorize maximization} algorithm
optimum.\footnote{Algorithms like this are called MM algorithms. This one is a \emph{minorize maximization} algorithm
\cite{hunter2004}.} The ensuing objective is called a \emph{surrogate} objective.
The mathematically proven algorithm is computationally demanding, as it needs to compute the KL divergence on all states
......@@ -33,7 +33,7 @@ algorithm is called \emph{Trust Region Policy Optimization} (TRPO).
However, TRPO achieves suboptimal results on tasks like video games and is both complicated to realize as well as to
execute \cite{ppo}. In order to compute an approximately optimal solution of the KL divergence, the inverse of a large
Hessian matrix\footnote{The Hessian matrix is a second-order derivative. It can be determined by deriving the gradient
of a function \cite[pp.~86--87]{goodfellow2016}.} is required. This lead \citeA{ppo} to devise a simpler algorithm
of a function \cite[pp.~86--87]{goodfellow2016}.} is required. This leads \citeA{ppo} to devise a simpler algorithm
called \emph{Proximal Policy Optimization} (PPO). We discuss this algorithm in the following chapters.
\subsubsection{Clipped Surrogate Objective}
......@@ -57,7 +57,8 @@ the likelihood ratio
\rho_t(\boldsymbol\theta) \doteq \frac{\pi_{\boldsymbol\theta}(A_t \mid S_t)}{\pi_{\boldsymbol\theta_\text{old}}(A_t
\mid S_t)},
\end{align}
where $\pi_{\boldsymbol\theta}$ is the current policy and $\pi_{\boldsymbol\theta_\text{old}}$ is a previous policy.
where $\pi_{\boldsymbol\theta}$ is the current policy and $\pi_{\boldsymbol\theta_\text{old}}$ is a previous policy
\cite{ppo}.
The action and the state were sampled by an agent following $\pi_{\boldsymbol\theta_\text{old}}$. By evaluating the ratio
$\rho_t(\boldsymbol\theta)$, we determine if the probability of taking the action $a$ in the state $s$ has increased or
......@@ -78,7 +79,7 @@ example, $\rho_t(\boldsymbol\theta) = 100$, gradient steps become excessively la
collapses \cite{kakade2002}.
Hence, Proximal Policy Optimization algorithms intend to keep the likelihood ratio close to one \cite{ppo}. There are
two variants of PPO, one that relies on clipping the ratio whereas the other is penalized using the Kullback-Leibler
two variants of PPO, one that relies on clipping the ratio, whereas the other is penalized using the Kullback-Leibler
divergence. We examine the clipped variant, as it is more widely used and achieved better results on various tasks such
as robotics and video games \cite{ppo, ppo_blog, kostrikov}.
......@@ -128,7 +129,8 @@ decreased---$\pi_{\boldsymbol\theta}(\emph{\text{noop}}\mid s) = 0.25$. If $\eps
0.9$. As the ratio moved outside of the clip range, it is clipped and the gradient becomes 0. This means that we cannot
correct the error we made in decreasing the likelihood of picking the action.
This issue is solved by taking an elementwise minimum:
We solve this issue by taking an elementwise minimum. Then, $\mathcal{L}^\text{CLIP}$ denotes the clipped surrogate
objective \cite{ppo}
\begin{align}
\label{eqn:lclip}
\mathcal{L}^\text{CLIP}(\boldsymbol\theta) \doteq \mathbb{E}_{a,s\sim\pi_{\boldsymbol\theta_\text{old}}}
......@@ -142,9 +144,9 @@ This issue is solved by taking an elementwise minimum:
\end{align}
In practice, the advantage function $\hat{a}_{\pi_{\boldsymbol\theta_\text{old}}}$ is replaced with Generalized
Advantage Estimations $\delta_t^{\text{GAE}(\gamma,\lambda)}$ as defined in equation \ref{eqn:gae} \cite{ppo}. Often,
Advantage Estimations $\delta_t^{\text{GAE}(\gamma,\lambda)}$ as defined in equation \ref{eqn:gae} \cite{ppo}.
$\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$ is called a surrogate objective, as it approximates the original objective
$J(\boldsymbol\theta)$ but is not identical.
$J(\boldsymbol\theta)$.
Figure \ref{fig:min_clipping} compares $\text{clip}(\rho_t(\boldsymbol\theta, \epsilon) \cdot \delta$ and
$\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$. Using an elementwise minimum has the following effect:
......@@ -159,7 +161,7 @@ $\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$. Using an elementwise minimum has t
\centering
\includegraphics[width=\textwidth]{03_ppo/clipping.pdf}
\caption{The loss for an action with positive advantage can be seen on the left side, whereas the loss for an action with
a negative advantage is shown on the right side. By using a minimum in $\mathcal{L}^\text{CLIP}$, we ensure that we
a negative advantage is shown on the right side. By using a minimum in $\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$, we ensure that we
can correct errors we made in previous gradient steps: We can raise the probability of actions with positive
advantage even when $\rho_t(\boldsymbol\theta) \le 1 - \epsilon$. Vice versa, we can decrease the probability of
actions with negative advantage when $\rho_t(\boldsymbol\theta) \ge 1 + \epsilon$. Figure as in
......@@ -172,9 +174,9 @@ Clipping with $\epsilon$ keeps the new policy close to the old policy. The regio
large, the performance of the policy may collapse again. On the other hand, an agent may learn too slowly if $\epsilon$
is too small.
By optimizing the loss $\mathcal{L}^\text{CLIP}$, we can optimize the policy $\pi_{\boldsymbol\theta}$ without making
excessively large gradient steps and still correct errors of previous optimizations given a suitable choice for
$\epsilon$.
By optimizing the loss $\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$, we can optimize the policy
$\pi_{\boldsymbol\theta}$ without making excessively large gradient steps and still correct errors of previous
optimizations given a suitable choice for $\epsilon$.
\subsubsection{Value Function Loss}
\label{sec:03:loss_value}
......@@ -194,8 +196,8 @@ $\lambda$-returns \cite{ppo}:
\begin{align}
\label{eqn:lvf}
\mathcal{L}^\text{VF}(\boldsymbol\omega) \doteq \frac{1}{2} \cdot
\mathbb{E}_{s,G\sim\pi_{\boldsymbol\theta_\text{old}}}\left[(\hat{v}_{\pi_{\boldsymbol\theta}}(s, \boldsymbol\omega) -
G)^2\right],
\mathbb{E}_{s,G\sim\pi_{\boldsymbol\theta_\text{old}}}\left[\left(\hat{v}_{\pi_{\boldsymbol\theta}}(s, \boldsymbol\omega) -
G\right)^2\right],
\end{align}
with $G$ being $\lambda$-returns calculated from rewards observed by an agent following $\pi_{\boldsymbol\theta}$
(cf.~equation \ref{eqn:lambda_return}).
......@@ -208,9 +210,11 @@ parameters. $\hat{v}(s, \boldsymbol\theta)$ and $\pi(a\mid s, \boldsymbol\theta)
architecture and weights with differing output layers (cf. chapter \ref{sec:04:architecture}). We choose to share
parameters because we require less computation power, as only one neural network needs to be trained and executed.
In this case, the loss must contain a loss for the policy and a loss for the value function. Since we perform
gradient ascent on the policy loss but gradient descent on the value function loss, $\mathcal{L}^\text{VF}$ is
subtracted from $\mathcal{L}^\text{CLIP}$. \citeA{ppo} propose the following loss:
In this case, the loss must contain a loss for the policy and a loss for the value function. Since we perform gradient
ascent on the policy loss but gradient descent on the value function loss, $\mathcal{L}^\text{VF}(\boldsymbol\theta)$ is
subtracted from $\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$.
\citeA{ppo} propose the following loss:
\begin{align}
\label{eqn:lshared}
\mathcal{L}^{\text{CLIP}+\text{VF}+S}(\boldsymbol\theta) \doteq \mathcal{L}^\text{CLIP}(\boldsymbol\theta) -
......@@ -222,7 +226,7 @@ $\mathcal{L}^\text{VF}$, whereas $c_2$ adjusts the impact of the \emph{entropy b
$S$ denotes an entropy bonus encouraging exploration, which addresses an issue raised by \citeA{kakade2002}: Policy
gradient methods commonly transition to exploitation too early, resulting in suboptimal policies. The closer the
distribution $\pi$ is to a uniform distribution, the larger its entropy is. If all actions are assigned the same
distribution $\pi$ is to a uniform distribution, the larger is its entropy. If all actions are assigned the same
probability, an agent following $\pi$ explores properly. An agent following a deterministic policy does not explore and
the entropy of this policy will be $0$.
......
......@@ -2,7 +2,7 @@ Policy gradient estimators such as the one introduced in equation \ref{eqn:gradi
\ref{sec:02:policy_gradients} or REINFORCE\footnote{A well-known reinforcement learning algorithm that is a simple
improvement over the estimator we introduced.} by \citeA{reinforce} suffer from a significant drawback. As
\citeA{kakade2002} demonstrate, some tasks require that we record long trajectories to guarantee a policy is improved
when performing stochastic gradient descent. The longer the trajectories are the higher their variance grows, as both
when performing stochastic gradient ascent. The longer the trajectories are the higher their variance grows, as both
the dynamics $p$ of the environment and the policy $\pi$ introduce non-deterministic behavior.
As the variance of the trajectories grows, the quality of the gradient estimator declines. Performing stochastic
......
Albeit it is not mentioned by \citeA{ppo}, advantage estimations $\delta$ are normalized before loss computation. Normalizing
Albeit \citeA{ppo} do not mention it, advantage estimations $\delta$ are normalized before loss computation. Normalizing
advantages is a well-known operation that lowers the variance of the gradient estimator.
\begin{align}
\label{eqn:adv_normalization}
......
......@@ -13,7 +13,7 @@ ATARI 2600 games.
\end{sidewaysfigure}
This architecture is composed of three convolutional layers\footnote{Convolutional neural networks are explained in
detail by \citeA[chapter 9]{goodfellow2016}} followed by a linear layer\footnote{Also known as dense layers,
detail by \citeA[chapter 9]{goodfellow2016}.} followed by a linear layer\footnote{Also known as dense layers,
fully-connected layers or occasionally multi-layer perceptrons.} and two output layers (cf.~figure
\ref{fig:architecture}). The input is scaled to $4 \times 84 \times 84$, that is $4$ grayscale images each sized $84
\times 84$. Table \ref{table:cnn_layers} outlines the structure of the convolutional layers:
......@@ -39,13 +39,13 @@ unit.
In this thesis, the policy and the value function share parameters. We choose this approach because this way we must
only train one neural network instead of two. This in turn means that the computation time is reduced slightly.
Furthermore, popular PPO implementations follow the same approach \cite{kostrikov, baselines}. In order to share
Furthermore, popular PPO implementations follow the same approach \cite{baselines, kostrikov}. In order to share
parameters, we construct two output layers, both of which follow the size $512$ linear layer:
\begin{itemize}
\item The policy output is computed by a linear layer with one output for each action available to the agent.
$\pi_{\boldsymbol\theta}$ is given by performing a softmax on the results of this output layer. Softmax
functions can be used to represent a probability distribution over discrete random variables
\cite[pp.~184--185]{goodfellow2016}, as is the case with ATARI games.
\cite[pp.~184--185]{goodfellow2016}, as is the case with actions in ATARI games.
\item The value function output is composed of a single artificial neuron. The output of this neuron is the value
estimation $\hat{v}_\pi(s, \boldsymbol\theta)$.
\end{itemize}
Algorithm \ref{alg:ppo_full} shows the complete Proximal Policy Optimization Algorithm when learning ATARI 2600 games.
Notable changes from algorithm \ref{alg:ppo} in chapter \ref{sec:03:algorithm} are the inclusion of the postprocessing
$\phi$ (lines 7, 13 and 14), the introduction of orthogonal initialization (line 1), the replacement of
$\mathcal{L}^\text{VF}$ with $\mathcal{L}^\text{VFCLIP}$ and the use of gradient clipping (both in line 23).
Algorithm \ref{alg:ppo_full} (on page \pageref{alg:ppo_full}) shows the complete Proximal Policy Optimization algorithm when
learning ATARI 2600 games. Notable changes from algorithm \ref{alg:ppo} in chapter \ref{sec:03:algorithm} are the
inclusion of the environment adaptation $\phi$ (lines 7, 13 and 14), the introduction of orthogonal initialization (line
1), the replacement of $\mathcal{L}^\text{VF}$ with $\mathcal{L}^\text{VFCLIP}$ and the use of gradient clipping (both
in line 23).
\begin{algorithm}[ht]
\caption{Full Proximal Policy Optimization for ATARI 2600 games, modified from \protect\citeauthor{ppo}'s
......@@ -10,7 +11,7 @@ $\mathcal{L}^\text{VF}$ with $\mathcal{L}^\text{VFCLIP}$ and the use of gradient
\label{alg:ppo_full}
\begin{algorithmic}[1]
\Require number of iterations $I$, rollout horizon $T$, number of actor $N$, number of epochs $K$, minibatch size $M$, discount $\gamma$, GAE weight
$\lambda$, learn rate $\alpha$, clipping parameter $\epsilon$, coefficients $c_1, c_2$, postprocessing $\phi$
$\lambda$, learn rate $\alpha$, clipping parameter $\epsilon$, coefficients $c_1, c_2$, adaptation function $\phi$
\State $\boldsymbol\theta \gets $ orthogonal initialization \Comment{}
\State Initialize environments $E$
\State Number of minibatches $B \gets N \cdot T\; / \;M$
......@@ -26,7 +27,7 @@ $\mathcal{L}^\text{VF}$ with $\mathcal{L}^\text{VFCLIP}$ and the use of gradient
\State $s \gets $ $\phi_s(\text{successor state})$ \Comment{}
\State $r \gets $ $\phi_r(\text{reward})$ \Comment{}
\State Append $a, s, r, \pi_{\boldsymbol\theta_\text{old}}(a\mid s),
\hat{v}_\pi (s, \boldsymbol\theta_\text{old})$ to $\tau$
\hat{v}_\pi (s, \boldsymbol\theta)$ to $\tau$
\EndFor
\EndFor
\State Compute Generalized Advantage Estimations $\delta_t^{\text{GAE}(\gamma, \lambda)}$
......@@ -47,6 +48,7 @@ $\mathcal{L}^\text{VF}$ with $\mathcal{L}^\text{VFCLIP}$ and the use of gradient
Table \ref{table:hyperparameters} lists popular values for all hyperparameters \cite{baselines, kostrikov}. It differs
slightly from the one used by \citeA{ppo}, as the authors trained agents for $K = 3$ epochs with a value function loss
coefficient $c_1 = 1.0$.
\begin{savenotes}
\begin{table}[ht]
\centering
\begin{tabular}{l|l}
......@@ -64,8 +66,11 @@ coefficient $c_1 = 1.0$.
Value function coeff. $c_1$ & $0.5$ \\
Entropy coeff. $c_2$ & $0.01$ \\
Maximum gradient norm & $0.5$ \\
Gradient optimizer & Adam\footnote{Consult \citeA[chapter 8.5.3]{goodfellow2016} for an explanation of the Adam
optimization algorithm.} \\
\end{tabular}
\caption{This table contains the most commonly used configuration of Proximal Policy Optimization for ATARI 2600
games.}
\label{table:hyperparameters}
\end{table}
\end{savenotes}
Before performing a gradient step, the gradient $\nabla_\theta \mathcal{L}^{\text{CLIP}+\text{VFCLIP}+S}$ is
clipped \cite{ilyas2018}. All changes in weights are concatenated to a single vector. The weight changes are adjusted
such that the Euclidean norm of the vector does not exceed $0.5$. Gradient clipping is a technique commonly used to
prevent large gradient steps \cite[pp.~413--415]{goodfellow2016}.
Before performing a gradient step, the gradient $\nabla_\theta \mathcal{L}^{\text{CLIP}+\text{VFCLIP}+S}$ is clipped
\cite{ilyas2018}. All changes in weights are concatenated into a single vector. The weight changes are adjusted such that
the Euclidean norm of the vector does not exceed $0.5$. Gradient clipping is a technique commonly used to prevent large
gradient steps \cite[pp.~413--415]{goodfellow2016}.
......@@ -2,9 +2,8 @@
\label{sec:04:implementation}
\input{04_implementation/introduction}
\subsection{TODO Environment Adaptation}
\subsection{Environment Adaptation}
\label{sec:04:postprocessing}
TODO ALSO FIX THE NAME OF THIS IN THE REST OF THE THESIS
\input{04_implementation/postprocessing}
\subsection{Model Architecture}
......
......@@ -15,7 +15,7 @@ used to initialize the layer.
Policy output & $0.01$ \\
Value function output & $1$
\end{tabular}
\caption{Each layer is subject to orthogonal initialization, but the scaling factors differ. Layers that use a
\caption{Each layer is subject to orthogonal initialization but the scaling factors differ. Layers that use a
rectified linear unit are initialized with $\sqrt{2}$, whereas the policy output is scaled by $0.01$.}
\label{table:orthogonal_scaling}
\end{table}
A sequence of post-processing steps is performed on observations returned by the ATARI 2600 emulator. Except for
\emph{reward clipping}, these operations are common when learning ATARI 2600 games. Operations such as \emph{fire
A sequence of adaptation steps is performed on observations returned by the ATARI 2600 emulator. Except for
\emph{reward clipping}, these adaptations are common when learning ATARI 2600 games. Operations such as \emph{fire
resets}, \emph{frame skipping} and handling \emph{episodic life} adapt specific behavior of ATARI games to the
requirements of reinforcement learning algorithms. Other operations reduce the complexity of the state space or lead to
more diverse starting conditions.
We use $\phi$ to denote the post-processing function and overload it as follows: $\phi_s(S_t)$ processes states whereas
We use $\phi$ to denote the adaptation operations and overload it as follows: $\phi_s(S_t)$ processes states, whereas
$\phi_r(R_t)$ transforms rewards. In practice, both functions are executed at the same time due to the nature of the
post-processing steps. Mathematically, $\phi$ is part of the environment and therefore not visible in terms of the
adaptation steps. Mathematically, $\phi$ is part of the environment and therefore not visible in terms of the
Markov decision process.
\paragraph{No operation resets.}
Whenever an environment is reset, a random number of \emph{noop} actions are performed for up to 30 frames in order
to guarantee differing initial conditions across different trajectories. Originally this approach was proposed to ensure
to guarantee differing initial conditions across different trajectories. Originally, this approach was proposed to ensure
that an agent under evaluation does not overfit \cite{nature_dqn}, but it has since been adopted to training. This might
prove beneficial, as the random initial conditions can lead to more diverse trajectories. However, it appears that no
research has been conducted on the full implications of this choice.
\paragraph{Frame skipping.}
Instead of prompting the agent for an action every frame, we show the agent every $k$th frame only \cite{ale, dqn}. As
Instead of prompting the agent for an action every frame, we show the agent every $k$-th frame only \cite{ale, dqn}. As
humans rarely perform frame-by-frame action selections, repeating an action for several frames should prove no detriment
for an agent. Thus, an agent can easily train on up to $k$ times more frames without the computation time of training
being affected significantly. A common selection is $k = 4$, which means that an agent repeats an action four times and
......@@ -38,10 +38,10 @@ solve this issue by taking the maximum value of each pixel across the last two f
Some games like Breakout provide the player with several attempts, often called lives. Whenever the player loses a life,
they may resume the game, usually at the cost of a reduced score. \citeA{nature_dqn} propose ending a training episode
once a life is lost. As PPO operates on a steady number of samples, this approach is modified slightly. Instead of
ending the episode, we simulate the end of the game. Thus, the return and advantage calculation are cutoff at this time
ending the episode, we simulate the end of the game. Thus, the return and advantage calculation are cut off at this time
step. Finally, we reset the observation stack as detailed in the operation \emph{observation stacking}.
This post-processing operation is applied to some of the games chosen for this thesis only (see chapter
This adaptation is applied to some of the games chosen for this thesis only (see chapter
\ref{sec:05:ale} for more information on the selected games). BeamRider, Breakout, Seaquest and SpaceInvaders provide
a player with multiple lives, whereas Pong does not.
......@@ -61,7 +61,7 @@ is a common choice when using neural networks.
According to \citeA{ilyas2018}, rewards are clipped to the interval $[-5, 5]$. However, an examination of the baselines
repository reveals that rewards are binned \cite[baselines/common/atari\_wrappers.py]{baselines}:
\begin{align}
\phi_r(r) \doteq \sign r
\phi_r(r) \doteq \sign r.
\end{align}
We discuss both choices in chapter \ref{sec:05:discussion_optims}.
......
......@@ -25,13 +25,14 @@ Then the clipped value function loss $\mathcal{L}^\text{VFCLIP}$ is defined to b
- G
\right)^2
\right]
\right].
\right],
\end{align}
$\epsilon$ is the same hyperparameter that is used to clip the likelihood ratio in $\mathcal{L}^\text{CLIP}$
(cf.~equation \ref{eqn:lclip} in chapter \ref{sec:03:policy_loss}).
with $\epsilon$ begin the same hyperparameter that is used to clip the likelihood ratio in $\mathcal{L}^\text{CLIP}$
(cf.~equation \ref{eqn:lclip} in chapter \ref{sec:03:policy_loss}). $\boldsymbol\omega_\text{old}$ denotes the parameter
vector that was used when the rollout was recorded.
Intuitively, this approach may be like clipping the probability ratio. To avoid gradient collapses, a trust region
is created with the clipping parameter $\epsilon$. Then an elementwise maximum is taken, so errors from previous
gradient steps can be corrected. A maximum is applied instead of a minimum because the value function loss is minimized.
Further analysis on the ramifications of optimizing a surrogate loss of the value function is available
\cite{ilyas2020}.
Intuitively, this approach may be like clipping the probability ratio $\rho_t(\boldsymbol\theta)$. To avoid gradient
collapses, a trust region is created with the clipping parameter $\epsilon$. Then, an elementwise maximum is taken so
errors from previous gradient steps can be corrected. A maximum is applied instead of a minimum because the value
function loss is minimized. Proper analysis on the ramifications of using this surrogate loss was recently published by
\citeA{ilyas2020}.
......@@ -5,11 +5,11 @@ example, the 18 most recent frames are used to determine the movement of the pad
trained on video input from ATARI games, which leads to a large state space (cf.~chapter \ref{sec:04:postprocessing}).
Lastly, video games are diverse challenges with some---such as Pong---being easier to solve, whereas others like the game
Montezuma's Revenge remain challenging. For these reasons, ATARI 2600 video games are a popular benchmarking
environment.\footnote{One might note that video games are well-known and could pique peoples' interest as well.}
environment.\footnote{One might note that video games are well-known and could pique people's interest as well.}
The \emph{Arcade Learning Environment} (ALE) provides a benchmarking environment containing 57 games
\cite{ale}.\footnote{When it was published, the Arcade Learning Environment contained fewer games. It contains 57 games
at the time of writing this thesis.} It is included, among other benchmark environments, in the popular OpenAI Gym
at the time of writing this thesis.} It is included, among other benchmarking environments, in the popular OpenAI Gym
framework \cite{gym}.
We conduct experiments on a selection of five games of the ALE: BeamRider, Breakout, Pong, Seaquest and SpaceInvaders
......@@ -20,12 +20,12 @@ accurately. On BeamRider, agents achieve close to human performance.
OpenAI Gym and ALE simplify evaluation greatly, as they provide a unified interface for all environments. We simply
provide an action to the environment and obtain observations containing the current state of the environment as a $210
\times 160$ RGB image (before post-processing), the reward and further information such as the number of lives remaining
\times 160$ RGB image (before adaptations), the reward and further information such as the number of lives remaining
or if the game terminated. The Arcade Learning Environment runs at 60 frames per second when run in real-time---without
frame skipping we would obtain 60 observations per second.
Since the ALE returns a reward, we do not have to design a reward function. Instead, all reinforcement
learning algorithms are trained using the same reward function bar post-processing. The reward of an action is the
learning algorithms are trained using the same reward function bar environment adaptations. The reward of an action is the
change in score, making the cumulated rewards of an episode the high score.
The action space of ATARI games is discrete. Every game accepts the actions $\mathcal{A} = \{\emph{\text{noop}},
......
......@@ -74,10 +74,10 @@ other hand, agents trained on Breakout and SpaceInvaders perform better than age
\paragraph{Conclusion.}
The conducted experiments show that the optimizations overall have a notable positive effect on the performance of
agents. However, no general statements on the impact of an optimization can be made, as the consequences of an
optimization depend on the respective game an agent is trained on. For example, orthogonal initialization is shown to be
crucial for agents learning Pong or Seaquest whilst the positive impact on Breakout and SpaceInvaders is notable but
less pronounced.
agents. However, no general statements on the impact of an optimization can be made, as the influence of an optimization
depends on the respective game an agent is trained on. For example, orthogonal initialization is shown to be crucial for
agents learning Pong or Seaquest whilst the positive impact on Breakout and SpaceInvaders is notable but less