Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit b9e224c0 authored by Daniel Lukats's avatar Daniel Lukats

large cleanup and feedback inclusion push

parent 23811425
......@@ -16,11 +16,12 @@ weight vector $\boldsymbol\omega$ changes the value estimations of multiple stat
estimation of a single state causes changes in the value estimations of other states, too. Thus, increasing the accuracy
of some states will decrease the accuracy of other states.
In order to alleviate this issue, the error introduced through estimation can be weighed according to the likelihood a
In order to alleviate this issue, the error introduced through estimation can be weighted according to the likelihood a
state is observed. Let $\mu$ denote the stationary distribution of states, then $\mu_\pi(s)$ is the percentage of time
an agent following $\pi$ spends in the state $s$. $\hat{v}_\pi(s, \boldsymbol\omega)$ shall be chosen such that the mean
squared value error
\begin{align}
\label{eqn:value_error}
\overline{\text{VE}}(\boldsymbol\omega) \doteq \sum_{s \in \mathcal{S}} \mu_\pi(s) [v_\pi(s) - \hat{v}_\pi(s, \boldsymbol\omega)]^2
\end{align}
is minimized \cite[p.~199]{sutton18}. As we sample trajectories to form our estimations from, the states we observe
......@@ -40,17 +41,20 @@ equation \ref{eqn:advantage_function}, the advantage function was defined to be
Then, the advantage function is parameterized by inserting the parameterized value function $\hat{v}_\pi$ instead of
$v_\pi$:
\begin{align}
\label{eqn:a_hat}
\hat{a}_\pi(s, a, \boldsymbol\omega) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left( r + \gamma \hat{v}_\pi(s',
\boldsymbol\omega) \right) \right] - \hat{v}_\pi(s, \boldsymbol\omega) \\
\label{eqn:a_hat2}
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) \mid S_t = s, A_t = a \right] - v_\pi(s,
\boldsymbol\omega).
\end{align}
We further extend function approximation to the advantage estimator
\begin{align}
\label{eqn:delta2}
\delta_t \doteq R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) - \hat{v}_\pi(S_t, \boldsymbol\omega).
\end{align}
Note that we do not indicate whether the advantage estimator uses the value function $v_\pi$ or a parameterized value
function $\hat{v}_\pi$. In subsequent chapters we consistently use function approximation, therefore it can be assumed
that advantage estimators rely on parameterized value functions.
Note that we do not indicate whether the advantage estimator uses the value function $v_\pi$ (like in equation
\ref{eqn:delta1}) or a parameterized value function $\hat{v}_\pi$. In subsequent chapters we consistently use function
approximation, therefore it can be assumed that advantage estimators rely on parameterized value functions.
......@@ -38,9 +38,10 @@ states and rewards as well as the actions agents may take. These requirements ar
which we introduce in chapter \ref{sec:02:mdp}. Subsequently, we define a goal the agent shall achieve utilizing the
\emph{value function}.
A key issue of reinforcement learning is the balance of \emph{exploration} and \emph{exploitation}. Often we lack full
knowledge of the environment. We gather information by interacting with the environment, slowly learning which actions
help achieving the goal. As we do not gain full information of a state or action by observing or choosing it once, we
have to make a decision: Do we explore more states and actions that we have no information of? Or do we exploit the best
known way to achieving the goal, gaining more knowledge on a few select states and actions? \todo{probably add a source
and check with Sutton and some other sources}
A key issue of reinforcement learning is the balance of \emph{exploration} and \emph{exploitation}
\cite[p.~3]{sutton18}. Often we lack full knowledge of the environment, for example a few frames of video input from a
video game do not carry information on the entire internal state of the game. We gather information by interacting with
the environment, slowly learning which actions help achieving the goal. As we do not gain full information of a state or
action by observing or choosing it once, we have to make a decision: Do we explore more states and actions that we have
no information of? Or do we exploit the best known way to achieving the goal, gaining more knowledge on a few select
states and actions?
\todo[inline]{short intro sentence goes here}
\todo[inline]{note that the initial state is usually determined by a start distribution $p_0$ or something like that,
but we do not require that in this thesis so it is implicitly included?}
A Markov decision process is a stochastic process that encompasses observations, actions and rewards. These are the core
properties needed to train an agent with a reinforcement learning algorithm.
\subsubsection{Definition}
\label{sec:02:mdp_def}
......@@ -31,8 +29,8 @@ interact once---the agent chooses an action and observes the environment.
A simple Markov decision process is displayed in figure \ref{fig:intro_mdp}. Like Markov chains, it contains states
$s_0, s_1, s_2$ with $s_0$ being the initial state and $s_2$ being a terminal state. Unlike Markov chains, it also
contains actions \emph{left} and \emph{right} as well as rewards $0$ and $1$. The rewards are written alongside
transition probabilities and can be found on the edges connecting the states. An example using elements of this Markov
decision process is given in chapter \ref{sec:02:distributions}.
transition probabilities and can be found on the edges connecting the states. Further explanations and an example using
elements of this Markov decision process are given in chapter \ref{sec:02:distributions}.
Markov decision processes share most properties with Markov chains\todo{some source here}, the major difference being
the transitions between states. Whereas in Markov chains the transition depends on the current state only, in Markov
......@@ -41,14 +39,15 @@ although it remains stochastic in nature. We assume that the Markov property hol
which means that these transitions depend on the current state and action only \cite[p.~49]{sutton18}.
\subsubsection{States, Actions and Rewards}
\label{sec:02:mdp_vars}
States, actions and rewards are core concepts of reinforcement learning. States and actions describe an environment and
an agent's capabilities within this environment. Rewards on the other hand are used to teach the agent.
\paragraph{States.}
We describe the environment using \emph{states}: Let $S_t \in \mathcal{S}$ denote the state of the environment at time
step $t$ and let $\mathcal{S}$ denote the finite set of states. A state must contain all details an agent requires to
make well-informed decisions.\todo{citation here probably?}
step $t$ and let $\mathcal{S}$ denote the finite set of states \cite[p.~48]{sutton18}. A state must contain all details
an agent requires to make well-informed decisions.
Figure \ref{fig:states} displays screenshots from two ATARI 2600 games, which may be used as states. Pending several
post-processing steps, we work with $\mathcal{S} \subseteq [0, 1]^{4 \times 84 \times 84}$ for all ATARI 2600 games, as
......@@ -66,7 +65,7 @@ we use a sequence of 4 consecutive grayscale images resized to $84 \times 84$ pi
\paragraph{Actions.}
At each time step $t$ our agent takes an \emph{action} depending on the state $S_t$ of the environment, which causes a
transition to one of several successor states. Let $A_t \in \mathcal{A}(s),\;S_t = s$ denote the action at time step $t$
and let $\mathcal{A}(s)$ denote the set of actions available in the state $s$.\todo{probably a citation here too}
and let $\mathcal{A}(s)$ denote the set of actions available in the state $s$ \cite[p.~48]{sutton18}.
The actions available to an agent depend on the environment and the state. An agent learning SpaceInvaders effectively
chooses from $\mathcal{A}(s) = \{\emph{\text{noop}}, \emph{\text{fire}}, \emph{\text{left}}, \emph{\text{right}}\}$ for
......@@ -76,8 +75,7 @@ step. If the actions do not differ between states, we write $\mathcal{A}$ to de
\paragraph{Rewards.}
Every time the agent completes an action, it observes the environment. These \emph{observations} consist of two
components. The first component of an observation is the \emph{reward}. Let $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$
denote the reward observed at time step $t + 1$ and let $\mathcal{R}$ denote the set of rewards.\todo{and another
citation here I guess}
denote the reward observed at time step $t + 1$ and let $\mathcal{R}$ denote the set of rewards \cite[p.~48]{sutton18}.
As teachers, rewards are our primary means of communicating with an agent. We use them to inform an agent whether it
achieved its goal or not. As stated in chapter \ref{sec:02:agent_environment}, the agent may receive positive rewards
......@@ -111,9 +109,11 @@ determine the path through the respective Markov decision process.
\paragraph{Dynamics.}
Transitions to successor states and the values of rewards associated with these states are determined by the
\emph{dynamics} of the environment. For any given observation $(S_{t+1} = s',R_{t+1} = r)$ the likelihood of
observation given the current state $S_t$ and the agent's action $A_t$ is given the following probability distribution:
\emph{dynamics} of the environment. Let the successor state $S_{t+1} = s'$ and the reward $R_{t+1} = r$ be an
observation. Its likelihood given the current state $S_t = s$ and the agent's action $A_t = a$ is determined by the
following probability distribution \cite[p.~48]{sutton18}:
\begin{align}
\label{eqn:dynamics}
&p(s', r\mid s, a) \doteq P(S_{t+1} = s', R_{t+1} = r \mid S_t = s, A_t = a), \\
&p: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]. \nonumber
\end{align}
......@@ -128,7 +128,7 @@ edge describes a transition from one state to a successor state. The reward alon
an observation. The probability of the observation is stated with the reward next to its respective edge. We assign a
probability of $0$ to observations not described by an edge. In the displayed Markov decision process, the probability
of observing the state $S_{t+1} = s_2$ as well as the reward $R_{t+1} = 1$ given the current state $S_t = s_1$ and the
action $A_t = \text{\emph{left}}$ is $p(s_2, 1 \mid s_1, \text{\emph{left}}) = 0.8$.
action $A_t = \text{\emph{right}}$ is $p(s_2, 1 \mid s_1, \text{\emph{right}}) = 0.8$.
\begin{figure}[h]
\centering
......@@ -142,8 +142,9 @@ action $A_t = \text{\emph{left}}$ is $p(s_2, 1 \mid s_1, \text{\emph{left}}) = 0
\end{figure}
\paragraph{Policy.}
An agent's actions are determined by the \emph{policy} function
An agent's actions are determined by the \emph{policy} function, which \citeA[p.~58]{sutton18} define as follows:
\begin{align}
\label{eqn:policy}
&\pi(a \mid s) \doteq P(A_t = a \mid S_t = s),\\
&\pi: \mathcal{A} \times \mathcal{S} \rightarrow [0, 1], \nonumber
\end{align}
......@@ -157,17 +158,17 @@ proportional to the probability distribution. For example, an agent following th
\emph{noop} twice as often as \emph{fire} on average.
\subsubsection{Finite Horizon and Episodes}
\label{sec:02:episodes}
When an agent and an environment interact with each other over a series of discrete time steps $t = 0, 1, 2, \dots$, we
can observe a sequence of states, actions and rewards $S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots$. We call this
sequence the \emph{trajectory}. When we train an agent, we often record trajectories of states, actions and rewards to
use as training data by running the agent in the environment.\todo{citation here?}
ATARI 2600 games have a plentitude of \emph{terminal states} that indicate the game has come to an end, for example when
either player scored 20 points in Pong or the player lost their final life in Breakout. $\#\{s \mid s \in \mathcal{S}
\text{ and } s \text{ is terminal}\} \gg 1$, as the score of one player determines the end of the game whereas the score
of the loser can have any value between $0$ and $19$. Furthermore, the paddles may be in any position at the end of
game, which opens up even more terminal states.\todo{someone check this}
sequence the \emph{trajectory} \cite[p.~48]{sutton18}. When we train an agent, we often record trajectories of states,
actions and rewards to use as training data by running the agent in the environment.
ATARI 2600 games have a plentitude of \emph{terminal states} that indicate the game has come to an end, $\#\{s \mid s
\in \mathcal{S} \land s \text{ is terminal}\} \gg 1$. For example, in Pong the game ends once either player
scored 21 points, whereas the loser's score can have any value in the range of 0 to 20. Furthermore, the paddles may be
in any position at the end of the game, which opens up even more terminal states.
Whenever the agent transitions to a terminal state, it cannot transition to other states anymore. Therefore, there is a
final time step $T$ after which no meaningful information can be gained. We call $T$ the \emph{finite} \emph{horizon};
......
......@@ -8,27 +8,28 @@ Just like the value function, the policy too can be parameterized when the state
$\boldsymbol\theta \in \mathbb{R}^{d'},\;d' \ll \#\mathcal{S}$, denote the policy's parameter vector. Then the
parameterized policy
\begin{align}
\pi(a \mid s, \boldsymbol\theta) \doteq P(A_t = a \mid S_t = a, \boldsymbol\theta_t = \boldsymbol\theta)
\pi(a \mid s, \boldsymbol\theta) \doteq P(A_t = a \mid S_t = s, \boldsymbol\theta_t = \boldsymbol\theta)
\end{align}
returns the probability of taking the action $a$ at time $t$ given the state $s$ and parameter vector
returns the probability of taking the action $a$ given the state $s$ and parameter vector
$\boldsymbol\theta$ \cite[p.~321]{sutton18}. Often we write $\pi_{\boldsymbol\theta}$ instead of $\pi(a \mid s,
\boldsymbol\theta)$.
The parameterization $\boldsymbol\theta$ often returns a vector of numerical weights. Each action corresponds to a
weight that indicates of the likelihood of picking the respective action. The higher the weight, the more likely picking
its respective action should be. The policy $\pi$ uses the numerical weights to determine a probability distribution
\cite[p.~322]{sutton18}, e.g., by applying a softmax function \cite[pp.~184--185]{goodfellow2016}. The method used to
determine the probability distribution remains fixed. Instead we optimize the parameterization $\boldsymbol\theta$ so
that the likelihood of choosing an advantageous actions is increased.
Let $S_t = s$ and $\mathcal{A}(s) = \{1, 2\}$. A parameterization $\boldsymbol\theta_t$ could return a vector with two
elements, for example $\boldsymbol\theta_t(s) = \begin{pmatrix}4\\2\end{pmatrix}$.\footnote{We write
$\left(\boldsymbol\theta_t(s)\right)_a$ to obtain the $a$th component of $\boldsymbol\theta_t(s)$, e.g.,
$\begin{pmatrix}4\\2\end{pmatrix}_1 = 4$.} The elements of this vector are numerical weights a policy could use to determine
probabilities:
Often $\boldsymbol\theta$ is the parameterization of a function that maps a state to a vector of numerical weights. Each
action corresponds to a weight that indicates the likelihood of picking the respective action. The higher the weight,
the more likely picking its respective action should be. The policy $\pi$ uses the numerical weights to determine a
probability distribution \cite[p.~322]{sutton18}, e.g., by applying a softmax function
\cite[pp.~184--185]{goodfellow2016}. The method used to determine the probability distribution remains fixed. Instead we
optimize the parameterization $\boldsymbol\theta$ so that the likelihood of choosing advantageous actions is
increased.
Let $S_t = s$ and $\mathcal{A}(s) = \{1, 2\}$. A function $h$ parameterized with $\boldsymbol\theta_t$ could return a
vector with two elements, for example $h(s, \boldsymbol\theta_t) = \begin{pmatrix}4\\2\end{pmatrix}$.\footnote{We write
$\left(h(s, \boldsymbol\theta_t)\right)_a$ to obtain the $a$th component of $h(s, \boldsymbol\theta_t)$, e.g.,
$\begin{pmatrix}4\\2\end{pmatrix}_1 = 4$.} The elements of this vector are numerical weights a policy could use to
determine probabilities:
\begin{align*}
\pi(a\mid s, \boldsymbol\theta_t) &= \left(\boldsymbol\theta_t(s)\right)_a \cdot
\left(\sum_{b\in\mathcal{A}(s)}\left(\boldsymbol\theta_t(s)\right)_b\right)^{-1}\text{, e.g.,} \\
\pi(a\mid s, \boldsymbol\theta_t) &= \left(h(s, \boldsymbol\theta_t)\right)_a \cdot
\left(\sum_{b\in\mathcal{A}(s)}\left(h(s, \boldsymbol\theta_t)\right)_b\right)^{-1}\text{, e.g.,} \\
\pi(1\mid s, \boldsymbol\theta_t) &= \begin{pmatrix}4\\2\end{pmatrix}_1 \cdot \left(\begin{pmatrix}4\\2
\end{pmatrix}_1 + \begin{pmatrix}4\\2\end{pmatrix}_2\right)^{-1} = \frac{4}{4 + 2} = \frac{2}{3}.
\end{align*}
......@@ -39,6 +40,7 @@ derivatives. With full information of the environment, we may finally search for
policy.
\subsubsection{Gradient Ascent}
\label{sec:02:gradient_ascent}
Policy gradient algorithms seek to optimize the policy directly by performing stochastic gradient ascent
$\nabla_{\boldsymbol\theta} \pi(a \mid s, \boldsymbol\theta)$ to maximize a fitness function $J(\boldsymbol\theta)$. As
......@@ -71,12 +73,12 @@ value to the learn rate $\alpha$, but no further hyperparameters are required to
smooth transition to exploitation.\footnote{In practice a number of other parameters are required to prevent forgetting
and performance collapses (cf.~chapter \ref{sec:03:ppo_motivation}).} Secondly, parameterized policies enable us to
assign arbitrary probabilities to actions. This gives us the opportunity to discover stochastic approximates of optimal
policies.\footnote{A policy that is not capable of doing so is the $\varepsilon$-greedy policy, that chooses the best
policies.\footnote{A policy that is not capable of doing so is the $\varepsilon$-greedy policy, which chooses the best
available action with a probability of $1 - \varepsilon$ and a random one otherwise \cite[p.~322]{sutton18}.
$\varepsilon$ typically converges to $0$ over the course of the training resulting in a deterministic policy.}
\citeA[p.~323]{sutton18} give an example of a Markov descision process that can only be solved by a stochastic policy.
\citeA[p.~324]{sutton18} point out, that combining a parameterized policy with function approximation may pose a
\citeA[p.~324]{sutton18} point out that combining a parameterized policy with function approximation may pose a
challenge: the performances of both $\pi_{\boldsymbol\theta}$ and $\hat{v}(s, \boldsymbol\omega)$ depend on the action
selections and on the distribution of the states $\mu_{\pi_{\boldsymbol\theta}}(s)$ that these actions are selected in.
Adjusting the policy results in different action choices, which in turn changes $\mu_{\pi_{\boldsymbol\theta}}(s)$.
......@@ -101,7 +103,7 @@ policy gradients with the advantage function \cite{sutton2000}:
Using the advantage function is beneficial, as the policy gradient in equation \ref{eqn:advantage_gradient} has a lower
variance than the gradient in equation \ref{eqn:q_gradient} \cite{gae}.
Similar to $\overline{\text{VE}}$, the gradient is weighed according to the distribution of states $\mu$, but it does not
Similar to $\overline{\text{VE}}$, the gradient is weighted according to the distribution of states $\mu$, but it does not
depend on its derivative. By virtue of weighing with $\mu$, states an agent encounters regularly have a greater effect
on the gradient. Moreover, the magnitude of the gradient is controlled by the advantage of an action: A large advantage
mandates a large gradient step.
......
......@@ -10,6 +10,7 @@ learning approach and algorithm we introduce in chapters \ref{sec:02:policy_grad
For any given time step $t < T$, an agent aims to maximize the remaining sequence of rewards $R_{t+1}, R_{t+2}, \dots,
R_T$. Using this sequence we define the return
\begin{align}
\label{eqn:return}
G_t &\doteq \sum_{k=t+1}^T \gamma^{k-t-1} R_k,\; 0 \leq \gamma \leq 1.
\end{align}
......@@ -34,39 +35,41 @@ states based on the returns we may obtain by using the value function
\end{align}
with $\mathbb{E}_\pi$ denoting the expected value when following the policy $\pi$. The higher the value of a state, the
higher the return an agent is expected to obtain. As $v_\pi(s)$ depends on the policy $\pi$, we must devise a method to
optimize the policy an agent follows. However, we first require the advantage function.
optimize the policy an agent follows. However, to do so we first require the advantage function.
We call the policy that yields the maximum return the optimal policy $\pi_*$ and the corresponding value function the
optimal value function $v_*$. Assuming an optimal policy for the ATARI 2600 game Pong, $v_*(S_0) = 20$ with $\gamma =
1$, as the game ends after either player scored 20 times. Furthermore, an optimal Pong player always deflects the ball.
optimal value function $v_*$. Assuming an optimal policy for the ATARI 2600 game Pong, $v_*(S_0) = 21$ with $\gamma =
1$, as the game ends after either player scored 21 times. Furthermore, an optimal Pong player always deflects the ball.
\subsubsection{Advantage Function}
\label{sec:02:advantage}
The value of a state depends on the expected behavior of an agent following the policy $\pi$. In order to increase the
probability of obtaining a high return, we require a means to identify actions that prove advantageous or
disadvantageous in this endeavour. The required functionality is provided by the \emph{advantage function}
disadvantageous in this endeavor. The required functionality is provided by the \emph{advantage function}
\cite{baird1993, sutton2000}\footnote{The advantage function is commonly defined using the $q$ function: $a_\pi(s, a)
\doteq q_\pi(s,a) - v_\pi(s)$. The definition we provide is equal, as it merely substitutes $q_\pi(s,a)$ with the term
it is defined to be.}
\begin{align}
\label{eqn:advantage_function}
a_\pi(s, a) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left(r + \gamma v_\pi(s') \right)\right] - v_\pi(s) \\
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a\right] - v_\pi(s)
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a\right] - v_\pi(s).
\end{align}
The advantage function compares the expected immediate reward and return of a successor state $r + v_\pi(s')$ with the
expected return of the current state $v_\pi(s)$. As the environment may be non-deterministic, the former term has to be
weighed with its probability.
weighted with its probability.
If $a_\pi(s, a) = 0$, which means $\sum_{s', r} p(s', r \mid s, a) \cdot r + \gamma v_\pi(s') = v_\pi(s)$, choosing the
action $a$ more frequently does not improve the policy. However, if $a_\pi(s, a) > 0$, the given action yields larger
returns than the actions commonly chosen under $\pi$; increasing the likelihood of choosing the action $a$ improvemes
the policy. Actions that return a negative advantage---$a_\pi(s, a) < 0$---lead to worse results than simply following
$\pi$ would. If $a_\pi(s, a) \le 0$ for all $s \in \mathcal{S}, a \in \mathcal{A}$, the policy is optimal.
If $a_\pi(s, a) = 0$, which means $\sum_{s', r}\left[ p(s', r \mid s, a) \cdot (r + \gamma v_\pi(s')) \right] =
v_\pi(s)$, choosing the action $a$ more frequently does not improve the policy. However, if $a_\pi(s, a) > 0$, the given
action yields larger returns than the actions commonly chosen under $\pi$; increasing the likelihood of choosing the
action $a$ improves the policy. Actions that return a negative advantage---$a_\pi(s, a) < 0$---lead to worse results
than simply following $\pi$ would. If $a_\pi(s, a) \le 0$ for all $s \in \mathcal{S}, a \in \mathcal{A}$, the policy is
optimal.
According to \citeA{gae}, a common approximation of the advantage function is provided by the estimator
\begin{align}
\label{eqn:delta1}
\delta_t \doteq R_{t+1} + \gamma v_\pi(S_{t+1}) - v_\pi(S_t).
\end{align}
......
......@@ -6,24 +6,24 @@ Optimization, as an agent can learn quicker than an agent that trains on a traje
More precisely, each training iteration consists of $K$ epochs. In each epoch, the entire data set is split into
randomly drawn disjunct minibatches of size $M$.\footnote{For a detailed explanation of minibatch gradient methods,
refer to the work of \citeA[chapter 8.1.3]{goodfellow2016}.} This approach alleviates the issue of learning on highly
correlated observation as observed by \citeA{dqn}. We further ensure that we learn on sufficiently independent
observations by running $N$ \emph{actors} at the same time.\footnote{\emph{Actor} is the technical term used to describe
agents that are run in parallel. Often, these agents follow the same policy $\pi_{\boldsymbol\theta}$, as is the case
with Proximal Policy Optimization.} Although each actor is embedded in its own environment, the $N$ environments are
described by the same Markov decision process \cite{a3c}. However, the environments are independent and therefore give
rise to different trajectories.
correlated observation as observed by \citeA{dqn}.
After each epoch, the learn rate $\alpha$ and the clipping parameter $\epsilon$ are decreased, so they linearly scale
from their initial values to 0 over the course of the training.
We further ensure that we learn on sufficiently independent observations by running $N$ \emph{actors} at the same
time.\footnote{\emph{Actor} is the technical term used to describe agents that are run in parallel. Often, these agents
follow the same policy $\pi_{\boldsymbol\theta}$, as is the case with Proximal Policy Optimization.} Although each actor
is embedded in its own environment, the $N$ environments are described by the same Markov decision process \cite{a3c}.
However, the environments are independent and therefore give rise to different trajectories. The observations of all
actors are then combined to optimize the policy $\pi_{\boldsymbol\theta}$ that all $N$ actors follow.
PPO---and many other reinforcement learning algorithms---operate in two phases. In the first phase, an agent interacts with its
environment and generates a rollout $\tau$. $\tau$ contains not only the states and rewards the agent observed, but also
environment and generates a \emph{rollout} $\tau$. $\tau$ contains not only the states and rewards the agent observed, but also
the chosen actions, their respective probabilities and the values of the states. This phase can be seen in lines 5--17
of algorithm \ref{alg:ppo}.
In the second phase, the value function approximation and the policy are optimized with the data the agent collected.
These two steps are repeated for a certain time or until the value function and policy converge to their respective
optimal functions. Lines 18--25 of algrithm \ref{alg:ppo} show this phase.
optimal functions. Afterwards, the learn rate $\alpha$ and the clipping $\epsilon$ are decreased, so they linearly
approach $0$ over the course of the training; the trust region shrinks. Lines 18--25 of algorithm \ref{alg:ppo} show this phase.
\begin{algorithm}[ht]
\caption{Proximal Policy Optimization, modified from \protect\citeauthor{ppo}'s \protect\citeyear{ppo} and
......@@ -73,6 +73,6 @@ rollout---an episode might terminate earlier or it might not terminate at all. W
\boldsymbol\omega)$ of the final state included in the rollout.
\item If an episode terminates early, we only include rewards and values up until the episode terminated. Let
$T_\text{episode}$ denote this time step. Then all advantage and return estimations for time steps $t \le
T_\text{episode}$ use $T_\text{episode}$ as the upper bound of summation (cf.~equations \ref{eqn:gae} and
\ref{eqn:lambda_return}).
T_\text{episode}$ use the rollout time step that correspeonds with $T_\text{episode}$ as the upper bound of
summation (cf.~equations \ref{eqn:gae} and \ref{eqn:lambda_return}).
\end{itemize}
......@@ -2,14 +2,18 @@ Both the loss described in equation \ref{eqn:naive_loss} and the loss proposed b
estimations. However, to estimate advantages value estimations are required. By definition (cf.~equation
\ref{eqn:value_function}), the value function can be estimated using the return.
Both the advantage estimator and the return are suboptimal as they suffer from being biased and having high variance
respectively. We further explain these issues and introduce advanced methods that alleviate this issue in this chapter.
Both the advantage estimator (cf.~equation \ref{eqn:delta2}) and the return (cf.~equation \ref{eqn:return}) are
suboptimal as they suffer from being biased and having high variance respectively. We further explain these issues and
introduce advanced methods that alleviate this issue in this chapter.
\todo[inline]{insert overview of Loss using advantage using value using return using rewards}
\subsubsection{Generalized Advantage Estimation}
\label{sec:03:gae_gae}
In chapter \ref{sec:02:advantage} we introduced the advantage function $a_\pi(s, a)$ and an estimator $\delta$, that we
combined with function approximation \ref{sec:02:function_approximation} (TODO replace with equation references):
In chapter \ref{sec:02:advantage} we introduced the advantage function $a_\pi(s, a)$ and an estimator $\delta$ that we
combined with function approximation (cf.~equations \ref{eqn:a_hat} and \ref{eqn:delta2} in chapter
\ref{sec:02:function_approximation}):
\begin{align*}
\hat{a}_\pi(s, a, \boldsymbol\omega) &\doteq \mathbb{E}_\pi\left[R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) \mid
S_t = s, A_t = a \right] - \hat{v}_\pi(s, \boldsymbol\omega),\\
......@@ -76,6 +80,7 @@ commonly used with finite-horizon Markov decision processes, too. In the above d
at the cost of adding minor inaccuracy. The inaccuracy grows larger as the time step $t$ approaches the horizon $T$.
\subsubsection{$\lambda$-return}
\label{sec:03:lambda_return}
The same approach can be applied to return estimation. Remember that we defined the return to be
\begin{align*}
......
As deep learning achieved success on a variety of tasks such as computer vision and speech recognition\todo{a goodfellow
citation goes here}, \citeA{dqn} adapted reinforcement learning techniques to deep learning in an algorithm called DQN.
The benefits of using neural networks to parameterize the value function and---occasionally---the policy were
demonstrated by DQN and further \emph{deep reinforcement learning} algorithms, e.g., by A3C \cite{a3c} and Rainbow DQN
\cite{rainbow}.
Deep learning methods achieved success on a variety of tasks, such as computer vision and speech recognition
\cite[chapter 1.2.4]{goodfellow2016}. For this reason, \citeA{dqn} combined reinforcement learning techniques and deep
learning methods in an algorithm called DQN. The benefits of using neural networks to parameterize the value function
and---occasionally---the policy were demonstrated by DQN and further \emph{deep reinforcement learning} algorithms,
e.g., by A3C \cite{a3c} and Rainbow DQN \cite{rainbow}.
The algorithm introduced in this chapter is called \emph{Proximal Policy Optimization} (PPO) \cite{ppo}. Because PPO is a deep
reinforcement learning algorithm, we adjust notation and replace our gradient estimator $\hat{g}$. Instead we utilize a
loss $\mathcal{L}$, as is common in deep learning \todo{citation on loss here}:
loss $\mathcal{L}$, as is common in deep learning \cite[chapter 4.3]{goodfellow2016}:
\begin{align}
\label{eqn:naive_loss}
\mathcal{L}(\boldsymbol\theta) \doteq
\mathbb{E}_{a,s\sim\pi_{\boldsymbol\theta}}\left[\hat{a}_{\pi_{\boldsymbol\theta}}(s, a, \boldsymbol\omega)\pi_{\boldsymbol\theta}(a\mid
s)\right]
\end{align}
The gradient estimator $\hat{g}$ can be obtained by deriving the loss in equation \ref{eqn:naive_loss}.
The gradient estimator $\hat{g}$ can be obtained by deriving the loss in equation \ref{eqn:naive_loss}. Unlike in deep
learning, a loss notation is also used to denote objectives that shall be maximized rather than. Therefore, it depends on the
specific objective whether gradient ascent or gradient descent is performed.
We begin by highlighting issues with the gradient estimator from equation \ref{eqn:gradient_estimator} and motivate the
use of advanced algorithms such as Proximal Policy Optimization. Then we introduce advanced advantage and return
use of advanced algorithms such as Proximal Policy Optimization. Then we introduce sophisticated advantage and return
estimators as used in many modern policy gradient methods. Afterwards the loss is introduced and explained. We close by
providing the complete Proximal Policy Optimization algorithm.
TODO short intro sentence goes here.
Proximal Policy Optimization consists of two components: A loss and an algorithm that incorporates this loss. In the
following sections we introduce and explain the clipped Proximal Policy Optimization loss. Afterwards, we reveal the
learning algorithm that uses this loss to optimize a policy.
\subsubsection{Background}
......@@ -11,8 +13,8 @@ However, this algorithm can only be applied to mixed policies of the form
\end{align*}
where $\pi^{(0)}$ and $\pi^{(1)}$ are different policies and $\alpha < 1$.
\citeA{trpo} prove that this method can be applied to stochastic policies---as defined in chapter
\ref{sec:02:distributions}---by changing the lower bound to a \emph{Kullback-Leibler} (KL) \emph{divergence}. The KL
\citeA{trpo} prove that this method can be applied to stochastic policies---as defined in equation
\ref{eqn:policy}---by changing the lower bound to a \emph{Kullback-Leibler} (KL) \emph{divergence}. The KL
divergence can be used to assess how much two probability distributions deviate from one another
\cite[pp.~74--75]{goodfellow2016}. They further transform the loss to a likelihood ratio of the candidate for the new
policy and the current policy (similar to equation \ref{eqn:likelihood_ratio}) that is penalized using the KL
......@@ -21,8 +23,8 @@ divergence. This way they incentivize finding a new policy that is close to the
As the new loss still imposes a lower bound on the fitness $J(\boldsymbol\theta)$, it is a minorant to the fitness
function. The authors prove that maximizing this minorant guarantees a monotically rising fitness
$J(\boldsymbol\theta)$; given sufficient optimization steps the algorithm converges to a local
optimum\footnote{Algorithms like this one are called MM algorithms. This one is a \emph{minorize maximization} algorithm
\cite{hunter2004}.}.
optimum.\footnote{Algorithms like this one are called MM algorithms. This one is a \emph{minorize maximization} algorithm
\cite{hunter2004}.} The ensuing objective is called a \emph{surrogate} objective.
The mathematically proven algorithm is computationally demanding, as it needs to compute the KL divergence on all states
in the state space for each optimization step. Hence, \citeA{trpo} perform multiple approximations. The resulting
......@@ -62,7 +64,8 @@ $\rho_t(\boldsymbol\theta)$, we determine if the probability of taking the actio
decreased. For example, if $\rho_t(\boldsymbol\theta) > 1$, the action is more likely under $\pi_{\boldsymbol\theta}$
than it is under $\pi_{\boldsymbol\theta_\text{old}}$. We note that $\rho_t(\boldsymbol\theta_\text{old}) = 1$.
A loss could be constructed by multiplying the likelihood ratio and advantage estimations:
A loss derived from \emph{Conservative Policy Iteration} is constructed by multiplying the likelihood ratio and
advantage estimations \cite{ppo}:
\begin{align}
\label{eqn:loss_cpi}
\mathcal{L}^\text{CPI}(\boldsymbol\theta) \doteq \mathbb{E}_{a,s\sim\pi_{\boldsymbol\theta_\text{old}}} \left[
......@@ -138,7 +141,9 @@ This issue is solved by taking an elementwise minimum:
\right].
\end{align}
In practice, the advantage function $\hat{a}_{\pi_{\boldsymbol\theta_\text{old}}}$ is replaced with Generalized
Advantage Estimations $\delta_t^{\text{GAE}(\gamma,\lambda)}$ (cf.~equation \ref{eqn:gae}).
Advantage Estimations $\delta_t^{\text{GAE}(\gamma,\lambda)}$ (cf.~equation \ref{eqn:gae}). Often,
$\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$ is called a surrogate objective, as it approximates the original objective
$J(\boldsymbol\theta)$ but is not identical.
Figure \ref{fig:min_clipping} compares $\text{clip}(\rho_t(\boldsymbol\theta, \epsilon) \cdot \delta$ and
$\mathcal{L}^\text{CLIP}$. Using an elementwise minimum has the following effect:
......@@ -152,7 +157,7 @@ $\mathcal{L}^\text{CLIP}$. Using an elementwise minimum has the following effect
\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{03_ppo/clipping.pdf}
\caption{The loss for an action with positive advantage can be seen on the left side, whereas the for an action with
\caption{The loss for an action with positive advantage can be seen on the left side, whereas the loss for an action with
a negative advantage is shown on the right side. By using a minimum in $\mathcal{L}^\text{CLIP}$, we ensure that we
can correct errors we made in previous gradient steps: We can raise the probability of actions with positive
advantage even when $\rho_t(\boldsymbol\theta) \le 1 - \epsilon$. Vice versa, we can decrease the probability of
......@@ -173,7 +178,7 @@ $\epsilon$.
\subsubsection{Value Function Loss}
\label{sec:03:loss_value}
In oder to determine advantages, we require to know the value function and therefore a means to learn the value function. As we
In order to determine advantages, we require to know the value function and therefore a means to learn the value function. As we
perform function approximation, the value function is parameterized by $\boldsymbol\omega$, a neural network. To learn
reliable value estimations, we perform stochastic gradient descent
\begin{align*}
......@@ -186,6 +191,7 @@ We estimate the value function by recording a trajectory and calculating $\lambd
encountered. Then, the loss is the mean squared error of value estimations of $\boldsymbol\omega_t$ and the observed
$\lambda$-returns \cite{ppo}:
\begin{align}
\label{eqn:lvf}
\mathcal{L}^\text{VF}(\boldsymbol\omega) \doteq \frac{1}{2} \cdot
\mathbb{E}_{s,G\sim\pi_{\boldsymbol\theta_\text{old}}}\left[(\hat{v}_{\pi_{\boldsymbol\theta}}(s, \boldsymbol\omega) -
G)^2\right],
......@@ -194,29 +200,31 @@ with $G$ being $\lambda$-returns calculated from rewards observed by an agent fo
(cf.~equation \ref{eqn:lambda_return}).
\subsubsection{Shared Parameterization Loss}
\label{sec:03:shared_loss}
When combining function approximation with policy parameterization, the value function and the policy may share the same
parameters. $\hat{v}(s, \boldsymbol\theta)$ and $\pi(a\mid s, \boldsymbol\theta)$ share the same neural network
architecture and weights with differing output layers (cf. chapter \ref{sec:04:architecture}). We choose to share
parameters, because we require less computation power as only one neural network needs to be trained and executed.
parameters because we require less computation power, as only one neural network needs to be trained and executed.
In this case, the loss must contain a loss for the policy and a loss for the value function. Since we perform
gradient ascent on the policy loss but gradient descent on the value function loss, $\mathcal{L}^\text{VF}$ is
subtracted from $\mathcal{L}^\text{CLIP}$. \citeA{ppo} propose the following loss:
\begin{align}
\label{eqn:lshared}
\mathcal{L}^{\text{CLIP}+\text{VF}+S}(\boldsymbol\theta) \doteq \mathcal{L}^\text{CLIP}(\boldsymbol\theta) -
c_1 \cdot \mathcal{L}^\text{VF}(\boldsymbol\theta) + c_2 \cdot
\mathbb{E}_{s\sim\pi_{\boldsymbol\theta}}\left[S\left[\pi_{\boldsymbol\theta}\right](s)\right],
\end{align}
with $c_1$ and $c_2$ being hyperparameters. $c_1$ controls the impact of the value function loss
$\mathcal{L}^\text{VF}$, whereas $c_2$ adjusts the impact of the entropy bonus $S$.
$\mathcal{L}^\text{VF}$, whereas $c_2$ adjusts the impact of the \emph{entropy bonus} $S$.
$S$ denotes an entropy bonus encouraging exploration, which addresses an issue raised by \citeA{kakade2002}: Policy
gradient methods commonly transition to exploitation too early, resulting in suboptimal policies. The closer the
distribution $\pi$ is to a uniform distribution, the larger its entropy is. If all actions are assigned the same
probability, an agent following $\pi$ explores properly. An agent following a deterministic policy does not explore and
the entropy of this policy will be $0$.
% TODO Hence, $S[\pi_{\boldsymbol\theta}]$ encourages exploration by enticing a larger gradient step when the entropy is
% large.
A common choice for $S$ is to determine the mean entropy of $\pi_{\boldsymbol\theta}$ over all observed states.
Hence, the entropy bonus naturally declines over the course of the training as the policy transitions from exploration
to exploitation and approaches a (locally) optimal policy. A common choice for $S$ is to determine the mean entropy of
$\pi_{\boldsymbol\theta}$ over all observed states.
Policy gradient estimators such as the one introduced in chapter \ref{sec:02:policy_gradients}\todo{refer to equation
here} or REINFORCE\footnote{A well-known reinforcement learning algorithm that is a simple improvement over the
estimator we introduced.} by \citeA{reinforce} suffer from a significant drawback. As \citeA{kakade2002} demonstrate,
some tasks require that we record long trajectories to guarantee a policy is improved when performing stochastic
gradient descent. The longer the trajectories are the higher their variance grows, as both the dynamics $p$ of the
environment and the policy $\pi$ introduce non-deterministic behavior.
Policy gradient estimators such as the one introduced in equation \ref{eqn:gradient_estimator} in chapter
\ref{sec:02:policy_gradients} or REINFORCE\footnote{A well-known reinforcement learning algorithm that is a simple
improvement over the estimator we introduced.} by \citeA{reinforce} suffer from a significant drawback. As
\citeA{kakade2002} demonstrate, some tasks require that we record long trajectories to guarantee a policy is improved
when performing stochastic gradient descent. The longer the trajectories are the higher their variance grows, as both
the dynamics $p$ of the environment and the policy $\pi$ introduce non-deterministic behavior.
As the variance of the trajectories grows, the quality of the gradient estimator declines. Performing stochastic
gradient ascent with this estimator is no longer guaranteed to yield an improved policy---in fact the performance of the
......
\todo[inline]{rewrite this entire thing}
As PPO runs $N$ agents simultaneously, most of the runtime is spent waiting for the execution of actions in the
environments. We can achieve a great speedup by running multiple processes.
Since Proximal Policy Optimization executes $N$ actors simulataneously, the runtime of a training can be greatly reduced
on multicore processors by parallelizing the execution. In this thesis, each actor operates in a dedicated process. They
are coordinated by a parent process that gathers the observations and optimizes the policy and the value function.
Each agent is run in a dedicated child process that listens for actions, executes them and returns observations. A
parent process enters observations into the neural network and transmits them to the child processes. Furthermore, this
process computes the loss and performs stochastic gradient ascent. The parent process and the child processes
communicate by using pipes.
The child processes communicate with the parent process through pipes. The actors transmit observations consisting of their
environment's current state, the last reward and information on the remaining lives to the parent process. Afterwards,
they await an action to perform.
\todo[inline]{maybe try a shared memory setup}
The parent process determines each actor's action by sampling from the policy using the respective environment's state.
It creates rollouts from the actors' observations, the actions and their respective probabilities as well as value
estimations. After rollout generation has concluded, the parent process computes the loss and performs stochastic
gradient descent.
......@@ -38,7 +38,7 @@ Some games like Breakout provide the player with several attempts, often called
they may resume the game, usually at the cost of a reduced score. \citeA{nature_dqn} propose ending a training episode
once a life is lost. As PPO operates on a steady number of samples, this approach is modified slightly. Instead of
ending the episode, we simulate the end of the game. Thus, the return and advantage calculation are cutoff at this time
step. Finally, we reset the observation stack as detailed in \emph{observation stacking}.
step. Finally, we reset the observation stack as detailed in the operation \emph{observation stacking}.
This post-processing operation is applied to some of the games chosen for this thesis only (see chapter
\ref{sec:05:ale} for more information on the selected games). BeamRider, Breakout, Seaquest and SpaceInvaders provide
......@@ -62,7 +62,7 @@ repository reveals that rewards are binned \cite[baselines/common/atari\_wrapper
\begin{align}
\phi_r(r) \doteq \sign r
\end{align}
We evaluate both choices in chapter TODO ref 5.
We discuss both choices in chapter \ref{sec:05:discussion_optims}.
\paragraph{Observation stacking.}
Finally, the four most recent maximized images seen by an agent are combined to a tensor with shape $4 \times 84 \times
......
Although it is neither mentioned nor motivated by \citeA{ppo}, clipping is applied to the value function loss as well
\cite{ilyas2018}. Let $\text{clip}_v$ denote a clipping function similar to equation \ref{eqn:clip}:
\begin{align}
\label{eqn:value_clip}
\text{clip}_v(\boldsymbol\omega, \boldsymbol\omega_\text{old}, \epsilon, S_t) \doteq
\begin{cases}
\hat{v}_\pi(S_t, \boldsymbol\omega_\text{old}) - \epsilon &\text{if } \hat{v}_\pi(S_t, \boldsymbol\omega) \le
......@@ -13,6 +14,7 @@ Although it is neither mentioned nor motivated by \citeA{ppo}, clipping is appli
Then the clipped value function loss $\mathcal{L}^\text{VFCLIP}$ is defined to be
\begin{align}
\label{eqn:lvfclip}
\mathcal{L}^\text{VFCLIP}(\boldsymbol\omega) \doteq \max
\left[
\mathcal{L}^\text{VF}(\boldsymbol\omega),\mathbb{E}_{s, G\sim\pi}
......@@ -28,5 +30,8 @@ Then the clipped value function loss $\mathcal{L}^\text{VFCLIP}$ is defined to b
$\epsilon$ is the same hyperparameter that is used to clip the likelihood ratio in $\mathcal{L}^\text{CLIP}$
(cf.~equation \ref{eqn:lclip} in chapter \ref{sec:03:policy_loss}).
\todo[inline]{short explanation of the max, but that is mostly the same as the clipped loss. max since we minimize an
error instead of maximizing a likelihood}
Intuitively, this approach may be similar to clipping the probability ratio. To avoid gradient collapses, a trust region
is created with the clipping parameter $\epsilon$. Then an elementwise maximum is taken, so errors from previous
gradient steps can be corrected. A maximum is applied instead of a minimum because the value function loss is minimized.
Further analysis on the ramifications of optimizing a surrogate loss for the value functions is available
\citeA{ilyas2020}.