Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit ced7fcb3 authored by Daniel Lukats's avatar Daniel Lukats

added a few missing figures

parent dc2664d6
......@@ -2,14 +2,14 @@ Although \gls{ppo} may be used with any differentiable function, neural networks
e.g., in the baselines repository by \citeA[baselines/common/models.py]{baselines}. With few exceptions, most deep
policy gradient algorithms use the architecture established by \citeA{nature_dqn} when learning ATARI 2600 games.
\begin{figure}[ht]
\begin{sidewaysfigure}
\centering
\includegraphics[width=\textwidth]{04_implementation/architecture.pdf}
\caption{This figure shows the architecture used in this thesis. The first four execution steps---from left to
right---use a ReLU. The fourth layer feeds into two output layers. The action output utilizes a softmax
right---use a rectified linear unit. The fourth layer feeds into two output layers. The action output utilizes a softmax
to generate a probability distribution, whereas the value function output has no particular activation function.}
\label{fig:architecture}
\end{figure}
\end{sidewaysfigure}
This architecture is composed of three convolutional layers\footnote{Convolutional neural networks are explained in
detail by \citeA[chapter 9]{goodfellow2016}} followed by a linear layer\footnote{Also known as dense layers,
......@@ -26,7 +26,8 @@ fully-connected layers or occasionally multi-layer perceptrons.} and two output
2 & 64 & $4 \times 4$ & 2 \\
3 & 64 & $3 \times 3$ & 1
\end{tabular}
\caption{The neural network contains 3 layers, that utilize different numbers of filters, kernel sizes and strides.}
\caption{The neural network contains three convolutional layers that utilize different numbers of filters, kernel
sizes and strides.}
\label{table:cnn_layers}
\end{table}
......
Algorithm \ref{alg:ppo_full} shows the complete Proximal Policy Optimization Algorithm when learning ATARI 2600 games.
Notable changes from algorithm \ref{alg:ppo} in chapter \ref{sec:03:algorithm} are the inclusion of the postprocessing
$\phi$, the introduction of orthogonal initialization, the replacement of $\mathcal{L}^\text{VF}$ with
$\mathcal{L}^\text{VFCLIP}$ and the use of gradient clipping.
$\phi$ (lines 7, 13 and 14), the introduction of orthogonal initialization (line 1), the replacement of
$\mathcal{L}^\text{VF}$ with $\mathcal{L}^\text{VFCLIP}$ and the use of gradient clipping (both in line 23).
\begin{algorithm}[ht]
\caption{Full Proximal Policy Optimization for ATARI 2600 games, modified from \protect\citeauthor{ppo}'s
\protect\citeyear{ppo} and \protect\citeauthor{peng2018}'s \protect\citeyear{peng2018} publications}
\protect\citeyear{ppo} and \protect\citeauthor{peng2018}'s \protect\citeyear{peng2018} publications. Lines commented
with a $*$ are changed from algorithm \ref{alg:ppo}.}
\label{alg:ppo_full}
\begin{algorithmic}[1]
\Require number of iterations $I$, rollout horizon $T$, number of agents $N$, number of epochs $K$, minibatch size $M$, discount $\gamma$, GAE weight
\Require number of iterations $I$, rollout horizon $T$, number of actor $N$, number of epochs $K$, minibatch size $M$, discount $\gamma$, GAE weight
$\lambda$, learn rate $\alpha$, clipping parameter $\epsilon$, coefficients $c_1, c_2$, postprocessing $\phi$
\State $\boldsymbol\theta \gets $ orthogonal initialization
\State $\boldsymbol\theta \gets $ orthogonal initialization \Comment{$*$}
\State Initialize environments $E$
\State Number of minibatches $B \gets N \cdot T\; / \;M$
\For{iteration=$1, 2, \dots, I$}
\State $\tau \gets $ empty rollout
\For{agent=$1, 2, \dots, N$}
\State $s \gets $ $\phi(\text{current state of environment }E_\text{agent})$
\For{actor=$1, 2, \dots, N$}
\State $s \gets $ $\phi(\text{current state of environment }E_\text{actor})$ \Comment{$*$}
\State Append $s$ to $\tau$
\For{step=$1, 2, \dots, T$}
\State $a \sim \pi_{\boldsymbol\theta}(a\mid s)$
\State $\pi_{\boldsymbol\theta_\text{old}}(a\mid s) \gets \pi_{\boldsymbol\theta}(a\mid s)$
\State Execute action $a$ in environment $E_\text{agent}$
\State $s \gets $ $\phi_s(\text{successor state})$
\State $r \gets $ $\phi_r(\text{reward})$
\State Execute action $a$ in environment $E_\text{actor}$
\State $s \gets $ $\phi_s(\text{successor state})$ \Comment{$*$}
\State $r \gets $ $\phi_r(\text{reward})$ \Comment{$*$}
\State Append $a, s, r, \pi_{\boldsymbol\theta_\text{old}}(a\mid s),
\hat{v}_\pi (s, \boldsymbol\theta_\text{old})$ to $\tau$
\EndFor
\EndFor
\State Compute Generalized Advantage Estimations $\delta_t^{\text{GAE}(\gamma, \lambda)}$
\State Normalize advantages
\State Normalize advantages \Comment{$*$}
\State Compute $\lambda$-returns $G_t^\lambda$
\For{epoch=$1, 2, \dots, K$}
\For{minibatch=$1, 2, \dots, B$}
\State $\boldsymbol\theta \gets \boldsymbol\theta + \text{clip}(\alpha \cdot
\nabla_{\boldsymbol\theta}\;\mathcal{L}^{\text{CLIP}+\text{VFCLIP}+S}(\boldsymbol\theta))$ on
minibatch with size $M$
minibatch with size $M$ \Comment{$*$}
\EndFor
\EndFor
\State Anneal $\alpha$ and $\epsilon$ linearly
......@@ -43,4 +44,28 @@ $\mathcal{L}^\text{VFCLIP}$ and the use of gradient clipping.
\end{algorithmic}
\end{algorithm}
\todo[inline]{put all hyperparameters here?}
Table \ref{table:hyperparameters} lists popular values for all hyperparameters \cite{baselines, kostrikov}. It differs
slightly from the one used by \citeA{ppo}, as the authors trained agents for $K = 3$ epochs with a value function loss
coefficient $c_1 = 1.0$.
\begin{table}[ht]
\centering
\begin{tabular}{l|l}
Hyperparameter & Value \\
\hline
Rollout horizon $T$ & 128 \\
Number of actors $N$ & $8$ \\
Iterations $I$ & $9765$ \\
Number of epochs $K$ & 4 \\
Minibatch size $M$ & $256$ \\
Learn rate $\alpha$ & $2.5 \cdot 10^{-4}$ \\
Discount $\gamma$ & $0.99$ \\
Variance tuning parameter $\lambda$ & $0.95$ \\
Clipping parameter $\epsilon$ & $0.1$ \\
Value function coeff. $c_1$ & $0.5$ \\
Entropy coeff. $c_2$ & $0.01$ \\
Maximum gradient norm & $0.5$ \\
\end{tabular}
\caption{This table contains the most commonly used configuration of Proximal Policy Optimization for ATARI 2600
games.}
\label{table:hyperparameters}
\end{table}
\section{Realization}
\section{From Theory to Application}
\label{sec:04:implementation}
\input{04_implementation/introduction}
......@@ -18,7 +18,7 @@
\label{sec:04:initialization}
\input{04_implementation/initialization}
\subsection{Value Clipping}
\subsection{Value Function Loss Clipping}
\label{sec:04:value_clipping}
\input{04_implementation/value_clipping}
......
......@@ -65,15 +65,24 @@ repository reveals that rewards are binned \cite[baselines/common/atari\_wrapper
We evaluate both choices in chapter TODO ref 5.
\paragraph{Observation stacking.}
Finally, the four most recent maximized images seen by an agent are combined to a tensor with shape $4 \times 84 \times 84$.
This operation is often called observation stacking or frame stacking. Although no explanation is given by
\citeA{nature_dqn}, one advantage is at hand: by stacking images we provide an agent further information on the state of
the game such as direction and movement. If the agent would see one frame only, it could not determine which direction
the ball is moving in Pong. By showing it four frames at once, the agent can discern if the ball is moving towards it or
the enemy or whether it will hit a wall or not.
Finally, the four most recent maximized images seen by an agent are combined to a tensor with shape $4 \times 84 \times
84$. This operation is often called observation stacking or frame stacking. At the beginning of an episode, all
elements of the tensor are set to 0, which represents the color black. The same applies to simulated episode ends and
beginnings as performed by the \emph{episodic life} operation.
At the beginning of an episode, all elements of the tensor are set to 0, which represents the color black. The same
applies to simulated episode ends and beginnings as performed by the \emph{episodic life} operation.
Although no explanation is given by \citeA{nature_dqn}, one advantage is at hand: by stacking images we provide an agent
further information on the state of the game such as direction and movement. If the agent would see one frame only, it
could not determine which direction the ball is moving in Pong. By showing it four frames at once, the agent can discern
if the ball is moving towards itself or the enemy or whether it will hit a wall or not (cf.~figure
\ref{fig:full_state}).
\begin{figure}[h]
\centering
\includegraphics[scale=1]{04_implementation/state.png}
\caption{A state of Pong consists of the four most recent transformed frames; the oldest frame is to the left. The
movement of the ball can easily be seen in these four frames.}
\label{fig:full_state}
\end{figure}
Due to transforming the image to grayscale and scaling it to $[0, 1]$, the set of states $\mathcal{S}$ is a subset of
$[0, 1]^{4 \times 84 \times 84}$.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment