Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit e8a88d22 authored by Daniel Lukats's avatar Daniel Lukats

more polishing

parent 499ab6e9
\thispagestyle{empty}
\section*{Eidesstattliche Erklärung}
Ich versichere, die von mir vorgelegte Arbeit selbstständig verfasst zu haben. Alle Stellen, die wörtlich oder
......
......@@ -10,34 +10,37 @@
\end{subfigure}
\end{figure}
%\hrulefill
\vspace*{2cm}
\vspace*{3.7cm}
{
\vspace*{2.5cm}
{\huge\bfseries Evaluating Hyperparameter and Implementation Choices of Proximal Policy Optimization}\\
\vspace{0.3cm}
{\Large{On a selection of ATARI 2600 Games}}\\
\vspace{1cm}
{\huge Master's thesis} \\
\vspace{0.3cm}
\vspace{3cm}
{\LARGE{On a Selection of ATARI 2600 Games}}\\
\vspace{1.6cm}
{\huge Master's Thesis} \\
\vspace{0.4cm}
{\huge Daniel Lukats} \\
\vspace{1.6cm}
{\large
submitted to obtain the degree of \\
\textsc{Master of Science (M.Sc.)} in \textsc{Computer Science} \\
at \\
\textsc{FH M\"unster -- University of Applied Sciences} \\
\textsc{Department of Electrical Engineering and Computer Science}
}
\vfill
\large{Daniel Lukats} \\
\vspace{1.6cm}
% \large{978528} \\ TODO
\today \\
\vspace{1cm}
{\large \today} \\
\vspace{1.6cm}
\begin{table}[H]
\centering
\begin{tabular}{l l}
\large{First examiner} & \large{Prof. Dr.-Ing. Jürgen te Vrugt}\\
\large{Second examiner} & \large{Prof. Dr. Kathrin Ungru}
\begin{tabular}{r l}
\large{Advisor} & \large{Prof. Dr.-Ing. Jürgen te Vrugt}\\
\large{Co-Advisor} & \large{Prof. Dr. Kathrin Ungru}
\end{tabular}
\end{table}
\vspace*{3cm}
}
}
\end{titlepage}
......@@ -25,7 +25,7 @@ understand policy gradient methods, so students may be introduced to reinforceme
fundamentals to explain Proximal Policy Optimization. In order to gain a thorough understanding of the algorithm, a
dedicated implementation based on the benchmarking framework \emph{OpenAI Gym} and the deep learning framework
\emph{PyTorch} was written instead of relying on an open source implementation of Proximal Policy
Optimization.\footnote{The code is available on TODO URL HERE.}
Optimization.\footnote{The code is available at \url{https://github.com/Aethiles/ppo-pytorch}.}
On the other hand, this thesis examines not only the impact of the aforementioned implementation choices, but also
common hyperparameter choices on a selection of five ATARI 2600 games. These video games form a typical benchmarking
......
A Markov decision process is a stochastic process that encompasses observations, actions and rewards. These are the core
properties needed to train an agent with a reinforcement learning algorithm.
properties needed for reinforcement learning.
\subsubsection{Definition}
\label{sec:02:mdp_def}
......@@ -26,11 +26,12 @@ interact once---the agent chooses an action and observes the environment.
\label{fig:intro_mdp}
\end{figure}
A simple Markov decision process is displayed in figure \ref{fig:intro_mdp}. Like Markov chains, it contains states
$s_0, s_1, s_2$ with $s_0$ being the initial state and $s_2$ being a terminal state. Unlike Markov chains, it also
contains actions \emph{left} and \emph{right} as well as rewards $0$ and $1$. The rewards are written alongside
transition probabilities and can be found on the edges connecting the states. Further explanations and an example using
elements of this Markov decision process are given in chapter \ref{sec:02:distributions}.
A simple Markov decision process is displayed in figure \ref{fig:intro_mdp}. Like Markov chains \cite[chapter
11.1]{grinstead1997}, it contains states $s_0, s_1, s_2$ with $s_0$ being the initial state and $s_2$ being a terminal
state. Unlike Markov chains, it also contains actions \emph{left} and \emph{right} as well as rewards $0$ and $1$. The
rewards are written alongside transition probabilities and can be found on the edges connecting the states. Further
explanations and an example using elements of this Markov decision process are given in chapter
\ref{sec:02:distributions}.
Markov decision processes extend Markov chains by introducing actions and rewards. Whereas in Markov chains a transition
to a successor state depends on the current state only, in Markov decision processes the transition function $p$ takes
......
......@@ -142,12 +142,12 @@ This issue is solved by taking an elementwise minimum:
\end{align}
In practice, the advantage function $\hat{a}_{\pi_{\boldsymbol\theta_\text{old}}}$ is replaced with Generalized
Advantage Estimations $\delta_t^{\text{GAE}(\gamma,\lambda)}$ (cf.~equation \ref{eqn:gae}). Often,
Advantage Estimations $\delta_t^{\text{GAE}(\gamma,\lambda)}$ as defined in equation \ref{eqn:gae} \cite{ppo}. Often,
$\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$ is called a surrogate objective, as it approximates the original objective
$J(\boldsymbol\theta)$ but is not identical.
Figure \ref{fig:min_clipping} compares $\text{clip}(\rho_t(\boldsymbol\theta, \epsilon) \cdot \delta$ and
$\mathcal{L}^\text{CLIP}$. Using an elementwise minimum has the following effect:
$\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$. Using an elementwise minimum has the following effect:
\begin{itemize}
\item If the advantage estimation $\delta > 0$ and $\rho_t(\boldsymbol\theta) \le 1 - \epsilon$, the loss is not clipped and the
likelihood of the corresponding action can be increased.
......
......@@ -2,8 +2,9 @@
\label{sec:04:implementation}
\input{04_implementation/introduction}
\subsection{Post-processing}
\subsection{TODO Environment Adaptation}
\label{sec:04:postprocessing}
TODO ALSO FIX THE NAME OF THIS IN THE REST OF THE THESIS
\input{04_implementation/postprocessing}
\subsection{Model Architecture}
......
......@@ -8,7 +8,5 @@ on the mathematical foundation of the algorithm \cite{ilyas2018, engstrom2019, w
In this chapter, we introduce the required operations to adapt the ATARI environment to PPO as well as the algorithmic
improvements. Often, mathematical reasoning for an operation is given neither in its respective original publication nor
in publications investigating it. In these cases, it can be assumed that the operation has been proven to be
beneficial empirically.
We evaluate some of these choices in chapter \ref{sec:05:evaluation}.
in publications investigating it. In these cases, it can be assumed that the operation has been proven to be beneficial
empirically. We evaluate some of these choices in chapter \ref{sec:05:evaluation}.
......@@ -6,7 +6,8 @@ more diverse starting conditions.
We use $\phi$ to denote the post-processing function and overload it as follows: $\phi_s(S_t)$ processes states whereas
$\phi_r(R_t)$ transforms rewards. In practice, both functions are executed at the same time due to the nature of the
post-processing steps.
post-processing steps. Mathematically, $\phi$ is part of the environment and therefore not visible in terms of the
Markov decision process.
\paragraph{No operation resets.}
Whenever an environment is reset, a random number of \emph{noop} actions is performed for up to 30 frames in order
......
Three aspects of the experiments must be discussed. Firstly, the impact of the non-disclosed optimizations shall be
evaluated. Secondly, the prevalance of outliers and the reliability of reward graphs as proposed by \citeA{ppo} and
shown in this thesis are reviewed. Finally, we discuss the robustness of PPO to hyperparameter choices.
shown in this thesis are reviewed. Finally, we discuss the robustness of PPO to hyperparameter choices. Remember that
all experiments are available online at \url{https://github.com/Aethiles/ppo-results}.
\subsubsection{Non-disclosed Optimizations}
\label{sec:05:discussion_optims}
......@@ -54,7 +55,9 @@ TODO
\paragraph{Conclusion.}
\todo[inline]{finish this once value function loss clipping is done} The previous experiments show that optimizations
TODO
The previous experiments show that optimizations
have different effects depending on the game an agent is trained on. This is most apparent with advantage
normalization, which appears to be detrimental to agents learning BeamRider, although it has a slight beneficial effect
on Breakout, Pong, Seaquest and SpaceInvaders. Moreover, we observe that all optimizations contribute to the success of
......
......@@ -7,9 +7,10 @@ denotes the hyperparameters outlined in chapter \ref{sec:04:full_algorithm} with
As we test each configuration on $5$ games, this makes for a total of $400$ graphs. For this reason, only a selection of
graphs is shown in this thesis. However, all graphs are made available online including the configuration and raw data
used to generate them.\footnote{TODO URL HERE} When we refer to an experiment, consult the repository TODO unless a figure or
table is specified.
used to generate them.\footnote{The raw data and images are available at \url{https://github.com/Aethiles/ppo-results}.}
When we refer to an experiment, consult the repository unless a figure or table is specified.
TODO enumerate here
Subsequently, we show three different experiments. The first experiment we evaluate is done with the reference
configuration. By evaluating this experiment, we can discern whether the algorithm implemented for this thesis achieves
the intended performance. The second experiment shows the stability of PPO even when severely misconfigured, by training
......
thesis/05_evaluation/unsmoothed.png

474 KB | W: | H:

thesis/05_evaluation/unsmoothed.png

475 KB | W: | H:

thesis/05_evaluation/unsmoothed.png
thesis/05_evaluation/unsmoothed.png
thesis/05_evaluation/unsmoothed.png
thesis/05_evaluation/unsmoothed.png
  • 2-up
  • Swipe
  • Onion skin
\section{List of Experiments}
TODO
\label{sec:a:experiments}
Table \ref{table:experiment_results} contains an overview of all experiments and a short summary of the observation. The
corresponding graphs are available at \url{https://github.com/Aethiles/ppo-results}. If a game is not mentioned in the
observations column, the graphs show no conclusive deviation from the reference graphs. In the \emph{only\_<name>}
experiments, the reference trendline is replaced with a trendline of the \emph{no\_optimizations} experiment.
\rowcolors{2}{gray!25}{white}
\begin{longtable}{p{3.8cm}|p{10.04cm}}
Experiment & Observations \\
\hline
\endhead
3 epochs\footnote{A common choice to reduce training time.} & slightly improved on Breakout and SpaceInvaders \\
4 actors & worse on BeamRider; improved on Seaquest \\
16 actors & improved on BeamRider; worse on Pong and Seaquest \\
16 actors,\newline Minibatch size 128 & improved on BeamRider, Pong, Seaquest; worse on SpaceInvaders \\
Minibatch size 32 & notably worse on BeamRider; worse on Breakout and SpaceInvaders; improved on Seaquest; faster
learning on Pong but less stable after $4\cdot10^6$ time steps \\
Minibatch size 64 & notably worse on BeamRider; improved on Pong and Seaquest \\
Minibatch size 128 & notably worse on BeamRider with three very different runs; improved on Seaquest \\
Minibatch size 32,\newline $\epsilon$ not annealed & notably worse on BeamRider and SpaceInvaders; improved on Pong
and Seaquest \\
Minibatch size 64,\newline $\epsilon$ not annealed & worse on BeamRider; improved on Breakout, Pong and Seaquest \\
Minibatch size 128,\newline $\epsilon$ not annealed & improved on Breakout, Pong and Seaquest \\
\hline
$\alpha$ and $\epsilon$ not annealed & worse on BeamRider; noisier on Pong; improved on Seaquest \\
$\alpha$ not annealed & worse on BeamRider; notably improved on Seaquest\footnote{As Seaquest often features
prominent outliers, it cannot be concluded that this configuration is better for Seaquest.} \\
$\epsilon$ not annealed & less stable on BeamRider; notably improved on Breakout \\
\hline
$\alpha = 2.5\cdot10^{-3}$ & no learning on BeamRider and two runs of Breakout; notably worse on one run of
Breakout, Seaquest and SpaceInvaders \\
$\alpha = 2.5\cdot10^{-5}$ & notably worse to no learning on all games \\
$\epsilon = 0.05$ & improved on Seaquest; worse on SpaceInvaders with three very different runs \\
$\epsilon = 0.2$ & improved on Breakout; noisier and worse on Pong \\
$\epsilon = 0.5$ & worse on all games; noisier on Pong and possibly on Breakout \\
$\gamma = 0.9$ & notably worse on all games \\
$\gamma = 0.95$ & notably worse on BeamRider, Seaquest and SpaceInvaders \\
$\lambda = 0.9$ & improved on Breakout \\
$\lambda = 0.99$ & worse on BeamRider, Breakout and SpaceInvaders \\
\hline
$c_1 = 0.2$ & no learning on Breakout; improved on Seaquest;\newline performance collapse in one run of SpaceInvaders \\
$c_1 = 1$ & improved on BeamRider and Seaquest \\
$c_1 = 1$, $\epsilon = 0.2$ & worse on BeamRider; slightly improved on Breakout \\
$c_2 = -0.01$ & notably worse to no learning on all games but SpaceInvaders; outliers may still achieve reference
performance \\
$c_2 = 0$ & notably worse and noisier on Pong, worse on Seaquest \\
$c_2 = 0.1$ & worse on BeamRider and Pong; notably worse on Breakout; improved on Seaquest \\
\hline
paper configuration & improved on BeamRider, Breakout and Seaquest \\
no optimizations,\newline paper configuration & notably worse on all games \\
no optimizations & notably worse on all games but BeamRider;\newline notably improved on BeamRider\footnote{This
result is unexpected and merits further investigation.} \\
no advantage\newline normalization & improved on BeamRider and Seaquest; slightly worse on Breakout and Pong;
slightly improved on SpaceInvaders \\
only advantage\newline normalization & \\
no gradient\newline clipping & worse on BeamRider; notably improved on Breakout and Seaquest \\
only gradient\newline clipping & \\
no orthogonal\newline initialization & worse on BeamRider, Pong and Seaquest \\
only orthogonal\newline initialization & \\
no reward binning & notably worse on all games but Pong \\
only reward binning & \\
no value function loss clipping & notably improved on Breakout; slightly worse on Pong \\
only value function loss clipping & \\
\hline
reward clipping & improved on BeamRider; worse on Breakout, Seaquest and SpaceInvaders \\
no input scaling & no learning on BeamRider; better on Seaquest; worse on SpaceInvaders \\
adam epsilon\footnote{A parameter for numerical stability.} $10^{-8}$ & slightly improved on Seaquest \\
conv3 32 filters\footnote{Architectural change by \citeA{kostrikov}.} & slightly worse on BeamRider; slightly
improved on Seaquest \\
small net\footnote{An architecture that has only two convolutional layers, the first with 16 filters and the second
with 32 \cite{dqn}.} & worse on all games but BeamRider \\
\caption{This table contains a list of all experiments and a short summary of the observations.}
\label{table:experiment_results}
\end{longtable}
\thispagestyle{empty}
\section*{Abstract}
\todo[inline]{first abstract. gather feedback on this}
Reinforcement learning denotes a class of machine learning algorithms that train an agent through trial-and-error.
Instead of correcting the agent's behavior, it is merely rewarded when it achieves a set out goal and punished when it
fails to do so.
\emph{Proximal Policy Optimization} algorithms are a class of deep reinforcement learning algorithms \cite{ppo}. The
most popular variant, PPO clip, enjoys widespread use and achieves competitive results on a many benchmarking tasks.
However, a number of non-disclosed optimization choices can be found in two implementations provided by the authors
\cite{baselines}.
Recently, reinforcement learning was combined with deep learning to utilize neural networks. \emph{Proximal Policy
Optimization} is one of these deep reinforcement learning algorithms \cite{ppo}. Although Proximal Policy Optimization
enjoys widespread use and achieves competitive results on many benchmarking tasks, a number of non-disclosed
optimizations can be found in implementations provided by the authors \cite{baselines}.
Previous research has shown that these optimizations are crucial to the success of Proximal Policy Optimization on
robotics tasks \cite{ilyas2018, engstrom2019}, but to the author's knowledge no research was conducted on other
benchmarking tasks like ATARI 2600 video games. Therefore, this thesis explains the PPO clip variant of Proximal Policy
Optimization and evaluates the impact of the aforementioned optimizations on a selection of five ATARI games.
Therefore, this thesis introduces the mathematical foundations required to understand Proximal Policy Optimization and
builds upon them to explain the algorithm itself. Afterwards, the aforementioned optimization choices as well as common
configuration choices are evaluated on five ATARI 2600 video games. ATARI games are a common benchmarking
task for reinforcement learning algorithms.
The experiments reveal that most optimization choices have a significant effect on the performance of the algorithm;
only a single optimization is found to be less impactful. Furthermore, the experiments support \citeauthor{ppo}'s
\citeyear{ppo} claim that PPO clip is robust to hyperparameter choices, as agents learn even when configured
suboptimally. Finally, significant deviations are apparent in approximately 35\% of the experiments. As a consequence,
the reliability of the evaluation methods employed by the authors of the original publication may be questioned.
The experiments reveal that most optimization choices have a strong effect on the performance of Proximal Policy
Optimization. Furthermore, they support the authors' claims that the algorithm is robust to configuration choices.
Finally, notable outliers are apparent in approximately 35\% of the experiments. As a consequence, reproducibility of
the results can be a challenge.
......@@ -284,6 +284,14 @@
note = {ISBN: 9780262035613},
}
@book{grinstead1997,
title = {Introduction to Probability},
author = {Charles M. Grinstead and J. Laurie Snell},
publisher = {American Mathematical Society},
year = 1997,
note = {ISBN: 9780821807491},
}
@book{sutton18,
author = {Richard S. Sutton and Andrew G. Barto},
title = {Reinforcement Learning: An Introduction},
......
......@@ -5,7 +5,7 @@
\newglossaryentry{actor}{
name = {actor},
description = {denotes an agent that is run in parallel in multiple environments},
description = {denotes an agent that is run in parallel in multiple instances of the same environment},
}
\newglossaryentry{advantage}{
......@@ -15,7 +15,7 @@
\newglossaryentry{advantage function}{
name = {advantage function},
description = {describes if an action yields higher or lower reward than the expected behavior would}
description = {determines if an action yields higher or lower reward than the expected behavior would}
}
\newglossaryentry{agent}{
......@@ -45,7 +45,7 @@
\newglossaryentry{entropybonus}{
name = {entropy bonus},
description = {a component of a loss that encourages exploration},
description = {a component of the loss that encourages exploration},
}
\newglossaryentry{environment}{
......@@ -55,17 +55,18 @@
\newglossaryentry{episode}{
name = {episode},
description = {TODO}
description = {the sequence from an initial state to a terminal state. In this thesis an episode usually means one
attempt at a game}
}
\newglossaryentry{exploration}{
name = {exploration},
description = {},
description = {denotes choosing suboptimal actions that the agent has little or no knowledge of yet},
}
\newglossaryentry{exploitation}{
name = {exploitation},
description = {},
description = {denotes choosing optimal actions to strengthen the knowledge of these actions},
}
\newglossaryentry{gae-acr}{
......@@ -98,9 +99,14 @@
description = {a configuration parameter that is not learned by the agent}
}
\newglossaryentry{learnrate}{
name = {learn rate},
description = {controls the step size in gradient ascent or descent}
}
\newglossaryentry{loss}{
name = {loss},
description = {the objective of minimization or---in reinforcement learning---maximization}
description = {the objective of minimization or maximization}
}
\newglossaryentry{mdp}{
......
......@@ -2,19 +2,24 @@
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
% Font
% FONT
\usepackage{lmodern}
% Header
% HEADER
\usepackage[headsepline]{scrlayer-scrpage}
\automark[section]{section}
\ohead{\headmark}
%%\rohead{\headmark}
% FIGURES
\usepackage{graphicx}
\usepackage{rotating}
\usepackage{longtable}
\usepackage[table]{xcolor}
% TODO
\usepackage{todonotes}
% FOOTNOTES
\usepackage{footnote}
\makesavenoteenv{longtable}
% MATHS
\usepackage{amsmath}
......@@ -32,26 +37,7 @@
\usepackage{blindtext}
\usepackage{listings}
% TODO check hidelinks here
\usepackage[hyphens]{url}
\usepackage[hidelinks,breaklinks]{hyperref} % must be loaded before apacite
\hypersetup{
pdfauthor = {Daniel Lukats},
pdftitle = {Master's thesis},
}
% Bibliography setup
% \usepackage[backend=biber, style=ieee]{biblatex}
% \addbibresource{bibliography.bib}
% \appto{\bibsetup}{\raggedright}
\usepackage{apacite}
% \usepackage{regexpatch}
% \makeatletter
% \xpatchcmd{\@@cite}{\def\BCA##1##2{{\@BAstyle ##1}}}{\def\BCA##1##2{{\@BAstyle ##2}}}{}{}
% \makeatother
% Glossary setup
% GLOSSARY
\usepackage[acronym, style=altlist, nonumberlist, toc, section=section, xindy]{glossaries}
\input{glossary}
\makeglossaries
......@@ -65,5 +51,16 @@
\usepackage{csquotes}
% URL AND HYPERREF
\usepackage[hyphens]{url}
\usepackage[hidelinks,breaklinks]{hyperref} % must be loaded before apacite
\hypersetup{
pdfauthor = {Daniel Lukats},
pdftitle = {Master's Thesis},
}
% BIB SETUP
\usepackage{apacite}
\widowpenalty = 10000
\clubpenalty = 10000
......@@ -2,7 +2,7 @@
\input{includes}
%\input{commands}
%\raggedbottom
\raggedbottom
\begin{document}
\pagenumbering{gobble}
......@@ -15,14 +15,16 @@
\clearpage
\thispagestyle{empty}
\hspace{0pt}
\setcounter{page}{0}
\setcounter{figure}{0}
\clearpage
\input{abstract.tex}
\listoftodos
\cleardoublepage
\mbox{}
\thispagestyle{empty}
\setcounter{page}{0}
\setcounter{figure}{0}
\clearpage
\phantomsection
\pdfbookmark[1]{Table of contents}{}
\tableofcontents
......@@ -50,8 +52,10 @@
\cleardoublepage
% \bibliography{bibliography}
\section*{List of Mathematical Symbols and Definitions}
\addcontentsline{toc}{section}{List of Mathematical Symbols and Definitions}
\setcounter{secnumdepth}{0}
\section{List of Mathematical Symbols and Definitions}
% \sectionmark{List of Mathematical Symbols and Definitions}
% \addcontentsline{toc}{section}{List of Mathematical Symbols and Definitions}
\label{sec:maths_index}
\input{maths_index.tex}
\glsaddall
......@@ -64,6 +68,7 @@
\bibliography{bibliography.bib}
\cleardoublepage
\setcounter{secnumdepth}{1}
\appendix
\input{07_appendix/experiments.tex}
......
......@@ -34,7 +34,6 @@
$\delta_t$ & advantage estimator & (chapter \ref{sec:02:advantage}, eq.~\ref{eqn:delta1}) \\
$\delta_t^{\text{GAE}(\gamma, \lambda)}$ & Generalized Advantage Estimation & (chapter \ref{sec:03:gae_gae},
eq.~\ref{eqn:gae}) \\
\\
$\boldsymbol\omega$ & parameter vector & (chapter \ref{sec:02:function_approximation} \\
$\hat{v}_\pi(s, \boldsymbol\omega)$ & parameterized value function & (chapter \ref{sec:02:function_approximation}) \\
$\hat{a}_\pi(s, a, \boldsymbol\omega)$ & parameterized advantage function & (chapter
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment