Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit ea404dfb authored by Daniel Lukats's avatar Daniel Lukats

added notation, lots of motivation

parent 1cd4c5a6
\section{Introduction}
\label{sec:01:introduction}
\todo[inline]{this entire chapter has an issue with implementation. fix that word}
\subsection{Motivation}
\label{sec:01:motivation}
\input{01_introduction/motivation}
......@@ -15,5 +17,4 @@
\subsection{Notation}
\label{sec:01:notation}
\todo[inline]{short note that notation adheres to Sutton as much as possible to provide a consistent experience.
notation of schulman for example differs, although most letters are assigned the same things}
\input{01_introduction/notation}
\todo[inline]{very rough first motiviation; definitely needs to be wordcrafted more}
In recent years reinforcement learning algorithms achieved several breakthroughs such as defeating a world champion in
the game of Go \cite{alphago} or defeating professional players in restricted versions of the video games StarCraft II
\cite{alphastar} and Dota 2 \cite{openaifive}. The latter was achieved using---amongst others---an algorithm called
\emph{Proximal Policy Optimization} \cite{ppo}.
Proximal Policy Optimization is used in
\todo[inline]{examples here such as curiosity-driven learning}
Despite its widespread use, researchers found that several implementation choices have great effect on the algorithm but
are not documented.
\subsection{Goal}
\todo[inline]{a few more citations here}
Proximal Policy Optimization is used in several research topics. Some researchers look to apply the algorithm to tasks
beyond benchmarking, for example to train an agent for interplanetary transfers \cite{nasa}. Others utilize Proximal
Policy Optimization in their research of further concepts such as guided exploration through a concept called curiosity
\cite{pathak2017}.
\todo[inline]{examine some of these choices on a restricted set of tasks}
Despite its widespread use, researchers found that several undocumented implementation choices have great effect on the
performance of Proximal Policy Optimization. Consequently, the authors raise doubts on our mathematical understanding of
the foundations of the algorithm \cite{ilyas2018, engstrom2019}.
\todo[inline]{provide necessary background of reinforcement learning to understand the algorithm}
The goal of this thesis is twofold. On the one hand, it examines implementation and hyperparameter choices of Proximal
Policy Optimization on ATARI 2600 games, which is a typical benchmarking environment. On the other hand, it provides the
required fundamentals of reinforcement learning to understand policy gradient methods and explains Proximal Policy
Optimization, so other students may be introduced to reinforcement learning TODO. In order to gain a thorough
understanding of the algorithm, an implementation was created and published instead of reyling on an open source
project.
Depending on authors the notation used in publications can differ greatly. In this thesis, we adhere to the notation
provided by \citeA{sutton18} and adapt definitions and proves from other sources accordingly. As a consequence,
interested readers may notice differences between the notation used in chapter \ref{sec:03:ppo} and the publications of
\citeA{gae} and \citeA{ppo}:
\begin{itemize}
\item The advantage estimator is denoted $\delta$ instead of $\hat{A}$, as both $A$ and $a$ are heavily overloaded
already.
\item The likelihood ratio used in chapter \ref{sec:03:ppo_loss} is denoted $\rho_t$ instead of $r_t$ to avoid
confusions with rewards $R_t$ or $r$. Furthermore, using $\rho$ is consistent with \citeauthor{sutton18}'s
\citeyear{sutton18} definition of \emph{importance sampling}.
\end{itemize}
For ease of reading, an overview of all definitions is provided on page TODO.
In this thesis, an active \emph{we} is used regularly. This \emph{we} is meant to include the author and all readers.
\begin{enumerate}
\item Introduction
\item Theory on RL
\item PPO
\item Experiments
\item Conclusion \& Outloook
\end{enumerate}
Firstly, the fundamentals of reinforcement learning are given in chapter \ref{sec:02:basics}. These contain core terms
and definitions required to discuss and construct reinforcement learning algorithms.
Secondly, issues with the naive learning approach outlined in chapter \ref{sec:02:basics} are pointed out in chapter
\ref{sec:03:ppo}. This leads to the introduction of advanced estimation methods, which are used in \emph{Proximal Policy
Optimization} (PPO). Chapter \ref{sec:03:ppo} closes with an explanation of PPO.
Thirdly, undocumented design and implementation choices are elaborated and---if possible---explained in chapter
\ref{sec:04:implementation}. These choices are examined on ATARI 2600 games in chapter \ref{sec:05:evaluation}.
Finally, we summarize and discuss the results of this thesis in chapter \ref{sec:06:conclusion} and discuss possible
future work that builds upon this work.
\todo[inline]{refer to \protect\cite{wang2019,ilyas2018,engstrom2019} and open source repositories such as
\protect\cite{kostrikov} or jayasiri}
Proximal Policy Optimization (PPO) is a widely used reinforcement learning algorithm and therefore topic of various
research questions. The publications by \citeA{ilyas2018} and \citeA{engstrom2019} are closest to the topic of this
thesis, as the authors thoroughly research undocumented properties and implementation choices of the algorithm. However,
their work focuses on control tasks and foregoes an examination of the modifications on ATARI video games. Furthermore,
the authors assume prior knowledge of reinforcement learning and provide only a short background on policy gradient
methods.
This thesis differs in that it does not assume any prior knowledge of reinforcement learning and therefore provides the
necessary foundations to grasp Proximal Policy Optimization.
Multiple open source implementations of PPO are available \cite{baselines, jayasiri, kostrikov}. As these publications
mostly consist of source code, no evaluation on various tasks is available for comparison. The notable exception to this
matter is the original publication by \citeA{ppo}, which likely evaluated the algorithm using an implementation provided
in the OpenAI baselines repository. Among these, the work of \citeA{jayasiri} stands out, as the author added extensive
comments elaborating some concepts and implementation choices. The code created for this thesis differs, as it is
suitable for learning ATARI tasks only, whereas \citeA{baselines} and \citeA{kostrikov} support robotics and control
tasks as well. \citeauthor{jayasiri}'s \citeyear{jayasiri} code is intended for a single ATARI game only.
Lastly, PPO is researched in master's theses, for example by \citeA{chen2018} and \citeA{gueldenring2019}. The authors'
theses differ from this one, as \citeA{chen2018} researches the application of PPO to engineering tasks, whereas
\citeA{gueldenring2019} examines PPO in the context of navigating a robot. Since both authors focus on application, they
chose to utilize open source implementations of PPO instead of implementing the algorithm themselves.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment