Commit 3cf06ccb authored by Daniel Lukats's avatar Daniel Lukats

introduction with feedback from anna

parent 3d41c25c
\section{Introduction}
\label{sec:01:introduction}
\todo[inline]{this entire chapter has an issue with implementation. fix that word}
\subsection{Motivation}
\label{sec:01:motivation}
\input{01_introduction/motivation}
......
\todo[inline]{very rough first motiviation; definitely needs to be wordcrafted more}
In recent years reinforcement learning algorithms achieved several breakthroughs such as defeating a world champion in
the game of Go \cite{alphago} or defeating professional players in restricted versions of the video games StarCraft II
\cite{alphastar} and Dota 2 \cite{openaifive}. The latter was achieved using---amongst others---an algorithm called
\emph{Proximal Policy Optimization} \cite{ppo}.
\todo[inline]{a few more citations here}
Proximal Policy Optimization is used in several research topics. Some researchers look to apply the algorithm to tasks
beyond benchmarking, for example to train an agent for interplanetary transfers \cite{nasa}. Others utilize Proximal
Policy Optimization in their research of further concepts such as guided exploration through a concept called curiosity
\cite{pathak2017}.
Proximal Policy Optimization is used in several research topics. On the one hand, researchers look to apply the
algorithm to real-world tasks. For example, Proximal Policy Optimization is used to learn human walking behavior, which
can aid in developing control schemes for prostheses \cite{anand2019}. Furthermore, the algorithm is utilized in
spaceflight, e.g., to train agents for interplanetary transfers \cite{nasa} and autonomous planetary landings
\cite{gaudet2020}. For a final example, researchers applied Proximal Policy Optimization to medical imaging in order to
trace axons, which are microscopic neuronal structures \cite{dai2019}.
On the other hand, researchers combine the algorithm with further concepts. For example, Proximaly Policy Optimization
can be used in meta reinforcement learning, which trains a reinforcement learning agent to train other reinforcement
learning agents \cite{liu2019}. Yet another concept that may be combined with the algorithm is curiosity
\cite{pathak2017}. Curiosity is a mechanism that incentivizes a methodical search for an optimal solution rather than a
random search.
Despite its widespread use, researchers found that several implementation choices\todo{is this fine or not?} are
undocumented in the original paper \cite{ppo}, but have great effect on the performance of Proximal Policy Optimization.
Consequently, the authors raise doubts on our mathematical understanding of the foundations of the algorithm
\cite{ilyas2018, engstrom2019}.
Despite its widespread use, researchers found that several undocumented implementation choices have great effect on the
performance of Proximal Policy Optimization. Consequently, the authors raise doubts on our mathematical understanding of
the foundations of the algorithm \cite{ilyas2018, engstrom2019}.
The goal of this thesis is twofold. On the one hand, it provides the required fundamentals of reinforcement learning to
understand policy gradient methods, so students may be introduced to reinforcement learning. It then builds upon these
fundamentals to explain Proximal Policy Optimization. In order to gain a thorough understanding of the algorithm, a
dedicated implementation was written \todo{with the OpenAI Gym benchmark and PyTorch deep learning frameworks?} instead
of relying on an open source project.\footnote{The code is available on TODO URL HERE.}
The goal of this thesis is twofold. On the one hand, it examines implementation and hyperparameter choices of Proximal
Policy Optimization on ATARI 2600 games, which is a typical benchmarking environment. On the other hand, it provides the
required fundamentals of reinforcement learning to understand policy gradient methods and explains Proximal Policy
Optimization, so other students may be introduced to reinforcement learning TODO. In order to gain a thorough
understanding of the algorithm, an implementation was created and published instead of reyling on an open source
project.
On the other hand, this thesis examines not only the impact of the aforementioned implementation choices, but also
common hyperparameter choices on a selection of five ATARI 2600 games. These video games form a typical benchmarking
environment for evaluating reinforcement learning algorithms. The significance of the implementation choices was already
researched on robotics tasks \cite{ilyas2018, engstrom2019}, but the authors forewent an examination on ATARI games.
Therefore, one may raise the question, if these choices have the same significance for ATARI 2600 games as they do for
robotics tasks.
Depending on authors the notation used in publications can differ greatly. In this thesis, we adhere to the notation
provided by \citeA{sutton18} and adapt definitions and proves from other sources accordingly. As a consequence,
interested readers may notice differences between the notation used in chapter \ref{sec:03:ppo} and the publications of
\citeA{gae} and \citeA{ppo}:
The mathematical notation used in publications regarding reinforcement learning can differ greatly. In this thesis, we
adhere to the notation introduced by \citeA{sutton18} and adapt definitions and proves from other sources accordingly. As
a consequence, interested readers may notice differences between the notation used in chapter \ref{sec:03:ppo} and the
publications of \citeA{gae} and \citeA{ppo}:
\begin{itemize}
\item The advantage estimator is denoted $\delta$ instead of $\hat{A}$, as both $A$ and $a$ are heavily overloaded
already.
already.
\item The likelihood ratio used in chapter \ref{sec:03:ppo_loss} is denoted $\rho_t$ instead of $r_t$ to avoid
confusions with rewards $R_t$ or $r$. Furthermore, using $\rho$ is consistent with \citeauthor{sutton18}'s
\citeyear{sutton18} definition of \emph{importance sampling}.
......
Firstly, the fundamentals of reinforcement learning are given in chapter \ref{sec:02:basics}. These contain core terms
The fundamentals of reinforcement learning are given in chapter \ref{sec:02:basics}. These contain core terms
and definitions required to discuss and construct reinforcement learning algorithms.
Secondly, issues with the naive learning approach outlined in chapter \ref{sec:02:basics} are pointed out in chapter
Issues with the naive learning approach outlined in chapter \ref{sec:02:basics} are pointed out in chapter
\ref{sec:03:ppo}. This leads to the introduction of advanced estimation methods, which are used in \emph{Proximal Policy
Optimization} (PPO). Chapter \ref{sec:03:ppo} closes with an explanation of PPO.
Optimization}. With these estimation methods PPO is defined and ramifications of specific operations are explained.
Chapter \ref{sec:03:ppo} closes with an outline of the complete reinforcement learning algorithm.
Thirdly, undocumented design and implementation choices are elaborated and---if possible---explained in chapter
\ref{sec:04:implementation}. These choices are examined on ATARI 2600 games in chapter \ref{sec:05:evaluation}.
Undocumented design and implementation choices are elaborated and---if possible---explained in chapter
\ref{sec:04:implementation}. Before these choices are examined on ATARI 2600 games, the benchmarking framework and the
evaluation method are introduced in chapter \ref{sec:05:evaluation}. A discussion of the results completes the chapter.
Finally, we summarize and discuss the results of this thesis in chapter \ref{sec:06:conclusion} and discuss possible
future work that builds upon this work.
Finally, we summarize the results of this thesis in chapter \ref{sec:06:conclusion} and discuss possible
future work that builds upon this thesis.
Proximal Policy Optimization (PPO) is a widely used reinforcement learning algorithm and therefore topic of various
research questions. The publications by \citeA{ilyas2018} and \citeA{engstrom2019} are closest to the topic of this
thesis, as the authors thoroughly research undocumented properties and implementation choices of the algorithm. However,
their work focuses on control tasks and foregoes an examination of the modifications on ATARI video games. Furthermore,
their work focuses on robotics tasks and foregoes an examination of the modifications on ATARI video games. Furthermore,
the authors assume prior knowledge of reinforcement learning and provide only a short background on policy gradient
methods.
Multiple open source implementations of PPO are available \cite{baselines, jayasiri, kostrikov}. As these publications
mostly consist of source code, no evaluation on various tasks is available for comparison. The notable exception to this
matter is the original publication by \citeA{ppo}, which likely evaluated the algorithm using an implementation provided
in the OpenAI baselines repository. Among these, the work of \citeA{jayasiri} stands out, as the author added extensive
comments elaborating some concepts and implementation choices. The code created for this thesis differs, as it is
suitable for learning ATARI tasks only, whereas \citeA{baselines} and \citeA{kostrikov} support robotics and control
tasks as well. \citeauthor{jayasiri}'s \citeyear{jayasiri} code is intended for a single ATARI game only.
matter is the baselines repository \cite{baselines}, which was published alongside the original publication by
\citeA{ppo}. Among these, the work of \citeA{jayasiri} stands out, as the author added extensive comments elaborating
some concepts and implementation choices. The code created for this thesis differs, as it is suitable for learning ATARI
tasks only, whereas \citeA{baselines} and \citeA{kostrikov} support robotics and control tasks as well.
\citeauthor{jayasiri}'s \citeyear{jayasiri} code is intended for a single ATARI game only.
Lastly, PPO is researched in master's theses, for example by \citeA{chen2018} and \citeA{gueldenring2019}. The authors'
Lastly, PPO is topic of master's theses, for example by \citeA{chen2018} and \citeA{gueldenring2019}. The authors'
theses differ from this one, as \citeA{chen2018} researches the application of PPO to engineering tasks, whereas
\citeA{gueldenring2019} examines PPO in the context of navigating a robot. Since both authors focus on application, they
chose to utilize open source implementations of PPO instead of implementing the algorithm themselves.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment