Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit 53bf5e19 authored by Daniel Lukats's avatar Daniel Lukats

first draft of conclusion and future work

parent ced7fcb3
In this thesis we introduced the necessary concepts of reinforcement learning to discuss \emph{Proximal Policy
Optimization}. Among these is the Markov decision process consisting of states, actions, rewards, a transition function
for the environment an agent interacts with and a policy guiding the agent's behavior. We built upon these concepts to
define the value function and the advantage function, which allow us to examine the performance of an agent
following a specific policy.
In order to improve the policy, we introduced parameterization and policy gradients, which enable us to optimize a
policy via stochastic gradient descent. As the environments in this thesis are specific ATARI 2600 video games, we lack
full knowledge of the environments' dynamics. Therefore, we must estimate gradients by sampling training data from
interaction with the environments. Vanilla policy gradient estimators have been shown to be unreliable, resulting in
performance collapses and suboptimal solutions \cite{kakade2002}.
Hence, advanced learning algorithms are required. One of these algorithms is \emph{Proximal Policy Optimization}, which
uses parallel environments to generate diverse training data \cite{ppo}. An agent learns on this data for multiple
epochs instead of just once. To ensure that the performance of a policy does not collapse after optimizing, Proximal
Policy Optimization intends to keep the new policy close to the old policy by clipping losses that deviate too much.
However, multiple non-disclosed optimizations to the algorithm can be found in a popular implementation of the algorithm
\cite[baselines/ppo2]{baselines}. We examined these optimizations and popular hyperparameter choices. The results of the
experiments show that most of the optimizations are crucial to achieve good results on ATARI games. This finding is
echoed by \citeA{ilyas2018} and \citeA{engstrom2019}, who examined Proximal Policy Optimization on robotics
environments. Furthermore, we found that significant outliers are present in approximately $35\%$ of the experiments.
As a consequence, we may call for two improvements: Firstly, all choices pertaining to the algorithm---even those not
purely related to reinforcement learning---and the portrayal of its results must be published to ensure proper
reproducibility. Secondly, instead of testing each configuration thrice on every game available, performing more runs on
an expert-picked set of diverse games grants more robust results. With more training runs, it becomes easier to discern
if a striking performance is to be expected or simply a result of chance.\todo{phrasing}
\todo[inline]{make this entire thing a bit prettier}
Based on the work conducted for this thesis, three possible venues for future work present themselves. First of all, we
could work on improved evaluation methods that go beyond comparing graphs printed in different publications. This
includes selecting a diverse subset of games so robust trendlines may be created and shared.
Second of all, the code written for this thesis can be adjusted to support other benchmarking environments. Most
importantly, by supporting robotics and control tasks, we could reproduce the findings of \citeA{ilyas2018} and
Finally, the Proximal Policy Optimization can be combined with further learning techniques. In particular,
curiosity-driven learning as detailed by \citeA{pathak2017} is an interesting technique to combine with PPO. As
PPO contains an entropy bonus encouraging exploration already, replacing this with a guided exploration technique could
yield interesting results.
\section{Conclusion and Future Work}
\todo{fix this headline}
\subsection{Future Work}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment