Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit 962e9da4 authored by Daniel Lukats's avatar Daniel Lukats

updated conclusion

parent 941dc9b5
In this thesis we introduced the necessary concepts of reinforcement learning to discuss \emph{Proximal Policy
Optimization}. Among these is the Markov decision process consisting of states, actions, rewards, a transition function
for the environment an agent interacts with and a policy guiding the agent's behavior. We built upon these concepts to
define the value function and the advantage function, which allow us to examine the performance of an agent
following a specific policy.
Optimization}. Among these are an agent and an environment the agent observes as well as the Markov decision process
consisting of states, actions, rewards. In addition, the Markov decision process contains a transition function for the
environment the agent interacts with and a policy guiding the agent's behavior. We built upon these concepts to define
the value function and the advantage function, which allow us to examine the performance of an agent following a
specific policy.
In order to improve the policy, we introduced parameterization and policy gradients, which enable us to optimize a
policy via stochastic gradient descent. As the environments in this thesis are specific ATARI 2600 video games, we lack
full knowledge of the environments' dynamics. Therefore, we must estimate gradients by sampling training data from
interaction with the environments. Vanilla policy gradient estimators have been shown to be unreliable, resulting in
interaction with the environments. Simple policy gradient estimators have been shown to be unreliable, resulting in
performance collapses and suboptimal solutions \cite{kakade2002}.
Hence, advanced learning algorithms are required. One of these algorithms is \emph{Proximal Policy Optimization}, which
......@@ -15,14 +16,17 @@ uses parallel environments to generate diverse training data \cite{ppo}. An agen
epochs instead of just once. To ensure that the performance of a policy does not collapse after optimizing, Proximal
Policy Optimization intends to keep the new policy close to the old policy by clipping losses that deviate too much.
However, multiple non-disclosed optimizations to the algorithm can be found in a popular implementation of the algorithm
\cite[baselines/ppo2]{baselines}. We examined these optimizations and popular hyperparameter choices. The results of the
experiments show that most of the optimizations are crucial to achieve good results on ATARI games. This finding is
echoed by \citeA{ilyas2018} and \citeA{engstrom2019}, who examined Proximal Policy Optimization on robotics
environments. Furthermore, we found that significant outliers are present in approximately $35\%$ of the experiments.
However, multiple non-disclosed optimizations to the algorithm can be found in two implementations of the algorithm
\cite{baselines} that are provided by the authors of the original publication. We examined these optimizations and
popular hyperparameter choices on five ATARI 2600 games: BeamRider, Breakout, Pong, Seaquest and SpaceInvaders. The
results of the experiments show that most of the optimizations are crucial to achieve good results on these ATARI games.
This finding is shared by \citeA{ilyas2018} and \citeA{engstrom2019}, who examined Proximal Policy Optimization on
robotics environments. Furthermore, we found that significant outliers are present in approximately $35\%$ of the
experiments.
As a consequence, we may call for two improvements: Firstly, all choices pertaining to the algorithm---even those not
purely related to reinforcement learning---and the portrayal of its results must be published to ensure proper
reproducibility. Secondly, instead of testing each configuration thrice on every game available, performing more runs on
an expert-picked set of diverse games grants more robust results. With more training runs, it becomes easier to discern
if a striking performance is to be expected or simply a result of chance.\todo{phrasing}
As a consequence, we may call for two improvements: Firstly, all optimization choices pertaining to the algorithm---even
those not purely related to reinforcement learning---must be published. In order to ensure proper reproducibility and
comparability, methods used for the portrayal of results, such as smoothing of graphs, must be described as well.
Secondly, instead of testing each configuration thrice on every game available, performing more runs on an expert-picked
set of diverse games grants more reliable results. With more training runs, it becomes easier to discern if a striking
performance is to be expected or simply a result of chance.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment