Commit d7a55f09 authored by Daniel Lukats's avatar Daniel Lukats

feedback anna and fabian

parent df835ecc
Typical benchmarks involve classic control tasks such as balancing an inverse pendulum, robotics tasks or ATARI video
games. Although the ATARI 2600 console was released in 1977, the games developed for the console prove challenging even
today \cite{ale}. The authors note that despite the simple graphics, ATARI games can be difficult to predict. For
today \cite{ale}. The authors note that despite the simple graphics, ATARI games can be difficult to predict: For
example, the 18 most recent frames are used to determine the movement of the paddle in Pong. Moreover, agents are
trained on video input from ATARI games, which leads to a large state space (cf.~chapter \ref{sec:04:postprocessing}).
Lastly, video games are diverse challenges with some---such as Pong---being easier to solve, whereas the game
Montezuma's Revenge remains a challenging. For these reasons, ATARI 2600 video games are a popular benchmarking
Lastly, video games are diverse challenges with some---such as Pong---being easier to solve, whereas others like the game
Montezuma's Revenge remain challenging. For these reasons, ATARI 2600 video games are a popular benchmarking
environment.\footnote{One might note that video games are well-known and could pique peoples' interest as well.}
The \emph{Arcade Learning Environment} (ALE) provides a benchmarking environment containing 57 games
......@@ -21,9 +21,9 @@ is 7800, my score is like 3800 best? dqn is not much better}
OpenAI Gym and ALE simplify evaluation greatly, as they provide a unified interface for all environments. We simply
provide an action to the environment and obtain observations containing the current state of the environment as a $210
\times 160$ RGB image, the reward and further information such as the number of lives remaining or if the game
terminated. Just like the ATARI 2600 console, the Arcade Learning Environment runs at 60 Hz---without frame skipping we
would obtain 60 observations per second.
\times 160$\todo{note that this is before post-processing} RGB image, the reward and further information such as the
number of lives remaining or if the game terminated. Just like the ATARI 2600 console, the Arcade Learning Environment
runs at 60 Hz---without frame skipping we would obtain 60 observations per second.
Since the ALE returns a reward, we do not have to design a reward function. Instead, all reinforcement
learning algorithms are trained using the same reward function bar post-processing. The reward of an action is the
......
......@@ -18,7 +18,8 @@ as is the performance on Seaquest.
Gradient clipping is disabled in experiment \emph{no\_gradient\_clipping}. We note that the performance on Breakout is
significantly improved with a final score of roughly $400$ achieved by all three agents without gradient clipping.
However, some reward graphs on other games appear less stable, notably one training on BeamRider, Pong and Seaquest each.
However, some reward graphs on other games appear less stable, notably one of the training runs on BeamRider, Pong and
Seaquest each.
As gradient clipping restricts the magnitude of change of the neural network's weights, it is another device that keeps
the new policy $\pi_{\boldsymbol\theta}$ close to the old policy $\pi_{\boldsymbol\theta_\text{old}}$. If the new policy
......@@ -37,10 +38,10 @@ slower.
Training runs without any form of reward binning or clipping can be seen in experiment \emph{no\_reward\_binning}. The
only game that is not affected by this change is Pong, as the set of rewards in Pong is ${-1, 0, 1}$ already.
Furthermore, no player will score multiple times within $k = 4$ frames (cf.~chapter \ref{sec:04:postprocessing}).
Furthermore, in Pong no player will score multiple times within $k = 4$ frames (cf.~chapter \ref{sec:04:postprocessing}).
On the remaining games, rewards can achieve much larger magnitudes so reward binning or clipping has a notable effect on
the rewards. Experiment \emph{reward\_clipping} shows a significant drop in performance on all games but pong.
the rewards. Experiment \emph{reward\_clipping} shows a significant drop in performance on all games but Pong.
This is echoed in the reward graphs and final scores with all agents performing much worse once rewards are no longer
subject to reward binning. As a consequence, rewards should be binned and not clipped.
......@@ -62,7 +63,7 @@ Similar inconcistency can be seen in the plots published by \citeA{ppo} for a va
With merely three runs per configuration used in evaluation, it is hard to tell if this behavior is representative. We
cannot determine if a single good performance must be attributed to luck or if in fact only roughly a third of all runs
achieve such a performance. For all we know, the actual spread could be the entire range between a good run and a bad
run.
run.\todo{improve this last sentence}
At the time of publication of Proximal Policy Optimization in 2017 OpenAI Gym contained 49 ATARI games. As of writing
this thesis, this number has grown to 57 games. It seems unlikely that each of these games provides a unique challenge.
......@@ -82,7 +83,7 @@ or of entirely different algorithms.
\citeA{ppo} state that Proximal Policy Optimization is robust in terms of hyperparameter choice. The results of the
experiments support this statement in general. Huge deviation from the published hyperparameters is required to
impede PPO such that no or little learning occurs. This can be seen in the experiment shown in chapter \ref{sec:05:stability}
and BeamRider in experiment \emph{epsilon\_0\_5}). When hyperparameters remain
and with BeamRider in experiment \emph{epsilon\_0\_5}. When hyperparameters remain
remotely close to the original values, the performance may deviate but agents still learn, for example when $\gamma =
0.95$ or $\lambda = 0.9$ (see experiments \emph{gamma\_0\_95} or \emph{lambda\_0\_9}). Therefore, it can be concluded that the
reference configuration is a sensible choice for learning ATARI video games.
......
......@@ -45,7 +45,8 @@ Comparing the shapes of the reward graphs with those given by \citeA{ppo} reveal
the magnitude of score obtained in BeamRider. On BeamRider and Seaquest, the algorithm implemented for this thesis
achieves a higher final score than the originally reported results. On Breakout and SpaceInvaders, the obtained score is
slightly lower. Most likely, these differences can be attributed to differing configurations, as \citeA{ppo} train with
a value function loss coefficient $c_1 = 1.0$ and $K = 3$ epochs instead of $c_1 = 0.5$ and $K = 4$.
a value function loss coefficient $c_1 = 1.0$ and $K = 3$ epochs instead of $c_1 = 0.5$ and $K = 4$.\todo{anna:
highlight inherent randomness?}
Training with the original configuration grants even more favorable results (cf.~experiment \emph{paper}), which means
that some of the optimization methods outlined in chapter \ref{sec:04:implementation} likely were not used for the
......@@ -117,14 +118,14 @@ the entropy coefficient $c_2 = -0.01$ are shown. The corresponding final scores
The results seen with this configuration are the worst among all experiments on BeamRider, Breakout and Pong. However,
single runs of Breakout and Pong (cf.~figures \ref{fig:graph_penalty_breakout} and \ref{fig:graph_penalty_pong}) still
achieve performances comparable to the reference configuration, as highlighted by the red\todo{check this color} trendline.
We further note that SeaQuest and SpaceInvaders overall are not as affected by this change. This indicates that games
We further note that Seaquest and SpaceInvaders overall are not as affected by this change. This indicates that games
pose different challenges and that hyperparameters have different effects depending on the respective game.
In figure \ref{fig:graph_penalty_spaceinvaders} a steep drop in performance is visible in the first run after
approximately $5.6 \cdot 10^7$ time steps. This performance collapse could be related to the policy making too large
steps. However, as policy updates occur within the trust region $\epsilon$, the performance recovers to a degree. If we
chose a significantly larger $\epsilon$, the performance collapse may be unrecoverable, as PPO becomes more similar to
simple policy gradient methods.
gradient steps. However, as policy updates occur within the trust region $\epsilon$, the performance recovers to a
degree. If we choose a significantly larger $\epsilon$, the performance collapse may be unrecoverable, as PPO becomes
more similar to simple policy gradient methods.
As some agents still learn, this implies that the algorithm is robust to the specific values chosen for hyperparameters.
We further discuss this in chapter \ref{sec:05:robustness}.
......
Recalling from chapter \ref{sec:01:motivation}, modern video games such as StarCraft II or Dota 2 pose a great challenge
for reinforcement learning. Solving them requires significant resources and a combination of reinforcement learning
algorithms with other methods.\footnote{One technique is imitation learning, that allows an agent to learn from human
example.} As a consequence, they are infeasible as a means of comparing different algorithms.
algorithms with other methods.\footnote{One technique is imitation learning, which allows an agent to learn from human
example.} As a consequence, these games are infeasible as a means of comparing different algorithms.
However, benchmarking reinforcement learning algorithms is a necessity when developing improved algorithms. Key metrics are
not only the performance of an algorithm after training, but also the sample efficiency that describes how quickly an
not only the rewards an agent collects after training, but also the sample efficiency that describes how quickly an
agent learns. Even the required computation time can be an important criterion, as algorithms with a low computation
time can be tested more often than highly demanding algorithms.
time can be trained on more time steps than highly demanding algorithms in the same duration.
In this chapter, the \emph{Arcade Learning Environment} (ALE) is introduced, which is the benchmarking framework used to
In this chapter, the \emph{Arcade Learning Environment} (ALE) is introduced, which is the benchmarking framework we use to
evaluate Proximal Policy Optimization. Afterwards, we discuss the evaluation method and issues one encounters when
attempting to reproduce results. We close by highlighting three experiments and discussing the results.
% Review the we here
......@@ -10,9 +10,10 @@ In order to evaluate the performance of the Proximal Policy Optimization algorit
the three metrics mentioned in the introduction to this chapter are used. The run time is not subject to evaluation for
several reasons. Firstly, the goal of this thesis is the reproduction and evaluation of results published by
\citeA{ppo}. The authors provide no measure of the computation time of their algorithm other than stating that it runs
faster than others. Furthermore, a comparison of stated computation times is impractical, as it is reasonable to assume that
hardware configurations differ greatly. Nevertheless, the computation time of the code written for this thesis roughly
matches that of the code written by \citeA{kostrikov}.
faster than others. Furthermore, a comparison of stated computation times is impractical, as it is reasonable to assume
that hardware configurations differ greatly. Nevertheless, the computation time of the code written for this thesis
roughly matches that of the code written by \citeA{kostrikov} on a machine with an Intel i7-6700, 32 GB of RAM and a
GeForce GTX 1060 6GB.
The two points of interest in evaluation are how quickly an agent learns and the performance of an agent at the end of
its training. We evaluate how quickly an agent learns by drawing an episode reward graph. Whenever an episode of a game
......@@ -53,4 +54,4 @@ We note that more advanced regression methods are available. Their suitability f
future research. Although simple, the chosen method is intuitive and suitable for comparisons as done in the following.
The final score is determined by computing the average of the final 100 terminating episodes in training \cite{ppo}. The
final scores given by \citeA{ppo} are averaged scores of three training runs.
final scores given by \citeA{ppo} are averaged scores of the three training runs.
......@@ -13,11 +13,11 @@ Debugging can be troublesome due to the frameworks used. Even when multiple impl
identical results is all but simple. The Arcade Learning Environment and the deep learning framework may rely on
randomness, for example when sampling actions. Thus, one must ensure identical seeds and the same way of interfacing
with the random number generator to make sure that its state remains the same in all algorithms used for
comparison.\todo{wordsmith this} If the different algorithms utilize different deep learning frameworks, this issue may
comparison. If the different algorithms do not use the same deep learning frameworks, this issue may
be hard to overcome.
But even when the same framework is used---such as PyTorch, which is used for this thesis---ensuring identical seeds and
randon number generator states might not be sufficient: Some operations are inherently non-deterministic, so these
random number generator states might not be sufficient: Some operations are inherently non-deterministic, so these
operations must not be used \cite{pytorch}.
Because PPO as executed in the following experiments is trained on minibatches of size $128 \cdot 8$, manually
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment