Commit 62ca06f6 authored by Daniel Lukats's avatar Daniel Lukats

first draft of chapter 5 -- figures missing

parent 53bf5e19
As mentioned in chapter \ref{sec:01:motivation}, video games such as Starcraft 2 and Dota 2 pose a great challenge
for reinforcement learning. Solving them requires significant resources and a combination of reinforcement learning
algorithms with other concepts.\footnote{One technique is imitation learning, that allows an agent to learn from human
example.} As a consequence, they are infeasible as a means of comparing different algorithms.
However, comparing reinforcement learning algorithms is a necessity when developing improved algorithms. Key metrics are
not only the performance of an algorithm after training, but also the sample efficiency that describes how quick an
agent learns. Even the required computation time can be an important criterion, as algorithms with a low computation
time can be tested more often than highly demanding algorithms.
Typical benchmarks involve classic control tasks such as balancing an inverse pendulum, robotics tasks or ATARI video
games. Although the ATARI 2600 console was released in 1977, the games developed for the console prove challenging even
today \cite{ale}. The authors note that despite the simple graphics, ATARI games can be difficult to predict. For
example, the 18 most recent frames are used to determine the movement of the paddle in Pong. \todo{probably one more
sentence on challenges} For these reasons, ATARI 2600 video games are a popular benchmarking environment.\footnote{One
might note that video games are well-known and could pique peoples' interest as well.}
example, the 18 most recent frames are used to determine the movement of the paddle in Pong. Moreover, agents are
trained on video input from ATARI games, which leads to a large state space (cf.~chapter \ref{sec:04:postprocessing}).
Lastly, video games are diverse challenges with some---such as Pong---being easier to solve, whereas the game
Montezuma's Revenge remains a challenging. For these reasons, ATARI 2600 video games are a popular benchmarking
environment.\footnote{One might note that video games are well-known and could pique peoples' interest as well.}
The \emph{Arcade Learning Environment} (ALE) provides a benchmark environment containing \todo{FIND THE NUMBER} TODO
games \cite{ale}. It is included in the OpenAI gym framework, which is a popular framework containing other benchmark
environments such as the aforementioned control and robotics tasks, too \cite{gym}. Due to resource and time
constraints, we restrict evaluation to 5 games of the ALE only.
The \emph{Arcade Learning Environment} (ALE) provides a benchmarking environment containing 57 games
\cite{ale}.\footnote{When it was published, the Arcade Learning Environment contained fewer games. It contains 57 games
at the time of writing this thesis.} It is included, among other benchmark environments, in the popular OpenAI Gym
framework \cite{gym}.
BeamRider, Breakout, Pong, Seaquest and SpaceInvaders are highlighted by \citeA{dqn} due to their differing complexity.
Reinforcement learning agents easily surpass human performance in Breakout and Pong, whereas they achieve human
performance on neither Beamrider nor Seaquest. \citeA{dqn} state, that the latter games require long-term strategy and
cannot be solved by reacting quickly and accurately. On Space Invaders, agents should achieve about human performance.
\todo{check this paragraph}
We conduct experiments on a selection of five games of the ALE: BeamRider, Breakout, Pong, Seaquest and SpaceInvaders
are highlighted by \citeA{dqn} due to their differing complexity. Reinforcement learning agents easily surpass human
performance on Breakout and Pong, whereas they fail to achieve human performance on Seaquest and SpaceInvaders.
\citeA{dqn} state that the latter games require long-term strategy and cannot be solved by reacting quickly and
accurately. On BeamRider, agents achieve close to human performance.\todo{discuss that i disagree with this? human score
is 7800, my score is like 3800 best? dqn is not much better}
OpenAI gym and ALE simplify evaluation greatly, as they provide a unified interface for all environments. We simply
OpenAI Gym and ALE simplify evaluation greatly, as they provide a unified interface for all environments. We simply
provide an action to the environment and obtain observations containing the current state of the environment as a $210
\times 160$ RGB image, the reward and further information such as the number of lives remaining or if the game
terminated. Since the ALE returns a reward, we do not have to design a reward function. Instead, all reinforcement
terminated. Just like the ATARI 2600 console, the Arcade Learning Environment runs at 60 Hz---without frame skipping we
would obtain 60 observations per second.
Since the ALE returns a reward, we do not have to design a reward function. Instead, all reinforcement
learning algorithms are trained using the same reward function bar post-processing. The reward of an action is the
change in score, making the cumulated rewards of an episode the high-score.
change in score, making the cumulated rewards of an episode the high-score.
The action space of ATARI games is discrete. Every game accepts the actions $\mathcal{A} = \{\emph{\text{noop}},
\emph{\text{fire}}, \emph{\text{left}}, \emph{\text{right}}, \emph{\text{up}}, \emph{\text{down}}\}$. Actions that are
......
\todo[inline]{short intro}
\subsubsection{Non-disclosed Optimizations}
Since the performance of agents trained with Proximal Policy Optimization as described in the publication is
significantly worse than the performance achieved with the reference configuration, it begs the question if all
optimizations are required and what their individual impact is. Due to time and resource constraints only single
optimizations have been disabled and not all permutations available were tested.\todo{rephrase this so it isnt as
negative. not sure how tho.}
\paragraph{Advantage normalization.}
Disabling advantage normalization can be seen in experiment \emph{no\_advantage\_normalization}. Normalizing advantages
has a slight positive effect on Breakout, Pong and SpaceInvaders. However, the performance on BeamRider is notably worse
as is the performance on Seaquest.
\paragraph{Gradient clipping.}
Gradient clipping is disabled in experiment \emph{no\_gradient\_clipping}. We note that the performance on Breakout is
significantly improved with a final score of roughly $400$ achieved by all three agents without gradient clipping.
However, some reward graphs on other games appear less stable, notably one training on BeamRider, Pong and Seaquest each.
As gradient clipping restricts the magnitude of change of the neural network's weights, it is another device that keeps
the new policy $\pi_{\boldsymbol\theta}$ close to the old policy $\pi_{\boldsymbol\theta_\text{old}}$. If the new policy
deviates a little too far, we could see more noise, as the policy will be slightly less optimal. However, clipping the
losses with $\epsilon$ still keeps them near the old policy, so we see no total performance collapse.
\paragraph{Orthogonal initialization.}
Experiment \emph{no\_orthogonal\_initialization} contains reward graphs of trainings conducted without the very
optimization. Although orthogonal initialization has no notable impact on Breakout and SpaceInvaders, it is an important
optimization for BeamRider, Pong and Seaquest. Performances on the latter games are worse when the neural network is not
initialized with an orthogonal initialization scheme. On all three, the final score is reduced and any learning happens
slower.
\paragraph{Reward binning.}
Training runs without any form of reward binning or clipping can be seen in experiment \emph{no\_reward\_binning}. The
only game that is not affected by this change is Pong, as the set of rewards in Pong is ${-1, 0, 1}$ already.
Furthermore, no player will score multiple times within $k = 4$ frames (cf.~chapter \ref{sec:04:postprocessing}).
On the remaining games, rewards can achieve much larger magnitudes so reward binning or clipping has a notable effect on
the rewards. However, experiment \emph{reward\_clipping} shows a significant drop in performance on all games but pong.
This is echoed in the reward graphs and final scores with all agents performing much worse once rewards are no longer
subject to reward binning.
\paragraph{Value function loss clipping.}
TODO
\paragraph{Conclusion.}
TODO
\subsubsection{Outliers}
Figures \ref{fig:graph_seaquest} and \ref{fig:graph_penalty_beamrider} to \ref{fig:graph_paper_beamrider} contain
obvious outliers, some of which perform a lot better than other runs in that game whereas others perform a lot worse.
Among all experiments conducted for this thesis, about $35\%$ of the graphs generated contain an obvious outlier.
Similar inconcistency can be seen in the plots published by \citeA{ppo} for a variety of games.
With merely three runs per configuration used in evaluation, it is hard to tell if this behavior is representative. We
cannot determine if a single good performance must be attributed to luck or if in fact only roughly a third of all runs
achieve such a performance. For all we know, the actual spread could be the entire range between a good run and a bad
run.
At the time of publication of Proximal Policy Optimization in 2017 OpenAI Gym contained 49 ATARI games. As of writing
this thesis, this number has grown to 57 games. It seems unlikely that each of these games provides a unique challenge.
Therefore, instead of training on all of the games available in OpenAI Gym one could train on a specific selection of
games chosen by experts to be as diverse as possible. If this subset is half or a third the size of the original set of
games, we can easily perform more training runs on each game without requiring more time to conduct our experiments.
Even if no subset can be selected, performing more runs on each configuration might be a worthwhile effort. By
increasing the number of training runs, we achieve more robust results. Thus, it becomes easier to discern if a run's
performance is representative of the configuration under test. Moreover, we can compute stronger
trendlines/average\todo{fix this thingy} that could allow for easier comparisons of configurations of the same algorithm
or of entirely different algorithms.
\subsubsection{Robustness to Hyperparameter Choices}
\label{sec:05:robustness}
\citeA{ppo} state that Proximal Policy Optimization is robust in terms of hyperparameter choice. The results of the
experiments support this statement in general. Huge deviation from the published hyperparameters is required to
impede PPO such that no or little learning occurs. This can be seen in the experiment shown in chapter \ref{sec:05:stability}
and BeamRider in experiment \emph{epsilon\_0\_5}). When hyperparameters remain
remotely close to the original values, the performance may deviate but agents still learn, for example when $\gamma =
0.95$ or $\lambda = 0.9$ (see experiments \emph{gamma\_0\_95} or \emph{lambda\_0\_9}). Therefore, it can be concluded that the
reference configuration is a sensible choice for learning ATARI video games.
That does not mean that the reference configuration is the optimal configuration for any game:
\begin{itemize}
\item Agents perform much better on Breakout when $\epsilon$ is not annealed over the course of the training
(cf.~experiment \emph{no\_epsilon\_annealing}).
\item When the minibatch size is decreased from $256$ to $64$, agents learn Pong a lot faster (cf.~experiment
\emph{16\_minibatches}).
\item Both the reward graph and the final score of SeaQuest are significantly improved when the learn rate $\alpha$
is not annealed over the course of the training (cf.~experiment \emph{no\_alpha\_annealing}).
\end{itemize}
However, performing such changes often leads to worse performances on at least one other game. Hence, it is hard to
conclude which configuration is superior and a larger set of games may be required to draw extensive conclusions.
This diff is collapsed.
\section{Evaluation}
\label{sec:05:evaluation}
\input{05_evaluation/introduction}
\subsection{Arcade Learning Environment}
\label{sec:05:ale}
......@@ -14,10 +15,9 @@
\input{05_evaluation/reproduction}
\subsection{Experiments}
\todo[inline]{refer to \protect\cite{results}}
\todo[inline]{check URL of \protect\cite{results}}
TODO probably need multiple evaluation sections and possibly multiple hyper parameter sections
\todo[inline]{one experiment on CNN from \protect\cite{kostrikov}}
\label{sec:05:experiments}
\input{05_evaluation/experiments}
\subsection{Discussion}
\label{sec:05:discussion}
\input{05_evaluation/discussion}
Recalling from chapter \ref{sec:01:motivation}, modern video games such as StarCraft II or Dota 2 pose a great challenge
for reinforcement learning. Solving them requires significant resources and a combination of reinforcement learning
algorithms with other methods.\footnote{One technique is imitation learning, that allows an agent to learn from human
example.} As a consequence, they are infeasible as a means of comparing different algorithms.
However, benchmarking reinforcement learning algorithms is a necessity when developing improved algorithms. Key metrics are
not only the performance of an algorithm after training, but also the sample efficiency that describes how quickly an
agent learns. Even the required computation time can be an important criterion, as algorithms with a low computation
time can be tested more often than highly demanding algorithms.
In this chapter, the \emph{Arcade Learning Environment} (ALE) is introduced, which is the benchmarking framework used to
evaluate Proximal Policy Optimization. Afterwards, we discuss the evaluation method and issues one encounters when
attempting to reproduce results. We close by highlighting three experiments and discussing the results.
% Review the we here
Proximal Policy Optimization is evaluated by running the algorithm thrice with the same configuration on different
seeds. As many sources of stochasticity are involved, this might give a plausible understanding of the algorithm's
performance with the configuration.\todo{word this properly} Two performance indicators are compared by \cite{ppo}:
\begin{itemize}
\item The average score of the last 100 episodes completed in training indicates the final performance of a trained
agent.
\item An average reward graph can be used to assess how fast an agent learns. It is created by plotting the average
score of all game episodes that terminate at a given time step.
\end{itemize}
Benchmarking reinforcement learning algorithms on ATARI 2600 video games is commonly done by training agents on
$10,000,000$ time steps of data. Since Proximal Policy Optimization executes $N$ actors simultaneously, the number of
iterations must be adjusted to accord for $N$ as well as the rollout horizon $T$. The number of iterations $I$ then is:
\begin{align*}
I \gets \text{floor}(10,000,000\;/\;(T \cdot N))
\end{align*}
With the common choices of $N = 8$ and $T = 128$ we train for $I = 9765$ iterations.
\todo[inline]{we do not measure run time of the algorithm}
In order to evaluate the performance of the Proximal Policy Optimization algorithm implemented for this thesis, two of
the three metrics mentioned in the introduction to this chapter are used. The run time is not subject to evaluation for
several reasons. Firstly, the goal of this thesis is the reproduction and evaluation of results published by
\citeA{ppo}. The authors provide no measure of the computation time of their algorithm other than stating that it runs
faster than others. Furthermore, a comparison of stated computation times is impractical, as it is reasonable to assume that
hardware configurations differ greatly. Nevertheless, the computation time of the code written for this thesis roughly
matches that of the code written by \citeA{kostrikov}.
\todo[inline]{trendline of most widespread setup for comparison}
The two points of interest in evaluation are how quickly an agent learns and the performance of an agent at the end of
its training. We evaluate how quickly an agent learns by drawing an episode reward graph. Whenever an episode of a game
terminates, the high-score is plotted. As we run $N$ environments simultaneously, several episodes could terminate at
the same time. If this occurs, we plot the average of all terminated episodes. Figure \ref{fig:reward_graph}
displays an unsmoothed reward graph for the game Pong. The x axis displays the training time step and the y axis shows
the high-score. As a result, we get a graph that shows the performance of an agent over the course of its training.
\todo[inline]{smoothing of graphs}
\begin{figure}[ht]
\centering
\includegraphics[width=\textwidth]{05_evaluation/Pong.png}
\caption{TODO---this actually isnt even a non-smoothed graph atm. just imagine it was noisier. to be fixed with
colors and font size later}
\label{fig:reward_graph}
\end{figure}
Since agents encounter novel states regularly in the beginning of the training, the performance of episodes can vary
greatly. This results in a noisy episode reward graph. Depending on the complexity of the game, the graph may remain
noisy even until the end of the training, e.g., when learning Breakout. We alleviate this issue by smoothing the reward
graph---outliers are featured less prominently and the noise is reduced, whilst the overall trend of the training is
preserved.
The reward graphs are smoothed by computing the average of a sliding window with $16$ data points. This window is then
centered, so each data point is the average of the $8$ previous episodes and the $7$ following ones. We note that this
operation does not remove all noise, which allows us to compare if certain choices have a stabilizing effect.
Furthermore, this is an educational choice as noisy graphs are to be expected depending on the environment.
As the performance in training depends on chance, \citeA{ppo} train three agents on each game for any given
configuration. To enable easier comparison of configurations, we add a simple trendline. We compute this trendline by
combining the data of all three runs. Then, we compute the average of a sliding window with $256$ data points. As with the
individual runs, this window is centered such that each data point is the average of the $128$ preceding data points and
the $127$ following ones.
To compare the performance of a given configuration with that of a reference configuration, we add a trendline generated
with the reference configuration.
We note that more advanced regression methods are available. Their suitability for the task at hand will be a topic for
future research. Although simple, the chosen method is intuitive and suitable for comparisons as done in the following.
The final score is determined by computing the average of the final 100 terminating episodes in training \cite{ppo}. The
final scores given by \citeA{ppo} are averaged scores of three training runs.
issues given because of tooling
\todo[inline]{sources of randomness -- deep learning framework seed (random actions) environment seed (potentially
random states) numpy resets}
\todo[inline]{non-determinism even with same seeds -- refer to pytorch}
issues due to methodology
\todo[inline]{only graphs to compare learning speed}
\todo[inline]{graphs by ppo authors have random smoothing}
\todo[inline]{in case of PPO exact configuration is unknown. which loss was used in graphs, which neural net architecture?}
Reproducing results can be challenging for multiple reasons. Since the learning speed is assessed with reward graphs,
it is hard to compare results from various sources as one needs to compare multiple graphs in different figures. This
issue is exacerbated by the fact that the reward graphs plotted by \citeA{ppo} are smoothed, but the smoothing method is
not disclosed in the paper.
Another problem stems from the fact that the exact configuration of the algorithm used to achieve the published
performance is unknown. It is unclear if optimizations such as advantage normalization (cf.~chapter
\ref{sec:04:adv_normalization}) and value function loss clipping (cf.~chapter \ref{sec:04:value_clipping}) were used in the
publication. These optimizations are not part of the initial PPO code \cite[baselines/ppo1]{baselines} and were
not mentioned in an update on the repository either \cite{ppo_blog}.
Debugging can be troublesome due to the frameworks used. Even when multiple implementations are available, ensuring
identitcal results is all but simple. The Arcade Learning Environment and the deep learning framework may rely on
randomness, for example when sampling actions. Thus, one must ensure identical seeds and the same way of interfacing
with the random number generator to make sure that its state remains the same in all algorithms used for
comparison.\todo{wordsmith this} If the different algorithms utilize different deep learning frameworks, this issue may
be hard to overcome.
But even when the same framework is used---such as PyTorch, which is used for this thesis---ensuring identical seeds and
randon number generator states might not be sufficient: Some operations are inherently non-deterministic, so these
operations must not be used \cite{pytorch}.
Because PPO as executed in the following experiments is trained on minibatches of size $128 \cdot 8$, manually
calculating the loss is infeasible. Manually solving a backpropagation training and calculating weight changes for a
neural network with approximately $80000$ parameters is unrealistic.
In order to ensure that the algorithm implemented for this thesis is working as intended, several tests were conducted.
Approximately 250 unit and integration tests cover the algorithm. Most of these tests validate the loss calculation with
many small scale tests being the results of manual calculations. A few larger tests spanning losses calculated from
up to $128$ time steps of data were generated from data obtained from the project of \citeA{kostrikov}. Furthermore,
several tests were run by replacing components of the code of both \citeA{kostrikov} and \citeA{jayasiri} with
components written for this thesis.
As the algorithm is tested extensively and the results of the experiment shown in chapter \ref{sec:05:first_exp} match
those published by \citeA{ppo}, one can trust that the code written for this thesis allows for a reliable evaluation.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment