diff --git a/thesis/05_evaluation/ale.tex b/thesis/05_evaluation/ale.tex index 550509e512d1e462c4aed5ab09ba497479ebcb9f..2029cc7203cdaca99469eecad5045b194450cb7b 100644 --- a/thesis/05_evaluation/ale.tex +++ b/thesis/05_evaluation/ale.tex @@ -1,37 +1,33 @@ -As mentioned in chapter \ref{sec:01:motivation}, video games such as Starcraft 2 and Dota 2 pose a great challenge -for reinforcement learning. Solving them requires significant resources and a combination of reinforcement learning -algorithms with other concepts.\footnote{One technique is imitation learning, that allows an agent to learn from human -example.} As a consequence, they are infeasible as a means of comparing different algorithms. - -However, comparing reinforcement learning algorithms is a necessity when developing improved algorithms. Key metrics are -not only the performance of an algorithm after training, but also the sample efficiency that describes how quick an -agent learns. Even the required computation time can be an important criterion, as algorithms with a low computation -time can be tested more often than highly demanding algorithms. - Typical benchmarks involve classic control tasks such as balancing an inverse pendulum, robotics tasks or ATARI video games. Although the ATARI 2600 console was released in 1977, the games developed for the console prove challenging even today \cite{ale}. The authors note that despite the simple graphics, ATARI games can be difficult to predict. For -example, the 18 most recent frames are used to determine the movement of the paddle in Pong. \todo{probably one more -sentence on challenges} For these reasons, ATARI 2600 video games are a popular benchmarking environment.\footnote{One -might note that video games are well-known and could pique peoples' interest as well.} +example, the 18 most recent frames are used to determine the movement of the paddle in Pong. Moreover, agents are +trained on video input from ATARI games, which leads to a large state space (cf.~chapter \ref{sec:04:postprocessing}). +Lastly, video games are diverse challenges with some---such as Pong---being easier to solve, whereas the game +Montezuma's Revenge remains a challenging. For these reasons, ATARI 2600 video games are a popular benchmarking +environment.\footnote{One might note that video games are well-known and could pique peoples' interest as well.} -The \emph{Arcade Learning Environment} (ALE) provides a benchmark environment containing \todo{FIND THE NUMBER} TODO -games \cite{ale}. It is included in the OpenAI gym framework, which is a popular framework containing other benchmark -environments such as the aforementioned control and robotics tasks, too \cite{gym}. Due to resource and time -constraints, we restrict evaluation to 5 games of the ALE only. +The \emph{Arcade Learning Environment} (ALE) provides a benchmarking environment containing 57 games +\cite{ale}.\footnote{When it was published, the Arcade Learning Environment contained fewer games. It contains 57 games +at the time of writing this thesis.} It is included, among other benchmark environments, in the popular OpenAI Gym +framework \cite{gym}. -BeamRider, Breakout, Pong, Seaquest and SpaceInvaders are highlighted by \citeA{dqn} due to their differing complexity. -Reinforcement learning agents easily surpass human performance in Breakout and Pong, whereas they achieve human -performance on neither Beamrider nor Seaquest. \citeA{dqn} state, that the latter games require long-term strategy and -cannot be solved by reacting quickly and accurately. On Space Invaders, agents should achieve about human performance. -\todo{check this paragraph} +We conduct experiments on a selection of five games of the ALE: BeamRider, Breakout, Pong, Seaquest and SpaceInvaders +are highlighted by \citeA{dqn} due to their differing complexity. Reinforcement learning agents easily surpass human +performance on Breakout and Pong, whereas they fail to achieve human performance on Seaquest and SpaceInvaders. +\citeA{dqn} state that the latter games require long-term strategy and cannot be solved by reacting quickly and +accurately. On BeamRider, agents achieve close to human performance.\todo{discuss that i disagree with this? human score +is 7800, my score is like 3800 best? dqn is not much better} -OpenAI gym and ALE simplify evaluation greatly, as they provide a unified interface for all environments. We simply +OpenAI Gym and ALE simplify evaluation greatly, as they provide a unified interface for all environments. We simply provide an action to the environment and obtain observations containing the current state of the environment as a $210 \times 160$ RGB image, the reward and further information such as the number of lives remaining or if the game -terminated. Since the ALE returns a reward, we do not have to design a reward function. Instead, all reinforcement +terminated. Just like the ATARI 2600 console, the Arcade Learning Environment runs at 60 Hz---without frame skipping we +would obtain 60 observations per second. + +Since the ALE returns a reward, we do not have to design a reward function. Instead, all reinforcement learning algorithms are trained using the same reward function bar post-processing. The reward of an action is the -change in score, making the cumulated rewards of an episode the high-score. +change in score, making the cumulated rewards of an episode the high-score. The action space of ATARI games is discrete. Every game accepts the actions $\mathcal{A} = \{\emph{\text{noop}}, \emph{\text{fire}}, \emph{\text{left}}, \emph{\text{right}}, \emph{\text{up}}, \emph{\text{down}}\}$. Actions that are diff --git a/thesis/05_evaluation/discussion.tex b/thesis/05_evaluation/discussion.tex new file mode 100644 index 0000000000000000000000000000000000000000..e6d5b1d5980e1c329ab59bef63ed83f88bf0465a --- /dev/null +++ b/thesis/05_evaluation/discussion.tex @@ -0,0 +1,100 @@ +\todo[inline]{short intro} + +\subsubsection{Non-disclosed Optimizations} + +Since the performance of agents trained with Proximal Policy Optimization as described in the publication is +significantly worse than the performance achieved with the reference configuration, it begs the question if all +optimizations are required and what their individual impact is. Due to time and resource constraints only single +optimizations have been disabled and not all permutations available were tested.\todo{rephrase this so it isnt as +negative. not sure how tho.} + +\paragraph{Advantage normalization.} + +Disabling advantage normalization can be seen in experiment \emph{no\_advantage\_normalization}. Normalizing advantages +has a slight positive effect on Breakout, Pong and SpaceInvaders. However, the performance on BeamRider is notably worse +as is the performance on Seaquest. + +\paragraph{Gradient clipping.} + +Gradient clipping is disabled in experiment \emph{no\_gradient\_clipping}. We note that the performance on Breakout is +significantly improved with a final score of roughly $400$ achieved by all three agents without gradient clipping. +However, some reward graphs on other games appear less stable, notably one training on BeamRider, Pong and Seaquest each. + +As gradient clipping restricts the magnitude of change of the neural network's weights, it is another device that keeps +the new policy $\pi_{\boldsymbol\theta}$ close to the old policy $\pi_{\boldsymbol\theta_\text{old}}$. If the new policy +deviates a little too far, we could see more noise, as the policy will be slightly less optimal. However, clipping the +losses with $\epsilon$ still keeps them near the old policy, so we see no total performance collapse. + +\paragraph{Orthogonal initialization.} + +Experiment \emph{no\_orthogonal\_initialization} contains reward graphs of trainings conducted without the very +optimization. Although orthogonal initialization has no notable impact on Breakout and SpaceInvaders, it is an important +optimization for BeamRider, Pong and Seaquest. Performances on the latter games are worse when the neural network is not +initialized with an orthogonal initialization scheme. On all three, the final score is reduced and any learning happens +slower. + +\paragraph{Reward binning.} + +Training runs without any form of reward binning or clipping can be seen in experiment \emph{no\_reward\_binning}. The +only game that is not affected by this change is Pong, as the set of rewards in Pong is ${-1, 0, 1}$ already. +Furthermore, no player will score multiple times within $k = 4$ frames (cf.~chapter \ref{sec:04:postprocessing}). + +On the remaining games, rewards can achieve much larger magnitudes so reward binning or clipping has a notable effect on +the rewards. However, experiment \emph{reward\_clipping} shows a significant drop in performance on all games but pong. +This is echoed in the reward graphs and final scores with all agents performing much worse once rewards are no longer +subject to reward binning. + +\paragraph{Value function loss clipping.} + +TODO + +\paragraph{Conclusion.} + +TODO + +\subsubsection{Outliers} + +Figures \ref{fig:graph_seaquest} and \ref{fig:graph_penalty_beamrider} to \ref{fig:graph_paper_beamrider} contain +obvious outliers, some of which perform a lot better than other runs in that game whereas others perform a lot worse. +Among all experiments conducted for this thesis, about $35\%$ of the graphs generated contain an obvious outlier. +Similar inconcistency can be seen in the plots published by \citeA{ppo} for a variety of games. + +With merely three runs per configuration used in evaluation, it is hard to tell if this behavior is representative. We +cannot determine if a single good performance must be attributed to luck or if in fact only roughly a third of all runs +achieve such a performance. For all we know, the actual spread could be the entire range between a good run and a bad +run. + +At the time of publication of Proximal Policy Optimization in 2017 OpenAI Gym contained 49 ATARI games. As of writing +this thesis, this number has grown to 57 games. It seems unlikely that each of these games provides a unique challenge. +Therefore, instead of training on all of the games available in OpenAI Gym one could train on a specific selection of +games chosen by experts to be as diverse as possible. If this subset is half or a third the size of the original set of +games, we can easily perform more training runs on each game without requiring more time to conduct our experiments. + +Even if no subset can be selected, performing more runs on each configuration might be a worthwhile effort. By +increasing the number of training runs, we achieve more robust results. Thus, it becomes easier to discern if a run's +performance is representative of the configuration under test. Moreover, we can compute stronger +trendlines/average\todo{fix this thingy} that could allow for easier comparisons of configurations of the same algorithm +or of entirely different algorithms. + +\subsubsection{Robustness to Hyperparameter Choices} +\label{sec:05:robustness} + +\citeA{ppo} state that Proximal Policy Optimization is robust in terms of hyperparameter choice. The results of the +experiments support this statement in general. Huge deviation from the published hyperparameters is required to +impede PPO such that no or little learning occurs. This can be seen in the experiment shown in chapter \ref{sec:05:stability} +and BeamRider in experiment \emph{epsilon\_0\_5}). When hyperparameters remain +remotely close to the original values, the performance may deviate but agents still learn, for example when $\gamma = +0.95$ or $\lambda = 0.9$ (see experiments \emph{gamma\_0\_95} or \emph{lambda\_0\_9}). Therefore, it can be concluded that the +reference configuration is a sensible choice for learning ATARI video games. + +That does not mean that the reference configuration is the optimal configuration for any game: +\begin{itemize} + \item Agents perform much better on Breakout when $\epsilon$ is not annealed over the course of the training + (cf.~experiment \emph{no\_epsilon\_annealing}). + \item When the minibatch size is decreased from $256$ to $64$, agents learn Pong a lot faster (cf.~experiment + \emph{16\_minibatches}). + \item Both the reward graph and the final score of SeaQuest are significantly improved when the learn rate $\alpha$ + is not annealed over the course of the training (cf.~experiment \emph{no\_alpha\_annealing}). +\end{itemize} +However, performing such changes often leads to worse performances on at least one other game. Hence, it is hard to +conclude which configuration is superior and a larger set of games may be required to draw extensive conclusions. diff --git a/thesis/05_evaluation/experiments.tex b/thesis/05_evaluation/experiments.tex new file mode 100644 index 0000000000000000000000000000000000000000..cf5aeeac7787476f644ee9d995fca7c9adf039f6 --- /dev/null +++ b/thesis/05_evaluation/experiments.tex @@ -0,0 +1,221 @@ +\todo[inline]{put a list of all experiments with short note on results in the appendix} + +A total of $42$ experiments were conducted to evaluate optimizations listed in chapter \ref{sec:04:implementation} and +values of hyperparameters. For each experiment, two graphs are generated: one including only the run itself and one +adding a trendline of the reference configuration of Proximal Policy Optimization for ATARI. The reference configuration +denotes the hyperparameters outlined in chapter \ref{sec:04:full_algorithm} with all optimizations listed in chapters +\ref{sec:04:adv_normalization}--\ref{sec:04:gradient_clipping}. It uses reward binning instead of reward clipping +(cf.~chapter \ref{sec:04:postprocessing}). + +As we test each configuration on $5$ games, this makes for a total of $400$ graphs. For this reason, only a selection of +graphs is shown in this thesis. However, all graphs are made available online including the configuration and raw data +used to generate them \cite{results}. When we refer to an experiment, consult the repository TODO unless a figure or +table is specified. + +Subsequently, we show three different experiments. The first experiment we evaluate is done with the reference +configuration. By evaluating this experiment, we can discern whether the algorithm implemented for this thesis achieves +the intended performance. The second experiment shows the stability of PPO even when severely misconfigured, by training +with an entropy penalty instead of an entropy bonus. Finally, we evaluate the effects of the non-disclosed optimizations. +We discuss the findings in chapter \ref{sec:05:discussion}. + +\subsubsection{Reference Configuration} +\label{sec:05:first_exp} + +Figures \ref{fig:graph_beamrider}--\ref{fig:graph_spaceinvaders} show the reward graphs of three training runs on each +of the games chosen for evaluation. The final scores of the three runs and the scores reported by \citeA{ppo} are shown +in table \ref{table:default_score}. + +\begin{table}[ht] + \centering + \begin{tabular}{l|r r r r} + Game & Run 1 & Run 2 & Run 3 & \citeA{ppo} \\ + \hline + BeamRider & $2432.3$ & $2324.5$ & $2246.6$ & $1590.0$ \\ + Breakout & $257.3$ & $252.9$ & $230.4$ & $274.8$ \\ + Pong & $20.9$ & $20.9$ & $20.6$ & $20.7$ \\ + Seaquest & $1756.6$ & $1744.4$ & $928.0$ & $1204.5$ \\ + SpaceInvaders & $1026.1$ & $763.8$ & $670.6$ & $942.5$ + \end{tabular} + \caption{Final scores achieved in training with the reference configuration and final score given by the authors of + Proximal Policy Optimization.} + \label{table:default_score} +\end{table} + +Comparing the shapes of the reward graphs with those given by \citeA{ppo} reveals no discernable differences except for +the magnitude of score obtained in BeamRider. On BeamRider and Seaquest, the algorithm implemented for this thesis +achieves a higher final score than the originally reported results. On Breakout and SpaceInvaders, the obtained score is +slightly lower. Most likely, these differences can be attributed to differing configurations, as \citeA{ppo} train with +a value function loss coefficient $c_1 = 1.0$ and $K = 3$ epochs instead of $c_1 = 0.5$ and $K = 4$. + +Training with the original configuration grants even more favorable results (cf.~experiment \emph{paper}), which means +that some of the optimization methods outlined in chapter \ref{sec:04:implementation} likely were not used for the +original publication. + +We can easily see a lot of noise in the plots for BeamRider, Breakout and SpaceInvaders (cf.~figures +\ref{fig:graph_beamrider}, \ref{fig:graph_breakout} and \ref{fig:graph_spaceinvaders}). This is expected, as our agents +follow stochastic policies and may pick suboptimal actions that allow them to explore the state space. Furthermore, the +agents still encounter novel states that they have no knowledge on leading to more random actions. Finally, there is a +very apparent outlier in the Seaquest runs. The same can be observed in \citeauthor{ppo}'s \citeyear{ppo} graphics. We +discuss this phenomenon in chapter \ref{sec:05:discussion}. + +As the performance of the implementation in this thesis matches or exceeds publicized results, one can reasonably assume +that the results in this thesis are reliable. Therefore, the implementation is suitable for comparison of various +hyperparameter values and optimization choices. + +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/BeamRider.png} + \caption{TODO} + \label{fig:graph_beamrider} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/Breakout.png} + \caption{TODO} + \label{fig:graph_breakout} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/Pong.png} + \caption{TODO} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/Seaquest.png} + \caption{TODO} + \label{fig:graph_seaquest} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/SpaceInvaders.png} + \caption{TODO} + \label{fig:graph_spaceinvaders} +\end{figure} + +\subsubsection{Stability With Suboptimal Hyperparameter} +\label{sec:05:stability} + +In figures \ref{fig:graph_penalty_beamrider}--\ref{fig:graph_penalty_spaceinvaders} reward graphs for a training with +the entropy coefficient $c_2 = -0.01$ are shown. The corresponding final scores can be seen in table +\ref{table:penalty_score}. + +\begin{table}[ht] + \centering + \begin{tabular}{l|r r r r} + Game & Run 1 & Run 2 & Run 3 & Average of reference \\ + \hline + BeamRider & $1580.3$ & $1040.9$ & $701.0$ & $2334.5$ \\ + Breakout & $283.81$ & $1.8$ & $1.2$ & $246.9$ \\ + Pong & $20.4$ & $-21.0$ & $-21.0$ & $20.8$ \\ + Seaquest & $1707.6$ & $957.2$ & $907.0$ & $1476.3$ \\ + SpaceInvaders & $854.8$ & $789.0$ & $562.3$ & $820.2$ + \end{tabular} + \caption{TODO} + \label{table:penalty_score} +\end{table} + +The results seen with this configuration are the worst among all experiments on BeamRider, Breakout and Pong. However, +single runs of Breakout and Pong (cf.~figures \ref{fig:graph_penalty_breakout} and \ref{fig:graph_penalty_pong}) still +achieve performances comparable to the reference configuration, as highlighted by the red\todo{check this color} trendline. +We further note that SeaQuest and SpaceInvaders overall are not as affected by this change. This indicates that games +pose different challenges and that hyperparameters have different effects depending on the respective game. + +In figure \ref{fig:graph_penalty_spaceinvaders} a steep drop in performance is visible in the first run after +approximately $5.6 \cdot 10^7$ time steps. This performance collapse could be related to the policy making too large +steps. However, as policy updates occur within the trust region $\epsilon$, the performance recovers to a degree. If we +chose a significantly larger $\epsilon$, the performance collapse may be unrecoverable, as PPO becomes more similar to +simple policy gradient methods. + +As some agents still learn, this implies that the algorithm is robust to the specific values chosen for hyperparameters. +We further discuss this in chapter \ref{sec:05:robustness}. + +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/BeamRider_entropy_penalty.png} + \caption{BeamRider with $c_2 = -0.01$} + \label{fig:graph_penalty_beamrider} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/Breakout_entropy_penalty.png} + \caption{Breakout with $c_2 = -0.01$} + \label{fig:graph_penalty_breakout} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/Pong_entropy_penalty.png} + \caption{Pong with $c_2 = -0.01$} + \label{fig:graph_penalty_pong} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/Seaquest_entropy_penalty.png} + \caption{Seaquest with $c_2 = -0.01$} + \label{fig:graph_penalty_seaquest} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/SpaceInvaders_entropy_penalty.png} + \caption{SpaceInvaders with $c_2 = -0.01$} + \label{fig:graph_penalty_spaceinvaders} +\end{figure} + +\subsubsection{No Optimizations} + +The following experiment contains no advantage normalization, no reward binning or clipping, no value function loss +clipping, no gradient clipping and no orthogonal initialization. The hyperparameters are changed from the reference +configuration to match the hyperparameters given by \citeA{ppo}: $c_1 = 1.0$ and $K = 3$ instead of $c_1 = 0.5$ and $K = +4$. The reward graphs can be seen in figures \ref{fig:graph_paper_beamrider}--\ref{fig:graph_paper_spaceinvaders}; the +final scores are shown in table \ref{table:paper_score}. + +\begin{table}[ht] + \centering + \begin{tabular}{l|r r r r} + Game & Run 1 & Run 2 & Run 3 & \citeA{ppo} \\ + \hline + BeamRider & $2618.7$ & $756.2$ & $750.6$ & $1590.0$ \\ + Breakout & $212.4$ & $176.2$ & $126.6$ & $274.8$ \\ + Pong & $20.2$ & $18.5$ & $18.2$ & $20.7$ \\ + Seaquest & $890.2$ & $882.6$ & $873.4$ & $1204.5$ \\ + SpaceInvaders & $559.7$ & $553.0$ & $413.8$ & $942.5$ + \end{tabular} + \caption{TODO} + \label{table:paper_score} +\end{table} + +Most reward graphs are significantly below the reward graphs of the reference configuration which is echoed in the final +scores---only the agents trained on Pong remain close to the reported performance, but it takes the agents much longer +to achieve strong performances. The results strongly deviate from the results reported by \citeA{ppo}. Therefore, we can +condlude that the optimizations outlined in chapter \ref{sec:04:implementation} have strong effects on the course of +training as well as on the final performance of trained agents. + +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/BeamRider_paper.png} + \caption{BeamRider trained exactly as described in the Proximal Policy Optimization publication} + \label{fig:graph_paper_beamrider} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/Breakout_paper.png} + \caption{Breakout trained exactly as described in the Proximal Policy Optimization publication} + \label{fig:graph_paper_breakout} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/Pong_paper.png} + \caption{Pong trained exactly as described in the Proximal Policy Optimization publication} + \label{fig:graph_paper_pong} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/Seaquest_paper.png} + \caption{Seaquest trained exactly as described in the Proximal Policy Optimization publication} + \label{fig:graph_paper_seaquest} +\end{figure} +\begin{figure}[hp] + \centering + \includegraphics[width=\textwidth]{05_evaluation/SpaceInvaders_paper.png} + \caption{SpaceInvaders trained exactly as described in the Proximal Policy Optimization publication} + \label{fig:graph_paper_spaceinvaders} +\end{figure} diff --git a/thesis/05_evaluation/index.tex b/thesis/05_evaluation/index.tex index 5c035c32be20773ace2b657bdc3c258f669c6de6..7d8930664de4147704c5211f500e47fd77625006 100644 --- a/thesis/05_evaluation/index.tex +++ b/thesis/05_evaluation/index.tex @@ -1,5 +1,6 @@ \section{Evaluation} \label{sec:05:evaluation} +\input{05_evaluation/introduction} \subsection{Arcade Learning Environment} \label{sec:05:ale} @@ -14,10 +15,9 @@ \input{05_evaluation/reproduction} \subsection{Experiments} -\todo[inline]{refer to \protect\cite{results}} - -\todo[inline]{check URL of \protect\cite{results}} -TODO probably need multiple evaluation sections and possibly multiple hyper parameter sections -\todo[inline]{one experiment on CNN from \protect\cite{kostrikov}} +\label{sec:05:experiments} +\input{05_evaluation/experiments} \subsection{Discussion} +\label{sec:05:discussion} +\input{05_evaluation/discussion} diff --git a/thesis/05_evaluation/introduction.tex b/thesis/05_evaluation/introduction.tex new file mode 100644 index 0000000000000000000000000000000000000000..65daea25719d4269ae4a480d77fc06de50fa35b8 --- /dev/null +++ b/thesis/05_evaluation/introduction.tex @@ -0,0 +1,14 @@ +Recalling from chapter \ref{sec:01:motivation}, modern video games such as StarCraft II or Dota 2 pose a great challenge +for reinforcement learning. Solving them requires significant resources and a combination of reinforcement learning +algorithms with other methods.\footnote{One technique is imitation learning, that allows an agent to learn from human +example.} As a consequence, they are infeasible as a means of comparing different algorithms. + +However, benchmarking reinforcement learning algorithms is a necessity when developing improved algorithms. Key metrics are +not only the performance of an algorithm after training, but also the sample efficiency that describes how quickly an +agent learns. Even the required computation time can be an important criterion, as algorithms with a low computation +time can be tested more often than highly demanding algorithms. + +In this chapter, the \emph{Arcade Learning Environment} (ALE) is introduced, which is the benchmarking framework used to +evaluate Proximal Policy Optimization. Afterwards, we discuss the evaluation method and issues one encounters when +attempting to reproduce results. We close by highlighting three experiments and discussing the results. +% Review the we here diff --git a/thesis/05_evaluation/method.tex b/thesis/05_evaluation/method.tex index ca9464051c32f276954fc26da3322e46d331db74..1e8170e917f86824ac098c0cf0f89408d3f99a5c 100644 --- a/thesis/05_evaluation/method.tex +++ b/thesis/05_evaluation/method.tex @@ -1,15 +1,56 @@ -Proximal Policy Optimization is evaluated by running the algorithm thrice with the same configuration on different -seeds. As many sources of stochasticity are involved, this might give a plausible understanding of the algorithm's -performance with the configuration.\todo{word this properly} Two performance indicators are compared by \cite{ppo}: -\begin{itemize} - \item The average score of the last 100 episodes completed in training indicates the final performance of a trained - agent. - \item An average reward graph can be used to assess how fast an agent learns. It is created by plotting the average - score of all game episodes that terminate at a given time step. -\end{itemize} +Benchmarking reinforcement learning algorithms on ATARI 2600 video games is commonly done by training agents on +$10,000,000$ time steps of data. Since Proximal Policy Optimization executes $N$ actors simultaneously, the number of +iterations must be adjusted to accord for $N$ as well as the rollout horizon $T$. The number of iterations $I$ then is: +\begin{align*} + I \gets \text{floor}(10,000,000\;/\;(T \cdot N)) +\end{align*} +With the common choices of $N = 8$ and $T = 128$ we train for $I = 9765$ iterations. -\todo[inline]{we do not measure run time of the algorithm} +In order to evaluate the performance of the Proximal Policy Optimization algorithm implemented for this thesis, two of +the three metrics mentioned in the introduction to this chapter are used. The run time is not subject to evaluation for +several reasons. Firstly, the goal of this thesis is the reproduction and evaluation of results published by +\citeA{ppo}. The authors provide no measure of the computation time of their algorithm other than stating that it runs +faster than others. Furthermore, a comparison of stated computation times is impractical, as it is reasonable to assume that +hardware configurations differ greatly. Nevertheless, the computation time of the code written for this thesis roughly +matches that of the code written by \citeA{kostrikov}. -\todo[inline]{trendline of most widespread setup for comparison} +The two points of interest in evaluation are how quickly an agent learns and the performance of an agent at the end of +its training. We evaluate how quickly an agent learns by drawing an episode reward graph. Whenever an episode of a game +terminates, the high-score is plotted. As we run $N$ environments simultaneously, several episodes could terminate at +the same time. If this occurs, we plot the average of all terminated episodes. Figure \ref{fig:reward_graph} +displays an unsmoothed reward graph for the game Pong. The x axis displays the training time step and the y axis shows +the high-score. As a result, we get a graph that shows the performance of an agent over the course of its training. -\todo[inline]{smoothing of graphs} +\begin{figure}[ht] + \centering + \includegraphics[width=\textwidth]{05_evaluation/Pong.png} + \caption{TODO---this actually isnt even a non-smoothed graph atm. just imagine it was noisier. to be fixed with + colors and font size later} + \label{fig:reward_graph} +\end{figure} + +Since agents encounter novel states regularly in the beginning of the training, the performance of episodes can vary +greatly. This results in a noisy episode reward graph. Depending on the complexity of the game, the graph may remain +noisy even until the end of the training, e.g., when learning Breakout. We alleviate this issue by smoothing the reward +graph---outliers are featured less prominently and the noise is reduced, whilst the overall trend of the training is +preserved. + +The reward graphs are smoothed by computing the average of a sliding window with $16$ data points. This window is then +centered, so each data point is the average of the $8$ previous episodes and the $7$ following ones. We note that this +operation does not remove all noise, which allows us to compare if certain choices have a stabilizing effect. +Furthermore, this is an educational choice as noisy graphs are to be expected depending on the environment. + +As the performance in training depends on chance, \citeA{ppo} train three agents on each game for any given +configuration. To enable easier comparison of configurations, we add a simple trendline. We compute this trendline by +combining the data of all three runs. Then, we compute the average of a sliding window with $256$ data points. As with the +individual runs, this window is centered such that each data point is the average of the $128$ preceding data points and +the $127$ following ones. + +To compare the performance of a given configuration with that of a reference configuration, we add a trendline generated +with the reference configuration. + +We note that more advanced regression methods are available. Their suitability for the task at hand will be a topic for +future research. Although simple, the chosen method is intuitive and suitable for comparisons as done in the following. + +The final score is determined by computing the average of the final 100 terminating episodes in training \cite{ppo}. The +final scores given by \citeA{ppo} are averaged scores of three training runs. diff --git a/thesis/05_evaluation/reproduction.tex b/thesis/05_evaluation/reproduction.tex index ce51bd2b4a4445a525c9120c0d4af99c618870c4..2a36fe276a14fcdab50975c577397cfdfa9db2d7 100644 --- a/thesis/05_evaluation/reproduction.tex +++ b/thesis/05_evaluation/reproduction.tex @@ -1,9 +1,35 @@ -issues given because of tooling -\todo[inline]{sources of randomness -- deep learning framework seed (random actions) environment seed (potentially -random states) numpy resets} -\todo[inline]{non-determinism even with same seeds -- refer to pytorch} - -issues due to methodology -\todo[inline]{only graphs to compare learning speed} -\todo[inline]{graphs by ppo authors have random smoothing} -\todo[inline]{in case of PPO exact configuration is unknown. which loss was used in graphs, which neural net architecture?} +Reproducing results can be challenging for multiple reasons. Since the learning speed is assessed with reward graphs, +it is hard to compare results from various sources as one needs to compare multiple graphs in different figures. This +issue is exacerbated by the fact that the reward graphs plotted by \citeA{ppo} are smoothed, but the smoothing method is +not disclosed in the paper. + +Another problem stems from the fact that the exact configuration of the algorithm used to achieve the published +performance is unknown. It is unclear if optimizations such as advantage normalization (cf.~chapter +\ref{sec:04:adv_normalization}) and value function loss clipping (cf.~chapter \ref{sec:04:value_clipping}) were used in the +publication. These optimizations are not part of the initial PPO code \cite[baselines/ppo1]{baselines} and were +not mentioned in an update on the repository either \cite{ppo_blog}. + +Debugging can be troublesome due to the frameworks used. Even when multiple implementations are available, ensuring +identitcal results is all but simple. The Arcade Learning Environment and the deep learning framework may rely on +randomness, for example when sampling actions. Thus, one must ensure identical seeds and the same way of interfacing +with the random number generator to make sure that its state remains the same in all algorithms used for +comparison.\todo{wordsmith this} If the different algorithms utilize different deep learning frameworks, this issue may +be hard to overcome. + +But even when the same framework is used---such as PyTorch, which is used for this thesis---ensuring identical seeds and +randon number generator states might not be sufficient: Some operations are inherently non-deterministic, so these +operations must not be used \cite{pytorch}. + +Because PPO as executed in the following experiments is trained on minibatches of size $128 \cdot 8$, manually +calculating the loss is infeasible. Manually solving a backpropagation training and calculating weight changes for a +neural network with approximately $80000$ parameters is unrealistic. + +In order to ensure that the algorithm implemented for this thesis is working as intended, several tests were conducted. +Approximately 250 unit and integration tests cover the algorithm. Most of these tests validate the loss calculation with +many small scale tests being the results of manual calculations. A few larger tests spanning losses calculated from +up to $128$ time steps of data were generated from data obtained from the project of \citeA{kostrikov}. Furthermore, +several tests were run by replacing components of the code of both \citeA{kostrikov} and \citeA{jayasiri} with +components written for this thesis. + +As the algorithm is tested extensively and the results of the experiment shown in chapter \ref{sec:05:first_exp} match +those published by \citeA{ppo}, one can trust that the code written for this thesis allows for a reliable evaluation.