Commit d8f8b71f authored by Daniel Lukats's avatar Daniel Lukats

updated chapters 5, 6 and A to account for latest experiment results

parent 53030137
......@@ -14,16 +14,20 @@ optimizations enabled.
\paragraph{Advantage normalization.}
Disabling advantage normalization can be seen in experiment \emph{no\_advantage\_normalization}. Normalizing advantages
has a slight positive effect on Breakout, Pong and SpaceInvaders. However, the performance on BeamRider is notably worse
as is the performance on Seaquest.
Disabling advantage normalization can be seen in experiment \emph{no\_advantage\_normalization}. Without advantage
normalization the performance is improved on BeamRider as well as Seaquest. However, agents trained on Breakout, Pong
and SpaceInvaders perform worse. If advantage optimization is the only optimization used (cf.~experiment
\emph{only\_advantage\_normalization}), agents perform worse than agents trained without any optimization on all games
but SpaceInvaders. Thus, advantage normalization appears to have little impact on the performance at best.
\paragraph{Gradient clipping.}
Gradient clipping is disabled in experiment \emph{no\_gradient\_clipping}. We note that the performance on Breakout is
particularly improved with a final score of roughly $400$ achieved by all three agents without gradient clipping.
However, some reward graphs on other games appear less stable, notably one of the training runs on BeamRider, Pong and
Seaquest each.
Seaquest each. If only gradient clipping is enabled (as seen in experiment \emph{only\_gradient\_clipping}), the
performance on Breakout is even worse than having no optimization enabled at all. However, the performance of agents
trained on BeamRider and SpaceInvaders is improved.
As gradient clipping restricts the magnitude of change of the neural network's weights, it is another device that keeps
the new policy $\pi_{\boldsymbol\theta}$ close to the old policy $\pi_{\boldsymbol\theta_\text{old}}$. If the new policy
......@@ -36,7 +40,13 @@ Experiment \emph{no\_orthogonal\_initialization} contains reward graphs of train
optimization. Although orthogonal initialization has no notable impact on Breakout and SpaceInvaders, it is an important
optimization for BeamRider, Pong and Seaquest. Performances on the latter games are worse when the neural network is not
initialized with an orthogonal initialization scheme. On all three, the final score is reduced and any learning happens
slower.
slower. Enabling only orthogonal initialization leads to improved results over no optimizations on Breakout, Pong,
Seaquest and SpaceInvaders (cf.~experiment \emph{only\_orthogonal\_initialization}). Only the performance on BeamRider
is diminished.
Most notably, agents learn Pong much faster when the neural network is initialized with an orthogonal initialization
scheme. The results also indicate that agents cannot achieve meaningful results on Seaquest without orthogonal
initialization.
\paragraph{Reward binning.}
......@@ -49,26 +59,39 @@ the rewards. Experiment \emph{reward\_clipping} shows a distinct drop in perform
This is echoed in the reward graphs and final scores with all agents performing much worse once rewards are no longer
subject to reward binning. As a consequence, rewards should be binned and not clipped.
As expected, agents trained with only reward binning achieve notably improved results than agents trained without
optimizations (cf.~experiment \emph{only\_reward\_binning}). The unexpected exception to this observation is Pong, which
sees diminished results for no apparent reason.
\paragraph{Value function loss clipping.}
TODO
Agents trained without value function loss clipping achieve notably improved performances on Breakout but perform
slightly worse on Pong as shown in the experiment \emph{no\_value\_function\_loss\_clipping}. With only value function
loss clipping enabled, the performance of agents trained on BeamRider is worse than those trained without optimizations
and those trained with the reference configuration (cf.~experiment \emph{only\_value\_function\_loss\_clipping}). On the
other hand, agents trained on Breakout and SpaceInvaders perform better than agents trained without optimizations.
\paragraph{Conclusion.}
TODO
The previous experiments show that optimizations
have different effects depending on the game an agent is trained on. This is most apparent with advantage
normalization, which appears to be detrimental to agents learning BeamRider, although it has a slight beneficial effect
on Breakout, Pong, Seaquest and SpaceInvaders. Moreover, we observe that all optimizations contribute to the success of
Proximal Policy Optimization. Only advantage normalization affected the performance of agents to a lesser degree.
However, we cannot deduce that advantage normalization is not important for PPO, as the optimization may be of larger
importance on other tasks.
Since these optimizations are crucial to achieve competitive results on ATARI games, we must criticize that they were
not disclosed by \citeA{ppo}. Even though they are not directly related to reinforcement learning, they should be
disclosed to ensure reproducibility and a transparent discussion on the performance of Proximal Policy Optimization
algorithms.
The conducted experiments show that the optimizations overall have a notable positive effect on the performance of
agents. However, no general statements on the impact of an optimization can be made, as the consequences of an
optimization depend on the respective game an agent is trained on. For example, orthogonal initialization is shown to be
crucial for agents learning Pong or Seaquest whilst the positive impact on Breakout and SpaceInvaders is notable but
less pronounced.
Furthermore, optimizations likely interact with each other, which further complicates general
statements based on the experiments. This can be seen with value function loss clipping on Breakout. Both the
experiment with all optimizations but value function loss clipping and the experiment with only value function loss
clipping show improved performances over the reference configuration and the no optimizations experiment respectively.
Most notably, agents trained on BeamRider benefit from the optimizations when trained with the paper configuration of $K
= 3$ and $c_1 = 1.0$, whereas they achieve subpar results with the reference configuration. This indicates that
hyperparameter choices revolving around the value function may have a larger impact than the optimizations.
Overall all optimizations have a positive effect on the algorithm with advantage normalization having the least impact.
Hence, we must criticize that they were not disclosed by \citeA{ppo}. Even though they are not directly related to
reinforcement learning, they should be disclosed to ensure reproducibility and a transparent discussion on the
performance of Proximal Policy Optimization algorithms.
\subsubsection{Outliers}
......
......@@ -19,14 +19,14 @@ Policy Optimization intends to keep the new policy close to the old policy by cl
However, multiple non-disclosed optimizations to the algorithm can be found in two implementations of the algorithm
\cite{baselines} that are provided by the authors of the original publication. We examined these optimizations and
popular hyperparameter choices on five ATARI 2600 games: BeamRider, Breakout, Pong, Seaquest and SpaceInvaders. The
results of the experiments show that most of the optimizations are crucial to achieve good results on these ATARI games.
This finding is shared by \citeA{ilyas2018} and \citeA{engstrom2019}, who examined Proximal Policy Optimization on
robotics environments. Furthermore, we found that noticable outliers are present in approximately $35\%$ of the
experiments.
results of the experiments show that most of the optimizations are crucial to achieve good results on these ATARI games,
although their specific impact depends on the game an agent is trained on. This finding is shared by \citeA{ilyas2018}
and \citeA{engstrom2019}, who examined Proximal Policy Optimization on robotics environments. Furthermore, we found that
noticable outliers are present in approximately $35\%$ of the experiments.
As a consequence, we may call for two improvements: Firstly, all optimization choices pertaining to the algorithm---even
those not purely related to reinforcement learning---must be published. In order to ensure proper reproducibility and
comparability, methods used for the portrayal of results, such as smoothing of graphs, must be described as well.
comparability, methods used for the portrayal of results, such as the smoothing of graphs, must be described as well.
Secondly, instead of testing each configuration thrice on every game available, performing more runs on an expert-picked
set of diverse games grants more reliable results. With more training runs, it becomes easier to discern if a striking
performance is to be expected or simply a result of chance.
......@@ -54,15 +54,20 @@ experiments, the reference trendline is replaced with a trendline of the \emph{n
result is unexpected and merits further investigation.} \\
no advantage\newline normalization & improved on BeamRider and Seaquest; slightly worse on Breakout and Pong;
slightly improved on SpaceInvaders \\
only advantage\newline normalization & \\
only advantage\newline normalization & worse than no optimizations on all games but SpaceInvaders \\
no gradient\newline clipping & worse on BeamRider; notably improved on Breakout and Seaquest \\
only gradient\newline clipping & \\
only gradient\newline clipping & improved on all games but BeamRider compared to no optimizations; slightly worse
than no optimizations on BeamRider \\
no orthogonal\newline initialization & worse on BeamRider, Pong and Seaquest \\
only orthogonal\newline initialization & \\
only orthogonal\newline initialization & better than no optimizations on all games but BeamRider; worse than no
optimizations on BeamRider \\
no reward binning & notably worse on all games but Pong \\
only reward binning & \\
only reward binning & notably better than no optimizations on all games but Pong; worse than no optimizations on
Pong\footnote{This result is unexpected because reward binning should have no effect on the reward signal of Pong.
Further research is required.} \\
no value function loss clipping & notably improved on Breakout; slightly worse on Pong \\
only value function loss clipping & \\
only value function loss clipping & worse than no optimizations on BeamRider; better than no optimizations on
Breakout and SpaceInvaders \\
\hline
reward clipping & improved on BeamRider; worse on Breakout, Seaquest and SpaceInvaders \\
no input scaling & no learning on BeamRider; better on Seaquest; worse on SpaceInvaders \\
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment