Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit 47356079 authored by Daniel Lukats's avatar Daniel Lukats

more feedback

parent a357ea6f
......@@ -5,11 +5,13 @@ Optimization, as an agent can learn quicker than an agent that trains on a traje
More precisely, each training iteration consists of $K$ epochs. In each epoch, the entire data set is split into
randomly drawn disjunct minibatches of size $M$.\footnote{For a detailed explanation of minibatch gradient methods,
refer to the work of \citeA[chapter 8.1.3]{goodfellow2016}.} This approach also alleviates the issue of learning on
highly correlated observation as observed by \citeA{dqn}. We further ensure that we learn on sufficiently independent
observations by running $N$ \emph{actors} at the same time. Each actor is embedded in an environment, with the only
difference between the actors' environments being the random seeds used to initialize them \cite{a3c}.
\todo[inline]{short note on actors being a manifestation of the agent in that environment}
refer to the work of \citeA[chapter 8.1.3]{goodfellow2016}.} This approach alleviates the issue of learning on highly
correlated observation as observed by \citeA{dqn}. We further ensure that we learn on sufficiently independent
observations by running $N$ \emph{actors} at the same time.\footnote{\emph{Actor} is the technical term used to describe
agents that are run in parallel. Often, these agents follow the same policy $\pi_{\boldsymbol\theta}$, as is the case
with Proximal Policy Optimization.} Although each actor is embedded in its own environment, the $N$ environments are
described by the same Markov decision process \cite{a3c}. However, the environments are independent and therefore give
rise to different trajectories.
After each epoch, the learn rate $\alpha$ and the clipping parameter $\epsilon$ are decreased, so they linearly scale
from their initial values to 0 over the course of the training.
......
......@@ -8,7 +8,7 @@ environment and the policy $\pi$ introduce non-deterministic behavior.
As the variance of the trajectories grows, the quality of the gradient estimator declines. Performing stochastic
gradient ascent with this estimator is no longer guaranteed to yield an improved policy---in fact the performance of the
policy might deteriorate \cite{seita2017}. Even a single bad step might cause a collapse of policy performance
\cite{spinningup}. More importantly\todo{not quite the right phrase here}, the performance collapse might be irrecoverable. As the
\cite{spinningup}. More importantly, the performance collapse might be irrecoverable. As the
policy was likely trained for several steps already, it has begun transitioning from exploration to exploitation. If an
agent now strongly favors bad actions, it cannot progress towards the goal. In turn, we cannot sample meaningful data to
train the policy further.
......
\citeA{ilyas2018} found that the weights of the neural network used by \citeA{baselines} are initialized using
\citeA{ilyas2018} found that the weights of the neural network used by \citeA[baselines/ppo2]{baselines} are initialized using
orthogonal initalization.\footnote{Using orthogonal initialization with large neural networks was proposed by
\citeA{saxe2014} The authors also provide mathematical foundations and examine the benefits of various initialization
schemes.} \todo{improve this as the years are confusing and maybe explain the found} The impact of this choice appears
to be subject of empirical examinations only \cite{ilyas2018, engstrom2019}. Table \ref{table:orthogonal_scaling} lists
each layer of the neural network and the corresponding scaling factor that is used to initialize the layer.
schemes.} The impact of this choice appears to be subject of empirical examinations only \cite{ilyas2018, engstrom2019}.
Table \ref{table:orthogonal_scaling} lists each layer of the neural network and the corresponding scaling factor that is
used to initialize the layer.
\begin{table}[h]
\centering
......
......@@ -73,9 +73,8 @@ games, we can easily perform more training runs on each game without requiring m
Even if no subset can be selected, performing more runs on each configuration might be a worthwhile effort. By
increasing the number of training runs, we achieve more robust results. Thus, it becomes easier to discern if a run's
performance is representative of the configuration under test. Moreover, we can compute stronger
trendlines/average\todo{fix this thingy} that could allow for easier comparisons of configurations of the same algorithm
or of entirely different algorithms.
performance is representative of the configuration under test. Moreover, we can compute more reliable trendlines that
could allow for easier comparisons of configurations of the same algorithm or of entirely different algorithms.
\subsubsection{Robustness to Hyperparameter Choices}
\label{sec:05:robustness}
......
......@@ -93,7 +93,7 @@ hyperparameter values and optimization choices.
\label{fig:graph_spaceinvaders}
\end{figure}
\subsubsection{Stability With Suboptimal Hyperparameter}
\subsubsection{Stability with Suboptimal Hyperparameter}
\label{sec:05:stability}
In figures \ref{fig:graph_penalty_beamrider}--\ref{fig:graph_penalty_spaceinvaders} reward graphs for a training with
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment