Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit dc2664d6 authored by Daniel Lukats's avatar Daniel Lukats

minor adjustments to ppo paper

parent 4685f9d4
......@@ -4,41 +4,44 @@ optimization passes over the same trajectory. This approach increases the sample
Optimization, as an agent can learn quicker than an agent that trains on a trajectory once.
More precisely, each training iteration consists of $K$ epochs. In each epoch, the entire data set is split into
randomly drawn disjunct minibatches of size $M$.\footnote{For a detailed explanation of minibatch gradient methods, refer
to the work of \citeA[chapter 8.1.3]{goodfellow2016}.} This approach also alleviates the issue of learning on highly
correlated observation as observed by \citeA{dqn}. We further ensure that we learn on sufficiently independent
observations by running $N$ agents at the same time. Each agent is embedded in an environment, with the only difference
between the agents' environments being the random seeds used to initialize them \cite{a3c}.
randomly drawn disjunct minibatches of size $M$.\footnote{For a detailed explanation of minibatch gradient methods,
refer to the work of \citeA[chapter 8.1.3]{goodfellow2016}.} This approach also alleviates the issue of learning on
highly correlated observation as observed by \citeA{dqn}. We further ensure that we learn on sufficiently independent
observations by running $N$ \emph{actors} at the same time. Each actor is embedded in an environment, with the only
difference between the actors' environments being the random seeds used to initialize them \cite{a3c}.
\todo[inline]{short note on actors being a manifestation of the agent in that environment}
After each epoch, the learn rate $\alpha$ and the clipping parameter $\epsilon$ are decreased, so they linearly scale
from their initial values to 0 over the course of the training.
PPO---and many other reinforcement learning algorithms---operate in two phases. In the first phase, an agent interacts with its
environment and generates a rollout $\tau$. $\tau$ contains not only the states and rewards the agent observed, but also
the chosen actions, their respective probabilities and the values of the states. In the second phase, the value function
approximation and the policy are improved with the data the agent collected. These two steps are repeated for a certain
time or until the value function and policy converge to their respective optimal functions. Algorithm \ref{alg:ppo}
shows these phases and all steps required to fulfill them.
the chosen actions, their respective probabilities and the values of the states. This phase can be seen in lines 5--17
of algorithm \ref{alg:ppo}.
In the second phase, the value function approximation and the policy are optimized with the data the agent collected.
These two steps are repeated for a certain time or until the value function and policy converge to their respective
optimal functions. Lines 18--25 of algrithm \ref{alg:ppo} show this phase.
\begin{algorithm}[ht]
\caption{Proximal Policy Optimization, modified from \protect\citeauthor{ppo}'s \protect\citeyear{ppo} and
\protect\citeauthor{peng2018}'s \protect\citeyear{peng2018} works}
\label{alg:ppo}
\begin{algorithmic}[1]
\Require number of iterations $I$, rollout horizon $T$, number of agents $N$, number of epochs $K$, minibatch size $M$, discount $\gamma$, GAE weight
\Require number of iterations $I$, rollout horizon $T$, number of actor $N$, number of epochs $K$, minibatch size $M$, discount $\gamma$, GAE weight
$\lambda$, learn rate $\alpha$, clipping parameter $\epsilon$, coefficients $c_1, c_2$
\State $\boldsymbol\theta \gets $ random weights
\State Initialize environments $E$
\State Number of minibatches $B \gets N \cdot T\; / \;M$
\For{iteration=$1, 2, \dots, I$}
\State $\tau \gets $ empty rollout
\For{agent=$1, 2, \dots, N$}
\State $s \gets $ current state of environment $E_\text{agent}$
\For{actor=$1, 2, \dots, N$} \Comment{record data}
\State $s \gets $ current state of environment $E_\text{actor}$
\State Append $s$ to $\tau$
\For{step=$1, 2, \dots, T$}
\State $a \sim \pi_{\boldsymbol\theta}(a\mid s)$
\State $\pi_{\boldsymbol\theta_\text{old}}(a\mid s) \gets \pi_{\boldsymbol\theta}(a\mid s)$
\State Execute action $a$ in environment $E_\text{agent}$
\State Execute action $a$ in environment $E_\text{actor}$
\State $s \gets $ successor state
\State $r \gets $ reward
\State Append $a, s, r, \pi_{\boldsymbol\theta_\text{old}}(a\mid s),
......@@ -47,7 +50,7 @@ shows these phases and all steps required to fulfill them.
\EndFor
\State Compute Generalized Advantage Estimations $\delta_t^{\text{GAE}(\gamma, \lambda)}$
\State Compute $\lambda$-returns $G_t^\lambda$
\For{epoch=$1, 2, \dots, K$}
\For{epoch=$1, 2, \dots, K$} \Comment{optimize}
\For{minibatch=$1, 2, \dots, B$}
\State $\boldsymbol\theta \gets \boldsymbol\theta + \alpha \cdot
\nabla_{\boldsymbol\theta}\;\mathcal{L}^{\text{CLIP}+\text{VF}+S}(\boldsymbol\theta)$ on
......
\todo[inline]{short intro text}
Both the loss described in equation \ref{eqn:naive_loss} and the loss proposed by \citeA{ppo} depend on advantage
estimations. However, to estimate advantages value estimations are required. By definition (cf.~equation
\ref{eqn:value_function}), the value function can be estimated using the return.
Both the advantage estimator and the return are suboptimal as they suffer from being biased and having high variance
respectively. We further explain these issues and introduce advanced methods that alleviate this issue in this chapter.
\subsubsection{Generalized Advantage Estimation}
\label{sec:03:gae_gae}
In chapter \ref{sec:02:advantage} we introduced the advantage function $a_\pi(s, a)$ and an estimator $\delta$, that we
combined with function approximation \ref{sec:02:function_approximation} (TODO replace with equation references):
......
......@@ -6,7 +6,7 @@
\label{sec:03:ppo_motivation}
\input{03_ppo/motivation}
\subsection{Advantage and Return Estimation}
\subsection{Modern Advantage and Return Estimation}
\label{sec:03:gae}
\input{03_ppo/gae}
......
\todo[inline]{motivation for deep RL and deep policy gradients}
\todo[inline]{short note on using Loss $L$ instead of gradient estimator with explanation of loss and reference to
\protect\citeA{goodfellow2016}}
\todo[inline]{short outline of content of the chapter}
As deep learning achieved success on a variety of tasks such as computer vision and speech recognition\todo{a goodfellow
citation goes here}, \citeA{dqn} adapted reinforcement learning techniques to deep learning in an algorithm called DQN.
The benefits of using neural networks to parameterize the value function and---occasionally---the policy were
demonstrated by DQN and further \emph{deep reinforcement learning} algorithms, e.g., by A3C \cite{a3c} and Rainbow DQN
\cite{rainbow}.
The algorithm introduced in this chapter is called \emph{Proximal Policy Optimization} (PPO) \cite{ppo}. Because PPO is a deep
reinforcement learning algorithm, we adjust notation and replace our gradient estimator $\hat{g}$. Instead we utilize a
loss $\mathcal{L}$, as is common in deep learning \todo{citation on loss here}:
\begin{align}
\label{eqn:naive_loss}
\mathcal{L}(\boldsymbol\theta) \doteq
\mathbb{E}_{a,s\sim\pi_{\boldsymbol\theta}}\left[\hat{a}_{\pi_{\boldsymbol\theta}}(s, a, \boldsymbol\omega)\pi_{\boldsymbol\theta}(a\mid
s)\right]
\end{align}
The gradient estimator $\hat{g}$ can be obtained by deriving the loss in equation \ref{eqn:naive_loss}.
We begin by highlighting issues with the gradient estimator from equation \ref{eqn:gradient_estimator} and motivate the
use of advanced algorithms such as Proximal Policy Optimization. Then we introduce advanced advantage and return
estimators as used in many modern policy gradient methods. Afterwards the loss is introduced and explained. We close by
providing the complete Proximal Policy Optimization algorithm.
\todo[inline]{short introductory sentence}
TODO short intro sentence goes here.
\subsubsection{Background}
......@@ -136,6 +136,8 @@ This issue is solved by taking an elementwise minimum:
\right)
\right].
\end{align}
In practice, the advantage function $\hat{a}_{\pi_{\boldsymbol\theta_\text{old}}}$ is replaced with Generalized
Advantage Estimations $\delta_t^{\text{GAE}(\gamma,\lambda)}$ (cf.~equation \ref{eqn:gae}).
Figure \ref{fig:min_clipping} compares $\text{clip}(\rho_t(\boldsymbol\theta, \epsilon) \cdot \delta$ and
$\mathcal{L}^\text{CLIP}$. Using an elementwise minimum has the following effect:
......@@ -158,6 +160,11 @@ $\mathcal{L}^\text{CLIP}$. Using an elementwise minimum has the following effect
\label{fig:min_clipping}
\end{figure}
Clipping with $\epsilon$ keeps the new policy close to the old policy. The region determined by $\epsilon$ is called a
\emph{trust region}. The larger $\epsilon$ the more the new policy can deviate. On the one hand, if $\epsilon$ is too
large, the performance of the policy may collapse again. On the other hand, an agent may learn too slowly if $\epsilon$
is too small.
By optimizing the loss $\mathcal{L}^\text{CLIP}$, we can optimize the policy $\pi_{\boldsymbol\theta}$ without making
excessively large gradient steps and still correct errors of previous optimizations given a suitable choice for
$\epsilon$.
......
......@@ -8,7 +8,7 @@ environment and the policy $\pi$ introduce non-deterministic behavior.
As the variance of the trajectories grows, the quality of the gradient estimator declines. Performing stochastic
gradient ascent with this estimator is no longer guaranteed to yield an improved policy---in fact the performance of the
policy might deteriorate \cite{seita2017}. Even a single bad step might cause a collapse of policy performance
\cite[algorithms/trpo.html]{spinningup}. More importantly, the performance collapse might be irrecoverable. As the
\cite[algorithms/trpo.html]{spinningup}. More importantly\todo{not quite the right phrase here}, the performance collapse might be irrecoverable. As the
policy was likely trained for several steps already, it has begun transitioning from exploration to exploitation. If an
agent now strongly favors bad actions, it cannot progress towards the goal. In turn, we cannot sample meaningful data to
train the policy further.
......@@ -16,7 +16,7 @@ train the policy further.
\citeA{dqn} further note that deep learning algorithms assume independent data samples. This poses another challenge, as
trajectories are commonly sequences of highly correlated states, actions and rewards.
Proximal Policy Optimization introduces\todo{some of this was introduced by other algos} several concepts to address
these issues. Policy updates are restricted so the new policy does not deviate far from the old one. Additionally, we
introduce a new advantage estimator and a return estimator to achieve more reliable gradient estimations.
\todo{improve this paragraph}
Proximal Policy Optimization addresses these challenges by combining multiple known concepts. In order to obtain less
correlated data, multiple environments are executed simultaneously---the agent uses one \emph{actor} in each
environment. Furthermore, policy updates are restricted so the new policy does not deviate far from the old one. Instead
of performing only one optimization pass on recorded data, multiple passes are performed so agents learn faster.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment