Commit 912fde77 authored by Daniel Lukats's avatar Daniel Lukats

feedback fabian

parent 8023e597
......@@ -72,9 +72,9 @@ A loss could be constructed by multiplying the likelihood ratio and advantage es
Despite the use of importance sampling this loss can be unreliable. For actions with a very large likelihood ratio, for
example, $\rho_t(\boldsymbol\theta) = 100$, gradient steps become excessively large possibly leading to performance
collapes \cite{kakade2002}.
collapes \cite{kakade2002}.\todo{highlight this a bit more?}
Hence, Proximal Policy Optimization algorithms intend to keep the likelihood ratio close to $1$ \cite{ppo}. There are
Hence, Proximal Policy Optimization algorithms intend to keep the likelihood ratio close to one \cite{ppo}. There are
two variants of PPO, one that relies on clipping the ratio whereas the other is penalized using the Kullback-Leibler
divergence. We examine the clipped variant, as it is more widely used and achieved better results on various tasks such
as robotics and video games \cite{ppo, ppo_blog, kostrikov}.
......@@ -102,7 +102,8 @@ the clip range. Figure \ref{fig:full_clipping} shows the behavior of $\text{clip
\includegraphics[width=\textwidth]{03_ppo/full_clipping.pdf}
\caption{The left side of this figure shows the loss for an action with a positive advantage, whereas the loss for
an action with negative advantage is shown on the right side. The loss is clipped if the likelihood ratio
(cf.~equation \ref{eqn:likelihood_ratio}) moves outside of the clip range $[1-\epsilon, 1+\epsilon]$. Inspired by
(cf.~equation \ref{eqn:likelihood_ratio}) moves outside of the clip range $[1-\epsilon, 1+\epsilon]$.
Figure inspired by
\protect\citeauthor{ppo}'s \protect\citeyear{ppo} and \protect\citeauthor{wang2019}'s \protect\citeyear{wang2019}
publications.}
\label{fig:full_clipping}
......
......@@ -8,7 +8,7 @@ environment and the policy $\pi$ introduce non-deterministic behavior.
As the variance of the trajectories grows, the quality of the gradient estimator declines. Performing stochastic
gradient ascent with this estimator is no longer guaranteed to yield an improved policy---in fact the performance of the
policy might deteriorate \cite{seita2017}. Even a single bad step might cause a collapse of policy performance
\cite[algorithms/trpo.html]{spinningup}. More importantly\todo{not quite the right phrase here}, the performance collapse might be irrecoverable. As the
\cite{spinningup}. More importantly\todo{not quite the right phrase here}, the performance collapse might be irrecoverable. As the
policy was likely trained for several steps already, it has begun transitioning from exploration to exploitation. If an
agent now strongly favors bad actions, it cannot progress towards the goal. In turn, we cannot sample meaningful data to
train the policy further.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment