Commit 912fde77 by Daniel Lukats

### feedback fabian

parent 8023e597
 ... ... @@ -72,9 +72,9 @@ A loss could be constructed by multiplying the likelihood ratio and advantage es Despite the use of importance sampling this loss can be unreliable. For actions with a very large likelihood ratio, for example, $\rho_t(\boldsymbol\theta) = 100$, gradient steps become excessively large possibly leading to performance collapes \cite{kakade2002}. collapes \cite{kakade2002}.\todo{highlight this a bit more?} Hence, Proximal Policy Optimization algorithms intend to keep the likelihood ratio close to $1$ \cite{ppo}. There are Hence, Proximal Policy Optimization algorithms intend to keep the likelihood ratio close to one \cite{ppo}. There are two variants of PPO, one that relies on clipping the ratio whereas the other is penalized using the Kullback-Leibler divergence. We examine the clipped variant, as it is more widely used and achieved better results on various tasks such as robotics and video games \cite{ppo, ppo_blog, kostrikov}. ... ... @@ -102,7 +102,8 @@ the clip range. Figure \ref{fig:full_clipping} shows the behavior of $\text{clip \includegraphics[width=\textwidth]{03_ppo/full_clipping.pdf} \caption{The left side of this figure shows the loss for an action with a positive advantage, whereas the loss for an action with negative advantage is shown on the right side. The loss is clipped if the likelihood ratio (cf.~equation \ref{eqn:likelihood_ratio}) moves outside of the clip range$[1-\epsilon, 1+\epsilon]$. Inspired by (cf.~equation \ref{eqn:likelihood_ratio}) moves outside of the clip range$[1-\epsilon, 1+\epsilon]$. Figure inspired by \protect\citeauthor{ppo}'s \protect\citeyear{ppo} and \protect\citeauthor{wang2019}'s \protect\citeyear{wang2019} publications.} \label{fig:full_clipping} ... ...  ... ... @@ -8,7 +8,7 @@ environment and the policy$\pi\$ introduce non-deterministic behavior. As the variance of the trajectories grows, the quality of the gradient estimator declines. Performing stochastic gradient ascent with this estimator is no longer guaranteed to yield an improved policy---in fact the performance of the policy might deteriorate \cite{seita2017}. Even a single bad step might cause a collapse of policy performance \cite[algorithms/trpo.html]{spinningup}. More importantly\todo{not quite the right phrase here}, the performance collapse might be irrecoverable. As the \cite{spinningup}. More importantly\todo{not quite the right phrase here}, the performance collapse might be irrecoverable. As the policy was likely trained for several steps already, it has begun transitioning from exploration to exploitation. If an agent now strongly favors bad actions, it cannot progress towards the goal. In turn, we cannot sample meaningful data to train the policy further. ... ...
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!