### overhauled chapter 2 a bit

parent ea404dfb 268 KB | W: | H:

41.7 KB | W: | H:  • 2-up
• Swipe
• Onion skin
 ... ... @@ -37,8 +37,11 @@ by utilizing $\hat{v}_\pi$: &= \mathbb{E}_\pi\left[R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) \mid S_t = s, A_t = a \right] - v_\pi(s, \boldsymbol\omega) \end{align} We further extend function approximation to the advantage estimator \begin{align} \delta_t \doteq R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) - \hat{v}_\pi(S_t, \boldsymbol\omega). \end{align} \todo{maybe short note that we do not explicitly mention if the value function si approximated or not} Note that we do not indicate whether the advantage estimator uses the value function $v_\pi$ or a parameterized value function $\hat{v}_\pi$. In subsequent chapters we consistently use function approximation, therefore it can be assumed that advantage estimators rely on parameterized value functions.
This diff is collapsed.
 ... ... @@ -75,6 +75,7 @@ Adjusting the policy results in different action choices, which in turn changes Thus, we might assume that we require the derivative of $\mu_{\pi_{\boldsymbol\theta}}(s)$ to compute gradient estimates. This issue is remedied by the \emph{policy gradient theorem}, which proves that \begin{align} \label{eqn:q_gradient} \nabla J(\boldsymbol\theta) \propto \sum_s \mu_{\pi_{\boldsymbol\theta}}(s) \sum_a q_{\pi_{\boldsymbol\theta}}(s, a) \nabla_{\boldsymbol\theta} \pi(a \mid s, \boldsymbol\theta), \end{align} ... ... @@ -85,11 +86,12 @@ Originally this proof held true only with $q_{\pi_{\boldsymbol\theta}}(s, a)$, a action $a$ given the state $s$. However, the policy gradient theorem was expanded upon to allow the combination of policy gradients with the advantage function \cite{sutton2000}: \begin{align} \label{eqn:advantage_gradient} \nabla J(\boldsymbol\theta) \propto \sum_s \mu_{\pi_{\boldsymbol\theta}}(s) \sum_a a_{\pi_{\boldsymbol\theta}}(s, a, \boldsymbol\omega) \nabla_{\boldsymbol\theta} \pi(a \mid s, \boldsymbol\theta). \end{align} \todo[inline]{note that the usage of the advantage function instead of q function is an improvement} Using the advantage function is beneficial, as the policy gradient in equation \ref{eqn:advantage_gradient} has a lower variance than the gradient in equation \ref{eqn:q_gradient} \cite{gae}. Similar to $\overline{\text{VE}}$, the gradient is weighed according to the distribution of states $\mu$, but it does not depend on its derivative. By virtue of weighing with $\mu$, states an agent encounters regularly have a greater effect ... ...
 ... ... @@ -28,6 +28,7 @@ agent playing Pong that fails to deflect a ball will receive a negative reward, maximize returns, agents must avoid actions and states that have a high likelihood of yielding negative rewards. We rate states based on the returns we may obtain by using the value function \begin{align} \label{eqn:value_function} &v_\pi(s) \doteq \mathbb{E}_\pi[G_t \mid S_t = s], \end{align} with $\mathbb{E}_\pi$ denoting the expected value when following the policy $\pi$. The higher the value of a state, the ... ...