Commit 614ec44d authored by Daniel Lukats's avatar Daniel Lukats

overhauled chapter 2 a bit

parent ea404dfb
......@@ -37,8 +37,11 @@ by utilizing $\hat{v}_\pi$:
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) \mid S_t = s, A_t = a \right] - v_\pi(s,
We further extend function approximation to the advantage estimator
\delta_t \doteq R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) - \hat{v}_\pi(S_t, \boldsymbol\omega).
\todo{maybe short note that we do not explicitly mention if the value function si approximated or not}
Note that we do not indicate whether the advantage estimator uses the value function $v_\pi$ or a parameterized value
function $\hat{v}_\pi$. In subsequent chapters we consistently use function approximation, therefore it can be assumed
that advantage estimators rely on parameterized value functions.
This diff is collapsed.
......@@ -75,6 +75,7 @@ Adjusting the policy results in different action choices, which in turn changes
Thus, we might assume that we require the derivative of $\mu_{\pi_{\boldsymbol\theta}}(s)$ to compute gradient
estimates. This issue is remedied by the \emph{policy gradient theorem}, which proves that
\nabla J(\boldsymbol\theta) \propto \sum_s \mu_{\pi_{\boldsymbol\theta}}(s) \sum_a q_{\pi_{\boldsymbol\theta}}(s, a)
\nabla_{\boldsymbol\theta} \pi(a \mid s, \boldsymbol\theta),
......@@ -85,11 +86,12 @@ Originally this proof held true only with $q_{\pi_{\boldsymbol\theta}}(s, a)$, a
action $a$ given the state $s$. However, the policy gradient theorem was expanded upon to allow the combination of
policy gradients with the advantage function \cite{sutton2000}:
\nabla J(\boldsymbol\theta) \propto \sum_s \mu_{\pi_{\boldsymbol\theta}}(s) \sum_a a_{\pi_{\boldsymbol\theta}}(s, a,
\boldsymbol\omega) \nabla_{\boldsymbol\theta} \pi(a \mid s, \boldsymbol\theta).
\todo[inline]{note that the usage of the advantage function instead of q function is an improvement}
Using the advantage function is beneficial, as the policy gradient in equation \ref{eqn:advantage_gradient} has a lower
variance than the gradient in equation \ref{eqn:q_gradient} \cite{gae}.
Similar to $\overline{\text{VE}}$, the gradient is weighed according to the distribution of states $\mu$, but it does not
depend on its derivative. By virtue of weighing with $\mu$, states an agent encounters regularly have a greater effect
......@@ -28,6 +28,7 @@ agent playing Pong that fails to deflect a ball will receive a negative reward,
maximize returns, agents must avoid actions and states that have a high likelihood of yielding negative rewards. We rate
states based on the returns we may obtain by using the value function
&v_\pi(s) \doteq \mathbb{E}_\pi[G_t \mid S_t = s],
with $\mathbb{E}_\pi$ denoting the expected value when following the policy $\pi$. The higher the value of a state, the
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment