Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit 8023e597 authored by Daniel Lukats's avatar Daniel Lukats

feedback fabian

parent 3cf06ccb
In simple environments with a small state space, we may construct a tabular solution storing values for all states we
encounter. Among the algorithms creating such tables are \emph{SARSA} and \emph{Q Learning} \cite[chapters 6.4 and 6.5]{sutton18}.
If the state space grows very large, creating a table becomes infeasible, both in terms of computation time and hardware
requirements. Hence, we must approximate both the value function and the policy (cf.~chapter
encounter. Among the algorithms creating such tables are \emph{SARSA} and \emph{Q Learning} \cite[chapters 6.4 and
6.5]{sutton18}. If the state space grows very large, creating a table becomes infeasible, both in terms of computation
time and memory consumption. Hence, we must approximate both the value function and the policy (cf.~chapter
\ref{sec:02:policy_gradients}).
Let $\hat{v}_\pi(s, \boldsymbol\omega)$ be an estimator of the value function $v_\pi(s)$, that means $\hat{v}_\pi(s,
......@@ -29,19 +29,28 @@ states we observe regularly more frequently leading to more accurate estimations
$\overline{\text{VE}}(\boldsymbol\omega)$ implicitly. We establish a method to learn the value function in chapter
\ref{sec:03:loss_value}.
As we defined $a_\pi(s, a)$ using $v_\pi(s)$ only, we can easily extend function approximation to the advantage function
by utilizing $\hat{v}_\pi$:
We can easily extend function approximation to the advantage function $a_\pi(s, a)$ by utilizing $\hat{v}_\pi$. In
equation \ref{eqn:advantage_function}, the advantage function was defined to be
\begin{align*}
a_\pi(s, a) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left(r + \gamma v_\pi(s') \right)\right] -
v_\pi(s) \\
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a\right] - v_\pi(s).
\end{align*}
Then, the advantage function is parameterized by inserting the parameterized value function $\hat{v}_\pi$ instead of
$v_\pi$:
\begin{align}
\hat{a}_\pi(s, a, \boldsymbol\omega) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left( r + \gamma \hat{v}_\pi(s',
\boldsymbol\omega) \right) \right] - \hat{v}_\pi(s, \boldsymbol\omega) \\
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) \mid S_t = s, A_t = a \right] - v_\pi(s,
\boldsymbol\omega)
\boldsymbol\omega).
\end{align}
We further extend function approximation to the advantage estimator
\begin{align}
\delta_t \doteq R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) - \hat{v}_\pi(S_t, \boldsymbol\omega).
\end{align}
Note that we do not indicate whether the advantage estimator uses the value function $v_\pi$ or a parameterized value
function $\hat{v}_\pi$. In subsequent chapters we consistently use function approximation, therefore it can be assumed
that advantage estimators rely on parameterized value functions.
......@@ -9,11 +9,11 @@ agent's choices. Finally, we reduce complexity by approximating and parameterizi
\label{sec:02:agent_environment}
In reinforcement learning, an \gls{agent}---which is the acting and learning entity---is embedded in and interacts with
an \gls{environment}. The environment describes the world surrounding the agent and is beyond the agent's immediate
control. However, agents can affect the environment through actions, which they choose based on the environment's
observed state. After taking an action, agents observe the environment again. The interplay of agent and environment is
shown in figure \ref{fig:action_observation}. In this thesis, we use a set of ATARI 2600 games as environments
(cf.~chapter \ref{sec:05:ale}).
an \gls{environment} \cite[chapter 3.1]{sutton18}. The environment describes the world surrounding the agent and is
beyond the agent's immediate control. However, agents can affect the environment through actions, which they choose
based on the environment's observed state. After taking an action, agents observe the environment again. The interplay
of agent and environment is shown in figure \ref{fig:action_observation}. In this thesis, we use a set of ATARI 2600
games as environments (cf.~chapter \ref{sec:05:ale}).
\begin{figure}[h]
\centering
......@@ -25,12 +25,12 @@ shown in figure \ref{fig:action_observation}. In this thesis, we use a set of AT
\end{figure}
In addition to a new state, observations are associated with a reward. Agents seek to maximize the rewards they observe
through trial-and-error. Hence, as teachers we can teach an agent to achieve a certain goal provided we design rewards
properly: \enquote{[\dots] the reward signal is your way of communicating to the [agent] \emph{what} you want it to
achieve, not \emph{how} you want it achieved} \cite[p.~54]{sutton18}. As a consequence, only the action achieving the
goal must yield a positive reward. However, a sequence of actions instead of a single one may be essential to achieving
the goal, raising the issue that rewards for vital actions may be delayed. Delayed reward and the trial-and-error
approach are considered to be the \enquote{most important distinguishing features of reinforcement learning} by
through trial-and-error. Hence, we can teach an agent to achieve a certain goal provided we design rewards properly:
\enquote{[\dots] the reward signal is your way of communicating to the [agent] \emph{what} you want it to achieve, not
\emph{how} you want it achieved} \cite[p.~54]{sutton18}. As a consequence, only the action achieving the goal must yield
a positive reward. However, a sequence of actions instead of a single one may be essential to achieving the goal,
raising the issue that rewards for vital actions may be delayed. Delayed reward and the trial-and-error approach are
considered to be the \enquote{most important distinguishing features of reinforcement learning} by
\citeA[p.~2]{sutton18}.
In order to describe agent and environment, we require a mathematical construct that encompasses the
......
......@@ -148,7 +148,8 @@ An agent's actions are determined by the \emph{policy} function
&\pi: \mathcal{A} \times \mathcal{S} \rightarrow [0, 1], \nonumber
\end{align}
which returns the probability of the agent picking the action $a$ in state $s$ \cite[p.~58]{sutton18}. We say that the
agent follows the policy $\pi$.
agent follows the policy $\pi$. A means for learning a policy $\pi$ is introduced in chapter
\ref{sec:02:policy_gradients} and elaborated on in chapter \ref{sec:03:ppo}.
Although the policy is stochastic and does not yield actions, an agent can determine actions by sampling actions
proportional to the probability distribution. For example, an agent following the policy $\pi(\emph{\text{noop}}
......@@ -160,7 +161,7 @@ proportional to the probability distribution. For example, an agent following th
When an agent and an environment interact with each other over a series of discrete time steps $t = 0, 1, 2, \dots$, we
can observe a sequence of states, actions and rewards $S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots$. We call this
sequence the \gls{trajectory}. When we train an agent, we often record trajectories of states, actions and rewards to
use as training data by running the agent in the environment.
use as training data by running the agent in the environment.\todo{citation here?}
ATARI 2600 games have a plentitude of \emph{terminal states} that indicate the game has come to an end, for example when
either player scored 20 points in Pong or the player lost their final life in Breakout. $\#\{s \mid s \in \mathcal{S}
......
\todo[inline]{short introductory text}
Policy gradients are one method of learning a policy. The main idea is parameterizing the policy with a derivable
function $\boldsymbol\theta$. The policy is then learned by optimizing $\boldsymbol\theta$ with gradient ascent.
\subsubsection{Policy Parameterization}
\label{sec:02:policy_theta}
......@@ -13,6 +14,13 @@ returns the probability of taking the action $a$ at time $t$ given the state $s$
$\boldsymbol\theta$ \cite[p.~321]{sutton18}. Often we write $\pi_{\boldsymbol\theta}$ instead of $\pi(a \mid s,
\boldsymbol\theta)$.
The parameterization $\boldsymbol\theta$ often returns a vector of numerical weights. Each action corresponds to a
weight that indicates of the likelihood of picking the respective action. The higher the weight, the more likely picking
its respective action should be. The policy $\pi$ uses the numerical weights to determine a probability distribution
\cite[p.~322]{sutton18}, e.g., by applying a softmax function \cite[pp.~184--185]{goodfellow2016}. The method used to
determine the probability distribution remains fixed. Instead we optimize the parameterization $\boldsymbol\theta$ so
that the likelihood of choosing an advantageous actions is increased.
Let $S_t = s$ and $\mathcal{A}(s) = \{1, 2\}$. A parameterization $\boldsymbol\theta_t$ could return a vector with two
elements, for example $\boldsymbol\theta_t(s) = \begin{pmatrix}4\\2\end{pmatrix}$.\footnote{We write
$\left(\boldsymbol\theta_t(s)\right)_a$ to obtain the $a$th component of $\boldsymbol\theta_t(s)$, e.g.,
......
Based on the components present in a \gls{mdp}, we can define a formal goal an agent shall achieve. As an agent's
behavior depends on the policy it follows, we require a means to assess choices made by the policy as well. Therefore,
we introduce \emph{returns} and define two functions: The \emph{value function} allows us to judge states, whereas
the \emph{advantage function} can be used to assess the impact of an action. These concepts are key in the learning
approach and algorithm we introduce in chapters \ref{sec:02:policy_gradients} and \ref{sec:03:ppo}.
Based on the components present in a Markov decision process, we can define a formal goal an agent shall achieve. As an
agent's behavior depends on the policy it follows, we require a means to assess choices made by the policy as well.
Therefore, we introduce \emph{returns} and define two functions: The \emph{value function} allows us to judge states,
whereas the \emph{advantage function} can be used to assess the impact of an action. These concepts are key in the
learning approach and algorithm we introduce in chapters \ref{sec:02:policy_gradients} and \ref{sec:03:ppo}.
\subsubsection{Return}
\label{sec:02:return}
......@@ -13,12 +13,13 @@ R_T$. Using this sequence we define the return
G_t &\doteq \sum_{k=t+1}^T \gamma^{k-t-1} R_k,\; 0 \leq \gamma \leq 1.
\end{align}
$\gamma$ denotes the discount factor, which discounts future rewards. In finite-horizon \glspl{mdp}, $\gamma$ may equal
$1$. In infinite-horizon \glspl{mdp}, we must choose $\gamma < 1$ so $G_t$ remains finite \cite[p.
54]{sutton18}.
An agent strives to maximize the return $G_0$ and thusly all following returns $G_t$, too. As early returns include the
reward sequence of later returns, returns may be defined recursively: $G_t = R_{t+1} + \gamma G_{t+1}$.
$\gamma$ denotes the discount factor, which discounts future rewards. In finite-horizon Markov decision processes,
$\gamma$ may equal $1$. In infinite-horizon Markov decision processes, we must choose $\gamma < 1$ so $G_t$ remains
finite \cite[p. 54]{sutton18}.
Returns may be defined recursively, as early returns include the reward sequences of later returns: $G_t = R_{t+1} +
\gamma G_{t+1}$.
\subsubsection{Value Function}
\label{sec:02:value_function}
......@@ -44,11 +45,12 @@ optimal value function $v_*$. Assuming an optimal policy for the ATARI 2600 game
The value of a state depends on the expected behavior of an agent following the policy $\pi$. In order to increase the
probability of obtaining a high return, we require a means to identify actions that prove advantageous or
disadvantageous in this endeavour. The required functionality is provided by the \gls{advantage} function
disadvantageous in this endeavour. The required functionality is provided by the \emph{advantage function}
\cite{baird1993, sutton2000}\footnote{The advantage function is commonly defined using the $q$ function: $a_\pi(s, a)
\doteq q_\pi(s,a) - v_\pi(s)$. The definition we provide is equal, as it merely substitutes $q_\pi(s,a)$ with the term
it is defined to be.}
\begin{align}
\label{eqn:advantage_function}
a_\pi(s, a) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left(r + \gamma v_\pi(s') \right)\right] - v_\pi(s) \\
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a\right] - v_\pi(s)
\end{align}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment