Commit a357ea6f authored by Daniel Lukats's avatar Daniel Lukats

removed some unneeded \gls

parent 89e4a2d0
The mathematical notation used in publications regarding reinforcement learning can differ greatly. In this thesis, we
adhere to the notation introduced by \citeA{sutton18} and adapt definitions and proves from other sources accordingly. As
a consequence, interested readers may notice differences between the notation used in chapter \ref{sec:03:ppo} and the
publications of \citeA{gae} and \citeA{ppo}:
For ease of reading, an overview of all definitions including the equation and page number can be found on page
\pageref{sec:maths_index}. The mathematical notation used in publications regarding reinforcement learning can differ
greatly. In this thesis, we adhere to the notation introduced by \citeA{sutton18} and adapt definitions and proves from
other sources accordingly. As a consequence, interested readers may notice differences between the notation used in
chapter \ref{sec:03:ppo} and the publications of \citeA{gae} and \citeA{ppo}:
\begin{itemize}
\item The advantage estimator is denoted $\delta$ instead of $\hat{A}$, as both $A$ and $a$ are heavily overloaded
already.
......@@ -10,6 +10,6 @@ publications of \citeA{gae} and \citeA{ppo}:
confusions with rewards $R_t$ or $r$. Furthermore, using $\rho$ is consistent with \citeauthor{sutton18}'s
\citeyear{sutton18} definition of \emph{importance sampling}.
\end{itemize}
For ease of reading, an overview of all definitions is provided on page TODO.
In this thesis, an active \emph{we} is used regularly. This \emph{we} is meant to include the author and all readers.
Besides that, it should be noted that an active \emph{we} is used regularly in this thesis. This \emph{we} is meant to
include the author and all readers.
......@@ -8,8 +8,8 @@ agent's choices. Finally, we reduce complexity by approximating and parameterizi
\subsection{Agent and Environment}
\label{sec:02:agent_environment}
In reinforcement learning, an \gls{agent}---which is the acting and learning entity---is embedded in and interacts with
an \gls{environment} \cite[chapter 3.1]{sutton18}. The environment describes the world surrounding the agent and is
In reinforcement learning, an \emph{agent}---which is the acting and learning entity---is embedded in and interacts with
an \emph{environment} \cite[chapter 3.1]{sutton18}. The environment describes the world surrounding the agent and is
beyond the agent's immediate control. However, agents can affect the environment through actions, which they choose
based on the environment's observed state. After taking an action, agents observe the environment again. The interplay
of agent and environment is shown in figure \ref{fig:action_observation}. In this thesis, we use a set of ATARI 2600
......@@ -33,8 +33,8 @@ raising the issue that rewards for vital actions may be delayed. Delayed reward
considered to be the \enquote{most important distinguishing features of reinforcement learning} by
\citeA[p.~2]{sutton18}.
In order to describe agent and environment, we require a mathematical construct that encompasses the
environment's states and rewards as well as the actions agents may take. These requirements are met by a \gls{mdp},
In order to describe agent and environment, we require a mathematical construct that encompasses the environment's
states and rewards as well as the actions agents may take. These requirements are met by a Markov decision process,
which we introduce in chapter \ref{sec:02:mdp}. Subsequently, we define a goal the agent shall achieve utilizing the
\emph{value function}.
......
......@@ -6,7 +6,7 @@ but we do not require that in this thesis so it is implicitly included?}
\subsubsection{Definition}
\label{sec:02:mdp_def}
We consider a finite \gls{mdp} with finite horizon defined by the tuple $(\mathcal{S}, \mathcal{A}, \mathcal{R}, p,
We consider a finite Markov decision process with finite horizon defined by the tuple $(\mathcal{S}, \mathcal{A}, \mathcal{R}, p,
\pi)$. It is comprised of three finite sets of random variables and two probability distribution functions:
the set of states $\mathcal{S}$, the set of actions $\mathcal{A}$, the set of rewards $\mathcal{R}$, the dynamics
function $p$ and lastly the policy function $\pi$.\footnote{Definitions by other researchers may differ slightly, e.g.,
......@@ -28,17 +28,17 @@ interact once---the agent chooses an action and observes the environment.
\label{fig:intro_mdp}
\end{figure}
A simple MDP is displayed in figure \ref{fig:intro_mdp}. Like Markov chains, it contains states $s_0, s_1, s_2$ with
$s_0$ being the initial state and $s_2$ being a terminal state. Unlike Markov chains, it also contains actions
\emph{left} and \emph{right} as well as rewards $0$ and $1$. The rewards are written alongside transition probabilities
and can be found on the edges connecting the states. An example using elements of this MDP is given in chapter
\ref{sec:02:distributions}.
A simple Markov decision process is displayed in figure \ref{fig:intro_mdp}. Like Markov chains, it contains states
$s_0, s_1, s_2$ with $s_0$ being the initial state and $s_2$ being a terminal state. Unlike Markov chains, it also
contains actions \emph{left} and \emph{right} as well as rewards $0$ and $1$. The rewards are written alongside
transition probabilities and can be found on the edges connecting the states. An example using elements of this Markov
decision process is given in chapter \ref{sec:02:distributions}.
\glspl{mdp} share most properties with Markov chains\todo{some source here}, the major difference being the transitions
between states. Whereas in Markov chains the transition depends on the current state only, in Markov decision processes
the transition function $p$ takes actions into account as well. Thus, the transition can be affected, although it
remains stochastic in nature. We assume that the Markov property holds true for Markov decision processes, which means
that these transitions depend on the current state and action only \cite[p.~49]{sutton18}.
Markov decision processes share most properties with Markov chains\todo{some source here}, the major difference being
the transitions between states. Whereas in Markov chains the transition depends on the current state only, in Markov
decision processes the transition function $p$ takes actions into account as well. Thus, the transition can be affected,
although it remains stochastic in nature. We assume that the Markov property holds true for Markov decision processes,
which means that these transitions depend on the current state and action only \cite[p.~49]{sutton18}.
\subsubsection{States, Actions and Rewards}
......@@ -46,7 +46,7 @@ States, actions and rewards are core concepts of reinforcement learning. States
an agent's capabilities within this environment. Rewards on the other hand are used to teach the agent.
\paragraph{States.}
We describe the environment using \glspl{state}: Let $S_t \in \mathcal{S}$ denote the state of the environment at time
We describe the environment using \emph{states}: Let $S_t \in \mathcal{S}$ denote the state of the environment at time
step $t$ and let $\mathcal{S}$ denote the finite set of states. A state must contain all details an agent requires to
make well-informed decisions.\todo{citation here probably?}
......@@ -82,7 +82,7 @@ citation here I guess}
As teachers, rewards are our primary means of communicating with an agent. We use them to inform an agent whether it
achieved its goal or not. As stated in chapter \ref{sec:02:agent_environment}, the agent may receive positive rewards
only for achieving its goal, which raises the issue of delayed rewards. We will remedy this issue in chapter
\ref{sec:02:value_function} by introducing \glspl{return}.
\ref{sec:02:value_function} by introducing the concept of a \emph{return}.
The second component of an observation is the successor state $S_{t+1}$. Note that we define the reward to be $R_{t+1}$
as it is the reward that is observed with the state $S_{t+1}$. Consequently, there is no reward $R_0$.
......@@ -95,7 +95,7 @@ the agent scores a point in Pong, we provide a reward $R_{t+1} = 1$ in the next
\includegraphics[width=0.7\textwidth]{02_rl_theory/agent_environment_detailed}
\caption{An agent always observes the current state $S_t$ to decide on an action $A_t$. After it has acted, it
observes the environment again. It obtains a reward $R_{t+1}$ and perceives successor state $S_{t+1}$
\protect\cite{sutton18}}.
\protect\cite{sutton18}.}
\label{fig:episode}
\end{figure}
......@@ -107,7 +107,7 @@ the agent acts again and obtains another observation.
\label{sec:02:distributions}
The dynamics function $p$ and the policy $\pi$ describe the behavior of the environment and the agent. Combined, they
determine the movement\todo{find a better word} through a Markov decision process.
determine the path through the respective Markov decision process.
\paragraph{Dynamics.}
Transitions to successor states and the values of rewards associated with these states are determined by the
......@@ -126,9 +126,9 @@ We call $p$ the dynamics function or dynamics of the environment. Since $p$ is a
The simple Markov decision process from chapter \ref{sec:02:mdp_def} is shown again in figure \ref{fig:dynamics}. Each
edge describes a transition from one state to a successor state. The reward along the edge and the successor state form
an observation. The probability of the observation is stated with the reward next to its respective edge. We assign a
probability of $0$ to observations not described by an edge. In the displayed \gls{mdp}, the probability of observing
the state $S_{t+1} = s_2$ as well as the reward $R_{t+1} = 1$ given the current state $S_t = s_1$ and the action $A_t =
\text{\emph{left}}$ is $p(s_2, 1 \mid s_1, \text{\emph{left}}) = 0.8$.
probability of $0$ to observations not described by an edge. In the displayed Markov decision process, the probability
of observing the state $S_{t+1} = s_2$ as well as the reward $R_{t+1} = 1$ given the current state $S_t = s_1$ and the
action $A_t = \text{\emph{left}}$ is $p(s_2, 1 \mid s_1, \text{\emph{left}}) = 0.8$.
\begin{figure}[h]
\centering
......@@ -160,7 +160,7 @@ proportional to the probability distribution. For example, an agent following th
When an agent and an environment interact with each other over a series of discrete time steps $t = 0, 1, 2, \dots$, we
can observe a sequence of states, actions and rewards $S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots$. We call this
sequence the \gls{trajectory}. When we train an agent, we often record trajectories of states, actions and rewards to
sequence the \emph{trajectory}. When we train an agent, we often record trajectories of states, actions and rewards to
use as training data by running the agent in the environment.\todo{citation here?}
ATARI 2600 games have a plentitude of \emph{terminal states} that indicate the game has come to an end, for example when
......@@ -170,7 +170,7 @@ of the loser can have any value between $0$ and $19$. Furthermore, the paddles m
game, which opens up even more terminal states.\todo{someone check this}
Whenever the agent transitions to a terminal state, it cannot transition to other states anymore. Therefore, there is a
final time step $T$ after which no meaningful information can be gained. We call $T$ the \emph{finite} \gls{horizon}; it
marks the end of an episode. Each episode begins with an initial state $S_0$ and ends once an agent reaches a terminal
state $S_T$. Usually, agents attempt episodic tasks many times with each episode giving rise to a different trajectory;
the horizon $T$ may differ, too.
final time step $T$ after which no meaningful information can be gained. We call $T$ the \emph{finite} \emph{horizon};
it marks the end of an episode. Each episode begins with an initial state $S_0$ and ends once an agent reaches a
terminal state $S_T$. Usually, agents attempt episodic tasks many times with each episode giving rise to a different
trajectory; the horizon $T$ may differ, too.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment