Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Sign in
Toggle navigation
master-thesis
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Locked Files
Issues
6
Issues
6
List
Boards
Labels
Service Desk
Milestones
Iterations
Merge Requests
0
Merge Requests
0
Requirements
Requirements
List
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Security & Compliance
Security & Compliance
Dependency List
License Compliance
Operations
Operations
Incidents
Environments
Packages & Registries
Packages & Registries
Package Registry
Container Registry
Analytics
Analytics
CI / CD
Code Review
Insights
Issue
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Daniel Lukats
master-thesis
Commits
a357ea6f
Commit
a357ea6f
authored
Jun 01, 2020
by
Daniel Lukats
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
removed some unneeded \gls
parent
89e4a2d0
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
34 additions
and
34 deletions
+34
-34
thesis/01_introduction/notation.tex
thesis/01_introduction/notation.tex
+7
-7
thesis/02_rl_theory/introduction.tex
thesis/02_rl_theory/introduction.tex
+4
-4
thesis/02_rl_theory/mdp.tex
thesis/02_rl_theory/mdp.tex
+23
-23
No files found.
thesis/01_introduction/notation.tex
View file @
a357ea6f
The mathematical notation used in publications regarding reinforcement learning can differ greatly. In this thesis, w
e
adhere to the notation introduced by
\citeA
{
sutton18
}
and adapt definitions and proves from other sources accordingly. As
a consequence, interested readers may notice differences between the notation used in chapter
\ref
{
sec:03:ppo
}
and the
publications of
\citeA
{
gae
}
and
\citeA
{
ppo
}
:
For ease of reading, an overview of all definitions including the equation and page number can be found on pag
e
\pageref
{
sec:maths
_
index
}
. The mathematical notation used in publications regarding reinforcement learning can differ
greatly. In this thesis, we adhere to the notation introduced by
\citeA
{
sutton18
}
and adapt definitions and proves from
other sources accordingly. As a consequence, interested readers may notice differences between the notation used in
chapter
\ref
{
sec:03:ppo
}
and the publications of
\citeA
{
gae
}
and
\citeA
{
ppo
}
:
\begin{itemize}
\item
The advantage estimator is denoted
$
\delta
$
instead of
$
\hat
{
A
}$
, as both
$
A
$
and
$
a
$
are heavily overloaded
already.
...
...
@@ -10,6 +10,6 @@ publications of \citeA{gae} and \citeA{ppo}:
confusions with rewards
$
R
_
t
$
or
$
r
$
. Furthermore, using
$
\rho
$
is consistent with
\citeauthor
{
sutton18
}
's
\citeyear
{
sutton18
}
definition of
\emph
{
importance sampling
}
.
\end{itemize}
For ease of reading, an overview of all definitions is provided on page TODO.
In this thesis, an active
\emph
{
we
}
is used regularly. This
\emph
{
we
}
is meant to include the author and all readers.
Besides that, it should be noted that an active
\emph
{
we
}
is used regularly in this thesis. This
\emph
{
we
}
is meant to
include the author and all readers.
thesis/02_rl_theory/introduction.tex
View file @
a357ea6f
...
...
@@ -8,8 +8,8 @@ agent's choices. Finally, we reduce complexity by approximating and parameterizi
\subsection
{
Agent and Environment
}
\label
{
sec:02:agent
_
environment
}
In reinforcement learning, an
\
gls
{
agent
}
---which is the acting and learning entity---is embedded in and interacts with
an
\
gls
{
environment
}
\cite
[chapter 3.1]
{
sutton18
}
. The environment describes the world surrounding the agent and is
In reinforcement learning, an
\
emph
{
agent
}
---which is the acting and learning entity---is embedded in and interacts with
an
\
emph
{
environment
}
\cite
[chapter 3.1]
{
sutton18
}
. The environment describes the world surrounding the agent and is
beyond the agent's immediate control. However, agents can affect the environment through actions, which they choose
based on the environment's observed state. After taking an action, agents observe the environment again. The interplay
of agent and environment is shown in figure
\ref
{
fig:action
_
observation
}
. In this thesis, we use a set of ATARI 2600
...
...
@@ -33,8 +33,8 @@ raising the issue that rewards for vital actions may be delayed. Delayed reward
considered to be the
\enquote
{
most important distinguishing features of reinforcement learning
}
by
\citeA
[p.~2]
{
sutton18
}
.
In order to describe agent and environment, we require a mathematical construct that encompasses the
environment's states and rewards as well as the actions agents may take. These requirements are met by a
\gls
{
mdp
}
,
In order to describe agent and environment, we require a mathematical construct that encompasses the
environment's
states and rewards as well as the actions agents may take. These requirements are met by a Markov decision process
,
which we introduce in chapter
\ref
{
sec:02:mdp
}
. Subsequently, we define a goal the agent shall achieve utilizing the
\emph
{
value function
}
.
...
...
thesis/02_rl_theory/mdp.tex
View file @
a357ea6f
...
...
@@ -6,7 +6,7 @@ but we do not require that in this thesis so it is implicitly included?}
\subsubsection
{
Definition
}
\label
{
sec:02:mdp
_
def
}
We consider a finite
\gls
{
mdp
}
with finite horizon defined by the tuple
$
(
\mathcal
{
S
}
,
\mathcal
{
A
}
,
\mathcal
{
R
}
, p,
We consider a finite
Markov decision process
with finite horizon defined by the tuple
$
(
\mathcal
{
S
}
,
\mathcal
{
A
}
,
\mathcal
{
R
}
, p,
\pi
)
$
. It is comprised of three finite sets of random variables and two probability distribution functions:
the set of states
$
\mathcal
{
S
}$
, the set of actions
$
\mathcal
{
A
}$
, the set of rewards
$
\mathcal
{
R
}$
, the dynamics
function
$
p
$
and lastly the policy function
$
\pi
$
.
\footnote
{
Definitions by other researchers may differ slightly, e.g.,
...
...
@@ -28,17 +28,17 @@ interact once---the agent chooses an action and observes the environment.
\label
{
fig:intro
_
mdp
}
\end{figure}
A simple M
DP is displayed in figure
\ref
{
fig:intro
_
mdp
}
. Like Markov chains, it contains states
$
s
_
0
, s
_
1
, s
_
2
$
with
$
s
_
0
$
being the initial state and
$
s
_
2
$
being a terminal state. Unlike Markov chains, it also contains actions
\emph
{
left
}
and
\emph
{
right
}
as well as rewards
$
0
$
and
$
1
$
. The rewards are written alongside transition probabilities
and can be found on the edges connecting the states. An example using elements of this MDP is given in chapter
\ref
{
sec:02:distributions
}
.
A simple M
arkov decision process is displayed in figure
\ref
{
fig:intro
_
mdp
}
. Like Markov chains, it contains states
$
s
_
0
, s
_
1
, s
_
2
$
with
$
s
_
0
$
being the initial state and
$
s
_
2
$
being a terminal state. Unlike Markov chains, it also
contains actions
\emph
{
left
}
and
\emph
{
right
}
as well as rewards
$
0
$
and
$
1
$
. The rewards are written alongside
transition probabilities and can be found on the edges connecting the states. An example using elements of this Markov
decision process is given in chapter
\ref
{
sec:02:distributions
}
.
\glspl
{
mdp
}
share most properties with Markov chains
\todo
{
some source here
}
, the major difference being the transitions
between states. Whereas in Markov chains the transition depends on the current state only, in Markov decision processes
the transition function
$
p
$
takes actions into account as well. Thus, the transition can be affected, although it
remains stochastic in nature. We assume that the Markov property holds true for Markov decision processes, which means
that these transitions depend on the current state and action only
\cite
[p.~49]
{
sutton18
}
.
Markov decision processes share most properties with Markov chains
\todo
{
some source here
}
, the major difference being
the transitions between states. Whereas in Markov chains the transition depends on the current state only, in Markov
decision processes the transition function
$
p
$
takes actions into account as well. Thus, the transition can be affected,
although it remains stochastic in nature. We assume that the Markov property holds true for Markov decision processes,
which means
that these transitions depend on the current state and action only
\cite
[p.~49]
{
sutton18
}
.
\subsubsection
{
States, Actions and Rewards
}
...
...
@@ -46,7 +46,7 @@ States, actions and rewards are core concepts of reinforcement learning. States
an agent's capabilities within this environment. Rewards on the other hand are used to teach the agent.
\paragraph
{
States.
}
We describe the environment using
\
glspl
{
state
}
: Let
$
S
_
t
\in
\mathcal
{
S
}$
denote the state of the environment at time
We describe the environment using
\
emph
{
states
}
: Let
$
S
_
t
\in
\mathcal
{
S
}$
denote the state of the environment at time
step
$
t
$
and let
$
\mathcal
{
S
}$
denote the finite set of states. A state must contain all details an agent requires to
make well-informed decisions.
\todo
{
citation here probably?
}
...
...
@@ -82,7 +82,7 @@ citation here I guess}
As teachers, rewards are our primary means of communicating with an agent. We use them to inform an agent whether it
achieved its goal or not. As stated in chapter
\ref
{
sec:02:agent
_
environment
}
, the agent may receive positive rewards
only for achieving its goal, which raises the issue of delayed rewards. We will remedy this issue in chapter
\ref
{
sec:02:value
_
function
}
by introducing
\glspl
{
return
}
.
\ref
{
sec:02:value
_
function
}
by introducing
the concept of a
\emph
{
return
}
.
The second component of an observation is the successor state
$
S
_{
t
+
1
}$
. Note that we define the reward to be
$
R
_{
t
+
1
}$
as it is the reward that is observed with the state
$
S
_{
t
+
1
}$
. Consequently, there is no reward
$
R
_
0
$
.
...
...
@@ -95,7 +95,7 @@ the agent scores a point in Pong, we provide a reward $R_{t+1} = 1$ in the next
\includegraphics
[width=0.7\textwidth]
{
02
_
rl
_
theory/agent
_
environment
_
detailed
}
\caption
{
An agent always observes the current state
$
S
_
t
$
to decide on an action
$
A
_
t
$
. After it has acted, it
observes the environment again. It obtains a reward
$
R
_{
t
+
1
}$
and perceives successor state
$
S
_{
t
+
1
}$
\protect\cite
{
sutton18
}
}
.
\protect\cite
{
sutton18
}
.
}
\label
{
fig:episode
}
\end{figure}
...
...
@@ -107,7 +107,7 @@ the agent acts again and obtains another observation.
\label
{
sec:02:distributions
}
The dynamics function
$
p
$
and the policy
$
\pi
$
describe the behavior of the environment and the agent. Combined, they
determine the
movement
\todo
{
find a better word
}
through a
Markov decision process.
determine the
path through the respective
Markov decision process.
\paragraph
{
Dynamics.
}
Transitions to successor states and the values of rewards associated with these states are determined by the
...
...
@@ -126,9 +126,9 @@ We call $p$ the dynamics function or dynamics of the environment. Since $p$ is a
The simple Markov decision process from chapter
\ref
{
sec:02:mdp
_
def
}
is shown again in figure
\ref
{
fig:dynamics
}
. Each
edge describes a transition from one state to a successor state. The reward along the edge and the successor state form
an observation. The probability of the observation is stated with the reward next to its respective edge. We assign a
probability of
$
0
$
to observations not described by an edge. In the displayed
\gls
{
mdp
}
, the probability of observing
the state
$
S
_{
t
+
1
}
=
s
_
2
$
as well as the reward
$
R
_{
t
+
1
}
=
1
$
given the current state
$
S
_
t
=
s
_
1
$
and the action
$
A
_
t
=
\text
{
\emph
{
left
}}$
is
$
p
(
s
_
2
,
1
\mid
s
_
1
,
\text
{
\emph
{
left
}}
)
=
0
.
8
$
.
probability of
$
0
$
to observations not described by an edge. In the displayed
Markov decision process, the probability
of observing the state
$
S
_{
t
+
1
}
=
s
_
2
$
as well as the reward
$
R
_{
t
+
1
}
=
1
$
given the current state
$
S
_
t
=
s
_
1
$
and the
action
$
A
_
t
=
\text
{
\emph
{
left
}}$
is
$
p
(
s
_
2
,
1
\mid
s
_
1
,
\text
{
\emph
{
left
}}
)
=
0
.
8
$
.
\begin{figure}
[h]
\centering
...
...
@@ -160,7 +160,7 @@ proportional to the probability distribution. For example, an agent following th
When an agent and an environment interact with each other over a series of discrete time steps
$
t
=
0
,
1
,
2
,
\dots
$
, we
can observe a sequence of states, actions and rewards
$
S
_
0
, A
_
0
, R
_
1
, S
_
1
, A
_
1
, R
_
2
, S
_
2
, A
_
2
, R
_
3
,
\dots
$
. We call this
sequence the
\
gls
{
trajectory
}
. When we train an agent, we often record trajectories of states, actions and rewards to
sequence the
\
emph
{
trajectory
}
. When we train an agent, we often record trajectories of states, actions and rewards to
use as training data by running the agent in the environment.
\todo
{
citation here?
}
ATARI 2600 games have a plentitude of
\emph
{
terminal states
}
that indicate the game has come to an end, for example when
...
...
@@ -170,7 +170,7 @@ of the loser can have any value between $0$ and $19$. Furthermore, the paddles m
game, which opens up even more terminal states.
\todo
{
someone check this
}
Whenever the agent transitions to a terminal state, it cannot transition to other states anymore. Therefore, there is a
final time step
$
T
$
after which no meaningful information can be gained. We call
$
T
$
the
\emph
{
finite
}
\
gls
{
horizon
}
; it
marks the end of an episode. Each episode begins with an initial state
$
S
_
0
$
and ends once an agent reaches a terminal
state
$
S
_
T
$
. Usually, agents attempt episodic tasks many times with each episode giving rise to a different trajectory;
the horizon
$
T
$
may differ, too.
final time step
$
T
$
after which no meaningful information can be gained. We call
$
T
$
the
\emph
{
finite
}
\
emph
{
horizon
}
;
it marks the end of an episode. Each episode begins with an initial state
$
S
_
0
$
and ends once an agent reaches a
terminal state
$
S
_
T
$
. Usually, agents attempt episodic tasks many times with each episode giving rise to a different
t
rajectory; t
he horizon
$
T
$
may differ, too.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment