Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Sign in
Toggle navigation
master-thesis
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Locked Files
Issues
6
Issues
6
List
Boards
Labels
Service Desk
Milestones
Iterations
Merge Requests
0
Merge Requests
0
Requirements
Requirements
List
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Security & Compliance
Security & Compliance
Dependency List
License Compliance
Operations
Operations
Incidents
Environments
Packages & Registries
Packages & Registries
Package Registry
Container Registry
Analytics
Analytics
CI / CD
Code Review
Insights
Issue
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Daniel Lukats
master-thesis
Commits
3cf06ccb
Commit
3cf06ccb
authored
May 31, 2020
by
Daniel Lukats
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
introduction with feedback from anna
parent
3d41c25c
Changes
5
Hide whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
50 additions
and
38 deletions
+50
-38
thesis/01_introduction/index.tex
thesis/01_introduction/index.tex
+0
-2
thesis/01_introduction/motivation.tex
thesis/01_introduction/motivation.tex
+28
-17
thesis/01_introduction/notation.tex
thesis/01_introduction/notation.tex
+6
-5
thesis/01_introduction/outline.tex
thesis/01_introduction/outline.tex
+9
-7
thesis/01_introduction/related_work.tex
thesis/01_introduction/related_work.tex
+7
-7
No files found.
thesis/01_introduction/index.tex
View file @
3cf06ccb
\section
{
Introduction
}
\label
{
sec:01:introduction
}
\todo
[inline]
{
this entire chapter has an issue with implementation. fix that word
}
\subsection
{
Motivation
}
\label
{
sec:01:motivation
}
\input
{
01
_
introduction/motivation
}
...
...
thesis/01_introduction/motivation.tex
View file @
3cf06ccb
\todo
[inline]
{
very rough first motiviation; definitely needs to be wordcrafted more
}
In recent years reinforcement learning algorithms achieved several breakthroughs such as defeating a world champion in
the game of Go
\cite
{
alphago
}
or defeating professional players in restricted versions of the video games StarCraft II
\cite
{
alphastar
}
and Dota 2
\cite
{
openaifive
}
. The latter was achieved using---amongst others---an algorithm called
\emph
{
Proximal Policy Optimization
}
\cite
{
ppo
}
.
\todo
[inline]
{
a few more citations here
}
Proximal Policy Optimization is used in several research topics. Some researchers look to apply the algorithm to tasks
beyond benchmarking, for example to train an agent for interplanetary transfers
\cite
{
nasa
}
. Others utilize Proximal
Policy Optimization in their research of further concepts such as guided exploration through a concept called curiosity
\cite
{
pathak2017
}
.
Proximal Policy Optimization is used in several research topics. On the one hand, researchers look to apply the
algorithm to real-world tasks. For example, Proximal Policy Optimization is used to learn human walking behavior, which
can aid in developing control schemes for prostheses
\cite
{
anand2019
}
. Furthermore, the algorithm is utilized in
spaceflight, e.g., to train agents for interplanetary transfers
\cite
{
nasa
}
and autonomous planetary landings
\cite
{
gaudet2020
}
. For a final example, researchers applied Proximal Policy Optimization to medical imaging in order to
trace axons, which are microscopic neuronal structures
\cite
{
dai2019
}
.
On the other hand, researchers combine the algorithm with further concepts. For example, Proximaly Policy Optimization
can be used in meta reinforcement learning, which trains a reinforcement learning agent to train other reinforcement
learning agents
\cite
{
liu2019
}
. Yet another concept that may be combined with the algorithm is curiosity
\cite
{
pathak2017
}
. Curiosity is a mechanism that incentivizes a methodical search for an optimal solution rather than a
random search.
Despite its widespread use, researchers found that several implementation choices
\todo
{
is this fine or not?
}
are
undocumented in the original paper
\cite
{
ppo
}
, but have great effect on the performance of Proximal Policy Optimization.
Consequently, the authors raise doubts on our mathematical understanding of the foundations of the algorithm
\cite
{
ilyas2018, engstrom2019
}
.
Despite its widespread use, researchers found that several undocumented implementation choices have great effect on the
performance of Proximal Policy Optimization. Consequently, the authors raise doubts on our mathematical understanding of
the foundations of the algorithm
\cite
{
ilyas2018, engstrom2019
}
.
The goal of this thesis is twofold. On the one hand, it provides the required fundamentals of reinforcement learning to
understand policy gradient methods, so students may be introduced to reinforcement learning. It then builds upon these
fundamentals to explain Proximal Policy Optimization. In order to gain a thorough understanding of the algorithm, a
dedicated implementation was written
\todo
{
with the OpenAI Gym benchmark and PyTorch deep learning frameworks?
}
instead
of relying on an open source project.
\footnote
{
The code is available on TODO URL HERE.
}
The goal of this thesis is twofold. On the one hand, it examines implementation and hyperparameter choices of Proximal
Policy Optimization on ATARI 2600 games, which is a typical benchmarking environment. On the other hand, it provides the
required fundamentals of reinforcement learning to understand policy gradient methods and explains Proximal Polic
y
Optimization, so other students may be introduced to reinforcement learning TODO. In order to gain a thorough
understanding of the algorithm, an implementation was created and published instead of reyling on an open source
project
.
On the other hand, this thesis examines not only the impact of the aforementioned implementation choices, but also
common hyperparameter choices on a selection of five ATARI 2600 games. These video games form a typical benchmarking
environment for evaluating reinforcement learning algorithms. The significance of the implementation choices was alread
y
researched on robotics tasks
\cite
{
ilyas2018, engstrom2019
}
, but the authors forewent an examination on ATARI games.
Therefore, one may raise the question, if these choices have the same significance for ATARI 2600 games as they do for
robotics tasks
.
thesis/01_introduction/notation.tex
View file @
3cf06ccb
Depending on authors the notation used in publications can differ greatly. In this thesis, we adhere to the notation
provided by
\citeA
{
sutton18
}
and adapt definitions and proves from other sources accordingly. As a consequence,
interested readers may notice differences between the notation used in chapter
\ref
{
sec:03:ppo
}
and the publications of
\citeA
{
gae
}
and
\citeA
{
ppo
}
:
The mathematical notation used in publications regarding reinforcement learning can differ greatly. In this thesis, we
adhere to the notation introduced by
\citeA
{
sutton18
}
and adapt definitions and proves from other sources accordingly. As
a consequence, interested readers may notice differences between the notation used in chapter
\ref
{
sec:03:ppo
}
and the
publications of
\citeA
{
gae
}
and
\citeA
{
ppo
}
:
\begin{itemize}
\item
The advantage estimator is denoted
$
\delta
$
instead of
$
\hat
{
A
}$
, as both
$
A
$
and
$
a
$
are heavily overloaded
already.
already.
\item
The likelihood ratio used in chapter
\ref
{
sec:03:ppo
_
loss
}
is denoted
$
\rho
_
t
$
instead of
$
r
_
t
$
to avoid
confusions with rewards
$
R
_
t
$
or
$
r
$
. Furthermore, using
$
\rho
$
is consistent with
\citeauthor
{
sutton18
}
's
\citeyear
{
sutton18
}
definition of
\emph
{
importance sampling
}
.
...
...
thesis/01_introduction/outline.tex
View file @
3cf06ccb
Firstly, t
he fundamentals of reinforcement learning are given in chapter
\ref
{
sec:02:basics
}
. These contain core terms
T
he fundamentals of reinforcement learning are given in chapter
\ref
{
sec:02:basics
}
. These contain core terms
and definitions required to discuss and construct reinforcement learning algorithms.
Secondly, i
ssues with the naive learning approach outlined in chapter
\ref
{
sec:02:basics
}
are pointed out in chapter
I
ssues with the naive learning approach outlined in chapter
\ref
{
sec:02:basics
}
are pointed out in chapter
\ref
{
sec:03:ppo
}
. This leads to the introduction of advanced estimation methods, which are used in
\emph
{
Proximal Policy
Optimization
}
(PPO). Chapter
\ref
{
sec:03:ppo
}
closes with an explanation of PPO.
Optimization
}
. With these estimation methods PPO is defined and ramifications of specific operations are explained.
Chapter
\ref
{
sec:03:ppo
}
closes with an outline of the complete reinforcement learning algorithm.
Thirdly, undocumented design and implementation choices are elaborated and---if possible---explained in chapter
\ref
{
sec:04:implementation
}
. These choices are examined on ATARI 2600 games in chapter
\ref
{
sec:05:evaluation
}
.
Undocumented design and implementation choices are elaborated and---if possible---explained in chapter
\ref
{
sec:04:implementation
}
. Before these choices are examined on ATARI 2600 games, the benchmarking framework and the
evaluation method are introduced in chapter
\ref
{
sec:05:evaluation
}
. A discussion of the results completes the chapter.
Finally, we summarize
and discuss
the results of this thesis in chapter
\ref
{
sec:06:conclusion
}
and discuss possible
future work that builds upon this
work
.
Finally, we summarize the results of this thesis in chapter
\ref
{
sec:06:conclusion
}
and discuss possible
future work that builds upon this
thesis
.
thesis/01_introduction/related_work.tex
View file @
3cf06ccb
Proximal Policy Optimization (PPO) is a widely used reinforcement learning algorithm and therefore topic of various
research questions. The publications by
\citeA
{
ilyas2018
}
and
\citeA
{
engstrom2019
}
are closest to the topic of this
thesis, as the authors thoroughly research undocumented properties and implementation choices of the algorithm. However,
their work focuses on
control
tasks and foregoes an examination of the modifications on ATARI video games. Furthermore,
their work focuses on
robotics
tasks and foregoes an examination of the modifications on ATARI video games. Furthermore,
the authors assume prior knowledge of reinforcement learning and provide only a short background on policy gradient
methods.
Multiple open source implementations of PPO are available
\cite
{
baselines, jayasiri, kostrikov
}
. As these publications
mostly consist of source code, no evaluation on various tasks is available for comparison. The notable exception to this
matter is the
original publication by
\citeA
{
ppo
}
, which likely evaluated the algorithm using an implementation provided
in the OpenAI baselines repository. Among these, the work of
\citeA
{
jayasiri
}
stands out, as the author added extensive
comments elaborating some concepts and implementation choices. The code created for this thesis differs, as it is
suitable for learning ATARI tasks only, whereas
\citeA
{
baselines
}
and
\citeA
{
kostrikov
}
support robotics and control
tasks as well.
\citeauthor
{
jayasiri
}
's
\citeyear
{
jayasiri
}
code is intended for a single ATARI game only.
matter is the
baselines repository
\cite
{
baselines
}
, which was published alongside the original publication by
\citeA
{
ppo
}
. Among these, the work of
\citeA
{
jayasiri
}
stands out, as the author added extensive comments elaborating
some concepts and implementation choices. The code created for this thesis differs, as it is suitable for learning ATARI
tasks only, whereas
\citeA
{
baselines
}
and
\citeA
{
kostrikov
}
support robotics and control tasks as well.
\citeauthor
{
jayasiri
}
's
\citeyear
{
jayasiri
}
code is intended for a single ATARI game only.
Lastly, PPO is
researched in
master's theses, for example by
\citeA
{
chen2018
}
and
\citeA
{
gueldenring2019
}
. The authors'
Lastly, PPO is
topic of
master's theses, for example by
\citeA
{
chen2018
}
and
\citeA
{
gueldenring2019
}
. The authors'
theses differ from this one, as
\citeA
{
chen2018
}
researches the application of PPO to engineering tasks, whereas
\citeA
{
gueldenring2019
}
examines PPO in the context of navigating a robot. Since both authors focus on application, they
chose to utilize open source implementations of PPO instead of implementing the algorithm themselves.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment