Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

Commit b9e224c0 by Daniel Lukats

### large cleanup and feedback inclusion push

parent 23811425
 ... ... @@ -16,11 +16,12 @@ weight vector $\boldsymbol\omega$ changes the value estimations of multiple stat estimation of a single state causes changes in the value estimations of other states, too. Thus, increasing the accuracy of some states will decrease the accuracy of other states. In order to alleviate this issue, the error introduced through estimation can be weighed according to the likelihood a In order to alleviate this issue, the error introduced through estimation can be weighted according to the likelihood a state is observed. Let $\mu$ denote the stationary distribution of states, then $\mu_\pi(s)$ is the percentage of time an agent following $\pi$ spends in the state $s$. $\hat{v}_\pi(s, \boldsymbol\omega)$ shall be chosen such that the mean squared value error \begin{align} \label{eqn:value_error} \overline{\text{VE}}(\boldsymbol\omega) \doteq \sum_{s \in \mathcal{S}} \mu_\pi(s) [v_\pi(s) - \hat{v}_\pi(s, \boldsymbol\omega)]^2 \end{align} is minimized \cite[p.~199]{sutton18}. As we sample trajectories to form our estimations from, the states we observe ... ... @@ -40,17 +41,20 @@ equation \ref{eqn:advantage_function}, the advantage function was defined to be Then, the advantage function is parameterized by inserting the parameterized value function $\hat{v}_\pi$ instead of $v_\pi$: \begin{align} \label{eqn:a_hat} \hat{a}_\pi(s, a, \boldsymbol\omega) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left( r + \gamma \hat{v}_\pi(s', \boldsymbol\omega) \right) \right] - \hat{v}_\pi(s, \boldsymbol\omega) \\ \label{eqn:a_hat2} &= \mathbb{E}_\pi\left[R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) \mid S_t = s, A_t = a \right] - v_\pi(s, \boldsymbol\omega). \end{align} We further extend function approximation to the advantage estimator \begin{align} \label{eqn:delta2} \delta_t \doteq R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) - \hat{v}_\pi(S_t, \boldsymbol\omega). \end{align} Note that we do not indicate whether the advantage estimator uses the value function $v_\pi$ or a parameterized value function $\hat{v}_\pi$. In subsequent chapters we consistently use function approximation, therefore it can be assumed that advantage estimators rely on parameterized value functions. Note that we do not indicate whether the advantage estimator uses the value function $v_\pi$ (like in equation \ref{eqn:delta1}) or a parameterized value function $\hat{v}_\pi$. In subsequent chapters we consistently use function approximation, therefore it can be assumed that advantage estimators rely on parameterized value functions.
 ... ... @@ -38,9 +38,10 @@ states and rewards as well as the actions agents may take. These requirements ar which we introduce in chapter \ref{sec:02:mdp}. Subsequently, we define a goal the agent shall achieve utilizing the \emph{value function}. A key issue of reinforcement learning is the balance of \emph{exploration} and \emph{exploitation}. Often we lack full knowledge of the environment. We gather information by interacting with the environment, slowly learning which actions help achieving the goal. As we do not gain full information of a state or action by observing or choosing it once, we have to make a decision: Do we explore more states and actions that we have no information of? Or do we exploit the best known way to achieving the goal, gaining more knowledge on a few select states and actions? \todo{probably add a source and check with Sutton and some other sources} A key issue of reinforcement learning is the balance of \emph{exploration} and \emph{exploitation} \cite[p.~3]{sutton18}. Often we lack full knowledge of the environment, for example a few frames of video input from a video game do not carry information on the entire internal state of the game. We gather information by interacting with the environment, slowly learning which actions help achieving the goal. As we do not gain full information of a state or action by observing or choosing it once, we have to make a decision: Do we explore more states and actions that we have no information of? Or do we exploit the best known way to achieving the goal, gaining more knowledge on a few select states and actions?
 \todo[inline]{short intro sentence goes here} \todo[inline]{note that the initial state is usually determined by a start distribution $p_0$ or something like that, but we do not require that in this thesis so it is implicitly included?} A Markov decision process is a stochastic process that encompasses observations, actions and rewards. These are the core properties needed to train an agent with a reinforcement learning algorithm. \subsubsection{Definition} \label{sec:02:mdp_def} ... ... @@ -31,8 +29,8 @@ interact once---the agent chooses an action and observes the environment. A simple Markov decision process is displayed in figure \ref{fig:intro_mdp}. Like Markov chains, it contains states $s_0, s_1, s_2$ with $s_0$ being the initial state and $s_2$ being a terminal state. Unlike Markov chains, it also contains actions \emph{left} and \emph{right} as well as rewards $0$ and $1$. The rewards are written alongside transition probabilities and can be found on the edges connecting the states. An example using elements of this Markov decision process is given in chapter \ref{sec:02:distributions}. transition probabilities and can be found on the edges connecting the states. Further explanations and an example using elements of this Markov decision process are given in chapter \ref{sec:02:distributions}. Markov decision processes share most properties with Markov chains\todo{some source here}, the major difference being the transitions between states. Whereas in Markov chains the transition depends on the current state only, in Markov ... ... @@ -41,14 +39,15 @@ although it remains stochastic in nature. We assume that the Markov property hol which means that these transitions depend on the current state and action only \cite[p.~49]{sutton18}. \subsubsection{States, Actions and Rewards} \label{sec:02:mdp_vars} States, actions and rewards are core concepts of reinforcement learning. States and actions describe an environment and an agent's capabilities within this environment. Rewards on the other hand are used to teach the agent. \paragraph{States.} We describe the environment using \emph{states}: Let $S_t \in \mathcal{S}$ denote the state of the environment at time step $t$ and let $\mathcal{S}$ denote the finite set of states. A state must contain all details an agent requires to make well-informed decisions.\todo{citation here probably?} step $t$ and let $\mathcal{S}$ denote the finite set of states \cite[p.~48]{sutton18}. A state must contain all details an agent requires to make well-informed decisions. Figure \ref{fig:states} displays screenshots from two ATARI 2600 games, which may be used as states. Pending several post-processing steps, we work with $\mathcal{S} \subseteq [0, 1]^{4 \times 84 \times 84}$ for all ATARI 2600 games, as ... ... @@ -66,7 +65,7 @@ we use a sequence of 4 consecutive grayscale images resized to $84 \times 84$ pi \paragraph{Actions.} At each time step $t$ our agent takes an \emph{action} depending on the state $S_t$ of the environment, which causes a transition to one of several successor states. Let $A_t \in \mathcal{A}(s),\;S_t = s$ denote the action at time step $t$ and let $\mathcal{A}(s)$ denote the set of actions available in the state $s$.\todo{probably a citation here too} and let $\mathcal{A}(s)$ denote the set of actions available in the state $s$ \cite[p.~48]{sutton18}. The actions available to an agent depend on the environment and the state. An agent learning SpaceInvaders effectively chooses from $\mathcal{A}(s) = \{\emph{\text{noop}}, \emph{\text{fire}}, \emph{\text{left}}, \emph{\text{right}}\}$ for ... ... @@ -76,8 +75,7 @@ step. If the actions do not differ between states, we write $\mathcal{A}$ to de \paragraph{Rewards.} Every time the agent completes an action, it observes the environment. These \emph{observations} consist of two components. The first component of an observation is the \emph{reward}. Let $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$ denote the reward observed at time step $t + 1$ and let $\mathcal{R}$ denote the set of rewards.\todo{and another citation here I guess} denote the reward observed at time step $t + 1$ and let $\mathcal{R}$ denote the set of rewards \cite[p.~48]{sutton18}. As teachers, rewards are our primary means of communicating with an agent. We use them to inform an agent whether it achieved its goal or not. As stated in chapter \ref{sec:02:agent_environment}, the agent may receive positive rewards ... ... @@ -111,9 +109,11 @@ determine the path through the respective Markov decision process. \paragraph{Dynamics.} Transitions to successor states and the values of rewards associated with these states are determined by the \emph{dynamics} of the environment. For any given observation $(S_{t+1} = s',R_{t+1} = r)$ the likelihood of observation given the current state $S_t$ and the agent's action $A_t$ is given the following probability distribution: \emph{dynamics} of the environment. Let the successor state $S_{t+1} = s'$ and the reward $R_{t+1} = r$ be an observation. Its likelihood given the current state $S_t = s$ and the agent's action $A_t = a$ is determined by the following probability distribution \cite[p.~48]{sutton18}: \begin{align} \label{eqn:dynamics} &p(s', r\mid s, a) \doteq P(S_{t+1} = s', R_{t+1} = r \mid S_t = s, A_t = a), \\ &p: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]. \nonumber \end{align} ... ... @@ -128,7 +128,7 @@ edge describes a transition from one state to a successor state. The reward alon an observation. The probability of the observation is stated with the reward next to its respective edge. We assign a probability of $0$ to observations not described by an edge. In the displayed Markov decision process, the probability of observing the state $S_{t+1} = s_2$ as well as the reward $R_{t+1} = 1$ given the current state $S_t = s_1$ and the action $A_t = \text{\emph{left}}$ is $p(s_2, 1 \mid s_1, \text{\emph{left}}) = 0.8$. action $A_t = \text{\emph{right}}$ is $p(s_2, 1 \mid s_1, \text{\emph{right}}) = 0.8$. \begin{figure}[h] \centering ... ... @@ -142,8 +142,9 @@ action $A_t = \text{\emph{left}}$ is p(s_2, 1 \mid s_1, \text{\emph{left}}) = 0 \end{figure} \paragraph{Policy.} An agent's actions are determined by the \emph{policy} function An agent's actions are determined by the \emph{policy} function, which \citeA[p.~58]{sutton18} define as follows: \begin{align} \label{eqn:policy} &\pi(a \mid s) \doteq P(A_t = a \mid S_t = s),\\ &\pi: \mathcal{A} \times \mathcal{S} \rightarrow [0, 1], \nonumber \end{align} ... ... @@ -157,17 +158,17 @@ proportional to the probability distribution. For example, an agent following th \emph{noop} twice as often as \emph{fire} on average. \subsubsection{Finite Horizon and Episodes} \label{sec:02:episodes} When an agent and an environment interact with each other over a series of discrete time stepst = 0, 1, 2, \dots$, we can observe a sequence of states, actions and rewards$S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots$. We call this sequence the \emph{trajectory}. When we train an agent, we often record trajectories of states, actions and rewards to use as training data by running the agent in the environment.\todo{citation here?} ATARI 2600 games have a plentitude of \emph{terminal states} that indicate the game has come to an end, for example when either player scored 20 points in Pong or the player lost their final life in Breakout.$\#\{s \mid s \in \mathcal{S} \text{ and } s \text{ is terminal}\} \gg 1$, as the score of one player determines the end of the game whereas the score of the loser can have any value between$0$and$19$. Furthermore, the paddles may be in any position at the end of game, which opens up even more terminal states.\todo{someone check this} sequence the \emph{trajectory} \cite[p.~48]{sutton18}. When we train an agent, we often record trajectories of states, actions and rewards to use as training data by running the agent in the environment. ATARI 2600 games have a plentitude of \emph{terminal states} that indicate the game has come to an end,$\#\{s \mid s \in \mathcal{S} \land s \text{ is terminal}\} \gg 1$. For example, in Pong the game ends once either player scored 21 points, whereas the loser's score can have any value in the range of 0 to 20. Furthermore, the paddles may be in any position at the end of the game, which opens up even more terminal states. Whenever the agent transitions to a terminal state, it cannot transition to other states anymore. Therefore, there is a final time step$T$after which no meaningful information can be gained. We call$T$the \emph{finite} \emph{horizon}; ... ...  ... ... @@ -8,27 +8,28 @@ Just like the value function, the policy too can be parameterized when the state$\boldsymbol\theta \in \mathbb{R}^{d'},\;d' \ll \#\mathcal{S}, denote the policy's parameter vector. Then the parameterized policy \begin{align} \pi(a \mid s, \boldsymbol\theta) \doteq P(A_t = a \mid S_t = a, \boldsymbol\theta_t = \boldsymbol\theta) \pi(a \mid s, \boldsymbol\theta) \doteq P(A_t = a \mid S_t = s, \boldsymbol\theta_t = \boldsymbol\theta) \end{align} returns the probability of taking the actiona$at time$t$given the state$s$and parameter vector returns the probability of taking the action$a$given the state$s$and parameter vector$\boldsymbol\theta$\cite[p.~321]{sutton18}. Often we write$\pi_{\boldsymbol\theta}$instead of$\pi(a \mid s, \boldsymbol\theta)$. The parameterization$\boldsymbol\theta$often returns a vector of numerical weights. Each action corresponds to a weight that indicates of the likelihood of picking the respective action. The higher the weight, the more likely picking its respective action should be. The policy$\pi$uses the numerical weights to determine a probability distribution \cite[p.~322]{sutton18}, e.g., by applying a softmax function \cite[pp.~184--185]{goodfellow2016}. The method used to determine the probability distribution remains fixed. Instead we optimize the parameterization$\boldsymbol\theta$so that the likelihood of choosing an advantageous actions is increased. Let$S_t = s$and$\mathcal{A}(s) = \{1, 2\}$. A parameterization$\boldsymbol\theta_t$could return a vector with two elements, for example$\boldsymbol\theta_t(s) = \begin{pmatrix}4\\2\end{pmatrix}$.\footnote{We write$\left(\boldsymbol\theta_t(s)\right)_a$to obtain the$a$th component of$\boldsymbol\theta_t(s)$, e.g.,$\begin{pmatrix}4\\2\end{pmatrix}_1 = 4$.} The elements of this vector are numerical weights a policy could use to determine probabilities: Often$\boldsymbol\theta$is the parameterization of a function that maps a state to a vector of numerical weights. Each action corresponds to a weight that indicates the likelihood of picking the respective action. The higher the weight, the more likely picking its respective action should be. The policy$\pi$uses the numerical weights to determine a probability distribution \cite[p.~322]{sutton18}, e.g., by applying a softmax function \cite[pp.~184--185]{goodfellow2016}. The method used to determine the probability distribution remains fixed. Instead we optimize the parameterization$\boldsymbol\theta$so that the likelihood of choosing advantageous actions is increased. Let$S_t = s$and$\mathcal{A}(s) = \{1, 2\}$. A function$h$parameterized with$\boldsymbol\theta_t$could return a vector with two elements, for example$h(s, \boldsymbol\theta_t) = \begin{pmatrix}4\\2\end{pmatrix}$.\footnote{We write$\left(h(s, \boldsymbol\theta_t)\right)_a$to obtain the$a$th component of$h(s, \boldsymbol\theta_t)$, e.g.,$\begin{pmatrix}4\\2\end{pmatrix}_1 = 4.} The elements of this vector are numerical weights a policy could use to determine probabilities: \begin{align*} \pi(a\mid s, \boldsymbol\theta_t) &= \left(\boldsymbol\theta_t(s)\right)_a \cdot \left(\sum_{b\in\mathcal{A}(s)}\left(\boldsymbol\theta_t(s)\right)_b\right)^{-1}\text{, e.g.,} \\ \pi(a\mid s, \boldsymbol\theta_t) &= \left(h(s, \boldsymbol\theta_t)\right)_a \cdot \left(\sum_{b\in\mathcal{A}(s)}\left(h(s, \boldsymbol\theta_t)\right)_b\right)^{-1}\text{, e.g.,} \\ \pi(1\mid s, \boldsymbol\theta_t) &= \begin{pmatrix}4\\2\end{pmatrix}_1 \cdot \left(\begin{pmatrix}4\\2 \end{pmatrix}_1 + \begin{pmatrix}4\\2\end{pmatrix}_2\right)^{-1} = \frac{4}{4 + 2} = \frac{2}{3}. \end{align*} ... ... @@ -39,6 +40,7 @@ derivatives. With full information of the environment, we may finally search for policy. \subsubsection{Gradient Ascent} \label{sec:02:gradient_ascent} Policy gradient algorithms seek to optimize the policy directly by performing stochastic gradient ascent\nabla_{\boldsymbol\theta} \pi(a \mid s, \boldsymbol\theta)$to maximize a fitness function$J(\boldsymbol\theta)$. As ... ... @@ -71,12 +73,12 @@ value to the learn rate$\alpha$, but no further hyperparameters are required to smooth transition to exploitation.\footnote{In practice a number of other parameters are required to prevent forgetting and performance collapses (cf.~chapter \ref{sec:03:ppo_motivation}).} Secondly, parameterized policies enable us to assign arbitrary probabilities to actions. This gives us the opportunity to discover stochastic approximates of optimal policies.\footnote{A policy that is not capable of doing so is the$\varepsilon$-greedy policy, that chooses the best policies.\footnote{A policy that is not capable of doing so is the$\varepsilon$-greedy policy, which chooses the best available action with a probability of$1 - \varepsilon$and a random one otherwise \cite[p.~322]{sutton18}.$\varepsilon$typically converges to$0$over the course of the training resulting in a deterministic policy.} \citeA[p.~323]{sutton18} give an example of a Markov descision process that can only be solved by a stochastic policy. \citeA[p.~324]{sutton18} point out, that combining a parameterized policy with function approximation may pose a \citeA[p.~324]{sutton18} point out that combining a parameterized policy with function approximation may pose a challenge: the performances of both$\pi_{\boldsymbol\theta}$and$\hat{v}(s, \boldsymbol\omega)$depend on the action selections and on the distribution of the states$\mu_{\pi_{\boldsymbol\theta}}(s)$that these actions are selected in. Adjusting the policy results in different action choices, which in turn changes$\mu_{\pi_{\boldsymbol\theta}}(s)$. ... ... @@ -101,7 +103,7 @@ policy gradients with the advantage function \cite{sutton2000}: Using the advantage function is beneficial, as the policy gradient in equation \ref{eqn:advantage_gradient} has a lower variance than the gradient in equation \ref{eqn:q_gradient} \cite{gae}. Similar to$\overline{\text{VE}}$, the gradient is weighed according to the distribution of states$\mu$, but it does not Similar to$\overline{\text{VE}}$, the gradient is weighted according to the distribution of states$\mu$, but it does not depend on its derivative. By virtue of weighing with$\mu$, states an agent encounters regularly have a greater effect on the gradient. Moreover, the magnitude of the gradient is controlled by the advantage of an action: A large advantage mandates a large gradient step. ... ...  ... ... @@ -10,6 +10,7 @@ learning approach and algorithm we introduce in chapters \ref{sec:02:policy_grad For any given time step$t < T$, an agent aims to maximize the remaining sequence of rewards$R_{t+1}, R_{t+2}, \dots, R_T. Using this sequence we define the return \begin{align} \label{eqn:return} G_t &\doteq \sum_{k=t+1}^T \gamma^{k-t-1} R_k,\; 0 \leq \gamma \leq 1. \end{align} ... ... @@ -34,39 +35,41 @@ states based on the returns we may obtain by using the value function \end{align} with\mathbb{E}_\pi$denoting the expected value when following the policy$\pi$. The higher the value of a state, the higher the return an agent is expected to obtain. As$v_\pi(s)$depends on the policy$\pi$, we must devise a method to optimize the policy an agent follows. However, we first require the advantage function. optimize the policy an agent follows. However, to do so we first require the advantage function. We call the policy that yields the maximum return the optimal policy$\pi_*$and the corresponding value function the optimal value function$v_*$. Assuming an optimal policy for the ATARI 2600 game Pong,$v_*(S_0) = 20$with$\gamma = 1$, as the game ends after either player scored 20 times. Furthermore, an optimal Pong player always deflects the ball. optimal value function$v_*$. Assuming an optimal policy for the ATARI 2600 game Pong,$v_*(S_0) = 21$with$\gamma = 1$, as the game ends after either player scored 21 times. Furthermore, an optimal Pong player always deflects the ball. \subsubsection{Advantage Function} \label{sec:02:advantage} The value of a state depends on the expected behavior of an agent following the policy$\pi$. In order to increase the probability of obtaining a high return, we require a means to identify actions that prove advantageous or disadvantageous in this endeavour. The required functionality is provided by the \emph{advantage function} disadvantageous in this endeavor. The required functionality is provided by the \emph{advantage function} \cite{baird1993, sutton2000}\footnote{The advantage function is commonly defined using the$q$function:$a_\pi(s, a) \doteq q_\pi(s,a) - v_\pi(s)$. The definition we provide is equal, as it merely substitutes$q_\pi(s,a)with the term it is defined to be.} \begin{align} \label{eqn:advantage_function} a_\pi(s, a) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left(r + \gamma v_\pi(s') \right)\right] - v_\pi(s) \\ &= \mathbb{E}_\pi\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a\right] - v_\pi(s) &= \mathbb{E}_\pi\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a\right] - v_\pi(s). \end{align} The advantage function compares the expected immediate reward and return of a successor stater + v_\pi(s')$with the expected return of the current state$v_\pi(s)$. As the environment may be non-deterministic, the former term has to be weighed with its probability. weighted with its probability. If$a_\pi(s, a) = 0$, which means$\sum_{s', r} p(s', r \mid s, a) \cdot r + \gamma v_\pi(s') = v_\pi(s)$, choosing the action$a$more frequently does not improve the policy. However, if$a_\pi(s, a) > 0$, the given action yields larger returns than the actions commonly chosen under$\pi$; increasing the likelihood of choosing the action$a$improvemes the policy. Actions that return a negative advantage---$a_\pi(s, a) < 0$---lead to worse results than simply following$\pi$would. If$a_\pi(s, a) \le 0$for all$s \in \mathcal{S}, a \in \mathcal{A}$, the policy is optimal. If$a_\pi(s, a) = 0$, which means$\sum_{s', r}\left[ p(s', r \mid s, a) \cdot (r + \gamma v_\pi(s')) \right] = v_\pi(s)$, choosing the action$a$more frequently does not improve the policy. However, if$a_\pi(s, a) > 0$, the given action yields larger returns than the actions commonly chosen under$\pi$; increasing the likelihood of choosing the action$a$improves the policy. Actions that return a negative advantage---$a_\pi(s, a) < 0$---lead to worse results than simply following$\pi$would. If$a_\pi(s, a) \le 0$for all$s \in \mathcal{S}, a \in \mathcal{A}, the policy is optimal. According to \citeA{gae}, a common approximation of the advantage function is provided by the estimator \begin{align} \label{eqn:delta1} \delta_t \doteq R_{t+1} + \gamma v_\pi(S_{t+1}) - v_\pi(S_t). \end{align} ... ...  ... ... @@ -6,24 +6,24 @@ Optimization, as an agent can learn quicker than an agent that trains on a traje More precisely, each training iteration consists ofK$epochs. In each epoch, the entire data set is split into randomly drawn disjunct minibatches of size$M$.\footnote{For a detailed explanation of minibatch gradient methods, refer to the work of \citeA[chapter 8.1.3]{goodfellow2016}.} This approach alleviates the issue of learning on highly correlated observation as observed by \citeA{dqn}. We further ensure that we learn on sufficiently independent observations by running$N$\emph{actors} at the same time.\footnote{\emph{Actor} is the technical term used to describe agents that are run in parallel. Often, these agents follow the same policy$\pi_{\boldsymbol\theta}$, as is the case with Proximal Policy Optimization.} Although each actor is embedded in its own environment, the$N$environments are described by the same Markov decision process \cite{a3c}. However, the environments are independent and therefore give rise to different trajectories. correlated observation as observed by \citeA{dqn}. After each epoch, the learn rate$\alpha$and the clipping parameter$\epsilon$are decreased, so they linearly scale from their initial values to 0 over the course of the training. We further ensure that we learn on sufficiently independent observations by running$N$\emph{actors} at the same time.\footnote{\emph{Actor} is the technical term used to describe agents that are run in parallel. Often, these agents follow the same policy$\pi_{\boldsymbol\theta}$, as is the case with Proximal Policy Optimization.} Although each actor is embedded in its own environment, the$N$environments are described by the same Markov decision process \cite{a3c}. However, the environments are independent and therefore give rise to different trajectories. The observations of all actors are then combined to optimize the policy$\pi_{\boldsymbol\theta}$that all$N$actors follow. PPO---and many other reinforcement learning algorithms---operate in two phases. In the first phase, an agent interacts with its environment and generates a rollout$\tau$.$\tau$contains not only the states and rewards the agent observed, but also environment and generates a \emph{rollout}$\tau$.$\tau$contains not only the states and rewards the agent observed, but also the chosen actions, their respective probabilities and the values of the states. This phase can be seen in lines 5--17 of algorithm \ref{alg:ppo}. In the second phase, the value function approximation and the policy are optimized with the data the agent collected. These two steps are repeated for a certain time or until the value function and policy converge to their respective optimal functions. Lines 18--25 of algrithm \ref{alg:ppo} show this phase. optimal functions. Afterwards, the learn rate$\alpha$and the clipping$\epsilon$are decreased, so they linearly approach$0$over the course of the training; the trust region shrinks. Lines 18--25 of algorithm \ref{alg:ppo} show this phase. \begin{algorithm}[ht] \caption{Proximal Policy Optimization, modified from \protect\citeauthor{ppo}'s \protect\citeyear{ppo} and ... ... @@ -73,6 +73,6 @@ rollout---an episode might terminate earlier or it might not terminate at all. W \boldsymbol\omega)$ of the final state included in the rollout. \item If an episode terminates early, we only include rewards and values up until the episode terminated. Let $T_\text{episode}$ denote this time step. Then all advantage and return estimations for time steps $t \le T_\text{episode}$ use $T_\text{episode}$ as the upper bound of summation (cf.~equations \ref{eqn:gae} and \ref{eqn:lambda_return}). T_\text{episode}$use the rollout time step that correspeonds with$T_\text{episode}$as the upper bound of summation (cf.~equations \ref{eqn:gae} and \ref{eqn:lambda_return}). \end{itemize}  ... ... @@ -2,14 +2,18 @@ Both the loss described in equation \ref{eqn:naive_loss} and the loss proposed b estimations. However, to estimate advantages value estimations are required. By definition (cf.~equation \ref{eqn:value_function}), the value function can be estimated using the return. Both the advantage estimator and the return are suboptimal as they suffer from being biased and having high variance respectively. We further explain these issues and introduce advanced methods that alleviate this issue in this chapter. Both the advantage estimator (cf.~equation \ref{eqn:delta2}) and the return (cf.~equation \ref{eqn:return}) are suboptimal as they suffer from being biased and having high variance respectively. We further explain these issues and introduce advanced methods that alleviate this issue in this chapter. \todo[inline]{insert overview of Loss using advantage using value using return using rewards} \subsubsection{Generalized Advantage Estimation} \label{sec:03:gae_gae} In chapter \ref{sec:02:advantage} we introduced the advantage function$a_\pi(s, a)$and an estimator$\delta$, that we combined with function approximation \ref{sec:02:function_approximation} (TODO replace with equation references): In chapter \ref{sec:02:advantage} we introduced the advantage function$a_\pi(s, a)$and an estimator$\deltathat we combined with function approximation (cf.~equations \ref{eqn:a_hat} and \ref{eqn:delta2} in chapter \ref{sec:02:function_approximation}): \begin{align*} \hat{a}_\pi(s, a, \boldsymbol\omega) &\doteq \mathbb{E}_\pi\left[R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) \mid S_t = s, A_t = a \right] - \hat{v}_\pi(s, \boldsymbol\omega),\\ ... ... @@ -76,6 +80,7 @@ commonly used with finite-horizon Markov decision processes, too. In the above d at the cost of adding minor inaccuracy. The inaccuracy grows larger as the time stept$approaches the horizon$T$. \subsubsection{$\lambda-return} \label{sec:03:lambda_return} The same approach can be applied to return estimation. Remember that we defined the return to be \begin{align*} ... ...  As deep learning achieved success on a variety of tasks such as computer vision and speech recognition\todo{a goodfellow citation goes here}, \citeA{dqn} adapted reinforcement learning techniques to deep learning in an algorithm called DQN. The benefits of using neural networks to parameterize the value function and---occasionally---the policy were demonstrated by DQN and further \emph{deep reinforcement learning} algorithms, e.g., by A3C \cite{a3c} and Rainbow DQN \cite{rainbow}. Deep learning methods achieved success on a variety of tasks, such as computer vision and speech recognition \cite[chapter 1.2.4]{goodfellow2016}. For this reason, \citeA{dqn} combined reinforcement learning techniques and deep learning methods in an algorithm called DQN. The benefits of using neural networks to parameterize the value function and---occasionally---the policy were demonstrated by DQN and further \emph{deep reinforcement learning} algorithms, e.g., by A3C \cite{a3c} and Rainbow DQN \cite{rainbow}. The algorithm introduced in this chapter is called \emph{Proximal Policy Optimization} (PPO) \cite{ppo}. Because PPO is a deep reinforcement learning algorithm, we adjust notation and replace our gradient estimator\hat{g}$. Instead we utilize a loss$\mathcal{L}$, as is common in deep learning \todo{citation on loss here}: loss$\mathcal{L}, as is common in deep learning \cite[chapter 4.3]{goodfellow2016}: \begin{align} \label{eqn:naive_loss} \mathcal{L}(\boldsymbol\theta) \doteq \mathbb{E}_{a,s\sim\pi_{\boldsymbol\theta}}\left[\hat{a}_{\pi_{\boldsymbol\theta}}(s, a, \boldsymbol\omega)\pi_{\boldsymbol\theta}(a\mid s)\right] \end{align} The gradient estimator\hat{g}$can be obtained by deriving the loss in equation \ref{eqn:naive_loss}. The gradient estimator$\hat{g}can be obtained by deriving the loss in equation \ref{eqn:naive_loss}. Unlike in deep learning, a loss notation is also used to denote objectives that shall be maximized rather than. Therefore, it depends on the specific objective whether gradient ascent or gradient descent is performed. We begin by highlighting issues with the gradient estimator from equation \ref{eqn:gradient_estimator} and motivate the use of advanced algorithms such as Proximal Policy Optimization. Then we introduce advanced advantage and return use of advanced algorithms such as Proximal Policy Optimization. Then we introduce sophisticated advantage and return estimators as used in many modern policy gradient methods. Afterwards the loss is introduced and explained. We close by providing the complete Proximal Policy Optimization algorithm.  TODO short intro sentence goes here. Proximal Policy Optimization consists of two components: A loss and an algorithm that incorporates this loss. In the following sections we introduce and explain the clipped Proximal Policy Optimization loss. Afterwards, we reveal the learning algorithm that uses this loss to optimize a policy. \subsubsection{Background} ... ... @@ -11,8 +13,8 @@ However, this algorithm can only be applied to mixed policies of the form \end{align*} where\pi^{(0)}$and$\pi^{(1)}$are different policies and$\alpha < 1$. \citeA{trpo} prove that this method can be applied to stochastic policies---as defined in chapter \ref{sec:02:distributions}---by changing the lower bound to a \emph{Kullback-Leibler} (KL) \emph{divergence}. The KL \citeA{trpo} prove that this method can be applied to stochastic policies---as defined in equation \ref{eqn:policy}---by changing the lower bound to a \emph{Kullback-Leibler} (KL) \emph{divergence}. The KL divergence can be used to assess how much two probability distributions deviate from one another \cite[pp.~74--75]{goodfellow2016}. They further transform the loss to a likelihood ratio of the candidate for the new policy and the current policy (similar to equation \ref{eqn:likelihood_ratio}) that is penalized using the KL ... ... @@ -21,8 +23,8 @@ divergence. This way they incentivize finding a new policy that is close to the As the new loss still imposes a lower bound on the fitness$J(\boldsymbol\theta)$, it is a minorant to the fitness function. The authors prove that maximizing this minorant guarantees a monotically rising fitness$J(\boldsymbol\theta)$; given sufficient optimization steps the algorithm converges to a local optimum\footnote{Algorithms like this one are called MM algorithms. This one is a \emph{minorize maximization} algorithm \cite{hunter2004}.}. optimum.\footnote{Algorithms like this one are called MM algorithms. This one is a \emph{minorize maximization} algorithm \cite{hunter2004}.} The ensuing objective is called a \emph{surrogate} objective. The mathematically proven algorithm is computationally demanding, as it needs to compute the KL divergence on all states in the state space for each optimization step. Hence, \citeA{trpo} perform multiple approximations. The resulting ... ... @@ -62,7 +64,8 @@$\rho_t(\boldsymbol\theta)$, we determine if the probability of taking the actio decreased. For example, if$\rho_t(\boldsymbol\theta) > 1$, the action is more likely under$\pi_{\boldsymbol\theta}$than it is under$\pi_{\boldsymbol\theta_\text{old}}$. We note that$\rho_t(\boldsymbol\theta_\text{old}) = 1. A loss could be constructed by multiplying the likelihood ratio and advantage estimations: A loss derived from \emph{Conservative Policy Iteration} is constructed by multiplying the likelihood ratio and advantage estimations \cite{ppo}: \begin{align} \label{eqn:loss_cpi} \mathcal{L}^\text{CPI}(\boldsymbol\theta) \doteq \mathbb{E}_{a,s\sim\pi_{\boldsymbol\theta_\text{old}}} \left[ ... ... @@ -138,7 +141,9 @@ This issue is solved by taking an elementwise minimum: \right]. \end{align} In practice, the advantage function\hat{a}_{\pi_{\boldsymbol\theta_\text{old}}}$is replaced with Generalized Advantage Estimations$\delta_t^{\text{GAE}(\gamma,\lambda)}$(cf.~equation \ref{eqn:gae}). Advantage Estimations$\delta_t^{\text{GAE}(\gamma,\lambda)}$(cf.~equation \ref{eqn:gae}). Often,$\mathcal{L}^\text{CLIP}(\boldsymbol\theta)$is called a surrogate objective, as it approximates the original objective$J(\boldsymbol\theta)$but is not identical. Figure \ref{fig:min_clipping} compares$\text{clip}(\rho_t(\boldsymbol\theta, \epsilon) \cdot \delta$and$\mathcal{L}^\text{CLIP}$. Using an elementwise minimum has the following effect: ... ... @@ -152,7 +157,7 @@$\mathcal{L}^\text{CLIP}$. Using an elementwise minimum has the following effect \begin{figure}[h] \centering \includegraphics[width=\textwidth]{03_ppo/clipping.pdf} \caption{The loss for an action with positive advantage can be seen on the left side, whereas the for an action with \caption{The loss for an action with positive advantage can be seen on the left side, whereas the loss for an action with a negative advantage is shown on the right side. By using a minimum in$\mathcal{L}^\text{CLIP}$, we ensure that we can correct errors we made in previous gradient steps: We can raise the probability of actions with positive advantage even when$\rho_t(\boldsymbol\theta) \le 1 - \epsilon$. Vice versa, we can decrease the probability of ... ... @@ -173,7 +178,7 @@$\epsilon$. \subsubsection{Value Function Loss} \label{sec:03:loss_value} In oder to determine advantages, we require to know the value function and therefore a means to learn the value function. As we In order to determine advantages, we require to know the value function and therefore a means to learn the value function. As we perform function approximation, the value function is parameterized by$\boldsymbol\omega, a neural network. To learn reliable value estimations, we perform stochastic gradient descent \begin{align*} ... ... @@ -186,6 +191,7 @@ We estimate the value function by recording a trajectory and calculating\lambd encountered. Then, the loss is the mean squared error of value estimations of $\boldsymbol\omega_t$ and the observed $\lambda$-returns \cite{ppo}: \begin{align} \label{eqn:lvf} \mathcal{L}^\text{VF}(\boldsymbol\omega) \doteq \frac{1}{2} \cdot \mathbb{E}_{s,G\sim\pi_{\boldsymbol\theta_\text{old}}}\left[(\hat{v}_{\pi_{\boldsymbol\theta}}(s, \boldsymbol\omega) - G)^2\right], ... ... @@ -194,29 +200,31 @@ with $G$ being $\lambda$-returns calculated from rewards observed by an agent fo (cf.~equation \ref{eqn:lambda_return}). \subsubsection{Shared Parameterization Loss} \label{sec:03:shared_loss} When combining function approximation with policy parameterization, the value function and the policy may share the same parameters. $\hat{v}(s, \boldsymbol\theta)$ and $\pi(a\mid s, \boldsymbol\theta)$ share the same neural network architecture and weights with differing output layers (cf. chapter \ref{sec:04:architecture}). We choose to share parameters, because we require less computation power as only one neural network needs to be trained and executed. parameters because we require less computation power, as only one neural network needs to be trained and executed. In this case, the loss must contain a loss for the policy and a loss for the value function. Since we perform gradient ascent on the policy loss but gradient descent on the value function loss, $\mathcal{L}^\text{VF}$ is subtracted from $\mathcal{L}^\text{CLIP}$. \citeA{ppo} propose the following loss: \begin{align} \label{eqn:lshared} \mathcal{L}^{\text{CLIP}+\text{VF}+S}(\boldsymbol\theta) \doteq \mathcal{L}^\text{CLIP}(\boldsymbol\theta) - c_1 \cdot \mathcal{L}^\text{VF}(\boldsymbol\theta) + c_2 \cdot \mathbb{E}_{s\sim\pi_{\boldsymbol\theta}}\left[S\left[\pi_{\boldsymbol\theta}\right](s)\right], \end{align} with $c_1$ and $c_2$ being hyperparameters. $c_1$ controls the impact of the value function loss $\mathcal{L}^\text{VF}$, whereas $c_2$ adjusts the impact of the entropy bonus $S$. $\mathcal{L}^\text{VF}$, whereas $c_2$ adjusts the impact of the \emph{entropy bonus} $S$. $S$ denotes an entropy bonus encouraging exploration, which addresses an issue raised by \citeA{kakade2002}: Policy gradient methods commonly transition to exploitation too early, resulting in suboptimal policies. The closer the distribution $\pi$ is to a uniform distribution, the larger its entropy is. If all actions are assigned the same probability, an agent following $\pi$ explores properly. An agent following a deterministic policy does not explore and the entropy of this policy will be $0$. % TODO Hence, $S[\pi_{\boldsymbol\theta}]$ encourages exploration by enticing a larger gradient step when the entropy is % large. A common choice for $S$ is to determine the mean entropy of $\pi_{\boldsymbol\theta}$ over all observed states. Hence, the entropy bonus naturally declines over the course of the training as the policy transitions from exploration to exploitation and approaches a (locally) optimal policy. A common choice for $S$ is to determine the mean entropy of $\pi_{\boldsymbol\theta}$ over all observed states.
 Policy gradient estimators such as the one introduced in chapter \ref{sec:02:policy_gradients}\todo{refer to equation here} or REINFORCE\footnote{A well-known reinforcement learning algorithm that is a simple improvement over the estimator we introduced.} by \citeA{reinforce} suffer from a significant drawback. As \citeA{kakade2002} demonstrate, some tasks require that we record long trajectories to guarantee a policy is improved when performing stochastic gradient descent. The longer the trajectories are the higher their variance grows, as both the dynamics $p$ of the environment and the policy $\pi$ introduce non-deterministic behavior. Policy gradient estimators such as the one introduced in equation \ref{eqn:gradient_estimator} in chapter \ref{sec:02:policy_gradients} or REINFORCE\footnote{A well-known reinforcement learning algorithm that is a simple improvement over the estimator we introduced.} by \citeA{reinforce} suffer from a significant drawback. As \citeA{kakade2002} demonstrate, some tasks require that we record long trajectories to guarantee a policy is improved when performing stochastic gradient descent. The longer the trajectories are the higher their variance grows, as both the dynamics $p$ of the environment and the policy $\pi$ introduce non-deterministic behavior. As the variance of the trajectories grows, the quality of the gradient estimator declines. Performing stochastic gradient ascent with this estimator is no longer guaranteed to yield an improved policy---in fact the performance of the ... ...
 \todo[inline]{rewrite this entire thing} As PPO runs $N$ agents simultaneously, most of the runtime is spent waiting for the execution of actions in the environments. We can achieve a great speedup by running multiple processes. Since Proximal Policy Optimization executes $N$ actors simulataneously, the runtime of a training can be greatly reduced on multicore processors by parallelizing the execution. In this thesis, each actor operates in a dedicated process. They are coordinated by a parent process that gathers the observations and optimizes the policy and the value function. Each agent is run in a dedicated child process that listens for actions, executes them and returns observations. A parent process enters observations into the neural network and transmits them to the child processes. Furthermore, this process computes the loss and performs stochastic gradient ascent. The parent process and the child processes communicate by using pipes. The child processes communicate with the parent process through pipes. The actors transmit observations consisting of their environment's current state, the last reward and information on the remaining lives to the parent process. Afterwards, they await an action to perform. \todo[inline]{maybe try a shared memory setup} The parent process determines each actor's action by sampling from the policy using the respective environment's state. It creates rollouts from the actors' observations, the actions and their respective probabilities as well as value estimations. After rollout generation has concluded, the parent process computes the loss and performs stochastic gradient descent.
 ... ... @@ -38,7 +38,7 @@ Some games like Breakout provide the player with several attempts, often called they may resume the game, usually at the cost of a reduced score. \citeA{nature_dqn} propose ending a training episode once a life is lost. As PPO operates on a steady number of samples, this approach is modified slightly. Instead of ending the episode, we simulate the end of the game. Thus, the return and advantage calculation are cutoff at this time step. Finally, we reset the observation stack as detailed in \emph{observation stacking}. step. Finally, we reset the observation stack as detailed in the operation \emph{observation stacking}. This post-processing operation is applied to some of the games chosen for this thesis only (see chapter \ref{sec:05:ale} for more information on the selected games). BeamRider, Breakout, Seaquest and SpaceInvaders provide ... ... @@ -62,7 +62,7 @@ repository reveals that rewards are binned \cite[baselines/common/atari\_wrapper \begin{align} \phi_r(r) \doteq \sign r \end{align} We evaluate both choices in chapter TODO ref 5. We discuss both choices in chapter \ref{sec:05:discussion_optims}. \paragraph{Observation stacking.} Finally, the four most recent maximized images seen by an agent are combined to a tensor with shape $4 \times 84 \times ... ...  Although it is neither mentioned nor motivated by \citeA{ppo}, clipping is applied to the value function loss as well \cite{ilyas2018}. Let$\text{clip}_vdenote a clipping function similar to equation \ref{eqn:clip}: \begin{align} \label{eqn:value_clip} \text{clip}_v(\boldsymbol\omega, \boldsymbol\omega_\text{old}, \epsilon, S_t) \doteq \begin{cases} \hat{v}_\pi(S_t, \boldsymbol\omega_\text{old}) - \epsilon &\text{if } \hat{v}_\pi(S_t, \boldsymbol\omega) \le ... ... @@ -13,6 +14,7 @@ Although it is neither mentioned nor motivated by \citeA{ppo}, clipping is appli Then the clipped value function loss\mathcal{L}^\text{VFCLIP}is defined to be \begin{align} \label{eqn:lvfclip} \mathcal{L}^\text{VFCLIP}(\boldsymbol\omega) \doteq \max \left[ \mathcal{L}^\text{VF}(\boldsymbol\omega),\mathbb{E}_{s, G\sim\pi} ... ... @@ -28,5 +30,8 @@ Then the clipped value function loss\mathcal{L}^\text{VFCLIP}$is defined to b$\epsilon$is the same hyperparameter that is used to clip the likelihood ratio in$\mathcal{L}^\text{CLIP}$(cf.~equation \ref{eqn:lclip} in chapter \ref{sec:03:policy_loss}). \todo[inline]{short explanation of the max, but that is mostly the same as the clipped loss. max since we minimize an error instead of maximizing a likelihood} Intuitively, this approach may be similar to clipping the probability ratio. To avoid gradient collapses, a trust region is created with the clipping parameter$\epsilon\$. Then an elementwise maximum is taken, so errors from previous gradient steps can be corrected. A maximum is applied instead of a minimum because the value function loss is minimized. Further analysis on the ramifications of optimizing a surrogate loss for the value functions is available \citeA{ilyas2020}.