Ab sofort ist der Login auf der Weboberfläche von git.fh-muenster.de bevorzugt über FH Muenster SSO möglich.

### feedback fabian

parent 3cf06ccb
 In simple environments with a small state space, we may construct a tabular solution storing values for all states we encounter. Among the algorithms creating such tables are \emph{SARSA} and \emph{Q Learning} \cite[chapters 6.4 and 6.5]{sutton18}. If the state space grows very large, creating a table becomes infeasible, both in terms of computation time and hardware requirements. Hence, we must approximate both the value function and the policy (cf.~chapter encounter. Among the algorithms creating such tables are \emph{SARSA} and \emph{Q Learning} \cite[chapters 6.4 and 6.5]{sutton18}. If the state space grows very large, creating a table becomes infeasible, both in terms of computation time and memory consumption. Hence, we must approximate both the value function and the policy (cf.~chapter \ref{sec:02:policy_gradients}). Let $\hat{v}_\pi(s, \boldsymbol\omega)$ be an estimator of the value function $v_\pi(s)$, that means $\hat{v}_\pi(s, ... ... @@ -29,19 +29,28 @@ states we observe regularly more frequently leading to more accurate estimations$\overline{\text{VE}}(\boldsymbol\omega)$implicitly. We establish a method to learn the value function in chapter \ref{sec:03:loss_value}. As we defined$a_\pi(s, a)$using$v_\pi(s)$only, we can easily extend function approximation to the advantage function by utilizing$\hat{v}_\pi$: We can easily extend function approximation to the advantage function$a_\pi(s, a)$by utilizing$\hat{v}_\pi. In equation \ref{eqn:advantage_function}, the advantage function was defined to be \begin{align*} a_\pi(s, a) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left(r + \gamma v_\pi(s') \right)\right] - v_\pi(s) \\ &= \mathbb{E}_\pi\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a\right] - v_\pi(s). \end{align*} Then, the advantage function is parameterized by inserting the parameterized value function\hat{v}_\pi$instead of$v_\pi: \begin{align} \hat{a}_\pi(s, a, \boldsymbol\omega) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left( r + \gamma \hat{v}_\pi(s', \boldsymbol\omega) \right) \right] - \hat{v}_\pi(s, \boldsymbol\omega) \\ &= \mathbb{E}_\pi\left[R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) \mid S_t = s, A_t = a \right] - v_\pi(s, \boldsymbol\omega) \boldsymbol\omega). \end{align} We further extend function approximation to the advantage estimator \begin{align} \delta_t \doteq R_{t+1} + \gamma \hat{v}_\pi(S_{t+1}, \boldsymbol\omega) - \hat{v}_\pi(S_t, \boldsymbol\omega). \end{align} Note that we do not indicate whether the advantage estimator uses the value functionv_\pi$or a parameterized value function$\hat{v}_\pi. In subsequent chapters we consistently use function approximation, therefore it can be assumed that advantage estimators rely on parameterized value functions.  ... ... @@ -9,11 +9,11 @@ agent's choices. Finally, we reduce complexity by approximating and parameterizi \label{sec:02:agent_environment} In reinforcement learning, an \gls{agent}---which is the acting and learning entity---is embedded in and interacts with an \gls{environment}. The environment describes the world surrounding the agent and is beyond the agent's immediate control. However, agents can affect the environment through actions, which they choose based on the environment's observed state. After taking an action, agents observe the environment again. The interplay of agent and environment is shown in figure \ref{fig:action_observation}. In this thesis, we use a set of ATARI 2600 games as environments (cf.~chapter \ref{sec:05:ale}). an \gls{environment} \cite[chapter 3.1]{sutton18}. The environment describes the world surrounding the agent and is beyond the agent's immediate control. However, agents can affect the environment through actions, which they choose based on the environment's observed state. After taking an action, agents observe the environment again. The interplay of agent and environment is shown in figure \ref{fig:action_observation}. In this thesis, we use a set of ATARI 2600 games as environments (cf.~chapter \ref{sec:05:ale}). \begin{figure}[h] \centering ... ... @@ -25,12 +25,12 @@ shown in figure \ref{fig:action_observation}. In this thesis, we use a set of AT \end{figure} In addition to a new state, observations are associated with a reward. Agents seek to maximize the rewards they observe through trial-and-error. Hence, as teachers we can teach an agent to achieve a certain goal provided we design rewards properly: \enquote{[\dots] the reward signal is your way of communicating to the [agent] \emph{what} you want it to achieve, not \emph{how} you want it achieved} \cite[p.~54]{sutton18}. As a consequence, only the action achieving the goal must yield a positive reward. However, a sequence of actions instead of a single one may be essential to achieving the goal, raising the issue that rewards for vital actions may be delayed. Delayed reward and the trial-and-error approach are considered to be the \enquote{most important distinguishing features of reinforcement learning} by through trial-and-error. Hence, we can teach an agent to achieve a certain goal provided we design rewards properly: \enquote{[\dots] the reward signal is your way of communicating to the [agent] \emph{what} you want it to achieve, not \emph{how} you want it achieved} \cite[p.~54]{sutton18}. As a consequence, only the action achieving the goal must yield a positive reward. However, a sequence of actions instead of a single one may be essential to achieving the goal, raising the issue that rewards for vital actions may be delayed. Delayed reward and the trial-and-error approach are considered to be the \enquote{most important distinguishing features of reinforcement learning} by \citeA[p.~2]{sutton18}. In order to describe agent and environment, we require a mathematical construct that encompasses the ... ...  ... ... @@ -148,7 +148,8 @@ An agent's actions are determined by the \emph{policy} function &\pi: \mathcal{A} \times \mathcal{S} \rightarrow [0, 1], \nonumber \end{align} which returns the probability of the agent picking the actiona$in state$s$\cite[p.~58]{sutton18}. We say that the agent follows the policy$\pi$. agent follows the policy$\pi$. A means for learning a policy$\pi$is introduced in chapter \ref{sec:02:policy_gradients} and elaborated on in chapter \ref{sec:03:ppo}. Although the policy is stochastic and does not yield actions, an agent can determine actions by sampling actions proportional to the probability distribution. For example, an agent following the policy$\pi(\emph{\text{noop}} ... ... @@ -160,7 +161,7 @@ proportional to the probability distribution. For example, an agent following th When an agent and an environment interact with each other over a series of discrete time steps $t = 0, 1, 2, \dots$, we can observe a sequence of states, actions and rewards $S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots$. We call this sequence the \gls{trajectory}. When we train an agent, we often record trajectories of states, actions and rewards to use as training data by running the agent in the environment. use as training data by running the agent in the environment.\todo{citation here?} ATARI 2600 games have a plentitude of \emph{terminal states} that indicate the game has come to an end, for example when either player scored 20 points in Pong or the player lost their final life in Breakout. $\#\{s \mid s \in \mathcal{S} ... ...  \todo[inline]{short introductory text} Policy gradients are one method of learning a policy. The main idea is parameterizing the policy with a derivable function$\boldsymbol\theta$. The policy is then learned by optimizing$\boldsymbol\theta$with gradient ascent. \subsubsection{Policy Parameterization} \label{sec:02:policy_theta} ... ... @@ -13,6 +14,13 @@ returns the probability of taking the action$a$at time$t$given the state$s\boldsymbol\theta$\cite[p.~321]{sutton18}. Often we write$\pi_{\boldsymbol\theta}$instead of$\pi(a \mid s, \boldsymbol\theta)$. The parameterization$\boldsymbol\theta$often returns a vector of numerical weights. Each action corresponds to a weight that indicates of the likelihood of picking the respective action. The higher the weight, the more likely picking its respective action should be. The policy$\pi$uses the numerical weights to determine a probability distribution \cite[p.~322]{sutton18}, e.g., by applying a softmax function \cite[pp.~184--185]{goodfellow2016}. The method used to determine the probability distribution remains fixed. Instead we optimize the parameterization$\boldsymbol\theta$so that the likelihood of choosing an advantageous actions is increased. Let$S_t = s$and$\mathcal{A}(s) = \{1, 2\}$. A parameterization$\boldsymbol\theta_t$could return a vector with two elements, for example$\boldsymbol\theta_t(s) = \begin{pmatrix}4\\2\end{pmatrix}$.\footnote{We write$\left(\boldsymbol\theta_t(s)\right)_a$to obtain the$a$th component of$\boldsymbol\theta_t(s)$, e.g., ... ...  Based on the components present in a \gls{mdp}, we can define a formal goal an agent shall achieve. As an agent's behavior depends on the policy it follows, we require a means to assess choices made by the policy as well. Therefore, we introduce \emph{returns} and define two functions: The \emph{value function} allows us to judge states, whereas the \emph{advantage function} can be used to assess the impact of an action. These concepts are key in the learning approach and algorithm we introduce in chapters \ref{sec:02:policy_gradients} and \ref{sec:03:ppo}. Based on the components present in a Markov decision process, we can define a formal goal an agent shall achieve. As an agent's behavior depends on the policy it follows, we require a means to assess choices made by the policy as well. Therefore, we introduce \emph{returns} and define two functions: The \emph{value function} allows us to judge states, whereas the \emph{advantage function} can be used to assess the impact of an action. These concepts are key in the learning approach and algorithm we introduce in chapters \ref{sec:02:policy_gradients} and \ref{sec:03:ppo}. \subsubsection{Return} \label{sec:02:return} ... ... @@ -13,12 +13,13 @@ R_T$. Using this sequence we define the return G_t &\doteq \sum_{k=t+1}^T \gamma^{k-t-1} R_k,\; 0 \leq \gamma \leq 1. \end{align} $\gamma$ denotes the discount factor, which discounts future rewards. In finite-horizon \glspl{mdp}, $\gamma$ may equal $1$. In infinite-horizon \glspl{mdp}, we must choose $\gamma < 1$ so $G_t$ remains finite \cite[p. 54]{sutton18}. An agent strives to maximize the return $G_0$ and thusly all following returns $G_t$, too. As early returns include the reward sequence of later returns, returns may be defined recursively: $G_t = R_{t+1} + \gamma G_{t+1}$. $\gamma$ denotes the discount factor, which discounts future rewards. In finite-horizon Markov decision processes, $\gamma$ may equal $1$. In infinite-horizon Markov decision processes, we must choose $\gamma < 1$ so $G_t$ remains finite \cite[p. 54]{sutton18}. Returns may be defined recursively, as early returns include the reward sequences of later returns: $G_t = R_{t+1} + \gamma G_{t+1}$. \subsubsection{Value Function} \label{sec:02:value_function} ... ... @@ -44,11 +45,12 @@ optimal value function $v_*$. Assuming an optimal policy for the ATARI 2600 game The value of a state depends on the expected behavior of an agent following the policy $\pi$. In order to increase the probability of obtaining a high return, we require a means to identify actions that prove advantageous or disadvantageous in this endeavour. The required functionality is provided by the \gls{advantage} function disadvantageous in this endeavour. The required functionality is provided by the \emph{advantage function} \cite{baird1993, sutton2000}\footnote{The advantage function is commonly defined using the $q$ function: $a_\pi(s, a) \doteq q_\pi(s,a) - v_\pi(s)$. The definition we provide is equal, as it merely substitutes $q_\pi(s,a)$ with the term it is defined to be.} \begin{align} \label{eqn:advantage_function} a_\pi(s, a) &\doteq \sum_{s', r}\left[ p(s', r \mid s, a) \cdot \left(r + \gamma v_\pi(s') \right)\right] - v_\pi(s) \\ &= \mathbb{E}_\pi\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a\right] - v_\pi(s) \end{align} ... ...
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!