 # Sunforger

## Basic Concepts of Reinforcement Learning

People gain knowledge of the world and receive feedback through interaction with the world (environment). Following this paradigm, people have proposed the method of reinforcement learning.

## Foundation#

First, let's introduce several concepts in reinforcement learning.

Agent, which can be understood as a decision-maker.

Environment, with which the agent interacts and receives feedback.

The term "feedback" is interesting and has rich connotations. Specifically, each decision of the agent, or each action, will incur a cost.

For example, it may cause a change in the state of the environment. At this time, the observation of the environment by the agent will change.

Based on the obtained environmental observation, we construct the agent state.

For fully observable environments, we can consider the agent state as the observation of the environment.
Generally, unless otherwise specified, all environments are assumed to be fully observable.

The cost incurred by taking an action is not limited to the change in the environment state. The action itself also has a good or bad distinction. We measure the quality of the decision/action using reward.

If an action has a positive reward, it means that at least in the short term, the action is good. Conversely, if the reward is negative, it means that the action is not wise.

The following figure intuitively shows the principle of reinforcement learning. So, what is reinforcement learning all about?

The goal of reinforcement learning is to optimize the agent's action strategy through continuous interaction and choose better actions to maximize the sum of rewards (including rewards at each step).

## Modeling#

We generally use Markov Decision Processes (MDPs) to model reinforcement learning.

Here is a diagram illustrating a Markov process. There are a few points to note:

1. A Markov process (Markov Chain) is composed of a tuple $M=(S, P)$, and a Markov Decision Process (MDP) is composed of a tuple $M=(S,A,P,R)$. Sometimes, a discount factor $\gamma$ is included, which will be mentioned later.
2. Due to randomness, given the initial state and the final state, there may be multiple Markov Chains corresponding to a Markov Process.

## Important Concepts#

During the interaction between the agent and the environment, at a certain time t:

1. Based on the observation of the environment $O_t$ and the received reward $R_t$, the agent constructs the agent state $S_t$. Usually, unless otherwise specified, we can assume that $S_t=O_t$. The agent then decides how to take action (submit $A_t$ to the environment).

2. The environment receives the action $A_t$ submitted by the agent and needs to bear the "cost" brought by $A_t$. It then provides updated observations $O_{t+1}$ and rewards $R_{t+1}$ to the agent.

This process repeats.

Through the interaction between the agent and the environment, the following interaction trajectory is generated. We denote it as $\mathcal H_t$. This interaction trajectory stores the observations, actions, and rewards at each interaction.

$\mathcal H_t = O_0, A_0, R_1,O_1, A_1,\cdots, O_{t-1}, A_{t-1}, R_t, O_t$

Based on this sequence $\mathcal H_t$, we can construct the agent state $S_t$.

When the environment satisfies the property of being fully observable, we can consider the agent state $S_t = O_t$. Therefore, we can replace all the O symbols in the equation for $\mathcal H_t$ with the symbol S (which is often used in many materials).

In this case, there is no need to construct $S_t$ using $\mathcal H_t$. We can directly use $O_t$ as $S_t$.

We determine what action to take based on the state, so we abstract the policy function $\pi$, which takes the state as input and outputs the corresponding action. We denote it as $\pi(a|s)$. Sometimes, it is abbreviated as $\pi$.

We assume that the state space S is discrete and there are only |S| different states. Similarly, we assume that the action space A is discrete and there are only |A| different actions.

In this setting, how should we understand the policy function $\pi(a|s)$?

Given that the system is currently in state $s$, where $s\in S$,

Under the condition of state $s$, the policy function $\pi(a|s)$ considers what action (which a) should be taken.

The policy function can be understood as a class of composite functions. In addition to random policies, we can generally consider that the policy function includes two parts: action evaluation and action selection.

Action evaluation is generally done using Q-values.

Action selection is generally done using argmax or $\epsilon$-greedy.

We will discuss this in more detail later.

Through interaction with the environment, the individual or agent obtains rewards. As mentioned above, the goal of reinforcement learning is to maximize the total reward by selecting actions (finding a suitable policy).

We define the total reward (also known as return or future reward) as $G_t$.

$G_t = \sum_{k=0}^\infty {\color{red} \gamma^k}R_{t+k+1} = R_{t+1} + {\color{red} \gamma } R_{t+2} + {\color{red} \gamma^2}R_{t+3} + \cdots + {\color{red} \gamma^k}R_{t+k+1} + \cdots$

A few points to note:

1. The total reward, or the sum of rewards, should start from $R_1$, but here it starts from $R_{t+1}$. This is because the values of $R_1 \sim R_t$ are fixed constants and cannot be optimized. Therefore, we focus more on future rewards.

2. The $\gamma$ in the equation represents the discount factor, which is usually limited to the range $0<\gamma<1$.

If $\gamma$ is small, close to 0, as k increases, $\gamma^k$ becomes smaller and smaller, which means that the weight of $R_{t+k+1}$ becomes smaller. This means that we are more inclined to consider short-term effects rather than long-term effects.

If $\gamma$ is close to 1, it means that we will consider long-term effects more.

We define the state value function (also known as value function or value) as the expected cumulative return obtained by following a policy $\pi$ starting from state s. The state value function is used to measure how "good" a state s is and is defined as follows:

\begin{align*} V^\pi(s) &= \mathbb E_\pi[G_t|S_t=s] {\color{red}=\mathbb E_\pi[R_{t+1}+R_{t+2}+\cdots|S_t = s] } \\ &=\mathbb E_\pi [R_{t+1}+G_{t+1}|S_t = s] \\ &=\mathbb E_\pi[R_{t+1}+V^\pi(S_{t+1})|S_t = s]\\ &={\color{red} \sum_a\pi(a|s)\sum_{s^\prime}p_{ss^\prime}^a \left[ r_{ss^\prime}^a + \gamma V^\pi(s^\prime) \right] } \end{align*}

The first line is the definition of the state value function.
The second and third lines expand the definition of the return $G_t$ recursively, resulting in the form of the Bellman equation.
The fourth line expands the Bellman equation, and we need to pay attention to the values of $p_{ss^\prime}^a$ and $r_{ss^\prime}^a$.

$p_{ss^\prime}^a$ is the state transition probability, which should be described in the Markov Decision Process. Specifically,

When we are in state s and choose an action a based on the policy function $\pi(a|s)$, it will cause a change in the observation of the environment, and thus the state will change.

However, the effect of action a is not fixed. We cannot guarantee that state s will always change to a fixed state $s^\prime$. In other words, $s^\prime$ can be equal to state_1, state_2, state_3, or some other state_i, so it corresponds to a probability distribution $p_{ss^\prime}^a$. Similarly, we have $r_{ss^\prime}^a$.

We define the action value function as the expected cumulative return obtained by following a policy $\pi$ starting from state s and taking action a.

\begin{align*} Q^\pi(s,a) &= \mathbb E_\pi[G_t|S_t=s, A_t=a] {\color{red}=\mathbb E_\pi[R_{t+1}+R_{t+2}+\cdots|S_t = s, A_t = a] } \\ &=\mathbb E_\pi [R_{t+1}+G_{t+1}|S_t = s, A_t = a] \\ &=\mathbb E_\pi[R_{t+1}+Q^\pi(S_{t+1}, A_{t+1})|S_t = s, A_t=a]\\ &={\color{red} \sum_{s^\prime}p_{ss^\prime}^a \left[ r_{ss^\prime}^a +\gamma \sum_{a^\prime} \pi(a^\prime|s^\prime) Q^\pi(s^\prime, a^\prime)\right] } \end{align*}

Similar to $V^\pi(s)$, no further explanation is needed.

We define the advantage function as the difference between Q and V.

$A(s,a) = Q(s, a) - V(s)$

It represents the degree to which taking action a in state s is better or worse than following the current policy $\pi$. The advantage function is mainly used to optimize the policy and help the agent understand which actions are advantageous in the current state.

How to Understand the Advantage Function? - Zhihu

Model-based vs. Model-free

The so-called model includes the state transition probability and the reward function.

If the model is known, it is model-based, and we can perform planning under complete information. In other words, we can use dynamic programming algorithms to learn the desired policy.

Knowing the model means that when the action and state are determined, we can know the state transition probability $p_{ss^\prime}^a$ and the corresponding $r_{ss^\prime}^a$.

On the contrary, if learning does not depend on the model, it is model-free. For example, Policy Gradient methods are model-free.

We will discuss this in more detail later.

On-Policy vs. Off-Policy

On-Policy means that the behavior policy (used for episode sampling) and the target policy (used for policy optimization) are the same.

Off-Policy means that the two policies are different.

We will discuss this in more detail later.

A (Long) Peek into Reinforcement Learning