Fundamental Concepts of Reinforcement Learning

People gain knowledge of the world and receive feedback through interaction with the world (environment).


Image source:

Following this paradigm, people have proposed methods for reinforcement learning.


First, let's introduce several concepts in reinforcement learning.

Agent, which can be understood as a decision-maker or decision-maker.

Environment, which is the environment. The agent interacts with the environment to obtain feedback.

The word "feedback" is very interesting and has rich connotations. Specifically, every decision of the agent, or every action, will incur a cost.

For example: It will cause a change in the state of the environment. At this time, the observation of the agent to the environment will change.

Based on the obtained environmental observation, we construct the agent state (Agent State).

For Fully Observable Environments, we can consider Agent State = Observation = environment. In general, unless otherwise specified, all environments are assumed to be fully observable.

The cost incurred by taking an action is not limited to the change in the environment state. The action itself also has a good or bad distinction. We measure the quality of the decision/action with a reward.

If an action has a positive reward, it means that at least in the short term, this action is good. Conversely, if the reward is negative, it means that the action is not wise.

The following figure intuitively shows the principle of reinforcement learning.


After all, what is reinforcement learning?

The goal of reinforcement learning is to optimize the agent's action strategy through continuous interaction, by selecting better actions to maximize the sum of rewards (including the rewards of each step).


We generally use Markov Decision Processes to model reinforcement learning.

Before proceeding, please read The relationship between Markov and Reinforcement Learning (RL) - Zhihu

Here is a schematic diagram of a Markov process.


Image source:

There are a few points to note:

  1. A Markov process (Markov Process) consists of a pair of binary M=(S,P)M=(S, P). A Markov decision process (Markov Decision Process) consists of a quadruple M=(S,A,P,R)M=(S,A,P,R). Sometimes, a discount factor γ\gamma for reward is also included, which will be mentioned later.
  2. Due to randomness, given the initial state and the final state, the Markov process (Markov Process) corresponding to the Markov chain (Markov Chain) is not unique.

Important Concepts#

During the interaction between the agent and the environment, at a certain time t:

  1. Based on the observation of the environment OtO_t and the received incentive RtR_t, the agent constructs the agent state StS_t. (Usually, unless otherwise specified, St=OtS_t=O_t), and decides how to take action (submit AtA_t to the environment).

  2. The environment receives the action AtA_t submitted by the agent and needs to bear the "cost" brought by AtA_t. It then provides updated Ot+1O_{t+1} and Rt+1R_{t+1} to the agent as feedback.

This process continues.

The interaction between the individual and the environment produces the following interaction trajectory, which we denote as Ht\mathcal H_t. This interaction trajectory stores the observation, action, and reward at each interaction.

Ht=O0,A0,R1,O1,A1,,Ot1,At1,Rt,Ot\mathcal H_t = O_0, A_0, R_1,O_1, A_1,\cdots, O_{t-1}, A_{t-1}, R_t, O_t

Based on this sequence Ht\mathcal H_t, we can construct the agent state StS_t.

When the environment satisfies the Fully Observable property, we consider Agent State St=OtS_t = O_t. Therefore, we can replace all the O symbols in the equation Ht\mathcal H_t with the S symbol. (Many materials also directly use the S symbol)

At this time, there is no need to use Ht\mathcal H_t to construct StS_t. We can directly use OtO_t as StS_t.

We determine what action to take based on the state and abstract it into a policy function π\pi. The policy function π\pi takes the state as input and outputs the corresponding action, denoted as π(as)\pi(a|s). Sometimes it is abbreviated as π\pi.

We assume that the state space S is discrete and can only have |S| different states. Similarly, we assume that the action space A is discrete and can only have |A| different actions.

In this setting, how should we understand the policy function π(as)\pi(a|s)?

The system is currently in state ss, where sSs\in S.

Under the condition of state ss, the policy function π(as)\pi(a|s) considers what action (which a) should be taken.

The policy function can be understood as a class of composite functions. In addition to random policies, we can generally consider that the policy function includes two parts: action evaluation and action selection.

Action evaluation is generally done using Q value.

Action selection is generally done using argmax or ϵ\epsilon-greedy.

We will discuss this in more detail later.

Through the interaction between the individual or agent and the environment, rewards are obtained. As mentioned above, the goal of reinforcement learning is to maximize the total reward by selecting actions (i.e., finding a better policy) through interaction.

We define the total reward (also known as return or future reward) as GtG_t.

Gt=k=0γkRt+k+1=Rt+1+γRt+2+γ2Rt+3++γkRt+k+1+G_t = \sum_{k=0}^\infty {\color{red} \gamma^k}R_{t+k+1} = R_{t+1} + {\color{red} \gamma } R_{t+2} + {\color{red} \gamma^2}R_{t+3} + \cdots + {\color{red} \gamma^k}R_{t+k+1} + \cdots

A few points to note:

  1. The total reward, or the sum of rewards, should start from R1R_1, but why does it start from Rt+1R_{t+1}? Because the values of R1RtR_1 \sim R_t are fixed constants and cannot be optimized. Therefore, we focus more on future rewards.

  2. The γ\gamma in the equation is the discount factor mentioned above. Usually, the range of the discount factor is limited to 0<γ<10<\gamma<1.

    If the value of γ\gamma is small, close to 0, as k increases, γk\gamma^k will become smaller and smaller, that is, the weight of Rt+k+1R_{t+k+1} will become smaller and smaller. This means that we are more inclined to consider short-term effects rather than long-term effects.

    If the value of γ\gamma is close to 1, it means that we will take into account long-term effects more.

We define the state value function (also known as value function or value) as the expected cumulative return obtained by starting from state ss and following policy π\pi. The state value function is used to measure how "good" a state ss is, and is defined as follows:

Vπ(s)=Eπ[GtSt=s]=Eπ[Rt+1+Rt+2+St=s]=Eπ[Rt+1+Gt+1St=s]=Eπ[Rt+1+Vπ(St+1)St=s]=aπ(as)spssa[rssa+γVπ(s)]\begin{align*} V^\pi(s) &= \mathbb E_\pi[G_t|S_t=s] {\color{red}=\mathbb E_\pi[R_{t+1}+R_{t+2}+\cdots|S_t = s] } \\ &=\mathbb E_\pi [R_{t+1}+G_{t+1}|S_t = s] \\ &=\mathbb E_\pi[R_{t+1}+V^\pi(S_{t+1})|S_t = s]\\ &={\color{red} \sum_a\pi(a|s)\sum_{s^\prime}p_{ss^\prime}^a \left[ r_{ss^\prime}^a + \gamma V^\pi(s^\prime) \right] } \end{align*}

The first line of the equation is the definition of the state value function.
The second and third lines of the equation are the recursive form obtained by expanding the return GtG_t according to the definition, or the form of the Bellman Equation.
The fourth line expands the Bellman Equation, and we need to pay attention to the pssap_{ss^\prime}^a and rssar_{ss^\prime}^a in the equation.

pssap_{ss^\prime}^a is the state transition probability, which should be described in the Markov Decision Process. Specifically:

When we are in state s and choose an a as the action based on the policy function π(as)\pi(a|s), it will cause a change in the observation of the environment, so the state will also change.

However, the effect caused by action a is not fixed. We cannot guarantee that state s will always change to a fixed state ss^\prime. In other words, ss^\prime can be equal to state_1, state_2, state_3, or some other state_i, so it corresponds to a probability distribution pssap_{ss^\prime}^a. Similarly, we have rssar_{ss^\prime}^a.

We define the action value function as the expected cumulative return obtained by starting from state ss, taking action aa, and following policy π\pi.

Qπ(s,a)=Eπ[GtSt=s,At=a]=Eπ[Rt+1+Rt+2+St=s,At=a]=Eπ[Rt+1+Gt+1St=s,At=a]=Eπ[Rt+1+Qπ(St+1,At+1)St=s,At=a]=spssa[rssa+γaπ(as)Qπ(s,a)]\begin{align*} Q^\pi(s,a) &= \mathbb E_\pi[G_t|S_t=s, A_t=a] {\color{red}=\mathbb E_\pi[R_{t+1}+R_{t+2}+\cdots|S_t = s, A_t = a] } \\ &=\mathbb E_\pi [R_{t+1}+G_{t+1}|S_t = s, A_t = a] \\ &=\mathbb E_\pi[R_{t+1}+Q^\pi(S_{t+1}, A_{t+1})|S_t = s, A_t=a]\\ &={\color{red} \sum_{s^\prime}p_{ss^\prime}^a \left[ r_{ss^\prime}^a +\gamma \sum_{a^\prime} \pi(a^\prime|s^\prime) Q^\pi(s^\prime, a^\prime)\right] } \end{align*}

Similar to Vπ(s)V^\pi(s), no further explanation is given.

We define the advantage function as the difference between Q and V.

A(s,a)=Q(s,a)V(s)A(s,a) = Q(s, a) - V(s)

It represents the degree to which taking action a in state s is better or worse than following the current policy π\pi. The main purpose of the advantage function is to optimize the policy and help the agent understand more clearly which actions are advantageous in the current state.

How to understand the advantage function? - Zhihu

Model-based vs. Model-free

The so-called model includes the state transition probability and the reward function.

If the model is known, it is model-based, and we will plan under complete information. In other words, we can use dynamic programming algorithms to learn the desired policy.

Knowing the model means that when the action and state are determined, we can know the state transition probability pssap_{ss^\prime}^a and the corresponding rssar_{ss^\prime}^a.

On the contrary, if learning does not depend on the model, it is called model-free. For example, the Policy Gradient method is model-free.

We will discuss this in more detail later.

On-Policy vs. Off-Policy

On-Policy means that the behavior policy during episode sampling and the target policy during policy optimization are the same.

Off-Policy means that the two policies are different.

We will discuss this in more detail later.

A (Long) Peek into Reinforcement Learning

Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.