Basic Concepts of Reinforcement Learning

People gain knowledge of the world and receive feedback through interaction with the world (environment).


Image source:

Following this paradigm, people have proposed the method of reinforcement learning.


First, let's introduce several concepts in reinforcement learning.

Agent, which can be understood as a decision-maker.

Environment, with which the agent interacts and receives feedback.

The term "feedback" is interesting and has rich connotations. Specifically, each decision of the agent, or each action, will incur a cost.

For example, it may cause a change in the state of the environment. At this time, the observation of the environment by the agent will change.

Based on the obtained environmental observation, we construct the agent state.

For fully observable environments, we can consider the agent state as the observation of the environment.
Generally, unless otherwise specified, all environments are assumed to be fully observable.

The cost incurred by taking an action is not limited to the change in the environment state. The action itself also has a good or bad distinction. We measure the quality of the decision/action using reward.

If an action has a positive reward, it means that at least in the short term, the action is good. Conversely, if the reward is negative, it means that the action is not wise.

The following figure intuitively shows the principle of reinforcement learning.


So, what is reinforcement learning all about?

The goal of reinforcement learning is to optimize the agent's action strategy through continuous interaction and choose better actions to maximize the sum of rewards (including rewards at each step).


We generally use Markov Decision Processes (MDPs) to model reinforcement learning.

Before proceeding, please read The Relationship Between Markov and Reinforcement Learning.

Here is a diagram illustrating a Markov process.


Image source:

There are a few points to note:

  1. A Markov process (Markov Chain) is composed of a tuple M=(S,P)M=(S, P), and a Markov Decision Process (MDP) is composed of a tuple M=(S,A,P,R)M=(S,A,P,R). Sometimes, a discount factor γ\gamma is included, which will be mentioned later.
  2. Due to randomness, given the initial state and the final state, there may be multiple Markov Chains corresponding to a Markov Process.

Important Concepts

During the interaction between the agent and the environment, at a certain time t:

  1. Based on the observation of the environment OtO_t and the received reward RtR_t, the agent constructs the agent state StS_t. Usually, unless otherwise specified, we can assume that St=OtS_t=O_t. The agent then decides how to take action (submit AtA_t to the environment).

  2. The environment receives the action AtA_t submitted by the agent and needs to bear the "cost" brought by AtA_t. It then provides updated observations Ot+1O_{t+1} and rewards Rt+1R_{t+1} to the agent.

This process repeats.

Through the interaction between the agent and the environment, the following interaction trajectory is generated. We denote it as Ht\mathcal H_t. This interaction trajectory stores the observations, actions, and rewards at each interaction.

Ht=O0,A0,R1,O1,A1,,Ot1,At1,Rt,Ot\mathcal H_t = O_0, A_0, R_1,O_1, A_1,\cdots, O_{t-1}, A_{t-1}, R_t, O_t

Based on this sequence Ht\mathcal H_t, we can construct the agent state StS_t.

When the environment satisfies the property of being fully observable, we can consider the agent state St=OtS_t = O_t. Therefore, we can replace all the O symbols in the equation for Ht\mathcal H_t with the symbol S (which is often used in many materials).

In this case, there is no need to construct StS_t using Ht\mathcal H_t. We can directly use OtO_t as StS_t.

We determine what action to take based on the state, so we abstract the policy function π\pi, which takes the state as input and outputs the corresponding action. We denote it as π(as)\pi(a|s). Sometimes, it is abbreviated as π\pi.

We assume that the state space S is discrete and there are only |S| different states. Similarly, we assume that the action space A is discrete and there are only |A| different actions.

In this setting, how should we understand the policy function π(as)\pi(a|s)?

Given that the system is currently in state ss, where sSs\in S,

Under the condition of state ss, the policy function π(as)\pi(a|s) considers what action (which a) should be taken.

The policy function can be understood as a class of composite functions. In addition to random policies, we can generally consider that the policy function includes two parts: action evaluation and action selection.

Action evaluation is generally done using Q-values.

Action selection is generally done using argmax or ϵ\epsilon-greedy.

We will discuss this in more detail later.

Through interaction with the environment, the individual or agent obtains rewards. As mentioned above, the goal of reinforcement learning is to maximize the total reward by selecting actions (finding a suitable policy).

We define the total reward (also known as return or future reward) as GtG_t.

Gt=k=0γkRt+k+1=Rt+1+γRt+2+γ2Rt+3++γkRt+k+1+G_t = \sum_{k=0}^\infty {\color{red} \gamma^k}R_{t+k+1} = R_{t+1} + {\color{red} \gamma } R_{t+2} + {\color{red} \gamma^2}R_{t+3} + \cdots + {\color{red} \gamma^k}R_{t+k+1} + \cdots

A few points to note:

  1. The total reward, or the sum of rewards, should start from R1R_1, but here it starts from Rt+1R_{t+1}. This is because the values of R1RtR_1 \sim R_t are fixed constants and cannot be optimized. Therefore, we focus more on future rewards.

  2. The γ\gamma in the equation represents the discount factor, which is usually limited to the range 0<γ<10<\gamma<1.

    If γ\gamma is small, close to 0, as k increases, γk\gamma^k becomes smaller and smaller, which means that the weight of Rt+k+1R_{t+k+1} becomes smaller. This means that we are more inclined to consider short-term effects rather than long-term effects.

    If γ\gamma is close to 1, it means that we will consider long-term effects more.

We define the state value function (also known as value function or value) as the expected cumulative return obtained by following a policy π\pi starting from state s. The state value function is used to measure how "good" a state s is and is defined as follows:

Vπ(s)=Eπ[GtSt=s]=Eπ[Rt+1+Rt+2+St=s]=Eπ[Rt+1+Gt+1St=s]=Eπ[Rt+1+Vπ(St+1)St=s]=aπ(as)spssa[rssa+γVπ(s)]\begin{align*} V^\pi(s) &= \mathbb E_\pi[G_t|S_t=s] {\color{red}=\mathbb E_\pi[R_{t+1}+R_{t+2}+\cdots|S_t = s] } \\ &=\mathbb E_\pi [R_{t+1}+G_{t+1}|S_t = s] \\ &=\mathbb E_\pi[R_{t+1}+V^\pi(S_{t+1})|S_t = s]\\ &={\color{red} \sum_a\pi(a|s)\sum_{s^\prime}p_{ss^\prime}^a \left[ r_{ss^\prime}^a + \gamma V^\pi(s^\prime) \right] } \end{align*}

The first line is the definition of the state value function.
The second and third lines expand the definition of the return GtG_t recursively, resulting in the form of the Bellman equation.
The fourth line expands the Bellman equation, and we need to pay attention to the values of pssap_{ss^\prime}^a and rssar_{ss^\prime}^a.

pssap_{ss^\prime}^a is the state transition probability, which should be described in the Markov Decision Process. Specifically,

When we are in state s and choose an action a based on the policy function π(as)\pi(a|s), it will cause a change in the observation of the environment, and thus the state will change.

However, the effect of action a is not fixed. We cannot guarantee that state s will always change to a fixed state ss^\prime. In other words, ss^\prime can be equal to state_1, state_2, state_3, or some other state_i, so it corresponds to a probability distribution pssap_{ss^\prime}^a. Similarly, we have rssar_{ss^\prime}^a.

We define the action value function as the expected cumulative return obtained by following a policy π\pi starting from state s and taking action a.

Qπ(s,a)=Eπ[GtSt=s,At=a]=Eπ[Rt+1+Rt+2+St=s,At=a]=Eπ[Rt+1+Gt+1St=s,At=a]=Eπ[Rt+1+Qπ(St+1,At+1)St=s,At=a]=spssa[rssa+γaπ(as)Qπ(s,a)]\begin{align*} Q^\pi(s,a) &= \mathbb E_\pi[G_t|S_t=s, A_t=a] {\color{red}=\mathbb E_\pi[R_{t+1}+R_{t+2}+\cdots|S_t = s, A_t = a] } \\ &=\mathbb E_\pi [R_{t+1}+G_{t+1}|S_t = s, A_t = a] \\ &=\mathbb E_\pi[R_{t+1}+Q^\pi(S_{t+1}, A_{t+1})|S_t = s, A_t=a]\\ &={\color{red} \sum_{s^\prime}p_{ss^\prime}^a \left[ r_{ss^\prime}^a +\gamma \sum_{a^\prime} \pi(a^\prime|s^\prime) Q^\pi(s^\prime, a^\prime)\right] } \end{align*}

Similar to Vπ(s)V^\pi(s), no further explanation is needed.

We define the advantage function as the difference between Q and V.

A(s,a)=Q(s,a)V(s)A(s,a) = Q(s, a) - V(s)

It represents the degree to which taking action a in state s is better or worse than following the current policy π\pi. The advantage function is mainly used to optimize the policy and help the agent understand which actions are advantageous in the current state.

How to Understand the Advantage Function? - Zhihu

Model-based vs. Model-free

The so-called model includes the state transition probability and the reward function.

If the model is known, it is model-based, and we can perform planning under complete information. In other words, we can use dynamic programming algorithms to learn the desired policy.

Knowing the model means that when the action and state are determined, we can know the state transition probability pssap_{ss^\prime}^a and the corresponding rssar_{ss^\prime}^a.

On the contrary, if learning does not depend on the model, it is model-free. For example, Policy Gradient methods are model-free.

We will discuss this in more detail later.

On-Policy vs. Off-Policy

On-Policy means that the behavior policy (used for episode sampling) and the target policy (used for policy optimization) are the same.

Off-Policy means that the two policies are different.

We will discuss this in more detail later.

A (Long) Peek into Reinforcement Learning

Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.