Learning to Control: Introduction to Reinforcement Learning

What is the goal of reinforcement learning (RL), and how does it work?

The goal of RL is to learn a good policy with none or limited supervision.

RL Terminology

Markov decision process (MDP)

  • a set of states s \in S
  • a set of actions a \in A
  • Action dependent transition probabilities T\left(s, a, s^{\prime}\right)=P\left(s^{\prime} \mid s, a\right) so that
    \sum_{s^{\prime} \in S} T\left(s, a, s^{\prime}\right)=1
  • Reward functions R\left(s, a, s^{\prime}\right), representing the reward for starting in state s. taking action a and ending up in state s’ after one step. (The reward function may also depend only on s, or only s and a.)

Markov Property

Let X_{i}, i=1,2, \ldots be a discrete Markov chain with states \left{s_{j}, j \in \mathbb{N}\right}

  • For n \geq 3, P\left[X_{n}=x_{n} \mid X_{n-1}=x_{n-1}, X_{1}=x_{1}\right]=P\left[X_{n}=x_{n} \mid X_{n-1}=x_{n-1}\right].
  • For n \geq 3 and n-j>1, P\left[X_{n}=x_{n} \mid X_{n-j}=x_{n-j}, X_{1}=x_{1}\right]=P\left[X_{n}=x_{n} \mid\right. \left.X_{n-j}=x_{n-j}\right].

An AI agent navigates in the 3×3 grid depicted above, where the middle square is not accessible by the agent (and hence is greyed out).

The MDP is defined as follows:

  • Every state s is defined by the current position of the agent in the grid (and is independent of its previous actions and positions).
  • The actions a are the 4 directions “up”, “down”,“left”, “right”.
  • The transition probabilities from state s via action a to state s’ is given by T\left(s,a,s^\prime\right)=P\left(s^\prime\mid s,a\right)
  • The agent receives a reward of +1 for arriving at the top right cell, and a reward of -1 for arriving in the cell immediately below it. It does not receive any non-zero reward at the other cells as illustrated in the following figure.

Markovian Setting

Let \underline{\underline{s}} be any given state in this MDP. The agent takes actions a_{1}, a_{2} \ldots a_{r} starting from state s_{0} and as a result visits states s_{1}, s_{2} \ldots s_{n}=s in that order.

  • Rewards collected after the \mathrm{n}^{\text {th }} step do not depend on the previous states s_{1}, s_{2} \ldots s_{n-1}
  • Rewards collected after the n^{\text {th }} step can depend on the current states s
  • Rewards collected after the \mathrm{n}^{\text {th }} step do not depend on the previous actions a_{1}, a_{2} \ldots a_{n}

The total number of unique states in the MDP described above and depicted by the grid above is 8.

Utility Function

The main problem for MDPs is to optimize the agent’s behavior.

Finite horizon-based utility:

The utility function is the sum of rewards after acting for a fixed number of n steps. For example, in the case when the rewards depend only on the states, the utility function is

  • U\left[s_{0}, s_{1}, \ldots, s_{n}\right]=\sum_{i=0}^{n} R\left(s_{i}\right) for some fixed number of steps \mathrm{n}
  • In particular U\left[s_{0}, s_{1}, \ldots, s_{n+m}\right]=U\left[s_{0}, s_{1}, \ldots, s_{n}\right] for any positive integer \mathrm{m} (Infinite horizon) discounted reward based utility:

        \[U\left[s_{0}, s_{1}, \ldots\right]=\sum_{k=0}^{\infty} \gamma^{k} R\left(s_{k}\right)\]

The action at state s that maximizes a finite horizon-based utility can depend on how many steps have been taken.

The action at state s that maximizes a discount reward-based utility does not depend on how many steps have been taken.

Definition of Optimal Policy

Given an MDP, and a utility function U\left[s_{0}, s_{1}, \ldots, s_{n}\right], our goal is to find an optimal policy function that maximizes the expectation of the utility. Here, a policy is a function \pi: S \rightarrow A that assigns an action \pi(s) to any state s. We denote the optimal policy by \pi *. The optimal policy assigns an action at every state that maximizes the expected utility.

MDP Example: Negative Living Reward

Optimal policy – Numerical Example In this setup, the agent receives a reward (or penalty) of  -10 for every action that it takes, on top of the +1 and -1 when it reached the corresponding cells. Since the agent always starts at the state s0, and the outcome of each action is deterministic, the discounted reward depends only on the action sequences and can be written as:

U\left[a_{1}, a_{2}, \ldots\right]=R\left(s_{0}, a_{1}\right)+\gamma R\left(s_{1}, a_{2}\right)+\gamma^{2} R\left(s_{2}, a_{3}\right)+\cdots

For the case \gamma=0\mathrm{\ } the discounted reward is the same as the reward received after the first step. The agent could receive a reward of -10 if it chooses any one of the actions “Left”, “Down”, “Right”. It would receive an additional reward of -1 if it selects to go up. So, the best discounted reward under this condition would be -10.

and \gamma=0.5 – the best discounted reward would occur for the following sequence of actions starting from the initial state: “Up”, “Up”. It would receive rewards of -11,-9 for these two actions respectively. The agent would reach the top right state and come to a complete halt after taking the above sequence of actions. The discounted reward amounts to -11+0.5 *-9=-15.5

Value Function

A value function V(s) of a given state s is the expected reward (i.e the expectation of the utility function) if the agent acts optimally starting at state s. In the given MDP, since the action outcome is deterministic, the expected reward simply equals the utility function.

Bellman Equations

V^{}(s)=\max {a} Q^{}(s, a)
Q^{}(s, a)=\sum{s^{\prime}} T\left(s, a, s^{\prime}\right)\left(R\left(s, a, s^{\prime}\right)+\gamma V^{}\left(s^{\prime}\right)\right)

The value function V^{}(s) is the expected reward from starting at state s and acting optimally. the Q-function Q^{}(s, a) is the expected reward from starting at state s, then acting with action a, and acting optimally afterwards.

Value Function in Terms of Q Function (numeric example)

Q^{}\left(s, a_{1}\right)=10 Q^{}\left(s, a_{2}\right)=-1
Q^{}\left(s, a_{3}\right)=0 Q^{}\left(s, a_{4}\right)=11

    \[V^{}(s)=11\]

Let:

    \[\begin{aligned} T\left(s, a_{1}, s^{\prime}\right) &=1 \ R\left(s, a_{1}, s^{\prime}\right) &=5 \ \gamma &=0.5 \end{aligned}\]

Then V^{}\left(s^{\prime}\right)=10

Value Iteration

the value iteration update rule :

    \[V_{k+1}^{}(s)=\max {a}\left[\sum{s^{\prime}} T\left(s, a, s^{\prime}\right)\left(R\left(s, a, s^{\prime}\right)+\gamma V_{k}^{}\left(s^{\prime}\right)\right)\right]\]

Where V_{k}^{}(s) is the expected reward from state s after acting optimally for \mathrm{k} \underline{\underline{\text { steps. }}} a non-zero reward is obtained only in state S_{4} when transitioning to S_{5}. The 3rd step of the value iteration could be worked out as follows:

    \[\begin{array}{c} V_{3}^{}(1)=0+\gamma * V_{2}^{}(2) \ V_{3}^{}(1)=0+0.5 * 0=0 \V_{3}^{*}(2)=0+\gamma * V_{2}^{}(3) \ V_{3}^{}(2)=0+0.5 * 0=0 \V_{3}^{*}(3)=0+\gamma * V_{2}^{}(4) \ V_{3}^{}(3)=0+0.5 * 0.5=0.25 \V_{3}^{*}(4)=0+\gamma * V_{2}^{}(5) \ V_{3}^{}(4)=0+0.5 * 1=0.5 \\text { and } V_{3}^{}(5)=V_{2}^{}(5)=1\end{array}\]

Complexity of Value Iteration

    \[\mathcal{O}\left(|S|^{2} \cdot|A|\right)\]

Q-Value Iteration Update Rule

Q_{k+1}^{}(s, a)=\sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left(R\left(s, a, s^{\prime}\right)+\gamma \max {a^{\prime}} Q{k}^{}\left(s^{\prime}, a^{\prime}\right)\right)