What is the goal of reinforcement learning (RL), and how does it work?
The goal of RL is to learn a good policy with none or limited supervision.
RL Terminology
Markov decision process (MDP)
- a set of states
- a set of actions
- Action dependent transition probabilities
so that
- Reward functions
, representing the reward for starting in state
. taking action a and ending up in state s’ after one step. (The reward function may also depend only on s, or only s and a.)
Markov Property
Let be a discrete Markov chain with states
- For
.
- For
and
.

An AI agent navigates in the 3×3 grid depicted above, where the middle square is not accessible by the agent (and hence is greyed out).
The MDP is defined as follows:
- Every state s is defined by the current position of the agent in the grid (and is independent of its previous actions and positions).
- The actions a are the 4 directions “up”, “down”,“left”, “right”.
- The transition probabilities from state s via action a to state s’ is given by
- The agent receives a reward of +1 for arriving at the top right cell, and a reward of -1 for arriving in the cell immediately below it. It does not receive any non-zero reward at the other cells as illustrated in the following figure.
Markovian Setting
Let be any given state in this MDP. The agent takes actions
starting from state
and as a result visits states
in that order.
- Rewards collected after the
step do not depend on the previous states
- Rewards collected after the
step can depend on the current states
- Rewards collected after the
step do not depend on the previous actions
The total number of unique states in the MDP described above and depicted by the grid above is 8.
Utility Function
The main problem for MDPs is to optimize the agent’s behavior.
Finite horizon-based utility:
The utility function is the sum of rewards after acting for a fixed number of n steps. For example, in the case when the rewards depend only on the states, the utility function is
for some fixed number of steps
- In particular
for any positive integer
(Infinite horizon) discounted reward based utility:
The action at state s that maximizes a finite horizon-based utility can depend on how many steps have been taken.
The action at state s that maximizes a discount reward-based utility does not depend on how many steps have been taken.
Definition of Optimal Policy
Given an MDP, and a utility function , our goal is to find an optimal policy function that maximizes the expectation of the utility. Here, a policy is a function
that assigns an action
to any state s. We denote the optimal policy by
. The optimal policy assigns an action at every state that maximizes the expected utility.
MDP Example: Negative Living Reward
Optimal policy – Numerical Example In this setup, the agent receives a reward (or penalty) of -10 for every action that it takes, on top of the +1 and -1 when it reached the corresponding cells. Since the agent always starts at the state s0, and the outcome of each action is deterministic, the discounted reward depends only on the action sequences and can be written as:
For the case the discounted reward is the same as the reward received after the first step. The agent could receive a reward of -10 if it chooses any one of the actions “Left”, “Down”, “Right”. It would receive an additional reward of -1 if it selects to go up. So, the best discounted reward under this condition would be -10.
and – the best discounted reward would occur for the following sequence of actions starting from the initial state: “Up”, “Up”. It would receive rewards of
for these two actions respectively. The agent would reach the top right state and come to a complete halt after taking the above sequence of actions. The discounted reward amounts to
Value Function
A value function V(s) of a given state s is the expected reward (i.e the expectation of the utility function) if the agent acts optimally starting at state s. In the given MDP, since the action outcome is deterministic, the expected reward simply equals the utility function.

Bellman Equations
The value function is the expected reward from starting at state s and acting optimally. the Q-function
is the expected reward from starting at state
, then acting with action a, and acting optimally afterwards.
Value Function in Terms of Q Function (numeric example)

Value Iteration
the value iteration update rule :
Where is the expected reward from state s after acting optimally for
a non-zero reward is obtained only in state
when transitioning to
. The 3rd step of the value iteration could be worked out as follows:
Complexity of Value Iteration