## What is the goal of reinforcement learning (RL), and how does it work?

The goal of RL is to learn a good policy with none or limited supervision.

## RL Terminology

### Markov decision process (MDP)

- a set of states
- a set of actions
- Action dependent transition probabilities so that
- Reward functions , representing the reward for starting in state . taking action a and ending up in state s’ after one step. (The reward function may also depend only on s, or only s and a.)

### Markov Property

Let be a discrete Markov chain with states

- For .
- For and .

An AI agent navigates in the 3×3 grid depicted above, where the middle square is not accessible by the agent (and hence is greyed out).

The MDP is defined as follows:

- Every state s is defined by the current position of the agent in the grid (and is independent of its previous actions and positions).
- The actions a are the 4 directions “up”, “down”,“left”, “right”.
- The transition probabilities from state s via action a to state s’ is given by
- The agent receives a reward of +1 for arriving at the top right cell, and a reward of -1 for arriving in the cell immediately below it. It does not receive any non-zero reward at the other cells as illustrated in the following figure.

## Markovian Setting

Let be any given state in this MDP. The agent takes actions starting from state and as a result visits states in that order.

- Rewards collected after the step do not depend on the previous states
- Rewards collected after the step can depend on the current states
- Rewards collected after the step do not depend on the previous actions

The total number of unique states in the MDP described above and depicted by the grid above is 8.

## Utility Function

The main problem for MDPs is to optimize the agent’s behavior.

Finite horizon-based utility:

The utility function is the sum of rewards after acting for a fixed number of n steps. For example, in the case when the rewards depend only on the states, the utility function is

- for some fixed number of steps
- In particular for any positive integer (Infinite horizon) discounted reward based utility:

The action at state s that maximizes a finite horizon-based utility can depend on how many steps have been taken.

The action at state s that maximizes a discount reward-based utility does not depend on how many steps have been taken.

## Definition of Optimal Policy

Given an MDP, and a utility function , our goal is to find an optimal policy function that maximizes the expectation of the utility. Here, a policy is a function that assigns an action to any state s. We denote the optimal policy by . The optimal policy assigns an action at every state that maximizes the expected utility.

## MDP Example: Negative Living Reward

**Optimal policy – Numerical Example** In this setup, the agent receives a reward (or penalty) of -10 for every action that it takes, on top of the +1 and -1 when it reached the corresponding cells. Since the agent always starts at the state s_{0}, and the outcome of each action is deterministic, the discounted reward depends only on the action sequences and can be written as:

For the case the discounted reward is the same as the reward received after the first step. The agent could receive a reward of -10 if it chooses any one of the actions “Left”, “Down”, “Right”. It would receive an additional reward of -1 if it selects to go up. So, the best discounted reward under this condition would be -10.

and – the best discounted reward would occur for the following sequence of actions starting from the initial state: “Up”, “Up”. It would receive rewards of for these two actions respectively. The agent would reach the top right state and come to a complete halt after taking the above sequence of actions. The discounted reward amounts to

**Value Function**

A value function V(s) of a given state s is the expected reward (i.e the expectation of the utility function) if the agent acts optimally starting at state s. In the given MDP, since the action outcome is deterministic, the expected reward simply equals the utility function.

## Bellman Equations

The value function is the expected reward from starting at state s and acting optimally. the Q-function is the expected reward from starting at state , then acting with action a, and acting optimally afterwards.

## Value Function in Terms of Q Function (numeric example)

Let:

Then

## Value Iteration

the value iteration update rule :

Where is the expected reward from state s after acting optimally for a non-zero reward is obtained only in state when transitioning to . The 3rd step of the value iteration could be worked out as follows:

## Complexity of Value Iteration