Revisiting MDP Fundamentals

Policy Function and Value Function

The goal of the optimal policy function is to maximize the expected discounted reward, even if this means taking actions that would lead to lower immediate next-step rewards from few states. Recall that from the previous lecture that for all s, the (optimal) value function is:

$V^{}(s)=\max {a} Q^{}(s, a)$
$=\max {a}\left[\sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left(R\left(s, a, s^{\prime}\right)+\gamma V^{*}\left(s^{\prime}\right)\right)\right]$ where $0 \leq \gamma<1$

Estimating Transition Probabilities and Rewards

What might be some issue(s) with trying to estimate $\widehat{T}, \hat{R}$ in the following manner?

$\begin{aligned}\hat{T} &=\frac{\operatorname{count}\left(s, a, s^{\prime}\right)}{\sum_{s^{\prime}} \operatorname{count}\left(s, a, s^{\prime}\right)} \\hat{R} &=\frac{\sum_{t=1}^{\operatorname{count}\left(s, a, s^{\prime}\right)} R_{t}\left(s, a, s^{\prime}\right)}{\operatorname{count}\left(s, a, s^{\prime}\right)}\end{aligned}$

Certain states might not be visited at all while collecting the statistics for $\hat{T}, \hat{R}$
Certain states might be visited much less often than others leading to very noisy estimates of $\widehat{T}, \hat{R}$

Model Free vs Model Based Approaches

Say we want to estimate the expectation of a function $f(x)$ .

$\mathbb{E}[f(x)]=\sum_{x} p(x) \cdot f(x)$

Model Based Approach works by sampling $K$ points from the distribution $P(X)$ and estimating,

$\hat{p}(X)=\frac{\operatorname{count}(x)}{K}$

before estimating the expectation of $f(X)$ as follows:

$E[f(X)] \approx \sum p(x) f(x)$

Model Free Approach would sample $K$ points from $P(X)$ and directly estimate the expectation of $f(X)$ as follows:

$E[f(X)] \approx \frac{\sum_{i=1}^{K} f\left(X_{i}\right)}{K}$

Exploration vs Exploitation

Examples of exploration:

Try a new restaurant as opposed to going to your favorite one
Play a random move in a game as opposed to playing the move that you believe is best
Take a different path to work as opposed to the one that you believe is the fastest

Epsilon-Greedy Approach

Epsilon controls the exploration aspect of the RL algorithm: Higher the value of , higher are the chances that the agent takes a random action during the learning phase and higher are the chances that it explores new states and actions. As the agent learns to act well, and has sufficiently explored its environment, should be decayed off so that the value and Q function samples get less noisy with some of the randomness in the agent’s policy eliminated.

Revisiting MDP Fundamentals

Policy Function and Value Function

Estimating Transition Probabilities and Rewards

Model Free vs Model Based Approaches

Exploration vs Exploitation

Epsilon-Greedy Approach

Don’t miss these tips!

Ayoub Bensakhria

Policy Function and Value Function

Estimating Transition Probabilities and Rewards

Model Free vs Model Based Approaches

Exploration vs Exploitation

Epsilon-Greedy Approach

Don’t miss these tips!

Ayoub Bensakhria

You may also like

Object Detection Tutorial with TensorFlow 2, Faster R-CNN and Google Colab (16 Bird Classes)

Install Tensorflow-GPU (for GPUs) for use in Jupyter Lab using Anaconda