Machine Learning

Revisiting MDP Fundamentals

Policy Function and Value Function

The goal of the optimal policy function is to maximize the expected discounted reward, even if this means taking actions that would lead to lower immediate next-step rewards from few states. Recall that from the previous lecture that for all s, the (optimal) value function is:

V^{}(s)=\max {a} Q^{}(s, a)
=\max {a}\left[\sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left(R\left(s, a, s^{\prime}\right)+\gamma V^{*}\left(s^{\prime}\right)\right)\right] where 0 \leq \gamma<1

Estimating Transition Probabilities and Rewards

What might be some issue(s) with trying to estimate \widehat{T}, \hat{R} in the following manner?

    \[\begin{aligned}\hat{T} &=\frac{\operatorname{count}\left(s, a, s^{\prime}\right)}{\sum_{s^{\prime}} \operatorname{count}\left(s, a, s^{\prime}\right)} \\hat{R} &=\frac{\sum_{t=1}^{\operatorname{count}\left(s, a, s^{\prime}\right)} R_{t}\left(s, a, s^{\prime}\right)}{\operatorname{count}\left(s, a, s^{\prime}\right)}\end{aligned}\]

  • Certain states might not be visited at all while collecting the statistics for \hat{T}, \hat{R}
  • Certain states might be visited much less often than others leading to very noisy estimates of \widehat{T}, \hat{R}

Model Free vs Model Based Approaches

Say we want to estimate the expectation of a function f(x).

    \[\mathbb{E}[f(x)]=\sum_{x} p(x) \cdot f(x)\]

Model Based Approach works by sampling K points from the distribution P(X) and estimating,


before estimating the expectation of f(X) as follows:

    \[E[f(X)] \approx \sum p(x) f(x)\]

Model Free Approach would sample K points from P(X) and directly estimate the expectation of f(X) as follows:

    \[E[f(X)] \approx \frac{\sum_{i=1}^{K} f\left(X_{i}\right)}{K}\]

Exploration vs Exploitation

Examples of exploration:

  • Try a new restaurant as opposed to going to your favorite one
  • Play a random move in a game as opposed to playing the move that you believe is best
  • Take a different path to work as opposed to the one that you believe is the fastest

Epsilon-Greedy Approach

Epsilon controls the exploration aspect of the RL algorithm: Higher the value of , higher are the chances that the agent takes a random action during the learning phase and higher are the chances that it explores new states and actions. As the agent learns to act well, and has sufficiently explored its environment,  should be decayed off so that the value and Q function samples get less noisy with some of the randomness in the agent’s policy eliminated.


Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.

No products in the cart.

Open chat
Powered by