Policy Function and Value Function
The goal of the optimal policy function is to maximize the expected discounted reward, even if this means taking actions that would lead to lower immediate next-step rewards from few states. Recall that from the previous lecture that for all s, the (optimal) value function is:
where
Estimating Transition Probabilities and Rewards
What might be some issue(s) with trying to estimate in the following manner?
- Certain states might not be visited at all while collecting the statistics for
- Certain states might be visited much less often than others leading to very noisy estimates of
Model Free vs Model Based Approaches
Say we want to estimate the expectation of a function .
Model Based Approach works by sampling points from the distribution and estimating,
before estimating the expectation of as follows:
Model Free Approach would sample points from and directly estimate the expectation of as follows:
Exploration vs Exploitation
Examples of exploration:
- Try a new restaurant as opposed to going to your favorite one
- Play a random move in a game as opposed to playing the move that you believe is best
- Take a different path to work as opposed to the one that you believe is the fastest
Epsilon-Greedy Approach
Epsilon controls the exploration aspect of the RL algorithm: Higher the value of , higher are the chances that the agent takes a random action during the learning phase and higher are the chances that it explores new states and actions. As the agent learns to act well, and has sufficiently explored its environment, should be decayed off so that the value and Q function samples get less noisy with some of the randomness in the agent’s policy eliminated.