January 8th, 2019

Day 22: Incremental Implementation of Monte Carlo Off-policy Methods

In this blog, we will discuss how Monte Carlo off-policy algorithms can be implemented incrementally.  This means how can we implement off-policy algorithms episode-by-episode without keeping track of returns for all of the episodes.

 

To this end, for ordinary importance sampling, we basically need to sum the estimation until the previous episode with the difference between the current episode reward and the estimated reward divided by the number of the episodes.

However, for the weighted importance sampling, we need to sum the estimation until the previous episode with the difference between the current episode reward and the estimated reward multiplied by the ratio of current weight and sum of weights.

 

The following pseudo-code summarizes the incremental off-policy evaluation method which can be found in Sutton's RL book: