January 9th, 2019

Day 23: Off-policy Monte Carlo Control

Unlike on-policy Monte Carlo methods in which we estimate the policy while using it for controlling, in the off-policy method these two steps are done separately.

 

In the off-policy method, we use a behavior policy and apply either an ordinary or a weighted sampling approach to estimate the value of every state. Then we use a greedy approach to find the optimal policy. Note that we do not update the behavior policy using the derived optimal policy at every episode.

 

One very important point to keep in mind is that the behavior policy should cover all of the actions or in other words, it must be a soft probability distribution. This assures the convergence to the optimal policy when using off-policy methods. However, in case of a very soft behavior policy, the learning may be very slow especially for the states appearing in the first stages of learning. 

The following pseudo-code summarizes an Off-policy MC controller using a weighted sampling approach. For more info check Sutton's RL book.