Showing 1 to 5 of 26 posts.

Today I coded a Monte Carlo method for policy evaluation.

Discounting-aware importance sampling and per-decision importance sampling are two newly proposed methods which help to reduce the variance of importance ratio learning.

Unlike on-policy Monte Carlo methods in which we estimate the policy while using it for controlling, in the off-policy method these two steps are done separately.

In this blog, we will discuss how Monte Carlo off-policy algorithms can be implemented incrementally.  This means how can we implement off-policy algorithms episode-by-episode without keeping track of returns for all of the episodes.


n off-policy prediction methods, we use an external or "off" the target policy to find the optimal policy. In this method, the policy that we want to learn is called target policy and the external policy is called the behavior policy. The only data we have is for the behavior policy and using this data we want to learn the actual target policy.