Showing 31 to 35 of 72 posts.

The major disadvantage of dynamic programming, Monte Carlo, and temporal difference methods is that they all require to estimate the value function. However, such a value function grows with the number of states and actions. Thus, the solution to this problem is to use approximators for value functions.

Slider Image

In the tree-backup algorithm, the update is from the estimated action values of the leaf nodes of the tree. The action nodes in the interior, corresponding to the actual actions taken, do not participate. Each leaf node contributes to the target with a weight proportional to its probability of occurring under the target policy.

n-step off-policy learning, we define importance sampling ratio of the target policy and the behavior policy. Then using this ratio and the rewards from n steps ahead we can update the state-action values and find an optimal policy. 

January 30th, 2019

Day 38: n-step Sarsa

In the n-step Sarsa algorithm, we use the rewards from n steps ahead to find the state-action value for every pair of state-action. 

n-step Temporal-Difference (TD) methods are combinations of Monte Carlo (MC) and 1 step TD methods. In MC, we need to wait for a completed episode to estimate the state value while in 1 step TD we update the value function at every time step.