PER | 云似乎在学习

paper:

Motivation

Experience transitions were uniformly, sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance.

Main idea

Some transitions may not be immediately useful to the agent, They propose to more frequently replay transitions with high expected learning progress, as measured by the magnitude of their TD error. This prioritization can lead to a loss of diversity, which they alleviate with stochastic prioritization, and introduce bias, which they correct with importance sampling.

Methods

Goal is choosing which experience to replay.

reward: The green arrow reward is 1, others 0

action: “wrong”: The dashed arrow, and the episode is terminated whenever the agent takes the ‘wrong’ action.

“right”: Taking the ‘right’ action progresses through a sequence of n states (black arrows), at the end of which lies a final reward of 1 (green arrow)

Feature: the successes are rare.

Uniform agent and oracle agent:

Uniform agent sample form buffer uniformly at random.

Oracle greedily selects the transition that maximally reduces the global loss in its current state (in hindsight, after the parameter update). (not realistic)

Median number of learning steps required to learn the value function as a function of the size of the total
number of transitions in the replay memory.