0%

PER

paper:

https://arxiv.org/abs/1511.05952

Motivation

Experience transitions were uniformly, sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance.

Main idea

Some transitions may not be immediately useful to the agent, They propose to more frequently replay transitions with high expected learning progress, as measured by the magnitude of their TD error. This prioritization can lead to a loss of diversity, which they alleviate with stochastic prioritization, and introduce bias, which they correct with importance sampling.

Methods

Goal is choosing which experience to replay.

example: ‘Blind Cliffwalk’

1591347439913

reward: The green arrow reward is 1, others 0

action: “wrong”: The dashed arrow, and the episode is terminated whenever the agent takes the ‘wrong’ action.

“right”: Taking the ‘right’ action progresses through a sequence of n states (black arrows), at the end of which lies a final reward of 1 (green arrow)

Feature: the successes are rare.

Uniform agent and oracle agent:

Uniform agent sample form buffer uniformly at random.

Oracle greedily selects the transition that maximally reduces the global loss in its current state (in hindsight, after the parameter update). (not realistic)

1591358362455

Median number of learning steps required to learn the value function as a function of the size of the total
number of transitions in the replay memory.

PRIORITIZING WITH TD-ERROR

The tuple stored is (st,at,rt,st+1,priority)

When we use Q-learning, we would get TD-error.

New transitions arrive without a known TD-error, so we put them at maximal priority in order to guarantee that all experience is seen at least once.

1591362874248

this algorithm results in a substantial reduction in the effort required to solve the Blind Cliffwalk task.

STOCHASTIC PRIORITIZATION

Greedy prioritization is lack of diversity that makes the system prone to over-fitting.

Stochastic prioritization’s probability:

1591412006698

Two variants:

1) Rank-based variant

1591412121080

2) The proportional variant

1591412202069

1591412465369

ANNEALING THE BIAS

Prioritized replay introduces bias, and bias can be corrected this bias by using importance-sampling (IS) weights

1591444087261

Update Q value using wiδi instead of δi.

In practice, we linearly anneal β from its initial value β0 to 1.

ALGORITHM

1591447080765

ATARI EXPERIMENTS

Average score per episode

1591496100348

The learning speed

1591496215490