paper:

https://arxiv.org/abs/1511.05952

### Motivation

Experience transitions were uniformly, sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance.

### Main idea

Some transitions may not be immediately useful to the agent, They propose to more frequently replay transitions with high expected learning progress, as measured by the **magnitude of their TD error**. This prioritization can lead to **a loss of diversity**, which they alleviate with stochastic prioritization, and introduce bias, which they correct with **importance sampling**.

### Methods

Goal is choosing which experience to replay.

#### example: ‘Blind Cliffwalk’

**reward**: The green arrow reward is 1, others 0

**action**: “wrong”: The dashed arrow, and the episode is terminated whenever the agent takes the ‘wrong’ action.

“right”: Taking the ‘right’ action progresses through a sequence of n states (black arrows), at the end of which lies a final reward of 1 (green arrow)

**Feature**: the successes are rare.

**Uniform agent and oracle agent**:

Uniform agent sample form buffer uniformly at random.

Oracle greedily selects the transition that maximally reduces the global loss in its current state (in hindsight, after the parameter update). (not realistic)

Median number of learning steps required to learn the value function as a function of the size of the total

number of transitions in the replay memory.

#### PRIORITIZING WITH TD-ERROR

The tuple stored is (st,at,rt,st+1,priority)

When we use Q-learning, we would get TD-error.

New transitions arrive without a known TD-error, so we put them at maximal priority in order to guarantee that all experience is seen at least once.

this algorithm results in a substantial reduction in the effort required to solve the Blind Cliffwalk task.

#### STOCHASTIC PRIORITIZATION

Greedy prioritization is lack of diversity that makes the system prone to over-fitting.

Stochastic prioritization’s probability:

Two variants:

1) Rank-based variant

2) The proportional variant

#### ANNEALING THE BIAS

Prioritized replay introduces bias, and bias can be corrected this bias by using importance-sampling (IS) weights

Update Q value using wiδi instead of δi.

In practice, we linearly anneal β from its initial value β0 to 1.

#### ALGORITHM

### ATARI EXPERIMENTS

Average score per episode

The learning speed