Experience Replay - A biologically inspired mechanism in Reinforcement Learning

Jian Gao
May 2, 2019
2 min read

Hippocampal replay and waking rest may facilitate consolidation of new memories

Sleep Stages

Non-rapid eye movement sleep (NREM)

Stage 1 Slow eye movement, hypnic jerks. Alpha wave disappear and the theta wave appears. 10 mins, 5% of total sleep.
Stage 2 No eye movement, dream very rare, the sleeper is easily awakened. EEG recordings "sleep spindles" and "K-complexes".
Stage 3 Slow wave sleep (SWS), deep sleep. Delta wave occurs. Dreaming is more common, but the content of dreams are disconnected, less vivid, and less memorable than those that occur during REM sleep. In morning hours or in naps. 50% of total sleep. Difficult to be waked up, sleeper needs 30 mins to be fully awake.

Rapid eye movement sleep (REM)

Rapid movement of the eyes, muscle tone throughout the body, dream vividly. REM sleep may favor the preservation of certain types of memories, for example, complex processes (e.g., how to escape from an elaborate maze, new ways of moving the body, new techniques of problem solving)

Hippocampus Replay

In NREM Stage 3, essential for encoding self-experienced events into memory.

Hippocampus is a brain region associated with memory and spacial navigation.

In sleep, neural activity in the hippocampus related to a recent experience has been observed to spontaneously reoccur, and this 'replay' is important for memory consolidation.

During SWS the place cells fire in a sequential order indicating replay and possibly indicate memory consolidation. The sequence replay was compressed during high frequency oscillations. These high frequency field oscillations play a causal role in memory consolidation.

Bias the content of hippocampus replay:

Sleep replay can be manipulated by external stimulation. For example, train rats on an auditory-spatial association task. During sleep, a task-related auditory cue biased reactivation events toward replaying the spatial memory associated with that cue.

( relates to the notion of ‘prioritized sweeping’ in reinforcement learning )

Reinforcement Learning

Setting: The agent interacts with an environment through a sequence of observations, actions and rewards.

Goal: Select actions in a fashion that maximizes the cumulative future reward. Each action is assigned with a value. Choose an action with the maximal value.

Q-learning: learn a function that maps (state, action) to value. Q(s,a)

Deep Q-learning: use a deep convolutional neural network to approximate the optimal action-value function Q(s, a, theta_i). theta_i is the Q-network parameter at iteration i.

Core components of the Deep Q-Network (DQN) agent

The replay memory
Separate target Q-network
Deep convolutional network architecture

The representations learned by DQN allow it to accurately predict state and action values.

Experience replay is randomly sampled from data, therefore removes correlations in the observation sequence and smooths over changes in the data distribution.

Store the agent experience with (state, action, reward, next state) at time t. D_t = {e_1, ..., e_t}, e_t = (s_t, a_t, r_t, s_{t+1})

In training, apply Q-learning updates on samples (mini-batches) of experience drawn uniformly at random from the stored experience pool.

Value functions may be efficiently updated through interactions with the basal ganglia during offline periods.