Episodic Memory Deep Q-Networks

Episodic Memory Deep Q-Networks (EMDQN) is a method used for addressing reinforcement learning tasks. In reinforcement learning, the goal of the agent is to maximize its cumulative future reward. In interactions between the agent and environment that are episodic (i.e., games that reset to a standard state), the agent seeks to maximize its expected return, which, in its basic form, is the summation of all rewards. With EMDQN, the agent’s training speed and efficiency is enhanced through table-based episodic memory.^[1] This allows the agent to determine high reward policies (the most rewarding state-action pair) at a faster speed.

EMDQN Algorithm[edit]

EMDQN, similar to Deep Q Networks (DQN), uses neural networks as the Q-function, to determine the largest Q-value for a state-action pair. However, it also gives two learning models for the agent—a model that simulates the striatum (used by deep RL methods) and a model that simulates the hippocampus (used in table-based methods). The first model, simulating the striatum, grants an inference target, denoted by S, while the second model, simulating the hippocampus, acts as a memory target and is denoted by H.^[1] This gives a new loss function:^[1]

                   $L=\alpha (Q_{\theta }-S)^{2}+\beta (Q_{\theta }-H)^{2}$

Here, α and β are weights and Qθ is the Q-value that is a function of θ. The inference target S, is defined by the equation:

                   $S(s_{t},a_{t})=r_{t}+\gamma maxQ_{\theta }(s_{t+1},a^{\prime })$

Here, H, the memory target, is defined by the equation:

                   $H(s_{t},a_{t})=maxR_{i}(s_{t},a_{t}),i\in \{1,2,...,E\}$

In the equation, E is the number of experienced episodes and Ri(s, a) denotes the calculated future return for the given state-action pair at episode i. Here H is a growing memory table whose indexes are state-action pairs.^[1] After each transition tuple (s, a, r) is determined and stored for each ith episode, the memory table H is then updated using:

                   $H(s_{t},a_{t})={\begin{cases}max\{H(s_{t},a_{t}),R(s_{t},a_{t})\},&if(s_{t},a_{t})\in H\\R(s_{t},a_{t}),&{\mbox{otherwise }}\end{cases}}$

Here, R(st, at) is the return for the state-action pair at the current episode, using Monte-carlo return, which gives a distribution of returns over a certain time period.^[1] Combining this and λ for this implementation of EMDQN defined by TD(λ) = β/α (the TD(λ) algorithm calculates λ-return, which is an estimate of the return that is expected based on subsequent rewards and expected return estimates that are to proceed),^[2] where as mentioned before α is the weight of the inference target S and β is the weight of memory target H, we get a final loss function of:

                   ${\stackrel {\min }{\theta }}\sum _{(s_{i},a_{i},r{i},s{i+1}\in D)}[(Q_{\theta }(s_{i},a_{i})-S(s_{i},a_{i}))^{2}=\lambda (Q_{\theta }(s_{i},a_{i})-H(s_{i},a_{i}))^{2}]$

Advantages of Episodic Memory Deep Q-Networks[edit]

Factors Addressed by EMDQN[edit]

Episodic Memory Deep Q-Networks uses episodic memory to speed up DQN learning.^[1] Typical DQN learns from experience at a very slow rate, as it takes hundreds of millions of environment interactions to adapt an effective policy.^[1] However, EMDQN leverages episodic memory to inform decision making based on past observations, to counter the slow learning rate.^[1] With this, EMDQN enhances DQN in three ways. First, EMDQN addresses the issue of low data efficiency symptomatic of the single-step or close-by multi-step reward updates in Q-learning by utilizing monte-carlo return.^[1] EMDQN addresses this by using episodic memory and more specific the memory target H. Determining the optimal reward for indexes of state-action pairs is most efficiently found using the EMDQN model, in which the agent can utilize details from past observations to inform its current decision, in order to maximize the return of reward.^[3]

Second EMDQN takes from two learning models, deep reinforcement learning methods and table-based methods, used in reinforcement learning, instead of using one sole model.^[1] EMQDN does this in attempt to better simulate the neurological mechanisms of the human brain that work in tandem. Deep reinforcement learning models simulate the human striatum, which is located in the basal ganglia and is primarily involved in the function of movement. Table-based models simulate the human hippocampus, which is primarily involved in long-term memory, among many other crucial functions.^[1] This method of utilizing both learning models enables the RL process to mimic the human brain more effectively.^[1] This is done by adjusting the value of λ, which is an estimate of the return that is expected based on subsequent rewards and expected return estimates, given by the TD(λ) algorithm. Setting a larger value of λ to be closer to episodic control and a smaller value of λ to be closer to DQN functionality is one way of achieving the goal of modeling interactions between brain regions.^[1] So, when the hippocampal memory model is needed more for a task, the value of λ can be adjusted to be higher. Likewise, when striatum functionality is needed more for generalized decision making, the value of λ can be lowered accordingly.^[1]^[2]

Lastly, as mentioned, for an RL task, it takes a DQN agent millions of environment interactions to develop an effective policy.^[1] EMPDQN seeks to address this through using samples more efficiently, in a less expensive way. Common iterations of RL algorithms gather all samples for agents, despite whether the sample yields a reward or not. However, with EMPDQN, the episodic memory portion allows for the highest reward yielding samples to be filtered, used in the neural network component, and utilized in training for the agent, which is done through table updates.^[1]

Applications of Episodic Memory Deep Q-Networks[edit]

A problem faced in DQN is that of overestimation, in which it learns unrealistically high action values.^[1]^[4] This poses a challenge because it may affect performance in practice. Zichuan Lin et al (2018) used EMDQN to mitigate the issue of over-optimization. In setting DQN and EMDQN against each other in two game tasks for 200 epochs, it was found that the estimated action-values were drastically high during training for DQN, leading to drops in performance and scoring. However, this did not occur for EMDQN, as the episodic memory component stabilized the Q-value calculated by the Q-function.^[1] This mitigation occurred due to the adjustable λ value used in the dual learning based model that increases or decreases based on the needs of the tasks.

References[edit]

↑ ^1.00 ^1.01 ^1.02 ^1.03 ^1.04 ^1.05 ^1.06 ^1.07 ^1.08 ^1.09 ^1.10 ^1.11 ^1.12 ^1.13 ^1.14 ^1.15 ^1.16 ^1.17 Zichuan, Lin; Zhao, Tianqi; Yang, Guangwen; Zhang, Lintao (2018). "Episodic Memory Deep Q-Networks". International Joint Conferences on Artificial Intelligence Organization: 2433–2439.
↑ ^2.0 ^2.1 Seijen, Harm van; Sutton, Richard S. (2014). "True Online TD(λ)".
↑ Loynd, Ricky; Hausknecht, Matthew; Li, Lihong; Deng, Li (2018). "Now I Remember! Episodic Memory for Reinforcement Learning".
↑ Hasselt, Hado van; Guez, Arthur; Silver, David (2015). "Deep Reinforcement Learning with Double Q-Learning".

This article "Episodic Memory Deep Q-Networks" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Episodic Memory Deep Q-Networks. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[EMDQN_article-1] 1.00 ^1.01 ^1.02 ^1.03 ^1.04 ^1.05 ^1.06 ^1.07 ^1.08 ^1.09 ^1.10 ^1.11 ^1.12 ^1.13 ^1.14 ^1.15 ^1.16 ^1.17 Zichuan, Lin; Zhao, Tianqi; Yang, Guangwen; Zhang, Lintao (2018). "Episodic Memory Deep Q-Networks". International Joint Conferences on Artificial Intelligence Organization: 2433–2439.

[TD-2] 2.0 ^2.1 Seijen, Harm van; Sutton, Richard S. (2014). "True Online TD(λ)".

[3] Loynd, Ricky; Hausknecht, Matthew; Li, Lihong; Deng, Li (2018). "Now I Remember! Episodic Memory for Reinforcement Learning".

[4] Hasselt, Hado van; Guez, Arthur; Silver, David (2015). "Deep Reinforcement Learning with Double Q-Learning".

[1]

[2]

[3]

[4]