Single-Task Continual Offline Reinforcement Learning

Sibo Gai Donglin Wang
Abstract

In this paper, we study the continual learning problem of single-task offline reinforcement learning. In the past, continual reinforcement learning usually only dealt with multitasking, that is, learning multiple related or unrelated tasks in a row, but once each learned task was learned, it was not relearned, but only used in subsequent processes. However, offline reinforcement learning tasks require the continuously learning of multiple different datasets for the same task. Existing algorithms will try their best to achieve the best results in each offline dataset they have learned and the skills of the network will overwrite the high-quality datasets that have been learned after learning the subsequent poor datasets. On the other hand, if too much emphasis is placed on stability, the network will learn the subsequent better dataset after learning the poor offline dataset, and the problem of insufficient plasticity and non-learning will occur. How to design a strategy that can always preserve the best performance for each state in the data that has been learned is a new challenge and the focus of this study. Therefore, this study proposes a new algorithm, called Ensemble Offline Reinforcement Learning Based on Experience Replay, which introduces multiple value networks to learn the same dataset and judge whether the strategy has been learned by the discrete degree of the value network, to improve the performance of the network in single-task offline reinforcement learning.

1 Introduction

Existing approaches to continual reinforcement learning (RL) typically study agents that continuously learn multiple tasks and aim to achieve the best possible performance on each task, which could be named ”multi-task continual learning”. Specifically, it can be divided into three scenarios: task-incremental learning, domain-incremental learning, and class-incremental learning[1]. In all of these scenarios, unless the continual learning method will or will not know the borders between tasks, they are learning multiple tasks sequentially. Defining these scenarios is meaningful in tasks such as image classification or semantic segmentation because, in these tasks, all the data are sampled independently, so schema shifts will only occur when the task changes. However, defining tasks is difficult in reinforcement learning. Although tasks can be defined in various ways, in real-world environments, the states of environments, tasks, or situations of the robots are often not incidental, mutable, or distinguishable, but rather continuous, gradual, and indistinct. Thus, formulating continual RL as a multi-task problem encounters difficulties. Meanwhile, many previous works have shown that agents trained in more extensive, stochastic environments exhibit superior performance, learning speed, and robustness when transferred to the real world compared to algorithms focused on a single concrete task [2]. If continual RL is defined as a multi-task continual learning problem, the agent cannot acquire sufficient effectiveness on each task with insufficient randomness and learning time. Therefore, in this work, we define the single-task continual learning creatively. As illustrated in Fig. 1, in single-task continual learning, the agent continuously learns the different subspaces of states and actions of the same task. It can adapt to the gradual changes in the environment, agent, and task goals through the generalization and robustness of neural networks and accommodate the latest environmental changes [3].

Refer to caption
Figure 1: The diagram of the single task continual learning. The algorithm needs to learn a sequence of datasets of a single task sequentially and expect to perform best on the task as a whole, rather than on individual datasets

Specifically, in the continual offline RL setting, proposing the single-task continual RL is especially crucial, and this is the topic we focus on in this work. In this paradigm, the agent no longer directly interacts with the environment to learn but instead learns from offline datasets. In single-task offline RL with multiple datasets, each offline dataset belongs to a subspace of this RL task. These offline datasets may originate from data collected by other people or robots interacting with the environment through alternative means, such as many existing offline reinforcement learning datasets collected through human control, online reinforcement learning, or random wandering [4, 5]. Additionally, considering that reward functions represented by task goals in RL can often be computed from states and actions without interacting with the environment [2], offline RL datasets can also be obtained by modifying the rewards in datasets collected for other tasks.

To make the best possible use of all offline datasets without evaluating their quality, we propose the single-task continual offline reinforcement learning (STCORL) problem. Specifically, for a continual offline RL task, the agent needs to continuously learn multiple datasets related to this task and achieve the best possible performance at each learning stage. However, STCORL faces challenges that would not appear in traditional continual learning. Simply applying conventional continual learning algorithms to STCORL cannot adequately solve the problem. The root cause is the quality of the datasets.

As a single-task continual learning problem, STCORL needs to solve the same-input-different-output problem or overwriting problem[6]. Specifically, assuming the new dataset contains data with the same states as inputs as the old dataset, the network must determine the relationship between the actions from the new data and the outputs of existing network. If the quality of new data exceeds the current outputs of the network, the network should prioritize plasticity to strengthen learning ability on the new task. In contrast, if the network’s current outputs demonstrate superior quality, stability should take precedence to consolidate mastery of the old task. Additionally, selective learning is essential for single-task continual learning. Usually, each dataset contains both high-quality and low-quality parts. Algorithms should assess datasets at the individual data level instead of the entire dataset level. Moreover, appraising data values themselves, specifically state Q-values or values in STCORL, also requires continual memory. In summary, single-task continual learning necessitates simultaneously implementing both learning and not learning. Considering that boosting model stability is a typical approach in multitask reinforcement learning since inadequate plasticity on new tasks can usually resolve through sustained learning [7], balancing both new and old tasks is indispensable in single-task continual learning. Therefore, single-task continual learning poses greater challenges than multitask continual reinforcement learning. Specifically in STCORL, the prevailing effective offline reinforcement learning algorithms adopting behavior cloning (BC) and conservative learning can introduce more problems. To address the core overestimation problem of the offline RL, these algorithms need to suppress and constrain the learned policy to more closely approximate the behavior policy across all states. This learning strategy initially aims to avoid algorithms learning too many out-of-distribution (OOD) actions that introduce risks by exceeding the distribution. However, in STCORL, such suppression and constraints will cause the network to always conform with the latest learned offline dataset and to forget knowledge learned from previously seen datasets. Skills learned from higher-quality datasets will be suppressed as overestimated OOD data in subsequent datasets. We terms the problem of abandoning skills acquired from old datasets when learning new datasets as active forgetting.

To solve the active forgetting problem, this chapter proposes a new offline reinforcement learning algorithm based on [8] and [9], called experience-replay-based ensemble implicit Q-learning (EREIQL), making it more amenable to single-task offline reinforcement learning. Specifically, EREIQL introduces an ensemble value function. By initializing multiple value functions, EREIQL assigns a sufficiently low value to each state, initializing them. Although EREIQL does not adopt any conservative strategies leading to active forgetting, unseen states will maintain the initialized low values during learning, while learned states will sustain the highest learned values through expectile regression from [8]. Meanwhile, the policy in EREIQL adopts advantage weighted regression, avoiding learning inferior actions with lower Q-values to prevent performance drops. Finally, EREIQL incorporates experience replay to mitigate catastrophic forgetting stemming from continual learning itself. The main contributions are: 1) We propose a new single-task offline reinforcement learning problem and point out that existing offline reinforcement learning algorithms lead to active forgetting in the STCORL problem; 2) We put forward a new EREIQL algorithm that avoids active forgetting through passively conservative methods; 3) We test the performance of the prevailing continual learning algorithms on STCORL and indicates experience replay as the optimal method; 4) Experiments on various datasets show EREIQL proposed here can achieve superior performance on different STCORL tasks.

2 Related Work

2.1 Continual Learning

Continual learning aims to use a single network to continuously learn multiple tasks while consuming acceptable resources, enabling excellent performance across tasks. These algorithms can be broadly categorized into three classes: rehearsal-based, regularization-based, and dynamic-architecture-based.

Regularization-based continual learning methods mitigate forgetting by constraining the change rates of essential parameters to be minor while allowing insignificant parameters to vary greatly when learning new tasks. Examples include [10, 11, 12]. Dynamic architecture-based methods reserve separate parameters for each task, such as [13, 14, 15]. As our work centers on rehearsal-based methods, the other two categories are not recounted here.

Rehearsal-based continual learning maintains performance on previous tasks by retaining some data from them in a replay buffer and using it when learning new tasks. The first critical issue here is constructing the replay buffer. Related research includes randomly storing [16, 17] data from different tasks and selective storing [18, 19] based on characteristics like value, uniqueness, and representativeness. Another focal research direction is leveraging selected data, most commonly by blending it with new task data into fresh batches for learning [20] or distilling knowledge from old data and retained old networks [21]. Another line of methods uses a generator to produce samples following the same distribution as the task data instead of directly storing a replay buffer [22, 23]. These approaches avoid occupying space with old task data but increase overall complexity.

2.2 Continual Reinforcement Learning

Compared to traditional continual learning, continual RL methods focus on several aspects: First, how to select data for storage. Related work here includes [19, 24], seeking to retain the most critical data points, using strategies like choosing the highest-value experiences and averaging sampling across the state space where possible. The second focal issue in continual reinforcement learning is integrating continual learning into reinforcement learning algorithms. Major work here includes [25, 26], showing that continual learning can play a role in RL, working better in actor network than critic network. [27] notes that although continual learning aims to alleviate forgetting, existing methods that reduce forgetting can simultaneously enhance forward transfer capabilities. For more details, please refer to the survey [28] and foundational work [29, 30].

2.3 Offline Reinforcement Learning

Offline reinforcement learning refers to the RL approach where agents learn skills not through interacting with the environment, but from offline datasets of experiences and trajectories collected from other agents or humans. The most critical problem agents need to solve in this learning method is over-estimation. Initially proposed solutions to this problem constrain the deviation between the learned policy and that in the offline data. These algorithms include [31, 32, 33], which incorporate a KL divergence constraint during policy learning, limiting the discrepancy between the agent’s policy and offline policy. [34] suggests this deviation should emphasize actions whose behavior policy probability is zero, requiring the probability of selecting these actions to be zero too. Further, [35] shows this deviation can be more simply achieved by appending a behavior cloning term to online reinforcement learning algorithms.

However, as these methods restrict the distance between the learned and behavior policies, they are susceptible to offline data quality [36]. Another line of RL algorithms tackles this issue by learning a conservative Q function. These include Conservative Q-Learning (CQL) proposed by [37]. CQL and its successor [38] avoid direct policy constraints, addressing the data quality problem by enforcing lower values for OOD data. Other approaches like [8] concurrently leverage value and Q-functions, averting over-estimation by using only in-distribution Q-values from the samples. Follow-up work includes [39], offering a more theoretical explanation and analysis. Finally, instead of reducing Q values of OOD data through constraints, [9] assigns them lower initial values through ensemble learning for conservative learning. Other related work includes [40, 41].

3 Problem Definition and Preliminary

3.1 Single Task Continual Offline Reinforcement Learning

In this work, we propose single-task continual offline reinforcement learning (STCORL). Unlike traditional continual learning, STCORL, as a typical representative of single-task continuous learning, learns only one offline reinforcement learning task T𝑇Titalic_T. An offline reinforcement learning task can be formulated as a Markov decision process (MDP) tuple {𝒮,𝒜,P,ρ0,r,γ}𝒮𝒜𝑃subscript𝜌0𝑟𝛾\{\mathcal{S},\mathcal{A},P,\rho_{0},r,\gamma\}{ caligraphic_S , caligraphic_A , italic_P , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_r , italic_γ }, where 𝒮𝒮\mathcal{S}caligraphic_S denotes the state space, 𝒜𝒜\mathcal{A}caligraphic_A the action space, P:𝒮×𝒜×𝒜[0,1]:𝑃𝒮𝒜𝒜01P:\mathcal{S}\times\mathcal{A}\times\mathcal{A}\rightarrow\left[0,1\right]italic_P : caligraphic_S × caligraphic_A × caligraphic_A → [ 0 , 1 ] the transition probability, ρ0:𝒮:subscript𝜌0𝒮\rho_{0}:\mathcal{S}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : caligraphic_S the initial state distribution, r:𝒮×𝒜[Rmax,Rmax]:𝑟𝒮𝒜subscript𝑅maxsubscript𝑅maxr:\mathcal{S}\times\mathcal{A}\rightarrow\left[-R_{\mathrm{max}},R_{\mathrm{% max}}\right]italic_r : caligraphic_S × caligraphic_A → [ - italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] the reward function, and γ[0,1)𝛾01\gamma\in\left[0,1\right)italic_γ ∈ [ 0 , 1 ) the discount factor. The final return is the cumulative discounted reward during motion Rt,n=i=tHγ(it)r(𝐬i,𝐚i)subscript𝑅𝑡𝑛superscriptsubscript𝑖𝑡𝐻superscript𝛾𝑖𝑡𝑟subscript𝐬𝑖subscript𝐚𝑖R_{t,n}=\sum_{i=t}^{H}\gamma^{\left(i-t\right)}r\left(\mathbf{s}_{i},\mathbf{a% }_{i}\right)italic_R start_POSTSUBSCRIPT italic_t , italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT ( italic_i - italic_t ) end_POSTSUPERSCRIPT italic_r ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with H𝐻Hitalic_H denoting the maximum execution steps.

We employ actor-critic RL architecture, which achieves the best performance in offline RL. Actor-critic RL consists of three parameterized networks: a critic network Q(𝐬,𝐚)𝑄𝐬𝐚Q\left(\mathbf{s},\mathbf{a}\right)italic_Q ( bold_s , bold_a ), a value function V(𝐬)𝑉𝐬V\left(\mathbf{s}\right)italic_V ( bold_s ), and an actor network π(𝐚|𝐬)𝜋conditional𝐚𝐬\pi\left(\mathbf{a}|\mathbf{s}\right)italic_π ( bold_a | bold_s ). Q learning trains the critic network using the Bellman operator: Q(𝐬,𝐚)=r(𝐬,𝐚)+γ𝔼𝐬P(𝐬|𝐬,𝐚)max𝐚Q(𝐬,𝐚)superscript𝑄𝐬𝐚𝑟𝐬𝐚𝛾subscript𝔼similar-tosuperscript𝐬𝑃conditionalsuperscript𝐬𝐬𝐚subscriptsuperscript𝐚𝑄superscript𝐬superscript𝐚\mathcal{B}^{*}Q\left(\mathbf{s},\mathbf{a}\right)=r\left(\mathbf{s},\mathbf{a% }\right)+\gamma\mathbb{E}_{\mathbf{s}^{\prime}\sim P\left(\mathbf{s}^{\prime}|% \mathbf{s},\mathbf{a}\right)\max_{\mathbf{a}^{\prime}}Q\left(\mathbf{s}^{% \prime},\mathbf{a}^{\prime}\right)}caligraphic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_Q ( bold_s , bold_a ) = italic_r ( bold_s , bold_a ) + italic_γ blackboard_E start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_s , bold_a ) roman_max start_POSTSUBSCRIPT bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT.

In offline RL, an agent learns not through environmental interaction but from offline datasets. In STCORL, the agent needs to learn from a series of offline datasets 𝒟={D1,,DN}𝒟subscript𝐷1subscript𝐷𝑁\mathcal{D}=\left\{D_{1},\dots,D_{N}\right\}caligraphic_D = { italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. Each dataset Dn=(𝐬i,n,𝐚i,n,𝐬i,n,ri,n)subscript𝐷𝑛subscript𝐬𝑖𝑛subscript𝐚𝑖𝑛superscriptsubscript𝐬𝑖𝑛subscript𝑟𝑖𝑛D_{n}=\left(\mathbf{s}_{i,n},\mathbf{a}_{i,n},\mathbf{s}_{i,n}^{\prime},r_{i,n% }\right)italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( bold_s start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT ) is commonly assumed to sampled from some (unknown) behavior policy πnβsuperscriptsubscript𝜋𝑛𝛽\pi_{n}^{\beta}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT. These behavior policies can be seen as independently and identically distributed (i.i.d.) samples from a distribution Πβsuperscriptscript-Π𝛽\mathcal{\Pi}^{\beta}caligraphic_Π start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT, with no information on relationships among behavior policies available to the algorithm. As all datasets sampled from the same environment, the transition probability 𝐬P(𝐬,𝐚)similar-tosuperscript𝐬𝑃𝐬𝐚\mathbf{s}^{\prime}\sim P\left(\mathbf{s},\mathbf{a}\right)bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P ( bold_s , bold_a ) and the reward function rR(𝐬,𝐚)similar-to𝑟𝑅𝐬𝐚r\sim R\left(\mathbf{s},\mathbf{a}\right)italic_r ∼ italic_R ( bold_s , bold_a ) are the same for any dataset. For convenience, subscripts n𝑛nitalic_n denoting specific tasks are omitted in the remainder when a particular task need not be specified.

3.2 Conservative Learning and Active Forgetting

Due to the bootstrap** learning architecture of RL, the Bellman equation used includes the Q value of OOD state-action pairs max𝐚Q(𝐬,𝐚)subscriptsuperscript𝐚𝑄superscript𝐬superscript𝐚\max_{\mathbf{a}^{\prime}}Q\left(\mathbf{s}^{\prime},\mathbf{a}^{\prime}\right)roman_max start_POSTSUBSCRIPT bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). In offline RL, since this term lacks updating during optimization since the related data does not exist in the dataset, errors accumulate to the Q values of all preceding states through bootstrap**. Also, suppose an OOD state-action pair has an initially over-estimated Q value. In that case, the agent will be prone to selecting such OOD actions, deteriorating the safety of the offline RL algorithms.

To address this problem, a successful class of offline RL algorithms such as Conservative Q-Learning (CQL) [37] employs conservative learning. Specifically, these algorithms actively suppress the Q values of the critic network for actions selected by the policy during learning while boosting Q values for actions present in the dataset:

Qupdatesuperscript𝑄update\displaystyle Q^{\text{update}}italic_Q start_POSTSUPERSCRIPT update end_POSTSUPERSCRIPT =\displaystyle== argminQαCQL(𝔼𝐬D,𝐚π(𝐚|𝐬)[Q(𝐬,𝐚)]\displaystyle\mathop{\mathrm{argmin}}\limits_{Q}\alpha_{\text{CQL}}\cdot\left(% \mathbb{E}_{\mathbf{s}\in D,\mathbf{a}\sim\pi\left(\mathbf{a}|\mathbf{s}\right% )}\left[Q\left(\mathbf{s},\mathbf{a}\right)\right]\right.roman_argmin start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT CQL end_POSTSUBSCRIPT ⋅ ( blackboard_E start_POSTSUBSCRIPT bold_s ∈ italic_D , bold_a ∼ italic_π ( bold_a | bold_s ) end_POSTSUBSCRIPT [ italic_Q ( bold_s , bold_a ) ] (2)
𝔼𝐬D,𝐚πβ(𝐚|𝐬)[Q(𝐬,𝐚)])\displaystyle-\left.\mathbb{E}_{\mathbf{s}\sim D,\mathbf{a}\sim\pi_{\beta}% \left(\mathbf{a}|\mathbf{s}\right)}\left[Q\left(\mathbf{s},\mathbf{a}\right)% \right]\right)- blackboard_E start_POSTSUBSCRIPT bold_s ∼ italic_D , bold_a ∼ italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( bold_a | bold_s ) end_POSTSUBSCRIPT [ italic_Q ( bold_s , bold_a ) ] )
+\displaystyle++ 12𝔼𝐬,𝐚,𝐬D[(Q(𝐬,𝐚)Q(𝐬,𝐚))2]12subscript𝔼similar-to𝐬𝐚superscript𝐬𝐷delimited-[]superscript𝑄𝐬𝐚superscript𝑄𝐬𝐚2\displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{s},\mathbf{a},\mathbf{s}^{\prime}% \sim D}\left[\left(Q\left(\mathbf{s},\mathbf{a}\right)-\mathcal{B}^{*}Q\left(% \mathbf{s},\mathbf{a}\right)\right)^{2}\right]divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT bold_s , bold_a , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_D end_POSTSUBSCRIPT [ ( italic_Q ( bold_s , bold_a ) - caligraphic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_Q ( bold_s , bold_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (3)

where αCQLsubscript𝛼CQL\alpha_{\text{CQL}}italic_α start_POSTSUBSCRIPT CQL end_POSTSUBSCRIPT controls the conservative term.

Refer to caption
Figure 2: The diagram of the active forgetting. In this picture, the network learns two datasets sequentially. Each of them has data point (𝐬1,𝐚1)subscript𝐬1subscript𝐚1\left(\mathbf{s}_{1},\mathbf{a}_{1}\right)( bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (𝐬1,𝐚2)subscript𝐬1subscript𝐚2\left(\mathbf{s}_{1},\mathbf{a}_{2}\right)( bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) respectively. Learning a worse action after a better action sequentially will result in forgetting directly. This kind of forgetting is not affected by the distribution shift.

Through this learning approach, when exposed to subsequent datasets, agents actively suppress Q values of data deemed OOD relative to the dataset currently under study. As depicted in Fig. 2, after learning (𝐬1,𝐚1)subscript𝐬1subscript𝐚1\left(\mathbf{s}_{1},\mathbf{a}_{1}\right)( bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) in the first task, in the following dataset, it learns another data with the same (or very similar) state, but different action 𝐚2subscript𝐚2\mathbf{a}_{2}bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Because the network cannot distinguish whether (𝐬1,𝐚1)subscript𝐬1subscript𝐚1\left(\mathbf{s}_{1},\mathbf{a}_{1}\right)( bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is a real OOD data (out of all the learned datasets), or has been learned in a previous dataset, it will restrain the Q value of the learned data (𝐬1,𝐚1)subscript𝐬1subscript𝐚1\left(\mathbf{s}_{1},\mathbf{a}_{1}\right)( bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), which will prompt the policy network avoiding from choosing 𝐚1subscript𝐚1\mathbf{a}_{1}bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at 𝐬1subscript𝐬1\mathbf{s}_{1}bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and result in forgetting. This type of happening in critic network is termed active forgetting because the network forgets learned action actively. Because this problem is caused by conservative learning, we also named it ”conservative forgetting”. Although algorithms like implicit Q learning (IQL) [8] do not actively suppress acquired Q value, they still lead to active forgetting. IQL learns the Q network and value network V through expectile regression and the policy network via advantage weighted regression (AWR) to avoid overestimation:

LVsubscript𝐿𝑉\displaystyle L_{V}italic_L start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT =\displaystyle== 𝔼(𝐬,𝐚)D[L2τ(Q(𝐬,𝐚)V(𝐬))].subscript𝔼similar-to𝐬𝐚𝐷delimited-[]superscriptsubscript𝐿2𝜏𝑄𝐬𝐚𝑉𝐬\displaystyle\mathbb{E}_{\left(\mathbf{s},\mathbf{a}\right)\sim D}\left[L_{2}^% {\tau}\left(Q\left(\mathbf{s},\mathbf{a}\right)-V\left(\mathbf{s}\right)\right% )\right].blackboard_E start_POSTSUBSCRIPT ( bold_s , bold_a ) ∼ italic_D end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q ( bold_s , bold_a ) - italic_V ( bold_s ) ) ] . (4)
LQsubscript𝐿𝑄\displaystyle L_{Q}italic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT =\displaystyle== 𝔼(𝐬,𝐚,𝐬)D[(r(𝐬,𝐚)+γV(𝐬)Q(𝐬,𝐚))].subscript𝔼similar-to𝐬𝐚superscript𝐬𝐷delimited-[]𝑟𝐬𝐚𝛾𝑉superscript𝐬𝑄𝐬𝐚\displaystyle\mathbb{E}_{\left(\mathbf{s},\mathbf{a},\mathbf{s}^{\prime}\right% )\sim D}\left[\left(r\left(\mathbf{s},\mathbf{a}\right)+\gamma V\left(\mathbf{% s}^{\prime}\right)-Q\left(\mathbf{s},\mathbf{a}\right)\right)\right].blackboard_E start_POSTSUBSCRIPT ( bold_s , bold_a , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ ( italic_r ( bold_s , bold_a ) + italic_γ italic_V ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q ( bold_s , bold_a ) ) ] . (5)

Here L2τ(u)=|τ𝟏(u<0)|u2superscriptsubscript𝐿2𝜏𝑢𝜏1𝑢0superscript𝑢2L_{2}^{\tau}(u)=\left|\tau-\mathbf{1}\left(u<0\right)\right|u^{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_u ) = | italic_τ - bold_1 ( italic_u < 0 ) | italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the expectile loss and τ𝜏\tauitalic_τ the expectile threshold. According to [8], the expectile threshold τ𝜏\tauitalic_τ ranges from 0.7 to 0.9. Although the update pace faces constraints, the value V𝑉Vitalic_V of a state can still decrease due to suboptimal Q value, causing networks to actively forget previously acquired value functions.

Active forgetting in the policy network may concurrently arise with that in the critic network. The policy network only learns data from newer datasets. For IQL, this takes the form:

Lπ=𝔼(𝐬,𝐚)D[exp(α(Q(𝐬,𝐚)V(𝐬)))logπ(𝐚|𝐬)],subscript𝐿𝜋subscript𝔼similar-to𝐬𝐚𝐷delimited-[]𝛼𝑄𝐬𝐚𝑉𝐬𝜋conditional𝐚𝐬L_{\pi}=\mathbb{E}_{\left(\mathbf{s},\mathbf{a}\right)\sim D}\left[\exp\left(% \alpha\left(Q\left(\mathbf{s},\mathbf{a}\right)-V\left(\mathbf{s}\right)\right% )\right)\log\pi\left(\mathbf{a}|\mathbf{s}\right)\right],italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( bold_s , bold_a ) ∼ italic_D end_POSTSUBSCRIPT [ roman_exp ( italic_α ( italic_Q ( bold_s , bold_a ) - italic_V ( bold_s ) ) ) roman_log italic_π ( bold_a | bold_s ) ] , (6)

where α𝛼\alphaitalic_α denotes the advantage weighting coefficient. Here, when new dataset data demonstrates inferior quality in state 𝐬𝐬\mathbf{s}bold_s, its actions receive lower weights and hence learn more slowly.

Although AWR used in IQL can alleviate learning of suboptimal actions, as only data from old datasets undergoes learning, inferior datasets still overwrite acquired knowledge from previously seen datasets regarding actions.

Refer to caption
Figure 3: The diagram of the catastrophic forgetting. In this picture, the network needs to learn two datasets sequentially. Each of them has data point (𝐬1,𝐚1)subscript𝐬1subscript𝐚1\left(\mathbf{s}_{1},\mathbf{a}_{1}\right)( bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (𝐬2,𝐚2)subscript𝐬2subscript𝐚2\left(\mathbf{s}_{2},\mathbf{a}_{2}\right)( bold_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) respectively. We can see that even though these two points have different states, learning the following one will also affect the action selected in the previous state because of the distribution shift.

We would like to emphasize the difference between active forgetting and catastrophic forgetting in STCORL. Although active forgetting in STCORL is described here, catastrophic forgetting also remains problematic in STCORL. As shown in Fig. 3, catastrophic forgetting denotes after learning (𝐬1,𝐚1)subscript𝐬1subscript𝐚1\left(\mathbf{s}_{1},\mathbf{a}_{1}\right)( bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), the network learns another data of different state (𝐬2,𝐚2)subscript𝐬2subscript𝐚2\left(\mathbf{s}_{2},\mathbf{a}_{2}\right)( bold_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) in the following dataset, but affect the Q value at (𝐬1,𝐚1)subscript𝐬1subscript𝐚1\left(\mathbf{s}_{1},\mathbf{a}_{1}\right)( bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). In comparison, active forgetting denotes superior action on the same state space input getting overridden by inferior action from learning a suboptimal new data actively, actively and catastrophic forgetting do not constitute independent issues but rather mutually reinforce each other. As the impaired performance of the critic network from catastrophic forgetting precludes appropriately distinguishing Q values across actions, active forgetting in the policy network follows. The converse holds when policy network degradation due to catastrophic forgetting hinders the critic network.

superior new action inferior new action
learn correct optimize active forgetting
not learn active forgetting (actively rejection) correct optimize
Table 1: The four cases of active forgetting and normal learning when the network meet two actions of a state in different datasets.

Notably, if more optimal outputs supersede inferior ones on identical inputs, i.e., Q value increase for the same 𝐬𝐬\mathbf{s}bold_s, 𝐚𝐚\mathbf{a}bold_a after discovering better trajectories, this signifies the desired STCORL behavior rather than active forgetting. Conversely, if the network fails to fit a better dataset because it has learned from a suboptimal dataset, it will eventually lead to degraded performance, which can be termed actively rejection, a variant of active forgetting. For clarity, we summarize the four situation in Tab 1. During learning under the same dataset, active forgetting and regular optimization iteration concurrently take place for different inputs. This trait also determines that knowledge distillation remains infeasible for STCORL, as naturally inducing active forgetting. As discussed in [42], BC constitutes the most efficacious approach in continual reinforcement learning. Hence, a new effective continual learning algorithm needs to be developed for STCORL.

3.3 Ensemble Implicit Q Learning

Most offline RL algorithms lead to active forgetting. Among prevalent offline RL algorithms, as far as the authors know, SAC-N and EDAC [9] do not incur active forgetting. Considering that conservative learning aims to prevent over-estimated initial Q values for OOD states, which causes over-estimation, SAC-N introduces ensemble Q learning with multiple critic networks, using the minimum output among networks as the Q value for each state. By avoiding actively diminishing Q values, active forgetting is averted. However, ensemble methods (including follow-up work like MSG [40] and LB-SAC [41]) cannot achieve the best STCORL performance. This is primarily because IQL does not utilize OOD data during learning, so catastrophic forgetting mutually arising in actor and critic networks exerts limited influence, avoiding triggering severe active forgetting. Additionally, the policy network training approach adopted in SAC-N proves more unstable than the AWR used in IQL [43].

Refer to caption
Figure 4: The diagram of the EIQL. By using ensemble value networks, EIQL keeps the initialized value network lower than the Q network at any state, so that the EIQL can use a very small τ𝜏\tauitalic_τ to avoid the active forgetting.

Therefore, this chapter combines SAC-N and IQL into a new ensemble implicit Q learning method (EIQL). As shown in Fig. 4, EIQL initializes multiple value functions, assigning a sufficiently low value to each state. It then sustains the highest learned values through advantage weighted regression:

LVsubscript𝐿𝑉\displaystyle L_{V}italic_L start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT =\displaystyle== 𝔼(𝐬,𝐚)D[L2τ(Q(𝐬,𝐚)minVj(𝐬))];subscript𝔼similar-to𝐬𝐚𝐷delimited-[]superscriptsubscript𝐿2𝜏𝑄𝐬𝐚superscript𝑉𝑗𝐬\displaystyle\mathop{\mathbb{E}}\limits_{\left(\mathbf{s},\mathbf{a}\right)% \sim D}\left[L_{2}^{\tau}\left(Q\left(\mathbf{s},\mathbf{a}\right)-\min V^{j}% \left(\mathbf{s}\right)\right)\right];blackboard_E start_POSTSUBSCRIPT ( bold_s , bold_a ) ∼ italic_D end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q ( bold_s , bold_a ) - roman_min italic_V start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( bold_s ) ) ] ; (7)
LQsubscript𝐿𝑄\displaystyle L_{Q}italic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT =\displaystyle== 𝔼(𝐬,𝐚,𝐬)D[(r+γminVj(𝐬)Q(𝐬,𝐚))],subscript𝔼similar-to𝐬𝐚superscript𝐬𝐷delimited-[]𝑟𝛾superscript𝑉𝑗superscript𝐬𝑄𝐬𝐚\displaystyle\mathop{\mathbb{E}}\limits_{\left(\mathbf{s},\mathbf{a},\mathbf{s% }^{\prime}\right)\sim D}\left[\left(r+\gamma\min V^{j}\left(\mathbf{s}^{\prime% }\right)-Q\left(\mathbf{s},\mathbf{a}\right)\right)\right],blackboard_E start_POSTSUBSCRIPT ( bold_s , bold_a , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ ( italic_r + italic_γ roman_min italic_V start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q ( bold_s , bold_a ) ) ] , (8)

where Vj,j=1,,Mformulae-sequencesuperscript𝑉𝑗𝑗1𝑀V^{j},j=1,\dots,Mitalic_V start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_j = 1 , … , italic_M denotes the jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT value function. Hence, only the minimum value function undergoes learning in EIQL. Through this approach, sufficiently low initialization of the value functions for OOD data enables a higher expectile threshold τ𝜏\tauitalic_τ in EIQL, alleviating active forgetting. In our experiment, τ𝜏\tauitalic_τ is 0.990.990.990.99 for EREIQL.

On the other hand, to address active forgetting in the policy network, EIQL introduces a cloned policy network, merging its outputs with the new dataset for learning. Assuming the network finished learning dataset n1𝑛1n-1italic_n - 1, started on dataset n𝑛nitalic_n, for state 𝐬𝐬\mathbf{s}bold_s, either the policy network πnsubscript𝜋𝑛\pi_{n}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT selects the best action in the new data or maintains the previous πn1subscript𝜋𝑛1\pi_{n-1}italic_π start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT:

Lπnsubscript𝐿subscript𝜋𝑛\displaystyle L_{\pi_{n}}italic_L start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT =\displaystyle== 𝔼(𝐬,𝐚)Dn[exp(α(Δ(𝐚,𝐬)))p(𝐚|𝐬)]subscript𝔼similar-to𝐬𝐚subscript𝐷𝑛delimited-[]𝛼Δ𝐚𝐬𝑝conditional𝐚𝐬\displaystyle\mathbb{E}_{\left(\mathbf{s},\mathbf{a}\right)\sim D_{n}}\left[% \exp\left(\alpha\left(\Delta\left(\mathbf{a},\mathbf{s}\right)\right)\right)p% \left(\mathbf{a}|\mathbf{s}\right)\right]blackboard_E start_POSTSUBSCRIPT ( bold_s , bold_a ) ∼ italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_exp ( italic_α ( roman_Δ ( bold_a , bold_s ) ) ) italic_p ( bold_a | bold_s ) ] (9)
+\displaystyle++ 𝔼(𝐬)Dn[exp(α(Δ(𝐚,𝐬)))logπn(𝐚|𝐬)],subscript𝔼similar-to𝐬subscript𝐷𝑛delimited-[]𝛼Δsuperscript𝐚𝐬subscript𝜋𝑛conditionalsuperscript𝐚𝐬\displaystyle\mathbb{E}_{\left(\mathbf{s}\right)\sim D_{n}}\left[\exp\left(% \alpha\left(\Delta\left(\mathbf{a}^{\prime},\mathbf{s}\right)\right)\right)% \log\pi_{n}\left(\mathbf{a}^{\prime}|\mathbf{s}\right)\right],blackboard_E start_POSTSUBSCRIPT ( bold_s ) ∼ italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_exp ( italic_α ( roman_Δ ( bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_s ) ) ) roman_log italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_s ) ] , (10)

where 𝐚=πn1superscript𝐚subscript𝜋𝑛1\mathbf{a}^{\prime}=\pi_{n-1}bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_π start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT denotes the action from the cloned network, Δ(𝐚,𝐬)=Q(𝐬,𝐚)minVj(𝐬)Δ𝐚𝐬𝑄𝐬𝐚superscript𝑉𝑗𝐬\Delta\left(\mathbf{a},\mathbf{s}\right)=Q\left(\mathbf{s},\mathbf{a}\right)-% \min V^{j}\left(\mathbf{s}\right)roman_Δ ( bold_a , bold_s ) = italic_Q ( bold_s , bold_a ) - roman_min italic_V start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( bold_s ) is different of the Q function and the value function, and p(𝐚|𝐬)=logπn(𝐚|𝐬)𝑝conditional𝐚𝐬subscript𝜋𝑛conditional𝐚𝐬p\left(\mathbf{a}|\mathbf{s}\right)=\log\pi_{n}\left(\mathbf{a}|\mathbf{s}\right)italic_p ( bold_a | bold_s ) = roman_log italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_a | bold_s ) is the logistic probability of the policy.

Finally, existing continual learning algorithms still need to be introduced into the critic, value, and policy networks of EIQL to mitigate catastrophic forgetting. Notably, according to [25], continual learning algorithms prove inadequate for critic networks in multitask continual RL, as upon finishing one task, only the policy network sees subsequent deployment. Hence, continual learning of critic networks is unnecessary there but imperative for single-task continual learning, since during the future dataset learning the assistance of the critic network is needed.

We choose experience replay (ER) [20] without its behavior cloning (BC) component as the continual learning method. Subsequent experiments will demonstrate ER as the most efficacious continual learning approach for STCORL. For data selection, we adopts the average Q value per trajectory. Although more advanced strategies like [42] may perform better, it lie beyond the major of our work. By integrating EIQL and ER, the complete algorithm is termed rehearsal-based ensemble implicit Q learning (EREIQL).

4 Experiments and Results

Here is my attempt to translate and polish this passage into more academic English:

Implementation Details

Following [44], we employs multi-layer perceptrons (MLPs) with three 256-neuron hidden layers with ReLU activation as the Q-function and policy networks in our experiment. The Q network adopts the Adam optimizer [45] with a learning rate of 0.001, while the policy network uses a learning rate of 0.003. Each task runs for 50,000 steps before switching to the next. The whole learning process will keep a 75-trajectories-sized replay buffer. Reported results reflect the average across five random seeds.

Baseline Algorithms

To address the aforementioned issues, this section contrasts two algorithm groups to assess the performance of existing algorithms:

Continual learning algorithm baselines include:

  • BC [20]: Basic rehearsal-based continual learning with a behavior cloning term for the policy network.

  • Averaged gradient episodic memory (AGEM) [46]: A gradient episodic memory (GEM)-based algorithm that only employs parts of old tasks per update to ensure positive inner products between new task update directions and learned tasks.

  • Elastic weight consolidation (EWC) [10]: Selects and softly locks essential parameters via Fisher matrix.

  • Synaptic intelligence (SI)[11]: A second derivative-based method that restricts gradient update magnitudes.

  • Riemannian walk (R-Walk)[12]: A Riemannian-manifold-based parameter weight method.

Offline RL algorithm baselines include:

  • TD3+BC [35]: TD3-based RL with a behavior cloning loss.

  • IQL [8]: Achieves conservative learning via expectile regression and advantage weighted regression without using OOD Q values.

  • SAC-N [9]: Passively realizes conservative Q learning by taking the minimum among ensemble Q network outputs.

  • EDAC [9]: Reduces the number of required networks by constraining gradients to minimize OOD data discreteness effects during in-distribution (ID) data learning based on SAC-N.

Offline Datasets

We use three offline datasets from [47]: Hopper, HalfCheetah and Walker2D. For each task, the network needs to learn a total of nine datasets, in the order of Random1-Random2-Random3-Medium1-Medium2-Medium3-Random1-Random2-Random3. The 1, 2, and 3 here represent the first, second, and third sub-datasets after the origin dataset is randomly divided into three parts. The purpose of this training is to test three abilities of the algorithm, which are the ability to improve performance when learning a better dataset after learning a worse dataset (plasticity), the ability to maintain performance when learning a worse dataset after learning a better dataset (stability), and the ability to improve performance when learning different datasets with the same quality.

Metrics

Following [48], we adopt the average performance (PER) and the backward transfer (BWT) as evaluation metrics,

PER=1Nn=1NaN,n,BWT=1N1n=1N1an,naN,n,formulae-sequencePER1𝑁superscriptsubscript𝑛1𝑁subscript𝑎𝑁𝑛BWT1𝑁1superscriptsubscript𝑛1𝑁1subscript𝑎𝑛𝑛subscript𝑎𝑁𝑛\text{PER}=\frac{1}{N}\sum\limits_{n=1}^{N}a_{N,n},\ \text{BWT}=\frac{1}{N-1}% \sum\limits_{n=1}^{N-1}a_{n,n}-a_{N,n},PER = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_N , italic_n end_POSTSUBSCRIPT , BWT = divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n , italic_n end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_N , italic_n end_POSTSUBSCRIPT , (11)

where ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT means the final cumulative rewards of dataset j𝑗jitalic_j after learning dataset i𝑖iitalic_i. For PER, higher is better; for BWT, lower is better. These two metrics show the performance of learning new tasks while alleviating the forgetting problem. Also, as a single-task learning problem, we propose to report the final performance after learning all the datasets as a reference:

LST=aN,N.LSTsubscript𝑎𝑁𝑁\text{LST}=a_{N,N}.LST = italic_a start_POSTSUBSCRIPT italic_N , italic_N end_POSTSUBSCRIPT . (12)

In the following, we use LST to represent is metric. Same as the PER, for LST, higher is better.

4.1 Reinforcement Learning Algorithm Performance in STCORL

Table 2: Results of different algorithms in STCORL setting. In which LST is the last performance, PER is the mean performance, and BWT is the backward translation. For LST and PER, higher is better; for BWT, lower is better.
Benchmark HalfCheetah Hopper Walker2D
LST PER BWT LST PER BWT LST PER BWT
TD3+BC 1704.80 3892.53 2625.30 39.88 963.80 2042.40 -8.90 56.34 165.50
SACN -474.56 -513.12 224.95 248.17 108.56 292.70 -2.20 8.18 132.45
EDAC -596.00 -386.90 295.47 36.26 38.75 77.45 -14.60 79.28 613.93
IQL 1422.21 2463.16 3158.92 327.58 618.85 650.26 2692.34 1578.40 1851.49
EREIQL 3996.05 3385.72 2625.30 1852.40 1070.68 1573.92 3230.55 1716.75 1760.88

Tab.2 illustrates the STCORL performance of different RL algorithms with ER. As shown, the SAC-N and EDAC algorithms cannot maintain performance, showing that their Q-optimizing policy is unstable and unreliable in STCORL. Moreover, neither TD3+BC nor IQL attains adequate STCORL results. Compared to the inferior plasticity of IQL, which hinders learning to higher levels, TD3+BC demonstrates far worse stability, with performance sharply decreasing when exposed to poorer tasks. Overall, these existing algorithms fail at STCORL. In contrast, the proposed EREIQL achieves the best performance, successfully tackling STCORL. We would like to emphasize that the low BWT of other tasks is not caused by the ability to relieve catastrophic forgetting but by the low ability to learn all of the tasks, which shows that the BWT is not a fitting metric for those tasks that some baseline may totally fail.

4.2 Continual Learning Algorithm Performance in STCORL

Refer to caption

Figure 5: Performance of Half-Cheetah across continual learning algorithms. The network learns a task for 500 epochs and turns to the next. Higher is better.

Fig. 5 presents different STCORL performance of continual learning algorithms. As shown, apart from experience replay, other experience replay-based and regularization-based algorithms cannot achieve adequate STCORL performance. This accords with findings from [25] that existing continual learning algorithms perform poorly on critic networks.

4.3 Impact of Storage Space

Refer to caption

Figure 6: Performance of Half-Cheetah across the different sizes of the replay buffer. The network learns a task for 500 epochs and turns to the next. Higher is better.

It may be noticed that we select a huge replay buffer in our experiment, comparing it with related works such as [42, 25]. As Fig. 6 shows, compared to the multi-task continual offline RL or the multi-task continual online RL in [42, 25], STCORL poses substantially greater difficulty. To maintain learned performance, algorithms require at least 75 trajectories here versus just one in [42]. Per [25], this discrepancy may owe to the superior maintenance of RL effects via policy network behavior cloning employed in [42]. In contrast, STCORL necessitates sustaining critic network performance, mandating larger replay buffers to preserve efficacy.

Refer to caption

Figure 7: Performance of EREIQL in Half-Cheetah without continual learning. The network learns a task for 500 epochs and turns to the next. Solid lines indicate continual learning and dashed lines denote re-initialization before learning each dataset while retaining the replay buffer. Higher is better.

Notably, 75 trajectories is no trifling amount. A plausible hypothesis holds that EREIQL does not resolve STCORL challenges but rather owes its performance solely to the considerable replay buffer size, which essentially constitutes a mixed dataset where EREIQL happens to boast superior learning capabilities over other algorithms. To eliminate this possibility, Fig. 7 additionally provides results comparing continual learning and learning from scratch, with solid lines indicating continual learning and dashed lines denoting re-initialization before learning each dataset while retaining the replay buffer. EREIQL acquires and maintains the best performance across all datasets through continual single-task learning instead of merely possessing stronger mixed-dataset learning abilities.

4.4 Impact of Network Ensemble Size

Refer to caption

Figure 8: Performance of EREIQL in Half-Cheetah across different ensemble number. The network learns a task for 500 epochs and turn to the next. Higher is better.

Fig.8 demonstrates the impact of value network numbers in EREIQL with the storage space fixed at 75 trajectories. EREIQL attains relatively strong performance from 30 to 100 networks. Compared to SAC-N, EREIQL achieves higher performance with fewer networks.

Refer to caption

Figure 9: Performance of Half-Cheetah across the different sizes of the replay buffer with 1000 ensemble value networks. The network learns a task for 500 epochs and turn to the next. Higher is better.

This discovery motivates the conjecture of whether more networks could reduce replay buffer demands. To validate this, Fig.9 presents the size of the replay buffer effects on network performance with 1000 value networks. Indeed, more networks lessen the reliance of EREIQL on the replay buffer.

4.5 Impact of Expectile Threshold τ𝜏\tauitalic_τ

Refer to caption
Figure 10: Performance of Half-Cheetah across different expectile regression threshold τ𝜏\tauitalic_τ. The network learns a task for 500 epochs and turns to the next. Higher is better.

As described, the primary EIQL advantage over IQL lies in enabling higher τ𝜏\tauitalic_τ values through ensemble learning to alleviate active forgetting. To validate this, Fig. 10 shows the impact of varying τ𝜏\tauitalic_τ. EIQL can set τ0.99𝜏0.99\tau\geq 0.99italic_τ ≥ 0.99 without hindering implicit Q𝑄Qitalic_Q-learning, while IQL is limited to around 0.9. Larger τ𝜏\tauitalic_τ values allow IQL greater certainty in acquired value functions, avoiding active forgetting. The capacity to utilize larger τ𝜏\tauitalic_τ constitutes the core factor behind the efficacy of EREIQL in resolving STCORL.

5 Conclusion and Future Work

In this work, we propose the active forgetting issue caused by data quality that leads to declining stability in continual learning, which is beyond the capability of traditional stability-plasticity balancing continual learning algorithms. Specifically, we first propose a new continual learning problem, single-task learning, and a practical application scenario is given: single-task continual offline reinforcement learning. The influence of data quality on STCORL is pointed out, and two effects of learning consecutive datasets with different qualities on the overall performance of algorithms are presented - active forgetting and catastrophic forgetting, that is, the phenomenon of actively discarding previously learned superior data after learning subsequent inferior data. To address this issue, we propose a new offline RL approach based on ensemble learning, EREIQL, which improves the plasticity of new data through ensemble learning to allow strong stability on old data. Our experiments firstly prove that STCORL is more challenging than common multi-task continual learning, and prevailing continual learning algorithms fail in more general continual learning settings. Then the effectiveness of EREIQL is validated. However, the current EREIQL still has high demands on time and space complexity. The next step will be further research on feasible and efficient continual and RL algorithms applied in STCORL settings.

References

  • [1] G. M. van de Ven and A. S. Tolias, “Three scenarios for continual learning,” arXiv:1904.07734 [cs, stat], Apr. 2019.
  • [2] I. M. A. Nahrendra, B. Yu, and H. Myung, “DreamWaQ: Learning Robust Quadrupedal Locomotion With Implicit Terrain Imagination via Deep Reinforcement Learning,” Mar. 2023.
  • [3] N. Anand and D. Precup, “Prediction and Control in Continual Reinforcement Learning,” in Thirty-Seventh Conference on Neural Information Processing Systems, Nov. 2023.
  • [4] N. Gürtler, F. Widmaier, C. Sancaktar, S. Blaes, P. Kolev, S. Bauer, M. Wüthrich, M. Wulfmeier, M. Riedmiller, A. Allshire, Q. Wang, R. McCarthy, H. Kim, J. Baek, W. Kwon, S. Qian, Y. Toshimitsu, M. Y. Michelis, A. Kazemipour, A. Raayatsanati, H. Zheng, B. G. Cangan, B. Schölkopf, and G. Martius, “Real Robot Challenge 2022: Learning Dexterous Manipulation from Offline Data in the Real World,” Nov. 2023.
  • [5] H. Zhang, S. Yang, and D. Wang, “A Real-World Quadrupedal Locomotion Benchmark for Offline Reinforcement Learning,” arXiv.org, 2023.
  • [6] S. Kessler, J. Parker-Holder, P. Ball, S. Zohren, and S. J. Roberts, “Same State, Different Task: Continual Reinforcement Learning without Interference,” AAAI, vol. 36, pp. 7143–7151, June 2022.
  • [7] D. Lopez-Paz and M. Ranzato, “Gradient Episodic Memory for Continual Learning,” arXiv:1706.08840 [cs], Nov. 2017.
  • [8] I. Kostrikov, A. Nair, and S. Levine, “Offline Reinforcement Learning with Implicit Q-Learning,” Oct. 2021.
  • [9] G. An, S. Moon, J.-H. Kim, and H. O. Song, “Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble,” Oct. 2021.
  • [10] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell, “Overcoming catastrophic forgetting in neural networks,” Proc Natl Acad Sci USA, vol. 114, pp. 3521–3526, Mar. 2017.
  • [11] F. Zenke, B. Poole, and S. Ganguli, “Continual Learning Through Synaptic Intelligence,” arXiv:1703.04200 [cs, q-bio, stat], June 2017.
  • [12] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. S. Torr, “Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence,” arXiv:1801.10112 [cs], Aug. 2018.
  • [13] H. Kang, J. Yoon, S. R. Madjid, S. J. Hwang, and C. D. Yoo, “Forget-free Continual Learning with Soft-Winning SubNetworks,” Mar. 2023.
  • [14] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra, “PathNet: Evolution Channels Gradient Descent in Super Neural Networks,” arXiv:1701.08734 [cs], Jan. 2017.
  • [15] Y. Ge, Y. Li, S. Ni, J. Zhao, M.-H. Yang, and L. Itti, “CLR: Channel-wise Lightweight Reprogramming for Continual Learning,” July 2023.
  • [16] J. Bang, H. Kim, Y. Yoo, J.-W. Ha, and J. Choi, “Rainbow Memory: Continual Learning with a Memory of Diverse Samples,” arXiv:2103.17230 [cs], Mar. 2021.
  • [17] A. Prabhu, P. H. S. Torr, and P. K. Dokania, “GDumb: A Simple Approach that Questions Our Progress in Continual Learning,” in Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II, (Berlin, Heidelberg), pp. 524–540, Springer-Verlag, Aug. 2020.
  • [18] J. Yoon, D. Madaan, E. Yang, and S. J. Hwang, “Online Coreset Selection for Rehearsal-based Continual Learning,” in International Conference on Learning Representations, Sept. 2021.
  • [19] T. L. Hayes and C. Kanan, “Selective Replay Enhances Learning in Online Continual Analogical Reasoning,” arXiv:2103.03987 [cs], Apr. 2021.
  • [20] D. Rolnick, A. Ahuja, J. Schwarz, T. P. Lillicrap, and G. Wayne, “Experience Replay for Continual Learning,” arXiv:1811.11682 [cs, stat], Nov. 2019.
  • [21] J. Smith, Y.-C. Hsu, J. Balloch, Y. Shen, H. **, and Z. Kira, “Always Be Dreaming: A New Approach for Data-Free Class-Incremental Learning,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9354–9364, 2021.
  • [22] Y. Luo, Y. Wong, M. Kankanhalli, and Q. Zhao, “Learning to Predict Gradients for Semi-Supervised Continual Learning,” arXiv:2201.09196 [cs], Jan. 2022.
  • [23] J. Millichamp and X. Chen, “Brain-inspired feature exaggeration in generative replay for continual learning,” arXiv:2110.15056 [cs], Oct. 2021.
  • [24] C. Kaplanis, C. Clopath, and M. Shanahan, “Continual Reinforcement Learning with Multi-Timescale Replay,” arXiv:2004.07530 [cs, stat], Apr. 2020.
  • [25] M. Wołczyk, M. Zajac, R. Pascanu, Ł. Kuciński, and P. Miłoś, “Continual World: A Robotic Benchmark For Continual Reinforcement Learning,” Oct. 2021.
  • [26] M. Wołczyk, M. Zajac, R. Pascanu, Ł. Kuciński, and P. Miłoś, “Disentangling Transfer in Continual Reinforcement Learning,” Sept. 2022.
  • [27] J. Chen, T. Nguyen, D. Gorur, and A. Chaudhry, “Is forgetting less a good inductive bias for forward transfer?,” Mar. 2023.
  • [28] T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, and N. Díaz-Rodríguez, “Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges,” Information Fusion, vol. 58, pp. 52–68, June 2020.
  • [29] M. B. Ring, “Toward a Formal Framework for Continual Learning,” in Inductive Transfer : 10 Years Later NIPS 2005 Workshop, 2005.
  • [30] D. Abel, A. Barreto, B. Van Roy, D. Precup, H. van Hasselt, and S. Singh, “A Definition of Continual Reinforcement Learning,” July 2023.
  • [31] X. B. Peng, A. Kumar, G. Zhang, and S. Levine, “Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning,” arXiv:1910.00177 [cs, stat], Oct. 2019.
  • [32] A. Nair, A. Gupta, M. Dalal, and S. Levine, “AWAC: Accelerating Online Reinforcement Learning with Offline Datasets,” Apr. 2021.
  • [33] Z. Zhuang, K. Lei, J. Liu, D. Wang, and Y. Guo, “Behavior Proximal Policy Optimization,” Feb. 2023.
  • [34] A. Kumar, J. Fu, G. Tucker, and S. Levine, “Stabilizing Off-Policy Q-Learning via Bootstrap** Error Reduction,” arXiv:1906.00949 [cs, stat], Nov. 2019.
  • [35] S. Fujimoto and S. S. Gu, “A Minimalist Approach to Offline Reinforcement Learning,” arXiv:2106.06860 [cs, stat], June 2021.
  • [36] Y. J. Ma, D. Jayaraman, and O. Bastani, “Conservative Offline Distributional Reinforcement Learning,” arXiv:2107.06106 [cs], Oct. 2021.
  • [37] A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative Q-Learning for Offline Reinforcement Learning,” arXiv:2006.04779 [cs, stat], Aug. 2020.
  • [38] J. Lyu, X. Ma, X. Li, and Z. Lu, “Mildly Conservative Q-Learning for Offline Reinforcement Learning,” Oct. 2022.
  • [39] H. Xu, L. Jiang, J. Li, Z. Yang, Z. Wang, V. W. K. Chan, and X. Zhan, “Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization,” Mar. 2023.
  • [40] S. K. S. Ghasemipour, S. S. Gu, and O. Nachum, “Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters,” May 2022.
  • [41] A. Nikulin, V. Kurenkov, D. Tarasov, D. Akimov, and S. Kolesnikov, “Q-Ensemble for Offline RL: Don’t Scale the Ensemble, Scale the Batch Size,” Jan. 2023.
  • [42] S. Gai, D. Wang, and L. He, “Offline Experience Replay for Continual Offline Reinforcement Learning,” May 2023.
  • [43] P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine, “IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies,” May 2023.
  • [44] T. Seno and M. Imai, “D3rlpy: An offline deep reinforcement learning library,” J. Mach. Learn. Res., vol. 23, Jan. 2022.
  • [45] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” CoRR, Dec. 2014.
  • [46] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny, “Efficient Lifelong Learning with A-GEM,” arXiv:1812.00420 [cs, stat], Jan. 2019.
  • [47] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4RL: Datasets for Deep Data-Driven Reinforcement Learning,” ArXiv, Apr. 2020.
  • [48] M. M. Derakhshani, X. Zhen, L. Shao, and C. Snoek, “Kernel Continual Learning,” in ICML, vol. 139, July 2021.