Dynamically Anchored Prompting for Task-Imbalanced Continual Learning

Chenxing Hong1    Yan **1,2    Zhiqi Kang4    Yizhou Chen1    Mengke Li3    Yang Lu1,2111Corresponding Author: Yang Lu ([email protected])&
Hanzi Wang1,2
1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China,
Xiamen University, Xiamen, China
2Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University, Xiamen, China
3Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
4Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France.
[email protected], [email protected], [email protected], [email protected], [email protected], {luyang, hanzi.wang}@xmu.edu.cn
Abstract

Existing continual learning literature relies heavily on a strong assumption that tasks arrive with a balanced data stream, which is often unrealistic in real-world applications. In this work, we explore task-imbalanced continual learning (TICL) scenarios where the distribution of task data is non-uniform across the whole learning process. We find that imbalanced tasks significantly challenge the capability of models to control the trade-off between stability and plasticity from the perspective of recent prompt-based continual learning methods. On top of the above finding, we propose Dynamically Anchored Prompting (DAP), a prompt-based method that only maintains a single general prompt to adapt to the shifts within a task stream dynamically. This general prompt is regularized in the prompt space with two specifically designed prompt anchors, called boosting anchor and stabilizing anchor, to balance stability and plasticity in TICL. Remarkably, DAP achieves this balance by only storing a prompt across the data stream, therefore offering a substantial advantage in rehearsal-free CL. Extensive experiments demonstrate that the proposed DAP results in 4.5% to 15% absolute improvements over state-of-the-art methods on benchmarks under task-imbalanced settings. Our code is available at https://github.com/chenxing6666/DAP

1 Introduction

Human beings possess the remarkable ability to learn new tasks and solve evolving challenges by leveraging knowledge from their past experiences. Inspired by this, continual learning (CL) methods are designed to address a series of tasks using a singular model while preserving performance on tasks previously mastered Wang et al. (2022c); Zhang et al. (2020); Aljundi et al. (2018). However, achieving this goal is challenging for deep models, as they tend to easily forget previously learned information, i.e., a phenomenon known as catastrophic forgetting De Lange et al. (2021); Kirkpatrick et al. (2017). This issue primarily arises from the network’s tendency to overwrite old knowledge with new data during the training process.

Refer to caption
((a))
Refer to caption
((b))
Figure 1: An illustration of the scenario of task-imbalanced continual learning (TICL). (a) Three cases in TICL compared to ordinary task-balanced CL. (b) Performance degradation of DualPrompt on TICL CIFAR-100. The number of balanced and imbalanced tasks is ensured to be the same.

Although existing methods in CL have achieved notable progress, they generally assume a balanced distribution of training data across tasks, i.e., each task holds the same number of training samples. In practical applications, data streams often exhibit an imbalanced distribution Huang et al. (2019), where the data volume for each task can vary significantly, with some tasks presenting a large number of samples while others may have far fewer. This gives rise to task-imbalanced CL scenarios and introduces potential new issues for the CL process. The imbalance among tasks will amplify the difficulty of striking a balance between learning and retaining knowledge over time.

To address the aforementioned problem, this paper first investigates this more general and realistic scenario: task-imbalanced continual learning (TICL). TICL characterizes environments where the number of samples in each class is imbalanced across the data stream, reflecting the long-tailed nature observed in most real-world data distributions. In such distributions, a few classes dominate in terms of the number of samples, while many others have significantly fewer. In the context of CL, this imbalance manifests at the task level, leading to imbalanced tasks, as shown in Figure 1(a). We call the tasks with a relatively large number of samples the data-rich tasks and the tasks with a relatively small number of samples the data-scarce tasks. Specifically, we consider three cases of TICL:

  • Descending TICL: The learning process starts with data-rich tasks followed by data-scarce ones.

  • Ascending TICL: The learning process starts with data-scarce tasks preceding data-rich ones.

  • Shuffled TICL: Tasks arrive in a random sequence without a prescribed order.

While descending and ascending TICL are two extreme cases, the shuffled TICL can be regarded as the general form of TICL.

To show the challenges brought by the proposed scenarios, we conduct a quick experiment using DualPrompt Wang et al. (2022d), a typical prompt-based CL method, in Figure 1(b).

It can be observed that in the descending TICL case, the model initially learns well on data-rich tasks but rapidly declines when training becomes scarce. This indicates the poor plasticity of the model in adapting to new tasks with fewer samples. On the contrary, in the ascending TICL case, the model initially struggles with learning data-scarce tasks and remains poor in the following tasks. This is due to the severe forgetting caused by the following data-rich tasks (see Figure. 2 for more analysis). These results reveal a more realistic plasticity and stability dilemma: given an incoming task, the model learns poorly when its training data is scarce and forgets rapidly when there is abundant data. Undoubtedly, both learning and forgetting become more challenging in TICL. Therefore, there is a clear need to design a specific method to address this issue.

In this paper, we propose a novel prompt-based approach named Dynamically Anchored Prompting (DAP). DAP only maintains a general prompt to learn from imbalanced task streams by strategically address the dilemma of stability and plasticity. It tackles the dilemma by decoupling stability and plasticity through two specialized prompts: the boosting prompt and the stabilizing prompt. The stabilizing prompt focuses on preserving the knowledge of past tasks, thereby ensuring stability and mitigating catastrophic forgetting. In contrast, the boosting prompt enhances the model’s generalization ability to learn and adapt to new tasks, promoting plasticity. The general prompt is updated by a novel dynamic stability-plasticity regularization (DSPR) strategy, which dynamically regularizes the general prompt in the prompt space based on task attributes, ensuring a flexible and adaptive learning process. Since DAP only stores a general prompt, it achieves superior performance with lower memory requirements, aligning well with the objectives of rehearsal-free CL. In summary, our contributions are three-fold:

  • We analyze a more realistic CL scenario with imbalanced tasks along with prompt-based learning algorithms, uncovering their defects caused by the stability and plasticity dilemma. This dilemma highlights a critical challenge: an effective balance between preserving existing knowledge and accommodating new learning demands.

  • We propose DAP, a novel approach to dynamically balance the stability and plasticity with a single regularized general prompt, which effectively addresses the challenges in TICL.

  • We evaluate the performance of DAP on benchmark datasets. Our proposed DAP exhibits significant improvements over previous state-of-the-art methods, with large margins ranging from 4.5% to 15%.

2 Related Work

2.1 Continual Learning

The existing research in CL within machine learning primarily assumes a balanced task distribution and focuses on addressing catastrophic forgetting, which can be categorized into three strategies De Lange et al. (2021); Wang et al. (2023): architectural expansion, regularization, and rehearsal. Architectural expansion adapts the model’s structure for new tasks, suitable where adding to the model is practical Kang et al. (2022); Wang et al. (2022b). Regularization, either in the weight or prediction space, aims to retain past task knowledge during new task training, with knowledge distillation being particularly effective in prediction space Castro et al. (2018); Cha et al. (2021). Rehearsal methods, using original or synthetic data, are efficient but face data privacy and storage issues Chaudhry et al. (2021); Wang et al. (2022a); Kang et al. (2023). These issues emphasize the need for rehearsal-free methods in CL, addressing both privacy concerns and computational efficiency Vaishnavh et al. (2018).

2.2 Prompting for Continual Learning

The recent trend in CL research focuses on combining prompting techniques with Vision Transformers (ViTs). This approach involves using a pre-trained, frozen backbone model from ImageNet, circumventing the need for a replay buffer. Prompting, initially applied in transfer learning with pre-trained language models like GPT-3, involves adding language-based instructions to the input text to guide the model in understanding downstream tasks. Traditional prompting methods were heuristic, but recent developments like Prompt Tuning Kemker and Kanan (2017) and Prefix Tuning Kemker et al. (2018) introduced the concept of learnable prompts in a continuous space, becoming mainstream in prompt-based learning.

In the realm of prompt-based CL, various methods have been proposed. Specifically, L2P Wang et al. (2022e) introduced a prompt pool concept to adjust the frozen ViT backbone for CL tasks. Building on this, DualPrompt Wang et al. (2022d) employs two different prompt types: G-Prompt for learning task-invariant knowledge and E-Prompt for task-specific knowledge, drawing inspiration from complementary learning systems. CODA-Prompt Smith et al. (2023) adopts input-conditioned prompts through an innovative attention-based end-to-end key-query mechanism that integrates the entire training sequence.

2.3 Imbalanced Continual Learning

While there is existing research on addressing imbalance in CL, it primarily concentrates on specific aspects of imbalance. For instance, BIC Wu et al. (2019) focuses on the imbalance between limited stored samples and current task samples, addressing challenges in storage under highly imbalanced conditions. PRS Kim et al. (2020) investigates the long-tail problem in multi-label scenarios, also relying on sample storage. Two-Stage-CL Liu et al. (2022) delves into long-tail class imbalances in CL, storing substantial amounts of original data.

It is worth noting that these studies are primarily rehearsal-based methods. This underscores the significant challenge of storing samples where class imbalance is inherent, as data selection itself is imbalanced. Furthermore, our approach to TICL diverges fundamentally from these existing works. While they concentrate on addressing imbalances within samples or classes, our focus is on the imbalance across different tasks.

3 Problem Formulation

3.1 Preliminaries

Our CL protocol adopts the class-incremental CL setting. The training data is denoted as a sequence of T𝑇Titalic_T tasks 𝒟={𝒟1,,𝒟T}𝒟subscript𝒟1subscript𝒟𝑇\mathcal{D}=\{\mathcal{D}_{1},\ldots,\mathcal{D}_{T}\}caligraphic_D = { caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, where 𝒟t={(xit,yit)},i=1,,Ntformulae-sequencesubscript𝒟𝑡superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑦𝑖𝑡𝑖1subscript𝑁𝑡\mathcal{D}_{t}=\{(x_{i}^{t},y_{i}^{t})\},i=1,...,N_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } , italic_i = 1 , … , italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is sampled from a joint data distribution in the input and label space Xt×Ytsubscript𝑋𝑡subscript𝑌𝑡X_{t}\times Y_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at task t𝑡titalic_t, whose size (i.e., the number of samples in this task) is denoted as Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The target model is formulated as f:XY:𝑓𝑋𝑌f:X\rightarrow Yitalic_f : italic_X → italic_Y, integrating a patch embedding layer fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and a backbone fbsubscript𝑓𝑏f_{b}italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT consisting of a stack of transformer encoder layers followed by a classifier, thus f=fpfb𝑓subscript𝑓𝑝subscript𝑓𝑏f=f_{p}\circ f_{b}italic_f = italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. We employ a ViT-Base model, pre-trained on ImageNet, as the frozen feature extractor. In this class-incremental setting, similar to recent prompt-based CL methodsSmith et al. (2023); Wang et al. (2022d, e), task boundaries are clearly defined with no shared classes between them, and task identity is provided only during training.

3.2 Task-Imbalanced Continual Learning

In this paper, we formally study task-imbalanced continual learning (TICL). TICL is characterized by the distribution of task sizes, i.e., Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, throughout the learning process. Different from the conventional task-balanced CL where Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is drawn from a uniform distribution, i.e., Ni=Nj,ijandi,j{1,,T}formulae-sequencesubscript𝑁𝑖subscript𝑁𝑗formulae-sequencefor-all𝑖𝑗and𝑖𝑗1𝑇N_{i}=N_{j},\forall i\neq j~{}\text{and}~{}i,j\in\{1,...,T\}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_i ≠ italic_j and italic_i , italic_j ∈ { 1 , … , italic_T }. Any distribution deviating from the uniform one introduces an inherent imbalance among tasks. By the long-tail nature of real-world data distributions, we generally assume that the task sizes also follow the long-tail distribution. Specifically, the task sizes follow NI1>>NIT,Ii<Ijformulae-sequencesubscript𝑁subscript𝐼1subscript𝑁subscript𝐼𝑇for-allsubscript𝐼𝑖subscript𝐼𝑗N_{I_{1}}>...>N_{I_{T}},~{}\forall I_{i}<I_{j}italic_N start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT > … > italic_N start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ∀ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where I1,,ITsubscript𝐼1subscript𝐼𝑇I_{1},...,I_{T}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are the sorting index based on the task size by non-increasing order.

3.3 Case Study

Refer to caption
((c)) Stability
Refer to caption
((d)) Plasticity
Figure 2: Performance analysis of DualPrompt on different TICL cases with evaluation metrics. (a): F𝐹Fitalic_F shows the model stability to resist forgetting. (b): P𝑃Pitalic_P shows the model plasticity to learn new tasks.

We first study how deep the model performance might drop on two extreme cases of TICL: descending TICL and ascending TICL. To properly quantify plasticity and stability Sun et al. (2022), we adopt two metrics to evaluate a model on the task t𝑡titalic_t:

P=Acct,t,F=Acct,tAccT,t.formulae-sequence𝑃𝐴𝑐subscript𝑐𝑡𝑡𝐹𝐴𝑐subscript𝑐𝑡𝑡𝐴𝑐subscript𝑐𝑇𝑡P=Acc_{t,t},\quad F=Acc_{t,t}-Acc_{T,t}.italic_P = italic_A italic_c italic_c start_POSTSUBSCRIPT italic_t , italic_t end_POSTSUBSCRIPT , italic_F = italic_A italic_c italic_c start_POSTSUBSCRIPT italic_t , italic_t end_POSTSUBSCRIPT - italic_A italic_c italic_c start_POSTSUBSCRIPT italic_T , italic_t end_POSTSUBSCRIPT . (1)

where Acci,j𝐴𝑐subscript𝑐𝑖𝑗Acc_{i,j}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represents the accuracy on the test set of the task j𝑗jitalic_j after finishing training on the task i𝑖iitalic_i. P𝑃Pitalic_P measures the models’ plasticity, i.e., the ability to learn new tasks, while F𝐹Fitalic_F measures the model’s stability, i.e., the resistance to catastrophic forgetting. Here, we examine the performance of DualPrompt Wang et al. (2022d) on descending TICL and ascending TICL with evaluation metrics P𝑃Pitalic_P and F𝐹Fitalic_F. As a recent prompt-based CL method, it achieves a good balance between stability and plasticity in task-balanced scenarios. In addition, to eliminate the factor of the absolutely small size of each task and focus on the task imbalance problem, we also compare with the case of one-shot, where each task contains only one sample for each class.

Refer to caption
Figure 3: Overview of the proposed dynamic anchored prompting (DAP) for task-imbalanced continual learning. Task-imbalanced training data stream represents the sequential arrival of data from Task 0 to Task t. There are two phases for each task. In in-task phase 1, the initialized task-specific prompt learns knowledge related to the current task, serving as a boosting anchor. In in-task phase 2, the general prompt is trained, with the assistance of boosting anchor to ensure plasticity. Meanwhile, the centers of all previously learned task-specific prompts serve as stabilizing anchors, emphasizing stability. It’s worth noting that the boosting anchors of past tasks are not stored as the stabilizing anchor is updated in an online manner.

Stability. Stability is extremely hard to achieve by a model trained on ascending TICL as catastrophic forgetting occurs more easily on past data-scarce tasks. As shown in Figure 2(a), The model’s performance decline on earlier tasks in the ascending case is significant, which indicates severe catastrophic forgetting. The result of the one-shot case reveals that this forgetting is not just due to limited data quantity. On the contrary, in the descending case, the model keeps good stability because the data-rich tasks come first, and the following data-scarce tasks can hardly overtake the knowledge of the data-rich tasks in the model.

Plasticity. On the opposite, plasticity is also extremely hard to achieve by a model trained on descending TICL as the model learns well on past data-rich tasks such that the following data-scarce tasks cannot be generalized well. As shown in Figure 2(b), the model performance in the descending case significantly drops. This raises the issue that the model initially trained on data-rich tasks struggles to adapt to the following data-scarce tasks, prioritizing stability over learning new information. We also compare with the one-shot case. The model shows consistent learning performance starting from the fifth task, surprisingly suggesting better adaptation if the last few tasks are not data-rich.

There exists a stability-plasticity dilemma in TICL scenarios. Accordingly, we propose a method to effectively balance this dilemma.

4 Dynamically Anchored Prompting

To address the issues identified in the case study, we propose Dynamically Anchored Prompting (DAP) for TICL in this paper. The proposed method is designed to dynamically balance and optimize the trade-off between stability and plasticity, empowering the model flexible to learn in TICL.

Different from the existing prompt-based CL methods Wang et al. (2022d); Smith et al. (2023); Wang et al. (2022e), the key idea of DAP is to maintain a prompt across all the tasks, which is called the general prompt. The general prompt aims to learn to generalize each task with the knowledge of the pre-trained model. It can be easily used during inference as there is no prompt pool, and therefore no prompt selection is needed. During the inference, the challenge of adopting only a general prompt is to balance the knowledge learned from each task, especially for imbalanced tasks. Therefore, to achieve the goal of using the general prompt to generalize well across all tasks, the proposed DAP adopts a two-phase learning scheme for each training task, termed in-task phases. Given a specific training task, during the first in-task phase, we optimize a task-specific prompt, which is then utilized to regularize the general prompt during the second in-task phase. To tackle the imbalance problem of TICL, we propose dynamic stability-plasticity regularization to make the general prompt learn well and stably no matter the amount of training data in the next incoming task. After two-phase tuning for each task, we only need to save the general prompt for inference. The overall framework of the proposed DAP is shown in Figure 3.

4.1 Anchored Prompting

The goal of DAP is to obtain a high-quality general prompt that properly addresses the intrinsic problem of TICL. For each task, we first optimize the task-specific prompt in the first in-task phase. The task-specific prompt is used to form two key components: the boosting anchor 𝐩bsubscript𝐩𝑏\mathbf{p}_{b}bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and the stabilizing anchor 𝐩ssubscript𝐩𝑠\mathbf{p}_{s}bold_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. These anchors are designed to regularize the relationship between the general prompt 𝐩gsubscript𝐩𝑔\mathbf{p}_{g}bold_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and each anchor, effectively handling the plasticity and stability dilemma during training. The general prompt is then optimized in the prompt space with a new strategy called dynamic stability-plasticity regularization.

Boosting Anchor. Specifically, the boosting anchor 𝐩bsubscript𝐩𝑏\mathbf{p}_{b}bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT aims to maintain model plasticity, ensuring adaptability to new tasks. It is especially useful when the size of the current task is small compared to the past tasks. The boosting anchor 𝐩bsubscript𝐩𝑏\mathbf{p}_{b}bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is simply set at the task-specific prompt optimized on the current task to capture the task-relevant information. Thus, 𝐩bsubscript𝐩𝑏\mathbf{p}_{b}bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT functions as a critical focal point, guiding the model towards learning trajectories that maximize plasticity. To optimize 𝐩bsubscript𝐩𝑏\mathbf{p}_{b}bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, we formulate the following loss function:

1=i=1NtlCE(fb([𝐩b,fp(xit)]),yit)subscript1superscriptsubscript𝑖1subscript𝑁𝑡subscript𝑙𝐶𝐸subscript𝑓𝑏subscript𝐩𝑏subscript𝑓𝑝superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑦𝑖𝑡\mathcal{L}_{1}=\sum_{i=1}^{N_{t}}l_{CE}(f_{b}([\mathbf{p}_{b},f_{p}(x_{i}^{t}% )]),y_{i}^{t})caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( [ bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (2)

Here, h(xit)superscriptsubscript𝑥𝑖𝑡h(x_{i}^{t})italic_h ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and yitsuperscriptsubscript𝑦𝑖𝑡y_{i}^{t}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represent the patched features of the i𝑖iitalic_i-th sample and its corresponding label from task t𝑡titalic_t, respectively. 𝐩bsubscript𝐩𝑏\mathbf{p}_{b}bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is initialized in each new task. The concatenation [𝐩b,h(xit)]subscript𝐩𝑏superscriptsubscript𝑥𝑖𝑡[\mathbf{p}_{b},h(x_{i}^{t})][ bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_h ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] is fed into the pre-trained transformer as the input. 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT represents the total loss for the first in-task phase, and lCEsubscript𝑙𝐶𝐸l_{CE}italic_l start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT is the cross-entropy loss. The goal of minimizing this loss is to make 𝐩bsubscript𝐩𝑏\mathbf{p}_{b}bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT fully learn the knowledge for adapting the pre-trained model for the current task.

Stabilizing Anchor. To ensure model stability, the stabilizing anchor 𝐩ssubscript𝐩𝑠\mathbf{p}_{s}bold_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is designed to prevent knowledge forgetting from past tasks by monitoring the learned task-specific prompts. To maintain the knowledge of all of the learned tasks, 𝐩ssubscript𝐩𝑠\mathbf{p}_{s}bold_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is calculated by the weighted center of the boosting anchors of all learned tasks in the prompt space from task 1 to the current task t𝑡titalic_t. The weight is associated with the inverse of the size of each task. To make the algorithm rehearsal-free, i.e., without storing any data or prompt, 𝐩ssubscript𝐩𝑠\mathbf{p}_{s}bold_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT can be updated in an online manner:

𝐩s1i=1tNi(𝐩si=1t1Ni+𝐩bNt).subscript𝐩𝑠1superscriptsubscript𝑖1𝑡subscript𝑁𝑖subscript𝐩𝑠superscriptsubscript𝑖1𝑡1subscript𝑁𝑖subscript𝐩𝑏subscript𝑁𝑡\displaystyle\mathbf{p}_{s}\leftarrow\frac{1}{\sum_{i=1}^{t}N_{i}}\left(\frac{% \mathbf{p}_{s}}{\sum_{i=1}^{t-1}N_{i}}+\frac{\mathbf{p}_{b}}{N_{t}}\right).bold_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( divide start_ARG bold_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + divide start_ARG bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) . (3)

The rationale behind this update is that task-specific prompts are optimized to generalize well in the corresponding task, thus, their weighted average represents stability. The weights in Eq. (3) associated with the corresponding task size Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT emphasize the past task with a smaller size to make the stabilizing prompt uniformly represent the knowledge of past tasks.

Anchor Alignment. The purpose of maintaining the boosting anchor and the stabilizing anchor in each task is to dynamically regularize the learning of the general prompt, such that it can be flexible to balance the stability and plasticity in the TICL scenario. We employ cosine similarity to measure the proximity between the general prompt and anchors in each task for anchor alignment, as it focuses on the orientation rather than the magnitude of vector representations, effectively capturing the inherent relationships between prompts:

a(𝐩g,𝐩)=1𝐩g𝐩𝐩g𝐩,subscript𝑎subscript𝐩𝑔𝐩1subscript𝐩𝑔𝐩normsubscript𝐩𝑔norm𝐩\mathcal{L}_{a}(\mathbf{p}_{g},\mathbf{p})=1-\frac{\mathbf{p}_{g}\cdot\mathbf{% p}}{\|\mathbf{p}_{g}\|\|\mathbf{p}\|},caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , bold_p ) = 1 - divide start_ARG bold_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⋅ bold_p end_ARG start_ARG ∥ bold_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ ∥ bold_p ∥ end_ARG , (4)

where 𝐩𝐩\mathbf{p}bold_p can be either 𝐩bsubscript𝐩𝑏\mathbf{p}_{b}bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT or 𝐩ssubscript𝐩𝑠\mathbf{p}_{s}bold_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to address the plasticity or the stability, respectively. Different from the methods that simultaneously adopt two sets of prompts Wang et al. (2022d), the proposed DAP only utilizes the boosting anchors (i.e. the task-specific prompts) as a constant to regularize the general prompt by anchor alignment. Therefore, task-specific prompts are not involved in the final inference process. The advantages of only maintaining the general prompt thought the CL process are two-fold. On one hand, it avoids the error produced by matching the improper prompts in the prompt pool Wang et al. (2022e). On the other hand, it makes the proposed DAP fully rehearsal-free.

4.2 Dynamic Stability-Plasticity Regularization

Anchored prompting offers an opportunity to address either plasticity or stability individually, yet it falls short in dynamically adapting within complex, imbalanced task streams. Recall the three cases in TICL described in Figure 1(a). Updating the general prompt by anchor alignment with a single anchor may only work for the descending TICL or the ascending TICL. For shuffled TICL, the incoming task cannot be guaranteed to be a data-scarce or data-rich task, such that the emphasis on updating the general prompt cannot be determined. The challenge of achieving balance in such fluctuating scenarios remains unresolved.

To address the remaining issue in the DAP framework, we introduce a strategy that enables the model to adjust its focus between plasticity and stability in the nature of TICL tasks. In the second in-task phase, we optimize the general prompt 𝐩gsubscript𝐩𝑔\mathbf{p}_{g}bold_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT by considering the balance between stability and plasticity. Different from 𝐩bsubscript𝐩𝑏\mathbf{p}_{b}bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT that is initialized in each new task, 𝐩gsubscript𝐩𝑔\mathbf{p}_{g}bold_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is initialized in task 1 and updated through the whole CL process in order to achieve the ability to generalize well on all learned classes. Thus, we propose dynamic stability-plasticity factor λ𝜆\lambdaitalic_λ as the coefficient between two anchor alignments. The factor λ𝜆\lambdaitalic_λ modulates the balance between stability and plasticity by considering the size of the current task t𝑡titalic_t and the sizes of past tasks:

λ=NtNminNmaxNmin+ϵ,𝜆subscript𝑁𝑡subscript𝑁minsubscript𝑁maxsubscript𝑁minitalic-ϵ\lambda=\frac{N_{t}-N_{\text{min}}}{N_{\text{max}}-N_{\text{min}}+\epsilon},italic_λ = divide start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT min end_POSTSUBSCRIPT + italic_ϵ end_ARG , (5)

where Nmin=mini=0,,tNisubscript𝑁minsubscript𝑖0𝑡subscript𝑁𝑖N_{\text{min}}=\min_{i=0,...,t}N_{i}italic_N start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_i = 0 , … , italic_t end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Nmax=maxi=0,,tNisubscript𝑁maxsubscript𝑖0𝑡subscript𝑁𝑖N_{\text{max}}=\max_{i=0,...,t}N_{i}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_i = 0 , … , italic_t end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the minimum and maximum sizes of learned tasks, and ϵitalic-ϵ\epsilonitalic_ϵ is a small positive constant to prevent division by zero. It is basically the min-max normalization that measures the difference between the size of the current task and the minimum size of the learned tasks, which is then normalized into the range of [0,1]01[0,1][ 0 , 1 ]. As with many re-weighting techniques in long-tail learning Peng et al. (2023); Chen et al. (2023); Zhang et al. (2023b), the task size Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an important indicator to represent the learning difficulty. It is relatively easier to learn from a task with a larger size. Accordingly, we use the factor λ𝜆\lambdaitalic_λ and 1λ1𝜆1-\lambda1 - italic_λ as the regularization coefficient to reflect this relationship. In the second in-task phase, the general prompt updated in the current task t𝑡titalic_t is given by the following loss function with the dynamic stability-plasticity factor λ𝜆\lambdaitalic_λ:

2=subscript2absent\displaystyle\mathcal{L}_{2}=caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = i=1NtlCE(fb([𝐩g,fp(xit)]),yit)superscriptsubscript𝑖1subscript𝑁𝑡subscript𝑙𝐶𝐸subscript𝑓𝑏subscript𝐩𝑔subscript𝑓𝑝superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑦𝑖𝑡\displaystyle\sum_{i=1}^{N_{t}}l_{CE}(f_{b}([\mathbf{p}_{g},f_{p}(x_{i}^{t})])% ,y_{i}^{t})∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( [ bold_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
+λa(𝐩g,𝐩s)+(1λ)a(𝐩g,𝐩b).𝜆subscript𝑎subscript𝐩𝑔subscript𝐩𝑠1𝜆subscript𝑎subscript𝐩𝑔subscript𝐩𝑏\displaystyle+\lambda\cdot\mathcal{L}_{a}(\mathbf{p}_{g},\mathbf{p}_{s})+(1-% \lambda)\cdot\mathcal{L}_{a}(\mathbf{p}_{g},\mathbf{p}_{b}).+ italic_λ ⋅ caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + ( 1 - italic_λ ) ⋅ caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) . (6)

Therefore, a larger Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates a smaller λ𝜆\lambdaitalic_λ, enhancing stability to prevent forgetting. On the other hand, a smaller Nnsubscript𝑁𝑛N_{n}italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, indicative of a more challenging task, prompts an increase in λ𝜆\lambdaitalic_λ to ensure sufficient plasticity for learning new, complex tasks. If the size of the current task t𝑡titalic_t is the largest or smallest ever, λ𝜆\lambdaitalic_λ then becomes 1 or 0, respectively.

The effective utilization of the prompt anchors and the flexible adjustment of λ𝜆\lambdaitalic_λ according to the task sizes are the core of DAP, which adapts the pre-trained model well in the scenario of TICL. It facilitates a balance between the acquisition of new knowledge and the retention of prior learning.

Table 1: Comparison Results (%) on TICL-Cifar100 and TICL-ImageNet-R. ‘Pre’ refers to pretraining and ‘P’ stands for prompt. ANsubscript𝐴𝑁A_{N}italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT gives the accuracy averaged over tasks, ALsubscript𝐴𝐿A_{L}italic_A start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT gives the last acccuracy.
TICL-Cifar100 TICL-ImageNet-R
Descending Ascending Shuffled Descending Ascending Shuffled
Method ANsubscript𝐴𝑁A_{N}italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (\uparrow) ALsubscript𝐴𝐿A_{L}italic_A start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (\uparrow) ANsubscript𝐴𝑁A_{N}italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (\uparrow) ALsubscript𝐴𝐿A_{L}italic_A start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (\uparrow) ANsubscript𝐴𝑁A_{N}italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (\uparrow) ALsubscript𝐴𝐿A_{L}italic_A start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (\uparrow) ANsubscript𝐴𝑁A_{N}italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (\uparrow) ALsubscript𝐴𝐿A_{L}italic_A start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (\uparrow) ANsubscript𝐴𝑁A_{N}italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (\uparrow) ALsubscript𝐴𝐿A_{L}italic_A start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (\uparrow) ANsubscript𝐴𝑁A_{N}italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (\uparrow) ALsubscript𝐴𝐿A_{L}italic_A start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (\uparrow) Buffer
BiC 27.9227.9227.9227.92 38.0238.0238.0238.02 36.0836.0836.0836.08 37.9037.9037.9037.90 27.1127.1127.1127.11 33.6333.6333.6333.63 21.9121.9121.9121.91 18.918.918.918.9 13.5213.5213.5213.52 16.0116.0116.0116.01 16.3616.3616.3616.36 16.3216.3216.3216.32 20/cls20cls20/\textrm{cls}20 / cls
PODNET 26.4826.4826.4826.48 23.8223.8223.8223.82 32.3132.3132.3132.31 28.2428.2428.2428.24 28.4928.4928.4928.49 26.2126.2126.2126.21 22.3222.3222.3222.32 18.9018.9018.9018.90 16.6116.6116.6116.61 16.2016.2016.2016.20 17.1117.1117.1117.11 18.7018.7018.7018.70 20/cls20cls20/\textrm{cls}20 / cls
EEIL++ 31.4931.4931.4931.49 38.2438.2438.2438.24 35.9335.9335.9335.93 37.8537.8537.8537.85 31.6331.6331.6331.63 39.3139.3139.3139.31 18.5018.5018.5018.50 17.8117.8117.8117.81 15.7615.7615.7615.76 15.6015.6015.6015.60 15.7915.7915.7915.79 16.3016.3016.3016.30 20/cls20cls20/\textrm{cls}20 / cls
LUCIR++ 27.7427.7427.7427.74 21.2621.2621.2621.26 42.3942.3942.3942.39 28.9228.9228.9228.92 35.6235.6235.6235.62 25.9425.9425.9425.94 18.3618.3618.3618.36 21.2921.2921.2921.29 8.058.058.058.05 8.278.278.278.27 15.6215.6215.6215.62 15.6215.6215.6215.62 20/cls20cls20/\textrm{cls}20 / cls
Pre+FT 65.8365.8365.8365.83 22.0222.0222.0222.02 19.5219.5219.5219.52 25.5825.5825.5825.58 43.3043.3043.3043.30 33.5633.5633.5633.56 40.6040.6040.6040.60 7.687.687.687.68 18.2218.2218.2218.22 21.1521.1521.1521.15 21.3721.3721.3721.37 22.6222.6222.6222.62 0
Pre+iCaRL 53.0053.0053.0053.00 28.7328.7328.7328.73 41.7041.7041.7041.70 26.8826.8826.8826.88 48.6248.6248.6248.62 31.0231.0231.0231.02 48.4148.4148.4148.41 29.5529.5529.5529.55 24.4024.4024.4024.40 29.1729.1729.1729.17 40.2140.2140.2140.21 23.0223.0223.0223.02 20/cls20cls20/\textrm{cls}20 / cls
CODA-P 81.9181.91\bm{81.91}bold_81.91 58.9858.9858.9858.98 54.5454.5454.5454.54 41.8441.8441.8441.84 60.9060.9060.9060.90 42.5642.5642.5642.56 52.3952.3952.3952.39 35.2135.2135.2135.21 28.2128.2128.2128.21 32.6232.6232.6232.62 40.0240.0240.0240.02 34.7834.7834.7834.78 0
L2-P 66.5166.5166.5166.51 50.2650.2650.2650.26 53.5053.5053.5053.50 48.7348.7348.7348.73 51.4351.4351.4351.43 49.4349.4349.4349.43 50.0550.0550.0550.05 31.7231.7231.7231.72 27.2427.2427.2427.24 29.4229.4229.4229.42 30.1930.1930.1930.19 26.2126.2126.2126.21 0
Dual-P 70.5170.5170.5170.51 51.7951.7951.7951.79 54.5054.5054.5054.50 45.7245.7245.7245.72 49.4949.4949.4949.49 48.8248.8248.8248.82 51.4751.4751.4751.47 31.1231.1231.1231.12 25.0325.0325.0325.03 25.4225.4225.4225.42 34.6834.6834.6834.68 27.3827.3827.3827.38 0
\hdashlineOurs 79.0979.0979.0979.09 61.4961.49\bm{61.49}bold_61.49 56.3056.30\bm{56.30}bold_56.30 55.4755.47\bm{55.47}bold_55.47 61.4361.43\bm{61.43}bold_61.43 56.1256.12\bm{56.12}bold_56.12 58.4758.47\bm{58.47}bold_58.47 40.2540.25\bm{40.25}bold_40.25 31.4231.42\bm{31.42}bold_31.42 36.4736.47\bm{36.47}bold_36.47 43.2243.22\bm{43.22}bold_43.22 36.3836.38\bm{36.38}bold_36.38 0

5 Experimental Results

In this section, we first introduce the experimental setup and then compare the proposed DAP with different existing CL methods applied on TICL benchmarks. Finally, we specifically evaluate the effectiveness of DAP and conduct ablation studies to key elements.

5.1 Implementation Details

Datasets. Given that long-tail distributions are the most prevalent form of imbalance in the real-world, we adopt the long-tailed setting to construct imbalances. The long-tailed distribution typically follows an exponential decay in sample size across classes Cao et al. (2019). This decay is parameterized by ρ𝜌\rhoitalic_ρ which is the ratio between the most and least frequent classes. ρ=1𝜌1\rho=1italic_ρ = 1 is the conventional CIL case and ρ𝜌\rhoitalic_ρ in (0,1) indicates different degrees of long-tailed distribution.

We follow the experimental datasets used in previous works Wang et al. (2022d); Khan et al. (2023), first conducting a long-tail division of the datasets, and then dividing them into 10 disjoint tasks. Specifically, the CIFAR-100 dataset Krizhevsky et al. (2009) includes 100 classes of natural images, with 500 training samples for the head classes, and subsequently decreasing for the remaining classes according to the long-tail division method. To ensure balance within each task, we select an equal number of samples from each class within a task, based on the maximum class quantity present in that task. The ImageNet-R dataset Wang et al. (2022d) contains 200 classes of images, divided in a similar manner for long-tail calculation. Aligned with this long-tailed distribution approach, we examine three cases in our study: Descending TICL, where learners first encounter data-rich tasks followed by data-scarce ones; Ascending TICL, featuring data-scarce tasks preceding data-rich ones; Shuffled TICL, where tasks arrive in a random sequence without a prescribed order of data volume.

Comparison Methods. In our experiments, we compare our Dynamically Anchored Prompting (DAP) with two groups of rehearsal-based methods. The first group includes classical approaches like PODNET Douillard et al. (2020) and BiC Wu et al. (2019). The second group comprises methods designed for long-tailed distributions such as EEIF++ Liu et al. (2022) and LUCIR++ Liu et al. (2022). Additionally, we include pretrained methods in our comparison, FineTune Khan et al. (2023), iCaRL Rebuffi et al. (2017), both these methods start from the same ImageNet pre-trained ViT-Base Dosovitskiy et al. (2020) model to ensure a fair comparison. Lastly, we compare DAP against the current state-of-the-art (SOTA) prompt-based methods, including L2P Wang et al. (2022e), DualPrompt Wang et al. (2022d), and CODA-Prompt Smith et al. (2023). These methods represent the latest advancements in prompt-based CL.

Evaluation Protocol. For the test set, we followed the setting of long-tailed learning research Cao et al. (2019) where the training set is imbalanced while the test set is balanced. Therefore, our test set is consistent with the balanced CL Wang et al. (2022c, d). In this manner, the testing accuracy can be easily averaged over all classes to reflect the performance of each class with equal weight. For evaluation, we report their average values with standard errors using two widely used CL metrics: average accuracy (AN \uparrow) Lopez-Paz et al. (2017) of the final average accuracy by the model, last accuracy (AL \uparrow) Zhang et al. (2023a) of the last accuracy at the end of the learning process.

Implementation. Following the settings of L2P Wang et al. (2022e), We train DAP using Adam with β1,β2subscript𝛽1subscript𝛽2\beta_{1},\beta_{2}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of 0.9, a learning rate of 0.01, and a batch size of 64. We resize the input images to a 224×\times×224 resolution and normalize them between 0 and 1. To ensure models converge, we train TICL-CIFAR-100 for 5 epochs per task, TICL-ImageNet-R for 50 epochs each task.

5.2 Comparison to the State-of-The-Art

We compare various rehearsal-based and prompt-based methods for TICL-CIFAR-100 and TICL-ImageNet-R in Table 1. We observe that DAP consistently outperforms all rehearsal-based methods by a considerable margin, with a substantial improvement ranging from 10% to 30%, establishing a new state-of-the-art in all cases. It also demonstrates a significant advantage over other prompt-based methods. showing an increase of 4.5% to 15%.

5.3 Effectiveness of the General Prompt

To verify that adopting a single general prompt 𝐩gsubscript𝐩𝑔\mathbf{p}_{g}bold_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is able to accumulate knowledge across the data stream, we adopt a linear probing experiment  He et al. (2020) to evaluate the performance of the representation layer. Specifically, following each incremental learning task, we freeze the representation layer and introduce an additional classification layer known as a linear probe, which is trained on all classes of the benchmark dataset.

We conducted a detailed analysis of the descending, ascending, and shuffled cases. As illustrated in Figure 4, in the descending case, we observe a rapid initial improvement, indicating that the model indeed learns significant global knowledge when presented with abundant data initially. However, as the tasks progress and data availability decreases, we notice a stagnation in learning, suggesting that under typical conditions, the model struggles to acquire new knowledge with limited data. In contrast, DAP continues to improve even with less data, overcoming this learning stagnation. In the ascending case, despite continuous learning, the final performance is lower than in the descending case, implying some degree of knowledge forgetting. Yet, DAP still shows an upward trend at the end. A similar pattern is observed in the shuffled case. This demonstrates DAP successfully accumulated knowledge, effectively navigating the numerical disparities in TICL environments. Moreover, DAP consistently outperforms DualPrompt Wang et al. (2022d) through all the tasks.

Refer to caption
Figure 4: Performance comparison between DAP and DualPrompt with linear probe.

5.4 Ablation Study

In this section, we delve into an in-depth ablation study to validate the effectiveness and contributions of different components in our model.

Desc. Asc. Shuff.
(1) w/ only task-specific prompt 64.75 52.56 52.23
(2) w/ only general prompt 66.79 50.49 47.50
(3) DAP 79.09 56.30 61.43
Table 2: Ablation on general prompt and task-specific prompt.

Dynamic Factor λ𝜆\lambdaitalic_λ. λ𝜆\lambdaitalic_λ is essential for calibrating the balance between stability and plasticity. Therefore, to assess the dynamic regularization’s effectiveness, we compare the results in the shuffled case to the use of the fixed values of λ𝜆\lambdaitalic_λ from 0 to 1. As illustrated in Figure 5, the dynamically adjusted λ𝜆\lambdaitalic_λ consistently outperforms any fixed value of λ𝜆\lambdaitalic_λ. This supports our premise that a dynamic λ𝜆\lambdaitalic_λ is more flexible to adapt to the evolving requirements of learning new information while retaining previously acquired knowledge. Generalization on all classes is ensured irrespective of the data abundance in each task during learning.

Refer to caption
Figure 5: Ablation on dynamic factor λ𝜆\lambdaitalic_λ.
ANsubscript𝐴𝑁A_{N}italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT(\uparrow) F𝐹Fitalic_F(\downarrow) P𝑃Pitalic_P(\uparrow)
(1) w/ only general prompt 47.50 18.28 63.29
(2) w/ only boosting anchor 58.15 15.45 70.38
(3) w/ only stabilizing anchor 59.47  6.19 64.86
(4) DAP 61.43  8.04 68.04
Table 3:  Ablation on boosting anchor and stabilizing anchor in shuffled case ANsubscript𝐴𝑁A_{N}italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT gives the accuracy averaged over tasks and F𝐹Fitalic_F gives the average forgetting. P𝑃Pitalic_P gives the average plasticity.

Task-specific Prompt and General Prompt. We further examine the roles of prompts within the DAP framework. Since DAP optimizes the task-specific prompt for each task in the first in-task phase and continually updates the general prompt in the second in-task phase, we can investigate the performance of using either prompt exclusively. This purpose of the study aims to discern whether each prompt can sustain the model’s efficacy across various learning scenarios on its own. We compare DAP with two exclusively designed methods: (1) To solely use the task-specific prompt. The task-specific prompts for all tasks are stored during inference, and it is assumed that the prompt is properly selected for each test sample. (2) To solely use the general prompt. The general prompt is continually optimized without dynamic stability-plasticity regularization. As shown in Table 2, employing either the task-specific prompt or the general prompt solely yields results that are much inferior to DAP. The task-specific prompt cannot harness the information across tasks while updating the general prompt without regularization cannot well balance the stability and plasticity. The strength of DAP’s design of updating the general prompt with regularization is verified in addressing the demands of TICL.

Ablation on Anchors. To ablate the boosting anchor and stabilizing anchor, we focused exclusively on the shuffled case of TICL, because this case uniquely demands the model to effectively balance both stability and plasticity. Therefore, we conducted experiments in this setting using each anchor type in isolation without the dynamic stability-plasticity regularization, aiming to assess the capability of each anchor. As seen from Table 3, employing the boosting anchor or the stabilizing anchor in isolation only enhances the model’s plasticity or stability but at a significant cost to each other. The best performance is attained by combining both boosting and stabilizing anchors with the dynamic factor.

6 Conclusion

In this paper, we formally define task-imbalanced continual learning (TICL) and systematically study its three cases. We discovered that imbalanced tasks significantly deteriorate the performance of prompt-based CL methods because they raise a new challenge to consider the dilemma of stability and plasticity. To counteract this, we introduced Dynamically Anchored Prompting (DAP). DAP addresses the challenge by separating stability and plasticity with two prompts, one for stabilization and the other for plasticity, serving as anchors to guide the learning process of a general prompt. DAP improves the performance of baseline prompt-based TICL methods to set a new state-of-the-art.

7 Acknowledgements

This study was supported in part by the National Natural Science Foundation of China under Grants 62376233, U21A20514, and 62306181; in part by the FuXiaQuan National Independent Innovation Demonstration Zone Collaborative Innovation Platform under Grant 3502ZCQXT2022008; in part by the Natural Science Foundation of Guangdong Province under Grant 2024A1515010163; in part by the China Fundamental Research Funds for the Central Universities under Grant 20720230038; and in part by Xiaomi Young Talents Program.

References

  • Aljundi et al. (2018) Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In ECCV, pages 139–154, 2018.
  • Cao et al. (2019) Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. NeurIPS, 32, 2019.
  • Castro et al. (2018) Francisco M Castro, Manuel J Marin-Jimenez, Nicolas Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In ECCV, 2018.
  • Cha et al. (2021) Hyuntak Cha, Jaeho Lee, and **woo Shin. Co2l: Contrastive continual learning. In CVPR, pages 9516–9525, 2021.
  • Chaudhry et al. (2021) Arslan Chaudhry, Albert Gordo, Puneet Dokania, Philip Torr, and David Lopez-Paz. Using hindsight to anchor past knowledge in continual learning. In AAAI, volume 35, pages 6993–7001, 2021.
  • Chen et al. (2023) ** Wang. Area: adaptive reweighting via effective area for long-tailed classification. In ICCV, pages 19277–19287, 2023.
  • De Lange et al. (2021) Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. TPAMI, 44(7):3366–3385, 2021.
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Douillard et al. (2020) Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. Podnet: Pooled outputs distillation for small-tasks incremental learning. In ECCV, pages 86–102. Springer, 2020.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
  • Huang et al. (2019) Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Deep imbalanced learning for face recognition and attribute prediction. TPAMI, 42(11):2781–2794, 2019.
  • Kang et al. (2022) Haeyong Kang, Rusty John Lloyd Mina, Sultan Rizky Hikmawan Madjid, Jaehong Yoon, Mark Hasegawa-Johnson, Sung Ju Hwang, and Chang D Yoo. Forget-free continual learning with winning subnetworks. In ICML, pages 10734–10750. PMLR, 2022.
  • Kang et al. (2023) Zhiqi Kang, Enrico Fini, Moin Nabi, Elisa Ricci, and Karteek Alahari. A soft nearest-neighbor framework for continual semi-supervised learning. In ICCV, pages 11868–11877, 2023.
  • Kemker and Kanan (2017) Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. arXiv preprint arXiv:1711.10563, 2017.
  • Kemker et al. (2018) Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. Measuring catastrophic forgetting in neural networks. In AAAI, volume 32, 2018.
  • Khan et al. (2023) Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, and Muhammad Zeshan Afzal. Introducing language guidance in prompt-based continual learning. In ICCV, pages 11463–11473, 2023.
  • Kim et al. (2020) Chris Dongjoo Kim, **seo Jeong, and Gunhee Kim. Imbalanced continual learning with partitioning reservoir sampling. In ECCV, pages 411–428. Springer, 2020.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Liu et al. (2022) Xialei Liu, Yu-Song Hu, Xu-Sheng Cao, Andrew D Bagdanov, Ke Li, and Ming-Ming Cheng. Long-tailed class incremental learning. In European Conference on Computer Vision, pages 495–512. Springer, 2022.
  • Lopez-Paz et al. (2017) David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, and Léon Bottou. Discovering causal signals in images. In CVPR, pages 6979–6987, 2017.
  • Peng et al. (2023) Hanyu Peng, Weiguo Pian, Mingming Sun, and ** Li. Dynamic re-weighting for long-tailed semi-supervised learning. In WACV, pages 6464–6474, 2023.
  • Rebuffi et al. (2017) Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In CVPR, pages 2001–2010, 2017.
  • Smith et al. (2023) James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In CVPR, pages 11909–11919, 2023.
  • Sun et al. (2022) Qing Sun, Fan Lyu, Fanhua Shang, Wei Feng, and Liang Wan. Exploring example influence in continual learning. NIPS, 35:27075–27086, 2022.
  • Vaishnavh et al. (2018) N Vaishnavh, C Raffel, and IJ Goodfellow. Theoretical insights into memorization in gans. In NeurIPS, 2018.
  • Wang et al. (2022a) Fu-Yun Wang, Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Foster: Feature boosting and compression for class-incremental learning. In ECCV, pages 398–414. Springer, 2022.
  • Wang et al. (2022b) Liyuan Wang, Xingxing Zhang, Qian Li, Jun Zhu, and Yi Zhong. Coscl: Cooperation of small continual learners is stronger than a big one. In ECCV, pages 254–271. Springer, 2022.
  • Wang et al. (2022c) Zhen Wang, Liu Liu, Yiqun Duan, Ya**g Kong, and Dacheng Tao. Continual learning with lifelong vision transformer. In CVPR, pages 171–181, 2022.
  • Wang et al. (2022d) Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In ECCV, pages 631–648, 2022.
  • Wang et al. (2022e) Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In CVPR, pages 139–149, 2022.
  • Wang et al. (2023) Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. arXiv preprint arXiv:2302.00487, 2023.
  • Wu et al. (2019) Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In CVPR, pages 374–382, 2019.
  • Zhang et al. (2020) Junting Zhang, Jie Zhang, Shalini Ghosh, Dawei Li, Serafettin Tasci, Larry Heck, Heming Zhang, and C-C Jay Kuo. Class-incremental learning via deep model consolidation. In CVPR, pages 1131–1140, 2020.
  • Zhang et al. (2023a) Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen, and Yunchao Wei. Slca: Slow learner with classifier alignment for continual learning on a pre-trained model. arXiv preprint arXiv:2303.05118, 2023.
  • Zhang et al. (2023b) Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. Deep long-tailed learning: A survey. TPAMI, 2023.