REP: Resource-Efficient Prompting for
On-device Continual Learning

Sungho Jeon1  Xinyue Ma1  Kwang In Kim2  Myeongjae Jeon1
1UNIST  2POSTECH
{sungho, xinyuema, mjjeon}@unist.ac.kr
{kimkin}@postech.ac.kr
Abstract

On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical. This is extremely challenging because it must preserve accuracy while learning new tasks with continuously drifting data and maintain both high energy and memory efficiency to be deployable on real-world devices. Typically, a CL method leverages one of two types of backbone networks: CNN or ViT. It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance, making each option attractive only for a single aspect. In this paper, we revisit this comparison while embracing powerful pre-trained ViT models of various sizes, including ViT-Ti (5.8M parameters). Our detailed analysis reveals that many practical options exist today for making ViT-based methods more suitable for on-device CL, even when accuracy, energy, and memory are all considered. To further expand this impact, we introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods. Our key focus is on avoiding catastrophic trade-offs with accuracy while trimming computational and memory costs throughout the training process. We achieve this by exploiting swift prompt selection that enhances input data using a carefully provisioned model, and by develo** two novel algorithms–adaptive token merging (AToM) and adaptive layer drop** (ALD)–that optimize the prompt updating stage. In particular, AToM and ALD perform selective skip** across the data and model-layer dimensions without compromising task-specific features in vision transformer models. Extensive experiments on three image classification datasets validate REP’s superior resource efficiency over current state-of-the-art methods.

1 Introduction

In on-device continual learning (CL), neural network models acquire new knowledge from new data locally, without dependence on remote servers for model training. This new data is typically non-independent and non-identically distributed (non-IID) and forms multiple sequential tasks, where each task may include data that diverges from previously encountered data. A crucial challenge for any CL task, including on-device scenarios [1, 2, 3], is to effectively address catastrophic forgetting [4]. Severe forgetting occurs when a model rapidly loses previously learned knowledge while adapting to new data. It impairs overall model accuracy and user experiences of mobile and edge AI services [5].

At the same time, learning methods in on-device CL must fulfill computational (equivalently, energy) and memory efficiency requirements. Energy is scarce and can be intermittent in small devices deployed in the wild [6, 7, 8], yet it is quickly drained by GPU operations during training [9]. Achieving higher energy efficiency while training a new task in CL invariably enhances device durability, making energy cost a critical soft constraint in edge AI. In contrast, memory efficiency often acts as a hard constraint. Device memory capacity is typically very limited (256MB–8GB) and unified to operate all CPU and GPU programs. Thus, a CL task that exhausts all available memory risks system crashes due to out-of-memory errors. Currently, no silver-bullet hardware solution to these resource constraints exists, as smartphones and IoT devices adhere to small form factors, making it difficult to dramatically augment battery and memory capacity.

Refer to caption
Figure 1: (a) Devices grouped by memory capacity. IoT devices typically have less than 1GB of memory, while high-end edge platforms can offer up to 8GB. ViT- and CNN-based methods with specific backbone models are mapped to the memory capacity they can fit into. (b) A memory breakdown of the methods and backbones corresponding to the 1–4GB memory budget. While it is necessary to reserve memory for replay buffers and model checkpoints in CNN-based methods (MEMO and BudgetCL), feature maps dominate the overall memory usage for all cases. Similar trends are observed in other memory ranges, as depicted in Fig. 2(b).

Then, which existing methods stand out in on-device CL when considering accuracy and system efficiency together? To conduct a thorough analysis, we closely examine CNN- and ViT-based CL methods. CL methods using CNN models, such as ResNet-10/18/34 [10], are known for system efficiency due to their relatively small size (5–22M parameters). Conversely, CL methods employing ViT models excel in model performance, as pre-trained ViT models [11, 12, 13] are renowned for their robust capabilities. Out of various emerging methodologies, we are particularly interested in prompt-based rehearsal-free methods [14, 15, 16, 17, 18]. Prompts are a small set of parameters that progressively learn incoming tasks to combat forgetting. Updates to such “small” prompts are minimal and do not significantly harm the lifespan of device storage (similar-to\sim10K writes), making them an ideal fit for on-device CL. We rigorously compare the abilities of competing state-of-the-art methods while aligning memory budgets to reflect diverse device specifications, as shown in Fig. 1(a).

We find that ViT-based methods indeed considerably outperform CNN-based ones, with larger backbone networks yielding higher accuracy. However, surprisingly, while small CNNs appear to promise high resource efficiency, their actual memory usage far exceeds expectations. For instance, a recent proposal MEMO [19], when set to use 1.2GB of memory, consumes up to 4.2GB as measured by CUDA APIs [20]—3.5×\times× the amount set by MEMO. This disparity arises because training requires substantial memory to stash feature maps for every training iteration and to operate ML runtime (Fig. 1(b)). Importantly, thanks to the recent evolution in small-memory ViT models, such as ViT-Ti [11] (5.8M parameters; similar size to ResNet-10), we can always spot more powerful ViT models that match the system efficiency of the CNN models under consideration. Taking accuracy, energy, and memory all into account, ViT-based methods are observed to be more suitable for on-device CL.

To further empower ViT-based CL for system efficiency, we introduce REP (Resource-Efficient Prompting), built upon our analysis of cost-accuracy trade-offs across the end-to-end learning process. We confirm two key design insights. (1) The prompt selection stage, which constructs a prompt subset to augment input or intermediate data, is highly amenable to numerous promising options with fast approximations [21, 22]. (2) In contrast, the prompt update stage, which involves forward-backward passes over the backbone model, presents a range of optimizations with vastly different cost-accuracy trade-offs [23, 24, 25, 26, 27]. With these insights, we develop several complementary techniques for both stages that trade minimal accuracy for high resource efficiency: (1) model downsizing for prompt selection and (2) adaptive token merging (AToM) and adaptive layer drop** (ALD) for prompt update. In ViT models, task-specific knowledge is concentrated in shallow layers, while generalized feature information is spread across all layers [15] (See Fig. 4 for our analysis). So, AToM and ALD carry out selective skip** across the data and model-layer dimensions in a “non-uniform” manner to preserve critical task-specific features in the shallow layers. By integrating all these contributions, REP achieves a level of resource efficiency compatible with real-world commodity devices, including NVIDIA Jetson TX2 [28] and Nano [29].

2 Related Work

Class-Incremental Learning. Our work primarily targets class-incremental learning (CIL) among various CL scenarios. In CIL, each incoming task introduces new classes within the same domain during the training phase, while task ID is not at hand during the inference phase [30]. In prior art using CNN backbone networks, rehearsal-based methods, which archive and replay representative samples from previous tasks [31, 32, 33, 34], have been a popular choice due to their superior performance [35]. Several recent methods are noteworthy along this line. For instance, BudgetCL [36] and CarM [37] suggest data-driven methods that outperform other complex strategies based on algorithmic optimizations, particularly when training is constrained by compute budgets. MEMO [19] effectively trades memory used to store old samples for saving task-specific model layers. We compare prompt-based methods with some of these methods and confirm better performance under resource constraints.

Continual Learning for the Edge. There has been a notable emphasis on efficient memory and energy usage in on-device learning, particularly in non-CL settings [38, 39, 40]. More recently, a handful of studies have explored extending CL capabilities to edge devices. Hayes et al. [2] investigated online CL methods using CNN models tailored for embedded devices and provided valuable algorithmic insights. Similarly, Kwon et al. [1] undertook a comparative analysis of rehearsal vs. regularization methods to uncover the cost-accuracy trade-offs, including storage, compute, and memory requirements. Ma et al. [9] proposed an ML platform called Miro, which dynamically configures key design parameters of CNN-based CL methods to reduce energy costs within the device’s memory capacity. Unfortunately, none of these studies address the specific challenge of enabling vision transformers for on-device CL, which stands as a main goal of our research.

Prompting for Continual Learning. The use of prompts, which are explicit instructions or queries given to the model during training or inference, has shown promise in guiding continual learning for vision transformers [14, 15, 16, 17, 18]. The general concept is to fine-tune prompts for new tasks while retaining knowledge acquired from previous tasks. L2P [14] optimizes a query function on a shared prompt pool to select the top-N best-matching prompts for input tokens. DualPrompt [15] proposes a dual-prompt approach that utilizes general and expert prompts to express task-invariant and task-specific knowledge, respectively. CODA-Prompt [17] introduces attention-conditioned prompts, which inherently facilitate higher prompting capacity. Our work focuses on prompting techniques that emphasize system efficiency and can be applied to all the above methods.

3 Empirical Analysis

We focus on the popular disjoint-CIL setup, where tasks consist of distinct sets of non-overlap** classes and samples from old classes are never given in subsequent tasks [34, 32, 30].

Data Generation. To organize task streams, we use two image classification datasets: CIFAR-100 [41] (100 classes) and ImageNet-R [42] (200 classes). Compared to CIFAR-100, ImageNet-R is known for exhibiting much higher intra-class variability among images and an uneven class-size distribution. By default, we divide each dataset into 10 tasks to create Split CIFAR-100 (i.e., 10 classes per task) and Split ImageNet-R (i.e., 20 classes per task) [14, 15, 17].

Methods. We implement L2P [14], DualPrompt [15], and CODA-Prompt [17] as representative ViT-based prompting methods, and BudgetCL [43] and MEMO [19] as state-of-the-art CNN-based methods using PyTorch. All prompting methods capitalize on ImageNet pre-trained models as backbones: ViT-L (307M), ViT-B (86M), and ViT-Ti (5.8M)—ViT-Ti is one of the smallest vision transformer models to our knowledge. For CNN-based methods, we adopt pre-trained ResNet-10 (RN10; 5M), ResNet-18 (RN18; 11M), ResNet-26 (RN26; 14M), ResNet-34 (RN34; 22M), and ResNet-101 (RN101; 43M) models. We categorize the prompt-based methods into three groups according to their memory consumption based on Fig. 1(a).

For fair comparisons, we align the memory usage of the CNN-based methods with the best-performing ViT-based method within the same memory range. Both BudgetCL and MEMO leverage spare memory to boost model accuracy by storing and replaying past samples, with MEMO also storing model checkpoints from history. For detailed implementations and hyperparameters, please refer to Sec. B.2.

Hardware and Metrics. Most power consumption that influences energy usage during training comes from GPU operations (see Fig. 7 in Sec. G.1). Therefore, training wall-clock time correlates linearly with energy usage, although energy is more intuitively measured in Joules. Given that not all CL methods are currently feasible on small devices due to memory limitations, we use NVIDIA RTX 3090 Ti as a reference GPU to compare CNN- and ViT-based methods and use the wall-clock time required to complete all tasks as a proxy for energy cost111In Sec. G.2, we evaluate our memory-efficient method directly on NVIDIA Jetson TX2, although it is applicable beyond this specific device.. We also report the final average accuracy by averaging the accuracy of all classes after the last task training [34, 44, 35, 45, 31], and measure memory usage using CUDA APIs [20].

3.1 Results and Findings

We first uncover the energy-accuracy trade-offs to highlight the impact of each method on the accuracy gained relative to its training cost. Fig. 2(a) visually represents the comparison results across device memory capacities and datasets. In each graph, the x-axis indicates wall-clock time, which corresponds to energy cost (lower is better), while the y-axis indicates final average accuracy (higher is better). Thus, a more cost-effective method appears closer to the upper-left corner of the graph. Overall, ViT-based methods outperform CNN-based methods by a large margin, achieving 26–36% higher accuracy with 45–90% less time and energy spent under the same memory budget.

Refer to caption
Figure 2: Energy-accuracy trade-offs of various ViT- and CNN-based methods over three different memory budgets compiled by Fig. 1(a). (a) Accuracy over the GPU time spent. Experiments on Split CIFAR-100 and Split ImageNet-R are on the first and second rows, respectively. ViT-based methods consistently outperform CNN-based methods. (b) Memory breakdown of 4–8GB and up to 1GB budgets. The breakdown for 1–4GB is covered in Fig. 1(b).

Why CNN-based Methods Underperform. Although CNN-based methods use smaller backbone models, their limited model capacity and the increasing training costs associated with replaying past samples and model checkpoints make them less competent in terms of accuracy and resource efficiency. We elaborate on two key characteristics that explain these shortcomings in more detail:

\bullet Performance scaling. ViT-based methods scale well, with larger backbone networks yielding higher accuracy. In Fig. 2(a), the accuracy gain is at least 20% when switching from ViT-Ti to ViT-B, and additional 7% or more when moving from ViT-B to ViT-L. In contrast, CNN-based methods scale poorly with increased backbone size. Specifically, on Split ImageNet-R, a more challenging dataset, MEMO/RN34 is only 2.95% better than MEMO/RN18, despite being 1.9×\times× larger. On Split CIFAR-100, increasing the backbone size leads to little or no gain. Exchanging model size for replay buffer is generally more efficient. This means that enlarging the backbone size for CNN models is not always beneficial, even with abundant memory available.

\bullet Memory efficiency. CNN-based methods are often considered more memory-efficient because they use backbones a magnitude smaller than ViT models. Thus, it is commonly believed that ample memory can be allocated to memory buffers for replay samples or past model checkpoints [19], which help improve model performance. However, there is a huge misconception: in CL, which involves typical training iterations, a substantial amount of memory must be allocated for managing feature maps [46]. Feature maps dominate overall memory usage and appear in all methods, thereby leaving little room for memory buffers. For a device memory range of 1–4GB in Fig. 1(b), the backbone of MEMO/RN18 consumes only 43MB of memory (6.5% of ViT-L), but feature maps occupy as much as 1.7GB (more than ViT-based methods) as training goes through layers extended for past model checkpoints. This causes MEMO/RN18 to be short on memory for the replay buffer, dramatically degrading model accuracy, as observed  Fig. 2(a). Fig. 2(b) shows profiling results for other memory ranges that reveal a similar trend.

What Our Study Implies. Based on the study so far, ViT-based methods appear to be more favored than CNN-based methods for on-device CL. However, CODA-Prompt consumes significantly more resources with the same ViT backbone, making it less efficient compared to other methods that operate within the same memory budget. This inefficiency stems from CODA-Prompt’s attention-based prompt selection algorithm, as opposed to the prompt-pool-based selection used in L2P and DualPrompt. The attention-based approach results in significantly larger feature maps. For instance, the feature maps of CODA-Prompt with ViT-Ti backbone become as large as those of L2P and DualPrompt using the larger ViT-B backbone.

It is important to note that the best-performing method is not fixed and can dynamically change depending on the backbone in use. In our study, L2P performs best with ViT-L backbone, while DualPrompt is more effective with ViT-B or ViT-Ti backbone (Fig. 2(a)). In Sec. 5.1, we further observe that the choice of method and backbone is more dynamic. Therefore, to enhance ViT-based CL for better resource efficiency, we need to design adaptive techniques tailored to any backbone, regardless of the specific learning method in use.

4 REP: Resource-Efficient Prompting

Refer to caption
Figure 3: Overview of REP. REP stands for Resource-Efficient Prompting for prompt-based rehearsal-free CL. REP calculates query features from input samples using a lightweight model (e.g., ViT-Ti) to swiftly extract prompts from the prompt pool. These prompts are then inserted into a main backbone model (e.g., ViT-L) for training, which prioritizes model accuracy.

We aim to reduce the cost of prompt-based CL to broaden its benefits and applicability. As the learning process typically involves two key stages, prompt selection and prompt update, each stage is optimized based on its unique characteristics. An overview of our method REP is provided in Fig. 3.

4.1 Prompt Selection

The prompt selection stage operates a neural network model 𝒇querysubscript𝒇query\bm{f_{\text{query}}}bold_italic_f start_POSTSUBSCRIPT query end_POSTSUBSCRIPT to extract representations of input data from a prompt pool [14, 15]. Formally, for a given input xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT in task Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝒇querysubscript𝒇query\bm{f_{\text{query}}}bold_italic_f start_POSTSUBSCRIPT query end_POSTSUBSCRIPT calculates a query feature 𝒒(xij)D𝒒superscriptsubscript𝑥𝑖𝑗superscript𝐷\bm{q}(x_{i}^{j})\in\mathbb{R}^{D}bold_italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT and selects the prompt 𝒑superscript𝒑\bm{p^{*}}bold_italic_p start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT that maximizes the cosine similarity with the query:

p=argmaxpkPfquery(xij),pkfquery(xij)pkpefficient=argmaxpkPϕ(qefficient(xij)),pkϕ(qefficient(xij))pksuperscript𝑝subscript𝑝𝑘𝑃argmaxsubscript𝑓querysuperscriptsubscript𝑥𝑖𝑗subscript𝑝𝑘normsubscript𝑓querysuperscriptsubscript𝑥𝑖𝑗normsubscript𝑝𝑘superscriptsubscript𝑝efficientsubscript𝑝𝑘𝑃argmaxitalic-ϕsubscript𝑞efficientsuperscriptsubscript𝑥𝑖𝑗subscript𝑝𝑘normitalic-ϕsubscript𝑞efficientsuperscriptsubscript𝑥𝑖𝑗normsubscript𝑝𝑘\begin{array}[]{cc}p^{*}=\underset{p_{k}\in P}{\mathrm{argmax}}\ \frac{\langle f% _{\text{query}}(x_{i}^{j}),p_{k}\rangle}{\|f_{\text{query}}(x_{i}^{j})\|\|p_{k% }\|}&\qquad p_{\text{efficient}}^{*}=\underset{p_{k}\in P}{\mathrm{argmax}}\ % \frac{\langle\phi(q_{\text{efficient}}(x_{i}^{j})),p_{k}\rangle}{\|\phi(q_{% \text{efficient}}(x_{i}^{j}))\|\|p_{k}\|}\end{array}start_ARRAY start_ROW start_CELL italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_P end_UNDERACCENT start_ARG roman_argmax end_ARG divide start_ARG ⟨ italic_f start_POSTSUBSCRIPT query end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_f start_POSTSUBSCRIPT query end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∥ ∥ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_ARG end_CELL start_CELL italic_p start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_P end_UNDERACCENT start_ARG roman_argmax end_ARG divide start_ARG ⟨ italic_ϕ ( italic_q start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_ϕ ( italic_q start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) ∥ ∥ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_ARG end_CELL end_ROW end_ARRAY (1)

While 𝒇querysubscript𝒇query\bm{f_{\text{query}}}bold_italic_f start_POSTSUBSCRIPT query end_POSTSUBSCRIPT performs solely model inference and is typically equivalent in size to the primary backbone model, it can incur a noticeable increase in computation time and memory usage, especially when implemented with a large transformer model. To mitigate these costs, we need to adopt a more resource-efficient model 𝒇efficientsubscript𝒇efficient\bm{f_{\text{efficient}}}bold_italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT, which has a smaller depth and width than 𝒇querysubscript𝒇query\bm{f_{\text{query}}}bold_italic_f start_POSTSUBSCRIPT query end_POSTSUBSCRIPT.

In our preliminary study, we observe that simply decreasing one dimension can easily compromise model performance. So, we opt for a small pre-trained ViT model ViT-Ti as 𝒇efficientsubscript𝒇efficient\bm{f_{\text{efficient}}}bold_italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT, which we have found proficient in extracting essential representations without excessively cutting down on depth or width. When 𝒇efficientsubscript𝒇efficient\bm{f_{\text{efficient}}}bold_italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT generates a query feature 𝒒efficient(xij)dsubscript𝒒efficientsuperscriptsubscript𝑥𝑖𝑗superscript𝑑\bm{q_{\text{efficient}}}(x_{i}^{j})\in\mathbb{R}^{d}bold_italic_q start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (where dD𝑑𝐷d\leq Ditalic_d ≤ italic_D), we apply a nonlinear random projection ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ [47] to maintain the integrity of high-dimensional features in the selected prompt 𝒑efficientsuperscriptsubscript𝒑efficient\bm{p_{\text{efficient}}^{*}}bold_italic_p start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT, as dictated in Eq. 1.

In pursuing system efficiency at this stage, a more comprehensive strategy would be to create a range of downsized models and select the most compact 𝒇efficientsubscript𝒇efficient\bm{f_{\text{efficient}}}bold_italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT that still achieves a high level of prompt overlap with 𝒇querysubscript𝒇query\bm{f_{\text{query}}}bold_italic_f start_POSTSUBSCRIPT query end_POSTSUBSCRIPT. Our choice of ViT-Ti as 𝒇efficientsubscript𝒇efficient\bm{f_{\text{efficient}}}bold_italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT meets this criterion; e.g., in Split ImageNet-R, ViT-Ti achieves a prompt overlap of 76.3% on average with ViT-L, the largest 𝒇querysubscript𝒇query\bm{f_{\text{query}}}bold_italic_f start_POSTSUBSCRIPT query end_POSTSUBSCRIPT in consideration, based on canonical-correlation analysis (CCA), and almost identical accuracy (Tab. 2).

4.2 Prompt Update

The prompt update stage aims to refine learnable parameters for effective training and task adaptability by combining the softmax cross-entropy loss of the classifier 𝑳𝒄𝒍𝒂𝒔𝒔subscript𝑳𝒄𝒍𝒂𝒔𝒔\bm{L_{class}}bold_italic_L start_POSTSUBSCRIPT bold_italic_c bold_italic_l bold_italic_a bold_italic_s bold_italic_s end_POSTSUBSCRIPT and the cosine similarity loss of the prompt 𝑳𝒑𝒓𝒐𝒎𝒑𝒕subscript𝑳𝒑𝒓𝒐𝒎𝒑𝒕\bm{L_{prompt}}bold_italic_L start_POSTSUBSCRIPT bold_italic_p bold_italic_r bold_italic_o bold_italic_m bold_italic_p bold_italic_t end_POSTSUBSCRIPT. The combined loss function 𝑳𝑳\bm{L}bold_italic_L is expressed as:

L=Lclass(fupdate(xij),yij)+λLprompt(p,q(xij)),𝐿subscript𝐿𝑐𝑙𝑎𝑠𝑠subscript𝑓updatesuperscriptsubscript𝑥𝑖𝑗superscriptsubscript𝑦𝑖𝑗𝜆subscript𝐿𝑝𝑟𝑜𝑚𝑝𝑡superscript𝑝𝑞superscriptsubscript𝑥𝑖𝑗\displaystyle L=L_{class}\left(f_{\text{update}}(x_{i}^{j}),y_{i}^{j}\right)+% \lambda L_{prompt}\left(p^{*},q(x_{i}^{j})\right),italic_L = italic_L start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT update end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) + italic_λ italic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) , (2)

where 𝒇updatesubscript𝒇update\bm{f_{\text{update}}}bold_italic_f start_POSTSUBSCRIPT update end_POSTSUBSCRIPT denotes the backbone model, and λ𝜆\lambdaitalic_λ is a regularization constant.

Refer to caption
Figure 4: Mean attention distances for frozen blocks along (a) layer IDs and (b/c) attention heads. We run the first task of Split ImageNet-R. (a/b) L2P with ViT-L, and (c) DualPrompt with ViT-B.

Unveiling System-Efficiency Opportunities. Remaining parameters that are not relevant to learnable parameters consist of a number of frozen transformer blocks. This raises the question:

Are all frozen blocks useful for executing the loss function 𝐋𝐋\bm{L}bold_italic_L, or can we skip
some computations to save energy and memory usage?

To answer this, we explore how the mean distance of attention scores in the self-attention mechanisms of ViT models changes when training a new task. The attention distance has been widely used to analyze layer representation during the training of new models or the fine-tuning of pre-trained models [48, 49]. In our study, we repurpose this concept to examine the behavior of frozen layers during data drift. Specifically, for a given head in an attention layer within frozen blocks, if the query patch position is (x0,y0)subscript𝑥0subscript𝑦0(x_{0},y_{0})( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and it attends to positions (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with weights aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the distance disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT—typically Euclidean or pixel distance—becomes di=(xix0)2+(yiy0)2subscript𝑑𝑖superscriptsubscript𝑥𝑖subscript𝑥02superscriptsubscript𝑦𝑖subscript𝑦02d_{i}=(x_{i}-x_{0})^{2}+(y_{i}-y_{0})^{2}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We then compute the weighted average distance across the positions as iaidiiaisubscript𝑖subscript𝑎𝑖subscript𝑑𝑖subscript𝑖subscript𝑎𝑖\frac{\sum_{i}a_{i}\cdot d_{i}}{\sum_{i}a_{i}}divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG. Lower distances indicate local “task-specific” information, whereas higher distances imply global “general” information. We measure such mean attention distances along two dimensions: by layer ID and by attention head. Fig. 4 shows these results when handling the first task of Split ImageNet-R.

In Fig. 4(a), we present analysis results for layer ID using L2P with ViT-L. Clearly, lower layers exhibit highly dispersed attention distances, indicating that they capture task-specific and global information simultaneously. However, as we examine deeper layers, these distances become increasingly concentrated at high values, indicating that deeper layers specialize in capturing global information. Taking a closer look at attention heads in a few selected layers in Fig. 4(b), attention distances across the heads indeed vary more widely in shallow layers than in deeper layers. We observe similar behavior when using DualPrompt with ViT-B in Fig. 4(c). DualPrompt differs from L2P in its use of prompts, as it integrates prompts directly into the input sequence before the attention layer of each transformer block, rather than only into the first block. This suggests that the disparity in mean attention distances across layers may not be greately affected by the method of leveraging prompts.

Next, we explain how we bring these insights into practice through two compute-skip** techniques that focus more on non-task-specific representations in deeper layers: adaptive token merging and adaptive layer drop**.

Adaptive Token Merging (AToM). The first dimension of skip** is data at the token level via token merging. Conventional token merging (ToMe) [25] reduces the number of tokens by uniformly merging n𝑛nitalic_n redundant tokens per layer, controlled by a scheduler 𝒓𝒓\bm{r}bold_italic_r. The scheduler function 𝒓(l)n𝒓𝑙𝑛\bm{r}(l)\rightarrow nbold_italic_r ( italic_l ) → italic_n is applied to each layer l𝑙litalic_l, and according to [25], it merges all tokens, including prompt tokens. However, insights from Fig. 4 and our additional analysis highlight two major problems.

Refer to caption
Figure 5: The gradient norm for the prompt when applying conventional token merging (ToMe) vs AToM. L2P with ViT-L runs on Split ImageNet-R.

First, there is a loss of task-specific information. The prompt tokens in CL carry essential task-specific information. However, ToMe indiscriminately combines these prompt tokens with others, thereby diminishing their intrinsic value. According to our empirical data in Fig. 5, this approach can cause gradient explosions in the prompt tokens, even with gradient clip**, leading to learning instability. Second, there is a lack of layer-specific adaptability. ToMe does not account for the disparity between shallow and deep layers, treating all layers uniformly. Therefore, there is a risk of excessive loss of valuable information from shallow layers, which are mainly responsible for adaptability to diverse sequential tasks.

Our adaptive token merging (AToM; Algorithm 1 in App. A) addresses the loss of task-specific information by excluding prompt tokens during token merging, thus maintaining their specificity. To enhance task-specific adaptability, AToM exploits a new scheduler 𝒓(l)nsuperscript𝒓bold-′𝑙superscript𝑛\bm{r^{\prime}}(l)\rightarrow n^{\prime}bold_italic_r start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT ( italic_l ) → italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which dynamically adjusts the number of tokens to merge based on layer depth, as follows:

r(l)=min(δ×(l1),rmax),superscript𝑟𝑙𝛿𝑙1subscript𝑟\displaystyle r^{\prime}(l)=\min(\delta\times(l-1),r_{\max}),italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_l ) = roman_min ( italic_δ × ( italic_l - 1 ) , italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) , (3)

where l𝑙litalic_l denotes the layer index, δ𝛿\deltaitalic_δ is the step change in the number of tokens to merge defined as rmaxL1subscript𝑟L1\frac{r_{\max}}{\text{L}-1}divide start_ARG italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG L - 1 end_ARG (with L𝐿Litalic_L being the number of layers), and rmaxsubscript𝑟r_{\max}italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the maximum number of tokens to merge (by default, 2×n2𝑛2\times n2 × italic_n). With this r(l)superscript𝑟𝑙r^{\prime}(l)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_l ), merging occurs more frequently in deeper layers than in shallow layers, preserving important task-related representations.

Adaptive Layer Drop** (ALD). The second dimension of skip** involves drop** layers. Inspired by insights from Fig. 4 and prior work on progressive layer drop** (PLD), we propose adaptive layer drop** (ALD; Algorithm 2 in App. A) with two key features: (1) the drop** schedule considers both temporal and spatial aspects, and (2) it manages to drop layers non-uniformly to preserve critical task-specific features in shallower layers. On the contrary, PLD only considers the temporal aspect and does not differentiate between layers when drop**. This results in poorer performance compared to ALD, as shown in Tab. 3.

ALD prioritizes retaining shallow layers that contain richer information essential for model performance after token merging. Thus, instead of operating on its own schedule parameters, ALD leverages feedback from token merging, specifically the count of merged tokens at each layer, to guide layer drop** decisions. The layer-kee** probability θt,lsubscript𝜃𝑡𝑙\theta_{t,l}italic_θ start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT is defined as:

θt,l=((1θ¯)exp(γt)+θ¯)×α,subscript𝜃𝑡𝑙1¯𝜃𝛾𝑡¯𝜃𝛼\displaystyle\theta_{t,l}=\left((1-\bar{\theta})\exp(-\gamma\cdot t)+\bar{% \theta}\right)\times\alpha,italic_θ start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT = ( ( 1 - over¯ start_ARG italic_θ end_ARG ) roman_exp ( - italic_γ ⋅ italic_t ) + over¯ start_ARG italic_θ end_ARG ) × italic_α , (4)

where θt,lsubscript𝜃𝑡𝑙\theta_{t,l}italic_θ start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT is the probability of kee** layer l𝑙litalic_l at time step t𝑡titalic_t, θ¯¯𝜃\bar{\theta}over¯ start_ARG italic_θ end_ARG is the minimum probability, γ𝛾\gammaitalic_γ controls the rate of decay, and α𝛼\alphaitalic_α is the adjustment factor. When the number of merged tokens surpasses τ𝜏\tauitalic_τ, ALD adjusts the layer-drop** probability based on α𝛼\alphaitalic_α. Since deeper layers tend to merge more tokens with AToM, ALD is more likely to exceed τ𝜏\tauitalic_τ in deeper layers and drop more aggressively. These parameters should be tuned to balance between efficiency and the need to process the nuanced information contained within the merged tokens. We empirically determine their values.

One might contemplate exploiting stochastic depth [23], which applies gradually increasing drop** probabilities for deeper layers without a temporal schedule. We compared ALD and stochastic depth and found that ALD generally offers 1.8–3.3% higher accuracy. This may be attributed to early training iterations after task insertion, which play a critical role in stabilizing training losses [9].

Implications for Resource Efficiency. REP triggers AToM and ALD for each task insertion to optimize both time and memory usage. Much like our prompt selection optimization, AToM consistently enhances time and memory efficiency. In contrast, ALD solely contributes to reducing time costs because it operates layer-kee** probability θt,lsubscript𝜃𝑡𝑙\theta_{t,l}italic_θ start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT starting at 1.0, i.e., no layer drop**, which necessitates traversing full layers.

5 Experiments of REP

Our experiment setup for REP extends the methodology described in Sec. 3 by including Split PlantDisease [50] dataset for image classification. Split PlantDisease contains 54,303 plant disease images across 38 categories. Due to its highly imbalanced nature, we drop 3 plant disease categories with very few images and organize the remaining 35 classes into 7 tasks, each with an equal number of classes.

5.1 REP on Best-Performing Methods

In each graph in Fig. 6, we demonstrate the efficacy of REP when applied to the best-performing CL method for each combination of memory budget and dataset observed in Fig. 2(a). The accuracy is indicated by the stars (\bigstar), while the GPU wall-clock time is indicated by the bars (\blacksquare). The memory consumption in GB is noted on top of the bars.

Refer to caption
Figure 6: REP over the best methods across various memory budgets and datasets.

Resource Efficiency. The best-performing baselines optimized with REP demand significantly smaller amounts of system resources. Specifically, REP reduces memory consumption by 33% for L2P/ViT-L, 11% for DualPrompt/ViT-B, and 5% for DualPrompt/ViT-Ti. Despite the frozen backbone not requiring gradients for parameter updates, it still performs the backward pass to calculate classification and prompt losses for updating learnable parameters. Therefore, all intermediate data need to be stored in memory, similar to training a non-frozen model. This memory usage is reduced by applying AToM and ALD. In addition, REP exhibits a 16–47% reduction in GPU wall-clock time compared to competing methods, thereby leading to great energy savings.

Accuracy. L2P, DualPrompt, and CODA-Prompt were initially evaluated using only ViT-B backbone, focusing primarily on accuracy in their original studies. In comparison, our evaluation extends these methods across diverse backbones while taking into account their energy-accuracy trade-offs in on-device CL across various memory constraints. When augmented with our techniques, these methods show significant improvements in resource efficiency without considerable loss in accuracy across all memory budgets: 0.09–1.90% for Split PlantDisease, 0.03–1.07% for Split CIFAR-100, and 0.26–0.86% for Split ImageNet-R.

5.2 Ablation and Additional Study

# merged AToM (w/ θ𝜃\thetaitalic_θ = 0.5) Keep ALD (w/ n𝑛nitalic_n = 8)
tokens (n𝑛nitalic_n) Acc. (\uparrow) GPU Time (s) Mem. (GB) ratio (θ𝜃\thetaitalic_θ) Acc. (\uparrow) GPU Time (s) Mem. (GB)
2 76.16±plus-or-minus\pm±0.74 2808.43 5.48 0.25 73.18±plus-or-minus\pm±0.55 2362.07 4.45
4 75.32±plus-or-minus\pm±0.32 2714.04 5.16 0.50 75.34±plus-or-minus\pm±0.86 2542.43 4.45
8 75.34±plus-or-minus\pm±0.86 2542.43 4.45 0.75 74.44±plus-or-minus\pm±0.60 2861.35 4.45
Table 1: REP over different # of merged tokens (n𝑛nitalic_n) and % of keep ratio (θ𝜃\thetaitalic_θ).
Ablated components Acc. (\uparrow) GPU Time (s) Mem. (GB)
REP (ours) 75.34±plus-or-minus\pm±0.86 2542.43 4.45
(1) fefficientsubscript𝑓efficientf_{\text{efficient}}italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT 74.79±plus-or-minus\pm±0.38 3698.78 5.53
(2) AToM 74.91±plus-or-minus\pm±0.77 4442.04 5.60
(3) ALD 74.53±plus-or-minus\pm±0.78 4243.97 6.67
(4) fefficientsubscript𝑓efficientf_{\text{efficient}}italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT + AToM 74.15±plus-or-minus\pm±0.82 2861.35 4.46
(5) fefficientsubscript𝑓efficientf_{\text{efficient}}italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT + ALD 74.52±plus-or-minus\pm±0.43 3206.74 5.51
(6) AToM + ALD 74.57±plus-or-minus\pm±0.37 3448.76 4.77
Table 2: Ablating REP’s components. They contribute to resource efficiency and accuracy.

We validate our proposed techniques and algorithm designs along with the chosen hyperparameters. Refer to the App. E for additional results. The study here primarily uses the ViT-L backbone on Split ImageNet-R, where L2P is the best non-optimized method.

Component Ablation. In Tab. 2, we ablate REP’s components to assess their contributions to system efficiency. Using L2P as a baseline, we examine all combinations with our components applied partially. Each component plays a key role in reducing memory usage and compute time, with ALD affecting compute time only, as expected. Notably, AToM substantially optimizes both resource usages.

Acc. (\uparrow) GPU Time (s) Mem. (GB)
REP (ours) 75.34±plus-or-minus\pm±0.86 2542.43 4.45
(1) w/ ToMe 70.19±plus-or-minus\pm±0.74 2917.84 3.74
(2) w/ PLD 73.33±plus-or-minus\pm±0.71 2741.90 4.46
Table 3: Using traditional token merging (ToMe) or layer drop** (PLD) instead our adaptive algorithms harms model accuracy.

Algorithm Validation. Token merging (ToMe) [25] and progressive layer drop** (PLD) [24] are specifically designed to accelerate traditional transformer-based model training. To validate the importance of incorporating our adaptive techniques instead for CL, we evaluate how REP performs in case AToM and ALD are replaced with ToMe and PLD, respectively. The results are presented in Tab. 3. Although applying ToMe or PLD improves system efficiency over L2P, it results in an accuracy loss of 5.15% or 2.01%, respectively, compared to using our techniques. This indicates that ToMe and PLD are less desirable for achieving our goal, i.e., enhancing system efficiency without compromising accuracy.

AToM and ALD Intensity. Tab. 1 shows the effects of varying intensities of AToM and ALD, focusing on the number of merged tokens (n𝑛nitalic_n) in AToM and the keep ratio (θ𝜃\thetaitalic_θ) in ALD. The default parameter values used by our system are 8 for n𝑛nitalic_n and 0.5 for θ𝜃\thetaitalic_θ. AToM appears to maintain stable accuracy even as more tokens are merged. In contrast, for ALD, a lower keep ratio improves system efficiency by reducing iteration time and memory usage, but it can markedly impair model accuracy if the ratio is too low. Overall, when used with caution, AToM and ALD can effectively balance system efficiency and accuracy, making them suitable for various resource-limited on-device CL scenarios.

6 Discussion and Conclusion

Since our study focuses on scenarios wherein a sequence of tasks is presented within learning environments, our algorithm does not determine the curriculum of learning. In general, actively selecting such curricula could lead to improved performance in the final average accuracy. Future work should explore extending our model for curriculum-based continual learning. Moreover, although using replay buffer may not be feasible in real-world scenarios where data privacy matters [14], on-device CL is already viewed as a means to preserve data privacy. Therefore, exploring rehearsal-based REP is a promising direction for future research. A critical question to address is determining the optimal memory space allocation between replay buffer and model training.

References

  • [1] Young D Kwon, Jagmohan Chauhan, Abhishek Kumar, Pan Hui HKUST, and Cecilia Mascolo. Exploring System Performance of Continual Learning for Mobile and Embedded Sensing Applications. In IEEE/ACM SEC, 2021.
  • [2] Tyler L Hayes and Christopher Kanan. Online Continual Learning for Embedded Devices. In CoLLAs, 2022.
  • [3] Yuqing Zhao, Divya Saxena, and Jiannong Cao. Memory-Efficient Domain Incremental Learning for Internet of Things. In SenSys, 2023.
  • [4] M. McCloskey and Neal. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychol. Learn. Motiv. - Adv. Res. Theory, 1989.
  • [5] Dianlei Xu, Tong Li, Yong Li, Xiang Su, Sasu Tarkoma, Tao Jiang, Jon Crowcroft, and Pan Hui. Edge Intelligence: Architectures, Challenges, and Applications. arXiv preprint arXiv:2003.12172, 2020.
  • [6] Jaeheon Kwak, Sunjae Lee, Dae R Jeong, Arjun Kumar, Dongjae Shin, Ilju Kim, Donghwa Shin, Kilho Lee, **kyu Lee, and Insik Shin. MixMax: Leveraging Heterogeneous Batteries to Alleviate Low Battery Experience for Mobile Users. In MobiSys, 2023.
  • [7] Seunghyeok Jeon, Yonghun Choi, Yeonwoo Cho, and Hojung Cha. HarvNet: Resource-Optimized Operation of Multi-Exit Deep Neural Networks on Energy Harvesting Devices. In MobiSys, 2023.
  • [8] Saad Ahmed, Bashima Islam, Kasim Sinan Yildirim, Marco Zimmerling, Przemysław Pawełczak, Muhammad Hamad Alizai, Brandon Lucia, Luca Mottola, Jacob Sorber, and Josiah Hester. The Internet of Batteryless Things. CACM, 67(3):64–73, 2024.
  • [9] Xinyue Ma, Suyeon Jeong, Minjia Zhang, Di Wang, Jonghyun Choi, and Myeongjae Jeon. Cost-effective On-device Continual Learning over Memory Hierarchy with Miro. In MobiCom, 2023.
  • [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016.
  • [11] Andreas Peter Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. TMLR, 2022.
  • [12] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging Properties in Self-Supervised Vision Transformers. ICCV, 2021.
  • [13] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. DINOv2: Learning Robust Visual Features without Supervision. TMLR, 2024.
  • [14] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning To Prompt for Continual Learning. In CVPR, 2022.
  • [15] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. DualPrompt: Complementary Prompting for Rehearsal-Free Continual Learning. In ECCV, 2022.
  • [16] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-Prompts Learning with Pre-trained Transformers: An Occam’s Razor for Domain Incremental Learning. In NeurIPS, 2022.
  • [17] James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning. In CVPR, 2023.
  • [18] Dahuin Jung, Dongyoon Han, Jihwan Bang, and Hwanjun Song. Generating Instance-level Prompts for Rehearsal-free Continual Learning. In ICCV, 2023.
  • [19] Da-Wei Zhou, Qi-Wei Wang, Han-Jia Ye, and De-Chuan Zhan. A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning. In ICLR, 2023.
  • [20] CUDA Semantics: Memory Management. https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management.
  • [21] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast Inference from Transformers via Speculative Decoding. In ICML, 2023.
  • [22] Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via Proxy: Efficient Data Selection for Deep Learning. In ICLR, 2020.
  • [23] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep Networks with Stochastic Depth. In ECCV, 2016.
  • [24] Minjia Zhang and Yuxiong He. Accelerating Training of Transformer-Based Language Models with Progressive Layer Drop**. In NeurIPS, 2020.
  • [25] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token Merging: Your ViT But Faster. In ICLR, 2023.
  • [26] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. In NeurIPS, 2021.
  • [27] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning. In NeurIPS, 2022.
  • [28] Jetson TX2 Module. https://developer.nvidia.com/embedded/jetson-tx2.
  • [29] Jetson Nano Module. https://developer.nvidia.com/embedded/jetson-nano.
  • [30] Alexander Gepperth and Barbara Hammer. Incremental Learning Algorithms and Applications. In ESANN, 2016.
  • [31] Jihwan Bang, Heesu Kim, YoungJoon Yoo, Jung-Woo Ha, and Jonghyun Choi. Rainbow Memory: Continual Learning with a Memory of Diverse Samples. In CVPR, 2021.
  • [32] Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Cordelia Schmid, and Karteek Alahari. End-to-End Incremental Learning. In ECCV, 2018.
  • [33] Arslan Chaudhry, Puneet K. Dokania, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence. In ECCV, 2018.
  • [34] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. iCaRL: Incremental Classifier and Representation Learning. In CVPR, 2017.
  • [35] Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. GDumb: A Simple Approach that Questions Our Progress in Continual Learning. In ECCV, 2020.
  • [36] Ameya Prabhu, Hasan Abed Al Kader Hammoud, Puneet K. Dokania, Philip H.S. Torr, Ser-Nam Lim, Bernard Ghanem, and Adel Bibi. Computationally Budgeted Continual Learning: What Does Matter? In CVPR, 2023.
  • [37] Soobee Lee, Minindu Weerakoon, Jonghyun Choi, Minjia Zhang, Di Wang, and Myeongjae Jeon. CarM: Hierarchical Episodic Memory for Continual Learning. In DAC, 2022.
  • [38] In Gim and JeongGil Ko. Memory-Efficient DNN Training on Mobile Devices. In MobiSys, 2022.
  • [39] Qipeng Wang, Mengwei Xu, Chao **, Xinran Dong, **liang Yuan, Xin **, Gang Huang, Yunxin Liu, and Xuanzhe Liu. Melon: Breaking the Memory Wall for Resource-Efficient on-Device Machine Learning. In MobiSys, 2022.
  • [40] Yue Wang, Ziyu Jiang, Xiaohan Chen, Pengfei Xu, Yang Zhao, Yingyan Lin, and Zhangyang Wang. E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings. In NeurIPS, 2019.
  • [41] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
  • [42] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In ICCV, 2021.
  • [43] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On Tiny Episodic Memories in Continual Learning. arXiv:1902.10486, 2019.
  • [44] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large Scale Incremental Learning. In CVPR, 2019.
  • [45] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and SIMONE CALDERARA. Dark Experience for General Continual Learning: a Strong, Simple Baseline. In NeurIPS, 2020.
  • [46] Gangmuk Lim, Jeongseob Ahn, Wencong Xiao, Young** Kwon, and Myeongjae Jeon. Zico: Efficient {{\{{GPU}}\}} memory sharing for concurrent {{\{{DNN}}\}} training. In USENIX ATC, pages 161–175, 2021.
  • [47] Mark D McDonnell, Dong Gong, Amin Parvaneh, Ehsan Abbasnejad, and Anton van den Hengel. Random Projection in Deep Neural Networks. NeurIPS, 36, 2024.
  • [48] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? In NeurIPS, 2021.
  • [49] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • [50] Sharada Prasanna Mohanty, David Hughes, and Marcel Salathe. Using deep learning for image-based plant disease detection, 2016.
  • [51] L2P PyTorch Implementation. https://github.com/JH-LEE-KR/l2p-pytorch.
  • [52] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
  • [53] DualPrompt PyTorch Implementation. https://github.com/JH-LEE-KR/dualprompt-pytorch.
  • [54] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. Caltech-ucsd birds 200. Technical Report CNS-TR-2011-001, Caltech, 2011.
  • [55] OEM PRODUCT DESIGN GUIDE: NVIDIA Jetson TX2. https://usermanual.wiki/Pdf/jetsontx2oemproductdesignguide.2134990230.pdf.
  • [56] Convenient Power Measurements on the Jetson TX2/Tegra X2 Board. https://embeddeddl.wordpress.com/2018/04/25/convenient-power-measurements-on-the-jetson-tx2-tegra-x2-board/.
  • [57] Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need, 2023.

Appendix A Algorithm Details

Input : T𝑇Titalic_T: Initial set of all tokens, rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT: Adaptive merging schedule function, P𝑃Pitalic_P: Set of prompt tokens,
L: Number of layers in the model
Output : Tfinalsubscriptsuperscript𝑇finalT^{\prime}_{\text{final}}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT: Set of tokens after processing all layers
TfinalTsubscriptsuperscript𝑇final𝑇T^{\prime}_{\text{final}}\leftarrow Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT ← italic_T \triangleright Initialize with the initial token set
for l{1,2,,L}𝑙12Ll\in\{1,2,\dots,\text{L}\}italic_l ∈ { 1 , 2 , … , L } do
       TattnMSA(Tfinal)subscript𝑇attnMSAsubscriptsuperscript𝑇finalT_{\text{attn}}\leftarrow\text{MSA}(T^{\prime}_{\text{final}})italic_T start_POSTSUBSCRIPT attn end_POSTSUBSCRIPT ← MSA ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT ) \triangleright Self-attention for layer l𝑙litalic_l
       TeligibleTattnPsubscript𝑇eligiblesubscript𝑇attn𝑃T_{\text{eligible}}\leftarrow T_{\text{attn}}\setminus Pitalic_T start_POSTSUBSCRIPT eligible end_POSTSUBSCRIPT ← italic_T start_POSTSUBSCRIPT attn end_POSTSUBSCRIPT ∖ italic_P \triangleright Exclude prompt tokens
       nr(l)superscript𝑛superscript𝑟𝑙n^{\prime}\leftarrow r^{\prime}(l)italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_l ) \triangleright Adaptive token merging at layer l𝑙litalic_l
       TmergedMerge(Teligible,n)subscriptsuperscript𝑇mergedMergesubscript𝑇eligiblesuperscript𝑛T^{\prime}_{\text{merged}}\leftarrow\text{Merge}(T_{\text{eligible}},n^{\prime})italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT merged end_POSTSUBSCRIPT ← Merge ( italic_T start_POSTSUBSCRIPT eligible end_POSTSUBSCRIPT , italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
       TconcatConcat(Tmerged,P)subscript𝑇concatConcatsubscriptsuperscript𝑇merged𝑃T_{\text{concat}}\leftarrow\text{Concat}(T^{\prime}_{\text{merged}},P)italic_T start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT ← Concat ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT merged end_POSTSUBSCRIPT , italic_P ) \triangleright Concatenate merged tokens with prompt tokens
       TfinalMLP(Tconcat,l)subscriptsuperscript𝑇finalMLPsubscript𝑇concat𝑙T^{\prime}_{\text{final}}\leftarrow\text{MLP}(T_{\text{concat}},l)italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT ← MLP ( italic_T start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT , italic_l ) \triangleright Apply MLP to concatenated tokens
end for
return Tfinalsubscriptsuperscript𝑇finalT^{\prime}_{\text{final}}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT
Algorithm 1 Adaptive token merging (AToM)
Input : X𝑋Xitalic_X: Input tensor, θt,lsubscript𝜃𝑡𝑙\theta_{t,l}italic_θ start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT: Keep ratio of layer, θ¯¯𝜃\bar{\theta}over¯ start_ARG italic_θ end_ARG: Minimum ratio, γ𝛾\gammaitalic_γ: Decay rate, τ𝜏\tauitalic_τ: Spatial threshold, α𝛼\alphaitalic_α: Adjustment factor, L𝐿Litalic_L: Number of layers
Output : Xoutsubscript𝑋outX_{\text{out}}italic_X start_POSTSUBSCRIPT out end_POSTSUBSCRIPT: Processed output for transformer layers
for l{1,2,,L}𝑙12𝐿l\in\{1,2,\dots,L\}italic_l ∈ { 1 , 2 , … , italic_L } do
       θt,l(1θ¯)exp(γt)+θ¯subscript𝜃𝑡𝑙1¯𝜃𝛾𝑡¯𝜃\theta_{t,l}\leftarrow(1-\bar{\theta})\exp(-\gamma\cdot t)+\bar{\theta}italic_θ start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT ← ( 1 - over¯ start_ARG italic_θ end_ARG ) roman_exp ( - italic_γ ⋅ italic_t ) + over¯ start_ARG italic_θ end_ARG \triangleright Current keep ratio
      
      if r(l)>τsuperscript𝑟𝑙𝜏r^{\prime}(l)>\tauitalic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_l ) > italic_τ then
             θt,lθt,l×αsubscript𝜃𝑡𝑙subscript𝜃𝑡𝑙𝛼\theta_{t,l}\leftarrow\theta_{t,l}\times\alphaitalic_θ start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT × italic_α \triangleright Adjust based on merging feedback
       end if
      if Bernoulli(θt)=1Bernoullisubscript𝜃𝑡1\text{Bernoulli}(\theta_{t})=1Bernoulli ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 then
             XoutExec(l,Xout)subscript𝑋outExec𝑙subscript𝑋outX_{\text{out}}\leftarrow\text{Exec}(l,X_{\text{out}})italic_X start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ← Exec ( italic_l , italic_X start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ) \triangleright Layer l𝑙litalic_l is executed
      else
             XoutXoutsubscript𝑋outsubscript𝑋outX_{\text{out}}\leftarrow X_{\text{out}}italic_X start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ← italic_X start_POSTSUBSCRIPT out end_POSTSUBSCRIPT \triangleright Layer l𝑙litalic_l is dropped
       end if
      
end for
return Xoutsubscript𝑋outX_{\text{out}}italic_X start_POSTSUBSCRIPT out end_POSTSUBSCRIPT for each layer l𝑙litalic_l
Algorithm 2 Adaptive layer drop** (ALD)

Appendix B Implementation Details

Unless otherwise stated, we use the term REP–L to denote REP applied to L2P with a ViT-L backbone, which stands out as the best-performing baseline method for Split CIFAR-100 and Split ImageNet-R within the 4–8GB memory range. Similarly, the notations L, B, and Ti are used to generally denote ViT-L, ViT-B, and ViT-Ti backbones, respectively, in the prompt update stage of any prompt-based method.

B.1 Prompt-Based Methods for REP

Here are details on the three major prompt-based methods to which REP is integrated for resource efficiency.

Learning to prompt (L2P) [14] positions prompts at the first layer of the transformer architecture. These prompts are learnable parameters, which dynamically evolve with the training process. The mechanism begins with a prompt pool P𝑃Pitalic_P that contains a set of distinct prompts {p1,p2,,pm}Dsubscript𝑝1subscript𝑝2subscript𝑝𝑚superscript𝐷\{p_{1},p_{2},...,p_{m}\}\subset\mathbb{R}^{D}{ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

For a given input xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT in task Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, L2P calculates a query feature q(xij)D𝑞superscriptsubscript𝑥𝑖𝑗superscript𝐷q(x_{i}^{j})\in\mathbb{R}^{D}italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT and selects the corresponding prompt. The prompt psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is selected based on maximizing the cosine similarity with respect to the query:

p=argmaxpkPq(xij),pkq(xij)pk.superscript𝑝subscript𝑝𝑘𝑃argmax𝑞superscriptsubscript𝑥𝑖𝑗subscript𝑝𝑘norm𝑞superscriptsubscript𝑥𝑖𝑗normsubscript𝑝𝑘\displaystyle p^{*}=\underset{p_{k}\in P}{\mathrm{argmax}}\ \frac{\langle q(x_% {i}^{j}),p_{k}\rangle}{\|q(x_{i}^{j})\|\|p_{k}\|}.italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_P end_UNDERACCENT start_ARG roman_argmax end_ARG divide start_ARG ⟨ italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∥ ∥ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_ARG . (5)

The number of the selected prompts is a learnable parameter.

The selected prompt psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is concatenated with the input embedding e(xij)𝑒superscriptsubscript𝑥𝑖𝑗e(x_{i}^{j})italic_e ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) to form the prompt-augmented input e(xij)=[p;e(xij)]superscript𝑒superscriptsubscript𝑥𝑖𝑗superscript𝑝𝑒superscriptsubscript𝑥𝑖𝑗e^{\prime}(x_{i}^{j})=[p^{*};e(x_{i}^{j})]italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) = [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_e ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ]. The training objective of L2P balances classification loss Lclasssubscript𝐿𝑐𝑙𝑎𝑠𝑠L_{class}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s end_POSTSUBSCRIPT and prompt-adjustment loss Lpromptsubscript𝐿𝑝𝑟𝑜𝑚𝑝𝑡L_{prompt}italic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT:

L=Lclass(e(xij),yij)+λLprompt(p,q(xij)),𝐿subscript𝐿𝑐𝑙𝑎𝑠𝑠superscript𝑒superscriptsubscript𝑥𝑖𝑗superscriptsubscript𝑦𝑖𝑗𝜆subscript𝐿𝑝𝑟𝑜𝑚𝑝𝑡superscript𝑝𝑞superscriptsubscript𝑥𝑖𝑗\displaystyle L=L_{class}\left(e^{\prime}(x_{i}^{j}),y_{i}^{j}\right)+\lambda L% _{prompt}\left(p^{*},q(x_{i}^{j})\right),italic_L = italic_L start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) + italic_λ italic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) , (6)

where Lclasssubscript𝐿𝑐𝑙𝑎𝑠𝑠L_{class}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s end_POSTSUBSCRIPT and Lpromptsubscript𝐿𝑝𝑟𝑜𝑚𝑝𝑡L_{prompt}italic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT employ soft-max cross-entropy and cosine similarity loss, respectively, and the regularization parameter λ𝜆\lambdaitalic_λ is set to 0.1 following [51].

DualPrompt [15] leverages prompts at multiple layers of the transformer architecture. It introduces a general prompt gLg×D𝑔superscriptsubscript𝐿𝑔𝐷g\in\mathbb{R}^{L_{g}\times D}italic_g ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT and a set of task-specific prompts E={e1,e2,,eT}Le×D𝐸subscript𝑒1subscript𝑒2subscript𝑒𝑇superscriptsubscript𝐿𝑒𝐷E=\{e_{1},e_{2},...,e_{T}\}\subset\mathbb{R}^{L_{e}\times D}italic_E = { italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } ⊂ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT. These prompts are incorporated at specified layers in the transformer model.

For a given input xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT in task Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the model’s transformer layers f𝑓fitalic_f are modified by attaching g𝑔gitalic_g and eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the layers, resulting in a prompted architecture fg,eisubscript𝑓𝑔subscript𝑒𝑖f_{g,e_{i}}italic_f start_POSTSUBSCRIPT italic_g , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The feature transformation hijsuperscriptsubscript𝑖𝑗h_{i}^{j}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT for the input sample xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is then obtained as:

hij=fg,ei(xij).superscriptsubscript𝑖𝑗subscript𝑓𝑔subscript𝑒𝑖superscriptsubscript𝑥𝑖𝑗\displaystyle h_{i}^{j}=f_{g,e_{i}}(x_{i}^{j}).italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_g , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) . (7)

Similar to L2P, DualPrompt optimizes Lclasssubscript𝐿𝑐𝑙𝑎𝑠𝑠L_{class}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s end_POSTSUBSCRIPT and Lpromptsubscript𝐿𝑝𝑟𝑜𝑚𝑝𝑡L_{prompt}italic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT:

L=Lclass(hij,yij)+Lprompt(g,ei).𝐿subscript𝐿𝑐𝑙𝑎𝑠𝑠superscriptsubscript𝑖𝑗superscriptsubscript𝑦𝑖𝑗subscript𝐿𝑝𝑟𝑜𝑚𝑝𝑡𝑔subscript𝑒𝑖\displaystyle L=L_{class}(h_{i}^{j},y_{i}^{j})+L_{prompt}(g,e_{i}).italic_L = italic_L start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT ( italic_g , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (8)

CODA-Prompt [17] decomposes learnable prompts into components and uses an attention mechanism from a pre-trained ViT model to select relevant prompts. Instead of a single prompt, CODA-Prompt learns a set of prompt components P={𝑷𝟏,𝑷𝟐,,𝑷𝑴}Lp×D×M𝑃subscript𝑷1subscript𝑷2subscript𝑷𝑴superscriptsubscript𝐿𝑝𝐷𝑀P=\{\bm{P_{1}},\bm{P_{2}},...,\bm{P_{M}}\}\subset\mathbb{R}^{L_{p}\times D% \times M}italic_P = { bold_italic_P start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , … , bold_italic_P start_POSTSUBSCRIPT bold_italic_M end_POSTSUBSCRIPT } ⊂ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_D × italic_M end_POSTSUPERSCRIPT. The final prompt 𝒑𝒑\bm{p}bold_italic_p is a weighted sum:

𝒑=mαm𝑷𝒎,𝒑subscript𝑚subscript𝛼𝑚subscript𝑷𝒎\displaystyle\bm{p}=\sum_{m}\alpha_{m}\bm{P_{m}},bold_italic_p = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_P start_POSTSUBSCRIPT bold_italic_m end_POSTSUBSCRIPT , (9)

where weights α𝛼\alphaitalic_α are based on the query q(x)𝑞𝑥q(x)italic_q ( italic_x ) and keys KD×M𝐾superscript𝐷𝑀K\in\mathbb{R}^{D\times M}italic_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_M end_POSTSUPERSCRIPT:

α=γ(q(x),K).𝛼𝛾𝑞𝑥𝐾\displaystyle\alpha=\gamma(q(x),K).italic_α = italic_γ ( italic_q ( italic_x ) , italic_K ) . (10)

When the task changes, the current components are frozen, and new one is added, ensuring orthogonality:

Lortho(B)=BBI2,subscript𝐿ortho𝐵superscriptnorm𝐵superscript𝐵top𝐼2\displaystyle L_{\text{ortho}}(B)=||BB^{\top}-I||^{2},italic_L start_POSTSUBSCRIPT ortho end_POSTSUBSCRIPT ( italic_B ) = | | italic_B italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - italic_I | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (11)

where B𝐵Bitalic_B represents P𝑃Pitalic_P, K𝐾Kitalic_K, or A𝐴Aitalic_A (A𝐴Aitalic_A represents the attention vector). The full optimization target is:

minPn,Kn,An,ϕnL(fϕ(fθ,P,K,A(x)),y)+λ(Lortho(P)+Lortho(K)+Lortho(A)),subscriptsuperscript𝑃𝑛superscript𝐾𝑛superscript𝐴𝑛superscriptitalic-ϕ𝑛𝐿subscript𝑓italic-ϕsubscript𝑓𝜃𝑃𝐾𝐴𝑥𝑦𝜆subscript𝐿ortho𝑃subscript𝐿ortho𝐾subscript𝐿ortho𝐴\displaystyle\min_{P^{n},K^{n},A^{n},\phi^{n}}L(f_{\phi}(f_{\theta},P,K,A(x)),% y)+\lambda(L_{\text{ortho}}(P)+L_{\text{ortho}}(K)+L_{\text{ortho}}(A)),roman_min start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L ( italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_P , italic_K , italic_A ( italic_x ) ) , italic_y ) + italic_λ ( italic_L start_POSTSUBSCRIPT ortho end_POSTSUBSCRIPT ( italic_P ) + italic_L start_POSTSUBSCRIPT ortho end_POSTSUBSCRIPT ( italic_K ) + italic_L start_POSTSUBSCRIPT ortho end_POSTSUBSCRIPT ( italic_A ) ) , (12)

where Pnsuperscript𝑃𝑛P^{n}italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, Knsuperscript𝐾𝑛K^{n}italic_K start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and Ansuperscript𝐴𝑛A^{n}italic_A start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are new components, and λ𝜆\lambdaitalic_λ balances the orthogonality loss.

B.2 Hyperparameters and Configurations

For all prompt-based CL methods, we employ the ADAM optimizer [52] configured with hyperparameters β1=0.99subscript𝛽10.99\beta_{1}=0.99italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.99 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999. In contrast, BudgetCL and MEMO adopt the stochastic gradient descent (SGD) optimizer with a momentum coefficient of 0.9 and a weight decay of 0.002, which is validated as effective by similar replay buffer-based methods [14]. In REP, we set the prompt pool size to 10 and the prompt length to 5, following the original implementations of L2P and DualPrompt [51, 53]. The learning rate for the prompt-based methods is set to 0.001875. For CODA-Prompt, we implement a cosine decay learning rate strategy, as outlined in its original paper [17]. For BudgetCL and MEMO, the initial learning rate starts from 0.1 and decays by 0.1 at 80 and 150 epochs following the original paper [19].

We consistently use the mini-batch size of 16 for all methods to maintain a uniform computational load for each training iteration. For fair comparisons, each method performs 1,875 1,080, and 2583 iterations per task insertion for Split CIFAR-100, Split ImageNet-R, and Split PlantDisease, respectively. This setup ensures that among prompt-based methods, iteration time itself correlates linearly with energy usage, as training wall-clock time similarly reflects energy consumption as discussed in the main paper.

BudgetCL and MEMO select 8 samples from the new task and another 8 samples from old tasks in the replay buffer to organize 16 samples in each mini-batch. In contrast to prompt-based methods, we adopt the setup in [19] and train 200 epochs for all datasets used. Thus, BudgetCL and MEMO perform more iterations per epoch for a bigger replay buffer, translating into longer GPU time and more energy usage.

Appendix C Competing Methods with Original Setups

Method Split CIFAR-100 Split ImageNet-R Mem. (GB)
Acc. (\uparrow) GPU Time (s) Acc. (\uparrow) GPU Time (s)
REP–L (ours) 86.89±plus-or-minus\pm±0.57 4420.99 75.34±plus-or-minus\pm±0.16 2542.43 4.45
L2P–B 84.36±plus-or-minus\pm±0.68 2802.42 59.66±plus-or-minus\pm±0.81 1609.88 14.20
DualPrompt–B 85.27±plus-or-minus\pm±0.75 2624.06 67.45±plus-or-minus\pm±0.88 1507.42 12.30
CODA-Prompt–B 86.09±plus-or-minus\pm±0.73 3226.58 75.52±plus-or-minus\pm±0.74 1853.54 18.83
Table 4: Competing methods with original setups (based on ViT-B backbone).

For the evaluation of L2P, DualPrompt, and CODA-Prompt in the main paper, we adhere to the hyperparameters specified in the original studies, except for the number of iterations per task and batch size. We adopt a smaller batch size to account for on-device memory limitations, which prevent the use of the original batch sizes. Nonetheless, we present the results of these three methods using their original hyperparameters in Tab. 4. Although these methods are based on the ViT-B backbone, they consume significantly more memory than REP–L (with the ViT-L backbone), primarily due to their originally large batch sizes.

Appendix D Additional Datasets

Method   Split CUB-200    GPU Time (s)    Mem. (GB)
Acc. (\uparrow) Fgt. (\downarrow)
REP–L (ours) 74.70±plus-or-minus\pm±1.04 6.67±plus-or-minus\pm±0.34 529.67 4.45
L2P–L 74.70±plus-or-minus\pm±1.02 6.68±plus-or-minus\pm±1.35 1005.57 6.67
DualPrompt–L 72.35±plus-or-minus\pm±0.98 7.82±plus-or-minus\pm±0.31 953.42 5.93
CODA-Prompt–L 79.46±plus-or-minus\pm±0.84 5.84±plus-or-minus\pm±0.99 1257.82 13.95
Table 5: Results on Split CUB-200, based on using ViT-L as the backbone model.

We also expand our experiments on REP by incorporating an additional dataset: Split CUB-200 [54], which is designed for fine-grained image classification. Split CUB-200 contains 5,994 bird images categorized into 200 classes, which are split into 5 CL tasks, each having 40 classes. We present the experimental results for all prompt-based methods using ViT-L as the backbone in Tab. 5. REP–L outperforms competing methods in resource efficiency by bringing significant reductions in both computation time and memory usage. Importantly, REP–L is on par with its baseline model L2P in accuracy, and note that while adopting CODA-Prompt to implement REP can deliver better accuracy, its memory usage exceeds all memory ranges considered in Fig. 1(a).

Appendix E Extended Ablation and Additional Study

As the paper mainly discusses the ablation study based on Split ImageNet-R, we here present an extended ablation study of REP using Split CIFAR-100. The results are shown in Tab. 7. Similar to the findings from Split ImageNet-R, the ablation of any component within REP for the Split CIFAR-100 dataset impacts resource efficiency. All components contribute to the reduction of computation time and memory usage, with AToM particularly standing out in both aspects of resource efficiency.

Furthermore, we perform algorithm validation on Split CIFAR-100, for which we present the results in Tab. 7. Consistent with the Split ImageNet-R results, using static token merging (ToMe) or progressive layer drop** (PLD) instead of adaptive token merging (AToM) or adaptive layer drop** (ALD) negatively affects model accuracy on the Split CIFAR-100 dataset. This experiment further corroborates the efficacy of our proposed algorithms in optimizing resource usage without compromising model accuracy in CL scenarios.

Ablated components Acc. (\uparrow) GPU Time (s) Mem. (GB)
REP–L (ours) 86.89±plus-or-minus\pm±0.57 4420.99 4.45
(1) fefficientsubscript𝑓efficientf_{\text{efficient}}italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT 86.38±plus-or-minus\pm±0.43 6096.14 5.53
(2) AToM 86.43±plus-or-minus\pm±0.23 7724.21 5.60
(3) ALD 86.79±plus-or-minus\pm±0.73 7379.79 6.67
(4) fefficientsubscript𝑓efficientf_{\text{efficient}}italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT + AToM 85.97±plus-or-minus\pm±0.54 4975.57 4.46
(5) fefficientsubscript𝑓efficientf_{\text{efficient}}italic_f start_POSTSUBSCRIPT efficient end_POSTSUBSCRIPT + ALD 86.07±plus-or-minus\pm±0.18 5576.16 5.51
(6) AToM + ALD 86.08±plus-or-minus\pm±0.48 5997.01 4.77
Table 6: Extended component ablation for Split CIFAR-100. Ablating any component of REP results in lower resource efficiency by increasing training time and memory consumption.
Acc. (\uparrow) GPU Time (s) Mem. (GB)
REP–L (ours) 86.89±plus-or-minus\pm±0.57 4420.99 4.45
(1) w/ ToMe 83.55±plus-or-minus\pm±0.14 5073.80 3.74
(2) w/ PLD 84.40±plus-or-minus\pm±0.35 4767.86 4.46
Table 7: Extended algorithm validation for Split CIFAR-100. Changing our adaptive algorithms to conventional static token merging (ToMe) or progressive layer drop** (PLD) harms model accuracy.

Appendix F REP under Varying Compute Budgets

Method Number of Iterations per Task.
313 625 938 1250 1563 1875
REP–L (ours) 84.06 85.47 86.60 86.66 86.57 86.89
L2P–L 84.36 86.28 87.34 87.76 87.83 88.24
Table 8: Impact of REP under various computation budget.

We compare REP–L with L2P across various compute budgets defined by the number of iterations per task. The comparison results using Split CIFAR-100 are shown in Tab. 8. Evidently, the accuracy of REP–L is comparable to that of L2P across six compute budgets, with the performance gap getting increasingly marginal at lower compute budgets. We observe that the overall accuracy for both methods does not drop significantly with reduced compute budgets. This is attributed to the fact that Split CIFAR-100 is considered a less complex CL benchmark.

Appendix G REP on Edge Device

Miro [9] introduces a dynamic approach for fine-tuning key design parameters of rehearsal-based CL methods to achieve high model accuracy while simultaneously reducing energy costs, i.e., high cost-effectiveness. Miro’s methodology centers on identifying the optimal memory size within the device’s capacity to accommodate both old and new samples for training, thereby maximizing cost-effectiveness. We compare REP with Miro directly on a reference edge device NVIDIA Jetson TX2 [28] (TX2). TX2 is equipped with a 256-core NVIDIA Pascal GPU, 8GB of RAM, and a 32GB eMMC 5.1 drive. The RAM is a single unified memory shared by the CPU and GPU. We report energy usage as joules (J) obtained by multiplying power by time. Power usage is measured by reading built-in sensor values provided by Jetson devices for the GPU, RAM, CPU, and I/O connection [55, 56].

G.1 Power Usage Breakdown

Refer to caption
Figure 7: Power breakdown for training a ViT-B model on NVIDIA Jetson TX2.

When implementing CL on edge devices, the majority of power consumption that influences energy usage during the training of a new task comes from GPU operations. Fig. 7 shows power consumption on TX2 across individual system components. Static power is measured when the system is inactive, while dynamic power is measured during the training of a ViT-B model. During training, power usage surges up to 6.5×\times×, with the GPU contributing 60% to the dynamic power.

G.2 REP vs Miro

In this experiment, we explore energy-accuracy trade-offs for REP–L and Miro on TX2. Fig. 8 visually represents the comparison results. In each graph, the x-axis signifies total energy usage (lower is better), while the y-axis signifies final average accuracy (higher is better). Hence, a more cost-effective strategy is positioned closer to the upper-left corner of the graph. Regarding Miro, we maintain the hyperparameters as suggested in the original work [9] but incorporate the use of pre-trained ResNet-18 (RN18; 11M), ResNet-34 (RN34; 22M), and ResNet-50 (RN50; 25M)[10] models instead of non-pre-trained ones to enhance overall performance. For REP–L, we vary the number of training iterations per task insertion to match the energy usage of Miro variants with different ResNet models.

\sidecaptionvpos

figurec

Refer to caption
Figure 8: Energy-accuracy trade-offs for Split CIFAR100 and Split ImageNet-R between REP–L and Miro. To provide a spectrum of energy-accuracy trade-offs, we use pre-trained ResNet-18 (RN18), ResNet-34 (RN34), and ResNet-50 (RN50) for Miro.

When operating within the same energy budget, REP–L consistently outperforms Miro variants, achieving 22–33% higher accuracy across datasets. REP–L demonstrates superior cost-effectiveness, especially on Split ImageNet-R compared to Split CIFAR-100. Both REP–L and Miro prove to be memory-efficient to some extent as they fit comfortably within the on-device memory capacity. Despite our exploration of larger ResNet models for Miro, we do not observe the accuracy levels comparable to REP–L with lower energy usage for either dataset. This experiment also underscores the importance of specifically optimizing vision transformers for on-device CL scenarios to advance performance boundaries.

Appendix H Diverse Task Sequence Lengths

Method   Split CIFAR-100 (5T)   Split CIFAR-100 (20T)   GPU Time (s)   Mem. (GB)
Acc. (\uparrow) Fgt. (\downarrow) Acc. (\uparrow) Fgt. (\downarrow)
REP–L (ours) 88.29±plus-or-minus\pm±0.85 4.98±plus-or-minus\pm±0.74 83.81±plus-or-minus\pm±0.87 5.98±plus-or-minus\pm±0.21 4420.99 4.45
L2P–L 89.27±plus-or-minus\pm±0.28 4.43±plus-or-minus\pm±0.34 84.17±plus-or-minus\pm±0.55 6.03±plus-or-minus\pm±0.11 8393.16 6.67
DualPrompt–L 87.91±plus-or-minus\pm±0.50 4.37±plus-or-minus\pm±0.19 83.76±plus-or-minus\pm±0.77 6.25±plus-or-minus\pm±0.64 7957.83 5.93
CODA-Prompt–L 88.79±plus-or-minus\pm±0.63 3.10±plus-or-minus\pm±0.22 83.98±plus-or-minus\pm±0.41 5.17±plus-or-minus\pm±0.33 8014.18 14.63
Table 9: Results on Split CIFAR-100 organized as 5 tasks (5T) and 20 tasks (20T).
Method   Split ImageNet-R (5T)   Split ImageNet-R (20T)   GPU Time (s)   Mem. (GB)
Acc. (\uparrow) Fgt. (\downarrow) Acc. (\uparrow) Fgt. (\downarrow)
REP–L (ours) 77.06±plus-or-minus\pm±0.78 2.04±plus-or-minus\pm±0.71 72.58±plus-or-minus\pm±0.59 4.50±plus-or-minus\pm±0.54 2542.43 4.45
L2P–L 77.54±plus-or-minus\pm±0.40 1.02±plus-or-minus\pm±0.20 72.77±plus-or-minus\pm±0.19 4.70±plus-or-minus\pm±0.49 4826.74 6.67
DualPrompt–L 73.11±plus-or-minus\pm±0.46 2.18±plus-or-minus\pm±0.47 69.49±plus-or-minus\pm±0.43 4.97±plus-or-minus\pm±0.90 4576.39 5.93
CODA-Prompt–L 76.83±plus-or-minus\pm±0.55 1.09±plus-or-minus\pm±0.11 73.38±plus-or-minus\pm±0.23 5.17±plus-or-minus\pm±0.33 4608.79 14.63
Table 10: Results on Split ImageNet-R organized as 5 tasks (5T) and 20 tasks (20T).

We examine the scalability and robustness of REP using REP–L over two task sequence lengths, 5 and 20 tasks, on the Split CIFAR-100 and Split ImageNet-R datasets. A longer task sequence represents a more challenging scenario where a CL method requires more frequent model updates. The results are shown in Tab. 9 and Tab. 10. We observe that our method consistently preserves both model accuracy and resource efficiency across all cases.

Appendix I Impact of Varying Hyperparameter Values

Guided by the hyperparameter tuning strategies used in DualPrompt [15] and CODA-Prompt [17], we fine-tune key hyperparameters for our ALD approach using REP–L. Our focus is primarily on optimizing the threshold parameter (α𝛼\alphaitalic_α) and the adjustment factor (τ𝜏\tauitalic_τ), which are crucial for ALD’s adaptability and efficiency. We conduct a hyperparameter search using cross-validation, for which we designate 20% of our training dataset as a validation set. Fig. 10 summarizes the results. We set the AToM’s merging parameter (n𝑛nitalic_n) to 8.

We also explore the impact of varying prompt-related hyperparameters, as detailed in Fig. 10. Note that computational costs increase with larger prompt sizes and longer prompt lengths, and higher costs do not always translate into improved accuracy. In Fig. 10, we opt for a prompt pool size of 10 and a prompt length of 5 to strike the right balance between computational efficiency and model accuracy.

Refer to caption
Figure 9: Result for ALD’s hyperparameter tuning via cross-validation.
Refer to caption
Figure 10: Impact of prompt-related hyperparameters of REP–L. Note that when we increase the pool size and prompt length, there is a notable increase in computation cost.

Appendix J Discussion about Additional Related Works

Method Acc. GPU Time (s) Mem. (GB)
REP–L (ours) 72.58 2542.43 4.45
VPT-Deep–L 66.70 1444.62 12.61
VPT-Shallow–L 64.53 1271.69 6.89
SSF–L 72.16 1573.70 22.09
Adapter–L 62.05 680.05 12.01
Table 11: REP vs ADAM.

Adapter-based Rehearsal-free Methods. Considering ADAM [57] as a state-of-the-art method for adapter-based CL, we evaluate all its variations proposed in [57] on Split ImageNet-R with 20 tasks and compare their accuracies with REP–L in Tab. 11. ADAM significantly reduces end-to-end training costs using an adapter model. Unlike REP–L, which performs the backward pass to update prompts, ADAM only updates the FC layers without the backward pass. However, training such a model is memory-intensive and often exceeds on-device memory capacity. Also, when implemented with the pre-trained ViT-L model, ADAM’s capability for generalization seems to be somewhat restricted. This observation does not necessarily indicate a fundamental limitation of ADAM but rather underscores the complexities of using it effectively in its current form.

Advantages of REP over NAS. While NAS (Neural Architecture Search) can be integrated into REP to drop some layers, it usually involves searching the model from scratch, which is time-consuming. In contrast, our method applies adaptive schedules directly to an existing pre-trained model, significantly saving computation time. So, it can naturally adapt to device-internal resource conditions without relying on external server resources.

Memory Cost Compared to Online or Few-shot CL. In on-device CL, the memory capacity for storing data from new tasks may be limited. Thus, one might consider adopting online or few-shot CL methods to save memory costs, as they require fewer new-task samples to be maintained in memory.

Offline CL (our setting) typically yields better accuracy with higher resource usage. However, memory allocation for new samples is not a major constraint in offline CL. All prompt-based methods require 155–269MB to store new samples for both Split CIFAR-100 and Split ImageNet-R, accounting for only 3.4–5.9% of REP’s memory usage. Moreover, modern DL frameworks can store data in storage and proactively prefetch upcoming mini-batches for training, virtually eliminating memory concerns for new samples.

Appendix K Societal Impacts

Our method widely promotes AI for robotics agents and surveillance systems with high efficiency. Once edge AI is easily deployable using our method, it may be used for monitoring unwanted mass populations. Then, adversaries could potentially obtain private information such as identity, clothing information, and personal attributes (e.g., age, gender, etc.). Although our method has no intention to foster such misuse, we currently do not propose secure solutions to prevent these problematic scenarios from happening.