REP: Resource-Efficient Prompting for
On-device Continual Learning

Sungho Jeon¹ Xinyue Ma¹ Kwang In Kim² Myeongjae Jeon¹
¹UNIST ²POSTECH
{sungho, xinyuema, mjjeon}@unist.ac.kr
{kimkin}@postech.ac.kr

Abstract

On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical. This is extremely challenging because it must preserve accuracy while learning new tasks with continuously drifting data and maintain both high energy and memory efficiency to be deployable on real-world devices. Typically, a CL method leverages one of two types of backbone networks: CNN or ViT. It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance, making each option attractive only for a single aspect. In this paper, we revisit this comparison while embracing powerful pre-trained ViT models of various sizes, including ViT-Ti (5.8M parameters). Our detailed analysis reveals that many practical options exist today for making ViT-based methods more suitable for on-device CL, even when accuracy, energy, and memory are all considered. To further expand this impact, we introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods. Our key focus is on avoiding catastrophic trade-offs with accuracy while trimming computational and memory costs throughout the training process. We achieve this by exploiting swift prompt selection that enhances input data using a carefully provisioned model, and by develo** two novel algorithms–adaptive token merging (AToM) and adaptive layer drop** (ALD)–that optimize the prompt updating stage. In particular, AToM and ALD perform selective skip** across the data and model-layer dimensions without compromising task-specific features in vision transformer models. Extensive experiments on three image classification datasets validate REP’s superior resource efficiency over current state-of-the-art methods.

1 Introduction

In on-device continual learning (CL), neural network models acquire new knowledge from new data locally, without dependence on remote servers for model training. This new data is typically non-independent and non-identically distributed (non-IID) and forms multiple sequential tasks, where each task may include data that diverges from previously encountered data. A crucial challenge for any CL task, including on-device scenarios [1, 2, 3], is to effectively address catastrophic forgetting [4]. Severe forgetting occurs when a model rapidly loses previously learned knowledge while adapting to new data. It impairs overall model accuracy and user experiences of mobile and edge AI services [5].

At the same time, learning methods in on-device CL must fulfill computational (equivalently, energy) and memory efficiency requirements. Energy is scarce and can be intermittent in small devices deployed in the wild [6, 7, 8], yet it is quickly drained by GPU operations during training [9]. Achieving higher energy efficiency while training a new task in CL invariably enhances device durability, making energy cost a critical soft constraint in edge AI. In contrast, memory efficiency often acts as a hard constraint. Device memory capacity is typically very limited (256MB–8GB) and unified to operate all CPU and GPU programs. Thus, a CL task that exhausts all available memory risks system crashes due to out-of-memory errors. Currently, no silver-bullet hardware solution to these resource constraints exists, as smartphones and IoT devices adhere to small form factors, making it difficult to dramatically augment battery and memory capacity.

Refer to caption — Figure 1: (a) Devices grouped by memory capacity. IoT devices typically have less than 1GB of memory, while high-end edge platforms can offer up to 8GB. ViT- and CNN-based methods with specific backbone models are mapped to the memory capacity they can fit into. (b) A memory breakdown of the methods and backbones corresponding to the 1–4GB memory budget. While it is necessary to reserve memory for replay buffers and model checkpoints in CNN-based methods (MEMO and BudgetCL), feature maps dominate the overall memory usage for all cases. Similar trends are observed in other memory ranges, as depicted in Fig. 2(b).

Then, which existing methods stand out in on-device CL when considering accuracy and system efficiency together? To conduct a thorough analysis, we closely examine CNN- and ViT-based CL methods. CL methods using CNN models, such as ResNet-10/18/34 [10], are known for system efficiency due to their relatively small size (5–22M parameters). Conversely, CL methods employing ViT models excel in model performance, as pre-trained ViT models [11, 12, 13] are renowned for their robust capabilities. Out of various emerging methodologies, we are particularly interested in prompt-based rehearsal-free methods [14, 15, 16, 17, 18]. Prompts are a small set of parameters that progressively learn incoming tasks to combat forgetting. Updates to such “small” prompts are minimal and do not significantly harm the lifespan of device storage ( $\sim$ 10K writes), making them an ideal fit for on-device CL. We rigorously compare the abilities of competing state-of-the-art methods while aligning memory budgets to reflect diverse device specifications, as shown in Fig. 1(a).

We find that ViT-based methods indeed considerably outperform CNN-based ones, with larger backbone networks yielding higher accuracy. However, surprisingly, while small CNNs appear to promise high resource efficiency, their actual memory usage far exceeds expectations. For instance, a recent proposal MEMO [19], when set to use 1.2GB of memory, consumes up to 4.2GB as measured by CUDA APIs [20]—3.5 $\times$ the amount set by MEMO. This disparity arises because training requires substantial memory to stash feature maps for every training iteration and to operate ML runtime (Fig. 1(b)). Importantly, thanks to the recent evolution in small-memory ViT models, such as ViT-Ti [11] (5.8M parameters; similar size to ResNet-10), we can always spot more powerful ViT models that match the system efficiency of the CNN models under consideration. Taking accuracy, energy, and memory all into account, ViT-based methods are observed to be more suitable for on-device CL.

To further empower ViT-based CL for system efficiency, we introduce REP (Resource-Efficient Prompting), built upon our analysis of cost-accuracy trade-offs across the end-to-end learning process. We confirm two key design insights. (1) The prompt selection stage, which constructs a prompt subset to augment input or intermediate data, is highly amenable to numerous promising options with fast approximations [21, 22]. (2) In contrast, the prompt update stage, which involves forward-backward passes over the backbone model, presents a range of optimizations with vastly different cost-accuracy trade-offs [23, 24, 25, 26, 27]. With these insights, we develop several complementary techniques for both stages that trade minimal accuracy for high resource efficiency: (1) model downsizing for prompt selection and (2) adaptive token merging (AToM) and adaptive layer drop** (ALD) for prompt update. In ViT models, task-specific knowledge is concentrated in shallow layers, while generalized feature information is spread across all layers [15] (See Fig. 4 for our analysis). So, AToM and ALD carry out selective skip** across the data and model-layer dimensions in a “non-uniform” manner to preserve critical task-specific features in the shallow layers. By integrating all these contributions, REP achieves a level of resource efficiency compatible with real-world commodity devices, including NVIDIA Jetson TX2 [28] and Nano [29].

2 Related Work

Class-Incremental Learning. Our work primarily targets class-incremental learning (CIL) among various CL scenarios. In CIL, each incoming task introduces new classes within the same domain during the training phase, while task ID is not at hand during the inference phase [30]. In prior art using CNN backbone networks, rehearsal-based methods, which archive and replay representative samples from previous tasks [31, 32, 33, 34], have been a popular choice due to their superior performance [35]. Several recent methods are noteworthy along this line. For instance, BudgetCL [36] and CarM [37] suggest data-driven methods that outperform other complex strategies based on algorithmic optimizations, particularly when training is constrained by compute budgets. MEMO [19] effectively trades memory used to store old samples for saving task-specific model layers. We compare prompt-based methods with some of these methods and confirm better performance under resource constraints.

Continual Learning for the Edge. There has been a notable emphasis on efficient memory and energy usage in on-device learning, particularly in non-CL settings [38, 39, 40]. More recently, a handful of studies have explored extending CL capabilities to edge devices. Hayes et al. [2] investigated online CL methods using CNN models tailored for embedded devices and provided valuable algorithmic insights. Similarly, Kwon et al. [1] undertook a comparative analysis of rehearsal vs. regularization methods to uncover the cost-accuracy trade-offs, including storage, compute, and memory requirements. Ma et al. [9] proposed an ML platform called Miro, which dynamically configures key design parameters of CNN-based CL methods to reduce energy costs within the device’s memory capacity. Unfortunately, none of these studies address the specific challenge of enabling vision transformers for on-device CL, which stands as a main goal of our research.

Prompting for Continual Learning. The use of prompts, which are explicit instructions or queries given to the model during training or inference, has shown promise in guiding continual learning for vision transformers [14, 15, 16, 17, 18]. The general concept is to fine-tune prompts for new tasks while retaining knowledge acquired from previous tasks. L2P [14] optimizes a query function on a shared prompt pool to select the top-N best-matching prompts for input tokens. DualPrompt [15] proposes a dual-prompt approach that utilizes general and expert prompts to express task-invariant and task-specific knowledge, respectively. CODA-Prompt [17] introduces attention-conditioned prompts, which inherently facilitate higher prompting capacity. Our work focuses on prompting techniques that emphasize system efficiency and can be applied to all the above methods.

3 Empirical Analysis

We focus on the popular disjoint-CIL setup, where tasks consist of distinct sets of non-overlap** classes and samples from old classes are never given in subsequent tasks [34, 32, 30].

Data Generation. To organize task streams, we use two image classification datasets: CIFAR-100 [41] (100 classes) and ImageNet-R [42] (200 classes). Compared to CIFAR-100, ImageNet-R is known for exhibiting much higher intra-class variability among images and an uneven class-size distribution. By default, we divide each dataset into 10 tasks to create Split CIFAR-100 (i.e., 10 classes per task) and Split ImageNet-R (i.e., 20 classes per task) [14, 15, 17].

Methods. We implement L2P [14], DualPrompt [15], and CODA-Prompt [17] as representative ViT-based prompting methods, and BudgetCL [43] and MEMO [19] as state-of-the-art CNN-based methods using PyTorch. All prompting methods capitalize on ImageNet pre-trained models as backbones: ViT-L (307M), ViT-B (86M), and ViT-Ti (5.8M)—ViT-Ti is one of the smallest vision transformer models to our knowledge. For CNN-based methods, we adopt pre-trained ResNet-10 (RN10; 5M), ResNet-18 (RN18; 11M), ResNet-26 (RN26; 14M), ResNet-34 (RN34; 22M), and ResNet-101 (RN101; 43M) models. We categorize the prompt-based methods into three groups according to their memory consumption based on Fig. 1(a).

For fair comparisons, we align the memory usage of the CNN-based methods with the best-performing ViT-based method within the same memory range. Both BudgetCL and MEMO leverage spare memory to boost model accuracy by storing and replaying past samples, with MEMO also storing model checkpoints from history. For detailed implementations and hyperparameters, please refer to Sec. B.2.

Hardware and Metrics. Most power consumption that influences energy usage during training comes from GPU operations (see Fig. 7 in Sec. G.1). Therefore, training wall-clock time correlates linearly with energy usage, although energy is more intuitively measured in Joules. Given that not all CL methods are currently feasible on small devices due to memory limitations, we use NVIDIA RTX 3090 Ti as a reference GPU to compare CNN- and ViT-based methods and use the wall-clock time required to complete all tasks as a proxy for energy cost¹¹1In Sec. G.2, we evaluate our memory-efficient method directly on NVIDIA Jetson TX2, although it is applicable beyond this specific device.. We also report the final average accuracy by averaging the accuracy of all classes after the last task training [34, 44, 35, 45, 31], and measure memory usage using CUDA APIs [20].

3.1 Results and Findings

We first uncover the energy-accuracy trade-offs to highlight the impact of each method on the accuracy gained relative to its training cost. Fig. 2(a) visually represents the comparison results across device memory capacities and datasets. In each graph, the x-axis indicates wall-clock time, which corresponds to energy cost (lower is better), while the y-axis indicates final average accuracy (higher is better). Thus, a more cost-effective method appears closer to the upper-left corner of the graph. Overall, ViT-based methods outperform CNN-based methods by a large margin, achieving 26–36% higher accuracy with 45–90% less time and energy spent under the same memory budget.

Why CNN-based Methods Underperform. Although CNN-based methods use smaller backbone models, their limited model capacity and the increasing training costs associated with replaying past samples and model checkpoints make them less competent in terms of accuracy and resource efficiency. We elaborate on two key characteristics that explain these shortcomings in more detail:

$\bullet$ Performance scaling. ViT-based methods scale well, with larger backbone networks yielding higher accuracy. In Fig. 2(a), the accuracy gain is at least 20% when switching from ViT-Ti to ViT-B, and additional 7% or more when moving from ViT-B to ViT-L. In contrast, CNN-based methods scale poorly with increased backbone size. Specifically, on Split ImageNet-R, a more challenging dataset, MEMO/RN34 is only 2.95% better than MEMO/RN18, despite being 1.9 $\times$ larger. On Split CIFAR-100, increasing the backbone size leads to little or no gain. Exchanging model size for replay buffer is generally more efficient. This means that enlarging the backbone size for CNN models is not always beneficial, even with abundant memory available.

$\bullet$ Memory efficiency. CNN-based methods are often considered more memory-efficient because they use backbones a magnitude smaller than ViT models. Thus, it is commonly believed that ample memory can be allocated to memory buffers for replay samples or past model checkpoints [19], which help improve model performance. However, there is a huge misconception: in CL, which involves typical training iterations, a substantial amount of memory must be allocated for managing feature maps [46]. Feature maps dominate overall memory usage and appear in all methods, thereby leaving little room for memory buffers. For a device memory range of 1–4GB in Fig. 1(b), the backbone of MEMO/RN18 consumes only 43MB of memory (6.5% of ViT-L), but feature maps occupy as much as 1.7GB (more than ViT-based methods) as training goes through layers extended for past model checkpoints. This causes MEMO/RN18 to be short on memory for the replay buffer, dramatically degrading model accuracy, as observed Fig. 2(a). Fig. 2(b) shows profiling results for other memory ranges that reveal a similar trend.

What Our Study Implies. Based on the study so far, ViT-based methods appear to be more favored than CNN-based methods for on-device CL. However, CODA-Prompt consumes significantly more resources with the same ViT backbone, making it less efficient compared to other methods that operate within the same memory budget. This inefficiency stems from CODA-Prompt’s attention-based prompt selection algorithm, as opposed to the prompt-pool-based selection used in L2P and DualPrompt. The attention-based approach results in significantly larger feature maps. For instance, the feature maps of CODA-Prompt with ViT-Ti backbone become as large as those of L2P and DualPrompt using the larger ViT-B backbone.

It is important to note that the best-performing method is not fixed and can dynamically change depending on the backbone in use. In our study, L2P performs best with ViT-L backbone, while DualPrompt is more effective with ViT-B or ViT-Ti backbone (Fig. 2(a)). In Sec. 5.1, we further observe that the choice of method and backbone is more dynamic. Therefore, to enhance ViT-based CL for better resource efficiency, we need to design adaptive techniques tailored to any backbone, regardless of the specific learning method in use.

4 REP: Resource-Efficient Prompting

We aim to reduce the cost of prompt-based CL to broaden its benefits and applicability. As the learning process typically involves two key stages, prompt selection and prompt update, each stage is optimized based on its unique characteristics. An overview of our method REP is provided in Fig. 3.

4.1 Prompt Selection

The prompt selection stage operates a neural network model $\bm{f_{\text{query}}}$ to extract representations of input data from a prompt pool [14, 15]. Formally, for a given input $x_{i}^{j}$ in task $T_{i}$ , $\bm{f_{\text{query}}}$ calculates a query feature $\bm{q}(x_{i}^{j})\in\mathbb{R}^{D}$ and selects the prompt $\bm{p^{*}}$ that maximizes the cosine similarity with the query:

\begin{array}[]{cc}p^{*}=\underset{p_{k}\in P}{\mathrm{argmax}}\ \frac{\langle f% _{\text{query}}(x_{i}^{j}),p_{k}\rangle}{\|f_{\text{query}}(x_{i}^{j})\|\|p_{k% }\|}&\qquad p_{\text{efficient}}^{*}=\underset{p_{k}\in P}{\mathrm{argmax}}\ % \frac{\langle\phi(q_{\text{efficient}}(x_{i}^{j})),p_{k}\rangle}{\|\phi(q_{% \text{efficient}}(x_{i}^{j}))\|\|p_{k}\|}\end{array}

(1)

While $\bm{f_{\text{query}}}$ performs solely model inference and is typically equivalent in size to the primary backbone model, it can incur a noticeable increase in computation time and memory usage, especially when implemented with a large transformer model. To mitigate these costs, we need to adopt a more resource-efficient model $\bm{f_{\text{efficient}}}$ , which has a smaller depth and width than $\bm{f_{\text{query}}}$ .

In our preliminary study, we observe that simply decreasing one dimension can easily compromise model performance. So, we opt for a small pre-trained ViT model ViT-Ti as $\bm{f_{\text{efficient}}}$ , which we have found proficient in extracting essential representations without excessively cutting down on depth or width. When $\bm{f_{\text{efficient}}}$ generates a query feature $\bm{q_{\text{efficient}}}(x_{i}^{j})\in\mathbb{R}^{d}$ (where $d\leq D$ ), we apply a nonlinear random projection $\bm{\phi}$ [47] to maintain the integrity of high-dimensional features in the selected prompt $\bm{p_{\text{efficient}}^{*}}$ , as dictated in Eq. 1.

In pursuing system efficiency at this stage, a more comprehensive strategy would be to create a range of downsized models and select the most compact $\bm{f_{\text{efficient}}}$ that still achieves a high level of prompt overlap with $\bm{f_{\text{query}}}$ . Our choice of ViT-Ti as $\bm{f_{\text{efficient}}}$ meets this criterion; e.g., in Split ImageNet-R, ViT-Ti achieves a prompt overlap of 76.3% on average with ViT-L, the largest $\bm{f_{\text{query}}}$ in consideration, based on canonical-correlation analysis (CCA), and almost identical accuracy (Tab. 2).

4.2 Prompt Update

The prompt update stage aims to refine learnable parameters for effective training and task adaptability by combining the softmax cross-entropy loss of the classifier $\bm{L_{class}}$ and the cosine similarity loss of the prompt $\bm{L_{prompt}}$ . The combined loss function $\bm{L}$ is expressed as:

\displaystyle L=L_{class}\left(f_{\text{update}}(x_{i}^{j}),y_{i}^{j}\right)+% \lambda L_{prompt}\left(p^{*},q(x_{i}^{j})\right),

(2)

where $\bm{f_{\text{update}}}$ denotes the backbone model, and $\lambda$ is a regularization constant.

Unveiling System-Efficiency Opportunities. Remaining parameters that are not relevant to learnable parameters consist of a number of frozen transformer blocks. This raises the question:

Are all frozen blocks useful for executing the loss function $\bm{L}$ , or can we skip
some computations to save energy and memory usage?

To answer this, we explore how the mean distance of attention scores in the self-attention mechanisms of ViT models changes when training a new task. The attention distance has been widely used to analyze layer representation during the training of new models or the fine-tuning of pre-trained models [48, 49]. In our study, we repurpose this concept to examine the behavior of frozen layers during data drift. Specifically, for a given head in an attention layer within frozen blocks, if the query patch position is $(x_{0},y_{0})$ and it attends to positions $(x_{i},y_{i})$ with weights $a_{i}$ , the distance $d_{i}$ —typically Euclidean or pixel distance—becomes $d_{i}=(x_{i}-x_{0})^{2}+(y_{i}-y_{0})^{2}$ . We then compute the weighted average distance across the positions as $\frac{\sum_{i}a_{i}\cdot d_{i}}{\sum_{i}a_{i}}$ . Lower distances indicate local “task-specific” information, whereas higher distances imply global “general” information. We measure such mean attention distances along two dimensions: by layer ID and by attention head. Fig. 4 shows these results when handling the first task of Split ImageNet-R.

In Fig. 4(a), we present analysis results for layer ID using L2P with ViT-L. Clearly, lower layers exhibit highly dispersed attention distances, indicating that they capture task-specific and global information simultaneously. However, as we examine deeper layers, these distances become increasingly concentrated at high values, indicating that deeper layers specialize in capturing global information. Taking a closer look at attention heads in a few selected layers in Fig. 4(b), attention distances across the heads indeed vary more widely in shallow layers than in deeper layers. We observe similar behavior when using DualPrompt with ViT-B in Fig. 4(c). DualPrompt differs from L2P in its use of prompts, as it integrates prompts directly into the input sequence before the attention layer of each transformer block, rather than only into the first block. This suggests that the disparity in mean attention distances across layers may not be greately affected by the method of leveraging prompts.

Next, we explain how we bring these insights into practice through two compute-skip** techniques that focus more on non-task-specific representations in deeper layers: adaptive token merging and adaptive layer drop**.

Adaptive Token Merging (AToM). The first dimension of skip** is data at the token level via token merging. Conventional token merging (ToMe) [25] reduces the number of tokens by uniformly merging $n$ redundant tokens per layer, controlled by a scheduler $\bm{r}$ . The scheduler function $\bm{r}(l)\rightarrow n$ is applied to each layer $l$ , and according to [25], it merges all tokens, including prompt tokens. However, insights from Fig. 4 and our additional analysis highlight two major problems.

First, there is a loss of task-specific information. The prompt tokens in CL carry essential task-specific information. However, ToMe indiscriminately combines these prompt tokens with others, thereby diminishing their intrinsic value. According to our empirical data in Fig. 5, this approach can cause gradient explosions in the prompt tokens, even with gradient clip**, leading to learning instability. Second, there is a lack of layer-specific adaptability. ToMe does not account for the disparity between shallow and deep layers, treating all layers uniformly. Therefore, there is a risk of excessive loss of valuable information from shallow layers, which are mainly responsible for adaptability to diverse sequential tasks.

Our adaptive token merging (AToM; Algorithm 1 in App. A) addresses the loss of task-specific information by excluding prompt tokens during token merging, thus maintaining their specificity. To enhance task-specific adaptability, AToM exploits a new scheduler $\bm{r^{\prime}}(l)\rightarrow n^{\prime}$ , which dynamically adjusts the number of tokens to merge based on layer depth, as follows:

\displaystyle r^{\prime}(l)=\min(\delta\times(l-1),r_{\max}),

(3)

where $l$ denotes the layer index, $\delta$ is the step change in the number of tokens to merge defined as $\frac{r_{\max}}{\text{L}-1}$ (with $L$ being the number of layers), and $r_{\max}$ is the maximum number of tokens to merge (by default, $2\times n$ ). With this $r^{\prime}(l)$ , merging occurs more frequently in deeper layers than in shallow layers, preserving important task-related representations.

Adaptive Layer Drop** (ALD). The second dimension of skip** involves drop** layers. Inspired by insights from Fig. 4 and prior work on progressive layer drop** (PLD), we propose adaptive layer drop** (ALD; Algorithm 2 in App. A) with two key features: (1) the drop** schedule considers both temporal and spatial aspects, and (2) it manages to drop layers non-uniformly to preserve critical task-specific features in shallower layers. On the contrary, PLD only considers the temporal aspect and does not differentiate between layers when drop**. This results in poorer performance compared to ALD, as shown in Tab. 3.

ALD prioritizes retaining shallow layers that contain richer information essential for model performance after token merging. Thus, instead of operating on its own schedule parameters, ALD leverages feedback from token merging, specifically the count of merged tokens at each layer, to guide layer drop** decisions. The layer-kee** probability $\theta_{t,l}$ is defined as:

\displaystyle\theta_{t,l}=\left((1-\bar{\theta})\exp(-\gamma\cdot t)+\bar{% \theta}\right)\times\alpha,

(4)

where $\theta_{t,l}$ is the probability of kee** layer $l$ at time step $t$ , $\bar{\theta}$ is the minimum probability, $\gamma$ controls the rate of decay, and $\alpha$ is the adjustment factor. When the number of merged tokens surpasses $\tau$ , ALD adjusts the layer-drop** probability based on $\alpha$ . Since deeper layers tend to merge more tokens with AToM, ALD is more likely to exceed $\tau$ in deeper layers and drop more aggressively. These parameters should be tuned to balance between efficiency and the need to process the nuanced information contained within the merged tokens. We empirically determine their values.

One might contemplate exploiting stochastic depth [23], which applies gradually increasing drop** probabilities for deeper layers without a temporal schedule. We compared ALD and stochastic depth and found that ALD generally offers 1.8–3.3% higher accuracy. This may be attributed to early training iterations after task insertion, which play a critical role in stabilizing training losses [9].

Implications for Resource Efficiency. REP triggers AToM and ALD for each task insertion to optimize both time and memory usage. Much like our prompt selection optimization, AToM consistently enhances time and memory efficiency. In contrast, ALD solely contributes to reducing time costs because it operates layer-kee** probability $\theta_{t,l}$ starting at 1.0, i.e., no layer drop**, which necessitates traversing full layers.

5 Experiments of REP

Our experiment setup for REP extends the methodology described in Sec. 3 by including Split PlantDisease [50] dataset for image classification. Split PlantDisease contains 54,303 plant disease images across 38 categories. Due to its highly imbalanced nature, we drop 3 plant disease categories with very few images and organize the remaining 35 classes into 7 tasks, each with an equal number of classes.

5.1 REP on Best-Performing Methods

In each graph in Fig. 6, we demonstrate the efficacy of REP when applied to the best-performing CL method for each combination of memory budget and dataset observed in Fig. 2(a). The accuracy is indicated by the stars ( $\bigstar$ ), while the GPU wall-clock time is indicated by the bars ( $\blacksquare$ ). The memory consumption in GB is noted on top of the bars.

Resource Efficiency. The best-performing baselines optimized with REP demand significantly smaller amounts of system resources. Specifically, REP reduces memory consumption by 33% for L2P/ViT-L, 11% for DualPrompt/ViT-B, and 5% for DualPrompt/ViT-Ti. Despite the frozen backbone not requiring gradients for parameter updates, it still performs the backward pass to calculate classification and prompt losses for updating learnable parameters. Therefore, all intermediate data need to be stored in memory, similar to training a non-frozen model. This memory usage is reduced by applying AToM and ALD. In addition, REP exhibits a 16–47% reduction in GPU wall-clock time compared to competing methods, thereby leading to great energy savings.

Accuracy. L2P, DualPrompt, and CODA-Prompt were initially evaluated using only ViT-B backbone, focusing primarily on accuracy in their original studies. In comparison, our evaluation extends these methods across diverse backbones while taking into account their energy-accuracy trade-offs in on-device CL across various memory constraints. When augmented with our techniques, these methods show significant improvements in resource efficiency without considerable loss in accuracy across all memory budgets: 0.09–1.90% for Split PlantDisease, 0.03–1.07% for Split CIFAR-100, and 0.26–0.86% for Split ImageNet-R.

5.2 Ablation and Additional Study

# merged	AToM (w/ $\theta$ = 0.5)			Keep	ALD (w/ $n$ = 8)
tokens ( $n$ )	Acc. ( $\uparrow$ )	GPU Time (s)	Mem. (GB)	ratio ( $\theta$ )	Acc. ( $\uparrow$ )	GPU Time (s)	Mem. (GB)
2	76.16 $\pm$ 0.74	2808.43	5.48	0.25	73.18 $\pm$ 0.55	2362.07	4.45
4	75.32 $\pm$ 0.32	2714.04	5.16	0.50	75.34 $\pm$ 0.86	2542.43	4.45
8	75.34 $\pm$ 0.86	2542.43	4.45	0.75	74.44 $\pm$ 0.60	2861.35	4.45

Table 1: REP over different # of merged tokens (

n

) and % of keep ratio (

\theta

Ablated components	Acc. ( $\uparrow$ )	GPU Time (s)	Mem. (GB)
REP (ours)	75.34 $\pm$ 0.86	2542.43	4.45
(1) $f_{\text{efficient}}$	74.79 $\pm$ 0.38	3698.78	5.53
(2) AToM	74.91 $\pm$ 0.77	4442.04	5.60
(3) ALD	74.53 $\pm$ 0.78	4243.97	6.67
(4) $f_{\text{efficient}}$ + AToM	74.15 $\pm$ 0.82	2861.35	4.46
(5) $f_{\text{efficient}}$ + ALD	74.52 $\pm$ 0.43	3206.74	5.51
(6) AToM + ALD	74.57 $\pm$ 0.37	3448.76	4.77

Table 2: Ablating REP’s components. They contribute to resource efficiency and accuracy.

We validate our proposed techniques and algorithm designs along with the chosen hyperparameters. Refer to the App. E for additional results. The study here primarily uses the ViT-L backbone on Split ImageNet-R, where L2P is the best non-optimized method.

Component Ablation. In Tab. 2, we ablate REP’s components to assess their contributions to system efficiency. Using L2P as a baseline, we examine all combinations with our components applied partially. Each component plays a key role in reducing memory usage and compute time, with ALD affecting compute time only, as expected. Notably, AToM substantially optimizes both resource usages.

	Acc. ( $\uparrow$ )	GPU Time (s)	Mem. (GB)
REP (ours)	75.34 $\pm$ 0.86	2542.43	4.45
(1) w/ ToMe	70.19 $\pm$ 0.74	2917.84	3.74
(2) w/ PLD	73.33 $\pm$ 0.71	2741.90	4.46

Table 3: Using traditional token merging (ToMe) or layer drop** (PLD) instead our adaptive algorithms harms model accuracy.

Algorithm Validation. Token merging (ToMe) [25] and progressive layer drop** (PLD) [24] are specifically designed to accelerate traditional transformer-based model training. To validate the importance of incorporating our adaptive techniques instead for CL, we evaluate how REP performs in case AToM and ALD are replaced with ToMe and PLD, respectively. The results are presented in Tab. 3. Although applying ToMe or PLD improves system efficiency over L2P, it results in an accuracy loss of 5.15% or 2.01%, respectively, compared to using our techniques. This indicates that ToMe and PLD are less desirable for achieving our goal, i.e., enhancing system efficiency without compromising accuracy.

AToM and ALD Intensity. Tab. 1 shows the effects of varying intensities of AToM and ALD, focusing on the number of merged tokens ( $n$ ) in AToM and the keep ratio ( $\theta$ ) in ALD. The default parameter values used by our system are 8 for $n$ and 0.5 for $\theta$ . AToM appears to maintain stable accuracy even as more tokens are merged. In contrast, for ALD, a lower keep ratio improves system efficiency by reducing iteration time and memory usage, but it can markedly impair model accuracy if the ratio is too low. Overall, when used with caution, AToM and ALD can effectively balance system efficiency and accuracy, making them suitable for various resource-limited on-device CL scenarios.

6 Discussion and Conclusion

Since our study focuses on scenarios wherein a sequence of tasks is presented within learning environments, our algorithm does not determine the curriculum of learning. In general, actively selecting such curricula could lead to improved performance in the final average accuracy. Future work should explore extending our model for curriculum-based continual learning. Moreover, although using replay buffer may not be feasible in real-world scenarios where data privacy matters [14], on-device CL is already viewed as a means to preserve data privacy. Therefore, exploring rehearsal-based REP is a promising direction for future research. A critical question to address is determining the optimal memory space allocation between replay buffer and model training.

References

[1] Young D Kwon, Jagmohan Chauhan, Abhishek Kumar, Pan Hui HKUST, and Cecilia Mascolo. Exploring System Performance of Continual Learning for Mobile and Embedded Sensing Applications. In IEEE/ACM SEC, 2021.
[2] Tyler L Hayes and Christopher Kanan. Online Continual Learning for Embedded Devices. In CoLLAs, 2022.
[3] Yuqing Zhao, Divya Saxena, and Jiannong Cao. Memory-Efficient Domain Incremental Learning for Internet of Things. In SenSys, 2023.
[4] M. McCloskey and Neal. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychol. Learn. Motiv. - Adv. Res. Theory, 1989.
[5] Dianlei Xu, Tong Li, Yong Li, Xiang Su, Sasu Tarkoma, Tao Jiang, Jon Crowcroft, and Pan Hui. Edge Intelligence: Architectures, Challenges, and Applications. arXiv preprint arXiv:2003.12172, 2020.
[6] Jaeheon Kwak, Sunjae Lee, Dae R Jeong, Arjun Kumar, Dongjae Shin, Ilju Kim, Donghwa Shin, Kilho Lee, **kyu Lee, and Insik Shin. MixMax: Leveraging Heterogeneous Batteries to Alleviate Low Battery Experience for Mobile Users. In MobiSys, 2023.
[7] Seunghyeok Jeon, Yonghun Choi, Yeonwoo Cho, and Hojung Cha. HarvNet: Resource-Optimized Operation of Multi-Exit Deep Neural Networks on Energy Harvesting Devices. In MobiSys, 2023.
[8] Saad Ahmed, Bashima Islam, Kasim Sinan Yildirim, Marco Zimmerling, Przemysław Pawełczak, Muhammad Hamad Alizai, Brandon Lucia, Luca Mottola, Jacob Sorber, and Josiah Hester. The Internet of Batteryless Things. CACM, 67(3):64–73, 2024.
[9] Xinyue Ma, Suyeon Jeong, Minjia Zhang, Di Wang, Jonghyun Choi, and Myeongjae Jeon. Cost-effective On-device Continual Learning over Memory Hierarchy with Miro. In MobiCom, 2023.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016.
[11] Andreas Peter Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. TMLR, 2022.
[12] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging Properties in Self-Supervised Vision Transformers. ICCV, 2021.
[13] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. DINOv2: Learning Robust Visual Features without Supervision. TMLR, 2024.
[14] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning To Prompt for Continual Learning. In CVPR, 2022.
[15] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. DualPrompt: Complementary Prompting for Rehearsal-Free Continual Learning. In ECCV, 2022.
[16] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-Prompts Learning with Pre-trained Transformers: An Occam’s Razor for Domain Incremental Learning. In NeurIPS, 2022.
[17] James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning. In CVPR, 2023.
[18] Dahuin Jung, Dongyoon Han, Jihwan Bang, and Hwanjun Song. Generating Instance-level Prompts for Rehearsal-free Continual Learning. In ICCV, 2023.
[19] Da-Wei Zhou, Qi-Wei Wang, Han-Jia Ye, and De-Chuan Zhan. A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning. In ICLR, 2023.
[20] CUDA Semantics: Memory Management. https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management.
[21] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast Inference from Transformers via Speculative Decoding. In ICML, 2023.
[22] Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via Proxy: Efficient Data Selection for Deep Learning. In ICLR, 2020.
[23] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep Networks with Stochastic Depth. In ECCV, 2016.
[24] Minjia Zhang and Yuxiong He. Accelerating Training of Transformer-Based Language Models with Progressive Layer Drop**. In NeurIPS, 2020.
[25] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token Merging: Your ViT But Faster. In ICLR, 2023.
[26] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. In NeurIPS, 2021.
[27] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning. In NeurIPS, 2022.
[28] Jetson TX2 Module. https://developer.nvidia.com/embedded/jetson-tx2.
[29] Jetson Nano Module. https://developer.nvidia.com/embedded/jetson-nano.
[30] Alexander Gepperth and Barbara Hammer. Incremental Learning Algorithms and Applications. In ESANN, 2016.
[31] Jihwan Bang, Heesu Kim, YoungJoon Yoo, Jung-Woo Ha, and Jonghyun Choi. Rainbow Memory: Continual Learning with a Memory of Diverse Samples. In CVPR, 2021.
[32] Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Cordelia Schmid, and Karteek Alahari. End-to-End Incremental Learning. In ECCV, 2018.
[33] Arslan Chaudhry, Puneet K. Dokania, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence. In ECCV, 2018.
[34] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. iCaRL: Incremental Classifier and Representation Learning. In CVPR, 2017.
[35] Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. GDumb: A Simple Approach that Questions Our Progress in Continual Learning. In ECCV, 2020.
[36] Ameya Prabhu, Hasan Abed Al Kader Hammoud, Puneet K. Dokania, Philip H.S. Torr, Ser-Nam Lim, Bernard Ghanem, and Adel Bibi. Computationally Budgeted Continual Learning: What Does Matter? In CVPR, 2023.
[37] Soobee Lee, Minindu Weerakoon, Jonghyun Choi, Minjia Zhang, Di Wang, and Myeongjae Jeon. CarM: Hierarchical Episodic Memory for Continual Learning. In DAC, 2022.
[38] In Gim and JeongGil Ko. Memory-Efficient DNN Training on Mobile Devices. In MobiSys, 2022.
[39] Qipeng Wang, Mengwei Xu, Chao **, Xinran Dong, **liang Yuan, Xin **, Gang Huang, Yunxin Liu, and Xuanzhe Liu. Melon: Breaking the Memory Wall for Resource-Efficient on-Device Machine Learning. In MobiSys, 2022.
[40] Yue Wang, Ziyu Jiang, Xiaohan Chen, Pengfei Xu, Yang Zhao, Yingyan Lin, and Zhangyang Wang. E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings. In NeurIPS, 2019.
[41] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
[42] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In ICCV, 2021.
[43] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On Tiny Episodic Memories in Continual Learning. arXiv:1902.10486, 2019.
[44] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large Scale Incremental Learning. In CVPR, 2019.
[45] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and SIMONE CALDERARA. Dark Experience for General Continual Learning: a Strong, Simple Baseline. In NeurIPS, 2020.
[46] Gangmuk Lim, Jeongseob Ahn, Wencong Xiao, Young** Kwon, and Myeongjae Jeon. Zico: Efficient $\{$ GPU $\}$ memory sharing for concurrent $\{$ DNN $\}$ training. In USENIX ATC, pages 161–175, 2021.
[47] Mark D McDonnell, Dong Gong, Amin Parvaneh, Ehsan Abbasnejad, and Anton van den Hengel. Random Projection in Deep Neural Networks. NeurIPS, 36, 2024.
[48] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? In NeurIPS, 2021.
[49] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
[50] Sharada Prasanna Mohanty, David Hughes, and Marcel Salathe. Using deep learning for image-based plant disease detection, 2016.
[51] L2P PyTorch Implementation. https://github.com/JH-LEE-KR/l2p-pytorch.
[52] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
[53] DualPrompt PyTorch Implementation. https://github.com/JH-LEE-KR/dualprompt-pytorch.
[54] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. Caltech-ucsd birds 200. Technical Report CNS-TR-2011-001, Caltech, 2011.
[55] OEM PRODUCT DESIGN GUIDE: NVIDIA Jetson TX2. https://usermanual.wiki/Pdf/jetsontx2oemproductdesignguide.2134990230.pdf.
[56] Convenient Power Measurements on the Jetson TX2/Tegra X2 Board. https://embeddeddl.wordpress.com/2018/04/25/convenient-power-measurements-on-the-jetson-tx2-tegra-x2-board/.
[57] Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need, 2023.

Appendix A Algorithm Details

Input :

T

: Initial set of all tokens,

r^{\prime}

: Adaptive merging schedule function,

P

: Set of prompt tokens,

L: Number of layers in the model

Output :

T^{\prime}_{\text{final}}

: Set of tokens after processing all layers

T^{\prime}_{\text{final}}\leftarrow T

\triangleright

Initialize with the initial token set

for $l\in\{1,2,\dots,\text{L}\}$ do

T_{\text{attn}}\leftarrow\text{MSA}(T^{\prime}_{\text{final}})

\triangleright

Self-attention for layer

l

T_{\text{eligible}}\leftarrow T_{\text{attn}}\setminus P

\triangleright

Exclude prompt tokens

n^{\prime}\leftarrow r^{\prime}(l)

\triangleright

Adaptive token merging at layer

l

T^{\prime}_{\text{merged}}\leftarrow\text{Merge}(T_{\text{eligible}},n^{\prime})

T_{\text{concat}}\leftarrow\text{Concat}(T^{\prime}_{\text{merged}},P)

\triangleright

Concatenate merged tokens with prompt tokens

T^{\prime}_{\text{final}}\leftarrow\text{MLP}(T_{\text{concat}},l)

\triangleright

Apply MLP to concatenated tokens

end for

return

T^{\prime}_{\text{final}}

Algorithm 1 Adaptive token merging (AToM)

Input :

X

: Input tensor,

\theta_{t,l}

: Keep ratio of layer,

\bar{\theta}

: Minimum ratio,

\gamma

: Decay rate,

\tau

: Spatial threshold,

\alpha

: Adjustment factor,

L

: Number of layers

Output :

X_{\text{out}}

: Processed output for transformer layers

for $l\in\{1,2,\dots,L\}$ do

\theta_{t,l}\leftarrow(1-\bar{\theta})\exp(-\gamma\cdot t)+\bar{\theta}

\triangleright

Current keep ratio

if $r^{\prime}(l)>\tau$ then

\theta_{t,l}\leftarrow\theta_{t,l}\times\alpha

\triangleright

Adjust based on merging feedback

end if

if $\text{Bernoulli}(\theta_{t})=1$ then

X_{\text{out}}\leftarrow\text{Exec}(l,X_{\text{out}})

\triangleright

Layer

l

is executed

else

X_{\text{out}}\leftarrow X_{\text{out}}

\triangleright

Layer

l

is dropped

end if

end for

return

X_{\text{out}}

for each layer

l

Algorithm 2 Adaptive layer drop** (ALD)

Appendix B Implementation Details

Unless otherwise stated, we use the term REP–L to denote REP applied to L2P with a ViT-L backbone, which stands out as the best-performing baseline method for Split CIFAR-100 and Split ImageNet-R within the 4–8GB memory range. Similarly, the notations L, B, and Ti are used to generally denote ViT-L, ViT-B, and ViT-Ti backbones, respectively, in the prompt update stage of any prompt-based method.

B.1 Prompt-Based Methods for REP

Here are details on the three major prompt-based methods to which REP is integrated for resource efficiency.

Learning to prompt (L2P) [14] positions prompts at the first layer of the transformer architecture. These prompts are learnable parameters, which dynamically evolve with the training process. The mechanism begins with a prompt pool $P$ that contains a set of distinct prompts $\{p_{1},p_{2},...,p_{m}\}\subset\mathbb{R}^{D}$ .

For a given input $x_{i}^{j}$ in task $T_{i}$ , L2P calculates a query feature $q(x_{i}^{j})\in\mathbb{R}^{D}$ and selects the corresponding prompt. The prompt $p^{*}$ is selected based on maximizing the cosine similarity with respect to the query:

\displaystyle p^{*}=\underset{p_{k}\in P}{\mathrm{argmax}}\ \frac{\langle q(x_% {i}^{j}),p_{k}\rangle}{\|q(x_{i}^{j})\|\|p_{k}\|}.

(5)

The number of the selected prompts is a learnable parameter.

The selected prompt $p^{*}$ is concatenated with the input embedding $e(x_{i}^{j})$ to form the prompt-augmented input $e^{\prime}(x_{i}^{j})=[p^{*};e(x_{i}^{j})]$ . The training objective of L2P balances classification loss $L_{class}$ and prompt-adjustment loss $L_{prompt}$ :

\displaystyle L=L_{class}\left(e^{\prime}(x_{i}^{j}),y_{i}^{j}\right)+\lambda L% _{prompt}\left(p^{*},q(x_{i}^{j})\right),

(6)

where $L_{class}$ and $L_{prompt}$ employ soft-max cross-entropy and cosine similarity loss, respectively, and the regularization parameter $\lambda$ is set to 0.1 following [51].

DualPrompt [15] leverages prompts at multiple layers of the transformer architecture. It introduces a general prompt $g\in\mathbb{R}^{L_{g}\times D}$ and a set of task-specific prompts $E=\{e_{1},e_{2},...,e_{T}\}\subset\mathbb{R}^{L_{e}\times D}$ . These prompts are incorporated at specified layers in the transformer model.

For a given input $x_{i}^{j}$ in task $T_{i}$ , the model’s transformer layers $f$ are modified by attaching $g$ and $e_{i}$ to the layers, resulting in a prompted architecture $f_{g,e_{i}}$ . The feature transformation $h_{i}^{j}$ for the input sample $x_{i}^{j}$ is then obtained as:

\displaystyle h_{i}^{j}=f_{g,e_{i}}(x_{i}^{j}).

(7)

Similar to L2P, DualPrompt optimizes $L_{class}$ and $L_{prompt}$ :

\displaystyle L=L_{class}(h_{i}^{j},y_{i}^{j})+L_{prompt}(g,e_{i}).

(8)

CODA-Prompt [17] decomposes learnable prompts into components and uses an attention mechanism from a pre-trained ViT model to select relevant prompts. Instead of a single prompt, CODA-Prompt learns a set of prompt components $P=\{\bm{P_{1}},\bm{P_{2}},...,\bm{P_{M}}\}\subset\mathbb{R}^{L_{p}\times D% \times M}$ . The final prompt $\bm{p}$ is a weighted sum:

\displaystyle\bm{p}=\sum_{m}\alpha_{m}\bm{P_{m}},

(9)

where weights $\alpha$ are based on the query $q(x)$ and keys $K\in\mathbb{R}^{D\times M}$ :

\displaystyle\alpha=\gamma(q(x),K).

(10)

When the task changes, the current components are frozen, and new one is added, ensuring orthogonality:

\displaystyle L_{\text{ortho}}(B)=||BB^{\top}-I||^{2},

(11)

where $B$ represents $P$ , $K$ , or $A$ ( $A$ represents the attention vector). The full optimization target is:

\displaystyle\min_{P^{n},K^{n},A^{n},\phi^{n}}L(f_{\phi}(f_{\theta},P,K,A(x)),% y)+\lambda(L_{\text{ortho}}(P)+L_{\text{ortho}}(K)+L_{\text{ortho}}(A)),

(12)

where $P^{n}$ , $K^{n}$ , and $A^{n}$ are new components, and $\lambda$ balances the orthogonality loss.

B.2 Hyperparameters and Configurations

For all prompt-based CL methods, we employ the ADAM optimizer [52] configured with hyperparameters $\beta_{1}=0.99$ and $\beta_{2}=0.999$ . In contrast, BudgetCL and MEMO adopt the stochastic gradient descent (SGD) optimizer with a momentum coefficient of 0.9 and a weight decay of 0.002, which is validated as effective by similar replay buffer-based methods [14]. In REP, we set the prompt pool size to 10 and the prompt length to 5, following the original implementations of L2P and DualPrompt [51, 53]. The learning rate for the prompt-based methods is set to 0.001875. For CODA-Prompt, we implement a cosine decay learning rate strategy, as outlined in its original paper [17]. For BudgetCL and MEMO, the initial learning rate starts from 0.1 and decays by 0.1 at 80 and 150 epochs following the original paper [19].

We consistently use the mini-batch size of 16 for all methods to maintain a uniform computational load for each training iteration. For fair comparisons, each method performs 1,875 1,080, and 2583 iterations per task insertion for Split CIFAR-100, Split ImageNet-R, and Split PlantDisease, respectively. This setup ensures that among prompt-based methods, iteration time itself correlates linearly with energy usage, as training wall-clock time similarly reflects energy consumption as discussed in the main paper.

BudgetCL and MEMO select 8 samples from the new task and another 8 samples from old tasks in the replay buffer to organize 16 samples in each mini-batch. In contrast to prompt-based methods, we adopt the setup in [19] and train 200 epochs for all datasets used. Thus, BudgetCL and MEMO perform more iterations per epoch for a bigger replay buffer, translating into longer GPU time and more energy usage.

Appendix C Competing Methods with Original Setups

Method	Split CIFAR-100		Split ImageNet-R		Mem. (GB)
Method	Acc. ( $\uparrow$ )	GPU Time (s)	Acc. ( $\uparrow$ )	GPU Time (s)	Mem. (GB)
REP–L (ours)	86.89 $\pm$ 0.57	4420.99	75.34 $\pm$ 0.16	2542.43	4.45
L2P–B	84.36 $\pm$ 0.68	2802.42	59.66 $\pm$ 0.81	1609.88	14.20
DualPrompt–B	85.27 $\pm$ 0.75	2624.06	67.45 $\pm$ 0.88	1507.42	12.30
CODA-Prompt–B	86.09 $\pm$ 0.73	3226.58	75.52 $\pm$ 0.74	1853.54	18.83

Table 4: Competing methods with original setups (based on ViT-B backbone).

For the evaluation of L2P, DualPrompt, and CODA-Prompt in the main paper, we adhere to the hyperparameters specified in the original studies, except for the number of iterations per task and batch size. We adopt a smaller batch size to account for on-device memory limitations, which prevent the use of the original batch sizes. Nonetheless, we present the results of these three methods using their original hyperparameters in Tab. 4. Although these methods are based on the ViT-B backbone, they consume significantly more memory than REP–L (with the ViT-L backbone), primarily due to their originally large batch sizes.

Appendix D Additional Datasets

Method	Split CUB-200		GPU Time (s)	Mem. (GB)
Method	Acc. ( $\uparrow$ )	Fgt. ( $\downarrow$ )	GPU Time (s)	Mem. (GB)
REP–L (ours)	74.70 $\pm$ 1.04	6.67 $\pm$ 0.34	529.67	4.45
L2P–L	74.70 $\pm$ 1.02	6.68 $\pm$ 1.35	1005.57	6.67
DualPrompt–L	72.35 $\pm$ 0.98	7.82 $\pm$ 0.31	953.42	5.93
CODA-Prompt–L	79.46 $\pm$ 0.84	5.84 $\pm$ 0.99	1257.82	13.95

Table 5: Results on Split CUB-200, based on using ViT-L as the backbone model.

We also expand our experiments on REP by incorporating an additional dataset: Split CUB-200 [54], which is designed for fine-grained image classification. Split CUB-200 contains 5,994 bird images categorized into 200 classes, which are split into 5 CL tasks, each having 40 classes. We present the experimental results for all prompt-based methods using ViT-L as the backbone in Tab. 5. REP–L outperforms competing methods in resource efficiency by bringing significant reductions in both computation time and memory usage. Importantly, REP–L is on par with its baseline model L2P in accuracy, and note that while adopting CODA-Prompt to implement REP can deliver better accuracy, its memory usage exceeds all memory ranges considered in Fig. 1(a).

Appendix E Extended Ablation and Additional Study

As the paper mainly discusses the ablation study based on Split ImageNet-R, we here present an extended ablation study of REP using Split CIFAR-100. The results are shown in Tab. 7. Similar to the findings from Split ImageNet-R, the ablation of any component within REP for the Split CIFAR-100 dataset impacts resource efficiency. All components contribute to the reduction of computation time and memory usage, with AToM particularly standing out in both aspects of resource efficiency.

Furthermore, we perform algorithm validation on Split CIFAR-100, for which we present the results in Tab. 7. Consistent with the Split ImageNet-R results, using static token merging (ToMe) or progressive layer drop** (PLD) instead of adaptive token merging (AToM) or adaptive layer drop** (ALD) negatively affects model accuracy on the Split CIFAR-100 dataset. This experiment further corroborates the efficacy of our proposed algorithms in optimizing resource usage without compromising model accuracy in CL scenarios.

Ablated components	Acc. ( $\uparrow$ )	GPU Time (s)	Mem. (GB)
REP–L (ours)	86.89 $\pm$ 0.57	4420.99	4.45
(1) $f_{\text{efficient}}$	86.38 $\pm$ 0.43	6096.14	5.53
(2) AToM	86.43 $\pm$ 0.23	7724.21	5.60
(3) ALD	86.79 $\pm$ 0.73	7379.79	6.67
(4) $f_{\text{efficient}}$ + AToM	85.97 $\pm$ 0.54	4975.57	4.46
(5) $f_{\text{efficient}}$ + ALD	86.07 $\pm$ 0.18	5576.16	5.51
(6) AToM + ALD	86.08 $\pm$ 0.48	5997.01	4.77

Table 6: Extended component ablation for Split CIFAR-100. Ablating any component of REP results in lower resource efficiency by increasing training time and memory consumption.

	Acc. ( $\uparrow$ )	GPU Time (s)	Mem. (GB)
REP–L (ours)	86.89 $\pm$ 0.57	4420.99	4.45
(1) w/ ToMe	83.55 $\pm$ 0.14	5073.80	3.74
(2) w/ PLD	84.40 $\pm$ 0.35	4767.86	4.46

Table 7: Extended algorithm validation for Split CIFAR-100. Changing our adaptive algorithms to conventional static token merging (ToMe) or progressive layer drop** (PLD) harms model accuracy.

Appendix F REP under Varying Compute Budgets

Method	Number of Iterations per Task.
Method	313	625	938	1250	1563	1875
REP–L (ours)	84.06	85.47	86.60	86.66	86.57	86.89
L2P–L	84.36	86.28	87.34	87.76	87.83	88.24

Table 8: Impact of REP under various computation budget.

We compare REP–L with L2P across various compute budgets defined by the number of iterations per task. The comparison results using Split CIFAR-100 are shown in Tab. 8. Evidently, the accuracy of REP–L is comparable to that of L2P across six compute budgets, with the performance gap getting increasingly marginal at lower compute budgets. We observe that the overall accuracy for both methods does not drop significantly with reduced compute budgets. This is attributed to the fact that Split CIFAR-100 is considered a less complex CL benchmark.

Appendix G REP on Edge Device

Miro [9] introduces a dynamic approach for fine-tuning key design parameters of rehearsal-based CL methods to achieve high model accuracy while simultaneously reducing energy costs, i.e., high cost-effectiveness. Miro’s methodology centers on identifying the optimal memory size within the device’s capacity to accommodate both old and new samples for training, thereby maximizing cost-effectiveness. We compare REP with Miro directly on a reference edge device NVIDIA Jetson TX2 [28] (TX2). TX2 is equipped with a 256-core NVIDIA Pascal GPU, 8GB of RAM, and a 32GB eMMC 5.1 drive. The RAM is a single unified memory shared by the CPU and GPU. We report energy usage as joules (J) obtained by multiplying power by time. Power usage is measured by reading built-in sensor values provided by Jetson devices for the GPU, RAM, CPU, and I/O connection [55, 56].

G.1 Power Usage Breakdown

When implementing CL on edge devices, the majority of power consumption that influences energy usage during the training of a new task comes from GPU operations. Fig. 7 shows power consumption on TX2 across individual system components. Static power is measured when the system is inactive, while dynamic power is measured during the training of a ViT-B model. During training, power usage surges up to 6.5 $\times$ , with the GPU contributing 60% to the dynamic power.

G.2 REP vs Miro

In this experiment, we explore energy-accuracy trade-offs for REP–L and Miro on TX2. Fig. 8 visually represents the comparison results. In each graph, the x-axis signifies total energy usage (lower is better), while the y-axis signifies final average accuracy (higher is better). Hence, a more cost-effective strategy is positioned closer to the upper-left corner of the graph. Regarding Miro, we maintain the hyperparameters as suggested in the original work [9] but incorporate the use of pre-trained ResNet-18 (RN18; 11M), ResNet-34 (RN34; 22M), and ResNet-50 (RN50; 25M)[10] models instead of non-pre-trained ones to enhance overall performance. For REP–L, we vary the number of training iterations per task insertion to match the energy usage of Miro variants with different ResNet models.

\sidecaptionvpos

figurec

When operating within the same energy budget, REP–L consistently outperforms Miro variants, achieving 22–33% higher accuracy across datasets. REP–L demonstrates superior cost-effectiveness, especially on Split ImageNet-R compared to Split CIFAR-100. Both REP–L and Miro prove to be memory-efficient to some extent as they fit comfortably within the on-device memory capacity. Despite our exploration of larger ResNet models for Miro, we do not observe the accuracy levels comparable to REP–L with lower energy usage for either dataset. This experiment also underscores the importance of specifically optimizing vision transformers for on-device CL scenarios to advance performance boundaries.

Appendix H Diverse Task Sequence Lengths

Method	Split CIFAR-100 (5T)		Split CIFAR-100 (20T)		GPU Time (s)	Mem. (GB)
Method	Acc. ( $\uparrow$ )	Fgt. ( $\downarrow$ )	Acc. ( $\uparrow$ )	Fgt. ( $\downarrow$ )	GPU Time (s)	Mem. (GB)
REP–L (ours)	88.29 $\pm$ 0.85	4.98 $\pm$ 0.74	83.81 $\pm$ 0.87	5.98 $\pm$ 0.21	4420.99	4.45
L2P–L	89.27 $\pm$ 0.28	4.43 $\pm$ 0.34	84.17 $\pm$ 0.55	6.03 $\pm$ 0.11	8393.16	6.67
DualPrompt–L	87.91 $\pm$ 0.50	4.37 $\pm$ 0.19	83.76 $\pm$ 0.77	6.25 $\pm$ 0.64	7957.83	5.93
CODA-Prompt–L	88.79 $\pm$ 0.63	3.10 $\pm$ 0.22	83.98 $\pm$ 0.41	5.17 $\pm$ 0.33	8014.18	14.63

Table 9: Results on Split CIFAR-100 organized as 5 tasks (5T) and 20 tasks (20T).

Method	Split ImageNet-R (5T)		Split ImageNet-R (20T)		GPU Time (s)	Mem. (GB)
Method	Acc. ( $\uparrow$ )	Fgt. ( $\downarrow$ )	Acc. ( $\uparrow$ )	Fgt. ( $\downarrow$ )	GPU Time (s)	Mem. (GB)
REP–L (ours)	77.06 $\pm$ 0.78	2.04 $\pm$ 0.71	72.58 $\pm$ 0.59	4.50 $\pm$ 0.54	2542.43	4.45
L2P–L	77.54 $\pm$ 0.40	1.02 $\pm$ 0.20	72.77 $\pm$ 0.19	4.70 $\pm$ 0.49	4826.74	6.67
DualPrompt–L	73.11 $\pm$ 0.46	2.18 $\pm$ 0.47	69.49 $\pm$ 0.43	4.97 $\pm$ 0.90	4576.39	5.93
CODA-Prompt–L	76.83 $\pm$ 0.55	1.09 $\pm$ 0.11	73.38 $\pm$ 0.23	5.17 $\pm$ 0.33	4608.79	14.63

Table 10: Results on Split ImageNet-R organized as 5 tasks (5T) and 20 tasks (20T).

We examine the scalability and robustness of REP using REP–L over two task sequence lengths, 5 and 20 tasks, on the Split CIFAR-100 and Split ImageNet-R datasets. A longer task sequence represents a more challenging scenario where a CL method requires more frequent model updates. The results are shown in Tab. 9 and Tab. 10. We observe that our method consistently preserves both model accuracy and resource efficiency across all cases.

Appendix I Impact of Varying Hyperparameter Values

Guided by the hyperparameter tuning strategies used in DualPrompt [15] and CODA-Prompt [17], we fine-tune key hyperparameters for our ALD approach using REP–L. Our focus is primarily on optimizing the threshold parameter ( $\alpha$ ) and the adjustment factor ( $\tau$ ), which are crucial for ALD’s adaptability and efficiency. We conduct a hyperparameter search using cross-validation, for which we designate 20% of our training dataset as a validation set. Fig. 10 summarizes the results. We set the AToM’s merging parameter ( $n$ ) to 8.

We also explore the impact of varying prompt-related hyperparameters, as detailed in Fig. 10. Note that computational costs increase with larger prompt sizes and longer prompt lengths, and higher costs do not always translate into improved accuracy. In Fig. 10, we opt for a prompt pool size of 10 and a prompt length of 5 to strike the right balance between computational efficiency and model accuracy.

Appendix J Discussion about Additional Related Works

Method	Acc.	GPU Time (s)	Mem. (GB)
REP–L (ours)	72.58	2542.43	4.45
VPT-Deep–L	66.70	1444.62	12.61
VPT-Shallow–L	64.53	1271.69	6.89
SSF–L	72.16	1573.70	22.09
Adapter–L	62.05	680.05	12.01

Table 11: REP vs ADAM.

Adapter-based Rehearsal-free Methods. Considering ADAM [57] as a state-of-the-art method for adapter-based CL, we evaluate all its variations proposed in [57] on Split ImageNet-R with 20 tasks and compare their accuracies with REP–L in Tab. 11. ADAM significantly reduces end-to-end training costs using an adapter model. Unlike REP–L, which performs the backward pass to update prompts, ADAM only updates the FC layers without the backward pass. However, training such a model is memory-intensive and often exceeds on-device memory capacity. Also, when implemented with the pre-trained ViT-L model, ADAM’s capability for generalization seems to be somewhat restricted. This observation does not necessarily indicate a fundamental limitation of ADAM but rather underscores the complexities of using it effectively in its current form.

Advantages of REP over NAS. While NAS (Neural Architecture Search) can be integrated into REP to drop some layers, it usually involves searching the model from scratch, which is time-consuming. In contrast, our method applies adaptive schedules directly to an existing pre-trained model, significantly saving computation time. So, it can naturally adapt to device-internal resource conditions without relying on external server resources.

Memory Cost Compared to Online or Few-shot CL. In on-device CL, the memory capacity for storing data from new tasks may be limited. Thus, one might consider adopting online or few-shot CL methods to save memory costs, as they require fewer new-task samples to be maintained in memory.

Offline CL (our setting) typically yields better accuracy with higher resource usage. However, memory allocation for new samples is not a major constraint in offline CL. All prompt-based methods require 155–269MB to store new samples for both Split CIFAR-100 and Split ImageNet-R, accounting for only 3.4–5.9% of REP’s memory usage. Moreover, modern DL frameworks can store data in storage and proactively prefetch upcoming mini-batches for training, virtually eliminating memory concerns for new samples.

Appendix K Societal Impacts

Our method widely promotes AI for robotics agents and surveillance systems with high efficiency. Once edge AI is easily deployable using our method, it may be used for monitoring unwanted mass populations. Then, adversaries could potentially obtain private information such as identity, clothing information, and personal attributes (e.g., age, gender, etc.). Although our method has no intention to foster such misuse, we currently do not propose secure solutions to prevent these problematic scenarios from happening.

REP: Resource-Efficient Prompting for On-device Continual Learning