License: CC BY 4.0
arXiv:2312.01188v1 [cs.LG] 02 Dec 2023

Efficient Expansion and Gradient Based Task Inference for Replay Free Incremental Learning

Soumya Roy
Amazon
[email protected]
   Vinay K Verma*
Amazon/Duke University
[email protected]
   Deepak Gupta
Amazon
[email protected]
Abstract

This paper proposes a simple but highly efficient expansion-based model for continual learning. The recent feature transformation, masking and factorization-based methods are efficient, but they grow the model only over the global or shared parameter. Therefore, these approaches do not fully utilize the previously learned information because the same task-specific parameter forgets the earlier knowledge. Thus, these approaches show limited transfer learning ability. Moreover, most of these models have constant parameter growth for all tasks, irrespective of the task complexity. Our work proposes a simple filter and channel expansion-based method that grows the model over the previous task parameters and not just over the global parameter. Therefore, it fully utilizes all the previously learned information without forgetting, which results in better knowledge transfer. The growth rate in our proposed model is a function of task complexity; therefore for a simple task, the model has a smaller parameter growth while for complex tasks, the model requires more parameters to adapt to the current task. Recent expansion-based models show promising results for task incremental learning (TIL). However, for class incremental learning (CIL), prediction of task id is a crucial challenge; hence, their results degrade rapidly as the number of tasks increase. In this work, we propose a robust task prediction method that leverages entropy weighted data augmentations and the model’s gradient using pseudo labels. We evaluate our model on various datasets and architectures in the TIL, CIL and generative continual learning settings. The proposed approach shows state-of-the-art results in all these settings. Our extensive ablation studies show the efficacy of the proposed components.

1 Introduction

{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPTThe work started before joining Amazon

Recent deep learning models outperform humans on many challenging tasks in a static environment [9, 38]. However, in a continuously changing environment where novel tasks arrive sequentially, humans significantly outperform deep learning models. When presented with sequential tasks, the model suffers from catastrophic forgetting and remembers only the current task sequence. Continual learning refers to continuously learning and adapting to new environments while exploiting knowledge acquired from the past tasks without forgetting.

The TIL and CIL are two most popular settings in continual learning. In the TIL setting, task ids are known during training and inference; however, in CIL, task ids are present only during training. The recent literature leverages three approaches to solve the continual learning problem. The replay-based methods [33, 32, 50] are the most popular approaches and show promising results, but we have to store a fraction of previous task samples for replay. Using past samples may violate privacy and increase training and storage costs. Moreover, learning is biased towards the current task since replay samples from the past tasks are limited. Regularization-based methods [15, 1, 4] regularize the previous task parameters to overcome catastrophic forgetting while learning the current task. They provide sub-optimal solutions and work well only for a limited task sequence. The expansion-based methods [35, 52, 58, 51] are promising, as they can model many task sequences, but efficient expansion and task prediction are the critical bottlenecks. Therefore, most of the models work only for the simple TIL setting, where during inference, task ids are available. Recently [47, 40, 44] propose efficient expansion-based models, but their models do not fully utilize the previously learned parameters. They fine-tune their expansion parameter for every new task; therefore, it does not remember previous information and only the current task knowledge transfers to the next task. So, their knowledge transfer is not optimal. Moreover, most of these approaches assume that all tasks have same complexity; therefore, their expansion parameters grow equally. However, in practice, tasks are diverse and can have different complexities. Task prediction is another key challenge in expansion-based approaches; it is sensitive to the parameter and it’s performance decreases rapidly as the number of tasks increase.

In this work, we propose a simple and highly efficient filter and channel expansion-based model which is generic and can be applied to any convolutional network. The model fully utilizes the previously learned information, hence providing better knowledge transfer. Our expansion-based model learns a task-specific parameter for each novel task. However, during learning, the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task model uses the global parameter and all learned task-specific parameters from 2222 to the (i1)thsuperscript𝑖1𝑡(i-1)^{th}( italic_i - 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task. Parameters up to the (i1)thsuperscript𝑖1𝑡(i-1)^{th}( italic_i - 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task preserve all the previously learned information and maximize knowledge transfer during learning of the current task. The expansion in the task-specific parameter is achieved by increasing the number of filters in each layer. We also propose a gradient-based approach to grow the task-specific parameter as a function of the task complexity; therefore for a simple task, the model has a smaller parameter growth while for complex tasks, the model requires more parameters to adapt to the current task. Our expansion-based model shows promising results in the TIL scenario. We further propose a replay-free task prediction method to solve the more challenging CIL setting. The task prediction method leverages entropy weighted augmentations and pseudo label to calculate the sample loss. Finally, we approximate the model’s gradient by the mean gradient of each layer and the norm of the task gradient is used to predict the task id.

2 Related Work

Recently there has been a lot of interest in the continual learning paradigm [29], because of its extensive applicability to various domains. We can broadly divide modern continual learning approaches into three categories - Replay, Regularization and Expansion-based models. Replay-based methods [33, 48, 3, 31, 34, 32, 42] keep a memory bank to store a fraction of samples from previous tasks. Handling sample bias [32] is a crucial challenge since the current task has a large number of samples, while the previous task samples are very few. Generative replay methods [37, 43, 23] learn a generative model for the sample replay. Regularization [15, 21, 22, 1] is another popular approach that regularizes the weight learned during previous tasks while training on the current task. SI [55], EWC [15] and IMM [21] focus on regularizing previous task weight based on their importance. These approaches cannot model a large number of tasks and model performance degrades quickly as the tasks increase. They provide sub-optimal solutions since they try to learn a joint weight that can generalize across all tasks. Recent methods like IL2A [61], SSRE [62] and FeTrIL [30] uses class prototypes to tackle replay-free class-incremental learning and are baselines for our method. TCL [13] provides a theoretical justification for decomposing the CIL problem into two sub-problems - TIL and task prediction. Task prediction is done using supervised contrastive learning, ensemble class prediction and output calibration by leveraging replay data.

The expansion-based models are closely related to our work, hence we mostly focus on these approaches. Recently, a wide range of methods [35, 52, 49, 25, 36, 58, 45, 7, 51, 40, 26, 44] propose expansion-based strategies to incorporate the growing task sequence. PNNs [35] directly adds a new neural network column for each novel task. These approaches freeze the weights of the previous tasks to overcome catastrophic forgetting and lateral connections help forward knowledge transfer. Side-tuning [58] proposes a simpler approach by training a small task-specific network and fusing the output to the base network. DEN [52] not only focuses on network expansion but also optimizes sub-problems like selective training, dynamic model expansion using loss threshold and duplication. RCL [49] uses reinforcement learning to determine the growth of the architecture. Recent advancements leverage Bayesian non-parametric models where the data itself determines the expansion [18, 20, 27], but these methods work well for small datasets and few task sequences like MNIST and CIFAR-10. EFT [44] partitions a model into global and task-specific local parameters and leverages efficient convolution operations to construct these local transforms. The global network can be any architecture and the method outperforms most baselines on diverse task sequences in both TIL and CIL settings.

Masking [24, 25, 36, 47, 26] is another expansion-based strategy that learns different binary/ternary masks per task. Iterative training and pruning strategies [24, 12] have also been proposed for expansion-based continual learning. These approaches are costly to train, since pruning is an expensive step. PackNet [25] requires saving masks to recover networks of previous models, which can take lots of storage space as the number of tasks grow. HAT [36] proposes hard attention masks for each task. TFM [26] applies ternary masks to feature maps, which result in less memory per mask, as the feature maps are often smaller than the number of weights in model. TFM uses all previous tasks to learn the mask of the current task; thus, it has similar motivations as our method. APD [51] decomposes the network parameters into task-shared and sparse task-specific parameters; however, the significant changes made to architecture make it harder to scale and prevent the use of pre-trained weights. SupSup [47] finds a supermask for each task and uses gradient-based optimization to speed up inference; also, the super masks can be stored in a fixed-size Hopfield network [11]. Masking approaches are promising, but they are mostly limited to the simpler TIL setting. Choosing a suitable mask requires predicting the task id, which is challenging. Our proposed approach shows promising results in both scenarios without relying on replay samples.

3 Proposed Model

In this section, we provide a detailed description of our proposed model. We use three key components to overcome catastrophic forgetting.

3.1 Notations

In incremental learning, tasks 𝒯1,𝒯2,𝒯Tsubscript𝒯1subscript𝒯2subscript𝒯𝑇\mathcal{T}_{1},\mathcal{T}_{2},\dots\mathcal{T}_{T}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … caligraphic_T start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT arrive sequentially. Each task 𝒯i={(𝐱k,yk)}k=1Nisubscript𝒯𝑖superscriptsubscriptsubscript𝐱𝑘subscript𝑦𝑘𝑘1subscript𝑁𝑖\mathcal{T}_{i}=\{(\mathbf{x}_{k},y_{k})\}_{k=1}^{N_{i}}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT where 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT has label yk𝒴isubscript𝑦𝑘subscript𝒴𝑖y_{k}\in\mathcal{Y}_{i}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒴isubscript𝒴𝑖\mathcal{Y}_{i}caligraphic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the set of all classes in task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (|𝒴i|=Kisubscript𝒴𝑖subscript𝐾𝑖|\mathcal{Y}_{i}|=K_{i}| caligraphic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT). During training on the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task, only 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is available to train the model and 𝒴i𝒴j=ϕsubscript𝒴𝑖subscript𝒴𝑗italic-ϕ\mathcal{Y}_{i}\cap\mathcal{Y}_{j}=\phicaligraphic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ caligraphic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_ϕ. Our proposed model can be applied to TIL as well as the CIL setting. During training and inference of the task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the task id i𝑖iitalic_i is known in the TIL setting. However, in the CIL setting, task id i𝑖iitalic_i is unknown during inference.

3.2 Efficient Dynamic Expansion (ablated in Sec. 5.3)

Refer to caption
Figure 1: Expansion of the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer during task i𝑖iitalic_i after training on the (i1)thsuperscript𝑖1𝑡(i-1)^{th}( italic_i - 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task.

In this work, we propose an efficient, dynamically expandable model which grows with the number of tasks and accumulates all the previously learned knowledge. Let θisubscriptsubscript𝜃𝑖\mathcal{M}_{\mathbf{\theta}_{i}}caligraphic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the model for task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where θisubscript𝜃𝑖\mathbf{\theta}_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the model parameter. The model parameter θi={Ω,τi,τ2:i1}subscript𝜃𝑖Ωsubscript𝜏𝑖subscript𝜏:2𝑖1\mathbf{\theta}_{i}=\{\Omega,\tau_{i},\tau_{2:i-1}\}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { roman_Ω , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 : italic_i - 1 end_POSTSUBSCRIPT }, i.e., it contains three types of parameters - global parameter (ΩΩ\Omegaroman_Ω), task-specific parameter (τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and all the learned task-specific parameters before the current task (τ2:i1subscript𝜏:2𝑖1\tau_{2:i-1}italic_τ start_POSTSUBSCRIPT 2 : italic_i - 1 end_POSTSUBSCRIPT). Therefore, the parameter θ𝜃\mathbf{\theta}italic_θ is growing with each novel task sequence and |θ1|<|θ2|<<|θT|subscript𝜃1subscript𝜃2subscript𝜃𝑇|\mathbf{\theta}_{1}|<|\mathbf{\theta}_{2}|<\dots<|\mathbf{\theta}_{T}|| italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | < | italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | < ⋯ < | italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | where |.||.|| . | represents cardinality.

Now we will describe our expansion method for a particular task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For task 𝒯i1subscript𝒯𝑖1\mathcal{T}_{i-1}caligraphic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, we can represent the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT convolutional layer for task model θi1subscriptsubscript𝜃𝑖1\mathcal{M}_{\mathbf{\theta}_{i-1}}caligraphic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT as li1jcj×k×k×djsuperscriptsubscript𝑙𝑖1𝑗superscriptsubscript𝑐𝑗𝑘𝑘subscript𝑑𝑗l_{i-1}^{j}\in\mathbb{R}^{c_{j}\times k\times k\times d_{j}}italic_l start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_k × italic_k × italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT where cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the number of filters, k𝑘kitalic_k is the kernel size and djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the number of feature maps of layer (j1)𝑗1(j-1)( italic_j - 1 ). After training on the (i1)thsuperscript𝑖1𝑡(i-1)^{th}( italic_i - 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task, we freeze it’s parameter θi1subscript𝜃𝑖1\mathbf{\theta}_{i-1}italic_θ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. Let us assume that gj1subscript𝑔𝑗1g_{j-1}italic_g start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT, gjsubscript𝑔𝑗g_{j}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and gj+1subscript𝑔𝑗1g_{j+1}italic_g start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT are the growth rates for the layers (j1)𝑗1(j-1)( italic_j - 1 ), j𝑗jitalic_j and (j+1)𝑗1(j+1)( italic_j + 1 ) respectively for the next task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Before training the model for task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we will grow the model by the defined growth rates and this growth is local to the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task (and hence the name task-specific parameter). The ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task model adds gj1subscript𝑔𝑗1g_{j-1}italic_g start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT, gjsubscript𝑔𝑗g_{j}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and gj+1subscript𝑔𝑗1g_{j+1}italic_g start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT number of filters at the (j1)𝑗1(j-1)( italic_j - 1 ), j𝑗jitalic_j and (j+1)𝑗1(j+1)( italic_j + 1 ) layers respectively. Therefore, for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task at the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer, we have cj+gjsubscript𝑐𝑗subscript𝑔𝑗c_{j}+g_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT number of filters, but layer (j1)𝑗1(j-1)( italic_j - 1 ) will also produce cj1+gj1subscript𝑐𝑗1subscript𝑔𝑗1c_{j-1}+g_{j-1}italic_c start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT feature maps. To accommodate these feature maps, we have to increase the number of channels in each filter of layer j𝑗jitalic_j. So lij(cj+gj)×k×k×djsuperscriptsubscript𝑙𝑖𝑗superscriptsubscript𝑐𝑗subscript𝑔𝑗𝑘𝑘𝑑subscript𝑗l_{i}^{j}\in\mathbb{R}^{(c_{j}+g_{j})\times k\times k\times d\textquoteright_{% j}}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) × italic_k × italic_k × italic_d ’ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT where dj=cj1+gj1𝑑subscript𝑗subscript𝑐𝑗1subscript𝑔𝑗1d\textquoteright_{j}=c_{j-1}+g_{j-1}italic_d ’ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT is the number of feature maps after adding the task-specific filters to layer (j1)𝑗1(j-1)( italic_j - 1 ). Similarly, the next layer lij+1(cj+1+gj+1)×k×k×dj+1superscriptsubscript𝑙𝑖𝑗1superscriptsubscript𝑐𝑗1subscript𝑔𝑗1𝑘𝑘𝑑subscript𝑗1l_{i}^{j+1}\in\mathbb{R}^{(c_{j+1}+g_{j+1})\times k\times k\times d% \textquoteright_{j+1}}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j + 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ) × italic_k × italic_k × italic_d ’ start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where dj+1=cj+gj𝑑subscript𝑗1subscript𝑐𝑗subscript𝑔𝑗d\textquoteright_{j+1}=c_{j}+g_{j}italic_d ’ start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Figure 1 shows our expansion strategy at the particular layer j𝑗jitalic_j during training of task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For each task, we train batch norm and final linear layer from scratch. So the global parameter (ΩΩ\Omegaroman_Ω) is the parameter of the first task model without batch norm and final linear layer.

3.3 Gradient aggregation for task prediction (ablated in Sec. 5.4)

The proposed expansion-based model shows promising results in the TIL setting.

Refer to caption
Figure 2: Overview of the gradient aggregation method that leverages weighted cross-entropy using pseudo label.

However, in the CIL setting, we don’t know the task id during inference; hence we have to predict it. In this work, we propose a robust method for task prediction that can predict the task id for a large number of task sequences.

Let θ1,θ2θTsubscriptsubscript𝜃1subscriptsubscript𝜃2subscriptsubscript𝜃𝑇\mathcal{M}_{\mathbf{\theta}_{1}},\mathcal{M}_{\mathbf{\theta}_{2}}\dots% \mathcal{M}_{\mathbf{\theta}_{T}}caligraphic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … caligraphic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the learned models for the tasks 𝒯1,𝒯2,𝒯Tsubscript𝒯1subscript𝒯2subscript𝒯𝑇\mathcal{T}_{1},\mathcal{T}_{2},\dots\mathcal{T}_{T}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … caligraphic_T start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with parameters θ1,θ2,θTsubscript𝜃1subscript𝜃2subscript𝜃𝑇\mathbf{\theta}_{1},\mathbf{\theta}_{2},\dots\mathbf{\theta}_{T}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Once we have finished training for the task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we can predict the sample’s task id of any classes till the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task. The gradient embedding of test sample 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, for the model with parameter θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over the loss function (𝐱k,θi)subscript𝐱𝑘subscript𝜃𝑖\mathcal{L}(\mathbf{x}_{k},\mathbf{\theta}_{i})caligraphic_L ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), can be computed as:

θik=θi(𝐱k,θi)subscriptsuperscript𝜃𝑖𝑘subscriptsubscript𝜃𝑖subscript𝐱𝑘subscript𝜃𝑖\mathbf{\theta}^{\prime}_{ik}=\nabla_{\mathbf{\theta}_{i}}\mathcal{L}(\mathbf{% x}_{k},\mathbf{\theta}_{i})italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (1)

To predict the task id of sample 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we measure the gradient norm using each model as follows:

i^=argminiTθik|θik|^𝑖subscriptargmin𝑖𝑇normsuperscriptsubscript𝜃𝑖𝑘superscriptsubscript𝜃𝑖𝑘\hat{i}=\operatorname*{arg\,min}_{i\in T}\frac{\|\mathbf{\theta}_{ik}^{{}^{% \prime}}\|}{|\mathbf{\theta}_{ik}^{{}^{\prime}}|}over^ start_ARG italic_i end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_i ∈ italic_T end_POSTSUBSCRIPT divide start_ARG ∥ italic_θ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∥ end_ARG start_ARG | italic_θ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT | end_ARG (2)

Here |θik|superscriptsubscript𝜃𝑖𝑘|\mathbf{\theta}_{ik}^{\textquoteright}|| italic_θ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ’ end_POSTSUPERSCRIPT | is the number of parameters in the gradient vector. We need to normalize the embedding, as each task has a different gradient length and the norm of a longer embedding can be higher. Once we have task id i^^𝑖\hat{i}over^ start_ARG italic_i end_ARG for the sample 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we can choose the model θi^subscriptsubscript𝜃^𝑖\mathcal{M}_{\mathbf{\theta}_{\hat{i}}}caligraphic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT over^ start_ARG italic_i end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT to evaluate the sample.

Minimizing the norm of loss gradient leads to flatter minima in weight space [59]. A flat minima is a large connected region in weight space where the loss remains approximately constant and it is optimal minima for generalization [10, 59]. In Eq. 2, we test the shape of the loss function \mathcal{L}caligraphic_L using 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to estimate which model might perform better on that sample. We use 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm as it gives marginally better results than 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm. For a fair comparison with earlier baselines, we do not use the gradient norm loss during training time.

For a convolutional network with L𝐿Litalic_L layers, we define the (flattened) gradient vector for layer j𝑗jitalic_j as θikjsuperscriptsubscript𝜃𝑖𝑘𝑗\mathbf{\theta}_{ik}^{\textquoteright j}italic_θ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ’ italic_j end_POSTSUPERSCRIPT (j=1L𝑗1𝐿j=1\dots Litalic_j = 1 … italic_L). An advantage of considering layer-wise gradient is that it allows us to use task and layer dependant gradient modification functions. Such functions can help us give more attention to certain parameters in a layer. Thus we can re-write the gradient embedding for task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and input 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as the vector:

θik=𝒞𝒪𝒩𝒞𝒜𝒯(𝒢i1(θik1),𝒢i2(θik2),..,𝒢iL(θikL))\mathbf{\theta}_{ik}^{\textquoteright}=\mathcal{CONCAT}(\mathcal{G}_{i}^{1}(% \mathbf{\theta}_{ik}^{\textquoteright 1}),\mathcal{G}_{i}^{2}(\mathbf{\theta}_% {ik}^{\textquoteright 2}),..,\mathcal{G}_{i}^{L}(\mathbf{\theta}_{ik}^{% \textquoteright L}))italic_θ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ’ end_POSTSUPERSCRIPT = caligraphic_C caligraphic_O caligraphic_N caligraphic_C caligraphic_A caligraphic_T ( caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ’ 1 end_POSTSUPERSCRIPT ) , caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ’ 2 end_POSTSUPERSCRIPT ) , . . , caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ’ italic_L end_POSTSUPERSCRIPT ) ) (3)

where 𝒢ijsuperscriptsubscript𝒢𝑖𝑗\mathcal{G}_{i}^{j}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the gradient modification function for task i𝑖iitalic_i and layer j𝑗jitalic_j and 𝒞𝒪𝒩𝒞𝒜𝒯𝒞𝒪𝒩𝒞𝒜𝒯\mathcal{CONCAT}caligraphic_C caligraphic_O caligraphic_N caligraphic_C caligraphic_A caligraphic_T concatenates multiple gradient vectors to create the final embedding vector. As our inference time grows linearly with the number of tasks, we can borrow the superposition idea of [47] to get sub-linear run times. However, in this paper, we do not focus on this direction. We present an overview of our entire method in Figure 2. Next we choose (𝐱k,θi)subscript𝐱𝑘subscript𝜃𝑖\mathcal{L}(\mathbf{x}_{k},\mathbf{\theta}_{i})caligraphic_L ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and 𝒢ijsuperscriptsubscript𝒢𝑖𝑗\mathcal{G}_{i}^{j}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

3.3.1 Choosing loss function

The most common loss function for training deep networks is cross-entropy. However, task prediction using a single sample 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT may not be robust. So our approach leverages simple test-time data augmentations to improve task prediction performance. From input 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we create a batch of A𝐴Aitalic_A augmented samples 𝒳k=[𝐱k1,𝐱k2,𝐱kA]subscript𝒳𝑘superscriptsubscript𝐱𝑘1superscriptsubscript𝐱𝑘2superscriptsubscript𝐱𝑘𝐴\mathcal{X}_{k}=[\mathbf{x}_{k}^{1},\mathbf{x}_{k}^{2}\dots,\mathbf{x}_{k}^{A}]caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT … , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ]. For each augmented sample in 𝒳ksubscript𝒳𝑘\mathcal{X}_{k}caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we consider the class with the maximum probability as it’s pseudo label and add it to the array 𝒴k^^subscript𝒴𝑘\hat{\mathcal{Y}_{k}}over^ start_ARG caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG. Subsequently, we take the maximally occurring pseudo label in 𝒴k^^subscript𝒴𝑘\hat{\mathcal{Y}_{k}}over^ start_ARG caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG as the batch pseudo label y^ksubscript^𝑦𝑘\hat{y}_{k}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Mathematically, this can be expressed as:

𝒴k^=[argmaxKiθi(𝐱k1),,argmaxKiθi(𝐱kA)]^subscript𝒴𝑘subscriptargmaxsubscript𝐾𝑖subscript𝜃𝑖superscriptsubscript𝐱𝑘1subscriptargmaxsubscript𝐾𝑖subscript𝜃𝑖superscriptsubscript𝐱𝑘𝐴\displaystyle\hat{\mathcal{Y}_{k}}=[\operatorname*{arg\,max}_{K_{i}}\mathbf{% \theta}_{i}({\mathbf{x}}_{k}^{1}),\dots,\operatorname*{arg\,max}_{K_{i}}% \mathbf{\theta}_{i}({\mathbf{x}}_{k}^{A})]over^ start_ARG caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = [ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , … , start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) ] (4)
y^k=𝒜𝒳𝒞𝒪𝒰𝒩𝒯(𝒴k^)subscript^𝑦𝑘𝒜𝒳𝒞𝒪𝒰𝒩𝒯^subscript𝒴𝑘\displaystyle\hat{y}_{k}=\mathcal{MAXCOUNT}(\hat{\mathcal{Y}_{k}})over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = caligraphic_M caligraphic_A caligraphic_X caligraphic_C caligraphic_O caligraphic_U caligraphic_N caligraphic_T ( over^ start_ARG caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG )

where 𝒜𝒳𝒞𝒪𝒰𝒩𝒯𝒜𝒳𝒞𝒪𝒰𝒩𝒯\mathcal{MAXCOUNT}caligraphic_M caligraphic_A caligraphic_X caligraphic_C caligraphic_O caligraphic_U caligraphic_N caligraphic_T takes an array as input and returns the maximally occurring element in the array. θi(𝐱k1)subscript𝜃𝑖superscriptsubscript𝐱𝑘1\mathbf{\theta}_{i}({\mathbf{x}}_{k}^{1})italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) is the output of network θisubscript𝜃𝑖\mathbf{\theta}_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for image 𝐱k1superscriptsubscript𝐱𝑘1{\mathbf{x}}_{k}^{1}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT.

The pseudo label y^ksubscript^𝑦𝑘\hat{y}_{k}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is used to calculate the cross-entropy loss over the batch 𝒳ksubscript𝒳𝑘\mathcal{X}_{k}caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using the model θisubscript𝜃𝑖\mathbf{\theta}_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This loss measures the robustness of the model with respect to sample perturbations. However, this augmentation strategy can create samples of different complexities, i.e., some samples can be more confusing to the model than others. So, instead of giving equal weights to every augmented sample, we calculate the batch cross-entropy loss in a weighted manner. We leverage the entropy measure to estimate the uncertainty of each augmented sample. For augmented sample 𝐱kasuperscriptsubscript𝐱𝑘𝑎\mathbf{x}_{k}^{a}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and model θisubscript𝜃𝑖\mathbf{\theta}_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we use 𝒞(θi(𝐱ka),y^k)𝒞subscript𝜃𝑖superscriptsubscript𝐱𝑘𝑎subscript^𝑦𝑘\mathcal{CE}(\mathbf{\theta}_{i}(\mathbf{x}_{k}^{a}),\hat{y}_{k})caligraphic_C caligraphic_E ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and 𝒩𝒯(θi(𝐱ka))𝒩𝒯subscript𝜃𝑖superscriptsubscript𝐱𝑘𝑎\mathcal{ENT}(\mathbf{\theta}_{i}(\mathbf{x}_{k}^{a}))caligraphic_E caligraphic_N caligraphic_T ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) ) to denote the sample cross-entropy loss and sample entropy respectively. So the final loss function can be expressed as:

(𝐱k,θi)=1AaA𝒞(θi(𝐱ka),y^k)×𝒩𝒯(θi(𝐱ka))subscript𝐱𝑘subscript𝜃𝑖1𝐴subscript𝑎𝐴𝒞subscript𝜃𝑖superscriptsubscript𝐱𝑘𝑎subscript^𝑦𝑘𝒩𝒯subscript𝜃𝑖superscriptsubscript𝐱𝑘𝑎\mathcal{L}(\mathbf{x}_{k},\mathbf{\theta}_{i})=\frac{1}{A}\sum_{a\in A}% \mathcal{CE}(\mathbf{\theta}_{i}(\mathbf{x}_{k}^{a}),\hat{y}_{k})\times% \mathcal{ENT}(\mathbf{\theta}_{i}(\mathbf{x}_{k}^{a}))caligraphic_L ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_A end_ARG ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT caligraphic_C caligraphic_E ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) × caligraphic_E caligraphic_N caligraphic_T ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) ) (5)

We use this loss to calculate the gradient in Eq. 1. It should be noted that calculating the gradient of Eq. 5 for an augmentation 𝐱kasuperscriptsubscript𝐱𝑘𝑎\mathbf{x}_{k}^{a}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT is mathematically equivalent to weighing the gradients of KL-divergence and entropy terms in the corresponding cross-entropy loss 𝒞(θi(𝐱ka),y^k)𝒞subscript𝜃𝑖superscriptsubscript𝐱𝑘𝑎subscript^𝑦𝑘\mathcal{CE}(\mathbf{\theta}_{i}(\mathbf{x}_{k}^{a}),\hat{y}_{k})caligraphic_C caligraphic_E ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

3.3.2 Choosing gradient modification function

The cardinality of the gradient in Eq. 1 is equal to the number of model parameters. Therefore, any operation on this vector is very costly. Moreover, noise in gradient directions can decrease the model’s robustness. Hence, we only use the mean gradient for each convolutional filter. For fully connected layers, we use the mean gradient for each output neuron. In the standard convolutional network, the initial layers capture generic information about the input image while the later layers contain task-specific information. Hence, we use the gradients of the last two convolutional layers and the fully connected layer. We call this strategy mean filters; it reduces the gradient cardinality by around 99.98%percent99.9899.98\%99.98 % and even helps improve task prediction.

3.4 Static and Adaptive Growth Rate (ablated in Sec. 5.1)

Most expansion-based models [40, 44] consider the growth rate as a hyperparameter and irrespective of the task complexity, it remains constant across all tasks. Here, we propose an adaptive growth rate for the expansion parameter that depends on the task complexity. The calculation of task complexity, without availability of the previous task samples, is the key challenge. For the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task, we leverage the current task samples 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the (i1)thsuperscript𝑖1𝑡(i-1)^{th}( italic_i - 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task model and the mean gradient on the previous task samples 𝒯i1subscript𝒯𝑖1\mathcal{T}_{i-1}caligraphic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT to construct a compatibility score. Subsequently, we use this score to decide the growth rate for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task.

Let’s assume that before training for task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we expand each layer j𝑗jitalic_j of the (i1)thsuperscript𝑖1𝑡(i-1)^{th}( italic_i - 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task model by gjsubscript𝑔𝑗g_{j}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT filters (i.e., growth rate is gjsubscript𝑔𝑗g_{j}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT). Using scaling factor α𝛼\alphaitalic_α \in [0, 1], we can write gjsubscript𝑔𝑗g_{j}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as:

gj=[α×gjmin+(1α)×gjmax]subscript𝑔𝑗delimited-[]𝛼superscriptsubscript𝑔𝑗𝑚𝑖𝑛1𝛼superscriptsubscript𝑔𝑗𝑚𝑎𝑥g_{j}=\left[\alpha\times g_{j}^{min}+(1-\alpha)\times g_{j}^{max}\right]italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = [ italic_α × italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT + ( 1 - italic_α ) × italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT ] (6)

where gjmaxsuperscriptsubscript𝑔𝑗𝑚𝑎𝑥g_{j}^{max}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT and gjminsuperscriptsubscript𝑔𝑗𝑚𝑖𝑛g_{j}^{min}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT are respectively the pre-defined maximum and minimum growth rates for layer j𝑗jitalic_j and [.]\left[.\right][ . ] is the rounding function.

In case of static growth, the growth rate for a layer is constant for all tasks and α=0𝛼0\alpha=0italic_α = 0. However, in adaptive growth, we expand our model based on the relative complexity of the novel task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and use α𝛼\alphaitalic_α to encode how similar 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is with 𝒯i1subscript𝒯𝑖1\mathcal{T}_{i-1}caligraphic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. Since our models are optimized using gradient-based methods, we can measure the uncertainty of a sample using gradients [2]. Given the (i1)thsuperscript𝑖1𝑡(i-1)^{th}( italic_i - 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task model, we can compare the model confidences in two samples 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐱2subscript𝐱2\mathbf{x}_{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT using θ(i1)1superscriptsubscript𝜃𝑖11\mathbf{\theta}_{(i-1)1}^{{}^{\prime}}italic_θ start_POSTSUBSCRIPT ( italic_i - 1 ) 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and θ(i1)2superscriptsubscript𝜃𝑖12\mathbf{\theta}_{(i-1)2}^{{}^{\prime}}italic_θ start_POSTSUBSCRIPT ( italic_i - 1 ) 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT from Eq. 1. For tasks 𝒯i1subscript𝒯𝑖1\mathcal{T}_{i-1}caligraphic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT and 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we can write the scaling factor α𝛼\alphaitalic_α as:

α=|𝒜𝒩({θ(i1)k}k𝒯i)𝒜𝒩({θ(i1)k}k𝒯i1)|𝛼𝒜𝒩subscriptsuperscriptsubscript𝜃𝑖1𝑘𝑘subscript𝒯𝑖𝒜𝒩subscriptsuperscriptsubscript𝜃𝑖1𝑘𝑘subscript𝒯𝑖1\small\alpha=|\mathcal{MEAN}({\{\mathbf{\theta}_{(i-1)k}^{{}^{\prime}}\}}_{k% \in\mathcal{T}_{i}})\cdot\mathcal{MEAN}({\{\mathbf{\theta}_{(i-1)k}^{{}^{% \prime}}\}}_{k\in\mathcal{T}_{i-1}})|italic_α = | caligraphic_M caligraphic_E caligraphic_A caligraphic_N ( { italic_θ start_POSTSUBSCRIPT ( italic_i - 1 ) italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ caligraphic_M caligraphic_E caligraphic_A caligraphic_N ( { italic_θ start_POSTSUBSCRIPT ( italic_i - 1 ) italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ∈ caligraphic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | (7)

where 𝒜𝒩𝒜𝒩\mathcal{MEAN}caligraphic_M caligraphic_E caligraphic_A caligraphic_N gives us the normalized mean vector from a set of input vectors. We can use this α𝛼\alphaitalic_α to decide the growth rate for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task using Eq. 6. So for each task, we only need to remember the mean gradient of the last task. We can reduce the gradient cardinality to around 0.02%percent0.020.02\%0.02 % of it’s original size using mean filters strategy. Based on the type of growth, we get two model variants - static parameter growth (SPG) model and adaptive parameter growth (APG) model. For a fair comparison between SPG and APG models, we set gjminsuperscriptsubscript𝑔𝑗𝑚𝑖𝑛g_{j}^{min}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT as 1 and gjmaxsuperscriptsubscript𝑔𝑗𝑚𝑎𝑥g_{j}^{max}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT as the growth rate used in SPG. The parameter growth rate is ablated in Sec. 5.2.

4 Experiments and Results

We conduct our experiments in TIL, CIL as well as generative continual learning settings. We also evaluate our method using different base architectures on diverse datasets. Our approach offers significant gains over the latest baselines in all the experiments.

4.1 Datasets and Architectures

Our experiments are performed on three datasets - CIFAR-100 [16], ImageNet-100 [17] and TinyImageNet [19]. We divide the CIFAR-100 dataset into 5, 10, and 20 task sequences (let’s call them CIFAR100/5, CIFAR100/10 and CIFAR100/20 respectively). CIFAR100/5 has a smaller task sequence, but each task contains 20 classes. On the other hand, CIFAR100/20 contains only 5 classes per task but has a large number of tasks. We evaluate our method on this dataset using ResNet-18 [9] architecture for CIFAR datasets. For the ImageNet-100 dataset, we use the same class subset and class order as DER [50]. Similar to CIFAR-100, we divide the ImageNet-100 dataset into 5, 10, and 20 tasks. The ImageNet-100 dataset is challenging as it contains relatively bigger images, i.e., the image size is 224×224224224224\times 224224 × 224. We use the standard ResNet-18 architecture for experimenting on this dataset. Tiny ImageNet is a smaller subset of the ImageNet [17] dataset, contains 200 classes with a 64×64646464\times 6464 × 64 resolution. To demonstrate the efficacy of our proposed model across different architectures, we use the VGG-16 [39] architecture with batch norm for the Tiny ImageNet dataset. We divide the dataset into 10 tasks and evaluate the model performance in the TIL setting. Furthermore, we explore our continual learning approach using the StackGAN-v2 [56] architecture for three diverse tasks - cats (ImageNet [5]), birds (CUB-200 [46]) and churches (LSUN [53]).

4.2 Baselines

Our proposed method leverages network expansion to learn new tasks and does not use any pretrained model or sample storage for replay. Therefore, comparison with replay-based methods is not fair. So we compare our model with both regularization and expansion-based approaches. We use various regularization based methods like LwF [22], EWC [15], SI [55], MAS [1], SDC [54], DMC [57], L2T [51], IL2A [61], LwF+adBiC [41], SSRE [62] and FeTrIL [30] as our baselines. We also use expansion-based approaches like PNN [35], HAT [36], PB [24], PackNet [25], TFM [26], APD [51] and EFT [44] for comparison. For the continual GAN experiment, we consider EWC [15], MeRGAN-RA [48] and EFT [44] as baselines. As we share our hyperparameters with EFT [44], we borrow most baseline results from that paper; the rest are obtained by either using the paper’s publicly available code or by running the PyCIL [60] framework for that model.

Refer to caption
Figure 3: The comparison for the TIL scenario on TinyImageNet-200/10. The result for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task is reported after learning all the tasks.

4.3 Implementation Details

We provide the model architecture, hyperparameter settings and other experimental details for CIFAR-100, ImageNet-100 and Tiny ImageNet datasets in the supplementary material. Moreover, in supplementary, we provide results for multiple seed values along with their standard deviations.

In the following section, we discuss the results in the TIL, CIL and generative continual learning settings. We use average incremental accuracy as the evaluation metric. For TIL, we compute it as the average accuracy over all tasks, including the initial one [44]. In case of CIL, we report the average accuracy for all seen classes [61, 44].

4.4 Results

4.4.1 Task Incremental Learning (TIL)

TIL is a relatively simpler setting where the task id is known during inference. Once we know the task id, we can select the model corresponding to that id for inference. We evaluate the TIL scenario for the Tiny ImageNet [19] dataset using the VGG-16 architecture. The Tiny ImageNet dataset contains 200 classes and we divide the dataset into 10 tasks each containing 20 classes. Our static parameter growth (SPG) model has an average parameter growth of 3.7%percent3.73.7\%3.7 % and achieves an average incremental accuracy of 68.6%percent68.668.6\%68.6 %. In comparison, EFT [44] has an average incremental accuracy of 66.8%percent66.866.8\%66.8 % with 3.6%percent3.63.6\%3.6 % average parameter growth. On the other hand, our adaptive parameter growth (APG) model gives an average incremental accuracy of 68.5%percent68.568.5\%68.5 % with a smaller average parameter growth of 3.4%percent3.43.4\%3.4 %. The task-wise accuracy for SPG model is shown in Fig. 3.

4.4.2 Class Incremental Learning (CIL)

CIFAR-100 ImageNet-100
Methods 5 10 20 5 10 20
LwF [22] 34.7 23.9 14.2 55.1 37.7 22.0
EWC [15] 25.0 16.4 9.6 - - -
SI [55] 29.6 23.3 13.3 - - -
MAS [1] 24.4 15.4 10.6 - - -
RWalk [4] 31.6 17.9 11.0 - - -
EWC+SDC [54] - 19.3 - - - -
DMC [57] 46.6 36.2 23.9 - - -
IL2A [61] 50.2 29.7 14.7 40.6 21.7 9.7
EFT [44] 52.7 45.5 30.3 49.2 42.5 36.3
LwF+adBiC [41] 44.1 33.3 19.5 - - -
SSRE [62] 41.9 30.0 13.7 37.5 24.0 16.2
FeTrIL [30] 45.0 34.3 20.8 44.0 31.2 19.8
Ours (SPG) [4.2%] 59.4 50.6 35.6 61.7 48.6 38.3
Ours (APG) [3.7%] 59.2 50.5 36.4 61.8 49.8 38.5
Table 1: Average incremental accuracy till last task for CIFAR-100 and ImageNet-100 in CIL setting. [X%] shows the average parameter growth of the model over all task sequences and datasets.

The TIL setting requires us to know the task id during inference. This setting is not practical as, in the real word, we may not know the actual source of a datapoint. Thus, in expansion-based methods, prediction of the task id is a key challenge. To demonstrate the efficacy of our proposed gradient aggregation method for task prediction, we evaluate our model for diverse task sequences using average incremental accuracy. In all scenarios, our proposed model significantly improves over the previous state-of-the-art approaches.

In Table 1, we have shown results for 5, 10 and 20 task sequences on the CIFAR-100 dataset. Using 4.3%percent4.34.3\%4.3 % average parameter growth and static parameter growth (SPG) model variant, we achieve an absolute gain of 5.1%percent5.15.1\%5.1 % over our best baseline EFT [44] in the CIFAR100/10 setting. Similarly, using 4%percent44\%4 % average parameter growth, we get an absolute gain of 6.7%percent6.76.7\%6.7 % over EFT in the CIFAR100/5 setting. The CIFAR100/20 setting is the most challenging CIL scenario as the task prediction accuracy rapidly drops with increase in task length. However, even in CIFAR100/20 setting, we achieve an absolute gain of 5.3%percent5.35.3\%5.3 % over EFT using 4.1%percent4.14.1\%4.1 % average parameter growth. The adaptive parameter growth (APG) model variant shows similar average incremental accuracy, but has 12.2%percent12.212.2\%12.2 % less average parameter growth. Table 1 also shows results for 5, 10 and 20 task sequences on the ImageNet-100 dataset. Occasionally APG outperforms SPG as training large networks on small tasks can lead to overfitting [26].

Like [61, 44], we report the average accuracy for all seen classes. Hence, our results for recent exemplar-free class-incremental learning (EFCIL) methods like SSRE [62] and FeTrIL [30] differ from the results reported in such papers as they measure average accuracy for all states.

Cats Birds Churches Final
Task i𝑖iitalic_i 1 2 3 1 2 3 1 2 3 Average
Finetune 29.0 156.9 189.6 - 21.2 174.5 - - 11.4 125.2
EWC [15] 29.0 147.3 190.7 - 65.9 165.4 - - 38.2 131.4
MeRGAN-RA [48] 29.0 56.4 58.2 - 50.9 53.7 - - 23.2 45.1
EFT [44] 29.0 29.0 29.0 - 44.1 44.1 - - 32.3 35.1
Ours 29.0 29.0 29.0 - 40.9 40.9 - - 29.3 33.1
Table 2: Results for GAN (StackGAN-v2) when trained in sequential manner. FID is reported after training the final task.

4.4.3 Generative Continual Learning

Generative Adversarial Network (GAN) [8] also suffers from catastrophic forgetting if data arrives as a sequence of tasks. However, training a separate model for each task is costly. Hence, we apply our proposed static filter expansion approach to train the GAN architecture in a continual learning fashion. We choose the StackGAN-v2 [56] architecture for the generative learning experiment the Table 2 shows the results for the continual generative models.

4.4.4 Heterogeneous Task Sequence

L2T PB PNN APD EFT Ours
Task order \downarrow \uparrow \downarrow \uparrow \downarrow \uparrow \downarrow \uparrow \downarrow \uparrow \downarrow \uparrow
SVHN 10.7 88.4 96.8 96.4 96.8 96.2 96.8 96.8 96.8 95.5 96.6 95.6
CIFAR-10 41.4 35.8 83.6 90.8 85.8 87.7 90.1 91.0 89.2 90.4 90.5 91.7
CIFAR-100 29.6 12.2 41.2 67.2 41.6 67.2 61.1 67.2 64.6 71.5 64.5 71. 8
Average 27.2 45.5 73.9 84.8 74.7 83.7 83.0 85.0 83.5 85.8 83.9 86.4
Table 3: Accuracy on the heterogeneous dataset sequence. \downarrow follow the SVHN \rightarrow CIFAR-10 \rightarrow CIFAR-100 task order and \uparrow represents CIFAR-100 \rightarrow CIFAR-10 \rightarrow SVHN task order.
Methods 5 10 20
EFT 52.7 (3.9%) 45.5 (3.9%) 30.3 (3.9%)
Ours (SPG) 59.4 (4.0%) 50.6 (4.3%) 35.6 (4.1%)
Ours (APG) 59.2 (3.2%) 50.5 (3.8%) 36.4 (3.7%)
Table 4: Average incremental accuracy till last task with average parameter growth in brackets for CIFAR-100.

We evaluate our proposed static filter expansion approach on a heterogeneous task sequence. We have selected SVHN [28], CIFAR-10 and CIFAR-100 datasets for constructing a task sequence and perform our experiments in the increasing (SVHN\rightarrowCIFAR10\rightarrowCIFAR100) as well as decreasing (CIFAR100\rightarrowCIFAR10\rightarrowSVHN) orders of task complexity. The task sequence is heterogeneous in two ways - firstly, SVHN and CIFAR datasets are completely different in nature and secondly, the number of classes change (increase/decrease) between tasks, i.e., go from 101001010010\rightarrow 10010 → 100 or 1001010010100\rightarrow 10100 → 10. The results for the heterogeneous setting are shown in Table 3. We can observe that in both the scenarios (i.e., forward and reverse sequence), our approach outperforms the recent baselines using just 4.7%percent4.74.7\%4.7 % average parameter growth.

5 Ablations

5.1 Adaptive Growth Rate: Toy Experiment

The CIFAR-100 [16] dataset contains 20202020 superclasses where each superclass has 5555 similar classes. For this toy experiment, we consider 4444 superclasses - aquaticmammals𝑎𝑞𝑢𝑎𝑡𝑖𝑐𝑚𝑎𝑚𝑚𝑎𝑙𝑠aquatic\ mammalsitalic_a italic_q italic_u italic_a italic_t italic_i italic_c italic_m italic_a italic_m italic_m italic_a italic_l italic_s, fish𝑓𝑖𝑠fishitalic_f italic_i italic_s italic_h, vehicles 1𝑣𝑒𝑖𝑐𝑙𝑒𝑠1vehicles\ 1italic_v italic_e italic_h italic_i italic_c italic_l italic_e italic_s 1 and vehicles 2𝑣𝑒𝑖𝑐𝑙𝑒𝑠2vehicles\ 2italic_v italic_e italic_h italic_i italic_c italic_l italic_e italic_s 2. Using these superclasses, we construct two task sequences - ordered𝑜𝑟𝑑𝑒𝑟𝑒𝑑ordereditalic_o italic_r italic_d italic_e italic_r italic_e italic_d and mixed𝑚𝑖𝑥𝑒𝑑mixeditalic_m italic_i italic_x italic_e italic_d. Each task sequence has two tasks and each task has 10 distinct classes. Task 1111 of the ordered𝑜𝑟𝑑𝑒𝑟𝑒𝑑ordereditalic_o italic_r italic_d italic_e italic_r italic_e italic_d sequence contains ({aquaticmammals}{fish})𝑎𝑞𝑢𝑎𝑡𝑖𝑐𝑚𝑎𝑚𝑚𝑎𝑙𝑠𝑓𝑖𝑠(\{aquatic\ mammals\}\cup\{fish\})( { italic_a italic_q italic_u italic_a italic_t italic_i italic_c italic_m italic_a italic_m italic_m italic_a italic_l italic_s } ∪ { italic_f italic_i italic_s italic_h } ) classes, while task 2222 has ({vehicles 1}{vehicles 2})𝑣𝑒𝑖𝑐𝑙𝑒𝑠1𝑣𝑒𝑖𝑐𝑙𝑒𝑠2(\{vehicles\ 1\}\cup\{vehicles\ 2\})( { italic_v italic_e italic_h italic_i italic_c italic_l italic_e italic_s 1 } ∪ { italic_v italic_e italic_h italic_i italic_c italic_l italic_e italic_s 2 } ) classes. On the other hand, the mixed𝑚𝑖𝑥𝑒𝑑mixeditalic_m italic_i italic_x italic_e italic_d sequence picks 23232-32 - 3 classes from each superclass and constructs two tasks, each with 10 distinct classes. So the ordered𝑜𝑟𝑑𝑒𝑟𝑒𝑑ordereditalic_o italic_r italic_d italic_e italic_r italic_e italic_d task sequence has two very different tasks, while the mixed𝑚𝑖𝑥𝑒𝑑mixeditalic_m italic_i italic_x italic_e italic_d task sequence has two somewhat similar tasks. In this experiment, after training on the first task, we find the gradient similarity between the first and second tasks to decide the growth rate for the second task. For ordered𝑜𝑟𝑑𝑒𝑟𝑒𝑑ordereditalic_o italic_r italic_d italic_e italic_r italic_e italic_d and mixed𝑚𝑖𝑥𝑒𝑑mixeditalic_m italic_i italic_x italic_e italic_d task sequences, we compute the gradient similarities αorderedsubscript𝛼𝑜𝑟𝑑𝑒𝑟𝑒𝑑\alpha_{ordered}italic_α start_POSTSUBSCRIPT italic_o italic_r italic_d italic_e italic_r italic_e italic_d end_POSTSUBSCRIPT and αmixedsubscript𝛼𝑚𝑖𝑥𝑒𝑑\alpha_{mixed}italic_α start_POSTSUBSCRIPT italic_m italic_i italic_x italic_e italic_d end_POSTSUBSCRIPT using Eg. 7. We observe αmixedsubscript𝛼𝑚𝑖𝑥𝑒𝑑\alpha_{mixed}italic_α start_POSTSUBSCRIPT italic_m italic_i italic_x italic_e italic_d end_POSTSUBSCRIPT >>> αorderedsubscript𝛼𝑜𝑟𝑑𝑒𝑟𝑒𝑑\alpha_{ordered}italic_α start_POSTSUBSCRIPT italic_o italic_r italic_d italic_e italic_r italic_e italic_d end_POSTSUBSCRIPT and (αmixedsubscript𝛼𝑚𝑖𝑥𝑒𝑑\alpha_{mixed}italic_α start_POSTSUBSCRIPT italic_m italic_i italic_x italic_e italic_d end_POSTSUBSCRIPT - αorderedsubscript𝛼𝑜𝑟𝑑𝑒𝑟𝑒𝑑\alpha_{ordered}italic_α start_POSTSUBSCRIPT italic_o italic_r italic_d italic_e italic_r italic_e italic_d end_POSTSUBSCRIPT) = 0.33. This result confirms that our gradient measure can be used for finding similarities between tasks.

Methods Growth Total Params Accuracy
Baseline - 11.2M -
EFT 3.9% 15.8M 45.5
SPG 3.6% 15.8M 51.0
SPG 3.9% 16.3M 50.3
SPG 4.3% 17.0M 50.6
APG 3.1% 15.2M 49.8
APG 3.5% 15.7M 50.1
APG 3.8% 16.2M 50.5
Table 5: Average incremental accuracy with different average parameter growths for CIFAR-100/10. Total Params is the total number of parameters that were used for training all 10 tasks.
[Uncaptioned image]
Figure 4: The effect of forward transfer; blue is for model which leverages previous task and yellow represents model which uses global parameter.

5.2 Parameter growth vs Performance

We define the parameter growth from task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 𝒯i+1subscript𝒯𝑖1\mathcal{T}_{i+1}caligraphic_T start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT as follows: Pi+1Pi+Ei+1Pisubscript𝑃𝑖1subscript𝑃𝑖subscript𝐸𝑖1subscript𝑃𝑖\frac{P_{i+1}-P_{i}+E_{i+1}}{P_{i}}divide start_ARG italic_P start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG where Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the total number of parameters used during task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ei+1subscript𝐸𝑖1E_{i+1}italic_E start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT is the number of parameters in task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT which are trained from scratch during task 𝒯i+1subscript𝒯𝑖1\mathcal{T}_{i+1}caligraphic_T start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT (let’s call them exclusive parameters). For example, batch norm and the final linear layer in the ResNet-18 architecture are trained from scratch for every new task in our proposed approach and are thus counted as exclusive parameters. We define average parameter growth for T𝑇Titalic_T tasks as the mean of all T𝑇Titalic_T parameter growths. Since methods like EFT [44] grow their networks from the first task, we define the parameter growth for the first task as the change of parameters from an equivalent standard network.

In Table 4, we compare both our variants with our best baseline EFT [44] on 5, 10 and 20 splits of CIFAR-100. We show that our variants achieve significantly better results without any significant increase in average parameter growth. In Table 5, we show that the variants have almost no degradation in accuracy even when we significantly reduce average parameter growth. This is due to the forward transfer from previous tasks. We also show the total number of parameters that were used during training of all 10 tasks.

5.3 Forward Transfer

We have also performed experiments to demonstrate the forward transfer ability of our proposed filter expansion model. In this experiment, we have explored the scenario where our model grows over just the global or shared parameter. This type of growth is used in masking [51, 47] and EFT [44]. We have also shown the result for our current approach where the parameter grows over the previous tasks to accumulate all the earlier knowledge. The result for the CIFAR100/10 setting is shown in Fig. 4. Our approach shows consistently better results across all tasks.

5.4 Task Prediction

Methods

Accuracy

Ensemble Class Prediction [13]

37.5

Entropy [44]

55.3

cross-entropy

54.9

\nabla(cross-entropy) + mean filters

56.6

\nabla(cross-entropy) + aug + mean filters

59.0

\nabla(cross-entropy) + entropy aug

58.8

\nabla(cross-entropy) + entropy aug + mean filters

59.4
Table 6: Average incremental accuracy till last task for different task prediction methods on CIFAR100/5 split.

We have experimented with different variants of our proposed task prediction method on the same task models and present the results on the CIFAR100/5 split in Table 8. We start off with the ensemble class prediction approach of TCL [13] and show that it gives the lowest accuracy. Subsequently, we present the result for the vanilla entropy-based task prediction approach used in multiple previous works [47, 44]. Lastly, we explore different components of our gradient aggregation approach and show how the accuracy improves with every step. Our proposed approach leverages the gradients of cross-entropy, entropy weighted augmentations and mean filters leading to an absolute gain of 4.1%percent4.14.1\%4.1 % over the vanilla entropy baseline. It should be noted that entropy weighted augmentations significantly outperform vanilla augmentations for initial tasks. For example, entropy weighted augmentations lead to respective relative gains of 1.1%percent1.11.1\%1.1 % and 1.3%percent1.31.3\%1.3 % over vanilla augmentations for the initial T2𝑇2\lceil\frac{T}{2}\rceil⌈ divide start_ARG italic_T end_ARG start_ARG 2 end_ARG ⌉ tasks (minus first task which has same accuracy for both baselines) on CIFAR100/5 (T𝑇Titalic_T=5) and CIFAR100/10 (T𝑇Titalic_T=10) splits.

6 Conclusion

In this work, we propose a highly efficient expansion-based model that accumulates all the previously learned information. The expansion rate is dynamic and it depends on the task complexity. The proposed expansion approach uses a simple and efficient in-layer filter and channel expansion which is generic and can be applied to any convolutional network. We also propose a replay-free, flat minima based task prediction strategy which can be used to predict a large number of task sequences. The approach shows promising results for the TIL, CIL and generative continual learning settings. In expansion-based approaches, task prediction is the key limitation and it will be interesting to explore the same in future.

References

  • [1] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory Aware Synapses: Learning What (not) to Forget. European Conference on Computer Vision, 2018.
  • [2] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In ICLR, 2020.
  • [3] Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 233–248, 2018.
  • [4] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence. European Conference on Computer Vision, 2018.
  • [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. ImageNet: A Large-Scale Hierarchical Image Database. Computer Vision and Pattern Recognition, 2009.
  • [6] Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. Podnet: Pooled outputs distillation for small-tasks incremental learning. In ECCV, 2020.
  • [7] Qiang Gao, Zhipeng Luo, and Diego Klabjan. Efficient architecture search for continual learning. arXiv preprint arXiv:2006.04027, 2020.
  • [8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. Neural information processing systems, 2014.
  • [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. Computer Vision and Pattern Recognition, 2016.
  • [10] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural computation, 9(1):1–42, 1997.
  • [11] John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, pages 2554–2558, 1982.
  • [12] Ching-Yi Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. Compacting, Picking and Growing for Unforgetting Continual Learning. NeurIPS, 2019.
  • [13] Gyuhak Kim, Changnan Xiao, Tatsuya Konishi, Zixuan Ke, and Bing Liu. A theoretical study on solving continual learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [14] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • [15] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming Catastrophic Forgetting in Neural Networks. Proceedings of the National Academy of Sciences, 2017.
  • [16] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009.
  • [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [18] Abhishek Kumar, Sunabha Chatterjee, and Piyush Rai. Nonparametric Bayesian Structure Adaptation for Continual Learning. arXiv preprint arXiv:1912.03624, 2019.
  • [19] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  • [20] Soochan Lee, Junsoo Ha, Dongsu Zhang, and Gunhee Kim. A Neural Dirichlet Process Mixture Model for Task-Free Continual Learning. ICLR, 2020.
  • [21] Sang-Woo Lee, **-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting by incremental moment matching. In Advances in neural information processing systems, pages 4652–4662, 2017.
  • [22] Zhizhong Li and Derek Hoiem. Learning without Forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [23] Xialei Liu, Chenshen Wu, Mikel Menta, Luis Herranz, Bogdan Raducanu, Andrew D Bagdanov, Shangling Jui, and Joost van de Weijer. Generative feature replay for class-incremental learning. In CVPR Workshops, pages 226–227, 2020.
  • [24] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67–82, 2018.
  • [25] Arun Mallya, Svetlana Lazebnik, dummy dummy, and dummy dummy. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
  • [26] Marc Masana, Tinne Tuytelaars, de van, and Joost Weijer. Ternary feature masks: continual learning without any forgetting. CVPR-Workshop, 2021.
  • [27] Nikhil Mehta, Kevin J Liang, Vinay Kumar Verma, and Lawrence Carin. Continual Learning using a Bayesian Nonparametric Dictionary of Weight Factors. Artificial Intelligence and Statistics, 2021.
  • [28] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  • [29] German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual Lifelong Learning with Neural Networks: A Review. Neural Networks, 2019.
  • [30] G. Petit, A. Popescu, H. Schindler, D. Picard, and B. Delezoide. Fetril: Feature translation for exemplar-free class-incremental learning. WACV, 2023.
  • [31] Jathushan Rajasegaran, Munawar Hayat, Salman H Khan, Fahad Shahbaz Khan, and Ling Shao. Random Path Selection for Continual Learning. Neural Information Processing Systems, 2019.
  • [32] Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Mubarak Shah. itaml: An incremental task-agnostic meta-learning approach. arXiv preprint arXiv:2003.11652, 2020.
  • [33] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. iCaRL: Incremental Classifier and Representation Learning. Computer Vision and Pattern Recognition, 2017.
  • [34] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience Replay for Continual Learning. Neural Information Processing Systems, 2019.
  • [35] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive Neural Networks. arXiv preprint arXiv:1606.04671, 2016.
  • [36] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548–4557, 2018.
  • [37] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual Learning with Deep Generative Replay. Neural Information Processing Systems, 2017.
  • [38] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • [39] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [40] Pravendra Singh, Vinay Kumar Verma, Pratik Mazumder, Lawrence Carin, and Piyush Rai. Calibrating cnns for lifelong learning. Advances in Neural Information Processing Systems, 33, 2020.
  • [41] Habib Slim, Eden Belouadah, Adrian Popescu, and Darian Onchis. Dataset knowledge transfer for class-incremental learning without memory. In WACV, 2022.
  • [42] Gido M van de Ven, Hava T Siegelmann, and Andreas S Tolias. Brain-inspired replay for continual learning with artificial neural networks. Nature communications, pages 1–14, 2020.
  • [43] Gido M van de Ven and Andreas S Tolias. Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635, 2018.
  • [44] Vinay Kumar Verma, Kevin J Liang, Nikhil Mehta, Piyush Rai, and Lawrence Carin. Efficient feature transformations for discriminative and generative continual learning. In CVPR, pages 13865–13875, 2021.
  • [45] Johannes von Oswald, Christian Henning, João Sacramento, and Benjamin F Grewe. Continual learning with hypernetworks. In International Conference on Learning Representations, 2020.
  • [46] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  • [47] Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, and Ali Farhadi. Supermasks in superposition. Advances in Neural Information Processing Systems, 33, 2020.
  • [48] Chenshen Wu, Luis Herranz, Xialei Liu, Joost van de Weijer, Bogdan Raducanu, et al. Memory Replay GANs: Learning to Generate New Categories without forgetting. Neural Information Processing Systems, pages 5962–5972, 2018.
  • [49] Ju Xu and Zhanxing Zhu. Reinforced continual learning. In Advances in Neural Information Processing Systems, pages 899–908, 2018.
  • [50] Shipeng Yan, Jiangwei Xie, and Xuming He. Der: Dynamically expandable representation for class incremental learning. In CVPR, pages 3014–3023, 2021.
  • [51] Jaehong Yoon, Saehoon Kim, Eunho Yang, and Sung Ju Hwang. Scalable and Order-robust Continual Learning with Additive Parameter Decomposition. International Conference on Learning Representations, abs/1902.09432, 2020.
  • [52] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong Learning with Dynamically Expandable Networks. International Conference on Learning Representations, 2018.
  • [53] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop. arXiv preprint arXiv:1506.03365, 2015.
  • [54] Lu Yu, Bartlomiej Twardowski, Xialei Liu, Luis Herranz, Kai Wang, Yongmei Cheng, Shangling Jui, and Joost van de Weijer. Semantic drift compensation for class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6982–6991, 2020.
  • [55] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual Learning Through Synaptic Intelligence. International Conference on Machine Learning, 2017.
  • [56] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks. Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [57] Junting Zhang, Jie Zhang, Shalini Ghosh, Dawei Li, Serafettin Tasci, Larry Heck, Heming Zhang, and C-C Jay Kuo. Class-Incremental Learning via Deep Model Consolidation. Winter Conference on Applications of Computer Vision, 2020.
  • [58] Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: Network adaptation via additive side networks. arXiv preprint arXiv:1912.13503, 2019.
  • [59] Yang Zhao, Hao Zhang, and Xiuyuan Hu. Penalizing gradient norm for efficiently improving generalization in deep learning. 2022.
  • [60] Da-Wei Zhou, Fu-Yun Wang, Han-Jia Ye, and De-Chuan Zhan. Pycil: A python toolbox for class-incremental learning, 2021.
  • [61] Fei Zhu, Zhen Cheng, Xu-Yao Zhang, and Cheng-lin Liu. Class-incremental learning via dual augmentation. NeurIPS, 34:14306–14318, 2021.
  • [62] Kai Zhu, Wei Zhai, Yang Cao, Jiebo Luo, and Zheng-Jun Zha. Self-sustaining representation expansion for non-exemplar class-incremental learning. In CVPR, pages 9286–9295, 2022.

Supplementary

In this supplementary material, we present additional ablation results and document our experimental settings.

Appendix A Additional Ablations

A.1 Influence of seeds

We use 5555 different seeds to understand how the results of our proposed method change with different seed values. The seed value impacts our method in four different ways - network initialization, sample order, random augmentation and class order. The question of seed influence is more important in adaptive parameter growth (APG) than static parameter growth (SPG) as APG uses task complexity to grow the model. As is clear from Table 7, our reported results are below the seed mean and reasonably stable for different seed values on all three splits of CIFAR-100. In Table 7, we also provide the parameter growth of a method averaged over all seeds and task sequences. It is interesting to note that the average parameter growth of the APG model is remarkably stable for different seeds. Thus, we can conclude that our methods are robust under different experimental conditions.

Method 5 10 20
5 seeds (SPG) [4.1%] 59.8 ±plus-or-minus\pm± 0.5 50.8 ±plus-or-minus\pm± 0.7 36.9 ±plus-or-minus\pm± 0.7
Reported (SPG) [4.1%] 59.4 50.6 35.6
5 seeds (APG) [3.6%] 59.3 ±plus-or-minus\pm± 0.5 50.8 ±plus-or-minus\pm± 0.5 37.4 ±plus-or-minus\pm± 0.7
Reported (APG) [3.6%] 59.2 50.5 36.4
Table 7: Mean and standard deviation for different splits of CIFAR-100. [X%] shows the average parameter growth of the model over all task sequences.

A.2 Task Prediction

For the sake of completeness of this paper, we present the average task prediction accuracy on the CIFAR100/5 split in Table. 8.

Methods

Accuracy

Ensemble Class Prediction [13]

39.8

Entropy [44]

57.7

cross-entropy

57.2

\nabla(cross-entropy) + mean filters

58.8

\nabla(cross-entropy) + aug + mean filters

61.3

\nabla(cross-entropy) + entropy aug

61.2

\nabla(cross-entropy) + entropy aug + mean filters

61.9
Table 8: Average task prediction accuracy till last task for different task prediction methods on CIFAR100/5 split.

A.3 Task-wise accuracy

We present the task-wise accuracy of the SPG model on different splits of CIFAR-100 in Fig. 5. Since this is the CIL scenario, the accuracy of a task i𝑖iitalic_i refers to the average incremental accuracy till task i𝑖iitalic_i. To avoid clutter, we only present results of important baselines. It should be noted that different methods have different first task accuracies as they have different optimization hyperparameters and expansion/regularization strategies. For example, EFT [44] expands from the first task while IL2A [61] works better with the Adam optimizer [14]. Similarly, the results for task-wise accuracy of the SPG model on different splits of ImageNet-100 is shown in Fig. 6.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Task-wise CIL results of SPG model on 5, 10 and 20 splits of CIFAR-100.
Refer to caption
Refer to caption
Refer to caption
Figure 6: Task-wise CIL results of SPG model on 5, 10 and 20 splits of ImageNet-100.

Appendix B Experimental Settings

In this section, we provide details about our hyperparameter settings and baselines.

B.1 CIFAR-100

B.1.1 Training hyperparameters

Since EFT [44] is our best performing baseline, we borrow the class order and hyperparameter settings (including seed) from their publicly available code. We train our model for 250250250250 epochs with batch size of 128128128128, initial learning rate of 0.010.010.010.01, learning rate drop of 0.10.10.10.1 at 100100100100, 150150150150 and 200200200200 epochs, SGD optimizer with momentum of 0.90.90.90.9 and weight decay of 5e35𝑒35e-35 italic_e - 3.

We use ResNet-18 [9] architecture for CIFAR datasets to evaluate our method. It should be noted that we train batch norm and linear layers from scratch for each task.

B.1.2 Expansion hyperparameters

We follow the same expansion hyperparameters for every task sequence. Let α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and α4subscript𝛼4\alpha_{4}italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT be the number of filters used for creating the four residual blocks (using the _make_layer()_𝑚𝑎𝑘𝑒_𝑙𝑎𝑦𝑒𝑟\_make\_layer()_ italic_m italic_a italic_k italic_e _ italic_l italic_a italic_y italic_e italic_r ( ) function in standard PyTorch implementation of ResNet-18 111github.com/pytorch/vision/blob/main/torchvision/models/resnet.py). For each task, we increase the filters as follows:

α1=α1+1,α2=α2+5,α3=α3+10,α4=α4+10formulae-sequencesubscript𝛼1subscript𝛼11formulae-sequencesubscript𝛼2subscript𝛼25formulae-sequencesubscript𝛼3subscript𝛼310subscript𝛼4subscript𝛼410\alpha_{1}=\alpha_{1}+1,\ \alpha_{2}=\alpha_{2}+5,\ \alpha_{3}=\alpha_{3}+10,% \ \alpha_{4}=\alpha_{4}+10italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 5 , italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 10 , italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + 10 (8)

For the first task, α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 64646464, α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 128128128128, α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 256256256256 and α4subscript𝛼4\alpha_{4}italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 512512512512 which is the standard ResNet-18 filter distribution. The criterion for selecting this hyperparameter is that we wanted to have an average parameter growth of around 4%percent44\%4 % like EFT [44].

For adaptive parameter growth, we define the minimum filter growth as:

α1=α1+1,α2=α2+1,α3=α3+1,α4=α4+1formulae-sequencesubscript𝛼1subscript𝛼11formulae-sequencesubscript𝛼2subscript𝛼21formulae-sequencesubscript𝛼3subscript𝛼31subscript𝛼4subscript𝛼41\alpha_{1}=\alpha_{1}+1,\ \alpha_{2}=\alpha_{2}+1,\ \alpha_{3}=\alpha_{3}+1,\ % \alpha_{4}=\alpha_{4}+1italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 , italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 1 , italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + 1 (9)

The maximum filter growth is defined using Eq. 8.

B.1.3 Augmentations

For class incremental learning (CIL), we apply 10101010 instances of a random data augmentation scheme, along with the standard unaugmented test sample, to create the batch 𝒳ksubscript𝒳𝑘\mathcal{X}_{k}caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (i.e., A𝐴Aitalic_A = 11111111). It should be noted that we use the same random data augmentation scheme for task prediction that we use for training the network. Our data augmentation scheme is same as EFT [44], i.e., from PyTorch library, we use:

  1. 1.

    RandomCrop(32, padding=4)

  2. 2.

    RandomHorizontalFlip()

  3. 3.

    RandomRotation(10)

B.1.4 Baselines

We borrow most of the baseline results from the EFT paper. We also run the publicly available code of IL2A [61] using our class order, split (5/10/20) and seed setting. Results for SSRE [62] and FeTrIL [30] are obtained by running the PyCIL [60] framework using our class order, split and seed settings. It should be noted that in the main paper, we define average incremental accuracy as the average accuracy for all seen classes.

B.2 Tiny ImageNet

B.2.1 Training hyperparameters

Like CIFAR-100, we borrow the class order, hyperparameter settings (including seed) and baselines from the EFT [44] paper. We train our model for 140140140140 epochs with batch size of 128128128128, initial learning rate of 0.010.010.010.01, learning rate drop of 0.10.10.10.1 at 70707070, 100100100100 and 120120120120 epochs, SGD optimizer with momentum of 0.90.90.90.9 and weight decay of 5e45𝑒45e-45 italic_e - 4. To evaluate our method, we use the VGG-16 [39] architecture with batch norm for the Tiny ImageNet dataset.

B.2.2 Expansion hyperparameters

If α1,jsubscript𝛼1𝑗\alpha_{1,j}italic_α start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT is the original number of filters for layer j𝑗jitalic_j in VGG-16 and αi,jsubscript𝛼𝑖𝑗\alpha_{i,j}italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT are their values before task i+1𝑖1i+1italic_i + 1, then for task i+1𝑖1i+1italic_i + 1, we increase the filters as follows:

αi+1,j=αi,j+1ifα1,j= 64or 128subscript𝛼𝑖1𝑗subscript𝛼𝑖𝑗1𝑖𝑓subscript𝛼1𝑗64𝑜𝑟128\displaystyle\alpha_{i+1,j}=\alpha_{i,j}+1\ if\ \alpha_{1,j}\ =\ 64\ or\ 128italic_α start_POSTSUBSCRIPT italic_i + 1 , italic_j end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + 1 italic_i italic_f italic_α start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT = 64 italic_o italic_r 128
αi+1,j=αi,j+8ifα1,j= 256or 512subscript𝛼𝑖1𝑗subscript𝛼𝑖𝑗8𝑖𝑓subscript𝛼1𝑗256𝑜𝑟512\displaystyle\alpha_{i+1,j}=\alpha_{i,j}+8\ if\ \alpha_{1,j}\ =\ 256\ or\ 512italic_α start_POSTSUBSCRIPT italic_i + 1 , italic_j end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + 8 italic_i italic_f italic_α start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT = 256 italic_o italic_r 512 (10)

For adaptive parameter growth, we define the minimum filter growth as:

αi+1,j=αi,j+1subscript𝛼𝑖1𝑗subscript𝛼𝑖𝑗1\displaystyle\alpha_{i+1,j}=\alpha_{i,j}+1italic_α start_POSTSUBSCRIPT italic_i + 1 , italic_j end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + 1

We define the maximum filter growth using Eq. 10.

B.3 ImageNet-100

B.3.1 Training hyperparameters

We use the same class subset, class order and hyperparameter settings as DER [50]. We train our model for 120120120120 epochs (unlike DER, we do not warm up) with batch size of 256256256256, initial learning rate of 0.10.10.10.1, learning rate drop of 0.10.10.10.1 at 30303030, 60606060, 80808080 and 90909090 epochs, SGD optimizer with momentum of 0.90.90.90.9 and weight decay of 5e45𝑒45e-45 italic_e - 4.

We use ResNet-18 [9] architecture for ImageNet dataset to evaluate our method. It should be noted that we train batch norm and linear layers from scratch for each task.

B.3.2 Expansion hyperparameters

We follow the same expansion hyperparameters as CIFAR-100, except for the ImageNet-100/20 split. If α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and α4subscript𝛼4\alpha_{4}italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are the number of filters used for creating the four residual blocks (using the _make_layer_𝑚𝑎𝑘𝑒_𝑙𝑎𝑦𝑒𝑟\_make\_layer_ italic_m italic_a italic_k italic_e _ italic_l italic_a italic_y italic_e italic_r function in standard PyTorch implementation of ResNet-18), then for each task in ImageNet-100/20 split, we increase the filters as follows:

α1=α1+2,α2=α2+10,α3=α3+10,α4=α4+10formulae-sequencesubscript𝛼1subscript𝛼12formulae-sequencesubscript𝛼2subscript𝛼210formulae-sequencesubscript𝛼3subscript𝛼310subscript𝛼4subscript𝛼410\alpha_{1}=\alpha_{1}+2,\ \alpha_{2}=\alpha_{2}+10,\ \alpha_{3}=\alpha_{3}+10,% \ \alpha_{4}=\alpha_{4}+10italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 10 , italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 10 , italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + 10 (11)

For the first task, α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 64646464, α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 128128128128, α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 256256256256 and α4subscript𝛼4\alpha_{4}italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 512512512512 which is the standard ResNet-18 filter distribution. This is because the ImageNet-100/20 split is harder than the corresponding CIFAR-100/20 split. For adaptive parameter growth and ImageNet-100/20 split, we define the minimum and maximum filter growths using Eq. 9 and Eq. 11 respectively.

B.3.3 Augmentations

For class incremental learning (CIL), we apply 20202020 instances of a random data augmentation scheme, along with the standard unaugmented test sample, to create the batch 𝒳ksubscript𝒳𝑘\mathcal{X}_{k}caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (i.e., A𝐴Aitalic_A = 21212121). It should be noted that we use the same random data augmentation scheme for task prediction that we use for training the network. Our data augmentation scheme is same as [6], i.e., from PyTorch library, we use:

  1. 1.

    RandomResizedCrop(224)

  2. 2.

    RandomHorizontalFlip()

  3. 3.

    ColorJitter(brightness=63 / 255)

B.3.4 Baselines

We run the baselines LwF [22], EFT [44] and IL2A [61] using their publicly available code. Results for SSRE [62] and FeTrIL [30] are obtained by running the PyCIL [60] framework. We use the same class subset, class order and seed for all our baseline experiments. It should be noted that in the main paper, we define average incremental accuracy as the average accuracy for all seen classes.

B.4 Generative (GAN) Continual Learning

We choose the StackGAN-v2 [56] architecture for the incremental GAN experiment. StackGAN-v2 contains four blocks in the generator and discriminator networks. In the generator network, there are 1024,512,256,12810245122561281024,512,256,1281024 , 512 , 256 , 128 filters from first to the fourth block and the final image construction layer contains 64646464 filters. We extend the last layer by 4444 filters; hence the respective increase in filters are 64,32,16,8643216864,32,16,864 , 32 , 16 , 8 from first to the fourth block. During training of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT task, all the previous task parameters are frozen; the parameter grows over the previous task parameters and not just over the global parameter. In our approach, we only grow the generator parameters and the discriminator is fixed for all the tasks; without any constraint, the discriminator parameter learns the current task. For the above discussed filter growth, the generator achieves a growth rate of 11.5%percent11.511.5\%11.5 %. We also observe that further filter growth shows better results. Our selected task sequences (cats, birds and churches) are highly diverse. The cat images are generally indoor or outdoor animal images; however, the next task (birds) are in a highly complex background and with fine-grained information; so the adaptation of birds from cats is difficult. Our model shows significant gains on the birds dataset using only 11.5%percent11.511.5\%11.5 % extra parameters. The adaptation of churches from the birds dataset (birds to buildings) is also very difficult. Our proposed model adapts to this dataset and shows state-of-the-art results compared to the recent strong baselines.

B.5 Heterogeneous Task Sequence

We borrow the baselines and hyperparameter settings (including seed) from the EFT [44] paper. To evaluate our method, we use the VGG-16 [39] architecture with batch norm.

SVHNnormal-→\rightarrowCIFAR10normal-→\rightarrowCIFAR100: If αi,jsubscript𝛼𝑖𝑗\alpha_{i,j}italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the number of filters for layer j𝑗jitalic_j in VGG-16 before task i+1𝑖1i+1italic_i + 1, then we increase the filters as follows:

α2,j=α1,j+10subscript𝛼2𝑗subscript𝛼1𝑗10\displaystyle\alpha_{2,j}=\alpha_{1,j}+10italic_α start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT + 10
α3,j=α2,j+10ifα1,j= 64or 128subscript𝛼3𝑗subscript𝛼2𝑗10𝑖𝑓subscript𝛼1𝑗64𝑜𝑟128\displaystyle\alpha_{3,j}=\alpha_{2,j}+10\ if\ \alpha_{1,j}\ =\ 64\ or\ 128italic_α start_POSTSUBSCRIPT 3 , italic_j end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT + 10 italic_i italic_f italic_α start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT = 64 italic_o italic_r 128
α3,j=α2,j+20ifα1,j= 256or 512subscript𝛼3𝑗subscript𝛼2𝑗20𝑖𝑓subscript𝛼1𝑗256𝑜𝑟512\displaystyle\alpha_{3,j}=\alpha_{2,j}+20\ if\ \alpha_{1,j}\ =\ 256\ or\ 512italic_α start_POSTSUBSCRIPT 3 , italic_j end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT + 20 italic_i italic_f italic_α start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT = 256 italic_o italic_r 512

CIFAR100normal-→\rightarrowCIFAR10normal-→\rightarrowSVHN: If αi,jsubscript𝛼𝑖𝑗\alpha_{i,j}italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the number of filters for layer j𝑗jitalic_j in VGG-16 before task i+1𝑖1i+1italic_i + 1, then we increase the filters as follows:

α2,j=α1,j+10ifα1,j= 64or 128subscript𝛼2𝑗subscript𝛼1𝑗10𝑖𝑓subscript𝛼1𝑗64𝑜𝑟128\displaystyle\alpha_{2,j}=\alpha_{1,j}+10\ if\ \alpha_{1,j}\ =\ 64\ or\ 128italic_α start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT + 10 italic_i italic_f italic_α start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT = 64 italic_o italic_r 128
α2,j=α1,j+20ifα1,j= 256or 512subscript𝛼2𝑗subscript𝛼1𝑗20𝑖𝑓subscript𝛼1𝑗256𝑜𝑟512\displaystyle\alpha_{2,j}=\alpha_{1,j}+20\ if\ \alpha_{1,j}\ =\ 256\ or\ 512italic_α start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT + 20 italic_i italic_f italic_α start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT = 256 italic_o italic_r 512
α3,j=α2,j+10subscript𝛼3𝑗subscript𝛼2𝑗10\displaystyle\alpha_{3,j}=\alpha_{2,j}+10italic_α start_POSTSUBSCRIPT 3 , italic_j end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT + 10

B.6 Softwares

Experiments are run on a single V100 gpu using Linux, Python 3.6 and PyTorch 1.7.1 softwares.

B.7 Input Processing

The data transformation scheme used in our method is borrowed from EFT [44] for CIFAR-100 and Tiny ImageNet datasets, while for ImageNet-100, we use the data transformation scheme used in [6]. The codes for both these methods are publicly available.