The impact of model size on catastrophic forgetting in Online Continual Learning

Eunhae Lee
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Cambridge, MA 02139
[email protected]
Abstract

This study investigates the impact of model size on Online Continual Learning performance, with a focus on catastrophic forgetting. Employing ResNet architectures of varying sizes, the research examines how network depth and width affect model performance in class-incremental learning using the SplitCIFAR-10 dataset. Key findings reveal that larger models do not guarantee better Continual Learning performance; in fact, they often struggle more in adapting to new tasks, particularly in online settings. These results challenge the notion that larger models inherently mitigate catastrophic forgetting, highlighting the nuanced relationship between model size and Continual Learning efficacy. This study contributes to a deeper understanding of model scalability and its practical implications in Continual Learning scenarios.

1 Introduction

One of the biggest unsolved challenges in Continual Learning (CL) is the prevention of forgetting previously learned information upon acquiring new information. Known as “catastrophic forgetting,” this phenomenon is particularly pertinent in scenarios where AI systems must adapt to new data without losing valuable insights from past experiences [21, 10, 13]. Numerous studies have investigated different approaches to solving this problem in the past years, mostly around proposing innovative strategies to modify the way models are trained and measuring its impact on model performance, such as accuracy and forgetting.

Yet, compared to the numerous amount of studies done in establishing new strategies and evaluative approaches in visual continual learning, there is surprisingly little discussion on the impact of model size on CL performance. It is commonly known that the size of a deep learning model (the number of parameters) is known to play a crucial role in its learning capabilities [12, 2]. Given the limitations in computational resources in most real-world circumstances, it is often not practical or feasible to choose the largest model available. In addition, sometimes smaller models perform just as well as larger models in specific contexts [3]. Given this context, a better understanding of how model size impacts performance in a Continual Learning setting can provide insights and implications for real-world deployment of CL systems.

This research examines how network depth and width affect model performance in class-incremental learning in both online and offline Continual Learning settings, using ResNet architectures of varying depths and widths. The hypothesis is set forth based on existing literature on model size and performance and is tested through an empirical experiment using ResNet models trained from scratch. The study aims to shed light on whether larger models truly offer an advantage in mitigating catastrophic forgetting, or if the reality is more nuanced.

2 Related Work

2.1 Online Continual Learning

Continual Learning (CL), also known as Lifelong Learning or Incremental Learning, is an approach that seeks to continually learn from non-iid data streams without forgetting previously acquired knowledge. The challenge in Continual Learning is generally known as the stability-plasticity dilemma [22], and the goal of Continual Learning is to strike a balance between learning stability and plasticity.

Traditional CL models assume new data arrives task by task, each with stable data distribution, enabling offline training. However, this requires having access to all task data, which can be impractical due to privacy or resource limitations. This study will focus on a more realistic setting of Online Continual Learning (OCL), where data arrives in smaller batches and are not accessible after training, requiring models to learn from a single pass over an online data stream [26, 5, 20]. This allows the model to learn data in real time.

Online Continual Learning can involve adapting to new classes (class-incremental) or changing data characteristics (domain-incremental). For class-incremental learning, the goal is to continually expand the model’s ability to recognize an increasing number of classes, maintaining its performance on all classes it has seen so far, despite not having continued access to the old class data [26, 9]. More recently, there have been studies investigating Unsupervised Continual Learning [29, 19]. However, to narrow the scope of the vast CL landscape to focus on learning the impact of model size on CL performance, this study will focus on the more common problem of class-incremental learning in supervised image classification in this study.

2.2 Continual Learning techniques

Popular strategies to mitigate catastrophic forgetting in Continual Learning generally fall into three buckets [9]:

  1. 1.

    Regularization-based approaches. Regularization-based approaches modify the classification objective to preserve past representations or foster more insightful representations, such as Elastic Weight Consolidation (EWC) [13] and Learning without Forgetting (LwF) [14].

  2. 2.

    Memory-based approaches. Memory- or replay-based approaches replay samples retrieved from a memory buffer along with every incoming mini-batch, including Experience Replay (ER) [7] and Maximally Interfered Retrieval [1], with variations on how the memory is retrieved and how the model and memory are updated.

  3. 3.

    Architectural approaches. Architectural approaches include parameter-isolation approaches where new parameters are added for new tasks and leaving previous parameters unchanged such as Progressive Neural Networks (PNNs) [24].

Many methods also combine two or more of these approaches, such as Averaged Gradient Episodic Memory (A-GEM) [6] and Incremental Classifier and Representation Learning (iCaRL) [23].

Specifically, this study will focus on Experience Replay (ER), a classic replay-based method widely used for Online Continual Learning. Despite its simplicity, recent studies have shown ER still outperforms many of the newer methods that have come after that, especially for Online Continual Learning [26, 20, 9].

2.3 Model size and performance

It is generally known across literature that larger and deeper models lead to increased performance [12]. Bianco et al. surveyed key performance-related metrics to compare across various architectures, including accuracy, model complexity, computational complexity, and accuracy density [2]. The relationship between model width and performance is also been discussed [12], albeit less frequently.

In 2015, Residual Networks (ResNets) were introduced by He et al.[11], which was a major innovation in computer vision. ResNets tackled the problem of degradation that occurs in deep networks through the use of residual blocks to increase the accuracy of deeper models. Residual blocks that contain two or more layers are stacked together, and “skip connections” are used in between these blocks. The skip connections act as an alternate shortcut for the gradient to pass through, which alleviates the issue of vanishing gradient. They also make it easier for the model to learn identity functions. As a result, ResNets improve the efficiency of deep neural networks with more neural layers while minimizing the percentage of errors. The authors compare models of different depths (composed of 18, 34, 50, 101, and 152 layers) and show that accuracy increases with the depth of the model [11].

ResNet18 ResNet34 ResNet50 ResNet101 ResNet152
Number of Layers 18 34 50 101 152
Number of Parameters similar-to\sim11.7 million similar-to\sim21.8 million similar-to\sim25.6 million similar-to\sim44.5 million similar-to\sim60 million
Top-1 Accuracy 69.76% 73.31% 76.13% 77.37% 78.31%
Top-5 Accuracy 89.08% 91.42% 92.86% 93.68% 94.05%
FLOPs 1.8 billion 3.6 billion 3.8 billion 7.6 billion 11.3 billion
Table 1: Comparison of ResNet Architectures. Accuracy scores are based on the ImageNet benchmark.

This leads to the question: do larger models perform better in Continual Learning? While much of the focus in CL research has often been on develo** various techniques and establishing benchmarks, the impact of model scale remains a less explored path.

Moreover, recent studies on model scales in slightly different contexts have shown conflicting results. Luo et al. [18] highlight a direct correlation between increasing model size and the severity of catastrophic forgetting in large language models (LLMs). They test models of varying sizes from 1 to 7 billion parameters. Yet, Dyer et al. [8] show a contrasting perspective in the context of pre-trained deep learning models. Their results show that large, pre-trained ResNets and Transformers are more resistant to forgetting than randomly initialized, trained-from-scratch models, and that this tendency increases with the scale of the model and the size of the dataset used for pre-training.

The relative lack of discussion on model size and the conflicting perspectives among existing studies indicate that the answer to the question is far from being definitive. The next two sections will elaborate on the specific approach and parameters of the study.

3 Method

3.1 Problem definition

Online Continual Learning can be defined as follows [5, 9]: The objective is to learn a function fθ:𝒳𝒴:subscript𝑓𝜃𝒳𝒴f_{\theta}:\mathcal{X}\rightarrow\mathcal{Y}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_X → caligraphic_Y with parameters θ𝜃\thetaitalic_θ that predicts the label Y𝒴𝑌𝒴Y\in\mathcal{Y}italic_Y ∈ caligraphic_Y of the input 𝐗𝒳𝐗𝒳\mathbf{X}\in\mathcal{X}bold_X ∈ caligraphic_X. Over time steps t{1,2,,}𝑡12t\in\{1,2,\ldots,\infty\}italic_t ∈ { 1 , 2 , … , ∞ }, a distribution-varying stream 𝒮𝒮\mathcal{S}caligraphic_S reveals data sequentially, which is different from classical supervised learning.

At every time step,

  1. 1.

    𝒮𝒮\mathcal{S}caligraphic_S reveals a set of data points (images) 𝐗tπtsimilar-tosubscript𝐗𝑡subscript𝜋𝑡\mathbf{X}_{t}\sim\pi_{t}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from a non-stationary distribution πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

  2. 2.

    Learner fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT makes predictions Y^tsubscript^𝑌𝑡\hat{Y}_{t}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on current parameters θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

  3. 3.

    𝒮𝒮\mathcal{S}caligraphic_S reveals true labels Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

  4. 4.

    Compare the predictions with the true labels, compute the training loss L(Yt,Y^t)𝐿subscript𝑌𝑡subscript^𝑌𝑡L(Y_{t},\hat{Y}_{t})italic_L ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

  5. 5.

    Learner updates the parameters of the model to θt+1subscript𝜃𝑡1\theta_{t+1}italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT

3.2 Task-agnostic and boundary-agnostic

In the context of class-incremental learning, the definitions of task-agnostic and boundary-agnostic from Soutif et al. [26] are adopted. A task-agnostic setting refers to when task labels are not available, which means the model does not know that the samples belong to a certain task. A boundary-agnostic setting is considered, where information on task boundaries is unavailable. This means that the model does not know when the data distribution changes to a new task. To reflect a more realistic Continual Learning setting, this study assumes a setting that is both task-agnostic and boundary-agnostic.

Yes No
Task labels Task-aware Task-agnostic
Task boundaries Boundary-aware Boundary-agnostic
Table 2: Task labels and task boundaries.

3.3 Experience Replay (ER)

In a class-incremental learning setting, the nature of the Experience Replay (ER) method aligns well with task-agnostic and boundary-agnostic settings. This is because ER focuses on replaying a subset of past experiences, which helps in maintaining knowledge of previous classes without needing explicit task labels or boundaries. This characteristic of ER allows it to adapt to new classes as they are introduced, while retaining the ability to recognize previously learned classes, making it inherently suitable for task-agnostic and boundary-agnostic Continual Learning scenarios.

In terms of implementation, ER involves randomly initializing an external memory buffer \mathcal{M}caligraphic_M, then implementing before_training_exp and after_training_exp callbacks to use the dataloader to create mini-batches with samples from both the training stream and the memory buffer. Each mini-batch is balanced so that all tasks or experiences are equally represented in terms of stored samples [16]. As ER is known to be well-suited for Online Continual Learning as explained in Section 2.2, it will be the go-to method used to compare performances across models of varying sizes.

3.4 Benchmark

For this study, the SplitCIFAR-10 [16] is used as the main benchmark to compare the performance across models of different sizes. SplitCIFAR-10 splits the popular CIFAR-10 dataset into five tasks with disjoint classes, each task including two classes. Each task has 10,000 3×32×32 images for training and 2000 images for testing. The model is exposed to these tasks or experiences sequentially, which simulates a real-world scenario where a learning system is exposed to new categories of data over time. This is suitable for class-incremental learning scenarios. This benchmark is used for both testing online and offline Continual Learning in this study.

3.5 Metrics

To evaluate the continual learning performance of each model, key metrics that have been established in earlier work in Online Continual Learning are used.

Average Anytime Accuracy (AAA) [4]

The concept of Average Anytime Accuracy serves as an indicator of a model’s overall performance throughout its learning phase, extending the idea of average incremental accuracy to include continuous assessment scenarios. This metric assesses the effectiveness of the model across all stages of training, rather than at a single endpoint, offering a more comprehensive view of its learning trajectory.

AAA=1Tt=1T(AA)tAAA1𝑇superscriptsubscript𝑡1𝑇subscriptAA𝑡\text{AAA}=\frac{1}{T}\sum_{t=1}^{T}(\text{AA})_{t}AAA = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( AA ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (1)

Average Cumulative Forgetting (ACF) [26, 27]

First, Cumulative Accuracy (bktsuperscriptsubscript𝑏𝑘𝑡b_{k}^{t}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT) for task k𝑘kitalic_k after the model has been trained up to task t𝑡titalic_t is defined. It computes the mean accuracy over the evaluation set EΣksubscriptsuperscript𝐸𝑘ΣE^{k}_{\Sigma}italic_E start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT, which contains all instances x𝑥xitalic_x and their true labels y𝑦yitalic_y up to task k𝑘kitalic_k. The model’s prediction for each instance is given by arg max cCΣkft(x)c𝑐subscriptsuperscript𝐶𝑘Σarg max superscript𝑓𝑡subscript𝑥𝑐\underset{c\in C^{k}_{\Sigma}}{\text{arg max }}f^{t}(x)_{c}start_UNDERACCENT italic_c ∈ italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT end_UNDERACCENT start_ARG arg max end_ARG italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, which selects the class c𝑐citalic_c with the highest predicted logit ft(x)csuperscript𝑓𝑡subscript𝑥𝑐f^{t}(x)_{c}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The indicator function 1y(y^)subscript1𝑦^𝑦1_{y}(\hat{y})1 start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) outputs 1 if the prediction matches the true label, and 0 otherwise. The sum of these outputs is then averaged over the size of the evaluation set to compute the cumulative accuracy.

bkt=1|EΣk|(x,y)EΣk1y(arg max cCΣkft(x)c)superscriptsubscript𝑏𝑘𝑡1subscriptsuperscript𝐸𝑘Σsubscript𝑥𝑦subscriptsuperscript𝐸𝑘Σsubscript1𝑦𝑐subscriptsuperscript𝐶𝑘Σarg max superscript𝑓𝑡subscript𝑥𝑐b_{k}^{t}=\frac{1}{|E^{k}_{\Sigma}|}\sum_{(x,y)\in E^{k}_{\Sigma}}1_{y}(% \underset{c\in C^{k}_{\Sigma}}{\text{arg max }}f^{t}(x)_{c})italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_E start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_E start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( start_UNDERACCENT italic_c ∈ italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT end_UNDERACCENT start_ARG arg max end_ARG italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) (2)

From Cumulative Accuracy, we can calculate the Average Cumulative Forgetting (FΣtsuperscriptsubscript𝐹Σ𝑡F_{\Sigma}^{t}italic_F start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT) by setting the cumulative forgetting about a previous cumulative task k𝑘kitalic_k, then averaging over all tasks learned so far:

FΣt=1t1k=1t1maxi=1,,t(bkibkt)superscriptsubscript𝐹Σ𝑡1𝑡1superscriptsubscript𝑘1𝑡1subscript𝑖1𝑡superscriptsubscript𝑏𝑘𝑖superscriptsubscript𝑏𝑘𝑡F_{\Sigma}^{t}=\frac{1}{t-1}\sum_{k=1}^{t-1}\max_{i=1,...,t}\left(b_{k}^{i}-b_% {k}^{t}\right)italic_F start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_t - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_i = 1 , … , italic_t end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (3)

Average Accuracy (AA) and Average Forgetting (AF) [20]

ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the accuracy evaluated on the test set of task j𝑗jitalic_j after training the network from task 1 to i𝑖iitalic_i, while i𝑖iitalic_i is the current task being trained. Average Accuracy (AA) is computed by averaging this over the number of tasks.

Average Accuracy(AAi)=1ij=1iai,jAverage Accuracy𝐴subscript𝐴𝑖1𝑖superscriptsubscript𝑗1𝑖subscript𝑎𝑖𝑗\text{Average Accuracy}(AA_{i})=\frac{1}{i}\sum_{j=1}^{i}a_{i,j}Average Accuracy ( italic_A italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT (4)

Average Forgetting measures how much a model’s performance on a previous task (task j𝑗jitalic_j) decreases after it has learned a new task (task i𝑖iitalic_i). It is calculated by comparing the highest accuracy the model maxl1,,k1(al,j)subscript𝑙1𝑘1subscript𝑎𝑙𝑗\max_{l\in{1,\ldots,k-1}}(a_{l,j})roman_max start_POSTSUBSCRIPT italic_l ∈ 1 , … , italic_k - 1 end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_l , italic_j end_POSTSUBSCRIPT ) had on task j𝑗jitalic_j before it learned task k𝑘kitalic_k, with the accuracy ak,jsubscript𝑎𝑘𝑗a_{k,j}italic_a start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT on task j𝑗jitalic_j after learning task k𝑘kitalic_k.

Average Forgetting(Fi)=1i1j=1i1fi,jAverage Forgettingsubscript𝐹𝑖1𝑖1superscriptsubscript𝑗1𝑖1subscript𝑓𝑖𝑗\text{Average Forgetting}(F_{i})=\frac{1}{i-1}\sum_{j=1}^{i-1}f_{i,j}Average Forgetting ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_i - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT (5)
fk,j=maxl{1,,k1}(al,j)ak,j,j<kformulae-sequencesubscript𝑓𝑘𝑗subscript𝑙1𝑘1subscript𝑎𝑙𝑗subscript𝑎𝑘𝑗for-all𝑗𝑘f_{k,j}=\max_{l\in\{1,...,k-1\}}(a_{l,j})-a_{k,j},\quad\forall j<kitalic_f start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_l ∈ { 1 , … , italic_k - 1 } end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_l , italic_j end_POSTSUBSCRIPT ) - italic_a start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT , ∀ italic_j < italic_k (6)

In the context of class-incremental learning, the classical concept of forgetting (Average Forgetting) may not provide meaningful insight due to its tendency to increase as the complexity of the task grows (considering there are more classes within the classification problem). Therefore, Soutif et al. [26] recommended avoiding relying on classical forgetting as a metric in settings of class-incremental learning, both online and offline settings. For this reason, Average Anytime Accuracy (AAA) and Average Cumulative Forgetting (ACF) are used throughout this experiment, although Average Accuracy (AA) and Average Forgetting (AF) are computed as part of the process.

3.6 Model selection

To compare the learning performance across different model depths, popular ResNet architectures, including ResNet18, ResNet34, and ResNet50 are used. As mentioned in Section 2.3, ResNets were designed to increase the performance of deeper neural networks, and their performance metrics are well known. While using custom models with more variability in sizes was a consideration, popular existing architectures were chosen for better reproducibility.

Moreover, in order to observe the effect of model width on CL performance, a slim version of ResNet18 that has been implemented in previous work [17] was used to compare with the performance of ResNet18. The slim version of ResNet18 uses fewer filters per layer, reducing the model width and computational load while kee** the depth of the original model.

While there are more recent versions of ResNet (e.g. ResNeXt [28]) that have shown to perform better without a significant increase in computational complexity [2], the original simpler models were chosen for this research to avoid introducing unnecessary variables. ResNet18 and ResNet34 have the basic residual network structure, and ResNet50, ResNet101, and ResNet152 use slightly modified building blocks that have three layers instead of two as in the original residual block structure. This "bottleneck design" was made to reduce training time of larger models. The specifics of the design of these models are detailed in the table from the original paper by He et al. [11].

3.7 Saliency maps

Saliency maps were utilized in the study to qualitatively visualize the “attention” of the networks of different sizes. Saliency maps are commonly used to understand which areas of the input images are most influential for the model’s predictions. By visualizing the specific areas of an image that a Convolutional Neural Network (CNN) considers important for classification, saliency maps provide insight into the internal representation and decision-making process of the network [25].

4 Experiment

4.1 Setup

The setup of the experiment is as follows:

  • Each ResNet model was trained from scratch using the Split-CIFAR10 benchmark with 2 classes per task, for 3 epoches with a mini-batch size of 64.

  • SGD optimizer with a 0.9 momentum and 1e-5 weight decay was used. The initial learning rate is set to 0.01 and the scheduler reduces it by a factor of 0.1 every 30 epochs, as done in [15].

  • Cross entropy loss is used as the criterion, as is common for image classification in Continual Learning.

  • Basic data augmentation is done on the training data to enhance model robustness and generalization by artificially expanding the dataset with varied, modified versions of the original images.

  • Each model is trained in both online and offline CL settings, the latter serving as baselines for performance comparison.

  • Memory size of 500 (representing 1% of the training dataset) is used to implement Experience Replay (ER).

4.2 Implementation

The Continual Learning benchmark was implemented using the Avalanche framework [16], an open source Continual Learning library, and an adapted version of the code for Online Continual Learning by Soutif et al. [26]. The experiments were run on NVIDIA Tesla T4 GPU.

5 Results

Accuracy decreases as model size increases.

Average Anytime Accuracy (AAA) decreases for larger models, with a sharper drop in performance between ResNet34 and ResNet50. The decrease in AAA is more significant in online learning than offline learning (Figure 2).

Accuracy of larger models grow at a slower pace when training.

The rate to which validation stream accuracy increases with each task degrade with larger models (Figure 2). Interestingly, Slim-ResNet18 shows the highest accuracy and growth trend, suggesting the potential impact of model width on CL performance. This could indicate that larger models are worse at generalizing to a class-incremental learning scenario.

Refer to caption
Figure 1: Average Anytime Accuracy (AAA) of different sized ResNets in Online and Offline Continual Learning
Refer to caption
Figure 2: Validation stream accuracy (Online CL)
Model Average Anytime Acc (AAA) Final Average Acc
Slim-ResNet18 0.6645 0.5364
ResNet18 0.6110 0.3712
ResNet34 0.5761 0.3568
ResNet50 0.4594 0.3036
Table 3: Accuracy metrics across differently sized models (Online CL)

Forgetting levels show more nuanced results across Online and Offline CL

In the Online CL setting, the Average Cumulative Forgetting (ACF) is lowest for ResNet34 (with a slight overlap with ResNet18 at Task 5), and highest for ResNet50. A noticeable observation in both ACF and AF is that ResNet50 performed better initially but forgetting levels started to increase after a few tasks. The results for Offline CL setting are slightly different, with ResNet50 having the lowest Average Cumulative Forgetting (ACF) (although with a slight increase at Task 4), followed by ResNet18, and finally ResNet34 (Figure 3).

The differences in forgetting between Online and Offline CL settings are aligned with the accuracy metrics in Figure 1, where the performance of ResNet50 decreased more starkly in the Online CL setting.

Refer to caption
Refer to caption
Figure 3: Forgetting curves, Online CL (left) and Offline CL (right). Solid lines: Average Forgetting (AF); Dotted lines: Average Cumulative Forgetting (ACF)

Observation of saliency maps.

Visual inspection of the saliency maps revealed some interesting observations. When it comes to the ability to highlight intuitive areas of interest in the images, there seemed to be a noticeable improvement from ResNet18 to ResNet34, but this was not necessarily the case from ResNet34 to ResNet50. This phenomenon was more salient in the online CL setting (Figure 4).

Interestingly, Slim-ResNet18 seems to be doing better than most of them, certainly better than its plain counterpart ResNet18. A further exploration of model width on performance and representation quality would be an interesting avenue of research.

Refer to caption
Figure 4: Saliency map visualizations for Online CL

6 Discussion

This research examined the impact of model size on Online Continual Learning performance by comparing key accuracy and forgetting metrics across ResNet models of different depths and widths. These results show that larger models do not necessarily lead to better continual learning performance. We saw that Average Anytime Accuracy (AAA) and stream accuracy dropped progressively with model size, hinting that larger models struggle to generalize to newly trained tasks, especially in an online CL setting. Forgetting curves showed similar trends but with more nuance; larger models perform well at first but suffer from increased forgetting with more incoming tasks. Interestingly, the problem was not as pronounced in the offline CL setting, which highlights the challenges of training models in a more realistic, Online Continual Learning context. Moreover, a qualitative inspection of the saliency maps suggests that the quality of the models’ internal representation also tends to drop with larger models, the difference being the most visible with a change in model width (i.e. ResNet18 vs Slim-ResNet18).

Why do larger models perform worse at Continual Learning? One reason could be that larger models have more parameters, which may make it harder to maintain stability in the learned features as new data is introduced. This makes them more prone to overfitting and forgetting previously learned information, reducing their ability to generalize.

Building on this work, future research could investigate the impact of model size on CL performance by exploring the following questions: (1) Do pre-trained models generalize better in Online Continual Learning settings compared to models trained from scratch? (2) Does longer training improve the relative performance of larger models in CL settings? (3) Can different CL strategies (other than Experience Replay) mitigate the degradation of performance in larger models? (4) Do "slimmer" versions of existing models always perform better? (5) How might different hyperparameters (i.e. learning rate) impact CL performance of larger models?

7 Conclusion

The results from this study strongly suggests that model size matters when it comes to Continual Learning and forgetting, albeit in nuanced ways. By empirically exploring the under-explored topic of the role of model size on CL performance, these findings contribute to the ongoing discussions on the role of the scale of deep learning models on performance and have implications for future areas of research.

References

  • [1] Rahaf Aljundi, Lucas Caccia, Eugene Belilovsky, Massimo Caccia, Min Lin, Laurent Charlin, and Tinne Tuytelaars. Online continual learning with maximally interfered retrieval, 2019.
  • [2] Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. Benchmark analysis of representative deep neural network architectures. IEEE Access, 6:64270–64277, 2018.
  • [3] Keno K. Bressem, Lisa C. Adams, Christoph Erxleben, Bernd Hamm, Stefan M. Niehues, and Janis L. Vahldiek. Comparing different deep learning architectures for classification of chest radiographs. Scientific Reports, 2020.
  • [4] Lucas Caccia, Rahaf Aljundi, Nader Asadi, Tinne Tuytelaars, Joelle Pineau, and Eugene Belilovsky. New insights on reducing abrupt representation change in online continual learning, 2022.
  • [5] Zhipeng Cai, Ozan Sener, and Vladlen Koltun. Online continual learning with natural distribution shifts: An empirical study with visual data. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021.
  • [6] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem, 2019.
  • [7] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning, 2019.
  • [8] Ethan Dyer, Aitor Lewkowycz, and Vinay Ramasesh. Effect of scale on catastrophic forgetting in neural networks. 2022.
  • [9] Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip H. S. Torr, and Bernard Ghanem. Real-time evaluation in online continual learning: A new hope, 2023.
  • [10] Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks, 2015.
  • [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
  • [12] Xia Hu, Lingyang Chu, Jian Pei, Weiqing Liu, and Jiang Bian. Model complexity of deep learning: A survey, 2021.
  • [13] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
  • [14] Zhizhong Li and Derek Hoiem. Learning without forgetting, February 2017. arXiv:1606.09282 [cs, stat].
  • [15] Zhiqiu Lin, Jia Shi, Deepak Pathak, and Deva Ramanan. The clear benchmark: Continual learning on real-world imagery, 2022.
  • [16] Vincenzo Lomonaco, Lorenzo Pellegrini, Andrea Cossu, Antonio Carta, Gabriele Graffieti, Tyler L. Hayes, Matthias De Lange, Marc Masana, Jary Pomponi, Gido van de Ven, Martin Mundt, Qi She, Keiland Cooper, Jeremy Forest, Eden Belouadah, Simone Calderara, German I. Parisi, Fabio Cuzzolin, Andreas Tolias, Simone Scardapane, Luca Antiga, Subutai Amhad, Adrian Popescu, Christopher Kanan, Joost van de Weijer, Tinne Tuytelaars, Davide Bacciu, and Davide Maltoni. Avalanche: an end-to-end library for continual learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2nd Continual Learning in Computer Vision Workshop, 2021.
  • [17] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. 2017.
  • [18] Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2023.
  • [19] Divyam Madaan, Jaehong Yoon, Yuanchun Li, Yunxin Liu, and Sung Ju Hwang. Representational continuity for unsupervised continual learning, April 2022. arXiv:2110.06976 [cs].
  • [20] Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. Online continual learning in image classification: An empirical survey, October 2021.
  • [21] Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages 109–165. Academic Press, 1989.
  • [22] Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in Psychology, 4, 2013.
  • [23] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classifier and representation learning, 2017.
  • [24] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks, 2022.
  • [25] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2014.
  • [26] Albin Soutif-Cormerais, Antonio Carta, Andrea Cossu, Julio Hurtado, Hamed Hemati, Vincenzo Lomonaco, and Joost Van de Weijer. A comprehensive empirical evaluation on online continual learning, September 2023.
  • [27] Albin Soutif-Cormerais, Marc Masana, Joost Van de Weijer, and Bartłomiej Twardowski. On the importance of cross-task features for class-incremental learning, 2021.
  • [28] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks, 2017.
  • [29] Xiaofan Yu, Yunhui Guo, Sicun Gao, and Tajana Rosing. Scale: Online self-supervised lifelong learning without prior knowledge. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2484–2495, Vancouver, BC, Canada, June 2023. IEEE.