\addbibresource

references.bib

On Sequential Loss Approximation for Continual Learning

Menghao Waiyan William Zhu\orcidlink0009-0001-4180-5791 Corresponding author: [email protected] Tsinghua-Berkeley Shenzhen Institute
Tsinghua Shenzhen International Graduate School
Shenzhen Ercan Engin Kuruoğlu Tsinghua-Berkeley Shenzhen Institute
Tsinghua Shenzhen International Graduate School
Shenzhen

Abstract

We introduce for continual learning Autodiff Quadratic Consolidation (AQC), which approximates the previous loss function with a quadratic function, and Neural Consolidation (NC), which approximates the previous loss function with a neural network. Although they are not scalable to large neural networks, they can be used with a fixed pre-trained feature extractor. We empirically study these methods in class-incremental learning, for which regularization-based methods produce unsatisfactory results, unless combined with replay. We find that for small datasets, quadratic approximation of the previous loss function leads to poor results, even with full Hessian computation, and NC could significantly improve the predictive performance, while for large datasets, when used with a fixed pre-trained feature extractor, AQC provides superior predictive performance. We also find that using tanh-output features can improve the predictive performance of AQC. In particular, in class-incremental Split MNIST, when a Convolutional Neural Network (CNN) with tanh-output features is pre-trained on EMNIST Letters and used as a fixed pre-trained feature extractor, AQC can achieve predictive performance comparable to joint training.

1 Introduction

Continual learning, also known as incremental learning or lifelong learning, is learning from a sequence of datasets called tasks which are not necessarily identically distributed. When a neural network (including a generalized linear model, which is essentially a neural network with no hidden layers) is trained on a task and fine-tuned on a new task, it loses predictive performance on the old task. This phenomenon is known as catastrophic forgetting \parencitemccloskey_catastrophic_1989 and can be prevented by joint training on all tasks, but previous data may not be accessible due to computational or privacy constraints. Thus, the goal of continual learning is to prevent catastrophic forgetting without accessing previous data.

Continual-learning settings vary considerably, and three main types are commonly studied \parencitevan_de_ven_three_2022:

1.

Task-incremental learning, in which task IDs are provided and the classes change between tasks
2.

Domain-incremental learning, in which task IDs are not provided and the classes remain the same between tasks
3.

Class-incremental learning, in which task IDs are not provided and the classes change between tasks

Multi-headed models, in which there is one output head for each task, are commonly used for task-incremental learning. However, task-incremental learning has been criticized as task IDs make the problem of continual learning easier \parencitefarquhar_towards_2019. In fact, if there is only one class per task, then the task ID could be used to make a perfect prediction. In accordance with the desiderata proposed in [farquhar_towards_2019], we focus on class-incremental learning with single-headed models on more than two similar tasks with no access to previous data.

Sequential Bayesian inference provides an elegant approach to continual learning. We assume that the parameters of the model are random and use the previous posterior Probability Density Function (PDF) as the current prior PDF. For Maximum-A-Posteriori (MAP) prediction, this can be formulated as sequential loss approximation. In this work, we use Autodiff Quadratic Consolidation (AQC), which approximates the previous loss function with a quadratic function, and Neural Consolidation (NC), which approximates the previous loss function with a neural network. Since these methods are not scalable to large models with millions of parameters, we use a fixed pre-trained feature extractor for image datasets. We show empirically that for small datasets, neural-network approximation of the previous loss function leads to significant improvements in predictive performance over quadratic approximation of the previous loss function and that using tanh-output features improves the predictive performance of AQC.

2 Sequential Bayesian inference and sequential loss approximation

Let $\theta$ be the parameters of the neural network, $x_{1:t}$ be the input data from time $1$ to $t$ and $y_{1:t}$ be the output data from time $1$ to $t$ . The independence assumptions are described by the following Bayesian network. In particular, $x_{1:t}$ are assumed to be independent, and given $\theta$ and $x_{1:t}$ , $y_{1:t}$ are assumed to be conditionally independent.

For $i=1,2,\ldots,t$ , $(x_{i},y_{i})$ represents a task or dataset at time $i$ , and the individual points in each task are assumed to be independent and identically distributed.

Tasks are assumed to be similar, so the likelihood PDF at time $t$ , $p(y_{t}|\theta,x_{t})$ , is assumed to be the same for all tasks. It is defined by the neural network, which maps $x_{t}$ to some parameters of $p(y_{t}|\theta,x_{t})$ . In binary classification, $p(y_{t}|\theta,x_{t})$ is Bernoulli, and the neural network maps to the log-probability of the positive class, while in multi-class classification, $p(y_{t}|\theta,x_{t})$ is categorical, and the neural network maps to the un-normalized log-probabilities of the classes. The posterior PDF at time $t$ , $p(\theta|x_{1:t},y_{1:t})$ , can be obtained by recursive application of Bayes’ rule, where the posterior PDF at time $t-1$ is used as the prior PDF at time $t$ :

p(\theta|y_{1:t},x_{1:t})=\frac{1}{z_{t}}p(\theta|y_{1:t-1},x_{1:t-1})p(y_{t}|% \theta,x_{t})

where $z_{t}=\int_{-\infty}^{\infty}p(\theta|y_{1:t-1},x_{1:t-1})p(y_{t}|\theta,x_{t}% )d\theta$ is a normalization term that does not depend on $\theta$ .

MAP prediction uses the maximum $\theta^{*}_{t}$ of the posterior PDF at time $t$ to make a point prediction. Since normalizing constants do not affect the maximum, $\theta^{*}_{t}$ is also equal to the maximum of the joint PDF $p(\theta,y_{1:t}|x_{1:t})$ . Maximizing the PDF is equivalent to minimizing the negative log PDF. Thus, the total loss at time $t$ could be defined as $\mathfrak{L}_{t}(\theta)=-\ln p(\theta,y_{1:t}|x_{1:t})$ and the task loss at time $t$ as $\mathfrak{l}_{t}(\theta)=-\ln p(y_{t}|\theta,x_{t})$ . This leads to a recursion of loss functions for $t=1,2,\ldots$ :

\mathfrak{L}_{t}(\theta)=\mathfrak{L}_{t-1}(\theta)+\mathfrak{l}_{t}(\theta)

Methods based on this recursion are also known as regularization-based methods as $\mathfrak{L}_{t-1}$ can be seen as regularizing $\mathfrak{l}_{t}$ . Note that $\mathfrak{L}_{0}=\mathfrak{l}_{0}$ , the negative log prior, may be in un-normalized form. In particular, a Gaussian prior PDF with zero mean vector and covariance matrix $\alpha I$ corresponds to an $L^{2}$ regularization term $\frac{1}{2}\alpha\lVert\theta\rVert_{2}^{2}$ . In binary classification, $\mathfrak{l}_{t}$ is the sigmoid cross-entropy loss, while in multi-class classification, $\mathfrak{l}_{t}$ is the softmax cross-entropy loss.

2.1 Autodiff quadratic consolidation

Quadratic approximation of $\mathfrak{L}_{t-1}$ corresponds to Laplace approximation of the previous posterior PDF and requires computing the minimum and its Hessian matrix at the minimum. Since the Hessian operator is linear, successive quadratic approximation results in addition of the Hessian matrices of the previous tasks:

\hat{\mathfrak{L}}_{t}(\theta)=\frac{1}{2}(\theta-\theta^{*}_{t-1})^{T}\left(% \sum_{i=0}^{t-1}H(\mathfrak{l}_{i})(\theta^{*}_{i})\right)(\theta-\theta^{*}_{% t-1})+\mathfrak{l}_{t}(\theta)

Autodiff Quadratic Consolidation (AQC) computes Hessian matrices by using automatic differentiation. The computational cost depends on the size of the model, i.e. the number of the parameters $\theta$ , and the size of the dataset. If the model is small but the dataset is large, the computation can be made tractable by computing in mini-batches. Since the Hessian operator is linear, the Hessian matrix of the batch loss function is equal to the sum of the Hessian matrices of the mini-batch loss functions. If $\mathfrak{l}_{t,1},\mathfrak{l}_{t,2},\ldots,\mathfrak{l}_{t,b}$ are the mini-batch loss functions corresponding to the batch loss function $\mathfrak{l}_{t}$ , then

H(\mathfrak{l}_{t})(\theta^{*}_{t})=H\left(\sum_{j=1}^{b}\mathfrak{l}_{t,j}% \right)(\theta^{*}_{t})=\sum_{j=1}^{b}H(\mathfrak{l}_{t,j})(\theta^{*}_{t})

2.2 Neural consolidation

Neural Consolidation (NC) uses a consolidator neural network $h$ with parameters $\phi^{*}_{t-1}$ to approximate $\mathfrak{L}_{t-1}$ :

\hat{\mathfrak{L}}_{t}(\theta)=h(\theta;\phi^{*}_{t-1})+\mathfrak{l}_{t}(\theta)

After finding the minimum $\theta^{*}_{t-1}$ , the consolidator neural network is trained by using mini-batch gradient descent. At each step, a sample of $n$ points $\theta_{1},\theta_{2},\ldots,\theta_{n}$ are generated randomly and uniformly within a ball of radius $r$ around $\theta^{*}_{t-1}$ , and the consolidator loss function is minimized to obtain $\phi^{*}_{t-1}$ :

\mathfrak{L}_{c}(\phi)=\frac{1}{2}\beta\lVert\phi\rVert_{2}^{2}+\sum_{i=1}^{n}% \mathfrak{l}_{c,i}(\phi)

where $\beta$ is an $L^{2}$ regularization factor, and each $\mathfrak{l}_{c,i}(\phi)$ is the Huber loss:

\mathfrak{l}_{c,i}(\phi)=\begin{cases}\frac{1}{2}(h(\theta_{i};\phi)-\hat{% \mathfrak{L}}_{t-1}(\theta_{i}))^{2}&\text{ if }|h(\theta_{i};\phi)-\hat{% \mathfrak{L}}_{t-1}(\theta_{i})|\leq 1\\ |h(\theta_{i};\phi)-\hat{\mathfrak{L}}_{t-1}(\theta_{i})|-\frac{1}{2}&\text{ % otherwise}\end{cases}

If the dataset is large, the true loss values can be computed by adding the mini-batch loss values.

2.3 Limitations

In sequential loss approximation, the total number of classes must be known in advance, even if the classes are learned incrementally, so that the number of parameters $\theta$ is fixed. Otherwise, the assumptions made above will be violated and sequential Bayesian inference will no longer be valid.

AQC and NC are local approximations, so if the minimum moves very far when learning a new task, the approximation will not be very accurate. NC is also sensitive to hyperparameters such as the radius $r$ , the sample size $n$ and the architecture of the consolidator neural network. These algorithms do not scale with model size, but the model size can be reduced by using a fixed pre-trained feature extractor. On the other hand, they scale with dataset size, and memory-mapped files can be used to increase the efficiency of the algorithms.

3 Related work

There are several works using quadratic approximation of the previous loss function. Elastic Weight Consolidation (EWC) approximates the Hessian matrix by using a diagonal Fisher information matrix, which is computed by averaging the squared gradients over mini-batches \parencitekirkpatrick_overcoming_2017. However, it adds a penalty term for every new task, and [huszar_note_2018] points out that it is not well justified and that the penalty term can be updated by adding the Hessian matrices. Synpatic Intelligence (SI) also performs a diagonal approximation of the Hessian. However, it does so by using the change in loss during gradient descent, which is computed by summing the product of the gradient and the change in the parameter values during gradient descent \parencitezenke_continual_2017. Online Structured Laplace Approximation (OSLA) uses Kronecker factorization to perform a block-diagonal approximation of the Hessian, in which the diagonal blocks of the matrix correspond to a layer of the neural network \parenciteritter_online_2018.

The above methods are designed to be scalable to large models and have mainly been shown to perform well in Permuted MNIST \parencitegoodfellow_empirical_2015, with OSLA performing better than EWC and SI. In class-incremental Split MNIST, EWC and SI have been shown to perform as poorly as fine-tuning \parencitevan_de_ven_three_2022. It has also been observed that regularization-based methods generally perform worse than replay-based methods, which use stored or generated previous data \parencitemai_online_2022,van_de_ven_three_2022. In our work, we examine how much predictive performance we can achieve by using a full Hessian computation and whether the quadratic approximation can be improved by approximation with a neural network, which is a powerful function approximator.

Recently, two types of pre-training have been studied for continual learning: pre-training for initialization and pre-training for feature extraction. In the former, the parameters of the neural network are initialized with the pre-trained parameters rather than randomly, which has been reported to reduce forgetting \parencitelee_pre-trained_2023,mehta_empirical_2023. In the latter, a pre-trained model is used as a feature extractor, which has also been reported to reduce forgetting \parencitehu_continual_2021,li_continual_2022,yang_continual_2023. Intuitively, neural networks do not need to learn lower-level features continually as much as they need to learn higher-level features, just as humans do not need to learn continually to recognize edges, corners and shapes as much as they need to learn to recognize objects. It has also been empirically shown that forgetting is more pronounced in later layers of a neural network \parenciteliu_generative_2020. Thus, in our work, we use pre-training for feature extraction for image datasets.

4 Experiments

In our experiments, we compare EWC, SI, AQC and NC in class-incremental-learning settings based on small and large datasets. For EWC, we use the variant with Huszár’s corrected penalty \parencitehuszar_note_2018 and the Fisher information matrix, which is computed based on the empirical distribution of $x$ and the conditional distribution of $y$ given $x$ defined by the model (rather than the empirical Fisher information matrix, which is computed based on the empirical distribution of $x$ and $y$ ) \parencitemartens_new_2020.

The experiments are run on an on-premise NVIDIA GeForce RTX 4090 GPU with 24 GB of memory. In every experiment, swish activation functions are used except in some pre-trained neural networks with tanh activation functions before the final layer, the regularization factor is $\alpha=\beta=0.1$ and optimization is done by using the Adam optimizer with learning rate 0.01. For Split Iris and Split Wine, the datasets are split randomly with 80% for training and 20% for testing, while for Split MNIST and Split CIFAR-10, the original dataset splits are used. Then, the each split dataset is split by class into multiple tasks. For NC, hyperparameters are chosen by manual tuning, while for EWC and SI, they are fixed at $\lambda=1$ and $\xi=1$ , respectively. Evaluation is done by using the average accuracy, which is the average of testing accuracy scores of the tasks up to and including the current task.

4.1 Split Iris

Iris is a dataset consisting of 150 instances of flowers with 4 features (sepal length, sepal width, petal length and petal width) and a 3-class label (setosa, versicolor and virginica), and Split Iris splits the dataset into 3 tasks by class. Before evaluating on Split Iris, we visualize loss functions by using Split Iris 1, which consists of one feature (petal length) and a binary label indicating whether the species is virginica or not, and predictions by using Split Iris 2, which consists of two features (petal length and petal width) and a 3-class label.

For Split Iris 1, we use two 2-parameter models for convenience of visualization of loss functions: a logistic-regression model, for which the consolidator neural network has two hidden layers each of 50 nodes, and a neural network with one hidden layer of one node and no bias, for which the consolidator neural network has two hidden layers each of 200 nodes. NC is done with radius $r=10$ and sample size $n=1024$ . The visualization of the loss functions is shown in Figure 1, and the visualization of the final loss functions under different values of the radius $r$ and sample size $n$ is shown in Figure 2. It can be seen that NC could approximate the loss functions significantly better than EWC, SI and AQC, but NC is sensitive to hyperparameters such as the radius $r$ and the sample size $n$ .

For Split Iris 2, we use the following two models: a softmax-regression model, for which the consolidator neural network has two hidden layers each of 50 nodes, and a neural network with one hidden layer of 3 nodes, for which the consolidator neural network has two hidden layers each of 200 nodes. NC is done with radius $r=10$ and sample size $n=1024$ . The visualizations of the predictions are shown in Figure 3. It can be seen that the predictions of NC are better than those of EWC, SI and AQC, especially on the softmax-regression model. The loss function of a neural network is much more difficult to fit, so the results of NC on the neural network are not as good as those on the softmax-regression model.

For Split Iris, we use the following two models: a softmax-regression model, for which the consolidator neural network has two hidden layers each of 50 nodes, and a neural network with one hidden layer of one node and no bias, for which the consolidator neural network has two hidden layers each of 200 nodes. NC is done with radius $r=20$ and sample size $n=1024$ . The results are shown in Table 1. It can be seen that EWC, SI and AQC perform as poorly as fine-tuning, and NC outperforms them significantly.

Refer to caption — (a) Logistic-regression model

Table 1: Average accuracy as percentage for Split Iris

Softmax-regression model
Method	Task 1	Task 2	Task 3
Joint training	100.00	100.00	100.00
Fine-tuning	100.00	50.00	33.33
EWC	100.00	50.00	33.33
SI	100.00	50.00	33.33
AQC	100.00	50.00	33.33
NC	100.00	100.00	93.33
Neural network
Joint training	100.00	100.00	100.00
Fine-tuning	100.00	50.00	33.33
EWC	100.00	50.00	33.33
SI	100.00	50.00	33.33
AQC	100.00	50.00	33.33
NC	100.00	100.00	100.00

4.2 Split Wine

Wine is a dataset consisting of 178 instances of wine each with 13 features and a 3-class label, and Split Wine splits the dataset into 3 tasks by class. We use the following two models: a softmax-regression model, for which the consolidator neural network has two hidden layers each of 50 nodes, and a neural network with one hidden layer of 3 nodes, for which the consolidator neural network has two hidden layers each of 2000 nodes. NC is done with radius $r=20$ and sample size $n=1024$ . The results are shown in Table 2. It can be seen that EWC, SI and AQC perform as poorly as fine-tuning, and NC outperform them significantly on the softmax-regression model, but on the neural network, NC performs well only up to the second task.

Table 2: Average accuracy as percentage for Split Wine

Softmax-regression model
Method	Task 1	Task 2	Task 3
Joint training	100.00	97.06	95.01
Fine-tuning	100.00	50.00	33.33
EWC	100.00	50.00	33.33
SI	100.00	50.00	33.33
AQC	100.00	50.00	33.33
NC	100.00	94.12	76.08
Neural network
Joint training	100.00	97.06	91.68
Fine-tuning	100.00	50.00	33.33
EWC	100.00	50.00	33.33
SI	100.00	50.00	33.33
AQC	100.00	50.00	33.33
NC	100.00	91.18	33.33

4.3 Split MNIST

MNIST is a dataset consisting of 60,000 instances of handwritten digits with a $28\times 28$ image and a 10-class label \parencitelecun_gradient_1998. Split MNIST splits the dataset into 5 tasks, the first consisting of 1’s and 2’s, the second consisting of 3’s and 4’s and so on. EMNIST Letters consists of 26 classes of $28\times 28$ images of handwritten letters in both upper case and lower case \parencitecohen_emnist_2017 and has no class overlap with MNIST.

We pre-train two Convolutional Neural Networks (CNNs) on EMNIST Letters and use them as fixed feature extractors. Each consists of a convolutional layer of 32 $3\times 3$ filters, an average-pooling layer with a stride of 2, a convolutional layer of 32 $3\times 3$ filters, an average-pooling layer with a stride of 2, a dense layer of 32 nodes and finally a dense layer of 26 nodes. Features are extracted before the final layer, and in that layer, one CNN uses swish activation functions, while the other uses tanh activation functions.

For Split MNIST, we use the pre-trained CNNs to extract features and perform continual learning on a dense layer with 10 nodes, for which the consolidator neural network has two hidden layers each of 5000 nodes. NC is done with radius $r=20$ and sample size $n=1024$ . It can be seen that AQC outperforms NC, EWC and SI, in both cases. Notably, using tanh-output features significantly reduces forgetting in EWC, SI, AQC and NC although it also slightly reduces the final average accuracy in joint training. In particular, AQC achieves performance comparable to joint training.

Table 3: Average accuracy as percentage for Split MNIST

Method	Task 1	Task 2	Task 3	Task 4	Task 5
Feature extractor with swish output
Joint training	100.00	98.63	98.60	97.04	95.06
Fine-tuning	100.00	49.22	33.32	24.96	19.54
EWC	100.00	49.22	33.32	24.96	19.54
SI	100.00	49.22	33.32	24.96	19.54
AQC	100.00	96.14	81.04	61.79	47.55
NC	100.00	56.05	45.75	33.29	26.23
Feature extractor with tanh output
Joint training	99.86	98.54	98.13	96.43	93.66
Fine-tuning	99.86	49.29	33.85	24.96	19.69
EWC	99.86	76.43	57.72	40.33	26.03
SI	99.86	76.31	57.27	39.41	25.33
AQC	99.86	98.52	97.96	96.11	92.26
NC	99.86	85.26	57.68	43.34	34.79

4.4 Split CIFAR-10

CIFAR-10 is a dataset consisting of 60,000 instances of natural images with a $32\times 32$ image and a 10-class label \parencitekrizhevsky_learning_2009. Split CIFAR-10 splits the dataset into 5 tasks, similarly to Split MNIST. CIFAR-100 consists of 100 classes of $32\times 32$ natural images and has no class overlap with CIFAR-10 \parencitekrizhevsky_learning_2009.

We pre-train two CNNs on CIFAR-100 and use them as fixed feature extractors. Each consists of two convolutional layers of 128 $3\times 3$ filters, an average-pooling layer with a stride of 2, two convolutional layers of 256 $3\times 3$ filters, an average-pooling layer with a stride of 2, a dense layer of 256 nodes, a dense layer of 128 nodes and finally a dense layer of 100 nodes. Features are extracted before the final layer, and in that layer, one CNN uses swish activation functions, while the other uses tanh activation functions.

For Split CIFAR-10, we use the pre-trained CNNs to extract features and perform continual learning on a dense layer with 10 nodes, for which the consolidator neural network has two hidden layers each of 5000 nodes. NC is done with radius $r=20$ and sample size $n=1024$ . As in Split MNIST, it can be seen that AQC outperforms NC, EWC and SI, in both cases. Using tanh-output features significantly reduces forgetting only in AQC.

Table 4: Average accuracy as percentage for Split CIFAR-10

Method	Task 1	Task 2	Task 3	Task 4	Task 5
Feature extractor with swish output
Joint training	89.45	73.28	57.27	52.35	50.10
Fine-tuning	89.45	37.95	26.70	22.90	16.75
EWC	89.45	37.93	26.70	22.90	17.57
SI	89.45	37.95	26.70	22.90	16.77
AQC	89.45	47.58	31.98	26.10	21.03
NC	89.45	49.15	22.46	19.53	11.21
Feature extractor with tanh output
Joint training	88.50	71.80	57.23	52.35	48.50
Fine-tuning	88.50	37.03	26.67	22.56	16.88
EWC	88.50	38.75	26.8	22.60	20.85
SI	88.50	37.95	26.72	22.59	19.22
AQC	88.50	61.63	45.08	40.53	39.24
NC	88.50	5.40	5.45	8.36	8.05

4.5 Data availability

All the datasets used in this paper are publicly available. Iris and Wine are available from the scikit-learn package \parencitepedregosa_scikit-learn_2011, which is released under the 3-clause BSD license. MNIST, EMNIST, CIFAR-10 and CIFAR-100 are available from the pytorch package \parenciteansel_pytorch_2024, which is also released under the 3-clause BSD license.

4.6 Code availability

Documented code for the experiments is available online under an MIT licence: https://github.com/blackblitz/bcl. Reproducibility is ensured by fixing the seed in pseudo-random-number generation and using deterministic GPU algorithms. Detailed instructions to reproduce the results are provided in the README file of the repository.

5 Conclusion

We proposed Autodiff Quadratic Consolidation (AQC) and Neural Consolidation (NC) for continual learning based on sequential loss approximation. Although they are not scalable to very large neural networks, they can be used together with fixed pre-trained feature extractors. In class-incremental learning settings with small datasets, methods based on quadratic approximation of the loss function, such as EWC, SI and AQC, perform as poorly as fine-tuning, and NC performs better. With large datasets, when used with pre-training, AQC performs very well and its performance can be improved by using tanh-output features, probably because the output of the tanh function is between -1 and 1, leading to a loss function that can be better approximated with a quadratic function.

NC is conceptually appealing as neural networks are much more powerful function approximators than quadratic functions. However, in practice, it is difficult to tune the hyperparameters. One possible reason that it does not work well with large datasets is that larger datasets lead to steeper loss functions, which may affect gradient descent. Moreover, although we have access to the loss function to be approximated, we need to take a sample of points from it to train the consolidator neural network. It would be better if we could fit directly to the loss function without generating a sample of points.

We hope that our work provides some insights on sequential loss approximation for continual learning.

Acknowledgments and Disclosure of Funding

We thank Pengcheng Hao, Yang Li and anonymous reviewers for their feedback.

\printbibliography