\addbibresource

references.bib

On Sequential Loss Approximation for Continual Learning

Menghao Waiyan William Zhu\orcidlink0009-0001-4180-5791 Corresponding author: [email protected] Tsinghua-Berkeley Shenzhen Institute
Tsinghua Shenzhen International Graduate School
Shenzhen
Ercan Engin Kuruoğlu Tsinghua-Berkeley Shenzhen Institute
Tsinghua Shenzhen International Graduate School
Shenzhen
Abstract

We introduce for continual learning Autodiff Quadratic Consolidation (AQC), which approximates the previous loss function with a quadratic function, and Neural Consolidation (NC), which approximates the previous loss function with a neural network. Although they are not scalable to large neural networks, they can be used with a fixed pre-trained feature extractor. We empirically study these methods in class-incremental learning, for which regularization-based methods produce unsatisfactory results, unless combined with replay. We find that for small datasets, quadratic approximation of the previous loss function leads to poor results, even with full Hessian computation, and NC could significantly improve the predictive performance, while for large datasets, when used with a fixed pre-trained feature extractor, AQC provides superior predictive performance. We also find that using tanh-output features can improve the predictive performance of AQC. In particular, in class-incremental Split MNIST, when a Convolutional Neural Network (CNN) with tanh-output features is pre-trained on EMNIST Letters and used as a fixed pre-trained feature extractor, AQC can achieve predictive performance comparable to joint training.

1 Introduction

Continual learning, also known as incremental learning or lifelong learning, is learning from a sequence of datasets called tasks which are not necessarily identically distributed. When a neural network (including a generalized linear model, which is essentially a neural network with no hidden layers) is trained on a task and fine-tuned on a new task, it loses predictive performance on the old task. This phenomenon is known as catastrophic forgetting \parencitemccloskey_catastrophic_1989 and can be prevented by joint training on all tasks, but previous data may not be accessible due to computational or privacy constraints. Thus, the goal of continual learning is to prevent catastrophic forgetting without accessing previous data.

Continual-learning settings vary considerably, and three main types are commonly studied \parencitevan_de_ven_three_2022:

  1. 1.

    Task-incremental learning, in which task IDs are provided and the classes change between tasks

  2. 2.

    Domain-incremental learning, in which task IDs are not provided and the classes remain the same between tasks

  3. 3.

    Class-incremental learning, in which task IDs are not provided and the classes change between tasks

Multi-headed models, in which there is one output head for each task, are commonly used for task-incremental learning. However, task-incremental learning has been criticized as task IDs make the problem of continual learning easier \parencitefarquhar_towards_2019. In fact, if there is only one class per task, then the task ID could be used to make a perfect prediction. In accordance with the desiderata proposed in [farquhar_towards_2019], we focus on class-incremental learning with single-headed models on more than two similar tasks with no access to previous data.

Sequential Bayesian inference provides an elegant approach to continual learning. We assume that the parameters of the model are random and use the previous posterior Probability Density Function (PDF) as the current prior PDF. For Maximum-A-Posteriori (MAP) prediction, this can be formulated as sequential loss approximation. In this work, we use Autodiff Quadratic Consolidation (AQC), which approximates the previous loss function with a quadratic function, and Neural Consolidation (NC), which approximates the previous loss function with a neural network. Since these methods are not scalable to large models with millions of parameters, we use a fixed pre-trained feature extractor for image datasets. We show empirically that for small datasets, neural-network approximation of the previous loss function leads to significant improvements in predictive performance over quadratic approximation of the previous loss function and that using tanh-output features improves the predictive performance of AQC.

2 Sequential Bayesian inference and sequential loss approximation

Let θ𝜃\thetaitalic_θ be the parameters of the neural network, x1:tsubscript𝑥:1𝑡x_{1:t}italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT be the input data from time 1111 to t𝑡titalic_t and y1:tsubscript𝑦:1𝑡y_{1:t}italic_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT be the output data from time 1111 to t𝑡titalic_t. The independence assumptions are described by the following Bayesian network. In particular, x1:tsubscript𝑥:1𝑡x_{1:t}italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT are assumed to be independent, and given θ𝜃\thetaitalic_θ and x1:tsubscript𝑥:1𝑡x_{1:t}italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT, y1:tsubscript𝑦:1𝑡y_{1:t}italic_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT are assumed to be conditionally independent.

θ𝜃\thetaitalic_θy1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTx1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTy2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTx2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT\ldots\ldotsytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTxtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

For i=1,2,,t𝑖12𝑡i=1,2,\ldots,titalic_i = 1 , 2 , … , italic_t, (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents a task or dataset at time i𝑖iitalic_i, and the individual points in each task are assumed to be independent and identically distributed.

Tasks are assumed to be similar, so the likelihood PDF at time t𝑡titalic_t, p(yt|θ,xt)𝑝conditionalsubscript𝑦𝑡𝜃subscript𝑥𝑡p(y_{t}|\theta,x_{t})italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), is assumed to be the same for all tasks. It is defined by the neural network, which maps xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to some parameters of p(yt|θ,xt)𝑝conditionalsubscript𝑦𝑡𝜃subscript𝑥𝑡p(y_{t}|\theta,x_{t})italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In binary classification, p(yt|θ,xt)𝑝conditionalsubscript𝑦𝑡𝜃subscript𝑥𝑡p(y_{t}|\theta,x_{t})italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is Bernoulli, and the neural network maps to the log-probability of the positive class, while in multi-class classification, p(yt|θ,xt)𝑝conditionalsubscript𝑦𝑡𝜃subscript𝑥𝑡p(y_{t}|\theta,x_{t})italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is categorical, and the neural network maps to the un-normalized log-probabilities of the classes. The posterior PDF at time t𝑡titalic_t, p(θ|x1:t,y1:t)𝑝conditional𝜃subscript𝑥:1𝑡subscript𝑦:1𝑡p(\theta|x_{1:t},y_{1:t})italic_p ( italic_θ | italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ), can be obtained by recursive application of Bayes’ rule, where the posterior PDF at time t1𝑡1t-1italic_t - 1 is used as the prior PDF at time t𝑡titalic_t:

p(θ|y1:t,x1:t)=1ztp(θ|y1:t1,x1:t1)p(yt|θ,xt)𝑝conditional𝜃subscript𝑦:1𝑡subscript𝑥:1𝑡1subscript𝑧𝑡𝑝conditional𝜃subscript𝑦:1𝑡1subscript𝑥:1𝑡1𝑝conditionalsubscript𝑦𝑡𝜃subscript𝑥𝑡p(\theta|y_{1:t},x_{1:t})=\frac{1}{z_{t}}p(\theta|y_{1:t-1},x_{1:t-1})p(y_{t}|% \theta,x_{t})italic_p ( italic_θ | italic_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_p ( italic_θ | italic_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

where zt=p(θ|y1:t1,x1:t1)p(yt|θ,xt)𝑑θsubscript𝑧𝑡superscriptsubscript𝑝conditional𝜃subscript𝑦:1𝑡1subscript𝑥:1𝑡1𝑝conditionalsubscript𝑦𝑡𝜃subscript𝑥𝑡differential-d𝜃z_{t}=\int_{-\infty}^{\infty}p(\theta|y_{1:t-1},x_{1:t-1})p(y_{t}|\theta,x_{t}% )d\thetaitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_p ( italic_θ | italic_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_θ is a normalization term that does not depend on θ𝜃\thetaitalic_θ.

MAP prediction uses the maximum θtsubscriptsuperscript𝜃𝑡\theta^{*}_{t}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the posterior PDF at time t𝑡titalic_t to make a point prediction. Since normalizing constants do not affect the maximum, θtsubscriptsuperscript𝜃𝑡\theta^{*}_{t}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is also equal to the maximum of the joint PDF p(θ,y1:t|x1:t)𝑝𝜃conditionalsubscript𝑦:1𝑡subscript𝑥:1𝑡p(\theta,y_{1:t}|x_{1:t})italic_p ( italic_θ , italic_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ). Maximizing the PDF is equivalent to minimizing the negative log PDF. Thus, the total loss at time t𝑡titalic_t could be defined as 𝔏t(θ)=lnp(θ,y1:t|x1:t)subscript𝔏𝑡𝜃𝑝𝜃conditionalsubscript𝑦:1𝑡subscript𝑥:1𝑡\mathfrak{L}_{t}(\theta)=-\ln p(\theta,y_{1:t}|x_{1:t})fraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = - roman_ln italic_p ( italic_θ , italic_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) and the task loss at time t𝑡titalic_t as 𝔩t(θ)=lnp(yt|θ,xt)subscript𝔩𝑡𝜃𝑝conditionalsubscript𝑦𝑡𝜃subscript𝑥𝑡\mathfrak{l}_{t}(\theta)=-\ln p(y_{t}|\theta,x_{t})fraktur_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = - roman_ln italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_θ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This leads to a recursion of loss functions for t=1,2,𝑡12t=1,2,\ldotsitalic_t = 1 , 2 , …:

𝔏t(θ)=𝔏t1(θ)+𝔩t(θ)subscript𝔏𝑡𝜃subscript𝔏𝑡1𝜃subscript𝔩𝑡𝜃\mathfrak{L}_{t}(\theta)=\mathfrak{L}_{t-1}(\theta)+\mathfrak{l}_{t}(\theta)fraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = fraktur_L start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_θ ) + fraktur_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ )

Methods based on this recursion are also known as regularization-based methods as 𝔏t1subscript𝔏𝑡1\mathfrak{L}_{t-1}fraktur_L start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT can be seen as regularizing 𝔩tsubscript𝔩𝑡\mathfrak{l}_{t}fraktur_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Note that 𝔏0=𝔩0subscript𝔏0subscript𝔩0\mathfrak{L}_{0}=\mathfrak{l}_{0}fraktur_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = fraktur_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the negative log prior, may be in un-normalized form. In particular, a Gaussian prior PDF with zero mean vector and covariance matrix αI𝛼𝐼\alpha Iitalic_α italic_I corresponds to an L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularization term 12αθ2212𝛼superscriptsubscriptdelimited-∥∥𝜃22\frac{1}{2}\alpha\lVert\theta\rVert_{2}^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_α ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. In binary classification, 𝔩tsubscript𝔩𝑡\mathfrak{l}_{t}fraktur_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the sigmoid cross-entropy loss, while in multi-class classification, 𝔩tsubscript𝔩𝑡\mathfrak{l}_{t}fraktur_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the softmax cross-entropy loss.

2.1 Autodiff quadratic consolidation

Quadratic approximation of 𝔏t1subscript𝔏𝑡1\mathfrak{L}_{t-1}fraktur_L start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT corresponds to Laplace approximation of the previous posterior PDF and requires computing the minimum and its Hessian matrix at the minimum. Since the Hessian operator is linear, successive quadratic approximation results in addition of the Hessian matrices of the previous tasks:

𝔏^t(θ)=12(θθt1)T(i=0t1H(𝔩i)(θi))(θθt1)+𝔩t(θ)subscript^𝔏𝑡𝜃12superscript𝜃subscriptsuperscript𝜃𝑡1𝑇superscriptsubscript𝑖0𝑡1𝐻subscript𝔩𝑖subscriptsuperscript𝜃𝑖𝜃subscriptsuperscript𝜃𝑡1subscript𝔩𝑡𝜃\hat{\mathfrak{L}}_{t}(\theta)=\frac{1}{2}(\theta-\theta^{*}_{t-1})^{T}\left(% \sum_{i=0}^{t-1}H(\mathfrak{l}_{i})(\theta^{*}_{i})\right)(\theta-\theta^{*}_{% t-1})+\mathfrak{l}_{t}(\theta)over^ start_ARG fraktur_L end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_θ - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_H ( fraktur_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ( italic_θ - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + fraktur_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ )

Autodiff Quadratic Consolidation (AQC) computes Hessian matrices by using automatic differentiation. The computational cost depends on the size of the model, i.e. the number of the parameters θ𝜃\thetaitalic_θ, and the size of the dataset. If the model is small but the dataset is large, the computation can be made tractable by computing in mini-batches. Since the Hessian operator is linear, the Hessian matrix of the batch loss function is equal to the sum of the Hessian matrices of the mini-batch loss functions. If 𝔩t,1,𝔩t,2,,𝔩t,bsubscript𝔩𝑡1subscript𝔩𝑡2subscript𝔩𝑡𝑏\mathfrak{l}_{t,1},\mathfrak{l}_{t,2},\ldots,\mathfrak{l}_{t,b}fraktur_l start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , fraktur_l start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT , … , fraktur_l start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT are the mini-batch loss functions corresponding to the batch loss function 𝔩tsubscript𝔩𝑡\mathfrak{l}_{t}fraktur_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, then

H(𝔩t)(θt)=H(j=1b𝔩t,j)(θt)=j=1bH(𝔩t,j)(θt)𝐻subscript𝔩𝑡subscriptsuperscript𝜃𝑡𝐻superscriptsubscript𝑗1𝑏subscript𝔩𝑡𝑗subscriptsuperscript𝜃𝑡superscriptsubscript𝑗1𝑏𝐻subscript𝔩𝑡𝑗subscriptsuperscript𝜃𝑡H(\mathfrak{l}_{t})(\theta^{*}_{t})=H\left(\sum_{j=1}^{b}\mathfrak{l}_{t,j}% \right)(\theta^{*}_{t})=\sum_{j=1}^{b}H(\mathfrak{l}_{t,j})(\theta^{*}_{t})italic_H ( fraktur_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_H ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT fraktur_l start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_H ( fraktur_l start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

2.2 Neural consolidation

Neural Consolidation (NC) uses a consolidator neural network hhitalic_h with parameters ϕt1subscriptsuperscriptitalic-ϕ𝑡1\phi^{*}_{t-1}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to approximate 𝔏t1subscript𝔏𝑡1\mathfrak{L}_{t-1}fraktur_L start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT:

𝔏^t(θ)=h(θ;ϕt1)+𝔩t(θ)subscript^𝔏𝑡𝜃𝜃subscriptsuperscriptitalic-ϕ𝑡1subscript𝔩𝑡𝜃\hat{\mathfrak{L}}_{t}(\theta)=h(\theta;\phi^{*}_{t-1})+\mathfrak{l}_{t}(\theta)over^ start_ARG fraktur_L end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = italic_h ( italic_θ ; italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + fraktur_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ )

After finding the minimum θt1subscriptsuperscript𝜃𝑡1\theta^{*}_{t-1}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, the consolidator neural network is trained by using mini-batch gradient descent. At each step, a sample of n𝑛nitalic_n points θ1,θ2,,θnsubscript𝜃1subscript𝜃2subscript𝜃𝑛\theta_{1},\theta_{2},\ldots,\theta_{n}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are generated randomly and uniformly within a ball of radius r𝑟ritalic_r around θt1subscriptsuperscript𝜃𝑡1\theta^{*}_{t-1}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, and the consolidator loss function is minimized to obtain ϕt1subscriptsuperscriptitalic-ϕ𝑡1\phi^{*}_{t-1}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT:

𝔏c(ϕ)=12βϕ22+i=1n𝔩c,i(ϕ)subscript𝔏𝑐italic-ϕ12𝛽superscriptsubscriptdelimited-∥∥italic-ϕ22superscriptsubscript𝑖1𝑛subscript𝔩𝑐𝑖italic-ϕ\mathfrak{L}_{c}(\phi)=\frac{1}{2}\beta\lVert\phi\rVert_{2}^{2}+\sum_{i=1}^{n}% \mathfrak{l}_{c,i}(\phi)fraktur_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_ϕ ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_β ∥ italic_ϕ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT fraktur_l start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_ϕ )

where β𝛽\betaitalic_β is an L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularization factor, and each 𝔩c,i(ϕ)subscript𝔩𝑐𝑖italic-ϕ\mathfrak{l}_{c,i}(\phi)fraktur_l start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_ϕ ) is the Huber loss:

𝔩c,i(ϕ)={12(h(θi;ϕ)𝔏^t1(θi))2 if |h(θi;ϕ)𝔏^t1(θi)|1|h(θi;ϕ)𝔏^t1(θi)|12 otherwisesubscript𝔩𝑐𝑖italic-ϕcases12superscriptsubscript𝜃𝑖italic-ϕsubscript^𝔏𝑡1subscript𝜃𝑖2 if subscript𝜃𝑖italic-ϕsubscript^𝔏𝑡1subscript𝜃𝑖1subscript𝜃𝑖italic-ϕsubscript^𝔏𝑡1subscript𝜃𝑖12 otherwise\mathfrak{l}_{c,i}(\phi)=\begin{cases}\frac{1}{2}(h(\theta_{i};\phi)-\hat{% \mathfrak{L}}_{t-1}(\theta_{i}))^{2}&\text{ if }|h(\theta_{i};\phi)-\hat{% \mathfrak{L}}_{t-1}(\theta_{i})|\leq 1\\ |h(\theta_{i};\phi)-\hat{\mathfrak{L}}_{t-1}(\theta_{i})|-\frac{1}{2}&\text{ % otherwise}\end{cases}fraktur_l start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ( italic_ϕ ) = { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_h ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_ϕ ) - over^ start_ARG fraktur_L end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if | italic_h ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_ϕ ) - over^ start_ARG fraktur_L end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ≤ 1 end_CELL end_ROW start_ROW start_CELL | italic_h ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_ϕ ) - over^ start_ARG fraktur_L end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_CELL start_CELL otherwise end_CELL end_ROW

If the dataset is large, the true loss values can be computed by adding the mini-batch loss values.

2.3 Limitations

In sequential loss approximation, the total number of classes must be known in advance, even if the classes are learned incrementally, so that the number of parameters θ𝜃\thetaitalic_θ is fixed. Otherwise, the assumptions made above will be violated and sequential Bayesian inference will no longer be valid.

AQC and NC are local approximations, so if the minimum moves very far when learning a new task, the approximation will not be very accurate. NC is also sensitive to hyperparameters such as the radius r𝑟ritalic_r, the sample size n𝑛nitalic_n and the architecture of the consolidator neural network. These algorithms do not scale with model size, but the model size can be reduced by using a fixed pre-trained feature extractor. On the other hand, they scale with dataset size, and memory-mapped files can be used to increase the efficiency of the algorithms.

3 Related work

There are several works using quadratic approximation of the previous loss function. Elastic Weight Consolidation (EWC) approximates the Hessian matrix by using a diagonal Fisher information matrix, which is computed by averaging the squared gradients over mini-batches \parencitekirkpatrick_overcoming_2017. However, it adds a penalty term for every new task, and [huszar_note_2018] points out that it is not well justified and that the penalty term can be updated by adding the Hessian matrices. Synpatic Intelligence (SI) also performs a diagonal approximation of the Hessian. However, it does so by using the change in loss during gradient descent, which is computed by summing the product of the gradient and the change in the parameter values during gradient descent \parencitezenke_continual_2017. Online Structured Laplace Approximation (OSLA) uses Kronecker factorization to perform a block-diagonal approximation of the Hessian, in which the diagonal blocks of the matrix correspond to a layer of the neural network \parenciteritter_online_2018.

The above methods are designed to be scalable to large models and have mainly been shown to perform well in Permuted MNIST \parencitegoodfellow_empirical_2015, with OSLA performing better than EWC and SI. In class-incremental Split MNIST, EWC and SI have been shown to perform as poorly as fine-tuning \parencitevan_de_ven_three_2022. It has also been observed that regularization-based methods generally perform worse than replay-based methods, which use stored or generated previous data \parencitemai_online_2022,van_de_ven_three_2022. In our work, we examine how much predictive performance we can achieve by using a full Hessian computation and whether the quadratic approximation can be improved by approximation with a neural network, which is a powerful function approximator.

Recently, two types of pre-training have been studied for continual learning: pre-training for initialization and pre-training for feature extraction. In the former, the parameters of the neural network are initialized with the pre-trained parameters rather than randomly, which has been reported to reduce forgetting \parencitelee_pre-trained_2023,mehta_empirical_2023. In the latter, a pre-trained model is used as a feature extractor, which has also been reported to reduce forgetting \parencitehu_continual_2021,li_continual_2022,yang_continual_2023. Intuitively, neural networks do not need to learn lower-level features continually as much as they need to learn higher-level features, just as humans do not need to learn continually to recognize edges, corners and shapes as much as they need to learn to recognize objects. It has also been empirically shown that forgetting is more pronounced in later layers of a neural network \parenciteliu_generative_2020. Thus, in our work, we use pre-training for feature extraction for image datasets.

4 Experiments

In our experiments, we compare EWC, SI, AQC and NC in class-incremental-learning settings based on small and large datasets. For EWC, we use the variant with Huszár’s corrected penalty \parencitehuszar_note_2018 and the Fisher information matrix, which is computed based on the empirical distribution of x𝑥xitalic_x and the conditional distribution of y𝑦yitalic_y given x𝑥xitalic_x defined by the model (rather than the empirical Fisher information matrix, which is computed based on the empirical distribution of x𝑥xitalic_x and y𝑦yitalic_y) \parencitemartens_new_2020.

The experiments are run on an on-premise NVIDIA GeForce RTX 4090 GPU with 24 GB of memory. In every experiment, swish activation functions are used except in some pre-trained neural networks with tanh activation functions before the final layer, the regularization factor is α=β=0.1𝛼𝛽0.1\alpha=\beta=0.1italic_α = italic_β = 0.1 and optimization is done by using the Adam optimizer with learning rate 0.01. For Split Iris and Split Wine, the datasets are split randomly with 80% for training and 20% for testing, while for Split MNIST and Split CIFAR-10, the original dataset splits are used. Then, the each split dataset is split by class into multiple tasks. For NC, hyperparameters are chosen by manual tuning, while for EWC and SI, they are fixed at λ=1𝜆1\lambda=1italic_λ = 1 and ξ=1𝜉1\xi=1italic_ξ = 1, respectively. Evaluation is done by using the average accuracy, which is the average of testing accuracy scores of the tasks up to and including the current task.

4.1 Split Iris

Iris is a dataset consisting of 150 instances of flowers with 4 features (sepal length, sepal width, petal length and petal width) and a 3-class label (setosa, versicolor and virginica), and Split Iris splits the dataset into 3 tasks by class. Before evaluating on Split Iris, we visualize loss functions by using Split Iris 1, which consists of one feature (petal length) and a binary label indicating whether the species is virginica or not, and predictions by using Split Iris 2, which consists of two features (petal length and petal width) and a 3-class label.

For Split Iris 1, we use two 2-parameter models for convenience of visualization of loss functions: a logistic-regression model, for which the consolidator neural network has two hidden layers each of 50 nodes, and a neural network with one hidden layer of one node and no bias, for which the consolidator neural network has two hidden layers each of 200 nodes. NC is done with radius r=10𝑟10r=10italic_r = 10 and sample size n=1024𝑛1024n=1024italic_n = 1024. The visualization of the loss functions is shown in Figure 1, and the visualization of the final loss functions under different values of the radius r𝑟ritalic_r and sample size n𝑛nitalic_n is shown in Figure 2. It can be seen that NC could approximate the loss functions significantly better than EWC, SI and AQC, but NC is sensitive to hyperparameters such as the radius r𝑟ritalic_r and the sample size n𝑛nitalic_n.

For Split Iris 2, we use the following two models: a softmax-regression model, for which the consolidator neural network has two hidden layers each of 50 nodes, and a neural network with one hidden layer of 3 nodes, for which the consolidator neural network has two hidden layers each of 200 nodes. NC is done with radius r=10𝑟10r=10italic_r = 10 and sample size n=1024𝑛1024n=1024italic_n = 1024. The visualizations of the predictions are shown in Figure 3. It can be seen that the predictions of NC are better than those of EWC, SI and AQC, especially on the softmax-regression model. The loss function of a neural network is much more difficult to fit, so the results of NC on the neural network are not as good as those on the softmax-regression model.

For Split Iris, we use the following two models: a softmax-regression model, for which the consolidator neural network has two hidden layers each of 50 nodes, and a neural network with one hidden layer of one node and no bias, for which the consolidator neural network has two hidden layers each of 200 nodes. NC is done with radius r=20𝑟20r=20italic_r = 20 and sample size n=1024𝑛1024n=1024italic_n = 1024. The results are shown in Table 1. It can be seen that EWC, SI and AQC perform as poorly as fine-tuning, and NC outperforms them significantly.

Refer to caption
(a) Logistic-regression model
Refer to caption
(b) Neural network
Figure 1: Visualization of loss functions of the two-parameter models for Split Iris 1. The x𝑥xitalic_x-axis and y𝑦yitalic_y-axis represent the first parameter and the second parameter of the model, respectively. White cross indicates the minimum found by using gradient descent.
Refer to caption
(a) Under different values of r𝑟ritalic_r
Refer to caption
(b) Under different values of n𝑛nitalic_n
Figure 2: Visualization of the final loss functions for Split Iris 1 learned by NC under different values of r𝑟ritalic_r and n𝑛nitalic_n. The x𝑥xitalic_x-axis and y𝑦yitalic_y-axis represent the first parameter and the second parameter of the model, respectively. White cross indicates the minimum found by using gradient descent.
Refer to caption
(a) Softmax-regression model
Refer to caption
(b) Neural network
Figure 3: Visualization of predictions for Split Iris 2. The x𝑥xitalic_x-axis and y𝑦yitalic_y-axis represent the first feature (petal length) and the second feature (petal width), respectively. Red, green and blue points represent the training data of classes setosa, virginica and versicolor, respectively. The amounts of red, green and blue in the pseudo-color plots represent the output probability of setosa, virginica and versicolor, respectively.
Table 1: Average accuracy as percentage for Split Iris
Method Task 1 Task 2 Task 3
Softmax-regression model
Joint training 100.00 100.00 100.00
Fine-tuning 100.00 50.00 33.33
EWC 100.00 50.00 33.33
SI 100.00 50.00 33.33
AQC 100.00 50.00 33.33
NC 100.00 100.00 93.33
Neural network
Joint training 100.00 100.00 100.00
Fine-tuning 100.00 50.00 33.33
EWC 100.00 50.00 33.33
SI 100.00 50.00 33.33
AQC 100.00 50.00 33.33
NC 100.00 100.00 100.00

4.2 Split Wine

Wine is a dataset consisting of 178 instances of wine each with 13 features and a 3-class label, and Split Wine splits the dataset into 3 tasks by class. We use the following two models: a softmax-regression model, for which the consolidator neural network has two hidden layers each of 50 nodes, and a neural network with one hidden layer of 3 nodes, for which the consolidator neural network has two hidden layers each of 2000 nodes. NC is done with radius r=20𝑟20r=20italic_r = 20 and sample size n=1024𝑛1024n=1024italic_n = 1024. The results are shown in Table 2. It can be seen that EWC, SI and AQC perform as poorly as fine-tuning, and NC outperform them significantly on the softmax-regression model, but on the neural network, NC performs well only up to the second task.

Table 2: Average accuracy as percentage for Split Wine
Method Task 1 Task 2 Task 3
Softmax-regression model
Joint training 100.00 97.06 95.01
Fine-tuning 100.00 50.00 33.33
EWC 100.00 50.00 33.33
SI 100.00 50.00 33.33
AQC 100.00 50.00 33.33
NC 100.00 94.12 76.08
Neural network
Joint training 100.00 97.06 91.68
Fine-tuning 100.00 50.00 33.33
EWC 100.00 50.00 33.33
SI 100.00 50.00 33.33
AQC 100.00 50.00 33.33
NC 100.00 91.18 33.33

4.3 Split MNIST

MNIST is a dataset consisting of 60,000 instances of handwritten digits with a 28×28282828\times 2828 × 28 image and a 10-class label \parencitelecun_gradient_1998. Split MNIST splits the dataset into 5 tasks, the first consisting of 1’s and 2’s, the second consisting of 3’s and 4’s and so on. EMNIST Letters consists of 26 classes of 28×28282828\times 2828 × 28 images of handwritten letters in both upper case and lower case \parencitecohen_emnist_2017 and has no class overlap with MNIST.

We pre-train two Convolutional Neural Networks (CNNs) on EMNIST Letters and use them as fixed feature extractors. Each consists of a convolutional layer of 32 3×3333\times 33 × 3 filters, an average-pooling layer with a stride of 2, a convolutional layer of 32 3×3333\times 33 × 3 filters, an average-pooling layer with a stride of 2, a dense layer of 32 nodes and finally a dense layer of 26 nodes. Features are extracted before the final layer, and in that layer, one CNN uses swish activation functions, while the other uses tanh activation functions.

For Split MNIST, we use the pre-trained CNNs to extract features and perform continual learning on a dense layer with 10 nodes, for which the consolidator neural network has two hidden layers each of 5000 nodes. NC is done with radius r=20𝑟20r=20italic_r = 20 and sample size n=1024𝑛1024n=1024italic_n = 1024. It can be seen that AQC outperforms NC, EWC and SI, in both cases. Notably, using tanh-output features significantly reduces forgetting in EWC, SI, AQC and NC although it also slightly reduces the final average accuracy in joint training. In particular, AQC achieves performance comparable to joint training.

Table 3: Average accuracy as percentage for Split MNIST
Method Task 1 Task 2 Task 3 Task 4 Task 5
Feature extractor with swish output
Joint training 100.00 98.63 98.60 97.04 95.06
Fine-tuning 100.00 49.22 33.32 24.96 19.54
EWC 100.00 49.22 33.32 24.96 19.54
SI 100.00 49.22 33.32 24.96 19.54
AQC 100.00 96.14 81.04 61.79 47.55
NC 100.00 56.05 45.75 33.29 26.23
Feature extractor with tanh output
Joint training 99.86 98.54 98.13 96.43 93.66
Fine-tuning 99.86 49.29 33.85 24.96 19.69
EWC 99.86 76.43 57.72 40.33 26.03
SI 99.86 76.31 57.27 39.41 25.33
AQC 99.86 98.52 97.96 96.11 92.26
NC 99.86 85.26 57.68 43.34 34.79

4.4 Split CIFAR-10

CIFAR-10 is a dataset consisting of 60,000 instances of natural images with a 32×32323232\times 3232 × 32 image and a 10-class label \parencitekrizhevsky_learning_2009. Split CIFAR-10 splits the dataset into 5 tasks, similarly to Split MNIST. CIFAR-100 consists of 100 classes of 32×32323232\times 3232 × 32 natural images and has no class overlap with CIFAR-10 \parencitekrizhevsky_learning_2009.

We pre-train two CNNs on CIFAR-100 and use them as fixed feature extractors. Each consists of two convolutional layers of 128 3×3333\times 33 × 3 filters, an average-pooling layer with a stride of 2, two convolutional layers of 256 3×3333\times 33 × 3 filters, an average-pooling layer with a stride of 2, a dense layer of 256 nodes, a dense layer of 128 nodes and finally a dense layer of 100 nodes. Features are extracted before the final layer, and in that layer, one CNN uses swish activation functions, while the other uses tanh activation functions.

For Split CIFAR-10, we use the pre-trained CNNs to extract features and perform continual learning on a dense layer with 10 nodes, for which the consolidator neural network has two hidden layers each of 5000 nodes. NC is done with radius r=20𝑟20r=20italic_r = 20 and sample size n=1024𝑛1024n=1024italic_n = 1024. As in Split MNIST, it can be seen that AQC outperforms NC, EWC and SI, in both cases. Using tanh-output features significantly reduces forgetting only in AQC.

Table 4: Average accuracy as percentage for Split CIFAR-10
Method Task 1 Task 2 Task 3 Task 4 Task 5
Feature extractor with swish output
Joint training 89.45 73.28 57.27 52.35 50.10
Fine-tuning 89.45 37.95 26.70 22.90 16.75
EWC 89.45 37.93 26.70 22.90 17.57
SI 89.45 37.95 26.70 22.90 16.77
AQC 89.45 47.58 31.98 26.10 21.03
NC 89.45 49.15 22.46 19.53 11.21
Feature extractor with tanh output
Joint training 88.50 71.80 57.23 52.35 48.50
Fine-tuning 88.50 37.03 26.67 22.56 16.88
EWC 88.50 38.75 26.8 22.60 20.85
SI 88.50 37.95 26.72 22.59 19.22
AQC 88.50 61.63 45.08 40.53 39.24
NC 88.50 5.40 5.45 8.36 8.05

4.5 Data availability

All the datasets used in this paper are publicly available. Iris and Wine are available from the scikit-learn package \parencitepedregosa_scikit-learn_2011, which is released under the 3-clause BSD license. MNIST, EMNIST, CIFAR-10 and CIFAR-100 are available from the pytorch package \parenciteansel_pytorch_2024, which is also released under the 3-clause BSD license.

4.6 Code availability

Documented code for the experiments is available online under an MIT licence: https://github.com/blackblitz/bcl. Reproducibility is ensured by fixing the seed in pseudo-random-number generation and using deterministic GPU algorithms. Detailed instructions to reproduce the results are provided in the README file of the repository.

5 Conclusion

We proposed Autodiff Quadratic Consolidation (AQC) and Neural Consolidation (NC) for continual learning based on sequential loss approximation. Although they are not scalable to very large neural networks, they can be used together with fixed pre-trained feature extractors. In class-incremental learning settings with small datasets, methods based on quadratic approximation of the loss function, such as EWC, SI and AQC, perform as poorly as fine-tuning, and NC performs better. With large datasets, when used with pre-training, AQC performs very well and its performance can be improved by using tanh-output features, probably because the output of the tanh function is between -1 and 1, leading to a loss function that can be better approximated with a quadratic function.

NC is conceptually appealing as neural networks are much more powerful function approximators than quadratic functions. However, in practice, it is difficult to tune the hyperparameters. One possible reason that it does not work well with large datasets is that larger datasets lead to steeper loss functions, which may affect gradient descent. Moreover, although we have access to the loss function to be approximated, we need to take a sample of points from it to train the consolidator neural network. It would be better if we could fit directly to the loss function without generating a sample of points.

We hope that our work provides some insights on sequential loss approximation for continual learning.

Acknowledgments and Disclosure of Funding

We thank Pengcheng Hao, Yang Li and anonymous reviewers for their feedback.

\printbibliography