Last Iterate Convergence of Incremental Methods
and Applications in Continual Learning
Abstract
Incremental gradient and incremental proximal methods are a fundamental class of optimization algorithms used for solving finite sum problems, broadly studied in the literature. Yet, without strong convexity, their convergence guarantees have primarily been established for the ergodic (average) iterate. Motivated by applications in continual learning, we obtain the first convergence guarantees for the last iterate of both incremental gradient and incremental proximal methods, in general convex smooth (for both) and convex Lipschitz (for the proximal variants) settings. Our oracle complexity bounds for the last iterate nearly match (i.e., match up to a square-root-log or a log factor) the best known oracle complexity bounds for the average iterate, for both classes of methods. We further obtain generalizations of our results to weighted averaging of the iterates with increasing weights and for randomly permuted ordering of updates. We study incremental proximal methods as a model of continual learning with generalization and argue that large amount of regularization is crucial to preventing catastrophic forgetting. Our results generalize last iterate guarantees for incremental methods compared to state of the art, as such results were previously known only for overparameterized linear models, which correspond to convex quadratic problems with infinitely many solutions.
1 Introduction
We study the last iterate convergence of incremental (gradient and proximal) methods, which apply to problems of the form
(1.1) |
As is standard, we assume that each component function is convex and either smooth or Lipschitz-continuous and that a minimizer exists.
Incremental methods traverse all the component functions in a cyclic manner, updating their iterates by taking either gradient descent steps (in the case of incremental gradient methods) or proximal-point steps (in the case of incremental proximal methods) with respect to the individual component functions . For a more precise statement of these two classes of methods, see Sections 2 and 3. Same as prior work (Bertsekas et al., 2011; Bertsekas, 2011; Li et al., 2019; Mishchenko et al., 2020; Cai et al., 2023a), we define oracle complexity of these methods as the number of first-order or proximal oracle queries to individual component functions required to reach a solution with optimality gap on the worst-case instance from the considered problem class, where is a given error parameter.
Our main motivation for studying the last iterate convergence of incremental methods comes from applications in continual learning (CL). In particular, CL models a sequential learning setting, where a machine learning model gets updated over time, based on the changing or evolving distribution of the data passed to the learner. A major challenge in such dynamic learning settings is the degradation of model performance on previously seen data, known as the catastrophic forgetting (McCloskey and Cohen, 1989; Goodfellow et al., 2013), which has been well-documented in various empirical studies; see, e.g., recent surveys (De Lange et al., 2021; Parisi et al., 2019). On the theoretical front, however, much is still missing from the understanding of possibilities and limitations related to catastrophic forgetting, with results for basic learning settings being obtained only very recently (Evron et al., 2022, 2023; Lin et al., 2023b; Goldfarb and Hand, 2023; Goldfarb et al., 2024; Peng and Risteski, 2022; Peng et al., 2023; Chen et al., 2022; Cao et al., 2022; Balcan et al., 2015).
While there are different learning setting studied under the umbrella of CL, following recent work (Evron et al., 2022, 2023), we focus on the CL settings with repeated replaying of tasks, where the forgetting after epochs/full passes over tasks is defined by Doan et al. (2021); Evron et al. (2022, 2023)
(1.2) |
where is the model parameter vector used at time . Such settings arise in applications that naturally undergo cyclic changes in the data/tasks, due to diurnal or seasonal cycles (e.g., in agriculture, forestry, e-commerce, astronomy, etc.). Forgetting is catastrophic if .
Observe that Eq. (1.2) corresponds to the value of the objective function from Eq. (1.1) at the final iterate Prior work (Evron et al., 2022) that obtained rigorous bounds for the forgetting Eq. (1.2) applied to the problems where each is a convex quadratic function minimized by the same such that By contrast, we consider more general convex functions that are either smooth or Lipschitz continuous, and make no assumption about beyond being a minimizer of the (average) function . Since we are not assuming that our focus is on bounding the excess forgetting , which is equivalently the optimality gap for the last iterate in Eq. (1.1).
The method considered in Evron et al. (2022) minimized each component function exactly, outputting the solution closest to the previous iterate in each iteration, using implicit regularization properties of SGD. To obtain the results, it was then crucial that the component functions were quadratic (so that there is an explicit, closed-form solution for each subproblem) and that all component functions shared a nonempty set of minimizers with value zero (so that forgetting can be controlled despite aggressive adaption to the current task). Our work using the incremental proximal method instead considers explicit regularization to enforce closeness of models on differing tasks, which can potentially degrade the performance on the current task, but as a tradeoff can control forgetting and it addresses a much broader class of loss functions.
1.1 Contributions
Our main contributions can be summarized as follows, where denotes the gradient variance at the optimum. The quantity is intrinsic to oracle complexity of incremental methods (Mishchenko et al., 2020; Nguyen et al., 2021; Cai et al., 2023a; Cha et al., 2023).
Last iterate convergence of Incremental Gradient Descent (IGD).
We provide the first oracle complexity guarantees for the last iterate of standard variants of IGD with either deterministic or randomly permuted ordering of the updates, applied to convex -smooth objectives. Up to a square-root-log factor, our oracle complexity bounds in Theorem 2.3 and Corollary 2.5 – which are for the deterministic variant and for the randomly permuted variant – match the best known oracle complexity bounds for these methods, previously known only for the (uniformly) average iterate (Mishchenko et al., 2020; Nguyen et al., 2021; Cai et al., 2023a; Cha et al., 2023). We further extend our results to increasing weighted averaging of the iterates in Corollary 2.4, which places more weight on the more recent iterates, removing the excess square-root-log factor in the resulting oracle complexity bound.
Last iterate convergence of Incremental Proximal Method (IPM).
We provide the first oracle complexity guarantees for the last iterate of IPM applied to convex and either smooth or Lipschitz-continuous objectives. When each component function is convex and -smooth, we show (in Theorem 3.1) that IPM has the same oracle complexity guarantee as IGD. This result is new for any variant of this method – with average or last iterate as its output. When component functions are convex and -Lipschitz, our oracle complexity in Theorem 3.3 matches the best known oracle complexity bound up to a log factor, which was previously known only for the (uniformly) average iterate (Bertsekas, 2011; Li et al., 2019). We further argue (in Corollary 3.4 and Corollary 3.5) that for both settings our analysis can be extended to admit inexact proximal point evaluations – an important setting not addressed by prior work on general IPM.
IPM as a model of CL.
We initiate the study of IPM as a model of CL, corresponding to sequential ridge-regularized model training commonly used in practice. On the positive side, our last-iterate convergence results for IPM in Theorem 3.1 and Theorem 3.3 demonstrate that forgetting (corresponding to the optimality gap at the last iterate) can be effectively controlled if the amount of employed regularization is sufficiently high. On the negative side, we show that for any constant amount of regularization, forgetting is always catastrophic, even for least squares problems. In particular, we provide a univariate quadratic example such that for any constant regularization parameter, the asymptotic limit of (excess) forgetting is non-zero. Further, we demonstrate that for forgetting to be made smaller than some target the regularization must be sufficiently high and depend polynomially on and These results are summarized in Theorem 3.2 and highlight the limitations of regularization as a black-box tool for controlling forgetting in CL.
1.2 Further related work
To remedy catastrophic forgetting, various empirical approaches have been proposed, and this work is most closely related to (i) memory-based approaches, which store samples from previous tasks and reuse those data for training on the current task (Robins, 1995; Lopez-Paz and Ranzato, 2017; Rolnick et al., 2019) and (ii) regularization-based approaches, which regularize the loss of the current task to ensure the new model parameter vector is close to the prior ones (Kirkpatrick et al., 2017); for a more complete survey, we refer to De Lange et al. (2021). On the theoretical front, the results on the forgetting for cyclic replaying of tasks considered in our work have only been established for linear models (Evron et al., 2022, 2023; Swartworth et al., 2024). In particular, the analysis for linear regression tasks in Evron et al. (2022); Swartworth et al. (2024) crucially relies on the exact minimization of each task using (S)GD to have closed-form updates between tasks, while Evron et al. (2023) uses alternating projections to analyze linear classification tasks. It is unclear how to extend either of these results to general convex loss functions that we consider.
On a technical level, our results are most closely related to 1) the literature on last iterate convergence guarantees for subgradient-based methods and stochastic gradient descent (SGD) and 2) the literature on incremental gradient methods and shuffled SGD. For 1), we draw inspiration from the recent results (Zamani and Glineur, 2023; Liu and Zhou, 2023), which rely on a clever construction of reference points with respect to which a gap quantity gets bounded to deduce a bound for the optimality gap of the last iterate . For the latter line of work 2), we generalize the analysis used exclusively for the optimality gap of the (uniformly) average iterate (Mishchenko et al., 2020; Nguyen et al., 2021; Cha et al., 2023; Cai et al., 2023a) to control the gap-like quantities , which require a more careful argument for controlling all error terms introduced by replacing by without introducing spurious unrealistic assumptions about the magnitudes of the component functions’ gradients. We finally note that both these related lines of work concern problems on which progress was made only in the very recent literature. In particular, while the oracle complexity upper bound for the average iterate of SGD in convex Lipschitz-continuous settings has been known for decades and its analysis is routinely taught in optimization and machine learning classes, there were no such results for the last iterate of SGD until 2013 (Shamir and Zhang, 2013) with improvements and generalizations to these results obtained as recently as in the past year (Liu and Zhou, 2023). Regarding 2), obtaining any nonasymptotic convergence guarantees for incremental gradient methods/shuffled SGD had remained open for decades (Bottou, 2009) until a recent line of work (Gürbüzbalaban et al., 2021; Shamir, 2016; Haochen and Sra, 2019; Nagaraj et al., 2019; Ahn et al., 2020; Rajput et al., 2020; Yun et al., 2022; Safran and Shamir, 2020). For nonconvex problems, Yu and Li (2023) proved the high-probability last iterate guarantee for shuffled SGD with stop** criteria, which is technically disjoint from our work. For smooth convex problems we consider, the convergence results were obtained only in the past few years (Mishchenko et al., 2020; Nguyen et al., 2021; Cha et al., 2023) and improved in Cai et al. (2023a) using a fine-grained analysis inspired by the recent advances in cyclic methods (Song and Diakonikolas, 2023; Cai et al., 2023b; Lin et al., 2023a). However, all those results are for the (uniformly) average iterate, while obtaining convergence results for the last iterate had remained open.
Concurrent independent work.
An independent and concurrent work to ours (Liu and Zhou, 2024) studied the last-iterate convergence of shuffled SGD for composite (strongly) convex smooth/Lipschitz optimization. For the same problems as studied in our Section 2, they obtained the same convergence results. The remaining results in Liu and Zhou (2024) and our work are not directly comparable, as the motivation for the two works and the studied settings are different. In particular, the focus of Liu and Zhou (2024) is on the last iterate convergence of shuffled SGD, and thus they study it in depth, considering different Lipschitz/smoothness constants for component functions, strong convexity, and composite settings. Our focus on the other hand is on settings relevant to continual learning, and thus we additionally consider the convergence of a weighted average of iterates and put more weight on the incremental proximal method, which were not considered in Liu and Zhou (2024). It is of note that while Liu and Zhou (2024) used proximal steps to handle the nonsmooth portion of the objective in their composite setting, the proximal maps are not applied component-wise, but at the end of a cycle, to only one (regularizer) function (e.g., to handle constraints or joint regularization).
1.3 Notation and preliminaries
We consider the -dimensional real space , where is the norm, and denote . Given a proper, convex, lower semicontinuous function , its proximal operator and Moreau envelope are defined by
respectively, for a parameter . The Moreau envelope is -smooth with the gradient , where denotes the subdifferential of .
We make the following assumptions. The first one is made throughout the paper.
Assumption 1.
Each is convex and there exists a minimizer .
By Assumption 1, is also convex. In nonsmooth settings, we make an additional standard assumption that the component functions are Lipschitz-continuous.
Assumption 2.
Each is -Lipschitz, i.e., for any ; thus for all .
For the smooth settings, we make the following assumption.
Assumption 3.
Each is -smooth, i.e., for any .
We remark that Assumptions 2 and 3 imply that is also -Lipschitz and -smooth, respectively, These two assumptions can also be generalized to be with distinct Lipschitz/smoothness constants, and our results would scale with the average Lipschitz/smoothness constant using the techniques from Cai et al. (2023a), which we omit to keep the focus on the intricacies of the last iterate convergence. When is -smooth and convex, we will often make use of the following standard inequality that fully characterizes the class of -smooth convex functions:
(1.3) |
Finally, when each is smooth, we assume bounded variance at , same as all prior work that considered the same settings of IGD/shuffled SGD as we do (Mishchenko et al., 2020; Nguyen et al., 2021; Tran et al., 2021, 2022; Cai et al., 2023a).
Assumption 4.
The quantity is bounded.
2 Last Iterate Convergence of Incremental Gradient Descent
In this section, we introduce our techniques for analyzing the last iterate guarantee and bound the oracle complexity for the last iterate of incremental gradient descent (IGD), assuming component functions are smooth and convex. In the context of CL, this corresponds to a simplified setup where the learner incrementally performs a single gradient step on each task and cyclically replays the tasks. Nevertheless, this setup serves as a warmup to the proximal setup we discuss in the next section. Additionally, it is of independent interest as incremental gradient methods are widely used in the optimization and machine learning literature, where despite the lack of prior theoretical justification, it is typically the last iterate that gets output by the algorithm in practice.
We summarize the IGD method in Alg. 1, assuming the incremental order in each epoch for simplicity and without loss of generality. The oracle complexity for the (uniformly) average iterate of IGD has been shown to be for an -optimality gap (Mishchenko et al., 2020; Cai et al., 2023a) under the same assumptions we make here (Assumptions 1, 3, and 4), while, as discussed before, there were no guarantees for either the last iterate or even a weighted average of the iterates. The main result of this section is that the same oracle complexity applies to the last iterate of IGD, up to a square-root-log factor. We then further generalize this result to weighted averages of iterates with increasing weights and to variants with randomly permuted order of cyclic updates.
We begin the analysis by deriving a bound on the gap with respect to an arbitrary but fixed reference point as summarized in the following lemma with its proof in Appendix A. This stands in contrast to arguments deriving bounds on the average iterate, which take While this may seem like a minor difference, it affects the analysis non-trivially: a direct extension of prior arguments would require replacing Assumption 4 – which imposes a bound on – with a bound on for an arbitrary , which would be a much stronger requirement.
Lemma 2.1.
Our next step is to specify our choice of the reference point for each epoch. In particular, we consider a sequence of points that is recursively defined as a convex combination of the algorithm iterate and the previous reference point :
(2.1) |
for with and to be set later. Observe that can also be written as a convex combination of the points and by unrolling the recursion, i.e.,
(2.2) |
where . If we set for all , then we have and recover the bound in Lemma 2.1, which leads to the average iterate guarantee. For general , we obtain the following lemma to relate the function value gap to the optimality gap , whose proof is deferred to Appendix A.
Lemma 2.2.
The role of the sequence of weights in Lemma 2.2 is to ensure that we can telescope the terms in Lemma 2.1. On the other hand, to succinctly see why such could lead to the desired last iterate guarantees, we note that the second part of Lemma 2.2 indicates that intrinsically includes retraction terms of the optimality gaps at the previous iterates. Hence, we can deduct from by properly choosing and to cancel out the optimality gap terms at the intermediate iterates. In this case, the convergence rate for the last iterate is characterized by the growth rate of .
Our choice of the reference points is inspired by the recent work (Zamani and Glineur, 2023; Liu and Zhou, 2023) on last iterate guarantees for subgradient methods and SGD with . However, their proof techniques are not directly applicable to incremental methods, due to several technical obstacles including additional nontrivial error terms of the form in Lemma 2.1 arising from the incremental gradient steps using reference points other than . Such error terms inherently deteriorate the growth rate of and could possibly lead to a worse last iterate rate compared to the rate on the average iterate. In the following theorem, we calibrate such degradation on the last iterate guarantees, and show that with a slightly smaller step size one can still achieve essentially the same rate as for the average iterate. The proof is provided in Appendix A due to space constraints.
Theorem 2.3.
The error term in Lemma 2.1 plays the role of slowing the last iterate rate, as calibrated by the dependence on in Eq. (2.3). To remedy such degradation compared to the average iterate rate (Mishchenko et al., 2020; Cai et al., 2023a), one natural thought is to make sufficiently small. In particular, we choose and show that the last iterate rate nearly matches the best known rate on the average iterate, with the trade-off of requiring order- smaller step sizes in comparison with Mishchenko et al. (2020); Cai et al. (2023a). This translates into the oracle complexity that is larger by at most a factor. For most cases of interest, this quantity can be treated as a constant: for example, for
On the other hand, with and constant , Lemmas 2.1 and 2.2 directly imply the average iterate guarantee, as a sanity check. Additionally, instead of zeroing the weights of the optimality gap terms for to obtain the last iterate guarantee, one can deduce the convergence rate on the increasing weighted averaging which places more weight on later iterates, as formalized in the following corollary whose proof is deferred to Appendix A.
Corollary 2.4 (Increasing Weighted Averaging).
We remark that increasing weighted averaging shaves off the (at most) square-root-log term appearing in the last iterate rate above and recovers the best known rate for the average iterate (Mishchenko et al., 2020; Cai et al., 2023a). The parameters are included in the weights controlling the growth rate of the increasing sequence . When , increasing weighted averaging reduces to the uniform weighted average.
Shuffled SGD.
We extend our analysis to handle the case with possible random permutations on the task ordering of each epoch, showing order- improvements in complexity if involving randomness. We consider two main permutation strategies of particular interests in the literature on shuffled SGD: (i) random reshuffling (RR): randomly generate permutations at the beginning of each epoch; (ii) shuffle-once (SO): generate a single random permutation at the beginning and use it in all epochs. Those strategies lead to order-() improvements in bounding the variance term from Lemma 2.1, and we state the improved convergence results with permutations in the following corollary. The proof is deferred to Appendix A. Our last iterate guarantee nearly matches the best known average iterate convergence rate for shuffled SGD (Mishchenko et al., 2020; Nguyen et al., 2021; Cai et al., 2023a) and the lower bound results on the RR scheme (Cha et al., 2023), with a slightly (order-) smaller step size.
Corollary 2.5 (Shuffled SGD (RR/SO)).
3 Incremental Proximal Method
In this section, we leverage the proof techniques developed in the previous section and derive the last iterate convergence bound for the incremental proximal method (IPM) summarized in Alg. 2. While the incremental proximal method is a fundamental method broadly studied in the optimization literature and thus the last iterate convergence bounds are of independent interest, our main motivation for considering this problem comes from CL applications, as discussed in the introduction. Thus we begin this section by briefly explaining this connection and reasoning.
The considered setup is motivated by the general -regularized CL setting (Heckel, 2022; Li et al., 2023). In particular, each proximal iteration can be interpreted as minimizing the (a.k.a. ridge) regularized loss corresponding to the current task , which aligns with the common machine learning practice of using regularization to improve the generalization error and prevent forgetting. When , the proximal point step reduces to the CL setting where the learner exactly minimizes the loss of the current task, i.e., , while the regularization effect vanishes and causes larger forgetting on previous tasks. When is small, the proximal point step is easier to compute with larger quadratic regularization and prevents deviating from the previous iterate thus causing less forgetting. However, in this case the plasticity of the model may be deteriorated.
We also note that our analysis of IPM is related to previous work on cyclic replays for overparameterized linear models with (S)GD (Evron et al., 2022; Swartworth et al., 2024), as (S)GD in this case acts as an implicit regularizer (Gunasekar et al., 2018; Zhang et al., 2021) (whereas proximal point update acts as an explicit regularizer). The two lines of work are not directly comparable: Evron et al. (2022); Swartworth et al. (2024) considers exact minimization of component/task loss function and bounds the forgetting, but only addresses convex quadratics where the component loss functions have a nonempty intersecting set of minima (which implies in Assumption 4). On the other hand, our work addresses much more general (not necessarily quadratic) convex functions and does not require but instead relies on sufficiently large regularization.
3.1 Smooth convex setting
We first study the setting where the loss function of each task is convex and smooth, for which we can show faster convergence and which covers many regression tasks studied in prior CL work; see e.g., Evron et al. (2022); Goldfarb et al. (2024). In contrast, prior work either only focused on nonsmooth settings for IPM (Bertsekas, 2011; Li et al., 2019, 2020) or studied different algorithms without component-wise proximal steps for smooth settings (Bertsekas, 2015; Mishchenko et al., 2022).
Under component smoothness, the proximal iteration is equivalent to the backward gradient step:
Hence, much of the analysis from Section 2 can be adapted here, and the main difference lies in bounding the gap within each epoch with decomposition w.r.t. instead of in comparison to Lemma 2.1. Then choosing defined by Eq. (2.1) and following the proof of Theorem 2.3, as stated in the following theorem with its proof in Appendix B.
Theorem 3.1.
A few remarks are in order here. First, the last-iterate convergence rate of IPM matches the rate of IGD on the last iterate. This is also the first convergence result for IPM (with component-wise proximal updates) under convex smooth settings, in comparison with the prior results in convex Lipschitz setups (Bertsekas, 2011; Li et al., 2019, 2020). Second, the extensions to the increasing weighted averaging and RR/SO shuffling discussed in Section 2 also apply to this setting, which we omit for brevity. Lastly, acute readers may find the step size constraint in Theorem 3.1 stands in contrast to fact that the proximal point method, which IGM reduces to when , converges for any positive step sizes. However, in the following theorem we show that such a step size restriction is necessary for IPM with component-wise proximal updates to reach the target optimality gap in this setting. We provide a proof sketch below, while the complete proof is in Appendix B.
Theorem 3.2.
Given let be the class of finite-sum functions whose each component is -smooth and convex. Then:
-
1.
For any fixed step size , there exists a function such that the iterates of Alg. 2 satisfy as . As a consequence, the forgetting is catastrophic.
-
2.
For any fixed step size that only depends on the parameters of the problem class ( error ), there exists a function such that the iterates of Alg. 2 satisfy .
-
3.
Given , if the fixed step size satisfies , then there exists a function such that for all sufficiently large .
Proof sketch.
For all parts of the proof, we consider -dimensional quadratics for and . It is immediate that is minimized at . In this case, Alg. 2 using a fixed step size performs closed-form updates on , i.e., where . Given any initial point ,
For ), since , we have for . As , for such that and , we have .
For ), with , observe that as , we have for any . As , for () and : .
For ), the case can be handled using a similar argument as in ), and thus we assume w.l.o.g. that . Let , where and . Further noticing that for and , then we have
Hence, if , since we can show , we have that with choosing large enough . Then for sufficiently large and such that and , we get . ∎
Regularization effect.
We now discuss how the regularization parameter (the step size of proximal updates) affects the loss on the current task and (excess) forgetting, based on the above convergence results of IPM. An interesting aspect of our result in Theorem 3.1 is that there is a critical value
(3.1) |
such that if decrease beyond , both the regularization error on the current task and our upper bound on the forgetting increase. For (i.e., when we decrease the regularization error), Theorem 3.2 shows that the forgetting would increase, at least in some regimes of the problem parameters. Moreover, Theorem 3.2 demonstrates that polynomial dependence on other parameters like and is necessary in the choice for In other words, strong regularization is needed to control the forgetting to a target error in general smooth convex settings. Another direct implication of Theorem 3.2 is that if no assumptions such as similarity are made on the tasks, then any -regularized model using a finite regularization parameter would suffer catastrophic forgetting, i.e., the forgetting would not be approaching zero as the number of epochs tends to infinity.
We further provide illustrative numerical results in Fig. 1 to facilitate our discussion. In particular, we choose , , () and for the example used in Theorem 3.2. In Fig. 1LABEL:sub@fig:forgetting, we plot the optimality gap at the last iterate, i.e., the excess forgetting, against the step sizes after epochs. It can be observed that the forgetting first decreases with reducing the step size, but then increases beyond some critical value. Note that the critical values are around , which is nontrivially smaller than , while a larger leads to a smaller such critical value. These numerical examples corroborate our results from Theorems 3.1 and 3.2, which jointly suggest that the step size (amount of regularization) can neither be too small nor too large. On the other hand, in Fig. 1LABEL:sub@fig:reg we show the final stagnated average regularization error, i.e., over tasks, where is the minimizer of . We thus conclude from both plots in Fig. 1 that as the step size increases (equivalently, regularization parameter decreases), the regularization error decreases as well, but the forgetting increases.
To conclude this subsection, we finally note that our results bridge the gap of theoretically calibrating the trade-off between the forgetting and the regularization error for general convex smooth tasks with cyclic replays, which has only been studied for the setting of two linear regression tasks without cyclic replay (Heckel, 2022; Li et al., 2023). Further, our results on finite regularization complement the asymptotic weak regularization results in the technically disjoint setting of linear classification (Evron et al., 2023), in the need of relating to the sequential max-margin projections for their analysis.
3.2 Convex Lipschitz setting
We now further relax the smoothness assumption and consider the convex Lipschitz setting, with applications such as linear classification tasks considered in Evron et al. (2023). To carry out the analysis, we leverage the standard fact that the proximal iteration is equivalent to the gradient step w.r.t. the Moreau envelope, i.e.,
while the gradient of the Moreau envelope belongs to the subdifferential , thus is bounded by the Lipschitz constant of . We use these observations to bound the gap for each epoch and then use the sequence defined in Eq. (2.1) to deduce the last iterate rate in the following theorem with proofs deferred to Appendix B. In contrast to Lemma 2.1 , we do not have the additional error term , so the analysis is much simplified.
Theorem 3.3.
The last iterate rate we obtained in Theorem 3.3 matches the prior best known results for the average iterate guarantees for incremental proximal methods (Bertsekas, 2011; Li et al., 2019) in nonsmooth settings, up to a logarithmic factor. Further, we take the step size only for analytical simplicity, while the diminishing step sizes will yield the same rate via a similar analysis, which we omit for brevity.
3.3 Inexact proximal point evaluations
In the last two subsections, we derived our results assuming that the proximal point operator can be evaluated exactly. However, computing the proximal point corresponds to solving a strongly convex problem, which is generally possible to do only up to finite accuracy. Thus, we now consider the case where is an approximation of with solving the corresponding strongly convex problem to -optimality gap for . Equivalently, using strong convexity and denoting , we have
(3.2) |
We note that direct extensions of the previous analysis would not work, because inexact evaluations give rise to additional positive terms that cause issues for telesco**. However, we observe that the coefficients of these terms admit additional slackness, i.e., , while Lemma 2.2 only requires . Thus, as long as the approximation error at each iteration is small, we can still maintain the convergence rate of incremental proximal methods with exact proximal point evaluations. With these insights, we extend our convergence results to admit inexact proximal point evaluations in the following corollaries, with proofs provided in Appendix B.
Corollary 3.4 (Convex Smooth).
4 Conclusion
This work provides the first oracle complexity guarantees for the last iterate of standard incremental (gradient and proximal) methods, motivated by applications in continual learning. The obtained complexity bounds nearly match the best known oracle complexity bounds that in the same settings were previously known only for the (uniformly) average iterate. Our for the incremental proximal method further characterize the effect of regularization and its limitations in controlling catastrophic forgetting in continual learning applications. It would be interesting to investigate in future work whether other types of regularization involving task similarity can effectively control forgetting.
Acknowledgments
This research was supported in part by the U.S. Office of Naval Research under contract number N00014-22-1-2348.
References
- Ahn et al. (2020) Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. SGD with shuffling: optimal rates without component convexity and large epoch requirements. In Proc. NeurIPS’20, 2020.
- Balcan et al. (2015) Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Efficient representations for lifelong learning and autoencoding. In Proc. COLT’15, 2015.
- Bertsekas (2011) Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163–195, 2011.
- Bertsekas (2015) Dimitri P Bertsekas. Incremental aggregated proximal and augmented lagrangian algorithms. arXiv preprint arXiv:1509.09257, 2015.
- Bertsekas et al. (2011) Dimitri P Bertsekas et al. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.
- Bottou (2009) Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proc. Symposium on Learning and Data Science, Paris’09, 2009.
- Cai et al. (2023a) Xufeng Cai, Cheuk Yin Lin, and Jelena Diakonikolas. Empirical risk minimization with shuffled SGD: A primal-dual perspective and improved bounds. arXiv preprint arXiv:2306.12498, 2023a.
- Cai et al. (2023b) Xufeng Cai, Chaobing Song, Stephen J Wright, and Jelena Diakonikolas. Cyclic block coordinate descent with variance reduction for composite nonconvex optimization. In Proc. ICML’23, 2023b.
- Cao et al. (2022) Xinyuan Cao, Weiyang Liu, and Santosh Vempala. Provable lifelong learning of representations. In Proc. AISTATS’22, 2022.
- Cha et al. (2023) Jaeyoung Cha, Jaewook Lee, and Chulhee Yun. Tighter lower bounds for shuffling SGD: Random permutations and beyond. arXiv preprint arXiv:2303.07160, 2023.
- Chen et al. (2022) Xi Chen, Christos Papadimitriou, and Binghui Peng. Memory bounds for continual learning. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), 2022.
- De Lange et al. (2021) Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021.
- Doan et al. (2021) Thang Doan, Mehdi Abbana Bennani, Bogdan Mazoure, Guillaume Rabusseau, and Pierre Alquier. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. In Proc. AISTATS’21, 2021.
- Evron et al. (2022) Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? In Proc. COLT’22, 2022.
- Evron et al. (2023) Itay Evron, Edward Moroshko, Gon Buzaglo, Maroun Khriesh, Badea Marjieh, Nathan Srebro, and Daniel Soudry. Continual learning in linear classification on separable data. In Proc. ICML’23, 2023.
- Goldfarb and Hand (2023) Daniel Goldfarb and Paul Hand. Analysis of catastrophic forgetting for random orthogonal transformation tasks in the overparameterized regime. In Proc. AISTATS’23, 2023.
- Goldfarb et al. (2024) Daniel Goldfarb, Itay Evron, Nir Weinberger, Daniel Soudry, and Paul Hand. The joint effect of task similarity and overparameterization on catastrophic forgetting - an analytical model. In Proc. ICLR’24, 2024.
- Goodfellow et al. (2013) Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
- Gunasekar et al. (2018) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In Proc. ICML’18, 2018.
- Gürbüzbalaban et al. (2021) Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186:49–84, 2021.
- Haochen and Sra (2019) Jeff Haochen and Suvrit Sra. Random shuffling beats SGD after finite epochs. In Proc. ICML’19, 2019.
- Heckel (2022) Reinhard Heckel. Provable continual learning via sketched jacobian approximations. In Proc. AISTATS’22, 2022.
- Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
- Li et al. (2023) Haoran Li, **gfeng Wu, and Vladimir Braverman. Fixed design analysis of regularization-based continual learning. arXiv preprint arXiv:2303.10263, 2023.
- Li et al. (2020) Jia** Li, Caihua Chen, and Anthony Man-Cho So. Fast epigraphical projection-based incremental algorithms for wasserstein distributionally robust support vector machine. In Proc. NeurIPS’20, 2020.
- Li et al. (2019) Xiao Li, Zhihui Zhu, Anthony Man-Cho So, and Jason D Lee. Incremental methods for weakly convex optimization. arXiv preprint arXiv:1907.11687, 2019.
- Lin et al. (2023a) Cheuk Yin Lin, Chaobing Song, and Jelena Diakonikolas. Accelerated cyclic coordinate dual averaging with extrapolation for composite convex optimization. In Proc. ICML’23, 2023a.
- Lin et al. (2023b) Sen Lin, Peizhong Ju, Yingbin Liang, and Ness Shroff. Theory on forgetting and generalization of continual learning. arXiv preprint arXiv:2302.05836, 2023b.
- Liu and Zhou (2023) Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. arXiv preprint arXiv:2312.08531, 2023.
- Liu and Zhou (2024) Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. arXiv preprint arXiv:2403.07723, 2024.
- Lohr (2021) Sharon L Lohr. Sampling: design and analysis. CRC press, 2021.
- Lopez-Paz and Ranzato (2017) David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Proc. NIPS’17, 2017.
- McCloskey and Cohen (1989) Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989.
- Mishchenko et al. (2020) Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements. In Proc. NeurIPS’20, 2020.
- Mishchenko et al. (2022) Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Proximal and federated random reshuffling. In Proc. ICML’22, 2022.
- Nagaraj et al. (2019) Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. SGD without replacement: Sharper rates for general smooth convex functions. In Proc. ICML’19, 2019.
- Nguyen et al. (2021) Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten Van Dijk. A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397–9440, 2021.
- Parisi et al. (2019) German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, 2019.
- Peng and Risteski (2022) Binghui Peng and Andrej Risteski. Continual learning: a feature extraction formalization, an efficient algorithm, and fundamental obstructions. In Proc. NeurIPS’22, 2022.
- Peng et al. (2023) Liangzu Peng, Paris Giampouras, and René Vidal. The ideal continual learner: An agent that never forgets. In Proc. ICML’23, 2023.
- Rajput et al. (2020) Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of SGD without replacement. In Proc. ICML’20, 2020.
- Robins (1995) Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
- Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Proc. NeurIPS’19, 2019.
- Safran and Shamir (2020) Itay Safran and Ohad Shamir. How good is SGD with random shuffling? In Proc. COLT’20, 2020.
- Shamir (2016) Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Proc. NeurIPS’16, 2016.
- Shamir and Zhang (2013) Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proc. ICML’18, 2013.
- Song and Diakonikolas (2023) Chaobing Song and Jelena Diakonikolas. Cyclic coordinate dual averaging with extrapolation. SIAM Journal on Optimization, 33(4):2935–2961, 2023. doi: 10.1137/22M1470104.
- Swartworth et al. (2024) William Swartworth, Deanna Needell, Rachel Ward, Mark Kong, and Halyun Jeong. Nearly optimal bounds for cyclic forgetting. In Proc. NeurIPS’24, 2024.
- Tran et al. (2021) Trang H Tran, Lam M Nguyen, and Quoc Tran-Dinh. SMG: A shuffling gradient-based method with momentum. In Proc. ICML’21, 2021.
- Tran et al. (2022) Trang H Tran, Katya Scheinberg, and Lam M Nguyen. Nesterov accelerated shuffling gradient method for convex optimization. In Proc. ICML’22, 2022.
- Yu and Li (2023) Hengxu Yu and Xiao Li. High probability guarantees for random reshuffling. arXiv preprint arXiv:2311.11841, 2023.
- Yun et al. (2022) Chulhee Yun, Shashank Rajput, and Suvrit Sra. Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond. In Proc. ICLR’22, 2022.
- Zamani and Glineur (2023) Moslem Zamani and François Glineur. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023.
- Zhang et al. (2021) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
Appendix A Omitted Proofs From Section 2
See 2.1
Proof.
Since each is convex and -smooth, we have for (see Eq. (1.3)):
(A.1) | ||||
(A.2) |
On the other hand, letting , we have by the update in Alg. 1. Observe that is -strong convex, and thus we also have
(A.3) |
Summing Eq. (A.1) and (A.2) and using Eq. (A.3), we conclude that for ,
Decomposing and summing over where we recall and , we have
For the term , we recall that by the IGD update, and , and thus we have
For the term , noticing that and decomposing , we use Young’s inequality with parameters and to obtain
Further using that and combining the above bounds on and , we obtain
To make the first two terms on the right-hand side both nonpositive, we choose
We further bound the term by the smoothness and convexity of each as follows:
where in the last equation we used . Hence, we finally obtain
where we used in the last inequality, thus completing the proof. ∎
See 2.2
Proof.
-
1.
Using Eq. (2.2) and the convexity of , we have
It remains to multiply by on both sides and notice that by the lemma assumption,
-
2.
This follows from the first part of the lemma, by decomposing .
∎
See 2.3
Proof.
Plugging defined by Eq. (2.1) into Lemma 2.1 and using the inequality to bound the term , we obtain
where . Multiplying on both sides with such that and noticing that by Eq. (2.1), we have
We then sum over and use Lemma 2.2 to obtain
(A.4) | ||||
where we also use . Unrolling the terms w.r.t. we get
(A.5) | ||||
and
Plugging back into Eq. (A.4), grou** the like terms, and choosing , we obtain
(A.6) | ||||
To obtain the last iterate guarantee, it suffices to choose and such that
(A.7) | ||||
(A.8) |
Noticing that Eq. A.8 is equivalent to , to have both inequalities satisfied at the same time, it suffices that
To maximize the growth rate of , we let . Without loss of generality, we take , and thus . Hence, dividing both sides of Eq. (A.6) by and choosing the constant step size for all , we obtain
(A.9) |
We first bound with the constant step size. Taking the natural logarithm of , we have
where for we use the fact that for . Further noticing that , then we have
On the other hand, we note that for and , then we follow the above argument and obtain
where is due to for any . Hence, we obtain the final bound:
To analyze the oracle complexity, we take
and analyze the two possible cases depending on which term in the min is smaller. If the first term in the min is smaller (which we can equivalently think of as being “large”), we get
Alternatively, if (which we can think of as having “small” ), we obtain
Hence, combining these two cases, we have
In particular, if we choose and , assuming without loss of generality that , then we have and thus
To guarantee given , the total number of individual gradient evaluations will be
∎
See 2.4
Proof.
We follow the proof of Theorem 2.3 up to Eq. (A.6) with constant step size , then we instead take and to obtain
Since is convex, we have where is the increasing weighted averaging of , thus (cf. Eq. (A.9))
where the last step is due to . Then we follow the proof of Theorem 2.3 and choose to obtain
To guarantee for , we choose and the total number of individual gradient evaluations will be
thus finishing the proof. ∎
See 2.5
Proof.
See A.1
Proof.
We first consider the case of random reshuffling strategy. Conditional on all the randomness up to but not including -th epoch, the only randomness of comes from the permutation at -th epoch. Further noticing that each partial sum can be seen as a batch sampled without replacement from , we have
where is due to and sampling without replacement (see e.g., (Lohr, 2021, Section 2.7)). It remains to take expectation w.r.t. all randomness on both sides and use the law of total expectation. For the case of shuffle-once variant, we can directly take expectation since the randomness only comes from the initial random permutation, and the above argument still applies. ∎
Appendix B Omitted Proofs from Section 3
B.1 Convex Smooth Setting
Lemma B.1.
Proof.
Since each is convex and -smooth, we have for
Following the proof of Lemma 2.1, we add and subtract on the right-hand side of the second inequality and combine the above two inequalities to obtain
Decomposing and summing over , we note that and and obtain
For the term , we follow the argument from the proof of Theorem 2.3 to obtain
For the term , noticing that for and decomposing , we use Young’s inequality with parameters and to obtain
Further using the fact that and combining the above bounds on and , we obtain
The rest of the proof is the same as the proof of Lemma 2.1 and is thus omitted. ∎
See 3.2
Proof.
For all parts of the proof, we consider -dimensional quadratics
for , , and appropriately chosen sequences of .
It is immediate that is minimized at . Observe that Alg. 2 using a constant step size has closed-form updates on , i.e.,
where . Given any initial point , by iterating we have
(B.2) | ||||
(B.3) |
-
1.
Consider the weight in Eq. (B.3). Since , we have
Then for any such that and , we know that
where is due to being both -strongly convex and -smooth.
-
2.
Consider the weights of in Eq. (B.3). Since , we have
thus for any
For , given any fixed , we have
Hence, for the sequence such that
then combining the bounds on the weights of with Eq. (B.3) we obtain
Since is -smooth and -strongly convex, we know that
thus finishing the proof of the second part. We note in passing that 1 on the right-hand side can be replaced by any constant using a simple rescaling.
-
3.
Observe that given a fixed step size , we can choose a sequence such that for all , thus for any initial point :
Without loss of generality, taking a sufficiently large , we obtain
Then for any step size , we have . Consider , we can bound
Thus, for , we have
On the other hand, recalling the definition in Assumption 4, we have in this example that
Thus for any step size , we have .
We now proceed by bounding the weight of . In particular, let for some , and assume that without loss of generality by the discussion above. Since and , we have
which leads to
Hence, we have
Further noticing that
for and , then we have
Recall that , then we obtain
for such that . Thus, for the sequence such that and , we have
completing the proof. ∎
B.2 Convex Lipschitz Setting
Lemma B.2.
Proof.
Since is convex and closed, we have
By -Lipschitzness of each component function, we have that for
(B.5) | ||||
On the other hand, using convexity of , we have that for
Expanding the inner product in the above inequality leads to
(B.6) |
Combining Eq. (B.5) and (B.6) and noticing that and , we sum the inequalities over and obtain
∎
See 3.3
Proof.
Plugging defined in Eq. (2.1) into Eq. (B.4) and multiplying on both sides, we obtain
Summing over and using the second part of Lemma 2.2, we have
where we also recall that . Unrolling the terms on the left-hand side as Eq. (A.5) and choosing , we obtain
(B.7) | ||||
To obtain the last iterate guarantee, we choose and such that
For simplicity and without loss of generality, we make both inequalities tight and choose . In particular, we choose for such that , then we divide on both sides of Eq. (B.7) and obtain
Finally, choosing , we get
Hence, given , to guarantee , the total number of individual gradient evaluations will be
completing the proof. ∎
B.3 Inexact Proximal Point Evaluations
We first prove the convergence results for convex smooth settings. The following techical lemma bounds within each epoch with inexact proximal point evaluations.
Lemma B.3.
Proof.
Since each is convex and -smooth, we have for
Following the proof of Lemma 2.1, we add and subtract on the right-hand side of the second inequality and notice that then we combine the above inequalities to obtain
We decompose in the first inner product term on the right-hand side, and sum the inequalities over with noticing and , and obtain
For the term , we follow the argument from the proof of Theorem 2.3 to obtain
For the term , noticing that for and , we use Young’s inequality with parameters and to obtain
For the term , we use Cauchy-Schwarz inequality and Young’s inequality to get
Further using the fact that and combining the above bounds on , and with for , we obtain
It remains to follow the proof of Lemma 2.1 and use for to obtain
thus finishing the proof. ∎
See 3.4
Proof.
Using Lemma B.3 and following the proof of Theorem 2.3 with multiplying on both sides, we have
Then we sum the above inequality over and follow the proof of Theorem 2.3. To telescope the terms , we need for such that
In this case, we maintain the same requirements on and to obtain the guarantee on the last iterate as in Theorem 2.3. In particular, we take the same choices with constant step sizes such that for , so it suffices to let for . Following the proof of Theorem 2.3, we obtain
Plugging in the choice that for and , we then have
Hence, given , to maintain the convergence rate with exact proximal point evaluations, it suffices to take for . Indeed, we have
It remains to follow the proof of Theorem 2.3, and we choose , assuming without loss of generality that . ∎
We then come to prove the convergence with inexact proximal point evaluations for convex Lipschitz settings.
Lemma B.4.
Proof.
By Lipschitzness of each component function, we have for
Decomposing and using triangle inequalities, we have
(B.9) |
where we use Eq. (3.2) and the fact that for . On the other hand, using convexity of , we have for that
Expanding the inner product in the above quantity and using Cauchy-Schwarz inequality with Eq. (3.2) leads to
(B.10) |
Using triangle inequalities and decomposing , we bound the term in Eq. (B.10) as follows
where we use Young’s inequality for . Combining Eq. (B.9) and (B.10) with the above bound on and noticing that and , we sum the inequalities over and obtain
thus finishing the proof. ∎
See 3.5
Proof.
Using Lemma B.4 with defined by Eq. (2.1) and multiplying on both sides, we have
Then we sum the inequalities over and follow the proof of Theorem 3.3. To telescope the terms , we need for such that
while we maintain other requirements on and to obtain the last iterate convergence as in Theorem 3.3. In particular, we take the same choice that and for , so it suffices to let . So we arrive at
Hence, given and taking the constant step size for simplicity, to maintain the convergence rate as in Theorem 3.3 with inexact proximal point evaluations, it suffices to let . Indeed, we have
and
It remains to follow the proof of Theorem 3.3, thus finishing the proof. ∎