Last Iterate Convergence of Incremental Methods
and Applications in Continual Learning

Xufeng Cai¹¹1Department of Computer Sciences, University of Wisconsin-Madison. XC ([email protected]), JD ([email protected]). Jelena Diakonikolas^†^†footnotemark:

Abstract

Incremental gradient and incremental proximal methods are a fundamental class of optimization algorithms used for solving finite sum problems, broadly studied in the literature. Yet, without strong convexity, their convergence guarantees have primarily been established for the ergodic (average) iterate. Motivated by applications in continual learning, we obtain the first convergence guarantees for the last iterate of both incremental gradient and incremental proximal methods, in general convex smooth (for both) and convex Lipschitz (for the proximal variants) settings. Our oracle complexity bounds for the last iterate nearly match (i.e., match up to a square-root-log or a log factor) the best known oracle complexity bounds for the average iterate, for both classes of methods. We further obtain generalizations of our results to weighted averaging of the iterates with increasing weights and for randomly permuted ordering of updates. We study incremental proximal methods as a model of continual learning with generalization and argue that large amount of regularization is crucial to preventing catastrophic forgetting. Our results generalize last iterate guarantees for incremental methods compared to state of the art, as such results were previously known only for overparameterized linear models, which correspond to convex quadratic problems with infinitely many solutions.

1 Introduction

We study the last iterate convergence of incremental (gradient and proximal) methods, which apply to problems of the form

\displaystyle\min_{{\bm{x}}\in\mathbb{R}^{d}}\Big{\{}f({\bm{x}}):=\frac{1}{T}% \sum_{t=1}^{T}f_{t}({\bm{x}})\Big{\}}.

(1.1)

As is standard, we assume that each component function $f_{t}$ is convex and either smooth or Lipschitz-continuous and that a minimizer ${\bm{x}}_{*}\in\operatorname*{arg\,min}_{{\bm{x}}}f({\bm{x}})$ exists.

Incremental methods traverse all the component functions $f_{t}$ in a cyclic manner, updating their iterates by taking either gradient descent steps (in the case of incremental gradient methods) or proximal-point steps (in the case of incremental proximal methods) with respect to the individual component functions $f_{t}$ . For a more precise statement of these two classes of methods, see Sections 2 and 3. Same as prior work (Bertsekas et al., 2011; Bertsekas, 2011; Li et al., 2019; Mishchenko et al., 2020; Cai et al., 2023a), we define oracle complexity of these methods as the number of first-order or proximal oracle queries to individual component functions $f_{t}$ required to reach a solution ${\bm{x}}$ with optimality gap $f({\bm{x}})-f({\bm{x}}_{*})\leq\epsilon$ on the worst-case instance from the considered problem class, where $\epsilon>0$ is a given error parameter.

Our main motivation for studying the last iterate convergence of incremental methods comes from applications in continual learning (CL). In particular, CL models a sequential learning setting, where a machine learning model gets updated over time, based on the changing or evolving distribution of the data passed to the learner. A major challenge in such dynamic learning settings is the degradation of model performance on previously seen data, known as the catastrophic forgetting (McCloskey and Cohen, 1989; Goodfellow et al., 2013), which has been well-documented in various empirical studies; see, e.g., recent surveys (De Lange et al., 2021; Parisi et al., 2019). On the theoretical front, however, much is still missing from the understanding of possibilities and limitations related to catastrophic forgetting, with results for basic learning settings being obtained only very recently (Evron et al., 2022, 2023; Lin et al., 2023b; Goldfarb and Hand, 2023; Goldfarb et al., 2024; Peng and Risteski, 2022; Peng et al., 2023; Chen et al., 2022; Cao et al., 2022; Balcan et al., 2015).

While there are different learning setting studied under the umbrella of CL, following recent work (Evron et al., 2022, 2023), we focus on the CL settings with repeated replaying of tasks, where the forgetting after $K$ epochs/full passes over $T$ tasks is defined by Doan et al. (2021); Evron et al. (2022, 2023)

\displaystyle f_{K}({\bm{x}}_{KT}):=\frac{1}{T}\sum_{t=1}^{T}f_{t}({\bm{x}}_{% KT}),

(1.2)

where ${\bm{x}}_{n}\in{\mathbb{R}}^{d}$ is the model parameter vector used at time $n\in\mathbb{N}_{+}$ . Such settings arise in applications that naturally undergo cyclic changes in the data/tasks, due to diurnal or seasonal cycles (e.g., in agriculture, forestry, e-commerce, astronomy, etc.). Forgetting is catastrophic if $f_{K}({\bm{x}}_{KT})\overset{K\rightarrow\infty}{\nrightarrow}0$ .

Observe that Eq. (1.2) corresponds to the value of the objective function from Eq. (1.1) at the final iterate ${\bm{x}}_{KT}.$ Prior work (Evron et al., 2022) that obtained rigorous bounds for the forgetting Eq. (1.2) applied to the problems where each $f_{t}$ is a convex quadratic function minimized by the same ${\bm{x}}_{*}$ such that $f_{t}({\bm{x}}_{*})=f({\bm{x}}_{*})=0.$ By contrast, we consider more general convex functions that are either smooth or Lipschitz continuous, and make no assumption about ${\bm{x}}_{*}$ beyond being a minimizer of the (average) function $f$ . Since we are not assuming that $f({\bm{x}}_{*})=0,$ our focus is on bounding the excess forgetting $f_{K}({\bm{x}}_{KT})-f({\bm{x}}_{*})$ , which is equivalently the optimality gap for the last iterate in Eq. (1.1).

The method considered in Evron et al. (2022) minimized each component function exactly, outputting the solution closest to the previous iterate in each iteration, using implicit regularization properties of SGD. To obtain the results, it was then crucial that the component functions were quadratic (so that there is an explicit, closed-form solution for each subproblem) and that all component functions shared a nonempty set of minimizers with value zero (so that forgetting can be controlled despite aggressive adaption to the current task). Our work using the incremental proximal method instead considers explicit regularization to enforce closeness of models on differing tasks, which can potentially degrade the performance on the current task, but as a tradeoff can control forgetting and it addresses a much broader class of loss functions.

1.1 Contributions

Our main contributions can be summarized as follows, where $\sigma_{*}^{2}:=\frac{1}{T}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{*})\|^{2}$ denotes the gradient variance at the optimum. The quantity $\sigma_{*}^{2}$ is intrinsic to oracle complexity of incremental methods (Mishchenko et al., 2020; Nguyen et al., 2021; Cai et al., 2023a; Cha et al., 2023).

Last iterate convergence of Incremental Gradient Descent (IGD).

We provide the first oracle complexity guarantees for the last iterate of standard variants of IGD with either deterministic or randomly permuted ordering of the updates, applied to convex $L$ -smooth objectives. Up to a square-root-log factor, our oracle complexity bounds in Theorem 2.3 and Corollary 2.5 – which are $\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon^% {3/2}}\big{)}$ for the deterministic variant and $\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{\sqrt{TL}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon% ^{3/2}}\big{)}$ for the randomly permuted variant – match the best known oracle complexity bounds for these methods, previously known only for the (uniformly) average iterate (Mishchenko et al., 2020; Nguyen et al., 2021; Cai et al., 2023a; Cha et al., 2023). We further extend our results to increasing weighted averaging of the iterates in Corollary 2.4, which places more weight on the more recent iterates, removing the excess square-root-log factor in the resulting oracle complexity bound.

Last iterate convergence of Incremental Proximal Method (IPM).

We provide the first oracle complexity guarantees for the last iterate of IPM applied to convex and either smooth or Lipschitz-continuous objectives. When each component function is convex and $L$ -smooth, we show (in Theorem 3.1) that IPM has the same $\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon^% {3/2}}\big{)}$ oracle complexity guarantee as IGD. This result is new for any variant of this method – with average or last iterate as its output. When component functions are convex and $G$ -Lipschitz, our oracle complexity $\widetilde{\mathcal{O}}\Big{(}\frac{G^{2}T\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon^{2}}\Big{)}$ in Theorem 3.3 matches the best known oracle complexity bound up to a log factor, which was previously known only for the (uniformly) average iterate (Bertsekas, 2011; Li et al., 2019). We further argue (in Corollary 3.4 and Corollary 3.5) that for both settings our analysis can be extended to admit inexact proximal point evaluations – an important setting not addressed by prior work on general IPM.

IPM as a model of CL.

We initiate the study of IPM as a model of CL, corresponding to sequential ridge-regularized model training commonly used in practice. On the positive side, our last-iterate convergence results for IPM in Theorem 3.1 and Theorem 3.3 demonstrate that forgetting (corresponding to the optimality gap at the last iterate) can be effectively controlled if the amount of employed regularization is sufficiently high. On the negative side, we show that for any constant amount of regularization, forgetting is always catastrophic, even for least squares problems. In particular, we provide a univariate quadratic example such that for any constant regularization parameter, the asymptotic limit of (excess) forgetting is non-zero. Further, we demonstrate that for forgetting to be made smaller than some target $\epsilon,$ the regularization must be sufficiently high and depend polynomially on $1/\epsilon,\,T,$ and $\sigma_{*}.$ These results are summarized in Theorem 3.2 and highlight the limitations of regularization as a black-box tool for controlling forgetting in CL.

1.2 Further related work

To remedy catastrophic forgetting, various empirical approaches have been proposed, and this work is most closely related to (i) memory-based approaches, which store samples from previous tasks and reuse those data for training on the current task (Robins, 1995; Lopez-Paz and Ranzato, 2017; Rolnick et al., 2019) and (ii) regularization-based approaches, which regularize the loss of the current task to ensure the new model parameter vector is close to the prior ones (Kirkpatrick et al., 2017); for a more complete survey, we refer to De Lange et al. (2021). On the theoretical front, the results on the forgetting for cyclic replaying of tasks considered in our work have only been established for linear models (Evron et al., 2022, 2023; Swartworth et al., 2024). In particular, the analysis for linear regression tasks in Evron et al. (2022); Swartworth et al. (2024) crucially relies on the exact minimization of each task using (S)GD to have closed-form updates between tasks, while Evron et al. (2023) uses alternating projections to analyze linear classification tasks. It is unclear how to extend either of these results to general convex loss functions that we consider.

On a technical level, our results are most closely related to 1) the literature on last iterate convergence guarantees for subgradient-based methods and stochastic gradient descent (SGD) and 2) the literature on incremental gradient methods and shuffled SGD. For 1), we draw inspiration from the recent results (Zamani and Glineur, 2023; Liu and Zhou, 2023), which rely on a clever construction of reference points ${\bm{z}}_{k}$ with respect to which a gap quantity $f({\bm{x}}_{k})-f({\bm{z}}_{k-1})$ gets bounded to deduce a bound for the optimality gap $f({\bm{x}}_{k})-f({\bm{x}}_{*})$ of the last iterate ${\bm{x}}_{*}$ . For the latter line of work 2), we generalize the analysis used exclusively for the optimality gap of the (uniformly) average iterate (Mishchenko et al., 2020; Nguyen et al., 2021; Cha et al., 2023; Cai et al., 2023a) to control the gap-like quantities $f({\bm{x}}_{k})-f({\bm{z}}_{k-1})$ , which require a more careful argument for controlling all error terms introduced by replacing ${\bm{x}}_{*}$ by ${\bm{z}}_{k-1}$ without introducing spurious unrealistic assumptions about the magnitudes of the component functions’ gradients. We finally note that both these related lines of work concern problems on which progress was made only in the very recent literature. In particular, while the oracle complexity upper bound for the average iterate of SGD in convex Lipschitz-continuous settings has been known for decades and its analysis is routinely taught in optimization and machine learning classes, there were no such results for the last iterate of SGD until 2013 (Shamir and Zhang, 2013) with improvements and generalizations to these results obtained as recently as in the past year (Liu and Zhou, 2023). Regarding 2), obtaining any nonasymptotic convergence guarantees for incremental gradient methods/shuffled SGD had remained open for decades (Bottou, 2009) until a recent line of work (Gürbüzbalaban et al., 2021; Shamir, 2016; Haochen and Sra, 2019; Nagaraj et al., 2019; Ahn et al., 2020; Rajput et al., 2020; Yun et al., 2022; Safran and Shamir, 2020). For nonconvex problems, Yu and Li (2023) proved the high-probability last iterate guarantee for shuffled SGD with stop** criteria, which is technically disjoint from our work. For smooth convex problems we consider, the convergence results were obtained only in the past few years (Mishchenko et al., 2020; Nguyen et al., 2021; Cha et al., 2023) and improved in Cai et al. (2023a) using a fine-grained analysis inspired by the recent advances in cyclic methods (Song and Diakonikolas, 2023; Cai et al., 2023b; Lin et al., 2023a). However, all those results are for the (uniformly) average iterate, while obtaining convergence results for the last iterate had remained open.

Concurrent independent work.

An independent and concurrent work to ours (Liu and Zhou, 2024) studied the last-iterate convergence of shuffled SGD for composite (strongly) convex smooth/Lipschitz optimization. For the same problems as studied in our Section 2, they obtained the same convergence results. The remaining results in Liu and Zhou (2024) and our work are not directly comparable, as the motivation for the two works and the studied settings are different. In particular, the focus of Liu and Zhou (2024) is on the last iterate convergence of shuffled SGD, and thus they study it in depth, considering different Lipschitz/smoothness constants for component functions, strong convexity, and composite settings. Our focus on the other hand is on settings relevant to continual learning, and thus we additionally consider the convergence of a weighted average of iterates and put more weight on the incremental proximal method, which were not considered in Liu and Zhou (2024). It is of note that while Liu and Zhou (2024) used proximal steps to handle the nonsmooth portion of the objective in their composite setting, the proximal maps are not applied component-wise, but at the end of a cycle, to only one (regularizer) function (e.g., to handle constraints or joint regularization).

1.3 Notation and preliminaries

We consider the $d$ -dimensional real space $(\mathbb{R}^{d},\|\cdot\|)$ , where $\|\cdot\|$ is the $\ell_{2}$ norm, and denote $[T]:=\{1,2,\dots,T\}$ . Given a proper, convex, lower semicontinuous function $f$ , its proximal operator and Moreau envelope are defined by

\displaystyle\mathrm{prox}_{\eta f}({\bm{x}})=\operatorname*{arg\,min}_{{\bm{y% }}\in{\mathbb{R}}^{d}}\Big{\{}\frac{1}{2\eta}\|{\bm{y}}-{\bm{x}}\|^{2}+f({\bm{% y}})\Big{\}},\;M_{\eta f}({\bm{x}})=\min_{{\bm{y}}\in{\mathbb{R}}^{d}}\Big{\{}% \frac{1}{2\eta}\|{\bm{y}}-{\bm{x}}\|^{2}+f({\bm{y}})\Big{\}},

respectively, for a parameter $\eta>0$ . The Moreau envelope is $\frac{1}{\eta}$ -smooth with the gradient $\nabla M_{\eta f}({\bm{x}})=\frac{1}{\eta}({\bm{x}}-\mathrm{prox}_{\eta f}({% \bm{x}}))\in\partial f(\mathrm{prox}_{\eta f}({\bm{x}}))$ , where $\partial f(\cdot)$ denotes the subdifferential of $f$ .

We make the following assumptions. The first one is made throughout the paper.

Assumption 1.

Each $f_{t}$ is convex and there exists a minimizer ${\bm{x}}_{*}\in\operatorname*{arg\,min}_{{\bm{x}}\in{\mathbb{R}}^{d}}f({\bm{x}})$ .

By Assumption 1, $f$ is also convex. In nonsmooth settings, we make an additional standard assumption that the component functions are Lipschitz-continuous.

Assumption 2.

Each $f_{t}$ is $G$ -Lipschitz, i.e., $|f_{t}({\bm{x}})-f_{t}({\bm{y}})|\leq G\|{\bm{x}}-{\bm{y}}\|$ for any ${\bm{x}},{\bm{y}}\in\mathbb{R}^{d}$ ; thus $\|g_{t}({\bm{x}})\|\leq G$ for all $g_{t}({\bm{x}})\in\partial f_{t}({\bm{x}})$ .

For the smooth settings, we make the following assumption.

Assumption 3.

Each $f_{t}$ is $L$ -smooth, i.e., $\|\nabla f_{t}({\bm{x}})-\nabla f_{t}({\bm{y}})\|\leq L\|{\bm{x}}-{\bm{y}}\|$ for any ${\bm{x}},{\bm{y}}\in\mathbb{R}^{d}$ .

We remark that Assumptions 2 and 3 imply that $f$ is also $G$ -Lipschitz and $L$ -smooth, respectively, These two assumptions can also be generalized to be with distinct Lipschitz/smoothness constants, and our results would scale with the average Lipschitz/smoothness constant using the techniques from Cai et al. (2023a), which we omit to keep the focus on the intricacies of the last iterate convergence. When $f$ is $L$ -smooth and convex, we will often make use of the following standard inequality that fully characterizes the class of $L$ -smooth convex functions:

\frac{1}{2L}\|\nabla f({\bm{x}})-\nabla f({\bm{y}})\|^{2}\leq f({\bm{y}})-f({% \bm{x}})-\left\langle\nabla f({\bm{x}}),{\bm{y}}-{\bm{x}}\right\rangle,\quad% \forall{\bm{x}},{\bm{y}}\in{\mathbb{R}}^{d}.

(1.3)

Finally, when each $f_{t}$ is smooth, we assume bounded variance at ${\bm{x}}_{*}$ , same as all prior work that considered the same settings of IGD/shuffled SGD as we do (Mishchenko et al., 2020; Nguyen et al., 2021; Tran et al., 2021, 2022; Cai et al., 2023a).

Assumption 4.

The quantity $\sigma_{*}^{2}:=\frac{1}{T}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{*})\|^{2}$ is bounded.

2 Last Iterate Convergence of Incremental Gradient Descent

In this section, we introduce our techniques for analyzing the last iterate guarantee and bound the oracle complexity for the last iterate of incremental gradient descent (IGD), assuming component functions are smooth and convex. In the context of CL, this corresponds to a simplified setup where the learner incrementally performs a single gradient step on each task and cyclically replays the $T$ tasks. Nevertheless, this setup serves as a warmup to the proximal setup we discuss in the next section. Additionally, it is of independent interest as incremental gradient methods are widely used in the optimization and machine learning literature, where despite the lack of prior theoretical justification, it is typically the last iterate that gets output by the algorithm in practice.

We summarize the IGD method in Alg. 1, assuming the incremental order $1,2,\dots,T$ in each epoch for simplicity and without loss of generality. The oracle complexity for the (uniformly) average iterate of IGD has been shown to be ${\mathcal{O}}(\frac{TL}{\epsilon}+\frac{T\sqrt{L}\sigma_{*}}{\epsilon^{3/2}})$ for an $\epsilon$ -optimality gap (Mishchenko et al., 2020; Cai et al., 2023a) under the same assumptions we make here (Assumptions 1, 3, and 4), while, as discussed before, there were no guarantees for either the last iterate or even a weighted average of the iterates. The main result of this section is that the same oracle complexity applies to the last iterate of IGD, up to a square-root-log factor. We then further generalize this result to weighted averages of iterates with increasing weights and to variants with randomly permuted order of cyclic updates.

Algorithm 1 Incremental Gradient Descent (IGD)

Input: initial point

{\bm{x}}_{0}

, number of epochs

K

, step size

\{\eta_{k}\}

for

k=1:K

{\bm{x}}_{k-1,1}={\bm{x}}_{k-1}

for

t=1:T

{\bm{x}}_{k-1,t+1}={\bm{x}}_{k-1,t}-\eta_{k}\nabla f_{t}({\bm{x}}_{k-1,t})

{\bm{x}}_{k}={\bm{x}}_{k-1,T+1}

return

{\bm{x}}_{K}

We begin the analysis by deriving a bound on the gap with respect to an arbitrary but fixed reference point ${\bm{z}},$ as summarized in the following lemma with its proof in Appendix A. This stands in contrast to arguments deriving bounds on the average iterate, which take ${\bm{z}}={\bm{x}}_{*}.$ While this may seem like a minor difference, it affects the analysis non-trivially: a direct extension of prior arguments would require replacing Assumption 4 – which imposes a bound on $\frac{1}{T}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{*})\|^{2}$ – with a bound on $\frac{1}{T}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{z}})\|^{2}$ for an arbitrary ${\bm{z}}$ , which would be a much stronger requirement.

Lemma 2.1.

Under Assumptions 1 and 3, for any ${\bm{z}}\in\mathbb{R}^{d}$ that is fixed in the $k$ -th cycle of Alg. 1 and any $\alpha,\beta>0$ such that $\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}$ , if $\eta_{k}\leq\frac{1}{\sqrt{\beta}TL}$ , then for all $k\in[K],$

\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\eta_{k}^{2}L\sum_% {t=1}^{T}\Big{\|}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\Big{\|}^{2}+\frac{% \alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}}_{*})\big{)}+\frac{1}{2\eta_{k}}% \big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm{x}}_{k}-{\bm{z}}\|^{2}\big{)}.

Our next step is to specify our choice of the reference point ${\bm{z}}$ for each epoch. In particular, we consider a sequence of points $\{{\bm{z}}_{k}\}_{k=-1}^{K-1}$ that is recursively defined as a convex combination of the algorithm iterate ${\bm{x}}_{k}$ and the previous reference point ${\bm{z}}_{k-1}$ :

\displaystyle{\bm{z}}_{k}=(1-\lambda_{k}){\bm{x}}_{k}+\lambda_{k}{\bm{z}}_{k-1}

(2.1)

for $k\geq 0$ with ${\bm{z}}_{-1}={\bm{x}}_{*}$ and $\lambda_{k}\in[0,1]$ to be set later. Observe that ${\bm{z}}_{k}$ can also be written as a convex combination of the points $\{{\bm{x}}_{j}\}_{j=0}^{k}$ and ${\bm{x}}_{*}$ by unrolling the recursion, i.e.,

\displaystyle{\bm{z}}_{k}=(1-\lambda_{k}){\bm{x}}_{k}+\Big{(}\prod_{i=0}^{k}% \lambda_{i}\Big{)}{\bm{x}}_{*}+\sum_{j=0}^{k-1}\Big{(}\prod_{i=j+1}^{k}\lambda% _{i}\Big{)}(1-\lambda_{j}){\bm{x}}_{j},

(2.2)

where $(1-\lambda_{k})+\prod_{i=0}^{k}\lambda_{i}+\sum_{j=0}^{k-1}\big{(}\prod_{i=j+1% }^{k}\lambda_{i}\big{)}(1-\lambda_{j})=1$ . If we set $\lambda_{k}=1$ for all $k$ , then we have ${\bm{z}}_{k}={\bm{x}}_{*}$ and recover the bound $f({\bm{x}}_{k})-f({\bm{x}}_{*})$ in Lemma 2.1, which leads to the average iterate guarantee. For general $\{\lambda_{k}\}$ , we obtain the following lemma to relate the function value gap $f({\bm{x}}_{k})-f({\bm{z}}_{k-1})$ to the optimality gap $f({\bm{x}}_{k})-f({\bm{x}}_{*})$ , whose proof is deferred to Appendix A.

Lemma 2.2.

Let ${\bm{z}}_{k}$ be defined by Eq. (2.1) for a given sequence of parameters $\lambda_{k}\in(0,1)$ , where $k\geq 0$ and ${\bm{z}}_{-1}={\bm{x}}_{*}.$ Under Assumption 1, if there exists a sequence of nonnegative weights $w_{k}$ such that $\lambda_{k}w_{k}\leq w_{k-1}$ for $k\in[K-1]$ , then for all $k\in[K]$ :

1.

$w_{k-1}\big{(}f({\bm{z}}_{k-1})-f({\bm{x}}_{*})\big{)}\leq\sum_{j=0}^{k-1}w_{j% }(1-\lambda_{j})\big{(}f({\bm{x}}_{j})-f({\bm{x}}_{*})\big{)};$
2.

$w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}\geq w_{k-1}(f({\bm{x}}_% {k})-f({\bm{x}}_{*}))-\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})(f({\bm{x}}_{j})-f({% \bm{x}}_{*}))$ .

The role of the sequence of weights $\{w_{k}\}$ in Lemma 2.2 is to ensure that we can telescope the terms $\|{\bm{x}}_{k-1}-{\bm{z}}_{k-1}\|^{2}-\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}$ in Lemma 2.1. On the other hand, to succinctly see why such $\{{\bm{z}}_{k}\}$ could lead to the desired last iterate guarantees, we note that the second part of Lemma 2.2 indicates that $w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}$ intrinsically includes retraction terms of the optimality gaps at the previous iterates. Hence, we can deduct $w_{K-1}(f({\bm{x}}_{K})-f({\bm{x}}_{*}))$ from $\sum_{k=1}^{K}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}$ by properly choosing $\lambda_{k}$ and $w_{k}$ to cancel out the optimality gap terms at the intermediate iterates. In this case, the convergence rate for the last iterate is characterized by the growth rate of $\{w_{k}\}$ .

Our choice of the reference points $\{{\bm{z}}_{k}\}$ is inspired by the recent work (Zamani and Glineur, 2023; Liu and Zhou, 2023) on last iterate guarantees for subgradient methods and SGD with $\lambda_{k}=\frac{w_{k-1}}{w_{k}}$ . However, their proof techniques are not directly applicable to incremental methods, due to several technical obstacles including additional nontrivial error terms of the form $\frac{\alpha}{\beta}T\big{(}f({\bm{z}}_{k})-f({\bm{x}}_{*})\big{)}$ in Lemma 2.1 arising from the incremental gradient steps using reference points other than ${\bm{x}}_{*}$ . Such error terms inherently deteriorate the growth rate of $\{w_{k}\}$ and could possibly lead to a worse last iterate rate compared to the rate on the average iterate. In the following theorem, we calibrate such degradation on the last iterate guarantees, and show that with a slightly smaller step size one can still achieve essentially the same rate as for the average iterate. The proof is provided in Appendix A due to space constraints.

Theorem 2.3.

Under Assumptions 1, 3, and 4 and for positive parameters $\alpha,\beta>0$ such that $\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}$ , if the step size is fixed and satisfies $\eta_{k}=\eta\leq\frac{1}{\sqrt{\beta}TL}$ , the output ${\bm{x}}_{K}$ of Alg. 1 satisfies

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;

\displaystyle\mathrm{e}\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{\frac{% \alpha/\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{% 2}}{2\eta TK^{\frac{1}{1+\alpha/\beta}}}.

(2.3)

With $\alpha=4,\beta=4\log K$ , there exists a constant step size $\eta$ such that $f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilon$ for $\epsilon>0$ after $\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon^% {3/2}}\big{)}$ individual gradient evaluations.

The error term $f({\bm{z}}_{k-1})-f({\bm{x}}_{*})$ in Lemma 2.1 plays the role of slowing the last iterate rate, as calibrated by the dependence on $\alpha/\beta$ in Eq. (2.3). To remedy such degradation compared to the average iterate rate (Mishchenko et al., 2020; Cai et al., 2023a), one natural thought is to make $\alpha/\beta$ sufficiently small. In particular, we choose $\alpha/\beta=1/\log K$ and show that the last iterate rate nearly matches the best known rate on the average iterate, with the trade-off of requiring order- $\frac{1}{\sqrt{\log K}}$ smaller step sizes in comparison with Mishchenko et al. (2020); Cai et al. (2023a). This translates into the oracle complexity that is larger by at most a $\sqrt{\log(1/\epsilon)}$ factor. For most cases of interest, this quantity can be treated as a constant: for example, for $\epsilon=10^{-8},$ $\sqrt{\log(1/\epsilon)}\approx 4.29.$

On the other hand, with $\lambda_{k}\equiv 1$ and constant $w_{k}$ , Lemmas 2.1 and 2.2 directly imply the average iterate guarantee, as a sanity check. Additionally, instead of zeroing the weights of the optimality gap terms $f({\bm{x}}_{k})-f({\bm{x}}_{*})$ for $k\in[K-1]$ to obtain the last iterate guarantee, one can deduce the convergence rate on the increasing weighted averaging which places more weight on later iterates, as formalized in the following corollary whose proof is deferred to Appendix A.

Corollary 2.4 (Increasing Weighted Averaging).

Under Assumptions 1, 3, and 4 and for parameters $\alpha,\beta>0$ such that $\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}$ , if the step size $\eta$ is fixed and such that $\eta\leq\frac{1}{\sqrt{\beta}TL}$ , then for any constant $c\in(0,1]$ and increasing sequence $\{w_{k}\}_{k=0}^{K-1}$ with $w_{k}=\frac{(1+\alpha/\beta)(K-k)+1-c}{(1+\alpha/\beta)(K-k)}w_{k-1}$ , Alg. 1 outputs $\hat{\bm{x}}_{K}=\sum_{k=1}^{K}\frac{w_{k-1}{\bm{x}}_{k}}{\sum_{k=1}^{K}w_{k-1}}$ satisfying

f(\hat{\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{T^{2}\sigma_{*}^{2}\eta^{2}L}{c}+% \frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{2c\eta TK}.

In particular, there exists a constant step size $\eta$ such that $f(\hat{\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilon$ for $\epsilon>0$ after ${\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{c\epsilon}+% \frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{c^{3/2}\epsilon^{3/% 2}}\big{)}$ individual gradient evaluations.

We remark that increasing weighted averaging shaves off the (at most) square-root-log term appearing in the last iterate rate above and recovers the best known rate for the average iterate (Mishchenko et al., 2020; Cai et al., 2023a). The parameters $\alpha,\beta$ are included in the weights $w_{k}$ controlling the growth rate of the increasing sequence $\{w_{k}\}$ . When $\alpha/\beta\rightarrow\infty$ , increasing weighted averaging reduces to the uniform weighted average.

Shuffled SGD.

We extend our analysis to handle the case with possible random permutations on the task ordering of each epoch, showing order- $\sqrt{T}$ improvements in complexity if involving randomness. We consider two main permutation strategies of particular interests in the literature on shuffled SGD: (i) random reshuffling (RR): randomly generate permutations at the beginning of each epoch; (ii) shuffle-once (SO): generate a single random permutation at the beginning and use it in all epochs. Those strategies lead to order-( $1/T$ ) improvements in bounding the variance term $\eta_{k}^{2}L\sum_{t=1}^{T}\big{\|}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})% \big{\|}^{2}$ from Lemma 2.1, and we state the improved convergence results with permutations in the following corollary. The proof is deferred to Appendix A. Our last iterate guarantee nearly matches the best known average iterate convergence rate for shuffled SGD (Mishchenko et al., 2020; Nguyen et al., 2021; Cai et al., 2023a) and the lower bound results on the RR scheme (Cha et al., 2023), with a slightly (order- $(\frac{1}{\sqrt{\log K}})$ ) smaller step size.

Corollary 2.5 (Shuffled SGD (RR/SO)).

Under Assumptions 1, 3 and 4 and for positive parameters $\alpha,\beta>0$ such that $\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}$ , if the step size is fixed and such that $\eta\leq\frac{1}{\sqrt{\beta}TL}$ , the output ${\bm{x}}_{K}$ of Alg. 1 with uniformly random (SO/RR) shuffling satisfies

\displaystyle\mathbb{E}[f({\bm{x}}_{K})-f({\bm{x}}_{*})]\leq\;

\displaystyle\mathrm{e}\eta^{2}\sigma_{*}^{2}TL(1+\beta/\alpha)K^{\frac{\alpha% /\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{2T% \eta K^{\frac{1}{1+\alpha/\beta}}}.

With $\alpha=4,\beta=4\log K$ , there exists a constant step size $\eta$ such that $\mathbb{E}[f({\bm{x}}_{K})-f({\bm{x}}_{*})]\leq\epsilon$ for $\epsilon>0$ after $\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{\sqrt{TL}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon% ^{3/2}}\big{)}$ individual gradient evaluations.

3 Incremental Proximal Method

In this section, we leverage the proof techniques developed in the previous section and derive the last iterate convergence bound for the incremental proximal method (IPM) summarized in Alg. 2. While the incremental proximal method is a fundamental method broadly studied in the optimization literature and thus the last iterate convergence bounds are of independent interest, our main motivation for considering this problem comes from CL applications, as discussed in the introduction. Thus we begin this section by briefly explaining this connection and reasoning.

The considered setup is motivated by the general $\ell_{2}$ -regularized CL setting (Heckel, 2022; Li et al., 2023). In particular, each proximal iteration ${\bm{x}}_{k-1,t+1}=\mathrm{prox}_{\eta_{k}f_{t}}({\bm{x}}_{k-1,t})$ can be interpreted as minimizing the $\ell_{2}$ (a.k.a. ridge) regularized loss $f_{t}({\bm{x}})+\frac{1}{2\eta_{k}}\|{\bm{x}}-{\bm{x}}_{k-1,t}\|_{2}^{2}$ corresponding to the current task $t$ , which aligns with the common machine learning practice of using regularization to improve the generalization error and prevent forgetting. When $\eta_{k}\rightarrow\infty$ , the proximal point step reduces to the CL setting where the learner exactly minimizes the loss of the current task, i.e., ${\bm{x}}_{k-1,t+1}=\operatorname*{arg\,min}_{{\bm{x}}\in\mathbb{R}^{d}}f_{t}({% \bm{x}})$ , while the regularization effect vanishes and causes larger forgetting on previous tasks. When $\eta_{k}$ is small, the proximal point step is easier to compute with larger quadratic regularization and prevents deviating from the previous iterate thus causing less forgetting. However, in this case the plasticity of the model may be deteriorated.

We also note that our analysis of IPM is related to previous work on cyclic replays for overparameterized linear models with (S)GD (Evron et al., 2022; Swartworth et al., 2024), as (S)GD in this case acts as an implicit regularizer (Gunasekar et al., 2018; Zhang et al., 2021) (whereas proximal point update acts as an explicit regularizer). The two lines of work are not directly comparable: Evron et al. (2022); Swartworth et al. (2024) considers exact minimization of component/task loss function and bounds the forgetting, but only addresses convex quadratics where the component loss functions have a nonempty intersecting set of minima (which implies $\sigma_{*}=0$ in Assumption 4). On the other hand, our work addresses much more general (not necessarily quadratic) convex functions and does not require $\sigma_{*}=0,$ but instead relies on sufficiently large regularization.

Algorithm 2 Incremental Proximal Method (IPM)

Input: initial point

{\bm{x}}_{0}

, number of epochs

K

, step size

\{\eta_{k}\}

for

k=1:K

{\bm{x}}_{k-1,1}={\bm{x}}_{k-1}

for

t=1:T

{\bm{x}}_{k-1,t+1}=\mathrm{prox}_{\eta_{k}f_{t}}({\bm{x}}_{k-1,t})=% \operatorname*{arg\,min}_{{\bm{x}}\in\mathbb{R}^{d}}\big{\{}\frac{1}{2\eta_{k}% }\|{\bm{x}}-{\bm{x}}_{k-1,t}\|^{2}+f_{t}({\bm{x}})\big{\}}

{\bm{x}}_{k}={\bm{x}}_{k-1,T+1}

return

{\bm{x}}_{K}

3.1 Smooth convex setting

We first study the setting where the loss function of each task is convex and smooth, for which we can show faster convergence and which covers many regression tasks studied in prior CL work; see e.g., Evron et al. (2022); Goldfarb et al. (2024). In contrast, prior work either only focused on nonsmooth settings for IPM (Bertsekas, 2011; Li et al., 2019, 2020) or studied different algorithms without component-wise proximal steps for smooth settings (Bertsekas, 2015; Mishchenko et al., 2022).

Under component smoothness, the proximal iteration is equivalent to the backward gradient step:

\displaystyle{\bm{x}}_{k-1,t+1}={\bm{x}}_{k-1,t}-\eta_{k}\nabla f_{t}({\bm{x}}% _{k-1,t+1}).

Hence, much of the analysis from Section 2 can be adapted here, and the main difference lies in bounding the gap $f({\bm{x}}_{k})-f({\bm{z}})$ within each epoch with decomposition w.r.t. ${\bm{x}}_{k-1,t+1}$ instead of ${\bm{x}}_{k-1,t}$ in comparison to Lemma 2.1. Then choosing ${\bm{z}}={\bm{z}}_{k-1}$ defined by Eq. (2.1) and following the proof of Theorem 2.3, as stated in the following theorem with its proof in Appendix B.

Theorem 3.1.

Under Assumptions 1, 3 and 4 and for parameters $\alpha,\beta>0$ such that $\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}$ , if the step size is fixed and such that $\eta\leq\frac{1}{\sqrt{\beta}TL}$ , the output ${\bm{x}}_{K}$ of Alg. 2 satisfies

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;

\displaystyle\mathrm{e}\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{\frac{% \alpha/\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{% 2}}{2T\eta K^{\frac{1}{1+\alpha/\beta}}}.

A few remarks are in order here. First, the last-iterate convergence rate of IPM matches the rate of IGD on the last iterate. This is also the first convergence result for IPM (with component-wise proximal updates) under convex smooth settings, in comparison with the prior results in convex Lipschitz setups (Bertsekas, 2011; Li et al., 2019, 2020). Second, the extensions to the increasing weighted averaging and RR/SO shuffling discussed in Section 2 also apply to this setting, which we omit for brevity. Lastly, acute readers may find the step size constraint in Theorem 3.1 stands in contrast to fact that the proximal point method, which IGM reduces to when $T=1$ , converges for any positive step sizes. However, in the following theorem we show that such a step size restriction is necessary for IPM with component-wise proximal updates to reach the target optimality gap in this setting. We provide a proof sketch below, while the complete proof is in Appendix B.

Theorem 3.2.

Given $L>0,T\geq 2,$ let ${\mathcal{F}}_{T,L}$ be the class of finite-sum functions $f({\bm{x}})=\frac{1}{T}\sum_{t=1}^{T}f_{t}({\bm{x}})$ whose each component $f_{t}$ is $L$ -smooth and convex. Then:

1.

For any fixed step size $\eta>0$ , there exists a function $f\in{\mathcal{F}}_{T,L}$ such that the iterates ${\bm{x}}_{k}$ of Alg. 2 satisfy $f({\bm{x}}_{k})-f({\bm{x}}_{*})\nrightarrow 0$ as $k\rightarrow\infty$ . As a consequence, the forgetting is catastrophic.
2.

For any fixed step size $\eta>0$ that only depends on the parameters of the problem class ( $L,T,$ error $\epsilon>0$ ), there exists a function $f\in{\mathcal{F}}_{T,L}$ such that the iterates ${\bm{x}}_{k}$ of Alg. 2 satisfy $\lim_{k\rightarrow\infty}f({\bm{x}}_{k})-f({\bm{x}}_{*})>1$ .
3.

Given $\varepsilon>0$ , if the fixed step size $\eta$ satisfies $\eta\geq\min\Big{\{}\frac{16\sqrt{\varepsilon}}{\sqrt{TL}\sigma_{*}},\,\frac{1% }{TL}\Big{\}}$ , then there exists a function $f\in{\mathcal{F}}_{T,L}$ such that $f({\bm{x}}_{k})-f({\bm{x}}_{*})>\varepsilon$ for all sufficiently large $k$ .

Proof sketch.

For all parts of the proof, we consider $1$ -dimensional quadratics $f_{t}(x)=\frac{L}{2}(x-\delta_{t})^{2}$ for $\{\delta_{t}\}_{t\in[T]}\subseteq\mathbb{R}$ and $L>0$ . It is immediate that $f(x)=\frac{1}{T}\sum_{t=1}^{T}f_{t}(x)$ is minimized at $x_{*}=\frac{1}{T}\sum_{t=1}^{T}\delta_{t}$ . In this case, Alg. 2 using a fixed step size $\eta$ performs closed-form updates on $f$ , i.e., $x_{k+1}=\gamma^{n}x_{k}+(1-\gamma)\sum_{t=1}^{T}\gamma^{T-t}\delta_{t},$ where $\gamma=\frac{1}{\eta L+1}\in(0,1)$ . Given any initial point $x_{0}$ ,

\displaystyle x_{k}-x_{*}=\gamma^{kT}x_{0}+\sum_{t=1}^{T}\Big{(}\frac{\gamma^{% T-t}(1-\gamma)(1-\gamma^{kT})}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{t}% \overset{k\rightarrow\infty}{\longrightarrow}\sum_{t=1}^{T}\Big{(}\frac{\gamma% ^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{t}.

For $1$ ), since $\gamma\in(0,1)$ , we have $\frac{1-\gamma}{1-\gamma^{T}}=\frac{1}{\sum_{t=0}^{T-1}\gamma^{t}}>\frac{1}{T}$ for $t=T$ . As $k\rightarrow\infty$ , for $\{\delta_{t}\}_{t\in[T]}$ such that $\mathrm{sgn}(\delta_{t})=\mathrm{sgn}\big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-% \gamma^{T}}-\frac{1}{T}\big{)}$ and $\delta_{T}>0$ , we have $f(x_{k})-f(x_{*})=\frac{L}{2}(x_{k}-x_{*})^{2}\geq\frac{L}{2}\Big{(}\frac{1-% \gamma}{1-\gamma^{T}}-\frac{1}{T}\Big{)}^{2}\delta_{T}^{2}>0$ .

For $2$ ), with $\frac{1-\gamma}{1-\gamma^{T}}>\frac{1}{T}$ , observe that as $\gamma\in(0,1)$ , we have $\big{|}\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\big{|}\leq\frac% {T-1}{T}$ for any $t$ . As $k\rightarrow\infty$ , for $|\delta_{t}|<\frac{T\sqrt{2/L}}{(T-1)^{2}}$ ( $t\in[T-1]$ ) and $\delta_{T}\geq\frac{2\sqrt{2/L}}{\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}}$ : $f(x_{k})-f(x_{*})=\frac{L}{2}(x_{k}-x_{*})^{2}>1$ .

For $3$ ), the case $\eta\geq\frac{1}{TL}$ can be handled using a similar argument as in $1$ ), and thus we assume w.l.o.g. that $\gamma\geq 1-\frac{1}{T+1}$ . Let $\gamma=1-\kappa$ , where $\kappa>0$ and $\kappa\leq\frac{1}{T+1}<\frac{1}{T}$ . Further noticing that $(1-\kappa)^{T}\geq 1-\kappa T+\frac{\kappa^{2}T(T-1)}{4}$ for $T\geq 2$ and $\kappa T<1$ , then we have

\displaystyle\frac{1-\gamma}{1-\gamma^{T}}=\frac{\kappa}{1-(1-\kappa)^{T}}\geq% \frac{\kappa}{\kappa T-\frac{\kappa^{2}T(T-1)}{4}}\geq\frac{1}{T}\Big{(}1+% \frac{\kappa(T-1)}{4}\Big{)}\geq\frac{1}{T}+\frac{\kappa}{8}.

Hence, if $\eta\geq\frac{16\sqrt{\varepsilon}}{\sqrt{TL}\sigma_{*}}$ , since we can show $\sigma_{*}\leq L\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}/T}$ , we have that $\kappa=\frac{\eta L}{\eta L+1}\geq\frac{16\sqrt{\varepsilon/L}}{16\sqrt{% \varepsilon/L}+\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}>\frac{8\sqrt{3\varepsilon/% L}}{\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}$ with choosing large enough $\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}$ . Then for sufficiently large $k$ and $\{\delta_{t}\}_{t\in[T]}$ such that $\delta_{T}^{2}>\frac{5}{6}\sum_{t=1}^{T}\delta_{t}^{2}$ and $\mathrm{sgn}(\delta_{t})=\mathrm{sgn}\big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-% \gamma^{T}}-\frac{1}{T}\big{)}$ , we get $f(x_{k})-f(x_{*})\geq\frac{2L}{5}\frac{3\varepsilon\delta_{T}^{2}}{L\sum_{t=1}% ^{T}\delta_{t}^{2}}>\varepsilon$ . ∎

Regularization effect.

We now discuss how the regularization parameter (the step size of proximal updates) affects the loss on the current task and (excess) forgetting, based on the above convergence results of IPM. An interesting aspect of our result in Theorem 3.1 is that there is a critical value

\displaystyle\eta_{*}=\min\Big{\{}\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2/3}}{2% ^{1/3}T\sigma_{*}^{2/3}L^{1/3}K^{1/3}(1+\beta/\alpha)^{1/3}},\,\frac{1}{\sqrt{% \beta}TL}\Big{\}}

(3.1)

such that if decrease $\eta$ beyond $\eta_{*}$ , both the regularization error on the current task and our upper bound on the forgetting increase. For $\eta>\eta_{*}$ (i.e., when we decrease the regularization error), Theorem 3.2 shows that the forgetting would increase, at least in some regimes of the problem parameters. Moreover, Theorem 3.2 demonstrates that polynomial dependence on other parameters like $1/\epsilon$ and $\sigma_{*}$ is necessary in the choice for $\eta.$ In other words, strong regularization is needed to control the forgetting to a target error in general smooth convex settings. Another direct implication of Theorem 3.2 is that if no assumptions such as similarity are made on the tasks, then any $\ell_{2}$ -regularized model using a finite regularization parameter would suffer catastrophic forgetting, i.e., the forgetting would not be approaching zero as the number of epochs tends to infinity.

We further provide illustrative numerical results in Fig. 1 to facilitate our discussion. In particular, we choose $L=2$ , $T\in\{100,150,200\}$ , $\delta_{t}=1/t$ ( $t\in[T-1]$ ) and $\delta_{T}=T$ for the example $f(x)=\frac{L}{2T}\sum_{t=1}^{T}(x-\delta_{t})^{2}$ used in Theorem 3.2. In Fig. 1LABEL:sub@fig:forgetting, we plot the optimality gap at the last iterate, i.e., the excess forgetting, against the step sizes after $K=10^{4}$ epochs. It can be observed that the forgetting first decreases with reducing the step size, but then increases beyond some critical value. Note that the critical values are around $10^{-5}$ , which is nontrivially smaller than $1/L=1/2$ , while a larger $T$ leads to a smaller such critical value. These numerical examples corroborate our results from Theorems 3.1 and 3.2, which jointly suggest that the step size (amount of regularization) can neither be too small nor too large. On the other hand, in Fig. 1LABEL:sub@fig:reg we show the final stagnated average regularization error, i.e., $\frac{1}{T}\sum_{t=1}^{T}f_{t}({\bm{x}}_{k-1,t+1})-f_{t}({\bm{x}}_{*,t})$ over $T$ tasks, where ${\bm{x}}_{*,t}$ is the minimizer of $f_{t}$ . We thus conclude from both plots in Fig. 1 that as the step size increases (equivalently, regularization parameter decreases), the regularization error decreases as well, but the forgetting increases.

To conclude this subsection, we finally note that our results bridge the gap of theoretically calibrating the trade-off between the forgetting and the regularization error for general convex smooth tasks with cyclic replays, which has only been studied for the setting of two linear regression tasks without cyclic replay (Heckel, 2022; Li et al., 2023). Further, our results on finite regularization complement the asymptotic weak regularization results in the technically disjoint setting of linear classification (Evron et al., 2023), in the need of relating to the sequential max-margin projections for their analysis.

3.2 Convex Lipschitz setting

We now further relax the smoothness assumption and consider the convex Lipschitz setting, with applications such as linear classification tasks considered in Evron et al. (2023). To carry out the analysis, we leverage the standard fact that the proximal iteration is equivalent to the gradient step w.r.t. the Moreau envelope, i.e.,

\displaystyle{\bm{x}}_{k-1,t+1}={\bm{x}}_{k-1,t}-\eta_{k}\nabla M_{\eta_{k}f_{% t}}({\bm{x}}_{k-1,t}),

while the gradient of the Moreau envelope $\nabla M_{\eta_{k}f_{t}}({\bm{x}}_{k-1,t})$ belongs to the subdifferential $\partial f_{t}({\bm{x}}_{k-1,t+1})$ , thus is bounded by the Lipschitz constant of $f_{t}$ . We use these observations to bound the gap $f({\bm{x}}_{k})-f({\bm{z}})$ for each epoch and then use the sequence $\{{\bm{z}}_{k}\}$ defined in Eq. (2.1) to deduce the last iterate rate in the following theorem with proofs deferred to Appendix B. In contrast to Lemma 2.1 , we do not have the additional error term $f({\bm{z}})-f({\bm{x}}_{*})$ , so the analysis is much simplified.

Theorem 3.3.

Under Assumptions 1 and 2, the output ${\bm{x}}_{K}$ of Alg. 2 satisfies

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;

\displaystyle\frac{1}{2T\sum_{k=1}^{K}\eta_{k}}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{% 2}+\frac{G^{2}T}{2}\sum_{k=1}^{K}\frac{\eta_{k}^{2}}{\sum_{j=k}^{K}\eta_{j}}.

Moreover, given $\epsilon>0$ , there exists a constant step size $\eta$ such that $f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilon$ after $\widetilde{\mathcal{O}}\big{(}\frac{G^{2}T\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon^{2}}\big{)}$ individual proximal oracle queries.

The last iterate rate we obtained in Theorem 3.3 matches the prior best known results for the average iterate guarantees for incremental proximal methods (Bertsekas, 2011; Li et al., 2019) in nonsmooth settings, up to a logarithmic factor. Further, we take the $\Theta(\frac{1}{\sqrt{K}})$ step size only for analytical simplicity, while the diminishing step sizes $\eta_{k}=\Theta(\frac{1}{\sqrt{k}})$ will yield the same rate via a similar analysis, which we omit for brevity.

3.3 Inexact proximal point evaluations

In the last two subsections, we derived our results assuming that the proximal point operator can be evaluated exactly. However, computing the proximal point corresponds to solving a strongly convex problem, which is generally possible to do only up to finite accuracy. Thus, we now consider the case where ${\bm{x}}_{k-1,t+1}$ is an approximation of $\mathrm{prox}_{\eta_{k}f_{t}}({\bm{x}}_{k-1,t})$ with solving the corresponding strongly convex problem to $\varepsilon_{k-1,t}^{2}/2\eta_{k}$ -optimality gap for $\varepsilon_{k-1,t}>0$ . Equivalently, using strong convexity and denoting ${\bm{g}}_{k-1,t}:=\frac{1}{\eta_{k}}({\bm{x}}_{k-1,t}-{\bm{x}}_{k-1,t+1})$ , we have

\displaystyle\|{\bm{x}}_{k-1,t+1}-\mathrm{prox}_{\eta_{k}f_{t}}({\bm{x}}_{k-1,% t})\|\leq\varepsilon_{k-1,t},\quad\|{\bm{g}}_{k-1,t}-\nabla M_{\eta_{k}f_{t}}(% {\bm{x}}_{k-1,t})\|\leq\varepsilon_{k-1,t}/\eta_{k}.

(3.2)

We note that direct extensions of the previous analysis would not work, because inexact evaluations give rise to additional positive terms $\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}$ that cause issues for telesco**. However, we observe that the coefficients of these terms admit additional slackness, i.e., $(\lambda_{k}^{2}w_{k}-w_{k-1})\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}$ , while Lemma 2.2 only requires $\lambda_{k}w_{k}\leq w_{k-1}$ . Thus, as long as the approximation error at each iteration is small, we can still maintain the convergence rate of incremental proximal methods with exact proximal point evaluations. With these insights, we extend our convergence results to admit inexact proximal point evaluations in the following corollaries, with proofs provided in Appendix B.

Corollary 3.4 (Convex Smooth).

Under Assumptions 1 and 3, 4 and for parameters $\alpha,\beta$ such that $\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}$ , if the step size is fixed and satisfies $\eta_{k}\equiv\eta\leq\frac{1}{\sqrt{\beta}TL}$ , the output ${\bm{x}}_{K}$ of Alg. 2 with inexact proximal point evaluations as in Eq. (3.2) with $\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{\sqrt{\eta}}{1+(1+\alpha/\beta)(K-k% +1)}$ satisfies

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;

\displaystyle\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{2\eta TK^{% \frac{1}{1+\alpha/\beta}}}+2\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{% \frac{\alpha/\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}}{2\eta T}\sum_{k=0}^{K-1% }\sum_{t=1}^{T}\frac{2T\varepsilon_{k,t}^{2}+\sqrt{\eta}\varepsilon_{k,t}}{(K-% k)^{\frac{1}{1+\alpha/\beta}}}.

Given $\epsilon>0$ , if $\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\sqrt{\eta}\min\{\varepsilon,\frac{1}{3(K% -k+1)}\}$ , there exists $\eta$ such that $f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilon$ after $\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon^% {3/2}}\big{)}$ individual inexact proximal point evaluations.

Corollary 3.5 (Convex Lipschitz).

Under Assumptions 1 and 2, the output ${\bm{x}}_{K}$ of Alg. 2 with inexact proximal point evaluations as in Eq. (3.2) with $\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{\eta_{k}\eta_{k-1}GT}{\sum_{j=k}^{K% }\eta_{j}}$ satisfies

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{\|{\bm{x}}_{0}-{\bm{x}}% _{*}\|^{2}}{2T\sum_{k=1}^{K}\eta_{k}}+\frac{G^{2}T}{2}\sum_{k=1}^{K}\frac{\eta% _{k}^{2}}{\sum_{j=k}^{K}\eta_{j}}+\sum_{k=1}^{K}\sum_{t=1}^{T}\Big{(}\frac{% \varepsilon_{k-1,t}^{2}}{2T\sum_{j=k}^{K}\eta_{j}}+\frac{3G\varepsilon_{k-1,t}% \eta_{k}}{\sum_{j=k}^{K}\eta_{j}}\Big{)}.

Given $\epsilon>0$ , if $\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{2\eta GT}{K-k+1}$ , there exists a constant step size $\eta$ such that $f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilon$ after $\widetilde{\mathcal{O}}\big{(}\frac{G^{2}T\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon^{2}}\big{)}$ individual inexact proximal point evaluations.

4 Conclusion

This work provides the first oracle complexity guarantees for the last iterate of standard incremental (gradient and proximal) methods, motivated by applications in continual learning. The obtained complexity bounds nearly match the best known oracle complexity bounds that in the same settings were previously known only for the (uniformly) average iterate. Our for the incremental proximal method further characterize the effect of regularization and its limitations in controlling catastrophic forgetting in continual learning applications. It would be interesting to investigate in future work whether other types of regularization involving task similarity can effectively control forgetting.

Acknowledgments

This research was supported in part by the U.S. Office of Naval Research under contract number N00014-22-1-2348.

References

Ahn et al. (2020) Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. SGD with shuffling: optimal rates without component convexity and large epoch requirements. In Proc. NeurIPS’20, 2020.
Balcan et al. (2015) Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Efficient representations for lifelong learning and autoencoding. In Proc. COLT’15, 2015.
Bertsekas (2011) Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163–195, 2011.
Bertsekas (2015) Dimitri P Bertsekas. Incremental aggregated proximal and augmented lagrangian algorithms. arXiv preprint arXiv:1509.09257, 2015.
Bertsekas et al. (2011) Dimitri P Bertsekas et al. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.
Bottou (2009) Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proc. Symposium on Learning and Data Science, Paris’09, 2009.
Cai et al. (2023a) Xufeng Cai, Cheuk Yin Lin, and Jelena Diakonikolas. Empirical risk minimization with shuffled SGD: A primal-dual perspective and improved bounds. arXiv preprint arXiv:2306.12498, 2023a.
Cai et al. (2023b) Xufeng Cai, Chaobing Song, Stephen J Wright, and Jelena Diakonikolas. Cyclic block coordinate descent with variance reduction for composite nonconvex optimization. In Proc. ICML’23, 2023b.
Cao et al. (2022) Xinyuan Cao, Weiyang Liu, and Santosh Vempala. Provable lifelong learning of representations. In Proc. AISTATS’22, 2022.
Cha et al. (2023) Jaeyoung Cha, Jaewook Lee, and Chulhee Yun. Tighter lower bounds for shuffling SGD: Random permutations and beyond. arXiv preprint arXiv:2303.07160, 2023.
Chen et al. (2022) Xi Chen, Christos Papadimitriou, and Binghui Peng. Memory bounds for continual learning. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), 2022.
De Lange et al. (2021) Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021.
Doan et al. (2021) Thang Doan, Mehdi Abbana Bennani, Bogdan Mazoure, Guillaume Rabusseau, and Pierre Alquier. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. In Proc. AISTATS’21, 2021.
Evron et al. (2022) Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? In Proc. COLT’22, 2022.
Evron et al. (2023) Itay Evron, Edward Moroshko, Gon Buzaglo, Maroun Khriesh, Badea Marjieh, Nathan Srebro, and Daniel Soudry. Continual learning in linear classification on separable data. In Proc. ICML’23, 2023.
Goldfarb and Hand (2023) Daniel Goldfarb and Paul Hand. Analysis of catastrophic forgetting for random orthogonal transformation tasks in the overparameterized regime. In Proc. AISTATS’23, 2023.
Goldfarb et al. (2024) Daniel Goldfarb, Itay Evron, Nir Weinberger, Daniel Soudry, and Paul Hand. The joint effect of task similarity and overparameterization on catastrophic forgetting - an analytical model. In Proc. ICLR’24, 2024.
Goodfellow et al. (2013) Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
Gunasekar et al. (2018) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In Proc. ICML’18, 2018.
Gürbüzbalaban et al. (2021) Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186:49–84, 2021.
Haochen and Sra (2019) Jeff Haochen and Suvrit Sra. Random shuffling beats SGD after finite epochs. In Proc. ICML’19, 2019.
Heckel (2022) Reinhard Heckel. Provable continual learning via sketched jacobian approximations. In Proc. AISTATS’22, 2022.
Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
Li et al. (2023) Haoran Li, **gfeng Wu, and Vladimir Braverman. Fixed design analysis of regularization-based continual learning. arXiv preprint arXiv:2303.10263, 2023.
Li et al. (2020) Jia** Li, Caihua Chen, and Anthony Man-Cho So. Fast epigraphical projection-based incremental algorithms for wasserstein distributionally robust support vector machine. In Proc. NeurIPS’20, 2020.
Li et al. (2019) Xiao Li, Zhihui Zhu, Anthony Man-Cho So, and Jason D Lee. Incremental methods for weakly convex optimization. arXiv preprint arXiv:1907.11687, 2019.
Lin et al. (2023a) Cheuk Yin Lin, Chaobing Song, and Jelena Diakonikolas. Accelerated cyclic coordinate dual averaging with extrapolation for composite convex optimization. In Proc. ICML’23, 2023a.
Lin et al. (2023b) Sen Lin, Peizhong Ju, Yingbin Liang, and Ness Shroff. Theory on forgetting and generalization of continual learning. arXiv preprint arXiv:2302.05836, 2023b.
Liu and Zhou (2023) Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. arXiv preprint arXiv:2312.08531, 2023.
Liu and Zhou (2024) Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. arXiv preprint arXiv:2403.07723, 2024.
Lohr (2021) Sharon L Lohr. Sampling: design and analysis. CRC press, 2021.
Lopez-Paz and Ranzato (2017) David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Proc. NIPS’17, 2017.
McCloskey and Cohen (1989) Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989.
Mishchenko et al. (2020) Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements. In Proc. NeurIPS’20, 2020.
Mishchenko et al. (2022) Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Proximal and federated random reshuffling. In Proc. ICML’22, 2022.
Nagaraj et al. (2019) Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. SGD without replacement: Sharper rates for general smooth convex functions. In Proc. ICML’19, 2019.
Nguyen et al. (2021) Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten Van Dijk. A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397–9440, 2021.
Parisi et al. (2019) German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, 2019.
Peng and Risteski (2022) Binghui Peng and Andrej Risteski. Continual learning: a feature extraction formalization, an efficient algorithm, and fundamental obstructions. In Proc. NeurIPS’22, 2022.
Peng et al. (2023) Liangzu Peng, Paris Giampouras, and René Vidal. The ideal continual learner: An agent that never forgets. In Proc. ICML’23, 2023.
Rajput et al. (2020) Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of SGD without replacement. In Proc. ICML’20, 2020.
Robins (1995) Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Proc. NeurIPS’19, 2019.
Safran and Shamir (2020) Itay Safran and Ohad Shamir. How good is SGD with random shuffling? In Proc. COLT’20, 2020.
Shamir (2016) Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Proc. NeurIPS’16, 2016.
Shamir and Zhang (2013) Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proc. ICML’18, 2013.
Song and Diakonikolas (2023) Chaobing Song and Jelena Diakonikolas. Cyclic coordinate dual averaging with extrapolation. SIAM Journal on Optimization, 33(4):2935–2961, 2023. doi: 10.1137/22M1470104.
Swartworth et al. (2024) William Swartworth, Deanna Needell, Rachel Ward, Mark Kong, and Halyun Jeong. Nearly optimal bounds for cyclic forgetting. In Proc. NeurIPS’24, 2024.
Tran et al. (2021) Trang H Tran, Lam M Nguyen, and Quoc Tran-Dinh. SMG: A shuffling gradient-based method with momentum. In Proc. ICML’21, 2021.
Tran et al. (2022) Trang H Tran, Katya Scheinberg, and Lam M Nguyen. Nesterov accelerated shuffling gradient method for convex optimization. In Proc. ICML’22, 2022.
Yu and Li (2023) Hengxu Yu and Xiao Li. High probability guarantees for random reshuffling. arXiv preprint arXiv:2311.11841, 2023.
Yun et al. (2022) Chulhee Yun, Shashank Rajput, and Suvrit Sra. Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond. In Proc. ICLR’22, 2022.
Zamani and Glineur (2023) Moslem Zamani and François Glineur. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023.
Zhang et al. (2021) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.

Appendix A Omitted Proofs From Section 2

See 2.1

Proof.

Since each $f_{t}$ is convex and $L$ -smooth, we have for $t\in[T]$ (see Eq. (1.3)):

	$\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t})\leq\;$	$\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1% ,t}\right\rangle-\frac{1}{2L}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}% }_{k-1,t})\\|^{2},$		(A.1)
	$\displaystyle f_{t}({\bm{x}}_{k-1,t})-f_{t}({\bm{z}})\leq\;$	$\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k-1,t}),{\bm{x}}_{k-1,t}-{\bm{% z}}\right\rangle-\frac{1}{2L}\\|\nabla f_{t}({\bm{x}}_{k-1,t})-\nabla f_{t}({% \bm{z}})\\|^{2}.$		(A.2)

On the other hand, letting $\Phi_{k-1,t}({\bm{x}}):=\left\langle\nabla f_{t}({\bm{x}}_{k-1,t}),{\bm{x}}% \right\rangle+\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t}-{\bm{x}}\|^{2}$ , we have $\nabla\Phi_{k-1,t}({\bm{x}}_{k-1,t+1})=\mathbf{0}$ by the update ${\bm{x}}_{k-1,t+1}={\bm{x}}_{k-1,t}-\eta_{k}\nabla f_{t}({\bm{x}}_{k-1,t})$ in Alg. 1. Observe that $\Phi_{k-1,t}$ is $\frac{1}{\eta_{k}}$ -strong convex, and thus we also have

\displaystyle\Phi_{k-1,t}({\bm{z}})\geq\Phi_{k-1,t}({\bm{x}}_{k-1,t+1})+\frac{% 1}{2\eta_{k}}\|{\bm{z}}-{\bm{x}}_{k-1,t+1}\|^{2}.

(A.3)

Summing Eq. (A.1) and (A.2) and using Eq. (A.3), we conclude that for $t\in[T]$ ,

		$\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{z}})$
	$\displaystyle\leq\;$	$\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1% ,t}\right\rangle+\left\langle\nabla f_{t}({\bm{x}}_{k-1,t}),{\bm{x}}_{k-1,t}-{% \bm{x}}_{k-1,t+1}\right\rangle$
		$\displaystyle-\frac{1}{2L}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{% k-1,t})\\|^{2}-\frac{1}{2L}\\|\nabla f_{t}({\bm{x}}_{k-1,t})-\nabla f_{t}({\bm{z% }})\\|^{2}$
		$\displaystyle-\frac{1}{2\eta_{k}}\\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\\|^{2}+% \frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1,t}-{\bm{z}}\\|^{2}-\\|{\bm{x}}_{k-1,t+% 1}-{\bm{z}}\\|^{2}\big{)}.$

Decomposing $\nabla f_{t}({\bm{x}}_{k})=\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k% -1,t})+\nabla f_{t}({\bm{x}}_{k-1,t})$ and summing over $t\in[T]$ where we recall ${\bm{x}}_{k-1}={\bm{x}}_{k-1,1}$ and ${\bm{x}}_{k}={\bm{x}}_{k-1,T+1}$ , we have

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;$	$\displaystyle\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k-1,% t}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{1}{2\eta_{k}}\sum_{t=1}% ^{T}\\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\\|^{2}}_{{\mathcal{T}}_{1}}$
		$\displaystyle+\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k})% -\nabla f_{t}({\bm{x}}_{k-1,t}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t}\right\rangle}_{{% \mathcal{T}}_{2}}$
		$\displaystyle-\frac{1}{2L}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_% {t}({\bm{x}}_{k-1,t})\\|^{2}-\frac{1}{2L}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_% {k-1,t})-\nabla f_{t}({\bm{z}})\\|^{2}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm% {x}}_{k}-{\bm{z}}\\|^{2}\big{)}.$

For the term ${\mathcal{T}}_{1}$ , we recall that by the IGD update, $\nabla f_{t}({\bm{x}}_{k-1,t})=-\frac{1}{\eta_{k}}({\bm{x}}_{k-1,t+1}-{\bm{x}}% _{k-1,t})$ and ${\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}=\sum_{s=t+1}^{T}({\bm{x}}_{k-1,s+1}-{\bm{x}}_{% k-1,s})$ , and thus we have

	$\displaystyle{\mathcal{T}}_{1}=\;$	$\displaystyle-\frac{1}{\eta_{k}}\sum_{t=1}^{T-1}\sum_{s=t+1}^{T}\left\langle{% \bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t},{\bm{x}}_{k-1,s+1}-{\bm{x}}_{k-1,s}\right% \rangle-\frac{1}{2\eta_{k}}\sum_{t=1}^{T}\\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}% \\|^{2}$
	$\displaystyle=\;$	$\displaystyle-\frac{1}{2\eta_{k}}\Big{\\|}\sum_{t=1}^{T}({\bm{x}}_{k-1,t+1}-{% \bm{x}}_{k-1,t})\Big{\\|}^{2}=-\frac{1}{2\eta_{k}}\\|{\bm{x}}_{k}-{\bm{x}}_{k-1}% \\|^{2}\leq 0.$

For the term ${\mathcal{T}}_{2}$ , noticing that ${\bm{x}}_{k}-{\bm{x}}_{k-1,t}=-\eta_{k}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{k-% 1,s})$ and decomposing $\nabla f_{s}({\bm{x}}_{k-1,s})=(\nabla f_{s}({\bm{x}}_{k-1,s})-\nabla f_{s}({% \bm{z}}))+(\nabla f_{s}({\bm{z}})-\nabla f_{s}({\bm{x}}_{*}))+\nabla f_{s}({% \bm{x}}_{*})$ , we use Young’s inequality with parameters $\alpha>0$ and $\beta>0$ to obtain

	$\displaystyle{\mathcal{T}}_{2}=\;$	$\displaystyle\sum_{t=1}^{T}\Big{\langle}\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t% }({\bm{x}}_{k-1,t}),-\eta_{k}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{k-1,s})\Big{\rangle}$
	$\displaystyle\leq\;$	$\displaystyle\frac{1}{2L}\Big{(}\frac{1}{2}+\frac{1}{\alpha}+\frac{1}{\beta}% \Big{)}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% })\\|^{2}$
		$\displaystyle+\frac{\alpha\eta_{k}^{2}L}{2}\sum_{t=1}^{T}\Big{\\|}\sum_{s=t}^{T% }\big{(}\nabla f_{s}({\bm{z}})-\nabla f_{s}({\bm{x}}_{})\big{)}\Big{\\|}^{2}+% \eta_{k}^{2}L\sum_{t=1}^{T}\Big{\\|}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{})% \Big{\\|}^{2}$
		$\displaystyle+\frac{\beta\eta_{k}^{2}L}{2}\sum_{t=1}^{T}\Big{\\|}\sum_{s=t}^{T}% \big{(}\nabla f_{s}({\bm{x}}_{k-1,s})-\nabla f_{s}({\bm{z}})\big{)}\Big{\\|}^{2}.$

Further using that $\|\sum_{i=1}^{n}{\bm{x}}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|{\bm{x}}_{i}\|^{2}$ and combining the above bounds on ${\mathcal{T}}_{1}$ and ${\mathcal{T}}_{2}$ , we obtain

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;$	$\displaystyle\frac{1}{2L}\Big{(}\frac{1}{\alpha}+\frac{1}{\beta}-\frac{1}{2}% \Big{)}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% })\\|^{2}$
		$\displaystyle+\Big{(}\frac{\beta\eta_{k}^{2}T^{2}L}{2}-\frac{1}{2L}\Big{)}\sum% _{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k-1,t})-\nabla f_{t}({\bm{z}})\\|^{2}$
		$\displaystyle+\frac{\alpha\eta_{k}^{2}T^{2}L}{2}\sum_{t=1}^{T}\\|\nabla f_{t}({% \bm{z}})-\nabla f_{t}({\bm{x}}_{})\\|^{2}+\eta_{k}^{2}L\sum_{t=1}^{T}\Big{\\|}% \sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{})\Big{\\|}^{2}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm% {x}}_{k}-{\bm{z}}\\|^{2}\big{)},$

To make the first two terms on the right-hand side both nonpositive, we choose

\displaystyle\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2},\quad\eta_{k}\leq% \frac{1}{\sqrt{\beta}TL}.

We further bound the term $\sum_{t=1}^{T}\|\nabla f_{t}({\bm{z}})-\nabla f_{t}({\bm{x}}_{*})\|^{2}$ by the smoothness and convexity of each $f_{t}$ as follows:

	$\displaystyle\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{z}})-\nabla f_{t}({\bm{x}}_{*})% \\|^{2}\leq\;$	$\displaystyle 2L\sum_{t=1}^{T}\big{(}f_{t}({\bm{z}})-f_{t}({\bm{x}}_{})-\left% \langle\nabla f_{t}({\bm{x}}_{}),{\bm{z}}-{\bm{x}}_{*}\right\rangle\big{)}$
	$\displaystyle=\;$	$\displaystyle 2TL\big{(}f({\bm{z}})-f({\bm{x}}_{*})\big{)},$

where in the last equation we used $\nabla f({\bm{x}}_{*})=\mathbf{0}$ . Hence, we finally obtain

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;$	$\displaystyle\eta_{k}^{2}L\sum_{t=1}^{T}\Big{\\|}\sum_{s=t}^{T}\nabla f_{s}({% \bm{x}}_{})\Big{\\|}^{2}+\alpha T^{3}\eta_{k}^{2}L^{2}\big{(}f({\bm{z}})-f({% \bm{x}}_{})\big{)}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm% {x}}_{k}-{\bm{z}}\\|^{2}\big{)}$
	$\displaystyle\leq\;$	$\displaystyle\eta_{k}^{2}L\sum_{t=1}^{T}\Big{\\|}\sum_{s=t}^{T}\nabla f_{s}({% \bm{x}}_{})\Big{\\|}^{2}+\frac{\alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}}_{% })\big{)}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm% {x}}_{k}-{\bm{z}}\\|^{2}\big{)},$

where we used $\eta_{k}\leq\frac{1}{\sqrt{\beta}TL}$ in the last inequality, thus completing the proof. ∎

See 2.2

Proof.

Using Eq. (2.2) and the convexity of $f$ , we have

		$\displaystyle f({\bm{z}}_{k-1})-f({\bm{x}}_{*})$
	$\displaystyle\leq\;$	$\displaystyle(1-\lambda_{k-1})\big{(}f({\bm{x}}_{k-1})-f({\bm{x}}_{})\big{)}+% \sum_{j=0}^{k-2}\big{(}\prod_{i=j+1}^{k-1}\lambda_{i}\big{)}(1-\lambda_{j})% \big{(}f({\bm{x}}_{j})-f({\bm{x}}_{})\big{)}.$

It remains to multiply by $w_{k-1}$ on both sides and notice that by the lemma assumption,

\displaystyle w_{k-1}\big{(}\prod_{i=j+1}^{k-1}\lambda_{i}\big{)}(1-\lambda_{j% })\leq w_{k-2}\big{(}\prod_{i=j+1}^{k-2}\lambda_{i}\big{)}(1-\lambda_{j})\leq% \cdots\leq w_{j}(1-\lambda_{j}).

2.

This follows from the first part of the lemma, by decomposing $f({\bm{x}}_{k})-f({\bm{z}}_{k-1})=f({\bm{x}}_{k})-f({\bm{x}}_{*})-\big{(}f({% \bm{z}}_{k-1})-f({\bm{x}}_{*})\big{)}$ .

∎

See 2.3

Proof.

Plugging ${\bm{z}}_{k-1}$ defined by Eq. (2.1) into Lemma 2.1 and using the inequality $\|\sum_{i=1}^{n}{\bm{x}}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|{\bm{x}}_{i}\|^{2}$ to bound the term $\sum_{t=1}^{T}\|\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\|^{2}$ , we obtain

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}\leq\;$	$\displaystyle T^{3}\eta_{k}^{2}\sigma_{}^{2}L+\frac{\alpha}{\beta}T\big{(}f({% \bm{z}}_{k-1})-f({\bm{x}}_{})\big{)}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}_{k-1}\\|^{2}-% \\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\\|^{2}\big{)},$

where $\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}$ . Multiplying $\eta_{k}w_{k-1}$ on both sides with $w_{k}$ such that $\lambda_{k}w_{k}\leq w_{k-1}$ and noticing that $\|{\bm{x}}_{k-1}-{\bm{z}}_{k-1}\|^{2}\leq\lambda_{k-1}\|{\bm{x}}_{k-1}-{\bm{z}% }_{k-2}\|^{2}$ by Eq. (2.1), we have

	$\displaystyle T\eta_{k}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}\leq\;$	$\displaystyle T^{3}\eta_{k}^{3}\sigma_{}^{2}Lw_{k-1}+\frac{\alpha}{\beta}T% \eta_{k}w_{k-1}\big{(}f({\bm{z}}_{k-1})-f({\bm{x}}_{})\big{)}$
		$\displaystyle+\frac{1}{2}\big{(}w_{k-2}\\|{\bm{x}}_{k-1}-{\bm{z}}_{k-2}\\|^{2}-w% _{k-1}\\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\\|^{2}\big{)}$

We then sum over $k\in[K]$ and use Lemma 2.2 to obtain

		$\displaystyle T\sum_{k=1}^{K}\eta_{k}\Big{[}w_{k-1}\big{(}f({\bm{x}}_{k})-f({% \bm{x}}_{})\big{)}-\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})\big{(}f({\bm{x}}_{j})% -f({\bm{x}}_{})\big{)}\Big{]}$		(A.4)
	$\displaystyle\leq$	$\displaystyle T^{3}\sigma_{}^{2}L\sum_{k=1}^{K}\eta_{k}^{3}w_{k-1}+\frac{w_{-% 1}}{2}\\|{\bm{x}}_{0}-{\bm{x}}_{}\\|^{2}+\frac{\alpha}{\beta}T\sum_{k=1}^{K}% \eta_{k}\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})\big{(}f({\bm{x}}_{j})-f({\bm{x}}_% {*})\big{)},$		(A.4)

where we also use ${\bm{z}}_{-1}={\bm{x}}_{*}$ . Unrolling the terms w.r.t. $f({\bm{x}}_{k})-f({\bm{x}}_{*})$ $(k\in[K])$ we get

	$\displaystyle\sum_{k=1}^{K}\eta_{k}\Big{[}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm% {x}}_{})\big{)}-\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})\big{(}f({\bm{x}}_{j})-f(% {\bm{x}}_{})\big{)}\Big{]}$	(A.5)
$\displaystyle=$	$\displaystyle\eta_{K}w_{K-1}\big{(}f({\bm{x}}_{K})-f({\bm{x}}_{})\big{)}-w_{0% }(1-\lambda_{0})\big{(}f({\bm{x}}_{0})-f({\bm{x}}_{})\big{)}\sum_{k=1}^{K}% \eta_{k}$
	$\displaystyle+\sum_{k=1}^{K-1}\Big{(}\eta_{k}w_{k-1}-w_{k}(1-\lambda_{k})\sum_% {j=k+1}^{K}\eta_{j}\Big{)}\big{(}f({\bm{x}}_{k})-f({\bm{x}}_{*})\big{)}$

and

\displaystyle\sum_{k=1}^{K}\eta_{k}\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})\big{(}% f({\bm{x}}_{j})-f({\bm{x}}_{*})\big{)}=\sum_{k=0}^{K-1}\Big{(}w_{k}(1-\lambda_% {k})\sum_{j=k+1}^{K}\eta_{j}\Big{)}\big{(}f({\bm{x}}_{k})-f({\bm{x}}_{*})\big{% )}.

Plugging back into Eq. (A.4), grou** the like terms, and choosing $\lambda_{0}=1$ , we obtain

	$\displaystyle T\eta_{K}w_{K-1}\big{(}f({\bm{x}}_{K})-f({\bm{x}}_{*})\big{)}$	(A.6)
	$\displaystyle+T\sum_{k=1}^{K-1}\Big{[}\eta_{k}w_{k-1}-\big{(}1+\frac{\alpha}{% \beta}\big{)}w_{k}(1-\lambda_{k})\sum_{j=k+1}^{K}\eta_{j}\Big{]}\big{(}f({\bm{% x}}_{k})-f({\bm{x}}_{*})\big{)}$
$\displaystyle\leq$	$\displaystyle T^{3}\sigma_{}^{2}L\sum_{k=1}^{K}\eta_{k}^{3}w_{k-1}+\frac{w_{-% 1}}{2}\\|{\bm{x}}_{0}-{\bm{x}}_{}\\|^{2}.$

To obtain the last iterate guarantee, it suffices to choose $w_{k}$ and $\lambda_{k}$ such that

	$\displaystyle\lambda_{k}w_{k}\leq\;$	$\displaystyle w_{k-1},\quad 0\leq k\leq K-1,$		(A.7)
	$\displaystyle\eta_{k}w_{k-1}-\big{(}1+\frac{\alpha}{\beta}\big{)}w_{k}(1-% \lambda_{k})\sum_{j=k+1}^{K}\eta_{j}\geq\;$	$\displaystyle 0,\quad 1\leq k\leq K-1.$		(A.8)

Noticing that Eq. A.8 is equivalent to $\lambda_{k}\geq 1-\frac{\eta_{k}w_{k-1}}{\big{(}1+\frac{\alpha}{\beta}\big{)}w% _{k}\sum_{j=k+1}^{K}\eta_{j}}$ , to have both inequalities satisfied at the same time, it suffices that

\displaystyle 1-\frac{\eta_{k}w_{k-1}}{\big{(}1+\frac{\alpha}{\beta}\big{)}w_{% k}\sum_{j=k+1}^{K}\eta_{j}}\leq\frac{w_{k-1}}{w_{k}}\iff w_{k}\leq\frac{\eta_{% k}+\big{(}1+\frac{\alpha}{\beta}\big{)}\sum_{j=k+1}^{K}\eta_{j}}{\big{(}1+% \frac{\alpha}{\beta}\big{)}\sum_{j=k+1}^{K}\eta_{j}}w_{k-1}.

To maximize the growth rate of $\{w_{k}\}$ , we let $w_{k}=\frac{\eta_{k}+(1+\frac{\alpha}{\beta})\sum_{j=k+1}^{K}\eta_{j}}{(1+% \frac{\alpha}{\beta})\sum_{j=k+1}^{K}\eta_{j}}w_{k-1}$ . Without loss of generality, we take $w_{-1}=w_{0}=\prod_{k=1}^{K-1}\frac{(1+\frac{\alpha}{\beta})\sum_{j=k+1}^{K}% \eta_{j}}{\eta_{k}+(1+\frac{\alpha}{\beta})\sum_{j=k+1}^{K}\eta_{j}}$ , and thus $w_{K-1}=1$ . Hence, dividing both sides of Eq. (A.6) by $T\eta_{K}w_{K-1}$ and choosing the constant step size $\eta_{k}\equiv\eta$ for all $k\in[K]$ , we obtain

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;

\displaystyle\eta^{2}T^{2}\sigma_{*}^{2}L\sum_{k=1}^{K}w_{k-1}+\frac{w_{-1}}{2% \eta T}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}.

(A.9)

We first bound $w_{-1}=\prod_{k=1}^{K-1}\frac{(1+\frac{\alpha}{\beta})(K-k)}{1+(1+\frac{\alpha% }{\beta})(K-k)}=\prod_{k=1}^{K-1}\frac{(1+\frac{\alpha}{\beta})k}{1+(1+\frac{% \alpha}{\beta})k}$ with the constant step size. Taking the natural logarithm of $w_{-1}$ , we have

\displaystyle\sum_{k=1}^{K-1}\log\Big{(}1-\frac{1}{1+\big{(}1+\frac{\alpha}{% \beta}\big{)}k}\Big{)}\overset{(i)}{\leq}-\sum_{k=1}^{K-1}\frac{1}{1+\big{(}1+% \frac{\alpha}{\beta}\big{)}k}\leq-\frac{1}{1+\frac{\alpha}{\beta}}\sum_{k=1}^{% K-1}\frac{1}{k+1}.

where for $(i)$ we use the fact that $\log(1+x)\leq x$ for $x>-1$ . Further noticing that $\sum_{k=1}^{K-1}\frac{1}{k+1}=\sum_{k=1}^{K}\frac{1}{k}-1\geq\log(K)+\frac{1}{% K}-1$ , then we have

\displaystyle\log(w_{-1})\leq-\frac{1}{1+\frac{\alpha}{\beta}}\log(K)+\frac{1}% {1+\frac{\alpha}{\beta}}\iff w_{-1}\leq\frac{\mathrm{e}^{1/(1+\alpha/\beta)}}{% K^{\frac{1}{1+\alpha/\beta}}}\leq\frac{\mathrm{e}}{K^{\frac{1}{1+\alpha/\beta}% }}.

On the other hand, we note that $w_{k-1}=\prod_{j=k}^{K-1}\frac{(1+\frac{\alpha}{\beta})(K-j)}{1+(1+\frac{% \alpha}{\beta})(K-j)}=\prod_{j=1}^{K-k}\frac{(1+\frac{\alpha}{\beta})j}{1+(1+% \frac{\alpha}{\beta})j}$ for $1\leq k\leq K-1$ and $w_{K-1}=1$ , then we follow the above argument and obtain

	$\displaystyle\sum_{k=1}^{K}w_{k-1}=\;$	$\displaystyle 1+\sum_{k=1}^{K-1}\prod_{j=1}^{K-k}\frac{(1+\frac{\alpha}{\beta}% )j}{1+(1+\frac{\alpha}{\beta})j}$
	$\displaystyle\leq\;$	$\displaystyle 1+\sum_{k=1}^{K-1}\frac{\mathrm{e}}{(K-k+1)^{\frac{1}{1+\alpha/% \beta}}}\overset{(i)}{\leq}\frac{\mathrm{e}(1+\alpha/\beta)}{\alpha/\beta}K^{% \frac{\alpha/\beta}{1+\alpha/\beta}},$

where $(i)$ is due to $\sum_{k=2}^{K}\frac{1}{k^{q}}\leq\int_{2}^{K+1}\frac{1}{(x-1)^{q}}dx=\frac{K^{% 1-q}-1}{1-q}$ for any $0<q<1$ . Hence, we obtain the final bound:

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;

\displaystyle\mathrm{e}\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{\frac{% \alpha/\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{% 2}}{2T\eta K^{\frac{1}{1+\alpha/\beta}}}.

To analyze the oracle complexity, we take

\displaystyle\eta=\min\Big{\{}\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2/3}}{2^{1/% 3}T\sigma_{*}^{2/3}L^{1/3}K^{1/3}(1+\beta/\alpha)^{1/3}},\,\frac{1}{\sqrt{% \beta}TL}\Big{\}}

and analyze the two possible cases depending on which term in the min is smaller. If the first term in the min is smaller (which we can equivalently think of as $K$ being “large”), we get

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;

\displaystyle\frac{3.5(1+\beta/\alpha)^{1/3}L^{1/3}\sigma_{*}^{2/3}\|{\bm{x}}_% {0}-{\bm{x}}_{*}\|^{4/3}}{K^{\frac{1}{1+\alpha/\beta}-\frac{1}{3}}}.

Alternatively, if $\eta=\frac{1}{\sqrt{\beta}TL}\leq\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2/3}}{2^% {1/3}T\sigma_{*}^{2/3}L^{1/3}K^{1/3}(1+\beta/\alpha)^{1/3}}$ (which we can think of as having “small” $K$ ), we obtain

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;

\displaystyle\frac{1.8(1+\beta/\alpha)^{1/3}L^{1/3}\sigma_{*}^{2/3}\|{\bm{x}}_% {0}-{\bm{x}}_{*}\|^{4/3}}{K^{\frac{1}{1+\alpha/\beta}-\frac{1}{3}}}+\frac{1.4% \sqrt{\beta}L\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{K^{\frac{1}{1+\alpha/\beta}}}.

Hence, combining these two cases, we have

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq

\displaystyle\frac{3.5(1+\beta/\alpha)^{1/3}L^{1/3}\sigma_{*}^{2/3}\|{\bm{x}}_% {0}-{\bm{x}}_{*}\|^{4/3}}{K^{\frac{1}{1+\alpha/\beta}-\frac{1}{3}}}+\frac{1.4% \sqrt{\beta}L\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{K^{\frac{1}{1+\alpha/\beta}}}.

In particular, if we choose $\alpha=4$ and $\beta=4\log K$ , assuming without loss of generality that $\log K>1$ , then we have $K^{\frac{\alpha/\beta}{1+\alpha/\beta}}=K^{\frac{1}{\log K+1}}\leq\mathrm{e}$ and thus

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{9.4L^{\frac{1}{3}}% \sigma_{*}^{\frac{2}{3}}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{\frac{4}{3}}(1+\log K)^% {\frac{1}{3}}}{K^{\frac{2}{3}}}+\frac{7.4L\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}% \sqrt{\log K}}{K}.

To guarantee $f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilon$ given $\epsilon>0$ , the total number of individual gradient evaluations will be

\displaystyle TK=\widetilde{\mathcal{O}}\Big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}% _{*}\|^{2}}{\epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2% }}{\epsilon^{3/2}}\Big{)}.

∎

See 2.4

Proof.

We follow the proof of Theorem 2.3 up to Eq. (A.6) with constant step size $\eta$ , then we instead take $\lambda_{k}=w_{k-1}/w_{k}$ and $w_{k}=\frac{(1+\frac{\alpha}{\beta})(K-k)+1-c}{(1+\frac{\alpha}{\beta})(K-k)}w% _{k-1}$ to obtain

\displaystyle c\eta T\sum_{k=1}^{K}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{x}}_{*% })\big{)}\leq T^{3}\eta^{3}\sigma_{*}^{2}L\sum_{k=1}^{K}w_{k-1}+\frac{w_{-1}}{% 2}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}.

Since $f$ is convex, we have $f(\hat{\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{\sum_{k=1}^{K}w_{k-1}(f({\bm{x}}_% {k})-f({\bm{x}}_{*}))}{\sum_{k=1}^{K}w_{k-1}}$ where $\hat{\bm{x}}_{K}=\sum_{k=1}^{K}\frac{w_{k-1}{\bm{x}}_{k}}{\sum_{k=1}^{K}w_{k-1}}$ is the increasing weighted averaging of $\{{\bm{x}}_{k}\}_{k=1}^{K}$ , thus (cf. Eq. (A.9))

\displaystyle f(\hat{\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{T^{2}\sigma_{*}^{2}% \eta^{2}L}{c}+\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}w_{-1}}{2c\eta T\sum_{k=1% }^{K}w_{k-1}}\leq\frac{T^{2}\sigma_{*}^{2}\eta^{2}L}{c}+\frac{\|{\bm{x}}_{0}-{% \bm{x}}_{*}\|^{2}}{2c\eta TK},

where the last step is due to $\sum_{k=1}^{K}w_{k-1}\geq Kw_{-1}$ . Then we follow the proof of Theorem 2.3 and choose $\eta=\min\{\frac{1}{\sqrt{\beta}TL},(\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% 2T^{3}\sigma_{*}^{2}LK})^{1/3}\}$ to obtain

\displaystyle f(\hat{\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{\sqrt{\beta}L\|{\bm% {x}}_{0}-{\bm{x}}_{*}\|^{2}}{2cK}+\frac{2^{1/3}L^{1/3}\sigma_{*}^{2/3}\|{\bm{x% }}_{0}-{\bm{x}}_{*}\|^{4/3}}{cK^{2/3}}.

To guarantee $f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilon$ for $\epsilon>0$ , we choose $\beta={\mathcal{O}}(1)$ and the total number of individual gradient evaluations will be

\displaystyle TK={\mathcal{O}}\Big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}% }{c\epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{c^{3/2% }\epsilon^{3/2}}\Big{)},

thus finishing the proof. ∎

See 2.5

Proof.

We first introduce a lemma to bound the term $\sum_{t=1}^{T}\|\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\|^{2}$ in Lemma 2.1 with random permutations involved. Its proof is provided in Appendix A.

Lemma A.1.

Under Assumptions 4 and for Alg. 1 with uniformly random (SO/RR) shuffling, we have

\displaystyle\textstyle\mathbb{E}\big{[}\sum_{t=1}^{T}\|\sum_{s=t}^{T}\nabla f% _{s}({\bm{x}}_{*})\|^{2}\big{]}\leq\frac{T(T+1)}{6}\sigma_{*}^{2}.

We then take expectation w.r.t. all the randomness on both sides of the inequality in Lemma 2.1 and use Lemma A.1. Then it remains to follow the analysis in the proof of Theorem 2.3. ∎

See A.1

Proof.

We first consider the case of random reshuffling strategy. Conditional on all the randomness up to but not including $k$ -th epoch, the only randomness of $\mathbb{E}_{k}[\sum_{t=1}^{T}\|\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\|^{2}]$ comes from the permutation $\pi^{(k)}$ at $k$ -th epoch. Further noticing that each partial sum $\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})$ can be seen as a batch sampled without replacement from $\{\nabla f_{t}({\bm{x}}_{*})\}_{t\in[T]}$ , we have

	$\displaystyle\mathbb{E}_{k}\Big{[}\sum_{t=1}^{T}\Big{\\|}\sum_{s=t}^{T}\nabla f% _{s}({\bm{x}}_{*})\Big{\\|}^{2}\Big{]}=\;$	$\displaystyle\sum_{t=1}^{T}\mathbb{E}_{\pi^{(k)}}\Big{[}\Big{\\|}\sum_{s=t}^{T}% \nabla f_{s}({\bm{x}}_{*})\Big{\\|}^{2}\Big{]}$
	$\displaystyle=\;$	$\displaystyle\sum_{t=1}^{T}(T-t+1)^{2}\mathbb{E}_{\pi^{(k)}}\Big{[}\Big{\\|}% \frac{1}{T-t+1}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\Big{\\|}^{2}\Big{]}$
	$\displaystyle\overset{(i)}{\leq}\;$	$\displaystyle\sum_{t=1}^{T}(T-t+1)^{2}\frac{t-1}{(T-t+1)(T-1)}\sigma_{*}^{2}$
	$\displaystyle=\;$	$\displaystyle\frac{T(T+1)}{6}\sigma_{*}^{2},$

where $(i)$ is due to $\sum_{t=1}^{T}\nabla f_{t}({\bm{x}}_{*})=\mathbf{0}$ and sampling without replacement (see e.g., (Lohr, 2021, Section 2.7)). It remains to take expectation w.r.t. all randomness on both sides and use the law of total expectation. For the case of shuffle-once variant, we can directly take expectation since the randomness only comes from the initial random permutation, and the above argument still applies. ∎

Appendix B Omitted Proofs from Section 3

B.1 Convex Smooth Setting

Lemma B.1.

Under Assumptions 1 and 3, for any ${\bm{z}}\in\mathbb{R}^{d}$ that is fixed in the $k$ -th cycle of Alg. 2 and for $\alpha>0,\beta>0$ such that $\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}$ , if the step sizes satisfy $\eta_{k}\leq\frac{1}{\sqrt{\beta}TL}$ , then we have for $k\in[K]$

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq$	$\displaystyle\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\\|}\sum_{s=t+1}^{T}\nabla f_{s}% ({\bm{x}}_{})\Big{\\|}^{2}+\frac{\alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}}_% {})\big{)}$		(B.1)
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm% {x}}_{k}-{\bm{z}}\\|^{2}\big{)}.$		(B.1)

Proof.

Since each $f_{t}$ is convex and $L$ -smooth, we have for $t\in[T]$

	$\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t+1})\leq\left\langle% \nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{% 1}{2L}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t+1})\\|^{2},$
	$\displaystyle f_{t}({\bm{x}}_{k-1,t+1})-f_{t}({\bm{z}})\leq\left\langle\nabla f% _{t}({\bm{x}}_{k-1,t+1}),{\bm{x}}_{k-1,t+1}-{\bm{z}}\right\rangle-\frac{1}{2L}% \\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{z}})\\|^{2}.$

Following the proof of Lemma 2.1, we add and subtract $\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t}-{\bm{z}}\|^{2}$ on the right-hand side of the second inequality and combine the above two inequalities to obtain

	$\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{z}})\leq\;$	$\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1% ,t+1}\right\rangle-\frac{1}{2\eta_{k}}\\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\\|^% {2}$
		$\displaystyle-\frac{1}{2L}\Big{(}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({% \bm{x}}_{k-1,t+1})\\|^{2}+\\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{% z}})\\|^{2}\Big{)}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1,t}-{\bm{z}}\\|^{2}-\\|{% \bm{x}}_{k-1,t+1}-{\bm{z}}\\|^{2}\big{)}.$

Decomposing $\nabla f_{t}({\bm{x}}_{k})=\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k% -1,t+1})+\nabla f_{t}({\bm{x}}_{k-1,t+1})$ and summing over $t\in[T]$ , we note that ${\bm{x}}_{k-1}={\bm{x}}_{k-1,1}$ and ${\bm{x}}_{k}={\bm{x}}_{k-1,T+1}$ and obtain

		$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}$
	$\displaystyle\leq\;$	$\displaystyle\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k-1,% t+1}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{1}{2\eta_{k}}\sum_{t=% 1}^{T}\\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\\|^{2}}_{{\mathcal{T}}_{1}}$
		$\displaystyle+\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k})% -\nabla f_{t}({\bm{x}}_{k-1,t+1}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle% }_{{\mathcal{T}}_{2}}+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2% }-\\|{\bm{x}}_{k}-{\bm{z}}\\|^{2}\big{)}$
		$\displaystyle-\frac{1}{2L}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_% {t}({\bm{x}}_{k-1,t+1})\\|^{2}-\frac{1}{2L}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}% }_{k-1,t+1})-\nabla f_{t}({\bm{z}})\\|^{2}.$

For the term ${\mathcal{T}}_{1}$ , we follow the argument from the proof of Theorem 2.3 to obtain

\displaystyle{\mathcal{T}}_{1}=-\frac{1}{2\eta_{k}}\|{\bm{x}}_{k}-{\bm{x}}_{k-% 1}\|^{2}\leq 0.

For the term ${\mathcal{T}}_{2}$ , noticing that ${\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}=-\eta_{k}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}% _{k-1,s+1})$ for $1\leq t\leq T-1$ and decomposing $\nabla f_{s}({\bm{x}}_{k-1,s+1})=(\nabla f_{s}({\bm{x}}_{k-1,s+1})-\nabla f_{s% }({\bm{z}}))+(\nabla f_{s}({\bm{z}})-\nabla f_{s}({\bm{x}}_{*}))+\nabla f_{s}(% {\bm{x}}_{*})$ , we use Young’s inequality with parameters $\alpha>0$ and $\beta>0$ to obtain

	$\displaystyle{\mathcal{T}}_{2}=\;$	$\displaystyle\sum_{t=1}^{T-1}\Big{\langle}\nabla f_{t}({\bm{x}}_{k})-\nabla f_% {t}({\bm{x}}_{k-1,t+1}),-\eta_{k}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}_{k-1,s+% 1})\Big{\rangle}$
	$\displaystyle\leq\;$	$\displaystyle\frac{1}{2L}\Big{(}\frac{1}{2}+\frac{1}{\alpha}+\frac{1}{\beta}% \Big{)}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% +1})\\|^{2}$
		$\displaystyle+\frac{\alpha\eta_{k}^{2}L}{2}\sum_{t=1}^{T-1}\Big{\\|}\sum_{s=t+1% }^{T}\big{(}\nabla f_{s}({\bm{z}})-\nabla f_{s}({\bm{x}}_{})\big{)}\Big{\\|}^{% 2}+\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\\|}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}_% {})\Big{\\|}^{2}$
		$\displaystyle+\frac{\beta\eta_{k}^{2}L}{2}\sum_{t=1}^{T-1}\Big{\\|}\sum_{s=t+1}% ^{T}\big{(}\nabla f_{s}({\bm{x}}_{k-1,s+1})-\nabla f_{s}({\bm{z}})\big{)}\Big{% \\|}^{2}.$

Further using the fact that $\|\sum_{i=1}^{n}{\bm{x}}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|{\bm{x}}_{i}\|^{2}$ and combining the above bounds on ${\mathcal{T}}_{1}$ and ${\mathcal{T}}_{2}$ , we obtain

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}$	$\displaystyle\leq\frac{1}{2L}\Big{(}\frac{1}{\alpha}+\frac{1}{\beta}-\frac{1}{% 2}\Big{)}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1% ,t})\\|^{2}$
		$\displaystyle\;+\Big{(}\frac{\beta\eta_{k}^{2}T^{2}L}{2}-\frac{1}{2L}\Big{)}% \sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{z}})\\|^{2}$
		$\displaystyle\;+\frac{\alpha\eta_{k}^{2}T^{2}L}{2}\sum_{t=1}^{T}\\|\nabla f_{t}% ({\bm{z}})-\nabla f_{t}({\bm{x}}_{})\\|^{2}+\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{% \\|}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}_{})\Big{\\|}^{2}$
		$\displaystyle\;+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{% \bm{x}}_{k}-{\bm{z}}\\|^{2}\big{)}.$

The rest of the proof is the same as the proof of Lemma 2.1 and is thus omitted. ∎

See 3.2

Proof.

For all parts of the proof, we consider $1$ -dimensional quadratics

\displaystyle f(x):=\frac{1}{T}\sum_{t=1}^{T}f_{t}(x),\quad\text{where}\quad f% _{t}(x)=\frac{L}{2}(x-\delta_{t})^{2}

for $t\in[T]$ , $L>0$ , and appropriately chosen sequences of $\{\delta_{t}\}_{t\in[T]}\subseteq\mathbb{R}$ .

It is immediate that $f(x)$ is minimized at $x_{*}=\frac{1}{T}\sum_{t=1}^{T}\delta_{t}$ . Observe that Alg. 2 using a constant step size $\eta>0$ has closed-form updates on $f$ , i.e.,

\displaystyle x_{k+1}=\gamma^{n}x_{k}+(1-\gamma)\sum_{t=1}^{T}\gamma^{T-t}% \delta_{t},

where $\gamma=\frac{1}{\eta L+1}\in(0,1)$ . Given any initial point $x_{0}$ , by iterating we have

	$\displaystyle\textstyle x_{k}-x_{*}$	$\displaystyle=\gamma^{kT}x_{0}+\sum_{t=1}^{T}\Big{(}\frac{\gamma^{T-t}(1-% \gamma)(1-\gamma^{kT})}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{t}$		(B.2)
		$\displaystyle\overset{k\rightarrow\infty}{\longrightarrow}\sum_{t=1}^{T}\Big{(% }\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{t}.$		(B.3)

Consider the weight $\delta_{T}$ in Eq. (B.3). Since $\gamma\in(0,1)$ , we have

\displaystyle\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}=\frac{1}{\sum_{t=0}^{T-% 1}\gamma^{t}}-\frac{1}{T}>0.

Then for any $\{\delta_{t}\}_{t\in[T]}$ such that $\mathrm{sgn}(\delta_{t})=\mathrm{sgn}\big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-% \gamma^{T}}-\frac{1}{T}\big{)}$ and $\delta_{T}>0$ , we know that

\displaystyle\lim_{k\rightarrow\infty}f(x_{k})-f(x_{*})\overset{(i)}{=}\frac{L% }{2}\lim_{k\rightarrow\infty}(x_{k}-x_{*})^{2}\geq\frac{L}{2}\Big{(}\frac{1-% \gamma}{1-\gamma^{T}}-\frac{1}{T}\Big{)}^{2}\delta_{T}^{2}>0,

where $(i)$ is due to $f$ being both $L$ -strongly convex and $L$ -smooth.

Consider the weights of $\delta_{t}$ in Eq. (B.3). Since $\gamma\in(0,1)$ , we have

\displaystyle 0\leq\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}\leq\gamma^{T-t}% \leq 1,

thus for any $t\in[T-1]$

\displaystyle\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\in\Big{[}% -\frac{1}{T},\frac{T-1}{T}\Big{)}.

For $t=T$ , given any fixed $\gamma<1$ , we have

\displaystyle\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}=\frac{1}{\sum_{t=0}^{T-% 1}\gamma^{t}}-\frac{1}{T}>0.

Hence, for the sequence $\{\delta_{t}\}$ such that

\displaystyle|\delta_{t}|<\frac{T\sqrt{2/L}}{(T-1)^{2}}\;(t\in[T-1]),\quad% \delta_{T}>\frac{2\sqrt{2/L}}{\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}},

then combining the bounds on the weights of $\delta_{t}$ with Eq. (B.3) we obtain

	$\displaystyle\lim_{k\rightarrow\infty}x_{k}-x_{*}=\;$	$\displaystyle\sum_{t=1}^{T}\Big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-% \frac{1}{T}\Big{)}\delta_{t}$
	$\displaystyle\geq\;$	$\displaystyle\Big{(}\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{T}% -\sum_{t=1}^{T-1}\frac{T-1}{T}\|\delta_{t}\|$
	$\displaystyle>\;$	$\displaystyle\sqrt{2/L}.$

Since $f$ is $L$ -smooth and $L$ -strongly convex, we know that

\displaystyle\lim_{k\rightarrow\infty}f(x_{k})-f(x_{*})=\frac{L}{2}\lim_{k% \rightarrow\infty}(x_{k}-x_{*})^{2}>1,

thus finishing the proof of the second part. We note in passing that 1 on the right-hand side can be replaced by any constant using a simple rescaling.

Observe that given a fixed step size $\eta>0$ , we can choose a sequence $\{\delta_{t}\}_{t\in[T]}$ such that $\mathrm{sgn}(\delta_{t})=\mathrm{sgn}\big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-% \gamma^{T}}-\frac{1}{T}\big{)}$ for all $t\in[T]$ , thus for any initial point $x_{0}\geq x_{*}$ :

	$\displaystyle f(x_{k})-f(x_{*})=\;$	$\displaystyle\frac{L}{2}\Big{(}\gamma^{kT}(x_{0}-x_{*})+(1-\gamma^{kT})\sum_{t% =1}^{T}\Big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}% \delta_{t}\Big{)}^{2}$
	$\displaystyle\geq\;$	$\displaystyle\frac{(1-\gamma^{kT})^{2}L}{2}\sum_{t=1}^{T}\Big{(}\frac{\gamma^{% T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}^{2}\delta_{t}^{2}$
		$\displaystyle+(1-\gamma^{kT})^{2}L\sum_{s\neq t\in[T]}\Big{(}\frac{\gamma^{T-t% }(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\Big{(}\frac{\gamma^{T-t}(1-% \gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{s}\delta_{t}.$

Without loss of generality, taking a sufficiently large $k\geq\frac{\log_{\gamma}(1-2/\sqrt{5})}{T}$ , we obtain

\displaystyle f(x_{k})-f(x_{*})\geq\frac{2L}{5}\sum_{t=1}^{T}\Big{(}\frac{% \gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}^{2}\delta_{t}^{2}.

Then for any step size $\eta\geq 1/TL$ , we have $\gamma=\frac{1}{\eta L+1}\leq\frac{T}{T+1}$ . Consider $t=T$ , we can bound

\displaystyle\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}\geq\frac{1-\frac{T}{T+1% }}{1-(\frac{T}{T+1})^{T}}-\frac{1}{T}=\frac{1}{T}\big{(}\frac{1}{1-(1-\frac{1}% {T})^{T}}-1\big{)}>\frac{\mathrm{e}-1}{T}.

Thus, for $\delta_{T}\geq\frac{\sqrt{5}T\sqrt{\varepsilon}}{\sqrt{2}(\mathrm{e}-1)L}$ , we have

\displaystyle f(x_{k})-f(x_{*})\geq\frac{2L}{5}\Big{(}\frac{1-\gamma}{1-\gamma% ^{T}}-\frac{1}{T}\Big{)}^{2}\delta_{T}^{2}>\varepsilon.

On the other hand, recalling the definition in Assumption 4, we have in this example that

\displaystyle\sigma_{*}^{2}=\frac{L^{2}}{T}\sum_{t=1}^{T}\Big{(}\frac{\sum_{t=% 1}^{T}\delta_{t}}{T}-\delta_{t}\Big{)}^{2}=\frac{L^{2}}{T}\Big{(}\frac{(\sum_{% t=1}^{T}\delta_{t})^{2}}{T}-2\frac{(\sum_{t=1}^{T}\delta_{t})^{2}}{T}+\sum_{t=% 1}^{T}\delta_{t}^{2}\Big{)}\leq L^{2}\sum_{t=1}^{T}\delta_{t}^{2}/T.

Thus for any step size $\eta\geq\frac{16\sqrt{\varepsilon}}{\sqrt{TL}\sigma_{*}}\geq\frac{16\sqrt{% \varepsilon}}{L^{3/2}\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}$ , we have $\gamma=\frac{1}{\eta L+1}\leq\frac{\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}{16% \sqrt{\varepsilon/L}+\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}$ .

We now proceed by bounding the weight of $\delta_{T}$ . In particular, let $\gamma=1-\kappa$ for some $\kappa>0$ , and assume that $\kappa\leq\frac{1}{T+1}<\frac{1}{T}$ without loss of generality by the discussion above. Since $\kappa T<1$ and $T\geq 2$ , we have

\displaystyle(1-\kappa)^{T}\geq 1-\kappa T+\frac{\kappa^{2}T(T-1)}{4},

which leads to

\displaystyle 1-\gamma^{T}=1-(1-\kappa)^{T}\leq\kappa T-\frac{\kappa^{2}T(T-1)% }{4}.

Hence, we have

\displaystyle\frac{1-\gamma}{1-\gamma^{T}}\geq\frac{\kappa}{\kappa T-\frac{% \kappa^{2}T(T-1)}{4}}=\frac{1}{T}\frac{1}{1-\kappa(T-1)/4}.

Further noticing that

\displaystyle\frac{1}{1-\kappa(T-1)/4}\geq 1+\frac{\kappa(T-1)}{4}\geq 1+\frac% {\kappa T}{8}

for $T\geq 2$ and $\kappa<\frac{1}{T}$ , then we have

\displaystyle\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}\geq\frac{1}{T}(1+\frac{% \kappa T}{8})-\frac{1}{T}=\frac{\kappa}{8}.

Recall that $\gamma\leq\frac{\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}{16\sqrt{\varepsilon/L}+% \sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}$ , then we obtain

\displaystyle\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}\geq\frac{2\sqrt{% \varepsilon/L}}{16\sqrt{\varepsilon/L}+\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}% \geq\frac{\sqrt{3\varepsilon/L}}{\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}},

for $\{\delta_{t}\}_{t\in[T]}$ such that $\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}\geq 16\sqrt{3}(2+\sqrt{3})\sqrt{% \varepsilon/L}$ . Thus, for the sequence $\{\delta_{t}\}_{t\in[T]}$ such that $\delta_{T}^{2}>\frac{5}{6}\sum_{t=1}^{T}\delta_{t}^{2}$ and $\mathrm{sgn}(\delta_{t})=\mathrm{sgn}\big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-% \gamma^{T}}-\frac{1}{T}\big{)}$ , we have

	$\displaystyle f(x_{k})-f(x_{*})\geq\;$	$\displaystyle\frac{2L}{5}\Big{(}\frac{(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}% \Big{)}^{2}\delta_{T}^{2}$
	$\displaystyle>\;$	$\displaystyle\frac{2L}{5}\frac{3\varepsilon\delta_{T}^{2}}{L\sum_{t=1}^{T}% \delta_{t}^{2}}$
	$\displaystyle>\;$	$\displaystyle\varepsilon,$

completing the proof. ∎

B.2 Convex Lipschitz Setting

Lemma B.2.

Under Assumptions 1 and 2, for any ${\bm{z}}\in\mathbb{R}^{d}$ that is fixed in the $k$ -th cycle of Alg. 2, we have for $k\in[K]$

\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq

\displaystyle\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm{% x}}_{k}-{\bm{z}}\|^{2}\big{)}+\frac{T(T-1)G^{2}\eta_{k}}{2}.

(B.4)

Proof.

Since $f_{t}$ is convex and closed, we have

\displaystyle\nabla M_{\eta_{k}f_{t}}({\bm{x}}_{k-1,t})=\frac{1}{\eta_{k}}({% \bm{x}}_{k-1,t}-{\bm{x}}_{k-1,t+1})\in\partial f_{t}({\bm{x}}_{k-1,t+1}).

By $G$ -Lipschitzness of each component function, we have that for $t\in[T-1]$

		$\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t+1})$		(B.5)
	$\displaystyle\leq$	$\displaystyle G\\|{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\\|\leq\eta_{k}G\sum_{s=t+1}^{T% }\\|\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})\\|\leq(T-t)G^{2}\eta_{k}.$		(B.5)

On the other hand, using convexity of $f_{t}$ , we have that for $t\in[T]$

\displaystyle f_{t}({\bm{z}})\geq f_{t}({\bm{x}}_{k-1,t+1})+\left\langle\nabla M% _{\eta_{k}f_{t}}({\bm{x}}_{k-1,t}),{\bm{z}}-{\bm{x}}_{k-1,t+1}\right\rangle.

Expanding the inner product in the above inequality leads to

	$\displaystyle f_{t}({\bm{x}}_{k-1,t+1})-f_{t}({\bm{z}})$
$\displaystyle\leq\;$	$\displaystyle-\frac{1}{\eta_{k}}\left\langle{\bm{x}}_{k-1,t}-{\bm{x}}_{k-1,t+1% },{\bm{z}}-{\bm{x}}_{k-1,t+1}\right\rangle$
$\displaystyle=\;$	$\displaystyle\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1,t}-{\bm{z}}\\|^{2}-\\|{% \bm{x}}_{k-1,t+1}-{\bm{z}}\\|^{2}\big{)}-\frac{1}{2\eta_{k}}\\|{\bm{x}}_{k-1,t+1% }-{\bm{x}}_{k-1,t}\\|^{2}.$	(B.6)

Combining Eq. (B.5) and (B.6) and noticing that ${\bm{x}}_{k-1,T+1}={\bm{x}}_{k}$ and ${\bm{x}}_{k-1,1}={\bm{x}}_{k-1}$ , we sum the inequalities over $t\in[T]$ and obtain

	$\displaystyle T(f({\bm{x}}_{k})-f({\bm{z}}))\leq\;$	$\displaystyle\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm{% x}}_{k}-{\bm{z}}\\|^{2}\big{)}+\frac{T(T-1)G^{2}\eta_{k}}{2}$
		$\displaystyle-\frac{1}{2\eta_{k}}\sum_{t=1}^{T}\\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{% k-1,t}\\|^{2}$
	$\displaystyle\leq\;$	$\displaystyle\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm{% x}}_{k}-{\bm{z}}\\|^{2}\big{)}+\frac{T(T-1)G^{2}\eta_{k}}{2}.$

∎

See 3.3

Proof.

Plugging ${\bm{z}}_{k-1}$ defined in Eq. (2.1) into Eq. (B.4) and multiplying $\eta_{k}w_{k-1}$ on both sides, we obtain

		$\displaystyle T\eta_{k}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}$
	$\displaystyle\leq\;$	$\displaystyle\frac{1}{2}\big{(}w_{k-2}\\|{\bm{x}}_{k-1}-{\bm{z}}_{k-2}\\|^{2}-w_% {k-1}\\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\\|^{2}\big{)}+\frac{T(T-1)G^{2}\eta_{k}^{2}w% _{k-1}}{2},$

Summing over $k\in[K]$ and using the second part of Lemma 2.2, we have

		$\displaystyle\sum_{k=1}^{K}T\eta_{k}\Big{[}w_{k-1}(f({\bm{x}}_{k})-f({\bm{x}}_% {}))-\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})(f({\bm{x}}_{j})-f({\bm{x}}_{}))% \Big{]}$
	$\displaystyle\leq\;$	$\displaystyle\frac{T(T-1)G^{2}}{2}\sum_{k=1}^{K}\eta_{k}^{2}w_{k-1}+\frac{w_{-% 1}}{2}\\|{\bm{x}}_{0}-{\bm{x}}_{*}\\|^{2},$

where we also recall that ${\bm{z}}_{-1}={\bm{x}}_{*}$ . Unrolling the terms on the left-hand side as Eq. (A.5) and choosing $\lambda_{0}=1$ , we obtain

	$\displaystyle T\eta_{K}w_{K-1}\big{(}f({\bm{x}}_{K})-f({\bm{x}}_{*})\big{)}$	(B.7)
	$\displaystyle+T\sum_{k=1}^{K-1}\Big{[}\eta_{k}w_{k-1}-w_{k}(1-\lambda_{k})\sum% _{j=k+1}^{K}\eta_{j}\Big{]}\big{(}f({\bm{x}}_{k})-f({\bm{x}}_{*})\big{)}$
$\displaystyle\leq$	$\displaystyle\frac{T(T-1)G^{2}}{2}\sum_{k=1}^{K}\eta_{k}^{2}w_{k-1}+\frac{w_{-% 1}}{2}\\|{\bm{x}}_{0}-{\bm{x}}_{*}\\|^{2}.$

To obtain the last iterate guarantee, we choose $\lambda_{k}$ and $w_{k}$ such that

	$\displaystyle\lambda_{k}w_{k}\leq\;$	$\displaystyle w_{k-1},\quad 0\leq k\leq K-1,$
	$\displaystyle\eta_{k}w_{k-1}-w_{k}(1-\lambda_{k})\sum_{j=k+1}^{K}\eta_{j}\geq\;$	$\displaystyle 0,\quad 1\leq k\leq K-1.$

For simplicity and without loss of generality, we make both inequalities tight and choose $w_{k}=\frac{\sum_{j=k}^{K}\eta_{j}}{\sum_{j=k+1}^{K}\eta_{j}}w_{k-1}$ . In particular, we choose $w_{k}=\frac{\eta_{K}}{\sum_{j=k+1}^{K}\eta_{j}}$ for $0\leq k\leq K-1$ such that $w_{K-1}=1$ , then we divide $T\eta_{K}$ on both sides of Eq. (B.7) and obtain

	$\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;$	$\displaystyle\frac{w_{-1}}{2T\eta_{K}}\\|{\bm{x}}_{0}-{\bm{x}}_{*}\\|^{2}+\frac{% G^{2}T}{2\eta_{K}}\sum_{k=1}^{K}\eta_{k}^{2}w_{k-1}$
	$\displaystyle=\;$	$\displaystyle\frac{1}{2T\sum_{k=1}^{K}\eta_{k}}\\|{\bm{x}}_{0}-{\bm{x}}_{*}\\|^{% 2}+\frac{G^{2}T}{2}\sum_{k=1}^{K}\frac{\eta_{k}^{2}}{\sum_{j=k}^{K}\eta_{j}}.$

Finally, choosing $\eta_{k}\equiv\eta=\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|}{GT\sqrt{K}}$ , we get

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{G\|{\bm{x}}_{0}-{\bm{x}% }_{*}\|}{2\sqrt{K}}\Big{(}1+\sum_{k=1}^{K}\frac{1}{K-k+1}\Big{)}\leq\frac{G\|{% \bm{x}}_{0}-{\bm{x}}_{*}\|(1+\log K/2)}{\sqrt{K}}.

Hence, given $\epsilon>0$ , to guarantee $f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilon$ , the total number of individual gradient evaluations will be

TK=\widetilde{\mathcal{O}}\Big{(}\frac{G^{2}T\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}% }{\epsilon^{2}}\Big{)},

completing the proof. ∎

B.3 Inexact Proximal Point Evaluations

We first prove the convergence results for convex smooth settings. The following techical lemma bounds $f({\bm{x}}_{k})-f({\bm{z}})$ within each epoch with inexact proximal point evaluations.

Lemma B.3.

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;$	$\displaystyle 2\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\\|}\sum_{s=t+1}^{T}\nabla f_{% s}({\bm{x}}_{})\Big{\\|}^{2}+\frac{\alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}% }_{})\big{)}+\frac{T}{\eta_{k}}\sum_{t=1}^{T}\varepsilon_{k-1,t}^{2}$
		$\displaystyle+\frac{1}{2\eta_{k}}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\frac{1}{2% \eta_{k}}\big{(}1-\frac{\sum_{t=1}^{T}\varepsilon_{k-1,t}}{\sqrt{\eta_{k}}}% \big{)}\\|{\bm{x}}_{k}-{\bm{z}}\\|^{2}+\frac{\sum_{t=1}^{T}\varepsilon_{k-1,t}}{% 2\sqrt{\eta_{k}}}.$

Proof.

Since each $f_{t}$ is convex and $L$ -smooth, we have for $t\in[T]$

	$\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t+1})\leq\left\langle% \nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{% 1}{2L}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t+1})\\|^{2},$
	$\displaystyle f_{t}({\bm{x}}_{k-1,t+1})-f_{t}({\bm{z}})\leq\left\langle\nabla f% _{t}({\bm{x}}_{k-1,t+1}),{\bm{x}}_{k-1,t+1}-{\bm{z}}\right\rangle-\frac{1}{2L}% \\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{z}})\\|^{2}.$

Following the proof of Lemma 2.1, we add and subtract $\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t}-{\bm{z}}\|^{2}$ on the right-hand side of the second inequality and notice that $\left\langle{\bm{g}}_{k-1,t},{\bm{z}}\right\rangle+\frac{1}{2\eta_{k}}\|{\bm{x% }}_{k-1,t}-{\bm{z}}\|^{2}\geq\left\langle{\bm{g}}_{k-1,t},{\bm{x}}_{k-1,t+1}% \right\rangle+\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t}-{\bm{x}}_{k-1,t+1}\|^{2}+% \frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t+1}-{\bm{z}}\|^{2},$ then we combine the above inequalities to obtain

	$\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{z}})\leq\;$	$\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1% ,t+1}\right\rangle+\left\langle\nabla f_{t}({\bm{x}}_{k-1,t+1})-{\bm{g}}_{k-1,% t},{\bm{x}}_{k-1,t+1}-{\bm{z}}\right\rangle$
		$\displaystyle-\frac{1}{2L}\Big{(}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({% \bm{x}}_{k-1,t+1})\\|^{2}+\\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{% z}})\\|^{2}\Big{)}$
		$\displaystyle-\frac{1}{2\eta_{k}}\\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\\|^{2}+% \frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1,t}-{\bm{z}}\\|^{2}-\\|{\bm{x}}_{k-1,t+% 1}-{\bm{z}}\\|^{2}\big{)}.$

We decompose $\nabla f_{t}({\bm{x}}_{k})=\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k% -1,t+1})+\nabla f_{t}({\bm{x}}_{k-1,t+1})-{\bm{g}}_{k-1,t}+{\bm{g}}_{k-1,t}$ in the first inner product term on the right-hand side, and sum the inequalities over $t\in[T]$ with noticing ${\bm{x}}_{k-1}={\bm{x}}_{k-1,1}$ and ${\bm{x}}_{k}={\bm{x}}_{k-1,T+1}$ , and obtain

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;$	$\displaystyle\underbrace{\sum_{t=1}^{T}\left\langle{\bm{g}}_{k-1,t},{\bm{x}}_{% k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{1}{2\eta_{k}}\sum_{t=1}^{T}\\|{\bm{x}}% _{k-1,t+1}-{\bm{x}}_{k-1,t}\\|^{2}}_{{\mathcal{T}}_{1}}$
		$\displaystyle+\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k})% -\nabla f_{t}({\bm{x}}_{k-1,t+1}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle% }_{{\mathcal{T}}_{2}}$
		$\displaystyle+\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k-1% ,t+1})-{\bm{g}}_{k-1,t},{\bm{x}}_{k}-{\bm{z}}\right\rangle}_{{\mathcal{T}}_{3}% }+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm{x}}_{k}-{% \bm{z}}\\|^{2}\big{)}$
		$\displaystyle-\frac{1}{2L}\sum_{t=1}^{T}\Big{(}\\|\nabla f_{t}({\bm{x}}_{k})-% \nabla f_{t}({\bm{x}}_{k-1,t+1})\\|^{2}+\\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-% \nabla f_{t}({\bm{z}})\\|^{2}\Big{)}.$

For the term ${\mathcal{T}}_{1}$ , we follow the argument from the proof of Theorem 2.3 to obtain

\displaystyle{\mathcal{T}}_{1}=-\frac{1}{2\eta_{k}}\|{\bm{x}}_{k}-{\bm{x}}_{k-% 1}\|^{2}\leq 0.

For the term ${\mathcal{T}}_{2}$ , noticing that ${\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}=-\eta_{k}\sum_{s=t+1}^{T}{\bm{g}}_{k-1,s}$ for $1\leq t\leq T-1$ and ${\bm{g}}_{k-1,s}={\bm{g}}_{k-1,s}-\nabla f_{s}({\bm{x}}_{k-1,s+1})+\nabla f_{s% }({\bm{x}}_{k-1,s+1})-\nabla f_{s}({\bm{z}})+\nabla f_{s}({\bm{z}})-\nabla f_{% s}({\bm{x}}_{*})+\nabla f_{s}({\bm{x}}_{*})$ , we use Young’s inequality with parameters $\alpha>0$ and $\beta>0$ to obtain

	$\displaystyle{\mathcal{T}}_{2}=\;$	$\displaystyle\sum_{t=1}^{T-1}\Big{\langle}\nabla f_{t}({\bm{x}}_{k})-\nabla f_% {t}({\bm{x}}_{k-1,t+1}),-\eta_{k}\sum_{s=t+1}^{T}{\bm{g}}_{k-1,s}\Big{\rangle}$
	$\displaystyle\leq\;$	$\displaystyle\frac{1}{2L}\Big{(}\frac{1}{2}+\frac{1}{\alpha}+\frac{1}{\beta}% \Big{)}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% +1})\\|^{2}$
		$\displaystyle+\frac{\alpha\eta_{k}^{2}L}{2}\sum_{t=1}^{T-1}\Big{\\|}\sum_{s=t+1% }^{T}\big{(}\nabla f_{s}({\bm{z}})-\nabla f_{s}({\bm{x}}_{})\big{)}\Big{\\|}^{% 2}+2\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\\|}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}% _{})\Big{\\|}^{2}$
		$\displaystyle+\frac{\beta\eta_{k}^{2}L}{2}\sum_{t=1}^{T-1}\Big{\\|}\sum_{s=t+1}% ^{T}\big{(}\nabla f_{s}({\bm{x}}_{k-1,s+1})-\nabla f_{s}({\bm{z}})\big{)}\Big{% \\|}^{2}$
		$\displaystyle+2\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\\|}\sum_{s=t+1}^{T}\big{(}{% \bm{g}}_{k-1,s}-\nabla f_{s}({\bm{x}}_{k-1,s+1})\big{)}\Big{\\|}^{2}$

For the term ${\mathcal{T}}_{3}$ , we use Cauchy-Schwarz inequality and Young’s inequality to get

		$\displaystyle\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k-1,t+1})-{\bm{g% }}_{k-1,t},{\bm{x}}_{k}-{\bm{z}}\right\rangle$
	$\displaystyle\leq\;$	$\displaystyle\frac{1}{2\sqrt{\eta_{k}}}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{% k-1,t+1})-{\bm{g}}_{k-1,t}\\|\\|{\bm{x}}_{k}-{\bm{z}}\\|^{2}+\frac{\sqrt{\eta_{k}% }}{2}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-{\bm{g}}_{k-1,t}\\|.$

Further using the fact that $\|\sum_{i=1}^{n}{\bm{x}}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|{\bm{x}}_{i}\|^{2}$ and combining the above bounds on ${\mathcal{T}}_{1}$ , ${\mathcal{T}}_{2}$ and ${\mathcal{T}}_{3}$ with $\|{\bm{g}}_{k-1,t}-\nabla f_{t}({\bm{x}}_{k-1,t+1})\|\leq\frac{\varepsilon_{k-% 1,t}}{\eta_{k}}$ for $t\in[T]$ , we obtain

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;$	$\displaystyle\frac{1}{2L}\Big{(}\frac{1}{\alpha}+\frac{1}{\beta}-\frac{1}{2}% \Big{)}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% })\\|^{2}$
		$\displaystyle+\Big{(}\frac{\beta\eta_{k}^{2}T^{2}L}{2}-\frac{1}{2L}\Big{)}\sum% _{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{z}})\\|^{2}$
		$\displaystyle+\frac{\alpha\eta_{k}^{2}T^{2}L}{2}\sum_{t=1}^{T}\\|\nabla f_{t}({% \bm{z}})-\nabla f_{t}({\bm{x}}_{})\\|^{2}+2\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{% \\|}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}_{})\Big{\\|}^{2}$
		$\displaystyle+2T^{2}L\sum_{t=1}^{T}\varepsilon_{k-1,t}^{2}+\frac{1}{2\eta_{k}^% {3/2}}\sum_{t=1}^{T}\varepsilon_{k-1,t}\\|{\bm{x}}_{k}-{\bm{z}}\\|^{2}+\frac{1}{% 2\sqrt{\eta_{k}}}\sum_{t=1}^{T}\varepsilon_{k-1,t}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm% {x}}_{k}-{\bm{z}}\\|^{2}\big{)}.$

It remains to follow the proof of Lemma 2.1 and use $\eta_{k}\leq\frac{1}{\sqrt{\beta}TL}\leq\frac{1}{2TL}$ for $\beta\geq 4$ to obtain

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;$	$\displaystyle 2\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\\|}\sum_{s=t+1}^{T}\nabla f_{% s}({\bm{x}}_{})\Big{\\|}^{2}+\frac{\alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}% }_{})\big{)}+\frac{T}{\eta_{k}}\sum_{t=1}^{T}\varepsilon_{k-1,t}^{2}$
		$\displaystyle+\frac{1}{2\eta_{k}}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\frac{1}{2% \eta_{k}}\big{(}1-\frac{\sum_{t=1}^{T}\varepsilon_{k-1,t}}{\sqrt{\eta_{k}}}% \big{)}\\|{\bm{x}}_{k}-{\bm{z}}\\|^{2}+\frac{\sum_{t=1}^{T}\varepsilon_{k-1,t}}{% 2\sqrt{\eta_{k}}},$

thus finishing the proof. ∎

See 3.4

Proof.

Using Lemma B.3 and following the proof of Theorem 2.3 with multiplying $\eta_{k}w_{k-1}$ on both sides, we have

		$\displaystyle T\eta_{k}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}$
	$\displaystyle\leq\;$	$\displaystyle 2T^{3}\eta_{k}^{3}w_{k-1}L\sigma_{}^{2}+\frac{\alpha}{\beta}T% \eta_{k}w_{k-1}\big{(}f({\bm{z}}_{k-1})-f({\bm{x}}_{})\big{)}+Tw_{k-1}\sum_{t% =1}^{T}\varepsilon_{k-1,t}^{2}+\frac{w_{k-1}\sqrt{\eta_{k}}}{2}\sum_{t=1}^{T}% \varepsilon_{k-1,t}$
		$\displaystyle+\frac{\lambda_{k-1}^{2}w_{k-1}}{2}\\|{\bm{x}}_{k-1}-{\bm{z}}_{k-2% }\\|^{2}-\frac{w_{k-1}(1-\sum_{t=1}^{T}\varepsilon_{k-1,t}/\sqrt{\eta_{k}})}{2}% \\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\\|^{2}.$

Then we sum the above inequality over $k\in[K]$ and follow the proof of Theorem 2.3. To telescope the terms $\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}$ , we need $\sum_{t=1}^{T}\varepsilon_{k-1,t}/\sqrt{\eta_{k}}\leq 1-\lambda_{k}$ for $1\leq k\leq K-1$ such that

\displaystyle\lambda_{k}^{2}w_{k}\leq\lambda_{k}w_{k}\Big{(}1-\sum_{t=1}^{T}% \varepsilon_{k-1,t}/\sqrt{\eta_{k}}\Big{)}\leq w_{k-1}\Big{(}1-\sum_{t=1}^{T}% \varepsilon_{k-1,t}/\sqrt{\eta_{k}}\Big{)}.

In this case, we maintain the same requirements on $\{\lambda_{k}\}$ and $\{w_{k}\}$ to obtain the guarantee on the last iterate as in Theorem 2.3. In particular, we take the same choices with constant step sizes $\eta_{k}\equiv\eta$ such that $\lambda_{k}=\frac{w_{k-1}}{w_{k}}=\frac{(1+\frac{\alpha}{\beta})(K-k)}{1+(1+% \frac{\alpha}{\beta})(K-k)}$ for $0\leq k\leq K-1$ , so it suffices to let $\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{\sqrt{\eta}}{1+(1+\frac{\alpha}{% \beta})(K-k+1)}$ for $1\leq k\leq K$ . Following the proof of Theorem 2.3, we obtain

	$\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;$	$\displaystyle\frac{w_{-1}}{2\eta T}\\|{\bm{x}}_{0}-{\bm{x}}_{}\\|^{2}+2\eta^{2}% T^{2}\sigma_{}^{2}L\sum_{k=1}^{K}w_{k-1}$
		$\displaystyle+\frac{1}{\eta}\sum_{k=1}^{K}\sum_{t=1}^{T}w_{k-1}\varepsilon_{k-% 1,t}^{2}+\frac{1}{2\sqrt{\eta}T}\sum_{k=1}^{K}\sum_{t=1}^{T}w_{k-1}\varepsilon% _{k-1,t}.$

Plugging in the choice that $w_{k-1}\leq\frac{\mathrm{e}}{(K-k+1)^{\frac{1}{1+\alpha/\beta}}}$ for $1\leq k\leq K-1$ and $w_{K-1}=1$ , we then have

\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;

\displaystyle\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{2\eta TK^{% \frac{1}{1+\alpha/\beta}}}+2\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{% \frac{\alpha/\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}}{2\eta T}\sum_{k=0}^{K-1% }\sum_{t=1}^{T}\frac{2T\varepsilon_{k,t}^{2}+\sqrt{\eta}\varepsilon_{k,t}}{(K-% k)^{\frac{1}{1+\alpha/\beta}}}.

Hence, given $\varepsilon>0$ , to maintain the convergence rate with exact proximal point evaluations, it suffices to take $\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\sqrt{\eta}\min\{\frac{\varepsilon}{4% \mathrm{e}^{2}(1+\log K)},\frac{1}{1+(1+\frac{\alpha}{\beta})(K-k+1)}\}$ for $1\leq k\leq K$ . Indeed, we have

	$\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;$	$\displaystyle\frac{\mathrm{e}\\|{\bm{x}}_{0}-{\bm{x}}_{}\\|^{2}}{2\eta TK^{% \frac{1}{1+\alpha/\beta}}}+2\eta^{2}T^{2}\sigma_{}^{2}L(1+\beta/\alpha)K^{% \frac{\alpha/\beta}{1+\alpha/\beta}}+\sum_{k=0}^{K}\frac{2\mathrm{e}% \varepsilon}{(K-k)^{\frac{1}{1+\alpha/\beta}}}$
	$\displaystyle\overset{(i)}{\leq}\;$	$\displaystyle\frac{\mathrm{e}\\|{\bm{x}}_{0}-{\bm{x}}_{}\\|^{2}}{2\eta TK^{% \frac{1}{1+\alpha/\beta}}}+2\eta^{2}T^{2}\sigma_{}^{2}L(1+\beta/\alpha)K^{% \frac{\alpha/\beta}{1+\alpha/\beta}}+2\mathrm{e}\varepsilon(1+\beta/\alpha)K^{% \frac{\alpha/\beta}{1+\alpha/\beta}}.$

It remains to follow the proof of Theorem 2.3, and we choose $\sum_{t=1}^{T}\varepsilon_{k-1,t}=\sqrt{\eta}\min\{\varepsilon,\frac{1}{3(K-k+% 1)}\}$ , assuming without loss of generality that $\varepsilon\leq\frac{1}{4\mathrm{e}^{2}(1+\log K)}$ . ∎

We then come to prove the convergence with inexact proximal point evaluations for convex Lipschitz settings.

Lemma B.4.

Under Assumptions 1 and 2, for any ${\bm{z}}\in\mathbb{R}^{d}$ that is fixed in the $k$ -th cycle of Alg. 2, we have for $k\in[K]$

	$\displaystyle T(f({\bm{x}}_{k})-f({\bm{z}}))\leq$	$\displaystyle\frac{1}{2\eta_{k}}\Big{(}1+\frac{1}{2\eta_{k}GT}\sum_{t=1}^{T}% \varepsilon_{k-1,t}\Big{)}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\frac{1}{2\eta_{k}}% \\|{\bm{x}}_{k}-{\bm{z}}\\|^{2}$		(B.8)
		$\displaystyle+\frac{T(T-1)G^{2}\eta_{k}}{2}+\frac{1}{\eta_{k}}\Big{(}\sum_{t=1% }^{T}\varepsilon_{k-1,t}\Big{)}^{2}+3GT\sum_{t=1}^{T}\varepsilon_{k-1,t}.$		(B.8)

Proof.

By Lipschitzness of each component function, we have for $t\in[T-1]$

\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t+1})\leq\;

\displaystyle G\|{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\|=G\eta_{k}\Big{\|}\sum_{s=t+% 1}^{T}{\bm{g}}_{k-1,s}\Big{\|}.

Decomposing ${\bm{g}}_{k-1,s}={\bm{g}}_{k-1,s}-\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})+% \nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})$ and using triangle inequalities, we have

	$\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t+1})$
$\displaystyle\leq\;$	$\displaystyle\eta_{k}G\sum_{s=t+1}^{T}\Big{(}\\|{\bm{g}}_{k-1,s}-\nabla M_{\eta% _{k}f_{s}}({\bm{x}}_{k-1,s})\\|+\\|\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})\\|% \Big{)}$
$\displaystyle\overset{(i)}{\leq}\;$	$\displaystyle\eta_{k}G\sum_{s=t+1}^{T}\Big{(}\frac{\varepsilon_{k-1,s}}{\eta_{% k}}+G\Big{)}\leq(T-t)G^{2}\eta_{k}+G\sum_{s=t+1}^{T}\varepsilon_{k-1,s},$	(B.9)

where we use Eq. (3.2) and the fact that $\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})\in\partial f_{t}(\mathrm{prox}_{% \eta_{k}f_{t}}({\bm{x}}_{k-1,t}))$ for $(i)$ . On the other hand, using convexity of $f_{t}$ , we have for $t\in[T]$ that

	$\displaystyle f_{t}({\bm{z}})\geq\;$	$\displaystyle f_{t}({\bm{x}}_{k-1,t+1})+\left\langle\nabla M_{\eta_{k}f_{t}}({% \bm{x}}_{k-1,t}),{\bm{z}}-{\bm{x}}_{k-1,t+1}\right\rangle$
	$\displaystyle=\;$	$\displaystyle f_{t}({\bm{x}}_{k-1,t+1})+\left\langle{\bm{g}}_{k-1,t},{\bm{z}}-% {\bm{x}}_{k-1,t+1}\right\rangle+\left\langle\nabla M_{\eta_{k}f_{t}}({\bm{x}}_% {k-1,t})-{\bm{g}}_{k-1,t},{\bm{z}}-{\bm{x}}_{k-1,t+1}\right\rangle.$

Expanding the inner product in the above quantity and using Cauchy-Schwarz inequality with Eq. (3.2) leads to

	$\displaystyle f_{t}({\bm{x}}_{k-1,t+1})-f_{t}({\bm{z}})$
$\displaystyle\leq\;$	$\displaystyle-\frac{1}{\eta_{k}}\left\langle{\bm{x}}_{k-1,t}-{\bm{x}}_{k-1,t+1% },{\bm{z}}-{\bm{x}}_{k-1,t+1}\right\rangle+\\|\nabla M_{\eta_{k}f_{t}}({\bm{x}}% _{k-1,t})+{\bm{g}}_{k-1,t}\\|\\|{\bm{z}}-{\bm{x}}_{k-1,t+1}\\|$
$\displaystyle\leq\;$	$\displaystyle\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1,t}-{\bm{z}}\\|^{2}-\\|{% \bm{x}}_{k-1,t+1}-{\bm{z}}\\|^{2}\big{)}+\frac{\varepsilon_{k-1,t}}{\eta_{k}}\\|% {\bm{x}}_{k-1,t+1}-{\bm{z}}\\|.$	(B.10)

Using triangle inequalities and decomposing ${\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1}=\eta_{k}\sum_{s=1}^{t}{\bm{g}}_{k-1,s}-% \nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})+\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{% k-1,s})$ , we bound the term ${\mathcal{T}}:=\sum_{t=1}^{T}\varepsilon_{k-1,t}\|{\bm{x}}_{k-1,t+1}-{\bm{z}}\|$ in Eq. (B.10) as follows

	$\displaystyle{\mathcal{T}}\leq\;$	$\displaystyle\sum_{t=1}^{T}\varepsilon_{k-1,t}\big{(}\\|{\bm{x}}_{k-1,t+1}-{\bm% {x}}_{k-1}\\|+\\|{\bm{x}}_{k-1}-{\bm{z}}\\|\big{)}$
	$\displaystyle\leq\;$	$\displaystyle\sum_{t=1}^{T}\varepsilon_{k-1,t}\Big{(}\eta_{k}\sum_{s=1}^{t}% \big{(}\\|{\bm{g}}_{k-1,s}-\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})\\|+\\|% \nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})\\|\big{)}+\\|{\bm{x}}_{k-1}-{\bm{z}}% \\|\Big{)}$
	$\displaystyle\leq\;$	$\displaystyle\sum_{t=1}^{T}\varepsilon_{k-1,t}\sum_{s=1}^{t}\varepsilon_{k-1,s% }+\big{(}\eta_{k}GT+\\|{\bm{x}}_{k-1}-{\bm{z}}\\|\big{)}\sum_{t=1}^{T}% \varepsilon_{k-1,t}$
	$\displaystyle\overset{(i)}{\leq}\;$	$\displaystyle\Big{(}\sum_{t=1}^{T}\varepsilon_{k-1,t}\Big{)}^{2}+2\eta_{k}GT% \sum_{t=1}^{T}\varepsilon_{k-1,t}+\frac{1}{4\eta_{k}GT}\sum_{t=1}^{T}% \varepsilon_{k-1,t}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2},$

where we use Young’s inequality for $(i)$ . Combining Eq. (B.9) and (B.10) with the above bound on ${\mathcal{T}}$ and noticing that ${\bm{x}}_{k-1,T+1}={\bm{x}}_{k}$ and ${\bm{x}}_{k-1,1}={\bm{x}}_{k-1}$ , we sum the inequalities over $t\in[T]$ and obtain

	$\displaystyle T(f({\bm{x}}_{k})-f({\bm{z}}))\leq\;$	$\displaystyle\frac{1}{2\eta_{k}}\Big{(}1+\frac{1}{2\eta_{k}GT}\sum_{t=1}^{T}% \varepsilon_{k-1,t}\Big{)}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\frac{1}{2\eta_{k}}% \\|{\bm{x}}_{k}-{\bm{z}}\\|^{2}$
		$\displaystyle+\frac{T(T-1)G^{2}\eta_{k}}{2}+\frac{1}{\eta_{k}}\Big{(}\sum_{t=1% }^{T}\varepsilon_{k-1,t}\Big{)}^{2}+3GT\sum_{t=1}^{T}\varepsilon_{k-1,t},$

thus finishing the proof. ∎

See 3.5

Proof.

Using Lemma B.4 with ${\bm{z}}={\bm{z}}_{k-1}$ defined by Eq. (2.1) and multiplying $\eta_{k}w_{k-1}$ on both sides, we have

		$\displaystyle T\eta_{k}w_{k-1}(f({\bm{x}}_{k})-f({\bm{z}}_{k-1}))$
	$\displaystyle\leq\;$	$\displaystyle\frac{w_{k-1}\lambda_{k-1}^{2}(1+\frac{1}{2\eta_{k}GT}\sum_{t=1}^% {T}\varepsilon_{k-1,t})}{2}\\|{\bm{x}}_{k-1}-{\bm{z}}_{k-2}\\|^{2}-\frac{w_{k-1}% }{2}\\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\\|^{2}$
		$\displaystyle+\frac{T(T-1)G^{2}\eta_{k}^{2}w_{k-1}}{2}+\frac{w_{k-1}}{2}\Big{(% }\sum_{t=1}^{T}\varepsilon_{k-1,t}\Big{)}^{2}+3GT\eta_{k}w_{k-1}\sum_{t=1}^{T}% \varepsilon_{k-1,t}.$

Then we sum the inequalities over $k\in[K]$ and follow the proof of Theorem 3.3. To telescope the terms $\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}$ , we need $\lambda_{k-1}\leq\frac{1}{1+\frac{1}{2\eta_{k}GT}\sum_{t=1}^{T}\varepsilon_{k-% 1,t}}$ for $1\leq k\leq K-1$ such that

\displaystyle w_{k-1}\lambda_{k-1}^{2}(1+\frac{1}{2\eta_{k}GT}\sum_{t=1}^{T}% \varepsilon_{k-1,t})\leq w_{k-1}\lambda_{k-1}\leq w_{k-2},

while we maintain other requirements on $\{\lambda_{k}\}$ and $\{w_{k}\}$ to obtain the last iterate convergence as in Theorem 3.3. In particular, we take the same choice that $w_{k}=\frac{\eta_{K}}{\sum_{j=k+1}^{K}\eta_{j}}$ and $\lambda_{k}=\frac{\sum_{j=k+1}^{K}\eta_{j}}{\sum_{j=k}^{K}\eta_{j}}$ for $0\leq k\leq K-1$ , so it suffices to let $\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{2\eta_{k}\eta_{k-1}GT}{\sum_{j=k}^{% K}\eta_{j}}$ . So we arrive at

	$\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;$	$\displaystyle\frac{w_{-1}}{2T\eta_{K}}\\|{\bm{x}}_{0}-{\bm{x}}_{*}\\|^{2}+\frac{% G^{2}T}{2\eta_{K}}\sum_{k=1}^{K}\eta_{k}^{2}w_{k-1}$
		$\displaystyle+\frac{3G}{\eta_{K}}\sum_{k=1}^{K}w_{k-1}\eta_{k}\sum_{t=1}^{T}% \varepsilon_{k-1,t}+\frac{1}{2T\eta_{K}}\sum_{k=1}^{K}\Big{(}\sum_{t=1}^{T}% \varepsilon_{k-1,t}\Big{)}^{2}w_{k-1}$
	$\displaystyle=\;$	$\displaystyle\frac{1}{2T\sum_{k=1}^{K}\eta_{k}}\\|{\bm{x}}_{0}-{\bm{x}}_{*}\\|^{% 2}+\frac{G^{2}T}{2}\sum_{k=1}^{K}\frac{\eta_{k}^{2}}{\sum_{j=k}^{K}\eta_{j}}$
		$\displaystyle+3G\sum_{k=1}^{K}\sum_{t=1}^{T}\frac{\varepsilon_{k-1,t}\eta_{k}}% {\sum_{j=k}^{K}\eta_{j}}+\frac{1}{2T}\sum_{k=1}^{K}\frac{(\sum_{t=1}^{T}% \varepsilon_{k-1,t})^{2}}{\sum_{j=k}^{K}\eta_{j}}.$

Hence, given $\varepsilon>0$ and taking the constant step size $\eta_{k}\equiv\eta=\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|}{GT\sqrt{K}}$ for simplicity, to maintain the convergence rate as in Theorem 3.3 with inexact proximal point evaluations, it suffices to let $\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{2\eta GT}{K-k+1}$ . Indeed, we have

\displaystyle 3G\sum_{k=1}^{K}\sum_{t=1}^{T}\frac{\varepsilon_{k-1,t}\eta_{k}}% {\sum_{j=k}^{K}\eta_{j}}=3G\sum_{k=1}^{K}\frac{\sum_{t=1}^{T}\varepsilon_{k-1,% t}}{K-k+1}\leq\frac{5G\|{\bm{x}}_{0}-{\bm{x}}_{*}\|}{\sqrt{K}},

and

\displaystyle\frac{1}{2T}\sum_{k=1}^{K}\frac{(\sum_{t=1}^{T}\varepsilon_{k-1,t% })^{2}}{\sum_{j=k}^{K}\eta_{j}}\leq\frac{2G\|{\bm{x}}_{0}-{\bm{x}}_{*}\|}{% \sqrt{K}}\sum_{k=1}^{K}\frac{1}{(K-k+1)^{3}}\leq\frac{2.5G\|{\bm{x}}_{0}-{\bm{% x}}_{*}\|}{\sqrt{K}}.

It remains to follow the proof of Theorem 3.3, thus finishing the proof. ∎

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;$	$\displaystyle\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k-1,% t}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{1}{2\eta_{k}}\sum_{t=1}% ^{T}\\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\\|^{2}}_{{\mathcal{T}}_{1}}$
		$\displaystyle+\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k})% -\nabla f_{t}({\bm{x}}_{k-1,t}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t}\right\rangle}_{{% \mathcal{T}}_{2}}$
		$\displaystyle-\frac{1}{2L}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_% {t}({\bm{x}}_{k-1,t})\\|^{2}-\frac{1}{2L}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_% {k-1,t})-\nabla f_{t}({\bm{z}})\\|^{2}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm% {x}}_{k}-{\bm{z}}\\|^{2}\big{)}.$

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;$	$\displaystyle\frac{1}{2L}\Big{(}\frac{1}{\alpha}+\frac{1}{\beta}-\frac{1}{2}% \Big{)}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% })\\|^{2}$
		$\displaystyle+\Big{(}\frac{\beta\eta_{k}^{2}T^{2}L}{2}-\frac{1}{2L}\Big{)}\sum% _{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k-1,t})-\nabla f_{t}({\bm{z}})\\|^{2}$
		$\displaystyle+\frac{\alpha\eta_{k}^{2}T^{2}L}{2}\sum_{t=1}^{T}\\|\nabla f_{t}({% \bm{z}})-\nabla f_{t}({\bm{x}}_{})\\|^{2}+\eta_{k}^{2}L\sum_{t=1}^{T}\Big{\\|}% \sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{})\Big{\\|}^{2}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm% {x}}_{k}-{\bm{z}}\\|^{2}\big{)},$

	$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;$	$\displaystyle\eta_{k}^{2}L\sum_{t=1}^{T}\Big{\\|}\sum_{s=t}^{T}\nabla f_{s}({% \bm{x}}_{})\Big{\\|}^{2}+\alpha T^{3}\eta_{k}^{2}L^{2}\big{(}f({\bm{z}})-f({% \bm{x}}_{})\big{)}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm% {x}}_{k}-{\bm{z}}\\|^{2}\big{)}$
	$\displaystyle\leq\;$	$\displaystyle\eta_{k}^{2}L\sum_{t=1}^{T}\Big{\\|}\sum_{s=t}^{T}\nabla f_{s}({% \bm{x}}_{})\Big{\\|}^{2}+\frac{\alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}}_{% })\big{)}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2}-\\|{\bm% {x}}_{k}-{\bm{z}}\\|^{2}\big{)},$

	$\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{z}})\leq\;$	$\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1% ,t+1}\right\rangle-\frac{1}{2\eta_{k}}\\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\\|^% {2}$
		$\displaystyle-\frac{1}{2L}\Big{(}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({% \bm{x}}_{k-1,t+1})\\|^{2}+\\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{% z}})\\|^{2}\Big{)}$
		$\displaystyle+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1,t}-{\bm{z}}\\|^{2}-\\|{% \bm{x}}_{k-1,t+1}-{\bm{z}}\\|^{2}\big{)}.$

		$\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}$
	$\displaystyle\leq\;$	$\displaystyle\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k-1,% t+1}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{1}{2\eta_{k}}\sum_{t=% 1}^{T}\\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\\|^{2}}_{{\mathcal{T}}_{1}}$
		$\displaystyle+\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k})% -\nabla f_{t}({\bm{x}}_{k-1,t+1}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle% }_{{\mathcal{T}}_{2}}+\frac{1}{2\eta_{k}}\big{(}\\|{\bm{x}}_{k-1}-{\bm{z}}\\|^{2% }-\\|{\bm{x}}_{k}-{\bm{z}}\\|^{2}\big{)}$
		$\displaystyle-\frac{1}{2L}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_% {t}({\bm{x}}_{k-1,t+1})\\|^{2}-\frac{1}{2L}\sum_{t=1}^{T}\\|\nabla f_{t}({\bm{x}% }_{k-1,t+1})-\nabla f_{t}({\bm{z}})\\|^{2}.$

Last Iterate Convergence of Incremental Methods and Applications in Continual Learning

Abstract

1 Introduction

1.1 Contributions

Last iterate convergence of Incremental Gradient Descent (IGD).

Last iterate convergence of Incremental Proximal Method (IPM).

IPM as a model of CL.

1.2 Further related work

Concurrent independent work.

1.3 Notation and preliminaries

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

2 Last Iterate Convergence of Incremental Gradient Descent

Lemma 2.1.

Lemma 2.2.

Theorem 2.3.

Corollary 2.4 (Increasing Weighted Averaging).

Shuffled SGD.

Corollary 2.5 (Shuffled SGD (RR/SO)).

3 Incremental Proximal Method

3.1 Smooth convex setting

Theorem 3.1.

Theorem 3.2.

Proof sketch.

Regularization effect.

3.2 Convex Lipschitz setting

Theorem 3.3.

3.3 Inexact proximal point evaluations

Corollary 3.4 (Convex Smooth).

Corollary 3.5 (Convex Lipschitz).

4 Conclusion

Acknowledgments

References

Appendix A Omitted Proofs From Section 2

Proof.

Proof.

Proof.

Proof.

Proof.

Lemma A.1.

Proof.

Appendix B Omitted Proofs from Section 3

B.1 Convex Smooth Setting

Lemma B.1.

Proof.

Proof.

B.2 Convex Lipschitz Setting

Lemma B.2.

Proof.

Proof.

B.3 Inexact Proximal Point Evaluations

Lemma B.3.

Proof.

Proof.

Lemma B.4.

Proof.

Proof.

Last Iterate Convergence of Incremental Methods
and Applications in Continual Learning