Last Iterate Convergence of Incremental Methods
and Applications in Continual Learning

Xufeng Cai111Department of Computer Sciences, University of Wisconsin-Madison. XC ([email protected]), JD ([email protected]).    Jelena Diakonikolasfootnotemark:
Abstract

Incremental gradient and incremental proximal methods are a fundamental class of optimization algorithms used for solving finite sum problems, broadly studied in the literature. Yet, without strong convexity, their convergence guarantees have primarily been established for the ergodic (average) iterate. Motivated by applications in continual learning, we obtain the first convergence guarantees for the last iterate of both incremental gradient and incremental proximal methods, in general convex smooth (for both) and convex Lipschitz (for the proximal variants) settings. Our oracle complexity bounds for the last iterate nearly match (i.e., match up to a square-root-log or a log factor) the best known oracle complexity bounds for the average iterate, for both classes of methods. We further obtain generalizations of our results to weighted averaging of the iterates with increasing weights and for randomly permuted ordering of updates. We study incremental proximal methods as a model of continual learning with generalization and argue that large amount of regularization is crucial to preventing catastrophic forgetting. Our results generalize last iterate guarantees for incremental methods compared to state of the art, as such results were previously known only for overparameterized linear models, which correspond to convex quadratic problems with infinitely many solutions.

1 Introduction

We study the last iterate convergence of incremental (gradient and proximal) methods, which apply to problems of the form

min𝒙d{f(𝒙):=1Tt=1Tft(𝒙)}.subscript𝒙superscript𝑑assign𝑓𝒙1𝑇superscriptsubscript𝑡1𝑇subscript𝑓𝑡𝒙\displaystyle\min_{{\bm{x}}\in\mathbb{R}^{d}}\Big{\{}f({\bm{x}}):=\frac{1}{T}% \sum_{t=1}^{T}f_{t}({\bm{x}})\Big{\}}.roman_min start_POSTSUBSCRIPT bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { italic_f ( bold_italic_x ) := divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) } . (1.1)

As is standard, we assume that each component function ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is convex and either smooth or Lipschitz-continuous and that a minimizer 𝒙argmin𝒙f(𝒙)subscript𝒙subscriptargmin𝒙𝑓𝒙{\bm{x}}_{*}\in\operatorname*{arg\,min}_{{\bm{x}}}f({\bm{x}})bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_f ( bold_italic_x ) exists.

Incremental methods traverse all the component functions ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in a cyclic manner, updating their iterates by taking either gradient descent steps (in the case of incremental gradient methods) or proximal-point steps (in the case of incremental proximal methods) with respect to the individual component functions ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For a more precise statement of these two classes of methods, see Sections 2 and 3. Same as prior work (Bertsekas et al., 2011; Bertsekas, 2011; Li et al., 2019; Mishchenko et al., 2020; Cai et al., 2023a), we define oracle complexity of these methods as the number of first-order or proximal oracle queries to individual component functions ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT required to reach a solution 𝒙𝒙{\bm{x}}bold_italic_x with optimality gap f(𝒙)f(𝒙)ϵ𝑓𝒙𝑓subscript𝒙italic-ϵf({\bm{x}})-f({\bm{x}}_{*})\leq\epsilonitalic_f ( bold_italic_x ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_ϵ on the worst-case instance from the considered problem class, where ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 is a given error parameter.

Our main motivation for studying the last iterate convergence of incremental methods comes from applications in continual learning (CL). In particular, CL models a sequential learning setting, where a machine learning model gets updated over time, based on the changing or evolving distribution of the data passed to the learner. A major challenge in such dynamic learning settings is the degradation of model performance on previously seen data, known as the catastrophic forgetting (McCloskey and Cohen, 1989; Goodfellow et al., 2013), which has been well-documented in various empirical studies; see, e.g., recent surveys (De Lange et al., 2021; Parisi et al., 2019). On the theoretical front, however, much is still missing from the understanding of possibilities and limitations related to catastrophic forgetting, with results for basic learning settings being obtained only very recently (Evron et al., 2022, 2023; Lin et al., 2023b; Goldfarb and Hand, 2023; Goldfarb et al., 2024; Peng and Risteski, 2022; Peng et al., 2023; Chen et al., 2022; Cao et al., 2022; Balcan et al., 2015).

While there are different learning setting studied under the umbrella of CL, following recent work (Evron et al., 2022, 2023), we focus on the CL settings with repeated replaying of tasks, where the forgetting after K𝐾Kitalic_K epochs/full passes over T𝑇Titalic_T tasks is defined by Doan et al. (2021); Evron et al. (2022, 2023)

fK(𝒙KT):=1Tt=1Tft(𝒙KT),assignsubscript𝑓𝐾subscript𝒙𝐾𝑇1𝑇superscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙𝐾𝑇\displaystyle f_{K}({\bm{x}}_{KT}):=\frac{1}{T}\sum_{t=1}^{T}f_{t}({\bm{x}}_{% KT}),italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_K italic_T end_POSTSUBSCRIPT ) := divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_K italic_T end_POSTSUBSCRIPT ) , (1.2)

where 𝒙ndsubscript𝒙𝑛superscript𝑑{\bm{x}}_{n}\in{\mathbb{R}}^{d}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the model parameter vector used at time n+𝑛subscriptn\in\mathbb{N}_{+}italic_n ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Such settings arise in applications that naturally undergo cyclic changes in the data/tasks, due to diurnal or seasonal cycles (e.g., in agriculture, forestry, e-commerce, astronomy, etc.). Forgetting is catastrophic if fK(𝒙KT)K0subscript𝑓𝐾subscript𝒙𝐾𝑇𝐾0f_{K}({\bm{x}}_{KT})\overset{K\rightarrow\infty}{\nrightarrow}0italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_K italic_T end_POSTSUBSCRIPT ) start_OVERACCENT italic_K → ∞ end_OVERACCENT start_ARG ↛ end_ARG 0.

Observe that Eq. (1.2) corresponds to the value of the objective function from Eq. (1.1) at the final iterate 𝒙KT.subscript𝒙𝐾𝑇{\bm{x}}_{KT}.bold_italic_x start_POSTSUBSCRIPT italic_K italic_T end_POSTSUBSCRIPT . Prior work (Evron et al., 2022) that obtained rigorous bounds for the forgetting Eq. (1.2) applied to the problems where each ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a convex quadratic function minimized by the same 𝒙subscript𝒙{\bm{x}}_{*}bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT such that ft(𝒙)=f(𝒙)=0.subscript𝑓𝑡subscript𝒙𝑓subscript𝒙0f_{t}({\bm{x}}_{*})=f({\bm{x}}_{*})=0.italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = 0 . By contrast, we consider more general convex functions that are either smooth or Lipschitz continuous, and make no assumption about 𝒙subscript𝒙{\bm{x}}_{*}bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT beyond being a minimizer of the (average) function f𝑓fitalic_f. Since we are not assuming that f(𝒙)=0,𝑓subscript𝒙0f({\bm{x}}_{*})=0,italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = 0 , our focus is on bounding the excess forgetting fK(𝒙KT)f(𝒙)subscript𝑓𝐾subscript𝒙𝐾𝑇𝑓subscript𝒙f_{K}({\bm{x}}_{KT})-f({\bm{x}}_{*})italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_K italic_T end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), which is equivalently the optimality gap for the last iterate in Eq. (1.1).

The method considered in Evron et al. (2022) minimized each component function exactly, outputting the solution closest to the previous iterate in each iteration, using implicit regularization properties of SGD. To obtain the results, it was then crucial that the component functions were quadratic (so that there is an explicit, closed-form solution for each subproblem) and that all component functions shared a nonempty set of minimizers with value zero (so that forgetting can be controlled despite aggressive adaption to the current task). Our work using the incremental proximal method instead considers explicit regularization to enforce closeness of models on differing tasks, which can potentially degrade the performance on the current task, but as a tradeoff can control forgetting and it addresses a much broader class of loss functions.

1.1 Contributions

Our main contributions can be summarized as follows, where σ2:=1Tt=1Tft(𝒙)2assignsuperscriptsubscript𝜎21𝑇superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙2\sigma_{*}^{2}:=\frac{1}{T}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{*})\|^{2}italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the gradient variance at the optimum. The quantity σ2superscriptsubscript𝜎2\sigma_{*}^{2}italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is intrinsic to oracle complexity of incremental methods (Mishchenko et al., 2020; Nguyen et al., 2021; Cai et al., 2023a; Cha et al., 2023).

Last iterate convergence of Incremental Gradient Descent (IGD).

We provide the first oracle complexity guarantees for the last iterate of standard variants of IGD with either deterministic or randomly permuted ordering of the updates, applied to convex L𝐿Litalic_L-smooth objectives. Up to a square-root-log factor, our oracle complexity bounds in Theorem 2.3 and Corollary 2.5 – which are 𝒪~(TL𝒙0𝒙2ϵ+TL1/2σ𝒙0𝒙2ϵ3/2)~𝒪𝑇𝐿superscriptnormsubscript𝒙0subscript𝒙2italic-ϵ𝑇superscript𝐿12subscript𝜎superscriptnormsubscript𝒙0subscript𝒙2superscriptitalic-ϵ32\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon^% {3/2}}\big{)}over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_T italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG + divide start_ARG italic_T italic_L start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) for the deterministic variant and 𝒪~(TL𝒙0𝒙2ϵ+TLσ𝒙0𝒙2ϵ3/2)~𝒪𝑇𝐿superscriptnormsubscript𝒙0subscript𝒙2italic-ϵ𝑇𝐿subscript𝜎superscriptnormsubscript𝒙0subscript𝒙2superscriptitalic-ϵ32\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{\sqrt{TL}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon% ^{3/2}}\big{)}over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_T italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG + divide start_ARG square-root start_ARG italic_T italic_L end_ARG italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) for the randomly permuted variant – match the best known oracle complexity bounds for these methods, previously known only for the (uniformly) average iterate (Mishchenko et al., 2020; Nguyen et al., 2021; Cai et al., 2023a; Cha et al., 2023). We further extend our results to increasing weighted averaging of the iterates in Corollary 2.4, which places more weight on the more recent iterates, removing the excess square-root-log factor in the resulting oracle complexity bound.

Last iterate convergence of Incremental Proximal Method (IPM).

We provide the first oracle complexity guarantees for the last iterate of IPM applied to convex and either smooth or Lipschitz-continuous objectives. When each component function is convex and L𝐿Litalic_L-smooth, we show (in Theorem 3.1) that IPM has the same 𝒪~(TL𝒙0𝒙2ϵ+TL1/2σ𝒙0𝒙2ϵ3/2)~𝒪𝑇𝐿superscriptnormsubscript𝒙0subscript𝒙2italic-ϵ𝑇superscript𝐿12subscript𝜎superscriptnormsubscript𝒙0subscript𝒙2superscriptitalic-ϵ32\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon^% {3/2}}\big{)}over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_T italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG + divide start_ARG italic_T italic_L start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) oracle complexity guarantee as IGD. This result is new for any variant of this method – with average or last iterate as its output. When component functions are convex and G𝐺Gitalic_G-Lipschitz, our oracle complexity 𝒪~(G2T𝒙0𝒙2ϵ2)~𝒪superscript𝐺2𝑇superscriptnormsubscript𝒙0subscript𝒙2superscriptitalic-ϵ2\widetilde{\mathcal{O}}\Big{(}\frac{G^{2}T\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon^{2}}\Big{)}over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) in Theorem 3.3 matches the best known oracle complexity bound up to a log factor, which was previously known only for the (uniformly) average iterate (Bertsekas, 2011; Li et al., 2019). We further argue (in Corollary 3.4 and Corollary 3.5) that for both settings our analysis can be extended to admit inexact proximal point evaluations – an important setting not addressed by prior work on general IPM.

IPM as a model of CL.

We initiate the study of IPM as a model of CL, corresponding to sequential ridge-regularized model training commonly used in practice. On the positive side, our last-iterate convergence results for IPM in Theorem 3.1 and Theorem 3.3 demonstrate that forgetting (corresponding to the optimality gap at the last iterate) can be effectively controlled if the amount of employed regularization is sufficiently high. On the negative side, we show that for any constant amount of regularization, forgetting is always catastrophic, even for least squares problems. In particular, we provide a univariate quadratic example such that for any constant regularization parameter, the asymptotic limit of (excess) forgetting is non-zero. Further, we demonstrate that for forgetting to be made smaller than some target ϵ,italic-ϵ\epsilon,italic_ϵ , the regularization must be sufficiently high and depend polynomially on 1/ϵ,T,1italic-ϵ𝑇1/\epsilon,\,T,1 / italic_ϵ , italic_T , and σ.subscript𝜎\sigma_{*}.italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT . These results are summarized in Theorem 3.2 and highlight the limitations of regularization as a black-box tool for controlling forgetting in CL.

1.2 Further related work

To remedy catastrophic forgetting, various empirical approaches have been proposed, and this work is most closely related to (i) memory-based approaches, which store samples from previous tasks and reuse those data for training on the current task (Robins, 1995; Lopez-Paz and Ranzato, 2017; Rolnick et al., 2019) and (ii) regularization-based approaches, which regularize the loss of the current task to ensure the new model parameter vector is close to the prior ones (Kirkpatrick et al., 2017); for a more complete survey, we refer to De Lange et al. (2021). On the theoretical front, the results on the forgetting for cyclic replaying of tasks considered in our work have only been established for linear models (Evron et al., 2022, 2023; Swartworth et al., 2024). In particular, the analysis for linear regression tasks in Evron et al. (2022); Swartworth et al. (2024) crucially relies on the exact minimization of each task using (S)GD to have closed-form updates between tasks, while Evron et al. (2023) uses alternating projections to analyze linear classification tasks. It is unclear how to extend either of these results to general convex loss functions that we consider.

On a technical level, our results are most closely related to 1) the literature on last iterate convergence guarantees for subgradient-based methods and stochastic gradient descent (SGD) and 2) the literature on incremental gradient methods and shuffled SGD. For 1), we draw inspiration from the recent results (Zamani and Glineur, 2023; Liu and Zhou, 2023), which rely on a clever construction of reference points 𝒛ksubscript𝒛𝑘{\bm{z}}_{k}bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with respect to which a gap quantity f(𝒙k)f(𝒛k1)𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1f({\bm{x}}_{k})-f({\bm{z}}_{k-1})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) gets bounded to deduce a bound for the optimality gap f(𝒙k)f(𝒙)𝑓subscript𝒙𝑘𝑓subscript𝒙f({\bm{x}}_{k})-f({\bm{x}}_{*})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) of the last iterate 𝒙subscript𝒙{\bm{x}}_{*}bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. For the latter line of work 2), we generalize the analysis used exclusively for the optimality gap of the (uniformly) average iterate (Mishchenko et al., 2020; Nguyen et al., 2021; Cha et al., 2023; Cai et al., 2023a) to control the gap-like quantities f(𝒙k)f(𝒛k1)𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1f({\bm{x}}_{k})-f({\bm{z}}_{k-1})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ), which require a more careful argument for controlling all error terms introduced by replacing 𝒙subscript𝒙{\bm{x}}_{*}bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT by 𝒛k1subscript𝒛𝑘1{\bm{z}}_{k-1}bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT without introducing spurious unrealistic assumptions about the magnitudes of the component functions’ gradients. We finally note that both these related lines of work concern problems on which progress was made only in the very recent literature. In particular, while the oracle complexity upper bound for the average iterate of SGD in convex Lipschitz-continuous settings has been known for decades and its analysis is routinely taught in optimization and machine learning classes, there were no such results for the last iterate of SGD until 2013 (Shamir and Zhang, 2013) with improvements and generalizations to these results obtained as recently as in the past year (Liu and Zhou, 2023). Regarding 2), obtaining any nonasymptotic convergence guarantees for incremental gradient methods/shuffled SGD had remained open for decades (Bottou, 2009) until a recent line of work (Gürbüzbalaban et al., 2021; Shamir, 2016; Haochen and Sra, 2019; Nagaraj et al., 2019; Ahn et al., 2020; Rajput et al., 2020; Yun et al., 2022; Safran and Shamir, 2020). For nonconvex problems, Yu and Li (2023) proved the high-probability last iterate guarantee for shuffled SGD with stop** criteria, which is technically disjoint from our work. For smooth convex problems we consider, the convergence results were obtained only in the past few years (Mishchenko et al., 2020; Nguyen et al., 2021; Cha et al., 2023) and improved in Cai et al. (2023a) using a fine-grained analysis inspired by the recent advances in cyclic methods (Song and Diakonikolas, 2023; Cai et al., 2023b; Lin et al., 2023a). However, all those results are for the (uniformly) average iterate, while obtaining convergence results for the last iterate had remained open.

Concurrent independent work.

An independent and concurrent work to ours (Liu and Zhou, 2024) studied the last-iterate convergence of shuffled SGD for composite (strongly) convex smooth/Lipschitz optimization. For the same problems as studied in our Section 2, they obtained the same convergence results. The remaining results in Liu and Zhou (2024) and our work are not directly comparable, as the motivation for the two works and the studied settings are different. In particular, the focus of Liu and Zhou (2024) is on the last iterate convergence of shuffled SGD, and thus they study it in depth, considering different Lipschitz/smoothness constants for component functions, strong convexity, and composite settings. Our focus on the other hand is on settings relevant to continual learning, and thus we additionally consider the convergence of a weighted average of iterates and put more weight on the incremental proximal method, which were not considered in Liu and Zhou (2024). It is of note that while Liu and Zhou (2024) used proximal steps to handle the nonsmooth portion of the objective in their composite setting, the proximal maps are not applied component-wise, but at the end of a cycle, to only one (regularizer) function (e.g., to handle constraints or joint regularization).

1.3 Notation and preliminaries

We consider the d𝑑ditalic_d-dimensional real space (d,)(\mathbb{R}^{d},\|\cdot\|)( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∥ ⋅ ∥ ), where \|\cdot\|∥ ⋅ ∥ is the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm, and denote [T]:={1,2,,T}assigndelimited-[]𝑇12𝑇[T]:=\{1,2,\dots,T\}[ italic_T ] := { 1 , 2 , … , italic_T }. Given a proper, convex, lower semicontinuous function f𝑓fitalic_f, its proximal operator and Moreau envelope are defined by

proxηf(𝒙)=argmin𝒚d{12η𝒚𝒙2+f(𝒚)},Mηf(𝒙)=min𝒚d{12η𝒚𝒙2+f(𝒚)},formulae-sequencesubscriptprox𝜂𝑓𝒙subscriptargmin𝒚superscript𝑑12𝜂superscriptnorm𝒚𝒙2𝑓𝒚subscript𝑀𝜂𝑓𝒙subscript𝒚superscript𝑑12𝜂superscriptnorm𝒚𝒙2𝑓𝒚\displaystyle\mathrm{prox}_{\eta f}({\bm{x}})=\operatorname*{arg\,min}_{{\bm{y% }}\in{\mathbb{R}}^{d}}\Big{\{}\frac{1}{2\eta}\|{\bm{y}}-{\bm{x}}\|^{2}+f({\bm{% y}})\Big{\}},\;M_{\eta f}({\bm{x}})=\min_{{\bm{y}}\in{\mathbb{R}}^{d}}\Big{\{}% \frac{1}{2\eta}\|{\bm{y}}-{\bm{x}}\|^{2}+f({\bm{y}})\Big{\}},roman_prox start_POSTSUBSCRIPT italic_η italic_f end_POSTSUBSCRIPT ( bold_italic_x ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG ∥ bold_italic_y - bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_f ( bold_italic_y ) } , italic_M start_POSTSUBSCRIPT italic_η italic_f end_POSTSUBSCRIPT ( bold_italic_x ) = roman_min start_POSTSUBSCRIPT bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG ∥ bold_italic_y - bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_f ( bold_italic_y ) } ,

respectively, for a parameter η>0𝜂0\eta>0italic_η > 0. The Moreau envelope is 1η1𝜂\frac{1}{\eta}divide start_ARG 1 end_ARG start_ARG italic_η end_ARG-smooth with the gradient Mηf(𝒙)=1η(𝒙proxηf(𝒙))f(proxηf(𝒙))subscript𝑀𝜂𝑓𝒙1𝜂𝒙subscriptprox𝜂𝑓𝒙𝑓subscriptprox𝜂𝑓𝒙\nabla M_{\eta f}({\bm{x}})=\frac{1}{\eta}({\bm{x}}-\mathrm{prox}_{\eta f}({% \bm{x}}))\in\partial f(\mathrm{prox}_{\eta f}({\bm{x}}))∇ italic_M start_POSTSUBSCRIPT italic_η italic_f end_POSTSUBSCRIPT ( bold_italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ( bold_italic_x - roman_prox start_POSTSUBSCRIPT italic_η italic_f end_POSTSUBSCRIPT ( bold_italic_x ) ) ∈ ∂ italic_f ( roman_prox start_POSTSUBSCRIPT italic_η italic_f end_POSTSUBSCRIPT ( bold_italic_x ) ), where f()𝑓\partial f(\cdot)∂ italic_f ( ⋅ ) denotes the subdifferential of f𝑓fitalic_f.

We make the following assumptions. The first one is made throughout the paper.

Assumption 1.

Each ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is convex and there exists a minimizer 𝒙argmin𝒙df(𝒙)subscript𝒙subscriptargmin𝒙superscript𝑑𝑓𝒙{\bm{x}}_{*}\in\operatorname*{arg\,min}_{{\bm{x}}\in{\mathbb{R}}^{d}}f({\bm{x}})bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_x ).

By Assumption 1, f𝑓fitalic_f is also convex. In nonsmooth settings, we make an additional standard assumption that the component functions are Lipschitz-continuous.

Assumption 2.

Each ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is G𝐺Gitalic_G-Lipschitz, i.e., |ft(𝒙)ft(𝒚)|G𝒙𝒚subscript𝑓𝑡𝒙subscript𝑓𝑡𝒚𝐺norm𝒙𝒚|f_{t}({\bm{x}})-f_{t}({\bm{y}})|\leq G\|{\bm{x}}-{\bm{y}}\|| italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_y ) | ≤ italic_G ∥ bold_italic_x - bold_italic_y ∥ for any 𝒙,𝒚d𝒙𝒚superscript𝑑{\bm{x}},{\bm{y}}\in\mathbb{R}^{d}bold_italic_x , bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT; thus gt(𝒙)Gnormsubscript𝑔𝑡𝒙𝐺\|g_{t}({\bm{x}})\|\leq G∥ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ∥ ≤ italic_G for all gt(𝒙)ft(𝒙)subscript𝑔𝑡𝒙subscript𝑓𝑡𝒙g_{t}({\bm{x}})\in\partial f_{t}({\bm{x}})italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ∈ ∂ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ).

For the smooth settings, we make the following assumption.

Assumption 3.

Each ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is L𝐿Litalic_L-smooth, i.e., ft(𝒙)ft(𝒚)L𝒙𝒚normsubscript𝑓𝑡𝒙subscript𝑓𝑡𝒚𝐿norm𝒙𝒚\|\nabla f_{t}({\bm{x}})-\nabla f_{t}({\bm{y}})\|\leq L\|{\bm{x}}-{\bm{y}}\|∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_y ) ∥ ≤ italic_L ∥ bold_italic_x - bold_italic_y ∥ for any 𝒙,𝒚d𝒙𝒚superscript𝑑{\bm{x}},{\bm{y}}\in\mathbb{R}^{d}bold_italic_x , bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

We remark that Assumptions 2 and 3 imply that f𝑓fitalic_f is also G𝐺Gitalic_G-Lipschitz and L𝐿Litalic_L-smooth, respectively, These two assumptions can also be generalized to be with distinct Lipschitz/smoothness constants, and our results would scale with the average Lipschitz/smoothness constant using the techniques from Cai et al. (2023a), which we omit to keep the focus on the intricacies of the last iterate convergence. When f𝑓fitalic_f is L𝐿Litalic_L-smooth and convex, we will often make use of the following standard inequality that fully characterizes the class of L𝐿Litalic_L-smooth convex functions:

12Lf(𝒙)f(𝒚)2f(𝒚)f(𝒙)f(𝒙),𝒚𝒙,𝒙,𝒚d.formulae-sequence12𝐿superscriptnorm𝑓𝒙𝑓𝒚2𝑓𝒚𝑓𝒙𝑓𝒙𝒚𝒙for-all𝒙𝒚superscript𝑑\frac{1}{2L}\|\nabla f({\bm{x}})-\nabla f({\bm{y}})\|^{2}\leq f({\bm{y}})-f({% \bm{x}})-\left\langle\nabla f({\bm{x}}),{\bm{y}}-{\bm{x}}\right\rangle,\quad% \forall{\bm{x}},{\bm{y}}\in{\mathbb{R}}^{d}.divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∥ ∇ italic_f ( bold_italic_x ) - ∇ italic_f ( bold_italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_f ( bold_italic_y ) - italic_f ( bold_italic_x ) - ⟨ ∇ italic_f ( bold_italic_x ) , bold_italic_y - bold_italic_x ⟩ , ∀ bold_italic_x , bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT . (1.3)

Finally, when each ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is smooth, we assume bounded variance at 𝒙subscript𝒙{\bm{x}}_{*}bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, same as all prior work that considered the same settings of IGD/shuffled SGD as we do (Mishchenko et al., 2020; Nguyen et al., 2021; Tran et al., 2021, 2022; Cai et al., 2023a).

Assumption 4.

The quantity σ2:=1Tt=1Tft(𝒙)2assignsuperscriptsubscript𝜎21𝑇superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙2\sigma_{*}^{2}:=\frac{1}{T}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{*})\|^{2}italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is bounded.

2 Last Iterate Convergence of Incremental Gradient Descent

In this section, we introduce our techniques for analyzing the last iterate guarantee and bound the oracle complexity for the last iterate of incremental gradient descent (IGD), assuming component functions are smooth and convex. In the context of CL, this corresponds to a simplified setup where the learner incrementally performs a single gradient step on each task and cyclically replays the T𝑇Titalic_T tasks. Nevertheless, this setup serves as a warmup to the proximal setup we discuss in the next section. Additionally, it is of independent interest as incremental gradient methods are widely used in the optimization and machine learning literature, where despite the lack of prior theoretical justification, it is typically the last iterate that gets output by the algorithm in practice.

We summarize the IGD method in Alg. 1, assuming the incremental order 1,2,,T12𝑇1,2,\dots,T1 , 2 , … , italic_T in each epoch for simplicity and without loss of generality. The oracle complexity for the (uniformly) average iterate of IGD has been shown to be 𝒪(TLϵ+TLσϵ3/2)𝒪𝑇𝐿italic-ϵ𝑇𝐿subscript𝜎superscriptitalic-ϵ32{\mathcal{O}}(\frac{TL}{\epsilon}+\frac{T\sqrt{L}\sigma_{*}}{\epsilon^{3/2}})caligraphic_O ( divide start_ARG italic_T italic_L end_ARG start_ARG italic_ϵ end_ARG + divide start_ARG italic_T square-root start_ARG italic_L end_ARG italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) for an ϵitalic-ϵ\epsilonitalic_ϵ-optimality gap (Mishchenko et al., 2020; Cai et al., 2023a) under the same assumptions we make here (Assumptions 13, and 4), while, as discussed before, there were no guarantees for either the last iterate or even a weighted average of the iterates. The main result of this section is that the same oracle complexity applies to the last iterate of IGD, up to a square-root-log factor. We then further generalize this result to weighted averages of iterates with increasing weights and to variants with randomly permuted order of cyclic updates.

Algorithm 1 Incremental Gradient Descent (IGD)
  Input: initial point 𝒙0subscript𝒙0{\bm{x}}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, number of epochs K𝐾Kitalic_K, step size {ηk}subscript𝜂𝑘\{\eta_{k}\}{ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }
  for k=1:K:𝑘1𝐾k=1:Kitalic_k = 1 : italic_K do
     𝒙k1,1=𝒙k1subscript𝒙𝑘11subscript𝒙𝑘1{\bm{x}}_{k-1,1}={\bm{x}}_{k-1}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT
     for t=1:T:𝑡1𝑇t=1:Titalic_t = 1 : italic_T do
        𝒙k1,t+1=𝒙k1,tηkft(𝒙k1,t)subscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡{\bm{x}}_{k-1,t+1}={\bm{x}}_{k-1,t}-\eta_{k}\nabla f_{t}({\bm{x}}_{k-1,t})bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT )
     𝒙k=𝒙k1,T+1subscript𝒙𝑘subscript𝒙𝑘1𝑇1{\bm{x}}_{k}={\bm{x}}_{k-1,T+1}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_T + 1 end_POSTSUBSCRIPT
  return  𝒙Ksubscript𝒙𝐾{\bm{x}}_{K}bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT

We begin the analysis by deriving a bound on the gap with respect to an arbitrary but fixed reference point 𝒛,𝒛{\bm{z}},bold_italic_z , as summarized in the following lemma with its proof in Appendix A. This stands in contrast to arguments deriving bounds on the average iterate, which take 𝒛=𝒙.𝒛subscript𝒙{\bm{z}}={\bm{x}}_{*}.bold_italic_z = bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT . While this may seem like a minor difference, it affects the analysis non-trivially: a direct extension of prior arguments would require replacing Assumption 4 – which imposes a bound on 1Tt=1Tft(𝒙)21𝑇superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙2\frac{1}{T}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{*})\|^{2}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT – with a bound on 1Tt=1Tft(𝒛)21𝑇superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡𝒛2\frac{1}{T}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{z}})\|^{2}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for an arbitrary 𝒛𝒛{\bm{z}}bold_italic_z, which would be a much stronger requirement.

Lemma 2.1.

Under Assumptions 1 and 3, for any 𝐳d𝐳superscript𝑑{\bm{z}}\in\mathbb{R}^{d}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that is fixed in the k𝑘kitalic_k-th cycle of Alg. 1 and any α,β>0𝛼𝛽0\alpha,\beta>0italic_α , italic_β > 0 such that 1α+1β121𝛼1𝛽12\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG, if ηk1βTLsubscript𝜂𝑘1𝛽𝑇𝐿\eta_{k}\leq\frac{1}{\sqrt{\beta}TL}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG, then for all k[K],𝑘delimited-[]𝐾k\in[K],italic_k ∈ [ italic_K ] ,

T(f(𝒙k)f(𝒛))ηk2Lt=1Ts=tTfs(𝒙)2+αβT(f(𝒛)f(𝒙))+12ηk(𝒙k1𝒛2𝒙k𝒛2).𝑇𝑓subscript𝒙𝑘𝑓𝒛superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2𝛼𝛽𝑇𝑓𝒛𝑓subscript𝒙12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\eta_{k}^{2}L\sum_% {t=1}^{T}\Big{\|}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\Big{\|}^{2}+\frac{% \alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}}_{*})\big{)}+\frac{1}{2\eta_{k}}% \big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm{x}}_{k}-{\bm{z}}\|^{2}\big{)}.italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG italic_T ( italic_f ( bold_italic_z ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Our next step is to specify our choice of the reference point 𝒛𝒛{\bm{z}}bold_italic_z for each epoch. In particular, we consider a sequence of points {𝒛k}k=1K1superscriptsubscriptsubscript𝒛𝑘𝑘1𝐾1\{{\bm{z}}_{k}\}_{k=-1}^{K-1}{ bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT that is recursively defined as a convex combination of the algorithm iterate 𝒙ksubscript𝒙𝑘{\bm{x}}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and the previous reference point 𝒛k1subscript𝒛𝑘1{\bm{z}}_{k-1}bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT:

𝒛k=(1λk)𝒙k+λk𝒛k1subscript𝒛𝑘1subscript𝜆𝑘subscript𝒙𝑘subscript𝜆𝑘subscript𝒛𝑘1\displaystyle{\bm{z}}_{k}=(1-\lambda_{k}){\bm{x}}_{k}+\lambda_{k}{\bm{z}}_{k-1}bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT (2.1)

for k0𝑘0k\geq 0italic_k ≥ 0 with 𝒛1=𝒙subscript𝒛1subscript𝒙{\bm{z}}_{-1}={\bm{x}}_{*}bold_italic_z start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and λk[0,1]subscript𝜆𝑘01\lambda_{k}\in[0,1]italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ] to be set later. Observe that 𝒛ksubscript𝒛𝑘{\bm{z}}_{k}bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can also be written as a convex combination of the points {𝒙j}j=0ksuperscriptsubscriptsubscript𝒙𝑗𝑗0𝑘\{{\bm{x}}_{j}\}_{j=0}^{k}{ bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and 𝒙subscript𝒙{\bm{x}}_{*}bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT by unrolling the recursion, i.e.,

𝒛k=(1λk)𝒙k+(i=0kλi)𝒙+j=0k1(i=j+1kλi)(1λj)𝒙j,subscript𝒛𝑘1subscript𝜆𝑘subscript𝒙𝑘superscriptsubscriptproduct𝑖0𝑘subscript𝜆𝑖subscript𝒙superscriptsubscript𝑗0𝑘1superscriptsubscriptproduct𝑖𝑗1𝑘subscript𝜆𝑖1subscript𝜆𝑗subscript𝒙𝑗\displaystyle{\bm{z}}_{k}=(1-\lambda_{k}){\bm{x}}_{k}+\Big{(}\prod_{i=0}^{k}% \lambda_{i}\Big{)}{\bm{x}}_{*}+\sum_{j=0}^{k-1}\Big{(}\prod_{i=j+1}^{k}\lambda% _{i}\Big{)}(1-\lambda_{j}){\bm{x}}_{j},bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( ∏ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (2.2)

where (1λk)+i=0kλi+j=0k1(i=j+1kλi)(1λj)=11subscript𝜆𝑘superscriptsubscriptproduct𝑖0𝑘subscript𝜆𝑖superscriptsubscript𝑗0𝑘1superscriptsubscriptproduct𝑖𝑗1𝑘subscript𝜆𝑖1subscript𝜆𝑗1(1-\lambda_{k})+\prod_{i=0}^{k}\lambda_{i}+\sum_{j=0}^{k-1}\big{(}\prod_{i=j+1% }^{k}\lambda_{i}\big{)}(1-\lambda_{j})=1( 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( ∏ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 1. If we set λk=1subscript𝜆𝑘1\lambda_{k}=1italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 for all k𝑘kitalic_k, then we have 𝒛k=𝒙subscript𝒛𝑘subscript𝒙{\bm{z}}_{k}={\bm{x}}_{*}bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and recover the bound f(𝒙k)f(𝒙)𝑓subscript𝒙𝑘𝑓subscript𝒙f({\bm{x}}_{k})-f({\bm{x}}_{*})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) in Lemma 2.1, which leads to the average iterate guarantee. For general {λk}subscript𝜆𝑘\{\lambda_{k}\}{ italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, we obtain the following lemma to relate the function value gap f(𝒙k)f(𝒛k1)𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1f({\bm{x}}_{k})-f({\bm{z}}_{k-1})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) to the optimality gap f(𝒙k)f(𝒙)𝑓subscript𝒙𝑘𝑓subscript𝒙f({\bm{x}}_{k})-f({\bm{x}}_{*})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), whose proof is deferred to Appendix A.

Lemma 2.2.

Let 𝐳ksubscript𝐳𝑘{\bm{z}}_{k}bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be defined by Eq. (2.1) for a given sequence of parameters λk(0,1)subscript𝜆𝑘01\lambda_{k}\in(0,1)italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 ), where k0𝑘0k\geq 0italic_k ≥ 0 and 𝐳1=𝐱.subscript𝐳1subscript𝐱{\bm{z}}_{-1}={\bm{x}}_{*}.bold_italic_z start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT . Under Assumption 1, if there exists a sequence of nonnegative weights wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT such that λkwkwk1subscript𝜆𝑘subscript𝑤𝑘subscript𝑤𝑘1\lambda_{k}w_{k}\leq w_{k-1}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT for k[K1]𝑘delimited-[]𝐾1k\in[K-1]italic_k ∈ [ italic_K - 1 ], then for all k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ]:

  1. 1.

    wk1(f(𝒛k1)f(𝒙))j=0k1wj(1λj)(f(𝒙j)f(𝒙));subscript𝑤𝑘1𝑓subscript𝒛𝑘1𝑓subscript𝒙superscriptsubscript𝑗0𝑘1subscript𝑤𝑗1subscript𝜆𝑗𝑓subscript𝒙𝑗𝑓subscript𝒙w_{k-1}\big{(}f({\bm{z}}_{k-1})-f({\bm{x}}_{*})\big{)}\leq\sum_{j=0}^{k-1}w_{j% }(1-\lambda_{j})\big{(}f({\bm{x}}_{j})-f({\bm{x}}_{*})\big{)};italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ≤ ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ;

  2. 2.

    wk1(f(𝒙k)f(𝒛k1))wk1(f(𝒙k)f(𝒙))j=0k1wj(1λj)(f(𝒙j)f(𝒙))subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒙superscriptsubscript𝑗0𝑘1subscript𝑤𝑗1subscript𝜆𝑗𝑓subscript𝒙𝑗𝑓subscript𝒙w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}\geq w_{k-1}(f({\bm{x}}_% {k})-f({\bm{x}}_{*}))-\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})(f({\bm{x}}_{j})-f({% \bm{x}}_{*}))italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) ≥ italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ).

The role of the sequence of weights {wk}subscript𝑤𝑘\{w_{k}\}{ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } in Lemma 2.2 is to ensure that we can telescope the terms 𝒙k1𝒛k12𝒙k𝒛k12superscriptnormsubscript𝒙𝑘1subscript𝒛𝑘12superscriptnormsubscript𝒙𝑘subscript𝒛𝑘12\|{\bm{x}}_{k-1}-{\bm{z}}_{k-1}\|^{2}-\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in Lemma 2.1. On the other hand, to succinctly see why such {𝒛k}subscript𝒛𝑘\{{\bm{z}}_{k}\}{ bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } could lead to the desired last iterate guarantees, we note that the second part of Lemma 2.2 indicates that wk1(f(𝒙k)f(𝒛k1))subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) intrinsically includes retraction terms of the optimality gaps at the previous iterates. Hence, we can deduct wK1(f(𝒙K)f(𝒙))subscript𝑤𝐾1𝑓subscript𝒙𝐾𝑓subscript𝒙w_{K-1}(f({\bm{x}}_{K})-f({\bm{x}}_{*}))italic_w start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) from k=1Kwk1(f(𝒙k)f(𝒛k1))superscriptsubscript𝑘1𝐾subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1\sum_{k=1}^{K}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) by properly choosing λksubscript𝜆𝑘\lambda_{k}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to cancel out the optimality gap terms at the intermediate iterates. In this case, the convergence rate for the last iterate is characterized by the growth rate of {wk}subscript𝑤𝑘\{w_{k}\}{ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }.

Our choice of the reference points {𝒛k}subscript𝒛𝑘\{{\bm{z}}_{k}\}{ bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } is inspired by the recent work (Zamani and Glineur, 2023; Liu and Zhou, 2023) on last iterate guarantees for subgradient methods and SGD with λk=wk1wksubscript𝜆𝑘subscript𝑤𝑘1subscript𝑤𝑘\lambda_{k}=\frac{w_{k-1}}{w_{k}}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG. However, their proof techniques are not directly applicable to incremental methods, due to several technical obstacles including additional nontrivial error terms of the form αβT(f(𝒛k)f(𝒙))𝛼𝛽𝑇𝑓subscript𝒛𝑘𝑓subscript𝒙\frac{\alpha}{\beta}T\big{(}f({\bm{z}}_{k})-f({\bm{x}}_{*})\big{)}divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG italic_T ( italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) in Lemma 2.1 arising from the incremental gradient steps using reference points other than 𝒙subscript𝒙{\bm{x}}_{*}bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. Such error terms inherently deteriorate the growth rate of {wk}subscript𝑤𝑘\{w_{k}\}{ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and could possibly lead to a worse last iterate rate compared to the rate on the average iterate. In the following theorem, we calibrate such degradation on the last iterate guarantees, and show that with a slightly smaller step size one can still achieve essentially the same rate as for the average iterate. The proof is provided in Appendix A due to space constraints.

Theorem 2.3.

Under Assumptions 1, 3, and 4 and for positive parameters α,β>0𝛼𝛽0\alpha,\beta>0italic_α , italic_β > 0 such that 1α+1β121𝛼1𝛽12\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG, if the step size is fixed and satisfies ηk=η1βTLsubscript𝜂𝑘𝜂1𝛽𝑇𝐿\eta_{k}=\eta\leq\frac{1}{\sqrt{\beta}TL}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_η ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG, the output 𝐱Ksubscript𝐱𝐾{\bm{x}}_{K}bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT of Alg. 1 satisfies

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ eη2T2σ2L(1+β/α)Kα/β1+α/β+e𝒙0𝒙22ηTK11+α/β.esuperscript𝜂2superscript𝑇2superscriptsubscript𝜎2𝐿1𝛽𝛼superscript𝐾𝛼𝛽1𝛼𝛽esuperscriptnormsubscript𝒙0subscript𝒙22𝜂𝑇superscript𝐾11𝛼𝛽\displaystyle\mathrm{e}\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{\frac{% \alpha/\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{% 2}}{2\eta TK^{\frac{1}{1+\alpha/\beta}}}.roman_e italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ( 1 + italic_β / italic_α ) italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT + divide start_ARG roman_e ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG . (2.3)

With α=4,β=4logKformulae-sequence𝛼4𝛽4𝐾\alpha=4,\beta=4\log Kitalic_α = 4 , italic_β = 4 roman_log italic_K, there exists a constant step size η𝜂\etaitalic_η such that f(𝐱K)f(𝐱)ϵ𝑓subscript𝐱𝐾𝑓subscript𝐱italic-ϵf({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilonitalic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_ϵ for ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 after 𝒪~(TL𝐱0𝐱2ϵ+TL1/2σ𝐱0𝐱2ϵ3/2)~𝒪𝑇𝐿superscriptnormsubscript𝐱0subscript𝐱2italic-ϵ𝑇superscript𝐿12subscript𝜎superscriptnormsubscript𝐱0subscript𝐱2superscriptitalic-ϵ32\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon^% {3/2}}\big{)}over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_T italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG + divide start_ARG italic_T italic_L start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) individual gradient evaluations.

The error term f(𝒛k1)f(𝒙)𝑓subscript𝒛𝑘1𝑓subscript𝒙f({\bm{z}}_{k-1})-f({\bm{x}}_{*})italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) in Lemma 2.1 plays the role of slowing the last iterate rate, as calibrated by the dependence on α/β𝛼𝛽\alpha/\betaitalic_α / italic_β in Eq. (2.3). To remedy such degradation compared to the average iterate rate (Mishchenko et al., 2020; Cai et al., 2023a), one natural thought is to make α/β𝛼𝛽\alpha/\betaitalic_α / italic_β sufficiently small. In particular, we choose α/β=1/logK𝛼𝛽1𝐾\alpha/\beta=1/\log Kitalic_α / italic_β = 1 / roman_log italic_K and show that the last iterate rate nearly matches the best known rate on the average iterate, with the trade-off of requiring order-1logK1𝐾\frac{1}{\sqrt{\log K}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG roman_log italic_K end_ARG end_ARG smaller step sizes in comparison with Mishchenko et al. (2020); Cai et al. (2023a). This translates into the oracle complexity that is larger by at most a log(1/ϵ)1italic-ϵ\sqrt{\log(1/\epsilon)}square-root start_ARG roman_log ( 1 / italic_ϵ ) end_ARG factor. For most cases of interest, this quantity can be treated as a constant: for example, for ϵ=108,italic-ϵsuperscript108\epsilon=10^{-8},italic_ϵ = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT , log(1/ϵ)4.29.1italic-ϵ4.29\sqrt{\log(1/\epsilon)}\approx 4.29.square-root start_ARG roman_log ( 1 / italic_ϵ ) end_ARG ≈ 4.29 .

On the other hand, with λk1subscript𝜆𝑘1\lambda_{k}\equiv 1italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≡ 1 and constant wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Lemmas 2.1 and 2.2 directly imply the average iterate guarantee, as a sanity check. Additionally, instead of zeroing the weights of the optimality gap terms f(𝒙k)f(𝒙)𝑓subscript𝒙𝑘𝑓subscript𝒙f({\bm{x}}_{k})-f({\bm{x}}_{*})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) for k[K1]𝑘delimited-[]𝐾1k\in[K-1]italic_k ∈ [ italic_K - 1 ] to obtain the last iterate guarantee, one can deduce the convergence rate on the increasing weighted averaging which places more weight on later iterates, as formalized in the following corollary whose proof is deferred to Appendix A.

Corollary 2.4 (Increasing Weighted Averaging).

Under Assumptions 13, and 4 and for parameters α,β>0𝛼𝛽0\alpha,\beta>0italic_α , italic_β > 0 such that 1α+1β121𝛼1𝛽12\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG, if the step size η𝜂\etaitalic_η is fixed and such that η1βTL𝜂1𝛽𝑇𝐿\eta\leq\frac{1}{\sqrt{\beta}TL}italic_η ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG, then for any constant c(0,1]𝑐01c\in(0,1]italic_c ∈ ( 0 , 1 ] and increasing sequence {wk}k=0K1superscriptsubscriptsubscript𝑤𝑘𝑘0𝐾1\{w_{k}\}_{k=0}^{K-1}{ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT with wk=(1+α/β)(Kk)+1c(1+α/β)(Kk)wk1subscript𝑤𝑘1𝛼𝛽𝐾𝑘1𝑐1𝛼𝛽𝐾𝑘subscript𝑤𝑘1w_{k}=\frac{(1+\alpha/\beta)(K-k)+1-c}{(1+\alpha/\beta)(K-k)}w_{k-1}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ( 1 + italic_α / italic_β ) ( italic_K - italic_k ) + 1 - italic_c end_ARG start_ARG ( 1 + italic_α / italic_β ) ( italic_K - italic_k ) end_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT, Alg. 1 outputs 𝐱^K=k=1Kwk1𝐱kk=1Kwk1subscript^𝐱𝐾superscriptsubscript𝑘1𝐾subscript𝑤𝑘1subscript𝐱𝑘superscriptsubscript𝑘1𝐾subscript𝑤𝑘1\hat{\bm{x}}_{K}=\sum_{k=1}^{K}\frac{w_{k-1}{\bm{x}}_{k}}{\sum_{k=1}^{K}w_{k-1}}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG satisfying

f(𝒙^K)f(𝒙)T2σ2η2Lc+𝒙0𝒙22cηTK.𝑓subscript^𝒙𝐾𝑓subscript𝒙superscript𝑇2superscriptsubscript𝜎2superscript𝜂2𝐿𝑐superscriptnormsubscript𝒙0subscript𝒙22𝑐𝜂𝑇𝐾f(\hat{\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{T^{2}\sigma_{*}^{2}\eta^{2}L}{c}+% \frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{2c\eta TK}.italic_f ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG italic_c end_ARG + divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_c italic_η italic_T italic_K end_ARG .

In particular, there exists a constant step size η𝜂\etaitalic_η such that f(𝐱^K)f(𝐱)ϵ𝑓subscript^𝐱𝐾𝑓subscript𝐱italic-ϵf(\hat{\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilonitalic_f ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_ϵ for ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 after 𝒪(TL𝐱0𝐱2cϵ+TL1/2σ𝐱0𝐱2c3/2ϵ3/2)𝒪𝑇𝐿superscriptnormsubscript𝐱0subscript𝐱2𝑐italic-ϵ𝑇superscript𝐿12subscript𝜎superscriptnormsubscript𝐱0subscript𝐱2superscript𝑐32superscriptitalic-ϵ32{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{c\epsilon}+% \frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{c^{3/2}\epsilon^{3/% 2}}\big{)}caligraphic_O ( divide start_ARG italic_T italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_ϵ end_ARG + divide start_ARG italic_T italic_L start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) individual gradient evaluations.

We remark that increasing weighted averaging shaves off the (at most) square-root-log term appearing in the last iterate rate above and recovers the best known rate for the average iterate (Mishchenko et al., 2020; Cai et al., 2023a). The parameters α,β𝛼𝛽\alpha,\betaitalic_α , italic_β are included in the weights wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT controlling the growth rate of the increasing sequence {wk}subscript𝑤𝑘\{w_{k}\}{ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. When α/β𝛼𝛽\alpha/\beta\rightarrow\inftyitalic_α / italic_β → ∞, increasing weighted averaging reduces to the uniform weighted average.

Shuffled SGD.

We extend our analysis to handle the case with possible random permutations on the task ordering of each epoch, showing order-T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG improvements in complexity if involving randomness. We consider two main permutation strategies of particular interests in the literature on shuffled SGD: (i) random reshuffling (RR): randomly generate permutations at the beginning of each epoch; (ii) shuffle-once (SO): generate a single random permutation at the beginning and use it in all epochs. Those strategies lead to order-(1/T1𝑇1/T1 / italic_T) improvements in bounding the variance term ηk2Lt=1Ts=tTfs(𝒙)2superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2\eta_{k}^{2}L\sum_{t=1}^{T}\big{\|}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})% \big{\|}^{2}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT from Lemma 2.1, and we state the improved convergence results with permutations in the following corollary. The proof is deferred to Appendix A. Our last iterate guarantee nearly matches the best known average iterate convergence rate for shuffled SGD (Mishchenko et al., 2020; Nguyen et al., 2021; Cai et al., 2023a) and the lower bound results on the RR scheme (Cha et al., 2023), with a slightly (order-(1logK)1𝐾(\frac{1}{\sqrt{\log K}})( divide start_ARG 1 end_ARG start_ARG square-root start_ARG roman_log italic_K end_ARG end_ARG )) smaller step size.

Corollary 2.5 (Shuffled SGD (RR/SO)).

Under Assumptions 13 and 4 and for positive parameters α,β>0𝛼𝛽0\alpha,\beta>0italic_α , italic_β > 0 such that 1α+1β121𝛼1𝛽12\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG, if the step size is fixed and such that η1βTL𝜂1𝛽𝑇𝐿\eta\leq\frac{1}{\sqrt{\beta}TL}italic_η ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG, the output 𝐱Ksubscript𝐱𝐾{\bm{x}}_{K}bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT of Alg. 1 with uniformly random (SO/RR) shuffling satisfies

𝔼[f(𝒙K)f(𝒙)]𝔼delimited-[]𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle\mathbb{E}[f({\bm{x}}_{K})-f({\bm{x}}_{*})]\leq\;blackboard_E [ italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ] ≤ eη2σ2TL(1+β/α)Kα/β1+α/β+e𝒙0𝒙22TηK11+α/β.esuperscript𝜂2superscriptsubscript𝜎2𝑇𝐿1𝛽𝛼superscript𝐾𝛼𝛽1𝛼𝛽esuperscriptnormsubscript𝒙0subscript𝒙22𝑇𝜂superscript𝐾11𝛼𝛽\displaystyle\mathrm{e}\eta^{2}\sigma_{*}^{2}TL(1+\beta/\alpha)K^{\frac{\alpha% /\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{2T% \eta K^{\frac{1}{1+\alpha/\beta}}}.roman_e italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T italic_L ( 1 + italic_β / italic_α ) italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT + divide start_ARG roman_e ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_T italic_η italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG .

With α=4,β=4logKformulae-sequence𝛼4𝛽4𝐾\alpha=4,\beta=4\log Kitalic_α = 4 , italic_β = 4 roman_log italic_K, there exists a constant step size η𝜂\etaitalic_η such that 𝔼[f(𝐱K)f(𝐱)]ϵ𝔼delimited-[]𝑓subscript𝐱𝐾𝑓subscript𝐱italic-ϵ\mathbb{E}[f({\bm{x}}_{K})-f({\bm{x}}_{*})]\leq\epsilonblackboard_E [ italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ] ≤ italic_ϵ for ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 after 𝒪~(TL𝐱0𝐱2ϵ+TLσ𝐱0𝐱2ϵ3/2)~𝒪𝑇𝐿superscriptnormsubscript𝐱0subscript𝐱2italic-ϵ𝑇𝐿subscript𝜎superscriptnormsubscript𝐱0subscript𝐱2superscriptitalic-ϵ32\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{\sqrt{TL}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon% ^{3/2}}\big{)}over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_T italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG + divide start_ARG square-root start_ARG italic_T italic_L end_ARG italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) individual gradient evaluations.

3 Incremental Proximal Method

In this section, we leverage the proof techniques developed in the previous section and derive the last iterate convergence bound for the incremental proximal method (IPM) summarized in Alg. 2. While the incremental proximal method is a fundamental method broadly studied in the optimization literature and thus the last iterate convergence bounds are of independent interest, our main motivation for considering this problem comes from CL applications, as discussed in the introduction. Thus we begin this section by briefly explaining this connection and reasoning.

The considered setup is motivated by the general 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularized CL setting (Heckel, 2022; Li et al., 2023). In particular, each proximal iteration 𝒙k1,t+1=proxηkft(𝒙k1,t)subscript𝒙𝑘1𝑡1subscriptproxsubscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡{\bm{x}}_{k-1,t+1}=\mathrm{prox}_{\eta_{k}f_{t}}({\bm{x}}_{k-1,t})bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT = roman_prox start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) can be interpreted as minimizing the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (a.k.a. ridge) regularized loss ft(𝒙)+12ηk𝒙𝒙k1,t22subscript𝑓𝑡𝒙12subscript𝜂𝑘superscriptsubscriptnorm𝒙subscript𝒙𝑘1𝑡22f_{t}({\bm{x}})+\frac{1}{2\eta_{k}}\|{\bm{x}}-{\bm{x}}_{k-1,t}\|_{2}^{2}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT corresponding to the current task t𝑡titalic_t, which aligns with the common machine learning practice of using regularization to improve the generalization error and prevent forgetting. When ηksubscript𝜂𝑘\eta_{k}\rightarrow\inftyitalic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → ∞, the proximal point step reduces to the CL setting where the learner exactly minimizes the loss of the current task, i.e., 𝒙k1,t+1=argmin𝒙dft(𝒙)subscript𝒙𝑘1𝑡1subscriptargmin𝒙superscript𝑑subscript𝑓𝑡𝒙{\bm{x}}_{k-1,t+1}=\operatorname*{arg\,min}_{{\bm{x}}\in\mathbb{R}^{d}}f_{t}({% \bm{x}})bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ), while the regularization effect vanishes and causes larger forgetting on previous tasks. When ηksubscript𝜂𝑘\eta_{k}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is small, the proximal point step is easier to compute with larger quadratic regularization and prevents deviating from the previous iterate thus causing less forgetting. However, in this case the plasticity of the model may be deteriorated.

We also note that our analysis of IPM is related to previous work on cyclic replays for overparameterized linear models with (S)GD (Evron et al., 2022; Swartworth et al., 2024), as (S)GD in this case acts as an implicit regularizer (Gunasekar et al., 2018; Zhang et al., 2021) (whereas proximal point update acts as an explicit regularizer). The two lines of work are not directly comparable: Evron et al. (2022); Swartworth et al. (2024) considers exact minimization of component/task loss function and bounds the forgetting, but only addresses convex quadratics where the component loss functions have a nonempty intersecting set of minima (which implies σ=0subscript𝜎0\sigma_{*}=0italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0 in Assumption 4). On the other hand, our work addresses much more general (not necessarily quadratic) convex functions and does not require σ=0,subscript𝜎0\sigma_{*}=0,italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0 , but instead relies on sufficiently large regularization.

Algorithm 2 Incremental Proximal Method (IPM)
  Input: initial point 𝒙0subscript𝒙0{\bm{x}}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, number of epochs K𝐾Kitalic_K, step size {ηk}subscript𝜂𝑘\{\eta_{k}\}{ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }
  for k=1:K:𝑘1𝐾k=1:Kitalic_k = 1 : italic_K do
     𝒙k1,1=𝒙k1subscript𝒙𝑘11subscript𝒙𝑘1{\bm{x}}_{k-1,1}={\bm{x}}_{k-1}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT
     for t=1:T:𝑡1𝑇t=1:Titalic_t = 1 : italic_T do
        𝒙k1,t+1=proxηkft(𝒙k1,t)=argmin𝒙d{12ηk𝒙𝒙k1,t2+ft(𝒙)}subscript𝒙𝑘1𝑡1subscriptproxsubscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡subscriptargmin𝒙superscript𝑑12subscript𝜂𝑘superscriptnorm𝒙subscript𝒙𝑘1𝑡2subscript𝑓𝑡𝒙{\bm{x}}_{k-1,t+1}=\mathrm{prox}_{\eta_{k}f_{t}}({\bm{x}}_{k-1,t})=% \operatorname*{arg\,min}_{{\bm{x}}\in\mathbb{R}^{d}}\big{\{}\frac{1}{2\eta_{k}% }\|{\bm{x}}-{\bm{x}}_{k-1,t}\|^{2}+f_{t}({\bm{x}})\big{\}}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT = roman_prox start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) }
     𝒙k=𝒙k1,T+1subscript𝒙𝑘subscript𝒙𝑘1𝑇1{\bm{x}}_{k}={\bm{x}}_{k-1,T+1}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_T + 1 end_POSTSUBSCRIPT
  return  𝒙Ksubscript𝒙𝐾{\bm{x}}_{K}bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT

3.1 Smooth convex setting

We first study the setting where the loss function of each task is convex and smooth, for which we can show faster convergence and which covers many regression tasks studied in prior CL work; see e.g., Evron et al. (2022); Goldfarb et al. (2024). In contrast, prior work either only focused on nonsmooth settings for IPM (Bertsekas, 2011; Li et al., 2019, 2020) or studied different algorithms without component-wise proximal steps for smooth settings (Bertsekas, 2015; Mishchenko et al., 2022).

Under component smoothness, the proximal iteration is equivalent to the backward gradient step:

𝒙k1,t+1=𝒙k1,tηkft(𝒙k1,t+1).subscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1\displaystyle{\bm{x}}_{k-1,t+1}={\bm{x}}_{k-1,t}-\eta_{k}\nabla f_{t}({\bm{x}}% _{k-1,t+1}).bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) .

Hence, much of the analysis from Section 2 can be adapted here, and the main difference lies in bounding the gap f(𝒙k)f(𝒛)𝑓subscript𝒙𝑘𝑓𝒛f({\bm{x}}_{k})-f({\bm{z}})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) within each epoch with decomposition w.r.t. 𝒙k1,t+1subscript𝒙𝑘1𝑡1{\bm{x}}_{k-1,t+1}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT instead of 𝒙k1,tsubscript𝒙𝑘1𝑡{\bm{x}}_{k-1,t}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT in comparison to Lemma 2.1. Then choosing 𝒛=𝒛k1𝒛subscript𝒛𝑘1{\bm{z}}={\bm{z}}_{k-1}bold_italic_z = bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT defined by Eq. (2.1) and following the proof of Theorem 2.3, as stated in the following theorem with its proof in Appendix B.

Theorem 3.1.

Under Assumptions 13 and 4 and for parameters α,β>0𝛼𝛽0\alpha,\beta>0italic_α , italic_β > 0 such that 1α+1β121𝛼1𝛽12\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG, if the step size is fixed and such that η1βTL𝜂1𝛽𝑇𝐿\eta\leq\frac{1}{\sqrt{\beta}TL}italic_η ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG, the output 𝐱Ksubscript𝐱𝐾{\bm{x}}_{K}bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT of Alg. 2 satisfies

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ eη2T2σ2L(1+β/α)Kα/β1+α/β+e𝒙0𝒙22TηK11+α/β.esuperscript𝜂2superscript𝑇2superscriptsubscript𝜎2𝐿1𝛽𝛼superscript𝐾𝛼𝛽1𝛼𝛽esuperscriptnormsubscript𝒙0subscript𝒙22𝑇𝜂superscript𝐾11𝛼𝛽\displaystyle\mathrm{e}\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{\frac{% \alpha/\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{% 2}}{2T\eta K^{\frac{1}{1+\alpha/\beta}}}.roman_e italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ( 1 + italic_β / italic_α ) italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT + divide start_ARG roman_e ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_T italic_η italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG .

With α=4,β=4logKformulae-sequence𝛼4𝛽4𝐾\alpha=4,\beta=4\log Kitalic_α = 4 , italic_β = 4 roman_log italic_K, there exists a constant step size η𝜂\etaitalic_η such that f(𝐱K)f(𝐱)ϵ𝑓subscript𝐱𝐾𝑓subscript𝐱italic-ϵf({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilonitalic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_ϵ for ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 after 𝒪~(TL𝐱0𝐱2ϵ+TL1/2σ𝐱0𝐱2ϵ3/2)~𝒪𝑇𝐿superscriptnormsubscript𝐱0subscript𝐱2italic-ϵ𝑇superscript𝐿12subscript𝜎superscriptnormsubscript𝐱0subscript𝐱2superscriptitalic-ϵ32\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon^% {3/2}}\big{)}over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_T italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG + divide start_ARG italic_T italic_L start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) individual proximal oracle queries.

A few remarks are in order here. First, the last-iterate convergence rate of IPM matches the rate of IGD on the last iterate. This is also the first convergence result for IPM (with component-wise proximal updates) under convex smooth settings, in comparison with the prior results in convex Lipschitz setups (Bertsekas, 2011; Li et al., 2019, 2020). Second, the extensions to the increasing weighted averaging and RR/SO shuffling discussed in Section 2 also apply to this setting, which we omit for brevity. Lastly, acute readers may find the step size constraint in Theorem 3.1 stands in contrast to fact that the proximal point method, which IGM reduces to when T=1𝑇1T=1italic_T = 1, converges for any positive step sizes. However, in the following theorem we show that such a step size restriction is necessary for IPM with component-wise proximal updates to reach the target optimality gap in this setting. We provide a proof sketch below, while the complete proof is in Appendix B.

Theorem 3.2.

Given L>0,T2,formulae-sequence𝐿0𝑇2L>0,T\geq 2,italic_L > 0 , italic_T ≥ 2 , let T,Lsubscript𝑇𝐿{\mathcal{F}}_{T,L}caligraphic_F start_POSTSUBSCRIPT italic_T , italic_L end_POSTSUBSCRIPT be the class of finite-sum functions f(𝐱)=1Tt=1Tft(𝐱)𝑓𝐱1𝑇superscriptsubscript𝑡1𝑇subscript𝑓𝑡𝐱f({\bm{x}})=\frac{1}{T}\sum_{t=1}^{T}f_{t}({\bm{x}})italic_f ( bold_italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) whose each component ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is L𝐿Litalic_L-smooth and convex. Then:

  1. 1.

    For any fixed step size η>0𝜂0\eta>0italic_η > 0, there exists a function fT,L𝑓subscript𝑇𝐿f\in{\mathcal{F}}_{T,L}italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_T , italic_L end_POSTSUBSCRIPT such that the iterates 𝒙ksubscript𝒙𝑘{\bm{x}}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of Alg. 2 satisfy f(𝒙k)f(𝒙)0𝑓subscript𝒙𝑘𝑓subscript𝒙0f({\bm{x}}_{k})-f({\bm{x}}_{*})\nrightarrow 0italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ↛ 0 as k𝑘k\rightarrow\inftyitalic_k → ∞. As a consequence, the forgetting is catastrophic.

  2. 2.

    For any fixed step size η>0𝜂0\eta>0italic_η > 0 that only depends on the parameters of the problem class (L,T,𝐿𝑇L,T,italic_L , italic_T , error ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0), there exists a function fT,L𝑓subscript𝑇𝐿f\in{\mathcal{F}}_{T,L}italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_T , italic_L end_POSTSUBSCRIPT such that the iterates 𝒙ksubscript𝒙𝑘{\bm{x}}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of Alg. 2 satisfy limkf(𝒙k)f(𝒙)>1subscript𝑘𝑓subscript𝒙𝑘𝑓subscript𝒙1\lim_{k\rightarrow\infty}f({\bm{x}}_{k})-f({\bm{x}}_{*})>1roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) > 1.

  3. 3.

    Given ε>0𝜀0\varepsilon>0italic_ε > 0, if the fixed step size η𝜂\etaitalic_η satisfies ηmin{16εTLσ,1TL}𝜂16𝜀𝑇𝐿subscript𝜎1𝑇𝐿\eta\geq\min\Big{\{}\frac{16\sqrt{\varepsilon}}{\sqrt{TL}\sigma_{*}},\,\frac{1% }{TL}\Big{\}}italic_η ≥ roman_min { divide start_ARG 16 square-root start_ARG italic_ε end_ARG end_ARG start_ARG square-root start_ARG italic_T italic_L end_ARG italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG , divide start_ARG 1 end_ARG start_ARG italic_T italic_L end_ARG }, then there exists a function fT,L𝑓subscript𝑇𝐿f\in{\mathcal{F}}_{T,L}italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_T , italic_L end_POSTSUBSCRIPT such that f(𝒙k)f(𝒙)>ε𝑓subscript𝒙𝑘𝑓subscript𝒙𝜀f({\bm{x}}_{k})-f({\bm{x}}_{*})>\varepsilonitalic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) > italic_ε for all sufficiently large k𝑘kitalic_k.

Proof sketch.

For all parts of the proof, we consider 1111-dimensional quadratics ft(x)=L2(xδt)2subscript𝑓𝑡𝑥𝐿2superscript𝑥subscript𝛿𝑡2f_{t}(x)=\frac{L}{2}(x-\delta_{t})^{2}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( italic_x - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for {δt}t[T]subscriptsubscript𝛿𝑡𝑡delimited-[]𝑇\{\delta_{t}\}_{t\in[T]}\subseteq\mathbb{R}{ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT ⊆ blackboard_R and L>0𝐿0L>0italic_L > 0. It is immediate that f(x)=1Tt=1Tft(x)𝑓𝑥1𝑇superscriptsubscript𝑡1𝑇subscript𝑓𝑡𝑥f(x)=\frac{1}{T}\sum_{t=1}^{T}f_{t}(x)italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) is minimized at x=1Tt=1Tδtsubscript𝑥1𝑇superscriptsubscript𝑡1𝑇subscript𝛿𝑡x_{*}=\frac{1}{T}\sum_{t=1}^{T}\delta_{t}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In this case, Alg. 2 using a fixed step size η𝜂\etaitalic_η performs closed-form updates on f𝑓fitalic_f, i.e., xk+1=γnxk+(1γ)t=1TγTtδt,subscript𝑥𝑘1superscript𝛾𝑛subscript𝑥𝑘1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑇𝑡subscript𝛿𝑡x_{k+1}=\gamma^{n}x_{k}+(1-\gamma)\sum_{t=1}^{T}\gamma^{T-t}\delta_{t},italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , where γ=1ηL+1(0,1)𝛾1𝜂𝐿101\gamma=\frac{1}{\eta L+1}\in(0,1)italic_γ = divide start_ARG 1 end_ARG start_ARG italic_η italic_L + 1 end_ARG ∈ ( 0 , 1 ). Given any initial point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,

xkx=γkTx0+t=1T(γTt(1γ)(1γkT)1γT1T)δtkt=1T(γTt(1γ)1γT1T)δt.subscript𝑥𝑘subscript𝑥superscript𝛾𝑘𝑇subscript𝑥0superscriptsubscript𝑡1𝑇superscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑘𝑇1superscript𝛾𝑇1𝑇subscript𝛿𝑡𝑘superscriptsubscript𝑡1𝑇superscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇subscript𝛿𝑡\displaystyle x_{k}-x_{*}=\gamma^{kT}x_{0}+\sum_{t=1}^{T}\Big{(}\frac{\gamma^{% T-t}(1-\gamma)(1-\gamma^{kT})}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{t}% \overset{k\rightarrow\infty}{\longrightarrow}\sum_{t=1}^{T}\Big{(}\frac{\gamma% ^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{t}.italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT italic_k italic_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) ( 1 - italic_γ start_POSTSUPERSCRIPT italic_k italic_T end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_OVERACCENT italic_k → ∞ end_OVERACCENT start_ARG ⟶ end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

For 1111), since γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), we have 1γ1γT=1t=0T1γt>1T1𝛾1superscript𝛾𝑇1superscriptsubscript𝑡0𝑇1superscript𝛾𝑡1𝑇\frac{1-\gamma}{1-\gamma^{T}}=\frac{1}{\sum_{t=0}^{T-1}\gamma^{t}}>\frac{1}{T}divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG > divide start_ARG 1 end_ARG start_ARG italic_T end_ARG for t=T𝑡𝑇t=Titalic_t = italic_T. As k𝑘k\rightarrow\inftyitalic_k → ∞, for {δt}t[T]subscriptsubscript𝛿𝑡𝑡delimited-[]𝑇\{\delta_{t}\}_{t\in[T]}{ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT such that sgn(δt)=sgn(γTt(1γ)1γT1T)sgnsubscript𝛿𝑡sgnsuperscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇\mathrm{sgn}(\delta_{t})=\mathrm{sgn}\big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-% \gamma^{T}}-\frac{1}{T}\big{)}roman_sgn ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_sgn ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) and δT>0subscript𝛿𝑇0\delta_{T}>0italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT > 0, we have f(xk)f(x)=L2(xkx)2L2(1γ1γT1T)2δT2>0𝑓subscript𝑥𝑘𝑓subscript𝑥𝐿2superscriptsubscript𝑥𝑘subscript𝑥2𝐿2superscript1𝛾1superscript𝛾𝑇1𝑇2superscriptsubscript𝛿𝑇20f(x_{k})-f(x_{*})=\frac{L}{2}(x_{k}-x_{*})^{2}\geq\frac{L}{2}\Big{(}\frac{1-% \gamma}{1-\gamma^{T}}-\frac{1}{T}\Big{)}^{2}\delta_{T}^{2}>0italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0.

For 2222), with 1γ1γT>1T1𝛾1superscript𝛾𝑇1𝑇\frac{1-\gamma}{1-\gamma^{T}}>\frac{1}{T}divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG > divide start_ARG 1 end_ARG start_ARG italic_T end_ARG, observe that as γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), we have |γTt(1γ)1γT1T|T1Tsuperscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇𝑇1𝑇\big{|}\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\big{|}\leq\frac% {T-1}{T}| divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG | ≤ divide start_ARG italic_T - 1 end_ARG start_ARG italic_T end_ARG for any t𝑡titalic_t. As k𝑘k\rightarrow\inftyitalic_k → ∞, for |δt|<T2/L(T1)2subscript𝛿𝑡𝑇2𝐿superscript𝑇12|\delta_{t}|<\frac{T\sqrt{2/L}}{(T-1)^{2}}| italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | < divide start_ARG italic_T square-root start_ARG 2 / italic_L end_ARG end_ARG start_ARG ( italic_T - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (t[T1]𝑡delimited-[]𝑇1t\in[T-1]italic_t ∈ [ italic_T - 1 ]) and δT22/L1γ1γT1Tsubscript𝛿𝑇22𝐿1𝛾1superscript𝛾𝑇1𝑇\delta_{T}\geq\frac{2\sqrt{2/L}}{\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}}italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≥ divide start_ARG 2 square-root start_ARG 2 / italic_L end_ARG end_ARG start_ARG divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_ARG: f(xk)f(x)=L2(xkx)2>1𝑓subscript𝑥𝑘𝑓subscript𝑥𝐿2superscriptsubscript𝑥𝑘subscript𝑥21f(x_{k})-f(x_{*})=\frac{L}{2}(x_{k}-x_{*})^{2}>1italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 1.

For 3333), the case η1TL𝜂1𝑇𝐿\eta\geq\frac{1}{TL}italic_η ≥ divide start_ARG 1 end_ARG start_ARG italic_T italic_L end_ARG can be handled using a similar argument as in 1111), and thus we assume w.l.o.g. that γ11T+1𝛾11𝑇1\gamma\geq 1-\frac{1}{T+1}italic_γ ≥ 1 - divide start_ARG 1 end_ARG start_ARG italic_T + 1 end_ARG. Let γ=1κ𝛾1𝜅\gamma=1-\kappaitalic_γ = 1 - italic_κ, where κ>0𝜅0\kappa>0italic_κ > 0 and κ1T+1<1T𝜅1𝑇11𝑇\kappa\leq\frac{1}{T+1}<\frac{1}{T}italic_κ ≤ divide start_ARG 1 end_ARG start_ARG italic_T + 1 end_ARG < divide start_ARG 1 end_ARG start_ARG italic_T end_ARG. Further noticing that (1κ)T1κT+κ2T(T1)4superscript1𝜅𝑇1𝜅𝑇superscript𝜅2𝑇𝑇14(1-\kappa)^{T}\geq 1-\kappa T+\frac{\kappa^{2}T(T-1)}{4}( 1 - italic_κ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ≥ 1 - italic_κ italic_T + divide start_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ( italic_T - 1 ) end_ARG start_ARG 4 end_ARG for T2𝑇2T\geq 2italic_T ≥ 2 and κT<1𝜅𝑇1\kappa T<1italic_κ italic_T < 1, then we have

1γ1γT=κ1(1κ)TκκTκ2T(T1)41T(1+κ(T1)4)1T+κ8.1𝛾1superscript𝛾𝑇𝜅1superscript1𝜅𝑇𝜅𝜅𝑇superscript𝜅2𝑇𝑇141𝑇1𝜅𝑇141𝑇𝜅8\displaystyle\frac{1-\gamma}{1-\gamma^{T}}=\frac{\kappa}{1-(1-\kappa)^{T}}\geq% \frac{\kappa}{\kappa T-\frac{\kappa^{2}T(T-1)}{4}}\geq\frac{1}{T}\Big{(}1+% \frac{\kappa(T-1)}{4}\Big{)}\geq\frac{1}{T}+\frac{\kappa}{8}.divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_κ end_ARG start_ARG 1 - ( 1 - italic_κ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≥ divide start_ARG italic_κ end_ARG start_ARG italic_κ italic_T - divide start_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ( italic_T - 1 ) end_ARG start_ARG 4 end_ARG end_ARG ≥ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ( 1 + divide start_ARG italic_κ ( italic_T - 1 ) end_ARG start_ARG 4 end_ARG ) ≥ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG + divide start_ARG italic_κ end_ARG start_ARG 8 end_ARG .

Hence, if η16εTLσ𝜂16𝜀𝑇𝐿subscript𝜎\eta\geq\frac{16\sqrt{\varepsilon}}{\sqrt{TL}\sigma_{*}}italic_η ≥ divide start_ARG 16 square-root start_ARG italic_ε end_ARG end_ARG start_ARG square-root start_ARG italic_T italic_L end_ARG italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG, since we can show σLt=1Tδt2/Tsubscript𝜎𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2𝑇\sigma_{*}\leq L\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}/T}italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≤ italic_L square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_T end_ARG, we have that κ=ηLηL+116ε/L16ε/L+t=1Tδt2>83ε/Lt=1Tδt2𝜅𝜂𝐿𝜂𝐿116𝜀𝐿16𝜀𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡283𝜀𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2\kappa=\frac{\eta L}{\eta L+1}\geq\frac{16\sqrt{\varepsilon/L}}{16\sqrt{% \varepsilon/L}+\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}>\frac{8\sqrt{3\varepsilon/% L}}{\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}italic_κ = divide start_ARG italic_η italic_L end_ARG start_ARG italic_η italic_L + 1 end_ARG ≥ divide start_ARG 16 square-root start_ARG italic_ε / italic_L end_ARG end_ARG start_ARG 16 square-root start_ARG italic_ε / italic_L end_ARG + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG > divide start_ARG 8 square-root start_ARG 3 italic_ε / italic_L end_ARG end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG with choosing large enough t=1Tδt2superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Then for sufficiently large k𝑘kitalic_k and {δt}t[T]subscriptsubscript𝛿𝑡𝑡delimited-[]𝑇\{\delta_{t}\}_{t\in[T]}{ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT such that δT2>56t=1Tδt2superscriptsubscript𝛿𝑇256superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2\delta_{T}^{2}>\frac{5}{6}\sum_{t=1}^{T}\delta_{t}^{2}italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > divide start_ARG 5 end_ARG start_ARG 6 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and sgn(δt)=sgn(γTt(1γ)1γT1T)sgnsubscript𝛿𝑡sgnsuperscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇\mathrm{sgn}(\delta_{t})=\mathrm{sgn}\big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-% \gamma^{T}}-\frac{1}{T}\big{)}roman_sgn ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_sgn ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ), we get f(xk)f(x)2L53εδT2Lt=1Tδt2>ε𝑓subscript𝑥𝑘𝑓subscript𝑥2𝐿53𝜀superscriptsubscript𝛿𝑇2𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2𝜀f(x_{k})-f(x_{*})\geq\frac{2L}{5}\frac{3\varepsilon\delta_{T}^{2}}{L\sum_{t=1}% ^{T}\delta_{t}^{2}}>\varepsilonitalic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≥ divide start_ARG 2 italic_L end_ARG start_ARG 5 end_ARG divide start_ARG 3 italic_ε italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG > italic_ε. ∎

Regularization effect.

We now discuss how the regularization parameter (the step size of proximal updates) affects the loss on the current task and (excess) forgetting, based on the above convergence results of IPM. An interesting aspect of our result in Theorem 3.1 is that there is a critical value

η=min{𝒙0𝒙2/321/3Tσ2/3L1/3K1/3(1+β/α)1/3,1βTL}subscript𝜂superscriptnormsubscript𝒙0subscript𝒙23superscript213𝑇superscriptsubscript𝜎23superscript𝐿13superscript𝐾13superscript1𝛽𝛼131𝛽𝑇𝐿\displaystyle\eta_{*}=\min\Big{\{}\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2/3}}{2% ^{1/3}T\sigma_{*}^{2/3}L^{1/3}K^{1/3}(1+\beta/\alpha)^{1/3}},\,\frac{1}{\sqrt{% \beta}TL}\Big{\}}italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = roman_min { divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_T italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ( 1 + italic_β / italic_α ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG } (3.1)

such that if decrease η𝜂\etaitalic_η beyond ηsubscript𝜂\eta_{*}italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, both the regularization error on the current task and our upper bound on the forgetting increase. For η>η𝜂subscript𝜂\eta>\eta_{*}italic_η > italic_η start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT (i.e., when we decrease the regularization error), Theorem 3.2 shows that the forgetting would increase, at least in some regimes of the problem parameters. Moreover, Theorem 3.2 demonstrates that polynomial dependence on other parameters like 1/ϵ1italic-ϵ1/\epsilon1 / italic_ϵ and σsubscript𝜎\sigma_{*}italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is necessary in the choice for η.𝜂\eta.italic_η . In other words, strong regularization is needed to control the forgetting to a target error in general smooth convex settings. Another direct implication of Theorem 3.2 is that if no assumptions such as similarity are made on the tasks, then any 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularized model using a finite regularization parameter would suffer catastrophic forgetting, i.e., the forgetting would not be approaching zero as the number of epochs tends to infinity.

We further provide illustrative numerical results in Fig. 1 to facilitate our discussion. In particular, we choose L=2𝐿2L=2italic_L = 2, T{100,150,200}𝑇100150200T\in\{100,150,200\}italic_T ∈ { 100 , 150 , 200 }, δt=1/tsubscript𝛿𝑡1𝑡\delta_{t}=1/titalic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 / italic_t (t[T1]𝑡delimited-[]𝑇1t\in[T-1]italic_t ∈ [ italic_T - 1 ]) and δT=Tsubscript𝛿𝑇𝑇\delta_{T}=Titalic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_T for the example f(x)=L2Tt=1T(xδt)2𝑓𝑥𝐿2𝑇superscriptsubscript𝑡1𝑇superscript𝑥subscript𝛿𝑡2f(x)=\frac{L}{2T}\sum_{t=1}^{T}(x-\delta_{t})^{2}italic_f ( italic_x ) = divide start_ARG italic_L end_ARG start_ARG 2 italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_x - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT used in Theorem 3.2. In Fig. 1LABEL:sub@fig:forgetting, we plot the optimality gap at the last iterate, i.e., the excess forgetting, against the step sizes after K=104𝐾superscript104K=10^{4}italic_K = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT epochs. It can be observed that the forgetting first decreases with reducing the step size, but then increases beyond some critical value. Note that the critical values are around 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, which is nontrivially smaller than 1/L=1/21𝐿121/L=1/21 / italic_L = 1 / 2, while a larger T𝑇Titalic_T leads to a smaller such critical value. These numerical examples corroborate our results from Theorems 3.1 and 3.2, which jointly suggest that the step size (amount of regularization) can neither be too small nor too large. On the other hand, in Fig. 1LABEL:sub@fig:reg we show the final stagnated average regularization error, i.e., 1Tt=1Tft(𝒙k1,t+1)ft(𝒙,t)1𝑇superscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡subscript𝒙𝑡\frac{1}{T}\sum_{t=1}^{T}f_{t}({\bm{x}}_{k-1,t+1})-f_{t}({\bm{x}}_{*,t})divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ , italic_t end_POSTSUBSCRIPT ) over T𝑇Titalic_T tasks, where 𝒙,tsubscript𝒙𝑡{\bm{x}}_{*,t}bold_italic_x start_POSTSUBSCRIPT ∗ , italic_t end_POSTSUBSCRIPT is the minimizer of ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We thus conclude from both plots in Fig. 1 that as the step size increases (equivalently, regularization parameter decreases), the regularization error decreases as well, but the forgetting increases.

Refer to caption
(a) Forgetting
Refer to caption
(b) Regularization error (avg)
Figure 1: Numerical results of performing IPM on T𝑇Titalic_T component least square functions, corresponding to the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularized CL setting of T𝑇Titalic_T linear regression tasks with cyclic replays.

To conclude this subsection, we finally note that our results bridge the gap of theoretically calibrating the trade-off between the forgetting and the regularization error for general convex smooth tasks with cyclic replays, which has only been studied for the setting of two linear regression tasks without cyclic replay (Heckel, 2022; Li et al., 2023). Further, our results on finite regularization complement the asymptotic weak regularization results in the technically disjoint setting of linear classification (Evron et al., 2023), in the need of relating to the sequential max-margin projections for their analysis.

3.2 Convex Lipschitz setting

We now further relax the smoothness assumption and consider the convex Lipschitz setting, with applications such as linear classification tasks considered in Evron et al. (2023). To carry out the analysis, we leverage the standard fact that the proximal iteration is equivalent to the gradient step w.r.t. the Moreau envelope, i.e.,

𝒙k1,t+1=𝒙k1,tηkMηkft(𝒙k1,t),subscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡subscript𝜂𝑘subscript𝑀subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡\displaystyle{\bm{x}}_{k-1,t+1}={\bm{x}}_{k-1,t}-\eta_{k}\nabla M_{\eta_{k}f_{% t}}({\bm{x}}_{k-1,t}),bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ,

while the gradient of the Moreau envelope Mηkft(𝒙k1,t)subscript𝑀subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡\nabla M_{\eta_{k}f_{t}}({\bm{x}}_{k-1,t})∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) belongs to the subdifferential ft(𝒙k1,t+1)subscript𝑓𝑡subscript𝒙𝑘1𝑡1\partial f_{t}({\bm{x}}_{k-1,t+1})∂ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ), thus is bounded by the Lipschitz constant of ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We use these observations to bound the gap f(𝒙k)f(𝒛)𝑓subscript𝒙𝑘𝑓𝒛f({\bm{x}}_{k})-f({\bm{z}})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) for each epoch and then use the sequence {𝒛k}subscript𝒛𝑘\{{\bm{z}}_{k}\}{ bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } defined in Eq. (2.1) to deduce the last iterate rate in the following theorem with proofs deferred to Appendix B. In contrast to Lemma 2.1 , we do not have the additional error term f(𝒛)f(𝒙)𝑓𝒛𝑓subscript𝒙f({\bm{z}})-f({\bm{x}}_{*})italic_f ( bold_italic_z ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), so the analysis is much simplified.

Theorem 3.3.

Under Assumptions 1 and 2, the output 𝐱Ksubscript𝐱𝐾{\bm{x}}_{K}bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT of Alg. 2 satisfies

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ 12Tk=1Kηk𝒙0𝒙2+G2T2k=1Kηk2j=kKηj.12𝑇superscriptsubscript𝑘1𝐾subscript𝜂𝑘superscriptnormsubscript𝒙0subscript𝒙2superscript𝐺2𝑇2superscriptsubscript𝑘1𝐾superscriptsubscript𝜂𝑘2superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗\displaystyle\frac{1}{2T\sum_{k=1}^{K}\eta_{k}}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{% 2}+\frac{G^{2}T}{2}\sum_{k=1}^{K}\frac{\eta_{k}^{2}}{\sum_{j=k}^{K}\eta_{j}}.divide start_ARG 1 end_ARG start_ARG 2 italic_T ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG .

Moreover, given ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, there exists a constant step size η𝜂\etaitalic_η such that f(𝐱K)f(𝐱)ϵ𝑓subscript𝐱𝐾𝑓subscript𝐱italic-ϵf({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilonitalic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_ϵ after 𝒪~(G2T𝐱0𝐱2ϵ2)~𝒪superscript𝐺2𝑇superscriptnormsubscript𝐱0subscript𝐱2superscriptitalic-ϵ2\widetilde{\mathcal{O}}\big{(}\frac{G^{2}T\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon^{2}}\big{)}over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) individual proximal oracle queries.

The last iterate rate we obtained in Theorem 3.3 matches the prior best known results for the average iterate guarantees for incremental proximal methods (Bertsekas, 2011; Li et al., 2019) in nonsmooth settings, up to a logarithmic factor. Further, we take the Θ(1K)Θ1𝐾\Theta(\frac{1}{\sqrt{K}})roman_Θ ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ) step size only for analytical simplicity, while the diminishing step sizes ηk=Θ(1k)subscript𝜂𝑘Θ1𝑘\eta_{k}=\Theta(\frac{1}{\sqrt{k}})italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_Θ ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_k end_ARG end_ARG ) will yield the same rate via a similar analysis, which we omit for brevity.

3.3 Inexact proximal point evaluations

In the last two subsections, we derived our results assuming that the proximal point operator can be evaluated exactly. However, computing the proximal point corresponds to solving a strongly convex problem, which is generally possible to do only up to finite accuracy. Thus, we now consider the case where 𝒙k1,t+1subscript𝒙𝑘1𝑡1{\bm{x}}_{k-1,t+1}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT is an approximation of proxηkft(𝒙k1,t)subscriptproxsubscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡\mathrm{prox}_{\eta_{k}f_{t}}({\bm{x}}_{k-1,t})roman_prox start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) with solving the corresponding strongly convex problem to εk1,t2/2ηksuperscriptsubscript𝜀𝑘1𝑡22subscript𝜂𝑘\varepsilon_{k-1,t}^{2}/2\eta_{k}italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-optimality gap for εk1,t>0subscript𝜀𝑘1𝑡0\varepsilon_{k-1,t}>0italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT > 0. Equivalently, using strong convexity and denoting 𝒈k1,t:=1ηk(𝒙k1,t𝒙k1,t+1)assignsubscript𝒈𝑘1𝑡1subscript𝜂𝑘subscript𝒙𝑘1𝑡subscript𝒙𝑘1𝑡1{\bm{g}}_{k-1,t}:=\frac{1}{\eta_{k}}({\bm{x}}_{k-1,t}-{\bm{x}}_{k-1,t+1})bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ), we have

𝒙k1,t+1proxηkft(𝒙k1,t)εk1,t,𝒈k1,tMηkft(𝒙k1,t)εk1,t/ηk.formulae-sequencenormsubscript𝒙𝑘1𝑡1subscriptproxsubscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝜀𝑘1𝑡normsubscript𝒈𝑘1𝑡subscript𝑀subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝜀𝑘1𝑡subscript𝜂𝑘\displaystyle\|{\bm{x}}_{k-1,t+1}-\mathrm{prox}_{\eta_{k}f_{t}}({\bm{x}}_{k-1,% t})\|\leq\varepsilon_{k-1,t},\quad\|{\bm{g}}_{k-1,t}-\nabla M_{\eta_{k}f_{t}}(% {\bm{x}}_{k-1,t})\|\leq\varepsilon_{k-1,t}/\eta_{k}.∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - roman_prox start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT , ∥ bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT / italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (3.2)

We note that direct extensions of the previous analysis would not work, because inexact evaluations give rise to additional positive terms 𝒙k𝒛k12superscriptnormsubscript𝒙𝑘subscript𝒛𝑘12\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT that cause issues for telesco**. However, we observe that the coefficients of these terms admit additional slackness, i.e., (λk2wkwk1)𝒙k𝒛k12superscriptsubscript𝜆𝑘2subscript𝑤𝑘subscript𝑤𝑘1superscriptnormsubscript𝒙𝑘subscript𝒛𝑘12(\lambda_{k}^{2}w_{k}-w_{k-1})\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, while Lemma 2.2 only requires λkwkwk1subscript𝜆𝑘subscript𝑤𝑘subscript𝑤𝑘1\lambda_{k}w_{k}\leq w_{k-1}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT. Thus, as long as the approximation error at each iteration is small, we can still maintain the convergence rate of incremental proximal methods with exact proximal point evaluations. With these insights, we extend our convergence results to admit inexact proximal point evaluations in the following corollaries, with proofs provided in Appendix B.

Corollary 3.4 (Convex Smooth).

Under Assumptions 1 and 34 and for parameters α,β𝛼𝛽\alpha,\betaitalic_α , italic_β such that 1α+1β121𝛼1𝛽12\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG, if the step size is fixed and satisfies ηkη1βTLsubscript𝜂𝑘𝜂1𝛽𝑇𝐿\eta_{k}\equiv\eta\leq\frac{1}{\sqrt{\beta}TL}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≡ italic_η ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG, the output 𝐱Ksubscript𝐱𝐾{\bm{x}}_{K}bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT of Alg. 2 with inexact proximal point evaluations as in Eq. (3.2) with t=1Tεk1,tη1+(1+α/β)(Kk+1)superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡𝜂11𝛼𝛽𝐾𝑘1\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{\sqrt{\eta}}{1+(1+\alpha/\beta)(K-k% +1)}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ≤ divide start_ARG square-root start_ARG italic_η end_ARG end_ARG start_ARG 1 + ( 1 + italic_α / italic_β ) ( italic_K - italic_k + 1 ) end_ARG satisfies

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ e𝒙0𝒙22ηTK11+α/β+2η2T2σ2L(1+β/α)Kα/β1+α/β+e2ηTk=0K1t=1T2Tεk,t2+ηεk,t(Kk)11+α/β.esuperscriptnormsubscript𝒙0subscript𝒙22𝜂𝑇superscript𝐾11𝛼𝛽2superscript𝜂2superscript𝑇2superscriptsubscript𝜎2𝐿1𝛽𝛼superscript𝐾𝛼𝛽1𝛼𝛽e2𝜂𝑇superscriptsubscript𝑘0𝐾1superscriptsubscript𝑡1𝑇2𝑇superscriptsubscript𝜀𝑘𝑡2𝜂subscript𝜀𝑘𝑡superscript𝐾𝑘11𝛼𝛽\displaystyle\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{2\eta TK^{% \frac{1}{1+\alpha/\beta}}}+2\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{% \frac{\alpha/\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}}{2\eta T}\sum_{k=0}^{K-1% }\sum_{t=1}^{T}\frac{2T\varepsilon_{k,t}^{2}+\sqrt{\eta}\varepsilon_{k,t}}{(K-% k)^{\frac{1}{1+\alpha/\beta}}}.divide start_ARG roman_e ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ( 1 + italic_β / italic_α ) italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT + divide start_ARG roman_e end_ARG start_ARG 2 italic_η italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 2 italic_T italic_ε start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + square-root start_ARG italic_η end_ARG italic_ε start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( italic_K - italic_k ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG .

Given ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, if t=1Tεk1,tηmin{ε,13(Kk+1)}superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡𝜂𝜀13𝐾𝑘1\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\sqrt{\eta}\min\{\varepsilon,\frac{1}{3(K% -k+1)}\}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ≤ square-root start_ARG italic_η end_ARG roman_min { italic_ε , divide start_ARG 1 end_ARG start_ARG 3 ( italic_K - italic_k + 1 ) end_ARG }, there exists η𝜂\etaitalic_η such that f(𝐱K)f(𝐱)ϵ𝑓subscript𝐱𝐾𝑓subscript𝐱italic-ϵf({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilonitalic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_ϵ after 𝒪~(TL𝐱0𝐱2ϵ+TL1/2σ𝐱0𝐱2ϵ3/2)~𝒪𝑇𝐿superscriptnormsubscript𝐱0subscript𝐱2italic-ϵ𝑇superscript𝐿12subscript𝜎superscriptnormsubscript𝐱0subscript𝐱2superscriptitalic-ϵ32\widetilde{\mathcal{O}}\big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{\epsilon^% {3/2}}\big{)}over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_T italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG + divide start_ARG italic_T italic_L start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) individual inexact proximal point evaluations.

Corollary 3.5 (Convex Lipschitz).

Under Assumptions 1 and 2, the output 𝐱Ksubscript𝐱𝐾{\bm{x}}_{K}bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT of Alg. 2 with inexact proximal point evaluations as in Eq. (3.2) with t=1Tεk1,tηkηk1GTj=kKηjsuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝜂𝑘subscript𝜂𝑘1𝐺𝑇superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{\eta_{k}\eta_{k-1}GT}{\sum_{j=k}^{K% }\eta_{j}}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ≤ divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_G italic_T end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG satisfies

f(𝒙K)f(𝒙)𝒙0𝒙22Tk=1Kηk+G2T2k=1Kηk2j=kKηj+k=1Kt=1T(εk1,t22Tj=kKηj+3Gεk1,tηkj=kKηj).𝑓subscript𝒙𝐾𝑓subscript𝒙superscriptnormsubscript𝒙0subscript𝒙22𝑇superscriptsubscript𝑘1𝐾subscript𝜂𝑘superscript𝐺2𝑇2superscriptsubscript𝑘1𝐾superscriptsubscript𝜂𝑘2superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗superscriptsubscript𝑘1𝐾superscriptsubscript𝑡1𝑇superscriptsubscript𝜀𝑘1𝑡22𝑇superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗3𝐺subscript𝜀𝑘1𝑡subscript𝜂𝑘superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{\|{\bm{x}}_{0}-{\bm{x}}% _{*}\|^{2}}{2T\sum_{k=1}^{K}\eta_{k}}+\frac{G^{2}T}{2}\sum_{k=1}^{K}\frac{\eta% _{k}^{2}}{\sum_{j=k}^{K}\eta_{j}}+\sum_{k=1}^{K}\sum_{t=1}^{T}\Big{(}\frac{% \varepsilon_{k-1,t}^{2}}{2T\sum_{j=k}^{K}\eta_{j}}+\frac{3G\varepsilon_{k-1,t}% \eta_{k}}{\sum_{j=k}^{K}\eta_{j}}\Big{)}.italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_T ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_T ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + divide start_ARG 3 italic_G italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) .

Given ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, if t=1Tεk1,t2ηGTKk+1superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡2𝜂𝐺𝑇𝐾𝑘1\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{2\eta GT}{K-k+1}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 2 italic_η italic_G italic_T end_ARG start_ARG italic_K - italic_k + 1 end_ARG, there exists a constant step size η𝜂\etaitalic_η such that f(𝐱K)f(𝐱)ϵ𝑓subscript𝐱𝐾𝑓subscript𝐱italic-ϵf({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilonitalic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_ϵ after 𝒪~(G2T𝐱0𝐱2ϵ2)~𝒪superscript𝐺2𝑇superscriptnormsubscript𝐱0subscript𝐱2superscriptitalic-ϵ2\widetilde{\mathcal{O}}\big{(}\frac{G^{2}T\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% \epsilon^{2}}\big{)}over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) individual inexact proximal point evaluations.

4 Conclusion

This work provides the first oracle complexity guarantees for the last iterate of standard incremental (gradient and proximal) methods, motivated by applications in continual learning. The obtained complexity bounds nearly match the best known oracle complexity bounds that in the same settings were previously known only for the (uniformly) average iterate. Our for the incremental proximal method further characterize the effect of regularization and its limitations in controlling catastrophic forgetting in continual learning applications. It would be interesting to investigate in future work whether other types of regularization involving task similarity can effectively control forgetting.

Acknowledgments

This research was supported in part by the U.S. Office of Naval Research under contract number N00014-22-1-2348.

References

  • Ahn et al. (2020) Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. SGD with shuffling: optimal rates without component convexity and large epoch requirements. In Proc. NeurIPS’20, 2020.
  • Balcan et al. (2015) Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Efficient representations for lifelong learning and autoencoding. In Proc. COLT’15, 2015.
  • Bertsekas (2011) Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163–195, 2011.
  • Bertsekas (2015) Dimitri P Bertsekas. Incremental aggregated proximal and augmented lagrangian algorithms. arXiv preprint arXiv:1509.09257, 2015.
  • Bertsekas et al. (2011) Dimitri P Bertsekas et al. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.
  • Bottou (2009) Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proc. Symposium on Learning and Data Science, Paris’09, 2009.
  • Cai et al. (2023a) Xufeng Cai, Cheuk Yin Lin, and Jelena Diakonikolas. Empirical risk minimization with shuffled SGD: A primal-dual perspective and improved bounds. arXiv preprint arXiv:2306.12498, 2023a.
  • Cai et al. (2023b) Xufeng Cai, Chaobing Song, Stephen J Wright, and Jelena Diakonikolas. Cyclic block coordinate descent with variance reduction for composite nonconvex optimization. In Proc. ICML’23, 2023b.
  • Cao et al. (2022) Xinyuan Cao, Weiyang Liu, and Santosh Vempala. Provable lifelong learning of representations. In Proc. AISTATS’22, 2022.
  • Cha et al. (2023) Jaeyoung Cha, Jaewook Lee, and Chulhee Yun. Tighter lower bounds for shuffling SGD: Random permutations and beyond. arXiv preprint arXiv:2303.07160, 2023.
  • Chen et al. (2022) Xi Chen, Christos Papadimitriou, and Binghui Peng. Memory bounds for continual learning. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), 2022.
  • De Lange et al. (2021) Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021.
  • Doan et al. (2021) Thang Doan, Mehdi Abbana Bennani, Bogdan Mazoure, Guillaume Rabusseau, and Pierre Alquier. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. In Proc. AISTATS’21, 2021.
  • Evron et al. (2022) Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? In Proc. COLT’22, 2022.
  • Evron et al. (2023) Itay Evron, Edward Moroshko, Gon Buzaglo, Maroun Khriesh, Badea Marjieh, Nathan Srebro, and Daniel Soudry. Continual learning in linear classification on separable data. In Proc. ICML’23, 2023.
  • Goldfarb and Hand (2023) Daniel Goldfarb and Paul Hand. Analysis of catastrophic forgetting for random orthogonal transformation tasks in the overparameterized regime. In Proc. AISTATS’23, 2023.
  • Goldfarb et al. (2024) Daniel Goldfarb, Itay Evron, Nir Weinberger, Daniel Soudry, and Paul Hand. The joint effect of task similarity and overparameterization on catastrophic forgetting - an analytical model. In Proc. ICLR’24, 2024.
  • Goodfellow et al. (2013) Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
  • Gunasekar et al. (2018) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In Proc. ICML’18, 2018.
  • Gürbüzbalaban et al. (2021) Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186:49–84, 2021.
  • Haochen and Sra (2019) Jeff Haochen and Suvrit Sra. Random shuffling beats SGD after finite epochs. In Proc. ICML’19, 2019.
  • Heckel (2022) Reinhard Heckel. Provable continual learning via sketched jacobian approximations. In Proc. AISTATS’22, 2022.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
  • Li et al. (2023) Haoran Li, **gfeng Wu, and Vladimir Braverman. Fixed design analysis of regularization-based continual learning. arXiv preprint arXiv:2303.10263, 2023.
  • Li et al. (2020) Jia** Li, Caihua Chen, and Anthony Man-Cho So. Fast epigraphical projection-based incremental algorithms for wasserstein distributionally robust support vector machine. In Proc. NeurIPS’20, 2020.
  • Li et al. (2019) Xiao Li, Zhihui Zhu, Anthony Man-Cho So, and Jason D Lee. Incremental methods for weakly convex optimization. arXiv preprint arXiv:1907.11687, 2019.
  • Lin et al. (2023a) Cheuk Yin Lin, Chaobing Song, and Jelena Diakonikolas. Accelerated cyclic coordinate dual averaging with extrapolation for composite convex optimization. In Proc. ICML’23, 2023a.
  • Lin et al. (2023b) Sen Lin, Peizhong Ju, Yingbin Liang, and Ness Shroff. Theory on forgetting and generalization of continual learning. arXiv preprint arXiv:2302.05836, 2023b.
  • Liu and Zhou (2023) Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. arXiv preprint arXiv:2312.08531, 2023.
  • Liu and Zhou (2024) Zijian Liu and Zhengyuan Zhou. On the last-iterate convergence of shuffling gradient methods. arXiv preprint arXiv:2403.07723, 2024.
  • Lohr (2021) Sharon L Lohr. Sampling: design and analysis. CRC press, 2021.
  • Lopez-Paz and Ranzato (2017) David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Proc. NIPS’17, 2017.
  • McCloskey and Cohen (1989) Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989.
  • Mishchenko et al. (2020) Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements. In Proc. NeurIPS’20, 2020.
  • Mishchenko et al. (2022) Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Proximal and federated random reshuffling. In Proc. ICML’22, 2022.
  • Nagaraj et al. (2019) Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. SGD without replacement: Sharper rates for general smooth convex functions. In Proc. ICML’19, 2019.
  • Nguyen et al. (2021) Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten Van Dijk. A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397–9440, 2021.
  • Parisi et al. (2019) German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, 2019.
  • Peng and Risteski (2022) Binghui Peng and Andrej Risteski. Continual learning: a feature extraction formalization, an efficient algorithm, and fundamental obstructions. In Proc. NeurIPS’22, 2022.
  • Peng et al. (2023) Liangzu Peng, Paris Giampouras, and René Vidal. The ideal continual learner: An agent that never forgets. In Proc. ICML’23, 2023.
  • Rajput et al. (2020) Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of SGD without replacement. In Proc. ICML’20, 2020.
  • Robins (1995) Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
  • Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Proc. NeurIPS’19, 2019.
  • Safran and Shamir (2020) Itay Safran and Ohad Shamir. How good is SGD with random shuffling? In Proc. COLT’20, 2020.
  • Shamir (2016) Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Proc. NeurIPS’16, 2016.
  • Shamir and Zhang (2013) Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proc. ICML’18, 2013.
  • Song and Diakonikolas (2023) Chaobing Song and Jelena Diakonikolas. Cyclic coordinate dual averaging with extrapolation. SIAM Journal on Optimization, 33(4):2935–2961, 2023. doi: 10.1137/22M1470104.
  • Swartworth et al. (2024) William Swartworth, Deanna Needell, Rachel Ward, Mark Kong, and Halyun Jeong. Nearly optimal bounds for cyclic forgetting. In Proc. NeurIPS’24, 2024.
  • Tran et al. (2021) Trang H Tran, Lam M Nguyen, and Quoc Tran-Dinh. SMG: A shuffling gradient-based method with momentum. In Proc. ICML’21, 2021.
  • Tran et al. (2022) Trang H Tran, Katya Scheinberg, and Lam M Nguyen. Nesterov accelerated shuffling gradient method for convex optimization. In Proc. ICML’22, 2022.
  • Yu and Li (2023) Hengxu Yu and Xiao Li. High probability guarantees for random reshuffling. arXiv preprint arXiv:2311.11841, 2023.
  • Yun et al. (2022) Chulhee Yun, Shashank Rajput, and Suvrit Sra. Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond. In Proc. ICLR’22, 2022.
  • Zamani and Glineur (2023) Moslem Zamani and François Glineur. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023.
  • Zhang et al. (2021) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.

Appendix A Omitted Proofs From Section 2

See 2.1

Proof.

Since each ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is convex and L𝐿Litalic_L-smooth, we have for t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] (see Eq. (1.3)):

ft(𝒙k)ft(𝒙k1,t)subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡absent\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t})\leq\;italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ≤ ft(𝒙k),𝒙k𝒙k1,t12Lft(𝒙k)ft(𝒙k1,t)2,subscript𝑓𝑡subscript𝒙𝑘subscript𝒙𝑘subscript𝒙𝑘1𝑡12𝐿superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡2\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1% ,t}\right\rangle-\frac{1}{2L}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}% }_{k-1,t})\|^{2},⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (A.1)
ft(𝒙k1,t)ft(𝒛)subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝑓𝑡𝒛absent\displaystyle f_{t}({\bm{x}}_{k-1,t})-f_{t}({\bm{z}})\leq\;italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ≤ ft(𝒙k1,t),𝒙k1,t𝒛12Lft(𝒙k1,t)ft(𝒛)2.subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝒙𝑘1𝑡𝒛12𝐿superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝑓𝑡𝒛2\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k-1,t}),{\bm{x}}_{k-1,t}-{\bm{% z}}\right\rangle-\frac{1}{2L}\|\nabla f_{t}({\bm{x}}_{k-1,t})-\nabla f_{t}({% \bm{z}})\|^{2}.⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_z ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (A.2)

On the other hand, letting Φk1,t(𝒙):=ft(𝒙k1,t),𝒙+12ηk𝒙k1,t𝒙2assignsubscriptΦ𝑘1𝑡𝒙subscript𝑓𝑡subscript𝒙𝑘1𝑡𝒙12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡𝒙2\Phi_{k-1,t}({\bm{x}}):=\left\langle\nabla f_{t}({\bm{x}}_{k-1,t}),{\bm{x}}% \right\rangle+\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t}-{\bm{x}}\|^{2}roman_Φ start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ( bold_italic_x ) := ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) , bold_italic_x ⟩ + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have Φk1,t(𝒙k1,t+1)=𝟎subscriptΦ𝑘1𝑡subscript𝒙𝑘1𝑡10\nabla\Phi_{k-1,t}({\bm{x}}_{k-1,t+1})=\mathbf{0}∇ roman_Φ start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) = bold_0 by the update 𝒙k1,t+1=𝒙k1,tηkft(𝒙k1,t)subscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡{\bm{x}}_{k-1,t+1}={\bm{x}}_{k-1,t}-\eta_{k}\nabla f_{t}({\bm{x}}_{k-1,t})bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) in Alg. 1. Observe that Φk1,tsubscriptΦ𝑘1𝑡\Phi_{k-1,t}roman_Φ start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT is 1ηk1subscript𝜂𝑘\frac{1}{\eta_{k}}divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG-strong convex, and thus we also have

Φk1,t(𝒛)Φk1,t(𝒙k1,t+1)+12ηk𝒛𝒙k1,t+12.subscriptΦ𝑘1𝑡𝒛subscriptΦ𝑘1𝑡subscript𝒙𝑘1𝑡112subscript𝜂𝑘superscriptnorm𝒛subscript𝒙𝑘1𝑡12\displaystyle\Phi_{k-1,t}({\bm{z}})\geq\Phi_{k-1,t}({\bm{x}}_{k-1,t+1})+\frac{% 1}{2\eta_{k}}\|{\bm{z}}-{\bm{x}}_{k-1,t+1}\|^{2}.roman_Φ start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ≥ roman_Φ start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_z - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (A.3)

Summing Eq. (A.1) and (A.2) and using Eq. (A.3), we conclude that for t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ],

ft(𝒙k)ft(𝒛)subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡𝒛\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{z}})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z )
\displaystyle\leq\; ft(𝒙k),𝒙k𝒙k1,t+ft(𝒙k1,t),𝒙k1,t𝒙k1,t+1subscript𝑓𝑡subscript𝒙𝑘subscript𝒙𝑘subscript𝒙𝑘1𝑡subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝒙𝑘1𝑡subscript𝒙𝑘1𝑡1\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1% ,t}\right\rangle+\left\langle\nabla f_{t}({\bm{x}}_{k-1,t}),{\bm{x}}_{k-1,t}-{% \bm{x}}_{k-1,t+1}\right\rangle⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ⟩ + ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩
12Lft(𝒙k)ft(𝒙k1,t)212Lft(𝒙k1,t)ft(𝒛)212𝐿superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡212𝐿superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝑓𝑡𝒛2\displaystyle-\frac{1}{2L}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{% k-1,t})\|^{2}-\frac{1}{2L}\|\nabla f_{t}({\bm{x}}_{k-1,t})-\nabla f_{t}({\bm{z% }})\|^{2}- divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
12ηk𝒙k1,t+1𝒙k1,t2+12ηk(𝒙k1,t𝒛2𝒙k1,t+1𝒛2).12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡𝒛2superscriptnormsubscript𝒙𝑘1𝑡1𝒛2\displaystyle-\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\|^{2}+% \frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1,t}-{\bm{z}}\|^{2}-\|{\bm{x}}_{k-1,t+% 1}-{\bm{z}}\|^{2}\big{)}.- divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Decomposing ft(𝒙k)=ft(𝒙k)ft(𝒙k1,t)+ft(𝒙k1,t)subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝑓𝑡subscript𝒙𝑘1𝑡\nabla f_{t}({\bm{x}}_{k})=\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k% -1,t})+\nabla f_{t}({\bm{x}}_{k-1,t})∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) + ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) and summing over t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] where we recall 𝒙k1=𝒙k1,1subscript𝒙𝑘1subscript𝒙𝑘11{\bm{x}}_{k-1}={\bm{x}}_{k-1,1}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , 1 end_POSTSUBSCRIPT and 𝒙k=𝒙k1,T+1subscript𝒙𝑘subscript𝒙𝑘1𝑇1{\bm{x}}_{k}={\bm{x}}_{k-1,T+1}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_T + 1 end_POSTSUBSCRIPT, we have

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ t=1Tft(𝒙k1,t),𝒙k𝒙k1,t+112ηkt=1T𝒙k1,t+1𝒙k1,t2𝒯1subscriptsuperscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝒙𝑘subscript𝒙𝑘1𝑡112subscript𝜂𝑘superscriptsubscript𝑡1𝑇superscriptnormsubscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡2subscript𝒯1\displaystyle\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k-1,% t}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{1}{2\eta_{k}}\sum_{t=1}% ^{T}\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\|^{2}}_{{\mathcal{T}}_{1}}under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+t=1Tft(𝒙k)ft(𝒙k1,t),𝒙k𝒙k1,t𝒯2subscriptsuperscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝒙𝑘subscript𝒙𝑘1𝑡subscript𝒯2\displaystyle+\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k})% -\nabla f_{t}({\bm{x}}_{k-1,t}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t}\right\rangle}_{{% \mathcal{T}}_{2}}+ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
12Lt=1Tft(𝒙k)ft(𝒙k1,t)212Lt=1Tft(𝒙k1,t)ft(𝒛)212𝐿superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡212𝐿superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝑓𝑡𝒛2\displaystyle-\frac{1}{2L}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_% {t}({\bm{x}}_{k-1,t})\|^{2}-\frac{1}{2L}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_% {k-1,t})-\nabla f_{t}({\bm{z}})\|^{2}- divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+12ηk(𝒙k1𝒛2𝒙k𝒛2).12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2\displaystyle+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm% {x}}_{k}-{\bm{z}}\|^{2}\big{)}.+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

For the term 𝒯1subscript𝒯1{\mathcal{T}}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we recall that by the IGD update, ft(𝒙k1,t)=1ηk(𝒙k1,t+1𝒙k1,t)subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝜂𝑘subscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡\nabla f_{t}({\bm{x}}_{k-1,t})=-\frac{1}{\eta_{k}}({\bm{x}}_{k-1,t+1}-{\bm{x}}% _{k-1,t})∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) and 𝒙k𝒙k1,t+1=s=t+1T(𝒙k1,s+1𝒙k1,s)subscript𝒙𝑘subscript𝒙𝑘1𝑡1superscriptsubscript𝑠𝑡1𝑇subscript𝒙𝑘1𝑠1subscript𝒙𝑘1𝑠{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}=\sum_{s=t+1}^{T}({\bm{x}}_{k-1,s+1}-{\bm{x}}_{% k-1,s})bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ), and thus we have

𝒯1=subscript𝒯1absent\displaystyle{\mathcal{T}}_{1}=\;caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1ηkt=1T1s=t+1T𝒙k1,t+1𝒙k1,t,𝒙k1,s+1𝒙k1,s12ηkt=1T𝒙k1,t+1𝒙k1,t21subscript𝜂𝑘superscriptsubscript𝑡1𝑇1superscriptsubscript𝑠𝑡1𝑇subscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡subscript𝒙𝑘1𝑠1subscript𝒙𝑘1𝑠12subscript𝜂𝑘superscriptsubscript𝑡1𝑇superscriptnormsubscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡2\displaystyle-\frac{1}{\eta_{k}}\sum_{t=1}^{T-1}\sum_{s=t+1}^{T}\left\langle{% \bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t},{\bm{x}}_{k-1,s+1}-{\bm{x}}_{k-1,s}\right% \rangle-\frac{1}{2\eta_{k}}\sum_{t=1}^{T}\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}% \|^{2}- divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle=\;= 12ηkt=1T(𝒙k1,t+1𝒙k1,t)2=12ηk𝒙k𝒙k120.12subscript𝜂𝑘superscriptnormsuperscriptsubscript𝑡1𝑇subscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘subscript𝒙𝑘120\displaystyle-\frac{1}{2\eta_{k}}\Big{\|}\sum_{t=1}^{T}({\bm{x}}_{k-1,t+1}-{% \bm{x}}_{k-1,t})\Big{\|}^{2}=-\frac{1}{2\eta_{k}}\|{\bm{x}}_{k}-{\bm{x}}_{k-1}% \|^{2}\leq 0.- divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 0 .

For the term 𝒯2subscript𝒯2{\mathcal{T}}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, noticing that 𝒙k𝒙k1,t=ηks=tTfs(𝒙k1,s)subscript𝒙𝑘subscript𝒙𝑘1𝑡subscript𝜂𝑘superscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙𝑘1𝑠{\bm{x}}_{k}-{\bm{x}}_{k-1,t}=-\eta_{k}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{k-% 1,s})bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT = - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) and decomposing fs(𝒙k1,s)=(fs(𝒙k1,s)fs(𝒛))+(fs(𝒛)fs(𝒙))+fs(𝒙)subscript𝑓𝑠subscript𝒙𝑘1𝑠subscript𝑓𝑠subscript𝒙𝑘1𝑠subscript𝑓𝑠𝒛subscript𝑓𝑠𝒛subscript𝑓𝑠subscript𝒙subscript𝑓𝑠subscript𝒙\nabla f_{s}({\bm{x}}_{k-1,s})=(\nabla f_{s}({\bm{x}}_{k-1,s})-\nabla f_{s}({% \bm{z}}))+(\nabla f_{s}({\bm{z}})-\nabla f_{s}({\bm{x}}_{*}))+\nabla f_{s}({% \bm{x}}_{*})∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) = ( ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) ) + ( ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), we use Young’s inequality with parameters α>0𝛼0\alpha>0italic_α > 0 and β>0𝛽0\beta>0italic_β > 0 to obtain

𝒯2=subscript𝒯2absent\displaystyle{\mathcal{T}}_{2}=\;caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = t=1Tft(𝒙k)ft(𝒙k1,t),ηks=tTfs(𝒙k1,s)superscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝜂𝑘superscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙𝑘1𝑠\displaystyle\sum_{t=1}^{T}\Big{\langle}\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t% }({\bm{x}}_{k-1,t}),-\eta_{k}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{k-1,s})\Big{\rangle}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) , - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) ⟩
\displaystyle\leq\; 12L(12+1α+1β)t=1Tft(𝒙k)ft(𝒙k1,t)212𝐿121𝛼1𝛽superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡2\displaystyle\frac{1}{2L}\Big{(}\frac{1}{2}+\frac{1}{\alpha}+\frac{1}{\beta}% \Big{)}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% })\|^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+αηk2L2t=1Ts=tT(fs(𝒛)fs(𝒙))2+ηk2Lt=1Ts=tTfs(𝒙)2𝛼superscriptsubscript𝜂𝑘2𝐿2superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠𝒛subscript𝑓𝑠subscript𝒙2superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2\displaystyle+\frac{\alpha\eta_{k}^{2}L}{2}\sum_{t=1}^{T}\Big{\|}\sum_{s=t}^{T% }\big{(}\nabla f_{s}({\bm{z}})-\nabla f_{s}({\bm{x}}_{*})\big{)}\Big{\|}^{2}+% \eta_{k}^{2}L\sum_{t=1}^{T}\Big{\|}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})% \Big{\|}^{2}+ divide start_ARG italic_α italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+βηk2L2t=1Ts=tT(fs(𝒙k1,s)fs(𝒛))2.𝛽superscriptsubscript𝜂𝑘2𝐿2superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙𝑘1𝑠subscript𝑓𝑠𝒛2\displaystyle+\frac{\beta\eta_{k}^{2}L}{2}\sum_{t=1}^{T}\Big{\|}\sum_{s=t}^{T}% \big{(}\nabla f_{s}({\bm{x}}_{k-1,s})-\nabla f_{s}({\bm{z}})\big{)}\Big{\|}^{2}.+ divide start_ARG italic_β italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Further using that i=1n𝒙i2ni=1n𝒙i2superscriptnormsuperscriptsubscript𝑖1𝑛subscript𝒙𝑖2𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscript𝒙𝑖2\|\sum_{i=1}^{n}{\bm{x}}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|{\bm{x}}_{i}\|^{2}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_n ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and combining the above bounds on 𝒯1subscript𝒯1{\mathcal{T}}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒯2subscript𝒯2{\mathcal{T}}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we obtain

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ 12L(1α+1β12)t=1Tft(𝒙k)ft(𝒙k1,t)212𝐿1𝛼1𝛽12superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡2\displaystyle\frac{1}{2L}\Big{(}\frac{1}{\alpha}+\frac{1}{\beta}-\frac{1}{2}% \Big{)}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% })\|^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+(βηk2T2L212L)t=1Tft(𝒙k1,t)ft(𝒛)2𝛽superscriptsubscript𝜂𝑘2superscript𝑇2𝐿212𝐿superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝑓𝑡𝒛2\displaystyle+\Big{(}\frac{\beta\eta_{k}^{2}T^{2}L}{2}-\frac{1}{2L}\Big{)}\sum% _{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k-1,t})-\nabla f_{t}({\bm{z}})\|^{2}+ ( divide start_ARG italic_β italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+αηk2T2L2t=1Tft(𝒛)ft(𝒙)2+ηk2Lt=1Ts=tTfs(𝒙)2𝛼superscriptsubscript𝜂𝑘2superscript𝑇2𝐿2superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡𝒛subscript𝑓𝑡subscript𝒙2superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2\displaystyle+\frac{\alpha\eta_{k}^{2}T^{2}L}{2}\sum_{t=1}^{T}\|\nabla f_{t}({% \bm{z}})-\nabla f_{t}({\bm{x}}_{*})\|^{2}+\eta_{k}^{2}L\sum_{t=1}^{T}\Big{\|}% \sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\Big{\|}^{2}+ divide start_ARG italic_α italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+12ηk(𝒙k1𝒛2𝒙k𝒛2),12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2\displaystyle+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm% {x}}_{k}-{\bm{z}}\|^{2}\big{)},+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

To make the first two terms on the right-hand side both nonpositive, we choose

1α+1β12,ηk1βTL.formulae-sequence1𝛼1𝛽12subscript𝜂𝑘1𝛽𝑇𝐿\displaystyle\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2},\quad\eta_{k}\leq% \frac{1}{\sqrt{\beta}TL}.divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG , italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG .

We further bound the term t=1Tft(𝒛)ft(𝒙)2superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡𝒛subscript𝑓𝑡subscript𝒙2\sum_{t=1}^{T}\|\nabla f_{t}({\bm{z}})-\nabla f_{t}({\bm{x}}_{*})\|^{2}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT by the smoothness and convexity of each ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as follows:

t=1Tft(𝒛)ft(𝒙)2superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡𝒛subscript𝑓𝑡subscript𝒙2absent\displaystyle\sum_{t=1}^{T}\|\nabla f_{t}({\bm{z}})-\nabla f_{t}({\bm{x}}_{*})% \|^{2}\leq\;∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2Lt=1T(ft(𝒛)ft(𝒙)ft(𝒙),𝒛𝒙)2𝐿superscriptsubscript𝑡1𝑇subscript𝑓𝑡𝒛subscript𝑓𝑡subscript𝒙subscript𝑓𝑡subscript𝒙𝒛subscript𝒙\displaystyle 2L\sum_{t=1}^{T}\big{(}f_{t}({\bm{z}})-f_{t}({\bm{x}}_{*})-\left% \langle\nabla f_{t}({\bm{x}}_{*}),{\bm{z}}-{\bm{x}}_{*}\right\rangle\big{)}2 italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , bold_italic_z - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ )
=\displaystyle=\;= 2TL(f(𝒛)f(𝒙)),2𝑇𝐿𝑓𝒛𝑓subscript𝒙\displaystyle 2TL\big{(}f({\bm{z}})-f({\bm{x}}_{*})\big{)},2 italic_T italic_L ( italic_f ( bold_italic_z ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ,

where in the last equation we used f(𝒙)=𝟎𝑓subscript𝒙0\nabla f({\bm{x}}_{*})=\mathbf{0}∇ italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = bold_0. Hence, we finally obtain

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ ηk2Lt=1Ts=tTfs(𝒙)2+αT3ηk2L2(f(𝒛)f(𝒙))superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2𝛼superscript𝑇3superscriptsubscript𝜂𝑘2superscript𝐿2𝑓𝒛𝑓subscript𝒙\displaystyle\eta_{k}^{2}L\sum_{t=1}^{T}\Big{\|}\sum_{s=t}^{T}\nabla f_{s}({% \bm{x}}_{*})\Big{\|}^{2}+\alpha T^{3}\eta_{k}^{2}L^{2}\big{(}f({\bm{z}})-f({% \bm{x}}_{*})\big{)}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_f ( bold_italic_z ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) )
+12ηk(𝒙k1𝒛2𝒙k𝒛2)12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2\displaystyle+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm% {x}}_{k}-{\bm{z}}\|^{2}\big{)}+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
\displaystyle\leq\; ηk2Lt=1Ts=tTfs(𝒙)2+αβT(f(𝒛)f(𝒙))superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2𝛼𝛽𝑇𝑓𝒛𝑓subscript𝒙\displaystyle\eta_{k}^{2}L\sum_{t=1}^{T}\Big{\|}\sum_{s=t}^{T}\nabla f_{s}({% \bm{x}}_{*})\Big{\|}^{2}+\frac{\alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}}_{*% })\big{)}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG italic_T ( italic_f ( bold_italic_z ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) )
+12ηk(𝒙k1𝒛2𝒙k𝒛2),12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2\displaystyle+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm% {x}}_{k}-{\bm{z}}\|^{2}\big{)},+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where we used ηk1βTLsubscript𝜂𝑘1𝛽𝑇𝐿\eta_{k}\leq\frac{1}{\sqrt{\beta}TL}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG in the last inequality, thus completing the proof. ∎

See 2.2

Proof.


  1. 1.

    Using Eq. (2.2) and the convexity of f𝑓fitalic_f, we have

    f(𝒛k1)f(𝒙)𝑓subscript𝒛𝑘1𝑓subscript𝒙\displaystyle f({\bm{z}}_{k-1})-f({\bm{x}}_{*})italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
    \displaystyle\leq\; (1λk1)(f(𝒙k1)f(𝒙))+j=0k2(i=j+1k1λi)(1λj)(f(𝒙j)f(𝒙)).1subscript𝜆𝑘1𝑓subscript𝒙𝑘1𝑓subscript𝒙superscriptsubscript𝑗0𝑘2superscriptsubscriptproduct𝑖𝑗1𝑘1subscript𝜆𝑖1subscript𝜆𝑗𝑓subscript𝒙𝑗𝑓subscript𝒙\displaystyle(1-\lambda_{k-1})\big{(}f({\bm{x}}_{k-1})-f({\bm{x}}_{*})\big{)}+% \sum_{j=0}^{k-2}\big{(}\prod_{i=j+1}^{k-1}\lambda_{i}\big{)}(1-\lambda_{j})% \big{(}f({\bm{x}}_{j})-f({\bm{x}}_{*})\big{)}.( 1 - italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ( ∏ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) .

    It remains to multiply by wk1subscript𝑤𝑘1w_{k-1}italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT on both sides and notice that by the lemma assumption,

    wk1(i=j+1k1λi)(1λj)wk2(i=j+1k2λi)(1λj)wj(1λj).subscript𝑤𝑘1superscriptsubscriptproduct𝑖𝑗1𝑘1subscript𝜆𝑖1subscript𝜆𝑗subscript𝑤𝑘2superscriptsubscriptproduct𝑖𝑗1𝑘2subscript𝜆𝑖1subscript𝜆𝑗subscript𝑤𝑗1subscript𝜆𝑗\displaystyle w_{k-1}\big{(}\prod_{i=j+1}^{k-1}\lambda_{i}\big{)}(1-\lambda_{j% })\leq w_{k-2}\big{(}\prod_{i=j+1}^{k-2}\lambda_{i}\big{)}(1-\lambda_{j})\leq% \cdots\leq w_{j}(1-\lambda_{j}).italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_w start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ ⋯ ≤ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .
  2. 2.

    This follows from the first part of the lemma, by decomposing f(𝒙k)f(𝒛k1)=f(𝒙k)f(𝒙)(f(𝒛k1)f(𝒙))𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒙𝑓subscript𝒛𝑘1𝑓subscript𝒙f({\bm{x}}_{k})-f({\bm{z}}_{k-1})=f({\bm{x}}_{k})-f({\bm{x}}_{*})-\big{(}f({% \bm{z}}_{k-1})-f({\bm{x}}_{*})\big{)}italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) = italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - ( italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ).

See 2.3

Proof.

Plugging 𝒛k1subscript𝒛𝑘1{\bm{z}}_{k-1}bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT defined by Eq. (2.1) into Lemma 2.1 and using the inequality i=1n𝒙i2ni=1n𝒙i2superscriptnormsuperscriptsubscript𝑖1𝑛subscript𝒙𝑖2𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscript𝒙𝑖2\|\sum_{i=1}^{n}{\bm{x}}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|{\bm{x}}_{i}\|^{2}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_n ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to bound the term t=1Ts=tTfs(𝒙)2superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2\sum_{t=1}^{T}\|\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\|^{2}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we obtain

T(f(𝒙k)f(𝒛k1))𝑇𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1absent\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}\leq\;italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) ≤ T3ηk2σ2L+αβT(f(𝒛k1)f(𝒙))superscript𝑇3superscriptsubscript𝜂𝑘2superscriptsubscript𝜎2𝐿𝛼𝛽𝑇𝑓subscript𝒛𝑘1𝑓subscript𝒙\displaystyle T^{3}\eta_{k}^{2}\sigma_{*}^{2}L+\frac{\alpha}{\beta}T\big{(}f({% \bm{z}}_{k-1})-f({\bm{x}}_{*})\big{)}italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG italic_T ( italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) )
+12ηk(𝒙k1𝒛k12𝒙k𝒛k12),12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1subscript𝒛𝑘12superscriptnormsubscript𝒙𝑘subscript𝒛𝑘12\displaystyle+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}_{k-1}\|^{2}-% \|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}\big{)},+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where 1α+1β121𝛼1𝛽12\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG. Multiplying ηkwk1subscript𝜂𝑘subscript𝑤𝑘1\eta_{k}w_{k-1}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT on both sides with wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT such that λkwkwk1subscript𝜆𝑘subscript𝑤𝑘subscript𝑤𝑘1\lambda_{k}w_{k}\leq w_{k-1}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT and noticing that 𝒙k1𝒛k12λk1𝒙k1𝒛k22superscriptnormsubscript𝒙𝑘1subscript𝒛𝑘12subscript𝜆𝑘1superscriptnormsubscript𝒙𝑘1subscript𝒛𝑘22\|{\bm{x}}_{k-1}-{\bm{z}}_{k-1}\|^{2}\leq\lambda_{k-1}\|{\bm{x}}_{k-1}-{\bm{z}% }_{k-2}\|^{2}∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT by Eq. (2.1), we have

Tηkwk1(f(𝒙k)f(𝒛k1))𝑇subscript𝜂𝑘subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1absent\displaystyle T\eta_{k}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}\leq\;italic_T italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) ≤ T3ηk3σ2Lwk1+αβTηkwk1(f(𝒛k1)f(𝒙))superscript𝑇3superscriptsubscript𝜂𝑘3superscriptsubscript𝜎2𝐿subscript𝑤𝑘1𝛼𝛽𝑇subscript𝜂𝑘subscript𝑤𝑘1𝑓subscript𝒛𝑘1𝑓subscript𝒙\displaystyle T^{3}\eta_{k}^{3}\sigma_{*}^{2}Lw_{k-1}+\frac{\alpha}{\beta}T% \eta_{k}w_{k-1}\big{(}f({\bm{z}}_{k-1})-f({\bm{x}}_{*})\big{)}italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG italic_T italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) )
+12(wk2𝒙k1𝒛k22wk1𝒙k𝒛k12)12subscript𝑤𝑘2superscriptnormsubscript𝒙𝑘1subscript𝒛𝑘22subscript𝑤𝑘1superscriptnormsubscript𝒙𝑘subscript𝒛𝑘12\displaystyle+\frac{1}{2}\big{(}w_{k-2}\|{\bm{x}}_{k-1}-{\bm{z}}_{k-2}\|^{2}-w% _{k-1}\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}\big{)}+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_w start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

We then sum over k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] and use Lemma 2.2 to obtain

Tk=1Kηk[wk1(f(𝒙k)f(𝒙))j=0k1wj(1λj)(f(𝒙j)f(𝒙))]𝑇superscriptsubscript𝑘1𝐾subscript𝜂𝑘delimited-[]subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒙superscriptsubscript𝑗0𝑘1subscript𝑤𝑗1subscript𝜆𝑗𝑓subscript𝒙𝑗𝑓subscript𝒙\displaystyle T\sum_{k=1}^{K}\eta_{k}\Big{[}w_{k-1}\big{(}f({\bm{x}}_{k})-f({% \bm{x}}_{*})\big{)}-\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})\big{(}f({\bm{x}}_{j})% -f({\bm{x}}_{*})\big{)}\Big{]}italic_T ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ] (A.4)
\displaystyle\leq T3σ2Lk=1Kηk3wk1+w12𝒙0𝒙2+αβTk=1Kηkj=0k1wj(1λj)(f(𝒙j)f(𝒙)),superscript𝑇3superscriptsubscript𝜎2𝐿superscriptsubscript𝑘1𝐾superscriptsubscript𝜂𝑘3subscript𝑤𝑘1subscript𝑤12superscriptnormsubscript𝒙0subscript𝒙2𝛼𝛽𝑇superscriptsubscript𝑘1𝐾subscript𝜂𝑘superscriptsubscript𝑗0𝑘1subscript𝑤𝑗1subscript𝜆𝑗𝑓subscript𝒙𝑗𝑓subscript𝒙\displaystyle T^{3}\sigma_{*}^{2}L\sum_{k=1}^{K}\eta_{k}^{3}w_{k-1}+\frac{w_{-% 1}}{2}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}+\frac{\alpha}{\beta}T\sum_{k=1}^{K}% \eta_{k}\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})\big{(}f({\bm{x}}_{j})-f({\bm{x}}_% {*})\big{)},italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + divide start_ARG italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG italic_T ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ,

where we also use 𝒛1=𝒙subscript𝒛1subscript𝒙{\bm{z}}_{-1}={\bm{x}}_{*}bold_italic_z start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. Unrolling the terms w.r.t. f(𝒙k)f(𝒙)𝑓subscript𝒙𝑘𝑓subscript𝒙f({\bm{x}}_{k})-f({\bm{x}}_{*})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) (k[K])𝑘delimited-[]𝐾(k\in[K])( italic_k ∈ [ italic_K ] ) we get

k=1Kηk[wk1(f(𝒙k)f(𝒙))j=0k1wj(1λj)(f(𝒙j)f(𝒙))]superscriptsubscript𝑘1𝐾subscript𝜂𝑘delimited-[]subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒙superscriptsubscript𝑗0𝑘1subscript𝑤𝑗1subscript𝜆𝑗𝑓subscript𝒙𝑗𝑓subscript𝒙\displaystyle\sum_{k=1}^{K}\eta_{k}\Big{[}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm% {x}}_{*})\big{)}-\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})\big{(}f({\bm{x}}_{j})-f(% {\bm{x}}_{*})\big{)}\Big{]}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ] (A.5)
=\displaystyle== ηKwK1(f(𝒙K)f(𝒙))w0(1λ0)(f(𝒙0)f(𝒙))k=1Kηksubscript𝜂𝐾subscript𝑤𝐾1𝑓subscript𝒙𝐾𝑓subscript𝒙subscript𝑤01subscript𝜆0𝑓subscript𝒙0𝑓subscript𝒙superscriptsubscript𝑘1𝐾subscript𝜂𝑘\displaystyle\eta_{K}w_{K-1}\big{(}f({\bm{x}}_{K})-f({\bm{x}}_{*})\big{)}-w_{0% }(1-\lambda_{0})\big{(}f({\bm{x}}_{0})-f({\bm{x}}_{*})\big{)}\sum_{k=1}^{K}% \eta_{k}italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) - italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
+k=1K1(ηkwk1wk(1λk)j=k+1Kηj)(f(𝒙k)f(𝒙))superscriptsubscript𝑘1𝐾1subscript𝜂𝑘subscript𝑤𝑘1subscript𝑤𝑘1subscript𝜆𝑘superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗𝑓subscript𝒙𝑘𝑓subscript𝒙\displaystyle+\sum_{k=1}^{K-1}\Big{(}\eta_{k}w_{k-1}-w_{k}(1-\lambda_{k})\sum_% {j=k+1}^{K}\eta_{j}\Big{)}\big{(}f({\bm{x}}_{k})-f({\bm{x}}_{*})\big{)}+ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) )

and

k=1Kηkj=0k1wj(1λj)(f(𝒙j)f(𝒙))=k=0K1(wk(1λk)j=k+1Kηj)(f(𝒙k)f(𝒙)).superscriptsubscript𝑘1𝐾subscript𝜂𝑘superscriptsubscript𝑗0𝑘1subscript𝑤𝑗1subscript𝜆𝑗𝑓subscript𝒙𝑗𝑓subscript𝒙superscriptsubscript𝑘0𝐾1subscript𝑤𝑘1subscript𝜆𝑘superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗𝑓subscript𝒙𝑘𝑓subscript𝒙\displaystyle\sum_{k=1}^{K}\eta_{k}\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})\big{(}% f({\bm{x}}_{j})-f({\bm{x}}_{*})\big{)}=\sum_{k=0}^{K-1}\Big{(}w_{k}(1-\lambda_% {k})\sum_{j=k+1}^{K}\eta_{j}\Big{)}\big{(}f({\bm{x}}_{k})-f({\bm{x}}_{*})\big{% )}.∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) .

Plugging back into Eq. (A.4), grou** the like terms, and choosing λ0=1subscript𝜆01\lambda_{0}=1italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, we obtain

TηKwK1(f(𝒙K)f(𝒙))𝑇subscript𝜂𝐾subscript𝑤𝐾1𝑓subscript𝒙𝐾𝑓subscript𝒙\displaystyle T\eta_{K}w_{K-1}\big{(}f({\bm{x}}_{K})-f({\bm{x}}_{*})\big{)}italic_T italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) (A.6)
+Tk=1K1[ηkwk1(1+αβ)wk(1λk)j=k+1Kηj](f(𝒙k)f(𝒙))𝑇superscriptsubscript𝑘1𝐾1delimited-[]subscript𝜂𝑘subscript𝑤𝑘11𝛼𝛽subscript𝑤𝑘1subscript𝜆𝑘superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗𝑓subscript𝒙𝑘𝑓subscript𝒙\displaystyle+T\sum_{k=1}^{K-1}\Big{[}\eta_{k}w_{k-1}-\big{(}1+\frac{\alpha}{% \beta}\big{)}w_{k}(1-\lambda_{k})\sum_{j=k+1}^{K}\eta_{j}\Big{]}\big{(}f({\bm{% x}}_{k})-f({\bm{x}}_{*})\big{)}+ italic_T ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT [ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) )
\displaystyle\leq T3σ2Lk=1Kηk3wk1+w12𝒙0𝒙2.superscript𝑇3superscriptsubscript𝜎2𝐿superscriptsubscript𝑘1𝐾superscriptsubscript𝜂𝑘3subscript𝑤𝑘1subscript𝑤12superscriptnormsubscript𝒙0subscript𝒙2\displaystyle T^{3}\sigma_{*}^{2}L\sum_{k=1}^{K}\eta_{k}^{3}w_{k-1}+\frac{w_{-% 1}}{2}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}.italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + divide start_ARG italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

To obtain the last iterate guarantee, it suffices to choose wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and λksubscript𝜆𝑘\lambda_{k}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT such that

λkwksubscript𝜆𝑘subscript𝑤𝑘absent\displaystyle\lambda_{k}w_{k}\leq\;italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ wk1,0kK1,subscript𝑤𝑘10𝑘𝐾1\displaystyle w_{k-1},\quad 0\leq k\leq K-1,italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , 0 ≤ italic_k ≤ italic_K - 1 , (A.7)
ηkwk1(1+αβ)wk(1λk)j=k+1Kηjsubscript𝜂𝑘subscript𝑤𝑘11𝛼𝛽subscript𝑤𝑘1subscript𝜆𝑘superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗absent\displaystyle\eta_{k}w_{k-1}-\big{(}1+\frac{\alpha}{\beta}\big{)}w_{k}(1-% \lambda_{k})\sum_{j=k+1}^{K}\eta_{j}\geq\;italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ 0,1kK1.01𝑘𝐾1\displaystyle 0,\quad 1\leq k\leq K-1.0 , 1 ≤ italic_k ≤ italic_K - 1 . (A.8)

Noticing that Eq. A.8 is equivalent to λk1ηkwk1(1+αβ)wkj=k+1Kηjsubscript𝜆𝑘1subscript𝜂𝑘subscript𝑤𝑘11𝛼𝛽subscript𝑤𝑘superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗\lambda_{k}\geq 1-\frac{\eta_{k}w_{k-1}}{\big{(}1+\frac{\alpha}{\beta}\big{)}w% _{k}\sum_{j=k+1}^{K}\eta_{j}}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 1 - divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG, to have both inequalities satisfied at the same time, it suffices that

1ηkwk1(1+αβ)wkj=k+1Kηjwk1wkwkηk+(1+αβ)j=k+1Kηj(1+αβ)j=k+1Kηjwk1.iff1subscript𝜂𝑘subscript𝑤𝑘11𝛼𝛽subscript𝑤𝑘superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗subscript𝑤𝑘1subscript𝑤𝑘subscript𝑤𝑘subscript𝜂𝑘1𝛼𝛽superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗1𝛼𝛽superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗subscript𝑤𝑘1\displaystyle 1-\frac{\eta_{k}w_{k-1}}{\big{(}1+\frac{\alpha}{\beta}\big{)}w_{% k}\sum_{j=k+1}^{K}\eta_{j}}\leq\frac{w_{k-1}}{w_{k}}\iff w_{k}\leq\frac{\eta_{% k}+\big{(}1+\frac{\alpha}{\beta}\big{)}\sum_{j=k+1}^{K}\eta_{j}}{\big{(}1+% \frac{\alpha}{\beta}\big{)}\sum_{j=k+1}^{K}\eta_{j}}w_{k-1}.1 - divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ≤ divide start_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⇔ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT .

To maximize the growth rate of {wk}subscript𝑤𝑘\{w_{k}\}{ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, we let wk=ηk+(1+αβ)j=k+1Kηj(1+αβ)j=k+1Kηjwk1subscript𝑤𝑘subscript𝜂𝑘1𝛼𝛽superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗1𝛼𝛽superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗subscript𝑤𝑘1w_{k}=\frac{\eta_{k}+(1+\frac{\alpha}{\beta})\sum_{j=k+1}^{K}\eta_{j}}{(1+% \frac{\alpha}{\beta})\sum_{j=k+1}^{K}\eta_{j}}w_{k-1}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT. Without loss of generality, we take w1=w0=k=1K1(1+αβ)j=k+1Kηjηk+(1+αβ)j=k+1Kηjsubscript𝑤1subscript𝑤0superscriptsubscriptproduct𝑘1𝐾11𝛼𝛽superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗subscript𝜂𝑘1𝛼𝛽superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗w_{-1}=w_{0}=\prod_{k=1}^{K-1}\frac{(1+\frac{\alpha}{\beta})\sum_{j=k+1}^{K}% \eta_{j}}{\eta_{k}+(1+\frac{\alpha}{\beta})\sum_{j=k+1}^{K}\eta_{j}}italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT divide start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG, and thus wK1=1subscript𝑤𝐾11w_{K-1}=1italic_w start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT = 1. Hence, dividing both sides of Eq. (A.6) by TηKwK1𝑇subscript𝜂𝐾subscript𝑤𝐾1T\eta_{K}w_{K-1}italic_T italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT and choosing the constant step size ηkηsubscript𝜂𝑘𝜂\eta_{k}\equiv\etaitalic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≡ italic_η for all k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], we obtain

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ η2T2σ2Lk=1Kwk1+w12ηT𝒙0𝒙2.superscript𝜂2superscript𝑇2superscriptsubscript𝜎2𝐿superscriptsubscript𝑘1𝐾subscript𝑤𝑘1subscript𝑤12𝜂𝑇superscriptnormsubscript𝒙0subscript𝒙2\displaystyle\eta^{2}T^{2}\sigma_{*}^{2}L\sum_{k=1}^{K}w_{k-1}+\frac{w_{-1}}{2% \eta T}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}.italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + divide start_ARG italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_η italic_T end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (A.9)

We first bound w1=k=1K1(1+αβ)(Kk)1+(1+αβ)(Kk)=k=1K1(1+αβ)k1+(1+αβ)ksubscript𝑤1superscriptsubscriptproduct𝑘1𝐾11𝛼𝛽𝐾𝑘11𝛼𝛽𝐾𝑘superscriptsubscriptproduct𝑘1𝐾11𝛼𝛽𝑘11𝛼𝛽𝑘w_{-1}=\prod_{k=1}^{K-1}\frac{(1+\frac{\alpha}{\beta})(K-k)}{1+(1+\frac{\alpha% }{\beta})(K-k)}=\prod_{k=1}^{K-1}\frac{(1+\frac{\alpha}{\beta})k}{1+(1+\frac{% \alpha}{\beta})k}italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT divide start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ( italic_K - italic_k ) end_ARG start_ARG 1 + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ( italic_K - italic_k ) end_ARG = ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT divide start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_k end_ARG start_ARG 1 + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_k end_ARG with the constant step size. Taking the natural logarithm of w1subscript𝑤1w_{-1}italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT, we have

k=1K1log(111+(1+αβ)k)(i)k=1K111+(1+αβ)k11+αβk=1K11k+1.superscriptsubscript𝑘1𝐾11111𝛼𝛽𝑘𝑖superscriptsubscript𝑘1𝐾1111𝛼𝛽𝑘11𝛼𝛽superscriptsubscript𝑘1𝐾11𝑘1\displaystyle\sum_{k=1}^{K-1}\log\Big{(}1-\frac{1}{1+\big{(}1+\frac{\alpha}{% \beta}\big{)}k}\Big{)}\overset{(i)}{\leq}-\sum_{k=1}^{K-1}\frac{1}{1+\big{(}1+% \frac{\alpha}{\beta}\big{)}k}\leq-\frac{1}{1+\frac{\alpha}{\beta}}\sum_{k=1}^{% K-1}\frac{1}{k+1}.∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT roman_log ( 1 - divide start_ARG 1 end_ARG start_ARG 1 + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_k end_ARG ) start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_k end_ARG ≤ - divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k + 1 end_ARG .

where for (i)𝑖(i)( italic_i ) we use the fact that log(1+x)x1𝑥𝑥\log(1+x)\leq xroman_log ( 1 + italic_x ) ≤ italic_x for x>1𝑥1x>-1italic_x > - 1. Further noticing that k=1K11k+1=k=1K1k1log(K)+1K1superscriptsubscript𝑘1𝐾11𝑘1superscriptsubscript𝑘1𝐾1𝑘1𝐾1𝐾1\sum_{k=1}^{K-1}\frac{1}{k+1}=\sum_{k=1}^{K}\frac{1}{k}-1\geq\log(K)+\frac{1}{% K}-1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k + 1 end_ARG = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k end_ARG - 1 ≥ roman_log ( italic_K ) + divide start_ARG 1 end_ARG start_ARG italic_K end_ARG - 1, then we have

log(w1)11+αβlog(K)+11+αβw1e1/(1+α/β)K11+α/βeK11+α/β.iffsubscript𝑤111𝛼𝛽𝐾11𝛼𝛽subscript𝑤1superscripte11𝛼𝛽superscript𝐾11𝛼𝛽esuperscript𝐾11𝛼𝛽\displaystyle\log(w_{-1})\leq-\frac{1}{1+\frac{\alpha}{\beta}}\log(K)+\frac{1}% {1+\frac{\alpha}{\beta}}\iff w_{-1}\leq\frac{\mathrm{e}^{1/(1+\alpha/\beta)}}{% K^{\frac{1}{1+\alpha/\beta}}}\leq\frac{\mathrm{e}}{K^{\frac{1}{1+\alpha/\beta}% }}.roman_log ( italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ) ≤ - divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG end_ARG roman_log ( italic_K ) + divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG end_ARG ⇔ italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ≤ divide start_ARG roman_e start_POSTSUPERSCRIPT 1 / ( 1 + italic_α / italic_β ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG roman_e end_ARG start_ARG italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG .

On the other hand, we note that wk1=j=kK1(1+αβ)(Kj)1+(1+αβ)(Kj)=j=1Kk(1+αβ)j1+(1+αβ)jsubscript𝑤𝑘1superscriptsubscriptproduct𝑗𝑘𝐾11𝛼𝛽𝐾𝑗11𝛼𝛽𝐾𝑗superscriptsubscriptproduct𝑗1𝐾𝑘1𝛼𝛽𝑗11𝛼𝛽𝑗w_{k-1}=\prod_{j=k}^{K-1}\frac{(1+\frac{\alpha}{\beta})(K-j)}{1+(1+\frac{% \alpha}{\beta})(K-j)}=\prod_{j=1}^{K-k}\frac{(1+\frac{\alpha}{\beta})j}{1+(1+% \frac{\alpha}{\beta})j}italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT divide start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ( italic_K - italic_j ) end_ARG start_ARG 1 + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ( italic_K - italic_j ) end_ARG = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - italic_k end_POSTSUPERSCRIPT divide start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_j end_ARG start_ARG 1 + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_j end_ARG for 1kK11𝑘𝐾11\leq k\leq K-11 ≤ italic_k ≤ italic_K - 1 and wK1=1subscript𝑤𝐾11w_{K-1}=1italic_w start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT = 1, then we follow the above argument and obtain

k=1Kwk1=superscriptsubscript𝑘1𝐾subscript𝑤𝑘1absent\displaystyle\sum_{k=1}^{K}w_{k-1}=\;∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT = 1+k=1K1j=1Kk(1+αβ)j1+(1+αβ)j1superscriptsubscript𝑘1𝐾1superscriptsubscriptproduct𝑗1𝐾𝑘1𝛼𝛽𝑗11𝛼𝛽𝑗\displaystyle 1+\sum_{k=1}^{K-1}\prod_{j=1}^{K-k}\frac{(1+\frac{\alpha}{\beta}% )j}{1+(1+\frac{\alpha}{\beta})j}1 + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - italic_k end_POSTSUPERSCRIPT divide start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_j end_ARG start_ARG 1 + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) italic_j end_ARG
\displaystyle\leq\; 1+k=1K1e(Kk+1)11+α/β(i)e(1+α/β)α/βKα/β1+α/β,1superscriptsubscript𝑘1𝐾1esuperscript𝐾𝑘111𝛼𝛽𝑖e1𝛼𝛽𝛼𝛽superscript𝐾𝛼𝛽1𝛼𝛽\displaystyle 1+\sum_{k=1}^{K-1}\frac{\mathrm{e}}{(K-k+1)^{\frac{1}{1+\alpha/% \beta}}}\overset{(i)}{\leq}\frac{\mathrm{e}(1+\alpha/\beta)}{\alpha/\beta}K^{% \frac{\alpha/\beta}{1+\alpha/\beta}},1 + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT divide start_ARG roman_e end_ARG start_ARG ( italic_K - italic_k + 1 ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG roman_e ( 1 + italic_α / italic_β ) end_ARG start_ARG italic_α / italic_β end_ARG italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT ,

where (i)𝑖(i)( italic_i ) is due to k=2K1kq2K+11(x1)q𝑑x=K1q11qsuperscriptsubscript𝑘2𝐾1superscript𝑘𝑞superscriptsubscript2𝐾11superscript𝑥1𝑞differential-d𝑥superscript𝐾1𝑞11𝑞\sum_{k=2}^{K}\frac{1}{k^{q}}\leq\int_{2}^{K+1}\frac{1}{(x-1)^{q}}dx=\frac{K^{% 1-q}-1}{1-q}∑ start_POSTSUBSCRIPT italic_k = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_ARG ≤ ∫ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K + 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( italic_x - 1 ) start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_ARG italic_d italic_x = divide start_ARG italic_K start_POSTSUPERSCRIPT 1 - italic_q end_POSTSUPERSCRIPT - 1 end_ARG start_ARG 1 - italic_q end_ARG for any 0<q<10𝑞10<q<10 < italic_q < 1. Hence, we obtain the final bound:

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ eη2T2σ2L(1+β/α)Kα/β1+α/β+e𝒙0𝒙22TηK11+α/β.esuperscript𝜂2superscript𝑇2superscriptsubscript𝜎2𝐿1𝛽𝛼superscript𝐾𝛼𝛽1𝛼𝛽esuperscriptnormsubscript𝒙0subscript𝒙22𝑇𝜂superscript𝐾11𝛼𝛽\displaystyle\mathrm{e}\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{\frac{% \alpha/\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{% 2}}{2T\eta K^{\frac{1}{1+\alpha/\beta}}}.roman_e italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ( 1 + italic_β / italic_α ) italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT + divide start_ARG roman_e ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_T italic_η italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG .

To analyze the oracle complexity, we take

η=min{𝒙0𝒙2/321/3Tσ2/3L1/3K1/3(1+β/α)1/3,1βTL}𝜂superscriptnormsubscript𝒙0subscript𝒙23superscript213𝑇superscriptsubscript𝜎23superscript𝐿13superscript𝐾13superscript1𝛽𝛼131𝛽𝑇𝐿\displaystyle\eta=\min\Big{\{}\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2/3}}{2^{1/% 3}T\sigma_{*}^{2/3}L^{1/3}K^{1/3}(1+\beta/\alpha)^{1/3}},\,\frac{1}{\sqrt{% \beta}TL}\Big{\}}italic_η = roman_min { divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_T italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ( 1 + italic_β / italic_α ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG }

and analyze the two possible cases depending on which term in the min is smaller. If the first term in the min is smaller (which we can equivalently think of as K𝐾Kitalic_K being “large”), we get

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ 3.5(1+β/α)1/3L1/3σ2/3𝒙0𝒙4/3K11+α/β13.3.5superscript1𝛽𝛼13superscript𝐿13superscriptsubscript𝜎23superscriptnormsubscript𝒙0subscript𝒙43superscript𝐾11𝛼𝛽13\displaystyle\frac{3.5(1+\beta/\alpha)^{1/3}L^{1/3}\sigma_{*}^{2/3}\|{\bm{x}}_% {0}-{\bm{x}}_{*}\|^{4/3}}{K^{\frac{1}{1+\alpha/\beta}-\frac{1}{3}}}.divide start_ARG 3.5 ( 1 + italic_β / italic_α ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG - divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT end_ARG .

Alternatively, if η=1βTL𝒙0𝒙2/321/3Tσ2/3L1/3K1/3(1+β/α)1/3𝜂1𝛽𝑇𝐿superscriptnormsubscript𝒙0subscript𝒙23superscript213𝑇superscriptsubscript𝜎23superscript𝐿13superscript𝐾13superscript1𝛽𝛼13\eta=\frac{1}{\sqrt{\beta}TL}\leq\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2/3}}{2^% {1/3}T\sigma_{*}^{2/3}L^{1/3}K^{1/3}(1+\beta/\alpha)^{1/3}}italic_η = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG ≤ divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_T italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ( 1 + italic_β / italic_α ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_ARG (which we can think of as having “small” K𝐾Kitalic_K), we obtain

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ 1.8(1+β/α)1/3L1/3σ2/3𝒙0𝒙4/3K11+α/β13+1.4βL𝒙0𝒙2K11+α/β.1.8superscript1𝛽𝛼13superscript𝐿13superscriptsubscript𝜎23superscriptnormsubscript𝒙0subscript𝒙43superscript𝐾11𝛼𝛽131.4𝛽𝐿superscriptnormsubscript𝒙0subscript𝒙2superscript𝐾11𝛼𝛽\displaystyle\frac{1.8(1+\beta/\alpha)^{1/3}L^{1/3}\sigma_{*}^{2/3}\|{\bm{x}}_% {0}-{\bm{x}}_{*}\|^{4/3}}{K^{\frac{1}{1+\alpha/\beta}-\frac{1}{3}}}+\frac{1.4% \sqrt{\beta}L\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{K^{\frac{1}{1+\alpha/\beta}}}.divide start_ARG 1.8 ( 1 + italic_β / italic_α ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG - divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1.4 square-root start_ARG italic_β end_ARG italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG .

Hence, combining these two cases, we have

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leqitalic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ 3.5(1+β/α)1/3L1/3σ2/3𝒙0𝒙4/3K11+α/β13+1.4βL𝒙0𝒙2K11+α/β.3.5superscript1𝛽𝛼13superscript𝐿13superscriptsubscript𝜎23superscriptnormsubscript𝒙0subscript𝒙43superscript𝐾11𝛼𝛽131.4𝛽𝐿superscriptnormsubscript𝒙0subscript𝒙2superscript𝐾11𝛼𝛽\displaystyle\frac{3.5(1+\beta/\alpha)^{1/3}L^{1/3}\sigma_{*}^{2/3}\|{\bm{x}}_% {0}-{\bm{x}}_{*}\|^{4/3}}{K^{\frac{1}{1+\alpha/\beta}-\frac{1}{3}}}+\frac{1.4% \sqrt{\beta}L\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{K^{\frac{1}{1+\alpha/\beta}}}.divide start_ARG 3.5 ( 1 + italic_β / italic_α ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG - divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1.4 square-root start_ARG italic_β end_ARG italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG .

In particular, if we choose α=4𝛼4\alpha=4italic_α = 4 and β=4logK𝛽4𝐾\beta=4\log Kitalic_β = 4 roman_log italic_K, assuming without loss of generality that logK>1𝐾1\log K>1roman_log italic_K > 1, then we have Kα/β1+α/β=K1logK+1esuperscript𝐾𝛼𝛽1𝛼𝛽superscript𝐾1𝐾1eK^{\frac{\alpha/\beta}{1+\alpha/\beta}}=K^{\frac{1}{\log K+1}}\leq\mathrm{e}italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT = italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG roman_log italic_K + 1 end_ARG end_POSTSUPERSCRIPT ≤ roman_e and thus

f(𝒙K)f(𝒙)9.4L13σ23𝒙0𝒙43(1+logK)13K23+7.4L𝒙0𝒙2logKK.𝑓subscript𝒙𝐾𝑓subscript𝒙9.4superscript𝐿13superscriptsubscript𝜎23superscriptnormsubscript𝒙0subscript𝒙43superscript1𝐾13superscript𝐾237.4𝐿superscriptnormsubscript𝒙0subscript𝒙2𝐾𝐾\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{9.4L^{\frac{1}{3}}% \sigma_{*}^{\frac{2}{3}}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{\frac{4}{3}}(1+\log K)^% {\frac{1}{3}}}{K^{\frac{2}{3}}}+\frac{7.4L\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}% \sqrt{\log K}}{K}.italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ divide start_ARG 9.4 italic_L start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT divide start_ARG 4 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT ( 1 + roman_log italic_K ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG italic_K start_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT end_ARG + divide start_ARG 7.4 italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_K end_ARG end_ARG start_ARG italic_K end_ARG .

To guarantee f(𝒙K)f(𝒙)ϵ𝑓subscript𝒙𝐾𝑓subscript𝒙italic-ϵf({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilonitalic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_ϵ given ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, the total number of individual gradient evaluations will be

TK=𝒪~(TL𝒙0𝒙2ϵ+TL1/2σ𝒙0𝒙2ϵ3/2).𝑇𝐾~𝒪𝑇𝐿superscriptnormsubscript𝒙0subscript𝒙2italic-ϵ𝑇superscript𝐿12subscript𝜎superscriptnormsubscript𝒙0subscript𝒙2superscriptitalic-ϵ32\displaystyle TK=\widetilde{\mathcal{O}}\Big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}% _{*}\|^{2}}{\epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2% }}{\epsilon^{3/2}}\Big{)}.italic_T italic_K = over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_T italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG + divide start_ARG italic_T italic_L start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) .

See 2.4

Proof.

We follow the proof of Theorem 2.3 up to Eq. (A.6) with constant step size η𝜂\etaitalic_η, then we instead take λk=wk1/wksubscript𝜆𝑘subscript𝑤𝑘1subscript𝑤𝑘\lambda_{k}=w_{k-1}/w_{k}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT / italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and wk=(1+αβ)(Kk)+1c(1+αβ)(Kk)wk1subscript𝑤𝑘1𝛼𝛽𝐾𝑘1𝑐1𝛼𝛽𝐾𝑘subscript𝑤𝑘1w_{k}=\frac{(1+\frac{\alpha}{\beta})(K-k)+1-c}{(1+\frac{\alpha}{\beta})(K-k)}w% _{k-1}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ( italic_K - italic_k ) + 1 - italic_c end_ARG start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ( italic_K - italic_k ) end_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT to obtain

cηTk=1Kwk1(f(𝒙k)f(𝒙))T3η3σ2Lk=1Kwk1+w12𝒙0𝒙2.𝑐𝜂𝑇superscriptsubscript𝑘1𝐾subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒙superscript𝑇3superscript𝜂3superscriptsubscript𝜎2𝐿superscriptsubscript𝑘1𝐾subscript𝑤𝑘1subscript𝑤12superscriptnormsubscript𝒙0subscript𝒙2\displaystyle c\eta T\sum_{k=1}^{K}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{x}}_{*% })\big{)}\leq T^{3}\eta^{3}\sigma_{*}^{2}L\sum_{k=1}^{K}w_{k-1}+\frac{w_{-1}}{% 2}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}.italic_c italic_η italic_T ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ≤ italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + divide start_ARG italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Since f𝑓fitalic_f is convex, we have f(𝒙^K)f(𝒙)k=1Kwk1(f(𝒙k)f(𝒙))k=1Kwk1𝑓subscript^𝒙𝐾𝑓subscript𝒙superscriptsubscript𝑘1𝐾subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒙superscriptsubscript𝑘1𝐾subscript𝑤𝑘1f(\hat{\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{\sum_{k=1}^{K}w_{k-1}(f({\bm{x}}_% {k})-f({\bm{x}}_{*}))}{\sum_{k=1}^{K}w_{k-1}}italic_f ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG where 𝒙^K=k=1Kwk1𝒙kk=1Kwk1subscript^𝒙𝐾superscriptsubscript𝑘1𝐾subscript𝑤𝑘1subscript𝒙𝑘superscriptsubscript𝑘1𝐾subscript𝑤𝑘1\hat{\bm{x}}_{K}=\sum_{k=1}^{K}\frac{w_{k-1}{\bm{x}}_{k}}{\sum_{k=1}^{K}w_{k-1}}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG is the increasing weighted averaging of {𝒙k}k=1Ksuperscriptsubscriptsubscript𝒙𝑘𝑘1𝐾\{{\bm{x}}_{k}\}_{k=1}^{K}{ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, thus (cf. Eq. (A.9))

f(𝒙^K)f(𝒙)T2σ2η2Lc+𝒙0𝒙2w12cηTk=1Kwk1T2σ2η2Lc+𝒙0𝒙22cηTK,𝑓subscript^𝒙𝐾𝑓subscript𝒙superscript𝑇2superscriptsubscript𝜎2superscript𝜂2𝐿𝑐superscriptnormsubscript𝒙0subscript𝒙2subscript𝑤12𝑐𝜂𝑇superscriptsubscript𝑘1𝐾subscript𝑤𝑘1superscript𝑇2superscriptsubscript𝜎2superscript𝜂2𝐿𝑐superscriptnormsubscript𝒙0subscript𝒙22𝑐𝜂𝑇𝐾\displaystyle f(\hat{\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{T^{2}\sigma_{*}^{2}% \eta^{2}L}{c}+\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}w_{-1}}{2c\eta T\sum_{k=1% }^{K}w_{k-1}}\leq\frac{T^{2}\sigma_{*}^{2}\eta^{2}L}{c}+\frac{\|{\bm{x}}_{0}-{% \bm{x}}_{*}\|^{2}}{2c\eta TK},italic_f ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG italic_c end_ARG + divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_c italic_η italic_T ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG ≤ divide start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG italic_c end_ARG + divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_c italic_η italic_T italic_K end_ARG ,

where the last step is due to k=1Kwk1Kw1superscriptsubscript𝑘1𝐾subscript𝑤𝑘1𝐾subscript𝑤1\sum_{k=1}^{K}w_{k-1}\geq Kw_{-1}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ≥ italic_K italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT. Then we follow the proof of Theorem 2.3 and choose η=min{1βTL,(𝒙0𝒙22T3σ2LK)1/3}𝜂1𝛽𝑇𝐿superscriptsuperscriptnormsubscript𝒙0subscript𝒙22superscript𝑇3superscriptsubscript𝜎2𝐿𝐾13\eta=\min\{\frac{1}{\sqrt{\beta}TL},(\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{% 2T^{3}\sigma_{*}^{2}LK})^{1/3}\}italic_η = roman_min { divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG , ( divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_K end_ARG ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT } to obtain

f(𝒙^K)f(𝒙)βL𝒙0𝒙22cK+21/3L1/3σ2/3𝒙0𝒙4/3cK2/3.𝑓subscript^𝒙𝐾𝑓subscript𝒙𝛽𝐿superscriptnormsubscript𝒙0subscript𝒙22𝑐𝐾superscript213superscript𝐿13superscriptsubscript𝜎23superscriptnormsubscript𝒙0subscript𝒙43𝑐superscript𝐾23\displaystyle f(\hat{\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{\sqrt{\beta}L\|{\bm% {x}}_{0}-{\bm{x}}_{*}\|^{2}}{2cK}+\frac{2^{1/3}L^{1/3}\sigma_{*}^{2/3}\|{\bm{x% }}_{0}-{\bm{x}}_{*}\|^{4/3}}{cK^{2/3}}.italic_f ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ divide start_ARG square-root start_ARG italic_β end_ARG italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_c italic_K end_ARG + divide start_ARG 2 start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 / 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_K start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG .

To guarantee f(𝒙K)f(𝒙)ϵ𝑓subscript𝒙𝐾𝑓subscript𝒙italic-ϵf({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilonitalic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_ϵ for ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, we choose β=𝒪(1)𝛽𝒪1\beta={\mathcal{O}}(1)italic_β = caligraphic_O ( 1 ) and the total number of individual gradient evaluations will be

TK=𝒪(TL𝒙0𝒙2cϵ+TL1/2σ𝒙0𝒙2c3/2ϵ3/2),𝑇𝐾𝒪𝑇𝐿superscriptnormsubscript𝒙0subscript𝒙2𝑐italic-ϵ𝑇superscript𝐿12subscript𝜎superscriptnormsubscript𝒙0subscript𝒙2superscript𝑐32superscriptitalic-ϵ32\displaystyle TK={\mathcal{O}}\Big{(}\frac{TL\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}% }{c\epsilon}+\frac{TL^{1/2}\sigma_{*}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{c^{3/2% }\epsilon^{3/2}}\Big{)},italic_T italic_K = caligraphic_O ( divide start_ARG italic_T italic_L ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c italic_ϵ end_ARG + divide start_ARG italic_T italic_L start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) ,

thus finishing the proof. ∎

See 2.5

Proof.

We first introduce a lemma to bound the term t=1Ts=tTfs(𝒙)2superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2\sum_{t=1}^{T}\|\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\|^{2}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in Lemma 2.1 with random permutations involved. Its proof is provided in Appendix A.

Lemma A.1.

Under Assumptions 4 and for Alg. 1 with uniformly random (SO/RR) shuffling, we have

𝔼[t=1Ts=tTfs(𝒙)2]T(T+1)6σ2.𝔼delimited-[]superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2𝑇𝑇16superscriptsubscript𝜎2\displaystyle\textstyle\mathbb{E}\big{[}\sum_{t=1}^{T}\|\sum_{s=t}^{T}\nabla f% _{s}({\bm{x}}_{*})\|^{2}\big{]}\leq\frac{T(T+1)}{6}\sigma_{*}^{2}.blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG italic_T ( italic_T + 1 ) end_ARG start_ARG 6 end_ARG italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

We then take expectation w.r.t. all the randomness on both sides of the inequality in Lemma 2.1 and use Lemma A.1. Then it remains to follow the analysis in the proof of Theorem 2.3. ∎

See A.1

Proof.

We first consider the case of random reshuffling strategy. Conditional on all the randomness up to but not including k𝑘kitalic_k-th epoch, the only randomness of 𝔼k[t=1Ts=tTfs(𝒙)2]subscript𝔼𝑘delimited-[]superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2\mathbb{E}_{k}[\sum_{t=1}^{T}\|\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\|^{2}]blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] comes from the permutation π(k)superscript𝜋𝑘\pi^{(k)}italic_π start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT at k𝑘kitalic_k-th epoch. Further noticing that each partial sum s=tTfs(𝒙)superscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) can be seen as a batch sampled without replacement from {ft(𝒙)}t[T]subscriptsubscript𝑓𝑡subscript𝒙𝑡delimited-[]𝑇\{\nabla f_{t}({\bm{x}}_{*})\}_{t\in[T]}{ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT, we have

𝔼k[t=1Ts=tTfs(𝒙)2]=subscript𝔼𝑘delimited-[]superscriptsubscript𝑡1𝑇superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2absent\displaystyle\mathbb{E}_{k}\Big{[}\sum_{t=1}^{T}\Big{\|}\sum_{s=t}^{T}\nabla f% _{s}({\bm{x}}_{*})\Big{\|}^{2}\Big{]}=\;blackboard_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = t=1T𝔼π(k)[s=tTfs(𝒙)2]superscriptsubscript𝑡1𝑇subscript𝔼superscript𝜋𝑘delimited-[]superscriptnormsuperscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2\displaystyle\sum_{t=1}^{T}\mathbb{E}_{\pi^{(k)}}\Big{[}\Big{\|}\sum_{s=t}^{T}% \nabla f_{s}({\bm{x}}_{*})\Big{\|}^{2}\Big{]}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle=\;= t=1T(Tt+1)2𝔼π(k)[1Tt+1s=tTfs(𝒙)2]superscriptsubscript𝑡1𝑇superscript𝑇𝑡12subscript𝔼superscript𝜋𝑘delimited-[]superscriptnorm1𝑇𝑡1superscriptsubscript𝑠𝑡𝑇subscript𝑓𝑠subscript𝒙2\displaystyle\sum_{t=1}^{T}(T-t+1)^{2}\mathbb{E}_{\pi^{(k)}}\Big{[}\Big{\|}% \frac{1}{T-t+1}\sum_{s=t}^{T}\nabla f_{s}({\bm{x}}_{*})\Big{\|}^{2}\Big{]}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_T - italic_t + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_T - italic_t + 1 end_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(i)𝑖\displaystyle\overset{(i)}{\leq}\;start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG t=1T(Tt+1)2t1(Tt+1)(T1)σ2superscriptsubscript𝑡1𝑇superscript𝑇𝑡12𝑡1𝑇𝑡1𝑇1superscriptsubscript𝜎2\displaystyle\sum_{t=1}^{T}(T-t+1)^{2}\frac{t-1}{(T-t+1)(T-1)}\sigma_{*}^{2}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_T - italic_t + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_t - 1 end_ARG start_ARG ( italic_T - italic_t + 1 ) ( italic_T - 1 ) end_ARG italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle=\;= T(T+1)6σ2,𝑇𝑇16superscriptsubscript𝜎2\displaystyle\frac{T(T+1)}{6}\sigma_{*}^{2},divide start_ARG italic_T ( italic_T + 1 ) end_ARG start_ARG 6 end_ARG italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where (i)𝑖(i)( italic_i ) is due to t=1Tft(𝒙)=𝟎superscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙0\sum_{t=1}^{T}\nabla f_{t}({\bm{x}}_{*})=\mathbf{0}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = bold_0 and sampling without replacement (see e.g., (Lohr, 2021, Section 2.7)). It remains to take expectation w.r.t. all randomness on both sides and use the law of total expectation. For the case of shuffle-once variant, we can directly take expectation since the randomness only comes from the initial random permutation, and the above argument still applies. ∎

Appendix B Omitted Proofs from Section 3

B.1 Convex Smooth Setting

Lemma B.1.

Under Assumptions 1 and 3, for any 𝐳d𝐳superscript𝑑{\bm{z}}\in\mathbb{R}^{d}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that is fixed in the k𝑘kitalic_k-th cycle of Alg. 2 and for α>0,β>0formulae-sequence𝛼0𝛽0\alpha>0,\beta>0italic_α > 0 , italic_β > 0 such that 1α+1β121𝛼1𝛽12\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG, if the step sizes satisfy ηk1βTLsubscript𝜂𝑘1𝛽𝑇𝐿\eta_{k}\leq\frac{1}{\sqrt{\beta}TL}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG, then we have for k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ]

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leqitalic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ ηk2Lt=1T1s=t+1Tfs(𝒙)2+αβT(f(𝒛)f(𝒙))superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙2𝛼𝛽𝑇𝑓𝒛𝑓subscript𝒙\displaystyle\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\|}\sum_{s=t+1}^{T}\nabla f_{s}% ({\bm{x}}_{*})\Big{\|}^{2}+\frac{\alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}}_% {*})\big{)}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG italic_T ( italic_f ( bold_italic_z ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) (B.1)
+12ηk(𝒙k1𝒛2𝒙k𝒛2).12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2\displaystyle+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm% {x}}_{k}-{\bm{z}}\|^{2}\big{)}.+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .
Proof.

Since each ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is convex and L𝐿Litalic_L-smooth, we have for t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]

ft(𝒙k)ft(𝒙k1,t+1)ft(𝒙k),𝒙k𝒙k1,t+112Lft(𝒙k)ft(𝒙k1,t+1)2,subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡subscript𝒙𝑘subscript𝒙𝑘subscript𝒙𝑘1𝑡112𝐿superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡12\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t+1})\leq\left\langle% \nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{% 1}{2L}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t+1})\|^{2},italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ≤ ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
ft(𝒙k1,t+1)ft(𝒛)ft(𝒙k1,t+1),𝒙k1,t+1𝒛12Lft(𝒙k1,t+1)ft(𝒛)2.subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡1𝒛12𝐿superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛2\displaystyle f_{t}({\bm{x}}_{k-1,t+1})-f_{t}({\bm{z}})\leq\left\langle\nabla f% _{t}({\bm{x}}_{k-1,t+1}),{\bm{x}}_{k-1,t+1}-{\bm{z}}\right\rangle-\frac{1}{2L}% \|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{z}})\|^{2}.italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ≤ ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Following the proof of Lemma 2.1, we add and subtract 12ηk𝒙k1,t𝒛212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡𝒛2\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t}-{\bm{z}}\|^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on the right-hand side of the second inequality and combine the above two inequalities to obtain

ft(𝒙k)ft(𝒛)subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡𝒛absent\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{z}})\leq\;italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ≤ ft(𝒙k),𝒙k𝒙k1,t+112ηk𝒙k1,t+1𝒙k1,t2subscript𝑓𝑡subscript𝒙𝑘subscript𝒙𝑘subscript𝒙𝑘1𝑡112subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡2\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1% ,t+1}\right\rangle-\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\|^% {2}⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
12L(ft(𝒙k)ft(𝒙k1,t+1)2+ft(𝒙k1,t+1)ft(𝒛)2)12𝐿superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡12superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛2\displaystyle-\frac{1}{2L}\Big{(}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({% \bm{x}}_{k-1,t+1})\|^{2}+\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{% z}})\|^{2}\Big{)}- divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
+12ηk(𝒙k1,t𝒛2𝒙k1,t+1𝒛2).12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡𝒛2superscriptnormsubscript𝒙𝑘1𝑡1𝒛2\displaystyle+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1,t}-{\bm{z}}\|^{2}-\|{% \bm{x}}_{k-1,t+1}-{\bm{z}}\|^{2}\big{)}.+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Decomposing ft(𝒙k)=ft(𝒙k)ft(𝒙k1,t+1)+ft(𝒙k1,t+1)subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡subscript𝒙𝑘1𝑡1\nabla f_{t}({\bm{x}}_{k})=\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k% -1,t+1})+\nabla f_{t}({\bm{x}}_{k-1,t+1})∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) + ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) and summing over t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], we note that 𝒙k1=𝒙k1,1subscript𝒙𝑘1subscript𝒙𝑘11{\bm{x}}_{k-1}={\bm{x}}_{k-1,1}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , 1 end_POSTSUBSCRIPT and 𝒙k=𝒙k1,T+1subscript𝒙𝑘subscript𝒙𝑘1𝑇1{\bm{x}}_{k}={\bm{x}}_{k-1,T+1}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_T + 1 end_POSTSUBSCRIPT and obtain

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) )
\displaystyle\leq\; t=1Tft(𝒙k1,t+1),𝒙k𝒙k1,t+112ηkt=1T𝒙k1,t+1𝒙k1,t2𝒯1subscriptsuperscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒙𝑘subscript𝒙𝑘1𝑡112subscript𝜂𝑘superscriptsubscript𝑡1𝑇superscriptnormsubscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡2subscript𝒯1\displaystyle\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k-1,% t+1}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{1}{2\eta_{k}}\sum_{t=% 1}^{T}\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\|^{2}}_{{\mathcal{T}}_{1}}under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+t=1Tft(𝒙k)ft(𝒙k1,t+1),𝒙k𝒙k1,t+1𝒯2+12ηk(𝒙k1𝒛2𝒙k𝒛2)subscriptsuperscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒙𝑘subscript𝒙𝑘1𝑡1subscript𝒯212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2\displaystyle+\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k})% -\nabla f_{t}({\bm{x}}_{k-1,t+1}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle% }_{{\mathcal{T}}_{2}}+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2% }-\|{\bm{x}}_{k}-{\bm{z}}\|^{2}\big{)}+ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
12Lt=1Tft(𝒙k)ft(𝒙k1,t+1)212Lt=1Tft(𝒙k1,t+1)ft(𝒛)2.12𝐿superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1212𝐿superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛2\displaystyle-\frac{1}{2L}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_% {t}({\bm{x}}_{k-1,t+1})\|^{2}-\frac{1}{2L}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}% }_{k-1,t+1})-\nabla f_{t}({\bm{z}})\|^{2}.- divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

For the term 𝒯1subscript𝒯1{\mathcal{T}}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we follow the argument from the proof of Theorem 2.3 to obtain

𝒯1=12ηk𝒙k𝒙k120.subscript𝒯112subscript𝜂𝑘superscriptnormsubscript𝒙𝑘subscript𝒙𝑘120\displaystyle{\mathcal{T}}_{1}=-\frac{1}{2\eta_{k}}\|{\bm{x}}_{k}-{\bm{x}}_{k-% 1}\|^{2}\leq 0.caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 0 .

For the term 𝒯2subscript𝒯2{\mathcal{T}}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, noticing that 𝒙k𝒙k1,t+1=ηks=t+1Tfs(𝒙k1,s+1)subscript𝒙𝑘subscript𝒙𝑘1𝑡1subscript𝜂𝑘superscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙𝑘1𝑠1{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}=-\eta_{k}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}% _{k-1,s+1})bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT = - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT ) for 1tT11𝑡𝑇11\leq t\leq T-11 ≤ italic_t ≤ italic_T - 1 and decomposing fs(𝒙k1,s+1)=(fs(𝒙k1,s+1)fs(𝒛))+(fs(𝒛)fs(𝒙))+fs(𝒙)subscript𝑓𝑠subscript𝒙𝑘1𝑠1subscript𝑓𝑠subscript𝒙𝑘1𝑠1subscript𝑓𝑠𝒛subscript𝑓𝑠𝒛subscript𝑓𝑠subscript𝒙subscript𝑓𝑠subscript𝒙\nabla f_{s}({\bm{x}}_{k-1,s+1})=(\nabla f_{s}({\bm{x}}_{k-1,s+1})-\nabla f_{s% }({\bm{z}}))+(\nabla f_{s}({\bm{z}})-\nabla f_{s}({\bm{x}}_{*}))+\nabla f_{s}(% {\bm{x}}_{*})∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT ) = ( ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) ) + ( ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), we use Young’s inequality with parameters α>0𝛼0\alpha>0italic_α > 0 and β>0𝛽0\beta>0italic_β > 0 to obtain

𝒯2=subscript𝒯2absent\displaystyle{\mathcal{T}}_{2}=\;caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = t=1T1ft(𝒙k)ft(𝒙k1,t+1),ηks=t+1Tfs(𝒙k1,s+1)superscriptsubscript𝑡1𝑇1subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝜂𝑘superscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙𝑘1𝑠1\displaystyle\sum_{t=1}^{T-1}\Big{\langle}\nabla f_{t}({\bm{x}}_{k})-\nabla f_% {t}({\bm{x}}_{k-1,t+1}),-\eta_{k}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}_{k-1,s+% 1})\Big{\rangle}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) , - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT ) ⟩
\displaystyle\leq\; 12L(12+1α+1β)t=1Tft(𝒙k)ft(𝒙k1,t+1)212𝐿121𝛼1𝛽superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡12\displaystyle\frac{1}{2L}\Big{(}\frac{1}{2}+\frac{1}{\alpha}+\frac{1}{\beta}% \Big{)}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% +1})\|^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+αηk2L2t=1T1s=t+1T(fs(𝒛)fs(𝒙))2+ηk2Lt=1T1s=t+1Tfs(𝒙)2𝛼superscriptsubscript𝜂𝑘2𝐿2superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠𝒛subscript𝑓𝑠subscript𝒙2superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙2\displaystyle+\frac{\alpha\eta_{k}^{2}L}{2}\sum_{t=1}^{T-1}\Big{\|}\sum_{s=t+1% }^{T}\big{(}\nabla f_{s}({\bm{z}})-\nabla f_{s}({\bm{x}}_{*})\big{)}\Big{\|}^{% 2}+\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\|}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}_% {*})\Big{\|}^{2}+ divide start_ARG italic_α italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+βηk2L2t=1T1s=t+1T(fs(𝒙k1,s+1)fs(𝒛))2.𝛽superscriptsubscript𝜂𝑘2𝐿2superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙𝑘1𝑠1subscript𝑓𝑠𝒛2\displaystyle+\frac{\beta\eta_{k}^{2}L}{2}\sum_{t=1}^{T-1}\Big{\|}\sum_{s=t+1}% ^{T}\big{(}\nabla f_{s}({\bm{x}}_{k-1,s+1})-\nabla f_{s}({\bm{z}})\big{)}\Big{% \|}^{2}.+ divide start_ARG italic_β italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Further using the fact that i=1n𝒙i2ni=1n𝒙i2superscriptnormsuperscriptsubscript𝑖1𝑛subscript𝒙𝑖2𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscript𝒙𝑖2\|\sum_{i=1}^{n}{\bm{x}}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|{\bm{x}}_{i}\|^{2}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_n ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and combining the above bounds on 𝒯1subscript𝒯1{\mathcal{T}}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒯2subscript𝒯2{\mathcal{T}}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we obtain

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) 12L(1α+1β12)t=1Tft(𝒙k)ft(𝒙k1,t)2absent12𝐿1𝛼1𝛽12superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡2\displaystyle\leq\frac{1}{2L}\Big{(}\frac{1}{\alpha}+\frac{1}{\beta}-\frac{1}{% 2}\Big{)}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1% ,t})\|^{2}≤ divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+(βηk2T2L212L)t=1Tft(𝒙k1,t+1)ft(𝒛)2𝛽superscriptsubscript𝜂𝑘2superscript𝑇2𝐿212𝐿superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛2\displaystyle\;+\Big{(}\frac{\beta\eta_{k}^{2}T^{2}L}{2}-\frac{1}{2L}\Big{)}% \sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{z}})\|^{2}+ ( divide start_ARG italic_β italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+αηk2T2L2t=1Tft(𝒛)ft(𝒙)2+ηk2Lt=1T1s=t+1Tfs(𝒙)2𝛼superscriptsubscript𝜂𝑘2superscript𝑇2𝐿2superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡𝒛subscript𝑓𝑡subscript𝒙2superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙2\displaystyle\;+\frac{\alpha\eta_{k}^{2}T^{2}L}{2}\sum_{t=1}^{T}\|\nabla f_{t}% ({\bm{z}})-\nabla f_{t}({\bm{x}}_{*})\|^{2}+\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{% \|}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}_{*})\Big{\|}^{2}+ divide start_ARG italic_α italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+12ηk(𝒙k1𝒛2𝒙k𝒛2).12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2\displaystyle\;+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{% \bm{x}}_{k}-{\bm{z}}\|^{2}\big{)}.+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

The rest of the proof is the same as the proof of Lemma 2.1 and is thus omitted. ∎

See 3.2

Proof.

For all parts of the proof, we consider 1111-dimensional quadratics

f(x):=1Tt=1Tft(x),whereft(x)=L2(xδt)2formulae-sequenceassign𝑓𝑥1𝑇superscriptsubscript𝑡1𝑇subscript𝑓𝑡𝑥wheresubscript𝑓𝑡𝑥𝐿2superscript𝑥subscript𝛿𝑡2\displaystyle f(x):=\frac{1}{T}\sum_{t=1}^{T}f_{t}(x),\quad\text{where}\quad f% _{t}(x)=\frac{L}{2}(x-\delta_{t})^{2}italic_f ( italic_x ) := divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) , where italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( italic_x - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

for t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], L>0𝐿0L>0italic_L > 0, and appropriately chosen sequences of {δt}t[T]subscriptsubscript𝛿𝑡𝑡delimited-[]𝑇\{\delta_{t}\}_{t\in[T]}\subseteq\mathbb{R}{ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT ⊆ blackboard_R.

It is immediate that f(x)𝑓𝑥f(x)italic_f ( italic_x ) is minimized at x=1Tt=1Tδtsubscript𝑥1𝑇superscriptsubscript𝑡1𝑇subscript𝛿𝑡x_{*}=\frac{1}{T}\sum_{t=1}^{T}\delta_{t}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Observe that Alg. 2 using a constant step size η>0𝜂0\eta>0italic_η > 0 has closed-form updates on f𝑓fitalic_f, i.e.,

xk+1=γnxk+(1γ)t=1TγTtδt,subscript𝑥𝑘1superscript𝛾𝑛subscript𝑥𝑘1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑇𝑡subscript𝛿𝑡\displaystyle x_{k+1}=\gamma^{n}x_{k}+(1-\gamma)\sum_{t=1}^{T}\gamma^{T-t}% \delta_{t},italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where γ=1ηL+1(0,1)𝛾1𝜂𝐿101\gamma=\frac{1}{\eta L+1}\in(0,1)italic_γ = divide start_ARG 1 end_ARG start_ARG italic_η italic_L + 1 end_ARG ∈ ( 0 , 1 ). Given any initial point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, by iterating we have

xkxsubscript𝑥𝑘subscript𝑥\displaystyle\textstyle x_{k}-x_{*}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT =γkTx0+t=1T(γTt(1γ)(1γkT)1γT1T)δtabsentsuperscript𝛾𝑘𝑇subscript𝑥0superscriptsubscript𝑡1𝑇superscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑘𝑇1superscript𝛾𝑇1𝑇subscript𝛿𝑡\displaystyle=\gamma^{kT}x_{0}+\sum_{t=1}^{T}\Big{(}\frac{\gamma^{T-t}(1-% \gamma)(1-\gamma^{kT})}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{t}= italic_γ start_POSTSUPERSCRIPT italic_k italic_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) ( 1 - italic_γ start_POSTSUPERSCRIPT italic_k italic_T end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (B.2)
kt=1T(γTt(1γ)1γT1T)δt.𝑘superscriptsubscript𝑡1𝑇superscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇subscript𝛿𝑡\displaystyle\overset{k\rightarrow\infty}{\longrightarrow}\sum_{t=1}^{T}\Big{(% }\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{t}.start_OVERACCENT italic_k → ∞ end_OVERACCENT start_ARG ⟶ end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (B.3)
  1. 1.

    Consider the weight δTsubscript𝛿𝑇\delta_{T}italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT in Eq. (B.3). Since γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), we have

    1γ1γT1T=1t=0T1γt1T>0.1𝛾1superscript𝛾𝑇1𝑇1superscriptsubscript𝑡0𝑇1superscript𝛾𝑡1𝑇0\displaystyle\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}=\frac{1}{\sum_{t=0}^{T-% 1}\gamma^{t}}-\frac{1}{T}>0.divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG > 0 .

    Then for any {δt}t[T]subscriptsubscript𝛿𝑡𝑡delimited-[]𝑇\{\delta_{t}\}_{t\in[T]}{ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT such that sgn(δt)=sgn(γTt(1γ)1γT1T)sgnsubscript𝛿𝑡sgnsuperscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇\mathrm{sgn}(\delta_{t})=\mathrm{sgn}\big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-% \gamma^{T}}-\frac{1}{T}\big{)}roman_sgn ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_sgn ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) and δT>0subscript𝛿𝑇0\delta_{T}>0italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT > 0, we know that

    limkf(xk)f(x)=(i)L2limk(xkx)2L2(1γ1γT1T)2δT2>0,subscript𝑘𝑓subscript𝑥𝑘𝑓subscript𝑥𝑖𝐿2subscript𝑘superscriptsubscript𝑥𝑘subscript𝑥2𝐿2superscript1𝛾1superscript𝛾𝑇1𝑇2superscriptsubscript𝛿𝑇20\displaystyle\lim_{k\rightarrow\infty}f(x_{k})-f(x_{*})\overset{(i)}{=}\frac{L% }{2}\lim_{k\rightarrow\infty}(x_{k}-x_{*})^{2}\geq\frac{L}{2}\Big{(}\frac{1-% \gamma}{1-\gamma^{T}}-\frac{1}{T}\Big{)}^{2}\delta_{T}^{2}>0,roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG = end_ARG divide start_ARG italic_L end_ARG start_ARG 2 end_ARG roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0 ,

    where (i)𝑖(i)( italic_i ) is due to f𝑓fitalic_f being both L𝐿Litalic_L-strongly convex and L𝐿Litalic_L-smooth.

  2. 2.

    Consider the weights of δtsubscript𝛿𝑡\delta_{t}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. (B.3). Since γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), we have

    0γTt(1γ)1γTγTt1,0superscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇superscript𝛾𝑇𝑡1\displaystyle 0\leq\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}\leq\gamma^{T-t}% \leq 1,0 ≤ divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≤ italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ≤ 1 ,

    thus for any t[T1]𝑡delimited-[]𝑇1t\in[T-1]italic_t ∈ [ italic_T - 1 ]

    γTt(1γ)1γT1T[1T,T1T).superscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇1𝑇𝑇1𝑇\displaystyle\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\in\Big{[}% -\frac{1}{T},\frac{T-1}{T}\Big{)}.divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∈ [ - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG , divide start_ARG italic_T - 1 end_ARG start_ARG italic_T end_ARG ) .

    For t=T𝑡𝑇t=Titalic_t = italic_T, given any fixed γ<1𝛾1\gamma<1italic_γ < 1, we have

    1γ1γT1T=1t=0T1γt1T>0.1𝛾1superscript𝛾𝑇1𝑇1superscriptsubscript𝑡0𝑇1superscript𝛾𝑡1𝑇0\displaystyle\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}=\frac{1}{\sum_{t=0}^{T-% 1}\gamma^{t}}-\frac{1}{T}>0.divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG > 0 .

    Hence, for the sequence {δt}subscript𝛿𝑡\{\delta_{t}\}{ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } such that

    |δt|<T2/L(T1)2(t[T1]),δT>22/L1γ1γT1T,formulae-sequencesubscript𝛿𝑡𝑇2𝐿superscript𝑇12𝑡delimited-[]𝑇1subscript𝛿𝑇22𝐿1𝛾1superscript𝛾𝑇1𝑇\displaystyle|\delta_{t}|<\frac{T\sqrt{2/L}}{(T-1)^{2}}\;(t\in[T-1]),\quad% \delta_{T}>\frac{2\sqrt{2/L}}{\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}},| italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | < divide start_ARG italic_T square-root start_ARG 2 / italic_L end_ARG end_ARG start_ARG ( italic_T - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_t ∈ [ italic_T - 1 ] ) , italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT > divide start_ARG 2 square-root start_ARG 2 / italic_L end_ARG end_ARG start_ARG divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_ARG ,

    then combining the bounds on the weights of δtsubscript𝛿𝑡\delta_{t}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with Eq. (B.3) we obtain

    limkxkx=subscript𝑘subscript𝑥𝑘subscript𝑥absent\displaystyle\lim_{k\rightarrow\infty}x_{k}-x_{*}=\;roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = t=1T(γTt(1γ)1γT1T)δtsuperscriptsubscript𝑡1𝑇superscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇subscript𝛿𝑡\displaystyle\sum_{t=1}^{T}\Big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-% \frac{1}{T}\Big{)}\delta_{t}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
    \displaystyle\geq\; (1γ1γT1T)δTt=1T1T1T|δt|1𝛾1superscript𝛾𝑇1𝑇subscript𝛿𝑇superscriptsubscript𝑡1𝑇1𝑇1𝑇subscript𝛿𝑡\displaystyle\Big{(}\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{T}% -\sum_{t=1}^{T-1}\frac{T-1}{T}|\delta_{t}|( divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT divide start_ARG italic_T - 1 end_ARG start_ARG italic_T end_ARG | italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT |
    >\displaystyle>\;> 2/L.2𝐿\displaystyle\sqrt{2/L}.square-root start_ARG 2 / italic_L end_ARG .

    Since f𝑓fitalic_f is L𝐿Litalic_L-smooth and L𝐿Litalic_L-strongly convex, we know that

    limkf(xk)f(x)=L2limk(xkx)2>1,subscript𝑘𝑓subscript𝑥𝑘𝑓subscript𝑥𝐿2subscript𝑘superscriptsubscript𝑥𝑘subscript𝑥21\displaystyle\lim_{k\rightarrow\infty}f(x_{k})-f(x_{*})=\frac{L}{2}\lim_{k% \rightarrow\infty}(x_{k}-x_{*})^{2}>1,roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = divide start_ARG italic_L end_ARG start_ARG 2 end_ARG roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 1 ,

    thus finishing the proof of the second part. We note in passing that 1 on the right-hand side can be replaced by any constant using a simple rescaling.

  3. 3.

    Observe that given a fixed step size η>0𝜂0\eta>0italic_η > 0, we can choose a sequence {δt}t[T]subscriptsubscript𝛿𝑡𝑡delimited-[]𝑇\{\delta_{t}\}_{t\in[T]}{ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT such that sgn(δt)=sgn(γTt(1γ)1γT1T)sgnsubscript𝛿𝑡sgnsuperscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇\mathrm{sgn}(\delta_{t})=\mathrm{sgn}\big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-% \gamma^{T}}-\frac{1}{T}\big{)}roman_sgn ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_sgn ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], thus for any initial point x0xsubscript𝑥0subscript𝑥x_{0}\geq x_{*}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT:

    f(xk)f(x)=𝑓subscript𝑥𝑘𝑓subscript𝑥absent\displaystyle f(x_{k})-f(x_{*})=\;italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = L2(γkT(x0x)+(1γkT)t=1T(γTt(1γ)1γT1T)δt)2𝐿2superscriptsuperscript𝛾𝑘𝑇subscript𝑥0subscript𝑥1superscript𝛾𝑘𝑇superscriptsubscript𝑡1𝑇superscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇subscript𝛿𝑡2\displaystyle\frac{L}{2}\Big{(}\gamma^{kT}(x_{0}-x_{*})+(1-\gamma^{kT})\sum_{t% =1}^{T}\Big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}% \delta_{t}\Big{)}^{2}divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( italic_γ start_POSTSUPERSCRIPT italic_k italic_T end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + ( 1 - italic_γ start_POSTSUPERSCRIPT italic_k italic_T end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
    \displaystyle\geq\; (1γkT)2L2t=1T(γTt(1γ)1γT1T)2δt2superscript1superscript𝛾𝑘𝑇2𝐿2superscriptsubscript𝑡1𝑇superscriptsuperscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇2superscriptsubscript𝛿𝑡2\displaystyle\frac{(1-\gamma^{kT})^{2}L}{2}\sum_{t=1}^{T}\Big{(}\frac{\gamma^{% T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}^{2}\delta_{t}^{2}divide start_ARG ( 1 - italic_γ start_POSTSUPERSCRIPT italic_k italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
    +(1γkT)2Lst[T](γTt(1γ)1γT1T)(γTt(1γ)1γT1T)δsδt.superscript1superscript𝛾𝑘𝑇2𝐿subscript𝑠𝑡delimited-[]𝑇superscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇superscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇subscript𝛿𝑠subscript𝛿𝑡\displaystyle+(1-\gamma^{kT})^{2}L\sum_{s\neq t\in[T]}\Big{(}\frac{\gamma^{T-t% }(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\Big{(}\frac{\gamma^{T-t}(1-% \gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}\delta_{s}\delta_{t}.+ ( 1 - italic_γ start_POSTSUPERSCRIPT italic_k italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_s ≠ italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

    Without loss of generality, taking a sufficiently large klogγ(12/5)T𝑘subscript𝛾125𝑇k\geq\frac{\log_{\gamma}(1-2/\sqrt{5})}{T}italic_k ≥ divide start_ARG roman_log start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( 1 - 2 / square-root start_ARG 5 end_ARG ) end_ARG start_ARG italic_T end_ARG, we obtain

    f(xk)f(x)2L5t=1T(γTt(1γ)1γT1T)2δt2.𝑓subscript𝑥𝑘𝑓subscript𝑥2𝐿5superscriptsubscript𝑡1𝑇superscriptsuperscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇2superscriptsubscript𝛿𝑡2\displaystyle f(x_{k})-f(x_{*})\geq\frac{2L}{5}\sum_{t=1}^{T}\Big{(}\frac{% \gamma^{T-t}(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}\Big{)}^{2}\delta_{t}^{2}.italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≥ divide start_ARG 2 italic_L end_ARG start_ARG 5 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

    Then for any step size η1/TL𝜂1𝑇𝐿\eta\geq 1/TLitalic_η ≥ 1 / italic_T italic_L, we have γ=1ηL+1TT+1𝛾1𝜂𝐿1𝑇𝑇1\gamma=\frac{1}{\eta L+1}\leq\frac{T}{T+1}italic_γ = divide start_ARG 1 end_ARG start_ARG italic_η italic_L + 1 end_ARG ≤ divide start_ARG italic_T end_ARG start_ARG italic_T + 1 end_ARG. Consider t=T𝑡𝑇t=Titalic_t = italic_T, we can bound

    1γ1γT1T1TT+11(TT+1)T1T=1T(11(11T)T1)>e1T.1𝛾1superscript𝛾𝑇1𝑇1𝑇𝑇11superscript𝑇𝑇1𝑇1𝑇1𝑇11superscript11𝑇𝑇1e1𝑇\displaystyle\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}\geq\frac{1-\frac{T}{T+1% }}{1-(\frac{T}{T+1})^{T}}-\frac{1}{T}=\frac{1}{T}\big{(}\frac{1}{1-(1-\frac{1}% {T})^{T}}-1\big{)}>\frac{\mathrm{e}-1}{T}.divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ≥ divide start_ARG 1 - divide start_ARG italic_T end_ARG start_ARG italic_T + 1 end_ARG end_ARG start_ARG 1 - ( divide start_ARG italic_T end_ARG start_ARG italic_T + 1 end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ( divide start_ARG 1 end_ARG start_ARG 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - 1 ) > divide start_ARG roman_e - 1 end_ARG start_ARG italic_T end_ARG .

    Thus, for δT5Tε2(e1)Lsubscript𝛿𝑇5𝑇𝜀2e1𝐿\delta_{T}\geq\frac{\sqrt{5}T\sqrt{\varepsilon}}{\sqrt{2}(\mathrm{e}-1)L}italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≥ divide start_ARG square-root start_ARG 5 end_ARG italic_T square-root start_ARG italic_ε end_ARG end_ARG start_ARG square-root start_ARG 2 end_ARG ( roman_e - 1 ) italic_L end_ARG, we have

    f(xk)f(x)2L5(1γ1γT1T)2δT2>ε.𝑓subscript𝑥𝑘𝑓subscript𝑥2𝐿5superscript1𝛾1superscript𝛾𝑇1𝑇2superscriptsubscript𝛿𝑇2𝜀\displaystyle f(x_{k})-f(x_{*})\geq\frac{2L}{5}\Big{(}\frac{1-\gamma}{1-\gamma% ^{T}}-\frac{1}{T}\Big{)}^{2}\delta_{T}^{2}>\varepsilon.italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≥ divide start_ARG 2 italic_L end_ARG start_ARG 5 end_ARG ( divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > italic_ε .

    On the other hand, recalling the definition in Assumption 4, we have in this example that

    σ2=L2Tt=1T(t=1TδtTδt)2=L2T((t=1Tδt)2T2(t=1Tδt)2T+t=1Tδt2)L2t=1Tδt2/T.superscriptsubscript𝜎2superscript𝐿2𝑇superscriptsubscript𝑡1𝑇superscriptsuperscriptsubscript𝑡1𝑇subscript𝛿𝑡𝑇subscript𝛿𝑡2superscript𝐿2𝑇superscriptsuperscriptsubscript𝑡1𝑇subscript𝛿𝑡2𝑇2superscriptsuperscriptsubscript𝑡1𝑇subscript𝛿𝑡2𝑇superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2superscript𝐿2superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2𝑇\displaystyle\sigma_{*}^{2}=\frac{L^{2}}{T}\sum_{t=1}^{T}\Big{(}\frac{\sum_{t=% 1}^{T}\delta_{t}}{T}-\delta_{t}\Big{)}^{2}=\frac{L^{2}}{T}\Big{(}\frac{(\sum_{% t=1}^{T}\delta_{t})^{2}}{T}-2\frac{(\sum_{t=1}^{T}\delta_{t})^{2}}{T}+\sum_{t=% 1}^{T}\delta_{t}^{2}\Big{)}\leq L^{2}\sum_{t=1}^{T}\delta_{t}^{2}/T.italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ( divide start_ARG ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG - 2 divide start_ARG ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_T .

    Thus for any step size η16εTLσ16εL3/2t=1Tδt2𝜂16𝜀𝑇𝐿subscript𝜎16𝜀superscript𝐿32superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2\eta\geq\frac{16\sqrt{\varepsilon}}{\sqrt{TL}\sigma_{*}}\geq\frac{16\sqrt{% \varepsilon}}{L^{3/2}\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}italic_η ≥ divide start_ARG 16 square-root start_ARG italic_ε end_ARG end_ARG start_ARG square-root start_ARG italic_T italic_L end_ARG italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG 16 square-root start_ARG italic_ε end_ARG end_ARG start_ARG italic_L start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG, we have γ=1ηL+1t=1Tδt216ε/L+t=1Tδt2𝛾1𝜂𝐿1superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡216𝜀𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2\gamma=\frac{1}{\eta L+1}\leq\frac{\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}{16% \sqrt{\varepsilon/L}+\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}italic_γ = divide start_ARG 1 end_ARG start_ARG italic_η italic_L + 1 end_ARG ≤ divide start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 16 square-root start_ARG italic_ε / italic_L end_ARG + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG.

    We now proceed by bounding the weight of δTsubscript𝛿𝑇\delta_{T}italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. In particular, let γ=1κ𝛾1𝜅\gamma=1-\kappaitalic_γ = 1 - italic_κ for some κ>0𝜅0\kappa>0italic_κ > 0, and assume that κ1T+1<1T𝜅1𝑇11𝑇\kappa\leq\frac{1}{T+1}<\frac{1}{T}italic_κ ≤ divide start_ARG 1 end_ARG start_ARG italic_T + 1 end_ARG < divide start_ARG 1 end_ARG start_ARG italic_T end_ARG without loss of generality by the discussion above. Since κT<1𝜅𝑇1\kappa T<1italic_κ italic_T < 1 and T2𝑇2T\geq 2italic_T ≥ 2, we have

    (1κ)T1κT+κ2T(T1)4,superscript1𝜅𝑇1𝜅𝑇superscript𝜅2𝑇𝑇14\displaystyle(1-\kappa)^{T}\geq 1-\kappa T+\frac{\kappa^{2}T(T-1)}{4},( 1 - italic_κ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ≥ 1 - italic_κ italic_T + divide start_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ( italic_T - 1 ) end_ARG start_ARG 4 end_ARG ,

    which leads to

    1γT=1(1κ)TκTκ2T(T1)4.1superscript𝛾𝑇1superscript1𝜅𝑇𝜅𝑇superscript𝜅2𝑇𝑇14\displaystyle 1-\gamma^{T}=1-(1-\kappa)^{T}\leq\kappa T-\frac{\kappa^{2}T(T-1)% }{4}.1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = 1 - ( 1 - italic_κ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ≤ italic_κ italic_T - divide start_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ( italic_T - 1 ) end_ARG start_ARG 4 end_ARG .

    Hence, we have

    1γ1γTκκTκ2T(T1)4=1T11κ(T1)/4.1𝛾1superscript𝛾𝑇𝜅𝜅𝑇superscript𝜅2𝑇𝑇141𝑇11𝜅𝑇14\displaystyle\frac{1-\gamma}{1-\gamma^{T}}\geq\frac{\kappa}{\kappa T-\frac{% \kappa^{2}T(T-1)}{4}}=\frac{1}{T}\frac{1}{1-\kappa(T-1)/4}.divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ≥ divide start_ARG italic_κ end_ARG start_ARG italic_κ italic_T - divide start_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ( italic_T - 1 ) end_ARG start_ARG 4 end_ARG end_ARG = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG divide start_ARG 1 end_ARG start_ARG 1 - italic_κ ( italic_T - 1 ) / 4 end_ARG .

    Further noticing that

    11κ(T1)/41+κ(T1)41+κT811𝜅𝑇141𝜅𝑇141𝜅𝑇8\displaystyle\frac{1}{1-\kappa(T-1)/4}\geq 1+\frac{\kappa(T-1)}{4}\geq 1+\frac% {\kappa T}{8}divide start_ARG 1 end_ARG start_ARG 1 - italic_κ ( italic_T - 1 ) / 4 end_ARG ≥ 1 + divide start_ARG italic_κ ( italic_T - 1 ) end_ARG start_ARG 4 end_ARG ≥ 1 + divide start_ARG italic_κ italic_T end_ARG start_ARG 8 end_ARG

    for T2𝑇2T\geq 2italic_T ≥ 2 and κ<1T𝜅1𝑇\kappa<\frac{1}{T}italic_κ < divide start_ARG 1 end_ARG start_ARG italic_T end_ARG, then we have

    1γ1γT1T1T(1+κT8)1T=κ8.1𝛾1superscript𝛾𝑇1𝑇1𝑇1𝜅𝑇81𝑇𝜅8\displaystyle\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}\geq\frac{1}{T}(1+\frac{% \kappa T}{8})-\frac{1}{T}=\frac{\kappa}{8}.divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ≥ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ( 1 + divide start_ARG italic_κ italic_T end_ARG start_ARG 8 end_ARG ) - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG = divide start_ARG italic_κ end_ARG start_ARG 8 end_ARG .

    Recall that γt=1Tδt216ε/L+t=1Tδt2𝛾superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡216𝜀𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2\gamma\leq\frac{\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}{16\sqrt{\varepsilon/L}+% \sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}italic_γ ≤ divide start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 16 square-root start_ARG italic_ε / italic_L end_ARG + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG, then we obtain

    1γ1γT1T2ε/L16ε/L+t=1Tδt23ε/Lt=1Tδt2,1𝛾1superscript𝛾𝑇1𝑇2𝜀𝐿16𝜀𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡23𝜀𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2\displaystyle\frac{1-\gamma}{1-\gamma^{T}}-\frac{1}{T}\geq\frac{2\sqrt{% \varepsilon/L}}{16\sqrt{\varepsilon/L}+\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}}% \geq\frac{\sqrt{3\varepsilon/L}}{\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}},divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ≥ divide start_ARG 2 square-root start_ARG italic_ε / italic_L end_ARG end_ARG start_ARG 16 square-root start_ARG italic_ε / italic_L end_ARG + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ≥ divide start_ARG square-root start_ARG 3 italic_ε / italic_L end_ARG end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ,

    for {δt}t[T]subscriptsubscript𝛿𝑡𝑡delimited-[]𝑇\{\delta_{t}\}_{t\in[T]}{ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT such that t=1Tδt2163(2+3)ε/Lsuperscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡216323𝜀𝐿\sqrt{\sum_{t=1}^{T}\delta_{t}^{2}}\geq 16\sqrt{3}(2+\sqrt{3})\sqrt{% \varepsilon/L}square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≥ 16 square-root start_ARG 3 end_ARG ( 2 + square-root start_ARG 3 end_ARG ) square-root start_ARG italic_ε / italic_L end_ARG. Thus, for the sequence {δt}t[T]subscriptsubscript𝛿𝑡𝑡delimited-[]𝑇\{\delta_{t}\}_{t\in[T]}{ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT such that δT2>56t=1Tδt2superscriptsubscript𝛿𝑇256superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2\delta_{T}^{2}>\frac{5}{6}\sum_{t=1}^{T}\delta_{t}^{2}italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > divide start_ARG 5 end_ARG start_ARG 6 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and sgn(δt)=sgn(γTt(1γ)1γT1T)sgnsubscript𝛿𝑡sgnsuperscript𝛾𝑇𝑡1𝛾1superscript𝛾𝑇1𝑇\mathrm{sgn}(\delta_{t})=\mathrm{sgn}\big{(}\frac{\gamma^{T-t}(1-\gamma)}{1-% \gamma^{T}}-\frac{1}{T}\big{)}roman_sgn ( italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_sgn ( divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ), we have

    f(xk)f(x)𝑓subscript𝑥𝑘𝑓subscript𝑥absent\displaystyle f(x_{k})-f(x_{*})\geq\;italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≥ 2L5((1γ)1γT1T)2δT22𝐿5superscript1𝛾1superscript𝛾𝑇1𝑇2superscriptsubscript𝛿𝑇2\displaystyle\frac{2L}{5}\Big{(}\frac{(1-\gamma)}{1-\gamma^{T}}-\frac{1}{T}% \Big{)}^{2}\delta_{T}^{2}divide start_ARG 2 italic_L end_ARG start_ARG 5 end_ARG ( divide start_ARG ( 1 - italic_γ ) end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
    >\displaystyle>\;> 2L53εδT2Lt=1Tδt22𝐿53𝜀superscriptsubscript𝛿𝑇2𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝛿𝑡2\displaystyle\frac{2L}{5}\frac{3\varepsilon\delta_{T}^{2}}{L\sum_{t=1}^{T}% \delta_{t}^{2}}divide start_ARG 2 italic_L end_ARG start_ARG 5 end_ARG divide start_ARG 3 italic_ε italic_δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
    >\displaystyle>\;> ε,𝜀\displaystyle\varepsilon,italic_ε ,

completing the proof. ∎

B.2 Convex Lipschitz Setting

Lemma B.2.

Under Assumptions 1 and 2, for any 𝐳d𝐳superscript𝑑{\bm{z}}\in\mathbb{R}^{d}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that is fixed in the k𝑘kitalic_k-th cycle of Alg. 2, we have for k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ]

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leqitalic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ 12ηk(𝒙k1𝒛2𝒙k𝒛2)+T(T1)G2ηk2.12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2𝑇𝑇1superscript𝐺2subscript𝜂𝑘2\displaystyle\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm{% x}}_{k}-{\bm{z}}\|^{2}\big{)}+\frac{T(T-1)G^{2}\eta_{k}}{2}.divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_T ( italic_T - 1 ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG . (B.4)
Proof.

Since ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is convex and closed, we have

Mηkft(𝒙k1,t)=1ηk(𝒙k1,t𝒙k1,t+1)ft(𝒙k1,t+1).subscript𝑀subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝜂𝑘subscript𝒙𝑘1𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡subscript𝒙𝑘1𝑡1\displaystyle\nabla M_{\eta_{k}f_{t}}({\bm{x}}_{k-1,t})=\frac{1}{\eta_{k}}({% \bm{x}}_{k-1,t}-{\bm{x}}_{k-1,t+1})\in\partial f_{t}({\bm{x}}_{k-1,t+1}).∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ∈ ∂ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) .

By G𝐺Gitalic_G-Lipschitzness of each component function, we have that for t[T1]𝑡delimited-[]𝑇1t\in[T-1]italic_t ∈ [ italic_T - 1 ]

ft(𝒙k)ft(𝒙k1,t+1)subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t+1})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) (B.5)
\displaystyle\leq G𝒙k𝒙k1,t+1ηkGs=t+1TMηkfs(𝒙k1,s)(Tt)G2ηk.𝐺normsubscript𝒙𝑘subscript𝒙𝑘1𝑡1subscript𝜂𝑘𝐺superscriptsubscript𝑠𝑡1𝑇normsubscript𝑀subscript𝜂𝑘subscript𝑓𝑠subscript𝒙𝑘1𝑠𝑇𝑡superscript𝐺2subscript𝜂𝑘\displaystyle G\|{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\|\leq\eta_{k}G\sum_{s=t+1}^{T% }\|\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})\|\leq(T-t)G^{2}\eta_{k}.italic_G ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ∥ ≤ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) ∥ ≤ ( italic_T - italic_t ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

On the other hand, using convexity of ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have that for t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]

ft(𝒛)ft(𝒙k1,t+1)+Mηkft(𝒙k1,t),𝒛𝒙k1,t+1.subscript𝑓𝑡𝒛subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑀subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡𝒛subscript𝒙𝑘1𝑡1\displaystyle f_{t}({\bm{z}})\geq f_{t}({\bm{x}}_{k-1,t+1})+\left\langle\nabla M% _{\eta_{k}f_{t}}({\bm{x}}_{k-1,t}),{\bm{z}}-{\bm{x}}_{k-1,t+1}\right\rangle.italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ≥ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) + ⟨ ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) , bold_italic_z - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ .

Expanding the inner product in the above inequality leads to

ft(𝒙k1,t+1)ft(𝒛)subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛\displaystyle f_{t}({\bm{x}}_{k-1,t+1})-f_{t}({\bm{z}})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z )
\displaystyle\leq\; 1ηk𝒙k1,t𝒙k1,t+1,𝒛𝒙k1,t+11subscript𝜂𝑘subscript𝒙𝑘1𝑡subscript𝒙𝑘1𝑡1𝒛subscript𝒙𝑘1𝑡1\displaystyle-\frac{1}{\eta_{k}}\left\langle{\bm{x}}_{k-1,t}-{\bm{x}}_{k-1,t+1% },{\bm{z}}-{\bm{x}}_{k-1,t+1}\right\rangle- divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⟨ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT , bold_italic_z - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩
=\displaystyle=\;= 12ηk(𝒙k1,t𝒛2𝒙k1,t+1𝒛2)12ηk𝒙k1,t+1𝒙k1,t2.12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡𝒛2superscriptnormsubscript𝒙𝑘1𝑡1𝒛212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡2\displaystyle\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1,t}-{\bm{z}}\|^{2}-\|{% \bm{x}}_{k-1,t+1}-{\bm{z}}\|^{2}\big{)}-\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t+1% }-{\bm{x}}_{k-1,t}\|^{2}.divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.6)

Combining Eq. (B.5) and (B.6) and noticing that 𝒙k1,T+1=𝒙ksubscript𝒙𝑘1𝑇1subscript𝒙𝑘{\bm{x}}_{k-1,T+1}={\bm{x}}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_T + 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝒙k1,1=𝒙k1subscript𝒙𝑘11subscript𝒙𝑘1{\bm{x}}_{k-1,1}={\bm{x}}_{k-1}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT, we sum the inequalities over t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] and obtain

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T(f({\bm{x}}_{k})-f({\bm{z}}))\leq\;italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ 12ηk(𝒙k1𝒛2𝒙k𝒛2)+T(T1)G2ηk212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2𝑇𝑇1superscript𝐺2subscript𝜂𝑘2\displaystyle\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm{% x}}_{k}-{\bm{z}}\|^{2}\big{)}+\frac{T(T-1)G^{2}\eta_{k}}{2}divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_T ( italic_T - 1 ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG
12ηkt=1T𝒙k1,t+1𝒙k1,t212subscript𝜂𝑘superscriptsubscript𝑡1𝑇superscriptnormsubscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡2\displaystyle-\frac{1}{2\eta_{k}}\sum_{t=1}^{T}\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{% k-1,t}\|^{2}- divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq\; 12ηk(𝒙k1𝒛2𝒙k𝒛2)+T(T1)G2ηk2.12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2𝑇𝑇1superscript𝐺2subscript𝜂𝑘2\displaystyle\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm{% x}}_{k}-{\bm{z}}\|^{2}\big{)}+\frac{T(T-1)G^{2}\eta_{k}}{2}.divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_T ( italic_T - 1 ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG .

See 3.3

Proof.

Plugging 𝒛k1subscript𝒛𝑘1{\bm{z}}_{k-1}bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT defined in Eq. (2.1) into Eq. (B.4) and multiplying ηkwk1subscript𝜂𝑘subscript𝑤𝑘1\eta_{k}w_{k-1}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT on both sides, we obtain

Tηkwk1(f(𝒙k)f(𝒛k1))𝑇subscript𝜂𝑘subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1\displaystyle T\eta_{k}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}italic_T italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) )
\displaystyle\leq\; 12(wk2𝒙k1𝒛k22wk1𝒙k𝒛k12)+T(T1)G2ηk2wk12,12subscript𝑤𝑘2superscriptnormsubscript𝒙𝑘1subscript𝒛𝑘22subscript𝑤𝑘1superscriptnormsubscript𝒙𝑘subscript𝒛𝑘12𝑇𝑇1superscript𝐺2superscriptsubscript𝜂𝑘2subscript𝑤𝑘12\displaystyle\frac{1}{2}\big{(}w_{k-2}\|{\bm{x}}_{k-1}-{\bm{z}}_{k-2}\|^{2}-w_% {k-1}\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}\big{)}+\frac{T(T-1)G^{2}\eta_{k}^{2}w% _{k-1}}{2},divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_w start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_T ( italic_T - 1 ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ,

Summing over k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] and using the second part of Lemma 2.2, we have

k=1KTηk[wk1(f(𝒙k)f(𝒙))j=0k1wj(1λj)(f(𝒙j)f(𝒙))]superscriptsubscript𝑘1𝐾𝑇subscript𝜂𝑘delimited-[]subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒙superscriptsubscript𝑗0𝑘1subscript𝑤𝑗1subscript𝜆𝑗𝑓subscript𝒙𝑗𝑓subscript𝒙\displaystyle\sum_{k=1}^{K}T\eta_{k}\Big{[}w_{k-1}(f({\bm{x}}_{k})-f({\bm{x}}_% {*}))-\sum_{j=0}^{k-1}w_{j}(1-\lambda_{j})(f({\bm{x}}_{j})-f({\bm{x}}_{*}))% \Big{]}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_T italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ]
\displaystyle\leq\; T(T1)G22k=1Kηk2wk1+w12𝒙0𝒙2,𝑇𝑇1superscript𝐺22superscriptsubscript𝑘1𝐾superscriptsubscript𝜂𝑘2subscript𝑤𝑘1subscript𝑤12superscriptnormsubscript𝒙0subscript𝒙2\displaystyle\frac{T(T-1)G^{2}}{2}\sum_{k=1}^{K}\eta_{k}^{2}w_{k-1}+\frac{w_{-% 1}}{2}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2},divide start_ARG italic_T ( italic_T - 1 ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + divide start_ARG italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where we also recall that 𝒛1=𝒙subscript𝒛1subscript𝒙{\bm{z}}_{-1}={\bm{x}}_{*}bold_italic_z start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. Unrolling the terms on the left-hand side as Eq. (A.5) and choosing λ0=1subscript𝜆01\lambda_{0}=1italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, we obtain

TηKwK1(f(𝒙K)f(𝒙))𝑇subscript𝜂𝐾subscript𝑤𝐾1𝑓subscript𝒙𝐾𝑓subscript𝒙\displaystyle T\eta_{K}w_{K-1}\big{(}f({\bm{x}}_{K})-f({\bm{x}}_{*})\big{)}italic_T italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) (B.7)
+Tk=1K1[ηkwk1wk(1λk)j=k+1Kηj](f(𝒙k)f(𝒙))𝑇superscriptsubscript𝑘1𝐾1delimited-[]subscript𝜂𝑘subscript𝑤𝑘1subscript𝑤𝑘1subscript𝜆𝑘superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗𝑓subscript𝒙𝑘𝑓subscript𝒙\displaystyle+T\sum_{k=1}^{K-1}\Big{[}\eta_{k}w_{k-1}-w_{k}(1-\lambda_{k})\sum% _{j=k+1}^{K}\eta_{j}\Big{]}\big{(}f({\bm{x}}_{k})-f({\bm{x}}_{*})\big{)}+ italic_T ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT [ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) )
\displaystyle\leq T(T1)G22k=1Kηk2wk1+w12𝒙0𝒙2.𝑇𝑇1superscript𝐺22superscriptsubscript𝑘1𝐾superscriptsubscript𝜂𝑘2subscript𝑤𝑘1subscript𝑤12superscriptnormsubscript𝒙0subscript𝒙2\displaystyle\frac{T(T-1)G^{2}}{2}\sum_{k=1}^{K}\eta_{k}^{2}w_{k-1}+\frac{w_{-% 1}}{2}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}.divide start_ARG italic_T ( italic_T - 1 ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + divide start_ARG italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

To obtain the last iterate guarantee, we choose λksubscript𝜆𝑘\lambda_{k}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT such that

λkwksubscript𝜆𝑘subscript𝑤𝑘absent\displaystyle\lambda_{k}w_{k}\leq\;italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ wk1,0kK1,subscript𝑤𝑘10𝑘𝐾1\displaystyle w_{k-1},\quad 0\leq k\leq K-1,italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , 0 ≤ italic_k ≤ italic_K - 1 ,
ηkwk1wk(1λk)j=k+1Kηjsubscript𝜂𝑘subscript𝑤𝑘1subscript𝑤𝑘1subscript𝜆𝑘superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗absent\displaystyle\eta_{k}w_{k-1}-w_{k}(1-\lambda_{k})\sum_{j=k+1}^{K}\eta_{j}\geq\;italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ 0,1kK1.01𝑘𝐾1\displaystyle 0,\quad 1\leq k\leq K-1.0 , 1 ≤ italic_k ≤ italic_K - 1 .

For simplicity and without loss of generality, we make both inequalities tight and choose wk=j=kKηjj=k+1Kηjwk1subscript𝑤𝑘superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗subscript𝑤𝑘1w_{k}=\frac{\sum_{j=k}^{K}\eta_{j}}{\sum_{j=k+1}^{K}\eta_{j}}w_{k-1}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT. In particular, we choose wk=ηKj=k+1Kηjsubscript𝑤𝑘subscript𝜂𝐾superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗w_{k}=\frac{\eta_{K}}{\sum_{j=k+1}^{K}\eta_{j}}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG for 0kK10𝑘𝐾10\leq k\leq K-10 ≤ italic_k ≤ italic_K - 1 such that wK1=1subscript𝑤𝐾11w_{K-1}=1italic_w start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT = 1, then we divide TηK𝑇subscript𝜂𝐾T\eta_{K}italic_T italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT on both sides of Eq. (B.7) and obtain

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ w12TηK𝒙0𝒙2+G2T2ηKk=1Kηk2wk1subscript𝑤12𝑇subscript𝜂𝐾superscriptnormsubscript𝒙0subscript𝒙2superscript𝐺2𝑇2subscript𝜂𝐾superscriptsubscript𝑘1𝐾superscriptsubscript𝜂𝑘2subscript𝑤𝑘1\displaystyle\frac{w_{-1}}{2T\eta_{K}}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}+\frac{% G^{2}T}{2\eta_{K}}\sum_{k=1}^{K}\eta_{k}^{2}w_{k-1}divide start_ARG italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_T italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT
=\displaystyle=\;= 12Tk=1Kηk𝒙0𝒙2+G2T2k=1Kηk2j=kKηj.12𝑇superscriptsubscript𝑘1𝐾subscript𝜂𝑘superscriptnormsubscript𝒙0subscript𝒙2superscript𝐺2𝑇2superscriptsubscript𝑘1𝐾superscriptsubscript𝜂𝑘2superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗\displaystyle\frac{1}{2T\sum_{k=1}^{K}\eta_{k}}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{% 2}+\frac{G^{2}T}{2}\sum_{k=1}^{K}\frac{\eta_{k}^{2}}{\sum_{j=k}^{K}\eta_{j}}.divide start_ARG 1 end_ARG start_ARG 2 italic_T ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG .

Finally, choosing ηkη=𝒙0𝒙GTKsubscript𝜂𝑘𝜂normsubscript𝒙0subscript𝒙𝐺𝑇𝐾\eta_{k}\equiv\eta=\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|}{GT\sqrt{K}}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≡ italic_η = divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG start_ARG italic_G italic_T square-root start_ARG italic_K end_ARG end_ARG, we get

f(𝒙K)f(𝒙)G𝒙0𝒙2K(1+k=1K1Kk+1)G𝒙0𝒙(1+logK/2)K.𝑓subscript𝒙𝐾𝑓subscript𝒙𝐺normsubscript𝒙0subscript𝒙2𝐾1superscriptsubscript𝑘1𝐾1𝐾𝑘1𝐺normsubscript𝒙0subscript𝒙1𝐾2𝐾\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\frac{G\|{\bm{x}}_{0}-{\bm{x}% }_{*}\|}{2\sqrt{K}}\Big{(}1+\sum_{k=1}^{K}\frac{1}{K-k+1}\Big{)}\leq\frac{G\|{% \bm{x}}_{0}-{\bm{x}}_{*}\|(1+\log K/2)}{\sqrt{K}}.italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_G ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG start_ARG 2 square-root start_ARG italic_K end_ARG end_ARG ( 1 + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_K - italic_k + 1 end_ARG ) ≤ divide start_ARG italic_G ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ( 1 + roman_log italic_K / 2 ) end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG .

Hence, given ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, to guarantee f(𝒙K)f(𝒙)ϵ𝑓subscript𝒙𝐾𝑓subscript𝒙italic-ϵf({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\epsilonitalic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ italic_ϵ, the total number of individual gradient evaluations will be

TK=𝒪~(G2T𝒙0𝒙2ϵ2),𝑇𝐾~𝒪superscript𝐺2𝑇superscriptnormsubscript𝒙0subscript𝒙2superscriptitalic-ϵ2TK=\widetilde{\mathcal{O}}\Big{(}\frac{G^{2}T\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}% }{\epsilon^{2}}\Big{)},italic_T italic_K = over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,

completing the proof. ∎

B.3 Inexact Proximal Point Evaluations

We first prove the convergence results for convex smooth settings. The following techical lemma bounds f(𝒙k)f(𝒛)𝑓subscript𝒙𝑘𝑓𝒛f({\bm{x}}_{k})-f({\bm{z}})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) within each epoch with inexact proximal point evaluations.

Lemma B.3.

Under Assumptions 1 and 3, for any 𝐳d𝐳superscript𝑑{\bm{z}}\in\mathbb{R}^{d}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that is fixed in the k𝑘kitalic_k-th cycle of Alg. 2 and for α>0,β>0formulae-sequence𝛼0𝛽0\alpha>0,\beta>0italic_α > 0 , italic_β > 0 such that 1α+1β121𝛼1𝛽12\frac{1}{\alpha}+\frac{1}{\beta}\leq\frac{1}{2}divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG, if the step sizes satisfy ηk1βTLsubscript𝜂𝑘1𝛽𝑇𝐿\eta_{k}\leq\frac{1}{\sqrt{\beta}TL}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG, then we have for k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ]

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ 2ηk2Lt=1T1s=t+1Tfs(𝒙)2+αβT(f(𝒛)f(𝒙))+Tηkt=1Tεk1,t22superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙2𝛼𝛽𝑇𝑓𝒛𝑓subscript𝒙𝑇subscript𝜂𝑘superscriptsubscript𝑡1𝑇superscriptsubscript𝜀𝑘1𝑡2\displaystyle 2\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\|}\sum_{s=t+1}^{T}\nabla f_{% s}({\bm{x}}_{*})\Big{\|}^{2}+\frac{\alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}% }_{*})\big{)}+\frac{T}{\eta_{k}}\sum_{t=1}^{T}\varepsilon_{k-1,t}^{2}2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG italic_T ( italic_f ( bold_italic_z ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + divide start_ARG italic_T end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+12ηk𝒙k1𝒛212ηk(1t=1Tεk1,tηk)𝒙k𝒛2+t=1Tεk1,t2ηk.12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛212subscript𝜂𝑘1superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝜂𝑘superscriptnormsubscript𝒙𝑘𝒛2superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡2subscript𝜂𝑘\displaystyle+\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\frac{1}{2% \eta_{k}}\big{(}1-\frac{\sum_{t=1}^{T}\varepsilon_{k-1,t}}{\sqrt{\eta_{k}}}% \big{)}\|{\bm{x}}_{k}-{\bm{z}}\|^{2}+\frac{\sum_{t=1}^{T}\varepsilon_{k-1,t}}{% 2\sqrt{\eta_{k}}}.+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( 1 - divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG .
Proof.

Since each ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is convex and L𝐿Litalic_L-smooth, we have for t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]

ft(𝒙k)ft(𝒙k1,t+1)ft(𝒙k),𝒙k𝒙k1,t+112Lft(𝒙k)ft(𝒙k1,t+1)2,subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡subscript𝒙𝑘subscript𝒙𝑘subscript𝒙𝑘1𝑡112𝐿superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡12\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t+1})\leq\left\langle% \nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{% 1}{2L}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t+1})\|^{2},italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ≤ ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
ft(𝒙k1,t+1)ft(𝒛)ft(𝒙k1,t+1),𝒙k1,t+1𝒛12Lft(𝒙k1,t+1)ft(𝒛)2.subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡1𝒛12𝐿superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛2\displaystyle f_{t}({\bm{x}}_{k-1,t+1})-f_{t}({\bm{z}})\leq\left\langle\nabla f% _{t}({\bm{x}}_{k-1,t+1}),{\bm{x}}_{k-1,t+1}-{\bm{z}}\right\rangle-\frac{1}{2L}% \|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{z}})\|^{2}.italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ≤ ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Following the proof of Lemma 2.1, we add and subtract 12ηk𝒙k1,t𝒛212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡𝒛2\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t}-{\bm{z}}\|^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on the right-hand side of the second inequality and notice that 𝒈k1,t,𝒛+12ηk𝒙k1,t𝒛2𝒈k1,t,𝒙k1,t+1+12ηk𝒙k1,t𝒙k1,t+12+12ηk𝒙k1,t+1𝒛2,subscript𝒈𝑘1𝑡𝒛12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡𝒛2subscript𝒈𝑘1𝑡subscript𝒙𝑘1𝑡112subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡subscript𝒙𝑘1𝑡1212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡1𝒛2\left\langle{\bm{g}}_{k-1,t},{\bm{z}}\right\rangle+\frac{1}{2\eta_{k}}\|{\bm{x% }}_{k-1,t}-{\bm{z}}\|^{2}\geq\left\langle{\bm{g}}_{k-1,t},{\bm{x}}_{k-1,t+1}% \right\rangle+\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t}-{\bm{x}}_{k-1,t+1}\|^{2}+% \frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t+1}-{\bm{z}}\|^{2},⟨ bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT , bold_italic_z ⟩ + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ ⟨ bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , then we combine the above inequalities to obtain

ft(𝒙k)ft(𝒛)subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡𝒛absent\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{z}})\leq\;italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ≤ ft(𝒙k),𝒙k𝒙k1,t+1+ft(𝒙k1,t+1)𝒈k1,t,𝒙k1,t+1𝒛subscript𝑓𝑡subscript𝒙𝑘subscript𝒙𝑘subscript𝒙𝑘1𝑡1subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒈𝑘1𝑡subscript𝒙𝑘1𝑡1𝒛\displaystyle\left\langle\nabla f_{t}({\bm{x}}_{k}),{\bm{x}}_{k}-{\bm{x}}_{k-1% ,t+1}\right\rangle+\left\langle\nabla f_{t}({\bm{x}}_{k-1,t+1})-{\bm{g}}_{k-1,% t},{\bm{x}}_{k-1,t+1}-{\bm{z}}\right\rangle⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ + ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ⟩
12L(ft(𝒙k)ft(𝒙k1,t+1)2+ft(𝒙k1,t+1)ft(𝒛)2)12𝐿superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡12superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛2\displaystyle-\frac{1}{2L}\Big{(}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({% \bm{x}}_{k-1,t+1})\|^{2}+\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{% z}})\|^{2}\Big{)}- divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
12ηk𝒙k1,t+1𝒙k1,t2+12ηk(𝒙k1,t𝒛2𝒙k1,t+1𝒛2).12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡𝒛2superscriptnormsubscript𝒙𝑘1𝑡1𝒛2\displaystyle-\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1,t}\|^{2}+% \frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1,t}-{\bm{z}}\|^{2}-\|{\bm{x}}_{k-1,t+% 1}-{\bm{z}}\|^{2}\big{)}.- divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

We decompose ft(𝒙k)=ft(𝒙k)ft(𝒙k1,t+1)+ft(𝒙k1,t+1)𝒈k1,t+𝒈k1,tsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒈𝑘1𝑡subscript𝒈𝑘1𝑡\nabla f_{t}({\bm{x}}_{k})=\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k% -1,t+1})+\nabla f_{t}({\bm{x}}_{k-1,t+1})-{\bm{g}}_{k-1,t}+{\bm{g}}_{k-1,t}∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) + ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT + bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT in the first inner product term on the right-hand side, and sum the inequalities over t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] with noticing 𝒙k1=𝒙k1,1subscript𝒙𝑘1subscript𝒙𝑘11{\bm{x}}_{k-1}={\bm{x}}_{k-1,1}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , 1 end_POSTSUBSCRIPT and 𝒙k=𝒙k1,T+1subscript𝒙𝑘subscript𝒙𝑘1𝑇1{\bm{x}}_{k}={\bm{x}}_{k-1,T+1}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_T + 1 end_POSTSUBSCRIPT, and obtain

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ t=1T𝒈k1,t,𝒙k𝒙k1,t+112ηkt=1T𝒙k1,t+1𝒙k1,t2𝒯1subscriptsuperscriptsubscript𝑡1𝑇subscript𝒈𝑘1𝑡subscript𝒙𝑘subscript𝒙𝑘1𝑡112subscript𝜂𝑘superscriptsubscript𝑡1𝑇superscriptnormsubscript𝒙𝑘1𝑡1subscript𝒙𝑘1𝑡2subscript𝒯1\displaystyle\underbrace{\sum_{t=1}^{T}\left\langle{\bm{g}}_{k-1,t},{\bm{x}}_{% k}-{\bm{x}}_{k-1,t+1}\right\rangle-\frac{1}{2\eta_{k}}\sum_{t=1}^{T}\|{\bm{x}}% _{k-1,t+1}-{\bm{x}}_{k-1,t}\|^{2}}_{{\mathcal{T}}_{1}}under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+t=1Tft(𝒙k)ft(𝒙k1,t+1),𝒙k𝒙k1,t+1𝒯2subscriptsuperscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒙𝑘subscript𝒙𝑘1𝑡1subscript𝒯2\displaystyle+\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k})% -\nabla f_{t}({\bm{x}}_{k-1,t+1}),{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\right\rangle% }_{{\mathcal{T}}_{2}}+ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+t=1Tft(𝒙k1,t+1)𝒈k1,t,𝒙k𝒛𝒯3+12ηk(𝒙k1𝒛2𝒙k𝒛2)subscriptsuperscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒈𝑘1𝑡subscript𝒙𝑘𝒛subscript𝒯312subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2\displaystyle+\underbrace{\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k-1% ,t+1})-{\bm{g}}_{k-1,t},{\bm{x}}_{k}-{\bm{z}}\right\rangle}_{{\mathcal{T}}_{3}% }+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm{x}}_{k}-{% \bm{z}}\|^{2}\big{)}+ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ⟩ end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
12Lt=1T(ft(𝒙k)ft(𝒙k1,t+1)2+ft(𝒙k1,t+1)ft(𝒛)2).12𝐿superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡12superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛2\displaystyle-\frac{1}{2L}\sum_{t=1}^{T}\Big{(}\|\nabla f_{t}({\bm{x}}_{k})-% \nabla f_{t}({\bm{x}}_{k-1,t+1})\|^{2}+\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-% \nabla f_{t}({\bm{z}})\|^{2}\Big{)}.- divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

For the term 𝒯1subscript𝒯1{\mathcal{T}}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we follow the argument from the proof of Theorem 2.3 to obtain

𝒯1=12ηk𝒙k𝒙k120.subscript𝒯112subscript𝜂𝑘superscriptnormsubscript𝒙𝑘subscript𝒙𝑘120\displaystyle{\mathcal{T}}_{1}=-\frac{1}{2\eta_{k}}\|{\bm{x}}_{k}-{\bm{x}}_{k-% 1}\|^{2}\leq 0.caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 0 .

For the term 𝒯2subscript𝒯2{\mathcal{T}}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, noticing that 𝒙k𝒙k1,t+1=ηks=t+1T𝒈k1,ssubscript𝒙𝑘subscript𝒙𝑘1𝑡1subscript𝜂𝑘superscriptsubscript𝑠𝑡1𝑇subscript𝒈𝑘1𝑠{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}=-\eta_{k}\sum_{s=t+1}^{T}{\bm{g}}_{k-1,s}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT = - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT for 1tT11𝑡𝑇11\leq t\leq T-11 ≤ italic_t ≤ italic_T - 1 and 𝒈k1,s=𝒈k1,sfs(𝒙k1,s+1)+fs(𝒙k1,s+1)fs(𝒛)+fs(𝒛)fs(𝒙)+fs(𝒙)subscript𝒈𝑘1𝑠subscript𝒈𝑘1𝑠subscript𝑓𝑠subscript𝒙𝑘1𝑠1subscript𝑓𝑠subscript𝒙𝑘1𝑠1subscript𝑓𝑠𝒛subscript𝑓𝑠𝒛subscript𝑓𝑠subscript𝒙subscript𝑓𝑠subscript𝒙{\bm{g}}_{k-1,s}={\bm{g}}_{k-1,s}-\nabla f_{s}({\bm{x}}_{k-1,s+1})+\nabla f_{s% }({\bm{x}}_{k-1,s+1})-\nabla f_{s}({\bm{z}})+\nabla f_{s}({\bm{z}})-\nabla f_{% s}({\bm{x}}_{*})+\nabla f_{s}({\bm{x}}_{*})bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT = bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT ) + ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) + ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), we use Young’s inequality with parameters α>0𝛼0\alpha>0italic_α > 0 and β>0𝛽0\beta>0italic_β > 0 to obtain

𝒯2=subscript𝒯2absent\displaystyle{\mathcal{T}}_{2}=\;caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = t=1T1ft(𝒙k)ft(𝒙k1,t+1),ηks=t+1T𝒈k1,ssuperscriptsubscript𝑡1𝑇1subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝜂𝑘superscriptsubscript𝑠𝑡1𝑇subscript𝒈𝑘1𝑠\displaystyle\sum_{t=1}^{T-1}\Big{\langle}\nabla f_{t}({\bm{x}}_{k})-\nabla f_% {t}({\bm{x}}_{k-1,t+1}),-\eta_{k}\sum_{s=t+1}^{T}{\bm{g}}_{k-1,s}\Big{\rangle}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) , - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ⟩
\displaystyle\leq\; 12L(12+1α+1β)t=1Tft(𝒙k)ft(𝒙k1,t+1)212𝐿121𝛼1𝛽superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡12\displaystyle\frac{1}{2L}\Big{(}\frac{1}{2}+\frac{1}{\alpha}+\frac{1}{\beta}% \Big{)}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% +1})\|^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+αηk2L2t=1T1s=t+1T(fs(𝒛)fs(𝒙))2+2ηk2Lt=1T1s=t+1Tfs(𝒙)2𝛼superscriptsubscript𝜂𝑘2𝐿2superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠𝒛subscript𝑓𝑠subscript𝒙22superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙2\displaystyle+\frac{\alpha\eta_{k}^{2}L}{2}\sum_{t=1}^{T-1}\Big{\|}\sum_{s=t+1% }^{T}\big{(}\nabla f_{s}({\bm{z}})-\nabla f_{s}({\bm{x}}_{*})\big{)}\Big{\|}^{% 2}+2\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\|}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}% _{*})\Big{\|}^{2}+ divide start_ARG italic_α italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+βηk2L2t=1T1s=t+1T(fs(𝒙k1,s+1)fs(𝒛))2𝛽superscriptsubscript𝜂𝑘2𝐿2superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙𝑘1𝑠1subscript𝑓𝑠𝒛2\displaystyle+\frac{\beta\eta_{k}^{2}L}{2}\sum_{t=1}^{T-1}\Big{\|}\sum_{s=t+1}% ^{T}\big{(}\nabla f_{s}({\bm{x}}_{k-1,s+1})-\nabla f_{s}({\bm{z}})\big{)}\Big{% \|}^{2}+ divide start_ARG italic_β italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_z ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2ηk2Lt=1T1s=t+1T(𝒈k1,sfs(𝒙k1,s+1))22superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝒈𝑘1𝑠subscript𝑓𝑠subscript𝒙𝑘1𝑠12\displaystyle+2\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\|}\sum_{s=t+1}^{T}\big{(}{% \bm{g}}_{k-1,s}-\nabla f_{s}({\bm{x}}_{k-1,s+1})\big{)}\Big{\|}^{2}+ 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s + 1 end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

For the term 𝒯3subscript𝒯3{\mathcal{T}}_{3}caligraphic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, we use Cauchy-Schwarz inequality and Young’s inequality to get

t=1Tft(𝒙k1,t+1)𝒈k1,t,𝒙k𝒛superscriptsubscript𝑡1𝑇subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒈𝑘1𝑡subscript𝒙𝑘𝒛\displaystyle\sum_{t=1}^{T}\left\langle\nabla f_{t}({\bm{x}}_{k-1,t+1})-{\bm{g% }}_{k-1,t},{\bm{x}}_{k}-{\bm{z}}\right\rangle∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ⟩
\displaystyle\leq\; 12ηkt=1Tft(𝒙k1,t+1)𝒈k1,t𝒙k𝒛2+ηk2t=1Tft(𝒙k1,t+1)𝒈k1,t.12subscript𝜂𝑘superscriptsubscript𝑡1𝑇normsubscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒈𝑘1𝑡superscriptnormsubscript𝒙𝑘𝒛2subscript𝜂𝑘2superscriptsubscript𝑡1𝑇normsubscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒈𝑘1𝑡\displaystyle\frac{1}{2\sqrt{\eta_{k}}}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{% k-1,t+1})-{\bm{g}}_{k-1,t}\|\|{\bm{x}}_{k}-{\bm{z}}\|^{2}+\frac{\sqrt{\eta_{k}% }}{2}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-{\bm{g}}_{k-1,t}\|.divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ .

Further using the fact that i=1n𝒙i2ni=1n𝒙i2superscriptnormsuperscriptsubscript𝑖1𝑛subscript𝒙𝑖2𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscript𝒙𝑖2\|\sum_{i=1}^{n}{\bm{x}}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|{\bm{x}}_{i}\|^{2}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_n ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and combining the above bounds on 𝒯1subscript𝒯1{\mathcal{T}}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝒯2subscript𝒯2{\mathcal{T}}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝒯3subscript𝒯3{\mathcal{T}}_{3}caligraphic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with 𝒈k1,tft(𝒙k1,t+1)εk1,tηknormsubscript𝒈𝑘1𝑡subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝜀𝑘1𝑡subscript𝜂𝑘\|{\bm{g}}_{k-1,t}-\nabla f_{t}({\bm{x}}_{k-1,t+1})\|\leq\frac{\varepsilon_{k-% 1,t}}{\eta_{k}}∥ bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ∥ ≤ divide start_ARG italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG for t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], we obtain

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ 12L(1α+1β12)t=1Tft(𝒙k)ft(𝒙k1,t)212𝐿1𝛼1𝛽12superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡2\displaystyle\frac{1}{2L}\Big{(}\frac{1}{\alpha}+\frac{1}{\beta}-\frac{1}{2}% \Big{)}\sum_{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k})-\nabla f_{t}({\bm{x}}_{k-1,t% })\|^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_α end_ARG + divide start_ARG 1 end_ARG start_ARG italic_β end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+(βηk2T2L212L)t=1Tft(𝒙k1,t+1)ft(𝒛)2𝛽superscriptsubscript𝜂𝑘2superscript𝑇2𝐿212𝐿superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛2\displaystyle+\Big{(}\frac{\beta\eta_{k}^{2}T^{2}L}{2}-\frac{1}{2L}\Big{)}\sum% _{t=1}^{T}\|\nabla f_{t}({\bm{x}}_{k-1,t+1})-\nabla f_{t}({\bm{z}})\|^{2}+ ( divide start_ARG italic_β italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+αηk2T2L2t=1Tft(𝒛)ft(𝒙)2+2ηk2Lt=1T1s=t+1Tfs(𝒙)2𝛼superscriptsubscript𝜂𝑘2superscript𝑇2𝐿2superscriptsubscript𝑡1𝑇superscriptnormsubscript𝑓𝑡𝒛subscript𝑓𝑡subscript𝒙22superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙2\displaystyle+\frac{\alpha\eta_{k}^{2}T^{2}L}{2}\sum_{t=1}^{T}\|\nabla f_{t}({% \bm{z}})-\nabla f_{t}({\bm{x}}_{*})\|^{2}+2\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{% \|}\sum_{s=t+1}^{T}\nabla f_{s}({\bm{x}}_{*})\Big{\|}^{2}+ divide start_ARG italic_α italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) - ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2T2Lt=1Tεk1,t2+12ηk3/2t=1Tεk1,t𝒙k𝒛2+12ηkt=1Tεk1,t2superscript𝑇2𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝜀𝑘1𝑡212superscriptsubscript𝜂𝑘32superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡superscriptnormsubscript𝒙𝑘𝒛212subscript𝜂𝑘superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡\displaystyle+2T^{2}L\sum_{t=1}^{T}\varepsilon_{k-1,t}^{2}+\frac{1}{2\eta_{k}^% {3/2}}\sum_{t=1}^{T}\varepsilon_{k-1,t}\|{\bm{x}}_{k}-{\bm{z}}\|^{2}+\frac{1}{% 2\sqrt{\eta_{k}}}\sum_{t=1}^{T}\varepsilon_{k-1,t}+ 2 italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT
+12ηk(𝒙k1𝒛2𝒙k𝒛2).12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛2superscriptnormsubscript𝒙𝑘𝒛2\displaystyle+\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\|{\bm% {x}}_{k}-{\bm{z}}\|^{2}\big{)}.+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

It remains to follow the proof of Lemma 2.1 and use ηk1βTL12TLsubscript𝜂𝑘1𝛽𝑇𝐿12𝑇𝐿\eta_{k}\leq\frac{1}{\sqrt{\beta}TL}\leq\frac{1}{2TL}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_β end_ARG italic_T italic_L end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_T italic_L end_ARG for β4𝛽4\beta\geq 4italic_β ≥ 4 to obtain

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T\big{(}f({\bm{x}}_{k})-f({\bm{z}})\big{)}\leq\;italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ 2ηk2Lt=1T1s=t+1Tfs(𝒙)2+αβT(f(𝒛)f(𝒙))+Tηkt=1Tεk1,t22superscriptsubscript𝜂𝑘2𝐿superscriptsubscript𝑡1𝑇1superscriptnormsuperscriptsubscript𝑠𝑡1𝑇subscript𝑓𝑠subscript𝒙2𝛼𝛽𝑇𝑓𝒛𝑓subscript𝒙𝑇subscript𝜂𝑘superscriptsubscript𝑡1𝑇superscriptsubscript𝜀𝑘1𝑡2\displaystyle 2\eta_{k}^{2}L\sum_{t=1}^{T-1}\Big{\|}\sum_{s=t+1}^{T}\nabla f_{% s}({\bm{x}}_{*})\Big{\|}^{2}+\frac{\alpha}{\beta}T\big{(}f({\bm{z}})-f({\bm{x}% }_{*})\big{)}+\frac{T}{\eta_{k}}\sum_{t=1}^{T}\varepsilon_{k-1,t}^{2}2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG italic_T ( italic_f ( bold_italic_z ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + divide start_ARG italic_T end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+12ηk𝒙k1𝒛212ηk(1t=1Tεk1,tηk)𝒙k𝒛2+t=1Tεk1,t2ηk,12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝒛212subscript𝜂𝑘1superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝜂𝑘superscriptnormsubscript𝒙𝑘𝒛2superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡2subscript𝜂𝑘\displaystyle+\frac{1}{2\eta_{k}}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\frac{1}{2% \eta_{k}}\big{(}1-\frac{\sum_{t=1}^{T}\varepsilon_{k-1,t}}{\sqrt{\eta_{k}}}% \big{)}\|{\bm{x}}_{k}-{\bm{z}}\|^{2}+\frac{\sum_{t=1}^{T}\varepsilon_{k-1,t}}{% 2\sqrt{\eta_{k}}},+ divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( 1 - divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ,

thus finishing the proof. ∎

See 3.4

Proof.

Using Lemma B.3 and following the proof of Theorem 2.3 with multiplying ηkwk1subscript𝜂𝑘subscript𝑤𝑘1\eta_{k}w_{k-1}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT on both sides, we have

Tηkwk1(f(𝒙k)f(𝒛k1))𝑇subscript𝜂𝑘subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1\displaystyle T\eta_{k}w_{k-1}\big{(}f({\bm{x}}_{k})-f({\bm{z}}_{k-1})\big{)}italic_T italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) )
\displaystyle\leq\; 2T3ηk3wk1Lσ2+αβTηkwk1(f(𝒛k1)f(𝒙))+Twk1t=1Tεk1,t2+wk1ηk2t=1Tεk1,t2superscript𝑇3superscriptsubscript𝜂𝑘3subscript𝑤𝑘1𝐿superscriptsubscript𝜎2𝛼𝛽𝑇subscript𝜂𝑘subscript𝑤𝑘1𝑓subscript𝒛𝑘1𝑓subscript𝒙𝑇subscript𝑤𝑘1superscriptsubscript𝑡1𝑇superscriptsubscript𝜀𝑘1𝑡2subscript𝑤𝑘1subscript𝜂𝑘2superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡\displaystyle 2T^{3}\eta_{k}^{3}w_{k-1}L\sigma_{*}^{2}+\frac{\alpha}{\beta}T% \eta_{k}w_{k-1}\big{(}f({\bm{z}}_{k-1})-f({\bm{x}}_{*})\big{)}+Tw_{k-1}\sum_{t% =1}^{T}\varepsilon_{k-1,t}^{2}+\frac{w_{k-1}\sqrt{\eta_{k}}}{2}\sum_{t=1}^{T}% \varepsilon_{k-1,t}2 italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_L italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG italic_T italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + italic_T italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT
+λk12wk12𝒙k1𝒛k22wk1(1t=1Tεk1,t/ηk)2𝒙k𝒛k12.superscriptsubscript𝜆𝑘12subscript𝑤𝑘12superscriptnormsubscript𝒙𝑘1subscript𝒛𝑘22subscript𝑤𝑘11superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝜂𝑘2superscriptnormsubscript𝒙𝑘subscript𝒛𝑘12\displaystyle+\frac{\lambda_{k-1}^{2}w_{k-1}}{2}\|{\bm{x}}_{k-1}-{\bm{z}}_{k-2% }\|^{2}-\frac{w_{k-1}(1-\sum_{t=1}^{T}\varepsilon_{k-1,t}/\sqrt{\eta_{k}})}{2}% \|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}.+ divide start_ARG italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( 1 - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT / square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG 2 end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Then we sum the above inequality over k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] and follow the proof of Theorem 2.3. To telescope the terms 𝒙k𝒛k12superscriptnormsubscript𝒙𝑘subscript𝒛𝑘12\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we need t=1Tεk1,t/ηk1λksuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝜂𝑘1subscript𝜆𝑘\sum_{t=1}^{T}\varepsilon_{k-1,t}/\sqrt{\eta_{k}}\leq 1-\lambda_{k}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT / square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ≤ 1 - italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for 1kK11𝑘𝐾11\leq k\leq K-11 ≤ italic_k ≤ italic_K - 1 such that

λk2wkλkwk(1t=1Tεk1,t/ηk)wk1(1t=1Tεk1,t/ηk).superscriptsubscript𝜆𝑘2subscript𝑤𝑘subscript𝜆𝑘subscript𝑤𝑘1superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝜂𝑘subscript𝑤𝑘11superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝜂𝑘\displaystyle\lambda_{k}^{2}w_{k}\leq\lambda_{k}w_{k}\Big{(}1-\sum_{t=1}^{T}% \varepsilon_{k-1,t}/\sqrt{\eta_{k}}\Big{)}\leq w_{k-1}\Big{(}1-\sum_{t=1}^{T}% \varepsilon_{k-1,t}/\sqrt{\eta_{k}}\Big{)}.italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT / square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) ≤ italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( 1 - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT / square-root start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) .

In this case, we maintain the same requirements on {λk}subscript𝜆𝑘\{\lambda_{k}\}{ italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and {wk}subscript𝑤𝑘\{w_{k}\}{ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } to obtain the guarantee on the last iterate as in Theorem 2.3. In particular, we take the same choices with constant step sizes ηkηsubscript𝜂𝑘𝜂\eta_{k}\equiv\etaitalic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≡ italic_η such that λk=wk1wk=(1+αβ)(Kk)1+(1+αβ)(Kk)subscript𝜆𝑘subscript𝑤𝑘1subscript𝑤𝑘1𝛼𝛽𝐾𝑘11𝛼𝛽𝐾𝑘\lambda_{k}=\frac{w_{k-1}}{w_{k}}=\frac{(1+\frac{\alpha}{\beta})(K-k)}{1+(1+% \frac{\alpha}{\beta})(K-k)}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = divide start_ARG ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ( italic_K - italic_k ) end_ARG start_ARG 1 + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ( italic_K - italic_k ) end_ARG for 0kK10𝑘𝐾10\leq k\leq K-10 ≤ italic_k ≤ italic_K - 1, so it suffices to let t=1Tεk1,tη1+(1+αβ)(Kk+1)superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡𝜂11𝛼𝛽𝐾𝑘1\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{\sqrt{\eta}}{1+(1+\frac{\alpha}{% \beta})(K-k+1)}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ≤ divide start_ARG square-root start_ARG italic_η end_ARG end_ARG start_ARG 1 + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ( italic_K - italic_k + 1 ) end_ARG for 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K. Following the proof of Theorem 2.3, we obtain

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ w12ηT𝒙0𝒙2+2η2T2σ2Lk=1Kwk1subscript𝑤12𝜂𝑇superscriptnormsubscript𝒙0subscript𝒙22superscript𝜂2superscript𝑇2superscriptsubscript𝜎2𝐿superscriptsubscript𝑘1𝐾subscript𝑤𝑘1\displaystyle\frac{w_{-1}}{2\eta T}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}+2\eta^{2}% T^{2}\sigma_{*}^{2}L\sum_{k=1}^{K}w_{k-1}divide start_ARG italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_η italic_T end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT
+1ηk=1Kt=1Twk1εk1,t2+12ηTk=1Kt=1Twk1εk1,t.1𝜂superscriptsubscript𝑘1𝐾superscriptsubscript𝑡1𝑇subscript𝑤𝑘1superscriptsubscript𝜀𝑘1𝑡212𝜂𝑇superscriptsubscript𝑘1𝐾superscriptsubscript𝑡1𝑇subscript𝑤𝑘1subscript𝜀𝑘1𝑡\displaystyle+\frac{1}{\eta}\sum_{k=1}^{K}\sum_{t=1}^{T}w_{k-1}\varepsilon_{k-% 1,t}^{2}+\frac{1}{2\sqrt{\eta}T}\sum_{k=1}^{K}\sum_{t=1}^{T}w_{k-1}\varepsilon% _{k-1,t}.+ divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG italic_η end_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT .

Plugging in the choice that wk1e(Kk+1)11+α/βsubscript𝑤𝑘1esuperscript𝐾𝑘111𝛼𝛽w_{k-1}\leq\frac{\mathrm{e}}{(K-k+1)^{\frac{1}{1+\alpha/\beta}}}italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ≤ divide start_ARG roman_e end_ARG start_ARG ( italic_K - italic_k + 1 ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG for 1kK11𝑘𝐾11\leq k\leq K-11 ≤ italic_k ≤ italic_K - 1 and wK1=1subscript𝑤𝐾11w_{K-1}=1italic_w start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT = 1, we then have

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ e𝒙0𝒙22ηTK11+α/β+2η2T2σ2L(1+β/α)Kα/β1+α/β+e2ηTk=0K1t=1T2Tεk,t2+ηεk,t(Kk)11+α/β.esuperscriptnormsubscript𝒙0subscript𝒙22𝜂𝑇superscript𝐾11𝛼𝛽2superscript𝜂2superscript𝑇2superscriptsubscript𝜎2𝐿1𝛽𝛼superscript𝐾𝛼𝛽1𝛼𝛽e2𝜂𝑇superscriptsubscript𝑘0𝐾1superscriptsubscript𝑡1𝑇2𝑇superscriptsubscript𝜀𝑘𝑡2𝜂subscript𝜀𝑘𝑡superscript𝐾𝑘11𝛼𝛽\displaystyle\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{2\eta TK^{% \frac{1}{1+\alpha/\beta}}}+2\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{% \frac{\alpha/\beta}{1+\alpha/\beta}}+\frac{\mathrm{e}}{2\eta T}\sum_{k=0}^{K-1% }\sum_{t=1}^{T}\frac{2T\varepsilon_{k,t}^{2}+\sqrt{\eta}\varepsilon_{k,t}}{(K-% k)^{\frac{1}{1+\alpha/\beta}}}.divide start_ARG roman_e ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ( 1 + italic_β / italic_α ) italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT + divide start_ARG roman_e end_ARG start_ARG 2 italic_η italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 2 italic_T italic_ε start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + square-root start_ARG italic_η end_ARG italic_ε start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( italic_K - italic_k ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG .

Hence, given ε>0𝜀0\varepsilon>0italic_ε > 0, to maintain the convergence rate with exact proximal point evaluations, it suffices to take t=1Tεk1,tηmin{ε4e2(1+logK),11+(1+αβ)(Kk+1)}superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡𝜂𝜀4superscripte21𝐾111𝛼𝛽𝐾𝑘1\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\sqrt{\eta}\min\{\frac{\varepsilon}{4% \mathrm{e}^{2}(1+\log K)},\frac{1}{1+(1+\frac{\alpha}{\beta})(K-k+1)}\}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ≤ square-root start_ARG italic_η end_ARG roman_min { divide start_ARG italic_ε end_ARG start_ARG 4 roman_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + roman_log italic_K ) end_ARG , divide start_ARG 1 end_ARG start_ARG 1 + ( 1 + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ) ( italic_K - italic_k + 1 ) end_ARG } for 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K. Indeed, we have

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ e𝒙0𝒙22ηTK11+α/β+2η2T2σ2L(1+β/α)Kα/β1+α/β+k=0K2eε(Kk)11+α/βesuperscriptnormsubscript𝒙0subscript𝒙22𝜂𝑇superscript𝐾11𝛼𝛽2superscript𝜂2superscript𝑇2superscriptsubscript𝜎2𝐿1𝛽𝛼superscript𝐾𝛼𝛽1𝛼𝛽superscriptsubscript𝑘0𝐾2e𝜀superscript𝐾𝑘11𝛼𝛽\displaystyle\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{2\eta TK^{% \frac{1}{1+\alpha/\beta}}}+2\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{% \frac{\alpha/\beta}{1+\alpha/\beta}}+\sum_{k=0}^{K}\frac{2\mathrm{e}% \varepsilon}{(K-k)^{\frac{1}{1+\alpha/\beta}}}divide start_ARG roman_e ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ( 1 + italic_β / italic_α ) italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 2 roman_e italic_ε end_ARG start_ARG ( italic_K - italic_k ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG
(i)𝑖\displaystyle\overset{(i)}{\leq}\;start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG e𝒙0𝒙22ηTK11+α/β+2η2T2σ2L(1+β/α)Kα/β1+α/β+2eε(1+β/α)Kα/β1+α/β.esuperscriptnormsubscript𝒙0subscript𝒙22𝜂𝑇superscript𝐾11𝛼𝛽2superscript𝜂2superscript𝑇2superscriptsubscript𝜎2𝐿1𝛽𝛼superscript𝐾𝛼𝛽1𝛼𝛽2e𝜀1𝛽𝛼superscript𝐾𝛼𝛽1𝛼𝛽\displaystyle\frac{\mathrm{e}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}}{2\eta TK^{% \frac{1}{1+\alpha/\beta}}}+2\eta^{2}T^{2}\sigma_{*}^{2}L(1+\beta/\alpha)K^{% \frac{\alpha/\beta}{1+\alpha/\beta}}+2\mathrm{e}\varepsilon(1+\beta/\alpha)K^{% \frac{\alpha/\beta}{1+\alpha/\beta}}.divide start_ARG roman_e ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_K start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT end_ARG + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ( 1 + italic_β / italic_α ) italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT + 2 roman_e italic_ε ( 1 + italic_β / italic_α ) italic_K start_POSTSUPERSCRIPT divide start_ARG italic_α / italic_β end_ARG start_ARG 1 + italic_α / italic_β end_ARG end_POSTSUPERSCRIPT .

It remains to follow the proof of Theorem 2.3, and we choose t=1Tεk1,t=ηmin{ε,13(Kk+1)}superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡𝜂𝜀13𝐾𝑘1\sum_{t=1}^{T}\varepsilon_{k-1,t}=\sqrt{\eta}\min\{\varepsilon,\frac{1}{3(K-k+% 1)}\}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_η end_ARG roman_min { italic_ε , divide start_ARG 1 end_ARG start_ARG 3 ( italic_K - italic_k + 1 ) end_ARG }, assuming without loss of generality that ε14e2(1+logK)𝜀14superscripte21𝐾\varepsilon\leq\frac{1}{4\mathrm{e}^{2}(1+\log K)}italic_ε ≤ divide start_ARG 1 end_ARG start_ARG 4 roman_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + roman_log italic_K ) end_ARG. ∎

We then come to prove the convergence with inexact proximal point evaluations for convex Lipschitz settings.

Lemma B.4.

Under Assumptions 1 and 2, for any 𝐳d𝐳superscript𝑑{\bm{z}}\in\mathbb{R}^{d}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that is fixed in the k𝑘kitalic_k-th cycle of Alg. 2, we have for k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ]

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T(f({\bm{x}}_{k})-f({\bm{z}}))\leqitalic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ 12ηk(1+12ηkGTt=1Tεk1,t)𝒙k1𝒛212ηk𝒙k𝒛212subscript𝜂𝑘112subscript𝜂𝑘𝐺𝑇superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡superscriptnormsubscript𝒙𝑘1𝒛212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘𝒛2\displaystyle\frac{1}{2\eta_{k}}\Big{(}1+\frac{1}{2\eta_{k}GT}\sum_{t=1}^{T}% \varepsilon_{k-1,t}\Big{)}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\frac{1}{2\eta_{k}}% \|{\bm{x}}_{k}-{\bm{z}}\|^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (B.8)
+T(T1)G2ηk2+1ηk(t=1Tεk1,t)2+3GTt=1Tεk1,t.𝑇𝑇1superscript𝐺2subscript𝜂𝑘21subscript𝜂𝑘superscriptsuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡23𝐺𝑇superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡\displaystyle+\frac{T(T-1)G^{2}\eta_{k}}{2}+\frac{1}{\eta_{k}}\Big{(}\sum_{t=1% }^{T}\varepsilon_{k-1,t}\Big{)}^{2}+3GT\sum_{t=1}^{T}\varepsilon_{k-1,t}.+ divide start_ARG italic_T ( italic_T - 1 ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 italic_G italic_T ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT .
Proof.

By Lipschitzness of each component function, we have for t[T1]𝑡delimited-[]𝑇1t\in[T-1]italic_t ∈ [ italic_T - 1 ]

ft(𝒙k)ft(𝒙k1,t+1)subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1absent\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t+1})\leq\;italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) ≤ G𝒙k𝒙k1,t+1=Gηks=t+1T𝒈k1,s.𝐺normsubscript𝒙𝑘subscript𝒙𝑘1𝑡1𝐺subscript𝜂𝑘normsuperscriptsubscript𝑠𝑡1𝑇subscript𝒈𝑘1𝑠\displaystyle G\|{\bm{x}}_{k}-{\bm{x}}_{k-1,t+1}\|=G\eta_{k}\Big{\|}\sum_{s=t+% 1}^{T}{\bm{g}}_{k-1,s}\Big{\|}.italic_G ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ∥ = italic_G italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ∥ .

Decomposing 𝒈k1,s=𝒈k1,sMηkfs(𝒙k1,s)+Mηkfs(𝒙k1,s)subscript𝒈𝑘1𝑠subscript𝒈𝑘1𝑠subscript𝑀subscript𝜂𝑘subscript𝑓𝑠subscript𝒙𝑘1𝑠subscript𝑀subscript𝜂𝑘subscript𝑓𝑠subscript𝒙𝑘1𝑠{\bm{g}}_{k-1,s}={\bm{g}}_{k-1,s}-\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})+% \nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT = bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT - ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) + ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) and using triangle inequalities, we have

ft(𝒙k)ft(𝒙k1,t+1)subscript𝑓𝑡subscript𝒙𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡1\displaystyle f_{t}({\bm{x}}_{k})-f_{t}({\bm{x}}_{k-1,t+1})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT )
\displaystyle\leq\; ηkGs=t+1T(𝒈k1,sMηkfs(𝒙k1,s)+Mηkfs(𝒙k1,s))subscript𝜂𝑘𝐺superscriptsubscript𝑠𝑡1𝑇normsubscript𝒈𝑘1𝑠subscript𝑀subscript𝜂𝑘subscript𝑓𝑠subscript𝒙𝑘1𝑠normsubscript𝑀subscript𝜂𝑘subscript𝑓𝑠subscript𝒙𝑘1𝑠\displaystyle\eta_{k}G\sum_{s=t+1}^{T}\Big{(}\|{\bm{g}}_{k-1,s}-\nabla M_{\eta% _{k}f_{s}}({\bm{x}}_{k-1,s})\|+\|\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})\|% \Big{)}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∥ bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT - ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) ∥ + ∥ ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) ∥ )
(i)𝑖\displaystyle\overset{(i)}{\leq}\;start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG ηkGs=t+1T(εk1,sηk+G)(Tt)G2ηk+Gs=t+1Tεk1,s,subscript𝜂𝑘𝐺superscriptsubscript𝑠𝑡1𝑇subscript𝜀𝑘1𝑠subscript𝜂𝑘𝐺𝑇𝑡superscript𝐺2subscript𝜂𝑘𝐺superscriptsubscript𝑠𝑡1𝑇subscript𝜀𝑘1𝑠\displaystyle\eta_{k}G\sum_{s=t+1}^{T}\Big{(}\frac{\varepsilon_{k-1,s}}{\eta_{% k}}+G\Big{)}\leq(T-t)G^{2}\eta_{k}+G\sum_{s=t+1}^{T}\varepsilon_{k-1,s},italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + italic_G ) ≤ ( italic_T - italic_t ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_G ∑ start_POSTSUBSCRIPT italic_s = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT , (B.9)

where we use Eq. (3.2) and the fact that Mηkfs(𝒙k1,s)ft(proxηkft(𝒙k1,t))subscript𝑀subscript𝜂𝑘subscript𝑓𝑠subscript𝒙𝑘1𝑠subscript𝑓𝑡subscriptproxsubscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})\in\partial f_{t}(\mathrm{prox}_{% \eta_{k}f_{t}}({\bm{x}}_{k-1,t}))∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) ∈ ∂ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_prox start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ) for (i)𝑖(i)( italic_i ). On the other hand, using convexity of ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have for t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] that

ft(𝒛)subscript𝑓𝑡𝒛absent\displaystyle f_{t}({\bm{z}})\geq\;italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z ) ≥ ft(𝒙k1,t+1)+Mηkft(𝒙k1,t),𝒛𝒙k1,t+1subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑀subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡𝒛subscript𝒙𝑘1𝑡1\displaystyle f_{t}({\bm{x}}_{k-1,t+1})+\left\langle\nabla M_{\eta_{k}f_{t}}({% \bm{x}}_{k-1,t}),{\bm{z}}-{\bm{x}}_{k-1,t+1}\right\rangleitalic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) + ⟨ ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) , bold_italic_z - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩
=\displaystyle=\;= ft(𝒙k1,t+1)+𝒈k1,t,𝒛𝒙k1,t+1+Mηkft(𝒙k1,t)𝒈k1,t,𝒛𝒙k1,t+1.subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝒈𝑘1𝑡𝒛subscript𝒙𝑘1𝑡1subscript𝑀subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝒈𝑘1𝑡𝒛subscript𝒙𝑘1𝑡1\displaystyle f_{t}({\bm{x}}_{k-1,t+1})+\left\langle{\bm{g}}_{k-1,t},{\bm{z}}-% {\bm{x}}_{k-1,t+1}\right\rangle+\left\langle\nabla M_{\eta_{k}f_{t}}({\bm{x}}_% {k-1,t})-{\bm{g}}_{k-1,t},{\bm{z}}-{\bm{x}}_{k-1,t+1}\right\rangle.italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) + ⟨ bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT , bold_italic_z - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ + ⟨ ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) - bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT , bold_italic_z - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ .

Expanding the inner product in the above quantity and using Cauchy-Schwarz inequality with Eq. (3.2) leads to

ft(𝒙k1,t+1)ft(𝒛)subscript𝑓𝑡subscript𝒙𝑘1𝑡1subscript𝑓𝑡𝒛\displaystyle f_{t}({\bm{x}}_{k-1,t+1})-f_{t}({\bm{z}})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z )
\displaystyle\leq\; 1ηk𝒙k1,t𝒙k1,t+1,𝒛𝒙k1,t+1+Mηkft(𝒙k1,t)+𝒈k1,t𝒛𝒙k1,t+11subscript𝜂𝑘subscript𝒙𝑘1𝑡subscript𝒙𝑘1𝑡1𝒛subscript𝒙𝑘1𝑡1normsubscript𝑀subscript𝜂𝑘subscript𝑓𝑡subscript𝒙𝑘1𝑡subscript𝒈𝑘1𝑡norm𝒛subscript𝒙𝑘1𝑡1\displaystyle-\frac{1}{\eta_{k}}\left\langle{\bm{x}}_{k-1,t}-{\bm{x}}_{k-1,t+1% },{\bm{z}}-{\bm{x}}_{k-1,t+1}\right\rangle+\|\nabla M_{\eta_{k}f_{t}}({\bm{x}}% _{k-1,t})+{\bm{g}}_{k-1,t}\|\|{\bm{z}}-{\bm{x}}_{k-1,t+1}\|- divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⟨ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT , bold_italic_z - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ⟩ + ∥ ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) + bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ ∥ bold_italic_z - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT ∥
\displaystyle\leq\; 12ηk(𝒙k1,t𝒛2𝒙k1,t+1𝒛2)+εk1,tηk𝒙k1,t+1𝒛.12subscript𝜂𝑘superscriptnormsubscript𝒙𝑘1𝑡𝒛2superscriptnormsubscript𝒙𝑘1𝑡1𝒛2subscript𝜀𝑘1𝑡subscript𝜂𝑘normsubscript𝒙𝑘1𝑡1𝒛\displaystyle\frac{1}{2\eta_{k}}\big{(}\|{\bm{x}}_{k-1,t}-{\bm{z}}\|^{2}-\|{% \bm{x}}_{k-1,t+1}-{\bm{z}}\|^{2}\big{)}+\frac{\varepsilon_{k-1,t}}{\eta_{k}}\|% {\bm{x}}_{k-1,t+1}-{\bm{z}}\|.divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ∥ . (B.10)

Using triangle inequalities and decomposing 𝒙k1,t+1𝒙k1=ηks=1t𝒈k1,sMηkfs(𝒙k1,s)+Mηkfs(𝒙k1,s)subscript𝒙𝑘1𝑡1subscript𝒙𝑘1subscript𝜂𝑘superscriptsubscript𝑠1𝑡subscript𝒈𝑘1𝑠subscript𝑀subscript𝜂𝑘subscript𝑓𝑠subscript𝒙𝑘1𝑠subscript𝑀subscript𝜂𝑘subscript𝑓𝑠subscript𝒙𝑘1𝑠{\bm{x}}_{k-1,t+1}-{\bm{x}}_{k-1}=\eta_{k}\sum_{s=1}^{t}{\bm{g}}_{k-1,s}-% \nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})+\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{% k-1,s})bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT - ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) + ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ), we bound the term 𝒯:=t=1Tεk1,t𝒙k1,t+1𝒛assign𝒯superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡normsubscript𝒙𝑘1𝑡1𝒛{\mathcal{T}}:=\sum_{t=1}^{T}\varepsilon_{k-1,t}\|{\bm{x}}_{k-1,t+1}-{\bm{z}}\|caligraphic_T := ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_z ∥ in Eq. (B.10) as follows

𝒯𝒯absent\displaystyle{\mathcal{T}}\leq\;caligraphic_T ≤ t=1Tεk1,t(𝒙k1,t+1𝒙k1+𝒙k1𝒛)superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡normsubscript𝒙𝑘1𝑡1subscript𝒙𝑘1normsubscript𝒙𝑘1𝒛\displaystyle\sum_{t=1}^{T}\varepsilon_{k-1,t}\big{(}\|{\bm{x}}_{k-1,t+1}-{\bm% {x}}_{k-1}\|+\|{\bm{x}}_{k-1}-{\bm{z}}\|\big{)}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_t + 1 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ + ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ )
\displaystyle\leq\; t=1Tεk1,t(ηks=1t(𝒈k1,sMηkfs(𝒙k1,s)+Mηkfs(𝒙k1,s))+𝒙k1𝒛)superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝜂𝑘superscriptsubscript𝑠1𝑡normsubscript𝒈𝑘1𝑠subscript𝑀subscript𝜂𝑘subscript𝑓𝑠subscript𝒙𝑘1𝑠normsubscript𝑀subscript𝜂𝑘subscript𝑓𝑠subscript𝒙𝑘1𝑠normsubscript𝒙𝑘1𝒛\displaystyle\sum_{t=1}^{T}\varepsilon_{k-1,t}\Big{(}\eta_{k}\sum_{s=1}^{t}% \big{(}\|{\bm{g}}_{k-1,s}-\nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})\|+\|% \nabla M_{\eta_{k}f_{s}}({\bm{x}}_{k-1,s})\|\big{)}+\|{\bm{x}}_{k-1}-{\bm{z}}% \|\Big{)}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ∥ bold_italic_g start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT - ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) ∥ + ∥ ∇ italic_M start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT ) ∥ ) + ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ )
\displaystyle\leq\; t=1Tεk1,ts=1tεk1,s+(ηkGT+𝒙k1𝒛)t=1Tεk1,tsuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡superscriptsubscript𝑠1𝑡subscript𝜀𝑘1𝑠subscript𝜂𝑘𝐺𝑇normsubscript𝒙𝑘1𝒛superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡\displaystyle\sum_{t=1}^{T}\varepsilon_{k-1,t}\sum_{s=1}^{t}\varepsilon_{k-1,s% }+\big{(}\eta_{k}GT+\|{\bm{x}}_{k-1}-{\bm{z}}\|\big{)}\sum_{t=1}^{T}% \varepsilon_{k-1,t}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_s end_POSTSUBSCRIPT + ( italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G italic_T + ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT
(i)𝑖\displaystyle\overset{(i)}{\leq}\;start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG (t=1Tεk1,t)2+2ηkGTt=1Tεk1,t+14ηkGTt=1Tεk1,t𝒙k1𝒛2,superscriptsuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡22subscript𝜂𝑘𝐺𝑇superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡14subscript𝜂𝑘𝐺𝑇superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡superscriptnormsubscript𝒙𝑘1𝒛2\displaystyle\Big{(}\sum_{t=1}^{T}\varepsilon_{k-1,t}\Big{)}^{2}+2\eta_{k}GT% \sum_{t=1}^{T}\varepsilon_{k-1,t}+\frac{1}{4\eta_{k}GT}\sum_{t=1}^{T}% \varepsilon_{k-1,t}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2},( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G italic_T ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 4 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where we use Young’s inequality for (i)𝑖(i)( italic_i ). Combining Eq. (B.9) and (B.10) with the above bound on 𝒯𝒯{\mathcal{T}}caligraphic_T and noticing that 𝒙k1,T+1=𝒙ksubscript𝒙𝑘1𝑇1subscript𝒙𝑘{\bm{x}}_{k-1,T+1}={\bm{x}}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , italic_T + 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝒙k1,1=𝒙k1subscript𝒙𝑘11subscript𝒙𝑘1{\bm{x}}_{k-1,1}={\bm{x}}_{k-1}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 , 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT, we sum the inequalities over t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] and obtain

T(f(𝒙k)f(𝒛))𝑇𝑓subscript𝒙𝑘𝑓𝒛absent\displaystyle T(f({\bm{x}}_{k})-f({\bm{z}}))\leq\;italic_T ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z ) ) ≤ 12ηk(1+12ηkGTt=1Tεk1,t)𝒙k1𝒛212ηk𝒙k𝒛212subscript𝜂𝑘112subscript𝜂𝑘𝐺𝑇superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡superscriptnormsubscript𝒙𝑘1𝒛212subscript𝜂𝑘superscriptnormsubscript𝒙𝑘𝒛2\displaystyle\frac{1}{2\eta_{k}}\Big{(}1+\frac{1}{2\eta_{k}GT}\sum_{t=1}^{T}% \varepsilon_{k-1,t}\Big{)}\|{\bm{x}}_{k-1}-{\bm{z}}\|^{2}-\frac{1}{2\eta_{k}}% \|{\bm{x}}_{k}-{\bm{z}}\|^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+T(T1)G2ηk2+1ηk(t=1Tεk1,t)2+3GTt=1Tεk1,t,𝑇𝑇1superscript𝐺2subscript𝜂𝑘21subscript𝜂𝑘superscriptsuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡23𝐺𝑇superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡\displaystyle+\frac{T(T-1)G^{2}\eta_{k}}{2}+\frac{1}{\eta_{k}}\Big{(}\sum_{t=1% }^{T}\varepsilon_{k-1,t}\Big{)}^{2}+3GT\sum_{t=1}^{T}\varepsilon_{k-1,t},+ divide start_ARG italic_T ( italic_T - 1 ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 italic_G italic_T ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ,

thus finishing the proof. ∎

See 3.5

Proof.

Using Lemma B.4 with 𝒛=𝒛k1𝒛subscript𝒛𝑘1{\bm{z}}={\bm{z}}_{k-1}bold_italic_z = bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT defined by Eq. (2.1) and multiplying ηkwk1subscript𝜂𝑘subscript𝑤𝑘1\eta_{k}w_{k-1}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT on both sides, we have

Tηkwk1(f(𝒙k)f(𝒛k1))𝑇subscript𝜂𝑘subscript𝑤𝑘1𝑓subscript𝒙𝑘𝑓subscript𝒛𝑘1\displaystyle T\eta_{k}w_{k-1}(f({\bm{x}}_{k})-f({\bm{z}}_{k-1}))italic_T italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) )
\displaystyle\leq\; wk1λk12(1+12ηkGTt=1Tεk1,t)2𝒙k1𝒛k22wk12𝒙k𝒛k12subscript𝑤𝑘1superscriptsubscript𝜆𝑘12112subscript𝜂𝑘𝐺𝑇superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡2superscriptnormsubscript𝒙𝑘1subscript𝒛𝑘22subscript𝑤𝑘12superscriptnormsubscript𝒙𝑘subscript𝒛𝑘12\displaystyle\frac{w_{k-1}\lambda_{k-1}^{2}(1+\frac{1}{2\eta_{k}GT}\sum_{t=1}^% {T}\varepsilon_{k-1,t})}{2}\|{\bm{x}}_{k-1}-{\bm{z}}_{k-2}\|^{2}-\frac{w_{k-1}% }{2}\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}divide start_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+T(T1)G2ηk2wk12+wk12(t=1Tεk1,t)2+3GTηkwk1t=1Tεk1,t.𝑇𝑇1superscript𝐺2superscriptsubscript𝜂𝑘2subscript𝑤𝑘12subscript𝑤𝑘12superscriptsuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡23𝐺𝑇subscript𝜂𝑘subscript𝑤𝑘1superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡\displaystyle+\frac{T(T-1)G^{2}\eta_{k}^{2}w_{k-1}}{2}+\frac{w_{k-1}}{2}\Big{(% }\sum_{t=1}^{T}\varepsilon_{k-1,t}\Big{)}^{2}+3GT\eta_{k}w_{k-1}\sum_{t=1}^{T}% \varepsilon_{k-1,t}.+ divide start_ARG italic_T ( italic_T - 1 ) italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 italic_G italic_T italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT .

Then we sum the inequalities over k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] and follow the proof of Theorem 3.3. To telescope the terms 𝒙k𝒛k12superscriptnormsubscript𝒙𝑘subscript𝒛𝑘12\|{\bm{x}}_{k}-{\bm{z}}_{k-1}\|^{2}∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we need λk111+12ηkGTt=1Tεk1,tsubscript𝜆𝑘11112subscript𝜂𝑘𝐺𝑇superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡\lambda_{k-1}\leq\frac{1}{1+\frac{1}{2\eta_{k}GT}\sum_{t=1}^{T}\varepsilon_{k-% 1,t}}italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT end_ARG for 1kK11𝑘𝐾11\leq k\leq K-11 ≤ italic_k ≤ italic_K - 1 such that

wk1λk12(1+12ηkGTt=1Tεk1,t)wk1λk1wk2,subscript𝑤𝑘1superscriptsubscript𝜆𝑘12112subscript𝜂𝑘𝐺𝑇superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝑤𝑘1subscript𝜆𝑘1subscript𝑤𝑘2\displaystyle w_{k-1}\lambda_{k-1}^{2}(1+\frac{1}{2\eta_{k}GT}\sum_{t=1}^{T}% \varepsilon_{k-1,t})\leq w_{k-1}\lambda_{k-1}\leq w_{k-2},italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) ≤ italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ≤ italic_w start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT ,

while we maintain other requirements on {λk}subscript𝜆𝑘\{\lambda_{k}\}{ italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and {wk}subscript𝑤𝑘\{w_{k}\}{ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } to obtain the last iterate convergence as in Theorem 3.3. In particular, we take the same choice that wk=ηKj=k+1Kηjsubscript𝑤𝑘subscript𝜂𝐾superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗w_{k}=\frac{\eta_{K}}{\sum_{j=k+1}^{K}\eta_{j}}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG and λk=j=k+1Kηjj=kKηjsubscript𝜆𝑘superscriptsubscript𝑗𝑘1𝐾subscript𝜂𝑗superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗\lambda_{k}=\frac{\sum_{j=k+1}^{K}\eta_{j}}{\sum_{j=k}^{K}\eta_{j}}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG for 0kK10𝑘𝐾10\leq k\leq K-10 ≤ italic_k ≤ italic_K - 1, so it suffices to let t=1Tεk1,t2ηkηk1GTj=kKηjsuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡2subscript𝜂𝑘subscript𝜂𝑘1𝐺𝑇superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{2\eta_{k}\eta_{k-1}GT}{\sum_{j=k}^{% K}\eta_{j}}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 2 italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_G italic_T end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG. So we arrive at

f(𝒙K)f(𝒙)𝑓subscript𝒙𝐾𝑓subscript𝒙absent\displaystyle f({\bm{x}}_{K})-f({\bm{x}}_{*})\leq\;italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_f ( bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≤ w12TηK𝒙0𝒙2+G2T2ηKk=1Kηk2wk1subscript𝑤12𝑇subscript𝜂𝐾superscriptnormsubscript𝒙0subscript𝒙2superscript𝐺2𝑇2subscript𝜂𝐾superscriptsubscript𝑘1𝐾superscriptsubscript𝜂𝑘2subscript𝑤𝑘1\displaystyle\frac{w_{-1}}{2T\eta_{K}}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{2}+\frac{% G^{2}T}{2\eta_{K}}\sum_{k=1}^{K}\eta_{k}^{2}w_{k-1}divide start_ARG italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_T italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT
+3GηKk=1Kwk1ηkt=1Tεk1,t+12TηKk=1K(t=1Tεk1,t)2wk13𝐺subscript𝜂𝐾superscriptsubscript𝑘1𝐾subscript𝑤𝑘1subscript𝜂𝑘superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡12𝑇subscript𝜂𝐾superscriptsubscript𝑘1𝐾superscriptsuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡2subscript𝑤𝑘1\displaystyle+\frac{3G}{\eta_{K}}\sum_{k=1}^{K}w_{k-1}\eta_{k}\sum_{t=1}^{T}% \varepsilon_{k-1,t}+\frac{1}{2T\eta_{K}}\sum_{k=1}^{K}\Big{(}\sum_{t=1}^{T}% \varepsilon_{k-1,t}\Big{)}^{2}w_{k-1}+ divide start_ARG 3 italic_G end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_T italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT
=\displaystyle=\;= 12Tk=1Kηk𝒙0𝒙2+G2T2k=1Kηk2j=kKηj12𝑇superscriptsubscript𝑘1𝐾subscript𝜂𝑘superscriptnormsubscript𝒙0subscript𝒙2superscript𝐺2𝑇2superscriptsubscript𝑘1𝐾superscriptsubscript𝜂𝑘2superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗\displaystyle\frac{1}{2T\sum_{k=1}^{K}\eta_{k}}\|{\bm{x}}_{0}-{\bm{x}}_{*}\|^{% 2}+\frac{G^{2}T}{2}\sum_{k=1}^{K}\frac{\eta_{k}^{2}}{\sum_{j=k}^{K}\eta_{j}}divide start_ARG 1 end_ARG start_ARG 2 italic_T ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG
+3Gk=1Kt=1Tεk1,tηkj=kKηj+12Tk=1K(t=1Tεk1,t)2j=kKηj.3𝐺superscriptsubscript𝑘1𝐾superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝜂𝑘superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗12𝑇superscriptsubscript𝑘1𝐾superscriptsuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡2superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗\displaystyle+3G\sum_{k=1}^{K}\sum_{t=1}^{T}\frac{\varepsilon_{k-1,t}\eta_{k}}% {\sum_{j=k}^{K}\eta_{j}}+\frac{1}{2T}\sum_{k=1}^{K}\frac{(\sum_{t=1}^{T}% \varepsilon_{k-1,t})^{2}}{\sum_{j=k}^{K}\eta_{j}}.+ 3 italic_G ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 2 italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG .

Hence, given ε>0𝜀0\varepsilon>0italic_ε > 0 and taking the constant step size ηkη=𝒙0𝒙GTKsubscript𝜂𝑘𝜂normsubscript𝒙0subscript𝒙𝐺𝑇𝐾\eta_{k}\equiv\eta=\frac{\|{\bm{x}}_{0}-{\bm{x}}_{*}\|}{GT\sqrt{K}}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≡ italic_η = divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG start_ARG italic_G italic_T square-root start_ARG italic_K end_ARG end_ARG for simplicity, to maintain the convergence rate as in Theorem 3.3 with inexact proximal point evaluations, it suffices to let t=1Tεk1,t2ηGTKk+1superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡2𝜂𝐺𝑇𝐾𝑘1\sum_{t=1}^{T}\varepsilon_{k-1,t}\leq\frac{2\eta GT}{K-k+1}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 2 italic_η italic_G italic_T end_ARG start_ARG italic_K - italic_k + 1 end_ARG. Indeed, we have

3Gk=1Kt=1Tεk1,tηkj=kKηj=3Gk=1Kt=1Tεk1,tKk+15G𝒙0𝒙K,3𝐺superscriptsubscript𝑘1𝐾superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡subscript𝜂𝑘superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗3𝐺superscriptsubscript𝑘1𝐾superscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡𝐾𝑘15𝐺normsubscript𝒙0subscript𝒙𝐾\displaystyle 3G\sum_{k=1}^{K}\sum_{t=1}^{T}\frac{\varepsilon_{k-1,t}\eta_{k}}% {\sum_{j=k}^{K}\eta_{j}}=3G\sum_{k=1}^{K}\frac{\sum_{t=1}^{T}\varepsilon_{k-1,% t}}{K-k+1}\leq\frac{5G\|{\bm{x}}_{0}-{\bm{x}}_{*}\|}{\sqrt{K}},3 italic_G ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = 3 italic_G ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_K - italic_k + 1 end_ARG ≤ divide start_ARG 5 italic_G ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ,

and

12Tk=1K(t=1Tεk1,t)2j=kKηj2G𝒙0𝒙Kk=1K1(Kk+1)32.5G𝒙0𝒙K.12𝑇superscriptsubscript𝑘1𝐾superscriptsuperscriptsubscript𝑡1𝑇subscript𝜀𝑘1𝑡2superscriptsubscript𝑗𝑘𝐾subscript𝜂𝑗2𝐺normsubscript𝒙0subscript𝒙𝐾superscriptsubscript𝑘1𝐾1superscript𝐾𝑘132.5𝐺normsubscript𝒙0subscript𝒙𝐾\displaystyle\frac{1}{2T}\sum_{k=1}^{K}\frac{(\sum_{t=1}^{T}\varepsilon_{k-1,t% })^{2}}{\sum_{j=k}^{K}\eta_{j}}\leq\frac{2G\|{\bm{x}}_{0}-{\bm{x}}_{*}\|}{% \sqrt{K}}\sum_{k=1}^{K}\frac{1}{(K-k+1)^{3}}\leq\frac{2.5G\|{\bm{x}}_{0}-{\bm{% x}}_{*}\|}{\sqrt{K}}.divide start_ARG 1 end_ARG start_ARG 2 italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_k - 1 , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ≤ divide start_ARG 2 italic_G ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( italic_K - italic_k + 1 ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG 2.5 italic_G ∥ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG .

It remains to follow the proof of Theorem 3.3, thus finishing the proof. ∎