On the Convergence of Multi-objective Optimization under Generalized Smoothness

Qi Zhang  Peiyao Xiao11footnotemark: 1  Kaiyi Ji 33footnotemark: 3  Shaofeng Zou22footnotemark: 2 Equal contribution.Qi Zhang and Shaofeng Zou are with the Department of Electrical Engineering, University at Buffalo, Buffalo, NY 14228 USA (e-mail: [email protected], [email protected]). Peiyao Xiao and Kaiyi Ji are with the Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY 14228 USA (e-mail: [email protected], [email protected]).
Abstract

Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard L𝐿Litalic_L-smooth or bounded-gradient assumptions, which are typically unsatisfactory for neural networks, such as Long short-term memory (LSTM) models and transformers. In this paper, we study a more general and realistic class of \ellroman_ℓ-smooth loss functions, where \ellroman_ℓ is a general non-decreasing function of gradient norm. We develop two novel single-loop algorithms for \ellroman_ℓ-smooth MOO problems, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant, Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of both algorithms and show that they converge to an ϵitalic-ϵ\epsilonitalic_ϵ-accurate Pareto stationary point with a guaranteed ϵitalic-ϵ\epsilonitalic_ϵ-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally 𝒪(ϵ2)𝒪superscriptitalic-ϵ2\mathcal{O}(\epsilon^{-2})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) and 𝒪(ϵ4)𝒪superscriptitalic-ϵ4\mathcal{O}(\epsilon^{-4})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ) samples are needed for deterministic and stochastic settings, respectively. Our algorithms can also guarantee a tighter ϵitalic-ϵ\epsilonitalic_ϵ-level CA distance in each iteration using more samples. Moreover, we propose an efficient variant of GSMGrad named GSMGrad-FA using only constant-level time and space, while achieving the same performance guarantee as GSMGrad. Our experiments validate our theory and demonstrate the effectiveness of the proposed methods.

1 Introduction

There have been a variety of emerging applications of multi-objective optimization (MOO), such as online advertising [26], autonomous driving [18], and reinforcement learning [31]. Mathematically, the MOO problem takes the following formulation.

F=minxmF(x):=(f1(x),f2(x),,fK(x)),superscript𝐹subscript𝑥superscript𝑚𝐹𝑥assignsubscript𝑓1𝑥subscript𝑓2𝑥subscript𝑓𝐾𝑥\displaystyle F^{*}=\min_{x\in\mathbb{R}^{m}}{F}(x):=(f_{1}(x),f_{2}(x),...,f_% {K}(x)),italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_F ( italic_x ) := ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) , … , italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_x ) ) , (1)

where K𝐾Kitalic_K is the total number of objectives and fk(x)subscript𝑓𝑘𝑥f_{k}(x)italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) is the k𝑘kitalic_k-objective function given model parameters x𝑥xitalic_x. Under the stochastic setting, fk(x)=𝔼s[fk(x;s)]subscript𝑓𝑘𝑥subscript𝔼𝑠delimited-[]subscript𝑓𝑘𝑥𝑠f_{k}(x)=\mathbb{E}_{s}[f_{k}(x;s)]italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) = blackboard_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ; italic_s ) ], where s𝑠sitalic_s denotes data samples. This problem is challenging due to the gradient conflict that some objectives with larger gradients dominate the update direction at the sacrifice of significant performance degeneration on the less-fortune objectives with smaller gradients. A variety of MOO-based methods have been proposed to mitigate this conflict and find a more balanced solution among all objectives. In particular, the multiple gradient descent algorithm (MGDA) [12] aims to find a conflict-avoidant (CA) update direction that maximizes the minimal improvement among all objectives and converges to a Pareto stationary point at which there is no common descent direction for all objective functions. This idea then inspired numerous follow-up methods including but not limited to CAGrad [24], PCGrad [34], GradDrop [8], FAMO [23] and FairGrad [3] with a convergence guarantee in the deterministic setting with full-gradient computations. The theoretical understanding of the convergence and complexity of stochastic MOO is not well-developed until very recently. [25] proposed stochastic multi-gradient (SMG) as a stochastic version of MGDA, and established its convergence guarantee. [36] analyzed the non-convergence issues of MGDA, CAGrad and PCGrad in the stochastic setting, and further proposed a convergent approach named CR-MOGM. More recently, [13] and [6] proposed single-loop stochastic MOO methods named MoCo and MoDo, and proved their convergence to an ϵitalic-ϵ\epsilonitalic_ϵ-accurate Pareto stationary point while guaranteeing an ϵitalic-ϵ\epsilonitalic_ϵ-level average CA distance111CA distance means the distance between the updating direction and the CA direction. Its formal definition can be found in Section 2.4 over all iterations. [32] proposed a double-loop algorithm named SDMGrad that enables to obtain an unbiased stochastic multi-gradient via a double-sampling strategy. They established the convergence of SDMGrad with a guaranteed ϵitalic-ϵ\epsilonitalic_ϵ-level CA distance in every iteration, which we call as iteration-wise CA distance.

However, all existing works are limited by the standard L𝐿Litalic_L-smooth and bounded-gradient assumptions. Nevertheless, a recent study [35] indicates that such assumptions may not necessarily be true for the training of neural networks and an alternative (L0,L1)subscript𝐿0subscript𝐿1(L_{0},L_{1})( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )-smoothness condition was observed and studied, which assumes the Lipschitz constant to be linear in the gradient norm and the gradient norm to be potentially infinite. The analysis of existing MOO methods cannot be generalized to this (L0,L1)subscript𝐿0subscript𝐿1(L_{0},L_{1})( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )-smoothness directly due to the possible unbounded smoothness or gradient norm. In addition, all existing works [29, 35, 22, 21, 19, 11, 9] in generalized smoothness are limited to the single task problems, which are fundamentally different from the MOO problems. All of the above methods can not be directly generalized to the MOO tasks studied in this paper, since even though each single task is generalized smooth, the linear combination of these tasks is not necessarily generalized smooth. In this paper, we aim to fill this gap by proposing novel MOO algorithms, which not only converge under the generalized smoothness condition but also mitigate gradient conflict effectively with a guaranteed sufficiently small CA distance.

1.1 Our Contributions

We propose two single-loop MOO methods, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant SGSMGrad, and provide them with a comprehensive convergence analysis under the generalized smoothness condition in different settings. Our detailed contributions are listed below.

Weakest assumptions in MOO. In this paper, we investigate the \ellroman_ℓ-smooth assumption, where \ellroman_ℓ is a general non-decreasing function of gradient norm, and includes both the standard L𝐿Litalic_L-smooth and (L0,L1)subscript𝐿0subscript𝐿1(L_{0},L_{1})( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )-smooth assumptions as special cases. This assumption finds many applications, such as LSTM models [35], transformers [11], distributionally robust optimization [19] and higher-order polynomial functions [9]. In addition, we do not make any bounded-gradient assumption, which is required in previous analysis to ensure the bounded multi-gradient approximation. To the best of our knowledge, this is the first work to investigate generalized smoothness in MOO problems.

New single-loop algorithms. Both GSMGrad and SGSMGrad are easy to implement by updating the weights w𝑤witalic_w of objectives and model parameters x𝑥xitalic_x simultaneously via a single-loop structure. A warm-start initialization sub-procedure is also introduced at the beginning of both algorithms to ensure a sufficiently small iteration-wise CA distance under the single-loop structure. We also propose a computation- and memory-efficient variant of GSMGrad named GSMGrad-FA by updating the objective weights w𝑤witalic_w using only forward passes of F()𝐹F(\cdot)italic_F ( ⋅ ) rather than gradient F𝐹\nabla F∇ italic_F, which effectively reduces O(K)𝑂𝐾O(K)italic_O ( italic_K ) time and space to O(1)𝑂1O(1)italic_O ( 1 ) without hurting the performance guarantee.

Convergence analysis and optimal complexity. We provide a comprehensive analysis of our proposed algorithms under the generalized \ellroman_ℓ-smooth condition in both deterministic and stochastic settings. To achieve an ϵitalic-ϵ\epsilonitalic_ϵ-accurate Pareto stationary point and an ϵitalic-ϵ\epsilonitalic_ϵ-level average CA distance, we show that GSMGrad and SGSMGrad require 𝒪(ϵ2)𝒪superscriptitalic-ϵ2\mathcal{O}(\epsilon^{-2})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) and 𝒪(ϵ4)𝒪superscriptitalic-ϵ4\mathcal{O}(\epsilon^{-4})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ) samples in the deterministic and stochastic settings, respectively. Furthermore, to achieve a more aggressive ϵitalic-ϵ\epsilonitalic_ϵ-level iteration-wise CA distance, GSMGrad and SGSMGrad require an increased number of samples, on the order of 𝒪(ϵ11)𝒪superscriptitalic-ϵ11\mathcal{O}(\epsilon^{-11})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT ) and 𝒪(ϵ17)𝒪superscriptitalic-ϵ17\mathcal{O}(\epsilon^{-17})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 17 end_POSTSUPERSCRIPT ) respectively, in both deterministic and stochastic scenarios due to smaller step sizes and mini-batch data sampling. Typically, achieving an ϵitalic-ϵ\epsilonitalic_ϵ-level iteration-wise CA distance results in much higher sample complexity, such as 𝒪(ϵ24)𝒪superscriptitalic-ϵ24\mathcal{O}(\epsilon^{-24})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 24 end_POSTSUPERSCRIPT ) in [13], 𝒪(ϵ16)𝒪superscriptitalic-ϵ16\mathcal{O}(\epsilon^{-16})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT ) in [6] and 𝒪(ϵ12)𝒪superscriptitalic-ϵ12\mathcal{O}(\epsilon^{-12})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 12 end_POSTSUPERSCRIPT ) in [32] for non-convex stochastic setting. Moreover, we show that GSMGrad-FA achieves the same performance guarantee as GSMGrad.

Supportive experiments. Our experiments on the MTL benchmark Cityscapes [10] validate our theory and demonstrate the effectiveness of our proposed algorithms.

1.2 Related Works

Gradient-based multi-objective optimization. A variety of gradient manipulation techniques have emerged for simultaneous learning of multiple tasks. One prevalent category of methods adjusts the weights of various objectives according to factors such as uncertainty [20], gradient norm [7], and training complexity [17]. Methods based on MOO have garnered increased attention due to their systematic designs, enhanced training stability and model-agnostic nature. For instance, [30] framed Multi-Task Learning (MTL) as a MOO problem and introduced an optimization method akin to MGDA [12]. Afterward, many MGDA-based methods have been proposed to mitigate gradient conflict with promising empirical performance. Among them, PCGrad [34] avoids conflict by projecting the gradient of each task on the norm plane of other tasks. GradDrop [8] randomly drops out conflicted gradients. CAGrad [24] adds a constraint on the update direction to be close to the average gradient. NashMTL [27] and FairGrad [3] formulated MTL as a bargaining game and a resource allocation problem, respectively. Theoretically, [13] proposed a provably convergent stochastic MOO method named MoCo based on an auxiliary tracking variable for gradient approximation. [6] characterized the trade-off among optimization, generalization, and conflict avoidance in MOO. [32] proposed a stochastic MOO method named SDMGrad with a preference-oriented regularizer, and analyzed its convergence. However, all these works rely on the L𝐿Litalic_L-smoothness and bounded-gradient assumptions. The details can be founded in Table 1. This paper focuses on the MOO problems with generalized \ellroman_ℓ-smooth objectives.

Method Smoothness 11footnotemark: 1 Assumption22footnotemark: 2 Sample Complexity
SMG [25] (LS) (BG) N/A33footnotemark: 3
CR-MOGM[36] (LS) (BF), (BG) 𝒪(ϵ4)𝒪superscriptitalic-ϵ4\mathcal{O}(\epsilon^{-4})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT )
MoCo[13] (LS) (BF), (BG) 𝒪(ϵ4)𝒪superscriptitalic-ϵ4\mathcal{O}(\epsilon^{-4})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT )
MoDo[6] (LS) (BG) 𝒪(ϵ4)𝒪superscriptitalic-ϵ4\mathcal{O}(\epsilon^{-4})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT )
SDMGrad[32] (LS) (BG) 𝒪(ϵ4)𝒪superscriptitalic-ϵ4\mathcal{O}(\epsilon^{-4})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT )
SGSMGrad (this paper) (GS) N/A 𝒪(ϵ4)𝒪superscriptitalic-ϵ4\mathcal{O}(\epsilon^{-4})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT )
Table 1: Comparison for existing stochastic methods for MOO problems to obtain an ϵitalic-ϵ\epsilonitalic_ϵ–accurate Pareto stationary point. Explanation on the upper footmarks: 1::1absent1:1 : (LS) indicates that the objectives are standard L𝐿Litalic_L-smooth while (GS) the objectives are generalized \ellroman_ℓ-smooth as defined in Definition 1; 2::2absent2:2 : (BF) shows that the bounded function value assumption is required and (BG) shows that the bounded gradient assumption is required; 3::3absent3:3 : the analysis in [25] focuses on convex objective functions.

Generalized smoothness. The generalized (L0,L1)subscript𝐿0subscript𝐿1(L_{0},L_{1})( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )-smoothness was firstly proposed by [35], which was observed from extensive empirical experiments in training neural networks. A clip** algorithm was developed by[35] and the convergence rate was provided. Later, [19] analyzed the convergence of a normalized momentum method. The SPIDER algorithm was also applied to solve generalized smooth problems in [29, 9], where [9] studied a new notion of α𝛼\alphaitalic_α-symmetric generalized smoothness, which includes (L0,L1)subscript𝐿0subscript𝐿1(L_{0},L_{1})( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )-smoothness as a special case. Very recently, a new \ellroman_ℓ-smoothness condition was studied in [21, 22], which is the weakest smoothness condition and includes all the smoothness conditions discussed above. However, all the existing works on generalized smoothness are limited to single-task optimizations and the understanding of MOO is insufficient. This paper provides the first study of MOO under the generalized \ellroman_ℓ-smoothness condition.

2 Preliminaries

2.1 Generalized smoothness

The standard L𝐿Litalic_L-smoothness condition is widely investigated in existing optimization studies [15, 16], which assumes a function f:𝒳:𝑓𝒳f:\mathcal{X}\to\mathbb{R}italic_f : caligraphic_X → blackboard_R to be L𝐿Litalic_L-smooth if there exists a bounded constant L𝐿Litalic_L such that for any x,y𝒳𝑥𝑦𝒳x,y\in\mathcal{X}italic_x , italic_y ∈ caligraphic_X, f(x)f(y)Lxy.norm𝑓𝑥𝑓𝑦𝐿norm𝑥𝑦\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|.∥ ∇ italic_f ( italic_x ) - ∇ italic_f ( italic_y ) ∥ ≤ italic_L ∥ italic_x - italic_y ∥ . Nevertheless, recent studies show that in the training of neural networks such as LSTM models [35], transformers [11], distributionally robust optimization [19] and high-order polynomials functions [9], the standard L𝐿Litalic_L-smoothness assumption does not hold. Instead, a generalized (L0,L1)subscript𝐿0subscript𝐿1(L_{0},L_{1})( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )-smoothness assumption was observed and studied in the training of LSTM models in [35], which assumes that for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, 2f(x)L0+L1f(x).normsuperscript2𝑓𝑥subscript𝐿0subscript𝐿1norm𝑓𝑥\|\nabla^{2}f(x)\|\leq L_{0}+L_{1}\|\nabla f(x)\|.∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x ) ∥ ≤ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ∇ italic_f ( italic_x ) ∥ . This assumption implies the Lipschitz constant is potentially unbounded and reduces to the L𝐿Litalic_L-smoothness if L1=0subscript𝐿10L_{1}=0italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0. Later, a more generalized assumption was proposed and studied in [21]:

Definition 1.

(\ellroman_ℓ-smoothness, Definition 1 in [21]). A real-valued differentiable function f:𝒳:𝑓𝒳f:\mathcal{X}\rightarrow\mathbb{R}italic_f : caligraphic_X → blackboard_R is \ellroman_ℓ-smooth if 2f(x)(f(x))normsuperscript2𝑓𝑥norm𝑓𝑥\|\nabla^{2}f(x)\|\leq\ell(\|\nabla f(x)\|)∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x ) ∥ ≤ roman_ℓ ( ∥ ∇ italic_f ( italic_x ) ∥ ) almost everywhere in 𝒳𝒳\mathcal{X}caligraphic_X, where :[0,+)(0,+):00\ell:[0,+\infty)\rightarrow(0,+\infty)roman_ℓ : [ 0 , + ∞ ) → ( 0 , + ∞ ) is a continuous non-decreasing function.

The (L0,L1)subscript𝐿0subscript𝐿1(L_{0},L_{1})( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )-smoothness is a special case of \ellroman_ℓ-smoothness, where (a)=L0+L1a𝑎subscript𝐿0subscript𝐿1𝑎\ell(a)=L_{0}+L_{1}aroman_ℓ ( italic_a ) = italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a. There is another definition of generalized smooth, which is widely used and is equivalent to the \ellroman_ℓ-smoothness:

Definition 2.

((r,𝑟r,\ellitalic_r , roman_ℓ)-smoothness, Definition 2 in [21]). A real-valued differentiable function f:𝒳:𝑓𝒳f:\mathcal{X}\rightarrow\mathbb{R}italic_f : caligraphic_X → blackboard_R is (r,)𝑟(r,\ell)( italic_r , roman_ℓ )-smooth if 1) for any x𝒳,B(x,r(f(x)))𝒳formulae-sequence𝑥𝒳𝐵𝑥𝑟norm𝑓𝑥𝒳x\in\mathcal{X},B(x,r(\|\nabla f(x)\|))\in\mathcal{X}italic_x ∈ caligraphic_X , italic_B ( italic_x , italic_r ( ∥ ∇ italic_f ( italic_x ) ∥ ) ) ∈ caligraphic_X, and 2) for any x1,x2B(x,r(f(x))),f(x1)f(x2)(f(x))x1x2formulae-sequencesubscript𝑥1subscript𝑥2𝐵𝑥𝑟norm𝑓𝑥norm𝑓subscript𝑥1𝑓subscript𝑥2norm𝑓𝑥normsubscript𝑥1subscript𝑥2x_{1},x_{2}\in B(x,r(\|\nabla f(x)\|)),\|\nabla f(x_{1})-\nabla f(x_{2})\|\leq% \ell(\|\nabla f(x)\|)\|x_{1}-x_{2}\|italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_B ( italic_x , italic_r ( ∥ ∇ italic_f ( italic_x ) ∥ ) ) , ∥ ∇ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ∇ italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ ≤ roman_ℓ ( ∥ ∇ italic_f ( italic_x ) ∥ ) ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥, where for continuous functions r,:[0,+)(0,+):𝑟00r,\ell:[0,+\infty)\rightarrow(0,+\infty)italic_r , roman_ℓ : [ 0 , + ∞ ) → ( 0 , + ∞ ), r𝑟ritalic_r is non-increasing, \ellroman_ℓ is non-decreasing and B(x,R)𝐵𝑥𝑅B(x,R)italic_B ( italic_x , italic_R ) is the Euclidean ball centered at x𝑥xitalic_x with radius R𝑅Ritalic_R.

In B(x,r(f(x)))𝐵𝑥𝑟norm𝑓𝑥B(x,r(\|\nabla f(x)\|))italic_B ( italic_x , italic_r ( ∥ ∇ italic_f ( italic_x ) ∥ ) ), the function f𝑓fitalic_f is also L𝐿Litalic_L-smooth where L=(f(x))𝐿norm𝑓𝑥L=\ell(\|\nabla f(x)\|)italic_L = roman_ℓ ( ∥ ∇ italic_f ( italic_x ) ∥ ). Proposition 3.2 in [21] shows that Definition 1 and Definition 2 are equivalent: An (r,)𝑟(r,\ell)( italic_r , roman_ℓ )-smooth function is \ellroman_ℓ-smooth; and an \ellroman_ℓ-smooth function satisfying Assumption 1 is (r,m)𝑟𝑚(r,m)( italic_r , italic_m )-smooth where m(u):=(u+a)assign𝑚𝑢𝑢𝑎m(u):=\ell(u+a)italic_m ( italic_u ) := roman_ℓ ( italic_u + italic_a ) and r(u):=a/m(u)assign𝑟𝑢𝑎𝑚𝑢r(u):=a/m(u)italic_r ( italic_u ) := italic_a / italic_m ( italic_u ) for any a>0𝑎0a>0italic_a > 0.

2.2 Pareto concepts in multi-objective optimization (MOO)

As described before, MOO aims to find points at which there is no common descent direction for all objectives. Considering two points x1,x2msubscript𝑥1subscript𝑥2superscript𝑚x_{1},x_{2}\in\mathbb{R}^{m}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, we claim that x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT dominates x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT if fi(x1)fi(x2)subscript𝑓𝑖subscript𝑥1subscript𝑓𝑖subscript𝑥2f_{i}(x_{1})\geq f_{i}(x_{2})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) for all i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ] and F(x1)F(x2)𝐹subscript𝑥1𝐹subscript𝑥2F(x_{1})\neq F(x_{2})italic_F ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≠ italic_F ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). We say a point is Pareto optimal if it is not dominated by any other point. In other words, we cannot improve one objective without compromising another when we reach a Pareto optimal point. In the general non-convex setting, MOO aims to find a Pareto stationary point defined as follows.

Definition 3.

We say xm𝑥superscript𝑚x\in\mathbb{R}^{m}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a Pareto stationary point if minw𝒲F(x)w2=0subscript𝑤𝒲superscriptnorm𝐹𝑥𝑤20\min_{w\in\mathcal{W}}\|\nabla F(x)w\|^{2}=0roman_min start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT ∥ ∇ italic_F ( italic_x ) italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0. In practice, we call x𝑥xitalic_x an ϵitalic-ϵ\epsilonitalic_ϵ-accurate Pareto stationary point if minw𝒲F(x)w2ϵ2subscript𝑤𝒲superscriptnorm𝐹𝑥𝑤2superscriptitalic-ϵ2\min_{w\in\mathcal{W}}\|\nabla F(x)w\|^{2}\leq\epsilon^{2}roman_min start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT ∥ ∇ italic_F ( italic_x ) italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

2.3 Multiple-gradient descent algorithm (MGDA) and its stochastic variants

Deterministic MGDA. One of the big challenges of MOO is the gradient conflict, i.e., the gradients of different objectives may vary heavily in scale such that the largest gradient dominates the update direction. As a result, the performance of those objectives with smaller gradients [34] may be significantly compromised. Towards this end, we tend to find a balanced update direction for all objectives. Thus, we consider the minimum improvement across all objectives and maximize it by solving the following problem

maxdmmini[K]{1α(fi(x)fi(xαd))}maxdmmini[K]fi(x),d,subscript𝑑superscript𝑚subscript𝑖delimited-[]𝐾1𝛼subscript𝑓𝑖𝑥subscript𝑓𝑖𝑥𝛼𝑑subscript𝑑superscript𝑚subscript𝑖delimited-[]𝐾subscript𝑓𝑖𝑥𝑑\displaystyle\max_{d\in\mathbb{R}^{m}}\min_{i\in[K]}\Big{\{}\frac{1}{\alpha}(f% _{i}(x)-f_{i}(x-\alpha d))\Big{\}}\approx\max_{d\in\mathbb{R}^{m}}\min_{i\in[K% ]}\langle\nabla f_{i}(x),d\rangle,roman_max start_POSTSUBSCRIPT italic_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_i ∈ [ italic_K ] end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x - italic_α italic_d ) ) } ≈ roman_max start_POSTSUBSCRIPT italic_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_i ∈ [ italic_K ] end_POSTSUBSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_d ⟩ , (2)

where α𝛼\alphaitalic_α is the step size, d𝑑ditalic_d is the update direction, and the first-order Taylor approximation is applied at x𝑥xitalic_x. To efficiently solve the above problem in eq. 2, we substitute the following relation

maxdmmini[K]fi(x),d12d2=maxdmminw𝒲k=1Kfi(x)wi,d12d2,subscript𝑑superscript𝑚subscript𝑖delimited-[]𝐾subscript𝑓𝑖𝑥𝑑12superscriptnorm𝑑2subscript𝑑superscript𝑚subscript𝑤𝒲superscriptsubscript𝑘1𝐾subscript𝑓𝑖𝑥subscript𝑤𝑖𝑑12superscriptnorm𝑑2\displaystyle\max_{d\in\mathbb{R}^{m}}\min_{i\in[K]}\langle\nabla f_{i}(x),d% \rangle-\frac{1}{2}\|d\|^{2}=\max_{d\in\mathbb{R}^{m}}\min_{w\in\mathcal{W}}% \Big{\langle}\sum_{k=1}^{K}\nabla f_{i}(x)w_{i},d\Big{\rangle}-\frac{1}{2}\|d% \|^{2},roman_max start_POSTSUBSCRIPT italic_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_i ∈ [ italic_K ] end_POSTSUBSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_d ⟩ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT ⟨ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d ⟩ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

where 𝒲𝒲\mathcal{W}caligraphic_W is the probability simplex over [K]delimited-[]𝐾[K][ italic_K ], and the regularization term 12d212superscriptnorm𝑑2-\frac{1}{2}\|d\|^{2}- divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is to regulate the magnitude of our update direction. The solution to the problem in eq. 3 can be obtained by solving the following problem [32]

d=F(x)w;s.t.wargminw𝒲12F(x)w2.formulae-sequencesuperscript𝑑𝐹𝑥superscript𝑤𝑠𝑡superscript𝑤subscript𝑤𝒲12superscriptnorm𝐹𝑥𝑤2\displaystyle d^{*}=\nabla F(x)w^{*};\;\;s.t.\;w^{*}\in\arg\min_{w\in\mathcal{% W}}\frac{1}{2}\|\nabla F(x)w\|^{2}.italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∇ italic_F ( italic_x ) italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s . italic_t . italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_arg roman_min start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x ) italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (4)

The above approach has been widely used in e.g., deterministic MGDA and its variants such as CAGrad, and PCGrad [12, 34, 24].

Stochastic MGDA. SMG [25] is the first stochastic MGDA. It directly replaces the gradients with stochastic gradients and the update rule becomes

ds=F(x;s)ws;s.t.wsargminw𝒲12F(x;s)w2,formulae-sequencesuperscriptsubscript𝑑𝑠𝐹𝑥𝑠superscriptsubscript𝑤𝑠𝑠𝑡superscriptsubscript𝑤𝑠subscript𝑤𝒲12superscriptnorm𝐹𝑥𝑠𝑤2\displaystyle d_{s}^{*}=\nabla F(x;s)w_{s}^{*};\;\;s.t.\;w_{s}^{*}\in\arg\min_% {w\in\mathcal{W}}\frac{1}{2}\|\nabla F(x;s)w\|^{2},italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∇ italic_F ( italic_x ; italic_s ) italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s . italic_t . italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_arg roman_min start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x ; italic_s ) italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where F(x;s)𝐹𝑥𝑠\nabla F(x;s)∇ italic_F ( italic_x ; italic_s ) is the estimate of F(x)𝐹𝑥\nabla F(x)∇ italic_F ( italic_x ) based on the sample s𝑠sitalic_s. However, this leads to a biased gradient estimation of the update direction dssuperscriptsubscript𝑑𝑠d_{s}^{*}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and thus it requires an increasing batch size. To solve this issue, another work MoCo [13] introduces a tracking variable Y𝑌Yitalic_Y as a stochastic estimation of the true gradient. Afterward, a double-sampling strategy is proposed by [6, 32] to generate a near-unbiased update direction.

All the works mentioned above require bounded gradients such as [6, 13, 32] or L𝐿Litalic_L-smoothness such as [3, 24, 27, 33]. Their analyses do not apply to \ellroman_ℓ-smoothness objectives studied in this paper, since the Lipschitz constant is potentially infinity.

2.4 Conflict-avoidant (CA) direction and CA distance

We call the update direction dsuperscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in eq. 4 the conflict-avoidant (CA) direction since it mitigates gradient conflict. Though it may not be feasible to calculate the exact CA direction, we aim to find an update direction to be close to the CA direction. Therefore, measuring the gap between the CA direction and the estimated update direction is important, which we define as the CA distance.

Definition 4.

ddnorm𝑑superscript𝑑\|d-d^{*}\|∥ italic_d - italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ is the CA distance between estimated update direction d𝑑ditalic_d and CA direction dsuperscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

The larger the CA distance is, the further the estimated update direction will be away from the CA direction, and the more conflict there will be. In single-loop algorithms, MoCo and MoDo [13, 6], the average CA distance over iteration is of the order of ϵitalic-ϵ\epsilonitalic_ϵ while the double-loop algorithm SDMGrad [32] guarantees an ϵitalic-ϵ\epsilonitalic_ϵ-order CA distance in every iteration. In our work, we analyze the CA distance in both cases and provide convergence results accordingly.

3 Single-loop Algorithms for MOO Under Generalized Smoothness

In this section, we present our main algorithms: Generalized Smooth Multi-objective Gradient descent (GSMGrad) and Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), both are easy to implement with a simple single-loop structure. We also introduce an efficient variant of GSMGrad with constant-level computational and memory costs.

3.1 Generalized Smooth Multi-objective Gradient descent (GSMGrad)

We start to adopt MGDA in our method by computing an approximated weight wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and an update direction dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT according to eq. 4 where t𝑡titalic_t is the iteration number. However, since the optimal weight wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the convex function is not unique, we deal with this issue by adding an 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization term and the problem becomes

wρ=argminw𝒲12F(x)w2+ρ2w2.superscriptsubscript𝑤𝜌subscript𝑤𝒲12superscriptnorm𝐹𝑥𝑤2𝜌2superscriptnorm𝑤2\displaystyle w_{\rho}^{*}=\arg\min_{w\in\mathcal{W}}\frac{1}{2}\|\nabla F(x)w% \|^{2}+\frac{\rho}{2}\|w\|^{2}.italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x ) italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5)

Besides the benefit of a unique solution, adding an 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization term also makes wρ(x)superscriptsubscript𝑤𝜌𝑥w_{\rho}^{*}(x)italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) Lipschitz continuous [13]. Note that w(x)superscript𝑤𝑥w^{*}(x)italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) may not be Lipschitz continuous because F(x)F(x)𝐹superscript𝑥top𝐹𝑥\nabla F(x)^{\top}\nabla F(x)∇ italic_F ( italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x ) may not be positive definite. Nevertheless, the analysis of CA distance is difficult because wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT may not be Lipschitz continuous. Thus, we will characterize the gap between wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and wρsuperscriptsubscript𝑤𝜌w_{\rho}^{*}italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT plus the change of wρsuperscriptsubscript𝑤𝜌w_{\rho}^{*}italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT after adding this 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization term. As a result, the update rules become Lines 4-5 in Algorithm 1. We first update wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by a projected gradient descent process and compute the update direction dt=F(xt)wtsubscript𝑑𝑡𝐹subscript𝑥𝑡subscript𝑤𝑡d_{t}=\nabla F(x_{t})w_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to update model parameters.

For our single-loop algorithm, CA distance is proportional to the term wtwt,ρnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌\|w_{t}-w_{t,\rho}^{*}\|∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥, which decreases as the algorithm iterates with some error terms controlled by appropriately chosen small step sizes. If we initialize w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT randomly, w0w0,ρnormsubscript𝑤0superscriptsubscript𝑤0𝜌\|w_{0}-w_{0,\rho}^{*}\|∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ will be a constant order, and so will the first CA distance. Meanwhile, we can only get an ϵitalic-ϵ\epsilonitalic_ϵ-order CA distance after a certain iteration number t>1superscript𝑡1t^{\prime}>1italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 1 when wtwt,ρnormsubscript𝑤superscript𝑡superscriptsubscript𝑤superscript𝑡𝜌\|w_{t^{\prime}}-w_{t^{\prime},\rho}^{*}\|∥ italic_w start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ takes an ϵitalic-ϵ\epsilonitalic_ϵ order. Thus, we add an extra warm start process using Algorithm 2 to guarantee the new w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is close enough to w0,ρsuperscriptsubscript𝑤0𝜌w_{0,\rho}^{*}italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and a small level CA distance in every iteration. However, this warm start process is not needed if we only require a small averaged CA distance.

Algorithm 1 Generalized Smooth Multi-objective Gradient descent (GSMGrad)
1:  Initialize: model parameters x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, weights w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and a constant ρ𝜌\rhoitalic_ρ
2:  w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT=Warm-start(w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, ρ𝜌\rhoitalic_ρ) for small-level iteration-wise CA distance only
3:  for t=0,1,,T1𝑡01𝑇1t=0,1,...,T-1italic_t = 0 , 1 , … , italic_T - 1 do
4:     wt+1=Π𝒲(wtβ[F(xt)F(xt)wt+ρwt])subscript𝑤𝑡1subscriptΠ𝒲subscript𝑤𝑡𝛽delimited-[]𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡w_{t+1}=\Pi_{\mathcal{W}}\big{(}w_{t}-\beta[\nabla F(x_{t})^{\top}\nabla F(x_{% t})w_{t}+\rho w_{t}]\big{)}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β [ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] )
5:     xt+1=xtαF(xt)wtsubscript𝑥𝑡1subscript𝑥𝑡𝛼𝐹subscript𝑥𝑡subscript𝑤𝑡x_{t+1}=x_{t}-\alpha\nabla F(x_{t})w_{t}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
6:  end for
Algorithm 2 Warm-start(w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, ρ𝜌\rhoitalic_ρ)
1:  for n=0,1,,N1𝑛01𝑁1n=0,1,...,N-1italic_n = 0 , 1 , … , italic_N - 1 do
2:     wn+1=Π𝒲(wnβ[F(x0)F(x0)wn+ρwn])subscript𝑤𝑛1subscriptΠ𝒲subscript𝑤𝑛superscript𝛽delimited-[]𝐹superscriptsubscript𝑥0top𝐹subscript𝑥0subscript𝑤𝑛𝜌subscript𝑤𝑛w_{n+1}=\Pi_{\mathcal{W}}\big{(}w_{n}-\beta^{\prime}[\nabla F(x_{0})^{\top}% \nabla F(x_{0})w_{n}+\rho w_{n}]\big{)}italic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ ∇ italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] )
3:  end for
4:  Output wNsubscript𝑤𝑁w_{N}italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT

3.2 Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad)

Under the stochastic setting, our algorithm keeps the same structure, having a warm start process and an update loop if we aim to control the CA distance in every iteration. In Algorithm 2, we do the same projected gradient descent without using stochastic gradients. This is because we only need to compute F(x0)F(x0)𝐹superscriptsubscript𝑥0top𝐹subscript𝑥0\nabla F(x_{0})^{\top}\nabla F(x_{0})∇ italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) once and reuse it in the whole loop, which does not bring a computational burden. Then in the update loop, we update the weight and model parameters accordingly. We use a double-sampling strategy here to make the weight gradient estimator unbiased [32] such that dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a near-unbiased multi-gradient 𝔼[G2(xt)G3(xt)wt+ρwt]=F(xt)F(xt)wt+ρwt,𝔼delimited-[]subscript𝐺2superscriptsubscript𝑥𝑡topsubscript𝐺3subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡\mathbb{E}[\nabla G_{2}(x_{t})^{\top}\nabla G_{3}(x_{t})w_{t}+\rho w_{t}]=% \nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}+\rho w_{t},blackboard_E [ ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , where G2(xt)subscript𝐺2subscript𝑥𝑡\nabla G_{2}(x_{t})∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and G3(xt)subscript𝐺3subscript𝑥𝑡\nabla G_{3}(x_{t})∇ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are independent and unbiased estimates of F(xt)𝐹subscript𝑥𝑡\nabla F(x_{t})∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Similarly, we do not involve a warm start process if we require the average CA distance to be small.

Algorithm 3 Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad)
1:  Initialize: model parameters x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, weights w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and a constant ρ𝜌\rhoitalic_ρ
2:  w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT=Warm-start(w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, ρ𝜌\rhoitalic_ρ) for small-level iteration-wise CA distance only
3:  for t=0,1,,T1𝑡01𝑇1t=0,1,...,T-1italic_t = 0 , 1 , … , italic_T - 1 do
4:     xt+1=xtαG1(xt)wtsubscript𝑥𝑡1subscript𝑥𝑡𝛼subscript𝐺1subscript𝑥𝑡subscript𝑤𝑡x_{t+1}=x_{t}-\alpha\nabla G_{1}(x_{t})w_{t}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α ∇ italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
5:     wt+1=Π𝒲(wtβ[G2(xt)G3(xt)wt+ρwt])subscript𝑤𝑡1subscriptΠ𝒲subscript𝑤𝑡𝛽delimited-[]subscript𝐺2superscriptsubscript𝑥𝑡topsubscript𝐺3subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡w_{t+1}=\Pi_{\mathcal{W}}\big{(}w_{t}-\beta[\nabla G_{2}(x_{t})^{\top}\nabla G% _{3}(x_{t})w_{t}+\rho w_{t}]\big{)}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β [ ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] )
6:  end for

3.3 Fast approximation via Taylor expansion

Similarly to most MGDA-type algorithms, our methods require 𝒪(K)𝒪𝐾\mathcal{O}(K)caligraphic_O ( italic_K ) space and time to compute and store all task gradients at each iteration for updating the weight wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This becomes a drawback when the number of tasks or the model size is large. Motivated by [23], one solution is to use the Taylor Theorem to approximate the gradient for updating the weight wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as

F(xt)F(xt+1)=F(xt)(xtxt+1)R(xt)=αF(xt)F(xt)wtR(xt),𝐹subscript𝑥𝑡𝐹subscript𝑥𝑡1𝐹superscriptsubscript𝑥𝑡topsubscript𝑥𝑡subscript𝑥𝑡1𝑅subscript𝑥𝑡𝛼𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝑅subscript𝑥𝑡\displaystyle{F}(x_{t})-{F}(x_{t+1})=\nabla F(x_{t})^{\top}(x_{t}-x_{t+1})-R(x% _{t})=\alpha\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}-R(x_{t}),italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where R(xt)𝑅subscript𝑥𝑡R(x_{t})italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the remainder term and it takes the order R(xt)=o(xtxt+12)𝑅subscript𝑥𝑡𝑜superscriptnormsubscript𝑥𝑡subscript𝑥𝑡12R(x_{t})=o(\|x_{t}-x_{t+1}\|^{2})italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_o ( ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), which can be made sufficiently small by adjusting the step size. Thus, we then propose GSMGrad-FA in Algorithm 4 (shown in Appendix B), where we update xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT along the update direction dt=F(xt)wtsubscript𝑑𝑡𝐹subscript𝑥𝑡subscript𝑤𝑡d_{t}=\nabla F(x_{t})w_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to get xt+1subscript𝑥𝑡1x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT following by the update rule of wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

wt+1=Π𝒲(wtβ[F(xt)F(xt+1)α+ρwt]).subscript𝑤𝑡1subscriptΠ𝒲subscript𝑤𝑡𝛽delimited-[]𝐹subscript𝑥𝑡𝐹subscript𝑥𝑡1𝛼𝜌subscript𝑤𝑡\displaystyle w_{t+1}=\Pi_{\mathcal{W}}\Big{(}w_{t}-\beta\Big{[}\frac{F(x_{t})% -F(x_{t+1})}{\alpha}+\rho w_{t}\Big{]}\Big{)}.italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β [ divide start_ARG italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_α end_ARG + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) . (6)

As a result, in the model parameters update process, we only require one backward process by calculating the gradient of F(xt)wt𝐹subscript𝑥𝑡subscript𝑤𝑡F(x_{t})w_{t}italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT w.r.t. xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT without storing it, and additional forward processes to compute F(xt+1)𝐹subscript𝑥𝑡1F(x_{t+1})italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) in the weight update process. This approach saves computational and memory costs in the practical implementation significantly. More importantly, we also provide a theoretical guarantee for this efficient method (in eq. 6).

4 Convergence Analysis under Average CA distance

In this section, we provide the theoretical results for Algorithms 1 and 3 without warm starts to obtain an ϵitalic-ϵ\epsilonitalic_ϵ-accurate Pareto stationary point, with the average CA distance over iterations in 𝒪(ϵ)𝒪italic-ϵ\mathcal{O}(\epsilon)caligraphic_O ( italic_ϵ ).

4.1 Deterministic setting

Assumption 1.

Each objective function fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT i[K]for-all𝑖delimited-[]𝐾\forall i\in[K]∀ italic_i ∈ [ italic_K ] is twice differentiable and lower bounded by fi:=infxmfi(x)>assignsuperscriptsubscript𝑓𝑖subscriptinfimum𝑥superscript𝑚subscript𝑓𝑖𝑥f_{i}^{*}:=\inf_{x\in\mathbb{R}^{m}}f_{i}(x)>-\inftyitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := roman_inf start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) > - ∞.

Assumption 2.

Each objective function fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT i[K]for-all𝑖delimited-[]𝐾\forall i\in[K]∀ italic_i ∈ [ italic_K ] is \ellroman_ℓ-smooth defined in Definition 1, where :[0,+)(0,+):00\ell:[0,+\infty)\rightarrow(0,+\infty)roman_ℓ : [ 0 , + ∞ ) → ( 0 , + ∞ ) is a continuous non-decreasing function such that φ(a)=a22(2a)𝜑𝑎superscript𝑎222𝑎\varphi(a)=\frac{a^{2}}{2\ell(2a)}italic_φ ( italic_a ) = divide start_ARG italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 roman_ℓ ( 2 italic_a ) end_ARG is monotonically increasing for any a0𝑎0a\geq 0italic_a ≥ 0.

These assumptions are the most relaxed ones in existing MOO works since they directly assume objective smoothness or gradient/function value boundness [24, 13, 27, 32, 6, 33, 3]. It also includes the widely studied standard L𝐿Litalic_L-smoothness [28, 15, 16], (L0,L1)subscript𝐿0subscript𝐿1(L_{0},L_{1})( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )-smoothness [35] as special cases. Moreover, for any 0γ20𝛾20\leq\gamma\leq 20 ≤ italic_γ ≤ 2 and x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, our assumption even holds for function f𝑓fitalic_f such that 2f(x)L0+L1f(x)γnormsuperscript2𝑓𝑥subscript𝐿0subscript𝐿1superscriptnorm𝑓𝑥𝛾\|\nabla^{2}f(x)\|\leq L_{0}+L_{1}\|\nabla f(x)\|^{\gamma}∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x ) ∥ ≤ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ∇ italic_f ( italic_x ) ∥ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT, where γ𝛾\gammaitalic_γ are limited to [0,1]01[0,1][ 0 , 1 ] in [9].

We then provide our theoretical results. Let c>0𝑐0c>0italic_c > 0 and F>0𝐹0F>0italic_F > 0 be some constants such that Δ+cF,Δ𝑐𝐹\Delta+c\leq F,roman_Δ + italic_c ≤ italic_F , where Δ=maxi[K]{fi(x0)f}Δsubscript𝑖delimited-[]𝐾subscript𝑓𝑖subscript𝑥0superscript𝑓\Delta=\max_{i\in[K]}\{f_{i}(x_{0})-f^{*}\}roman_Δ = roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_K ] end_POSTSUBSCRIPT { italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }. Define M=sup{z0|φ(z)F}𝑀supremumconditional-set𝑧0𝜑𝑧𝐹M=\sup\{z\geq 0|\varphi(z)\leq F\}italic_M = roman_sup { italic_z ≥ 0 | italic_φ ( italic_z ) ≤ italic_F }. We then have the following convergence rate for Algorithm 1:

Theorem 1.

Let Assumptions 1 and 2 hold. Set β𝒪(1M2),α𝒪(1M2+1M(M+1)),Tmax(Θ(1αϵ2),Θ(1βϵ2))formulae-sequencesimilar-to𝛽𝒪1superscript𝑀2formulae-sequencesimilar-to𝛼𝒪1superscript𝑀21𝑀𝑀1𝑇Θ1𝛼superscriptitalic-ϵ2Θ1𝛽superscriptitalic-ϵ2\beta\sim\mathcal{O}(\frac{1}{M^{2}}),\alpha\sim\mathcal{O}(\frac{1}{M^{2}}+% \frac{1}{M\ell(M+1)}),T\geq\max\Big{(}\Theta\big{(}\frac{1}{\alpha\epsilon^{2}% }\big{)},\Theta\left(\frac{1}{\beta\epsilon^{2}}\right)\Big{)}italic_β ∼ caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , italic_α ∼ caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_M roman_ℓ ( italic_M + 1 ) end_ARG ) , italic_T ≥ roman_max ( roman_Θ ( divide start_ARG 1 end_ARG start_ARG italic_α italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , roman_Θ ( divide start_ARG 1 end_ARG start_ARG italic_β italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ) and ρ𝒪(ϵ2)similar-to𝜌𝒪superscriptitalic-ϵ2\rho\sim\mathcal{O}(\epsilon^{2})italic_ρ ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). We then have that 1Tt=0T1F(xt)wt2ϵ2.1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptitalic-ϵ2\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\leq\epsilon^{2}.divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The full version with detailed constants and detailed proof can be found in Appendix C.1. Theorem 1 provides the first convergence rate to obtain an ϵitalic-ϵ\epsilonitalic_ϵ-accurate Pareto stationary point for MOO problems with \ellroman_ℓ-smooth objectives. Moreover, it achieves the optimal sample complexity in the order of 𝒪(ϵ2)𝒪superscriptitalic-ϵ2\mathcal{O}(\epsilon^{-2})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )for GD with a single standard L𝐿Litalic_L-smooth objective[5]. The MOO problems with \ellroman_ℓ-smooth objectives are challenging due to two reasons: 1) F(x)norm𝐹𝑥\|\nabla F(x)\|∥ ∇ italic_F ( italic_x ) ∥ is potentially unbounded in our \ellroman_ℓ-smoothness setting, making all existing analysis in MOO [24, 13, 27, 32, 6, 33, 3] not applicable. 2) the update of x𝑥xitalic_x includes all gradient information from each task, making the existing adaptive methods for single generalized smooth functions invalid.

To solve the challenges in Theorem 1, we find that a bounded function value implies a bounded gradient norm. Thus in our proof, we use induction to show that with parameters selected in Theorem 1, for any w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W and tT𝑡𝑇t\leq Titalic_t ≤ italic_T, we have that F(xt)w𝐹subscript𝑥𝑡𝑤F(x_{t})witalic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w is upper bounded by F𝐹Fitalic_F. Consequently, for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ], we have that fi(x)Mnormsubscript𝑓𝑖𝑥𝑀\|\nabla f_{i}(x)\|\leq M∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∥ ≤ italic_M, which solves the unbounded gradient norm problem in our generalized smoothness setting. Then we can show that F(xt)wtnorm𝐹subscript𝑥𝑡subscript𝑤𝑡\|\nabla F(x_{t})w_{t}\|∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ converges.

Corollary 1.

Under the same setting in Theorem 1, 1Tt=0T1F(xt)wtF(xt)wt2=𝒪(ϵ2)1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡2𝒪superscriptitalic-ϵ2\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t}^{*}\|^{% 2}=\mathcal{O}(\epsilon^{2})divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

The proof is available in Appendix C.2. Corollary 1 shows that the average CA distance converges.

4.2 Stochastic setting

In the stochastic setting, we assume that we have access to an unbiased stochastic gradient fi(x;s)subscript𝑓𝑖𝑥𝑠{\nabla}f_{i}(x;s)∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ; italic_s ) instead of the true gradient fi(x)subscript𝑓𝑖𝑥\nabla f_{i}(x)∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), where s𝑠sitalic_s is the collected samples. To prove convergence, we have the following assumption.

Assumption 3.

There exists some σ0𝜎0\sigma\geq 0italic_σ ≥ 0 such that 𝔼[fi(x;s)fi(x)2]σ2𝔼delimited-[]superscriptnormsubscript𝑓𝑖𝑥𝑠subscript𝑓𝑖𝑥2superscript𝜎2\mathbb{E}[\|{\nabla}f_{i}(x;s)-\nabla f_{i}(x)\|^{2}]\leq\sigma^{2}blackboard_E [ ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ; italic_s ) - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ].

Assumption 3 indicates bounded gradient variances, which is widely studied [32, 21, 13].

Let st,i=(st,i,1,st,i,2,,st,i,k)subscript𝑠𝑡𝑖subscript𝑠𝑡𝑖1subscript𝑠𝑡𝑖2subscript𝑠𝑡𝑖𝑘s_{t,i}=(s_{t,i,1},s_{t,i,2},...,s_{t,i,k})italic_s start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT italic_t , italic_i , 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t , italic_i , 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_t , italic_i , italic_k end_POSTSUBSCRIPT ) be the i𝑖iitalic_i-th collection of samples at time t𝑡titalic_t in the stochastic model and F(xt;st,i)=(f1(xt;st,i,1),f2(xt;st,i,2),,fk(xt;st,i,k)).𝐹subscript𝑥𝑡subscript𝑠𝑡𝑖subscript𝑓1subscript𝑥𝑡subscript𝑠𝑡𝑖1subscript𝑓2subscript𝑥𝑡subscript𝑠𝑡𝑖2subscript𝑓𝑘subscript𝑥𝑡subscript𝑠𝑡𝑖𝑘F(x_{t};s_{t,i})=(f_{1}(x_{t};s_{t,i,1}),f_{2}(x_{t};s_{t,i,2}),...,f_{k}(x_{t% };s_{t,i,k})).italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) = ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT italic_t , italic_i , 1 end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT italic_t , italic_i , 2 end_POSTSUBSCRIPT ) , … , italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT italic_t , italic_i , italic_k end_POSTSUBSCRIPT ) ) . In this section, we choose Gi(xt)=F(xt;st,i)subscript𝐺𝑖subscript𝑥𝑡𝐹subscript𝑥𝑡subscript𝑠𝑡𝑖G_{i}(x_{t})=F(x_{t};s_{t,i})italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) for any i[3]𝑖delimited-[]3i\in[3]italic_i ∈ [ 3 ]. Define εt,i=(εt,i,1,εt,i,2,,εt,i,k)=F(xt)Gi(xt)subscript𝜀𝑡𝑖subscript𝜀𝑡𝑖1subscript𝜀𝑡𝑖2subscript𝜀𝑡𝑖𝑘𝐹subscript𝑥𝑡subscript𝐺𝑖subscript𝑥𝑡\varepsilon_{t,i}=(\varepsilon_{t,i,1},\varepsilon_{t,i,2},...,\varepsilon_{t,% i,k})=\nabla F(x_{t})-\nabla G_{i}(x_{t})italic_ε start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = ( italic_ε start_POSTSUBSCRIPT italic_t , italic_i , 1 end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT italic_t , italic_i , 2 end_POSTSUBSCRIPT , … , italic_ε start_POSTSUBSCRIPT italic_t , italic_i , italic_k end_POSTSUBSCRIPT ) = ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Let F,c>0𝐹𝑐0F,c>0italic_F , italic_c > 0 and 0<δ120𝛿120<\delta\leq\frac{1}{2}0 < italic_δ ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG be some constants such that F4(Δ+c)δ𝐹4Δ𝑐𝛿F\geq\frac{4(\Delta+c)}{\delta}italic_F ≥ divide start_ARG 4 ( roman_Δ + italic_c ) end_ARG start_ARG italic_δ end_ARG and M=sup{z0|φ(z)F}𝑀supremumconditional-set𝑧0𝜑𝑧𝐹M=\sup\{z\geq 0|\varphi(z)\leq F\}italic_M = roman_sup { italic_z ≥ 0 | italic_φ ( italic_z ) ≤ italic_F }. Define the following random variables τ1=min{t|i[K],fi(xt+1)fi>F}T,τ2=min{t|i[K],j[3],εt,j,i>L0αρ}T,τ3=min{t|i,j[K],εt,2,iεt,3,j>L1αρ}Tformulae-sequencesubscript𝜏1𝑡ket𝑖delimited-[]𝐾subscript𝑓𝑖subscript𝑥𝑡1superscriptsubscript𝑓𝑖𝐹𝑇formulae-sequencesubscript𝜏2conditional𝑡𝑖delimited-[]𝐾𝑗delimited-[]3normsubscript𝜀𝑡𝑗𝑖subscript𝐿0𝛼𝜌𝑇subscript𝜏3conditional𝑡𝑖𝑗delimited-[]𝐾normsubscript𝜀𝑡2𝑖normsubscript𝜀𝑡3𝑗subscript𝐿1𝛼𝜌𝑇\tau_{1}=\min\{t|\exists i\in[K],f_{i}(x_{t+1})-f_{i}^{*}>F\}\wedge T,\tau_{2}% =\min\{t|\exists i\in[K],j\in[3],\|\varepsilon_{t,j,i}\|>\frac{L_{0}}{\sqrt{% \alpha\rho}}\}\wedge T,\tau_{3}=\min\{t|\exists i,j\in[K],\|\varepsilon_{t,2,i% }\|\|\varepsilon_{t,3,j}\|>\frac{L_{1}}{\sqrt{\alpha\rho}}\}\wedge Titalic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_min { italic_t | ∃ italic_i ∈ [ italic_K ] , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > italic_F } ∧ italic_T , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_min { italic_t | ∃ italic_i ∈ [ italic_K ] , italic_j ∈ [ 3 ] , ∥ italic_ε start_POSTSUBSCRIPT italic_t , italic_j , italic_i end_POSTSUBSCRIPT ∥ > divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG } ∧ italic_T , italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = roman_min { italic_t | ∃ italic_i , italic_j ∈ [ italic_K ] , ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 , italic_i end_POSTSUBSCRIPT ∥ ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 , italic_j end_POSTSUBSCRIPT ∥ > divide start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG } ∧ italic_T and τ=min{τ1,τ2,τ3}𝜏subscript𝜏1subscript𝜏2subscript𝜏3\tau=\min\{\tau_{1},\tau_{2},\tau_{3}\}italic_τ = roman_min { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }, where L0,L1>0subscript𝐿0subscript𝐿10L_{0},L_{1}>0italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 are some constants and ab𝑎𝑏a\wedge bitalic_a ∧ italic_b denotes min(a,b)𝑎𝑏\min(a,b)roman_min ( italic_a , italic_b ). We have the following theorem:

Theorem 2.

Let Assumptions 1, 2, and 3 hold. Set αmin{𝒪((M+1)),𝒪(β),𝒪(ρ),𝒪(1T),𝒪(ϵ2)}𝛼𝒪𝑀1𝒪𝛽𝒪𝜌𝒪1𝑇𝒪superscriptitalic-ϵ2\alpha\leq\min\{\mathcal{O}(\ell(M+1)),\mathcal{O}(\beta),\mathcal{O}(\rho),% \mathcal{O}\left(\frac{1}{\sqrt{T}}\right),\mathcal{O}(\epsilon^{2})\}italic_α ≤ roman_min { caligraphic_O ( roman_ℓ ( italic_M + 1 ) ) , caligraphic_O ( italic_β ) , caligraphic_O ( italic_ρ ) , caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG ) , caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) }, βmin{𝒪(1αT),𝒪(ϵ2),𝒪(ρ)}𝛽𝒪1𝛼𝑇𝒪superscriptitalic-ϵ2𝒪𝜌\beta\leq\min\{\mathcal{O}\left(\frac{1}{\alpha{T}}\right),\mathcal{O}(% \epsilon^{2}),\mathcal{O}(\rho)\}italic_β ≤ roman_min { caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_α italic_T end_ARG ) , caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , caligraphic_O ( italic_ρ ) }, ρmin{𝒪(1αT),𝒪(1αT),𝒪(ϵβ),𝒪(ϵ2)}𝜌𝒪1𝛼𝑇𝒪1𝛼𝑇𝒪italic-ϵ𝛽𝒪superscriptitalic-ϵ2\rho\leq\min\{\mathcal{O}\left(\frac{1}{\alpha T}\right),\mathcal{O}(\frac{1}{% \sqrt{\alpha T}}),\mathcal{O}\left(\frac{\epsilon}{\sqrt{\beta}}\right),% \mathcal{O}(\epsilon^{2})\}italic_ρ ≤ roman_min { caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_α italic_T end_ARG ) , caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α italic_T end_ARG end_ARG ) , caligraphic_O ( divide start_ARG italic_ϵ end_ARG start_ARG square-root start_ARG italic_β end_ARG end_ARG ) , caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) } and TΘ(1αϵ2+1ϵ4)𝑇Θ1𝛼superscriptitalic-ϵ21superscriptitalic-ϵ4T\geq\Theta\left(\frac{1}{\alpha\epsilon^{2}}+\frac{1}{\epsilon^{4}}\right)italic_T ≥ roman_Θ ( divide start_ARG 1 end_ARG start_ARG italic_α italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ). We then have that 1Tt=0T1F(xt)wt2ϵ2,1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptitalic-ϵ2\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\leq\epsilon^{2},divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , with the probability at least 1δ1𝛿1-\delta1 - italic_δ.

The full version and detailed proof can be found in Appendix C.3. When we set α,β,ρ𝒪(ϵ2)similar-to𝛼𝛽𝜌𝒪superscriptitalic-ϵ2\alpha,\beta,\rho\sim\mathcal{O}(\epsilon^{2})italic_α , italic_β , italic_ρ ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and T𝒪(ϵ4)similar-to𝑇𝒪superscriptitalic-ϵ4T\sim\mathcal{O}(\epsilon^{-4})italic_T ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ), we can find an ϵitalic-ϵ\epsilonitalic_ϵ-stationary point with the optimal sample complexity in the order of 𝒪(ϵ4)𝒪superscriptitalic-ϵ4\mathcal{O}(\epsilon^{-4})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ) for SGD with a single L𝐿Litalic_L-smooth objective[1]. Note that in the proof of Theorem 1, we show for each tT𝑡𝑇t\leq Titalic_t ≤ italic_T and wW𝑤𝑊w\in Witalic_w ∈ italic_W, we have that F(xt)w𝐹subscript𝑥𝑡𝑤F(x_{t})witalic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w is bounded by applying a small constant step size α𝛼\alphaitalic_α and β𝛽\betaitalic_β. However, this condition does not necessarily hold for our stochastic setting due to the unbounded gradient noise. To solve this problem, we introduce stop** time τ𝜏\tauitalic_τ. The advantages are as follows: 1) for any tτ𝑡𝜏t\leq\tauitalic_t ≤ italic_τ, wW𝑤𝑊w\in Witalic_w ∈ italic_W, we have that F(xt)w𝐹subscript𝑥𝑡𝑤F(x_{t})witalic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w is bounded; 2) for any t<τ𝑡𝜏t<\tauitalic_t < italic_τ, the norm of gradient noise is bounded; 3) due to the optimal stop** theorem, for any wW𝑤𝑊w\in Witalic_w ∈ italic_W and i[3]𝑖delimited-[]3i\in[3]italic_i ∈ [ 3 ], we have that 𝔼[t=0τεt,iw]=0𝔼delimited-[]superscriptsubscript𝑡0𝜏subscript𝜀𝑡𝑖𝑤0\mathbb{E}[\sum_{t=0}^{\tau}\varepsilon_{t,i}w]=0blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT italic_w ] = 0. Based on these properties, we can further get the following lemma:

Lemma 1.

Using the parameters selected in Theorem 2, we have that

𝔼[F(xτ)w]FwδF8α2𝔼[t=0τ1F(xt)wt2].𝔼delimited-[]𝐹subscript𝑥𝜏𝑤superscript𝐹𝑤𝛿𝐹8𝛼2𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle\mathbb{E}[F(x_{\tau})w]-F^{*}w\leq\frac{\delta F}{8}-\frac{% \alpha}{2}\mathbb{E}\left[\sum_{t=0}^{\tau-1}\|\nabla F(x_{t})w_{t}\|^{2}% \right].blackboard_E [ italic_F ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) italic_w ] - italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_w ≤ divide start_ARG italic_δ italic_F end_ARG start_ARG 8 end_ARG - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (7)

The proof of Lemma 1 is available in C.4. Lemma 1 indicates that α2𝔼[t=0τ1F(xt)wt2]𝛼2𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\frac{\alpha}{2}\mathbb{E}\left[\sum_{t=0}^{\tau-1}\|\nabla F(x_{t})w_{t}\|^{2% }\right]divide start_ARG italic_α end_ARG start_ARG 2 end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] is bounded by some constant and if τ=T𝜏𝑇\tau=Titalic_τ = italic_T with high probability, we have that 1T𝔼[t=0T1F(xt)wt2|τ=T]𝒪(1αT)similar-to1𝑇𝔼delimited-[]conditionalsuperscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝜏𝑇𝒪1𝛼𝑇\frac{1}{T}\mathbb{E}\left[\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\Big{|}% \tau=T\right]\sim\mathcal{O}\left(\frac{1}{\alpha T}\right)divide start_ARG 1 end_ARG start_ARG italic_T end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] ∼ caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_α italic_T end_ARG ). Note that {τ<T}={τ2<T}{τ3<T}{τ1<T,τ2=T,τ3=T}.𝜏𝑇subscript𝜏2𝑇subscript𝜏3𝑇formulae-sequencesubscript𝜏1𝑇formulae-sequencesubscript𝜏2𝑇subscript𝜏3𝑇\{\tau<T\}=\{\tau_{2}<T\}\cup\{\tau_{3}<T\}\cup\{\tau_{1}<T,\tau_{2}=T,\tau_{3% }=T\}.{ italic_τ < italic_T } = { italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_T } ∪ { italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT < italic_T } ∪ { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_T , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_T , italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_T } . The first two events are related to the gradient noise, where the probabilities can be bounded by Assumption 3 and Chebyshev’s inequality. The last event indicates that for some i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ], we have fi(xτ)fiF2.subscript𝑓𝑖subscript𝑥𝜏superscriptsubscript𝑓𝑖𝐹2f_{i}(x_{\tau})-f_{i}^{*}\leq\frac{F}{2}.italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ divide start_ARG italic_F end_ARG start_ARG 2 end_ARG . Based on Lemma 1 and Markov inequality, we can show that ({τ1<T,τ2=T,τ3=T})δ4formulae-sequencesubscript𝜏1𝑇formulae-sequencesubscript𝜏2𝑇subscript𝜏3𝑇𝛿4\mathbb{P}(\{\tau_{1}<T,\tau_{2}=T,\tau_{3}=T\})\leq\frac{\delta}{4}blackboard_P ( { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_T , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_T , italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_T } ) ≤ divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG and we can further show that (τ=T)1δ2𝜏𝑇1𝛿2\mathbb{P}(\tau=T)\geq 1-\frac{\delta}{2}blackboard_P ( italic_τ = italic_T ) ≥ 1 - divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG. We then have that 1Tt=0T1F(xt)wt21𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT converges with high probability. Similar to Corollary 1, Theorem 2 also implies the average of CA distances converges with time with high probability.

5 Convergence Analysis under Iteration-wise CA distance

In Section 4 we show that the average CA distance is bounded under generalized smooth conditions. The average CA distance is also studied in MoCo [13] and MoDo [6] with the bounded gradient assumption and these works only focus on guarantees of the average CA distance over iterations. However, a ϵitalic-ϵ\epsilonitalic_ϵ-level average CA distance only implies the smallest CA distance to be ϵitalic-ϵ\epsilonitalic_ϵ-level. Since we want to keep the update direction close enough to the CA direction, it is better to have a tighter bound of CA distances. In this section, we show the CA distance is 𝒪(ϵ)𝒪italic-ϵ\mathcal{O}(\epsilon)caligraphic_O ( italic_ϵ ) at every iteration with the help of a warm-start process and convergence results for Algorithms 1, 3, and 4.

5.1 Deterministic setting

Deterministic setting without fast approximation. We first provide results about bounded iteration-wise CA distance for Algorithm 1 with a warm start.

Theorem 3.

Let Assumptions 1 and 2 hold. Set β1M2,ρ𝒪(ϵ2),β𝒪(ϵ4),α𝒪(ϵ9)formulae-sequencesuperscript𝛽1superscript𝑀2formulae-sequencesimilar-to𝜌𝒪superscriptitalic-ϵ2formulae-sequencesimilar-to𝛽𝒪superscriptitalic-ϵ4similar-to𝛼𝒪superscriptitalic-ϵ9\beta^{\prime}\leq\frac{1}{M^{2}},\rho\sim\mathcal{O}(\epsilon^{2}),\beta\sim% \mathcal{O}(\epsilon^{4}),\alpha\sim\mathcal{O}(\epsilon^{9})italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_ρ ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_β ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) , italic_α ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT ), NO(ϵ2)similar-to𝑁𝑂superscriptitalic-ϵ2N\sim O(\epsilon^{-2})italic_N ∼ italic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) as constants, and TΘ(ϵ11)similar-to𝑇Θsuperscriptitalic-ϵ11T\sim\Theta(\epsilon^{-11})italic_T ∼ roman_Θ ( italic_ϵ start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT ). All the parameters satisfy the requirements in the formal version of Theorem 1 and we have F(xt)wtF(xt)wt𝒪(ϵ).similar-tonorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝒪italic-ϵ\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t}^{*}\|\sim\mathcal{O}(\epsilon).∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ∼ caligraphic_O ( italic_ϵ ) .

The finite time error bound and the full proof can be found in Appendix D.1. Since our parameters satisfy the requirements in the formal version of Theorem 1, we can find an ϵitalic-ϵ\epsilonitalic_ϵ-accurate Pareto stationary point with 𝒪(ϵ11)𝒪superscriptitalic-ϵ11\mathcal{O}(\epsilon^{11})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT ) samples. In the analysis of CA distance, we show that the CA distance can be bounded by the term wtwt,ρnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌\|w_{t}-w_{t,\rho}^{*}\|∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ plus the strongly-convex constant ρ𝜌\rhoitalic_ρ. Meanwhile, there is a decay relation between wt+1wt+1,ρnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌\|w_{t+1}-w_{t+1,\rho}^{*}\|∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ and wtwt,ρnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌\|w_{t}-w_{t,\rho}^{*}\|∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ with some error terms controlled by step sizes. Nevertheless, the error terms will accumulate since we do telesco** on this decay relation, which will be the dominating term. Thus, step sizes have to be much smaller than the choices in Theorem 1 to guarantee iteration-wise small CA distance.

Deterministic setting with fast approximation. In this section, we show the convergence rate of Algorithm 4 and bounded iteration-wise CA distance.

Theorem 4.

Let Assumptions 1 and 2 hold. Set NO(ϵ2)similar-to𝑁𝑂superscriptitalic-ϵ2N\sim O(\epsilon^{-2})italic_N ∼ italic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), β1M2,ρmin(𝒪(ϵ2),𝒪(1αT)),β𝒪(ϵ2),αmin(𝒪(β),𝒪(ϵ2),𝒪(1βT))formulae-sequencesuperscript𝛽1superscript𝑀2formulae-sequence𝜌𝒪superscriptitalic-ϵ2𝒪1𝛼𝑇formulae-sequence𝛽𝒪superscriptitalic-ϵ2𝛼𝒪𝛽𝒪superscriptitalic-ϵ2𝒪1𝛽𝑇\beta^{\prime}\leq\frac{1}{M^{2}},\rho\leq\min\left(\mathcal{O}(\epsilon^{2}),% \mathcal{O}\left(\frac{1}{\alpha T}\right)\right),\beta\leq\mathcal{O}(% \epsilon^{2}),\alpha\leq\min\left(\mathcal{O}(\beta),\mathcal{O}(\epsilon^{2})% ,\mathcal{O}\left(\frac{1}{\beta T}\right)\right)italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_ρ ≤ roman_min ( caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_α italic_T end_ARG ) ) , italic_β ≤ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_α ≤ roman_min ( caligraphic_O ( italic_β ) , caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_β italic_T end_ARG ) ) as constants, Tmax(Θ(1αϵ2),Θ(1βϵ2))𝑇Θ1𝛼superscriptitalic-ϵ2Θ1𝛽superscriptitalic-ϵ2T\geq\max\left(\Theta\left(\frac{1}{\alpha\epsilon^{2}}\right),\Theta\left(% \frac{1}{\beta\epsilon^{2}}\right)\right)italic_T ≥ roman_max ( roman_Θ ( divide start_ARG 1 end_ARG start_ARG italic_α italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , roman_Θ ( divide start_ARG 1 end_ARG start_ARG italic_β italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ). We have 1Tt=0T1F(xt)wt2ϵ2.1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptitalic-ϵ2\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\leq\epsilon^{2}.divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The full version with detailed constants and proof can be found in the Section D.2. We can easily extend the analysis in Section C.1 on convergence analysis of Algorithm 4 because the only extra effort is dealing with the remainder term, which can be simply bounded by the smallest step size. As a result, the sample complexity remains the same 𝒪(ϵ11)𝒪superscriptitalic-ϵ11\mathcal{O}(\epsilon^{-11})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT ) to achieve a Pareto stationary point.

Theorem 5.

Let Assumptions 1 and 2 hold. We choose β1M2,ρ𝒪(ϵ2),β𝒪(ϵ4),α𝒪(ϵ9)formulae-sequencesuperscript𝛽1superscript𝑀2formulae-sequencesimilar-to𝜌𝒪superscriptitalic-ϵ2formulae-sequencesimilar-to𝛽𝒪superscriptitalic-ϵ4similar-to𝛼𝒪superscriptitalic-ϵ9\beta^{\prime}\leq\frac{1}{M^{2}},\rho\sim\mathcal{O}(\epsilon^{2}),\beta\sim% \mathcal{O}(\epsilon^{4}),\alpha\sim\mathcal{O}(\epsilon^{9})italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_ρ ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_β ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) , italic_α ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT ), NO(ϵ2)similar-to𝑁𝑂superscriptitalic-ϵ2N\sim O(\epsilon^{-2})italic_N ∼ italic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) as constants, and TΘ(ϵ11)similar-to𝑇Θsuperscriptitalic-ϵ11T\sim\Theta(\epsilon^{-11})italic_T ∼ roman_Θ ( italic_ϵ start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT ). We have that F(xt)wtF(xt)wt𝒪(ϵ).similar-tonorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝒪italic-ϵ\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t}^{*}\|\sim\mathcal{O}(\epsilon).∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ∼ caligraphic_O ( italic_ϵ ) .

5.2 Stochastic setting

In this section, we show that Algorithm 3 with a warm start and mini-batches achieves a bounded iteration-wise CA distance with high probability. In this section, we choose Gi(xt)=1nsi=nsins+1nsiF(xt;st,i),subscript𝐺𝑖subscript𝑥𝑡1subscript𝑛𝑠superscriptsubscript𝑖subscript𝑛𝑠𝑖subscript𝑛𝑠1subscript𝑛𝑠𝑖𝐹subscript𝑥𝑡subscript𝑠𝑡𝑖G_{i}(x_{t})=\frac{1}{n_{s}}\sum_{i=n_{s}i-n_{s}+1}^{n_{s}i}F(x_{t};s_{t,i}),italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_i - italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_i end_POSTSUPERSCRIPT italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) , where nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the size of the mini-batch.

Theorem 6.

Let Assumptions 1, 2 and 3 hold. Set β1M2superscript𝛽1superscript𝑀2\beta^{\prime}\leq\frac{1}{M^{2}}italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, α𝒪(ϵ9)similar-to𝛼𝒪superscriptitalic-ϵ9\alpha\sim\mathcal{O}(\epsilon^{9})italic_α ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT ), β𝒪(ϵ4)similar-to𝛽𝒪superscriptitalic-ϵ4\beta\sim\mathcal{O}(\epsilon^{4})italic_β ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ), ρ𝒪(ϵ2)similar-to𝜌𝒪superscriptitalic-ϵ2\rho\sim\mathcal{O}(\epsilon^{2})italic_ρ ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), ns𝒪(ϵ6),N𝒪(ϵ2),formulae-sequencesimilar-tosubscript𝑛𝑠𝒪superscriptitalic-ϵ6similar-to𝑁𝒪superscriptitalic-ϵ2n_{s}\sim\mathcal{O}(\epsilon^{-6}),N\sim\mathcal{O}(\epsilon^{-2}),italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT ) , italic_N ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) , and TΘ(ϵ11)similar-to𝑇Θsuperscriptitalic-ϵ11T\sim\Theta(\epsilon^{-11})italic_T ∼ roman_Θ ( italic_ϵ start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT ), and all the parameters satisfy the requirements in Theorem 2. We then have F(xt)wtF(xt)wt𝒪(ϵ),similar-tonorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝒪italic-ϵ\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t}^{*}\|\sim\mathcal{O}(\epsilon),∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ∼ caligraphic_O ( italic_ϵ ) , with the probability at least 1δ1𝛿1-\delta1 - italic_δ.

The full version with detailed constants and proof can be found in Appendix D.4. Since our parameters satisfy all requirements in Theorem 2, we can find an ϵitalic-ϵ\epsilonitalic_ϵ-accurate Pareto stationary point with high probability. Compared with Theorem 2, to guarantee an iteration-wise CA distance, despite our warm start process, a mini-batch method is required in our analysis. This is because given τ=T𝜏𝑇\tau=Titalic_τ = italic_T, the gradient is not unbiased. In Theorem 2, the optimal stop** theorem is applied which indicates that the expectation of the cumulative gradient is zero. However, for each iteration, this optimal stop** theorem does not hold and the estimated error is controlled by the size of the mini-batch. Then, the sample complexity to get a Pareto stationary point becomes 𝒪(ϵ17)𝒪superscriptitalic-ϵ17\mathcal{O}(\epsilon^{-17})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 17 end_POSTSUPERSCRIPT ) due to necessary mini-batch nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

6 Experiments

In this experiment, we evaluate the performance of the Cityscapes dataset [10], which involves 2 pixel-wise tasks: 7-class semantic segmentation (Task 1) and depth estimation (Task 2). Following the same experiment setup of [32], we build a SegNet [2] as the model, comparing the performance of MGDA [12], PCGrad [34], GradDrop [8], CAGrad [24], MoCo [13], MoDo [6], Nash-MTL [27], SDMGrad [32] with our methods, GSMGrad and GSMGrad-FA with the warm-start initialization. We utilize the metric 𝚫𝐦%𝚫percent𝐦\mathbf{\Delta m\%}bold_Δ bold_m % to reflect the overall performance, which considers the average per-task performance drop versus the single-task (STL) baseline to assess methods. It can be observed in Figure 2 that GSMGrad has a better result in task 2 and a much more balanced performance. Meanwhile, the proposed GSMGrad-FA is much faster than GSMGrad as shown in Table 2 in the Appendix.

In addition, we also illustrate the relationship between the gradient norm and the local smoothness for each task. To do so, we compute them according to the method displayed in Section H.3 in [35]. We scatter the local smoothness constant against gradient norms in Figure 2 for the semantic segmentation task and depth estimation task in Figure 3 (in the appendix), respectively. Both results demonstrate a positive correlation between them, which further substantiates the necessity of our analysis. More experimental details can be found in Appendix A.

Method Segmentation Depth Δm%Δpercent𝑚absent\Delta m\%\downarrowroman_Δ italic_m % ↓ mIoU \uparrow Pix Acc \uparrow Abs Err \downarrow Rel Err \downarrow STL 74.01 93.16 0.0125 27.77 MGDA [12] 68.84 91.54 0.0309 33.50 44.14 PCGrad [34] 75.13 93.48 0.0154 42.07 18.29 GradDrop [8] 75.27 93.53 0.0157 47.54 23.73 CAGrad [24] 75.16 93.48 0.0141 37.60 11.64 MoCo [13] 75.42 93.55 0.0149 34.19 9.90 MoDo [6] 74.55 93.32 0.0159 41.51 18.89 Nash-MTL [27] 75.41 93.66 0.0129 35.02 6.82 SDMGrad [32] 74.53 93.52 0.0137 34.01 7.79 GSMGrad 75.41 93.46 0.0133 31.07 3.93 GSMGrad-FA 74.38 93.24 0.0160 41.78 19.44

Figure 1: Multi-task learning on Cityscapes dataset.
Refer to caption
Figure 2: Gradient norm vs smoothness

7 Conclusion

In this paper, we investigate the multi-objective problem with a more challenging, relaxed and realistic \ellroman_ℓ-smooth assumption. We propose the first efficient MOO algorithm GSMGrad and its stochastic variant SGSMGrad for this problem. We provide the convergence guarantee for both algorithms to find an ϵitalic-ϵ\epsilonitalic_ϵ-accurate Pareto stationary point with ϵitalic-ϵ\epsilonitalic_ϵ-level average/iteration-wise CA distance. Extensive experiments are conducted to validate our theoretical results.

References

  • [1] Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1):165–214, 2023.
  • [2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
  • [3] Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. arXiv preprint arXiv:2402.15638, 2024.
  • [4] Amir Beck and Marc Teboulle. Gradient-based algorithms with applications to signal recovery. Convex optimization in signal processing and communications, pages 42–88, 2009.
  • [5] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020.
  • [6] Lisha Chen, Heshan Fernando, Yiming Ying, and Tianyi Chen. Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance. Advances in Neural Information Processing Systems, 36, 2024.
  • [7] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pages 794–803. PMLR, 2018.
  • [8] Zhao Chen, Jiquan Ngiam, Yan** Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout. Advances in Neural Information Processing Systems, 33:2039–2050, 2020.
  • [9] Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. arXiv preprint arXiv:2303.02854, 2023.
  • [10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [11] Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, and Zhenxun Zhuang. Robustness to unbounded smoothness of generalized signsgd. Advances in Neural Information Processing Systems, 35:9955–9968, 2022.
  • [12] Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313–318, 2012.
  • [13] Heshan Devaka Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent approach. In The Eleventh International Conference on Learning Representations, 2022.
  • [14] Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023.
  • [15] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • [16] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
  • [17] Michelle Guo, Albert Haque, De-An Huang, Serena Yeung, and Li Fei-Fei. Dynamic task prioritization for multitask learning. In Proceedings of the European conference on computer vision (ECCV), pages 270–287, 2018.
  • [18] Xinyu Huang, Peng Wang, Xin**g Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang. The apolloscape open dataset for autonomous driving and its application. IEEE transactions on pattern analysis and machine intelligence, 42(10):2702–2719, 2019.
  • [19] Jikai **, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34:2771–2782, 2021.
  • [20] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
  • [21] Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems, 36, 2024.
  • [22] Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions. Advances in Neural Information Processing Systems, 36, 2024.
  • [23] Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu. Famo: Fast adaptive multitask optimization. Advances in Neural Information Processing Systems, 36, 2024.
  • [24] Bo Liu, Xingchao Liu, Xiaojie **, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34:18878–18890, 2021.
  • [25] Suyun Liu and Luis Nunes Vicente. The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning. Annals of Operations Research, pages 1–30, 2021.
  • [26] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1930–1939, 2018.
  • [27] Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game. arXiv preprint arXiv:2202.01017, 2022.
  • [28] Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
  • [29] Amirhossein Reisizadeh, Haochuan Li, Subhro Das, and Ali Jadbabaie. Variance-reduced clip** for non-convex optimization. arXiv preprint arXiv:2303.00883, 2023.
  • [30] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.
  • [31] Philip S Thomas, Joelle Pineau, Romain Laroche, et al. Multi-objective spibb: Seldonian offline policy improvement with safety constraints in finite mdps. Advances in Neural Information Processing Systems, 34:2004–2017, 2021.
  • [32] Peiyao Xiao, Hao Ban, and Kaiyi Ji. Direction-oriented multi-objective learning: Simple and provable stochastic algorithms. Advances in Neural Information Processing Systems, 36, 2024.
  • [33] Haibo Yang, Zhuqing Liu, Jia Liu, Chaosheng Dong, and Michinari Momma. Federated multi-objective learning. Advances in Neural Information Processing Systems, 36, 2024.
  • [34] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020.
  • [35] **gzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clip** accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
  • [36] Shiji Zhou, Wenpeng Zhang, Jiyan Jiang, Wenliang Zhong, **jie Gu, and Wenwu Zhu. On the convergence of stochastic multi-objective gradient manipulation and beyond. Advances in Neural Information Processing Systems, 35:38103–38115, 2022.

Appendix A Experimental details

A.1 Relation between gradient norms and the local smoothness

We show the relation between local smoothness and gradient norms of each task in this part. Both results demonstrate a positive correlation between them, which further substantiates the necessity of our analysis.

Refer to caption
Refer to caption
Figure 3: Local smoothness constant vs. Gradient norm on training SegNet on CityScapes dataset of each task. Task 1 on the left and Task 2 on the right.

A.2 Running time comparison between GSMGrad and GSMGrad-FA

We compare the average running time of the proposed algorithms, GSMGrad and GSMGrad-FA. The time in Table 2 is an average of the total running time over epochs (in minutes). The result solidifies the advantage of the fast approximation.

Method Average running time
GSMGrad 2.93
GSMGrad-FA 1.93
Table 2: Average running time comparison between GSMGrad and GSMGrad-FA.

A.3 Implementation details

Multi-task learning on Cityscapes dataset. Following the experiment setup in [32], we train our method for 200 epochs, using SGD optimizers for both model parameters and weights, and the batch size for Cityscapes is 8. We compute the averaged test performance over the last 10 epochs as the final performance measure. We fix the β=0.5𝛽0.5\beta=0.5italic_β = 0.5 and do a grid search on hyperparameters including N[10,20,40,50],α[0.0001,0.0002,0.0005,0.001]formulae-sequence𝑁10204050𝛼0.00010.00020.00050.001N\in[10,20,40,50],\alpha\in[0.0001,0.0002,0.0005,0.001]italic_N ∈ [ 10 , 20 , 40 , 50 ] , italic_α ∈ [ 0.0001 , 0.0002 , 0.0005 , 0.001 ], and ρ[0.01,0.05,0.1,0.2,0.5,0.6,0.7,0.8,0.9,1]𝜌0.010.050.10.20.50.60.70.80.91\rho\in[0.01,0.05,0.1,0.2,0.5,0.6,0.7,0.8,0.9,1]italic_ρ ∈ [ 0.01 , 0.05 , 0.1 , 0.2 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 1 ] and choose the best result from them. It turns out our best performance of GSMGrad is based on the choice that N=40,α=0.0005,β=0.5formulae-sequence𝑁40formulae-sequence𝛼0.0005𝛽0.5N=40,\alpha=0.0005,\beta=0.5italic_N = 40 , italic_α = 0.0005 , italic_β = 0.5, and ρ=0.5𝜌0.5\rho=0.5italic_ρ = 0.5. The choice of hyperparameters for GSMGrad-FA turns out to be the same as that for GSMGrad. All experiments are run on NVIDIA RTX A6000.

Δm%Δpercent𝑚\Delta m\%roman_Δ italic_m % reflects the average per-task performance drop versus the single-task (STL) baseline b𝑏bitalic_b to assess method m𝑚mitalic_m. We calculate it by the following equation

Δm%=1Kk=1K(1)lk(Mm,kMb,k)/Mb,k×100,Δpercent𝑚1𝐾superscriptsubscript𝑘1𝐾superscript1subscript𝑙𝑘subscript𝑀𝑚𝑘subscript𝑀𝑏𝑘subscript𝑀𝑏𝑘100\Delta m\%=\frac{1}{K}\sum_{k=1}^{K}(-1)^{l_{k}}(M_{m,k}-M_{b,k})/M_{b,k}% \times 100,roman_Δ italic_m % = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_M start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_b , italic_k end_POSTSUBSCRIPT ) / italic_M start_POSTSUBSCRIPT italic_b , italic_k end_POSTSUBSCRIPT × 100 ,

where K𝐾Kitalic_K is the number of metrics, Mb,ksubscript𝑀𝑏𝑘M_{b,k}italic_M start_POSTSUBSCRIPT italic_b , italic_k end_POSTSUBSCRIPT is the value of metric Mksubscript𝑀𝑘M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT obtained by baseline b𝑏bitalic_b, and Mm,ksubscript𝑀𝑚𝑘M_{m,k}italic_M start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT obtained by the compared method m𝑚mitalic_m. lk=1subscript𝑙𝑘1l_{k}=1italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 if the evaluation metric Mksubscript𝑀𝑘M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on task k𝑘kitalic_k prefers a higher value and 00 otherwise.

Generalized smoothness illustration. To illustrate the relation between gradient norms and local smoothness, we run SGD on each task separately without the warm start process. Since there is no weight update process, we only need to choose α=0.0005𝛼0.0005\alpha=0.0005italic_α = 0.0005 for both tasks.

Appendix B Algorithm

We show our GSMGrad with Fast Approximation (GSMGrad-FA):

Algorithm 4 GSMGrad with Fast Approximation (GSMGrad-FA)
1:  Initialize: model parameters x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, weights w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and a constant ρ𝜌\rhoitalic_ρ
2:  w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT=Warm-start(w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, ρ𝜌\rhoitalic_ρ)
3:  for t=0,1,,T1𝑡01𝑇1t=0,1,...,T-1italic_t = 0 , 1 , … , italic_T - 1 do
4:     xt+1=xtαF(xt)wtsubscript𝑥𝑡1subscript𝑥𝑡𝛼𝐹subscript𝑥𝑡subscript𝑤𝑡x_{t+1}=x_{t}-\alpha\nabla F(x_{t})w_{t}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
5:     Update wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT according to eq. 6
6:  end for

Appendix C Detailed Proofs for Average CA Distance

C.1 Formal version and proof of Theorem 1

Let c1>0,c2>0formulae-sequencesubscript𝑐10subscript𝑐20c_{1}>0,c_{2}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0, c30subscript𝑐30c_{3}\geq 0italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ≥ 0 and F>0𝐹0F>0italic_F > 0 be some constants such that

Δ+c1+c2+c3F.Δsubscript𝑐1subscript𝑐2subscript𝑐3𝐹\displaystyle\Delta+c_{1}+{c_{2}}+c_{3}\leq F.roman_Δ + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ≤ italic_F . (8)

Define M=sup{z0|φ(z)F}𝑀supremumconditional-set𝑧0𝜑𝑧𝐹M=\sup\{z\geq 0|\varphi(z)\leq F\}italic_M = roman_sup { italic_z ≥ 0 | italic_φ ( italic_z ) ≤ italic_F }. We then have the following convergence rate for Algorithm 1 without warm start:

Theorem 7.

Suppose Assumptions 1 and 2 are satisfied, and we choose constant step sizes that β14KM2,αmin(c1β,12(M+1),1M(M+1)),Tmax(10Δαϵ2,10ϵ2β)Θ(ϵ2)formulae-sequence𝛽14𝐾superscript𝑀2formulae-sequence𝛼subscript𝑐1𝛽12𝑀11𝑀𝑀1𝑇10Δ𝛼superscriptitalic-ϵ210superscriptitalic-ϵ2𝛽similar-toΘsuperscriptitalic-ϵ2\beta\leq\frac{1}{4KM^{2}},\alpha\leq\min\left(c_{1}\beta,\frac{1}{2\ell(M+1)}% ,\frac{1}{M\ell(M+1)}\right),T\geq\max\left(\frac{10\Delta}{\alpha\epsilon^{2}% },\frac{10}{\epsilon^{2}\beta}\right)\sim\Theta(\epsilon^{-2})italic_β ≤ divide start_ARG 1 end_ARG start_ARG 4 italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_α ≤ roman_min ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_β , divide start_ARG 1 end_ARG start_ARG 2 roman_ℓ ( italic_M + 1 ) end_ARG , divide start_ARG 1 end_ARG start_ARG italic_M roman_ℓ ( italic_M + 1 ) end_ARG ) , italic_T ≥ roman_max ( divide start_ARG 10 roman_Δ end_ARG start_ARG italic_α italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 10 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β end_ARG ) ∼ roman_Θ ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), and ρmin(ϵ220,ϵ210β,c22Tα,c3Tαβ)𝒪(ϵ2)𝜌superscriptitalic-ϵ220superscriptitalic-ϵ210𝛽subscript𝑐22𝑇𝛼subscript𝑐3𝑇𝛼𝛽similar-to𝒪superscriptitalic-ϵ2\rho\leq\min\left(\frac{\epsilon^{2}}{20},\sqrt{\frac{\epsilon^{2}}{10\beta}},% \frac{c_{2}}{2T\alpha},\sqrt{\frac{c_{3}}{T\alpha\beta}}\right)\sim\mathcal{O}% (\epsilon^{2})italic_ρ ≤ roman_min ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 20 end_ARG , square-root start_ARG divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 10 italic_β end_ARG end_ARG , divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_T italic_α end_ARG , square-root start_ARG divide start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_α italic_β end_ARG end_ARG ) ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). We have that

1Tt=0T1F(xt)wt2ϵ2.1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptitalic-ϵ2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\leq% \epsilon^{2}.divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (9)
Proof.

Compared with the standard L𝐿Litalic_L-smoothness, the generalized smoothness is more challenging to address due to the unbounded Lipschitz constant. Lemma 2 demonstrates that a bounded function value implies a bounded gradient norm, which further implies a bounded Lipschitz constant. In the following, we solve the unbounded Lipschitz constant problem by showing that the function value is bounded with the parameters selected in Theorem 1. We prove that for any iK𝑖𝐾i\in Kitalic_i ∈ italic_K and tT𝑡𝑇t\leq Titalic_t ≤ italic_T we have that fi(xt)fiFsubscript𝑓𝑖subscript𝑥𝑡superscriptsubscript𝑓𝑖𝐹f_{i}(x_{t})-f_{i}^{*}\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_F by induction.

Base Case: since all c1,c2,c3,Msubscript𝑐1subscript𝑐2subscript𝑐3𝑀c_{1},c_{2},c_{3},Mitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_M are non-negative, according to (8) we have that fi(x0)fiΔFsubscript𝑓𝑖subscript𝑥0superscriptsubscript𝑓𝑖Δ𝐹f_{i}(x_{0})-f_{i}^{*}\leq\Delta\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ roman_Δ ≤ italic_F holds for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ].

Induction step: assume that for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ] and tk<T𝑡𝑘𝑇t\leq k<Titalic_t ≤ italic_k < italic_T, we have that fi(xt)fiFsubscript𝑓𝑖subscript𝑥𝑡superscriptsubscript𝑓𝑖𝐹f_{i}(x_{t})-f_{i}^{*}\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_F holds. We then prove that fi(xk+1)fiFsubscript𝑓𝑖subscript𝑥𝑘1superscriptsubscript𝑓𝑖𝐹f_{i}(x_{k+1})-f_{i}^{*}\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_F holds for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ].

For fi(xt)fiFsubscript𝑓𝑖subscript𝑥𝑡superscriptsubscript𝑓𝑖𝐹f_{i}(x_{t})-f_{i}^{*}\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_F, based on the monotonicity shown in Lemma 2, we have that fi(xt)Mnormsubscript𝑓𝑖subscript𝑥𝑡𝑀\|\nabla f_{i}(x_{t})\|\leq M∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M. From assumption 2, we have that fi(x)subscript𝑓𝑖𝑥f_{i}(x)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) is (1(fi(x))+1),(fi(x)+1))\left(\frac{1}{\ell(\|\nabla f_{i}(x))\|+1)},\ell(\|\nabla f_{i}(x)\|+1)\right)( divide start_ARG 1 end_ARG start_ARG roman_ℓ ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) ∥ + 1 ) end_ARG , roman_ℓ ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∥ + 1 ) )-smooth by setting a=1𝑎1a=1italic_a = 1. For any tk𝑡𝑘t\leq kitalic_t ≤ italic_k, we have that

xt+1xt=αF(xt)wtαM1(M+1)1(fi(xt)+1),normsubscript𝑥𝑡1subscript𝑥𝑡𝛼norm𝐹subscript𝑥𝑡subscript𝑤𝑡𝛼𝑀1𝑀11normsubscript𝑓𝑖subscript𝑥𝑡1\displaystyle\|x_{t+1}-x_{t}\|=\alpha\|\nabla F(x_{t})w_{t}\|\leq\alpha M\leq% \frac{1}{\ell(M+1)}\leq\frac{1}{\ell(\|\nabla f_{i}(x_{t})\|+1)},∥ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ = italic_α ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ italic_α italic_M ≤ divide start_ARG 1 end_ARG start_ARG roman_ℓ ( italic_M + 1 ) end_ARG ≤ divide start_ARG 1 end_ARG start_ARG roman_ℓ ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + 1 ) end_ARG ,

where the second inequality is due to α1M(M+1)𝛼1𝑀𝑀1\alpha\leq\frac{1}{M\ell(M+1)}italic_α ≤ divide start_ARG 1 end_ARG start_ARG italic_M roman_ℓ ( italic_M + 1 ) end_ARG and the last inequality is due to fi(xt)Mnormsubscript𝑓𝑖subscript𝑥𝑡𝑀\|\nabla f_{i}(x_{t})\|\leq M∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M. Based on Assumption 2, Definition 2 and Lemma 3.3 in [21], we have the following descent lemma:

fi(xt+1)subscript𝑓𝑖subscript𝑥𝑡1\displaystyle f_{i}(x_{t+1})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) fi(xt)αfi(xt),F(xt)wt+(fi(xt)+1)2α2F(xt)wt2absentsubscript𝑓𝑖subscript𝑥𝑡𝛼subscript𝑓𝑖subscript𝑥𝑡𝐹subscript𝑥𝑡subscript𝑤𝑡normsubscript𝑓𝑖subscript𝑥𝑡12superscript𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle\leq f_{i}(x_{t})-\alpha\langle\nabla f_{i}(x_{t}),\nabla F(x_{t}% )w_{t}\rangle+\frac{\ell(\|\nabla f_{i}(x_{t})\|+1)}{2}\alpha^{2}\|\nabla F(x_% {t})w_{t}\|^{2}≤ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_α ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + divide start_ARG roman_ℓ ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + 1 ) end_ARG start_ARG 2 end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
fi(xt)αfi(xt),F(xt)wt+(M+1)2α2F(xt)wt2.absentsubscript𝑓𝑖subscript𝑥𝑡𝛼subscript𝑓𝑖subscript𝑥𝑡𝐹subscript𝑥𝑡subscript𝑤𝑡𝑀12superscript𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle\leq f_{i}(x_{t})-\alpha\langle\nabla f_{i}(x_{t}),\nabla F(x_{t}% )w_{t}\rangle+\frac{\ell(M+1)}{2}\alpha^{2}\|\nabla F(x_{t})w_{t}\|^{2}.≤ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_α ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + divide start_ARG roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

As a result, for any w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W, we have that

F(xt+1)wF(xt)wαF(xt)w,F(xt)wt+(M+1)2α2F(xt)wt2.𝐹subscript𝑥𝑡1𝑤𝐹subscript𝑥𝑡𝑤𝛼𝐹subscript𝑥𝑡𝑤𝐹subscript𝑥𝑡subscript𝑤𝑡𝑀12superscript𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle F(x_{t+1})w\leq F(x_{t})w-\alpha\langle\nabla F(x_{t})w,\nabla F% (x_{t})w_{t}\rangle+\frac{\ell(M+1)}{2}\alpha^{2}\|\nabla F(x_{t})w_{t}\|^{2}.italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_w ≤ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w - italic_α ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + divide start_ARG roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (10)

Based on the update process of w𝑤witalic_w, we have that

wt+1subscript𝑤𝑡1\displaystyle w_{t+1}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =Π𝒲(wtβ(F(xt)F(xt)wt+ρwt)).absentsubscriptΠ𝒲subscript𝑤𝑡𝛽𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡\displaystyle=\Pi_{\mathcal{W}}\Big{(}w_{t}-\beta\Big{(}\nabla F(x_{t})^{\top}% \nabla F(x_{t})w_{t}+\rho w_{t}\Big{)}\Big{)}.= roman_Π start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) .

It then follows that

wt+1w2superscriptnormsubscript𝑤𝑡1𝑤2\displaystyle\|w_{t+1}-w\|^{2}∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Π𝒲(wtβ(F(xt)F(xt)wt+ρwt))w2absentsuperscriptnormsubscriptΠ𝒲subscript𝑤𝑡𝛽𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡𝑤2\displaystyle=\Big{\|}\Pi_{\mathcal{W}}\Big{(}w_{t}-\beta\Big{(}\nabla F(x_{t}% )^{\top}\nabla F(x_{t})w_{t}+\rho w_{t}\Big{)}\Big{)}-w\Big{\|}^{2}= ∥ roman_Π start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(wtβ(F(xt)F(xt)wt+ρwt))w2absentsuperscriptnormsubscript𝑤𝑡𝛽𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡𝑤2\displaystyle\leq\Big{\|}\Big{(}w_{t}-\beta\Big{(}\nabla F(x_{t})^{\top}\nabla F% (x_{t})w_{t}+\rho w_{t}\Big{)}\Big{)}-w\Big{\|}^{2}≤ ∥ ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=wtw22βwtw,(F(xt)F(xt)+ρI)wtabsentsuperscriptnormsubscript𝑤𝑡𝑤22𝛽subscript𝑤𝑡𝑤𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡𝜌𝐼subscript𝑤𝑡\displaystyle=\|w_{t}-w\|^{2}-2\beta\left\langle w_{t}-w,(\nabla F(x_{t})^{% \top}\nabla F(x_{t})+\rho I)w_{t}\right\rangle= ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_β ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ρ italic_I ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
+β2(F(xt)F(xt)+ρI)wt2,superscript𝛽2superscriptnorm𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡𝜌𝐼subscript𝑤𝑡2\displaystyle+\beta^{2}\left\|(\nabla F(x_{t})^{\top}\nabla F(x_{t})+\rho I)w_% {t}\right\|^{2},+ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ρ italic_I ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the inequality is due to the non-expansiveness of projection. By rearranging the above inequality, we have that

wtw,F(xt)F(xt)wtsubscript𝑤𝑡𝑤𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡\displaystyle\langle w_{t}-w,\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}\rangle⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
12β(wtw2wt+1w2)+2ρ+βKM2F(xt)wt2+βρ2.absent12𝛽superscriptnormsubscript𝑤𝑡𝑤2superscriptnormsubscript𝑤𝑡1𝑤22𝜌𝛽𝐾superscript𝑀2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝛽superscript𝜌2\displaystyle\leq\frac{1}{2\beta}\left(\|w_{t}-w\|^{2}-\|w_{t+1}-w\|^{2}\right% )+2\rho+\beta KM^{2}\|\nabla F(x_{t})w_{t}\|^{2}+\beta\rho^{2}.≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG ( ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 2 italic_ρ + italic_β italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (11)

Plug (C.1) into (10), and we can show that

F(xt+1)wF(xt)w𝐹subscript𝑥𝑡1𝑤𝐹subscript𝑥𝑡𝑤\displaystyle F(x_{t+1})w-F(x_{t})witalic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_w - italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w
αF(xt)wt2+(M+1)2α2F(xt)wt2absent𝛼superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝑀12superscript𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle\leq-\alpha\|\nabla F(x_{t})w_{t}\|^{2}+\frac{\ell(M+1)}{2}\alpha% ^{2}\|\nabla F(x_{t})w_{t}\|^{2}≤ - italic_α ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α2β(wtw2wt+1w2)+αβKM2F(xt)wt2+αβρ2+2αρ.𝛼2𝛽superscriptnormsubscript𝑤𝑡𝑤2superscriptnormsubscript𝑤𝑡1𝑤2𝛼𝛽𝐾superscript𝑀2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝛼𝛽superscript𝜌22𝛼𝜌\displaystyle+\frac{\alpha}{2\beta}\left(\|w_{t}-w\|^{2}-\|w_{t+1}-w\|^{2}% \right)+\alpha\beta KM^{2}\|\nabla F(x_{t})w_{t}\|^{2}+\alpha\beta\rho^{2}+2% \alpha\rho.+ divide start_ARG italic_α end_ARG start_ARG 2 italic_β end_ARG ( ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α italic_ρ . (12)

Taking sums of (C.1) from t=0𝑡0t=0italic_t = 0 to k𝑘kitalic_k, for any w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W we have that

F(xk+1)wF(x0)w𝐹subscript𝑥𝑘1𝑤𝐹subscript𝑥0𝑤\displaystyle F(x_{k+1})w-F(x_{0})witalic_F ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) italic_w - italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w
t=0kαF(xt)wt2+t=0k((M+1)2α2+αβKM2)F(xt)wt2absentsuperscriptsubscript𝑡0𝑘𝛼superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptsubscript𝑡0𝑘𝑀12superscript𝛼2𝛼𝛽𝐾superscript𝑀2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle\leq-\sum_{t=0}^{k}\alpha\|\nabla F(x_{t})w_{t}\|^{2}+\sum_{t=0}^% {k}\left(\frac{\ell(M+1)}{2}\alpha^{2}+\alpha\beta KM^{2}\right)\|\nabla F(x_{% t})w_{t}\|^{2}≤ - ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_α ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( divide start_ARG roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α2βw0w2+Tαβρ2+2Tαρ𝛼2𝛽superscriptnormsubscript𝑤0𝑤2𝑇𝛼𝛽superscript𝜌22𝑇𝛼𝜌\displaystyle+\frac{\alpha}{2\beta}\|w_{0}-w\|^{2}+T\alpha\beta\rho^{2}+2T\alpha\rho+ divide start_ARG italic_α end_ARG start_ARG 2 italic_β end_ARG ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_T italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_T italic_α italic_ρ
α2βw0w2+Tαβρ2+2Tαρ,absent𝛼2𝛽superscriptnormsubscript𝑤0𝑤2𝑇𝛼𝛽superscript𝜌22𝑇𝛼𝜌\displaystyle\leq\frac{\alpha}{2\beta}\|w_{0}-w\|^{2}+T\alpha\beta\rho^{2}+2T% \alpha\rho,≤ divide start_ARG italic_α end_ARG start_ARG 2 italic_β end_ARG ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_T italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_T italic_α italic_ρ , (13)

where the first inequality is due to k<T𝑘𝑇k<Titalic_k < italic_T and the last inequality is due to that α12(M+1)𝛼12𝑀1\alpha\leq\frac{1}{2\ell(M+1)}italic_α ≤ divide start_ARG 1 end_ARG start_ARG 2 roman_ℓ ( italic_M + 1 ) end_ARG and βKM214𝛽𝐾superscript𝑀214\beta KM^{2}\leq{\frac{1}{4}}italic_β italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 4 end_ARG. Thus for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ] it can be shown that

fi(xt+1)fifi(x0)fi+αβ+Tαβρ2+2TαρF,subscript𝑓𝑖subscript𝑥𝑡1superscriptsubscript𝑓𝑖subscript𝑓𝑖subscript𝑥0superscriptsubscript𝑓𝑖𝛼𝛽𝑇𝛼𝛽superscript𝜌22𝑇𝛼𝜌𝐹\displaystyle f_{i}(x_{t+1})-f_{i}^{*}\leq f_{i}(x_{0})-f_{i}^{*}+\frac{\alpha% }{\beta}+T\alpha\beta\rho^{2}+2T\alpha\rho\leq F,italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG + italic_T italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_T italic_α italic_ρ ≤ italic_F ,

since we have that αβc1,Tαβρ2c3formulae-sequence𝛼𝛽subscript𝑐1𝑇𝛼𝛽superscript𝜌2subscript𝑐3\frac{\alpha}{\beta}\leq c_{1},T\alpha\beta\rho^{2}\leq c_{3}divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and 2Tαρc22𝑇𝛼𝜌subscript𝑐22T\alpha\rho\leq c_{2}2 italic_T italic_α italic_ρ ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Now we finish the induction step and can show that fi(xk+1)fiFsubscript𝑓𝑖subscript𝑥𝑘1superscriptsubscript𝑓𝑖𝐹f_{i}(x_{k+1})-f_{i}^{*}\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_F and (C.1) hold for all k<T𝑘𝑇k<Titalic_k < italic_T and i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ].

Specifically, for α<12(M+1),β14KM2formulae-sequence𝛼12𝑀1𝛽14𝐾superscript𝑀2\alpha<\frac{1}{2\ell(M+1)},\beta\leq\frac{1}{4KM^{2}}italic_α < divide start_ARG 1 end_ARG start_ARG 2 roman_ℓ ( italic_M + 1 ) end_ARG , italic_β ≤ divide start_ARG 1 end_ARG start_ARG 4 italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, according to (C.1), for k=T1𝑘𝑇1k=T-1italic_k = italic_T - 1 we have that

1Tt=0T1F(xt)wt22(F(x0)wFw)αT+2βT+2βρ2+4ρϵ2,1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡22𝐹subscript𝑥0𝑤superscript𝐹𝑤𝛼𝑇2𝛽𝑇2𝛽superscript𝜌24𝜌superscriptitalic-ϵ2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\leq\frac{% 2(F(x_{0})w-F^{*}w)}{\alpha T}+\frac{2}{\beta T}+2\beta\rho^{2}+4\rho\leq% \epsilon^{2},divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 2 ( italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w - italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_w ) end_ARG start_ARG italic_α italic_T end_ARG + divide start_ARG 2 end_ARG start_ARG italic_β italic_T end_ARG + 2 italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_ρ ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

which completes our proof. ∎

Lemma 2.

(Lemma 3.5 in [21]) If a function f𝑓fitalic_f is \ellroman_ℓ-smooth, we have that

φ(f(x))=f(x)22(2f(x))f(x)f.𝜑norm𝑓𝑥superscriptnorm𝑓𝑥222norm𝑓𝑥𝑓𝑥superscript𝑓\displaystyle\varphi(\|\nabla f(x)\|)=\frac{\|\nabla f(x)\|^{2}}{2\ell(2\|% \nabla f(x)\|)}\leq f(x)-f^{*}.italic_φ ( ∥ ∇ italic_f ( italic_x ) ∥ ) = divide start_ARG ∥ ∇ italic_f ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 roman_ℓ ( 2 ∥ ∇ italic_f ( italic_x ) ∥ ) end_ARG ≤ italic_f ( italic_x ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (14)

C.2 Proof of Corollary 1

Proof.

Recall that

F(xt)wtF(xt)wt2F(xt)wt2F(xt)wt2F(xt)wt2,superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptnorm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t}^{*}\|^{2}\leq\|\nabla F% (x_{t})w_{t}\|^{2}-\|\nabla F(x_{t})w_{t}^{*}\|^{2}\leq\|\nabla F(x_{t})w_{t}% \|^{2},∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the first inequality follows from the optimal condition. Then we have

1Tt=0T1F(xt)wtF(xt)wt21Tt=0T1|F(xt)wt2=𝒪(ϵ2),\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})% w_{t}\|^{2}\leq\frac{1}{T}\sum_{t=0}^{T-1}|\|\nabla F(x_{t})w_{t}\|^{2}=% \mathcal{O}(\epsilon^{2}),divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT | ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where we follow the same setting in Theorem 1. The proof is complete. ∎

C.3 Formal Version and Its Proof of Theorem 2

Let c1>0,c2>0,c3>0,c4>0,c5>0,c6>0formulae-sequencesubscript𝑐10formulae-sequencesubscript𝑐20formulae-sequencesubscript𝑐30formulae-sequencesubscript𝑐40formulae-sequencesubscript𝑐50subscript𝑐60c_{1}>0,c_{2}>0,c_{3}>0,c_{4}>0,c_{5}>0,c_{6}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0 , italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT > 0 , italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT > 0 , italic_c start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT > 0 be some constants. Let F>0𝐹0F>0italic_F > 0 and 0<δ120𝛿120<\delta\leq\frac{1}{2}0 < italic_δ ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG be some constants such that

F8(Δ+c1+c2+c3+c4+c5+c6)δ,𝐹8Δsubscript𝑐1subscript𝑐2subscript𝑐3subscript𝑐4subscript𝑐5subscript𝑐6𝛿\displaystyle F\geq\frac{8(\Delta+c_{1}+c_{2}+c_{3}+c_{4}+c_{5}+c_{6})}{\delta},italic_F ≥ divide start_ARG 8 ( roman_Δ + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_δ end_ARG ,

where Δ=maxi[d]{fi(x0)f}Δsubscript𝑖delimited-[]𝑑subscript𝑓𝑖subscript𝑥0superscript𝑓\Delta=\max_{i\in[d]}\{f_{i}(x_{0})-f^{*}\}roman_Δ = roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_d ] end_POSTSUBSCRIPT { italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }. Let L0>0,L1>0,b1>0,b2>0,b3>0formulae-sequencesubscript𝐿00formulae-sequencesubscript𝐿10formulae-sequencesubscript𝑏10formulae-sequencesubscript𝑏20subscript𝑏30L_{0}>0,L_{1}>0,b_{1}>0,b_{2}>0,b_{3}>0italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 , italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0 be some constants.

For αmin{(M+1)2max(1,M),c1β,ρmin{14L02(M+1)2,b12(3MKL0+MKL1)2,b2(M+1)L02},c22TM(3σ+σ2),,c3(M+1)σ2T,δϵ248σ2(M+1)}\alpha\leq\min\{\frac{\ell(M+1)}{2\max(1,M)},c_{1}\beta,\rho\min\left\{\frac{1% }{4L_{0}^{2}\ell(M+1)^{2}},\frac{b_{1}^{2}}{(3M\sqrt{K}L_{0}+M\sqrt{K}L_{1})^{% 2}},\frac{b_{2}}{\ell(M+1)L_{0}^{2}}\right\},\frac{c_{2}}{\sqrt{2T}M(3\sigma+% \sigma^{2}),},\\ \frac{\sqrt{c_{3}}}{\sqrt{\ell(M+1)\sigma^{2}T}},\frac{\delta\epsilon^{2}}{48% \sigma^{2}\ell(M+1)}\}italic_α ≤ roman_min { divide start_ARG roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 roman_max ( 1 , italic_M ) end_ARG , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_β , italic_ρ roman_min { divide start_ARG 1 end_ARG start_ARG 4 italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ ( italic_M + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 3 italic_M square-root start_ARG italic_K end_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_M square-root start_ARG italic_K end_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG roman_ℓ ( italic_M + 1 ) italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG } , divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_T end_ARG italic_M ( 3 italic_σ + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , end_ARG , divide start_ARG square-root start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG square-root start_ARG roman_ℓ ( italic_M + 1 ) italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG end_ARG , divide start_ARG italic_δ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 48 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ ( italic_M + 1 ) end_ARG },
βmin{c44KαT(M2+σ2)2,δϵ2192K(M2+σ2)2,b3ρ8KM2L02+4KL12},ρmin{c5αT,c6αβT,δϵ248,δϵ248β,δL0224Kσ2αT,δL128K2σ4αT}formulae-sequence𝛽subscript𝑐44𝐾𝛼𝑇superscriptsuperscript𝑀2superscript𝜎22𝛿superscriptitalic-ϵ2192𝐾superscriptsuperscript𝑀2superscript𝜎22subscript𝑏3𝜌8𝐾superscript𝑀2superscriptsubscript𝐿024𝐾superscriptsubscript𝐿12𝜌subscript𝑐5𝛼𝑇subscript𝑐6𝛼𝛽𝑇𝛿superscriptitalic-ϵ248𝛿superscriptitalic-ϵ248𝛽𝛿superscriptsubscript𝐿0224𝐾superscript𝜎2𝛼𝑇𝛿superscriptsubscript𝐿128superscript𝐾2superscript𝜎4𝛼𝑇\beta\leq\min\{\frac{c_{4}}{4K\alpha T(M^{2}+\sigma^{2})^{2}},\frac{\delta% \epsilon^{2}}{192K(M^{2}+\sigma^{2})^{2}},\frac{b_{3}\rho}{8KM^{2}L_{0}^{2}+4% KL_{1}^{2}}\},\\ \rho\leq\min\{\frac{c_{5}}{\alpha T},{\frac{\sqrt{c_{6}}}{\sqrt{\alpha\beta T}% }},\frac{\delta\epsilon^{2}}{48},\sqrt{\frac{\delta\epsilon^{2}}{48\beta}},% \frac{\delta L_{0}^{2}}{24K\sigma^{2}\alpha T},\frac{\delta L_{1}^{2}}{8K^{2}% \sigma^{4}\alpha T}\}italic_β ≤ roman_min { divide start_ARG italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_K italic_α italic_T ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_δ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 192 italic_K ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_ρ end_ARG start_ARG 8 italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_K italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG } , italic_ρ ≤ roman_min { divide start_ARG italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_ARG start_ARG italic_α italic_T end_ARG , divide start_ARG square-root start_ARG italic_c start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG square-root start_ARG italic_α italic_β italic_T end_ARG end_ARG , divide start_ARG italic_δ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 48 end_ARG , square-root start_ARG divide start_ARG italic_δ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 48 italic_β end_ARG end_ARG , divide start_ARG italic_δ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 24 italic_K italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α italic_T end_ARG , divide start_ARG italic_δ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_α italic_T end_ARG },
and Tmax{48Δ+48c1δαϵ2,4608M2(3σ+σ2)2δ2ϵ4}𝑇48Δ48subscript𝑐1𝛿𝛼superscriptitalic-ϵ24608superscript𝑀2superscript3𝜎superscript𝜎22superscript𝛿2superscriptitalic-ϵ4T\geq\max\{\frac{48\Delta+48c_{1}}{\delta\alpha\epsilon^{2}},\frac{4608M^{2}(3% \sigma+\sigma^{2})^{2}}{\delta^{2}\epsilon^{4}}\}italic_T ≥ roman_max { divide start_ARG 48 roman_Δ + 48 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_δ italic_α italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 4608 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 3 italic_σ + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG } such that

b1+b2+c1+αρ(1+βρ)+4αβKM4F2.subscript𝑏1subscript𝑏2subscript𝑐1𝛼𝜌1𝛽𝜌4𝛼𝛽𝐾superscript𝑀4𝐹2\displaystyle b_{1}+b_{2}+c_{1}+\alpha\rho(1+\beta\rho)+4\alpha\beta KM^{4}% \leq\frac{F}{2}.italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α italic_ρ ( 1 + italic_β italic_ρ ) + 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_F end_ARG start_ARG 2 end_ARG . (15)

Define the following random variables

τ1=min{t|i[K],fi(xt+1)fi>F}T,subscript𝜏1𝑡ket𝑖delimited-[]𝐾subscript𝑓𝑖subscript𝑥𝑡1superscriptsubscript𝑓𝑖𝐹𝑇\displaystyle\tau_{1}=\min\{t|\exists i\in[K],f_{i}(x_{t+1})-f_{i}^{*}>F\}% \wedge T,italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_min { italic_t | ∃ italic_i ∈ [ italic_K ] , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > italic_F } ∧ italic_T ,
τ2=min{t|i[K],j[3],εt,j,i>L0αρ}T,subscript𝜏2conditional𝑡𝑖delimited-[]𝐾𝑗delimited-[]3normsubscript𝜀𝑡𝑗𝑖subscript𝐿0𝛼𝜌𝑇\displaystyle\tau_{2}=\min\{t|\exists i\in[K],j\in[3],\|\varepsilon_{t,j,i}\|>% \frac{L_{0}}{\sqrt{\alpha\rho}}\}\wedge T,italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_min { italic_t | ∃ italic_i ∈ [ italic_K ] , italic_j ∈ [ 3 ] , ∥ italic_ε start_POSTSUBSCRIPT italic_t , italic_j , italic_i end_POSTSUBSCRIPT ∥ > divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG } ∧ italic_T ,
τ3=min{t|i,j[K],εt,2,iεt,3,j>L1αρ}T,subscript𝜏3conditional𝑡𝑖𝑗delimited-[]𝐾normsubscript𝜀𝑡2𝑖normsubscript𝜀𝑡3𝑗subscript𝐿1𝛼𝜌𝑇\displaystyle\tau_{3}=\min\{t|\exists i,j\in[K],\|\varepsilon_{t,2,i}\|\|% \varepsilon_{t,3,j}\|>\frac{L_{1}}{\sqrt{\alpha\rho}}\}\wedge T,italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = roman_min { italic_t | ∃ italic_i , italic_j ∈ [ italic_K ] , ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 , italic_i end_POSTSUBSCRIPT ∥ ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 , italic_j end_POSTSUBSCRIPT ∥ > divide start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG } ∧ italic_T ,
τ=min{τ1,τ2,τ3}.𝜏subscript𝜏1subscript𝜏2subscript𝜏3\displaystyle\tau=\min\{\tau_{1},\tau_{2},\tau_{3}\}.italic_τ = roman_min { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } .

We then have the following theorem:

Theorem 8.

If Assumptions 1, 2 and 3 hold, with the parameters selected above, we have that

1Tt=0T1F(xt)wt2ϵ2,1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptitalic-ϵ2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\leq% \epsilon^{2},divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

with the probability at least 1δ1𝛿1-\delta1 - italic_δ.

Proof.

Small probability of the event {τ<T}𝜏𝑇\{\tau<T\}{ italic_τ < italic_T }.

We first show that the probability of the event {τ<T}𝜏𝑇\{\tau<T\}{ italic_τ < italic_T } is small: (τ<T)δ𝜏𝑇𝛿\mathbb{P}(\tau<T)\leq\deltablackboard_P ( italic_τ < italic_T ) ≤ italic_δ. Note that

{τ<T}={τ2<T}{τ3<T}{τ1<T,τ2=T,τ3=T}.𝜏𝑇subscript𝜏2𝑇subscript𝜏3𝑇formulae-sequencesubscript𝜏1𝑇formulae-sequencesubscript𝜏2𝑇subscript𝜏3𝑇\displaystyle\{\tau<T\}=\{\tau_{2}<T\}\cup\{\tau_{3}<T\}\cup\{\tau_{1}<T,\tau_% {2}=T,\tau_{3}=T\}.{ italic_τ < italic_T } = { italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_T } ∪ { italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT < italic_T } ∪ { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_T , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_T , italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_T } .

For any i[K],j[3]formulae-sequence𝑖delimited-[]𝐾𝑗delimited-[]3i\in[K],j\in[3]italic_i ∈ [ italic_K ] , italic_j ∈ [ 3 ], we have that

(εt,j,i>L0αρ)=(εt,j,i2>L02αρ)σ2αρL02,normsubscript𝜀𝑡𝑗𝑖subscript𝐿0𝛼𝜌superscriptnormsubscript𝜀𝑡𝑗𝑖2superscriptsubscript𝐿02𝛼𝜌superscript𝜎2𝛼𝜌superscriptsubscript𝐿02\displaystyle\mathbb{P}(\|\varepsilon_{t,j,i}\|>\frac{L_{0}}{\sqrt{\alpha\rho}% })=\mathbb{P}(\|\varepsilon_{t,j,i}\|^{2}>\frac{L_{0}^{2}}{{\alpha\rho}})\leq% \frac{\sigma^{2}{\alpha\rho}}{L_{0}^{2}},blackboard_P ( ∥ italic_ε start_POSTSUBSCRIPT italic_t , italic_j , italic_i end_POSTSUBSCRIPT ∥ > divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG ) = blackboard_P ( ∥ italic_ε start_POSTSUBSCRIPT italic_t , italic_j , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α italic_ρ end_ARG ) ≤ divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α italic_ρ end_ARG start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

where the last inequality is due to Chebyshev’s inequality. Based on the union bound, we have that

({τ2<T})t=0T1j=1Ki=13(εt,j,i>L0αρ)3Kσ2αρTL02δ8subscript𝜏2𝑇superscriptsubscript𝑡0𝑇1superscriptsubscript𝑗1𝐾superscriptsubscript𝑖13normsubscript𝜀𝑡𝑗𝑖subscript𝐿0𝛼𝜌3𝐾superscript𝜎2𝛼𝜌𝑇superscriptsubscript𝐿02𝛿8\displaystyle\mathbb{P}(\{\tau_{2}<T\})\leq\sum_{t=0}^{T-1}\sum_{j=1}^{K}\sum_% {i=1}^{3}\mathbb{P}(\|\varepsilon_{t,j,i}\|>\frac{L_{0}}{\sqrt{\alpha\rho}})% \leq\frac{3K\sigma^{2}{\alpha\rho}T}{L_{0}^{2}}\leq\frac{\delta}{8}blackboard_P ( { italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_T } ) ≤ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT blackboard_P ( ∥ italic_ε start_POSTSUBSCRIPT italic_t , italic_j , italic_i end_POSTSUBSCRIPT ∥ > divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG ) ≤ divide start_ARG 3 italic_K italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α italic_ρ italic_T end_ARG start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_δ end_ARG start_ARG 8 end_ARG (16)

since ρδL0224Kσ2αT𝜌𝛿superscriptsubscript𝐿0224𝐾superscript𝜎2𝛼𝑇\rho\leq\frac{\delta L_{0}^{2}}{24K\sigma^{2}\alpha T}italic_ρ ≤ divide start_ARG italic_δ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 24 italic_K italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α italic_T end_ARG. Similarly, we have that

(εt,2,iεt,3,i>L1αρ)=(εt,2,i2εt,3,i2>L12αρ)σ4αρL12.normsubscript𝜀𝑡2𝑖normsubscript𝜀𝑡3𝑖subscript𝐿1𝛼𝜌superscriptnormsubscript𝜀𝑡2𝑖2superscriptnormsubscript𝜀𝑡3𝑖2superscriptsubscript𝐿12𝛼𝜌superscript𝜎4𝛼𝜌superscriptsubscript𝐿12\displaystyle\mathbb{P}(\|\varepsilon_{t,2,i}\|\|\varepsilon_{t,3,i}\|>\frac{L% _{1}}{\sqrt{\alpha\rho}})=\mathbb{P}(\|\varepsilon_{t,2,i}\|^{2}\|\varepsilon_% {t,3,i}\|^{2}>\frac{L_{1}^{2}}{{\alpha\rho}})\leq\frac{\sigma^{4}{\alpha\rho}}% {L_{1}^{2}}.blackboard_P ( ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 , italic_i end_POSTSUBSCRIPT ∥ ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 , italic_i end_POSTSUBSCRIPT ∥ > divide start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG ) = blackboard_P ( ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > divide start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α italic_ρ end_ARG ) ≤ divide start_ARG italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_α italic_ρ end_ARG start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

It follows that

({τ3<T})t=0T1i=1Kj=1K(εt,2,iεt,3,i>L1αρ)K2σ4αρTL12δ8.subscript𝜏3𝑇superscriptsubscript𝑡0𝑇1superscriptsubscript𝑖1𝐾superscriptsubscript𝑗1𝐾normsubscript𝜀𝑡2𝑖normsubscript𝜀𝑡3𝑖subscript𝐿1𝛼𝜌superscript𝐾2superscript𝜎4𝛼𝜌𝑇superscriptsubscript𝐿12𝛿8\displaystyle\mathbb{P}(\{\tau_{3}<T\})\leq\sum_{t=0}^{T-1}\sum_{i=1}^{K}\sum_% {j=1}^{K}\mathbb{P}(\|\varepsilon_{t,2,i}\|\|\varepsilon_{t,3,i}\|>\frac{L_{1}% }{\sqrt{\alpha\rho}})\leq\frac{K^{2}\sigma^{4}{\alpha\rho}T}{L_{1}^{2}}\leq% \frac{\delta}{8}.blackboard_P ( { italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT < italic_T } ) ≤ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_P ( ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 , italic_i end_POSTSUBSCRIPT ∥ ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 , italic_i end_POSTSUBSCRIPT ∥ > divide start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG ) ≤ divide start_ARG italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_α italic_ρ italic_T end_ARG start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_δ end_ARG start_ARG 8 end_ARG . (17)

We then bound the probability of the event {τ1<T,τ2=T,τ3=T}formulae-sequencesubscript𝜏1𝑇formulae-sequencesubscript𝜏2𝑇subscript𝜏3𝑇\{\tau_{1}<T,\tau_{2}=T,\tau_{3}=T\}{ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_T , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_T , italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_T }. Since τ=τ1<T,𝜏subscript𝜏1𝑇\tau=\tau_{1}<T,italic_τ = italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_T , we have that for some i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ], fi(xτ+1)fi>Fsubscript𝑓𝑖subscript𝑥𝜏1superscriptsubscript𝑓𝑖𝐹f_{i}(x_{\tau+1})-f_{i}^{*}>Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > italic_F.

According to (C.4) shown in Lemma 1, for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ] and t=τ𝑡𝜏t=\tauitalic_t = italic_τ we have that

fi(xτ+1)fi(xτ)subscript𝑓𝑖subscript𝑥𝜏1subscript𝑓𝑖subscript𝑥𝜏\displaystyle f_{i}(x_{\tau+1})-f_{i}(x_{\tau})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) αF(xτ)wεt,1wt+(M+1)α2εt,1wt2+αβ+αρ+αβρ2absent𝛼norm𝐹subscript𝑥𝜏𝑤normsubscript𝜀𝑡1subscript𝑤𝑡𝑀1superscript𝛼2superscriptnormsubscript𝜀𝑡1subscript𝑤𝑡2𝛼𝛽𝛼𝜌𝛼𝛽superscript𝜌2\displaystyle\leq\alpha\|\nabla F(x_{\tau})w\|\|\varepsilon_{t,1}w_{t}\|+{\ell% (M+1)}\alpha^{2}\|\varepsilon_{t,1}w_{t}\|^{2}+\frac{\alpha}{\beta}+\alpha\rho% +\alpha\beta\rho^{2}≤ italic_α ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) italic_w ∥ ∥ italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + roman_ℓ ( italic_M + 1 ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG + italic_α italic_ρ + italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+αMεt,2+αMεt,3+αεt,2εt,3wt𝛼𝑀normsubscript𝜀𝑡2𝛼𝑀normsubscript𝜀𝑡3𝛼normsuperscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡\displaystyle+\alpha M\|\varepsilon_{t,2}\|+\alpha M\|\varepsilon_{t,3}\|+% \alpha\|\varepsilon_{t,2}^{\top}\varepsilon_{t,3}w_{t}\|+ italic_α italic_M ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ + italic_α italic_M ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ + italic_α ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥
+4αβKM4+4αβM2εt,22+4αβM2εt,32+4αβεt,2εt,3wt24𝛼𝛽𝐾superscript𝑀44𝛼𝛽superscript𝑀2superscriptnormsubscript𝜀𝑡224𝛼𝛽superscript𝑀2superscriptnormsubscript𝜀𝑡324𝛼𝛽superscriptnormsuperscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡2\displaystyle+4\alpha\beta KM^{4}+4\alpha\beta M^{2}\|\varepsilon_{t,2}\|^{2}+% 4\alpha\beta M^{2}\|\varepsilon_{t,3}\|^{2}+4\alpha\beta\|\varepsilon_{t,2}^{% \top}\varepsilon_{t,3}w_{t}\|^{2}+ 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 4 italic_α italic_β italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_β italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_β ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
αML0αρ+(M+1)αL02ρ+αβ+αρ+αβρ2absent𝛼𝑀subscript𝐿0𝛼𝜌𝑀1𝛼superscriptsubscript𝐿02𝜌𝛼𝛽𝛼𝜌𝛼𝛽superscript𝜌2\displaystyle\leq\alpha M\frac{L_{0}}{\sqrt{\alpha\rho}}+{\ell(M+1)}\alpha% \frac{L_{0}^{2}}{{\rho}}+\frac{\alpha}{\beta}+\alpha\rho+\alpha\beta\rho^{2}≤ italic_α italic_M divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG + roman_ℓ ( italic_M + 1 ) italic_α divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ end_ARG + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG + italic_α italic_ρ + italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+αMKL0αρ+αMKL0αρ+αMKL1αρ𝛼𝑀𝐾subscript𝐿0𝛼𝜌𝛼𝑀𝐾subscript𝐿0𝛼𝜌𝛼𝑀𝐾subscript𝐿1𝛼𝜌\displaystyle+\alpha M\frac{\sqrt{K}L_{0}}{\sqrt{\alpha\rho}}+\alpha M\frac{% \sqrt{K}L_{0}}{\sqrt{\alpha\rho}}+\alpha M\frac{\sqrt{K}L_{1}}{\sqrt{\alpha% \rho}}+ italic_α italic_M divide start_ARG square-root start_ARG italic_K end_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG + italic_α italic_M divide start_ARG square-root start_ARG italic_K end_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG + italic_α italic_M divide start_ARG square-root start_ARG italic_K end_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG
+4αβKM4+4αβM2KL02αρ+4αβM2KL02αρ+4αβKL12αρ4𝛼𝛽𝐾superscript𝑀44𝛼𝛽superscript𝑀2𝐾superscriptsubscript𝐿02𝛼𝜌4𝛼𝛽superscript𝑀2𝐾superscriptsubscript𝐿02𝛼𝜌4𝛼𝛽𝐾superscriptsubscript𝐿12𝛼𝜌\displaystyle+4\alpha\beta KM^{4}+4\alpha\beta M^{2}\frac{KL_{0}^{2}}{{\alpha% \rho}}+4\alpha\beta M^{2}\frac{KL_{0}^{2}}{{\alpha\rho}}+4\alpha\beta\frac{KL_% {1}^{2}}{{\alpha\rho}}+ 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 4 italic_α italic_β italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_K italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α italic_ρ end_ARG + 4 italic_α italic_β italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_K italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α italic_ρ end_ARG + 4 italic_α italic_β divide start_ARG italic_K italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α italic_ρ end_ARG
b1+b2+b3+c1+αρ(1+βρ)+4αβKM4absentsubscript𝑏1subscript𝑏2subscript𝑏3subscript𝑐1𝛼𝜌1𝛽𝜌4𝛼𝛽𝐾superscript𝑀4\displaystyle\leq b_{1}+b_{2}+b_{3}+c_{1}+\alpha\rho(1+\beta\rho)+4\alpha\beta KM% ^{4}≤ italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α italic_ρ ( 1 + italic_β italic_ρ ) + 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
F2,absent𝐹2\displaystyle\leq\frac{F}{2},≤ divide start_ARG italic_F end_ARG start_ARG 2 end_ARG ,

where the first inequality is due to that τ2=τ3=Tsubscript𝜏2subscript𝜏3𝑇\tau_{2}=\tau_{3}=Titalic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_T, and the second one is due to βρb38KM2L02+4KL12𝛽𝜌subscript𝑏38𝐾superscript𝑀2superscriptsubscript𝐿024𝐾superscriptsubscript𝐿12\frac{\beta}{\rho}\leq\frac{b_{3}}{8KM^{2}L_{0}^{2}+4KL_{1}^{2}}divide start_ARG italic_β end_ARG start_ARG italic_ρ end_ARG ≤ divide start_ARG italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG 8 italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_K italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and αρmin{b12(3MKL0+MKL1)2,b2(M+1)L02}𝛼𝜌superscriptsubscript𝑏12superscript3𝑀𝐾subscript𝐿0𝑀𝐾subscript𝐿12subscript𝑏2𝑀1superscriptsubscript𝐿02\frac{\alpha}{\rho}\leq\min\left\{\frac{b_{1}^{2}}{(3M\sqrt{K}L_{0}+M\sqrt{K}L% _{1})^{2}},\frac{b_{2}}{\ell(M+1)L_{0}^{2}}\right\}divide start_ARG italic_α end_ARG start_ARG italic_ρ end_ARG ≤ roman_min { divide start_ARG italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 3 italic_M square-root start_ARG italic_K end_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_M square-root start_ARG italic_K end_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG roman_ℓ ( italic_M + 1 ) italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG }.

However, for some i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ], fi(xτ+1)fi>Fsubscript𝑓𝑖subscript𝑥𝜏1superscriptsubscript𝑓𝑖𝐹f_{i}(x_{\tau+1})-f_{i}^{*}>Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > italic_F. Thus for this task, we have that

fi(xτ)fiF2.subscript𝑓𝑖subscript𝑥𝜏superscriptsubscript𝑓𝑖𝐹2\displaystyle f_{i}(x_{\tau})-f_{i}^{*}\leq\frac{F}{2}.italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ divide start_ARG italic_F end_ARG start_ARG 2 end_ARG .

According to Lemma 1, we have that

𝔼[fi(xτ)fi]δ8.𝔼delimited-[]subscript𝑓𝑖subscript𝑥𝜏superscriptsubscript𝑓𝑖𝛿8\displaystyle\mathbb{E}[f_{i}(x_{\tau})-f_{i}^{*}]\leq\frac{\delta}{8}.blackboard_E [ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] ≤ divide start_ARG italic_δ end_ARG start_ARG 8 end_ARG .

Based on Markov inequality, it follows that

(fi(xτ)fiF2)𝔼[fi(xτ)fi]F/2δ4,subscript𝑓𝑖subscript𝑥𝜏superscriptsubscript𝑓𝑖𝐹2𝔼delimited-[]subscript𝑓𝑖subscript𝑥𝜏superscriptsubscript𝑓𝑖𝐹2𝛿4\displaystyle\mathbb{P}\left(f_{i}(x_{\tau})-f_{i}^{*}\leq\frac{F}{2}\right)% \leq\frac{\mathbb{E}[f_{i}(x_{\tau})-f_{i}^{*}]}{F/2}\leq\frac{\delta}{4},blackboard_P ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ divide start_ARG italic_F end_ARG start_ARG 2 end_ARG ) ≤ divide start_ARG blackboard_E [ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_F / 2 end_ARG ≤ divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG , (18)

which indicates that (τ1<T,τ2=T,τ3=T)δ4formulae-sequencesubscript𝜏1𝑇formulae-sequencesubscript𝜏2𝑇subscript𝜏3𝑇𝛿4\mathbb{P}(\tau_{1}<T,\tau_{2}=T,\tau_{3}=T)\leq\frac{\delta}{4}blackboard_P ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_T , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_T , italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_T ) ≤ divide start_ARG italic_δ end_ARG start_ARG 4 end_ARG. It follows that (τ<T)δ2𝜏𝑇𝛿2\mathbb{P}(\tau<T)\leq\frac{\delta}{2}blackboard_P ( italic_τ < italic_T ) ≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG.

Convergence of 1T𝔼[t=0T1F(xt)wt2|τ=T]1𝑇𝔼delimited-[]conditionalsuperscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝜏𝑇\frac{1}{T}\mathbb{E}\left[\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\Big{|}% \tau=T\right]divide start_ARG 1 end_ARG start_ARG italic_T end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]. Based on (C.4) in Lemma 1, we have that

1T𝔼[t=0T1F(xt)wt2|τ=T]1𝑇𝔼delimited-[]conditionalsuperscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝜏𝑇\displaystyle\frac{1}{T}\mathbb{E}\left[\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}% \|^{2}\Big{|}\tau=T\right]divide start_ARG 1 end_ARG start_ARG italic_T end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
1T1(τ=T)𝔼[t=0τ1F(xt)wt2]absent1𝑇1𝜏𝑇𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle\leq\frac{1}{T}\frac{1}{\mathbb{P}(\tau=T)}\mathbb{E}\left[\sum_{% t=0}^{\tau-1}\|\nabla F(x_{t})w_{t}\|^{2}\right]≤ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG divide start_ARG 1 end_ARG start_ARG blackboard_P ( italic_τ = italic_T ) end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
4F(x0)w4Fw+4αβαT+42M(3σ+σ2)T+4ασ2(M+1)+4ρabsent4𝐹subscript𝑥0𝑤4superscript𝐹𝑤4𝛼𝛽𝛼𝑇42𝑀3𝜎superscript𝜎2𝑇4𝛼superscript𝜎2𝑀14𝜌\displaystyle\leq\frac{4F(x_{0})w-4F^{*}w+\frac{4\alpha}{\beta}}{\alpha T}+% \frac{4\sqrt{2}M(3\sigma+\sigma^{2})}{\sqrt{T}}+4\alpha\sigma^{2}\ell(M+1)+4\rho≤ divide start_ARG 4 italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w - 4 italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_w + divide start_ARG 4 italic_α end_ARG start_ARG italic_β end_ARG end_ARG start_ARG italic_α italic_T end_ARG + divide start_ARG 4 square-root start_ARG 2 end_ARG italic_M ( 3 italic_σ + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG + 4 italic_α italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ ( italic_M + 1 ) + 4 italic_ρ
+4βρ2+16βKM4+32βKM2σ2+16βKσ44𝛽superscript𝜌216𝛽𝐾superscript𝑀432𝛽𝐾superscript𝑀2superscript𝜎216𝛽𝐾superscript𝜎4\displaystyle+4\beta\rho^{2}+16\beta KM^{4}+32\beta KM^{2}\sigma^{2}+16\beta K% \sigma^{4}+ 4 italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 16 italic_β italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 32 italic_β italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 16 italic_β italic_K italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
δ2ϵ2,absent𝛿2superscriptitalic-ϵ2\displaystyle\leq\frac{\delta}{2}\epsilon^{2},≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the second inequality is due to δ<12𝛿12\delta<\frac{1}{2}italic_δ < divide start_ARG 1 end_ARG start_ARG 2 end_ARG and the last inequality is due to our selection of parameters. As a result, we have that

(1Tt=0T1F(xt)wt2>ϵ2|τ=T)𝔼[1Tt=0T1F(xt)wt2ϵ2|τ=T]ϵ2δ2,1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2conditionalsuperscriptitalic-ϵ2𝜏𝑇𝔼delimited-[]1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2conditionalsuperscriptitalic-ϵ2𝜏𝑇superscriptitalic-ϵ2𝛿2\displaystyle\mathbb{P}\left(\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}% \|^{2}>\epsilon^{2}\Big{|}\tau=T\right)\leq\frac{\mathbb{E}\left[\frac{1}{T}% \sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\geq\epsilon^{2}\Big{|}\tau=T% \right]}{\epsilon^{2}}\leq\frac{\delta}{2},blackboard_P ( divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ) ≤ divide start_ARG blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG , (19)

where the first probability is due to Markov inequality. Thus we have that

(1Tt=0T1F(xt)wt2ϵ2)1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptitalic-ϵ2\displaystyle\mathbb{P}\left(\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}% \|^{2}\leq\epsilon^{2}\right)blackboard_P ( divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
1(τ<T)(1Tt=0T1F(xt)wt2>ϵ2|τ=T)(τ=T)absent1𝜏𝑇1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2conditionalsuperscriptitalic-ϵ2𝜏𝑇𝜏𝑇\displaystyle\geq 1-\mathbb{P}\left(\tau<T\right)-\mathbb{P}\left(\frac{1}{T}% \sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}>\epsilon^{2}\Big{|}\tau=T\right)% \mathbb{P}\left(\tau=T\right)≥ 1 - blackboard_P ( italic_τ < italic_T ) - blackboard_P ( divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ) blackboard_P ( italic_τ = italic_T )
1δ,absent1𝛿\displaystyle\geq 1-\delta,≥ 1 - italic_δ ,

where the last inequality is due to (16), (17), (18), and (19). This completes the proof. ∎

C.4 Proof of Lemma 1

Proof.

For all i[K],tτformulae-sequence𝑖delimited-[]𝐾𝑡𝜏i\in[K],t\leq\tauitalic_i ∈ [ italic_K ] , italic_t ≤ italic_τ, we have fi(xt)fiFsubscript𝑓𝑖subscript𝑥𝑡superscriptsubscript𝑓𝑖𝐹f_{i}(x_{t})-f_{i}^{*}\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_F which further implies that fi(xt)Mnormsubscript𝑓𝑖subscript𝑥𝑡𝑀\|\nabla f_{i}(x_{t})\|\leq M∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M. Moreover, we have that for any tτ𝑡𝜏t\leq\tauitalic_t ≤ italic_τ and i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ],

xt+1xtαG1(xt)wtα(F(xt)wt+εt,1wt)α(M+L0αρ)1(M+1).normsubscript𝑥𝑡1subscript𝑥𝑡𝛼normsubscript𝐺1subscript𝑥𝑡subscript𝑤𝑡𝛼norm𝐹subscript𝑥𝑡subscript𝑤𝑡normsubscript𝜀𝑡1subscript𝑤𝑡𝛼𝑀subscript𝐿0𝛼𝜌1𝑀1\displaystyle\|x_{t+1}-x_{t}\|\leq\alpha\|\nabla G_{1}(x_{t})w_{t}\|\leq\alpha% (\|\nabla F(x_{t})w_{t}\|+\|\varepsilon_{t,1}w_{t}\|)\leq\alpha\left(M+\frac{L% _{0}}{\sqrt{\alpha\rho}}\right)\leq\frac{1}{\ell(M+1)}.∥ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ italic_α ∥ ∇ italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ italic_α ( ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + ∥ italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ) ≤ italic_α ( italic_M + divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α italic_ρ end_ARG end_ARG ) ≤ divide start_ARG 1 end_ARG start_ARG roman_ℓ ( italic_M + 1 ) end_ARG .

Since fi(x)subscript𝑓𝑖𝑥f_{i}(x)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) is (1(fi(x))+1),(fi(x)+1))\left(\frac{1}{\ell(\|\nabla f_{i}(x))\|+1)},\ell(\|\nabla f_{i}(x)\|+1)\right)( divide start_ARG 1 end_ARG start_ARG roman_ℓ ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) ∥ + 1 ) end_ARG , roman_ℓ ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∥ + 1 ) )-smooth, it follows that

fi(xt+1)fi(xt)αfi(xt),G1(xt)wt+(fi(xt)+1)2α2G1(xt)wt2.subscript𝑓𝑖subscript𝑥𝑡1subscript𝑓𝑖subscript𝑥𝑡𝛼subscript𝑓𝑖subscript𝑥𝑡subscript𝐺1subscript𝑥𝑡subscript𝑤𝑡normsubscript𝑓𝑖subscript𝑥𝑡12superscript𝛼2superscriptnormsubscript𝐺1subscript𝑥𝑡subscript𝑤𝑡2\displaystyle f_{i}(x_{t+1})-f_{i}(x_{t})\leq-\alpha\langle\nabla f_{i}(x_{t})% ,\nabla G_{1}(x_{t})w_{t}\rangle+\frac{\ell(\|\nabla f_{i}(x_{t})\|+1)}{2}% \alpha^{2}\|\nabla G_{1}(x_{t})w_{t}\|^{2}.italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ - italic_α ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∇ italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + divide start_ARG roman_ℓ ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + 1 ) end_ARG start_ARG 2 end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

As a result, for any w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W, we have that

F(xt+1)w𝐹subscript𝑥𝑡1𝑤\displaystyle F(x_{t+1})witalic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_w F(xt)wαF(xt)w,G1(xt)wt+(M+1)2α2G1(xt)wt2absent𝐹subscript𝑥𝑡𝑤𝛼𝐹subscript𝑥𝑡𝑤subscript𝐺1subscript𝑥𝑡subscript𝑤𝑡𝑀12superscript𝛼2superscriptnormsubscript𝐺1subscript𝑥𝑡subscript𝑤𝑡2\displaystyle\leq F(x_{t})w-\alpha\langle\nabla F(x_{t})w,\nabla G_{1}(x_{t})w% _{t}\rangle+\frac{\ell(M+1)}{2}\alpha^{2}\|\nabla G_{1}(x_{t})w_{t}\|^{2}≤ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w - italic_α ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , ∇ italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + divide start_ARG roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
F(xt)wαF(xt)w,F(xt)wt+αF(xt)w,εt,1wtabsent𝐹subscript𝑥𝑡𝑤𝛼𝐹subscript𝑥𝑡𝑤𝐹subscript𝑥𝑡subscript𝑤𝑡𝛼𝐹subscript𝑥𝑡𝑤subscript𝜀𝑡1subscript𝑤𝑡\displaystyle\leq F(x_{t})w-\alpha\langle\nabla F(x_{t})w,\nabla F(x_{t})w_{t}% \rangle+\alpha\langle\nabla F(x_{t})w,\varepsilon_{t,1}w_{t}\rangle≤ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w - italic_α ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + italic_α ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
+(M+1)α2F(xt)wt2+(M+1)α2εt,1wt2𝑀1superscript𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝑀1superscript𝛼2superscriptnormsubscript𝜀𝑡1subscript𝑤𝑡2\displaystyle+{\ell(M+1)}\alpha^{2}\|\nabla F(x_{t})w_{t}\|^{2}+{\ell(M+1)}% \alpha^{2}\|\varepsilon_{t,1}w_{t}\|^{2}+ roman_ℓ ( italic_M + 1 ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_ℓ ( italic_M + 1 ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (20)

Based on the update process of w𝑤witalic_w, we have that

wt+1w2superscriptnormsubscript𝑤𝑡1𝑤2\displaystyle\|w_{t+1}-w\|^{2}∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Π𝒲(wtβ[G2(xt)G3(xt)wt+ρwt])w2absentsuperscriptnormsubscriptΠ𝒲subscript𝑤𝑡𝛽delimited-[]subscript𝐺2superscriptsubscript𝑥𝑡topsubscript𝐺3subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡𝑤2\displaystyle=\|\Pi_{\mathcal{W}}\big{(}w_{t}-\beta[{\nabla G_{2}(x_{t})^{\top% }\nabla G_{3}(x_{t})w_{t}}+\rho w_{t}]\big{)}-w\|^{2}= ∥ roman_Π start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β [ ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(wtβ[G2(xt)G3(xt)wt+ρwt])w2absentsuperscriptnormsubscript𝑤𝑡𝛽delimited-[]subscript𝐺2superscriptsubscript𝑥𝑡topsubscript𝐺3subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡𝑤2\displaystyle\leq\|\big{(}w_{t}-\beta[{\nabla G_{2}(x_{t})^{\top}\nabla G_{3}(% x_{t})w_{t}}+\rho w_{t}]\big{)}-w\|^{2}≤ ∥ ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β [ ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=wtw22βwtw,(G2(xt)G3(xt)+ρ)wtabsentsuperscriptnormsubscript𝑤𝑡𝑤22𝛽subscript𝑤𝑡𝑤subscript𝐺2superscriptsubscript𝑥𝑡topsubscript𝐺3subscript𝑥𝑡𝜌subscript𝑤𝑡\displaystyle=\|w_{t}-w\|^{2}-2\beta\langle w_{t}-w,(\nabla G_{2}(x_{t})^{\top% }\nabla G_{3}(x_{t})+\rho)w_{t}\rangle= ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_β ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , ( ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ρ ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
+β2(G2(xt)G3(xt)+ρ)wt2,superscript𝛽2superscriptnormsubscript𝐺2superscriptsubscript𝑥𝑡topsubscript𝐺3subscript𝑥𝑡𝜌subscript𝑤𝑡2\displaystyle+\beta^{2}\|(\nabla G_{2}(x_{t})^{\top}\nabla G_{3}(x_{t})+\rho)w% _{t}\|^{2},+ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ( ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ρ ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the inequality follows from the non-expansiveness of projection. It follows that

2βwtw,F(xt)F(xt)wt2𝛽subscript𝑤𝑡𝑤𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡\displaystyle 2\beta\langle w_{t}-w,\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}\rangle2 italic_β ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
(wtw2wt+1w2)+2βρ+2β2ρ2absentsuperscriptnormsubscript𝑤𝑡𝑤2superscriptnormsubscript𝑤𝑡1𝑤22𝛽𝜌2superscript𝛽2superscript𝜌2\displaystyle\leq\left(\|w_{t}-w\|^{2}-\|w_{t+1}-w\|^{2}\right)+2\beta\rho+2% \beta^{2}\rho^{2}≤ ( ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 2 italic_β italic_ρ + 2 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βwtw,εt,2F(xt)wt+F(xt)εt,3wtεt,2εt,3wt2𝛽subscript𝑤𝑡𝑤superscriptsubscript𝜀𝑡2top𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹superscriptsubscript𝑥𝑡topsubscript𝜀𝑡3subscript𝑤𝑡superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡\displaystyle+2\beta\left\langle w_{t}-w,\varepsilon_{t,2}^{\top}\nabla F(x_{t% })w_{t}+\nabla F(x_{t})^{\top}\varepsilon_{t,3}w_{t}-\varepsilon_{t,2}^{\top}% \varepsilon_{t,3}w_{t}\right\rangle+ 2 italic_β ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
+2β2F(xt)F(xt)wtεt,2F(xt)wtF(xt)εt,3wt+εt,2εt,3wt22superscript𝛽2superscriptnorm𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡superscriptsubscript𝜀𝑡2top𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹superscriptsubscript𝑥𝑡topsubscript𝜀𝑡3subscript𝑤𝑡superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡2\displaystyle+2\beta^{2}\|\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}-% \varepsilon_{t,2}^{\top}\nabla F(x_{t})w_{t}-\nabla F(x_{t})^{\top}\varepsilon% _{t,3}w_{t}+\varepsilon_{t,2}^{\top}\varepsilon_{t,3}w_{t}\|^{2}+ 2 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(wtw2wt+1w2)+2βρ+2β2ρ2absentsuperscriptnormsubscript𝑤𝑡𝑤2superscriptnormsubscript𝑤𝑡1𝑤22𝛽𝜌2superscript𝛽2superscript𝜌2\displaystyle\leq\left(\|w_{t}-w\|^{2}-\|w_{t+1}-w\|^{2}\right)+2\beta\rho+2% \beta^{2}\rho^{2}≤ ( ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 2 italic_β italic_ρ + 2 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βwtw,εt,2F(xt)wt+F(xt)εt,3wtεt,2εt,3wt2𝛽subscript𝑤𝑡𝑤superscriptsubscript𝜀𝑡2top𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹superscriptsubscript𝑥𝑡topsubscript𝜀𝑡3subscript𝑤𝑡superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡\displaystyle+2\beta\left\langle w_{t}-w,\varepsilon_{t,2}^{\top}\nabla F(x_{t% })w_{t}+\nabla F(x_{t})^{\top}\varepsilon_{t,3}w_{t}-\varepsilon_{t,2}^{\top}% \varepsilon_{t,3}w_{t}\right\rangle+ 2 italic_β ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
+8β2KM4+8β2M2εt,22+8β2M2εt,32+8β2εt,2εt,3wt2.8superscript𝛽2𝐾superscript𝑀48superscript𝛽2superscript𝑀2superscriptnormsubscript𝜀𝑡228superscript𝛽2superscript𝑀2superscriptnormsubscript𝜀𝑡328superscript𝛽2superscriptnormsuperscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡2\displaystyle+8\beta^{2}KM^{4}+8\beta^{2}M^{2}\|\varepsilon_{t,2}\|^{2}+8\beta% ^{2}M^{2}\|\varepsilon_{t,3}\|^{2}+8\beta^{2}\|\varepsilon_{t,2}^{\top}% \varepsilon_{t,3}w_{t}\|^{2}.+ 8 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 8 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 8 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 8 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (21)

Combine (C.4) and (C.4), and we can get that

F(xt+1)wF(xt)w𝐹subscript𝑥𝑡1𝑤𝐹subscript𝑥𝑡𝑤\displaystyle F(x_{t+1})w-F(x_{t})witalic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_w - italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w αF(xt)wt2+αF(xt)w,εt,1wtabsent𝛼superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝛼𝐹subscript𝑥𝑡𝑤subscript𝜀𝑡1subscript𝑤𝑡\displaystyle\leq-\alpha\|\nabla F(x_{t})w_{t}\|^{2}+\alpha\langle\nabla F(x_{% t})w,\varepsilon_{t,1}w_{t}\rangle≤ - italic_α ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
+(M+1)α2F(xt)wt2+(M+1)α2εt,1wt2𝑀1superscript𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝑀1superscript𝛼2superscriptnormsubscript𝜀𝑡1subscript𝑤𝑡2\displaystyle+{\ell(M+1)}\alpha^{2}\|\nabla F(x_{t})w_{t}\|^{2}+{\ell(M+1)}% \alpha^{2}\|\varepsilon_{t,1}w_{t}\|^{2}+ roman_ℓ ( italic_M + 1 ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_ℓ ( italic_M + 1 ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α2β(wtw2wt+1w2)+αρ+αβρ2𝛼2𝛽superscriptnormsubscript𝑤𝑡𝑤2superscriptnormsubscript𝑤𝑡1𝑤2𝛼𝜌𝛼𝛽superscript𝜌2\displaystyle+\frac{\alpha}{2\beta}\left(\|w_{t}-w\|^{2}-\|w_{t+1}-w\|^{2}% \right)+\alpha\rho+\alpha\beta\rho^{2}+ divide start_ARG italic_α end_ARG start_ARG 2 italic_β end_ARG ( ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_α italic_ρ + italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+αwtw,εt,2F(xt)wt+F(xt)wtεt,3εt,2εt,3wt𝛼subscript𝑤𝑡𝑤superscriptsubscript𝜀𝑡2top𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡topsubscript𝜀𝑡3superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡\displaystyle+\alpha\left\langle w_{t}-w,\varepsilon_{t,2}^{\top}\nabla F(x_{t% })w_{t}+\nabla F(x_{t})w_{t}^{\top}\varepsilon_{t,3}-\varepsilon_{t,2}^{\top}% \varepsilon_{t,3}w_{t}\right\rangle+ italic_α ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
+4αβKM4+4αβM2εt,22+4αβM2εt,32+4αβεt,2εt,3wt2.4𝛼𝛽𝐾superscript𝑀44𝛼𝛽superscript𝑀2superscriptnormsubscript𝜀𝑡224𝛼𝛽superscript𝑀2superscriptnormsubscript𝜀𝑡324𝛼𝛽superscriptnormsuperscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡2\displaystyle+4\alpha\beta KM^{4}+4\alpha\beta M^{2}\|\varepsilon_{t,2}\|^{2}+% 4\alpha\beta M^{2}\|\varepsilon_{t,3}\|^{2}+4\alpha\beta\|\varepsilon_{t,2}^{% \top}\varepsilon_{t,3}w_{t}\|^{2}.+ 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 4 italic_α italic_β italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_β italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_β ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (22)

Taking expectation and sum up (C.4) from t=0𝑡0t=0italic_t = 0 to τ1𝜏1\tau-1italic_τ - 1, we have that

𝔼[F(xτ)w]F(x0)w𝔼delimited-[]𝐹subscript𝑥𝜏𝑤𝐹subscript𝑥0𝑤\displaystyle\mathbb{E}[F(x_{\tau})w]-F(x_{0})wblackboard_E [ italic_F ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) italic_w ] - italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w α2𝔼[t=0τ1F(xt)wt2]+α𝔼[t=0τ1F(xt)w,εt,1wt]absent𝛼2𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝛼𝔼delimited-[]superscriptsubscript𝑡0𝜏1𝐹subscript𝑥𝑡𝑤subscript𝜀𝑡1subscript𝑤𝑡\displaystyle\leq-\frac{\alpha}{2}\mathbb{E}\left[\sum_{t=0}^{\tau-1}\|\nabla F% (x_{t})w_{t}\|^{2}\right]+\alpha\mathbb{E}\left[\sum_{t=0}^{\tau-1}\langle% \nabla F(x_{t})w,\varepsilon_{t,1}w_{t}\rangle\right]≤ - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_α blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ]
+α𝔼[t=0τ1wtw,εt,2F(xt)wt+F(xt)wtεt,3εt,2εt,3wt]𝛼𝔼delimited-[]superscriptsubscript𝑡0𝜏1subscript𝑤𝑡𝑤superscriptsubscript𝜀𝑡2top𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡topsubscript𝜀𝑡3superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡\displaystyle+\alpha\mathbb{E}\left[\sum_{t=0}^{\tau-1}\left\langle w_{t}-w,% \varepsilon_{t,2}^{\top}\nabla F(x_{t})w_{t}+\nabla F(x_{t})w_{t}^{\top}% \varepsilon_{t,3}-\varepsilon_{t,2}^{\top}\varepsilon_{t,3}w_{t}\right\rangle\right]+ italic_α blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ]
+(M+1)α2𝔼[t=0τ1εt,1wt2]+α2βw0w2+αρT+αβρ2T𝑀1superscript𝛼2𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscriptnormsubscript𝜀𝑡1subscript𝑤𝑡2𝛼2𝛽superscriptnormsubscript𝑤0𝑤2𝛼𝜌𝑇𝛼𝛽superscript𝜌2𝑇\displaystyle+{\ell(M+1)}\alpha^{2}\mathbb{E}\left[\sum_{t=0}^{\tau-1}\|% \varepsilon_{t,1}w_{t}\|^{2}\right]+\frac{\alpha}{2\beta}\|w_{0}-w\|^{2}+% \alpha\rho T+\alpha\beta\rho^{2}T+ roman_ℓ ( italic_M + 1 ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_α end_ARG start_ARG 2 italic_β end_ARG ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α italic_ρ italic_T + italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T
+4αβKM4T+4αβM2𝔼[t=0τ1εt,22]4𝛼𝛽𝐾superscript𝑀4𝑇4𝛼𝛽superscript𝑀2𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscriptnormsubscript𝜀𝑡22\displaystyle+4\alpha\beta KM^{4}T+4\alpha\beta M^{2}\mathbb{E}\left[\sum_{t=0% }^{\tau-1}\|\varepsilon_{t,2}\|^{2}\right]+ 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_T + 4 italic_α italic_β italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+4αβM2𝔼[t=0τ1εt,32]+4αβ𝔼[t=0τ1εt,2εt,3wt2]4𝛼𝛽superscript𝑀2𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscriptnormsubscript𝜀𝑡324𝛼𝛽𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscriptnormsuperscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡2\displaystyle+4\alpha\beta M^{2}\mathbb{E}\left[\sum_{t=0}^{\tau-1}\|% \varepsilon_{t,3}\|^{2}\right]+4\alpha\beta\mathbb{E}\left[\sum_{t=0}^{\tau-1}% \|\varepsilon_{t,2}^{\top}\varepsilon_{t,3}w_{t}\|^{2}\right]+ 4 italic_α italic_β italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 4 italic_α italic_β blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
α2𝔼[t=0τ1F(xt)wt2]+α𝔼[t=0τ1F(xt)w,εt,1wt]absent𝛼2𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝛼𝔼delimited-[]superscriptsubscript𝑡0𝜏1𝐹subscript𝑥𝑡𝑤subscript𝜀𝑡1subscript𝑤𝑡\displaystyle\leq-\frac{\alpha}{2}\mathbb{E}\left[\sum_{t=0}^{\tau-1}\|\nabla F% (x_{t})w_{t}\|^{2}\right]+\alpha\mathbb{E}\left[\sum_{t=0}^{\tau-1}\langle% \nabla F(x_{t})w,\varepsilon_{t,1}w_{t}\rangle\right]≤ - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_α blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ]
+α𝔼[t=0τ1wtw,εt,2F(xt)wt+F(xt)wtεt,3εt,2εt,3wt]𝛼𝔼delimited-[]superscriptsubscript𝑡0𝜏1subscript𝑤𝑡𝑤superscriptsubscript𝜀𝑡2top𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡topsubscript𝜀𝑡3superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡\displaystyle+\alpha\mathbb{E}\left[\sum_{t=0}^{\tau-1}\left\langle w_{t}-w,% \varepsilon_{t,2}^{\top}\nabla F(x_{t})w_{t}+\nabla F(x_{t})w_{t}^{\top}% \varepsilon_{t,3}-\varepsilon_{t,2}^{\top}\varepsilon_{t,3}w_{t}\right\rangle\right]+ italic_α blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ]
+(M+1)α2Tσ2+αβ+αρT+αβρ2T𝑀1superscript𝛼2𝑇superscript𝜎2𝛼𝛽𝛼𝜌𝑇𝛼𝛽superscript𝜌2𝑇\displaystyle+{\ell(M+1)}\alpha^{2}T\sigma^{2}+\frac{\alpha}{\beta}+\alpha\rho T% +\alpha\beta\rho^{2}T+ roman_ℓ ( italic_M + 1 ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG + italic_α italic_ρ italic_T + italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T
+4αβKM4T+4αβKM2Tσ24𝛼𝛽𝐾superscript𝑀4𝑇4𝛼𝛽𝐾superscript𝑀2𝑇superscript𝜎2\displaystyle+4\alpha\beta KM^{4}T+4\alpha\beta KM^{2}T\sigma^{2}+ 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_T + 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+4αβKM2Tσ2+4αβTKσ4,4𝛼𝛽𝐾superscript𝑀2𝑇superscript𝜎24𝛼𝛽𝑇𝐾superscript𝜎4\displaystyle+4\alpha\beta KM^{2}T\sigma^{2}+4\alpha\beta TK\sigma^{4},+ 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_β italic_T italic_K italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , (23)

where the last inequality is due to that τT𝜏𝑇\tau\leq Titalic_τ ≤ italic_T and for any i[K],j[3]formulae-sequence𝑖delimited-[]𝐾𝑗delimited-[]3i\in[K],j\in[3]italic_i ∈ [ italic_K ] , italic_j ∈ [ 3 ], 𝔼[t=0τ1εt,j,i2]𝔼[t=0T1εt,j,i2]TKσ2𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscriptnormsubscript𝜀𝑡𝑗𝑖2𝔼delimited-[]superscriptsubscript𝑡0𝑇1superscriptnormsubscript𝜀𝑡𝑗𝑖2𝑇𝐾superscript𝜎2\mathbb{E}\left[\sum_{t=0}^{\tau-1}\|\varepsilon_{t,j,i}\|^{2}\right]\leq% \mathbb{E}\left[\sum_{t=0}^{T-1}\|\varepsilon_{t,j,i}\|^{2}\right]\leq TK% \sigma^{2}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , italic_j , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , italic_j , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_T italic_K italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. By the optional stop** theorem, we have that

𝔼[t=0τF(xt)w,εt,1wt]=0,𝔼delimited-[]superscriptsubscript𝑡0𝜏𝐹subscript𝑥𝑡𝑤subscript𝜀𝑡1subscript𝑤𝑡0\displaystyle\mathbb{E}\left[\sum_{t=0}^{\tau}\langle\nabla F(x_{t})w,% \varepsilon_{t,1}w_{t}\rangle\right]=0,blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ] = 0 ,

which further implies that

𝔼[t=0τ1F(xt)w,εt,1wt]𝔼delimited-[]superscriptsubscript𝑡0𝜏1𝐹subscript𝑥𝑡𝑤subscript𝜀𝑡1subscript𝑤𝑡\displaystyle\mathbb{E}\left[\sum_{t=0}^{\tau-1}\langle\nabla F(x_{t})w,% \varepsilon_{t,1}w_{t}\rangle\right]blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ] =𝔼[F(xτ)w,ετ,1wτ]absent𝔼delimited-[]𝐹subscript𝑥𝜏𝑤subscript𝜀𝜏1subscript𝑤𝜏\displaystyle=-\mathbb{E}[\langle\nabla F(x_{\tau})w,\varepsilon_{\tau,1}w_{% \tau}\rangle]= - blackboard_E [ ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) italic_w , italic_ε start_POSTSUBSCRIPT italic_τ , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ⟩ ]
𝔼[Mετ,1wτ]M𝔼[ετ,1wτ2]absent𝔼delimited-[]𝑀normsubscript𝜀𝜏1subscript𝑤𝜏𝑀𝔼delimited-[]superscriptnormsubscript𝜀𝜏1subscript𝑤𝜏2\displaystyle\leq\mathbb{E}[M\|\varepsilon_{\tau,1}w_{\tau}\|]\leq M\sqrt{% \mathbb{E}[\|\varepsilon_{\tau,1}w_{\tau}\|^{2}]}≤ blackboard_E [ italic_M ∥ italic_ε start_POSTSUBSCRIPT italic_τ , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∥ ] ≤ italic_M square-root start_ARG blackboard_E [ ∥ italic_ε start_POSTSUBSCRIPT italic_τ , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG
M𝔼[t=0Tεt,1wt2]MσT+1absent𝑀𝔼delimited-[]superscriptsubscript𝑡0𝑇superscriptnormsubscript𝜀𝑡1subscript𝑤𝑡2𝑀𝜎𝑇1\displaystyle\leq M\sqrt{\mathbb{E}\left[\sum_{t=0}^{T}\|\varepsilon_{t,1}w_{t% }\|^{2}\right]}\leq M\sigma\sqrt{T+1}≤ italic_M square-root start_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG ≤ italic_M italic_σ square-root start_ARG italic_T + 1 end_ARG
2MσT.absent2𝑀𝜎𝑇\displaystyle\leq\sqrt{2}M\sigma\sqrt{T}.≤ square-root start_ARG 2 end_ARG italic_M italic_σ square-root start_ARG italic_T end_ARG . (24)

Similarly, we have 𝔼[t=0τ1F(xt)w,εt,2wt]2MσT𝔼delimited-[]superscriptsubscript𝑡0𝜏1𝐹subscript𝑥𝑡𝑤subscript𝜀𝑡2subscript𝑤𝑡2𝑀𝜎𝑇\mathbb{E}\left[\sum_{t=0}^{\tau-1}\langle\nabla F(x_{t})w,\varepsilon_{t,2}w_% {t}\rangle\right]\leq\sqrt{2}M\sigma\sqrt{T}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ] ≤ square-root start_ARG 2 end_ARG italic_M italic_σ square-root start_ARG italic_T end_ARG, 𝔼[t=0τ1F(xt)w,εt,3wt]2MσT𝔼delimited-[]superscriptsubscript𝑡0𝜏1𝐹subscript𝑥𝑡𝑤subscript𝜀𝑡3subscript𝑤𝑡2𝑀𝜎𝑇\mathbb{E}\left[\sum_{t=0}^{\tau-1}\langle\nabla F(x_{t})w,\varepsilon_{t,3}w_% {t}\rangle\right]\leq\sqrt{2}M\sigma\sqrt{T}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ] ≤ square-root start_ARG 2 end_ARG italic_M italic_σ square-root start_ARG italic_T end_ARG and 𝔼[t=0τ1F(xt)w,εt,2εt,3wt]2KMσ2T𝔼delimited-[]superscriptsubscript𝑡0𝜏1𝐹subscript𝑥𝑡𝑤superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡2𝐾𝑀superscript𝜎2𝑇\mathbb{E}\left[\sum_{t=0}^{\tau-1}\langle\nabla F(x_{t})w,\varepsilon_{t,2}^{% \top}\varepsilon_{t,3}w_{t}\rangle\right]\leq\sqrt{2K}M\sigma^{2}\sqrt{T}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ] ≤ square-root start_ARG 2 italic_K end_ARG italic_M italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_T end_ARG.

Based on (C.4) and (C.4), we have that

𝔼[F(xτ)w]Fw𝔼delimited-[]𝐹subscript𝑥𝜏𝑤superscript𝐹𝑤\displaystyle\mathbb{E}[F(x_{\tau})w]-F^{*}wblackboard_E [ italic_F ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) italic_w ] - italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_w F(x0)wFw𝔼[t=0τ1α2F(xt)wt2]+α2TM(3σ+σ2)absent𝐹subscript𝑥0𝑤superscript𝐹𝑤𝔼delimited-[]superscriptsubscript𝑡0𝜏1𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2𝛼2𝑇𝑀3𝜎superscript𝜎2\displaystyle\leq F(x_{0})w-F^{*}w-\mathbb{E}\left[\sum_{t=0}^{\tau-1}\frac{% \alpha}{2}\|\nabla F(x_{t})w_{t}\|^{2}\right]+\alpha\sqrt{2T}M(3\sigma+\sigma^% {2})≤ italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w - italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_w - blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_α square-root start_ARG 2 italic_T end_ARG italic_M ( 3 italic_σ + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
+(M+1)α2Tσ2+αβ+αρT+αβρ2T𝑀1superscript𝛼2𝑇superscript𝜎2𝛼𝛽𝛼𝜌𝑇𝛼𝛽superscript𝜌2𝑇\displaystyle+{\ell(M+1)}\alpha^{2}T\sigma^{2}+\frac{\alpha}{\beta}+\alpha\rho T% +\alpha\beta\rho^{2}T+ roman_ℓ ( italic_M + 1 ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG + italic_α italic_ρ italic_T + italic_α italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T
+4αβKM4T+4αβKM2Tσ24𝛼𝛽𝐾superscript𝑀4𝑇4𝛼𝛽𝐾superscript𝑀2𝑇superscript𝜎2\displaystyle+4\alpha\beta KM^{4}T+4\alpha\beta KM^{2}T\sigma^{2}+ 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_T + 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+4αβKM2Tσ2+4αβTKσ44𝛼𝛽𝐾superscript𝑀2𝑇superscript𝜎24𝛼𝛽𝑇𝐾superscript𝜎4\displaystyle+4\alpha\beta KM^{2}T\sigma^{2}+4\alpha\beta TK\sigma^{4}+ 4 italic_α italic_β italic_K italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_β italic_T italic_K italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
δF8𝔼[t=0τ1α2F(xt)wt2],absent𝛿𝐹8𝔼delimited-[]superscriptsubscript𝑡0𝜏1𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle\leq\frac{\delta F}{8}-\mathbb{E}\left[\sum_{t=0}^{\tau-1}\frac{% \alpha}{2}\|\nabla F(x_{t})w_{t}\|^{2}\right],≤ divide start_ARG italic_δ italic_F end_ARG start_ARG 8 end_ARG - blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (25)

which completes the proof. ∎

Appendix D Detailed Proofs for Iteration-wise CA Distance

We first provide some useful lemmas, which will be used in our main theorems.

Lemma 3 (Continuity of wt,ρsuperscriptsubscript𝑤𝑡𝜌w_{t,\rho}^{*}italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT).

Suppose Assumptions 1 and 2 are satisfied. If for any i[K],fi(xt)Mformulae-sequence𝑖delimited-[]𝐾normsubscript𝑓𝑖subscript𝑥𝑡𝑀i\in[K],\|\nabla f_{i}(x_{t})\|\leq Mitalic_i ∈ [ italic_K ] , ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M and xtxt+11(M+1)normsubscript𝑥𝑡subscript𝑥𝑡11𝑀1\|x_{t}-x_{t+1}\|\leq\frac{1}{\ell(M+1)}∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ ≤ divide start_ARG 1 end_ARG start_ARG roman_ℓ ( italic_M + 1 ) end_ARG, we have,

wρ(xt)wρ(xt+1)2ρ1KM(M+1)xtxt+1=Lwxtxt+1.normsuperscriptsubscript𝑤𝜌subscript𝑥𝑡superscriptsubscript𝑤𝜌subscript𝑥𝑡12superscript𝜌1𝐾𝑀𝑀1normsubscript𝑥𝑡subscript𝑥𝑡1subscript𝐿𝑤normsubscript𝑥𝑡subscript𝑥𝑡1\displaystyle\|w_{\rho}^{*}(x_{t})-w_{\rho}^{*}(x_{t+1})\|\leq 2\rho^{-1}KM% \ell(M+1)\|x_{t}-x_{t+1}\|=L_{w}\|x_{t}-x_{t+1}\|.∥ italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ ≤ 2 italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_K italic_M roman_ℓ ( italic_M + 1 ) ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ = italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ .
Proof.

We first define that wQ,ρ(xt)𝒲subscript𝑤𝑄𝜌subscript𝑥𝑡𝒲w_{Q,\rho}(x_{t})\in\mathcal{W}italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_W is the Q𝑄Qitalic_Q-th iterate of a function J(w)=12F(xt)w2+ρ2w2𝐽𝑤12superscriptnorm𝐹subscript𝑥𝑡𝑤2𝜌2superscriptnorm𝑤2J(w)=\frac{1}{2}\|\nabla F(x_{t})w\|^{2}+\frac{\rho}{2}\|w\|^{2}italic_J ( italic_w ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT using projected gradient descent (PGD) with a constant step size β𝛽\betaitalic_β. The update rule is wQ+1,ρ(xt)=Π𝒲(((1βρ)IβF(xt)F(xt))wQ,ρ(xt))subscript𝑤𝑄1𝜌subscript𝑥𝑡subscriptΠ𝒲1𝛽𝜌𝐼𝛽𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑄𝜌subscript𝑥𝑡w_{Q+1,\rho}(x_{t})=\Pi_{\mathcal{W}}\Big{(}\big{(}(1-\beta\rho)I-\beta\nabla F% (x_{t})^{\top}\nabla F(x_{t})\big{)}w_{Q,\rho}(x_{t})\Big{)}italic_w start_POSTSUBSCRIPT italic_Q + 1 , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_Π start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ( ( ( 1 - italic_β italic_ρ ) italic_I - italic_β ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ). By the non-expansiveness of projection, we have

wQ+1,ρ\displaystyle\|w_{Q+1,\rho}∥ italic_w start_POSTSUBSCRIPT italic_Q + 1 , italic_ρ end_POSTSUBSCRIPT (xt)wQ+1,ρ(xt+1)\displaystyle(x_{t})-w_{Q+1,\rho}(x_{t+1})\|( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_Q + 1 , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥
\displaystyle\leq ((1βρ)IβF(xt)F(xt))wQ,ρ(xt)\displaystyle\|\big{(}(1-\beta\rho)I-\beta\nabla F(x_{t})^{\top}\nabla F(x_{t}% )\big{)}w_{Q,\rho}(x_{t})∥ ( ( 1 - italic_β italic_ρ ) italic_I - italic_β ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
((1βρ)IβF(xt+1)F(xt+1))wQ,ρ(xt+1)\displaystyle-\big{(}(1-\beta\rho)I-\beta\nabla F^{\top}(x_{t+1})\nabla F(x_{t% +1})\big{)}w_{Q,\rho}(x_{t+1})\|- ( ( 1 - italic_β italic_ρ ) italic_I - italic_β ∇ italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥
\displaystyle\leq (1βρ)IβF(xt)F(xt)wQ,ρ(xt)wQ,ρ(xt+1)norm1𝛽𝜌𝐼𝛽𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡normsubscript𝑤𝑄𝜌subscript𝑥𝑡subscript𝑤𝑄𝜌subscript𝑥𝑡1\displaystyle\|(1-\beta\rho)I-\beta\nabla F(x_{t})^{\top}\nabla F(x_{t})\|\|w_% {Q,\rho}(x_{t})-w_{Q,\rho}(x_{t+1})\|∥ ( 1 - italic_β italic_ρ ) italic_I - italic_β ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ∥ italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥
+β(F(xt)F(xt)F(xt+1)F(xt+1))wQ,ρ(xt+1)𝛽norm𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡superscript𝐹topsubscript𝑥𝑡1𝐹subscript𝑥𝑡1subscript𝑤𝑄𝜌subscript𝑥𝑡1\displaystyle+\beta\|\big{(}\nabla F(x_{t})^{\top}\nabla F(x_{t})-\nabla F^{% \top}(x_{t+1})\nabla F(x_{t+1})\big{)}w_{Q,\rho}(x_{t+1})\|+ italic_β ∥ ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥
\displaystyle\leq (1βρ)wQ,ρ(xt)wQ,ρ(xt+1)1𝛽𝜌normsubscript𝑤𝑄𝜌subscript𝑥𝑡subscript𝑤𝑄𝜌subscript𝑥𝑡1\displaystyle(1-\beta\rho)\|w_{Q,\rho}(x_{t})-w_{Q,\rho}(x_{t+1})\|( 1 - italic_β italic_ρ ) ∥ italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥
+β(F(xt)F(xt)F(xt+1)F(xt+1))wQ,ρ(xt+1).𝛽norm𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡superscript𝐹topsubscript𝑥𝑡1𝐹subscript𝑥𝑡1subscript𝑤𝑄𝜌subscript𝑥𝑡1\displaystyle+\beta\|\big{(}\nabla F(x_{t})^{\top}\nabla F(x_{t})-\nabla F^{% \top}(x_{t+1})\nabla F(x_{t+1})\big{)}w_{Q,\rho}(x_{t+1})\|.+ italic_β ∥ ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ .

Since we set w0,ρ(xt)=w0,ρ(xt+1)subscript𝑤0𝜌subscript𝑥𝑡subscript𝑤0𝜌subscript𝑥𝑡1w_{0,\rho}(x_{t})=w_{0,\rho}(x_{t+1})italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) and wQ,ρ(xt+1)1normsubscript𝑤𝑄𝜌subscript𝑥𝑡11\|w_{Q,\rho}(x_{t+1})\|\leq 1∥ italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ ≤ 1, telesco** the above inequality from t=0,1,,T1𝑡01𝑇1t=0,1,...,T-1italic_t = 0 , 1 , … , italic_T - 1 gives,

wQ,ρ\displaystyle\|w_{Q,\rho}∥ italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT (xt)wQ,ρ(xt+1)ρ1F(xt)F(xt)F(xt+1)F(xt+1).\displaystyle(x_{t})-w_{Q,\rho}(x_{t+1})\|\leq\rho^{-1}\|\nabla F(x_{t})^{\top% }\nabla F(x_{t})-\nabla F^{\top}(x_{t+1})\nabla F(x_{t+1})\|.( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ ≤ italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ . (26)

Then according to the Cauchy-Schwartz inequality, it follows that

wρ(xt)wρ(xt+1)normsuperscriptsubscript𝑤𝜌subscript𝑥𝑡superscriptsubscript𝑤𝜌subscript𝑥𝑡1absent\displaystyle\|w_{\rho}^{*}(x_{t})-w_{\rho}^{*}(x_{t+1})\|\leq∥ italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ ≤ limQ(wρ(xt)wQ,ρ(xt)+wρ(xt+1)wQ,ρ(xt+1)\displaystyle\lim_{Q\rightarrow\infty}\big{(}\|w_{\rho}^{*}(x_{t})-w_{Q,\rho}(% x_{t})\|+\|w_{\rho}^{*}(x_{t+1})-w_{Q,\rho}(x_{t+1})\|roman_lim start_POSTSUBSCRIPT italic_Q → ∞ end_POSTSUBSCRIPT ( ∥ italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + ∥ italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥
+wQ,ρ(xt)wQ,ρ(xt+1))\displaystyle+\|w_{Q,\rho}(x_{t})-w_{Q,\rho}(x_{t+1})\|\big{)}+ ∥ italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ )
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG limQ(wρ(xt)wQ,ρ(xt)+wρ(xt+1)wQ,ρ(xt+1))subscript𝑄normsuperscriptsubscript𝑤𝜌subscript𝑥𝑡subscript𝑤𝑄𝜌subscript𝑥𝑡normsuperscriptsubscript𝑤𝜌subscript𝑥𝑡1subscript𝑤𝑄𝜌subscript𝑥𝑡1\displaystyle\lim_{Q\rightarrow\infty}\big{(}\|w_{\rho}^{*}(x_{t})-w_{Q,\rho}(% x_{t})\|+\|w_{\rho}^{*}(x_{t+1})-w_{Q,\rho}(x_{t+1})\|\big{)}roman_lim start_POSTSUBSCRIPT italic_Q → ∞ end_POSTSUBSCRIPT ( ∥ italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + ∥ italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ )
+ρ1F(xt)F(xt)F(xt+1)F(xt+1)superscript𝜌1norm𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡superscript𝐹topsubscript𝑥𝑡1𝐹subscript𝑥𝑡1\displaystyle+\rho^{-1}\|\nabla F(x_{t})^{\top}\nabla F(x_{t})-\nabla F^{\top}% (x_{t+1})\nabla F(x_{t+1})\|+ italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG limQ24ρβQ+ρ1F(xt)F(xt)F(xt+1)F(xt+1),subscript𝑄24𝜌𝛽𝑄superscript𝜌1norm𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡superscript𝐹topsubscript𝑥𝑡1𝐹subscript𝑥𝑡1\displaystyle\lim_{Q\rightarrow\infty}2\sqrt{\frac{4}{\rho\beta Q}}+\rho^{-1}% \|\nabla F(x_{t})^{\top}\nabla F(x_{t})-\nabla F^{\top}(x_{t+1})\nabla F(x_{t+% 1})\|,roman_lim start_POSTSUBSCRIPT italic_Q → ∞ end_POSTSUBSCRIPT 2 square-root start_ARG divide start_ARG 4 end_ARG start_ARG italic_ρ italic_β italic_Q end_ARG end_ARG + italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ , (27)

where (i)𝑖(i)( italic_i ) follows from eq. 26 and (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from the convergence of PGD (Theorem 1.1, [4]) on ρ𝜌\rhoitalic_ρ-strongly convex objectives that

wρ(xt)wQ,ρ(xt)22ρ(J(wρ(xt))J(wQ,ρ(xt)))2ρw0,ρ(xt)wρ(xt)22βQ4ρβQ.superscriptnormsuperscriptsubscript𝑤𝜌subscript𝑥𝑡subscript𝑤𝑄𝜌subscript𝑥𝑡22𝜌𝐽superscriptsubscript𝑤𝜌subscript𝑥𝑡𝐽subscript𝑤𝑄𝜌subscript𝑥𝑡2𝜌superscriptnormsubscript𝑤0𝜌subscript𝑥𝑡superscriptsubscript𝑤𝜌subscript𝑥𝑡22𝛽𝑄4𝜌𝛽𝑄\displaystyle\|w_{\rho}^{*}(x_{t})-w_{Q,\rho}(x_{t})\|^{2}\leq\frac{2}{\rho}% \big{(}J(w_{\rho}^{*}(x_{t}))-J(w_{Q,\rho}(x_{t}))\big{)}\leq\frac{2}{\rho}% \frac{\|w_{0,\rho}(x_{t})-w_{\rho}^{*}(x_{t})\|^{2}}{2\beta Q}\leq\frac{4}{% \rho\beta Q}.∥ italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 2 end_ARG start_ARG italic_ρ end_ARG ( italic_J ( italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_J ( italic_w start_POSTSUBSCRIPT italic_Q , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ≤ divide start_ARG 2 end_ARG start_ARG italic_ρ end_ARG divide start_ARG ∥ italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_β italic_Q end_ARG ≤ divide start_ARG 4 end_ARG start_ARG italic_ρ italic_β italic_Q end_ARG .

Then lemma 3 can be bounded by

wρ(xt)wρ(xt+1)normsuperscriptsubscript𝑤𝜌subscript𝑥𝑡superscriptsubscript𝑤𝜌subscript𝑥𝑡1absent\displaystyle\|w_{\rho}^{*}(x_{t})-w_{\rho}^{*}(x_{t+1})\|\leq∥ italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ ≤ ρ1F(xt)F(xt)F(xt+1)F(xt+1)superscript𝜌1norm𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡superscript𝐹topsubscript𝑥𝑡1𝐹subscript𝑥𝑡1\displaystyle\rho^{-1}\|\nabla F(x_{t})^{\top}\nabla F(x_{t})-\nabla F^{\top}(% x_{t+1})\nabla F(x_{t+1})\|italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥
\displaystyle\leq ρ1F(xt)+F(xt+1)F(xt)F(xt+1)superscript𝜌1norm𝐹subscript𝑥𝑡𝐹subscript𝑥𝑡1norm𝐹subscript𝑥𝑡𝐹subscript𝑥𝑡1\displaystyle\rho^{-1}\|\nabla F(x_{t})+\nabla F(x_{t+1})\|\|\nabla F(x_{t})-% \nabla F(x_{t+1})\|italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥ ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∥
\displaystyle\leq 2ρ1KM(M+1)xtxt+1,2superscript𝜌1𝐾𝑀𝑀1normsubscript𝑥𝑡subscript𝑥𝑡1\displaystyle 2\rho^{-1}KM\ell(M+1)\|x_{t}-x_{t+1}\|,2 italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_K italic_M roman_ℓ ( italic_M + 1 ) ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ ,

where the last inequality follows from fi(xt)Mnormsubscript𝑓𝑖subscript𝑥𝑡𝑀\|\nabla f_{i}(x_{t})\|\leq M∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M and fi(x)subscript𝑓𝑖𝑥f_{i}(x)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) is (1(fi(x)+1),(fi(x)+1))1normsubscript𝑓𝑖𝑥1normsubscript𝑓𝑖𝑥1\Big{(}\frac{1}{\ell(\|\nabla f_{i}(x)\|+1)},\ell(\|\nabla f_{i}(x)\|+1)\Big{)}( divide start_ARG 1 end_ARG start_ARG roman_ℓ ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∥ + 1 ) end_ARG , roman_ℓ ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∥ + 1 ) )-smooth by setting a=1𝑎1a=1italic_a = 1. The proof is complete. ∎

Lemma 4.

Given wt=argminw𝒲12F(xt)w2superscriptsubscript𝑤𝑡subscript𝑤𝒲12superscriptnorm𝐹subscript𝑥𝑡𝑤2w_{t}^{*}=\arg\min_{w\in\mathcal{W}}\frac{1}{2}\|\nabla F(x_{t})w\|^{2}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and wt,ρ=argminw𝒲12F(xt)w2+ρ2w2superscriptsubscript𝑤𝑡𝜌subscript𝑤𝒲12superscriptnorm𝐹subscript𝑥𝑡𝑤2𝜌2superscriptnorm𝑤2w_{t,\rho}^{*}=\arg\min_{w\in\mathcal{W}}\frac{1}{2}\|\nabla F(x_{t})w\|^{2}+% \frac{\rho}{2}\|w\|^{2}italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have

F(xt)wtF(xt)wt,ρρ.norm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌𝜌\displaystyle\|\nabla F(x_{t})w_{t}^{*}-\nabla F(x_{t})w_{t,\rho}^{*}\|\leq% \sqrt{\rho}.∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ≤ square-root start_ARG italic_ρ end_ARG .
Proof.

Recall that wt,ρ=argminw𝒲12F(xt)w2+ρ2w2superscriptsubscript𝑤𝑡𝜌subscript𝑤𝒲12superscriptnorm𝐹subscript𝑥𝑡𝑤2𝜌2superscriptnorm𝑤2w_{t,\rho}^{*}=\arg\min_{w\in\mathcal{W}}\frac{1}{2}\|\nabla F(x_{t})w\|^{2}+% \frac{\rho}{2}\|w\|^{2}italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then we have

12F(xt)wt2+ρ2wt212F(xt)wt,ρ2ρ2wt,ρ20.12superscriptnorm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡2𝜌2superscriptnormsuperscriptsubscript𝑤𝑡212superscriptnorm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌2𝜌2superscriptnormsuperscriptsubscript𝑤𝑡𝜌20\displaystyle\frac{1}{2}\|\nabla F(x_{t})w_{t}^{*}\|^{2}+\frac{\rho}{2}\|w_{t}% ^{*}\|^{2}-\frac{1}{2}\|\nabla F(x_{t})w_{t,\rho}^{*}\|^{2}-\frac{\rho}{2}\|w_% {t,\rho}^{*}\|^{2}\geq 0.divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0 .

By rearranging the above inequality, we have

F(xt)wt,ρ2F(xt)wt2ρ(wt,ρ2wt2)ρ.superscriptnorm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌2superscriptnorm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡2𝜌superscriptnormsuperscriptsubscript𝑤𝑡𝜌2superscriptnormsuperscriptsubscript𝑤𝑡2𝜌\displaystyle\|\nabla F(x_{t})w_{t,\rho}^{*}\|^{2}-\|\nabla F(x_{t})w_{t}^{*}% \|^{2}\leq\rho(\|w_{t,\rho}^{*}\|^{2}-\|w_{t}^{*}\|^{2})\leq\rho.∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ρ ( ∥ italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ italic_ρ .

Then recall that wt=argminw𝒲12F(xt)w2superscriptsubscript𝑤𝑡subscript𝑤𝒲12superscriptnorm𝐹subscript𝑥𝑡𝑤2w_{t}^{*}=\arg\min_{w\in\mathcal{W}}\frac{1}{2}\|\nabla F(x_{t})w\|^{2}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have

F(xt)wtF(xt)wt,ρ2=superscriptnorm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌2absent\displaystyle\|\nabla F(x_{t})w_{t}^{*}-\nabla F(x_{t})w_{t,\rho}^{*}\|^{2}=∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = F(xt)wt2+F(xt)wt,ρ22F(xt)wt,F(xt)wt,ρsuperscriptnorm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡2superscriptnorm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌22𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌\displaystyle\|\nabla F(x_{t})w_{t}^{*}\|^{2}+\|\nabla F(x_{t})w_{t,\rho}^{*}% \|^{2}-2\langle\nabla F(x_{t})w_{t}^{*},\nabla F(x_{t})w_{t,\rho}^{*}\rangle∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩
\displaystyle\leq F(xt)wt,ρ2F(xt)wt2superscriptnorm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌2superscriptnorm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡2\displaystyle\|\nabla F(x_{t})w_{t,\rho}^{*}\|^{2}-\|\nabla F(x_{t})w_{t}^{*}% \|^{2}∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq ρ,𝜌\displaystyle\rho,italic_ρ ,

where the first inequlity follows from the optimality that

2wt,ρ,F(xt)F(xt)wt2Fwt2.2superscriptsubscript𝑤𝑡𝜌𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡2superscriptnorm𝐹superscriptsubscript𝑤𝑡2-2\langle w_{t,\rho}^{*},\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}^{*}\rangle% \leq-2\|\nabla Fw_{t}^{*}\|^{2}.- 2 ⟨ italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ ≤ - 2 ∥ ∇ italic_F italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The proof is complete. ∎

Lemma 5.

Suppose Assumptions 1 and 2 are satisfied. If for any i[K],fi(xt)Mformulae-sequence𝑖delimited-[]𝐾normsubscript𝑓𝑖subscript𝑥𝑡𝑀i\in[K],\|\nabla f_{i}(x_{t})\|\leq Mitalic_i ∈ [ italic_K ] , ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M and xtxt+11(M+1)normsubscript𝑥𝑡subscript𝑥𝑡11𝑀1\|x_{t}-x_{t+1}\|\leq\frac{1}{\ell(M+1)}∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ ≤ divide start_ARG 1 end_ARG start_ARG roman_ℓ ( italic_M + 1 ) end_ARG, we have

R(xt)α2K(M+1)M22norm𝑅subscript𝑥𝑡superscript𝛼2𝐾𝑀1superscript𝑀22\displaystyle\|R(x_{t})\|\leq\frac{\alpha^{2}K\ell(M+1)M^{2}}{2}∥ italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG
Proof.

According to the Talyor Theorem, we have the following result for any objective function fi(xt),i[K]subscript𝑓𝑖subscript𝑥𝑡𝑖delimited-[]𝐾f_{i}(x_{t}),i\in[K]italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_i ∈ [ italic_K ].

fi(xt+1)=fi(xt)+fi(xt)(xt+1xt)+Ri(xt),subscript𝑓𝑖subscript𝑥𝑡1subscript𝑓𝑖subscript𝑥𝑡superscriptsubscript𝑓𝑖topsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡subscript𝑅𝑖subscript𝑥𝑡\displaystyle f_{i}(x_{t+1})=f_{i}(x_{t})+\nabla f_{i}^{\top}(x_{t})(x_{t+1}-x% _{t})+R_{i}(x_{t}),italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where Ri(xt)subscript𝑅𝑖subscript𝑥𝑡R_{i}(x_{t})italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the remainder term. Then according to the descent lemma of each objective function fi(x)subscript𝑓𝑖𝑥f_{i}(x)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), we have

fi(xt+1)subscript𝑓𝑖subscript𝑥𝑡1absent\displaystyle f_{i}(x_{t+1})\leqitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ fi(xt)+fi(xt)(xt+1xt)+α2(fi(xt)+1)2F(xt)wt2subscript𝑓𝑖subscript𝑥𝑡superscriptsubscript𝑓𝑖topsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡superscript𝛼2normsubscript𝑓𝑖subscript𝑥𝑡12superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle f_{i}(x_{t})+\nabla f_{i}^{\top}(x_{t})(x_{t+1}-x_{t})+\alpha^{2% }\frac{\ell(\|\nabla f_{i}(x_{t})\|+1)}{2}\|\nabla F(x_{t})w_{t}\|^{2}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_ℓ ( ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + 1 ) end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq fi(xt)+fi(xt)(xt+1xt)+α2(M+1)2F(xt)wt2.subscript𝑓𝑖subscript𝑥𝑡superscriptsubscript𝑓𝑖topsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡superscript𝛼2𝑀12superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle f_{i}(x_{t})+\nabla f_{i}^{\top}(x_{t})(x_{t+1}-x_{t})+\alpha^{2% }\frac{\ell(M+1)}{2}\|\nabla F(x_{t})w_{t}\|^{2}.italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Then we can obtain

Ri(xt)α2(M+1)2F(xt)wt2.subscript𝑅𝑖subscript𝑥𝑡superscript𝛼2𝑀12superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle R_{i}(x_{t})\leq\alpha^{2}\frac{\ell(M+1)}{2}\|\nabla F(x_{t})w_% {t}\|^{2}.italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Thus, according to the Cauchy-Schwartz inequality, we have

R(xt)α2(M+1)2F(xt)wt2.norm𝑅subscript𝑥𝑡superscript𝛼2𝑀12superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle\|R(x_{t})\|\leq\alpha^{2}\frac{\ell(M+1)}{2}\|\nabla F(x_{t})w_{% t}\|^{2}.∥ italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The proof is complete. ∎

D.1 Proof of Theorem 3

Theorem 9.

Suppose Assumptions 1 and 2 are satisfied. We choose β1M2,N𝒪(ϵ2),βmin(ϵ2ρC12,ϵ2),αmin(c1β,1(M+1),βρϵ2LwM,ρϵ22LwMC1),Tmax(10Δαϵ2,10ϵ2β)Θ(ϵ11)formulae-sequencesuperscript𝛽1superscript𝑀2formulae-sequencesimilar-to𝑁𝒪superscriptitalic-ϵ2formulae-sequence𝛽superscriptitalic-ϵ2𝜌superscriptsubscript𝐶12superscriptitalic-ϵ2formulae-sequence𝛼subscript𝑐1𝛽1𝑀1𝛽𝜌italic-ϵ2subscript𝐿𝑤𝑀𝜌superscriptitalic-ϵ22subscript𝐿𝑤𝑀subscript𝐶1𝑇10Δ𝛼superscriptitalic-ϵ210superscriptitalic-ϵ2𝛽similar-toΘsuperscriptitalic-ϵ11\beta^{\prime}\leq\frac{1}{M^{2}},N\sim\mathcal{O}(\epsilon^{-2}),\beta\leq% \min\left(\frac{\epsilon^{2}\rho}{C_{1}^{2}},\epsilon^{2}\right),\alpha\leq% \min\left(c_{1}\beta,\frac{1}{\ell(M+1)},\frac{\beta\rho\epsilon}{2L_{w}\sqrt{% M}},\frac{\rho\epsilon^{2}}{2L_{w}MC_{1}}\right),T\geq\max\left(\frac{10\Delta% }{\alpha\epsilon^{2}},\frac{10}{\epsilon^{2}\beta}\right)\sim\Theta(\epsilon^{% -11})italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_N ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) , italic_β ≤ roman_min ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ end_ARG start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_α ≤ roman_min ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_β , divide start_ARG 1 end_ARG start_ARG roman_ℓ ( italic_M + 1 ) end_ARG , divide start_ARG italic_β italic_ρ italic_ϵ end_ARG start_ARG 2 italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT square-root start_ARG italic_M end_ARG end_ARG , divide start_ARG italic_ρ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) , italic_T ≥ roman_max ( divide start_ARG 10 roman_Δ end_ARG start_ARG italic_α italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 10 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β end_ARG ) ∼ roman_Θ ( italic_ϵ start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT ), and ρmin(ϵ220,c22Tα)𝒪(ϵ2)𝜌superscriptitalic-ϵ220subscript𝑐22𝑇𝛼similar-to𝒪superscriptitalic-ϵ2\rho\leq\min\left(\frac{\epsilon^{2}}{20},\frac{c_{2}}{2T\alpha}\right)\sim% \mathcal{O}(\epsilon^{2})italic_ρ ≤ roman_min ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 20 end_ARG , divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_T italic_α end_ARG ) ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The CA distance in every iteration takes the order of 𝒪(ϵ)𝒪italic-ϵ\mathcal{O}(\epsilon)caligraphic_O ( italic_ϵ ).

Proof.

Since our parameters satisfy all requirements in Theorem 1, we have that fi(xt)Mnormsubscript𝑓𝑖subscript𝑥𝑡𝑀\|\nabla f_{i}(x_{t})\|\leq M∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M. According to the definition of CA distance, we have

F\displaystyle\|\nabla F∥ ∇ italic_F (xt)wtF(xt)wt\displaystyle(x_{t})w_{t}-\nabla F(x_{t})w_{t}^{*}\|( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
=\displaystyle== F(xt)wtF(xt)wt,ρ+F(xt)wt,ρF(xt)wtnorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡\displaystyle\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t,\rho}^{*}+\nabla F(x_{% t})w_{t,\rho}^{*}-\nabla F(x_{t})w_{t}^{*}\|∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG F(xt)wtF(xt)wt,ρ+F(xt)wt,ρF(xt)wtnorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌norm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡\displaystyle\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t,\rho}^{*}\|+\|\nabla F% (x_{t})w_{t,\rho}^{*}-\nabla F(x_{t})w_{t}^{*}\|∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ + ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG KMwtwt,ρ+ρ,𝐾𝑀normsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌𝜌\displaystyle\sqrt{K}M\|w_{t}-w_{t,\rho}^{*}\|+\sqrt{\rho},square-root start_ARG italic_K end_ARG italic_M ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ + square-root start_ARG italic_ρ end_ARG , (28)

where (i)𝑖(i)( italic_i ) follows from Cauchy-Schwartz inequality and (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from fi(xt)Mnormsubscript𝑓𝑖subscript𝑥𝑡𝑀\|\nabla f_{i}(x_{t})\|\leq M∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M for any i𝑖iitalic_i and Lemma 4. Then for the first term in the above inequality on the right-hand side (RHS), we have

wt+1wt+1,ρ2=wt+1wt,ρ2+wt+1,ρwt,ρ22wt+1wt,ρ,wt+1,ρwt,ρ.superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌2superscriptnormsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌22subscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌superscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌\displaystyle\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}=\|w_{t+1}-w_{t,\rho}^{*}\|^{2}+% \|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|^{2}-2\langle w_{t+1}-w_{t,\rho}^{*},w_{t+1% ,\rho}^{*}-w_{t,\rho}^{*}\rangle.∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ . (29)

For the first term on the RHS in the above inequality, we have

wt+1wt,ρ2(i)superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌2𝑖\displaystyle\|w_{t+1}-w_{t,\rho}^{*}\|^{2}\overset{(i)}{\leq}∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG wtβ[F(xt)F(xt)wt+ρwt]wt,ρ2superscriptnormsubscript𝑤𝑡𝛽delimited-[]𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2\displaystyle\|w_{t}-\beta[\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}+\rho w_{% t}]-w_{t,\rho}^{*}\|^{2}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β [ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== wtwt,ρ22βF(xt)F(xt)wt+ρwt,wtwt,ρsuperscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌22𝛽𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡subscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌\displaystyle\|w_{t}-w_{t,\rho}^{*}\|^{2}-2\beta\langle\nabla F(x_{t})^{\top}% \nabla F(x_{t})w_{t}+\rho w_{t},w_{t}-w_{t,\rho}^{*}\rangle∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_β ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩
+β2F(xt)F(xt)wt+ρwt2superscript𝛽2superscriptnorm𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡2\displaystyle+\beta^{2}\|\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}+\rho w_{t}% \|^{2}+ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG (12βρ)wtwt,ρ2+β2(ρ+KM2)2,12𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2superscript𝛽2superscript𝜌𝐾superscript𝑀22\displaystyle(1-2\beta\rho)\|w_{t}-w_{t,\rho}^{*}\|^{2}+\beta^{2}(\rho+\sqrt{K% }M^{2})^{2},( 1 - 2 italic_β italic_ρ ) ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ + square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (30)

where (i)𝑖(i)( italic_i ) follows from the non-expansiveness of projection and (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from properties of strong convexity and Cauchy-Schwartz inequality. Then for the second term on the RHS in eq. 29, we have

wt+1,ρwt,ρ2Lw2xtxt+12=Lw2α2F(xt)wt2α2Lw2M2.superscriptnormsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌2superscriptsubscript𝐿𝑤2superscriptnormsubscript𝑥𝑡subscript𝑥𝑡12superscriptsubscript𝐿𝑤2superscript𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀2\displaystyle\|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|^{2}\leq L_{w}^{2}\|x_{t}-x_{t% +1}\|^{2}=L_{w}^{2}\alpha^{2}\|\nabla F(x_{t})w_{t}\|^{2}\leq\alpha^{2}L_{w}^{% 2}M^{2}.∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (31)

Then for the last term on the RHS in eq. 29, we have

22\displaystyle-2- 2 wt+1wt,ρ,wt+1,ρwt,ρsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌superscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌\displaystyle\langle w_{t+1}-w_{t,\rho}^{*},w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\rangle⟨ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩
\displaystyle\leq 2wt+1wt,ρwt+1,ρwt,ρ2normsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌normsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌\displaystyle 2\|w_{t+1}-w_{t,\rho}^{*}\|\|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|2 ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
\displaystyle\leq 2(wt+1wt+wtwt,ρ)wt+1,ρwt,ρ2normsubscript𝑤𝑡1subscript𝑤𝑡normsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌normsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌\displaystyle 2(\|w_{t+1}-w_{t}\|+\|w_{t}-w_{t,\rho}^{*}\|)\|w_{t+1,\rho}^{*}-% w_{t,\rho}^{*}\|2 ( ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ) ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG 2αβLwF(xt)F(xt)wt+ρwtF(xt)wt+βρwtwt,ρ2+4βρwt+1,ρwt,ρ22𝛼𝛽subscript𝐿𝑤norm𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡norm𝐹subscript𝑥𝑡subscript𝑤𝑡𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌24𝛽𝜌superscriptnormsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌2\displaystyle 2\alpha\beta L_{w}\|\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}+% \rho w_{t}\|\|\nabla F(x_{t})w_{t}\|+\beta\rho\|w_{t}-w_{t,\rho}^{*}\|^{2}+% \frac{4}{\beta\rho}\|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|^{2}2 italic_α italic_β italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + italic_β italic_ρ ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 4 end_ARG start_ARG italic_β italic_ρ end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 2αβLwM(KM2+ρ)+βρwtwt,ρ2+4α2Lw2M2βρ,2𝛼𝛽subscript𝐿𝑤𝑀𝐾superscript𝑀2𝜌𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌24superscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀2𝛽𝜌\displaystyle 2\alpha\beta L_{w}M(\sqrt{K}M^{2}+\rho)+\beta\rho\|w_{t}-w_{t,% \rho}^{*}\|^{2}+\frac{4\alpha^{2}L_{w}^{2}M^{2}}{\beta\rho},2 italic_α italic_β italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M ( square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ ) + italic_β italic_ρ ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β italic_ρ end_ARG , (32)

where (i)𝑖(i)( italic_i ) follows from the update rule in Algorithm 1, Lemma 3, and Young’s inequality. Then substituting section D.1, eq. 31 and section D.1 into eq. 29, we have

wt+1wt+1,ρ2superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2absent\displaystyle\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}\leq∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ (1βρ)wtwt,ρ2+β2(ρ+KM2)2+α2Lw2M21𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2superscript𝛽2superscript𝜌𝐾superscript𝑀22superscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀2\displaystyle(1-\beta\rho)\|w_{t}-w_{t,\rho}^{*}\|^{2}+\beta^{2}(\rho+\sqrt{K}% M^{2})^{2}+\alpha^{2}L_{w}^{2}M^{2}( 1 - italic_β italic_ρ ) ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ + square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2αβLwM(KM2+ρ)+4α2Lw2M2βρ2𝛼𝛽subscript𝐿𝑤𝑀𝐾superscript𝑀2𝜌4superscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀2𝛽𝜌\displaystyle+2\alpha\beta L_{w}M(\sqrt{K}M^{2}+\rho)+\frac{4\alpha^{2}L_{w}^{% 2}M^{2}}{\beta\rho}+ 2 italic_α italic_β italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M ( square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ ) + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β italic_ρ end_ARG
\displaystyle\leq (1βρ)wtwt,ρ2+β2C12+α2Lw2M2+2αβLwMC1+4α2Lw2M2βρ,1𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2superscript𝛽2superscriptsubscript𝐶12superscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀22𝛼𝛽subscript𝐿𝑤𝑀subscript𝐶14superscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀2𝛽𝜌\displaystyle(1-\beta\rho)\|w_{t}-w_{t,\rho}^{*}\|^{2}+\beta^{2}C_{1}^{2}+% \alpha^{2}L_{w}^{2}M^{2}+2\alpha\beta L_{w}MC_{1}+\frac{4\alpha^{2}L_{w}^{2}M^% {2}}{\beta\rho},( 1 - italic_β italic_ρ ) ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α italic_β italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β italic_ρ end_ARG ,

where the last inequality follows from Lemma 5 and C1=KM2+ρsubscript𝐶1𝐾superscript𝑀2𝜌C_{1}=\sqrt{K}M^{2}+\rhoitalic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ. Then we do telesco** over t=0,1,,T1𝑡01𝑇1t=0,1,...,T-1italic_t = 0 , 1 , … , italic_T - 1

wTwT,ρ2superscriptnormsubscript𝑤𝑇superscriptsubscript𝑤𝑇𝜌2absent\displaystyle\|w_{T}-w_{T,\rho}^{*}\|^{2}\leq∥ italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_T , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ (1βρ)Tw0w0,ρ2+βρC12+α2βρLw2M2+2αLwMρC1+4α2Lw2M2β2ρ2.superscript1𝛽𝜌𝑇superscriptnormsubscript𝑤0superscriptsubscript𝑤0𝜌2𝛽𝜌superscriptsubscript𝐶12superscript𝛼2𝛽𝜌superscriptsubscript𝐿𝑤2superscript𝑀22𝛼subscript𝐿𝑤𝑀𝜌subscript𝐶14superscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀2superscript𝛽2superscript𝜌2\displaystyle(1-\beta\rho)^{T}\|w_{0}-w_{0,\rho}^{*}\|^{2}+\frac{\beta}{\rho}C% _{1}^{2}+\frac{\alpha^{2}}{\beta\rho}L_{w}^{2}M^{2}+\frac{2\alpha L_{w}M}{\rho% }C_{1}+\frac{4\alpha^{2}L_{w}^{2}M^{2}}{\beta^{2}\rho^{2}}.( 1 - italic_β italic_ρ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_β end_ARG start_ARG italic_ρ end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β italic_ρ end_ARG italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 italic_α italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M end_ARG start_ARG italic_ρ end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

Then recalling that Lw=𝒪(1ρ)subscript𝐿𝑤𝒪1𝜌L_{w}=\mathcal{O}(\frac{1}{\rho})italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_ρ end_ARG ) and substituting the above inequality into section D.1, we have

F(xt)wt\displaystyle\|\nabla F(x_{t})w_{t}∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT F(xt)wt\displaystyle-\nabla F(x_{t})w_{t}^{*}\|- ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
\displaystyle\leq KM[(1βρ)tw0w0,ρ2+βρC12+α2βρLw2M+\displaystyle\sqrt{K}M\Big{[}(1-\beta\rho)^{t}\|w_{0}-w_{0,\rho}^{*}\|^{2}+% \frac{\beta}{\rho}C_{1}^{2}+\frac{\alpha^{2}}{\beta\rho}L_{w}^{2}M+square-root start_ARG italic_K end_ARG italic_M [ ( 1 - italic_β italic_ρ ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_β end_ARG start_ARG italic_ρ end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β italic_ρ end_ARG italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M +
2αLwMρC1+4α2Lw2M2β2ρ2]12+ρ\displaystyle\frac{2\alpha L_{w}M}{\rho}C_{1}+\frac{4\alpha^{2}L_{w}^{2}M^{2}}% {\beta^{2}\rho^{2}}\Big{]}^{\frac{1}{2}}+\sqrt{\rho}divide start_ARG 2 italic_α italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M end_ARG start_ARG italic_ρ end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT + square-root start_ARG italic_ρ end_ARG
=\displaystyle== 𝒪((1βρ)t2w0w0,ρ+βρ+αβρ2+ρ).𝒪superscript1𝛽𝜌𝑡2normsubscript𝑤0superscriptsubscript𝑤0𝜌𝛽𝜌𝛼𝛽superscript𝜌2𝜌\displaystyle\mathcal{O}\Big{(}(1-\beta\rho)^{\frac{t}{2}}\|w_{0}-w_{0,\rho}^{% *}\|+\sqrt{\frac{\beta}{\rho}}+\frac{\alpha}{\beta\rho^{2}}+\sqrt{\rho}\Big{)}.caligraphic_O ( ( 1 - italic_β italic_ρ ) start_POSTSUPERSCRIPT divide start_ARG italic_t end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ + square-root start_ARG divide start_ARG italic_β end_ARG start_ARG italic_ρ end_ARG end_ARG + divide start_ARG italic_α end_ARG start_ARG italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + square-root start_ARG italic_ρ end_ARG ) .

Since we run projected gradient descent for the strongly convex function J(wn)=12F(x0)wn2+ρ2wn2𝐽subscript𝑤𝑛12superscriptnorm𝐹subscript𝑥0subscript𝑤𝑛2𝜌2superscriptnormsubscript𝑤𝑛2J(w_{n})=\frac{1}{2}\|\nabla F(x_{0})w_{n}\|^{2}+\frac{\rho}{2}\|w_{n}\|^{2}italic_J ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the N-loop in Algorithm 1, according to Theorem 10.5 [14], we have by choosing β(0,1M2]superscript𝛽01superscript𝑀2\beta^{\prime}\in(0,\frac{1}{M^{2}}]italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ]

w0w0,ρ2=wNw0,ρ22(1ρM2)N.superscriptnormsubscript𝑤0superscriptsubscript𝑤0𝜌2superscriptnormsubscript𝑤𝑁superscriptsubscript𝑤0𝜌22superscript1𝜌superscript𝑀2𝑁\|w_{0}-w_{0,\rho}^{*}\|^{2}=\|w_{N}-w_{0,\rho}^{*}\|^{2}\leq 2\Big{(}1-\frac{% \rho}{M^{2}}\Big{)}^{N}.∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ( 1 - divide start_ARG italic_ρ end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT .

Thus, w0w0,ρ=𝒪(ϵ)normsubscript𝑤0superscriptsubscript𝑤0𝜌𝒪italic-ϵ\|w_{0}-w_{0,\rho}^{*}\|=\mathcal{O}(\epsilon)∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ = caligraphic_O ( italic_ϵ ) as N𝒪(ρ1)similar-to𝑁𝒪superscript𝜌1N\sim\mathcal{O}(\rho^{-1})italic_N ∼ caligraphic_O ( italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). CA distance takes the order of ϵitalic-ϵ\epsilonitalic_ϵ in every iteration by choosing ρ𝒪(ϵ2),β𝒪(ϵ4)formulae-sequencesimilar-to𝜌𝒪superscriptitalic-ϵ2similar-to𝛽𝒪superscriptitalic-ϵ4\rho\sim\mathcal{O}(\epsilon^{2}),\beta\sim\mathcal{O}(\epsilon^{4})italic_ρ ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_β ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ), α𝒪(ϵ9)similar-to𝛼𝒪superscriptitalic-ϵ9\alpha\sim\mathcal{O}(\epsilon^{9})italic_α ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT ), and N𝒪(ϵ2)similar-to𝑁𝒪superscriptitalic-ϵ2N\sim\mathcal{O}(\epsilon^{-2})italic_N ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ). The proof is complete. ∎

D.2 Formal Version and Its Proof of Theorem 4

Let c1>0,c2>0formulae-sequencesubscript𝑐10superscriptsubscript𝑐20c_{1}>0,c_{2}^{\prime}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0, c3,c40superscriptsubscript𝑐3superscriptsubscript𝑐40c_{3}^{\prime},c_{4}^{\prime}\geq 0italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 0, and F>0𝐹0F>0italic_F > 0 be some constants such that

Δ+c1+c2+c3+c4FΔsubscript𝑐1superscriptsubscript𝑐2superscriptsubscript𝑐3superscriptsubscript𝑐4𝐹\displaystyle\Delta+c_{1}+{c_{2}^{\prime}}+c_{3}^{\prime}+c_{4}^{\prime}\leq Froman_Δ + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_F

We then have the following convergence rate for Algorithm 4.

Theorem 10.

Suppose Assumptions 1 and 2 are satisfied, and we choose constant step sizes that βϵ2C12,αmin(c1β,2c3K(M+1)M2T,2c4βC12T,ϵ2K(M+1)M2),ρ(ϵ22,c2αT)formulae-sequence𝛽superscriptitalic-ϵ2superscriptsubscript𝐶12formulae-sequence𝛼subscript𝑐1𝛽2superscriptsubscript𝑐3𝐾𝑀1superscript𝑀2𝑇2superscriptsubscript𝑐4𝛽superscriptsubscript𝐶12𝑇superscriptitalic-ϵ2𝐾𝑀1superscript𝑀2𝜌superscriptitalic-ϵ22superscriptsubscript𝑐2𝛼𝑇\beta\leq\frac{\epsilon^{2}}{C_{1}^{\prime 2}},\alpha\leq\min\left(c_{1}\beta,% \sqrt{\frac{2c_{3}^{\prime}}{K\ell(M+1)M^{2}T}},\frac{2c_{4}^{\prime}}{\beta C% _{1}^{\prime 2}T},\frac{\epsilon^{2}}{K\ell(M+1)M^{2}}\right),\rho\leq\left(% \frac{\epsilon^{2}}{2},\frac{c_{2}^{\prime}}{\alpha T}\right)italic_β ≤ divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG , italic_α ≤ roman_min ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_β , square-root start_ARG divide start_ARG 2 italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG end_ARG , divide start_ARG 2 italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_β italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT italic_T end_ARG , divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , italic_ρ ≤ ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG , divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_α italic_T end_ARG ), and Tmax(10Δαϵ2,10ϵ2β)𝑇10Δ𝛼superscriptitalic-ϵ210superscriptitalic-ϵ2𝛽T\geq\max\left(\frac{10\Delta}{\alpha\epsilon^{2}},\frac{10}{\epsilon^{2}\beta% }\right)italic_T ≥ roman_max ( divide start_ARG 10 roman_Δ end_ARG start_ARG italic_α italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 10 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β end_ARG ). We have

1Tt=0T1F(xt)wt2ϵ21𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptitalic-ϵ2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\leq% \epsilon^{2}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Proof..

Following similar steps in Section C.1, we also prove that for any iK𝑖𝐾i\in Kitalic_i ∈ italic_K and tT𝑡𝑇t\leq Titalic_t ≤ italic_T, we have that fi(xt)fiFsubscript𝑓𝑖subscript𝑥𝑡superscriptsubscript𝑓𝑖𝐹f_{i}(x_{t})-f_{i}^{*}\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_F by induction.

Base case: since all constants c1,c2,c3,c4subscript𝑐1superscriptsubscript𝑐2superscriptsubscript𝑐3superscriptsubscript𝑐4c_{1},c_{2}^{\prime},c_{3}^{\prime},c_{4}^{\prime}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are non-negative, we have that fi(x0)fiΔFsubscript𝑓𝑖subscript𝑥0superscriptsubscript𝑓𝑖Δ𝐹f_{i}(x_{0})-f_{i}^{*}\leq\Delta\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ roman_Δ ≤ italic_F holds for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ].

Induction step: assume that for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ] and tk<T𝑡𝑘𝑇t\leq k<Titalic_t ≤ italic_k < italic_T, fi(xt)fiFsubscript𝑓𝑖subscript𝑥𝑡superscriptsubscript𝑓𝑖𝐹f_{i}(x_{t})-f_{i}^{*}\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_F holds. We then prove fi(xk+1)fiFsubscript𝑓𝑖subscript𝑥𝑘1superscriptsubscript𝑓𝑖𝐹f_{i}(x_{k+1})-f_{i}^{*}\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_F holds for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ]. Following similar steps in Section C.1, we have

F(xt+1)wF(xt)wαF(xt)w,F(xt)wt+α2(M+1)2F(xt)wt2.𝐹subscript𝑥𝑡1𝑤𝐹subscript𝑥𝑡𝑤𝛼𝐹subscript𝑥𝑡𝑤𝐹subscript𝑥𝑡subscript𝑤𝑡superscript𝛼2𝑀12superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle F(x_{t+1})w\leq F(x_{t})w-\alpha\langle\nabla F(x_{t})w,\nabla F% (x_{t})w_{t}\rangle+\frac{\alpha^{2}\ell(M+1)}{2}\|\nabla F(x_{t})w_{t}\|^{2}.italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_w ≤ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w - italic_α ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w , ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (33)

Based on the update rule of w𝑤witalic_w and non-expansiveness of projection, we have

wt+1w2superscriptnormsubscript𝑤𝑡1𝑤2absent\displaystyle\|w_{t+1}-w\|^{2}\leq∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ wtβ(F(xt)F(xt+1)α+ρwt)w2superscriptnormsubscript𝑤𝑡𝛽𝐹subscript𝑥𝑡𝐹subscript𝑥𝑡1𝛼𝜌subscript𝑤𝑡𝑤2\displaystyle\Big{\|}w_{t}-\beta\Big{(}\frac{F(x_{t})-F(x_{t+1})}{\alpha}+\rho w% _{t}\Big{)}-w\Big{\|}^{2}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β ( divide start_ARG italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_α end_ARG + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== wtβ(F(xt)F(xt)wt+ρwt+R(xt)α)w2superscriptnormsubscript𝑤𝑡𝛽𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡𝑅subscript𝑥𝑡𝛼𝑤2\displaystyle\Big{\|}w_{t}-\beta\Big{(}\nabla F(x_{t})^{\top}\nabla F(x_{t})w_% {t}+\rho w_{t}+\frac{R(x_{t})}{\alpha}\Big{)}-w\Big{\|}^{2}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_α end_ARG ) - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG wtw22βwtw,(F(xt)F(xt)+ρI)wt+2βαR(xt)+β2(C1)2superscriptnormsubscript𝑤𝑡𝑤22𝛽subscript𝑤𝑡𝑤𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡𝜌𝐼subscript𝑤𝑡2𝛽𝛼norm𝑅subscript𝑥𝑡superscript𝛽2superscriptsuperscriptsubscript𝐶12\displaystyle\|w_{t}-w\|^{2}-2\beta\langle w_{t}-w,(\nabla F(x_{t})^{\top}% \nabla F(x_{t})+\rho I)w_{t}\rangle+2\frac{\beta}{\alpha}\|R(x_{t})\|+\beta^{2% }(C_{1}^{\prime})^{2}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_β ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ρ italic_I ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ + 2 divide start_ARG italic_β end_ARG start_ARG italic_α end_ARG ∥ italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG wtw22βwtw,(F(xt)F(xt)+ρI)wtsuperscriptnormsubscript𝑤𝑡𝑤22𝛽subscript𝑤𝑡𝑤𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡𝜌𝐼subscript𝑤𝑡\displaystyle\|w_{t}-w\|^{2}-2\beta\langle w_{t}-w,(\nabla F(x_{t})^{\top}% \nabla F(x_{t})+\rho I)w_{t}\rangle∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_β ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ρ italic_I ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
+αβK(M+1)M2+β2(C1)2,𝛼𝛽𝐾𝑀1superscript𝑀2superscript𝛽2superscriptsuperscriptsubscript𝐶12\displaystyle+\alpha\beta K\ell(M+1)M^{2}+\beta^{2}(C_{1}^{\prime})^{2},+ italic_α italic_β italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where (i)𝑖(i)( italic_i ) follows from Cauchy-Schwartz inequality and C1=KM2+ρ+α2K(M+1)M22superscriptsubscript𝐶1𝐾superscript𝑀2𝜌superscript𝛼2𝐾𝑀1superscript𝑀22C_{1}^{\prime}=\sqrt{K}M^{2}+\rho+\frac{\alpha^{2}K\ell(M+1)M^{2}}{2}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG, and (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from Lemma 5. Then we have

wtw,F(xt)F(xt)wtsubscript𝑤𝑡𝑤𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡absent\displaystyle\langle w_{t}-w,\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}\rangle\leq⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w , ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ≤ 12β(wtw2wt+1w2)+ρ12𝛽superscriptnormsubscript𝑤𝑡𝑤2superscriptnormsubscript𝑤𝑡1𝑤2𝜌\displaystyle\frac{1}{2\beta}(\|w_{t}-w\|^{2}-\|w_{t+1}-w\|^{2})+\rhodivide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG ( ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_ρ
+αK(M+1)M22+β(C1)22.𝛼𝐾𝑀1superscript𝑀22𝛽superscriptsuperscriptsubscript𝐶122\displaystyle+\frac{\alpha K\ell(M+1)M^{2}}{2}+\frac{\beta(C_{1}^{\prime})^{2}% }{2}.+ divide start_ARG italic_α italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_β ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG .

Then substitute the above inequality into eq. 33, we can obtain

F(xt+1)wF(xt)𝐹subscript𝑥𝑡1𝑤𝐹subscript𝑥𝑡absent\displaystyle F(x_{t+1})w-F(x_{t})\leqitalic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_w - italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ αF(xt)wt2+α2(M+1)2F(xt)wt2𝛼superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscript𝛼2𝑀12superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle-\alpha\|\nabla F(x_{t})w_{t}\|^{2}+\frac{\alpha^{2}\ell(M+1)}{2}% \|\nabla F(x_{t})w_{t}\|^{2}- italic_α ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α2β(wtw2wt+1w2)+αρ+α2K(M+1)M22+αβ(C1)22.𝛼2𝛽superscriptnormsubscript𝑤𝑡𝑤2superscriptnormsubscript𝑤𝑡1𝑤2𝛼𝜌superscript𝛼2𝐾𝑀1superscript𝑀22𝛼𝛽superscriptsuperscriptsubscript𝐶122\displaystyle+\frac{\alpha}{2\beta}(\|w_{t}-w\|^{2}-\|w_{t+1}-w\|^{2})+\alpha% \rho+\frac{\alpha^{2}K\ell(M+1)M^{2}}{2}+\frac{\alpha\beta(C_{1}^{\prime})^{2}% }{2}.+ divide start_ARG italic_α end_ARG start_ARG 2 italic_β end_ARG ( ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_α italic_ρ + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_α italic_β ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG .

Then taking sums of the above inequality from t=0𝑡0t=0italic_t = 0 to k𝑘kitalic_k, for any w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W, we have

F(xk+1)wF(x0)w𝐹subscript𝑥𝑘1𝑤𝐹subscript𝑥0𝑤absent\displaystyle F(x_{k+1})w-F(x_{0})w\leqitalic_F ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) italic_w - italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w ≤ t=0kαF(xt)wt2+t=0kα2(M+1)2F(xt)wt2superscriptsubscript𝑡0𝑘𝛼superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptsubscript𝑡0𝑘superscript𝛼2𝑀12superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2\displaystyle-\sum_{t=0}^{k}\alpha\|\nabla F(x_{t})w_{t}\|^{2}+\sum_{t=0}^{k}% \frac{\alpha^{2}\ell(M+1)}{2}\|\nabla F(x_{t})w_{t}\|^{2}- ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_α ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ ( italic_M + 1 ) end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α2βw0w2+αρT+α2K(M+1)M2T2+αβ(C1)2T2𝛼2𝛽superscriptnormsubscript𝑤0𝑤2𝛼𝜌𝑇superscript𝛼2𝐾𝑀1superscript𝑀2𝑇2𝛼𝛽superscriptsuperscriptsubscript𝐶12𝑇2\displaystyle+\frac{\alpha}{2\beta}\|w_{0}-w\|^{2}+\alpha\rho T+\frac{\alpha^{% 2}K\ell(M+1)M^{2}T}{2}+\frac{\alpha\beta(C_{1}^{\prime})^{2}T}{2}+ divide start_ARG italic_α end_ARG start_ARG 2 italic_β end_ARG ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α italic_ρ italic_T + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG + divide start_ARG italic_α italic_β ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG
\displaystyle\leq αβ+αρT+α2K(M+1)M2T2+αβ(C1)2T2,𝛼𝛽𝛼𝜌𝑇superscript𝛼2𝐾𝑀1superscript𝑀2𝑇2𝛼𝛽superscriptsuperscriptsubscript𝐶12𝑇2\displaystyle\frac{\alpha}{\beta}+\alpha\rho T+\frac{\alpha^{2}K\ell(M+1)M^{2}% T}{2}+\frac{\alpha\beta(C_{1}^{\prime})^{2}T}{2},divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG + italic_α italic_ρ italic_T + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG + divide start_ARG italic_α italic_β ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG , (34)

where the last inequality follows from α1(M+1)𝛼1𝑀1\alpha\leq\frac{1}{\ell(M+1)}italic_α ≤ divide start_ARG 1 end_ARG start_ARG roman_ℓ ( italic_M + 1 ) end_ARG. Thus, for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ], it can be shown that

fi(xk+1)fifi(x0)fi+αβ+αρT+α2K(M+1)M2T2+αβ(C1)2T2F,subscript𝑓𝑖subscript𝑥𝑘1superscriptsubscript𝑓𝑖subscript𝑓𝑖subscript𝑥0superscriptsubscript𝑓𝑖𝛼𝛽𝛼𝜌𝑇superscript𝛼2𝐾𝑀1superscript𝑀2𝑇2𝛼𝛽superscriptsuperscriptsubscript𝐶12𝑇2𝐹\displaystyle f_{i}(x_{k+1})-f_{i}^{*}\leq f_{i}(x_{0})-f_{i}^{*}+\frac{\alpha% }{\beta}+\alpha\rho T+\frac{\alpha^{2}K\ell(M+1)M^{2}T}{2}+\frac{\alpha\beta(C% _{1}^{\prime})^{2}T}{2}\leq F,italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG + italic_α italic_ρ italic_T + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG + divide start_ARG italic_α italic_β ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG ≤ italic_F ,

since we have that αβc1𝛼𝛽subscript𝑐1\frac{\alpha}{\beta}\leq c_{1}divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, αρTc2𝛼𝜌𝑇superscriptsubscript𝑐2\alpha\rho T\leq c_{2}^{\prime}italic_α italic_ρ italic_T ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, α2K(M+1)M2T2c3\frac{\alpha^{2}K\ell(M+1)M^{2}T}{2}\leq c_{3}\primedivide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG ≤ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ′, αβ(C1)2T2c4𝛼𝛽superscriptsuperscriptsubscript𝐶12𝑇2superscriptsubscript𝑐4\frac{\alpha\beta(C_{1}^{\prime})^{2}T}{2}\leq c_{4}^{\prime}divide start_ARG italic_α italic_β ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG 2 end_ARG ≤ italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Now we finish the induction step and can show that fi(xk)fiFsubscript𝑓𝑖subscript𝑥𝑘superscriptsubscript𝑓𝑖𝐹f_{i}(x_{k})-f_{i}^{*}\leq Fitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_F and section D.2 hold for all k<T𝑘𝑇k<Titalic_k < italic_T and i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ]. Specifically, for α1(M+1)𝛼1𝑀1\alpha\leq\frac{1}{\ell(M+1)}italic_α ≤ divide start_ARG 1 end_ARG start_ARG roman_ℓ ( italic_M + 1 ) end_ARG, we have

1Tt=0T1F(xt)wt22F(x0)w2FwαT+2βT+2ρ+αK(M+1)M2+β(C1)2.1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡22𝐹subscript𝑥0𝑤2superscript𝐹𝑤𝛼𝑇2𝛽𝑇2𝜌𝛼𝐾𝑀1superscript𝑀2𝛽superscriptsuperscriptsubscript𝐶12\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\leq\frac{% 2F(x_{0})w-2F^{*}w}{\alpha T}+\frac{2}{\beta T}+2\rho+\alpha K\ell(M+1)M^{2}+% \beta(C_{1}^{\prime})^{2}.divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 2 italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w - 2 italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_w end_ARG start_ARG italic_α italic_T end_ARG + divide start_ARG 2 end_ARG start_ARG italic_β italic_T end_ARG + 2 italic_ρ + italic_α italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Then following the choice of step sizes, we can obtain

1Tt=0T1F(xt)wt2ϵ2.1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscriptitalic-ϵ2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_{t})w_{t}\|^{2}\leq% \epsilon^{2}.divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The proof is complete. ∎

D.3 Formal Version and Its Proof of Theorem 5

Theorem 11.

Suppose Assumptions 1 and 2 are satisfied. We choose β1M2,N𝒪(ϵ2),βmin(ϵ2ρ(C1)2,ϵ2),αmin(c1β,2c3βc1T,1(M+1),βρϵ2LwM,ρϵ22LwMC1),Tmax(10Δαϵ2,10ϵ2β)Θ(ϵ11)formulae-sequencesuperscript𝛽1superscript𝑀2formulae-sequencesimilar-to𝑁𝒪superscriptitalic-ϵ2formulae-sequence𝛽superscriptitalic-ϵ2𝜌superscriptsuperscriptsubscript𝐶12superscriptitalic-ϵ2formulae-sequence𝛼subscript𝑐1𝛽2superscriptsubscript𝑐3𝛽superscriptsubscript𝑐1𝑇1𝑀1𝛽𝜌italic-ϵ2subscript𝐿𝑤𝑀𝜌superscriptitalic-ϵ22subscript𝐿𝑤𝑀superscriptsubscript𝐶1𝑇10Δ𝛼superscriptitalic-ϵ210superscriptitalic-ϵ2𝛽similar-toΘsuperscriptitalic-ϵ11\beta^{\prime}\leq\frac{1}{M^{2}},N\sim\mathcal{O}(\epsilon^{-2}),\beta\leq% \min\left(\frac{\epsilon^{2}\rho}{(C_{1}^{\prime})^{2}},\epsilon^{2}\right),% \alpha\leq\min\left(c_{1}\beta,\frac{2c_{3}^{\prime}}{\beta c_{1}^{\prime}T},% \frac{1}{\ell(M+1)},\frac{\beta\rho\epsilon}{2L_{w}\sqrt{M}},\frac{\rho% \epsilon^{2}}{2L_{w}MC_{1}^{\prime}}\right),T\geq\max\left(\frac{10\Delta}{% \alpha\epsilon^{2}},\frac{10}{\epsilon^{2}\beta}\right)\sim\Theta(\epsilon^{-1% 1})italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_N ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) , italic_β ≤ roman_min ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ end_ARG start_ARG ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_α ≤ roman_min ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_β , divide start_ARG 2 italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_β italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T end_ARG , divide start_ARG 1 end_ARG start_ARG roman_ℓ ( italic_M + 1 ) end_ARG , divide start_ARG italic_β italic_ρ italic_ϵ end_ARG start_ARG 2 italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT square-root start_ARG italic_M end_ARG end_ARG , divide start_ARG italic_ρ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) , italic_T ≥ roman_max ( divide start_ARG 10 roman_Δ end_ARG start_ARG italic_α italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG 10 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β end_ARG ) ∼ roman_Θ ( italic_ϵ start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT ), and ρmin(ϵ220,c22Tα)𝒪(ϵ2)𝜌superscriptitalic-ϵ220superscriptsubscript𝑐22𝑇𝛼similar-to𝒪superscriptitalic-ϵ2\rho\leq\min\left(\frac{\epsilon^{2}}{20},\frac{c_{2}^{\prime}}{2T\alpha}% \right)\sim\mathcal{O}(\epsilon^{2})italic_ρ ≤ roman_min ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 20 end_ARG , divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_T italic_α end_ARG ) ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The CA distance in every iteration takes the order of 𝒪(ϵ)𝒪italic-ϵ\mathcal{O}(\epsilon)caligraphic_O ( italic_ϵ ).

Proof.

According to the definition of CA distance, we have

F\displaystyle\|\nabla F∥ ∇ italic_F (xt)wtF(xt)wt\displaystyle(x_{t})w_{t}-\nabla F(x_{t})w_{t}^{*}\|( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
=\displaystyle== F(xt)wtF(xt)wt,ρ+F(xt)wt,ρF(xt)wtnorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡\displaystyle\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t,\rho}^{*}+\nabla F(x_{% t})w_{t,\rho}^{*}-\nabla F(x_{t})w_{t}^{*}\|∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG F(xt)wtF(xt)wt,ρ+F(xt)wt,ρF(xt)wtnorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌norm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡\displaystyle\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t,\rho}^{*}\|+\|\nabla F% (x_{t})w_{t,\rho}^{*}-\nabla F(x_{t})w_{t}^{*}\|∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ + ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG KMwtwt,ρ+ρ,𝐾𝑀normsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌𝜌\displaystyle\sqrt{K}M\|w_{t}-w_{t,\rho}^{*}\|+\sqrt{\rho},square-root start_ARG italic_K end_ARG italic_M ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ + square-root start_ARG italic_ρ end_ARG , (35)

where (i)𝑖(i)( italic_i ) follows from Cauchy-Schwartz inequality and (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from fi(xt)Mnormsubscript𝑓𝑖subscript𝑥𝑡𝑀\|\nabla f_{i}(x_{t})\|\leq M∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M for any i𝑖iitalic_i and Lemma 3. Then for the first term in the above inequality on the right-hand side (RHS), we have

wt+1wt+1,ρ2=wt+1wt,ρ2+wt+1,ρwt,ρ22wt+1wt,ρ,wt+1,ρwt,ρ.superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌2superscriptnormsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌22subscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌superscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌\displaystyle\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}=\|w_{t+1}-w_{t,\rho}^{*}\|^{2}+% \|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|^{2}-2\langle w_{t+1}-w_{t,\rho}^{*},w_{t+1% ,\rho}^{*}-w_{t,\rho}^{*}\rangle.∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ . (36)

For the first term on the RHS in the above inequality, we have

wt+1wt,ρ2(i)superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌2𝑖\displaystyle\|w_{t+1}-w_{t,\rho}^{*}\|^{2}\overset{(i)}{\leq}∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG wtβ(F(xt)F(xt+1)α+ρwt)wt,ρ2superscriptnormsubscript𝑤𝑡𝛽𝐹subscript𝑥𝑡𝐹subscript𝑥𝑡1𝛼𝜌subscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2\displaystyle\Big{\|}w_{t}-\beta\Big{(}\frac{F(x_{t})-F(x_{t+1})}{\alpha}+\rho w% _{t}\Big{)}-w_{t,\rho}^{*}\Big{\|}^{2}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β ( divide start_ARG italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_α end_ARG + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== wtβ(F(xt)F(xt)wt+ρwt+R(xt)α)wt,ρ2superscriptnormsubscript𝑤𝑡𝛽𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡𝑅subscript𝑥𝑡𝛼superscriptsubscript𝑤𝑡𝜌2\displaystyle\Big{\|}w_{t}-\beta\Big{(}\nabla F(x_{t})^{\top}\nabla F(x_{t})w_% {t}+\rho w_{t}+\frac{R(x_{t})}{\alpha}\Big{)}-w_{t,\rho}^{*}\Big{\|}^{2}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β ( ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_α end_ARG ) - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== wtwt,ρ22βF(xt)F(xt)wt+ρwt,wtwt,ρsuperscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌22𝛽𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡subscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌\displaystyle\|w_{t}-w_{t,\rho}^{*}\|^{2}-2\beta\langle\nabla F(x_{t})^{\top}% \nabla F(x_{t})w_{t}+\rho w_{t},w_{t}-w_{t,\rho}^{*}\rangle∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_β ⟨ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩
2βαR(xt),wtwt,ρ+β2F(xt)F(xt)wt+R(xt)+ρwt22𝛽𝛼𝑅subscript𝑥𝑡subscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌superscript𝛽2superscriptnorm𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡𝑅subscript𝑥𝑡𝜌subscript𝑤𝑡2\displaystyle-2\frac{\beta}{\alpha}\langle R(x_{t}),w_{t}-w_{t,\rho}^{*}% \rangle+\beta^{2}\|\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}+R(x_{t})+\rho w_% {t}\|^{2}- 2 divide start_ARG italic_β end_ARG start_ARG italic_α end_ARG ⟨ italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG (12βρ)wtwt,ρ2+2βαR(xt)+β2(ρ+KM2+R(xt))2,12𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌22𝛽𝛼norm𝑅subscript𝑥𝑡superscript𝛽2superscript𝜌𝐾superscript𝑀2norm𝑅subscript𝑥𝑡2\displaystyle(1-2\beta\rho)\|w_{t}-w_{t,\rho}^{*}\|^{2}+2\frac{\beta}{\alpha}% \|R(x_{t})\|+\beta^{2}(\rho+\sqrt{K}M^{2}+\|R(x_{t})\|)^{2},( 1 - 2 italic_β italic_ρ ) ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 divide start_ARG italic_β end_ARG start_ARG italic_α end_ARG ∥ italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ + square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (37)

where (i)𝑖(i)( italic_i ) follows from the non-expansiveness of projection and (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from properties of strong convexity and Cauchy-Schwartz inequality. Then for the second term on the RHS in eq. 36, we have

wt+1,ρwt,ρ2Lw2xtxt+12=Lw2α2F(xt)wt2α2Lw2M.superscriptnormsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌2superscriptsubscript𝐿𝑤2superscriptnormsubscript𝑥𝑡subscript𝑥𝑡12superscriptsubscript𝐿𝑤2superscript𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑤𝑡2superscript𝛼2superscriptsubscript𝐿𝑤2𝑀\displaystyle\|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|^{2}\leq L_{w}^{2}\|x_{t}-x_{t% +1}\|^{2}=L_{w}^{2}\alpha^{2}\|\nabla F(x_{t})w_{t}\|^{2}\leq\alpha^{2}L_{w}^{% 2}M.∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M . (38)

Then for the last term on the RHS in eq. 36, we have

2wt+1\displaystyle-2\langle w_{t+1}- 2 ⟨ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT wt,ρ,wt+1,ρwt,ρ\displaystyle-w_{t,\rho}^{*},w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\rangle- italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩
\displaystyle\leq 2wt+1wt,ρwt+1,ρwt,ρ2normsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌normsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌\displaystyle 2\|w_{t+1}-w_{t,\rho}^{*}\|\|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|2 ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
\displaystyle\leq 2(wt+1wt+wtwt,ρ)wt+1,ρwt,ρ2normsubscript𝑤𝑡1subscript𝑤𝑡normsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌normsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌\displaystyle 2(\|w_{t+1}-w_{t}\|+\|w_{t}-w_{t,\rho}^{*}\|)\|w_{t+1,\rho}^{*}-% w_{t,\rho}^{*}\|2 ( ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ) ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG 2αβLwF(xt)F(xt+1)α+ρwtF(xt)wt+βρwtwt,ρ2+4βρwt+1,ρwt,ρ22𝛼𝛽subscript𝐿𝑤norm𝐹subscript𝑥𝑡𝐹subscript𝑥𝑡1𝛼𝜌subscript𝑤𝑡norm𝐹subscript𝑥𝑡subscript𝑤𝑡𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌24𝛽𝜌superscriptnormsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌2\displaystyle 2\alpha\beta L_{w}\Big{\|}\frac{F(x_{t})-F(x_{t+1})}{\alpha}+% \rho w_{t}\Big{\|}\|\nabla F(x_{t})w_{t}\|+\beta\rho\|w_{t}-w_{t,\rho}^{*}\|^{% 2}+\frac{4}{\beta\rho}\|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|^{2}2 italic_α italic_β italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∥ divide start_ARG italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_α end_ARG + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + italic_β italic_ρ ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 4 end_ARG start_ARG italic_β italic_ρ end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 2αβLwM(KM2+R(xt)+ρ)+βρwtwt,ρ2+4α2Lw2Mβρ.2𝛼𝛽subscript𝐿𝑤𝑀𝐾superscript𝑀2norm𝑅subscript𝑥𝑡𝜌𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌24superscript𝛼2superscriptsubscript𝐿𝑤2𝑀𝛽𝜌\displaystyle 2\alpha\beta L_{w}M(\sqrt{K}M^{2}+\|R(x_{t})\|+\rho)+\beta\rho\|% w_{t}-w_{t,\rho}^{*}\|^{2}+\frac{4\alpha^{2}L_{w}^{2}M}{\beta\rho}.2 italic_α italic_β italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M ( square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + italic_ρ ) + italic_β italic_ρ ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M end_ARG start_ARG italic_β italic_ρ end_ARG . (39)

Then substituting section D.3, eq. 38 and section D.3 into eq. 36, we have

wt+1wt+1,ρ2superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2absent\displaystyle\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}\leq∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ (1βρ)wtwt,ρ2+2βαR(xt)+β2(ρ+KM2+R(xt))21𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌22𝛽𝛼norm𝑅subscript𝑥𝑡superscript𝛽2superscript𝜌𝐾superscript𝑀2norm𝑅subscript𝑥𝑡2\displaystyle(1-\beta\rho)\|w_{t}-w_{t,\rho}^{*}\|^{2}+2\frac{\beta}{\alpha}\|% R(x_{t})\|+\beta^{2}(\rho+\sqrt{K}M^{2}+\|R(x_{t})\|)^{2}( 1 - italic_β italic_ρ ) ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 divide start_ARG italic_β end_ARG start_ARG italic_α end_ARG ∥ italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ρ + square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α2Lw2M+2αβLwM(KM2+R(xt)+ρ)+4α2Lw2Mβρsuperscript𝛼2superscriptsubscript𝐿𝑤2𝑀2𝛼𝛽subscript𝐿𝑤𝑀𝐾superscript𝑀2norm𝑅subscript𝑥𝑡𝜌4superscript𝛼2superscriptsubscript𝐿𝑤2𝑀𝛽𝜌\displaystyle+\alpha^{2}L_{w}^{2}M+2\alpha\beta L_{w}M(\sqrt{K}M^{2}+\|R(x_{t}% )\|+\rho)+\frac{4\alpha^{2}L_{w}^{2}M}{\beta\rho}+ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M + 2 italic_α italic_β italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M ( square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_R ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + italic_ρ ) + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M end_ARG start_ARG italic_β italic_ρ end_ARG
\displaystyle\leq (1βρ)wtwt,ρ2+αβK(M+1)M2+β2(C1)21𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2𝛼𝛽𝐾𝑀1superscript𝑀2superscript𝛽2superscriptsuperscriptsubscript𝐶12\displaystyle(1-\beta\rho)\|w_{t}-w_{t,\rho}^{*}\|^{2}+\alpha\beta K\ell(M+1)M% ^{2}+\beta^{2}(C_{1}^{\prime})^{2}( 1 - italic_β italic_ρ ) ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α italic_β italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α2Lw2M+2αβLwMC1+4α2Lw2Mβρ,superscript𝛼2superscriptsubscript𝐿𝑤2𝑀2𝛼𝛽subscript𝐿𝑤𝑀superscriptsubscript𝐶14superscript𝛼2superscriptsubscript𝐿𝑤2𝑀𝛽𝜌\displaystyle+\alpha^{2}L_{w}^{2}M+2\alpha\beta L_{w}MC_{1}^{\prime}+\frac{4% \alpha^{2}L_{w}^{2}M}{\beta\rho},+ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M + 2 italic_α italic_β italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M end_ARG start_ARG italic_β italic_ρ end_ARG ,

where the last inequality follows from Lemma 5 and C1=KM2+ρ+α2K(M+1)M22superscriptsubscript𝐶1𝐾superscript𝑀2𝜌superscript𝛼2𝐾𝑀1superscript𝑀22C_{1}^{\prime}=\sqrt{K}M^{2}+\rho+\frac{\alpha^{2}K\ell(M+1)M^{2}}{2}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = square-root start_ARG italic_K end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG. Then we do telesco** over t=0,1,,T1𝑡01𝑇1t=0,1,...,T-1italic_t = 0 , 1 , … , italic_T - 1

wTwT,ρ2superscriptnormsubscript𝑤𝑇superscriptsubscript𝑤𝑇𝜌2absent\displaystyle\|w_{T}-w_{T,\rho}^{*}\|^{2}\leq∥ italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_T , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ (1βρ)Tw0w0,ρ2+αρK(M+1)M2+βρ(C1)2superscript1𝛽𝜌𝑇superscriptnormsubscript𝑤0superscriptsubscript𝑤0𝜌2𝛼𝜌𝐾𝑀1superscript𝑀2𝛽𝜌superscriptsuperscriptsubscript𝐶12\displaystyle(1-\beta\rho)^{T}\|w_{0}-w_{0,\rho}^{*}\|^{2}+\frac{\alpha}{\rho}% K\ell(M+1)M^{2}+\frac{\beta}{\rho}(C_{1}^{\prime})^{2}( 1 - italic_β italic_ρ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_ρ end_ARG italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_β end_ARG start_ARG italic_ρ end_ARG ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α2βρLw2M+2αLwMρC1+4α2Lw2Mβ2ρ2.superscript𝛼2𝛽𝜌superscriptsubscript𝐿𝑤2𝑀2𝛼subscript𝐿𝑤𝑀𝜌superscriptsubscript𝐶14superscript𝛼2superscriptsubscript𝐿𝑤2𝑀superscript𝛽2superscript𝜌2\displaystyle+\frac{\alpha^{2}}{\beta\rho}L_{w}^{2}M+\frac{2\alpha L_{w}M}{% \rho}C_{1}^{\prime}+\frac{4\alpha^{2}L_{w}^{2}M}{\beta^{2}\rho^{2}}.+ divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β italic_ρ end_ARG italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M + divide start_ARG 2 italic_α italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M end_ARG start_ARG italic_ρ end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M end_ARG start_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

Then substituting the above inequality into section D.3, we have

F(xt)wt\displaystyle\|\nabla F(x_{t})w_{t}∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT F(xt)wt\displaystyle-\nabla F(x_{t})w_{t}^{*}\|- ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
\displaystyle\leq KM[(1βρ)tw0w0,ρ2+αρK(M+1)M2+βρ(C1)2\displaystyle\sqrt{K}M\Big{[}(1-\beta\rho)^{t}\|w_{0}-w_{0,\rho}^{*}\|^{2}+% \frac{\alpha}{\rho}K\ell(M+1)M^{2}+\frac{\beta}{\rho}(C_{1}^{\prime})^{2}square-root start_ARG italic_K end_ARG italic_M [ ( 1 - italic_β italic_ρ ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_ρ end_ARG italic_K roman_ℓ ( italic_M + 1 ) italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_β end_ARG start_ARG italic_ρ end_ARG ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α2βρLw2M+2αLwMρC1+4α2Lw2Mβ2ρ2]12+ρ\displaystyle+\frac{\alpha^{2}}{\beta\rho}L_{w}^{2}M+\frac{2\alpha L_{w}M}{% \rho}C_{1}^{\prime}+\frac{4\alpha^{2}L_{w}^{2}M}{\beta^{2}\rho^{2}}\Big{]}^{% \frac{1}{2}}+\sqrt{\rho}+ divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_β italic_ρ end_ARG italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M + divide start_ARG 2 italic_α italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_M end_ARG start_ARG italic_ρ end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M end_ARG start_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT + square-root start_ARG italic_ρ end_ARG
=\displaystyle== 𝒪((1βρ)t2w0w0,ρ+αρ2+βρ+αβρ2+ρ).𝒪superscript1𝛽𝜌𝑡2normsubscript𝑤0superscriptsubscript𝑤0𝜌𝛼superscript𝜌2𝛽𝜌𝛼𝛽superscript𝜌2𝜌\displaystyle\mathcal{O}\Big{(}(1-\beta\rho)^{\frac{t}{2}}\|w_{0}-w_{0,\rho}^{% *}\|+\sqrt{\frac{\alpha}{\rho^{2}}}+\sqrt{\frac{\beta}{\rho}}+\frac{\alpha}{% \beta\rho^{2}}+\sqrt{\rho}\Big{)}.caligraphic_O ( ( 1 - italic_β italic_ρ ) start_POSTSUPERSCRIPT divide start_ARG italic_t end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ + square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG + square-root start_ARG divide start_ARG italic_β end_ARG start_ARG italic_ρ end_ARG end_ARG + divide start_ARG italic_α end_ARG start_ARG italic_β italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + square-root start_ARG italic_ρ end_ARG ) .

Since we run projected gradient descent for the strongly convex function J(wn)=12F(x0)wn2+ρ2wn2𝐽subscript𝑤𝑛12superscriptnorm𝐹subscript𝑥0subscript𝑤𝑛2𝜌2superscriptnormsubscript𝑤𝑛2J(w_{n})=\frac{1}{2}\|\nabla F(x_{0})w_{n}\|^{2}+\frac{\rho}{2}\|w_{n}\|^{2}italic_J ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the N-loop in Algorithm 4, according to Theorem 10.5 [14], we have

w0w0,ρ2=wNw0,ρ22(1ρM2+ρ)N.superscriptnormsubscript𝑤0superscriptsubscript𝑤0𝜌2superscriptnormsubscript𝑤𝑁superscriptsubscript𝑤0𝜌22superscript1𝜌superscript𝑀2𝜌𝑁\|w_{0}-w_{0,\rho}^{*}\|^{2}=\|w_{N}-w_{0,\rho}^{*}\|^{2}\leq 2\Big{(}1-\frac{% \rho}{M^{2}+\rho}\Big{)}^{N}.∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ( 1 - divide start_ARG italic_ρ end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ end_ARG ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT .

Thus, w0w0,ρ=𝒪(ϵ)normsubscript𝑤0superscriptsubscript𝑤0𝜌𝒪italic-ϵ\|w_{0}-w_{0,\rho}^{*}\|=\mathcal{O}(\epsilon)∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ = caligraphic_O ( italic_ϵ ) as N𝒪(ρ1)similar-to𝑁𝒪superscript𝜌1N\sim\mathcal{O}(\rho^{-1})italic_N ∼ caligraphic_O ( italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). CA distance takes the order of ϵitalic-ϵ\epsilonitalic_ϵ in every iteration by choosing ρ𝒪(ϵ2),β𝒪(ϵ4)formulae-sequencesimilar-to𝜌𝒪superscriptitalic-ϵ2similar-to𝛽𝒪superscriptitalic-ϵ4\rho\sim\mathcal{O}(\epsilon^{2}),\beta\sim\mathcal{O}(\epsilon^{4})italic_ρ ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_β ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ), α𝒪(ϵ9)similar-to𝛼𝒪superscriptitalic-ϵ9\alpha\sim\mathcal{O}(\epsilon^{9})italic_α ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT ), and N𝒪(ϵ2)similar-to𝑁𝒪superscriptitalic-ϵ2N\sim\mathcal{O}(\epsilon^{-2})italic_N ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ). The proof is complete. ∎

D.4 Formal Version of Its Proof of Theorem 6

Let α,β,ρ,T𝛼𝛽𝜌𝑇\alpha,\beta,\rho,Titalic_α , italic_β , italic_ρ , italic_T satisfy all requirements for Theorem 2 with δ<12𝛿12\delta<\frac{1}{2}italic_δ < divide start_ARG 1 end_ARG start_ARG 2 end_ARG. Moreover, for ρ𝒪(ϵ2),N𝒪(ϵ2),βδρϵ260(1+KM4)𝒪(ϵ4),nsmax{1Kσ,36Kσ(6+20βρ)δρ2ϵ2}𝒪(ϵ6)formulae-sequenceformulae-sequencesimilar-to𝜌𝒪superscriptitalic-ϵ2formulae-sequencesimilar-to𝑁𝒪superscriptitalic-ϵ2𝛽𝛿𝜌superscriptitalic-ϵ2601𝐾superscript𝑀4similar-to𝒪superscriptitalic-ϵ4subscript𝑛𝑠1𝐾𝜎36𝐾𝜎620𝛽𝜌𝛿superscript𝜌2superscriptitalic-ϵ2similar-to𝒪superscriptitalic-ϵ6\rho\sim\mathcal{O}(\epsilon^{2}),N\sim\mathcal{O}(\epsilon^{-2}),\beta\leq% \frac{\delta\rho\epsilon^{2}}{60(1+KM^{4})}\sim\mathcal{O}(\epsilon^{4}),n_{s}% \geq\max\{\frac{1}{K\sigma},\frac{36K\sigma(6+20\beta\rho)}{\delta\rho^{2}% \epsilon^{2}}\}\sim\mathcal{O}(\epsilon^{-6})italic_ρ ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_N ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) , italic_β ≤ divide start_ARG italic_δ italic_ρ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 60 ( 1 + italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_ARG ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) , italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≥ roman_max { divide start_ARG 1 end_ARG start_ARG italic_K italic_σ end_ARG , divide start_ARG 36 italic_K italic_σ ( 6 + 20 italic_β italic_ρ ) end_ARG start_ARG italic_δ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG } ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT ) and αδβ2ρ2ϵ212Lw2(2M2+4σ2)(βρ+1)𝒪(ϵ9)𝛼𝛿superscript𝛽2superscript𝜌2superscriptitalic-ϵ212superscriptsubscript𝐿𝑤22superscript𝑀24superscript𝜎2𝛽𝜌1similar-to𝒪superscriptitalic-ϵ9\alpha\leq\sqrt{\frac{\delta\beta^{2}\rho^{2}\epsilon^{2}}{12L_{w}^{2}(2M^{2}+% 4\sigma^{2})(\beta\rho+1)}}\sim\mathcal{O}(\epsilon^{9})italic_α ≤ square-root start_ARG divide start_ARG italic_δ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 12 italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_β italic_ρ + 1 ) end_ARG end_ARG ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT ) and TΘ(ϵ11)similar-to𝑇Θsuperscriptitalic-ϵ11T\sim\Theta(\epsilon^{-11})italic_T ∼ roman_Θ ( italic_ϵ start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT ), we have the following theorem:

Theorem 12.

If Assumptions 1, 2 and 3 hold, with the values of the parameters mentioned above, we have that for each tT𝑡𝑇t\leq Titalic_t ≤ italic_T,

F(xt)wtF(xt)wt𝒪(ϵ),similar-tonorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝒪italic-ϵ\displaystyle\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t}^{*}\|\sim\mathcal{O}(% \epsilon),∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ∼ caligraphic_O ( italic_ϵ ) ,

with the probability at least 1δ1𝛿1-\delta1 - italic_δ.

Proof.

When τ=T𝜏𝑇\tau=Titalic_τ = italic_T and t<τ𝑡𝜏t<\tauitalic_t < italic_τ, according to the definition of CA distance, we have

F\displaystyle\|\nabla F∥ ∇ italic_F (xt)wtF(xt)wt\displaystyle(x_{t})w_{t}-\nabla F(x_{t})w_{t}^{*}\|( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
=\displaystyle== F(xt)wtF(xt)wt,ρ+F(xt)wt,ρF(xt)wtnorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡\displaystyle\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t,\rho}^{*}+\nabla F(x_{% t})w_{t,\rho}^{*}-\nabla F(x_{t})w_{t}^{*}\|∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG F(xt)wtF(xt)wt,ρ+F(xt)wt,ρF(xt)wtnorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌norm𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝜌𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡\displaystyle\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t,\rho}^{*}\|+\|\nabla F% (x_{t})w_{t,\rho}^{*}-\nabla F(x_{t})w_{t}^{*}\|∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ + ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG KMwtwt,ρ+ρ,𝐾𝑀normsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌𝜌\displaystyle\sqrt{K}M\|w_{t}-w_{t,\rho}^{*}\|+\sqrt{\rho},square-root start_ARG italic_K end_ARG italic_M ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ + square-root start_ARG italic_ρ end_ARG , (40)

where (i)𝑖(i)( italic_i ) follows from Cauchy-Schwartz inequality and (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from fi(xt)Mnormsubscript𝑓𝑖subscript𝑥𝑡𝑀\|\nabla f_{i}(x_{t})\|\leq M∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ] and Lemma 4. We then show that for any tτ𝑡𝜏t\leq\tauitalic_t ≤ italic_τ, we have that 𝔼[wtwt,ρ2|τ=T]δ2ϵ2𝔼delimited-[]conditionalsuperscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2𝜏𝑇𝛿2superscriptitalic-ϵ2\mathbb{E}[\|w_{t}-w_{t,\rho}^{*}\|^{2}|\tau=T]\leq\frac{\delta}{2}\epsilon^{2}blackboard_E [ ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] ≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT by induction.

Base case: Since we run projected gradient descent for the strongly convex function J(wn)=12F(x0)wn2+ρ2wn2𝐽subscript𝑤𝑛12superscriptnorm𝐹subscript𝑥0subscript𝑤𝑛2𝜌2superscriptnormsubscript𝑤𝑛2J(w_{n})=\frac{1}{2}\|\nabla F(x_{0})w_{n}\|^{2}+\frac{\rho}{2}\|w_{n}\|^{2}italic_J ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the N-loop in Algorithm 1, according to Theorem 10.5 [14], we have by choosing β(0,1M2]superscript𝛽01superscript𝑀2\beta^{\prime}\in(0,\frac{1}{M^{2}}]italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ]

w0w0,ρ2=wNw0,ρ22(1ρM2)N.superscriptnormsubscript𝑤0superscriptsubscript𝑤0𝜌2superscriptnormsubscript𝑤𝑁superscriptsubscript𝑤0𝜌22superscript1𝜌superscript𝑀2𝑁\|w_{0}-w_{0,\rho}^{*}\|^{2}=\|w_{N}-w_{0,\rho}^{*}\|^{2}\leq 2\Big{(}1-\frac{% \rho}{M^{2}}\Big{)}^{N}.∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ( 1 - divide start_ARG italic_ρ end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT .

Thus, w0w0,ρ2=𝒪(δ2ϵ2)superscriptnormsubscript𝑤0superscriptsubscript𝑤0𝜌2𝒪𝛿2superscriptitalic-ϵ2\|w_{0}-w_{0,\rho}^{*}\|^{2}=\mathcal{O}(\frac{\delta}{2}\epsilon^{2})∥ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 0 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = caligraphic_O ( divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) as N𝒪(ρ1)similar-to𝑁𝒪superscript𝜌1N\sim\mathcal{O}(\rho^{-1})italic_N ∼ caligraphic_O ( italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

Induction: Assume we have that 𝔼[wtwt,ρ2|τ=T]ϵ2𝔼delimited-[]conditionalsuperscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2𝜏𝑇superscriptitalic-ϵ2\mathbb{E}[\|w_{t}-w_{t,\rho}^{*}\|^{2}|\tau=T]\leq\epsilon^{2}blackboard_E [ ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we will show that 𝔼[wt+1wt+1,ρ2|τ=T]ϵ2𝔼delimited-[]conditionalsuperscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2𝜏𝑇superscriptitalic-ϵ2\mathbb{E}[\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}|\tau=T]\leq\epsilon^{2}blackboard_E [ ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT holds for any t<τ𝑡𝜏t<\tauitalic_t < italic_τ in the following proof. We first divide wt+1wt+1,ρ2superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT into three parts:

wt+1wt+1,ρ2=wt+1wt,ρ2+wt+1,ρwt,ρ22wt+1wt,ρ,wt+1,ρwt,ρ.superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌2superscriptnormsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌22subscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌superscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌\displaystyle\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}=\|w_{t+1}-w_{t,\rho}^{*}\|^{2}+% \|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|^{2}-2\langle w_{t+1}-w_{t,\rho}^{*},w_{t+1% ,\rho}^{*}-w_{t,\rho}^{*}\rangle.∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ . (41)

For the first term on the RHS in the above inequality, we have that

wt+1wt,ρ2superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌2\displaystyle\|w_{t+1}-w_{t,\rho}^{*}\|^{2}∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG wtβ(G2(xt)G3(xt)wt+ρwt)wt,ρ2superscriptnormsubscript𝑤𝑡𝛽subscript𝐺2superscriptsubscript𝑥𝑡topsubscript𝐺3subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2\displaystyle\Big{\|}w_{t}-\beta\Big{(}\nabla G_{2}(x_{t})^{\top}\nabla G_{3}(% x_{t})w_{t}+\rho w_{t}\Big{)}-w_{t,\rho}^{*}\Big{\|}^{2}∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β ( ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== wtwt,ρ22βG2(xt)G2(xt)wt+ρwt,wtwt,ρsuperscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌22𝛽subscript𝐺2superscriptsubscript𝑥𝑡topsubscript𝐺2subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡subscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌\displaystyle\|w_{t}-w_{t,\rho}^{*}\|^{2}-2\beta\langle\nabla G_{2}(x_{t})^{% \top}\nabla G_{2}(x_{t})w_{t}+\rho w_{t},w_{t}-w_{t,\rho}^{*}\rangle∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_β ⟨ ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩
+β2G2(xt)G3(xt)wt+ρwt2superscript𝛽2superscriptnormsubscript𝐺2superscriptsubscript𝑥𝑡topsubscript𝐺3subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡2\displaystyle+\beta^{2}\|\nabla G_{2}(x_{t})^{\top}\nabla G_{3}(x_{t})w_{t}+% \rho w_{t}\|^{2}+ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG (12βρ)wtwt,ρ212𝛽𝜌superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2\displaystyle(1-2\beta\rho)\|w_{t}-w_{t,\rho}^{*}\|^{2}( 1 - 2 italic_β italic_ρ ) ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βwtwt,ρ,εt,2F(xt)wt+F(xt)εt,3wtεt,2εt,3wt2𝛽subscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌superscriptsubscript𝜀𝑡2top𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹superscriptsubscript𝑥𝑡topsubscript𝜀𝑡3subscript𝑤𝑡superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡\displaystyle+2\beta\left\langle w_{t}-w_{t,\rho}^{*},\varepsilon_{t,2}^{\top}% \nabla F(x_{t})w_{t}+\nabla F(x_{t})^{\top}\varepsilon_{t,3}w_{t}-\varepsilon_% {t,2}^{\top}\varepsilon_{t,3}w_{t}\right\rangle+ 2 italic_β ⟨ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩
+β2ρwt+F(xt)F(xt)wtεt,2F(xt)wtF(xt)εt,3wt+εt,2εt,3wt2,superscript𝛽2superscriptnorm𝜌subscript𝑤𝑡𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡superscriptsubscript𝜀𝑡2top𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹superscriptsubscript𝑥𝑡topsubscript𝜀𝑡3subscript𝑤𝑡superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡2\displaystyle+\beta^{2}\|\rho w_{t}+\nabla F(x_{t})^{\top}\nabla F(x_{t})w_{t}% -\varepsilon_{t,2}^{\top}\nabla F(x_{t})w_{t}-\nabla F(x_{t})^{\top}% \varepsilon_{t,3}w_{t}+\varepsilon_{t,2}^{\top}\varepsilon_{t,3}w_{t}\|^{2},+ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (42)

where (i)𝑖(i)( italic_i ) follows from the non-expansiveness of projection and (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from properties of strong convexity and Cauchy-Schwartz inequality. Taking the conditional expectation of (D.4), we have that for any a1>0subscript𝑎10a_{1}>0italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0,

𝔼[wt+1wt,ρ2|τ=T]𝔼delimited-[]conditionalsuperscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌2𝜏𝑇\displaystyle\mathbb{E}[\|w_{t+1}-w_{t,\rho}^{*}\|^{2}|\tau=T]blackboard_E [ ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
\displaystyle\leq δ2(12βρ)ϵ2+2β𝔼[wtwt,ρεt,2F(xt)wt+F(xt)εt,3wtεt,2εt,3wt|τ=T]𝛿212𝛽𝜌superscriptitalic-ϵ22𝛽𝔼delimited-[]conditionalnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌normsuperscriptsubscript𝜀𝑡2top𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹superscriptsubscript𝑥𝑡topsubscript𝜀𝑡3subscript𝑤𝑡superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡𝜏𝑇\displaystyle\frac{\delta}{2}(1-2\beta\rho)\epsilon^{2}+2\beta\mathbb{E}[\|w_{% t}-w_{t,\rho}^{*}\|\|\varepsilon_{t,2}^{\top}\nabla F(x_{t})w_{t}+\nabla F(x_{% t})^{\top}\varepsilon_{t,3}w_{t}-\varepsilon_{t,2}^{\top}\varepsilon_{t,3}w_{t% }\||\tau=T]divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG ( 1 - 2 italic_β italic_ρ ) italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_β blackboard_E [ ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ | italic_τ = italic_T ]
+𝔼[β2ρwt+F(xt)F(xt)wtεt,2F(xt)wtF(xt)εt,3wt+εt,2εt,3wt2|τ=T]𝔼delimited-[]conditionalsuperscript𝛽2superscriptnorm𝜌subscript𝑤𝑡𝐹superscriptsubscript𝑥𝑡top𝐹subscript𝑥𝑡subscript𝑤𝑡superscriptsubscript𝜀𝑡2top𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹superscriptsubscript𝑥𝑡topsubscript𝜀𝑡3subscript𝑤𝑡superscriptsubscript𝜀𝑡2topsubscript𝜀𝑡3subscript𝑤𝑡2𝜏𝑇\displaystyle+\mathbb{E}[\beta^{2}\|\rho w_{t}+\nabla F(x_{t})^{\top}\nabla F(% x_{t})w_{t}-\varepsilon_{t,2}^{\top}\nabla F(x_{t})w_{t}-\nabla F(x_{t})^{\top% }\varepsilon_{t,3}w_{t}+\varepsilon_{t,2}^{\top}\varepsilon_{t,3}w_{t}\|^{2}|% \tau=T]+ blackboard_E [ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
\displaystyle\leq β(𝔼[a1wtwt,ρ2+εt,2F(xt)wt+F(xt)εt,3wtεt,2εt,3wt2/a1|τ=T]])\displaystyle\beta(\mathbb{E}[a_{1}\|w_{t}-w_{t,\rho}^{*}\|^{2}+\|\varepsilon_% {t,2}^{\top}\nabla F(x_{t})w_{t}+\nabla F(x_{t})^{\top}\varepsilon_{t,3}w_{t}-% \varepsilon_{t,2}^{\top}\varepsilon_{t,3}w_{t}\|^{2}/a_{1}|\tau=T]])italic_β ( blackboard_E [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_τ = italic_T ] ] )
+δ2(12βρ)ϵ2+5β2ρ2+5β2KM4+5β2𝔼[Mϵt,22|τ=T]𝛿212𝛽𝜌superscriptitalic-ϵ25superscript𝛽2superscript𝜌25superscript𝛽2𝐾superscript𝑀45superscript𝛽2𝔼delimited-[]conditional𝑀superscriptnormsubscriptitalic-ϵ𝑡22𝜏𝑇\displaystyle+\frac{\delta}{2}(1-2\beta\rho)\epsilon^{2}+5\beta^{2}\rho^{2}+5% \beta^{2}KM^{4}+5\beta^{2}\mathbb{E}[M\|\epsilon_{t,2}\|^{2}|\tau=T]+ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG ( 1 - 2 italic_β italic_ρ ) italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ italic_M ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
+5β2𝔼[Mϵt,32|τ=T]+5β2𝔼[ϵt,22ϵt,32|τ=T],5superscript𝛽2𝔼delimited-[]conditional𝑀superscriptnormsubscriptitalic-ϵ𝑡32𝜏𝑇5superscript𝛽2𝔼delimited-[]conditionalsuperscriptnormsubscriptitalic-ϵ𝑡22superscriptnormsubscriptitalic-ϵ𝑡32𝜏𝑇\displaystyle+5\beta^{2}\mathbb{E}[M\|\epsilon_{t,3}\|^{2}|\tau=T]+5\beta^{2}% \mathbb{E}[\|\epsilon_{t,2}\|^{2}\|\epsilon_{t,3}\|^{2}|\tau=T],+ 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ italic_M ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] , (43)

where the last inequality is due to that for tτ=T𝑡𝜏𝑇t\leq\tau=Titalic_t ≤ italic_τ = italic_T, and for any i[K]𝑖delimited-[]𝐾i\in[K]italic_i ∈ [ italic_K ], we have that fi(xt)M.normsubscript𝑓𝑖subscript𝑥𝑡𝑀\|\nabla f_{i}(x_{t})\|\leq M.∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_M . Then for the second term on the RHS in eq. 41, we have

𝔼[wt+1,ρwt,ρ2|τ=T]𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌2𝜏𝑇\displaystyle\mathbb{E}[\|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|^{2}|\tau=T]blackboard_E [ ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] 𝔼[Lw2xtxt+12|τ=T]absent𝔼delimited-[]conditionalsuperscriptsubscript𝐿𝑤2superscriptnormsubscript𝑥𝑡subscript𝑥𝑡12𝜏𝑇\displaystyle\leq\mathbb{E}[L_{w}^{2}\|x_{t}-x_{t+1}\|^{2}|\tau=T]≤ blackboard_E [ italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
=𝔼[Lw2α2F(xt,st,1)wt2|τ=T]absent𝔼delimited-[]conditionalsuperscriptsubscript𝐿𝑤2superscript𝛼2superscriptnorm𝐹subscript𝑥𝑡subscript𝑠𝑡1subscript𝑤𝑡2𝜏𝑇\displaystyle=\mathbb{E}[L_{w}^{2}\alpha^{2}\|\nabla F(x_{t},s_{t,1})w_{t}\|^{% 2}|\tau=T]= blackboard_E [ italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
𝔼[α2Lw2(M+ϵt,1wt)2|τ=T],absent𝔼delimited-[]conditionalsuperscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀normsubscriptitalic-ϵ𝑡1subscript𝑤𝑡2𝜏𝑇\displaystyle\leq\mathbb{E}[\alpha^{2}L_{w}^{2}(M+\|\epsilon_{t,1}w_{t}\|)^{2}% |\tau=T],≤ blackboard_E [ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_M + ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] , (44)

where the first inequality is due to Lemma 3, where Lw𝒪(ρ)similar-tosubscript𝐿𝑤𝒪𝜌L_{w}\sim\mathcal{O}(\rho)italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∼ caligraphic_O ( italic_ρ ). Then for the last term on the RHS in eq. 41, for any a2>0,a3>0formulae-sequencesubscript𝑎20subscript𝑎30a_{2}>0,a_{3}>0italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 , italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0, we have that

𝔼[2wt+1wt,ρ,wt+1,ρwt,ρ|τ=T]𝔼delimited-[]conditional2subscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌superscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌𝜏𝑇\displaystyle\mathbb{E}[-2\langle w_{t+1}-w_{t,\rho}^{*},w_{t+1,\rho}^{*}-w_{t% ,\rho}^{*}\rangle|\tau=T]blackboard_E [ - 2 ⟨ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ | italic_τ = italic_T ]
\displaystyle\leq 𝔼[2wt+1wt,ρwt+1,ρwt,ρ|τ=T]𝔼delimited-[]conditional2normsubscript𝑤𝑡1superscriptsubscript𝑤𝑡𝜌normsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌𝜏𝑇\displaystyle\mathbb{E}[2\|w_{t+1}-w_{t,\rho}^{*}\|\|w_{t+1,\rho}^{*}-w_{t,% \rho}^{*}\||\tau=T]blackboard_E [ 2 ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ | italic_τ = italic_T ]
\displaystyle\leq 𝔼[2(wt+1wt+wtwt,ρ)wt+1,ρwt,ρ|τ=T]𝔼delimited-[]conditional2normsubscript𝑤𝑡1subscript𝑤𝑡normsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌normsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌𝜏𝑇\displaystyle\mathbb{E}[2(\|w_{t+1}-w_{t}\|+\|w_{t}-w_{t,\rho}^{*}\|)\|w_{t+1,% \rho}^{*}-w_{t,\rho}^{*}\||\tau=T]blackboard_E [ 2 ( ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ) ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ | italic_τ = italic_T ]
\displaystyle\leq 𝔼[a2wt+1wt2+1a2wt+1,ρwt,ρ2+a3wtwt,ρ2+1a3wt+1,ρwt,ρ2|τ=T]𝔼delimited-[]subscript𝑎2superscriptnormsubscript𝑤𝑡1subscript𝑤𝑡21subscript𝑎2superscriptnormsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌2subscript𝑎3superscriptnormsubscript𝑤𝑡superscriptsubscript𝑤𝑡𝜌2conditional1subscript𝑎3superscriptnormsuperscriptsubscript𝑤𝑡1𝜌superscriptsubscript𝑤𝑡𝜌2𝜏𝑇\displaystyle\mathbb{E}\left[a_{2}\|w_{t+1}-w_{t}\|^{2}+\frac{1}{a_{2}}\|w_{t+% 1,\rho}^{*}-w_{t,\rho}^{*}\|^{2}+a_{3}\|w_{t}-w_{t,\rho}^{*}\|^{2}+\frac{1}{a_% {3}}\|w_{t+1,\rho}^{*}-w_{t,\rho}^{*}\|^{2}|\tau=T\right]blackboard_E [ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG 𝔼[a2β2G2(xt)G3(xt)wt+ρwt2+a3δ2ϵ2+(1a2+1a3)α2Lw2(M+ϵt,1wt)2|τ=T]𝔼delimited-[]subscript𝑎2superscript𝛽2superscriptnormsubscript𝐺2superscriptsubscript𝑥𝑡topsubscript𝐺3subscript𝑥𝑡subscript𝑤𝑡𝜌subscript𝑤𝑡2subscript𝑎3𝛿2superscriptitalic-ϵ2conditional1subscript𝑎21subscript𝑎3superscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀normsubscriptitalic-ϵ𝑡1subscript𝑤𝑡2𝜏𝑇\displaystyle\mathbb{E}\left[a_{2}\beta^{2}\|\nabla G_{2}(x_{t})^{\top}\nabla G% _{3}(x_{t})w_{t}+\rho w_{t}\|^{2}+a_{3}\frac{\delta}{2}\epsilon^{2}+\left(% \frac{1}{a_{2}}+\frac{1}{a_{3}}\right)\alpha^{2}L_{w}^{2}(M+\|\epsilon_{t,1}w_% {t}\|)^{2}|\tau=T\right]blackboard_E [ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( divide start_ARG 1 end_ARG start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_M + ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
\displaystyle\leq a2(5β2ρ2+5β2KM4+5β2𝔼[Mϵt,22|τ=T]\displaystyle a_{2}(5\beta^{2}\rho^{2}+5\beta^{2}KM^{4}+5\beta^{2}\mathbb{E}[M% \|\epsilon_{t,2}\|^{2}|\tau=T]italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ italic_M ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
+5β2𝔼[Mϵt,32|τ=T]+5β2𝔼[ϵt,22|ϵt,32|τ=T])\displaystyle+5\beta^{2}\mathbb{E}[M\|\epsilon_{t,3}\|^{2}|\tau=T]+5\beta^{2}% \mathbb{E}[\|\epsilon_{t,2}\|^{2}|\|\epsilon_{t,3}\|^{2}|\tau=T])+ 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ italic_M ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] )
+𝔼[(1a2+1a3)α2Lw2(M+ϵt,1wt)2|τ=T]+a3δ2ϵ2,𝔼delimited-[]conditional1subscript𝑎21subscript𝑎3superscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀normsubscriptitalic-ϵ𝑡1subscript𝑤𝑡2𝜏𝑇subscript𝑎3𝛿2superscriptitalic-ϵ2\displaystyle+\mathbb{E}\Big{[}\left(\frac{1}{a_{2}}+\frac{1}{a_{3}}\right)% \alpha^{2}L_{w}^{2}(M+\|\epsilon_{t,1}w_{t}\|)^{2}|\tau=T\Big{]}+a_{3}\frac{% \delta}{2}\epsilon^{2},+ blackboard_E [ ( divide start_ARG 1 end_ARG start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_M + ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] + italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (45)

where (i)𝑖(i)( italic_i ) follows from the non-expansiveness of projection and (D.4), and the last inequality is from (D.4). Then substituting section D.4, section D.4 and section D.4 into eq. 41, we have

𝔼[wt+1wt+1,ρ2|τ=T]𝔼delimited-[]conditionalsuperscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2𝜏𝑇\displaystyle\mathbb{E}[\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}|\tau=T]blackboard_E [ ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
\displaystyle\leq (12βρ+βa1+a3)δ2ϵ212𝛽𝜌𝛽subscript𝑎1subscript𝑎3𝛿2superscriptitalic-ϵ2\displaystyle(1-2\beta\rho+\beta a_{1}+a_{3})\frac{\delta}{2}\epsilon^{2}( 1 - 2 italic_β italic_ρ + italic_β italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+β2(5ρ2+5KM4)(1+a2)superscript𝛽25superscript𝜌25𝐾superscript𝑀41subscript𝑎2\displaystyle+\beta^{2}(5\rho^{2}+5KM^{4})(1+a_{2})+ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 5 italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) ( 1 + italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
+M(3βa1+5β2+5β2a2)𝔼[εt,22|τ=T]𝑀3𝛽subscript𝑎15superscript𝛽25superscript𝛽2subscript𝑎2𝔼delimited-[]conditionalsuperscriptnormsubscript𝜀𝑡22𝜏𝑇\displaystyle+M\left(\frac{3\beta}{a_{1}}+5\beta^{2}+5\beta^{2}a_{2}\right)% \mathbb{E}[\|\varepsilon_{t,2}\|^{2}|\tau=T]+ italic_M ( divide start_ARG 3 italic_β end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) blackboard_E [ ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
+M(3βa1+5β2+5β2a2)𝔼[εt,32|τ=T]𝑀3𝛽subscript𝑎15superscript𝛽25superscript𝛽2subscript𝑎2𝔼delimited-[]conditionalsuperscriptnormsubscript𝜀𝑡32𝜏𝑇\displaystyle+M\left(\frac{3\beta}{a_{1}}+5\beta^{2}+5\beta^{2}a_{2}\right)% \mathbb{E}[\|\varepsilon_{t,3}\|^{2}|\tau=T]+ italic_M ( divide start_ARG 3 italic_β end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) blackboard_E [ ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
+(3βa1+5β2+5β2a2)𝔼[εt,22εt,32|τ=T]3𝛽subscript𝑎15superscript𝛽25superscript𝛽2subscript𝑎2𝔼delimited-[]conditionalsuperscriptnormsubscript𝜀𝑡22superscriptnormsubscript𝜀𝑡32𝜏𝑇\displaystyle+\left(\frac{3\beta}{a_{1}}+5\beta^{2}+5\beta^{2}a_{2}\right)% \mathbb{E}[\|\varepsilon_{t,2}\|^{2}\|\varepsilon_{t,3}\|^{2}|\tau=T]+ ( divide start_ARG 3 italic_β end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) blackboard_E [ ∥ italic_ε start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ε start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
+𝔼[(1+1a2+1a3)α2Lw2(M+ϵt,1wt)2|τ=T]𝔼delimited-[]conditional11subscript𝑎21subscript𝑎3superscript𝛼2superscriptsubscript𝐿𝑤2superscript𝑀normsubscriptitalic-ϵ𝑡1subscript𝑤𝑡2𝜏𝑇\displaystyle+\mathbb{E}\Big{[}\left(1+\frac{1}{a_{2}}+\frac{1}{a_{3}}\right)% \alpha^{2}L_{w}^{2}(M+\|\epsilon_{t,1}w_{t}\|)^{2}|\tau=T\Big{]}+ blackboard_E [ ( 1 + divide start_ARG 1 end_ARG start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_M + ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ]
\displaystyle\leq (12βρ+βa1+a3)δ2ϵ212𝛽𝜌𝛽subscript𝑎1subscript𝑎3𝛿2superscriptitalic-ϵ2\displaystyle(1-2\beta\rho+\beta a_{1}+a_{3})\frac{\delta}{2}\epsilon^{2}( 1 - 2 italic_β italic_ρ + italic_β italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+β2(5ρ2+5KM4)(1+a2)superscript𝛽25superscript𝜌25𝐾superscript𝑀41subscript𝑎2\displaystyle+\beta^{2}(5\rho^{2}+5KM^{4})(1+a_{2})+ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 5 italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) ( 1 + italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
+M(3βa1+5β2+5β2a2)(4Kσns+2K2σ2ns2)𝑀3𝛽subscript𝑎15superscript𝛽25superscript𝛽2subscript𝑎24𝐾𝜎subscript𝑛𝑠2superscript𝐾2superscript𝜎2superscriptsubscript𝑛𝑠2\displaystyle+M\left(\frac{3\beta}{a_{1}}+5\beta^{2}+5\beta^{2}a_{2}\right)% \left({\frac{4K\sigma}{n_{s}}}+\frac{{2}K^{2}\sigma^{2}}{n_{s}^{2}}\right)+ italic_M ( divide start_ARG 3 italic_β end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( divide start_ARG 4 italic_K italic_σ end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + divide start_ARG 2 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )
+(1+1a2+1a3)α2Lw2(2M2+4σ2ns),11subscript𝑎21subscript𝑎3superscript𝛼2superscriptsubscript𝐿𝑤22superscript𝑀24superscript𝜎2subscript𝑛𝑠\displaystyle+\left(1+\frac{1}{a_{2}}+\frac{1}{a_{3}}\right)\alpha^{2}L_{w}^{2% }\left(2M^{2}+\frac{4\sigma^{2}}{n_{s}}\right),+ ( 1 + divide start_ARG 1 end_ARG start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 4 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ) , (46)

where the last inequality is due to that for any i[3]𝑖delimited-[]3i\in[3]italic_i ∈ [ 3 ],

𝔼[ϵt,i|τ=T]𝔼[ϵt,i2|τ=T]𝔼[ϵt,i2]/(τ=T)2Knsσ𝔼delimited-[]conditionalnormsubscriptitalic-ϵ𝑡𝑖𝜏𝑇𝔼delimited-[]conditionalsuperscriptnormsubscriptitalic-ϵ𝑡𝑖2𝜏𝑇𝔼delimited-[]superscriptnormsubscriptitalic-ϵ𝑡𝑖2𝜏𝑇2𝐾subscript𝑛𝑠𝜎\displaystyle\mathbb{E}[\|\epsilon_{t,i}\||\tau=T]\leq\sqrt{\mathbb{E}[\|% \epsilon_{t,i}\|^{2}|\tau=T]}\leq\sqrt{\mathbb{E}[\|\epsilon_{t,i}\|^{2}]/% \mathbb{P}(\tau=T)}\leq\sqrt{\frac{2K}{n_{s}}}\sigmablackboard_E [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∥ | italic_τ = italic_T ] ≤ square-root start_ARG blackboard_E [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] end_ARG ≤ square-root start_ARG blackboard_E [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] / blackboard_P ( italic_τ = italic_T ) end_ARG ≤ square-root start_ARG divide start_ARG 2 italic_K end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG end_ARG italic_σ

and

𝔼[ϵt,2ϵt,3|τ=T]𝔼delimited-[]conditionalnormsubscriptitalic-ϵ𝑡2normsubscriptitalic-ϵ𝑡3𝜏𝑇\displaystyle\mathbb{E}[\|\epsilon_{t,2}\|\|\epsilon_{t,3}\||\tau=T]blackboard_E [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ | italic_τ = italic_T ] 𝔼[ϵt,22ϵt,32|τ=T]absent𝔼delimited-[]conditionalsuperscriptnormsubscriptitalic-ϵ𝑡22superscriptnormsubscriptitalic-ϵ𝑡32𝜏𝑇\displaystyle\leq\sqrt{\mathbb{E}[\|\epsilon_{t,2}\|^{2}\|\epsilon_{t,3}\|^{2}% |\tau=T]}≤ square-root start_ARG blackboard_E [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] end_ARG
𝔼[ϵt,22ϵt,32]/(τ=T)absent𝔼delimited-[]superscriptnormsubscriptitalic-ϵ𝑡22superscriptnormsubscriptitalic-ϵ𝑡32𝜏𝑇\displaystyle\leq\sqrt{\mathbb{E}[\|\epsilon_{t,2}\|^{2}\|\epsilon_{t,3}\|^{2}% ]/\mathbb{P}(\tau=T)}≤ square-root start_ARG blackboard_E [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] / blackboard_P ( italic_τ = italic_T ) end_ARG
𝔼[ϵt,22]𝔼[ϵt,32]/(τ=T)absent𝔼delimited-[]superscriptnormsubscriptitalic-ϵ𝑡22𝔼delimited-[]superscriptnormsubscriptitalic-ϵ𝑡32𝜏𝑇\displaystyle\leq\sqrt{\mathbb{E}[\|\epsilon_{t,2}\|^{2}]\mathbb{E}[\|\epsilon% _{t,3}\|^{2}]/\mathbb{P}(\tau=T)}≤ square-root start_ARG blackboard_E [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] blackboard_E [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] / blackboard_P ( italic_τ = italic_T ) end_ARG
2Kσ2nsσ.absent2𝐾superscript𝜎2subscript𝑛𝑠𝜎\displaystyle\leq\frac{\sqrt{2}K\sigma^{2}}{n_{s}}\sigma.≤ divide start_ARG square-root start_ARG 2 end_ARG italic_K italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG italic_σ .

According to (D.4), with a1=0.5ρ,a2=1,a3=0.5βρ,βδρϵ260(1+KM4),nsmax{1Kσ,36Kσ(6+20βρ)δρ2ϵ2}formulae-sequencesubscript𝑎10.5𝜌formulae-sequencesubscript𝑎21formulae-sequencesubscript𝑎30.5𝛽𝜌formulae-sequence𝛽𝛿𝜌superscriptitalic-ϵ2601𝐾superscript𝑀4subscript𝑛𝑠1𝐾𝜎36𝐾𝜎620𝛽𝜌𝛿superscript𝜌2superscriptitalic-ϵ2a_{1}=0.5\rho,a_{2}=1,a_{3}=0.5\beta\rho,\beta\leq\frac{\delta\rho\epsilon^{2}% }{60(1+KM^{4})},n_{s}\geq\max\{\frac{1}{K\sigma},\frac{36K\sigma(6+20\beta\rho% )}{\delta\rho^{2}\epsilon^{2}}\}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5 italic_ρ , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 , italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 0.5 italic_β italic_ρ , italic_β ≤ divide start_ARG italic_δ italic_ρ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 60 ( 1 + italic_K italic_M start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_ARG , italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≥ roman_max { divide start_ARG 1 end_ARG start_ARG italic_K italic_σ end_ARG , divide start_ARG 36 italic_K italic_σ ( 6 + 20 italic_β italic_ρ ) end_ARG start_ARG italic_δ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG } and αδβ2ρ2ϵ212Lw2(2M2+4σ2)(βρ+1)𝛼𝛿superscript𝛽2superscript𝜌2superscriptitalic-ϵ212superscriptsubscript𝐿𝑤22superscript𝑀24superscript𝜎2𝛽𝜌1\alpha\leq\sqrt{\frac{\delta\beta^{2}\rho^{2}\epsilon^{2}}{12L_{w}^{2}(2M^{2}+% 4\sigma^{2})(\beta\rho+1)}}italic_α ≤ square-root start_ARG divide start_ARG italic_δ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 12 italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_β italic_ρ + 1 ) end_ARG end_ARG, we have that

𝔼[wt+1wt+1,ρ2|τ=T]δ2ϵ2.𝔼delimited-[]conditionalsuperscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2𝜏𝑇𝛿2superscriptitalic-ϵ2\displaystyle\mathbb{E}[\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}|\tau=T]\leq\frac{% \delta}{2}\epsilon^{2}.blackboard_E [ ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] ≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

We then complete our induction and prove that for any t<τ𝑡𝜏t<\tauitalic_t < italic_τ, we have that 𝔼[wt+1wt+1,ρ2|τ=T]δ2ϵ2𝔼delimited-[]conditionalsuperscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2𝜏𝑇𝛿2superscriptitalic-ϵ2\mathbb{E}[\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}|\tau=T]\leq\frac{\delta}{2}% \epsilon^{2}blackboard_E [ ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] ≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

As a result, we have that

(wt+1wt+1,ρ2>ϵ2|τ=T)𝔼[wt+1wt+1,ρ2|τ=T]ϵ2δ2,superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2conditionalsuperscriptitalic-ϵ2𝜏𝑇𝔼delimited-[]conditionalsuperscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2𝜏𝑇superscriptitalic-ϵ2𝛿2\displaystyle\mathbb{P}\left(\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}>\epsilon^{2}\Big% {|}\tau=T\right)\leq\frac{\mathbb{E}\left[\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}\Big% {|}\tau=T\right]}{\epsilon^{2}}\leq\frac{\delta}{2},blackboard_P ( ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ) ≤ divide start_ARG blackboard_E [ ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ] end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG ,

where the first probability is due to Markov inequality. Thus we have that

(wt+1wt+1,ρ2ϵ2)superscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2superscriptitalic-ϵ2\displaystyle\mathbb{P}\left(\|w_{t+1}-w_{t+1,\rho}^{*}\|^{2}\leq\epsilon^{2}\right)blackboard_P ( ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
1(τ<T)(wt+1wt+1,ρ2|τ=T)(τ=T)absent1𝜏𝑇conditionalsuperscriptnormsubscript𝑤𝑡1superscriptsubscript𝑤𝑡1𝜌2𝜏𝑇𝜏𝑇\displaystyle\geq 1-\mathbb{P}\left(\tau<T\right)-\mathbb{P}\left(\|w_{t+1}-w_% {t+1,\rho}^{*}\|^{2}\Big{|}\tau=T\right)\mathbb{P}\left(\tau=T\right)≥ 1 - blackboard_P ( italic_τ < italic_T ) - blackboard_P ( ∥ italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_t + 1 , italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_τ = italic_T ) blackboard_P ( italic_τ = italic_T )
1δ,absent1𝛿\displaystyle\geq 1-\delta,≥ 1 - italic_δ , (47)

where the last inequality is because our parameters satisfy all the requirements in Theorem 2, thus (τ<T)δ2𝜏𝑇𝛿2\mathbb{P}(\tau<T)\leq\frac{\delta}{2}blackboard_P ( italic_τ < italic_T ) ≤ divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG. Then based on (D.4), by setting ρ𝒪(ϵ2)similar-to𝜌𝒪superscriptitalic-ϵ2\rho\sim\mathcal{O}(\epsilon^{2})italic_ρ ∼ caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), we have that F(xt)wtF(xt)wt𝒪(ϵ)similar-tonorm𝐹subscript𝑥𝑡subscript𝑤𝑡𝐹subscript𝑥𝑡superscriptsubscript𝑤𝑡𝒪italic-ϵ\|\nabla F(x_{t})w_{t}-\nabla F(x_{t})w_{t}^{*}\|\sim\mathcal{O}(\epsilon)∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ∼ caligraphic_O ( italic_ϵ ) with probability at least 1δ1𝛿1-\delta1 - italic_δ for each iteration t𝑡titalic_t, which completes the proof. ∎