Revisiting Decentralized ProxSkip: Achieving Linear Speedup^†^†thanks: Corresponding author: **de Cao.

Luyao Guo¹, Sulaiman A. Alghunaim², Kun Yuan³, Laurent Condat⁴, **de Cao¹
¹Southeast University ²Kuwait University ³Peking University
⁴King Abdullah University of Science and Technology (KAUST)
{ly_guo, jdcao}@seu.edu.cn
[email protected] [email protected]

Abstract

The ProxSkip algorithm for decentralized and federated learning is gaining increasing attention due to its proven benefits in accelerating communication complexity while maintaining robustness against data heterogeneity. However, existing analyses of ProxSkip are limited to the strongly convex setting and do not achieve linear speedup, where convergence performance increases linearly with respect to the number of nodes. So far, questions remain open about how ProxSkip behaves in the non-convex setting and whether linear speedup is achievable.

In this paper, we revisit decentralized ProxSkip and address both questions. We demonstrate that the leading communication complexity of ProxSkip is $\mathcal{O}(\nicefrac{{p\sigma^{2}}}{{n\epsilon^{2}}})$ for non-convex and convex settings, and $\mathcal{O}(\nicefrac{{p\sigma^{2}}}{{n\epsilon}})$ for the strongly convex setting, where $n$ represents the number of nodes, $p$ denotes the probability of communication, $\sigma^{2}$ signifies the level of stochastic noise, and $\epsilon$ denotes the desired accuracy level. This result illustrates that ProxSkip achieves linear speedup and can asymptotically reduce communication overhead proportional to the probability of communication. Additionally, for the strongly convex setting, we further prove that ProxSkip can achieve linear speedup with network-independent stepsizes.

1 Introduction

In this work, we consider the following decentralized optimization problem by a group of agents $[n]:=\{1,2,\ldots,n\}$ connected over a network:

		$\displaystyle f^{\star}=\min_{{\bf{x}}\in\mathbb{R}^{d}}\Big{[}f({\bf{x}}):=% \frac{1}{n}\sum_{i=1}^{n}f_{i}({\bf{x}})\Big{]},$		(1)
		$\displaystyle\text{with }f_{i}({\bf{x}})=\mathbb{E}_{\xi_{i}\sim\mathcal{D}_{i% }}[F_{i}({\bf{x}},\xi_{i})],$		(1)

where $\{\mathcal{D}_{i}\}^{n}_{i=1}$ represent data distributions, which can be heterogeneous across $n$ nodes, $f_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a smooth local function accessed by node $i$ . In this setup, a network of nodes (also referred to as agents, workers, or clients) collaboratively seeks to minimize the average of the nodes’ objectives. Solving problem (1) in a decentralized manner has garnered considerable attention in recent years [2, 3, 4, 5, 6]. These methods do not rely on a central coordinator and that communicate only with neighbors in an arbitrary communication topology. Nevertheless, decentralized optimization algorithms may still face challenges arising from communication bottlenecks.

Table 1: Comparisons with existing convergence rates of ProxSkip for decentralized optimization.

\sigma^{2}

is the variance of the stochastic gradient.

\zeta_{0}:=\max\{1-\alpha\mu,1-p^{2}\}

, the definition of

\zeta_{{\mathrm{new}}}

and

\zeta

can be found in [50] and Theorem 1, respectively. CVX, N-CVX, and S-CVX mean convex, non-convex, and strongly convex, respectively.

Reference	convergence rate			decentralized	linear speedup
Reference	N-CVX	CVX	S-CVX	decentralized	linear speedup
[47, 48]	no results	no results	$\zeta_{0}^{T}$ , $\sigma^{2}=0$	✗	✗
[46, 49]	no results	no results	$\zeta_{0}^{T}+\mathcal{O}(\alpha\sigma^{2})$	✗	✗
[50]	no results	no results	$\zeta_{{\mathrm{new}}}^{T}+\mathcal{O}(\alpha\sigma^{2})$	✗	✗
[46]	no results	no results	$\zeta^{T}$ , $\sigma^{2}=0$	✓	✗
[51]	no results	$\mathcal{O}\left(\frac{1}{\alpha T}\right)$	$\zeta^{T}$ , $\sigma^{2}=0$	✓	✗
this work	$\mathcal{O}\left(\frac{1}{\alpha T}+{\color[rgb]{1.00,0.50,0.00}\definecolor[% named]{pgfstrokecolor}{rgb}{1.00,0.50,0.00}\frac{\alpha\sigma^{2}}{n}}+\alpha^% {2}\sigma^{2}\right)$	$\mathcal{O}\left(\frac{1}{\alpha T}+{\color[rgb]{1.00,0.50,0.00}\definecolor[% named]{pgfstrokecolor}{rgb}{1.00,0.50,0.00}\frac{\alpha\sigma^{2}}{n}}+\alpha^% {2}\sigma^{2}\right)$	$\mathcal{O}\left(\zeta^{T}+{\color[rgb]{1.00,0.50,0.00}\definecolor[named]{% pgfstrokecolor}{rgb}{1.00,0.50,0.00}\frac{\alpha\sigma^{2}}{n}}+\alpha^{2}% \sigma^{2}\right)$	✓	✓

To reduce communication costs, many techniques have been proposed. These techniques include compressing models and gradients [7, 8, 9, 10, 11, 12, 13], using accelerated scheme [14, 15, 16, 17, 18, 19], and implementing local updates [20, 21, 22, 23, 24]. By applying these strategies, it is possible to reduce the amount of information exchanged between different nodes during training, thereby improving the efficiency of distributed training setups.

In this work, we mainly focus on performing local updates as means to reduce communication frequency. In centralized settings (federated learning), local-SGD/FedAvg [22, 23, 25, 26] has emerged as one of the most widely adopted learning methods that employ local updates. However, when dealing with heterogeneous data, Local-SGD/FedAvg encounters the challenge of “client-drift.” This phenomenon arises from the diversity of functions on each node, causing each client to converge towards the minima of its respective function $f_{i}$ , which may be significantly distant from the global optimum $f^{\star}$ . To tackle this issue, several algorithms have been proposed, including Scaffold [27], Scaffold with momentum [28], FedLin [29], FedPD [30], FedDyn [31], VRL-SGD [32], FedGATE [33], SCALLION/SCAFCOM [34]. In decentralized settings, local-DSGD has been introduced in [35]. Similarly to local-SGD, it also encounters the issue of client-drift when dealing with heterogeneous data. To mitigate the drift in Local-DSGD, several algorithms have been proposed. Notably, gradient-tracking (GT) based approaches, such as local-GT [39] and $K$ -GT [40], have been developed. Additionally, algorithms based on Exact-Diffusion/NIDS/D² [41, 42, 43, 44], such as LED [45], have been introduced. Distinct from these periodic local updates methods [27, 29, 30, 35, 39, 40, 45], methods incorporating probabilistic local update are proposed such ProxSkip [46] and its extended versions, such as TAMUNA [47], CompressedScaffnew [48], VR-ProxSkip [49], ODEProx [50], and RandProx [51].

It is known that ProxSkip does not depend on the heterogeneity of the data and exhibit linear convergence on distributed strongly convex problems in the absence of stochastic noise [46]. When the network is sufficiently well-connected, ProxSkip [46] and its extensions [47, 48, 49, 50, 51] are gaining increasing attention due to their proven benefits in accelerating communication complexity. When deploying ProxSkip within the context of machine learning, it becomes imperative to comprehend its behavior on non-convex tasks and its susceptibility to stochastic noise. However, existing ProxSkip convergence analyses focus on convex settings, and the main limitation of the existing analyses is the inability to prove that linear speedup in terms of the number of nodes. Notice that although [50] presents the ODEProx algorithm and gives a more rigorous analysis of ProxSkip in the strongly convex setting, this new analysis shares the same limitation as the original ProxSkip analysis, namely, the inability to achieve linear speedup. Achieving linear speedup is highly desirable for a decentralized/federated learning algorithm as it enables effective utilization of the massive parallelism inherent in large decentralized/federated learning systems. Consequently, two fundamental open questions emerge:

(1) How does ProxSkip behave on non-convex tasks?

(2) Can we establish a linear speedup bound for ProxSkip in the presence of stochastic noise?

In this paper, we revisit ProxSkip for decentralized learning and provide answers to both questions. Specifically, we develop a new analysis with a novel proof technique under non-convex, convex, and strongly convex settings. Through this analysis, we obtain several new results that are comparable to the bounds of state-of-the-art decentralized algorithms while achieving linear speedup bounds.

We highlight our contributions as follows:

•

We establish the non-asymptotic convergence rate under stochastic non-convex, convex, and strongly convex settings of ProxSkip for problem (1). In particular, we prove that ProxSkip at iteration $T$ converges with rate

	N-CVX/CVX:	$\displaystyle\mathcal{O}\bigg{(}\frac{1}{\alpha T}+\frac{\alpha^{2}}{(1-% \lambda_{2})^{2}T}+{\color[rgb]{1.00,0.50,0.00}\definecolor[named]{% pgfstrokecolor}{rgb}{1.00,0.50,0.00}\frac{\alpha\sigma^{2}}{n}}+\frac{\sigma^{% 2}\alpha^{2}}{1-\lambda_{2}}\bigg{)},$		(2)
	S-CVX:	$\displaystyle\mathcal{O}\bigg{(}(1-\alpha\mu)^{T}a_{0}+{\color[rgb]{% 1.00,0.50,0.00}\definecolor[named]{pgfstrokecolor}{rgb}{1.00,0.50,0.00}\frac{% \alpha\sigma^{2}}{\mu n}}+\frac{\sigma^{2}\alpha^{2}}{\mu(1-\lambda_{2})}\bigg% {)},$		(3)

where $\alpha$ is the stepsize of ProxSkip, $\sigma^{2}$ denotes the variance of the stochastic gradient, $1-\lambda_{2}$ is a topology-dependent quantity that approaches $0$ for a large and sparse network, $\mu$ is the strongly convex constant, and $a_{0}$ is a constant that depends on the initialization. To the best of our knowledge, it is the first work that establishes the convergence rate of probabilistic decentralized methods for non-convex settings. We offer a comparison of convergence rates of ProxSkip for problem (1) in Table 1.

•

We prove that, after enough transient time, the expected communication complexity of ProxSkip is $\mathcal{O}(\nicefrac{{p\sigma^{2}}}{{n\epsilon^{2}}})$ (or $\tilde{\mathcal{O}}(\nicefrac{{p\sigma^{2}}}{{n\epsilon}})$ for S-CVX), where $\epsilon$ denotes the desired accuracy level, demonstrating that ProxSkip achieves linear speedup with respect to the number of nodes $n$ . In addition, for the strongly convex setting, we further prove that ProxSkip can achieve linear speedup with network-independent stepsizes. The proposed new proof technique overcomes the analytical limitations of [46, 47, 48, 49, 50, 51]. To the best of our knowledge, we prove for the first time that ProxSkip can achieve linear speedup.
•

We elucidate the effects of noise, local steps, and data heterogeneity on the convergence of ProxSkip in stochastic non-convex, convex, and strongly convex settings. We demonstrate the robustness of ProxSkip against data heterogeneity while enhancing communication efficiency by local updates. Furthermore, we show that the convergence rates exhibited by ProxSkip in stochastic settings are comparable with those of existing state-of-the-art decentralized algorithms incorporating local updates [35, 40, 45] (see Table 2).

2 Setup

All vectors are column vectors unless otherwise stated. We let $\mathbf{x}_{i}^{t}\in\mathbb{R}^{d}$ represent the local state of node $i$ at the $t$ -th iteration. For the sake of convenience in notation, we use bold capital letters to denote stacked variables. For instance,

	$\displaystyle\mathbf{X}^{t}:=$	$\displaystyle\ [\mathbf{x}_{1}^{t},\mathbf{x}_{2}^{t},\ldots,\mathbf{x}_{n}^{t% }]^{\sf T}\in\mathbb{R}^{n\times d},$
	$\displaystyle\mathbf{G}^{t}:=$	$\displaystyle\ [\mathbf{g}_{1}^{t},\mathbf{g}_{2}^{t},\ldots,\mathbf{g}_{n}^{t% }]^{\sf T}\in\mathbb{R}^{n\times d},$
	$\displaystyle\nabla F(\mathbf{X}^{t}):=$	$\displaystyle\ [\nabla f_{1}(\mathbf{x}^{t}_{1}),\nabla f_{2}(\mathbf{x}^{t}_{% 2}),\ldots,\nabla f_{n}(\mathbf{x}^{t}_{n})]^{\sf T}\in\mathbb{R}^{n\times d}.$

2.1 Network graph

Algorithm 1 ProxSkip for decentralized stochastic optimization

1: Input

\alpha>0

\beta>0

0<p\leq 1

\chi\geq 1

, initial iterates

{\bf{x}}^{0}_{i}={\bf{x}}^{0}\in\mathbb{R}^{d},~{}i=1,\dots,n

, initial dual variables

{\bf{y}}^{0}_{i}=0,~{}i=1,\dots,n

, weights for averaging

{\bf{W}}_{a}={\bf{I}}-\nicefrac{{1}}{{2\chi}}({\bf{I-W}}):=(\widetilde{W}_{ij}% )^{n}_{i,j=1}

2: Flip coins

[\theta_{0},\ldots,\theta_{T-1}]

, where

\theta_{t}\in\{0,1\}

, with

\mathop{{\bf{P}}}(\theta_{t}=1)=p

3: for

t=0,1,\dotsc,T-1

every node do

4: Sample

\xi_{i}^{t}

, compute gradient

{\bf{g}}_{i}^{t}=\nabla F_{i}({\bf{x}}^{t}_{i},\xi^{t}_{i})

\hat{{\bf{z}}}^{t}_{i}={\bf{x}}^{t}_{i}-\alpha{\bf{g}}^{t}_{i}-{\bf{y}}^{t}_{i}

\triangleright

update the prediction variate

\hat{{\bf{z}}}^{t}_{i}

6: if

\theta_{t}=1

then

{\bf{x}}^{t+1}_{i}=\sum_{j=1}^{n}\widetilde{W}_{ij}\hat{{\bf{z}}}^{t}_{j}

\triangleright

communicate with probability

p

{\bf{y}}^{t+1}_{i}={\bf{y}}^{t}_{i}+\beta(\hat{{\bf{z}}}^{t}_{i}-{\bf{x}}^{t+1% }_{i})

\triangleright

update the control variate

{\bf{y}}^{t+1}_{i}

9: else

10:

{\bf{y}}^{t+1}_{i}={\bf{y}}^{t}_{i},~{}{\bf{x}}^{t+1}_{i}=\hat{{\bf{z}}}^{t}_{i}

\triangleright

skip communication

11: end if

12: end for

In this work, we focus on decentralized scenarios (undirected and connected network), where a network of $n$ nodes is interconnected by a graph with a set of edges $\mathcal{E}$ , where node $i$ is connected to node $j$ if $(i,j)\in\mathcal{E}$ . To describe the algorithm, we introduce the global mixing matrix $\mathbf{W}=[W_{ij}]$ , where $W_{ij}=W_{ji}=0$ if $(i,j)\notin\mathcal{E}$ , and $W_{ij}>0$ otherwise. We impose the following standard assumption on the mixing matrix.

Assumption 1.

The mixing matrix $\mathbf{W}\in[0,1]^{n\times n}$ is symmetric, doubly stochastic, and primitive. Let $\lambda_{1}=1$ denote the largest eigenvalue of the mixing matrix $\mathbf{W}$ , and the remaining eigenvalues are denoted as $1>\lambda_{2}\geq\lambda_{3}\geq\cdots\geq\lambda_{n}>-1$ .

We introduce two quantities as follows: $\mathbf{W}_{a}=\mathbf{I}-\nicefrac{{1}}{{2\chi}}(\mathbf{I}-\mathbf{W}),\ % \mathbf{W}_{b}=(\mathbf{I}-\mathbf{W})^{1/2}$ , where $\chi\geq 1$ . Under Assumption 1, the matrix $\mathbf{W}_{a}$ is positive semi-definite and doubly stochastic. Furthermore, we have $\mathbf{I}-\mathbf{W}_{a}=\nicefrac{{1}}{{2\chi}}\mathbf{W}_{b}^{2}$ , and $\mathbf{W}_{a}$ is well-conditioned when $\chi$ is large.

2.2 Algorithm description

The ProxSkip algorithm [46] for problem (1) can be written as


$\displaystyle\hat{\mathbf{Z}}^{t}$	$\displaystyle=\mathbf{X}^{t}-\alpha\mathbf{G}^{t}-\mathbf{Y}^{t},$	(4a)
$\displaystyle\mathbf{X}^{t+1}$	$\displaystyle=(1-\theta_{t})\hat{\mathbf{Z}}^{t}+\theta_{t}\mathbf{W}_{a}\hat{% \mathbf{Z}}^{t},$	(4b)
$\displaystyle\mathbf{Y}^{t+1}$	$\displaystyle=\mathbf{Y}^{t}+\beta(\hat{\mathbf{Z}}^{t}-\mathbf{X}^{t+1}).$	(4c)

Here, $\alpha>0$ is the stepsize (learning rate), $\beta>0$ , $\mathbf{G}^{t}=[\mathbf{g}_{1}^{t},\mathbf{g}_{2}^{t},\ldots,\mathbf{g}_{n}^{t% }]^{\sf T}\in\mathbb{R}^{n\times d}$ with $\mathbf{g}_{i}^{t}$ representing the stochastic gradient of $\nabla f_{i}(\mathbf{x}_{i}^{t})$ , $\theta_{t}=1$ with probability $p$ and $\theta_{t}=0$ with probability $1-p$ , and $\mathbf{Y}^{t}$ is the control variate. At each iteration $t\geq 0$ , communication takes place with a probability $p\in(0,1]$ . In the absence of communication, the update $\mathbf{X}^{t+1}=\mathbf{X}^{t}-\alpha\mathbf{G}^{t}-\mathbf{Y}^{t}$ is performed, while $\mathbf{Y}^{t+1}$ remains unchanged. This allows for multiple iterations of local computations to be performed between communication rounds. Decomposing the updates for individual nodes, we provide a detailed implementation in Algorithm 1.

2.3 Assumptions

We further use the following standard assumptions:

Assumption 2.

A solution exists to problem (1), and $f^{*}>-\infty$ . Moreover, $f_{i}$ is $L$ -smooth, i.e.,

\|\nabla f_{i}({\bf{x}})-\nabla f_{i}({\bf{y}})\|\leq L\|{\bf{x}}-{\bf{y}}\|,% \text{ for any }{\bf{x}},{\bf{y}}\in\mathbb{R}^{d}.

Assumption 3.

Each function $f_{i}$ is $\mu$ -strongly convex for constant $\mu\geq 0$ , i.e.,

f_{i}({\bf{x}})-f_{i}({\bf{y}})+\frac{\mu}{2}\|{\bf{x}}-{\bf{y}}\|^{2}\leq% \langle\nabla f_{i}({\bf{x}}),{\bf{x}}-{\bf{y}}\rangle,\text{ for any }{\bf{x}% },{\bf{y}}\in\mathbb{R}^{d}.

Assumption 4.

For all iteration $t\geq 0$ , the local stochastic gradient ${\bf{g}}^{t}_{i}=\nabla F_{i}({\bf{x}}_{i}^{t},\xi^{t}_{i})$ is an unbiased estimate, i.e.,

\mathbb{E}_{\xi^{t}_{i}}[\nabla F_{i}({\bf{x}}_{i}^{t},\xi^{t}_{i})\;|\;{\bf{x% }}_{i}^{t}\ ]=\nabla f_{i}({\bf{x}}_{i}^{t}),

and there exists a constant $\sigma>0$ such that

\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\xi_{i}^{t}}\big{[}\|\nabla F_{i}({\bf{x}% }_{i}^{t},\xi^{t}_{i})-\nabla f_{i}({\bf{x}}_{i}^{t})\|^{2}\big{]}\leq\sigma^{% 2}.

3 Convergence results

We now present our novel convergence results for ProxSkip. In Section 3.1, we recall the existing results in [46]. In Section 3.2, the convergence rates and communication complexities for nonconvex and convex functions are presented Theorem 2 and Corollary 1, respectively. In Section 3.3, we prove further that ProxSkip can achieve linear speedup with network-independent stepsizes.

3.1 Preliminary

We start to recall the existing convergence results of ProxSkip [46, 51].

Theorem 1.

Suppose that Assumptions 1, 2, and 4 hold, and $f_{i}$ is $\mu$ -strongly convex for some $0<\mu\leq L$ . If $0<\alpha\leq\nicefrac{{1}}{{L}}$ , $\beta=p$ , and $\chi\geq 1$ , it holds that

\displaystyle\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^{t+1}-{\bf{x}}^{\star}% \big{\|}^{2}\right]\leq\zeta^{t+1}a_{0}+\frac{\alpha^{2}\sigma^{2}}{1-\zeta},

(5)

where $a_{0}$ is a constant that depends on the initialization and $\zeta=\max\{1-\alpha\mu,1-\frac{(1-\lambda_{2})p^{2}}{2\chi}\}<1$ .

When $\sigma^{2}=0$ , by setting $\alpha=\nicefrac{{1}}{{L}}$ and $\chi=1$ , we can deduce from (5) that the communication complexity of ProxSkip to achieve $\epsilon$ -accuracy, i.e., $\mathbb{E}[|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\|^{2}]\leq\epsilon$ , is given by $\mathcal{O}((p\kappa+\nicefrac{{1}}{{p(1-\lambda_{2})}})\mathrm{log}\nicefrac{% {1}}{{\epsilon}})$ , where $\kappa=\nicefrac{{L}}{{\mu}}$ . If the network is sufficiently well-connected, i.e., $\nicefrac{{1}}{{(1-\lambda_{2})\kappa}}<1$ , and we set $p=\sqrt{\nicefrac{{1}}{{(1-\lambda_{2})\kappa}}}$ , the iteration complexity becomes $\mathcal{O}(\sqrt{\nicefrac{{\kappa}}{{1-\lambda_{2}}}}\ \mathrm{log}\nicefrac% {{1}}{{\epsilon}})$ , achieving the optimal communication complexity as proven by [52].

When $\sigma^{2}\neq 0$ , based on (5) and the fact that $\frac{\alpha^{2}\sigma^{2}}{1-\zeta}=\frac{\alpha^{2}\sigma^{2}}{\alpha\mu}=% \mathcal{O}(\alpha\sigma^{2})$ , we can conclude that the local solution ${\bf{x}}_{i}^{t}$ generated by ProxSkip converges to the global minimizer ${\bf{x}}^{\star}$ at a linear rate until it reaches an $\mathcal{O}(\alpha\sigma^{2})$ -neighborhood of ${\bf{x}}^{\star}$ . However, it is important to note that relying solely on equation (5) is not sufficient to achieve the desired linear speedup term $\mathcal{O}(\frac{\alpha\sigma^{2}}{n})+\mathcal{O}(\alpha^{2})$ . This indicates that the direct extension of the analysis techniques proposed in [46] or [51] to the stochastic scenario does not guarantee linear speedup, despite ensuring convergence. Therefore, further analysis is required to achieve the desired linear speedup.

3.2 Main theorem—Convergence rate of ProxSkip

We are now ready to present the new convergence results for ProxSkip.

Theorem 2.

Suppose that Assumptions 1, 2, and 4 hold. Let $\bar{{\bf{x}}}^{t}=\frac{1}{n}\sum_{i=1}^{n}{\bf{x}}_{i}^{t}$ denote the iterates of Algorithm 1 and ${\bf{x}}^{\star}$ solves (1). For sufficiently small $\alpha$ , $\chi=\mathcal{O}(\max\{1,\nicefrac{{(1-p)}}{{(1-\lambda_{2})}}\})$ , and $\beta=1$ , we have the following convergence results.

Non-convex: Let $F_{0}=f(\bar{{\bf{x}}}^{0})-f^{\star}$ and $\varsigma^{2}_{0}=\frac{1}{n}\sum_{i=1}^{n}\|\nabla f_{i}(\bar{{\bf{x}}}^{0})-% \nabla f(\bar{{\bf{x}}}^{0})\|^{2}$ . It holds that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[\big{\|}\nabla f(% \bar{{\bf{x}}}^{t})\big{\|}^{2}\right]\leq

\displaystyle\mathcal{O}\bigg{(}\underbrace{\frac{F_{0}}{\alpha T}+\frac{\chi^% {2}L^{2}\varsigma^{2}_{0}\alpha^{2}}{(1-\lambda_{2})^{2}T}}_{{\mathrm{% deterministic\ part}}}+\underbrace{\frac{\alpha\sigma^{2}L}{n}+\frac{\chi L^{2% }\sigma^{2}\alpha^{2}}{1-\lambda_{2}}}_{{\mathrm{stochastic\ part}}}\bigg{)}.

(6)

Convex: Let $R_{0}^{2}=\|\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\|^{2}$ . Under the additional Assumption 3 with $\mu\geq 0$ , it holds that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t}% )-f^{\star}\right]\leq\mathcal{O}\bigg{(}\underbrace{\frac{R_{0}^{2}}{\alpha T% }+\frac{\chi^{2}L\varsigma^{2}_{0}\alpha^{2}}{(1-\lambda_{2})^{2}T}}_{{\mathrm% {deterministic\ part}}}+\underbrace{\frac{\alpha\sigma^{2}}{n}+\frac{\chi L% \sigma^{2}\alpha^{2}}{1-\lambda_{2}}}_{{\mathrm{stochastic\ part}}}\bigg{)}.

(7)

Strongly convex: Under the additional Assumption 3 with $\mu>0$ , it holds that

\displaystyle\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^{T}-{\bf{x}}^{\star}\big% {\|}^{2}\right]\leq

\displaystyle\underbrace{\Big{(}1-\frac{\alpha\mu}{4}\Big{)}^{T}\Big{(}\big{\|% }\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\big{\|}^{2}+\frac{8\chi\alpha^{2}% \varsigma_{0}^{2}}{1-\lambda_{2}}\Big{)}}_{{\mathrm{deterministic\ part}}}+% \mathcal{O}\bigg{(}\underbrace{\frac{\alpha\sigma^{2}}{\mu n}+\frac{\chi L% \sigma^{2}\alpha^{2}}{\mu(1-\lambda_{2})}}_{{\mathrm{stochastic\ part}}}\bigg{% )}.

(8)

For the non-convex setting, Theorem 2 demonstrates that the ProxSkip algorithm converges to a radius around some stationary point. Without any additional assumptions, a stationary point is the best guarantee possible and is a satisfactory criterion to measure the performance of distributed methods with nonconvex objectives [35]. For the convex case, Theorem 2 shows that ProxSkip converges around some optimal solution. When $\sigma^{2}=0$ , i.e., in the deterministic case, ProxSkip converges exactly with sublinear and linear rates for N-CVX/CVX and S-CVX settings, respectively.

Note that stochastic part in convergence rates (6), (7), and (8), which all can be rewritten as $\mathcal{O}(\frac{\alpha\sigma^{2}}{n}+\alpha^{2}\sigma^{2})$ . It follows from Theorem 2 that

\begin{array}[]{rr}\text{N-CVX: }&\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left% [\|\nabla f(\bar{{\bf{x}}}^{t})\|^{2}\right]\leq\mathcal{O}(\frac{1}{\alpha T}% )+\mathcal{O}(\frac{\alpha\sigma^{2}}{n}+\alpha^{2}\sigma^{2}),\\ \text{CVX: }&\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t}% )-f^{\star}\right]\leq\mathcal{O}(\frac{1}{\alpha T})+\mathcal{O}(\frac{\alpha% \sigma^{2}}{n}+\alpha^{2}\sigma^{2}),\\ \text{S-CVX: }&\mathbb{E}\!\left[\|\bar{{\bf{x}}}^{T}-{\bf{x}}^{\star}\|^{2}% \right]\leq\mathcal{O}((1-\alpha\mu)^{T})+\mathcal{O}(\frac{\alpha\sigma^{2}}{% n}+\alpha^{2}\sigma^{2}).\end{array}

Thus, it is established in Theorem 2 that the linear speedup term $\mathcal{O}(\frac{\alpha\sigma^{2}}{n})$ can be achieved. When the stepsize is sufficiently small the term $\mathcal{O}(\frac{\alpha\sigma^{2}}{n})$ dominates convergence rates (6), (7), and (8), which improve linearly with the number of nodes $n$ .

Setting $\alpha=\sqrt{\frac{n}{T}}$ for sufficiently large $T$ for non-convex and convex settings, it holds that the rates are bounded by

\mathcal{O}\left(\sqrt{\frac{1}{nT}}+\sqrt{\frac{\sigma^{2}}{nT}}+\frac{n\chi% \sigma^{2}}{(1-\lambda_{2})T}+\frac{n\chi^{2}}{(1-\lambda_{2})^{2}T^{2}}\right% )=\mathcal{O}\left(\sqrt{\frac{1}{nT}}\right).

For the strongly convex setting, letting $\alpha=\frac{4\ln T^{2}}{\mu T}$ for sufficiently large $T$ , it holds that $1-\frac{\alpha\mu}{4}\leq{\mathrm{exp}}(-\frac{\alpha\mu T}{4})=\frac{1}{T^{2}}$ , where ${\mathrm{exp}}(\cdot)$ denotes the exponential function, thus the rate is bounded by

\tilde{\mathcal{O}}\left(\frac{\sigma^{2}}{nT}+\frac{1}{T^{2}}+\frac{\chi% \sigma^{2}}{(1-\lambda_{2})T^{2}}+\frac{\chi}{(1-\lambda_{2})T^{4}}\right)=% \tilde{\mathcal{O}}\left(\frac{1}{nT}\right).

When $T$ is sufficiently large, the term $\frac{1}{\sqrt{nT}}$ (or $\frac{1}{nT}$ for the strongly convex setting) will dominate the rate. In this scenario, ProxSkip requires $T=\Omega\left(\frac{1}{n\epsilon^{2}}\right)$ (or $T=\Omega\left(\frac{1}{n\epsilon}\right)$ ) iterations to reach a desired $\epsilon$ -accurate solution, thus the convergence accuracy improves linearly with $n$ .

In addition, based on Theorem 2, we can even get a tighter rate by carefully selecting the stepsize to obtain the following result.

Corollary 1.

Same settings as in Theorem 1, we have the following convergence results.

Non-convex: It holds that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}[\|\nabla f(\bar{{\bf{x}}}^{% t})\|^{2}]\leq\epsilon\quad\text{ after }\quad\mathcal{O}\left(\frac{p\sigma^{% 2}}{n\epsilon^{2}}+\frac{p\sqrt{\chi}}{\sqrt{1-\lambda_{2}}}\frac{\sigma}{% \epsilon^{\nicefrac{{3}}{{2}}}}+\frac{\chi}{(1-\lambda_{2})\epsilon}\right)

(9)

expected communication rounds.

Convex: Under the additional Assumption 3 with $\mu\geq 0$ , it holds that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}[f(\bar{{\bf{x}}}^{t})-f^{% \star}]\leq\epsilon\quad\text{ after }\quad\mathcal{O}\left(\frac{p\sigma^{2}}% {n\epsilon^{2}}+\frac{p\sqrt{\chi}}{\sqrt{1-\lambda_{2}}}\frac{\sigma}{% \epsilon^{\nicefrac{{3}}{{2}}}}+\frac{\chi}{(1-\lambda_{2})\epsilon}\right)

(10)

expected communication rounds.

Strongly Convex: Under the additional Assumption 3 with $\mu>0$ , it holds that

\displaystyle\mathbb{E}[\|\bar{{\bf{x}}}^{T}-{\bf{x}}^{\star}\|^{2}]\leq% \epsilon\quad\text{ after }\quad\tilde{\mathcal{O}}\left(\frac{p\sigma^{2}}{n% \epsilon}+\frac{p\sqrt{\chi}}{\sqrt{1-\lambda_{2}}}\frac{\sigma}{\sqrt{% \epsilon}}+\frac{\chi\mathrm{log}\nicefrac{{1}}{{\epsilon}}}{1-\lambda_{2}}\right)

(11)

expected communication rounds. Here, the notation $\tilde{\mathcal{O}}(\cdot)$ ignores logarithmic factors.

We provide Table 2 to compare the convergence results of ProxSkip with existing state-of-the-art decentralized optimization algorithms, such as local-DSGD [35], $K$ -GT [40], and LED [45], with local updates in terms of the number of communication rounds needed to achieve $\epsilon>0$ .

Table 2: A comparison with existing methods employing local steps.

\rho=1-\lambda_{2}

K

denotes the number of local steps, and SL-NI denotes linear speedup with network-independent stepsizes.

Method	# communication rounds		LS-NI
Method	N-CVX/CVX	S-CVX	S-CVX
local-DSGD [35]	$\mathcal{O}\left(\frac{\sigma^{2}}{nK\epsilon^{2}}+\left(\frac{\sigma}{\sqrt{% \rho K}}+\frac{\varsigma}{\rho}\right)\frac{1}{\epsilon^{\nicefrac{{3}}{{2}}}}% +\frac{1}{\rho\epsilon}\right)$ ^a	$\tilde{\mathcal{O}}\left(\frac{\sigma^{2}}{nK\epsilon}+\left(\frac{\sigma}{% \sqrt{\rho K}}+\frac{\varsigma}{\rho}\right)\frac{1}{\sqrt{\epsilon}}+\frac{1}% {\rho}\mathrm{log}\nicefrac{{1}}{{\epsilon}}\right)$	✗
$K$ -GT [40]	$\mathcal{O}\left(\frac{\sigma^{2}}{nK\epsilon^{2}}+\left(\frac{\sigma}{\rho^{2% }\sqrt{K}}\right)\frac{1}{\epsilon^{\nicefrac{{3}}{{2}}}}+\frac{1}{\rho^{2}% \epsilon}\right){\textsuperscript{b}}$	no results	✗
Periodical GT [40]	$\mathcal{O}\left(\frac{\sigma^{2}}{nK\epsilon^{2}}+\left(\frac{\sigma}{\rho^{2% }}\right)\frac{1}{\epsilon^{\nicefrac{{3}}{{2}}}}+\frac{1}{\rho^{2}\epsilon}% \right){\textsuperscript{b}}$	no results	✗
LED [45]	$\mathcal{O}\left(\frac{\sigma^{2}}{nK\epsilon^{2}}+\left(\frac{\sigma}{\sqrt{% \rho K}}\right)\frac{1}{\epsilon^{\nicefrac{{3}}{{2}}}}+\frac{1}{\rho\epsilon}\right)$	$\tilde{\mathcal{O}}\left(\frac{\sigma^{2}}{nK\epsilon}+\left(\frac{\sigma}{% \sqrt{\rho K}}\right)\frac{1}{\sqrt{\epsilon}}+\frac{1}{\rho}\mathrm{log}% \nicefrac{{1}}{{\epsilon}}\right)$	✗
ProxSkip	$\mathcal{O}\left(\frac{p\sigma^{2}}{n\epsilon^{2}}+\frac{p\sqrt{\chi}}{\sqrt{% \rho}}\frac{\sigma}{\epsilon^{\nicefrac{{3}}{{2}}}}+\frac{\chi}{\rho\epsilon}\right)$	$\tilde{\mathcal{O}}\left(\frac{p\sigma^{2}}{n\epsilon}+\frac{p\sqrt{\chi}}{% \sqrt{\rho}}\frac{\sigma}{\sqrt{\epsilon}}+\frac{\chi}{\rho}\mathrm{log}% \nicefrac{{1}}{{\epsilon}}\right)$	✓

a

$\varsigma^{2}$ is function heterogeneity constant such that $\nicefrac{{1}}{{n}}\sum_{i=1}^{n}\|\nabla f_{i}({\bf{x}})-\nabla f({\bf{x}}^{% \star})\|^{2}\leq\varsigma^{2}$ .
b

The results is for the non-convex setting, and no corresponding result is given for the convex setting.

Achieving acceleration by $p$ and $n$ . According to (9), (10), and (11), when $\epsilon$ is sufficiently small, the convergence rate is dominated by noise and is unaffected by the graph parameter $1-\lambda_{2}$ for ProxSkip. After enough transient time, ProxSkip with $\mathcal{O}\big{(}\frac{p\sigma^{2}}{n\epsilon^{2}}\big{)}$ (or $\tilde{\mathcal{O}}\big{(}\frac{p\sigma^{2}}{n\epsilon}\big{)}$ for the strongly convex setting) achieves linear speedup by the probability of communication $p$ and the number of nodes $n$ .

Removing dependence on data heterogeneity. According to Table 2, the second term of the communication complexity of local-DSGD [35], a popular algorithm for decentralized optimization, is as follows:

\text{N-CVX/CVX: }\mathcal{O}\left(\left(\nicefrac{{\sigma}}{{\sqrt{\rho K}}}+% \nicefrac{{\varsigma}}{{\rho}}\right)\epsilon^{-\nicefrac{{3}}{{2}}}\right),% \quad\text{S-CVX: }\tilde{\mathcal{O}}\left(\left(\nicefrac{{\sigma}}{{\sqrt{% \rho K}}}+\nicefrac{{\varsigma}}{{\rho}}\right)\epsilon^{-\nicefrac{{1}}{{2}}}% \right).

Here, $\varsigma^{2}$ represents the function heterogeneity constant such that $\nicefrac{{1}}{{n}}\sum_{i=1}^{n}\|\nabla f_{i}(\mathbf{x})-\nabla f(\mathbf{x% }^{\star})\|^{2}\leq\varsigma^{2}$ . We note that ProxSkip lacks the additional term $\frac{\varsigma}{\rho}\frac{1}{\epsilon^{\nicefrac{{3}}{{2}}}}$ ( $\frac{\varsigma}{\rho}\frac{1}{\sqrt{\epsilon}}$ for the strongly convex case). Thus, ProxSkip effectively eliminates dependence on the data heterogeneity level $\varsigma^{2}$ .

Comparable with existing decentralized algorithms incorporating local updates. When $p<\lambda_{2}$ , we have $\chi=\mathcal{O}(\max\{1,\frac{1-p}{1-\lambda_{2}}\})=\mathcal{O}(\frac{1}{1-% \lambda_{2}})$ ; when $p>\lambda_{2}$ , $\chi=\mathcal{O}(\max\{1,\frac{1-p}{1-\lambda_{2}}\})=\mathcal{O}(1)$ . Highlighting the network quantities, the second and third terms of the communication complexity of ProxSkip is $\mathcal{O}\left(p\rho^{-1}+\rho^{-2}\right)$ when $p<1-\rho$ , $\mathcal{O}\left(p\rho^{-\nicefrac{{1}}{{2}}}+\rho^{-1}\right)$ when $p\geq 1-\rho$ . Compared with GT based methods [40], the network dependent bounds are improved. Let $p=\nicefrac{{1}}{{K}}$ . Considering that the first term of the communication complexity of ProxSkip, local-DSGD [35], $K$ -GT [40], Periodical-GT [40], and LED [45] are $\mathcal{O}(\frac{\sigma^{2}}{nK\epsilon^{2}})$ (or $\tilde{\mathcal{O}}(\frac{\sigma^{2}}{nK\epsilon})$ for the strongly convex setting), where $K$ denotes the number of local steps, the convergence rates of ProxSkip are comparable with these existing decentralized algorithms incorporating local updates.

3.3 Achieving linear speedup with network-independent stepsizes

Theorem 3.

Suppose that Assumptions 1, 2, and 4 hold, and $f_{i}$ is $\mu$ -strongly convex for some $0<\mu\leq L$ . If $0<\alpha\leq\frac{1}{2L}$ , $\beta=p$ , there exists $\chi=\mathcal{O}(\max\{\frac{1}{p},\frac{1}{1-\lambda_{2}},\frac{1-p}{1-% \lambda_{2}}\})$ such that

\displaystyle\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^{t+1}-{\bf{x}}^{\star}% \big{\|}^{2}\right]\leq\zeta_{0}^{t+1}a_{0}+\frac{\alpha\sigma^{2}}{n\mu}+% \mathcal{O}(\alpha^{2}\sigma^{2}),

(12)

where $a_{0}$ is a constant that depends on the initialization and $\zeta_{0}=\max\{1-\alpha\mu,\sqrt{1-\frac{(1-\lambda_{2})p^{2}}{2\chi}}\}<1$ .

According to this rate, a linear speedup term of $\mathcal{O}(\frac{\alpha\sigma^{2}}{n})+\mathcal{O}(\alpha^{2})$ can be achieved. Importantly, the upper bound on the step size is independent of network topologies, making it a favorable property for practical implementation. Referring to Table 2, in the strongly convex setting, while local-DSGD [35], $K$ -GT [40], and LED [45] achieve linear speedup bounds, this property hinges on the requirement of network-dependent step sizes, wherein these step sizes are correlated with $1-\lambda_{2}$ . In contrast, the step size condition for ProxSkip is $0<\alpha\leq\frac{1}{2L}$ , which remains independent of $1-\lambda_{2}$ .

Notably, [53] for the first time prove that NIDS/ED/D² can achieve linear speedup with network-independent stepsizes. However, it remains an open question whether, with local updates, a linear speedup bound can be achieved using network-independent stepsizes. Theorem 3 offers a positive response to this question.

3.4 Proof sketch of the main theorem

The existing convergence analysis of ProxSkip [46, 47, 48, 49, 50, 51] relies on primal-dual methodologies. Nevertheless, these analyses are limited to the use of first-order (stochastic) gradient information, leading to a suboptimal exploitation of the available function data. We propose a new proof that, in order to fully utilize function and gradient information, we use matrix factorization techniques to equivalently transform the iteration of ProxSkip into “SGD + consensus” form.

Here, we provide a proof sketch for Theorem 2 concerning non-convex objectives.

Step 1. (Lemma 1) We first give the equivalent form of update (4) as follows.

\displaystyle\bar{{\bf{x}}}^{t+1}=\bar{{\bf{x}}}^{t}-\alpha\bar{{\bf{g}}}^{t},% \quad\mathcal{E}^{t+1}={\bf{\Gamma}}\mathcal{E}^{t}+\bm{\Theta}_{1}^{t}+\bm{% \Theta}_{2}^{t},

(13)

where $\bar{{\bf{g}}}^{t}=\frac{1}{n}\sum_{i=1}^{n}{\bf{g}}_{i}^{t}$ , $\|{\bf{\Gamma}}\|<1$ , $\mathcal{E}^{t}$ measures “consensus”, $\bm{\Theta}_{1}^{t}$ is related to the stochastic gradient, and $\bm{\Theta}_{2}^{t}$ measures “communication error” ( $\bm{\Theta}_{2}^{t}=0$ , if $\theta_{t}=1$ ). This description may be less rigorous, but it helps to understand the proof more clearly. See Lemma 1 in Appendix for more details.

Step 2. (Lemma 2) Based on this equivalent update of ProxSkip and by the $L$ -smoothness of $f_{i}$ , we establish the following descent inequality.

\displaystyle\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t+1})\right]\leq

\displaystyle f(\bar{{\bf{x}}}^{t})-\frac{\alpha}{2}\big{\|}\nabla f(\bar{{\bf% {x}}}^{t})\big{\|}^{2}+\frac{2\alpha L^{2}}{n}\big{\|}\mathcal{E}^{t}\big{\|}_% {\mathrm{F}}^{2}+{\frac{L\alpha^{2}\sigma^{2}}{2n}}.

Taking average for both sides over $t=0,1,\ldots,T-1$ , we have

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[\big{\|}\nabla f(% \bar{{\bf{x}}}^{t})\big{\|}^{2}\right]\leq\frac{2F_{0}}{\alpha T}+\frac{4L^{2}% }{nT}\sum_{t=0}^{T-1}\big{\|}\mathcal{E}^{t}\big{\|}_{\mathrm{F}}^{2}+{\frac{% \alpha L\sigma^{2}}{n}}.

(14)

Step 3. (Lemma 2) Subsequently, we establish the following consensus inequality.

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}^{t+1}\|_{\mathrm{F}}^{2}\right]% \leq\tilde{\gamma}\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2}+A_{1}n\alpha^{4}\|% \nabla f(\bar{{\bf{x}}}^{t})\|^{2}+A_{2}n\alpha^{2}\sigma^{2}+A_{3}\alpha^{4}% \sigma^{2},

where $0<\tilde{\gamma}<1$ and $A_{1},A_{2},A_{3}>0$ . Unrolling this recurrence, we have

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2}\right]\leq% \frac{\|\mathcal{E}^{0}\|_{\mathrm{F}}^{2}}{1-\tilde{\gamma}}+\frac{A_{2}n% \alpha^{2}\sigma^{2}+A_{3}\alpha^{4}\sigma^{2}}{1-\tilde{\gamma}}+A_{1}n\alpha% ^{4}\sum_{k=0}^{t-1}\tilde{\gamma}^{t-1-k}\|\nabla f(\bar{{\bf{x}}}^{k})\|^{2}.

Step 4. Since $\sum_{t=0}^{T-1}\sum_{k=0}^{t-1}\tilde{\gamma}^{t-1-k}\|\nabla f(\bar{{\bf{x}}% }^{k})\|^{2}\leq\frac{1}{1-\tilde{\gamma}}\sum_{t=0}^{T-1}\|\nabla f(\bar{{\bf% {x}}}^{t})\|^{2}$ , letting $\alpha$ such that $\frac{4L^{2}A_{1}\alpha^{4}}{1-\tilde{\gamma}}\leq\frac{1}{2}$ , it gives that $\frac{4L^{2}}{nT}\sum_{t=0}^{T-1}\big{\|}\mathcal{E}^{t}\big{\|}_{\mathrm{F}}^% {2}\leq\mathcal{O}(\alpha^{2}\sigma^{2})+\frac{1}{2T}\sum_{t=0}^{T-1}\|\nabla f% (\bar{{\bf{x}}}^{t})\|^{2}$ . Combining it with (14), we complete the proof, i.e.,

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[\|\nabla f(\bar{{\bf% {x}}}^{t})\|^{2}\right]\leq\mathcal{O}\left(\frac{1}{\alpha T}\right)+\mathcal% {O}\left(\frac{\alpha\sigma^{2}}{n}+\alpha^{2}\sigma^{2}\right).

4 Experiments

We empirically verify the theoretical results of ProxSkip for stochastic decentralized optimization. The experiment results for the deterministic case can be found in [46].

Setup. Similar as [46], we also demonstrate our findings on the logistic regression problem with a regularizer. The objective function is $f({\bf{x}})=\frac{1}{n}\sum_{i=1}^{n}\big{\{}\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}% \ln(1+e^{-(\mathcal{A}_{ij}^{\sf T}{{\bf{x}}})\mathcal{B}_{ij}})\big{\}}+r({% \bf{x}})$ . Here, $r({\bf{x}})$ is the regularizer, any node $i$ holds its own training date $\left(\mathcal{A}_{ij},\mathcal{B}_{ij}\right)\in$ $\mathbb{R}^{d}\times\{-1,1\},j=1,\cdots,m_{i}$ , including sample vectors $\mathcal{A}_{ij}$ and corresponding classes $\mathcal{B}_{ij}$ . We use the dataset ijcnn1 from the widely-used LIBSVM library [54], whose attributes is $d=22$ and $\sum_{i=1}^{n}m_{i}=49950$ . Moreover, the training samples are randomly and evenly distributed over all the $n$ agents. We control the stochastic noise $\sigma^{2}$ by adding Gaussian noise to every stochastic gradient, i.e., the stochastic gradients are generated as follows: $\nabla F_{i}({\bf{x}})=\nabla f_{i}({\bf{x}})+\omega_{i}$ , where $\omega_{i}\thicksim\mathcal{N}(0,\sigma^{2}{\bf{I}}_{d})$ and $\sigma^{2}=10^{-3}$ .

For all experiments, we first compute the solution ${\bf{x}}^{\star}$ to (1) by centralized methods, and then run over a randomly generated connected network with $n$ agents and $\frac{\iota n(n-1)}{2}$ undirected edges, where $\iota$ is the connectivity ratio. The mixing matrix ${\bf{W}}$ is generated with the Metropolis-Hastings rule. All stochastic results are averaged over 10 runs.

Achieving linear speedup by $n$ and $\nicefrac{{1}}{{p}}$ . We choose the regularizer $r({\bf{x}})=\frac{1}{2}\|{\bf{x}}\|^{2}$ to demonstrate the results in the convex setting. The results are shown in Fig. 1. The relative error $\nicefrac{{\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\|}}{{\|{\bf{x}}^{\star}\|}}$ is shown on the $y$ -axis. Here, we set $\alpha=\frac{1}{2L}$ , which independent of the network topology. We show the performance of ProxSkip at different number of nodes $n$ , network connectivity $\iota$ , and communication probability $p$ . The results show that, when the number of nodes is increased, the relative errors of ProxSkip is reduced under a constant and network-independent stepsize, which validates our results about linear speedup. Moreover, Fig. 1 shows that we can save on communication rounds by reducing $p$ , i.e., increasing the number of local steps reduces the amount of communication required to achieve the same level of accuracy.

Refer to caption — Figure 1: Experimental results for ProxSkip to logistic regression problem with a strongly convex regularizer $r({\bf{x}})=\frac{1}{2}\|{\bf{x}}\|^{2}$ over ijcnn1 dataset.

Comparing with existing decentralized algorithms. We choose the regularizer $r({\bf{x}})=\sum_{j=1}^{d}\frac{{\bf{x}}(j)^{2}}{1+{\bf{x}}(j)^{2}}$ , $n=10$ , and $\iota=0.1$ to demonstrate the results in the non-convex setting, where ${\bf{x}}=\mathrm{col}\{{\bf{x}}(j)\}_{j=1}^{d}\in\mathbb{R}^{d}$ . In this case, we compare ProxSkip to the decentralized methods local-DSGD [35], $K$ -GT [40], and LED [45] for different local steps $\nicefrac{{1}}{{p}}=10,5,1$ . We use the same stepsize $\alpha=0.01$ for all algorithms. From Fig. 2, it shows that ProxSkip and LED perform similarly, and they outperforms the other methods as we increase the number of local steps.

5 Conclusion

This paper revisits the convergence bounds of ProxSkip for stochastic decentralized optimization. We present a new analysis with a novel proof technique applicable to stochastic non-convex, convex, and strongly convex settings. Through this comprehensive analysis, we derive several new results that rival the bounds of state-of-the-art decentralized algorithms [35, 40, 45]. We establish that the leading communication complexity of ProxSkip is $\mathcal{O}(pn^{-1}\sigma^{2})$ , indicating that ProxSkip can achieve acceleration by $p$ and $n$ . Our proposed proof technique overcomes the analytical limitations of prior work [46, 47, 48, 49, 50, 51] and might be of independent interest in the community.

References

[1]
[2] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Proc. Adv. Neural Inf. Process. Sys., pp. 5330–5340, 2017.
[3] M Assran, N Loizou, N Ballas, M Rabbat, “Stochastic gradient push for distributed deep learning,” in Proc. Int. Conf. Mach. Learn., pp. 344–353, 2019.
[4] A. Koloskova, T. Lin, and S. Stich, “An improved analysis of gradient tracking for decentralized machine learning,”, in Proc. Adv. Neural Inf. Process. Sys., pp. 11422–11435, 2021.
[5] S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Trans. Signal Process., vol. 70, pp. 3264–3279, 2022.
[6] L. Guo, X. Shi, S. Yang, and J. Cao, “DISA: A Dual inexact splitting algorithm for distributed convex composite optimization,” IEEE Trans. Autom. Control, doi: 10.1109/TAC.2023.3301289, 2023.
[7] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Proc. Adv. Neural Inf. Process. Sys., 2017.
[8] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” in Proc. Int. Conf. Mach. Learn., pp.560-569, 2018.
[9] J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsification for communication-efficient distributed optimization,” in Proc. Adv. Neural Inf. Process. Sys., 2018.
[10] A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in Proc. Int. Conf. Mach. Learn., pp. 3478–3487, 2019.
[11] S.P. Karimireddy, Q. Rebjock, S. Stich, M. Jaggi, “Error feedback fixes signSGD and other gradient compression schemes,” in Proc. Int. Conf. Mach. Learn., pp. 3252–3261, 2019.
[12] I. Fatkhullin, A. Tyurin, and P. Richtárik, “Momentum provably improves error feedback!,” in Proc. Adv. Neural Inf. Process. Sys., 2023.
[13] A. Tyurin and P. Richtárik, “2Direction: Theoretically faster distributed training with bidirectional communication compression,” in Proc. Adv. Neural Inf. Process. Sys., 2023.
[14] A. Sadiev, D. Kovalev, and P. Richtárik, “Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact Prox,” in Proc. Adv. Neural Inf. Process. Sys., pp. 21777–21791, 2022.
[15] D. Kovalev, A. Salim, and P. Richtárik, “Optimal and practical algorithms for smooth and strongly convex decentralized optimization,” in Proc. Adv. Neural Inf. Process. Sys., pp. 18342–18352, 2020.
[16] H. Li, C. Fang, W. Yin and Z. Lin, “Decentralized accelerated gradient methods with increasing penalty parameters,” IEEE Trans. Signal Process., vol. 68, pp. 4855–4870, 2020.
[17] H. Li, Z. Lin, and Y. Fang, “Variance reduced EXTRA and DIGing and their optimal acceleration for strongly convex decentralized optimization,” J. Mach. Learn. Res., vol. 23, 2022.
[18] H. Hendrikx, F. Bach, and L. Massoulié,“An optimal algorithm for decentralized finite-sum optimization,” SIAM J. Optim., vol. 31, no. 4, pp. 2753–2783, 2021.
[19] Z. Song, L. Shi, S. Pu, and M. Yan, “Optimal gradient tracking for decentralized optimization,” Math. Program., 2023. doi: 10.1007/s10107-023-01997-7.
[20] T. Lin, S. Stich, K. K. Patel, and M. Jaggi “Don’t use large mini-batches, use local SGD,” in Proc. Int. Conf. Learn. Represent., 2018, arXiv:1808.07217. [Online]. Available: https://arxiv.longhoe.net/abs/1808.07217.
[21] B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, H. B. McMahan, O. Shamir, and N. Srebro, “Is Local SGD Better than Minibatch SGD?,” in Proc. Int. Conf. Mach. Learn., pp. 10334–10343, 2020.
[22] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A.T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” 2016, arXiv:1610.05492. [Online]. Available: https://arxiv.longhoe.net/abs/1610.05492.
[23] S. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learn. Represent., 2018, arXiv:1805.09767. https://arxiv.longhoe.net/abs/1805.09767.
[24] H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with partial worker participation in non-IID federated learning,” in Proc. Int. Conf. Learn. Represent., 2021.
[25] A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local SGD on identical and heterogeneous data,” in Proc. Int. Conf. Artif. Intell. Statist., pp. 4519–4529, 2020.
[26] J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms,” J. Mach. Learn. Res., vol. 22, no. 213, 2021.
[27] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning” , in Proc. Int. Conf. Mach. Learn., pp. 5132–5143, 2020.
[28] Z. Cheng, X. Huang, and K. Yuan, “Momentum benefits non-IID federated learning simply and provably, ” in Proc. Int. Conf. Learn. Represent., 2024.
[29] A. Mitra, R. Jaafar, G. J. Pappas, and H. Hassani, “Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients,” in Proc. Adv. Neural Inf. Process. Sys., pp. 14606–14619, 2021.
[30] X. Zhang, M. Hong, S. Dhople, W. Yin, and Y. Liu, “FedPD: A federated learning framework with adaptivity to Non-IID data,” IEEE Trans. Signal Process., vol. 69, pp. 6055–6070, 2021.
[31] A. E. Durmus, Z. Yue, M. Ramon, M. Matthew, W. Paul, and S. Venkatesh, “Federated learning based on dynamic regularization,” in Proc. Int. Conf. Learn. Represent., 2021.
[32] X. Liang, S. Shen, J. Liu, Z. Pan, E. Chen, and Y. Cheng, “Variance reduced local SGD with lower communication complexity,” 2019, arXiv:1912.12844. https://arxiv.longhoe.net/abs/1912.12844.
[33] F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi, “Federated learning with compression: Unified analysis and sharp guarantees,” in Proc. Int. Conf. Artif. Intell. Statist., pp. 2350–2358, 2021.
[34] X. Huang, P. Li, and X. Li, “Stochastic controlled averaging for federated learning with communication compression,” in Proc. Int. Conf. Learn. Represent., 2024.
[35] A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in Proc. Int. Conf. Mach. Learn., pp. 5381–5393, 2020.
[36] S. Pu and A. Nedić, “Distributed stochastic gradient tracking methods,” Math. Program., vol. 187, pp. 409–457, 2021.
[37] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM J. Optim., vol. 27, no. 4, pp. 2597–2633, 2017.
[38] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245–1260, Sep. 2018.
[39] E. D. H. Nguyen, S. A. Alghunaim, K. Yuan, and C. A. Uribe, “On the performance of gradient tracking with local updates,” 2022, arXiv:2210.04757, [Online]. Available: https://arxiv.longhoe.net/abs/2210.04757.
[40] Y. Liu, T. Lin, A. Koloskova, and S. U. Stich, “Decentralized gradient tracking with local steps,” 2023, arXiv:2301.01313, [Online]. Available: https://arxiv.longhoe.net/abs/2301.01313.
[41] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning-part I: Algorithm development,” IEEE Trans. Signal Process., vol. 67, no. 3, pp. 708–723, Feb. 2019.
[42] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4494–4506, Sep. 2019.
[43] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D²: Decentralized training over decentralized data,” in Proc. Int. Conf. Mach. Learn., pp. 4848–4856, 2018.
[44] L. Guo, X. Shi, J. Cao, and Z. Wang, “Decentralized inexact proximal gradient method with network-independent stepsizes for convex composite optimization,” IEEE Trans. Signal Process., vol. 71, pp. 786–801, 2023.
[45] S. A. Alghunaim, “Local exact-diffusion for decentralized optimization and learning,” 2023, arXiv:2302.00620, [Online]. Available: https://arxiv.longhoe.net/abs/2302.00620.
[46] K. Mishchenko, G. Malinovsky, S. Stich, and P. Richtárik, “ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!,” in Proc. Int. Conf. Mach. Learn., pp. 15750–15769, 2022.
[47] L. Condat, I. Agarský, G. Malinovsky, and P. Richtárik, TAMUNA: Doubly accelerated federated learning with local training, compression, and partial participation, 2023, arXiv:2302.09832, [Online]. Available: https://arxiv.longhoe.net/abs/2302.09832.
[48] L. Condat, I. Agarský, and P. Richtárik, “Provably doubly accelerated federated learning: The first theoretically successful combination of local training and communication compression,” 2023, arXiv:2210.13277, [Online]. Available: https://arxiv.longhoe.net/abs/2210.13277.
[49] G. Malinovsky, K. Yi, and P. Richtárik, “Variance reduced ProxSkip: Algorithm, theory and application to federated learning,” in Proc. Adv. Neural Inf. Process. Sys., pp. 15176–15189, 2022.
[50] Z. Hu and H. Huang, “Tighter analysis for ProxSkip,” in Proc. Int. Conf. Mach. Learn., pp. 13469–13496, 2023.
[51] L. Condat and P. Richtárik, “RandProx: Primal-dual optimization algorithms with randomized proximal updates,” in Proc. Int. Conf. Learn. Represent., 2023.
[52] K. Scaman, F. Bach, S. Bubeck, Y.-T. Lee, and L. Massoulié, “Optimal algorithms for smooth and strongly convex distributed optimization in networks,” in Proc. Int. Conf. Mach. Learn., pp. 3027–3036, 2017.
[53] H. Yuan, S. A. Alghunaim, and K. Yuan, “Achieving linear speedup with network-independent learning rates in decentralized stochastic optimization,” Proc. in IEEE Conf. Decis. Control, pp. 139-144, 2023.
[54] C.-C. Chang and C.-J. Lin, “LibSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, 2011, Art. no. 27.

Appendix

Appendix A Preliminaries

A.1 Basic Facts

The stochastic processes such as randomized communication and gradient estimation generate two sequences of $\sigma$ -algebra. We denote by $\mathcal{G}^{t}$ the $\sigma$ -algebra of gradient estimation at $t$ -th iteration and $\mathcal{F}^{t}$ the $\sigma$ -algebra of randomized communication at the same step. The sequences $\{\mathcal{G}^{t}\}_{t\geq 0}$ and $\{\mathcal{F}^{t}\}_{t\geq 0}$ satisfy

\mathcal{G}^{0}\subset\mathcal{F}^{0}\subset\mathcal{G}^{1}\subset\mathcal{F}^% {1}\subset\mathcal{G}^{2}\subset\mathcal{F}^{2}\subset\cdots\subset\mathcal{G}% ^{t}\subset\mathcal{F}^{t}\subset\cdots\ .

With these notations, we can clarify the stochastic dependencies among the variables generated by Algorithmd 1, i.e., $({\bf{G}}^{t},\hat{{\bf{Z}}}^{t})$ is measurable in $\mathcal{G}^{t}$ and $({\bf{Y}}^{t+1},{\bf{X}}^{t+1})$ is measurable in $\mathcal{F}^{t}$ .

The Bregman divergence of $f$ at points $(x,y)$ is defined by

D_{f}(x,y):=f(x)-f(y)-\langle\nabla f(y),x-y\rangle.

It is easy to verify that $\langle\nabla f(x)-\nabla f(y),x-y\rangle=D_{f}(x,y)+D_{f}(y,x)$ . If $f$ is convex, from the definition of convex function, we have $D_{f}(x,y)\geq 0$ and $D_{f}(y,x)\geq 0$ . Thus

\displaystyle\langle\nabla f(x)-\nabla f(y),x-y\rangle\geq D_{f}(x,y),\text{ % and }\langle\nabla f(x)-\nabla f(y),x-y\rangle\geq D_{f}(y,x).

(15)

For an $L$ -smooth and $\mu$ -strongly convex function $f$ , by [46, Appendix. A] we have

	$\displaystyle\frac{\mu}{2}\\|x-y\\|^{2}\leq$	$\displaystyle D_{f}(x,y)\leq\frac{L}{2}\\|x-y\\|^{2},$		(16)
	$\displaystyle\frac{1}{2L}\\|\nabla f(x)-\nabla f(y)\\|^{2}\leq$	$\displaystyle D_{f}(x,y)\leq\frac{1}{2\mu}\\|\nabla f(x)-\nabla f(y)\\|^{2}.$		(17)

Under the $L$ -smoothness condition, we have

\displaystyle f(y)\leq f(x)+\langle\nabla f(x),y-x\rangle+\frac{L}{2}\|x-y\|^{% 2},\ \forall x,y\in\mathbb{R}^{d}

(18)

A.2 Notations

For any $n\times m$ matrices $\mathbf{a}$ and $\mathbf{b}$ , their inner product is denoted as $\langle\mathbf{a},\mathbf{b}\rangle=\mathrm{Trace}(\mathbf{a}^{\sf T}\mathbf{b})$ . For a given matrix $\mathbf{a}$ , the Frobenius norm is given by $\|\mathbf{a}\|_{\mathrm{F}}$ , while the spectral norm is given by $\|\mathbf{a}\|$ . Define the gradient and communication noise as

	gradient noise:	$\displaystyle{\bf{S}}^{t}=[{\bf{s}}^{t}_{1},\ldots,{\bf{s}}^{t}_{n}]^{\sf T}={% \bf{G}}^{t}-\nabla F({\bf{X}}^{t}),\text{ where }{\bf{s}}^{t}_{i}={\bf{g}}_{i}% ^{t}-\nabla f_{i}({\bf{x}}_{i}^{t});$
	communication noise:	$\displaystyle{\bf{E}}^{t}=\frac{\theta_{t}-1}{2\chi}{\bf{W}}_{b}\hat{{\bf{Z}}}% ^{t}.$

We also define the following notations to simplify the analysis:

\displaystyle\bar{{\bf{x}}}^{t}\triangleq\big{(}\frac{1}{n}\sum_{i=1}^{n}{\bf{% x}}_{i}^{t}\big{)}^{\sf T},\quad\bar{{\bf{X}}}^{t}={\bf{1}}\otimes\bar{{\bf{x}% }}^{t},\quad\bar{{\bf{s}}}^{t}\triangleq\big{(}\frac{1}{n}\sum_{i=1}^{n}{\bf{s% }}_{i}^{t}\big{)}^{\sf T},\quad\overline{\nabla F}({\bf{X}}^{t})\triangleq\big% {(}\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}({\bf{x}}^{t}_{i})\big{)}^{\sf T}.

With Assumption 1 and [5, Section IV-B], the mixing matrix ${\bf{W}}$ can be decomposed as

{\bf{W}}={\bf{P\Lambda P}}^{-1}=\left[\begin{array}[]{cc}{\bf{1}}&\hat{{\bf{P}% }}\\ \end{array}\right]\left[\begin{array}[]{cc}{\bf{I}}&0\\ 0&\hat{{\bf{\Lambda}}}\\ \end{array}\right]\left[\begin{array}[]{c}\frac{1}{n}{\bf{1}}^{\sf T}\\ \hat{{\bf{P}}}^{\sf T}\\ \end{array}\right],

where $\hat{{\bf{\Lambda}}}=\mathrm{diag}\{\lambda_{2},\ldots,\lambda_{n}\}$ , and matrix $\hat{{\bf{P}}}\in\mathbb{R}^{n\times(n-1)}$ satisfies

\hat{{\bf{P}}}^{\sf T}\hat{{\bf{P}}}={\bf{I}},\ {\bf{1}}^{\sf T}\hat{{\bf{P}}}% =0,\ \hat{{\bf{P}}}\hat{{\bf{P}}}^{\sf T}={\bf{I}}-\frac{1}{n}{\bf{11}}^{\sf T}.

Therefore, the matrix ${\bf{W}}_{a}$ and ${\bf{W}}_{b}$ can be decomposed as

\displaystyle{\bf{W}}_{a}=\left[\begin{array}[]{cc}{\bf{1}}&\hat{{\bf{P}}}\\ \end{array}\right]\underbrace{\left[\begin{array}[]{cc}{\bf{1}}&0\\ 0&\hat{{\bf{\Lambda}}}_{a}\\ \end{array}\right]}_{:={\bf{\Lambda}}_{a}}\left[\begin{array}[]{c}\frac{1}{n}{% \bf{1}}^{\sf T}\\ \hat{{\bf{P}}}^{\sf T}\\ \end{array}\right],~{}{\bf{W}}_{b}^{2}=\left[\begin{array}[]{cc}{\bf{1}}&\hat{% {\bf{P}}}\\ \end{array}\right]\underbrace{\left[\begin{array}[]{cc}0&0\\ 0&\hat{{\bf{\Lambda}}}_{b}^{2}\\ \end{array}\right]}_{:={\bf{\Lambda}}_{b}^{2}}\left[\begin{array}[]{c}\frac{1}% {n}{\bf{1}}^{\sf T}\\ \hat{{\bf{P}}}^{\sf T}\\ \end{array}\right],

(29)

where $\hat{{\bf{\Lambda}}}_{a}={\bf{I}}-\frac{1}{2\chi}({\bf{I}}-\hat{{\bf{\Lambda}}})$ , $\hat{{\bf{\Lambda}}}_{b}=\sqrt{{\bf{I}}-\hat{{\bf{\Lambda}}}}$ . Since $\lambda_{i}\in(-1,1)$ for $i=2,\dots,n$ , it holds that $1-\frac{1}{2\chi}(1-\lambda_{i})\in[0,1)$ and $0\preceq{\bf{W}}_{a}\prec{\bf{I}}$ for $\chi\geq 1$ .

Appendix B Proof of Theorem 2 and Corollary 1

B.1 Transformation and Some Descent Inequalities

Here, we introduce an auxiliary variable ${\bf{R}}^{t}={\bf{Y}}^{t}+\alpha\nabla F(\bar{{\bf{X}}}^{t})$ , where $\bar{{\bf{X}}}^{t}=\mathbf{1}\otimes\bar{{\bf{x}}}^{t}$ . It follows from (4b) and (4c) that, when $\beta=1$ and $p=1$ , ${\bf{Y}}^{t+1}={\bf{Y}}^{t}+\frac{1}{2\chi}{\bf{W}}_{b}^{2}\hat{{\bf{Z}}}^{t}$ . For any fixed point $({\bf{X}},{\bf{Y}})$ of update (4), it holds that $\hat{{\bf{Z}}}={\bf{X}}$ , ${\bf{Y}}+\alpha\nabla F({{\bf{X}}})=0$ , ${\bf{W}}_{b}{\bf{X}}=0$ . Thus, ${\bf{R}}=0$ implies that $\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}({\bf{x}})=0$ , i.e., ${\bf{x}}$ is a stationary point of problem (1). By this new variable, we give following error dynamics of Algorithm 1.

Lemma 1.

Suppose that Assumption 1 holds. There exists a invertible matrix ${\bf{Q}}$ and a diagonal matrix ${\bf{\Gamma}}$ such that


$\displaystyle\bar{{\bf{x}}}^{t+1}$	$\displaystyle=\bar{{\bf{x}}}^{t}-\alpha\overline{\nabla F}({\bf{X}}^{t})-% \alpha\bar{{\bf{s}}}^{t},$	(30a)
$\displaystyle\mathcal{E}^{t+1}$	$\displaystyle=\underbrace{{\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}% \left[\begin{array}[]{c}\hat{{\bf{\Lambda}}}_{a}\hat{{\bf{P}}}^{\sf T}{\bf{% \Sigma}}_{1}^{t}\\ \frac{1}{2\chi}\hat{{\bf{\Lambda}}}_{b}^{2}\hat{{\bf{P}}}^{\sf T}{\bf{\Sigma}}% _{1}^{t}+\hat{{\bf{P}}}^{\sf T}{\bf{\Sigma}}_{2}^{t}\\ \end{array}\right]}_{:=\mathbb{G}^{t}}+\underbrace{{\bf{Q}}^{-1}\left[\begin{% array}[]{c}-\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\ \hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\ \end{array}\right]}_{:=\mathbb{F}^{t}},$	(30f)

where $\gamma\triangleq\|{\bf{\Gamma}}\|=\sqrt{1-\frac{1}{2\chi}(1-\lambda_{2})}<1$ ,

\displaystyle\mathcal{E}^{t}\triangleq{\bf{Q}}^{-1}\left[\begin{array}[]{c}% \hat{{\bf{P}}}^{\sf T}{{\bf{X}}}^{t}\\ \hat{{\bf{P}}}^{\sf T}{{\bf{R}}}^{t}\\ \end{array}\right],\quad\left\{\begin{array}[]{l}{\bf{\Sigma}}_{1}^{t}=\nabla F% ({\bf{X}}^{t})-\nabla F(\bar{{\bf{X}}}^{t})+{\bf{S}}^{t},\\ {\bf{\Sigma}}_{2}^{t}=\nabla F(\bar{{\bf{X}}}^{t})-\nabla F(\bar{{\bf{X}}}^{t+% 1})\end{array}\right..

Moreover, we have

\|{\bf{Q}}\|^{2}\leq 2\text{ and }\|{\bf{Q}}^{-1}\|^{2}\leq\frac{2\chi}{(1+% \lambda_{n})(1-\lambda_{2})}.

In addition, we have

\displaystyle\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}\leq 4\|% \mathcal{E}^{t}\|_{\mathrm{F}}^{2}.

(31)

Proof.

See Appendix D. ∎

Based on Lemma 1, we give the following descent inequalities.

Lemma 2.

Suppose that Assumptions 1, 2, and 4 hold. If $\alpha\leq\frac{1}{2L}$ , it holds that

$\displaystyle\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t+1})\;\|\;\mathcal{G}^{t}% \right]\leq$	$\displaystyle f(\bar{{\bf{x}}}^{t})-\frac{\alpha}{2}\\|\nabla f(\bar{{\bf{x}}}^% {t})\\|^{2}+\frac{2\alpha L^{2}}{n}\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{L% \alpha^{2}\sigma^{2}}{2n},$	(32)
$\displaystyle\mathbb{E}\!\left[\\|\mathcal{E}^{t+1}\\|_{\mathrm{F}}^{2}\;\|\;% \mathcal{G}^{t}\right]\leq$	$\displaystyle\tilde{\gamma}\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{4n\alpha% ^{4}L^{2}\frac{2\chi}{1-\lambda_{2}}}{1-\gamma}\\|\nabla f(\bar{{\bf{x}}}^{t})% \\|^{2}$
	$\displaystyle\quad+\frac{2\alpha^{4}L^{2}\sigma^{2}\frac{2\chi}{1-\lambda_{2}}% }{1-\gamma}+\frac{2n\alpha^{2}\sigma^{2}(2\chi^{2}+(1-p))}{\chi^{2}},$	(33)

where

\displaystyle\tilde{\gamma}=\gamma+\frac{32\alpha^{2}L^{2}+16\alpha^{4}L^{4}% \frac{2\chi}{1-\lambda_{2}}}{1-\gamma}+\frac{2(1-p)\big{(}3+\frac{24\chi\alpha% ^{2}L^{2}}{(1+\lambda_{n})(1-\lambda_{2})}\big{)}}{\chi^{2}}.

(34)

Moreover, if $f_{i}$ is $\mu$ -convex ( $\mu\geq 0$ ) and $\alpha\leq\frac{1}{4L}$ , it holds that

	$\displaystyle\mathbb{E}\!\left[\\|\bar{{\bf{x}}}^{t+1}-{\bf{x}}^{\star}\\|^{2}\;% \|\;\mathcal{G}^{t}\right]$	$\displaystyle\leq(1-\mu\alpha)\\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\\|^{2}+% \frac{6\alpha L}{n}\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}$
		$\displaystyle\quad+\frac{\alpha^{2}\sigma^{2}}{n}-\alpha(f(\bar{{\bf{x}}}^{t})% -f({\bf{x}}^{\star})),$		(35)

where ${\bf{x}}^{\star}$ solves problem (1).

Proof.

See Appendix E. ∎

B.2 Convergence Analysis: Non-convex

With Lemma 1 and Lemma 2, we further have the following theorem.

Theorem 4.

Suppose that Assumptions 1, 2, and 4 hold. If $\beta=1$ , $\alpha$ and $\chi$ satisfy that $\chi\geq\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}$

\displaystyle 0<\alpha\leq\min\left\{\frac{1}{2L},\frac{1-\lambda_{2}}{32\sqrt% {3}\chi L},\sqrt{\frac{(1+\lambda_{n})(1-\lambda_{2})}{2\chi}}\frac{1}{2L},% \sqrt[4]{\frac{(1-\lambda_{2})^{3}}{12\chi^{3}}}\frac{1}{4L}\right\},

(36)

it holds that $\tilde{\gamma}\leq\frac{1+\gamma}{2}<1$ and

	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[\big{\\|}\nabla f(% \bar{{\bf{x}}}^{t})\big{\\|}^{2}\right]\leq$	$\displaystyle\frac{4(f(\bar{{\bf{x}}}^{0})-f^{\star})}{\alpha T}+\frac{128\chi% ^{2}L^{2}\alpha^{2}\varsigma^{2}_{0}}{(1-\lambda_{2})^{2}T}+\frac{2L\alpha% \sigma^{2}}{n}$
		$\displaystyle+\frac{\alpha^{2}L^{2}\sigma^{2}\big{(}\chi^{3}+256\chi(2\chi^{2}% +(1-p))\big{)}}{2(1-\lambda_{2})\chi^{2}},$		(37)

where $\varsigma^{2}_{0}=\frac{1}{n}\sum_{i=1}^{n}\|\nabla f_{i}(\bar{{\bf{x}}}^{0})-% \nabla f(\bar{{\bf{x}}}^{0})\|^{2}$ .

Proof.

See Appendix F. ∎

Based on Theorem 2, we can even get a tighter rate by carefully selecting the stepsize similar to [35, Lemma 17], [40, Lemma C.13], or [45, Corollary 1].

Corollary 2.

Suppose that Assumptions 1, 2, and 4 hold. If $\beta=1$ , $\chi$ satisfies that $\chi\geq\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}$ , there exist a constant $\alpha=\mathcal{O}\left(\frac{1-\lambda_{2}}{\chi L}\right)$ such that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}[\|\nabla f(\bar{{\bf{x}}}^{% t})\|^{2}]\leq\mathcal{O}\left(\left(\frac{\sigma^{2}}{nT}\right)^{\frac{1}{2}% }+\left(\frac{\chi\sigma^{2}}{(1-\lambda_{2})T^{2}}\right)^{\frac{1}{3}}+\frac% {\chi}{(1-\lambda_{2})T}\right).

(38)

Proof.

See Appendix G. ∎

When $p<\lambda_{2}$ , we have $\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}=\mathcal{O}\left(\frac{1}{% 1-\lambda_{2}}\right)$ . Choosing $\chi=\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}$ . Since in each iteration we trigger communication with probability $p$ , for any desired accuracy $\epsilon>0$ , the expected number of communication rounds required to achieve $\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}[\|\nabla f(\bar{{\bf{x}}}^{t})\|^{2}]\leq\epsilon$ is bounded by:

p<\lambda_{2}:\ p\times\text{(iteration complexity)}=\mathcal{O}\left(\frac{p% \sigma^{2}}{n\epsilon^{2}}+\frac{p}{1-\lambda_{2}}\frac{\sigma}{\epsilon^{% \nicefrac{{3}}{{2}}}}+\frac{1}{(1-\lambda_{2})^{2}}\frac{1}{\epsilon}\right).

When $p\geq\lambda_{2}$ , we have $\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}=\mathcal{O}(1)$ . If we choose $\chi$ such that $\chi=\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}$ , then for any desired accuracy $\epsilon>0$ , the expected communication complexity is bounded by

p\geq\lambda_{2}:\ p\times\text{(iteration complexity)}=\mathcal{O}\left(\frac% {p\sigma^{2}}{n\epsilon^{2}}+\frac{p}{\sqrt{1-\lambda_{2}}}\frac{\sigma}{% \epsilon^{\nicefrac{{3}}{{2}}}}+\frac{1}{1-\lambda_{2}}\frac{1}{\epsilon}% \right).

B.3 Convergence Analysis: Convex

By Lemma 1 and Lemma 2, we also can deduce the following lemma.

Theorem 5.

Suppose that Assumptions 1, 2, and 4 hold. Under the additional Assumption 3 with $\mu\geq 0$ , if $\beta=1$ , $\alpha$ and $\chi\geq\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}$

\displaystyle 0<\alpha\leq\min\left\{\frac{1}{2L},\frac{1-\lambda_{2}}{32\sqrt% {3}\chi L},\sqrt{\frac{(1+\lambda_{n})(1-\lambda_{2})}{2\chi}}\frac{1}{2L},% \sqrt[4]{\frac{(1-\lambda_{2})^{3}}{24\chi^{3}}}\frac{1}{4L}\right\},

(39)

it holds that

	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t}% )-f^{\star}\right]\leq$	$\displaystyle\frac{2\\|\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\\|^{2}}{\alpha T}+% \frac{192\chi^{2}\alpha^{2}L\varsigma^{2}_{0}}{(1-\lambda_{2})^{2}T}$
		$\displaystyle\quad+\frac{2\alpha\sigma^{2}}{n}+\frac{L\alpha^{2}\sigma^{2}\big% {(}\chi^{3}+384\chi(2\chi^{2}+(1-p))\big{)}}{2(1-\lambda_{2})\chi^{2}}.$		(40)

Proof.

See Appendix H. ∎

Similar as the analysis of non-convex setting, with Theorem 5, we have the following results.

Corollary 3.

Suppose that Assumptions 1, 2, and 4 hold. Under the additional Assumption 3 with $\mu\geq 0$ , if $\beta=1$ and $\chi\geq\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}$ , there exist a constant $\alpha=\mathcal{O}\left(\frac{1-\lambda_{2}}{\chi L}\right)$ such that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t}% )-f^{\star}\right]\leq\mathcal{O}\left(\left(\frac{\sigma^{2}}{nT}\right)^{% \frac{1}{2}}+\left(\frac{\chi\sigma^{2}}{(1-\lambda_{2})T^{2}}\right)^{\frac{1% }{3}}+\frac{\chi}{(1-\lambda_{2})T}\right).

(41)

Proof.

See Appendix I. ∎

Similar as the analysis of non-convex setting, when $p<\lambda_{2}$ and choosing $\chi=\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}\leq\mathcal{O}\left(% \frac{1}{1-\lambda_{2}}\right)$ , for any desired accuracy $\epsilon>0$ , the expected communication complexity is bounded by

p\times\text{(iteration complexity)}=\mathcal{O}\left(\frac{p\sigma^{2}}{n% \epsilon^{2}}+\frac{p}{1-\lambda_{2}}\frac{\sigma\sqrt{L}}{\epsilon^{\nicefrac% {{3}}{{2}}}}+\frac{1}{(1-\lambda_{2})^{2}}\frac{L}{\epsilon}\right).

p\times\text{(iteration complexity)}=\mathcal{O}\left(\frac{p\sigma^{2}}{n% \epsilon^{2}}+\frac{p}{\sqrt{1-\lambda_{2}}}\frac{\sigma\sqrt{L}}{\epsilon^{% \nicefrac{{3}}{{2}}}}+\frac{1}{1-\lambda_{2}}\frac{L}{\epsilon}\right).

B.4 Convergence Analysis: Strongly Convex

By Lemma 1 and Lemma 2, we also can deduce the following lemma.

Theorem 6.

Suppose that Assumptions 1, 2, and 4 hold. Under the additional Assumption 3 with $\mu>0$ , If $\beta=1$ , $\alpha$ and $\chi$ satisfy that $\chi\geq\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}$ and

\displaystyle 0<\alpha\leq\min\left\{\frac{1}{2L},\frac{1-\lambda_{2}}{32\sqrt% {3}\chi L},\sqrt{\frac{(1+\lambda_{n})(1-\lambda_{2})}{2\chi}}\frac{1}{2L},% \frac{72\mu}{L^{2}},\frac{1-\gamma}{12L+\nicefrac{{\mu}}{{2}}},\frac{\sqrt[3]{% 4\mu(1-\gamma)}}{L}\right\},

(42)

where $\gamma=\sqrt{1-\frac{1}{2\chi}(1-\lambda_{2})}<1$ , it holds that

	$\displaystyle\mathbb{E}\!\left[\big{\\|}\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\big% {\\|}^{2}\right]\leq$	$\displaystyle\Big{(}1-\frac{\alpha\mu}{4}\Big{)}^{t}\Big{(}\big{\\|}\bar{{\bf{x% }}}^{0}-{\bf{x}}^{\star}\big{\\|}^{2}+\frac{8\chi\alpha^{2}\varsigma_{0}^{2}}{1% -\lambda_{2}}\Big{)}$
		$\displaystyle+\frac{2\alpha\sigma^{2}}{\mu n}+\frac{7L\alpha^{2}\sigma^{2}(192% \chi^{2}+(4\chi^{2}+2(1-p)))}{12\mu(1-\lambda_{2})\chi}.$		(43)

Proof.

See Appendix J. ∎

Based on Theorem 6, we can even get a tighter rate by carefully selecting the step size similar to [35].

Corollary 4.

Suppose that Assumptions 1, 2, and 4 hold. Under the additional Assumption 3 with $\mu>0$ , if $\beta=1$ and $\chi\geq\max\left\{\frac{288(1-p)}{1-\lambda_{2}},1\right\}$ , there exist a constant $\alpha=\mathcal{O}\left(\frac{\mu(1-\lambda_{2})}{\chi L^{2}}\right)$ such that

\displaystyle\mathbb{E}[\|\bar{{\bf{x}}}^{T}-{\bf{x}}^{\star}\|^{2}]\leq\tilde% {\mathcal{O}}\left(\frac{\sigma^{2}}{nT}+\frac{\sigma^{2}\chi}{(1-\lambda_{2})% T^{2}}+\mathrm{exp}\Big{[}-\frac{(1-\lambda_{2})T}{\chi}\Big{]}\right).

(44)

Proof.

See Appendix K. ∎

Similar as the analysis of non-convex and convex settings, we have $\chi=\max\{\nicefrac{{288(1-p)}}{{1-\lambda_{2}}},1\}\leq\mathcal{O}(\nicefrac% {{1}}{{(1-\lambda_{2})}})$ if $p<\lambda_{2}$ and $\chi=\max\{\nicefrac{{288(1-p)}}{{1-\lambda_{2}}},1\}=\mathcal{O}(1)$ if $p\geq\lambda_{2}$ . Thus, for any desired accuracy $\epsilon>0$ , the expected number of communication rounds required to achieve $\mathbb{E}[\|\bar{{\bf{x}}}^{T}-{\bf{x}}^{\star}\|^{2}]\leq\epsilon$ is bounded by

p\times\text{(iteration complexity)}=\tilde{\mathcal{O}}\left(\frac{p\sigma^{2% }}{n\epsilon}+\frac{p}{{1-\lambda_{2}}}\frac{\sigma}{\sqrt{\epsilon}}+\frac{% \mathrm{log}\nicefrac{{1}}{{\epsilon}}}{(1-\lambda_{2})^{2}}\right),\ p\in(0,% \lambda_{2}),

and

p\times\text{(iteration complexity)}=\tilde{\mathcal{O}}\left(\frac{p\sigma^{2% }}{n\epsilon}+\frac{p}{\sqrt{1-\lambda_{2}}}\frac{\sigma}{\sqrt{\epsilon}}+% \frac{\mathrm{log}\nicefrac{{1}}{{\epsilon}}}{1-\lambda_{2}}\right),\ p\in[% \lambda_{2},1].

Appendix C Proof Theorem 3

Then, we further prove ProxSkip can achieve linear speedup with network-independent stepsize. We introduce new iterates $\{{\bf{U}}^{t}\}$ to facilitate the analysis. Similar techniques can be found, e.g., in [41, 42, 6], ${\bf{Y}}^{t}=\alpha{\bf{W}}_{b}{\bf{U}}^{t}$ . Since ${\bf{I}}-{\bf{W}}_{a}=\frac{1}{2\chi}{\bf{W}}^{2}_{b}$ , from (4b) and (4c), we have

\left\{\begin{array}[]{rl}{\bf{X}}^{t+1}\ =&(1-\theta_{t})\hat{{\bf{Z}}}^{t}+% \theta_{t}{\bf{W}}_{a}\hat{{\bf{Z}}}^{t}\\ \alpha{\bf{W}}_{b}{\bf{U}}^{t+1}\ =&\alpha{\bf{W}}_{b}{\bf{U}}^{t}+\beta(\hat{% {\bf{Z}}}^{t}-{\bf{X}}^{t+1})\end{array}\right.\Longleftrightarrow\left\{% \begin{array}[]{rl}{\bf{W}}_{b}{\bf{U}}^{t+1}\ =&{\bf{W}}_{b}{\bf{U}}^{t}+% \frac{\beta\theta_{t}}{2\chi\alpha}{\bf{W}}_{b}^{2}\hat{{\bf{Z}}}^{t}\\ {\bf{X}}^{t+1}\ =&\hat{{\bf{Z}}}^{t}-\frac{\alpha}{\beta}{\bf{W}}_{b}({\bf{U}}% ^{t+1}-{\bf{U}}^{t})\end{array}\right.

Therefore, letting ${\bf{Y}}^{0}=0$ , we have the following equivalent form of ProxSkip (4) in the sense that they generate an identical sequence $({\bf{X}}^{t},\hat{{\bf{Z}}}^{t})$ .


$\displaystyle\hat{{\bf{Z}}}^{t}$	$\displaystyle=\ {\bf{X}}^{t}-\alpha{\bf{G}}^{t}-\alpha{\bf{W}}_{b}{\bf{U}}^{t},$	(45a)
$\displaystyle{\bf{U}}^{t+1}$	$\displaystyle=\ {\bf{U}}^{t}+\frac{\beta\theta_{t}}{2\chi\alpha}{\bf{W}}_{b}% \hat{{\bf{Z}}}^{t},$	(45b)
$\displaystyle{\bf{X}}^{t+1}$	$\displaystyle=\ \hat{{\bf{Z}}}^{t}-\frac{\alpha}{\beta}{\bf{W}}_{b}({\bf{U}}^{% t+1}-{\bf{U}}^{t}).$	(45c)

This equivalent form is more useful for the subsequent convergence analysis. The optimality condition of problem (1) is as the following lemma.

Lemma 3.

Suppose that Assumption 1 holds. If there exists a point $({\bf{X}}^{\star},{\bf{U}}^{\star})$ that satisfies:


$\displaystyle 0$	$\displaystyle=\nabla F({\bf{X}}^{\star})+{\bf{W}}_{b}{\bf{U}}^{\star},$	(46a)
$\displaystyle 0$	$\displaystyle={\bf{W}}_{b}{\bf{Z}}^{\star},$	(46b)

then it holds that ${\bf{X}}^{\star}=[{\bf{x}}^{\star},{\bf{x}}^{\star},\ldots,{\bf{x}}^{\star}]^{% \sf T}$ , where ${\bf{x}}^{\star}\in\mathbb{R}^{d}$ is a stationary point to problem (1).

From Lemma 3, when ${\bf{G}}^{t}=\nabla F({\bf{X}}^{t})$ , we have that any fixed point of (45) satisfies the condition (46). We also define the following notations to simplify the analysis:

\displaystyle\widetilde{{\bf{Z}}}^{t}\triangleq\hat{{\bf{Z}}}^{t}-{\bf{X}}^{% \star},\quad\widetilde{{\bf{X}}}^{t}\triangleq{\bf{X}}^{t}-{\bf{X}}^{\star},% \quad\widetilde{{\bf{U}}}^{t}\triangleq\alpha({\bf{U}}^{t}-{\bf{U}}^{\star}),% \quad\bar{{\bf{e}}}^{t}\triangleq\bar{{\bf{x}}}^{t}-({\bf{x}}^{\star})^{\sf T},

where $({\bf{X}}^{\star},{\bf{U}}^{\star})$ satisfies (46). Similar as Lemma 1, we give another error dynamics of ProxSkip, which will be used for proving the linear speedup with network-independent stepsizes of ProxSkip under strongly convexity.

Lemma 4.

Suppose that Assumption 1 holds. If $\beta=p$ and $\chi p\geq 1$ , there exist a invertible matrix ${\bf{Q}}^{\mathrm{s}}$ and a diagonal matrix ${\bf{\Gamma}}$ such that


$\displaystyle\bar{{\bf{e}}}^{t+1}$	$\displaystyle=\bar{{\bf{e}}}^{t}-\alpha\overline{\nabla F}({\bf{X}}^{t})-% \alpha\bar{{\bf{s}}}^{k},$	(47a)
$\displaystyle\mathcal{E}_{\mathrm{s}}^{t+1}$	$\displaystyle=\underbrace{{\bf{\Gamma}}\mathcal{E}_{\mathrm{s}}^{t}-\alpha% \upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\left[\begin{array}[]{c}\hat{{\bf{\Lambda}% }}_{a}\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{\star})% +{\bf{S}}^{t})\\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}% }^{t})-\nabla F({\bf{X}}^{\star})+{\bf{S}}^{t})\\ \end{array}\right]}_{:=\mathbb{G}_{\mathrm{s}}^{t}}+\underbrace{\upsilon({\bf{% Q}}^{\mathrm{s}})^{-1}\left[\begin{array}[]{c}-\hat{{\bf{\Lambda}}}_{b}\hat{{% \bf{P}}}^{\sf T}{\bf{E}}^{t}\\ p\hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\ \end{array}\right]}_{:=\mathbb{F}_{\mathrm{s}}^{t}},$	(47f)

where $\upsilon$ is an arbitrary strictly positive constant,

\mathcal{E}_{\mathrm{s}}^{t}\triangleq\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}% \left[\begin{array}[]{c}\hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{X}}}^{t}\\ \hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{U}}}^{t}\\ \end{array}\right]\text{ and }\gamma\triangleq\|{\bf{\Gamma}}\|=\sqrt{1-\frac{% 1}{2\chi}(1-\lambda_{2})}<1.

Moreover, we have $\|{\bf{Q}}^{\mathrm{s}}\|^{2}\|({\bf{Q}}^{\mathrm{s}})^{-1}\|^{2}\leq\frac{8% \chi^{2}}{p^{2}(1+\lambda_{n})}$ .

Proof.

See Appendix L. ∎

With this error dynamics, similar as Lemma 2, we give the following descent inequalities.

Lemma 5.

Suppose that Assumptions 1, 2, and 4 hold, and $f_{i}$ is $\mu$ -strongly convex for some $0<\mu\leq L$ . Let $\upsilon=1/\|({\bf{Q}}^{\mathrm{s}})^{-1}\|$ . If $\alpha\leq\frac{1}{2L}$ , it holds that

	$\displaystyle\mathbb{E}\!\left[\big{\\|}\bar{{\bf{x}}}^{t+1}-{\bf{x}}^{\star}% \big{\\|}^{2}\;\|\;\mathcal{G}^{t}\right]$	$\displaystyle\leq(1-\mu\alpha)\\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\\|^{2}+% \frac{2\alpha L\vartheta_{\mathrm{s}}}{n}\\|\mathcal{E}_{s}^{t}\\|_{\mathrm{F}}^% {2}+\frac{\alpha^{2}\sigma^{2}}{n},$		(48)
	$\displaystyle\mathbb{E}\!\left[\\|\mathcal{E}_{\mathrm{s}}^{t+1}\\|_{\mathrm{F}}% ^{2}\;\|\;\mathcal{G}^{t}\right]$	$\displaystyle\leq\tilde{\gamma}_{\mathrm{s}}\\|\mathcal{E}_{\mathrm{s}}^{t}\\|_{% \mathrm{F}}^{2}+D_{1}\\|\widetilde{{\bf{X}}}^{t}\\|_{\mathrm{F}}^{2}+D_{2}n% \alpha^{2}\sigma^{2},$		(49)

where $\vartheta_{\mathrm{s}}=\|{\bf{Q}}^{\mathrm{s}}\|^{2}\|({\bf{Q}}^{\mathrm{s}})^% {-1}\|^{2}$ , $\tilde{\gamma}_{\mathrm{s}}=\gamma+\frac{3(1-p)(2+p^{2})}{\chi^{2}}$

D_{1}=\frac{\alpha^{2}L^{2}(2\chi^{2}+p^{2})}{2\chi^{2}(1-\gamma)}+\frac{3% \alpha^{2}L^{2}(1-p)(2+p^{2})}{2\chi^{2}},\ D_{2}=\frac{(1-p)(2+p^{2})+(p^{2}+% 2\chi^{2})}{2\chi^{2}}.

Proof.

See Appendix M. ∎

With Lemmas 4 and 5, we have the following Theorem.

Theorem 7.

Suppose that Assumptions1, 2, and 4 hold, and $f_{i}$ is $\mu$ -strongly convex for some $0<\mu\leq L$ . If $0<\alpha\leq\frac{1}{2L}$ , $\beta=p$ , and

\displaystyle\chi>\max\left\{\frac{1}{p},\frac{36}{1-\lambda_{2}},\frac{72(1-p% )}{1-\lambda_{2}}\right\},

(50)

it holds that $\tilde{\gamma}_{\mathrm{s}}<1$ and

\displaystyle\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^{t+1}-{\bf{x}}^{\star}% \big{\|}^{2}\right]\leq\zeta_{0}^{t+1}a_{0}+\mathcal{O}\left(\frac{\alpha^{4}% \sigma^{2}L^{3}\chi^{4}}{\mu p^{2}(1-\lambda_{2})^{2}(1-\zeta)}+\frac{\alpha^{% 2}\sigma^{2}L\chi^{3}}{\mu p^{2}(1-\lambda_{2})}\right)+\frac{\alpha\sigma^{2}% }{n\mu},

(51)

where $a_{0}$ is a constant that depends on the initialization and $\zeta_{0}=\max\{1-\alpha\mu,\sqrt{1-\frac{(1-\lambda_{2})p^{2}}{2\chi}}\}<1$ .

Proof.

See Appendix N. ∎

Appendix D Proof of Lemma 1

Proof.

It follows from (4b), $\mathbf{I}-\mathbf{W}_{a}=\frac{1}{2\chi}\mathbf{W}_{b}^{2}$ and ${\bf{E}}^{t}=\frac{1}{2\chi}(\theta_{t}-1){\bf{W}}_{b}\hat{{\bf{Z}}}^{t}$ that

	$\displaystyle\mathbf{X}^{t+1}$	$\displaystyle=(1-\theta_{t})\hat{\mathbf{Z}}^{t}+\theta_{t}\mathbf{W}_{a}\hat{% \mathbf{Z}}^{t}$
		$\displaystyle={\bf{W}}_{a}\hat{\mathbf{Z}}^{t}+(1-\theta_{t})\hat{\mathbf{Z}}^% {t}+\theta_{t}\mathbf{W}_{a}\hat{\mathbf{Z}}^{t}-{\bf{W}}_{a}\hat{\mathbf{Z}}^% {t}$
		$\displaystyle={\bf{W}}_{a}\hat{\mathbf{Z}}^{t}+(1-\theta_{t})({\bf{I}}-{\bf{W}% }_{a})\hat{\mathbf{Z}}^{t}$
		$\displaystyle={\bf{W}}_{a}\hat{\mathbf{Z}}^{t}-{\bf{W}}_{b}{\bf{E}}^{t}.$

Since $\beta=1$ , it follows from (4b), (4c), and $\mathbf{I}-\mathbf{W}_{a}=\frac{1}{2\chi}\mathbf{W}_{b}^{2}$ that

{\bf{Y}}^{t+1}={\bf{Y}}^{t}+\frac{1}{2\chi}{\bf{W}}_{b}^{2}\hat{{\bf{Z}}}^{t}+% {\bf{W}}_{b}{\bf{E}}^{t}={\bf{Y}}^{t}+({\bf{I}}-{\bf{W}}_{a})\hat{{\bf{Z}}}^{t% }+{\bf{W}}_{b}{\bf{E}}^{t}.

By ${\bf{R}}^{t}={\bf{Y}}^{t}+\alpha\nabla F(\bar{{\bf{X}}}^{t})$ , ${\bf{\Sigma}}_{2}^{t}=\nabla F(\bar{{\bf{X}}}^{t})-\nabla F(\bar{{\bf{X}}}^{t+% 1})$ , and ${\bf{E}}^{t}=\frac{1}{2\chi}(\theta_{t}-1){\bf{W}}_{b}\hat{{\bf{Z}}}^{t}$ , we have

	$\displaystyle{\bf{R}}^{t+1}-{\bf{R}}^{t}$	$\displaystyle={\bf{Y}}^{t+1}-{\bf{Y}}^{t}+\alpha(\nabla F(\bar{{\bf{X}}}^{t+1}% )-\nabla F(\bar{{\bf{X}}}^{t}))$
		$\displaystyle=({\bf{I}}-{\bf{W}}_{a})\hat{{\bf{Z}}}^{t}+{\bf{W}}_{b}{\bf{E}}^{% t}+\alpha(\nabla F(\bar{{\bf{X}}}^{t+1})-\nabla F(\bar{{\bf{X}}}^{t}))$
		$\displaystyle=({\bf{I}}-{\bf{W}}_{a})\hat{{\bf{Z}}}^{t}+{\bf{W}}_{b}{\bf{E}}^{% t}-\alpha{\bf{\Sigma}}_{2}^{t}.$

Note that ${\bf{\Sigma}}_{1}^{t}=\nabla F({\bf{X}}^{t})-\nabla F(\bar{{\bf{X}}}^{t})+{\bf% {S}}^{t}$ . Algorithm 1 (update (4)) is equivalent to

	$\displaystyle\hat{{\bf{Z}}}^{t}$	$\displaystyle={\bf{X}}^{t}-{\bf{R}}^{t}-\alpha{\bf{\Sigma}}_{1}^{t},$
	$\displaystyle{\bf{X}}^{t+1}$	$\displaystyle={\bf{W}}_{a}\hat{{\bf{Z}}}^{t}-{\bf{W}}_{b}{\bf{E}}^{t},$
	$\displaystyle{\bf{R}}^{t+1}$	$\displaystyle={\bf{R}}^{t}+({\bf{I}}-{\bf{W}}_{a})\hat{{\bf{Z}}}^{t}-\alpha{% \bf{\Sigma}}_{2}^{t}+{\bf{W}}_{b}{\bf{E}}^{t},$

which also can be rewritten as (since ${\bf{W}}_{a}={\bf{I}}-\frac{1}{2\chi}{\bf{W}}^{2}_{b}$ )

\displaystyle\left[\begin{array}[]{c}{\bf{X}}^{t+1}\\ {\bf{R}}^{t+1}\\ \end{array}\right]=

\displaystyle\left[\begin{array}[]{cc}{\bf{W}}_{a}&-{\bf{W}}_{a}\\ {\bf{I}}-{\bf{W}}_{a}&{\bf{W}}_{a}\\ \end{array}\right]\left[\begin{array}[]{c}{\bf{X}}^{t}\\ {\bf{R}}^{t}\\ \end{array}\right]-\alpha\left[\begin{array}[]{c}{\bf{W}}_{a}{\bf{\Sigma}}_{1}% ^{t}\\ \frac{1}{2\chi}{\bf{W}}_{b}^{2}{\bf{\Sigma}}_{1}^{t}+{\bf{\Sigma}}_{2}^{t}\\ \end{array}\right]+\left[\begin{array}[]{c}-{\bf{W}}_{b}{\bf{E}}^{t}\\ {\bf{W}}_{b}{\bf{E}}^{t}\\ \end{array}\right].

Multiplying both sides of the above by $\mathrm{diag}\{{\bf{P}}^{-1},{\bf{P}}^{-1}\}$ on the left, and using (29) and

{\bf{P}}^{-1}{\bf{X}}^{t}=\left[\begin{array}[]{c}\bar{{\bf{x}}}^{t}\\ \hat{{\bf{P}}}^{\sf T}{\bf{X}}^{t}\\ \end{array}\right],\ {\bf{P}}^{-1}{\bf{R}}^{t}=\left[\begin{array}[]{c}\alpha% \overline{\nabla F}(\bar{{\bf{X}}}^{t})\\ \hat{{\bf{P}}}^{\sf T}{\bf{R}}^{t}\\ \end{array}\right],\ {\bf{P}}^{-1}\nabla F({\bf{X}}^{t})=\left[\begin{array}[]% {c}\overline{\nabla F}({\bf{X}}^{t})\\ \hat{{\bf{P}}}^{\sf T}\nabla F({\bf{X}}^{t})\\ \end{array}\right],\ {\bf{P}}^{-1}{\bf{E}}^{t}=\left[\begin{array}[]{c}0\\ \hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\ \end{array}\right].

we have

	$\displaystyle\bar{{\bf{x}}}^{t+1}=$	$\displaystyle~{}\bar{{\bf{x}}}^{t}-\alpha\overline{\nabla F}({\bf{X}}^{t})-% \alpha\bar{{\bf{s}}}^{t},$
	$\displaystyle\left[\begin{array}[]{c}\hat{{\bf{P}}}^{\sf T}{\bf{X}}^{t+1}\\ \hat{{\bf{P}}}^{\sf T}{\bf{R}}^{t+1}\\ \end{array}\right]=$	$\displaystyle\left[\begin{array}[]{cc}\hat{{\bf{\Lambda}}}_{a}&-\hat{{\bf{% \Lambda}}}_{a}\\ {\bf{I}}-\hat{{\bf{\Lambda}}}_{a}&\hat{{\bf{\Lambda}}}_{a}\\ \end{array}\right]\left[\begin{array}[]{c}\hat{{\bf{P}}}^{\sf T}{\bf{X}}^{t}\\ \hat{{\bf{P}}}^{\sf T}{\bf{R}}^{t}\\ \end{array}\right]-\alpha\left[\begin{array}[]{c}\hat{{\bf{\Lambda}}}_{a}\hat{% {\bf{P}}}^{\sf T}{\bf{\Sigma}}_{1}^{t}\\ \frac{1}{2\chi}\hat{{\bf{\Lambda}}}_{b}^{2}\hat{{\bf{P}}}^{\sf T}{\bf{\Sigma}}% _{1}^{t}+\hat{{\bf{P}}}^{\sf T}{\bf{\Sigma}}_{2}^{t}\\ \end{array}\right]+\left[\begin{array}[]{c}-\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{% P}}}^{\sf T}{\bf{E}}^{t}\\ \hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\ \end{array}\right].$

Let

{\bf{H}}=\left[\begin{array}[]{cc}\hat{{\bf{\Lambda}}}_{a}&-\hat{{\bf{\Lambda}% }}_{a}\\ {\bf{I}}-\hat{{\bf{\Lambda}}}_{a}&\hat{{\bf{\Lambda}}}_{a}\\ \end{array}\right]=\left[\begin{array}[]{cc}{\bf{I}}-\frac{1}{2\chi}({\bf{I}}-% \hat{{\bf{\Lambda}}})&-({\bf{I}}-\frac{1}{2\chi}({\bf{I}}-\hat{{\bf{\Lambda}}}% ))\\ \frac{1}{2\chi}({{\bf{I}}-\hat{{\bf{\Lambda}}}})&{\bf{I}}-\frac{1}{2\chi}({\bf% {I}}-\hat{{\bf{\Lambda}}})\\ \end{array}\right],

where $\hat{{\bf{\Lambda}}}=\mathrm{diag}\{\lambda_{2},\ldots,\lambda_{n}\}$ , and $\lambda_{i}\in(-1,1)$ . Since the blocks of ${\bf{H}}$ are diagonal matrices, there exists a permutation matrix ${\bf{Q}}_{1}$ such that ${\bf{Q}}_{1}{\bf{H}}{\bf{Q}}_{1}^{\sf T}=\mathrm{blkdiag}\{H_{i}\}_{i=2}^{n}$ , where

H_{i}=\left[\begin{array}[]{cc}1-\frac{1}{2\chi}(1-\lambda_{i})&-(1-\frac{1}{2% \chi}(1-\lambda_{i}))\\ \frac{1}{2\chi}(1-\lambda_{i})&1-\frac{1}{2\chi}(1-\lambda_{i})\\ \end{array}\right].

Setting $\nu_{i}=1-\frac{1}{2\chi}(1-\lambda_{i})$ , we have $\nu_{i}\in(0,1)$ and $H_{i}$ can be rewritten as

H_{i}=\left[\begin{array}[]{cc}\nu_{i}&-\nu_{i}\\ 1-\nu_{i}&\nu_{i}\\ \end{array}\right]\in\mathbb{R}^{2\times 2}.

It holds that $\mathrm{Tr}(H_{i})=2\nu_{i}$ , $\mathrm{det}(H_{i})=\nu_{i}$ . Thus, the eigenvalues of $H_{i}$ are

\displaystyle\gamma_{(1,2),i}

\displaystyle=\frac{1}{2}\Big{[}\mathrm{Tr}(H_{i})\pm\sqrt{\mathrm{Tr}(H_{i})^% {2}-4\mathrm{det}(H_{i})}\Big{]}=\nu_{i}\pm\sqrt{\nu_{i}^{2}-\nu_{i}}.

Notice that $|\gamma_{(1,2),i}|<1$ when $-\nicefrac{{1}}{{3}}<\nu_{i}<1$ , which holds under Assumption 1 since ${\bf{W}}_{a}\succ 0$ , i.e., $0<\nu_{i}<1\ (i=2,\ldots,n)$ . For $0<\nu_{i}<1$ , the eigenvalues of $H_{i}$ are complex and distinct:

\displaystyle\gamma_{(1,2),i}=\nu_{i}\pm j\sqrt{\nu_{i}-\nu_{i}^{2}},\ |\gamma% _{(1,2),i}|<1,

where $j^{2}=-1$ . Through algebraic multiplication it can be verified that $H_{i}=Q_{2,i}\Gamma_{i}Q_{2,i}^{-1}$ , where $\Gamma_{i}=\mathrm{diag}\{\gamma_{1,i},\gamma_{2,i}\}$ and

Q_{2,i}=\left[\begin{array}[]{cc}\sqrt{\nu_{i}}&\sqrt{\nu_{i}}\\ -j\sqrt{1-\nu_{i}}&j\sqrt{1-\nu_{i}}\end{array}\right],\quad Q_{2,i}^{-1}=% \left[\begin{array}[]{cc}\frac{1}{2\sqrt{\nu_{i}}}&\frac{j}{2\sqrt{1-\nu_{i}}}% \\ \frac{1}{2\sqrt{\nu_{i}}}&-\frac{j}{2\sqrt{1-\nu_{i}}}\end{array}\right].

Note that

Q_{2,i}Q_{2,i}^{*}=\left[\begin{array}[]{cc}2\nu_{i}&0\\ 0&2(1-\nu_{i})\end{array}\right],\text{ and }(Q_{2,i}^{-1})(Q_{2,i}^{-1})^{*}=% \frac{1}{4\nu_{i}(1-\nu_{i})}\left[\begin{array}[]{cc}1&1-2\nu_{i}\\ 1-2\nu_{i}&1\end{array}\right].

Since the spectral radius of matrix is upper bounded by any of its norm and $0<\nu_{i}<1$ , it holds that

\|Q_{2,i}\|^{2}\leq\|Q_{2,i}Q_{2,i}^{*}\|_{1}\leq 2,\text{ and }\|Q_{2,i}^{-1}% \|^{2}\leq\|(Q_{2,i}^{-1})(Q_{2,i}^{-1})^{*}\|_{1}\leq\frac{2}{4\nu_{i}(1-\nu_% {i})}.

Using $\nu_{i}\geq 1-\frac{1}{2\chi}(1-\lambda_{n})$ and $1-\nu_{i}=\frac{1}{2\chi}(1-\lambda_{i})\geq\frac{1}{2\chi}(1-\lambda_{2})$ , we have

\|Q_{2,i}^{-1}\|^{2}\leq\frac{\chi}{(1-\frac{1}{2\chi}(1-\lambda_{n}))(1-% \lambda_{2})}\leq\frac{2\chi}{(1+\lambda_{n})(1-\lambda_{2})}\ .

Let ${\bf{Q}}={\bf{Q}}_{1}^{\sf T}{\bf{Q}}_{2}$ with ${\bf{Q}}_{2}=\mathrm{blkdiag}\{Q_{2,i}\}_{i=2}^{n}$ . We have ${\bf{Q}}^{-1}{\bf{H}}{\bf{Q}}={\bf{\Gamma}}$ , where ${\bf{\Gamma}}=\mathrm{blkdiag}\{\Gamma_{i}\}_{i=2}^{n}$ , i.e., there exists an invertible matrix ${\bf{Q}}$ such that ${\bf{H}}={\bf{Q}}{\bf{\Gamma}}{\bf{Q}}^{-1}$ , and

\|{\bf{\Gamma}}\|=\sqrt{1-\frac{1}{2\chi}(1-\lambda_{2})}<1.

Therefore, we finally obtain (30). Moreover, we have

\|{\bf{Q}}\|^{2}\leq 2\text{ and }\|{\bf{Q}}^{-1}\|^{2}\leq\frac{2\chi}{(1+% \lambda_{n})(1-\lambda_{2})}.

Then, we prove $\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}\leq 4\|\mathcal{E}^{t}\|_% {\mathrm{F}}^{2}$ . Since

\mathcal{E}^{t}={\bf{Q}}^{-1}\left[\begin{array}[]{c}\hat{{\bf{P}}}^{\sf T}{{% \bf{X}}}^{t}\\ \hat{{\bf{P}}}^{\sf T}{{\bf{R}}}^{t}\\ \end{array}\right]\text{ and }{\bf{Q}}^{-1}={\bf{Q}}_{1}\left[\begin{array}[]{% cc}\frac{1}{2}\hat{{\bf{\Lambda}}}_{a}^{-\frac{1}{2}}&\frac{j}{2}({\bf{I}}-% \hat{{\bf{\Lambda}}}_{a})^{-\frac{1}{2}}\\ \frac{1}{2}\hat{{\bf{\Lambda}}}_{a}^{-\frac{1}{2}}&-\frac{j}{2}({\bf{I}}-\hat{% {\bf{\Lambda}}}_{a})^{-\frac{1}{2}}\\ \end{array}\right]

taking the squared norm, we have

	$\displaystyle\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}$	$\displaystyle=\left\\|{\bf{Q}}_{1}\left[\begin{array}[]{cc}\frac{1}{2}\hat{{\bf% {\Lambda}}}_{a}^{-\frac{1}{2}}&\frac{j}{2}({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^% {-\frac{1}{2}}\\ \frac{1}{2}\hat{{\bf{\Lambda}}}_{a}^{-\frac{1}{2}}&-\frac{j}{2}({\bf{I}}-\hat{% {\bf{\Lambda}}}_{a})^{-\frac{1}{2}}\\ \end{array}\right]\left[\begin{array}[]{c}\hat{{\bf{P}}}^{\sf T}{{\bf{X}}}^{t}% \\ \hat{{\bf{P}}}^{\sf T}{{\bf{R}}}^{t}\\ \end{array}\right]\right\\|_{\mathrm{F}}^{2}\leq\frac{1}{4}\left\\|\left[\begin{% array}[]{c}\hat{{\bf{\Lambda}}}_{a}^{-\frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{{\bf{% X}}}^{t}+j({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{-\frac{1}{2}}\hat{{\bf{P}}}^{% \sf T}{{\bf{R}}}^{t}\\ \hat{{\bf{\Lambda}}}_{a}^{-\frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{{\bf{X}}}^{t}-j(% {\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{-\frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{{\bf{R% }}}^{t}\\ \end{array}\right]\right\\|_{\mathrm{F}}^{2}$
		$\displaystyle\leq\\|\hat{{\bf{\Lambda}}}_{a}^{-\frac{1}{2}}\hat{{\bf{P}}}^{\sf T% }{{\bf{X}}}^{t}\\|^{2}+\\|({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{-\frac{1}{2}}\hat% {{\bf{P}}}^{\sf T}{{\bf{R}}}^{t}\\|_{\mathrm{F}}^{2}.$

On the other hand, noting that

{\bf{Q}}=\left[\begin{array}[]{cc}\hat{{\bf{\Lambda}}}_{a}^{\frac{1}{2}}&\hat{% {\bf{\Lambda}}}_{a}^{\frac{1}{2}}\\ -j({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{\frac{1}{2}}&j({\bf{I}}-\hat{{\bf{% \Lambda}}}_{a})^{\frac{1}{2}}\\ \end{array}\right]{\bf{Q}}_{1}^{\sf T},

it holds that

\left[\begin{array}[]{c}\hat{{\bf{P}}}^{\sf T}{{\bf{X}}}^{t}\\ \hat{{\bf{P}}}^{\sf T}{{\bf{R}}}^{t}\\ \end{array}\right]=\left[\begin{array}[]{cc}\hat{{\bf{\Lambda}}}_{a}^{\frac{1}% {2}}&\hat{{\bf{\Lambda}}}_{a}^{\frac{1}{2}}\\ -j({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{\frac{1}{2}}&j({\bf{I}}-\hat{{\bf{% \Lambda}}}_{a})^{\frac{1}{2}}\\ \end{array}\right]{\bf{Q}}_{1}^{\sf T}\mathcal{E}^{t}=\left[\begin{array}[]{c}% \hat{{\bf{\Lambda}}}_{a}^{\frac{1}{2}}({\bf{Q}}_{1,u}^{\sf T}+{\bf{Q}}_{1,l}^{% \sf T})\mathcal{E}^{t}\\ -j({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{\frac{1}{2}}({\bf{Q}}_{1,u}^{\sf T}+{% \bf{Q}}_{1,l}^{\sf T})\mathcal{E}^{t}\\ \end{array}\right],

where ${\bf{Q}}_{1,u}^{\sf T}$ and ${\bf{Q}}_{1,l}^{\sf T}$ are the upper and lower blocks of ${\bf{Q}}_{1}^{\sf T}=[{\bf{Q}}_{1,u}^{\sf T};{\bf{Q}}_{1,l}^{\sf T}]$ . Then, it holds that

\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}=\|\hat{{\bf{P}}}^{\sf T}{% {\bf{X}}}^{t}\|^{2}_{\mathrm{F}}=\|\hat{{\bf{\Lambda}}}_{a}^{\frac{1}{2}}({\bf% {Q}}_{1,u}^{\sf T}+{\bf{Q}}_{1,l}^{\sf T})\mathcal{E}^{t}\|^{2}_{\mathrm{F}}% \leq 4\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2},

where we used $\|{\bf{Q}}_{1,u}^{\sf T}+{\bf{Q}}_{1,l}^{\sf T}\|^{2}\leq 4$ since ${\bf{Q}}_{1}$ is a permutation matrix $\|{\bf{Q}}_{1}\|=1$ . ∎

Appendix E Proof of Lemma 2

Proof.

Proof of the descent inequality (32). Since $f$ is $L$ -smooth, setting $y=\bar{{\bf{x}}}^{t+1}$ and $x=\bar{{\bf{x}}}^{t}$ in (18), it gives that

\displaystyle f(\bar{{\bf{x}}}^{t+1})\leq f(\bar{{\bf{x}}}^{t})+\langle\nabla f% (\bar{{\bf{x}}}^{t}),\bar{{\bf{x}}}^{t+1}-\bar{{\bf{x}}}^{t}\rangle+\frac{L}{2% }\|\bar{{\bf{x}}}^{t+1}-\bar{{\bf{x}}}^{t}\|^{2}.

From (30a), i.e.,

\bar{{\bf{x}}}^{t+1}=\bar{{\bf{x}}}^{t}-\alpha\overline{\nabla F}({\bf{X}}^{t}% )-\alpha\bar{{\bf{s}}}^{t},

where $\overline{\nabla F}({\bf{X}}^{t})=\big{(}\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}% ({\bf{x}}^{t}_{i})\big{)}^{\sf T}$ , we have

\displaystyle f(\bar{{\bf{x}}}^{t+1})\leq f(\bar{{\bf{x}}}^{t})-\alpha\big{% \langle}\nabla f(\bar{{\bf{x}}}^{t}),\overline{\nabla F}({\bf{X}}^{t})+\bar{{% \bf{s}}}^{t}\big{\rangle}+\frac{L\alpha^{2}}{2}\|\overline{\nabla F}({\bf{X}}^% {t})+\bar{{\bf{s}}}^{t}\|^{2}.

Taking conditioned expectation with respect to $\mathcal{G}^{t}$ and by

\mathbb{E}\!\left[\bar{{\bf{s}}}^{t}\;|\;\mathcal{G}^{t}\right]=0,\ \mathbb{E}% \!\left[\|\bar{{\bf{s}}}^{t}\|^{2}\;|\;\mathcal{G}^{t}\right]\leq\frac{\sigma^% {2}}{n},

it holds that

	$\displaystyle\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t+1})\;\|\;\mathcal{G}^{t}\right]$	$\displaystyle\leq\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t})\;\|\;\mathcal{G}^{t}% \right]-\alpha\big{\langle}\nabla f(\bar{{\bf{x}}}^{t}),\overline{\nabla F}({% \bf{X}}^{t})\big{\rangle}+\mathbb{E}\!\left[\frac{L\alpha^{2}}{2}\\|\overline{% \nabla F}({\bf{X}}^{t})+\bar{{\bf{s}}}^{t}\\|^{2}\;\|\;\mathcal{F}^{t}\right]$
		$\displaystyle=\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t})\;\|\;\mathcal{G}^{t}% \right]-\alpha\big{\langle}\nabla f(\bar{{\bf{x}}}^{t}),\overline{\nabla F}({% \bf{X}}^{t})\big{\rangle}+\frac{L\alpha^{2}}{2}\Big{(}\mathbb{E}\!\left[\\|% \overline{\nabla F}({\bf{X}}^{t})\\|^{2}\;\|\;\mathcal{F}^{t}\right]+\mathbb{E}% \!\left[\\|\bar{{\bf{s}}}^{t}\\|^{2}\;\|\;\mathcal{F}^{t}\right]\Big{)}$
		$\displaystyle\leq\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t})\;\|\;\mathcal{G}^{t}% \right]-\alpha\big{\langle}\nabla f(\bar{{\bf{x}}}^{t}),\overline{\nabla F}({% \bf{X}}^{t})\big{\rangle}+\frac{L\alpha^{2}}{2}\mathbb{E}\!\left[\\|\overline{% \nabla F}({\bf{X}}^{t})\\|^{2}\;\|\;\mathcal{G}^{t}\right]+\frac{L\alpha^{2}% \sigma^{2}}{2n}.$

Since $2\langle a,b\rangle=\|a\|^{2}+\|b\|^{2}-\|a-b\|^{2}$ , we have

-\big{\langle}\nabla f(\bar{{\bf{x}}}^{t}),\overline{\nabla F}({\bf{X}}^{t})% \big{\rangle}=-\frac{1}{2}\|\nabla f(\bar{{\bf{x}}}^{t})\|^{2}-\frac{1}{2}\|% \overline{\nabla F}({\bf{X}}^{t})\|^{2}+\frac{1}{2}\|\overline{\nabla F}({\bf{% X}}^{t})-\nabla f(\bar{{\bf{x}}}^{t})\|^{2}.

Combining the last two equations and by $\alpha L\leq\frac{1}{2}$ , we get

	$\displaystyle\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t+1})\;\|\;\mathcal{G}^{t}\right]$	$\displaystyle\leq f(\bar{{\bf{x}}}^{t})-\frac{\alpha}{2}\\|\nabla f(\bar{{\bf{x% }}}^{t})\\|^{2}-\frac{\alpha}{2}\\|\overline{\nabla F}({\bf{X}}^{t})\\|^{2}+\frac% {\alpha}{2}\\|\overline{\nabla F}({\bf{X}}^{t})-\nabla f(\bar{{\bf{x}}}^{t})\\|^% {2}$
		$\displaystyle\quad+\frac{L\alpha^{2}}{2}\\|\overline{\nabla F}({\bf{X}}^{t})\\|^% {2}+\frac{L\alpha^{2}\sigma^{2}}{2n}$
		$\displaystyle\leq f(\bar{{\bf{x}}}^{t})-\frac{\alpha}{2}\\|\nabla f(\bar{{\bf{x% }}}^{t})\\|^{2}-\frac{\alpha}{2}(1-\alpha L)\\|\overline{\nabla F}({\bf{X}}^{t})% \\|^{2}+\frac{\alpha}{2}\\|\overline{\nabla F}({\bf{X}}^{t})-\nabla f(\bar{{\bf{% x}}}^{t})\\|^{2}+\frac{L\alpha^{2}\sigma^{2}}{2n}$
		$\displaystyle\leq f(\bar{{\bf{x}}}^{t})-\frac{\alpha}{2}\\|\nabla f(\bar{{\bf{x% }}}^{t})\\|^{2}+\frac{\alpha}{2}\\|\overline{\nabla F}({\bf{X}}^{t})-\nabla f(% \bar{{\bf{x}}}^{t})\\|^{2}+\frac{L\alpha^{2}\sigma^{2}}{2n}.$

By (31), i.e., $\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}\leq 4\|\mathcal{E}^{t}\|_% {\mathrm{F}}^{2}$ , we have

	$\displaystyle\frac{\alpha}{2}\\|\overline{\nabla F}({\bf{X}}^{t})-\nabla f(\bar% {{\bf{x}}}^{t})\\|^{2}$	$\displaystyle=\frac{\alpha}{2}\\|\frac{1}{n}\sum_{i=1}^{n}(\nabla f_{i}({\bf{x}% }_{i}^{t})-\nabla f_{i}(\bar{{\bf{x}}}^{t}))\\|^{2}$
		$\displaystyle\leq\frac{\alpha}{2n}\sum_{i=1}^{n}\\|\nabla f_{i}({\bf{x}}_{i}^{t% })-\nabla f_{i}(\bar{{\bf{x}}}^{t})\\|^{2}\leq\frac{\alpha L^{2}}{2n}\\|{\bf{X}}% ^{t}-\bar{{\bf{X}}}^{t}\\|_{\mathrm{F}}^{2}\leq\frac{2\alpha L^{2}}{n}\\|% \mathcal{E}^{t}\\|_{\mathrm{F}}^{2}.$

Thus, the descent inequality (32) holds.

Proof of the inequality (33). Taking conditioned expectation with respect to $\mathcal{F}^{t}$ , from (30f), we have

	$\displaystyle\mathbb{E}\!\left[\\|\mathcal{E}^{t+1}\\|_{\mathrm{F}}^{2}\;\|\;% \mathcal{F}^{t}\right]$	$\displaystyle=\\|\mathbb{G}^{t}\\|_{\mathrm{F}}^{2}+\mathbb{E}\!\left[\\|\mathbb{% F}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{F}^{t}\right]$
		$\displaystyle=\\|\mathbb{G}^{t}\\|_{\mathrm{F}}^{2}+\mathbb{E}\!\left[\\|{\bf{Q}}% ^{-1}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\|_{\mathrm{F}}% ^{2}\;\|\;\mathcal{F}^{t}\right]+\mathbb{E}\!\left[\\|{\bf{Q}}^{-1}\hat{{\bf{% \Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;% \mathcal{F}^{t}\right].$

Since ${\bf{E}}^{t}=\frac{(\theta_{t}-1)}{2\chi}{\bf{W}}_{b}\hat{{\bf{Z}}}^{t}$ , $\mathop{\rm Prob}(\theta_{t}=1)=p$ , and $\mathop{\rm Prob}(\theta_{t}=0)=1-p$ , we have

	$\displaystyle\mathbb{E}\!\left[\\|{\bf{Q}}^{-1}\hat{{\bf{\Lambda}}}_{b}\hat{{% \bf{P}}}^{\sf T}{\bf{E}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{F}^{t}\right]+% \mathbb{E}\!\left[\\|{\bf{Q}}^{-1}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T% }{\bf{E}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{F}^{t}\right]$
	$\displaystyle=\frac{1-p}{4\chi^{2}}\Big{(}\\|{\bf{Q}}^{-1}\hat{{\bf{\Lambda}}}_% {b}\hat{{\bf{P}}}^{\sf T}{\bf{W}}_{b}\hat{{\bf{Z}}}^{t}\\|_{\mathrm{F}}^{2}+\\|{% \bf{Q}}^{-1}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{W}}_{b}\hat{{% \bf{Z}}}^{t}\\|_{\mathrm{F}}^{2}\Big{)}\leq{\frac{2(1-p)}{\chi^{2}}}\\|{\bf{Q}}^% {-1}\hat{{\bf{P}}}^{\sf T}\hat{{\bf{Z}}}^{t}\\|_{\mathrm{F}}^{2}.$

Hence, it gives that

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}^{t+1}\|_{\mathrm{F}}^{2}\;|\;% \mathcal{F}^{t}\right]\leq\|\mathbb{G}^{t}\|_{\mathrm{F}}^{2}+{\frac{2(1-p)}{% \chi^{2}}}\|{\bf{Q}}^{-1}\hat{{\bf{P}}}^{\sf T}\hat{{\bf{Z}}}^{t}\|_{\mathrm{F% }}^{2}.

Taking conditioned expectation with respect to $\mathcal{G}^{t}\subset\mathcal{F}^{t}$ , and using the unbiasedness of ${\bf{G}}^{t}$ , we have

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}^{t+1}\|_{\mathrm{F}}^{2}\;|\;% \mathcal{G}^{t}\right]

\displaystyle\leq\mathbb{E}\!\left[\|\mathbb{G}^{t}\|_{\mathrm{F}}^{2}\;|\;% \mathcal{G}^{t}\right]+\frac{2(1-p)}{\chi^{2}}\mathbb{E}\!\left[\|{\bf{Q}}^{-1% }\hat{{\bf{P}}}^{\sf T}\hat{{\bf{Z}}}^{t}\|_{\mathrm{F}}^{2}\;|\;\mathcal{G}^{% t}\right].

(52)

We first bound $\mathbb{E}\!\left[\|\mathbb{G}^{t}\|^{2}\;|\;\mathcal{G}^{t}\right]$ . Recall the definition of $\mathbb{G}^{t}$ .

	$\displaystyle\mathbb{G}^{t}$	$\displaystyle={\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}\left[\begin{% array}[]{c}\hat{{\bf{\Lambda}}}_{a}\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}}^{t% })-\nabla F(\bar{{\bf{X}}}^{t})+{\bf{S}}^{t})\\ \frac{1}{2\chi}\hat{{\bf{\Lambda}}}_{b}^{2}\hat{{\bf{P}}}^{\sf T}(\nabla F({% \bf{X}}^{t})-\nabla F(\bar{{\bf{X}}}^{t})+{\bf{S}}^{t})+\hat{{\bf{P}}}^{\sf T}% (\nabla F(\bar{{\bf{X}}}^{t})-\nabla F(\bar{{\bf{X}}}^{t+1}))\\ \end{array}\right]$
		$\displaystyle={\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}\underbrace{% \left[\begin{array}[]{c}\hat{{\bf{\Lambda}}}_{a}\hat{{\bf{P}}}^{\sf T}(\nabla F% ({\bf{X}}^{t})-\nabla F(\bar{{\bf{X}}}^{t}))\\ \frac{1}{2\chi}\hat{{\bf{\Lambda}}}_{b}^{2}\hat{{\bf{P}}}^{\sf T}(\nabla F({% \bf{X}}^{t})-\nabla F(\bar{{\bf{X}}}^{t}))+\hat{{\bf{P}}}^{\sf T}(\nabla F(% \bar{{\bf{X}}}^{t})-\nabla F(\bar{{\bf{X}}}^{t+1}))\\ \end{array}\right]}_{{\bf{F}}^{t}}-\alpha\underbrace{{\bf{Q}}^{-1}\left[\begin% {array}[]{c}\hat{{\bf{\Lambda}}}_{a}\hat{{\bf{P}}}^{\sf T}\\ \frac{1}{2\chi}\hat{{\bf{\Lambda}}}_{b}^{2}\hat{{\bf{P}}}^{\sf T}\\ \end{array}\right]}_{{\bf{C}}}{\bf{S}}^{t}$
		$\displaystyle={\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}{\bf{F}}^{t}-% \alpha{\bf{C}}{\bf{S}}^{t}.$

Note that ${\bf{Q}}^{-1}={\bf{Q}}_{1}\left[\begin{array}[]{cc}\frac{1}{2}\hat{{\bf{% \Lambda}}}_{a}^{-\frac{1}{2}}&\frac{j}{2}({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{% -\frac{1}{2}}\\ \frac{1}{2}\hat{{\bf{\Lambda}}}_{a}^{-\frac{1}{2}}&-\frac{j}{2}({\bf{I}}-\hat{% {\bf{\Lambda}}}_{a})^{-\frac{1}{2}}\\ \end{array}\right]$ and ${\bf{I}}-\hat{{\bf{\Lambda}}}_{a}=\frac{1}{2\chi}\hat{{\bf{\Lambda}}}_{b}^{2}$ . It follows that

\displaystyle{\bf{CS}}^{t}

\displaystyle={\bf{Q}}_{1}\left[\begin{array}[]{cc}\frac{1}{2}\hat{{\bf{% \Lambda}}}_{a}^{-\frac{1}{2}}&\frac{j}{2}({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{% -\frac{1}{2}}\\ \frac{1}{2}\hat{{\bf{\Lambda}}}_{a}^{-\frac{1}{2}}&-\frac{j}{2}({\bf{I}}-\hat{% {\bf{\Lambda}}}_{a})^{-\frac{1}{2}}\\ \end{array}\right]\left[\begin{array}[]{c}\hat{{\bf{\Lambda}}}_{a}\hat{{\bf{P}% }}^{\sf T}\\ ({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})\hat{{\bf{P}}}^{\sf T}\\ \end{array}\right]{\bf{S}}^{t}=\frac{1}{2}{\bf{Q}}_{1}\left[\begin{array}[]{c}% \hat{{\bf{\Lambda}}}_{a}^{\frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{\bf{S}}^{t}+j({% \bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{\frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{\bf{S}}^% {t}\\ \hat{{\bf{\Lambda}}}_{a}^{\frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{\bf{S}}^{t}-j({% \bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{\frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{\bf{S}}^% {t}\\ \end{array}\right],

where ${\bf{Q}}_{1}$ is a permutation matrix $\|{\bf{Q}}_{1}\|=1$ . Therefore, we have

\|{\bf{CS}}^{t}\|^{2}_{\mathrm{F}}\leq\frac{1}{4}(\|\hat{{\bf{\Lambda}}}_{a}^{% \frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{\bf{S}}^{t}+j({\bf{I}}-\hat{{\bf{\Lambda}}}% _{a})^{\frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{\bf{S}}^{t}\|^{2}+\|\hat{{\bf{% \Lambda}}}_{a}^{\frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{\bf{S}}^{t}-j({\bf{I}}-\hat% {{\bf{\Lambda}}}_{a})^{\frac{1}{2}}\hat{{\bf{P}}}^{\sf T}{\bf{S}}^{t}\|^{2})% \leq 2\|{\bf{S}}^{t}\|^{2}_{\mathrm{F}}.

Then, using Cauchy-Schwarz inequality, $\|\hat{{\bf{\Lambda}}}_{a}\|\leq 1$ , $\|\hat{{\bf{\Lambda}}}_{b}^{2}\|\leq 2$ , and $\|\hat{{\bf{P}}}^{\sf T}\|\leq 1$ , we have

	$\displaystyle\\|\mathbb{G}^{t}\\|_{\mathrm{F}}^{2}$	$\displaystyle=\\|{\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}{\bf{F}}^{t}\\|% _{\mathrm{F}}^{2}-2\alpha\langle{\bf{\Gamma}}\mathcal{E}^{t},{\bf{C}}{\bf{S}}^% {t}\rangle+2\alpha^{2}\langle{\bf{Q}}^{-1}{\bf{F}}^{t},{\bf{C}}{\bf{S}}^{t}% \rangle+\alpha^{2}\\|{\bf{CS}}^{t}\\|^{2}$
		$\displaystyle\leq\\|{\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}{\bf{F}}^{t% }\\|_{\mathrm{F}}^{2}-2\alpha\langle{\bf{\Gamma}}\mathcal{E}^{t},{\bf{C}}{\bf{S% }}^{t}\rangle+\alpha^{2}\\|{\bf{Q}}^{-1}{\bf{F}}^{t}\\|_{\mathrm{F}}^{2}+2\alpha% ^{2}\\|{\bf{C}}{\bf{S}}^{t}\\|_{\mathrm{F}}^{2}$
		$\displaystyle\leq\\|{\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}{\bf{F}}^{t% }\\|_{\mathrm{F}}^{2}+\alpha^{2}\\|{\bf{Q}}^{-1}{\bf{F}}^{t}\\|_{\mathrm{F}}^{2}-% 2\alpha\langle{\bf{\Gamma}}\mathcal{E}^{t},{\bf{C}}{\bf{S}}^{t}\rangle+4\alpha% ^{2}\\|{\bf{S}}^{t}\\|_{\mathrm{F}}^{2}.$

For any matrices ${\bf{a}}$ and ${\bf{b}}$ , it holds from Jensen’s inequality that $\|{\bf{a+b}}\|_{\mathrm{F}}^{2}\leq\frac{1}{\theta}\|{\bf{a}}\|_{\mathrm{F}}^{% 2}+\frac{1}{1-\theta}\|{\bf{b}}\|_{\mathrm{F}}^{2}$ for any $\theta\in(0,1)$ . Therefore, letting $\theta=\|{\bf{\Gamma}}\|:=\gamma$ , it holds that

\|{\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}{\bf{F}}^{t}\|_{\mathrm{F}}^% {2}\leq\frac{1}{\gamma}\|{\bf{\Gamma}}\mathcal{E}^{t}\|_{\mathrm{F}}^{2}+\frac% {1}{1-\gamma}\|\alpha{\bf{Q}}^{-1}{\bf{F}}^{t}\|_{\mathrm{F}}^{2}\leq\gamma\|% \mathcal{E}^{t}\|_{\mathrm{F}}^{2}+\frac{\alpha^{2}}{1-\gamma}\|{\bf{Q}}^{-1}{% \bf{F}}^{t}\|_{\mathrm{F}}^{2}.

Since $\frac{1}{1-\gamma}>1$ , we have

\displaystyle\|\mathbb{G}^{t}\|_{\mathrm{F}}^{2}\leq\gamma\|\mathcal{E}^{t}\|_% {\mathrm{F}}^{2}+\frac{2\alpha^{2}}{1-\gamma}\|{\bf{Q}}^{-1}{\bf{F}}^{t}\|_{% \mathrm{F}}^{2}-2\alpha\upsilon\langle{\bf{\Gamma}}\mathcal{E}^{t},{\bf{C}}{% \bf{S}}^{t}\rangle+4\alpha^{2}\|{\bf{S}}^{t}\|_{\mathrm{F}}^{2}.

Note that

{\bf{S}}^{t}={\bf{G}}^{t}-\nabla F({\bf{X}}^{t}),\ \mathbb{E}\!\left[{\bf{S}}^% {t}\;|\;\mathcal{G}^{t}\right]=0,\ \mathbb{E}\!\left[\|{\bf{S}}^{t}\|_{\mathrm% {F}}^{2}\;|\;\mathcal{G}^{t}\right]\leq n\sigma^{2}.

It follows from this above inequality that

	$\displaystyle\mathbb{E}\!\left[\\|\mathbb{G}^{t}\\|_{\mathrm{F}}^{2}\;\|\;% \mathcal{G}^{t}\right]$	$\displaystyle\leq\gamma\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{2\alpha^{2}}% {1-\gamma}\mathbb{E}\!\left[\\|{\bf{Q}}^{-1}{\bf{F}}^{t}\\|_{\mathrm{F}}^{2}\;\|% \;\mathcal{G}^{t}\right]-2\alpha\mathbb{E}\!\left[\langle{\bf{\Gamma}}\mathcal% {E}^{t},{\bf{C}}{\bf{S}}^{t}\rangle\;\|\;\mathcal{G}^{t}\right]+4\alpha^{2}% \mathbb{E}\!\left[\\|{\bf{S}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{G}^{t}\right]$
		$\displaystyle\leq\gamma\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{2\alpha^{2}}% {1-\gamma}\mathbb{E}\!\left[\\|{\bf{Q}}^{-1}{\bf{F}}^{t}\\|_{\mathrm{F}}^{2}\;\|% \;\mathcal{G}^{t}\right]+4n\alpha^{2}\sigma^{2}.$		(53)

$\mathbb{E}\!\left[\|{\bf{Q}}^{-1}{\bf{F}}^{t}\|_{\mathrm{F}}^{2}\;|\;\mathcal{% G}^{t}\right]$ can be bounded as follows: Note that

{\bf{Q}}^{-1}={\bf{Q}}_{1}\left[\begin{array}[]{cc}\frac{1}{2}\hat{{\bf{% \Lambda}}}_{a}^{-\frac{1}{2}}&\frac{j}{2}({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{% -\frac{1}{2}}\\ \frac{1}{2}\hat{{\bf{\Lambda}}}_{a}^{-\frac{1}{2}}&-\frac{j}{2}({\bf{I}}-\hat{% {\bf{\Lambda}}}_{a})^{-\frac{1}{2}}\\ \end{array}\right]\text{ and }{\bf{I}}-\hat{{\bf{\Lambda}}}_{a}=\frac{1}{2\chi% }\hat{{\bf{\Lambda}}}_{b}^{2}.

It follows that

\displaystyle{\bf{Q}}^{-1}{\bf{F}}^{t}

\displaystyle={\bf{Q}}_{1}\left[\begin{array}[]{cc}\frac{1}{2}\hat{{\bf{% \Lambda}}}_{a}^{-\frac{1}{2}}&\frac{j}{2}({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{% -\frac{1}{2}}\\ \frac{1}{2}\hat{{\bf{\Lambda}}}_{a}^{-\frac{1}{2}}&-\frac{j}{2}({\bf{I}}-\hat{% {\bf{\Lambda}}}_{a})^{-\frac{1}{2}}\\ \end{array}\right]\left[\begin{array}[]{c}\hat{{\bf{\Lambda}}}_{a}\hat{{\bf{P}% }}^{\sf T}(\nabla F({\bf{X}}^{t})-\nabla F(\bar{{\bf{X}}}^{t}))\\ ({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}}^{t% })-\nabla F(\bar{{\bf{X}}}^{t}))+\hat{{\bf{P}}}^{\sf T}(\nabla F(\bar{{\bf{X}}% }^{t})-\nabla F(\bar{{\bf{X}}}^{t+1}))\\ \end{array}\right].

By $\|({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{-1}\|=\frac{2\chi}{1-\lambda_{2}}$ and $L$ -smoothness of $f_{i}$ , we have

\displaystyle\mathbb{E}\!\left[\|{\bf{Q}}^{-1}{\bf{F}}^{t}\|_{\mathrm{F}}^{2}% \;|\;\mathcal{G}^{t}\right]\leq 4L^{2}\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}\|_{% \mathrm{F}}^{2}+\frac{2\chi nL^{2}}{1-\lambda_{2}}\mathbb{E}\!\left[\|\bar{{% \bf{x}}}^{t}-\bar{{\bf{x}}}^{t+1}\|^{2}\;|\;\mathcal{G}^{t}\right].

(54)

Since $\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}\leq 4\|\mathcal{E}^{t}\|_% {\mathrm{F}}^{2}$ , it holds that

\displaystyle\mathbb{E}\!\left[\|{\bf{Q}}^{-1}{\bf{F}}^{t}\|_{\mathrm{F}}^{2}% \;|\;\mathcal{G}^{t}\right]\leq 16L^{2}\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2}+% \frac{2\chi nL^{2}}{1-\lambda_{2}}\mathbb{E}\!\left[\|\bar{{\bf{x}}}^{t}-\bar{% {\bf{x}}}^{t+1}\|^{2}\;|\;\mathcal{G}^{t}\right].

(55)

Since $\bar{{\bf{x}}}^{t+1}=\bar{{\bf{x}}}^{t}-\alpha\overline{\nabla F}({\bf{X}}^{t}% )-\alpha\bar{{\bf{s}}}^{t}$ , $\mathbb{E}\!\left[\bar{{\bf{s}}}^{t}\;|\;\mathcal{G}^{t}\right]=0$ , and $\mathbb{E}\!\left[\|\bar{{\bf{s}}}^{t}\|^{2}\;|\;\mathcal{G}^{t}\right]\leq% \frac{\sigma^{2}}{n}$ , it gives that

	$\displaystyle\mathbb{E}\!\left[\\|\bar{{\bf{x}}}^{t}-\bar{{\bf{x}}}^{t+1}\\|_{% \mathrm{F}}^{2}\;\|\;\mathcal{G}^{t}\right]$	$\displaystyle=\mathbb{E}\!\left[\\|\alpha\overline{\nabla F}({\bf{X}}^{t})+% \alpha\bar{{\bf{s}}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{G}^{t}\right]$
		$\displaystyle=\alpha^{2}\mathbb{E}\!\left[\\|\bar{{\bf{s}}}^{t}+(\overline{% \nabla F}({\bf{X}}^{t})-\overline{\nabla F}(\bar{{\bf{X}}}^{t}))+\overline{% \nabla F}(\bar{{\bf{X}}}^{t})\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{G}^{t}\right]$
		$\displaystyle\leq\alpha^{2}\mathbb{E}\!\left[\\|\bar{{\bf{s}}}^{t}\\|^{2}\;\|\;% \mathcal{G}^{t}\right]+2\alpha^{2}\\|\overline{\nabla F}({\bf{X}}^{t})-% \overline{\nabla F}(\bar{{\bf{X}}}^{t})\\|_{\mathrm{F}}^{2}+2\alpha^{2}\\|% \overline{\nabla F}(\bar{{\bf{X}}}^{t})\\|_{\mathrm{F}}^{2}$
		$\displaystyle\leq\frac{\alpha^{2}\sigma^{2}}{n}+\frac{2\alpha^{2}L^{2}}{n}\\|{% \bf{X}}^{t}-\bar{{\bf{X}}}^{t}\\|_{\mathrm{F}}^{2}+2\alpha^{2}\\|\nabla f(\bar{{% \bf{x}}}^{t})\\|^{2}$
		$\displaystyle\leq\frac{\alpha^{2}\sigma^{2}}{n}+\frac{8\alpha^{2}L^{2}}{n}\\|% \mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+2\alpha^{2}\\|\nabla f(\bar{{\bf{x}}}^{t})\\|% ^{2}.$

Then, substituting it into (55), we have

\displaystyle\mathbb{E}\!\left[\|{\bf{Q}}^{-1}{\bf{F}}^{t}\|_{\mathrm{F}}^{2}% \;|\;\mathcal{G}^{t}\right]\leq

\displaystyle(16L^{2}+\frac{16\alpha^{2}L^{4}\chi}{1-\lambda_{2}})\|\mathcal{E% }^{t}\|_{\mathrm{F}}^{2}+\frac{4n\alpha^{2}L^{2}\chi}{1-\lambda_{2}}\|\nabla f% (\bar{{\bf{x}}}^{t})\|^{2}+\frac{2\alpha^{2}L^{2}\sigma^{2}\chi}{1-\lambda_{2}}.

(56)

Thus, combining (E) and (56), it holds that

	$\displaystyle\mathbb{E}\!\left[\\|\mathbb{G}^{t}\\|_{\mathrm{F}}^{2}\;\|\;% \mathcal{G}^{t}\right]=$	$\displaystyle\gamma\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{32\alpha^{2}L^{2% }+16\alpha^{4}L^{4}\frac{2\chi}{1-\lambda_{2}}}{1-\gamma}\\|\mathcal{E}^{t}\\|_{% \mathrm{F}}^{2}$
		$\displaystyle+\frac{8n\alpha^{4}L^{2}\chi}{(1-\gamma)(1-\lambda_{2})}\\|\nabla f% (\bar{{\bf{x}}}^{t})\\|^{2}+\frac{4\alpha^{4}L^{2}\sigma^{2}\chi}{(1-\gamma)(1-% \lambda_{2})}+4n\alpha^{2}\sigma^{2}.$		(57)

Then, we bound $\mathbb{E}\!\left[\|{\bf{Q}}^{-1}\hat{{\bf{P}}}^{\sf T}\hat{{\bf{Z}}}^{t}\|_{% \mathrm{F}}^{2}\;|\;\mathcal{G}^{t}\right]$ . Using $\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}\leq 4\|\mathcal{E}^{t}\|_% {\mathrm{F}}^{2}$ , $\hat{{\bf{Z}}}^{t}={\bf{X}}^{t}-{\bf{R}}^{t}-\alpha(\nabla F({\bf{X}}^{t})-% \nabla F(\bar{{\bf{X}}}^{t})+{\bf{S}}^{t}))$ , and $\|{\bf{Q}}^{-1}\|^{2}\leq\frac{2\chi}{(1+\lambda_{n})(1-\lambda_{2})}$ , we have

		$\displaystyle\mathbb{E}\!\left[\\|{\bf{Q}}^{-1}\hat{{\bf{P}}}^{\sf T}\hat{{\bf{% Z}}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{G}^{t}\right]=\mathbb{E}\!\left[\\|{% \bf{Q}}^{-1}\hat{{\bf{P}}}^{\sf T}({\bf{X}}^{t}-{\bf{R}}^{t}-\alpha(\nabla F({% \bf{X}}^{t})-\nabla F(\bar{{\bf{X}}}^{t})+{\bf{S}}^{t}))\\|_{\mathrm{F}}^{2}\;\|% \;\mathcal{G}^{t}\right]$
		$\displaystyle\quad=\\|{\bf{Q}}^{-1}\hat{{\bf{P}}}^{\sf T}({\bf{X}}^{t}-{\bf{R}}% ^{t}-\alpha(\nabla F({\bf{X}}^{t})-\nabla F(\bar{{\bf{X}}}^{t})))\\|_{\mathrm{F% }}^{2}+\mathbb{E}\!\left[\\|\alpha{\bf{S}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{% G}^{t}\right]$
		$\displaystyle\quad\leq 3\\|{\bf{Q}}^{-1}\hat{{\bf{P}}}^{\sf T}{\bf{X}}^{t}\\|_{% \mathrm{F}}^{2}+3\\|{\bf{Q}}^{-1}\hat{{\bf{P}}}^{\sf T}{\bf{R}}^{t}\\|_{\mathrm{% F}}^{2}+3\alpha^{2}L^{2}\\|{\bf{Q}}^{-1}\\|^{2}\\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}% \\|_{\mathrm{F}}^{2}+n\alpha^{2}\sigma^{2}$
		$\displaystyle\quad\leq 3\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{24\chi% \alpha^{2}L^{2}}{(1+\lambda_{n})(1-\lambda_{2})}\\|\mathcal{E}^{t}\\|_{\mathrm{F% }}^{2}+n\alpha^{2}\sigma^{2}.$		(58)

Combining (52) and (E), we have

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}^{t+1}\|_{\mathrm{F}}^{2}\;|\;% \mathcal{G}^{t}\right]

\displaystyle\leq\mathbb{E}\!\left[\|\mathbb{G}^{t}\|_{\mathrm{F}}^{2}\;|\;% \mathcal{G}^{t}\right]+\frac{2(1-p)}{\chi^{2}}\left(3\|\mathcal{E}^{t}\|_{% \mathrm{F}}^{2}+\frac{24\chi\alpha^{2}L^{2}}{(1+\lambda_{n})(1-\lambda_{2})}\|% \mathcal{E}^{t}\|_{\mathrm{F}}^{2}+n\alpha^{2}\sigma^{2}\right).

Therefore, combining it and (E), the inequality (33) follows.

Proof of the inequality (2). Let $\bar{{\bf{e}}}^{t}\triangleq\bar{{\bf{x}}}^{t}-({\bf{x}}^{\star})^{\sf T}$ . It follows from (30a), i.e., $\bar{{\bf{x}}}^{t+1}=\bar{{\bf{x}}}^{t}-\alpha\overline{\nabla F}({\bf{X}}^{t}% )-\alpha\bar{{\bf{s}}}^{t}$ , Assumption 4, and $\sum_{i=1}^{n}\nabla f_{i}({\bf{x}}^{\star})=0$ that

		$\displaystyle\mathbb{E}\!\left[\big{\\|}\bar{{\bf{e}}}^{t+1}\big{\\|}^{2}\;\|\;% \mathcal{G}^{t}\right]=\big{\\|}\bar{{\bf{e}}}^{t}-\frac{\alpha}{n}\sum_{i=1}^{% n}(\nabla f_{i}({\bf{x}}_{i}^{t})-\nabla f_{i}({\bf{x}}^{\star}))\big{\\|}^{2}+% \alpha^{2}\mathbb{E}\!\left[\\|\bar{{\bf{s}}}^{k}\\|^{2}\;\|\;\mathcal{G}^{t}\right]$
		$\displaystyle\leq\big{\\|}\bar{{\bf{e}}}^{t}-\frac{\alpha}{n}\sum_{i=1}^{n}(% \nabla f_{i}({\bf{x}}_{i}^{t})-\nabla f_{i}({\bf{x}}^{\star}))\big{\\|}^{2}+% \frac{\alpha^{2}\sigma^{2}}{n}$
		$\displaystyle=\\|\bar{{\bf{e}}}^{t}\\|^{2}+\alpha^{2}\Big{\\|}\frac{1}{n}\sum_{i=% 1}^{n}(\nabla f_{i}({\bf{x}}_{i}^{t})-\nabla f_{i}({\bf{x}}^{\star}))\Big{\\|}^% {2}+\frac{\alpha^{2}\sigma^{2}}{n}-\frac{2\alpha}{n}\sum_{i=1}^{n}\langle% \nabla f_{i}({\bf{x}}_{i}^{t}),\bar{{\bf{e}}}^{t}\rangle.$		(59)

It follows from the $L$ -smoothness of $f$ and $f_{i}$ and Jensen’s inequality that

		$\displaystyle\alpha^{2}\Big{\\|}\frac{1}{n}\sum_{i=1}^{n}(\nabla f_{i}({\bf{x}}% _{i}^{t})-\nabla f_{i}({\bf{x}}^{\star}))\Big{\\|}^{2}=\alpha^{2}\Big{\\|}\frac{% 1}{n}\sum_{i=1}^{n}(\nabla f_{i}({\bf{x}}_{i}^{t})-\nabla f_{i}(\bar{{\bf{x}}}% ^{t})+\nabla f_{i}(\bar{{\bf{x}}}^{t})-\nabla f_{i}({\bf{x}}^{\star}))\Big{\\|}% ^{2}$
		$\displaystyle\leq 2\alpha^{2}\Big{\\|}\frac{1}{n}\sum_{i=1}^{n}(\nabla f_{i}({% \bf{x}}_{i}^{t})-\nabla f_{i}(\bar{{\bf{x}}}^{t}))\Big{\\|}^{2}+2\alpha^{2}\Big% {\\|}\frac{1}{n}\sum_{i=1}^{n}(\nabla f_{i}(\bar{{\bf{x}}}^{t})-\nabla f_{i}({% \bf{x}}^{\star}))\Big{\\|}^{2}$
		$\displaystyle\leq\frac{2\alpha^{2}}{n}\sum_{i=1}^{n}\Big{\\|}\nabla f_{i}({\bf{% x}}_{i}^{t})-\nabla f_{i}(\bar{{\bf{x}}}^{t})\Big{\\|}^{2}+2\alpha^{2}\Big{\\|}% \nabla f_{i}(\bar{{\bf{x}}}^{t})-\nabla f_{i}({\bf{x}}^{\star})\Big{\\|}^{2}$
		$\displaystyle\leq\frac{2\alpha^{2}L^{2}}{n}\\|{\bf{X}}^{t}-{\bf{1}}\bar{{\bf{x}% }}^{t}\\|_{\mathrm{F}}^{2}+4\alpha^{2}L\big{(}f(\bar{{\bf{x}}}^{t})-f({\bf{x}}^% {\star})-\langle\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star},\nabla f({\bf{x}}^{\star})% \rangle\big{)}$
		$\displaystyle=\frac{2\alpha^{2}L^{2}}{n}\\|{\bf{X}}^{t}-{\bf{1}}\bar{{\bf{x}}}^% {t}\\|_{\mathrm{F}}^{2}+4\alpha^{2}L(f(\bar{{\bf{x}}}^{t})-f({\bf{x}}^{\star})).$		(60)

Then, we consider the bound of $-\frac{2\alpha}{n}\sum_{i=1}^{n}\langle\nabla f_{i}({\bf{x}}_{i}^{t}),\bar{{% \bf{e}}}^{t}\rangle$ . Since $f_{i}$ is $L$ -smooth and $\mu$ -strongly convex, and $\frac{1}{2}\sum_{i=1}^{n}\|{\bf{x}}_{i}^{t}-{\bf{x}}^{\star}\|^{2}\leq-\|\frac% {1}{n}\sum_{i=1}^{n}({\bf{x}}_{i}^{t}-{\bf{x}}^{\star})\|$ , by (16), it gives that

		$\displaystyle-\frac{2\alpha}{n}\sum_{i=1}^{n}\langle\nabla f_{i}({\bf{x}}_{i}^% {t}),\bar{{\bf{e}}}^{t}\rangle=\frac{2\alpha}{n}\sum_{i=1}^{n}\big{(}-\langle% \nabla f_{i}({\bf{x}}_{i}^{t}),\bar{{\bf{x}}}^{t}-{\bf{x}}_{i}^{t}\rangle-% \langle\nabla f_{i}({\bf{x}}_{i}^{t}),{\bf{x}}^{t}_{i}-{\bf{x}}^{\star}\rangle% \big{)}$
		$\displaystyle\leq\frac{2\alpha}{n}\sum_{i=1}^{n}\Big{(}-f_{i}(\bar{{\bf{x}}}^{% t})+f_{i}({\bf{x}}_{i}^{t})+\frac{L}{2}\\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{t}_{i}\\|% ^{2}-\frac{\mu}{2}\\|{\bf{x}}_{i}^{t}-{\bf{x}}^{\star}\\|^{2}-f_{i}({\bf{x}}_{i}% ^{t})+f_{i}({\bf{x}}^{\star})\Big{)}$
		$\displaystyle\leq-2\alpha(f(\bar{{\bf{x}}}^{t})-f({\bf{x}}^{\star}))+\frac{% \alpha L}{n}\sum_{i=1}^{n}\\|\bar{{\bf{x}}}^{t}-{\bf{x}}_{i}^{t}\\|^{2}-\mu% \alpha\\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\\|^{2}$
		$\displaystyle=-2\alpha(f(\bar{{\bf{x}}}^{t})-f({\bf{x}}^{\star}))+\frac{\alpha L% }{n}\\|{\bf{X}}^{t}-{\bf{1}}\bar{{\bf{x}}}^{t}\\|_{\mathrm{F}}^{2}-\mu\alpha\\|% \bar{{\bf{e}}}^{t}\\|^{2}.$		(61)

Substituting (E) and (E) into (E), and using $f(\bar{{\bf{x}}}^{t})-f({\bf{x}}^{\star})\geq 0$ , we have

	$\displaystyle\mathbb{E}\!\left[\big{\\|}\bar{{\bf{e}}}^{t+1}\big{\\|}^{2}\;\|\;% \mathcal{G}^{t}\right]\leq$	$\displaystyle(1-\mu\alpha)\\|\bar{{\bf{e}}}^{t}\\|^{2}+\Big{(}\frac{\alpha L}{n}% +\frac{2\alpha^{2}L^{2}}{n}\Big{)}\\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}\\|_{\mathrm% {F}}^{2}$
		$\displaystyle+\frac{\alpha^{2}\sigma^{2}}{n}-2\alpha(1-2\alpha L)(f(\bar{{\bf{% x}}}^{t})-f({\bf{x}}^{\star})).$		(62)

Since $\alpha\leq\frac{1}{4L}$ , it holds that

	$\displaystyle\mathbb{E}\!\left[\big{\\|}\bar{{\bf{x}}}^{t+1}-{\bf{x}}^{\star}% \big{\\|}^{2}\;\|\;\mathcal{G}^{t}\right]\leq$	$\displaystyle(1-\mu\alpha)\\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\\|^{2}+\Big{(}% \frac{\alpha L}{n}+\frac{2\alpha^{2}L^{2}}{n}\Big{)}\\|{\bf{X}}^{t}-\bar{{\bf{X% }}}^{t}\\|_{\mathrm{F}}^{2}$
		$\displaystyle+\frac{\alpha^{2}\sigma^{2}}{n}-2\alpha(1-2\alpha L)(f(\bar{{\bf{% x}}}^{t})-f({\bf{x}}^{\star}))$
	$\displaystyle\leq$	$\displaystyle(1-\mu\alpha)\\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\\|^{2}+\frac{3% \alpha L}{2n}\\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}\\|_{\mathrm{F}}^{2}+\frac{\alpha% ^{2}\sigma^{2}}{n}-\alpha(f(\bar{{\bf{x}}}^{t})-f({\bf{x}}^{\star})).$

Combining with $\|{\bf{X}}^{t}-\bar{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}\leq 4\|\mathcal{E}^{t}\|_% {\mathrm{F}}^{2}$ , we complete the proof. ∎

Appendix F Proof of Theorem 4

Proof.

From the condition of stepsize, we have

\alpha\leq\sqrt{\frac{(1+\lambda_{n})(1-\lambda_{2})}{2\chi}}\frac{1}{2L}% \Longrightarrow\frac{24\chi\alpha^{2}L^{2}}{(1+\lambda_{n})(1-\lambda_{2})}% \leq 3.

Then, it follows from the definition of $\tilde{\gamma}$ (34) that

	$\displaystyle\tilde{\gamma}$	$\displaystyle=\gamma+\frac{32\alpha^{2}L^{2}+16\alpha^{4}L^{4}\frac{2\chi}{1-% \lambda_{2}}}{1-\gamma}+\frac{2(1-p)\big{(}3+\frac{24\chi\alpha^{2}L^{2}}{(1+% \lambda_{n})(1-\lambda_{2})}\big{)}}{\chi^{2}}$
		$\displaystyle\leq\gamma+\frac{32\alpha^{2}L^{2}+16\alpha^{4}L^{4}\frac{2\chi}{% 1-\lambda_{2}}}{1-\gamma}+\frac{12(1-p)}{\chi^{2}}$
		$\displaystyle\leq\gamma+\frac{32\alpha^{2}L^{2}+16\alpha^{4}L^{4}\frac{2\chi}{% 1-\lambda_{2}}}{1-\gamma}+\frac{12(1-p)}{\chi^{2}}.$

To ensure $\tilde{\gamma}\leq\frac{1+\gamma}{2}$ , we need to choose $\alpha$ and $\chi$ such that

\frac{32\alpha^{2}L^{2}+16\alpha^{4}L^{4}\frac{2\chi}{1-\lambda_{2}}}{1-\gamma% }+\frac{12(1-p)}{\chi^{2}}\leq\frac{1-\gamma}{2}.

By solving these inequalities

\displaystyle\frac{32\alpha^{2}L^{2}}{1-\gamma}\leq\frac{1-\gamma}{6},\quad% \frac{16\alpha^{4}L^{4}\frac{2\chi}{1-\lambda_{2}}}{1-\gamma}\leq\frac{1-% \gamma}{6},\quad\frac{12(1-p)}{\chi^{2}}\leq\frac{1-\gamma}{6},

and using $\gamma=\sqrt{1-\frac{1}{2\chi}(1-\lambda_{2})}$ , we have

\alpha\leq\min\left\{\frac{1-\lambda_{2}}{32\sqrt{3}\chi L},\sqrt[4]{\frac{(1-% \lambda_{2})^{3}}{12\chi^{3}}}\frac{1}{4L}\right\},\ \chi\geq\frac{288(1-p)}{1% -\lambda_{2}}.

Thus, it implies that if the condition of $\alpha$ and $\chi$ in this Lemma holds, then $\tilde{\gamma}\leq\frac{1+\gamma}{2}<1$ .

Define the Lyapunov function

\mathcal{L}^{t}=f(\bar{{\bf{x}}}^{t})-f^{\star}+\frac{2\alpha L^{2}}{n(1-% \tilde{\gamma})}\|\mathcal{E}^{t}\|^{2}.

Since $\frac{1}{1-\gamma}\leq\frac{4\chi}{1-\lambda_{2}}$ and $\frac{1}{1-\tilde{\gamma}}\leq\frac{2}{1-\gamma}$ , we have

\frac{\frac{2\chi}{1-\lambda_{2}}}{(1-\gamma)^{2}}\leq\frac{32\chi^{3}}{(1-% \lambda_{2})^{3}}\text{ and }\frac{16\alpha^{4}L^{4}\frac{2\chi}{1-\lambda_{2}% }}{(1-\tilde{\gamma})(1-\gamma)}\leq\frac{1024\alpha^{4}L^{4}\chi^{3}}{(1-% \lambda_{2})^{3}}.

Thus, it gives that

\alpha\leq\sqrt[4]{\frac{(1-\lambda_{2})^{3}}{8\chi^{3}}}\frac{1}{4L}% \Rightarrow\frac{1}{2}<1-\frac{16\alpha^{4}L^{4}\frac{2\chi}{1-\lambda_{2}}}{(% 1-\tilde{\gamma})(1-\gamma)}.

Then, since $\alpha\leq\frac{1}{2L}$ , by (32) and (33), it gives

	$\displaystyle\mathbb{E}\!\left[\mathcal{L}^{t+1}\;\|\;\mathcal{G}^{t}\right]\leq$	$\displaystyle f(\bar{{\bf{x}}}^{t})-f^{\star}-\frac{\alpha}{2}\big{\\|}\nabla f% (\bar{{\bf{x}}}^{t})\big{\\|}^{2}+\frac{2\alpha L^{2}}{n}\big{\\|}\mathcal{E}^{t% }\big{\\|}_{\mathrm{F}}^{2}+\frac{L\alpha^{2}\sigma^{2}}{2n}$
		$\displaystyle+\frac{2\alpha L^{2}}{n(1-\tilde{\gamma})}\Big{(}\tilde{\gamma}\\|% \mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{4n\alpha^{4}L^{2}\frac{2\chi}{1-% \lambda_{2}}}{1-\gamma}\\|\nabla f(\bar{{\bf{x}}}^{t})\\|^{2}+\frac{2\alpha^{4}L% ^{2}\sigma^{2}\frac{2\chi}{1-\lambda_{2}}}{1-\gamma}+\frac{2n\alpha^{2}\sigma^% {2}(2\chi^{2}+(1-p))}{\chi^{2}}\Big{)}$
	$\displaystyle=$	$\displaystyle f(\bar{{\bf{x}}}^{t})-f^{\star}+\frac{2\alpha L^{2}}{n(1-\tilde{% \gamma})}\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}-\frac{\alpha}{2}\Big{(}1-\frac{1% 6\alpha^{4}L^{4}\frac{2\chi}{1-\lambda_{2}}}{(1-\tilde{\gamma})(1-\gamma)}\Big% {)}\big{\\|}\nabla f(\bar{{\bf{x}}}^{t})\big{\\|}^{2}$
		$\displaystyle+\frac{L\alpha^{2}\sigma^{2}}{2n}+\frac{4\sigma^{2}L^{4}\alpha^{5% }\frac{2\chi}{1-\lambda_{2}}}{n(1-\tilde{\gamma})(1-\gamma)}+\frac{4L^{2}% \sigma^{2}\alpha^{3}(2\chi^{2}+(1-p))}{(1-\tilde{\gamma})\chi^{2}}$
	$\displaystyle\leq$	$\displaystyle\mathcal{L}^{t}-\frac{\alpha}{4}\big{\\|}\nabla f(\bar{{\bf{x}}}^{% t})\big{\\|}^{2}+\frac{L\alpha^{2}\sigma^{2}}{2n}+\frac{4\sigma^{2}L^{4}\alpha^% {5}\frac{2\chi}{1-\lambda_{2}}}{n(1-\tilde{\gamma})(1-\gamma)}+\frac{4L^{2}% \sigma^{2}\alpha^{3}(2\chi^{2}+(1-p))}{(1-\tilde{\gamma})\chi^{2}},$

where the last inequality holds because the condition (36) implies $\frac{1}{2}<1-\frac{16\alpha^{4}L^{4}\frac{2\chi}{1-\lambda_{2}}}{(1-\tilde{% \gamma})(1-\gamma)}$ . Taking full expectation, we have

\displaystyle\mathbb{E}\!\left[\mathcal{L}^{t+1}\right]\leq\mathbb{E}\!\left[% \mathcal{L}^{t}\right]-\frac{\alpha}{4}\mathbb{E}\!\left[\big{\|}\nabla f(\bar% {{\bf{x}}}^{t})\big{\|}^{2}\right]+\frac{L\alpha^{2}\sigma^{2}}{2n}+\frac{4% \sigma^{2}L^{4}\alpha^{5}\frac{2\chi}{1-\lambda_{2}}}{n(1-\tilde{\gamma})(1-% \gamma)}+\frac{4L^{2}\sigma^{2}\alpha^{3}(2\chi^{2}+(1-p))}{(1-\tilde{\gamma})% \chi^{2}}.

Summing it over $t=0,1,\cdots,T-1$ , we can obtain

\displaystyle\frac{\alpha}{4}\sum_{t=0}^{T-1}\mathbb{E}\!\left[\big{\|}\nabla f% (\bar{{\bf{x}}}^{t})\big{\|}^{2}\right]\leq\mathcal{L}^{0}+T\Big{(}\frac{L% \alpha^{2}\sigma^{2}}{2n}+\frac{4\sigma^{2}L^{4}\alpha^{5}\frac{2\chi}{1-% \lambda_{2}}}{n(1-\tilde{\gamma})(1-\gamma)}+\frac{4L^{2}\sigma^{2}\alpha^{3}(% 2\chi^{2}+(1-p))}{(1-\tilde{\gamma})\chi^{2}}\Big{)},

which implies that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[\big{\|}\nabla f(% \bar{{\bf{x}}}^{t})\big{\|}^{2}\right]\leq\frac{4\mathcal{L}^{0}}{\alpha T}+% \frac{2L\alpha\sigma^{2}}{n}+\frac{16\sigma^{2}L^{4}\alpha^{4}\frac{2\chi}{1-% \lambda_{2}}}{n(1-\tilde{\gamma})(1-\gamma)}+\frac{16L^{2}\sigma^{2}\alpha^{2}% (2\chi^{2}+(1-p))}{(1-\tilde{\gamma})\chi^{2}}.

Since ${\bf{X}}^{0}=[{\bf{x}}^{0},\cdots,{\bf{x}}^{0}]^{\sf T}$ , by [45, (75)], we have $\|\mathcal{E}^{0}\|_{\mathrm{F}}^{2}\leq 2\alpha^{2}\|({\bf{I}}-\hat{{\bf{% \Lambda}}}_{a})^{-1}\|\|\nabla F({\bf{X}}^{0})-{\bf{1}}_{n}\otimes(\nabla f({% \bf{x}}^{0}))^{\sf T}\|^{2}$ . Notice that $\varsigma^{2}_{0}=\frac{1}{n}\sum_{i=1}^{n}\|\nabla f_{i}(\bar{{\bf{x}}}^{0})-% \nabla f(\bar{{\bf{x}}}^{0})\|^{2}$ . It holds that

$\displaystyle\mathcal{L}^{0}$	$\displaystyle=f(\bar{{\bf{x}}}^{0})-f^{\star}+\frac{2\alpha L^{2}}{n(1-\tilde{% \gamma})}\\|\mathcal{E}^{0}\\|_{\mathrm{F}}^{2}$
	$\displaystyle=f(\bar{{\bf{x}}}^{0})-f^{\star}+\frac{2\alpha L^{2}}{n(1-\tilde{% \gamma})}\Big{(}2\alpha^{2}\\|({\bf{I}}-\hat{{\bf{\Lambda}}}_{a})^{-1}\\|\\|% \nabla F({\bf{X}}^{0})-{\bf{1}}_{n}\otimes(\nabla f({\bf{x}}^{0}))^{\sf T}\\|^{% 2}\Big{)}$
	$\displaystyle\leq f(\bar{{\bf{x}}}^{0})-f^{\star}+\frac{32\chi^{2}\alpha^{3}L^% {2}\varsigma^{2}_{0}}{(1-\lambda_{2})^{2}}.$	(63)

Using (63) and

\frac{1}{1-\gamma}\leq\frac{4\chi}{1-\lambda_{2}},\ \frac{1}{1-\tilde{\gamma}}% \leq\frac{2}{1-\gamma},

we have

	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[\big{\\|}\nabla f(% \bar{{\bf{x}}}^{t})\big{\\|}^{2}\right]$	$\displaystyle\leq\frac{4(f(\bar{{\bf{x}}}^{0})-f^{*})}{\alpha T}+\frac{128\chi% ^{2}L^{2}\alpha^{2}\varsigma^{2}_{0}}{(1-\lambda_{2})^{2}T}+\frac{2L\alpha% \sigma^{2}}{n}$
		$\displaystyle\quad+\frac{1024\sigma^{2}L^{4}\alpha^{4}\chi^{3}}{n(1-\lambda_{2% })^{3}}+\frac{128\chi\alpha^{2}L^{2}\sigma^{2}(2\chi^{2}+(1-p))}{(1-\lambda_{2% })\chi^{2}}.$

Since $\alpha\leq\frac{1-\lambda_{2}}{32\sqrt{3}\chi L}$ , we have $\frac{1024\sigma^{2}L^{4}\alpha^{4}\chi^{3}}{n(1-\lambda_{2})^{3}}\leq\frac{% \alpha^{2}L^{2}\sigma^{2}\chi}{3n(1-\lambda_{2})}\leq\frac{\alpha^{2}L^{2}% \sigma^{2}\chi}{2(1-\lambda_{2})}$ , it holds that

	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[\big{\\|}\nabla f(% \bar{{\bf{x}}}^{t})\big{\\|}^{2}\right]$	$\displaystyle\leq\frac{4(f(\bar{{\bf{x}}}^{0})-f^{*})}{\alpha T}+\frac{128\chi% ^{2}L^{2}\alpha^{2}\varsigma^{2}_{0}}{(1-\lambda_{2})^{2}T}$
		$\displaystyle\quad+\frac{2L\alpha\sigma^{2}}{n}+\frac{\alpha^{2}L^{2}\sigma^{2% }\chi^{3}+256\chi\alpha^{2}L^{2}\sigma^{2}(2\chi^{2}+(1-p))}{2(1-\lambda_{2})% \chi^{2}},$

i.e, (4) holds. ∎

Appendix G Proof of Corollary 2

Proof.

We derive a tighter rate by carefully selecting the step size similar to [35]. We rewrite (4) as

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[\big{\|}\nabla f(% \bar{{\bf{x}}}^{t})\big{\|}^{2}\right]\leq\underbrace{\frac{c_{0}}{\alpha T}+c% _{1}\alpha+c_{2}\alpha^{2}}_{:=\Psi_{T}}+\frac{a_{0}\alpha^{2}}{T},

(64)

where

\displaystyle c_{0}=4(f(\bar{{\bf{x}}}^{0})-f^{*}),\ c_{1}=\frac{2L\sigma^{2}}% {n},\ c_{2}=\frac{L^{2}\sigma^{2}\big{(}\chi^{3}+256\chi(2\chi^{2}+(1-p))\big{% )}}{2(1-\lambda_{2})\chi^{2}},\ a_{0}=\frac{128\chi^{2}L^{2}\varsigma^{2}_{0}}% {(1-\lambda_{2})^{2}}.

(65)

From the condition of stepsize, we have

\alpha\leq\frac{1}{\underline{\alpha}}=\min\left\{\frac{1}{2L},\frac{1-\lambda% _{2}}{32\sqrt{3}\chi L},\sqrt{\frac{(1+\lambda_{n})(1-\lambda_{2})}{2\chi}}% \frac{1}{2L},\sqrt[4]{\frac{(1-\lambda_{2})^{3}}{12\chi^{3}}}\frac{1}{4L}% \right\}=\mathcal{O}\left(\frac{1-\lambda_{2}}{\chi L}\right).

Setting

\alpha=\min\left\{\left(\frac{c_{0}}{c_{1}T}\right)^{\frac{1}{2}},\left(\frac{% c_{0}}{c_{2}T}\right)^{\frac{1}{3}},\frac{1}{\underline{\alpha}}\right\},

we have the following cases.

- When $\alpha=\frac{1}{\underline{\alpha}}$ and is smaller than both $\left(\frac{c_{0}}{c_{1}T}\right)^{\frac{1}{2}}$ and $\left(\frac{c_{0}}{c_{2}T}\right)^{\frac{1}{3}}$ , then

\Psi_{T}=\frac{c_{0}}{\alpha T}+c_{1}\alpha+c_{2}\alpha^{2}=\frac{\alpha c_{0}% }{T}+\frac{c_{1}}{\underline{\alpha}}+\frac{c_{2}}{\underline{\alpha}^{2}}\leq% \frac{\alpha c_{0}}{T}+c_{1}^{\frac{1}{2}}\left(\frac{c_{0}}{T}\right)^{\frac{% 1}{2}}+c_{2}^{\frac{1}{3}}\left(\frac{c_{0}}{T}\right)^{\frac{2}{3}}.

- When $\alpha=\left(\frac{c_{0}}{c_{1}T}\right)^{\frac{1}{2}}\leq\left(\frac{c_{0}}{c% _{2}T}\right)^{\frac{1}{3}}$ , then

\Psi_{T}\leq 2c_{1}^{\frac{1}{2}}\left(\frac{c_{0}}{T}\right)^{\frac{1}{2}}+c_% {2}\left(\frac{c_{0}}{c_{1}T}\right)\leq 2c_{1}^{\frac{1}{2}}\left(\frac{c_{0}% }{T}\right)^{\frac{1}{2}}+c_{2}^{\frac{1}{3}}\left(\frac{c_{0}}{T}\right)^{% \frac{2}{3}}.

- When $\alpha=\left(\frac{c_{0}}{c_{2}T}\right)^{\frac{1}{3}}\leq\left(\frac{c_{0}}{c% _{1}T}\right)^{\frac{1}{2}}$ , then

\Psi_{T}\leq 2c_{2}^{\frac{1}{3}}\left(\frac{c_{0}}{T}\right)^{\frac{2}{3}}+c_% {1}\left(\frac{c_{0}}{c_{2}T}\right)^{\frac{1}{3}}\leq 2c_{2}^{\frac{1}{3}}% \left(\frac{c_{0}}{T}\right)^{\frac{2}{3}}+c_{1}^{\frac{1}{2}}\left(\frac{c_{0% }}{T}\right)^{\frac{1}{2}}.

Combining the above three cases together it holds that

\Psi_{T}=\frac{c_{0}}{\alpha T}+c_{1}\alpha+c_{2}\alpha^{2}\leq 2c_{1}^{\frac{% 1}{2}}\left(\frac{c_{0}}{T}\right)^{\frac{1}{2}}+2c_{2}^{\frac{1}{3}}\left(% \frac{c_{0}}{T}\right)^{\frac{2}{3}}+\frac{\alpha c_{0}}{T}.

Substituting the above into (64), we conclude that

\displaystyle\frac{1}{R}\sum_{r=0}^{R-1}\mathcal{E}_{r}\leq 2c_{1}^{\frac{1}{2% }}\left(\frac{c_{0}}{R}\right)^{\frac{1}{2}}+2c_{2}^{\frac{1}{3}}\left(\frac{c% _{0}}{R}\right)^{\frac{2}{3}}+\frac{\left(\underline{\alpha}c_{0}+a_{0}/% \underline{\alpha}^{2}\right)}{R}

(66)

Therefore, from (66) and plugging the parameters (65)

	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}[\\|\nabla f(\bar{{\bf{x}}}^{% t})\\|^{2}]\leq\mathcal{O}\left(\sqrt{\frac{L(f(\bar{{\bf{x}}}^{0})-f^{\star})% \sigma^{2}}{nT}}\right)$
	$\displaystyle+\mathcal{O}\left(\sqrt[3]{\frac{\chi^{3}+\chi(1-p)}{(1-\lambda_{% 2})\chi^{2}}}\left(\frac{L(f(\bar{{\bf{x}}}^{0})-f^{\star})\sigma}{T}\right)^{% \frac{2}{3}}\right)+\mathcal{O}\left(\frac{\frac{\chi L(f(\bar{{\bf{x}}}^{0})-% f^{\star})}{1-\lambda_{2}}+\varsigma_{0}^{2}}{T}\right),$

i.e., the rate (38) holds. ∎

Appendix H Proof of Theorem 5

Proof.

Plugging $\|\nabla f(\bar{{\bf{x}}}^{t})\|^{2}\leq 2L(f(\bar{{\bf{x}}}^{t})-f^{\star})$ into (33) gives

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}^{t+1}\|_{\mathrm{F}}^{2}\;|\;% \mathcal{G}^{t}\right]\leq

\displaystyle\tilde{\gamma}\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2}+\frac{8n\alpha% ^{4}L^{3}\frac{2\chi}{1-\lambda_{2}}}{1-\gamma}(f(\bar{{\bf{x}}}^{t})-f^{\star% })+\frac{2\alpha^{4}L^{2}\sigma^{2}\frac{2\chi}{1-\lambda_{2}}}{1-\gamma}+% \frac{2n\alpha^{2}\sigma^{2}(2\chi^{2}+(1-p))}{\chi^{2}}.

(67)

Similar as Lemma 4, we know that

\alpha\leq\min\left\{\frac{1-\lambda_{2}}{32\sqrt{3}\chi L},\sqrt{\frac{(1+% \lambda_{n})(1-\lambda_{2})}{2\chi}}\frac{1}{2L},\sqrt[4]{\frac{(1-\lambda_{2}% )^{3}}{12\chi^{3}}}\frac{1}{4L}\right\},\ \chi\geq\frac{288(1-p)}{1-\lambda_{2% }}\Longrightarrow\tilde{\gamma}\leq\frac{1+\gamma}{2}<1.

Define the Lyapunov function

\mathcal{L}_{\mathrm{c}}^{t}=\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\|^{2}+\frac% {6\alpha L}{n(1-\tilde{\gamma})}\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2}.

Since $\frac{1}{1-\gamma}\leq\frac{4\chi}{1-\lambda_{2}}$ and $\frac{1}{1-\tilde{\gamma}}\leq\frac{2}{1-\gamma}$ , we have

\frac{\frac{2\chi}{1-\lambda_{2}}}{(1-\gamma)^{2}}\leq\frac{32\chi^{3}}{(1-% \lambda_{2})^{3}}\text{ and }\frac{24\alpha^{4}L^{4}\frac{2\chi}{1-\lambda_{2}% }}{(1-\tilde{\gamma})(1-\gamma)}\leq\frac{96\alpha^{4}L^{4}\frac{2\chi}{1-% \lambda_{2}}}{(1-\gamma)^{2}}.

It gives that

\alpha\leq\sqrt[4]{\frac{(1-\lambda_{2})^{3}}{24\chi^{3}}}\frac{1}{4L}% \Rightarrow\frac{1}{2}<1-\frac{48\alpha^{4}L^{4}\frac{2\chi}{1-\lambda_{2}}}{(% 1-\tilde{\gamma})(1-\gamma)}.

Thus, according to (2), (67), and $\mu=0$ , we have

	$\displaystyle\mathbb{E}\!\left[\mathcal{L}_{\mathrm{c}}^{t+1}\;\|\;\mathcal{G}^% {t}\right]\leq$	$\displaystyle\\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\\|^{2}+\frac{6\alpha L}{n}\\|% \mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{\alpha^{2}\sigma^{2}}{n}-\alpha(f(% \bar{{\bf{x}}}^{t})-f({\bf{x}}^{\star}))$
		$\displaystyle+\frac{6\alpha L}{n(1-\tilde{\gamma})}\Big{(}\tilde{\gamma}\\|% \mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{8n\alpha^{4}L^{3}\frac{2\chi}{1-% \lambda_{2}}}{1-\gamma}(f(\bar{{\bf{x}}}^{t})-f^{\star})+\frac{2\alpha^{4}L^{2% }\sigma^{2}\frac{2\chi}{1-\lambda_{2}}}{1-\gamma}+\frac{2n\alpha^{2}\sigma^{2}% (2\chi^{2}+(1-p))}{\chi^{2}}\Big{)}$
	$\displaystyle=$	$\displaystyle\\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\\|^{2}+\frac{6\alpha L}{n(1-% \tilde{\gamma})}\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}-\alpha\Big{(}1-\frac{48% \alpha^{4}L^{4}\frac{2\chi}{1-\lambda_{2}}}{(1-\tilde{\gamma})(1-\gamma)}\Big{% )}(f(\bar{{\bf{x}}}^{t})-f^{\star})$
		$\displaystyle+\frac{\alpha^{2}\sigma^{2}}{n}+\frac{12\alpha^{5}L^{3}\sigma^{2}% \frac{2\chi}{1-\lambda_{2}}}{n(1-\tilde{\gamma})(1-\gamma)}+\frac{12\alpha^{3}% L\sigma^{2}(2\chi^{2}+(1-p))}{(1-\tilde{\gamma})\chi^{2}}$
	$\displaystyle\leq$	$\displaystyle\mathcal{L}_{\mathrm{c}}^{t}-\frac{\alpha}{2}(f(\bar{{\bf{x}}}^{t% })-f^{\star})+\frac{\alpha^{2}\sigma^{2}}{n}+\frac{12\alpha^{5}L^{3}\sigma^{2}% \frac{2\chi}{1-\lambda_{2}}}{n(1-\tilde{\gamma})(1-\gamma)}+\frac{12\alpha^{3}% L\sigma^{2}(2\chi^{2}+(1-p))}{(1-\tilde{\gamma})\chi^{2}}.$

Taking full expectation, we have

\displaystyle\mathbb{E}\!\left[\mathcal{L}_{\mathrm{c}}^{t+1}\right]\leq% \mathbb{E}\!\left[\mathcal{L}_{\mathrm{c}}^{t}\right]-\frac{\alpha}{2}\mathbb{% E}\!\left[f(\bar{{\bf{x}}}^{t})-f^{\star}\right]+\frac{\alpha^{2}\sigma^{2}}{n% }+\frac{12\alpha^{5}L^{3}\sigma^{2}\frac{2\chi}{1-\lambda_{2}}}{n(1-\tilde{% \gamma})(1-\gamma)}+\frac{12\alpha^{3}L\sigma^{2}(2\chi^{2}+(1-p))}{(1-\tilde{% \gamma})\chi^{2}}.

(68)

Summing the inequality (68) over $t=0,1,\cdots,T-1$ , we can obtain

\displaystyle\frac{\alpha}{2}\sum_{t=0}^{T-1}\mathbb{E}\!\left[f(\bar{{\bf{x}}% }^{t})-f^{\star}\right]\leq\mathcal{L}_{\mathrm{c}}^{0}+T\Big{(}+\frac{\alpha^% {2}\sigma^{2}}{n}+\frac{12\alpha^{5}L^{3}\sigma^{2}\frac{2\chi}{1-\lambda_{2}}% }{n(1-\tilde{\gamma})(1-\gamma)}+\frac{12\alpha^{3}L\sigma^{2}(2\chi^{2}+(1-p)% )}{(1-\tilde{\gamma})\chi^{2}}\Big{)},

which implies that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t}% )-f^{\star}\right]\leq\frac{2\mathcal{L}_{\mathrm{c}}^{0}}{\alpha T}+\frac{2% \alpha\sigma^{2}}{n}+\frac{24\alpha^{4}L^{3}\sigma^{2}\frac{2\chi}{1-\lambda_{% 2}}}{n(1-\tilde{\gamma})(1-\gamma)}+\frac{24\alpha^{2}L\sigma^{2}(2\chi^{2}+(1% -p))}{(1-\tilde{\gamma})\chi^{2}}.

(69)

Since ${\bf{X}}^{0}=[{\bf{x}}^{0},\cdots,{\bf{x}}^{0}]^{\sf T}$ , similar as (63), we have

\displaystyle\mathcal{L}_{\mathrm{c}}^{0}

\displaystyle=\|\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\|^{2}+\frac{6\alpha L}{n(1% -\tilde{\gamma})}\|\mathcal{E}^{0}\|_{\mathrm{F}}^{2}\leq\|\bar{{\bf{x}}}^{0}-% {\bf{x}}^{\star}\|^{2}+\frac{96\chi^{2}\alpha^{3}L\varsigma^{2}_{0}}{(1-% \lambda_{2})^{2}}.

(70)

Substituting (70) into (69) and using

\tilde{\gamma}\leq\frac{1+\gamma}{2}<1,\ \frac{1}{1-\gamma}\leq\frac{4\chi}{1-% \lambda_{2}},

we can derive that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t}% )-f^{\star}\right]\leq

\displaystyle\frac{2\|\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\|^{2}}{\alpha T}+% \frac{192\chi^{2}\alpha^{2}L\varsigma^{2}_{0}}{(1-\lambda_{2})^{2}T}+\frac{2% \alpha\sigma^{2}}{n}+\frac{1536\chi^{3}\alpha^{4}L^{3}\sigma^{2}}{n(1-\lambda_% {2})^{3}}+\frac{192\alpha^{2}L\sigma^{2}\chi(2\chi^{2}+(1-p))}{(1-\lambda_{2})% \chi^{2}}.

Since $\alpha\leq\frac{1-\lambda_{2}}{32\sqrt{3}\chi L}$ , we have $\frac{1536\sigma^{2}L^{3}\alpha^{4}\chi^{3}}{n(1-\lambda_{2})^{3}}\leq\frac{% \alpha^{2}L\sigma^{2}\chi}{2n(1-\lambda_{2})}\leq\frac{\alpha^{2}L\sigma^{2}% \chi}{2(1-\lambda_{2})}$ , it holds that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t}% )-f^{\star}\right]\leq

\displaystyle\frac{2\|\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\|^{2}}{\alpha T}+% \frac{192\chi^{2}\alpha^{2}L\varsigma^{2}_{0}}{(1-\lambda_{2})^{2}T}+\frac{2% \alpha\sigma^{2}}{n}+\frac{\alpha^{2}L\sigma^{2}\chi^{3}+384\alpha^{2}L\sigma^% {2}\chi(2\chi^{2}+(1-p))}{2(1-\lambda_{2})\chi^{2}}.

i.e., (5) holds. ∎

Appendix I Proof of Corollary 3

Proof.

Then, we derive a tighter rate by carefully selecting the step size similar to Corollary 2. From the condition of stepsize, we have

\alpha\leq\frac{1}{\underline{\alpha}}=\min\left\{\frac{1}{4L},\frac{1-\lambda% _{2}}{32\sqrt{3}\chi L},\sqrt{\frac{(1+\lambda_{n})(1-\lambda_{2})}{2\chi}}% \frac{1}{2L},\sqrt[4]{\frac{(1-\lambda_{2})^{3}}{24\chi^{3}}}\frac{1}{4L}% \right\}=\mathcal{O}\left(\frac{1-\lambda_{2}}{\chi L}\right).

Similar as the proof of Theorem 4, it follows that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t}% )-f^{\star}\right]\leq\underbrace{\frac{c_{0}}{\alpha T}+c_{1}\alpha+c_{2}% \alpha^{2}}_{:=\Psi_{T}}+\frac{a_{0}\alpha^{2}}{T},

where

c_{0}=2\|\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\|^{2},\ c_{1}=\frac{2\sigma^{2}}{% n},\ c_{2}=\frac{L\sigma^{2}\big{(}\chi^{3}+384\chi(2\chi^{2}+(1-p))\big{)}}{2% (1-\lambda_{2})\chi^{2}},\ a_{0}=\frac{192\chi^{2}L\varsigma^{2}_{0}}{(1-% \lambda_{2})^{2}}.

Then, the following rate can be obtained by following the same arguments used for the noncovex case,

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t}% )-f^{\star}\right]\leq

\displaystyle\mathcal{O}\left(\sqrt{\frac{\|\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star% }\|^{2}\sigma^{2}}{nT}}+\sqrt[3]{\frac{\chi^{3}+\chi(1-p)}{(1-\lambda_{2})\chi% ^{2}}}L^{\frac{1}{3}}\Big{(}\frac{\|\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\|^{2}% \sigma}{T}\Big{)}^{\frac{2}{3}}+\frac{\frac{\chi L\|\bar{{\bf{x}}}^{0}-{\bf{x}% }^{\star}\|^{2}}{1-\lambda_{2}}+\varsigma_{0}^{2}}{T}\right),

i.e., (41) holds. ∎

Appendix J Proof of Theorem 6

Proof.

Recall (33)

\displaystyle\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^{t+1}-{\bf{x}}^{\star}% \big{\|}^{2}\;|\;\mathcal{G}^{t}\right]\leq(1-\mu\alpha)\|\bar{{\bf{x}}}^{t}-{% \bf{x}}^{\star}\|^{2}+\frac{6\alpha L}{n}\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2}+% \frac{\alpha^{2}\sigma^{2}}{n}-\alpha(f(\bar{{\bf{x}}}^{t})-f({\bf{x}}^{\star}% )),

From (33) and (2), we have

\displaystyle\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^{t+1}-{\bf{x}}^{\star}% \big{\|}^{2}\;|\;\mathcal{G}^{t}\right]\leq(1-\mu\alpha)\|\bar{{\bf{x}}}^{t}-{% \bf{x}}^{\star}\|^{2}+\frac{6\alpha L}{n}\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2}+% \frac{\alpha^{2}\sigma^{2}}{n},

and

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}^{t+1}\|_{\mathrm{F}}^{2}\;|\;% \mathcal{G}^{t}\right]\leq

\displaystyle\tilde{\gamma}\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2}+\frac{4n\alpha% ^{4}L^{4}\frac{2\chi}{1-\lambda_{2}}}{1-\gamma}\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{% \star}\|^{2}+\frac{2\alpha^{4}L^{2}\sigma^{2}\frac{2\chi}{1-\lambda_{2}}}{1-% \gamma}+\frac{2n\alpha^{2}\sigma^{2}(2\chi^{2}+(1-p))}{\chi^{2}},

where the last inequality follows from $\|\nabla f(\bar{{\bf{x}}}^{t})\|^{2}\leq L^{2}\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{% \star}\|^{2}$ . Similar as Lemma 4, we know that

\alpha\leq\min\left\{\frac{1-\lambda_{2}}{32\sqrt{3}\chi L},\sqrt{\frac{(1+% \lambda_{n})(1-\lambda_{2})}{2\chi}}\frac{1}{2L},\sqrt[4]{\frac{(1-\lambda_{2}% )^{3}}{12\chi^{3}}}\frac{1}{4L}\right\},\ \chi\geq\frac{288(1-p)}{1-\lambda_{2% }}\Longrightarrow\tilde{\gamma}\leq\frac{1+\gamma}{2}<1.

Since $\alpha\leq\frac{1-\lambda_{2}}{32\sqrt{3}\chi L}$ and $\frac{\frac{2\chi}{1-\lambda_{2}}}{1-\gamma}\leq\frac{8\chi^{2}}{(1-\lambda_{2% })^{2}}$ , we have $\frac{\alpha^{2}\frac{2\chi}{1-\lambda_{2}}}{1-\gamma}\leq\frac{1}{384L^{2}}$ . Thus, it holds that

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}^{t+1}\|_{\mathrm{F}}^{2}\;|\;% \mathcal{G}^{t}\right]\leq

\displaystyle\tilde{\gamma}\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2}+\frac{n\alpha^% {2}L^{2}}{96}\|\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\|^{2}+\frac{n\alpha^{2}% \sigma^{2}(192\chi^{2}+(4\chi^{2}+2(1-p)))}{192\chi^{2}}.

Then, it follows that

\displaystyle\left[\begin{array}[]{c}\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^% {t+1}-{\bf{x}}^{\star}\big{\|}^{2}\right]\\ \frac{1}{n}\mathbb{E}\!\left[\|\mathcal{E}^{t+1}\|_{\mathrm{F}}^{2}\right]\\ \end{array}\right]\leq\underbrace{\left[\begin{array}[]{cc}1-\mu\alpha&6\alpha L% \\ \frac{\alpha^{2}L^{2}}{96}&\frac{1+\gamma}{2}\\ \end{array}\right]}_{:=A}\left[\begin{array}[]{c}\mathbb{E}\!\left[\big{\|}% \bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\big{\|}^{2}\right]\\ \frac{1}{n}\mathbb{E}\!\left[\|\mathcal{E}^{t}\|_{\mathrm{F}}^{2}\right]\\ \end{array}\right]+\underbrace{\left[\begin{array}[]{c}\frac{\alpha^{2}\sigma^% {2}}{n}\\ \frac{\alpha^{2}\sigma^{2}(192\chi^{2}+(4\chi^{2}+2(1-p)))}{192\chi^{2}}\\ \end{array}\right]}_{:=b}.

(79)

Note that

\displaystyle\alpha\leq\min\left\{\frac{72\mu}{L^{2}},\frac{1-\gamma}{12L+% \nicefrac{{\mu}}{{2}}}\right\}\Longrightarrow\|A\|\leq\|A\|_{1}=\max\left\{1-% \mu\alpha+\frac{\alpha^{2}L^{2}}{96},6\alpha L+\frac{1+\gamma}{2}\right\}\leq 1% -\frac{\mu\alpha}{4}<1.

Since $\|A\|<1$ , we can iterate inequality (79) to get

\displaystyle\left[\begin{array}[]{c}\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^% {t+1}-{\bf{x}}^{\star}\big{\|}^{2}\right]\\ \frac{1}{n}\mathbb{E}\!\left[\|\mathcal{E}^{t+1}\|_{\mathrm{F}}^{2}\right]\\ \end{array}\right]\leq A^{t}\left[\begin{array}[]{c}\mathbb{E}\!\left[\big{\|}% \bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\big{\|}^{2}\right]\\ \frac{1}{n}\mathbb{E}\!\left[\|\mathcal{E}^{0}\|_{\mathrm{F}}^{2}\right]\\ \end{array}\right]+\sum_{\ell=0}^{t-1}A^{\ell}b\leq A^{t}\left[\begin{array}[]% {c}\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\big{\|}^{2}% \right]\\ \frac{1}{n}\mathbb{E}\!\left[\|\mathcal{E}^{0}\|_{\mathrm{F}}^{2}\right]\\ \end{array}\right]+(I-A)^{-1}b.

Taking the 1-induced-norm and using properties of the (induced) norms, it holds that

\displaystyle\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\big% {\|}^{2}\right]+\frac{1}{n}\mathbb{E}\!\left[\|\mathcal{E}^{t}\|_{\mathrm{F}}^% {2}\right]\leq\|A^{t}\|_{1}a_{0}+\|(I-A)^{-1}b\|_{1}\leq\|A\|_{1}^{t}a_{0}+\|(% I-A)^{-1}b\|_{1},

(80)

where $a_{0}=\big{\|}\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\big{\|}^{2}+\frac{1}{n}\|% \mathcal{E}^{0}\|_{\mathrm{F}}^{2}$ . We now bound the last term by noting that

	$\displaystyle(I-A)^{-1}b$	$\displaystyle=\left[\begin{array}[]{cc}\mu\alpha&-6\alpha L\\ -\frac{\alpha^{2}L^{2}}{96}&\frac{1-\gamma}{2}\\ \end{array}\right]^{-1}b=\frac{1}{\mathrm{det}(I-A)}\left[\begin{array}[]{cc}% \frac{1-\gamma}{2}&6\alpha L\\ \frac{\alpha^{2}L^{2}}{96}&\mu\alpha\\ \end{array}\right]b$
		$\displaystyle=\frac{1}{\mu\alpha(1-\gamma)(\frac{1}{2}-\frac{\alpha^{3}L^{3}}{% 16\mu(1-\gamma)})}\left[\begin{array}[]{cc}\frac{1-\gamma}{2}&6\alpha L\\ \frac{\alpha^{2}L^{2}}{96}&\mu\alpha\\ \end{array}\right]\left[\begin{array}[]{c}\frac{\alpha^{2}\sigma^{2}}{n}\\ \frac{\alpha^{2}\sigma^{2}(192\chi^{2}+(4\chi^{2}+2(1-p)))}{192\chi^{2}}\\ \end{array}\right]$
		$\displaystyle\leq\frac{4}{\alpha\mu(1-\gamma)}\left[\begin{array}[]{c}\frac{(1% -\gamma)\alpha^{2}\sigma^{2}}{2n}+\frac{6L\alpha^{3}\sigma^{2}(192\chi^{2}+(4% \chi^{2}+2(1-p)))}{192\chi^{2}}\\ \frac{\alpha^{4}L^{2}\sigma^{2}}{96n}+\frac{\mu\alpha^{3}\sigma^{2}(192\chi^{2% }+(4\chi^{2}+2(1-p)))}{192\chi^{2}}\\ \end{array}\right],$

where the last step holds for $\alpha\leq\sqrt[3]{4\mu(1-\gamma)}\frac{1}{L}$ . Therefore,

\|(I-A)^{-1}b\|_{1}\leq\frac{2\alpha\sigma^{2}}{\mu n}+\frac{(6L\alpha^{2}% \sigma^{2}+\mu\alpha^{2}\sigma^{2})(192\chi^{2}+(4\chi^{2}+2(1-p)))}{48\mu(1-% \gamma)\chi^{2}}.

Substituting the above into (80) and using $\|A\|^{t}_{1}\leq(1-\frac{\alpha\mu}{4})^{t}$ and $\mu\leq L$ , we obtain

\mathbb{E}\!\left[\big{\|}\bar{{\bf{x}}}^{t}-{\bf{x}}^{\star}\big{\|}^{2}% \right]\leq\Big{(}1-\frac{\alpha\mu}{4}\Big{)}^{t}a_{0}+\frac{2\alpha\sigma^{2% }}{\mu n}+\frac{7L\alpha^{2}\sigma^{2}(192\chi^{2}+(4\chi^{2}+2(1-p)))}{48\mu(% 1-\gamma)\chi^{2}}.

Since ${\bf{X}}^{0}=[{\bf{x}}^{0},\cdots,{\bf{x}}^{0}]^{\sf T}$ , by [45, (75)], we have $\|\mathcal{E}^{0}\|_{\mathrm{F}}^{2}\leq 2\alpha^{2}\|({\bf{I}}-\hat{{\bf{% \Lambda}}}_{a})^{-1}\|\|\nabla F({\bf{X}}^{0})-{\bf{1}}_{n}\otimes(\nabla f({% \bf{x}}^{0}))^{\sf T}\|^{2}=\frac{2n\alpha^{2}\varsigma_{0}^{2}}{1-\gamma}$ . Note that $\frac{1}{1-\gamma}\leq\frac{4\chi}{1-\lambda_{2}}$ . It holds that

a_{0}=\big{\|}\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\big{\|}^{2}+\frac{1}{n}\|% \mathcal{E}^{0}\|_{\mathrm{F}}^{2}\leq\big{\|}\bar{{\bf{x}}}^{0}-{\bf{x}}^{% \star}\big{\|}^{2}+\frac{8\chi\alpha^{2}\varsigma_{0}^{2}}{1-\lambda_{2}}

Thus, we finally obtain (6). ∎

Appendix K Proof of Corollary 4

Proof.

Recall (6)

	$\displaystyle\mathbb{E}\!\left[\big{\\|}\bar{{\bf{x}}}^{T}-{\bf{x}}^{\star}\big% {\\|}^{2}\right]$	$\displaystyle\leq\Big{(}1-\frac{\alpha\mu}{4}\Big{)}^{T}\Big{(}c_{0}+b_{0}% \alpha^{2}\Big{)}+c_{1}\alpha+c_{2}\alpha^{2}$
		$\displaystyle\leq{\mathrm{exp}}\Big{(}-\frac{\alpha\mu}{2}T\Big{)}\Big{(}c_{0}% +b_{0}\alpha^{2}\Big{)}+c_{1}\alpha+c_{2}\alpha^{2},$		(81)

where

c_{0}=\big{\|}\bar{{\bf{x}}}^{0}-{\bf{x}}^{\star}\big{\|}^{2},\ b_{0}=\frac{8% \chi\varsigma_{0}^{2}}{1-\lambda_{2}},\ c_{1}=\frac{2\sigma^{2}}{\mu n},\ c_{2% }=\frac{7L\sigma^{2}(192\chi^{2}+(4\chi^{2}+2(1-p)))}{12\mu(1-\lambda_{2})\chi}.

From the setpsize condition, we have

\displaystyle\alpha\leq\frac{1}{\underline{\alpha}}\triangleq\min\left\{\frac{% 1}{2L},\frac{1-\lambda_{2}}{32\sqrt{3}\chi L},\sqrt{\frac{(1+\lambda_{n})(1-% \lambda_{2})}{2\chi}}\frac{1}{2L},\frac{72\mu}{L^{2}},\frac{1-\gamma}{12L+% \nicefrac{{\mu}}{{2}}},\sqrt[3]{4\mu(1-\gamma)}\frac{1}{L}\right\}=\mathcal{O}% \left(\frac{\mu(1-\lambda_{2})}{\chi L^{2}}\right).

Now we select $\alpha=\min\left\{\frac{\ln\left(\max\left\{1,\mu\left(c_{0}+b_{0}/\underline{% \alpha}^{2}\right)T/c_{1}\right\}\right)}{\mu T},\frac{1}{\underline{\alpha}}% \right\}\leq\frac{1}{\underline{\alpha}}$ to get the following cases.

- If $\alpha=\frac{\ln\left(\max\left\{1,\mu\left(c_{0}+b_{0}/\underline{\alpha}^{2}% \right)T/c_{1}\right\}\right)}{\mu T}\leq\frac{1}{\underline{\alpha}}$ then

\displaystyle\exp\left(-\frac{\alpha\mu}{2}T\right)\left(c_{0}+\alpha^{2}b_{0}\right)

\displaystyle\leq\tilde{\mathcal{O}}\left(\left(c_{0}+\frac{b_{0}}{\underline{% \alpha}^{2}}\right)\exp\left[-\ln\left(\max\left\{1,\mu\left(c_{0}+\frac{b_{0}% }{\underline{\alpha}^{2}}\right)T/c_{1}\right\}\right)\right]\right)=\mathcal{% O}\left(\frac{c_{1}}{\mu T}\right)

- Otherwise $\alpha=\frac{1}{\underline{\alpha}}\leq\frac{\ln\left(\max\left\{1,\mu\left(c_% {0}+b_{0}/\underline{\alpha}^{2}\right)/c_{1}\right\}\right)}{\mu T}$ and

\exp\left(-\frac{\alpha\mu}{2}T\right)\left(c_{0}+\alpha^{2}b_{0}\right)=% \tilde{\mathcal{O}}\left(\exp\left[-\frac{\mu T}{2\underline{\alpha}}\right]% \left(c_{0}+\frac{b_{0}}{\underline{\alpha}^{2}}\right)\right).

Collecting these cases together into (K), we have

	$\displaystyle\mathbb{E}\!\left[\big{\\|}\bar{{\bf{x}}}^{T}-{\bf{x}}^{\star}\big% {\\|}^{2}\right]$	$\displaystyle\leq\exp\left(-\frac{\alpha\mu}{2}T\right)\left(c_{0}+\alpha^{2}b% _{0}\right)+c_{1}\alpha+c_{2}\alpha^{2}$
		$\displaystyle\leq\tilde{\mathcal{O}}\left(\frac{c_{1}}{\mu T}\right)+\tilde{% \mathcal{O}}\left(\frac{c_{2}}{\mu^{2}T^{2}}\right)+\tilde{\mathcal{O}}\left(% \exp\left[-\frac{\mu T}{2\underline{\alpha}}\right]\left(c_{0}+\frac{b_{0}}{% \underline{\alpha}^{2}}\right)\right).$

Therefore, (44) holds. ∎

Appendix L Proof of Lemma 4

Proof.

Note that ProxSkip (45) has the following equivalently updates


$\displaystyle\widetilde{{\bf{Z}}}^{t}$	$\displaystyle=\widetilde{{\bf{X}}}^{t}-{\bf{W}}_{b}\widetilde{{\bf{U}}}^{t}-% \alpha(\nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{\star})+{\bf{S}}^{t}),$	(82a)
$\displaystyle\widetilde{{\bf{X}}}^{t+1}$	$\displaystyle={\bf{W}}_{a}\widetilde{{\bf{Z}}}^{t}-{\bf{W}}_{b}{\bf{E}}^{t},$	(82b)
$\displaystyle\widetilde{{\bf{U}}}^{t+1}$	$\displaystyle=\widetilde{{\bf{U}}}^{t}+\frac{p}{2\chi}{\bf{W}}_{b}\widetilde{{% \bf{Z}}}^{t}+p{\bf{E}}^{t}.$	(82c)

We rewrite the recursion (82) into the following matrix representation:

\left[\begin{array}[]{c}\widetilde{{\bf{X}}}^{t+1}\\ \widetilde{{\bf{U}}}^{t+1}\\ \end{array}\right]=\left[\begin{array}[]{cc}{\bf{W}}_{a}&-{\bf{W}}_{a}{\bf{W}}% _{b}\\ \frac{p}{2\chi}{\bf{W}}_{b}&{\bf{I}}-\frac{p}{2\chi}{\bf{W}}_{b}^{2}\\ \end{array}\right]\left[\begin{array}[]{c}\widetilde{{\bf{X}}}^{t}\\ \widetilde{{\bf{U}}}^{t}\\ \end{array}\right]-\alpha\left[\begin{array}[]{c}{\bf{W}}_{a}(\nabla F({\bf{X}% }^{t})-\nabla F({\bf{X}}^{\star})+{\bf{S}}^{t})\\ \frac{p}{2\chi}{\bf{W}}_{b}(\nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{\star})+% {\bf{S}}^{t})\\ \end{array}\right]+\left[\begin{array}[]{c}-{\bf{W}}_{b}{\bf{E}}^{t}\\ p{\bf{E}}^{t}\\ \end{array}\right].

Multiplying both sides of the above by $\mathrm{diag}\{{\bf{P}}^{-1},{\bf{P}}^{-1}\}$ on the left and using (29), we have

\left[\begin{array}[]{c}{\bf{P}}^{-1}\widetilde{{\bf{X}}}^{t+1}\\ {\bf{P}}^{-1}\widetilde{{\bf{U}}}^{t+1}\\ \end{array}\right]=\left[\begin{array}[]{cc}\hat{{\bf{\Lambda}}}_{a}&-\hat{{% \bf{\Lambda}}}_{a}\hat{{\bf{\Lambda}}}_{b}\\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}&{\bf{I}}-\frac{p}{2\chi}\hat{{\bf{% \Lambda}}}_{b}^{2}\\ \end{array}\right]\left[\begin{array}[]{c}{\bf{P}}^{-1}\widetilde{{\bf{X}}}^{t% }\\ {\bf{P}}^{-1}\widetilde{{\bf{U}}}^{t}\\ \end{array}\right]-\alpha\left[\begin{array}[]{c}\hat{{\bf{\Lambda}}}_{a}{\bf{% P}}^{-1}(\nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{\star})+{\bf{S}}^{t})\\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}{\bf{P}}^{-1}(\nabla F({\bf{X}}^{t})-% \nabla F({\bf{X}}^{\star})+{\bf{S}}^{t})\\ \end{array}\right]+\left[\begin{array}[]{c}-\hat{{\bf{\Lambda}}}_{b}{\bf{P}}^{% -1}{\bf{E}}^{t}\\ p{\bf{P}}^{-1}{\bf{E}}^{t}\\ \end{array}\right].

Since $\widetilde{{\bf{U}}}^{t}$ lies in the range space of ${\bf{W}}_{b}$ , we have ${\bf{1}}^{\sf T}\widetilde{{\bf{U}}}^{t}=0,\ t\geq 0$ . By the structure of ${\bf{P}}$ , we have

{\bf{P}}^{-1}\widetilde{{\bf{X}}}^{t}=\left[\begin{array}[]{c}\bar{{\bf{e}}}^{% t}\\ \hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{X}}}^{t}\\ \end{array}\right],\ {\bf{P}}^{-1}\widetilde{{\bf{U}}}^{t}=\left[\begin{array}% []{c}0\\ \hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{U}}}^{t}\\ \end{array}\right],\ {\bf{P}}^{-1}\nabla F({\bf{X}}^{t})=\left[\begin{array}[]% {c}\overline{\nabla F}({\bf{X}}^{t})\\ \hat{{\bf{P}}}^{\sf T}\nabla F({\bf{X}}^{t})\\ \end{array}\right],\ {\bf{P}}^{-1}{\bf{E}}^{t}=\left[\begin{array}[]{c}0\\ \hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\ \end{array}\right].

Therefor, it holds that

	$\displaystyle\bar{{\bf{e}}}^{t+1}$	$\displaystyle=\bar{{\bf{e}}}^{t}-\alpha\overline{\nabla F}({\bf{X}}^{t})-% \alpha\bar{{\bf{s}}}^{t},$
	$\displaystyle\left[\begin{array}[]{c}\hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{X}}% }^{t+1}\\ \hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{U}}}^{t+1}\\ \end{array}\right]$	$\displaystyle=\left[\begin{array}[]{cc}\hat{{\bf{\Lambda}}}_{a}&-\hat{{\bf{% \Lambda}}}_{a}\hat{{\bf{\Lambda}}}_{b}\\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}&{\bf{I}}-\frac{p}{2\chi}\hat{{\bf{% \Lambda}}}_{b}^{2}\\ \end{array}\right]\left[\begin{array}[]{c}\hat{{\bf{P}}}^{\sf T}\widetilde{{% \bf{X}}}^{t}\\ \hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{U}}}^{t}\\ \end{array}\right]-\alpha\left[\begin{array}[]{c}\hat{{\bf{\Lambda}}}_{a}\hat{% {\bf{P}}}^{\sf T}(\nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{\star})+{\bf{S}}^{% t})\\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}% }^{t})-\nabla F({\bf{X}}^{\star})+{\bf{S}}^{t})\\ \end{array}\right]+\left[\begin{array}[]{c}-\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{% P}}}^{\sf T}{\bf{E}}^{t}\\ p\hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\ \end{array}\right].$

Let

{\bf{H}}^{\mathrm{s}}=\left[\begin{array}[]{cc}\hat{{\bf{\Lambda}}}_{a}&-\hat{% {\bf{\Lambda}}}_{a}\hat{{\bf{\Lambda}}}_{b}\\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}&{\bf{I}}-\frac{p}{2\chi}\hat{{\bf{% \Lambda}}}_{b}^{2}\\ \end{array}\right]=\left[\begin{array}[]{cc}{\bf{I}}-\frac{1}{2\chi}({\bf{I}}-% \hat{{\bf{\Lambda}}})&-({\bf{I}}-\frac{1}{2\chi}({\bf{I}}-\hat{{\bf{\Lambda}}}% ))\sqrt{{\bf{I}}-\hat{{\bf{\Lambda}}}}\\ \frac{p}{2\chi}\sqrt{{\bf{I}}-\hat{{\bf{\Lambda}}}}&{\bf{I}}-\frac{p}{2\chi}({% \bf{I}}-\hat{{\bf{\Lambda}}})\\ \end{array}\right]

where $\hat{{\bf{\Lambda}}}=\mathrm{diag}\{\lambda_{2},\cdots,\lambda_{n}\}$ , and $\lambda_{i}\in(-1,1)$ . Since the blocks of ${\bf{H}}^{\mathrm{s}}$ are diagonal matrices, there exists a permutation matrix ${\bf{Q}}^{\mathrm{s}}_{1}$ such that ${\bf{Q}}^{\mathrm{s}}_{1}{\bf{H}}^{\mathrm{s}}({\bf{Q}}^{\mathrm{s}}_{1})^{\sf T% }=\mathrm{blkdiag}\{H^{\mathrm{s}}_{i}\}_{i=2}^{n}$ , where

H^{\mathrm{s}}_{i}=\left[\begin{array}[]{cc}1-\frac{1}{2\chi}(1-\lambda_{i})&-% (1-\frac{1}{2\chi}(1-\lambda_{i}))\sqrt{1-\lambda_{i}}\\ \frac{p}{2\chi}\sqrt{1-\lambda_{i}}&1-\frac{p}{2\chi}(1-\lambda_{i})\\ \end{array}\right].

Setting $\nu_{i}=1-\frac{1}{2\chi}(1-\lambda_{i})$ , we have $\nu_{i}\in(0,1)$ and $H_{i}$ can be rewritten as

H^{\mathrm{s}}_{i}=\left[\begin{array}[]{cc}\nu_{i}&-\nu_{i}\sqrt{2\chi(1-\nu_% {i})}\\ \frac{p}{2\chi}\sqrt{2\chi(1-\nu_{i})}&1-p(1-\nu_{i})\\ \end{array}\right].

Since

\displaystyle\mathrm{Tr}(H^{\mathrm{s}}_{i})=(1+p)\nu_{i}+(1-p),\quad\mathrm{% det}(H^{\mathrm{s}}_{i})=\nu_{i},

the eigenvalues of $H_{i}$ are

	$\displaystyle\gamma_{(1,2),i}$	$\displaystyle=\frac{1}{2}\Big{[}\mathrm{Tr}(H^{\mathrm{s}}_{i})\pm\sqrt{% \mathrm{Tr}(H^{\mathrm{s}}_{i})^{2}-4\mathrm{det}(H^{\mathrm{s}}_{i})}\Big{]}$
		$\displaystyle=\frac{1}{2}\Big{[}(1+p)\nu_{i}+(1-p)\Big{]}\pm\frac{1}{2}\sqrt{% \underbrace{(1+p)^{2}\nu_{i}^{2}+(2(1+p)(1-p)-4)\nu_{i}+(1-p)^{2}}_{:=\Delta_{% i}(\nu_{i},p)}}.$

Consider the sign of $\Delta_{i}(\nu_{i},p)$ . Note that $\Delta_{i}(\nu_{i},p)$ is a quadratic function on $\nu_{i}$ , and

(1+p)^{2}>0,\ \Delta_{i}(0,p)=(1-p)^{2},\ \Delta_{i}(1,p)=0,\ \Delta_{i}(c_{i}% ,p)=0,\text{ where }c_{i}=\frac{(1-p)^{2}}{(1+p)^{2}}<1.

We have

\left\{\begin{array}[]{cc}\Delta_{i}(\nu_{i},p)>0,&\nu_{i}\in(0,c_{i})\\ \Delta_{i}(\nu_{i},p)<0,&\nu_{i}\in(c_{i},1)\end{array}\right..

Since $\nu_{i}=1-\frac{1}{2\chi}(1-\lambda_{i})\geq 1-\frac{1}{2\chi}(1-\lambda_{n}),% i=2,\ldots,n$ and $\lambda_{n}\in(-1,1)$ , it holds that

\chi\geq\frac{1}{p}\geq\frac{(1+p)^{2}}{4p}>\frac{(1-\lambda_{n})(1+p)^{2}}{8p% }\Longrightarrow\nu_{i}\geq 1-\frac{1}{2\chi}(1-\lambda_{n})>\frac{(1-p)^{2}}{% (1+p)^{2}}.

As a result, when $\chi\geq\frac{1}{p}$ , we have $\nu_{i}\in(c_{i},1)$ , i.e., $\Delta_{i}(\nu_{i},p)<0$ . It implies that

\displaystyle\gamma_{(1,2),i}

\displaystyle=\frac{1}{2}\big{[}(1+p)\nu_{i}+(1-p)\big{]}\pm j\frac{1}{2}\sqrt% {4\nu_{i}-\big{[}(1+p)\nu_{i}+(1-p)\big{]}^{2}},\text{ and }|\gamma_{(1,2),i}|% =\sqrt{\nu_{i}}<1,

where $j^{2}=-1$ . Since $\gamma_{1,i}\neq\gamma_{2,i}$ , there exists a invertible $Q^{\mathrm{s}}_{2,i}$ such that $H_{i}=Q^{\mathrm{s}}_{2,i}\Gamma_{i}(Q^{\mathrm{s}}_{2,i})^{-1}$ , where $\Gamma_{i}=\mathrm{diag}\{\gamma_{1,i},\gamma_{2,i}\}$ . Using [5, Appendix B.2] and letting $r=\sqrt{1-\nu_{i}}$ , we have

Q^{\mathrm{s}}_{2,i}=\left[\begin{array}[]{cc}\frac{1}{2}(p-1)\sqrt{1-\nu_{i}}% +\frac{1}{2}j\sqrt{(1+p)^{2}(\nu_{i}-c_{i})}&\frac{1}{2}(p-1)\sqrt{1-\nu_{i}}-% \frac{1}{2}j\sqrt{(1+p)^{2}(\nu_{i}-c_{i})}\\ p\sqrt{\nicefrac{{1}}{{2\chi}}}&p\sqrt{\nicefrac{{1}}{{2\chi}}}\end{array}\right]

(Q^{\mathrm{s}}_{2,i})^{-1}=\frac{\sqrt{2\chi}}{p\sqrt{(1+p)^{2}(\nu_{i}-c_{i}% )}}\left[\begin{array}[]{cc}-jp\sqrt{\nicefrac{{1}}{{2\chi}}}&\frac{1}{2}\sqrt% {(1+p)^{2}(\nu_{i}-c_{i})}+\frac{1}{2}j(p-1)\sqrt{1-\nu_{i}}\\ jp\sqrt{\nicefrac{{1}}{{2\chi}}}&\frac{1}{2}\sqrt{(1+p)^{2}(\nu_{i}-c_{i})}-% \frac{1}{2}j(p-1)\sqrt{1-\nu_{i}}\end{array}\right]

Since the spectral radius of matrix is upper bounded by any of its norm, $0<p_{0}\leq p<1$ , and $0<\nu_{i}<1$ , it holds that

\|Q_{2,i}\|^{2}\leq\|Q_{2,i}Q_{2,i}^{*}\|_{1}\leq 4.

Following a similar argument for $Q_{2,i}^{-1}$ , and using $p^{2}(1+p)^{2}(\nu_{i}-c_{i})=p^{2}(1+p)^{2}(1-\frac{1}{2\chi}(1-\lambda_{i}))% -(1-p)^{2}\geq 4p^{3}-\frac{4p^{2}(1-\lambda_{n})}{2\chi}\geq\frac{2p^{2}(1+% \lambda_{n})}{\chi}$ , we have

\displaystyle\|(Q_{2,i}^{\mathrm{s}})^{-1}\|^{2}\leq\frac{2\chi}{p^{2}(1+p)^{2% }(\nu_{i}-c_{i})}\leq\frac{\chi^{2}}{p^{2}(1+\lambda_{n})}.

Let ${\bf{Q}}^{\mathrm{s}}=({\bf{Q}}^{\mathrm{s}}_{1})^{\sf T}{\bf{Q}}^{\mathrm{s}}% _{2}$ with ${\bf{Q}}^{\mathrm{s}}_{2}=\mathrm{blkdiag}\{Q^{\mathrm{s}}_{2,i}\}_{i=2}^{n}$ . We have $({\bf{Q}}^{\mathrm{s}})^{-1}{\bf{H}}{\bf{Q}}^{\mathrm{s}}={\bf{\Gamma}}$ , where ${\bf{\Gamma}}=\mathrm{blkdiag}\{\Gamma_{i}\}_{i=2}^{n}$ , i.e., there exists an invertible matrix ${\bf{Q}}^{\mathrm{s}}$ such that ${\bf{H}}^{\mathrm{s}}={\bf{Q}}^{\mathrm{s}}{\bf{\Gamma}}({\bf{Q}}^{\mathrm{s}}% )^{-1}$ , and

\|{\bf{\Gamma}}\|=\sqrt{1-\frac{1}{2\chi}(1-\lambda_{2})}<1.

Moreover, we have $\|{\bf{Q}}^{\mathrm{s}}\|^{2}\|({\bf{Q}}^{\mathrm{s}})^{-1}\|^{2}\leq\frac{8% \chi^{2}}{p^{2}(1+\lambda_{n})}$ . We thus complete the proof. ∎

Appendix M Proof of Lemma 5

Proof.

Proof of (48). It follows from (E) and $0<\alpha L\leq\frac{1}{2}$ that

	$\displaystyle\mathbb{E}\!\left[\big{\\|}\bar{{\bf{e}}}^{t+1}\big{\\|}^{2}\;\|\;% \mathcal{G}^{t}\right]$	$\displaystyle\leq(1-\mu\alpha)\\|\bar{{\bf{e}}}^{t}\\|^{2}+\Big{(}\frac{\alpha L% }{n}+\frac{2\alpha^{2}L^{2}}{n}\Big{)}\\|{\bf{X}}^{t}-{\bf{1}}\bar{{\bf{x}}}^{t% }\\|_{\mathrm{F}}^{2}+\frac{\alpha^{2}\sigma^{2}}{n}-2\alpha(1-2\alpha L)(f(% \bar{{\bf{x}}}^{t})-f({\bf{x}}^{\star}))$
		$\displaystyle\leq(1-\mu\alpha)\\|\bar{{\bf{e}}}^{t}\\|^{2}+\frac{2\alpha L}{n}\\|% {\bf{X}}^{t}-{\bf{1}}\bar{{\bf{x}}}^{t}\\|_{\mathrm{F}}^{2}+\frac{\alpha^{2}% \sigma^{2}}{n}.$

Note that $\hat{{\bf{P}}}^{\sf T}\hat{{\bf{P}}}={\bf{I}},\ {\bf{1}}^{\sf T}\hat{{\bf{P}}}% =0,\ \hat{{\bf{P}}}\hat{{\bf{P}}}^{\sf T}={\bf{I}}-\frac{1}{n}{\bf{11}}^{\sf T}$ . We obtain

\|\hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}=\|\hat{{% \bf{P}}}\hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}=\|({% \bf{I}}-\frac{1}{n}{\bf{11}}^{\sf T})\widetilde{{\bf{X}}}^{t}\|_{\mathrm{F}}^{% 2}=\|{\bf{X}}^{t}-{\bf{1}}\bar{{\bf{x}}}^{t}\|_{\mathrm{F}}^{2}.

On the other hand, $\|\hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}=\|\upsilon% ^{-1}{\bf{Q}}^{\mathrm{s}}\mathcal{E}_{\mathrm{s}}^{t}\|_{\mathrm{F}}^{2}-\|% \hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{U}}}^{t}\|_{\mathrm{F}}^{2}$ . It holds that

\|{\bf{X}}^{t}-{\bf{1}}\bar{{\bf{x}}}^{t}\|_{\mathrm{F}}^{2}\leq\|\upsilon^{-1% }{\bf{Q}}^{\mathrm{s}}\mathcal{E}_{\mathrm{s}}^{t}\|_{\mathrm{F}}^{2}\leq% \upsilon^{-2}\|{\bf{Q}}^{\mathrm{s}}\|^{2}\|\mathcal{E}_{\mathrm{s}}^{t}\|_{% \mathrm{F}}^{2}.

Therefore, we (48) follows.

Proof of (49). Taking conditioned expectation with respect to $\mathcal{F}^{t}$ , it follows from (47f) that

	$\displaystyle\mathbb{E}\!\left[\\|\mathcal{E}_{\mathrm{s}}^{t+1}\\|_{\mathrm{F}}% ^{2}\;\|\;\mathcal{F}^{t}\right]$	$\displaystyle=\\|\mathbb{G}_{\mathrm{s}}^{t}\\|_{\mathrm{F}}^{2}+\mathbb{E}\!% \left[\\|\mathbb{F}_{\mathrm{s}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{F}^{t}% \right]+2\mathbb{E}\!\left[\langle\mathbb{G}^{t}_{\mathrm{s}},\mathbb{F}^{t}_{% \mathrm{s}}\rangle\;\|\;\mathcal{F}^{t}\right]$
		$\displaystyle=\\|\mathbb{G}_{\mathrm{s}}^{t}\\|_{\mathrm{F}}^{2}+\mathbb{E}\!% \left[\\|\mathbb{F}_{\mathrm{s}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{F}^{t}\right]$
		$\displaystyle=\\|\mathbb{G}_{\mathrm{s}}^{t}\\|_{\mathrm{F}}^{2}+\mathbb{E}\!% \left[\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{% P}}}^{\sf T}{\bf{E}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{F}^{t}\right]+\mathbb% {E}\!\left[\\|\upsilon p({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T}{\bf{% E}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{F}^{t}\right].$

Since ${\bf{E}}^{t}=\frac{(\theta_{t}-1)}{2\chi}{\bf{W}}_{b}\hat{{\bf{Z}}}^{t}$ , $\mathop{\rm Prob}(\theta_{t}=1)=p$ , and $\mathop{\rm Prob}(\theta_{t}=0)=1-p$ , we have

	$\displaystyle\mathbb{E}\!\left[\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf% {\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;% \mathcal{F}^{t}\right]+\mathbb{E}\!\left[\\|\upsilon p({\bf{Q}}^{\mathrm{s}})^{% -1}\hat{{\bf{P}}}^{\sf T}{\bf{E}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{F}^{t}\right]$
	$\displaystyle=\frac{1-p}{4\chi^{2}}\Big{(}\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-% 1}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{W}}_{b}\hat{{\bf{Z}}}^{t}% \\|_{\mathrm{F}}^{2}+\\|\upsilon p({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{% \sf T}{\bf{W}}_{b}\hat{{\bf{Z}}}^{t}\\|_{\mathrm{F}}^{2}\Big{)}$
	$\displaystyle=\frac{1-p}{4\chi^{2}}\Big{(}\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-% 1}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{W}}_{b}(\hat{{\bf{Z}}}^{t% }-{\bf{X}}^{\star})\\|_{\mathrm{F}}^{2}+\\|\upsilon p({\bf{Q}}^{\mathrm{s}})^{-1% }\hat{{\bf{P}}}^{\sf T}{\bf{W}}_{b}(\hat{{\bf{Z}}}^{t}-{\bf{X}}^{\star})\\|_{% \mathrm{F}}^{2}\Big{)}$
	$\displaystyle\leq\frac{(1-p)(2+p^{2})}{2\chi^{2}}\\|\upsilon({\bf{Q}}^{\mathrm{% s}})^{-1}\hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{Z}}}^{t}\\|_{\mathrm{F}}^{2}.$

Hence, it gives that

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}_{\mathrm{s}}^{t+1}\|_{\mathrm{F}}% ^{2}\;|\;\mathcal{F}^{t}\right]\leq\|\mathbb{G}_{\mathrm{s}}^{t}\|_{\mathrm{F}% }^{2}+\frac{(1-p)(2+p^{2})}{2\chi^{2}}\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}% \hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{Z}}}^{t}\|_{\mathrm{F}}^{2}.

Taking conditioned expectation with respect to $\mathcal{G}^{t}\subset\mathcal{F}^{t}$ , and using the unbiasedness of ${\bf{G}}^{t}$ , we have

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}_{\mathrm{s}}^{t+1}\|_{\mathrm{F}}% ^{2}\;|\;\mathcal{G}^{t}\right]

\displaystyle\leq\mathbb{E}\!\left[\|\mathbb{G}_{\mathrm{s}}^{t}\|_{\mathrm{F}% }^{2}\;|\;\mathcal{G}^{t}\right]+\frac{(1-p)(2+p^{2})}{2\chi^{2}}\mathbb{E}\!% \left[\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T}\widetilde{{% \bf{Z}}}^{t}\|_{\mathrm{F}}^{2}\;|\;\mathcal{G}^{t}\right].

(83)

Let $\upsilon=1/\|({\bf{Q}}^{\mathrm{s}})^{-1}\|$ . $\mathbb{E}\!\left[\|\mathbb{G}_{\mathrm{s}}^{t}\|_{\mathrm{F}}^{2}\;|\;% \mathcal{G}^{t}\right]$ can be bounded as follows:

	$\displaystyle\mathbb{E}\!\left[\\|\mathbb{G}_{\mathrm{s}}^{t}\\|_{\mathrm{F}}^{2% }\;\|\;\mathcal{G}^{t}\right]=\mathbb{E}\!\left[\left\\|{\bf{\Gamma}}\mathcal{E}% _{\mathrm{s}}^{t}-\upsilon\alpha({\bf{Q}}^{\mathrm{s}})^{-1}\left[\begin{array% }[]{c}\hat{{\bf{\Lambda}}}_{a}\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}}^{t})-% \nabla F({\bf{X}}^{\star})+{\bf{S}}^{t})\\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}% }^{t})-\nabla F({\bf{X}}^{\star})+{\bf{S}}^{t})\\ \end{array}\right]\right\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{G}^{t}\right]$
	$\displaystyle=\left\\|{\bf{\Gamma}}\mathcal{E}_{\mathrm{s}}^{t}-\upsilon\alpha(% {\bf{Q}}^{\mathrm{s}})^{-1}\left[\begin{array}[]{c}\hat{{\bf{\Lambda}}}_{a}% \hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{\star}))\\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}% }^{t})-\nabla F({\bf{X}}^{\star}))\\ \end{array}\right]\right\\|_{\mathrm{F}}^{2}+\upsilon^{2}\alpha^{2}\mathbb{E}\!% \left[\left\\|({\bf{Q}}^{\mathrm{s}})^{-1}\left[\begin{array}[]{c}\hat{{\bf{% \Lambda}}}_{a}\hat{{\bf{P}}}^{\sf T}{\bf{S}}^{t}\\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{S}}^{t}\\ \end{array}\right]\right\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{G}^{t}\right]$
	$\displaystyle\leq\left\\|{\bf{\Gamma}}\mathcal{E}_{\mathrm{s}}^{t}-\upsilon% \alpha({\bf{Q}}^{\mathrm{s}})^{-1}\left[\begin{array}[]{c}\hat{{\bf{\Lambda}}}% _{a}\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{\star}))% \\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}% }^{t})-\nabla F({\bf{X}}^{\star}))\\ \end{array}\right]\right\\|_{\mathrm{F}}^{2}+\frac{(p^{2}+2\chi^{2})n\alpha^{2}% \sigma^{2}}{2\chi^{2}}\ .$

The last inequality holds due to $\|\hat{{\bf{\Lambda}}}_{a}\|\leq 1$ , $\|\hat{{\bf{\Lambda}}}_{b}\|^{2}\leq 2$ , and $\upsilon=\|({\bf{Q}}^{\mathrm{s}})^{-1}\|$ . For any vectors ${\bf{a}}$ and ${\bf{b}}$ , it holds from Jensen’s inequality that $\|{\bf{a+b}}\|^{2}\leq\frac{1}{\theta}\|{\bf{a}}\|^{2}+\frac{1}{1-\theta}\|{% \bf{b}}\|^{2}$ for any $\theta\in(0,1)$ . Therefore, letting $\theta=\|{\bf{\Gamma}}\|:=\gamma$ , it holds that

	$\displaystyle\left\\|{\bf{\Gamma}}\mathcal{E}_{\mathrm{s}}^{t}-\upsilon\alpha({% \bf{Q}}^{\mathrm{s}})^{-1}\left[\begin{array}[]{c}\hat{{\bf{\Lambda}}}_{a}\hat% {{\bf{P}}}^{\sf T}(\nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{\star}))\\ \frac{p}{2\chi}\hat{{\bf{\Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}(\nabla F({\bf{X}% }^{t})-\nabla F({\bf{X}}^{\star}))\\ \end{array}\right]\right\\|_{\mathrm{F}}^{2}$
	$\displaystyle\leq\frac{1}{\gamma}\\|{\bf{\Gamma}}\mathcal{E}_{\mathrm{s}}^{t}\\|% _{\mathrm{F}}^{2}+\frac{\alpha^{2}(2\chi^{2}+p^{2})}{2\chi^{2}(1-\gamma)}\\|% \nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{\star})\\|_{\mathrm{F}}^{2}$
	$\displaystyle\leq\gamma\\|\mathcal{E}_{\mathrm{s}}^{t}\\|_{\mathrm{F}}^{2}+\frac% {\alpha^{2}L^{2}(2\chi^{2}+p^{2})}{2\chi^{2}(1-\gamma)}\\|{\bf{X}}^{t}-{\bf{X}}% ^{\star}\\|_{\mathrm{F}}^{2}\ .$

Then, we have

\displaystyle\mathbb{E}\!\left[\|\mathbb{G}_{\mathrm{s}}^{t}\|_{\mathrm{F}}^{2% }\;|\;\mathcal{G}^{t}\right]\leq\gamma\|\mathcal{E}_{\mathrm{s}}^{t}\|_{% \mathrm{F}}^{2}+\frac{\alpha^{2}L^{2}(2\chi^{2}+p^{2})}{2\chi^{2}(1-\gamma)}\|% \widetilde{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}+\frac{(p^{2}+2\chi^{2})n\alpha^{2}% \sigma^{2}}{2\chi^{2}}\ .

(84)

In addition, we bound $\mathbb{E}\!\left[\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T}% \widetilde{{\bf{Z}}}^{t}\|_{\mathrm{F}}^{2}\;|\;\mathcal{G}^{t}\right]$ as follows:

		$\displaystyle\mathbb{E}\!\left[\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf% {P}}}^{\sf T}\widetilde{{\bf{Z}}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{G}^{t}% \right]=\mathbb{E}\!\left[\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}% ^{\sf T}(\widetilde{{\bf{X}}}^{t}-\alpha(\nabla F({\bf{X}}^{t})-\nabla F({\bf{% X}}^{\star})+{\bf{S}}^{t})-{\bf{W}}_{b}\widetilde{{\bf{U}}}^{t})\\|_{\mathrm{F}% }^{2}\;\|\;\mathcal{G}^{t}\right]$
		$\displaystyle=\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T}(% \widetilde{{\bf{X}}}^{t}-\alpha(\nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{% \star}))-{\bf{W}}_{b}\widetilde{{\bf{U}}}^{t})\\|_{\mathrm{F}}^{2}+\mathbb{E}\!% \left[\alpha^{2}\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T}{% \bf{S}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{G}^{t}\right]$
		$\displaystyle\leq 3\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T% }\widetilde{{\bf{X}}}^{t}\\|_{\mathrm{F}}^{2}+3\alpha^{2}\\|\upsilon({\bf{Q}}^{% \mathrm{s}})^{-1}(\nabla F({\bf{X}}^{t})-\nabla F({\bf{X}}^{\star}))\\|_{% \mathrm{F}}^{2}+3\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T}{% \bf{W}}_{b}\widetilde{{\bf{U}}}^{t}\\|_{\mathrm{F}}^{2}+n\alpha^{2}\sigma^{2}$
		$\displaystyle\leq 3\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T% }\widetilde{{\bf{X}}}^{t}\\|_{\mathrm{F}}^{2}+6\\|\upsilon({\bf{Q}}^{\mathrm{s}}% )^{-1}\hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{U}}}^{t}\\|_{\mathrm{F}}^{2}+3% \alpha^{2}L^{2}\\|{\bf{X}}^{t}-{\bf{X}}^{\star}\\|_{\mathrm{F}}^{2}+n\alpha^{2}% \sigma^{2}$
		$\displaystyle\leq 6\\|\mathcal{E}_{\mathrm{s}}^{t}\\|_{\mathrm{F}}^{2}+3\alpha^{% 2}L^{2}\\|\widetilde{{\bf{X}}}^{t}\\|_{\mathrm{F}}^{2}+n\alpha^{2}\sigma^{2}.$		(85)

Therefore, substituting (84) and (M) into (83), we can conclude (49). ∎

Appendix N Proofs of Theorem 7

Proof.

From [51, eq. (27)], we have

		$\displaystyle\mathbb{E}\!\left[\big{\\|}{\bf{X}}^{t+1}-{\bf{X}}^{\star}\big{\\|}% _{\mathrm{F}}^{2}\;\|\;\mathcal{F}_{t}\right]+\frac{2\chi\alpha^{2}}{p^{2}}% \mathbb{E}\!\left[\big{\\|}{\bf{U}}^{t+1}-{\bf{U}}^{\star}\big{\\|}_{\mathrm{F}}% ^{2}\;\|\;\mathcal{F}_{t}\right]$
		$\displaystyle\leq\big{\\|}\tilde{{\bf{V}}}^{t}-{\bf{V}}^{\star}\big{\\|}_{% \mathrm{F}}^{2}+\alpha^{2}\Big{(}\frac{2\chi}{p^{2}}-(1-\lambda_{2})\Big{)}% \big{\\|}{\bf{U}}^{t}-{\bf{U}}^{\star}\big{\\|}_{\mathrm{F}}^{2}\ .$		(86)

Then, recalling the definition of $\tilde{{\bf{V}}}^{t}$ and ${\bf{V}}^{\star}$ , it gives that

	$\displaystyle\big{\\|}\tilde{{\bf{V}}}^{t}-{\bf{V}}^{\star}\big{\\|}_{\mathrm{F}% }^{2}$	$\displaystyle=\big{\\|}({\bf{X}}^{t}-\alpha\nabla F({\bf{X}}^{t}))-({\bf{X}}^{% \star}-\alpha\nabla F({\bf{X}}^{\star}))+(\alpha\nabla F({\bf{X}}^{t})-\alpha{% \bf{G}}^{t})\big{\\|}_{\mathrm{F}}^{2}$
		$\displaystyle=\big{\\|}({\bf{X}}^{t}-\alpha\nabla F({\bf{X}}^{t}))-({\bf{X}}^{% \star}-\alpha\nabla F({\bf{X}}^{\star}))\big{\\|}_{\mathrm{F}}^{2}+\big{\\|}% \alpha\nabla F({\bf{X}}^{t})-\alpha{\bf{G}}^{t}\big{\\|}_{\mathrm{F}}^{2}$
		$\displaystyle\quad+2\big{\langle}({\bf{X}}^{t}-\alpha\nabla F({\bf{X}}^{t}))-(% {\bf{X}}^{\star}-\alpha\nabla F({\bf{X}}^{\star})),\alpha\nabla F({\bf{X}}^{t}% )-\alpha{\bf{G}}^{t}\big{\rangle}.$

Taking conditioned expectation with respect to $\mathcal{G}^{t}\subset\mathcal{F}^{t}$ , and using the unbiasedness of ${\bf{G}}^{t}$ , we have

\displaystyle\mathbb{E}\!\left[\big{\|}\tilde{{\bf{V}}}^{t}-{\bf{V}}^{\star}% \big{\|}_{\mathrm{F}}^{2}\;|\;\mathcal{G}^{t}\right]

\displaystyle\leq\big{\|}({\bf{X}}^{t}-\alpha\nabla F({\bf{X}}^{t}))-({\bf{X}}% ^{\star}-\alpha\nabla F({\bf{X}}^{\star}))\big{\|}_{\mathrm{F}}^{2}+n\alpha^{2% }\sigma^{2}.

(87)

By [51, Lemma 1], it gives that when $0<\alpha<2/L$ and $\mu>0$

\displaystyle\big{\|}({\bf{X}}^{t}-\alpha\nabla F({\bf{X}}^{t}))-({\bf{X}}^{% \star}-\alpha\nabla F({\bf{X}}^{\star}))\big{\|}_{\mathrm{F}}^{2}\leq\max\{(1-% \alpha\mu)^{2},(\alpha L-1)^{2}\}\big{\|}{\bf{X}}^{t}-{\bf{X}}^{\star}\big{\|}% _{\mathrm{F}}^{2},

(88)

and $\max\{(1-\alpha\mu)^{2},(\alpha L-1)^{2}\}\in(0,1)$ . Combining with (87), it gives that

\displaystyle\mathbb{E}\!\left[\big{\|}\tilde{{\bf{V}}}^{t}-{\bf{V}}^{\star}% \big{\|}_{\mathrm{F}}^{2}\;|\;\mathcal{G}^{t}\right]\leq\max\{(1-\alpha\mu)^{2% },(\alpha L-1)^{2}\}\big{\|}{\bf{X}}^{t}-{\bf{X}}^{\star}\big{\|}_{\mathrm{F}}% ^{2}+n\alpha^{2}\sigma^{2}.

(89)

Then, it follows from (N) and (89) that

	$\displaystyle\mathbb{E}\!\left[\big{\\|}{\bf{X}}^{t+1}-{\bf{X}}^{\star}\big{\\|}% _{\mathrm{F}}^{2}\right]+\frac{2\chi\alpha^{2}}{p^{2}}\mathbb{E}\!\left[\big{% \\|}{\bf{U}}^{t+1}-{\bf{U}}^{\star}\big{\\|}_{\mathrm{F}}^{2}\right]$
	$\displaystyle\leq\max\{(1-\alpha\mu)^{2},(\alpha L-1)^{2}\}\big{\\|}{\bf{X}}^{t% }-{\bf{X}}^{\star}\big{\\|}_{\mathrm{F}}^{2}+n\alpha^{2}\sigma^{2}+(\frac{2\chi% \alpha^{2}}{p^{2}}-(1-\lambda_{2})\alpha^{2})\big{\\|}{\bf{U}}^{t}-{\bf{U}}^{% \star}\big{\\|}_{\mathrm{F}}^{2}$
	$\displaystyle\leq\max\{(1-\mu\alpha)^{2},(\alpha L-1)^{2},1-\frac{(1-\lambda_{% 2})p^{2}}{2\chi}\}\Big{(}\\|{\bf{X}}^{t}-{\bf{X}}^{\star}\\|_{\mathrm{F}}^{2}+% \frac{2\chi\alpha^{2}}{p^{2}}\\|{\bf{U}}^{t}-{\bf{U}}^{\star}\\|_{\mathrm{F}}^{2% }\Big{)}+n\alpha^{2}\sigma^{2}$
	$\displaystyle=\underbrace{\max\{1-(2\mu\alpha-\mu^{2}\alpha^{2}),1-(2\alpha L-% \alpha^{2}L^{2}),1-\frac{(1-\lambda_{2})p^{2}}{2\chi}\}}_{:=\zeta}\Big{(}\\|{% \bf{X}}^{t}-{\bf{X}}^{\star}\\|_{\mathrm{F}}^{2}+\frac{2\chi\alpha^{2}}{p^{2}}% \\|{\bf{U}}^{t}-{\bf{U}}^{\star}\\|_{\mathrm{F}}^{2}\Big{)}+n\alpha^{2}\sigma^{2}.$

Since $0<\alpha<\frac{2}{L}$ , $0<\frac{1-\lambda_{2}}{2}<1$ and $0<p^{2}\leq 1$ , we have $0<\zeta<1$ . It follows from $\Psi^{t}=\|{\bf{X}}^{t}-{\bf{X}}^{\star}\|_{\mathrm{F}}^{2}+\frac{2\chi\alpha^% {2}}{p^{2}}\|{\bf{U}}^{t}-{\bf{U}}^{\star}\|_{\mathrm{F}}^{2}$ that

\mathbb{E}\!\left[\Psi^{t+1}\right]\leq\zeta\Psi^{t}+n\alpha^{2}\sigma^{2}.

Taking full expectation, and unrolling the recurrence, we have

\displaystyle\mathbb{E}\!\left[\Psi^{T}\right]\leq\zeta^{T}\Psi^{0}+\frac{n% \alpha^{2}\sigma^{2}}{1-\zeta}.

(90)

Note that

1-\frac{p^{2}}{2\chi\kappa_{w}}=1-\frac{p^{2}(1-\lambda_{2})}{2\chi}\leq\sqrt{% 1-\frac{p^{2}(1-\lambda_{2})}{2\chi}}<1,\text{ and }\gamma=\sqrt{1-\frac{1-% \lambda_{2}}{2\chi}}.

Since $\tilde{\gamma}_{\mathrm{s}}=\gamma+\frac{3(1-p)(2+p^{2})}{\chi^{2}}=\sqrt{1-% \frac{1-\lambda_{2}}{2\chi}}+\frac{3(1-p)(2+p^{2})}{\chi^{2}}$ , we have

\chi\geq\frac{36}{1-\lambda_{2}}\Longrightarrow\tilde{\gamma}_{\mathrm{s}}\leq% \sqrt{1-\frac{p^{2}}{2\chi\kappa_{w}}}=\sqrt{1-\frac{p^{2}(1-\lambda_{2})}{2% \chi}}<1.

From (90), we have $\mathbb{E}[\|\widetilde{{\bf{X}}}^{t}\|_{\mathrm{F}}^{2}]\leq\mathbb{E}[\Psi^{% t}]\leq\zeta^{t}\Psi^{0}+\frac{n\alpha^{2}\sigma^{2}}{1-\zeta}$ . Substituting it to (49), we get

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}_{\mathrm{s}}^{t+1}\|_{\mathrm{F}}% ^{2}\right]\leq\tilde{\gamma}_{\mathrm{s}}\mathbb{E}\!\left[\|\mathcal{E}_{% \mathrm{s}}^{t}\|_{\mathrm{F}}^{2}\right]+F_{1}\zeta^{t}+F_{2},

(91)

where $F_{1}=D_{1}\Psi^{0}$ and $F_{2}=\frac{D_{1}n\alpha^{2}\sigma^{2}}{1-\zeta}+D_{2}n\alpha^{2}\sigma^{2}$ . Unrolling the recurrence (91), we have

$\displaystyle\mathbb{E}\!\left[\\|\mathcal{E}_{\mathrm{s}}^{t+1}\\|_{\mathrm{F}}% ^{2}\right]$	$\displaystyle\leq\tilde{\gamma}_{\mathrm{s}}\mathbb{E}\!\left[\\|\mathcal{E}_{% \mathrm{s}}^{t}\\|_{\mathrm{F}}^{2}\right]+F_{1}\zeta^{t}+F_{2}$
	$\displaystyle\leq\tilde{\gamma}_{\mathrm{s}}^{t+1}{\\|\mathcal{E}_{\mathrm{s}}^% {0}\\|_{\mathrm{F}}^{2}}+F_{1}\sum_{j=0}^{t}\tilde{\gamma}_{\mathrm{s}}^{j}% \zeta^{t-j}+F_{2}\sum_{j=0}^{t}\zeta^{j}$
	$\displaystyle=\tilde{\gamma}_{\mathrm{s}}^{t+1}{\\|\mathcal{E}_{\mathrm{s}}^{0}% \\|_{\mathrm{F}}^{2}}+F_{1}\frac{\zeta^{t+1}-\tilde{\gamma}_{\mathrm{s}}^{t+1}}% {\zeta-\tilde{\gamma}_{\mathrm{s}}}+F_{2}\frac{1-\tilde{\gamma}_{\mathrm{s}}^{% t+1}}{1-\tilde{\gamma}_{\mathrm{s}}}$
	$\displaystyle=\tilde{\gamma}_{\mathrm{s}}^{t+1}\Big{(}{\\|\upsilon({\bf{Q}}^{% \mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{X}}}^{0}\\|_{\mathrm{F}}% ^{2}+\\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T}\widetilde{{% \bf{U}}}^{0}\\|_{\mathrm{F}}^{2}}\Big{)}+F_{1}\frac{\zeta^{t+1}-\tilde{\gamma}_% {\mathrm{s}}^{t+1}}{\zeta-\tilde{\gamma}_{\mathrm{s}}}+F_{2}\frac{1-\tilde{% \gamma}_{\mathrm{s}}^{t+1}}{1-\tilde{\gamma}_{\mathrm{s}}}.$	(92)

Since ${\bf{X}}^{0}=[{\bf{x}}^{0},\cdots,{\bf{x}}^{0}]^{\sf T}$ and ${\bf{U}}^{0}=0$ , we have

\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{X}% }}^{0}\|_{\mathrm{F}}^{2}+\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}% ^{\sf T}\widetilde{{\bf{U}}}^{0}\|_{\mathrm{F}}^{2}\leq\alpha^{2}\|\hat{{\bf{P% }}}^{\sf T}{\bf{U}}^{\star}\|_{\mathrm{F}}^{2}.

Multiplying (46a) by $\hat{{\bf{P}}}^{\sf T}$ and using (29), we have

0=\alpha\hat{{\bf{P}}}^{\sf T}\nabla F({\bf{X}}^{\star})+\alpha\hat{{\bf{% \Lambda}}}_{b}\hat{{\bf{P}}}^{\sf T}{\bf{U}}^{\star}.

Then, it holds that

\displaystyle\|\upsilon({\bf{Q}}^{\mathrm{s}})^{-1}\hat{{\bf{P}}}^{\sf T}% \widetilde{{\bf{X}}}^{0}\|_{\mathrm{F}}^{2}+\|\upsilon({\bf{Q}}^{\mathrm{s}})^% {-1}\hat{{\bf{P}}}^{\sf T}\widetilde{{\bf{U}}}^{0}\|_{\mathrm{F}}^{2}\leq% \alpha^{2}\|\hat{{\bf{P}}}^{\sf T}{\bf{U}}^{\star}\|_{\mathrm{F}}^{2}\leq\frac% {\alpha^{2}}{1-\lambda_{2}}\|\nabla F({\bf{X}}^{\star})\|_{\mathrm{F}}^{2}.

(93)

Combining (N) and (93), and using $1-\tilde{\gamma}_{\mathrm{s}}^{t+1}<1$ , it gives that

\displaystyle\mathbb{E}\!\left[\|\mathcal{E}_{\mathrm{s}}^{t+1}\|_{\mathrm{F}}% ^{2}\right]\leq\tilde{\gamma}_{\mathrm{s}}^{t+1}\frac{\alpha^{2}}{1-\lambda_{2% }}\|\nabla F({\bf{X}}^{\star})\|_{\mathrm{F}}^{2}+F_{1}\frac{\zeta^{t+1}-% \tilde{\gamma}_{\mathrm{s}}^{t+1}}{\zeta-\tilde{\gamma}_{\mathrm{s}}}+\frac{F_% {2}}{1-\tilde{\gamma}_{\mathrm{s}}}.

(94)

Note that

\displaystyle\left\{\begin{array}[]{cc}\frac{\zeta^{t+1}-\tilde{\gamma}_{% \mathrm{s}}^{t+1}}{\zeta-\tilde{\gamma}_{\mathrm{s}}}\leq\frac{\zeta^{t+1}}{% \zeta-\tilde{\gamma}_{\mathrm{s}}},&\zeta>\tilde{\gamma}_{\mathrm{s}};\\ \frac{\zeta^{t+1}-\tilde{\gamma}_{\mathrm{s}}^{t+1}}{\zeta-\tilde{\gamma}_{% \mathrm{s}}}\leq\frac{\tilde{\gamma}_{\mathrm{s}}^{t+1}}{\tilde{\gamma}_{% \mathrm{s}}-\zeta},&\zeta<\tilde{\gamma}_{\mathrm{s}}.\end{array}\right.

We have $\frac{\zeta^{t+1}-\tilde{\gamma}_{\mathrm{s}}^{t+1}}{\zeta-\tilde{\gamma}_{% \mathrm{s}}}\leq\frac{\zeta_{0}^{t+1}}{|\zeta-\tilde{\gamma}_{\mathrm{s}}|}$ , where $\zeta_{0}=\max\{\zeta,\tilde{\gamma}_{\mathrm{s}},1-\mu\alpha\}=\max\{1-\alpha% \mu,\sqrt{1-\frac{(1-\lambda_{2})p^{2}}{2\chi}}\}$ . Substituting (94) into (48), taking full expectation, and unrolling the recurrence, we have

	$\displaystyle\mathbb{E}\!\left[\big{\\|}\bar{{\bf{e}}}^{t+1}\big{\\|}^{2}\right]$	$\displaystyle\leq(1-\mu\alpha)\\|\bar{{\bf{e}}}^{t}\\|^{2}+\frac{2\alpha L% \vartheta_{\mathrm{s}}}{n}\\|\mathcal{E}_{\mathrm{s}}^{t}\\|_{\mathrm{F}}^{2}+% \frac{\alpha^{2}\sigma^{2}}{n}$
		$\displaystyle\leq(1-\mu\alpha)\\|\bar{{\bf{e}}}^{t}\\|^{2}+\frac{2\alpha L% \vartheta_{\mathrm{s}}}{n}\Big{(}\tilde{\gamma}_{\mathrm{s}}^{t}\frac{\alpha^{% 2}}{1-\lambda_{2}}\\|\nabla F({\bf{X}}^{\star})\\|_{\mathrm{F}}^{2}+F_{1}\frac{% \zeta_{0}^{t}}{\|\zeta-\tilde{\gamma}_{\mathrm{s}}\|}+\frac{F_{2}}{1-\tilde{% \gamma}_{\mathrm{s}}}\Big{)}+\frac{\alpha^{2}\sigma^{2}}{n}$
		$\displaystyle\leq(1-\mu\alpha)\\|\bar{{\bf{e}}}^{t}\\|^{2}+\frac{2\alpha L% \vartheta_{\mathrm{s}}(\frac{\alpha^{2}}{1-\lambda_{2}}\\|\nabla F({\bf{X}}^{% \star})\\|_{\mathrm{F}}^{2}+\nicefrac{{F_{1}}}{{\|\zeta-\tilde{\gamma}_{\mathrm{% s}}\|}})}{n}\zeta_{0}^{t}+\frac{2\alpha L\vartheta_{\mathrm{s}}F_{2}}{n(1-% \tilde{\gamma}_{\mathrm{s}})}+\frac{\alpha^{2}\sigma^{2}}{n}$
		$\displaystyle\leq\zeta_{0}^{t}a_{0}+\frac{2LF_{2}\vartheta_{\mathrm{s}}}{n\mu(% 1-\tilde{\gamma}_{\mathrm{s}})}+\frac{\alpha\sigma^{2}}{n\mu}.$

Note that $\chi\geq\frac{72(1-p)}{1-\lambda_{2}}\Longrightarrow\tilde{\gamma}_{\mathrm{s}% }\leq\frac{1+\gamma}{2}<1$ . We have $\frac{1}{1-\tilde{\gamma}_{\mathrm{s}}}\leq\frac{8\chi}{1-\lambda_{2}}$ . Since $\vartheta_{\mathrm{s}}=\|{\bf{Q}}^{\mathrm{s}}\|^{2}\|({\bf{Q}}^{\mathrm{s}})^% {-1}\|^{2}\leq\frac{8\chi^{2}}{p^{2}(1+\lambda_{n})}$ and $F_{2}=\frac{D_{1}n\alpha^{2}\sigma^{2}}{1-\zeta}+D_{2}n\alpha^{2}\sigma^{2}$ , where

D_{1}=\frac{\alpha^{2}L^{2}(2+p^{2})}{2(1-\gamma)}+\frac{3\alpha^{2}L^{2}(1-p)% (2+p^{2})}{2},\ D_{2}=\frac{(2-p)(2+p^{2})}{2},

we have

\frac{2\alpha L\vartheta_{\mathrm{s}}F_{2}}{n(1-\tilde{\gamma}_{\mathrm{s}})}% \leq\mathcal{O}\left(\frac{\alpha^{4}\sigma^{2}L^{3}\chi^{4}}{\mu p^{2}(1-% \lambda_{2})^{2}(1-\zeta)}+\frac{\alpha^{2}\sigma^{2}L\chi^{3}}{\mu p^{2}(1-% \lambda_{2})}\right).

The linear speedup result (51) is thus proved. ∎

	$\displaystyle\frac{\mu}{2}\\|x-y\\|^{2}\leq$	$\displaystyle D_{f}(x,y)\leq\frac{L}{2}\\|x-y\\|^{2},$		(16)
	$\displaystyle\frac{1}{2L}\\|\nabla f(x)-\nabla f(y)\\|^{2}\leq$	$\displaystyle D_{f}(x,y)\leq\frac{1}{2\mu}\\|\nabla f(x)-\nabla f(y)\\|^{2}.$		(17)

	$\displaystyle\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t+1})\;\|\;\mathcal{G}^{t}\right]$	$\displaystyle\leq\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t})\;\|\;\mathcal{G}^{t}% \right]-\alpha\big{\langle}\nabla f(\bar{{\bf{x}}}^{t}),\overline{\nabla F}({% \bf{X}}^{t})\big{\rangle}+\mathbb{E}\!\left[\frac{L\alpha^{2}}{2}\\|\overline{% \nabla F}({\bf{X}}^{t})+\bar{{\bf{s}}}^{t}\\|^{2}\;\|\;\mathcal{F}^{t}\right]$
		$\displaystyle=\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t})\;\|\;\mathcal{G}^{t}% \right]-\alpha\big{\langle}\nabla f(\bar{{\bf{x}}}^{t}),\overline{\nabla F}({% \bf{X}}^{t})\big{\rangle}+\frac{L\alpha^{2}}{2}\Big{(}\mathbb{E}\!\left[\\|% \overline{\nabla F}({\bf{X}}^{t})\\|^{2}\;\|\;\mathcal{F}^{t}\right]+\mathbb{E}% \!\left[\\|\bar{{\bf{s}}}^{t}\\|^{2}\;\|\;\mathcal{F}^{t}\right]\Big{)}$
		$\displaystyle\leq\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t})\;\|\;\mathcal{G}^{t}% \right]-\alpha\big{\langle}\nabla f(\bar{{\bf{x}}}^{t}),\overline{\nabla F}({% \bf{X}}^{t})\big{\rangle}+\frac{L\alpha^{2}}{2}\mathbb{E}\!\left[\\|\overline{% \nabla F}({\bf{X}}^{t})\\|^{2}\;\|\;\mathcal{G}^{t}\right]+\frac{L\alpha^{2}% \sigma^{2}}{2n}.$

	$\displaystyle\mathbb{E}\!\left[f(\bar{{\bf{x}}}^{t+1})\;\|\;\mathcal{G}^{t}\right]$	$\displaystyle\leq f(\bar{{\bf{x}}}^{t})-\frac{\alpha}{2}\\|\nabla f(\bar{{\bf{x% }}}^{t})\\|^{2}-\frac{\alpha}{2}\\|\overline{\nabla F}({\bf{X}}^{t})\\|^{2}+\frac% {\alpha}{2}\\|\overline{\nabla F}({\bf{X}}^{t})-\nabla f(\bar{{\bf{x}}}^{t})\\|^% {2}$
		$\displaystyle\quad+\frac{L\alpha^{2}}{2}\\|\overline{\nabla F}({\bf{X}}^{t})\\|^% {2}+\frac{L\alpha^{2}\sigma^{2}}{2n}$
		$\displaystyle\leq f(\bar{{\bf{x}}}^{t})-\frac{\alpha}{2}\\|\nabla f(\bar{{\bf{x% }}}^{t})\\|^{2}-\frac{\alpha}{2}(1-\alpha L)\\|\overline{\nabla F}({\bf{X}}^{t})% \\|^{2}+\frac{\alpha}{2}\\|\overline{\nabla F}({\bf{X}}^{t})-\nabla f(\bar{{\bf{% x}}}^{t})\\|^{2}+\frac{L\alpha^{2}\sigma^{2}}{2n}$
		$\displaystyle\leq f(\bar{{\bf{x}}}^{t})-\frac{\alpha}{2}\\|\nabla f(\bar{{\bf{x% }}}^{t})\\|^{2}+\frac{\alpha}{2}\\|\overline{\nabla F}({\bf{X}}^{t})-\nabla f(% \bar{{\bf{x}}}^{t})\\|^{2}+\frac{L\alpha^{2}\sigma^{2}}{2n}.$

	$\displaystyle\\|\mathbb{G}^{t}\\|_{\mathrm{F}}^{2}$	$\displaystyle=\\|{\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}{\bf{F}}^{t}\\|% _{\mathrm{F}}^{2}-2\alpha\langle{\bf{\Gamma}}\mathcal{E}^{t},{\bf{C}}{\bf{S}}^% {t}\rangle+2\alpha^{2}\langle{\bf{Q}}^{-1}{\bf{F}}^{t},{\bf{C}}{\bf{S}}^{t}% \rangle+\alpha^{2}\\|{\bf{CS}}^{t}\\|^{2}$
		$\displaystyle\leq\\|{\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}{\bf{F}}^{t% }\\|_{\mathrm{F}}^{2}-2\alpha\langle{\bf{\Gamma}}\mathcal{E}^{t},{\bf{C}}{\bf{S% }}^{t}\rangle+\alpha^{2}\\|{\bf{Q}}^{-1}{\bf{F}}^{t}\\|_{\mathrm{F}}^{2}+2\alpha% ^{2}\\|{\bf{C}}{\bf{S}}^{t}\\|_{\mathrm{F}}^{2}$
		$\displaystyle\leq\\|{\bf{\Gamma}}\mathcal{E}^{t}-\alpha{\bf{Q}}^{-1}{\bf{F}}^{t% }\\|_{\mathrm{F}}^{2}+\alpha^{2}\\|{\bf{Q}}^{-1}{\bf{F}}^{t}\\|_{\mathrm{F}}^{2}-% 2\alpha\langle{\bf{\Gamma}}\mathcal{E}^{t},{\bf{C}}{\bf{S}}^{t}\rangle+4\alpha% ^{2}\\|{\bf{S}}^{t}\\|_{\mathrm{F}}^{2}.$

	$\displaystyle\mathbb{E}\!\left[\\|\mathbb{G}^{t}\\|_{\mathrm{F}}^{2}\;\|\;% \mathcal{G}^{t}\right]$	$\displaystyle\leq\gamma\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{2\alpha^{2}}% {1-\gamma}\mathbb{E}\!\left[\\|{\bf{Q}}^{-1}{\bf{F}}^{t}\\|_{\mathrm{F}}^{2}\;\|% \;\mathcal{G}^{t}\right]-2\alpha\mathbb{E}\!\left[\langle{\bf{\Gamma}}\mathcal% {E}^{t},{\bf{C}}{\bf{S}}^{t}\rangle\;\|\;\mathcal{G}^{t}\right]+4\alpha^{2}% \mathbb{E}\!\left[\\|{\bf{S}}^{t}\\|_{\mathrm{F}}^{2}\;\|\;\mathcal{G}^{t}\right]$
		$\displaystyle\leq\gamma\\|\mathcal{E}^{t}\\|_{\mathrm{F}}^{2}+\frac{2\alpha^{2}}% {1-\gamma}\mathbb{E}\!\left[\\|{\bf{Q}}^{-1}{\bf{F}}^{t}\\|_{\mathrm{F}}^{2}\;\|% \;\mathcal{G}^{t}\right]+4n\alpha^{2}\sigma^{2}.$		(53)

Revisiting Decentralized ProxSkip: Achieving Linear Speedup††thanks: Corresponding author: **de Cao.

Abstract

1 Introduction

2 Setup

2.1 Network graph

Assumption 1.

2.2 Algorithm description

2.3 Assumptions

Assumption 2.

Assumption 3.

Assumption 4.

3 Convergence results

3.1 Preliminary

Theorem 1.

3.2 Main theorem—Convergence rate of ProxSkip

Theorem 2.

Corollary 1.

3.3 Achieving linear speedup with network-independent stepsizes

Theorem 3.

3.4 Proof sketch of the main theorem

4 Experiments

5 Conclusion

References

Appendix A Preliminaries

A.1 Basic Facts

A.2 Notations

Appendix B Proof of Theorem 2 and Corollary 1

B.1 Transformation and Some Descent Inequalities

Lemma 1.

Proof.

Lemma 2.

Proof.

B.2 Convergence Analysis: Non-convex

Theorem 4.

Proof.

Corollary 2.

Proof.

B.3 Convergence Analysis: Convex

Theorem 5.

Proof.

Corollary 3.

Proof.

B.4 Convergence Analysis: Strongly Convex

Theorem 6.

Proof.

Corollary 4.

Proof.

Appendix C Proof Theorem 3

Lemma 3.

Lemma 4.

Proof.

Lemma 5.

Proof.

Theorem 7.

Proof.

Appendix D Proof of Lemma 1

Proof.

Appendix E Proof of Lemma 2

Proof.

Appendix F Proof of Theorem 4

Proof.

Appendix G Proof of Corollary 2

Proof.

Appendix H Proof of Theorem 5

Proof.

Appendix I Proof of Corollary 3

Proof.

Appendix J Proof of Theorem 6

Proof.

Appendix K Proof of Corollary 4

Proof.

Appendix L Proof of Lemma 4

Proof.

Appendix M Proof of Lemma 5

Proof.

Appendix N Proofs of Theorem 7

Proof.

Revisiting Decentralized ProxSkip: Achieving Linear Speedup^†^†thanks: Corresponding author: **de Cao.