On the Trade-off between Flatness and Optimization
in Distributed Learning

Ying Cao*, Zhaoxian Wu, Kun Yuan, Ali H. Sayed * Corresponding author. Authors Ying Cao and Ali H. Sayed are with the Institute of Electrical and Micro Engineering in EPFL, Lausanne. Zhaoxian Wu was with the Sun Yat-sen University. Kun Yuan is with the Center for Machine Learning Research (CMLR) in Peking University. Emails: {ying.cao, ali.sayed}@epfl.ch, [email protected], [email protected].
Abstract

This paper proposes a theoretical framework to evaluate and compare the performance of gradient-descent algorithms for distributed learning in relation to their behavior around local minima in nonconvex environments. Previous works have noticed that convergence toward flat local minima tend to enhance the generalization ability of learning algorithms. This work discovers two interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minimizers and favor convergence toward flatter minima relative to the centralized solution in the large-batch training regime. Second, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimizer but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies of the diffusion type deliver enhanced classification accuracy because it strikes a more favorable balance between flatness and optimization performance.

Index Terms:
Flatness, esca** efficiency, distributed learning, decentralized learning, nonconvex learning.

I Introduction and related works

As modern society continues to evolve into a more interconnected world, and large-scale applications become increasingly prevalent, the interest in distributed learning has grown significantly [1, 2, 3]. In the distributed scenario, a group of agents, each with its own objective function, collaborate to optimize a global objective. To achieve the goal, two popular gradient descent based methodologies are commonly employed in algorithm design [4, 5, 6, 7]. The first one, known as the centralized method, requires all agents to send their data to a central server that manages all computations. In the second approach, which is called decentralized, agents are connected by a graph topology, and they process data locally and exchange information with their neighbors. Furthermore, the terms consensus and diffusion refer to two prominent decentralized methods in the optimization and machine learning communities [8, 9, 10, 1, 11, 12, 13, 14]. It is widely known that decentralized implementations offer enhanced data privacy, resilience to failure and reduced computation burden compared to the centralized approach [6, 9, 15, 16, 17, 18, 19].

Refer to caption
Figure 1: The evolution of the test accuracy for centralized, consensus and diffusion training with local batch size 512 on CIFAR100 dataset. Consensus exhibits the worst test performance. The manuscript provides detailed theoretical analysis to explain how distributed learning methods perform in relation to flatness, test accuracy, and optimization performance.

As for the fundamental properties of the three strategies including centralized, consensus and diffusion, the works by [6, 20, 11, 21, 15, 22, 23] provide convergence guarantees for convex and non-convex environments. The results generally suggest that decentralized operations could, at best, only match the optimization performance of the centralized approach [6, 24, 25], or potentially degrade the generalization ability of models [26, 10, 27]. Nevertheless, the work [28] empirically demonstrated that the network disagreement, i.e., the distance among agents, in the middle of decentralized training can enhance generalization over centralized training. Afterwards, the work [9] provably verified how the consensus strategy enhances the generalization performance of models over centralized solutions in the large-batch setting from the perspective of sharpness-aware minimization (SAM), which is primarily designed to reduce the sharpness of machine learning models [29]. Specifically, the results in [9] showed that the network disagreement among agents, which changes over time, during training with the consensus method implicitly introduces an extra regularization term related to SAM to the original risk function, which helps drive the iterates to flatter models that are known to generalize better. However, as Figure 1 shows, flatter models obtained by consensus do not always guarantee better test accuracy in the context of distributed learning. This is because the final test accuracy depends on both generalization and optimization performance. Although decentralized training tends to favor flatter minima that generalize better than those found by the centralized method, it could also implicitly degrade the optimization performance in some cases. This observation motivates us to examine the behavior of centralized and decentralized strategies more closely and to compare more directly both their generalization and optimization abilities. As a result, this work discovers two interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minimizers and favor convergence toward flatter minima relative to the centralized solution. Second, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimizer but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies of the diffusion type deliver enhanced classification accuracy because they strike a more favorable balance between flatness and optimization performance.

Notation. In this paper, we use boldface letters to denote random variables. For a scalar-valued function f(μ)𝑓𝜇f(\mu)italic_f ( italic_μ ) where μ𝜇\muitalic_μ is a scalar, we say that f(μ)=±O(μ)𝑓𝜇plus-or-minus𝑂𝜇f(\mu)=\pm O(\mu)italic_f ( italic_μ ) = ± italic_O ( italic_μ ) if |f(μ)|c|μ|𝑓𝜇𝑐𝜇|f(\mu)|\leq c|\mu|| italic_f ( italic_μ ) | ≤ italic_c | italic_μ | for some constant c>0𝑐0c>0italic_c > 0. We say f(μ)=o(μ)𝑓𝜇𝑜𝜇f(\mu)=o(\mu)italic_f ( italic_μ ) = italic_o ( italic_μ ) if f(μ)/μ0𝑓𝜇𝜇0f(\mu)/\mu\to 0italic_f ( italic_μ ) / italic_μ → 0 as μ0𝜇0\mu\to 0italic_μ → 0. For any matrix X𝑋Xitalic_X, X=O(μ)𝑋𝑂𝜇X=O(\mu)italic_X = italic_O ( italic_μ ) signifies that the magnitude of the individual entries of X are O(μ)𝑂𝜇O(\mu)italic_O ( italic_μ ) or the norm of X𝑋Xitalic_X is O(μ)𝑂𝜇O(\mu)italic_O ( italic_μ ).

I-A Related works

Understanding the optimization behavior of mini-batch gradient descent (GD) algorithms and their influence on the generalization of models has emerged as a prominent topic of interest in recent years. Large-batch training is increasingly important in distributed learning for its potential to enhance the training speed and scalability of modern neural networks. However, it has already been observed in the literature that small-batch GD tends to flatter minima than the large-batch version [30], and that flat minima usually generalize better than sharp ones [31, 32, 33, 34, 35, 36]. Intuitively, the loss values around flat minima change slowly when the model parameters are adjusted, thereby reducing the disparity between the training and test data [30]. The works by [37, 38, 39, 40, 34, 41] utilized the stochastic differential equation (SDE) approach as a fundamental tool to analyze the regularization effects of mini-batch GD. This type of analysis inherently introduces an extra layer of approximation error due to substituting the discrete-time update by a different equation. Moreover, to enable the analysis with SDE, it is necessary to make some extra assumptions on the gradient noise, such as being Gaussian distributed [34, 39], parameter-independent [37] or heavy-tailed distributed [42]. To overcome these limitations, another line of works [43, 44, 33] focused on the dynamical stability of mini-batch GD with discrete-time analysis. Specifically, they determined the conditions on the Hessian matrices to ensure that the distance to a local minimum does not increase, thereby stabilizing the iterates in the vicinity of a local minimum.

I-B Contribution

Contrary to the aforementioned works focusing on the single-agent case, we will study in this work the optimization bias of distributed algorithms in the multi-agent setting, which allows us to compare the performance of various distributed strategies. Inspired by [34], we start from investigating the esca** efficiency of algorithms from local minima, and then relate it to the trade-off between flatness and optimization. Our contributions are listed as follows:

(1) We propose a general framework to examine the esca** efficiency of distributed algorithms from local minima by staying in the discrete-time domain. Since the dependence of the Hessian matrix on the immediate iterate in the original recursion makes the optimization analysis intractable, it is necessary to resort to a short-term model where the original Hessian is replaced by one evaluated at some local minima. We rigorously verify that the approximation error between the short-term and true models is negligible, ensuring that the short-term model represents the true model accurately enough. Afterwards, we obtain closed-form expressions of the short-term excess risk which quantifies the esca** efficiency. Note that we follow a different discrete-time analysis from the dynamical stability approach used in [43, 44, 33] and which focused on studying the stability of algorithms. Similar to [34], our emphasis is on quantifying the extent to which algorithms can escape from local minima.

(2) According to the results obtained from (1), we compare the esca** efficiency of the centralized, consensus and diffusion methods. We show that decentralized approaches, i.e., consensus and diffusion, gain additional efficacy from the network heterogeneity and graph structure, enabling them to escape faster from local minimum than the centralized strategy. We further verify that consensus outperforms diffusion in terms of esca** ability from local minima. We also show that higher esca** efficiency encourages algorithms to favor flatter minimum by relating esca** efficiency to the flatness metric.

(3) If the additional power generated by the network heterogeneity and graph structure are not sufficiently strong to allow decentralized methods to successfully leave the current basin, then both decentralized and centralized methods will be stuck within the basin of the current local minimum. A similar rule also holds for the comparison between consensus and diffusion. This motivates us to pursue next a long-term analysis, which corresponds to the optimization performance. In this context, we verify that the extra power boosting the esca** efficiency could inversely deteriorate the optimization performance. This reveals the inherent trade-off between flatness and optimization, with diffusion showing superior optimization performance over consensus in terms of smaller estimation error.

(4) We finally illustrate the performance of centralized, consensus and diffusion training strategies on real data. In addition to the flatness and optimization performance of the three distributed methods, we show that diffusion achieves a favorable balance between flatness and optimization, thereby exhibiting superior test accuracy than the other two methods (i.e., centralized and consensus).

II Problem Statement

II-A Empirical risk minimization

Consider a collection of K𝐾Kitalic_K agents (or nodes) linked by a graph topology. Each agent receives a streaming sequence of data realizations arising from independently distributed observations. The agents wish to collaborate to solve a distributed learning problem of the following form:

minwM{J(w)=Δ1Kk=1KJk(w)}subscriptmin𝑤superscript𝑀𝐽𝑤Δ1𝐾superscriptsubscript𝑘1𝐾subscript𝐽𝑘𝑤\mathop{\mathrm{min}}\limits_{{w}\in\mathbbm{R}^{M}}\left\{J(w)\overset{\Delta% }{=}\frac{1}{K}\sum\limits_{k=1}^{K}J_{k}(w)\right\}roman_min start_POSTSUBSCRIPT italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { italic_J ( italic_w ) overroman_Δ start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) } (1)

where M𝑀Mitalic_M is the dimension of the unknown vector w𝑤witalic_w, Jk(w)subscript𝐽𝑘𝑤J_{k}(w)italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) is the risk function for the k𝑘kitalic_k-th agent, and J(w)𝐽𝑤J(w)italic_J ( italic_w ) is the aggregate risk across the graph. Each individual risk is defined as the empirical average of a loss function over a collection of training data points, namely,

Jk(w)=Δ1Nki=1NkQk(w;xk,i)subscript𝐽𝑘𝑤Δ1subscript𝑁𝑘superscriptsubscript𝑖1subscript𝑁𝑘subscript𝑄𝑘𝑤subscript𝑥𝑘𝑖J_{k}(w)\overset{\Delta}{=}\frac{1}{N_{k}}\sum\limits_{i=1}^{N_{k}}Q_{k}(w;{x}% _{k,i})italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) overroman_Δ start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; italic_x start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ) (2)

where Qk()subscript𝑄𝑘Q_{k}(\cdot)italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ ) is the possibly nonconvex loss function and the {xk,i={γk,i,hk,i}}subscript𝑥𝑘𝑖subscript𝛾𝑘𝑖subscript𝑘𝑖\{x_{k,i}=\{\gamma_{k,i},h_{k,i}\}\}{ italic_x start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT = { italic_γ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT } } refers to Nksubscript𝑁𝑘N_{k}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT training samples with feature vector γk,isubscript𝛾𝑘𝑖\gamma_{k,i}italic_γ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT and true label hk,isubscript𝑘𝑖h_{k,i}italic_h start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT arising as realizations from a random source 𝒙ksubscript𝒙𝑘\boldsymbol{x}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT associated with agent k𝑘kitalic_k.

Since we aim to understand the optimization behavior of algorithms around local minima of nonconvex risk functions, we need to distinguish between the local minima of Jk(w)subscript𝐽𝑘𝑤J_{k}(w)italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) and J(w)𝐽𝑤J(w)italic_J ( italic_w ). Thus, we let wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT denote a local minimizer for the aggregate risk J(w)𝐽𝑤J(w)italic_J ( italic_w ), and let wksuperscriptsubscript𝑤𝑘w_{k}^{\star}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT denote a local minimizer for the individual risk Jk(w)subscript𝐽𝑘𝑤J_{k}(w)italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ). If all agents in the network share the same risk function, i.e., Jk(w)=J(w)subscript𝐽𝑘𝑤𝐽𝑤J_{k}(w)=J(w)italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) = italic_J ( italic_w ), then w=wksuperscript𝑤superscriptsubscript𝑤𝑘w^{\star}=w_{k}^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, and we refer to the network as being homogeneous. In this case, all local minimizers of J(w)𝐽𝑤J(w)italic_J ( italic_w ) are also local minima of Jk(w)subscript𝐽𝑘𝑤J_{k}(w)italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ). Otherwise, we refer to the network as being heterogeneous, where the wksuperscriptsubscript𝑤𝑘w_{k}^{\star}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT of all agents can be distinct among themselves and also in relation to wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. In this paper, we focus on this latter more general case.

II-B Gradient-descent algorithms

We consider three popular classes of algorithms that can be used to seek a solution for (1). The first algorithm is the mini-batch centralized method, in which all data from across the graph are shared with a central processor. At every instant n𝑛nitalic_n, B𝐵Bitalic_B samples denoted by {𝒙k,nb}superscriptsubscript𝒙𝑘𝑛𝑏\{\boldsymbol{x}_{k,n}^{b}\}{ bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT } are selected uniformly at random with replacement from the training data available for each agent k𝑘kitalic_k. Starting from a random initial condition 𝒘1subscript𝒘1\boldsymbol{w}_{-1}bold_italic_w start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT, the central processor updates the estimate by using:

𝒘n=𝒘n1μ×1KBk=1Kb=1BQk(𝒘n1;𝒙k,nb)subscript𝒘𝑛subscript𝒘𝑛1𝜇1𝐾𝐵superscriptsubscript𝑘1𝐾superscriptsubscript𝑏1𝐵subscript𝑄𝑘subscript𝒘𝑛1superscriptsubscript𝒙𝑘𝑛𝑏\boldsymbol{w}_{n}=\boldsymbol{w}_{n-1}-\mu\times\frac{1}{KB}\sum\limits_{k=1}% ^{K}\sum\limits_{b=1}^{B}Q_{k}(\boldsymbol{w}_{n-1};\boldsymbol{x}_{k,n}^{b})bold_italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_italic_w start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - italic_μ × divide start_ARG 1 end_ARG start_ARG italic_K italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) (3)

Here, B𝐵Bitalic_B is the batch size and μ𝜇\muitalic_μ is a small positive step-size parameter. Observe that we are using boldface letters to refer to the data samples and iterates in (3); it is our convention in this paper to use the boldface notation to refer to random quantities.

The other two algorithms are of the decentralized type, where the data remains local and agents interact locally with their neighbors to solve (1) through a collaborative process. We consider two types of decentralized methods, which have been studied extensively in the literature, namely, the consensus and diffusion strategies [4, 5, 6, 7, 20, 24].

Before listing the algorithms, we describe the graph structure that drives their operation. The agents are assumed to be linked by a weighted graph topology. The weight on the link from agent \ellroman_ℓ to agent k𝑘kitalic_k is denoted by aksubscript𝑎𝑘a_{\ell k}italic_a start_POSTSUBSCRIPT roman_ℓ italic_k end_POSTSUBSCRIPT; this value is used to scale information sent from \ellroman_ℓ to k𝑘kitalic_k. Each aksubscript𝑎𝑘a_{\ell k}italic_a start_POSTSUBSCRIPT roman_ℓ italic_k end_POSTSUBSCRIPT is non-negative and lies within [0,1]01[0,1][ 0 , 1 ]; it will be strictly positive if there exists a link from \ellroman_ℓ to k𝑘kitalic_k over which information can be shared. We collect the {ak}subscript𝑎𝑘\{a_{\ell k}\}{ italic_a start_POSTSUBSCRIPT roman_ℓ italic_k end_POSTSUBSCRIPT } into a K×K𝐾𝐾K\times Kitalic_K × italic_K matrix A𝐴Aitalic_A. In this paper, we consider a symmetric doubly-stochastic matrix A𝐴Aitalic_A: the entries on each column of A𝐴Aitalic_A are normalized to add up to 1111.

Assumption II.1.

(Strongly-connected graph). The graph is assumed to be strongly connected. This means that there exists a path with nonzero weights {ak}subscript𝑎𝑘\{a_{\ell k}\}{ italic_a start_POSTSUBSCRIPT roman_ℓ italic_k end_POSTSUBSCRIPT } linking any pair of agents and, in addition, at least one node k𝑘kitalic_k in the network has a self-loop with akk>0subscript𝑎𝑘𝑘0a_{kk}>0italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT > 0.

\hfill\square

It follows from the Perron-Frobenius theorem [6] that A𝐴Aitalic_A has a single eigenvalue at 1. Moreover, if we let π={πk}k=1K𝜋superscriptsubscriptsubscript𝜋𝑘𝑘1𝐾\pi=\{\pi_{k}\}_{k=1}^{K}italic_π = { italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT denote the corresponding right eigenvector, then all its entries are positive and they can be normalized to add up to 1111:

Aπ=π,𝟙𝖳π=1,πk>0formulae-sequence𝐴𝜋𝜋formulae-sequencesuperscript1𝖳𝜋1subscript𝜋𝑘0\displaystyle A\pi=\pi,\quad\mathbbm{1}^{\sf T}\pi=1,\quad\pi_{k}>0italic_A italic_π = italic_π , blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_π = 1 , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 0 (4)

For the doubly-stochastic matrix A𝐴Aitalic_A, we have π=1K𝟙𝜋1𝐾1\pi=\frac{1}{K}\mathbbm{1}italic_π = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG blackboard_1. This means that the entries {πk}subscript𝜋𝑘\{\pi_{k}\}{ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } of the Perron vector π𝜋\piitalic_π are identical.

The diffusion strategy consists of the following two steps:

ϕk,n=𝒘k,n1μBb=1BwQk(𝒘k,n1;𝒙k,nb)subscriptbold-italic-ϕ𝑘𝑛subscript𝒘𝑘𝑛1𝜇𝐵superscriptsubscript𝑏1𝐵subscript𝑤subscript𝑄𝑘subscript𝒘𝑘𝑛1superscriptsubscript𝒙𝑘𝑛𝑏\displaystyle\boldsymbol{\phi}_{k,n}=\boldsymbol{w}_{k,n-1}-\frac{\mu}{B}\sum% \limits_{b=1}^{B}\nabla_{w}Q_{k}(\boldsymbol{w}_{k,n-1};\boldsymbol{x}_{k,n}^{% b})bold_italic_ϕ start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT = bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n - 1 end_POSTSUBSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) (5a)
𝒘k,n=𝒩kakϕ,nsubscript𝒘𝑘𝑛subscriptsubscript𝒩𝑘subscript𝑎𝑘subscriptbold-italic-ϕ𝑛\displaystyle\boldsymbol{w}_{k,n}=\sum\limits_{\ell\mathcal{\in N}_{k}}a_{\ell k% }\boldsymbol{\phi}_{\ell,n}bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT roman_ℓ ∈ caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT roman_ℓ italic_k end_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT roman_ℓ , italic_n end_POSTSUBSCRIPT (5b)

At every iteration n𝑛nitalic_n, every agent k𝑘kitalic_k samples B𝐵Bitalic_B data points and uses (5a)5a(\ref{x_e})( ) to update its iterate 𝒘k,n1subscript𝒘𝑘𝑛1\boldsymbol{w}_{k,n-1}bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n - 1 end_POSTSUBSCRIPT to the intermediate value ϕk,nsubscriptbold-italic-ϕ𝑘𝑛\boldsymbol{\phi}_{k,n}bold_italic_ϕ start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT. Subsequently, the same agent combines the intermediate iterates from across its neighbors using (5b). The symbol 𝒩ksubscript𝒩𝑘\mathcal{N}_{k}caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the collection of neighbors of k𝑘kitalic_k. It is useful to remark that the centralized implementation (3) can be viewed as a special case of (5a)–(5b) if we select the combination matrix as A=π𝟙𝖳𝐴𝜋superscript1𝖳A=\pi\mathbbm{1}^{\sf T}italic_A = italic_π blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT.

In comparison, the consensus strategy involves the following steps:

ϕk,n=𝒩kak𝒘,n1subscriptbold-italic-ϕ𝑘𝑛subscriptsubscript𝒩𝑘subscript𝑎𝑘subscript𝒘𝑛1\displaystyle\boldsymbol{\phi}_{k,n}=\sum\limits_{\ell\mathcal{\in N}_{k}}a_{% \ell k}\boldsymbol{w}_{\ell,n-1}bold_italic_ϕ start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT roman_ℓ ∈ caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT roman_ℓ italic_k end_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT roman_ℓ , italic_n - 1 end_POSTSUBSCRIPT (6a)
𝒘k,n=ϕk,nμBb=1BwQk(𝒘k,n1;𝒙k,nb)subscript𝒘𝑘𝑛subscriptbold-italic-ϕ𝑘𝑛𝜇𝐵superscriptsubscript𝑏1𝐵subscript𝑤subscript𝑄𝑘subscript𝒘𝑘𝑛1superscriptsubscript𝒙𝑘𝑛𝑏\displaystyle\boldsymbol{w}_{k,n}=\boldsymbol{\phi}_{k,n}-\frac{\mu}{B}\sum% \limits_{b=1}^{B}\nabla_{w}Q_{k}(\boldsymbol{w}_{k,n-1};\boldsymbol{x}_{k,n}^{% b})bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n - 1 end_POSTSUBSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) (6b)

In this case, the existing iterates 𝒘,n1subscript𝒘𝑛1\boldsymbol{w}_{\ell,n-1}bold_italic_w start_POSTSUBSCRIPT roman_ℓ , italic_n - 1 end_POSTSUBSCRIPT are first combined to generate the intermediate value ϕk,nsubscriptbold-italic-ϕ𝑘𝑛\boldsymbol{\phi}_{k,n}bold_italic_ϕ start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT, after which (6b) is applied. Observe the asymmetry on the right-hand side in (6b). The starting iterate is ϕk,nsubscriptbold-italic-ϕ𝑘𝑛\boldsymbol{\phi}_{k,n}bold_italic_ϕ start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT, while the loss functions are evaluated at the different iterates 𝒘k,n1subscript𝒘𝑘𝑛1\boldsymbol{w}_{k,n-1}bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n - 1 end_POSTSUBSCRIPT. In contrast, the same iterate 𝒘k,n1subscript𝒘𝑘𝑛1\boldsymbol{w}_{k,n-1}bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n - 1 end_POSTSUBSCRIPT appears in both terms on the RHS of (5b). This symmetry has been shown to enlarge the stability range of diffusion implementations over its consensus counterparts in the convex case, namely, diffusion is mean-square stable for a wider range of step-sizes μ𝜇\muitalic_μ [45, 1, 7].

II-C Esca** efficiency from local minima

We first characterize the basin (or valley) of a local minimum:

Definition II.2.

(Basin of attraction [38].) For a given local minimizer wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, its basin of attraction Ω(w)Ωsuperscript𝑤\Omega(w^{\star})roman_Ω ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) (or the valley of wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT) is defined as the set of all points starting from which wk,nsubscript𝑤𝑘𝑛w_{k,n}italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT or wnwsubscript𝑤𝑛superscript𝑤w_{n}\to w^{\star}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT as n𝑛n\to\inftyitalic_n → ∞ if the step size μ𝜇\muitalic_μ is sufficiently small and there is no noise in the gradient-descent algorithms.

\hfill\square

We illustrate the basin of attraction related to wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT in Figure 2, where Ω(w)=(w1,w2)Ωsuperscript𝑤subscript𝑤1subscript𝑤2\Omega(w^{\star})=(w_{1},w_{2})roman_Ω ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is a bounded open set. Also, we use Ω(w)Ωsuperscript𝑤\partial\Omega(w^{\star})∂ roman_Ω ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) to denote the boundary of Ω(w)Ωsuperscript𝑤\Omega(w^{\star})roman_Ω ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ), according to which Ω(w)={w1,w2}Ωsuperscript𝑤subscript𝑤1subscript𝑤2\partial\Omega(w^{\star})=\{w_{1},w_{2}\}∂ roman_Ω ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = { italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } in Figure 2. If wk,nsubscript𝑤𝑘𝑛{w}_{k,n}italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT or wnΩ(w)subscript𝑤𝑛Ωsuperscript𝑤w_{n}\notin\Omega(w^{\star})italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∉ roman_Ω ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ), then we say the algorithm escapes from wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT exactly.

Refer to caption
Figure 2: An illustration of the valley around local minimizer wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT.

To analyze the esca** behavior of learning algorithms from local minima, it is necessary to quantify their esca** ability. Inspired by the continuous-time definition of esca** efficiency for the single-agent case from [34], and by the metrics used to assess the optimization performance of stochastic learning algorithms [6, 7], we use the following metric to measure the discrete-time esca** efficiency for general non-convex environments.

Definition II.3.

(Esca** efficiency). Assume all agents start from points close to a local minimizer wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, the esca** efficiency over the network at iteration n𝑛nitalic_n is defined by:

ERn=Δ1Kk=1K𝔼J(𝒘k,n)J(w)subscriptER𝑛Δ1𝐾superscriptsubscript𝑘1𝐾𝔼𝐽subscript𝒘𝑘𝑛𝐽superscript𝑤\displaystyle\mathrm{ER}_{n}\overset{\Delta}{=}\frac{1}{K}\sum\limits_{k=1}^{K% }\mathds{E}J(\boldsymbol{w}_{k,n})-J(w^{\star})roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT overroman_Δ start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E italic_J ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - italic_J ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) (7)

The larger this value is, the higher the esca** efficiency will be away from wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. If the algorithm ultimately converges to wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, then the larger ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is, the worse the optimization performance will be.

\hfill\square

ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT assesses the average excess-risk value across the network, and it quantifies the deviation between the current model at iteration n𝑛nitalic_n and the local minimum. For a fixed n𝑛nitalic_n, a larger value of ERnsubscriptERn\mathrm{ER_{n}}roman_ER start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT indicates an expected faster escape from the local minimum (since the iterate will be further away from it). To justify whether an algorithm escapes from a local basin or not, earlier studies [46, 47, 48] have established various criteria based on the analysis methods they use. They nevertheless generally adhered to the principle of the minimum effort to reach the boundary of the local basin. In a similar vein, we introduce the risk barrier, which is defined as the infimum of the risk gap between the boundary points and the local minimum:

h=ΔinfwΩ(w){J(w)J(w)}Δsubscriptinfimum𝑤Ωsuperscript𝑤𝐽𝑤𝐽superscript𝑤\displaystyle h\overset{\Delta}{=}\inf\limits_{w\in\partial\Omega(w^{\star})}% \;\left\{J(w)-J(w^{\star})\right\}italic_h overroman_Δ start_ARG = end_ARG roman_inf start_POSTSUBSCRIPT italic_w ∈ ∂ roman_Ω ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT { italic_J ( italic_w ) - italic_J ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) } (8)

For example, the risk barrier of wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT in Figure 2 is J(w2)J(w)𝐽subscript𝑤2𝐽superscript𝑤J(w_{2})-J(w^{\star})italic_J ( italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_J ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ). When ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT exceeds the corresponding risk barrier defined in (8), namely,

ERnhsubscriptER𝑛\displaystyle\mathrm{ER}_{n}\geq hroman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ italic_h (9)

then we say that the network escapes from wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT on average. If, on the other hand, 𝒘k,nsubscript𝒘𝑘𝑛\boldsymbol{w}_{k,n}bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT gets stuck around (or converges to) wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, then ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in that case would serve as a measure of the resulting optimization performance[6, 7].

II-D Flatness metrics

Next, we formally characterize the notion of flat minima. Consider a local minimum wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and a small drift ΔwΔ𝑤\Delta wroman_Δ italic_w around it. The change in the risk value can be approximated by

J(w+Δw)J(w)ΔwH2𝐽superscript𝑤Δ𝑤𝐽superscript𝑤subscriptsuperscriptnormΔ𝑤2superscript𝐻\displaystyle J(w^{\star}+\Delta w)-J(w^{\star})\approx\|\Delta w\|^{2}_{H^{% \star}}italic_J ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + roman_Δ italic_w ) - italic_J ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ≈ ∥ roman_Δ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (10)

where Hsuperscript𝐻H^{\star}italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT refers to the(positive definite) Hessian matrix of J(w)𝐽𝑤J(w)italic_J ( italic_w ) at wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. If the above change in the risk value is small then we refer to wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT as a flat minimum. Otherwise, if the change in the risk value is large, then we refer to wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT as a sharp minimum. Motivated by (10), various metrics related to the Hessian matrix are applied in the literature to measure the flatness of local minima, e.g., the spectral norm H2subscriptnormsuperscript𝐻2\|H^{\star}\|_{2}∥ italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT[33, 43], the Frobenius norm HFsubscriptnormsuperscript𝐻𝐹\|H^{\star}\|_{F}∥ italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT[33, 44] and the trace Tr(H)Trsuperscript𝐻\mathrm{Tr}(H^{\star})roman_Tr ( italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT )[49, 33, 50]. In this paper, we use the trace of the Hessian as a flatness measure, but the other two metrics may also be applied due to the norm equivalence, namely:

1MTr(H)H2HFTr(H)MH21𝑀Trsuperscript𝐻subscriptnormsuperscript𝐻2subscriptnormsuperscript𝐻𝐹Trsuperscript𝐻𝑀subscriptnormsuperscript𝐻2\displaystyle\frac{1}{M}\mathrm{Tr}(H^{\star})\leq\|H^{\star}\|_{2}\leq\|H^{% \star}\|_{F}\leq\mathrm{Tr}(H^{\star})\leq M\|H^{\star}\|_{2}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG roman_Tr ( italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ≤ ∥ italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ roman_Tr ( italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ≤ italic_M ∥ italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (11)

Note that for highly ill-conditioned minima, which is commonly observed in over-parameterized neural networks [34, 51], the first d𝑑ditalic_d largest eigenvalues of Hsuperscript𝐻H^{\star}italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT dominate the remaining Md𝑀𝑑M-ditalic_M - italic_d eigenvalues where dMmuch-less-than𝑑𝑀d\ll Mitalic_d ≪ italic_M. In this case, the norm equivalence in (11) can be more tightly guaranteed since now the M𝑀Mitalic_M in the upper and lower bounds can be approximately replaced by d𝑑ditalic_d. More insights related to the flatness measure Tr(H)Trsuperscript𝐻\mathrm{Tr}(H^{\star})roman_Tr ( italic_H start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) can be found in [49, 50].

III Esca** efficiency of multi-agent learning

Motivated by the definitions in the last section, we now examine the ERnsubscriptERn\mathrm{ER_{n}}roman_ER start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT performance of decentralized and centralized methods.

III-A Modeling conditions

We introduce the following commonly-used assumptions. First, since our purpose is to understand the behavior of the algorithm around the same local minima wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, we assume all agents start from points near wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT.

Assumption III.1.

(Starting points of all agents). All models start their updates from initial points that are sufficiently close to wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, namely,

𝔼𝒘k,1w4o(μ2B2)𝔼superscriptnormsubscript𝒘𝑘1superscript𝑤4𝑜superscript𝜇2superscript𝐵2\displaystyle\mathds{E}\|\boldsymbol{w}_{k,-1}-w^{\star}\|^{4}\leq o\left(% \frac{\mu^{2}}{B^{2}}\right)blackboard_E ∥ bold_italic_w start_POSTSUBSCRIPT italic_k , - 1 end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ italic_o ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (12)

\hfill\square

By using Jensen’s inequality on (12), we have

𝔼𝒘k,1w2(𝔼𝒘k,1w4)12o(μB)𝔼superscriptnormsubscript𝒘𝑘1superscript𝑤2superscript𝔼superscriptnormsubscript𝒘𝑘1superscript𝑤412𝑜𝜇𝐵\displaystyle\mathds{E}\|\boldsymbol{w}_{k,-1}-w^{\star}\|^{2}\leq\left(% \mathds{E}\|\boldsymbol{w}_{k,-1}-w^{\star}\|^{4}\right)^{\frac{1}{2}}\leq o% \left(\frac{\mu}{B}\right)blackboard_E ∥ bold_italic_w start_POSTSUBSCRIPT italic_k , - 1 end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( blackboard_E ∥ bold_italic_w start_POSTSUBSCRIPT italic_k , - 1 end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≤ italic_o ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) (13)

Since we will establish that mini-batch gradient descent algorithms approach an O(μB)𝑂𝜇𝐵O(\frac{\mu}{B})italic_O ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) neighborhood of wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, it is reasonable to assume all agents start from the o(μB)𝑜𝜇𝐵o(\frac{\mu}{B})italic_o ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) neighborhood, which is dominated by O(μB)𝑂𝜇𝐵O(\frac{\mu}{B})italic_O ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ). This assumption is weaker than the one used in [34] where models should exactly start from wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT.

We further require the loss function Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of all agents to be smooth [11, 21].

Assumption III.2.

(Smoothness condition.) For each agent k𝑘kitalic_k, the gradient of Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT relative to w𝑤witalic_w is Lipschitz. Specifically, for any w1,w2Msubscript𝑤1subscript𝑤2superscript𝑀w_{1},w_{2}\in\mathbbm{R}^{M}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, it holds that:

Qk(w2;x)Qk(w1;x)Lw2w1normsubscript𝑄𝑘subscript𝑤2𝑥subscript𝑄𝑘subscript𝑤1𝑥𝐿normsubscript𝑤2subscript𝑤1\displaystyle\|\nabla Q_{k}(w_{2};x)-\nabla Q_{k}(w_{1};x)\|\leq L\|w_{2}-w_{1}\|∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_x ) - ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_x ) ∥ ≤ italic_L ∥ italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ (14)

\hfill\square

Next, consider the Hessian matrix of agent k𝑘kitalic_k at wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT denoted by:

Hk=Δ2Jk(w)=1Nki=1Nk2Qk(w;xk,i)superscriptsubscript𝐻𝑘Δsuperscript2subscript𝐽𝑘superscript𝑤1subscript𝑁𝑘superscriptsubscript𝑖1subscript𝑁𝑘superscript2subscript𝑄𝑘superscript𝑤subscript𝑥𝑘𝑖\displaystyle H_{k}^{\star}\overset{\Delta}{=}\nabla^{2}J_{k}(w^{\star})=\frac% {1}{N_{k}}\sum\limits_{i=1}^{N_{k}}\nabla^{2}Q_{k}(w^{\star};x_{k,i})italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT overroman_Δ start_ARG = end_ARG ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; italic_x start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ) (15)

and the global Hessian matrix of J(w)𝐽𝑤J(w)italic_J ( italic_w ) at wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT:

H¯=Δ1Kk=1K2Jk(w)=1Kk=1KHk(w)¯𝐻Δ1𝐾superscriptsubscript𝑘1𝐾superscript2subscript𝐽𝑘superscript𝑤1𝐾superscriptsubscript𝑘1𝐾subscript𝐻𝑘superscript𝑤\displaystyle\bar{H}\overset{\Delta}{=}\frac{1}{K}\sum_{k=1}^{K}\nabla^{2}J_{k% }(w^{\star})=\frac{1}{K}\sum_{k=1}^{K}H_{k}(w^{\star})over¯ start_ARG italic_H end_ARG overroman_Δ start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) (16)
Assumption III.3.

(Small Hessian disagreement). The local Hessian at wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is sufficiently close to the global Hessian, namely,

HkH¯ϵnormsuperscriptsubscript𝐻𝑘¯𝐻italic-ϵ\displaystyle\|H_{k}^{\star}-\bar{H}\|\leq\epsilon∥ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - over¯ start_ARG italic_H end_ARG ∥ ≤ italic_ϵ (17)

with a small constant ϵitalic-ϵ\epsilonitalic_ϵ. \hfill\square

Assumption III.3 can be satisfied when the data heterogeneity among agents is sufficiently small. For example, if all agents observe independently and identically distributed data, and consider the stochastic Hessian defined by:

H~=𝔼𝒙2Q(w;𝒙)~𝐻subscript𝔼𝒙superscript2𝑄𝑤𝒙\displaystyle\tilde{H}=\mathds{E}_{\boldsymbol{x}}\nabla^{2}Q(w;\boldsymbol{x})over~ start_ARG italic_H end_ARG = blackboard_E start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q ( italic_w ; bold_italic_x ) (18)

Then by resorting to the matrix Bernstein inequality [52], for any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, we have:

(HkH~ϵ2)1Me3ϵ2Nk4L(3L+ϵ)normsubscript𝐻𝑘~𝐻italic-ϵ21𝑀superscript𝑒3superscriptitalic-ϵ2subscript𝑁𝑘4𝐿3𝐿italic-ϵ\displaystyle\mathbbm{P}\left(\|H_{k}-\tilde{H}\|\leq\frac{\epsilon}{2}\right)% \geq 1-Me^{-\frac{3\epsilon^{2}N_{k}}{4L(3L+\epsilon)}}blackboard_P ( ∥ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_H end_ARG ∥ ≤ divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) ≥ 1 - italic_M italic_e start_POSTSUPERSCRIPT - divide start_ARG 3 italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_L ( 3 italic_L + italic_ϵ ) end_ARG end_POSTSUPERSCRIPT (19)

Thus, when each agent collects sufficient amount of data, all local Hessian matrices are guaranteed to be close to the stochastic Hessian with high probability, from which we get:

Hk1KHnormsubscript𝐻𝑘1𝐾subscriptsubscript𝐻\displaystyle\left\|H_{k}-\frac{1}{K}\sum\limits_{\ell}H_{\ell}\right\|∥ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ 1KHkHabsentsubscript1𝐾normsubscript𝐻𝑘subscript𝐻\displaystyle\leq\sum\limits_{\ell}\frac{1}{K}\|H_{k}-H_{\ell}\|≤ ∑ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∥ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_H start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥
1K(HkH~+HH~)absent1𝐾subscriptnormsubscript𝐻𝑘~𝐻normsubscript𝐻~𝐻\displaystyle\leq\frac{1}{K}\sum\limits_{\ell}(\|H_{k}-\tilde{H}\|+\|H_{\ell}-% \tilde{H}\|)≤ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( ∥ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_H end_ARG ∥ + ∥ italic_H start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - over~ start_ARG italic_H end_ARG ∥ )
ϵabsentitalic-ϵ\displaystyle\leq\epsilon≤ italic_ϵ (20)

Intuitively, even when all agents independently collect data from different distributions, if the data distribution among agents are sufficiently close to each other, then (III-A) can still be satisfied.

III-B Properties of gradient noise

To enable the analysis, we describe properties associated with the gradient noise process for later use. For any wM𝑤superscript𝑀w\in\mathbbm{R}^{M}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, the stochastic gradient noise at agent k𝑘kitalic_k at iteration n𝑛nitalic_n is defined by the difference:

𝒔k,n(w)=ΔQk(w;𝒙k,n)Jk(w)subscript𝒔𝑘𝑛𝑤Δsubscript𝑄𝑘𝑤subscript𝒙𝑘𝑛subscript𝐽𝑘𝑤\displaystyle\boldsymbol{s}_{k,n}(w)\overset{\Delta}{=}\nabla Q_{k}(w;% \boldsymbol{x}_{k,n})-\nabla J_{k}(w)bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( italic_w ) overroman_Δ start_ARG = end_ARG ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) (21)

where the term stochastic refers to the case B=1𝐵1B=1italic_B = 1 (i.e., batch size equal to 1). In the mini-batch case, the gradient noise is instead given by:

𝒔k,nB(w)=Δ1Bb=1BQk(w;𝒙k,nb)Jk(w)superscriptsubscript𝒔𝑘𝑛𝐵𝑤Δ1𝐵superscriptsubscript𝑏1𝐵subscript𝑄𝑘𝑤superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝑤\displaystyle\boldsymbol{s}_{k,n}^{B}(w)\overset{\Delta}{=}\frac{1}{B}\sum% \limits_{b=1}^{B}\nabla Q_{k}(w;\boldsymbol{x}_{k,n}^{b})-\nabla J_{k}(w)bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_w ) overroman_Δ start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) (22)

We denote the covariance matrix of the gradient noise by:

Rs,k,n(w)=Δ𝔼{𝒔k,n(w)𝒔k,n(w)𝖳}subscript𝑅𝑠𝑘𝑛𝑤Δ𝔼subscript𝒔𝑘𝑛𝑤subscript𝒔𝑘𝑛superscript𝑤𝖳\displaystyle R_{s,k,n}(w)\overset{\Delta}{=}\mathds{E}\left\{\boldsymbol{s}_{% k,n}(w)\boldsymbol{s}_{k,n}(w)^{\sf T}\right\}italic_R start_POSTSUBSCRIPT italic_s , italic_k , italic_n end_POSTSUBSCRIPT ( italic_w ) overroman_Δ start_ARG = end_ARG blackboard_E { bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( italic_w ) bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( italic_w ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT } (23)

which is symmetric and non-negative definite. Likewise, for the mini-batch case,

Rs,k,nB(w)=Δ𝔼{𝒔k,nB(w)𝒔k,nB(w)𝖳}superscriptsubscript𝑅𝑠𝑘𝑛𝐵𝑤Δ𝔼subscriptsuperscript𝒔𝐵𝑘𝑛𝑤subscriptsuperscript𝒔𝐵𝑘𝑛superscript𝑤𝖳\displaystyle R_{s,k,n}^{B}(w)\overset{\Delta}{=}\mathds{E}\left\{\boldsymbol{% s}^{B}_{k,n}(w)\boldsymbol{s}^{B}_{k,n}(w)^{\sf T}\right\}italic_R start_POSTSUBSCRIPT italic_s , italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_w ) overroman_Δ start_ARG = end_ARG blackboard_E { bold_italic_s start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( italic_w ) bold_italic_s start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( italic_w ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT } (24)
Lemma III.4.

(Gradient noise terms). Let n1subscript𝑛1\mathcal{F}_{n-1}caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT denote the filtration generated by the past history of the random process 𝐰k,jsubscript𝐰𝑘𝑗\boldsymbol{w}_{k,j}bold_italic_w start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT for all jn1𝑗𝑛1j\leq n-1italic_j ≤ italic_n - 1 and k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K. For any 𝐰n1𝐰subscript𝑛1\boldsymbol{w}\in\mathcal{F}_{n-1}bold_italic_w ∈ caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT, we define the error vector

𝒘~=Δw𝒘~𝒘Δsuperscript𝑤𝒘\displaystyle\tilde{\boldsymbol{w}}\overset{\Delta}{=}w^{\star}-\boldsymbol{w}over~ start_ARG bold_italic_w end_ARG overroman_Δ start_ARG = end_ARG italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - bold_italic_w (25)

Then, under assumption III.2, it holds that the gradient noise defined in (22)22(\ref{sk1})( ) has zero mean:

𝔼[𝒔k,nB|n1]=0𝔼delimited-[]conditionalsuperscriptsubscript𝒔𝑘𝑛𝐵subscript𝑛10\displaystyle\mathds{E}\left[\boldsymbol{s}_{k,n}^{B}|{\mathcal{F}}_{n-1}% \right]=0blackboard_E [ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] = 0 (26)

while its second and fourth-order moments are upper bounded by terms related to the batch size as follows:

𝔼[𝒔k,nB(𝒘)2|n1]𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵𝒘2subscript𝑛1\displaystyle\mathds{E}[\|\boldsymbol{s}_{k,n}^{B}(\boldsymbol{w})\|^{2}|% \mathcal{F}_{n-1}]blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] α2𝒘~2+β22absentsubscript𝛼2superscriptnorm~𝒘2superscriptsubscript𝛽22\displaystyle\leq\alpha_{2}\|\tilde{\boldsymbol{w}}\|^{2}+\beta_{2}^{2}≤ italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ over~ start_ARG bold_italic_w end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (27)
𝔼[𝒔k,nB(𝒘)4|n1]𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵𝒘4subscript𝑛1\displaystyle\mathds{E}[\|\boldsymbol{s}_{k,n}^{B}(\boldsymbol{w})\|^{4}|% \mathcal{F}_{n-1}]blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] α4𝒘~4+β42absentsubscript𝛼4superscriptnorm~𝒘4superscriptsubscript𝛽42\displaystyle\leq\alpha_{4}\|\tilde{\boldsymbol{w}}\|^{4}+\beta_{4}^{2}≤ italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∥ over~ start_ARG bold_italic_w end_ARG ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (28)

where the nonnegative scalars {α2,β22,α4,β42}subscript𝛼2superscriptsubscript𝛽22subscript𝛼4superscriptsubscript𝛽42\{\alpha_{2},\beta_{2}^{2},\alpha_{4},\beta_{4}^{2}\}{ italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } are on the order of

α2=O(1B),β22=O(1B)formulae-sequencesubscript𝛼2𝑂1𝐵superscriptsubscript𝛽22𝑂1𝐵\displaystyle\alpha_{2}=O\left(\frac{1}{B}\right),\;\;\;\beta_{2}^{2}=O\left(% \frac{1}{B}\right)italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ) , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ) (29)
α4=O(1B2),β42=O(1B2)formulae-sequencesubscript𝛼4𝑂1superscript𝐵2superscriptsubscript𝛽42𝑂1superscript𝐵2\displaystyle\alpha_{4}=O\left(\frac{1}{B^{2}}\right),\;\;\beta_{4}^{2}=O\left% (\frac{1}{B^{2}}\right)italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (30)

Moreover, the covariance matrices (23) and (24) are scaled versions of each other:

Rs,k,nB(w)=1BRs,k,n(w)superscriptsubscript𝑅𝑠𝑘𝑛𝐵𝑤1𝐵subscript𝑅𝑠𝑘𝑛𝑤\displaystyle R_{s,k,n}^{B}(w)=\frac{1}{B}R_{s,k,n}(w)italic_R start_POSTSUBSCRIPT italic_s , italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_w ) = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG italic_R start_POSTSUBSCRIPT italic_s , italic_k , italic_n end_POSTSUBSCRIPT ( italic_w ) (31)
Proof.

See Appendix A. ∎

We can further relate the noise covariance matrix at wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and the Hessian matrix H¯¯𝐻\bar{H}over¯ start_ARG italic_H end_ARG. In the single-agent case, when B=1𝐵1B=1italic_B = 1 and negative log-likelihood losses are used, it is well-known that there is an exact equivalence between the Hessian and the gradient covariance matrices at local minima in stochastic risk minimization [34], and this relationship can also be approximately used in the context of empirical risk minimization [34, 53, 39]. By log-likelihood losses we mean choosing Qk(w;𝒙k)subscript𝑄𝑘𝑤subscript𝒙𝑘Q_{k}(w;\boldsymbol{x}_{k})italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) to be of the form

Qk(w;𝒙k)=logpk(𝜸k|𝒉k;w)subscript𝑄𝑘𝑤subscript𝒙𝑘subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘𝑤Q_{k}(w;\boldsymbol{x}_{k})=-\log p_{k}(\boldsymbol{\gamma}_{k}|\boldsymbol{h}% _{k};w)italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = - roman_log italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_w ) (32)

where 𝒉ksubscript𝒉𝑘\boldsymbol{h}_{k}bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝜸ksubscript𝜸𝑘\boldsymbol{\gamma}_{k}bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represent the feature vector and label associated with 𝒙ksubscript𝒙𝑘\boldsymbol{x}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. In the multi-agent case, we define the gradient covariance matrix of the global risk function at wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT for the case B=1𝐵1B=1italic_B = 1 as:

R¯ssubscript¯𝑅𝑠\displaystyle\bar{R}_{s}over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =Δ𝔼{(1Kk=1K𝒔k,n(w))(1K=1K𝒔,n(w))𝖳}Δ𝔼1𝐾superscriptsubscript𝑘1𝐾subscript𝒔𝑘𝑛superscript𝑤superscript1𝐾superscriptsubscript1𝐾subscript𝒔𝑛superscript𝑤𝖳\displaystyle\overset{\Delta}{=}\mathds{E}\left\{\left(\frac{1}{K}\sum\limits_% {k=1}^{K}\boldsymbol{s}_{k,n}(w^{\star})\right)\left(\frac{1}{K}\sum\limits_{% \ell=1}^{K}\boldsymbol{s}_{\ell,n}(w^{\star})\right)^{\sf T}\right\}overroman_Δ start_ARG = end_ARG blackboard_E { ( divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ) ( divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT roman_ℓ , italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT } (33)
Lemma III.5.

(Relation to Hessian matrix). Under Assumption II.1, if the negative log likelihood losses are used by all agents, it holds that

R¯s1KH¯subscript¯𝑅𝑠1𝐾¯𝐻\displaystyle\bar{R}_{s}\approx\frac{1}{K}\bar{H}over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≈ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG over¯ start_ARG italic_H end_ARG (34)
Proof.

See Appendix B. ∎

Lemma III.5 establishes the approximate equivalence between R¯ssubscript¯𝑅𝑠\bar{R}_{s}over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 1KH¯1𝐾¯𝐻\frac{1}{K}\bar{H}divide start_ARG 1 end_ARG start_ARG italic_K end_ARG over¯ start_ARG italic_H end_ARG in the context of empirical risk minimization. Since negative log-likelihood losses, e.g., mean-square loss and cross-entropy loss, are popular in the machine learning community, we mainly focus on this category in this paper, but our framework can also be applied to other losses.

III-C Network performance analysis

We now proceed with the network analysis, which will enable us to derive expressions for the value of ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We will then use these expressions to deduce properties about the behavior of the centralized and decentralized algorithms in relation to flat minima and esca** efficiency. To begin with, we follow the decomposition from [6] and note that the combination matrix A𝐴Aitalic_A admits an eigen-decomposition of the form:

A=VPV𝖳𝐴𝑉𝑃superscript𝑉𝖳A=VPV^{\sf T}italic_A = italic_V italic_P italic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT (35)

where the matrices V𝑉Vitalic_V and P𝑃Pitalic_P are:

V=[1K𝟙Vα],P=[100Pα]formulae-sequence𝑉1𝐾1subscript𝑉𝛼𝑃delimited-[]100subscript𝑃𝛼V=\left[\frac{1}{\sqrt{K}}\mathbbm{1}\quad V_{\alpha}\right],\quad P=\left[% \begin{array}[]{cc}1&0\\ 0&P_{\alpha}\end{array}\right]italic_V = [ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ] , italic_P = [ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] (36)

Here, Pα(K1)×(K1)subscript𝑃𝛼superscript𝐾1𝐾1P_{\alpha}\in\mathbbm{R}^{(K-1)\times(K-1)}italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_K - 1 ) × ( italic_K - 1 ) end_POSTSUPERSCRIPT is a diagonal matrix with elements from the second largest-magnitude eigenvalue λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to the smallest-magnitude eigenvalue λKsubscript𝜆𝐾\lambda_{K}italic_λ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT of A𝐴Aitalic_A appearing on the diagonal, and VαK×(K1)subscript𝑉𝛼superscript𝐾𝐾1V_{\alpha}\in\mathbbm{R}^{K\times(K-1)}italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × ( italic_K - 1 ) end_POSTSUPERSCRIPT. Consider the extended network version policy:

𝒜=ΔAIMtensor-product𝒜Δ𝐴subscript𝐼𝑀\displaystyle\mathcal{A}\overset{\Delta}{=}A\otimes I_{M}caligraphic_A overroman_Δ start_ARG = end_ARG italic_A ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT (37)

where tensor-product\otimes denotes the Kronecker product, and IMsubscript𝐼𝑀I_{M}italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT is the identity matrix of size M𝑀Mitalic_M. Then, 𝒜𝒜\mathcal{A}caligraphic_A satisfies

𝒜=𝒱𝒫𝒱𝖳𝒜𝒱𝒫superscript𝒱𝖳\displaystyle\mathcal{A}=\mathcal{V}\mathcal{P}\mathcal{V}^{\sf T}caligraphic_A = caligraphic_V caligraphic_P caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT (38)

where

𝒱=VIM,𝒫=PIM,𝒱𝖳=V𝖳IMformulae-sequence𝒱tensor-product𝑉subscript𝐼𝑀formulae-sequence𝒫tensor-product𝑃subscript𝐼𝑀superscript𝒱𝖳tensor-productsuperscript𝑉𝖳subscript𝐼𝑀\mathcal{V}=V\otimes I_{M},\quad\mathcal{P}=P\otimes I_{M},\quad\mathcal{V}^{% \sf T}=V^{\sf T}\otimes I_{M}caligraphic_V = italic_V ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , caligraphic_P = italic_P ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT = italic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT (39)

We further collect quantities from across the network into the block variables:

n=Δdiag{H1,n(𝒘1,n),H2,n(𝒘2,n),HK,n(𝒘K,n)}subscript𝑛Δdiagsubscript𝐻1𝑛subscript𝒘1𝑛subscript𝐻2𝑛subscript𝒘2𝑛subscript𝐻𝐾𝑛subscript𝒘𝐾𝑛\displaystyle\mathcal{H}_{n}\overset{\Delta}{=}\mathrm{diag}\left\{H_{1,n}(% \boldsymbol{w}_{1,n}),H_{2,n}(\boldsymbol{w}_{2,n}),\ldots H_{K,n}(\boldsymbol% {w}_{K,n})\right\}caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT overroman_Δ start_ARG = end_ARG roman_diag { italic_H start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT ) , italic_H start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ) , … italic_H start_POSTSUBSCRIPT italic_K , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_K , italic_n end_POSTSUBSCRIPT ) }
𝒔nB=Δcol{𝒔k,nB(𝒘k,n1)}superscriptsubscript𝒔𝑛𝐵Δcolsuperscriptsubscript𝒔𝑘𝑛𝐵subscript𝒘𝑘𝑛1\displaystyle\boldsymbol{s}_{n}^{B}\overset{\Delta}{=}\mathrm{col}\left\{% \boldsymbol{s}_{k,n}^{B}(\boldsymbol{w}_{k,n-1})\right\}bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT overroman_Δ start_ARG = end_ARG roman_col { bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n - 1 end_POSTSUBSCRIPT ) }
d=Δcol{Jk(w)}𝑑Δcolsubscript𝐽𝑘superscript𝑤\displaystyle d\overset{\Delta}{=}\mathrm{col}\left\{\nabla J_{k}(w^{\star})\right\}italic_d overroman_Δ start_ARG = end_ARG roman_col { ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) }
𝓦~n=col{𝒘~k,n}=Δcol{w𝒘k,n}subscript~𝓦𝑛colsubscript~𝒘𝑘𝑛Δcolsuperscript𝑤subscript𝒘𝑘𝑛\displaystyle{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}=\mathrm{% col}\left\{\tilde{\boldsymbol{w}}_{k,n}\right\}\overset{\Delta}{=}\mathrm{col}% \left\{w^{\star}-\boldsymbol{w}_{k,n}\right\}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_col { over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT } overroman_Δ start_ARG = end_ARG roman_col { italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT } (40)

where colcol\mathrm{col}roman_col denotes a block column vector, and each Hessian matrix Hk,n(𝒘k,n)subscript𝐻𝑘𝑛subscript𝒘𝑘𝑛H_{k,n}(\boldsymbol{w}_{k,n})italic_H start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) is defined by

Hk,n(𝒘k,n)=Δ[012Jk(wt𝒘~k,n)𝑑t]subscript𝐻𝑘𝑛subscript𝒘𝑘𝑛Δdelimited-[]superscriptsubscript01superscript2subscript𝐽𝑘superscript𝑤𝑡subscript~𝒘𝑘𝑛differential-d𝑡\displaystyle H_{k,n}(\boldsymbol{w}_{k,n})\overset{\Delta}{=}\left[\int_{0}^{% 1}\nabla^{2}J_{k}(w^{\star}-t\tilde{\boldsymbol{w}}_{k,n})dt\right]italic_H start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) overroman_Δ start_ARG = end_ARG [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_t over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) italic_d italic_t ] (41)

Using (37)–(III-C), we can rewrite algorithms (3), (5a)–(5b) and (6a)–(6b) using a unified description as follows:

𝓦~n=𝒜2(𝒜1μn1)𝓦~n1+μ𝒜2d+μ𝒜2𝒔nBsubscript~𝓦𝑛subscript𝒜2subscript𝒜1𝜇subscript𝑛1subscript~𝓦𝑛1𝜇subscript𝒜2𝑑𝜇subscript𝒜2superscriptsubscript𝒔𝑛𝐵\displaystyle{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}=\mathcal{A% }_{2}(\mathcal{A}_{1}-\mu\mathcal{H}_{n-1}){\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}}_{n-1}+\mu\mathcal{A}_{2}d+\mu\mathcal{A}_{2}% \boldsymbol{s}_{n}^{B}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (42)

where

𝒜1=A1IM,𝒜2=A2IMformulae-sequencesubscript𝒜1tensor-productsubscript𝐴1subscript𝐼𝑀subscript𝒜2tensor-productsubscript𝐴2subscript𝐼𝑀\displaystyle\mathcal{A}_{1}=A_{1}\otimes I_{M},\;\mathcal{A}_{2}=A_{2}\otimes I% _{M}caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT (43)

and the choices for the matrices {A1,A2}subscript𝐴1subscript𝐴2\{A_{1},A_{2}\}{ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } depend on the nature of the algorithm. For instance, for the consensus algorithm we set

A1=A,A2=IKformulae-sequencesubscript𝐴1𝐴subscript𝐴2subscript𝐼𝐾\displaystyle A_{1}=A,\;A_{2}=I_{K}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_A , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT (44)

while for diffusion we set

A1=IK,A2=Aformulae-sequencesubscript𝐴1subscript𝐼𝐾subscript𝐴2𝐴\displaystyle A_{1}=I_{K},\;A_{2}=Aitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_A (45)

Moreover, the centralized method can be viewed as a special case of diffusion for which

A1=IK,A2=A=π𝟙𝖳=1K𝟙𝟙𝖳formulae-sequencesubscript𝐴1subscript𝐼𝐾subscript𝐴2𝐴𝜋superscript1𝖳1𝐾superscript11𝖳\displaystyle A_{1}=I_{K},\;A_{2}=A=\pi\mathbbm{1}^{\sf T}=\frac{1}{K}\mathbbm% {1}\mathbbm{1}^{\sf T}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_A = italic_π blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG blackboard_11 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT (46)

Unfortunately, the dependence of n1subscript𝑛1\mathcal{H}_{n-1}caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT on 𝓦~n1subscript~𝓦𝑛1{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n-1}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT makes the analysis with (42) intractable. To overcome this challenge, and inspired by [11, 21, 6], we introduce the alternative block diagonal matrix

=Δdiag{H1,H2,HK}Δdiagsuperscriptsubscript𝐻1superscriptsubscript𝐻2superscriptsubscript𝐻𝐾\displaystyle\mathcal{H}\overset{\Delta}{=}\mathrm{diag}\left\{H_{1}^{\star},H% _{2}^{\star},\ldots H_{K}^{\star}\right\}caligraphic_H overroman_Δ start_ARG = end_ARG roman_diag { italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , … italic_H start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT } (47)

where the Hksuperscriptsubscript𝐻𝑘H_{k}^{\star}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT are evaluated at wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. Using \mathcal{H}caligraphic_H in place of nsubscript𝑛\mathcal{H}_{n}caligraphic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we replace recursion (42) by the following so-called short-term model:

𝓦~n=𝒜2(𝒜1μ)𝓦~n1+μ𝒜2d+μ𝒜2𝒔nBsuperscriptsubscript~𝓦𝑛subscript𝒜2subscript𝒜1𝜇superscriptsubscript~𝓦𝑛1𝜇subscript𝒜2𝑑𝜇subscript𝒜2superscriptsubscript𝒔𝑛𝐵\displaystyle{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}^{\prime}=% \mathcal{A}_{2}(\mathcal{A}_{1}-\mu\mathcal{H}){\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}}_{n-1}^{\prime}+\mu\mathcal{A}_{2}d+\mu\mathcal{A}_{% 2}\boldsymbol{s}_{n}^{B}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ caligraphic_H ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (48)

This approximation naturally raises the following question: How accurate can the short-term model in (48) approximate the true recursion in (42)? This question can be answered by the following two lemmas, where we separately show the results for the decentralized and centralized methods over a finite time horizon.

Lemma III.6.

(Deviation bounds of decentralized methods). For a fixed small step size μ𝜇\muitalic_μ and local batch size B𝐵Bitalic_B such that

1Bcμη1𝐵𝑐superscript𝜇𝜂\displaystyle\frac{1}{B}\leq c\mu^{\eta}divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ≤ italic_c italic_μ start_POSTSUPERSCRIPT italic_η end_POSTSUPERSCRIPT (49)

where cmuch-less-than𝑐c\ll\inftyitalic_c ≪ ∞ and η0𝜂0\eta\geq 0italic_η ≥ 0, and under assumptions II.1, III.1, and III.2, it can be verified for consensus and diffusion that the second and fourth-order moments of 𝓦~nsubscript~𝓦𝑛{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are upper bounded in a finite number of iterations such that nO(1μ)𝑛𝑂1𝜇n\leq O(\frac{1}{\mu})italic_n ≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ), namely,

𝔼𝓦~n2𝔼superscriptnormsubscript~𝓦𝑛2\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT O(μB)+O(μ2)=O(μγ)absent𝑂𝜇𝐵𝑂superscript𝜇2𝑂superscript𝜇𝛾\displaystyle\leq O\left(\frac{\mu}{B}\right)+O(\mu^{2})=O(\mu^{\gamma})≤ italic_O ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = italic_O ( italic_μ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ) (50)
𝔼𝓦~n4𝔼superscriptnormsubscript~𝓦𝑛4\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{4}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT O(μ2γ)absent𝑂superscript𝜇2𝛾\displaystyle\leq O(\mu^{2\gamma})≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 2 italic_γ end_POSTSUPERSCRIPT ) (51)

where γ=min{1+η,2}𝛾1𝜂2\gamma=\min\{1+\eta,2\}italic_γ = roman_min { 1 + italic_η , 2 }. Also, the second-order moment of 𝓦~nsuperscriptsubscript~𝓦𝑛{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}^{\prime}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is upper bounded by

𝔼𝓦~n2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }^{\prime}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT O(μγ)absent𝑂superscript𝜇𝛾\displaystyle\leq O(\mu^{\gamma})≤ italic_O ( italic_μ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ) (52)

Moreover, the approximation error caused by the short-term model in (48) is upper bounded by

|𝔼𝓦~n2𝔼𝓦~n2|O(μ1.5γ)𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2𝑂superscript𝜇1.5𝛾\displaystyle\Big{|}\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}% }}_{n}^{\prime}\|^{2}-\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{% W}}}_{n}\|^{2}\Big{|}\leq O(\mu^{1.5\gamma})| blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ end_POSTSUPERSCRIPT ) (53)
Proof.

See Appendices C, D and E. ∎

Lemma III.7.

(Deviation bounds of the centralized method). Under the same conditions of Lemma III.6, the second-order and fourth-order moments of 𝓦~nsubscript~𝓦𝑛{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the second-order moment of 𝓦~nsuperscriptsubscript~𝓦𝑛{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}^{\prime}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and the approximation error of the short term model, related to the centralized method are guaranteed to be upper bounded in a finite number of iterations nO(1μ)𝑛𝑂1𝜇n\leq O(\frac{1}{\mu})italic_n ≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ). Basically, it holds that

𝔼𝓦~n2𝔼superscriptnormsubscript~𝓦𝑛2\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT O(μ1+η)absent𝑂superscript𝜇1𝜂\displaystyle\leq O(\mu^{1+\eta})≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1 + italic_η end_POSTSUPERSCRIPT ) (54)
𝔼𝓦~n2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }^{\prime}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT O(μ1+η)absent𝑂superscript𝜇1𝜂\displaystyle\leq O(\mu^{1+\eta})≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1 + italic_η end_POSTSUPERSCRIPT ) (55)
𝔼𝓦~n4𝔼superscriptnormsubscript~𝓦𝑛4\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{4}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT O(μ2(1+η))absent𝑂superscript𝜇21𝜂\displaystyle\leq O(\mu^{2(1+\eta)})≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 2 ( 1 + italic_η ) end_POSTSUPERSCRIPT ) (56)

and

|𝔼𝓦~n2𝔼𝓦~n2|O(μ1.5(1+η))𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2𝑂superscript𝜇1.51𝜂\displaystyle\Big{|}\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}% }}_{n}^{\prime}\|^{2}-\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{% W}}}_{n}\|^{2}\Big{|}\leq O(\mu^{1.5(1+\eta)})| blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 ( 1 + italic_η ) end_POSTSUPERSCRIPT ) (57)
Proof.

See Appendices C, D and E. ∎

Comparing the bounds in Lemmas III.6 and III.7 , we find that the upper bounds on the second-order moments associated with the decentralized methods incorporate extra O(μ2)𝑂superscript𝜇2O(\mu^{2})italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) terms, which are generated by the network heterogeneity and graph structure. We will discuss the effects of the O(μ2)𝑂superscript𝜇2O(\mu^{2})italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) terms on the esca** efficiency later. Moreover, Lemmas III.6 and III.7 demonstrate that 𝔼𝓦~n2𝔼superscriptnormsubscript~𝓦𝑛2\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝔼𝓦~n2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}^{\prime}\|^% {2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT dominate |𝔼𝓦~n2𝔼𝓦~n2|𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2|\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}\|^{% 2}-\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}|| blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | for all methods when μ𝜇\muitalic_μ is sufficiently small. In other words, the approximation error between the mean square deviation of the short-term and true models can be omitted compared with the size of the true models. Furthermore, we can manipulate the bounds in Lemmas III.6 and III.7 to obtain expression for ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT performance. Using (7) and (10), we can approximate the ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as follows:

ERn=12K𝔼𝓦~nIH¯2+o(𝔼𝓦~n2)subscriptER𝑛12𝐾𝔼subscriptsuperscriptnormsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝑜𝔼superscriptnormsubscript~𝓦𝑛2\displaystyle\mathrm{ER}_{n}=\frac{1}{2K}\mathds{E}\|{\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}}_{n}\|^{2}_{I\otimes\bar{H}}+o(\mathds{E}\|{% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^{2})roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT + italic_o ( blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=12K𝔼𝓦~nIH¯2+o(𝔼𝓦~n2)absent12𝐾𝔼subscriptsuperscriptnormsuperscriptsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝑜𝔼superscriptnormsubscript~𝓦𝑛2\displaystyle=\frac{1}{2K}\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}}_{n}^{\prime}\|^{2}_{I\otimes\bar{H}}+o(\mathds{E}\|{\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^{2})= divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT + italic_o ( blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (58)

which means that ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be evaluated by means of the short-term recursion. Thus, the original recursion in (42) can be replaced by (48) in the small step-size regime. More details about the equality (III-C) can be found in Appendix F.

Building upon the results of Lemmas III.6 and III.7, we next analyze the esca** efficiency of the three algorithms by using (48), and establish the following theorem.

Theorem III.8.

(Esca** efficiency of distributed algorithms). Consider a network of agents running distributed algorithms covered by the short-term model (48). Under assumptions II.1, III.1, III.2, and III.3, and after n𝑛nitalic_n iterations with

nO(1μ)𝑛𝑂1𝜇\displaystyle n\leq O(\frac{1}{\mu})italic_n ≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ) (59)

it holds that

ERn,cen=subscriptER𝑛𝑐𝑒𝑛absent\displaystyle\mathrm{ER}_{n,cen}=roman_ER start_POSTSUBSCRIPT italic_n , italic_c italic_e italic_n end_POSTSUBSCRIPT = μBe(n)+o(μ1+η)𝜇𝐵𝑒𝑛𝑜superscript𝜇1𝜂\displaystyle\frac{\mu}{B}e(n)+o(\mu^{1+\eta})divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG italic_e ( italic_n ) + italic_o ( italic_μ start_POSTSUPERSCRIPT 1 + italic_η end_POSTSUPERSCRIPT ) (60)
ERn,con=subscriptER𝑛𝑐𝑜𝑛absent\displaystyle\mathrm{ER}_{n,con}=roman_ER start_POSTSUBSCRIPT italic_n , italic_c italic_o italic_n end_POSTSUBSCRIPT = μBe(n)+μ2fcon(n)±o(μ2)±o(μ1+η)plus-or-minus𝜇𝐵𝑒𝑛superscript𝜇2subscript𝑓𝑐𝑜𝑛𝑛𝑜superscript𝜇2𝑜superscript𝜇1𝜂\displaystyle\frac{\mu}{B}e(n)+\mu^{2}f_{con}(n)\pm o(\mu^{2})\pm o(\mu^{1+% \eta})divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG italic_e ( italic_n ) + italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ( italic_n ) ± italic_o ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ± italic_o ( italic_μ start_POSTSUPERSCRIPT 1 + italic_η end_POSTSUPERSCRIPT ) (61)
ERn,dif=subscriptER𝑛𝑑𝑖𝑓absent\displaystyle\mathrm{ER}_{n,dif}=roman_ER start_POSTSUBSCRIPT italic_n , italic_d italic_i italic_f end_POSTSUBSCRIPT = μBe(n)+μ2fdif(n)±o(μ2)±o(μ1+η)plus-or-minus𝜇𝐵𝑒𝑛superscript𝜇2subscript𝑓𝑑𝑖𝑓𝑛𝑜superscript𝜇2𝑜superscript𝜇1𝜂\displaystyle\frac{\mu}{B}e(n)+\mu^{2}f_{dif}(n)\pm o(\mu^{2})\pm o(\mu^{1+% \eta})divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG italic_e ( italic_n ) + italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f end_POSTSUBSCRIPT ( italic_n ) ± italic_o ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ± italic_o ( italic_μ start_POSTSUPERSCRIPT 1 + italic_η end_POSTSUPERSCRIPT ) (62)

where

e(n)=14KTr((I(IμH¯)2(n+1))H¯)𝑒𝑛14𝐾Tr𝐼superscript𝐼𝜇¯𝐻2𝑛1¯𝐻\displaystyle e(n)=\frac{1}{4K}\mathrm{Tr}\left(\left(I-(I-\mu\bar{H})^{2(n+1)% }\right)\bar{H}\right)italic_e ( italic_n ) = divide start_ARG 1 end_ARG start_ARG 4 italic_K end_ARG roman_Tr ( ( italic_I - ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT 2 ( italic_n + 1 ) end_POSTSUPERSCRIPT ) over¯ start_ARG italic_H end_ARG ) (63)
fcon(n)=12Kd𝖳𝒱α(I𝒫α)1(I𝒫αn+1)IH¯2subscript𝑓𝑐𝑜𝑛𝑛12𝐾subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒱𝛼superscript𝐼subscript𝒫𝛼1𝐼superscriptsubscript𝒫𝛼𝑛12tensor-product𝐼¯𝐻\displaystyle f_{con}(n)=\frac{1}{2K}\|d^{\sf T}\mathcal{V}_{\alpha}(I-% \mathcal{P}_{\alpha})^{-1}(I-\mathcal{P}_{\alpha}^{n+1})\|^{2}_{I\otimes\bar{H}}italic_f start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ( italic_n ) = divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT (64)
fdif(n)=12Kd𝖳𝒱α𝒫α(I𝒫α)1(I𝒫αn+1)IH¯2subscript𝑓𝑑𝑖𝑓𝑛12𝐾subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒱𝛼subscript𝒫𝛼superscript𝐼subscript𝒫𝛼1𝐼superscriptsubscript𝒫𝛼𝑛12tensor-product𝐼¯𝐻\displaystyle f_{dif}(n)=\frac{1}{2K}\|d^{\sf T}\mathcal{V}_{\alpha}\mathcal{P% }_{\alpha}(I-\mathcal{P}_{\alpha})^{-1}(I-\mathcal{P}_{\alpha}^{n+1})\|^{2}_{I% \otimes\bar{H}}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f end_POSTSUBSCRIPT ( italic_n ) = divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT (65)

and ERn,censubscriptER𝑛𝑐𝑒𝑛\mathrm{{ER}}_{n,cen}roman_ER start_POSTSUBSCRIPT italic_n , italic_c italic_e italic_n end_POSTSUBSCRIPT, ERn,consubscriptER𝑛𝑐𝑜𝑛\mathrm{{ER}}_{n,con}roman_ER start_POSTSUBSCRIPT italic_n , italic_c italic_o italic_n end_POSTSUBSCRIPT, and ERn,difsubscriptER𝑛𝑑𝑖𝑓\mathrm{{ER}}_{n,dif}roman_ER start_POSTSUBSCRIPT italic_n , italic_d italic_i italic_f end_POSTSUBSCRIPT represent the excess risk of the centralized, consensus and diffusion methods at iteration n𝑛nitalic_n, respectively.

Proof.

See Appendix F. ∎

Theorem III.8 shows the esca** efficiency of the three algorithms around a local minimum over a finite time horizon. Basically, for all three methods, larger μ𝜇\muitalic_μ or smaller batch size B𝐵Bitalic_B (i.e., larger gradient noise) enable higher esca** efficiency. Also, as mentioned in Appendix F, for the algorithms to successfully exit the local basin they would require O(1μ)𝑂1𝜇O(\frac{1}{\mu})italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ) iterations. Thus, if the algorithms cannot leave the local minimum in O(1μ)𝑂1𝜇O(\frac{1}{\mu})italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ) iterations, then we say they cannot effectively escape the local basin and are therefore trapped in it. Furthermore, we observe that the network heterogeneity and graph structure implicitly introduce additional O(μ2)𝑂superscript𝜇2O(\mu^{2})italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) terms into decentralized methods. That is, for homogeneous networks where all agents share the same local minimum wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, we have

Jk(w)=J(w)=0subscript𝐽𝑘superscript𝑤𝐽superscript𝑤0\displaystyle\nabla J_{k}(w^{\star})=\nabla J(w^{\star})=0∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = ∇ italic_J ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = 0 (66)

which makes d=0𝑑0d=0italic_d = 0. Meanwhile, in the centralized setting, we have Vα=0subscript𝑉𝛼0V_{\alpha}=0italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0 and Pα=0subscript𝑃𝛼0P_{\alpha}=0italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0. In these cases, the extra O(μ2)𝑂superscript𝜇2O(\mu^{2})italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) terms are 0. On the other hand, in heterogeneous networks, the nonzero O(μ2)𝑂superscript𝜇2O(\mu^{2})italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) terms lead to larger ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT values for decentralized methods compared to the centralized strategy.

However, the difference between the centralized and decentralized methods is only significant in the large-batch regime. Specifically, if B𝐵Bitalic_B is not sufficiently large such that η<1𝜂1\eta<1italic_η < 1, then we have 1+η<21𝜂21+\eta<21 + italic_η < 2 under which the additional O(μ2)𝑂superscript𝜇2O(\mu^{2})italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) terms in (61) and (62) will be dominated by the o(μ1+η)𝑜superscript𝜇1𝜂o(\mu^{1+\eta})italic_o ( italic_μ start_POSTSUPERSCRIPT 1 + italic_η end_POSTSUPERSCRIPT ) terms associated with the approximated error due to (48) and the initialization condition in Assumption III.1. However, as the batch size B𝐵Bitalic_B increases, the effect of the gradient noise will progressively decrease and the influence of the O(μ2)𝑂superscript𝜇2O(\mu^{2})italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) terms will become more prominent. In the extreme scenario when the full-batch gradient descent is applied, the centralized method becomes completely noise-free, which means that there is no noise to help the algorithm escape from local minima. Nevertheless, the noisy terms related to network heterogeneity and graph structure continue to exist in decentralized approaches.

When comparing the esca** efficiency of two methods from a local minimum, such as the centralized and diffusion methods, there are three possible cases on average: (1) Both ERn,censubscriptER𝑛𝑐𝑒𝑛\mathrm{ER}_{n,cen}roman_ER start_POSTSUBSCRIPT italic_n , italic_c italic_e italic_n end_POSTSUBSCRIPT and ERn,difsubscriptER𝑛𝑑𝑖𝑓\mathrm{ER}_{n,dif}roman_ER start_POSTSUBSCRIPT italic_n , italic_d italic_i italic_f end_POSTSUBSCRIPT exceed the risk barrier hhitalic_h, then the algorithms will leave the basin around the local minimum; (2) ERn,difsubscriptER𝑛𝑑𝑖𝑓\mathrm{ER}_{n,dif}roman_ER start_POSTSUBSCRIPT italic_n , italic_d italic_i italic_f end_POSTSUBSCRIPT is lager than hhitalic_h, while ERn,censubscriptER𝑛𝑐𝑒𝑛\mathrm{ER}_{n,cen}roman_ER start_POSTSUBSCRIPT italic_n , italic_c italic_e italic_n end_POSTSUBSCRIPT is smaller than hhitalic_h. In this case, the network driven by diffusion escapes from the local minimum, while the centralized algorithm remains stuck in the current basin; (3) Both ERn,censubscriptER𝑛𝑐𝑒𝑛\mathrm{ER}_{n,cen}roman_ER start_POSTSUBSCRIPT italic_n , italic_c italic_e italic_n end_POSTSUBSCRIPT and ERn,difsubscriptER𝑛𝑑𝑖𝑓\mathrm{ER}_{n,dif}roman_ER start_POSTSUBSCRIPT italic_n , italic_d italic_i italic_f end_POSTSUBSCRIPT are smaller than hhitalic_h, then the two methods remain trapped in the current basin. These 3 situations illustrate how higher esca** efficiency corresponds to a higher likelihood of escape from a local minimum. It is also clear that a decentralized method is more effective at esca** from a local minimum than the centralized method in the large-batch regime. However, larger values of ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, while good for esca** efficiency, they nevertheless worsen the optimization performance. We will discuss this point later in the steady state when n𝑛n\to\inftyitalic_n → ∞.

We can further compare the esca** efficiency of diffusion and consensus, and show that the extra 𝒫αsubscript𝒫𝛼\mathcal{P}_{\alpha}caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT appearing in ERn,difsubscriptER𝑛𝑑𝑖𝑓\mathrm{ER}_{n,dif}roman_ER start_POSTSUBSCRIPT italic_n , italic_d italic_i italic_f end_POSTSUBSCRIPT leads to smaller excess risk for diffusion than consensus.

Corollary III.9.

(Smaller ERnsubscriptERn\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for diffusion). Under the same conditions of Theorem III.8, it holds that

ERn,difERn,consubscriptER𝑛𝑑𝑖𝑓subscriptER𝑛𝑐𝑜𝑛\displaystyle\mathrm{ER}_{n,dif}\leq\mathrm{ER}_{n,con}roman_ER start_POSTSUBSCRIPT italic_n , italic_d italic_i italic_f end_POSTSUBSCRIPT ≤ roman_ER start_POSTSUBSCRIPT italic_n , italic_c italic_o italic_n end_POSTSUBSCRIPT (67)
Proof.

See Appendix G. ∎

Corollary III.9 implies that consensus runs farther away from the local minimum than the diffusion strategy for the same number of iterations due to its worse ER performance. This translates into faster escape for consensus compared to diffusion.

IV Trade-off between flatness and optimization

In this section, we elaborate on the important trade-off between flatness and optimization.

Suppose we run a decentralized or centralized algorithm to solve a possibly nonconvex optimization problem. One useful question is to investigate where the algorithm would prefer to go if it escapes from the current basin. In other words, assume the algorithm escapes from the basin around some local minimum wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and starts evolving until it settles down in the basin of another local minimum, we would like to examine what preferential properties does this second minimum have relative to the earlier one. To answer this question, we will relate the esca** efficiency of algorithms measured by ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the flatness of local minima measured by Tr(H¯)Tr¯𝐻\mathrm{Tr}(\bar{H})roman_Tr ( over¯ start_ARG italic_H end_ARG ).

From expressions (60)–(62), and in the one-dimensional case where Tr(H¯)Tr¯𝐻\mathrm{Tr}(\bar{H})roman_Tr ( over¯ start_ARG italic_H end_ARG ) is a scalar, it is obvious that larger values of Tr(H¯)Tr¯𝐻\mathrm{Tr}(\bar{H})roman_Tr ( over¯ start_ARG italic_H end_ARG ), i.e., sharper minima, enable higher esca** efficiency for all three methods. Thus, it is more likely for the three algorithms to escape from sharp minima than flat ones in the one-dimensional case. When it comes to the higher dimensions, it is generally intractable to clarify the relationship between ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and Tr(H¯)Tr¯𝐻\mathrm{Tr}(\bar{H})roman_Tr ( over¯ start_ARG italic_H end_ARG ) directly. Fortunately, motivated by the Markov inequality, we can appeal to an upper bound for ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denoted by 𝒰nsubscript𝒰𝑛\mathcal{U}_{n}caligraphic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Specifically, according to the Markov inequality [7], the probability of the excess risk value to remain below a basis threshold hhitalic_h satisfies:

(1Kk=1KJ(𝒘k,n)J(w)h)1ERnh1𝒰nh1𝐾superscriptsubscript𝑘1𝐾𝐽subscript𝒘𝑘𝑛𝐽superscript𝑤1subscriptER𝑛1subscript𝒰𝑛\displaystyle\mathds{P}\left(\frac{1}{K}\sum\limits_{k=1}^{K}J(\boldsymbol{w}_% {k,n})-J(w^{\star})\leq h\right)\geq 1-\frac{\mathrm{ER}_{n}}{h}\geq 1-\frac{% \mathcal{U}_{n}}{h}blackboard_P ( divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_J ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - italic_J ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ≤ italic_h ) ≥ 1 - divide start_ARG roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_h end_ARG ≥ 1 - divide start_ARG caligraphic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_h end_ARG (68)

where, from (60)–(65), the upper bound for the dominate terms of the three methods can be derived:

e(n)=𝑒𝑛absent\displaystyle e(n)=italic_e ( italic_n ) = 14KTr((I(IμH¯)2(n+1))H¯)14𝐾Tr𝐼superscript𝐼𝜇¯𝐻2𝑛1¯𝐻\displaystyle\frac{1}{4K}\mathrm{Tr}\left(\left(I-(I-\mu\bar{H})^{2(n+1)}% \right)\bar{H}\right)divide start_ARG 1 end_ARG start_ARG 4 italic_K end_ARG roman_Tr ( ( italic_I - ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT 2 ( italic_n + 1 ) end_POSTSUPERSCRIPT ) over¯ start_ARG italic_H end_ARG )
=\displaystyle== 14Ki=1Mλi(H¯)(1(1μλi(H¯))2(n+1))14𝐾superscriptsubscript𝑖1𝑀subscript𝜆𝑖¯𝐻1superscript1𝜇subscript𝜆𝑖¯𝐻2𝑛1\displaystyle\frac{1}{4K}\sum\limits_{i=1}^{M}\lambda_{i}(\bar{H})\left(1-(1-% \mu\lambda_{i}(\bar{H}))^{2(n+1)}\right)divide start_ARG 1 end_ARG start_ARG 4 italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_H end_ARG ) ( 1 - ( 1 - italic_μ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_H end_ARG ) ) start_POSTSUPERSCRIPT 2 ( italic_n + 1 ) end_POSTSUPERSCRIPT )
\displaystyle\leq 14Ki=1Mλi(H¯)=14KTr(H¯)=𝒰(e(n))14𝐾superscriptsubscript𝑖1𝑀subscript𝜆𝑖¯𝐻14𝐾Tr¯𝐻𝒰𝑒𝑛\displaystyle\frac{1}{4K}\sum\limits_{i=1}^{M}\lambda_{i}(\bar{H})=\frac{1}{4K% }\mathrm{Tr}(\bar{H})=\mathcal{U}\big{(}e(n)\big{)}divide start_ARG 1 end_ARG start_ARG 4 italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_H end_ARG ) = divide start_ARG 1 end_ARG start_ARG 4 italic_K end_ARG roman_Tr ( over¯ start_ARG italic_H end_ARG ) = caligraphic_U ( italic_e ( italic_n ) ) (69)

where λi(H¯)subscript𝜆𝑖¯𝐻\lambda_{i}(\bar{H})italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_H end_ARG ) represents the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT eigenvalue of H¯¯𝐻\bar{H}over¯ start_ARG italic_H end_ARG. Also,

fcon(n)subscript𝑓𝑐𝑜𝑛𝑛\displaystyle f_{con}(n)italic_f start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ( italic_n ) =12Kd𝖳𝒱α(I𝒫α)1(I𝒫αn+1)IH¯2absent12𝐾subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒱𝛼superscript𝐼subscript𝒫𝛼1𝐼superscriptsubscript𝒫𝛼𝑛12tensor-product𝐼¯𝐻\displaystyle=\frac{1}{2K}\|d^{\sf T}\mathcal{V}_{\alpha}(I-\mathcal{P}_{% \alpha})^{-1}(I-\mathcal{P}_{\alpha}^{n+1})\|^{2}_{I\otimes\bar{H}}= divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT
12λmax(H¯)d𝖳𝒱α(I𝒫α)1(I𝒫αn+1)2absent12subscript𝜆¯𝐻superscriptnormsuperscript𝑑𝖳subscript𝒱𝛼superscript𝐼subscript𝒫𝛼1𝐼superscriptsubscript𝒫𝛼𝑛12\displaystyle\leq\frac{1}{2}\lambda_{\max}(\bar{H})\|d^{\sf T}\mathcal{V}_{% \alpha}(I-\mathcal{P}_{\alpha})^{-1}(I-\mathcal{P}_{\alpha}^{n+1})\|^{2}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( over¯ start_ARG italic_H end_ARG ) ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
12Tr(H¯)d𝖳𝒱α(I𝒫α)1(I𝒫αn+1)2absent12Tr¯𝐻superscriptnormsuperscript𝑑𝖳subscript𝒱𝛼superscript𝐼subscript𝒫𝛼1𝐼superscriptsubscript𝒫𝛼𝑛12\displaystyle\leq\frac{1}{2}\mathrm{Tr}(\bar{H})\|d^{\sf T}\mathcal{V}_{\alpha% }(I-\mathcal{P}_{\alpha})^{-1}(I-\mathcal{P}_{\alpha}^{n+1})\|^{2}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_Tr ( over¯ start_ARG italic_H end_ARG ) ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝒰(fcon(n))absent𝒰subscript𝑓𝑐𝑜𝑛𝑛\displaystyle=\mathcal{U}\big{(}f_{con}(n)\big{)}= caligraphic_U ( italic_f start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ( italic_n ) ) (70)

and, similarly,

fdif(n)subscript𝑓𝑑𝑖𝑓𝑛\displaystyle f_{dif}(n)italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f end_POSTSUBSCRIPT ( italic_n ) =12Kd𝖳𝒱α𝒫α(I𝒫α)1(I𝒫αn+1)IH¯2absent12𝐾subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒱𝛼subscript𝒫𝛼superscript𝐼subscript𝒫𝛼1𝐼superscriptsubscript𝒫𝛼𝑛12tensor-product𝐼¯𝐻\displaystyle=\frac{1}{2K}\|d^{\sf T}\mathcal{V}_{\alpha}\mathcal{P}_{\alpha}(% I-\mathcal{P}_{\alpha})^{-1}(I-\mathcal{P}_{\alpha}^{n+1})\|^{2}_{I\otimes\bar% {H}}= divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT
12Tr(H¯)d𝖳𝒱α𝒫α(I𝒫α)1(I𝒫αn+1)2absent12Tr¯𝐻superscriptnormsuperscript𝑑𝖳subscript𝒱𝛼subscript𝒫𝛼superscript𝐼subscript𝒫𝛼1𝐼superscriptsubscript𝒫𝛼𝑛12\displaystyle\leq\frac{1}{2}\mathrm{Tr}(\bar{H})\|d^{\sf T}\mathcal{V}_{\alpha% }\mathcal{P}_{\alpha}(I-\mathcal{P}_{\alpha})^{-1}(I-\mathcal{P}_{\alpha}^{n+1% })\|^{2}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_Tr ( over¯ start_ARG italic_H end_ARG ) ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝒰(fdif(n))absent𝒰subscript𝑓𝑑𝑖𝑓𝑛\displaystyle=\mathcal{U}\big{(}f_{dif}(n)\big{)}= caligraphic_U ( italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f end_POSTSUBSCRIPT ( italic_n ) ) (71)

Then, the upper bound variables for the three methods are given by

𝒰n,cen=subscript𝒰𝑛𝑐𝑒𝑛absent\displaystyle\mathcal{U}_{n,cen}=caligraphic_U start_POSTSUBSCRIPT italic_n , italic_c italic_e italic_n end_POSTSUBSCRIPT = μB𝒰(e(n))𝜇𝐵𝒰𝑒𝑛\displaystyle\frac{\mu}{B}\mathcal{U}\big{(}e(n)\big{)}divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG caligraphic_U ( italic_e ( italic_n ) ) (72)
𝒰n,con=subscript𝒰𝑛𝑐𝑜𝑛absent\displaystyle\mathcal{U}_{n,con}=caligraphic_U start_POSTSUBSCRIPT italic_n , italic_c italic_o italic_n end_POSTSUBSCRIPT = μB𝒰(e(n))+μ2𝒰(fcon(n))𝜇𝐵𝒰𝑒𝑛superscript𝜇2𝒰subscript𝑓𝑐𝑜𝑛𝑛\displaystyle\frac{\mu}{B}\mathcal{U}\big{(}e(n)\big{)}+\mu^{2}\mathcal{U}\big% {(}f_{con}(n)\big{)}divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG caligraphic_U ( italic_e ( italic_n ) ) + italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_U ( italic_f start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ( italic_n ) ) (73)
𝒰n,dif=subscript𝒰𝑛𝑑𝑖𝑓absent\displaystyle\mathcal{U}_{n,dif}=caligraphic_U start_POSTSUBSCRIPT italic_n , italic_d italic_i italic_f end_POSTSUBSCRIPT = μB𝒰(e(n))+μ2𝒰(fdif(n))𝜇𝐵𝒰𝑒𝑛superscript𝜇2𝒰subscript𝑓𝑑𝑖𝑓𝑛\displaystyle\frac{\mu}{B}\mathcal{U}\big{(}e(n)\big{)}+\mu^{2}\mathcal{U}\big% {(}f_{dif}(n)\big{)}divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG caligraphic_U ( italic_e ( italic_n ) ) + italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_U ( italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f end_POSTSUBSCRIPT ( italic_n ) ) (74)

Observe that these upper bounds on the esca** efficiency are positively correlated with the sharpness of the local minimum through Tr(H¯)Tr¯𝐻\mathrm{Tr}(\bar{H})roman_Tr ( over¯ start_ARG italic_H end_ARG ). Substituting (72)–(74) into (68) separately, we find that flat minima where Tr(H¯)Tr¯𝐻\mathrm{Tr}(\bar{H})roman_Tr ( over¯ start_ARG italic_H end_ARG ) is small provide high probability guarantees that the algorithms stay around the corresponding local basin.

Combining the discussion of both one and higher-dimensional cases, we conclude that the three algorithms prefer to stay around flat minima, which means that they favor flatter basin if they successfully escape from a current minimum. Moreover, as discussed before, higher esca** efficiency makes decentralized algorithms more likely to escape from a local minimum than the centralized method. This fact further indicates that decentralized methods favor flatter minima than their centralized counterpart. However, as already noted before, higher esca** efficiency may deteriorate the optimization performance. This motivates us to analyze the excess-risk performance in the long run when n𝑛n\to\inftyitalic_n → ∞, which corresponds to the optimization performance in the steady state.

Strictly speaking, the short-term model defined in (48) may not be valid for nonconvex risk functions in the long run. Fortunately, it has been rigorously verified that (48) can approximate well the true model (42) when n𝑛n\to\inftyitalic_n → ∞ in the strongly convex case [6]. Since here we want to examine the convergence behavior of algorithms around local minima given that the algorithms are stuck in the current basin, we can resort to a strong convexity assumption around local minima, for which the following result can be guaranteed.

Corollary IV.1.

(Steady-state excess risks). Consider a network of agents running distributed algorithms covered by (48). After sufficient iterations such that n𝑛n\to\inftyitalic_n → ∞, under assumptions II.1, III.1, III.2 and III.3, and assuming the algorithms are already trapped in the basin of a local minimum wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and J(w)𝐽𝑤J(w)italic_J ( italic_w ) is locally strongly-convex around wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, it holds that

ER,cen=subscriptER𝑐𝑒𝑛absent\displaystyle\mathrm{ER}_{\infty,cen}=roman_ER start_POSTSUBSCRIPT ∞ , italic_c italic_e italic_n end_POSTSUBSCRIPT = μ4BKTr(H¯)±O(μ1.5(1+η))plus-or-minus𝜇4𝐵𝐾Tr¯𝐻𝑂superscript𝜇1.51𝜂\displaystyle\frac{\mu}{4BK}\mathrm{Tr}\left(\bar{H}\right)\pm O\left(\mu^{1.5% (1+\eta)}\right)divide start_ARG italic_μ end_ARG start_ARG 4 italic_B italic_K end_ARG roman_Tr ( over¯ start_ARG italic_H end_ARG ) ± italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 ( 1 + italic_η ) end_POSTSUPERSCRIPT ) (75)
ER,con=subscriptER𝑐𝑜𝑛absent\displaystyle\mathrm{ER}_{\infty,con}=roman_ER start_POSTSUBSCRIPT ∞ , italic_c italic_o italic_n end_POSTSUBSCRIPT = μ4BKTr(H¯)+μ22Kd𝖳𝒱α(I𝒫α)1IH¯2𝜇4𝐵𝐾Tr¯𝐻superscript𝜇22𝐾subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒱𝛼superscript𝐼subscript𝒫𝛼12tensor-product𝐼¯𝐻\displaystyle\frac{\mu}{4BK}\mathrm{Tr}\left(\bar{H}\right)+\frac{\mu^{2}}{2K}% \|d^{\sf T}\mathcal{V}_{\alpha}(I-\mathcal{P}_{\alpha})^{-1}\|^{2}_{I\otimes% \bar{H}}divide start_ARG italic_μ end_ARG start_ARG 4 italic_B italic_K end_ARG roman_Tr ( over¯ start_ARG italic_H end_ARG ) + divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_K end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT
±O(μ1.5γ)±o(μ2)plus-or-minusplus-or-minus𝑂superscript𝜇1.5𝛾𝑜superscript𝜇2\displaystyle\pm O(\mu^{1.5\gamma})\pm o(\mu^{2})± italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ end_POSTSUPERSCRIPT ) ± italic_o ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (76)
ER,dif=subscriptER𝑑𝑖𝑓absent\displaystyle\mathrm{ER}_{\infty,dif}=roman_ER start_POSTSUBSCRIPT ∞ , italic_d italic_i italic_f end_POSTSUBSCRIPT = μ4BKTr(H¯)+μ22Kd𝖳𝒱αPα(I𝒫α)1IH¯2𝜇4𝐵𝐾Tr¯𝐻superscript𝜇22𝐾subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒱𝛼subscript𝑃𝛼superscript𝐼subscript𝒫𝛼12tensor-product𝐼¯𝐻\displaystyle\frac{\mu}{4BK}\mathrm{Tr}\left(\bar{H}\right)+\frac{\mu^{2}}{2K}% \|d^{\sf T}\mathcal{V}_{\alpha}P_{\alpha}(I-\mathcal{P}_{\alpha})^{-1}\|^{2}_{% I\otimes\bar{H}}divide start_ARG italic_μ end_ARG start_ARG 4 italic_B italic_K end_ARG roman_Tr ( over¯ start_ARG italic_H end_ARG ) + divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_K end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT
±O(μ1.5γ)±o(μ2)plus-or-minusplus-or-minus𝑂superscript𝜇1.5𝛾𝑜superscript𝜇2\displaystyle\pm O(\mu^{1.5\gamma})\pm o(\mu^{2})± italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ end_POSTSUPERSCRIPT ) ± italic_o ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (77)

and, moreover,

ER,cenER,difER,consubscriptER𝑐𝑒𝑛subscriptER𝑑𝑖𝑓subscriptER𝑐𝑜𝑛\displaystyle\mathrm{ER}_{\infty,cen}\leq\mathrm{ER}_{\infty,dif}\leq\mathrm{% ER}_{\infty,con}roman_ER start_POSTSUBSCRIPT ∞ , italic_c italic_e italic_n end_POSTSUBSCRIPT ≤ roman_ER start_POSTSUBSCRIPT ∞ , italic_d italic_i italic_f end_POSTSUBSCRIPT ≤ roman_ER start_POSTSUBSCRIPT ∞ , italic_c italic_o italic_n end_POSTSUBSCRIPT (78)
Proof.

See Appendix H. ∎

In Corollary IV.1, the factors that help algorithms escape local minima are now seen to adversely affect the optimization performance. Again, the difference between decentralized and centralized methods is only significant in the large-batch regime. By integrating the findings of Theorem III.8 and Corollaries III.9 and IV.1, we deduce that network heterogeneity and graph structure inject additional noisy terms into decentralized methods and these facilitate their escape from sharp minima compared to the centralized method. However, these added noisy terms incorporate some deterioration into the long-term optimization performance. Furthermore, although consensus exhibits higher esca** efficiency, this comes at the expense of reduced optimization performance. This observation reveals an intrinsic trade-off between flatness and optimization within the context of multi-agent learning.

V Simulation results

In this section, we illustrate the performance of the three algorithms on the CIFAR10 and CIFAR100 datasets across different neural network architectures. We choose the ResNet-18 [54], WideResNet-28-10 [55] and DenseNet-121 [56] as the base neural network structures for the two datasets. As for the graph structure of the multi-agent system, we use the Metropolis rule [6] to randomly generate a doubly-stochastic graph with 16 nodes, and its structure is shown in Figure 3(a). In the main text, we present the results for ResNet-18, while the results for the other two neural networks are included Appendix J. In the decentralized experiments, the full training dataset is divided into K𝐾Kitalic_K subsets, and each agent can only observe one subset. The centralized setting is the same as the traditional single-agent learning. Moreover, we simulate three different local batch sizes including B=128,256,512𝐵128256512B=128,256,512italic_B = 128 , 256 , 512. Note that the distributed setting with local batch B𝐵Bitalic_B means the global batch is KB𝐾𝐵KBitalic_K italic_B. As for the learning scheme, we rely on the linear decaying rule [54, 57], where the initial learning rate 0.20.20.20.2 is divided by 10101010 when the training process reaches 50%percent5050\%50 % and 75%percent7575\%75 % of the total epoch. That is, moderately small step sizes are applied first to search for flat minima, and then smaller ones are used to guarantee the stability (i.e., convergence) of algorithms.

Refer to caption
(a) Random
Refer to caption
(b) Ring
Figure 3: (a) The randomly-generated graph structure; (b) The ring structure.

We first illustrate the flatness and optimization performance of the three algorithms on CIFAR10 and CIFAR100. On one hand, we visualize the loss landscape around the obtained models in Figure 4. To do so, we use the visualization method from [58] where for any model w𝑤witalic_w we compute J(w+α𝒗)𝐽𝑤𝛼𝒗J(w+\alpha\boldsymbol{v})italic_J ( italic_w + italic_α bold_italic_v ) using some random directions 𝒗𝒗\boldsymbol{v}bold_italic_v that match the norm of w𝑤witalic_w. It can be observed from Figure 4 that the decentralized methods arrive at models that are significantly flatter than centralized models. However, the difference in terms of flatness between diffusion and consensus is subtle. In other words, diffusion and consensus converge to models with similar flatness. On the other hand, the evolution of the training loss is shown in Figure 5, from which we observe that all three methods converge after sufficient iterations. Also, consensus consistently exhibits larger training loss in the steady state than diffusion especially when the local batch B𝐵Bitalic_B is 512, while the optimization performance of diffusion and centralized are comparable. Therefore, we conclude that diffusion enables flatter models than centralized without obviously sacrificing optimization performance. Note that we also observe some loss spikes in the training loss curves of the centralized method from Figure 5. One explanation for this phenomenon is that the centralized models may be oscillating around a sharp minima that is unstable when using moderately small step sizes. This suggests intuitively that decentralized methods aid in stabilizing the training process.

Refer to caption
(a) CIFAR10: B = 128
Refer to caption
(b) CIFAR10: B = 256
Refer to caption
(c) CIFAR10: B = 512
Refer to caption
(d) CIFAR100: B = 128
Refer to caption
(e) CIFAR100: B = 256
Refer to caption
(f) CIFAR100: B = 512
Figure 4: Loss visualization around the obtained models with ResNet-18
Refer to caption
(a) B = 128
Refer to caption
(b) B = 256
Refer to caption
(c) B = 512
Refer to caption
(d) B = 128
Refer to caption
(e) B = 256
Refer to caption
(f) B = 512
Figure 5: The evolution of training loss with ResNet-18: (a)–(c) CIFAR10; (d)–(f) CIFAR100.

We next compare the test accuracy of the three methods on CIFAR10 and CIFAR100. To better show the performance of the three methods across different settings, we also simulate the ring graph, shown in Figure 3(b), in addition to the random graph structure. Note that the ring graph can be viewed as a specific case of a random graph generated by the Metropolis rule. Thus the random graph is more general. The simulation results are shown in Table I. In each configuration, we run the simulation 3 times with different seeds, and the final results exhibit the mean of the last 10 epochs in these repetitions. We observe from Table I that diffusion consistently outperforms the other two methods for each setting. In particular, in the cases of B=512𝐵512B=512italic_B = 512, consensus even performs worse than centralized on the CIFAR100 dataset which is attributed to the optimization issue demonstrated in Figure 5.

TABLE I: Test accuracy of distributed algorithms on CIFAR10 and CIFAR100 under multiple configurations with ResNet-18.
Dataset Method Graph 128×1612816128\times 16128 × 16 256×1625616256\times 16256 × 16 512×1651216512\times 16512 × 16
CIFAR10 Centralized 91.06±0.41%plus-or-minus91.06percent0.4191.06\pm 0.41\%91.06 ± 0.41 % 90.36±0.16%plus-or-minus90.36percent0.1690.36\pm 0.16\%90.36 ± 0.16 % 89.14±0.41%plus-or-minus89.14percent0.4189.14\pm 0.41\%89.14 ± 0.41 %
Consensus Random 92.11±0.01%plus-or-minus92.11percent0.0192.11\pm 0.01\%92.11 ± 0.01 % 91.48±0.18%plus-or-minus91.48percent0.1891.48\pm 0.18\%91.48 ± 0.18 % 90.22±0.21%plus-or-minus90.22percent0.2190.22\pm 0.21\%90.22 ± 0.21 %
Ring 91.67±0.15%plus-or-minus91.67percent0.1591.67\pm 0.15\%91.67 ± 0.15 % 91.30±0.04%plus-or-minus91.30percent0.0491.30\pm 0.04\%91.30 ± 0.04 % 89.24±0.40%plus-or-minus89.24percent0.4089.24\pm 0.40\%89.24 ± 0.40 %
Diffusion Random 92.77±0.01%plus-or-minus92.77percent0.01\textbf{92.77}\pm 0.01\%92.77 ± 0.01 % 92.03±0.17%plus-or-minus92.03percent0.17\textbf{92.03}\pm 0.17\%92.03 ± 0.17 % 91.49±0.08%plus-or-minus91.49percent0.08\textbf{91.49}\pm 0.08\%91.49 ± 0.08 %
Ring 92.75±0.20%plus-or-minus92.75percent0.20\textbf{92.75}\pm 0.20\%92.75 ± 0.20 % 92.27±0.09%plus-or-minus92.27percent0.09\textbf{92.27}\pm 0.09\%92.27 ± 0.09 % 91.32±0.17%plus-or-minus91.32percent0.17\textbf{91.32}\pm 0.17\%91.32 ± 0.17 %
CIFAR100 Centralized 70.20±0.40%plus-or-minus70.20percent0.4070.20\pm 0.40\%70.20 ± 0.40 % 68.96±0.12plus-or-minus68.960.1268.96\pm 0.1268.96 ± 0.12% 68.65±0.37%plus-or-minus68.65percent0.3768.65\pm 0.37\%68.65 ± 0.37 %
Consensus Random 70.42±0.59%plus-or-minus70.42percent0.5970.42\pm 0.59\%70.42 ± 0.59 % 69.79±0.14plus-or-minus69.790.1469.79\pm 0.1469.79 ± 0.14% 67.38±0.45%plus-or-minus67.38percent0.4567.38\pm 0.45\%67.38 ± 0.45 %
Ring 70.45±0.67%plus-or-minus70.45percent0.6770.45\pm 0.67\%70.45 ± 0.67 % 68.03±0.18%plus-or-minus68.03percent0.1868.03\pm 0.18\%68.03 ± 0.18 % 64.97±0.39%plus-or-minus64.97percent0.3964.97\pm 0.39\%64.97 ± 0.39 %
Diffusion Random 71.32±0.50%plus-or-minus71.32percent0.50\textbf{71.32}\pm 0.50\%71.32 ± 0.50 % 70.03±0.28%plus-or-minus70.03percent0.28\textbf{70.03}\pm 0.28\%70.03 ± 0.28 % 69.74±0.83%plus-or-minus69.74percent0.83\textbf{69.74}\pm 0.83\%69.74 ± 0.83 %
Ring 71.30±0.37%plus-or-minus71.30percent0.37\textbf{71.30}\pm 0.37\%71.30 ± 0.37 % 69.64±0.58%plus-or-minus69.64percent0.58\textbf{69.64}\pm 0.58\%69.64 ± 0.58 % 69.38±0.58%plus-or-minus69.38percent0.58\textbf{69.38}\pm 0.58\%69.38 ± 0.58 %
TABLE II: The generalization gap between training and test accuracy with ResNet-18.
Dataset Method Graph 128×1612816128\times 16128 × 16 256×1625616256\times 16256 × 16 512×1651216512\times 16512 × 16
CIFAR10 Centralized 0.0892 0.0960 0.1053
Consensus Random 0.0743 0.0777 0.0686
Ring 0.0767 0.0760 0.0583
Diffusion Random 0.0718 0.0789 0.0805
Ring 0.0719 0.0764 0.0798
CIFAR100 Centralized 0.2976 0.3099 0.3126
Consensus Random 0.2849 0.2817 0.2027
Ring 0.2761 0.2764 0.1731
Diffusion Random 0.2861 0.2989 0.2977
Ring 0.2863 0.3029 0.2998

The simulation results in Table I can be interpreted by the trade-off between flatness and optimization performance. Note that our simulation results do not contradict the traditional view that flatter models generalize better [31, 32, 33, 34, 35]. We show the generalization gap measuring the difference between the training and test accuracy in all experiments in Table II, which demonstrates that flatter models found by decentralized methods enable smaller generalization gap than the centralized approach. However, we emphasize that the final test accuracy is determined by both optimization and generalization performance. When B𝐵Bitalic_B is 512 with CIFAR100 dataset, even though consensus enables better generalization, it simultaneously loses too much optimization performance. Thus its final test accuracy is worse than centralized. Similarly, diffusion achieves a favorable balance between flatness and optimization, thereby exhibiting superior test accuracy than the other two methods.

VI Conclusion

We analyzed the learning behavior of three popular algorithms around local minima in nonconvex environments. The results show that decentralized methods exhibit accelerated evasion from local minima in contrast to the centralized strategy. They also show that while consensus outperforms diffusion in terms of esca** ability around local minima, this feature nevertheless comes at the expense of a deteriorated optimization performance. As a result, consensus is observed to lead to lower classification accuracy than diffusion. In other words, the results in this paper highlight an important trade-off between esca** efficiency and optimization performance in the context of multi-agent learning.

Although we focused on the traditional mini-batch gradient descent implementation, one useful extension would be to consider other types of optimizers, such as stochastic GD momentum [59] and ADAM [40]. In addition, it has been observed in the single-agent case that the structure of the gradient noise can influence the esca** efficiency and stability of stochastic gradient algorithms [34, 44]. It would be useful to examine the nature of this effect in the multi-agent case.

Acknowledgments

We acknowledge the assistance of ChatGPT in improving the English expressions in this paper. We also appreciate Dr. Elsa Rizk for her valuable suggestions, which helped us present the results more clearly.

References

  • [1] A. H. Sayed, “Adaptive networks,” Proceedings of the IEEE, vol. 102, no. 4, pp. 460–497, 2014.
  • [2] T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed learning in the nonconvex world: From batch data to streaming and beyond,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 26–38, 2020.
  • [3] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
  • [4] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [5] A. G. Dimakis, S. Kar, J. M. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.
  • [6] A. H. Sayed, “Adaptation, learning, and optimization over networks,” Foundations and Trends in Machine Learning, vol. 7, pp. 311–801, 2014.
  • [7] ——, Inference and Learning from Data.   Cambridge University Press, 2022.
  • [8] J. Chen and A. H. Sayed, “Distributed pareto optimization via diffusion strategies,” IEEE J. Sel. Top. Signal Process., vol. 7, no. 2, pp. 205–220, 2013.
  • [9] T. Zhu, F. He, K. Chen, M. Song, and D. Tao, “Decentralized SGD and average-direction SAM are asymptotically equivalent,” in Proc. ICML, Honolulu, 2023, pp. 43 005–43 036.
  • [10] T. Zhu, F. He, L. Zhang, Z. Niu, M. Song, and D. Tao, “Topology-aware generalization of decentralized SGD,” in Proc. ICML, Baltimore, 2022, pp. 27 479–27 503.
  • [11] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments - part I: agreement at a linear rate,” IEEE Trans. Signal Process., vol. 69, pp. 1242–1256, 2021.
  • [12] M. Kayaalp, S. Vlaski, and A. H. Sayed, “Dif-MAML: Decentralized multi-agent meta-learning,” IEEE Open Journal of Signal Processing, vol. 3, pp. 71–93, 2022.
  • [13] Z. Wang, F. R. M. Pavan, and A. H. Sayed, “Decentralized GAN training through diffusion learning,” in Proc. MLSP, Xi’an, 2022, pp. 1–6.
  • [14] K. Yuan, S. A. Alghunaim, and X. Huang, “Removing data heterogeneity influence enhances network topology dependence of decentralized SGD,” Journal of Machine Learning Research, vol. 24, no. 280, pp. 1–53, 2023.
  • [15] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Proc. NeurIPS, Long Beach, 2017, pp. 5330–5340.
  • [16] B. Ying, K. Yuan, H. Hu, Y. Chen, and W. Yin, “Bluefog: Make decentralized algorithms practical for optimization and deep learning,” arXiv preprint arXiv:2111.04287, 2021.
  • [17] B. Ying, K. Yuan, Y. Chen, H. Hu, P. Pan, and W. Yin, “Exponential graph is provably efficient for decentralized deep training,” in Proc. NeurIPS, 2021, pp. 13 975–13 987.
  • [18] T. Sun, D. Li, and B. Wang, “Decentralized federated averaging,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 4, pp. 4289–4301, 2023.
  • [19] J. Xu, W. Zhang, and F. Wang, “A (dp)2 sgd: Asynchronous decentralized parallel stochastic gradient descent with differential privacy,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 11, pp. 8036–8047, 2021.
  • [20] J. Chen and A. H. Sayed, “On the learning behavior of adaptive networks - part I: Transient analysis,” IEEE Trans. Inf. Theory, vol. 61, no. 6, pp. 3487–3517, 2015.
  • [21] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments - part II: polynomial escape from saddle-points,” IEEE Trans. Signal Process., vol. 69, pp. 1257–1270, 2021.
  • [22] A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. U. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in Proc. ICML, vol. 119, 2020, pp. 5381–5393.
  • [23] S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Trans. Signal Process., vol. 70, pp. 3264–3279, 2022.
  • [24] J. Chen and A. H. Sayed, “On the learning behavior of adaptive networks - part II: performance analysis,” IEEE Trans. Inf. Theory, vol. 61, no. 6, pp. 3518–3548, 2015.
  • [25] K. Yuan, S. A. Alghunaim, B. Ying, and A. H. Sayed, “On the influence of bias-correction on distributed stochastic optimization,” IEEE Trans. Signal Process., vol. 68, pp. 4352–4367, 2020.
  • [26] T. Sun, D. Li, and B. Wang, “Stability and generalization of decentralized stochastic gradient descent,” in Proc. AAAI, 2021, pp. 9756–9764.
  • [27] X. Deng, T. Sun, S. Li, and D. Li, “Stability-based generalization analysis of the asynchronous decentralized SGD,” in Proc. AAAI, Washington, 2023, pp. 7340–7348.
  • [28] L. Kong, T. Lin, A. Koloskova, M. Jaggi, and S. U. Stich, “Consensus control for decentralized deep learning,” in Proc. ICML, 2021, pp. 5686–5696.
  • [29] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” in Proc. ICLR, 2021.
  • [30] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” in Proc. ICLR, Toulon, 2017.
  • [31] K. Lyu, Z. Li, and S. Arora, “Understanding the generalization benefit of normalization layers: Sharpness reduction,” in Proc. NeurIPS 2022, New Orleans, 2022.
  • [32] K. Gatmiry, Z. Li, C.-Y. Chuang, S. Reddi, T. Ma, and S. Jegelka, “The inductive bias of flatness regularization for deep matrix factorization,” in NeurIPS, New Orleans, 2023.
  • [33] L. Wu and W. J. Su, “The implicit regularization of dynamical stability in stochastic gradient descent,” in Proc. ICML, Honolulu, 2023, pp. 37 656–37 684.
  • [34] Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma, “The anisotropic noise in stochastic gradient descent: Its behavior of esca** from sharp minima and regularization effects,” in Proc. ICML, Long Beach, 2019, pp. 7654–7663.
  • [35] M. S. Nacson, K. Ravichandran, N. Srebro, and D. Soudry, “Implicit bias of the step size in linear diagonal neural networks,” in Proc. ICML, Baltimore, 2022, pp. 16 270–16 295.
  • [36] Y. Jiang, B. Neyshabur, H. Mobahi, D. Krishnan, and S. Bengio, “Fantastic generalization measures and where to find them,” in Proc. ICLR, Addis Ababa, 2020.
  • [37] S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximate bayesian inference,” J. Mach. Learn. Res., vol. 18, pp. 134:1–134:35, 2017.
  • [38] T. Mori, Z. Liu, K. Liu, and M. Ueda, “Power-law escape rate of SGD,” in Proc. ICML, Baltimore, 2022, pp. 15 959–15 975.
  • [39] Z. Xie, I. Sato, and M. Sugiyama, “A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima,” in Proc. ICLR, 2021.
  • [40] P. Zhou, J. Feng, C. Ma, C. Xiong, S. C. Hoi, and W. E, “Towards theoretically understanding why SGD generalizes better than adam in deep learning,” in Proc. NeurIPS, 2020.
  • [41] S. Pesme, L. Pillaud-Vivien, and N. Flammarion, “Implicit bias of SGD for diagonal linear networks: a provable benefit of stochasticity,” in Proc. NeurIPS 2021, 2021, pp. 29 218–29 230.
  • [42] U. Simsekli, L. Sagun, and M. Gürbüzbalaban, “A tail-index analysis of stochastic gradient noise in deep neural networks,” in Proc. ICML, Long Beach, 2019, pp. 5827–5837.
  • [43] L. Wu, C. Ma, and W. E, “How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective,” in Proc. NeurIPS, Montréal, 2018, pp. 8289–8298.
  • [44] L. Wu, M. Wang, and W. Su, “The alignment property of SGD noise and how it helps select flat minima: A stability analysis,” in Proc. NeurIPS, New Orleans, 2022.
  • [45] S. Tu and A. H. Sayed, “Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks,” IEEE Trans. Signal Process., vol. 60, no. 12, pp. 6217–6234, 2012.
  • [46] A. Bovier, M. Eckhoff, V. Gayrard, and M. Klein, “Metastability in reversible diffusion processes. i. sharp asymptotics for capacities and exit times,” J. Eur. Math. Soc.(JEMS), vol. 6, no. 4, pp. 399–424, 2004.
  • [47] H. Ibayashi and M. Imaizumi, “Why does SGD prefer flat minima?: Through the lens of dynamical systems,” in Proc. AAAI workshop: When Machine Learning meets Dynamical Systems: Theory and Applications, Washington, 2023.
  • [48] T. H. Nguyen, U. Simsekli, M. Gürbüzbalaban, and G. Richard, “First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise,” in Proc. NeurIPS, Vancouver, 2019, pp. 273–283.
  • [49] K. Ahn, A. Jadbabaie, and S. Sra, “How to escape sharp minima with random perturbations,” 2024. [Online]. Available: https://arxiv.longhoe.net/pdf/2305.15659.pdf
  • [50] K. Wen, Z. Li, and T. Ma, “Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization,” in Proc. NeurIPS, New Orleans, 2023.
  • [51] Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. W. Mahoney, “Hessian-based analysis of large batch training and robustness to adversaries,” in Proc. NeurIPS , Canada, Montréal, 2018, pp. 4954–4964.
  • [52] J. A. Tropp, “An introduction to matrix concentration inequalities,” 2015. [Online]. Available: https://arxiv.longhoe.net/abs/1501.01571
  • [53] E. Barshan, M.-E. Brunet, and G. K. Dziugaite, “Relatif: Identifying explanatory training samples via relative influence,” in Proc. AISTATS, 2020, pp. 1899–1909.
  • [54] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, Las Vegas, 2016, pp. 770–778.
  • [55] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proc. BMVC, York, 2016.
  • [56] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. CVPR, Honolulu, 2017, pp. 2261–2269.
  • [57] T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local SGD,” in Proc. ICLR, Addis Ababa, 2020.
  • [58] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Proc. NeurIPS, Montréal, 2018, pp. 6391–6401.
  • [59] K. Yuan, Y. Chen, X. Huang, Y. Zhang, P. Pan, Y. Xu, and W. Yin, “Decentlam: Decentralized momentum SGD for large-batch deep training,” in Proc. ICCV, Montreal, 2021, pp. 3009–3019.

Appendix A Proof for Lemma III.4: Properties associated with gradient noise

Proofs in this section follow from [6, 12, 7]. We first note that the mini-batch gradient is an unbiased estimator of the true gradient. That is, for any 𝒘n1𝒘subscript𝑛1\boldsymbol{w}\in\mathcal{F}_{n-1}bold_italic_w ∈ caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT:

𝔼[sk,nB(𝒘)|n1]𝔼delimited-[]conditionalsuperscriptsubscript𝑠𝑘𝑛𝐵𝒘subscript𝑛1\displaystyle\mathds{E}\left[s_{k,n}^{B}(\boldsymbol{w})|\mathcal{F}_{n-1}\right]blackboard_E [ italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_w ) | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] =1Bb𝔼[Qk(𝒘;𝒙k,nb)Jk(𝒘)|n1]=0absent1𝐵subscript𝑏𝔼delimited-[]subscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏conditionalsubscript𝐽𝑘𝒘subscript𝑛10\displaystyle=\frac{1}{B}\sum\limits_{b}\mathds{E}\left[\nabla Q_{k}(% \boldsymbol{w};\boldsymbol{x}_{k,n}^{b})-\nabla J_{k}(\boldsymbol{w})|\mathcal% {F}_{n-1}\right]=0= divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT blackboard_E [ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] = 0 (79)

Using (79) and the independence of data among agents, for any two agents k𝑘kitalic_k and \ellroman_ℓ, and 𝒘k,𝒘n1subscript𝒘𝑘subscript𝒘subscript𝑛1\boldsymbol{w}_{k},\boldsymbol{w}_{\ell}\in\mathcal{F}_{n-1}bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT, we have

𝔼[𝒔k,n(𝒘k)𝒔,n𝖳(𝒘)|n1]=𝔼[𝒔k,n(𝒘k)|n1]×𝔼[𝒔,n𝖳(𝒘)|n1]=0𝔼delimited-[]conditionalsubscript𝒔𝑘𝑛subscript𝒘𝑘superscriptsubscript𝒔𝑛𝖳subscript𝒘subscript𝑛1𝔼delimited-[]conditionalsubscript𝒔𝑘𝑛subscript𝒘𝑘subscript𝑛1𝔼delimited-[]conditionalsuperscriptsubscript𝒔𝑛𝖳subscript𝒘subscript𝑛10\mathds{E}\left[\boldsymbol{s}_{k,n}(\boldsymbol{w}_{k})\boldsymbol{s}_{\ell,n% }^{\sf T}(\boldsymbol{w}_{\ell})|\mathcal{F}_{n-1}\right]=\mathds{E}\left[% \boldsymbol{s}_{k,n}(\boldsymbol{w}_{k})|\mathcal{F}_{n-1}\right]\times\mathds% {E}\left[\boldsymbol{s}_{\ell,n}^{\sf T}(\boldsymbol{w}_{\ell})|\mathcal{F}_{n% -1}\right]=0blackboard_E [ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) bold_italic_s start_POSTSUBSCRIPT roman_ℓ , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] = blackboard_E [ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] × blackboard_E [ bold_italic_s start_POSTSUBSCRIPT roman_ℓ , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] = 0 (80)

Then, for B=1𝐵1B=1italic_B = 1 we have

𝔼[𝒔k,n(𝒘)2|n1]=𝔼delimited-[]conditionalsuperscriptnormsubscript𝒔𝑘𝑛𝒘2subscript𝑛1absent\displaystyle\mathds{E}\left[\|\boldsymbol{s}_{k,n}(\boldsymbol{w})\|^{2}|% \mathcal{F}_{n-1}\right]=blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] = 𝔼[Qk(𝒘;𝒙k,n)Jk(𝒘)2|n1]𝔼delimited-[]conditionalsuperscriptnormsubscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝐽𝑘𝒘2subscript𝑛1\displaystyle\mathds{E}\left[\|\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n% })-\nabla J_{k}(\boldsymbol{w})\|^{2}|\mathcal{F}_{n-1}\right]blackboard_E [ ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ]
(a)𝑎\displaystyle\overset{(a)}{\leq}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG 2𝔼Qk(𝒘;𝒙k,n)Qk(w;𝒙k,n)2+4𝔼Qk(w;𝒙k,n)2+8Jk(𝒘)Jk(w)22𝔼superscriptnormsubscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛24𝔼superscriptnormsubscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛28superscriptnormsubscript𝐽𝑘𝒘subscript𝐽𝑘superscript𝑤2\displaystyle 2\mathds{E}\|\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n})-% \nabla Q_{k}(w^{\star};\boldsymbol{x}_{k,n})\|^{2}+4\mathds{E}\|\nabla Q_{k}(w% ^{\star};\boldsymbol{x}_{k,n})\|^{2}+8\|\nabla J_{k}(\boldsymbol{w})-\nabla J_% {k}(w^{\star})\|^{2}2 blackboard_E ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 blackboard_E ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 8 ∥ ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+8Jk(w)28superscriptnormsubscript𝐽𝑘superscript𝑤2\displaystyle+8\|\nabla J_{k}(w^{\star})\|^{2}+ 8 ∥ ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(b)𝑏\displaystyle\overset{(b)}{\leq}start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG ≤ end_ARG 10𝔼Qk(𝒘;𝒙k,n)Qk(w;𝒙k,n)2+12𝔼Qk(w;𝒙k,n)210𝔼superscriptnormsubscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛212𝔼superscriptnormsubscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛2\displaystyle 10\mathds{E}\|\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n})-% \nabla Q_{k}(w^{\star};\boldsymbol{x}_{k,n})\|^{2}+12\mathds{E}\|\nabla Q_{k}(% w^{\star};\boldsymbol{x}_{k,n})\|^{2}10 blackboard_E ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 12 blackboard_E ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(c)𝑐\displaystyle\overset{(c)}{\leq}start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG ≤ end_ARG 10L2𝒘w2+12𝔼Qk(w;𝒙k,n)2=Δa110superscript𝐿2superscriptnorm𝒘superscript𝑤212𝔼superscriptnormsubscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛2Δsubscript𝑎1\displaystyle 10L^{2}\|\boldsymbol{w}-w^{\star}\|^{2}+12\mathds{E}\|\nabla Q_{% k}(w^{\star};\boldsymbol{x}_{k,n})\|^{2}\overset{\Delta}{=}a_{1}10 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_w - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 12 blackboard_E ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT overroman_Δ start_ARG = end_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (81)

where (a)𝑎(a)( italic_a ) follows from Jensen’s inequality, (b)𝑏(b)( italic_b ) follows from the following inequality:

Jk(w)2=𝔼Qk(w;𝒙k,n)2𝔼Qk(w;𝒙k,n)2superscriptnormsubscript𝐽𝑘superscript𝑤2superscriptnorm𝔼subscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛2𝔼superscriptnormsubscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛2\displaystyle\|\nabla J_{k}(w^{\star})\|^{2}=\|\mathds{E}\nabla Q_{k}(w^{\star% };\boldsymbol{x}_{k,n})\|^{2}\leq\mathds{E}\|\nabla Q_{k}(w^{\star};% \boldsymbol{x}_{k,n})\|^{2}∥ ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ blackboard_E ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ blackboard_E ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (82)
Jk(𝒘)Jk(w)2=𝔼Qk(𝒘;𝒙k,n)𝔼Qk(w;𝒙k,n)2𝔼Qk(𝒘;𝒙k,n)Qk(w;𝒙k,n)2superscriptnormsubscript𝐽𝑘𝒘subscript𝐽𝑘superscript𝑤2superscriptnorm𝔼subscript𝑄𝑘𝒘subscript𝒙𝑘𝑛𝔼subscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛2𝔼superscriptnormsubscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛2\displaystyle\|\nabla J_{k}(\boldsymbol{w})-\nabla J_{k}(w^{\star})\|^{2}=\|% \mathds{E}\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n})-\mathds{E}\nabla Q% _{k}(w^{\star};\boldsymbol{x}_{k,n})\|^{2}\leq\mathds{E}\|\nabla Q_{k}(% \boldsymbol{w};\boldsymbol{x}_{k,n})-\nabla Q_{k}(w^{\star};\boldsymbol{x}_{k,% n})\|^{2}∥ ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ blackboard_E ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - blackboard_E ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ blackboard_E ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (83)

and (c)𝑐(c)( italic_c ) follows from the Lipschitz condition in Assumption III.2. Similarly, for the fourth-order moment of the stochastic gradient noise, we have:

𝔼[𝒔k,n(𝒘)4|n1]𝔼delimited-[]conditionalsuperscriptnormsubscript𝒔𝑘𝑛𝒘4subscript𝑛1\displaystyle\;\mathds{E}\left[\|\boldsymbol{s}_{k,n}(\boldsymbol{w})\|^{4}|% \mathcal{F}_{n-1}\right]blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ]
=𝔼[Qk(𝒘;𝒙k,n)Jk(𝒘)4|n1]absent𝔼delimited-[]conditionalsuperscriptnormsubscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝐽𝑘𝒘4subscript𝑛1\displaystyle=\mathds{E}\left[\|\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,% n})-\nabla J_{k}(\boldsymbol{w})\|^{4}|\mathcal{F}_{n-1}\right]= blackboard_E [ ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ]
8𝔼Qk(𝒘;𝒙k,n)Qk(w;𝒙k,n)4+64𝔼Qk(w;𝒙k,n)4+512𝔼Jk(𝒘)Jk(w)4+512𝔼Jk(w)4absent8𝔼superscriptnormsubscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛464𝔼superscriptnormsubscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛4512𝔼superscriptnormsubscript𝐽𝑘𝒘subscript𝐽𝑘superscript𝑤4512𝔼superscriptnormsubscript𝐽𝑘superscript𝑤4\displaystyle\leq 8\mathds{E}\|\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n% })-\nabla Q_{k}(w^{\star};\boldsymbol{x}_{k,n})\|^{4}+64\mathds{E}\|\nabla Q_{% k}(w^{\star};\boldsymbol{x}_{k,n})\|^{4}+512\mathds{E}\|\nabla J_{k}(% \boldsymbol{w})-\nabla J_{k}(w^{\star})\|^{4}+512\mathds{E}\|\nabla J_{k}(w^{% \star})\|^{4}≤ 8 blackboard_E ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 64 blackboard_E ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 512 blackboard_E ∥ ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 512 blackboard_E ∥ ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
520L4𝒘w4+576𝔼Qk(w;𝒙k,n)4=Δa2absent520superscript𝐿4superscriptnorm𝒘superscript𝑤4576𝔼superscriptnormsubscript𝑄𝑘superscript𝑤subscript𝒙𝑘𝑛4Δsubscript𝑎2\displaystyle\leq 520L^{4}\|\boldsymbol{w}-w^{\star}\|^{4}+576\mathds{E}\|% \nabla Q_{k}(w^{\star};\boldsymbol{x}_{k,n})\|^{4}\overset{\Delta}{=}a_{2}≤ 520 italic_L start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∥ bold_italic_w - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 576 blackboard_E ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT overroman_Δ start_ARG = end_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (84)

We now verify that the size of terms related to gradient noise decreases with the increase of batch size. For any 𝒘n1𝒘subscript𝑛1\boldsymbol{w}\in\mathcal{F}_{n-1}bold_italic_w ∈ caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT, we verify how the second and fourth-order conditional gradient noise varies with batch size. Recall (A) and (A), for B=1𝐵1B=1italic_B = 1:

𝔼[𝒔k,n(𝒘)2|n1]a1,𝔼[𝒔k,n(𝒘)4|n1]a2formulae-sequence𝔼delimited-[]conditionalsuperscriptnormsubscript𝒔𝑘𝑛𝒘2subscript𝑛1subscript𝑎1𝔼delimited-[]conditionalsuperscriptnormsubscript𝒔𝑘𝑛𝒘4subscript𝑛1subscript𝑎2\displaystyle\mathds{E}[\|\boldsymbol{s}_{k,n}(\boldsymbol{w})\|^{2}|\mathcal{% F}_{n-1}]\leq a_{1},\quad\mathds{E}[\|\boldsymbol{s}_{k,n}(\boldsymbol{w})\|^{% 4}|\mathcal{F}_{n-1}]\leq a_{2}blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] ≤ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] ≤ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (85)

where a12a2superscriptsubscript𝑎12subscript𝑎2a_{1}^{2}\leq a_{2}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, while for general B>1𝐵1B>1italic_B > 1:

𝔼[𝒔k,n(𝒘)2|n1]=𝔼delimited-[]conditionalsuperscriptnormsubscript𝒔𝑘𝑛𝒘2subscript𝑛1absent\displaystyle\mathds{E}[\|\boldsymbol{s}_{k,n}(\boldsymbol{w})\|^{2}|\mathcal{% F}_{n-1}]=blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] = 𝔼[1BbQk(𝒘;𝒙k,nb)Jk(𝒘)2|n1]𝔼delimited-[]conditionalsuperscriptnorm1𝐵subscript𝑏subscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝒘2subscript𝑛1\displaystyle\mathds{E}\left[\left\|\frac{1}{B}\sum\limits_{b}\nabla Q_{k}(% \boldsymbol{w};\boldsymbol{x}_{k,n}^{b})-\nabla J_{k}(\boldsymbol{w})\right\|^% {2}\Bigg{|}\mathcal{F}_{n-1}\right]blackboard_E [ ∥ divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ]
=(a)𝑎\displaystyle\overset{(a)}{=}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG 1B2b𝔼[Qk(𝒘;𝒙k,nb)Jk(𝒘)2|n1]1superscript𝐵2subscript𝑏𝔼delimited-[]conditionalsuperscriptnormsubscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝒘2subscript𝑛1\displaystyle\frac{1}{B^{2}}\sum\limits_{b}\mathds{E}[\|\nabla Q_{k}(% \boldsymbol{w};\boldsymbol{x}_{k,n}^{b})-\nabla J_{k}(\boldsymbol{w})\|^{2}|% \mathcal{F}_{n-1}]divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ]
=\displaystyle== 1B𝔼[𝒔k,n(𝒘)2|n1]1Ba11𝐵𝔼delimited-[]conditionalsuperscriptnormsubscript𝒔𝑘𝑛𝒘2subscript𝑛11𝐵subscript𝑎1\displaystyle\frac{1}{B}\mathds{E}[\|\boldsymbol{s}_{k,n}(\boldsymbol{w})\|^{2% }|\mathcal{F}_{n-1}]\leq\frac{1}{B}a_{1}divide start_ARG 1 end_ARG start_ARG italic_B end_ARG blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] ≤ divide start_ARG 1 end_ARG start_ARG italic_B end_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (86)

where (a)𝑎(a)( italic_a ) follows from the independence among data. As for the fourth-order conditional gradient noise, we prove it by induction. Assume

𝔼[𝒔k,nB1(𝒘)4|n1]3a2(B1)2𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵1𝒘4subscript𝑛13subscript𝑎2superscript𝐵12\displaystyle\mathds{E}[\|\boldsymbol{s}_{k,n}^{B-1}(\boldsymbol{w})\|^{4}|% \mathcal{F}_{n-1}]\leq\frac{3a_{2}}{(B-1)^{2}}blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] ≤ divide start_ARG 3 italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ( italic_B - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (87)

then we have

𝔼[𝒔k,nB(𝒘)4|n1]=𝔼[B1B1B1b=1BQk(𝒘;𝒙k,nb)Jk(𝒘)4|n1]𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵𝒘4subscript𝑛1𝔼delimited-[]conditionalsuperscriptnorm𝐵1𝐵1𝐵1superscriptsubscript𝑏1𝐵subscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝒘4subscript𝑛1\displaystyle\mathds{E}[\|\boldsymbol{s}_{k,n}^{B}(\boldsymbol{w})\|^{4}|% \mathcal{F}_{n-1}]=\mathds{E}\left[\left\|\frac{B-1}{B}\cdot\frac{1}{B-1}\sum% \limits_{b=1}^{B}\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n}^{b})-\nabla J% _{k}(\boldsymbol{w})\right\|^{4}\Bigg{|}\mathcal{F}_{n-1}\right]blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] = blackboard_E [ ∥ divide start_ARG italic_B - 1 end_ARG start_ARG italic_B end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_B - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ]
=\displaystyle== 𝔼B1B1B1b=1B1(Qk(𝒘;𝒙k,nb)Jk(𝒘))+1B(Qk(𝒘;𝒙k,n)Jk(𝒘))4𝔼superscriptnorm𝐵1𝐵1𝐵1superscriptsubscript𝑏1𝐵1subscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝒘1𝐵subscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝐽𝑘𝒘4\displaystyle\mathds{E}\left\|\frac{B-1}{B}\cdot\frac{1}{B-1}\sum\limits_{b=1}% ^{B-1}\left(\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n}^{b})-\nabla J_{k}% (\boldsymbol{w})\right)+\frac{1}{B}\Big{(}\nabla Q_{k}(\boldsymbol{w};% \boldsymbol{x}_{k,n})-\nabla J_{k}(\boldsymbol{w})\Big{)}\right\|^{4}blackboard_E ∥ divide start_ARG italic_B - 1 end_ARG start_ARG italic_B end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_B - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) + divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
=\displaystyle== 𝔼(B1B1B1b=1B1(Qk(𝒘;𝒙k,nb)Jk(𝒘))+1B(Qk(𝒘;𝒙k,n)Jk(𝒘))2)2𝔼superscriptsuperscriptnorm𝐵1𝐵1𝐵1superscriptsubscript𝑏1𝐵1subscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝒘1𝐵subscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝐽𝑘𝒘22\displaystyle\mathds{E}\left(\left\|\frac{B-1}{B}\cdot\frac{1}{B-1}\sum\limits% _{b=1}^{B-1}\left(\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n}^{b})-\nabla J% _{k}(\boldsymbol{w})\right)+\frac{1}{B}\Big{(}\nabla Q_{k}(\boldsymbol{w};% \boldsymbol{x}_{k,n})-\nabla J_{k}(\boldsymbol{w})\Big{)}\right\|^{2}\right)^{2}blackboard_E ( ∥ divide start_ARG italic_B - 1 end_ARG start_ARG italic_B end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_B - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) + divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝔼B1B1B1b=1B1(Qk(𝒘;𝒙k,nb)Jk(𝒘))4+𝔼1B(Qk(𝒘;𝒙k,n)Jk(𝒘))4𝔼superscriptnorm𝐵1𝐵1𝐵1superscriptsubscript𝑏1𝐵1subscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝒘4𝔼superscriptnorm1𝐵subscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝐽𝑘𝒘4\displaystyle\mathds{E}\left\|\frac{B-1}{B}\cdot\frac{1}{B-1}\sum\limits_{b=1}% ^{B-1}\left(\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n}^{b})-\nabla J_{k}% (\boldsymbol{w})\right)\right\|^{4}+\mathds{E}\left\|\frac{1}{B}\Big{(}\nabla Q% _{k}(\boldsymbol{w};\boldsymbol{x}_{k,n})-\nabla J_{k}(\boldsymbol{w})\Big{)}% \right\|^{4}blackboard_E ∥ divide start_ARG italic_B - 1 end_ARG start_ARG italic_B end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_B - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + blackboard_E ∥ divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
+4𝔼[(B1B1B1b=1B1(Qk(𝒘;𝒙k,nb)Jk(𝒘)))𝖳(1B(Qk(𝒘;𝒙k,n)Jk(𝒘)))]24𝔼superscriptdelimited-[]superscript𝐵1𝐵1𝐵1superscriptsubscript𝑏1𝐵1subscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝒘𝖳1𝐵subscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝐽𝑘𝒘2\displaystyle\ +4\mathds{E}\left[\left(\frac{B-1}{B}\cdot\frac{1}{B-1}\sum% \limits_{b=1}^{B-1}\left(\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n}^{b})% -\nabla J_{k}(\boldsymbol{w})\right)\right)^{\sf T}\left(\frac{1}{B}\Big{(}% \nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n})-\nabla J_{k}(\boldsymbol{w})% \Big{)}\right)\right]^{2}+ 4 blackboard_E [ ( divide start_ARG italic_B - 1 end_ARG start_ARG italic_B end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_B - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2𝔼B1B1B1b=1B1(Qk(𝒘;𝒙k,nb)Jk(𝒘))21B(Qk(𝒘;𝒙k,n)Jk(𝒘))22𝔼superscriptnorm𝐵1𝐵1𝐵1superscriptsubscript𝑏1𝐵1subscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝒘2superscriptnorm1𝐵subscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝐽𝑘𝒘2\displaystyle\ +2\mathds{E}\left\|\frac{B-1}{B}\cdot\frac{1}{B-1}\sum\limits_{% b=1}^{B-1}\left(\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n}^{b})-\nabla J% _{k}(\boldsymbol{w})\right)\right\|^{2}\left\|\frac{1}{B}\Big{(}\nabla Q_{k}(% \boldsymbol{w};\boldsymbol{x}_{k,n})-\nabla J_{k}(\boldsymbol{w})\Big{)}\right% \|^{2}+ 2 blackboard_E ∥ divide start_ARG italic_B - 1 end_ARG start_ARG italic_B end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_B - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 𝔼B1B1B1b=1B1(Qk(𝒘;𝒙k,nb)Jk(𝒘))4+𝔼1B(Qk(𝒘;𝒙k,n)Jk(𝒘))4𝔼superscriptnorm𝐵1𝐵1𝐵1superscriptsubscript𝑏1𝐵1subscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝒘4𝔼superscriptnorm1𝐵subscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝐽𝑘𝒘4\displaystyle\mathds{E}\left\|\frac{B-1}{B}\cdot\frac{1}{B-1}\sum\limits_{b=1}% ^{B-1}\left(\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n}^{b})-\nabla J_{k}% (\boldsymbol{w})\right)\right\|^{4}+\mathds{E}\left\|\frac{1}{B}\Big{(}\nabla Q% _{k}(\boldsymbol{w};\boldsymbol{x}_{k,n})-\nabla J_{k}(\boldsymbol{w})\Big{)}% \right\|^{4}blackboard_E ∥ divide start_ARG italic_B - 1 end_ARG start_ARG italic_B end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_B - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + blackboard_E ∥ divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
+6𝔼B1B1B1b=1B1(Qk(𝒘;𝒙k,nb)Jk(𝒘))2𝔼1B(Qk(𝒘;𝒙k,n)Jk(𝒘))26𝔼superscriptnorm𝐵1𝐵1𝐵1superscriptsubscript𝑏1𝐵1subscript𝑄𝑘𝒘superscriptsubscript𝒙𝑘𝑛𝑏subscript𝐽𝑘𝒘2𝔼superscriptnorm1𝐵subscript𝑄𝑘𝒘subscript𝒙𝑘𝑛subscript𝐽𝑘𝒘2\displaystyle\ +6\mathds{E}\left\|\frac{B-1}{B}\cdot\frac{1}{B-1}\sum\limits_{% b=1}^{B-1}\left(\nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n}^{b})-\nabla J% _{k}(\boldsymbol{w})\right)\right\|^{2}\mathds{E}\left\|\frac{1}{B}\Big{(}% \nabla Q_{k}(\boldsymbol{w};\boldsymbol{x}_{k,n})-\nabla J_{k}(\boldsymbol{w})% \Big{)}\right\|^{2}+ 6 blackboard_E ∥ divide start_ARG italic_B - 1 end_ARG start_ARG italic_B end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_B - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq (B1)4B4𝔼[𝒔k,nB1(𝒘)4|n1]+a2B4+6((B1)2B4×1B1a12)superscript𝐵14superscript𝐵4𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵1𝒘4subscript𝑛1subscript𝑎2superscript𝐵46superscript𝐵12superscript𝐵41𝐵1superscriptsubscript𝑎12\displaystyle\frac{(B-1)^{4}}{B^{4}}\mathds{E}[\|\boldsymbol{s}_{k,n}^{B-1}(% \boldsymbol{w})\|^{4}|\mathcal{F}_{n-1}]+\frac{a_{2}}{B^{4}}+6\left(\frac{(B-1% )^{2}}{B^{4}}\times\frac{1}{B-1}a_{1}^{2}\right)divide start_ARG ( italic_B - 1 ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] + divide start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + 6 ( divide start_ARG ( italic_B - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG × divide start_ARG 1 end_ARG start_ARG italic_B - 1 end_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
\displaystyle\leq 3(B1)2a2B4+a2B4+6(B1)B4a2=(3B22)a2B43a2B23superscript𝐵12subscript𝑎2superscript𝐵4subscript𝑎2superscript𝐵46𝐵1superscript𝐵4subscript𝑎23superscript𝐵22subscript𝑎2superscript𝐵43subscript𝑎2superscript𝐵2\displaystyle\frac{3(B-1)^{2}a_{2}}{B^{4}}+\frac{a_{2}}{B^{4}}+6\frac{(B-1)}{B% ^{4}}a_{2}=\frac{(3B^{2}-2)a_{2}}{B^{4}}\leq\frac{3a_{2}}{B^{2}}divide start_ARG 3 ( italic_B - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + 6 divide start_ARG ( italic_B - 1 ) end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG ( 3 italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ) italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG 3 italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (88)

Combining (A), (A), (A) and (A), we can verify that the variance and fourth-order moment of the gradient noise is upper bounded by terms related to batch size B𝐵Bitalic_B and the difference between the current model and local minimizer denoted by 𝒘w𝒘superscript𝑤\boldsymbol{w}-w^{\star}bold_italic_w - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT:

𝔼[𝒔k,nB(𝒘)2|n1]𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵𝒘2subscript𝑛1\displaystyle\mathds{E}[\|\boldsymbol{s}_{k,n}^{B}(\boldsymbol{w})\|^{2}|% \mathcal{F}_{n-1}]blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] O(1B)𝔼𝒘w2+O(1B)absent𝑂1𝐵𝔼superscriptnorm𝒘superscript𝑤2𝑂1𝐵\displaystyle\leq O\left(\frac{1}{B}\right)\mathds{E}\|\boldsymbol{w}-w^{\star% }\|^{2}+O\left(\frac{1}{B}\right)≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ) blackboard_E ∥ bold_italic_w - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ) (89)
𝔼[𝒔k,nB(𝒘)4|n1]𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵𝒘4subscript𝑛1\displaystyle\mathds{E}[\|\boldsymbol{s}_{k,n}^{B}(\boldsymbol{w})\|^{4}|% \mathcal{F}_{n-1}]blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_w ) ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] O(1B2)𝔼𝒘w4+O(1B2)absent𝑂1superscript𝐵2𝔼superscriptnorm𝒘superscript𝑤4𝑂1superscript𝐵2\displaystyle\leq O\left(\frac{1}{B^{2}}\right)\mathds{E}\|\boldsymbol{w}-w^{% \star}\|^{4}+O\left(\frac{1}{B^{2}}\right)≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) blackboard_E ∥ bold_italic_w - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (90)

As for the gradient covariance matrices, we have

Rs,k,nB(w)=superscriptsubscript𝑅𝑠𝑘𝑛𝐵𝑤absent\displaystyle R_{s,k,n}^{B}(w)=italic_R start_POSTSUBSCRIPT italic_s , italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_w ) = 𝔼{[1BiQk(w;𝒙k,ni)Jk(w)][1BiQk(w;𝒙k,ni)Jk(w)]𝖳}𝔼delimited-[]1𝐵subscript𝑖subscript𝑄𝑘𝑤superscriptsubscript𝒙𝑘𝑛𝑖subscript𝐽𝑘𝑤superscriptdelimited-[]1𝐵subscript𝑖subscript𝑄𝑘𝑤superscriptsubscript𝒙𝑘𝑛𝑖subscript𝐽𝑘𝑤𝖳\displaystyle\mathds{E}\left\{\left[\frac{1}{B}\sum\limits_{i}\nabla Q_{k}(w;% \boldsymbol{x}_{k,n}^{i})-\nabla J_{k}(w)\right]\left[\frac{1}{B}\sum\limits_{% i}\nabla Q_{k}(w;\boldsymbol{x}_{k,n}^{i})-\nabla J_{k}(w)\right]^{\sf T}\right\}blackboard_E { [ divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) ] [ divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT }
=\displaystyle== 1B2𝔼{ij[Qk(w;𝒙k,ni)Jk(w)][Qk(w;𝒙k,nj)Jk(w)]𝖳}1superscript𝐵2𝔼subscript𝑖subscript𝑗delimited-[]subscript𝑄𝑘𝑤superscriptsubscript𝒙𝑘𝑛𝑖subscript𝐽𝑘𝑤superscriptdelimited-[]subscript𝑄𝑘𝑤superscriptsubscript𝒙𝑘𝑛𝑗subscript𝐽𝑘𝑤𝖳\displaystyle\frac{1}{B^{2}}\mathds{E}\left\{\sum\limits_{i}\sum\limits_{j}% \left[\nabla Q_{k}(w;\boldsymbol{x}_{k,n}^{i})-\nabla J_{k}(w)\right]\left[% \nabla Q_{k}(w;\boldsymbol{x}_{k,n}^{j})-\nabla J_{k}(w)\right]^{\sf T}\right\}divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E { ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) ] [ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT }
=(a)𝑎\displaystyle\overset{(a)}{=}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG 1B2𝔼{i[Qk(w;𝒙k,ni)Jk(w)][Qk(w;𝒙k,ni)Jk(w)]𝖳}1superscript𝐵2𝔼subscript𝑖delimited-[]subscript𝑄𝑘𝑤superscriptsubscript𝒙𝑘𝑛𝑖subscript𝐽𝑘𝑤superscriptdelimited-[]subscript𝑄𝑘𝑤superscriptsubscript𝒙𝑘𝑛𝑖subscript𝐽𝑘𝑤𝖳\displaystyle\frac{1}{B^{2}}\mathds{E}\left\{\sum\limits_{i}\left[\nabla Q_{k}% (w;\boldsymbol{x}_{k,n}^{i})-\nabla J_{k}(w)\right]\left[\nabla Q_{k}(w;% \boldsymbol{x}_{k,n}^{i})-\nabla J_{k}(w)\right]^{\sf T}\right\}divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E { ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) ] [ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT }
=\displaystyle== 1B𝔼[Qk(w;𝒙k)Jk(w)][Qk(w;𝒙k)Jk(w)]𝖳=1BRs,k,n(w)1𝐵𝔼delimited-[]subscript𝑄𝑘𝑤subscript𝒙𝑘subscript𝐽𝑘𝑤superscriptdelimited-[]subscript𝑄𝑘𝑤subscript𝒙𝑘subscript𝐽𝑘𝑤𝖳1𝐵subscript𝑅𝑠𝑘𝑛𝑤\displaystyle\frac{1}{B}\mathds{E}\left[\nabla Q_{k}(w;\boldsymbol{x}_{k})-% \nabla J_{k}(w)\right]\left[\nabla Q_{k}(w;\boldsymbol{x}_{k})-\nabla J_{k}(w)% \right]^{\sf T}=\frac{1}{B}R_{s,k,n}(w)divide start_ARG 1 end_ARG start_ARG italic_B end_ARG blackboard_E [ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) ] [ ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG italic_R start_POSTSUBSCRIPT italic_s , italic_k , italic_n end_POSTSUBSCRIPT ( italic_w ) (91)

where (a)𝑎(a)( italic_a ) is due to the fact that all data are sampled independently at each agent.

Appendix B Proof for Lemma III.5

We establish the relationship between the gradient covariance and Hessian matrices. We extend the result from single-agent case [34] to multi-agent setting. For any wM𝑤superscript𝑀w\in\mathbbm{R}^{M}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, the negative log likelihood loss is expressed as

Qk(w;𝒙k)=Qk(w;𝒉k,𝜸k)=logpk(𝜸k|𝒉k,w)subscript𝑄𝑘𝑤subscript𝒙𝑘subscript𝑄𝑘𝑤subscript𝒉𝑘subscript𝜸𝑘logsubscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘𝑤\displaystyle Q_{k}(w;\boldsymbol{x}_{k})=Q_{k}(w;\boldsymbol{h}_{k},% \boldsymbol{\gamma}_{k})=-\mathrm{log}\ p_{k}(\boldsymbol{\gamma}_{k}|% \boldsymbol{h}_{k},w)italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = - roman_log italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w ) (92)

where 𝜸ksubscript𝜸𝑘\boldsymbol{\gamma}_{k}bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the true label of 𝒉ksubscript𝒉𝑘\boldsymbol{h}_{k}bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Then, in the stochastic risk minimization where the risk function is defined as:

Jk(w)=𝔼𝒙kQk(w;𝒙k)subscript𝐽𝑘𝑤subscript𝔼subscript𝒙𝑘subscript𝑄𝑘𝑤subscript𝒙𝑘\displaystyle J_{k}(w)=\mathds{E}_{\boldsymbol{x}_{k}}Q_{k}(w;\boldsymbol{x}_{% k})italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ) = blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w ; bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (93)

we have the following gradient covariance matrix:

R¯ssubscript¯𝑅𝑠\displaystyle\bar{R}_{s}over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =(a)𝔼[(1Kk(Qk(w;𝒙k)Jk(w)))×(1K(Q(w;𝒙)J(w)))𝖳]𝑎𝔼delimited-[]1𝐾subscript𝑘subscript𝑄𝑘superscript𝑤subscript𝒙𝑘subscript𝐽𝑘superscript𝑤superscript1𝐾subscriptsubscript𝑄superscript𝑤subscript𝒙subscript𝐽superscript𝑤𝖳\displaystyle\overset{(a)}{=}\mathds{E}\left[\left(\frac{1}{K}\sum\limits_{k}(% \nabla Q_{k}(w^{\star};\boldsymbol{x}_{k})-\nabla J_{k}(w^{\star}))\right)% \times\left(\frac{1}{K}\sum\limits_{\ell}(\nabla Q_{\ell}(w^{\star};% \boldsymbol{x}_{\ell})-\nabla J_{\ell}(w^{\star}))\right)^{\sf T}\right]start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG blackboard_E [ ( divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ) ) × ( divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( ∇ italic_Q start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ) ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ]
=(b)1K2𝔼[kQk(w;𝒙k)𝖳Qk(w;𝒙k)]𝑏1superscript𝐾2𝔼delimited-[]subscript𝑘subscript𝑄𝑘superscript𝑤subscript𝒙𝑘superscript𝖳subscript𝑄𝑘superscript𝑤subscript𝒙𝑘\displaystyle\overset{(b)}{=}\frac{1}{K^{2}}\mathds{E}\left[\sum\limits_{k}% \nabla Q_{k}(w^{\star};\boldsymbol{x}_{k})\nabla^{\sf T}Q_{k}(w^{\star};% \boldsymbol{x}_{k})\right]start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∇ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ]
=1K2k𝔼[pk(𝜸k|𝒉k,w)𝖳pk(𝜸k|𝒉k,w)pk2(𝜸k|𝒉k,w)]absent1superscript𝐾2subscript𝑘𝔼delimited-[]subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘𝑤superscript𝖳subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘𝑤superscriptsubscript𝑝𝑘2conditionalsubscript𝜸𝑘subscript𝒉𝑘𝑤\displaystyle=\frac{1}{K^{2}}\sum\limits_{k}\mathds{E}\left[\frac{\nabla p_{k}% (\boldsymbol{\gamma}_{k}|\boldsymbol{h}_{k},w)\nabla^{\sf T}p_{k}(\boldsymbol{% \gamma}_{k}|\boldsymbol{h}_{k},w)}{p_{k}^{2}(\boldsymbol{\gamma}_{k}|% \boldsymbol{h}_{k},w)}\right]= divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ divide start_ARG ∇ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w ) ∇ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w ) end_ARG ] (94)

where (a)𝑎(a)( italic_a ) follows from (33), and (b)𝑏(b)( italic_b ) follows from (79), the independence among agents and the following equality:

1KkJk(w)=J(w)=01𝐾subscript𝑘subscript𝐽𝑘superscript𝑤𝐽superscript𝑤0\displaystyle\frac{1}{K}\sum\limits_{k}\nabla J_{k}(w^{\star})=\nabla J(w^{% \star})=0divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = ∇ italic_J ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = 0 (95)

As for the Hessian matrix, we have

H¯¯𝐻\displaystyle\bar{H}over¯ start_ARG italic_H end_ARG =1K𝔼[2Qk(w;𝒙k)]absent1𝐾𝔼delimited-[]superscript2subscript𝑄𝑘superscript𝑤subscript𝒙𝑘\displaystyle=\frac{1}{K}\mathds{E}\left[\sum\limits\nabla^{2}Q_{k}(w^{\star};% \boldsymbol{x}_{k})\right]= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG blackboard_E [ ∑ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ]
=1K𝔼[k2logpk(𝜸k|𝒉k,w)]absent1𝐾𝔼delimited-[]subscript𝑘superscript2logsubscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤\displaystyle=-\frac{1}{K}\mathds{E}\left[\sum\limits_{k}\nabla^{2}\mathrm{log% }\ p_{k}(\boldsymbol{\gamma}_{k}|\boldsymbol{h}_{k},w^{\star})\right]= - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ]
=1K𝔼[k2pk(𝜸k|𝒉k,w)pk(𝜸k|𝒉k,w)pk(𝜸k|𝒉k,w)𝖳pk(𝜸k|𝒉k,w)pk2(𝜸k|𝒉k,w)]absent1𝐾𝔼delimited-[]subscript𝑘superscript2subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤superscript𝖳subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤superscriptsubscript𝑝𝑘2conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤\displaystyle=-\frac{1}{K}\mathds{E}\left[\sum\limits_{k}\frac{\nabla^{2}p_{k}% (\boldsymbol{\gamma}_{k}|\boldsymbol{h}_{k},w^{\star})p_{k}(\boldsymbol{\gamma% }_{k}|\boldsymbol{h}_{k},w^{\star})-\nabla p_{k}(\boldsymbol{\gamma}_{k}|% \boldsymbol{h}_{k},w^{\star})\nabla^{\sf T}p_{k}(\boldsymbol{\gamma}_{k}|% \boldsymbol{h}_{k},w^{\star})}{p_{k}^{2}(\boldsymbol{\gamma}_{k}|\boldsymbol{h% }_{k},w^{\star})}\right]= - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) - ∇ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ∇ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) end_ARG ]
=(a)1Kk𝔼[pk(𝜸k|𝒉k,w)𝖳pk(𝜸k|𝒉k,w)pk2(𝜸k|𝒉k,w)]𝑎1𝐾subscript𝑘𝔼delimited-[]subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘𝑤superscript𝖳subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘𝑤superscriptsubscript𝑝𝑘2conditionalsubscript𝜸𝑘subscript𝒉𝑘𝑤\displaystyle\overset{(a)}{=}\frac{1}{K}\sum\limits_{k}\mathds{E}\left[\frac{% \nabla p_{k}(\boldsymbol{\gamma}_{k}|\boldsymbol{h}_{k},w)\nabla^{\sf T}p_{k}(% \boldsymbol{\gamma}_{k}|\boldsymbol{h}_{k},w)}{p_{k}^{2}(\boldsymbol{\gamma}_{% k}|\boldsymbol{h}_{k},w)}\right]start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ divide start_ARG ∇ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w ) ∇ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w ) end_ARG ] (96)

where (a)𝑎(a)( italic_a ) follows from the following equality:

1K𝔼k2pk(𝜸k|𝒉k,w)pk(𝜸k|𝒉k,w)1𝐾𝔼subscript𝑘superscript2subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤\displaystyle\frac{1}{K}\mathds{E}\sum\limits_{k}\frac{\nabla^{2}p_{k}(% \boldsymbol{\gamma}_{k}|\boldsymbol{h}_{k},w^{\star})}{p_{k}(\boldsymbol{% \gamma}_{k}|\boldsymbol{h}_{k},w^{\star})}divide start_ARG 1 end_ARG start_ARG italic_K end_ARG blackboard_E ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) end_ARG =1Kk𝔼𝒉k𝔼𝜸k|𝒉k[2pk(𝜸k|𝒉k,w)pk(𝜸k|𝒉k,w)]absent1𝐾subscript𝑘subscript𝔼subscript𝒉𝑘subscript𝔼conditionalsubscript𝜸𝑘subscript𝒉𝑘delimited-[]superscript2subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤subscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤\displaystyle=\frac{1}{K}\sum\limits_{k}\mathds{E}_{\boldsymbol{h}_{k}}\mathds% {E}_{\boldsymbol{\gamma}_{k}|\boldsymbol{h}_{k}}\left[\frac{\nabla^{2}p_{k}(% \boldsymbol{\gamma}_{k}|\boldsymbol{h}_{k},w^{\star})}{p_{k}(\boldsymbol{% \gamma}_{k}|\boldsymbol{h}_{k},w^{\star})}\right]= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) end_ARG ]
=1Kk𝔼𝒉k2pk(𝜸k|𝒉k,w)𝑑𝜸k1absent1𝐾subscript𝑘subscript𝔼subscript𝒉𝑘superscript2subscriptsubscript𝑝𝑘conditionalsubscript𝜸𝑘subscript𝒉𝑘superscript𝑤differential-dsubscript𝜸𝑘1\displaystyle=\frac{1}{K}\sum\limits_{k}\mathds{E}_{\boldsymbol{h}_{k}}\nabla^% {2}\underbrace{\int p_{k}(\boldsymbol{\gamma}_{k}|\boldsymbol{h}_{k},w^{\star}% )d\boldsymbol{\gamma}_{k}}_{1}= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT under⏟ start_ARG ∫ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) italic_d bold_italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
=0absent0\displaystyle=0= 0 (97)

From (B) and (B), we have the exact equivalence relationship between R¯ssubscript¯𝑅𝑠\bar{R}_{s}over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and H¯¯𝐻\bar{H}over¯ start_ARG italic_H end_ARG for the stochastic risk minimization:

R¯s=1KH¯subscript¯𝑅𝑠1𝐾¯𝐻\displaystyle\bar{R}_{s}=\frac{1}{K}\bar{H}over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG over¯ start_ARG italic_H end_ARG (98)

Strictly speaking, since the stochastic risk minimization samples the targets from all possible predictions while the empirical optimization use the specific training label for each sample, the equivalence need not hold exactly in the empirical settings. Fortunately, we can still approximate R¯ssubscript¯𝑅𝑠\bar{R}_{s}over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT by 1KH¯1𝐾¯𝐻\frac{1}{K}\bar{H}divide start_ARG 1 end_ARG start_ARG italic_K end_ARG over¯ start_ARG italic_H end_ARG with a Monte Carlo estimate based on the training set [53, 34], i.e., Nksubscript𝑁𝑘N_{k}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is sufficiently large at all agents. Thus, in this paper, we use

R¯s1KH¯subscript¯𝑅𝑠1𝐾¯𝐻\displaystyle\bar{R}_{s}\approx\frac{1}{K}\bar{H}over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≈ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG over¯ start_ARG italic_H end_ARG (99)

Appendix C Proof for Lemmas III.6 and III.7: Upper bound for the second-order error moment

The proof in this section is similar to [6, 20, 24] with 2 important differences: we now focus on nonconvex (as opposed to convex) objective functions and on the finite-horizon (as opposed to infinite-horizon) case. To examine the size of ERnsubscriptER𝑛\mathrm{ER}_{n}roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we first analyze the mean-square error associated with 𝓦~nsubscript~𝓦𝑛{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in (42) denoted by 𝔼𝓦~n2𝔼superscriptnormsubscript~𝓦𝑛2\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for later use, which can be verified that is upper bounded by a term associated with μ𝜇\muitalic_μ and gradient noise.

Since

VV𝖳=[1K𝟙Vα][1K𝟙𝖳Vα𝖳]=V𝖳V=[1K𝟙𝖳Vα𝖳][1K𝟙Vα]=IK𝑉superscript𝑉𝖳1𝐾1subscript𝑉𝛼delimited-[]1𝐾superscript1𝖳superscriptsubscript𝑉𝛼𝖳superscript𝑉𝖳𝑉delimited-[]1𝐾superscript1𝖳superscriptsubscript𝑉𝛼𝖳1𝐾1subscript𝑉𝛼subscript𝐼𝐾\displaystyle VV^{\sf T}=\left[\frac{1}{\sqrt{K}}\mathbbm{1}\quad V_{\alpha}% \right]\left[\begin{array}[]{c}\frac{1}{\sqrt{K}}\mathbbm{1}^{\sf T}\\ V_{\alpha}^{\sf T}\end{array}\right]=V^{\sf T}V=\left[\begin{array}[]{c}\frac{% 1}{\sqrt{K}}\mathbbm{1}^{\sf T}\\ V_{\alpha}^{\sf T}\end{array}\right]\left[\frac{1}{\sqrt{K}}\mathbbm{1}\quad V% _{\alpha}\right]=I_{K}italic_V italic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT = [ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ] [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] = italic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_V = [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] [ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ] = italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT (104)

we have

1K𝟙𝟙𝖳+VαVα𝖳=IK,𝟙𝖳Vα=0,Vα𝖳Vα=IK1formulae-sequence1𝐾superscript11𝖳subscript𝑉𝛼superscriptsubscript𝑉𝛼𝖳subscript𝐼𝐾formulae-sequencesuperscript1𝖳subscript𝑉𝛼0superscriptsubscript𝑉𝛼𝖳subscript𝑉𝛼subscript𝐼𝐾1\displaystyle\frac{1}{K}\mathbbm{1}\mathbbm{1}^{\sf T}+V_{\alpha}V_{\alpha}^{% \sf T}=I_{K},\quad\mathbbm{1}^{\sf T}V_{\alpha}=0,\quad V_{\alpha}^{\sf T}V_{% \alpha}=I_{K-1}divide start_ARG 1 end_ARG start_ARG italic_K end_ARG blackboard_11 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT + italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT = italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0 , italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT (105)

Also, by the mean-value theorem [6], we have

Jk(w)Jk(𝒘k,n)=Hk,n(𝒘k,n)𝒘~k,nsubscript𝐽𝑘superscript𝑤subscript𝐽𝑘subscript𝒘𝑘𝑛subscript𝐻𝑘𝑛subscript𝒘𝑘𝑛subscript~𝒘𝑘𝑛\displaystyle\nabla J_{k}(w^{\star})-\nabla J_{k}(\boldsymbol{w}_{k,n})=H_{k,n% }(\boldsymbol{w}_{k,n})\tilde{\boldsymbol{w}}_{k,n}∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) - ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) = italic_H start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT (106)

Note that for the heterogeneous networks, we have

Jk(w)0,J(w)=1KkJk(w)=0formulae-sequencesubscript𝐽𝑘superscript𝑤0𝐽superscript𝑤1𝐾subscript𝑘subscript𝐽𝑘superscript𝑤0\nabla J_{k}(w^{\star})\neq 0,\quad\nabla J(w^{\star})=\frac{1}{K}\sum\limits_% {k}\nabla J_{k}(w^{\star})=0∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ≠ 0 , ∇ italic_J ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = 0 (107)

Recall the error recursion in (42), since A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are either A𝐴Aitalic_A or IKsubscript𝐼𝐾I_{K}italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, we have

A1A2=A,A1𝟙=A2𝟙=𝟙formulae-sequencesubscript𝐴1subscript𝐴2𝐴subscript𝐴11subscript𝐴211\displaystyle A_{1}A_{2}=A,\;A_{1}\mathbbm{1}=A_{2}\mathbbm{1}=\mathbbm{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_A , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_1 = italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT blackboard_1 = blackboard_1 (108)

from which we have

𝓦~nsubscript~𝓦𝑛\displaystyle{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =𝒜2(𝒜1μn1)𝓦~n1+μ𝒜2d+μ𝒜2𝒔nBabsentsubscript𝒜2subscript𝒜1𝜇subscript𝑛1subscript~𝓦𝑛1𝜇subscript𝒜2𝑑𝜇subscript𝒜2superscriptsubscript𝒔𝑛𝐵\displaystyle=\mathcal{A}_{2}(\mathcal{A}_{1}-\mu\mathcal{H}_{n-1}){\widetilde% {\boldsymbol{\scriptstyle\mathcal{W}}}}_{n-1}+\mu\mathcal{A}_{2}d+\mu\mathcal{% A}_{2}\boldsymbol{s}_{n}^{B}= caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT
=(𝒜μ𝒜2n1)𝓦~n1+μ𝒜2d+μ𝒜2𝒔nBabsent𝒜𝜇subscript𝒜2subscript𝑛1subscript~𝓦𝑛1𝜇subscript𝒜2𝑑𝜇subscript𝒜2superscriptsubscript𝒔𝑛𝐵\displaystyle=(\mathcal{A}-\mu\mathcal{A}_{2}\mathcal{H}_{n-1}){\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}}_{n-1}+\mu\mathcal{A}_{2}d+\mu\mathcal{A% }_{2}\boldsymbol{s}_{n}^{B}= ( caligraphic_A - italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT
=(a)𝒱(𝒫μ𝒱𝖳𝒜2n1𝒱)𝒱𝖳𝓦~n1+μ𝒜2d+μ𝒜2𝒔nB𝑎𝒱𝒫𝜇superscript𝒱𝖳subscript𝒜2subscript𝑛1𝒱superscript𝒱𝖳subscript~𝓦𝑛1𝜇subscript𝒜2𝑑𝜇subscript𝒜2superscriptsubscript𝒔𝑛𝐵\displaystyle\overset{(a)}{=}\mathcal{V}(\mathcal{P}-\mu\mathcal{V}^{\sf T}% \mathcal{A}_{2}\mathcal{H}_{n-1}\mathcal{V})\mathcal{V}^{\sf T}{\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}}_{n-1}+\mu\mathcal{A}_{2}d+\mu\mathcal{A% }_{2}\boldsymbol{s}_{n}^{B}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG caligraphic_V ( caligraphic_P - italic_μ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V ) caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (109)

where (a)𝑎(a)( italic_a ) follows from (35). Multiplying (C) from the left by 𝒱𝖳superscript𝒱𝖳\mathcal{V}^{\sf T}caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT, the following recursion can be obtained:

𝒱𝖳𝓦~n=superscript𝒱𝖳subscript~𝓦𝑛absent\displaystyle\mathcal{V}^{\sf T}{\widetilde{\boldsymbol{\scriptstyle\mathcal{W% }}}}_{n}=caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = [(1K𝟙𝖳IM)𝓦~n(Vα𝖳IM)𝓦~n]=Δ[𝒘¯n𝒘ˇn]delimited-[]tensor-product1𝐾superscript1𝖳subscript𝐼𝑀subscript~𝓦𝑛tensor-productsuperscriptsubscript𝑉𝛼𝖳subscript𝐼𝑀subscript~𝓦𝑛Δdelimited-[]subscript¯𝒘𝑛subscriptˇ𝒘𝑛\displaystyle\left[\begin{array}[]{c}(\frac{1}{\sqrt{K}}\mathbbm{1}^{\sf T}% \otimes I_{M}){\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\\ (V_{\alpha}^{\sf T}\otimes I_{M}){\widetilde{\boldsymbol{\scriptstyle\mathcal{% W}}}}_{n}\end{array}\right]\overset{\Delta}{=}\left[\begin{array}[]{c}\bar{% \boldsymbol{w}}_{n}\\ \check{\boldsymbol{w}}_{n}\end{array}\right][ start_ARRAY start_ROW start_CELL ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ( italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] overroman_Δ start_ARG = end_ARG [ start_ARRAY start_ROW start_CELL over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] (114)
=\displaystyle== (𝒫μ𝒱𝖳𝒜2n1𝒱)𝒱𝖳𝓦~n1+μ𝒱𝖳𝒜2d+μ𝒱𝖳𝒜2𝒔nB𝒫𝜇superscript𝒱𝖳subscript𝒜2subscript𝑛1𝒱superscript𝒱𝖳subscript~𝓦𝑛1𝜇superscript𝒱𝖳subscript𝒜2𝑑𝜇superscript𝒱𝖳subscript𝒜2superscriptsubscript𝒔𝑛𝐵\displaystyle(\mathcal{P}-\mu\mathcal{V}^{\sf T}\mathcal{A}_{2}\mathcal{H}_{n-% 1}\mathcal{V})\mathcal{V}^{\sf T}{\widetilde{\boldsymbol{\scriptstyle\mathcal{% W}}}}_{n-1}+\mu\mathcal{V}^{\sf T}\mathcal{A}_{2}d+\mu\mathcal{V}^{\sf T}% \mathcal{A}_{2}\boldsymbol{s}_{n}^{B}( caligraphic_P - italic_μ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V ) caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d + italic_μ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT
=\displaystyle== ([IM𝟎𝟎PαIM]μ[1K𝟙𝖳IMVα𝖳IM]𝒜2n1[1K𝟙IMVαIM])[𝒘¯n1𝒘ˇn1]delimited-[]subscript𝐼𝑀00tensor-productsubscript𝑃𝛼subscript𝐼𝑀𝜇delimited-[]tensor-product1𝐾superscript1𝖳subscript𝐼𝑀tensor-productsuperscriptsubscript𝑉𝛼𝖳subscript𝐼𝑀subscript𝒜2subscript𝑛1tensor-product1𝐾1subscript𝐼𝑀tensor-productsubscript𝑉𝛼subscript𝐼𝑀delimited-[]subscript¯𝒘𝑛1subscriptˇ𝒘𝑛1\displaystyle\left(\left[\begin{array}[]{cc}I_{M}&\boldsymbol{0}\\ \boldsymbol{0}&P_{\alpha}\otimes I_{M}\end{array}\right]-\mu\left[\begin{array% }[]{c}\frac{1}{\sqrt{K}}\mathbbm{1}^{\sf T}\otimes I_{M}\\ V_{\alpha}^{\sf T}\otimes I_{M}\end{array}\right]\mathcal{A}_{2}\mathcal{H}_{n% -1}\left[\frac{1}{\sqrt{K}}\mathbbm{1}\otimes I_{M}\quad V_{\alpha}\otimes{I_{% M}}\right]\right)\left[\begin{array}[]{c}\bar{\boldsymbol{w}}_{n-1}\\ \check{\boldsymbol{w}}_{n-1}\end{array}\right]( [ start_ARRAY start_ROW start_CELL italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] - italic_μ [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] ) [ start_ARRAY start_ROW start_CELL over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] (121)
+μ[1K𝟙𝖳IMVα𝖳IM]𝒜2d+μ[1K𝟙𝖳IMVα𝖳IM]𝒜2𝒔nB𝜇delimited-[]tensor-product1𝐾superscript1𝖳subscript𝐼𝑀tensor-productsuperscriptsubscript𝑉𝛼𝖳subscript𝐼𝑀subscript𝒜2𝑑𝜇delimited-[]tensor-product1𝐾superscript1𝖳subscript𝐼𝑀tensor-productsuperscriptsubscript𝑉𝛼𝖳subscript𝐼𝑀subscript𝒜2superscriptsubscript𝒔𝑛𝐵\displaystyle+\mu\left[\begin{array}[]{c}\frac{1}{\sqrt{K}}\mathbbm{1}^{\sf T}% \otimes I_{M}\\ V_{\alpha}^{\sf T}\otimes I_{M}\end{array}\right]\mathcal{A}_{2}d+\mu\left[% \begin{array}[]{c}\frac{1}{\sqrt{K}}\mathbbm{1}^{\sf T}\otimes I_{M}\\ V_{\alpha}^{\sf T}\otimes I_{M}\end{array}\right]\mathcal{A}_{2}\boldsymbol{s}% _{n}^{B}+ italic_μ [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d + italic_μ [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (126)

Note that according to (107), we have

(1K𝟙𝖳IM)𝒜2d=(1K𝟙𝖳IM)d=1KkJk(w)=0tensor-product1𝐾superscript1𝖳subscript𝐼𝑀subscript𝒜2𝑑tensor-product1𝐾superscript1𝖳subscript𝐼𝑀𝑑1𝐾subscript𝑘subscript𝐽𝑘superscript𝑤0(\frac{1}{\sqrt{K}}\mathbbm{1}^{\sf T}\otimes I_{M})\mathcal{A}_{2}d=(\frac{1}% {\sqrt{K}}\mathbbm{1}^{\sf T}\otimes I_{M})d=\frac{1}{\sqrt{K}}\sum_{k}\nabla J% _{k}(w^{\star})=0( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d = ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) italic_d = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = 0 (127)

Also, consider the average of the Hessian matrices of all agents denoted by

H¯n1=1KkHk,n1subscript¯𝐻𝑛11𝐾subscript𝑘subscript𝐻𝑘𝑛1\bar{H}_{n-1}=\frac{1}{K}\sum_{k}H_{k,n-1}over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_k , italic_n - 1 end_POSTSUBSCRIPT (128)

According to (108), the recursion (114) can be split as

𝒘¯n=subscript¯𝒘𝑛absent\displaystyle\bar{\boldsymbol{w}}_{n}=over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = (IμH¯n1)𝒘¯n1μ(1K𝟙𝖳IM)n1(VαI)𝒘ˇn1+μ(1K𝟙𝖳IM)𝒔nB𝐼𝜇subscript¯𝐻𝑛1subscript¯𝒘𝑛1𝜇tensor-product1𝐾superscript1𝖳subscript𝐼𝑀subscript𝑛1tensor-productsubscript𝑉𝛼𝐼subscriptˇ𝒘𝑛1𝜇tensor-product1𝐾superscript1𝖳subscript𝐼𝑀superscriptsubscript𝒔𝑛𝐵\displaystyle(I-\mu\bar{H}_{n-1})\bar{\boldsymbol{w}}_{n-1}-\mu(\frac{1}{\sqrt% {K}}\mathbbm{1}^{\sf T}\otimes I_{M})\mathcal{H}_{n-1}(V_{\alpha}\otimes I)% \check{\boldsymbol{w}}_{n-1}+\mu(\frac{1}{\sqrt{K}}\mathbbm{1}^{\sf T}\otimes I% _{M})\boldsymbol{s}_{n}^{B}( italic_I - italic_μ over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - italic_μ ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⊗ italic_I ) overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (129)
𝒘ˇn=subscriptˇ𝒘𝑛absent\displaystyle\check{\boldsymbol{w}}_{n}=overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = (PαIMμ((Vα𝖳A2)IM)n1(VαIM))𝒘ˇn1μ((Vα𝖳A2)IM))n1(1K𝟙IM)𝒘¯n1\displaystyle\left(P_{\alpha}\otimes I_{M}-\mu((V_{\alpha}^{\sf T}A_{2})% \otimes I_{M})\mathcal{H}_{n-1}(V_{\alpha}\otimes I_{M})\right)\check{% \boldsymbol{w}}_{n-1}-\mu((V_{\alpha}^{\sf T}A_{2})\otimes I_{M}))\mathcal{H}_% {n-1}(\frac{1}{\sqrt{K}}\mathbbm{1}\otimes{I_{M}})\bar{\boldsymbol{w}}_{n-1}( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - italic_μ ( ( italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) ) overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - italic_μ ( ( italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) ) caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT
+μ((Vα𝖳A2)IM)d+μ((Vα𝖳A2)IM)𝒔nB𝜇tensor-productsuperscriptsubscript𝑉𝛼𝖳subscript𝐴2subscript𝐼𝑀𝑑𝜇tensor-productsuperscriptsubscript𝑉𝛼𝖳subscript𝐴2subscript𝐼𝑀superscriptsubscript𝒔𝑛𝐵\displaystyle+\mu((V_{\alpha}^{\sf T}A_{2})\otimes I_{M})d+\mu((V_{\alpha}^{% \sf T}A_{2})\otimes I_{M})\boldsymbol{s}_{n}^{B}+ italic_μ ( ( italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) italic_d + italic_μ ( ( italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (130)

Note that in the centralized method, we have

Pα=0,Vα=0formulae-sequencesubscript𝑃𝛼0subscript𝑉𝛼0\displaystyle P_{\alpha}=0,\quad V_{\alpha}=0italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0 , italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0 (131)

so that we only need to analyze the term 𝒘¯nsubscript¯𝒘𝑛\bar{\boldsymbol{w}}_{n}over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for the centralized method. Here we start from decentralized methods.

Conditioning both sides of (129) on n1subscript𝑛1\mathcal{F}_{n-1}caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT, we now analyse the second-order moments of the two terms separately. First, for 𝒘ˇnsubscriptˇ𝒘𝑛\check{\boldsymbol{w}}_{n}overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, consider

𝒫α=PαIM,𝒱α=VαIM,𝟙=𝟙IMformulae-sequencesubscript𝒫𝛼tensor-productsubscript𝑃𝛼subscript𝐼𝑀formulae-sequencesubscript𝒱𝛼tensor-productsubscript𝑉𝛼subscript𝐼𝑀1tensor-product1subscript𝐼𝑀\displaystyle\mathcal{P}_{\alpha}=P_{\alpha}\otimes I_{M},\quad\mathcal{V}_{% \alpha}=V_{\alpha}\otimes I_{M},\quad\mathds{1}=\mathbbm{1}\otimes I_{M}caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , blackboard_1 = blackboard_1 ⊗ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT (132)

According to (129), we have

𝔼[𝒘ˇn2|n1]=(a)𝔼delimited-[]conditionalsuperscriptnormsubscriptˇ𝒘𝑛2subscript𝑛1𝑎\displaystyle\mathds{E}[\|\check{\boldsymbol{w}}_{n}\|^{2}|\mathcal{F}_{n-1}]% \overset{(a)}{=}blackboard_E [ ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG 𝔼(𝒫αμ𝒱α𝖳𝒜2n1𝒱α)𝒘ˇn1μK𝒱α𝖳𝒜2n1𝟙𝒘¯n1+μ𝒱α𝖳𝒜2d2+μ2𝔼[𝒱α𝖳𝒜2𝒔nB2|n1]𝔼superscriptnormsubscript𝒫𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛1subscript𝒱𝛼subscriptˇ𝒘𝑛1𝜇𝐾superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛11subscript¯𝒘𝑛1𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2𝑑2superscript𝜇2𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒱𝛼𝖳subscript𝒜2superscriptsubscript𝒔𝑛𝐵2subscript𝑛1\displaystyle\mathds{E}\|\left(\mathcal{P}_{\alpha}-\mu\mathcal{V}_{\alpha}^{% \sf T}\mathcal{A}_{2}\mathcal{H}_{n-1}\mathcal{V}_{\alpha}\right)\check{% \boldsymbol{w}}_{n-1}-\frac{\mu}{\sqrt{K}}\mathcal{V}_{\alpha}^{\sf T}\mathcal% {A}_{2}\mathcal{H}_{n-1}\mathds{1}\bar{\boldsymbol{w}}_{n-1}+\mu\mathcal{V}_{% \alpha}^{\sf T}\mathcal{A}_{2}d\|^{2}+\mu^{2}\mathds{E}[\|\mathcal{V}_{\alpha}% ^{\sf T}\mathcal{A}_{2}\boldsymbol{s}_{n}^{B}\|^{2}|\mathcal{F}_{n-1}]blackboard_E ∥ ( caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT blackboard_1 over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] (133)

where (a) follows from (79).

We bound the first term on the right hand of (133). Note that

Pα𝒘ˇn12𝒘ˇn1𝖳Pα2𝒘ˇn1λmax(Pα2)𝒘ˇn12superscriptnormsubscript𝑃𝛼subscriptˇ𝒘𝑛12superscriptsubscriptˇ𝒘𝑛1𝖳superscriptsubscript𝑃𝛼2subscriptˇ𝒘𝑛1subscript𝜆superscriptsubscript𝑃𝛼2superscriptnormsubscriptˇ𝒘𝑛12\|P_{\alpha}\check{\boldsymbol{w}}_{n-1}\|^{2}\leq\check{\boldsymbol{w}}_{n-1}% ^{\sf T}P_{\alpha}^{2}\check{\boldsymbol{w}}_{n-1}\leq\lambda_{\max}(P_{\alpha% }^{2})\|\check{\boldsymbol{w}}_{n-1}\|^{2}∥ italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ≤ italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (134)

and we know 0<ρ(Pα)<10𝜌subscript𝑃𝛼10<\rho(P_{\alpha})<10 < italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) < 1 from [6] where ρ(Pα)𝜌subscript𝑃𝛼\rho(P_{\alpha})italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) denotes the spectral radius of Pαsubscript𝑃𝛼P_{\alpha}italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT. Let t=ρ(Pα)𝑡𝜌subscript𝑃𝛼t=\rho(P_{\alpha})italic_t = italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ), we have

𝔼(𝒫αμ𝒱α𝖳𝒜2n1𝒱α)𝒘ˇn1μK𝒱α𝖳𝒜2n1𝟙𝒘¯n1+μ𝒱α𝖳𝒜2d2𝔼superscriptnormsubscript𝒫𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛1subscript𝒱𝛼subscriptˇ𝒘𝑛1𝜇𝐾superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛11subscript¯𝒘𝑛1𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2𝑑2\displaystyle\mathds{E}\|\left(\mathcal{P}_{\alpha}-\mu\mathcal{V}_{\alpha}^{% \sf T}\mathcal{A}_{2}\mathcal{H}_{n-1}\mathcal{V}_{\alpha}\right)\check{% \boldsymbol{w}}_{n-1}-\frac{\mu}{\sqrt{K}}\mathcal{V}_{\alpha}^{\sf T}\mathcal% {A}_{2}\mathcal{H}_{n-1}\mathds{1}\bar{\boldsymbol{w}}_{n-1}+\mu\mathcal{V}_{% \alpha}^{\sf T}\mathcal{A}_{2}d\|^{2}blackboard_E ∥ ( caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT blackboard_1 over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝔼t1t𝒫α𝒘ˇn1+(1t)11t(μ𝒱α𝖳𝒜2n1𝒱α𝒘ˇn1μK𝒱α𝖳𝒜2n1𝟙𝒘¯n1+μ𝒱α𝖳𝒜2d)2𝔼superscriptnorm𝑡1𝑡subscript𝒫𝛼subscriptˇ𝒘𝑛11𝑡11𝑡𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛1subscript𝒱𝛼subscriptˇ𝒘𝑛1𝜇𝐾superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛11subscript¯𝒘𝑛1𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2𝑑2\displaystyle\mathds{E}\|t\cdot\frac{1}{t}\mathcal{P}_{\alpha}\check{% \boldsymbol{w}}_{n-1}+(1-t)\cdot\frac{1}{1-t}\left(-\mu\mathcal{V}_{\alpha}^{% \sf T}\mathcal{A}_{2}\mathcal{H}_{n-1}\mathcal{V}_{\alpha}\check{\boldsymbol{w% }}_{n-1}-\frac{\mu}{\sqrt{K}}\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}% \mathcal{H}_{n-1}\mathds{1}\bar{\boldsymbol{w}}_{n-1}+\mu\mathcal{V}_{\alpha}^% {\sf T}\mathcal{A}_{2}d\right)\|^{2}blackboard_E ∥ italic_t ⋅ divide start_ARG 1 end_ARG start_ARG italic_t end_ARG caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + ( 1 - italic_t ) ⋅ divide start_ARG 1 end_ARG start_ARG 1 - italic_t end_ARG ( - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT blackboard_1 over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(a)𝑎\displaystyle\overset{(a)}{\leq}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG 1t𝔼𝒫α𝒘ˇn12+1(1t)𝔼μ𝒱α𝖳𝒜2n1𝒱α𝒘ˇn1μK𝒱α𝖳𝒜2n1𝟙𝒘¯n1+μ𝒱α𝖳𝒜2d21𝑡𝔼superscriptnormsubscript𝒫𝛼subscriptˇ𝒘𝑛1211𝑡𝔼superscriptnorm𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛1subscript𝒱𝛼subscriptˇ𝒘𝑛1𝜇𝐾superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛11subscript¯𝒘𝑛1𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2𝑑2\displaystyle\frac{1}{t}\mathds{E}\|\mathcal{P}_{\alpha}\check{\boldsymbol{w}}% _{n-1}\|^{2}+\frac{1}{(1-t)}\mathds{E}\|-\mu\mathcal{V}_{\alpha}^{\sf T}% \mathcal{A}_{2}\mathcal{H}_{n-1}\mathcal{V}_{\alpha}\check{\boldsymbol{w}}_{n-% 1}-\frac{\mu}{\sqrt{K}}\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}\mathcal{H}_% {n-1}\mathds{1}\bar{\boldsymbol{w}}_{n-1}+\mu\mathcal{V}_{\alpha}^{\sf T}% \mathcal{A}_{2}d\|^{2}divide start_ARG 1 end_ARG start_ARG italic_t end_ARG blackboard_E ∥ caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG ( 1 - italic_t ) end_ARG blackboard_E ∥ - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT blackboard_1 over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(b)𝑏\displaystyle\overset{(b)}{\leq}start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG ≤ end_ARG (t+O(μ2))𝔼𝒘ˇn12+O(μ2)𝔼𝒘¯n12+O(μ2)𝑡𝑂superscript𝜇2𝔼superscriptnormsubscriptˇ𝒘𝑛12𝑂superscript𝜇2𝔼superscriptnormsubscript¯𝒘𝑛12𝑂superscript𝜇2\displaystyle(t+O(\mu^{2}))\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{2}+O(% \mu^{2})\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{2}+O(\mu^{2})( italic_t + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (135)

where (a)𝑎(a)( italic_a ) and (b)𝑏(b)( italic_b ) follow from Jensen’s inequality.

We now bound the second term of (133), which is related to the gradient noise. Consider

𝒔ˇnB=Vα𝖳𝒜2𝒔nBsuperscriptsubscriptˇ𝒔𝑛𝐵subscriptsuperscript𝑉𝖳𝛼subscript𝒜2superscriptsubscript𝒔𝑛𝐵\displaystyle\quad\check{\boldsymbol{s}}_{n}^{B}={V}^{\sf T}_{\alpha}\mathcal{% A}_{2}\boldsymbol{s}_{n}^{B}overroman_ˇ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = italic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (136)

we have

μ2𝔼[𝒔ˇnB2|n1]μ2𝒱α𝖳𝒜22𝔼[𝒔nB2|n1]superscript𝜇2𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscriptˇ𝒔𝑛𝐵2subscript𝑛1superscript𝜇2superscriptnormsuperscriptsubscript𝒱𝛼𝖳subscript𝒜22𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑛𝐵2subscript𝑛1\displaystyle\mu^{2}\mathds{E}\left[\|\check{\boldsymbol{s}}_{n}^{B}\|^{2}|% \mathcal{F}_{n-1}\right]\leq\mu^{2}\|\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{% 2}\|^{2}\mathds{E}\left[\|\boldsymbol{s}_{n}^{B}\|^{2}|\mathcal{F}_{n-1}\right]italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ overroman_ˇ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] ≤ italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] (137)

Thus, we should bound 𝔼[𝒔nB2|n1]𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑛𝐵2subscript𝑛1\mathds{E}\left[\|\boldsymbol{s}_{n}^{B}\|^{2}|\mathcal{F}_{n-1}\right]blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ]. Fortunately, we have

𝔼[𝒔nB2|n1]𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑛𝐵2subscript𝑛1\displaystyle\mathds{E}\left[\|\boldsymbol{s}_{n}^{B}\|^{2}|\mathcal{F}_{n-1}\right]blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] =k𝔼[𝒔k,nB(𝒘k,n1)2|n1]absentsubscript𝑘𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵subscript𝒘𝑘𝑛12subscript𝑛1\displaystyle=\sum\limits_{k}\mathds{E}\left[\|\boldsymbol{s}_{k,n}^{B}(% \boldsymbol{w}_{k,n-1})\|^{2}|\mathcal{F}_{n-1}\right]= ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ]
(a)O(1B)k𝔼𝒘~k,n12+O(1B)𝑎𝑂1𝐵subscript𝑘𝔼superscriptnormsubscript~𝒘𝑘𝑛12𝑂1𝐵\displaystyle\overset{(a)}{\leq}O(\frac{1}{B})\sum\limits_{k}\mathds{E}\|% \tilde{\boldsymbol{w}}_{k,n-1}\|^{2}+O(\frac{1}{B})start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ) ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E ∥ over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG )
O(1B)𝔼𝓦~n2+O(1B)absent𝑂1𝐵𝔼superscriptnormsubscript~𝓦𝑛2𝑂1𝐵\displaystyle\leq O(\frac{1}{B})\mathds{E}\|{\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}}_{n}\|^{2}+O(\frac{1}{B})≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ) blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG )
=O(1B)(𝔼𝒱𝒱𝖳𝓦~n2)+O(1B)absent𝑂1𝐵𝔼superscriptnorm𝒱superscript𝒱𝖳subscript~𝓦𝑛2𝑂1𝐵\displaystyle=O(\frac{1}{B})(\mathds{E}\|\mathcal{V}\mathcal{V}^{\sf T}{% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^{2})+O(\frac{1}{B})= italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ) ( blackboard_E ∥ caligraphic_V caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG )
O(1B)(𝔼𝒘ˇn12+𝔼𝒘¯n12)+O(1B)absent𝑂1𝐵𝔼superscriptnormsubscriptˇ𝒘𝑛12𝔼superscriptnormsubscript¯𝒘𝑛12𝑂1𝐵\displaystyle\leq O(\frac{1}{B})(\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{2% }+\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{2})+O(\frac{1}{B})≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ) ( blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ) (138)

where (a)𝑎(a)( italic_a ) follows from (89). Combining (133), (C), (C) and (137), we obtain

𝔼𝒘ˇn2(ρ(Pα)+O(μ2))𝔼𝒘ˇn12+O(μ2)𝔼𝒘¯n12+O(μ2)𝔼superscriptnormsubscriptˇ𝒘𝑛2𝜌subscript𝑃𝛼𝑂superscript𝜇2𝔼superscriptnormsubscriptˇ𝒘𝑛12𝑂superscript𝜇2𝔼superscriptnormsubscript¯𝒘𝑛12𝑂superscript𝜇2\displaystyle\mathds{E}\|\check{\boldsymbol{w}}_{n}\|^{2}\leq\left(\rho(P_{% \alpha})+O(\mu^{2})\right)\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{2}+O(\mu% ^{2})\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{2}+O(\mu^{2})blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (139)

We now analyse the size of 𝔼𝒘¯n2𝔼superscriptnormsubscript¯𝒘𝑛2\mathds{E}\|\bar{\boldsymbol{w}}_{n}\|^{2}blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Consider

𝒔¯nB=Δ1K𝟙𝖳𝒔nB=1Kk𝒔k,nBsubscriptsuperscript¯𝒔𝐵𝑛Δ1𝐾superscript1𝖳superscriptsubscript𝒔𝑛𝐵1𝐾subscript𝑘superscriptsubscript𝒔𝑘𝑛𝐵\displaystyle\bar{\boldsymbol{s}}^{B}_{n}\overset{\Delta}{=}\frac{1}{\sqrt{K}}% \mathds{1}^{\sf T}\boldsymbol{s}_{n}^{B}=\frac{1}{\sqrt{K}}\sum\limits_{k}% \boldsymbol{s}_{k,n}^{B}over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT overroman_Δ start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (140)

for which we have

𝔼[1Kk𝒔k,nB2|n1](a)1Kk𝔼[𝒔k,nB2|n1]=1K𝔼[𝒔nB2|n1]𝔼delimited-[]conditionalsuperscriptnorm1𝐾subscript𝑘superscriptsubscript𝒔𝑘𝑛𝐵2subscript𝑛1𝑎1𝐾subscript𝑘𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵2subscript𝑛11𝐾𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑛𝐵2subscript𝑛1\displaystyle\mathds{E}\left[\|\frac{1}{\sqrt{K}}\sum\limits_{k}\boldsymbol{s}% _{k,n}^{B}\|^{2}|\mathcal{F}_{n-1}\right]\overset{(a)}{\leq}\frac{1}{K}\sum% \limits_{k}\mathds{E}[\|\boldsymbol{s}_{k,n}^{B}\|^{2}|\mathcal{F}_{n-1}]=% \frac{1}{K}\mathds{E}[\|\boldsymbol{s}_{n}^{B}\|^{2}|\mathcal{F}_{n-1}]blackboard_E [ ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] (141)

where (a)𝑎(a)( italic_a ) follows from the sampling independence among agents.

Also, consider 0<t=1O(μ)<10𝑡1𝑂𝜇10<t=1-O(\mu)<10 < italic_t = 1 - italic_O ( italic_μ ) < 1, with (129) we have

𝔼[𝒘¯n2|n1]=(a)𝔼delimited-[]conditionalsuperscriptnormsubscript¯𝒘𝑛2subscript𝑛1𝑎\displaystyle\mathds{E}\left[\|\bar{\boldsymbol{w}}_{n}\|^{2}|\mathcal{F}_{n-1% }\right]\overset{(a)}{=}blackboard_E [ ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG 𝔼(IμH¯n1)𝒘¯n1μK𝟙𝖳n1𝒱α𝒘ˇn12+μ2K𝔼[𝟙𝖳𝒔nB2|n1]𝔼superscriptnorm𝐼𝜇subscript¯𝐻𝑛1subscript¯𝒘𝑛1𝜇𝐾superscript1𝖳subscript𝑛1subscript𝒱𝛼subscriptˇ𝒘𝑛12superscript𝜇2𝐾𝔼delimited-[]conditionalsuperscriptnormsuperscript1𝖳superscriptsubscript𝒔𝑛𝐵2subscript𝑛1\displaystyle\mathds{E}\|(I-\mu\bar{H}_{n-1})\bar{\boldsymbol{w}}_{n-1}-\frac{% \mu}{\sqrt{K}}\mathds{1}^{\sf T}\mathcal{H}_{n-1}\mathcal{V}_{\alpha}\check{% \boldsymbol{w}}_{n-1}\|^{2}+\frac{\mu^{2}}{K}\mathds{E}[\|\mathds{1}^{\sf T}% \boldsymbol{s}_{n}^{B}\|^{2}|\mathcal{F}_{n-1}]blackboard_E ∥ ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG blackboard_E [ ∥ blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ]
(b)𝑏\displaystyle\overset{(b)}{\leq}start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG ≤ end_ARG 1t𝔼(IμH¯n1)𝒘¯n12+μ2K(1t)𝔼𝟙𝖳n1𝒱α𝒘ˇn12+μ2K𝔼[𝒔nB2|n1]1𝑡𝔼superscriptnorm𝐼𝜇subscript¯𝐻𝑛1subscript¯𝒘𝑛12superscript𝜇2𝐾1𝑡𝔼superscriptnormsuperscript1𝖳subscript𝑛1subscript𝒱𝛼subscriptˇ𝒘𝑛12superscript𝜇2𝐾𝔼delimited-[]conditionalsuperscriptnormsuperscriptsubscript𝒔𝑛𝐵2subscript𝑛1\displaystyle\frac{1}{t}\mathds{E}\|(I-\mu\bar{H}_{n-1})\bar{\boldsymbol{w}}_{% n-1}\|^{2}+\frac{\mu^{2}}{K(1-t)}\mathds{E}\|\mathds{1}^{\sf T}\mathcal{H}_{n-% 1}\mathcal{V}_{\alpha}\check{\boldsymbol{w}}_{n-1}\|^{2}+\frac{\mu^{2}}{K}% \mathds{E}[\|\boldsymbol{s}_{n}^{B}\|^{2}|\mathcal{F}_{n-1}]divide start_ARG 1 end_ARG start_ARG italic_t end_ARG blackboard_E ∥ ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K ( 1 - italic_t ) end_ARG blackboard_E ∥ blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG blackboard_E [ ∥ bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ]
(c)𝑐\displaystyle\overset{(c)}{\leq}start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG ≤ end_ARG ((1+μL)21O(μ)+O(μ2B))𝔼𝒘¯n12+O(μ)𝔼𝒘ˇn12+O(μ2B)superscript1𝜇𝐿21𝑂𝜇𝑂superscript𝜇2𝐵𝔼superscriptnormsubscript¯𝒘𝑛12𝑂𝜇𝔼superscriptnormsubscriptˇ𝒘𝑛12𝑂superscript𝜇2𝐵\displaystyle\left(\frac{(1+\mu L)^{2}}{1-O(\mu)}+O(\frac{\mu^{2}}{B})\right)% \mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{2}+O(\mu)\mathds{E}\|\check{% \boldsymbol{w}}_{n-1}\|^{2}+O(\frac{\mu^{2}}{B})( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) (142)

where (a)𝑎(a)( italic_a ) follows from (79), (b)𝑏(b)( italic_b ) follows from the Jensen’s inequality and (141), and (c)𝑐(c)( italic_c ) follows from (C) and the Lipschitz condition in Assumption III.2.

Combining (139) and (C), we obtain

𝔼𝒱𝖳𝓦~n2=𝔼[𝒘¯n2𝒘ˇn2][(1+μL)21O(μ)O(μ)O(μ2)ρ(Pα)+O(μ2)][𝔼𝒘¯n12𝔼𝒘ˇn12]+[O(μ2B)O(μ2)]𝔼superscriptnormsuperscript𝒱𝖳subscript~𝓦𝑛2𝔼delimited-[]superscriptnormsubscript¯𝒘𝑛2superscriptnormsubscriptˇ𝒘𝑛2delimited-[]superscript1𝜇𝐿21𝑂𝜇𝑂𝜇𝑂superscript𝜇2𝜌subscript𝑃𝛼𝑂superscript𝜇2delimited-[]𝔼superscriptnormsubscript¯𝒘𝑛12𝔼superscriptnormsubscriptˇ𝒘𝑛12delimited-[]𝑂superscript𝜇2𝐵𝑂superscript𝜇2\displaystyle\mathds{E}\|\mathcal{V}^{\sf T}{\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}}_{n}\|^{2}=\mathds{E}\left[\begin{array}[]{c}\|\bar{% \boldsymbol{w}}_{n}\|^{2}\\ \|\check{\boldsymbol{w}}_{n}\|^{2}\end{array}\right]\leq\left[\begin{array}[]{% cc}\frac{(1+\mu L)^{2}}{1-O(\mu)}&O(\mu)\\ O(\mu^{2})&\rho(P_{\alpha})+O(\mu^{2})\end{array}\right]\left[\begin{array}[]{% c}\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{2}\\ \mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{2}\end{array}\right]+\left[\begin{% array}[]{c}O(\frac{\mu^{2}}{B})\\ O(\mu^{2})\end{array}\right]blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E [ start_ARRAY start_ROW start_CELL ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] ≤ [ start_ARRAY start_ROW start_CELL divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] + [ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] (151)

Let

Γ1=[(1+μL)21O(μ)O(μ)O(μ2)ρ(Pα)+O(μ2)]subscriptΓ1delimited-[]superscript1𝜇𝐿21𝑂𝜇𝑂𝜇𝑂superscript𝜇2𝜌subscript𝑃𝛼𝑂superscript𝜇2\Gamma_{1}=\left[\begin{array}[]{cc}\frac{(1+\mu L)^{2}}{1-O(\mu)}&O(\mu)\\ O(\mu^{2})&\rho(P_{\alpha})+O(\mu^{2})\end{array}\right]roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] (152)

and by iterating (151), we obtain

𝔼𝒱𝖳𝓦~n2Γ1n+1𝔼𝒱𝖳𝓦~12+i=0nΓ1i[O(μ2B)O(μ2)]=Γ1n+1𝔼𝒱𝖳𝓦~12+(IΓ1)1(IΓ1n+1)[O(μ2B)O(μ2)]𝔼superscriptnormsuperscript𝒱𝖳subscript~𝓦𝑛2superscriptsubscriptΓ1𝑛1𝔼superscriptnormsuperscript𝒱𝖳subscript~𝓦12superscriptsubscript𝑖0𝑛superscriptsubscriptΓ1𝑖delimited-[]𝑂superscript𝜇2𝐵𝑂superscript𝜇2superscriptsubscriptΓ1𝑛1𝔼superscriptnormsuperscript𝒱𝖳subscript~𝓦12superscript𝐼subscriptΓ11𝐼superscriptsubscriptΓ1𝑛1delimited-[]𝑂superscript𝜇2𝐵𝑂superscript𝜇2\displaystyle\mathds{E}\|\mathcal{V}^{\sf T}{\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}}_{n}\|^{2}\leq\Gamma_{1}^{n+1}\mathds{E}\|\mathcal{V% }^{\sf T}{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}\|^{2}+\sum% \limits_{i=0}^{n}\Gamma_{1}^{i}\left[\begin{array}[]{c}O(\frac{\mu^{2}}{B})\\ O(\mu^{2})\end{array}\right]=\Gamma_{1}^{n+1}\mathds{E}\|\mathcal{V}^{\sf T}{% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}\|^{2}+(I-\Gamma_{1})^{-% 1}(I-\Gamma_{1}^{n+1})\left[\begin{array}[]{c}O(\frac{\mu^{2}}{B})\\ O(\mu^{2})\end{array}\right]blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] = roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_I - roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) [ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] (157)

To proceed, it is necessary to compute (IΓ1)1(IΓ1n+1)superscript𝐼subscriptΓ11𝐼superscriptsubscriptΓ1𝑛1(I-\Gamma_{1})^{-1}(I-\Gamma_{1}^{n+1})( italic_I - roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ). Basically, since

1(1+μL)21O(μ)=(1O(μ))(1+μL)21μL=O(μ)1superscript1𝜇𝐿21𝑂𝜇1𝑂𝜇superscript1𝜇𝐿21𝜇𝐿𝑂𝜇\displaystyle 1-\frac{(1+\mu L)^{2}}{1-O(\mu)}=\frac{(1-O(\mu))-(1+\mu L)^{2}}% {1-\mu L}=-O(\mu)1 - divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG = divide start_ARG ( 1 - italic_O ( italic_μ ) ) - ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_μ italic_L end_ARG = - italic_O ( italic_μ ) (158)

we have

(IΓ1)1=superscript𝐼subscriptΓ11absent\displaystyle(I-\Gamma_{1})^{-1}=( italic_I - roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = [1(1+μL)21O(μ)O(μ)O(μ2)1ρ(Pα)+O(μ2)+O(μ2)]1=[O(μ)O(μ)O(μ2)O(1)]1superscriptdelimited-[]1superscript1𝜇𝐿21𝑂𝜇𝑂𝜇𝑂superscript𝜇21𝜌subscript𝑃𝛼𝑂superscript𝜇2𝑂superscript𝜇21superscriptdelimited-[]𝑂𝜇𝑂𝜇𝑂superscript𝜇2𝑂11\displaystyle\left[\begin{array}[]{cc}1-\frac{(1+\mu L)^{2}}{1-O(\mu)}&-O(\mu)% \\ -O(\mu^{2})&1-\rho(P_{\alpha})+O(\mu^{2})+O(\mu^{2})\end{array}\right]^{-1}=% \left[\begin{array}[]{cc}-O(\mu)&-O(\mu)\\ -O(\mu^{2})&O(1)\end{array}\right]^{-1}[ start_ARRAY start_ROW start_CELL 1 - divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL 1 - italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL - italic_O ( italic_μ ) end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (163)
=\displaystyle== 1O(μ)[O(1)O(μ)O(μ2)O(u)]=[O(1μ)O(1)O(μ)O(1)]1𝑂𝜇delimited-[]𝑂1𝑂𝜇𝑂superscript𝜇2𝑂𝑢delimited-[]𝑂1𝜇𝑂1𝑂𝜇𝑂1\displaystyle-\frac{1}{O(\mu)}\left[\begin{array}[]{cc}O(1)&O(\mu)\\ O(\mu^{2})&-O(u)\end{array}\right]=\left[\begin{array}[]{cc}-O(\frac{1}{\mu})&% -O(1)\\ -O(\mu)&O(1)\end{array}\right]- divide start_ARG 1 end_ARG start_ARG italic_O ( italic_μ ) end_ARG [ start_ARRAY start_ROW start_CELL italic_O ( 1 ) end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL - italic_O ( italic_u ) end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ) end_CELL start_CELL - italic_O ( 1 ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] (168)

As for the matrix power Γ1n+1superscriptsubscriptΓ1𝑛1\Gamma_{1}^{n+1}roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT,

Γ1n+1=[((1+μL)21O(μ))n+1O(μ)O(μ2)ρn+1(Pα)]superscriptsubscriptΓ1𝑛1delimited-[]superscriptsuperscript1𝜇𝐿21𝑂𝜇𝑛1𝑂𝜇𝑂superscript𝜇2superscript𝜌𝑛1subscript𝑃𝛼\displaystyle\Gamma_{1}^{n+1}=\left[\begin{array}[]{cc}(\frac{(1+\mu L)^{2}}{1% -O(\mu)})^{n+1}&O(\mu)\\ O(\mu^{2})&\rho^{n+1}(P_{\alpha})\end{array}\right]roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_ρ start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ] (171)

for which by resorting to Lemma 2 in [11], with nO(1μ)𝑛𝑂1𝜇n\leq O(\frac{1}{\mu})italic_n ≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ), we have

((1+μL)21O(μ))n+1=O(1),1((1+μL)21O(μ))n+1=O(1)formulae-sequencesuperscriptsuperscript1𝜇𝐿21𝑂𝜇𝑛1𝑂11superscriptsuperscript1𝜇𝐿21𝑂𝜇𝑛1𝑂1\displaystyle\left(\frac{(1+\mu L)^{2}}{1-O(\mu)}\right)^{n+1}=O(1),\quad\quad 1% -\left(\frac{(1+\mu L)^{2}}{1-O(\mu)}\right)^{n+1}=-O(1)( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = italic_O ( 1 ) , 1 - ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = - italic_O ( 1 ) (172)

then we have

IΓ1n+1=[1((1+μL)21O(μ))n+1O(μ)O(μ2)1ρn+1(Pα)]=[O(1)O(μ)O(μ2)O(1)]𝐼superscriptsubscriptΓ1𝑛1delimited-[]1superscriptsuperscript1𝜇𝐿21𝑂𝜇𝑛1𝑂𝜇𝑂superscript𝜇21superscript𝜌𝑛1subscript𝑃𝛼delimited-[]𝑂1𝑂𝜇𝑂superscript𝜇2𝑂1\displaystyle I-\Gamma_{1}^{n+1}=\left[\begin{array}[]{cc}1-\left(\frac{(1+\mu L% )^{2}}{1-O(\mu)}\right)^{n+1}&-O(\mu)\\ -O(\mu^{2})&1-\rho^{n+1}(P_{\alpha})\end{array}\right]=\left[\begin{array}[]{% cc}-O(1)&-O(\mu)\\ -O(\mu^{2})&O(1)\end{array}\right]italic_I - roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL 1 - ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL 1 - italic_ρ start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL - italic_O ( 1 ) end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] (177)

Also, substituting (171), (172) and assumption III.1 into the term Γ1n+1𝒱𝖳𝓦~12superscriptsubscriptΓ1𝑛1superscriptnormsuperscript𝒱𝖳subscript~𝓦12\Gamma_{1}^{n+1}\|\mathcal{V}^{\sf T}{\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}}_{-1}\|^{2}roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we obtain

Γ1n+1𝔼𝒱𝖳𝓦~12o(μB)superscriptsubscriptΓ1𝑛1𝔼superscriptnormsuperscript𝒱𝖳subscript~𝓦12𝑜𝜇𝐵\displaystyle\Gamma_{1}^{n+1}\mathds{E}\|\mathcal{V}^{\sf T}{\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}\|^{2}\leq o(\frac{\mu}{B})roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_o ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) (178)

Then, substituting (163), (177) and (178) into (157), we obtain

𝔼[𝒘¯n2𝒘ˇn2]=𝔼𝒱𝖳𝓦~n2o(μB)+[O(1μ)O(1)O(μ)O(1)][O(1)O(μ)O(μ2)O(1)][O(μ2B)O(μ2)][O(μB)+O(μ2)O(μ2)]𝔼delimited-[]superscriptnormsubscript¯𝒘𝑛2superscriptnormsubscriptˇ𝒘𝑛2𝔼superscriptnormsuperscript𝒱𝖳subscript~𝓦𝑛2𝑜𝜇𝐵delimited-[]𝑂1𝜇𝑂1𝑂𝜇𝑂1delimited-[]𝑂1𝑂𝜇𝑂superscript𝜇2𝑂1delimited-[]𝑂superscript𝜇2𝐵𝑂superscript𝜇2delimited-[]𝑂𝜇𝐵𝑂superscript𝜇2𝑂superscript𝜇2\displaystyle\mathds{E}\left[\begin{array}[]{c}\|\bar{\boldsymbol{w}}_{n}\|^{2% }\\ \|\check{\boldsymbol{w}}_{n}\|^{2}\end{array}\right]=\mathds{E}\|\mathcal{V}^{% \sf T}{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^{2}\leq o(\frac% {\mu}{B})+\left[\begin{array}[]{cc}-O(\frac{1}{\mu})&-O(1)\\ -O(\mu)&O(1)\end{array}\right]\left[\begin{array}[]{cc}-O(1)&-O(\mu)\\ -O(\mu^{2})&O(1)\end{array}\right]\left[\begin{array}[]{c}O(\frac{\mu^{2}}{B})% \\ O(\mu^{2})\end{array}\right]\leq\left[\begin{array}[]{c}O(\frac{\mu}{B})+O(\mu% ^{2})\\ O(\mu^{2})\end{array}\right]blackboard_E [ start_ARRAY start_ROW start_CELL ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] = blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_o ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) + [ start_ARRAY start_ROW start_CELL - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ) end_CELL start_CELL - italic_O ( 1 ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL - italic_O ( 1 ) end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] ≤ [ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] (189)

from which we have

𝔼𝓦~n2𝒱2𝔼𝒱𝖳𝓦~n2O(μB)+O(μ2)𝔼superscriptnormsubscript~𝓦𝑛2superscriptnorm𝒱2𝔼superscriptnormsuperscript𝒱𝖳subscript~𝓦𝑛2𝑂𝜇𝐵𝑂superscript𝜇2\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{2}\leq\|\mathcal{V}\|^{2}\mathds{E}\|{\mathcal{V}^{\sf T}\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^{2}\leq O(\frac{\mu}{B})+O(\mu^{2})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ caligraphic_V ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_O ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (190)

According to (49), we obtain

𝔼𝓦~n2O(μ1+η)+O(μ2)=O(μγ)𝔼superscriptnormsubscript~𝓦𝑛2𝑂superscript𝜇1𝜂𝑂superscript𝜇2𝑂superscript𝜇𝛾\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{2}\leq O(\mu^{1+\eta})+O(\mu^{2})=O(\mu^{\gamma})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1 + italic_η end_POSTSUPERSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = italic_O ( italic_μ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ) (191)

where γ=min{1+η,2}𝛾1𝜂2\gamma=\min\{1+\eta,2\}italic_γ = roman_min { 1 + italic_η , 2 }.

We finally examine the bounds of the centralized method. As mentioned via (131), in the centralized method, 𝒘ˇnsubscriptˇ𝒘𝑛\check{\boldsymbol{w}}_{n}overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is always 00. Thus, we only need to analyze 𝔼𝒘¯n2𝔼superscriptnormsubscript¯𝒘𝑛2\mathds{E}\|\bar{\boldsymbol{w}}_{n}\|^{2}blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, for which according to (C), we have:

𝔼𝒘¯n2((1+μL)21O(μ)+O(μ2B))n+1𝔼𝒘¯12+1((1+μL)21O(μ)+O(μ2B))n+11((1+μL)21O(μ)+O(μ2B))×O(μ2B)𝔼superscriptnormsubscript¯𝒘𝑛2superscriptsuperscript1𝜇𝐿21𝑂𝜇𝑂superscript𝜇2𝐵𝑛1𝔼superscriptnormsubscript¯𝒘121superscriptsuperscript1𝜇𝐿21𝑂𝜇𝑂superscript𝜇2𝐵𝑛11superscript1𝜇𝐿21𝑂𝜇𝑂superscript𝜇2𝐵𝑂superscript𝜇2𝐵\displaystyle\mathds{E}\|\bar{\boldsymbol{w}}_{n}\|^{2}\leq\left(\frac{(1+\mu L% )^{2}}{1-O(\mu)}+O(\frac{\mu^{2}}{B})\right)^{n+1}\mathds{E}\|\bar{\boldsymbol% {w}}_{-1}\|^{2}+\frac{1-\left(\frac{(1+\mu L)^{2}}{1-O(\mu)}+O(\frac{\mu^{2}}{% B})\right)^{n+1}}{1-\left(\frac{(1+\mu L)^{2}}{1-O(\mu)}+O(\frac{\mu^{2}}{B})% \right)}\times O(\frac{\mu^{2}}{B})blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 - ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_O ( italic_μ ) end_ARG + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) ) end_ARG × italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) (192)

By following the same logic with the derivation process from (171)–(191), for the centralized method, we have

𝔼𝓦~n2O(μB)=O(μ1+η)𝔼superscriptnormsubscript~𝓦𝑛2𝑂𝜇𝐵𝑂superscript𝜇1𝜂\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{2}\leq O(\frac{\mu}{B})=O(\mu^{1+\eta})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_O ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) = italic_O ( italic_μ start_POSTSUPERSCRIPT 1 + italic_η end_POSTSUPERSCRIPT ) (193)

By comparing (191) and (193), we see that the decentralized methods have extra O(μ2)𝑂superscript𝜇2O(\mu^{2})italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) noisy terms related to the network heterogeneity and graph structure.

Appendix D Proof for Lemmas III.6 and III.7: Upper bound for the fourth-order moment

In this section, we analyze the size of 𝔼𝓦~n4𝔼superscriptnormsubscript~𝓦𝑛4\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^{4}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT for later use. We recall the relation of 𝒘¯nsubscript¯𝒘𝑛\bar{\boldsymbol{w}}_{n}over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to 𝒘ˇnsubscriptˇ𝒘𝑛\check{\boldsymbol{w}}_{n}overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in (129), and use the following equality:

𝒂+𝒃4=𝒂4+𝒃4+2𝒂2𝒃2+4𝒂𝖳𝒃𝒂2+4𝒂𝖳𝒃𝒃2+4(𝒂𝖳𝒃)2superscriptnorm𝒂𝒃4superscriptnorm𝒂4superscriptnorm𝒃42superscriptnorm𝒂2superscriptnorm𝒃24superscript𝒂𝖳𝒃superscriptnorm𝒂24superscript𝒂𝖳𝒃superscriptnorm𝒃24superscriptsuperscript𝒂𝖳𝒃2\displaystyle\|\boldsymbol{a}+\boldsymbol{b}\|^{4}=\|\boldsymbol{a}\|^{4}+\|% \boldsymbol{b}\|^{4}+2\|\boldsymbol{a}\|^{2}\|\boldsymbol{b}\|^{2}+4% \boldsymbol{a}^{\sf T}\boldsymbol{b}\|\boldsymbol{a}\|^{2}+4\boldsymbol{a}^{% \sf T}\boldsymbol{b}\|\boldsymbol{b}\|^{2}+4(\boldsymbol{a}^{\sf T}\boldsymbol% {b})^{2}∥ bold_italic_a + bold_italic_b ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = ∥ bold_italic_a ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + ∥ bold_italic_b ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 2 ∥ bold_italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 bold_italic_a start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_b ∥ bold_italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 bold_italic_a start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_b ∥ bold_italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 ( bold_italic_a start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (194)

where 𝒂𝒂\boldsymbol{a}bold_italic_a and 𝒃𝒃\boldsymbol{b}bold_italic_b are two column vectors. When 𝔼𝒃=0𝔼𝒃0\mathds{E}\boldsymbol{b}=0blackboard_E bold_italic_b = 0, with the Cauchy–Schwarz inequality, we have:

𝔼𝒂+𝒃4𝔼𝒂4+3𝔼𝒃4+8𝔼𝒂2𝒃2𝔼superscriptnorm𝒂𝒃4𝔼superscriptnorm𝒂43𝔼superscriptnorm𝒃48𝔼superscriptnorm𝒂2superscriptnorm𝒃2\displaystyle\mathds{E}\|\boldsymbol{a}+\boldsymbol{b}\|^{4}\leq\mathds{E}\|% \boldsymbol{a}\|^{4}+3\mathds{E}\|\boldsymbol{b}\|^{4}+8\mathds{E}\|% \boldsymbol{a}\|^{2}\|\boldsymbol{b}\|^{2}blackboard_E ∥ bold_italic_a + bold_italic_b ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ blackboard_E ∥ bold_italic_a ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 3 blackboard_E ∥ bold_italic_b ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 8 blackboard_E ∥ bold_italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (195)

We first analyze the size of 𝔼𝒘¯n4𝔼superscriptnormsubscript¯𝒘𝑛4\mathds{E}\|\bar{\boldsymbol{w}}_{n}\|^{4}blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. Let t=1O(μ)𝑡1𝑂𝜇t=1-O(\mu)italic_t = 1 - italic_O ( italic_μ ), we have

𝔼𝒘¯n4=(a)𝔼superscriptnormsubscript¯𝒘𝑛4𝑎\displaystyle\mathds{E}\|\bar{\boldsymbol{w}}_{n}\|^{4}\overset{(a)}{=}blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG 𝔼(IμH¯n1)𝒘¯n1μK𝟙𝖳n1𝒱α𝒘ˇn14+3μ4𝔼1K𝟙𝖳𝒔nB4𝔼superscriptnorm𝐼𝜇subscript¯𝐻𝑛1subscript¯𝒘𝑛1𝜇𝐾superscript1𝖳subscript𝑛1subscript𝒱𝛼subscriptˇ𝒘𝑛143superscript𝜇4𝔼superscriptnorm1𝐾superscript1𝖳superscriptsubscript𝒔𝑛𝐵4\displaystyle\mathds{E}\|(I-\mu\bar{H}_{n-1})\bar{\boldsymbol{w}}_{n-1}-\frac{% \mu}{\sqrt{K}}\mathds{1}^{\sf T}\mathcal{H}_{n-1}\mathcal{V}_{\alpha}\check{% \boldsymbol{w}}_{n-1}\|^{4}+3\mu^{4}\mathds{E}\|\frac{1}{\sqrt{K}}\mathds{1}^{% \sf T}\boldsymbol{s}_{n}^{B}\|^{4}blackboard_E ∥ ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 3 italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT blackboard_E ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
+8μ2𝔼(IμH¯n1)𝒘¯n1μ1K𝟙𝖳n1𝒱α𝒘ˇn121K𝟙𝖳𝒔nB28superscript𝜇2𝔼superscriptnorm𝐼𝜇subscript¯𝐻𝑛1subscript¯𝒘𝑛1𝜇1𝐾superscript1𝖳subscript𝑛1subscript𝒱𝛼subscriptˇ𝒘𝑛12superscriptnorm1𝐾superscript1𝖳superscriptsubscript𝒔𝑛𝐵2\displaystyle+8\mu^{2}\mathds{E}\|(I-\mu\bar{H}_{n-1})\bar{\boldsymbol{w}}_{n-% 1}-\mu\frac{1}{\sqrt{K}}\mathds{1}^{\sf T}\mathcal{H}_{n-1}\mathcal{V}_{\alpha% }\check{\boldsymbol{w}}_{n-1}\|^{2}\|\frac{1}{\sqrt{K}}\mathds{1}^{\sf T}% \boldsymbol{s}_{n}^{B}\|^{2}+ 8 italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - italic_μ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(b)𝑏\displaystyle\overset{(b)}{\leq}start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG ≤ end_ARG 1t3𝔼(IμH¯n1)𝒘¯n14+O(μ4)(1t)3𝔼𝒘ˇn14+3μ4𝔼1K𝟙𝖳𝒔nB41superscript𝑡3𝔼superscriptnorm𝐼𝜇subscript¯𝐻𝑛1subscript¯𝒘𝑛14𝑂superscript𝜇4superscript1𝑡3𝔼superscriptnormsubscriptˇ𝒘𝑛143superscript𝜇4𝔼superscriptnorm1𝐾superscript1𝖳superscriptsubscript𝒔𝑛𝐵4\displaystyle\frac{1}{t^{3}}\mathds{E}\|(I-\mu\bar{H}_{n-1})\bar{\boldsymbol{w% }}_{n-1}\|^{4}+\frac{O(\mu^{4})}{(1-t)^{3}}\mathds{E}\|\check{\boldsymbol{w}}_% {n-1}\|^{4}+3\mu^{4}\mathds{E}\|\frac{1}{\sqrt{K}}\mathds{1}^{\sf T}% \boldsymbol{s}_{n}^{B}\|^{4}divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG blackboard_E ∥ ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + divide start_ARG italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_ARG start_ARG ( 1 - italic_t ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 3 italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT blackboard_E ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
+8μ2𝔼[(1t(IμH¯n1)𝒘¯n12+O(μ2)1t𝒘ˇn12)1K𝟙𝖳𝒔nB2]8superscript𝜇2𝔼delimited-[]1𝑡superscriptnorm𝐼𝜇subscript¯𝐻𝑛1subscript¯𝒘𝑛12𝑂superscript𝜇21𝑡superscriptnormsubscriptˇ𝒘𝑛12superscriptnorm1𝐾superscript1𝖳superscriptsubscript𝒔𝑛𝐵2\displaystyle+8\mu^{2}\mathds{E}\left[\left(\frac{1}{t}\|(I-\mu\bar{H}_{n-1})% \bar{\boldsymbol{w}}_{n-1}\|^{2}+\frac{O(\mu^{2})}{1-t}\|\check{\boldsymbol{w}% }_{n-1}\|^{2}\right)\cdot\|\frac{1}{\sqrt{K}}\mathds{1}^{\sf T}\boldsymbol{s}_% {n}^{B}\|^{2}\right]+ 8 italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ( divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∥ ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - italic_t end_ARG ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
\displaystyle\leq (1+μL)4(1O(μ))3𝔼𝒘¯n14+O(μ)𝔼𝒘ˇn14+O(μ4)𝔼1K𝟙𝖳𝒔nB4+O(μ2)𝔼𝒘¯n121K𝟙𝖳𝒔nB2superscript1𝜇𝐿4superscript1𝑂𝜇3𝔼superscriptnormsubscript¯𝒘𝑛14𝑂𝜇𝔼superscriptnormsubscriptˇ𝒘𝑛14𝑂superscript𝜇4𝔼superscriptnorm1𝐾superscript1𝖳superscriptsubscript𝒔𝑛𝐵4𝑂superscript𝜇2𝔼superscriptnormsubscript¯𝒘𝑛12superscriptnorm1𝐾superscript1𝖳superscriptsubscript𝒔𝑛𝐵2\displaystyle\frac{(1+\mu L)^{4}}{(1-O(\mu))^{3}}\mathds{E}\|\bar{\boldsymbol{% w}}_{n-1}\|^{4}+O(\mu)\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{4}+O(\mu^{4}% )\mathds{E}\|\frac{1}{\sqrt{K}}\mathds{1}^{\sf T}\boldsymbol{s}_{n}^{B}\|^{4}+% O(\mu^{2})\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{2}\|\frac{1}{\sqrt{K}}% \mathds{1}^{\sf T}\boldsymbol{s}_{n}^{B}\|^{2}divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( italic_μ ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) blackboard_E ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+O(μ3)𝔼𝒘ˇn121K𝟙𝖳𝒔nB2𝑂superscript𝜇3𝔼superscriptnormsubscriptˇ𝒘𝑛12superscriptnorm1𝐾superscript1𝖳superscriptsubscript𝒔𝑛𝐵2\displaystyle+O(\mu^{3})\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{2}\|\frac{% 1}{\sqrt{K}}\mathds{1}^{\sf T}\boldsymbol{s}_{n}^{B}\|^{2}+ italic_O ( italic_μ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (196)

where (a)𝑎(a)( italic_a ) follows from (79)79(\ref{s0})( ) and (195), and (b)𝑏(b)( italic_b ) follows from Jensen’s inequality.

Then for 𝒘ˇn1subscriptˇ𝒘𝑛1\check{\boldsymbol{w}}_{n-1}overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT, similar to the proof when analyzing the second-order error in Appendix C, we consider t=ρ(Pα)𝑡𝜌subscript𝑃𝛼t=\rho(P_{\alpha})italic_t = italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) which is the spectral radius of Pαsubscript𝑃𝛼P_{\alpha}italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPTand obtain

𝔼𝒘ˇn4(a)𝔼superscriptnormsubscriptˇ𝒘𝑛4𝑎\displaystyle\mathds{E}\|\check{\boldsymbol{w}}_{n}\|^{4}\overset{(a)}{\leq}blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG 𝔼(𝒫αμ𝒱α𝖳𝒜2n1𝒱α)𝒘ˇn1μK𝒱α𝖳𝒜2n1𝟙𝒘¯n1+μ𝒱α𝖳𝒜2d4+3μ4𝔼𝒱α𝖳𝒜2𝒔nB4𝔼superscriptnormsubscript𝒫𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛1subscript𝒱𝛼subscriptˇ𝒘𝑛1𝜇𝐾superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛11subscript¯𝒘𝑛1𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2𝑑43superscript𝜇4𝔼superscriptnormsubscriptsuperscript𝒱𝖳𝛼subscript𝒜2superscriptsubscript𝒔𝑛𝐵4\displaystyle\mathds{E}\|\left(\mathcal{P}_{\alpha}-\mu\mathcal{V}_{\alpha}^{% \sf T}\mathcal{A}_{2}\mathcal{H}_{n-1}\mathcal{V}_{\alpha}\right)\check{% \boldsymbol{w}}_{n-1}-\frac{\mu}{\sqrt{K}}\mathcal{V}_{\alpha}^{\sf T}\mathcal% {A}_{2}\mathcal{H}_{n-1}\mathds{1}\bar{\boldsymbol{w}}_{n-1}+\mu\mathcal{V}_{% \alpha}^{\sf T}\mathcal{A}_{2}d\|^{4}+3\mu^{4}\mathds{E}\|\mathcal{V}^{\sf T}_% {\alpha}\mathcal{A}_{2}\boldsymbol{s}_{n}^{B}\|^{4}blackboard_E ∥ ( caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT blackboard_1 over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 3 italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
+8μ2𝔼[𝒱α𝖳𝒜2𝒔nB2(𝒫αμ𝒱α𝖳𝒜2n1𝒱α)𝒘ˇn1μK𝒱α𝖳𝒜2n1𝟙𝒘¯n1+μ𝒱α𝖳𝒜2d2]8superscript𝜇2𝔼delimited-[]superscriptnormsubscriptsuperscript𝒱𝖳𝛼subscript𝒜2superscriptsubscript𝒔𝑛𝐵2superscriptnormsubscript𝒫𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛1subscript𝒱𝛼subscriptˇ𝒘𝑛1𝜇𝐾superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛11subscript¯𝒘𝑛1𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2𝑑2\displaystyle+8\mu^{2}\mathds{E}\left[\|\mathcal{V}^{\sf T}_{\alpha}\mathcal{A% }_{2}\boldsymbol{s}_{n}^{B}\|^{2}\|\left(\mathcal{P}_{\alpha}-\mu\mathcal{V}_{% \alpha}^{\sf T}\mathcal{A}_{2}\mathcal{H}_{n-1}\mathcal{V}_{\alpha}\right)% \check{\boldsymbol{w}}_{n-1}-\frac{\mu}{\sqrt{K}}\mathcal{V}_{\alpha}^{\sf T}% \mathcal{A}_{2}\mathcal{H}_{n-1}\mathds{1}\bar{\boldsymbol{w}}_{n-1}+\mu% \mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}d\|^{2}\right]+ 8 italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ( caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT blackboard_1 over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(b)𝑏\displaystyle\overset{(b)}{\leq}start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG ≤ end_ARG (ρ(Pα)+O(μ4))𝔼𝒘ˇn14+O(μ4)𝔼𝒘¯n14+O(μ4)𝔼d4+3μ4𝔼𝒱α𝖳𝒜2𝒔nB4𝜌subscript𝑃𝛼𝑂superscript𝜇4𝔼superscriptnormsubscriptˇ𝒘𝑛14𝑂superscript𝜇4𝔼superscriptnormsubscript¯𝒘𝑛14𝑂superscript𝜇4𝔼superscriptnorm𝑑43superscript𝜇4𝔼superscriptnormsubscriptsuperscript𝒱𝖳𝛼subscript𝒜2superscriptsubscript𝒔𝑛𝐵4\displaystyle(\rho(P_{\alpha})+O(\mu^{4}))\mathds{E}\|\check{\boldsymbol{w}}_{% n-1}\|^{4}+O(\mu^{4})\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{4}+O(\mu^{4})% \mathds{E}\|d\|^{4}+3\mu^{4}\mathds{E}\|\mathcal{V}^{\sf T}_{\alpha}\mathcal{A% }_{2}\boldsymbol{s}_{n}^{B}\|^{4}( italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) blackboard_E ∥ italic_d ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 3 italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
+8μ2𝔼[𝒱α𝖳𝒜2𝒔nB2((ρ(Pα)+O(μ2))𝒘ˇn12+O(μ2)𝒘¯n12+O(μ2)d2)]8superscript𝜇2𝔼delimited-[]superscriptnormsubscriptsuperscript𝒱𝖳𝛼subscript𝒜2superscriptsubscript𝒔𝑛𝐵2𝜌subscript𝑃𝛼𝑂superscript𝜇2superscriptnormsubscriptˇ𝒘𝑛12𝑂superscript𝜇2superscriptnormsubscript¯𝒘𝑛12𝑂superscript𝜇2superscriptnorm𝑑2\displaystyle+8\mu^{2}\mathds{E}\left[\|\mathcal{V}^{\sf T}_{\alpha}\mathcal{A% }_{2}\boldsymbol{s}_{n}^{B}\|^{2}\left((\rho(P_{\alpha})+O(\mu^{2}))\|\check{% \boldsymbol{w}}_{n-1}\|^{2}+O(\mu^{2})\|\bar{\boldsymbol{w}}_{n-1}\|^{2}+O(\mu% ^{2})\|d\|^{2}\right)\right]+ 8 italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ( italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] (197)

where (a)𝑎(a)( italic_a ) follows from (79)79(\ref{s0})( ) and (195), and (b)𝑏(b)( italic_b ) follows from Jensen’s inequality.

To proceed with the analysis for (D)D(\ref{barw4})( ) and (D)D(\ref{checkw4})( ), we should analyze the fourth-order moment associated with gradient noise. Recall (136) and (140), we have

𝔼𝒔¯nB4+𝔼𝒔ˇnB4𝔼[𝒔¯nB2+𝒔ˇnB2]2=𝔼𝒱𝖳𝒜2𝒔nB4O(1)𝔼𝒔nB4𝔼superscriptnormsubscriptsuperscript¯𝒔𝐵𝑛4𝔼superscriptnormsuperscriptsubscriptˇ𝒔𝑛𝐵4𝔼superscriptdelimited-[]superscriptnormsubscriptsuperscript¯𝒔𝐵𝑛2superscriptnormsubscriptsuperscriptˇ𝒔𝐵𝑛22𝔼superscriptnormsuperscript𝒱𝖳subscript𝒜2superscriptsubscript𝒔𝑛𝐵4𝑂1𝔼superscriptnormsuperscriptsubscript𝒔𝑛𝐵4\displaystyle\mathds{E}\|\bar{\boldsymbol{s}}^{B}_{n}\|^{4}+\mathds{E}\|\check% {\boldsymbol{s}}_{n}^{B}\|^{4}\leq\mathds{E}[\|\bar{\boldsymbol{s}}^{B}_{n}\|^% {2}+\|\check{\boldsymbol{s}}^{B}_{n}\|^{2}]^{2}=\mathds{E}\|\mathcal{V}^{\sf T% }\mathcal{A}_{2}\boldsymbol{s}_{n}^{B}\|^{4}\leq O(1)\mathds{E}\|\boldsymbol{s% }_{n}^{B}\|^{4}blackboard_E ∥ over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + blackboard_E ∥ overroman_ˇ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ blackboard_E [ ∥ over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ overroman_ˇ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ italic_O ( 1 ) blackboard_E ∥ bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (198)

Also,

𝔼𝒔nB4=𝔼superscriptnormsubscriptsuperscript𝒔𝐵𝑛4absent\displaystyle\mathds{E}\|\boldsymbol{s}^{B}_{n}\|^{4}=blackboard_E ∥ bold_italic_s start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = 𝔼[𝒔nB2]2=𝔼[k𝒔k,nB2]2=𝔼[k1KK𝒔k,nB2]2(a)Kk𝔼𝒔k,nB4𝔼superscriptdelimited-[]superscriptnormsubscriptsuperscript𝒔𝐵𝑛22𝔼superscriptdelimited-[]subscript𝑘superscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵22𝔼superscriptdelimited-[]subscript𝑘1𝐾𝐾superscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵22𝑎𝐾subscript𝑘𝔼superscriptnormsuperscriptsubscript𝒔𝑘𝑛𝐵4\displaystyle\mathds{E}[\|\boldsymbol{s}^{B}_{n}\|^{2}]^{2}=\mathds{E}\left[% \sum\limits_{k}\|\boldsymbol{s}_{k,n}^{B}\|^{2}\right]^{2}=\mathds{E}\left[% \sum\limits_{k}\frac{1}{K}\cdot K\|\boldsymbol{s}_{k,n}^{B}\|^{2}\right]^{2}% \overset{(a)}{\leq}K\sum\limits_{k}\mathds{E}\|\boldsymbol{s}_{k,n}^{B}\|^{4}blackboard_E [ ∥ bold_italic_s start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ⋅ italic_K ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG italic_K ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_s start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
(b)𝑏\displaystyle\overset{(b)}{\leq}start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG ≤ end_ARG O(1B2)k𝕖𝒘~k,n4+O(1B2)𝑂1superscript𝐵2subscript𝑘𝕖superscriptnormsubscript~𝒘𝑘𝑛4𝑂1superscript𝐵2\displaystyle O(\frac{1}{B^{2}})\sum\limits_{k}\mathds{e}\|\tilde{\boldsymbol{% w}}_{k,n}\|^{4}+O(\frac{1}{B^{2}})italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_e ∥ over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (199)

where (a)𝑎(a)( italic_a ) follows from Jensen’s inequality, and (b)𝑏(b)( italic_b ) follows from (A)A(\ref{sb4l})( ). Moreover, we have

k𝒘~k,n4(k𝒘~k,n2)2=𝓦~n4subscript𝑘superscriptnormsubscript~𝒘𝑘𝑛4superscriptsubscript𝑘superscriptnormsubscript~𝒘𝑘𝑛22superscriptnormsubscript~𝓦𝑛4\displaystyle\sum\limits_{k}\|\tilde{\boldsymbol{w}}_{k,n}\|^{4}\leq(\sum% \limits_{k}\|\tilde{\boldsymbol{w}}_{k,n}\|^{2})^{2}=\|\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n}\|^{4}∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ ( ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (200)

Substituting (200)200(\ref{wn4})( ) into (D)D(\ref{snb4})( ), we have

𝔼𝒔nB4𝔼superscriptnormsubscriptsuperscript𝒔𝐵𝑛4\displaystyle\mathds{E}\|\boldsymbol{s}^{B}_{n}\|^{4}blackboard_E ∥ bold_italic_s start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT O(1B2)𝔼𝓦~n4+O(1B2)absent𝑂1superscript𝐵2𝔼superscriptnormsubscript~𝓦𝑛4𝑂1superscript𝐵2\displaystyle\leq O(\frac{1}{B^{2}})\mathds{E}\|\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n}\|^{4}+O(\frac{1}{B^{2}})≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )
O(1B2)𝒱4𝔼𝒱𝖳𝓦~n4+O(1B2)absent𝑂1superscript𝐵2superscriptnorm𝒱4𝔼superscriptnormsuperscript𝒱𝖳subscript~𝓦𝑛4𝑂1superscript𝐵2\displaystyle\leq O(\frac{1}{B^{2}})\|\mathcal{V}\|^{4}\mathds{E}\|\mathcal{V}% ^{\sf T}\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{4}+O(\frac{1}{% B^{2}})≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ∥ caligraphic_V ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )
(a)O(1B2)(𝔼𝒘¯n14+𝔼𝒘ˇn14)+O(1B2)𝑎𝑂1superscript𝐵2𝔼superscriptnormsubscript¯𝒘𝑛14𝔼superscriptnormsubscriptˇ𝒘𝑛14𝑂1superscript𝐵2\displaystyle\overset{(a)}{\leq}O(\frac{1}{B^{2}})(\mathds{E}\|\bar{% \boldsymbol{w}}_{n-1}\|^{4}+\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{4})+O(% \frac{1}{B^{2}})start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (201)

where (a)𝑎(a)( italic_a ) follows from the Jensen’s inequality such that

𝔼𝒱𝖳𝓦~n4=𝔼[𝒘¯n12+𝒘ˇn12]22𝔼𝒘¯n14+2𝔼𝒘ˇn14𝔼superscriptnormsuperscript𝒱𝖳subscript~𝓦𝑛4𝔼superscriptdelimited-[]superscriptnormsubscript¯𝒘𝑛12superscriptnormsubscriptˇ𝒘𝑛1222𝔼superscriptnormsubscript¯𝒘𝑛142𝔼superscriptnormsubscriptˇ𝒘𝑛14\displaystyle\mathds{E}\|\mathcal{V}^{\sf T}\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n}\|^{4}=\mathds{E}[\|\bar{\boldsymbol{w}}_{n-1}\|^% {2}+\|\check{\boldsymbol{w}}_{n-1}\|^{2}]^{2}\leq 2\mathds{E}\|\bar{% \boldsymbol{w}}_{n-1}\|^{4}+2\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{4}blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = blackboard_E [ ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 2 blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (202)

Substituting (198)198(\ref{sum_sn})( ), (D)D(\ref{snb4})( ), (D)D(\ref{f_snb4})( ) and (C)C(\ref{snb})( ) into (D), we have

𝔼𝒘¯n4𝔼superscriptnormsubscript¯𝒘𝑛4absent\displaystyle\mathds{E}\|\bar{\boldsymbol{w}}_{n}\|^{4}\leqblackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ (1+μL)4(1O(μ))3𝔼𝒘¯n14+O(μ)𝔼𝒘ˇn14+O(μ4B2)𝔼𝒘¯n14+O(μ4B2)𝔼𝒘ˇn14+O(μ4B2)superscript1𝜇𝐿4superscript1𝑂𝜇3𝔼superscriptnormsubscript¯𝒘𝑛14𝑂𝜇𝔼superscriptnormsubscriptˇ𝒘𝑛14𝑂superscript𝜇4superscript𝐵2𝔼superscriptnormsubscript¯𝒘𝑛14𝑂superscript𝜇4superscript𝐵2𝔼superscriptnormsubscriptˇ𝒘𝑛14𝑂superscript𝜇4superscript𝐵2\displaystyle\frac{(1+\mu L)^{4}}{(1-O(\mu))^{3}}\mathds{E}\|\bar{\boldsymbol{% w}}_{n-1}\|^{4}+O(\mu)\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{4}+O(\frac{% \mu^{4}}{B^{2}})\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{4}+O(\frac{\mu^{4}}{% B^{2}})\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{4}+O(\frac{\mu^{4}}{B^{2}})divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( italic_μ ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )
+O(μ2B)𝔼𝒘¯n12(𝒘ˇn12+𝒘¯n12+O(1))+O(μ3B)𝔼𝒘ˇn12(𝒘ˇn12+𝒘¯n12+O(1))𝑂superscript𝜇2𝐵𝔼superscriptnormsubscript¯𝒘𝑛12superscriptnormsubscriptˇ𝒘𝑛12superscriptnormsubscript¯𝒘𝑛12𝑂1𝑂superscript𝜇3𝐵𝔼superscriptnormsubscriptˇ𝒘𝑛12superscriptnormsubscriptˇ𝒘𝑛12superscriptnormsubscript¯𝒘𝑛12𝑂1\displaystyle+O(\frac{\mu^{2}}{B})\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{2}% (\|\check{\boldsymbol{w}}_{n-1}\|^{2}+\|\bar{\boldsymbol{w}}_{n-1}\|^{2}+O(1))% +O(\frac{\mu^{3}}{B})\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{2}(\|\check{% \boldsymbol{w}}_{n-1}\|^{2}+\|\bar{\boldsymbol{w}}_{n-1}\|^{2}+O(1))+ italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( 1 ) ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( 1 ) )
(a)𝑎\displaystyle\overset{(a)}{\leq}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG ((1+μL)4(1O(μ))3+O(μ2B))𝔼𝒘¯n14+O(μ)𝔼𝒘ˇn14+O(μ3B2)+O(μ5B)superscript1𝜇𝐿4superscript1𝑂𝜇3𝑂superscript𝜇2𝐵𝔼superscriptnormsubscript¯𝒘𝑛14𝑂𝜇𝔼superscriptnormsubscriptˇ𝒘𝑛14𝑂superscript𝜇3superscript𝐵2𝑂superscript𝜇5𝐵\displaystyle\left(\frac{(1+\mu L)^{4}}{(1-O(\mu))^{3}}+O(\frac{\mu^{2}}{B})% \right)\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{4}+O(\mu)\mathds{E}\|\check{% \boldsymbol{w}}_{n-1}\|^{4}+O(\frac{\mu^{3}}{B^{2}})+{O(\frac{\mu^{5}}{B})}( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( italic_μ ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) (203)

where in (a)𝑎(a)( italic_a ) we apply the result of (189) and the following inequality:

𝔼𝒘¯n12𝒘ˇn1212𝔼𝒘¯n14+12𝔼𝒘ˇn14𝔼superscriptnormsubscript¯𝒘𝑛12superscriptnormsubscriptˇ𝒘𝑛1212𝔼superscriptnormsubscript¯𝒘𝑛1412𝔼superscriptnormsubscriptˇ𝒘𝑛14\displaystyle\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{2}\|\check{\boldsymbol{% w}}_{n-1}\|^{2}\leq\frac{1}{2}\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{4}+% \frac{1}{2}\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{4}blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (204)

Similarly, for 𝔼𝒘ˇn4𝔼superscriptnormsubscriptˇ𝒘𝑛4\mathds{E}\|\check{\boldsymbol{w}}_{n}\|^{4}blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, we substitute (198)198(\ref{sum_sn})( ), (D)D(\ref{snb4})( ) and (C)C(\ref{snb})( ) into (D), and obtain

𝔼𝒘ˇn4𝔼superscriptnormsubscriptˇ𝒘𝑛4absent\displaystyle\mathds{E}\|\check{\boldsymbol{w}}_{n}\|^{4}\leqblackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ (ρ(Pα)+O(μ4))𝔼𝒘ˇn14+O(μ4)𝔼𝒘¯n14+O(μ4)+O(μ4B2)𝔼𝒘¯n14+O(μ4B2)𝔼𝒘ˇn14+O(μ4B2)𝜌subscript𝑃𝛼𝑂superscript𝜇4𝔼superscriptnormsubscriptˇ𝒘𝑛14𝑂superscript𝜇4𝔼superscriptnormsubscript¯𝒘𝑛14𝑂superscript𝜇4𝑂superscript𝜇4superscript𝐵2𝔼superscriptnormsubscript¯𝒘𝑛14𝑂superscript𝜇4superscript𝐵2𝔼superscriptnormsubscriptˇ𝒘𝑛14𝑂superscript𝜇4superscript𝐵2\displaystyle(\rho(P_{\alpha})+O(\mu^{4}))\mathds{E}\|\check{\boldsymbol{w}}_{% n-1}\|^{4}+O(\mu^{4})\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{4}+O(\mu^{4})+O% (\frac{\mu^{4}}{B^{2}})\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{4}+O(\frac{% \mu^{4}}{B^{2}})\mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{4}+O(\frac{\mu^{4}% }{B^{2}})( italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )
+O(μ2B)𝔼[(𝒘¯n12+𝒘ˇn12+O(1))(𝒘ˇn12+O(μ2)𝒘¯n12+O(μ2))]𝑂superscript𝜇2𝐵𝔼delimited-[]superscriptnormsubscript¯𝒘𝑛12superscriptnormsubscriptˇ𝒘𝑛12𝑂1superscriptnormsubscriptˇ𝒘𝑛12𝑂superscript𝜇2superscriptnormsubscript¯𝒘𝑛12𝑂superscript𝜇2\displaystyle+O(\frac{\mu^{2}}{B})\mathds{E}\left[\left(\|\bar{\boldsymbol{w}}% _{n-1}\|^{2}+\|\check{\boldsymbol{w}}_{n-1}\|^{2}+O(1)\right)\left(\|\check{% \boldsymbol{w}}_{n-1}\|^{2}+O(\mu^{2})\|\bar{\boldsymbol{w}}_{n-1}\|^{2}+O(\mu% ^{2})\right)\right]+ italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) blackboard_E [ ( ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( 1 ) ) ( ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ]
\displaystyle\leq (ρ(Pα)+O(μ2B))𝔼𝒘ˇn14+O(μ2B)𝔼𝒘¯n14+O(μ4)𝜌subscript𝑃𝛼𝑂superscript𝜇2𝐵𝔼superscriptnormsubscriptˇ𝒘𝑛14𝑂superscript𝜇2𝐵𝔼superscriptnormsubscript¯𝒘𝑛14𝑂superscript𝜇4\displaystyle(\rho(P_{\alpha})+O(\frac{\mu^{2}}{B}))\mathds{E}\|\check{% \boldsymbol{w}}_{n-1}\|^{4}+O(\frac{\mu^{2}}{B})\mathds{E}\|\bar{\boldsymbol{w% }}_{n-1}\|^{4}+O(\mu^{4})( italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) (205)

Combining (D) and (D), we obtain

𝔼[𝒘¯n4𝒘ˇn4][(1+μL)4(1O(μ))3O(μ)O(μ2B)ρ(Pα)+O(μ2B)][𝔼𝒘¯n14𝔼𝒘ˇn14]+[O(μ3B2)+O(μ5B)O(μ4)]𝔼delimited-[]superscriptnormsubscript¯𝒘𝑛4superscriptnormsubscriptˇ𝒘𝑛4delimited-[]superscript1𝜇𝐿4superscript1𝑂𝜇3𝑂𝜇𝑂superscript𝜇2𝐵𝜌subscript𝑃𝛼𝑂superscript𝜇2𝐵delimited-[]𝔼superscriptnormsubscript¯𝒘𝑛14𝔼superscriptnormsubscriptˇ𝒘𝑛14delimited-[]𝑂superscript𝜇3superscript𝐵2𝑂superscript𝜇5𝐵𝑂superscript𝜇4\displaystyle\mathds{E}\left[\begin{array}[]{c}\|\bar{\boldsymbol{w}}_{n}\|^{4% }\\ \|\check{\boldsymbol{w}}_{n}\|^{4}\end{array}\right]\leq\left[\begin{array}[]{% cc}\frac{(1+\mu L)^{4}}{(1-O(\mu))^{3}}&O(\mu)\\ O(\frac{\mu^{2}}{B})&\rho(P_{\alpha})+O(\frac{\mu^{2}}{B})\end{array}\right]% \left[\begin{array}[]{c}\mathds{E}\|\bar{\boldsymbol{w}}_{n-1}\|^{4}\\ \mathds{E}\|\check{\boldsymbol{w}}_{n-1}\|^{4}\end{array}\right]+\left[\begin{% array}[]{c}O(\frac{\mu^{3}}{B^{2}})+{O(\frac{\mu^{5}}{B})}\\ O(\mu^{4})\end{array}\right]blackboard_E [ start_ARRAY start_ROW start_CELL ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] ≤ [ start_ARRAY start_ROW start_CELL divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] + [ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] (214)

Let

Γ2=[(1+μL)4(1O(μ))3O(μ)O(μ2B)ρ(Pα)+O(μ2B)]subscriptΓ2delimited-[]superscript1𝜇𝐿4superscript1𝑂𝜇3𝑂𝜇𝑂superscript𝜇2𝐵𝜌subscript𝑃𝛼𝑂superscript𝜇2𝐵\displaystyle\Gamma_{2}=\left[\begin{array}[]{cc}\frac{(1+\mu L)^{4}}{(1-O(\mu% ))^{3}}&O(\mu)\\ O(\frac{\mu^{2}}{B})&\rho(P_{\alpha})+O(\frac{\mu^{2}}{B})\end{array}\right]roman_Γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW end_ARRAY ] (217)

and by iterating (214), we have

𝔼[𝒘¯n4𝒘ˇn4]Γ2n+1[𝒘¯14𝒘ˇ14]+(IΓ2)1(IΓ2n+1)[O(μ3B2)+O(μ5B)O(μ4)]𝔼delimited-[]superscriptnormsubscript¯𝒘𝑛4superscriptnormsubscriptˇ𝒘𝑛4superscriptsubscriptΓ2𝑛1delimited-[]superscriptnormsubscript¯𝒘14superscriptnormsubscriptˇ𝒘14superscript𝐼subscriptΓ21𝐼superscriptsubscriptΓ2𝑛1delimited-[]𝑂superscript𝜇3superscript𝐵2𝑂superscript𝜇5𝐵𝑂superscript𝜇4\displaystyle\mathds{E}\left[\begin{array}[]{c}\|\bar{\boldsymbol{w}}_{n}\|^{4% }\\ \|\check{\boldsymbol{w}}_{n}\|^{4}\end{array}\right]\leq\Gamma_{2}^{n+1}\left[% \begin{array}[]{c}\|\bar{\boldsymbol{w}}_{-1}\|^{4}\\ \|\check{\boldsymbol{w}}_{-1}\|^{4}\end{array}\right]+(I-\Gamma_{2})^{-1}(I-% \Gamma_{2}^{n+1})\left[\begin{array}[]{c}O(\frac{\mu^{3}}{B^{2}})+{O(\frac{\mu% ^{5}}{B})}\\ O(\mu^{4})\end{array}\right]blackboard_E [ start_ARRAY start_ROW start_CELL ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] ≤ roman_Γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT [ start_ARRAY start_ROW start_CELL ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] + ( italic_I - roman_Γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - roman_Γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) [ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] (224)

for which with Assumption III.1 we have

𝔼𝒘¯14=1Kk𝒘~k,14(a)1Kk𝒘~k,14o(μ2B2)𝔼superscriptnormsubscriptbold-¯𝒘141𝐾superscriptnormsubscript𝑘subscript~𝒘𝑘14𝑎1𝐾subscript𝑘superscriptnormsubscript~𝒘𝑘14𝑜superscript𝜇2superscript𝐵2\displaystyle\mathds{E}\|\boldsymbol{\bar{{w}}}_{-1}\|^{4}=\frac{1}{K}\|\sum% \limits_{k}\tilde{\boldsymbol{w}}_{k,-1}\|^{4}\overset{(a)}{\leq}\frac{1}{K}% \sum\limits_{k}\|\tilde{\boldsymbol{w}}_{k,-1}\|^{4}\leq o(\frac{\mu^{2}}{B^{2% }})blackboard_E ∥ overbold_¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ italic_o ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (225)

where (a)𝑎(a)( italic_a ) follows from Jensen’s inequality. Also,

𝔼𝒘ˇ14=𝔼𝒱α𝖳𝓦~14𝔼𝒱α𝖳4(𝓦~12)2o(μ2B2)𝔼superscriptnormsubscriptˇ𝒘14𝔼superscriptnormsuperscriptsubscript𝒱𝛼𝖳subscript~𝓦14𝔼superscriptnormsuperscriptsubscript𝒱𝛼𝖳4superscriptsuperscriptnormsubscript~𝓦122𝑜superscript𝜇2superscript𝐵2\displaystyle\mathds{E}\|\check{\boldsymbol{w}}_{-1}\|^{4}=\mathds{E}\|% \mathcal{V}_{\alpha}^{\sf T}\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{% -1}\|^{4}\leq\mathds{E}\|\mathcal{V}_{\alpha}^{\sf T}\|^{4}\left(\|\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}_{-1}\|^{2}\right)^{2}\leq o(\frac{\mu^{2% }}{B^{2}})blackboard_E ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = blackboard_E ∥ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ blackboard_E ∥ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_o ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (226)

Note that

(IΓ2)1=[1(1+μL)4(1O(μ))3O(μ)O(μ2B)1ρ(Pα)+O(μ2B)]1=[O(μ)O(μ)O(μ2B)O(1)]1=[O(1μ)O(1)O(μB)O(1)]superscript𝐼subscriptΓ21superscriptdelimited-[]1superscript1𝜇𝐿4superscript1𝑂𝜇3𝑂𝜇𝑂superscript𝜇2𝐵1𝜌subscript𝑃𝛼𝑂superscript𝜇2𝐵1superscriptdelimited-[]𝑂𝜇𝑂𝜇𝑂superscript𝜇2𝐵𝑂11delimited-[]𝑂1𝜇𝑂1𝑂𝜇𝐵𝑂1\displaystyle(I-\Gamma_{2})^{-1}=\left[\begin{array}[]{cc}1-\frac{(1+\mu L)^{4% }}{(1-O(\mu))^{3}}&-O(\mu)\\ -O(\frac{\mu^{2}}{B})&1-\rho(P_{\alpha})+O(\frac{\mu^{2}}{B})\end{array}\right% ]^{-1}=\left[\begin{array}[]{cc}-O(\mu)&-O(\mu)\\ -O(\frac{\mu^{2}}{B})&O(1)\end{array}\right]^{-1}=\left[\begin{array}[]{cc}-O(% \frac{1}{\mu})&-O(1)\\ -O(\frac{\mu}{B})&O(1)\end{array}\right]( italic_I - roman_Γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL 1 - divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL 1 - italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL - italic_O ( italic_μ ) end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ) end_CELL start_CELL - italic_O ( 1 ) end_CELL end_ROW start_ROW start_CELL - italic_O ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] (233)

and

Γ2n+1=[(1+μL)4(1O(μ))3O(μ)O(μ2B)ρ(Pα)+O(μ2B)]n+1=[((1+μL)4(1O(μ))3)n+1+o(μ)O(μ)O(μ2B)(ρ(Pα)+O(μ2B))n+1+o(μ)]superscriptsubscriptΓ2𝑛1superscriptdelimited-[]superscript1𝜇𝐿4superscript1𝑂𝜇3𝑂𝜇𝑂superscript𝜇2𝐵𝜌subscript𝑃𝛼𝑂superscript𝜇2𝐵𝑛1delimited-[]superscriptsuperscript1𝜇𝐿4superscript1𝑂𝜇3𝑛1𝑜𝜇𝑂𝜇𝑂superscript𝜇2𝐵superscript𝜌subscript𝑃𝛼𝑂superscript𝜇2𝐵𝑛1𝑜𝜇\displaystyle\Gamma_{2}^{n+1}=\left[\begin{array}[]{cc}\frac{(1+\mu L)^{4}}{(1% -O(\mu))^{3}}&O(\mu)\\ O(\frac{\mu^{2}}{B})&\rho(P_{\alpha})+O(\frac{\mu^{2}}{B})\end{array}\right]^{% n+1}=\left[\begin{array}[]{cc}(\frac{(1+\mu L)^{4}}{(1-O(\mu))^{3}})^{n+1}+o(% \mu)&O(\mu)\\ O(\frac{\mu^{2}}{B})&(\rho(P_{\alpha})+O(\frac{\mu^{2}}{B}))^{n+1}+o(\mu)\end{% array}\right]roman_Γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_o ( italic_μ ) end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL ( italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_o ( italic_μ ) end_CELL end_ROW end_ARRAY ] (238)

so that

IΓ2n+1=[1((1+μL)4(1O(μ))3)n+1+o(μ)O(μ)O(μ2B)1(ρ(Pα)+O(μ2B))n+1+o(μ)]𝐼superscriptsubscriptΓ2𝑛1delimited-[]1superscriptsuperscript1𝜇𝐿4superscript1𝑂𝜇3𝑛1𝑜𝜇𝑂𝜇𝑂superscript𝜇2𝐵1superscript𝜌subscript𝑃𝛼𝑂superscript𝜇2𝐵𝑛1𝑜𝜇\displaystyle I-\Gamma_{2}^{n+1}=\left[\begin{array}[]{cc}1-(\frac{(1+\mu L)^{% 4}}{(1-O(\mu))^{3}})^{n+1}+o(\mu)&-O(\mu)\\ -O(\frac{\mu^{2}}{B})&1-(\rho(P_{\alpha})+O(\frac{\mu^{2}}{B}))^{n+1}+o(\mu)% \end{array}\right]italic_I - roman_Γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL 1 - ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_o ( italic_μ ) end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL 1 - ( italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_o ( italic_μ ) end_CELL end_ROW end_ARRAY ] (241)

Similar to Γ1subscriptΓ1\Gamma_{1}roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, by resorting to Lemma 2 in [11], and with nO(1μ)𝑛𝑂1𝜇n\leq O(\frac{1}{\mu})italic_n ≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ), we have

((1+μL)4(1O(μ))3)n+1=O(1),1((1+μL)4(1O(μ))3)n+1=O(1)formulae-sequencesuperscriptsuperscript1𝜇𝐿4superscript1𝑂𝜇3𝑛1𝑂11superscriptsuperscript1𝜇𝐿4superscript1𝑂𝜇3𝑛1𝑂1\displaystyle\left(\frac{(1+\mu L)^{4}}{(1-O(\mu))^{3}}\right)^{n+1}=O(1),% \quad\quad 1-\left(\frac{(1+\mu L)^{4}}{(1-O(\mu))^{3}}\right)^{n+1}=-O(1)( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = italic_O ( 1 ) , 1 - ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = - italic_O ( 1 ) (242)

so that

IΓ2n+1=[O(1)O(μ)O(μ2B)O(1)]𝐼superscriptsubscriptΓ2𝑛1delimited-[]𝑂1𝑂𝜇𝑂superscript𝜇2𝐵𝑂1\displaystyle I-\Gamma_{2}^{n+1}=\left[\begin{array}[]{cc}-O(1)&-O(\mu)\\ -O(\frac{\mu^{2}}{B})&O(1)\end{array}\right]italic_I - roman_Γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL - italic_O ( 1 ) end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] (245)
Γ2n+1=[O(1)O(μ)O(μ2B)O(1)]superscriptsubscriptΓ2𝑛1delimited-[]𝑂1𝑂𝜇𝑂superscript𝜇2𝐵𝑂1\displaystyle\Gamma_{2}^{n+1}=\left[\begin{array}[]{cc}O(1)&O(\mu)\\ O(\frac{\mu^{2}}{B})&O(1)\end{array}\right]roman_Γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_O ( 1 ) end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] (248)

Substituting (225), (226), (233), (245) and (248) into (224), we obtain

𝔼[𝒘¯n4𝒘ˇn4]o(μ2B2)+[O(1μ)O(1)O(μB)O(1)][O(1)O(μ)O(μ2B)O(1)][O(μ3B2)+O(μ5B)O(μ4)][O(μ2B2)+O(μ4)O(μ4)]𝔼delimited-[]superscriptnormsubscript¯𝒘𝑛4superscriptnormsubscriptˇ𝒘𝑛4𝑜superscript𝜇2superscript𝐵2delimited-[]𝑂1𝜇𝑂1𝑂𝜇𝐵𝑂1delimited-[]𝑂1𝑂𝜇𝑂superscript𝜇2𝐵𝑂1delimited-[]𝑂superscript𝜇3superscript𝐵2𝑂superscript𝜇5𝐵𝑂superscript𝜇4delimited-[]𝑂superscript𝜇2superscript𝐵2𝑂superscript𝜇4𝑂superscript𝜇4\displaystyle\mathds{E}\left[\begin{array}[]{c}\|\bar{\boldsymbol{w}}_{n}\|^{4% }\\ \|\check{\boldsymbol{w}}_{n}\|^{4}\end{array}\right]\leq o(\frac{\mu^{2}}{B^{2% }})+\left[\begin{array}[]{cc}-O(\frac{1}{\mu})&-O(1)\\ -O(\frac{\mu}{B})&O(1)\end{array}\right]\left[\begin{array}[]{cc}-O(1)&-O(\mu)% \\ -O(\frac{\mu^{2}}{B})&O(1)\end{array}\right]\left[\begin{array}[]{c}O(\frac{% \mu^{3}}{B^{2}})+{O(\frac{\mu^{5}}{B})}\\ O(\mu^{4})\end{array}\right]\leq\left[\begin{array}[]{c}O(\frac{\mu^{2}}{B^{2}% })+O(\mu^{4})\\ O(\mu^{4})\end{array}\right]blackboard_E [ start_ARRAY start_ROW start_CELL ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] ≤ italic_o ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + [ start_ARRAY start_ROW start_CELL - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ) end_CELL start_CELL - italic_O ( 1 ) end_CELL end_ROW start_ROW start_CELL - italic_O ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL - italic_O ( 1 ) end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] ≤ [ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] (259)

and, therefore,

𝔼𝓦~n4𝔼superscriptnormsubscript~𝓦𝑛4\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{4}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT 𝒱4𝔼𝒱𝖳𝓦~n4=𝒱4𝔼(𝒘¯n2+𝒘ˇn2)2(a)2𝒱4𝔼(𝒘¯n4+𝒘ˇn4)absentsuperscriptnorm𝒱4𝔼superscriptnormsuperscript𝒱𝖳subscript~𝓦𝑛4superscriptnorm𝒱4𝔼superscriptsuperscriptnormsubscript¯𝒘𝑛2superscriptnormsubscriptˇ𝒘𝑛22𝑎2superscriptnorm𝒱4𝔼superscriptnormsubscript¯𝒘𝑛4superscriptnormsubscriptˇ𝒘𝑛4\displaystyle\leq\|\mathcal{V}\|^{4}\mathds{E}\|{\mathcal{V}^{\sf T}\widetilde% {\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^{4}=\|\mathcal{V}\|^{4}\mathds{E% }\left(\|\bar{\boldsymbol{w}}_{n}\|^{2}+\|\check{\boldsymbol{w}}_{n}\|^{2}% \right)^{2}\overset{(a)}{\leq}2\|\mathcal{V}\|^{4}\mathds{E}\left(\|\bar{% \boldsymbol{w}}_{n}\|^{4}+\|\check{\boldsymbol{w}}_{n}\|^{4}\right)≤ ∥ caligraphic_V ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = ∥ caligraphic_V ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT blackboard_E ( ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG 2 ∥ caligraphic_V ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT blackboard_E ( ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + ∥ overroman_ˇ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT )
O(μ2B2)+O(μ4B)=O(μ2γ)absent𝑂superscript𝜇2superscript𝐵2𝑂superscript𝜇4𝐵𝑂superscript𝜇2𝛾\displaystyle\leq O(\frac{\mu^{2}}{B^{2}})+O(\frac{\mu^{4}}{B})=O(\mu^{2\gamma})≤ italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) = italic_O ( italic_μ start_POSTSUPERSCRIPT 2 italic_γ end_POSTSUPERSCRIPT ) (260)

where (a)𝑎(a)( italic_a ) follows from Jensen’s inequality, and we observe from (191) and (D) that

𝔼𝓦~n2=O((𝔼𝓦~n4)12)𝔼superscriptnormsubscript~𝓦𝑛2𝑂superscript𝔼superscriptnormsubscript~𝓦𝑛412\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{2}=O((\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}% \|^{4})^{\frac{1}{2}})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( ( blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) (261)

Also, by using the Jensen’s inequality again, we have

𝔼𝓦~n3=𝔼(𝓦~n4)34(𝔼𝓦~n4)34=O(μ1.5γ)𝔼superscriptnormsubscript~𝓦𝑛3𝔼superscriptsuperscriptnormsubscript~𝓦𝑛434superscript𝔼superscriptnormsubscript~𝓦𝑛434𝑂superscript𝜇1.5𝛾\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{3}=\mathds{E}(\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^% {4})^{\frac{3}{4}}\leq(\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}}_{n}\|^{4})^{\frac{3}{4}}=O(\mu^{1.5\gamma})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = blackboard_E ( ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ≤ ( blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT = italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ end_POSTSUPERSCRIPT ) (262)

As for the centralized method, we substitute (193) into (D) and iterate it, then we obtain

𝔼𝒘¯n2((1+μL)4(1O(μ))3+O(μ2B))n+1𝔼𝒘¯14+1((1+μL)4(1O(μ))3)n+11(1+μL)4(1O(μ))3×O(μ3B2)𝔼superscriptnormsubscript¯𝒘𝑛2superscriptsuperscript1𝜇𝐿4superscript1𝑂𝜇3𝑂superscript𝜇2𝐵𝑛1𝔼superscriptnormsubscript¯𝒘141superscriptsuperscript1𝜇𝐿4superscript1𝑂𝜇3𝑛11superscript1𝜇𝐿4superscript1𝑂𝜇3𝑂superscript𝜇3superscript𝐵2\displaystyle\mathds{E}\|\bar{\boldsymbol{w}}_{n}\|^{2}{\leq}\left(\frac{(1+% \mu L)^{4}}{(1-O(\mu))^{3}}+O(\frac{\mu^{2}}{B})\right)^{n+1}\mathds{E}\|\bar{% \boldsymbol{w}}_{-1}\|^{4}+\frac{1-\left(\frac{(1+\mu L)^{4}}{(1-O(\mu))^{3}}% \right)^{n+1}}{1-\frac{(1+\mu L)^{4}}{(1-O(\mu))^{3}}}\times O(\frac{\mu^{3}}{% B^{2}})blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ) ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT blackboard_E ∥ over¯ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + divide start_ARG 1 - ( divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - divide start_ARG ( 1 + italic_μ italic_L ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG end_ARG × italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (263)

with which and (242), for the centralized method, it holds that

𝔼𝓦~n4O(μ2B2)=O(μ2(1+η))𝔼superscriptnormsubscript~𝓦𝑛4𝑂superscript𝜇2superscript𝐵2𝑂superscript𝜇21𝜂\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{4}\leq O(\frac{\mu^{2}}{B^{2}})=O(\mu^{2(1+\eta)})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ italic_O ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) = italic_O ( italic_μ start_POSTSUPERSCRIPT 2 ( 1 + italic_η ) end_POSTSUPERSCRIPT ) (264)
𝔼𝓦~n3(𝔼𝓦~n4)34O(μ1.5(1+η))𝔼superscriptnormsubscript~𝓦𝑛3superscript𝔼superscriptnormsubscript~𝓦𝑛434𝑂superscript𝜇1.51𝜂\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{3}\leq(\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}% \|^{4})^{\frac{3}{4}}\leq O(\mu^{1.5(1+\eta)})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ≤ ( blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 ( 1 + italic_η ) end_POSTSUPERSCRIPT ) (265)

Appendix E Proof for Lemmas III.6 and III.7: Approximation error of the short-term model

The argument is similar to [6] except that we now focus on nonconvex (as opposed to convex) risk functions. To clarify how far the algorithm can escape from a local minimum wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, it is necessary to assess the size of the distance between 𝒘k,nsubscript𝒘𝑘𝑛\boldsymbol{w}_{k,n}bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT and wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT rather than upper bound it. However, the dependence of n1subscript𝑛1\mathcal{H}_{n-1}caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT on 𝓦n1subscript𝓦𝑛1{{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n-1}bold_caligraphic_W start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT makes the analysis difficult. This motivates the short-term model in (48). Consider

𝒞=Δ𝒜2(𝒜1μ)=𝒜μ𝒜2𝒞Δsubscript𝒜2subscript𝒜1𝜇𝒜𝜇subscript𝒜2\displaystyle\mathcal{C}\overset{\Delta}{=}\mathcal{A}_{2}(\mathcal{A}_{1}-\mu% \mathcal{H})=\mathcal{A}-\mu\mathcal{A}_{2}\mathcal{H}caligraphic_C overroman_Δ start_ARG = end_ARG caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ caligraphic_H ) = caligraphic_A - italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H (266)

and recall the short-term model in (48):

𝓦~n=𝒞𝓦~n1+μ𝒜2d+μ𝒜2𝒔nBsuperscriptsubscript~𝓦𝑛𝒞superscriptsubscript~𝓦𝑛1𝜇subscript𝒜2𝑑𝜇subscript𝒜2superscriptsubscript𝒔𝑛𝐵\displaystyle{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}^{\prime}=% \mathcal{C}{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n-1}^{\prime}+% \mu\mathcal{A}_{2}d+\mu\mathcal{A}_{2}\boldsymbol{s}_{n}^{B}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_C over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (267)

where the Hessian matrix Hk,n(𝒘k,n)subscript𝐻𝑘𝑛subscript𝒘𝑘𝑛H_{k,n}(\boldsymbol{w}_{k,n})italic_H start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) is approximated by the Hessian matrix at wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, and 𝓦~1=𝓦~1superscriptsubscript~𝓦1subscript~𝓦1{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}^{\prime}={\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT.

In this section, we analyze the approximation error caused by the short-term model. Consider 𝒛n=𝓦~n𝓦~nsubscript𝒛𝑛superscriptsubscript~𝓦𝑛subscript~𝓦𝑛\boldsymbol{z}_{n}=\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{% \prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT which measures the difference between the true model and the short-term model. Subtracting (42) and (267), we obtain:

𝒛n=𝒞𝒛n1+μ𝒜2(n1)𝓦~n1subscript𝒛𝑛𝒞subscript𝒛𝑛1𝜇subscript𝒜2subscript𝑛1subscript~𝓦𝑛1\displaystyle\boldsymbol{z}_{n}=\mathcal{C}\boldsymbol{z}_{n-1}+\mu\mathcal{A}% _{2}(\mathcal{H}_{n-1}-\mathcal{H})\widetilde{\boldsymbol{\scriptstyle\mathcal% {W}}}_{n-1}bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_C bold_italic_z start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - caligraphic_H ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT (268)

Note that according to the Taylor expansion technique, when 𝒘k,nsubscript𝒘𝑘𝑛\boldsymbol{w}_{k,n}bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT is sufficiently close to wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, we have:

Hk,n(𝒘k,n)Hk=Hk𝒘~k,n+O(𝒘~k,n2)subscript𝐻𝑘𝑛subscript𝒘𝑘𝑛superscriptsubscript𝐻𝑘superscriptsubscript𝐻𝑘subscript~𝒘𝑘𝑛𝑂superscriptnormsubscript~𝒘𝑘𝑛2\displaystyle H_{k,n}(\boldsymbol{w}_{k,n})-H_{k}^{\star}=-\nabla H_{k}^{\star% }\tilde{\boldsymbol{w}}_{k,n}+O(\|\tilde{\boldsymbol{w}}_{k,n}\|^{2})italic_H start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = - ∇ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT + italic_O ( ∥ over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (269)

from which we get

(Hk,n(𝒘k,n)Hk)𝒘~k,nHk𝒘~k,n2+O(𝒘~k,n3)normsubscript𝐻𝑘𝑛subscript𝒘𝑘𝑛superscriptsubscript𝐻𝑘subscript~𝒘𝑘𝑛normsuperscriptsubscript𝐻𝑘superscriptnormsubscript~𝒘𝑘𝑛2𝑂superscriptnormsubscript~𝒘𝑘𝑛3\displaystyle\|(H_{k,n}(\boldsymbol{w}_{k,n})-H_{k}^{\star})\tilde{\boldsymbol% {w}}_{k,n}\|\leq\|\nabla H_{k}^{\star}\|\|\tilde{\boldsymbol{w}}_{k,n}\|^{2}+O% (\|\tilde{\boldsymbol{w}}_{k,n}\|^{3})∥ ( italic_H start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ∥ ≤ ∥ ∇ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ ∥ over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( ∥ over~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) (270)

so that

(n1)𝓦~n1O(𝓦~n12)normsubscript𝑛1subscript~𝓦𝑛1𝑂superscriptnormsubscript~𝓦𝑛12\displaystyle\|(\mathcal{H}_{n-1}-\mathcal{H})\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n-1}\|\leq O(\|\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}_{n-1}\|^{2})∥ ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - caligraphic_H ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ ≤ italic_O ( ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (271)

We now analyze the size of 𝔼𝒛n2𝔼superscriptnormsubscript𝒛𝑛2\mathds{E}\|\boldsymbol{z}_{n}\|^{2}blackboard_E ∥ bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Substituting (35) into (268), we have

𝒛nsubscript𝒛𝑛\displaystyle\boldsymbol{z}_{n}bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =(𝒱𝒫𝒱𝖳μ𝒜2)𝒛n1+μ𝒜2(n1)𝓦~n1absent𝒱𝒫superscript𝒱𝖳𝜇subscript𝒜2subscript𝒛𝑛1𝜇subscript𝒜2subscript𝑛1subscript~𝓦𝑛1\displaystyle=(\mathcal{V}\mathcal{P}\mathcal{V}^{\sf T}-\mu\mathcal{A}_{2}% \mathcal{H})\boldsymbol{z}_{n-1}+\mu\mathcal{A}_{2}(\mathcal{H}_{n-1}-\mathcal% {H})\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n-1}= ( caligraphic_V caligraphic_P caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT - italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H ) bold_italic_z start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - caligraphic_H ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT
=𝒱(𝒫μ𝒱𝖳𝒜2𝒱)𝒱𝖳𝒛n1+μ𝒜2(n1)𝓦~n1absent𝒱𝒫𝜇superscript𝒱𝖳subscript𝒜2𝒱superscript𝒱𝖳subscript𝒛𝑛1𝜇subscript𝒜2subscript𝑛1subscript~𝓦𝑛1\displaystyle=\mathcal{V}(\mathcal{P}-\mu\mathcal{V}^{\sf T}\mathcal{A}_{2}% \mathcal{H}\mathcal{V})\mathcal{V}^{\sf T}\boldsymbol{z}_{n-1}+\mu\mathcal{A}_% {2}(\mathcal{H}_{n-1}-\mathcal{H})\widetilde{\boldsymbol{\scriptstyle\mathcal{% W}}}_{n-1}= caligraphic_V ( caligraphic_P - italic_μ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H caligraphic_V ) caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - caligraphic_H ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT (272)

Similar to appendix C, we multiply both sides of (E) by 𝒱𝖳superscript𝒱𝖳\mathcal{V}^{\sf T}caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT so that it can be decomposed as

𝒱𝖳𝒛n=superscript𝒱𝖳subscript𝒛𝑛absent\displaystyle\mathcal{V}^{\sf T}{{\boldsymbol{{z}}}}_{n}=caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = [1K𝟙𝖳𝒛n𝒱α𝖳𝒛n]=Δ[𝒛¯n𝒛ˇn]delimited-[]1𝐾superscript1𝖳subscript𝒛𝑛superscriptsubscript𝒱𝛼𝖳subscript𝒛𝑛Δdelimited-[]subscript¯𝒛𝑛subscriptˇ𝒛𝑛\displaystyle\left[\begin{array}[]{c}\frac{1}{\sqrt{K}}\mathds{1}^{\sf T}% \boldsymbol{z}_{n}\\ \mathcal{V}_{\alpha}^{\sf T}\boldsymbol{z}_{n}\end{array}\right]\overset{% \Delta}{=}\left[\begin{array}[]{c}\bar{\boldsymbol{z}}_{n}\\ \check{\boldsymbol{z}}_{n}\end{array}\right][ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] overroman_Δ start_ARG = end_ARG [ start_ARRAY start_ROW start_CELL over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] (277)
=\displaystyle== ([IM00𝒫α]μ[1K𝟙𝖳𝒱α𝖳]𝒜2[1K𝟙𝒱α])[𝒛¯n1𝒛ˇn1]+μ𝒱𝖳𝒜2(n1)𝓦~n1delimited-[]subscript𝐼𝑀00subscript𝒫𝛼𝜇delimited-[]1𝐾superscript1𝖳superscriptsubscript𝒱𝛼𝖳subscript𝒜21𝐾1subscript𝒱𝛼delimited-[]subscript¯𝒛𝑛1subscriptˇ𝒛𝑛1𝜇superscript𝒱𝖳subscript𝒜2subscript𝑛1subscript~𝓦𝑛1\displaystyle\left(\left[\begin{array}[]{cc}I_{M}&0\\ 0&\mathcal{P}_{\alpha}\end{array}\right]-\mu\left[\begin{array}[]{c}\frac{1}{% \sqrt{K}}\mathds{1}^{\sf T}\\ \mathcal{V}_{\alpha}^{\sf T}\end{array}\right]\mathcal{A}_{2}\mathcal{H}\left[% \frac{1}{\sqrt{K}}\mathds{1}\quad\mathcal{V}_{\alpha}\right]\right)\left[% \begin{array}[]{c}\bar{\boldsymbol{z}}_{n-1}\\ \check{\boldsymbol{z}}_{n-1}\end{array}\right]+\mu\mathcal{V}^{\sf T}\mathcal{% A}_{2}(\mathcal{H}_{n-1}-\mathcal{H})\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}_{n-1}( [ start_ARRAY start_ROW start_CELL italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] - italic_μ [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H [ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ] ) [ start_ARRAY start_ROW start_CELL over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] + italic_μ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - caligraphic_H ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT (284)

Consider

H¯=1K𝟙𝖳𝟙=1KkHk(w)¯𝐻1𝐾superscript1𝖳11𝐾subscript𝑘subscript𝐻𝑘superscript𝑤\bar{H}=\frac{1}{K}\mathds{1}^{\sf T}\mathcal{H}\mathds{1}=\frac{1}{K}\sum_{k}% H_{k}(w^{\star})over¯ start_ARG italic_H end_ARG = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H blackboard_1 = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) (285)

which is positive-definite since it is the Hessian matrix of the global risk J𝐽Jitalic_J at a local minimum. Then we have

𝒛¯nsubscript¯𝒛𝑛\displaystyle\bar{\boldsymbol{z}}_{n}over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =(IMμH¯)𝒛¯n1μ1K𝟙𝖳𝒱α𝒛ˇn1+μ1K𝟙𝖳(n1)𝓦~n1absentsubscript𝐼𝑀𝜇¯𝐻subscript¯𝒛𝑛1𝜇1𝐾superscript1𝖳subscript𝒱𝛼subscriptˇ𝒛𝑛1𝜇1𝐾superscript1𝖳subscript𝑛1subscript~𝓦𝑛1\displaystyle=(I_{M}-\mu\bar{H})\bar{\boldsymbol{z}}_{n-1}-\mu\frac{1}{\sqrt{K% }}\mathds{1}^{\sf T}\mathcal{H}\mathcal{V}_{\alpha}\check{\boldsymbol{z}}_{n-1% }+\mu\frac{1}{\sqrt{K}}\mathds{1}^{\sf T}(\mathcal{H}_{n-1}-\mathcal{H})% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n-1}= ( italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - italic_μ over¯ start_ARG italic_H end_ARG ) over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - italic_μ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - caligraphic_H ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT
𝒛ˇnsubscriptˇ𝒛𝑛\displaystyle\check{\boldsymbol{z}}_{n}overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =(𝒫αμ𝒱α𝖳𝒜2𝒱α)𝒛ˇn11Kμ𝒱α𝖳𝒜2𝟙𝒛¯n1+μ𝒱α𝖳𝒜2(n1)𝓦~n1absentsubscript𝒫𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝒱𝛼subscriptˇ𝒛𝑛11𝐾𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜21subscript¯𝒛𝑛1𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛1subscript~𝓦𝑛1\displaystyle=(\mathcal{P}_{\alpha}-\mu\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}% _{2}\mathcal{H}\mathcal{V}_{\alpha})\check{\boldsymbol{z}}_{n-1}-\frac{1}{% \sqrt{K}}\mu\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}\mathcal{H}\mathds{1}% \bar{\boldsymbol{z}}_{n-1}+\mu\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}(% \mathcal{H}_{n-1}-\mathcal{H})\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}% _{n-1}= ( caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H blackboard_1 over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - caligraphic_H ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT (286)

Still, for the centralized method, we only need to analyze 𝒛¯nsubscript¯𝒛𝑛\bar{\boldsymbol{z}}_{n}over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as now 𝒛ˇnsubscriptˇ𝒛𝑛\check{\boldsymbol{z}}_{n}overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is always 00.

We first analyze 𝔼𝒛¯n2𝔼superscriptnormsubscript¯𝒛𝑛2\mathds{E}\|\bar{\boldsymbol{z}}_{n}\|^{2}blackboard_E ∥ over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Let t=IMμH¯=1O(μ)>0𝑡normsubscript𝐼𝑀𝜇¯𝐻1𝑂𝜇0t=\|I_{M}-\mu\bar{H}\|=1-O(\mu)>0italic_t = ∥ italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - italic_μ over¯ start_ARG italic_H end_ARG ∥ = 1 - italic_O ( italic_μ ) > 0, we have

𝔼𝒛¯n2=𝔼superscriptnormsubscript¯𝒛𝑛2absent\displaystyle\mathds{E}\|\bar{\boldsymbol{z}}_{n}\|^{2}=blackboard_E ∥ over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 𝔼(IMμH¯)𝒛¯n1μK𝟙𝖳𝒱α𝒛ˇn1+μ1K𝟙𝖳(n1)𝓦~n12𝔼superscriptnormsubscript𝐼𝑀𝜇¯𝐻subscript¯𝒛𝑛1𝜇𝐾superscript1𝖳subscript𝒱𝛼subscriptˇ𝒛𝑛1𝜇1𝐾superscript1𝖳subscript𝑛1subscript~𝓦𝑛12\displaystyle\mathds{E}\|(I_{M}-\mu\bar{H})\bar{\boldsymbol{z}}_{n-1}-\frac{% \mu}{\sqrt{K}}\mathds{1}^{\sf T}\mathcal{H}\mathcal{V}_{\alpha}\check{% \boldsymbol{z}}_{n-1}+\mu\frac{1}{\sqrt{K}}\mathds{1}^{\sf T}(\mathcal{H}_{n-1% }-\mathcal{H})\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n-1}\|^{2}blackboard_E ∥ ( italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - italic_μ over¯ start_ARG italic_H end_ARG ) over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - caligraphic_H ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(a)𝑎\displaystyle\overset{(a)}{\leq}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG 1t𝔼(IMμH¯)𝒛¯n12+O(μ2)1t𝔼𝒛ˇn12+O(μ2)1t𝔼𝓦~n141𝑡𝔼superscriptnormsubscript𝐼𝑀𝜇¯𝐻subscript¯𝒛𝑛12𝑂superscript𝜇21𝑡𝔼superscriptnormsubscriptˇ𝒛𝑛12𝑂superscript𝜇21𝑡𝔼superscriptnormsubscript~𝓦𝑛14\displaystyle\frac{1}{t}\mathds{E}\|(I_{M}-\mu\bar{H})\bar{\boldsymbol{z}}_{n-% 1}\|^{2}+\frac{O(\mu^{2})}{1-t}\mathds{E}\|\check{\boldsymbol{z}}_{n-1}\|^{2}+% \frac{O(\mu^{2})}{1-t}\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{% W}}}_{n-1}\|^{4}divide start_ARG 1 end_ARG start_ARG italic_t end_ARG blackboard_E ∥ ( italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - italic_μ over¯ start_ARG italic_H end_ARG ) over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - italic_t end_ARG blackboard_E ∥ overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - italic_t end_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
=\displaystyle== (1O(μ))𝔼𝒛¯n12+O(μ)𝔼𝒛ˇn12+O(μ)𝔼𝓦~n141𝑂𝜇𝔼superscriptnormsubscript¯𝒛𝑛12𝑂𝜇𝔼superscriptnormsubscriptˇ𝒛𝑛12𝑂𝜇𝔼superscriptnormsubscript~𝓦𝑛14\displaystyle(1-O(\mu))\mathds{E}\|\bar{\boldsymbol{z}}_{n-1}\|^{2}+O(\mu)% \mathds{E}\|\check{\boldsymbol{z}}_{n-1}\|^{2}+O(\mu)\mathds{E}\|\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}_{n-1}\|^{4}( 1 - italic_O ( italic_μ ) ) blackboard_E ∥ over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ ) blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (287)

where (a) follows from Jensen’s inequality and (271).

Next, we analyze 𝔼𝒛ˇn2𝔼superscriptnormsubscriptˇ𝒛𝑛2\mathds{E}\|\check{\boldsymbol{z}}_{n}\|^{2}blackboard_E ∥ overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Let t=ρ(Pα)𝑡𝜌subscript𝑃𝛼t=\rho(P_{\alpha})italic_t = italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) which is the spectral radius of Pαsubscript𝑃𝛼P_{\alpha}italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, we have

𝔼𝒛ˇn2=𝔼superscriptnormsubscriptˇ𝒛𝑛2absent\displaystyle\mathds{E}\|\check{\boldsymbol{z}}_{n}\|^{2}=blackboard_E ∥ overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 𝔼(𝒫αμ𝒱α𝖳𝒜2𝒱α)𝒛ˇn1μK𝒱α𝖳𝒜2𝟙𝒛¯n1+μ𝒱α𝖳𝒜2(n1)𝓦~n12𝔼superscriptnormsubscript𝒫𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝒱𝛼subscriptˇ𝒛𝑛1𝜇𝐾superscriptsubscript𝒱𝛼𝖳subscript𝒜21subscript¯𝒛𝑛1𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝑛1subscript~𝓦𝑛12\displaystyle\mathds{E}\|(\mathcal{P}_{\alpha}-\mu\mathcal{V}_{\alpha}^{\sf T}% \mathcal{A}_{2}\mathcal{H}\mathcal{V}_{\alpha})\check{\boldsymbol{z}}_{n-1}-% \frac{\mu}{\sqrt{K}}\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}\mathcal{H}% \mathds{1}\bar{\boldsymbol{z}}_{n-1}+\mu\mathcal{V}_{\alpha}^{\sf T}\mathcal{A% }_{2}(\mathcal{H}_{n-1}-\mathcal{H})\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}_{n-1}\|^{2}blackboard_E ∥ ( caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - divide start_ARG italic_μ end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H blackboard_1 over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT + italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - caligraphic_H ) over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(a)𝑎\displaystyle\overset{(a)}{\leq}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG t𝔼𝒛ˇn12+O(μ2)𝔼𝒛ˇn12+O(μ2)𝔼𝒛¯n12+O(μ2)𝔼𝓦~n14𝑡𝔼superscriptnormsubscriptˇ𝒛𝑛12𝑂superscript𝜇2𝔼superscriptnormsubscriptˇ𝒛𝑛12𝑂superscript𝜇2𝔼superscriptnormsubscript¯𝒛𝑛12𝑂superscript𝜇2𝔼superscriptnormsubscript~𝓦𝑛14\displaystyle t\mathds{E}\|\check{\boldsymbol{z}}_{n-1}\|^{2}+O(\mu^{2})% \mathds{E}\|\check{\boldsymbol{z}}_{n-1}\|^{2}+O(\mu^{2})\mathds{E}\|\bar{% \boldsymbol{z}}_{n-1}\|^{2}+O(\mu^{2})\mathds{E}\|\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n-1}\|^{4}italic_t blackboard_E ∥ overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
=\displaystyle== (ρ(Pα)+O(μ2))𝔼𝒛ˇn12+O(μ2)𝔼𝒛¯n12+O(μ2)𝔼𝓦~n14𝜌subscript𝑃𝛼𝑂superscript𝜇2𝔼superscriptnormsubscriptˇ𝒛𝑛12𝑂superscript𝜇2𝔼superscriptnormsubscript¯𝒛𝑛12𝑂superscript𝜇2𝔼superscriptnormsubscript~𝓦𝑛14\displaystyle(\rho(P_{\alpha})+O(\mu^{2}))\mathds{E}\|\check{\boldsymbol{z}}_{% n-1}\|^{2}+O(\mu^{2})\mathds{E}\|\bar{\boldsymbol{z}}_{n-1}\|^{2}+O(\mu^{2})% \mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n-1}\|^{4}( italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) blackboard_E ∥ overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (288)

where (a)𝑎(a)( italic_a ) follows from Jensen’s inequality. By combining (E) and (E), we obtain the recursion associated with the size of 𝔼𝒱α𝖳𝒛n2𝔼superscriptnormsuperscriptsubscript𝒱𝛼𝖳subscript𝒛𝑛2\mathds{E}\|\mathcal{V}_{\alpha}^{\sf T}{{\boldsymbol{{z}}}}_{n}\|^{2}blackboard_E ∥ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT:

𝔼𝒱α𝖳𝒛n2=𝔼[𝒛¯n2𝒛ˇn2]Γ𝔼[𝒛¯n12𝒛ˇn12]+[O(μ)𝔼𝓦~n14O(μ2)𝔼𝓦~n14]𝔼superscriptnormsuperscriptsubscript𝒱𝛼𝖳subscript𝒛𝑛2𝔼delimited-[]superscriptnormsubscript¯𝒛𝑛2superscriptnormsubscriptˇ𝒛𝑛2superscriptΓ𝔼delimited-[]superscriptnormsubscript¯𝒛𝑛12superscriptnormsubscriptˇ𝒛𝑛12delimited-[]𝑂𝜇𝔼superscriptnormsubscript~𝓦𝑛14𝑂superscript𝜇2𝔼superscriptnormsubscript~𝓦𝑛14\displaystyle\mathds{E}\|\mathcal{V}_{\alpha}^{\sf T}{{\boldsymbol{{z}}}}_{n}% \|^{2}=\mathds{E}\left[\begin{array}[]{c}\|\bar{\boldsymbol{z}}_{n}\|^{2}\\ \|\check{\boldsymbol{z}}_{n}\|^{2}\end{array}\right]\leq\Gamma^{\prime}\mathds% {E}\left[\begin{array}[]{c}\|\bar{\boldsymbol{z}}_{n-1}\|^{2}\\ \|\check{\boldsymbol{z}}_{n-1}\|^{2}\end{array}\right]+\left[\begin{array}[]{c% }O(\mu)\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n-1}\|^{4% }\\ O(\mu^{2})\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n-1}\|% ^{4}\end{array}\right]blackboard_E ∥ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E [ start_ARRAY start_ROW start_CELL ∥ over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ∥ overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] ≤ roman_Γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_E [ start_ARRAY start_ROW start_CELL ∥ over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ∥ overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] + [ start_ARRAY start_ROW start_CELL italic_O ( italic_μ ) blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] (295)

where 𝒛1=0subscript𝒛10\boldsymbol{z}_{-1}=0bold_italic_z start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = 0, and

Γ=[1O(μ)O(μ)O(μ2)ρ(Pα)+O(μ2)]superscriptΓdelimited-[]1𝑂𝜇𝑂𝜇𝑂superscript𝜇2𝜌subscript𝑃𝛼𝑂superscript𝜇2\displaystyle\Gamma^{\prime}=\left[\begin{array}[]{cc}1-O(\mu)&O(\mu)\\ O(\mu^{2})&\rho(P_{\alpha})+O(\mu^{2})\end{array}\right]roman_Γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL 1 - italic_O ( italic_μ ) end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] (298)

By iterating (295), we have

𝔼𝒱α𝖳𝒛n2=(IΓ)1(IΓn+1)[O(μ)𝔼𝓦~n14O(μ2)𝔼𝓦~n14]𝔼superscriptnormsuperscriptsubscript𝒱𝛼𝖳subscript𝒛𝑛2superscript𝐼superscriptΓ1𝐼superscriptΓ𝑛1delimited-[]𝑂𝜇𝔼superscriptnormsubscript~𝓦𝑛14𝑂superscript𝜇2𝔼superscriptnormsubscript~𝓦𝑛14\displaystyle\mathds{E}\|\mathcal{V}_{\alpha}^{\sf T}{{\boldsymbol{{z}}}}_{n}% \|^{2}=(I-\Gamma^{\prime})^{-1}(I-\Gamma^{\prime n+1})\left[\begin{array}[]{c}% O(\mu)\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n-1}\|^{4}% \\ O(\mu^{2})\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n-1}\|% ^{4}\end{array}\right]blackboard_E ∥ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_I - roman_Γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - roman_Γ start_POSTSUPERSCRIPT ′ italic_n + 1 end_POSTSUPERSCRIPT ) [ start_ARRAY start_ROW start_CELL italic_O ( italic_μ ) blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] (301)

for which it can be verified that

(IΓ)1=[O(μ)O(μ)O(μ2)1ρ(Pα)+O(μ2)]1=[O(1μ)O(1)O(μ)O(1)]superscript𝐼superscriptΓ1superscriptdelimited-[]𝑂𝜇𝑂𝜇𝑂superscript𝜇21𝜌subscript𝑃𝛼𝑂superscript𝜇21delimited-[]𝑂1𝜇𝑂1𝑂𝜇𝑂1\displaystyle(I-\Gamma^{\prime})^{-1}=\left[\begin{array}[]{cc}O(\mu)&-O(\mu)% \\ -O(\mu^{2})&1-\rho(P_{\alpha})+O(\mu^{2})\end{array}\right]^{-1}=\left[\begin{% array}[]{cc}O(\frac{1}{\mu})&O(1)\\ O(\mu)&O(1)\end{array}\right]( italic_I - roman_Γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_O ( italic_μ ) end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL 1 - italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] (306)

and

IΓn+1=[1(1O(μ))n+1O(μ)O(μ2)1(ρ(Pα)+O(μ2))n+1]=[1(1O(μ))n+1O(μ)O(μ2)O(1)]𝐼superscriptΓ𝑛1delimited-[]1superscript1𝑂𝜇𝑛1𝑂𝜇𝑂superscript𝜇21superscript𝜌subscript𝑃𝛼𝑂superscript𝜇2𝑛1delimited-[]1superscript1𝑂𝜇𝑛1𝑂𝜇𝑂superscript𝜇2𝑂1\displaystyle I-\Gamma^{\prime n+1}=\left[\begin{array}[]{cc}1-(1-O(\mu))^{n+1% }&-O(\mu)\\ -O(\mu^{2})&1-(\rho(P_{\alpha})+O(\mu^{2}))^{n+1}\end{array}\right]=\left[% \begin{array}[]{cc}1-(1-O(\mu))^{n+1}&-O(\mu)\\ -O(\mu^{2})&O(1)\end{array}\right]italic_I - roman_Γ start_POSTSUPERSCRIPT ′ italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL 1 - ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL 1 - ( italic_ρ ( italic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL 1 - ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] (311)

Again, similar to (172), we have:

1(1O(μ))n+1<1=O(1)1superscript1𝑂𝜇𝑛11𝑂1\displaystyle 1-(1-O(\mu))^{n+1}<1=O(1)1 - ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT < 1 = italic_O ( 1 ) (312)

Then we substitute (D), (306), (311) and (312) into (301), and obtain

𝔼[𝒛ˇn2𝒛¯n2]𝔼delimited-[]superscriptnormsubscriptˇ𝒛𝑛2superscriptnormsubscript¯𝒛𝑛2absent\displaystyle\mathds{E}\left[\begin{array}[]{c}\|\check{\boldsymbol{z}}_{n}\|^% {2}\\ \|\bar{\boldsymbol{z}}_{n}\|^{2}\end{array}\right]\leqblackboard_E [ start_ARRAY start_ROW start_CELL ∥ overroman_ˇ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ∥ over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] ≤ [O(1μ)O(1)O(μ)O(1)][O(1)O(μ)O(μ2)O(1)][O(μ1+2γ)O(μ2+2γ)]O(μ2γ)=O(𝔼𝓦~n14)delimited-[]𝑂1𝜇𝑂1𝑂𝜇𝑂1delimited-[]𝑂1𝑂𝜇𝑂superscript𝜇2𝑂1delimited-[]𝑂superscript𝜇12𝛾𝑂superscript𝜇22𝛾𝑂superscript𝜇2𝛾𝑂𝔼superscriptnormsubscript~𝓦𝑛14\displaystyle\left[\begin{array}[]{cc}O(\frac{1}{\mu})&O(1)\\ O(\mu)&O(1)\end{array}\right]\left[\begin{array}[]{cc}O(1)&-O(\mu)\\ -O(\mu^{2})&O(1)\end{array}\right]\left[\begin{array}[]{c}O(\mu^{1+2\gamma})\\ O(\mu^{2+2\gamma})\end{array}\right]\leq O(\mu^{2\gamma})=O(\mathds{E}\|% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n-1}\|^{4})[ start_ARRAY start_ROW start_CELL italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL italic_O ( 1 ) end_CELL start_CELL - italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL - italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_O ( 1 ) end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 1 + 2 italic_γ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ start_POSTSUPERSCRIPT 2 + 2 italic_γ end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 2 italic_γ end_POSTSUPERSCRIPT ) = italic_O ( blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) (321)

from which we have

𝔼𝓦~n𝓦~n2=𝔼𝒱𝖳𝒱𝖳𝒛n2=O(μ2γ)𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝔼superscriptnormsuperscript𝒱𝖳superscript𝒱𝖳subscript𝒛𝑛2𝑂superscript𝜇2𝛾\displaystyle\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^% {\prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}=\mathds{E}% \|\mathcal{V}^{-\sf T}\mathcal{V}^{\sf T}{{\boldsymbol{{z}}}}_{n}\|^{2}=O(\mu^% {2\gamma})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E ∥ caligraphic_V start_POSTSUPERSCRIPT - sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( italic_μ start_POSTSUPERSCRIPT 2 italic_γ end_POSTSUPERSCRIPT ) (322)

Since

𝔼𝓦~n2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2\displaystyle\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^% {\prime}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =𝔼𝓦~n𝓦~n+𝓦~n2absent𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛subscript~𝓦𝑛2\displaystyle=\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}% ^{\prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}+\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}= blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
𝔼𝓦~n𝓦~n2+𝔼𝓦~n2+2|𝔼(𝓦~n𝓦~n)𝖳𝓦~n|absent𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛22𝔼superscriptsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛𝖳subscript~𝓦𝑛\displaystyle\leq\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_% {n}^{\prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}+% \mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}+2|% \mathds{E}(\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}-% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n})^{\sf T}\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}_{n}|≤ blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 | blackboard_E ( over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT |
𝔼𝓦~n𝓦~n2+𝔼𝓦~n2+2𝔼𝓦~n𝓦~n2𝔼𝓦~n2absent𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛22𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2\displaystyle\leq\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_% {n}^{\prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}+% \mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}+2\sqrt{% \mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}-% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}\mathds{E}\|% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}}≤ blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 square-root start_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
(a)O(μγ)𝑎𝑂superscript𝜇𝛾\displaystyle\overset{(a)}{\leq}O(\mu^{\gamma})start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG italic_O ( italic_μ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ) (323)

where in (a)𝑎(a)( italic_a ) we use the results of (191) and (322), then we have

𝔼𝓦~n2𝔼𝓦~n2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2\displaystyle\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^% {\prime}\|^{2}-\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n% }\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝔼𝓦~n𝓦~n2+2𝔼𝓦~n𝓦~n2𝔼𝓦~n2absent𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛22𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2\displaystyle\leq\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_% {n}^{\prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}+2\sqrt% {\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}-% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}\mathds{E}\|% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}}≤ blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 square-root start_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=(a)O(𝔼𝓦~n14)+O(𝔼𝓦~n14(𝔼𝓦~n14)12)𝑎𝑂𝔼superscriptnormsubscript~𝓦𝑛14𝑂𝔼superscriptnormsubscript~𝓦𝑛14superscript𝔼superscriptnormsubscript~𝓦𝑛1412\displaystyle\overset{(a)}{=}O(\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}_{n-1}\|^{4})+O(\sqrt{\mathds{E}\|\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n-1}\|^{4}\cdot(\mathds{E}\|\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n-1}\|^{4})^{\frac{1}{2}}})start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG italic_O ( blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) + italic_O ( square-root start_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ⋅ ( blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG )
=O(μ1.5γ)absent𝑂superscript𝜇1.5𝛾\displaystyle=O(\mu^{1.5\gamma})= italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ end_POSTSUPERSCRIPT ) (324)

where in (a)𝑎(a)( italic_a ) we use the results of (261)261(\ref{ew2})( ) and (321)321(\ref{ew4})( ). Similar to (E), we have

𝔼𝓦~n2𝔼superscriptnormsubscript~𝓦𝑛2\displaystyle\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}% \|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =𝔼𝓦~n𝓦~n+𝓦~n2absent𝔼superscriptnormsubscript~𝓦𝑛superscriptsubscript~𝓦𝑛superscriptsubscript~𝓦𝑛2\displaystyle=\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}% -\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}+\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}\|^{2}= blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
𝔼𝓦~n𝓦~n2+𝔼𝓦~n2+2|𝔼(𝓦~n𝓦~n)𝖳𝓦~n|absent𝔼superscriptnormsubscript~𝓦𝑛superscriptsubscript~𝓦𝑛2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛22𝔼superscriptsubscript~𝓦𝑛superscriptsubscript~𝓦𝑛𝖳superscriptsubscript~𝓦𝑛\displaystyle\leq\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_% {n}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}\|^{2}+% \mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}\|^{2% }+2|\mathds{E}(\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}-\widetilde% {\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime})^{\sf T}\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}|≤ blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 | blackboard_E ( over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |
𝔼𝓦~n𝓦~n2+𝔼𝓦~n2+2𝔼𝓦~n𝓦~n2𝔼𝓦~n2absent𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛22𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2\displaystyle\leq\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_% {n}^{\prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}+% \mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}\|^{2% }+2\sqrt{\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{% \prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}\mathds{E}\|% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}\|^{2}}≤ blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 square-root start_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (325)

from which we have

𝔼𝓦~n2𝔼𝓦~n2𝔼superscriptnormsubscript~𝓦𝑛2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2\displaystyle\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}% \|^{2}-\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{% \prime}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝔼𝓦~n𝓦~n2+2𝔼𝓦~n𝓦~n2𝔼𝓦~n2O(μ1.5γ)absent𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛22𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2𝑂superscript𝜇1.5𝛾\displaystyle\leq\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_% {n}^{\prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}+2\sqrt% {\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}-% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}\mathds{E}\|% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}\|^{2}}\leq O(\mu% ^{1.5\gamma})≤ blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 square-root start_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ end_POSTSUPERSCRIPT ) (326)

Combining (E) and (326), we have

|𝔼𝓦~n2𝔼𝓦~n2|O(μ1.5γ)𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2𝑂superscript𝜇1.5𝛾\displaystyle|\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}% ^{\prime}\|^{2}-\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{% n}\|^{2}|\leq O(\mu^{1.5\gamma})| blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ end_POSTSUPERSCRIPT ) (327)

which means that the approximation error of replacing 𝔼𝓦~n2𝔼superscriptnormsubscript~𝓦𝑛2\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT by 𝔼𝓦~n2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can be omitted compared with the size of 𝔼𝓦~n2𝔼superscriptnormsubscript~𝓦𝑛2\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. In other words, 𝔼𝓦~n2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT carries sufficient information of 𝔼𝓦~n2𝔼superscriptnormsubscript~𝓦𝑛2\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Similarly, for the centralized method, we only need to analyze 𝔼𝒛¯n2𝔼superscriptnormsubscript¯𝒛𝑛2\mathds{E}\|\bar{\boldsymbol{z}}_{n}\|^{2}blackboard_E ∥ over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Iterating (E) gives

𝔼𝒛¯n2=1(1O(μ))n+11(1O(μ))×O(μ)𝔼𝓦~n14𝔼superscriptnormsubscript¯𝒛𝑛21superscript1𝑂𝜇𝑛111𝑂𝜇𝑂𝜇𝔼superscriptnormsubscript~𝓦𝑛14\displaystyle\mathds{E}\|\bar{\boldsymbol{z}}_{n}\|^{2}=\frac{1-(1-O(\mu))^{n+% 1}}{1-(1-O(\mu))}\times O(\mu)\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}_{n-1}\|^{4}blackboard_E ∥ over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 - ( 1 - italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - ( 1 - italic_O ( italic_μ ) ) end_ARG × italic_O ( italic_μ ) blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (328)

We substitute (193), (264), (312) into (328), and use (E) and (E), then we obtain

𝔼𝓦~n𝓦~n2O(μ2(1+η))𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝑂superscript𝜇21𝜂\displaystyle\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^% {\prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}\leq O(\mu^% {2(1+\eta)})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 2 ( 1 + italic_η ) end_POSTSUPERSCRIPT ) (329)
|𝔼𝓦~n2𝔼𝓦~n2|O(μ1.5(1+η))𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2𝑂superscript𝜇1.51𝜂\displaystyle|\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}% ^{\prime}\|^{2}-\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{% n}\|^{2}|\leq O(\mu^{1.5(1+\eta)})| blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 ( 1 + italic_η ) end_POSTSUPERSCRIPT ) (330)

Appendix F Proof for Theorem III.8

In this section, we derive closed-form expressions for the excess-risk performance of centralized, consensus and diffusion strategies over a finite time horizon nO(1μ)𝑛𝑂1𝜇n\leq O(\frac{1}{\mu})italic_n ≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ). Specifically, we verify how far the algorithms can escape from the local minimum wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT in terms of the risk function. To do this, we need to use the upper bounds in Lemmas III.6 and III.7, where centralized and decentralized methods have different expressions. For simplicity, we use γsuperscript𝛾\gamma^{\prime}italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to unify the results of decentralized and centralized methods in Lemmas III.6 and III.7, namely,

𝔼𝓦~n2O(μγ),𝔼𝓦~n4O(μ2γ),|𝔼𝓦~n2𝔼𝓦~n2|O(μ1.5γ)formulae-sequence𝔼superscriptnormsubscript~𝓦𝑛2𝑂superscript𝜇superscript𝛾formulae-sequence𝔼superscriptnormsubscript~𝓦𝑛4𝑂superscript𝜇2superscript𝛾𝔼superscriptnormsuperscriptsubscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2𝑂superscript𝜇1.5superscript𝛾\displaystyle\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n% }\|^{2}\leq O(\mu^{\gamma^{\prime}}),\quad\mathds{E}\|{\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}}_{n}\|^{4}\leq O(\mu^{2\gamma^{\prime}}),\quad|% \mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}\|^{2% }-\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}|\leq O% (\mu^{1.5\gamma^{\prime}})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) , blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 2 italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) , | blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (331)

where in the centralized method γ=1+ηsuperscript𝛾1𝜂\gamma^{\prime}={1+\eta}italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 + italic_η, while in decentralized methods γ=γ=min{1+η,2}superscript𝛾𝛾1𝜂2\gamma^{\prime}=\gamma=\min\{1+\eta,2\}italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_γ = roman_min { 1 + italic_η , 2 }.

For each agent, the excess risk corresponding to its model is:

ERk,nsubscriptER𝑘𝑛\displaystyle\mathrm{ER}_{k,n}roman_ER start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT =𝔼J(𝒘k,n)J(w)absent𝔼𝐽subscript𝒘𝑘𝑛𝐽superscript𝑤\displaystyle=\mathds{E}J(\boldsymbol{w}_{k,n})-J(w^{\star})= blackboard_E italic_J ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - italic_J ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT )
=1K=1K(𝔼J(𝒘k,n)J(w))absent1𝐾superscriptsubscript1𝐾𝔼subscript𝐽subscript𝒘𝑘𝑛subscript𝐽superscript𝑤\displaystyle=\frac{1}{K}\sum\limits_{\ell=1}^{K}\left(\mathds{E}J_{\ell}(% \boldsymbol{w}_{k,n})-J_{\ell}(w^{\star})\right)= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( blackboard_E italic_J start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - italic_J start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) )
=(a)1K=1K(J(w)+12𝔼(𝒘k,nw)𝖳H(𝒘k,nw)±O(𝔼𝒘k,nw3))𝑎1𝐾superscriptsubscript1𝐾plus-or-minussubscript𝐽superscript𝑤12𝔼superscriptsubscript𝒘𝑘𝑛superscript𝑤𝖳superscriptsubscript𝐻subscript𝒘𝑘𝑛superscript𝑤𝑂𝔼superscriptnormsubscript𝒘𝑘𝑛superscript𝑤3\displaystyle\overset{(a)}{=}\frac{1}{K}\sum\limits_{\ell=1}^{K}\left(\nabla J% _{\ell}(w^{\star})+\frac{1}{2}\mathds{E}(\boldsymbol{w}_{k,n}-w^{\star})^{\sf T% }H_{\ell}^{\star}(\boldsymbol{w}_{k,n}-w^{\star})\pm O(\mathds{E}\|\boldsymbol% {w}_{k,n}-w^{\star}\|^{3})\right)start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( ∇ italic_J start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ± italic_O ( blackboard_E ∥ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) )
=(b)𝔼𝒘k,nwH¯2±O(𝔼𝒘k,nw3)plus-or-minus𝑏𝔼subscriptsuperscriptnormsubscript𝒘𝑘𝑛superscript𝑤2¯𝐻𝑂𝔼superscriptnormsubscript𝒘𝑘𝑛superscript𝑤3\displaystyle\overset{(b)}{=}\mathds{E}\|\boldsymbol{w}_{k,n}-w^{\star}\|^{2}_% {\bar{H}}\pm O(\mathds{E}\|\boldsymbol{w}_{k,n}-w^{\star}\|^{3})start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG = end_ARG blackboard_E ∥ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT ± italic_O ( blackboard_E ∥ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) (332)

where (a)𝑎(a)( italic_a ) follows from the Taylor expansion, and (b)𝑏(b)( italic_b ) follows from (107). Then we have the following average excess risk (or esca** efficiency) across the network involving models of all agents:

ERn=1KkERk,n=(a)12K𝔼𝓦~nIH¯2±O((𝔼𝓦~n4)34)=(b)12K𝔼𝓦~nIH¯2±O(μ1.5γ)subscriptER𝑛plus-or-minus1𝐾subscript𝑘subscriptER𝑘𝑛𝑎12𝐾𝔼subscriptsuperscriptnormsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝑂superscript𝔼superscriptnormsubscript~𝓦𝑛434𝑏12𝐾𝔼subscriptsuperscriptnormsuperscriptsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝑂superscript𝜇1.5superscript𝛾\displaystyle\mathrm{ER}_{n}=\frac{1}{K}\sum\limits_{k}\mathrm{ER}_{k,n}% \overset{(a)}{=}\frac{1}{2K}\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}}_{n}\|^{2}_{I\otimes\bar{H}}\pm O((\mathds{E}\|{\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}}_{n}\|^{4})^{\frac{3}{4}})\overset{(b)}{% =}\frac{1}{2K}\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{% n}^{\prime}\|^{2}_{I\otimes\bar{H}}\pm O(\mu^{1.5\gamma^{\prime}})roman_ER start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_ER start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT ± italic_O ( ( blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT ± italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (333)

where (a)𝑎(a)( italic_a ) follows from Taylor expansion and the following inequality:

𝔼𝒘kw3=𝔼(𝒘kw4)34(𝔼𝒘kw4)34(𝔼𝓦~n4)34O(μ1.5γ)𝔼superscriptnormsubscript𝒘𝑘superscript𝑤3𝔼superscriptsuperscriptnormsubscript𝒘𝑘superscript𝑤434superscript𝔼superscriptnormsubscript𝒘𝑘superscript𝑤434superscript𝔼superscriptnormsubscript~𝓦𝑛434𝑂superscript𝜇1.5superscript𝛾\displaystyle\mathds{E}\|\boldsymbol{w}_{k}-w^{\star}\|^{3}=\mathds{E}(\|% \boldsymbol{w}_{k}-w^{\star}\|^{4})^{\frac{3}{4}}\leq(\mathds{E}\|\boldsymbol{% w}_{k}-w^{\star}\|^{4})^{\frac{3}{4}}\leq(\mathds{E}\|{\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}}_{n}\|^{4})^{\frac{3}{4}}\leq O(\mu^{1.5\gamma^{% \prime}})blackboard_E ∥ bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = blackboard_E ( ∥ bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ≤ ( blackboard_E ∥ bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ≤ ( blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (334)

and (b)𝑏(b)( italic_b ) follows from the following inequality which is similar to (E) and (326):

𝔼𝓦~nIH¯2𝔼𝓦~nIH¯2𝔼subscriptsuperscriptnormsuperscriptsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝔼subscriptsuperscriptnormsubscript~𝓦𝑛2tensor-product𝐼¯𝐻\displaystyle\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^% {\prime}\|^{2}_{I\otimes\bar{H}}-\mathds{E}\|\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n}\|^{2}_{I\otimes\bar{H}}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT - blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT 𝔼𝓦~n𝓦~nIH¯2+2H¯𝔼𝓦~n𝓦~n2𝔼𝓦~n2absent𝔼subscriptsuperscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2tensor-product𝐼¯𝐻2norm¯𝐻𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2\displaystyle\leq\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_% {n}^{\prime}-\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}_{I% \otimes\bar{H}}+2\|\bar{H}\|\sqrt{\mathds{E}\|\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n}^{\prime}-\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}_{n}\|^{2}\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal% {W}}}_{n}\|^{2}}≤ blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT + 2 ∥ over¯ start_ARG italic_H end_ARG ∥ square-root start_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
H¯2𝔼𝓦~n𝓦~n2+2H¯𝔼𝓦~n𝓦~n2𝔼𝓦~n2absentsuperscriptnorm¯𝐻2𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛22norm¯𝐻𝔼superscriptnormsuperscriptsubscript~𝓦𝑛subscript~𝓦𝑛2𝔼superscriptnormsubscript~𝓦𝑛2\displaystyle\leq\|\bar{H}\|^{2}\mathds{E}\|\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n}^{\prime}-\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}_{n}\|^{2}+2\|\bar{H}\|\sqrt{\mathds{E}\|\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}_{n}^{\prime}-\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}_{n}\|^{2}\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal% {W}}}_{n}\|^{2}}≤ ∥ over¯ start_ARG italic_H end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ∥ over¯ start_ARG italic_H end_ARG ∥ square-root start_ARG blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=O(μ1.5γ)absent𝑂superscript𝜇1.5superscript𝛾\displaystyle=O(\mu^{1.5\gamma^{\prime}})= italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (335)
𝔼𝓦~nIH¯2𝔼𝓦~nIH¯2O(μ1.5γ)𝔼subscriptsuperscriptnormsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝔼subscriptsuperscriptnormsuperscriptsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝑂superscript𝜇1.5superscript𝛾\displaystyle\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}% \|^{2}_{I\otimes\bar{H}}-\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}_{n}^{\prime}\|^{2}_{I\otimes\bar{H}}\leq O(\mu^{1.5\gamma^{% \prime}})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT - blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT ≤ italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (336)

so that

𝔼𝓦~nIH¯2O(μ1.5γ)𝔼𝓦~nIH¯2𝔼𝓦~nIH¯2+O(μ1.5γ)𝔼subscriptsuperscriptnormsuperscriptsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝑂superscript𝜇1.5superscript𝛾𝔼subscriptsuperscriptnormsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝔼subscriptsuperscriptnormsuperscriptsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝑂superscript𝜇1.5superscript𝛾\displaystyle\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^% {\prime}\|^{2}_{I\otimes\bar{H}}-O(\mu^{1.5\gamma^{\prime}})\leq\mathds{E}\|% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}\|^{2}_{I\otimes\bar{H}}% \leq\mathds{E}\|\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}_{n}^{\prime}% \|^{2}_{I\otimes\bar{H}}+O(\mu^{1.5\gamma^{\prime}})blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT - italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ≤ blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT ≤ blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 1.5 italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) (337)

To proceed, we examine the size of 𝔼𝓦~nIH¯2𝔼subscriptsuperscriptnormsuperscriptsubscript~𝓦𝑛2tensor-product𝐼¯𝐻\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}^{\prime}\|^% {2}_{I\otimes\bar{H}}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT. We start from the short-term model in (267) and iterate it,

𝓦~n=𝒞n+1𝓦~1+μi=0n𝒞i𝒜2d+μi=0n𝒞i𝒜2𝒔iBsuperscriptsubscript~𝓦𝑛superscript𝒞𝑛1superscriptsubscript~𝓦1𝜇superscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑𝜇superscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2superscriptsubscript𝒔𝑖𝐵\displaystyle{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}^{\prime}=% \mathcal{C}^{n+1}{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}^{% \prime}+\mu\sum\limits_{i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}d+\mu\sum\limits% _{i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}\boldsymbol{s}_{i}^{B}over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_μ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d + italic_μ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (338)

from which we proceed to examine the size of 𝔼𝓦~nIH¯2𝔼subscriptsuperscriptnormsuperscriptsubscript~𝓦𝑛2tensor-product𝐼¯𝐻\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{n}^{\prime}\|^% {2}_{I\otimes\bar{H}}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT as follows:

𝔼𝓦~nIH¯2=𝔼𝒞n+1𝓦~1+μi=0n𝒞i𝒜2dIH¯2+μ2𝔼i=0n𝒞i𝒜2𝒔iBIH¯2𝔼subscriptsuperscriptnormsuperscriptsubscript~𝓦𝑛2tensor-product𝐼¯𝐻𝔼subscriptsuperscriptnormsuperscript𝒞𝑛1superscriptsubscript~𝓦1𝜇superscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑2tensor-product𝐼¯𝐻superscript𝜇2𝔼subscriptsuperscriptnormsuperscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2superscriptsubscript𝒔𝑖𝐵2tensor-product𝐼¯𝐻\displaystyle\mathds{E}\left\|{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}% }}_{n}^{\prime}\right\|^{2}_{I\otimes\bar{H}}=\mathds{E}\left\|\mathcal{C}^{n+% 1}{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}^{\prime}+\mu\sum% \limits_{i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}d\right\|^{2}_{I\otimes\bar{H}}% +\mu^{2}\mathds{E}\left\|\sum\limits_{i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}% \boldsymbol{s}_{i}^{B}\right\|^{2}_{I\otimes\bar{H}}blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT = blackboard_E ∥ caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_μ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT + italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT (339)

We first examine the term 𝔼𝒞n+1𝓦~1+μi=0n𝒞i𝒜2dIH¯2𝔼subscriptsuperscriptnormsuperscript𝒞𝑛1superscriptsubscript~𝓦1𝜇superscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑2tensor-product𝐼¯𝐻\mathds{E}\left\|\mathcal{C}^{n+1}{\widetilde{\boldsymbol{\scriptstyle\mathcal% {W}}}}_{-1}^{\prime}+\mu\sum\limits_{i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}d% \right\|^{2}_{I\otimes\bar{H}}blackboard_E ∥ caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_μ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT. To do so, it is necessary to compute i=0n𝒞isuperscriptsubscript𝑖0𝑛superscript𝒞𝑖\sum\limits_{i=0}^{n}\mathcal{C}^{i}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT:

i=0n𝒞i=(I𝒞)1(I𝒞n+1)superscriptsubscript𝑖0𝑛superscript𝒞𝑖superscript𝐼𝒞1𝐼superscript𝒞𝑛1\displaystyle\sum\limits_{i=0}^{n}\mathcal{C}^{i}=(I-\mathcal{C})^{-1}(I-% \mathcal{C}^{n+1})∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( italic_I - caligraphic_C ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) (340)

Recall that

𝒞=𝒜μ𝒜2=𝒱𝒫𝒱𝖳μ𝒜2=𝒱(𝒫μ𝒱𝖳𝒜2𝒱)𝒱𝖳𝒞𝒜𝜇subscript𝒜2𝒱𝒫superscript𝒱𝖳𝜇subscript𝒜2𝒱𝒫𝜇superscript𝒱𝖳subscript𝒜2𝒱superscript𝒱𝖳\displaystyle\mathcal{C}=\mathcal{A}-\mu\mathcal{A}_{2}\mathcal{H}=\mathcal{V}% \mathcal{P}\mathcal{V}^{\sf T}-\mu\mathcal{A}_{2}\mathcal{H}=\mathcal{V}(% \mathcal{P}-\mu\mathcal{V}^{\sf T}\mathcal{A}_{2}\mathcal{H}\mathcal{V})% \mathcal{V}^{\sf T}caligraphic_C = caligraphic_A - italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H = caligraphic_V caligraphic_P caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT - italic_μ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H = caligraphic_V ( caligraphic_P - italic_μ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H caligraphic_V ) caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT (341)

and consider

𝒞¯=𝒫μ𝒱𝖳𝒜2𝒱𝖳¯𝒞𝒫𝜇superscript𝒱𝖳subscript𝒜2superscript𝒱𝖳\displaystyle\bar{\mathcal{C}}=\mathcal{P}-\mu\mathcal{V}^{\sf T}\mathcal{A}_{% 2}\mathcal{H}\mathcal{V}^{-\sf T}over¯ start_ARG caligraphic_C end_ARG = caligraphic_P - italic_μ caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H caligraphic_V start_POSTSUPERSCRIPT - sansserif_T end_POSTSUPERSCRIPT (342)

then, it holds that

(I𝒞)1superscript𝐼𝒞1\displaystyle(I-\mathcal{C})^{-1}( italic_I - caligraphic_C ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT =𝒱(I𝒞¯)1𝒱𝖳absent𝒱superscript𝐼¯𝒞1superscript𝒱𝖳\displaystyle=\mathcal{V}(I-\bar{\mathcal{C}})^{-1}\mathcal{V}^{\sf T}= caligraphic_V ( italic_I - over¯ start_ARG caligraphic_C end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT (343)
(I𝒞n+1)𝐼superscript𝒞𝑛1\displaystyle(I-\mathcal{C}^{n+1})( italic_I - caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) =𝒱(I𝒞¯n+1)𝒱𝖳absent𝒱𝐼superscript¯𝒞𝑛1superscript𝒱𝖳\displaystyle=\mathcal{V}(I-\bar{\mathcal{C}}^{n+1})\mathcal{V}^{\sf T}= caligraphic_V ( italic_I - over¯ start_ARG caligraphic_C end_ARG start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT (344)

Substituting (36) and (132) into (342), we obtain

𝒞¯=[I00𝒫α]μ[1K𝟙𝖳𝒱α𝖳]𝒜2[1K𝟙𝒱α]=[IμH¯μ1K𝟙𝖳𝒱αμ𝒱α𝖳𝒜21K𝟙𝒫αμ𝒱α𝖳𝒜2𝒱α]¯𝒞delimited-[]𝐼00subscript𝒫𝛼𝜇delimited-[]1𝐾superscript1𝖳superscriptsubscript𝒱𝛼𝖳subscript𝒜2delimited-[]1𝐾1subscript𝒱𝛼delimited-[]𝐼𝜇¯𝐻𝜇1𝐾superscript1𝖳subscript𝒱𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜21𝐾1subscript𝒫𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝒱𝛼\displaystyle\bar{\mathcal{C}}=\left[\begin{array}[]{cc}I&0\\ 0&\mathcal{P}_{\alpha}\end{array}\right]-\mu\left[\begin{array}[]{c}\frac{1}{% \sqrt{K}}\mathds{1}^{\sf T}\\ \mathcal{V}_{\alpha}^{\sf T}\end{array}\right]\mathcal{A}_{2}\mathcal{H}\left[% \begin{array}[]{cc}\frac{1}{\sqrt{K}}\mathds{1}&\mathcal{V}_{\alpha}\end{array% }\right]=\left[\begin{array}[]{cc}I-\mu\bar{H}&-\mu\frac{1}{\sqrt{K}}\mathds{1% }^{\sf T}\mathcal{H}\mathcal{V}_{\alpha}\\ -\mu\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}\mathcal{H}\frac{1}{\sqrt{K}}% \mathds{1}&\mathcal{P}_{\alpha}-\mu\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}% \mathcal{H}\mathcal{V}_{\alpha}\end{array}\right]over¯ start_ARG caligraphic_C end_ARG = [ start_ARRAY start_ROW start_CELL italic_I end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] - italic_μ [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 end_CELL start_CELL caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL italic_I - italic_μ over¯ start_ARG italic_H end_ARG end_CELL start_CELL - italic_μ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 end_CELL start_CELL caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] (352)

from which we have

(I𝒞¯)1=[μH¯μ1K𝟙𝖳𝒱αμ𝒱α𝖳𝒜21K𝟙I𝒫αμ𝒱α𝖳𝒜2𝒱α]1superscript𝐼¯𝒞1superscriptdelimited-[]𝜇¯𝐻𝜇1𝐾superscript1𝖳subscript𝒱𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜21𝐾1𝐼subscript𝒫𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝒱𝛼1\displaystyle(I-\bar{\mathcal{C}})^{-1}=\left[\begin{array}[]{cc}\mu\bar{H}&% \mu\frac{1}{\sqrt{K}}\mathds{1}^{\sf T}\mathcal{H}\mathcal{V}_{\alpha}\\ \mu\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}\mathcal{H}\frac{1}{\sqrt{K}}% \mathds{1}&I-\mathcal{P}_{\alpha}-\mu\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{% 2}\mathcal{H}\mathcal{V}_{\alpha}\end{array}\right]^{-1}( italic_I - over¯ start_ARG caligraphic_C end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_μ over¯ start_ARG italic_H end_ARG end_CELL start_CELL italic_μ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 end_CELL start_CELL italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (355)

We appeal to the block matrix inversion formula:

[ABCD]1=[A1000]+[A1BΔ1CA1A1BΔ1Δ1CA1Δ1]superscriptdelimited-[]𝐴𝐵𝐶𝐷1delimited-[]superscript𝐴1000delimited-[]superscript𝐴1𝐵superscriptΔ1𝐶superscript𝐴1superscript𝐴1𝐵superscriptΔ1superscriptΔ1𝐶superscript𝐴1superscriptΔ1\displaystyle\left[\begin{array}[]{cc}A&B\\ C&D\end{array}\right]^{-1}=\left[\begin{array}[]{cc}A^{-1}&0\\ 0&0\end{array}\right]+\left[\begin{array}[]{cc}A^{-1}B\Delta^{-1}CA^{-1}&-A^{-% 1}B\Delta^{-1}\\ -\Delta^{-1}CA^{-1}&\Delta^{-1}\end{array}\right][ start_ARRAY start_ROW start_CELL italic_A end_CELL start_CELL italic_B end_CELL end_ROW start_ROW start_CELL italic_C end_CELL start_CELL italic_D end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARRAY ] + [ start_ARRAY start_ROW start_CELL italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_C italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL - italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_C italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] (362)

where the Schur complement ΔΔ\Deltaroman_Δ is defined by

Δ=DCA1BΔ𝐷𝐶superscript𝐴1𝐵\displaystyle\Delta=D-CA^{-1}Broman_Δ = italic_D - italic_C italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B (363)

Applying this formula to (355), we have

Δ=I𝒫α+O(μ)Δ𝐼subscript𝒫𝛼𝑂𝜇\displaystyle\Delta=I-\mathcal{P}_{\alpha}+O(\mu)roman_Δ = italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) (364)

and

A1superscript𝐴1\displaystyle A^{-1}italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT =1μH¯1absent1𝜇superscript¯𝐻1\displaystyle=\frac{1}{\mu}\bar{H}^{-1}= divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (365)
Δ1superscriptΔ1\displaystyle\Delta^{-1}roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT =(I𝒫α+O(μ))1absentsuperscript𝐼subscript𝒫𝛼𝑂𝜇1\displaystyle=(I-\mathcal{P}_{\alpha}+O(\mu))^{-1}= ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (366)
A1BΔ1CA1superscript𝐴1𝐵superscriptΔ1𝐶superscript𝐴1\displaystyle A^{-1}B\Delta^{-1}CA^{-1}italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_C italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT =1KH¯1𝟙𝖳𝒱α(I𝒫α+O(μ))1𝒱α𝖳𝒜2𝟙H¯1absent1𝐾superscript¯𝐻1superscript1𝖳subscript𝒱𝛼superscript𝐼subscript𝒫𝛼𝑂𝜇1superscriptsubscript𝒱𝛼𝖳subscript𝒜21superscript¯𝐻1\displaystyle=\frac{1}{K}\bar{H}^{-1}\mathds{1}^{\sf T}\mathcal{H}\mathcal{V}_% {\alpha}(I-\mathcal{P}_{\alpha}+O(\mu))^{-1}\mathcal{V}_{\alpha}^{\sf T}% \mathcal{A}_{2}\mathcal{H}\mathds{1}\bar{H}^{-1}= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H blackboard_1 over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (367)
A1BΔ1superscript𝐴1𝐵superscriptΔ1\displaystyle-A^{-1}B\Delta^{-1}- italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT =1KH¯1𝟙𝖳𝒱α(I𝒫α+O(μ))1absent1𝐾superscript¯𝐻1superscript1𝖳subscript𝒱𝛼superscript𝐼subscript𝒫𝛼𝑂𝜇1\displaystyle=-\frac{1}{\sqrt{K}}\bar{H}^{-1}\mathds{1}^{\sf T}\mathcal{H}% \mathcal{V}_{\alpha}(I-\mathcal{P}_{\alpha}+O(\mu))^{-1}= - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (368)
Δ1CA1superscriptΔ1𝐶superscript𝐴1\displaystyle-\Delta^{-1}CA^{-1}- roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_C italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT =1K(I𝒫α+O(μ))1𝒱α𝖳𝒜2𝟙H¯1absent1𝐾superscript𝐼subscript𝒫𝛼𝑂𝜇1superscriptsubscript𝒱𝛼𝖳subscript𝒜21superscript¯𝐻1\displaystyle=-\frac{1}{\sqrt{K}}(I-\mathcal{P}_{\alpha}+O(\mu))^{-1}\mathcal{% V}_{\alpha}^{\sf T}\mathcal{A}_{2}\mathcal{H}\mathds{1}\bar{H}^{-1}= - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H blackboard_1 over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (369)

so that

(I𝒞¯)1=[1μH¯1+1KH¯1𝟙𝖳𝒱α(I𝒫α+O(μ))1𝒱α𝖳𝒜2𝟙H¯11KH¯1𝟙𝖳𝒱α(I𝒫α+O(μ))11K(I𝒫α+O(μ))1𝒱α𝖳𝒜2𝟙H¯1(I𝒫α+O(μ))1]superscript𝐼¯𝒞1delimited-[]1𝜇superscript¯𝐻11𝐾superscript¯𝐻1superscript1𝖳subscript𝒱𝛼superscript𝐼subscript𝒫𝛼𝑂𝜇1superscriptsubscript𝒱𝛼𝖳subscript𝒜21superscript¯𝐻11𝐾superscript¯𝐻1superscript1𝖳subscript𝒱𝛼superscript𝐼subscript𝒫𝛼𝑂𝜇11𝐾superscript𝐼subscript𝒫𝛼𝑂𝜇1superscriptsubscript𝒱𝛼𝖳subscript𝒜21superscript¯𝐻1superscript𝐼subscript𝒫𝛼𝑂𝜇1\displaystyle(I-\bar{\mathcal{C}})^{-1}=\left[\begin{array}[]{cc}\frac{1}{\mu}% \bar{H}^{-1}+\frac{1}{K}\bar{H}^{-1}\mathds{1}^{\sf T}\mathcal{H}\mathcal{V}_{% \alpha}(I-\mathcal{P}_{\alpha}+O(\mu))^{-1}\mathcal{V}_{\alpha}^{\sf T}% \mathcal{A}_{2}\mathcal{H}\mathds{1}\bar{H}^{-1}&-\frac{1}{\sqrt{K}}\bar{H}^{-% 1}\mathds{1}^{\sf T}\mathcal{H}\mathcal{V}_{\alpha}(I-\mathcal{P}_{\alpha}+O(% \mu))^{-1}\\ -\frac{1}{\sqrt{K}}(I-\mathcal{P}_{\alpha}+O(\mu))^{-1}\mathcal{V}_{\alpha}^{% \sf T}\mathcal{A}_{2}\mathcal{H}\mathds{1}\bar{H}^{-1}&(I-\mathcal{P}_{\alpha}% +O(\mu))^{-1}\end{array}\right]( italic_I - over¯ start_ARG caligraphic_C end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_K end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H blackboard_1 over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H blackboard_1 over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] (372)

As for 𝒞¯n+1superscript¯𝒞𝑛1\bar{\mathcal{C}}^{n+1}over¯ start_ARG caligraphic_C end_ARG start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT, we have

𝒞¯n+1=[IμH¯μ1K𝟙𝖳𝒱αμ𝒱α𝖳𝒜21K𝟙𝒫αμ𝒱α𝖳𝒜2𝒱α]n+1=[(IμH¯)n+1+O(μ2)O(μ)O(μ)𝒫αn+1+O(μ2)]superscript¯𝒞𝑛1superscriptdelimited-[]𝐼𝜇¯𝐻𝜇1𝐾superscript1𝖳subscript𝒱𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜21𝐾1subscript𝒫𝛼𝜇superscriptsubscript𝒱𝛼𝖳subscript𝒜2subscript𝒱𝛼𝑛1delimited-[]superscript𝐼𝜇¯𝐻𝑛1𝑂superscript𝜇2𝑂𝜇𝑂𝜇superscriptsubscript𝒫𝛼𝑛1𝑂superscript𝜇2\displaystyle\bar{\mathcal{C}}^{n+1}=\left[\begin{array}[]{cc}I-\mu\bar{H}&-% \mu\frac{1}{\sqrt{K}}\mathds{1}^{\sf T}\mathcal{H}\mathcal{V}_{\alpha}\\ -\mu\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}\mathcal{H}\frac{1}{\sqrt{K}}% \mathds{1}&\mathcal{P}_{\alpha}-\mu\mathcal{V}_{\alpha}^{\sf T}\mathcal{A}_{2}% \mathcal{H}\mathcal{V}_{\alpha}\end{array}\right]^{n+1}=\left[\begin{array}[]{% cc}(I-\mu\bar{H})^{n+1}+O(\mu^{2})&O(\mu)\\ O(\mu)&\mathcal{P}_{\alpha}^{n+1}+O(\mu^{2})\end{array}\right]over¯ start_ARG caligraphic_C end_ARG start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_I - italic_μ over¯ start_ARG italic_H end_ARG end_CELL start_CELL - italic_μ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 end_CELL start_CELL caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_μ caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ ) end_CELL start_CELL caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] (377)

We therefore obtain

I𝒞¯n+1=[I(IμH¯)n+1+O(μ2)O(μ)O(μ)I𝒫αn+1+O(μ2)]𝐼superscript¯𝒞𝑛1delimited-[]𝐼superscript𝐼𝜇¯𝐻𝑛1𝑂superscript𝜇2𝑂𝜇𝑂𝜇𝐼superscriptsubscript𝒫𝛼𝑛1𝑂superscript𝜇2\displaystyle I-\bar{\mathcal{C}}^{n+1}=\left[\begin{array}[]{cc}I-(I-\mu\bar{% H})^{n+1}+O(\mu^{2})&O(\mu)\\ O(\mu)&I-\mathcal{P}_{\alpha}^{n+1}+O(\mu^{2})\end{array}\right]italic_I - over¯ start_ARG caligraphic_C end_ARG start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_I - ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ ) end_CELL start_CELL italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] (380)

and also

𝒞n+1=𝒱𝒞¯n+1𝒱𝖳=O(1)superscript𝒞𝑛1𝒱superscript¯𝒞𝑛1superscript𝒱𝖳𝑂1\displaystyle\mathcal{C}^{n+1}=\mathcal{V}\bar{\mathcal{C}}^{n+1}\mathcal{V}^{% \sf T}=O(1)caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = caligraphic_V over¯ start_ARG caligraphic_C end_ARG start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT = italic_O ( 1 ) (381)

Note that for any two invertible matrices X𝑋Xitalic_X and Y𝑌Yitalic_Y, where X=O(1)𝑋𝑂1X=O(1)italic_X = italic_O ( 1 ) and Y=o(1)𝑌𝑜1Y=o(1)italic_Y = italic_o ( 1 ), i.e., X𝑋Xitalic_X dominates Y𝑌Yitalic_Y, we have

(X+Y)1=X1X1Y(I+X1Y)1X1=X1+o(1)superscript𝑋𝑌1superscript𝑋1superscript𝑋1𝑌superscript𝐼superscript𝑋1𝑌1superscript𝑋1superscript𝑋1𝑜1\displaystyle(X+Y)^{-1}=X^{-1}-X^{-1}Y(I+X^{-1}Y)^{-1}X^{-1}=X^{-1}+o(1)( italic_X + italic_Y ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_X start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_X start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Y ( italic_I + italic_X start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Y ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_X start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_o ( 1 ) (382)

which means that the inverse of the sum of two matrices can be well expressed by the inverse of the dominate matrix. Also, for any two vectors x𝑥xitalic_x and y𝑦yitalic_y, assume x=O(1)𝑥𝑂1x=O(1)italic_x = italic_O ( 1 ) and y=o(1)𝑦𝑜1y=o(1)italic_y = italic_o ( 1 ), i.e., x𝑥xitalic_x dominates y𝑦yitalic_y, we have

x+y2=x2+y2+2x𝖳y=x2±o(1)superscriptnorm𝑥𝑦2superscriptnorm𝑥2superscriptnorm𝑦22superscript𝑥𝖳𝑦plus-or-minussuperscriptnorm𝑥2𝑜1\displaystyle\|x+y\|^{2}=\|x\|^{2}+\|y\|^{2}+2x^{\sf T}y=\|x\|^{2}\pm o(1)∥ italic_x + italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_x start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_y = ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ± italic_o ( 1 ) (383)

which means that the square of the sum of any two vectors can be well expressed by the square of the dominate vector.

We recall the term 𝔼𝒞n+1𝓦~1+μi=0n𝒞i𝒜2dIH¯2𝔼subscriptsuperscriptnormsuperscript𝒞𝑛1superscriptsubscript~𝓦1𝜇superscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑2tensor-product𝐼¯𝐻\mathds{E}\left\|\mathcal{C}^{n+1}{\widetilde{\boldsymbol{\scriptstyle\mathcal% {W}}}}_{-1}^{\prime}+\mu\sum\limits_{i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}d% \right\|^{2}_{I\otimes\bar{H}}blackboard_E ∥ caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_μ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT, and obtain:

𝔼𝒞n+1𝓦~1+μi=0n𝒞i𝒜2dIH¯2=μ2i=0n𝒞i𝒜2dIH¯2+μ𝔼(i=0n𝒞i𝒜2d)𝖳(IH¯)(𝒞n+1𝓦~1)+𝔼𝒞n+1𝓦~1IH¯2𝔼subscriptsuperscriptnormsuperscript𝒞𝑛1superscriptsubscript~𝓦1𝜇superscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑2tensor-product𝐼¯𝐻superscript𝜇2subscriptsuperscriptnormsuperscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑2tensor-product𝐼¯𝐻𝜇𝔼superscriptsuperscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑𝖳tensor-product𝐼¯𝐻superscript𝒞𝑛1superscriptsubscript~𝓦1𝔼subscriptsuperscriptnormsuperscript𝒞𝑛1superscriptsubscript~𝓦12tensor-product𝐼¯𝐻\displaystyle\mathds{E}\left\|\mathcal{C}^{n+1}{\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}}_{-1}^{\prime}+\mu\sum\limits_{i=0}^{n}\mathcal{C}^{% i}\mathcal{A}_{2}d\right\|^{2}_{I\otimes\bar{H}}{=}\mu^{2}\left\|\sum\limits_{% i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}d\right\|^{2}_{I\otimes\bar{H}}+\mu% \mathds{E}\left(\sum\limits_{i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}d\right)^{% \sf T}(I\otimes\bar{H})(\mathcal{C}^{n+1}{\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}}_{-1}^{\prime})+\mathds{E}\|\mathcal{C}^{n+1}{\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}^{\prime}\|^{2}_{I\otimes\bar{H}}blackboard_E ∥ caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_μ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT = italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT + italic_μ blackboard_E ( ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_I ⊗ over¯ start_ARG italic_H end_ARG ) ( caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + blackboard_E ∥ caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT (384)

Note that we have:

|μ𝔼(i=0n𝒞i𝒜2d)𝖳(IH¯)(𝒞n+1𝓦~1)+𝔼𝒞n+1𝓦~1IH¯|𝜇𝔼superscriptsuperscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑𝖳tensor-product𝐼¯𝐻superscript𝒞𝑛1superscriptsubscript~𝓦1𝔼subscriptnormsuperscript𝒞𝑛1superscriptsubscript~𝓦1tensor-product𝐼¯𝐻\displaystyle\left|\mu\mathds{E}\left(\sum\limits_{i=0}^{n}\mathcal{C}^{i}% \mathcal{A}_{2}d\right)^{\sf T}(I\otimes\bar{H})(\mathcal{C}^{n+1}{\widetilde{% \boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}^{\prime})+\mathds{E}\|\mathcal{C}^% {n+1}{\widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}^{\prime}\|_{I% \otimes\bar{H}}\right|| italic_μ blackboard_E ( ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_I ⊗ over¯ start_ARG italic_H end_ARG ) ( caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + blackboard_E ∥ caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT |
μ𝔼(i=0n𝒞i𝒜2d)𝖳(IH¯)𝒞n+1𝓦~1+𝒞n+12H¯𝔼𝓦~12absent𝜇𝔼normsuperscriptsuperscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑𝖳tensor-product𝐼¯𝐻superscript𝒞𝑛1normsuperscriptsubscript~𝓦1superscriptnormsuperscript𝒞𝑛12norm¯𝐻𝔼superscriptnormsuperscriptsubscript~𝓦12\displaystyle\leq\mu\mathds{E}\left\|\left(\sum\limits_{i=0}^{n}\mathcal{C}^{i% }\mathcal{A}_{2}d\right)^{\sf T}(I\otimes\bar{H})\mathcal{C}^{n+1}\right\|\|{% \widetilde{\boldsymbol{\scriptstyle\mathcal{W}}}}_{-1}^{\prime}\|+\|\mathcal{C% }^{n+1}\|^{2}\|\bar{H}\|\mathds{E}\|{\widetilde{\boldsymbol{\scriptstyle% \mathcal{W}}}}_{-1}^{\prime}\|^{2}≤ italic_μ blackboard_E ∥ ( ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_I ⊗ over¯ start_ARG italic_H end_ARG ) caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ∥ ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ + ∥ caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over¯ start_ARG italic_H end_ARG ∥ blackboard_E ∥ over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(a)O(μ)×o(μB)+o(μB)𝑎𝑂𝜇𝑜𝜇𝐵𝑜𝜇𝐵\displaystyle\overset{(a)}{\leq}O(\mu)\times o(\sqrt{\frac{\mu}{B}})+o(\frac{% \mu}{B})start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG italic_O ( italic_μ ) × italic_o ( square-root start_ARG divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG end_ARG ) + italic_o ( divide start_ARG italic_μ end_ARG start_ARG italic_B end_ARG )
=(b)o(μ0.5(3+η))+o(μ1+η)𝑏𝑜superscript𝜇0.53𝜂𝑜superscript𝜇1𝜂\displaystyle\overset{(b)}{=}o(\mu^{0.5(3+\eta)})+o(\mu^{1+\eta})start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG = end_ARG italic_o ( italic_μ start_POSTSUPERSCRIPT 0.5 ( 3 + italic_η ) end_POSTSUPERSCRIPT ) + italic_o ( italic_μ start_POSTSUPERSCRIPT 1 + italic_η end_POSTSUPERSCRIPT ) (385)

where (a)𝑎(a)( italic_a ) follows from Assumption III.1 and (381), and (b)𝑏(b)( italic_b ) follows from (49), then we have:

𝔼𝒞n+1𝓦~1+μi=0n𝒞i𝒜2dIH¯2=μ2i=0n𝒞i𝒜2dIH¯2±o(μ0.5(3+η))±o(μ1+η)𝔼subscriptsuperscriptnormsuperscript𝒞𝑛1superscriptsubscript~𝓦1𝜇superscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑2tensor-product𝐼¯𝐻plus-or-minussuperscript𝜇2subscriptsuperscriptnormsuperscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑2tensor-product𝐼¯𝐻𝑜superscript𝜇0.53𝜂𝑜superscript𝜇1𝜂\displaystyle\mathds{E}\left\|\mathcal{C}^{n+1}{\widetilde{\boldsymbol{% \scriptstyle\mathcal{W}}}}_{-1}^{\prime}+\mu\sum\limits_{i=0}^{n}\mathcal{C}^{% i}\mathcal{A}_{2}d\right\|^{2}_{I\otimes\bar{H}}{=}\mu^{2}\left\|\sum\limits_{% i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}d\right\|^{2}_{I\otimes\bar{H}}\pm o(\mu% ^{0.5(3+\eta)})\pm o(\mu^{1+\eta})blackboard_E ∥ caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT over~ start_ARG bold_caligraphic_W end_ARG start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_μ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT = italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT ± italic_o ( italic_μ start_POSTSUPERSCRIPT 0.5 ( 3 + italic_η ) end_POSTSUPERSCRIPT ) ± italic_o ( italic_μ start_POSTSUPERSCRIPT 1 + italic_η end_POSTSUPERSCRIPT ) (386)

We now examine μ2i=0n𝒞i𝒜2dIH¯2superscript𝜇2subscriptsuperscriptnormsuperscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑2tensor-product𝐼¯𝐻\mu^{2}\left\|\sum\limits_{i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}d\right\|^{2}% _{I\otimes\bar{H}}italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT. To do so, we substitute (340), (343), (344), (372), (380), (382) and (383) into i=0n𝒞i𝒜2dIH¯2subscriptsuperscriptnormsuperscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑2tensor-product𝐼¯𝐻\|\sum\limits_{i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}d\|^{2}_{I\otimes\bar{H}}∥ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT, and obtain

i=0n𝒞i𝒜2dIH¯2subscriptsuperscriptnormsuperscriptsubscript𝑖0𝑛superscript𝒞𝑖subscript𝒜2𝑑2tensor-product𝐼¯𝐻\displaystyle\left\|\sum\limits_{i=0}^{n}\mathcal{C}^{i}\mathcal{A}_{2}d\right% \|^{2}_{I\otimes\bar{H}}∥ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT =d𝖳𝒜2(I𝒞)𝖳(I𝒞n+1)𝖳(IH¯)(I𝒞n+1)(I𝒞)1𝒜2dabsentsuperscript𝑑𝖳subscript𝒜2superscript𝐼𝒞𝖳superscript𝐼superscript𝒞𝑛1𝖳tensor-product𝐼¯𝐻𝐼superscript𝒞𝑛1superscript𝐼𝒞1subscript𝒜2𝑑\displaystyle=d^{\sf T}\mathcal{A}_{2}(I-\mathcal{C})^{-\sf T}(I-\mathcal{C}^{% n+1})^{\sf T}({I\otimes\bar{H}})(I-\mathcal{C}^{n+1})(I-\mathcal{C})^{-1}% \mathcal{A}_{2}d= italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_I - caligraphic_C ) start_POSTSUPERSCRIPT - sansserif_T end_POSTSUPERSCRIPT ( italic_I - caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_I ⊗ over¯ start_ARG italic_H end_ARG ) ( italic_I - caligraphic_C start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ( italic_I - caligraphic_C ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d
=(a)d𝖳𝒜2𝒱(I𝒞¯)𝖳(I𝒞¯n+1)𝖳𝒱𝖳(IH¯)𝒱(I𝒞¯n+1)(I𝒞¯)1𝒱𝖳𝒜2d𝑎superscript𝑑𝖳subscript𝒜2𝒱superscript𝐼¯𝒞𝖳superscript𝐼superscript¯𝒞𝑛1𝖳superscript𝒱𝖳tensor-product𝐼¯𝐻𝒱𝐼superscript¯𝒞𝑛1superscript𝐼¯𝒞1superscript𝒱𝖳subscript𝒜2𝑑\displaystyle\overset{(a)}{=}d^{\sf T}\mathcal{A}_{2}\mathcal{V}(I-\bar{% \mathcal{C}})^{-\sf T}(I-\bar{\mathcal{C}}^{n+1})^{\sf T}\mathcal{V}^{\sf T}({% I\otimes\bar{H}})\mathcal{V}(I-\bar{\mathcal{C}}^{n+1})(I-\bar{\mathcal{C}})^{% -1}\mathcal{V}^{\sf T}\mathcal{A}_{2}dstart_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V ( italic_I - over¯ start_ARG caligraphic_C end_ARG ) start_POSTSUPERSCRIPT - sansserif_T end_POSTSUPERSCRIPT ( italic_I - over¯ start_ARG caligraphic_C end_ARG start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_I ⊗ over¯ start_ARG italic_H end_ARG ) caligraphic_V ( italic_I - over¯ start_ARG caligraphic_C end_ARG start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ( italic_I - over¯ start_ARG caligraphic_C end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d
=(b)d𝖳𝒜2𝒱(I𝒞¯)𝖳(I𝒞¯n+1)𝖳IH¯2𝑏subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒜2𝒱superscript𝐼¯𝒞𝖳superscript𝐼superscript¯𝒞𝑛1𝖳2tensor-product𝐼¯𝐻\displaystyle\overset{(b)}{=}\|d^{\sf T}\mathcal{A}_{2}\mathcal{V}(I-\bar{% \mathcal{C}})^{-\sf T}(I-\bar{\mathcal{C}}^{n+1})^{\sf T}\|^{2}_{I\otimes\bar{% H}}start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG = end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V ( italic_I - over¯ start_ARG caligraphic_C end_ARG ) start_POSTSUPERSCRIPT - sansserif_T end_POSTSUPERSCRIPT ( italic_I - over¯ start_ARG caligraphic_C end_ARG start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT
=(c)d𝖳𝒜2𝒱α(I𝒫α)1(I𝒫αn+1)IH¯2±o(1)plus-or-minus𝑐subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒜2subscript𝒱𝛼superscript𝐼subscript𝒫𝛼1𝐼superscriptsubscript𝒫𝛼𝑛12tensor-product𝐼¯𝐻𝑜1\displaystyle\overset{(c)}{=}\|d^{\sf T}\mathcal{A}_{2}\mathcal{V}_{\alpha}(I-% \mathcal{P}_{\alpha})^{-1}(I-\mathcal{P}_{\alpha}^{n+1})\|^{2}_{I\otimes\bar{H% }}\pm o(1)start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG = end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT ± italic_o ( 1 ) (387)

where (a)𝑎(a)( italic_a ) follows from (343) and (344), (b)𝑏(b)( italic_b ) follows from the following equality:

𝒱𝖳(IH¯)𝒱=(V𝖳I)(IH¯)(VI)=(V𝖳V)H¯=IH¯superscript𝒱𝖳tensor-product𝐼¯𝐻𝒱tensor-productsuperscript𝑉𝖳𝐼tensor-product𝐼¯𝐻tensor-product𝑉𝐼tensor-productsuperscript𝑉𝖳𝑉¯𝐻tensor-product𝐼¯𝐻\displaystyle\mathcal{V}^{\sf T}({I\otimes\bar{H}})\mathcal{V}=(V^{\sf T}% \otimes I)({I\otimes\bar{H}})(V\otimes I)=(V^{\sf T}V)\otimes\bar{H}={I\otimes% \bar{H}}caligraphic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_I ⊗ over¯ start_ARG italic_H end_ARG ) caligraphic_V = ( italic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ italic_I ) ( italic_I ⊗ over¯ start_ARG italic_H end_ARG ) ( italic_V ⊗ italic_I ) = ( italic_V start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_V ) ⊗ over¯ start_ARG italic_H end_ARG = italic_I ⊗ over¯ start_ARG italic_H end_ARG (388)

and (c)𝑐(c)( italic_c ) follows from the following equality:

d𝖳𝒜2𝒱(I𝒞¯)𝖳(I𝒞¯n+1)𝖳IH¯2subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒜2𝒱superscript𝐼¯𝒞𝖳superscript𝐼superscript¯𝒞𝑛1𝖳2tensor-product𝐼¯𝐻\displaystyle\|d^{\sf T}\mathcal{A}_{2}\mathcal{V}(I-\bar{\mathcal{C}})^{-\sf T% }(I-\bar{\mathcal{C}}^{n+1})^{\sf T}\|^{2}_{I\otimes\bar{H}}∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V ( italic_I - over¯ start_ARG caligraphic_C end_ARG ) start_POSTSUPERSCRIPT - sansserif_T end_POSTSUPERSCRIPT ( italic_I - over¯ start_ARG caligraphic_C end_ARG start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT
=d𝖳𝒜2[1K𝟙𝒱α][1μH¯1+O(1)1KH¯1𝟙𝖳𝒱α(I𝒫α+O(μ))11K(I𝒫α+O(μ))1𝒱α𝖳𝒜2𝟙H¯1(I𝒫α+O(μ))1]𝖳\displaystyle=\Bigg{\|}d^{\sf T}\mathcal{A}_{2}\left[\begin{array}[]{cc}\frac{% 1}{\sqrt{K}}\mathds{1}&\mathcal{V}_{\alpha}\end{array}\right]\left[\begin{% array}[]{cc}\frac{1}{\mu}\bar{H}^{-1}+O(1)&-\frac{1}{\sqrt{K}}\bar{H}^{-1}% \mathds{1}^{\sf T}\mathcal{H}\mathcal{V}_{\alpha}(I-\mathcal{P}_{\alpha}+O(\mu% ))^{-1}\\ -\frac{1}{\sqrt{K}}(I-\mathcal{P}_{\alpha}+O(\mu))^{-1}\mathcal{V}_{\alpha}^{% \sf T}\mathcal{A}_{2}\mathcal{H}\mathds{1}\bar{H}^{-1}&(I-\mathcal{P}_{\alpha}% +O(\mu))^{-1}\end{array}\right]^{\sf T}= ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG blackboard_1 end_CELL start_CELL caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_O ( 1 ) end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H blackboard_1 over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT (392)
×[I(IμH¯)n+1+O(μ2)O(μ)O(μ)I𝒫αn+1+O(μ2)]𝖳IH¯2absentevaluated-atsuperscriptdelimited-[]𝐼superscript𝐼𝜇¯𝐻𝑛1𝑂superscript𝜇2𝑂𝜇𝑂𝜇𝐼superscriptsubscript𝒫𝛼𝑛1𝑂superscript𝜇2𝖳tensor-product𝐼¯𝐻2\displaystyle\quad\;\times\left[\begin{array}[]{cc}I-(I-\mu\bar{H})^{n+1}+O(% \mu^{2})&O(\mu)\\ O(\mu)&I-\mathcal{P}_{\alpha}^{n+1}+O(\mu^{2})\end{array}\right]^{\sf T}\Bigg{% \|}^{2}_{I\otimes\bar{H}}× [ start_ARRAY start_ROW start_CELL italic_I - ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ ) end_CELL start_CELL italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT (395)
=(d)[0d𝖳𝒜2𝒱α][1μH¯1+O(1)1KH¯1𝟙𝖳𝒱α(I𝒫α+O(μ))11K(I𝒫α+O(μ))1𝒱α𝖳𝒜2𝟙H¯1(I𝒫α+O(μ))1]𝖳conditional𝑑delimited-[]0superscript𝑑𝖳subscript𝒜2subscript𝒱𝛼superscriptdelimited-[]1𝜇superscript¯𝐻1𝑂11𝐾superscript¯𝐻1superscript1𝖳subscript𝒱𝛼superscript𝐼subscript𝒫𝛼𝑂𝜇11𝐾superscript𝐼subscript𝒫𝛼𝑂𝜇1superscriptsubscript𝒱𝛼𝖳subscript𝒜21superscript¯𝐻1superscript𝐼subscript𝒫𝛼𝑂𝜇1𝖳\displaystyle\overset{(d)}{=}\Bigg{\|}\left[\begin{array}[]{cc}0&d^{\sf T}% \mathcal{A}_{2}\mathcal{V}_{\alpha}\end{array}\right]\left[\begin{array}[]{cc}% \frac{1}{\mu}\bar{H}^{-1}+O(1)&-\frac{1}{\sqrt{K}}\bar{H}^{-1}\mathds{1}^{\sf T% }\mathcal{H}\mathcal{V}_{\alpha}(I-\mathcal{P}_{\alpha}+O(\mu))^{-1}\\ -\frac{1}{\sqrt{K}}(I-\mathcal{P}_{\alpha}+O(\mu))^{-1}\mathcal{V}_{\alpha}^{% \sf T}\mathcal{A}_{2}\mathcal{H}\mathds{1}\bar{H}^{-1}&(I-\mathcal{P}_{\alpha}% +O(\mu))^{-1}\end{array}\right]^{\sf T}start_OVERACCENT ( italic_d ) end_OVERACCENT start_ARG = end_ARG ∥ [ start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_O ( 1 ) end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H blackboard_1 over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT (399)
×[I(IμH¯)n+1+O(μ2)O(μ)O(μ)I𝒫αn+1+O(μ2)]𝖳IH¯2absentevaluated-atsuperscriptdelimited-[]𝐼superscript𝐼𝜇¯𝐻𝑛1𝑂superscript𝜇2𝑂𝜇𝑂𝜇𝐼superscriptsubscript𝒫𝛼𝑛1𝑂superscript𝜇2𝖳tensor-product𝐼¯𝐻2\displaystyle\quad\;\times\left[\begin{array}[]{cc}I-(I-\mu\bar{H})^{n+1}+O(% \mu^{2})&O(\mu)\\ O(\mu)&I-\mathcal{P}_{\alpha}^{n+1}+O(\mu^{2})\end{array}\right]^{\sf T}\Bigg{% \|}^{2}_{I\otimes\bar{H}}× [ start_ARRAY start_ROW start_CELL italic_I - ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_O ( italic_μ ) end_CELL end_ROW start_ROW start_CELL italic_O ( italic_μ ) end_CELL start_CELL italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT + italic_O ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT (402)
=1Kd𝖳𝒜2𝒱α(I𝒫α+O(μ))1𝒱α𝖳𝟙H¯1(I(IμH¯)n+1)+O(μ)IH¯2absent1𝐾subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒜2subscript𝒱𝛼superscript𝐼subscript𝒫𝛼𝑂𝜇1superscriptsubscript𝒱𝛼𝖳1superscript¯𝐻1𝐼superscript𝐼𝜇¯𝐻𝑛1𝑂𝜇2tensor-product𝐼¯𝐻\displaystyle=\frac{1}{K}\|d^{\sf T}\mathcal{A}_{2}\mathcal{V}_{\alpha}(I-% \mathcal{P}_{\alpha}+O(\mu))^{-1}\mathcal{V}_{\alpha}^{\sf T}\mathcal{H}% \mathds{1}\bar{H}^{-1}(I-(I-\mu\bar{H})^{n+1})+O(\mu)\|^{2}_{I\otimes\bar{H}}= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H blackboard_1 over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) + italic_O ( italic_μ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT
+d𝖳𝒜2𝒱α(I𝒫α+O(μ))1(I𝒫αn+1)+O(μ)IH¯2subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒜2subscript𝒱𝛼superscript𝐼subscript𝒫𝛼𝑂𝜇1𝐼superscriptsubscript𝒫𝛼𝑛1𝑂𝜇2tensor-product𝐼¯𝐻\displaystyle\quad\;+\|d^{\sf T}\mathcal{A}_{2}\mathcal{V}_{\alpha}(I-\mathcal% {P}_{\alpha}+O(\mu))^{-1}(I-\mathcal{P}_{\alpha}^{n+1})+O(\mu)\|^{2}_{I\otimes% \bar{H}}+ ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_O ( italic_μ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) + italic_O ( italic_μ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT
=(e)1Kd𝖳𝒜2𝒱α(I𝒫α)1𝒱α𝖳𝟙H¯1(I(IμH¯)n+1)H¯2+d𝖳𝒜2𝒱α(I𝒫α)1(I𝒫αn+1)IH¯2±O(μ)plus-or-minus𝑒1𝐾subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒜2subscript𝒱𝛼superscript𝐼subscript𝒫𝛼1superscriptsubscript𝒱𝛼𝖳1superscript¯𝐻1𝐼superscript𝐼𝜇¯𝐻𝑛12¯𝐻subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒜2subscript𝒱𝛼superscript𝐼subscript𝒫𝛼1𝐼superscriptsubscript𝒫𝛼𝑛12tensor-product𝐼¯𝐻𝑂𝜇\displaystyle\overset{(e)}{=}\frac{1}{K}\|d^{\sf T}\mathcal{A}_{2}\mathcal{V}_% {\alpha}(I-\mathcal{P}_{\alpha})^{-1}\mathcal{V}_{\alpha}^{\sf T}\mathcal{H}% \mathds{1}\bar{H}^{-1}(I-(I-\mu\bar{H})^{n+1})\|^{2}_{\bar{H}}+\|d^{\sf T}% \mathcal{A}_{2}\mathcal{V}_{\alpha}(I-\mathcal{P}_{\alpha})^{-1}(I-\mathcal{P}% _{\alpha}^{n+1})\|^{2}_{I\otimes\bar{H}}\pm O(\mu)start_OVERACCENT ( italic_e ) end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H blackboard_1 over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT + ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT ± italic_O ( italic_μ )
=(f)d𝖳𝒜2𝒱α(I𝒫α)1(I𝒫αn+1)IH¯2+O(ϵ2)±O(μ)plus-or-minus𝑓subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒜2subscript𝒱𝛼superscript𝐼subscript𝒫𝛼1𝐼superscriptsubscript𝒫𝛼𝑛12tensor-product𝐼¯𝐻𝑂superscriptitalic-ϵ2𝑂𝜇\displaystyle\overset{(f)}{=}\|d^{\sf T}\mathcal{A}_{2}\mathcal{V}_{\alpha}(I-% \mathcal{P}_{\alpha})^{-1}(I-\mathcal{P}_{\alpha}^{n+1})\|^{2}_{I\otimes\bar{H% }}+O(\epsilon^{2})\pm O(\mu)start_OVERACCENT ( italic_f ) end_OVERACCENT start_ARG = end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I ⊗ over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ± italic_O ( italic_μ ) (403)

where (d)𝑑(d)( italic_d ) follows from (107) and (108), and (e)𝑒(e)( italic_e ) follows from (382) and (383), and in (f)𝑓(f)( italic_f ) we apply the following inequality:

d𝖳𝒜2𝒱α(I𝒫α)1𝒱α𝖳𝟙H¯1(I(IμH¯)n+1)H¯2subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒜2subscript𝒱𝛼superscript𝐼subscript𝒫𝛼1superscriptsubscript𝒱𝛼𝖳1superscript¯𝐻1𝐼superscript𝐼𝜇¯𝐻𝑛12¯𝐻\displaystyle\|d^{\sf T}\mathcal{A}_{2}\mathcal{V}_{\alpha}(I-\mathcal{P}_{% \alpha})^{-1}\mathcal{V}_{\alpha}^{\sf T}\mathcal{H}\mathds{1}\bar{H}^{-1}(I-(% I-\mu\bar{H})^{n+1})\|^{2}_{\bar{H}}∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_H blackboard_1 over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT
=(g)d𝖳𝒜2𝒱α(I𝒫α)1𝒱α𝖳(IH¯)𝟙H¯1(I(IμH¯)n+1)H¯2𝑔subscriptsuperscriptnormsuperscript𝑑𝖳subscript𝒜2subscript𝒱𝛼superscript𝐼subscript𝒫𝛼1superscriptsubscript𝒱𝛼𝖳tensor-product𝐼¯𝐻1superscript¯𝐻1𝐼superscript𝐼𝜇¯𝐻𝑛12¯𝐻\displaystyle\overset{(g)}{=}\|d^{\sf T}\mathcal{A}_{2}\mathcal{V}_{\alpha}(I-% \mathcal{P}_{\alpha})^{-1}\mathcal{V}_{\alpha}^{\sf T}(\mathcal{H}-I\otimes% \bar{H})\mathds{1}\bar{H}^{-1}(I-(I-\mu\bar{H})^{n+1})\|^{2}_{\bar{H}}start_OVERACCENT ( italic_g ) end_OVERACCENT start_ARG = end_ARG ∥ italic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_I - caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( caligraphic_H - italic_I ⊗ over¯ start_ARG italic_H end_ARG ) blackboard_1 over¯ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I - ( italic_I - italic_μ over¯ start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_H end_ARG end_POSTSUBSCRIPT
O(IH¯2)=(h)O(ϵ2)absent𝑂superscriptnormtensor-product𝐼¯𝐻2𝑂superscriptitalic-ϵ2\displaystyle\leq O(\|\mathcal{H}-I\otimes\bar{H}\|^{2})\overset{(h)}{=}O(% \epsilon^{2})≤ italic_O ( ∥ caligraphic_H - italic_I ⊗ over¯ start_ARG italic_H end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_OVERACCENT ( italic_h ) end_OVERACCENT start_ARG = end_ARG italic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (404)

where (h)(h)( italic_h ) follows from Assumption III.3, and (g)𝑔(g)( italic_g ) follows from the following equality:

𝒱α𝖳(IH¯)𝟙=(VR𝖳𝟙)H¯=0superscriptsubscript𝒱𝛼𝖳tensor-product𝐼¯𝐻1tensor-productsuperscriptsubscript𝑉𝑅𝖳1¯𝐻0\displaystyle\mathcal{V}_{\alpha}^{\sf T}(I\otimes\bar{H})\mathds{1}=(V_{R}^{% \sf T}\mathbbm{1})\otimes\bar{H}=0caligraphic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( italic_I ⊗ over¯ start_ARG italic_H end_ARG ) blackboard_1 = ( italic_V start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT blackboard_1 ) ⊗ over¯ start_ARG italic_H end_ARG = 0 (405)

Substituting (F) into (386), we obtain