FedAWARE: Maximizing Gradient Diversity
for Heterogeneous Federated Server-side Optimization

Dun Zeng1  Zenglin Xu2111Corresponding author: [email protected]  Yu Pan3absent{}^{\;\,}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPTQifan Wang4Xiaoying Tang5
1
UESTC  2Fudan University  3Harbin Institute of Technology, Shenzhen
4Meta AI  5CUHK, Shenzhen
[email protected]
Abstract

Federated learning (FL) is a distributed learning framework where numerous clients collaborate with a central server to train a model without sharing local data. However, the standard federated optimization in real-world applications faces both statistical and system heterogeneity challenges, which result in unfavorable convergence behavior. The previous works attempted to modify the local training process (client-side) to tackle heterogeneity challenges. However, they ignored that the updates on the server side can coordinate the diverse local updates efficiently. This work explores the effect of server-side updates against heterogeneity issues. We first introduce the gradient diversity maximization direction findings, suggesting the global model moves continuously in this direction for fast and stable convergence. Then, we derive a novel server-side optimizer FedAWARE with rigorous convergence analysis for general non-convex settings. Our extensive experiments across multiple heterogeneous federated settings using four datasets showcase that FedAWARE achieves competitive convergence performance in comparison to state-of-the-art adaptive federated optimizers. Furthermore, our results show that FedAWARE can enhance the performance of FL algorithms as a plug-in module. Our source code is available at https://github.com/dunzeng/FedAWARE.

1 Introduction

We study a standard cross-device federated learning (FL) task [15], which minimizes a finite sum of local empirical objectives:

min𝒙𝒳f(𝒙):=i=1Nλifi(𝒙):=i=1Nλi𝔼ξi𝒟i[Fi(𝒙,ξi)],assignsubscript𝒙𝒳𝑓𝒙superscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝑓𝑖𝒙assignsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝔼similar-tosubscript𝜉𝑖subscript𝒟𝑖delimited-[]subscript𝐹𝑖𝒙subscript𝜉𝑖\min_{\boldsymbol{x}\in\mathcal{X}}f(\boldsymbol{x}):=\sum_{i=1}^{N}\lambda_{i% }f_{i}(\boldsymbol{x}):=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}_{\xi_{i}\sim% \mathcal{D}_{i}}[F_{i}(\boldsymbol{x},\xi_{i})],roman_min start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( bold_italic_x ) := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x , italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] , (1)

where 𝒙𝒳d𝒙𝒳superscript𝑑\boldsymbol{x}\in\mathcal{X}\subseteq\mathbb{R}^{d}bold_italic_x ∈ caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is parameters of machine learning model, f(𝒙)𝑓𝒙f(\boldsymbol{x})italic_f ( bold_italic_x ) is the global objective weighted by λi(s.t.,i=1Nλi=1\lambda_{i}(s.t.,\sum_{i=1}^{N}\lambda_{i}=1italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s . italic_t . , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, ξi,λi0,i)\xi_{i},\lambda_{i}\geq 0,\forall i)italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 , ∀ italic_i ), ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is stochastic batch data, and 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes dataset on the i𝑖iitalic_i-th client (i{1,2,,Ni\in\{1,2,\ldots,Nitalic_i ∈ { 1 , 2 , … , italic_N}). The federated optimization framework that minimizes the global objective involves client optimization and server optimization procedures [35], as shown in Figure 1 and Algorithm 2. Take the FedAvg [33] as a example:

Client: 𝒈it=𝒙it,Ki𝒙it,0=ηlk=0Ki1Fi(𝒙it,k);Server: 𝒙t+1=𝒙tηi=1Nλi𝒈itformulae-sequenceClient: superscriptsubscript𝒈𝑖𝑡superscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscriptsubscript𝒙𝑖𝑡0subscript𝜂𝑙superscriptsubscript𝑘0subscript𝐾𝑖1subscript𝐹𝑖superscriptsubscript𝒙𝑖𝑡𝑘Server: superscript𝒙𝑡1superscript𝒙𝑡𝜂superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript𝒈𝑖𝑡\text{{Client: }}\boldsymbol{g}_{i}^{t}=\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}_{i}^{t,0}=\eta_{l}\sum_{k=0}^{K_{i}-1}\nabla F_{i}(\boldsymbol{% x}_{i}^{t,k});\quad\quad\text{{Server: }}\boldsymbol{x}^{t+1}=\boldsymbol{x}^{% t}-\eta\sum_{i=1}^{N}\lambda_{i}\boldsymbol{g}_{i}^{t}Client: bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , 0 end_POSTSUPERSCRIPT = italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT ) ; Server: bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (2)

where Fi(𝒙)subscript𝐹𝑖𝒙\nabla F_{i}(\boldsymbol{x})∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) denotes stochastic gradients over a mini-batch of samples, 𝒙it,ksuperscriptsubscript𝒙𝑖𝑡𝑘\boldsymbol{x}_{i}^{t,k}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT denotes client i𝑖iitalic_i’s model after the k𝑘kitalic_k local update steps at the t𝑡titalic_t-th communication round, and ηlsubscript𝜂𝑙\eta_{l}italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the client learning rate. The federated optimization framework is powerful and covers a broader range of FL algorithms [33, 35, 45, 41]. However, it suffers from two major optimization challenges [25, 26].

Refer to caption
Figure 1: Illustration of federated optimization framework and challenges. System heterogeneity (left) induces unbalanced local updates and enlarges the statistical heterogeneity impacts (right). FedAWARE can tackle these challenges on the server side.

Statistical heterogeneity is a crucial challenge, resulting the local updates to divergence direction. This problem majorly arises from data being non-identically distributed (Non-IID) among devices. System heterogeneity represents another critical challenge, that causes unbalanced local update steps of clients [26, 42] due to the varying hardware specifications, operating systems, and software configurations during cross-device training [13]. Hence, faster clients may perform more local update steps than slower clients within a fixed wall-clock time interval. The unbalanced local updates will further enlarge the variance of local updates. These two challenges result in unfavorable convergence behavior of FL, further inducing communication inefficiency and model performance degeneration. For a more in-depth discussion of FL and the challenges, we defer to works [15, 41].

To address the statistical heterogeneity, previous federated optimization algorithms propose to manipulate local updates, such as inducing control variates [16], momentum terms [10, 35], dynamic regularization [1], or local penalty terms [26] to client-side optimization. On the other hand, some works [16, 33, 35] assume that the number of local updates is constant and balanced across all clients Ki=Kj,i,j[N]formulae-sequencesubscript𝐾𝑖subscript𝐾𝑗for-all𝑖𝑗delimited-[]𝑁K_{i}=K_{j},\forall i,j\in[N]italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_i , italic_j ∈ [ italic_N ], thereby ignoring the system heterogeneity impacts. For alleviating the unbalanced local update steps impacts, these algorithms [26, 42, 1] primarily modified client-side optimization or used advanced local solvers. For example, FedProx [26] proposes using local penalty terms to prevent large local updates 𝒈isubscript𝒈𝑖\boldsymbol{g}_{i}bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. And, FedNova [42] clips the local updates 𝒈isubscript𝒈𝑖\boldsymbol{g}_{i}bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT based on the local update steps Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for optimization objective consistency. However, the potential of server-side optimization is not fully explored, as most algorithms only update the global model via vanilla averaged gradient descent.

In this paper, we argue that the server can be powerful enough to handle the heterogeneity challenges without using advanced local solvers. To this end, we explore the impacts of such hybrid heterogeneity (i.e., statistical heterogeneity with unbalanced local updates), particularly on the server-side optimization in federated optimization framework (2). Our results answer a critical question:

How can server-side aggregation of local updates mitigate the impacts of heterogeneity?

Contributions. The main contributions of this work can be summarised as follows:

  • We first propose the gradient diversity maximization direction, that is, the aggregation results with the minimal norm of linearly combined local updates. Our findings suggest that the global model moves continuously in the gradient diversity maximization direction with fast and stable convergence. Besides, we highlight gradient diversity as new training dynamics for FL and find that stationary points with higher gradient diversity tend to achieve better test accuracy.

  • We introduce a novel federated server-side optimization method called Federated Adaptive Weighted AggREgator (FedAWARE), which adapts the direction of global model updates to enlarge gradient diversity. We provide rigorous convergence guarantees for gradient diversity maximization theory and the effectiveness of FedAWARE for tackling heterogeneity issues.

  • We show that FedAWARE is orthogonal with existing algorithms and can be a plug-in module for enhancing the state-of-the-art FL algorithms. In detail, we modify the gradient descent direction of the algorithms to maximize their gradient diversity dynamics and find their accuracy is improved.

  • We assess our findings on various degrees of hybrid heterogeneity settings across four datasets. Intensive empirical evidence demonstrates that FedAWARE consistently outperforms baseline optimization algorithms. The results support our findings about gradient diversity and emphasize the importance of adaptivity in heterogeneous federated optimization.

2 Preliminaries

Related works. This paper is related to a broader range of federated optimization algorithms against heterogeneity issues as discussed in Appendix A. Representative adaptive optimization algorithms [35, 52, 1] are the most relevant to our work. Because they can be viewed as server-side optimization works, incorporating adaptive momentum to achieve fast convergence against heterogeneity issues. We justify our method optimizes different terms in Section 4, which can further enhance their performance in Section 5.

Settings. In this paper, we allow the number of local steps Ki,i[N],[N]={1,,N}formulae-sequencesubscript𝐾𝑖𝑖delimited-[]𝑁delimited-[]𝑁1𝑁K_{i},i\in[N],[N]=\{1,\dots,N\}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ italic_N ] , [ italic_N ] = { 1 , … , italic_N } to be not necessarily equal for denoting system heterogeneity. Typically, the number of local update steps depends on factors such as the chosen batch size, training epoch, and the number of data samples, which can vary across clients. Different devices in cross-device federated learning may have varying computational abilities and data quantities. Therefore, more powerful devices may perform local training with larger batch sizes and more epochs [26]. Besides, clients typically run multiple steps of local SGD and only upload their updates, and the FL server only accesses the uploaded information. We focus on the non-convex optimization problem for (1), typically covering the training of neural networks. Therefore, our analyses rely on common non-convex optimization assumptions [35, 45, 1] on local objectives fi(𝒙),i[N]subscript𝑓𝑖𝒙𝑖delimited-[]𝑁f_{i}(\boldsymbol{x}),i\in[N]italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) , italic_i ∈ [ italic_N ] and we use \|\cdot\|∥ ⋅ ∥ to denote 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm as default:

Assumption 2.1 (Smoothness)

Each objective fi(𝐱)subscript𝑓𝑖𝐱f_{i}(\boldsymbol{x})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) for all i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] is L𝐿Litalic_L-smooth, inducing that for all 𝐱,𝐲dfor-all𝐱𝐲superscript𝑑\forall\boldsymbol{x},\boldsymbol{y}\in\mathbb{R}^{d}∀ bold_italic_x , bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, it holds fi(𝐱)fi(𝐲)L𝐱𝐲normsubscript𝑓𝑖𝐱subscript𝑓𝑖𝐲𝐿norm𝐱𝐲\|\nabla f_{i}(\boldsymbol{x})-\nabla f_{i}(\boldsymbol{y})\|\leq L\|% \boldsymbol{x}-\boldsymbol{y}\|∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_y ) ∥ ≤ italic_L ∥ bold_italic_x - bold_italic_y ∥.

Assumption 2.2 (Unbiasedness and Bounded Local Variance)

For each i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] and 𝐱d𝐱superscript𝑑\boldsymbol{x}\in\mathbb{R}^{d}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we assume the access to an unbiased stochastic gradient Fi(𝐱,ξi)subscript𝐹𝑖𝐱subscript𝜉𝑖\nabla F_{i}(\boldsymbol{x},\xi_{i})∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x , italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of client’s true gradient fi(𝐱)subscript𝑓𝑖𝐱\nabla f_{i}(\boldsymbol{x})∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ), i.e.,𝔼ξi𝒟i[Fi(𝐱,ξi)]=fi(𝐱)subscript𝔼similar-tosubscript𝜉𝑖subscript𝒟𝑖delimited-[]subscript𝐹𝑖𝐱subscript𝜉𝑖subscript𝑓𝑖𝐱\mathbb{E}_{\xi_{i}\sim\mathcal{D}_{i}}\left[\nabla F_{i}(\boldsymbol{x},\xi_{% i})\right]=\nabla f_{i}(\boldsymbol{x})blackboard_E start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x , italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] = ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ). The function fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT have σlsubscript𝜎𝑙\sigma_{l}italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT-bounded (local) variance i.e.,𝔼ξi𝒟i[Fi(𝐱,ξi)fi(𝐱)2]σl2subscript𝔼similar-tosubscript𝜉𝑖subscript𝒟𝑖delimited-[]superscriptnormsubscript𝐹𝑖𝐱subscript𝜉𝑖subscript𝑓𝑖𝐱2superscriptsubscript𝜎𝑙2\mathbb{E}_{\xi_{i}\sim\mathcal{D}_{i}}\left[\left\|\nabla F_{i}(\boldsymbol{x% },\xi_{i})-\nabla f_{i}(\boldsymbol{x})\right\|^{2}\right]\leq\sigma_{l}^{2}blackboard_E start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x , italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

3 Analysis of Statistical Heterogeneity and System Heterogeneity Impacts

This section provides our theoretical insights into the general impacts of hybrid heterogeneity on federated optimization. In Section 3.1, we discussed the theoretical measurement of statistical heterogeneity and highlighted the power of gradient diversity. Then, we showed a loose upper bound of averaged local updates and discussed how previous works minimize it in Section 3.2. Moreover, we propose the gradient diversity maximization direction, which is the aggregation results with the minimal norm of linearly combined local updates. Proofs can be found in Appendix B.

3.1 Gradient Diversity is a Fine-Grained Measurement of Statistical Heterogeneity Impacts

We propose using gradient diversity to measure the degrees of statistical heterogeneity status. The definition of gradient diversity is provided below:

Definition 3.1 (Gradient diversity [26, 51, 9])

The gradient diversity is defined as:

D(𝒙):=i=1Nλifi(𝒙)2f(𝒙)21.assign𝐷𝒙superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙2superscriptnorm𝑓𝒙21D(\boldsymbol{x}):=\sqrt{\sum_{i=1}^{N}\lambda_{i}\frac{\|\nabla f_{i}(% \boldsymbol{x})\|^{2}}{\|\nabla f(\boldsymbol{x})\|^{2}}}\geq 1.italic_D ( bold_italic_x ) := square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ≥ 1 . (3)

Similar definitions have been proposed for differing purposes [26, 51, 9]. Gradient diversity quantifies the degree to which individual gradients diverge from each other. Specifically, D(𝒙)=1𝐷𝒙1D(\boldsymbol{x})=1italic_D ( bold_italic_x ) = 1 corresponds to the independent and identically distributed (IID) settings, such that all the local functions converge in the same direction. In particular, gradient diversity is large when the gradients are almost orthogonal, or in opposite directions. Therefore, a larger value of D(𝒙)𝐷𝒙D(\boldsymbol{x})italic_D ( bold_italic_x ) indicates more statistical heterogeneity among the local functions. In federated optimization, FedProx [26] and subsequent works [50, 43, 5] introduce an assumed upper bound of gradient diversity to convergence rate for representing the greatest statistical heterogeneity.

Analogous to gradient diversity, recent works have commonly made assumptions about measuring statistical heterogeneity by first-order gradients. For example, Assumption 3.1 is used by recent works[12, 46, 23], while Assumption 3.2 is used by works[50, 43, 5]. It is important to note that these terminologies are relevant to each other, as shown in Lemma 3.1.

Assumption 3.1 (Bounded global variance)

We assume the averaged global variance is bounded, i.e., i=1Nλifi(𝐱)f(𝐱)2σg2superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝐱𝑓𝐱2superscriptsubscript𝜎𝑔2\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})-\nabla f(% \boldsymbol{x})\right\|^{2}\leq\sigma_{g}^{2}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) - ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X.

Assumption 3.2 (Bounded gradient dissimilarity)

There exist constants B1,G0formulae-sequence𝐵1𝐺0B\geq 1,G\geq 0italic_B ≥ 1 , italic_G ≥ 0 such that i=1Nλifi(𝐱)2B2f(𝐱)2+G2superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝐱2superscript𝐵2superscriptnorm𝑓𝐱2superscript𝐺2\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})\right\|^{2}\leq B% ^{2}\|\nabla f(\boldsymbol{x})\|^{2}+G^{2}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X.

Lemma 3.1 (Heterogeneity assumption equivalence)

Let Assumption 3.1 holds, we have

i=1Nλifi(𝒙)f(𝒙)2σg2i=1Nλifi(𝒙)2f(𝒙)2+σg2,superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙𝑓𝒙2superscriptsubscript𝜎𝑔2superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙2superscriptnorm𝑓𝒙2superscriptsubscript𝜎𝑔2\displaystyle\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})-% \nabla f(\boldsymbol{x})\right\|^{2}\leq\sigma_{g}^{2}\;\Leftrightarrow\;\sum_% {i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})\right\|^{2}\leq\|% \nabla f(\boldsymbol{x})\|^{2}+\sigma_{g}^{2},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) - ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⇔ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

as the special case of Assumption 3.2 with B=1,G=σgformulae-sequence𝐵1𝐺subscript𝜎𝑔B=1,G=\sigma_{g}italic_B = 1 , italic_G = italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

Furthermore, Assumptions 3.1 3.2 defines the upper bound of gradient diversity, as discussed below:

Corollary 3.1 (Bounded gradient diversity)

Let Assumption 3.1 hold, it induces that D(𝐱)1+σg2f(𝐱)2,𝐷𝐱1superscriptsubscript𝜎𝑔2superscriptnorm𝑓𝐱2D(\boldsymbol{x})\leq\sqrt{1+\frac{\sigma_{g}^{2}}{\|\nabla f(\boldsymbol{x})% \|^{2}}},italic_D ( bold_italic_x ) ≤ square-root start_ARG 1 + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG , which is also connected to Assumption 3.2 with G=0𝐺0G=0italic_G = 0. In this case, the B𝐵Bitalic_B is the upper bound of gradient diversity.

Theoretical observation: the evolution of gradient diversity is ignored. In the FL training, previous works only focus on the upper bound of gradient diversity implicitly using Assumption 3.23.1. Moreover, these works typically suggest their convergence rate is negatively related to a larger upper bound of gradient diversity. However, the gradient diversity D(𝒙)𝐷𝒙D(\boldsymbol{x})italic_D ( bold_italic_x ) measures the diversity information in local updates during training, which is not considered. Therefore, it is important to understand how the evolution of D(𝒙)𝐷𝒙D(\boldsymbol{x})italic_D ( bold_italic_x ) affects a federated optimization process. Following previous works, we use Assumption 3.1 as the implicit upper bound of D(𝒙)𝐷𝒙D(\boldsymbol{x})italic_D ( bold_italic_x ) in this paper. And, our convergence analysis in Section 4 additionally focuses on the evolution of D(𝒙)𝐷𝒙D(\boldsymbol{x})italic_D ( bold_italic_x ) in the FL training process.

3.2 Hybrid Heterogeneity Causes Loose Upper Bound of Averaged Local Updates

This section shows that additional system heterogeneity will further enhance the impacts of statistical heterogeneity in federated optimization. We begin our analysis by examining the scenario where all clients run the same steps of local updates. This is captured in the following lemma:

Lemma 3.2 (Upper bound of balanced local updates [35])

Under Assumptions 2.1 2.2 and 3.1, for communication round t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], any client i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] with balanced local iteration steps k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], and local learning rate ηl1Ksubscript𝜂𝑙1𝐾\eta_{l}\leq\frac{1}{K}italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG, the average of local updates can be bounded as:

i=1Nλi𝔼𝒙it,k𝒙t25ηl(σl2+6Kσg2+6K𝔼f(𝒙t)2).superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘superscript𝒙𝑡25subscript𝜂𝑙superscriptsubscript𝜎𝑙26𝐾superscriptsubscript𝜎𝑔26𝐾𝔼superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k% }-\boldsymbol{x}^{t}\right\|^{2}\leq 5\eta_{l}(\sigma_{l}^{2}+6K\sigma_{g}^{2}% +6K\mathbb{E}\left\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\|^{2}).∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 5 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (4)

Then, we further extend the lemma to a more general scenario where clients run unbalanced local update steps. Corollary 3.2 discusses a loose upper bound on the averaged local updates:

Corollary 3.2 (Loose upper bound of unbalanced local updates)

We denote the local gradient of client i𝑖iitalic_i from the global gradient as ζi(𝐱)=fi(𝐱)f(𝐱)2subscript𝜁𝑖𝐱superscriptnormsubscript𝑓𝑖𝐱𝑓𝐱2\zeta_{i}(\boldsymbol{x})=\left\|\nabla f_{i}(\boldsymbol{x})-\nabla f(% \boldsymbol{x})\right\|^{2}italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) = ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) - ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Under Assumption 2.1 2.2 and 3.1, for communication round t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], all client i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] with arbitrary local iteration steps k[Ki]𝑘delimited-[]subscript𝐾𝑖k\in[K_{i}]italic_k ∈ [ italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], and local learning rate ηl1Kisubscript𝜂𝑙1subscript𝐾𝑖\eta_{l}\leq\frac{1}{K_{i}}italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, the average of local updates can be bounded as:

i=1Nλi𝔼𝒙it,k𝒙t2ΦHetero+5ηl(σl2+6Kmin2σg2+6K~𝔼f(𝒙t)2),superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘superscript𝒙𝑡2subscriptΦHetero5subscript𝜂𝑙superscriptsubscript𝜎𝑙26superscriptsubscript𝐾min2superscriptsubscript𝜎𝑔26~𝐾𝔼superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k% }-\boldsymbol{x}^{t}\right\|^{2}\leq\Phi_{\text{Hetero}}+5\eta_{l}(\sigma_{l}^% {2}+6K_{\text{min}}^{2}\sigma_{g}^{2}+6\tilde{K}\mathbb{E}\left\|\nabla f\left% (\boldsymbol{x}^{t}\right)\right\|^{2}),∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT + 5 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 over~ start_ARG italic_K end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (5)

where K~=i=1NλiKi~𝐾superscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝐾𝑖\tilde{K}=\sum_{i=1}^{N}\lambda_{i}K_{i}over~ start_ARG italic_K end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Kmin=min(K1,,KN)subscript𝐾minsubscript𝐾1subscript𝐾𝑁K_{\text{min}}=\min(K_{1},\dots,K_{N})italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = roman_min ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), and ΦHetero=i=1N30ηl(KiKmin)ζi(𝐱t)subscriptΦHeterosuperscriptsubscript𝑖1𝑁30subscript𝜂𝑙subscript𝐾𝑖subscript𝐾minsubscript𝜁𝑖superscript𝐱𝑡\Phi_{\text{Hetero}}=\sum_{i=1}^{N}30\eta_{l}(K_{i}-K_{\text{min}})\zeta_{i}(% \boldsymbol{x}^{t})roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT 30 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ).

Remark 3.1

We do not assume a bound of local dissimilarity ζi(𝐱t)subscript𝜁𝑖superscript𝐱𝑡\zeta_{i}(\boldsymbol{x}^{t})italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) in ΦHeterosubscriptΦHetero\Phi_{\text{Hetero}}roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT. Therefore, when the number of local update steps becomes unbalanced, the upper bound (4) is replaced by (5) with the induced hybrid heterogeneity term ΦHeterosubscriptΦHetero\Phi_{\text{Hetero}}roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT. There will always be at least a client i𝑖iitalic_i such that KiKmin=0subscript𝐾𝑖subscript𝐾min0K_{i}-K_{\text{min}}=0italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = 0, making Assumption 3.1 inapplicable. Consequently, the term ΦHeterosubscriptΦHetero\Phi_{\text{Hetero}}roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT is enlarged due to the unbalanced local steps, making (5) a very loose bound. Consequently, this can negatively impact the performance of federated optimization.

Theoretical observation: alleviating hybrid heterogeneity by minimizing (5). Typically, previous works in federated optimization implicitly minimize a few terms in the upper bound (5) to improve optimization performance. For example, FedProx [26] leverages a penalty term to reduce the local updates 𝔼𝒙it,Ki𝒙t2𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\mathbb{E}\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}\|^{2}blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in Corollary 3.2. Similarly, SCAFFOLD [16], FedAvgM [10], and FedDyn [1] correct the local updates using variance regularization terms to narrow the variance σlsubscript𝜎𝑙\sigma_{l}italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and σgsubscript𝜎𝑔\sigma_{g}italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. FedNova [42] clips local updates based on the local steps to reduce the scale effects of Ki,isubscript𝐾𝑖for-all𝑖K_{i},\forall iitalic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i. Therefore, minimizing the upper bound of (5) can efficiently mitigate hybrid heterogeneity, while previous works achieve this by manipulating local updates. Recall that the non-convex federated optimization decreases the f(𝒙t)norm𝑓superscript𝒙𝑡\|\nabla f(\boldsymbol{x}^{t})\|∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ from t=0𝑡0t=0italic_t = 0 to T𝑇Titalic_T, while we propose the term f(𝒙t)norm𝑓superscript𝒙𝑡\|\nabla f(\boldsymbol{x}^{t})\|∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ can be further minimized by the server at each round t𝑡titalic_t for a tight bound (5). And, previous convergence analysis of FedAvg variants [26, 16, 42] expected a low upper bound of gradient diversity (discussed in Appendix D.7) for better convergence rate. However, minimizing f(𝒙t)norm𝑓superscript𝒙𝑡\|\nabla f(\boldsymbol{x}^{t})\|∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ potentially enlarge the upper bound of gradient diversity by Corollary 3.2. Therefore, gradient diversity in Corollary 3.1 and hybrid heterogeneity in Corollary 3.2 reveal that both their upper bounds are related to the global gradient f(𝒙t)norm𝑓superscript𝒙𝑡\|\nabla f(\boldsymbol{x}^{t})\|∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥, revealing an interesting trade-off in convergence analysis.

Empirical observation: gradient diversity of FedAvg. Since the ultimate goal of a FL system is to train a well-generalized global model, it is important to identify the relation among gradient diversity, global gradient, and model quality. To this end, we conduct observation experiments of FedAvg on heterogeneous federated CIFAR10 (detailed in Section 5). We show the training dynamics of test accuracy, gradient diversity, and global first-order gradient in Figure 2. Higher test accuracy curves correspond to trajectory points with higher gradient diversity. Hence, larger gradient diversity values of a federated optimization do not create a worse convergence performance. The observation empirically reveals a positive relationship between gradient diversity dynamics and model quality.

3.3 Gradient Diversity Maximization for Tighter Averaged Local Updates

Refer to caption
Figure 2: Observation: gradient diversity is related to convergence quality.

We consider the simple and effective federated weighted averaging form on server-side optimization for a tighter (5) (i.e., minimizing f(𝒙)norm𝑓𝒙\|\nabla f(\boldsymbol{x})\|∥ ∇ italic_f ( bold_italic_x ) ∥). The weight typically represents the importance of each local function in the global objective. Local function reweighting scheme has been adopted to improve the fairness [27, 34], robustness [24], and generalization [28] via adjusting λ𝜆\lambdaitalic_λ for FedAvg.

Analogously, we use an dynamical weights λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG to study a surrogate global objective [2, 42, 43]: f~(𝒙)=i=1Nλ~ifi(𝒙)~𝑓𝒙superscriptsubscript𝑖1𝑁subscript~𝜆𝑖subscript𝑓𝑖𝒙\tilde{f}(\boldsymbol{x})=\sum_{i=1}^{N}\tilde{\lambda}_{i}f_{i}(\boldsymbol{x})over~ start_ARG italic_f end_ARG ( bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ). We focus on the adaptive global objective with appropriate adaptive weights λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG. This is technically different from previous work that considers the global objective to be deterministic for a fixed λ𝜆\lambdaitalic_λ [33, 35, 45]. Then, we replace the global gradient of (3) with the surrogate global gradient:

Definition 3.2 (Surrogate gradient diversity)

The surrogate gradient diversity is defined as:

D~(𝒙):=i=1Nλifi(𝒙)2f~(𝒙)21,wheref~(𝒙)=i=1Nλ~ifi(𝒙)formulae-sequenceassign~𝐷𝒙superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙2superscriptnorm~𝑓𝒙21where~𝑓𝒙superscriptsubscript𝑖1𝑁subscript~𝜆𝑖subscript𝑓𝑖𝒙\tilde{D}(\boldsymbol{x}):=\sqrt{\sum_{i=1}^{N}\frac{\lambda_{i}\|\nabla f_{i}% (\boldsymbol{x})\|^{2}}{\|\nabla\tilde{f}(\boldsymbol{x})\|^{2}}}\geq 1,\;% \text{where}\;\nabla\tilde{f}(\boldsymbol{x})=\sum_{i=1}^{N}\tilde{\lambda}_{i% }\nabla f_{i}(\boldsymbol{x})over~ start_ARG italic_D end_ARG ( bold_italic_x ) := square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ≥ 1 , where ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) (6)

is the surrogate global gradient. And, we keep the λ𝜆\lambdaitalic_λ for local gradients in numerator.

We provide the gap between surrogate gradient diversity and original gradient diversity in Appendix D.4. This modification allows us to capture the convergence differences between the practically applied weights λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG and theoretically assume weights λ𝜆\lambdaitalic_λ as elaborated in Section 4.

Deriving gradient diversity maximization direction. Since the FL server cannot directly access local data samples due to privacy concernss [58], it relies solely on the local updates uploaded from clients at each round. the objective of server-side gradient diversity maximization is to minimize the denominator of (6), with the numerator determined by the local updates. We provide an illustration case with N=2𝑁2N=2italic_N = 2 in Appendix D.1. For a general case, the local updates fi(𝒙𝒕),i[N]subscript𝑓𝑖superscript𝒙𝒕𝑖delimited-[]𝑁\nabla f_{i}(\boldsymbol{x^{t}}),i\in[N]∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT bold_italic_t end_POSTSUPERSCRIPT ) , italic_i ∈ [ italic_N ] are observed by the FL server at the t𝑡titalic_t-th round. We can find a minimum of the norm of weighted averaged global gradients f(𝒙)norm𝑓𝒙\|\nabla f(\boldsymbol{x})\|∥ ∇ italic_f ( bold_italic_x ) ∥. Therefore, our gradient diversity maximization direction becomes:

𝒅=i=1Nλ~ifi(𝒙),s.t.i=1Nλ~i=1,λ~i0i,formulae-sequence𝒅superscriptsubscript𝑖1𝑁subscript~𝜆𝑖subscript𝑓𝑖𝒙formulae-sequences.t.superscriptsubscript𝑖1𝑁subscript~𝜆𝑖1subscript~𝜆𝑖0for-all𝑖\displaystyle\boldsymbol{d}=\sum_{i=1}^{N}\tilde{\lambda}_{i}\nabla f_{i}(% \boldsymbol{x}),\quad\text{s.t.}\;\sum_{i=1}^{N}\tilde{\lambda}_{i}=1,\tilde{% \lambda}_{i}\geq 0\;\forall i,bold_italic_d = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) , s.t. ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 ∀ italic_i , (7)

where λ~=argminλ~i=1Nλ~ifi(𝒙)2~𝜆subscript~𝜆superscriptnormsuperscriptsubscript𝑖1𝑁subscript~𝜆𝑖subscript𝑓𝑖𝒙2\tilde{\lambda}=\arg\min_{\tilde{\lambda}}\|\sum_{i=1}^{N}\tilde{\lambda}_{i}% \nabla f_{i}(\boldsymbol{x})\|^{2}over~ start_ARG italic_λ end_ARG = roman_arg roman_min start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The constrained minimization problem is finding a minimum-norm point in the convex hull of the set of input gradients. As the dimension of gradients can be millions, we use the Frank-Wolfe algorithm [14] to solve it. Importantly, we highlight that the gradient diversity maximization direction is a non-zero vector (i.e., 𝒅>0norm𝒅0\|\boldsymbol{d}\|>0∥ bold_italic_d ∥ > 0 from (7)) in the practice of non-convex optimization as discussed in Appendix D.5. Besides, (7) is a similar sub-problem from multi-objective optimization works [14, 38, 60]. We clarify the differences in Appendix D.2.

Algorithm 1 FedAWARE and its plug-in extension (green box)
0:  𝒙0,𝒎0,αsuperscript𝒙0superscript𝒎0𝛼\boldsymbol{x}^{0},\boldsymbol{m}^{0},\alphabold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_α
1:  for round t𝑡titalic_t \in [T]delimited-[]𝑇[T][ italic_T ] do
2:     Server sample clients Stsuperscript𝑆𝑡S^{t}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and broadcast model 𝒙tsuperscript𝒙𝑡\boldsymbol{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
3:     for client i𝑖iitalic_i \in Stsuperscript𝑆𝑡S^{t}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in parallel do
4:        𝒙it,0=𝒙tsubscriptsuperscript𝒙𝑡0𝑖superscript𝒙𝑡\boldsymbol{x}^{t,0}_{i}=\boldsymbol{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t , 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
5:        for local update step k𝑘kitalic_k \in [Ki]delimited-[]subscript𝐾𝑖[K_{i}][ italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] do
6:           Compute an unbiased estimate 𝒈i(𝒙t,k)subscript𝒈𝑖superscript𝒙𝑡𝑘\boldsymbol{g}_{i}(\boldsymbol{x}^{t,k})bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT ) of fi(𝒙t,k)subscript𝑓𝑖superscript𝒙𝑡𝑘\nabla f_{i}(\boldsymbol{x}^{t,k})∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT )
7:           𝒙it,k=𝒙it,k1ηlFi(𝒙it,k1)subscriptsuperscript𝒙𝑡𝑘𝑖subscriptsuperscript𝒙𝑡𝑘1𝑖subscript𝜂𝑙subscript𝐹𝑖subscriptsuperscript𝒙𝑡𝑘1𝑖\boldsymbol{x}^{t,k}_{i}=\boldsymbol{x}^{t,k-1}_{i}-\eta_{l}\nabla F_{i}(% \boldsymbol{x}^{t,k-1}_{i})bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
8:        end for
9:        Client uploads local updates 𝒈it=xt,0xt,Kisuperscriptsubscript𝒈𝑖𝑡superscript𝑥𝑡0superscript𝑥𝑡subscript𝐾𝑖\boldsymbol{g}_{i}^{t}=x^{t,0}-x^{t,K_{i}}bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_t , 0 end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
10:     end for
11:     Server updates local momentum 𝒎it={(1α)𝒎it1+α𝒈it,ifiSt𝒎it1,ifiStsuperscriptsubscript𝒎𝑖𝑡cases1𝛼superscriptsubscript𝒎𝑖𝑡1𝛼superscriptsubscript𝒈𝑖𝑡if𝑖superscript𝑆𝑡superscriptsubscript𝒎𝑖𝑡1if𝑖superscript𝑆𝑡\boldsymbol{m}_{i}^{t}=\begin{cases}(1-\alpha)\boldsymbol{m}_{i}^{t-1}+\alpha% \boldsymbol{g}_{i}^{t},&\text{if}\;i\in S^{t}\\ \boldsymbol{m}_{i}^{t-1},&\text{if}\;i\notin S^{t}\end{cases}bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { start_ROW start_CELL ( 1 - italic_α ) bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + italic_α bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_i ∈ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_i ∉ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL end_ROW
12:     Server computes estimates 𝒅tsuperscript𝒅𝑡\boldsymbol{d}^{t}bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT by replacing fi(𝒙)subscript𝑓𝑖𝒙\nabla f_{i}(\boldsymbol{x})∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) with 𝒎itsuperscriptsubscript𝒎𝑖𝑡\boldsymbol{m}_{i}^{t}bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for (7)
13:     if plug FedAWARE in a optimizer ServerOpt(e.g., FedYogi) then
14:         Server computes the estimates of applied algorithm 𝒅~t=ServerOpt({𝒈it}iSt)superscript~𝒅𝑡ServerOptsubscriptsuperscriptsubscript𝒈𝑖𝑡𝑖superscript𝑆𝑡\tilde{\boldsymbol{d}}^{t}=\textsc{ServerOpt}(\{\boldsymbol{g}_{i}^{t}\}_{i\in S% ^{t}})over~ start_ARG bold_italic_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ServerOpt ( { bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )
15:        Server modifies the direction of estimates 𝒅t=𝒅projtsuperscript𝒅𝑡superscriptsubscript𝒅proj𝑡\boldsymbol{d}^{t}=\boldsymbol{d}_{\text{proj}}^{t}bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_italic_d start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT follows (9)
16:     end if
17:     Server updates 𝒙t+1=𝒙tη𝒅tsuperscript𝒙𝑡1superscript𝒙𝑡𝜂superscript𝒅𝑡\boldsymbol{x}^{t+1}=\boldsymbol{x}^{t}-\eta\boldsymbol{d}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
18:  end for

4 Methodology and Convergence Guarantee of FedAWARE

In this section, we present a simple yet effective server-side optimizer FedAWARE as shown in Algorithm 1. Especially, Algorithm 1 can also involve an optional extension of FedAWARE to other server-side optimizers as shown in green. We provide convergence analysis of FedAWARE (without using extention) details in Appendix C. And, we discuss the implications of our novel extension in Remark 4.2.

Support for partial participation. Solving (7) typically requires access to all local first-order gradients from clients, which is often infeasible in FL systems. To overcome this limitation, FedAWARE approximates the local updates using the history momentum of clients, denoted as 𝒎i,i[N]subscript𝒎𝑖𝑖delimited-[]𝑁\boldsymbol{m}_{i},i\in[N]bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ italic_N ]. The update rule of momentum is controlled by a constant α𝛼\alphaitalic_α in Line 10, Algorithm 1.

Lemma 4.1

Under Assumption 2.1 2.2 and 3.1, for any i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] and t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], the gap between momentum 𝐦isubscript𝐦𝑖\boldsymbol{m}_{i}bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and gradient fi(𝐱t)subscript𝑓𝑖superscript𝐱𝑡\nabla f_{i}(\boldsymbol{x}^{t})∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) can be bounded:

i=1Nλi𝔼[𝒎itfi(𝒙t)2]β2L2i=1Nλi𝔼𝒙it,Ki𝒙t2,superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2superscript𝛽2superscript𝐿2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[\left\|\boldsymbol{m}_{i% }^{t}-\nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2}\right]\leq\beta^{2}L^{2}% \sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\|^{2}},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where we denote β=(1α)p𝛽1𝛼𝑝\beta=(1-\alpha)pitalic_β = ( 1 - italic_α ) italic_p as a constant p=p1=pN𝑝subscript𝑝1subscript𝑝𝑁p=p_{1}\cdots=p_{N}italic_p = italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ = italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, and pi=Prob(iSt)subscript𝑝𝑖Prob𝑖superscript𝑆𝑡p_{i}=\text{Prob}(i\in S^{t})italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Prob ( italic_i ∈ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ).

We provide the Lemma 4.1 to theoretically demonstrate the upper bound of the momentum approximation, which highlights that this can stabilize the (5) with a proper β𝛽\betaitalic_β. In this paper, we set α𝛼\alphaitalic_α and p𝑝pitalic_p to be the same for all clients. Our detailed analyses in the Appendix C show the FedAWARE can be further improved by designing α𝛼\alphaitalic_α and p𝑝pitalic_p for each client, which we will study in future work. Then, we present the convergence guarantee of Algorithm 1:

Theorem 4.1 (Convergence to the stationary points of Algorithm 1)

Letting Assumption 2.1 2.2 and 3.1 holds, following the momentum approximation in Lemma 4.1, and assuming an ideal (possiblely dynamical) λ𝜆\lambdaitalic_λ always such that f(𝐱t)2f~(𝐱t)2superscriptnorm𝑓superscript𝐱𝑡2superscriptnorm~𝑓superscript𝐱𝑡2\|\nabla f(\boldsymbol{x}^{t})\|^{2}\leq\|\nabla\tilde{f}(\boldsymbol{x}^{t})% \|^{2}∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have:

mint[T]𝔼[f(𝒙t)2]ϵopt+t=0T1βLG~χλ~λ2TρtD~(𝒙t)σ2+ΦHetero,subscript𝑡delimited-[]𝑇𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2subscriptitalic-ϵoptsuperscriptsubscript𝑡0𝑇1𝛽𝐿~𝐺superscriptsubscript𝜒conditional~𝜆𝜆2𝑇superscript𝜌𝑡~𝐷superscript𝒙𝑡superscript𝜎2subscriptΦHetero\displaystyle\min_{t\in[T]}\mathbb{E}\left[\left\|\nabla f(\boldsymbol{x}^{t})% \right\|^{2}\right]\leq\epsilon_{\text{opt}}+\sum_{t=0}^{T-1}\frac{\beta L% \tilde{G}\sqrt{\chi_{\tilde{\lambda}\|\lambda}^{2}}}{T\rho^{t}\tilde{D}(% \boldsymbol{x}^{t})}\sqrt{\sigma^{2}+\Phi_{\text{Hetero}}},roman_min start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_ϵ start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT divide start_ARG italic_β italic_L over~ start_ARG italic_G end_ARG square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_T italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG square-root start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT end_ARG , (8)

where σ2=5ηl(σl2+6Kmin2σg2)superscript𝜎25subscript𝜂𝑙superscriptsubscript𝜎𝑙26superscriptsubscript𝐾min2superscriptsubscript𝜎𝑔2\sigma^{2}=5\eta_{l}(\sigma_{l}^{2}+6K_{\text{min}}^{2}\sigma_{g}^{2})italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 5 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), χλ~λ2=i=1N(λ~it)2λi2superscriptsubscript𝜒conditional~𝜆𝜆2superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript~𝜆𝑖𝑡2superscriptsubscript𝜆𝑖2\chi_{\tilde{\lambda}\|\lambda}^{2}=\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{% t})^{2}}{\lambda_{i}^{2}}italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, G~=maxt[T]i=1Nλifi(𝐱t)2~𝐺subscript𝑡delimited-[]𝑇superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖superscript𝐱𝑡2\tilde{G}=\max_{t\in[T]}\sqrt{\sum_{i=1}^{N}\lambda_{i}\|\nabla f_{i}(% \boldsymbol{x}^{t})\|^{2}}over~ start_ARG italic_G end_ARG = roman_max start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, and ρt>0superscript𝜌𝑡0\rho^{t}>0italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT > 0 is a coefficient detailed in Appendix. The terms ϵoptsubscriptitalic-ϵopt\epsilon_{\text{opt}}italic_ϵ start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT vanishes with 𝒪(1/T)𝒪1𝑇\mathcal{O}(1/\sqrt{T})caligraphic_O ( 1 / square-root start_ARG italic_T end_ARG ).

Remark 4.1

The terms σ2+ΦHeterosuperscript𝜎2subscriptΦHetero\sqrt{\sigma^{2}+\Phi_{\text{Hetero}}}square-root start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT end_ARG reflect the impact of hybrid heterogeneity on the convergence rate. In other words, large heterogeneity terms increase the convergence error, resulting in worse optimization performance. Our convergence analysis evaluates the feasibility of updating the global model along with the gradient diversity maximization direction. Concretely, Algorithm 1 minimizes a surrogate global objective given by λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG, while using term χ𝜒\chiitalic_χ to denote the difference between the surrogate objective and the original objective (See details in Lemma C.2 from Appendix). Specially, the surrogate objective (1) is dynamically determined by the λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG in Algorithm 1. Hence, the convergence rate denotes the status of the final stationary point and the speed of reaching it. It suggests that Algorithm 1 tends to make the model converge to a point with large gradient diversity. Besides, if we let ηl=𝒪(1/T)subscript𝜂𝑙𝒪1𝑇\eta_{l}=\mathcal{O}(1/T)italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = caligraphic_O ( 1 / italic_T ) decay with communication rounds, our convergence rate matches the best-known rate 𝒪(1/T)𝒪1𝑇\mathcal{O}(1/\sqrt{T})caligraphic_O ( 1 / square-root start_ARG italic_T end_ARG ) in non-convex optimization works [45, 16].

The implications of surrogate gradient diversity dynamics. Theorem 4.1 describes the convergence behavior of a broad class of FedAvg variants. The final term in (8) suggests the convergence performance is related to overall surrogate gradient diversity dynamics of the optimization trajectory {𝒙0,,𝒙T1}superscript𝒙0superscript𝒙𝑇1\{\boldsymbol{x}^{0},\dots,\boldsymbol{x}^{T-1}\}{ bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT }. In other words, our gradient diversity maximization findings suggest the heterogeneous federated optimization algorithms should create a trajectory {𝒙0,,𝒙T1}superscript𝒙0superscript𝒙𝑇1\{\boldsymbol{x}^{0},\dots,\boldsymbol{x}^{T-1}\}{ bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT }, resulting in a higher curve of gradient diversity D~(𝒙t),t=0,,T1formulae-sequence~𝐷superscript𝒙𝑡𝑡0𝑇1\tilde{D}(\boldsymbol{x}^{t}),t=0,\dots,T-1over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_t = 0 , … , italic_T - 1. We evaluate this point in comparison with other algorithms in Figure 3 and plug-in experiments in Figure 4. Besides, as gradient diversity quantifies the conflicts of individual gradients, surrogate gradient diversity maximization (minimizing objective 7) can be intuitively viewed as adjusting the local functions’ conflicts.

Refer to caption
Figure 3: Training dynamics of federated optimization algorithms on CIFAR10 task. FedAWARE converges to stationary points with higher gradient diversity than baselines. Convergence experiments on Fashin-MNIST/MNIST are shown in Figure 7 8.

FedAWARE as a plug-in module. FedAWARE uses a novel adaptive weighted aggregation that only induces the surrogate gradient diversity terms D~()~𝐷\tilde{D}(\cdot)over~ start_ARG italic_D end_ARG ( ⋅ ) in the convergence rate. Therefore, FedAWARE can be applied to other optimization works, as we minimize different terms (discussed in Appendix D.7). We present a novel extension of FedAWARE and evaluate it in Section 5:

Remark 4.2 (Extention of FedAWARE)

If global estimates 𝐝~tsuperscript~𝐝𝑡\tilde{\boldsymbol{d}}^{t}over~ start_ARG bold_italic_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT outputted by arbitrary federated optimization algorithm (Line 14, Algorithm 1), we project the 𝐝~tsuperscript~𝐝𝑡\tilde{\boldsymbol{d}}^{t}over~ start_ARG bold_italic_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to the direction of 𝐝tsuperscript𝐝𝑡\boldsymbol{d}^{t}bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT(Line 15, Algorithm 1):

𝒅projt=𝒅~t,𝒅t𝒅t,𝒅t𝒅t,𝒙t+1=𝒙tη𝒅projtformulae-sequencesuperscriptsubscript𝒅proj𝑡superscript~𝒅𝑡superscript𝒅𝑡superscript𝒅𝑡superscript𝒅𝑡superscript𝒅𝑡superscript𝒙𝑡1superscript𝒙𝑡𝜂superscriptsubscript𝒅proj𝑡\displaystyle\boldsymbol{d}_{\text{proj}}^{t}=\frac{\langle\tilde{\boldsymbol{% d}}^{t},\boldsymbol{d}^{t}\rangle}{\langle\boldsymbol{d}^{t},\boldsymbol{d}^{t% }\rangle}\boldsymbol{d}^{t},\;\boldsymbol{x}^{t+1}=\boldsymbol{x}^{t}-\eta% \boldsymbol{d}_{\text{proj}}^{t}bold_italic_d start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG ⟨ over~ start_ARG bold_italic_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ end_ARG start_ARG ⟨ bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ end_ARG bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η bold_italic_d start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (9)

This procedure modifies the gradient descent direction of applied algorithms to the direction of implicitly enlarged gradient diversity and maintains the advantages of original algorithms. Please see Figure 4 9 for empirical evidence.

This extension allows FedAWARE to be a plug-in module to enhance the applied algorithms.

5 Experiments

We evaluate our theoretical findings and proposed algorithms in standard heterogeneous federated settings and show the hybrid heterogeneity damaging federated optimization algorithms. We show the error bars over three independent runs with different random seeds.

Experiment Setup. We compare baseline algorithms that follow the federated optimization framework in Figure 1. Our baselines includes FedAvg [33], FedProx [26], SCAFFOLD [16], FedDyn [1], FedNova [42] for the regularization-based algorithms. And, we also compare with FedAvgM [10] for the momentum-based algorithms. For the adaptive federated optimization algorithms, we compare with FedYogi [35], and FedAMS [45]. Due to page limitation, we report the results of training a 5-layer CNN network on the CIFAR-10 dataset [20] in the main paper. And, we provide convergence experiments on MNIST/Fashion-MNIST in Appendix E.2 and extension experiments on text dataset AGNews [57] with model Pythia [3] with details in Appendix E.3.

Refer to caption
Figure 4: Plug-in experiments on AGNews-Dirichlet setting with GPT2-based model Pythia-70M. Same experiments on CIFAR10-Dirichlet+ setting is shown in Figure 9.

Data partition. We construct experiments on various heterogeneous settings: (1) Pathological: We adopt the identical Non-IID setup from FedAvg [33]. We sort the data that is sampled by labels, partition them into 200 blocks of equal size, and assign 100 clients with 2 blocks each. In this setting, we set a constant data batch size and local epoch for all clients, hence the number of local steps are balanced across clients. (2) Dirichlet: We create federated datasets for 100 clients following the latent Dirichlet allocation over labels [10]. The hyper-parameter of the Dirichlet distribution is set to 0.1, indicating extreme statistical heterogeneity. In this setting, we maintain a fixed data batch size and a local training epoch for all clients. The number of local steps differs amongst clients, resulting in system heterogeneity. (3) Pathological+/Dirichlet+: Based on the data partition described above, we set the local epoch Ei𝒰(2,5)similar-tosubscript𝐸𝑖𝒰25E_{i}\sim\mathcal{U}(2,5)italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_U ( 2 , 5 ), and the batch size Bi𝒰(10,|𝒟i|)similar-tosubscript𝐵𝑖𝒰10subscript𝒟𝑖B_{i}\sim\mathcal{U}(10,|\mathcal{D}_{i}|)italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_U ( 10 , | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ) at round t𝑡titalic_t randomly for each selected client iSt𝑖superscript𝑆𝑡i\in S^{t}italic_i ∈ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Here, 𝒰𝒰\mathcal{U}caligraphic_U denotes uniform distribution. Therefore, the local steps Ki=Ei|𝒟i|/Bisubscript𝐾𝑖subscript𝐸𝑖subscript𝒟𝑖subscript𝐵𝑖K_{i}=\lfloor E_{i}*|\mathcal{D}_{i}|/B_{i}\rflooritalic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌊ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | / italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⌋ are unstable. Each client conducts local mini-batch SGD with multiple local epochs to update the model. More details are provided in Appendix E.1.

Main results: convergence behavior of FedAWARE. We provide the training dynamics of testing accuracy, gradient diversity, and the training loss in Figure 3. Training loss curves prove a comparable convergence rate of FedAWARE with state-of-the-art federated optimizers. Furthermore, the testing accuracy curve showcases that FedAWARE is more stable (fewer jitters and lower variance) than other baselines. Besides, the gradient diversity dynamics indicate that FedAWARE converges to stationary points with larger gradient diversity. Notably, FedAvgM and FedYogi have lower gradient diversity while achieving a relatively high testing accuracy. This does not break our theory as gradient diversity is not the only factor in rate (8) as further elaborated in Appendix D. Moreover, designing a better global update direction that maximizes gradient diversity can potentially achieve higher performance. We justify the point by the following gradient projection experiments.

Main results: gradient projection extension of FedAWARE. We plug FedAWARE into standard baselines (FedAvg, FedAvgM) and the state-of-the-art adaptive optimizers (FedAMS, and FedYogi) following Remark 4.2, and the experiments are shown in Figure 4. The loss curves show that gradient projection maintains the convergence rate of applied algorithms. Besides, the gradient projection modifies the model updating direction and successfully enlarges gradient diversity dynamics. Hence, the applied algorithms obtain more stable test accuracy curves.

Observation: adaptivity in federated optimization is important. Our experiments highlight the importance of adaptivity in federated optimization. For example, standard federated optimization algorithms (e.g., FedAvg, SCAFFOLD) suffer from performance degeneration in Pathological+/Dirichlet+ settings. In comparison, adaptive federated optimization algorithms (FedYogi, FedDyn, FedAMS, and FedAWARE) maintain high convergence quality in such extreme heterogeneous settings. Moreover, our gradient diversity findings suggest a new adaptivity in FL.

Discussion: practical gradient diversity dynamics. Our results suggest the importance of gradient diversity as a new training dynamic in heterogeneous federated optimization. We suggest a practical gradient diversity dynamics (30) for monitoring optimization status in Appendix E.1.

6 Conclusion

This paper explores an initial exploration of gradient diversity maximization in federated server-side optimization. We examine this issue through the lens of FL server optimization and conclude that pursuing gradient diversity maximization in the global update direction helps mitigate hybrid heterogeneity. Following theoretical findings, we propose a simple yet effective server-side optimizer applicable in scenarios exhibiting extreme hybrid heterogeneity. Beyond FL settings, we believe the gradient diversity maximization can also be applied to the general finite-sum optimization task (1), especially when handling finite-sum of conflict objectives.

References

  • [1] Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In International Conference on Learning Representations, 2020.
  • [2] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
  • [3] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  • [4] Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
  • [5] Michael Crawshaw, Yajie Bao, and Mingrui Liu. Federated learning with client subsampling, data heterogeneity, and unbounded smoothness: A new algorithm and lower bounds. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [6] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • [7] Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200–5209. PMLR, 2019.
  • [8] Werner H Greub. Linear algebra, volume 23. Springer Science & Business Media, 2012.
  • [9] Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods in federated learning. arXiv preprint arXiv:1910.14425, 2019.
  • [10] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  • [11] Zeou Hu, Kiarash Shaloudegi, Guojun Zhang, and Yaoliang Yu. Federated learning meets multi-objective optimization. IEEE Transactions on Network Science and Engineering, 9(4):2039–2051, 2022.
  • [12] Minhui Huang, Dewei Zhang, and Kaiyi Ji. Achieving linear speedup in non-iid federated bilevel learning. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 14039–14059. PMLR, 2023.
  • [13] Ahmed Imteaj, Urmish Thakker, Shiqiang Wang, Jian Li, and M Hadi Amini. A survey on federated learning for resource-constrained iot devices. IEEE Internet of Things Journal, 9(1):1–24, 2021.
  • [14] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning, pages 427–435. PMLR, 2013.
  • [15] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • [16] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020.
  • [17] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. First analysis of local gd on heterogeneous data. arXiv preprint arXiv:1909.04715, 2019.
  • [18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [19] Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. arXiv preprint arXiv:2204.02937, 2022.
  • [20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [21] Chengxi Li, Gang Li, and Pramod K Varshney. Federated learning with soft clustering. IEEE Internet of Things Journal, 9(10):7773–7782, 2021.
  • [22] Daliang Li and Junpu Wang. Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
  • [23] Junbo Li, Ang Li, Chong Tian, Qirong Ho, Eric P Xing, and Hongyi Wang. Fednar: Federated optimization with normalized annealing regularization. arXiv preprint arXiv:2310.03163, 2023.
  • [24] Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization. In International Conference on Learning Representations, 2020.
  • [25] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine, 37(3):50–60, 2020.
  • [26] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
  • [27] Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. Fair resource allocation in federated learning. arXiv preprint arXiv:1905.10497, 2019.
  • [28] Zexi Li, Tao Lin, Xinyi Shang, and Chao Wu. Revisiting weighted aggregation in federated learning with neural networks. In International Conference on Machine Learning, pages 19767–19788. PMLR, 2023.
  • [29] Shiyu Liu, Shaogao Lv, Dun Zeng, Zenglin Xu, Hui Wang, and Yue Yu. Personalized federated learning via amortized bayesian meta-learning. arXiv preprint arXiv:2307.02222, 2023.
  • [30] Guodong Long, Ming Xie, Tao Shen, Tianyi Zhou, Xianzhi Wang, and **g Jiang. Multi-center federated learning: clients clustering for better personalization. World Wide Web, 26(1):481–500, 2023.
  • [31] Xiaodong Ma, Jia Zhu, Zhihao Lin, Shanxuan Chen, and Yangjie Qin. A state-of-the-art survey on solving non-iid data in federated learning. Future Generation Computer Systems, 135:244–258, 2022.
  • [32] Othmane Marfoq, Giovanni Neglia, Aurélien Bellet, Laetitia Kameni, and Richard Vidal. Federated multi-task learning under a mixture of distributions. Advances in Neural Information Processing Systems, 34:15434–15447, 2021.
  • [33] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • [34] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In International Conference on Machine Learning, pages 4615–4625. PMLR, 2019.
  • [35] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
  • [36] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
  • [37] Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE transactions on neural networks and learning systems, 32(8):3710–3722, 2020.
  • [38] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.
  • [39] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. Advances in neural information processing systems, 30, 2017.
  • [40] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [41] Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan McMahan, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, et al. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
  • [42] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020.
  • [43] Lin Wang, YongXin Guo, Tao Lin, and Xiaoying Tang. Delta: Diverse client sampling for fasting federated learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [44] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and Kevin Chan. Adaptive federated learning in resource constrained edge computing systems. IEEE journal on selected areas in communications, 37(6):1205–1221, 2019.
  • [45] Yujia Wang, Lu Lin, and **ghui Chen. Communication-efficient adaptive federated learning. In International Conference on Machine Learning, pages 22802–22838. PMLR, 2022.
  • [46] Xidong Wu, Jianhui Sun, Zhengmian Hu, Aidong Zhang, and Heng Huang. Solving a class of non-convex minimax optimization in federated learning. arXiv preprint arXiv:2310.03613, 2023.
  • [47] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • [48] Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, and Haibin Lin. Local adaalter: Communication-efficient stochastic gradient descent with adaptive learning rates. arXiv preprint arXiv:1911.09030, 2019.
  • [49] Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration. arXiv preprint arXiv:2012.01780, 2020.
  • [50] Rui Ye, Mingkai Xu, Jianyu Wang, Chenxin Xu, Siheng Chen, and Yanfeng Wang. Feddisco: Federated learning with discrepancy-aware collaboration. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 39879–39902. PMLR, 2023.
  • [51] Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In International Conference on Artificial Intelligence and Statistics, pages 1998–2007. PMLR, 2018.
  • [52] Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. Advances in neural information processing systems, 31, 2018.
  • [53] Dun Zeng, Xiang**g Hu, Shiyu Liu, Yue Yu, Qifan Wang, and Zenglin Xu. Stochastic clustered federated learning. arXiv preprint arXiv:2303.00897, 2023.
  • [54] Dun Zeng, Siqi Liang, Xiang**g Hu, Hui Wang, and Zenglin Xu. Fedlab: A flexible federated learning framework. Journal of Machine Learning Research, 24(100):1–7, 2023.
  • [55] Shenglai Zeng, Zonghang Li, Hongfang Yu, Yihong He, Zenglin Xu, Dusit Niyato, and Han Yu. Heterogeneous federated learning via grouped sequential-to-parallel training. In International Conference on Database Systems for Advanced Applications, pages 455–471. Springer, 2022.
  • [56] Jianqing Zhang, Yang Hua, Jian Cao, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, and Haibing Guan. Eliminating domain bias for federated learning in representation space. Advances in Neural Information Processing Systems, 36, 2024.
  • [57] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
  • [58] Yifei Zhang, Dun Zeng, **glong Luo, Zenglin Xu, and Irwin King. A survey of trustworthy federated learning with perspectives on security, robustness, and privacy. arXiv preprint arXiv:2302.10637, 2023.
  • [59] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
  • [60] Shiji Zhou, Wenpeng Zhang, Jiyan Jiang, Wenliang Zhong, **jie Gu, and Wenwu Zhu. On the convergence of stochastic multi-objective gradient manipulation and beyond. Advances in Neural Information Processing Systems, 35:38103–38115, 2022.
  • [61] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. Advances in neural information processing systems, 32, 2019.
  • [62] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for heterogeneous federated learning. In International conference on machine learning, pages 12878–12889. PMLR, 2021.
\appendixpage\startcontents

[sections] \printcontents[sections]l1

Appendix A Related Works

Plenty of studies have been proposed to address statistical heterogeneity issues in the FL literature [31], including multi-task federated learning [32, 39], personalized federated learning [40, 29], knowledge distillation [62, 22], and client clustering [37, 53, 55]. In contrast, this paper primarily focuses on federated optimization algorithms that train a robust global model against heterogeneity issues, as shown in Algorithm 2.

Algorithm 2 FedOpt [35]
0:  𝒙0,ClientOpt,ServerOptsuperscript𝒙0ClientOptServerOpt\boldsymbol{x}^{0},\textsc{ClientOpt},\textsc{ServerOpt}bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , ClientOpt , ServerOpt
1:  for round t𝑡titalic_t \in [T]delimited-[]𝑇[T][ italic_T ] do
2:     Server sample clients Stsuperscript𝑆𝑡S^{t}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and broadcast model 𝒙tsuperscript𝒙𝑡\boldsymbol{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
3:     for client i𝑖iitalic_i \in Stsuperscript𝑆𝑡S^{t}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in parallel do
4:        𝒙it,0=𝒙tsubscriptsuperscript𝒙𝑡0𝑖superscript𝒙𝑡\boldsymbol{x}^{t,0}_{i}=\boldsymbol{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t , 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
5:        for local update step k𝑘kitalic_k \in [Ki]delimited-[]subscript𝐾𝑖[K_{i}][ italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] do
6:           Compute an unbiased estimate Fi(𝒙it,k1)subscript𝐹𝑖subscriptsuperscript𝒙𝑡𝑘1𝑖\nabla F_{i}(\boldsymbol{x}^{t,k-1}_{i})∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of fi(𝒙it,k1)subscript𝑓𝑖superscriptsubscript𝒙𝑖𝑡𝑘1\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT )
7:           𝒙it,k=ClientOpt(𝒙it,k1,Fi(𝒙it,k1),ηl,t)subscriptsuperscript𝒙𝑡𝑘𝑖ClientOptsubscriptsuperscript𝒙𝑡𝑘1𝑖subscript𝐹𝑖subscriptsuperscript𝒙𝑡𝑘1𝑖subscript𝜂𝑙𝑡\boldsymbol{x}^{t,k}_{i}=\textsc{ClientOpt}(\boldsymbol{x}^{t,k-1}_{i},\nabla F% _{i}(\boldsymbol{x}^{t,k-1}_{i}),\eta_{l},t)bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ClientOpt ( bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t ) (e.g., 𝒙it,k=𝒙it,k1ηlFi(𝒙it,k1)subscriptsuperscript𝒙𝑡𝑘𝑖subscriptsuperscript𝒙𝑡𝑘1𝑖subscript𝜂𝑙subscript𝐹𝑖subscriptsuperscript𝒙𝑡𝑘1𝑖\boldsymbol{x}^{t,k}_{i}=\boldsymbol{x}^{t,k-1}_{i}-\eta_{l}\nabla F_{i}(% \boldsymbol{x}^{t,k-1}_{i})bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ))
8:        end for
9:        Client uploads local updates 𝒈it=xt,0xt,Kisuperscriptsubscript𝒈𝑖𝑡superscript𝑥𝑡0superscript𝑥𝑡subscript𝐾𝑖\boldsymbol{g}_{i}^{t}=x^{t,0}-x^{t,K_{i}}bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_t , 0 end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
10:     end for
11:     Server compute an estimate 𝒅t=ServerOpt({𝒈it}iSt)superscript𝒅𝑡ServerOptsubscriptsuperscriptsubscript𝒈𝑖𝑡𝑖superscript𝑆𝑡\boldsymbol{d}^{t}=\textsc{ServerOpt}(\{\boldsymbol{g}_{i}^{t}\}_{i\in S^{t}})bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ServerOpt ( { bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )
12:     Server updates 𝒙t+1=𝒙tη𝒅tsuperscript𝒙𝑡1superscript𝒙𝑡𝜂superscript𝒅𝑡\boldsymbol{x}^{t+1}=\boldsymbol{x}^{t}-\eta\boldsymbol{d}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
13:  end for

The pioneering federated optimization algorithm, FedAvg [33], significantly reduces communication costs. Subsequent works built upon FedAvg to address challenges related to convergence guarantees and heterogeneity issues. For example, some approaches introduced a regularization term in the client objectives [26], while others incorporated server momentum [10]. Several studies have analyzed the convergence rate of FedAvg and demonstrated its degradation with system heterogeneity [26, 44] and statistical heterogeneity [59, 17]. SCAFFOLD [16] utilizes control variates to mitigate client drift and achieve convergence rates independent of the level of heterogeneity. FedNova [42] addresses objective inconsistency issues arising from system heterogeneity through local update regularization. For more detailed comparisons, we refer to the survey [15]. Besides, adaptive methods [52, 36, 48] have proven to be effective in non-convex optimizations. In the context of federated optimization, FedYogi [35, 52] and FedAMS [1] are representative adaptive federated optimization algorithm that incorporates Adam-like momentum and adaptive terms to address heterogeneity issues. These related works demonstrate the ongoing efforts to address heterogeneity issues in FL, with various approaches focusing on different aspects of the problem.

Appendix B Proof of Lemmas and Corollaries

Corollary 1 (Proof of Corollary 3.1)

Let Assumption 3.1 hold. Then, in the case of bounded variance, i.e.,i=1Nλifi(𝐱)f(𝐱)2σg2superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝐱𝑓𝐱2superscriptsubscript𝜎𝑔2\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})-\nabla f(% \boldsymbol{x})\right\|^{2}\leq\sigma_{g}^{2}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) - ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, it follows that D(𝐱)1+σg2f(𝐱)2𝐷𝐱1superscriptsubscript𝜎𝑔2superscriptnorm𝑓𝐱2D(\boldsymbol{x})\leq\sqrt{1+\frac{\sigma_{g}^{2}}{\|\nabla f(\boldsymbol{x})% \|^{2}}}italic_D ( bold_italic_x ) ≤ square-root start_ARG 1 + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG.

Proof 1
i=1Nλifi(𝒙)f(𝒙)2i=1Nλifi(𝒙)2f(𝒙)2σg2superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙𝑓𝒙2superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙2superscriptnorm𝑓𝒙2superscriptsubscript𝜎𝑔2\displaystyle\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})-% \nabla f(\boldsymbol{x})\right\|^{2}\leq\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f% _{i}(\boldsymbol{x})\right\|^{2}-\|\nabla f(\boldsymbol{x})\|^{2}\leq\sigma_{g% }^{2}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) - ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\Leftrightarrow i=1Nλifi(𝒙)2σg2+f(𝒙)2superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙2superscriptsubscript𝜎𝑔2superscriptnorm𝑓𝒙2\displaystyle\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})% \right\|^{2}\leq\sigma_{g}^{2}+\|\nabla f(\boldsymbol{x})\|^{2}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\Rightarrow D(𝒙)=i=1Nλifi(𝒙)2f(𝒙)2σg2+f(𝒙)2f(𝒙)21+σg2f(𝒙)2.𝐷𝒙superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙2superscriptnorm𝑓𝒙2superscriptsubscript𝜎𝑔2superscriptnorm𝑓𝒙2superscriptnorm𝑓𝒙21superscriptsubscript𝜎𝑔2superscriptnorm𝑓𝒙2\displaystyle D(\boldsymbol{x})=\sqrt{\frac{\sum_{i=1}^{N}\lambda_{i}\|\nabla f% _{i}(\boldsymbol{x})\|^{2}}{\|\nabla f(\boldsymbol{x})\|^{2}}}\leq\sqrt{\frac{% \sigma_{g}^{2}+\|\nabla f(\boldsymbol{x})\|^{2}}{\|\nabla f(\boldsymbol{x})\|^% {2}}}\leq\sqrt{1+\frac{\sigma_{g}^{2}}{\|\nabla f(\boldsymbol{x})\|^{2}}}.italic_D ( bold_italic_x ) = square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ≤ square-root start_ARG divide start_ARG italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ≤ square-root start_ARG 1 + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG .
Lemma 1 (Proof of Lemma 3.2and Corollary 3.2)

Let Assumption 2.2 3.1 hold. For all client i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] with arbitrary local iteration steps k[Ki]𝑘delimited-[]subscript𝐾𝑖k\in[K_{i}]italic_k ∈ [ italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], the local updates can be bounded as follows,

𝔼𝒙it,k𝒙t25Kiηl2(σl2+6Kiζi(𝒙t)+6Kif(𝒙t)2).𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘superscript𝒙𝑡25subscript𝐾𝑖superscriptsubscript𝜂𝑙2superscriptsubscript𝜎𝑙26subscript𝐾𝑖subscript𝜁𝑖superscript𝒙𝑡6subscript𝐾𝑖superscriptnorm𝑓superscript𝒙𝑡2\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}\right\|^{2}\leq 5% K_{i}\eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K_{i}% \left\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\|^{2}).blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 5 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .
Proof 2

For k[Ki]𝑘delimited-[]subscript𝐾𝑖k\in[K_{i}]italic_k ∈ [ italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], we have

𝔼[𝒈it2]=𝔼𝒙it,k𝒙t2=𝔼𝒙it,k1𝒙tηlFi(𝒙it,k1)2𝔼delimited-[]superscriptnormsuperscriptsubscript𝒈𝑖𝑡2𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘superscript𝒙𝑡2𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘1superscript𝒙𝑡subscript𝜂𝑙subscript𝐹𝑖superscriptsubscript𝒙𝑖𝑡𝑘12\displaystyle\quad\mathbb{E}\left[\left\|\boldsymbol{g}_{i}^{t}\right\|^{2}% \right]=\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}\right\|^{% 2}=\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}-\eta_{l}% \nabla F_{i}(\boldsymbol{x}_{i}^{t,k-1})\right\|^{2}blackboard_E [ ∥ bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝔼𝒙it,k1𝒙tηl(Fi(𝒙it,k1)±fi(𝒙it,k1))2𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘1superscript𝒙𝑡subscript𝜂𝑙plus-or-minussubscript𝐹𝑖superscriptsubscript𝒙𝑖𝑡𝑘1subscript𝑓𝑖superscriptsubscript𝒙𝑖𝑡𝑘12\displaystyle\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}-% \eta_{l}(\nabla F_{i}(\boldsymbol{x}_{i}^{t,k-1})\pm\nabla f_{i}(\boldsymbol{x% }_{i}^{t,k-1}))\right\|^{2}blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ± ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝔼𝒙it,k1𝒙tηlfi(𝒙it,k1))2+𝔼ηl(Fi(𝒙it,k1)fi(𝒙it,k1))2\displaystyle\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}-% \eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1}))\right\|^{2}+\mathbb{E}\left% \|\eta_{l}\left(\nabla F_{i}(\boldsymbol{x}_{i}^{t,k-1})-\nabla f_{i}\left(% \boldsymbol{x}_{i}^{t,k-1}\right)\right)\right\|^{2}blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E ∥ italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝔼[𝒙it,k1𝒙t22𝒙it,k1𝒙t,ηlfi(𝒙it,k1)+ηlfi(𝒙it,k1)2]+ηl2σl2𝔼delimited-[]superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘1superscript𝒙𝑡22superscriptsubscript𝒙𝑖𝑡𝑘1superscript𝒙𝑡subscript𝜂𝑙subscript𝑓𝑖superscriptsubscript𝒙𝑖𝑡𝑘1superscriptnormsubscript𝜂𝑙subscript𝑓𝑖superscriptsubscript𝒙𝑖𝑡𝑘12superscriptsubscript𝜂𝑙2superscriptsubscript𝜎𝑙2\displaystyle\mathbb{E}\left[\left\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^% {t}\right\|^{2}-2\langle\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t},\eta_{l}% \nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\rangle+\left\|\eta_{l}\nabla f_{i}(% \boldsymbol{x}_{i}^{t,k-1})\right\|^{2}\right]+\eta_{l}^{2}\sigma_{l}^{2}blackboard_E [ ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ⟩ + ∥ italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝔼[𝒙it,k1𝒙t2212Ki1(𝒙it,k1𝒙t),2Ki1ηlfi(𝒙it,k1)\displaystyle\mathbb{E}\Big{[}\left\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}% ^{t}\right\|^{2}-2\langle\frac{1}{\sqrt{2K_{i}-1}}(\boldsymbol{x}_{i}^{t,k-1}-% \boldsymbol{x}^{t}),\sqrt{2K_{i}-1}\eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,% k-1})\rangleblackboard_E [ ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , square-root start_ARG 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ⟩
+ηlfi(𝒙it,k1)2]+ηl2σl2\displaystyle+\left\|\eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\right\|^% {2}\Big{]}+\eta_{l}^{2}\sigma_{l}^{2}+ ∥ italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq (1+12Ki1)𝔼[𝒙it,k1𝒙t2]+2Ki𝔼[ηlfi(𝒙it,k1)2]+ηl2σl2112subscript𝐾𝑖1𝔼delimited-[]superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘1superscript𝒙𝑡22subscript𝐾𝑖𝔼delimited-[]superscriptnormsubscript𝜂𝑙subscript𝑓𝑖superscriptsubscript𝒙𝑖𝑡𝑘12superscriptsubscript𝜂𝑙2superscriptsubscript𝜎𝑙2\displaystyle\left(1+\frac{1}{2K_{i}-1}\right)\mathbb{E}\left[\left\|% \boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}\right\|^{2}\right]+2K_{i}\mathbb% {E}\left[\left\|\eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\right\|^{2}% \right]+\eta_{l}^{2}\sigma_{l}^{2}( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG ) blackboard_E [ ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== (1+12Ki1)𝔼[𝒙it,k1𝒙t2]+2Ki𝔼[ηl(fi(𝒙it,k1)±f(𝒙t)±fi(𝒙t))2]112subscript𝐾𝑖1𝔼delimited-[]superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘1superscript𝒙𝑡22subscript𝐾𝑖𝔼delimited-[]superscriptnormsubscript𝜂𝑙plus-or-minussubscript𝑓𝑖superscriptsubscript𝒙𝑖𝑡𝑘1𝑓superscript𝒙𝑡subscript𝑓𝑖superscript𝒙𝑡2\displaystyle\left(1+\frac{1}{2K_{i}-1}\right)\mathbb{E}\left[\left\|% \boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}\right\|^{2}\right]+2K_{i}\mathbb% {E}\left[\left\|\eta_{l}\left(\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\pm% \nabla f\left(\boldsymbol{x}^{t}\right)\pm\nabla f_{i}\left(\boldsymbol{x}^{t}% \right)\right)\right\|^{2}\right]( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG ) blackboard_E [ ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) ± ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ± ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+ηl2σl2superscriptsubscript𝜂𝑙2superscriptsubscript𝜎𝑙2\displaystyle+\eta_{l}^{2}\sigma_{l}^{2}+ italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq (1+12Ki1)𝔼𝒙it,k1𝒙t2+6Ki𝔼[ηl(fi(𝒙it,k1)fi(𝒙t))2]112subscript𝐾𝑖1𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘1superscript𝒙𝑡26subscript𝐾𝑖𝔼delimited-[]superscriptnormsubscript𝜂𝑙subscript𝑓𝑖superscriptsubscript𝒙𝑖𝑡𝑘1subscript𝑓𝑖superscript𝒙𝑡2\displaystyle\left(1+\frac{1}{2K_{i}-1}\right)\mathbb{E}\left\|\boldsymbol{x}_% {i}^{t,k-1}-\boldsymbol{x}^{t}\right\|^{2}+6K_{i}\mathbb{E}\left[\left\|\eta_{% l}\left(\nabla f_{i}\left(\boldsymbol{x}_{i}^{t,k-1}\right)-\nabla f_{i}\left(% \boldsymbol{x}^{t}\right)\right)\right\|^{2}\right]( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG ) blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+6Ki𝔼[ηl(fi(𝒙t))2]+ηl2σl26subscript𝐾𝑖𝔼delimited-[]superscriptnormsubscript𝜂𝑙subscript𝑓𝑖superscript𝒙𝑡2superscriptsubscript𝜂𝑙2superscriptsubscript𝜎𝑙2\displaystyle+6K_{i}\mathbb{E}\left[\left\|\eta_{l}\left(\nabla f_{i}\left(% \boldsymbol{x}^{t}\right)\right)\right\|^{2}\right]+\eta_{l}^{2}\sigma_{l}^{2}+ 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq (1+12Ki1+6Kiηl2L2)𝔼𝒙it,k1𝒙t2+ηl2(σl2+6Kiζi(𝒙t)+6Ki𝔼[f(𝒙t)2])112subscript𝐾𝑖16subscript𝐾𝑖superscriptsubscript𝜂𝑙2superscript𝐿2𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘1superscript𝒙𝑡2superscriptsubscript𝜂𝑙2superscriptsubscript𝜎𝑙26subscript𝐾𝑖subscript𝜁𝑖superscript𝒙𝑡6subscript𝐾𝑖𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\left(1+\frac{1}{2K_{i}-1}+6K_{i}\eta_{l}^{2}L^{2}\right)\mathbb{% E}\left\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}\right\|^{2}+\eta_{l}^{2% }(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K_{i}\mathbb{E}\left[% \left\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\|^{2}\right])( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] )

Unrolling the recursion, we obtain

𝔼𝒙it,k𝒙t2𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘superscript𝒙𝑡2absent\displaystyle\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}% \right\|^{2}\leqblackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ p=0k1(1+12Ki1+4Kiηl2L2)pηl2(σl2+6Kiζi(𝒙t)+6Ki𝔼[f(𝒙t)2])superscriptsubscript𝑝0𝑘1superscript112subscript𝐾𝑖14subscript𝐾𝑖superscriptsubscript𝜂𝑙2superscript𝐿2𝑝superscriptsubscript𝜂𝑙2superscriptsubscript𝜎𝑙26subscript𝐾𝑖subscript𝜁𝑖superscript𝒙𝑡6subscript𝐾𝑖𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\sum_{p=0}^{k-1}\left(1+\frac{1}{2K_{i}-1}+4K_{i}\eta_{l}^{2}L^{2% }\right)^{p}\eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K% _{i}\mathbb{E}\left[\left\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\|^{2}% \right])∑ start_POSTSUBSCRIPT italic_p = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG + 4 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) (10)
\displaystyle\leq (Ki1)[(1+1Ki1)Ki1]ηl2(σl2+6Kiζi(𝒙t)+6Ki𝔼[f(𝒙t)2])subscript𝐾𝑖1delimited-[]superscript11subscript𝐾𝑖1subscript𝐾𝑖1superscriptsubscript𝜂𝑙2superscriptsubscript𝜎𝑙26subscript𝐾𝑖subscript𝜁𝑖superscript𝒙𝑡6subscript𝐾𝑖𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2\displaystyle(K_{i}-1)\left[\left(1+\frac{1}{K_{i}-1}\right)^{K_{i}}-1\right]% \eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K_{i}\mathbb{% E}\left[\left\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\|^{2}\right])( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) [ ( 1 + divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG ) start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 ] italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] )
\displaystyle\leq 5Kiηl2(σl2+6Kiζi(𝒙t)+6Ki𝔼f(𝒙t)2)5subscript𝐾𝑖superscriptsubscript𝜂𝑙2superscriptsubscript𝜎𝑙26subscript𝐾𝑖subscript𝜁𝑖superscript𝒙𝑡6subscript𝐾𝑖𝔼superscriptnorm𝑓superscript𝒙𝑡2\displaystyle 5K_{i}\eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}% ^{t})+6K_{i}\mathbb{E}\left\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\|^{% 2})5 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 6 italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

where we use the fact that (1+1Ki1)Ki5superscript11subscript𝐾𝑖1subscript𝐾𝑖5(1+\frac{1}{K_{i}-1})^{K_{i}}\leq 5( 1 + divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG ) start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≤ 5 for Ki>1subscript𝐾𝑖1K_{i}>1italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 1.

Lemma B.1 (Upper bound of balanced local updates, [35])

Let Assumption 2.2 3.1 hold. For any client i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] with arbitrary local iteration steps k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] with local learning rate ηl1Ksubscript𝜂𝑙1𝐾\eta_{l}\leq\frac{1}{K}italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG, the average of local updates can be bounded as follows,

𝔼i=1Nλi𝒙it,k𝒙t2i=1Nλi𝔼𝒙it,k𝒙t25ηl(σl2+6Kσg2+6K𝔼f(𝒙t)2)𝔼superscriptnormsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript𝒙𝑖𝑡𝑘superscript𝒙𝑡2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘superscript𝒙𝑡25subscript𝜂𝑙superscriptsubscript𝜎𝑙26𝐾superscriptsubscript𝜎𝑔26𝐾𝔼superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\mathbb{E}\left\|\sum_{i=1}^{N}\lambda_{i}\boldsymbol{x}_{i}^{t,k% }-\boldsymbol{x}^{t}\right\|^{2}\leq\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left\|% \boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}\right\|^{2}\leq 5\eta_{l}(\sigma_{% l}^{2}+6K\sigma_{g}^{2}+6K\mathbb{E}\left\|\nabla f\left(\boldsymbol{x}^{t}% \right)\right\|^{2})blackboard_E ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 5 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (11)
Corollary B.1 (Loose upper bound of unbalanced local updates)

Let Assumption 2.2 3.1 hold. For any client i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] with arbitrary local iteration steps k[Ki]𝑘delimited-[]subscript𝐾𝑖k\in[K_{i}]italic_k ∈ [ italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] with local learning rate ηl1Kisubscript𝜂𝑙1subscript𝐾𝑖\eta_{l}\leq\frac{1}{K_{i}}italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, the average of local updates can be bounded as follows,

𝔼i=1Nλi𝒙it,k𝒙t2𝔼superscriptnormsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript𝒙𝑖𝑡𝑘superscript𝒙𝑡2\displaystyle\mathbb{E}\left\|\sum_{i=1}^{N}\lambda_{i}\boldsymbol{x}_{i}^{t,k% }-\boldsymbol{x}^{t}\right\|^{2}blackboard_E ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT i=1Nλi𝔼𝒙it,k𝒙t25ηl(σl2+6Kmin2σg2+6K~𝔼f(𝒙t)2)+ΦHetero,absentsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘superscript𝒙𝑡25subscript𝜂𝑙superscriptsubscript𝜎𝑙26superscriptsubscript𝐾min2superscriptsubscript𝜎𝑔26~𝐾𝔼superscriptnorm𝑓superscript𝒙𝑡2subscriptΦHetero\displaystyle\leq\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left\|\boldsymbol{x}_{i}^% {t,k}-\boldsymbol{x}^{t}\right\|^{2}\leq 5\eta_{l}(\sigma_{l}^{2}+6K_{\text{% min}}^{2}\sigma_{g}^{2}+6\tilde{K}\mathbb{E}\left\|\nabla f\left(\boldsymbol{x% }^{t}\right)\right\|^{2})+\Phi_{\text{Hetero}},≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 5 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 over~ start_ARG italic_K end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT , (12)

where K~=i=1NλiKi~𝐾superscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝐾𝑖\tilde{K}=\sum_{i=1}^{N}\lambda_{i}K_{i}over~ start_ARG italic_K end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ΦHetero=i=1N30ηl(KiKmin)ζi(𝐱t).subscriptΦHeterosuperscriptsubscript𝑖1𝑁30subscript𝜂𝑙subscript𝐾𝑖subscript𝐾minsubscript𝜁𝑖superscript𝐱𝑡\Phi_{\text{Hetero}}=\sum_{i=1}^{N}30\eta_{l}(K_{i}-K_{\text{min}})\zeta_{i}(% \boldsymbol{x}^{t}).roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT 30 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) .

Appendix C Convergence Analysis of FedAWARE

This section provides detailed proof of Theorem 4.1. Our theory examines the impacts of gradient diversity in federated optimization and matches the convergence of classic aggregation-based methods, such as FedAvg. The Algorithm 1 manipulates the aggregation results with momentum approximation and adaptive weight strategies in a different manner with FedAvg.

Proof 3

We recall the updated rule of Algorithm 1 is:

𝒙t+1=𝒙tη𝒅t,superscript𝒙𝑡1superscript𝒙𝑡𝜂superscript𝒅𝑡\boldsymbol{x}^{t+1}=\boldsymbol{x}^{t}-\eta\boldsymbol{d}^{t},bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ,

where

𝒅t=i=1Nλ~it𝒎it,s.t.λ~t=argmin𝜆i=1Nλi𝒎it2.formulae-sequencesuperscript𝒅𝑡superscriptsubscript𝑖1𝑁subscriptsuperscript~𝜆𝑡𝑖superscriptsubscript𝒎𝑖𝑡s.t.superscript~𝜆𝑡𝜆superscriptnormsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript𝒎𝑖𝑡2\boldsymbol{d}^{t}=\sum_{i=1}^{N}\tilde{\lambda}^{t}_{i}\boldsymbol{m}_{i}^{t}% ,\text{s.t.}\;\tilde{\lambda}^{t}=\underset{\lambda}{\arg\min}\left\|\sum_{i=1% }^{N}\lambda_{i}\boldsymbol{m}_{i}^{t}\right\|^{2}.bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , s.t. over~ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = underitalic_λ start_ARG roman_arg roman_min end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Using the smoothness, we have:

f(𝒙t+1)𝑓superscript𝒙𝑡1\displaystyle f\left(\boldsymbol{x}^{t+1}\right)italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) =f(𝒙tη𝒅t)f(𝒙t)ηf(𝒙t),𝒅t+L2η2𝒅t2absent𝑓superscript𝒙𝑡𝜂superscript𝒅𝑡𝑓superscript𝒙𝑡𝜂𝑓superscript𝒙𝑡superscript𝒅𝑡𝐿2superscript𝜂2superscriptnormsuperscript𝒅𝑡2\displaystyle=f\left(\boldsymbol{x}^{t}-\eta\boldsymbol{d}^{t}\right)\leq f% \left(\boldsymbol{x}^{t}\right)-\eta\left\langle\nabla f\left(\boldsymbol{x}^{% t}\right),\boldsymbol{d}^{t}\right\rangle+\frac{L}{2}\eta^{2}\left\|% \boldsymbol{d}^{t}\right\|^{2}= italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ≤ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_η ⟨ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ + divide start_ARG italic_L end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
f(𝒙t)ηf(𝒙t),𝒅tf(𝒙t)+f(𝒙t)+L2η2𝒅t2absent𝑓superscript𝒙𝑡𝜂𝑓superscript𝒙𝑡superscript𝒅𝑡𝑓superscript𝒙𝑡𝑓superscript𝒙𝑡𝐿2superscript𝜂2superscriptnormsuperscript𝒅𝑡2\displaystyle\leq f\left(\boldsymbol{x}^{t}\right)-\eta\left\langle\nabla f% \left(\boldsymbol{x}^{t}\right),\boldsymbol{d}^{t}-\nabla f\left(\boldsymbol{x% }^{t}\right)+\nabla f\left(\boldsymbol{x}^{t}\right)\right\rangle+\frac{L}{2}% \eta^{2}\left\|\boldsymbol{d}^{t}\right\|^{2}≤ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_η ⟨ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ + divide start_ARG italic_L end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
f(𝒙t)ηf(𝒙t)2+ηf(𝒙t),f(𝒙t)𝒅t+L2η2𝒅t2.absent𝑓superscript𝒙𝑡𝜂superscriptnorm𝑓superscript𝒙𝑡2𝜂𝑓superscript𝒙𝑡𝑓superscript𝒙𝑡superscript𝒅𝑡𝐿2superscript𝜂2superscriptnormsuperscript𝒅𝑡2\displaystyle\leq f\left(\boldsymbol{x}^{t}\right)-\eta\|\nabla f\left(% \boldsymbol{x}^{t}\right)\|^{2}+\eta\left\langle\nabla f\left(\boldsymbol{x}^{% t}\right),\nabla f\left(\boldsymbol{x}^{t}\right)-\boldsymbol{d}^{t}\right% \rangle+\frac{L}{2}\eta^{2}\left\|\boldsymbol{d}^{t}\right\|^{2}.≤ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_η ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η ⟨ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ + divide start_ARG italic_L end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Taking full expectation over randomness at time step t𝑡titalic_t on both sides, we have:

𝔼[f(𝒙t+1)]f(𝒙t)𝔼delimited-[]𝑓superscript𝒙𝑡1𝑓superscript𝒙𝑡\displaystyle\mathbb{E}\left[f(\boldsymbol{x}^{t+1})\right]-f(\boldsymbol{x}^{% t})blackboard_E [ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ] - italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) η𝔼f(𝒙t)2+η𝔼[f(𝒙t),f(𝒙t)𝒅t]T1+L2η2𝔼[𝒅t2]T2.absent𝜂𝔼superscriptnorm𝑓superscript𝒙𝑡2𝜂subscript𝔼delimited-[]𝑓superscript𝒙𝑡𝑓superscript𝒙𝑡superscript𝒅𝑡subscript𝑇1𝐿2superscript𝜂2subscript𝔼delimited-[]superscriptnormsuperscript𝒅𝑡2subscript𝑇2\displaystyle\leq-\eta\mathbb{E}\|\nabla f\left(\boldsymbol{x}^{t}\right)\|^{2% }+\eta\underbrace{\mathbb{E}[\langle\nabla f(\boldsymbol{x}^{t}),\nabla f\left% (\boldsymbol{x}^{t}\right)-\boldsymbol{d}^{t}\rangle]}_{T_{1}}+\frac{L}{2}\eta% ^{2}\underbrace{\mathbb{E}[\|\boldsymbol{d}^{t}\|^{2}]}_{T_{2}}.≤ - italic_η blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η under⏟ start_ARG blackboard_E [ ⟨ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ ] end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG italic_L end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT under⏟ start_ARG blackboard_E [ ∥ bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (13)

Now, we are about to bound T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively.

Bounding T1subscriptT1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By Cauchy-Schwartz inequality, we get

T1subscript𝑇1\displaystyle T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 𝔼[f(𝒙t)f(𝒙t)𝒅t]𝔼f(𝒙t)𝔼[i=1Nλifi(𝒙t)i=1Nλ~it𝒎it]absent𝔼delimited-[]norm𝑓superscript𝒙𝑡norm𝑓superscript𝒙𝑡superscript𝒅𝑡𝔼norm𝑓superscript𝒙𝑡𝔼delimited-[]normsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡superscriptsubscript𝒎𝑖𝑡\displaystyle\leq\mathbb{E}\left[\left\|\nabla f(\boldsymbol{x}^{t})\right\|% \left\|\nabla f(\boldsymbol{x}^{t})-\boldsymbol{d}^{t}\right\|\right]\leq% \mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|\mathbb{E}\left[\left\|% \sum_{i=1}^{N}\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t})-\sum_{i=1}^{N}\tilde% {\lambda}_{i}^{t}\boldsymbol{m}_{i}^{t}\right\|\right]≤ blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ] ≤ blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ]
=𝔼f(𝒙t)𝔼[i=1Nλifi(𝒙t)±i=1Nλ~ifi(𝒙t)i=1Nλ~it𝒎it]absent𝔼norm𝑓superscript𝒙𝑡𝔼delimited-[]normplus-or-minussuperscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝑖1𝑁subscript~𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡superscriptsubscript𝒎𝑖𝑡\displaystyle=\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|\mathbb{E}% \left[\left\|\sum_{i=1}^{N}\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t})\pm\sum_% {i=1}^{N}\tilde{\lambda}_{i}f_{i}(\boldsymbol{x}^{t})-\sum_{i=1}^{N}\tilde{% \lambda}_{i}^{t}\boldsymbol{m}_{i}^{t}\right\|\right]= blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ± ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ]
𝔼f(𝒙t)𝔼[i=1N(λiλ~it)fi(𝒙t)]+𝔼f(𝒙t)𝔼[i=1Nλ~it(fi(𝒙t)𝒎it)]absent𝔼norm𝑓superscript𝒙𝑡𝔼delimited-[]normsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript~𝜆𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡𝔼norm𝑓superscript𝒙𝑡𝔼delimited-[]normsuperscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝒎𝑖𝑡\displaystyle\leq\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|\mathbb{% E}\left[\left\|\sum_{i=1}^{N}(\lambda_{i}-\tilde{\lambda}_{i}^{t})\nabla f_{i}% (\boldsymbol{x}^{t})\right\|\right]+\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{% t})\right\|\mathbb{E}\left[\left\|\sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}(\nabla f% _{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t})\right\|\right]≤ blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ] + blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ]
𝔼f(𝒙t)𝔼[i=1N(λiλ~it)λiλifi(𝒙t)]+𝔼f(𝒙t)𝔼[i=1Nλ~itλiλi(fi(𝒙t)𝒎it)]absent𝔼norm𝑓superscript𝒙𝑡𝔼delimited-[]normsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript~𝜆𝑖𝑡subscript𝜆𝑖subscript𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡𝔼norm𝑓superscript𝒙𝑡𝔼delimited-[]normsuperscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡subscript𝜆𝑖subscript𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝒎𝑖𝑡\displaystyle\leq\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|\mathbb{% E}\left[\left\|\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t})}{% \lambda_{i}}\cdot\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t})\right\|\right]+% \mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|\mathbb{E}\left[\left\|% \sum_{i=1}^{N}\frac{\tilde{\lambda}_{i}^{t}}{\lambda_{i}}\cdot\lambda_{i}(% \nabla f_{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t})\right\|\right]≤ blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ] + blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ]
i=1N(λiλ~it)2λi2𝔼f(𝒙t)𝔼[i=1Nλifi(𝒙t)]absentsuperscriptsubscript𝑖1𝑁superscriptsubscript𝜆𝑖superscriptsubscript~𝜆𝑖𝑡2superscriptsubscript𝜆𝑖2𝔼norm𝑓superscript𝒙𝑡𝔼delimited-[]normsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡\displaystyle\leq\sqrt{\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t% })^{2}}{\lambda_{i}^{2}}}\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|% \mathbb{E}\left[\left\|\sum_{i=1}^{N}\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t% })\right\|\right]≤ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ]
+i=1N(λ~it)2λi2𝔼f(𝒙t)𝔼[i=1Nλi(fi(𝒙t)𝒎it)]superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript~𝜆𝑖𝑡2superscriptsubscript𝜆𝑖2𝔼norm𝑓superscript𝒙𝑡𝔼delimited-[]normsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝒎𝑖𝑡\displaystyle\quad+\sqrt{\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{t})^{2}}{% \lambda_{i}^{2}}}\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|\mathbb{% E}\left[\left\|\sum_{i=1}^{N}\lambda_{i}(\nabla f_{i}(\boldsymbol{x}^{t})-% \boldsymbol{m}_{i}^{t})\right\|\right]+ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ]
i=1N(λiλ~it)2λi2𝔼f(𝒙t)2+i=1N(λ~it)2λi2𝔼f(𝒙t)𝔼[i=1Nλi(fi(𝒙t)𝒎it)],absentsuperscriptsubscript𝑖1𝑁superscriptsubscript𝜆𝑖superscriptsubscript~𝜆𝑖𝑡2superscriptsubscript𝜆𝑖2𝔼superscriptnorm𝑓superscript𝒙𝑡2superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript~𝜆𝑖𝑡2superscriptsubscript𝜆𝑖2𝔼norm𝑓superscript𝒙𝑡𝔼delimited-[]normsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝒎𝑖𝑡\displaystyle\leq\sqrt{\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t% })^{2}}{\lambda_{i}^{2}}}\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|% ^{2}+\sqrt{\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{t})^{2}}{\lambda_{i}^{2}}% }\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|\mathbb{E}\left[\left\|% \sum_{i=1}^{N}\lambda_{i}(\nabla f_{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^% {t})\right\|\right],≤ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ] ,

where we use Cauchy-Schwartz inequality to decompose the λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG from norms.

Then, using the fact that 𝔼2A𝔼A2superscript𝔼2norm𝐴𝔼superscriptnorm𝐴2\mathbb{E}^{2}\|A\|\leq\mathbb{E}\|A\|^{2}blackboard_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_A ∥ ≤ blackboard_E ∥ italic_A ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have

T1subscript𝑇1\displaystyle T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT i=1N(λiλ~it)2λi2𝔼f(𝒙t)2+i=1N(λ~it)2λi2𝔼[i=1Nλi(fi(𝒙t)𝒎it)2]𝔼f(𝒙t)absentsuperscriptsubscript𝑖1𝑁superscriptsubscript𝜆𝑖superscriptsubscript~𝜆𝑖𝑡2superscriptsubscript𝜆𝑖2𝔼superscriptnorm𝑓superscript𝒙𝑡2superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript~𝜆𝑖𝑡2superscriptsubscript𝜆𝑖2𝔼delimited-[]superscriptnormsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝒎𝑖𝑡2𝔼norm𝑓superscript𝒙𝑡\displaystyle\leq\sqrt{\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t% })^{2}}{\lambda_{i}^{2}}}\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|% ^{2}+\sqrt{\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{t})^{2}}{\lambda_{i}^{2}}% }\sqrt{\mathbb{E}\left[\left\|\sum_{i=1}^{N}\lambda_{i}(\nabla f_{i}(% \boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t})\right\|^{2}\right]}\mathbb{E}\left% \|\nabla f(\boldsymbol{x}^{t})\right\|≤ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG square-root start_ARG blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥
i=1N(λiλ~it)2λi2𝔼f(𝒙t)2+i=1N(λ~it)2λi2i=1Nλi𝔼[fi(𝒙t)𝒎it2]𝔼f(𝒙t).absentsuperscriptsubscript𝑖1𝑁superscriptsubscript𝜆𝑖superscriptsubscript~𝜆𝑖𝑡2superscriptsubscript𝜆𝑖2𝔼superscriptnorm𝑓superscript𝒙𝑡2superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript~𝜆𝑖𝑡2superscriptsubscript𝜆𝑖2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsubscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝒎𝑖𝑡2𝔼norm𝑓superscript𝒙𝑡\displaystyle\leq\sqrt{\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t% })^{2}}{\lambda_{i}^{2}}}\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|% ^{2}+\sqrt{\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{t})^{2}}{\lambda_{i}^{2}}% }\sqrt{\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[\left\|\nabla f_{i}(% \boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t}\right\|^{2}\right]}\mathbb{E}\left% \|\nabla f(\boldsymbol{x}^{t})\right\|.≤ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ .

Letting χλ~λ2=i=1N(λ~it)2λi2,χλ~λ=i=1N(λ~it)λiformulae-sequencesuperscriptsubscript𝜒conditional~𝜆𝜆2superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript~𝜆𝑖𝑡2superscriptsubscript𝜆𝑖2subscript𝜒conditional~𝜆𝜆superscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡subscript𝜆𝑖\chi_{\tilde{\lambda}\|\lambda}^{2}=\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{% t})^{2}}{\lambda_{i}^{2}},\chi_{\tilde{\lambda}\|\lambda}=\sum_{i=1}^{N}\frac{% (\tilde{\lambda}_{i}^{t})}{\lambda_{i}}italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, we have

T11+χλ~λ22χλ~λ𝔼f(𝒙t)2+χλ~λ2i=1Nλi𝔼[fi(𝒙t)𝒎it2]𝔼f(𝒙t).subscript𝑇11superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝔼superscriptnorm𝑓superscript𝒙𝑡2superscriptsubscript𝜒conditional~𝜆𝜆2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsubscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝒎𝑖𝑡2𝔼norm𝑓superscript𝒙𝑡T_{1}\leq\sqrt{1+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{\lambda}\|% \lambda}}\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|^{2}+\sqrt{\chi_% {\tilde{\lambda}\|\lambda}^{2}}\sqrt{\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[% \left\|\nabla f_{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t}\right\|^{2}% \right]}\cdot\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|.italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ square-root start_ARG 1 + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT end_ARG blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG ⋅ blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ . (14)

Bounding T2subscriptT2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. By the definitions and triangle inequality, we can decompose the T1subscriptT1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,

T2subscript𝑇2\displaystyle T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =𝔼[𝒅t2]=𝔼[i=1Nλ~it𝒎it2]=𝔼[i=1Nλ~it𝒎iti=1Nλ~itfi(𝒙t)+i=1Nλ~itfi(𝒙t)2]absent𝔼delimited-[]superscriptnormsuperscript𝒅𝑡2𝔼delimited-[]superscriptnormsuperscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡superscriptsubscript𝒎𝑖𝑡2𝔼delimited-[]superscriptnormsuperscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡superscriptsubscript𝒎𝑖𝑡superscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2\displaystyle=\mathbb{E}[\|\boldsymbol{d}^{t}\|^{2}]=\mathbb{E}\left[\left\|% \sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}\boldsymbol{m}_{i}^{t}\right\|^{2}\right]% =\mathbb{E}\left[\left\|\sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}\boldsymbol{m}_{i% }^{t}-\sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}\nabla f_{i}(\boldsymbol{x}^{t})+% \sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}\nabla f_{i}(\boldsymbol{x}^{t})\right\|^% {2}\right]= blackboard_E [ ∥ bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (15)
2𝔼[i=1Nλ~it(𝒎itfi(𝒙t))2]+2𝔼[i=1N(λ~itλi)fi(𝒙t)2]absent2𝔼delimited-[]superscriptnormsuperscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡superscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡22𝔼delimited-[]superscriptnormsuperscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡subscript𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡2\displaystyle\leq 2\mathbb{E}\left[\left\|\sum_{i=1}^{N}\tilde{\lambda}_{i}^{t% }(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(\boldsymbol{x}^{t}))\right\|^{2}\right]+% 2\mathbb{E}\left[\left\|\sum_{i=1}^{N}(\tilde{\lambda}_{i}^{t}-\lambda_{i})% \nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2}\right]≤ 2 blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 2 blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
2𝔼[i=1Nλ~itλiλi(𝒎itfi(𝒙t))2]+2𝔼[i=1N(λ~itλi)λiλifi(𝒙t)2]absent2𝔼delimited-[]superscriptnormsuperscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡subscript𝜆𝑖subscript𝜆𝑖superscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡22𝔼delimited-[]superscriptnormsuperscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖𝑡subscript𝜆𝑖subscript𝜆𝑖subscript𝜆𝑖subscript𝑓𝑖superscript𝒙𝑡2\displaystyle\leq 2\mathbb{E}\left[\left\|\sum_{i=1}^{N}\frac{\tilde{\lambda}_% {i}^{t}}{\lambda_{i}}\cdot\lambda_{i}(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(% \boldsymbol{x}^{t}))\right\|^{2}\right]+2\mathbb{E}\left[\left\|\sum_{i=1}^{N}% \frac{(\tilde{\lambda}_{i}^{t}-\lambda_{i})}{\lambda_{i}}\cdot\lambda_{i}% \nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2}\right]≤ 2 blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 2 blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
2χλ~λ2𝔼[i=1Nλi(𝒎itfi(𝒙t))2]+2(N+χλ~λ22χλ~λ)𝔼[f(𝒙t)2]absent2superscriptsubscript𝜒conditional~𝜆𝜆2𝔼delimited-[]superscriptnormsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡22𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\leq 2\chi_{\tilde{\lambda}\|\lambda}^{2}\mathbb{E}\left[\left\|% \sum_{i=1}^{N}\lambda_{i}(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(\boldsymbol{x}^{% t}))\right\|^{2}\right]+2(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\|\lambda})\mathbb{E}\left[\left\|\nabla f(\boldsymbol{x}^{t})\right\|% ^{2}\right]≤ 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 2 ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
2χλ~λ2i=1Nλi𝔼[(𝒎itfi(𝒙t))2]+2(N+χλ~λ22χλ~λ)𝔼[f(𝒙t)2],absent2superscriptsubscript𝜒conditional~𝜆𝜆2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡22𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\leq 2\chi_{\tilde{\lambda}\|\lambda}^{2}\sum_{i=1}^{N}\lambda_{i% }\mathbb{E}\left[\left\|(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(\boldsymbol{x}^{t% }))\right\|^{2}\right]+2(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\|\lambda})\mathbb{E}\left[\left\|\nabla f(\boldsymbol{x}^{t})\right\|% ^{2}\right],≤ 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ( bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 2 ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where we also use Cauchy-Schwartz inequality to decompose λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG.

Investigating momentum approximation. Observing (14) and (15), we need to bound the approximation error of local momentum. We present our analysis in the following lemma.

Lemma C.1 (Bound of local momentum)

For any i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] and t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], letting βi=(1α)pi1subscript𝛽𝑖1𝛼subscript𝑝𝑖1\beta_{i}=(1-\alpha)p_{i}\leq 1italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 - italic_α ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 1 and pi=Prob(iSt)subscript𝑝𝑖Prob𝑖superscript𝑆𝑡p_{i}=\text{Prob}(i\in S^{t})italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Prob ( italic_i ∈ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), the gap between momentum 𝐦isubscript𝐦𝑖\boldsymbol{m}_{i}bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and gradient fi(𝐱t)subscript𝑓𝑖superscript𝐱𝑡\nabla f_{i}(\boldsymbol{x}^{t})∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) can be bounded:

i=1Nλi𝔼[𝒎itfi(𝒙t)2]β2L2i=1Nλi𝔼𝒙it,Ki𝒙t2.superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2superscript𝛽2superscript𝐿2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[\left\|\boldsymbol{m}_{i}^{t}-\nabla f% _{i}(\boldsymbol{x}^{t})\right\|^{2}\right]\leq\beta^{2}L^{2}\sum_{i=1}^{N}% \lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}% \right\|^{2}}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (16)
Proof 4
i=1Nλi𝔼[𝒎itfi(𝒙t)2]superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2\displaystyle\quad\;\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\|% \boldsymbol{m}_{i}^{t}-\nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2}\right]}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=i=1Nλi𝔼[(1pi)𝒎it1+pi(α𝒎it1+(1α)𝒈it)fi(𝒙t)2]absentsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnorm1subscript𝑝𝑖superscriptsubscript𝒎𝑖𝑡1subscript𝑝𝑖𝛼superscriptsubscript𝒎𝑖𝑡11𝛼superscriptsubscript𝒈𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2\displaystyle=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\|(1-p_{i})% \boldsymbol{m}_{i}^{t-1}+p_{i}(\alpha\boldsymbol{m}_{i}^{t-1}+(1-\alpha)% \boldsymbol{g}_{i}^{t})-\nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2}\right]}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_α bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + ( 1 - italic_α ) bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=i=1Nλi𝔼[[1(1α)pi]𝒎it1+(1α)pi𝒈itfi(𝒙t)2]absentsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormdelimited-[]11𝛼subscript𝑝𝑖superscriptsubscript𝒎𝑖𝑡11𝛼subscript𝑝𝑖superscriptsubscript𝒈𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2\displaystyle=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\|[1-(1-\alpha)p_% {i}]\boldsymbol{m}_{i}^{t-1}+(1-\alpha)p_{i}\boldsymbol{g}_{i}^{t}-\nabla f_{i% }(\boldsymbol{x}^{t})\right\|^{2}\right]}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ [ 1 - ( 1 - italic_α ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + ( 1 - italic_α ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=i=1Nλi𝔼[(1βi)𝒎it1+βi𝒈itfi(𝒙t)2] letting βi=(1α)piabsentsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnorm1subscript𝛽𝑖superscriptsubscript𝒎𝑖𝑡1subscript𝛽𝑖superscriptsubscript𝒈𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2 letting βi=(1α)pi\displaystyle=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\|(1-\beta_{i})% \boldsymbol{m}_{i}^{t-1}+\beta_{i}\boldsymbol{g}_{i}^{t}-\nabla f_{i}(% \boldsymbol{x}^{t})\right\|^{2}\right]}\quad\quad\quad\text{$\triangleright$ % letting $\beta_{i}=(1-\alpha)p_{i}$}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ▷ letting italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 - italic_α ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=i=1Nλi𝔼[(1βi)(𝒎it1fi(𝒙t1))+βi𝒈it+(1βi)fi(𝒙t1)fi(𝒙t)2]absentsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnorm1subscript𝛽𝑖superscriptsubscript𝒎𝑖𝑡1subscript𝑓𝑖superscript𝒙𝑡1subscript𝛽𝑖superscriptsubscript𝒈𝑖𝑡1subscript𝛽𝑖subscript𝑓𝑖superscript𝒙𝑡1subscript𝑓𝑖superscript𝒙𝑡2\displaystyle=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\|(1-\beta_{i})(% \boldsymbol{m}_{i}^{t-1}-\nabla f_{i}(\boldsymbol{x}^{t-1}))+\beta_{i}% \boldsymbol{g}_{i}^{t}+(1-\beta_{i})\nabla f_{i}(\boldsymbol{x}^{t-1})-\nabla f% _{i}(\boldsymbol{x}^{t})\right\|^{2}\right]}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) ) + italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=i=1Nλi𝔼[(1βi)(𝒎it1fi(𝒙t1))+βi[𝒈itηlKifi(𝒙t)]\displaystyle=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\Big{[}\|(1-\beta_{i})(% \boldsymbol{m}_{i}^{t-1}-\nabla f_{i}(\boldsymbol{x}^{t-1}))+\beta_{i}[% \boldsymbol{g}_{i}^{t}-\eta_{l}K_{i}\nabla f_{i}(\boldsymbol{x}^{t})]= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) ) + italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ]
+((1βi)fi(𝒙t1)(1βiηlKi))fi(𝒙t)2]\displaystyle\quad+\left((1-\beta_{i})\nabla f_{i}(\boldsymbol{x}^{t-1})-(1-% \beta_{i}\eta_{l}K_{i})\right)\nabla f_{i}(\boldsymbol{x}^{t})\|^{2}\Big{]}+ ( ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) - ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
i=1Nλi(1βi)2𝔼[𝒎it1fi(𝒙t1)2]+i=1Nλiβi2𝔼[𝒈itηlKifi(𝒙t)2]R1absentsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖superscript1subscript𝛽𝑖2𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡1subscript𝑓𝑖superscript𝒙𝑡12superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript𝛽𝑖2subscript𝔼delimited-[]superscriptnormsuperscriptsubscript𝒈𝑖𝑡subscript𝜂𝑙subscript𝐾𝑖subscript𝑓𝑖superscript𝒙𝑡2subscript𝑅1\displaystyle\leq\sum_{i=1}^{N}\lambda_{i}(1-\beta_{i})^{2}\mathbb{E}{\left[% \left\|\boldsymbol{m}_{i}^{t-1}-\nabla f_{i}(\boldsymbol{x}^{t-1})\right\|^{2}% \right]}+\sum_{i=1}^{N}\lambda_{i}\beta_{i}^{2}\underbrace{\mathbb{E}\left[% \left\|\boldsymbol{g}_{i}^{t}-\eta_{l}K_{i}\nabla f_{i}(\boldsymbol{x}^{t})% \right\|^{2}\right]}_{R_{1}}≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT under⏟ start_ARG blackboard_E [ ∥ bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+i=1Nλi(1βi)2L2𝔼𝒙t1𝒙t2𝔼η𝒅t12.superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscript1subscript𝛽𝑖2superscript𝐿2subscript𝔼superscriptnormsuperscript𝒙𝑡1superscript𝒙𝑡2𝔼superscriptnorm𝜂superscript𝒅𝑡12\displaystyle\quad+\sum_{i=1}^{N}\lambda_{i}(1-\beta_{i})^{2}L^{2}\underbrace{% \mathbb{E}\left\|\boldsymbol{x}^{t-1}-\boldsymbol{x}^{t}\right\|^{2}}_{\mathbb% {E}\|\eta\boldsymbol{d}^{t-1}\|^{2}}.+ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT under⏟ start_ARG blackboard_E ∥ bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT blackboard_E ∥ italic_η bold_italic_d start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .

Note that the term R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denotes the heterogeneity brought by the local computation. For τ[t]𝜏delimited-[]𝑡\tau\in[t]italic_τ ∈ [ italic_t ], term R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can be bounded,

R1subscript𝑅1\displaystyle R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =𝔼[k=1KiηlFi(𝒙t,k)ηlKifi(𝒙t)2]=𝔼[k=1Kiηl(Fi(𝒙t,k)fi(𝒙t))2]absent𝔼delimited-[]superscriptnormsuperscriptsubscript𝑘1subscript𝐾𝑖subscript𝜂𝑙subscript𝐹𝑖superscript𝒙𝑡𝑘subscript𝜂𝑙subscript𝐾𝑖subscript𝑓𝑖superscript𝒙𝑡2𝔼delimited-[]superscriptnormsuperscriptsubscript𝑘1subscript𝐾𝑖subscript𝜂𝑙subscript𝐹𝑖superscript𝒙𝑡𝑘subscript𝑓𝑖superscript𝒙𝑡2\displaystyle=\mathbb{E}{\left[\left\|\sum_{k=1}^{K_{i}}\eta_{l}\nabla F_{i}(% \boldsymbol{x}^{t,k})-\eta_{l}K_{i}\nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2% }\right]}=\mathbb{E}{\left[\left\|\sum_{k=1}^{K_{i}}\eta_{l}\left(\nabla F_{i}% (\boldsymbol{x}^{t,k})-\nabla f_{i}(\boldsymbol{x}^{t})\right)\right\|^{2}% \right]}= blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (17)
=𝔼[k=1Kiηl(fi(𝒙t,k)fi(𝒙t))2] using Assumption 2.2absent𝔼delimited-[]superscriptnormsuperscriptsubscript𝑘1subscript𝐾𝑖subscript𝜂𝑙subscript𝑓𝑖superscript𝒙𝑡𝑘subscript𝑓𝑖superscript𝒙𝑡2 using Assumption 2.2\displaystyle=\mathbb{E}{\left[\left\|\sum_{k=1}^{K_{i}}\eta_{l}\left(\nabla f% _{i}(\boldsymbol{x}^{t,k})-\nabla f_{i}(\boldsymbol{x}^{t})\right)\right\|^{2}% \right]}\quad\quad\text{$\triangleright$ using Assumption~{}\ref{asp:% unbiasedness}}= blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ▷ using Assumption
L2ηl2k=1Ki𝔼𝒙it,k𝒙t2 using Assumption 2.1absentsuperscript𝐿2superscriptsubscript𝜂𝑙2superscriptsubscript𝑘1subscript𝐾𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡𝑘superscript𝒙𝑡2 using Assumption 2.1\displaystyle\leq L^{2}\eta_{l}^{2}\sum_{k=1}^{K_{i}}\mathbb{E}{\left\|% \boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}\right\|^{2}}\quad\quad\text{$% \triangleright$ using Assumption~{}\ref{asp:smoothness}}≤ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_k end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ▷ using Assumption
L2ηl2Ki𝔼𝒙it,Ki𝒙t2.absentsuperscript𝐿2superscriptsubscript𝜂𝑙2subscript𝐾𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\displaystyle\leq L^{2}\eta_{l}^{2}K_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{% t,K_{i}}-\boldsymbol{x}^{t}\right\|^{2}}.≤ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

And, using (15) to bound the second term, we have

𝔼η𝒅t12=η2𝔼𝒅t12𝔼superscriptnorm𝜂superscript𝒅𝑡12superscript𝜂2𝔼superscriptnormsuperscript𝒅𝑡12\displaystyle\mathbb{E}\|\eta\boldsymbol{d}^{t-1}\|^{2}=\eta^{2}\mathbb{E}\|% \boldsymbol{d}^{t-1}\|^{2}blackboard_E ∥ italic_η bold_italic_d start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ bold_italic_d start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 2η2χλ~λ2i=1Nλi𝔼[(𝒎itfi(𝒙t1))2]absent2superscript𝜂2superscriptsubscript𝜒conditional~𝜆𝜆2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡12\displaystyle\leq 2\eta^{2}\chi_{\tilde{\lambda}\|\lambda}^{2}\sum_{i=1}^{N}% \lambda_{i}\mathbb{E}\left[\left\|(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(% \boldsymbol{x}^{t-1}))\right\|^{2}\right]≤ 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ( bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (18)
+2η2(N+χλ~λ22χλ~λ)𝔼[f(𝒙t1)2]2superscript𝜂2𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡12\displaystyle\quad+2\eta^{2}(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{% \tilde{\lambda}\|\lambda})\mathbb{E}\left[\left\|\nabla f(\boldsymbol{x}^{t-1}% )\right\|^{2}\right]+ 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

Combining the equations, we have

i=1Nλi𝔼[𝒎itfi(𝒙t)2]i=1Nλi(1βi)2(1+2L2η2χλ~λ2)𝔼[𝒎it1fi(𝒙t1)2]superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscript1subscript𝛽𝑖212superscript𝐿2superscript𝜂2superscriptsubscript𝜒conditional~𝜆𝜆2𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡1subscript𝑓𝑖superscript𝒙𝑡12\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\|\boldsymbol{m}_{% i}^{t}-\nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2}\right]}\leq\sum_{i=1}^{N}% \lambda_{i}(1-\beta_{i})^{2}(1+2L^{2}\eta^{2}\chi_{\tilde{\lambda}\|\lambda}^{% 2})\mathbb{E}{\left[\left\|\boldsymbol{m}_{i}^{t-1}-\nabla f_{i}(\boldsymbol{x% }^{t-1})\right\|^{2}\right]}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E [ ∥ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (19)
+i=1Nλiβi2L2ηl2Ki𝔼𝒙it,Ki𝒙t2+[i=1Nλi(1βi)2]2L2η2(N+χλ~λ22χλ~λ)f(𝒙t1)2.superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript𝛽𝑖2superscript𝐿2superscriptsubscript𝜂𝑙2subscript𝐾𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2delimited-[]superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscript1subscript𝛽𝑖22superscript𝐿2superscript𝜂2𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆superscriptnorm𝑓superscript𝒙𝑡12\displaystyle\quad+\sum_{i=1}^{N}\lambda_{i}\beta_{i}^{2}L^{2}\eta_{l}^{2}K_{i% }\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}\right\|^{2}% }+\left[\sum_{i=1}^{N}\lambda_{i}(1-\beta_{i})^{2}\right]2L^{2}\eta^{2}(N+\chi% _{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{\lambda}\|\lambda})\cdot\left\|% \nabla f(\boldsymbol{x}^{t-1})\right\|^{2}.+ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) ⋅ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

As the last term of (19) typically vanishes over time T𝑇Titalic_T due to factors β𝛽\betaitalic_β and η𝜂\etaitalic_η, we omit the last term and mainly focus on the effects of the second term. Unrolling the recursion for i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ], we have

i=1Nλi𝔼[𝒎itfi(𝒙t)2]superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[\left\|\boldsymbol{m}_{i% }^{t}-\nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2}\right]∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (20)
\displaystyle\leq i=1Nλiτ=1t((1βi)2(1+2L2η2χλ~λ2))τ1βi2L2ηl2Ki𝔼𝒙it,Ki𝒙t2superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript𝜏1𝑡superscriptsuperscript1subscript𝛽𝑖212superscript𝐿2superscript𝜂2superscriptsubscript𝜒conditional~𝜆𝜆2𝜏1superscriptsubscript𝛽𝑖2superscript𝐿2superscriptsubscript𝜂𝑙2subscript𝐾𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\displaystyle\sum_{i=1}^{N}\lambda_{i}\sum_{\tau=1}^{t}\left((1-\beta_{i})^{2}% (1+2L^{2}\eta^{2}\chi_{\tilde{\lambda}\|\lambda}^{2})\right)^{\tau-1}\beta_{i}% ^{2}L^{2}\eta_{l}^{2}K_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\|^{2}}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq i=1Nλitmaxτ[t]((1βi)2(1+2L2η2χλ~λ2))τ1βi2L2ηl2Ki𝔼𝒙it,Ki𝒙t2\displaystyle\sum_{i=1}^{N}\lambda_{i}t\max_{\tau\in[t]}\left((1-\beta_{i})^{2% }(1+2L^{2}\eta^{2}\chi_{\tilde{\lambda}\|\lambda}^{2})\right)^{\tau-1}\beta_{i% }^{2}L^{2}\eta_{l}^{2}K_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\|^{2}}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_t roman_max start_POSTSUBSCRIPT italic_τ ∈ [ italic_t ] end_POSTSUBSCRIPT ( ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq i=1Nλitβi2L2ηl2Ki𝔼𝒙it,Ki𝒙t2,superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝑡superscriptsubscript𝛽𝑖2superscript𝐿2superscriptsubscript𝜂𝑙2subscript𝐾𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\displaystyle\sum_{i=1}^{N}\lambda_{i}t\beta_{i}^{2}L^{2}\eta_{l}^{2}K_{i}% \mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}\right\|^{2}},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_t italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where we let 𝔼[𝐦i0fi(𝐱0)2]=0𝔼delimited-[]superscriptnormsuperscriptsubscript𝐦𝑖0subscript𝑓𝑖superscript𝐱020\mathbb{E}\left[\left\|\boldsymbol{m}_{i}^{0}-\nabla f_{i}(\boldsymbol{x}^{0})% \right\|^{2}\right]=0blackboard_E [ ∥ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 and (1βi)2(1+2L2η2χλ~λ2)1superscript1subscript𝛽𝑖212superscript𝐿2superscript𝜂2superscriptsubscript𝜒conditional~𝜆𝜆21(1-\beta_{i})^{2}(1+2L^{2}\eta^{2}\chi_{\tilde{\lambda}\|\lambda}^{2})\leq 1( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ 1. To further optimize the upper bound, we let ηl1tKisubscript𝜂𝑙1𝑡subscript𝐾𝑖\eta_{l}\leq\frac{1}{\sqrt{tK_{i}}}italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_t italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG and β=β1==βN𝛽subscript𝛽1subscript𝛽𝑁\beta=\beta_{1}=\dots=\beta_{N}italic_β = italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_β start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT to conclude the proof:

i=1Nλi𝔼[𝒎itfi(𝒙t)2]β2L2i=1Nλi𝔼𝒙it,Ki𝒙t2.superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2superscript𝛽2superscript𝐿2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[\left\|\boldsymbol{m}_{i% }^{t}-\nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2}\right]\leq\beta^{2}L^{2}% \sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\|^{2}}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (21)

Remark. The 𝔼[𝐦i0fi(𝐱0)2]=0𝔼delimited-[]superscriptnormsuperscriptsubscript𝐦i0subscriptfisuperscript𝐱020\mathbb{E}\left[\left\|\boldsymbol{m}_{i}^{0}-\nabla f_{i}(\boldsymbol{x}^{0})% \right\|^{2}\right]=0blackboard_E [ ∥ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0 denotes non-error initialization of server-saved local momentum. Typically, this can be implemented via once full participation in the first round of FL. Otherwise, it only induces an additional constant factor on approximation without breaking our analyses. Besides, (1βi)2(1+2L2η2χλ~λ2)1superscript1subscriptβi212superscriptL2superscriptη2superscriptsubscriptχconditional~λλ21(1-\beta_{i})^{2}(1+2L^{2}\eta^{2}\chi_{\tilde{\lambda}\|\lambda}^{2})\leq 1( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ 1 can be easily implemented by setting ηη\etaitalic_η and αα\alphaitalic_α.

Putting together. Substituting (13) with corresponding terms (14) and (15), we have

𝔼[f(𝒙t+1)]f(𝒙t)η(1N+χλ~λ22χλ~λ(N+χλ~λ22χλ~λ))𝔼[f(𝒙t)2]𝔼delimited-[]𝑓superscript𝒙𝑡1𝑓superscript𝒙𝑡𝜂1𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\quad\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t+1})\right]-f(% \boldsymbol{x}^{t})}{\eta}\leq-\left(1-\sqrt{N+\chi_{\tilde{\lambda}\|\lambda}% ^{2}-2\chi_{\tilde{\lambda}\|\lambda}}-(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-% 2\chi_{\tilde{\lambda}\|\lambda})\right)\mathbb{E}\left[\left\|\nabla f(% \boldsymbol{x}^{t})\right\|^{2}\right]divide start_ARG blackboard_E [ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ] - italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_η end_ARG ≤ - ( 1 - square-root start_ARG italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT end_ARG - ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) ) blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+χλ~λ2i=1Nλi𝔼[fi(𝒙t)𝒎it2]𝔼f(𝒙t)+Lηχλ~λ2i=1Nλi𝔼[(𝒎itfi(𝒙t))2].superscriptsubscript𝜒conditional~𝜆𝜆2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsubscript𝑓𝑖superscript𝒙𝑡superscriptsubscript𝒎𝑖𝑡2𝔼norm𝑓superscript𝒙𝑡𝐿𝜂superscriptsubscript𝜒conditional~𝜆𝜆2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼delimited-[]superscriptnormsuperscriptsubscript𝒎𝑖𝑡subscript𝑓𝑖superscript𝒙𝑡2\displaystyle+\sqrt{\chi_{\tilde{\lambda}\|\lambda}^{2}}\sqrt{\sum_{i=1}^{N}% \lambda_{i}\mathbb{E}\left[\left\|\nabla f_{i}(\boldsymbol{x}^{t})-\boldsymbol% {m}_{i}^{t}\right\|^{2}\right]}\cdot\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{% t})\right\|+L\eta\chi_{\tilde{\lambda}\|\lambda}^{2}\sum_{i=1}^{N}\lambda_{i}% \mathbb{E}\left[\left\|(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(\boldsymbol{x}^{t}% ))\right\|^{2}\right].+ square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG ⋅ blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ + italic_L italic_η italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ∥ ( bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Then, substituting the momentum approximation with (21) and reorganising the terms, we have

𝔼[f(𝒙t+1)]f(𝒙t)η(1N+χλ~λ22χλ~λ(N+χλ~λ22χλ~λ))𝔼[f(𝒙t)2]𝔼delimited-[]𝑓superscript𝒙𝑡1𝑓superscript𝒙𝑡𝜂1𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\quad\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t+1})\right]-f(% \boldsymbol{x}^{t})}{\eta}\leq-\left(1-\sqrt{N+\chi_{\tilde{\lambda}\|\lambda}% ^{2}-2\chi_{\tilde{\lambda}\|\lambda}}-(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-% 2\chi_{\tilde{\lambda}\|\lambda})\right)\mathbb{E}\left[\left\|\nabla f(% \boldsymbol{x}^{t})\right\|^{2}\right]divide start_ARG blackboard_E [ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ] - italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_η end_ARG ≤ - ( 1 - square-root start_ARG italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT end_ARG - ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) ) blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (22)
+χλ~λ2β2L2i=1Nλi𝔼𝒙it,Ki𝒙t2𝔼f(𝒙t)+ηχλ~λ2β2L3i=1Nλi𝔼𝒙it,Ki𝒙t2.superscriptsubscript𝜒conditional~𝜆𝜆2superscript𝛽2superscript𝐿2superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2𝔼norm𝑓superscript𝒙𝑡𝜂superscriptsubscript𝜒conditional~𝜆𝜆2superscript𝛽2superscript𝐿3superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\displaystyle+\sqrt{\chi_{\tilde{\lambda}\|\lambda}^{2}}\sqrt{\beta^{2}L^{2}% \sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\|^{2}}}\cdot\mathbb{E}\left\|\nabla f(\boldsymbol{x}^% {t})\right\|+\eta\chi_{\tilde{\lambda}\|\lambda}^{2}\beta^{2}L^{3}\sum_{i=1}^{% N}\lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}% \right\|^{2}}.+ square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ blackboard_E ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ + italic_η italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Inducing gradient diversity. In the main paper, we discussed the gradient diversity related to the status of statistical heterogeneity and optimization quality. To demonstrate the relation, our idea is to induce the gradient diversity into our convergence rate by replacing the first-order global gradient f(𝐱t)fsuperscript𝐱t\nabla f(\boldsymbol{x}^{t})∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) on the right-hand side of (22). However, the Algorithm 1 is minimizing an surrogate objective f~(𝐱)~f𝐱\tilde{f}(\boldsymbol{x})over~ start_ARG italic_f end_ARG ( bold_italic_x ), which is determined by λ~~λ\tilde{\lambda}over~ start_ARG italic_λ end_ARG. Therefore, we are to clarify the convergence relation between the surrogate objective f~(𝐱)~f𝐱\tilde{f}(\boldsymbol{x})over~ start_ARG italic_f end_ARG ( bold_italic_x ) and the original objective f(𝐱)f𝐱f(\boldsymbol{x})italic_f ( bold_italic_x ), as we discussed in the lemma below.

Lemma C.2 (Surrogate convergence relation)

For all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, the differences between the primary gradient f(𝐱)𝑓𝐱\nabla f(\boldsymbol{x})∇ italic_f ( bold_italic_x ) and surrogate gradient f~(𝐱)=i=1Nλ~ifi(𝐱)~𝑓𝐱superscriptsubscript𝑖1𝑁subscript~𝜆𝑖subscript𝑓𝑖𝐱\nabla\tilde{f}(\boldsymbol{x})=\sum_{i=1}^{N}\tilde{\lambda}_{i}\nabla f_{i}(% \boldsymbol{x})∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) can be bounded:

Proof 5

According to the definition, we have

f~(𝒙)f(𝒙)2superscriptnorm~𝑓𝒙𝑓𝒙2\displaystyle\|\nabla\tilde{f}(\boldsymbol{x})-\nabla f(\boldsymbol{x})\|^{2}∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) - ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =i=1N(λ~iλi)fi(𝒙)2absentsuperscriptnormsuperscriptsubscript𝑖1𝑁subscript~𝜆𝑖subscript𝜆𝑖subscript𝑓𝑖𝒙2\displaystyle=\|\sum_{i=1}^{N}(\tilde{\lambda}_{i}-\lambda_{i})\nabla f_{i}(% \boldsymbol{x})\|^{2}= ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (23)
[i=1N(λ~iλi)2λi2][i=1Nλifi(𝒙)2]absentdelimited-[]superscriptsubscript𝑖1𝑁superscriptsubscript~𝜆𝑖subscript𝜆𝑖2superscriptsubscript𝜆𝑖2delimited-[]superscriptnormsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖subscript𝑓𝑖𝒙2\displaystyle\leq\left[\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}-\lambda_{i})^{% 2}}{\lambda_{i}^{2}}\right]\cdot\left[\left\|\sum_{i=1}^{N}\lambda_{i}\nabla f% _{i}(\boldsymbol{x})\right\|^{2}\right]≤ [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ( over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] ⋅ [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(N+χλ~λ22χλ~λ)(σg2+f(𝒙)2).absent𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆superscriptsubscript𝜎𝑔2superscriptnorm𝑓𝒙2\displaystyle\leq(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{\lambda}% \|\lambda})\left(\sigma_{g}^{2}+\left\|\nabla f(\boldsymbol{x})\right\|^{2}% \right).≤ ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Then, we investigate

f~(𝒙)2superscriptnorm~𝑓𝒙2\displaystyle\|\nabla\tilde{f}(\boldsymbol{x})\|^{2}∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 2f~(𝒙)f(𝒙)2+2f(𝒙)2absent2superscriptnorm~𝑓𝒙𝑓𝒙22superscriptnorm𝑓𝒙2\displaystyle\leq 2\|\nabla\tilde{f}(\boldsymbol{x})-\nabla f(\boldsymbol{x})% \|^{2}+2\|\nabla f(\boldsymbol{x})\|^{2}≤ 2 ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) - ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2(N+χλ~λ22χλ~λ)(σg2+f(𝒙)2)+2f(𝒙)2absent2𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆superscriptsubscript𝜎𝑔2superscriptnorm𝑓𝒙22superscriptnorm𝑓𝒙2\displaystyle\leq 2(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\|\lambda})\left(\sigma_{g}^{2}+\left\|\nabla f(\boldsymbol{x})\right% \|^{2}\right)+2\|\nabla f(\boldsymbol{x})\|^{2}≤ 2 ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 2 ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2(1+N+χλ~λ22χλ~λ)f(𝒙)2+2(N+χλ~λ22χλ~λ)σg2.absent21𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆superscriptnorm𝑓𝒙22𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆superscriptsubscript𝜎𝑔2\displaystyle\leq 2(1+N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\|\lambda})\left\|\nabla f(\boldsymbol{x})\right\|^{2}+2(N+\chi_{% \tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{\lambda}\|\lambda})\sigma_{g}^{2}.≤ 2 ( 1 + italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Therefore, we have

mint[T]f~(𝒙t)2subscript𝑡delimited-[]𝑇superscriptnorm~𝑓superscript𝒙𝑡2\displaystyle\min_{t\in[T]}\|\nabla\tilde{f}(\boldsymbol{x}^{t})\|^{2}roman_min start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1Tt=0T1f~(𝒙t)2absent1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm~𝑓superscript𝒙𝑡2\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla\tilde{f}(\boldsymbol{x}^{% t})\|^{2}≤ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (24)
2(1+N+χλ~λ22χλ~λ)1Tt=0T1f(𝒙t)2+2(N+χλ~λ22χλ~λ)σg2.absent21𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆1𝑇superscriptsubscript𝑡0𝑇1superscriptnorm𝑓superscript𝒙𝑡22𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆superscriptsubscript𝜎𝑔2\displaystyle\leq 2(1+N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\|\lambda})\frac{1}{T}\sum_{t=0}^{T-1}\left\|\nabla f(\boldsymbol{x}^{% t})\right\|^{2}+2(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{\lambda}% \|\lambda})\sigma_{g}^{2}.≤ 2 ( 1 + italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Remark. The (24) shows the convergence relation, which is greatly determined by λλ\lambdaitalic_λ and λ~~λ\tilde{\lambda}over~ start_ARG italic_λ end_ARG. Importantly, we are interested about the case that f(𝐱t)2f~(𝐱t)2superscriptnormfsuperscript𝐱t2superscriptnorm~fsuperscript𝐱t2\|\nabla f(\boldsymbol{x}^{t})\|^{2}\leq\|\nabla\tilde{f}(\boldsymbol{x}^{t})% \|^{2}∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, indicating that the cases that λλ\lambdaitalic_λ is better than λ~~λ\tilde{\lambda}over~ start_ARG italic_λ end_ARG. In these cases, we can minimize the right-hand side of (24) by tuning λ~~λ\tilde{\lambda}over~ start_ARG italic_λ end_ARG. In contrast, the cases that f~(𝐱t)2f(𝐱t)2superscriptnorm~fsuperscript𝐱t2superscriptnormfsuperscript𝐱t2\|\nabla\tilde{f}(\boldsymbol{x}^{t})\|^{2}\leq\|\nabla f(\boldsymbol{x}^{t})% \|^{2}∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT indicates that Algorithm 1 converges to stationary points with lower surrogate gradient. According to the non-convex analysis theory, it denotes a better optimization result.

In this paper, we can assume that λ𝜆\lambdaitalic_λ is the best weight for federated optimization, which always induces f(𝐱t)2f~(𝐱t)2superscriptnorm𝑓superscript𝐱𝑡2superscriptnorm~𝑓superscript𝐱𝑡2\|\nabla f(\boldsymbol{x}^{t})\|^{2}\leq\|\nabla\tilde{f}(\boldsymbol{x}^{t})% \|^{2}∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Therefore, we can connect the surrogate gradient diversity with the original objective:

Corollary C.1

Connecting the assumption f(𝐱t)2f~(𝐱t)2superscriptnorm𝑓superscript𝐱𝑡2superscriptnorm~𝑓superscript𝐱𝑡2\|\nabla f(\boldsymbol{x}^{t})\|^{2}\leq\|\nabla\tilde{f}(\boldsymbol{x}^{t})% \|^{2}∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and Definition 3.2, we have

f(𝒙t)f~(𝒙)i=1Nλifi(𝒙t)2D~(𝒙t)G~D~(𝒙t),norm𝑓superscript𝒙𝑡norm~𝑓𝒙superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖superscript𝒙𝑡2~𝐷superscript𝒙𝑡~𝐺~𝐷superscript𝒙𝑡\|\nabla f(\boldsymbol{x}^{t})\|\leq\|\nabla\tilde{f}(\boldsymbol{x})\|\leq% \frac{\sqrt{\sum_{i=1}^{N}\lambda_{i}\|\nabla f_{i}(\boldsymbol{x}^{t})\|^{2}}% }{\tilde{D}(\boldsymbol{x}^{t})}\leq\frac{\tilde{G}}{\tilde{D}(\boldsymbol{x}^% {t})},∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ≤ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ ≤ divide start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG ≤ divide start_ARG over~ start_ARG italic_G end_ARG end_ARG start_ARG over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG , (25)

where we use G~=maxt[T]i=1Nλifi(𝐱t)2~𝐺subscript𝑡delimited-[]𝑇superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖superscript𝐱𝑡2\tilde{G}=\max_{t\in[T]}\sqrt{\sum_{i=1}^{N}\lambda_{i}\|\nabla f_{i}(% \boldsymbol{x}^{t})\|^{2}}over~ start_ARG italic_G end_ARG = roman_max start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG.

Inducing inequality (25) to (22), we rearrange the terms

(1N+χλ~λ22χλ~λ(N+χλ~λ22χλ~λ))𝔼[f(𝒙t)2]1𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\quad\left(1-\sqrt{N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{% \tilde{\lambda}\|\lambda}}-(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{% \tilde{\lambda}\|\lambda})\right)\mathbb{E}\left[\left\|\nabla f(\boldsymbol{x% }^{t})\right\|^{2}\right]( 1 - square-root start_ARG italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT end_ARG - ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) ) blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (26)
𝔼[f(𝒙t)]f(𝒙t+1)η+ηχλ~λ2β2L3i=1Nλi𝔼𝒙it,Ki𝒙t2+βLG~χλ~λ2D~(𝒙t)i=1Nλi𝔼𝒙it,Ki𝒙t2absent𝔼delimited-[]𝑓superscript𝒙𝑡𝑓superscript𝒙𝑡1𝜂𝜂superscriptsubscript𝜒conditional~𝜆𝜆2superscript𝛽2superscript𝐿3superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2𝛽𝐿~𝐺superscriptsubscript𝜒conditional~𝜆𝜆2~𝐷superscript𝒙𝑡superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\displaystyle\leq\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t})\right]-f(% \boldsymbol{x}^{t+1})}{\eta}+\eta\chi_{\tilde{\lambda}\|\lambda}^{2}\beta^{2}L% ^{3}\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\|^{2}}+\frac{\beta L\tilde{G}\sqrt{\chi_{\tilde{% \lambda}\|\lambda}^{2}}}{\tilde{D}(\boldsymbol{x}^{t})}\sqrt{\sum_{i=1}^{N}% \lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}% \right\|^{2}}}≤ divide start_ARG blackboard_E [ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] - italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_η end_ARG + italic_η italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_β italic_L over~ start_ARG italic_G end_ARG square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
𝔼[f(𝒙t)]f(𝒙t+1)η+ηχλ~λ2β2L3i=1Nλi𝔼𝒙it,Ki𝒙t2absent𝔼delimited-[]𝑓superscript𝒙𝑡𝑓superscript𝒙𝑡1𝜂𝜂superscriptsubscript𝜒conditional~𝜆𝜆2superscript𝛽2superscript𝐿3superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\displaystyle\leq\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t})\right]-f(% \boldsymbol{x}^{t+1})}{\eta}+\eta\chi_{\tilde{\lambda}\|\lambda}^{2}\beta^{2}L% ^{3}\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\|^{2}}≤ divide start_ARG blackboard_E [ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] - italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_η end_ARG + italic_η italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+βLG~χλ~λ2D~(𝒙t)5ηl(σl2+6Kmin2σg2+6K~f(𝒙t)2)+ΦHetero𝛽𝐿~𝐺superscriptsubscript𝜒conditional~𝜆𝜆2~𝐷superscript𝒙𝑡5subscript𝜂𝑙superscriptsubscript𝜎𝑙26superscriptsubscript𝐾min2superscriptsubscript𝜎𝑔26~𝐾superscriptnorm𝑓superscript𝒙𝑡2subscriptΦHetero\displaystyle\quad+\frac{\beta L\tilde{G}\sqrt{\chi_{\tilde{\lambda}\|\lambda}% ^{2}}}{\tilde{D}(\boldsymbol{x}^{t})}\sqrt{5\eta_{l}(\sigma_{l}^{2}+6K_{\text{% min}}^{2}\sigma_{g}^{2}+6\tilde{K}\|\nabla f(\boldsymbol{x}^{t})\|^{2})+\Phi_{% \text{Hetero}}}+ divide start_ARG italic_β italic_L over~ start_ARG italic_G end_ARG square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG square-root start_ARG 5 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 over~ start_ARG italic_K end_ARG ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT end_ARG
𝔼[f(𝒙t)]f(𝒙t+1)η+ηχλ~λ2β2L3i=1Nλi𝔼𝒙it,Ki𝒙t2absent𝔼delimited-[]𝑓superscript𝒙𝑡𝑓superscript𝒙𝑡1𝜂𝜂superscriptsubscript𝜒conditional~𝜆𝜆2superscript𝛽2superscript𝐿3superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2\displaystyle\leq\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t})\right]-f(% \boldsymbol{x}^{t+1})}{\eta}+\eta\chi_{\tilde{\lambda}\|\lambda}^{2}\beta^{2}L% ^{3}\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\|^{2}}≤ divide start_ARG blackboard_E [ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] - italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_η end_ARG + italic_η italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+βLG~χλ~λ2D~(𝒙t)5ηl(σl2+6Kmin2σg2)+ΦHetero+6K~βLG~χλ~λ2D~(𝒙t)f(𝒙t)2,𝛽𝐿~𝐺superscriptsubscript𝜒conditional~𝜆𝜆2~𝐷superscript𝒙𝑡5subscript𝜂𝑙superscriptsubscript𝜎𝑙26superscriptsubscript𝐾min2superscriptsubscript𝜎𝑔2subscriptΦHetero6~𝐾𝛽𝐿~𝐺superscriptsubscript𝜒conditional~𝜆𝜆2~𝐷superscript𝒙𝑡superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\quad+\frac{\beta L\tilde{G}\sqrt{\chi_{\tilde{\lambda}\|\lambda}% ^{2}}}{\tilde{D}(\boldsymbol{x}^{t})}\sqrt{5\eta_{l}(\sigma_{l}^{2}+6K_{\text{% min}}^{2}\sigma_{g}^{2})+\Phi_{\text{Hetero}}}+\frac{6\tilde{K}\beta L\tilde{G% }\sqrt{\chi_{\tilde{\lambda}\|\lambda}^{2}}}{\tilde{D}(\boldsymbol{x}^{t})}\|% \nabla f(\boldsymbol{x}^{t})\|^{2},+ divide start_ARG italic_β italic_L over~ start_ARG italic_G end_ARG square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG square-root start_ARG 5 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT end_ARG + divide start_ARG 6 over~ start_ARG italic_K end_ARG italic_β italic_L over~ start_ARG italic_G end_ARG square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the last inequality replaces the averaged local updates term with Corollary 3.2.

Then, rearranging the above equation, we have

ρt𝔼[f(𝒙t)2]𝔼[f(𝒙t)]f(𝒙t+1)η+ηχλ~λ2β2L3i=1Nλi𝔼𝒙it,Ki𝒙t2Vanishing terms about η.superscript𝜌𝑡𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2subscript𝔼delimited-[]𝑓superscript𝒙𝑡𝑓superscript𝒙𝑡1𝜂𝜂superscriptsubscript𝜒conditional~𝜆𝜆2superscript𝛽2superscript𝐿3superscriptsubscript𝑖1𝑁subscript𝜆𝑖𝔼superscriptnormsuperscriptsubscript𝒙𝑖𝑡subscript𝐾𝑖superscript𝒙𝑡2Vanishing terms about η.\displaystyle\rho^{t}\mathbb{E}\left[\left\|\nabla f(\boldsymbol{x}^{t})\right% \|^{2}\right]\leq\underbrace{\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t})\right% ]-f(\boldsymbol{x}^{t+1})}{\eta}+\eta\chi_{\tilde{\lambda}\|\lambda}^{2}\beta^% {2}L^{3}\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}% }-\boldsymbol{x}^{t}\right\|^{2}}}_{\text{Vanishing terms about $\eta$.}}italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ under⏟ start_ARG divide start_ARG blackboard_E [ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] - italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_η end_ARG + italic_η italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Vanishing terms about italic_η . end_POSTSUBSCRIPT (27)
+βLG~χλ~λ2D~(𝒙t)5ηl(σl2+6Kmin2σg2)+ΦHetero𝛽𝐿~𝐺superscriptsubscript𝜒conditional~𝜆𝜆2~𝐷superscript𝒙𝑡5subscript𝜂𝑙superscriptsubscript𝜎𝑙26superscriptsubscript𝐾min2superscriptsubscript𝜎𝑔2subscriptΦHetero\displaystyle\quad+\frac{\beta L\tilde{G}\sqrt{\chi_{\tilde{\lambda}\|\lambda}% ^{2}}}{\tilde{D}(\boldsymbol{x}^{t})}\sqrt{5\eta_{l}(\sigma_{l}^{2}+6K_{\text{% min}}^{2}\sigma_{g}^{2})+\Phi_{\text{Hetero}}}+ divide start_ARG italic_β italic_L over~ start_ARG italic_G end_ARG square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG square-root start_ARG 5 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT end_ARG

where we let ρt=(1N+χλ~λ22χλ~λ(N+χλ~λ22χλ~λ)6K~βLG~χλ~λ2/D~(𝐱t))>0superscript𝜌𝑡1𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆𝑁superscriptsubscript𝜒conditional~𝜆𝜆22subscript𝜒conditional~𝜆𝜆6~𝐾𝛽𝐿~𝐺superscriptsubscript𝜒conditional~𝜆𝜆2~𝐷superscript𝐱𝑡0\rho^{t}=\left(1-\sqrt{N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\|\lambda}}-(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\|\lambda})-6\tilde{K}\beta L\tilde{G}\sqrt{\chi_{\tilde{\lambda}\|% \lambda}^{2}}/\tilde{D}(\boldsymbol{x}^{t})\right)>0italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( 1 - square-root start_ARG italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT end_ARG - ( italic_N + italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT ) - 6 over~ start_ARG italic_K end_ARG italic_β italic_L over~ start_ARG italic_G end_ARG square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG / over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) > 0 for notation brevity.

Finally, taking full expectation on both sizes, summarizing terms from time t=0𝑡0t=0italic_t = 0 to T1𝑇1T-1italic_T - 1 and rearranging terms, we have

mint[T]𝔼[f(𝒙t)2]subscript𝑡delimited-[]𝑇𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\min_{t\in[T]}\mathbb{E}\left[\left\|\nabla f(\boldsymbol{x}^{t})% \right\|^{2}\right]roman_min start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] 1Tt=0T1𝔼[f(𝒙t)2]absent1𝑇superscriptsubscript𝑡0𝑇1𝔼delimited-[]superscriptnorm𝑓superscript𝒙𝑡2\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\|\nabla f(% \boldsymbol{x}^{t})\right\|^{2}\right]≤ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (28)
ϵopt+t=0T1βLG~χλ~λ2TρtD~(𝒙t)σ2+ΦHetero,absentsubscriptitalic-ϵoptsuperscriptsubscript𝑡0𝑇1𝛽𝐿~𝐺superscriptsubscript𝜒conditional~𝜆𝜆2𝑇superscript𝜌𝑡~𝐷superscript𝒙𝑡superscript𝜎2subscriptΦHetero\displaystyle\leq\epsilon_{\text{opt}}+\sum_{t=0}^{T-1}\frac{\beta L\tilde{G}% \sqrt{\chi_{\tilde{\lambda}\|\lambda}^{2}}}{T\rho^{t}\tilde{D}(\boldsymbol{x}^% {t})}\sqrt{\sigma^{2}+\Phi_{\text{Hetero}}},≤ italic_ϵ start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT divide start_ARG italic_β italic_L over~ start_ARG italic_G end_ARG square-root start_ARG italic_χ start_POSTSUBSCRIPT over~ start_ARG italic_λ end_ARG ∥ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_T italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG square-root start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT end_ARG ,

where we use ϵoptsubscriptitalic-ϵopt\epsilon_{\text{opt}}italic_ϵ start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT to absorb vanishing terms in (27) and σ2=5ηl(σl2+6Kmin2σg2)superscript𝜎25subscript𝜂𝑙superscriptsubscript𝜎𝑙26superscriptsubscript𝐾min2superscriptsubscript𝜎𝑔2\sigma^{2}=5\eta_{l}(\sigma_{l}^{2}+6K_{\text{min}}^{2}\sigma_{g}^{2})italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 5 italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_K start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Summary. Analogous to previous non-convex analyses, the term ϵoptsubscriptϵopt\epsilon_{\text{opt}}italic_ϵ start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT in (22) will result in a sub-linear convergence rate 𝒪(1/T)𝒪1T\mathcal{O}(1/\sqrt{T})caligraphic_O ( 1 / square-root start_ARG italic_T end_ARG ) with proper setting of ηη\etaitalic_η, which is omitted in the final bound. Differently, we focus on the impacts of heterogeneity terms on the convergence results. Convergence to the stationary points with large D~(𝐱t)~Dsuperscript𝐱t\tilde{D}(\boldsymbol{x}^{t})over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) mitigate the convergence error included by heterogeneity terms σ2+ΦHeterosuperscriptσ2subscriptΦHetero\sqrt{\sigma^{2}+\Phi_{\text{Hetero}}}square-root start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Φ start_POSTSUBSCRIPT Hetero end_POSTSUBSCRIPT end_ARG.

Appendix D Discussion

D.1 Case illustration on gradient diversity maximization direction

Refer to caption
Figure 5: Solution of (29)

To better illustrate this concept, we consider an FL system with two clients, where the gradient diversity is computed as follows:

D~(𝒙)=λ1f1(𝒙)2+λ2f2(𝒙)2λ~1f1(𝒙)+λ~2f2(𝒙)2.~𝐷𝒙subscript𝜆1superscriptnormsubscript𝑓1𝒙2subscript𝜆2superscriptnormsubscript𝑓2𝒙2superscriptnormsubscript~𝜆1subscript𝑓1𝒙subscript~𝜆2subscript𝑓2𝒙2\tilde{D}(\boldsymbol{x})=\sqrt{\frac{\lambda_{1}\|\nabla f_{1}(\boldsymbol{x}% )\|^{2}+\lambda_{2}\|\nabla f_{2}(\boldsymbol{x})\|^{2}}{\|\tilde{\lambda}_{1}% \nabla f_{1}(\boldsymbol{x})+\tilde{\lambda}_{2}\nabla f_{2}(\boldsymbol{x})\|% ^{2}}}.over~ start_ARG italic_D end_ARG ( bold_italic_x ) = square-root start_ARG divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) + over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG . (29)

We note that the server cannot access local data samples. For each round, it only observes the local updates uploaded from clients. Hence, given a fixed point x𝑥xitalic_x, maximizing gradient diversity at the server is equivalent to minimizing the denominator of (6). As a result, a global gradient direction that maximizes the gradient diversity leads to the solution222We exclude special solutions that {λ1,λ2}={1,0}subscript𝜆1subscript𝜆210\{\lambda_{1},\lambda_{2}\}=\{1,0\}{ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } = { 1 , 0 } if f1(𝒙)f2(𝒙)f1(𝒙)f1(𝒙)subscript𝑓1superscript𝒙topsubscript𝑓2𝒙subscript𝑓1superscript𝒙topsubscript𝑓1𝒙\nabla f_{1}(\boldsymbol{x})^{\top}\nabla f_{2}(\boldsymbol{x})\geq\nabla f_{1% }(\boldsymbol{x})^{\top}\nabla f_{1}(\boldsymbol{x})∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) ≥ ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ), and {λ1,λ2}={0,1}subscript𝜆1subscript𝜆201\{\lambda_{1},\lambda_{2}\}=\{0,1\}{ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } = { 0 , 1 } if f1(𝒙)f2(𝒙)f2(𝒙)f2(𝒙)subscript𝑓1superscript𝒙topsubscript𝑓2𝒙subscript𝑓2superscript𝒙topsubscript𝑓2𝒙\nabla f_{1}(\boldsymbol{x})^{\top}\nabla f_{2}(\boldsymbol{x})\geq f_{2}(% \boldsymbol{x})^{\top}\nabla f_{2}(\boldsymbol{x})∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) ≥ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ). They indicate a low degree of statistical heterogeneity. in Figure 5.

D.2 Connection with multi-objective optimization (MOO)

Refer to caption
Figure 6: Comparing with FedMGDA+ on CIFAR10 task.

We note that (7) is relevant to a sub-problem from MOO literature [38, 11, 60]. However, Algorithm 1 presents a federated optimization algorithm differently. Specifically, FedAWARE violates the basic Pareto partial order restriction in MOO. Besides, FedAWARE meets basic protocols in FL, including partial client participation and decoupled server/client optimization procedure, while MOO requires full participation and local first-order gradients in the FL context. In the FL literature, FedMGDA+ [11] also involves a variant of (7) in federated optimization, however, focusing on FL fairness rather than heterogeneity. We found it is inefficient against hybrid heterogeneity issues as shown in Figure 6.

D.3 Limitations and feasible solutions.

The main limitation is that running Algorithm 1 requires the server to store the local momentum, while the FL server typically has sufficient storage and computing resources in FL practice. Besides, we argue this memory consumption is worth using FedAWARE to enhance other algorithms. Moreover, we discuss additional engineering optimizations in Appendix D.3 for alleviating the hardware requirements.

In practical settings, especially cross-silo settings, the FL server typically has sufficient storage and computing resources. Besides, this concern can be alleviated by the following options:

Computing (7) with the last few layers of a neural network. This is because numerous studies [49, 19, 4] have shown that the last layers contain crucial network information. Hence, we can trade off the memory storage and quality of the surrogate.

Adopting client clustering techniques [37, 21, 30]. We can cluster clients respecting their similarity, and save the momentum of cluster-averaged gradients. The server only costs the storage proportional to the number of client clusters.

D.4 What is the gap between gradient diversity and surrogate gradient diversity?

The gap is dynamically changing in practice. We explain it from empirical and theoretical perspectives. Empirically, please refer to the curves of FedAWARE and FedAvg in Figure 3 for the empirical gap. Theoretically, we provide the explicit bound of the gap (related to Lemma C.2), if we assume λ𝜆\lambdaitalic_λ is the optimal weight (f(𝒙)f~(𝒙)norm𝑓𝒙norm~𝑓𝒙\|\nabla f(\boldsymbol{x})\|\leq\|\nabla\tilde{f}(\boldsymbol{x})\|∥ ∇ italic_f ( bold_italic_x ) ∥ ≤ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥), we have

(D(𝒙)D~(𝒙))2superscript𝐷𝒙~𝐷𝒙2\displaystyle(D(\boldsymbol{x})-\tilde{D}(\boldsymbol{x}))^{2}( italic_D ( bold_italic_x ) - over~ start_ARG italic_D end_ARG ( bold_italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =D2(𝒙)+D~2(𝒙)2D(𝒙)D~(𝒙)absentsuperscript𝐷2𝒙superscript~𝐷2𝒙2𝐷𝒙~𝐷𝒙\displaystyle=D^{2}(\boldsymbol{x})+\tilde{D}^{2}(\boldsymbol{x})-2D(% \boldsymbol{x})\tilde{D}(\boldsymbol{x})= italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x ) + over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x ) - 2 italic_D ( bold_italic_x ) over~ start_ARG italic_D end_ARG ( bold_italic_x )
=i=1Nλifi(𝒙)2f(𝒙)2+i=1Nλifi(𝒙)2f~(𝒙)22i=1Nλifi(𝒙)2f(𝒙)f~(𝒙)absentsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙2superscriptnorm𝑓𝒙2superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙2superscriptnorm~𝑓𝒙22superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖𝒙2norm𝑓𝒙norm~𝑓𝒙\displaystyle=\frac{\sum_{i=1}^{N}\lambda_{i}\|\nabla f_{i}(\boldsymbol{x})\|^% {2}}{\|\nabla f(\boldsymbol{x})\|^{2}}+\frac{\sum_{i=1}^{N}\lambda_{i}\|\nabla f% _{i}(\boldsymbol{x})\|^{2}}{\|\nabla\tilde{f}(\boldsymbol{x})\|^{2}}-2\frac{% \sum_{i=1}^{N}\lambda_{i}\|\nabla f_{i}(\boldsymbol{x})\|^{2}}{\|\nabla f(% \boldsymbol{x})\|\|\nabla\tilde{f}(\boldsymbol{x})\|}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 2 divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ end_ARG
G~(1f(𝒙)2+1f~(𝒙)221f(𝒙)f~(𝒙))absent~𝐺1superscriptnorm𝑓𝒙21superscriptnorm~𝑓𝒙221norm𝑓𝒙norm~𝑓𝒙\displaystyle\leq\tilde{G}(\frac{1}{\|\nabla f(\boldsymbol{x})\|^{2}}+\frac{1}% {\|\nabla\tilde{f}(\boldsymbol{x})\|^{2}}-2\frac{1}{\|\nabla f(\boldsymbol{x})% \|\|\nabla\tilde{f}(\boldsymbol{x})\|})≤ over~ start_ARG italic_G end_ARG ( divide start_ARG 1 end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 2 divide start_ARG 1 end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ end_ARG )
=G~(1f(𝒙)1f~(𝒙))2absent~𝐺superscript1norm𝑓𝒙1norm~𝑓𝒙2\displaystyle=\tilde{G}(\frac{1}{\|\nabla f(\boldsymbol{x})\|}-\frac{1}{\|% \nabla\tilde{f}(\boldsymbol{x})\|})^{2}= over~ start_ARG italic_G end_ARG ( divide start_ARG 1 end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ end_ARG - divide start_ARG 1 end_ARG start_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=G~(f~(𝒙)f(𝒙)f(𝒙)f~(𝒙))2absent~𝐺superscriptnorm~𝑓𝒙norm𝑓𝒙norm𝑓𝒙norm~𝑓𝒙2\displaystyle=\tilde{G}(\frac{\|\nabla\tilde{f}(\boldsymbol{x})\|-\|\nabla f(% \boldsymbol{x})\|}{\|\nabla f(\boldsymbol{x})\|\|\nabla\tilde{f}(\boldsymbol{x% })\|})^{2}= over~ start_ARG italic_G end_ARG ( divide start_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ - ∥ ∇ italic_f ( bold_italic_x ) ∥ end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
G~f(𝒙)2(f~(𝒙)f(𝒙))2absent~𝐺superscriptnorm𝑓𝒙2superscriptnorm~𝑓𝒙norm𝑓𝒙2\displaystyle\leq\frac{\tilde{G}}{\|\nabla f(\boldsymbol{x})\|^{2}}(\|\nabla% \tilde{f}(\boldsymbol{x})\|-\|\nabla f(\boldsymbol{x})\|)^{2}≤ divide start_ARG over~ start_ARG italic_G end_ARG end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) ∥ - ∥ ∇ italic_f ( bold_italic_x ) ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
G~f(𝒙)2f~(𝒙)f(𝒙)2(23),absent~𝐺superscriptnorm𝑓𝒙2subscriptsuperscriptnorm~𝑓𝒙𝑓𝒙2italic-(23italic-)\displaystyle\leq\frac{\tilde{G}}{\|\nabla f(\boldsymbol{x})\|^{2}}\underbrace% {\|\nabla\tilde{f}(\boldsymbol{x})-\nabla f(\boldsymbol{x})\|^{2}}_{\eqref{eq:% surrogate_gap}},≤ divide start_ARG over~ start_ARG italic_G end_ARG end_ARG start_ARG ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG under⏟ start_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_italic_x ) - ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_( italic_) end_POSTSUBSCRIPT ,

where G~=maxt[T]i=1Nλifi(𝒙t)2~𝐺subscript𝑡delimited-[]𝑇superscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptnormsubscript𝑓𝑖superscript𝒙𝑡2\tilde{G}=\max_{t\in[T]}\sqrt{\sum_{i=1}^{N}\lambda_{i}\|\nabla f_{i}(% \boldsymbol{x}^{t})\|^{2}}over~ start_ARG italic_G end_ARG = roman_max start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. This inequality shows that the gap D(𝒙)D~(𝒙)𝐷𝒙~𝐷𝒙D(\boldsymbol{x})-\tilde{D}(\boldsymbol{x})italic_D ( bold_italic_x ) - over~ start_ARG italic_D end_ARG ( bold_italic_x ) is primarily decided by the global objectives with different weights λ𝜆\lambdaitalic_λ or λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG.

D.5 The implications of zero vector in FedAWARE

As FedAWARE is finding the minimum of 𝒅tnormsuperscript𝒅𝑡\|\boldsymbol{d}^{t}\|∥ bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ during training, the cases such that 𝒅t=0normsuperscript𝒅𝑡0\|\boldsymbol{d}^{t}\|=0∥ bold_italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ = 0 indicates that all the momentum vectors are linearly dependent [8]. This case means that we found a saddle point of local optimum (possibly global optimum) for the surrogate objective. Following stochastic optimization theories, non-convex optimizers [18, 7] typically use stochasticity and Nesterov momentum to let the model escape from the saddle points. Analogously, FedAWARE uses random client sampling and local momentum to guarantee the convergence quality. Moreover, we note that the client-side optimization typically directly uses non-convex optimizers. Therefore, it can collaboratively help the server-side optimization escape from the saddle points.

D.6 Feasibility of combining with faster local solvers

Enhancing the convergence rate can be achieved through a combination of improved local computing methods, as our server-side optimizer operates independently of the local update modification process. For instance, FedNova [42] has explored the benefits of employing faster local solvers, such as those leveraging proximal gradients, gradient tracking, adaptive learning rates, momentum, and more, within the context of FedAvg. While our work primarily addresses server-side optimization in the presence of unbalanced local updates from local SGD, determining the optimal combination of local solvers remains a direction for future research.

D.7 Discussion on novelty

This work is the first to propose gradient diversity maximization on server-side optimization against heterogeneity issues in FL. We provide additional discussion in the section to clarify our novelty.

Novelty in tackling heterogeneity. FedAWARE addresses hybrid heterogeneity by adjusting the denominator factor D~(𝒙t)~𝐷superscript𝒙𝑡\tilde{D}(\boldsymbol{x}^{t})over~ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) in Theorem 4.1. In contrast, previous works are reducing the numerator terms corresponding with Theorem 4.1, which captures the impacts of hybrid heterogeneity. As discussed in Section 3.2, previous works minimize the upper bound of (5) to improve optimization quality. These works typically reduce the hybrid heterogeneity impacts by manipulating the local computation process. FedAWARE innovatively captures the gradient diversity trends and utilizes it to design global updates without manipulating local updates. This approach provides a promising perspective on addressing heterogeneity issues in FL.

Novelty in convergence analyses. The convergence analysis of FedAWARE introduces novel elements compared to previous federated optimization algorithms. While previous algorithms, such as FedAvg [33], FedProx [26], and FedYogi [35], assume a static global objective with uniform weights, FedAWARE analyzes the convergence of a surrogate objective with weights λ~~𝜆\tilde{\lambda}over~ start_ARG italic_λ end_ARG based on stored momentum information. We use the term χ𝜒\chiitalic_χ to highlight the differences to differentiate from traditional federated optimization objectives. Furthermore, since FedAWARE does not manipulate the local computation process, we focus on the unbalanced local update steps, which absorb factors such as local training batch size, epochs, and local datasets. This approach allows us to address the concerns raised by FedNova [42] regarding objective inconsistency.

Suggesting training dynamics of gradient diversity. In our analysis, we first demonstrate the importance of gradient diversity dynamics in the convergence analysis. Moreover, it is important to note that gradient diversity is not the sole factor for mitigating the impacts of heterogeneity. According to the convergence result in Theorem 4.1, the upper bound is determined by multiple factors, where gradient diversity is one feasible factor to minimize the upper bound. Furthermore, we have established a connection between gradient diversity and common assumptions on statistical heterogeneity in Section 3.1. It is worth mentioning that previous works [26, 16, 42] take the upper bound of gradient diversity as a negative factor in the convergence analysis333please refer to the notation B𝐵Bitalic_B of Theorem 4 in FedProx, notation B𝐵Bitalic_B of Theorem 1 in SCAFFOLD, and notation κ𝜅\kappaitalic_κ of Theorem 1 in FedNova for evidence.. Therefore, while they address heterogeneity issues differently, they ignore the evolution of gradient diversity.

Future works. In this paper, we only apply the gradient diversity maximization findings to server-side optimization. Further collaboration of server-side and client-side optimization for gradient diversity maximization is left to future works. Moreover, we found that FedAWARE achieves better generalization performance. We plan to explain the theoretical relation between the generalization performance and the gradient diversity dynamics in the generalization theories.

Refer to caption
Figure 7: Experiments on MNIST task: Train loss & test accuracy & gradient diversity of federated optimization algorithms.
Refer to caption
Figure 8: Experiments on FashionMNIST task: Train loss & test accuracy & gradient diversity of federated optimization algorithms.
Refer to caption
Figure 9: Plug-in experiments on CIFAR10-Dirichlet+ setting. We observe similar phenomena that match our conclusion in the main paper about the AGNews task. Gradient diversity maximization can generally enhance applied FL algorithms’ convergence speed and generalization performance. Significantly, FedAWARE enables all algorithms to very smooth and stable generalization performance enhancement during training (smooth test accuracy curves), highlighting that our method makes applied optimization algorithms more robust to heterogeneity impacts.

Appendix E Additional Experiments

E.1 Experiment Details

Platform. The experiment implementations are supported by FedLab [54]. Our experiments run on a Linux server with 4*2080Ti GPU.

Datasets and Model. Our experiments mainly follow prior works [45, 1, 35, 56], including image classification datasets MNIST, Fashion-MNIST, CIFAR10, and text classification dataset AGNews [57] datasets. For MNIST and Fashion-MNIST tasks, we use fully connected neural network architectures with 2 hidden layers. For the CIFAR10 task, we use a CNN model, similar to [33], consisting of 3 convolutional layers with 64 3×\times×3 filters followed by 2 fully connected layers with 1024 and 64 neurons, and a softmax layer. For the AGNews task, we choose the Pthyia-70M from GPT2-based large language models Pythia [3].

Hyperparameters. For the server, we set the rate of client participation to be 10%percent1010\%10 %, and use η=1𝜂1\eta=1italic_η = 1 for FedAvg, FedAvgM, FedProx, SCAFFOLD, FedNova, FedDyn, and FedAWARE. For the momentum parameter of FedAvgM, we set it from {0.7,0.9,0.97,0.997}0.70.90.970.997\{0.7,0.9,0.97,0.997\}{ 0.7 , 0.9 , 0.97 , 0.997 } following the original paper. For weights of the penalty term in FedProx, we tune it from grid {0.01,0.1,1,10}0.010.1110\{0.01,0.1,1,10\}{ 0.01 , 0.1 , 1 , 10 }. For FedYogi, we set momentum parameter β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, a second-moment parameter β2=0.99subscript𝛽20.99\beta_{2}=0.99italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.99, and adaptivity τ=104𝜏superscript104\tau=10^{-4}italic_τ = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT following the original paper. Besides, We select η𝜂\etaitalic_η for FedYogi by grid-searching tuning from {103,102.5,102,,101}superscript103superscript102.5superscript102superscript101\{10^{-3},10^{-2.5},10^{-2},\dots,10^{1}\}{ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2.5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , … , 10 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT }. The parameter of FedDyn is chosen among {0.1,0.01,0.001}0.10.010.001\{0.1,0.01,0.001\}{ 0.1 , 0.01 , 0.001 } from the original paper. For FedAMS, we set β1=0.9,β2=0.99formulae-sequencesubscript𝛽10.9subscript𝛽20.99\beta_{1}=0.9,\beta_{2}=0.99italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.99 follows the original paper. Then, we grid search for the best global learning rate η={1,101,102,103,104}𝜂1superscript101superscript102superscript103superscript104\eta=\{1,10^{-1},10^{-2},10^{-3},10^{-4}\}italic_η = { 1 , 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT } and the best stabilization term ϵ={108,104,103,102,101}italic-ϵsuperscript108superscript104superscript103superscript102superscript101\epsilon=\{10^{-8},10^{-4},10^{-3},10^{-2},10^{-1}\}italic_ϵ = { 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT }. For FedAWARE, we set α=0.5𝛼0.5\alpha=0.5italic_α = 0.5. We set learning rate ηl=0.01subscript𝜂𝑙0.01\eta_{l}=0.01italic_η start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 0.01 for local training parameters, batch size 64646464, and local epoch 3333 for all clients with the pathological and Dirichlet settings. We report the best performance of these algorithms.

Implementation of gradient diversity dynamics. To fairly compare the gradient diversity dynamics of all algorithms, we estimate the gradient diversity as follows:

D^(𝒙t):=iSt𝒈it2iSt𝒈it2=iSt𝒈it2iSt𝒈it2+ij𝒈it,𝒈jt,assign^𝐷superscript𝒙𝑡subscript𝑖superscript𝑆𝑡superscriptnormsuperscriptsubscript𝒈𝑖𝑡2superscriptnormsubscript𝑖superscript𝑆𝑡superscriptsubscript𝒈𝑖𝑡2subscript𝑖superscript𝑆𝑡superscriptnormsuperscriptsubscript𝒈𝑖𝑡2subscript𝑖superscript𝑆𝑡superscriptnormsuperscriptsubscript𝒈𝑖𝑡2subscript𝑖𝑗superscriptsubscript𝒈𝑖𝑡superscriptsubscript𝒈𝑗𝑡\hat{D}(\boldsymbol{x}^{t}):=\sqrt{\frac{\sum_{i\in S^{t}}\|\boldsymbol{g}_{i}% ^{t}\|^{2}}{\|\sum_{i\in S^{t}}\boldsymbol{g}_{i}^{t}\|^{2}}}=\sqrt{\frac{\sum% _{i\in S^{t}}\|\boldsymbol{g}_{i}^{t}\|^{2}}{\sum_{i\in S^{t}}\|\boldsymbol{g}% _{i}^{t}\|^{2}+\sum_{i\neq j}\langle\boldsymbol{g}_{i}^{t},\boldsymbol{g}_{j}^% {t}\rangle}},over^ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) := square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG = square-root start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT ⟨ bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ end_ARG end_ARG , (30)

where Stsuperscript𝑆𝑡S^{t}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the selected client set at the t𝑡titalic_t-th round and 𝒈itsuperscriptsubscript𝒈𝑖𝑡\boldsymbol{g}_{i}^{t}bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the uploaded local updates from selected clients. And, we report the evaluation of estimated D^(𝒙t)^𝐷superscript𝒙𝑡\hat{D}(\boldsymbol{x}^{t})over^ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) value in our experiment figures. This estimate is applicable in practice for matching the client sampling protocol in federated learning and avoiding accessing local first-order gradient (privacy issue [61]). Besides, we argue that D^(𝒙t)^𝐷superscript𝒙𝑡\hat{D}(\boldsymbol{x}^{t})over^ start_ARG italic_D end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) share a similar ability to capture the direction relation among local updates. For example, it is large when the inner products between the local updates taken for different clients are small. In particular, estimated gradient diversity is large when the gradients are almost orthogonal, or even in opposite directions. Most importantly, As the gradient diversity dynamics of all algorithms are computed the same, their values faithfully indicate the stationary points in their landscape.

E.2 Experiments on MNIST/FashionMNIST

We provide additional experiments on MNIST [6] and FashionMNIST [47] datasets using the same data partition settings as CIFAR10 experiments. The results of the MNIST task are shown in Figure 7; the results of the FashionMNIST task are shown in Figure 8.

Discussion on MNIST task. According to the training loss, FedAWARE is slightly slower than FedYogi and FedAMS in the MNIST task. Besides, FedAWARE converges to comparable stationary points with similar training loss and test accuracy at the end of training. Most importantly, the training dynamics of gradient diversity curves show that FedAWARE finds a series of stationary points with large gradient diversity.

Discussion on FashionMNIST task. The main conclusions of the FashionMNIST task are similar to the MNIST task. Most differently, FedAWARE achieves comparable convergence speed with FedYogi and FedAMS.

Overall, our conclusions in the main paper still hold on federated MNIST/FashinMNIST tasks.

E.3 Experiments on AGNews

We choose the famous GPT2-based pretrained large language model Pythia-70M [3] for the AGNews task [57]. AGNews is a text classification task with 4 labels consisting of 119,999 train samples and 7,599 test samples. We partition the AGNews dataset into 100 clients using the Dirichlet partition strategy with the hyperparameter of 0.1. For the training hyper-parameters, we set local batch size 32, epoch 1, learning rate 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and α=0.99𝛼0.99\alpha=0.99italic_α = 0.99 for FedAWARE. We conduct the gradient projection experiment, and the results are shown in Figure 4.