FedAWARE: Maximizing Gradient Diversity
for Heterogeneous Federated Server-side Optimization

Dun Zeng¹ Zenglin Xu²¹¹1Corresponding author: [email protected] Yu Pan³

{}^{\;\,}

Qifan Wang⁴ Xiaoying Tang⁵
¹UESTC ²Fudan University ³Harbin Institute of Technology, Shenzhen
⁴Meta AI ⁵CUHK, Shenzhen
^∗[email protected]

Abstract

Federated learning (FL) is a distributed learning framework where numerous clients collaborate with a central server to train a model without sharing local data. However, the standard federated optimization in real-world applications faces both statistical and system heterogeneity challenges, which result in unfavorable convergence behavior. The previous works attempted to modify the local training process (client-side) to tackle heterogeneity challenges. However, they ignored that the updates on the server side can coordinate the diverse local updates efficiently. This work explores the effect of server-side updates against heterogeneity issues. We first introduce the gradient diversity maximization direction findings, suggesting the global model moves continuously in this direction for fast and stable convergence. Then, we derive a novel server-side optimizer FedAWARE with rigorous convergence analysis for general non-convex settings. Our extensive experiments across multiple heterogeneous federated settings using four datasets showcase that FedAWARE achieves competitive convergence performance in comparison to state-of-the-art adaptive federated optimizers. Furthermore, our results show that FedAWARE can enhance the performance of FL algorithms as a plug-in module. Our source code is available at https://github.com/dunzeng/FedAWARE.

1 Introduction

We study a standard cross-device federated learning (FL) task [15], which minimizes a finite sum of local empirical objectives:

\min_{\boldsymbol{x}\in\mathcal{X}}f(\boldsymbol{x}):=\sum_{i=1}^{N}\lambda_{i% }f_{i}(\boldsymbol{x}):=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}_{\xi_{i}\sim% \mathcal{D}_{i}}[F_{i}(\boldsymbol{x},\xi_{i})],

(1)

where $\boldsymbol{x}\in\mathcal{X}\subseteq\mathbb{R}^{d}$ is parameters of machine learning model, $f(\boldsymbol{x})$ is the global objective weighted by $\lambda_{i}(s.t.,\sum_{i=1}^{N}\lambda_{i}=1$ , $\xi_{i},\lambda_{i}\geq 0,\forall i)$ , $\xi_{i}$ is stochastic batch data, and $\mathcal{D}_{i}$ denotes dataset on the $i$ -th client ( $i\in\{1,2,\ldots,N$ }). The federated optimization framework that minimizes the global objective involves client optimization and server optimization procedures [35], as shown in Figure 1 and Algorithm 2. Take the FedAvg [33] as a example:

\text{{Client: }}\boldsymbol{g}_{i}^{t}=\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}_{i}^{t,0}=\eta_{l}\sum_{k=0}^{K_{i}-1}\nabla F_{i}(\boldsymbol{% x}_{i}^{t,k});\quad\quad\text{{Server: }}\boldsymbol{x}^{t+1}=\boldsymbol{x}^{% t}-\eta\sum_{i=1}^{N}\lambda_{i}\boldsymbol{g}_{i}^{t}

(2)

where $\nabla F_{i}(\boldsymbol{x})$ denotes stochastic gradients over a mini-batch of samples, $\boldsymbol{x}_{i}^{t,k}$ denotes client $i$ ’s model after the $k$ local update steps at the $t$ -th communication round, and $\eta_{l}$ is the client learning rate. The federated optimization framework is powerful and covers a broader range of FL algorithms [33, 35, 45, 41]. However, it suffers from two major optimization challenges [25, 26].

Refer to caption — Figure 1: Illustration of federated optimization framework and challenges. System heterogeneity (left) induces unbalanced local updates and enlarges the statistical heterogeneity impacts (right). FedAWARE can tackle these challenges on the server side.

Statistical heterogeneity is a crucial challenge, resulting the local updates to divergence direction. This problem majorly arises from data being non-identically distributed (Non-IID) among devices. System heterogeneity represents another critical challenge, that causes unbalanced local update steps of clients [26, 42] due to the varying hardware specifications, operating systems, and software configurations during cross-device training [13]. Hence, faster clients may perform more local update steps than slower clients within a fixed wall-clock time interval. The unbalanced local updates will further enlarge the variance of local updates. These two challenges result in unfavorable convergence behavior of FL, further inducing communication inefficiency and model performance degeneration. For a more in-depth discussion of FL and the challenges, we defer to works [15, 41].

To address the statistical heterogeneity, previous federated optimization algorithms propose to manipulate local updates, such as inducing control variates [16], momentum terms [10, 35], dynamic regularization [1], or local penalty terms [26] to client-side optimization. On the other hand, some works [16, 33, 35] assume that the number of local updates is constant and balanced across all clients $K_{i}=K_{j},\forall i,j\in[N]$ , thereby ignoring the system heterogeneity impacts. For alleviating the unbalanced local update steps impacts, these algorithms [26, 42, 1] primarily modified client-side optimization or used advanced local solvers. For example, FedProx [26] proposes using local penalty terms to prevent large local updates $\boldsymbol{g}_{i}$ . And, FedNova [42] clips the local updates $\boldsymbol{g}_{i}$ based on the local update steps $K_{i}$ for optimization objective consistency. However, the potential of server-side optimization is not fully explored, as most algorithms only update the global model via vanilla averaged gradient descent.

In this paper, we argue that the server can be powerful enough to handle the heterogeneity challenges without using advanced local solvers. To this end, we explore the impacts of such hybrid heterogeneity (i.e., statistical heterogeneity with unbalanced local updates), particularly on the server-side optimization in federated optimization framework (2). Our results answer a critical question:

How can server-side aggregation of local updates mitigate the impacts of heterogeneity?

Contributions. The main contributions of this work can be summarised as follows:

•

We first propose the gradient diversity maximization direction, that is, the aggregation results with the minimal norm of linearly combined local updates. Our findings suggest that the global model moves continuously in the gradient diversity maximization direction with fast and stable convergence. Besides, we highlight gradient diversity as new training dynamics for FL and find that stationary points with higher gradient diversity tend to achieve better test accuracy.
•

We introduce a novel federated server-side optimization method called Federated Adaptive Weighted AggREgator (FedAWARE), which adapts the direction of global model updates to enlarge gradient diversity. We provide rigorous convergence guarantees for gradient diversity maximization theory and the effectiveness of FedAWARE for tackling heterogeneity issues.
•

We show that FedAWARE is orthogonal with existing algorithms and can be a plug-in module for enhancing the state-of-the-art FL algorithms. In detail, we modify the gradient descent direction of the algorithms to maximize their gradient diversity dynamics and find their accuracy is improved.
•

We assess our findings on various degrees of hybrid heterogeneity settings across four datasets. Intensive empirical evidence demonstrates that FedAWARE consistently outperforms baseline optimization algorithms. The results support our findings about gradient diversity and emphasize the importance of adaptivity in heterogeneous federated optimization.

2 Preliminaries

Related works. This paper is related to a broader range of federated optimization algorithms against heterogeneity issues as discussed in Appendix A. Representative adaptive optimization algorithms [35, 52, 1] are the most relevant to our work. Because they can be viewed as server-side optimization works, incorporating adaptive momentum to achieve fast convergence against heterogeneity issues. We justify our method optimizes different terms in Section 4, which can further enhance their performance in Section 5.

Settings. In this paper, we allow the number of local steps $K_{i},i\in[N],[N]=\{1,\dots,N\}$ to be not necessarily equal for denoting system heterogeneity. Typically, the number of local update steps depends on factors such as the chosen batch size, training epoch, and the number of data samples, which can vary across clients. Different devices in cross-device federated learning may have varying computational abilities and data quantities. Therefore, more powerful devices may perform local training with larger batch sizes and more epochs [26]. Besides, clients typically run multiple steps of local SGD and only upload their updates, and the FL server only accesses the uploaded information. We focus on the non-convex optimization problem for (1), typically covering the training of neural networks. Therefore, our analyses rely on common non-convex optimization assumptions [35, 45, 1] on local objectives $f_{i}(\boldsymbol{x}),i\in[N]$ and we use $\|\cdot\|$ to denote $\ell_{2}$ norm as default:

Assumption 2.1 (Smoothness)

Each objective $f_{i}(\boldsymbol{x})$ for all $i\in[N]$ is $L$ -smooth, inducing that for all $\forall\boldsymbol{x},\boldsymbol{y}\in\mathbb{R}^{d}$ , it holds $\|\nabla f_{i}(\boldsymbol{x})-\nabla f_{i}(\boldsymbol{y})\|\leq L\|% \boldsymbol{x}-\boldsymbol{y}\|$ .

Assumption 2.2 (Unbiasedness and Bounded Local Variance)

For each $i\in[N]$ and $\boldsymbol{x}\in\mathbb{R}^{d}$ , we assume the access to an unbiased stochastic gradient $\nabla F_{i}(\boldsymbol{x},\xi_{i})$ of client’s true gradient $\nabla f_{i}(\boldsymbol{x})$ , i.e., $\mathbb{E}_{\xi_{i}\sim\mathcal{D}_{i}}\left[\nabla F_{i}(\boldsymbol{x},\xi_{% i})\right]=\nabla f_{i}(\boldsymbol{x})$ . The function $f_{i}$ have $\sigma_{l}$ -bounded (local) variance i.e., $\mathbb{E}_{\xi_{i}\sim\mathcal{D}_{i}}\left[\left\|\nabla F_{i}(\boldsymbol{x% },\xi_{i})-\nabla f_{i}(\boldsymbol{x})\right\|^{2}\right]\leq\sigma_{l}^{2}$ .

3 Analysis of Statistical Heterogeneity and System Heterogeneity Impacts

This section provides our theoretical insights into the general impacts of hybrid heterogeneity on federated optimization. In Section 3.1, we discussed the theoretical measurement of statistical heterogeneity and highlighted the power of gradient diversity. Then, we showed a loose upper bound of averaged local updates and discussed how previous works minimize it in Section 3.2. Moreover, we propose the gradient diversity maximization direction, which is the aggregation results with the minimal norm of linearly combined local updates. Proofs can be found in Appendix B.

3.1 Gradient Diversity is a Fine-Grained Measurement of Statistical Heterogeneity Impacts

We propose using gradient diversity to measure the degrees of statistical heterogeneity status. The definition of gradient diversity is provided below:

Definition 3.1 (Gradient diversity [26, 51, 9])

The gradient diversity is defined as:

D(\boldsymbol{x}):=\sqrt{\sum_{i=1}^{N}\lambda_{i}\frac{\|\nabla f_{i}(% \boldsymbol{x})\|^{2}}{\|\nabla f(\boldsymbol{x})\|^{2}}}\geq 1.

(3)

Similar definitions have been proposed for differing purposes [26, 51, 9]. Gradient diversity quantifies the degree to which individual gradients diverge from each other. Specifically, $D(\boldsymbol{x})=1$ corresponds to the independent and identically distributed (IID) settings, such that all the local functions converge in the same direction. In particular, gradient diversity is large when the gradients are almost orthogonal, or in opposite directions. Therefore, a larger value of $D(\boldsymbol{x})$ indicates more statistical heterogeneity among the local functions. In federated optimization, FedProx [26] and subsequent works [50, 43, 5] introduce an assumed upper bound of gradient diversity to convergence rate for representing the greatest statistical heterogeneity.

Analogous to gradient diversity, recent works have commonly made assumptions about measuring statistical heterogeneity by first-order gradients. For example, Assumption 3.1 is used by recent works[12, 46, 23], while Assumption 3.2 is used by works[50, 43, 5]. It is important to note that these terminologies are relevant to each other, as shown in Lemma 3.1.

Assumption 3.1 (Bounded global variance)

We assume the averaged global variance is bounded, i.e., $\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})-\nabla f(% \boldsymbol{x})\right\|^{2}\leq\sigma_{g}^{2}$ for all $x\in\mathcal{X}$ .

Assumption 3.2 (Bounded gradient dissimilarity)

There exist constants $B\geq 1,G\geq 0$ such that $\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})\right\|^{2}\leq B% ^{2}\|\nabla f(\boldsymbol{x})\|^{2}+G^{2}$ for all $x\in\mathcal{X}$ .

Lemma 3.1 (Heterogeneity assumption equivalence)

Let Assumption 3.1 holds, we have

\displaystyle\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})-% \nabla f(\boldsymbol{x})\right\|^{2}\leq\sigma_{g}^{2}\;\Leftrightarrow\;\sum_% {i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})\right\|^{2}\leq\|% \nabla f(\boldsymbol{x})\|^{2}+\sigma_{g}^{2},

as the special case of Assumption 3.2 with $B=1,G=\sigma_{g}$ .

Furthermore, Assumptions 3.1 3.2 defines the upper bound of gradient diversity, as discussed below:

Corollary 3.1 (Bounded gradient diversity)

Let Assumption 3.1 hold, it induces that $D(\boldsymbol{x})\leq\sqrt{1+\frac{\sigma_{g}^{2}}{\|\nabla f(\boldsymbol{x})% \|^{2}}},$ which is also connected to Assumption 3.2 with $G=0$ . In this case, the $B$ is the upper bound of gradient diversity.

Theoretical observation: the evolution of gradient diversity is ignored. In the FL training, previous works only focus on the upper bound of gradient diversity implicitly using Assumption 3.2, 3.1. Moreover, these works typically suggest their convergence rate is negatively related to a larger upper bound of gradient diversity. However, the gradient diversity $D(\boldsymbol{x})$ measures the diversity information in local updates during training, which is not considered. Therefore, it is important to understand how the evolution of $D(\boldsymbol{x})$ affects a federated optimization process. Following previous works, we use Assumption 3.1 as the implicit upper bound of $D(\boldsymbol{x})$ in this paper. And, our convergence analysis in Section 4 additionally focuses on the evolution of $D(\boldsymbol{x})$ in the FL training process.

3.2 Hybrid Heterogeneity Causes Loose Upper Bound of Averaged Local Updates

This section shows that additional system heterogeneity will further enhance the impacts of statistical heterogeneity in federated optimization. We begin our analysis by examining the scenario where all clients run the same steps of local updates. This is captured in the following lemma:

Lemma 3.2 (Upper bound of balanced local updates [35])

Under Assumptions 2.1 2.2 and 3.1, for communication round $t\in[T]$ , any client $i\in[N]$ with balanced local iteration steps $k\in[K]$ , and local learning rate $\eta_{l}\leq\frac{1}{K}$ , the average of local updates can be bounded as:

\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k% }-\boldsymbol{x}^{t}\right\|^{2}\leq 5\eta_{l}(\sigma_{l}^{2}+6K\sigma_{g}^{2}% +6K\mathbb{E}\left\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\|^{2}).

(4)

Then, we further extend the lemma to a more general scenario where clients run unbalanced local update steps. Corollary 3.2 discusses a loose upper bound on the averaged local updates:

Corollary 3.2 (Loose upper bound of unbalanced local updates)

We denote the local gradient of client $i$ from the global gradient as $\zeta_{i}(\boldsymbol{x})=\left\|\nabla f_{i}(\boldsymbol{x})-\nabla f(% \boldsymbol{x})\right\|^{2}$ . Under Assumption 2.1 2.2 and 3.1, for communication round $t\in[T]$ , all client $i\in[N]$ with arbitrary local iteration steps $k\in[K_{i}]$ , and local learning rate $\eta_{l}\leq\frac{1}{K_{i}}$ , the average of local updates can be bounded as:

\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k% }-\boldsymbol{x}^{t}\right\|^{2}\leq\Phi_{\text{Hetero}}+5\eta_{l}(\sigma_{l}^% {2}+6K_{\text{min}}^{2}\sigma_{g}^{2}+6\tilde{K}\mathbb{E}\left\|\nabla f\left% (\boldsymbol{x}^{t}\right)\right\|^{2}),

(5)

where $\tilde{K}=\sum_{i=1}^{N}\lambda_{i}K_{i}$ , $K_{\text{min}}=\min(K_{1},\dots,K_{N})$ , and $\Phi_{\text{Hetero}}=\sum_{i=1}^{N}30\eta_{l}(K_{i}-K_{\text{min}})\zeta_{i}(% \boldsymbol{x}^{t})$ .

Remark 3.1

We do not assume a bound of local dissimilarity $\zeta_{i}(\boldsymbol{x}^{t})$ in $\Phi_{\text{Hetero}}$ . Therefore, when the number of local update steps becomes unbalanced, the upper bound (4) is replaced by (5) with the induced hybrid heterogeneity term $\Phi_{\text{Hetero}}$ . There will always be at least a client $i$ such that $K_{i}-K_{\text{min}}=0$ , making Assumption 3.1 inapplicable. Consequently, the term $\Phi_{\text{Hetero}}$ is enlarged due to the unbalanced local steps, making (5) a very loose bound. Consequently, this can negatively impact the performance of federated optimization.

Theoretical observation: alleviating hybrid heterogeneity by minimizing (5). Typically, previous works in federated optimization implicitly minimize a few terms in the upper bound (5) to improve optimization performance. For example, FedProx [26] leverages a penalty term to reduce the local updates $\mathbb{E}\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}\|^{2}$ in Corollary 3.2. Similarly, SCAFFOLD [16], FedAvgM [10], and FedDyn [1] correct the local updates using variance regularization terms to narrow the variance $\sigma_{l}$ and $\sigma_{g}$ . FedNova [42] clips local updates based on the local steps to reduce the scale effects of $K_{i},\forall i$ . Therefore, minimizing the upper bound of (5) can efficiently mitigate hybrid heterogeneity, while previous works achieve this by manipulating local updates. Recall that the non-convex federated optimization decreases the $\|\nabla f(\boldsymbol{x}^{t})\|$ from $t=0$ to $T$ , while we propose the term $\|\nabla f(\boldsymbol{x}^{t})\|$ can be further minimized by the server at each round $t$ for a tight bound (5). And, previous convergence analysis of FedAvg variants [26, 16, 42] expected a low upper bound of gradient diversity (discussed in Appendix D.7) for better convergence rate. However, minimizing $\|\nabla f(\boldsymbol{x}^{t})\|$ potentially enlarge the upper bound of gradient diversity by Corollary 3.2. Therefore, gradient diversity in Corollary 3.1 and hybrid heterogeneity in Corollary 3.2 reveal that both their upper bounds are related to the global gradient $\|\nabla f(\boldsymbol{x}^{t})\|$ , revealing an interesting trade-off in convergence analysis.

Empirical observation: gradient diversity of FedAvg. Since the ultimate goal of a FL system is to train a well-generalized global model, it is important to identify the relation among gradient diversity, global gradient, and model quality. To this end, we conduct observation experiments of FedAvg on heterogeneous federated CIFAR10 (detailed in Section 5). We show the training dynamics of test accuracy, gradient diversity, and global first-order gradient in Figure 2. Higher test accuracy curves correspond to trajectory points with higher gradient diversity. Hence, larger gradient diversity values of a federated optimization do not create a worse convergence performance. The observation empirically reveals a positive relationship between gradient diversity dynamics and model quality.

3.3 Gradient Diversity Maximization for Tighter Averaged Local Updates

We consider the simple and effective federated weighted averaging form on server-side optimization for a tighter (5) (i.e., minimizing $\|\nabla f(\boldsymbol{x})\|$ ). The weight typically represents the importance of each local function in the global objective. Local function reweighting scheme has been adopted to improve the fairness [27, 34], robustness [24], and generalization [28] via adjusting $\lambda$ for FedAvg.

Analogously, we use an dynamical weights $\tilde{\lambda}$ to study a surrogate global objective [2, 42, 43]: $\tilde{f}(\boldsymbol{x})=\sum_{i=1}^{N}\tilde{\lambda}_{i}f_{i}(\boldsymbol{x})$ . We focus on the adaptive global objective with appropriate adaptive weights $\tilde{\lambda}$ . This is technically different from previous work that considers the global objective to be deterministic for a fixed $\lambda$ [33, 35, 45]. Then, we replace the global gradient of (3) with the surrogate global gradient:

Definition 3.2 (Surrogate gradient diversity)

The surrogate gradient diversity is defined as:

\tilde{D}(\boldsymbol{x}):=\sqrt{\sum_{i=1}^{N}\frac{\lambda_{i}\|\nabla f_{i}% (\boldsymbol{x})\|^{2}}{\|\nabla\tilde{f}(\boldsymbol{x})\|^{2}}}\geq 1,\;% \text{where}\;\nabla\tilde{f}(\boldsymbol{x})=\sum_{i=1}^{N}\tilde{\lambda}_{i% }\nabla f_{i}(\boldsymbol{x})

(6)

is the surrogate global gradient. And, we keep the $\lambda$ for local gradients in numerator.

We provide the gap between surrogate gradient diversity and original gradient diversity in Appendix D.4. This modification allows us to capture the convergence differences between the practically applied weights $\tilde{\lambda}$ and theoretically assume weights $\lambda$ as elaborated in Section 4.

Deriving gradient diversity maximization direction. Since the FL server cannot directly access local data samples due to privacy concernss [58], it relies solely on the local updates uploaded from clients at each round. the objective of server-side gradient diversity maximization is to minimize the denominator of (6), with the numerator determined by the local updates. We provide an illustration case with $N=2$ in Appendix D.1. For a general case, the local updates $\nabla f_{i}(\boldsymbol{x^{t}}),i\in[N]$ are observed by the FL server at the $t$ -th round. We can find a minimum of the norm of weighted averaged global gradients $\|\nabla f(\boldsymbol{x})\|$ . Therefore, our gradient diversity maximization direction becomes:

\displaystyle\boldsymbol{d}=\sum_{i=1}^{N}\tilde{\lambda}_{i}\nabla f_{i}(% \boldsymbol{x}),\quad\text{s.t.}\;\sum_{i=1}^{N}\tilde{\lambda}_{i}=1,\tilde{% \lambda}_{i}\geq 0\;\forall i,

(7)

where $\tilde{\lambda}=\arg\min_{\tilde{\lambda}}\|\sum_{i=1}^{N}\tilde{\lambda}_{i}% \nabla f_{i}(\boldsymbol{x})\|^{2}$ . The constrained minimization problem is finding a minimum-norm point in the convex hull of the set of input gradients. As the dimension of gradients can be millions, we use the Frank-Wolfe algorithm [14] to solve it. Importantly, we highlight that the gradient diversity maximization direction is a non-zero vector (i.e., $\|\boldsymbol{d}\|>0$ from (7)) in the practice of non-convex optimization as discussed in Appendix D.5. Besides, (7) is a similar sub-problem from multi-objective optimization works [14, 38, 60]. We clarify the differences in Appendix D.2.

Algorithm 1 FedAWARE and its plug-in extension (green box)

\boldsymbol{x}^{0},\boldsymbol{m}^{0},\alpha

1: for round

t

\in

[T]

2: Server sample clients

S^{t}

and broadcast model

\boldsymbol{x}^{t}

3: for client

i

\in

S^{t}

in parallel do

\boldsymbol{x}^{t,0}_{i}=\boldsymbol{x}^{t}

5: for local update step

k

\in

[K_{i}]

6: Compute an unbiased estimate

\boldsymbol{g}_{i}(\boldsymbol{x}^{t,k})

\nabla f_{i}(\boldsymbol{x}^{t,k})

\boldsymbol{x}^{t,k}_{i}=\boldsymbol{x}^{t,k-1}_{i}-\eta_{l}\nabla F_{i}(% \boldsymbol{x}^{t,k-1}_{i})

8: end for

9: Client uploads local updates

\boldsymbol{g}_{i}^{t}=x^{t,0}-x^{t,K_{i}}

10: end for

11: Server updates local momentum

\boldsymbol{m}_{i}^{t}=\begin{cases}(1-\alpha)\boldsymbol{m}_{i}^{t-1}+\alpha% \boldsymbol{g}_{i}^{t},&\text{if}\;i\in S^{t}\\ \boldsymbol{m}_{i}^{t-1},&\text{if}\;i\notin S^{t}\end{cases}

12: Server computes estimates

\boldsymbol{d}^{t}

by replacing

\nabla f_{i}(\boldsymbol{x})

with

\boldsymbol{m}_{i}^{t}

for (7)

13: if plug FedAWARE in a optimizer ServerOpt(e.g., FedYogi) then

14: Server computes the estimates of applied algorithm

\tilde{\boldsymbol{d}}^{t}=\textsc{ServerOpt}(\{\boldsymbol{g}_{i}^{t}\}_{i\in S% ^{t}})

15: Server modifies the direction of estimates

\boldsymbol{d}^{t}=\boldsymbol{d}_{\text{proj}}^{t}

follows (9)

16: end if

17: Server updates

\boldsymbol{x}^{t+1}=\boldsymbol{x}^{t}-\eta\boldsymbol{d}^{t}

18: end for

4 Methodology and Convergence Guarantee of FedAWARE

In this section, we present a simple yet effective server-side optimizer FedAWARE as shown in Algorithm 1. Especially, Algorithm 1 can also involve an optional extension of FedAWARE to other server-side optimizers as shown in green. We provide convergence analysis of FedAWARE (without using extention) details in Appendix C. And, we discuss the implications of our novel extension in Remark 4.2.

Support for partial participation. Solving (7) typically requires access to all local first-order gradients from clients, which is often infeasible in FL systems. To overcome this limitation, FedAWARE approximates the local updates using the history momentum of clients, denoted as $\boldsymbol{m}_{i},i\in[N]$ . The update rule of momentum is controlled by a constant $\alpha$ in Line 10, Algorithm 1.

Lemma 4.1

Under Assumption 2.1 2.2 and 3.1, for any $i\in[N]$ and $t\in[T]$ , the gap between momentum $\boldsymbol{m}_{i}$ and gradient $\nabla f_{i}(\boldsymbol{x}^{t})$ can be bounded:

\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[\left\|\boldsymbol{m}_{i% }^{t}-\nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2}\right]\leq\beta^{2}L^{2}% \sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\|^{2}},

where we denote $\beta=(1-\alpha)p$ as a constant $p=p_{1}\cdots=p_{N}$ , and $p_{i}=\text{Prob}(i\in S^{t})$ .

We provide the Lemma 4.1 to theoretically demonstrate the upper bound of the momentum approximation, which highlights that this can stabilize the (5) with a proper $\beta$ . In this paper, we set $\alpha$ and $p$ to be the same for all clients. Our detailed analyses in the Appendix C show the FedAWARE can be further improved by designing $\alpha$ and $p$ for each client, which we will study in future work. Then, we present the convergence guarantee of Algorithm 1:

Theorem 4.1 (Convergence to the stationary points of Algorithm 1)

Letting Assumption 2.1 2.2 and 3.1 holds, following the momentum approximation in Lemma 4.1, and assuming an ideal (possiblely dynamical) $\lambda$ always such that $\|\nabla f(\boldsymbol{x}^{t})\|^{2}\leq\|\nabla\tilde{f}(\boldsymbol{x}^{t})% \|^{2}$ , we have:

\displaystyle\min_{t\in[T]}\mathbb{E}\left[\left\|\nabla f(\boldsymbol{x}^{t})% \right\|^{2}\right]\leq\epsilon_{\text{opt}}+\sum_{t=0}^{T-1}\frac{\beta L% \tilde{G}\sqrt{\chi_{\tilde{\lambda}\|\lambda}^{2}}}{T\rho^{t}\tilde{D}(% \boldsymbol{x}^{t})}\sqrt{\sigma^{2}+\Phi_{\text{Hetero}}},

(8)

where $\sigma^{2}=5\eta_{l}(\sigma_{l}^{2}+6K_{\text{min}}^{2}\sigma_{g}^{2})$ , $\chi_{\tilde{\lambda}\|\lambda}^{2}=\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{% t})^{2}}{\lambda_{i}^{2}}$ , $\tilde{G}=\max_{t\in[T]}\sqrt{\sum_{i=1}^{N}\lambda_{i}\|\nabla f_{i}(% \boldsymbol{x}^{t})\|^{2}}$ , and $\rho^{t}>0$ is a coefficient detailed in Appendix. The terms $\epsilon_{\text{opt}}$ vanishes with $\mathcal{O}(1/\sqrt{T})$ .

Remark 4.1

The terms $\sqrt{\sigma^{2}+\Phi_{\text{Hetero}}}$ reflect the impact of hybrid heterogeneity on the convergence rate. In other words, large heterogeneity terms increase the convergence error, resulting in worse optimization performance. Our convergence analysis evaluates the feasibility of updating the global model along with the gradient diversity maximization direction. Concretely, Algorithm 1 minimizes a surrogate global objective given by $\tilde{\lambda}$ , while using term $\chi$ to denote the difference between the surrogate objective and the original objective (See details in Lemma C.2 from Appendix). Specially, the surrogate objective (1) is dynamically determined by the $\tilde{\lambda}$ in Algorithm 1. Hence, the convergence rate denotes the status of the final stationary point and the speed of reaching it. It suggests that Algorithm 1 tends to make the model converge to a point with large gradient diversity. Besides, if we let $\eta_{l}=\mathcal{O}(1/T)$ decay with communication rounds, our convergence rate matches the best-known rate $\mathcal{O}(1/\sqrt{T})$ in non-convex optimization works [45, 16].

The implications of surrogate gradient diversity dynamics. Theorem 4.1 describes the convergence behavior of a broad class of FedAvg variants. The final term in (8) suggests the convergence performance is related to overall surrogate gradient diversity dynamics of the optimization trajectory $\{\boldsymbol{x}^{0},\dots,\boldsymbol{x}^{T-1}\}$ . In other words, our gradient diversity maximization findings suggest the heterogeneous federated optimization algorithms should create a trajectory $\{\boldsymbol{x}^{0},\dots,\boldsymbol{x}^{T-1}\}$ , resulting in a higher curve of gradient diversity $\tilde{D}(\boldsymbol{x}^{t}),t=0,\dots,T-1$ . We evaluate this point in comparison with other algorithms in Figure 3 and plug-in experiments in Figure 4. Besides, as gradient diversity quantifies the conflicts of individual gradients, surrogate gradient diversity maximization (minimizing objective 7) can be intuitively viewed as adjusting the local functions’ conflicts.

FedAWARE as a plug-in module. FedAWARE uses a novel adaptive weighted aggregation that only induces the surrogate gradient diversity terms $\tilde{D}(\cdot)$ in the convergence rate. Therefore, FedAWARE can be applied to other optimization works, as we minimize different terms (discussed in Appendix D.7). We present a novel extension of FedAWARE and evaluate it in Section 5:

Remark 4.2 (Extention of FedAWARE)

If global estimates $\tilde{\boldsymbol{d}}^{t}$ outputted by arbitrary federated optimization algorithm (Line 14, Algorithm 1), we project the $\tilde{\boldsymbol{d}}^{t}$ to the direction of $\boldsymbol{d}^{t}$ (Line 15, Algorithm 1):

\displaystyle\boldsymbol{d}_{\text{proj}}^{t}=\frac{\langle\tilde{\boldsymbol{% d}}^{t},\boldsymbol{d}^{t}\rangle}{\langle\boldsymbol{d}^{t},\boldsymbol{d}^{t% }\rangle}\boldsymbol{d}^{t},\;\boldsymbol{x}^{t+1}=\boldsymbol{x}^{t}-\eta% \boldsymbol{d}_{\text{proj}}^{t}

(9)

This procedure modifies the gradient descent direction of applied algorithms to the direction of implicitly enlarged gradient diversity and maintains the advantages of original algorithms. Please see Figure 4 9 for empirical evidence.

This extension allows FedAWARE to be a plug-in module to enhance the applied algorithms.

5 Experiments

We evaluate our theoretical findings and proposed algorithms in standard heterogeneous federated settings and show the hybrid heterogeneity damaging federated optimization algorithms. We show the error bars over three independent runs with different random seeds.

Experiment Setup. We compare baseline algorithms that follow the federated optimization framework in Figure 1. Our baselines includes FedAvg [33], FedProx [26], SCAFFOLD [16], FedDyn [1], FedNova [42] for the regularization-based algorithms. And, we also compare with FedAvgM [10] for the momentum-based algorithms. For the adaptive federated optimization algorithms, we compare with FedYogi [35], and FedAMS [45]. Due to page limitation, we report the results of training a 5-layer CNN network on the CIFAR-10 dataset [20] in the main paper. And, we provide convergence experiments on MNIST/Fashion-MNIST in Appendix E.2 and extension experiments on text dataset AGNews [57] with model Pythia [3] with details in Appendix E.3.

Data partition. We construct experiments on various heterogeneous settings: (1) Pathological: We adopt the identical Non-IID setup from FedAvg [33]. We sort the data that is sampled by labels, partition them into 200 blocks of equal size, and assign 100 clients with 2 blocks each. In this setting, we set a constant data batch size and local epoch for all clients, hence the number of local steps are balanced across clients. (2) Dirichlet: We create federated datasets for 100 clients following the latent Dirichlet allocation over labels [10]. The hyper-parameter of the Dirichlet distribution is set to 0.1, indicating extreme statistical heterogeneity. In this setting, we maintain a fixed data batch size and a local training epoch for all clients. The number of local steps differs amongst clients, resulting in system heterogeneity. (3) Pathological+/Dirichlet+: Based on the data partition described above, we set the local epoch $E_{i}\sim\mathcal{U}(2,5)$ , and the batch size $B_{i}\sim\mathcal{U}(10,|\mathcal{D}_{i}|)$ at round $t$ randomly for each selected client $i\in S^{t}$ . Here, $\mathcal{U}$ denotes uniform distribution. Therefore, the local steps $K_{i}=\lfloor E_{i}*|\mathcal{D}_{i}|/B_{i}\rfloor$ are unstable. Each client conducts local mini-batch SGD with multiple local epochs to update the model. More details are provided in Appendix E.1.

Main results: convergence behavior of FedAWARE. We provide the training dynamics of testing accuracy, gradient diversity, and the training loss in Figure 3. Training loss curves prove a comparable convergence rate of FedAWARE with state-of-the-art federated optimizers. Furthermore, the testing accuracy curve showcases that FedAWARE is more stable (fewer jitters and lower variance) than other baselines. Besides, the gradient diversity dynamics indicate that FedAWARE converges to stationary points with larger gradient diversity. Notably, FedAvgM and FedYogi have lower gradient diversity while achieving a relatively high testing accuracy. This does not break our theory as gradient diversity is not the only factor in rate (8) as further elaborated in Appendix D. Moreover, designing a better global update direction that maximizes gradient diversity can potentially achieve higher performance. We justify the point by the following gradient projection experiments.

Main results: gradient projection extension of FedAWARE. We plug FedAWARE into standard baselines (FedAvg, FedAvgM) and the state-of-the-art adaptive optimizers (FedAMS, and FedYogi) following Remark 4.2, and the experiments are shown in Figure 4. The loss curves show that gradient projection maintains the convergence rate of applied algorithms. Besides, the gradient projection modifies the model updating direction and successfully enlarges gradient diversity dynamics. Hence, the applied algorithms obtain more stable test accuracy curves.

Observation: adaptivity in federated optimization is important. Our experiments highlight the importance of adaptivity in federated optimization. For example, standard federated optimization algorithms (e.g., FedAvg, SCAFFOLD) suffer from performance degeneration in Pathological+/Dirichlet+ settings. In comparison, adaptive federated optimization algorithms (FedYogi, FedDyn, FedAMS, and FedAWARE) maintain high convergence quality in such extreme heterogeneous settings. Moreover, our gradient diversity findings suggest a new adaptivity in FL.

Discussion: practical gradient diversity dynamics. Our results suggest the importance of gradient diversity as a new training dynamic in heterogeneous federated optimization. We suggest a practical gradient diversity dynamics (30) for monitoring optimization status in Appendix E.1.

6 Conclusion

This paper explores an initial exploration of gradient diversity maximization in federated server-side optimization. We examine this issue through the lens of FL server optimization and conclude that pursuing gradient diversity maximization in the global update direction helps mitigate hybrid heterogeneity. Following theoretical findings, we propose a simple yet effective server-side optimizer applicable in scenarios exhibiting extreme hybrid heterogeneity. Beyond FL settings, we believe the gradient diversity maximization can also be applied to the general finite-sum optimization task (1), especially when handling finite-sum of conflict objectives.

References

[1] Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In International Conference on Learning Representations, 2020.
[2] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
[3] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
[4] Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
[5] Michael Crawshaw, Yajie Bao, and Mingrui Liu. Federated learning with client subsampling, data heterogeneity, and unbounded smoothness: A new algorithm and lower bounds. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[6] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
[7] Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200–5209. PMLR, 2019.
[8] Werner H Greub. Linear algebra, volume 23. Springer Science & Business Media, 2012.
[9] Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods in federated learning. arXiv preprint arXiv:1910.14425, 2019.
[10] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
[11] Zeou Hu, Kiarash Shaloudegi, Guojun Zhang, and Yaoliang Yu. Federated learning meets multi-objective optimization. IEEE Transactions on Network Science and Engineering, 9(4):2039–2051, 2022.
[12] Minhui Huang, Dewei Zhang, and Kaiyi Ji. Achieving linear speedup in non-iid federated bilevel learning. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 14039–14059. PMLR, 2023.
[13] Ahmed Imteaj, Urmish Thakker, Shiqiang Wang, Jian Li, and M Hadi Amini. A survey on federated learning for resource-constrained iot devices. IEEE Internet of Things Journal, 9(1):1–24, 2021.
[14] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning, pages 427–435. PMLR, 2013.
[15] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
[16] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020.
[17] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. First analysis of local gd on heterogeneous data. arXiv preprint arXiv:1909.04715, 2019.
[18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[19] Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. arXiv preprint arXiv:2204.02937, 2022.
[20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[21] Chengxi Li, Gang Li, and Pramod K Varshney. Federated learning with soft clustering. IEEE Internet of Things Journal, 9(10):7773–7782, 2021.
[22] Daliang Li and Junpu Wang. Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
[23] Junbo Li, Ang Li, Chong Tian, Qirong Ho, Eric P Xing, and Hongyi Wang. Fednar: Federated optimization with normalized annealing regularization. arXiv preprint arXiv:2310.03163, 2023.
[24] Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization. In International Conference on Learning Representations, 2020.
[25] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine, 37(3):50–60, 2020.
[26] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
[27] Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. Fair resource allocation in federated learning. arXiv preprint arXiv:1905.10497, 2019.
[28] Zexi Li, Tao Lin, Xinyi Shang, and Chao Wu. Revisiting weighted aggregation in federated learning with neural networks. In International Conference on Machine Learning, pages 19767–19788. PMLR, 2023.
[29] Shiyu Liu, Shaogao Lv, Dun Zeng, Zenglin Xu, Hui Wang, and Yue Yu. Personalized federated learning via amortized bayesian meta-learning. arXiv preprint arXiv:2307.02222, 2023.
[30] Guodong Long, Ming Xie, Tao Shen, Tianyi Zhou, Xianzhi Wang, and **g Jiang. Multi-center federated learning: clients clustering for better personalization. World Wide Web, 26(1):481–500, 2023.
[31] Xiaodong Ma, Jia Zhu, Zhihao Lin, Shanxuan Chen, and Yangjie Qin. A state-of-the-art survey on solving non-iid data in federated learning. Future Generation Computer Systems, 135:244–258, 2022.
[32] Othmane Marfoq, Giovanni Neglia, Aurélien Bellet, Laetitia Kameni, and Richard Vidal. Federated multi-task learning under a mixture of distributions. Advances in Neural Information Processing Systems, 34:15434–15447, 2021.
[33] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
[34] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In International Conference on Machine Learning, pages 4615–4625. PMLR, 2019.
[35] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
[36] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
[37] Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE transactions on neural networks and learning systems, 32(8):3710–3722, 2020.
[38] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.
[39] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. Advances in neural information processing systems, 30, 2017.
[40] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems, 2022.
[41] Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan McMahan, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, et al. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
[42] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020.
[43] Lin Wang, YongXin Guo, Tao Lin, and Xiaoying Tang. Delta: Diverse client sampling for fasting federated learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[44] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and Kevin Chan. Adaptive federated learning in resource constrained edge computing systems. IEEE journal on selected areas in communications, 37(6):1205–1221, 2019.
[45] Yujia Wang, Lu Lin, and **ghui Chen. Communication-efficient adaptive federated learning. In International Conference on Machine Learning, pages 22802–22838. PMLR, 2022.
[46] Xidong Wu, Jianhui Sun, Zhengmian Hu, Aidong Zhang, and Heng Huang. Solving a class of non-convex minimax optimization in federated learning. arXiv preprint arXiv:2310.03613, 2023.
[47] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
[48] Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, and Haibin Lin. Local adaalter: Communication-efficient stochastic gradient descent with adaptive learning rates. arXiv preprint arXiv:1911.09030, 2019.
[49] Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration. arXiv preprint arXiv:2012.01780, 2020.
[50] Rui Ye, Mingkai Xu, Jianyu Wang, Chenxin Xu, Siheng Chen, and Yanfeng Wang. Feddisco: Federated learning with discrepancy-aware collaboration. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 39879–39902. PMLR, 2023.
[51] Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In International Conference on Artificial Intelligence and Statistics, pages 1998–2007. PMLR, 2018.
[52] Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. Advances in neural information processing systems, 31, 2018.
[53] Dun Zeng, Xiang**g Hu, Shiyu Liu, Yue Yu, Qifan Wang, and Zenglin Xu. Stochastic clustered federated learning. arXiv preprint arXiv:2303.00897, 2023.
[54] Dun Zeng, Siqi Liang, Xiang**g Hu, Hui Wang, and Zenglin Xu. Fedlab: A flexible federated learning framework. Journal of Machine Learning Research, 24(100):1–7, 2023.
[55] Shenglai Zeng, Zonghang Li, Hongfang Yu, Yihong He, Zenglin Xu, Dusit Niyato, and Han Yu. Heterogeneous federated learning via grouped sequential-to-parallel training. In International Conference on Database Systems for Advanced Applications, pages 455–471. Springer, 2022.
[56] Jianqing Zhang, Yang Hua, Jian Cao, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, and Haibing Guan. Eliminating domain bias for federated learning in representation space. Advances in Neural Information Processing Systems, 36, 2024.
[57] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
[58] Yifei Zhang, Dun Zeng, **glong Luo, Zenglin Xu, and Irwin King. A survey of trustworthy federated learning with perspectives on security, robustness, and privacy. arXiv preprint arXiv:2302.10637, 2023.
[59] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
[60] Shiji Zhou, Wenpeng Zhang, Jiyan Jiang, Wenliang Zhong, **jie Gu, and Wenwu Zhu. On the convergence of stochastic multi-objective gradient manipulation and beyond. Advances in Neural Information Processing Systems, 35:38103–38115, 2022.
[61] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. Advances in neural information processing systems, 32, 2019.
[62] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for heterogeneous federated learning. In International conference on machine learning, pages 12878–12889. PMLR, 2021.

\appendixpage\startcontents

[sections] \printcontents[sections]l1

Appendix A Related Works

Plenty of studies have been proposed to address statistical heterogeneity issues in the FL literature [31], including multi-task federated learning [32, 39], personalized federated learning [40, 29], knowledge distillation [62, 22], and client clustering [37, 53, 55]. In contrast, this paper primarily focuses on federated optimization algorithms that train a robust global model against heterogeneity issues, as shown in Algorithm 2.

Algorithm 2 FedOpt [35]

\boldsymbol{x}^{0},\textsc{ClientOpt},\textsc{ServerOpt}

1: for round

t

\in

[T]

2: Server sample clients

S^{t}

and broadcast model

\boldsymbol{x}^{t}

3: for client

i

\in

S^{t}

in parallel do

\boldsymbol{x}^{t,0}_{i}=\boldsymbol{x}^{t}

5: for local update step

k

\in

[K_{i}]

6: Compute an unbiased estimate

\nabla F_{i}(\boldsymbol{x}^{t,k-1}_{i})

\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})

\boldsymbol{x}^{t,k}_{i}=\textsc{ClientOpt}(\boldsymbol{x}^{t,k-1}_{i},\nabla F% _{i}(\boldsymbol{x}^{t,k-1}_{i}),\eta_{l},t)

(e.g.,

\boldsymbol{x}^{t,k}_{i}=\boldsymbol{x}^{t,k-1}_{i}-\eta_{l}\nabla F_{i}(% \boldsymbol{x}^{t,k-1}_{i})

)

8: end for

9: Client uploads local updates

\boldsymbol{g}_{i}^{t}=x^{t,0}-x^{t,K_{i}}

10: end for

11: Server compute an estimate

\boldsymbol{d}^{t}=\textsc{ServerOpt}(\{\boldsymbol{g}_{i}^{t}\}_{i\in S^{t}})

12: Server updates

\boldsymbol{x}^{t+1}=\boldsymbol{x}^{t}-\eta\boldsymbol{d}^{t}

13: end for

The pioneering federated optimization algorithm, FedAvg [33], significantly reduces communication costs. Subsequent works built upon FedAvg to address challenges related to convergence guarantees and heterogeneity issues. For example, some approaches introduced a regularization term in the client objectives [26], while others incorporated server momentum [10]. Several studies have analyzed the convergence rate of FedAvg and demonstrated its degradation with system heterogeneity [26, 44] and statistical heterogeneity [59, 17]. SCAFFOLD [16] utilizes control variates to mitigate client drift and achieve convergence rates independent of the level of heterogeneity. FedNova [42] addresses objective inconsistency issues arising from system heterogeneity through local update regularization. For more detailed comparisons, we refer to the survey [15]. Besides, adaptive methods [52, 36, 48] have proven to be effective in non-convex optimizations. In the context of federated optimization, FedYogi [35, 52] and FedAMS [1] are representative adaptive federated optimization algorithm that incorporates Adam-like momentum and adaptive terms to address heterogeneity issues. These related works demonstrate the ongoing efforts to address heterogeneity issues in FL, with various approaches focusing on different aspects of the problem.

Appendix B Proof of Lemmas and Corollaries

Corollary 1 (Proof of Corollary 3.1)

Let Assumption 3.1 hold. Then, in the case of bounded variance, i.e., $\sum_{i=1}^{N}\lambda_{i}\left\|\nabla f_{i}(\boldsymbol{x})-\nabla f(% \boldsymbol{x})\right\|^{2}\leq\sigma_{g}^{2}$ , it follows that $D(\boldsymbol{x})\leq\sqrt{1+\frac{\sigma_{g}^{2}}{\|\nabla f(\boldsymbol{x})% \|^{2}}}$ .

Proof 1

		$\displaystyle\sum_{i=1}^{N}\lambda_{i}\left\\|\nabla f_{i}(\boldsymbol{x})-% \nabla f(\boldsymbol{x})\right\\|^{2}\leq\sum_{i=1}^{N}\lambda_{i}\left\\|\nabla f% _{i}(\boldsymbol{x})\right\\|^{2}-\\|\nabla f(\boldsymbol{x})\\|^{2}\leq\sigma_{g% }^{2}$
	$\displaystyle\Leftrightarrow$	$\displaystyle\sum_{i=1}^{N}\lambda_{i}\left\\|\nabla f_{i}(\boldsymbol{x})% \right\\|^{2}\leq\sigma_{g}^{2}+\\|\nabla f(\boldsymbol{x})\\|^{2}$
	$\displaystyle\Rightarrow$	$\displaystyle D(\boldsymbol{x})=\sqrt{\frac{\sum_{i=1}^{N}\lambda_{i}\\|\nabla f% _{i}(\boldsymbol{x})\\|^{2}}{\\|\nabla f(\boldsymbol{x})\\|^{2}}}\leq\sqrt{\frac{% \sigma_{g}^{2}+\\|\nabla f(\boldsymbol{x})\\|^{2}}{\\|\nabla f(\boldsymbol{x})\\|^% {2}}}\leq\sqrt{1+\frac{\sigma_{g}^{2}}{\\|\nabla f(\boldsymbol{x})\\|^{2}}}.$

Lemma 1 (Proof of Lemma 3.2and Corollary 3.2)

Let Assumption 2.2 3.1 hold. For all client $i\in[N]$ with arbitrary local iteration steps $k\in[K_{i}]$ , the local updates can be bounded as follows,

\mathbb{E}\left\|\boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}\right\|^{2}\leq 5% K_{i}\eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K_{i}% \left\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\|^{2}).

Proof 2

For $k\in[K_{i}]$ , we have

		$\displaystyle\quad\mathbb{E}\left[\left\\|\boldsymbol{g}_{i}^{t}\right\\|^{2}% \right]=\mathbb{E}\left\\|\boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}\right\\|^{% 2}=\mathbb{E}\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}-\eta_{l}% \nabla F_{i}(\boldsymbol{x}_{i}^{t,k-1})\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}-% \eta_{l}(\nabla F_{i}(\boldsymbol{x}_{i}^{t,k-1})\pm\nabla f_{i}(\boldsymbol{x% }_{i}^{t,k-1}))\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}-% \eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1}))\right\\|^{2}+\mathbb{E}\left% \\|\eta_{l}\left(\nabla F_{i}(\boldsymbol{x}_{i}^{t,k-1})-\nabla f_{i}\left(% \boldsymbol{x}_{i}^{t,k-1}\right)\right)\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^% {t}\right\\|^{2}-2\langle\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t},\eta_{l}% \nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\rangle+\left\\|\eta_{l}\nabla f_{i}(% \boldsymbol{x}_{i}^{t,k-1})\right\\|^{2}\right]+\eta_{l}^{2}\sigma_{l}^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}% ^{t}\right\\|^{2}-2\langle\frac{1}{\sqrt{2K_{i}-1}}(\boldsymbol{x}_{i}^{t,k-1}-% \boldsymbol{x}^{t}),\sqrt{2K_{i}-1}\eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,% k-1})\rangle$
		$\displaystyle+\left\\|\eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\right\\|^% {2}\Big{]}+\eta_{l}^{2}\sigma_{l}^{2}$
	$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{2K_{i}-1}\right)\mathbb{E}\left[\left\\|% \boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}\right\\|^{2}\right]+2K_{i}\mathbb% {E}\left[\left\\|\eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\right\\|^{2}% \right]+\eta_{l}^{2}\sigma_{l}^{2}$
	$\displaystyle=$	$\displaystyle\left(1+\frac{1}{2K_{i}-1}\right)\mathbb{E}\left[\left\\|% \boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}\right\\|^{2}\right]+2K_{i}\mathbb% {E}\left[\left\\|\eta_{l}\left(\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\pm% \nabla f\left(\boldsymbol{x}^{t}\right)\pm\nabla f_{i}\left(\boldsymbol{x}^{t}% \right)\right)\right\\|^{2}\right]$
		$\displaystyle+\eta_{l}^{2}\sigma_{l}^{2}$
	$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{2K_{i}-1}\right)\mathbb{E}\left\\|\boldsymbol{x}_% {i}^{t,k-1}-\boldsymbol{x}^{t}\right\\|^{2}+6K_{i}\mathbb{E}\left[\left\\|\eta_{% l}\left(\nabla f_{i}\left(\boldsymbol{x}_{i}^{t,k-1}\right)-\nabla f_{i}\left(% \boldsymbol{x}^{t}\right)\right)\right\\|^{2}\right]$
		$\displaystyle+6K_{i}\mathbb{E}\left[\left\\|\eta_{l}\left(\nabla f_{i}\left(% \boldsymbol{x}^{t}\right)\right)\right\\|^{2}\right]+\eta_{l}^{2}\sigma_{l}^{2}$
	$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{2K_{i}-1}+6K_{i}\eta_{l}^{2}L^{2}\right)\mathbb{% E}\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}\right\\|^{2}+\eta_{l}^{2% }(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K_{i}\mathbb{E}\left[% \left\\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\\|^{2}\right])$

Unrolling the recursion, we obtain

$\displaystyle\mathbb{E}\left\\|\boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}% \right\\|^{2}\leq$	$\displaystyle\sum_{p=0}^{k-1}\left(1+\frac{1}{2K_{i}-1}+4K_{i}\eta_{l}^{2}L^{2% }\right)^{p}\eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K% _{i}\mathbb{E}\left[\left\\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\\|^{2}% \right])$	(10)
$\displaystyle\leq$	$\displaystyle(K_{i}-1)\left[\left(1+\frac{1}{K_{i}-1}\right)^{K_{i}}-1\right]% \eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K_{i}\mathbb{% E}\left[\left\\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\\|^{2}\right])$
$\displaystyle\leq$	$\displaystyle 5K_{i}\eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}% ^{t})+6K_{i}\mathbb{E}\left\\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\\|^{% 2})$

where we use the fact that $(1+\frac{1}{K_{i}-1})^{K_{i}}\leq 5$ for $K_{i}>1$ .

Lemma B.1 (Upper bound of balanced local updates, [35])

Let Assumption 2.2 3.1 hold. For any client $i\in[N]$ with arbitrary local iteration steps $k\in[K]$ with local learning rate $\eta_{l}\leq\frac{1}{K}$ , the average of local updates can be bounded as follows,

\displaystyle\mathbb{E}\left\|\sum_{i=1}^{N}\lambda_{i}\boldsymbol{x}_{i}^{t,k% }-\boldsymbol{x}^{t}\right\|^{2}\leq\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left\|% \boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}\right\|^{2}\leq 5\eta_{l}(\sigma_{% l}^{2}+6K\sigma_{g}^{2}+6K\mathbb{E}\left\|\nabla f\left(\boldsymbol{x}^{t}% \right)\right\|^{2})

(11)

Corollary B.1 (Loose upper bound of unbalanced local updates)

Let Assumption 2.2 3.1 hold. For any client $i\in[N]$ with arbitrary local iteration steps $k\in[K_{i}]$ with local learning rate $\eta_{l}\leq\frac{1}{K_{i}}$ , the average of local updates can be bounded as follows,

\displaystyle\mathbb{E}\left\|\sum_{i=1}^{N}\lambda_{i}\boldsymbol{x}_{i}^{t,k% }-\boldsymbol{x}^{t}\right\|^{2}

\displaystyle\leq\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left\|\boldsymbol{x}_{i}^% {t,k}-\boldsymbol{x}^{t}\right\|^{2}\leq 5\eta_{l}(\sigma_{l}^{2}+6K_{\text{% min}}^{2}\sigma_{g}^{2}+6\tilde{K}\mathbb{E}\left\|\nabla f\left(\boldsymbol{x% }^{t}\right)\right\|^{2})+\Phi_{\text{Hetero}},

(12)

where $\tilde{K}=\sum_{i=1}^{N}\lambda_{i}K_{i}$ , $\Phi_{\text{Hetero}}=\sum_{i=1}^{N}30\eta_{l}(K_{i}-K_{\text{min}})\zeta_{i}(% \boldsymbol{x}^{t}).$

Appendix C Convergence Analysis of FedAWARE

This section provides detailed proof of Theorem 4.1. Our theory examines the impacts of gradient diversity in federated optimization and matches the convergence of classic aggregation-based methods, such as FedAvg. The Algorithm 1 manipulates the aggregation results with momentum approximation and adaptive weight strategies in a different manner with FedAvg.

Proof 3

We recall the updated rule of Algorithm 1 is:

\boldsymbol{x}^{t+1}=\boldsymbol{x}^{t}-\eta\boldsymbol{d}^{t},

where

\boldsymbol{d}^{t}=\sum_{i=1}^{N}\tilde{\lambda}^{t}_{i}\boldsymbol{m}_{i}^{t}% ,\text{s.t.}\;\tilde{\lambda}^{t}=\underset{\lambda}{\arg\min}\left\|\sum_{i=1% }^{N}\lambda_{i}\boldsymbol{m}_{i}^{t}\right\|^{2}.

Using the smoothness, we have:

	$\displaystyle f\left(\boldsymbol{x}^{t+1}\right)$	$\displaystyle=f\left(\boldsymbol{x}^{t}-\eta\boldsymbol{d}^{t}\right)\leq f% \left(\boldsymbol{x}^{t}\right)-\eta\left\langle\nabla f\left(\boldsymbol{x}^{% t}\right),\boldsymbol{d}^{t}\right\rangle+\frac{L}{2}\eta^{2}\left\\|% \boldsymbol{d}^{t}\right\\|^{2}$
		$\displaystyle\leq f\left(\boldsymbol{x}^{t}\right)-\eta\left\langle\nabla f% \left(\boldsymbol{x}^{t}\right),\boldsymbol{d}^{t}-\nabla f\left(\boldsymbol{x% }^{t}\right)+\nabla f\left(\boldsymbol{x}^{t}\right)\right\rangle+\frac{L}{2}% \eta^{2}\left\\|\boldsymbol{d}^{t}\right\\|^{2}$
		$\displaystyle\leq f\left(\boldsymbol{x}^{t}\right)-\eta\\|\nabla f\left(% \boldsymbol{x}^{t}\right)\\|^{2}+\eta\left\langle\nabla f\left(\boldsymbol{x}^{% t}\right),\nabla f\left(\boldsymbol{x}^{t}\right)-\boldsymbol{d}^{t}\right% \rangle+\frac{L}{2}\eta^{2}\left\\|\boldsymbol{d}^{t}\right\\|^{2}.$

Taking full expectation over randomness at time step $t$ on both sides, we have:

\displaystyle\mathbb{E}\left[f(\boldsymbol{x}^{t+1})\right]-f(\boldsymbol{x}^{% t})

\displaystyle\leq-\eta\mathbb{E}\|\nabla f\left(\boldsymbol{x}^{t}\right)\|^{2% }+\eta\underbrace{\mathbb{E}[\langle\nabla f(\boldsymbol{x}^{t}),\nabla f\left% (\boldsymbol{x}^{t}\right)-\boldsymbol{d}^{t}\rangle]}_{T_{1}}+\frac{L}{2}\eta% ^{2}\underbrace{\mathbb{E}[\|\boldsymbol{d}^{t}\|^{2}]}_{T_{2}}.

(13)

Now, we are about to bound $T_{1}$ and $T_{2}$ respectively.

Bounding $T_{1}$ . By Cauchy-Schwartz inequality, we get

	$\displaystyle T_{1}$	$\displaystyle\leq\mathbb{E}\left[\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|% \left\\|\nabla f(\boldsymbol{x}^{t})-\boldsymbol{d}^{t}\right\\|\right]\leq% \mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{E}\left[\left\\|% \sum_{i=1}^{N}\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t})-\sum_{i=1}^{N}\tilde% {\lambda}_{i}^{t}\boldsymbol{m}_{i}^{t}\right\\|\right]$
		$\displaystyle=\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{E}% \left[\left\\|\sum_{i=1}^{N}\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t})\pm\sum_% {i=1}^{N}\tilde{\lambda}_{i}f_{i}(\boldsymbol{x}^{t})-\sum_{i=1}^{N}\tilde{% \lambda}_{i}^{t}\boldsymbol{m}_{i}^{t}\right\\|\right]$
		$\displaystyle\leq\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{% E}\left[\left\\|\sum_{i=1}^{N}(\lambda_{i}-\tilde{\lambda}_{i}^{t})\nabla f_{i}% (\boldsymbol{x}^{t})\right\\|\right]+\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{% t})\right\\|\mathbb{E}\left[\left\\|\sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}(\nabla f% _{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t})\right\\|\right]$
		$\displaystyle\leq\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{% E}\left[\left\\|\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t})}{% \lambda_{i}}\cdot\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t})\right\\|\right]+% \mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{E}\left[\left\\|% \sum_{i=1}^{N}\frac{\tilde{\lambda}_{i}^{t}}{\lambda_{i}}\cdot\lambda_{i}(% \nabla f_{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t})\right\\|\right]$
		$\displaystyle\leq\sqrt{\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t% })^{2}}{\lambda_{i}^{2}}}\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|% \mathbb{E}\left[\left\\|\sum_{i=1}^{N}\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t% })\right\\|\right]$
		$\displaystyle\quad+\sqrt{\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{t})^{2}}{% \lambda_{i}^{2}}}\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{% E}\left[\left\\|\sum_{i=1}^{N}\lambda_{i}(\nabla f_{i}(\boldsymbol{x}^{t})-% \boldsymbol{m}_{i}^{t})\right\\|\right]$
		$\displaystyle\leq\sqrt{\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t% })^{2}}{\lambda_{i}^{2}}}\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|% ^{2}+\sqrt{\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{t})^{2}}{\lambda_{i}^{2}}% }\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{E}\left[\left\\|% \sum_{i=1}^{N}\lambda_{i}(\nabla f_{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^% {t})\right\\|\right],$

where we use Cauchy-Schwartz inequality to decompose the $\tilde{\lambda}$ from norms.

Then, using the fact that $\mathbb{E}^{2}\|A\|\leq\mathbb{E}\|A\|^{2}$ , we have

	$\displaystyle T_{1}$	$\displaystyle\leq\sqrt{\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t% })^{2}}{\lambda_{i}^{2}}}\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|% ^{2}+\sqrt{\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{t})^{2}}{\lambda_{i}^{2}}% }\sqrt{\mathbb{E}\left[\left\\|\sum_{i=1}^{N}\lambda_{i}(\nabla f_{i}(% \boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t})\right\\|^{2}\right]}\mathbb{E}\left% \\|\nabla f(\boldsymbol{x}^{t})\right\\|$
		$\displaystyle\leq\sqrt{\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t% })^{2}}{\lambda_{i}^{2}}}\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|% ^{2}+\sqrt{\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{t})^{2}}{\lambda_{i}^{2}}% }\sqrt{\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[\left\\|\nabla f_{i}(% \boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t}\right\\|^{2}\right]}\mathbb{E}\left% \\|\nabla f(\boldsymbol{x}^{t})\right\\|.$

Letting $\chi_{\tilde{\lambda}\|\lambda}^{2}=\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{% t})^{2}}{\lambda_{i}^{2}},\chi_{\tilde{\lambda}\|\lambda}=\sum_{i=1}^{N}\frac{% (\tilde{\lambda}_{i}^{t})}{\lambda_{i}}$ , we have

T_{1}\leq\sqrt{1+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{\lambda}\|% \lambda}}\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|^{2}+\sqrt{\chi_% {\tilde{\lambda}\|\lambda}^{2}}\sqrt{\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[% \left\|\nabla f_{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t}\right\|^{2}% \right]}\cdot\mathbb{E}\left\|\nabla f(\boldsymbol{x}^{t})\right\|.

(14)

Bounding $T_{2}$ . By the definitions and triangle inequality, we can decompose the $T_{1}$ ,

$\displaystyle T_{2}$	$\displaystyle=\mathbb{E}[\\|\boldsymbol{d}^{t}\\|^{2}]=\mathbb{E}\left[\left\\|% \sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}\boldsymbol{m}_{i}^{t}\right\\|^{2}\right]% =\mathbb{E}\left[\left\\|\sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}\boldsymbol{m}_{i% }^{t}-\sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}\nabla f_{i}(\boldsymbol{x}^{t})+% \sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}\nabla f_{i}(\boldsymbol{x}^{t})\right\\|^% {2}\right]$	(15)
	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\sum_{i=1}^{N}\tilde{\lambda}_{i}^{t% }(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(\boldsymbol{x}^{t}))\right\\|^{2}\right]+% 2\mathbb{E}\left[\left\\|\sum_{i=1}^{N}(\tilde{\lambda}_{i}^{t}-\lambda_{i})% \nabla f_{i}(\boldsymbol{x}^{t})\right\\|^{2}\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\sum_{i=1}^{N}\frac{\tilde{\lambda}_% {i}^{t}}{\lambda_{i}}\cdot\lambda_{i}(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(% \boldsymbol{x}^{t}))\right\\|^{2}\right]+2\mathbb{E}\left[\left\\|\sum_{i=1}^{N}% \frac{(\tilde{\lambda}_{i}^{t}-\lambda_{i})}{\lambda_{i}}\cdot\lambda_{i}% \nabla f_{i}(\boldsymbol{x}^{t})\right\\|^{2}\right]$
	$\displaystyle\leq 2\chi_{\tilde{\lambda}\\|\lambda}^{2}\mathbb{E}\left[\left\\|% \sum_{i=1}^{N}\lambda_{i}(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(\boldsymbol{x}^{% t}))\right\\|^{2}\right]+2(N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\\|\lambda})\mathbb{E}\left[\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|% ^{2}\right]$
	$\displaystyle\leq 2\chi_{\tilde{\lambda}\\|\lambda}^{2}\sum_{i=1}^{N}\lambda_{i% }\mathbb{E}\left[\left\\|(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(\boldsymbol{x}^{t% }))\right\\|^{2}\right]+2(N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\\|\lambda})\mathbb{E}\left[\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|% ^{2}\right],$

where we also use Cauchy-Schwartz inequality to decompose $\tilde{\lambda}$ .

Investigating momentum approximation. Observing (14) and (15), we need to bound the approximation error of local momentum. We present our analysis in the following lemma.

Lemma C.1 (Bound of local momentum)

For any $i\in[N]$ and $t\in[T]$ , letting $\beta_{i}=(1-\alpha)p_{i}\leq 1$ and $p_{i}=\text{Prob}(i\in S^{t})$ , the gap between momentum $\boldsymbol{m}_{i}$ and gradient $\nabla f_{i}(\boldsymbol{x}^{t})$ can be bounded:

\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[\left\|\boldsymbol{m}_{i}^{t}-\nabla f% _{i}(\boldsymbol{x}^{t})\right\|^{2}\right]\leq\beta^{2}L^{2}\sum_{i=1}^{N}% \lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}% \right\|^{2}}.

(16)

Proof 4

		$\displaystyle\quad\;\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\\|% \boldsymbol{m}_{i}^{t}-\nabla f_{i}(\boldsymbol{x}^{t})\right\\|^{2}\right]}$
		$\displaystyle=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\\|(1-p_{i})% \boldsymbol{m}_{i}^{t-1}+p_{i}(\alpha\boldsymbol{m}_{i}^{t-1}+(1-\alpha)% \boldsymbol{g}_{i}^{t})-\nabla f_{i}(\boldsymbol{x}^{t})\right\\|^{2}\right]}$
		$\displaystyle=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\\|[1-(1-\alpha)p_% {i}]\boldsymbol{m}_{i}^{t-1}+(1-\alpha)p_{i}\boldsymbol{g}_{i}^{t}-\nabla f_{i% }(\boldsymbol{x}^{t})\right\\|^{2}\right]}$
		$\displaystyle=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\\|(1-\beta_{i})% \boldsymbol{m}_{i}^{t-1}+\beta_{i}\boldsymbol{g}_{i}^{t}-\nabla f_{i}(% \boldsymbol{x}^{t})\right\\|^{2}\right]}\quad\quad\quad\text{$\triangleright$ % letting $\beta_{i}=(1-\alpha)p_{i}$}$
		$\displaystyle=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\\|(1-\beta_{i})(% \boldsymbol{m}_{i}^{t-1}-\nabla f_{i}(\boldsymbol{x}^{t-1}))+\beta_{i}% \boldsymbol{g}_{i}^{t}+(1-\beta_{i})\nabla f_{i}(\boldsymbol{x}^{t-1})-\nabla f% _{i}(\boldsymbol{x}^{t})\right\\|^{2}\right]}$
		$\displaystyle=\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\Big{[}\\|(1-\beta_{i})(% \boldsymbol{m}_{i}^{t-1}-\nabla f_{i}(\boldsymbol{x}^{t-1}))+\beta_{i}[% \boldsymbol{g}_{i}^{t}-\eta_{l}K_{i}\nabla f_{i}(\boldsymbol{x}^{t})]$
		$\displaystyle\quad+\left((1-\beta_{i})\nabla f_{i}(\boldsymbol{x}^{t-1})-(1-% \beta_{i}\eta_{l}K_{i})\right)\nabla f_{i}(\boldsymbol{x}^{t})\\|^{2}\Big{]}$
		$\displaystyle\leq\sum_{i=1}^{N}\lambda_{i}(1-\beta_{i})^{2}\mathbb{E}{\left[% \left\\|\boldsymbol{m}_{i}^{t-1}-\nabla f_{i}(\boldsymbol{x}^{t-1})\right\\|^{2}% \right]}+\sum_{i=1}^{N}\lambda_{i}\beta_{i}^{2}\underbrace{\mathbb{E}\left[% \left\\|\boldsymbol{g}_{i}^{t}-\eta_{l}K_{i}\nabla f_{i}(\boldsymbol{x}^{t})% \right\\|^{2}\right]}_{R_{1}}$
		$\displaystyle\quad+\sum_{i=1}^{N}\lambda_{i}(1-\beta_{i})^{2}L^{2}\underbrace{% \mathbb{E}\left\\|\boldsymbol{x}^{t-1}-\boldsymbol{x}^{t}\right\\|^{2}}_{\mathbb% {E}\\|\eta\boldsymbol{d}^{t-1}\\|^{2}}.$

Note that the term $R_{1}$ denotes the heterogeneity brought by the local computation. For $\tau\in[t]$ , term $R_{1}$ can be bounded,

$\displaystyle R_{1}$	$\displaystyle=\mathbb{E}{\left[\left\\|\sum_{k=1}^{K_{i}}\eta_{l}\nabla F_{i}(% \boldsymbol{x}^{t,k})-\eta_{l}K_{i}\nabla f_{i}(\boldsymbol{x}^{t})\right\\|^{2% }\right]}=\mathbb{E}{\left[\left\\|\sum_{k=1}^{K_{i}}\eta_{l}\left(\nabla F_{i}% (\boldsymbol{x}^{t,k})-\nabla f_{i}(\boldsymbol{x}^{t})\right)\right\\|^{2}% \right]}$	(17)
	$\displaystyle=\mathbb{E}{\left[\left\\|\sum_{k=1}^{K_{i}}\eta_{l}\left(\nabla f% _{i}(\boldsymbol{x}^{t,k})-\nabla f_{i}(\boldsymbol{x}^{t})\right)\right\\|^{2}% \right]}\quad\quad\text{$\triangleright$ using Assumption~{}\ref{asp:% unbiasedness}}$
	$\displaystyle\leq L^{2}\eta_{l}^{2}\sum_{k=1}^{K_{i}}\mathbb{E}{\left\\|% \boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}\right\\|^{2}}\quad\quad\text{$% \triangleright$ using Assumption~{}\ref{asp:smoothness}}$
	$\displaystyle\leq L^{2}\eta_{l}^{2}K_{i}\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{% t,K_{i}}-\boldsymbol{x}^{t}\right\\|^{2}}.$

And, using (15) to bound the second term, we have

	$\displaystyle\mathbb{E}\\|\eta\boldsymbol{d}^{t-1}\\|^{2}=\eta^{2}\mathbb{E}\\|% \boldsymbol{d}^{t-1}\\|^{2}$	$\displaystyle\leq 2\eta^{2}\chi_{\tilde{\lambda}\\|\lambda}^{2}\sum_{i=1}^{N}% \lambda_{i}\mathbb{E}\left[\left\\|(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(% \boldsymbol{x}^{t-1}))\right\\|^{2}\right]$		(18)
		$\displaystyle\quad+2\eta^{2}(N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{% \tilde{\lambda}\\|\lambda})\mathbb{E}\left[\left\\|\nabla f(\boldsymbol{x}^{t-1}% )\right\\|^{2}\right]$		(18)

Combining the equations, we have

		$\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left[\left\\|\boldsymbol{m}_{% i}^{t}-\nabla f_{i}(\boldsymbol{x}^{t})\right\\|^{2}\right]}\leq\sum_{i=1}^{N}% \lambda_{i}(1-\beta_{i})^{2}(1+2L^{2}\eta^{2}\chi_{\tilde{\lambda}\\|\lambda}^{% 2})\mathbb{E}{\left[\left\\|\boldsymbol{m}_{i}^{t-1}-\nabla f_{i}(\boldsymbol{x% }^{t-1})\right\\|^{2}\right]}$		(19)
		$\displaystyle\quad+\sum_{i=1}^{N}\lambda_{i}\beta_{i}^{2}L^{2}\eta_{l}^{2}K_{i% }\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}\right\\|^{2}% }+\left[\sum_{i=1}^{N}\lambda_{i}(1-\beta_{i})^{2}\right]2L^{2}\eta^{2}(N+\chi% _{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{\tilde{\lambda}\\|\lambda})\cdot\left\\|% \nabla f(\boldsymbol{x}^{t-1})\right\\|^{2}.$		(19)

As the last term of (19) typically vanishes over time $T$ due to factors $\beta$ and $\eta$ , we omit the last term and mainly focus on the effects of the second term. Unrolling the recursion for $i\in[N]$ , we have

	$\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[\left\\|\boldsymbol{m}_{i% }^{t}-\nabla f_{i}(\boldsymbol{x}^{t})\right\\|^{2}\right]$	(20)
$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{N}\lambda_{i}\sum_{\tau=1}^{t}\left((1-\beta_{i})^{2}% (1+2L^{2}\eta^{2}\chi_{\tilde{\lambda}\\|\lambda}^{2})\right)^{\tau-1}\beta_{i}% ^{2}L^{2}\eta_{l}^{2}K_{i}\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\\|^{2}}$
$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{N}\lambda_{i}t\max_{\tau\in[t]}\left((1-\beta_{i})^{2% }(1+2L^{2}\eta^{2}\chi_{\tilde{\lambda}\\|\lambda}^{2})\right)^{\tau-1}\beta_{i% }^{2}L^{2}\eta_{l}^{2}K_{i}\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\\|^{2}}$
$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{N}\lambda_{i}t\beta_{i}^{2}L^{2}\eta_{l}^{2}K_{i}% \mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}\right\\|^{2}},$

where we let $\mathbb{E}\left[\left\|\boldsymbol{m}_{i}^{0}-\nabla f_{i}(\boldsymbol{x}^{0})% \right\|^{2}\right]=0$ and $(1-\beta_{i})^{2}(1+2L^{2}\eta^{2}\chi_{\tilde{\lambda}\|\lambda}^{2})\leq 1$ . To further optimize the upper bound, we let $\eta_{l}\leq\frac{1}{\sqrt{tK_{i}}}$ and $\beta=\beta_{1}=\dots=\beta_{N}$ to conclude the proof:

\displaystyle\sum_{i=1}^{N}\lambda_{i}\mathbb{E}\left[\left\|\boldsymbol{m}_{i% }^{t}-\nabla f_{i}(\boldsymbol{x}^{t})\right\|^{2}\right]\leq\beta^{2}L^{2}% \sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\|^{2}}.

(21)

Remark. The $\mathbb{E}\left[\left\|\boldsymbol{m}_{i}^{0}-\nabla f_{i}(\boldsymbol{x}^{0})% \right\|^{2}\right]=0$ denotes non-error initialization of server-saved local momentum. Typically, this can be implemented via once full participation in the first round of FL. Otherwise, it only induces an additional constant factor on approximation without breaking our analyses. Besides, $(1-\beta_{i})^{2}(1+2L^{2}\eta^{2}\chi_{\tilde{\lambda}\|\lambda}^{2})\leq 1$ can be easily implemented by setting $\eta$ and $\alpha$ .

Putting together. Substituting (13) with corresponding terms (14) and (15), we have

		$\displaystyle\quad\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t+1})\right]-f(% \boldsymbol{x}^{t})}{\eta}\leq-\left(1-\sqrt{N+\chi_{\tilde{\lambda}\\|\lambda}% ^{2}-2\chi_{\tilde{\lambda}\\|\lambda}}-(N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-% 2\chi_{\tilde{\lambda}\\|\lambda})\right)\mathbb{E}\left[\left\\|\nabla f(% \boldsymbol{x}^{t})\right\\|^{2}\right]$
		$\displaystyle+\sqrt{\chi_{\tilde{\lambda}\\|\lambda}^{2}}\sqrt{\sum_{i=1}^{N}% \lambda_{i}\mathbb{E}\left[\left\\|\nabla f_{i}(\boldsymbol{x}^{t})-\boldsymbol% {m}_{i}^{t}\right\\|^{2}\right]}\cdot\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{% t})\right\\|+L\eta\chi_{\tilde{\lambda}\\|\lambda}^{2}\sum_{i=1}^{N}\lambda_{i}% \mathbb{E}\left[\left\\|(\boldsymbol{m}_{i}^{t}-\nabla f_{i}(\boldsymbol{x}^{t}% ))\right\\|^{2}\right].$

Then, substituting the momentum approximation with (21) and reorganising the terms, we have

		$\displaystyle\quad\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t+1})\right]-f(% \boldsymbol{x}^{t})}{\eta}\leq-\left(1-\sqrt{N+\chi_{\tilde{\lambda}\\|\lambda}% ^{2}-2\chi_{\tilde{\lambda}\\|\lambda}}-(N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-% 2\chi_{\tilde{\lambda}\\|\lambda})\right)\mathbb{E}\left[\left\\|\nabla f(% \boldsymbol{x}^{t})\right\\|^{2}\right]$		(22)
		$\displaystyle+\sqrt{\chi_{\tilde{\lambda}\\|\lambda}^{2}}\sqrt{\beta^{2}L^{2}% \sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\\|^{2}}}\cdot\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^% {t})\right\\|+\eta\chi_{\tilde{\lambda}\\|\lambda}^{2}\beta^{2}L^{3}\sum_{i=1}^{% N}\lambda_{i}\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}% \right\\|^{2}}.$		(22)

Inducing gradient diversity. In the main paper, we discussed the gradient diversity related to the status of statistical heterogeneity and optimization quality. To demonstrate the relation, our idea is to induce the gradient diversity into our convergence rate by replacing the first-order global gradient $\nabla f(\boldsymbol{x}^{t})$ on the right-hand side of (22). However, the Algorithm 1 is minimizing an surrogate objective $\tilde{f}(\boldsymbol{x})$ , which is determined by $\tilde{\lambda}$ . Therefore, we are to clarify the convergence relation between the surrogate objective $\tilde{f}(\boldsymbol{x})$ and the original objective $f(\boldsymbol{x})$ , as we discussed in the lemma below.

Lemma C.2 (Surrogate convergence relation)

For all $x\in\mathcal{X}$ , the differences between the primary gradient $\nabla f(\boldsymbol{x})$ and surrogate gradient $\nabla\tilde{f}(\boldsymbol{x})=\sum_{i=1}^{N}\tilde{\lambda}_{i}\nabla f_{i}(% \boldsymbol{x})$ can be bounded:

Proof 5

According to the definition, we have

$\displaystyle\\|\nabla\tilde{f}(\boldsymbol{x})-\nabla f(\boldsymbol{x})\\|^{2}$	$\displaystyle=\\|\sum_{i=1}^{N}(\tilde{\lambda}_{i}-\lambda_{i})\nabla f_{i}(% \boldsymbol{x})\\|^{2}$	(23)
	$\displaystyle\leq\left[\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}-\lambda_{i})^{% 2}}{\lambda_{i}^{2}}\right]\cdot\left[\left\\|\sum_{i=1}^{N}\lambda_{i}\nabla f% _{i}(\boldsymbol{x})\right\\|^{2}\right]$
	$\displaystyle\leq(N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{\tilde{\lambda}% \\|\lambda})\left(\sigma_{g}^{2}+\left\\|\nabla f(\boldsymbol{x})\right\\|^{2}% \right).$

Then, we investigate

	$\displaystyle\\|\nabla\tilde{f}(\boldsymbol{x})\\|^{2}$	$\displaystyle\leq 2\\|\nabla\tilde{f}(\boldsymbol{x})-\nabla f(\boldsymbol{x})% \\|^{2}+2\\|\nabla f(\boldsymbol{x})\\|^{2}$
		$\displaystyle\leq 2(N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\\|\lambda})\left(\sigma_{g}^{2}+\left\\|\nabla f(\boldsymbol{x})\right% \\|^{2}\right)+2\\|\nabla f(\boldsymbol{x})\\|^{2}$
		$\displaystyle\leq 2(1+N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\\|\lambda})\left\\|\nabla f(\boldsymbol{x})\right\\|^{2}+2(N+\chi_{% \tilde{\lambda}\\|\lambda}^{2}-2\chi_{\tilde{\lambda}\\|\lambda})\sigma_{g}^{2}.$

Therefore, we have

	$\displaystyle\min_{t\in[T]}\\|\nabla\tilde{f}(\boldsymbol{x}^{t})\\|^{2}$	$\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\\|\nabla\tilde{f}(\boldsymbol{x}^{% t})\\|^{2}$		(24)
		$\displaystyle\leq 2(1+N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\\|\lambda})\frac{1}{T}\sum_{t=0}^{T-1}\left\\|\nabla f(\boldsymbol{x}^{% t})\right\\|^{2}+2(N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{\tilde{\lambda}% \\|\lambda})\sigma_{g}^{2}.$		(24)

Remark. The (24) shows the convergence relation, which is greatly determined by $\lambda$ and $\tilde{\lambda}$ . Importantly, we are interested about the case that $\|\nabla f(\boldsymbol{x}^{t})\|^{2}\leq\|\nabla\tilde{f}(\boldsymbol{x}^{t})% \|^{2}$ , indicating that the cases that $\lambda$ is better than $\tilde{\lambda}$ . In these cases, we can minimize the right-hand side of (24) by tuning $\tilde{\lambda}$ . In contrast, the cases that $\|\nabla\tilde{f}(\boldsymbol{x}^{t})\|^{2}\leq\|\nabla f(\boldsymbol{x}^{t})% \|^{2}$ indicates that Algorithm 1 converges to stationary points with lower surrogate gradient. According to the non-convex analysis theory, it denotes a better optimization result.

In this paper, we can assume that $\lambda$ is the best weight for federated optimization, which always induces $\|\nabla f(\boldsymbol{x}^{t})\|^{2}\leq\|\nabla\tilde{f}(\boldsymbol{x}^{t})% \|^{2}$ . Therefore, we can connect the surrogate gradient diversity with the original objective:

Corollary C.1

Connecting the assumption $\|\nabla f(\boldsymbol{x}^{t})\|^{2}\leq\|\nabla\tilde{f}(\boldsymbol{x}^{t})% \|^{2}$ and Definition 3.2, we have

\|\nabla f(\boldsymbol{x}^{t})\|\leq\|\nabla\tilde{f}(\boldsymbol{x})\|\leq% \frac{\sqrt{\sum_{i=1}^{N}\lambda_{i}\|\nabla f_{i}(\boldsymbol{x}^{t})\|^{2}}% }{\tilde{D}(\boldsymbol{x}^{t})}\leq\frac{\tilde{G}}{\tilde{D}(\boldsymbol{x}^% {t})},

(25)

where we use $\tilde{G}=\max_{t\in[T]}\sqrt{\sum_{i=1}^{N}\lambda_{i}\|\nabla f_{i}(% \boldsymbol{x}^{t})\|^{2}}$ .

Inducing inequality (25) to (22), we rearrange the terms

		$\displaystyle\quad\left(1-\sqrt{N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{% \tilde{\lambda}\\|\lambda}}-(N+\chi_{\tilde{\lambda}\\|\lambda}^{2}-2\chi_{% \tilde{\lambda}\\|\lambda})\right)\mathbb{E}\left[\left\\|\nabla f(\boldsymbol{x% }^{t})\right\\|^{2}\right]$		(26)
		$\displaystyle\leq\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t})\right]-f(% \boldsymbol{x}^{t+1})}{\eta}+\eta\chi_{\tilde{\lambda}\\|\lambda}^{2}\beta^{2}L% ^{3}\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\\|^{2}}+\frac{\beta L\tilde{G}\sqrt{\chi_{\tilde{% \lambda}\\|\lambda}^{2}}}{\tilde{D}(\boldsymbol{x}^{t})}\sqrt{\sum_{i=1}^{N}% \lambda_{i}\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}}-\boldsymbol{x}^{t}% \right\\|^{2}}}$
		$\displaystyle\leq\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t})\right]-f(% \boldsymbol{x}^{t+1})}{\eta}+\eta\chi_{\tilde{\lambda}\\|\lambda}^{2}\beta^{2}L% ^{3}\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\\|^{2}}$
		$\displaystyle\quad+\frac{\beta L\tilde{G}\sqrt{\chi_{\tilde{\lambda}\\|\lambda}% ^{2}}}{\tilde{D}(\boldsymbol{x}^{t})}\sqrt{5\eta_{l}(\sigma_{l}^{2}+6K_{\text{% min}}^{2}\sigma_{g}^{2}+6\tilde{K}\\|\nabla f(\boldsymbol{x}^{t})\\|^{2})+\Phi_{% \text{Hetero}}}$
		$\displaystyle\leq\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t})\right]-f(% \boldsymbol{x}^{t+1})}{\eta}+\eta\chi_{\tilde{\lambda}\\|\lambda}^{2}\beta^{2}L% ^{3}\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}}-% \boldsymbol{x}^{t}\right\\|^{2}}$
		$\displaystyle\quad+\frac{\beta L\tilde{G}\sqrt{\chi_{\tilde{\lambda}\\|\lambda}% ^{2}}}{\tilde{D}(\boldsymbol{x}^{t})}\sqrt{5\eta_{l}(\sigma_{l}^{2}+6K_{\text{% min}}^{2}\sigma_{g}^{2})+\Phi_{\text{Hetero}}}+\frac{6\tilde{K}\beta L\tilde{G% }\sqrt{\chi_{\tilde{\lambda}\\|\lambda}^{2}}}{\tilde{D}(\boldsymbol{x}^{t})}\\|% \nabla f(\boldsymbol{x}^{t})\\|^{2},$

where the last inequality replaces the averaged local updates term with Corollary 3.2.

Then, rearranging the above equation, we have

		$\displaystyle\rho^{t}\mathbb{E}\left[\left\\|\nabla f(\boldsymbol{x}^{t})\right% \\|^{2}\right]\leq\underbrace{\frac{\mathbb{E}\left[f(\boldsymbol{x}^{t})\right% ]-f(\boldsymbol{x}^{t+1})}{\eta}+\eta\chi_{\tilde{\lambda}\\|\lambda}^{2}\beta^% {2}L^{3}\sum_{i=1}^{N}\lambda_{i}\mathbb{E}{\left\\|\boldsymbol{x}_{i}^{t,K_{i}% }-\boldsymbol{x}^{t}\right\\|^{2}}}_{\text{Vanishing terms about $\eta$.}}$		(27)
		$\displaystyle\quad+\frac{\beta L\tilde{G}\sqrt{\chi_{\tilde{\lambda}\\|\lambda}% ^{2}}}{\tilde{D}(\boldsymbol{x}^{t})}\sqrt{5\eta_{l}(\sigma_{l}^{2}+6K_{\text{% min}}^{2}\sigma_{g}^{2})+\Phi_{\text{Hetero}}}$		(27)

where we let $\rho^{t}=\left(1-\sqrt{N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\|\lambda}}-(N+\chi_{\tilde{\lambda}\|\lambda}^{2}-2\chi_{\tilde{% \lambda}\|\lambda})-6\tilde{K}\beta L\tilde{G}\sqrt{\chi_{\tilde{\lambda}\|% \lambda}^{2}}/\tilde{D}(\boldsymbol{x}^{t})\right)>0$ for notation brevity.

Finally, taking full expectation on both sizes, summarizing terms from time $t=0$ to $T-1$ and rearranging terms, we have

	$\displaystyle\min_{t\in[T]}\mathbb{E}\left[\left\\|\nabla f(\boldsymbol{x}^{t})% \right\\|^{2}\right]$	$\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\\|\nabla f(% \boldsymbol{x}^{t})\right\\|^{2}\right]$		(28)
		$\displaystyle\leq\epsilon_{\text{opt}}+\sum_{t=0}^{T-1}\frac{\beta L\tilde{G}% \sqrt{\chi_{\tilde{\lambda}\\|\lambda}^{2}}}{T\rho^{t}\tilde{D}(\boldsymbol{x}^% {t})}\sqrt{\sigma^{2}+\Phi_{\text{Hetero}}},$		(28)

where we use $\epsilon_{\text{opt}}$ to absorb vanishing terms in (27) and $\sigma^{2}=5\eta_{l}(\sigma_{l}^{2}+6K_{\text{min}}^{2}\sigma_{g}^{2})$ .

Summary. Analogous to previous non-convex analyses, the term $\epsilon_{\text{opt}}$ in (22) will result in a sub-linear convergence rate $\mathcal{O}(1/\sqrt{T})$ with proper setting of $\eta$ , which is omitted in the final bound. Differently, we focus on the impacts of heterogeneity terms on the convergence results. Convergence to the stationary points with large $\tilde{D}(\boldsymbol{x}^{t})$ mitigate the convergence error included by heterogeneity terms $\sqrt{\sigma^{2}+\Phi_{\text{Hetero}}}$ .

Appendix D Discussion

D.1 Case illustration on gradient diversity maximization direction

To better illustrate this concept, we consider an FL system with two clients, where the gradient diversity is computed as follows:

\tilde{D}(\boldsymbol{x})=\sqrt{\frac{\lambda_{1}\|\nabla f_{1}(\boldsymbol{x}% )\|^{2}+\lambda_{2}\|\nabla f_{2}(\boldsymbol{x})\|^{2}}{\|\tilde{\lambda}_{1}% \nabla f_{1}(\boldsymbol{x})+\tilde{\lambda}_{2}\nabla f_{2}(\boldsymbol{x})\|% ^{2}}}.

(29)

We note that the server cannot access local data samples. For each round, it only observes the local updates uploaded from clients. Hence, given a fixed point $x$ , maximizing gradient diversity at the server is equivalent to minimizing the denominator of (6). As a result, a global gradient direction that maximizes the gradient diversity leads to the solution²²2We exclude special solutions that $\{\lambda_{1},\lambda_{2}\}=\{1,0\}$ if $\nabla f_{1}(\boldsymbol{x})^{\top}\nabla f_{2}(\boldsymbol{x})\geq\nabla f_{1% }(\boldsymbol{x})^{\top}\nabla f_{1}(\boldsymbol{x})$ , and $\{\lambda_{1},\lambda_{2}\}=\{0,1\}$ if $\nabla f_{1}(\boldsymbol{x})^{\top}\nabla f_{2}(\boldsymbol{x})\geq f_{2}(% \boldsymbol{x})^{\top}\nabla f_{2}(\boldsymbol{x})$ . They indicate a low degree of statistical heterogeneity. in Figure 5.

D.2 Connection with multi-objective optimization (MOO)

We note that (7) is relevant to a sub-problem from MOO literature [38, 11, 60]. However, Algorithm 1 presents a federated optimization algorithm differently. Specifically, FedAWARE violates the basic Pareto partial order restriction in MOO. Besides, FedAWARE meets basic protocols in FL, including partial client participation and decoupled server/client optimization procedure, while MOO requires full participation and local first-order gradients in the FL context. In the FL literature, FedMGDA+ [11] also involves a variant of (7) in federated optimization, however, focusing on FL fairness rather than heterogeneity. We found it is inefficient against hybrid heterogeneity issues as shown in Figure 6.

D.3 Limitations and feasible solutions.

The main limitation is that running Algorithm 1 requires the server to store the local momentum, while the FL server typically has sufficient storage and computing resources in FL practice. Besides, we argue this memory consumption is worth using FedAWARE to enhance other algorithms. Moreover, we discuss additional engineering optimizations in Appendix D.3 for alleviating the hardware requirements.

In practical settings, especially cross-silo settings, the FL server typically has sufficient storage and computing resources. Besides, this concern can be alleviated by the following options:

Computing (7) with the last few layers of a neural network. This is because numerous studies [49, 19, 4] have shown that the last layers contain crucial network information. Hence, we can trade off the memory storage and quality of the surrogate.

Adopting client clustering techniques [37, 21, 30]. We can cluster clients respecting their similarity, and save the momentum of cluster-averaged gradients. The server only costs the storage proportional to the number of client clusters.

D.4 What is the gap between gradient diversity and surrogate gradient diversity?

The gap is dynamically changing in practice. We explain it from empirical and theoretical perspectives. Empirically, please refer to the curves of FedAWARE and FedAvg in Figure 3 for the empirical gap. Theoretically, we provide the explicit bound of the gap (related to Lemma C.2), if we assume $\lambda$ is the optimal weight ( $\|\nabla f(\boldsymbol{x})\|\leq\|\nabla\tilde{f}(\boldsymbol{x})\|$ ), we have

	$\displaystyle(D(\boldsymbol{x})-\tilde{D}(\boldsymbol{x}))^{2}$	$\displaystyle=D^{2}(\boldsymbol{x})+\tilde{D}^{2}(\boldsymbol{x})-2D(% \boldsymbol{x})\tilde{D}(\boldsymbol{x})$
		$\displaystyle=\frac{\sum_{i=1}^{N}\lambda_{i}\\|\nabla f_{i}(\boldsymbol{x})\\|^% {2}}{\\|\nabla f(\boldsymbol{x})\\|^{2}}+\frac{\sum_{i=1}^{N}\lambda_{i}\\|\nabla f% _{i}(\boldsymbol{x})\\|^{2}}{\\|\nabla\tilde{f}(\boldsymbol{x})\\|^{2}}-2\frac{% \sum_{i=1}^{N}\lambda_{i}\\|\nabla f_{i}(\boldsymbol{x})\\|^{2}}{\\|\nabla f(% \boldsymbol{x})\\|\\|\nabla\tilde{f}(\boldsymbol{x})\\|}$
		$\displaystyle\leq\tilde{G}(\frac{1}{\\|\nabla f(\boldsymbol{x})\\|^{2}}+\frac{1}% {\\|\nabla\tilde{f}(\boldsymbol{x})\\|^{2}}-2\frac{1}{\\|\nabla f(\boldsymbol{x})% \\|\\|\nabla\tilde{f}(\boldsymbol{x})\\|})$
		$\displaystyle=\tilde{G}(\frac{1}{\\|\nabla f(\boldsymbol{x})\\|}-\frac{1}{\\|% \nabla\tilde{f}(\boldsymbol{x})\\|})^{2}$
		$\displaystyle=\tilde{G}(\frac{\\|\nabla\tilde{f}(\boldsymbol{x})\\|-\\|\nabla f(% \boldsymbol{x})\\|}{\\|\nabla f(\boldsymbol{x})\\|\\|\nabla\tilde{f}(\boldsymbol{x% })\\|})^{2}$
		$\displaystyle\leq\frac{\tilde{G}}{\\|\nabla f(\boldsymbol{x})\\|^{2}}(\\|\nabla% \tilde{f}(\boldsymbol{x})\\|-\\|\nabla f(\boldsymbol{x})\\|)^{2}$
		$\displaystyle\leq\frac{\tilde{G}}{\\|\nabla f(\boldsymbol{x})\\|^{2}}\underbrace% {\\|\nabla\tilde{f}(\boldsymbol{x})-\nabla f(\boldsymbol{x})\\|^{2}}_{\eqref{eq:% surrogate_gap}},$

where $\tilde{G}=\max_{t\in[T]}\sqrt{\sum_{i=1}^{N}\lambda_{i}\|\nabla f_{i}(% \boldsymbol{x}^{t})\|^{2}}$ . This inequality shows that the gap $D(\boldsymbol{x})-\tilde{D}(\boldsymbol{x})$ is primarily decided by the global objectives with different weights $\lambda$ or $\tilde{\lambda}$ .

D.5 The implications of zero vector in FedAWARE

As FedAWARE is finding the minimum of $\|\boldsymbol{d}^{t}\|$ during training, the cases such that $\|\boldsymbol{d}^{t}\|=0$ indicates that all the momentum vectors are linearly dependent [8]. This case means that we found a saddle point of local optimum (possibly global optimum) for the surrogate objective. Following stochastic optimization theories, non-convex optimizers [18, 7] typically use stochasticity and Nesterov momentum to let the model escape from the saddle points. Analogously, FedAWARE uses random client sampling and local momentum to guarantee the convergence quality. Moreover, we note that the client-side optimization typically directly uses non-convex optimizers. Therefore, it can collaboratively help the server-side optimization escape from the saddle points.

D.6 Feasibility of combining with faster local solvers

Enhancing the convergence rate can be achieved through a combination of improved local computing methods, as our server-side optimizer operates independently of the local update modification process. For instance, FedNova [42] has explored the benefits of employing faster local solvers, such as those leveraging proximal gradients, gradient tracking, adaptive learning rates, momentum, and more, within the context of FedAvg. While our work primarily addresses server-side optimization in the presence of unbalanced local updates from local SGD, determining the optimal combination of local solvers remains a direction for future research.

D.7 Discussion on novelty

This work is the first to propose gradient diversity maximization on server-side optimization against heterogeneity issues in FL. We provide additional discussion in the section to clarify our novelty.

Novelty in tackling heterogeneity. FedAWARE addresses hybrid heterogeneity by adjusting the denominator factor $\tilde{D}(\boldsymbol{x}^{t})$ in Theorem 4.1. In contrast, previous works are reducing the numerator terms corresponding with Theorem 4.1, which captures the impacts of hybrid heterogeneity. As discussed in Section 3.2, previous works minimize the upper bound of (5) to improve optimization quality. These works typically reduce the hybrid heterogeneity impacts by manipulating the local computation process. FedAWARE innovatively captures the gradient diversity trends and utilizes it to design global updates without manipulating local updates. This approach provides a promising perspective on addressing heterogeneity issues in FL.

Novelty in convergence analyses. The convergence analysis of FedAWARE introduces novel elements compared to previous federated optimization algorithms. While previous algorithms, such as FedAvg [33], FedProx [26], and FedYogi [35], assume a static global objective with uniform weights, FedAWARE analyzes the convergence of a surrogate objective with weights $\tilde{\lambda}$ based on stored momentum information. We use the term $\chi$ to highlight the differences to differentiate from traditional federated optimization objectives. Furthermore, since FedAWARE does not manipulate the local computation process, we focus on the unbalanced local update steps, which absorb factors such as local training batch size, epochs, and local datasets. This approach allows us to address the concerns raised by FedNova [42] regarding objective inconsistency.

Suggesting training dynamics of gradient diversity. In our analysis, we first demonstrate the importance of gradient diversity dynamics in the convergence analysis. Moreover, it is important to note that gradient diversity is not the sole factor for mitigating the impacts of heterogeneity. According to the convergence result in Theorem 4.1, the upper bound is determined by multiple factors, where gradient diversity is one feasible factor to minimize the upper bound. Furthermore, we have established a connection between gradient diversity and common assumptions on statistical heterogeneity in Section 3.1. It is worth mentioning that previous works [26, 16, 42] take the upper bound of gradient diversity as a negative factor in the convergence analysis³³3please refer to the notation $B$ of Theorem 4 in FedProx, notation $B$ of Theorem 1 in SCAFFOLD, and notation $\kappa$ of Theorem 1 in FedNova for evidence.. Therefore, while they address heterogeneity issues differently, they ignore the evolution of gradient diversity.

Future works. In this paper, we only apply the gradient diversity maximization findings to server-side optimization. Further collaboration of server-side and client-side optimization for gradient diversity maximization is left to future works. Moreover, we found that FedAWARE achieves better generalization performance. We plan to explain the theoretical relation between the generalization performance and the gradient diversity dynamics in the generalization theories.

Appendix E Additional Experiments

E.1 Experiment Details

Platform. The experiment implementations are supported by FedLab [54]. Our experiments run on a Linux server with 4*2080Ti GPU.

Datasets and Model. Our experiments mainly follow prior works [45, 1, 35, 56], including image classification datasets MNIST, Fashion-MNIST, CIFAR10, and text classification dataset AGNews [57] datasets. For MNIST and Fashion-MNIST tasks, we use fully connected neural network architectures with 2 hidden layers. For the CIFAR10 task, we use a CNN model, similar to [33], consisting of 3 convolutional layers with 64 3 $\times$ 3 filters followed by 2 fully connected layers with 1024 and 64 neurons, and a softmax layer. For the AGNews task, we choose the Pthyia-70M from GPT2-based large language models Pythia [3].

Hyperparameters. For the server, we set the rate of client participation to be $10\%$ , and use $\eta=1$ for FedAvg, FedAvgM, FedProx, SCAFFOLD, FedNova, FedDyn, and FedAWARE. For the momentum parameter of FedAvgM, we set it from $\{0.7,0.9,0.97,0.997\}$ following the original paper. For weights of the penalty term in FedProx, we tune it from grid $\{0.01,0.1,1,10\}$ . For FedYogi, we set momentum parameter $\beta_{1}=0.9$ , a second-moment parameter $\beta_{2}=0.99$ , and adaptivity $\tau=10^{-4}$ following the original paper. Besides, We select $\eta$ for FedYogi by grid-searching tuning from $\{10^{-3},10^{-2.5},10^{-2},\dots,10^{1}\}$ . The parameter of FedDyn is chosen among $\{0.1,0.01,0.001\}$ from the original paper. For FedAMS, we set $\beta_{1}=0.9,\beta_{2}=0.99$ follows the original paper. Then, we grid search for the best global learning rate $\eta=\{1,10^{-1},10^{-2},10^{-3},10^{-4}\}$ and the best stabilization term $\epsilon=\{10^{-8},10^{-4},10^{-3},10^{-2},10^{-1}\}$ . For FedAWARE, we set $\alpha=0.5$ . We set learning rate $\eta_{l}=0.01$ for local training parameters, batch size $64$ , and local epoch $3$ for all clients with the pathological and Dirichlet settings. We report the best performance of these algorithms.

Implementation of gradient diversity dynamics. To fairly compare the gradient diversity dynamics of all algorithms, we estimate the gradient diversity as follows:

\hat{D}(\boldsymbol{x}^{t}):=\sqrt{\frac{\sum_{i\in S^{t}}\|\boldsymbol{g}_{i}% ^{t}\|^{2}}{\|\sum_{i\in S^{t}}\boldsymbol{g}_{i}^{t}\|^{2}}}=\sqrt{\frac{\sum% _{i\in S^{t}}\|\boldsymbol{g}_{i}^{t}\|^{2}}{\sum_{i\in S^{t}}\|\boldsymbol{g}% _{i}^{t}\|^{2}+\sum_{i\neq j}\langle\boldsymbol{g}_{i}^{t},\boldsymbol{g}_{j}^% {t}\rangle}},

(30)

where $S^{t}$ is the selected client set at the $t$ -th round and $\boldsymbol{g}_{i}^{t}$ is the uploaded local updates from selected clients. And, we report the evaluation of estimated $\hat{D}(\boldsymbol{x}^{t})$ value in our experiment figures. This estimate is applicable in practice for matching the client sampling protocol in federated learning and avoiding accessing local first-order gradient (privacy issue [61]). Besides, we argue that $\hat{D}(\boldsymbol{x}^{t})$ share a similar ability to capture the direction relation among local updates. For example, it is large when the inner products between the local updates taken for different clients are small. In particular, estimated gradient diversity is large when the gradients are almost orthogonal, or even in opposite directions. Most importantly, As the gradient diversity dynamics of all algorithms are computed the same, their values faithfully indicate the stationary points in their landscape.

E.2 Experiments on MNIST/FashionMNIST

We provide additional experiments on MNIST [6] and FashionMNIST [47] datasets using the same data partition settings as CIFAR10 experiments. The results of the MNIST task are shown in Figure 7; the results of the FashionMNIST task are shown in Figure 8.

Discussion on MNIST task. According to the training loss, FedAWARE is slightly slower than FedYogi and FedAMS in the MNIST task. Besides, FedAWARE converges to comparable stationary points with similar training loss and test accuracy at the end of training. Most importantly, the training dynamics of gradient diversity curves show that FedAWARE finds a series of stationary points with large gradient diversity.

Discussion on FashionMNIST task. The main conclusions of the FashionMNIST task are similar to the MNIST task. Most differently, FedAWARE achieves comparable convergence speed with FedYogi and FedAMS.

Overall, our conclusions in the main paper still hold on federated MNIST/FashinMNIST tasks.

E.3 Experiments on AGNews

We choose the famous GPT2-based pretrained large language model Pythia-70M [3] for the AGNews task [57]. AGNews is a text classification task with 4 labels consisting of 119,999 train samples and 7,599 test samples. We partition the AGNews dataset into 100 clients using the Dirichlet partition strategy with the hyperparameter of 0.1. For the training hyper-parameters, we set local batch size 32, epoch 1, learning rate $1e^{-4}$ and $\alpha=0.99$ for FedAWARE. We conduct the gradient projection experiment, and the results are shown in Figure 4.

		$\displaystyle\sum_{i=1}^{N}\lambda_{i}\left\\|\nabla f_{i}(\boldsymbol{x})-% \nabla f(\boldsymbol{x})\right\\|^{2}\leq\sum_{i=1}^{N}\lambda_{i}\left\\|\nabla f% _{i}(\boldsymbol{x})\right\\|^{2}-\\|\nabla f(\boldsymbol{x})\\|^{2}\leq\sigma_{g% }^{2}$
	$\displaystyle\Leftrightarrow$	$\displaystyle\sum_{i=1}^{N}\lambda_{i}\left\\|\nabla f_{i}(\boldsymbol{x})% \right\\|^{2}\leq\sigma_{g}^{2}+\\|\nabla f(\boldsymbol{x})\\|^{2}$
	$\displaystyle\Rightarrow$	$\displaystyle D(\boldsymbol{x})=\sqrt{\frac{\sum_{i=1}^{N}\lambda_{i}\\|\nabla f% _{i}(\boldsymbol{x})\\|^{2}}{\\|\nabla f(\boldsymbol{x})\\|^{2}}}\leq\sqrt{\frac{% \sigma_{g}^{2}+\\|\nabla f(\boldsymbol{x})\\|^{2}}{\\|\nabla f(\boldsymbol{x})\\|^% {2}}}\leq\sqrt{1+\frac{\sigma_{g}^{2}}{\\|\nabla f(\boldsymbol{x})\\|^{2}}}.$

		$\displaystyle\quad\mathbb{E}\left[\left\\|\boldsymbol{g}_{i}^{t}\right\\|^{2}% \right]=\mathbb{E}\left\\|\boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}\right\\|^{% 2}=\mathbb{E}\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}-\eta_{l}% \nabla F_{i}(\boldsymbol{x}_{i}^{t,k-1})\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}-% \eta_{l}(\nabla F_{i}(\boldsymbol{x}_{i}^{t,k-1})\pm\nabla f_{i}(\boldsymbol{x% }_{i}^{t,k-1}))\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}-% \eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1}))\right\\|^{2}+\mathbb{E}\left% \\|\eta_{l}\left(\nabla F_{i}(\boldsymbol{x}_{i}^{t,k-1})-\nabla f_{i}\left(% \boldsymbol{x}_{i}^{t,k-1}\right)\right)\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^% {t}\right\\|^{2}-2\langle\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t},\eta_{l}% \nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\rangle+\left\\|\eta_{l}\nabla f_{i}(% \boldsymbol{x}_{i}^{t,k-1})\right\\|^{2}\right]+\eta_{l}^{2}\sigma_{l}^{2}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}% ^{t}\right\\|^{2}-2\langle\frac{1}{\sqrt{2K_{i}-1}}(\boldsymbol{x}_{i}^{t,k-1}-% \boldsymbol{x}^{t}),\sqrt{2K_{i}-1}\eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,% k-1})\rangle$
		$\displaystyle+\left\\|\eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\right\\|^% {2}\Big{]}+\eta_{l}^{2}\sigma_{l}^{2}$
	$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{2K_{i}-1}\right)\mathbb{E}\left[\left\\|% \boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}\right\\|^{2}\right]+2K_{i}\mathbb% {E}\left[\left\\|\eta_{l}\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\right\\|^{2}% \right]+\eta_{l}^{2}\sigma_{l}^{2}$
	$\displaystyle=$	$\displaystyle\left(1+\frac{1}{2K_{i}-1}\right)\mathbb{E}\left[\left\\|% \boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}\right\\|^{2}\right]+2K_{i}\mathbb% {E}\left[\left\\|\eta_{l}\left(\nabla f_{i}(\boldsymbol{x}_{i}^{t,k-1})\pm% \nabla f\left(\boldsymbol{x}^{t}\right)\pm\nabla f_{i}\left(\boldsymbol{x}^{t}% \right)\right)\right\\|^{2}\right]$
		$\displaystyle+\eta_{l}^{2}\sigma_{l}^{2}$
	$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{2K_{i}-1}\right)\mathbb{E}\left\\|\boldsymbol{x}_% {i}^{t,k-1}-\boldsymbol{x}^{t}\right\\|^{2}+6K_{i}\mathbb{E}\left[\left\\|\eta_{% l}\left(\nabla f_{i}\left(\boldsymbol{x}_{i}^{t,k-1}\right)-\nabla f_{i}\left(% \boldsymbol{x}^{t}\right)\right)\right\\|^{2}\right]$
		$\displaystyle+6K_{i}\mathbb{E}\left[\left\\|\eta_{l}\left(\nabla f_{i}\left(% \boldsymbol{x}^{t}\right)\right)\right\\|^{2}\right]+\eta_{l}^{2}\sigma_{l}^{2}$
	$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{2K_{i}-1}+6K_{i}\eta_{l}^{2}L^{2}\right)\mathbb{% E}\left\\|\boldsymbol{x}_{i}^{t,k-1}-\boldsymbol{x}^{t}\right\\|^{2}+\eta_{l}^{2% }(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K_{i}\mathbb{E}\left[% \left\\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\\|^{2}\right])$

$\displaystyle\mathbb{E}\left\\|\boldsymbol{x}_{i}^{t,k}-\boldsymbol{x}^{t}% \right\\|^{2}\leq$	$\displaystyle\sum_{p=0}^{k-1}\left(1+\frac{1}{2K_{i}-1}+4K_{i}\eta_{l}^{2}L^{2% }\right)^{p}\eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K% _{i}\mathbb{E}\left[\left\\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\\|^{2}% \right])$	(10)
$\displaystyle\leq$	$\displaystyle(K_{i}-1)\left[\left(1+\frac{1}{K_{i}-1}\right)^{K_{i}}-1\right]% \eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}^{t})+6K_{i}\mathbb{% E}\left[\left\\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\\|^{2}\right])$
$\displaystyle\leq$	$\displaystyle 5K_{i}\eta_{l}^{2}(\sigma_{l}^{2}+6K_{i}\zeta_{i}(\boldsymbol{x}% ^{t})+6K_{i}\mathbb{E}\left\\|\nabla f\left(\boldsymbol{x}^{t}\right)\right\\|^{% 2})$

	$\displaystyle f\left(\boldsymbol{x}^{t+1}\right)$	$\displaystyle=f\left(\boldsymbol{x}^{t}-\eta\boldsymbol{d}^{t}\right)\leq f% \left(\boldsymbol{x}^{t}\right)-\eta\left\langle\nabla f\left(\boldsymbol{x}^{% t}\right),\boldsymbol{d}^{t}\right\rangle+\frac{L}{2}\eta^{2}\left\\|% \boldsymbol{d}^{t}\right\\|^{2}$
		$\displaystyle\leq f\left(\boldsymbol{x}^{t}\right)-\eta\left\langle\nabla f% \left(\boldsymbol{x}^{t}\right),\boldsymbol{d}^{t}-\nabla f\left(\boldsymbol{x% }^{t}\right)+\nabla f\left(\boldsymbol{x}^{t}\right)\right\rangle+\frac{L}{2}% \eta^{2}\left\\|\boldsymbol{d}^{t}\right\\|^{2}$
		$\displaystyle\leq f\left(\boldsymbol{x}^{t}\right)-\eta\\|\nabla f\left(% \boldsymbol{x}^{t}\right)\\|^{2}+\eta\left\langle\nabla f\left(\boldsymbol{x}^{% t}\right),\nabla f\left(\boldsymbol{x}^{t}\right)-\boldsymbol{d}^{t}\right% \rangle+\frac{L}{2}\eta^{2}\left\\|\boldsymbol{d}^{t}\right\\|^{2}.$

	$\displaystyle T_{1}$	$\displaystyle\leq\mathbb{E}\left[\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|% \left\\|\nabla f(\boldsymbol{x}^{t})-\boldsymbol{d}^{t}\right\\|\right]\leq% \mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{E}\left[\left\\|% \sum_{i=1}^{N}\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t})-\sum_{i=1}^{N}\tilde% {\lambda}_{i}^{t}\boldsymbol{m}_{i}^{t}\right\\|\right]$
		$\displaystyle=\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{E}% \left[\left\\|\sum_{i=1}^{N}\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t})\pm\sum_% {i=1}^{N}\tilde{\lambda}_{i}f_{i}(\boldsymbol{x}^{t})-\sum_{i=1}^{N}\tilde{% \lambda}_{i}^{t}\boldsymbol{m}_{i}^{t}\right\\|\right]$
		$\displaystyle\leq\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{% E}\left[\left\\|\sum_{i=1}^{N}(\lambda_{i}-\tilde{\lambda}_{i}^{t})\nabla f_{i}% (\boldsymbol{x}^{t})\right\\|\right]+\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{% t})\right\\|\mathbb{E}\left[\left\\|\sum_{i=1}^{N}\tilde{\lambda}_{i}^{t}(\nabla f% _{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t})\right\\|\right]$
		$\displaystyle\leq\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{% E}\left[\left\\|\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t})}{% \lambda_{i}}\cdot\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t})\right\\|\right]+% \mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{E}\left[\left\\|% \sum_{i=1}^{N}\frac{\tilde{\lambda}_{i}^{t}}{\lambda_{i}}\cdot\lambda_{i}(% \nabla f_{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^{t})\right\\|\right]$
		$\displaystyle\leq\sqrt{\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t% })^{2}}{\lambda_{i}^{2}}}\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|% \mathbb{E}\left[\left\\|\sum_{i=1}^{N}\lambda_{i}\nabla f_{i}(\boldsymbol{x}^{t% })\right\\|\right]$
		$\displaystyle\quad+\sqrt{\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{t})^{2}}{% \lambda_{i}^{2}}}\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{% E}\left[\left\\|\sum_{i=1}^{N}\lambda_{i}(\nabla f_{i}(\boldsymbol{x}^{t})-% \boldsymbol{m}_{i}^{t})\right\\|\right]$
		$\displaystyle\leq\sqrt{\sum_{i=1}^{N}\frac{(\lambda_{i}-\tilde{\lambda}_{i}^{t% })^{2}}{\lambda_{i}^{2}}}\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|% ^{2}+\sqrt{\sum_{i=1}^{N}\frac{(\tilde{\lambda}_{i}^{t})^{2}}{\lambda_{i}^{2}}% }\mathbb{E}\left\\|\nabla f(\boldsymbol{x}^{t})\right\\|\mathbb{E}\left[\left\\|% \sum_{i=1}^{N}\lambda_{i}(\nabla f_{i}(\boldsymbol{x}^{t})-\boldsymbol{m}_{i}^% {t})\right\\|\right],$

FedAWARE: Maximizing Gradient Diversity for Heterogeneous Federated Server-side Optimization