License: CC Zero
arXiv:2401.04993v1 [cs.LG] 10 Jan 2024

AdaFed: Fair Federated Learning via Adaptive Common Descent Direction

Shayan Mohajer Hamidi [email protected]
Department of Electrical and Computer Engineering
University of Waterloo
En-Hui Yang [email protected]
Department of Electrical and Computer Engineering
University of Waterloo
Abstract

Federated learning (FL) is a promising technology via which some edge devices/clients collaboratively train a machine learning model orchestrated by a server. Learning an unfair model is known as a critical problem in federated learning, where the trained model may unfairly advantage or disadvantage some of the devices. To tackle this problem, in this work, we propose AdaFed. The goal of AdaFed is to find an updating direction for the server along which (i) all the clients’ loss functions are decreasing; and (ii) more importantly, the loss functions for the clients with larger values decrease with a higher rate. AdaFed adaptively tunes this common direction based on the values of local gradients and loss functions. We validate the effectiveness of AdaFed on a suite of federated datasets, and demonstrate that AdaFed outperforms state-of-the-art fair FL methods.

1 Introduction

Conventionally, a machine learning (ML) model is trained in a centralized approach where the training data is available at a data center or a cloud server. However, in many new applications, devices often do not want to share their private data with a remote server. As a remedy, federated learning (FL) was proposed in McMahan et al. (2017) where each device participates in training using only locally available dataset with the help of a server. Specifically, in FL, devices share only their local updates with the server, and not their raw dataset A well-known setup to carry out such decentralized training is FedAvg McMahan et al. (2017) which combines local stochastic gradient descent (SGD) on each client with iterative model averaging. The server sends the most recent global model to some selected clients (Eichner et al., 2019; Wang et al., 2021a), and then these clients perform a number of epochs of local SGD on their local training data and send the local gradients back to the central server. The server then finds the (weighted) average of the gradients to update the global model, and the process repeats.

In FedAvg, the vector of averaged gradients computed by the server is in fact a common direction along which the global model is updated. However, finding the common direction in this manner may result in a direction which is not descent for some clients. Consequently, the learnt model could perform quite poorly once applied to the private dataset of the clients, yielding an unfair global model (Li et al., 2019a; Bonawitz et al., 2019; Kairouz et al., 2021); that is, although the average accuracy might be high, some clients whose data distributions differ from the majority of the clients are prone to perform poorly on the learnt model.

One possible method to find a direction that is descent for all the clients is to treat the FL task as a multi-objective minimization (MoM) problem Hu et al. (2022). In this setup, a Pareto-stationary solution of the MoM yields a descent direction for all the clients. However, having a common descent direction is not enough per se to train a fair model with uniform test accuracies across the clients111Similarly to other fields like ML (Barocas et al., 2017), communications (Huaizhou et al., 2013), and justice (Rawls, 2020), the notion of fairness does not have a unique definition in FL. However, following (Li et al., 2019a; 2021), we use standard deviation of the clients’ test accuracies—and some other metrics discussed in Section 7—to measure how uniform the global model performs across the clients. Please refer to Appendix C for more in-depth discussions.. This is because data heterogeneity across different clients makes the local loss functions vary significantly in values, and therefore those loss functions with larger values should decrease with a higher rate to learn a fair model.

To address the above-mentioned issues and to train a fair global model, in this work, we propose AdaFed. The aim of AdaFed is to help the sever to find a common direction 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (i) that is descent for all the clients, which is a necessary condition to decrease the clients’ loss functions in the SGD algorithm; and (ii) along which the loss functions with larger values decrease with higher rates. The latter is to enforce obtaining a global model with uniform test accuracies across the clients.

We note that if the directional derivatives of clients’ loss functions along the normalized common direction 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are all positive, then 𝔡𝔡-\mathfrak{d}- fraktur_d is a common descent direction for all the clients. As such, AdaFed adaptively tunes 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that these directional derivatives (i) remain positive over the course of FL process, and (ii) are larger for loss functions with higher values enforcing them to decrease more during the global updation by the server.

The contributions of the paper are summarized as follows:

  • We introduce AdaFed, a method to realize fair FL via adaptive common descent direction.

  • We provide a closed-form solution for the common direction found by AdaFed. This is in contrast with many existing fair FL methods which deploy iterative or generic quadratic programming methods.

  • Under some common assumptions in FL literature, we prove the convergence of AdaFed under different FL setups to a Pareto-stationary solution.

  • By conducting thorough experiments over seven different datasets (six vision datasets, and a language one), we show that AdaFed can yield a higher level of fairness among the clients while achieving similar prediction accuracy compared to the state-of-the-art fair FL algorithms.

  • The experiments conducted in this paper evaluate many existing fair FL algorithms over different datasets under different FL setups, and therefore can pave the way for future researches.

2 Related Works

There are many different perspectives in the literature to combat the problem of fairness in FL. These methods include client selection (Nishio & Yonetani, 2019; Huang et al., 2020a; 2022; Yang et al., 2021), contribution Evaluation (Zhang et al., 2020; Lyu et al., 2020; Song et al., 2021; Le et al., 2021), incentive mechanisms (Zhang et al., 2021; Kang et al., 2019; Ye et al., 2020; Zhang et al., 2020), and the methods based on the loss function. Specifically, our work falls into the latter category. In this approach, the goal is to attain uniform performance across the clients in terms of test accuracy. To this end, the works using this approach target to reduce the variance of test accuracy across the participating clients. In the following, we briefly review some of these works.

One of the pioneering methods in this realm is agnostic federated learning (AFL) (Mohri et al., 2019). AFL optimizes the global model for the worse-case realization of weighted combination of the user distributions. Their approach boils down to solving a saddle-point optimization problem for which they used a fast stochastic optimization algorithm. Yet, AFL performs well only for a small number of clients, and when the size of participating clients becomes large, the generalization guarantees of the model may not be satisfied. Du et al. (2021) deployed the notation of AFL and proposed the AgnosticFair algorithm. Specifically, they linearly parametrized model weights by kernel functions and showed that AFL can be viewed as a special case of AgnosticFair. To overcome the generalization problem in AFL, q-fair federated learning (q-FFL) (Li et al., 2019a) was proposed to achieve more uniform test accuracy across users. The main idea of q-FFL stemmed from fair resource allocation methods in wireless communication networks (Huaizhou et al., 2013; Hamidi et al., 2019). Afterward, Li et al. (2020a) developed TERM, a tilted empirical risk minimization algorithm which handles outliers and class imbalance in statistical estimation procedures. Compared to q-FFL, TERM has demonstrated better performance in many FL applications. Deploying a similar notion, Huang et al. (2020b) proposed using training accuracy and frequency to adjust weights of devices to promote fairness. Furthermore, FCFC Cui et al. (2021) minimizes the loss of the worst-performing client, leading to a version of AFL. Later, Li et al. (2021) devised Ditto, a multitask personalized FL algorithm. After optimizing a global objective function, Ditto allows local devices to run more steps of SGD, subject to some constraints, to minimize their own losses. Ditto can significantly improve testing accuracy among local devices and encourage fairness.

Our approach is more similar to FedMGDA+ (Hu et al., 2022), which treats the FL task as a multi-objective optimization problem. In this scenario, the goal is to minimize the loss function of each FL client simultaneously. To avoid sacrificing the performance of any client, FedMGDA+ uses Pareto-stationary solutions to find a common descent direction for all selected clients.

3 Notation and Preliminaries

3.1 Notation

We denote by [K]delimited-[]𝐾[K][ italic_K ] the set of integers {1,2,,K}12𝐾\{1,2,\cdots,K\}{ 1 , 2 , ⋯ , italic_K }. In addition, we define {fk}k[K]={f1,f2,,fK}subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾subscript𝑓1subscript𝑓2subscript𝑓𝐾\{f_{k}\}_{k\in[K]}=\{f_{1},f_{2},\dots,f_{K}\}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } for a scalar/function f𝑓fitalic_f. We use bold-symbol small letters to represent vectors. Denote by 𝔲isubscript𝔲𝑖\mathfrak{u}_{i}fraktur_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the i𝑖iitalic_i-th element of vector 𝖚𝖚\boldsymbol{\mathfrak{u}}bold_fraktur_u. For two vectors 𝖚,𝖛d𝖚𝖛superscript𝑑\boldsymbol{\mathfrak{u}},\boldsymbol{\mathfrak{v}}\in\mathbb{R}^{d}bold_fraktur_u , bold_fraktur_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we say 𝖚𝖛𝖚𝖛\boldsymbol{\mathfrak{u}}\leq\boldsymbol{\mathfrak{v}}bold_fraktur_u ≤ bold_fraktur_v iff 𝔲i𝔳isubscript𝔲𝑖subscript𝔳𝑖\mathfrak{u}_{i}\leq\mathfrak{v}_{i}fraktur_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ fraktur_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i[d]for-all𝑖delimited-[]𝑑\forall i\in[d]∀ italic_i ∈ [ italic_d ], i.e., two vectors are compared w.r.t. partial ordering. In addition, denote by 𝖛𝖚𝖛𝖚\boldsymbol{\mathfrak{v}}\cdot\boldsymbol{\mathfrak{u}}bold_fraktur_v ⋅ bold_fraktur_u their inner product, and by proj𝖚(𝖛)=𝖛𝖚𝖚𝖚𝖚subscriptproj𝖚𝖛𝖛𝖚𝖚𝖚𝖚\text{proj}_{\boldsymbol{\mathfrak{u}}}(\boldsymbol{\mathfrak{v}})=\frac{% \boldsymbol{\mathfrak{v}}\cdot\boldsymbol{\mathfrak{u}}}{\boldsymbol{\mathfrak% {u}}\cdot\boldsymbol{\mathfrak{u}}}\boldsymbol{\mathfrak{u}}proj start_POSTSUBSCRIPT bold_fraktur_u end_POSTSUBSCRIPT ( bold_fraktur_v ) = divide start_ARG bold_fraktur_v ⋅ bold_fraktur_u end_ARG start_ARG bold_fraktur_u ⋅ bold_fraktur_u end_ARG bold_fraktur_u the projection of 𝖛𝖛\boldsymbol{\mathfrak{v}}bold_fraktur_v onto the line spanned by 𝖚𝖚\boldsymbol{\mathfrak{u}}bold_fraktur_u.

3.2 Preliminaries and Definitions

In Hu et al. (2022), authors demonstrated that FL can be regarded as multi-objective minimization (MoM) problem. In particular, denote by 𝒇(𝜽)={fk(𝜽)}k[K]𝒇𝜽subscriptsubscript𝑓𝑘𝜽𝑘delimited-[]𝐾\boldsymbol{f}(\boldsymbol{\theta})=\{f_{k}(\boldsymbol{\theta})\}_{k\in[K]}bold_italic_f ( bold_italic_θ ) = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ ) } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT the set of local clients’ objective functions. Then, the aim of MoM is to solve

𝜽*=argmin𝜽𝒇(𝜽),superscript𝜽subscript𝜽𝒇𝜽\displaystyle\boldsymbol{\theta}^{*}=\arg\min_{\boldsymbol{\theta}}\boldsymbol% {f}(\boldsymbol{\theta}),bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT bold_italic_f ( bold_italic_θ ) , (1)

where the minimization is performed w.r.t. the partial ordering. Finding 𝜽*superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT could enforce fairness among the users since by setting setting 𝜽=𝜽*𝜽superscript𝜽\boldsymbol{\theta}=\boldsymbol{\theta}^{*}bold_italic_θ = bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, it is not possible to reduce any of the local objective functions fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT without increasing at least another one. Here, 𝜽*superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is called a Pareto-optimal solution of Equation 1. In addition, the collection of function values {fk(𝜽*)}k[K]subscriptsubscript𝑓𝑘superscript𝜽𝑘delimited-[]𝐾\{f_{k}(\boldsymbol{\theta}^{*})\}_{k\in[K]}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT of all the Pareto points 𝜽*superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is called the Pareto front.

Although finding Pareto-optimal solutions can be challenging, there are several methods to identify the Pareto-stationary solutions instead, which are defined as follows:

Definition 3.1.

Pareto-stationary (Mukai, 1980): The vector 𝜽*superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is said to be Pareto-stationary iff there exists a convex combination of the gradient-vectors {𝔤k(𝜽*)}k[K]subscriptsubscript𝔤𝑘superscript𝜽𝑘delimited-[]𝐾\{\mathfrak{g}_{k}(\boldsymbol{\theta}^{*})\}_{k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT which is equal to zero; that is, k=1Kλk𝔤k(𝜽*)=0superscriptsubscript𝑘1𝐾subscript𝜆𝑘subscript𝔤𝑘superscript𝜽0\sum_{k=1}^{K}\lambda_{k}\mathfrak{g}_{k}(\boldsymbol{\theta}^{*})=0∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = 0, where 𝝀0𝝀0\boldsymbol{\lambda}\geq 0bold_italic_λ ≥ 0, and k=1Kλk=1superscriptsubscript𝑘1𝐾subscript𝜆𝑘1\sum_{k=1}^{K}\lambda_{k}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1.

Lemma 3.2.

(Mukai, 1980) Any Pareto-optimal solution is Pareto-stationary. On the other hand, if all {fk(𝜽)}k[K]subscriptsubscript𝑓𝑘𝜽𝑘delimited-[]𝐾\{f_{k}(\boldsymbol{\theta})\}_{k\in[K]}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ ) } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT’s are convex, then any Pareto-stationary solution is weakly Pareto optimal 222𝜽*superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is called a weakly Pareto-optimal solution of Equation 1 if there does not exist any 𝜽𝜽\boldsymbol{\theta}bold_italic_θ such that f(𝜽)<f(𝜽*)𝑓𝜽𝑓superscript𝜽f(\boldsymbol{\theta})<f(\boldsymbol{\theta}^{*})italic_f ( bold_italic_θ ) < italic_f ( bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ); meaning that, it is not possible to improve all of the objective functions in f(𝜽*)𝑓superscript𝜽f(\boldsymbol{\theta}^{*})italic_f ( bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ). Obviously, any Pareto optimal solution is also weakly Pareto-optimal but the converse may not hold..

There are many methods in the literature to find Pareto-stationary solutions among which we elaborate on two well-known ones, namely linear scalarization and Multiple gradient descent algorithm (MGDA) (Mukai, 1980; Fliege & Svaiter, 2000; Désidéri, 2012).

\bullet Linear scalarization: this approach is essentially the core principle behind the FedAvg algorithm. To elucidate, in FedAvg, the server updates 𝜽𝜽\boldsymbol{\theta}bold_italic_θ by minimizing the weighted average of clients’ loss functions:

min𝜽f(𝜽)=k=1Kλkfk(𝜽),subscript𝜽𝑓𝜽superscriptsubscript𝑘1𝐾subscript𝜆𝑘subscript𝑓𝑘𝜽\min_{\boldsymbol{\theta}}f(\boldsymbol{\theta})=\sum_{k=1}^{K}\lambda_{k}f_{k% }(\boldsymbol{\theta}),roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_f ( bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ ) , (2)

where the weights {λk}k[K]subscriptsubscript𝜆𝑘𝑘delimited-[]𝐾\{\lambda_{k}\}_{k\in[K]}{ italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are assigned by the server and satisfy k=1Kλk=1superscriptsubscript𝑘1𝐾subscript𝜆𝑘1\sum_{k=1}^{K}\lambda_{k}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1. These fixed {λk}k[K]subscriptsubscript𝜆𝑘𝑘delimited-[]𝐾\{\lambda_{k}\}_{k\in[K]}{ italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are assigned based on some priori information about the clients such as the size of their datasets. We note that different values for {λk}k[K]subscriptsubscript𝜆𝑘𝑘delimited-[]𝐾\{\lambda_{k}\}_{k\in[K]}{ italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT yield different Pareto-stationary solutions.

Referring to Definition 3.1, any solutions of Equation 2 is a Pareto-stationary solution of Equation 1. To perform FedAvg, at iteration t𝑡titalic_t, client k𝑘kitalic_k, k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] sends its gradient vector 𝔤k(𝜽t)subscript𝔤𝑘subscript𝜽𝑡\mathfrak{g}_{k}(\boldsymbol{\theta}_{t})fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to the server, and server updates the global model as

𝜽t+1=𝜽tηt𝖉t,where𝖉t=k=1Kλk𝔤k(𝜽t).formulae-sequencesubscript𝜽𝑡1subscript𝜽𝑡subscript𝜂𝑡subscript𝖉𝑡wheresubscript𝖉𝑡superscriptsubscript𝑘1𝐾subscript𝜆𝑘subscript𝔤𝑘subscript𝜽𝑡\displaystyle\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta_{t}% \boldsymbol{\mathfrak{d}}_{t},~{}~{}\text{where}~{}~{}\boldsymbol{\mathfrak{d}% }_{t}=\sum_{k=1}^{K}\lambda_{k}\mathfrak{g}_{k}(\boldsymbol{\theta}_{t}).bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , where bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (3)

However, linear scalarization can only converge to Pareto points that lie on the convex envelop of the Pareto front (Boyd & Vandenberghe, 2004). Furthermore, the weighted average of the gradients with pre-defined weights yields a vector 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT whose direction might not be descent for all the clients; because some clients may have conflicting gradients with opposing directions due to the heterogeneity of their local datasets (Wang et al., 2021b). As a result, FedAvg may result in an unfair accuracy distribution among the clients (Li et al., 2019a; Mohri et al., 2019).

\bullet MGDA: To mitigate the above issue, (Hu et al., 2022) proposed to exploit MGDA algorithm in FL to converge to a fair solution on the Pareto front. Unlike linear scalarization, MGDA adaptively tunes {λk}k[K]subscriptsubscript𝜆𝑘𝑘delimited-[]𝐾\{\lambda_{k}\}_{k\in[K]}{ italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT by finding the minimal-norm element of the convex hull of the gradient vectors defined as follows (we drop the dependence of 𝔤ksubscript𝔤𝑘\mathfrak{g}_{k}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for ease of notation hereafter)

𝒢={𝒈d|𝒈=k=1Kλk𝔤k;λk0;k=1Kλk=1}.𝒢conditional-set𝒈superscript𝑑formulae-sequence𝒈superscriptsubscript𝑘1𝐾subscript𝜆𝑘subscript𝔤𝑘formulae-sequencesubscript𝜆𝑘0superscriptsubscript𝑘1𝐾subscript𝜆𝑘1\displaystyle\mathcal{G}=\{\boldsymbol{g}\in\mathbb{R}^{d}|\boldsymbol{g}=\sum% _{k=1}^{K}\lambda_{k}\mathfrak{g}_{k};~{}\lambda_{k}\geq 0;~{}\sum_{k=1}^{K}% \lambda_{k}=1\}.caligraphic_G = { bold_italic_g ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | bold_italic_g = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 0 ; ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 } . (4)

Denote the minimal-norm element of 𝒢𝒢\mathcal{G}caligraphic_G by 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ). Then, either (i) 𝖉(𝒢)=0𝖉𝒢0\boldsymbol{\mathfrak{d}}(\mathcal{G})=0bold_fraktur_d ( caligraphic_G ) = 0, and therefore based on Lemma 3.2 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ) is a Pareto-stationary point; or (ii) 𝖉(𝒢)0𝖉𝒢0\boldsymbol{\mathfrak{d}}(\mathcal{G})\neq 0bold_fraktur_d ( caligraphic_G ) ≠ 0 and the direction of 𝖉(𝒢)𝖉𝒢-\boldsymbol{\mathfrak{d}}(\mathcal{G})- bold_fraktur_d ( caligraphic_G ) is a common descent direction for all the objective functions {fk(𝜽)}k[K]subscriptsubscript𝑓𝑘𝜽𝑘delimited-[]𝐾\{f_{k}(\boldsymbol{\theta})\}_{k\in[K]}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ ) } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT (Désidéri, 2009), meaning that all the directional derivatives {𝔤k𝖉(𝒢)}k[K]subscriptsubscript𝔤𝑘𝖉𝒢𝑘delimited-[]𝐾\{\mathfrak{g}_{k}\cdot\boldsymbol{\mathfrak{d}}(\mathcal{G})\}_{k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_fraktur_d ( caligraphic_G ) } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are positive. Having positive directional derivatives is a necessary condition to ensure that the common direction is descent for all the objective functions.

4 Motivation and Methodology

We first discuss our motivation in Section 4.1, and then elaborate on the methodology in Section 4.2.

4.1 Motivation

Although any solutions on the Pareto front is fair in the sense that decreasing one of the loss functions is not possible without sacrificing some others, not all of such solutions impose uniformity among the loss functions (see Figure 1a). As such, we aim to find solutions on the Pareto front which enjoy such uniformity.

First we note that having a common descent direction is a necessary condition to find such uniform solutions; but not enough. Additionally, we stipulate that the rate of decrease in the loss function should be greater for clients whose loss functions are larger. In fact, the purpose of this paper is to find an updation direction for the server that satisfies both of the following conditions at the same time:

  • Condition (I): It is a descent direction for all {fk(𝜽)}k[K]subscriptsubscript𝑓𝑘𝜽𝑘delimited-[]𝐾\{f_{k}(\boldsymbol{\theta})\}_{k\in[K]}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ ) } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT, which is a necessary condition for the loss functions to decrease when the server updates the global model along that direction.

  • Condition (II): It is more inclined toward the clients with larger losses, and therefore the directional derivatives of loss functions over the common direction are larger for those with larger loss functions.

To satisfy Condition (I), it is enough to find 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ) using MGDA algorithm (as Hu et al. (2022) uses MGDA to enforce fairness in FL setup). Nevertheless, we aim to further satisfy Condition (II) on top of Condition (I). To this end, we investigate the direction of 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ), and note that it is more inclined toward that of min{𝔤k22}k[K]\min\{\|\mathfrak{g}_{k}\|^{2}_{2}\}_{k\in[K]}roman_min { ∥ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT. For instance, consider the simple example depicted in Figure 1b, where 𝔤122<𝔤222superscriptsubscriptnormsubscript𝔤122superscriptsubscriptnormsubscript𝔤222\|\mathfrak{g}_{1}\|_{2}^{2}<\|\mathfrak{g}_{2}\|_{2}^{2}∥ fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∥ fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The Convex hull 𝒢𝒢\mathcal{G}caligraphic_G and the 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ) are depicted for 𝔤1subscript𝔤1\mathfrak{g}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝔤2subscript𝔤2\mathfrak{g}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. As seen, the direction of 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ) is mostly influenced by that of 𝔤1subscript𝔤1\mathfrak{g}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1: (a) The Pareto front for two objective functions f1(𝜽)subscript𝑓1𝜽f_{1}(\boldsymbol{\theta})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_θ ) and f2(𝜽)subscript𝑓2𝜽f_{2}(\boldsymbol{\theta})italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_θ ) is depicted. MGDA may converge to any points on the Pareto front. (b)-(c) Illustration of convex hull 𝒢𝒢\mathcal{G}caligraphic_G and minimal-norm vector 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ) for two gradient vectors 𝔤1subscript𝔤1\mathfrak{g}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝔤2subscript𝔤2\mathfrak{g}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In (b), 𝔤122<𝔤222superscriptsubscriptnormsubscript𝔤122superscriptsubscriptnormsubscript𝔤222\|\mathfrak{g}_{1}\|_{2}^{2}<\|\mathfrak{g}_{2}\|_{2}^{2}∥ fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∥ fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where the direction of 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ) is more inclined toward 𝔤1subscript𝔤1\mathfrak{g}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. In (c), 𝔤122=𝔤222=1superscriptsubscriptnormsubscript𝔤122superscriptsubscriptnormsubscript𝔤2221\|\mathfrak{g}_{1}\|_{2}^{2}=\|\mathfrak{g}_{2}\|_{2}^{2}=1∥ fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1, where the direction of 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ) is the same as that of the bisection of 𝔤1subscript𝔤1\mathfrak{g}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝔤2subscript𝔤2\mathfrak{g}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

However, this phenomenon is not favourable for satisfying Condition (II) since after some rounds of communication between the server and clients, the value of 𝔤𝔤\mathfrak{g}fraktur_g becomes small for those objective functions which are close to their minimum points. Consequently, the direction of 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ) is mostly controlled by these small 𝔤𝔤\mathfrak{g}fraktur_g’s, which is undesirable. Note that 𝔤k𝖉(𝒢)subscript𝔤𝑘𝖉𝒢\mathfrak{g}_{k}\cdot\boldsymbol{\mathfrak{d}}(\mathcal{G})fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_fraktur_d ( caligraphic_G ) represents how fast fk(𝜽)subscript𝑓𝑘𝜽f_{k}(\boldsymbol{\theta})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ ) changes if 𝜽𝜽\boldsymbol{\theta}bold_italic_θ changes in the direction of 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ). In fact, the direction of 𝖉(𝒢)𝖉𝒢\boldsymbol{\mathfrak{d}}(\mathcal{G})bold_fraktur_d ( caligraphic_G ) should be more inclined toward the gradients of those clients with larger loss functions.

One possible solution could be to naively normalize {𝔤k}k[K]subscriptsubscript𝔤𝑘𝑘delimited-[]𝐾\{\mathfrak{g}_{k}\}_{k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT by their norm to obtain {𝔤k𝔤k22}k[K]subscriptsubscript𝔤𝑘superscriptsubscriptnormsubscript𝔤𝑘22𝑘delimited-[]𝐾\{\frac{\mathfrak{g}_{k}}{\|\mathfrak{g}_{k}\|_{2}^{2}}\}_{k\in[K]}{ divide start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∥ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT whose convex hull is denoted by 𝒢normsubscript𝒢norm\mathcal{G}_{\text{norm}}caligraphic_G start_POSTSUBSCRIPT norm end_POSTSUBSCRIPT, and then use this normalized set of gradients to find 𝖉(𝒢norm)𝖉subscript𝒢norm\boldsymbol{\mathfrak{d}}(\mathcal{G}_{\text{norm}})bold_fraktur_d ( caligraphic_G start_POSTSUBSCRIPT norm end_POSTSUBSCRIPT ). Yet, the normalization makes all the {𝔤k𝖉(𝒢norm)}k[K]subscriptsubscript𝔤𝑘𝖉subscript𝒢norm𝑘delimited-[]𝐾\{\mathfrak{g}_{k}\cdot\boldsymbol{\mathfrak{d}}(\mathcal{G}_{\text{norm}})\}_% {k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_fraktur_d ( caligraphic_G start_POSTSUBSCRIPT norm end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT equal (see Figure 1c) which is still undesirable as the rate of decrease becomes equal for all {fk(𝜽)}k[K]subscriptsubscript𝑓𝑘𝜽𝑘delimited-[]𝐾\{f_{k}(\boldsymbol{\theta})\}_{k\in[K]}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ ) } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT.

Based on these observations, the gradient vectors should be somehow scaled𝑠𝑐𝑎𝑙𝑒𝑑scaleditalic_s italic_c italic_a italic_l italic_e italic_d if one aims to also satisfy Condition (II). Finding such scaling𝑠𝑐𝑎𝑙𝑖𝑛𝑔scalingitalic_s italic_c italic_a italic_l italic_i italic_n italic_g factor is not straight-forward in general. To tackle this issue, and to be able to find a closed-form formula, we find the minimal-norm vector in the convex hull of mutually-orthogonal scaled gradients instead, and prove that this yields a common direction for which both Conditions (I) and (II) are satisfied.

4.2 Methodology

To devise an appropriate scaling as explained above, we carry out the following two phases.

4.2.1 Phase 1, orthogonalization

In order to be able to find a closed-form formula for the common descent direction, in the first phase, we orthogonalize the gradients.

Once the gradient updates {𝔤k}k[K]subscriptsubscript𝔤𝑘𝑘delimited-[]𝐾\{\mathfrak{g}_{k}\}_{k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are transmitted by the clients, the server first generates a mutually orthogonal 333Here, orthogonality is in the sense of standard inner product in Euclidean space. set {𝔤~k}k[K]subscriptsubscript~𝔤𝑘𝑘delimited-[]𝐾\{\tilde{\mathfrak{g}}_{k}\}_{k\in[K]}{ over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT that spans the same K𝐾Kitalic_K-dimensional subspace in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT as that spanned by {𝔤k}k[K]subscriptsubscript𝔤𝑘𝑘delimited-[]𝐾\{\mathfrak{g}_{k}\}_{k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT. To this aim, the server exploits a modified Gram–Schmidt orthogonalization process over {𝔤k}k[K]subscriptsubscript𝔤𝑘𝑘delimited-[]𝐾\{\mathfrak{g}_{k}\}_{k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT in the following manner 444The reason for such normalization is to satisfy Conditions I and II. This will be proven later in this section.

𝔤1~~subscript𝔤1\displaystyle\tilde{\mathfrak{g}_{1}}over~ start_ARG fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG =𝔤1/|fk|γabsentsubscript𝔤1superscriptsubscript𝑓𝑘𝛾\displaystyle=\mathfrak{g}_{1}/|f_{k}|^{\gamma}= fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT (5)
𝔤k~~subscript𝔤𝑘\displaystyle\tilde{\mathfrak{g}_{k}}over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG =𝔤ki=1k1proj𝔤i~(𝔤k)|fk|γi=1k1𝔤k𝔤i~𝔤i~𝔤i~,fork=2,3,,K,formulae-sequenceabsentsubscript𝔤𝑘superscriptsubscript𝑖1𝑘1subscriptproj~subscript𝔤𝑖subscript𝔤𝑘superscriptsubscript𝑓𝑘𝛾superscriptsubscript𝑖1𝑘1subscript𝔤𝑘~subscript𝔤𝑖~subscript𝔤𝑖~subscript𝔤𝑖for𝑘23𝐾\displaystyle=\frac{\mathfrak{g}_{k}-\sum_{i=1}^{k-1}\text{proj}_{\tilde{% \mathfrak{g}_{i}}}(\mathfrak{g}_{k})}{|f_{k}|^{\gamma}-\sum_{i=1}^{k-1}\frac{% \mathfrak{g}_{k}\cdot\tilde{\mathfrak{g}_{i}}}{\tilde{\mathfrak{g}_{i}}\cdot% \tilde{\mathfrak{g}_{i}}}},~{}\text{for}~{}k=2,3,\dots,K,= divide start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT proj start_POSTSUBSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_POSTSUBSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG start_ARG over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG end_ARG , for italic_k = 2 , 3 , … , italic_K , (6)

where γ>0𝛾0\gamma>0italic_γ > 0 is a scalar.

\bullet Why such orthogonalization is possible?

First, note that the orthogonalization approach in phase 1 is feasible if we assume that the K𝐾Kitalic_K gradient vectors {𝔤k}k[K]subscriptsubscript𝔤𝑘𝑘delimited-[]𝐾\{\mathfrak{g}_{k}\}_{k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are linearly independent. Indeed, this assumption is reasonable considering that (i) the gradient vectors {𝔤k}k[K]subscriptsubscript𝔤𝑘𝑘delimited-[]𝐾\{\mathfrak{g}_{k}\}_{k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are K𝐾Kitalic_K vectors in d𝑑ditalic_d-dimensional space, and d>>Kmuch-greater-than𝑑𝐾d>>Kitalic_d > > italic_K for the current deep neural networks (DNNs)555Also, note that to tackle non-iid distribution of user-specific data, it is a common practice that server selects a different subset of clients in each round (McMahan et al., 2017).; and (ii) the random nature of the gradient vectors due to the non-iid distributions of the local datasets. The validity of this assumption is further confirmed in our thorough experiments over different datasets and models.

4.2.2 Phase 2, finding optimal 𝝀*superscript𝝀\boldsymbol{\lambda}^{*}bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT

In this phase, we aim to find the minimum-norm vector in the convex hull of the mutually-orthogonal gradients found in Phase (I).

First, denote by 𝒢~~𝒢\tilde{\mathcal{G}}over~ start_ARG caligraphic_G end_ARG the convex hull of gradient vectors {𝔤k~}k[K]subscript~subscript𝔤𝑘𝑘delimited-[]𝐾\{\tilde{\mathfrak{g}_{k}}\}_{k\in[K]}{ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT obtained in Phase 1; that is,

𝒢~={𝒈d|𝒈=k=1Kλk𝔤k~;λk0;k=1Kλk=1}.~𝒢conditional-set𝒈superscript𝑑formulae-sequence𝒈superscriptsubscript𝑘1𝐾subscript𝜆𝑘~subscript𝔤𝑘formulae-sequencesubscript𝜆𝑘0superscriptsubscript𝑘1𝐾subscript𝜆𝑘1\displaystyle\tilde{\mathcal{G}}=\{\boldsymbol{g}\in\mathbb{R}^{d}|\boldsymbol% {g}=\sum_{k=1}^{K}\lambda_{k}\tilde{\mathfrak{g}_{k}};~{}\lambda_{k}\geq 0;~{}% \sum_{k=1}^{K}\lambda_{k}=1\}.over~ start_ARG caligraphic_G end_ARG = { bold_italic_g ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | bold_italic_g = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ; italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 0 ; ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 } . (7)

In the following, we find the minimal-norm element in 𝒢~~𝒢\tilde{\mathcal{G}}over~ start_ARG caligraphic_G end_ARG, and then we show that this element is a descent direction for all the objective functions.

Denote by 𝝀*superscript𝝀\boldsymbol{\lambda}^{*}bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT the weights corresponding to the minimal-norm vector in 𝒢~~𝒢\tilde{\mathcal{G}}over~ start_ARG caligraphic_G end_ARG. To find the weight vector 𝝀*superscript𝝀\boldsymbol{\lambda}^{*}bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, we solve

𝒈*=argmin𝒈𝒢𝒈22,superscript𝒈subscript𝒈𝒢superscriptsubscriptnorm𝒈22\displaystyle\boldsymbol{g}^{*}=\arg\min_{\boldsymbol{g}\in\mathcal{G}}\|% \boldsymbol{g}\|_{2}^{2},bold_italic_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_g ∈ caligraphic_G end_POSTSUBSCRIPT ∥ bold_italic_g ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (8)

which accordingly finds 𝝀*superscript𝝀\boldsymbol{\lambda}^{*}bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. For an element 𝒈𝒢𝒈𝒢\boldsymbol{g}\in\mathcal{G}bold_italic_g ∈ caligraphic_G, we have

𝒈22=k=1Kλk𝔤k~22=k=1Kλk2𝔤k~22,superscriptsubscriptnorm𝒈22superscriptsubscriptnormsuperscriptsubscript𝑘1𝐾subscript𝜆𝑘~subscript𝔤𝑘22superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘2superscriptsubscriptnorm~subscript𝔤𝑘22\displaystyle\|\boldsymbol{g}\|_{2}^{2}=\|\sum_{k=1}^{K}\lambda_{k}\tilde{% \mathfrak{g}_{k}}\|_{2}^{2}=\sum_{k=1}^{K}\lambda_{k}^{2}\|\tilde{\mathfrak{g}% _{k}}\|_{2}^{2},∥ bold_italic_g ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (9)

where we used the fact that {𝔤k~}k[K]subscript~subscript𝔤𝑘𝑘delimited-[]𝐾\{\tilde{\mathfrak{g}_{k}}\}_{k\in[K]}{ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are orthogonal.

To solve Equation 8, we first ignore the inequality λk0subscript𝜆𝑘0\lambda_{k}\geq 0italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 0, for k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], and then we observe that this constraint will be automatically satisfied. Therefore, we make the following Lagrangian to solve the minimization problem in Equation 8:

L(𝖌~,𝝀)𝐿bold-~𝖌𝝀\displaystyle L(\boldsymbol{\tilde{\mathfrak{g}}},\boldsymbol{\lambda})italic_L ( overbold_~ start_ARG bold_fraktur_g end_ARG , bold_italic_λ ) =𝒈22α(k=1Kλk1)=k=1Kλk2𝔤k~22α(k=1Kλk1).absentsuperscriptsubscriptnorm𝒈22𝛼superscriptsubscript𝑘1𝐾subscript𝜆𝑘1superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘2superscriptsubscriptnorm~subscript𝔤𝑘22𝛼superscriptsubscript𝑘1𝐾subscript𝜆𝑘1\displaystyle=\|\boldsymbol{g}\|_{2}^{2}-\alpha\left(\sum_{k=1}^{K}\lambda_{k}% -1\right)=\sum_{k=1}^{K}\lambda_{k}^{2}\|\tilde{\mathfrak{g}_{k}}\|_{2}^{2}-% \alpha\left(\sum_{k=1}^{K}\lambda_{k}-1\right).= ∥ bold_italic_g ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_α ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_α ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) . (10)

Hence,

Lλk=2λk𝔤k~22α,𝐿subscript𝜆𝑘2subscript𝜆𝑘superscriptsubscriptnorm~subscript𝔤𝑘22𝛼\displaystyle\frac{\partial L}{\partial\lambda_{k}}=2\lambda_{k}\|\tilde{% \mathfrak{g}_{k}}\|_{2}^{2}-\alpha,divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = 2 italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_α , (11)

and by setting Equation 11 to zero, we obtain:

λk*=α2𝔤k~22.superscriptsubscript𝜆𝑘𝛼2superscriptsubscriptnorm~subscript𝔤𝑘22\displaystyle\lambda_{k}^{*}=\frac{\alpha}{2\|\tilde{\mathfrak{g}_{k}}\|_{2}^{% 2}}.italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG italic_α end_ARG start_ARG 2 ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (12)

On the other hand, since k=1Kλk=1superscriptsubscript𝑘1𝐾subscript𝜆𝑘1\sum_{k=1}^{K}\lambda_{k}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1, from Equation 12 we obtain

α=2k=1K1𝔤k~22,𝛼2superscriptsubscript𝑘1𝐾1superscriptsubscriptnorm~subscript𝔤𝑘22\displaystyle\alpha=\frac{2}{\sum_{k=1}^{K}\frac{1}{\|\tilde{\mathfrak{g}_{k}}% \|_{2}^{2}}},italic_α = divide start_ARG 2 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG , (13)

from which the optimal 𝝀*superscript𝝀\boldsymbol{\lambda}^{*}bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is obtained as follows

λk*=1𝔤k~22k=1K1𝔤k~22,fork[K].formulae-sequencesuperscriptsubscript𝜆𝑘1superscriptsubscriptnorm~subscript𝔤𝑘22superscriptsubscript𝑘1𝐾1superscriptsubscriptnorm~subscript𝔤𝑘22for𝑘delimited-[]𝐾\displaystyle\lambda_{k}^{*}=\frac{1}{\|\tilde{\mathfrak{g}_{k}}\|_{2}^{2}\sum% _{k=1}^{K}\frac{1}{\|\tilde{\mathfrak{g}_{k}}\|_{2}^{2}}},~{}~{}~{}\text{for}~% {}k\in[K].italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG , for italic_k ∈ [ italic_K ] . (14)

Note that λk*>0superscriptsubscript𝜆𝑘0\lambda_{k}^{*}>0italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT > 0, and therefore the minimum norm vector we found belongs to 𝒢𝒢\mathcal{G}caligraphic_G.

Using the 𝝀*superscript𝝀\boldsymbol{\lambda}^{*}bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT found in (14), we can calculate 𝖉t=k=1Kλk*𝔤k~subscript𝖉𝑡superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘~subscript𝔤𝑘\boldsymbol{\mathfrak{d}}_{t}=\sum_{k=1}^{K}{\lambda}_{k}^{*}\tilde{\mathfrak{% g}_{k}}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG as the minimum norm element in the convex hull 𝒢~~𝒢\tilde{\mathcal{G}}over~ start_ARG caligraphic_G end_ARG. In the following (Theorem 4.1), we show that the negate of 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies both Conditions (I) and (II).

Theorem 4.1.

The negate of 𝖉t=k=1Kλk*𝔤k~subscript𝖉𝑡superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘~subscript𝔤𝑘\boldsymbol{\mathfrak{d}}_{t}=\sum_{k=1}^{K}{\lambda}_{k}^{*}\tilde{\mathfrak{% g}_{k}}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG satisfies both Conditions (I) and (II).

Proof.

We find the directional derivatives of loss functions {fk}k[K]subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\{f_{k}\}_{k\in[K]}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT over 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For k[K]for-all𝑘delimited-[]𝐾\forall k\in[K]∀ italic_k ∈ [ italic_K ] we have

𝔤k𝖉tsubscript𝔤𝑘subscript𝖉𝑡\displaystyle\mathfrak{g}_{k}\cdot\boldsymbol{\mathfrak{d}}_{t}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =(𝔤k~(|fk|γi=1k1𝔤k𝔤i~𝔤i~𝔤i~)+i=1k1proj𝔤i~(𝔤k))(i=1Kλi*𝔤i~)absent~subscript𝔤𝑘superscriptsubscript𝑓𝑘𝛾superscriptsubscript𝑖1𝑘1subscript𝔤𝑘~subscript𝔤𝑖~subscript𝔤𝑖~subscript𝔤𝑖superscriptsubscript𝑖1𝑘1subscriptproj~subscript𝔤𝑖subscript𝔤𝑘superscriptsubscript𝑖1𝐾superscriptsubscript𝜆𝑖~subscript𝔤𝑖\displaystyle=\left(\tilde{\mathfrak{g}_{k}}(|f_{k}|^{\gamma}-\sum_{i=1}^{k-1}% \frac{\mathfrak{g}_{k}\cdot\tilde{\mathfrak{g}_{i}}}{\tilde{\mathfrak{g}_{i}}% \cdot\tilde{\mathfrak{g}_{i}}})+\sum_{i=1}^{k-1}\text{proj}_{\tilde{\mathfrak{% g}_{i}}}(\mathfrak{g}_{k})\right)\cdot(\sum_{i=1}^{K}{\lambda}_{i}^{*}\tilde{% \mathfrak{g}_{i}})= ( over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG start_ARG over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT proj start_POSTSUBSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_POSTSUBSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) (15)
=λk*𝔤k~22(|fk|γi=1k1𝔤k𝔤i~𝔤i~𝔤i~)+i=1k1𝔤k𝔤i~𝔤i~𝔤i~λi*𝔤i~22absentsuperscriptsubscript𝜆𝑘subscriptsuperscriptnorm~subscript𝔤𝑘22superscriptsubscript𝑓𝑘𝛾superscriptsubscript𝑖1𝑘1subscript𝔤𝑘~subscript𝔤𝑖~subscript𝔤𝑖~subscript𝔤𝑖superscriptsubscript𝑖1𝑘1subscript𝔤𝑘~subscript𝔤𝑖~subscript𝔤𝑖~subscript𝔤𝑖subscriptsuperscript𝜆𝑖subscriptsuperscriptnorm~subscript𝔤𝑖22\displaystyle=\lambda_{k}^{*}\|\tilde{\mathfrak{g}_{k}}\|^{2}_{2}\left(|f_{k}|% ^{\gamma}-\sum_{i=1}^{k-1}\frac{\mathfrak{g}_{k}\cdot\tilde{\mathfrak{g}_{i}}}% {\tilde{\mathfrak{g}_{i}}\cdot\tilde{\mathfrak{g}_{i}}}\right)+\sum_{i=1}^{k-1% }\frac{\mathfrak{g}_{k}\cdot\tilde{\mathfrak{g}_{i}}}{\tilde{\mathfrak{g}_{i}}% \cdot\tilde{\mathfrak{g}_{i}}}\lambda^{*}_{i}\|\tilde{\mathfrak{g}_{i}}\|^{2}_% {2}= italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG start_ARG over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG start_ARG over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (16)
=α2(|fk|γi=1k1𝔤k𝔤i~𝔤i~𝔤i~)+α2i=1k1𝔤k𝔤i~𝔤i~𝔤i~absent𝛼2superscriptsubscript𝑓𝑘𝛾superscriptsubscript𝑖1𝑘1subscript𝔤𝑘~subscript𝔤𝑖~subscript𝔤𝑖~subscript𝔤𝑖𝛼2superscriptsubscript𝑖1𝑘1subscript𝔤𝑘~subscript𝔤𝑖~subscript𝔤𝑖~subscript𝔤𝑖\displaystyle=\frac{\alpha}{2}\left(|f_{k}|^{\gamma}-\sum_{i=1}^{k-1}\frac{% \mathfrak{g}_{k}\cdot\tilde{\mathfrak{g}_{i}}}{\tilde{\mathfrak{g}_{i}}\cdot% \tilde{\mathfrak{g}_{i}}}\right)+\frac{\alpha}{2}\sum_{i=1}^{k-1}\frac{% \mathfrak{g}_{k}\cdot\tilde{\mathfrak{g}_{i}}}{\tilde{\mathfrak{g}_{i}}\cdot% \tilde{\mathfrak{g}_{i}}}= divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ( | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG start_ARG over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG ) + divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT divide start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG start_ARG over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⋅ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG (17)
=α2|fk|γ=|fk|γk=1K1𝔤k~22>0,absent𝛼2superscriptsubscript𝑓𝑘𝛾superscriptsubscript𝑓𝑘𝛾superscriptsubscript𝑘1𝐾1superscriptsubscriptnorm~subscript𝔤𝑘220\displaystyle=\frac{\alpha}{2}|f_{k}|^{\gamma}=\frac{|f_{k}|^{\gamma}}{\sum_{k% =1}^{K}\frac{1}{\|\tilde{\mathfrak{g}_{k}}\|_{2}^{2}}}>0,= divide start_ARG italic_α end_ARG start_ARG 2 end_ARG | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT = divide start_ARG | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG > 0 , (18)

where (i) Equation 15 is obtained by using definition of 𝔤k~~subscript𝔤𝑘\tilde{\mathfrak{g}_{k}}over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG in Equation 6, (ii) Equation 16 follows from the orthogonality of {𝔤k~}k=1Ksuperscriptsubscript~subscript𝔤𝑘𝑘1𝐾\{\tilde{\mathfrak{g}_{k}}\}_{k=1}^{K}{ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT vectors, and (iii) Equation 17 is obtained by using Equation 12.

As seen in Equation 18, the directional derivatives over 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are positive, meaning that the direction of 𝖉tsubscript𝖉𝑡-\boldsymbol{\mathfrak{d}}_{t}- bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is descent for all {fk}k[K]subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\{f_{k}\}_{k\in[K]}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT. In addition, the value of these directional derivatives are proportional to |fk|γsuperscriptsubscript𝑓𝑘𝛾|f_{k}|^{\gamma}| italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT. This implies that if the server changes the global model in the direction of 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the rate of decrease is higher for those functions with larger loss function values. Thus, 𝖉tsubscript𝖉𝑡-\boldsymbol{\mathfrak{d}}_{t}- bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies both Conditions (I) and (II). ∎

Remark 4.2.

As seen, Equation 14 yields a closed-form formula to find the optimal weights for the orthogonal scaled gradients {𝔤k~}k[K]subscript~subscript𝔤𝑘𝑘delimited-[]𝐾\{\tilde{\mathfrak{g}_{k}}\}_{k\in[K]}{ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT, based on which the common direction is obtained. On the contrary, FedMGDA+ (Hu et al., 2022) solves an iterative algorithm to find the updating directions. The complexity of such algorithms is greatly controlled by the size of the model (and the number of participating devices). As the recent DNNs are large in size, deploying such iterative algorithms slows down the FL process. Furthermore, we note that the computational cost of proposed algorithm is negligible (see Appendix F for details).

5 The AdaFed algorithm

At iteration t𝑡titalic_t, the server computes 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using the methodology described in Section 4.2, and then updates the global model as

𝜽t+1=𝜽tηt𝖉t.subscript𝜽𝑡1subscript𝜽𝑡subscript𝜂𝑡subscript𝖉𝑡\displaystyle\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta_{t}% \boldsymbol{\mathfrak{d}}_{t}.bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (19)

Similarly to the conventional GD, we note that updating the global model as (19) is a necessary condition to have 𝒇(𝜽t+1)𝒇(𝜽t)𝒇subscript𝜽𝑡1𝒇subscript𝜽𝑡\boldsymbol{f}(\boldsymbol{\theta}_{t+1})\leq\boldsymbol{f}(\boldsymbol{\theta% }_{t})bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In Theorem 5.1, we state the sufficient condition to satisfy 𝒇(𝜽t+1)𝒇(𝜽t)𝒇subscript𝜽𝑡1𝒇subscript𝜽𝑡\boldsymbol{f}(\boldsymbol{\theta}_{t+1})\leq\boldsymbol{f}(\boldsymbol{\theta% }_{t})bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Theorem 5.1.

Assume that 𝒇={fk}k[K]𝒇subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\boldsymbol{f}=\{f_{k}\}_{k\in[K]}bold_italic_f = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are L-Lipschitz smooth. If the step-size ηt[0,2Lmin{|fk|γ}k[K]]\eta_{t}\in[0,\frac{2}{L}\min\{|f_{k}|^{\gamma}\}_{k\in[K]}]italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , divide start_ARG 2 end_ARG start_ARG italic_L end_ARG roman_min { | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT ], then 𝒇(𝜽t+1)𝒇(𝜽t)𝒇subscript𝜽𝑡1𝒇subscript𝜽𝑡\boldsymbol{f}(\boldsymbol{\theta}_{t+1})\leq\boldsymbol{f}(\boldsymbol{\theta% }_{t})bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and equality is achieved iff 𝖉t=𝟎subscript𝖉𝑡0\boldsymbol{\mathfrak{d}}_{t}=\boldsymbol{0}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_0.

Proof.

If all the {fk}k[K]subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\{f_{k}\}_{k\in[K]}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are L𝐿Litalic_L-smooth, then

𝒇(𝜽t+1)𝒇(𝜽t)+𝖌T(𝜽t+1𝜽t)+L2𝜽t+1𝜽t22.𝒇subscript𝜽𝑡1𝒇subscript𝜽𝑡superscript𝖌𝑇subscript𝜽𝑡1subscript𝜽𝑡𝐿2superscriptsubscriptnormsubscript𝜽𝑡1subscript𝜽𝑡22\displaystyle\boldsymbol{f}(\boldsymbol{\theta}_{t+1})\leq\boldsymbol{f}(% \boldsymbol{\theta}_{t})+\boldsymbol{\mathfrak{g}}^{T}(\boldsymbol{\theta}_{t+% 1}-\boldsymbol{\theta}_{t})+\frac{L}{2}\|\boldsymbol{\theta}_{t+1}-\boldsymbol% {\theta}_{t}\|_{2}^{2}.bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (20)

Now, for client k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], by using the update rule Equation 19 in Equation 20 we obtain

fk(𝜽t+1)fk(𝜽t)ηt𝔤k𝖉t+ηt2L2𝖉t22.subscript𝑓𝑘subscript𝜽𝑡1subscript𝑓𝑘subscript𝜽𝑡subscript𝜂𝑡subscript𝔤𝑘subscript𝖉𝑡superscriptsubscript𝜂𝑡2𝐿2superscriptsubscriptnormsubscript𝖉𝑡22\displaystyle f_{k}(\boldsymbol{\theta}_{t+1})\leq f_{k}(\boldsymbol{\theta}_{% t})-\eta_{t}\mathfrak{g}_{k}\cdot\boldsymbol{\mathfrak{d}}_{t}+\eta_{t}^{2}% \frac{L}{2}\|\boldsymbol{\mathfrak{d}}_{t}\|_{2}^{2}.italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (21)

To impose fk(𝜽t+1)fk(𝜽t)subscript𝑓𝑘subscript𝜽𝑡1subscript𝑓𝑘subscript𝜽𝑡f_{k}(\boldsymbol{\theta}_{t+1})\leq f_{k}(\boldsymbol{\theta}_{t})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we should have

ηt𝔤k𝖉tηt2L2𝖉t22subscript𝜂𝑡subscript𝔤𝑘subscript𝖉𝑡superscriptsubscript𝜂𝑡2𝐿2superscriptsubscriptnormsubscript𝖉𝑡22\displaystyle\eta_{t}\mathfrak{g}_{k}\cdot\boldsymbol{\mathfrak{d}}_{t}\geq% \eta_{t}^{2}\frac{L}{2}\|\boldsymbol{\mathfrak{d}}_{t}\|_{2}^{2}~{}~{}~{}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝔤k𝖉tηtL2k=1K𝔤k~22𝔤k~24(i=1K1𝔤i~22)2absentsubscript𝔤𝑘subscript𝖉𝑡subscript𝜂𝑡𝐿2superscriptsubscript𝑘1𝐾superscriptsubscriptnorm~subscript𝔤𝑘22superscriptsubscriptnorm~subscript𝔤𝑘24superscriptsuperscriptsubscript𝑖1𝐾1superscriptsubscriptnorm~subscript𝔤𝑖222\displaystyle\Leftrightarrow~{}~{}\mathfrak{g}_{k}\cdot\boldsymbol{\mathfrak{d% }}_{t}\geq\frac{\eta_{t}L}{2}\sum_{k=1}^{K}\frac{\|\tilde{\mathfrak{g}_{k}}\|_% {2}^{2}}{\|\tilde{\mathfrak{g}_{k}}\|_{2}^{4}\left(\sum_{i=1}^{K}\frac{1}{\|% \tilde{\mathfrak{g}_{i}}\|_{2}^{2}}\right)^{2}}⇔ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_L end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (22)
|fk|γk=1K1𝔤k~22ηtL21(k=1K1𝔤k~22)2k=1K1𝔤k~22absentsuperscriptsubscript𝑓𝑘𝛾superscriptsubscript𝑘1𝐾1superscriptsubscriptnorm~subscript𝔤𝑘22subscript𝜂𝑡𝐿21superscriptsuperscriptsubscript𝑘1𝐾1superscriptsubscriptnorm~subscript𝔤𝑘222superscriptsubscript𝑘1𝐾1superscriptsubscriptnorm~subscript𝔤𝑘22\displaystyle\Leftrightarrow~{}~{}\frac{|f_{k}|^{\gamma}}{\sum_{k=1}^{K}\frac{% 1}{\|\tilde{\mathfrak{g}_{k}}\|_{2}^{2}}}\geq\frac{\eta_{t}L}{2}\frac{1}{\left% (\sum_{k=1}^{K}\frac{1}{\|\tilde{\mathfrak{g}_{k}}\|_{2}^{2}}\right)^{2}}\sum_% {k=1}^{K}\frac{1}{\|\tilde{\mathfrak{g}_{k}}\|_{2}^{2}}⇔ divide start_ARG | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ≥ divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_L end_ARG start_ARG 2 end_ARG divide start_ARG 1 end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (23)
ηt2L|fk|γ.absentsubscript𝜂𝑡2𝐿superscriptsubscript𝑓𝑘𝛾\displaystyle\Leftrightarrow~{}~{}\eta_{t}\leq\frac{2}{L}|f_{k}|^{\gamma}.⇔ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 2 end_ARG start_ARG italic_L end_ARG | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT . (24)

Therefore, if the step-size ηt[0,2Lmin{|fk|γ}k[K]]\eta_{t}\in[0,\frac{2}{L}\min\{|f_{k}|^{\gamma}\}_{k\in[K]}]italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , divide start_ARG 2 end_ARG start_ARG italic_L end_ARG roman_min { | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT ], then 𝒇(𝜽t+1)𝒇(𝜽t)𝒇subscript𝜽𝑡1𝒇subscript𝜽𝑡\boldsymbol{f}(\boldsymbol{\theta}_{t+1})\leq\boldsymbol{f}(\boldsymbol{\theta% }_{t})bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). ∎

Lastly, similar to many recent FL algorithms McMahan et al. (2017); Li et al. (2019a); Hu et al. (2022), we allow each client to perform a couple of local epochs e𝑒eitalic_e before sending its gradient update to the server. In this case, the pseudo-gradients (the opposite of the local updates) will be abusively used as the gradient vectors. It is important to note that we provide a convergence guarantee for this scenario in Section 5.1. We summarize AdaFed in Algorithm 1.

Remark 5.2.

When e>1𝑒1e>1italic_e > 1, an alternative approach is to use the accumulated loss rather than the loss from the last iteration in line (9) of Algorithm 1. However, based on our experiments, we observed that using the accumulated loss does not affect the overall performance of the algorithm, including its convergence speed, accuracy and fairness. This stands in contrast to the use of pseudo-gradients, which serves clear purposes of accelerating convergence and reducing communication costs.

Algorithm 1 AdaFed
1:Input: Number of global epochs T𝑇Titalic_T, number of local epochs e𝑒eitalic_e, global learning rate ηtsubscript𝜂𝑡{\eta}_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, local learning rate η𝜂\etaitalic_η, initial global model 𝜽0subscript𝜽0\boldsymbol{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, local datasets {𝒟k}kKsubscriptsubscript𝒟𝑘𝑘𝐾\{\mathcal{D}_{k}\}_{k\in K}{ caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ italic_K end_POSTSUBSCRIPT.
2:for t=0,1,,T1𝑡01𝑇1t=0,1,\dots,T-1italic_t = 0 , 1 , … , italic_T - 1 do
3:     Server randomly selects a subset of devices 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and sends 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to them.
4:     for device k𝒮t𝑘subscript𝒮𝑡k\in\mathcal{S}_{t}italic_k ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in parallel do [local training]
5:         Store the value 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in 𝜽initsubscript𝜽init\boldsymbol{\theta}_{\text{init}}bold_italic_θ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT; that is 𝜽init𝜽tsubscript𝜽initsubscript𝜽𝑡\boldsymbol{\theta}_{\text{init}}\leftarrow\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
6:         for e𝑒eitalic_e epochs do
7:              Perform (stochastic) gradient descent over local dataset 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to update: 𝜽t𝜽tηfk(𝜽t,𝒟k)subscript𝜽𝑡subscript𝜽𝑡𝜂subscript𝑓𝑘subscript𝜽𝑡subscript𝒟𝑘\boldsymbol{\theta}_{t}\leftarrow\boldsymbol{\theta}_{t}-\eta\nabla f_{k}(% \boldsymbol{\theta}_{t},\mathcal{D}_{k})bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).
8:         end for
9:         Send the pseudo-gradient 𝔤k:=𝜽init𝜽tassignsubscript𝔤𝑘subscript𝜽initsubscript𝜽𝑡\mathfrak{g}_{k}:=\boldsymbol{\theta}_{\text{init}}-\boldsymbol{\theta}_{t}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := bold_italic_θ start_POSTSUBSCRIPT init end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and local loss value fk(𝜽t)subscript𝑓𝑘subscript𝜽𝑡f_{k}(\boldsymbol{\theta}_{t})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to the server.
10:     end for
11:     for k=1,2,,|𝒮t|𝑘12subscript𝒮𝑡k=1,2,\dots,|\mathcal{S}_{t}|italic_k = 1 , 2 , … , | caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | do
12:         Find 𝔤k~~subscript𝔤𝑘\tilde{\mathfrak{g}_{k}}over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG form Equations 5 and 6.
13:     end for
14:     Find 𝝀*superscript𝝀\boldsymbol{\lambda}^{*}bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT from Equation 14.
15:     Calculate 𝖉t:=k=1Kλk*𝔤k~assignsubscript𝖉𝑡superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘~subscript𝔤𝑘\boldsymbol{\mathfrak{d}}_{t}:=\sum_{k=1}^{K}{\lambda}_{k}^{*}\tilde{\mathfrak% {g}_{k}}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG.
16:     𝜽t+1𝜽tηt𝖉tsubscript𝜽𝑡1subscript𝜽𝑡subscript𝜂𝑡subscript𝖉𝑡\boldsymbol{\theta}_{t+1}\leftarrow\boldsymbol{\theta}_{t}-\eta_{t}\boldsymbol% {\mathfrak{d}}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
17:end for
18:Output: Global model 𝜽Tsubscript𝜽𝑇\boldsymbol{\theta}_{T}bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

5.1 Convergence results

In the following, we prove the convergence guarantee of AdaFed based on how the clients update the local models: (i) using SGD with e=1𝑒1e=1italic_e = 1, (ii) using GD with e>1𝑒1e>1italic_e > 1, and (iii) using GD with e=1𝑒1e=1italic_e = 1. Of course the strongest convergence guarantee is provided for the latter case.

Theorem 5.3 (e=1𝑒1e=1italic_e = 1 & local SGD).

Assume that 𝒇={fk}k[K]𝒇subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\boldsymbol{f}=\{f_{k}\}_{k\in[K]}bold_italic_f = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are l-Lipschitz continuous and L-Lipschitz smooth, and that the global step-size ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies the following three conditions: (i) ηt(0,12L]subscript𝜂𝑡012𝐿\eta_{t}\in(0,\frac{1}{2L}]italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ], (ii) limTt=0Tηtsubscript𝑇superscriptsubscript𝑡0𝑇subscript𝜂𝑡\lim_{T\rightarrow\infty}\sum_{t=0}^{T}\eta_{t}\rightarrow\inftyroman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → ∞, and (iii) limTt=0Tηtσt<subscript𝑇superscriptsubscript𝑡0𝑇subscript𝜂𝑡subscript𝜎𝑡\lim_{T\rightarrow\infty}\sum_{t=0}^{T}\eta_{t}\sigma_{t}<\inftyroman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < ∞; where σt2=𝔼[𝖌~𝝀*𝖌~s𝝀s*]2subscriptsuperscript𝜎2𝑡𝔼superscriptdelimited-[]norm~𝖌superscript𝝀subscript~𝖌𝑠superscriptsubscript𝝀𝑠2\sigma^{2}_{t}=\mathbb{E}[\|\tilde{\boldsymbol{\mathfrak{g}}}\boldsymbol{% \lambda}^{*}-\tilde{\boldsymbol{\mathfrak{g}}}_{s}\boldsymbol{\lambda}_{s}^{*}% \|]^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = blackboard_E [ ∥ over~ start_ARG bold_fraktur_g end_ARG bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the variance of stochastic common descent direction. Then

limTmint=0,,T𝔼[𝖉t]0.subscript𝑇subscript𝑡0𝑇𝔼delimited-[]normsubscript𝖉𝑡0\displaystyle\lim_{T\rightarrow\infty}~{}\min_{t=0,\dots,T}\mathbb{E}[\|% \boldsymbol{\mathfrak{d}}_{t}\|]\rightarrow 0.roman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_t = 0 , … , italic_T end_POSTSUBSCRIPT blackboard_E [ ∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ] → 0 . (25)
Theorem 5.4 (e>1𝑒1e>1italic_e > 1 & local GD).

Assume that 𝒇={fk}k[K]𝒇subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\boldsymbol{f}=\{f_{k}\}_{k\in[K]}bold_italic_f = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are l-Lipschitz continuous and L-Lipschitz smooth. Denote by ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and η𝜂\etaitalic_η the global and local learning rates, respectively. Also, define ζt=𝝀*𝝀e*subscript𝜁𝑡normsuperscript𝝀subscriptsuperscript𝝀𝑒\zeta_{t}=\|\boldsymbol{\lambda}^{*}-\boldsymbol{\lambda}^{*}_{e}\|italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∥, where 𝝀e*subscriptsuperscript𝝀𝑒\boldsymbol{\lambda}^{*}_{e}bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the optimum weights obtained from pseudo-gradients after e𝑒eitalic_e local epochs. We have

limTmint=0,,T𝖉t0,subscript𝑇subscript𝑡0𝑇normsubscript𝖉𝑡0\displaystyle\lim_{T\rightarrow\infty}~{}\min_{t=0,\dots,T}\|\boldsymbol{% \mathfrak{d}}_{t}\|\rightarrow 0,roman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_t = 0 , … , italic_T end_POSTSUBSCRIPT ∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ → 0 , (26)

if the following conditions are satisfied: (i) ηt(0,12L]subscript𝜂𝑡012𝐿\eta_{t}\in(0,\frac{1}{2L}]italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ], (ii) limTt=0Tηtsubscript𝑇superscriptsubscript𝑡0𝑇subscript𝜂𝑡\lim_{T\rightarrow\infty}\sum_{t=0}^{T}\eta_{t}\rightarrow\inftyroman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → ∞, (iii) limtηt0subscript𝑡subscript𝜂𝑡0\lim_{t\rightarrow\infty}\eta_{t}\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → 0, (iv) limtη0subscript𝑡𝜂0\lim_{t\rightarrow\infty}\eta\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_η → 0, and (v) limtζt0subscript𝑡subscript𝜁𝑡0\lim_{t\rightarrow\infty}\zeta_{t}\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → 0.

Before introducing Theorem 5.5, we first introduce some notations. Denote by ϑitalic-ϑ\varthetaitalic_ϑ the Pareto-stationary solution set666In general, the Pareto-stationary solution of multi-objective minimization problem forms a set with cardinality of infinity (Mukai, 1980). of minimization problem argmin𝜽𝒇(𝜽)subscript𝜽𝒇𝜽\arg\min_{\boldsymbol{\theta}}\boldsymbol{f}(\boldsymbol{\theta})roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT bold_italic_f ( bold_italic_θ ). Then, denote by 𝜽t*superscriptsubscript𝜽𝑡\boldsymbol{\theta}_{t}^{*}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT the projection of 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT onto the set ϑitalic-ϑ\varthetaitalic_ϑ; that is, 𝜽t*=argmin𝜽ϑ𝜽t𝜽22superscriptsubscript𝜽𝑡subscript𝜽italic-ϑsuperscriptsubscriptnormsubscript𝜽𝑡𝜽22\boldsymbol{\theta}_{t}^{*}=\arg\min_{\boldsymbol{\theta}\in\vartheta}\|% \boldsymbol{\theta}_{t}-\boldsymbol{\theta}\|_{2}^{2}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ italic_ϑ end_POSTSUBSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Theorem 5.5 (e=1𝑒1e=1italic_e = 1 & local GD).

Assume that 𝒇={fk}k[K]𝒇subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\boldsymbol{f}=\{f_{k}\}_{k\in[K]}bold_italic_f = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are l-Lipschitz continuous and σ𝜎\sigmaitalic_σ-convex, and that the global step-size ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies the following two conditions: (i) limtj=0tηjsubscript𝑡superscriptsubscript𝑗0𝑡subscript𝜂𝑗\lim_{t\rightarrow\infty}\sum_{j=0}^{t}\eta_{j}\rightarrow\inftyroman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT → ∞, and (ii) limtj=0tηj2<subscript𝑡superscriptsubscript𝑗0𝑡subscriptsuperscript𝜂2𝑗\lim_{t\rightarrow\infty}\sum_{j=0}^{t}\eta^{2}_{j}<\inftyroman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < ∞. Then almost surely 𝜽t𝜽t*subscript𝜽𝑡superscriptsubscript𝜽𝑡\boldsymbol{\theta}_{t}\rightarrow\boldsymbol{\theta}_{t}^{*}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT; that is,

(limt(𝜽t𝜽t*)=0)=1,subscript𝑡subscript𝜽𝑡superscriptsubscript𝜽𝑡01\displaystyle\mathbb{P}\left(\lim_{t\rightarrow\infty}\left(\boldsymbol{\theta% }_{t}-\boldsymbol{\theta}_{t}^{*}\right)=0\right)=1,blackboard_P ( roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = 0 ) = 1 , (27)

where (E)𝐸\mathbb{P}(E)blackboard_P ( italic_E ) denotes the probability of event E𝐸Eitalic_E.

The proofs for Theorems 5.3, 5.4 and 5.5 are provided in Sections A.1, A.2 and A.3, respectively, and we further discuss that the assumptions we made in the theorems are common in the FL literature.

Note that all the Theorems 5.3, 5.4 and 5.5 provide some types of convergence to a Pareto-optimal solution of optimization problem in Equation 1. Specifically, diminishing 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Theorems 5.3 and 5.4 implies that we are reaching to a Pareto-optimal point (Désidéri, 2009). On the other hand, Theorem 5.5 explicitly provides this convergence guarantee in an almost surely fashion.

6 AdaFed features and a comparative analysis with FedAdam

6.1 AdaFed features

Aside from satisfying fairness among the users, we mention some notable features of AdaFed in this section.

The inequality 𝒇(𝜽t+1)𝒇(𝜽t)𝒇subscript𝜽𝑡1𝒇subscript𝜽𝑡\boldsymbol{f}(\boldsymbol{\theta}_{t+1})\leq\boldsymbol{f}(\boldsymbol{\theta% }_{t})bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) gives motivation to the clients to participate in the FL task as their loss functions would decrease upon participating. In addition, since the common direction is more inclined toward that of the gradients of loss functions with bigger values, a new client could possibly join the FL task in the middle of FL process. In this case, for some consecutive rounds, the loss function for the newly-joined client decreases more compared to those for the other clients.

The parameter γ𝛾\gammaitalic_γ is a hyper-parameter of AdaFed. In fact, different γ𝛾\gammaitalic_γ values yield variable levels of fairness. Thus, γ𝛾\gammaitalic_γ should be tuned to achieve the desired fairness level777Most (if not all) of the fair FL methods introduce an extra hyper-parameter to tune in order to establish a trade-off between fairness and accuracy, for instance: (i) ϵitalic-ϵ\epsilonitalic_ϵ in FEDMGDA+ (Hu et al., 2022), (ii) q𝑞qitalic_q in Q-FFL & q-FFL (Li et al., 2019a), and (iii) t𝑡titalic_t in TERM (Li et al., 2020a). Similarly, AdaFed introduces a new parameter to make this trade-off.. In general, a moderate γ𝛾\gammaitalic_γ value enforces a larger respective directional derivative for the devices with the worst performance (larger loss functions), imposing more uniformity to the training accuracy distribution.

Lastly, we note that AdaFed is orthogonal to the popular FL methods such as Fedprox (Li et al., 2020b) and q-FFL (Li et al., 2019a). Therefore, it could be combined with the existing FL algorithms to achieve a better performance, especially with those using personalization (Li et al., 2021).

6.2 Comparison with FedAdam and its variants

Similarly to AdaFed, there are some FL algorithms in the literature in which the server adaptively updates the global model. Despite this similarity, we note that the purpose of AdaFed is rather different from these algorithms. For instance, let us consider FedAdam (Reddi et al., 2020); this algorithm changes the global update rule of FedAvg from one-step SGD to one-step adaptive gradient optimization by adopting an Adam optimizer on the server side. Specifically, after gathering local pseudo-gradients and finding their average as 𝔤t=1|𝒮t|k𝒮t𝔤tksubscript𝔤𝑡1subscript𝒮𝑡subscript𝑘subscript𝒮𝑡subscriptsuperscript𝔤𝑘𝑡\mathfrak{g}_{t}=\frac{1}{|\mathcal{S}_{t}|}\sum_{k\in\mathcal{S}_{t}}% \mathfrak{g}^{k}_{t}fraktur_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT fraktur_g start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the server updates the global model by Adam optimizer:

𝒎t=β1𝒎t1+(1β1)𝔤t,subscript𝒎𝑡subscript𝛽1subscript𝒎𝑡11subscript𝛽1subscript𝔤𝑡\displaystyle\boldsymbol{m}_{t}=\beta_{1}\boldsymbol{m}_{t-1}+(1-\beta_{1})% \mathfrak{g}_{t},bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) fraktur_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (28)
𝒗t=β2𝒗t1+(1β2)𝔤t2,subscript𝒗𝑡subscript𝛽2subscript𝒗𝑡11subscript𝛽2subscriptsuperscript𝔤2𝑡\displaystyle\boldsymbol{v}_{t}=\beta_{2}\boldsymbol{v}_{t-1}+(1-\beta_{2})% \mathfrak{g}^{2}_{t},bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) fraktur_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (29)
𝜽t+1=𝜽t+η𝒎t𝒗t+ϵ,subscript𝜽𝑡1subscript𝜽𝑡𝜂subscript𝒎𝑡subscript𝒗𝑡bold-italic-ϵ\displaystyle\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\eta\frac{% \boldsymbol{m}_{t}}{\sqrt{\boldsymbol{v}_{t}}+\boldsymbol{\epsilon}},bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_η divide start_ARG bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + bold_italic_ϵ end_ARG , (30)

where β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are two hyper-parameters of the algorithm, and ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ is used for numerical stabilization purpose. We note that other variants of this algorithm, such as FedAdagrad and FedYogi (Reddi et al., 2020) and FedAMSGrad (Tong et al., 2020), involve slight changes in the variance term 𝒗tsubscript𝒗𝑡\boldsymbol{v}_{t}bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

The primary objective of FedAdam (as well as its variants) is to enhance convergence behavior (Reddi et al., 2020); this is achieved by retaining information from previous epochs, which helps prevent significant fluctuations in the server update. In contrast, AdaFed is designed to promote fairness among clients. Such differences could be confirmed by comparing Algorithm 1 with Equations 28, 29 and 30.

7 Experiments

In this section, we conclude the paper with several experiments to demonstrate the performance of AdaFed, and compare its effectiveness with state-of-the-art alternatives under some performance metrics.

\bullet Datasets: We conduct a thorough set of experiments over seven datasets. The results for four datasets, namely CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009), FEMNIST (Caldas et al., 2018) and Shakespear (McMahan et al., 2017) are reported in this section; and those for Fashion MNIST (Xiao et al., 2017), TinyImageNet (Le & Yang, 2015), CINIC-10 (Darlow et al., 2018) are reported in Appendix D. Particularly, in order to demonstrate the effectiveness of AdaFed in different FL scenarios, for each of the datasets reported in this section, we consider two different FL setups. In addition, we tested the effectiveness of AdaFed over a real-world noisy dataset, namely Clothing1M (Xiao et al., 2015), in Appendix I.

\bullet Benchmarks: We compare the performance of AdaFed against various fair FL algorithms in the literature including: q-FFL (Li et al., 2019a), TERM (Li et al., 2020a), FedMGDA+ (Hu et al., 2022), AFL (Mohri et al., 2019), Ditto (Li et al., 2021), FedFA (Huang et al., 2020b), and lastly FedAvg (McMahan et al., 2017). In our experiments, we conduct a grid-search to find the best hyper-parameters for each of the benchmark methods including AdaFed. The details are reported in Appendix E, and here we only report the results obtained from the best hyper-parameters.

\bullet Performance metrics: Denote by aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the prediction accuracy on device k𝑘kitalic_k. We use a¯=1Kk=1Kak¯𝑎1𝐾superscriptsubscript𝑘1𝐾subscript𝑎𝑘\bar{a}=\frac{1}{K}\sum_{k=1}^{K}a_{k}over¯ start_ARG italic_a end_ARG = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as the average test accuracy of the underlying FL algorithm, and use σa=1Kk=1K(aka¯)2subscript𝜎𝑎1𝐾superscriptsubscript𝑘1𝐾superscriptsubscript𝑎𝑘¯𝑎2\sigma_{a}=\sqrt{\frac{1}{K}\sum_{k=1}^{K}(a_{k}-\bar{a})^{2}}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over¯ start_ARG italic_a end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG as the standard deviation of the accuracy across the clients (similarly to (Li et al., 2019a; 2021)). Furthermore, we report worst 10% (5%) and best 10% (5%) accuracies as a common metric in fair FL algorithms (Li et al., 2020a).

\bullet Notations: We use bold and underlined numbers to denote the best and second best performance, respectively. We use e𝑒eitalic_e and K𝐾Kitalic_K to represent the number of local epochs and that of clients, respectively.

7.1 CIFAR-10

CIFAR-10 dataset (Krizhevsky et al., 2009) contains 40K training and 10K test colour images of size 32×32323232\times 3232 × 32, which are labeled for 10 classes. The batch size is equal to 64 for both of the following setups.

\bullet Setup 1: Following (Wang et al., 2021b), we sort the dataset based on their classes, and then split them into 200 shards. Each client randomly selects two shards without replacement so that each has the same local dataset size. We use a feedforward neural network with 2 hidden layers. We fix e=1𝑒1e=1italic_e = 1 and K=100𝐾100K=100italic_K = 100. We carry out 2000 rounds of communication, and sample 10% of the clients in each round. We run SGD on local datasets with stepsize η=0.1𝜂0.1\eta=0.1italic_η = 0.1.

\bullet Setup 2: We distribute the dataset among the clients deploying Dirichlet allocation (Wang et al., 2020) with β=0.5𝛽0.5\beta=0.5italic_β = 0.5. We use ResNet-18 (He et al., 2016) with Group Normalization (Wu & He, 2018). We perform 100 communication rounds in each of which all clients participate. We set e=1𝑒1e=1italic_e = 1, K=10𝐾10K=10italic_K = 10 and η=0.01𝜂0.01\eta=0.01italic_η = 0.01.

Results for both setups are reported in Table 1. Additionally, we depict the average accuracy over the course of training for setup 1 in Section G.1.

Table 1: Test accuracy on CIFAR-10. The reported results are averaged over 5 different random seeds.
Setup 1 Setup 2
Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 5% Best 5% a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
FedAvg 46.85 3.54 19.84 69.28 63.55 5.44 53.40 72.24
q-FFL 46.30 3.27 23.39 68.02 57.27 5.60 47.29 66.92
FedMGDA+++ 45.34 3.37 24.00 68.51 62.05 4.88 52.69 70.77
FedFA 46.40 3.61 19.33 69.30 63.05 4.95 48.69 70.88
TERM 47.11 3.66 28.21 69.51 64.15 5.90 56.21 72.20
Ditto 46.31 3.44 27.14 68.44 63.49 5.70 55.99 71.34
AdaFed 46.42 3.01 31.12 69.41 64.80 4.50 58.24 72.45
Table 2: Test accuracy on CIFAR-100. The reported results are averaged over 5 different random seeds.
Setup 1 Setup 2
Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10% a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
FedAvg 30.05 4.03 25.20 40.31 20.15 6.40 11.20 33.80
q-FFL 28.86 4.44 25.38 39.77 20.20 6.24 11.09 34.02
FedMGDA+++ 29.12 4.17 25.67 39.71 20.15 5.41 11.12 33.92
AFL 30.28 3.68 25.33 39.45 18.92 4.90 11.29 28.60
TERM 30.34 3.51 27.03 39.35 17.88 5.98 10.09 31.68
Ditto 29.81 3.79 26.90 39.39 17.52 5.65 10.21 31.25
AdaFed 31.42 3.03 28.91 40.41 20.02 4.45 11.81 34.11

7.2 CIFAR-100

CIFAR-100 (Krizhevsky et al., 2009) contains the same number of samples as those in CIFAR-10, yet it contains 100 classes instead.

The model for both setups is ResNet-18 (He et al., 2016) with Group Normalization (Wu & He, 2018), where all clients participate in each round. We also set e=1𝑒1e=1italic_e = 1 and η=0.01𝜂0.01\eta=0.01italic_η = 0.01. The batch size is equal to 64. The results are reported in Table 2 for both of the following setups:

\bullet Setup 1: We set K=10𝐾10K=10italic_K = 10 and β=0.5𝛽0.5\beta=0.5italic_β = 0.5 for Dirichlet allocation, and use 400 communication rounds.

\bullet Setup 2: We set K=50𝐾50K=50italic_K = 50 and β=0.05𝛽0.05\beta=0.05italic_β = 0.05 for Dirichlet allocation, and use 200 communication rounds.

Additionally, we perform the same experiments with more local epochs, specifically e=10,20𝑒1020e={10,20}italic_e = 10 , 20 as presented in Appendix H.

7.3 FEMNIST

FEMNIST (Federated Extended MNIST) Caldas et al. (2018) is a federated image classification dataset distributed over 3,550 devices by the dataset creators. This dataset has 62 classes containing 28×28282828\times 2828 × 28-pixel images of digits (0-9) and English characters (A-Z, a-z) written by different people.

For implementation, we use a CNN model with 2 convolutional layers followed by 2 fully-connected layers. The batch size is 32, and e=2𝑒2e=2italic_e = 2 for both of the following setups:

\bullet FEMNIST-original: We use the setting in Li et al. (2021), and randomly sample K=500𝐾500K=500italic_K = 500 devices (from the 3550 ones) and train models using the default data stored in each device.

\bullet FEMNIST-skewed: Here K=100𝐾100K=100italic_K = 100. We first sample 10 lower case characters (‘a’-‘j’) from Extended MNIST (EMNIST), and then randomly assign 5 classes to each of the 100 devices.

Similarly to (Li et al., 2019a), we use two new fairness metrics for this dataset: (i) the angle between the accuracy distribution and the all-ones vector 𝟏1\boldsymbol{1}bold_1 denoted by Angle ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT), and (ii) the KL divergence between the normalized accuracy a𝑎aitalic_a and uniform distribution u𝑢uitalic_u denoted by KL (auconditional𝑎𝑢a\|uitalic_a ∥ italic_u). Results for both setups are reported in Table 3. In addition, we report the distribution of accuracies across clients in Section G.2.

Table 3: Test accuracy on FEMNIST. The reported results are averaged over 5 different random seeds.
FEMNIST-original FEMNIST-skewed
Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Angle ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) KL (auconditional𝑎𝑢a\|uitalic_a ∥ italic_u) a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Angle ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) KL (auconditional𝑎𝑢a\|uitalic_a ∥ italic_u)
FedAvg 80.42 11.16 10.18 0.017 79.24 22.30 12.29 0.054
q-FFL 80.91 10.62 9.71 0.016 84.65 18.56 12.01 0.038
FedMGDA+++ 81.00 10.41 10.04 0.016 85.41 17.36 11.63 0.032
TERM 81.08 10.32 9.15 0.015 84.29 13.88 11.27 0.025
AFL 82.45 9.85 9.01 0.012 85.21 14.92 11.44 0.027
Ditto 83.77 10.13 9.34 0.014 92.51 14.32 11.45 0.022
AdaFed 82.26 6.58 8.12 0.009 92.21 7.56 9.44 0.011

7.4 Text Data

We use The Complete Works of William Shakespeare (McMahan et al., 2017) as the dataset, and train an RNN whose input is 80-character sequence to predict the next character. In this dataset, there are about 1,129 speaking roles. Naturally, each speaking role in the play is treated as a device. Each device stored several text data and those information will be used to train a RNN on each device. The dataset is available on the LEAF website (Caldas et al., 2018). We use e=1𝑒1e=1italic_e = 1, and let all the devices participate in each round. The results are reported in Table 4 for the following two setups:

\bullet Setup 1: Following McMahan et al. (2017), we subsample 31 speaking roles, and assign each role to a client (K=31𝐾31K=31italic_K = 31) to complete 500 communication rounds. We use a model with two LSTM layers (Hochreiter & Schmidhuber, 1997) and one densely-connected layer. The initial η=0.8𝜂0.8\eta=0.8italic_η = 0.8 with decay rate of 0.95.

\bullet Setup 2: Among the 31 speaking roles, the 20 ones with more than 10000 samples are selected, and assigned to 20 clients (K=20𝐾20K=20italic_K = 20). We use one LSTM followed by a fully-connected layer. η=2𝜂2\eta=2italic_η = 2, and the number of communication rounds is 100.

Table 4: Test accuracy on Shakespeare. The reported results are averaged over 5 different random seeds.
Setup 1 Setup 2
Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10% a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
FedAvg 53.21 9.25 51.01 54.41 50.48 1.24 48.20 52.10
q-FFL 53.90 7.52 51.52 54.47 50.72 1.07 48.90 52.29
FedMGDA+++ 53.08 8.14 52.84 54.51 50.41 1.09 48.18 51.99
AFL 54.58 8.44 52.87 55.84 52.45 1.23 50.02 54.17
TERM 54.16 8.21 52.09 55.15 52.17 1.11 49.14 53.62
Ditto 60.74 8.32 53.57 55.92 53.12 1.20 50.94 55.23
AdaFed 55.65 6.55 53.79 55.86 52.89 0.98 51.02 54.48

7.5 Analysis of Results

Based on Tables 1, 2, 3 and 4, we can attain some notable insights. Compared to other benchmark models, AdaFed leads to significantly more fair solutions. In addition, the average accuracy is not scarified, yet interestingly, for some cases it is improved. We also note that the performance of AdaFed becomes more superior when the level of non-iidness is high. For instance, by referring to FEMNIST-skewed in Table 3, we observe a considerable superiority of AdaFed. Note that the average accuracy of Ditto over FEMNIST is greater than that of AdaFed. This is comprehensible, since Ditto provides a personalized solution to each device, while AdaFed only returns a global parameter 𝜽𝜽\boldsymbol{\theta}bold_italic_θ.

We also observe a similar trend in three other datasets reported in Appendix D. We further analyse the effect of hyper-parameter γ𝛾\gammaitalic_γ in AdaFed in Appendix E.

7.5.1 Percentage of improved clients

We measure the training loss before and after each communication round for all participating clients and report the percentage of clients whose loss function decreased or remained unchanged, as defined below

ρt=k𝒮t𝕀{𝒇k(𝜽t+1)𝒇k(𝜽t)}|𝒮t|,subscript𝜌𝑡subscript𝑘subscript𝒮𝑡𝕀subscript𝒇𝑘subscript𝜽𝑡1subscript𝒇𝑘subscript𝜽𝑡subscript𝒮𝑡\displaystyle\rho_{t}=\frac{\sum_{k\in\mathcal{S}_{t}}\mathbb{I}\{\boldsymbol{% f}_{k}(\boldsymbol{\theta}_{t+1})\leq\boldsymbol{f}_{k}(\boldsymbol{\theta}_{t% })\}}{|\mathcal{S}_{t}|},italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_I { bold_italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ bold_italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG , (31)

where 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the participating clients in round t𝑡titalic_t, and 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) is the indicator function. Then, we plot ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT versus communication rounds for different fair FL benchmarks, including AdaFed. The curves for CIFAR-10 and CIFAR-100 datasets are reported in Figure 2a and Figure 2b, respectively. As seen, both AdaFed and FedMGDA+ consistently outperform other benchmark methods in that fewer clients’ performances get worse after participation. This is a unique feature of these two methods. We further note that after enough number of communication rounds, curves for both AdaFed and FedMGDA+ converge to 100% (with a bit of fluctuation).

Refer to caption
(a)
Refer to caption
(b)
Figure 2: The percentage of improved clients as a function of communication rounds for (a) CIFAR-10 setup one in Section 7.1; and (b) CIFAR-100 setup one in Section 7.2.
Refer to caption
(a)
Refer to caption
(b)
Figure 3: The training loss function for two clients trained in AdaFed framework Vs. the communication rounds for (a) CIFAR-10 setup one in Section 7.1; and (b) CIFAR-100 setup one in Section 7.2.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: The convergence of 𝖉tnormsubscript𝖉𝑡\|\boldsymbol{\mathfrak{d}}_{t}\|∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ as a function of communication rounds for (a) e=1𝑒1e=1italic_e = 1 and local SGD, (b) e>1𝑒1e>1italic_e > 1 and local GD, and (c) e=1𝑒1e=1italic_e = 1 and local GD. The dataset is CIFAR-10.

7.5.2 Rate of decrease in loss function

In this part, we observe the loss function values for two clients over the course of training to verify Theorem 4.1. This theorem asserts that the rate of decrease in the loss function is higher for clients with larger initial loss function values. To this end, we select two clients—one with a low initial loss function and one with a high initial loss function—and depict their respective training loss as a function of communication rounds.

The curves are illustrated in Figure 3a and Figure 3b for CIFAR-10 and CIFAR-100 datasets, respectively. As observed in both curves, the rate of decrease in the loss function of the client with a larger initial loss function is higher. Additionally, close to the end of the training task, the values for the loss function of the clients converge to almost the same value, indicating fairness among the clients.

7.5.3 Convergence of 𝖉tnormsubscript𝖉𝑡\|\boldsymbol{\mathfrak{d}}_{t}\|∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥

In this part, we aim to observe the behaviour of 𝖉tnormsubscript𝖉𝑡\|\boldsymbol{\mathfrak{d}}_{t}\|∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ over the course of training. Particularly, we consider three cases following Theorems 5.3, 5.4 and 5.5, namely (i) e=1𝑒1e=1italic_e = 1 & local SGD, (ii) e>1𝑒1e>1italic_e > 1 & local GD, and (iii) e=1𝑒1e=1italic_e = 1 & local GD.

For training, we follow the setup in Section 7.1; however, we change e𝑒eitalic_e and the local training method—either GD or SGD—to generated the three cases mentioned above. Then, for these three cases, we normalize 𝖉tnormsubscript𝖉𝑡\|\boldsymbol{\mathfrak{d}}_{t}\|∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥, and depict it versus the communication rounds.

As observed in Figures 4a, 4b and 4c, in all the three cases, 𝖉tnormsubscript𝖉𝑡\|\boldsymbol{\mathfrak{d}}_{t}\|∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ tends to zero. Nonetheless, the curve for e=1𝑒1e=1italic_e = 1 & local GD is more smooth.

8 Conclusion

In this paper, we proposed a method to enforce fairness in FL task dubbed AdaFed. In AdaFed, the aim is to adaptively tune a common direction along which the server updates the global model. The common direction found by AdaFed enjoys two properties: (i) it is descent for all the local loss functions, and (ii) the loss functions for the clients with worst performance decrease with a higher rate along this direction. These properties were satisfied by using the notion of directional derivative in the multi-objective optimization task. We then derived a closed-form formula for such common direction, and proved that AdaFed converges to a Pareto-stationary point. The effectiveness of AdaFed was demonstrated via thorough experimental results.

References

  • Barocas et al. (2017) Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness in machine learning. Nips tutorial, 1:2017, 2017.
  • Bonawitz et al. (2019) Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečnỳ, Stefano Mazzocchi, Brendan McMahan, et al. Towards federated learning at scale: System design. Proceedings of machine learning and systems, 1:374–388, 2019.
  • Boyd & Vandenberghe (2004) Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Caldas et al. (2018) Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečnỳ, H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.
  • Cui et al. (2021) Sen Cui, Weishen Pan, Jian Liang, Changshui Zhang, and Fei Wang. Addressing algorithmic disparity and performance inconsistency in federated learning. Advances in Neural Information Processing Systems, 34:26091–26102, 2021.
  • Darlow et al. (2018) Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. Cinic-10 is not imagenet or cifar-10. arXiv preprint arXiv:1810.03505, 2018.
  • Désidéri (2009) Jean-Antoine Désidéri. Multiple-gradient descent algorithm (MGDA). PhD thesis, INRIA, 2009.
  • Désidéri (2012) Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313–318, 2012.
  • Du et al. (2021) Wei Du, Depeng Xu, Xintao Wu, and Hanghang Tong. Fairness-aware agnostic federated learning. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pp.  181–189. SIAM, 2021.
  • Eichner et al. (2019) Hubert Eichner, Tomer Koren, Brendan McMahan, Nathan Srebro, and Kunal Talwar. Semi-cyclic stochastic gradient descent. In International Conference on Machine Learning, pp.  1764–1773. PMLR, 2019.
  • Fliege & Svaiter (2000) Jörg Fliege and Benar Fux Svaiter. Steepest descent methods for multicriteria optimization. Mathematical methods of operations research, 51(3):479–494, 2000.
  • Hamidi & Damen (2024) Shayan Mohajer Hamidi and Oussama Damen. Fair wireless federated learning through the identification of a common descent direction. IEEE Communications Letters, pp.  1–1, 2024. doi: 10.1109/LCOMM.2024.3350378.
  • Hamidi et al. (2019) Shayan Mohajer Hamidi, Sanjeewa Herath, Alireza Bayesteh, and Amir Keyvan Khandani. Systems and methods for communication resource usage control, May 30 2019. US Patent App. 15/824,352.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hu et al. (2022) Zeou Hu, Kiarash Shaloudegi, Guojun Zhang, and Yaoliang Yu. Federated learning meets multi-objective optimization. IEEE Transactions on Network Science and Engineering, 2022.
  • Huaizhou et al. (2013) SHI Huaizhou, R Venkatesha Prasad, Ertan Onur, and IGMM Niemegeers. Fairness in wireless networks: Issues, measures and challenges. IEEE Communications Surveys & Tutorials, 16(1):5–24, 2013.
  • Huang et al. (2020a) Tiansheng Huang, Weiwei Lin, Wentai Wu, Ligang He, Keqin Li, and Albert Y Zomaya. An efficiency-boosting client selection scheme for federated learning with fairness guarantee. IEEE Transactions on Parallel and Distributed Systems, 32(7):1552–1564, 2020a.
  • Huang et al. (2022) Tiansheng Huang, Weiwei Lin, Li Shen, Keqin Li, and Albert Y Zomaya. Stochastic client selection for federated learning with volatile clients. IEEE Internet of Things Journal, 9(20):20055–20070, 2022.
  • Huang et al. (2020b) Wei Huang, Tianrui Li, Dexian Wang, Shengdong Du, and Junbo Zhang. Fairness and accuracy in federated learning. arXiv preprint arXiv:2012.10069, 2020b.
  • Kairouz et al. (2021) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • Kang et al. (2019) Jiawen Kang, Zehui Xiong, Dusit Niyato, Han Yu, Ying-Chang Liang, and Dong In Kim. Incentive design for efficient federated learning in mobile networks: A contract theory approach. In 2019 IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS), pp.  1–5. IEEE, 2019.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Langley (2000) P. Langley. Crafting papers on machine learning. In Pat Langley (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  • Le et al. (2021) Tra Huong Thi Le, Nguyen H Tran, Yan Kyaw Tun, Minh NH Nguyen, Shashi Raj Pandey, Zhu Han, and Choong Seon Hong. An incentive mechanism for federated learning in wireless cellular networks: An auction approach. IEEE Transactions on Wireless Communications, 20(8):4874–4887, 2021.
  • Le & Yang (2015) Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. (2019a) Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. Fair resource allocation in federated learning. In International Conference on Learning Representations, 2019a.
  • Li et al. (2020a) Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization. In International Conference on Learning Representations, 2020a.
  • Li et al. (2020b) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429–450, 2020b.
  • Li et al. (2021) Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. In International Conference on Machine Learning, pp.  6357–6368. PMLR, 2021.
  • Li et al. (2019b) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019b.
  • Lyu et al. (2020) Lingjuan Lyu, Xinyi Xu, Qian Wang, and Han Yu. Collaborative fairness in federated learning. Federated Learning: Privacy and Incentive, pp.  189–204, 2020.
  • McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.  1273–1282. PMLR, 2017.
  • Mercier et al. (2018) Quentin Mercier, Fabrice Poirion, and Jean-Antoine Désidéri. A stochastic multiple gradient descent algorithm. European Journal of Operational Research, 271(3):808–817, 2018.
  • Mohri et al. (2019) Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In International Conference on Machine Learning, pp.  4615–4625. PMLR, 2019.
  • Mukai (1980) Hiroaki Mukai. Algorithms for multicriterion optimization. IEEE transactions on automatic control, 25(2):177–186, 1980.
  • Nishio & Yonetani (2019) Takayuki Nishio and Ryo Yonetani. Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE international conference on communications (ICC), pp.  1–7. IEEE, 2019.
  • Rawls (2020) John Rawls. A theory of justice: Revised edition. Harvard university press, 2020.
  • Reddi et al. (2020) Sashank J Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive federated optimization. In International Conference on Learning Representations, 2020.
  • Song et al. (2021) Zhendong Song, Hongguang Sun, Howard H Yang, Xijun Wang, Yan Zhang, and Tony QS Quek. Reputation-based federated learning for secure wireless networks. IEEE Internet of Things Journal, 9(2):1212–1226, 2021.
  • Tong et al. (2020) Qianqian Tong, Guannan Liang, and **bo Bi. Effective federated adaptive gradient methods with non-iid decentralized data. arXiv preprint arXiv:2009.06557, 2020.
  • Wang et al. (2020) Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. arXiv preprint arXiv:2002.06440, 2020.
  • Wang et al. (2021a) Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan McMahan, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, et al. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021a.
  • Wang et al. (2021b) Zheng Wang, Xiaoliang Fan, Jianzhong Qi, Chenglu Wen, Cheng Wang, and Rongshan Yu. Federated learning with fair averaging. arXiv preprint arXiv:2104.14937, 2021b.
  • Wu & He (2018) Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp.  3–19, 2018.
  • Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • Xiao et al. (2015) Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2691–2699, 2015.
  • Xu et al. (2022) **gyi Xu, Zihan Chen, Tony QS Quek, and Kai Fong Ernest Chong. Fedcorr: Multi-stage federated learning for label noise correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10184–10193, 2022.
  • Yang et al. (2021) Miao Yang, Ximin Wang, Hongbin Zhu, Haifeng Wang, and Hua Qian. Federated learning with class imbalance reduction. In 2021 29th European Signal Processing Conference (EUSIPCO), pp.  2174–2178. IEEE, 2021.
  • Ye et al. (2020) Dongdong Ye, Rong Yu, Miao Pan, and Zhu Han. Federated learning in vehicular edge computing: A selective model aggregation approach. IEEE Access, 8:23920–23935, 2020.
  • Zhang et al. (2020) **gfeng Zhang, Cheng Li, Antonio Robles-Kelly, and Mohan Kankanhalli. Hierarchically fair federated learning. arXiv preprint arXiv:2004.10386, 2020.
  • Zhang et al. (2021) **gwen Zhang, Yuezhou Wu, and Rong Pan. Incentive mechanism for horizontal federated learning based on reputation and reverse auction. In Proceedings of the Web Conference 2021, pp.  947–956, 2021.

Appendix A Convergence of AdaFed

In the following, we provide three theorems to analyse the convergence of AdaFed under different scenarios. Specifically, we consider three cases: (i) Theorem A.1 considers e=1𝑒1e=1italic_e = 1 and using SGD for local updates, (ii) Theorem A.2 considers an arbitrary value for e𝑒eitalic_e and using GD for local updates, and (iii) Theorem A.4 considers e=1𝑒1e=1italic_e = 1 and using GD for local updates.

A.1 Case 1: e=1𝑒1e=1italic_e = 1 & local SGD

Notations: We use subscript ()ssubscript𝑠(\cdot)_{s}( ⋅ ) start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to indicate a stochastic value. Using this notation for the values we introduced in the paper, our notations used in the proof of Theorem A.1 are summarized in Table 5.

Table 5: Notations used in Theorem A.1 for e=1𝑒1e=1italic_e = 1 & local SGD.

Notation Description
𝔤k,ssubscript𝔤𝑘𝑠\mathfrak{g}_{k,s}fraktur_g start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT Stochastic gradient vector of client k𝑘kitalic_k.
𝖌ssubscript𝖌𝑠\boldsymbol{\mathfrak{g}}_{s}bold_fraktur_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT Matrix of Stochastic gradient vectors [𝔤1,s,,𝔤K,s]subscript𝔤1𝑠subscript𝔤𝐾𝑠[\mathfrak{g}_{1,s},\dots,\mathfrak{g}_{K,s}][ fraktur_g start_POSTSUBSCRIPT 1 , italic_s end_POSTSUBSCRIPT , … , fraktur_g start_POSTSUBSCRIPT italic_K , italic_s end_POSTSUBSCRIPT ].
𝔤~k,ssubscript~𝔤𝑘𝑠\tilde{\mathfrak{g}}_{k,s}over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT Stochastic gradient vector of client k𝑘kitalic_k after orthogonalization process.
𝖌~ssubscript~𝖌𝑠\tilde{\boldsymbol{\mathfrak{g}}}_{s}over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT Matrix of orthogonalized Stochastic gradient vectors [𝔤~1,s,,𝔤~K,s]subscript~𝔤1𝑠subscript~𝔤𝐾𝑠[\tilde{\mathfrak{g}}_{1,s},\dots,\tilde{\mathfrak{g}}_{K,s}][ over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT 1 , italic_s end_POSTSUBSCRIPT , … , over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT italic_K , italic_s end_POSTSUBSCRIPT ].
λk,s*superscriptsubscript𝜆𝑘𝑠\lambda_{k,s}^{*}italic_λ start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT Optimum weights obtained from Equation (14) using Stochastic gradients 𝖌~ssubscript~𝖌𝑠\tilde{\boldsymbol{\mathfrak{g}}}_{s}over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.
𝖉ssubscript𝖉𝑠\boldsymbol{\mathfrak{d}}_{s}bold_fraktur_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT Optimum direction obtained using Stochastic 𝖌~ssubscript~𝖌𝑠\tilde{\boldsymbol{\mathfrak{g}}}_{s}over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT; that is, 𝔡s=k=1Kλk,s*𝔤~k,ssubscript𝔡𝑠superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘𝑠subscript~𝔤𝑘𝑠\mathfrak{d}_{s}=\sum_{k=1}^{K}{\lambda}_{k,s}^{*}\tilde{\mathfrak{g}}_{k,s}fraktur_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT.
Theorem A.1.

Assume that 𝒇={fk}k[K]𝒇subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\boldsymbol{f}=\{f_{k}\}_{k\in[K]}bold_italic_f = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are l-Lipschitz continuous and L-Lipschitz smooth, and that the step-size ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies the following three conditions: (i) ηt(0,12L]subscript𝜂𝑡012𝐿\eta_{t}\in(0,\frac{1}{2L}]italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ], (ii) limTt=0Tηtsubscript𝑇superscriptsubscript𝑡0𝑇subscript𝜂𝑡\lim_{T\rightarrow\infty}\sum_{t=0}^{T}\eta_{t}\rightarrow\inftyroman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → ∞ and (iii) limTt=0Tηtσt<subscript𝑇superscriptsubscript𝑡0𝑇subscript𝜂𝑡subscript𝜎𝑡\lim_{T\rightarrow\infty}\sum_{t=0}^{T}\eta_{t}\sigma_{t}<\inftyroman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < ∞; where σt2=𝔼[𝖌~𝝀*𝖌~s𝝀s*]2subscriptsuperscript𝜎2𝑡𝔼superscriptdelimited-[]norm~𝖌superscript𝝀subscript~𝖌𝑠superscriptsubscript𝝀𝑠2\sigma^{2}_{t}=\mathbb{E}[\|\tilde{\boldsymbol{\mathfrak{g}}}\boldsymbol{% \lambda}^{*}-\tilde{\boldsymbol{\mathfrak{g}}}_{s}\boldsymbol{\lambda}_{s}^{*}% \|]^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = blackboard_E [ ∥ over~ start_ARG bold_fraktur_g end_ARG bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the variance of stochastic common descent direction. Then

limTmint=0,,T𝔼[𝖉t]0.subscript𝑇subscript𝑡0𝑇𝔼delimited-[]normsubscript𝖉𝑡0\displaystyle\lim_{T\rightarrow\infty}~{}\min_{t=0,\dots,T}\mathbb{E}[\|% \boldsymbol{\mathfrak{d}}_{t}\|]\rightarrow 0.roman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_t = 0 , … , italic_T end_POSTSUBSCRIPT blackboard_E [ ∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ] → 0 . (32)
Proof.

Since orthogonal vectors {𝔤k~}k[K]subscript~subscript𝔤𝑘𝑘delimited-[]𝐾\{\tilde{\mathfrak{g}_{k}}\}_{k\in[K]}{ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT span the same K𝐾Kitalic_K-dimensional space as that spanned by gradient vectors {𝔤k}k[K]subscriptsubscript𝔤𝑘𝑘delimited-[]𝐾\{\mathfrak{g}_{k}\}_{k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT, then

{λk}k[K]s.t.𝖉=k=1Kλk*𝔤k~=k=1Kλk𝔤k=𝖌𝝀.subscriptsubscriptsuperscript𝜆𝑘𝑘delimited-[]𝐾s.t.𝖉superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘~subscript𝔤𝑘superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘subscript𝔤𝑘𝖌superscript𝝀\displaystyle\exists\{\lambda^{\prime}_{k}\}_{k\in[K]}~{}~{}\text{s.t.}~{}~{}% \boldsymbol{\mathfrak{d}}=\sum_{k=1}^{K}{\lambda}_{k}^{*}\tilde{\mathfrak{g}_{% k}}=\sum_{k=1}^{K}{\lambda}_{k}^{\prime}\mathfrak{g}_{k}=\boldsymbol{\mathfrak% {g}}\boldsymbol{\lambda}^{\prime}.∃ { italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT s.t. bold_fraktur_d = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_fraktur_g bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . (33)

Similarly, for the stochastic gradients we have

{λk,s}k[K]s.t.𝖉s=k=1Kλk,s*𝔤~k,s=k=1Kλk,s𝔤k,s=𝖌s𝝀s.subscriptsubscriptsuperscript𝜆𝑘𝑠𝑘delimited-[]𝐾s.t.subscript𝖉𝑠superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘𝑠subscript~𝔤𝑘𝑠superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘𝑠subscript𝔤𝑘𝑠subscript𝖌𝑠subscriptsuperscript𝝀𝑠\displaystyle\exists\{\lambda^{\prime}_{k,s}\}_{k\in[K]}~{}~{}\text{s.t.}~{}~{% }\boldsymbol{\mathfrak{d}}_{s}=\sum_{k=1}^{K}{\lambda}_{k,s}^{*}\tilde{% \mathfrak{g}}_{k,s}=\sum_{k=1}^{K}{\lambda}_{k,s}^{\prime}\mathfrak{g}_{k,s}=% \boldsymbol{\mathfrak{g}}_{s}\boldsymbol{\lambda}^{\prime}_{s}.∃ { italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT s.t. bold_fraktur_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT = bold_fraktur_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT . (34)

Define Δt=𝖌𝝀𝖌s𝝀s=𝖌~𝝀*𝖌~s𝝀s*subscriptΔ𝑡𝖌superscript𝝀subscript𝖌𝑠subscriptsuperscript𝝀𝑠~𝖌superscript𝝀subscript~𝖌𝑠superscriptsubscript𝝀𝑠\Delta_{t}=\boldsymbol{\mathfrak{g}}\boldsymbol{\lambda}^{\prime}-\boldsymbol{% \mathfrak{g}}_{s}\boldsymbol{\lambda}^{\prime}_{s}=\tilde{\boldsymbol{% \mathfrak{g}}}\boldsymbol{\lambda}^{*}-\tilde{\boldsymbol{\mathfrak{g}}}_{s}% \boldsymbol{\lambda}_{s}^{*}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_fraktur_g bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_fraktur_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = over~ start_ARG bold_fraktur_g end_ARG bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, where the last equality is due to the definitions in Equations 33 and 34.

We can find an upper bound for 𝒇(𝜽t+1)𝒇subscript𝜽𝑡1\boldsymbol{f}(\boldsymbol{\theta}_{t+1})bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) as follows

𝒇(𝜽t+1)𝒇subscript𝜽𝑡1\displaystyle\boldsymbol{f}(\boldsymbol{\theta}_{t+1})bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) =𝒇(𝜽tηt𝔡t)absent𝒇subscript𝜽𝑡subscript𝜂𝑡subscript𝔡𝑡\displaystyle=\boldsymbol{f}(\boldsymbol{\theta}_{t}-\eta_{t}\mathfrak{d}_{t})= bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (35)
=𝒇(𝜽tηtk=1Kλk,s*𝔤~k,s)absent𝒇subscript𝜽𝑡subscript𝜂𝑡superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘𝑠subscript~𝔤𝑘𝑠\displaystyle=\boldsymbol{f}(\boldsymbol{\theta}_{t}-\eta_{t}\sum_{k=1}^{K}{% \lambda}_{k,s}^{*}\tilde{\mathfrak{g}}_{k,s})= bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT italic_k , italic_s end_POSTSUBSCRIPT ) (36)
=𝒇(𝜽tηt𝖌s𝝀s)absent𝒇subscript𝜽𝑡subscript𝜂𝑡subscript𝖌𝑠subscriptsuperscript𝝀𝑠\displaystyle=\boldsymbol{f}(\boldsymbol{\theta}_{t}-\eta_{t}\boldsymbol{% \mathfrak{g}}_{s}\boldsymbol{\lambda}^{\prime}_{s})= bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_fraktur_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (37)
𝒇(𝜽t)ηt𝖌T𝖌sT𝝀s+Lηt22𝖌sT𝝀s2absent𝒇subscript𝜽𝑡subscript𝜂𝑡superscript𝖌𝑇superscriptsubscript𝖌𝑠𝑇superscriptsubscript𝝀𝑠𝐿superscriptsubscript𝜂𝑡22superscriptnormsuperscriptsubscript𝖌𝑠𝑇superscriptsubscript𝝀𝑠2\displaystyle\leq\boldsymbol{f}(\boldsymbol{\theta}_{t})-\eta_{t}\boldsymbol{% \mathfrak{g}}^{T}\boldsymbol{\mathfrak{g}}_{s}^{T}\boldsymbol{\lambda}_{s}^{% \prime}+\frac{L\eta_{t}^{2}}{2}\|\boldsymbol{\mathfrak{g}}_{s}^{T}\boldsymbol{% \lambda}_{s}^{\prime}\|^{2}≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_fraktur_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + divide start_ARG italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_fraktur_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (38)
𝒇(𝜽t)ηt𝖌T𝖌T𝝀+Lηt2𝖌T𝝀2+ηt𝖌TΔt+Lηt2Δt2absent𝒇subscript𝜽𝑡subscript𝜂𝑡superscript𝖌𝑇superscript𝖌𝑇superscript𝝀𝐿superscriptsubscript𝜂𝑡2superscriptnormsuperscript𝖌𝑇superscript𝝀2subscript𝜂𝑡superscript𝖌𝑇subscriptΔ𝑡𝐿superscriptsubscript𝜂𝑡2superscriptnormsubscriptΔ𝑡2\displaystyle\leq\boldsymbol{f}(\boldsymbol{\theta}_{t})-\eta_{t}\boldsymbol{% \mathfrak{g}}^{T}\boldsymbol{\mathfrak{g}}^{T}\boldsymbol{\lambda}^{\prime}+L% \eta_{t}^{2}\|\boldsymbol{\mathfrak{g}}^{T}\boldsymbol{\lambda}^{\prime}\|^{2}% +\eta_{t}\boldsymbol{\mathfrak{g}}^{T}\Delta_{t}+L\eta_{t}^{2}\|\Delta_{t}\|^{2}≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (39)
𝒇(𝜽t)ηt(1Lηt)𝖌T𝝀2+lηtΔt+Lηt2Δt2,absent𝒇subscript𝜽𝑡subscript𝜂𝑡1𝐿subscript𝜂𝑡superscriptnormsuperscript𝖌𝑇superscript𝝀2𝑙subscript𝜂𝑡normsubscriptΔ𝑡𝐿superscriptsubscript𝜂𝑡2superscriptnormsubscriptΔ𝑡2\displaystyle\leq\boldsymbol{f}(\boldsymbol{\theta}_{t})-\eta_{t}(1-L\eta_{t})% \|\boldsymbol{\mathfrak{g}}^{T}\boldsymbol{\lambda}^{\prime}\|^{2}+l\eta_{t}\|% \Delta_{t}\|+L\eta_{t}^{2}\|\Delta_{t}\|^{2},≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_l italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ + italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (40)

where (36) uses stochastic gradients in the updating rule of AdaFed, (37) is obtained from the definition in (34), (38) holds following the quadratic bound for smooth functions 𝒇={fk}k[K]𝒇subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\boldsymbol{f}=\{f_{k}\}_{k\in[K]}bold_italic_f = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT, and lastly (40) holds considering the Lipschits continuity of 𝒇={fk}k[K]𝒇subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\boldsymbol{f}=\{f_{k}\}_{k\in[K]}bold_italic_f = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT.

Assuming ηt(0,12L]subscript𝜂𝑡012𝐿\eta_{t}\in(0,\frac{1}{2L}]italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ] and taking expectation from both sides, we obtain:

mint=0,,T𝔼[𝔡t]𝒇(𝜽0)𝔼[𝒇(𝜽T+1)]+t=0Tηt(lσt+Lηtσt2)12t=0Tηt.subscript𝑡0𝑇𝔼delimited-[]normsubscript𝔡𝑡𝒇subscript𝜽0𝔼delimited-[]𝒇subscript𝜽𝑇1superscriptsubscript𝑡0𝑇subscript𝜂𝑡𝑙subscript𝜎𝑡𝐿subscript𝜂𝑡superscriptsubscript𝜎𝑡212superscriptsubscript𝑡0𝑇subscript𝜂𝑡\displaystyle\min_{t=0,\dots,T}\mathbb{E}[\|\mathfrak{d}_{t}\|]\leq\frac{% \boldsymbol{f}(\boldsymbol{\theta}_{0})-\mathbb{E}[\boldsymbol{f}(\boldsymbol{% \theta}_{T+1})]+\sum_{t=0}^{T}\eta_{t}(l\sigma_{t}+L\eta_{t}\sigma_{t}^{2})}{% \frac{1}{2}\sum_{t=0}^{T}\eta_{t}}.roman_min start_POSTSUBSCRIPT italic_t = 0 , … , italic_T end_POSTSUBSCRIPT blackboard_E [ ∥ fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ] ≤ divide start_ARG bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - blackboard_E [ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) ] + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_l italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG . (41)

Using the assumptions (i) limTj=0Tηtsubscript𝑇superscriptsubscript𝑗0𝑇subscript𝜂𝑡\lim_{T\rightarrow\infty}\sum_{j=0}^{T}\eta_{t}\rightarrow\inftyroman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → ∞, and (ii) limTt=0Tηtσt<subscript𝑇superscriptsubscript𝑡0𝑇subscript𝜂𝑡subscript𝜎𝑡\lim_{T\rightarrow\infty}\sum_{t=0}^{T}\eta_{t}\sigma_{t}<\inftyroman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < ∞, the theorem will be concluded. Note that vanishing 𝖉tsubscript𝖉𝑡\boldsymbol{\mathfrak{d}}_{t}bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT implies reaching to a Pareto-stationary point of original MoM problem. Yet, the convergence rate is different in different scenarios as we see in the following theorems. ∎

A.1.1 Discussing the assumptions

\bullet The assumptions over the local loss functions: The two assumptions l-Lipschitz continuous and L-Lipschitz smooth over the local loss functions are two standard assumptions in FL papers providing some sorts of convergence guarantee (Li et al., 2019b).

\bullet The assumptions over the step-size: The three assumptions we enforced over the step-size could be easily satisfied as explained in the sequel. For instance, one can pick ηt=κ11tsubscript𝜂𝑡subscript𝜅11𝑡\eta_{t}=\kappa_{1}\frac{1}{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t end_ARG for some constant κ1subscript𝜅1\kappa_{1}italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT such that ηt(0,12L]subscript𝜂𝑡012𝐿\eta_{t}\in(0,\frac{1}{2L}]italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ] is satisfied. Then even if σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has a extremely loose upper-bound, let’s say σt<κ2tϵsubscript𝜎𝑡subscript𝜅2superscript𝑡italic-ϵ\sigma_{t}<\frac{\kappa_{2}}{t^{\epsilon}}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < divide start_ARG italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG for a small ϵ+italic-ϵsubscript\epsilon\in\mathbb{R}_{+}italic_ϵ ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and a constant number κ2subscript𝜅2\kappa_{2}italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then all the three assumptions over the step-size in the theorem will be satisfied. Note that the convergence rate of AdaFed depends on how fast σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT diminishes which depends on how heterogeneous the users are.

A.2 Case 2: e>1𝑒1e>1italic_e > 1 & local GD

The notations used in this subsection are elaborated in Table 6.

Table 6: Notations used in the Theorem A.2 for e>1𝑒1e>1italic_e > 1 and local GD.

Notation Description
𝜽(k,e)tsubscript𝜽superscript𝑘𝑒𝑡\boldsymbol{\theta}_{{(k,e)}^{t}}bold_italic_θ start_POSTSUBSCRIPT ( italic_k , italic_e ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT Updated weight for client k𝑘kitalic_k after e𝑒eitalic_e local epochs at the t𝑡titalic_t-th round of FL.
𝔤k,esubscript𝔤𝑘𝑒\mathfrak{g}_{k,e}fraktur_g start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT 𝔤k,e=𝜽t𝜽(k,e)tsubscript𝔤𝑘𝑒subscript𝜽𝑡subscript𝜽superscript𝑘𝑒𝑡\mathfrak{g}_{k,e}=\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{{(k,e)}^{t}}fraktur_g start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ( italic_k , italic_e ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT; that is, the update vector of client k𝑘kitalic_k after e𝑒eitalic_e local epochs.
𝖌esubscript𝖌𝑒\boldsymbol{\mathfrak{g}}_{e}bold_fraktur_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT Matrix of update vectors [𝔤1,e,,𝔤K,e]subscript𝔤1𝑒subscript𝔤𝐾𝑒[\mathfrak{g}_{1,e},\dots,\mathfrak{g}_{K,e}][ fraktur_g start_POSTSUBSCRIPT 1 , italic_e end_POSTSUBSCRIPT , … , fraktur_g start_POSTSUBSCRIPT italic_K , italic_e end_POSTSUBSCRIPT ].
𝔤~k,esubscript~𝔤𝑘𝑒\tilde{\mathfrak{g}}_{k,e}over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT Update vector of client k𝑘kitalic_k after orthogonalization process.
𝖌~esubscript~𝖌𝑒\tilde{\boldsymbol{\mathfrak{g}}}_{e}over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT Matrix of orthogonalized update vectors [𝔤~1,e,,𝔤~K,e]subscript~𝔤1𝑒subscript~𝔤𝐾𝑒[\tilde{\mathfrak{g}}_{1,e},\dots,\tilde{\mathfrak{g}}_{K,e}][ over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT 1 , italic_e end_POSTSUBSCRIPT , … , over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT italic_K , italic_e end_POSTSUBSCRIPT ].
λk,e*superscriptsubscript𝜆𝑘𝑒\lambda_{k,e}^{*}italic_λ start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT Optimum weights obtained from Equation (14) using 𝖌~esubscript~𝖌𝑒\tilde{\boldsymbol{\mathfrak{g}}}_{e}over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT.
𝖉esubscript𝖉𝑒\boldsymbol{\mathfrak{d}}_{e}bold_fraktur_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT Optimum direction obtained using 𝖌~esubscript~𝖌𝑒\tilde{\boldsymbol{\mathfrak{g}}}_{e}over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT; that is, 𝔡e=k=1Kλk,e*𝔤~k,esubscript𝔡𝑒superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘𝑒subscript~𝔤𝑘𝑒\mathfrak{d}_{e}=\sum_{k=1}^{K}{\lambda}_{k,e}^{*}\tilde{\mathfrak{g}}_{k,e}fraktur_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT.
Theorem A.2.

Assume that 𝒇={fk}k[K]𝒇subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\boldsymbol{f}=\{f_{k}\}_{k\in[K]}bold_italic_f = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are l-Lipschitz continuous and L-Lipschitz smooth. Denote by ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and η𝜂\etaitalic_η the global and local learning rate, respectively. Also, define ζt=𝝀*𝝀e*subscript𝜁𝑡normsuperscript𝝀subscriptsuperscript𝝀𝑒\zeta_{t}=\|\boldsymbol{\lambda}^{*}-\boldsymbol{\lambda}^{*}_{e}\|italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∥, where 𝝀e*subscriptsuperscript𝝀𝑒\boldsymbol{\lambda}^{*}_{e}bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the optimum weights obtained from pseudo-gradients after e𝑒eitalic_e local epochs. Then,

limTmint=0,,T𝖉t0,subscript𝑇subscript𝑡0𝑇normsubscript𝖉𝑡0\displaystyle\lim_{T\rightarrow\infty}~{}\min_{t=0,\dots,T}\|\boldsymbol{% \mathfrak{d}}_{t}\|\rightarrow 0,roman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_t = 0 , … , italic_T end_POSTSUBSCRIPT ∥ bold_fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ → 0 , (42)

if the following conditions are satisfied: (i) ηt(0,12L]subscript𝜂𝑡012𝐿\eta_{t}\in(0,\frac{1}{2L}]italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ], (ii) limTt=0Tηtsubscript𝑇superscriptsubscript𝑡0𝑇subscript𝜂𝑡\lim_{T\rightarrow\infty}\sum_{t=0}^{T}\eta_{t}\rightarrow\inftyroman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → ∞ and (iii) limtηt0subscript𝑡subscript𝜂𝑡0\lim_{t\rightarrow\infty}\eta_{t}\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → 0, (iv) limtη0subscript𝑡𝜂0\lim_{t\rightarrow\infty}\eta\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_η → 0, and (v) limtζt0subscript𝑡subscript𝜁𝑡0\lim_{t\rightarrow\infty}\zeta_{t}\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → 0.

Proof.

As discussed in the proof of Theorem A.1, we can write

{λk}k[K]s.t.𝔡=k=1Kλk*𝔤k~=k=1Kλk𝔤k=𝖌𝝀,subscriptsubscriptsuperscript𝜆𝑘𝑘delimited-[]𝐾s.t.𝔡superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘~subscript𝔤𝑘superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘subscript𝔤𝑘𝖌superscript𝝀\displaystyle\exists\{\lambda^{\prime}_{k}\}_{k\in[K]}~{}~{}\text{s.t.}~{}~{}% \mathfrak{d}=\sum_{k=1}^{K}{\lambda}_{k}^{*}\tilde{\mathfrak{g}_{k}}=\sum_{k=1% }^{K}{\lambda}_{k}^{\prime}\mathfrak{g}_{k}=\boldsymbol{\mathfrak{g}}% \boldsymbol{\lambda}^{\prime},∃ { italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT s.t. fraktur_d = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_fraktur_g bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , (43)
{λk,e}k[K]s.t.𝔡e=k=1Kλk,e*𝔤~k,e=k=1Kλk,e𝔤k,e=𝖌e𝝀e.subscriptsubscriptsuperscript𝜆𝑘𝑒𝑘delimited-[]𝐾s.t.subscript𝔡𝑒superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘𝑒subscript~𝔤𝑘𝑒superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘𝑒subscript𝔤𝑘𝑒subscript𝖌𝑒subscriptsuperscript𝝀𝑒\displaystyle\exists\{\lambda^{\prime}_{k,e}\}_{k\in[K]}~{}~{}\text{s.t.}~{}~{% }\mathfrak{d}_{e}=\sum_{k=1}^{K}{\lambda}_{k,e}^{*}\tilde{\mathfrak{g}}_{k,e}=% \sum_{k=1}^{K}{\lambda}_{k,e}^{\prime}\mathfrak{g}_{k,e}=\boldsymbol{\mathfrak% {g}}_{e}\boldsymbol{\lambda}^{\prime}_{e}.∃ { italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT s.t. fraktur_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g end_ARG start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT = bold_fraktur_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT . (44)

To prove Theorem A.2, we first introduce a lemma whose proof is provided in Appendix B.

Lemma A.3.

Using the notations used in Theorem A.2, and assumming that 𝒇={fk}k[K]𝒇subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\boldsymbol{f}=\{f_{k}\}_{k\in[K]}bold_italic_f = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are L-Lipschitz smooth, we have 𝔤k,e𝔤kηelnormsubscript𝔤𝑘𝑒subscript𝔤𝑘𝜂𝑒𝑙\|\mathfrak{g}_{k,e}-\mathfrak{g}_{k}\|\leq\eta el∥ fraktur_g start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT - fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≤ italic_η italic_e italic_l.

Using Lemma A.3, we have

𝖉𝖉𝒆=𝖌~𝝀*𝖌~e𝝀e*norm𝖉subscript𝖉𝒆norm~𝖌superscript𝝀subscript~𝖌𝑒superscriptsubscript𝝀𝑒\displaystyle\|\boldsymbol{\mathfrak{d}}-\boldsymbol{\mathfrak{d}_{e}}\|=\|% \tilde{\boldsymbol{\mathfrak{g}}}\boldsymbol{\lambda}^{*}-\tilde{\boldsymbol{% \mathfrak{g}}}_{e}\boldsymbol{\lambda}_{e}^{*}\|∥ bold_fraktur_d - bold_fraktur_d start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ∥ = ∥ over~ start_ARG bold_fraktur_g end_ARG bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ 𝖌~𝝀*𝖌~𝝀e*+𝖌~𝝀e*𝖌~e𝝀e*absentnorm~𝖌superscript𝝀~𝖌subscriptsuperscript𝝀𝑒norm~𝖌subscriptsuperscript𝝀𝑒subscript~𝖌𝑒subscriptsuperscript𝝀𝑒\displaystyle\leq\|\tilde{\boldsymbol{\mathfrak{g}}}\boldsymbol{\lambda}^{*}-% \tilde{\boldsymbol{\mathfrak{g}}}\boldsymbol{\lambda}^{*}_{e}\|+\|\tilde{% \boldsymbol{\mathfrak{g}}}\boldsymbol{\lambda}^{*}_{e}-\tilde{\boldsymbol{% \mathfrak{g}}}_{e}\boldsymbol{\lambda}^{*}_{e}\|≤ ∥ over~ start_ARG bold_fraktur_g end_ARG bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - over~ start_ARG bold_fraktur_g end_ARG bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∥ + ∥ over~ start_ARG bold_fraktur_g end_ARG bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - over~ start_ARG bold_fraktur_g end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∥ (45)
𝖌~𝝀*𝝀e*+𝖌𝝀e𝖌e𝝀eabsentnorm~𝖌normsuperscript𝝀subscriptsuperscript𝝀𝑒norm𝖌subscriptsuperscript𝝀𝑒subscript𝖌𝑒subscriptsuperscript𝝀𝑒\displaystyle\leq\|\tilde{\boldsymbol{\mathfrak{g}}}\|\|\boldsymbol{\lambda}^{% *}-\boldsymbol{\lambda}^{*}_{e}\|+\|\boldsymbol{\mathfrak{g}}\boldsymbol{% \lambda}^{\prime}_{e}-\boldsymbol{\mathfrak{g}}_{e}\boldsymbol{\lambda}^{% \prime}_{e}\|≤ ∥ over~ start_ARG bold_fraktur_g end_ARG ∥ ∥ bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∥ + ∥ bold_fraktur_g bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - bold_fraktur_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∥ (46)
𝖌~𝝀*𝝀e*+ηelabsentnorm~𝖌normsuperscript𝝀subscriptsuperscript𝝀𝑒𝜂𝑒𝑙\displaystyle\leq\|\tilde{\boldsymbol{\mathfrak{g}}}\|\|\boldsymbol{\lambda}^{% *}-\boldsymbol{\lambda}^{*}_{e}\|+\eta el≤ ∥ over~ start_ARG bold_fraktur_g end_ARG ∥ ∥ bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - bold_italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∥ + italic_η italic_e italic_l (47)
ζtlK+ηel,absentsubscript𝜁𝑡𝑙𝐾𝜂𝑒𝑙\displaystyle\leq\zeta_{t}l\sqrt{K}+\eta el,≤ italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_l square-root start_ARG italic_K end_ARG + italic_η italic_e italic_l , (48)

where Equation 45 follows triangular inequality, Equation 46 is obtained from Equations 43 and 44, and Equation 47 uses Lemma A.3.

As seen, if limtη0subscript𝑡𝜂0\lim_{t\rightarrow\infty}\eta\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_η → 0, and limtζt0subscript𝑡subscript𝜁𝑡0\lim_{t\rightarrow\infty}\zeta_{t}\rightarrow 0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → 0, then 𝖉𝖉𝒆0norm𝖉subscript𝖉𝒆0\|\boldsymbol{\mathfrak{d}}-\boldsymbol{\mathfrak{d}_{e}}\|\rightarrow 0∥ bold_fraktur_d - bold_fraktur_d start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ∥ → 0. Now, by writing the quadratic upper bound we obtain:

𝒇(𝜽t+1)𝒇subscript𝜽𝑡1\displaystyle\boldsymbol{f}(\boldsymbol{\theta}_{t+1})bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) 𝒇(𝜽t)ηt𝖌T𝖌eT𝝀e+Lηt22𝖌eT𝝀e2absent𝒇subscript𝜽𝑡subscript𝜂𝑡superscript𝖌𝑇superscriptsubscript𝖌𝑒𝑇superscriptsubscript𝝀𝑒𝐿superscriptsubscript𝜂𝑡22superscriptnormsuperscriptsubscript𝖌𝑒𝑇superscriptsubscript𝝀𝑒2\displaystyle\leq\boldsymbol{f}(\boldsymbol{\theta}_{t})-\eta_{t}\boldsymbol{% \mathfrak{g}}^{T}\boldsymbol{\mathfrak{g}}_{e}^{T}\boldsymbol{\lambda}_{e}^{% \prime}+\frac{L\eta_{t}^{2}}{2}\|\boldsymbol{\mathfrak{g}}_{e}^{T}\boldsymbol{% \lambda}_{e}^{\prime}\|^{2}≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_fraktur_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + divide start_ARG italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_fraktur_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (49)
𝒇(𝜽t)ηt𝖌T𝖌T𝝀+Lηt2𝖌T𝝀2+ηt𝖌T(𝖉𝖉𝒆)+Lηt2𝖉𝖉𝒆2absent𝒇subscript𝜽𝑡subscript𝜂𝑡superscript𝖌𝑇superscript𝖌𝑇superscript𝝀𝐿superscriptsubscript𝜂𝑡2superscriptnormsuperscript𝖌𝑇superscript𝝀2subscript𝜂𝑡superscript𝖌𝑇𝖉subscript𝖉𝒆𝐿superscriptsubscript𝜂𝑡2superscriptnorm𝖉subscript𝖉𝒆2\displaystyle\leq\boldsymbol{f}(\boldsymbol{\theta}_{t})-\eta_{t}\boldsymbol{% \mathfrak{g}}^{T}\boldsymbol{\mathfrak{g}}^{T}\boldsymbol{\lambda}^{\prime}+L% \eta_{t}^{2}\|\boldsymbol{\mathfrak{g}}^{T}\boldsymbol{\lambda}^{\prime}\|^{2}% +\eta_{t}\boldsymbol{\mathfrak{g}}^{T}(\boldsymbol{\mathfrak{d}}-\boldsymbol{% \mathfrak{d}_{e}})+L\eta_{t}^{2}\|\boldsymbol{\mathfrak{d}}-\boldsymbol{% \mathfrak{d}_{e}}\|^{2}≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_fraktur_d - bold_fraktur_d start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) + italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_fraktur_d - bold_fraktur_d start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (50)
𝒇(𝜽t)ηt(1Lηt)𝖌T𝝀2+lηt𝖉𝖉𝒆+Lηt2𝖉𝖉𝒆2.absent𝒇subscript𝜽𝑡subscript𝜂𝑡1𝐿subscript𝜂𝑡superscriptnormsuperscript𝖌𝑇superscript𝝀2𝑙subscript𝜂𝑡norm𝖉subscript𝖉𝒆𝐿superscriptsubscript𝜂𝑡2superscriptnorm𝖉subscript𝖉𝒆2\displaystyle\leq\boldsymbol{f}(\boldsymbol{\theta}_{t})-\eta_{t}(1-L\eta_{t})% \|\boldsymbol{\mathfrak{g}}^{T}\boldsymbol{\lambda}^{\prime}\|^{2}+l\eta_{t}\|% \boldsymbol{\mathfrak{d}}-\boldsymbol{\mathfrak{d}_{e}}\|+L\eta_{t}^{2}\|% \boldsymbol{\mathfrak{d}}-\boldsymbol{\mathfrak{d}_{e}}\|^{2}.≤ bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ bold_fraktur_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_l italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ bold_fraktur_d - bold_fraktur_d start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ∥ + italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_fraktur_d - bold_fraktur_d start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (51)

Noting that ηt(0,12L]subscript𝜂𝑡012𝐿\eta_{t}\in(0,\frac{1}{2L}]italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 italic_L end_ARG ], and utilizing telesco** yields

mint=0,,T𝔡t𝒇(𝜽0)𝒇(𝜽T+1)+t=0Tηt(l𝖉𝖉𝒆+Lηt𝖉𝖉𝒆2)12t=0Tηt.subscript𝑡0𝑇normsubscript𝔡𝑡𝒇subscript𝜽0𝒇subscript𝜽𝑇1superscriptsubscript𝑡0𝑇subscript𝜂𝑡𝑙norm𝖉subscript𝖉𝒆𝐿subscript𝜂𝑡superscriptnorm𝖉subscript𝖉𝒆212superscriptsubscript𝑡0𝑇subscript𝜂𝑡\displaystyle\min_{t=0,\dots,T}\|\mathfrak{d}_{t}\|\leq\frac{\boldsymbol{f}(% \boldsymbol{\theta}_{0})-\boldsymbol{f}(\boldsymbol{\theta}_{T+1})+\sum_{t=0}^% {T}\eta_{t}(l\|\boldsymbol{\mathfrak{d}}-\boldsymbol{\mathfrak{d}_{e}}\|+L\eta% _{t}\|\boldsymbol{\mathfrak{d}}-\boldsymbol{\mathfrak{d}_{e}}\|^{2})}{\frac{1}% {2}\sum_{t=0}^{T}\eta_{t}}.roman_min start_POSTSUBSCRIPT italic_t = 0 , … , italic_T end_POSTSUBSCRIPT ∥ fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ divide start_ARG bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_l ∥ bold_fraktur_d - bold_fraktur_d start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ∥ + italic_L italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ bold_fraktur_d - bold_fraktur_d start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG . (52)

Using 𝖉𝖉𝒆0norm𝖉subscript𝖉𝒆0\|\boldsymbol{\mathfrak{d}}-\boldsymbol{\mathfrak{d}_{e}}\|\rightarrow 0∥ bold_fraktur_d - bold_fraktur_d start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ∥ → 0, the Theorem A.2 is concluded. ∎

A.3 Case 3: e=1𝑒1e=1italic_e = 1 & local GD

Denote by ϑitalic-ϑ\varthetaitalic_ϑ the Pareto-stationary solution set of minimization problem argmin𝜽𝒇(𝜽)subscript𝜽𝒇𝜽\arg\min_{\boldsymbol{\theta}}\boldsymbol{f}(\boldsymbol{\theta})roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT bold_italic_f ( bold_italic_θ ). Then, define 𝜽t*=argmin𝜽ϑ𝜽t𝜽22superscriptsubscript𝜽𝑡subscript𝜽italic-ϑsuperscriptsubscriptnormsubscript𝜽𝑡𝜽22\boldsymbol{\theta}_{t}^{*}=\arg\min_{\boldsymbol{\theta}\in\vartheta}\|% \boldsymbol{\theta}_{t}-\boldsymbol{\theta}\|_{2}^{2}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ italic_ϑ end_POSTSUBSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Theorem A.4.

Assume that 𝒇={fk}k[K]𝒇subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\boldsymbol{f}=\{f_{k}\}_{k\in[K]}bold_italic_f = { italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT are l-Lipschitz continuous and σ𝜎\sigmaitalic_σ-convex, and that the step-size ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies the following two conditions: (i) limtj=0tηjsubscript𝑡superscriptsubscript𝑗0𝑡subscript𝜂𝑗\lim_{t\rightarrow\infty}\sum_{j=0}^{t}\eta_{j}\rightarrow\inftyroman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT → ∞ and (ii) limtj=0tηj2<subscript𝑡superscriptsubscript𝑗0𝑡subscriptsuperscript𝜂2𝑗\lim_{t\rightarrow\infty}\sum_{j=0}^{t}\eta^{2}_{j}<\inftyroman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < ∞. Then almost surely 𝜽t𝜽t*subscript𝜽𝑡superscriptsubscript𝜽𝑡\boldsymbol{\theta}_{t}\rightarrow\boldsymbol{\theta}_{t}^{*}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT; that is,

(limt(𝜽t𝜽t*)=0)=1,subscript𝑡subscript𝜽𝑡superscriptsubscript𝜽𝑡01\displaystyle\mathbb{P}\left(\lim_{t\rightarrow\infty}\left(\boldsymbol{\theta% }_{t}-\boldsymbol{\theta}_{t}^{*}\right)=0\right)=1,blackboard_P ( roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = 0 ) = 1 , (53)

where (E)𝐸\mathbb{P}(E)blackboard_P ( italic_E ) denotes the probability of event E𝐸Eitalic_E.

Proof.

The proof is inspired from Mercier et al. (2018). Without loss of generality, we assume that all users participate in all rounds.

Based on the definition of 𝜽t*superscriptsubscript𝜽𝑡\boldsymbol{\theta}_{t}^{*}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT we can say

𝜽t+1𝜽t+1*22𝜽t+1𝜽t*22superscriptsubscriptnormsubscript𝜽𝑡1subscriptsuperscript𝜽𝑡122superscriptsubscriptnormsubscript𝜽𝑡1subscriptsuperscript𝜽𝑡22\displaystyle\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{*}_{t+1}\|_{2}^{% 2}\leq\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{*}_{t}\|_{2}^{2}∥ bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =𝜽tηt𝔡t𝜽t*22absentsuperscriptsubscriptnormsubscript𝜽𝑡subscript𝜂𝑡subscript𝔡𝑡subscriptsuperscript𝜽𝑡22\displaystyle=\|\boldsymbol{\theta}_{t}-\eta_{t}\mathfrak{d}_{t}-\boldsymbol{% \theta}^{*}_{t}\|_{2}^{2}= ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (54)
=𝜽t𝜽t*222ηt(𝜽t𝜽t*)𝔡t+ηt2𝔡t22.absentsuperscriptsubscriptnormsubscript𝜽𝑡subscriptsuperscript𝜽𝑡222subscript𝜂𝑡subscript𝜽𝑡subscriptsuperscript𝜽𝑡subscript𝔡𝑡superscriptsubscript𝜂𝑡2superscriptsubscriptnormsubscript𝔡𝑡22\displaystyle=\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{*}_{t}\|_{2}^{2}-% 2\eta_{t}(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{*}_{t})\cdot\mathfrak{d% }_{t}+\eta_{t}^{2}\|\mathfrak{d}_{t}\|_{2}^{2}.= ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (55)

To bound the third term in Equation 55, we note that from Equation 23, we have:

ηt2𝔡t22=ηt2k=1K1𝔤k~22ηt2l2K.superscriptsubscript𝜂𝑡2superscriptsubscriptnormsubscript𝔡𝑡22superscriptsubscript𝜂𝑡2superscriptsubscript𝑘1𝐾1superscriptsubscriptnorm~subscript𝔤𝑘22superscriptsubscript𝜂𝑡2superscript𝑙2𝐾\displaystyle\eta_{t}^{2}\|\mathfrak{d}_{t}\|_{2}^{2}=\frac{\eta_{t}^{2}}{\sum% _{k=1}^{K}\frac{1}{\|\tilde{\mathfrak{g}_{k}}\|_{2}^{2}}}\leq\frac{\eta_{t}^{2% }l^{2}}{K}.italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ≤ divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG . (56)

To bound the second term, first note that since orthogonal vectors {𝔤k~}k[K]subscript~subscript𝔤𝑘𝑘delimited-[]𝐾\{\tilde{\mathfrak{g}_{k}}\}_{k\in[K]}{ over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT span the same K𝐾Kitalic_K-dimensional space as that spanned by gradient vectors {𝔤k}k[K]subscriptsubscript𝔤𝑘𝑘delimited-[]𝐾\{\mathfrak{g}_{k}\}_{k\in[K]}{ fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT, then

{λk}k[K]s.t.𝔡=k=1Kλk*𝔤k~=k=1Kλk𝔤k.subscriptsubscriptsuperscript𝜆𝑘𝑘delimited-[]𝐾s.t.𝔡superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘~subscript𝔤𝑘superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘subscript𝔤𝑘\displaystyle\exists\{\lambda^{\prime}_{k}\}_{k\in[K]}~{}~{}\text{s.t.}~{}~{}% \mathfrak{d}=\sum_{k=1}^{K}{\lambda}_{k}^{*}\tilde{\mathfrak{g}_{k}}=\sum_{k=1% }^{K}{\lambda}_{k}^{\prime}\mathfrak{g}_{k}.∃ { italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT s.t. fraktur_d = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (57)

Using Equation 57 and the σ𝜎\sigmaitalic_σ-convexity of {fk}k[K]subscriptsubscript𝑓𝑘𝑘delimited-[]𝐾\{f_{k}\}_{k\in[K]}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT we obtain

(𝜽t𝜽t*)𝔡tsubscript𝜽𝑡subscriptsuperscript𝜽𝑡subscript𝔡𝑡\displaystyle(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{*}_{t})\cdot% \mathfrak{d}_{t}( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ fraktur_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =(𝜽t𝜽t*)k=1Kλk*𝔤k~absentsubscript𝜽𝑡subscriptsuperscript𝜽𝑡superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘~subscript𝔤𝑘\displaystyle=(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{*}_{t})\cdot\sum_{% k=1}^{K}{\lambda}_{k}^{*}\tilde{\mathfrak{g}_{k}}= ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over~ start_ARG fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG (58)
=(𝜽t𝜽t*)k=1Kλk𝔤kabsentsubscript𝜽𝑡subscriptsuperscript𝜽𝑡superscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘subscript𝔤𝑘\displaystyle=(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{*}_{t})\cdot\sum_{% k=1}^{K}{\lambda}_{k}^{\prime}\mathfrak{g}_{k}= ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (59)
k=1Kλk(fk(𝜽t)fk(𝜽t*))+σ𝜽t𝜽t*222absentsuperscriptsubscript𝑘1𝐾superscriptsubscript𝜆𝑘subscript𝑓𝑘subscript𝜽𝑡subscript𝑓𝑘subscriptsuperscript𝜽𝑡𝜎superscriptsubscriptnormsubscript𝜽𝑡subscriptsuperscript𝜽𝑡222\displaystyle\geq\sum_{k=1}^{K}{\lambda}_{k}^{\prime}\left(f_{k}(\boldsymbol{% \theta}_{t})-f_{k}(\boldsymbol{\theta}^{*}_{t})\right)+\sigma\frac{\|% \boldsymbol{\theta}_{t}-\boldsymbol{\theta}^{*}_{t}\|_{2}^{2}}{2}≥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_σ divide start_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG (60)
λαM2𝜽t𝜽t*22+σ𝜽t𝜽t*222absentsuperscriptsubscript𝜆𝛼𝑀2superscriptsubscriptnormsubscript𝜽𝑡subscriptsuperscript𝜽𝑡22𝜎superscriptsubscriptnormsubscript𝜽𝑡subscriptsuperscript𝜽𝑡222\displaystyle\geq\frac{{\lambda}_{\alpha}^{\prime}M}{2}\|\boldsymbol{\theta}_{% t}-\boldsymbol{\theta}^{*}_{t}\|_{2}^{2}+\sigma\frac{\|\boldsymbol{\theta}_{t}% -\boldsymbol{\theta}^{*}_{t}\|_{2}^{2}}{2}≥ divide start_ARG italic_λ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_M end_ARG start_ARG 2 end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ divide start_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG (61)
=λαM+σ2𝜽t𝜽t*22.absentsuperscriptsubscript𝜆𝛼𝑀𝜎2superscriptsubscriptnormsubscript𝜽𝑡subscriptsuperscript𝜽𝑡22\displaystyle=\frac{{\lambda}_{\alpha}^{\prime}M+\sigma}{2}\|\boldsymbol{% \theta}_{t}-\boldsymbol{\theta}^{*}_{t}\|_{2}^{2}.= divide start_ARG italic_λ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_M + italic_σ end_ARG start_ARG 2 end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (62)

Now, we return back to Equation 55 and find the conditional expectation w.r.t. 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as follows

𝔼[𝜽t+1𝜽t+1*22|𝜽t](1ηt𝔼[λαM+σ|𝜽t])𝜽t𝜽t*22+ηt2l2K.𝔼delimited-[]conditionalsuperscriptsubscriptnormsubscript𝜽𝑡1subscriptsuperscript𝜽𝑡122subscript𝜽𝑡1subscript𝜂𝑡𝔼delimited-[]superscriptsubscript𝜆𝛼𝑀conditional𝜎subscript𝜽𝑡superscriptsubscriptnormsubscript𝜽𝑡subscriptsuperscript𝜽𝑡22superscriptsubscript𝜂𝑡2superscript𝑙2𝐾\displaystyle\mathbb{E}[\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{*}_{t% +1}\|_{2}^{2}~{}|~{}\boldsymbol{\theta}_{t}]\leq(1-\eta_{t}\mathbb{E}[{\lambda% }_{\alpha}^{\prime}M+\sigma|\boldsymbol{\theta}_{t}])\|\boldsymbol{\theta}_{t}% -\boldsymbol{\theta}^{*}_{t}\|_{2}^{2}+\frac{\eta_{t}^{2}l^{2}}{K}.blackboard_E [ ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ≤ ( 1 - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_E [ italic_λ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_M + italic_σ | bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG . (63)

Assume that 𝔼[λαM+σ|𝜽t]c𝔼delimited-[]superscriptsubscript𝜆𝛼𝑀conditional𝜎subscript𝜽𝑡𝑐\mathbb{E}[{\lambda}_{\alpha}^{\prime}M+\sigma|\boldsymbol{\theta}_{t}]\geq cblackboard_E [ italic_λ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_M + italic_σ | bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ≥ italic_c, taking another expectation we obtain:

𝔼[𝜽t+1𝜽t+1*22](1ηtc)𝔼[𝜽t𝜽t*22]+ηt2l2K,𝔼delimited-[]superscriptsubscriptnormsubscript𝜽𝑡1subscriptsuperscript𝜽𝑡1221subscript𝜂𝑡𝑐𝔼delimited-[]superscriptsubscriptnormsubscript𝜽𝑡subscriptsuperscript𝜽𝑡22superscriptsubscript𝜂𝑡2superscript𝑙2𝐾\displaystyle\mathbb{E}[\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{*}_{t% +1}\|_{2}^{2}]\leq(1-\eta_{t}c)\mathbb{E}[\|\boldsymbol{\theta}_{t}-% \boldsymbol{\theta}^{*}_{t}\|_{2}^{2}]+\frac{\eta_{t}^{2}l^{2}}{K},blackboard_E [ ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ ( 1 - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_c ) blackboard_E [ ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG , (64)

which is a recursive expression. By solving Equation 64 we obtain

𝔼[𝜽t+1𝜽t+1*22]j=0t(1ηjc)𝔼[𝜽0𝜽0*22]First term+m=1tj=1t(1ηjc)ηm2l2Kj=1m(1ηjc)Second term.𝔼delimited-[]superscriptsubscriptnormsubscript𝜽𝑡1subscriptsuperscript𝜽𝑡122subscriptsuperscriptsubscriptproduct𝑗0𝑡1subscript𝜂𝑗𝑐𝔼delimited-[]superscriptsubscriptnormsubscript𝜽0subscriptsuperscript𝜽022First termsubscriptsuperscriptsubscript𝑚1𝑡superscriptsubscriptproduct𝑗1𝑡1subscript𝜂𝑗𝑐superscriptsubscript𝜂𝑚2superscript𝑙2𝐾superscriptsubscriptproduct𝑗1𝑚1subscript𝜂𝑗𝑐Second term\displaystyle\mathbb{E}[\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{*}_{t% +1}\|_{2}^{2}]\leq\underbrace{\prod_{j=0}^{t}(1-\eta_{j}c)\mathbb{E}[\|% \boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{*}_{0}\|_{2}^{2}]}_{\text{First % term}}+\underbrace{\sum_{m=1}^{t}\frac{\prod_{j=1}^{t}(1-\eta_{j}c)\eta_{m}^{2% }l^{2}}{K\prod_{j=1}^{m}(1-\eta_{j}c)}}_{\text{Second term}}.blackboard_E [ ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ under⏟ start_ARG ∏ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c ) blackboard_E [ ∥ bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT First term end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c ) italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( 1 - italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c ) end_ARG end_ARG start_POSTSUBSCRIPT Second term end_POSTSUBSCRIPT . (65)

It is observed that if the limit of both First term and Second term in Equation 65 go to zero, then 𝔼[𝜽t+1𝜽t+1*22]0𝔼delimited-[]superscriptsubscriptnormsubscript𝜽𝑡1subscriptsuperscript𝜽𝑡1220\mathbb{E}[\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{*}_{t+1}\|_{2}^{2}% ]\rightarrow 0blackboard_E [ ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] → 0. For the First term, from the arithmetic-geometric mean inequality we have

limtj=0t(1ηjc)limt(j=0t(1ηjc)t)tsubscript𝑡superscriptsubscriptproduct𝑗0𝑡1subscript𝜂𝑗𝑐subscript𝑡superscriptsuperscriptsubscript𝑗0𝑡1subscript𝜂𝑗𝑐𝑡𝑡\displaystyle\lim_{t\rightarrow\infty}\prod_{j=0}^{t}(1-\eta_{j}c)\leq\lim_{t% \rightarrow\infty}\left(\frac{\sum_{j=0}^{t}(1-\eta_{j}c)}{t}\right)^{t}roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c ) ≤ roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c ) end_ARG start_ARG italic_t end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =limt(1cj=0tηjt)tabsentsubscript𝑡superscript1𝑐superscriptsubscript𝑗0𝑡subscript𝜂𝑗𝑡𝑡\displaystyle=\lim_{t\rightarrow\infty}\left(1-c\frac{\sum_{j=0}^{t}\eta_{j}}{% t}\right)^{t}= roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ( 1 - italic_c divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_t end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (66)
=limtecj=0tηj.absentsubscript𝑡superscript𝑒𝑐superscriptsubscript𝑗0𝑡subscript𝜂𝑗\displaystyle=\lim_{t\rightarrow\infty}e^{-c\sum_{j=0}^{t}\eta_{j}}.= roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_c ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (67)

From Equation 67 it is seen that if limtj=0tηjsubscript𝑡superscriptsubscript𝑗0𝑡subscript𝜂𝑗\lim_{t\rightarrow\infty}\sum_{j=0}^{t}\eta_{j}\rightarrow\inftyroman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT → ∞, then the First term is also converges to zero as t𝑡t\rightarrow\inftyitalic_t → ∞.

On the other hand, consider the Second term in Equation 65. Obviously, if limtj=0tηj2<subscript𝑡superscriptsubscript𝑗0𝑡subscriptsuperscript𝜂2𝑗\lim_{t\rightarrow\infty}\sum_{j=0}^{t}\eta^{2}_{j}<\inftyroman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < ∞, then the Second term converges to zero as t𝑡t\rightarrow\inftyitalic_t → ∞.

Hence, if (i) limtj=0tηjsubscript𝑡superscriptsubscript𝑗0𝑡subscript𝜂𝑗\lim_{t\rightarrow\infty}\sum_{j=0}^{t}\eta_{j}\rightarrow\inftyroman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT → ∞ and (ii) limtj=0tηj2<subscript𝑡superscriptsubscript𝑗0𝑡subscriptsuperscript𝜂2𝑗\lim_{t\rightarrow\infty}\sum_{j=0}^{t}\eta^{2}_{j}<\inftyroman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < ∞, then 𝔼[𝜽t+1𝜽t+1*22]0𝔼delimited-[]superscriptsubscriptnormsubscript𝜽𝑡1subscriptsuperscript𝜽𝑡1220\mathbb{E}[\|\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}^{*}_{t+1}\|_{2}^{2}% ]\rightarrow 0blackboard_E [ ∥ bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] → 0. Consequently, based on standard supermartingale (Mercier et al., 2018), we have

(limt(𝜽t𝜽t*)=0)=1.subscript𝑡subscript𝜽𝑡superscriptsubscript𝜽𝑡01\displaystyle\mathbb{P}\left(\lim_{t\rightarrow\infty}\left(\boldsymbol{\theta% }_{t}-\boldsymbol{\theta}_{t}^{*}\right)=0\right)=1.blackboard_P ( roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = 0 ) = 1 . (68)

Appendix B Proof of Lemma A.3

Proof.
𝔤k,e=𝜽t𝜽(k,e)tsubscript𝔤𝑘𝑒subscript𝜽𝑡subscript𝜽superscript𝑘𝑒𝑡\displaystyle\mathfrak{g}_{k,e}=\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{{% (k,e)}^{t}}fraktur_g start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ( italic_k , italic_e ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT =(𝜽t𝜽(k,1)t)+(𝜽(k,1)t𝜽(k,2)t)++(𝜽(k,e1)t𝜽(k,e)t)absentsubscript𝜽𝑡subscript𝜽superscript𝑘1𝑡subscript𝜽superscript𝑘1𝑡subscript𝜽superscript𝑘2𝑡subscript𝜽superscript𝑘𝑒1𝑡subscript𝜽superscript𝑘𝑒𝑡\displaystyle=(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{{(k,1)}^{t}})+(% \boldsymbol{\theta}_{{(k,1)}^{t}}-\boldsymbol{\theta}_{{(k,2)}^{t}})+\dots+(% \boldsymbol{\theta}_{{(k,e-1)}^{t}}-\boldsymbol{\theta}_{{(k,e)}^{t}})= ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ( italic_k , 1 ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + ( bold_italic_θ start_POSTSUBSCRIPT ( italic_k , 1 ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ( italic_k , 2 ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + ⋯ + ( bold_italic_θ start_POSTSUBSCRIPT ( italic_k , italic_e - 1 ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ( italic_k , italic_e ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (69)
=𝔤k(𝜽t)+η𝔤k,1++η𝔤k,e1.absentsubscript𝔤𝑘subscript𝜽𝑡𝜂subscript𝔤𝑘1𝜂subscript𝔤𝑘𝑒1\displaystyle=\mathfrak{g}_{k}(\boldsymbol{\theta}_{t})+\eta\mathfrak{g}_{k,1}% +\dots+\eta\mathfrak{g}_{k,e-1}.= fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_η fraktur_g start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT + ⋯ + italic_η fraktur_g start_POSTSUBSCRIPT italic_k , italic_e - 1 end_POSTSUBSCRIPT . (70)

Hence,

𝔤k,e𝔤k=ηj=1e𝔤k,jnormsubscript𝔤𝑘𝑒subscript𝔤𝑘norm𝜂superscriptsubscript𝑗1𝑒subscript𝔤𝑘𝑗\displaystyle\|\mathfrak{g}_{k,e}-\mathfrak{g}_{k}\|=\|\eta\sum_{j=1}^{e}% \mathfrak{g}_{k,j}\|∥ fraktur_g start_POSTSUBSCRIPT italic_k , italic_e end_POSTSUBSCRIPT - fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ = ∥ italic_η ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ∥ ηj=1e𝔤k,jηel.absent𝜂superscriptsubscript𝑗1𝑒normsubscript𝔤𝑘𝑗𝜂𝑒𝑙\displaystyle\leq\eta\sum_{j=1}^{e}\|\mathfrak{g}_{k,j}\|\leq\eta el.≤ italic_η ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∥ fraktur_g start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ∥ ≤ italic_η italic_e italic_l . (71)

Appendix C More about fairness in FL

C.1 Sources of unfairness in federated learning

Unfairness in FL can arise from various sources and is a concern that needs to be addressed in FL systems. Here are some of the key reasons for unfairness in FL:

1. Non-Representative Data Distribution: Unfairness can occur when the distribution of data across participating devices or clients is non-representative of the overall population. Some devices may have more or less relevant data, leading to biased model updates.

2. Data Bias: If the data collected or used by different clients is inherently biased due to the data collection process, it can lead to unfairness. For example, if certain demographic groups are underrepresented in the training data of some clients, the federated model may not perform well for those groups.

3. Heterogeneous Data Sources: Federated learning often involves data from a diverse set of sources, including different device types, locations, or user demographics. Variability in data sources can introduce unfairness as the models may not generalize equally well across all sources.

4. Varying Data Quality: Data quality can vary among clients, leading to unfairness. Some clients may have noisy or less reliable data, while others may have high-quality data, affecting the model’s performance.

5. Data Sampling: The way data is sampled and used for local updates can introduce unfairness. If some clients have imbalanced or non-representative data sampling strategies, it can lead to biased model updates.

6. Aggregation Bias: The learned model may exhibit a bias towards devices with larger amounts of data or, if devices are weighted equally, it may favor more commonly occurring devices.

C.2 Fairness in conventional ML Vs. FL

The concept of fairness is often used to address social biases or performance disparities among different individuals or groups in the machine learning (ML) literature (Barocas et al., 2017). However, in the context of FL, the notion of fairness differs slightly from traditional ML. In FL, fairness primarily pertains to the consistency of performance across various clients. In fact, the difference in the notion of fairness between traditional ML and FL arises from the distinct contexts and challenges of these two settings:

1. Centralized vs. decentralized data distribution:

  • In traditional ML, data is typically centralized, and fairness is often defined in terms of mitigating biases or disparities within a single, homogeneous dataset. Fairness is evaluated based on how the model treats different individuals or groups within that dataset.

  • In FL, data is distributed across multiple decentralized clients or devices. Each client may have its own unique data distribution, and fairness considerations extend to addressing disparities across these clients, ensuring that the federated model provides uniform and equitable performance for all clients.

2. Client autonomy and data heterogeneity:

  • In FL, clients are autonomous and may have different data sources, labeling processes, and data collection practices. Fairness in this context involves adapting to the heterogeneity and diversity among clients while still achieving equitable outcomes.

  • Traditional ML operates under a centralized, unified data schema and is not inherently designed to handle data heterogeneity across sources.

We should note that in certain cases where devices can be naturally clustered into groups with specific attributes, the definition of fairness in FL can be seen as a relaxed version of that in ML, i.e., we optimize for similar but not necessarily identical performance across devices (Li et al., 2019a).

Nevertheless, despite the differences mentioned above, to maintain consistency with the terminology used in the FL literature and the papers we have cited in the main body of this work, we will continue to use the term “fairness" to denote the uniformity of performance across different devices.

Appendix D Additional three datasets

In this section, we evaluate the performance of AdaFed against some benchmarks over some other datasets, namely Fashion MNIST, CINIC-10, and TinyImageNet whose respective results are reported in Sections D.1, D.2 and D.3.

D.1 Fashion MNIST

Fashion MNIST (Xiao et al., 2017) is an extension of MNIST dataset (LeCun et al., 1998) with images resized to 32×32323232\times 3232 × 32 pixels.

We use a fully-connected neural network with 2 hidden layers, and use the same setting as that used in Li et al. (2019a) for our experiments. We set e=1𝑒1e=1italic_e = 1 and use full batchsize, and use η=0.1𝜂0.1\eta=0.1italic_η = 0.1. Then, we conduct 300 rounds of communications. For the benchmarks, we use the same as those we used for CIFAR-10 experiments. The results are reported in Table 7.

By observing the three different classes reported in Table 7, we observe that the fairness level attained in AdaFed is not limited to a dominate class.

Table 7: Test accuracy on Fashion MNIST. The reported results are averaged over 5 different random seeds.

Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT shirt pullover T-shirt
FedAvg 80.42 3.39 64.26 87.00 89.90
q-FFL 78.53 2.27 71.29 81.46 82.86
FedMGDA+ 79.29 2.53 72.46 79.74 85.66
FedFA 80.22 3.41 63.71 86.87 89.94
AdaFed 79.14 2.12 72.49 79.81 86.99

D.2 CINIC-10

CINIC-10 (Darlow et al., 2018) has 4.5 times as many images as those in CIFAR-10 dataset (270,000 sample images in total). In fact, it is obtained from ImageNet and CIFAR-10 datasets. As a result, this dataset fits FL scenarios since the constituent elements of CINIC-10 are not drawn from the same distribution. Furthermore, we add more non-iidness to the dataset by distributing the data among the clients using Dirichlet allocation with β=0.5𝛽0.5\beta=0.5italic_β = 0.5.

For the model, we use ResNet-18 with group normalization, and set η=0.01𝜂0.01\eta=0.01italic_η = 0.01. There are 200 communication rounds in which all the clients participate with e=1𝑒1e=1italic_e = 1. Also, K=50𝐾50K=50italic_K = 50. Results are reported in Table 8.

Table 8: Test accuracy on CINIC-10. The reported results are averaged over 5 different random seeds.

Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
q-FFL 86.57 14.91 57.70 100.00
Ditto 86.31 15.14 56.91 100.00
AFL 86.49 15.12 57.62 100.00
TERM 86.40 15.10 57.30 100.00
AdaFed 86.34 14.85 57.88 99.99

D.3 TinyImageNet

Tiny-ImageNet (Le & Yang, 2015) is a subset of ImageNet with 100k samples of 200 classes. We distribute the dataset among K=20𝐾20K=20italic_K = 20 clients using Dirichlet allocation with β=0.05𝛽0.05\beta=0.05italic_β = 0.05

We use ResNet-18 with group normalization, and set η=0.02𝜂0.02\eta=0.02italic_η = 0.02. There are 400 communication rounds in which all the clients participate with e=1𝑒1e=1italic_e = 1. The results are reported in Table 9.

Table 9: Test accuracy on TinyImageNet. The reported results are averaged over 5 different random seeds.

Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
q-FFL 18.90 3.20 13.12 23.72
AFL 16.55 2.38 12.40 20.25
TERM 16.41 2.77 11.52 21.02
FedMGDA+ 14.00 2.71 9.88 19.21
AdaFed 18.05 2.35 13.24 23.08

Appendix E Experiments details, tuning hyper-parameters

For the benchmark methods and also AdaFed, we used grid-search to find the best hyper-parameters for the underlying algorithms. The parameters we tested for each method are as follows:

\bullet AdaFed: γ{0,0.01,0.1,1,5,10}𝛾00.010.11510\gamma\in\{0,0.01,0.1,1,5,10\}italic_γ ∈ { 0 , 0.01 , 0.1 , 1 , 5 , 10 }.

\bullet q-FFL: q{0,0.001,0.01,0.1,1,2,5,10}𝑞00.0010.010.112510q\in\{0,0.001,0.01,0.1,1,2,5,10\}italic_q ∈ { 0 , 0.001 , 0.01 , 0.1 , 1 , 2 , 5 , 10 }.

\bullet TERM: t{0.1,0.5,1,2,5}𝑡0.10.5125t\in\{0.1,0.5,1,2,5\}italic_t ∈ { 0.1 , 0.5 , 1 , 2 , 5 }.

\bullet AFL: ηt{0.01,0.05,0.1,0.5,1}subscript𝜂𝑡0.010.050.10.51\eta_{t}\in\{0.01,0.05,0.1,0.5,1\}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0.01 , 0.05 , 0.1 , 0.5 , 1 }.

\bullet Ditto: λ{0.01,0.05,0.1,0.5,1,2,5}𝜆0.010.050.10.5125\lambda\in\{0.01,0.05,0.1,0.5,1,2,5\}italic_λ ∈ { 0.01 , 0.05 , 0.1 , 0.5 , 1 , 2 , 5 }.

\bullet FedMGDA+: ϵ{0.01,0.05,0.1,0.5,1}italic-ϵ0.010.050.10.51\epsilon\in\{0.01,0.05,0.1,0.5,1\}italic_ϵ ∈ { 0.01 , 0.05 , 0.1 , 0.5 , 1 }.

\bullet FedFA: (α,β)={(0.5,0.5)}𝛼𝛽0.50.5(\alpha,\beta)=\{(0.5,0.5)\}( italic_α , italic_β ) = { ( 0.5 , 0.5 ) }, (γs,γc)={(0.5,0.9)}subscript𝛾𝑠subscript𝛾𝑐0.50.9(\gamma_{s},\gamma_{c})=\{(0.5,0.9)\}( italic_γ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = { ( 0.5 , 0.9 ) }.

To have a better understanding about how the parameter γ𝛾\gammaitalic_γ in AdaFed affects the performance of the FL task, we report the results for different values of γ𝛾\gammaitalic_γ in AdaFed in this section.

E.1 CIFAR-10

The best hyper-parameters for the benchmark methods are: q=10𝑞10q=10italic_q = 10 for q-FFL, ϵ=0.5italic-ϵ0.5\epsilon=0.5italic_ϵ = 0.5 for FedMGDA+, and (α,β)={(0.5,0.5)}𝛼𝛽0.50.5(\alpha,\beta)=\{(0.5,0.5)\}( italic_α , italic_β ) = { ( 0.5 , 0.5 ) }, (γs,γc)={(0.5,0.9)}subscript𝛾𝑠subscript𝛾𝑐0.50.9(\gamma_{s},\gamma_{c})=\{(0.5,0.9)\}( italic_γ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = { ( 0.5 , 0.9 ) } for FedFA. The detailed results for different γ𝛾\gammaitalic_γ in AdaFed are reported in Table 10. We used γ=5𝛾5\gamma=5italic_γ = 5 as the best point for Table 1.

Table 10: Tuning γ𝛾\gammaitalic_γ in AdaFed over CIFAR-10. Reported results are averaged over 5 different random seeds.

Setup 1 Setup 2
Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 5% Best 5% a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
AdaFedγ=0subscriptAdaFed𝛾0\text{AdaFed}_{\gamma=0}AdaFed start_POSTSUBSCRIPT italic_γ = 0 end_POSTSUBSCRIPT 45.44 3.43 20.18 68.04 59.88 4.89 48.12 70.62
AdaFedγ=0.01subscriptAdaFed𝛾0.01\text{AdaFed}_{\gamma=0.01}AdaFed start_POSTSUBSCRIPT italic_γ = 0.01 end_POSTSUBSCRIPT 45.77 3.36 23.55 68.07 60.39 4.81 49.43 70.63
AdaFedγ=0.1subscriptAdaFed𝛾0.1\text{AdaFed}_{\gamma=0.1}AdaFed start_POSTSUBSCRIPT italic_γ = 0.1 end_POSTSUBSCRIPT 46.01 3.18 27.12 68.12 60.98 4.70 50.91 70.70
AdaFedγ=1subscriptAdaFed𝛾1\text{AdaFed}_{\gamma=1}AdaFed start_POSTSUBSCRIPT italic_γ = 1 end_POSTSUBSCRIPT 46.55 3.18 27.75 68.20 63.24 4.54 54.55 71.12
AdaFedγ=5subscriptAdaFed𝛾5\text{AdaFed}_{\gamma=5}AdaFed start_POSTSUBSCRIPT italic_γ = 5 end_POSTSUBSCRIPT 46.42 3.01 31.12 67.73 64.80 4.50 58.24 72.45
AdaFedγ=10subscriptAdaFed𝛾10\text{AdaFed}_{\gamma=10}AdaFed start_POSTSUBSCRIPT italic_γ = 10 end_POSTSUBSCRIPT 46.00 2.88 35.21 67.35 63.25 4.66 51.74 71.25

E.2 CIFAR-100

The best hyper-parameters for the benchmark methods are: q=0.1𝑞0.1q=0.1italic_q = 0.1 for q-FFL, t=0.5𝑡0.5t=0.5italic_t = 0.5 in TERM, and ηt=0.5subscript𝜂𝑡0.5\eta_{t}=0.5italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.5 in AFL. In addition, the detailed results for different γ𝛾\gammaitalic_γ in AdaFed are reported in Table 11. We used γ=1𝛾1\gamma=1italic_γ = 1 as the best point for Table 2.

Table 11: Tuning γ𝛾\gammaitalic_γ in AdaFed over CIFAR-100. Reported results are averaged over 5 different random seeds.

Setup 1 Setup 2
Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10% a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
AdaFedγ=0subscriptAdaFed𝛾0\text{AdaFed}_{\gamma=0}AdaFed start_POSTSUBSCRIPT italic_γ = 0 end_POSTSUBSCRIPT 29.41 4.45 24.41 39.21 17.05 6.71 10.04 27.41
AdaFedγ=0.01subscriptAdaFed𝛾0.01\text{AdaFed}_{\gamma=0.01}AdaFed start_POSTSUBSCRIPT italic_γ = 0.01 end_POSTSUBSCRIPT 30.12 4.05 25.23 39.41 17.77 6.09 10.43 28.42
AdaFedγ=0.1subscriptAdaFed𝛾0.1\text{AdaFed}_{\gamma=0.1}AdaFed start_POSTSUBSCRIPT italic_γ = 0.1 end_POSTSUBSCRIPT 31.05 3.52 26.13 40.12 19.51 4.95 10.89 32.10
AdaFedγ=1subscriptAdaFed𝛾1\text{AdaFed}_{\gamma=1}AdaFed start_POSTSUBSCRIPT italic_γ = 1 end_POSTSUBSCRIPT 31.42 3.03 28.91 40.41 20.02 4.45 11.81 34.11
AdaFedγ=5subscriptAdaFed𝛾5\text{AdaFed}_{\gamma=5}AdaFed start_POSTSUBSCRIPT italic_γ = 5 end_POSTSUBSCRIPT 31.23 2.95 28.12 40.20 19.79 4.31 11.86 33.67
AdaFedγ=10subscriptAdaFed𝛾10\text{AdaFed}_{\gamma=10}AdaFed start_POSTSUBSCRIPT italic_γ = 10 end_POSTSUBSCRIPT 31.34 2.91 28.52 40.15 19.61 4.56 11.42 32.91

E.3 Fashion MNIST

The best hyper-parameters for the benchmark methods are: ϵ=0.5italic-ϵ0.5\epsilon=0.5italic_ϵ = 0.5 for FedMGDA+, q=0.1𝑞0.1q=0.1italic_q = 0.1 for q-FFL, (α,β)={(0.5,0.5)}𝛼𝛽0.50.5(\alpha,\beta)=\{(0.5,0.5)\}( italic_α , italic_β ) = { ( 0.5 , 0.5 ) }, (γs,γc)={(0.5,0.9)}subscript𝛾𝑠subscript𝛾𝑐0.50.9(\gamma_{s},\gamma_{c})=\{(0.5,0.9)\}( italic_γ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = { ( 0.5 , 0.9 ) } for FedFA. The detailed results for different γ𝛾\gammaitalic_γ in AdaFed are reported in Table 12. We used γ=1𝛾1\gamma=1italic_γ = 1 as the best point for Table 7.

Table 12: Tuning γ𝛾\gammaitalic_γ in AdaFed over Fashion MNIST. Reported results are averaged over 5 different seeds.

Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT shirt pullover T-shirt
AdaFedγ=0subscriptAdaFed𝛾0\text{AdaFed}_{\gamma=0}AdaFed start_POSTSUBSCRIPT italic_γ = 0 end_POSTSUBSCRIPT 78.84 2.55 71.77 78.34 84.12
AdaFedγ=0.01subscriptAdaFed𝛾0.01\text{AdaFed}_{\gamma=0.01}AdaFed start_POSTSUBSCRIPT italic_γ = 0.01 end_POSTSUBSCRIPT 78.88 2.41 71.73 78.66 85.62
AdaFedγ=0.1subscriptAdaFed𝛾0.1\text{AdaFed}_{\gamma=0.1}AdaFed start_POSTSUBSCRIPT italic_γ = 0.1 end_POSTSUBSCRIPT 79.24 2.30 72.46 79.14 85.66
AdaFedγ=1subscriptAdaFed𝛾1\text{AdaFed}_{\gamma=1}AdaFed start_POSTSUBSCRIPT italic_γ = 1 end_POSTSUBSCRIPT 79.14 2.12 72.33 79.81 83.99
AdaFedγ=5subscriptAdaFed𝛾5\text{AdaFed}_{\gamma=5}AdaFed start_POSTSUBSCRIPT italic_γ = 5 end_POSTSUBSCRIPT 79.04 2.09 71.55 78.37 85.41
AdaFedγ=10subscriptAdaFed𝛾10\text{AdaFed}_{\gamma=10}AdaFed start_POSTSUBSCRIPT italic_γ = 10 end_POSTSUBSCRIPT 78.91 1.96 71.43 78.04 85.82

E.4 FEMNIST

The best hyper-parameters for the benchmark methods are: λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1 for Ditto, q=0.1𝑞0.1q=0.1italic_q = 0.1 for q-FFL, t=0.5𝑡0.5t=0.5italic_t = 0.5 for TERM, ηt=0.5subscript𝜂𝑡0.5\eta_{t}=0.5italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.5 for AFL. Also, the detailed results for different γ𝛾\gammaitalic_γ in AdaFed are reported in Table 3. We used γ=1𝛾1\gamma=1italic_γ = 1 as the best point for Table 13.

Table 13: Tuning γ𝛾\gammaitalic_γ in AdaFed over FEMNIST. Reported results are averaged over 5 different seeds.

FEMNIST-original FEMNIST-skewed
Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Angle ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) KL (auconditional𝑎𝑢a\|uitalic_a ∥ italic_u) a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Angle ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) KL (auconditional𝑎𝑢a\|uitalic_a ∥ italic_u)
AdaFedγ=0subscriptAdaFed𝛾0\text{AdaFed}_{\gamma=0}AdaFed start_POSTSUBSCRIPT italic_γ = 0 end_POSTSUBSCRIPT 81.32 13.59 10.85 0.019 84.39 13.54 11.32 0.024
AdaFedγ=0.01subscriptAdaFed𝛾0.01\text{AdaFed}_{\gamma=0.01}AdaFed start_POSTSUBSCRIPT italic_γ = 0.01 end_POSTSUBSCRIPT 82.67 12.03 10.68 0.018 87.66 12.02 10.91 0.019
AdaFedγ=0.1subscriptAdaFed𝛾0.1\text{AdaFed}_{\gamma=0.1}AdaFed start_POSTSUBSCRIPT italic_γ = 0.1 end_POSTSUBSCRIPT 81.60 8.72 9.23 0.011 88.62 10.59 10.75 0.017
AdaFedγ=1subscriptAdaFed𝛾1\text{AdaFed}_{\gamma=1}AdaFed start_POSTSUBSCRIPT italic_γ = 1 end_POSTSUBSCRIPT 82.26 6.58 8.12 0.009 92.21 7.56 9.44 0.011
AdaFedγ=5subscriptAdaFed𝛾5\text{AdaFed}_{\gamma=5}AdaFed start_POSTSUBSCRIPT italic_γ = 5 end_POSTSUBSCRIPT 80.10 5.16 7.29 0.007 90.12 5.82 7.31 0.009
AdaFedγ=10subscriptAdaFed𝛾10\text{AdaFed}_{\gamma=10}AdaFed start_POSTSUBSCRIPT italic_γ = 10 end_POSTSUBSCRIPT 80.05 3.03 6.44 0.007 84.38 4.49 6.99 0.008

E.5 Shakespeare

The best hyper-parameters for the benchmark methods are: q=0.1𝑞0.1q=0.1italic_q = 0.1 for q-FFL, λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1 for Ditto, and ηt=0.5subscript𝜂𝑡0.5\eta_{t}=0.5italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.5 for AFL. Furthermore, the results obtained for different γ𝛾\gammaitalic_γ values in AdaFed are reported in Table 14. We used γ=0.1𝛾0.1\gamma=0.1italic_γ = 0.1 as the best point for Table 4.

Table 14: Tuning γ𝛾\gammaitalic_γ in AdaFed over Shakespeare. Reported results are averaged over 5 different seeds.

Setup 1 Setup 2
Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10% a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
AdaFedγ=0subscriptAdaFed𝛾0\text{AdaFed}_{\gamma=0}AdaFed start_POSTSUBSCRIPT italic_γ = 0 end_POSTSUBSCRIPT 48.40 13.5 44.12 51.29 48.80 1.58 46.23 51.12
AdaFedγ=0.01subscriptAdaFed𝛾0.01\text{AdaFed}_{\gamma=0.01}AdaFed start_POSTSUBSCRIPT italic_γ = 0.01 end_POSTSUBSCRIPT 53.55 8.01 50.96 54.46 51.67 1.10 48.71 53.16
AdaFedγ=0.1subscriptAdaFed𝛾0.1\text{AdaFed}_{\gamma=0.1}AdaFed start_POSTSUBSCRIPT italic_γ = 0.1 end_POSTSUBSCRIPT 55.65 6.55 53.79 55.86 52.89 0.98 51.02 54.48
AdaFedγ=1subscriptAdaFed𝛾1\text{AdaFed}_{\gamma=1}AdaFed start_POSTSUBSCRIPT italic_γ = 1 end_POSTSUBSCRIPT 53.91 5.10 51.94 54.06 51.44 1.06 50.88 54.52
AdaFedγ=5subscriptAdaFed𝛾5\text{AdaFed}_{\gamma=5}AdaFed start_POSTSUBSCRIPT italic_γ = 5 end_POSTSUBSCRIPT 54.40 4.15 52.17 54.77 51.20 1.05 50.72 54.61
AdaFedγ=10subscriptAdaFed𝛾10\text{AdaFed}_{\gamma=10}AdaFed start_POSTSUBSCRIPT italic_γ = 10 end_POSTSUBSCRIPT 54.56 4.22 52.20 54.73 51.19 1.07 50.70 54.01

E.6 CINIC-10

The best hyper-parameters for the benchmark methods are: t=0.5𝑡0.5t=0.5italic_t = 0.5 for TERM, q=0.1𝑞0.1q=0.1italic_q = 0.1 for q-FFL, λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1 for Ditto, and ηt=0.5subscript𝜂𝑡0.5\eta_{t}=0.5italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.5 for AFL. Furthermore, the results obtained for different γ𝛾\gammaitalic_γ values in AdaFed are reported in Table 15. We used γ=1𝛾1\gamma=1italic_γ = 1 as the best point for Table 8.

Table 15: Tuning γ𝛾\gammaitalic_γ in AdaFed over CINIC-10. Reported results are averaged over 5 different seeds.

Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
AdaFedγ=0subscriptAdaFed𝛾0\text{AdaFed}_{\gamma=0}AdaFed start_POSTSUBSCRIPT italic_γ = 0 end_POSTSUBSCRIPT 85.17 15.71 54.67 99.92
AdaFedγ=0.01subscriptAdaFed𝛾0.01\text{AdaFed}_{\gamma=0.01}AdaFed start_POSTSUBSCRIPT italic_γ = 0.01 end_POSTSUBSCRIPT 85.87 15.54 56.12 99.95
AdaFedγ=0.1subscriptAdaFed𝛾0.1\text{AdaFed}_{\gamma=0.1}AdaFed start_POSTSUBSCRIPT italic_γ = 0.1 end_POSTSUBSCRIPT 86.13 15.32 57.01 99.98
AdaFedγ=1subscriptAdaFed𝛾1\text{AdaFed}_{\gamma=1}AdaFed start_POSTSUBSCRIPT italic_γ = 1 end_POSTSUBSCRIPT 86.34 14.85 57.88 99.99
AdaFedγ=5subscriptAdaFed𝛾5\text{AdaFed}_{\gamma=5}AdaFed start_POSTSUBSCRIPT italic_γ = 5 end_POSTSUBSCRIPT 86.03 15.01 57.72 99.98
AdaFedγ=10subscriptAdaFed𝛾10\text{AdaFed}_{\gamma=10}AdaFed start_POSTSUBSCRIPT italic_γ = 10 end_POSTSUBSCRIPT 85.49 15.08 57.23 99.99

E.7 TinyImageNet

The best hyper-parameters for the benchmark methods are: ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05 for FedMGDA+, t=0.5𝑡0.5t=0.5italic_t = 0.5 for TERM, q=0.1𝑞0.1q=0.1italic_q = 0.1 for q-FFL, and ηt=0.5subscript𝜂𝑡0.5\eta_{t}=0.5italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.5 for AFL. Furthermore, the results obtained for different γ𝛾\gammaitalic_γ values in AdaFed are reported in Table 16. We used γ=1𝛾1\gamma=1italic_γ = 1 as the best point for Table 9.

Table 16: Tuning γ𝛾\gammaitalic_γ in AdaFed over TinyImageNet. Reported results are averaged over 5 different seeds.

Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
AdaFedγ=0subscriptAdaFed𝛾0\text{AdaFed}_{\gamma=0}AdaFed start_POSTSUBSCRIPT italic_γ = 0 end_POSTSUBSCRIPT 13.25 3.14 9.82 19.24
AdaFedγ=0.01subscriptAdaFed𝛾0.01\text{AdaFed}_{\gamma=0.01}AdaFed start_POSTSUBSCRIPT italic_γ = 0.01 end_POSTSUBSCRIPT 14.38 2.72 10.12 19.97
AdaFedγ=0.1subscriptAdaFed𝛾0.1\text{AdaFed}_{\gamma=0.1}AdaFed start_POSTSUBSCRIPT italic_γ = 0.1 end_POSTSUBSCRIPT 16.20 2.65 11.65 21.12
AdaFedγ=1subscriptAdaFed𝛾1\text{AdaFed}_{\gamma=1}AdaFed start_POSTSUBSCRIPT italic_γ = 1 end_POSTSUBSCRIPT 18.05 2.35 13.24 23.08
AdaFedγ=5subscriptAdaFed𝛾5\text{AdaFed}_{\gamma=5}AdaFed start_POSTSUBSCRIPT italic_γ = 5 end_POSTSUBSCRIPT 17.76 2.31 12.44 22.84
AdaFedγ=10subscriptAdaFed𝛾10\text{AdaFed}_{\gamma=10}AdaFed start_POSTSUBSCRIPT italic_γ = 10 end_POSTSUBSCRIPT 17.05 2.38 12.58 23.67

E.8 The effect of parameter γ𝛾\gammaitalic_γ

In this section, we reported the results for AdaFed over different datasets when γ𝛾\gammaitalic_γ takes different values. Based on the tables reported in this section, we observe almost a similar trend over all the dataset. As a rule of thumb, a higher (lower) γ𝛾\gammaitalic_γ yields a higher (lower) fairness and slightly lower (higher) accuracy. Nevertheless, the best performance of AdaFed (in terms of establishing an appropriate trade-off between average accuracy and fairness) is achieved for a moderate value of γ𝛾\gammaitalic_γ. This is also consistent with the other fairness methods in the literature, where in most cases, the best hyper-parameter is a moderate one.

Appendix F Computation cost of Adafed

F.1 Comparing to FedMGDA+

First, note that AdaFed concept is built upon that of FedMGDA+ (Hu et al., 2022) (and FairWire in Hamidi & Damen (2024)), in that both use Pareto-optimal notion to enforce fairness in FL task. Note that the optimal solutions in MoM usually forms a set (in general of infinite cardinality). As discussed, what distinguishes FedMGDA+ and AdaFed is that to which point of this set these algorithms converge. Particularly, AdaFed converges to more uniform solutions Figure 1a. This is because FedMGDA+ algorithm only satisfies Condition (I), yet in AdaFed, both Conditions (I) and (II) are held.

Interestingly, the cost of performing AdaFed is less than that of performing FedMGDA+. To elucidate, FedMGDA+ also finds the minimum-norm vector in the convex hull of the gradients’ space in order to find a common descent direction. To this end, they used generic quadratic programming which entails iteratively finding the minimum-norm vector in the convex hull of the local gradients. One of the pros of AdaFed is that it finds the common descent direction without performing any iterations over the gradient vectors. Thus, AdaFed not only yields a higher level of fairness compared to FedMGDA+, but also solves its complexity issue.

F.2 Running time for AdaFed

Assume that the number of clients is K𝐾Kitalic_K, and the dimension of the gradient vectors is d𝑑ditalic_d. Then, the orthogonalization for k𝑘kitalic_k-th client, k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], needs 𝒪(2dk)𝒪2𝑑𝑘\mathcal{O}(2dk)caligraphic_O ( 2 italic_d italic_k ) operations (by operations we meant multiplications and additions). Hence, the total number of operations needed for orthogonalization process in equal to 𝒪(2dK2)𝒪2𝑑superscript𝐾2\mathcal{O}(2dK^{2})caligraphic_O ( 2 italic_d italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (Also note that Gram-Schmidt is the most efficient algorithm for orthogonalization).

In our experimental setup, we realized that the overhead of AdaFed is negligible, resulting almost the same overall running time for FedAvg and AdaFed. To justify this fact, please refer to FedMGDA+ paper where they discussed the overhead of their proposed method; as they claimed, the overhead is negligible yielding the same running time as FedAVG. On the other hand, as explained in Section F.1, the complexity of AdaFed is lower than that of FedMGDA+.

Appendix G Curves and histograms

G.1 Training curves for CIFAR-10 and CIFAR-100

In this subsection, we depict the average test accuracy over the course of training for CIFAR-10 dataset using setup one (see Section 7.1). In particular, we depict the average test accuracy across all the clients Vs. the number of communication rounds. Additionally, to demonstrate the convergence of the FL algorithms after 2000 communication rounds, we have depicted the training curve for 4000 rounds.

The curve for each method is obtained by using the best hyper-parameter of the respective method (we discussed the details of hyper-parameter tuning in Appendix E). Furthermore, the curves are averaged over five different seeds.

The results are shown in Figure 5 and Figure 6 for CIFAR-10 and CIFAR-100, respectively. Particularly in Figure 5, AdaFed converges faster than the benchmark methods. Specifically, AdaFed reaches average test accuracy of 40% after around 400 communication rounds; however, the benchmark methods reach this accuracy after around 900 rounds. Indeed, this is another advantage of AdaFed in addition to imposing fairness across the clients.

Refer to caption
Figure 5: Average test accuracy across clients for different FL methods on CIFAR-10. The setup for the experiments is elaborated in Section 7.1, setup 1.
Refer to caption
Figure 6: Average test accuracy across clients for different FL methods on CIFAR-100. The setup for the experiments is elaborated in Section 7.2, setup 1.

G.2 Histogram of accuracies

To better observe the spread of clients accuracy, we depict the histogram of accuracy across 500 clients for the Original FEMNIST dataset (the setup for the experiment is discussed in Section D.1). To this end, we depict the histogram of the clients’ accuracies for three different methods: (i) FedAvg, (ii) Q-FFL, and (iii) AdaFed(γ=5)𝛾5(\gamma=5)( italic_γ = 5 ); all using their well-tuned hyper-parameters. The result is depicted in Figure 7. As seen, the distribution of the accuracy is more concentrated (fair) for AdaFed.

Refer to caption
Figure 7: The distribution of clients accuracy for Original FEMNIST dataset using three different methods, namely: (i) FedAvg, (ii) Q-FFL, and (iii) AdaFed.

Appendix H CIFAR-100 results with more local epochs

In this section, we want to test the performance of AdaFed using a larger number of local epochs e𝑒eitalic_e. To this end, we use the same setups as those used in the main body of the paper to produce the results for CIFAR-100, but we change the number of local epochs e𝑒eitalic_e to 10 and 20. 888We selected e={10,20}𝑒1020e=\{10,20\}italic_e = { 10 , 20 } because these values have been commonly utilized in the literature for the CIFAR-100 dataset.. The results for e=10𝑒10e=10italic_e = 10 and e=20𝑒20e=20italic_e = 20 are reported in Table 17 and Table 18, respectively. We highlight two key observations from the tables:

  • AdaFed can still provide a higher level of fairness compared to the benchmark methods;

  • While increasing the number of local epochs from 1 to 10 results in higher average accuracy, this trend is not observed when further increasing e𝑒eitalic_e to 20.

Table 17: Test accuracy on CIFAR-100 with e=10𝑒10e=10italic_e = 10. The reported results are averaged over 5 different random seeds.
Setup 1 Setup 2
Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10% a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
FedAvg 31.14 4.09 25.30 40.54 20.54 6.42 11.12 33.47
q-FFL 29.45 4.66 25.35 39.91 20.77 6.20 11.05 33.52
AFL 31.17 3.69 25.12 39.52 19.32 4.85 11.23 28.93
TERM 30.56 3.63 27.19 39.46 17.91 5.87 10.11 32.00
AdaFed 31.19 3.14 28.81 40.42 20.41 4.71 11.39 34.08
Table 18: Test accuracy on CIFAR-100 with e=20𝑒20e=20italic_e = 20. The reported results are averaged over 5 different random seeds.
Setup 1 Setup 2
Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10% a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
FedAvg 29.11 4.31 24.61 39.45 18.05 6.12 9.15 30.12
q-FFL 29.15 4.23 25.12 39.67 19.02 6.15 8.41 31.66
AFL 30.38 3.78 25.00 39.12 17.74 4.96 10.01 27.08
TERM 31.15 3.62 27.02 40.41 15.81 5.68 8.17 29.26
AdaFed 30.41 3.19 27.35 40.18 18.37 4.21 10.55 31.78

Appendix I Integration with a Label Noise Correction method

I.1 What is label noise in FL?

Label noise in FL refers to inaccuracies or errors in the ground truth labels associated with the data used for training. It occurs when the labels assigned to data points are incorrect or noisy due to various reasons. Label noise can be introduced at different stages of data collection, annotation, or transmission, and it can have a significant impact on the performance and reliability of FL models.

Label noise in FL is particularly challenging to address because FL relies on decentralized data sources, and participants may have limited control over label quality in remote environments. Dealing with label noise often involves develo** robust models and FL algorithms that can adapt to the presence of inaccuracies in the labels.

I.2 Are the fair FL algorithms robust against label noise?

The primary intention of the fair FL algorithms including AdaFed is to ensure fairness among the clients while maintaining the average accuracy across them. Yet, these algorithms are not robust against label noise (mislabeled instances).

Nonetheless, AdaFed could be integrated with label-noise resistant methods in the literature yielding an FL method which (i) satisfies fairness among the clients, and (ii) is robust against the label noise. In particular, among the label-noise resistant FL algorithms in the literature, we select FedCorr (Xu et al., 2022) to be integrated with AdaFed.

FedCorr introduces a dimensionality-based filter to identify noisy clients, which is accomplished by measuring the local intrinsic dimensionality (LID) of local model prediction subspaces. They demonstrate that it is possible to distinguish clean datasets from noisy ones by observing the behavior of LID scores during the training process (we omit further discussions about FedCorr, and refer interested readers to their paper for more details).

Similarly to FedCorr, we use a real-world noisy dataset, namely Clothing1M999Clothing1M contains 1M clothing images in 14 classes. It is a dataset with noisy labels, since the data is collected from several online shop** websites and include many mislabelled samples. (Xiao et al., 2015), and we use exactly the same setting as they used for this dataset101010https://github.com/Xu-**gyi/FedCorr. In particular, we use local SGD with a momentum of 0.5, with a batch size of 16, and five local epochs, and set the hyper-parameter T1=2subscript𝑇12T_{1}=2italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 in their algorithm. In addition, when integrated with AdaFed, we set γ=5𝛾5\gamma=5italic_γ = 5 for AdaFed.

The results are summarized in Table 19. As observed, the average accuracy obtained by AdaFed is around 2.2% lower than that obtained from FedCorr which shows that AdaFed is not robust against label-noise. Moreover, as expected AdaFed results in a more fair client accuracy. On the other hand, when AdaFed is combined with FedCorr, the average accuracy improves while maintaining satisfactory fairness among the clients.

Table 19: Test accuracy on Clothing1M dataset. The reported results are averaged over 5 different random seeds.

Algorithm a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT Worst 10% Best 10%
FedAvg 70.49 13.25 43.09 91.05
FedCorr 72.55 13.27 43.12 91.15
AdaFed 70.35 5.17 49.91 90.77
FedCorr + AdaFed 72.29 8.12 46.52 91.02