DP-BREM: Differentially-Private and Byzantine-Robust Federated Learning
with Client Momentum

Xiaolan Gu
University of Arizona
[email protected]
   Ming Li
University of Arizona
[email protected]
   Li Xiong
Emory University
[email protected]
Abstract

Federated Learning (FL) allows multiple participating clients to train machine learning models collaboratively while kee** their datasets local and only exchanging the gradient or model updates with a coordinating server. Existing FL protocols are vulnerable to attacks that aim to compromise data privacy and/or model robustness. Recently proposed defenses focused on ensuring either privacy or robustness, but not both. In this paper, we focus on simultaneously achieving differential privacy (DP) and Byzantine robustness for cross-silo FL, based on the idea of learning from history. The robustness is achieved via client momentum, which averages the updates of each client over time, thus reducing the variance of the honest clients and exposing the small malicious perturbations of Byzantine clients that are undetectable in a single round but accumulate over time. In our initial solution DP-BREM, DP is achieved by adding noise to the aggregated momentum, and we account for the privacy cost from the momentum, which is different from the conventional DP-SGD that accounts for the privacy cost from the gradient. Since DP-BREM assumes a trusted server (who can obtain clients’ local models or updates), we further develop the final solution called DP-BREM+, which achieves the same DP and robustness properties as DP-BREM without a trusted server by utilizing secure aggregation techniques, where DP noise is securely and jointly generated by the clients. Both theoretical analysis and experimental results demonstrate that our proposed protocols achieve better privacy-utility tradeoff and stronger Byzantine robustness than several baseline methods, under different DP budgets and attack settings.

1 Introduction

Federated learning (FL) [29] is an emerging paradigm that enables multiple clients to collaboratively learn models without explicitly sharing their data. The clients upload their local model updates to a coordinating server, which then shares the global average with the clients in an iterative process. This offers a promising solution to mitigate the potential privacy leakage of sensitive information about individuals (since the data stays local with each client), such as ty** history, shop** transactions, geographical locations, and medical records. However, recent works have demonstrated that FL may not always provide sufficient privacy and robustness guarantees. In terms of privacy leakage, exchanging the model updates throughout the training process can still reveal sensitive information [4, 31] and cause deep leakage such as pixel-wise accurate image recovery [51, 48], either to a third-party (including other participating clients) or the central server. In terms of robustness, the decentralization design of FL systems opens up the training process to be manipulated by malicious clients, aiming to either prevent the convergence of the global model (a.k.a. Byzantine attacks) [15, 3, 45], or implant a backdoor trigger into the global model to cause targeted misclassification (a.k.a. backdoor attacks) [2, 44].

To mitigate the privacy leakage in FL, Differential Privacy (DP) [12, 13] has been adopted as a rigorous privacy notion. Existing frameworks [30, 18, 26] applied DP in FL to provide client-level privacy under the assumption of a trusted server: whether a client has participated in the training process cannot be inferred by a third party from the released global model. Other works in FL [49, 26, 46, 41] focused on record-level privacy: whether a data record at a client has participated during training cannot be inferred by the server or other adversaries that have access to the model updates or the global model. Record-level privacy is more relevant in cross-silo (as versus cross-device) FL scenarios, such as multiple hospitals collaboratively learn a prediction model for COVID-19, in which case what needs to be protected is the privacy of each patient (corresponding to each record in a hospital’s dataset). In this paper, we focus on cross-silo FL with record-level DP, where each client possesses a set of raw records, and each record corresponds to an individual’s private data.

To defend against Byzantine attacks, robust FL protocols are proposed to ensure that the training procedure is robust to a fraction of potentially malicious clients. This problem has received significant attention from the community. Most existing approaches replace the averaging step at the server with a robust aggregation rule, such as the median [8, 47, 5, 32]. However, recent state-of-the-art attacks [3, 45, 38] have demonstrated the failure of the above robust aggregators. Furthermore, a recent work [22] shows that there exist realistic scenarios where these robust aggregators fail to converge, even if there are no Byzantine attackers and the data distribution is identical (i.i.d.) across the clients, and proposed a new solution called Learning From History (LFH) to address this issue. LFH achieves robustness via client momentum with the motivation of averaging the updates of each client over time, thus reducing the variance of the honest clients and exposing the small malicious perturbations of Byzantine clients that are undetectable in a single round but accumulate over time.

In this paper, we focus on achieving record-level DP and Byzantine robustness simultaneously in cross-silo FL. Existing FL protocols with DP-SGD [1] do not achieve the robustness property intrinsically. Directly implementing an existing robust aggregator over the privatized client gradients will lead to poor utility because these aggregators (such as median [47, 5, 32]) usually have large sensitivity and require large DP noise, leading to poor utility. It is desirable to achieve DP guarantees based on average-based aggregators. Although LFH [22] is an average-based Byzantine-robust FL protocol, it aggregates client momentum instead of gradient, thus it is non-trivial to achieve DP on top of LFH. We show that a direct combination of LFH with DP-SGD momentum has several limitations, leading to both poor utility and robustness. Therefore, we aim to address these limitations in our solution.

To achieve an enhanced privacy-utility tradeoff, we start our problem from an assumption that the server is trusted and developed a Differentially-Private and Byzantine-Robust fEderated learning algorithm with client Momentum (DP-BREM), which essentially is a DP version of the Byzantine-robust method LFH [22]. Instead of adding DP noise to the gradient and then aggregating momentum as post-processing, we add DP noise to the aggregated momentum with carefully computed sensitivity to account for the privacy cost. Since the noise is added to the final aggregate (instead of intermediate local gradient), our basic solution DP-BREM maintains the non-private LFH’s robustness as much as possible, which we show both theoretically (via convergence analysis) and empirically (via experimental results). Then, we relax our trust assumption to a malicious server (for privacy only) and develop our final solution DP-BREM+. It utilizes secure multiparty computation (MPC) techniques, including secure aggregation and secure noise generation, to achieve the same DP and robustness guarantees as in DP-BREM. In Table 1, we compare DP-BREM and DP-BREM+ with existing approaches (or the variants) that achieve both DP and Byzantine robustness (DDP-RP [43] and DP-RSA [50] are described in Sec. 7). These approaches will be evaluated and compared in experiments. Our main contributions are:

Table 1: Comparison of FL approaches with DP and Byzantine-robustness
  Approaches Differential Privacy (DP) §§\mathsection§ Byzantine Robustness
Trust Assumption of Server Noise Generator Perturbation Mechanism Standard Deviation of Noise in Aggregate Mechanism
  DP-FedSGD [30] with both record and client norm clip**s trusted server igi+𝒩(0,σ2)subscript𝑖subscript𝑔𝑖𝒩0superscript𝜎2\sum_{i}g_{i}+\mathcal{N}(0,\sigma^{2})∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) σ𝜎\sigmaitalic_σ client norm clip**
CM [47] with DP noise trusted server median({gi}i=1n)+𝒩(0,σ2)mediansuperscriptsubscriptsubscript𝑔𝑖𝑖1𝑛𝒩0superscript𝜎2\text{median}(\{g_{i}\}_{i=1}^{n})+\mathcal{N}(0,\sigma^{2})median ( { italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) + caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) σ𝜎\sigmaitalic_σ coordinate-wise median (CM)
DDP-RP [43] \diamond honest-but-curious clients (distributively) i[gi+𝒩(0,σ2τ)]subscript𝑖delimited-[]subscript𝑔𝑖𝒩0superscript𝜎2𝜏\sum_{i}[g_{i}+\mathcal{N}(0,\frac{\sigma^{2}}{\tau})]∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + caligraphic_N ( 0 , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_τ end_ARG ) ] nτσ𝑛𝜏𝜎\sqrt{\frac{n}{\tau}}\cdot\sigmasquare-root start_ARG divide start_ARG italic_n end_ARG start_ARG italic_τ end_ARG end_ARG ⋅ italic_σ element-wise range proof
DP-RSA [50] untrusted client i[sign(gi)+𝒩(0,σ2)]subscript𝑖delimited-[]signsubscript𝑔𝑖𝒩0superscript𝜎2\sum_{i}[\text{sign}(g_{i})+\mathcal{N}(0,\sigma^{2})]∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ sign ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] nσ𝑛𝜎\sqrt{n}\cdot\sigmasquare-root start_ARG italic_n end_ARG ⋅ italic_σ aggregation of sign-SGD
DP-LFH (baseline in Sec. 3.2) untrusted client i[mi+𝒩(0,σ2)]subscript𝑖delimited-[]subscript𝑚𝑖𝒩0superscript𝜎2\sum_{i}[m_{i}+\mathcal{N}(0,\sigma^{2})]∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] nσ𝑛𝜎\sqrt{n}\cdot\sigmasquare-root start_ARG italic_n end_ARG ⋅ italic_σ LFH [22]: client momentum and centered clip**
DP-BREM (our initial solution) trusted server imi+𝒩(0,σ2)subscript𝑖subscript𝑚𝑖𝒩0superscript𝜎2\sum_{i}m_{i}+\mathcal{N}(0,\sigma^{2})∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) σ𝜎\sigmaitalic_σ
DP-BREM+ (our final solution) \dagger untrusted clients (jointly)
 
  • §§\mathsection§

    We show the privacy-utility tradeoff by fixing the same privacy cost (in terms of DP) and then comparing the standard deviation of the noise on the aggregation, where a smaller standard deviation means the DP noise has less negative impact on the utility. Note that different approaches have different aggregation strategies, where gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes local gradient, and misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes local momentum.

  • \diamond

    DDP-RP assumes an honest-but-curious server, i.e., following protocol instructions honestly, but may try to learn additional information. It guarantees distributed DP (DDP) with secure aggregation techniques, where clients add partial noise with a smaller standard deviation, depending on the number of honest clients or its lower bound, denoted by τ𝜏\tauitalic_τ. Thus, it provides better privacy-utility tradeoff than local DP (LDP).

  • \dagger

    DP-BREM+ achieves the same DP and robustness guarantees as DP-BREM, but has a different trust assumption on the server. It achieves the same noise as central DP, i.e., the DP noise is added to the aggregation, but does not require a trusted server because the noise is securely generated by clients (via the proposed noise generation protocol) and securely added to the aggregation (via a secure aggregation protocol).

1) We propose a novel differentially private and Byzantine-robust FL protocol called DP-BREM, which adds DP noise to the aggregated client momentum with carefully computed sensitivity. Our privacy analysis (shown in Theorem 1) accounts for the privacy cost from momentum, which is different from the conventional DP-SGD that accounts for the privacy cost from the gradient. We also provide the convergence analysis of DP-BREM (shown in Theorem 3), which indicates that there is only a small sacrifice in the convergence rate to satisfy DP (compared to the large sacrifice in convergence of the baseline solution shown in Section 3.2).

2) Considering that DP-BREM is developed under the assumption of a trusted server, we propose the final solution called DP-BREM+ (in Section 5), which achieves the same privacy and robustness properties as DP-BREM, even under a malicious server (for privacy only), using secure multiparty computation techniques. DP-BREM+ is built based on the framework of secure aggregation with verifiable inputs (SAVI) [35], but extends it to guarantee the integrity of DP noise via a novel secure distributed noise generation protocol. Our extended SAVI protocol is general enough to be applied to other DP and robust FL protocols that are average-based.

3) We conduct extensive experiments using MNIST and CIFAR-10 datasets (in Section 6) to demonstrate the effectiveness of our protocols. The results show that it can achieve better utility under the same record-level DP guarantees, as well as strong robustness against Byzantine clients under state-of-the-art attacks, compared to the baseline methods.

2 Preliminaries

2.1 Differential Privacy (DP)

Differential Privacy (DP) is a rigorous mathematical framework for the release of information derived from private data. Applied to machine learning, a differentially private training mechanism allows the public release of model parameters with a strong privacy guarantee: adversaries are limited in what they can learn about the original training data based on analyzing the parameters, even when they have access to arbitrary side information. The formal definition is as follows:

Definition 1 ((ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP [13, 12]).

For ϵ[0,)italic-ϵ0\epsilon\in[0,\infty)italic_ϵ ∈ [ 0 , ∞ ) and δ[0,1)𝛿01\delta\in[0,1)italic_δ ∈ [ 0 , 1 ), a randomized mechanism :𝒟:𝒟\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R}caligraphic_M : caligraphic_D → caligraphic_R with a domain 𝒟𝒟\mathcal{D}caligraphic_D (e.g., possible training datasets) and range \mathcal{R}caligraphic_R (e.g., all possible trained models) satisfies (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-Differential Privacy (DP) if for any two neighboring datasets D,D𝒟𝐷superscript𝐷𝒟D,D^{\prime}\in\mathcal{D}italic_D , italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_D that differ in only one record and for any subset of outputs S𝑆S\subseteq\mathcal{R}italic_S ⊆ caligraphic_R, it holds that

[(D)S]eϵ[(D)S]+δdelimited-[]𝐷𝑆superscript𝑒italic-ϵdelimited-[]superscript𝐷𝑆𝛿\displaystyle\mathbb{P}[\mathcal{M}(D)\in S]\leqslant e^{\epsilon}\cdot\mathbb% {P}[\mathcal{M}(D^{\prime})\in S]+\deltablackboard_P [ caligraphic_M ( italic_D ) ∈ italic_S ] ⩽ italic_e start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ⋅ blackboard_P [ caligraphic_M ( italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_S ] + italic_δ

where ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ are privacy parameters (or privacy budget), and a smaller ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ indicate a more private mechanism.

Gaussian Mechanism. A common paradigm for approximating a deterministic real-valued function f:𝒟:𝑓𝒟f:\mathcal{D}\rightarrow\mathbb{R}italic_f : caligraphic_D → blackboard_R with a differentially-private mechanism is via additive noise calibrated to f𝑓fitalic_f’s sensitivity sfsubscript𝑠𝑓s_{f}italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, which is defined as the maximum of the absolute distance |f(D)f(D)|𝑓𝐷𝑓superscript𝐷|f(D)-f(D^{\prime})|| italic_f ( italic_D ) - italic_f ( italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |. The Gaussian Mechanism is defined by (D)=f(D)+𝒩(0,sf2σ2)𝐷𝑓𝐷𝒩0superscriptsubscript𝑠𝑓2superscript𝜎2\mathcal{M}(D)=f(D)+\mathcal{N}(0,s_{f}^{2}\cdot\sigma^{2})caligraphic_M ( italic_D ) = italic_f ( italic_D ) + caligraphic_N ( 0 , italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where 𝒩(0,sf2σ2)𝒩0superscriptsubscript𝑠𝑓2superscript𝜎2\mathcal{N}(0,s_{f}^{2}\cdot\sigma^{2})caligraphic_N ( 0 , italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is noise drawn from a Gaussian distribution. It was shown that \mathcal{M}caligraphic_M satisfies (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP if δ45e(σϵ)2/2𝛿45superscript𝑒superscript𝜎italic-ϵ22\delta\geqslant\frac{4}{5}e^{-(\sigma\epsilon)^{2}/2}italic_δ ⩾ divide start_ARG 4 end_ARG start_ARG 5 end_ARG italic_e start_POSTSUPERSCRIPT - ( italic_σ italic_ϵ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT and ϵ<1italic-ϵ1\epsilon<1italic_ϵ < 1 [13]. Note that we use an advanced privacy analysis tool proposed in [11], which works for all ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0.

DP-SGD Algorithm. The most well-known differentially-private algorithm in machine learning is DP-SGD [1], which introduces two modifications to the vanilla stochastic gradient descent (SGD). First, a clip** step is applied to the gradient so that the gradient is in effect bounded for a finite sensitivity. The second modification is Gaussian noise augmentation on the summation of clipped gradients, which is equivalent to applying the Gaussian mechanism to the updated iterates. The privacy accountant of DP-SGD is shown in Appendix F.

2.2 Federated Learning (FL) with DP

Federated Learning (FL) [21, 29] is a collaborative learning setting to train machine learning models. We consider the horizontal cross-silo FL setting, which involves multiple clients, each holding their own private dataset of the same set of features, and a central server that implements the aggregation. Unlike the traditional centralized approach, data is not stored at a central server; instead, clients train models locally and exchange updated parameters with the server, which aggregates the received local model parameters and sends them to the clients. Based on the participating clients and scale, federated learning can be classified into two types: cross-device FL where clients are typically mobile devices and the client number can reach up to a scale of millions; cross-silo FL (our focus) where clients are organizations or companies and the client number is usually small (e.g., within a hundred).

FL with DP. In FL, the neighboring datasets D𝐷Ditalic_D and Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in Definition 1 can be defined at two distinct levels: record-level and client-level. In cross-device FL, each device usually stores one individual’s data, then the whole devices’ data should be protected. It corresponds to client-level DP, where Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is obtained by adding or removing one client/device’s whole training dataset from D𝐷Ditalic_D. In cross-silo FL, each record corresponds to one individual’s data, then record-level DP should be provided, where Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is obtained by adding or removing a single training record/example from D𝐷Ditalic_D. Since we consider cross-silo FL, achieving record-level DP is our privacy goal.

2.3 Byzantine Attacks and Defenses

In a Byzantine attack, the adversary aims to destroy the convergence of the model. Due to the decentralization design, FL systems are vulnerable to Byzantine clients, who may not follow the protocol and can send arbitrary updates to the server. Also, they may have complete knowledge of the system and can collude with each other. Most state-of-the-art defense mechanisms [8, 47, 5, 32] play with median statistics of client gradients. However, recent attacks [3, 45] have empirically demonstrated the failure of the above robust aggregations.

LFH: Non-private Byzantine-Robust Defense. Recently, Karimireddy et al. [22] showed that most state-of-the-art robust aggregators require strong assumptions and may not converge even in the complete absence of Byzantine attackers. Then, they proposed a new Byzantine-robust scheme called "learning from history" (LFH) that essentially utilizes two simple strategies: client momentum (during local update) and centered clip** (during server aggregation). In each iteration t𝑡titalic_t, client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT receives the global model parameter 𝜽t1subscript𝜽𝑡1\bm{\theta}_{t-1}bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT from the server, and computes the local gradient of the random dataset batch 𝒟i,t𝒟isubscript𝒟𝑖𝑡subscript𝒟𝑖\mathcal{D}_{i,t}\subseteq\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ⊆ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by

𝒈t,i=1|𝒟i,t|𝒙𝒟i,t𝜽(𝒙,𝜽t1)subscript𝒈𝑡𝑖1subscript𝒟𝑖𝑡subscript𝒙subscript𝒟𝑖𝑡subscript𝜽𝒙subscript𝜽𝑡1\displaystyle\bm{g}_{t,i}=\frac{1}{|\mathcal{D}_{i,t}|}\sum\nolimits_{\bm{x}% \in\mathcal{D}_{i,t}}\nabla_{\bm{\theta}}\ell(\bm{x},\bm{\theta}_{t-1})bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( bold_italic_x , bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (1)

where 𝜽(𝒙,𝜽t1)subscript𝜽𝒙subscript𝜽𝑡1\nabla_{\bm{\theta}}\ell(\bm{x},\bm{\theta}_{t-1})∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( bold_italic_x , bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) is the per-record gradient w.r.t. the loss function ()\ell(\cdot)roman_ℓ ( ⋅ ). The client momentum can be computed via

𝒎t,i=(1β)𝒈t,i+β𝒎t1,isubscript𝒎𝑡𝑖1𝛽subscript𝒈𝑡𝑖𝛽subscript𝒎𝑡1𝑖\displaystyle\bm{m}_{t,i}=(1-\beta)\bm{g}_{t,i}+\beta\bm{m}_{t-1,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = ( 1 - italic_β ) bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT + italic_β bold_italic_m start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT (2)

where β[0,1)𝛽01\beta\in[0,1)italic_β ∈ [ 0 , 1 ). After receiving 𝒎t,isubscript𝒎𝑡𝑖\bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT from all clients, the server implements aggregation with centered clip** via

𝒎t=𝒎t1+1ni=1n𝖢𝗅𝗂𝗉C(𝒎t,i𝒎t1)subscript𝒎𝑡subscript𝒎𝑡11𝑛superscriptsubscript𝑖1𝑛subscript𝖢𝗅𝗂𝗉𝐶subscript𝒎𝑡𝑖subscript𝒎𝑡1\displaystyle\bm{m}_{t}=\bm{m}_{t-1}+\frac{1}{n}\sum\nolimits_{i=1}^{n}\mathsf% {Clip}_{C}(\bm{m}_{t,i}-\bm{m}_{t-1})bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (3)

where 𝖢𝗅𝗂𝗉C()subscript𝖢𝗅𝗂𝗉𝐶\mathsf{Clip}_{C}(\cdot)sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( ⋅ ) with scalar C>0𝐶0C>0italic_C > 0 is the clip** function:

𝖢𝗅𝗂𝗉C(𝒙)𝒙min{1,C/𝒙}subscript𝖢𝗅𝗂𝗉𝐶𝒙𝒙1𝐶norm𝒙\displaystyle\mathsf{Clip}_{C}(\bm{x})\coloneqq\bm{x}\cdot\min\{1,~{}C/\|\bm{x% }\|\}sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_x ) ≔ bold_italic_x ⋅ roman_min { 1 , italic_C / ∥ bold_italic_x ∥ } (4)

and 𝒙norm𝒙\|\bm{x}\|∥ bold_italic_x ∥ is the L2-norm of any vector 𝒙𝒙\bm{x}bold_italic_x. The clip** operation 𝖢𝗅𝗂𝗉C(𝒎t,i𝒎t1)subscript𝖢𝗅𝗂𝗉𝐶subscript𝒎𝑡𝑖subscript𝒎𝑡1\mathsf{Clip}_{C}(\bm{m}_{t,i}-\bm{m}_{t-1})sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) essentially bounds the distance between client’s local momentum 𝒎t,isubscript𝒎𝑡𝑖\bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and the previous aggregated momentum 𝒎t1subscript𝒎𝑡1\bm{m}_{t-1}bold_italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, thus restricts the impact from Byzantine clients. Then, the global model 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be updated by 𝜽t=𝜽t1ηt𝒎tsubscript𝜽𝑡subscript𝜽𝑡1subscript𝜂𝑡subscript𝒎𝑡\bm{\theta}_{t}=\bm{\theta}_{t-1}-\eta_{t}\bm{m}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with learning rate ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The convergence rate under Byzantine attacks is shown by the following lemma.

Lemma 1 (Convergence Rate of LFH [22]).

With some parameter tuning, the convergence rate of the Byzantine-robust algorithm LFH is asymptotically (ignoring constants and higher order terms) of the order

1Tt=1T𝔼(𝜽t1)2ρ2T1+||nless-than-or-similar-to1𝑇superscriptsubscript𝑡1𝑇𝔼superscriptnormsubscript𝜽𝑡12superscript𝜌2𝑇1𝑛\displaystyle\frac{1}{T}\sum\nolimits_{t=1}^{T}\mathbb{E}\|\nabla\ell(\bm{% \theta}_{t-1})\|^{2}\lesssim\sqrt{\frac{\rho^{2}}{T}\frac{1+|\mathcal{B}|}{n}}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E ∥ ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≲ square-root start_ARG divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG divide start_ARG 1 + | caligraphic_B | end_ARG start_ARG italic_n end_ARG end_ARG (5)

where ()\ell(\cdot)roman_ℓ ( ⋅ ) is the loss function, T𝑇Titalic_T is the total number of training iterations, |||\mathcal{B}|| caligraphic_B | is the number of Byzantine clients, n𝑛nitalic_n is the number of all clients, and ρ𝜌\rhoitalic_ρ is a parameter that quantifies the variance of honest clients’ stochastic gradients:

𝔼𝒈t,i𝔼[𝒈t,i]2ρ2𝔼superscriptnormsubscript𝒈𝑡𝑖𝔼delimited-[]subscript𝒈𝑡𝑖2superscript𝜌2\displaystyle\mathbb{E}\|\bm{g}_{t,i}-\mathbb{E}[\bm{g}_{t,i}]\|^{2}\leqslant% \rho^{2}blackboard_E ∥ bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - blackboard_E [ bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (6)

Interpretation of Lemma 1. When there are no Byzantine clients, LFH recovers the optimal rate of ρnT𝜌𝑛𝑇\frac{\rho}{\sqrt{nT}}divide start_ARG italic_ρ end_ARG start_ARG square-root start_ARG italic_n italic_T end_ARG end_ARG. In the presence of a ||/n𝑛|\mathcal{B}|/n| caligraphic_B | / italic_n fraction of Byzantine clients, the rate has an additional term ρ||/nT𝜌𝑛𝑇\rho\sqrt{\frac{|\mathcal{B}|/n}{T}}italic_ρ square-root start_ARG divide start_ARG | caligraphic_B | / italic_n end_ARG start_ARG italic_T end_ARG end_ARG, which depends on the fraction ||/n𝑛|\mathcal{B}|/n| caligraphic_B | / italic_n but does not improve with increasing clients.

3 Problem Statement and Motivation

3.1 Problem Statement

System Model. Our system model follows the general setting of Fed-SGD [29]. There are multiple parties in the FL system: one aggregation server and n𝑛nitalic_n participating clients {𝖢1,,𝖢n}subscript𝖢1subscript𝖢𝑛\{\mathsf{C}_{1},\cdots,\mathsf{C}_{n}\}{ sansserif_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , sansserif_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. The server holds a global model 𝜽tdsubscript𝜽𝑡superscript𝑑\bm{\theta}_{t}\in\mathbb{R}^{d}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and each client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i{1,,n}𝑖1𝑛i\in\{1,\cdots,n\}italic_i ∈ { 1 , ⋯ , italic_n } possesses a private training dataset 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The server communicates with each client through a secure (private and authenticated) channel. During the iterative training process, the server broadcasts the global model in the current iteration to all clients and aggregates the received gradient/momentum from all clients (or a subset of clients) to update the global model until convergence. The final global model is returned after the training process as the output.

Threat Model. The considered adversary aims to perform a 1) privacy attack and/or 2) Byzantine attack with the following threat model, respectively.

1) Privacy Attack. Following the conventional FL setting, we assume the server has no access to the client’s local training data, but may have an incentive to infer clients’ private information. In our initial solution called DP-BREM, we assume a trusted server that can obtain clients’ local models/updates. The adversary is a third party or the participating clients (can be any set of clients) who have access to the intermediate and final global models and may use them to infer the private data of an honest client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Hence, the privacy goal is to ensure the global model (and its update) satisfies DP. In our final solution DP-BREM+, in addition to third parties and clients, the adversary also includes the server that tries to infer additional information from the local updates (and may deviate from the protocol for privacy inference). Such a model is also adopted in previous work [35]. Following [35], we assume a minority of malicious clients who can deviate from the protocol arbitrarily.

2) Byzantine Attack. Recall that the goal of Byzantine attacks is to destroy the convergence of the global model (discussed in Section 2.3). We only consider malicious clients as the adversaries for Byzantine attacks because the server’s primary goal is to train a robust model, thus no incentive to implement Byzantine attacks. These malicious clients (assumed to be a minority of all participating clients) can deviate from the protocol arbitrarily and have full control of both their local training data and their submission to the servers, but do not influence other honest clients.

Objectives. The goal of this paper is to achieve both record-level DP and Byzantine robustness at the same time. We aim to provide high utility (i.e., high accuracy of the global model) with the required DP guarantee under the existence of Byzantine attacks from malicious clients. Our ultimate privacy goal is to provide DP guarantees against an untrusted server and other clients, but we start by assuming a trusted server first in our initial solution.

3.2 Challenges and Baseline

Challenges: replacing average-based aggregator leads to large sensitivity of DP. Though there are many works on achieving either DP or Byzantine robustness, it’s nontrivial to achieve both with high utility. The main reason is that most Byzantine-robust methods replace the averaging aggregator with median-based strategies or some complex robust aggregators, which leads to a large sensitivity of DP compared to the averaging operation, as illustrated in Example 1.

Example 1 (Sensitivity Computation: Average vs. Median).

Consider a dataset with 5 samples: 𝒟={1,3,5,7,9}𝒟13579\mathcal{D}=\{1,3,5,7,9\}caligraphic_D = { 1 , 3 , 5 , 7 , 9 }, and its neighboring dataset 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is obtained by changing one value in 𝒟𝒟\mathcal{D}caligraphic_D with at most 1, such as 𝒟={1,3,𝟒,7,9}superscript𝒟13479\mathcal{D}^{\prime}=\{1,3,\bm{4},7,9\}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { 1 , 3 , bold_4 , 7 , 9 }. Then, the sensitivity of average-query is max𝒟,𝒟|𝖺𝗏𝗀(𝒟)𝖺𝗏𝗀(𝒟)|=1/5=0.2subscript𝒟superscript𝒟𝖺𝗏𝗀𝒟𝖺𝗏𝗀superscript𝒟150.2\max\limits_{\mathcal{D},\mathcal{D}^{\prime}}|\mathsf{avg}(\mathcal{D})-% \mathsf{avg}(\mathcal{D}^{\prime})|=1/5=0.2roman_max start_POSTSUBSCRIPT caligraphic_D , caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | sansserif_avg ( caligraphic_D ) - sansserif_avg ( caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | = 1 / 5 = 0.2. However, the sensitivity of median-query is max𝒟,𝒟|𝗆𝖾𝖽𝗂𝖺𝗇(𝒟)𝗆𝖾𝖽𝗂𝖺𝗇(𝒟)|=1subscript𝒟superscript𝒟𝗆𝖾𝖽𝗂𝖺𝗇𝒟𝗆𝖾𝖽𝗂𝖺𝗇superscript𝒟1\max\limits_{\mathcal{D},\mathcal{D}^{\prime}}|\mathsf{median}(\mathcal{D})-% \mathsf{median}(\mathcal{D}^{\prime})|=1roman_max start_POSTSUBSCRIPT caligraphic_D , caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | sansserif_median ( caligraphic_D ) - sansserif_median ( caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | = 1. Moreover, when increasing the size of the dataset, the sensitivity of the average query will be reduced (and then less noise to be added), while the sensitivity of the median query is the same.

DP-LFH: baseline via direct combination of LFH and DP-SGD. As shown in Section 2.3, the Byzantine-robust scheme LFH [22] utilizes an average-based aggregator, which can be regarded as a non-private robust solution to address the disadvantage of median-based aggregator. A straightforward method to add DP protection on top of LFH is to combine it with the DP-SGD algorithm. However, LFH requires each client to compute the local momentum 𝒎t,isubscript𝒎𝑡𝑖\bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT for server aggregation, while DP-SGD aggregates gradients and accounts for the privacy cost via the composition of iterative gradient update. Without a trusted server, a straightforward solution to combine DP with LFH is to use DP-SGD at each client to privatize the local gradient, and then compute the momentum from the privatized gradient (thus there is no additional privacy cost due to post-processing). Formally, client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT computes

𝒈t,i=1|𝒟i,t|[𝒙𝒟i,t𝖢𝗅𝗂𝗉R(𝜽(𝒙,𝜽t1))+𝒩(0,R2σ2𝐈d)],subscript𝒈𝑡𝑖1subscript𝒟𝑖𝑡delimited-[]subscript𝒙subscript𝒟𝑖𝑡subscript𝖢𝗅𝗂𝗉𝑅subscript𝜽𝒙subscript𝜽𝑡1𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\displaystyle\bm{g}_{t,i}=\frac{1}{|\mathcal{D}_{i,t}|}\left[\sum\nolimits_{% \bm{x}\in\mathcal{D}_{i,t}}\mathsf{Clip}_{R}(\nabla_{\bm{\theta}}\ell(\bm{x},% \bm{\theta}_{t-1}))+\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})\right],bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT | end_ARG [ ∑ start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( bold_italic_x , bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) + caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] , (7)

where 𝐈dsubscript𝐈𝑑\mathbf{I}_{d}bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is an identity matrix with size d×d𝑑𝑑d\times ditalic_d × italic_d (d𝑑ditalic_d is the model size, i.e., 𝜽tdsubscript𝜽𝑡superscript𝑑\bm{\theta}_{t}\in\mathbb{R}^{d}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT), the record-level clip** parameter 𝖢𝗅𝗂𝗉R()subscript𝖢𝗅𝗂𝗉𝑅\mathsf{Clip}_{R}(\cdot)sansserif_Clip start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( ⋅ ) restricts the sensitivity when adding/removing one record from the local dataset, and Gaussian noise 𝒩(0,R2σ2𝐈d)𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) introduces DP property on 𝒈t,isubscript𝒈𝑡𝑖\bm{g}_{t,i}bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT. Since DP is immune to post-processing, the remaining steps can be implemented in the same way as the original LFH, without incurring additional privacy costs. This baseline solution DP-LFH achieves record-level DP against an untrusted server. However, it has several limitations, which lead to both poor privacy-utility tradeoff and robustness.

Limitation 1: large aggregated noise. Since each client locally adds DP noise, the overall noise after aggregation will be larger than the case of the central setting under the same privacy budget (i.e., the value of ϵitalic-ϵ\epsilonitalic_ϵ), because only the server adds noise in the central setting. Therefore, DP-LFH has a poor privacy-utility tradeoff.

Limitation 2: large impact on Byzantine robustness. Since the DP noise is added locally to each client’s gradient before momentum-based clip**, it leads to a negative impact on Byzantine robustness: the noisy client momentum 𝒎t,isubscript𝒎𝑡𝑖\bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT has larger variance than the noise-free one, which leads to larger bias and variance on the clip** step 𝖢𝗅𝗂𝗉C(𝒎t,i𝒎t1)subscript𝖢𝗅𝗂𝗉𝐶subscript𝒎𝑡𝑖subscript𝒎𝑡1\mathsf{Clip}_{C}(\bm{m}_{t,i}-\bm{m}_{t-1})sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). Furthermore, this impact will be enlarged when there are more Byzantine clients, which is explained as follows. Since the parameter ρ2superscript𝜌2\rho^{2}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT defined in (6) quantifies the variance of client’s gradient, and the DP noise is added to the local gradient in (7), the parameter ρ𝜌\rhoitalic_ρ of the convergence rate shown in (5) is replaced by ρ+dσ𝜌𝑑𝜎\rho+\sqrt{d}\sigmaitalic_ρ + square-root start_ARG italic_d end_ARG italic_σ (ignoring constants) for DP-LFH, i.e., the convergence rate of DP-LFH is asymptotic of the order

1Tt=1T𝔼(𝜽t1)2(ρ+dσ)2T1+||nless-than-or-similar-to1𝑇superscriptsubscript𝑡1𝑇𝔼superscriptnormsubscript𝜽𝑡12superscript𝜌𝑑𝜎2𝑇1𝑛\displaystyle\frac{1}{T}\sum\nolimits_{t=1}^{T}\mathbb{E}\|\nabla\ell(\bm{% \theta}_{t-1})\|^{2}\lesssim\sqrt{\frac{(\rho+\sqrt{d}\sigma)^{2}}{T}\frac{1+|% \mathcal{B}|}{n}}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E ∥ ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≲ square-root start_ARG divide start_ARG ( italic_ρ + square-root start_ARG italic_d end_ARG italic_σ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG divide start_ARG 1 + | caligraphic_B | end_ARG start_ARG italic_n end_ARG end_ARG (8)

Therefore, either a large d𝑑ditalic_d (i.e., large model) or a large σ𝜎\sigmaitalic_σ (i.e., small privacy budget ϵitalic-ϵ\epsilonitalic_ϵ) will enlarge the impact from Byzantine clients due to the order O(dσ2||)𝑂𝑑superscript𝜎2O(\sqrt{d\sigma^{2}|\mathcal{B}|})italic_O ( square-root start_ARG italic_d italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_B | end_ARG ) of convergence rate. We note that Guerraoui et al.’s work [19] also shares a similar insight: they show that DP with local noise and Byzantine robustness are incompatible, especially when the dimension of model parameters d𝑑ditalic_d is large.

Limitation 3: no privacy amplification from client-level sampling due to momentum. According to the recursive representation 𝒎t,i=(1β)𝒈t,i+β𝒎t1,isubscript𝒎𝑡𝑖1𝛽subscript𝒈𝑡𝑖𝛽subscript𝒎𝑡1𝑖\bm{m}_{t,i}=(1-\beta)\bm{g}_{t,i}+\beta\bm{m}_{t-1,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = ( 1 - italic_β ) bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT + italic_β bold_italic_m start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT, client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s momentum in t𝑡titalic_t-th iteration 𝒎t,isubscript𝒎𝑡𝑖\bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is essentially a weighted summation of all previous privatized client gradients:

𝒎t,i=(1β)(𝒈t,i+β𝒈t1,i++βt2𝒈2,i)+βt1𝒈1,isubscript𝒎𝑡𝑖1𝛽subscript𝒈𝑡𝑖𝛽subscript𝒈𝑡1𝑖superscript𝛽𝑡2subscript𝒈2𝑖superscript𝛽𝑡1subscript𝒈1𝑖\displaystyle\bm{m}_{t,i}=(1-\beta)(\bm{g}_{t,i}+\beta\bm{g}_{t-1,i}+\cdots+% \beta^{t-2}\bm{g}_{2,i})+\beta^{t-1}\bm{g}_{1,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = ( 1 - italic_β ) ( bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT + italic_β bold_italic_g start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT + ⋯ + italic_β start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT bold_italic_g start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT ) + italic_β start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_italic_g start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT (9)

where 𝒈1,i,𝒈2,i,,𝒈t,isubscript𝒈1𝑖subscript𝒈2𝑖subscript𝒈𝑡𝑖\bm{g}_{1,i},\bm{g}_{2,i},\cdots,\bm{g}_{t,i}bold_italic_g start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , bold_italic_g start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT , ⋯ , bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT are already privatized via local noise. Assume the server samples a subset of clients for aggregation in each iteration. Assume that client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s momentum 𝒎t,isubscript𝒎𝑡𝑖\bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is not selected in the t𝑡titalic_t-th iteration, thus the aggregate is independent of 𝒈t,isubscript𝒈𝑡𝑖\bm{g}_{t,i}bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT. However, in a later iteration (i.e., τ>t𝜏𝑡\tau>titalic_τ > italic_t), if client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s momentum 𝒎τ,isubscript𝒎𝜏𝑖\bm{m}_{\tau,i}bold_italic_m start_POSTSUBSCRIPT italic_τ , italic_i end_POSTSUBSCRIPT is involved in the aggregation, it will depend on 𝒈t,isubscript𝒈𝑡𝑖\bm{g}_{t,i}bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT according to (9). Therefore, we need to account for the privacy cost of 𝒈t,isubscript𝒈𝑡𝑖\bm{g}_{t,i}bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT in all iterations. There is no privacy amplification benefit from sampling clients, leading to high privacy costs or low utility.

4 DP-BREM

To address the limitations of DP-LFH, we start from the assumption of a trusted server that can obtain clients’ local models/updates and generate DP noise, and propose an initial solution called DP-BREM (in Section 4.1). It is a differentially-private version of LFH with carefully designed enhancements, achieving a similar level of robustness as the non-private LFH. Since DP-BREM adds DP noise to the momentum (as versus adding noise to the gradient in DP-SGD), our privacy accountant shown in Section 4.2 is different from the traditional privacy accountant of DP-SGD. We also provide the convergence analysis in Section 4.3, where the provable convergence of LFH is maintained with only a small difference. Based on DP-BREM, we then relax the server’s trust assumption in our final solution DP-BREM+ (in Section 5), by adopting secure multiparty computation techniques including secure aggregation with input validation and joint noise generation, which achieves the same DP guarantee with the same amount of noise as in DP-BREM, without trusting the server.

4.1 Algorithm Design

The mathematical notations involved in our algorithm design and theoretical analysis are summarized in Table 5 (see Appendix A). The illustration of our design is shown in Figure 1, and the algorithm is shown in Algorithm 1, where all clients need to implement local updates (in Line-3), but only a subset of their momentum vectors are aggregated by the server (in Line-4). The details of client updates and server aggregation are described below.

Client Updates. The client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT first samples a random batch 𝒟i,tsubscript𝒟𝑖𝑡\mathcal{D}_{i,t}caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT from the local dataset 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with sampling rate pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, clips the per-record gradient 𝜽(𝒙,𝜽t1)subscript𝜽𝒙subscript𝜽𝑡1\nabla_{\bm{\theta}}\ell(\bm{x},\bm{\theta}_{t-1})∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( bold_italic_x , bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) by R𝑅Ritalic_R and multiplies the sum by a constant factor 1pi|𝒟i|1subscript𝑝𝑖subscript𝒟𝑖\frac{1}{p_{i}|\mathcal{D}_{i}|}divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG to get the averaged gradient

𝒈¯t,i=1pi|𝒟i|𝒙𝒟i,t𝖢𝗅𝗂𝗉R(𝜽(𝒙,𝜽t1))subscript¯𝒈𝑡𝑖1subscript𝑝𝑖subscript𝒟𝑖subscript𝒙subscript𝒟𝑖𝑡subscript𝖢𝗅𝗂𝗉𝑅subscript𝜽𝒙subscript𝜽𝑡1\bar{\bm{g}}_{t,i}=\frac{1}{p_{i}|\mathcal{D}_{i}|}\sum\nolimits_{\bm{x}\in% \mathcal{D}_{i,t}}\mathsf{Clip}_{R}(\nabla_{\bm{\theta}}\ell(\bm{x},\bm{\theta% }_{t-1}))over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( bold_italic_x , bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) (10)

where 𝖢𝗅𝗂𝗉R()subscript𝖢𝗅𝗂𝗉𝑅\mathsf{Clip}_{R}(\cdot)sansserif_Clip start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( ⋅ ) is the clip** function defined in (4), but is used here to bound the sensitivity for DP (refer to DP-SGD discussed in Section 2.1). Note that the batch size |𝒟i,t|subscript𝒟𝑖𝑡|\mathcal{D}_{i,t}|| caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT | is random and 𝔼[|𝒟i,t|]=pi|𝒟i|𝔼delimited-[]subscript𝒟𝑖𝑡subscript𝑝𝑖subscript𝒟𝑖\mathbb{E}[|\mathcal{D}_{i,t}|]=p_{i}|\mathcal{D}_{i}|blackboard_E [ | caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT | ] = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |. Then, the local momentum can be computed by

𝒎¯t,i={𝒈¯t,i,if t=1(1β)𝒈¯t,i+β𝒎¯t1,i,if t>1subscript¯𝒎𝑡𝑖casessubscript¯𝒈𝑡𝑖if 𝑡11𝛽subscript¯𝒈𝑡𝑖𝛽subscript¯𝒎𝑡1𝑖if 𝑡1\displaystyle\bar{\bm{m}}_{t,i}=\begin{cases}\bar{\bm{g}}_{t,i},&\text{if }t=1% \\ (1-\beta)\bar{\bm{g}}_{t,i}+\beta\bar{\bm{m}}_{t-1,i},&\text{if }t>1\\ \end{cases}over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , end_CELL start_CELL if italic_t = 1 end_CELL end_ROW start_ROW start_CELL ( 1 - italic_β ) over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT + italic_β over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT , end_CELL start_CELL if italic_t > 1 end_CELL end_ROW (11)

where β[0,1)𝛽01\beta\in[0,1)italic_β ∈ [ 0 , 1 ) is the momentum parameter.

Refer to caption
Figure 1: Illustration of our DP-BREM algorithm.
Algorithm 1 DP-BREM
0:  Initialization 𝜽0dsubscript𝜽0superscript𝑑\bm{\theta}_{0}\in\mathbb{R}^{d}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, clip** bounds R𝑅Ritalic_R and C𝐶Citalic_C , learning rate ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the global model.
1:  for t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T do
2:     The server broadcasts the previous model 𝜽t1subscript𝜽𝑡1\bm{\theta}_{t-1}bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to all clients {𝖢i}i=1nsuperscriptsubscriptsubscript𝖢𝑖𝑖1𝑛\{\mathsf{C}_{i}\}_{i=1}^{n}{ sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and selects a subset of client index t{1,,n}subscript𝑡1𝑛\mathcal{I}_{t}\subseteq\{1,\cdots,n\}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ { 1 , ⋯ , italic_n }, where each client is selected with probability q𝑞qitalic_q.
3:     Each client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i{1,,n}𝑖1𝑛i\in\{1,\cdots,n\}italic_i ∈ { 1 , ⋯ , italic_n } implements the local updates via (10) and (11), while only selected clients need to send the local momentum 𝒎t,isubscript𝒎𝑡𝑖\bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT (for it𝑖subscript𝑡i\in\mathcal{I}_{t}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) to the server.
4:     The server aggregates received clients’ momentum (only for it𝑖subscript𝑡i\in\mathcal{I}_{t}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) with centered clip** and DP noise via (12), then updates the global model 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT via (13).
5:  end for
5:  The final model parameter 𝜽Tsubscript𝜽𝑇\bm{\theta}_{T}bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

Server Aggregation. The server implements centered clip** with clip** parameter C>0𝐶0C>0italic_C > 0 to bound the difference between client momentum 𝒎¯t,isubscript¯𝒎𝑡𝑖\bar{\bm{m}}_{t,i}over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and the previous noisy global momentum 𝒎~t1subscript~𝒎𝑡1\tilde{\bm{m}}_{t-1}over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT for robustness. Then, it adds Gaussian noise with standard deviation Rσ𝑅𝜎R\sigmaitalic_R italic_σ (thus the variance is R2σ2superscript𝑅2superscript𝜎2R^{2}\sigma^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) to the sum of clipped terms to get the noisy global momentum 𝒎~tsubscript~𝒎𝑡\tilde{\bm{m}}_{t}over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

𝒎~t=𝒎~t1+1|t|[it𝖢𝗅𝗂𝗉C(𝒎¯t,i𝒎~t1)+𝒩(0,R2σ2𝐈d)]subscript~𝒎𝑡subscript~𝒎𝑡11subscript𝑡delimited-[]subscript𝑖subscript𝑡subscript𝖢𝗅𝗂𝗉𝐶subscript¯𝒎𝑡𝑖subscript~𝒎𝑡1𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\displaystyle\tilde{\bm{m}}_{t}=\tilde{\bm{m}}_{t-1}+\frac{1}{|\mathcal{I}_{t}% |}\left[\sum\nolimits_{i\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i% }-\tilde{\bm{m}}_{t-1})+\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})\right]over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG [ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] (12)

where 𝐈dsubscript𝐈𝑑\mathbf{I}_{d}bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is an identity matrix with size d×d𝑑𝑑d\times ditalic_d × italic_d, and only the sampled clients in tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (which is obtained in Line-2 of Algorithm 1 with sampling rate q𝑞qitalic_q) are aggregated in t𝑡titalic_t-th iteration. Note that adding noise 𝒩(0,R2σ2𝐈d)𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) to the summation of clipped client momentum it𝖢𝗅𝗂𝗉C(𝒎¯t,i𝒎~t1)subscript𝑖subscript𝑡subscript𝖢𝗅𝗂𝗉𝐶subscript¯𝒎𝑡𝑖subscript~𝒎𝑡1\sum\nolimits_{i\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde% {\bm{m}}_{t-1})∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) is equivalent to adding noise 1|t|𝒩(0,R2σ2𝐈d)1subscript𝑡𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\frac{1}{|\mathcal{I}_{t}|}\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) to the average result 1|t|it𝖢𝗅𝗂𝗉C(𝒎¯t,i𝒎~t1)1subscript𝑡subscript𝑖subscript𝑡subscript𝖢𝗅𝗂𝗉𝐶subscript¯𝒎𝑡𝑖subscript~𝒎𝑡1\frac{1}{|\mathcal{I}_{t}|}\sum\nolimits_{i\in\mathcal{I}_{t}}\mathsf{Clip}_{C% }(\bar{\bm{m}}_{t,i}-\tilde{\bm{m}}_{t-1})divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). Then, the server updates the global model 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with learning rate ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

𝜽t=𝜽t1ηt𝒎~tsubscript𝜽𝑡subscript𝜽𝑡1subscript𝜂𝑡subscript~𝒎𝑡\displaystyle\bm{\theta}_{t}=\bm{\theta}_{t-1}-\eta_{t}\tilde{\bm{m}}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (13)

Remark: clip** bounds and sampling rates. In our algorithm, we use two clip** bounds and two sampling rates. For clip** bounds, each client uses record-level bound R𝑅Ritalic_R to bound the per-record gradient in (10) for a finite sensitivity in record-level DP; while the server uses client-level bound C𝐶Citalic_C to bound the difference between client momentum 𝒎¯t,isubscript¯𝒎𝑡𝑖\bar{\bm{m}}_{t,i}over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and the previous noisy global momentum 𝒎~t1subscript~𝒎𝑡1\tilde{\bm{m}}_{t-1}over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT in (12), which achieves Byzantine robustness as in LFH. For sampling rates, the client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT samples a batch of records 𝒟i,tsubscript𝒟𝑖𝑡\mathcal{D}_{i,t}caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT from the local dataset 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with sampling rate pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which provides privacy amplification for DP from record-level sampling; while the server samples a subset of clients with sampling rate q𝑞qitalic_q (where the sampled clients set is denoted by tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), which provides privacy amplification for DP from client-level sampling.

Remark: comparison with non-private LFH. Comparing with the original non-private Byzantine-robust method LFH [22] (see Section 2.3), our differentially-private version has three differences. First, comparing with (1), the client gradient in (10) is computed by averaging the clipped per-record gradient (with clip** bound R𝑅Ritalic_R), which bounds the sensitivity of final aggregation when adding/removing one record from the local dataset. Second, comparing with (3), the server adds Gaussian noise when computing the aggregated global momentum 𝒎~tsubscript~𝒎𝑡\tilde{\bm{m}}_{t}over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (12) to guarantee DP. Third, instead of aggregating all clients’ momentum, our method also considers aggregating a subset of them, reflected by the index set tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (12). It provides additional privacy amplification from client-level sampling with sampling rate q𝑞qitalic_q. Note that the original privacy amplification is provided by record-level sampling.

4.2 Privacy Analysis

Before presenting the final privacy analysis of DP-BREM, we first show how we compute the sensitivity for the summation of clipped client momentum in (12) for privacy analysis of one iteration, shown in Lemma 2. We note that clients may have different sizes of local datasets 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and can use different record-level sampling rates pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, thus the record-level sensitivity (denoted by Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) for different clients can be different.

Lemma 2 (DP Sensitivity).

We use \|\cdot\|∥ ⋅ ∥ to denote L2-norm 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In the t𝑡titalic_t-th round, denote the query function Qt(𝒟)jt𝖢𝗅𝗂𝗉C(𝐦t,j𝐦~t1)subscript𝑄𝑡𝒟subscript𝑗subscript𝑡subscript𝖢𝗅𝗂𝗉𝐶subscript𝐦𝑡𝑗subscript~𝐦𝑡1Q_{t}(\mathcal{D})\coloneqq\sum\nolimits_{j\in\mathcal{I}_{t}}\mathsf{Clip}_{C% }(\bm{m}_{t,j}-\tilde{\bm{m}}_{t-1})italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_D ) ≔ ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), where 𝐦~t1subscript~𝐦𝑡1\tilde{\bm{m}}_{t-1}over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is public and 𝒟={𝒟j}jt𝒟subscriptsubscript𝒟𝑗𝑗subscript𝑡\mathcal{D}=\{\mathcal{D}_{j}\}_{j\in\mathcal{I}_{t}}caligraphic_D = { caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Consider the neighboring dataset 𝒟={𝒟j}ji,jt𝒟isuperscript𝒟subscriptsubscript𝒟𝑗formulae-sequence𝑗𝑖𝑗subscript𝑡superscriptsubscript𝒟𝑖\mathcal{D}^{\prime}=\{\mathcal{D}_{j}\}_{j\neq i,j\in\mathcal{I}_{t}}\cup% \mathcal{D}_{i}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j ≠ italic_i , italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that differs in one record from client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s local data (it𝑖subscript𝑡i\in\mathcal{I}_{t}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), i.e., |𝒟i𝒟i|=1subscript𝒟𝑖superscriptsubscript𝒟𝑖1|\mathcal{D}_{i}-\mathcal{D}_{i}^{\prime}|=1| caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | = 1, then the sensitivity with respect to client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is

Simax𝒟,DQt(𝒟)Qt(𝒟)=min{2C,Rpi|𝒟i|}subscript𝑆𝑖subscript𝒟superscript𝐷normsubscript𝑄𝑡𝒟subscript𝑄𝑡superscript𝒟2𝐶𝑅subscript𝑝𝑖subscript𝒟𝑖\displaystyle S_{i}\coloneqq\max_{\mathcal{D},D^{\prime}}\|Q_{t}(\mathcal{D})-% Q_{t}(\mathcal{D^{\prime}})\|=\min\left\{2C,\frac{R}{p_{i}|\mathcal{D}_{i}|}\right\}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ roman_max start_POSTSUBSCRIPT caligraphic_D , italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_D ) - italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ = roman_min { 2 italic_C , divide start_ARG italic_R end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG } (14)
Proof.

(Sketch) According to (10), the sensitivity of 𝒈¯t,isubscript¯𝒈𝑡𝑖\bar{\bm{g}}_{t,i}over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is Rpi|𝒟i|𝑅subscript𝑝𝑖subscript𝒟𝑖\frac{R}{p_{i}|\mathcal{D}_{i}|}divide start_ARG italic_R end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG because each clipped term 𝖢𝗅𝗂𝗉R()subscript𝖢𝗅𝗂𝗉𝑅\mathsf{Clip}_{R}(\cdot)sansserif_Clip start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( ⋅ ) has bounded L2-norm, i.e., 𝖢𝗅𝗂𝗉R()Rnormsubscript𝖢𝗅𝗂𝗉𝑅𝑅\|\mathsf{Clip}_{R}(\cdot)\|\leqslant R∥ sansserif_Clip start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( ⋅ ) ∥ ⩽ italic_R. Then, due to the recursive representation of local momentum in (11), the sensitivity of 𝒎t,isubscript𝒎𝑡𝑖\bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is Rpi|𝒟i|𝑅subscript𝑝𝑖subscript𝒟𝑖\frac{R}{p_{i}|\mathcal{D}_{i}|}divide start_ARG italic_R end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG. Finally, the client-level clip** 𝖢𝗅𝗂𝗉C()subscript𝖢𝗅𝗂𝗉𝐶\mathsf{Clip}_{C}(\cdot)sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( ⋅ ) introduces another upper bound for the sensitivity. Refer to Appendix A for the full-version proof. ∎

Remark: comparison with the privacy accountant of DP-SGD momentum. As discussed in Section 3.2, the privacy accountant of DP-SGD with momentum (i.e., account for privacy cost of gradient, then do post-processing for momentum) requires clients to add noise in the local gradients, which leads to poor utility especially when Byzantine attacks exist. In Lemma 2, we account for the privacy cost of aggregated momentum, where the sensitivity is carefully computed from the bounded record-level gradient. Therefore, our scheme solves the three limitations shown in Section 3.2, which is explained as follows. First, only the server adds noise (which is the same as the central setting), thus the privacy-utility tradeoff is not impacted. Second, the noise is added after the centered clip** 𝖢𝗅𝗂𝗉C(𝒎¯t,i𝒎~t1)subscript𝖢𝗅𝗂𝗉𝐶subscript¯𝒎𝑡𝑖subscript~𝒎𝑡1\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde{\bm{m}}_{t-1})sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), thus it only introduces unbiased error. We also show that (in Section 4.3) the impact from the added noise is separate from the impact from Byzantine attacks, as versus the impact from the local noise is enlarged with Byzantine attacks in DP-LFH (see Section 3.2). Third, since privacy is accounted on momentum, and only the aggregated momentum leaks privacy, our solution enjoys privacy amplification from client-level sampling.

The final privacy analysis of DP-BREM is shown in Theorem 1. It presents how to compute the privacy budget ϵitalic-ϵ\epsilonitalic_ϵ and privacy parameter δ𝛿\deltaitalic_δ when the parameters (such as T𝑇Titalic_T, σ𝜎\sigmaitalic_σ, q𝑞qitalic_q, etc.) of the algorithm are given. We use an advanced privacy accountant tool called Gaussian DP (GDP) [11] (refer to Appendix F), then convert it to (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP. Note that in our privacy analysis, clients can use different record-level sampling rates pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, thus different sensitivity Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT shown in (14). Therefore, the final privacy budget (denoted by ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) of DP-BREM may be different for different clients, which provides personalized privacy if these parameters are different for each client.

Theorem 1 (Privacy Analysis).

DP-BREM (in Algorithm 1) satisfies record-level (ϵi,δ)subscriptitalic-ϵ𝑖𝛿(\epsilon_{i},\delta)( italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ )-DP for an honest client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and δ𝛿\deltaitalic_δ satisfying

δ=Φ(ϵiμi+μi2)eϵiΦ(ϵiμiμi2),𝛿Φsubscriptitalic-ϵ𝑖subscript𝜇𝑖subscript𝜇𝑖2superscript𝑒subscriptitalic-ϵ𝑖Φsubscriptitalic-ϵ𝑖subscript𝜇𝑖subscript𝜇𝑖2\displaystyle\delta=\Phi\left(-\frac{\epsilon_{i}}{\mu_{i}}+\frac{\mu_{i}}{2}% \right)-e^{\epsilon_{i}}\cdot\Phi\left(-\frac{\epsilon_{i}}{\mu_{i}}-\frac{\mu% _{i}}{2}\right),italic_δ = roman_Φ ( - divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) - italic_e start_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ roman_Φ ( - divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) , (15)

where Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) denotes the cumulative distribution function (CDF) of standard normal distribution, and μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined by

μi=qpiT(e1/(2σi2)1),with σi=σmax{R2C,pi|𝒟i|}formulae-sequencesubscript𝜇𝑖𝑞subscript𝑝𝑖𝑇superscript𝑒12superscriptsubscript𝜎𝑖21with subscript𝜎𝑖𝜎𝑅2𝐶subscript𝑝𝑖subscript𝒟𝑖\displaystyle\mu_{i}=qp_{i}\sqrt{T(e^{1/(2\sigma_{i}^{2})}-1)},\quad\text{with% }\sigma_{i}=\sigma\cdot\max\left\{\frac{R}{2C},p_{i}|\mathcal{D}_{i}|\right\}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_q italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT square-root start_ARG italic_T ( italic_e start_POSTSUPERSCRIPT 1 / ( 2 italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT - 1 ) end_ARG , with italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ⋅ roman_max { divide start_ARG italic_R end_ARG start_ARG 2 italic_C end_ARG , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | } (16)
Proof.

This result is obtained by the composition of multiple iterations and the privacy amplification from sampling. See Appendix B for the detailed proof. ∎

4.3 Convergence Analysis

Before presenting the final convergence analysis of our solution, we first show the aggregation error for one iteration in Theorem 2.

Theorem 2 (Aggregation Error).

Denote 𝐦t1||i𝐦t,isuperscriptsubscript𝐦𝑡1subscript𝑖subscript𝐦𝑡𝑖\bm{m}_{t}^{*}\coloneqq\frac{1}{|\mathcal{H}|}\sum\nolimits_{i\in\mathcal{H}}% \bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≔ divide start_ARG 1 end_ARG start_ARG | caligraphic_H | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_H end_POSTSUBSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT as the ground truth aggregated raw momentum, where 𝐦t,isubscript𝐦𝑡𝑖\bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is the client momentum computed from gradient without record-level clip**. Assume the local momentum of all honest clients {𝐦t,i}isubscriptsubscript𝐦𝑡𝑖𝑖\{\bm{m}_{t,i}\}_{i\in\mathcal{H}}{ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_H end_POSTSUBSCRIPT are i.i.d. with expectation 𝛍𝔼[𝐦t,i]𝛍𝔼delimited-[]subscript𝐦𝑡𝑖\bm{\mu}\coloneqq\mathbb{E}[\bm{m}_{t,i}]bold_italic_μ ≔ blackboard_E [ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ], and the variance is bounded (in terms of L2-norm)

𝔼𝒎t,i𝝁2ρ2𝔼superscriptnormsubscript𝒎𝑡𝑖𝝁2superscript𝜌2\displaystyle\mathbb{E}\|\bm{m}_{t,i}-\bm{\mu}\|^{2}\leqslant\rho^{2}blackboard_E ∥ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (17)

After some parameter tuning (the detailed tuning is shown under (C) in Appendix C) of the clip** bounds:

RO(ρn/(||+dσ/q)),CO(R)formulae-sequenceproportional-to𝑅𝑂𝜌𝑛𝑑𝜎𝑞proportional-to𝐶𝑂𝑅\displaystyle R\propto O\left(\rho\sqrt{n/(|\mathcal{B}|+\sqrt{d}\sigma/q)}% \right),\quad C\propto O(R)italic_R ∝ italic_O ( italic_ρ square-root start_ARG italic_n / ( | caligraphic_B | + square-root start_ARG italic_d end_ARG italic_σ / italic_q ) end_ARG ) , italic_C ∝ italic_O ( italic_R ) (18)

we have the following aggregation error due to clip**, DP noise, and Byzantine clients:

𝔼𝒎~t𝒎t2O(ρ2(||+dσ/q)n)𝔼superscriptnormsubscript~𝒎𝑡superscriptsubscript𝒎𝑡2𝑂superscript𝜌2𝑑𝜎𝑞𝑛\displaystyle\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{m}_{t}^{*}\|^{2}\leqslant O% \left(\frac{\rho^{2}(|\mathcal{B}|+\sqrt{d}\sigma/q)}{n}\right)blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ italic_O ( divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( | caligraphic_B | + square-root start_ARG italic_d end_ARG italic_σ / italic_q ) end_ARG start_ARG italic_n end_ARG ) (19)

where |||\mathcal{B}|| caligraphic_B | is the number of Byzantine clients, d𝑑ditalic_d is the dimension of model parameter 𝛉tsubscript𝛉𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, σ𝜎\sigmaitalic_σ is the noise multiplier (for DP) shown in (12), q𝑞qitalic_q is the client-level sampling rate shown in Line-2 of Algorithm 1, and ρ𝜌\rhoitalic_ρ is defined in (17). The formal version of (19) is shown in (C) of Appendix C.

Proof.

(Sketch) Directly bounding 𝔼𝒎~t𝒎t2𝔼superscriptnormsubscript~𝒎𝑡superscriptsubscript𝒎𝑡2\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{m}_{t}^{*}\|^{2}blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is not easy, thus we utilize the upper bounds of 𝔼𝒎~t𝝁2𝔼superscriptnormsubscript~𝒎𝑡𝝁2\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{\mu}\|^{2}blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝔼𝝁𝒎t2𝔼superscriptnorm𝝁superscriptsubscript𝒎𝑡2\mathbb{E}\|\bm{\mu}-\bm{m}_{t}^{*}\|^{2}blackboard_E ∥ bold_italic_μ - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to get the final result, where 𝝁𝔼[𝒎t,i]𝝁𝔼delimited-[]subscript𝒎𝑡𝑖\bm{\mu}\coloneqq\mathbb{E}[\bm{m}_{t,i}]bold_italic_μ ≔ blackboard_E [ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ] is the expected local momentum (we assume clients’ local momentum are i.i.d.). When upper bounding 𝔼𝒎~t𝝁2𝔼superscriptnormsubscript~𝒎𝑡𝝁2\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{\mu}\|^{2}blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we can decompose it to three types of errors: error of honest clients (due to randomness and bias introduced by clip**), error of Byzantine clients (due to Byzantine perturbation), and error introduced by the added DP noise. Furthermore, we can have the optimized parameter tuning of C𝐶Citalic_C and R𝑅Ritalic_R to minimize the summation of the above three types of errors. Refer to Appendix C for the full-version proof. ∎

Interpretation of Theorem 2. The value of 𝔼𝒎~t𝒎t2𝔼superscriptnormsubscript~𝒎𝑡superscriptsubscript𝒎𝑡2\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{m}_{t}^{*}\|^{2}blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT quantifies the aggregation error, i.e., how the aggregated privatized momentum 𝒎~tsubscript~𝒎𝑡\tilde{\bm{m}}_{t}over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (with clip**, DP noise, and Byzantine clients’ impact) differs from the "pure" momentum aggregation 𝒎tsuperscriptsubscript𝒎𝑡\bm{m}_{t}^{*}bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, where only honest clients participate and without clip** and DP noise. According to (19), the aggregation error is proportional to ρ2superscript𝜌2\rho^{2}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and ||n+dσnq𝑛𝑑𝜎𝑛𝑞\frac{|\mathcal{B}|}{n}+\frac{\sqrt{d}\sigma}{nq}divide start_ARG | caligraphic_B | end_ARG start_ARG italic_n end_ARG + divide start_ARG square-root start_ARG italic_d end_ARG italic_σ end_ARG start_ARG italic_n italic_q end_ARG, where ρ2superscript𝜌2\rho^{2}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT quantifies the variance of honest clients’ local momentum, ||n𝑛\frac{|\mathcal{B}|}{n}divide start_ARG | caligraphic_B | end_ARG start_ARG italic_n end_ARG is the fraction of Byzantine clients, and σnq=O(1/ϵ)𝜎𝑛𝑞𝑂1italic-ϵ\frac{\sigma}{nq}=O(1/\epsilon)divide start_ARG italic_σ end_ARG start_ARG italic_n italic_q end_ARG = italic_O ( 1 / italic_ϵ ) for ϵitalic-ϵ\epsilonitalic_ϵ-DP. In other words, the aggregation error will be enlarged when: honest clients’ variance is large, or the Byzantine attacker corrupts more clients, or the training model is complex (i.e., the model dimension d𝑑ditalic_d is large), or we need stronger privacy (i.e., a smaller ϵitalic-ϵ\epsilonitalic_ϵ), or the number of clients n is small. Furthermore, due to the format of ||n+dσnq𝑛𝑑𝜎𝑛𝑞\frac{|\mathcal{B}|}{n}+\frac{\sqrt{d}\sigma}{nq}divide start_ARG | caligraphic_B | end_ARG start_ARG italic_n end_ARG + divide start_ARG square-root start_ARG italic_d end_ARG italic_σ end_ARG start_ARG italic_n italic_q end_ARG, the impact from DP noise is independent of the increase of Byzantine clients |||\mathcal{B}|| caligraphic_B | (versus Limitation 2 of DP-LFH in Section 3.2). On the other hand, according to the parameter tuning in (18), we could theoretically set a smaller record-level clip** bound R𝑅Ritalic_R when σ𝜎\sigmaitalic_σ, d𝑑ditalic_d, and |||\mathcal{B}|| caligraphic_B | are large, or ρ𝜌\rhoitalic_ρ and n𝑛nitalic_n are small. The tuning of client-level clip** bound C𝐶Citalic_C should be adjusted according to the value of R𝑅Ritalic_R. Recall that R𝑅Ritalic_R is for DP, while C𝐶Citalic_C is for robustness.

Table 2: Comparison of Convergence Rate
Where to add noise Convergence Rate
LFH [22] None O(ρ1+||)𝑂𝜌1O(\rho\sqrt{1+|\mathcal{B}|})italic_O ( italic_ρ square-root start_ARG 1 + | caligraphic_B | end_ARG )
DP-LFH Clients’ gradients O((ρ+dσ)1+||)𝑂𝜌𝑑𝜎1O\left((\rho+\sqrt{d}\sigma)\sqrt{1+|\mathcal{B}|}\right)italic_O ( ( italic_ρ + square-root start_ARG italic_d end_ARG italic_σ ) square-root start_ARG 1 + | caligraphic_B | end_ARG )
DP-BREM Aggregated momentum O(ρ1+||+dσ)𝑂𝜌1𝑑𝜎O\left(\rho\sqrt{1+|\mathcal{B}|+\sqrt{d}\sigma}\right)italic_O ( italic_ρ square-root start_ARG 1 + | caligraphic_B | + square-root start_ARG italic_d end_ARG italic_σ end_ARG )

By following the convergence analysis in [22] and using the result in (19), we have the convergence rate shown below.

Theorem 3 (Convergence Rate of DP-BREM).

The convergence rate of DP-BREM in Algorithm 1 is asymptotically (ignoring constants and higher order terms) of the order

1Tt=1T𝔼(𝜽t1)2ρ2T||+(1+dσ)/qnless-than-or-similar-to1𝑇superscriptsubscript𝑡1𝑇𝔼superscriptnormsubscript𝜽𝑡12superscript𝜌2𝑇1𝑑𝜎𝑞𝑛\displaystyle\frac{1}{T}\sum\nolimits_{t=1}^{T}\mathbb{E}\|\nabla\ell(\bm{% \theta}_{t-1})\|^{2}\lesssim\sqrt{\frac{\rho^{2}}{T}\frac{|\mathcal{B}|+(1+% \sqrt{d}\sigma)/q}{n}}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E ∥ ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≲ square-root start_ARG divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG divide start_ARG | caligraphic_B | + ( 1 + square-root start_ARG italic_d end_ARG italic_σ ) / italic_q end_ARG start_ARG italic_n end_ARG end_ARG (20)

where ()\ell(\cdot)roman_ℓ ( ⋅ ) is the loss function, T𝑇Titalic_T is the total number of training iterations, and other parameters are the same as in (19).

Proof.

See Appendix D. ∎

Remark: comparison with LFH and DP-LFH. The convergence rate of the non-private LFH, DP-LFH, and the proposed solution DP-BREM, showing in (5), (8), and (20) respectively, are summarized in Table 2. Though both DP-LFH and DP-BREM pay an additional term of dσ/q𝑑𝜎𝑞\sqrt{d}\sigma/qsquare-root start_ARG italic_d end_ARG italic_σ / italic_q to get the DP property, they have different impacts on the convergence. As discussed in Limitation 2 of Section 3.2, the additional term dσ/q𝑑𝜎𝑞\sqrt{d}\sigma/qsquare-root start_ARG italic_d end_ARG italic_σ / italic_q of DP-LFH (due to DP noise added to clients’ gradient) is on the term ρ𝜌\rhoitalic_ρ, thus it will enlarge the impact of Byzantine clients (i.e., the term |||\mathcal{B}|| caligraphic_B |). However, the additional term dσ/q𝑑𝜎𝑞\sqrt{d}\sigma/qsquare-root start_ARG italic_d end_ARG italic_σ / italic_q of our solution DP-BREM (due to DP noise added to the aggregated momentum) is on the term 1+||11+|\mathcal{B}|1 + | caligraphic_B |, which has a squared-root order. Therefore, DP noise only has a limited impact on the convergence of DP-BREM when there are Byzantine clients. We will validate the above theoretical analysis via experimental results in Section 6.

5 DP-BREM+ with Secure Aggregation

The private and robust FL solution DP-BREM (in Section 4) assumes a trusted server which can access clients’ momentum. In this section, we propose DP-BREM+, which assumes a malicious server and utilizes secure aggregation techniques, achieving the same DP and robustness guarantees as DP-BREM. As discussed in Section 3.1, we consider the server as malicious only for data privacy, while clients are malicious for both data privacy and Byzantine attacks.

5.1 Challenges

Considering the server is malicious for data privacy, the noisy aggregate of momentum with centered clip** shown in (12) must be implemented securely with the goals of 1) privacy, i.e., each party, including clients and the server, learns nothing but the differentially-private output; and 2) integrity, i.e., the output is correctly computed. Since the noisy aggregated momentum of the previous iteration 𝒎~t1subscript~𝒎𝑡1\tilde{\bm{m}}_{t-1}over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT already satisfies DP, we can regard it as public information and only need to focus on securely computing the term it𝖢𝗅𝗂𝗉C(𝒎¯t,i𝒎~t1)+𝒩(0,R2σ2𝐈d)subscript𝑖subscript𝑡subscript𝖢𝗅𝗂𝗉𝐶subscript¯𝒎𝑡𝑖subscript~𝒎𝑡1𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\sum\nolimits_{i\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde% {\bm{m}}_{t-1})+\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) in (12).

Secure Aggregation with Verified Inputs (SAVI). The key crypto technique we leverage to achieve the above objectives is SAVI [35], which is a type of protocols that securely aggregate only well-formed inputs. The security goals include both privacy and integrity. Specifically, privacy means that no party should be able to learn anything about the raw input of an honest client, other than what can be learned from the final aggregation result. Integrity means that the output of the protocol returns the correct aggregate of well-formed input, where 1) an input u𝑢uitalic_u passes the input integrity check with a public validation predicate 𝖵𝖺𝗅𝗂𝖽()𝖵𝖺𝗅𝗂𝖽\mathsf{Valid}(\cdot)sansserif_Valid ( ⋅ ) if and only if 𝖵𝖺𝗅𝗂𝖽(u)=1𝖵𝖺𝗅𝗂𝖽𝑢1\mathsf{Valid}(u)=1sansserif_Valid ( italic_u ) = 1, and 2) the aggregation is correctly computed. An instantiation of the SAVI protocol is EIFFeL [35] (described in Appendix G).

Challenge: Secure Generation of Gaussian Noise. A SAVI protocol can potentially solve the problem of securely aggregating the clipped vectors (by enforcing a norm-bound on the client momentum difference). However, the Gaussian noise 𝒩(0,R2σ2𝐈d)𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) needs to be securely generated and aggregated as well. In DP-BREM with a trusted server, the Gaussian noise 𝒩(0,R2σ2𝐈d)𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) is generated by the server to guarantee DP. However, when the server is assumed as malicious, the added Gaussian noise for DP cannot be directly generated by the server.

A straightforward solution is to follow [36] that assumes the existence of another semi-honest server (but does not collude with the original server) that will generate DP noise and execute the privacy engine. However, the assumption of another non-colluding server may not be practical and we assume only a single server.

Another alternative solution is to leverage Distributed DP (DDP) [39], where Gaussian noise is generated by clients in a distributed way: each client generates a Gaussian noise locally, and the aggregation of Gaussian noise also follows a Gaussian distribution with an enlarged standard deviation. Since only the aggregated result is released (with the help of crypto techniques), each client can add a smaller noise with the guarantee that the aggregated noise satisfies the required DP. However, this solution has two limitations in our scenario. First, distributed noise generation needs to add more noise to achieve the same privacy compared with server-side noise generation due to the collusion of malicious clients. Second, malicious clients can generate arbitrary values as the local Gaussian noise, which has a large impact on the robustness.

A possible solution to address the first limitation is to jointly generate Gaussian noise as in [34], where no party learns or controls the true value of the noise (or a portion of the noise). However, the protocol in [34] is designed only for additive secret sharing schemes, which only works for honest-but-curious parties and does not tolerate malicious parties. Moreover, in [34], the Gaussian noise is jointly generated by honest-but-curious and non-colluding parties, which does not address the second limitation as the clients can be malicious in our threat model discussed in Section 3.1.

Overview of DP-BREM+. To achieve secure aggregation with verified inputs and secure Gaussian noise generation under the threat model of a malicious server and malicious minority of clients, our DP-BREM+ 1) leverages an existing SAVI protocol called EIFFeL [35] to achieve secure input validation; and 2) introduces a new protocol to achieve secure noise generation that is compatible with EIFFeL. The idea of jointly generating Gaussian noise in DP-BREM+ is inspired by [34], but our design is based on Shamir’s secret sharing [37] with robust reconstruction by following the design in EIFFeL, thus guarantees security under malicious minority. We present the preliminaries of Shamir’s secret sharing and EIFFeL protocol in Appendix G.

5.2 Design of DP-BREM+

As discussed in Section 5.1, the main task of DP-BREM+ is to securely compute the term it𝖢𝗅𝗂𝗉C(𝒎¯t,i𝒎~t1)+𝒩(0,R2σ2𝐈d)subscript𝑖subscript𝑡subscript𝖢𝗅𝗂𝗉𝐶subscript¯𝒎𝑡𝑖subscript~𝒎𝑡1𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\sum\nolimits_{i\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde% {\bm{m}}_{t-1})+\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) shown in (12). After computing local momentum 𝒎¯t,isubscript¯𝒎𝑡𝑖\bar{\bm{m}}_{t,i}over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT via (11), each client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT first implements centered clip** to get 𝒛i𝖢𝗅𝗂𝗉C(𝒎¯t,i𝒎~t1)subscript𝒛𝑖subscript𝖢𝗅𝗂𝗉𝐶subscript¯𝒎𝑡𝑖subscript~𝒎𝑡1\bm{z}_{i}\coloneqq\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde{\bm{m}}_{t-1})bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), which is the private input for validation and aggregation.

Three-Phase Design. In DP-BREM+, clients and the server jointly implement three phases: 1) secure input validation to validate the client momentum is properly centered clipped by C𝐶Citalic_C, 2) secure noise generation, where clients generate shares of Gaussian noise which can be aggregated in Phase 3 to ensure DP, and 3) aggregation of valid inputs and noise to obtain the noisy global model. We assume the arithmetic circuit is computed over a finite field 𝔽2Ksubscript𝔽superscript2𝐾\mathbb{F}_{2^{K}}blackboard_F start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. The illustration of DP-BREM+ is shown in Figure 2. Due to limited space, we present the detailed steps \small1⃝-\small7⃝ in Appendix H.

Phase 1: Secure Input Validation. The validation function for an input 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT considered in DP-BREM+ is defined as 𝖵𝖺𝗅𝗂𝖽(𝒛i)𝟙(𝒛iC)𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖1normsubscript𝒛𝑖𝐶\mathsf{Valid}(\bm{z}_{i})\coloneqq\mathbbm{1}(\|\bm{z}_{i}\|\leqslant C)sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≔ blackboard_1 ( ∥ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ⩽ italic_C ), where 𝖵𝖺𝗅𝗂𝖽(𝒛i)=1𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖1\mathsf{Valid}(\bm{z}_{i})=1sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 if and only if the condition 𝒛iCnormsubscript𝒛𝑖𝐶\|\bm{z}_{i}\|\leqslant C∥ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ⩽ italic_C holds. Since honest clients compute 𝒛i=𝖢𝗅𝗂𝗉C(𝒎¯t,i𝒎~t1)subscript𝒛𝑖subscript𝖢𝗅𝗂𝗉𝐶subscript¯𝒎𝑡𝑖subscript~𝒎𝑡1\bm{z}_{i}=\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde{\bm{m}}_{t-1})bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), verifying whether 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is well-formed, with bounded L2-norm via 𝖵𝖺𝗅𝗂𝖽()𝖵𝖺𝗅𝗂𝖽\mathsf{Valid}(\cdot)sansserif_Valid ( ⋅ ), for all clients ensures centered clip** of client momentum 𝒎¯t,isubscript¯𝒎𝑡𝑖\bar{\bm{m}}_{t,i}over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT (to achieve robustness as DP-BREM). We follow the design in EIFFeL [35] for secure input validation, which returns the validation result 𝖵𝖺𝗅𝗂𝖽(𝒛i)𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖\mathsf{Valid}(\bm{z}_{i})sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (either 1 or 0) for client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s private input 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, corresponding to steps \small1⃝, \small2⃝, and \small3⃝ shown in Figure 2. Then, clients and the server can jointly verify all inputs {𝒛i}itsubscriptsubscript𝒛𝑖𝑖subscript𝑡\{\bm{z}_{i}\}_{i\in\mathcal{I}_{t}}{ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and obtain the set of valid inputs 𝖵𝖺𝗅𝗂𝖽subscript𝖵𝖺𝗅𝗂𝖽\mathcal{I}_{\mathsf{Valid}}caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT, where 𝖵𝖺𝗅𝗂𝖽(𝒛i)=1𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖1\mathsf{Valid}(\bm{z}_{i})=1sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 for all i𝖵𝖺𝗅𝗂𝖽𝑖subscript𝖵𝖺𝗅𝗂𝖽i\in\mathcal{I}_{\mathsf{Valid}}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT. In the later step, only inputs in 𝖵𝖺𝗅𝗂𝖽subscript𝖵𝖺𝗅𝗂𝖽\mathcal{I}_{\mathsf{Valid}}caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT are aggregated.

Phase 2: Secure Noise Generation. We develop a new protocol for secure distributed Gaussian noise generation, which returns the shares (held by each client) of a random vector 𝝃𝝃\bm{\xi}bold_italic_ξ of length d𝑑ditalic_d from the Gaussian distribution 𝒩(0,R2σ2𝐈d)𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), corresponding to steps \small4⃝ and \small5⃝ shown in Figure 2. The shares of noise can be reconstructed into a single Gaussian noise (for ensuring DP) with the guarantee that no parties know or control the generated noise, which protects the information of private inputs after the noisy aggregate is released.

Phase 3: Aggregation of Valid Inputs and Noise. Finally, the server and clients can aggregate the valid inputs (obtained in Phase 1) and the generated Gaussian noise (obtained in Phase 2) by implementing steps \small6⃝ and \small7⃝ shown in Figure 2, ensuring nothing except the noisy aggregate can be learned.

Remark on Efficiency. DP-BREM+’s usage of EIFFeL’s secure input validation is due to efficiency considerations. Instead of having clients perform clip** and using secure input validation, one alternative is to use standard secure multi-party computation (MPC) for the clip** and aggregation. However, doing this under MPC would result in a very large computation/communication overhead due to the multiplication, min-operation, division, and L2-norm computation in the clip** operation 𝖢𝗅𝗂𝗉C()subscript𝖢𝗅𝗂𝗉𝐶\mathsf{Clip}_{C}(\cdot)sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( ⋅ ) defined in (4). In contrast, the secure input validation protocol only requires the verifiers to check all the multiplication gates very efficiently with just one identity test. The compatibility with secure input validation is one of the advantages of DP-BREM.

Complexity. According to EIFFeL [35], the computation/communication complexity of secure aggregation with input validation is O(mnd)𝑂𝑚𝑛𝑑O(mnd)italic_O ( italic_m italic_n italic_d ) for clients and O(n2+mdmin{n,m2})𝑂superscript𝑛2𝑚𝑑𝑛superscript𝑚2O(n^{2}+md\min\{n,m^{2}\})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m italic_d roman_min { italic_n , italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ) for the server in terms of the number of clients n𝑛nitalic_n, number of malicious clients m𝑚mitalic_m, and data dimension d𝑑ditalic_d. For the proposed secure noise generation (only clients are involved), the computation/communication complexity for total n𝑛nitalic_n clients is O(mnd)𝑂𝑚𝑛𝑑O(mnd)italic_O ( italic_m italic_n italic_d ).

Refer to caption
Figure 2: Illustration of DP-BREM+ (see Appendix H for detailed steps \small1⃝-\small7⃝)
Refer to caption
Figure 3: MNIST - Varying the percentage of Byzantine clients δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT (with ϵ=3italic-ϵ3\epsilon=3italic_ϵ = 3).
Refer to caption
Figure 4: CIFAR-10 - Varying the percentage of Byzantine clients δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT (with ϵ=4italic-ϵ4\epsilon=4italic_ϵ = 4).

5.3 Security Analysis

In comparison, EIFFeL [35] is a secure aggregation protocol with verified inputs (without guaranteeing DP), while our solution DP-BREM+ is a secure noisy aggregation protocol with verified inputs and jointly generated Gaussian noise, which provides DP on the aggregated results. Therefore, the only difference is the Gaussian noise that will be aggregated to the final result. We show the formal security guarantee of DP-BREM+ in the following theorem.

Theorem 4 (Security Guarantees of DP-BREM+).

For the validation function 𝖵𝖺𝗅𝗂𝖽()𝖵𝖺𝗅𝗂𝖽\mathsf{Valid}(\cdot)sansserif_Valid ( ⋅ ) considered in Section 5.2, given a security parameter κ𝜅\kappaitalic_κ, the secure noisy aggregation protocol in DP-BREM+ satisfies:

1) Integrity. For a negligible function negl()negl\text{negl}(\cdot)negl ( ⋅ ), the output of the protocol returns the noisy aggregate of a subset of clients 𝖵𝖺𝗅𝗂𝖽subscript𝖵𝖺𝗅𝗂𝖽\mathcal{I}_{\mathsf{Valid}}caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT and Gaussian noise 𝛏𝛏\bm{\xi}bold_italic_ξ, such that all clients in 𝖵𝖺𝗅𝗂𝖽subscript𝖵𝖺𝗅𝗂𝖽\mathcal{I}_{\mathsf{Valid}}caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT have well-formed inputs:

Pr[output=i𝖵𝖺𝗅𝗂𝖽𝒛i+𝝃]1negl(κ)Proutputsubscript𝑖subscript𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖𝝃1negl𝜅\displaystyle\Pr[\text{output}=\sum\nolimits_{i\in\mathcal{I}_{\mathsf{Valid}}% }\bm{z}_{i}+\bm{\xi}]\geqslant 1-\text{negl}(\kappa)roman_Pr [ output = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ξ ] ⩾ 1 - negl ( italic_κ )

where random vector 𝛏𝒩(0,R2σ2𝐈d)similar-to𝛏𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\bm{\xi}\sim\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})bold_italic_ξ ∼ caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), and 𝖵𝖺𝗅𝗂𝖽(𝐳i)=1𝖵𝖺𝗅𝗂𝖽subscript𝐳𝑖1\mathsf{Valid}(\bm{z}_{i})=1sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 for all i𝖵𝖺𝗅𝗂𝖽𝑖subscript𝖵𝖺𝗅𝗂𝖽i\in\mathcal{I}_{\mathsf{Valid}}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT. Note that the set 𝖵𝖺𝗅𝗂𝖽subscript𝖵𝖺𝗅𝗂𝖽\mathcal{I}_{\mathsf{Valid}}caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT contains all honest clients (denoted by Hsubscript𝐻\mathcal{I}_{H}caligraphic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT) and the malicious clients who submitted well-formed input (denoted by Msuperscriptsubscript𝑀\mathcal{I}_{M}^{*}caligraphic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT), i.e., 𝖵𝖺𝗅𝗂𝖽=HMsubscript𝖵𝖺𝗅𝗂𝖽subscript𝐻superscriptsubscript𝑀\mathcal{I}_{\mathsf{Valid}}=\mathcal{I}_{H}\cup\mathcal{I}_{M}^{*}caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT = caligraphic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∪ caligraphic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

2) Privacy. For a set of malicious clients MsubscriptM\mathcal{I}_{M}caligraphic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and a malicious server 𝖲𝖲\mathsf{S}sansserif_S, there exists a probabilistic polynomial-time (P.P.T.) simulator 𝖲𝗂𝗆()𝖲𝗂𝗆\mathsf{Sim}(\cdot)sansserif_Sim ( ⋅ ) such that:

𝖱𝖾𝖺𝗅({zi}iH,ΩM𝖲)𝖢𝖲𝗂𝗆(iH𝒛i+𝝃,H,ΩM𝖲)subscript𝖢𝖱𝖾𝖺𝗅subscriptsubscript𝑧𝑖𝑖subscript𝐻subscriptΩsubscript𝑀𝖲𝖲𝗂𝗆subscript𝑖subscript𝐻subscript𝒛𝑖𝝃subscript𝐻subscriptΩsubscript𝑀𝖲\displaystyle\mathsf{Real}\left(\{z_{i}\}_{i\in\mathcal{I}_{H}},\Omega_{% \mathcal{I}_{M}\cup\mathsf{S}}\right)\equiv_{\mathsf{C}}\mathsf{Sim}\left(\sum% \nolimits_{i\in\mathcal{I}_{H}}\bm{z}_{i}+\bm{\xi},\mathcal{I}_{H},\Omega_{% \mathcal{I}_{M}\cup\mathsf{S}}\right)sansserif_Real ( { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Ω start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∪ sansserif_S end_POSTSUBSCRIPT ) ≡ start_POSTSUBSCRIPT sansserif_C end_POSTSUBSCRIPT sansserif_Sim ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ξ , caligraphic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , roman_Ω start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∪ sansserif_S end_POSTSUBSCRIPT )

where {zi}iHsubscriptsubscript𝑧𝑖𝑖subscript𝐻\{z_{i}\}_{i\in\mathcal{I}_{H}}{ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the input of all the honest clients, 𝖱𝖾𝖺𝗅𝖱𝖾𝖺𝗅\mathsf{Real}sansserif_Real denotes a random variable representing the joint view of all the parties in the protocol’s execution, ΩM𝖲subscriptΩsubscript𝑀𝖲\Omega_{\mathcal{I}_{M}\cup\mathsf{S}}roman_Ω start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∪ sansserif_S end_POSTSUBSCRIPT indicates a polynomial-time algorithm implementing the "next-message" function of the parties in M𝖲subscript𝑀𝖲\mathcal{I}_{M}\cup\mathsf{S}caligraphic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∪ sansserif_S (see [35, Appendix 11.5]), and 𝖢subscript𝖢\equiv_{\mathsf{C}}≡ start_POSTSUBSCRIPT sansserif_C end_POSTSUBSCRIPT denotes computational indistinguishability. In summary, the server and clients do not learn anything besides the final aggregated result.

Proof.

See Appendix I. ∎

6 Experimental Evaluation

Refer to caption
Figure 5: MNIST - Varying privacy budget ϵitalic-ϵ\epsilonitalic_ϵ (with δB=30%subscript𝛿𝐵percent30\delta_{B}=30\%italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 30 % Byzantine clients).
Refer to caption
Figure 6: CIFAR-10 - Varying privacy budget ϵitalic-ϵ\epsilonitalic_ϵ (with δB=15%subscript𝛿𝐵percent15\delta_{B}=15\%italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 15 % Byzantine clients).

In this section, we demonstrate the effectiveness of the proposed DP-BREM/DP-BREM+ on achieving both good privacy-utility tradeoff and Byzantine robustness via experimental results on MNIST [25] and CIFAR-10 [24] datasets with non-IID setting (refer to Appendix J.1 for more details on the datasets and model architectures). All experiments are developed via PyTorch111Our source code will be available after the acceptance of the paper..

Byzantine Attacks. We consider four existing Byzantine attacks in our experiments, including ALIE ("a little is enough") [3], IPM (inner-product manipulation) [45], LF (label-flip**), and the state-of-the-art MTB ("manipulating-the-Byzantine") [38]. Refer to Appendix J.1 for more details.

Compared Methods. We compare the performance of six approaches against Byzantine attacks, including DP-BREM/+ (our approach)222Since DP-BREM+ achieves the same DP and robustness guarantees as DP-BREM, we did not perform the empirical experiments with secure aggregation because the accuracy results will be exactly the same as DP-BREM. We use DP-BREM/+ to denote both DP-BREM and DP-BREM+, and the implementation follows Algorithm 1., a variant of DP-FedSGD [30] with both record and client norm clip**, DDP-RP [43], DP-RSA [50], a variant of CM [47] with DP noise, and DP-LFH. The comparison (on trust assumption and mechanism overview) of these approaches is provided in Table 1, and Appendix J.1 shows more details of each approach. In summary, DP-BREM/+, DP-FedSGD, and DP-CM add central noise to the aggregation, but DP-BREM+ does not require a trusted server due to the secure aggregation technique. DDP-RP adds partial local noise to the client’s update with secure aggregation. DP-RSA and DP-LFH add local noise to the client’s update. We fix δ=106𝛿superscript106\delta=10^{-6}italic_δ = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT for (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP in all experiments. For the setting of other parameters, refer to Appendix J.2.

Evaluation Metric. We evaluate the testing accuracy of the global model within T𝑇Titalic_T iterations. Considering the accuracy curve might be unstable under Byzantine attacks, we average the accuracy between 0.9T0.9𝑇0.9T0.9 italic_T and T𝑇Titalic_T as the final accuracy for comparison. Note that both DP noise and Byzantine attacks reduce the accuracy. A protocol achieves good Byzantine robustness if its accuracy does not decrease too much with an increased number of Byzantine clients.

6.1 Robustness Evaluation with DP

We consider a fixed privacy budget ϵitalic-ϵ\epsilonitalic_ϵ and implement each attack with different percentages of Byzantine clients δB=||nsubscript𝛿𝐵𝑛\delta_{B}=\frac{|\mathcal{B}|}{n}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = divide start_ARG | caligraphic_B | end_ARG start_ARG italic_n end_ARG for the four attacks, and compare the accuracy among all approaches. The results for MNIST (with ϵ=3italic-ϵ3\epsilon=3italic_ϵ = 3) and CIFAR-10 (with ϵ=4italic-ϵ4\epsilon=4italic_ϵ = 4) datasets are shown in Figures 3 and 4. Though the detailed results differ under different attacks and for two datasets, we have some general observations:

1) When there is no attack, i.e., δB=0subscript𝛿𝐵0\delta_{B}=0italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 0, DP-BREM/+ achieves almost the same accuracy as DP-FedSGD, indicating the Byzantine-robust design (client momentum with centered clip**) has almost no impact on the utility in this case.

2) After increasing δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, our DP-BREM/+ has the smallest accuracy decrease, indicating its success in providing Byzantine robustness. However, the accuracy of DP-LFH reduces sharply, demonstrating that the large aggregated local DP noise makes the robust aggregator more vulnerable to Byzantine attacks, which is consistent with our discussions of Limitation 2 in Section 3.2.

3) Though DP-FedSGD has client-level gradient clip**, which can restrict malicious clients’ impact, it is still vulnerable to some types of Byzantine attacks (such as IPM and MTB) under larger δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT values.

4) CM with DP noise (or DP-CM) has a relatively small accuracy decrease for a relative small δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. It is the benefit of the median-based robust aggregator. But the sensitivity is larger than the average-based aggregators, as discussed in Example 1, the aggregated DP noise is too large to obtain a high accuracy, even when δB=0subscript𝛿𝐵0\delta_{B}=0italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 0.

5) DDP-RP is more vulnerable to LF attack because it only checks the element-wise range. Also, the model replacement strategy in LF attack is more likely to change the positions that have small values in benign gradient vectors.

6) DP-RSA has relatively poor accuracy compared with other approaches, even when δB=0subscript𝛿𝐵0\delta_{B}=0italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 0. It is caused by the sign-SGD aggregator, which only aggregates element-wise signs instead of values, leading to information loss. Moreover, the local DP noise makes the Byzantine attacks easier to succeed.

6.2 Privacy-Utility Tradeoff with Attack

We consider a fixed percentage of Byzantine clients δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT for each attack under different values of privacy budget ϵitalic-ϵ\epsilonitalic_ϵ, and compare the accuracy of all approaches. The results for MNIST (with δB=30%subscript𝛿𝐵percent30\delta_{B}=30\%italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 30 %) and CIFAR-10 (with δB=15%subscript𝛿𝐵percent15\delta_{B}=15\%italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 15 %) datasets are shown in Figures 5 and 6. For both datasets, we consider four different levels of privacy, where ϵ=infitalic-ϵinfimum\epsilon=\infitalic_ϵ = roman_inf means the standard deviation of DP noise is 0, but we still implement record-level clip** to illustrate how the noise affects the results while kee** other settings including the clip** step the same. Though the detailed results differ under different attacks and for two datasets, DP-BREM/+ has the best accuracy among all approaches, especially under IPM and MTB attacks. Note that when σ=0𝜎0\sigma=0italic_σ = 0 for DP noise (i.e., ϵ=infitalic-ϵinfimum\epsilon=\infitalic_ϵ = roman_inf), both DP-BREM/+ and DP-LFH reduce to LFH, thus they have the same results in this case. We can observe that with a moderate privacy budget, such as ϵ2italic-ϵ2\epsilon\geqslant 2italic_ϵ ⩾ 2, DP noise only has a negligible impact on the accuracy. But if ϵitalic-ϵ\epsilonitalic_ϵ is too small, such as ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 in Figure 5, DP-BREM/+ suffers a relatively larger impact (but still acceptable) from DP noise. Note that when there exist Byzantine attacks, reducing the DP noise to σ=0𝜎0\sigma=0italic_σ = 0 (i.e., ϵ=infitalic-ϵinfimum\epsilon=\infitalic_ϵ = roman_inf) does not significantly improve the accuracy of DP-BREM/+ compared with ϵ<infitalic-ϵinfimum\epsilon<\infitalic_ϵ < roman_inf, because the performance is largely impacted by Byzantine clients’ perturbations. However, the accuracy of DP-LFH is greatly reduced when ϵ<infitalic-ϵinfimum\epsilon<\infitalic_ϵ < roman_inf, since the local DP noise impacts the robustness of the aggregator. This observation is consistent with our theoretical analysis in Limitation 2 of DP-LFH (Section 3.2).

6.3 Other Results

Efficiency Evaluation of DP and Byzantine Robustness. We note that DP and Byzantine Robustness designs in our solution only introduce a small computation overhead, because 1) the clip** step of DP can be implemented efficiently; 2) our robustness is essentially a clipped summation of client momentum without any complex computations. Due to limited resources, we implemented the distributed training of FL on a single machine (by running all the clients and the server code sequentially). We evaluate the efficiency of DP-BREM via the running time (per round per client) on the MNIST dataset. The results shown in Table 3 indicate that the DP noise and Byzantine robustness only incur 8%16%similar-topercent8percent168\%\sim 16\%8 % ∼ 16 % additional running time (depending on batch size).

Table 3: Running time1 (in milliseconds) per round per client in MNIST dataset
Batch Size Baseline (FedSGD) FedSGD+DP (efficient2  ) DP-BREM (DP+robust) FedSGD+DP (inefficient3  )
30 11.80 13.31 13.72 41.06
60 18.23 19.79 20.27 76.70
120 31.22 33.18 33.70 149.32
  • 1

    Our GPU device is NVIDIA Tesla P100-PCIE-16GB. Using other GPU devices may have different results.

  • 2

    By default, our implementation uses efficient per-record gradient clip** by following Opacus library’s implementation with parallel clip** and optimized einsum (refer to https://opacus.ai/api/_modules/opacus/optimizers/optimizer.html#DPOptimizer)

  • 3

    To illustrate the improvement of efficient clip**, we also show the results of the inefficient implementation, which clips per-record gradient sequentially and without using optimized einsum.

Impact of R𝑅Ritalic_R in DP-BREM/+. Figure 7 (in Appendix J) shows how the accuracy changes w.r.t. the record-level clip** bound R𝑅Ritalic_R in DP-BREM/+. The results demonstrate that when there are fewer Byzantine clients (i.e., smaller δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT) or the noise multiplier σ𝜎\sigmaitalic_σ is smaller (i.e., larger ϵitalic-ϵ\epsilonitalic_ϵ), we need to set a larger R𝑅Ritalic_R to obtain better accuracy. This observation is consistent with the theoretical analysis of parameter tuning discussed in Theorem 2 and its interpretation.

Impact of q𝑞qitalic_q in DP-BREM/+. In all previous experiments, we set client-level sampling rate q=1𝑞1q=1italic_q = 1 by default. As discussed in Sec. 4.1, aggregating a subset Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of clients in (12) is one of the major differences from LFH. In Table 4, we show how the utility can be improved by this design under different attack percentages δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT if we can choose an optimal q𝑞qitalic_q (which can be regarded as a tunable parameter). Intuitively, when there is no attack, a smaller q𝑞qitalic_q can provide more privacy amplification, i.e., a smaller σ𝜎\sigmaitalic_σ is needed for the same value of ϵitalic-ϵ\epsilonitalic_ϵ in DP; but if q𝑞qitalic_q is too small, the small aggregate population will lead to a larger variance of the aggregation. When there exists Byzantine attacks, a smaller value of q𝑞qitalic_q can reduce the attack impact for each round because only a portion of Byzantine clients are selected for aggregation. Therefore, with increased δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, the optimal q𝑞qitalic_q (highlighted in Table 4) is decreased.

Table 4: Model accuracy when varying q𝑞qitalic_q of DP-BREM/+ with ϵ=2italic-ϵ2\epsilon=2italic_ϵ = 2 under MTB attack in CIFAR-10 dataset.
δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT q=1𝑞1q=1italic_q = 1 q=0.8𝑞0.8q=0.8italic_q = 0.8 q=0.6𝑞0.6q=0.6italic_q = 0.6 q=0.4𝑞0.4q=0.4italic_q = 0.4 q=0.2𝑞0.2q=0.2italic_q = 0.2
0% 0.503 0.525 0.504 0.491 0.485
10% 0.435 0.434 0.465 0.449 0.438
20% 0.255 0.284 0.297 0.328 0.241

7 Related Work

Due to limited space, we only discuss the most relevant defenses below and put other related work in Appendix K. Other works either only achieve DP or Byzantine robustness (but not both), or combine secure aggregation with Byzantine robustness without realizing DP.

Wang et al. [43] proposed an FL scheme (DDP-RP) to provide Distributed DP (via encryption) and robustness (via range proof technologies); however, this scheme only verifies whether the local model weights are in a bounded range, which provides weak robustness. In comparison, our solution utilizes client momentum and centered clip** to guarantee Byzantine robustness with provable convergence analysis. Zhu et al. [50] replaces the value aggregation with sign aggregation, which provides robustness because each client has a limited impact on the aggregation. The DP noise is added to the local gradient before the sign operation. Since it only aggregates the element-wise sign (instead of the value) of clients’ gradients, it has degraded convergence due to information loss. Also, [50] only accounts for the privacy cost of one iteration, instead of the composition of all iterations in FL. Thus, the privacy cost computed in [50] is underestimated. As a comparison, our solution is based on the original SGD (with momentum), and we account for the privacy cost of all iterations. Our experimental results have confirmed that DP-BREM outperforms both of these approaches.

8 Conclusions

This paper aims to achieve FL in the cross-silo setting with both DP and Byzantine robustness. We first proposed DP-BREM, a DP version of LFH-based FL protocol with a robust aggregator based on client momentum, where the server adds noise to the aggregated momentum. Then we further developed DP-BREM+ which relaxes the server’s trust assumption, by combining secure aggregation techniques with verifiable inputs and a new protocol for secure joint noise generation. DP-BREM+ achieves the same DP and robustness guarantees as DP-BREM, under a malicious server (for privacy) and malicious minority clients. We theoretically analyze the error and convergence of DP-BREM, and conduct extensive experiments that empirically show the advantage of DP-BREM/+ in terms of privacy-utility tradeoff and Byzantine robustness over five baseline protocols. In the future, we will extend our work to other types of robust aggregators.

References

  • [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security (CCS), 2016.
  • [2] Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. How to backdoor federated learning. In AISTATS, 2020.
  • [3] Gilad Baruch, Moran Baruch, and Yoav Goldberg. A little is enough: Circumventing defenses for distributed learning. 2019.
  • [4] Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, and Ryan Rogers. Protection against reconstruction and its applications in private federated learning. arXiv preprint arXiv:1812.00984, 2018.
  • [5] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In NeurIPS, 2017.
  • [6] George EP Box and Mervin E Muller. A note on the generation of random normal deviates. The annals of mathematical statistics, 1958.
  • [7] Zhiqi Bu, **shuo Dong, Qi Long, and Weijie J Su. Deep learning with gaussian differential privacy. Harvard Data Science Review, 2020(23), 2020.
  • [8] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. In ACM on Measurement and Analysis of Computing Systems, 2017.
  • [9] Henry Corrigan-Gibbs and Dan Boneh. Prio: Private, robust, and scalable computation of aggregate statistics. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2017.
  • [10] Ronald Cramer, Ivan Damgård, and Yuval Ishai. Share conversion, pseudorandom secret-sharing and applications to secure computation. In Theory of Cryptography Conference (TCC), 2005.
  • [11] **shuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. To appear in Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2019.
  • [12] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (TCC), 2006.
  • [13] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Now Publishers, 2014.
  • [14] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In IEEE Annual Symposium on Foundations of Computer Science (FOCS), 2010.
  • [15] Minghong Fang, Xiaoyu Cao, **yuan Jia, and Neil Gong. Local model poisoning attacks to byzantine-robust federated learning. In USENIX Security Symposium, 2020.
  • [16] Paul Feldman. A practical scheme for non-interactive verifiable secret sharing. In IEEE Annual Symposium on Foundations of Computer Science (SFCS), pages 427–438, 1987.
  • [17] Shuhong Gao. A new algorithm for decoding reed-solomon codes. Communications, information and network security, pages 55–68, 2003.
  • [18] Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.
  • [19] Rachid Guerraoui, Nirupam Gupta, Rafaël Pinot, Sébastien Rouault, and John Stephan. Differential privacy and byzantine resilience in sgd: Do they add up? In ACM Symposium on Principles of Distributed Computing, 2021.
  • [20] Lie He, Sai Praneeth Karimireddy, and Martin Jaggi. Secure byzantine-robust machine learning. arXiv preprint arXiv:2006.04747, 2020.
  • [21] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1–2):1–210, 2021.
  • [22] Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Learning from history for byzantine robust optimization. In ICML, pages 5311–5319, 2021.
  • [23] Marcel Keller. Mp-spdz: A versatile framework for multi-party computation. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1575–1590, 2020.
  • [24] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [25] Yann LeCun. The mnist database of handwritten digits. 1998.
  • [26] Jeffrey Li, Mikhail Khodak, Sebastian Caldas, and Ameet Talwalkar. Differentially private meta-learning. In ICLR, 2020.
  • [27] Shu Lin and Daniel J Costello. Error control coding. Prentice hall Lebanon, IN, 2001.
  • [28] Yehuda Lindell and Ariel Nof. A framework for constructing fast mpc over arithmetic circuits with malicious adversaries and an honest-majority. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 259–276, 2017.
  • [29] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017.
  • [30] H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. In ICLR, 2018.
  • [31] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Exploiting unintended feature leakage in collaborative learning. In IEEE Symposium on Security and Privacy (S&P), 2019.
  • [32] El Mahdi El Mhamdi, Rachid Guerraoui, and Sébastien Rouault. The hidden vulnerability of distributed learning in byzantium. In ICML, 2018.
  • [33] Ilya Mironov. Rényi differential privacy. In IEEE Computer Security Foundations Symposium (CSF), 2017.
  • [34] Sikha Pentyala, Davis Railsback, Ricardo Maia, Rafael Dowsley, David Melanson, Anderson Nascimento, and Martine De Cock. Training differentially private models with secure multiparty computation. arXiv preprint, 2022.
  • [35] Amrita Roy Chowdhury, Chuan Guo, Somesh Jha, and Laurens van der Maaten. Eiffel: Ensuring integrity for federated learning. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 2535–2549, 2022.
  • [36] Amrita Roy Chowdhury, Chenghong Wang, Xi He, Ashwin Machanavajjhala, and Somesh Jha. Crypte: Crypto-assisted differential privacy on untrusted servers. In ACM SIGMOD International Conference on Management of Data (SIGMOD), 2020.
  • [37] Adi Shamir. How to share a secret. Communications of the ACM, 22(11):612–613, 1979.
  • [38] Virat Shejwalkar and Amir Houmansadr. Manipulating the byzantine: Optimizing model poisoning attacks and defenses for federated learning. In NDSS, 2021.
  • [39] Elaine Shi, TH Hubert Chan, Eleanor Rieffel, Richard Chow, and Dawn Song. Privacy-preserving aggregation of time-series data. In Annual Network & Distributed System Security Symposium (NDSS), 2011.
  • [40] **hyun So, Başak Güler, and A Salman Avestimehr. Byzantine-resilient secure federated learning. IEEE Journal on Selected Areas in Communications, 2020.
  • [41] Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, and Yi Zhou. A hybrid approach to privacy-preserving federated learning. In ACM Workshop on Artificial Intelligence and Security, 2019.
  • [42] Raj Kiriti Velicheti, Derek Xia, and Oluwasanmi Koyejo. Secure byzantine-robust distributed learning via clustering. arXiv preprint arXiv:2110.02940, 2021.
  • [43] Fayao Wang, Yuanyuan He, Yunchuan Guo, Peizhi Li, and Xinyu Wei. Privacy-preserving robust federated learning with distributed differential privacy. In IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2022.
  • [44] Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, and Dimitris S Papailiopoulos. Attack of the tails: Yes, you really can backdoor federated learning. In NeurIPS, 2020.
  • [45] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. In Uncertainty in Artificial Intelligence, 2020.
  • [46] Runhua Xu, Nathalie Baracaldo, Yi Zhou, Ali Anwar, and Heiko Ludwig. Hybridalpha: An efficient approach for privacy-preserving federated learning. In ACM Workshop on Artificial Intelligence and Security, 2019.
  • [47] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In ICML, 2018.
  • [48] Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M Alvarez, Jan Kautz, and Pavlo Molchanov. See through gradients: Image batch recovery via gradinversion. In CVPR, 2021.
  • [49] Qinqing Zheng, Shuxiao Chen, Qi Long, and Weijie Su. Federated f-differential privacy. In AISTATS, 2021.
  • [50] Heng Zhu and Qing Ling. Bridging differential privacy and byzantine-robustness via model aggregation. In International Joint Conference on Artificial Intelligence (IJCAI), 2022.
  • [51] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In NeurIPS, 2019.

Appendix A Proof of Lemma 2 (Aggregation Sensitivity)

Proof.

For the local momentum computation in (11), we can rewrite it as

𝒎t,i=(1β)(𝒈¯t,i+β𝒈¯t1,i++βt2𝒈¯2,i)+βt1𝒈¯1,isubscript𝒎𝑡𝑖1𝛽subscript¯𝒈𝑡𝑖𝛽subscript¯𝒈𝑡1𝑖superscript𝛽𝑡2subscript¯𝒈2𝑖superscript𝛽𝑡1subscript¯𝒈1𝑖\displaystyle\bm{m}_{t,i}=(1-\beta)(\bar{\bm{g}}_{t,i}+\beta\bar{\bm{g}}_{t-1,% i}+\cdots+\beta^{t-2}\bar{\bm{g}}_{2,i})+\beta^{t-1}\bar{\bm{g}}_{1,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = ( 1 - italic_β ) ( over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT + italic_β over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT + ⋯ + italic_β start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT ) + italic_β start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT

For a neighboring dataset 𝒟isuperscriptsubscript𝒟𝑖\mathcal{D}_{i}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT which differs only one record from client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s data, i.e., |𝒟i𝒟i|=1subscript𝒟𝑖superscriptsubscript𝒟𝑖1|\mathcal{D}_{i}-\mathcal{D}_{i}^{\prime}|=1| caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | = 1, we denote the corresponding local gradient (with per-record gradient clip**) and momentum as 𝒈¯t,isuperscriptsubscript¯𝒈𝑡𝑖\bar{\bm{g}}_{t,i}^{\prime}over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒎t,isuperscriptsubscript𝒎𝑡𝑖\bm{m}_{t,i}^{\prime}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, respectively. Since 𝒈¯τ,isubscript¯𝒈𝜏𝑖\bar{\bm{g}}_{\tau,i}over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_τ , italic_i end_POSTSUBSCRIPT is computed by (10) for τ=1,,t𝜏1𝑡\tau=1,\cdots,titalic_τ = 1 , ⋯ , italic_t, we have

𝒈¯τ,i𝒈¯τ,i=1pi|𝒟i|𝖢𝗅𝗂𝗉R(𝜽(𝒙,𝜽τ1))Rpi|𝒟i|normsubscript¯𝒈𝜏𝑖superscriptsubscript¯𝒈𝜏𝑖1subscript𝑝𝑖subscript𝒟𝑖normsubscript𝖢𝗅𝗂𝗉𝑅subscript𝜽𝒙subscript𝜽𝜏1𝑅subscript𝑝𝑖subscript𝒟𝑖\displaystyle\|\bar{\bm{g}}_{\tau,i}-\bar{\bm{g}}_{\tau,i}^{\prime}\|=\frac{1}% {p_{i}|\mathcal{D}_{i}|}\|\mathsf{Clip}_{R}(\nabla_{\bm{\theta}}\ell(\bm{x},% \bm{\theta}_{\tau-1}))\|\leqslant\frac{R}{p_{i}|\mathcal{D}_{i}|}∥ over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_τ , italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_τ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ = divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∥ sansserif_Clip start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( bold_italic_x , bold_italic_θ start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT ) ) ∥ ⩽ divide start_ARG italic_R end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG

where 𝒙=𝒟i𝒟i𝒙subscript𝒟𝑖superscriptsubscript𝒟𝑖\bm{x}=\mathcal{D}_{i}-\mathcal{D}_{i}^{\prime}bold_italic_x = caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then,

𝒎t,i𝒎t,inormsubscript𝒎𝑡𝑖superscriptsubscript𝒎𝑡𝑖\displaystyle\|\bm{m}_{t,i}-\bm{m}_{t,i}^{\prime}\|∥ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (1β)[𝒈¯t,i𝒈¯t,i+β𝒈¯t1,i𝒈¯t1,i+\displaystyle\leqslant(1-\beta)[\|\bar{\bm{g}}_{t,i}-\bar{\bm{g}}_{t,i}^{% \prime}\|+\beta\|\bar{\bm{g}}_{t-1,i}-\bar{\bm{g}}_{t-1,i}^{\prime}\|+⩽ ( 1 - italic_β ) [ ∥ over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ + italic_β ∥ over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ +
+βt2𝒈¯2,i𝒈¯2,i]+βt1𝒈¯1,i𝒈¯1,i\displaystyle\quad\cdots+\beta^{t-2}\|\bar{\bm{g}}_{2,i}-\bar{\bm{g}}_{2,i}^{% \prime}\|]+\beta^{t-1}\|\bar{\bm{g}}_{1,i}-\bar{\bm{g}}_{1,i}^{\prime}\|⋯ + italic_β start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT ∥ over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ] + italic_β start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∥ over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥
[(1β)(1+β++βt2)+βt1]Rpi|𝒟i|absentdelimited-[]1𝛽1𝛽superscript𝛽𝑡2superscript𝛽𝑡1𝑅subscript𝑝𝑖subscript𝒟𝑖\displaystyle\leqslant[(1-\beta)(1+\beta+\cdots+\beta^{t-2})+\beta^{t-1}]\cdot% \frac{R}{p_{i}|\mathcal{D}_{i}|}⩽ [ ( 1 - italic_β ) ( 1 + italic_β + ⋯ + italic_β start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT ) + italic_β start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ] ⋅ divide start_ARG italic_R end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG
=[(1β)1βt11β+βt1]Rpi|𝒟i|=Rpi|𝒟i|absentdelimited-[]1𝛽1superscript𝛽𝑡11𝛽superscript𝛽𝑡1𝑅subscript𝑝𝑖subscript𝒟𝑖𝑅subscript𝑝𝑖subscript𝒟𝑖\displaystyle=\left[(1-\beta)\cdot\frac{1-\beta^{t-1}}{1-\beta}+\beta^{t-1}% \right]\cdot\frac{R}{p_{i}|\mathcal{D}_{i}|}=\frac{R}{p_{i}|\mathcal{D}_{i}|}= [ ( 1 - italic_β ) ⋅ divide start_ARG 1 - italic_β start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_β end_ARG + italic_β start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ] ⋅ divide start_ARG italic_R end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG = divide start_ARG italic_R end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG

where the first inequality is obtained by generalizing the triangle inequality; Therefore,

Qt(𝒟)Qt(𝒟)normsubscript𝑄𝑡𝒟subscript𝑄𝑡superscript𝒟\displaystyle\quad\|Q_{t}(\mathcal{D})-Q_{t}(\mathcal{D^{\prime}})\|∥ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_D ) - italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥
=jt𝖢𝗅𝗂𝗉C(𝒎t,j𝒎~t1)jt𝖢𝗅𝗂𝗉C(𝒎t,j𝒎~t1)absentnormsubscript𝑗subscript𝑡subscript𝖢𝗅𝗂𝗉𝐶subscript𝒎𝑡𝑗subscript~𝒎𝑡1subscript𝑗subscript𝑡subscript𝖢𝗅𝗂𝗉𝐶superscriptsubscript𝒎𝑡𝑗subscript~𝒎𝑡1\displaystyle=\|\sum\nolimits_{j\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bm{m}_{t% ,j}-\tilde{\bm{m}}_{t-1})-\sum\nolimits_{j\in\mathcal{I}_{t}}\mathsf{Clip}_{C}% (\bm{m}_{t,j}^{\prime}-\tilde{\bm{m}}_{t-1})\|= ∥ ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥
=(𝖺)𝖢𝗅𝗂𝗉C(𝒎t,i𝒎~t1)𝖢𝗅𝗂𝗉C(𝒎t,i𝒎~t1)𝖺normsubscript𝖢𝗅𝗂𝗉𝐶subscript𝒎𝑡𝑖subscript~𝒎𝑡1subscript𝖢𝗅𝗂𝗉𝐶superscriptsubscript𝒎𝑡𝑖subscript~𝒎𝑡1\displaystyle\overset{(\mathsf{a})}{=}\|\mathsf{Clip}_{C}(\bm{m}_{t,i}-\tilde{% \bm{m}}_{t-1})-\mathsf{Clip}_{C}(\bm{m}_{t,i}^{\prime}-\tilde{\bm{m}}_{t-1})\|start_OVERACCENT ( sansserif_a ) end_OVERACCENT start_ARG = end_ARG ∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥
(𝖻)min{2C,𝒎t,i𝒎t,i}=min{2C,Rpi|𝒟i|}𝖻2𝐶normsubscript𝒎𝑡𝑖superscriptsubscript𝒎𝑡𝑖2𝐶𝑅subscript𝑝𝑖subscript𝒟𝑖\displaystyle\overset{(\mathsf{b})}{\leqslant}\min\{2C,\|\bm{m}_{t,i}-\bm{m}_{% t,i}^{\prime}\|\}=\min\{2C,\frac{R}{p_{i}|\mathcal{D}_{i}|}\}start_OVERACCENT ( sansserif_b ) end_OVERACCENT start_ARG ⩽ end_ARG roman_min { 2 italic_C , ∥ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ } = roman_min { 2 italic_C , divide start_ARG italic_R end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG }

where (𝖺)𝖺(\mathsf{a})( sansserif_a ) is obtained due to 𝒎t,j=𝒎t,jsubscript𝒎𝑡𝑗superscriptsubscript𝒎𝑡𝑗\bm{m}_{t,j}=\bm{m}_{t,j}^{\prime}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT = bold_italic_m start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for ji𝑗𝑖j\neq iitalic_j ≠ italic_i; and (𝖻)𝖻(\mathsf{b})( sansserif_b ) is obtained according to Lemma 3. Now we finished the main proof of Lemma 2. ∎

Table 5: Notations
Symbols Description
𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (global) model in t𝑡titalic_t-th iteration, where 𝜽tdsubscript𝜽𝑡superscript𝑑\bm{\theta}_{t}\in\mathbb{R}^{d}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT local training data of client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
𝒟i,tsubscript𝒟𝑖𝑡\mathcal{D}_{i,t}caligraphic_D start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT data batch sampled from 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in t𝑡titalic_t-th iteration
𝒈t,i,𝒎t,isubscript𝒈𝑡𝑖subscript𝒎𝑡𝑖\bm{g}_{t,i},\bm{m}_{t,i}bold_italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT client gradient and momentum at t𝑡titalic_t-th iteration
𝒎tsubscript𝒎𝑡\bm{m}_{t}bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT aggregation of 𝒎t,isubscript𝒎𝑡𝑖\bm{m}_{t,i}bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT among multiple clients
𝒈¯t,i,𝒎¯t,isubscript¯𝒈𝑡𝑖subscript¯𝒎𝑡𝑖\bar{\bm{g}}_{t,i},\bar{\bm{m}}_{t,i}over¯ start_ARG bold_italic_g end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT client gradient and momentum with record-level clip**
𝒎~tsubscript~𝒎𝑡\tilde{\bm{m}}_{t}over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT aggregation of 𝒎¯t,isubscript¯𝒎𝑡𝑖\bar{\bm{m}}_{t,i}over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT with DP noise
pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT record-level sampling rate (implemented by client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT)
q𝑞qitalic_q client-level sampling rate (implemented by the server)
R𝑅Ritalic_R record-level clip** bound (for DP)
C𝐶Citalic_C client-level clip** bound (for robustness)
\mathcal{H}caligraphic_H the set of honest clients that follow the protocol honestly
\mathcal{B}caligraphic_B the set of Byzantine clients that are malicious
δBsubscript𝛿𝐵\delta_{B}italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT the percentage of Byzantine clients, i.e., δB=||/n×100%subscript𝛿𝐵𝑛percent100\delta_{B}=|\mathcal{B}|/n\times 100\%italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = | caligraphic_B | / italic_n × 100 %

The above proof used the following lemma.

Lemma 3.

For any vectors x𝑥xitalic_x and δ𝛿\deltaitalic_δ, we have

𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)min{2C,δ}normsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿2𝐶norm𝛿\displaystyle\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|\leqslant\min% \{2C,\|\delta\|\}∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ ⩽ roman_min { 2 italic_C , ∥ italic_δ ∥ }

where 𝖢𝗅𝗂𝗉C(x)min{1,C/x}xsubscript𝖢𝗅𝗂𝗉𝐶𝑥1𝐶norm𝑥𝑥\mathsf{Clip}_{C}(x)\coloneqq\min\{1,C/\|x\|\}\cdot xsansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) ≔ roman_min { 1 , italic_C / ∥ italic_x ∥ } ⋅ italic_x and \|\cdot\|∥ ⋅ ∥ denotes L2-norm.

Proof.

Our proof of Lemma 3 mainly uses the triangle inequality of a norm. Note that for L-2 norm \|\cdot\|∥ ⋅ ∥, we havea+b2=a2+2ab+b2superscriptnorm𝑎𝑏2superscriptnorm𝑎22superscript𝑎top𝑏superscriptnorm𝑏2\|a+b\|^{2}=\|a\|^{2}+2a^{\top}b+\|b\|^{2}∥ italic_a + italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_b + ∥ italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for any vectors a𝑎aitalic_a and b𝑏bitalic_b. We first show that 𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)δnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿norm𝛿\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|\leqslant\|\delta\|∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ ⩽ ∥ italic_δ ∥, which is proved by enumerating all cases as follows.

Case 1. Assume xCnorm𝑥𝐶\|x\|\leqslant C∥ italic_x ∥ ⩽ italic_C and x+δCnorm𝑥𝛿𝐶\|x+\delta\|\leqslant C∥ italic_x + italic_δ ∥ ⩽ italic_C. Then,

𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)=x(x+δ)=δnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿norm𝑥𝑥𝛿norm𝛿\displaystyle\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|=\|x-(x+% \delta)\|=\|\delta\|∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ = ∥ italic_x - ( italic_x + italic_δ ) ∥ = ∥ italic_δ ∥

Case 2. Assume x>Cnorm𝑥𝐶\|x\|>C∥ italic_x ∥ > italic_C and x+δCnorm𝑥𝛿𝐶\|x+\delta\|\leqslant C∥ italic_x + italic_δ ∥ ⩽ italic_C. Then, 0<Cx<10𝐶norm𝑥10<\frac{C}{\|x\|}<10 < divide start_ARG italic_C end_ARG start_ARG ∥ italic_x ∥ end_ARG < 1 and

𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)2superscriptnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿2\displaystyle\quad\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|^{2}∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Cxx(x+δ)2=(1Cx)x+δ2absentsuperscriptnorm𝐶norm𝑥𝑥𝑥𝛿2superscriptnorm1𝐶norm𝑥𝑥𝛿2\displaystyle=\left\|\frac{C}{\|x\|}\cdot x-(x+\delta)\right\|^{2}=\left\|% \left(1-\frac{C}{\|x\|}\right)\cdot x+\delta\right\|^{2}= ∥ divide start_ARG italic_C end_ARG start_ARG ∥ italic_x ∥ end_ARG ⋅ italic_x - ( italic_x + italic_δ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x ∥ end_ARG ) ⋅ italic_x + italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(1Cx)2x2+2(1Cx)xδ+δ2absentsuperscript1𝐶norm𝑥2superscriptnorm𝑥221𝐶norm𝑥superscript𝑥top𝛿superscriptnorm𝛿2\displaystyle=\left(1-\frac{C}{\|x\|}\right)^{2}\|x\|^{2}+2\left(1-\frac{C}{\|% x\|}\right)x^{\top}\delta+\|\delta\|^{2}= ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x ∥ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x ∥ end_ARG ) italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_δ + ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(1Cx)[(1Cx)x2+2xδ+δ2]+Cδ2xabsent1𝐶norm𝑥delimited-[]1𝐶norm𝑥superscriptnorm𝑥22superscript𝑥top𝛿superscriptnorm𝛿2𝐶superscriptnorm𝛿2norm𝑥\displaystyle=\left(1-\frac{C}{\|x\|}\right)\left[\left(1-\frac{C}{\|x\|}% \right)\|x\|^{2}+2x^{\top}\delta+\|\delta\|^{2}\right]+\frac{C\cdot\|\delta\|^% {2}}{\|x\|}= ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x ∥ end_ARG ) [ ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x ∥ end_ARG ) ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_δ + ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_C ⋅ ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_x ∥ end_ARG
=(1Cx)[x+δ2Cx]+Cδ2xabsent1𝐶norm𝑥delimited-[]superscriptnorm𝑥𝛿2𝐶norm𝑥𝐶superscriptnorm𝛿2norm𝑥\displaystyle=\left(1-\frac{C}{\|x\|}\right)\left[\|x+\delta\|^{2}-C\|x\|% \right]+\frac{C\cdot\|\delta\|^{2}}{\|x\|}= ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x ∥ end_ARG ) [ ∥ italic_x + italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_C ∥ italic_x ∥ ] + divide start_ARG italic_C ⋅ ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_x ∥ end_ARG
<0+1δ2=δ2absent01superscriptnorm𝛿2superscriptnorm𝛿2\displaystyle<0+1\cdot\|\delta\|^{2}=\|\delta\|^{2}< 0 + 1 ⋅ ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where x+δ2C2<Cxsuperscriptnorm𝑥𝛿2superscript𝐶2𝐶norm𝑥\|x+\delta\|^{2}\leqslant C^{2}<C\|x\|∥ italic_x + italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < italic_C ∥ italic_x ∥. Therefore, 𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)<δnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿norm𝛿\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|<\|\delta\|∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ < ∥ italic_δ ∥ in this case.

Case 3. Assume xCnorm𝑥𝐶\|x\|\leqslant C∥ italic_x ∥ ⩽ italic_C and x+δ>Cnorm𝑥𝛿𝐶\|x+\delta\|>C∥ italic_x + italic_δ ∥ > italic_C. Then, 0<Cx+δ<10𝐶norm𝑥𝛿10<\frac{C}{\|x+\delta\|}<10 < divide start_ARG italic_C end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG < 1 and

𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)2superscriptnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿2\displaystyle\quad\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|^{2}∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=xCx+δ(x+δ)2=(1Cx+δ)(x+δ)δ2absentsuperscriptnorm𝑥𝐶norm𝑥𝛿𝑥𝛿2superscriptnorm1𝐶norm𝑥𝛿𝑥𝛿𝛿2\displaystyle=\left\|x-\frac{C}{\|x+\delta\|}\cdot(x+\delta)\right\|^{2}=\left% \|\left(1-\frac{C}{\|x+\delta\|}\right)\cdot(x+\delta)-\delta\right\|^{2}= ∥ italic_x - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG ⋅ ( italic_x + italic_δ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG ) ⋅ ( italic_x + italic_δ ) - italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(1Cx+δ)2x+δ22(1Cx+δ)(x+δ)δ+δ2absentsuperscript1𝐶norm𝑥𝛿2superscriptnorm𝑥𝛿221𝐶norm𝑥𝛿superscript𝑥𝛿top𝛿superscriptnorm𝛿2\displaystyle=\left(1-\frac{C}{\|x+\delta\|}\right)^{2}\|x+\delta\|^{2}-2\left% (1-\frac{C}{\|x+\delta\|}\right)(x+\delta)^{\top}\delta+\|\delta\|^{2}= ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x + italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG ) ( italic_x + italic_δ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_δ + ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(1Cx+δ)[(1Cx+δ)x+δ22(x+δ)δ+δ2]absent1𝐶norm𝑥𝛿delimited-[]1𝐶norm𝑥𝛿superscriptnorm𝑥𝛿22superscript𝑥𝛿top𝛿superscriptnorm𝛿2\displaystyle=\left(1-\frac{C}{\|x+\delta\|}\right)\left[\left(1-\frac{C}{\|x+% \delta\|}\right)\|x+\delta\|^{2}-2(x+\delta)^{\top}\delta+\|\delta\|^{2}\right]= ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG ) [ ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG ) ∥ italic_x + italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ( italic_x + italic_δ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_δ + ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+Cδ2x+δ𝐶superscriptnorm𝛿2norm𝑥𝛿\displaystyle\qquad+\frac{C\cdot\|\delta\|^{2}}{\|x+\delta\|}+ divide start_ARG italic_C ⋅ ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG
=(1Cx+δ)[(x+δ)δ2Cx+δ]+Cδ2x+δabsent1𝐶norm𝑥𝛿delimited-[]superscriptnorm𝑥𝛿𝛿2𝐶norm𝑥𝛿𝐶superscriptnorm𝛿2norm𝑥𝛿\displaystyle=\left(1-\frac{C}{\|x+\delta\|}\right)\left[\|(x+\delta)-\delta\|% ^{2}-C\|x+\delta\|\right]+\frac{C\cdot\|\delta\|^{2}}{\|x+\delta\|}= ( 1 - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG ) [ ∥ ( italic_x + italic_δ ) - italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_C ∥ italic_x + italic_δ ∥ ] + divide start_ARG italic_C ⋅ ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG
<0+1δ2=δ2absent01superscriptnorm𝛿2superscriptnorm𝛿2\displaystyle<0+1\cdot\|\delta\|^{2}=\|\delta\|^{2}< 0 + 1 ⋅ ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where (x+δ)δ2=x2C2<Cx+δsuperscriptnorm𝑥𝛿𝛿2superscriptnorm𝑥2superscript𝐶2𝐶norm𝑥𝛿\|(x+\delta)-\delta\|^{2}=\|x\|^{2}\leqslant C^{2}<C\|x+\delta\|∥ ( italic_x + italic_δ ) - italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < italic_C ∥ italic_x + italic_δ ∥. Therefore, 𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)<δnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿norm𝛿\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|<\|\delta\|∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ < ∥ italic_δ ∥ in this case.

Case 4. Assume x>Cnorm𝑥𝐶\|x\|>C∥ italic_x ∥ > italic_C and x+δ>Cnorm𝑥𝛿𝐶\|x+\delta\|>C∥ italic_x + italic_δ ∥ > italic_C. Then,

𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)2superscriptnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿2\displaystyle\quad\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|^{2}∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=CxxCx+δ(x+δ)2absentsuperscriptnorm𝐶norm𝑥𝑥𝐶norm𝑥𝛿𝑥𝛿2\displaystyle=\left\|\frac{C}{\|x\|}\cdot x-\frac{C}{\|x+\delta\|}\cdot(x+% \delta)\right\|^{2}= ∥ divide start_ARG italic_C end_ARG start_ARG ∥ italic_x ∥ end_ARG ⋅ italic_x - divide start_ARG italic_C end_ARG start_ARG ∥ italic_x + italic_δ ∥ end_ARG ⋅ ( italic_x + italic_δ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=C2x2x2C22x(x+δ)xx+δ+C2x+δ2x+δ2absentsuperscript𝐶2superscriptnorm𝑥2superscriptnorm𝑥2superscript𝐶22superscript𝑥top𝑥𝛿norm𝑥norm𝑥𝛿superscript𝐶2superscriptnorm𝑥𝛿2superscriptnorm𝑥𝛿2\displaystyle=\frac{C^{2}}{\|x\|^{2}}\cdot\|x\|^{2}-\frac{C^{2}\cdot 2x^{\top}% (x+\delta)}{\|x\|\cdot\|x+\delta\|}+\frac{C^{2}}{\|x+\delta\|^{2}}\cdot\|x+% \delta\|^{2}= divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ 2 italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x + italic_δ ) end_ARG start_ARG ∥ italic_x ∥ ⋅ ∥ italic_x + italic_δ ∥ end_ARG + divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_x + italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ ∥ italic_x + italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=2C2C2[x2+(x2+2xδ+δ2)]xx+δ+C2δ2xx+δabsent2superscript𝐶2superscript𝐶2delimited-[]superscriptnorm𝑥2superscriptnorm𝑥22superscript𝑥top𝛿superscriptnorm𝛿2norm𝑥norm𝑥𝛿superscript𝐶2superscriptnorm𝛿2norm𝑥norm𝑥𝛿\displaystyle=2C^{2}-\frac{C^{2}\cdot[\|x\|^{2}+(\|x\|^{2}+2x^{\top}\delta+\|% \delta\|^{2})]}{\|x\|\cdot\|x+\delta\|}+\frac{C^{2}\cdot\|\delta\|^{2}}{\|x\|% \cdot\|x+\delta\|}= 2 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ [ ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_δ + ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] end_ARG start_ARG ∥ italic_x ∥ ⋅ ∥ italic_x + italic_δ ∥ end_ARG + divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_x ∥ ⋅ ∥ italic_x + italic_δ ∥ end_ARG
=2C2C2[x2+x+δ2]xx+δ+C2xx+δδ2absent2superscript𝐶2superscript𝐶2delimited-[]superscriptnorm𝑥2superscriptnorm𝑥𝛿2norm𝑥norm𝑥𝛿superscript𝐶2norm𝑥norm𝑥𝛿superscriptnorm𝛿2\displaystyle=2C^{2}-\frac{C^{2}\cdot[\|x\|^{2}+\|x+\delta\|^{2}]}{\|x\|\cdot% \|x+\delta\|}+\frac{C^{2}}{\|x\|\cdot\|x+\delta\|}\cdot\|\delta\|^{2}= 2 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ [ ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x + italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG ∥ italic_x ∥ ⋅ ∥ italic_x + italic_δ ∥ end_ARG + divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_x ∥ ⋅ ∥ italic_x + italic_δ ∥ end_ARG ⋅ ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
<2C2C22+1δ2=δ2absent2superscript𝐶2superscript𝐶221superscriptnorm𝛿2superscriptnorm𝛿2\displaystyle<2C^{2}-C^{2}\cdot 2+1\cdot\|\delta\|^{2}=\|\delta\|^{2}< 2 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ 2 + 1 ⋅ ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where x2+x+δ2xx+δ2superscriptnorm𝑥2superscriptnorm𝑥𝛿2norm𝑥norm𝑥𝛿2\frac{\|x\|^{2}+\|x+\delta\|^{2}}{\|x\|\cdot\|x+\delta\|}\geqslant 2divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x + italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_x ∥ ⋅ ∥ italic_x + italic_δ ∥ end_ARG ⩾ 2 due to x2+x+δ22(xx+δ)=(xx+δ)20superscriptnorm𝑥2superscriptnorm𝑥𝛿22norm𝑥norm𝑥𝛿superscriptnorm𝑥norm𝑥𝛿20\|x\|^{2}+\|x+\delta\|^{2}-2(\|x\|\cdot\|x+\delta\|)=(\|x\|-\|x+\delta\|)^{2}\geqslant 0∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x + italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ( ∥ italic_x ∥ ⋅ ∥ italic_x + italic_δ ∥ ) = ( ∥ italic_x ∥ - ∥ italic_x + italic_δ ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩾ 0, and C2xx+δ<1superscript𝐶2norm𝑥norm𝑥𝛿1\frac{C^{2}}{\|x\|\cdot\|x+\delta\|}<1divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_x ∥ ⋅ ∥ italic_x + italic_δ ∥ end_ARG < 1 due to x>Cnorm𝑥𝐶\|x\|>C∥ italic_x ∥ > italic_C and x+δ>Cnorm𝑥𝛿𝐶\|x+\delta\|>C∥ italic_x + italic_δ ∥ > italic_C. Therefore, 𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)<δnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿norm𝛿\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|<\|\delta\|∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ < ∥ italic_δ ∥ in this case.

The Final Result. By summarizing all cases above, we have 𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)δnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿norm𝛿\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|\leqslant\|\delta\|∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ ⩽ ∥ italic_δ ∥. On the other hand, since 𝖢𝗅𝗂𝗉C(x)Cnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥𝐶\|\mathsf{Clip}_{C}(x)\|\leqslant C∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) ∥ ⩽ italic_C for any x𝑥xitalic_x, it is obvious that

𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)𝖢𝗅𝗂𝗉C(x)+𝖢𝗅𝗂𝗉C(x+δ)2Cnormsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿normsubscript𝖢𝗅𝗂𝗉𝐶𝑥normsubscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿2𝐶\displaystyle\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|\leqslant\|% \mathsf{Clip}_{C}(x)\|+\|\mathsf{Clip}_{C}(x+\delta)\|\leqslant 2C∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ ⩽ ∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) ∥ + ∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ ⩽ 2 italic_C

Thus, the upper bound of 𝖢𝗅𝗂𝗉C(x)𝖢𝗅𝗂𝗉C(x+δ)normsubscript𝖢𝗅𝗂𝗉𝐶𝑥subscript𝖢𝗅𝗂𝗉𝐶𝑥𝛿\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|∥ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) - sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x + italic_δ ) ∥ is min{2C,δ}2𝐶norm𝛿\min\{2C,\|\delta\|\}roman_min { 2 italic_C , ∥ italic_δ ∥ }. ∎

Appendix B Proof of Theorem 1 (Privacy Analysis)

Proof.

Since the added Gaussian noise in (12) has standard deviation Rσ𝑅𝜎R\sigmaitalic_R italic_σ, and the aggregation sensitivity is shown in (14), then the noise multiplier (defined by the ratio between Gaussian noise’s standard deviation and the sensitivity) is

σi=RσSi=max{Rσ2C,σpi|𝒟i|}=σmax{R2C,pi|𝒟i|}subscript𝜎𝑖𝑅𝜎subscript𝑆𝑖𝑅𝜎2𝐶𝜎subscript𝑝𝑖subscript𝒟𝑖𝜎𝑅2𝐶subscript𝑝𝑖subscript𝒟𝑖\displaystyle\sigma_{i}=\frac{R\sigma}{S_{i}}=\max\left\{\frac{R\sigma}{2C},% \sigma p_{i}|\mathcal{D}_{i}|\right\}=\sigma\cdot\max\left\{\frac{R}{2C},p_{i}% |\mathcal{D}_{i}|\right\}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_R italic_σ end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = roman_max { divide start_ARG italic_R italic_σ end_ARG start_ARG 2 italic_C end_ARG , italic_σ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | } = italic_σ ⋅ roman_max { divide start_ARG italic_R end_ARG start_ARG 2 italic_C end_ARG , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | }

Also, due to the client-level sampling (i.e., each client was selected by the server w.p. q𝑞qitalic_q) and record-level sampling (i.e., each record was selected by client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT w.p. pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), the overall sampling rate is qpi𝑞subscript𝑝𝑖qp_{i}italic_q italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, by applying the privacy accountant of Gaussian DP (GDP) [11] shown in Lemma 10 (see more details in Appendix F), DP-BREM satisfies μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-GDP with μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT shown in (16). Finally, by converting μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-GDP to (ϵi,δ)subscriptitalic-ϵ𝑖𝛿(\epsilon_{i},\delta)( italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ )-DP via Lemma 9, we get (15), which finishes the proof. ∎

Remark: privacy accountant in practice. Eq. (15) provides the formula of δ𝛿\deltaitalic_δ when ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is given and μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is computed from (16). In practice, however, we need to compute the value of privacy budget ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with a fixed δ𝛿\deltaitalic_δ, where δ𝛿\deltaitalic_δ is conventionally set to be less than 1/n1𝑛1/n1 / italic_n. In our experiments, we utilize the computation tool333https://github.com/woodyx218/Deep-Learning-with-GDP-Pytorch in [7] to solve ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from (15). For the value of σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in (16), we usually have pi|𝒟i|>R2Csubscript𝑝𝑖subscript𝒟𝑖𝑅2𝐶p_{i}|\mathcal{D}_{i}|>\frac{R}{2C}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > divide start_ARG italic_R end_ARG start_ARG 2 italic_C end_ARG in practice, then σi=σpi|𝒟i|subscript𝜎𝑖𝜎subscript𝑝𝑖subscript𝒟𝑖\sigma_{i}=\sigma p_{i}|\mathcal{D}_{i}|italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |. In this case, the clip** bounds R𝑅Ritalic_R and C𝐶Citalic_C are just hyperparameters that may affect the utility of the algorithm, but has no influence on the privacy analysis.

Appendix C Proof of Theorem 2 (Aggregation Error)

Before proving Theorem 2, we first show some notations and assumptions. In t𝑡titalic_t-th iteration, denote the selected honest clients t=tsubscript𝑡subscript𝑡\mathcal{H}_{t}=\mathcal{H}\cup\mathcal{I}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_H ∪ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and selected Byzantine clients t=tsubscript𝑡subscript𝑡\mathcal{B}_{t}=\mathcal{B}\cup\mathcal{I}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_B ∪ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For momentum updates in t𝑡titalic_t-th iteration, we simplify the following notation (ignoring the subscript t𝑡titalic_t) for convenience,

𝒚0𝒎~t1,𝒚i𝒚0+𝒛iwith 𝒛i𝖢𝗅𝗂𝗉C(𝒎¯t,i𝒚0)formulae-sequencesubscript𝒚0subscript~𝒎𝑡1formulae-sequencesubscript𝒚𝑖subscript𝒚0subscript𝒛𝑖with subscript𝒛𝑖subscript𝖢𝗅𝗂𝗉𝐶subscript¯𝒎𝑡𝑖subscript𝒚0\displaystyle\bm{y}_{0}\coloneqq\tilde{\bm{m}}_{t-1},\quad\bm{y}_{i}\coloneqq% \bm{y}_{0}+\bm{z}_{i}\quad\text{with }\bm{z}_{i}\coloneqq\mathsf{Clip}_{C}(% \bar{\bm{m}}_{t,i}-\bm{y}_{0})bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≔ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ sansserif_Clip start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

where 𝒎¯t,isubscript¯𝒎𝑡𝑖\bar{\bm{m}}_{t,i}over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is the client momentum computed from gradient with record-level clip**. Then, we can rewrite the noisy global momentum as 𝒎~t=it𝒚i+𝝃|t|subscript~𝒎𝑡subscript𝑖subscript𝑡subscript𝒚𝑖𝝃subscript𝑡\tilde{\bm{m}}_{t}=\frac{\sum_{i\in\mathcal{I}_{t}}\bm{y}_{i}+\bm{\xi}}{|% \mathcal{I}_{t}|}over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ξ end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG, where 𝝃𝒩(0,R2σ2)similar-to𝝃𝒩0superscript𝑅2superscript𝜎2\bm{\xi}\sim\mathcal{N}(0,R^{2}\sigma^{2})bold_italic_ξ ∼ caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

We assume {𝒎t,i}isubscriptsubscript𝒎𝑡𝑖𝑖\{\bm{m}_{t,i}\}_{i\in\mathcal{H}}{ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_H end_POSTSUBSCRIPT are i.i.d. with expectation 𝝁𝔼[𝒎t,i]𝝁𝔼delimited-[]subscript𝒎𝑡𝑖\bm{\mu}\coloneqq\mathbb{E}[\bm{m}_{t,i}]bold_italic_μ ≔ blackboard_E [ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ] and variance is bounded (in terms of L2-norm) 𝔼𝒎t,i𝝁2ρ2𝔼superscriptnormsubscript𝒎𝑡𝑖𝝁2superscript𝜌2\mathbb{E}\|\bm{m}_{t,i}-\bm{\mu}\|^{2}\leqslant\rho^{2}blackboard_E ∥ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Therefore, the record-level gradient clipped ones {𝒎¯t,i}isubscriptsubscript¯𝒎𝑡𝑖𝑖\{\bar{\bm{m}}_{t,i}\}_{i\in\mathcal{H}}{ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_H end_POSTSUBSCRIPT are also i.i.d., and we denote the expectation 𝝁¯𝔼[𝒎¯t,i]¯𝝁𝔼delimited-[]subscript¯𝒎𝑡𝑖\bar{\bm{\mu}}\coloneqq\mathbb{E}[\bar{\bm{m}}_{t,i}]over¯ start_ARG bold_italic_μ end_ARG ≔ blackboard_E [ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ]. Due to the clip** operation, the variance is reduced, and we assume 𝔼𝒎¯t,i𝝁¯2[ρR/(R+c)]2𝔼superscriptnormsubscript¯𝒎𝑡𝑖¯𝝁2superscriptdelimited-[]𝜌𝑅𝑅𝑐2\mathbb{E}\|\bar{\bm{m}}_{t,i}-\bar{\bm{\mu}}\|^{2}\leqslant[\rho R/(R+c)]^{2}blackboard_E ∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_μ end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ [ italic_ρ italic_R / ( italic_R + italic_c ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where R𝑅Ritalic_R is the record-level clip** bound and c𝑐citalic_c is some positive constant. Also, there is a gap between 𝝁𝝁\bm{\mu}bold_italic_μ and 𝝁¯¯𝝁\bar{\bm{\mu}}over¯ start_ARG bold_italic_μ end_ARG and we assume 𝝁¯𝝁2(κ/R)2superscriptnorm¯𝝁𝝁2superscript𝜅𝑅2\|\bar{\bm{\mu}}-\bm{\mu}\|^{2}\leqslant(\kappa/R)^{2}∥ over¯ start_ARG bold_italic_μ end_ARG - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ ( italic_κ / italic_R ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We assume 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is not very far away from both 𝝁𝝁\bm{\mu}bold_italic_μ and 𝝁¯¯𝝁\bar{\bm{\mu}}over¯ start_ARG bold_italic_μ end_ARG: 𝒚0𝝁¯2ϕ2superscriptnormsubscript𝒚0¯𝝁2superscriptitalic-ϕ2\|\bm{y}_{0}-\bar{\bm{\mu}}\|^{2}\leqslant\phi^{2}∥ bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_μ end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝒚0𝝁2τ2superscriptnormsubscript𝒚0𝝁2superscript𝜏2\|\bm{y}_{0}-\bm{\mu}\|^{2}\leqslant\tau^{2}∥ bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Proof.

Our proof heavily relies on several useful lemmas shown in Appendix E, where Lemma 4 splits the L2-norm of summation of vectors into weighted summation of vectors’L2-norm, and Lemma 5 provides the optimal strategy to choose these weights.

We first consider the bound of 𝔼𝒎~t𝝁2𝔼superscriptnormsubscript~𝒎𝑡𝝁2\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{\mu}\|^{2}blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Recall that the selected client set is t=ttsubscript𝑡subscript𝑡subscript𝑡\mathcal{I}_{t}=\mathcal{H}_{t}\cup\mathcal{B}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where the honest clients set tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Byzantine clients set tsubscript𝑡\mathcal{B}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are disjoint. For any positive values γ1,γ2,γ3>0subscript𝛾1subscript𝛾2subscript𝛾30\gamma_{1},\gamma_{2},\gamma_{3}>0italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0 with γ1+γ2+γ3=1subscript𝛾1subscript𝛾2subscript𝛾31\gamma_{1}+\gamma_{2}+\gamma_{3}=1italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1, we have

|t|2𝔼𝒎~t𝝁2=|t|2𝔼(it𝒚i)+𝝃|t|𝝁2superscriptsubscript𝑡2𝔼superscriptnormsubscript~𝒎𝑡𝝁2superscriptsubscript𝑡2𝔼superscriptnormsubscript𝑖subscript𝑡subscript𝒚𝑖𝝃subscript𝑡𝝁2\displaystyle\qquad|\mathcal{I}_{t}|^{2}\cdot\mathbb{E}\|\tilde{\bm{m}}_{t}-% \bm{\mu}\|^{2}=|\mathcal{I}_{t}|^{2}\cdot\mathbb{E}\left\|\frac{\left(\sum_{i% \in\mathcal{I}_{t}}\bm{y}_{i}\right)+\bm{\xi}}{|\mathcal{I}_{t}|}-\bm{\mu}% \right\|^{2}| caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ blackboard_E ∥ divide start_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_italic_ξ end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(𝖺)𝔼it(𝒚i𝝁)+jt(𝒚j𝝁)+𝝃2𝖺𝔼superscriptnormsubscript𝑖subscript𝑡subscript𝒚𝑖𝝁subscript𝑗subscript𝑡subscript𝒚𝑗𝝁𝝃2\displaystyle\overset{(\mathsf{a})}{=}\mathbb{E}\left\|\sum\nolimits_{i\in% \mathcal{H}_{t}}(\bm{y}_{i}-\bm{\mu})+\sum\nolimits_{j\in\mathcal{B}_{t}}(\bm{% y}_{j}-\bm{\mu})+\bm{\xi}\right\|^{2}start_OVERACCENT ( sansserif_a ) end_OVERACCENT start_ARG = end_ARG blackboard_E ∥ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_μ ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_italic_μ ) + bold_italic_ξ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(𝖻)1γ1𝔼it(𝒚i𝝁)2𝒯1+1γ2𝔼jt(𝒚j𝝁)2𝒯2+1γ3𝔼𝝃2𝒯3𝖻1subscript𝛾1subscript𝔼superscriptnormsubscript𝑖subscript𝑡subscript𝒚𝑖𝝁2subscript𝒯11subscript𝛾2subscript𝔼superscriptnormsubscript𝑗subscript𝑡subscript𝒚𝑗𝝁2subscript𝒯21subscript𝛾3subscript𝔼superscriptnorm𝝃2subscript𝒯3\displaystyle\overset{(\mathsf{b})}{\leqslant}\frac{1}{\gamma_{1}}\underbrace{% \mathbb{E}\left\|\sum\nolimits_{i\in\mathcal{H}_{t}}(\bm{y}_{i}-\bm{\mu})% \right\|^{2}}_{\mathcal{T}_{1}}+\frac{1}{\gamma_{2}}\underbrace{\mathbb{E}% \left\|\sum\nolimits_{j\in\mathcal{B}_{t}}(\bm{y}_{j}-\bm{\mu})\right\|^{2}}_{% \mathcal{T}_{2}}+\frac{1}{\gamma_{3}}\underbrace{\mathbb{E}\|\bm{\xi}\|^{2}}_{% \mathcal{T}_{3}}start_OVERACCENT ( sansserif_b ) end_OVERACCENT start_ARG ⩽ end_ARG divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG under⏟ start_ARG blackboard_E ∥ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_μ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG under⏟ start_ARG blackboard_E ∥ ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_italic_μ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG under⏟ start_ARG blackboard_E ∥ bold_italic_ξ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

where (𝖺)𝖺(\mathsf{a})( sansserif_a ) used the fact that t=ttsubscript𝑡subscript𝑡subscript𝑡\mathcal{I}_{t}=\mathcal{H}_{t}\cup\mathcal{B}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and tt=subscript𝑡subscript𝑡\mathcal{H}_{t}\cap\mathcal{B}_{t}=\emptysetcaligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∅; (𝖻)𝖻(\mathsf{b})( sansserif_b ) used the result in Lemma 4. From the above inequality, the error can be decomposed into three terms: 𝒯1subscript𝒯1\mathcal{T}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corresponds to the error of honest clients (who follow the protocol honestly) due to the randomness of clients’ training data and bias introduced by clip**, 𝒯2subscript𝒯2\mathcal{T}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT corresponds to the error of Byzantine clients (who submit arbitrary 𝒎¯t,isubscript¯𝒎𝑡𝑖\bar{\bm{m}}_{t,i}over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT but will be clipped by the server), and 𝒯3subscript𝒯3\mathcal{T}_{3}caligraphic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT corresponds to the error introduced by added Gaussian noise for privacy purpose. We will analyze each of the three errors in turn.

Bounding 𝒯1subscript𝒯1\mathcal{T}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Since 𝔼X2=𝔼[X]2+𝔼X𝔼[X]2𝔼superscriptnorm𝑋2superscriptnorm𝔼delimited-[]𝑋2𝔼superscriptnorm𝑋𝔼delimited-[]𝑋2\mathbb{E}\|X\|^{2}=\|\mathbb{E}[X]\|^{2}+\mathbb{E}\|X-\mathbb{E}[X]\|^{2}blackboard_E ∥ italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ blackboard_E [ italic_X ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E ∥ italic_X - blackboard_E [ italic_X ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for any random vector X𝑋Xitalic_X, we can rewrite 𝒯1subscript𝒯1\mathcal{T}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as

𝒯1=it(𝔼[𝒚i]𝝁)2𝒯11+𝔼it(𝒚i𝔼[𝒚i])2𝒯12subscript𝒯1subscriptsuperscriptnormsubscript𝑖subscript𝑡𝔼delimited-[]subscript𝒚𝑖𝝁2subscript𝒯11subscript𝔼superscriptnormsubscript𝑖subscript𝑡subscript𝒚𝑖𝔼delimited-[]subscript𝒚𝑖2subscript𝒯12\displaystyle\mathcal{T}_{1}=\underbrace{{\left\|\sum\nolimits_{i\in\mathcal{H% }_{t}}(\mathbb{E}[\bm{y}_{i}]-\bm{\mu})\right\|^{2}}}_{\mathcal{T}_{11}}+% \underbrace{{\mathbb{E}\left\|\sum\nolimits_{i\in\mathcal{H}_{t}}(\bm{y}_{i}-% \mathbb{E}[\bm{y}_{i}])\right\|^{2}}}_{\mathcal{T}_{12}}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = under⏟ start_ARG ∥ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_E [ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] - bold_italic_μ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG blackboard_E ∥ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - blackboard_E [ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

where 𝒯11subscript𝒯11\mathcal{T}_{11}caligraphic_T start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT corresponds to the bias introduced by the clip** operations, and 𝒯12subscript𝒯12\mathcal{T}_{12}caligraphic_T start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT is the variance of honest clients’ submissions. Rewrite 𝒛i=αi(𝒎¯t,i𝒚0)subscript𝒛𝑖subscript𝛼𝑖subscript¯𝒎𝑡𝑖subscript𝒚0\bm{z}_{i}=\alpha_{i}\cdot(\bar{\bm{m}}_{t,i}-\bm{y}_{0})bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), where αi=min{1,C𝒎¯t,i𝒚0}(0,1]subscript𝛼𝑖1𝐶normsubscript¯𝒎𝑡𝑖subscript𝒚001\alpha_{i}=\min\{1,\frac{C}{\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}\|}\}\in(0,1]italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_min { 1 , divide start_ARG italic_C end_ARG start_ARG ∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ end_ARG } ∈ ( 0 , 1 ]. Let 𝟙isubscript1𝑖\mathbbm{1}_{i}blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be an indicator variable denoting if the momentum difference 𝒎¯t,i𝒚0subscript¯𝒎𝑡𝑖subscript𝒚0\bar{\bm{m}}_{t,i}-\bm{y}_{0}over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT was clipped. Therefore, if 𝒎¯t,i𝒚0Cnormsubscript¯𝒎𝑡𝑖subscript𝒚0𝐶\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}\|\leqslant C∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ ⩽ italic_C, then 𝟙i=0subscript1𝑖0\mathbbm{1}_{i}=0blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 and αi=1subscript𝛼𝑖1\alpha_{i}=1italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1; if 𝒎¯t,i𝒚0>Cnormsubscript¯𝒎𝑡𝑖subscript𝒚0𝐶\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}\|>C∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ > italic_C, then 𝟙i=1subscript1𝑖1\mathbbm{1}_{i}=1blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 and 0<αi<10subscript𝛼𝑖10<\alpha_{i}<10 < italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 1. Then, for each it𝑖subscript𝑡i\in\mathcal{H}_{t}italic_i ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have

𝔼𝒛i(𝒎¯t,i𝒚0)=𝔼[(1αi)𝒎¯t,i𝒚0]𝔼normsubscript𝒛𝑖subscript¯𝒎𝑡𝑖subscript𝒚0𝔼delimited-[]1subscript𝛼𝑖normsubscript¯𝒎𝑡𝑖subscript𝒚0\displaystyle\quad\mathbb{E}\|\bm{z}_{i}-(\bar{\bm{m}}_{t,i}-\bm{y}_{0})\|=% \mathbb{E}[(1-\alpha_{i})\cdot\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}\|]blackboard_E ∥ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ = blackboard_E [ ( 1 - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ ∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ ]
𝔼[𝟙i𝒎¯t,i𝒚0]𝔼[𝟙i𝒎¯t,i𝒚02]C𝔼𝒎¯t,i𝒚02Cabsent𝔼delimited-[]subscript1𝑖normsubscript¯𝒎𝑡𝑖subscript𝒚0𝔼delimited-[]subscript1𝑖superscriptnormsubscript¯𝒎𝑡𝑖subscript𝒚02𝐶𝔼superscriptnormsubscript¯𝒎𝑡𝑖subscript𝒚02𝐶\displaystyle\leqslant\mathbb{E}[\mathbbm{1}_{i}\cdot\|\bar{\bm{m}}_{t,i}-\bm{% y}_{0}\|]\leqslant\frac{\mathbb{E}[\mathbbm{1}_{i}\cdot\|\bar{\bm{m}}_{t,i}-% \bm{y}_{0}\|^{2}]}{C}\leqslant\frac{\mathbb{E}\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}% \|^{2}}{C}⩽ blackboard_E [ blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ ] ⩽ divide start_ARG blackboard_E [ blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_C end_ARG ⩽ divide start_ARG blackboard_E ∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG

where

𝔼𝒎¯t,i𝒚02=𝔼(𝒎¯t,i𝝁¯)+(𝝁¯𝒚0)2𝔼superscriptnormsubscript¯𝒎𝑡𝑖subscript𝒚02𝔼superscriptnormsubscript¯𝒎𝑡𝑖¯𝝁¯𝝁subscript𝒚02\displaystyle\qquad\mathbb{E}\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}\|^{2}=\mathbb{E}% \|(\bar{\bm{m}}_{t,i}-\bar{\bm{\mu}})+(\bar{\bm{\mu}}-\bm{y}_{0})\|^{2}blackboard_E ∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E ∥ ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_μ end_ARG ) + ( over¯ start_ARG bold_italic_μ end_ARG - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(𝖺)𝔼𝒎¯t,i𝝁¯2γ+𝔼𝝁¯𝒚021γ[ρR/(R+c)]2γ+ϕ21γ𝖺𝔼superscriptnormsubscript¯𝒎𝑡𝑖¯𝝁2𝛾𝔼superscriptnorm¯𝝁subscript𝒚021𝛾superscriptdelimited-[]𝜌𝑅𝑅𝑐2𝛾superscriptitalic-ϕ21𝛾\displaystyle\overset{(\mathsf{a})}{\leqslant}\frac{\mathbb{E}\|\bar{\bm{m}}_{% t,i}-\bar{\bm{\mu}}\|^{2}}{\gamma}+\frac{\mathbb{E}\|\bar{\bm{\mu}}-\bm{y}_{0}% \|^{2}}{1-\gamma}\leqslant\frac{[\rho R/(R+c)]^{2}}{\gamma}+\frac{\phi^{2}}{1-\gamma}start_OVERACCENT ( sansserif_a ) end_OVERACCENT start_ARG ⩽ end_ARG divide start_ARG blackboard_E ∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_μ end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG + divide start_ARG blackboard_E ∥ over¯ start_ARG bold_italic_μ end_ARG - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG ⩽ divide start_ARG [ italic_ρ italic_R / ( italic_R + italic_c ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG + divide start_ARG italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG
=(𝖻)[ρR/(R+c)+ϕ]2𝖻superscriptdelimited-[]𝜌𝑅𝑅𝑐italic-ϕ2\displaystyle\overset{(\mathsf{b})}{=}[\rho R/(R+c)+\phi]^{2}start_OVERACCENT ( sansserif_b ) end_OVERACCENT start_ARG = end_ARG [ italic_ρ italic_R / ( italic_R + italic_c ) + italic_ϕ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where (𝖺)𝖺(\mathsf{a})( sansserif_a ) is obtained by using Lemma 4 for any γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ); (𝖻)𝖻(\mathsf{b})( sansserif_b ) is obtained by taking γ=ρR/(R+c)ρR/(R+c)+ϕ𝛾𝜌𝑅𝑅𝑐𝜌𝑅𝑅𝑐italic-ϕ\gamma=\frac{\rho R/(R+c)}{\rho R/(R+c)+\phi}italic_γ = divide start_ARG italic_ρ italic_R / ( italic_R + italic_c ) end_ARG start_ARG italic_ρ italic_R / ( italic_R + italic_c ) + italic_ϕ end_ARG. Therefore,

𝔼[𝒚i]𝝁2=(𝖺)𝔼[𝒚0+𝒛i𝒎¯t,i]+(𝝁¯𝝁)2superscriptnorm𝔼delimited-[]subscript𝒚𝑖𝝁2𝖺superscriptnorm𝔼delimited-[]subscript𝒚0subscript𝒛𝑖subscript¯𝒎𝑡𝑖¯𝝁𝝁2\displaystyle\qquad\|\mathbb{E}[\bm{y}_{i}]-\bm{\mu}\|^{2}\overset{(\mathsf{a}% )}{=}\|\mathbb{E}[\bm{y}_{0}+\bm{z}_{i}-\bar{\bm{m}}_{t,i}]+(\bar{\bm{\mu}}-% \bm{\mu})\|^{2}∥ blackboard_E [ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_OVERACCENT ( sansserif_a ) end_OVERACCENT start_ARG = end_ARG ∥ blackboard_E [ bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ] + ( over¯ start_ARG bold_italic_μ end_ARG - bold_italic_μ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(𝖻)𝔼[𝒛i(𝒎¯t,i𝒚0)]2γ+𝝁¯𝝁21γ𝖻superscriptnorm𝔼delimited-[]subscript𝒛𝑖subscript¯𝒎𝑡𝑖subscript𝒚02𝛾superscriptnorm¯𝝁𝝁21𝛾\displaystyle\overset{(\mathsf{b})}{\leqslant}\frac{\|\mathbb{E}[\bm{z}_{i}-(% \bar{\bm{m}}_{t,i}-\bm{y}_{0})]\|^{2}}{\gamma}+\frac{\|\bar{\bm{\mu}}-\bm{\mu}% \|^{2}}{1-\gamma}start_OVERACCENT ( sansserif_b ) end_OVERACCENT start_ARG ⩽ end_ARG divide start_ARG ∥ blackboard_E [ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG + divide start_ARG ∥ over¯ start_ARG bold_italic_μ end_ARG - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG
(𝖼)(𝔼𝒛i(𝒎¯t,i𝒚0))2γ+𝝁¯𝝁21γ𝖼superscript𝔼normsubscript𝒛𝑖subscript¯𝒎𝑡𝑖subscript𝒚02𝛾superscriptnorm¯𝝁𝝁21𝛾\displaystyle\overset{(\mathsf{c})}{\leqslant}\frac{\left(\mathbb{E}\|\bm{z}_{% i}-(\bar{\bm{m}}_{t,i}-\bm{y}_{0})\|\right)^{2}}{\gamma}+\frac{\|\bar{\bm{\mu}% }-\bm{\mu}\|^{2}}{1-\gamma}start_OVERACCENT ( sansserif_c ) end_OVERACCENT start_ARG ⩽ end_ARG divide start_ARG ( blackboard_E ∥ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ( over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG + divide start_ARG ∥ over¯ start_ARG bold_italic_μ end_ARG - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG
(𝖽)[ρR/(R+c)+ϕ]4γC2+(κ/R)21γ𝖽superscriptdelimited-[]𝜌𝑅𝑅𝑐italic-ϕ4𝛾superscript𝐶2superscript𝜅𝑅21𝛾\displaystyle\overset{(\mathsf{d})}{\leqslant}\frac{[\rho R/(R+c)+\phi]^{4}}{% \gamma C^{2}}+\frac{(\kappa/R)^{2}}{1-\gamma}start_OVERACCENT ( sansserif_d ) end_OVERACCENT start_ARG ⩽ end_ARG divide start_ARG [ italic_ρ italic_R / ( italic_R + italic_c ) + italic_ϕ ] start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG ( italic_κ / italic_R ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG
=(𝖾)[[ρR/(R+c)+ϕ]2C+κ/R]2𝖾superscriptdelimited-[]superscriptdelimited-[]𝜌𝑅𝑅𝑐italic-ϕ2𝐶𝜅𝑅2\displaystyle\overset{(\mathsf{e})}{=}\left[\frac{[\rho R/(R+c)+\phi]^{2}}{C}+% \kappa/R\right]^{2}start_OVERACCENT ( sansserif_e ) end_OVERACCENT start_ARG = end_ARG [ divide start_ARG [ italic_ρ italic_R / ( italic_R + italic_c ) + italic_ϕ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG + italic_κ / italic_R ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where (𝖺)𝖺(\mathsf{a})( sansserif_a ) is obtained from the definitions 𝒚i=𝒚0+𝒛isubscript𝒚𝑖subscript𝒚0subscript𝒛𝑖\bm{y}_{i}=\bm{y}_{0}+\bm{z}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝔼[𝒎¯t,i]=𝝁¯𝔼delimited-[]subscript¯𝒎𝑡𝑖¯𝝁\mathbb{E}[\bar{\bm{m}}_{t,i}]=\bar{\bm{\mu}}blackboard_E [ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ] = over¯ start_ARG bold_italic_μ end_ARG; (𝖻)𝖻(\mathsf{b})( sansserif_b ) is obtained by using Lemma 4 for any γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ); (𝖼)𝖼(\mathsf{c})( sansserif_c ) is derived from Jensen’s Inequality, i.e., 𝔼[f(X)]f(𝔼[X])𝔼delimited-[]𝑓𝑋𝑓𝔼delimited-[]𝑋\mathbb{E}[f(X)]\geqslant f(\mathbb{E}[X])blackboard_E [ italic_f ( italic_X ) ] ⩾ italic_f ( blackboard_E [ italic_X ] ) for convex function f(X)X𝑓𝑋norm𝑋f(X)\coloneqq\|X\|italic_f ( italic_X ) ≔ ∥ italic_X ∥; (𝖽)𝖽(\mathsf{d})( sansserif_d ) is obtained by plugging in the previous two inequalities; (𝖾)𝖾(\mathsf{e})( sansserif_e ) is obtained by taking γ=[ρR/(R+c)+ϕ]2[ρR/(R+c)+ϕ]2+Cκ/R𝛾superscriptdelimited-[]𝜌𝑅𝑅𝑐italic-ϕ2superscriptdelimited-[]𝜌𝑅𝑅𝑐italic-ϕ2𝐶𝜅𝑅\gamma=\frac{[\rho R/(R+c)+\phi]^{2}}{[\rho R/(R+c)+\phi]^{2}+C\kappa/R}italic_γ = divide start_ARG [ italic_ρ italic_R / ( italic_R + italic_c ) + italic_ϕ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG [ italic_ρ italic_R / ( italic_R + italic_c ) + italic_ϕ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C italic_κ / italic_R end_ARG. Now, we can bound 𝒯11subscript𝒯11\mathcal{T}_{11}caligraphic_T start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT by:

𝒯11|t|it𝔼[𝒚i]𝝁2|t|2[[ρR/(R+c)+ϕ]2C+κR]2subscript𝒯11subscript𝑡subscript𝑖subscript𝑡superscriptnorm𝔼delimited-[]subscript𝒚𝑖𝝁2superscriptsubscript𝑡2superscriptdelimited-[]superscriptdelimited-[]𝜌𝑅𝑅𝑐italic-ϕ2𝐶𝜅𝑅2\displaystyle\mathcal{T}_{11}\leqslant|\mathcal{H}_{t}|\sum_{i\in\mathcal{H}_{% t}}\left\|\mathbb{E}[\bm{y}_{i}]-\bm{\mu}\right\|^{2}\leqslant|\mathcal{H}_{t}% |^{2}\left[\frac{[\rho R/(R+c)+\phi]^{2}}{C}+\frac{\kappa}{R}\right]^{2}caligraphic_T start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ⩽ | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ blackboard_E [ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ divide start_ARG [ italic_ρ italic_R / ( italic_R + italic_c ) + italic_ϕ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG + divide start_ARG italic_κ end_ARG start_ARG italic_R end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where the first inequality is obtained by using Lemma 4. On the other hand, we can bound 𝒯12subscript𝒯12\mathcal{T}_{12}caligraphic_T start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT by

𝒯12subscript𝒯12\displaystyle\mathcal{T}_{12}caligraphic_T start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT =(𝖺)𝔼it𝒚i𝔼[𝒚i]2(𝖻)𝔼it𝒎¯t,i𝔼[𝒎¯t,i]2𝖺𝔼subscript𝑖subscript𝑡superscriptnormsubscript𝒚𝑖𝔼delimited-[]subscript𝒚𝑖2𝖻𝔼subscript𝑖subscript𝑡superscriptnormsubscript¯𝒎𝑡𝑖𝔼delimited-[]subscript¯𝒎𝑡𝑖2\displaystyle\overset{(\mathsf{a})}{=}\mathbb{E}\sum\nolimits_{i\in\mathcal{H}% _{t}}\|\bm{y}_{i}-\mathbb{E}[\bm{y}_{i}]\|^{2}\overset{(\mathsf{b})}{\leqslant% }\mathbb{E}\sum\nolimits_{i\in\mathcal{H}_{t}}\|\bar{\bm{m}}_{t,i}-\mathbb{E}[% \bar{\bm{m}}_{t,i}]\|^{2}start_OVERACCENT ( sansserif_a ) end_OVERACCENT start_ARG = end_ARG blackboard_E ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - blackboard_E [ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_OVERACCENT ( sansserif_b ) end_OVERACCENT start_ARG ⩽ end_ARG blackboard_E ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - blackboard_E [ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
|t|[ρR/(R+c)+ϕ]2absentsubscript𝑡superscriptdelimited-[]𝜌𝑅𝑅𝑐italic-ϕ2\displaystyle\leqslant|\mathcal{H}_{t}|\cdot[\rho R/(R+c)+\phi]^{2}⩽ | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ⋅ [ italic_ρ italic_R / ( italic_R + italic_c ) + italic_ϕ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where (𝖺)𝖺(\mathsf{a})( sansserif_a ) used the assumption that {𝒎¯t,i}itsubscriptsubscript¯𝒎𝑡𝑖𝑖subscript𝑡\{\bar{\bm{m}}_{t,i}\}_{i\in\mathcal{H}_{t}}{ over¯ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT are independent, then the random variables {𝒚i}itsubscriptsubscript𝒚𝑖𝑖subscript𝑡\{\bm{y}_{i}\}_{i\in\mathcal{H}_{t}}{ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT are also independent; (𝖻)𝖻(\mathsf{b})( sansserif_b ) used contractivity of a clip** (projection) step. Therefore,

𝒯1subscript𝒯1\displaystyle\mathcal{T}_{1}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =𝒯11+𝒯12|t|2(ψ2C+κR+c)2+|t|ψ2absentsubscript𝒯11subscript𝒯12superscriptsubscript𝑡2superscriptsuperscript𝜓2𝐶𝜅𝑅𝑐2subscript𝑡superscript𝜓2\displaystyle=\mathcal{T}_{11}+\mathcal{T}_{12}\leqslant|\mathcal{H}_{t}|^{2}% \left(\frac{\psi^{2}}{C}+\frac{\kappa}{R+c}\right)^{2}+|\mathcal{H}_{t}|\psi^{2}= caligraphic_T start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT + caligraphic_T start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT ⩽ | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG + divide start_ARG italic_κ end_ARG start_ARG italic_R + italic_c end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
4|t|2ψ4C2+|t|ψ2(2|t|ψ2C+|t|ψ)2absent4superscriptsubscript𝑡2superscript𝜓4superscript𝐶2subscript𝑡superscript𝜓2superscript2subscript𝑡superscript𝜓2𝐶subscript𝑡𝜓2\displaystyle\leqslant\frac{4|\mathcal{H}_{t}|^{2}\psi^{4}}{C^{2}}+|\mathcal{H% }_{t}|\psi^{2}\leqslant\left(\frac{2|\mathcal{H}_{t}|\psi^{2}}{C}+\sqrt{|% \mathcal{H}_{t}|}\psi\right)^{2}⩽ divide start_ARG 4 | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ψ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ ( divide start_ARG 2 | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG + square-root start_ARG | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG italic_ψ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where ψρR/(R+c)+ϕ𝜓𝜌𝑅𝑅𝑐italic-ϕ\psi\coloneqq\rho R/(R+c)+\phiitalic_ψ ≔ italic_ρ italic_R / ( italic_R + italic_c ) + italic_ϕ, and the second inequality holds with the assumption Cψ2R/κ𝐶superscript𝜓2𝑅𝜅C\leqslant\psi^{2}R/\kappaitalic_C ⩽ italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R / italic_κ (thus we have ψ2CκRsuperscript𝜓2𝐶𝜅𝑅\frac{\psi^{2}}{C}\geqslant\frac{\kappa}{R}divide start_ARG italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG ⩾ divide start_ARG italic_κ end_ARG start_ARG italic_R end_ARG)

Bounding 𝒯2subscript𝒯2\mathcal{T}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For any Byzantine client 𝖢jsubscript𝖢𝑗\mathsf{C}_{j}sansserif_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with jt𝑗subscript𝑡j\in\mathcal{B}_{t}italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the error is bounded by the clip** step

𝔼𝒚j𝝁2𝔼superscriptnormsubscript𝒚𝑗𝝁2\displaystyle\mathbb{E}\|\bm{y}_{j}-\bm{\mu}\|^{2}blackboard_E ∥ bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =𝔼𝒛j+(𝒚0𝝁)2(𝖺)𝔼𝒛j2γ+𝔼𝒚0𝝁21γabsent𝔼superscriptnormsubscript𝒛𝑗subscript𝒚0𝝁2𝖺𝔼superscriptnormsubscript𝒛𝑗2𝛾𝔼superscriptnormsubscript𝒚0𝝁21𝛾\displaystyle=\mathbb{E}\|\bm{z}_{j}+(\bm{y}_{0}-\bm{\mu})\|^{2}\overset{(% \mathsf{a})}{\leqslant}\frac{\mathbb{E}\|\bm{z}_{j}\|^{2}}{\gamma}+\frac{% \mathbb{E}\|\bm{y}_{0}-\bm{\mu}\|^{2}}{1-\gamma}= blackboard_E ∥ bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + ( bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_μ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_OVERACCENT ( sansserif_a ) end_OVERACCENT start_ARG ⩽ end_ARG divide start_ARG blackboard_E ∥ bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG + divide start_ARG blackboard_E ∥ bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG
(𝖻)C2γ+τ21γ=(𝖼)(C+τ)2𝖻superscript𝐶2𝛾superscript𝜏21𝛾𝖼superscript𝐶𝜏2\displaystyle\overset{(\mathsf{b})}{\leqslant}\frac{C^{2}}{\gamma}+\frac{\tau^% {2}}{1-\gamma}\overset{(\mathsf{c})}{=}(C+\tau)^{2}start_OVERACCENT ( sansserif_b ) end_OVERACCENT start_ARG ⩽ end_ARG divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG + divide start_ARG italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG start_OVERACCENT ( sansserif_c ) end_OVERACCENT start_ARG = end_ARG ( italic_C + italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where (𝖺)𝖺(\mathsf{a})( sansserif_a ) is obtained by using in Lemma 4 for any γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ); (𝖻)𝖻(\mathsf{b})( sansserif_b ) is obtained by the definition of 𝒛jsubscript𝒛𝑗\bm{z}_{j}bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the assumption; (𝖼)𝖼(\mathsf{c})( sansserif_c ) is obtained by taking γ=CC+τ𝛾𝐶𝐶𝜏\gamma=\frac{C}{C+\tau}italic_γ = divide start_ARG italic_C end_ARG start_ARG italic_C + italic_τ end_ARG. Then, by using Lemma 4, we have

𝒯2|t|jt𝔼𝒚j𝝁2|t|2(C+τ)2subscript𝒯2subscript𝑡subscript𝑗subscript𝑡𝔼superscriptnormsubscript𝒚𝑗𝝁2superscriptsubscript𝑡2superscript𝐶𝜏2\displaystyle\mathcal{T}_{2}\leqslant|\mathcal{B}_{t}|\cdot\sum_{j\in\mathcal{% B}_{t}}\mathbb{E}\|\bm{y}_{j}-\bm{\mu}\|^{2}\leqslant|\mathcal{B}_{t}|^{2}(C+% \tau)^{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⩽ | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ⋅ ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E ∥ bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_C + italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Bounding 𝒯3subscript𝒯3\mathcal{T}_{3}caligraphic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Since the random noise 𝝃𝒩(0,R2σ2𝐈d)dsimilar-to𝝃𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑superscript𝑑\bm{\xi}\sim\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})\in\mathbb{R}^{d}bold_italic_ξ ∼ caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we have 𝒯3=dR2σ2subscript𝒯3𝑑superscript𝑅2superscript𝜎2\mathcal{T}_{3}=dR^{2}\sigma^{2}caligraphic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_d italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Putting into Together. Combining all terms, we have

𝔼𝒎~t𝝁2𝔼superscriptnormsubscript~𝒎𝑡𝝁2\displaystyle\qquad\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{\mu}\|^{2}blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
1|t|2[1γ1(2|t|ψ2C+|t|ψ)2+|t|2(C+τ)2γ2+dR2σ2γ3]absent1superscriptsubscript𝑡2delimited-[]1subscript𝛾1superscript2subscript𝑡superscript𝜓2𝐶subscript𝑡𝜓2superscriptsubscript𝑡2superscript𝐶𝜏2subscript𝛾2𝑑superscript𝑅2superscript𝜎2subscript𝛾3\displaystyle\leqslant\frac{1}{|\mathcal{I}_{t}|^{2}}\left[\frac{1}{\gamma_{1}% }\left(\frac{2|\mathcal{H}_{t}|\psi^{2}}{C}+\sqrt{|\mathcal{H}_{t}|}\psi\right% )^{2}+\frac{|\mathcal{B}_{t}|^{2}(C+\tau)^{2}}{\gamma_{2}}+\frac{dR^{2}\sigma^% {2}}{\gamma_{3}}\right]⩽ divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ( divide start_ARG 2 | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG + square-root start_ARG | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG italic_ψ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_C + italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_d italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ]
=(𝖺)1|t|2[(2|t|ψ2C+|t|ψ)+|t|(C+τ)+dRσ]2𝖺1superscriptsubscript𝑡2superscriptdelimited-[]2subscript𝑡superscript𝜓2𝐶subscript𝑡𝜓subscript𝑡𝐶𝜏𝑑𝑅𝜎2\displaystyle\overset{(\mathsf{a})}{=}\frac{1}{|\mathcal{I}_{t}|^{2}}\left[% \left(\frac{2|\mathcal{H}_{t}|\psi^{2}}{C}+\sqrt{|\mathcal{H}_{t}|}\psi\right)% +|\mathcal{B}_{t}|(C+\tau)+\sqrt{d}R\sigma\right]^{2}start_OVERACCENT ( sansserif_a ) end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ ( divide start_ARG 2 | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG + square-root start_ARG | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG italic_ψ ) + | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ( italic_C + italic_τ ) + square-root start_ARG italic_d end_ARG italic_R italic_σ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(𝖻)1|t|2[2κ|t|(ρ+ϕ)2ϕ2R+(|t|+dσ)R+|t|(ρ+ϕ)+|t|τ]2𝖻1superscriptsubscript𝑡2superscriptdelimited-[]2𝜅subscript𝑡superscript𝜌italic-ϕ2superscriptitalic-ϕ2𝑅subscript𝑡𝑑𝜎𝑅subscript𝑡𝜌italic-ϕsubscript𝑡𝜏2\displaystyle\overset{(\mathsf{b})}{\leqslant}\frac{1}{|\mathcal{I}_{t}|^{2}}% \left[\frac{2\kappa|\mathcal{H}_{t}|(\rho+\phi)^{2}}{\phi^{2}\cdot R}+(|% \mathcal{B}_{t}|+\sqrt{d}\sigma)R+\sqrt{|\mathcal{H}_{t}|}(\rho+\phi)+|% \mathcal{B}_{t}|\tau\right]^{2}start_OVERACCENT ( sansserif_b ) end_OVERACCENT start_ARG ⩽ end_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ divide start_ARG 2 italic_κ | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ( italic_ρ + italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_R end_ARG + ( | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | + square-root start_ARG italic_d end_ARG italic_σ ) italic_R + square-root start_ARG | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ( italic_ρ + italic_ϕ ) + | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_τ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(𝖼)1|t|2[2(ρ+ϕ)ϕ2κ|t|(|t|+dσ)+|t|(ρ+ϕ)+|t|τ]2𝖼1superscriptsubscript𝑡2superscriptdelimited-[]2𝜌italic-ϕitalic-ϕ2𝜅subscript𝑡subscript𝑡𝑑𝜎subscript𝑡𝜌italic-ϕsubscript𝑡𝜏2\displaystyle\overset{(\mathsf{c})}{=}\frac{1}{|\mathcal{I}_{t}|^{2}}\left[% \frac{2(\rho+\phi)}{\phi}\sqrt{2\kappa|\mathcal{H}_{t}|(|\mathcal{B}_{t}|+% \sqrt{d}\sigma)}+\sqrt{|\mathcal{H}_{t}|}(\rho+\phi)+|\mathcal{B}_{t}|\tau% \right]^{2}start_OVERACCENT ( sansserif_c ) end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ divide start_ARG 2 ( italic_ρ + italic_ϕ ) end_ARG start_ARG italic_ϕ end_ARG square-root start_ARG 2 italic_κ | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ( | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | + square-root start_ARG italic_d end_ARG italic_σ ) end_ARG + square-root start_ARG | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ( italic_ρ + italic_ϕ ) + | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_τ ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=[2(ρ+ϕ)|t|ϕ2κ|t|(|t|+dσ)+|t|(ρ+ϕ)|t|+|t|τ|t|Φ]2absentsuperscriptdelimited-[]subscript2𝜌italic-ϕsubscript𝑡italic-ϕ2𝜅subscript𝑡subscript𝑡𝑑𝜎subscript𝑡𝜌italic-ϕsubscript𝑡subscript𝑡𝜏subscript𝑡absentΦ2\displaystyle=\left[\underbrace{\frac{2(\rho+\phi)}{|\mathcal{I}_{t}|\phi}% \sqrt{2\kappa|\mathcal{H}_{t}|(|\mathcal{B}_{t}|+\sqrt{d}\sigma)}+\frac{\sqrt{% |\mathcal{H}_{t}|}(\rho+\phi)}{|\mathcal{I}_{t}|}+\frac{|\mathcal{B}_{t}|\tau}% {|\mathcal{I}_{t}|}}_{\eqqcolon\Phi}\right]^{2}= [ under⏟ start_ARG divide start_ARG 2 ( italic_ρ + italic_ϕ ) end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ϕ end_ARG square-root start_ARG 2 italic_κ | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ( | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | + square-root start_ARG italic_d end_ARG italic_σ ) end_ARG + divide start_ARG square-root start_ARG | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ( italic_ρ + italic_ϕ ) end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG + divide start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_τ end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG end_ARG start_POSTSUBSCRIPT ≕ roman_Φ end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (21)

where (𝖺)𝖺(\mathsf{a})( sansserif_a ) is obtained by taking γk=ΦkΦ1+Φ2+Φ3subscript𝛾𝑘subscriptΦ𝑘subscriptΦ1subscriptΦ2subscriptΦ3\gamma_{k}=\frac{\sqrt{\Phi_{k}}}{\sqrt{\Phi_{1}}+\sqrt{\Phi_{2}}+\sqrt{\Phi_{% 3}}}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG square-root start_ARG roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG start_ARG square-root start_ARG roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + square-root start_ARG roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + square-root start_ARG roman_Φ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG end_ARG for k=1,2,3𝑘123k=1,2,3italic_k = 1 , 2 , 3, where Φ1(2|t|ψ2C+|t|ψ)2,Φ2|t|2(C+τ)2,Φ3dR2σ2formulae-sequencesubscriptΦ1superscript2subscript𝑡superscript𝜓2𝐶subscript𝑡𝜓2formulae-sequencesubscriptΦ2superscriptsubscript𝑡2superscript𝐶𝜏2subscriptΦ3𝑑superscript𝑅2superscript𝜎2\Phi_{1}\coloneqq\left(\frac{2|\mathcal{H}_{t}|\psi^{2}}{C}+\sqrt{|\mathcal{H}% _{t}|}\psi\right)^{2},\Phi_{2}\coloneqq|\mathcal{B}_{t}|^{2}(C+\tau)^{2},\Phi_% {3}\coloneqq dR^{2}\sigma^{2}roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≔ ( divide start_ARG 2 | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG + square-root start_ARG | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG italic_ψ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≔ | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_C + italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , roman_Φ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ≔ italic_d italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT; (𝖻)𝖻(\mathsf{b})( sansserif_b ) is obtained by considering ψ=ρR/(R+c)+ϕ(ρ+ϕ)𝜓𝜌𝑅𝑅𝑐italic-ϕ𝜌italic-ϕ\psi=\rho R/(R+c)+\phi\leqslant(\rho+\phi)italic_ψ = italic_ρ italic_R / ( italic_R + italic_c ) + italic_ϕ ⩽ ( italic_ρ + italic_ϕ ) and taking the clip** bound C=ϕ2κR𝐶superscriptitalic-ϕ2𝜅𝑅C=\frac{\phi^{2}}{\kappa}Ritalic_C = divide start_ARG italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_κ end_ARG italic_R, which makes the previous assumption Cψ2R/κ𝐶superscript𝜓2𝑅𝜅C\leqslant\psi^{2}R/\kappaitalic_C ⩽ italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R / italic_κ holds; (𝖼)𝖼(\mathsf{c})( sansserif_c ) is obtained by taking R=ρ+ϕϕ2κ|t||t|+dσ𝑅𝜌italic-ϕitalic-ϕ2𝜅subscript𝑡subscript𝑡𝑑𝜎R=\frac{\rho+\phi}{\phi}\sqrt{\frac{2\kappa|\mathcal{H}_{t}|}{|\mathcal{B}_{t}% |+\sqrt{d}\sigma}}italic_R = divide start_ARG italic_ρ + italic_ϕ end_ARG start_ARG italic_ϕ end_ARG square-root start_ARG divide start_ARG 2 italic_κ | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | + square-root start_ARG italic_d end_ARG italic_σ end_ARG end_ARG, where |t|||qsubscript𝑡𝑞|\mathcal{H}_{t}|\approx|\mathcal{H}|q| caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≈ | caligraphic_H | italic_q and |t|||qsubscript𝑡𝑞|\mathcal{B}_{t}|\approx|\mathcal{B}|q| caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≈ | caligraphic_B | italic_q. Since ||+||=n𝑛|\mathcal{H}|+|\mathcal{B}|=n| caligraphic_H | + | caligraphic_B | = italic_n and ||/n<1/2𝑛12|\mathcal{B}|/n<1/2| caligraphic_B | / italic_n < 1 / 2, we can approximate the tuning by RO(ρn/(|t|+dσ/q)R\propto O(\rho\sqrt{n/(|\mathcal{B}_{t}|+\sqrt{d}\sigma/q})italic_R ∝ italic_O ( italic_ρ square-root start_ARG italic_n / ( | caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | + square-root start_ARG italic_d end_ARG italic_σ / italic_q end_ARG ).

The Final Result. On the other hand, we have

𝔼𝝁𝒎t2𝔼superscriptnorm𝝁superscriptsubscript𝒎𝑡2\displaystyle\mathbb{E}\|\bm{\mu}-\bm{m}_{t}^{*}\|^{2}blackboard_E ∥ bold_italic_μ - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =1||2𝔼i(𝒎t,i𝝁)2absent1superscript2𝔼superscriptnormsubscript𝑖subscript𝒎𝑡𝑖𝝁2\displaystyle=\frac{1}{|\mathcal{H}|^{2}}\mathbb{E}\left\|\sum\nolimits_{i\in% \mathcal{H}}(\bm{m}_{t,i}-\bm{\mu})\right\|^{2}= divide start_ARG 1 end_ARG start_ARG | caligraphic_H | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E ∥ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_H end_POSTSUBSCRIPT ( bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_μ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=1||2𝔼i𝒎t,i𝝁2ρ2||absent1superscript2𝔼subscript𝑖superscriptnormsubscript𝒎𝑡𝑖𝝁2superscript𝜌2\displaystyle=\frac{1}{|\mathcal{H}|^{2}}\mathbb{E}\sum\nolimits_{i\in\mathcal% {H}}\left\|\bm{m}_{t,i}-\bm{\mu}\right\|^{2}\leqslant\frac{\rho^{2}}{|\mathcal% {H}|}= divide start_ARG 1 end_ARG start_ARG | caligraphic_H | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_H end_POSTSUBSCRIPT ∥ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_H | end_ARG (22)

where the first equality is obtained by the definition of 𝒎tsuperscriptsubscript𝒎𝑡\bm{m}_{t}^{*}bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT; the second equality is obtained by the fact that all honest clients’ momentum {𝒎t,i}isubscriptsubscript𝒎𝑡𝑖𝑖\{\bm{m}_{t,i}\}_{i\in\mathcal{H}}{ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_H end_POSTSUBSCRIPT are independent with each other; and the third equality is obtained by the assumption 𝒎t,i𝝁2ρ2superscriptnormsubscript𝒎𝑡𝑖𝝁2superscript𝜌2\left\|\bm{m}_{t,i}-\bm{\mu}\right\|^{2}\leqslant\rho^{2}∥ bold_italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for i𝑖i\in\mathcal{H}italic_i ∈ caligraphic_H. Finally, we have

𝔼𝒎~t𝒎t2𝔼superscriptnormsubscript~𝒎𝑡superscriptsubscript𝒎𝑡2\displaystyle\qquad\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{m}_{t}^{*}\|^{2}blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝔼(𝒎~t𝝁)+(𝝁𝒎t)2(𝖺)𝔼𝒎~t𝝁2γ+𝔼𝝁𝒎t21γabsent𝔼superscriptnormsubscript~𝒎𝑡𝝁𝝁superscriptsubscript𝒎𝑡2𝖺𝔼superscriptnormsubscript~𝒎𝑡𝝁2𝛾𝔼superscriptnorm𝝁superscriptsubscript𝒎𝑡21𝛾\displaystyle=\mathbb{E}\|(\tilde{\bm{m}}_{t}-\bm{\mu})+(\bm{\mu}-\bm{m}_{t}^{% *})\|^{2}\overset{(\mathsf{a})}{\leqslant}\frac{\mathbb{E}\|\tilde{\bm{m}}_{t}% -\bm{\mu}\|^{2}}{\gamma}+\frac{\mathbb{E}\|\bm{\mu}-\bm{m}_{t}^{*}\|^{2}}{1-\gamma}= blackboard_E ∥ ( over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_μ ) + ( bold_italic_μ - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_OVERACCENT ( sansserif_a ) end_OVERACCENT start_ARG ⩽ end_ARG divide start_ARG blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG + divide start_ARG blackboard_E ∥ bold_italic_μ - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG
(𝖻)Φ2γ+ρ2/||1γ=(𝖼)(Φ+ρ||)2𝖻superscriptΦ2𝛾superscript𝜌21𝛾𝖼superscriptΦ𝜌2\displaystyle\overset{(\mathsf{b})}{\leqslant}\frac{\Phi^{2}}{\gamma}+\frac{% \rho^{2}/|\mathcal{H}|}{1-\gamma}\overset{(\mathsf{c})}{=}\left(\Phi+\frac{% \rho}{\sqrt{|\mathcal{H}|}}\right)^{2}start_OVERACCENT ( sansserif_b ) end_OVERACCENT start_ARG ⩽ end_ARG divide start_ARG roman_Φ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG + divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / | caligraphic_H | end_ARG start_ARG 1 - italic_γ end_ARG start_OVERACCENT ( sansserif_c ) end_OVERACCENT start_ARG = end_ARG ( roman_Φ + divide start_ARG italic_ρ end_ARG start_ARG square-root start_ARG | caligraphic_H | end_ARG end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (23)

where (𝖺)𝖺(\mathsf{a})( sansserif_a ) is obtained by using Lemma 4 for any γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ); (𝖻)𝖻(\mathsf{b})( sansserif_b ) is obtained from (C) and (C), where ΦΦ\Phiroman_Φ is defined in (C); (𝖼)𝖼(\mathsf{c})( sansserif_c ) is obtained by taking γ=ΦΦ+ρ||𝛾ΦΦ𝜌\gamma=\frac{\Phi}{\Phi+\frac{\rho}{\sqrt{|\mathcal{H}|}}}italic_γ = divide start_ARG roman_Φ end_ARG start_ARG roman_Φ + divide start_ARG italic_ρ end_ARG start_ARG square-root start_ARG | caligraphic_H | end_ARG end_ARG end_ARG. Furthermore, if we assume ϕO(ρ)italic-ϕ𝑂𝜌\phi\leqslant O(\rho)italic_ϕ ⩽ italic_O ( italic_ρ ) and τO(ρ)𝜏𝑂𝜌\tau\leqslant O(\rho)italic_τ ⩽ italic_O ( italic_ρ ), we can rewrite (C) as the following version

𝔼𝒎~t𝒎t2O(ρ2(||+dσ/q)n)𝔼superscriptnormsubscript~𝒎𝑡superscriptsubscript𝒎𝑡2𝑂superscript𝜌2𝑑𝜎𝑞𝑛\displaystyle\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{m}_{t}^{*}\|^{2}\leqslant O% \left(\frac{\rho^{2}(|\mathcal{B}|+\sqrt{d}\sigma/q)}{n}\right)blackboard_E ∥ over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ italic_O ( divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( | caligraphic_B | + square-root start_ARG italic_d end_ARG italic_σ / italic_q ) end_ARG start_ARG italic_n end_ARG )

which finishes the proof of Theorem 2. ∎

Appendix D Proof of Theorem 3 (Convergence Rate)

Proof.

Comparing with the aggregation error of O(ρ2||/n)𝑂superscript𝜌2𝑛O(\rho^{2}|\mathcal{B}|/n)italic_O ( italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_B | / italic_n ) (ignoring constants and higher order terms) in [22, Lemma 9], our aggregation error shown in (19) replaces the term |||\mathcal{B}|| caligraphic_B | by ||+dσ/q𝑑𝜎𝑞|\mathcal{B}|+\sqrt{d}\sigma/q| caligraphic_B | + square-root start_ARG italic_d end_ARG italic_σ / italic_q, which means a slower convergence due to DP noise. Then, following the result in [22, Theorem VI] and its informal version in (5), we get the convergence rate of our algorithm as in (20). Note that our aggregation utilizes a client-level sampling rate q𝑞qitalic_q, i.e., approximate nq𝑛𝑞nqitalic_n italic_q clients participate in the aggregation for one iteration. We need to replace the term of 1n1𝑛\frac{1}{n}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG in (5) by 1nq1𝑛𝑞\frac{1}{nq}divide start_ARG 1 end_ARG start_ARG italic_n italic_q end_ARG in (20). ∎

Appendix E Useful Lemmas

Lemma 4.

For any positive real values α1,,αK+subscript𝛼1subscript𝛼𝐾superscript\alpha_{1},\cdots,\alpha_{K}\in\mathbb{R}^{+}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_α start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and any d𝑑ditalic_d-dimensional vectors 𝐱1,,𝐱Kdsubscript𝐱1subscript𝐱𝐾superscript𝑑\bm{x}_{1},\cdots,\bm{x}_{K}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the following inequality holds

k=1K𝒙k2(k=1Kαk)(k=1K𝒙k2αk)superscriptnormsuperscriptsubscript𝑘1𝐾subscript𝒙𝑘2superscriptsubscript𝑘1𝐾subscript𝛼𝑘superscriptsubscript𝑘1𝐾superscriptnormsubscript𝒙𝑘2subscript𝛼𝑘\displaystyle\left\|\sum\nolimits_{k=1}^{K}\bm{x}_{k}\right\|^{2}\leqslant% \left(\sum\nolimits_{k=1}^{K}\alpha_{k}\right)\cdot\left(\sum\nolimits_{k=1}^{% K}\frac{\|\bm{x}_{k}\|^{2}}{\alpha_{k}}\right)∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⩽ ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG )

where \|\cdot\|∥ ⋅ ∥ denotes the L2-norm of a vector.

Proof.

Denote xkisubscript𝑥𝑘𝑖x_{ki}italic_x start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT as the i𝑖iitalic_i-th element of the vector 𝒙ksubscript𝒙𝑘\bm{x}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, then we have

k=1K𝒙k2=i=1d(k=1Kxki)2=i=1d(k=1Kαkxkiαk)2superscriptnormsuperscriptsubscript𝑘1𝐾subscript𝒙𝑘2superscriptsubscript𝑖1𝑑superscriptsuperscriptsubscript𝑘1𝐾subscript𝑥𝑘𝑖2superscriptsubscript𝑖1𝑑superscriptsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘subscript𝑥𝑘𝑖subscript𝛼𝑘2\displaystyle\quad\left\|\sum\nolimits_{k=1}^{K}\bm{x}_{k}\right\|^{2}=\sum% \nolimits_{i=1}^{d}\left(\sum\nolimits_{k=1}^{K}x_{ki}\right)^{2}=\sum% \nolimits_{i=1}^{d}\left(\sum\nolimits_{k=1}^{K}\sqrt{\alpha_{k}}\cdot\frac{x_% {ki}}{\sqrt{\alpha_{k}}}\right)^{2}∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG italic_x start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
i=1d[k=1K(αk)2k=1K(xkiαk)2]absentsuperscriptsubscript𝑖1𝑑delimited-[]superscriptsubscript𝑘1𝐾superscriptsubscript𝛼𝑘2superscriptsubscript𝑘1𝐾superscriptsubscript𝑥𝑘𝑖subscript𝛼𝑘2\displaystyle\leqslant\sum\nolimits_{i=1}^{d}\left[\sum\nolimits_{k=1}^{K}% \left(\sqrt{\alpha_{k}}\right)^{2}\cdot\sum\nolimits_{k=1}^{K}\left(\frac{x_{% ki}}{\sqrt{\alpha_{k}}}\right)^{2}\right]⩽ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(k=1Kαk)(k=1K1αki=1dxki2)=(k=1Kαk)(k=1K𝒙k2αk)absentsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘superscriptsubscript𝑘1𝐾1subscript𝛼𝑘superscriptsubscript𝑖1𝑑superscriptsubscript𝑥𝑘𝑖2superscriptsubscript𝑘1𝐾subscript𝛼𝑘superscriptsubscript𝑘1𝐾superscriptnormsubscript𝒙𝑘2subscript𝛼𝑘\displaystyle=\left(\sum\nolimits_{k=1}^{K}\alpha_{k}\right)\cdot\left(\sum% \nolimits_{k=1}^{K}\frac{1}{\alpha_{k}}\sum\nolimits_{i=1}^{d}x_{ki}^{2}\right% )=\left(\sum\nolimits_{k=1}^{K}\alpha_{k}\right)\cdot\left(\sum\nolimits_{k=1}% ^{K}\frac{\|\bm{x}_{k}\|^{2}}{\alpha_{k}}\right)= ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG )

where the inequality is caused by Cauchy-Schwarz inequality. ∎

Lemma 5.

Consider the following optimization problem

f=minx1,,xKk=1Kckxk,such thatxk>0,k=1Kxk=1formulae-sequencesuperscript𝑓subscriptsubscript𝑥1subscript𝑥𝐾superscriptsubscript𝑘1𝐾subscript𝑐𝑘subscript𝑥𝑘such thatformulae-sequencesubscript𝑥𝑘0superscriptsubscript𝑘1𝐾subscript𝑥𝑘1\displaystyle f^{*}=\min_{x_{1},\cdots,x_{K}}~{}~{}\sum\nolimits_{k=1}^{K}% \frac{c_{k}}{x_{k}},\qquad\text{such that}\quad x_{k}>0,\quad\sum\nolimits_{k=% 1}^{K}x_{k}=1italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG , such that italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 0 , ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1

where c1,,cK>0subscript𝑐1subscript𝑐𝐾0c_{1},\cdots,c_{K}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_c start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT > 0. Then, we have f=(j=1Kcj)2superscript𝑓superscriptsuperscriptsubscript𝑗1𝐾subscript𝑐𝑗2f^{*}=(\sum\nolimits_{j=1}^{K}\sqrt{c_{j}})^{2}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT square-root start_ARG italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where the optimal solution is xk=ckj=1Kcj(k=1,,K)subscript𝑥𝑘subscript𝑐𝑘superscriptsubscript𝑗1𝐾subscript𝑐𝑗for-all𝑘1𝐾x_{k}=\frac{\sqrt{c_{k}}}{\sum\nolimits_{j=1}^{K}\sqrt{c_{j}}}~{}(\forall k=1,% \cdots,K)italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG square-root start_ARG italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT square-root start_ARG italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG ( ∀ italic_k = 1 , ⋯ , italic_K ).

Proof.

The augmented Lagrange function is (xk;λ)=k=1Kckxk+λ(k=1Kxk1)subscript𝑥𝑘𝜆superscriptsubscript𝑘1𝐾subscript𝑐𝑘subscript𝑥𝑘𝜆superscriptsubscript𝑘1𝐾subscript𝑥𝑘1\mathcal{L}(x_{k};\lambda)=\sum\nolimits_{k=1}^{K}\frac{c_{k}}{x_{k}}+\lambda% \cdot(\sum\nolimits_{k=1}^{K}x_{k}-1)caligraphic_L ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_λ ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + italic_λ ⋅ ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ). By taking Karush-Kuhn-Tucker (KKT) conditions, we have

{xk=0λ=0{ckxk2+λ=0k=1Kxk=1{xk=ckλλ=j=1Kcjcasessubscript𝑥𝑘0otherwise𝜆0otherwisecasessubscript𝑐𝑘superscriptsubscript𝑥𝑘2𝜆0otherwisesuperscriptsubscript𝑘1𝐾subscript𝑥𝑘1otherwisecasessubscript𝑥𝑘subscript𝑐𝑘𝜆otherwise𝜆superscriptsubscript𝑗1𝐾subscript𝑐𝑗otherwise\displaystyle\begin{cases}\frac{\partial\mathcal{L}}{\partial x_{k}}=0\\ \frac{\partial\mathcal{L}}{\partial\lambda}=0\end{cases}\Rightarrow~{}~{}% \begin{cases}-\frac{c_{k}}{x_{k}^{2}}+\lambda=0\\ \sum\nolimits_{k=1}^{K}x_{k}=1\end{cases}\Rightarrow~{}~{}\begin{cases}x_{k}=% \sqrt{\frac{c_{k}}{\lambda}}\\ \sqrt{\lambda}=\sum\nolimits_{j=1}^{K}\sqrt{c_{j}}\end{cases}{ start_ROW start_CELL divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_λ end_ARG = 0 end_CELL start_CELL end_CELL end_ROW ⇒ { start_ROW start_CELL - divide start_ARG italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_λ = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 end_CELL start_CELL end_CELL end_ROW ⇒ { start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG end_ARG end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL square-root start_ARG italic_λ end_ARG = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT square-root start_ARG italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_CELL start_CELL end_CELL end_ROW

then we have f=(j=1Kcj)2superscript𝑓superscriptsuperscriptsubscript𝑗1𝐾subscript𝑐𝑗2f^{*}=\left(\sum\nolimits_{j=1}^{K}\sqrt{c_{j}}\right)^{2}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT square-root start_ARG italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which finished the proof. ∎

Appendix F Gaussian Differential Privacy (GDP)

Privacy Accountant. Since deep learning needs to iterate over the training data and apply gradient computation multiple times during the training process, each access to the training data incurs some privacy cost from the overall privacy budget ϵitalic-ϵ\epsilonitalic_ϵ. The total privacy cost of repeated applications of additive noise mechanisms follow from the composition theorems and their refinements [13]. The task of kee** track of the accumulated privacy loss in the course of execution of a composite mechanism, and enforcing the applicable privacy policy, can be performed by the privacy accountant. Abadi et al. [1] proposed moments accountant to provide a tighter bound on the privacy loss compared to the generic advanced composition theorem [14]. Another new and more state-of-the-art privacy accountant method is Gaussian Differential Privacy (GDP) [11, 7], which was shown to obtain a tighter result than moments accountant.

Gaussian Differential Privacy. GDP is a new privacy notion which faithfully retains hypothesis testing interpretation of differential privacy. By leveraging the central limit theorem of Gaussian distribution, GDP has been shown to possess an analytically tractable privacy accountant (vs. moments accountant must be done by numerical computation). Furthermore, GDP can be converted to a collection of (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP guarantees (refer to Lemma 9). Note that even in terms of (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP, the GDP approach gives a tighter privacy accountant than moments accountant. GDP utilizes a single parameter μ0𝜇0\mu\geqslant 0italic_μ ⩾ 0 (called privacy parameter) to quantify the privacy of a randomized mechanism. Similar to the privacy budget ϵitalic-ϵ\epsilonitalic_ϵ defined in DP, a larger μ𝜇\muitalic_μ in GDP indicates less privacy guarantee. Comparing with (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP, the new notion μ𝜇\muitalic_μ-GDP can losslessly reason about common primitives associated with differential privacy, including composition, privacy amplification by sampling, and group privacy. In the following, we briefly introduce some important properties (that will be used in the analysis of our approach) of GDP as below. The formal definition and more detailed results can be found in the original paper [11].

Lemma 6 (Gaussian Mechanism for GDP [11]).

Consider the problem of privately releasing a univariate statistic f(D)𝑓𝐷f(D)italic_f ( italic_D ) of a dataset D𝐷Ditalic_D. Define the sensitivity of f()𝑓f(\cdot)italic_f ( ⋅ ) as sf=supD,D|f(D)f(D)|subscript𝑠𝑓subscriptsupremum𝐷superscript𝐷𝑓𝐷𝑓superscript𝐷s_{f}=\sup_{D,D^{\prime}}|f(D)-f(D^{\prime})|italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_D , italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_f ( italic_D ) - italic_f ( italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |, where the supremum is over all neighboring datasets. Then, the Gaussian mechanism (D)=f(D)+ξ𝐷𝑓𝐷𝜉\mathcal{M}(D)=f(D)+\xicaligraphic_M ( italic_D ) = italic_f ( italic_D ) + italic_ξ, where ξ𝒩(0,sf2/μ2)similar-to𝜉𝒩0superscriptsubscript𝑠𝑓2superscript𝜇2\xi\sim\mathcal{N}(0,s_{f}^{2}/\mu^{2})italic_ξ ∼ caligraphic_N ( 0 , italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), satisfies μ𝜇\muitalic_μ-GDP.

Lemma 7 (Composition Theorem of GDP [11]).

The m𝑚mitalic_m-fold composition of μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-GDP mechanisms is μ12++μm2superscriptsubscript𝜇12superscriptsubscript𝜇𝑚2\sqrt{\mu_{1}^{2}+\cdots+\mu_{m}^{2}}square-root start_ARG italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⋯ + italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG-GDP.

Lemma 8 (Group Privacy of GDP [11]).

If a mechanism is μ𝜇\muitalic_μ-GDP, then it is Kμ𝐾𝜇K\muitalic_K italic_μ-GDP for a group with size K𝐾Kitalic_K.

Lemma 9 (μ𝜇\muitalic_μ-GDP to (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP [11]).

A mechanism is μ𝜇\muitalic_μ-GDP if and only if it is (ϵ,δ(ϵ))italic-ϵ𝛿italic-ϵ(\epsilon,\delta(\epsilon))( italic_ϵ , italic_δ ( italic_ϵ ) )-DP for all ϵ0italic-ϵ0\epsilon\geqslant 0italic_ϵ ⩾ 0, where

δ(ϵ)=Φ(ϵμ+μ2)eϵΦ(ϵμμ2),𝛿italic-ϵΦitalic-ϵ𝜇𝜇2superscript𝑒italic-ϵΦitalic-ϵ𝜇𝜇2\displaystyle\delta(\epsilon)=\Phi\left(-\frac{\epsilon}{\mu}+\frac{\mu}{2}% \right)-e^{\epsilon}\cdot\Phi\left(-\frac{\epsilon}{\mu}-\frac{\mu}{2}\right),italic_δ ( italic_ϵ ) = roman_Φ ( - divide start_ARG italic_ϵ end_ARG start_ARG italic_μ end_ARG + divide start_ARG italic_μ end_ARG start_ARG 2 end_ARG ) - italic_e start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ⋅ roman_Φ ( - divide start_ARG italic_ϵ end_ARG start_ARG italic_μ end_ARG - divide start_ARG italic_μ end_ARG start_ARG 2 end_ARG ) ,

and ΦΦ\Phiroman_Φ denotes the CDF of standard normal (Gaussian) distribution.

Lemma 10 (Privacy Central Limit Theorem of GDP [7]).

Denote p𝑝pitalic_p as the sampling probability of one example in the training dataset, T𝑇Titalic_T as the total number of iterations and σ𝜎\sigmaitalic_σ as the noise scale (i.e., the ratio between the standard deviation of Gaussian noise and the gradient norm bound). Then, algorithm DP-SDG asymptotically satisfies μ𝜇\muitalic_μ-GDP with privacy parameter μ=pT(e1/σ21)𝜇𝑝𝑇superscript𝑒1superscript𝜎21\mu=p\sqrt{T(e^{1/\sigma^{2}}-1)}italic_μ = italic_p square-root start_ARG italic_T ( italic_e start_POSTSUPERSCRIPT 1 / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - 1 ) end_ARG.

In this paper, we use μ𝜇\muitalic_μ-GDP as our primary privacy accountant method due to its good property on composition and accountant of privacy amplification in Lemma 10, and then convert the result to (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP via Lemma 9. We note that other privacy accountant methods, such as moments accountant [1] and Rényi DP (RDP) [33], are also applicable to the proposed scheme and theoretical analysis, but might lead to suboptimal results.

Appendix G Preliminaries for Crypto Primitives

Shamir’s Secret Sharing with Robust Reconstruction. Due to the assumption of a malicious minority, the utilized crypto primitives should be able to tolerate the wrong or missing messages of malicious clients. Shamir’s t𝑡titalic_t-out-of-n𝑛nitalic_n Secret Sharing Scheme [37] allows distributing a secret s𝑠sitalic_s among n𝑛nitalic_n parties such that: 1) the complete secret can be reconstructed from any combination of t𝑡titalic_t shares; 2) any set of t1𝑡1t-1italic_t - 1 or fewer shares reveals no information about s𝑠sitalic_s, where t𝑡titalic_t is the threshold of the secret sharing scheme. We denote [s]isubscriptdelimited-[]𝑠𝑖[s]_{i}[ italic_s ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the share held by the i𝑖iitalic_i-th party. Shamir’s secret sharing scheme is linear, which means a party can locally perform: 1) addition of two shares, 2) addition of a constant, and 3) multiplication by a constant. Furthermore, Shamir’s secret sharing scheme is closely related to Reed-Solomon error correcting codes [27], which is a group of polynomial-based error correcting codes. Shamir’s secret sharing scheme results in a [n,t,nt+1]𝑛𝑡𝑛𝑡1[n,t,n-t+1][ italic_n , italic_t , italic_n - italic_t + 1 ] Reed-Solomon code that can tolerate up to q𝑞qitalic_q errors and e𝑒eitalic_e erasures (message dropouts) such that 2q+e<nt+12𝑞𝑒𝑛𝑡12q+e<n-t+12 italic_q + italic_e < italic_n - italic_t + 1. Given any subset of ne𝑛𝑒n-eitalic_n - italic_e shares 𝒬(|𝒬|ne)𝒬𝒬𝑛𝑒\mathcal{Q}~{}(|\mathcal{Q}|\geqslant n-e)caligraphic_Q ( | caligraphic_Q | ⩾ italic_n - italic_e ) with up to q𝑞qitalic_q errors, any standard Reed Solomon decoding algorithm, such as Gao’s decoding algorithm [17], can robustly reconstruct the secret s𝑠sitalic_s. Due to the property of robust reconstruction, Shamir’s secret sharing is able to guarantee security with malicious minority (as versus additive secret sharing [10] guarantees security with honest-but-curious parties).

EIFFeL: An Instantiation of SAVI Protocol. EIFFeL [35] is a SAVI protocol (with privacy and integrity guarantees) that securely aggregates only well-informed inputs. Its threat model assumes a malicious server (for privacy only) and a set of malicious clients (for both breaching privacy and submitting malformed inputs) that can arbitrarily deviate from the protocol, while the remaining honest clients are assumed to follow the protocol correctly and have well-formed inputs. EIFFeL ensures privacy by using Shamir’s secret sharing scheme [37]. Integrity is guaranteed via 1) secret-shared non-interactive proofs (SNIP) [9], which is an information-theoretic zero-knowledge proof for secret-shared data; and 2) verifiable secret shares [16], which validates the correctness of the secret shares. Note that the original SNIP utilizes additive secret sharing scheme [10], and its deployment setting uses 2absent2\geqslant 2⩾ 2 honest and non-colluding servers as the verifiers. In contrast, by leveraging Shamir’s secret sharing with robust reconstruction, EIFFeL extends SNIP to a malicious threat model in a single server setting, where all the other clients (some of them are malicious) and the server jointly act as the verifiers for the verification of client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s input. Therefore, EIFFeL is compatible to our system model (a single server) and the threat model discussed in Section 3.1.

Appendix H Detailed Steps of DP-BREM+ in Figure 2

\small1⃝ Proof and Shares Generation: 𝒛i,𝖵𝖺𝗅𝗂𝖽()[𝒛i]j,[πi]j(ji)formulae-sequencesubscript𝒛𝑖𝖵𝖺𝗅𝗂𝖽subscriptdelimited-[]subscript𝒛𝑖𝑗subscriptdelimited-[]subscript𝜋𝑖𝑗for-all𝑗𝑖\bm{z}_{i},\mathsf{Valid}(\cdot)\rightarrow[\bm{z}_{i}]_{j},[\pi_{i}]_{j}~{}(% \forall j\neq i)bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , sansserif_Valid ( ⋅ ) → [ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , [ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ≠ italic_i ). For generating the proof, client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT first evaluates the circuit 𝖵𝖺𝗅𝗂𝖽()𝖵𝖺𝗅𝗂𝖽\mathsf{Valid}(\cdot)sansserif_Valid ( ⋅ ) on its private input 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to obtain the value of every wire in the arithmetic circuit corresponding to the computation of 𝖵𝖺𝗅𝗂𝖽(𝒛i)𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖\mathsf{Valid}(\bm{z}_{i})sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), then uses these wire values to generate the proof πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (refer to [9, 35] for the detailed format). Then, client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT splits the private input 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and proof πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to generate shares [𝒛i]jsubscriptdelimited-[]subscript𝒛𝑖𝑗[\bm{z}_{i}]_{j}[ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and [πi]j(ji)subscriptdelimited-[]subscript𝜋𝑖𝑗for-all𝑗𝑖[\pi_{i}]_{j}~{}(\forall j\neq i)[ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ≠ italic_i ), and send them to other clients {𝖢j}jisubscriptsubscript𝖢𝑗for-all𝑗𝑖\{\mathsf{C}_{j}\}_{\forall j\neq i}{ sansserif_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ∀ italic_j ≠ italic_i end_POSTSUBSCRIPT via Shamir’s secret sharing.

\small2⃝ Proof Summary Computation: [𝒛i]j,[πi]j(ji)[σi]j(ji)subscriptdelimited-[]subscript𝒛𝑖𝑗subscriptdelimited-[]subscript𝜋𝑖𝑗for-all𝑗𝑖subscriptdelimited-[]subscript𝜎𝑖𝑗for-all𝑗𝑖[\bm{z}_{i}]_{j},[\pi_{i}]_{j}~{}(\forall j\neq i)\rightarrow[\sigma_{i}]_{j}~% {}(\forall j\neq i)[ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , [ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ≠ italic_i ) → [ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ≠ italic_i ). Each client except 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT first verifies the validity of the received secret shares via verifiable secret shares [16], and then locally constructs the shares of every wire in 𝖵𝖺𝗅𝗂𝖽(𝒛i)𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖\mathsf{Valid}(\bm{z}_{i})sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) via affine operations on the shares [𝒛i]jsubscriptdelimited-[]subscript𝒛𝑖𝑗[\bm{z}_{i}]_{j}[ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and[πi]jsubscriptdelimited-[]subscript𝜋𝑖𝑗[\pi_{i}]_{j}[ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to get the shares of proof summary [σi]jsubscriptdelimited-[]subscript𝜎𝑖𝑗[\sigma_{i}]_{j}[ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (refer to [35] for the detailed format), which will be sent to the server.

\small3⃝ Proof Summary Verification: [σi]j(ji)𝖵𝖺𝗅𝗂𝖽(𝒛i)subscriptdelimited-[]subscript𝜎𝑖𝑗for-all𝑗𝑖𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖[\sigma_{i}]_{j}~{}(\forall j\neq i)\rightarrow\mathsf{Valid}(\bm{z}_{i})[ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ≠ italic_i ) → sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). After receiving shares of proof summary [σi]j(ji)subscriptdelimited-[]subscript𝜎𝑖𝑗for-all𝑗𝑖[\sigma_{i}]_{j}(\forall j\neq i)[ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ≠ italic_i ) from clients {𝖢j}jisubscriptsubscript𝖢𝑗for-all𝑗𝑖\{\mathsf{C}_{j}\}_{\forall j\neq i}{ sansserif_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ∀ italic_j ≠ italic_i end_POSTSUBSCRIPT, the server recovers the value of σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT via robust reconstruction, which is resilient to incorrect shares submitted by the malicious clients, and then checks the values in proof summaries. Finally, the validation result 𝖵𝖺𝗅𝗂𝖽(𝒛i)=1𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖1\mathsf{Valid}(\bm{z}_{i})=1sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 if and only if σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has the correct value.

\small4⃝ Random Numbers Generation: l,d{([uk]j,[vk]j)}k=1d/2(j)𝑙𝑑superscriptsubscriptsubscriptdelimited-[]subscript𝑢𝑘𝑗subscriptdelimited-[]subscript𝑣𝑘𝑗𝑘1𝑑2for-all𝑗l,d\rightarrow\{([u_{k}]_{j},[v_{k}]_{j})\}_{k=1}^{\lceil d/2\rceil}~{}(% \forall j)italic_l , italic_d → { ( [ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , [ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌈ italic_d / 2 ⌉ end_POSTSUPERSCRIPT ( ∀ italic_j ). In this step, clients jointly generate the shares of d/2𝑑2\lceil d/2\rceil⌈ italic_d / 2 ⌉-pairs of random numbers {(uk,vk)}k=1d/2superscriptsubscriptsubscript𝑢𝑘subscript𝑣𝑘𝑘1𝑑2\{(u_{k},v_{k})\}_{k=1}^{\lceil d/2\rceil}{ ( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌈ italic_d / 2 ⌉ end_POSTSUPERSCRIPT, where all of them are i.i.d. from uniform distribution in the range [0,1]01[0,1][ 0 , 1 ]. Denote l𝑙litalic_l as the fractional precision of the power 2 ring representation of real numbers. To obtain the share of one random number u𝑢uitalic_u, each client 𝖢i(i)subscript𝖢𝑖for-all𝑖\mathsf{C}_{i}~{}(\forall i)sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ∀ italic_i ) generates l𝑙litalic_l random bits in the binary filed 𝔽2subscript𝔽2\mathbb{F}_{2}blackboard_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, denoted by a binary vector 𝒃isubscript𝒃𝑖\bm{b}_{i}bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with length l𝑙litalic_l, then generate and distributes the shares [𝒃i]jsubscriptdelimited-[]subscript𝒃𝑖𝑗[\bm{b}_{i}]_{j}[ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to other clients (via Shamir’s secret sharing). After receiving all shares from other clients, each client 𝖢j(j)subscript𝖢𝑗for-all𝑗\mathsf{C}_{j}~{}(\forall j)sansserif_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ) locally adds these shares to get [𝒃]j=[i𝒃i]j𝔽2lsubscriptdelimited-[]𝒃𝑗subscriptdelimited-[]subscript𝑖subscript𝒃𝑖𝑗superscriptsubscript𝔽2𝑙[\bm{b}]_{j}=[\sum_{i}\bm{b}_{i}]_{j}\in\mathbb{F}_{2}^{l}[ bold_italic_b ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = [ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, where vector 𝒃𝔽2l𝒃superscriptsubscript𝔽2𝑙\bm{b}\in\mathbb{F}_{2}^{l}bold_italic_b ∈ blackboard_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is actually the bitwise XOR of vectors {𝒃i}isubscriptsubscript𝒃𝑖for-all𝑖\{\bm{b}_{i}\}_{\forall i}{ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ∀ italic_i end_POSTSUBSCRIPT because the computation is implemented in the binary field 𝔽2lsuperscriptsubscript𝔽2𝑙\mathbb{F}_{2}^{l}blackboard_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. We define the binary vector 𝒃𝒃\bm{b}bold_italic_b as the binary representation of the fractional part of u[0,1]𝑢01u\in[0,1]italic_u ∈ [ 0 , 1 ]. Note that the Shamir’s secret sharing scheme of Phase 1 is implemented in a finite filed 𝔽2Ksubscript𝔽superscript2𝐾\mathbb{F}_{2^{K}}blackboard_F start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, where K>l𝐾𝑙K>litalic_K > italic_l. Therefore, the client 𝖢jsubscript𝖢𝑗\mathsf{C}_{j}sansserif_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can locally compute the arithmetic share [u]j𝔽2Ksubscriptdelimited-[]𝑢𝑗subscript𝔽superscript2𝐾[u]_{j}\in\mathbb{F}_{2^{K}}[ italic_u ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_F start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from the share of binary representation [𝒃]j𝔽2lsubscriptdelimited-[]𝒃𝑗superscriptsubscript𝔽2𝑙[\bm{b}]_{j}\in\mathbb{F}_{2}^{l}[ bold_italic_b ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. Since all possible discrete values with power 2 ring representation evenly span the range [0,1]01[0,1][ 0 , 1 ], the generated random real number u𝑢uitalic_u is uniformly distributed in [0,1]01[0,1][ 0 , 1 ].

\small5⃝ Transformation to Gaussian Distribution: {([uk]j,[vk]j)}k=1d/2(j)[𝝃]j(j)superscriptsubscriptsubscriptdelimited-[]subscript𝑢𝑘𝑗subscriptdelimited-[]subscript𝑣𝑘𝑗𝑘1𝑑2for-all𝑗subscriptdelimited-[]𝝃𝑗for-all𝑗\{([u_{k}]_{j},[v_{k}]_{j})\}_{k=1}^{\lceil d/2\rceil}~{}(\forall j)% \rightarrow[\bm{\xi}]_{j}~{}(\forall j){ ( [ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , [ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌈ italic_d / 2 ⌉ end_POSTSUPERSCRIPT ( ∀ italic_j ) → [ bold_italic_ξ ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ). For each pair of (uk,vk)subscript𝑢𝑘subscript𝑣𝑘(u_{k},v_{k})( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), clients can jointly compute a secret sharing of ak=2ln(uk)cos(2πvk)subscript𝑎𝑘2subscript𝑢𝑘2𝜋subscript𝑣𝑘a_{k}=\sqrt{-2\ln(u_{k})}\cdot\cos(2\pi v_{k})italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = square-root start_ARG - 2 roman_ln ( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG ⋅ roman_cos ( 2 italic_π italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and of bk=2ln(uk)sin(2πvk)subscript𝑏𝑘2subscript𝑢𝑘2𝜋subscript𝑣𝑘b_{k}=\sqrt{-2\ln(u_{k})}\cdot\sin(2\pi v_{k})italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = square-root start_ARG - 2 roman_ln ( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG ⋅ roman_sin ( 2 italic_π italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) by utilizing Secure Multiparty Computation (MPC) protocols [23] that guarantees security (i.e., privacy and integrity) with malicious minority. According to Box and Muller Transformation [6], aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are i.i.d. random variables from the Gaussian distribution with mean 0 and variance 1. Then, by locally implementing secure multiplication with a constant (i.e., Rσ𝑅𝜎R\sigmaitalic_R italic_σ), aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are i.i.d random numbers following a Gaussian distribution with the desired standard deviation of Rσ𝑅𝜎R\sigmaitalic_R italic_σ. Finally, by concatenating shares of d𝑑ditalic_d numbers in {(ak,bk)}k=1d/2superscriptsubscriptsubscript𝑎𝑘subscript𝑏𝑘𝑘1𝑑2\{(a_{k},b_{k})\}_{k=1}^{\lceil d/2\rceil}{ ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌈ italic_d / 2 ⌉ end_POSTSUPERSCRIPT, clients obtains the shares of random vector 𝝃𝝃\bm{\xi}bold_italic_ξ with length d𝑑ditalic_d from Gaussian distribution 𝒩(0,R2σ2𝐈d)𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ).

\small6⃝ Shares Aggregation: {[𝒛i]j}i𝖵𝖺𝗅𝗂𝖽,[𝝃]j(j)[i𝖵𝖺𝗅𝗂𝖽𝒛i+𝝃]j(j)subscriptsubscriptdelimited-[]subscript𝒛𝑖𝑗𝑖subscript𝖵𝖺𝗅𝗂𝖽subscriptdelimited-[]𝝃𝑗for-all𝑗subscriptdelimited-[]subscript𝑖subscript𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖𝝃𝑗for-all𝑗\{[\bm{z}_{i}]_{j}\}_{i\in\mathcal{I}_{\mathsf{Valid}}},[\bm{\xi}]_{j}~{}(% \forall j)\rightarrow[\sum_{i\in\mathcal{I}_{\mathsf{Valid}}}\bm{z}_{i}+\bm{% \xi}]_{j}~{}(\forall j){ [ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT end_POSTSUBSCRIPT , [ bold_italic_ξ ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ) → [ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ξ ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ). Due to the linearity of Shamir’s secret sharing scheme, each client 𝖢jsubscript𝖢𝑗\mathsf{C}_{j}sansserif_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can locally compute the share of the noisy aggregate by adding the shares of all valid inputs and the share of Gaussian noise: [i𝖵𝖺𝗅𝗂𝖽𝒛i+𝝃]j=i𝖵𝖺𝗅𝗂𝖽[𝒛i]j+[𝝃]jsubscriptdelimited-[]subscript𝑖subscript𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖𝝃𝑗subscript𝑖subscript𝖵𝖺𝗅𝗂𝖽subscriptdelimited-[]subscript𝒛𝑖𝑗subscriptdelimited-[]𝝃𝑗[\sum_{i\in\mathcal{I}_{\mathsf{Valid}}}\bm{z}_{i}+\bm{\xi}]_{j}=\sum_{i\in% \mathcal{I}_{\mathsf{Valid}}}[\bm{z}_{i}]_{j}+[\bm{\xi}]_{j}[ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ξ ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + [ bold_italic_ξ ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and sends that share to the server.

\small7⃝ Noisy Aggregate Reconstruction: [i𝖵𝖺𝗅𝗂𝖽𝒛i+𝝃]j(j)i𝖵𝖺𝗅𝗂𝖽𝒛i+𝝃subscriptdelimited-[]subscript𝑖subscript𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖𝝃𝑗for-all𝑗subscript𝑖subscript𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖𝝃[\sum_{i\in\mathcal{I}_{\mathsf{Valid}}}\bm{z}_{i}+\bm{\xi}]_{j}~{}(\forall j)% \rightarrow\sum_{i\in\mathcal{I}_{\mathsf{Valid}}}\bm{z}_{i}+\bm{\xi}[ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ξ ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∀ italic_j ) → ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ξ. After receiving all shares of the noisy aggregate, the server recovers it using robust reconstruction.

Appendix I Proof of Theorem 4 (Security Analysis)

Refer to caption
Figure 7: MNIST: Varying record-level clip** bound R𝑅Ritalic_R for DP-BREM under different settings.

Integrity. We prove that DP-BREM+ satisfies the integrity constraint using the following lemmas, where Lemma 11 and Lemma 13 are derived from EIFFeL [35].

Lemma 11 (Integrity of Input).

DP-BREM+ rejects all malformed inputs with probability 1negl(κ)1negl𝜅1-\text{negl}(\kappa)1 - negl ( italic_κ ).

Lemma 12 (Integrity of Gaussian Noise).

In Phase 2 of DP-BREM+, each client holds the share of random vector 𝛏𝛏\bm{\xi}bold_italic_ξ that follows the Gaussian distribution 𝒩(0,R2σ2𝐈d)𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ).

Proof.

In the step \small4⃝ of Phase 2, the jointly generated random number u𝑢uitalic_u follows uniform distribution in range [0,1]01[0,1][ 0 , 1 ] as long as there is at least one honest client because u𝑢uitalic_u’s binary representation 𝒃𝒃\bm{b}bold_italic_b is the result of bitwise XOR of clients’ local random vectors {𝒃i}isubscriptsubscript𝒃𝑖for-all𝑖\{\bm{b}_{i}\}_{\forall i}{ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ∀ italic_i end_POSTSUBSCRIPT. In step \small5⃝, since the utilized MPC protocol [23, 28] guarantees computation integrity (meaning that the output is correctly computed) with malicious minority, the uniform distribution generated in step \small4⃝ will be correctly transformed to Gaussian distribution. ∎

Since clients locally add shares of valid inputs and noise together, DP-BREM+ satisfies integrity of aggregate shown in Lemma 13. Our integrity guarantee in Lemma 13 directly follows EIFFeL [35], though the integrity of noise has different definition compared with the integrity of input. Note that the integrity of EIFFeL (and ours) relies on robust reconstruction property of Shamir’s secret sharing [37], and the details can be found from the paper[35].

Lemma 13 (Integrity of Aggregate).

The aggregated output of DP-BREM+ must contain the inputs of all honest clients and the generated Gaussian noise.

Aggregate=iH𝒛i+iM𝒛i+𝝃Aggregatesubscript𝑖subscript𝐻subscript𝒛𝑖subscript𝑖superscriptsubscript𝑀subscript𝒛𝑖𝝃\displaystyle\text{Aggregate}=\sum\nolimits_{i\in\mathcal{I}_{H}}\bm{z}_{i}+% \sum\nolimits_{i\in\mathcal{I}_{M}^{*}}\bm{z}_{i}+\bm{\xi}Aggregate = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ξ

where random vector ξ𝒩(0,R2σ2𝐈d)similar-to𝜉𝒩0superscript𝑅2superscript𝜎2subscript𝐈𝑑\xi\sim\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})italic_ξ ∼ caligraphic_N ( 0 , italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), Hsubscript𝐻\mathcal{I}_{H}caligraphic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the set of all honest clients, Msuperscriptsubscript𝑀\mathcal{I}_{M}^{*}caligraphic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the set of malicious clients with well-formed inputs (i.e., M=𝖵𝖺𝗅𝗂𝖽\Hsuperscriptsubscript𝑀\subscript𝖵𝖺𝗅𝗂𝖽subscript𝐻\mathcal{I}_{M}^{*}=\mathcal{I}_{\mathsf{Valid}}\backslash\mathcal{I}_{H}caligraphic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_I start_POSTSUBSCRIPT sansserif_Valid end_POSTSUBSCRIPT \ caligraphic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT)

Privacy. DP-BREM+ guarantees: nothing can be learned about a private input 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for an honest client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, except:

1) 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT passes the integrity check, i.e., 𝖵𝖺𝗅𝗂𝖽(𝒛i)=1𝖵𝖺𝗅𝗂𝖽subscript𝒛𝑖1\mathsf{Valid}(\bm{z}_{i})=1sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 .

2) anything that can be learned from the noisy aggregation of well-formed inputs (thus achieving the same DP guarantee as the original DP-BREM).

We prove this privacy property using the following lemmas, where Lemma 14 and Lemma 16 are derived from EIFFeL [35].

Lemma 14.

In Phase 1, for an honest client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, DP-BREM+ reveals nothing about the private input 𝐳isubscript𝐳𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT except 𝖵𝖺𝗅𝗂𝖽(𝐳i)=1𝖵𝖺𝗅𝗂𝖽subscript𝐳𝑖1\mathsf{Valid}(\bm{z}_{i})=1sansserif_Valid ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1.

Lemma 15.

In Phase 2, DP-BREM+ reveals nothing about the generated Gaussian noise.

Proof.

In step \small4⃝, no entity learns the uniformly random number u𝑢uitalic_u as long as there is at least one honest client due to the bitwise XOR operation. In step \small5⃝, nothing is revealed because the utilized MPC protocol [28] guarantees information theoretic privacy about the input shares during computation for distribution transmission. Note that the step \small5⃝ only generates the shares hold by clients without outputting the final result. ∎

Lemma 16.

In Phase 3, for an honest client 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, DP-BREM+ reveals nothing about the private input 𝐳isubscript𝐳𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT except whatever can be leaned from the noisy aggregate.

Appendix J Supplements of Experiments

Refer to caption
Figure 8: MNIST: Iteration curve with privacy budget ϵ=3italic-ϵ3\epsilon=3italic_ϵ = 3 and Byzantine clients δB=30%subscript𝛿𝐵percent30\delta_{B}=30\%italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 30 % over n=100𝑛100n=100italic_n = 100 clients.
Refer to caption
Figure 9: CIFAR-10: Iteration Curve with privacy budget ϵ=4italic-ϵ4\epsilon=4italic_ϵ = 4 and Byzantine clients δB=15%subscript𝛿𝐵percent15\delta_{B}=15\%italic_δ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 15 % over n=100𝑛100n=100italic_n = 100 clients.

J.1 Experimental Setup

FL Implementation. Due to limited resources, we simulate the distributed training of FL by running a single machine sequentially for clients and the server. The real-world implementation of FL is out of the scope of this paper.

Datasets (non-IID) and Model Architecture. We use two datasets for our experiments: MNIST [25] and CIFAR-10 [24], where the default value of the number of total clients is n=100𝑛100n=100italic_n = 100. For MNIST dataset, we use the CNN model from PyTorch example444https://github.com/pytorch/opacus. For CIFAR-10 dataset, we use the CNN model from the TensorFlow tutorial555https://www.tensorflow.org/tutorials/images/cnn, like the previous works [49, 30]. To simulate the heterogeneous data distributions, we make non-i.i.d. partitions of the datasets, which is a similar setup as [49] and is described below:

1) Non-IID MNIST: The MNIST dataset contains 60,000 training images and 10,000 testing images of 10 classes. There are 100 clients, each holds 600 training images. We sort the training data by digit label and evenly divide it into 400 shards. Each client is assigned four random shards of the data, so that most of the clients have examples of three or four digits.

2) Non-IID CIFAR-10: The CIFAR-10 dataset contains 50,000 training images and 10,000 test images of 10 classes. There are 100 clients, each holds 500 training images. We sample the training images for each client using a Dirichlet distribution with hyperparameter 0.9.

Byzantine Attacks. We consider four different Byzantine attacks in our experiments.

1) ALIE ("a little is enough") [3]. The attacker uses the empirical variance (estimated from the data of corrupted clients) to determine the perturbation range, in which the attack can deviate from the mean without being detected or filtered out.

2) IPM (inner-product manipulation) [45]. The attacker manipulates the submitted gradient to be the negative direction of the mean of other honest clients’ gradients, thus the negative inner-product of the true gradient and the aggregation prevents the descent of the loss. Note that the original IPM attack assumes the omniscient attacker (i.e., knows the data/gradient of all other clients), which is contradicted to our assumption that the attacker only has access to the data of the corrupted clients (otherwise, the privacy is already leaked and no need to provide DP). Thus, in the experiments, we use the data of corrupted clients to estimate the aggregated gradient of honest clients, and then manipulate the inner-product (i.e., non-omniscient attack).

3) LF (label-flip**). The attacker modifies the labels of all examples of corrupted clients’ data and trains a new model with multiple iterations, then uses model replacement strategy [2] to enhance the impact on the global model.

4) MTB ("manipulating-the-Byzantine") [38]. The attacker computes a benign reference aggregate using some benign data samples obtained from corrupted clients, then computes a malicious perturbation vector, and an optimized scaling factor to get the malicious update with the goal of evading detection by robust aggregation algorithms. The optimization of the scaling factor can be tailored or agnostic to the aggregator. Considering our scheme and the baselines do not detect malicious clients, we use the agnostic setting (including min-max and min-sum) for simplicity because tailoring MTB attack to all defense aggregators is nontrivial. In our experiments, we implement the min-max attack since it has a larger impact on the global model.

Byzantine Defenses with DP. We compare the performance of our approaches with the following five competitors against Byzantine attacks. All of them satisfy record-level DP via record-level clip** and DP noise added to the local gradient/momentum. Note that privacy budget ϵitalic-ϵ\epsilonitalic_ϵ in Theorem 1 is the same for different clients because clients have the same size of local datasets |𝒟i|subscript𝒟𝑖|\mathcal{D}_{i}|| caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | and same record-level sampling rate (i.e., same |𝒟i|subscript𝒟𝑖|\mathcal{D}_{i}|| caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | and pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for different clients 𝖢isubscript𝖢𝑖\mathsf{C}_{i}sansserif_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT).

1) DP-FedSGD. Note that the original DP-FedSGD in [30] clips the client gradient to achieve client-level DP. For a fair comparison, we also implement record-level gradient clip** on top of the original DP-FedSGD to guarantee record-level DP. Though DP-FedSGD is not designed for robustness, its client-level clip** can restrict malicious clients’ capability, thus providing some level of Byzantine robustness. We take this as a baseline to illustrate that client-level clip** can provide some level of robustness, but may not be enough to defend against strong attackers (either advanced attack strategy or a larger number of malicious clients).

2) DP-CM. As a baseline that adds DP to median-based robust aggregators (discussed in Section 3.2), we implement the Byzantine-robust aggregator Coordinate-wise Median (CM) [47] with DP noise added to the median result. Note that only DP-CM uses median-based aggregation, while other methods use average-based aggregation. As discussed in Section 3.1 and Example 1, the median-based aggregation has large sensitivity and poor privacy-utility tradeoff.

3) DDP-RP [43]. By leveraging encryption techniques, DDP-RP guarantees Distributed DP with secure aggregation. It allows clients to add smaller noise in the local gradient than the Local DP, with the knowledge of the lower bound of trusted clients, thus providing enhanced privacy-utility tradeoff than local DP protocols. To guarantee Byzantine robustness, DDP-RP uses range-proof (RP) technologies to securely verify whether the local model/gradient weights are in a (predefined) bounded range.

4) DP-RSA [50]. It replaces the value aggregation to sign aggregation, which provides robustness because each client has limited impact on the aggregation. The DP noise is added to the local gradient before the sign operation.

5) DP-LFH. The baseline (shown in Section 3.2) directly combines DP-SGD based momentum with LFH. Each client adds DP noise to the local gradient, and then computes the local momentum that will be aggregated with centered clip** by the server.

J.2 Parameters Setting

Basic Parameters.

  • Total number of iterations T𝑇Titalic_T: 1000 for MNIST; 2000 for CIFAR-10

  • Learning rate ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: For MNIST datasets, ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is linearly reduced from 0.1 to 0.01 w.r.t. iterations. For CIFAR-10 dataset, ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is linearly reduced from 0.05 to 0.0025 w.r.t. iterations.

DP-related Parameters.

  • Record-level sampling rate pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: 0.05 for all i𝑖iitalic_i

  • Client-level sampling rate q𝑞qitalic_q: the default value is 1. We evaluate the influence of q𝑞qitalic_q (from 0.2 to 1) on the accuracy in Table 4.

  • Record-level clip** bound R𝑅Ritalic_R: linearly reduced from R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 0.3R00.3subscript𝑅00.3R_{0}0.3 italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT w.r.t. iterations. Note that in Figure 7, the different value of R𝑅Ritalic_R in x-axis is the value of the above R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For MNIST, we set R0=10subscript𝑅010R_{0}=10italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10 by default, but R0=5subscript𝑅05R_{0}=5italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 5 only for the case of ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 in Figure 5. For CIFAR-10, we set R0=20subscript𝑅020R_{0}=20italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 20 by default, but R0=15subscript𝑅015R_{0}=15italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 15 only for the case of ϵ=2italic-ϵ2\epsilon=2italic_ϵ = 2 in Figure 6.

  • Privacy parameter δ𝛿\deltaitalic_δ in DP: 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT

  • Noise multiplier σ𝜎\sigmaitalic_σ: For MNIST (with T=1000𝑇1000T=1000italic_T = 1000 and each client has |𝒟|i=60000/100=600subscript𝒟𝑖60000100600|\mathcal{D}|_{i}=60000/100=600| caligraphic_D | start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 60000 / 100 = 600 examples), σ{0.15,0.06,0.029,0}𝜎0.150.060.0290\sigma\in\{0.15,0.06,0.029,0\}italic_σ ∈ { 0.15 , 0.06 , 0.029 , 0 } for ϵ{1,3,8,inf}italic-ϵ138inf\epsilon\in\{1,3,8,\text{inf}\}italic_ϵ ∈ { 1 , 3 , 8 , inf }. For CIFAR-10 (with T=2000𝑇2000T=2000italic_T = 2000 and each client has |𝒟|i=50000/100=500subscript𝒟𝑖50000100500|\mathcal{D}|_{i}=50000/100=500| caligraphic_D | start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 50000 / 100 = 500 examples), σ{0.14,0.077,0.042,0}𝜎0.140.0770.0420\sigma\in\{0.14,0.077,0.042,0\}italic_σ ∈ { 0.14 , 0.077 , 0.042 , 0 } for ϵ{2,4,9,inf}italic-ϵ249inf\epsilon\in\{2,4,9,\text{inf}\}italic_ϵ ∈ { 2 , 4 , 9 , inf }.

Robustness-related Parameters.

  • Client-level clip** bound C𝐶Citalic_C (only for DP-BREM and DP-LFH): linearly reduced from C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 0.3C00.3subscript𝐶00.3C_{0}0.3 italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT w.r.t. iterations, where C0=1subscript𝐶01C_{0}=1italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 for MNIST, and C0=5subscript𝐶05C_{0}=5italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 5 for CIFAR-10.

  • Momentum parameter β=0.9𝛽0.9\beta=0.9italic_β = 0.9

J.3 More Experimental Results

Iteration Curve. Figures 8 and 9 show how the accuracy changes with the training iterations in MNIST (with total iteration T=1000𝑇1000T=1000italic_T = 1000) and CIFAR-10 (with T=2000𝑇2000T=2000italic_T = 2000), respectively. Due to the existence of Byzantine attacks, the iteration curve is not as smooth as in the attack-free case.

Appendix K Other Related Work

FL with DP. Differential Privacy (DP) was originally designed for the centralized scenario where a trusted database server, which has direct access to all client’s data in the clear, wishes to answer queries or publish statistics in a privacy-preserving manner by randomizing query results. In FL, McMahan et al. [30] proposed DP-FedSGD and DP-FedAvg, which provide client-level privacy with a trusted server. Geyer et al. [18] uses an algorithm similar to DP-FedSGD for the architecture search problem, and the privacy guarantee acts on client-level and the trusted server too. Li et al. [26] studies online transfer learning and introduces a notion called task global privacy that works on record-level. However, the online setting assumes the client only interacts with the server once and does not extend to the federated setting. Zheng et al. [49] introduced two privacy notions, that describe privacy guarantee against an individual malicious client and against a group of malicious clients (but not against the server) on record-level privacy, based on a new privacy notion called f𝑓fitalic_f-differential privacy. Note that, our solutions achieve record-level DP under either a trusted server or a malicious server.

Byzantine-Robust FL. Recently, there have been extensive works on Byzantine-robust federated/distributed learning with a trustworthy server, and most of them play with median statistics of gradient contributions. Blanchard et al. [5] proposed Krum which uses the Euclidean distance to determine which gradient contributions should be removed. Yin et al. [47] proposed two robust distributed gradient descent algorithms, one based on the coordinate-wise median, and the other on the coordinate-wise trimmed mean. Mhamdi et al. [32] proposed a meta-aggregation rule called Bulyan, a two-step meta-aggregation algorithm based on the Krum and trimmed median, which filters malicious updates followed by computing the trimmed median of the remaining updates.

Private and Byzantine-Robust FL. Recently, some works tried to simultaneously achieve both privacy and robustness of FL. He et al. [20] proposed a Byzantine-resilient and privacy-preserving solution, which makes distance-based robust aggregation rules (such as Krum [5]) compatible with secure aggregation via MPC and secret sharing. So et al. [40] developed a similar scheme based on Krum, but rely on different cryptographic techniques, such as verifiable Shamir’s secret sharing and Reed-Solomon code. Velicheti et al. [42] achieved both privacy and Byzantine robustness via incorporating secure averaging among randomly clustered clients before filtering malicious updates through robust aggregation. However, these works only ensure the security of the aggregation step and do not achieve DP for the aggregated model.