DP-BREM: Differentially-Private and Byzantine-Robust Federated Learning
with Client Momentum

Xiaolan Gu
University of Arizona
[email protected] Ming Li
University of Arizona
[email protected] Li Xiong
Emory University
[email protected]

Abstract

Federated Learning (FL) allows multiple participating clients to train machine learning models collaboratively while kee** their datasets local and only exchanging the gradient or model updates with a coordinating server. Existing FL protocols are vulnerable to attacks that aim to compromise data privacy and/or model robustness. Recently proposed defenses focused on ensuring either privacy or robustness, but not both. In this paper, we focus on simultaneously achieving differential privacy (DP) and Byzantine robustness for cross-silo FL, based on the idea of learning from history. The robustness is achieved via client momentum, which averages the updates of each client over time, thus reducing the variance of the honest clients and exposing the small malicious perturbations of Byzantine clients that are undetectable in a single round but accumulate over time. In our initial solution DP-BREM, DP is achieved by adding noise to the aggregated momentum, and we account for the privacy cost from the momentum, which is different from the conventional DP-SGD that accounts for the privacy cost from the gradient. Since DP-BREM assumes a trusted server (who can obtain clients’ local models or updates), we further develop the final solution called DP-BREM⁺, which achieves the same DP and robustness properties as DP-BREM without a trusted server by utilizing secure aggregation techniques, where DP noise is securely and jointly generated by the clients. Both theoretical analysis and experimental results demonstrate that our proposed protocols achieve better privacy-utility tradeoff and stronger Byzantine robustness than several baseline methods, under different DP budgets and attack settings.

1 Introduction

Federated learning (FL) [29] is an emerging paradigm that enables multiple clients to collaboratively learn models without explicitly sharing their data. The clients upload their local model updates to a coordinating server, which then shares the global average with the clients in an iterative process. This offers a promising solution to mitigate the potential privacy leakage of sensitive information about individuals (since the data stays local with each client), such as ty** history, shop** transactions, geographical locations, and medical records. However, recent works have demonstrated that FL may not always provide sufficient privacy and robustness guarantees. In terms of privacy leakage, exchanging the model updates throughout the training process can still reveal sensitive information [4, 31] and cause deep leakage such as pixel-wise accurate image recovery [51, 48], either to a third-party (including other participating clients) or the central server. In terms of robustness, the decentralization design of FL systems opens up the training process to be manipulated by malicious clients, aiming to either prevent the convergence of the global model (a.k.a. Byzantine attacks) [15, 3, 45], or implant a backdoor trigger into the global model to cause targeted misclassification (a.k.a. backdoor attacks) [2, 44].

To mitigate the privacy leakage in FL, Differential Privacy (DP) [12, 13] has been adopted as a rigorous privacy notion. Existing frameworks [30, 18, 26] applied DP in FL to provide client-level privacy under the assumption of a trusted server: whether a client has participated in the training process cannot be inferred by a third party from the released global model. Other works in FL [49, 26, 46, 41] focused on record-level privacy: whether a data record at a client has participated during training cannot be inferred by the server or other adversaries that have access to the model updates or the global model. Record-level privacy is more relevant in cross-silo (as versus cross-device) FL scenarios, such as multiple hospitals collaboratively learn a prediction model for COVID-19, in which case what needs to be protected is the privacy of each patient (corresponding to each record in a hospital’s dataset). In this paper, we focus on cross-silo FL with record-level DP, where each client possesses a set of raw records, and each record corresponds to an individual’s private data.

To defend against Byzantine attacks, robust FL protocols are proposed to ensure that the training procedure is robust to a fraction of potentially malicious clients. This problem has received significant attention from the community. Most existing approaches replace the averaging step at the server with a robust aggregation rule, such as the median [8, 47, 5, 32]. However, recent state-of-the-art attacks [3, 45, 38] have demonstrated the failure of the above robust aggregators. Furthermore, a recent work [22] shows that there exist realistic scenarios where these robust aggregators fail to converge, even if there are no Byzantine attackers and the data distribution is identical (i.i.d.) across the clients, and proposed a new solution called Learning From History (LFH) to address this issue. LFH achieves robustness via client momentum with the motivation of averaging the updates of each client over time, thus reducing the variance of the honest clients and exposing the small malicious perturbations of Byzantine clients that are undetectable in a single round but accumulate over time.

In this paper, we focus on achieving record-level DP and Byzantine robustness simultaneously in cross-silo FL. Existing FL protocols with DP-SGD [1] do not achieve the robustness property intrinsically. Directly implementing an existing robust aggregator over the privatized client gradients will lead to poor utility because these aggregators (such as median [47, 5, 32]) usually have large sensitivity and require large DP noise, leading to poor utility. It is desirable to achieve DP guarantees based on average-based aggregators. Although LFH [22] is an average-based Byzantine-robust FL protocol, it aggregates client momentum instead of gradient, thus it is non-trivial to achieve DP on top of LFH. We show that a direct combination of LFH with DP-SGD momentum has several limitations, leading to both poor utility and robustness. Therefore, we aim to address these limitations in our solution.

To achieve an enhanced privacy-utility tradeoff, we start our problem from an assumption that the server is trusted and developed a Differentially-Private and Byzantine-Robust fEderated learning algorithm with client Momentum (DP-BREM), which essentially is a DP version of the Byzantine-robust method LFH [22]. Instead of adding DP noise to the gradient and then aggregating momentum as post-processing, we add DP noise to the aggregated momentum with carefully computed sensitivity to account for the privacy cost. Since the noise is added to the final aggregate (instead of intermediate local gradient), our basic solution DP-BREM maintains the non-private LFH’s robustness as much as possible, which we show both theoretically (via convergence analysis) and empirically (via experimental results). Then, we relax our trust assumption to a malicious server (for privacy only) and develop our final solution DP-BREM⁺. It utilizes secure multiparty computation (MPC) techniques, including secure aggregation and secure noise generation, to achieve the same DP and robustness guarantees as in DP-BREM. In Table 1, we compare DP-BREM and DP-BREM⁺ with existing approaches (or the variants) that achieve both DP and Byzantine robustness (DDP-RP [43] and DP-RSA [50] are described in Sec. 7). These approaches will be evaluated and compared in experiments. Our main contributions are:

Table 1: Comparison of FL approaches with DP and Byzantine-robustness

Approaches	Differential Privacy (DP) ^{$\mathsection$}				Byzantine Robustness
Approaches	Trust Assumption of Server	Noise Generator	Perturbation Mechanism	Standard Deviation of Noise in Aggregate	Mechanism
DP-FedSGD [30] with both record and client norm clip**s	trusted	server	$\sum_{i}g_{i}+\mathcal{N}(0,\sigma^{2})$	$\sigma$	client norm clip**
CM [47] with DP noise	trusted	server	$\text{median}(\{g_{i}\}_{i=1}^{n})+\mathcal{N}(0,\sigma^{2})$	$\sigma$	coordinate-wise median (CM)
DDP-RP [43] ^$\diamond$	honest-but-curious	clients (distributively)	$\sum_{i}[g_{i}+\mathcal{N}(0,\frac{\sigma^{2}}{\tau})]$	$\sqrt{\frac{n}{\tau}}\cdot\sigma$	element-wise range proof
DP-RSA [50]	untrusted	client	$\sum_{i}[\text{sign}(g_{i})+\mathcal{N}(0,\sigma^{2})]$	$\sqrt{n}\cdot\sigma$	aggregation of sign-SGD
DP-LFH (baseline in Sec. 3.2)	untrusted	client	$\sum_{i}[m_{i}+\mathcal{N}(0,\sigma^{2})]$	$\sqrt{n}\cdot\sigma$	LFH [22]: client momentum and centered clip**
DP-BREM (our initial solution)	trusted	server	$\sum_{i}m_{i}+\mathcal{N}(0,\sigma^{2})$	$\sigma$
DP-BREM⁺ (our final solution) ^$\dagger$	untrusted	clients (jointly)	$\sum_{i}m_{i}+\mathcal{N}(0,\sigma^{2})$	$\sigma$

$\mathsection$

We show the privacy-utility tradeoff by fixing the same privacy cost (in terms of DP) and then comparing the standard deviation of the noise on the aggregation, where a smaller standard deviation means the DP noise has less negative impact on the utility. Note that different approaches have different aggregation strategies, where $g_{i}$ denotes local gradient, and $m_{i}$ denotes local momentum.
$\diamond$

DDP-RP assumes an honest-but-curious server, i.e., following protocol instructions honestly, but may try to learn additional information. It guarantees distributed DP (DDP) with secure aggregation techniques, where clients add partial noise with a smaller standard deviation, depending on the number of honest clients or its lower bound, denoted by $\tau$ . Thus, it provides better privacy-utility tradeoff than local DP (LDP).
$\dagger$

DP-BREM⁺ achieves the same DP and robustness guarantees as DP-BREM, but has a different trust assumption on the server. It achieves the same noise as central DP, i.e., the DP noise is added to the aggregation, but does not require a trusted server because the noise is securely generated by clients (via the proposed noise generation protocol) and securely added to the aggregation (via a secure aggregation protocol).

1) We propose a novel differentially private and Byzantine-robust FL protocol called DP-BREM, which adds DP noise to the aggregated client momentum with carefully computed sensitivity. Our privacy analysis (shown in Theorem 1) accounts for the privacy cost from momentum, which is different from the conventional DP-SGD that accounts for the privacy cost from the gradient. We also provide the convergence analysis of DP-BREM (shown in Theorem 3), which indicates that there is only a small sacrifice in the convergence rate to satisfy DP (compared to the large sacrifice in convergence of the baseline solution shown in Section 3.2).

2) Considering that DP-BREM is developed under the assumption of a trusted server, we propose the final solution called DP-BREM⁺ (in Section 5), which achieves the same privacy and robustness properties as DP-BREM, even under a malicious server (for privacy only), using secure multiparty computation techniques. DP-BREM⁺ is built based on the framework of secure aggregation with verifiable inputs (SAVI) [35], but extends it to guarantee the integrity of DP noise via a novel secure distributed noise generation protocol. Our extended SAVI protocol is general enough to be applied to other DP and robust FL protocols that are average-based.

3) We conduct extensive experiments using MNIST and CIFAR-10 datasets (in Section 6) to demonstrate the effectiveness of our protocols. The results show that it can achieve better utility under the same record-level DP guarantees, as well as strong robustness against Byzantine clients under state-of-the-art attacks, compared to the baseline methods.

2 Preliminaries

2.1 Differential Privacy (DP)

Differential Privacy (DP) is a rigorous mathematical framework for the release of information derived from private data. Applied to machine learning, a differentially private training mechanism allows the public release of model parameters with a strong privacy guarantee: adversaries are limited in what they can learn about the original training data based on analyzing the parameters, even when they have access to arbitrary side information. The formal definition is as follows:

Definition 1 ( $(\epsilon,\delta)$ -DP [13, 12]).

For $\epsilon\in[0,\infty)$ and $\delta\in[0,1)$ , a randomized mechanism $\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R}$ with a domain $\mathcal{D}$ (e.g., possible training datasets) and range $\mathcal{R}$ (e.g., all possible trained models) satisfies $(\epsilon,\delta)$ -Differential Privacy (DP) if for any two neighboring datasets $D,D^{\prime}\in\mathcal{D}$ that differ in only one record and for any subset of outputs $S\subseteq\mathcal{R}$ , it holds that

\displaystyle\mathbb{P}[\mathcal{M}(D)\in S]\leqslant e^{\epsilon}\cdot\mathbb% {P}[\mathcal{M}(D^{\prime})\in S]+\delta

where $\epsilon$ and $\delta$ are privacy parameters (or privacy budget), and a smaller $\epsilon$ and $\delta$ indicate a more private mechanism.

Gaussian Mechanism. A common paradigm for approximating a deterministic real-valued function $f:\mathcal{D}\rightarrow\mathbb{R}$ with a differentially-private mechanism is via additive noise calibrated to $f$ ’s sensitivity $s_{f}$ , which is defined as the maximum of the absolute distance $|f(D)-f(D^{\prime})|$ . The Gaussian Mechanism is defined by $\mathcal{M}(D)=f(D)+\mathcal{N}(0,s_{f}^{2}\cdot\sigma^{2})$ , where $\mathcal{N}(0,s_{f}^{2}\cdot\sigma^{2})$ is noise drawn from a Gaussian distribution. It was shown that $\mathcal{M}$ satisfies $(\epsilon,\delta)$ -DP if $\delta\geqslant\frac{4}{5}e^{-(\sigma\epsilon)^{2}/2}$ and $\epsilon<1$ [13]. Note that we use an advanced privacy analysis tool proposed in [11], which works for all $\epsilon>0$ .

DP-SGD Algorithm. The most well-known differentially-private algorithm in machine learning is DP-SGD [1], which introduces two modifications to the vanilla stochastic gradient descent (SGD). First, a clip** step is applied to the gradient so that the gradient is in effect bounded for a finite sensitivity. The second modification is Gaussian noise augmentation on the summation of clipped gradients, which is equivalent to applying the Gaussian mechanism to the updated iterates. The privacy accountant of DP-SGD is shown in Appendix F.

2.2 Federated Learning (FL) with DP

Federated Learning (FL) [21, 29] is a collaborative learning setting to train machine learning models. We consider the horizontal cross-silo FL setting, which involves multiple clients, each holding their own private dataset of the same set of features, and a central server that implements the aggregation. Unlike the traditional centralized approach, data is not stored at a central server; instead, clients train models locally and exchange updated parameters with the server, which aggregates the received local model parameters and sends them to the clients. Based on the participating clients and scale, federated learning can be classified into two types: cross-device FL where clients are typically mobile devices and the client number can reach up to a scale of millions; cross-silo FL (our focus) where clients are organizations or companies and the client number is usually small (e.g., within a hundred).

FL with DP. In FL, the neighboring datasets $D$ and $D^{\prime}$ in Definition 1 can be defined at two distinct levels: record-level and client-level. In cross-device FL, each device usually stores one individual’s data, then the whole devices’ data should be protected. It corresponds to client-level DP, where $D^{\prime}$ is obtained by adding or removing one client/device’s whole training dataset from $D$ . In cross-silo FL, each record corresponds to one individual’s data, then record-level DP should be provided, where $D^{\prime}$ is obtained by adding or removing a single training record/example from $D$ . Since we consider cross-silo FL, achieving record-level DP is our privacy goal.

2.3 Byzantine Attacks and Defenses

In a Byzantine attack, the adversary aims to destroy the convergence of the model. Due to the decentralization design, FL systems are vulnerable to Byzantine clients, who may not follow the protocol and can send arbitrary updates to the server. Also, they may have complete knowledge of the system and can collude with each other. Most state-of-the-art defense mechanisms [8, 47, 5, 32] play with median statistics of client gradients. However, recent attacks [3, 45] have empirically demonstrated the failure of the above robust aggregations.

LFH: Non-private Byzantine-Robust Defense. Recently, Karimireddy et al. [22] showed that most state-of-the-art robust aggregators require strong assumptions and may not converge even in the complete absence of Byzantine attackers. Then, they proposed a new Byzantine-robust scheme called "learning from history" (LFH) that essentially utilizes two simple strategies: client momentum (during local update) and centered clip** (during server aggregation). In each iteration $t$ , client $\mathsf{C}_{i}$ receives the global model parameter $\bm{\theta}_{t-1}$ from the server, and computes the local gradient of the random dataset batch $\mathcal{D}_{i,t}\subseteq\mathcal{D}_{i}$ by

\displaystyle\bm{g}_{t,i}=\frac{1}{|\mathcal{D}_{i,t}|}\sum\nolimits_{\bm{x}% \in\mathcal{D}_{i,t}}\nabla_{\bm{\theta}}\ell(\bm{x},\bm{\theta}_{t-1})

(1)

where $\nabla_{\bm{\theta}}\ell(\bm{x},\bm{\theta}_{t-1})$ is the per-record gradient w.r.t. the loss function $\ell(\cdot)$ . The client momentum can be computed via

\displaystyle\bm{m}_{t,i}=(1-\beta)\bm{g}_{t,i}+\beta\bm{m}_{t-1,i}

(2)

where $\beta\in[0,1)$ . After receiving $\bm{m}_{t,i}$ from all clients, the server implements aggregation with centered clip** via

\displaystyle\bm{m}_{t}=\bm{m}_{t-1}+\frac{1}{n}\sum\nolimits_{i=1}^{n}\mathsf% {Clip}_{C}(\bm{m}_{t,i}-\bm{m}_{t-1})

(3)

where $\mathsf{Clip}_{C}(\cdot)$ with scalar $C>0$ is the clip** function:

\displaystyle\mathsf{Clip}_{C}(\bm{x})\coloneqq\bm{x}\cdot\min\{1,~{}C/\|\bm{x% }\|\}

(4)

and $\|\bm{x}\|$ is the L2-norm of any vector $\bm{x}$ . The clip** operation $\mathsf{Clip}_{C}(\bm{m}_{t,i}-\bm{m}_{t-1})$ essentially bounds the distance between client’s local momentum $\bm{m}_{t,i}$ and the previous aggregated momentum $\bm{m}_{t-1}$ , thus restricts the impact from Byzantine clients. Then, the global model $\bm{\theta}_{t}$ can be updated by $\bm{\theta}_{t}=\bm{\theta}_{t-1}-\eta_{t}\bm{m}_{t}$ with learning rate $\eta_{t}$ . The convergence rate under Byzantine attacks is shown by the following lemma.

Lemma 1 (Convergence Rate of LFH [22]).

With some parameter tuning, the convergence rate of the Byzantine-robust algorithm LFH is asymptotically (ignoring constants and higher order terms) of the order

\displaystyle\frac{1}{T}\sum\nolimits_{t=1}^{T}\mathbb{E}\|\nabla\ell(\bm{% \theta}_{t-1})\|^{2}\lesssim\sqrt{\frac{\rho^{2}}{T}\frac{1+|\mathcal{B}|}{n}}

(5)

where $\ell(\cdot)$ is the loss function, $T$ is the total number of training iterations, $|\mathcal{B}|$ is the number of Byzantine clients, $n$ is the number of all clients, and $\rho$ is a parameter that quantifies the variance of honest clients’ stochastic gradients:

\displaystyle\mathbb{E}\|\bm{g}_{t,i}-\mathbb{E}[\bm{g}_{t,i}]\|^{2}\leqslant% \rho^{2}

(6)

Interpretation of Lemma 1. When there are no Byzantine clients, LFH recovers the optimal rate of $\frac{\rho}{\sqrt{nT}}$ . In the presence of a $|\mathcal{B}|/n$ fraction of Byzantine clients, the rate has an additional term $\rho\sqrt{\frac{|\mathcal{B}|/n}{T}}$ , which depends on the fraction $|\mathcal{B}|/n$ but does not improve with increasing clients.

3 Problem Statement and Motivation

3.1 Problem Statement

System Model. Our system model follows the general setting of Fed-SGD [29]. There are multiple parties in the FL system: one aggregation server and $n$ participating clients $\{\mathsf{C}_{1},\cdots,\mathsf{C}_{n}\}$ . The server holds a global model $\bm{\theta}_{t}\in\mathbb{R}^{d}$ and each client $\mathsf{C}_{i}$ , $i\in\{1,\cdots,n\}$ possesses a private training dataset $\mathcal{D}_{i}$ . The server communicates with each client through a secure (private and authenticated) channel. During the iterative training process, the server broadcasts the global model in the current iteration to all clients and aggregates the received gradient/momentum from all clients (or a subset of clients) to update the global model until convergence. The final global model is returned after the training process as the output.

Threat Model. The considered adversary aims to perform a 1) privacy attack and/or 2) Byzantine attack with the following threat model, respectively.

1) Privacy Attack. Following the conventional FL setting, we assume the server has no access to the client’s local training data, but may have an incentive to infer clients’ private information. In our initial solution called DP-BREM, we assume a trusted server that can obtain clients’ local models/updates. The adversary is a third party or the participating clients (can be any set of clients) who have access to the intermediate and final global models and may use them to infer the private data of an honest client $\mathsf{C}_{i}$ . Hence, the privacy goal is to ensure the global model (and its update) satisfies DP. In our final solution DP-BREM⁺, in addition to third parties and clients, the adversary also includes the server that tries to infer additional information from the local updates (and may deviate from the protocol for privacy inference). Such a model is also adopted in previous work [35]. Following [35], we assume a minority of malicious clients who can deviate from the protocol arbitrarily.

2) Byzantine Attack. Recall that the goal of Byzantine attacks is to destroy the convergence of the global model (discussed in Section 2.3). We only consider malicious clients as the adversaries for Byzantine attacks because the server’s primary goal is to train a robust model, thus no incentive to implement Byzantine attacks. These malicious clients (assumed to be a minority of all participating clients) can deviate from the protocol arbitrarily and have full control of both their local training data and their submission to the servers, but do not influence other honest clients.

Objectives. The goal of this paper is to achieve both record-level DP and Byzantine robustness at the same time. We aim to provide high utility (i.e., high accuracy of the global model) with the required DP guarantee under the existence of Byzantine attacks from malicious clients. Our ultimate privacy goal is to provide DP guarantees against an untrusted server and other clients, but we start by assuming a trusted server first in our initial solution.

3.2 Challenges and Baseline

Challenges: replacing average-based aggregator leads to large sensitivity of DP. Though there are many works on achieving either DP or Byzantine robustness, it’s nontrivial to achieve both with high utility. The main reason is that most Byzantine-robust methods replace the averaging aggregator with median-based strategies or some complex robust aggregators, which leads to a large sensitivity of DP compared to the averaging operation, as illustrated in Example 1.

Example 1 (Sensitivity Computation: Average vs. Median).

Consider a dataset with 5 samples: $\mathcal{D}=\{1,3,5,7,9\}$ , and its neighboring dataset $\mathcal{D}^{\prime}$ is obtained by changing one value in $\mathcal{D}$ with at most 1, such as $\mathcal{D}^{\prime}=\{1,3,\bm{4},7,9\}$ . Then, the sensitivity of average-query is $\max\limits_{\mathcal{D},\mathcal{D}^{\prime}}|\mathsf{avg}(\mathcal{D})-% \mathsf{avg}(\mathcal{D}^{\prime})|=1/5=0.2$ . However, the sensitivity of median-query is $\max\limits_{\mathcal{D},\mathcal{D}^{\prime}}|\mathsf{median}(\mathcal{D})-% \mathsf{median}(\mathcal{D}^{\prime})|=1$ . Moreover, when increasing the size of the dataset, the sensitivity of the average query will be reduced (and then less noise to be added), while the sensitivity of the median query is the same.

DP-LFH: baseline via direct combination of LFH and DP-SGD. As shown in Section 2.3, the Byzantine-robust scheme LFH [22] utilizes an average-based aggregator, which can be regarded as a non-private robust solution to address the disadvantage of median-based aggregator. A straightforward method to add DP protection on top of LFH is to combine it with the DP-SGD algorithm. However, LFH requires each client to compute the local momentum $\bm{m}_{t,i}$ for server aggregation, while DP-SGD aggregates gradients and accounts for the privacy cost via the composition of iterative gradient update. Without a trusted server, a straightforward solution to combine DP with LFH is to use DP-SGD at each client to privatize the local gradient, and then compute the momentum from the privatized gradient (thus there is no additional privacy cost due to post-processing). Formally, client $\mathsf{C}_{i}$ computes

\displaystyle\bm{g}_{t,i}=\frac{1}{|\mathcal{D}_{i,t}|}\left[\sum\nolimits_{% \bm{x}\in\mathcal{D}_{i,t}}\mathsf{Clip}_{R}(\nabla_{\bm{\theta}}\ell(\bm{x},% \bm{\theta}_{t-1}))+\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})\right],

(7)

where $\mathbf{I}_{d}$ is an identity matrix with size $d\times d$ ( $d$ is the model size, i.e., $\bm{\theta}_{t}\in\mathbb{R}^{d}$ ), the record-level clip** parameter $\mathsf{Clip}_{R}(\cdot)$ restricts the sensitivity when adding/removing one record from the local dataset, and Gaussian noise $\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ introduces DP property on $\bm{g}_{t,i}$ . Since DP is immune to post-processing, the remaining steps can be implemented in the same way as the original LFH, without incurring additional privacy costs. This baseline solution DP-LFH achieves record-level DP against an untrusted server. However, it has several limitations, which lead to both poor privacy-utility tradeoff and robustness.

Limitation 1: large aggregated noise. Since each client locally adds DP noise, the overall noise after aggregation will be larger than the case of the central setting under the same privacy budget (i.e., the value of $\epsilon$ ), because only the server adds noise in the central setting. Therefore, DP-LFH has a poor privacy-utility tradeoff.

Limitation 2: large impact on Byzantine robustness. Since the DP noise is added locally to each client’s gradient before momentum-based clip**, it leads to a negative impact on Byzantine robustness: the noisy client momentum $\bm{m}_{t,i}$ has larger variance than the noise-free one, which leads to larger bias and variance on the clip** step $\mathsf{Clip}_{C}(\bm{m}_{t,i}-\bm{m}_{t-1})$ . Furthermore, this impact will be enlarged when there are more Byzantine clients, which is explained as follows. Since the parameter $\rho^{2}$ defined in (6) quantifies the variance of client’s gradient, and the DP noise is added to the local gradient in (7), the parameter $\rho$ of the convergence rate shown in (5) is replaced by $\rho+\sqrt{d}\sigma$ (ignoring constants) for DP-LFH, i.e., the convergence rate of DP-LFH is asymptotic of the order

\displaystyle\frac{1}{T}\sum\nolimits_{t=1}^{T}\mathbb{E}\|\nabla\ell(\bm{% \theta}_{t-1})\|^{2}\lesssim\sqrt{\frac{(\rho+\sqrt{d}\sigma)^{2}}{T}\frac{1+|% \mathcal{B}|}{n}}

(8)

Therefore, either a large $d$ (i.e., large model) or a large $\sigma$ (i.e., small privacy budget $\epsilon$ ) will enlarge the impact from Byzantine clients due to the order $O(\sqrt{d\sigma^{2}|\mathcal{B}|})$ of convergence rate. We note that Guerraoui et al.’s work [19] also shares a similar insight: they show that DP with local noise and Byzantine robustness are incompatible, especially when the dimension of model parameters $d$ is large.

Limitation 3: no privacy amplification from client-level sampling due to momentum. According to the recursive representation $\bm{m}_{t,i}=(1-\beta)\bm{g}_{t,i}+\beta\bm{m}_{t-1,i}$ , client $\mathsf{C}_{i}$ ’s momentum in $t$ -th iteration $\bm{m}_{t,i}$ is essentially a weighted summation of all previous privatized client gradients:

\displaystyle\bm{m}_{t,i}=(1-\beta)(\bm{g}_{t,i}+\beta\bm{g}_{t-1,i}+\cdots+% \beta^{t-2}\bm{g}_{2,i})+\beta^{t-1}\bm{g}_{1,i}

(9)

where $\bm{g}_{1,i},\bm{g}_{2,i},\cdots,\bm{g}_{t,i}$ are already privatized via local noise. Assume the server samples a subset of clients for aggregation in each iteration. Assume that client $\mathsf{C}_{i}$ ’s momentum $\bm{m}_{t,i}$ is not selected in the $t$ -th iteration, thus the aggregate is independent of $\bm{g}_{t,i}$ . However, in a later iteration (i.e., $\tau>t$ ), if client $\mathsf{C}_{i}$ ’s momentum $\bm{m}_{\tau,i}$ is involved in the aggregation, it will depend on $\bm{g}_{t,i}$ according to (9). Therefore, we need to account for the privacy cost of $\bm{g}_{t,i}$ in all iterations. There is no privacy amplification benefit from sampling clients, leading to high privacy costs or low utility.

4 DP-BREM

To address the limitations of DP-LFH, we start from the assumption of a trusted server that can obtain clients’ local models/updates and generate DP noise, and propose an initial solution called DP-BREM (in Section 4.1). It is a differentially-private version of LFH with carefully designed enhancements, achieving a similar level of robustness as the non-private LFH. Since DP-BREM adds DP noise to the momentum (as versus adding noise to the gradient in DP-SGD), our privacy accountant shown in Section 4.2 is different from the traditional privacy accountant of DP-SGD. We also provide the convergence analysis in Section 4.3, where the provable convergence of LFH is maintained with only a small difference. Based on DP-BREM, we then relax the server’s trust assumption in our final solution DP-BREM⁺ (in Section 5), by adopting secure multiparty computation techniques including secure aggregation with input validation and joint noise generation, which achieves the same DP guarantee with the same amount of noise as in DP-BREM, without trusting the server.

4.1 Algorithm Design

The mathematical notations involved in our algorithm design and theoretical analysis are summarized in Table 5 (see Appendix A). The illustration of our design is shown in Figure 1, and the algorithm is shown in Algorithm 1, where all clients need to implement local updates (in Line-3), but only a subset of their momentum vectors are aggregated by the server (in Line-4). The details of client updates and server aggregation are described below.

Client Updates. The client $\mathsf{C}_{i}$ first samples a random batch $\mathcal{D}_{i,t}$ from the local dataset $\mathcal{D}_{i}$ with sampling rate $p_{i}$ , clips the per-record gradient $\nabla_{\bm{\theta}}\ell(\bm{x},\bm{\theta}_{t-1})$ by $R$ and multiplies the sum by a constant factor $\frac{1}{p_{i}|\mathcal{D}_{i}|}$ to get the averaged gradient

\bar{\bm{g}}_{t,i}=\frac{1}{p_{i}|\mathcal{D}_{i}|}\sum\nolimits_{\bm{x}\in% \mathcal{D}_{i,t}}\mathsf{Clip}_{R}(\nabla_{\bm{\theta}}\ell(\bm{x},\bm{\theta% }_{t-1}))

(10)

where $\mathsf{Clip}_{R}(\cdot)$ is the clip** function defined in (4), but is used here to bound the sensitivity for DP (refer to DP-SGD discussed in Section 2.1). Note that the batch size $|\mathcal{D}_{i,t}|$ is random and $\mathbb{E}[|\mathcal{D}_{i,t}|]=p_{i}|\mathcal{D}_{i}|$ . Then, the local momentum can be computed by

\displaystyle\bar{\bm{m}}_{t,i}=\begin{cases}\bar{\bm{g}}_{t,i},&\text{if }t=1% \\ (1-\beta)\bar{\bm{g}}_{t,i}+\beta\bar{\bm{m}}_{t-1,i},&\text{if }t>1\\ \end{cases}

(11)

where $\beta\in[0,1)$ is the momentum parameter.

Refer to caption — Figure 1: Illustration of our DP-BREM algorithm.

Algorithm 1 DP-BREM

0: Initialization

\bm{\theta}_{0}\in\mathbb{R}^{d}

, clip** bounds

R

and

C

, learning rate

\eta_{t}

of the global model.

1: for

t=1,\cdots,T

2: The server broadcasts the previous model

\bm{\theta}_{t-1}

to all clients

\{\mathsf{C}_{i}\}_{i=1}^{n}

and selects a subset of client index

\mathcal{I}_{t}\subseteq\{1,\cdots,n\}

, where each client is selected with probability

q

3: Each client

\mathsf{C}_{i}

for

i\in\{1,\cdots,n\}

implements the local updates via (10) and (11), while only selected clients need to send the local momentum

\bm{m}_{t,i}

(for

i\in\mathcal{I}_{t}

) to the server.

4: The server aggregates received clients’ momentum (only for

i\in\mathcal{I}_{t}

) with centered clip** and DP noise via (12), then updates the global model

\bm{\theta}_{t}

via (13).

5: end for

5: The final model parameter

\bm{\theta}_{T}

Server Aggregation. The server implements centered clip** with clip** parameter $C>0$ to bound the difference between client momentum $\bar{\bm{m}}_{t,i}$ and the previous noisy global momentum $\tilde{\bm{m}}_{t-1}$ for robustness. Then, it adds Gaussian noise with standard deviation $R\sigma$ (thus the variance is $R^{2}\sigma^{2}$ ) to the sum of clipped terms to get the noisy global momentum $\tilde{\bm{m}}_{t}$

\displaystyle\tilde{\bm{m}}_{t}=\tilde{\bm{m}}_{t-1}+\frac{1}{|\mathcal{I}_{t}% |}\left[\sum\nolimits_{i\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i% }-\tilde{\bm{m}}_{t-1})+\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})\right]

(12)

where $\mathbf{I}_{d}$ is an identity matrix with size $d\times d$ , and only the sampled clients in $\mathcal{I}_{t}$ (which is obtained in Line-2 of Algorithm 1 with sampling rate $q$ ) are aggregated in $t$ -th iteration. Note that adding noise $\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ to the summation of clipped client momentum $\sum\nolimits_{i\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde% {\bm{m}}_{t-1})$ is equivalent to adding noise $\frac{1}{|\mathcal{I}_{t}|}\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ to the average result $\frac{1}{|\mathcal{I}_{t}|}\sum\nolimits_{i\in\mathcal{I}_{t}}\mathsf{Clip}_{C% }(\bar{\bm{m}}_{t,i}-\tilde{\bm{m}}_{t-1})$ . Then, the server updates the global model $\bm{\theta}_{t}$ with learning rate $\eta_{t}$

\displaystyle\bm{\theta}_{t}=\bm{\theta}_{t-1}-\eta_{t}\tilde{\bm{m}}_{t}

(13)

Remark: clip** bounds and sampling rates. In our algorithm, we use two clip** bounds and two sampling rates. For clip** bounds, each client uses record-level bound $R$ to bound the per-record gradient in (10) for a finite sensitivity in record-level DP; while the server uses client-level bound $C$ to bound the difference between client momentum $\bar{\bm{m}}_{t,i}$ and the previous noisy global momentum $\tilde{\bm{m}}_{t-1}$ in (12), which achieves Byzantine robustness as in LFH. For sampling rates, the client $\mathsf{C}_{i}$ samples a batch of records $\mathcal{D}_{i,t}$ from the local dataset $\mathcal{D}_{i}$ with sampling rate $p_{i}$ , which provides privacy amplification for DP from record-level sampling; while the server samples a subset of clients with sampling rate $q$ (where the sampled clients set is denoted by $\mathcal{I}_{t}$ ), which provides privacy amplification for DP from client-level sampling.

Remark: comparison with non-private LFH. Comparing with the original non-private Byzantine-robust method LFH [22] (see Section 2.3), our differentially-private version has three differences. First, comparing with (1), the client gradient in (10) is computed by averaging the clipped per-record gradient (with clip** bound $R$ ), which bounds the sensitivity of final aggregation when adding/removing one record from the local dataset. Second, comparing with (3), the server adds Gaussian noise when computing the aggregated global momentum $\tilde{\bm{m}}_{t}$ in (12) to guarantee DP. Third, instead of aggregating all clients’ momentum, our method also considers aggregating a subset of them, reflected by the index set $\mathcal{I}_{t}$ in (12). It provides additional privacy amplification from client-level sampling with sampling rate $q$ . Note that the original privacy amplification is provided by record-level sampling.

4.2 Privacy Analysis

Before presenting the final privacy analysis of DP-BREM, we first show how we compute the sensitivity for the summation of clipped client momentum in (12) for privacy analysis of one iteration, shown in Lemma 2. We note that clients may have different sizes of local datasets $\mathcal{D}_{i}$ and can use different record-level sampling rates $p_{i}$ , thus the record-level sensitivity (denoted by $S_{i}$ ) for different clients can be different.

Lemma 2 (DP Sensitivity).

We use $\|\cdot\|$ to denote L2-norm $\|\cdot\|_{2}$ . In the $t$ -th round, denote the query function $Q_{t}(\mathcal{D})\coloneqq\sum\nolimits_{j\in\mathcal{I}_{t}}\mathsf{Clip}_{C% }(\bm{m}_{t,j}-\tilde{\bm{m}}_{t-1})$ , where $\tilde{\bm{m}}_{t-1}$ is public and $\mathcal{D}=\{\mathcal{D}_{j}\}_{j\in\mathcal{I}_{t}}$ . Consider the neighboring dataset $\mathcal{D}^{\prime}=\{\mathcal{D}_{j}\}_{j\neq i,j\in\mathcal{I}_{t}}\cup% \mathcal{D}_{i}^{\prime}$ that differs in one record from client $\mathsf{C}_{i}$ ’s local data ( $i\in\mathcal{I}_{t}$ ), i.e., $|\mathcal{D}_{i}-\mathcal{D}_{i}^{\prime}|=1$ , then the sensitivity with respect to client $\mathsf{C}_{i}$ is

\displaystyle S_{i}\coloneqq\max_{\mathcal{D},D^{\prime}}\|Q_{t}(\mathcal{D})-% Q_{t}(\mathcal{D^{\prime}})\|=\min\left\{2C,\frac{R}{p_{i}|\mathcal{D}_{i}|}\right\}

(14)

Proof.

(Sketch) According to (10), the sensitivity of $\bar{\bm{g}}_{t,i}$ is $\frac{R}{p_{i}|\mathcal{D}_{i}|}$ because each clipped term $\mathsf{Clip}_{R}(\cdot)$ has bounded L2-norm, i.e., $\|\mathsf{Clip}_{R}(\cdot)\|\leqslant R$ . Then, due to the recursive representation of local momentum in (11), the sensitivity of $\bm{m}_{t,i}$ is $\frac{R}{p_{i}|\mathcal{D}_{i}|}$ . Finally, the client-level clip** $\mathsf{Clip}_{C}(\cdot)$ introduces another upper bound for the sensitivity. Refer to Appendix A for the full-version proof. ∎

Remark: comparison with the privacy accountant of DP-SGD momentum. As discussed in Section 3.2, the privacy accountant of DP-SGD with momentum (i.e., account for privacy cost of gradient, then do post-processing for momentum) requires clients to add noise in the local gradients, which leads to poor utility especially when Byzantine attacks exist. In Lemma 2, we account for the privacy cost of aggregated momentum, where the sensitivity is carefully computed from the bounded record-level gradient. Therefore, our scheme solves the three limitations shown in Section 3.2, which is explained as follows. First, only the server adds noise (which is the same as the central setting), thus the privacy-utility tradeoff is not impacted. Second, the noise is added after the centered clip** $\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde{\bm{m}}_{t-1})$ , thus it only introduces unbiased error. We also show that (in Section 4.3) the impact from the added noise is separate from the impact from Byzantine attacks, as versus the impact from the local noise is enlarged with Byzantine attacks in DP-LFH (see Section 3.2). Third, since privacy is accounted on momentum, and only the aggregated momentum leaks privacy, our solution enjoys privacy amplification from client-level sampling.

The final privacy analysis of DP-BREM is shown in Theorem 1. It presents how to compute the privacy budget $\epsilon$ and privacy parameter $\delta$ when the parameters (such as $T$ , $\sigma$ , $q$ , etc.) of the algorithm are given. We use an advanced privacy accountant tool called Gaussian DP (GDP) [11] (refer to Appendix F), then convert it to $(\epsilon,\delta)$ -DP. Note that in our privacy analysis, clients can use different record-level sampling rates $p_{i}$ , thus different sensitivity $S_{i}$ shown in (14). Therefore, the final privacy budget (denoted by $\epsilon_{i}$ ) of DP-BREM may be different for different clients, which provides personalized privacy if these parameters are different for each client.

Theorem 1 (Privacy Analysis).

DP-BREM (in Algorithm 1) satisfies record-level $(\epsilon_{i},\delta)$ -DP for an honest client $\mathsf{C}_{i}$ with $\epsilon_{i}$ and $\delta$ satisfying

\displaystyle\delta=\Phi\left(-\frac{\epsilon_{i}}{\mu_{i}}+\frac{\mu_{i}}{2}% \right)-e^{\epsilon_{i}}\cdot\Phi\left(-\frac{\epsilon_{i}}{\mu_{i}}-\frac{\mu% _{i}}{2}\right),

(15)

where $\Phi(\cdot)$ denotes the cumulative distribution function (CDF) of standard normal distribution, and $\mu_{i}$ is defined by

\displaystyle\mu_{i}=qp_{i}\sqrt{T(e^{1/(2\sigma_{i}^{2})}-1)},\quad\text{with% }\sigma_{i}=\sigma\cdot\max\left\{\frac{R}{2C},p_{i}|\mathcal{D}_{i}|\right\}

(16)

Proof.

This result is obtained by the composition of multiple iterations and the privacy amplification from sampling. See Appendix B for the detailed proof. ∎

4.3 Convergence Analysis

Before presenting the final convergence analysis of our solution, we first show the aggregation error for one iteration in Theorem 2.

Theorem 2 (Aggregation Error).

Denote $\bm{m}_{t}^{*}\coloneqq\frac{1}{|\mathcal{H}|}\sum\nolimits_{i\in\mathcal{H}}% \bm{m}_{t,i}$ as the ground truth aggregated raw momentum, where $\bm{m}_{t,i}$ is the client momentum computed from gradient without record-level clip**. Assume the local momentum of all honest clients $\{\bm{m}_{t,i}\}_{i\in\mathcal{H}}$ are i.i.d. with expectation $\bm{\mu}\coloneqq\mathbb{E}[\bm{m}_{t,i}]$ , and the variance is bounded (in terms of L2-norm)

\displaystyle\mathbb{E}\|\bm{m}_{t,i}-\bm{\mu}\|^{2}\leqslant\rho^{2}

(17)

After some parameter tuning (the detailed tuning is shown under (C) in Appendix C) of the clip** bounds:

\displaystyle R\propto O\left(\rho\sqrt{n/(|\mathcal{B}|+\sqrt{d}\sigma/q)}% \right),\quad C\propto O(R)

(18)

we have the following aggregation error due to clip**, DP noise, and Byzantine clients:

\displaystyle\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{m}_{t}^{*}\|^{2}\leqslant O% \left(\frac{\rho^{2}(|\mathcal{B}|+\sqrt{d}\sigma/q)}{n}\right)

(19)

where $|\mathcal{B}|$ is the number of Byzantine clients, $d$ is the dimension of model parameter $\bm{\theta}_{t}$ , $\sigma$ is the noise multiplier (for DP) shown in (12), $q$ is the client-level sampling rate shown in Line-2 of Algorithm 1, and $\rho$ is defined in (17). The formal version of (19) is shown in (C) of Appendix C.

Proof.

(Sketch) Directly bounding $\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{m}_{t}^{*}\|^{2}$ is not easy, thus we utilize the upper bounds of $\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{\mu}\|^{2}$ and $\mathbb{E}\|\bm{\mu}-\bm{m}_{t}^{*}\|^{2}$ to get the final result, where $\bm{\mu}\coloneqq\mathbb{E}[\bm{m}_{t,i}]$ is the expected local momentum (we assume clients’ local momentum are i.i.d.). When upper bounding $\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{\mu}\|^{2}$ , we can decompose it to three types of errors: error of honest clients (due to randomness and bias introduced by clip**), error of Byzantine clients (due to Byzantine perturbation), and error introduced by the added DP noise. Furthermore, we can have the optimized parameter tuning of $C$ and $R$ to minimize the summation of the above three types of errors. Refer to Appendix C for the full-version proof. ∎

Interpretation of Theorem 2. The value of $\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{m}_{t}^{*}\|^{2}$ quantifies the aggregation error, i.e., how the aggregated privatized momentum $\tilde{\bm{m}}_{t}$ (with clip**, DP noise, and Byzantine clients’ impact) differs from the "pure" momentum aggregation $\bm{m}_{t}^{*}$ , where only honest clients participate and without clip** and DP noise. According to (19), the aggregation error is proportional to $\rho^{2}$ and $\frac{|\mathcal{B}|}{n}+\frac{\sqrt{d}\sigma}{nq}$ , where $\rho^{2}$ quantifies the variance of honest clients’ local momentum, $\frac{|\mathcal{B}|}{n}$ is the fraction of Byzantine clients, and $\frac{\sigma}{nq}=O(1/\epsilon)$ for $\epsilon$ -DP. In other words, the aggregation error will be enlarged when: honest clients’ variance is large, or the Byzantine attacker corrupts more clients, or the training model is complex (i.e., the model dimension $d$ is large), or we need stronger privacy (i.e., a smaller $\epsilon$ ), or the number of clients n is small. Furthermore, due to the format of $\frac{|\mathcal{B}|}{n}+\frac{\sqrt{d}\sigma}{nq}$ , the impact from DP noise is independent of the increase of Byzantine clients $|\mathcal{B}|$ (versus Limitation 2 of DP-LFH in Section 3.2). On the other hand, according to the parameter tuning in (18), we could theoretically set a smaller record-level clip** bound $R$ when $\sigma$ , $d$ , and $|\mathcal{B}|$ are large, or $\rho$ and $n$ are small. The tuning of client-level clip** bound $C$ should be adjusted according to the value of $R$ . Recall that $R$ is for DP, while $C$ is for robustness.

Table 2: Comparison of Convergence Rate

	Where to add noise	Convergence Rate
LFH [22]	None	$O(\rho\sqrt{1+\|\mathcal{B}\|})$
DP-LFH	Clients’ gradients	$O\left((\rho+\sqrt{d}\sigma)\sqrt{1+\|\mathcal{B}\|}\right)$
DP-BREM	Aggregated momentum	$O\left(\rho\sqrt{1+\|\mathcal{B}\|+\sqrt{d}\sigma}\right)$

By following the convergence analysis in [22] and using the result in (19), we have the convergence rate shown below.

Theorem 3 (Convergence Rate of DP-BREM).

The convergence rate of DP-BREM in Algorithm 1 is asymptotically (ignoring constants and higher order terms) of the order

\displaystyle\frac{1}{T}\sum\nolimits_{t=1}^{T}\mathbb{E}\|\nabla\ell(\bm{% \theta}_{t-1})\|^{2}\lesssim\sqrt{\frac{\rho^{2}}{T}\frac{|\mathcal{B}|+(1+% \sqrt{d}\sigma)/q}{n}}

(20)

where $\ell(\cdot)$ is the loss function, $T$ is the total number of training iterations, and other parameters are the same as in (19).

Proof.

See Appendix D. ∎

Remark: comparison with LFH and DP-LFH. The convergence rate of the non-private LFH, DP-LFH, and the proposed solution DP-BREM, showing in (5), (8), and (20) respectively, are summarized in Table 2. Though both DP-LFH and DP-BREM pay an additional term of $\sqrt{d}\sigma/q$ to get the DP property, they have different impacts on the convergence. As discussed in Limitation 2 of Section 3.2, the additional term $\sqrt{d}\sigma/q$ of DP-LFH (due to DP noise added to clients’ gradient) is on the term $\rho$ , thus it will enlarge the impact of Byzantine clients (i.e., the term $|\mathcal{B}|$ ). However, the additional term $\sqrt{d}\sigma/q$ of our solution DP-BREM (due to DP noise added to the aggregated momentum) is on the term $1+|\mathcal{B}|$ , which has a squared-root order. Therefore, DP noise only has a limited impact on the convergence of DP-BREM when there are Byzantine clients. We will validate the above theoretical analysis via experimental results in Section 6.

5 DP-BREM⁺ with Secure Aggregation

The private and robust FL solution DP-BREM (in Section 4) assumes a trusted server which can access clients’ momentum. In this section, we propose DP-BREM⁺, which assumes a malicious server and utilizes secure aggregation techniques, achieving the same DP and robustness guarantees as DP-BREM. As discussed in Section 3.1, we consider the server as malicious only for data privacy, while clients are malicious for both data privacy and Byzantine attacks.

5.1 Challenges

Considering the server is malicious for data privacy, the noisy aggregate of momentum with centered clip** shown in (12) must be implemented securely with the goals of 1) privacy, i.e., each party, including clients and the server, learns nothing but the differentially-private output; and 2) integrity, i.e., the output is correctly computed. Since the noisy aggregated momentum of the previous iteration $\tilde{\bm{m}}_{t-1}$ already satisfies DP, we can regard it as public information and only need to focus on securely computing the term $\sum\nolimits_{i\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde% {\bm{m}}_{t-1})+\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ in (12).

Secure Aggregation with Verified Inputs (SAVI). The key crypto technique we leverage to achieve the above objectives is SAVI [35], which is a type of protocols that securely aggregate only well-formed inputs. The security goals include both privacy and integrity. Specifically, privacy means that no party should be able to learn anything about the raw input of an honest client, other than what can be learned from the final aggregation result. Integrity means that the output of the protocol returns the correct aggregate of well-formed input, where 1) an input $u$ passes the input integrity check with a public validation predicate $\mathsf{Valid}(\cdot)$ if and only if $\mathsf{Valid}(u)=1$ , and 2) the aggregation is correctly computed. An instantiation of the SAVI protocol is EIFFeL [35] (described in Appendix G).

Challenge: Secure Generation of Gaussian Noise. A SAVI protocol can potentially solve the problem of securely aggregating the clipped vectors (by enforcing a norm-bound on the client momentum difference). However, the Gaussian noise $\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ needs to be securely generated and aggregated as well. In DP-BREM with a trusted server, the Gaussian noise $\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ is generated by the server to guarantee DP. However, when the server is assumed as malicious, the added Gaussian noise for DP cannot be directly generated by the server.

A straightforward solution is to follow [36] that assumes the existence of another semi-honest server (but does not collude with the original server) that will generate DP noise and execute the privacy engine. However, the assumption of another non-colluding server may not be practical and we assume only a single server.

Another alternative solution is to leverage Distributed DP (DDP) [39], where Gaussian noise is generated by clients in a distributed way: each client generates a Gaussian noise locally, and the aggregation of Gaussian noise also follows a Gaussian distribution with an enlarged standard deviation. Since only the aggregated result is released (with the help of crypto techniques), each client can add a smaller noise with the guarantee that the aggregated noise satisfies the required DP. However, this solution has two limitations in our scenario. First, distributed noise generation needs to add more noise to achieve the same privacy compared with server-side noise generation due to the collusion of malicious clients. Second, malicious clients can generate arbitrary values as the local Gaussian noise, which has a large impact on the robustness.

A possible solution to address the first limitation is to jointly generate Gaussian noise as in [34], where no party learns or controls the true value of the noise (or a portion of the noise). However, the protocol in [34] is designed only for additive secret sharing schemes, which only works for honest-but-curious parties and does not tolerate malicious parties. Moreover, in [34], the Gaussian noise is jointly generated by honest-but-curious and non-colluding parties, which does not address the second limitation as the clients can be malicious in our threat model discussed in Section 3.1.

Overview of DP-BREM⁺. To achieve secure aggregation with verified inputs and secure Gaussian noise generation under the threat model of a malicious server and malicious minority of clients, our DP-BREM⁺ 1) leverages an existing SAVI protocol called EIFFeL [35] to achieve secure input validation; and 2) introduces a new protocol to achieve secure noise generation that is compatible with EIFFeL. The idea of jointly generating Gaussian noise in DP-BREM⁺ is inspired by [34], but our design is based on Shamir’s secret sharing [37] with robust reconstruction by following the design in EIFFeL, thus guarantees security under malicious minority. We present the preliminaries of Shamir’s secret sharing and EIFFeL protocol in Appendix G.

5.2 Design of DP-BREM⁺

As discussed in Section 5.1, the main task of DP-BREM⁺ is to securely compute the term $\sum\nolimits_{i\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde% {\bm{m}}_{t-1})+\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ shown in (12). After computing local momentum $\bar{\bm{m}}_{t,i}$ via (11), each client $\mathsf{C}_{i}$ first implements centered clip** to get $\bm{z}_{i}\coloneqq\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde{\bm{m}}_{t-1})$ , which is the private input for validation and aggregation.

Three-Phase Design. In DP-BREM⁺, clients and the server jointly implement three phases: 1) secure input validation to validate the client momentum is properly centered clipped by $C$ , 2) secure noise generation, where clients generate shares of Gaussian noise which can be aggregated in Phase 3 to ensure DP, and 3) aggregation of valid inputs and noise to obtain the noisy global model. We assume the arithmetic circuit is computed over a finite field $\mathbb{F}_{2^{K}}$ . The illustration of DP-BREM⁺ is shown in Figure 2. Due to limited space, we present the detailed steps \small1⃝-\small7⃝ in Appendix H.

Phase 1: Secure Input Validation. The validation function for an input $\bm{z}_{i}$ considered in DP-BREM⁺ is defined as $\mathsf{Valid}(\bm{z}_{i})\coloneqq\mathbbm{1}(\|\bm{z}_{i}\|\leqslant C)$ , where $\mathsf{Valid}(\bm{z}_{i})=1$ if and only if the condition $\|\bm{z}_{i}\|\leqslant C$ holds. Since honest clients compute $\bm{z}_{i}=\mathsf{Clip}_{C}(\bar{\bm{m}}_{t,i}-\tilde{\bm{m}}_{t-1})$ , verifying whether $\bm{z}_{i}$ is well-formed, with bounded L2-norm via $\mathsf{Valid}(\cdot)$ , for all clients ensures centered clip** of client momentum $\bar{\bm{m}}_{t,i}$ (to achieve robustness as DP-BREM). We follow the design in EIFFeL [35] for secure input validation, which returns the validation result $\mathsf{Valid}(\bm{z}_{i})$ (either 1 or 0) for client $\mathsf{C}_{i}$ ’s private input $\bm{z}_{i}$ , corresponding to steps \small1⃝, \small2⃝, and \small3⃝ shown in Figure 2. Then, clients and the server can jointly verify all inputs $\{\bm{z}_{i}\}_{i\in\mathcal{I}_{t}}$ , and obtain the set of valid inputs $\mathcal{I}_{\mathsf{Valid}}$ , where $\mathsf{Valid}(\bm{z}_{i})=1$ for all $i\in\mathcal{I}_{\mathsf{Valid}}$ . In the later step, only inputs in $\mathcal{I}_{\mathsf{Valid}}$ are aggregated.

Phase 2: Secure Noise Generation. We develop a new protocol for secure distributed Gaussian noise generation, which returns the shares (held by each client) of a random vector $\bm{\xi}$ of length $d$ from the Gaussian distribution $\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ , corresponding to steps \small4⃝ and \small5⃝ shown in Figure 2. The shares of noise can be reconstructed into a single Gaussian noise (for ensuring DP) with the guarantee that no parties know or control the generated noise, which protects the information of private inputs after the noisy aggregate is released.

Phase 3: Aggregation of Valid Inputs and Noise. Finally, the server and clients can aggregate the valid inputs (obtained in Phase 1) and the generated Gaussian noise (obtained in Phase 2) by implementing steps \small6⃝ and \small7⃝ shown in Figure 2, ensuring nothing except the noisy aggregate can be learned.

Remark on Efficiency. DP-BREM⁺’s usage of EIFFeL’s secure input validation is due to efficiency considerations. Instead of having clients perform clip** and using secure input validation, one alternative is to use standard secure multi-party computation (MPC) for the clip** and aggregation. However, doing this under MPC would result in a very large computation/communication overhead due to the multiplication, min-operation, division, and L2-norm computation in the clip** operation $\mathsf{Clip}_{C}(\cdot)$ defined in (4). In contrast, the secure input validation protocol only requires the verifiers to check all the multiplication gates very efficiently with just one identity test. The compatibility with secure input validation is one of the advantages of DP-BREM.

Complexity. According to EIFFeL [35], the computation/communication complexity of secure aggregation with input validation is $O(mnd)$ for clients and $O(n^{2}+md\min\{n,m^{2}\})$ for the server in terms of the number of clients $n$ , number of malicious clients $m$ , and data dimension $d$ . For the proposed secure noise generation (only clients are involved), the computation/communication complexity for total $n$ clients is $O(mnd)$ .

5.3 Security Analysis

In comparison, EIFFeL [35] is a secure aggregation protocol with verified inputs (without guaranteeing DP), while our solution DP-BREM⁺ is a secure noisy aggregation protocol with verified inputs and jointly generated Gaussian noise, which provides DP on the aggregated results. Therefore, the only difference is the Gaussian noise that will be aggregated to the final result. We show the formal security guarantee of DP-BREM⁺ in the following theorem.

Theorem 4 (Security Guarantees of DP-BREM⁺).

For the validation function $\mathsf{Valid}(\cdot)$ considered in Section 5.2, given a security parameter $\kappa$ , the secure noisy aggregation protocol in DP-BREM⁺ satisfies:

1) Integrity. For a negligible function $\text{negl}(\cdot)$ , the output of the protocol returns the noisy aggregate of a subset of clients $\mathcal{I}_{\mathsf{Valid}}$ and Gaussian noise $\bm{\xi}$ , such that all clients in $\mathcal{I}_{\mathsf{Valid}}$ have well-formed inputs:

\displaystyle\Pr[\text{output}=\sum\nolimits_{i\in\mathcal{I}_{\mathsf{Valid}}% }\bm{z}_{i}+\bm{\xi}]\geqslant 1-\text{negl}(\kappa)

where random vector $\bm{\xi}\sim\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ , and $\mathsf{Valid}(\bm{z}_{i})=1$ for all $i\in\mathcal{I}_{\mathsf{Valid}}$ . Note that the set $\mathcal{I}_{\mathsf{Valid}}$ contains all honest clients (denoted by $\mathcal{I}_{H}$ ) and the malicious clients who submitted well-formed input (denoted by $\mathcal{I}_{M}^{*}$ ), i.e., $\mathcal{I}_{\mathsf{Valid}}=\mathcal{I}_{H}\cup\mathcal{I}_{M}^{*}$ .

2) Privacy. For a set of malicious clients $\mathcal{I}_{M}$ and a malicious server $\mathsf{S}$ , there exists a probabilistic polynomial-time (P.P.T.) simulator $\mathsf{Sim}(\cdot)$ such that:

\displaystyle\mathsf{Real}\left(\{z_{i}\}_{i\in\mathcal{I}_{H}},\Omega_{% \mathcal{I}_{M}\cup\mathsf{S}}\right)\equiv_{\mathsf{C}}\mathsf{Sim}\left(\sum% \nolimits_{i\in\mathcal{I}_{H}}\bm{z}_{i}+\bm{\xi},\mathcal{I}_{H},\Omega_{% \mathcal{I}_{M}\cup\mathsf{S}}\right)

where $\{z_{i}\}_{i\in\mathcal{I}_{H}}$ denotes the input of all the honest clients, $\mathsf{Real}$ denotes a random variable representing the joint view of all the parties in the protocol’s execution, $\Omega_{\mathcal{I}_{M}\cup\mathsf{S}}$ indicates a polynomial-time algorithm implementing the "next-message" function of the parties in $\mathcal{I}_{M}\cup\mathsf{S}$ (see [35, Appendix 11.5]), and $\equiv_{\mathsf{C}}$ denotes computational indistinguishability. In summary, the server and clients do not learn anything besides the final aggregated result.

Proof.

See Appendix I. ∎

6 Experimental Evaluation

In this section, we demonstrate the effectiveness of the proposed DP-BREM/DP-BREM⁺ on achieving both good privacy-utility tradeoff and Byzantine robustness via experimental results on MNIST [25] and CIFAR-10 [24] datasets with non-IID setting (refer to Appendix J.1 for more details on the datasets and model architectures). All experiments are developed via PyTorch¹¹1Our source code will be available after the acceptance of the paper..

Byzantine Attacks. We consider four existing Byzantine attacks in our experiments, including ALIE ("a little is enough") [3], IPM (inner-product manipulation) [45], LF (label-flip**), and the state-of-the-art MTB ("manipulating-the-Byzantine") [38]. Refer to Appendix J.1 for more details.

Compared Methods. We compare the performance of six approaches against Byzantine attacks, including DP-BREM/⁺ (our approach)²²2Since DP-BREM⁺ achieves the same DP and robustness guarantees as DP-BREM, we did not perform the empirical experiments with secure aggregation because the accuracy results will be exactly the same as DP-BREM. We use DP-BREM/⁺ to denote both DP-BREM and DP-BREM⁺, and the implementation follows Algorithm 1., a variant of DP-FedSGD [30] with both record and client norm clip**, DDP-RP [43], DP-RSA [50], a variant of CM [47] with DP noise, and DP-LFH. The comparison (on trust assumption and mechanism overview) of these approaches is provided in Table 1, and Appendix J.1 shows more details of each approach. In summary, DP-BREM/⁺, DP-FedSGD, and DP-CM add central noise to the aggregation, but DP-BREM⁺ does not require a trusted server due to the secure aggregation technique. DDP-RP adds partial local noise to the client’s update with secure aggregation. DP-RSA and DP-LFH add local noise to the client’s update. We fix $\delta=10^{-6}$ for $(\epsilon,\delta)$ -DP in all experiments. For the setting of other parameters, refer to Appendix J.2.

Evaluation Metric. We evaluate the testing accuracy of the global model within $T$ iterations. Considering the accuracy curve might be unstable under Byzantine attacks, we average the accuracy between $0.9T$ and $T$ as the final accuracy for comparison. Note that both DP noise and Byzantine attacks reduce the accuracy. A protocol achieves good Byzantine robustness if its accuracy does not decrease too much with an increased number of Byzantine clients.

6.1 Robustness Evaluation with DP

We consider a fixed privacy budget $\epsilon$ and implement each attack with different percentages of Byzantine clients $\delta_{B}=\frac{|\mathcal{B}|}{n}$ for the four attacks, and compare the accuracy among all approaches. The results for MNIST (with $\epsilon=3$ ) and CIFAR-10 (with $\epsilon=4$ ) datasets are shown in Figures 3 and 4. Though the detailed results differ under different attacks and for two datasets, we have some general observations:

1) When there is no attack, i.e., $\delta_{B}=0$ , DP-BREM/⁺ achieves almost the same accuracy as DP-FedSGD, indicating the Byzantine-robust design (client momentum with centered clip**) has almost no impact on the utility in this case.

2) After increasing $\delta_{B}$ , our DP-BREM/⁺ has the smallest accuracy decrease, indicating its success in providing Byzantine robustness. However, the accuracy of DP-LFH reduces sharply, demonstrating that the large aggregated local DP noise makes the robust aggregator more vulnerable to Byzantine attacks, which is consistent with our discussions of Limitation 2 in Section 3.2.

3) Though DP-FedSGD has client-level gradient clip**, which can restrict malicious clients’ impact, it is still vulnerable to some types of Byzantine attacks (such as IPM and MTB) under larger $\delta_{B}$ values.

4) CM with DP noise (or DP-CM) has a relatively small accuracy decrease for a relative small $\delta_{B}$ . It is the benefit of the median-based robust aggregator. But the sensitivity is larger than the average-based aggregators, as discussed in Example 1, the aggregated DP noise is too large to obtain a high accuracy, even when $\delta_{B}=0$ .

5) DDP-RP is more vulnerable to LF attack because it only checks the element-wise range. Also, the model replacement strategy in LF attack is more likely to change the positions that have small values in benign gradient vectors.

6) DP-RSA has relatively poor accuracy compared with other approaches, even when $\delta_{B}=0$ . It is caused by the sign-SGD aggregator, which only aggregates element-wise signs instead of values, leading to information loss. Moreover, the local DP noise makes the Byzantine attacks easier to succeed.

6.2 Privacy-Utility Tradeoff with Attack

We consider a fixed percentage of Byzantine clients $\delta_{B}$ for each attack under different values of privacy budget $\epsilon$ , and compare the accuracy of all approaches. The results for MNIST (with $\delta_{B}=30\%$ ) and CIFAR-10 (with $\delta_{B}=15\%$ ) datasets are shown in Figures 5 and 6. For both datasets, we consider four different levels of privacy, where $\epsilon=\inf$ means the standard deviation of DP noise is 0, but we still implement record-level clip** to illustrate how the noise affects the results while kee** other settings including the clip** step the same. Though the detailed results differ under different attacks and for two datasets, DP-BREM/⁺ has the best accuracy among all approaches, especially under IPM and MTB attacks. Note that when $\sigma=0$ for DP noise (i.e., $\epsilon=\inf$ ), both DP-BREM/⁺ and DP-LFH reduce to LFH, thus they have the same results in this case. We can observe that with a moderate privacy budget, such as $\epsilon\geqslant 2$ , DP noise only has a negligible impact on the accuracy. But if $\epsilon$ is too small, such as $\epsilon=1$ in Figure 5, DP-BREM/⁺ suffers a relatively larger impact (but still acceptable) from DP noise. Note that when there exist Byzantine attacks, reducing the DP noise to $\sigma=0$ (i.e., $\epsilon=\inf$ ) does not significantly improve the accuracy of DP-BREM/⁺ compared with $\epsilon<\inf$ , because the performance is largely impacted by Byzantine clients’ perturbations. However, the accuracy of DP-LFH is greatly reduced when $\epsilon<\inf$ , since the local DP noise impacts the robustness of the aggregator. This observation is consistent with our theoretical analysis in Limitation 2 of DP-LFH (Section 3.2).

6.3 Other Results

Efficiency Evaluation of DP and Byzantine Robustness. We note that DP and Byzantine Robustness designs in our solution only introduce a small computation overhead, because 1) the clip** step of DP can be implemented efficiently; 2) our robustness is essentially a clipped summation of client momentum without any complex computations. Due to limited resources, we implemented the distributed training of FL on a single machine (by running all the clients and the server code sequentially). We evaluate the efficiency of DP-BREM via the running time (per round per client) on the MNIST dataset. The results shown in Table 3 indicate that the DP noise and Byzantine robustness only incur $8\%\sim 16\%$ additional running time (depending on batch size).

Table 3: Running time¹ (in milliseconds) per round per client in MNIST dataset

Batch Size	Baseline (FedSGD)	FedSGD+DP (efficient² )	DP-BREM (DP+robust)	FedSGD+DP (inefficient³ )
30	11.80	13.31	13.72	41.06
60	18.23	19.79	20.27	76.70
120	31.22	33.18	33.70	149.32

1

Our GPU device is NVIDIA Tesla P100-PCIE-16GB. Using other GPU devices may have different results.
2

By default, our implementation uses efficient per-record gradient clip** by following Opacus library’s implementation with parallel clip** and optimized einsum (refer to https://opacus.ai/api/_modules/opacus/optimizers/optimizer.html#DPOptimizer)
3

To illustrate the improvement of efficient clip**, we also show the results of the inefficient implementation, which clips per-record gradient sequentially and without using optimized einsum.

Impact of $R$ in DP-BREM/⁺. Figure 7 (in Appendix J) shows how the accuracy changes w.r.t. the record-level clip** bound $R$ in DP-BREM/⁺. The results demonstrate that when there are fewer Byzantine clients (i.e., smaller $\delta_{B}$ ) or the noise multiplier $\sigma$ is smaller (i.e., larger $\epsilon$ ), we need to set a larger $R$ to obtain better accuracy. This observation is consistent with the theoretical analysis of parameter tuning discussed in Theorem 2 and its interpretation.

Impact of $q$ in DP-BREM/⁺. In all previous experiments, we set client-level sampling rate $q=1$ by default. As discussed in Sec. 4.1, aggregating a subset $I_{t}$ of clients in (12) is one of the major differences from LFH. In Table 4, we show how the utility can be improved by this design under different attack percentages $\delta_{B}$ if we can choose an optimal $q$ (which can be regarded as a tunable parameter). Intuitively, when there is no attack, a smaller $q$ can provide more privacy amplification, i.e., a smaller $\sigma$ is needed for the same value of $\epsilon$ in DP; but if $q$ is too small, the small aggregate population will lead to a larger variance of the aggregation. When there exists Byzantine attacks, a smaller value of $q$ can reduce the attack impact for each round because only a portion of Byzantine clients are selected for aggregation. Therefore, with increased $\delta_{B}$ , the optimal $q$ (highlighted in Table 4) is decreased.

Table 4: Model accuracy when varying

q

of DP-BREM/⁺ with

\epsilon=2

under MTB attack in CIFAR-10 dataset.

$\delta_{B}$	$q=1$	$q=0.8$	$q=0.6$	$q=0.4$	$q=0.2$
0%	0.503	0.525	0.504	0.491	0.485
10%	0.435	0.434	0.465	0.449	0.438
20%	0.255	0.284	0.297	0.328	0.241

7 Related Work

Due to limited space, we only discuss the most relevant defenses below and put other related work in Appendix K. Other works either only achieve DP or Byzantine robustness (but not both), or combine secure aggregation with Byzantine robustness without realizing DP.

Wang et al. [43] proposed an FL scheme (DDP-RP) to provide Distributed DP (via encryption) and robustness (via range proof technologies); however, this scheme only verifies whether the local model weights are in a bounded range, which provides weak robustness. In comparison, our solution utilizes client momentum and centered clip** to guarantee Byzantine robustness with provable convergence analysis. Zhu et al. [50] replaces the value aggregation with sign aggregation, which provides robustness because each client has a limited impact on the aggregation. The DP noise is added to the local gradient before the sign operation. Since it only aggregates the element-wise sign (instead of the value) of clients’ gradients, it has degraded convergence due to information loss. Also, [50] only accounts for the privacy cost of one iteration, instead of the composition of all iterations in FL. Thus, the privacy cost computed in [50] is underestimated. As a comparison, our solution is based on the original SGD (with momentum), and we account for the privacy cost of all iterations. Our experimental results have confirmed that DP-BREM outperforms both of these approaches.

8 Conclusions

This paper aims to achieve FL in the cross-silo setting with both DP and Byzantine robustness. We first proposed DP-BREM, a DP version of LFH-based FL protocol with a robust aggregator based on client momentum, where the server adds noise to the aggregated momentum. Then we further developed DP-BREM⁺ which relaxes the server’s trust assumption, by combining secure aggregation techniques with verifiable inputs and a new protocol for secure joint noise generation. DP-BREM⁺ achieves the same DP and robustness guarantees as DP-BREM, under a malicious server (for privacy) and malicious minority clients. We theoretically analyze the error and convergence of DP-BREM, and conduct extensive experiments that empirically show the advantage of DP-BREM/⁺ in terms of privacy-utility tradeoff and Byzantine robustness over five baseline protocols. In the future, we will extend our work to other types of robust aggregators.

References

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security (CCS), 2016.
[2] Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. How to backdoor federated learning. In AISTATS, 2020.
[3] Gilad Baruch, Moran Baruch, and Yoav Goldberg. A little is enough: Circumventing defenses for distributed learning. 2019.
[4] Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, and Ryan Rogers. Protection against reconstruction and its applications in private federated learning. arXiv preprint arXiv:1812.00984, 2018.
[5] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In NeurIPS, 2017.
[6] George EP Box and Mervin E Muller. A note on the generation of random normal deviates. The annals of mathematical statistics, 1958.
[7] Zhiqi Bu, **shuo Dong, Qi Long, and Weijie J Su. Deep learning with gaussian differential privacy. Harvard Data Science Review, 2020(23), 2020.
[8] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. In ACM on Measurement and Analysis of Computing Systems, 2017.
[9] Henry Corrigan-Gibbs and Dan Boneh. Prio: Private, robust, and scalable computation of aggregate statistics. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2017.
[10] Ronald Cramer, Ivan Damgård, and Yuval Ishai. Share conversion, pseudorandom secret-sharing and applications to secure computation. In Theory of Cryptography Conference (TCC), 2005.
[11] **shuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. To appear in Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2019.
[12] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (TCC), 2006.
[13] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Now Publishers, 2014.
[14] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In IEEE Annual Symposium on Foundations of Computer Science (FOCS), 2010.
[15] Minghong Fang, Xiaoyu Cao, **yuan Jia, and Neil Gong. Local model poisoning attacks to byzantine-robust federated learning. In USENIX Security Symposium, 2020.
[16] Paul Feldman. A practical scheme for non-interactive verifiable secret sharing. In IEEE Annual Symposium on Foundations of Computer Science (SFCS), pages 427–438, 1987.
[17] Shuhong Gao. A new algorithm for decoding reed-solomon codes. Communications, information and network security, pages 55–68, 2003.
[18] Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.
[19] Rachid Guerraoui, Nirupam Gupta, Rafaël Pinot, Sébastien Rouault, and John Stephan. Differential privacy and byzantine resilience in sgd: Do they add up? In ACM Symposium on Principles of Distributed Computing, 2021.
[20] Lie He, Sai Praneeth Karimireddy, and Martin Jaggi. Secure byzantine-robust machine learning. arXiv preprint arXiv:2006.04747, 2020.
[21] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1–2):1–210, 2021.
[22] Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Learning from history for byzantine robust optimization. In ICML, pages 5311–5319, 2021.
[23] Marcel Keller. Mp-spdz: A versatile framework for multi-party computation. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1575–1590, 2020.
[24] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[25] Yann LeCun. The mnist database of handwritten digits. 1998.
[26] Jeffrey Li, Mikhail Khodak, Sebastian Caldas, and Ameet Talwalkar. Differentially private meta-learning. In ICLR, 2020.
[27] Shu Lin and Daniel J Costello. Error control coding. Prentice hall Lebanon, IN, 2001.
[28] Yehuda Lindell and Ariel Nof. A framework for constructing fast mpc over arithmetic circuits with malicious adversaries and an honest-majority. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 259–276, 2017.
[29] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017.
[30] H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. In ICLR, 2018.
[31] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Exploiting unintended feature leakage in collaborative learning. In IEEE Symposium on Security and Privacy (S&P), 2019.
[32] El Mahdi El Mhamdi, Rachid Guerraoui, and Sébastien Rouault. The hidden vulnerability of distributed learning in byzantium. In ICML, 2018.
[33] Ilya Mironov. Rényi differential privacy. In IEEE Computer Security Foundations Symposium (CSF), 2017.
[34] Sikha Pentyala, Davis Railsback, Ricardo Maia, Rafael Dowsley, David Melanson, Anderson Nascimento, and Martine De Cock. Training differentially private models with secure multiparty computation. arXiv preprint, 2022.
[35] Amrita Roy Chowdhury, Chuan Guo, Somesh Jha, and Laurens van der Maaten. Eiffel: Ensuring integrity for federated learning. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 2535–2549, 2022.
[36] Amrita Roy Chowdhury, Chenghong Wang, Xi He, Ashwin Machanavajjhala, and Somesh Jha. Crypte: Crypto-assisted differential privacy on untrusted servers. In ACM SIGMOD International Conference on Management of Data (SIGMOD), 2020.
[37] Adi Shamir. How to share a secret. Communications of the ACM, 22(11):612–613, 1979.
[38] Virat Shejwalkar and Amir Houmansadr. Manipulating the byzantine: Optimizing model poisoning attacks and defenses for federated learning. In NDSS, 2021.
[39] Elaine Shi, TH Hubert Chan, Eleanor Rieffel, Richard Chow, and Dawn Song. Privacy-preserving aggregation of time-series data. In Annual Network & Distributed System Security Symposium (NDSS), 2011.
[40] **hyun So, Başak Güler, and A Salman Avestimehr. Byzantine-resilient secure federated learning. IEEE Journal on Selected Areas in Communications, 2020.
[41] Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, and Yi Zhou. A hybrid approach to privacy-preserving federated learning. In ACM Workshop on Artificial Intelligence and Security, 2019.
[42] Raj Kiriti Velicheti, Derek Xia, and Oluwasanmi Koyejo. Secure byzantine-robust distributed learning via clustering. arXiv preprint arXiv:2110.02940, 2021.
[43] Fayao Wang, Yuanyuan He, Yunchuan Guo, Peizhi Li, and Xinyu Wei. Privacy-preserving robust federated learning with distributed differential privacy. In IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2022.
[44] Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, and Dimitris S Papailiopoulos. Attack of the tails: Yes, you really can backdoor federated learning. In NeurIPS, 2020.
[45] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. In Uncertainty in Artificial Intelligence, 2020.
[46] Runhua Xu, Nathalie Baracaldo, Yi Zhou, Ali Anwar, and Heiko Ludwig. Hybridalpha: An efficient approach for privacy-preserving federated learning. In ACM Workshop on Artificial Intelligence and Security, 2019.
[47] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In ICML, 2018.
[48] Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M Alvarez, Jan Kautz, and Pavlo Molchanov. See through gradients: Image batch recovery via gradinversion. In CVPR, 2021.
[49] Qinqing Zheng, Shuxiao Chen, Qi Long, and Weijie Su. Federated f-differential privacy. In AISTATS, 2021.
[50] Heng Zhu and Qing Ling. Bridging differential privacy and byzantine-robustness via model aggregation. In International Joint Conference on Artificial Intelligence (IJCAI), 2022.
[51] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In NeurIPS, 2019.

Appendix A Proof of Lemma 2 (Aggregation Sensitivity)

Proof.

For the local momentum computation in (11), we can rewrite it as

\displaystyle\bm{m}_{t,i}=(1-\beta)(\bar{\bm{g}}_{t,i}+\beta\bar{\bm{g}}_{t-1,% i}+\cdots+\beta^{t-2}\bar{\bm{g}}_{2,i})+\beta^{t-1}\bar{\bm{g}}_{1,i}

For a neighboring dataset $\mathcal{D}_{i}^{\prime}$ which differs only one record from client $\mathsf{C}_{i}$ ’s data, i.e., $|\mathcal{D}_{i}-\mathcal{D}_{i}^{\prime}|=1$ , we denote the corresponding local gradient (with per-record gradient clip**) and momentum as $\bar{\bm{g}}_{t,i}^{\prime}$ and $\bm{m}_{t,i}^{\prime}$ , respectively. Since $\bar{\bm{g}}_{\tau,i}$ is computed by (10) for $\tau=1,\cdots,t$ , we have

\displaystyle\|\bar{\bm{g}}_{\tau,i}-\bar{\bm{g}}_{\tau,i}^{\prime}\|=\frac{1}% {p_{i}|\mathcal{D}_{i}|}\|\mathsf{Clip}_{R}(\nabla_{\bm{\theta}}\ell(\bm{x},% \bm{\theta}_{\tau-1}))\|\leqslant\frac{R}{p_{i}|\mathcal{D}_{i}|}

where $\bm{x}=\mathcal{D}_{i}-\mathcal{D}_{i}^{\prime}$ . Then,

	$\displaystyle\\|\bm{m}_{t,i}-\bm{m}_{t,i}^{\prime}\\|$	$\displaystyle\leqslant(1-\beta)[\\|\bar{\bm{g}}_{t,i}-\bar{\bm{g}}_{t,i}^{% \prime}\\|+\beta\\|\bar{\bm{g}}_{t-1,i}-\bar{\bm{g}}_{t-1,i}^{\prime}\\|+$
		$\displaystyle\quad\cdots+\beta^{t-2}\\|\bar{\bm{g}}_{2,i}-\bar{\bm{g}}_{2,i}^{% \prime}\\|]+\beta^{t-1}\\|\bar{\bm{g}}_{1,i}-\bar{\bm{g}}_{1,i}^{\prime}\\|$
		$\displaystyle\leqslant[(1-\beta)(1+\beta+\cdots+\beta^{t-2})+\beta^{t-1}]\cdot% \frac{R}{p_{i}\|\mathcal{D}_{i}\|}$
		$\displaystyle=\left[(1-\beta)\cdot\frac{1-\beta^{t-1}}{1-\beta}+\beta^{t-1}% \right]\cdot\frac{R}{p_{i}\|\mathcal{D}_{i}\|}=\frac{R}{p_{i}\|\mathcal{D}_{i}\|}$

where the first inequality is obtained by generalizing the triangle inequality; Therefore,

	$\displaystyle\quad\\|Q_{t}(\mathcal{D})-Q_{t}(\mathcal{D^{\prime}})\\|$
	$\displaystyle=\\|\sum\nolimits_{j\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bm{m}_{t% ,j}-\tilde{\bm{m}}_{t-1})-\sum\nolimits_{j\in\mathcal{I}_{t}}\mathsf{Clip}_{C}% (\bm{m}_{t,j}^{\prime}-\tilde{\bm{m}}_{t-1})\\|$
	$\displaystyle\overset{(\mathsf{a})}{=}\\|\mathsf{Clip}_{C}(\bm{m}_{t,i}-\tilde{% \bm{m}}_{t-1})-\mathsf{Clip}_{C}(\bm{m}_{t,i}^{\prime}-\tilde{\bm{m}}_{t-1})\\|$
	$\displaystyle\overset{(\mathsf{b})}{\leqslant}\min\{2C,\\|\bm{m}_{t,i}-\bm{m}_{% t,i}^{\prime}\\|\}=\min\{2C,\frac{R}{p_{i}\|\mathcal{D}_{i}\|}\}$

where $(\mathsf{a})$ is obtained due to $\bm{m}_{t,j}=\bm{m}_{t,j}^{\prime}$ for $j\neq i$ ; and $(\mathsf{b})$ is obtained according to Lemma 3. Now we finished the main proof of Lemma 2. ∎

Table 5: Notations

Symbols	Description
$\bm{\theta}_{t}$	(global) model in $t$ -th iteration, where $\bm{\theta}_{t}\in\mathbb{R}^{d}$
$\mathcal{D}_{i}$	local training data of client $\mathsf{C}_{i}$
$\mathcal{D}_{i,t}$	data batch sampled from $\mathcal{D}_{i}$ in $t$ -th iteration
$\bm{g}_{t,i},\bm{m}_{t,i}$	client gradient and momentum at $t$ -th iteration
$\bm{m}_{t}$	aggregation of $\bm{m}_{t,i}$ among multiple clients
$\bar{\bm{g}}_{t,i},\bar{\bm{m}}_{t,i}$	client gradient and momentum with record-level clip**
$\tilde{\bm{m}}_{t}$	aggregation of $\bar{\bm{m}}_{t,i}$ with DP noise
$p_{i}$	record-level sampling rate (implemented by client $\mathsf{C}_{i}$ )
$q$	client-level sampling rate (implemented by the server)
$R$	record-level clip** bound (for DP)
$C$	client-level clip** bound (for robustness)
$\mathcal{H}$	the set of honest clients that follow the protocol honestly
$\mathcal{B}$	the set of Byzantine clients that are malicious
$\delta_{B}$	the percentage of Byzantine clients, i.e., $\delta_{B}=\|\mathcal{B}\|/n\times 100\%$

The above proof used the following lemma.

Lemma 3.

For any vectors $x$ and $\delta$ , we have

\displaystyle\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|\leqslant\min% \{2C,\|\delta\|\}

where $\mathsf{Clip}_{C}(x)\coloneqq\min\{1,C/\|x\|\}\cdot x$ and $\|\cdot\|$ denotes L2-norm.

Proof.

Our proof of Lemma 3 mainly uses the triangle inequality of a norm. Note that for L-2 norm $\|\cdot\|$ , we have $\|a+b\|^{2}=\|a\|^{2}+2a^{\top}b+\|b\|^{2}$ for any vectors $a$ and $b$ . We first show that $\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|\leqslant\|\delta\|$ , which is proved by enumerating all cases as follows.

Case 1. Assume $\|x\|\leqslant C$ and $\|x+\delta\|\leqslant C$ . Then,

\displaystyle\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|=\|x-(x+% \delta)\|=\|\delta\|

Case 2. Assume $\|x\|>C$ and $\|x+\delta\|\leqslant C$ . Then, $0<\frac{C}{\|x\|}<1$ and

	$\displaystyle\quad\\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\\|^{2}$
	$\displaystyle=\left\\|\frac{C}{\\|x\\|}\cdot x-(x+\delta)\right\\|^{2}=\left\\|% \left(1-\frac{C}{\\|x\\|}\right)\cdot x+\delta\right\\|^{2}$
	$\displaystyle=\left(1-\frac{C}{\\|x\\|}\right)^{2}\\|x\\|^{2}+2\left(1-\frac{C}{\\|% x\\|}\right)x^{\top}\delta+\\|\delta\\|^{2}$
	$\displaystyle=\left(1-\frac{C}{\\|x\\|}\right)\left[\left(1-\frac{C}{\\|x\\|}% \right)\\|x\\|^{2}+2x^{\top}\delta+\\|\delta\\|^{2}\right]+\frac{C\cdot\\|\delta\\|^% {2}}{\\|x\\|}$
	$\displaystyle=\left(1-\frac{C}{\\|x\\|}\right)\left[\\|x+\delta\\|^{2}-C\\|x\\|% \right]+\frac{C\cdot\\|\delta\\|^{2}}{\\|x\\|}$
	$\displaystyle<0+1\cdot\\|\delta\\|^{2}=\\|\delta\\|^{2}$

where $\|x+\delta\|^{2}\leqslant C^{2}<C\|x\|$ . Therefore, $\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|<\|\delta\|$ in this case.

Case 3. Assume $\|x\|\leqslant C$ and $\|x+\delta\|>C$ . Then, $0<\frac{C}{\|x+\delta\|}<1$ and

	$\displaystyle\quad\\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\\|^{2}$
	$\displaystyle=\left\\|x-\frac{C}{\\|x+\delta\\|}\cdot(x+\delta)\right\\|^{2}=\left% \\|\left(1-\frac{C}{\\|x+\delta\\|}\right)\cdot(x+\delta)-\delta\right\\|^{2}$
	$\displaystyle=\left(1-\frac{C}{\\|x+\delta\\|}\right)^{2}\\|x+\delta\\|^{2}-2\left% (1-\frac{C}{\\|x+\delta\\|}\right)(x+\delta)^{\top}\delta+\\|\delta\\|^{2}$
	$\displaystyle=\left(1-\frac{C}{\\|x+\delta\\|}\right)\left[\left(1-\frac{C}{\\|x+% \delta\\|}\right)\\|x+\delta\\|^{2}-2(x+\delta)^{\top}\delta+\\|\delta\\|^{2}\right]$
	$\displaystyle\qquad+\frac{C\cdot\\|\delta\\|^{2}}{\\|x+\delta\\|}$
	$\displaystyle=\left(1-\frac{C}{\\|x+\delta\\|}\right)\left[\\|(x+\delta)-\delta\\|% ^{2}-C\\|x+\delta\\|\right]+\frac{C\cdot\\|\delta\\|^{2}}{\\|x+\delta\\|}$
	$\displaystyle<0+1\cdot\\|\delta\\|^{2}=\\|\delta\\|^{2}$

where $\|(x+\delta)-\delta\|^{2}=\|x\|^{2}\leqslant C^{2}<C\|x+\delta\|$ . Therefore, $\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|<\|\delta\|$ in this case.

Case 4. Assume $\|x\|>C$ and $\|x+\delta\|>C$ . Then,

	$\displaystyle\quad\\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\\|^{2}$
	$\displaystyle=\left\\|\frac{C}{\\|x\\|}\cdot x-\frac{C}{\\|x+\delta\\|}\cdot(x+% \delta)\right\\|^{2}$
	$\displaystyle=\frac{C^{2}}{\\|x\\|^{2}}\cdot\\|x\\|^{2}-\frac{C^{2}\cdot 2x^{\top}% (x+\delta)}{\\|x\\|\cdot\\|x+\delta\\|}+\frac{C^{2}}{\\|x+\delta\\|^{2}}\cdot\\|x+% \delta\\|^{2}$
	$\displaystyle=2C^{2}-\frac{C^{2}\cdot[\\|x\\|^{2}+(\\|x\\|^{2}+2x^{\top}\delta+\\|% \delta\\|^{2})]}{\\|x\\|\cdot\\|x+\delta\\|}+\frac{C^{2}\cdot\\|\delta\\|^{2}}{\\|x\\|% \cdot\\|x+\delta\\|}$
	$\displaystyle=2C^{2}-\frac{C^{2}\cdot[\\|x\\|^{2}+\\|x+\delta\\|^{2}]}{\\|x\\|\cdot% \\|x+\delta\\|}+\frac{C^{2}}{\\|x\\|\cdot\\|x+\delta\\|}\cdot\\|\delta\\|^{2}$
	$\displaystyle<2C^{2}-C^{2}\cdot 2+1\cdot\\|\delta\\|^{2}=\\|\delta\\|^{2}$

where $\frac{\|x\|^{2}+\|x+\delta\|^{2}}{\|x\|\cdot\|x+\delta\|}\geqslant 2$ due to $\|x\|^{2}+\|x+\delta\|^{2}-2(\|x\|\cdot\|x+\delta\|)=(\|x\|-\|x+\delta\|)^{2}\geqslant 0$ , and $\frac{C^{2}}{\|x\|\cdot\|x+\delta\|}<1$ due to $\|x\|>C$ and $\|x+\delta\|>C$ . Therefore, $\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|<\|\delta\|$ in this case.

The Final Result. By summarizing all cases above, we have $\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|\leqslant\|\delta\|$ . On the other hand, since $\|\mathsf{Clip}_{C}(x)\|\leqslant C$ for any $x$ , it is obvious that

\displaystyle\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|\leqslant\|% \mathsf{Clip}_{C}(x)\|+\|\mathsf{Clip}_{C}(x+\delta)\|\leqslant 2C

Thus, the upper bound of $\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\|$ is $\min\{2C,\|\delta\|\}$ . ∎

Appendix B Proof of Theorem 1 (Privacy Analysis)

Proof.

Since the added Gaussian noise in (12) has standard deviation $R\sigma$ , and the aggregation sensitivity is shown in (14), then the noise multiplier (defined by the ratio between Gaussian noise’s standard deviation and the sensitivity) is

\displaystyle\sigma_{i}=\frac{R\sigma}{S_{i}}=\max\left\{\frac{R\sigma}{2C},% \sigma p_{i}|\mathcal{D}_{i}|\right\}=\sigma\cdot\max\left\{\frac{R}{2C},p_{i}% |\mathcal{D}_{i}|\right\}

Also, due to the client-level sampling (i.e., each client was selected by the server w.p. $q$ ) and record-level sampling (i.e., each record was selected by client $\mathsf{C}_{i}$ w.p. $p_{i}$ ), the overall sampling rate is $qp_{i}$ . Then, by applying the privacy accountant of Gaussian DP (GDP) [11] shown in Lemma 10 (see more details in Appendix F), DP-BREM satisfies $\mu_{i}$ -GDP with $\mu_{i}$ shown in (16). Finally, by converting $\mu_{i}$ -GDP to $(\epsilon_{i},\delta)$ -DP via Lemma 9, we get (15), which finishes the proof. ∎

Remark: privacy accountant in practice. Eq. (15) provides the formula of $\delta$ when $\epsilon_{i}$ is given and $\mu_{i}$ is computed from (16). In practice, however, we need to compute the value of privacy budget $\epsilon_{i}$ with a fixed $\delta$ , where $\delta$ is conventionally set to be less than $1/n$ . In our experiments, we utilize the computation tool³³3https://github.com/woodyx218/Deep-Learning-with-GDP-Pytorch in [7] to solve $\epsilon_{i}$ from (15). For the value of $\sigma_{i}$ in (16), we usually have $p_{i}|\mathcal{D}_{i}|>\frac{R}{2C}$ in practice, then $\sigma_{i}=\sigma p_{i}|\mathcal{D}_{i}|$ . In this case, the clip** bounds $R$ and $C$ are just hyperparameters that may affect the utility of the algorithm, but has no influence on the privacy analysis.

Appendix C Proof of Theorem 2 (Aggregation Error)

Before proving Theorem 2, we first show some notations and assumptions. In $t$ -th iteration, denote the selected honest clients $\mathcal{H}_{t}=\mathcal{H}\cup\mathcal{I}_{t}$ and selected Byzantine clients $\mathcal{B}_{t}=\mathcal{B}\cup\mathcal{I}_{t}$ . For momentum updates in $t$ -th iteration, we simplify the following notation (ignoring the subscript $t$ ) for convenience,

\displaystyle\bm{y}_{0}\coloneqq\tilde{\bm{m}}_{t-1},\quad\bm{y}_{i}\coloneqq% \bm{y}_{0}+\bm{z}_{i}\quad\text{with }\bm{z}_{i}\coloneqq\mathsf{Clip}_{C}(% \bar{\bm{m}}_{t,i}-\bm{y}_{0})

where $\bar{\bm{m}}_{t,i}$ is the client momentum computed from gradient with record-level clip**. Then, we can rewrite the noisy global momentum as $\tilde{\bm{m}}_{t}=\frac{\sum_{i\in\mathcal{I}_{t}}\bm{y}_{i}+\bm{\xi}}{|% \mathcal{I}_{t}|}$ , where $\bm{\xi}\sim\mathcal{N}(0,R^{2}\sigma^{2})$ .

We assume $\{\bm{m}_{t,i}\}_{i\in\mathcal{H}}$ are i.i.d. with expectation $\bm{\mu}\coloneqq\mathbb{E}[\bm{m}_{t,i}]$ and variance is bounded (in terms of L2-norm) $\mathbb{E}\|\bm{m}_{t,i}-\bm{\mu}\|^{2}\leqslant\rho^{2}$ . Therefore, the record-level gradient clipped ones $\{\bar{\bm{m}}_{t,i}\}_{i\in\mathcal{H}}$ are also i.i.d., and we denote the expectation $\bar{\bm{\mu}}\coloneqq\mathbb{E}[\bar{\bm{m}}_{t,i}]$ . Due to the clip** operation, the variance is reduced, and we assume $\mathbb{E}\|\bar{\bm{m}}_{t,i}-\bar{\bm{\mu}}\|^{2}\leqslant[\rho R/(R+c)]^{2}$ , where $R$ is the record-level clip** bound and $c$ is some positive constant. Also, there is a gap between $\bm{\mu}$ and $\bar{\bm{\mu}}$ and we assume $\|\bar{\bm{\mu}}-\bm{\mu}\|^{2}\leqslant(\kappa/R)^{2}$ . We assume $\bm{y}_{0}$ is not very far away from both $\bm{\mu}$ and $\bar{\bm{\mu}}$ : $\|\bm{y}_{0}-\bar{\bm{\mu}}\|^{2}\leqslant\phi^{2}$ and $\|\bm{y}_{0}-\bm{\mu}\|^{2}\leqslant\tau^{2}$ .

Proof.

Our proof heavily relies on several useful lemmas shown in Appendix E, where Lemma 4 splits the L2-norm of summation of vectors into weighted summation of vectors’L2-norm, and Lemma 5 provides the optimal strategy to choose these weights.

We first consider the bound of $\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{\mu}\|^{2}$ . Recall that the selected client set is $\mathcal{I}_{t}=\mathcal{H}_{t}\cup\mathcal{B}_{t}$ , where the honest clients set $\mathcal{H}_{t}$ and Byzantine clients set $\mathcal{B}_{t}$ are disjoint. For any positive values $\gamma_{1},\gamma_{2},\gamma_{3}>0$ with $\gamma_{1}+\gamma_{2}+\gamma_{3}=1$ , we have

	$\displaystyle\qquad\|\mathcal{I}_{t}\|^{2}\cdot\mathbb{E}\\|\tilde{\bm{m}}_{t}-% \bm{\mu}\\|^{2}=\|\mathcal{I}_{t}\|^{2}\cdot\mathbb{E}\left\\|\frac{\left(\sum_{i% \in\mathcal{I}_{t}}\bm{y}_{i}\right)+\bm{\xi}}{\|\mathcal{I}_{t}\|}-\bm{\mu}% \right\\|^{2}$
	$\displaystyle\overset{(\mathsf{a})}{=}\mathbb{E}\left\\|\sum\nolimits_{i\in% \mathcal{H}_{t}}(\bm{y}_{i}-\bm{\mu})+\sum\nolimits_{j\in\mathcal{B}_{t}}(\bm{% y}_{j}-\bm{\mu})+\bm{\xi}\right\\|^{2}$
	$\displaystyle\overset{(\mathsf{b})}{\leqslant}\frac{1}{\gamma_{1}}\underbrace{% \mathbb{E}\left\\|\sum\nolimits_{i\in\mathcal{H}_{t}}(\bm{y}_{i}-\bm{\mu})% \right\\|^{2}}_{\mathcal{T}_{1}}+\frac{1}{\gamma_{2}}\underbrace{\mathbb{E}% \left\\|\sum\nolimits_{j\in\mathcal{B}_{t}}(\bm{y}_{j}-\bm{\mu})\right\\|^{2}}_{% \mathcal{T}_{2}}+\frac{1}{\gamma_{3}}\underbrace{\mathbb{E}\\|\bm{\xi}\\|^{2}}_{% \mathcal{T}_{3}}$

where $(\mathsf{a})$ used the fact that $\mathcal{I}_{t}=\mathcal{H}_{t}\cup\mathcal{B}_{t}$ and $\mathcal{H}_{t}\cap\mathcal{B}_{t}=\emptyset$ ; $(\mathsf{b})$ used the result in Lemma 4. From the above inequality, the error can be decomposed into three terms: $\mathcal{T}_{1}$ corresponds to the error of honest clients (who follow the protocol honestly) due to the randomness of clients’ training data and bias introduced by clip**, $\mathcal{T}_{2}$ corresponds to the error of Byzantine clients (who submit arbitrary $\bar{\bm{m}}_{t,i}$ but will be clipped by the server), and $\mathcal{T}_{3}$ corresponds to the error introduced by added Gaussian noise for privacy purpose. We will analyze each of the three errors in turn.

Bounding $\mathcal{T}_{1}$ . Since $\mathbb{E}\|X\|^{2}=\|\mathbb{E}[X]\|^{2}+\mathbb{E}\|X-\mathbb{E}[X]\|^{2}$ for any random vector $X$ , we can rewrite $\mathcal{T}_{1}$ as

\displaystyle\mathcal{T}_{1}=\underbrace{{\left\|\sum\nolimits_{i\in\mathcal{H% }_{t}}(\mathbb{E}[\bm{y}_{i}]-\bm{\mu})\right\|^{2}}}_{\mathcal{T}_{11}}+% \underbrace{{\mathbb{E}\left\|\sum\nolimits_{i\in\mathcal{H}_{t}}(\bm{y}_{i}-% \mathbb{E}[\bm{y}_{i}])\right\|^{2}}}_{\mathcal{T}_{12}}

where $\mathcal{T}_{11}$ corresponds to the bias introduced by the clip** operations, and $\mathcal{T}_{12}$ is the variance of honest clients’ submissions. Rewrite $\bm{z}_{i}=\alpha_{i}\cdot(\bar{\bm{m}}_{t,i}-\bm{y}_{0})$ , where $\alpha_{i}=\min\{1,\frac{C}{\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}\|}\}\in(0,1]$ . Let $\mathbbm{1}_{i}$ be an indicator variable denoting if the momentum difference $\bar{\bm{m}}_{t,i}-\bm{y}_{0}$ was clipped. Therefore, if $\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}\|\leqslant C$ , then $\mathbbm{1}_{i}=0$ and $\alpha_{i}=1$ ; if $\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}\|>C$ , then $\mathbbm{1}_{i}=1$ and $0<\alpha_{i}<1$ . Then, for each $i\in\mathcal{H}_{t}$ , we have

	$\displaystyle\quad\mathbb{E}\\|\bm{z}_{i}-(\bar{\bm{m}}_{t,i}-\bm{y}_{0})\\|=% \mathbb{E}[(1-\alpha_{i})\cdot\\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}\\|]$
	$\displaystyle\leqslant\mathbb{E}[\mathbbm{1}_{i}\cdot\\|\bar{\bm{m}}_{t,i}-\bm{% y}_{0}\\|]\leqslant\frac{\mathbb{E}[\mathbbm{1}_{i}\cdot\\|\bar{\bm{m}}_{t,i}-% \bm{y}_{0}\\|^{2}]}{C}\leqslant\frac{\mathbb{E}\\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}% \\|^{2}}{C}$

where

	$\displaystyle\qquad\mathbb{E}\\|\bar{\bm{m}}_{t,i}-\bm{y}_{0}\\|^{2}=\mathbb{E}% \\|(\bar{\bm{m}}_{t,i}-\bar{\bm{\mu}})+(\bar{\bm{\mu}}-\bm{y}_{0})\\|^{2}$
	$\displaystyle\overset{(\mathsf{a})}{\leqslant}\frac{\mathbb{E}\\|\bar{\bm{m}}_{% t,i}-\bar{\bm{\mu}}\\|^{2}}{\gamma}+\frac{\mathbb{E}\\|\bar{\bm{\mu}}-\bm{y}_{0}% \\|^{2}}{1-\gamma}\leqslant\frac{[\rho R/(R+c)]^{2}}{\gamma}+\frac{\phi^{2}}{1-\gamma}$
	$\displaystyle\overset{(\mathsf{b})}{=}[\rho R/(R+c)+\phi]^{2}$

where $(\mathsf{a})$ is obtained by using Lemma 4 for any $\gamma\in(0,1)$ ; $(\mathsf{b})$ is obtained by taking $\gamma=\frac{\rho R/(R+c)}{\rho R/(R+c)+\phi}$ . Therefore,

	$\displaystyle\qquad\\|\mathbb{E}[\bm{y}_{i}]-\bm{\mu}\\|^{2}\overset{(\mathsf{a}% )}{=}\\|\mathbb{E}[\bm{y}_{0}+\bm{z}_{i}-\bar{\bm{m}}_{t,i}]+(\bar{\bm{\mu}}-% \bm{\mu})\\|^{2}$
	$\displaystyle\overset{(\mathsf{b})}{\leqslant}\frac{\\|\mathbb{E}[\bm{z}_{i}-(% \bar{\bm{m}}_{t,i}-\bm{y}_{0})]\\|^{2}}{\gamma}+\frac{\\|\bar{\bm{\mu}}-\bm{\mu}% \\|^{2}}{1-\gamma}$
	$\displaystyle\overset{(\mathsf{c})}{\leqslant}\frac{\left(\mathbb{E}\\|\bm{z}_{% i}-(\bar{\bm{m}}_{t,i}-\bm{y}_{0})\\|\right)^{2}}{\gamma}+\frac{\\|\bar{\bm{\mu}% }-\bm{\mu}\\|^{2}}{1-\gamma}$
	$\displaystyle\overset{(\mathsf{d})}{\leqslant}\frac{[\rho R/(R+c)+\phi]^{4}}{% \gamma C^{2}}+\frac{(\kappa/R)^{2}}{1-\gamma}$
	$\displaystyle\overset{(\mathsf{e})}{=}\left[\frac{[\rho R/(R+c)+\phi]^{2}}{C}+% \kappa/R\right]^{2}$

where $(\mathsf{a})$ is obtained from the definitions $\bm{y}_{i}=\bm{y}_{0}+\bm{z}_{i}$ and $\mathbb{E}[\bar{\bm{m}}_{t,i}]=\bar{\bm{\mu}}$ ; $(\mathsf{b})$ is obtained by using Lemma 4 for any $\gamma\in(0,1)$ ; $(\mathsf{c})$ is derived from Jensen’s Inequality, i.e., $\mathbb{E}[f(X)]\geqslant f(\mathbb{E}[X])$ for convex function $f(X)\coloneqq\|X\|$ ; $(\mathsf{d})$ is obtained by plugging in the previous two inequalities; $(\mathsf{e})$ is obtained by taking $\gamma=\frac{[\rho R/(R+c)+\phi]^{2}}{[\rho R/(R+c)+\phi]^{2}+C\kappa/R}$ . Now, we can bound $\mathcal{T}_{11}$ by:

\displaystyle\mathcal{T}_{11}\leqslant|\mathcal{H}_{t}|\sum_{i\in\mathcal{H}_{% t}}\left\|\mathbb{E}[\bm{y}_{i}]-\bm{\mu}\right\|^{2}\leqslant|\mathcal{H}_{t}% |^{2}\left[\frac{[\rho R/(R+c)+\phi]^{2}}{C}+\frac{\kappa}{R}\right]^{2}

where the first inequality is obtained by using Lemma 4. On the other hand, we can bound $\mathcal{T}_{12}$ by

	$\displaystyle\mathcal{T}_{12}$	$\displaystyle\overset{(\mathsf{a})}{=}\mathbb{E}\sum\nolimits_{i\in\mathcal{H}% _{t}}\\|\bm{y}_{i}-\mathbb{E}[\bm{y}_{i}]\\|^{2}\overset{(\mathsf{b})}{\leqslant% }\mathbb{E}\sum\nolimits_{i\in\mathcal{H}_{t}}\\|\bar{\bm{m}}_{t,i}-\mathbb{E}[% \bar{\bm{m}}_{t,i}]\\|^{2}$
		$\displaystyle\leqslant\|\mathcal{H}_{t}\|\cdot[\rho R/(R+c)+\phi]^{2}$

where $(\mathsf{a})$ used the assumption that $\{\bar{\bm{m}}_{t,i}\}_{i\in\mathcal{H}_{t}}$ are independent, then the random variables $\{\bm{y}_{i}\}_{i\in\mathcal{H}_{t}}$ are also independent; $(\mathsf{b})$ used contractivity of a clip** (projection) step. Therefore,

	$\displaystyle\mathcal{T}_{1}$	$\displaystyle=\mathcal{T}_{11}+\mathcal{T}_{12}\leqslant\|\mathcal{H}_{t}\|^{2}% \left(\frac{\psi^{2}}{C}+\frac{\kappa}{R+c}\right)^{2}+\|\mathcal{H}_{t}\|\psi^{2}$
		$\displaystyle\leqslant\frac{4\|\mathcal{H}_{t}\|^{2}\psi^{4}}{C^{2}}+\|\mathcal{H% }_{t}\|\psi^{2}\leqslant\left(\frac{2\|\mathcal{H}_{t}\|\psi^{2}}{C}+\sqrt{\|% \mathcal{H}_{t}\|}\psi\right)^{2}$

where $\psi\coloneqq\rho R/(R+c)+\phi$ , and the second inequality holds with the assumption $C\leqslant\psi^{2}R/\kappa$ (thus we have $\frac{\psi^{2}}{C}\geqslant\frac{\kappa}{R}$ )

Bounding $\mathcal{T}_{2}$ . For any Byzantine client $\mathsf{C}_{j}$ with $j\in\mathcal{B}_{t}$ , the error is bounded by the clip** step

	$\displaystyle\mathbb{E}\\|\bm{y}_{j}-\bm{\mu}\\|^{2}$	$\displaystyle=\mathbb{E}\\|\bm{z}_{j}+(\bm{y}_{0}-\bm{\mu})\\|^{2}\overset{(% \mathsf{a})}{\leqslant}\frac{\mathbb{E}\\|\bm{z}_{j}\\|^{2}}{\gamma}+\frac{% \mathbb{E}\\|\bm{y}_{0}-\bm{\mu}\\|^{2}}{1-\gamma}$
		$\displaystyle\overset{(\mathsf{b})}{\leqslant}\frac{C^{2}}{\gamma}+\frac{\tau^% {2}}{1-\gamma}\overset{(\mathsf{c})}{=}(C+\tau)^{2}$

where $(\mathsf{a})$ is obtained by using in Lemma 4 for any $\gamma\in(0,1)$ ; $(\mathsf{b})$ is obtained by the definition of $\bm{z}_{j}$ and the assumption; $(\mathsf{c})$ is obtained by taking $\gamma=\frac{C}{C+\tau}$ . Then, by using Lemma 4, we have

\displaystyle\mathcal{T}_{2}\leqslant|\mathcal{B}_{t}|\cdot\sum_{j\in\mathcal{% B}_{t}}\mathbb{E}\|\bm{y}_{j}-\bm{\mu}\|^{2}\leqslant|\mathcal{B}_{t}|^{2}(C+% \tau)^{2}

Bounding $\mathcal{T}_{3}$ . Since the random noise $\bm{\xi}\sim\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})\in\mathbb{R}^{d}$ , we have $\mathcal{T}_{3}=dR^{2}\sigma^{2}$ .

Putting into Together. Combining all terms, we have

		$\displaystyle\qquad\mathbb{E}\\|\tilde{\bm{m}}_{t}-\bm{\mu}\\|^{2}$
		$\displaystyle\leqslant\frac{1}{\|\mathcal{I}_{t}\|^{2}}\left[\frac{1}{\gamma_{1}% }\left(\frac{2\|\mathcal{H}_{t}\|\psi^{2}}{C}+\sqrt{\|\mathcal{H}_{t}\|}\psi\right% )^{2}+\frac{\|\mathcal{B}_{t}\|^{2}(C+\tau)^{2}}{\gamma_{2}}+\frac{dR^{2}\sigma^% {2}}{\gamma_{3}}\right]$
		$\displaystyle\overset{(\mathsf{a})}{=}\frac{1}{\|\mathcal{I}_{t}\|^{2}}\left[% \left(\frac{2\|\mathcal{H}_{t}\|\psi^{2}}{C}+\sqrt{\|\mathcal{H}_{t}\|}\psi\right)% +\|\mathcal{B}_{t}\|(C+\tau)+\sqrt{d}R\sigma\right]^{2}$
		$\displaystyle\overset{(\mathsf{b})}{\leqslant}\frac{1}{\|\mathcal{I}_{t}\|^{2}}% \left[\frac{2\kappa\|\mathcal{H}_{t}\|(\rho+\phi)^{2}}{\phi^{2}\cdot R}+(\|% \mathcal{B}_{t}\|+\sqrt{d}\sigma)R+\sqrt{\|\mathcal{H}_{t}\|}(\rho+\phi)+\|% \mathcal{B}_{t}\|\tau\right]^{2}$
		$\displaystyle\overset{(\mathsf{c})}{=}\frac{1}{\|\mathcal{I}_{t}\|^{2}}\left[% \frac{2(\rho+\phi)}{\phi}\sqrt{2\kappa\|\mathcal{H}_{t}\|(\|\mathcal{B}_{t}\|+% \sqrt{d}\sigma)}+\sqrt{\|\mathcal{H}_{t}\|}(\rho+\phi)+\|\mathcal{B}_{t}\|\tau% \right]^{2}$
		$\displaystyle=\left[\underbrace{\frac{2(\rho+\phi)}{\|\mathcal{I}_{t}\|\phi}% \sqrt{2\kappa\|\mathcal{H}_{t}\|(\|\mathcal{B}_{t}\|+\sqrt{d}\sigma)}+\frac{\sqrt{% \|\mathcal{H}_{t}\|}(\rho+\phi)}{\|\mathcal{I}_{t}\|}+\frac{\|\mathcal{B}_{t}\|\tau}% {\|\mathcal{I}_{t}\|}}_{\eqqcolon\Phi}\right]^{2}$		(21)

where $(\mathsf{a})$ is obtained by taking $\gamma_{k}=\frac{\sqrt{\Phi_{k}}}{\sqrt{\Phi_{1}}+\sqrt{\Phi_{2}}+\sqrt{\Phi_{% 3}}}$ for $k=1,2,3$ , where $\Phi_{1}\coloneqq\left(\frac{2|\mathcal{H}_{t}|\psi^{2}}{C}+\sqrt{|\mathcal{H}% _{t}|}\psi\right)^{2},\Phi_{2}\coloneqq|\mathcal{B}_{t}|^{2}(C+\tau)^{2},\Phi_% {3}\coloneqq dR^{2}\sigma^{2}$ ; $(\mathsf{b})$ is obtained by considering $\psi=\rho R/(R+c)+\phi\leqslant(\rho+\phi)$ and taking the clip** bound $C=\frac{\phi^{2}}{\kappa}R$ , which makes the previous assumption $C\leqslant\psi^{2}R/\kappa$ holds; $(\mathsf{c})$ is obtained by taking $R=\frac{\rho+\phi}{\phi}\sqrt{\frac{2\kappa|\mathcal{H}_{t}|}{|\mathcal{B}_{t}% |+\sqrt{d}\sigma}}$ , where $|\mathcal{H}_{t}|\approx|\mathcal{H}|q$ and $|\mathcal{B}_{t}|\approx|\mathcal{B}|q$ . Since $|\mathcal{H}|+|\mathcal{B}|=n$ and $|\mathcal{B}|/n<1/2$ , we can approximate the tuning by $R\propto O(\rho\sqrt{n/(|\mathcal{B}_{t}|+\sqrt{d}\sigma/q})$ .

The Final Result. On the other hand, we have

	$\displaystyle\mathbb{E}\\|\bm{\mu}-\bm{m}_{t}^{*}\\|^{2}$	$\displaystyle=\frac{1}{\|\mathcal{H}\|^{2}}\mathbb{E}\left\\|\sum\nolimits_{i\in% \mathcal{H}}(\bm{m}_{t,i}-\bm{\mu})\right\\|^{2}$
		$\displaystyle=\frac{1}{\|\mathcal{H}\|^{2}}\mathbb{E}\sum\nolimits_{i\in\mathcal% {H}}\left\\|\bm{m}_{t,i}-\bm{\mu}\right\\|^{2}\leqslant\frac{\rho^{2}}{\|\mathcal% {H}\|}$		(22)

where the first equality is obtained by the definition of $\bm{m}_{t}^{*}$ ; the second equality is obtained by the fact that all honest clients’ momentum $\{\bm{m}_{t,i}\}_{i\in\mathcal{H}}$ are independent with each other; and the third equality is obtained by the assumption $\left\|\bm{m}_{t,i}-\bm{\mu}\right\|^{2}\leqslant\rho^{2}$ for $i\in\mathcal{H}$ . Finally, we have

		$\displaystyle\qquad\mathbb{E}\\|\tilde{\bm{m}}_{t}-\bm{m}_{t}^{*}\\|^{2}$
		$\displaystyle=\mathbb{E}\\|(\tilde{\bm{m}}_{t}-\bm{\mu})+(\bm{\mu}-\bm{m}_{t}^{% })\\|^{2}\overset{(\mathsf{a})}{\leqslant}\frac{\mathbb{E}\\|\tilde{\bm{m}}_{t}% -\bm{\mu}\\|^{2}}{\gamma}+\frac{\mathbb{E}\\|\bm{\mu}-\bm{m}_{t}^{}\\|^{2}}{1-\gamma}$
		$\displaystyle\overset{(\mathsf{b})}{\leqslant}\frac{\Phi^{2}}{\gamma}+\frac{% \rho^{2}/\|\mathcal{H}\|}{1-\gamma}\overset{(\mathsf{c})}{=}\left(\Phi+\frac{% \rho}{\sqrt{\|\mathcal{H}\|}}\right)^{2}$		(23)

where $(\mathsf{a})$ is obtained by using Lemma 4 for any $\gamma\in(0,1)$ ; $(\mathsf{b})$ is obtained from (C) and (C), where $\Phi$ is defined in (C); $(\mathsf{c})$ is obtained by taking $\gamma=\frac{\Phi}{\Phi+\frac{\rho}{\sqrt{|\mathcal{H}|}}}$ . Furthermore, if we assume $\phi\leqslant O(\rho)$ and $\tau\leqslant O(\rho)$ , we can rewrite (C) as the following version

\displaystyle\mathbb{E}\|\tilde{\bm{m}}_{t}-\bm{m}_{t}^{*}\|^{2}\leqslant O% \left(\frac{\rho^{2}(|\mathcal{B}|+\sqrt{d}\sigma/q)}{n}\right)

which finishes the proof of Theorem 2. ∎

Appendix D Proof of Theorem 3 (Convergence Rate)

Proof.

Comparing with the aggregation error of $O(\rho^{2}|\mathcal{B}|/n)$ (ignoring constants and higher order terms) in [22, Lemma 9], our aggregation error shown in (19) replaces the term $|\mathcal{B}|$ by $|\mathcal{B}|+\sqrt{d}\sigma/q$ , which means a slower convergence due to DP noise. Then, following the result in [22, Theorem VI] and its informal version in (5), we get the convergence rate of our algorithm as in (20). Note that our aggregation utilizes a client-level sampling rate $q$ , i.e., approximate $nq$ clients participate in the aggregation for one iteration. We need to replace the term of $\frac{1}{n}$ in (5) by $\frac{1}{nq}$ in (20). ∎

Appendix E Useful Lemmas

Lemma 4.

For any positive real values $\alpha_{1},\cdots,\alpha_{K}\in\mathbb{R}^{+}$ and any $d$ -dimensional vectors $\bm{x}_{1},\cdots,\bm{x}_{K}\in\mathbb{R}^{d}$ , the following inequality holds

\displaystyle\left\|\sum\nolimits_{k=1}^{K}\bm{x}_{k}\right\|^{2}\leqslant% \left(\sum\nolimits_{k=1}^{K}\alpha_{k}\right)\cdot\left(\sum\nolimits_{k=1}^{% K}\frac{\|\bm{x}_{k}\|^{2}}{\alpha_{k}}\right)

where $\|\cdot\|$ denotes the L2-norm of a vector.

Proof.

Denote $x_{ki}$ as the $i$ -th element of the vector $\bm{x}_{k}$ , then we have

	$\displaystyle\quad\left\\|\sum\nolimits_{k=1}^{K}\bm{x}_{k}\right\\|^{2}=\sum% \nolimits_{i=1}^{d}\left(\sum\nolimits_{k=1}^{K}x_{ki}\right)^{2}=\sum% \nolimits_{i=1}^{d}\left(\sum\nolimits_{k=1}^{K}\sqrt{\alpha_{k}}\cdot\frac{x_% {ki}}{\sqrt{\alpha_{k}}}\right)^{2}$
	$\displaystyle\leqslant\sum\nolimits_{i=1}^{d}\left[\sum\nolimits_{k=1}^{K}% \left(\sqrt{\alpha_{k}}\right)^{2}\cdot\sum\nolimits_{k=1}^{K}\left(\frac{x_{% ki}}{\sqrt{\alpha_{k}}}\right)^{2}\right]$
	$\displaystyle=\left(\sum\nolimits_{k=1}^{K}\alpha_{k}\right)\cdot\left(\sum% \nolimits_{k=1}^{K}\frac{1}{\alpha_{k}}\sum\nolimits_{i=1}^{d}x_{ki}^{2}\right% )=\left(\sum\nolimits_{k=1}^{K}\alpha_{k}\right)\cdot\left(\sum\nolimits_{k=1}% ^{K}\frac{\\|\bm{x}_{k}\\|^{2}}{\alpha_{k}}\right)$

where the inequality is caused by Cauchy-Schwarz inequality. ∎

Lemma 5.

Consider the following optimization problem

\displaystyle f^{*}=\min_{x_{1},\cdots,x_{K}}~{}~{}\sum\nolimits_{k=1}^{K}% \frac{c_{k}}{x_{k}},\qquad\text{such that}\quad x_{k}>0,\quad\sum\nolimits_{k=% 1}^{K}x_{k}=1

where $c_{1},\cdots,c_{K}>0$ . Then, we have $f^{*}=(\sum\nolimits_{j=1}^{K}\sqrt{c_{j}})^{2}$ , where the optimal solution is $x_{k}=\frac{\sqrt{c_{k}}}{\sum\nolimits_{j=1}^{K}\sqrt{c_{j}}}~{}(\forall k=1,% \cdots,K)$ .

Proof.

The augmented Lagrange function is $\mathcal{L}(x_{k};\lambda)=\sum\nolimits_{k=1}^{K}\frac{c_{k}}{x_{k}}+\lambda% \cdot(\sum\nolimits_{k=1}^{K}x_{k}-1)$ . By taking Karush-Kuhn-Tucker (KKT) conditions, we have

\displaystyle\begin{cases}\frac{\partial\mathcal{L}}{\partial x_{k}}=0\\ \frac{\partial\mathcal{L}}{\partial\lambda}=0\end{cases}\Rightarrow~{}~{}% \begin{cases}-\frac{c_{k}}{x_{k}^{2}}+\lambda=0\\ \sum\nolimits_{k=1}^{K}x_{k}=1\end{cases}\Rightarrow~{}~{}\begin{cases}x_{k}=% \sqrt{\frac{c_{k}}{\lambda}}\\ \sqrt{\lambda}=\sum\nolimits_{j=1}^{K}\sqrt{c_{j}}\end{cases}

then we have $f^{*}=\left(\sum\nolimits_{j=1}^{K}\sqrt{c_{j}}\right)^{2}$ , which finished the proof. ∎

Appendix F Gaussian Differential Privacy (GDP)

Privacy Accountant. Since deep learning needs to iterate over the training data and apply gradient computation multiple times during the training process, each access to the training data incurs some privacy cost from the overall privacy budget $\epsilon$ . The total privacy cost of repeated applications of additive noise mechanisms follow from the composition theorems and their refinements [13]. The task of kee** track of the accumulated privacy loss in the course of execution of a composite mechanism, and enforcing the applicable privacy policy, can be performed by the privacy accountant. Abadi et al. [1] proposed moments accountant to provide a tighter bound on the privacy loss compared to the generic advanced composition theorem [14]. Another new and more state-of-the-art privacy accountant method is Gaussian Differential Privacy (GDP) [11, 7], which was shown to obtain a tighter result than moments accountant.

Gaussian Differential Privacy. GDP is a new privacy notion which faithfully retains hypothesis testing interpretation of differential privacy. By leveraging the central limit theorem of Gaussian distribution, GDP has been shown to possess an analytically tractable privacy accountant (vs. moments accountant must be done by numerical computation). Furthermore, GDP can be converted to a collection of $(\epsilon,\delta)$ -DP guarantees (refer to Lemma 9). Note that even in terms of $(\epsilon,\delta)$ -DP, the GDP approach gives a tighter privacy accountant than moments accountant. GDP utilizes a single parameter $\mu\geqslant 0$ (called privacy parameter) to quantify the privacy of a randomized mechanism. Similar to the privacy budget $\epsilon$ defined in DP, a larger $\mu$ in GDP indicates less privacy guarantee. Comparing with $(\epsilon,\delta)$ -DP, the new notion $\mu$ -GDP can losslessly reason about common primitives associated with differential privacy, including composition, privacy amplification by sampling, and group privacy. In the following, we briefly introduce some important properties (that will be used in the analysis of our approach) of GDP as below. The formal definition and more detailed results can be found in the original paper [11].

Lemma 6 (Gaussian Mechanism for GDP [11]).

Consider the problem of privately releasing a univariate statistic $f(D)$ of a dataset $D$ . Define the sensitivity of $f(\cdot)$ as $s_{f}=\sup_{D,D^{\prime}}|f(D)-f(D^{\prime})|$ , where the supremum is over all neighboring datasets. Then, the Gaussian mechanism $\mathcal{M}(D)=f(D)+\xi$ , where $\xi\sim\mathcal{N}(0,s_{f}^{2}/\mu^{2})$ , satisfies $\mu$ -GDP.

Lemma 7 (Composition Theorem of GDP [11]).

The $m$ -fold composition of $\mu_{i}$ -GDP mechanisms is $\sqrt{\mu_{1}^{2}+\cdots+\mu_{m}^{2}}$ -GDP.

Lemma 8 (Group Privacy of GDP [11]).

If a mechanism is $\mu$ -GDP, then it is $K\mu$ -GDP for a group with size $K$ .

Lemma 9 ( $\mu$ -GDP to $(\epsilon,\delta)$ -DP [11]).

A mechanism is $\mu$ -GDP if and only if it is $(\epsilon,\delta(\epsilon))$ -DP for all $\epsilon\geqslant 0$ , where

\displaystyle\delta(\epsilon)=\Phi\left(-\frac{\epsilon}{\mu}+\frac{\mu}{2}% \right)-e^{\epsilon}\cdot\Phi\left(-\frac{\epsilon}{\mu}-\frac{\mu}{2}\right),

and $\Phi$ denotes the CDF of standard normal (Gaussian) distribution.

Lemma 10 (Privacy Central Limit Theorem of GDP [7]).

Denote $p$ as the sampling probability of one example in the training dataset, $T$ as the total number of iterations and $\sigma$ as the noise scale (i.e., the ratio between the standard deviation of Gaussian noise and the gradient norm bound). Then, algorithm DP-SDG asymptotically satisfies $\mu$ -GDP with privacy parameter $\mu=p\sqrt{T(e^{1/\sigma^{2}}-1)}$ .

In this paper, we use $\mu$ -GDP as our primary privacy accountant method due to its good property on composition and accountant of privacy amplification in Lemma 10, and then convert the result to $(\epsilon,\delta)$ -DP via Lemma 9. We note that other privacy accountant methods, such as moments accountant [1] and Rényi DP (RDP) [33], are also applicable to the proposed scheme and theoretical analysis, but might lead to suboptimal results.

Appendix G Preliminaries for Crypto Primitives

Shamir’s Secret Sharing with Robust Reconstruction. Due to the assumption of a malicious minority, the utilized crypto primitives should be able to tolerate the wrong or missing messages of malicious clients. Shamir’s $t$ -out-of- $n$ Secret Sharing Scheme [37] allows distributing a secret $s$ among $n$ parties such that: 1) the complete secret can be reconstructed from any combination of $t$ shares; 2) any set of $t-1$ or fewer shares reveals no information about $s$ , where $t$ is the threshold of the secret sharing scheme. We denote $[s]_{i}$ as the share held by the $i$ -th party. Shamir’s secret sharing scheme is linear, which means a party can locally perform: 1) addition of two shares, 2) addition of a constant, and 3) multiplication by a constant. Furthermore, Shamir’s secret sharing scheme is closely related to Reed-Solomon error correcting codes [27], which is a group of polynomial-based error correcting codes. Shamir’s secret sharing scheme results in a $[n,t,n-t+1]$ Reed-Solomon code that can tolerate up to $q$ errors and $e$ erasures (message dropouts) such that $2q+e<n-t+1$ . Given any subset of $n-e$ shares $\mathcal{Q}~{}(|\mathcal{Q}|\geqslant n-e)$ with up to $q$ errors, any standard Reed Solomon decoding algorithm, such as Gao’s decoding algorithm [17], can robustly reconstruct the secret $s$ . Due to the property of robust reconstruction, Shamir’s secret sharing is able to guarantee security with malicious minority (as versus additive secret sharing [10] guarantees security with honest-but-curious parties).

EIFFeL: An Instantiation of SAVI Protocol. EIFFeL [35] is a SAVI protocol (with privacy and integrity guarantees) that securely aggregates only well-informed inputs. Its threat model assumes a malicious server (for privacy only) and a set of malicious clients (for both breaching privacy and submitting malformed inputs) that can arbitrarily deviate from the protocol, while the remaining honest clients are assumed to follow the protocol correctly and have well-formed inputs. EIFFeL ensures privacy by using Shamir’s secret sharing scheme [37]. Integrity is guaranteed via 1) secret-shared non-interactive proofs (SNIP) [9], which is an information-theoretic zero-knowledge proof for secret-shared data; and 2) verifiable secret shares [16], which validates the correctness of the secret shares. Note that the original SNIP utilizes additive secret sharing scheme [10], and its deployment setting uses $\geqslant 2$ honest and non-colluding servers as the verifiers. In contrast, by leveraging Shamir’s secret sharing with robust reconstruction, EIFFeL extends SNIP to a malicious threat model in a single server setting, where all the other clients (some of them are malicious) and the server jointly act as the verifiers for the verification of client $\mathsf{C}_{i}$ ’s input. Therefore, EIFFeL is compatible to our system model (a single server) and the threat model discussed in Section 3.1.

Appendix H Detailed Steps of DP-BREM⁺ in Figure 2

\small1⃝ Proof and Shares Generation: $\bm{z}_{i},\mathsf{Valid}(\cdot)\rightarrow[\bm{z}_{i}]_{j},[\pi_{i}]_{j}~{}(% \forall j\neq i)$ . For generating the proof, client $\mathsf{C}_{i}$ first evaluates the circuit $\mathsf{Valid}(\cdot)$ on its private input $\bm{z}_{i}$ to obtain the value of every wire in the arithmetic circuit corresponding to the computation of $\mathsf{Valid}(\bm{z}_{i})$ , then uses these wire values to generate the proof $\pi_{i}$ (refer to [9, 35] for the detailed format). Then, client $\mathsf{C}_{i}$ splits the private input $\bm{z}_{i}$ and proof $\pi_{i}$ to generate shares $[\bm{z}_{i}]_{j}$ and $[\pi_{i}]_{j}~{}(\forall j\neq i)$ , and send them to other clients $\{\mathsf{C}_{j}\}_{\forall j\neq i}$ via Shamir’s secret sharing.

\small2⃝ Proof Summary Computation: $[\bm{z}_{i}]_{j},[\pi_{i}]_{j}~{}(\forall j\neq i)\rightarrow[\sigma_{i}]_{j}~% {}(\forall j\neq i)$ . Each client except $\mathsf{C}_{i}$ first verifies the validity of the received secret shares via verifiable secret shares [16], and then locally constructs the shares of every wire in $\mathsf{Valid}(\bm{z}_{i})$ via affine operations on the shares $[\bm{z}_{i}]_{j}$ and $[\pi_{i}]_{j}$ to get the shares of proof summary $[\sigma_{i}]_{j}$ (refer to [35] for the detailed format), which will be sent to the server.

\small3⃝ Proof Summary Verification: $[\sigma_{i}]_{j}~{}(\forall j\neq i)\rightarrow\mathsf{Valid}(\bm{z}_{i})$ . After receiving shares of proof summary $[\sigma_{i}]_{j}(\forall j\neq i)$ from clients $\{\mathsf{C}_{j}\}_{\forall j\neq i}$ , the server recovers the value of $\sigma_{i}$ via robust reconstruction, which is resilient to incorrect shares submitted by the malicious clients, and then checks the values in proof summaries. Finally, the validation result $\mathsf{Valid}(\bm{z}_{i})=1$ if and only if $\sigma_{i}$ has the correct value.

\small4⃝ Random Numbers Generation: $l,d\rightarrow\{([u_{k}]_{j},[v_{k}]_{j})\}_{k=1}^{\lceil d/2\rceil}~{}(% \forall j)$ . In this step, clients jointly generate the shares of $\lceil d/2\rceil$ -pairs of random numbers $\{(u_{k},v_{k})\}_{k=1}^{\lceil d/2\rceil}$ , where all of them are i.i.d. from uniform distribution in the range $[0,1]$ . Denote $l$ as the fractional precision of the power 2 ring representation of real numbers. To obtain the share of one random number $u$ , each client $\mathsf{C}_{i}~{}(\forall i)$ generates $l$ random bits in the binary filed $\mathbb{F}_{2}$ , denoted by a binary vector $\bm{b}_{i}$ with length $l$ , then generate and distributes the shares $[\bm{b}_{i}]_{j}$ to other clients (via Shamir’s secret sharing). After receiving all shares from other clients, each client $\mathsf{C}_{j}~{}(\forall j)$ locally adds these shares to get $[\bm{b}]_{j}=[\sum_{i}\bm{b}_{i}]_{j}\in\mathbb{F}_{2}^{l}$ , where vector $\bm{b}\in\mathbb{F}_{2}^{l}$ is actually the bitwise XOR of vectors $\{\bm{b}_{i}\}_{\forall i}$ because the computation is implemented in the binary field $\mathbb{F}_{2}^{l}$ . We define the binary vector $\bm{b}$ as the binary representation of the fractional part of $u\in[0,1]$ . Note that the Shamir’s secret sharing scheme of Phase 1 is implemented in a finite filed $\mathbb{F}_{2^{K}}$ , where $K>l$ . Therefore, the client $\mathsf{C}_{j}$ can locally compute the arithmetic share $[u]_{j}\in\mathbb{F}_{2^{K}}$ from the share of binary representation $[\bm{b}]_{j}\in\mathbb{F}_{2}^{l}$ . Since all possible discrete values with power 2 ring representation evenly span the range $[0,1]$ , the generated random real number $u$ is uniformly distributed in $[0,1]$ .

\small5⃝ Transformation to Gaussian Distribution: $\{([u_{k}]_{j},[v_{k}]_{j})\}_{k=1}^{\lceil d/2\rceil}~{}(\forall j)% \rightarrow[\bm{\xi}]_{j}~{}(\forall j)$ . For each pair of $(u_{k},v_{k})$ , clients can jointly compute a secret sharing of $a_{k}=\sqrt{-2\ln(u_{k})}\cdot\cos(2\pi v_{k})$ and of $b_{k}=\sqrt{-2\ln(u_{k})}\cdot\sin(2\pi v_{k})$ by utilizing Secure Multiparty Computation (MPC) protocols [23] that guarantees security (i.e., privacy and integrity) with malicious minority. According to Box and Muller Transformation [6], $a_{k}$ and $b_{k}$ are i.i.d. random variables from the Gaussian distribution with mean 0 and variance 1. Then, by locally implementing secure multiplication with a constant (i.e., $R\sigma$ ), $a_{k}$ and $b_{k}$ are i.i.d random numbers following a Gaussian distribution with the desired standard deviation of $R\sigma$ . Finally, by concatenating shares of $d$ numbers in $\{(a_{k},b_{k})\}_{k=1}^{\lceil d/2\rceil}$ , clients obtains the shares of random vector $\bm{\xi}$ with length $d$ from Gaussian distribution $\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ .

\small6⃝ Shares Aggregation: $\{[\bm{z}_{i}]_{j}\}_{i\in\mathcal{I}_{\mathsf{Valid}}},[\bm{\xi}]_{j}~{}(% \forall j)\rightarrow[\sum_{i\in\mathcal{I}_{\mathsf{Valid}}}\bm{z}_{i}+\bm{% \xi}]_{j}~{}(\forall j)$ . Due to the linearity of Shamir’s secret sharing scheme, each client $\mathsf{C}_{j}$ can locally compute the share of the noisy aggregate by adding the shares of all valid inputs and the share of Gaussian noise: $[\sum_{i\in\mathcal{I}_{\mathsf{Valid}}}\bm{z}_{i}+\bm{\xi}]_{j}=\sum_{i\in% \mathcal{I}_{\mathsf{Valid}}}[\bm{z}_{i}]_{j}+[\bm{\xi}]_{j}$ , and sends that share to the server.

\small7⃝ Noisy Aggregate Reconstruction: $[\sum_{i\in\mathcal{I}_{\mathsf{Valid}}}\bm{z}_{i}+\bm{\xi}]_{j}~{}(\forall j)% \rightarrow\sum_{i\in\mathcal{I}_{\mathsf{Valid}}}\bm{z}_{i}+\bm{\xi}$ . After receiving all shares of the noisy aggregate, the server recovers it using robust reconstruction.

Appendix I Proof of Theorem 4 (Security Analysis)

Integrity. We prove that DP-BREM⁺ satisfies the integrity constraint using the following lemmas, where Lemma 11 and Lemma 13 are derived from EIFFeL [35].

Lemma 11 (Integrity of Input).

DP-BREM⁺ rejects all malformed inputs with probability $1-\text{negl}(\kappa)$ .

Lemma 12 (Integrity of Gaussian Noise).

In Phase 2 of DP-BREM⁺, each client holds the share of random vector $\bm{\xi}$ that follows the Gaussian distribution $\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ .

Proof.

In the step \small4⃝ of Phase 2, the jointly generated random number $u$ follows uniform distribution in range $[0,1]$ as long as there is at least one honest client because $u$ ’s binary representation $\bm{b}$ is the result of bitwise XOR of clients’ local random vectors $\{\bm{b}_{i}\}_{\forall i}$ . In step \small5⃝, since the utilized MPC protocol [23, 28] guarantees computation integrity (meaning that the output is correctly computed) with malicious minority, the uniform distribution generated in step \small4⃝ will be correctly transformed to Gaussian distribution. ∎

Since clients locally add shares of valid inputs and noise together, DP-BREM⁺ satisfies integrity of aggregate shown in Lemma 13. Our integrity guarantee in Lemma 13 directly follows EIFFeL [35], though the integrity of noise has different definition compared with the integrity of input. Note that the integrity of EIFFeL (and ours) relies on robust reconstruction property of Shamir’s secret sharing [37], and the details can be found from the paper[35].

Lemma 13 (Integrity of Aggregate).

The aggregated output of DP-BREM⁺ must contain the inputs of all honest clients and the generated Gaussian noise.

\displaystyle\text{Aggregate}=\sum\nolimits_{i\in\mathcal{I}_{H}}\bm{z}_{i}+% \sum\nolimits_{i\in\mathcal{I}_{M}^{*}}\bm{z}_{i}+\bm{\xi}

where random vector $\xi\sim\mathcal{N}(0,R^{2}\sigma^{2}\mathbf{I}_{d})$ , $\mathcal{I}_{H}$ is the set of all honest clients, $\mathcal{I}_{M}^{*}$ is the set of malicious clients with well-formed inputs (i.e., $\mathcal{I}_{M}^{*}=\mathcal{I}_{\mathsf{Valid}}\backslash\mathcal{I}_{H}$ )

Privacy. DP-BREM⁺ guarantees: nothing can be learned about a private input $\bm{z}_{i}$ for an honest client $\mathsf{C}_{i}$ , except:

1) $\bm{z}_{i}$ passes the integrity check, i.e., $\mathsf{Valid}(\bm{z}_{i})=1$ .

2) anything that can be learned from the noisy aggregation of well-formed inputs (thus achieving the same DP guarantee as the original DP-BREM).

We prove this privacy property using the following lemmas, where Lemma 14 and Lemma 16 are derived from EIFFeL [35].

Lemma 14.

In Phase 1, for an honest client $\mathsf{C}_{i}$ , DP-BREM⁺ reveals nothing about the private input $\bm{z}_{i}$ except $\mathsf{Valid}(\bm{z}_{i})=1$ .

Lemma 15.

In Phase 2, DP-BREM⁺ reveals nothing about the generated Gaussian noise.

Proof.

In step \small4⃝, no entity learns the uniformly random number $u$ as long as there is at least one honest client due to the bitwise XOR operation. In step \small5⃝, nothing is revealed because the utilized MPC protocol [28] guarantees information theoretic privacy about the input shares during computation for distribution transmission. Note that the step \small5⃝ only generates the shares hold by clients without outputting the final result. ∎

Lemma 16.

In Phase 3, for an honest client $\mathsf{C}_{i}$ , DP-BREM⁺ reveals nothing about the private input $\bm{z}_{i}$ except whatever can be leaned from the noisy aggregate.

Appendix J Supplements of Experiments

J.1 Experimental Setup

FL Implementation. Due to limited resources, we simulate the distributed training of FL by running a single machine sequentially for clients and the server. The real-world implementation of FL is out of the scope of this paper.

Datasets (non-IID) and Model Architecture. We use two datasets for our experiments: MNIST [25] and CIFAR-10 [24], where the default value of the number of total clients is $n=100$ . For MNIST dataset, we use the CNN model from PyTorch example⁴⁴4https://github.com/pytorch/opacus. For CIFAR-10 dataset, we use the CNN model from the TensorFlow tutorial⁵⁵5https://www.tensorflow.org/tutorials/images/cnn, like the previous works [49, 30]. To simulate the heterogeneous data distributions, we make non-i.i.d. partitions of the datasets, which is a similar setup as [49] and is described below:

1) Non-IID MNIST: The MNIST dataset contains 60,000 training images and 10,000 testing images of 10 classes. There are 100 clients, each holds 600 training images. We sort the training data by digit label and evenly divide it into 400 shards. Each client is assigned four random shards of the data, so that most of the clients have examples of three or four digits.

2) Non-IID CIFAR-10: The CIFAR-10 dataset contains 50,000 training images and 10,000 test images of 10 classes. There are 100 clients, each holds 500 training images. We sample the training images for each client using a Dirichlet distribution with hyperparameter 0.9.

Byzantine Attacks. We consider four different Byzantine attacks in our experiments.

1) ALIE ("a little is enough") [3]. The attacker uses the empirical variance (estimated from the data of corrupted clients) to determine the perturbation range, in which the attack can deviate from the mean without being detected or filtered out.

2) IPM (inner-product manipulation) [45]. The attacker manipulates the submitted gradient to be the negative direction of the mean of other honest clients’ gradients, thus the negative inner-product of the true gradient and the aggregation prevents the descent of the loss. Note that the original IPM attack assumes the omniscient attacker (i.e., knows the data/gradient of all other clients), which is contradicted to our assumption that the attacker only has access to the data of the corrupted clients (otherwise, the privacy is already leaked and no need to provide DP). Thus, in the experiments, we use the data of corrupted clients to estimate the aggregated gradient of honest clients, and then manipulate the inner-product (i.e., non-omniscient attack).

3) LF (label-flip**). The attacker modifies the labels of all examples of corrupted clients’ data and trains a new model with multiple iterations, then uses model replacement strategy [2] to enhance the impact on the global model.

4) MTB ("manipulating-the-Byzantine") [38]. The attacker computes a benign reference aggregate using some benign data samples obtained from corrupted clients, then computes a malicious perturbation vector, and an optimized scaling factor to get the malicious update with the goal of evading detection by robust aggregation algorithms. The optimization of the scaling factor can be tailored or agnostic to the aggregator. Considering our scheme and the baselines do not detect malicious clients, we use the agnostic setting (including min-max and min-sum) for simplicity because tailoring MTB attack to all defense aggregators is nontrivial. In our experiments, we implement the min-max attack since it has a larger impact on the global model.

Byzantine Defenses with DP. We compare the performance of our approaches with the following five competitors against Byzantine attacks. All of them satisfy record-level DP via record-level clip** and DP noise added to the local gradient/momentum. Note that privacy budget $\epsilon$ in Theorem 1 is the same for different clients because clients have the same size of local datasets $|\mathcal{D}_{i}|$ and same record-level sampling rate (i.e., same $|\mathcal{D}_{i}|$ and $p_{i}$ for different clients $\mathsf{C}_{i}$ ).

1) DP-FedSGD. Note that the original DP-FedSGD in [30] clips the client gradient to achieve client-level DP. For a fair comparison, we also implement record-level gradient clip** on top of the original DP-FedSGD to guarantee record-level DP. Though DP-FedSGD is not designed for robustness, its client-level clip** can restrict malicious clients’ capability, thus providing some level of Byzantine robustness. We take this as a baseline to illustrate that client-level clip** can provide some level of robustness, but may not be enough to defend against strong attackers (either advanced attack strategy or a larger number of malicious clients).

2) DP-CM. As a baseline that adds DP to median-based robust aggregators (discussed in Section 3.2), we implement the Byzantine-robust aggregator Coordinate-wise Median (CM) [47] with DP noise added to the median result. Note that only DP-CM uses median-based aggregation, while other methods use average-based aggregation. As discussed in Section 3.1 and Example 1, the median-based aggregation has large sensitivity and poor privacy-utility tradeoff.

3) DDP-RP [43]. By leveraging encryption techniques, DDP-RP guarantees Distributed DP with secure aggregation. It allows clients to add smaller noise in the local gradient than the Local DP, with the knowledge of the lower bound of trusted clients, thus providing enhanced privacy-utility tradeoff than local DP protocols. To guarantee Byzantine robustness, DDP-RP uses range-proof (RP) technologies to securely verify whether the local model/gradient weights are in a (predefined) bounded range.

4) DP-RSA [50]. It replaces the value aggregation to sign aggregation, which provides robustness because each client has limited impact on the aggregation. The DP noise is added to the local gradient before the sign operation.

5) DP-LFH. The baseline (shown in Section 3.2) directly combines DP-SGD based momentum with LFH. Each client adds DP noise to the local gradient, and then computes the local momentum that will be aggregated with centered clip** by the server.

J.2 Parameters Setting

Basic Parameters.

•

Total number of iterations $T$ : 1000 for MNIST; 2000 for CIFAR-10
•

Learning rate $\eta_{t}$ : For MNIST datasets, $\eta_{t}$ is linearly reduced from 0.1 to 0.01 w.r.t. iterations. For CIFAR-10 dataset, $\eta_{t}$ is linearly reduced from 0.05 to 0.0025 w.r.t. iterations.

DP-related Parameters.

•

Record-level sampling rate $p_{i}$ : 0.05 for all $i$
•

Client-level sampling rate $q$ : the default value is 1. We evaluate the influence of $q$ (from 0.2 to 1) on the accuracy in Table 4.
•

Record-level clip** bound $R$ : linearly reduced from $R_{0}$ to $0.3R_{0}$ w.r.t. iterations. Note that in Figure 7, the different value of $R$ in x-axis is the value of the above $R_{0}$ . For MNIST, we set $R_{0}=10$ by default, but $R_{0}=5$ only for the case of $\epsilon=1$ in Figure 5. For CIFAR-10, we set $R_{0}=20$ by default, but $R_{0}=15$ only for the case of $\epsilon=2$ in Figure 6.
•

Privacy parameter $\delta$ in DP: $10^{-6}$
•

Noise multiplier $\sigma$ : For MNIST (with $T=1000$ and each client has $|\mathcal{D}|_{i}=60000/100=600$ examples), $\sigma\in\{0.15,0.06,0.029,0\}$ for $\epsilon\in\{1,3,8,\text{inf}\}$ . For CIFAR-10 (with $T=2000$ and each client has $|\mathcal{D}|_{i}=50000/100=500$ examples), $\sigma\in\{0.14,0.077,0.042,0\}$ for $\epsilon\in\{2,4,9,\text{inf}\}$ .

Robustness-related Parameters.

•

Client-level clip** bound $C$ (only for DP-BREM and DP-LFH): linearly reduced from $C_{0}$ to $0.3C_{0}$ w.r.t. iterations, where $C_{0}=1$ for MNIST, and $C_{0}=5$ for CIFAR-10.
•

Momentum parameter $\beta=0.9$

J.3 More Experimental Results

Iteration Curve. Figures 8 and 9 show how the accuracy changes with the training iterations in MNIST (with total iteration $T=1000$ ) and CIFAR-10 (with $T=2000$ ), respectively. Due to the existence of Byzantine attacks, the iteration curve is not as smooth as in the attack-free case.

Appendix K Other Related Work

FL with DP. Differential Privacy (DP) was originally designed for the centralized scenario where a trusted database server, which has direct access to all client’s data in the clear, wishes to answer queries or publish statistics in a privacy-preserving manner by randomizing query results. In FL, McMahan et al. [30] proposed DP-FedSGD and DP-FedAvg, which provide client-level privacy with a trusted server. Geyer et al. [18] uses an algorithm similar to DP-FedSGD for the architecture search problem, and the privacy guarantee acts on client-level and the trusted server too. Li et al. [26] studies online transfer learning and introduces a notion called task global privacy that works on record-level. However, the online setting assumes the client only interacts with the server once and does not extend to the federated setting. Zheng et al. [49] introduced two privacy notions, that describe privacy guarantee against an individual malicious client and against a group of malicious clients (but not against the server) on record-level privacy, based on a new privacy notion called $f$ -differential privacy. Note that, our solutions achieve record-level DP under either a trusted server or a malicious server.

Byzantine-Robust FL. Recently, there have been extensive works on Byzantine-robust federated/distributed learning with a trustworthy server, and most of them play with median statistics of gradient contributions. Blanchard et al. [5] proposed Krum which uses the Euclidean distance to determine which gradient contributions should be removed. Yin et al. [47] proposed two robust distributed gradient descent algorithms, one based on the coordinate-wise median, and the other on the coordinate-wise trimmed mean. Mhamdi et al. [32] proposed a meta-aggregation rule called Bulyan, a two-step meta-aggregation algorithm based on the Krum and trimmed median, which filters malicious updates followed by computing the trimmed median of the remaining updates.

Private and Byzantine-Robust FL. Recently, some works tried to simultaneously achieve both privacy and robustness of FL. He et al. [20] proposed a Byzantine-resilient and privacy-preserving solution, which makes distance-based robust aggregation rules (such as Krum [5]) compatible with secure aggregation via MPC and secret sharing. So et al. [40] developed a similar scheme based on Krum, but rely on different cryptographic techniques, such as verifiable Shamir’s secret sharing and Reed-Solomon code. Velicheti et al. [42] achieved both privacy and Byzantine robustness via incorporating secure averaging among randomly clustered clients before filtering malicious updates through robust aggregation. However, these works only ensure the security of the aggregation step and do not achieve DP for the aggregated model.

	$\displaystyle\\|\bm{m}_{t,i}-\bm{m}_{t,i}^{\prime}\\|$	$\displaystyle\leqslant(1-\beta)[\\|\bar{\bm{g}}_{t,i}-\bar{\bm{g}}_{t,i}^{% \prime}\\|+\beta\\|\bar{\bm{g}}_{t-1,i}-\bar{\bm{g}}_{t-1,i}^{\prime}\\|+$
		$\displaystyle\quad\cdots+\beta^{t-2}\\|\bar{\bm{g}}_{2,i}-\bar{\bm{g}}_{2,i}^{% \prime}\\|]+\beta^{t-1}\\|\bar{\bm{g}}_{1,i}-\bar{\bm{g}}_{1,i}^{\prime}\\|$
		$\displaystyle\leqslant[(1-\beta)(1+\beta+\cdots+\beta^{t-2})+\beta^{t-1}]\cdot% \frac{R}{p_{i}\|\mathcal{D}_{i}\|}$
		$\displaystyle=\left[(1-\beta)\cdot\frac{1-\beta^{t-1}}{1-\beta}+\beta^{t-1}% \right]\cdot\frac{R}{p_{i}\|\mathcal{D}_{i}\|}=\frac{R}{p_{i}\|\mathcal{D}_{i}\|}$

	$\displaystyle\quad\\|Q_{t}(\mathcal{D})-Q_{t}(\mathcal{D^{\prime}})\\|$
	$\displaystyle=\\|\sum\nolimits_{j\in\mathcal{I}_{t}}\mathsf{Clip}_{C}(\bm{m}_{t% ,j}-\tilde{\bm{m}}_{t-1})-\sum\nolimits_{j\in\mathcal{I}_{t}}\mathsf{Clip}_{C}% (\bm{m}_{t,j}^{\prime}-\tilde{\bm{m}}_{t-1})\\|$
	$\displaystyle\overset{(\mathsf{a})}{=}\\|\mathsf{Clip}_{C}(\bm{m}_{t,i}-\tilde{% \bm{m}}_{t-1})-\mathsf{Clip}_{C}(\bm{m}_{t,i}^{\prime}-\tilde{\bm{m}}_{t-1})\\|$
	$\displaystyle\overset{(\mathsf{b})}{\leqslant}\min\{2C,\\|\bm{m}_{t,i}-\bm{m}_{% t,i}^{\prime}\\|\}=\min\{2C,\frac{R}{p_{i}\|\mathcal{D}_{i}\|}\}$

	$\displaystyle\quad\\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\\|^{2}$
	$\displaystyle=\left\\|\frac{C}{\\|x\\|}\cdot x-(x+\delta)\right\\|^{2}=\left\\|% \left(1-\frac{C}{\\|x\\|}\right)\cdot x+\delta\right\\|^{2}$
	$\displaystyle=\left(1-\frac{C}{\\|x\\|}\right)^{2}\\|x\\|^{2}+2\left(1-\frac{C}{\\|% x\\|}\right)x^{\top}\delta+\\|\delta\\|^{2}$
	$\displaystyle=\left(1-\frac{C}{\\|x\\|}\right)\left[\left(1-\frac{C}{\\|x\\|}% \right)\\|x\\|^{2}+2x^{\top}\delta+\\|\delta\\|^{2}\right]+\frac{C\cdot\\|\delta\\|^% {2}}{\\|x\\|}$
	$\displaystyle=\left(1-\frac{C}{\\|x\\|}\right)\left[\\|x+\delta\\|^{2}-C\\|x\\|% \right]+\frac{C\cdot\\|\delta\\|^{2}}{\\|x\\|}$
	$\displaystyle<0+1\cdot\\|\delta\\|^{2}=\\|\delta\\|^{2}$

	$\displaystyle\quad\\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\\|^{2}$
	$\displaystyle=\left\\|x-\frac{C}{\\|x+\delta\\|}\cdot(x+\delta)\right\\|^{2}=\left% \\|\left(1-\frac{C}{\\|x+\delta\\|}\right)\cdot(x+\delta)-\delta\right\\|^{2}$
	$\displaystyle=\left(1-\frac{C}{\\|x+\delta\\|}\right)^{2}\\|x+\delta\\|^{2}-2\left% (1-\frac{C}{\\|x+\delta\\|}\right)(x+\delta)^{\top}\delta+\\|\delta\\|^{2}$
	$\displaystyle=\left(1-\frac{C}{\\|x+\delta\\|}\right)\left[\left(1-\frac{C}{\\|x+% \delta\\|}\right)\\|x+\delta\\|^{2}-2(x+\delta)^{\top}\delta+\\|\delta\\|^{2}\right]$
	$\displaystyle\qquad+\frac{C\cdot\\|\delta\\|^{2}}{\\|x+\delta\\|}$
	$\displaystyle=\left(1-\frac{C}{\\|x+\delta\\|}\right)\left[\\|(x+\delta)-\delta\\|% ^{2}-C\\|x+\delta\\|\right]+\frac{C\cdot\\|\delta\\|^{2}}{\\|x+\delta\\|}$
	$\displaystyle<0+1\cdot\\|\delta\\|^{2}=\\|\delta\\|^{2}$

	$\displaystyle\quad\\|\mathsf{Clip}_{C}(x)-\mathsf{Clip}_{C}(x+\delta)\\|^{2}$
	$\displaystyle=\left\\|\frac{C}{\\|x\\|}\cdot x-\frac{C}{\\|x+\delta\\|}\cdot(x+% \delta)\right\\|^{2}$
	$\displaystyle=\frac{C^{2}}{\\|x\\|^{2}}\cdot\\|x\\|^{2}-\frac{C^{2}\cdot 2x^{\top}% (x+\delta)}{\\|x\\|\cdot\\|x+\delta\\|}+\frac{C^{2}}{\\|x+\delta\\|^{2}}\cdot\\|x+% \delta\\|^{2}$
	$\displaystyle=2C^{2}-\frac{C^{2}\cdot[\\|x\\|^{2}+(\\|x\\|^{2}+2x^{\top}\delta+\\|% \delta\\|^{2})]}{\\|x\\|\cdot\\|x+\delta\\|}+\frac{C^{2}\cdot\\|\delta\\|^{2}}{\\|x\\|% \cdot\\|x+\delta\\|}$
	$\displaystyle=2C^{2}-\frac{C^{2}\cdot[\\|x\\|^{2}+\\|x+\delta\\|^{2}]}{\\|x\\|\cdot% \\|x+\delta\\|}+\frac{C^{2}}{\\|x\\|\cdot\\|x+\delta\\|}\cdot\\|\delta\\|^{2}$
	$\displaystyle<2C^{2}-C^{2}\cdot 2+1\cdot\\|\delta\\|^{2}=\\|\delta\\|^{2}$

DP-BREM: Differentially-Private and Byzantine-Robust Federated Learning with Client Momentum

Abstract

1 Introduction

2 Preliminaries

2.1 Differential Privacy (DP)

Definition 1 ((ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP [13, 12]).

2.2 Federated Learning (FL) with DP

2.3 Byzantine Attacks and Defenses

Lemma 1 (Convergence Rate of LFH [22]).

3 Problem Statement and Motivation

3.1 Problem Statement

3.2 Challenges and Baseline

Example 1 (Sensitivity Computation: Average vs. Median).

4 DP-BREM

4.1 Algorithm Design

4.2 Privacy Analysis

Lemma 2 (DP Sensitivity).

Proof.

Theorem 1 (Privacy Analysis).

Proof.

4.3 Convergence Analysis

Theorem 2 (Aggregation Error).

Proof.

Theorem 3 (Convergence Rate of DP-BREM).

Proof.

5 DP-BREM+ with Secure Aggregation

5.1 Challenges

5.2 Design of DP-BREM+

5.3 Security Analysis

Theorem 4 (Security Guarantees of DP-BREM+).

Proof.

6 Experimental Evaluation

6.1 Robustness Evaluation with DP

6.2 Privacy-Utility Tradeoff with Attack

6.3 Other Results

7 Related Work

8 Conclusions

References

Appendix A Proof of Lemma 2 (Aggregation Sensitivity)

Proof.

Lemma 3.

Proof.

Appendix B Proof of Theorem 1 (Privacy Analysis)

Proof.

Appendix C Proof of Theorem 2 (Aggregation Error)

Proof.

Appendix D Proof of Theorem 3 (Convergence Rate)

Proof.

Appendix E Useful Lemmas

Lemma 4.

Proof.

Lemma 5.

Proof.

Appendix F Gaussian Differential Privacy (GDP)

Lemma 6 (Gaussian Mechanism for GDP [11]).

Lemma 7 (Composition Theorem of GDP [11]).

Lemma 8 (Group Privacy of GDP [11]).

Lemma 9 (μ𝜇\muitalic_μ-GDP to (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP [11]).

Lemma 10 (Privacy Central Limit Theorem of GDP [7]).

Appendix G Preliminaries for Crypto Primitives

Appendix H Detailed Steps of DP-BREM+ in Figure 2

Appendix I Proof of Theorem 4 (Security Analysis)

Lemma 11 (Integrity of Input).

Lemma 12 (Integrity of Gaussian Noise).

Proof.

Lemma 13 (Integrity of Aggregate).

Lemma 14.

Lemma 15.

Proof.

Lemma 16.

Appendix J Supplements of Experiments

J.1 Experimental Setup

J.2 Parameters Setting

J.3 More Experimental Results

Appendix K Other Related Work

DP-BREM: Differentially-Private and Byzantine-Robust Federated Learning
with Client Momentum

Definition 1 ( $(\epsilon,\delta)$ -DP [13, 12]).

5 DP-BREM⁺ with Secure Aggregation

5.2 Design of DP-BREM⁺

Theorem 4 (Security Guarantees of DP-BREM⁺).

Lemma 9 ( $\mu$ -GDP to $(\epsilon,\delta)$ -DP [11]).

Appendix H Detailed Steps of DP-BREM⁺ in Figure 2