HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: hyphenat
  • failed: arydshln
  • failed: bigdelim
  • failed: tkz-euclide

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.10789v1 [cs.CR] 17 Dec 2023

Federated learning with differential privacy and an untrusted aggregator (technical report)

Kunlong Liu University of California, Santa Barbara  and  Trinabh Gupta University of California, Santa Barbara
Abstract.

Federated learning for training models over mobile devices is gaining popularity. Current systems for this task exhibit significant trade-offs between model accuracy, privacy guarantee, and device efficiency. For instance, Oort (OSDI 2021) provides excellent accuracy and efficiency but requires a trusted central server. On the other hand, Orchard (OSDI 2020) provides good accuracy and the rigorous guarantee of differential privacy over an untrusted server, but creates huge overhead for the devices. This paper describes Aero, a new federated learning system that significantly improves this trade-off. Aero guarantees good accuracy, differential privacy over an untrusted server, and keeps the device overhead low. The key idea of Aero is to tune system architecture and design to a specific set of popular, federated learning algorithms. This tuning requires novel optimizations and techniques, e.g., a new protocol to securely aggregate updates from devices. An evaluation of Aero demonstrates that it provides comparable accuracy to plain federated learning (without differential privacy), and it improves efficiency (cpu and network) over Orchard by up to 1Γ5×10^{5}\times1 roman_Γ start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ×.

copyright: none

1. Introduction

Federated learning (FL) is a recent paradigm in machine learning that embraces a decentralized training architecture (mcmahan2017communication, ). In contrast to the traditional, central model of learning where users ship their training data to a central server, users in FL download the latest model parameters from the server, perform local training to generate updates to the parameters, and send only the updates to the server. Federated learning has gained popularity in training models for mobiles (hard2018federated, ; hartmann2019federated, ; kairouz2021practical, ; DPFTRL, ) as it can save network bandwidth and it is privacy-friendly—raw data stays at the devices.

Current systems for federated learning exhibit significant trade-offs between model accuracy, privacy, and device efficiency. For instance, one class of systems that includes Oort (lai2021oort, ), FedScale (fedscale-icml22, ), and FedML (chaoyanghe2020fedml, ) provides excellent accuracy (comparable to centralized learning) and device efficiency. But these systems provide only a weak notion of privacy. This point is subtle. At first glimpse, it appears that in any federated learning system, since users ship updates to model parameters rather than the raw training data, this data (user images, text messages, search queries, etc.) remains confidential. However, research shows that updates can be reverse-engineered to reveal the raw data (zhu2019deep, ; melis2019exploiting, ; shokri2017membership, ). Thus, if the server is compromised, so is the users’ data. In other words, the server must be trusted.

On the other hand, systems such as HybridAlpha (xu2019hybridalpha, ) and Orchard (roth2020orchard, ) offer good accuracy and a differential privacy guarantee for users’ data. Informally, differential privacy says that an adversary cannot deduce a user’s training data by inspecting the updates or the learned model parameters (dwork2011firm, ; dwork2006calibrating, ; dwork2014algorithmic, ; abadi2016deep, ). In fact, Orchard guarantees differential privacy while assuming a byzantine server. But the downside is the high overhead for the devices. For example, to train a CNN model with 1.2 million parameters (reddi2020adaptive, ), Orchard requires from each device \approx14 minutes of training time on a six-core processor and \approx840 MiB in network transfers per round of training (§6.2). The full training requires at least a few hundred rounds. Further, for a few randomly chosen devices, this per-round cost spikes to \approx214 hours of cpu time and \approx11 TiB of network transfers. Clearly, this is quite high.111Another class of systems provides a particular type of differential privacy called local differential privacy (LDP) (duchi2013local, ; ijcai2021-217, ; truex2020ldp, ). These systems are efficient but LDP creates a high accuracy loss (truex2020ldp, ; grafberger2021fedless, ; ijcai2021-217, )2.3, §7).

This paper describes a new federated learning system, Aero, that significantly improves the tradeoff between accuracy, privacy, and device overhead. Aero provides good accuracy, the differential privacy guarantee in the same threat model as Orchard, and low device overhead. For instance, most of the time Aero’s devices incur overhead in milliseconds of cpu time and KiBs of network transfers.

The key idea in Aero is that it does not aim to be a general-purpose federated learning system, rather focuses on a particular class of algorithms (§3.1). These algorithms sample devices that contribute updates in a round using a simple probability parameter (e.g., a device is selected with a probability of 1Γ51superscriptΓ510^{-5}1 roman_Γ start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT), then aggregate updates across devices by averaging them, and generate noise needed for differential privacy from a Gaussian distribution. Admittedly, this is only one class of algorithms, but this class comprises popular algorithms such as DP-FedAvg (mcmahan2017learning, ) and DP-FedSGD (mcmahan2017learning, ) that are the ones commonly used and deployed (hard2018federated, ; hartmann2019federated, ; fedscale-icml22, ; chaoyanghe2020fedml, ). With this restriction, Aero tunes system architecture and design to these algorithms, thereby gaining on performance by orders of magnitude.

This tuning is non-trivial and requires novel techniques and optimizations (§3.2, §4). As one example, the devices must verify that the byzantine server did not behave maliciously. One prior technique is to use a summation tree (roth2019honeycrisp, ), where the server explicitly shows its work aggregating updates across devices in a tree form; the devices then collectively check nodes of this tree. This checking, in turn, adds overhead to the devices. Aero addresses this tension between privacy and overhead by leveraging the sampling characteristic of DP-FedAvg and similar algorithms: the total number of devices that participate in the system (e.g., one billion) is much larger than the ones that are sampled to contribute updates. Leveraging this characteristic, Aero employs multiple, finer-grained summation trees (rather than a monolithic tree) to massively divide the checking work across the large device pool (§4.3). Aero further optimizes how each device verifies the nodes of the tree using a technique called polynomial identity testing (§4.3). The aforementioned is just one example of optimization; Aero uses multiple throughout its architecture (§3.2) and design (§4).

We implemented Aero by extending the FedScale FL system (fedscale-icml22, )5). FedScale supports plain federated learning without differential privacy or protection against a byzantine server. However, it is flexible, allows a programmer to specify models in the PyTorch framework (pytorch, ), and includes a host of models and datasets, with model sizes ranging from 49K to 3.9M, for easy evaluation. Our evaluation of Aero’s prototype (§6) shows that Aero trains models with comparable accuracy to FedScale, in particular, the plain FedAvg algorithm in FedScale (§6.1). Aero also improves overhead relative to Orchard by up to five orders of magnitude, to a point where the overhead is low to moderate. For instance, for a 1.2M parameter CNN on the FEMNIST dataset (reddi2020adaptive, ; cohen2017emnist, ), and for a total population of 1Γ91superscriptΓ910^{9}1 roman_Γ start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT devices where 1Γ41superscriptΓ410^{4}1 roman_Γ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT contribute updates per round, a Aero device requires 15 ms of cpu time and 3.12 KiB of network transfers per round. Occasionally (with a probability of 1Γ51superscriptΓ510^{-5}1 roman_Γ start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT in a round) this overhead increases when a device contributes updates, to a moderate 13.4 minutes of latency on a six-core processor and 234 MiB in network transfers.

Prior to Aero, one must choose two of the three properties of high accuracy, rigorous privacy guarantee, and low device overhead. With Aero, one can train models in a federated manner with a balance across these properties, at least for a particular class of federated learning algorithms. Thus, Aero’s main contribution is that it finally shows a way for a data analyst to train models while asking the data providers to place no trust in the analyst or their company.

2. Problem and background

This section outlines the problem and gives a short background on Orchard (roth2020orchard, ) that forms both a baseline and an inspiration for Aero.

2.1. Scenario and threat model

We consider a scenario consisting of a data analyst and a large number of mobile devices, e.g., hundreds of million. The analyst, perhaps at a large company such as Google, is interested in learning a machine learning model over the data on the devices. For instance, the analyst may want to train a recurrent neural network (RNN) to provide auto-completion suggestions for the android keyboard (hard2018federated, ).

One restriction we place on this scenario is that the training must be done in a federated manner. (We refer the reader to prior work (tan2021cryptgpu, ; knott2021crypTen, ) for a discussion on training in the centralized, non-federated model.) As noted earlier (§1), federated learning proceeds in rounds, where in each round devices download the latest model parameters from a server, generate updates to these parameters locally, and send the updates to the server. The server aggregates the model updates. This repeats until the model achieves a target accuracy.

In this scenario, a malicious server, or even a malicious device, can execute many attacks. For instance, a malicious server can infer the training data of a device from the updates contributed by the device (zhu2019deep, ; melis2019exploiting, ; shokri2017membership, ). Similarly, a malicious device that receives model parameters from the server can execute an inference attack to learn another device’s input.

We assume the strong OB+MC threat model from Orchard. The server is honest-but-curious most of the time but occasionally byzantine (OB), while the devices are mostly correct (MC), but a small fraction can be malicious. The rationale behind the occasionality of the server’s maliciousness is that the server’s operator, e.g., Google, is reputed and subject to significant scrutiny from the press and the users, and thus unlikely to be byzantine for long. However, it may occasionally come under attack, e.g., from a rogue employee. The rationale behind the smallness of the fraction of malicious devices is that with billions of devices, it is unlikely that an adversary will control more than a small percentage. For instance, for a billion devices, only controlling 3% is already significantly larger than a large botnet. We further assume that a configurable percentage of honest devices may be offline during any given round of training.

2.2. Goals

Under the OB+MC threat model, we want our system to meet the following goals.

Privacy. It must guarantee the gold standard definition of privacy, i.e., differential privacy (DP) (dwork2011firm, ; dwork2006calibrating, ; dwork2014algorithmic, ; abadi2016deep, ). Informally, a system offers DP for model training if the probability of learning a particular set of model parameters is (approximately) independent of whether a device’s input is included in the training. This means that DP prevents inference attacks (Naseri2022LocalAC, ) where a particular device’s input is revealed, as models are (approximately) independent of its input.

Accuracy. During periods when the server or the devices that contribute in a round are not byzantine, our system must produce models with accuracy comparable to models trained via plain federated learning. That is, we want the impact of differential privacy to be low. Further, we want the system to mitigate a malicious device’s impact on accuracy and prevent it from supplying arbitrary updates.

Efficiency and scalability. We want the system to support models with a large number of parameters while imposing a low to moderate device-side overhead. For the former, a reference point is the android keyboard auto-completion model (an RNN) with 1.4M parameters (hard2018federated, ). For the device overhead, if a device participates regularly in training, e.g., in every round, then it should incur no more than a few seconds of cpu and a few MiBs in network transfers per round. However, we assume that devices can tolerate occasional amounts of additional work, contributing tens of minutes of cpu and a few hundred MiBs in network transfers.

2.3. Possible solution approaches

Meeting all of the goals described above is quite challenging. For illustration, consider the following solution approaches.

Local differential privacy. One option to guarantee differential privacy is to pick a federated learning algorithm that incorporates local differential privacy (LDP) (duchi2013local, ). In LDP-based federated learning, devices add statistical noise to their updates before uploading them to the server. The added noise protects updates against a malicious server which now cannot execute an inference attack, but LDP significantly degrades model accuracy relative to plain federated learning as every device must add noise. For instance, the LDP-FL (ijcai2021-217, ) system trains a VGG model over CIFAR10 with 10% accuracy compared to 62% with plain federated learning.

Trusted server. One alternative is to use central differential privacy (abadi2016deep, ), where a central entity adds a smaller amount of noise to device updates to ensure differential privacy. This approach mitigates the accuracy issue, but provides weak to no privacy as the central entity sees devices’ updates.

Server-side secure multiparty computation (MPC). One way to reduce trust in the central server is to break it down into multiple non-colluding pieces, e.g., three servers, that run in separate administrative domains. Then, one would run a secure multi-party computation protocol (yao1982protocols, ; goldreich2019play, ) among these servers such that they holistically perform the required computation (noise generation, addition, etc.) while no individual server sees the input or intermediate state of the computation. The problem is that we must still put significant trust in the server—that an adversary cannot compromise, say, two administrative domains.

Large-scale MPC. One can remove trust in the server by instead running MPC among the devices (essentially the devices perform the server’s work). The problem now shifts to efficiency and scalability: general-purpose MPC protocols are expensive and do not scale well with the number of participants (scaleMamba, ; damgaard2012multiparty, ). Indeed, scaling MPC to a few hundred or thousand participants is an active research area (gordon2021more, ; ben2021large, ), let alone hundreds of millions of participants.

State-of-the-art: Orchard.

Refer to caption
Figure 1. An overview of Orchard (roth2020orchard, ). ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes k𝑘kitalic_k-th device’s update. The superscript t𝑡titalic_t denotes the round number. Orchard runs the four phases of setup, generate, add, and release for every round.

Orchard (roth2020orchard, ) takes a middle ground. It runs small(er)-scale MPC among devices while assuming an (occasionally) byzantine server. In particular, it forms a committee of a few tens of devices picked randomly from the entire population of devices; this committee then runs MPC among its members. Figure 1 shows an overview of Orchard. Orchard supports the full-batch gradient descent algorithm where all devices contribute updates in a round. For every round, Orchard runs the following four phases.

In the setup phase, Orchard samples the committee. Its members use MPC to generate keys for two cryptographic primitives: additive homomorphic encryption (AHE) (rivest1978data, ; fan2012somewhat, ) and zero-knowledge proof (ZK-proof) (groth16on, ; goldwasser2019knowledge, ).

In the generate phase, devices download the latest model parameters from the server. After local training, they encrypt their model updates (denoted by ΔΔ\Deltaroman_Δ in Figure 1) and generate a ZK-proof to prove to the server that the ciphertexts are well-formed and the model updates are bounded. The encryption hides the updates from the byzantine server and the proof limits the impact of malicious devices (they cannot supply arbitrary updates and thus ruin the model accuracy).

In the add phase, the server homomorphically adds the encrypted updates. The server also generates proof that it performed the addition correctly so that all devices can collaboratively verify the addition. This verification is necessary to prevent a byzantine server from launching subtle attacks to break DP. (We will discuss these attacks further in §4.3.)

In the release phase, the committee from the setup phase uses MPC to decrypt the output ciphertexts from the add phase. The committee also generates and adds the DP noise to the output, before releasing it, to guarantee (central) DP.

The challenge with Orchard is that even though it uses MPC at a smaller scale, the MPC overhead is still high. First, in the setup phase, committee devices generate fresh keys for each round, and generating one AHE key pair inside general-purpose MPC requires \approx180 seconds of cpu and 1 GiB of network transfers. Second, the overhead of the add phase to verify the server’s work grows linearly with the model size and becomes unpragmatic as soon as the model has a few hundred thousand parameters. Third, in the release phase, committee devices decrypt ciphertexts and generate DP noise inside general-purpose MPC, costing, for example, \approx2600 seconds of cpu and \approx38 GiB of network transfers per device for a model with just 4K parameters.

In general, providing high accuracy, differential privacy, and device efficiency simultaneously in a threat model where there is no trusted party proves incredibly challenging.

3. Overview of Aero

The high-level idea in Aero is to focus on a specific type of federated learning algorithms comprising DP-FedAvg (mcmahan2017learning, ), DP-FedSGD (mcmahan2017learning, ), and DP-FTRL (kairouz2021practical, ). These algorithms have similar characteristics; for instance, they all sample noise for differential privacy from a Gaussian distribution. To keep Aero easier to explain and understand, we take the most popular DP-FedAvg as the canonical algorithm and describe Aero in its context.

3.1. DP-FedAvg without amplification



Main:
1:    parameters
2:         device selection probability q(Γ,1]𝑞Γ1q\in(0,1]italic_q ∈ ( roman_Γ , 1 ]
3:         DP noise scale z𝑧zitalic_z
4:         total # of devices W𝑊Witalic_W
5:         clip** bound on device updates S𝑆Sitalic_S
6:    Initialize model θΓsuperscript𝜃Γ\theta^{0}italic_θ start_POSTSUPERSCRIPT roman_Γ end_POSTSUPERSCRIPT, DP privacy budget accountant \mathcal{M}caligraphic_M
7:    for each round t=Γ,1,2,𝑡Γ12t=0,1,2,\ldotsitalic_t = roman_Γ , 1 , 2 , … do
8:         Ctsuperscript𝐶𝑡absentC^{t}\leftarrowitalic_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← (sample users with probability q𝑞qitalic_q)
9:         for each user kCt𝑘superscript𝐶𝑡k\in C^{t}italic_k ∈ italic_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT do
10:             ΔktUserUpdate(k,θt,S)superscriptsubscriptΔ𝑘𝑡UserUpdate𝑘superscript𝜃𝑡𝑆\Delta_{k}^{t}\leftarrow{{\small\textsc{UserUpdate}}}(k,\theta^{t},S)roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← UserUpdate ( italic_k , italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_S )          
11:         // Add updates and Gaussian DP noise with σ=zS𝜎𝑧𝑆\sigma=zSitalic_σ = italic_z italic_S
12:         ΔtkΔkt+𝒩(Γ,Iσ2)superscriptΔ𝑡subscript𝑘superscriptsubscriptΔ𝑘𝑡𝒩Γ𝐼superscript𝜎2\Delta^{t}\leftarrow\sum_{k}\Delta_{k}^{t}+\mathcal{N}(0,I\sigma^{2})roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + caligraphic_N ( roman_Γ , italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
13:         θt+1θt+(Δt/(qW))superscript𝜃𝑡1superscript𝜃𝑡superscriptΔ𝑡𝑞𝑊\theta^{t+1}\leftarrow\theta^{t}+(\Delta^{t}/(qW))italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ← italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + ( roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT / ( italic_q italic_W ) ) // Update model
14:         Update \mathcal{M}caligraphic_M based on noise scale z𝑧zitalic_z and parameter q𝑞qitalic_q     UserUpdate(k𝑘kitalic_k, θΓsuperscript𝜃Γ\theta^{0}italic_θ start_POSTSUPERSCRIPT roman_Γ end_POSTSUPERSCRIPT, S𝑆Sitalic_S)
15:    parameters B,E,η𝐵𝐸𝜂B,E,\etaitalic_B , italic_E , italic_η // η𝜂\etaitalic_η is learning rate
16:    θθΓ𝜃superscript𝜃Γ\theta\leftarrow\theta^{0}italic_θ ← italic_θ start_POSTSUPERSCRIPT roman_Γ end_POSTSUPERSCRIPT
17:    for each local epoch i𝑖iitalic_i in 1111 to E𝐸Eitalic_E do
18:         absent\mathcal{B}\leftarrowcaligraphic_B ← (k𝑘kitalic_k’s data split into size B𝐵Bitalic_B batches)
19:         for batch b𝑏b\in\mathcal{B}italic_b ∈ caligraphic_B do
20:             θθη(θ;b)𝜃𝜃𝜂;𝜃𝑏\theta\leftarrow\theta-\eta\nabla\ell(\theta;b)italic_θ ← italic_θ - italic_η ∇ roman_ℓ ( italic_θ ; italic_b ) // \ellroman_ℓ is loss fn. (model err.)
21:             θθΓ+Clip(θθΓ,S)𝜃superscript𝜃Γ𝐶𝑙𝑖𝑝𝜃superscript𝜃Γ𝑆\theta\leftarrow\theta^{0}+Clip(\theta-\theta^{0},S)italic_θ ← italic_θ start_POSTSUPERSCRIPT roman_Γ end_POSTSUPERSCRIPT + italic_C italic_l italic_i italic_p ( italic_θ - italic_θ start_POSTSUPERSCRIPT roman_Γ end_POSTSUPERSCRIPT , italic_S )              
22:    return Δk=θθΓsubscriptΔ𝑘𝜃superscript𝜃Γ\Delta_{k}=\theta-\theta^{0}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_θ - italic_θ start_POSTSUPERSCRIPT roman_Γ end_POSTSUPERSCRIPT // Already clipped
Figure 2. Pseudocode for the DP-FedAvg algorithm. Clip(,S)𝐶𝑙𝑖𝑝𝑆Clip(\cdot,S)italic_C italic_l italic_i italic_p ( ⋅ , italic_S ) scales its input vector such that its norm (Euclidean distance from the origin) is less than S𝑆Sitalic_S. \mathcal{M}caligraphic_M is the privacy budget accountant of Abadi et al. (abadi2016deep, ) that tracks the values of the DP parameters ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ.
Refer to caption
Figure 3. An overview of Aero’s architecture and the four phases of its protocol.

DP-FedAvg proceeds in discrete rounds (Figure 2). In each round t𝑡titalic_t, it samples a small subset of user devices using a probability parameter q𝑞qitalic_q (line 8), and asks the sampled devices to provide updates to the global model parameters (line 10). The devices locally generate the updates before clip** them by a value S𝑆Sitalic_S and uploading them (line 21); this clip** is necessary for differential privacy and it bounds the norm (sensitivity) of a device’s update. DP-FedAvg then aggregates these updates (line 12) and (separately) adds noise sampled from a Gaussian distribution. The standard deviation of the Gaussian distribution depends on a noise scale parameter z𝑧zitalic_z and the clip** bound S𝑆Sitalic_S; both are input parameters for DP-FedAvg. Finally, DP-FedAvg updates a privacy accountant \mathcal{M}caligraphic_M that computes, based on the noise scale z𝑧zitalic_z and sampling probability q𝑞qitalic_q, two parameters ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ associated with differential privacy (line 14). These parameters capture the strength of the guarantee: how much the model parameters learned after a round vary depending on a device’s input. A lower value of δ𝛿\deltaitalic_δ and ϵitalic-ϵ\epsilonitalic_ϵ is desirable, and the literature recommends ensuring that ϵitalic-ϵ\epsilonitalic_ϵ stays close to or below 1111, and δ𝛿\deltaitalic_δ is less than 1/W1𝑊1/W1 / italic_W, where W𝑊Witalic_W is the total number of devices (mcmahan2017learning, ).

DP-FedAvg has three characteristics that are crucial for Aero. The first is the sampling of devices (lines 8 to 10 in Figure 2). For instance, the sampling parameter q𝑞qitalic_q could be 1Γ51superscriptΓ510^{-5}1 roman_Γ start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT such that 10,000 out of, say, 1Γ91superscriptΓ910^{9}1 roman_Γ start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT total devices contribute updates in a round. The second characteristic is that the noise is sampled from a Gaussian distribution whose standard deviation σ𝜎\sigmaitalic_σ is predetermined (set before the algorithm is run). This is in contrast to other DP algorithms that utilize techniques such as the Sparse Vector Technique (SVT) that generate noise depending on the value of the updates (roth2010interactive, ; dwork2009complexity, ). The third characteristic is averaging of updates: DP-FedAvg simply adds updates and noise (line 12 in Figure 2) rather than combining them using a more complex function. Aero heavily leverages these characteristics.

Finally, we remark that Aero can support DP-FedAvg only without the amplification assumption for DP. This is because the adversary (the byzantine server) can observe all traffic and knows which devices contribute updates for training. In contrast, the amplification assumption requires the server to be oblivious to the contributors, which in turn improves the privacy budget. We leave the addition of expensive oblivious approaches (which hides who is contributing updates besides hiding the updates themselves) to future work.

3.2. Architecture of Aero

Aero borrows two system components from Orchard: an aggregator and a public bulletin board (Figure 3). The aggregator runs server-side inside a data center and therefore consists of one or more powerful machines. Its main role is to combine updates from user devices without learning their content. The bulletin board is an immutable append-only log. The aggregator (which is potentially malicious) and the devices use the bulletin board to reliably broadcast messages and store states across rounds, e.g., the latest values of DP parameters ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ. Like Orchard (roth2020orchard, ), Aero assumes that free web services such as Wikipedia, or a public block-chain could serve as the bulletin board.

Like Orchard, Aero also consists of committees of devices, but instead of a single committee as in Orchard, Aero has three types of committees tailored to the needs of DP-FedAvg (and similar algorithms). A master committee handles system setup, including key generation for cryptographic primitives. A DP-noise committee handles Gaussian noise generation. And multiple decryption committees perform decryption operations to release updates to the global model parameters at the end of a training round. Aero samples each committee afresh each round, dividing the committee workload across the large population of devices.

An architecture with separate committee types is deliberate and crucial. It helps tailor a committee’s protocol to its tasks to significantly improve efficiency. Besides, the use of multiple committees of the same type, i.e., multiple decryption committees, helps Aero scale with model size as each committee works on a subset of model parameters.

Notably, the ability to use separate committee types is possible only because of the specifics of DP-FedAvg. For instance, the fact that Gaussian noise generation does not depend on the value of the updates allows Aero to separate the DP-noise committee from the decryption committees.

3.3. Protocol overview of Aero

To begin training a model, a data analyst supplies input parameters (the model architecture, the initial model parameters, and the other input parameters for DP-FedAvg) to the aggregator. The aggregator then initiates a round-based protocol consisting of discrete rounds. In each round, it executes one iteration of the for loop in the Main procedure of DP-FedAvg (line 7 in Figure 2). Each round further consists of the four phases of setup, generate, add, and release (Figure 3).

In the setup phase, the aggregator samples the various committees for the round. The master committee then receives and validates the input parameters, and generates keys for an AHE and a ZK-proof scheme. Aero’s setup phase is similar to Orchard (§2.3) with the key difference that Aero’s master committee uses techniques to reuse keys across rounds rather than generating them fresh for each round using MPC.

Next, in the generate phase, (i) devices select themselves to generate updates for the round, and (ii) the DP-noise committee generates the Gaussian noise for DP. Both types of devices use techniques to perform their work efficiently. For instance, the DP-noise committee generates noise in a distributed manner while avoiding MPC.

Next, in the add phase, the aggregator adds the model updates to the Gaussian noise without learning the plaintext content of either of them. This is done through the use of the AHE scheme. The entire population of devices collectively verifies the aggregator’s work. Again, the key is efficiency for the devices, for which the aggregator and the devices use a new verifiable aggregation protocol.

Finally, in the release phase, each decryption committee receives the secret key for the AHE scheme from the master committee and decrypts a few ciphertexts from the add phase. The key point is that a decryption committee avoids general-purpose MPC by using a specialized decryption protocol.

4. Design of Aero

We now go over the design details of Aero phase-by-phase. The main challenge in each phase is kee** the device overhead low while protecting against the malicious aggregator and the malicious subset of devices. We highlight these challenges, and Aero’s key design choices and techniques.

But before proceeding, we briefly discuss committee formation, which is common to multiple phases. To form committees, Aero uses the sortition protocol from Orchard (which in turn used Algorand’s protocol (gilad2017algorand, )). This protocol relies on a publicly verifiable source of randomness so that the results of the election are verifiable by all devices. At the end of the protocol, the aggregator publishes the list (public keys) of the committee members by putting it on the bulletin board. An important aspect of committee formation is committee size and the number of malicious devices in a committee: provision of a larger number of malicious devices A𝐴Aitalic_A relative to the committee size C𝐶Citalic_C increases costs but ensures higher resiliency. Like Orchard, Aero makes a probabilistic argument (roth2019honeycrisp, ) to select C𝐶Citalic_C and A𝐴Aitalic_A such that the probability of the number of malicious devices exceeding A𝐴Aitalic_A is small. For example, if the overall population contains up to f=3%𝑓percent3f=3\%italic_f = 3 % malicious devices (§2.1), then the probability that a randomly sampled subset of C=45𝐶45C=45italic_C = 4 5 devices contain more than A=2C/5=18𝐴2𝐶518A=2C/5=18italic_A = 2 italic_C / 5 = 1 8 malicious devices is less than 9.61Γ149.61superscriptΓ149.6\cdot 10^{-14}9 . 6 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT - 1 4 end_POSTSUPERSCRIPT.

4.1. Setup phase

Much of Aero’s setup phase is similar to Orchard. During this phase, (i) the aggregator samples the master committee, which then (ii) receives and validates inputs for the round (i.e., receives model parameters θtsuperscript𝜃𝑡\theta^{t}italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for the current round t𝑡titalic_t, the device selection probability q𝑞qitalic_q, noise scale z𝑧zitalic_z, and clip** bound S𝑆Sitalic_S, and generates new values of the DP parameters ϵ,δitalic-ϵ𝛿\epsilon,\deltaitalic_ϵ , italic_δ), and (iii) generates keys for cryptographic primitives (§3.3). We do not focus on the first two pieces to avoid repetition with Orchard, but include them in the supplementary material for completeness (Appendix §A.2). Instead, the key challenge in Aero is the overhead of key generation.

Recall (§2.3) that Orchard uses MPC among the master committee members to correctly run the key generation function and ensure that even if the malicious members of the committee collude, they cannot recover the AHE secret key. The overhead of this MPC is high: \approx1 GiB of network transfers and 180 seconds of cpu time per committee device. How can this overhead be reduced?

One idea (roth2021mycelium, ) is to reuse keys across rounds rather than generate them afresh for each round. Indeed, this is what Aero does: the master committee in round 1 generates the keys and shares them with the committee for the next round, and this committee then shares the keys with the committee for the third round, and so on. But one has to be careful.

Consider the following attack. Say that the malicious aggregator receives a victim device kssuperscript𝑘𝑠k^{\prime}sitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s update Enc(pk,Δkt)𝐸𝑛𝑐𝑝𝑘superscriptsubscriptΔ𝑘𝑡Enc(pk,\Delta_{k}^{t})italic_E italic_n italic_c ( italic_p italic_k , roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) in round t𝑡titalic_t. Then, in the next round t+1𝑡1t+1italic_t + 1, the aggregator colludes with a malicious device in the overall population to use Enc(pk,Δkt)𝐸𝑛𝑐𝑝𝑘superscriptsubscriptΔ𝑘𝑡Enc(pk,\Delta_{k}^{t})italic_E italic_n italic_c ( italic_p italic_k , roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) as the device’s update. This attack enables the aggregator to violate differential privacy as the victim device’s input does not satisfy the required clip** bound S𝑆Sitalic_S in round t+1𝑡1t+1italic_t + 1 due to its multiple copies (§3.1). Orchard does not suffer from this attack as it generates fresh keys in each round: the ciphertext for round t𝑡titalic_t decrypts to a random message with round t+1𝑡1t+1italic_t + 1’s key. However, prior work that reuses keys in this manner (in particular, Mycelium (roth2021mycelium, )) does suffer from this attack.

Thus, Aero must apply the reuse-of-keys idea with care. Aero adjusts the generate and add phases of its protocol (§3.3) to prevent the aforementioned attack. We are not in a position yet to describe these changes, but we will detail them shortly when we describe these other phases (§4.2, §4.3). Meanwhile, the changes in the setup phase relative to Orchard are the following: for the AHE secret key sk𝑠𝑘skitalic_s italic_k, Aero implements an efficient verifiable secret redistribution scheme (gupta2006extended, ; roth2021mycelium, ) such that committee members at round t+1𝑡1t+1italic_t + 1 securely obtain the relevant shares of the key from the committee at round t𝑡titalic_t. For the public keys (AHE public key pk𝑝𝑘pkitalic_p italic_k, and both the ZK-proof public proving and verification keys), the committee for round t𝑡titalic_t signs a certificate containing these keys and uploads it to the bulletin board, and the committee for round t+1𝑡1t+1italic_t + 1 downloads it from the board.

The savings by switching from key generation to key resharing are substantial for the network, with a slight increase in cpu. While the MPC solution incurs \approx1 GiB of network transfers and 180 seconds of cpu time per committee device, key resharing requires 125 MiB and 187 seconds, respectively (§6.2). The cpu is higher because key resharing requires certain expensive field exponentiation operations (gupta2006extended, ).

4.2. Generate phase

Recall from §3.3 that during this phase (i) Aero must pick a subset of devices to generate updates to the model parameters, (ii) the DP-noise committee must generate Gaussian noise for differential privacy, and (iii) both types of devices must encrypt their generated data (updates and noise).

Device sampling for updates. One design choice is to ask the aggregator to sample devices that will contribute updates. The problem with this option is that the (malicious) aggregator may choose the devices non-uniformly; for instance, it may pick an honest device more often than the device should be picked, violating differential privacy. An alternative is to ask the devices to sample themselves with probability q𝑞qitalic_q (as required by DP-FedAvg; line 8 in Figure 2). But then a malicious device may pick itself in every round, which would allow it to significantly affect model accuracy.

Aero adopts a hybrid and efficient design in which devices sample themselves but the aggregator verifies the sampling. Let Btsuperscript𝐵𝑡B^{t}italic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT be a publicly verifiable source of randomness for round t𝑡titalic_t; this is the same randomness that is used in the sortition protocol to sample committees for the round. Then, each device k𝑘kitalic_k with public key πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT computes PRG(πk||Bt)PRG(\pi_{k}||B^{t})italic_P italic_R italic_G ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | | italic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), where PRG𝑃𝑅𝐺PRGitalic_P italic_R italic_G is a pseudorandom generator. Next, the device scales the PRG output to a value between 0 and 1, and checks if the result is less than q𝑞qitalic_q. For instance, if the PRG output is 8 bytes, then the device divides this number by 2641superscript26412^{64}-12 start_POSTSUPERSCRIPT 6 4 end_POSTSUPERSCRIPT - 1. If selected, the device runs the UserUpdate procedure (line 10 in Figure 2) to generate updates for the round. This approach of sampling is efficient as devices only perform local computations.

Gaussian noise generation. The default option is to make the DP-noise committee generate the noise using MPC, but as noted several times in this paper, this option is expensive. Instead, Aero adapts prior work (truex2019hybrid, ) on distributed Gaussian noise generation. The Gaussian distribution has the property that if an element sampled from 𝒩(Γ,a)𝒩Γ𝑎\mathcal{N}(0,a)caligraphic_N ( roman_Γ , italic_a ) is added to another element sampled from 𝒩(Γ,b)𝒩Γ𝑏\mathcal{N}(0,b)caligraphic_N ( roman_Γ , italic_b ), then the sum is a sample of 𝒩(a+b)𝒩𝑎𝑏\mathcal{N}(a+b)caligraphic_N ( italic_a + italic_b ) (truex2019hybrid, ; xu2019hybridalpha, ; dwork2006our, ). This works well for the simple case when all C𝐶Citalic_C committee members of the DP-noise committee are honest. Given the standard deviation of the Gaussian distribution, σ=zS𝜎𝑧𝑆\sigma=z\cdot Sitalic_σ = italic_z ⋅ italic_S, the devices can independently compute their additive share. That is, to generate samples from 𝒩(Γ,Iσ2)𝒩Γ𝐼superscript𝜎2\mathcal{N}(0,I\sigma^{2})caligraphic_N ( roman_Γ , italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (line 12 in Figure 2), each committee member can sample its share of the noise from the distribution 𝒩(Γ,Iσ2C)𝒩Γ𝐼superscript𝜎2𝐶\mathcal{N}(0,I\frac{\sigma^{2}}{C})caligraphic_N ( roman_Γ , italic_I divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG ).

The challenge in Aero is therefore: how do we account for the A𝐴Aitalic_A malicious devices in the DP-committee? These devices may behave arbitrarily and may thus generate either no noise or large amounts of it. Adding unnecessary noise hurts accuracy, not privacy. In contrast, failing to add noise may violate privacy. We thus consider the worst case in which malicious users fail to add any noise and ask honest devices to compensate. Each honest client thus samples its noise share from the distribution 𝒩(Γ,Iσ2CA)𝒩Γ𝐼superscript𝜎2𝐶𝐴\mathcal{N}(0,I\frac{\sigma^{2}}{C-A})caligraphic_N ( roman_Γ , italic_I divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C - italic_A end_ARG ).222Aero can further compensate for honest-but-offline devices. Say, for e.g., that B𝐵Bitalic_B of CA𝐶𝐴C-Aitalic_C - italic_A honest devices must be provisioned to be offline. Then, Aero subtracts B𝐵Bitalic_B from CA𝐶𝐴C-Aitalic_C - italic_A to get the number of honest-but-online devices.

This algorithm generates noise cheaply without expensive MPC. The downside is that it may generate more noise than necessary, hurting accuracy. To mitigate this risk, we carefully choose the committee size to minimize the ratio of additional noise. Specifically, we choose C𝐶Citalic_C to keep the ratio (CA)/C𝐶𝐴𝐶(C-A)/C( italic_C - italic_A ) / italic_C close to 1. For instance, instead of picking a committee containing a few tens of devices similar to the master committee, we pick a somewhat larger DP-noise committee: (A,C)=(4Γ,28Γ)𝐴𝐶4Γ28Γ(A,C)=(40,280)( italic_A , italic_C ) = ( 4 roman_Γ , 2 8 roman_Γ ).333Using a probabilistic argument for committee size selection as before (§4), if f=3%𝑓percent3f=3\%italic_f = 3 % devices in the overall population are malicious, then the chances of sampling 280 devices with more than 1/7th malicious is 4.11Γ144.11superscriptΓ144.1\cdot 10^{-14}4 . 1 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT - 1 4 end_POSTSUPERSCRIPT.

Encryption and ZK-proofs. Once the devices generate their updates or shares of the Gaussian noise, they encrypt the content using the public key of the AHE scheme to prevent the aggregator from learning the content. Further, they certify using a ZK-proof scheme that the encryption is done correctly and the data being encrypted is bounded by the clip** value S𝑆Sitalic_S (so that malicious devices may not supply arbitrary updates). This encryption and ZK-proof generation is same as in Orchard, but Aero requires additional changes. Recall from the setup phase that Aero must ensure a ciphertext generated in a round is used only in that round, to prevent complications due to reuse of keys (§4.1). To do this, each device concatenates the round number t𝑡titalic_t (as a timestamp) to the plaintext message before encrypting it. Further, the ZK-proof includes additional constraints that prove that a prefix of the plaintext message equals the current round number.

  Add phase protocol of Aero

Commit step

Add step

Verify step

Every device in the system does the following:

 

Figure 4. Aero’s verifiable aggregation. This description does not include the PIT optimization (described in text) that applies to line 11.

4.3. Add phase

Recall that during the add phase (i) the aggregator adds ciphertexts containing device updates to those containing shares of Gaussian noise, (ii) the devices collectively verify the aggregator’s addition (§3.3).

This work during the add phase has subtle requirements. So first, we expand on these requirements while considering a toy example with two honest and a malicious device. The first honest device’s input is Enc(pk,Δ)Enc𝑝𝑘Δ{\small\textsc{Enc}}(pk,\Delta)Enc ( italic_p italic_k , roman_Δ ), where ΔΔ\Deltaroman_Δ is its update, while the second honest device’s input is Enc(pk,n)Enc𝑝𝑘𝑛{\small\textsc{Enc}}(pk,n)Enc ( italic_p italic_k , italic_n ), where n𝑛nitalic_n is the Gaussian noise. For this toy example, first (R1), the aggregator must not omit Enc(pk,n)Enc𝑝𝑘𝑛{\small\textsc{Enc}}(pk,n)Enc ( italic_p italic_k , italic_n ) from the aggregate as the added noise would then be insufficient to protect ΔΔ\Deltaroman_Δ and guarantee DP. Second (R2), the aggregator must not let the malicious device use Enc(pk,Δ)Enc𝑝𝑘Δ{\small\textsc{Enc}}(pk,\Delta)Enc ( italic_p italic_k , roman_Δ ) as its input. Relatedly, the aggregator itself must not modify Enc(pk,Δ)Enc𝑝𝑘Δ{\small\textsc{Enc}}(pk,\Delta)Enc ( italic_p italic_k , roman_Δ ) to Enc(pk,kΔ)Enc𝑝𝑘𝑘Δ{\small\textsc{Enc}}(pk,k\cdot\Delta)Enc ( italic_p italic_k , italic_k ⋅ roman_Δ ), where k𝑘kitalic_k is a scalar, using the additively homomorphic properties of the encryption scheme. The reason is that these changes can violate the clip** requirement that a device’s input is bounded by S𝑆Sitalic_S (e.g., 2Δ2Δ2\cdot\Delta2 ⋅ roman_Δ may be larger than S𝑆Sitalic_S). And, third (R3), the aggregator must ensure that the above (the malicious device or the aggregator copying a device’s input) does not happen across rounds, as recall that Aero uses the same encryption key in multiple rounds (§4.1).

One option to satisfy these requirements is to use the verifiable aggregation protocol of Orchard (roth2019honeycrisp, ) that is based on summation trees. The main challenge is resource costs. Briefly, in this protocol, the aggregator arranges the ciphertexts to be aggregated as leaf nodes of a tree, and publishes the nodes of the tree leading to the root node. For example, the leaf nodes will be Enc(pk,Δ)Enc𝑝𝑘Δ{\small\textsc{Enc}}(pk,\Delta)Enc ( italic_p italic_k , roman_Δ ) and Enc(pk,n)Enc𝑝𝑘𝑛{\small\textsc{Enc}}(pk,n)Enc ( italic_p italic_k , italic_n ), and the root node will be Enc(pk,Δ)+Enc(pk,n)Enc𝑝𝑘ΔEnc𝑝𝑘𝑛{\small\textsc{Enc}}(pk,\Delta)+{\small\textsc{Enc}}(pk,n)Enc ( italic_p italic_k , roman_Δ ) + Enc ( italic_p italic_k , italic_n ), for the toy example above. Then, devices in the entire population inspect parts of this tree: download a few children and their parents and check that the addition is done correctly, that the leaf nodes haven’t been modified by the aggregator, and the leaf nodes that should be included are indeed included. The problem is that Orchard requires a device to download and check about 3s3𝑠3\cdot s3 ⋅ italic_s nodes of the tree (roth2019honeycrisp, ; roth2020orchard, ), where s𝑠sitalic_s is a configurable parameter whose default value is six. But for realistic models, each node is made of many ciphertexts (e.g., the 1.2M parameter CNN model requires =293293\ell=293roman_ℓ = 2 9 3 ciphertexts), and 18 such nodes add to 738 MiB.

Aero improves this protocol using two ideas. First, Aero observes that the entire population of devices that must collectively check the tree is massive (e.g., 1Γ91superscriptΓ910^{9}1 roman_Γ start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT). Besides, although the tree has bulky nodes with many ciphertexts, the total number of nodes is not high due to sampling (e.g., only 10,000 devices contribute updates in a round). Thus, Aero moves away from one summation tree with “bulky” nodes, to \ellroman_ℓ summation trees with “small” nodes, where \ellroman_ℓ is the number of ciphertexts comprising a device’s update (e.g., =293293\ell=293roman_ℓ = 2 9 3 for the 1.2M parameter model). Then, each device probabilistically selects a handful of trees, and a checks few nodes within each selected tree.

Second, Aero optimizes how a device tests whether the sum of two ciphertexts equals a third ciphertext. Aero recognizes that ciphertexts can be expressed as polynomials and the validity of their addition can be checked efficiently using a technique called polynomial identity testing (PIT) (schwartz1980fast, ; zippel1979probabilistic, ). Roughly, PIT says that the sum of polynomials can be checked by evaluating them at a random point and checking the sum of these evaluations. Using PIT, Aero replaces the ciphertexts at the non-leaf nodes of the summation trees with their much smaller evaluations at a random point.

We now describe Aero’s protocol in detail, first without the PIT optimization, and then with it.

Incorporating finer-grained summation trees. Aero’s protocol has three steps: commit, add, and verify (Figure 4). In the commit step, all devices commit to their ciphertexts before submitting them to the aggregator (line 13 in Figure 4). The aggregator publishes a Merkle tree of these commitments to the bulletin board. Committing before submitting ensures that a malicious device cannot copy and submit an honest device’s input (requirement R2 above). Similarly, this design ensures that the aggregator cannot change a device’s input (again requirement R2).

In the add step, the aggregator adds the ciphertexts via summation trees. Specifically, if device updates have \ellroman_ℓ ciphertexts, the aggregator creates \ellroman_ℓ summation trees, one per ciphertext (line 6 in Figure 4). The leaf vertices of the j𝑗jitalic_j-th tree are the j𝑗jitalic_j-th ciphertexts in the devices’ inputs, while each parent is the sum of its children ciphertexts, and the root is the j𝑗jitalic_j-th ciphertext in the aggregation result. The aggregator publishes the vertices of the summation trees on the bulletin board (line 7 in Figure 4), allowing an honest device to check that its input is not omitted (requirement R1 above).

In the verify step, each device in the system selects q𝑞q\cdot\ellitalic_q ⋅ roman_ℓ summation trees, where q𝑞qitalic_q is the device sampling probability (line 9 in Figure 4), and checks s𝑠sitalic_s leaf nodes and 2s2𝑠2s2 italic_s non-leaf nodes in each tree. (s=6𝑠6s=6italic_s = 6 in our implementation.) Specifically, the device checks that the leaf node ciphertexts are committed to in the commit step (requirement R2), and the ZK-proofs of the ciphertexts are valid, e.g., the first part of the plaintext message in the ciphertexts equals the current round number (requirement R3). For the non-leafs, the device checks that they sum to their children.

Incorporating PIT. Checking the non-leaf vertices is a main source of overhead for the protocol above. The reason is that even though each non-leaf is a single ciphertext, this ciphertext is large: for the quantum-secure AHE scheme Aero uses (§5), a ciphertext is 131 KiB, made of two polynomials of 212superscript2122^{12}2 start_POSTSUPERSCRIPT 1 2 end_POSTSUPERSCRIPT coefficients each, where each coefficient is 16 bytes.

As mentioned earlier, Aero reduces this overhead by using polynomial identity testing (PIT) (schwartz1980fast, ; zippel1979probabilistic, ). This test says that given a d𝑑ditalic_d-degree polynomial g(x)𝑔𝑥g(x)italic_g ( italic_x ) whose coefficients are in a field 𝔽𝔽\mathbb{F}blackboard_F, one can test whether g(x)𝑔𝑥g(x)italic_g ( italic_x ) is a zero polynomial by picking a number r𝔽𝑟𝔽r\in\mathbb{F}italic_r ∈ blackboard_F uniformly and testing whether g(r)==Γg(r)==0italic_g ( italic_r ) = = roman_Γ. This works because a d𝑑ditalic_d-degree polynomial has at most d𝑑ditalic_d solutions to g(x)==Γg(x)==0italic_g ( italic_x ) = = roman_Γ and d𝑑ditalic_d is much less than |𝔽|𝔽|\mathbb{F}|| blackboard_F |.

Using PIT, Aero replaces the ciphertexts at the non-leafs with their evaluations at a random point r𝑟ritalic_r. Then, during the “Verify” step, a device checks (line 11 in Fig. 4) whether these evaluations (rather than ciphertexts) add up. Thus, instead of downloading three ciphertexts with 22122superscript2122\cdot 2^{12}2 ⋅ 2 start_POSTSUPERSCRIPT 1 2 end_POSTSUPERSCRIPT field elements each, a device downloads 2 elements of 𝔽𝔽\mathbb{F}blackboard_F per ciphertext.

A requirement for PIT is generation of r𝑟ritalic_r, which must be sampled uniformly from the coefficient field. For this task, Aero extends the master committee to publish an r𝑟ritalic_r to the bulletin board in the add step, using a known protocol to securely and efficiently generate a random number (damgard2006unconditionally, ; damgaard2012multiparty, ).

4.4. Release phase

During the release phase, Aero must decrypt the \ellroman_ℓ ciphertexts from the add phase, i.e., the \ellroman_ℓ root nodes of the \ellroman_ℓ summation trees. The default, but expensive, option is to use MPC among the members of the decryption committees.

Aero addresses this efficiency challenge using known ideas and applying them; i.e., Aero’s contribution in this phase is not new techniques, but the observation that existing ideas can be applied. Nevertheless, applying these ideas requires some care and work.

First, recall that Aero has multiple decryption committees (§3.2). Naturally, to reduce per-device work, each committee decrypts a few of the \ellroman_ℓ ciphertexts. A design question for Aero is how many committees should it use. On the one hand, more committees are desirable (best case is \ellroman_ℓ). However, more committees also mean that each has to be larger to ensure that none of them samples more than A𝐴Aitalic_A out of C𝐶Citalic_C malicious devices, breaking the threshold assumptions of a committee. Meanwhile, a larger committee means more overhead. In practice (§6), we take a middle ground and configure Aero to use ten decryption committees.

Second, Aero reduces each committee’s work relative to the MPC baseline, using a fast distributed decryption protocol to decrypt the ciphertexts (chen2019efficient, ). The use of this protocol is possible as a decryption committee’s task is only of decryption given how we formed and assigned work to different types of committees (§3.2). This fast protocol requires the committee devices to mainly perform local computations with little interaction with each other. The caveat is that for this protocol to be applicable, the committee members must know an upper bound on the number of additive homomorphic operations on the ciphertexts they are decrypting.444This bound is needed to add a “smudging noise” to the committee’s decryption output to ensure that the output does not leak information on the inputs to the aggregation (asharov2012multiparty, ). Fortunately, in Aero’s setting this bound is known: it is the maximum number of devices whose data the aggregator adds in the add phase (Mmaxsubscript𝑀𝑚𝑎𝑥M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT in Figure 4). The benefit of distributed decryption (and moving work outside MPC) meanwhile is substantial.

4.5. Privacy proof

Aero’s protocol provides the required differential privacy guarantee (§2.2). The supplementary material (Appendix A.1) contains a proof. But, briefly, the key reasons are that (i) in the generate phase, honest devices sample themselves to make sure that they are not sampled more than expected, (ii) the verifiable aggregation protects these devices’ input, and (iii) key resharing and fast decryption protocols keep secret keys hidden.

5. Prototype implementation

Refer to caption
Figure 5. An overview of Aero’s implementation.

We implemented a prototype of Aero atop FedScale (fedscale-icml22, ), which is a scalable system for federated learning capable of handling a large number of devices. By default, FedScale supports algorithms such as FedAvg and FedSGD (without differential privacy). Further, it allows a data analyst to specify the model using the popular PyTorch framework.


Dataset

Model

Size

FedScale

Aero

FEMNIST (cohen2017emnist, )

LeNet (lecun1995learning, )

49K

75%

74%

CNND (reddi2020adaptive, )

1.2M

78%

68%

CNNF (mcmahan2017communication, )

1.7M

79%

68%

AlexNet (krizhevsky2012imagenet, )

3.9M

78%

40%

CIFAR10 (Krizhevsky09learningmultiple, )

LeNet (lecun1995learning, )

62K

48%

48%

ResNet20 (he2016deep, )

272K

59%

48%

ResNet56 (he2016deep, )

855K

54%

35%

Speech (warden2018speech, )

MobileNetV2 (howard2017mobilenets, )

2M

57%

4%

Figure 6. Test accuracy for different models after 480 rounds of training and differential privacy parameters (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ ) set to (5.03, W1.1superscript𝑊1.1W^{-1.1}italic_W start_POSTSUPERSCRIPT - 1 . 1 end_POSTSUPERSCRIPT). As shown later, increasing ϵitalic-ϵ\epsilonitalic_ϵ can recover the accuracy loss.
Refer to caption
(a) CNNF
Refer to caption
(b) MobileNetV2
Figure 7. Test accuracy versus rounds for Aero and FedScale for the 1.7M parameter CNNF model and the 2M parameter MobileNetV2 model.

Our Aero prototype extends FedScale in the following way (Figure 5). First, it extends the programming layer of FedScale with Opacus (opacus, ), which is a library that adjusts a PyTorch model to make it suitable for differentially private federated learning; for instance, Opacus replaces the batch normalization layer of a neural network with group normalization. Second, our prototype extends the device-side code of FedScale with additional components needed for the various committees and phases in Aero (key resharing, Gaussian noise generation, verifiable aggregation, and distributed decryption; §4). FedScale is written in Python while the code we added is in Rust; thus, we use PyO3 to wrap the Rust code with Python interfaces. Third, our prototype extends the FedScale server-side code with Aero’s aggregator code and the code to coordinate the various phases of Aero. In total, we added \approx4,300 lines of Rust to FedScale.

Our prototype configures the cryptographic primitives for 128-bit security. For additively homomorphic encryption, we use the BFV encryption scheme. We set the polynomial degree in BFV to 212superscript2122^{12}2 start_POSTSUPERSCRIPT 1 2 end_POSTSUPERSCRIPT and use the default parameters from Microsoft SEAL (seal, ). For ZK-proofs, we use ark_groth16 (arkgroth16, ), which implements the zkSNARK of Jens Groth (groth16on, ).

6. Evaluation

We evaluate Aero in two parts. First, we compare it with plain federated learning, specifically, the FedScale system. This comparison sheds light on the cost of privacy both in terms of model accuracy and resource consumption on the devices. Second, we compare Aero to Orchard, which is the state-of-the-art system for training models in a federated manner in the same threat model as Aero. This comparison helps understand the effectiveness of Aero’s techniques in reducing overhead. Our main results are the following:

  • Aero can train models with comparable accuracy to FedScale (plain federated learning). For instance, for a CNN model over the FEMNIST dataset, Aero produces a model with 79.2% accuracy with DP parameter ϵ=5.53italic-ϵ5.53\epsilon=5.53italic_ϵ = 5 . 5 3, relative to 79.3% in FedScale, after 480 rounds of training.

  • Aero’s cpu and network overhead is low to moderate: for a 1.2M parameter model, devices spend 15 ms of cpu and 3.12 KiBs of network transfers most of the time, and occasionally (with a probability of 1Γ51superscriptΓ510^{-5}1 roman_Γ start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT in a round) 13.4 min. of processor time and 234 MiBs of network transfers.

  • Aero’s techniques improve over Orchard by up to 2.31Γ5×2.3\cdot 10^{5}\times2 . 3 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ×.

Testbed. Our testbed has machines of type c5.24xlarge on Amazon EC2. Each machine has 96vcpus, 192 GiB RAM, and 25 Gbps network bandwidth. We use a single machine for running Aero’s server. Meanwhile, we co-locate multiple devices on a machine: each device is assigned six cpus given that modern mobiles have processors with four to eight cpus.

Default system configuration. Unless specified otherwise, we configure the systems to assume W=1Γ9𝑊1superscriptΓ9W=10^{9}italic_W = 1 roman_Γ start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT total devices. For Aero, we set the default device sampling probability q𝑞qitalic_q in DP-FedAvg to 1Γ51superscriptΓ510^{-5}1 roman_Γ start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT; i.e., the expected number of devices that contribute updates in a round is 1Γ41superscriptΓ410^{4}1 roman_Γ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. We also configure Aero to use ten decryption committees, where each committee has a total of C=45𝐶45C=45italic_C = 4 5 devices of which A=18𝐴18A=18italic_A = 1 8 may be malicious. The first decryption committee also serves as the master committee. We configure the DP-noise committee with (A,C)=(4Γ,28Γ)𝐴𝐶4Γ28Γ(A,C)=(40,280)( italic_A , italic_C ) = ( 4 roman_Γ , 2 8 roman_Γ ). For Orchard, we configure its committee to have 40 devices of which 16 may be malicious.555Aero’s committees are larger because it must ensure, using a union bound, that the chance of sampling more than A𝐴Aitalic_A malicious devices across any of its committees is the same as in Orchard.

6.1. Comparison with FedScale

Accuracy. We evaluate several datasets and models to compare Aero with FedScale, specifically, the FedAvg algorithm in FedScale. Figure 6 shows these datasets and models. We use CNND and CNNF for two different CNN models: one dropout model (reddi2020adaptive, ) and the other from the FedAvg paper (mcmahan2017communication, ).

Aero’s accuracy depends on the DP parameters ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ. For Figure 6 experiments, we set ϵ=5.Γ4italic-ϵ5.Γ4\epsilon=5.04italic_ϵ = 5 . roman_Γ 4 and δ=1/W1.1𝛿1superscript𝑊1.1\delta=1/W^{1.1}italic_δ = 1 / italic_W start_POSTSUPERSCRIPT 1 . 1 end_POSTSUPERSCRIPT. Further, for both systems, we set all other training parameters (batch size, the number of device-side training epochs, etc.) per the examples provided by FedScale for each dataset.

Figure 6 compares the accuracies after 480 rounds of training (these models converge in roughly 400-500 rounds). Generally, Aero’s accuracy loss grows with the number of model parameters. The reason is that DP-FedAvg adds noise for every parameter and thus the norm of the noise increases with the number of parameters.

Although Aero’s accuracy loss is (very) high for a larger number of parameters, this loss is recoverable by increasing ϵitalic-ϵ\epsilonitalic_ϵ (but still kee** it at a recommended value). Figure 7 shows accuracy for two values of ϵitalic-ϵ\epsilonitalic_ϵ for two example models. Increasing ϵitalic-ϵ\epsilonitalic_ϵ from 5.04 to 5.53 recovers the accuracy loss. For instance, for the CNNF model, FedScale’s accuracy is 79.3% after 480 rounds, while Aero’s is 79.2%. The reason is that as ϵitalic-ϵ\epsilonitalic_ϵ increases, more devices can contribute updates (q𝑞qitalic_q increases), which increases the signal relative to the differential privacy noise. Overall, Aero can give competitive accuracy as plain federated learning for models with parameters ranging from tens of thousand to a few million.


cpu (ms) network (KiB)

Model

Size

FedScale

Aero

FedScale

Aero

LeNet

49K

3.36E-4

2

3.93E-5

0.96

CNND

1.2M

9.50E-4

55

9.66E-4

3.87

CNNF

1.7M

9.49E-4

77

1.35E-3

5.15

AlexNet

3.9M

1.75E-3

170

3.12E-3

11.0

Figure 8. Per device per round average cost for different models.
Refer to caption
(a) Generator
Refer to caption
(b) Verifier
Refer to caption
(c) Decryption committee
Figure 9. cpu time per device per round of training for different device roles in Aero and Orchard.
Refer to caption
(a) Generator
Refer to caption
(b) Verifier
Refer to caption
(c) Decryption committee
Figure 10. Network transfers per device per round of training for different device roles in Aero and Orchard.

Device overhead. Another cost of privacy relative to plain federated learning is increased device overhead. Figure 8 summarizes the average cpu and network cost per round per device for the four models on the FEMNIST dataset. (We picked the FEMNIST dataset just as an example, but the results for the other datasets are qualitatively the same.)

Overall, an Aero device on average (considering the different types of Aero devices) spends 5.91Γ39.71Γ4×5.9\cdot 10^{3}-9.7\cdot 10^{4}\times5 . 9 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT - 9 . 7 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT × higher cpu and 3.51Γ32.41Γ4×3.5\cdot 10^{3}-2.4\cdot 10^{4}\times3 . 5 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT - 2 . 4 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT × higher network relative to FedScale. This overhead is due to the fact that FedScale does not use any cryptographic operations, while Aero devices use many, for example, encryption and ZK-proofs during the generate phase, and verifiable aggregation during the add phase. However, Aero’s overhead, at least, on average, is low (Figure 8). Further, as we will show next, Aero’s worst-case overhead is also moderate.

6.2. Comparison to Orchard

Both Aero and Orchard have multiple types of devices. Aero has devices that participate in the master committee, generate updates (or Gaussian noise), verify the aggregator’s work, and participate in the decryption committee. Similarly, Orchard has generator, verifier, and committee devices. We compare overhead for these devices separately.

Generator device overhead. The overhead for the generators changes only with the model size (after excluding the training time to generate the plain updates). Thus, we vary the number of model parameters and report overhead.

Figure 8(a) shows the cpu time and Figure 9(a) shows the network transfers with a varying number of model parameters. These overheads grow linearly with the number of model parameters (the network overhead is not a straight line as it includes a fixed cost of 60 MiB to download ZK-proof proving keys). The reason is that the dominant operations for a generator device are generating ZK-proofs and ship** ciphertexts to the aggregator. The number of both operations is proportional to the number of parameters (§4.2).

In terms of absolute overhead, a specific data point of interest is a million-parameter model, e.g., the CNND model with 1.2M parameters. For this size, a generator device spends 1.01 hours in cpu time, or equivalently 13.4 minutes of latency (wall-clock time) over six cores. The generator also sends 101 MiB of data over the network. These overheads are moderate, considering the fact that the probability that a device will be a generator in a round is small: 1Γ51superscriptΓ510^{-5}1 roman_Γ start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT.

Finally, the cpu and network overhead for Aero and Orchard is roughly the same. The reason is that the dominant operations for the two systems are common: ZK-proofs and upload of ciphertexts.

Verifier device overhead. Figure 8(b) shows cpu and Figure 9(b) shows network overhead for the verifier devices that participate in the verifiable aggregation protocol (§4.3). These experiments fix the number of model parameters to 1.2M and vary the probability q𝑞qitalic_q with which a verifier device samples summation trees to inspect (recall that a verifier device in Aero checks q𝑞q\cdot\ellitalic_q ⋅ roman_ℓ summation trees). For Orchard, overhead does not change with q𝑞qitalic_q (q𝑞qitalic_q is effectively 1).

Overall, Aero’s verifier devices, which are the bulk of the devices in the system, are efficient consuming a few milliseconds of cpu and a few KiBs of network transfers. For instance, for q=1Γ5𝑞1superscriptΓ5q=10^{-5}italic_q = 1 roman_Γ start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, Aero incurs 3.12 KiB in network and 15 ms of cpu time, while Orchard incurs 1.96 seconds (130×\times×) and 738 MiB (2.361Γ5×2.36\cdot 10^{5}\times2 . 3 6 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ×).

Comparing Aero with Orchard, a verifier in Aero consumes lower cpu than Orchard for smaller values of q𝑞qitalic_q but a higher cpu for larger q𝑞qitalic_q. This trend is due to constants: even though an Aero device checks q𝑞q\cdot\ellitalic_q ⋅ roman_ℓ summation trees and 3s3𝑠3s3 italic_s ciphertexts in each tree versus 3s3𝑠\ell\cdot 3sroman_ℓ ⋅ 3 italic_s ciphertexts in Orchard, Aero devices verify the ZK-proofs to address the reuse-of-keys issue, while Orchard does not have such a requirement (§4.3). Each proof check takes \approx700 ms on a single cpu of c5.24xlarge. Indeed, Aero w/o ZK-proof check (another line in the plot) is strictly better than Orchard.

Aero’s network overhead increases linearly with q𝑞qitalic_q, while Orchard’s stays constant as it does not do sampling (Figure 9(b)). Notably, when q=1𝑞1q=1italic_q = 1, i.e., when Aero and Orchard check the same number of ciphertexts, a Aero verifier consumes 251 MB, which is 1/3absent13\approx 1/3≈ 1 / 3rd of Orchard. This is because polynomial identity testing allows a Aero verifier to download evaluations of ciphertext polynomials rather than the full polynomials from non-leaf vertices (§4.3).

Committee device overhead. Figure 8(c) and Figure 9(c) show the cpu and network overhead of decryption committee devices as a function of the model size. (In Aero, the first decryption committee also serves as the master committee.)

Aero’s overheads are much lower than Orchard’s—for 1.2M parameters, cpu time is 206 s in Aero versus 214 hours in Orchard (i.e., 3751×3751\times3 7 5 1 × lower), and network is 234 MiB in Aero versus 11 TiB in Orchard (i.e., 4.81Γ4×4.8\cdot 10^{4}\times4 . 8 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT × lower). This improvement is for two reasons. First, Aero divides the decryption of multiple ciphertexts across committees, and thus each performs less work. Second, Aero uses the distributed decryption protocol (§4.4), while Orchard uses the general-purpose SCALE-MAMBA MPC (scaleMamba, ).

7. Related work

Aero’s goal is to add the rigorous guarantee of differential privacy to federated learning—at low device overhead. This section compares Aero to prior work with similar goals.

Local differential privacy (LDP). In LDP, devices locally add noise to their updates before submitting them for aggregation (duchi2013local, ; erlingsson2014rappor, ; ijcai2021-217, ; he2020secure, ; pathak2010multiparty, ; truex2020ldp, ; seif2020wireless, ; bhowmick2018protection, ; hao2019towards, ; sun2020federated, ; grafberger2021fedless, ; nguyen2016collecting, ; wang2019collecting, ; niu2019secure, ; lu2019blockchain, ; chen2018machine, ; mugunthan2020blockflow, ; chen2020practical, ; ding2021differentially, ). On the plus side, the privacy guarantee in LDP does not depend on the behavior of the aggregator, as devices add noises locally. Further, LDP is scalable as it adds small device-side overhead relative to plain federated learning. However, on the negative side, since each device perturbs its update, the trained model can have a large error.

Central differential privacy (CDP). Given the accuracy loss in LDP, many systems target CDP (froelicher2017unlynx, ; zeng2022aggregating, ; sebert2022protecting, ; truex2019hybrid, ; xu2019hybridalpha, ; chase2017private, ; rastogi2010differentially, ; stevens2021efficient, ; hynes2018efficient, ; roth2019honeycrisp, ; roth2020orchard, ; xu2022detrust, ). The core challenge is of hiding sensitive device updates from the aggregator.

Several systems in this category target a setting of a few tens of devices to a few thousand devices (xu2022detrust, ; xu2019hybridalpha, ; truex2019hybrid, ; stevens2021efficient, ; sebert2022protecting, ). These systems require all devices to participate in one or more cryptographic primitives, and thus their overhead grows with the number of devices. For example, in secure aggregation based FLDP (stevens2021efficient, ), each device generates a secret key, then masks its update using the key, before sending the masked update to the aggregator. Then, the devices securely sum their masks to subtract them from the aggregator’s result. This latter protocol requires each device to secretly share its mask with all others.

Chase et al. (chase2017private, ) do not require their protocol to scale with the number of devices: two devices aggregate updates from all others before generating and adding DP noise via Yao’s MPC protocol (yao1982protocols, ). The issue is that if the adversary compromises the two devices, it learns the updates.

Honeycrisp (roth2019honeycrisp, ), Orchard (roth2020orchard, ), and Mycelium (roth2021mycelium, ) target a setting of a billion devices. One of their key insights is to run expensive cryptographic protocols among a small, randomly-sampled committee, while leveraging an untrusted resourceful aggregator to help with the aggregation. Among the three systems, Orchard supports learning tasks, while Honeycrisp supports aggregate statistics and Mycelium supports graph analytics. The limitation of Orchard is that it imposes a large overhead on the devices (§2.3, §6). Aero improves over Orchard by several orders of magnitude (§6).

An alternative to cryptography is to use trusted hardware, e.g., Intel SGX (hynes2018efficient, ). These systems add negligible overhead over plain federated learning, but trusting the hardware design and manufacturer is a strong assumption (fei2021security, ; TrustZoneAttacks, ; ArmSEVAttacks, ).

No differential privacy. Many systems provide a weaker notion of privacy than differential privacy, for functionality such as federated machine learning (aono2017privacy, ; rathee2022elsa, ; dong2020eastfly, ; fu2020vfl, ; jiang2020federated, ; jiang2021flashe, ; liu2019secure, ; ma2021privacy, ; mandal2019privfl, ; 254465, ; sav2020poseidon, ; xu2022hercules, ; beguier2020efficient, ; chen2021ppt, ; chowdhury2021eiffel, ; ergun2021sparsified, ; fereidooni2021safelearn, ; guo2020secure, ; hao2021efficient, ; kadhe2020fastsecagg, ; li2021secure, ; liu2020boosting, ; so2021turbo, ; xu2019verifynet, ; zhang2021dubhe, ; mo2021ppfl, ; hashemi2021byzantine, ; quoc2021secfl, ; sav2022privacy, ), statistics (corrigan2017prio, ), and aggregation (bell2020secure, ; bonawitz2017practical, ; liu2022dhsa, ; wan2022information, ; liu2022efficient, ). For instance, BatchCrypt (254465, ) uses Paillier AHE (damgaard2001generalisation, ) to hide updates from the aggregator. The promise is that the adversary learns only the aggregate of the data of many devices. The fundamental issue is that aggregation does not provide a rigorous guarantee: one can learn individual training data from the trained model parameters (zhu2019deep, ; melis2019exploiting, ; briland2017deep, ; shokri2017membership, ).

8. Summary

Federated learning over a large number of mobile devices is getting significant attention both in industry and academia. One big challenge of current practical systems, those that provide good accuracy and efficiency, is the trust they require: the data analyst must say “let’s trust that the server will not be compromised”. Aero adds an alternative. It shows that one can perform FL with good accuracy, moderate overhead, and the rigorous guarantee of differential privacy without trusting a central server or the data analyst. Aero improves the trade-off by focusing on a specific type of learning algorithms and tuning system architecture and design to these algorithms (§4). The main evaluation highlight is that Aero has comparable accuracy to plain federated learning, and improves over prior work Orchard that has strong guarantees by five orders of magnitude (§6).

References

  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In ACM Conference on Computer and Communications Security (CCS), pages 308–318, 2016.
  • [2] Y. Aono, T. Hayashi, L. Wang, S. Moriai, et al. Privacy-preserving deep learning via additively homomorphic encryption. IEEE Transactions on Information Forensics and Security, pages 1333–1345, 2017.
  • [3] arkworks. ark-groth16. https://github.com/arkworks-rs/groth16.
  • [4] G. Asharov, A. Jain, A. López-Alt, E. Tromer, V. Vaikuntanathan, and D. Wichs. Multiparty computation with low communication, computation and interaction via threshold FHE. In Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), pages 483–501, 2012.
  • [5] C. Beguier, M. Andreux, and E. W. Tramel. Efficient sparse secure aggregation for federated learning. arXiv preprint arXiv:2007.14861, 2020.
  • [6] J. H. Bell, K. A. Bonawitz, A. Gascón, T. Lepoint, and M. Raykova. Secure single-server aggregation with (poly) logarithmic overhead. In ACM Conference on Computer and Communications Security (CCS), pages 1253–1269, 2020.
  • [7] A. Ben-Efraim, K. Cong, E. Omri, E. Orsini, N. P. Smart, and E. Soria-Vazquez. Large scale, actively secure computation from lpn and free-xor garbled circuits. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, 2021.
  • [8] G. Beniamini. Trust issues: Exploiting trustzone TEEs. https://googleprojectzero.blogspot.com/2017/07/trust-issues-exploiting-trustzone-tees.html. Accessed: 2022-01-30.
  • [9] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, and R. Rogers. Protection against reconstruction and its applications in private federated learning. arXiv preprint arXiv:1812.00984, 2018.
  • [10] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth. Practical secure aggregation for privacy-preserving machine learning. In ACM Conference on Computer and Communications Security (CCS), pages 1175–1191, 2017.
  • [11] Z. Brakerski. Fully homomorphic encryption without modulus switching from classical GapSVP. In Advances in Cryptology—CRYPTO, pages 868–886, 2012.
  • [12] M. Chase, R. Gilad-Bachrach, K. Laine, K. Lauter, and P. Rindal. Private collaborative neural network learning. IACR Cryptol. ePrint Arch., 2017.
  • [13] C. Chen, J. Zhou, B. Wu, W. Fang, L. Wang, Y. Qi, and X. Zheng. Practical privacy preserving POI recommendation. ACM Transactions on Intelligent Systems and Technology (TIST), pages 1–20, 2020.
  • [14] H. Chen, W. Dai, M. Kim, and Y. Song. Efficient multi-key homomorphic encryption with packed ciphertexts with application to oblivious neural network inference. In ACM Conference on Computer and Communications Security (CCS), pages 395–412, 2019.
  • [15] Q. Chen, Z. Wang, W. Zhang, and X. Lin. PPT: A privacy-preserving global model training protocol for federated learning in P2P networks. arXiv preprint arXiv:2105.14408, 2021.
  • [16] X. Chen, J. Ji, C. Luo, W. Liao, and P. Li. When machine learning meets blockchain: A decentralized, privacy-preserving and secure design. In IEEE International Conference on Big Data (Big Data), pages 1178–1187, 2018.
  • [17] A. R. Chowdhury, C. Guo, S. Jha, and L. van der Maaten. EIFFeL: Ensuring integrity for federated learning. arXiv preprint arXiv:2112.12727, 2021.
  • [18] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik. EMNIST: Extending MNIST to handwritten letters. In International Joint Conference on Neural Networks (IJCNN), pages 2921–2926, 2017.
  • [19] H. Corrigan-Gibbs and D. Boneh. Prio: Private, robust, and scalable computation of aggregate statistics. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 259–282, 2017.
  • [20] I. Damgård and M. Jurik. A generalisation, a simplification and some applications of Paillier’s probabilistic public-key system. In Proceedings of International Workshop on Practice and Theory in Public Key Cryptography: Public Key Cryptography, pages 119–136, 2001.
  • [21] I. Damgård, V. Pastro, N. Smart, and S. Zakarias. Multiparty computation from somewhat homomorphic encryption. In Advances in Cryptology—CRYPTO, pages 643–662, 2012.
  • [22] I. Damgård, M. Fitzi, E. Kiltz, J. B. Nielsen, and T. Toft. Unconditionally secure constant-rounds multi-party computation for equality, comparison, bits and exponentiation. In Theory of Cryptography Conference (TCC), 2006.
  • [23] J. Ding, G. Liang, J. Bi, and M. Pan. Differentially private and communication efficient collaborative learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021.
  • [24] Y. Dong, X. Chen, L. Shen, and D. Wang. EaSTFLy: Efficient and secure ternary federated learning. Computers & Security, page 101824, 2020.
  • [25] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In Symposium on Foundations of Computer Science (FOCS), 2013.
  • [26] C. Dwork. A firm foundation for private data analysis. Communications of the ACM, pages 86–95, 2011.
  • [27] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), page 486, 2006.
  • [28] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (TCC), pages 265–284, 2006.
  • [29] C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In ACM Symposium on Theory of Computing (STOC), 2009.
  • [30] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, page 9(3–4):211–407, 2014.
  • [31] I. Ergun, H. U. Sami, and B. Guler. Sparsified secure aggregation for privacy-preserving federated learning. arXiv preprint arXiv:2112.12872, 2021.
  • [32] Ú. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In ACM Conference on Computer and Communications Security (CCS), pages 1054–1067, 2014.
  • [33] J. Fan and F. Vercauteren. Somewhat practical fully homomorphic encryption. IACR Cryptol. ePrint Arch., page 144, 2012.
  • [34] S. Fei, Z. Yan, W. Ding, and H. Xie. Security vulnerabilities of SGX and countermeasures: A survey. ACM Computing Surveys, 2021.
  • [35] H. Fereidooni, S. Marchal, M. Miettinen, A. Mirhoseini, H. Möllering, T. D. Nguyen, P. Rieger, A.-R. Sadeghi, T. Schneider, H. Yalame, et al. SAFELearn: secure aggregation for private federated learning. In IEEE Security and Privacy Workshops (SPW), pages 56–62, 2021.
  • [36] D. Froelicher, P. Egger, J. S. Sousa, J. L. Raisaro, Z. Huang, C. Mouchet, B. Ford, and J.-P. Hubaux. UnLynx: A decentralized system for privacy-conscious data sharing. Proceedings on Privacy Enhancing Technologies, pages 232–250, 2017.
  • [37] A. Fu, X. Zhang, N. Xiong, Y. Gao, H. Wang, and J. Zhang. VFL: a verifiable federated learning with privacy-preserving for big data in industrial IoT. IEEE Transactions on Industrial Informatics, 2020.
  • [38] Y. Gilad, R. Hemo, S. Micali, G. Vlachos, and N. Zeldovich. Algorand: Scaling byzantine agreements for cryptocurrencies. In ACM Symposium on Operating Systems Principles (SOSP), page 51–68, 2017.
  • [39] O. Goldreich, S. Micali, and A. Wigderson. How to play any mental game. In ACM Symposium on Theory of Computing (STOC), page 218–229, 1987.
  • [40] S. Goldwasser, S. Micali, and C. Rackoff. The Knowledge Complexity of Interactive Proof-Systems, page 203–225. Association for Computing Machinery, 2019.
  • [41] S. D. Gordon, D. Starin, and A. Yerukhimovich. The more the merrier: reducing the cost of large scale mpc. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, 2021.
  • [42] A. Grafberger, M. Chadha, A. **dal, J. Gu, and M. Gerndt. FedLess: Secure and scalable federated learning using serverless computing. In IEEE International Conference on Big Data (Big Data), pages 164–173, 2021.
  • [43] J. Groth. On the size of pairing-based non-interactive arguments. In Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), pages 305–326, 2016.
  • [44] J. Guo, Z. Liu, K.-Y. Lam, J. Zhao, Y. Chen, and C. Xing. Secure weighted aggregation for federated learning. arXiv preprint arXiv:2010.08730, 2020.
  • [45] V. Gupta and K. Gopinath. An extended verifiable secret redistribution protocol for archival systems. In International Conference on Availability, Reliability and Security (ARES), pages 8–pp, 2006.
  • [46] M. Hao, H. Li, G. Xu, H. Chen, and T. Zhang. Efficient, private and robust federated learning. In Annual Computer Security Applications Conference, pages 45–60, 2021.
  • [47] M. Hao, H. Li, G. Xu, S. Liu, and H. Yang. Towards efficient and privacy-preserving federated deep learning. In IEEE International Conference on Communications (ICC), pages 1–6, 2019.
  • [48] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.
  • [49] F. Hartmann, S. Suh, A. Komarzewski, T. D. Smith, and I. Segall. Federated learning for ranking browser history suggestions. arXiv preprint arXiv:1911.11807, 2019.
  • [50] H. Hashemi, Y. Wang, C. Guo, and M. Annavaram. Byzantine-robust and privacy-preserving framework for FedML. arXiv preprint arXiv:2105.02295, 2021.
  • [51] C. He, S. Li, J. So, M. Zhang, H. Wang, X. Wang, P. Vepakomma, A. Singh, H. Qiu, L. Shen, P. Zhao, Y. Kang, Y. Liu, R. Raskar, Q. Yang, M. Annavaram, and S. Avestimehr. Fedml: A research library and benchmark for federated machine learning. Advances in Neural Information Processing Systems, Best Paper Award at Federate Learning Workshop, 2020.
  • [52] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [53] L. He, S. P. Karimireddy, and M. Jaggi. Secure byzantine-robust machine learning. arXiv preprint arXiv:2006.04747, 2020.
  • [54] B. Hitaj, G. Ateniese, and F. Perez-Cruz. Deep models under the GAN: Information leakage from collaborative deep learning. In ACM Conference on Computer and Communications Security (CCS), page 603–618, 2017.
  • [55] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [56] N. Hynes, R. Cheng, and D. Song. Efficient deep learning on multi-source private data. arXiv preprint arXiv:1807.06689, 2018.
  • [57] M. Jiang, T. Jung, R. Karl, and T. Zhao. Federated dynamic GNN with secure aggregation. arXiv preprint arXiv:2009.07351, 2020.
  • [58] Z. Jiang, W. Wang, and Y. Liu. Flashe: Additively symmetric homomorphic encryption for cross-silo federated learning. arXiv preprint arXiv:2109.00675, 2021.
  • [59] S. Kadhe, N. Rajaraman, O. O. Koyluoglu, and K. Ramchandran. Fastsecagg: Scalable secure aggregation for privacy-preserving federated learning. In ICML Workshop on Federated Learning for User Privacy and Data Confidentiality, 2020.
  • [60] P. Kairouz, B. McMahan, S. Song, O. Thakkar, A. Thakurta, and Z. Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, pages 5213–5225, 2021.
  • [61] B. Knott, S. Venkataraman, A. Hannun, S. Sengupta, M. Ibrahim, and L. van der Maaten. Crypten: Secure multi-party computation meets machine learning. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • [62] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [63] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • [64] KU Leuven COSIC. SCALE-MAMBA. https://github.com/KULeuven-COSIC/SCALE-MAMBA.
  • [65] F. Lai, Y. Dai, S. S. Singapuram, J. Liu, X. Zhu, H. V. Madhyastha, and M. Chowdhury. FedScale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learning (ICML), 2022.
  • [66] F. Lai, X. Zhu, H. V. Madhyastha, and M. Chowdhury. Oort: Efficient federated learning via guided participant selection. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 19–35, 2021.
  • [67] Y. LeCun, L. D. Jackel, L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, et al. Learning algorithms for classification: A comparison on handwritten digit recognition. Neural Networks: the statistical mechanics perspective, 1995.
  • [68] K. H. Li, P. P. B. de Gusmão, D. J. Beutel, and N. D. Lane. Secure aggregation for federated learning in flower. In Proceedings of ACM International Workshop on Distributed Machine Learning, pages 8–14, 2021.
  • [69] M. Li, Y. Zhang, Z. Lin, and Y. Solihin. Exploiting unprotected I/O operations in AMD’s secure encrypted virtualization. In USENIX Security Symposium, 2019.
  • [70] C. Liu, S. Chakraborty, and D. Verma. Secure model fusion for distributed learning using partial homomorphic encryption. In Policy-Based Autonomic Data Governance, pages 154–179. Springer, 2019.
  • [71] Y. Liu, Z. Ma, X. Liu, S. Ma, S. Nepal, R. H. Deng, and K. Ren. Boosting privately: Federated extreme gradient boosting for mobile crowdsensing. In International Conference on Distributed Computing Systems (ICDCS), pages 1–11, 2020.
  • [72] Z. Liu, S. Chen, J. Ye, J. Fan, H. Li, and X. Li. DHSA: efficient doubly homomorphic secure aggregation for cross-silo federated learning. The Journal of Supercomputing, 2022.
  • [73] Z. Liu, J. Guo, K.-Y. Lam, and J. Zhao. Efficient dropout-resilient aggregation for privacy-preserving machine learning. IEEE Transactions on Information Forensics and Security, 2022.
  • [74] Y. Lu, X. Huang, Y. Dai, S. Maharjan, and Y. Zhang. Blockchain and federated learning for privacy-preserved data sharing in industrial IoT. IEEE Transactions on Industrial Informatics, pages 4177–4186, 2019.
  • [75] J. Ma, S.-A. Naas, S. Sigg, and X. Lyu. Privacy-preserving federated learning based on multi-key homomorphic encryption. arXiv preprint arXiv:2104.06824, 2021.
  • [76] K. Mandal and G. Gong. PrivFL: Practical privacy-preserving federated regressions on high-dimensional data over mobile networks. In Proceedings of ACM SIGSAC Conference on Cloud Computing Security Workshop, pages 57–68, 2019.
  • [77] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282, 2017.
  • [78] B. McMahan and A. Thakurta. Federated learning with formal differential privacy guarantees. https://ai.googleblog.com/2022/02/federated-learning-with-formal.html. Accessed: 2022-12-12.
  • [79] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018.
  • [80] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018.
  • [81] L. Melis, C. Song, E. D. Cristofaro, and V. Shmatikov. Exploiting unintended feature leakage in collaborative learning. In IEEE Symposium on Security and Privacy (S&P), pages 691–706, 2019.
  • [82] Microsoft. Microsoft SEAL (release 3.7). https://github.com/Microsoft/SEAL.
  • [83] F. Mo, H. Haddadi, K. Katevas, E. Marin, D. Perino, and N. Kourtellis. PPFL: privacy-preserving federated learning with trusted execution environments. arXiv preprint arXiv:2104.14380, 2021.
  • [84] V. Mugunthan, R. Rahman, and L. Kagal. BlockFLow: An accountable and privacy-preserving solution for federated learning. arXiv preprint arXiv:2007.03856, 2020.
  • [85] M. Naseri, J. Hayes, and E. D. Cristofaro. Local and central differential privacy for robustness and privacy in federated learning. Proceedings of the Network and Distributed System Security Symposium, 2022.
  • [86] T. T. Nguyên, X. Xiao, Y. Yang, S. C. Hui, H. Shin, and J. Shin. Collecting and analyzing data from smart device users with local differential privacy. arXiv preprint arXiv:1606.05053, 2016.
  • [87] A. Nitulescu. zk-snarks: A gentle introduction. https://www.di.ens.fr/~nitulesc/files/Survey-SNARKs.pdf, 2020.
  • [88] C. Niu, F. Wu, S. Tang, L. Hua, R. Jia, C. Lv, Z. Wu, and G. Chen. Secure federated submodel learning. arXiv preprint arXiv:1911.02254, 2019.
  • [89] M. A. Pathak, S. Rane, and B. Raj. Multiparty differential privacy via aggregation of locally trained classifiers. In Advances in Neural Information Processing Systems (NeurIPS), pages 1876–1884, 2010.
  • [90] pytorch. Pytorch. https://github.com/pytorch/pytorch.
  • [91] D. L. Quoc and C. Fetzer. SecFL: Confidential federated learning using TEEs. arXiv preprint arXiv:2110.00981, 2021.
  • [92] V. Rastogi and S. Nath. Differentially private aggregation of distributed time-series with transformation and encryption. In Proceedings of ACM SIGMOD International Conference on Management of data, pages 735–746, 2010.
  • [93] M. Rathee, C. Shen, S. Wagh, and R. A. Popa. ELSA: Secure aggregation for federated learning with malicious actors. Cryptology ePrint Archive, 2022.
  • [94] S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar, and H. B. McMahan. Adaptive federated optimization. In International Conference on Learning Representations, 2020.
  • [95] R. L. Rivest, L. Adleman, M. L. Dertouzos, et al. On data banks and privacy homomorphisms. Foundations of secure computation, 1978.
  • [96] A. Roth and T. Roughgarden. Interactive privacy via the median mechanism. In ACM Symposium on Theory of Computing (STOC), 2010.
  • [97] E. Roth, K. Newatia, Y. Ma, K. Zhong, S. Angel, and A. Haeberlen. Mycelium: Large-scale distributed graph queries with differential privacy. In ACM Symposium on Operating Systems Principles (SOSP), page 327–343, 2021.
  • [98] E. Roth, D. Noble, B. H. Falk, and A. Haeberlen. Honeycrisp: Large-scale differentially private aggregation without a trusted core. In ACM Symposium on Operating Systems Principles (SOSP), pages 196–210, 2019.
  • [99] E. Roth, H. Zhang, A. Haeberlen, and B. C. Pierce. Orchard: Differentially private analytics at scale. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 1065–1081, 2020.
  • [100] S. Sav, A. Diaa, A. Pyrgelis, J.-P. Bossuat, and J.-P. Hubaux. Privacy-preserving federated recurrent neural networks. arXiv preprint arXiv:2207.13947, 2022.
  • [101] S. Sav, A. Pyrgelis, J. R. Troncoso-Pastoriza, D. Froelicher, J.-P. Bossuat, J. S. Sousa, and J.-P. Hubaux. Poseidon: Privacy-preserving federated neural network learning. In Proceedings of the Network and Distributed System Security Symposium (NDSS), 2021.
  • [102] J. T. Schwartz. Fast probabilistic algorithms for verification of polynomial identities. J. ACM, 27(4):701–717, 1980.
  • [103] A. G. Sébert, R. Sirdey, O. Stan, and C. Gouy-Pailler. Protecting data from all parties: Combining FHE and DP in federated learning. arXiv preprint arXiv:2205.04330, 2022.
  • [104] M. Seif, R. Tandon, and M. Li. Wireless federated learning with local differential privacy. In IEEE International Symposium on Information Theory (ISIT), pages 2604–2609, 2020.
  • [105] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy (S&P), pages 3–18, 2017.
  • [106] J. So, B. Güler, and A. S. Avestimehr. Turbo-aggregate: Breaking the quadratic aggregation barrier in secure federated learning. IEEE Journal on Selected Areas in Information Theory, pages 479–489, 2021.
  • [107] T. Stevens, C. Skalka, C. Vincent, J. Ring, S. Clark, and J. Near. Efficient differentially private secure aggregation for federated learning via hardness of learning with errors. arXiv preprint arXiv:2112.06872, 2021.
  • [108] L. Sun and L. Lyu. Federated model distillation with noise-free differential privacy. arXiv preprint arXiv:2009.05537, 2020.
  • [109] L. Sun, J. Qian, and X. Chen. LDP-FL: Practical private aggregation in federated learning with local differential privacy. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 1571–1578, 2021.
  • [110] S. Tan, B. Knott, Y. Tian, and D. J. Wu. Cryptgpu: Fast privacy-preserving machine learning on the gpu. In 2021 IEEE Symposium on Security and Privacy (SP), pages 1021–1038. IEEE, 2021.
  • [111] S. Truex, N. Baracaldo, A. Anwar, T. Steinke, H. Ludwig, R. Zhang, and Y. Zhou. A hybrid approach to privacy-preserving federated learning. In ACM Workshop on Artificial Intelligence and Security, pages 1–11, 2019.
  • [112] S. Truex, L. Liu, K.-H. Chow, M. E. Gursoy, and W. Wei. LDP-Fed: Federated learning with local differential privacy. In Proceedings of the ACM International Workshop on Edge Systems, Analytics and Networking, pages 61–66, 2020.
  • [113] K. Wan, H. Sun, M. Ji, and G. Caire. Information theoretic secure aggregation with uncoded groupwise keys. arXiv preprint arXiv:2204.11364, 2022.
  • [114] N. Wang, X. Xiao, Y. Yang, J. Zhao, S. C. Hui, H. Shin, J. Shin, and G. Yu. Collecting and analyzing multidimensional data with local differential privacy. In IEEE International Conference on Data Engineering (ICDE), pages 638–649, 2019.
  • [115] P. Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
  • [116] G. Xu, X. Han, S. Xu, T. Zhang, H. Li, X. Huang, and R. H. Deng. Hercules: Boosting the performance of privacy-preserving federated learning. arXiv preprint arXiv:2207.04620, 2022.
  • [117] G. Xu, H. Li, S. Liu, K. Yang, and X. Lin. VerifyNet: Secure and verifiable federated learning. IEEE Transactions on Information Forensics and Security, pages 911–926, 2019.
  • [118] R. Xu, N. Baracaldo, Y. Zhou, A. Anwar, S. Kadhe, and H. Ludwig. DeTrust-FL: Privacy-preserving federated learning in decentralized trust setting. In IEEE International Conference on Cloud Computing (CLOUD), 2022.
  • [119] R. Xu, N. Baracaldo, Y. Zhou, A. Anwar, and H. Ludwig. HybridAlpha: An efficient approach for privacy-preserving federated learning. In ACM Workshop on Artificial Intelligence and Security, pages 13–23, 2019.
  • [120] A. C. Yao. Protocols for secure computations. In Annual Symposium on Foundations of Computer Science (SFCS), pages 160–164, 1982.
  • [121] A. Yousefpour, I. Shilov, A. Sablayrolles, D. Testuggine, K. Prasad, M. Malek, J. Nguyen, S. Ghosh, A. Bharadwaj, J. Zhao, G. Cormode, and I. Mironov. Opacus: User-friendly differential privacy library in PyTorch. arXiv preprint arXiv:2109.12298, 2021.
  • [122] D. Zeng, S. Liu, and Z. Xu. Aggregating gradients in encoded domain for federated learning. arXiv preprint arXiv:2205.13216, 2022.
  • [123] C. Zhang, S. Li, J. Xia, W. Wang, F. Yan, and Y. Liu. BatchCrypt: Efficient homomorphic encryption for cross-silo federated learning. In USENIX Annual Technical Conference (USENIX ATC), pages 493–506, 2020.
  • [124] S. Zhang, Z. Li, Q. Chen, W. Zheng, J. Leng, and M. Guo. Dubhe: Towards data unbiasedness with homomorphic encryption in federated learning client selection. In International Conference on Parallel Processing, pages 1–10, 2021.
  • [125] K. Zhu, P. Van Hentenryck, and F. Fioretto. Bias and variance of post-processing in differential privacy. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11177–11184, 2021.
  • [126] L. Zhu, Z. Liu, and S. Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • [127] R. Zippel. Probabilistic algorithms for sparse polynomials. In International Symposium on Symbolic and Algebraic Manipulation, pages 216–226, 1979.
\raggedend

Appendix A Supplementary material

A.1. Privacy proof

The goal of Aero is to provide differential privacy for a class of federated learning algorithms. We will take DP-FedAvg as an example and prove that Aero indeed meets its goal in multiple steps. The proof for other algorithms such as DP-FedSGD is similar. The outline of the proof is as follows.

First, we will introduce a slightly modified version of DP-FedAvg that makes explicit the behavior of the malicious aggregator and devices. This modified version changes line 8 and line 12 in Figure 2. For instance, we will modify line 8 in Figure 2 to show that a byzantine aggregator may allow malicious devices to be sampled in a round. Appendix A.1.1 introduces the modified version and shows that the changes do not impact DP-FedAvg’s differential privacy guarantee.

Next, we will prove that Aero executes the modified DP-FedAvg algorithm faithfully, by showing that enough Gaussian noise will be added (Appendix A.1.2) and if an adversary introduces an error into the aggregation, it will be caught with high probability (Appendix A.1.3).

Finally, we will prove that after aggregation, Aero’s decryption protocol does not leak any information beyond the allowed output of DP-FedAvg (Appendix A.1.4).

We will not cover security of the protocols used in the setup phase, e.g., the sortition protocol to form committees, because Aero does not innovate on these protocols. With the above proof structure, we will show Aero provides the differential privacy guarantee to honest devices’ data.

Before proceeding to the proof, we introduce a few definitions. In the main body of the paper, for simplicity, we did not distinguish between honest-but-offline and malicious devices for the DP-noise committee (§4.2). Instead, we considered all offline devices as malicious for this committee, since it’s not possible to tell whether an offline device is malicious or not.

However, for devices that generate updates, we do protect honest-but-offline devices’ data. So when talking about generator devices, we use the following terms:

  • honest-and-online devices,

  • honest-but-offline devices,

  • honest devices: including both honest-and-online and honest-but-offline devices

  • malicious devices

As for the DP-noise committee members, we still follow the previous definition, namely

  • honest members: honest and online members

  • malicious members: malicious or honest-but-offline members

A.1.1. DP-FedAvg

As mentioned above, Aero must take into account the behavior of malicious entities for the computation in line 8 and line 12 of Figure 2.

There are two reasons, Aero must modify line 8: first, a malicious aggregator may filter out honest-but-offline devices’ data in the aggregation; second, a malicious aggregator may add malicious devices’ data in the aggregation.

To capture the power of a malicious aggregator, we change line 8 to be

Ctsuperscript𝐶𝑡absentC^{t}\leftarrowitalic_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← subset of (sampled honest users with probability q) + (some malicious devices)

We must modify line 12 because the DP-noise committee is likely to add more noise as noted in the design of the generate phase (§4.2). To capture this additional noise, we change line 12 to be

ΔtkΔkt+𝒩(Γ,Iσ2)superscriptΔ𝑡subscript𝑘superscriptsubscriptΔ𝑘𝑡𝒩Γ𝐼superscript𝜎2\Delta^{t}\leftarrow\sum_{k}\Delta_{k}^{t}+\mathcal{N}(0,I\sigma^{2})roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + caligraphic_N ( roman_Γ , italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + some additional bounded noise

In the remainder of this section, we will prove that the modifications preserve the privacy guarantee of the original DP-FedAvg algorithm (brendan2018learning, ).

Device sampling (line 8). For device sampling, there are two possible attacks: either some malicious devices’ data will be included or some honest devices’ data will be filtered out.

The first case is not a problem, since it’s equivalent to post-processing: for example, after aggregation, the malicious aggregator can add malicious updates. Post-processing of a deferentially private result does not affect the differential privacy guarantee (this follows from the post-processing lemma in the differential privacy literature (zhu2021bias, ; dwork2006calibrating, )).

To prove that filtering out honest devices’ data will not impact the privacy guarantee, we’ll first give intuition and then a rigorous proof. Intuitively, a larger sampling probability (larger q𝑞qitalic_q) leads to more privacy loss, because a device’s data is more likely to be used in training. So informally, if each device is expected to contribute an update fewer times, the privacy loss is expected to be less. Now coming back to the case where the aggregator filters out honest devices’ data, it is obvious that each device is expected to contribute updates no more frequently than without filtering, which means the privacy loss is expected to be no more than without filtering.

In the original paper of DP-FedAvg (brendan2018learning, ), the DP guarantee relies on the moments accountant introduced by Abadi et al. (abadi2016deep, ), whose tail bound can be converted to (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-differential privacy. To be precise, the proof uses a lemma (which we will introduce shortly) that gives the moments bound on Gaussian noise with random sampling, which is equivalent to (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-differential privacy. We cite this lemma as Lemma A.1 in the appendix.

So next we will show by replacing the original random sampling with random sampling plus filtering, the moment bound still holds. In the discussion, without losing generality, we will focus on functions whose sensitivity is 1, for example, the UserUpdate function that does local training and gradient clip** with S=1𝑆1S=1italic_S = 1 in DP-FedAvg (Figure 2). Notice that for any function fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT whose sensitivity is S1superscript𝑆1S^{\prime}\neq 1italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ 1, we can always construct a function f=f/S𝑓superscript𝑓superscript𝑆f=f^{\prime}/S^{\prime}italic_f = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT whose sensitivity is 1.

Lemma A.1 ().

Given any function f𝑓fitalic_f, whose norm f()21subscriptnorm𝑓normal-⋅normal-2normal-1\|f(\cdot)\|_{2}\leq 1∥ italic_f ( ⋅ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1, let z1𝑧normal-1z\geq 1italic_z ≥ 1 be some noise scale and σ=zf()2𝜎normal-⋅𝑧subscriptnorm𝑓normal-⋅normal-2\sigma=z\cdot\|f(\cdot)\|_{2}italic_σ = italic_z ⋅ ∥ italic_f ( ⋅ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, let d={d1,,dn}𝑑subscript𝑑normal-1normal-…subscript𝑑𝑛d=\{d_{1},...,d_{n}\}italic_d = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } be a database, let 𝒥𝒥\mathcal{J}caligraphic_J be a sample from [n]delimited-[]𝑛[n][ italic_n ] where each i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] is chosen independently with probability q116σ𝑞normal-1normal-1normal-6𝜎q\leq\frac{1}{16\sigma}italic_q ≤ divide start_ARG 1 end_ARG start_ARG 1 6 italic_σ end_ARG, then for any positive integer λσ2ln1qσ𝜆superscript𝜎normal-2normal-1𝑞𝜎\lambda\leq\sigma^{2}\ln\frac{1}{q\sigma}italic_λ ≤ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln divide start_ARG 1 end_ARG start_ARG italic_q italic_σ end_ARG, the function 𝒢(d)=i𝒥f(di)+𝒩(Γ,σ2𝐈)𝒢𝑑subscript𝑖𝒥𝑓subscript𝑑𝑖𝒩normal-Γsuperscript𝜎normal-2𝐈\mathcal{G}(d)=\sum_{i\in\mathcal{J}}f(d_{i})+\mathcal{N}(0,\sigma^{2}% \boldsymbol{I})caligraphic_G ( italic_d ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_J end_POSTSUBSCRIPT italic_f ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + caligraphic_N ( roman_Γ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ) satisfies

α𝒢(λ)q2λ(λ+1)(1q)σ2+O(q3λ2/σ3).subscript𝛼𝒢𝜆superscript𝑞2𝜆𝜆11𝑞superscript𝜎2𝑂superscript𝑞3superscript𝜆2superscript𝜎3.\alpha_{\mathcal{G}}(\lambda)\leq\frac{q^{2}\lambda(\lambda+1)}{(1-q)\sigma^{2% }}+O(q^{3}\lambda^{2}/\sigma^{3}).italic_α start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_λ ) ≤ divide start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ ( italic_λ + 1 ) end_ARG start_ARG ( 1 - italic_q ) italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_O ( italic_q start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) .

Notice that if the bound on α𝒢subscript𝛼𝒢\alpha_{\mathcal{G}}italic_α start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT does not change when some data points are filtered out after sampling, the differential privacy guarantee will not change either. We refer readers to (abadi2016deep, ) for more details.

Claim A.1 ().

Let 𝒥𝒥\mathcal{J}caligraphic_J be a sample from [n]delimited-[]𝑛[n][ italic_n ] where each i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] is chosen independently with probability q116σ𝑞normal-1normal-1normal-6𝜎q\leq\frac{1}{16\sigma}italic_q ≤ divide start_ARG 1 end_ARG start_ARG 1 6 italic_σ end_ARG. Let 𝒥𝒥superscript𝒥normal-′𝒥\mathcal{J}^{\prime}\subseteq\mathcal{J}caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ caligraphic_J be some arbitrary subset of 𝒥𝒥\mathcal{J}caligraphic_J. Then for any positive integer λσ2ln1qσ𝜆superscript𝜎normal-2normal-1𝑞𝜎\lambda\leq\sigma^{2}\ln\frac{1}{q\sigma}italic_λ ≤ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln divide start_ARG 1 end_ARG start_ARG italic_q italic_σ end_ARG, the function 𝒢(d)=i𝒥f(di)+𝒩(Γ,σ2𝐈)superscript𝒢normal-′𝑑subscript𝑖superscript𝒥normal-′𝑓subscript𝑑𝑖𝒩normal-Γsuperscript𝜎normal-2𝐈\mathcal{G^{\prime}}(d)=\sum_{i\in\mathcal{J}^{\prime}}f(d_{i})+\mathcal{N}(0,% \sigma^{2}\boldsymbol{I})caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_d ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + caligraphic_N ( roman_Γ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ) also satisfies

α𝒢(λ)q2λ(λ+1)(1q)σ2+O(q3λ2/σ3).subscript𝛼superscript𝒢𝜆superscript𝑞2𝜆𝜆11𝑞superscript𝜎2𝑂superscript𝑞3superscript𝜆2superscript𝜎3.\alpha_{\mathcal{G^{\prime}}}(\lambda)\leq\frac{q^{2}\lambda(\lambda+1)}{(1-q)% \sigma^{2}}+O(q^{3}\lambda^{2}/\sigma^{3}).italic_α start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ ) ≤ divide start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ ( italic_λ + 1 ) end_ARG start_ARG ( 1 - italic_q ) italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_O ( italic_q start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) .
Proof.

Let’s abuse the notation a little bit and define 𝒢f,𝒥subscript𝒢𝑓𝒥\mathcal{G}_{f,\mathcal{J}}caligraphic_G start_POSTSUBSCRIPT italic_f , caligraphic_J end_POSTSUBSCRIPT to be

𝒢f,𝒥(d)=i𝒥f(di)+𝒩(Γ,σ2𝑰).subscript𝒢𝑓𝒥𝑑subscript𝑖𝒥𝑓subscript𝑑𝑖𝒩Γsuperscript𝜎2𝑰.\mathcal{G}_{f,\mathcal{J}}(d)=\sum_{i\in\mathcal{J}}f(d_{i})+\mathcal{N}(0,% \sigma^{2}\boldsymbol{I}).caligraphic_G start_POSTSUBSCRIPT italic_f , caligraphic_J end_POSTSUBSCRIPT ( italic_d ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_J end_POSTSUBSCRIPT italic_f ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + caligraphic_N ( roman_Γ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ) .

To prove this claim, since by definition 𝒢(d)=𝒢f,𝒥(d)superscript𝒢𝑑subscript𝒢𝑓superscript𝒥𝑑\mathcal{G^{\prime}}(d)=\mathcal{G}_{f,\mathcal{J^{\prime}}}(d)caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_d ) = caligraphic_G start_POSTSUBSCRIPT italic_f , caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_d ), we just need to show

α𝒢f,𝒥(λ)q2λ(λ+1)(1q)σ2+O(q3λ2/σ3).subscript𝛼subscript𝒢𝑓superscript𝒥𝜆superscript𝑞2𝜆𝜆11𝑞superscript𝜎2𝑂superscript𝑞3superscript𝜆2superscript𝜎3.\alpha_{\mathcal{G}_{f,\mathcal{J^{\prime}}}}(\lambda)\leq\frac{q^{2}\lambda(% \lambda+1)}{(1-q)\sigma^{2}}+O(q^{3}\lambda^{2}/\sigma^{3}).italic_α start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_f , caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ ) ≤ divide start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ ( italic_λ + 1 ) end_ARG start_ARG ( 1 - italic_q ) italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_O ( italic_q start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) .

To do this, suppose for now we have a fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT on 𝒥𝒥\mathcal{J}caligraphic_J whose sensitivity is no greater than 1 and gives the same output as f𝑓fitalic_f on 𝒥superscript𝒥\mathcal{J}^{\prime}caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, namely

f()21,𝒢f,𝒥(d)=𝒢f,𝒥(d).formulae-sequencesubscriptnormsuperscript𝑓21subscript𝒢superscript𝑓𝒥𝑑subscript𝒢𝑓superscript𝒥𝑑.\|f^{\prime}(\cdot)\|_{2}\leq 1,\mathcal{G}_{f^{\prime},\mathcal{J}}(d)=% \mathcal{G}_{f,\mathcal{J^{\prime}}}(d).∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 , caligraphic_G start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_J end_POSTSUBSCRIPT ( italic_d ) = caligraphic_G start_POSTSUBSCRIPT italic_f , caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_d ) .

By applying Lemma A.1 on 𝒢f,𝒥subscript𝒢superscript𝑓𝒥\mathcal{G}_{f^{\prime},\mathcal{J}}caligraphic_G start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_J end_POSTSUBSCRIPT, we get

α𝒢f,𝒥(λ)q2λ(λ+1)(1q)σ2+O(q3λ2/σ3).subscript𝛼subscript𝒢superscript𝑓𝒥𝜆superscript𝑞2𝜆𝜆11𝑞superscript𝜎2𝑂superscript𝑞3superscript𝜆2superscript𝜎3.\alpha_{\mathcal{G}_{f^{\prime},\mathcal{J}}}(\lambda)\leq\frac{q^{2}\lambda(% \lambda+1)}{(1-q)\sigma^{2}}+O(q^{3}\lambda^{2}/\sigma^{3}).italic_α start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_J end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ ) ≤ divide start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ ( italic_λ + 1 ) end_ARG start_ARG ( 1 - italic_q ) italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_O ( italic_q start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) .

Since 𝒢f,𝒥(d)=𝒢f,𝒥(d)subscript𝒢superscript𝑓𝒥𝑑subscript𝒢𝑓superscript𝒥𝑑\mathcal{G}_{f^{\prime},\mathcal{J}}(d)=\mathcal{G}_{f,\mathcal{J^{\prime}}}(d)caligraphic_G start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_J end_POSTSUBSCRIPT ( italic_d ) = caligraphic_G start_POSTSUBSCRIPT italic_f , caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_d ),

α𝒢f,𝒥(λ)=α𝒢f,𝒥(λ).subscript𝛼subscript𝒢superscript𝑓𝒥𝜆subscript𝛼subscript𝒢𝑓superscript𝒥𝜆.\alpha_{\mathcal{G}_{f^{\prime},\mathcal{J}}}(\lambda)=\alpha_{\mathcal{G}_{f,% \mathcal{J^{\prime}}}}(\lambda).italic_α start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_J end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ ) = italic_α start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_f , caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ ) .

Combining the preceding two equations, we get

α𝒢f,𝒥(λ)=α𝒢f,𝒥(λ)q2λ(λ+1)(1q)σ2+O(q3λ2/σ3).subscript𝛼subscript𝒢𝑓superscript𝒥𝜆subscript𝛼subscript𝒢superscript𝑓𝒥𝜆superscript𝑞2𝜆𝜆11𝑞superscript𝜎2𝑂superscript𝑞3superscript𝜆2superscript𝜎3.\alpha_{\mathcal{G}_{f,\mathcal{J^{\prime}}}}(\lambda)=\alpha_{\mathcal{G}_{f^% {\prime},\mathcal{J}}}(\lambda)\leq\frac{q^{2}\lambda(\lambda+1)}{(1-q)\sigma^% {2}}+O(q^{3}\lambda^{2}/\sigma^{3}).italic_α start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_f , caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ ) = italic_α start_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_J end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ ) ≤ divide start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ ( italic_λ + 1 ) end_ARG start_ARG ( 1 - italic_q ) italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_O ( italic_q start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) .

The remaining task is to construct such a fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT on 𝒥𝒥\mathcal{J}caligraphic_J, where f()21subscriptnormsuperscript𝑓21\|f^{\prime}(\cdot)\|_{2}\leq 1∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1, to give the same output as f𝑓fitalic_f on 𝒥superscript𝒥\mathcal{J}^{\prime}caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. One possible fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is as follows:

f(di)={f(di)if i𝒥,Γif i𝒥𝒥.superscript𝑓subscript𝑑𝑖cases𝑓subscript𝑑𝑖if 𝑖superscript𝒥Γif 𝑖𝒥superscript𝒥.f^{\prime}(d_{i})=\begin{cases}f(d_{i})&\quad\text{if }i\in\mathcal{J}^{\prime% },\\ 0&\quad\text{if }i\in\mathcal{J}-\mathcal{J}^{\prime}.\end{cases}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { start_ROW start_CELL italic_f ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_i ∈ caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL roman_Γ end_CELL start_CELL if italic_i ∈ caligraphic_J - caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . end_CELL end_ROW

It’s easy to prove i𝒥f(di)=i𝒥f(di)subscript𝑖superscript𝒥𝑓subscript𝑑𝑖subscript𝑖𝒥superscript𝑓subscript𝑑𝑖\sum_{i\in\mathcal{J}^{\prime}}f(d_{i})=\sum_{i\in\mathcal{J}}f^{\prime}(d_{i})∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_J end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Next, for the sensitivity of fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, it is not difficult to see f()2f()2subscriptnormsuperscript𝑓2subscriptnorm𝑓2\|f^{\prime}(\cdot)\|_{2}\leq\|f(\cdot)\|_{2}∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ italic_f ( ⋅ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, since by removing or adding one entry to the database, fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT will incur either the same change as f𝑓fitalic_f or no change. ∎

So far we have proved that if only a subset of selected devices are included in aggregation or more malicious devices’ data is included, the differential privacy guarantee will not be impacted.

Gaussian Noise (line 12). Recall that we also add some additional noise in line 12 in Figure 2. For the Gaussian noise, we need to prove additional noise will not impact privacy, which is not difficult to show, since this change is also equivalent to post-processing.

With the above proof, we’ve showed that our modified DP-FedAvg provides the same privacy guarantee as the original DP-FedAvg.

A.1.2. DP-noise committee

This section shows that the 𝒩(Γ,Iσ2)𝒩Γ𝐼superscript𝜎2\mathcal{N}(0,I\sigma^{2})caligraphic_N ( roman_Γ , italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) part of line 12 of our modified DP-FedAvg is executed faithfully.

We have already covered in the generate phase (§4.2) that 𝒩(Γ,Iσ2)𝒩Γ𝐼superscript𝜎2\mathcal{N}(0,I\sigma^{2})caligraphic_N ( roman_Γ , italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) amount of noise will be generated by the DP-noise committee as long as the DP-noise committee does not violate its threshold: less than A𝐴Aitalic_A out of C𝐶Citalic_C devices of the DP-noise committee are indeed malicious. Thus, in this section we will derive the probability of a committee having fewer than some threshold of honest members. We will follow the same way to compute the probability as in Honeycrisp (roth2019honeycrisp, ), which is a building block for Orchard (roth2020orchard, ).

Claim A.2 ().

(Aero) If a randomly sampled DP-noise committee size is C𝐶Citalic_C, the probability of a committee member being malicious is f𝑓fitalic_f, the probability that committee has fewer than (1t)Cnormal-⋅normal-1𝑡𝐶(1-t)\cdot C( 1 - italic_t ) ⋅ italic_C honest members is upper-bounded by p=efC(eft)tC𝑝superscript𝑒𝑓𝐶superscript𝑒𝑓𝑡𝑡𝐶\mathit{p}=e^{-fC}(\frac{ef}{t})^{tC}italic_p = italic_e start_POSTSUPERSCRIPT - italic_f italic_C end_POSTSUPERSCRIPT ( divide start_ARG italic_e italic_f end_ARG start_ARG italic_t end_ARG ) start_POSTSUPERSCRIPT italic_t italic_C end_POSTSUPERSCRIPT, when 1>tf.normal-1𝑡𝑓normal-.1>t\geq f.1 > italic_t ≥ italic_f .

Proof.

We treat each member being malicious as independent events. Let Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be a random variable

Xi={1,if member i is malicious,Γ,if member i is honest.subscript𝑋𝑖cases1if member i is malicious,Γif member i is honest.X_{i}=\begin{cases}1,&\quad\text{if member i is malicious,}\\ 0,&\quad\text{if member i is honest.}\end{cases}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL 1 , end_CELL start_CELL if member i is malicious, end_CELL end_ROW start_ROW start_CELL roman_Γ , end_CELL start_CELL if member i is honest. end_CELL end_ROW

Let X=iXi𝑋subscript𝑖subscript𝑋𝑖X=\sum_{i}X_{i}italic_X = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the random variable representing the number of malicious members. If tf𝑡𝑓t\geq fitalic_t ≥ italic_f, the Chernoff bound shows that

Pr(XtC)efC(eft)tC.𝑃𝑟𝑋𝑡𝐶superscript𝑒𝑓𝐶superscript𝑒𝑓𝑡𝑡𝐶.Pr(X\geq tC)\leq e^{-fC}(\frac{ef}{t})^{tC}.italic_P italic_r ( italic_X ≥ italic_t italic_C ) ≤ italic_e start_POSTSUPERSCRIPT - italic_f italic_C end_POSTSUPERSCRIPT ( divide start_ARG italic_e italic_f end_ARG start_ARG italic_t end_ARG ) start_POSTSUPERSCRIPT italic_t italic_C end_POSTSUPERSCRIPT .

So the probability of fewer than (1t)C1𝑡𝐶(1-t)\cdot C( 1 - italic_t ) ⋅ italic_C members being honest will be upper-bounded by p.𝑝.\mathit{p}.italic_p .

With (high) probability, the lower bound of honest committee members will be (1t)C1𝑡𝐶(1-t)C( 1 - italic_t ) italic_C. As long as each honest member contributes 1(1t)C𝒩(Γ,σ2𝑰)11𝑡𝐶𝒩Γsuperscript𝜎2𝑰\frac{1}{(1-t)C}\cdot\mathcal{N}(0,\sigma^{2}\boldsymbol{I})divide start_ARG 1 end_ARG start_ARG ( 1 - italic_t ) italic_C end_ARG ⋅ caligraphic_N ( roman_Γ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ) noise, the total amount of noise will be no less than 𝒩(Γ,σ2𝑰)𝒩Γsuperscript𝜎2𝑰\mathcal{N}(0,\sigma^{2}\boldsymbol{I})caligraphic_N ( roman_Γ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ).

To give some examples of what committee sizes could be, when f=Γ.Γ3𝑓Γ.Γ3f=0.03italic_f = roman_Γ . roman_Γ 3, we may set t=1/7,C=28Γformulae-sequence𝑡17𝐶28Γt=1/7,C=280italic_t = 1 / 7 , italic_C = 2 8 roman_Γ, to achieve p=4.11Γ14𝑝4.11superscriptΓ14\mathit{p}=4.1\cdot 10^{-14}italic_p = 4 . 1 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT - 1 4 end_POSTSUPERSCRIPT; when f=Γ.Γ5𝑓Γ.Γ5f=0.05italic_f = roman_Γ . roman_Γ 5, we may set t=1/8,C=35Γformulae-sequence𝑡18𝐶35Γt=1/8,C=350italic_t = 1 / 8 , italic_C = 3 5 roman_Γ to achieve p=9.781Γ7𝑝9.781superscriptΓ7\mathit{p}=9.78\cdot 10^{-7}italic_p = 9 . 7 8 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT or C=45Γ𝐶45ΓC=450italic_C = 4 5 roman_Γ to achieve p=1.871Γ8𝑝1.871superscriptΓ8\mathit{p}=1.87\cdot 10^{-8}italic_p = 1 . 8 7 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT; and, when f=Γ.1Γ𝑓Γ.1Γf=0.10italic_f = roman_Γ . 1 roman_Γ, we may set t=1/5,C=35Γformulae-sequence𝑡15𝐶35Γt=1/5,C=350italic_t = 1 / 5 , italic_C = 3 5 roman_Γ to achieve p=1.341Γ6𝑝1.341superscriptΓ6\mathit{p}=1.34\cdot 10^{-6}italic_p = 1 . 3 4 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT or C=45Γ𝐶45ΓC=450italic_C = 4 5 roman_Γ to achieve p=2.821Γ8𝑝2.821superscriptΓ8\mathit{p}=2.82\cdot 10^{-8}italic_p = 2 . 8 2 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT.

A.1.3. Aggregation

This section will prove that the additions in line 12 in our modified DP-FedAvg are executed faithfully in Aero. Otherwise, the aggregator will be caught with a high probability.

We will first prove the integrity of additions as in Aero’s add phase protocol and in Honeycrisp. Next, we will prove the freshness guarantee that no ciphertexts from previous rounds can be included in the current aggregation. Finally, we will prove how these two proofs together show that line 12 of modified DP-FedAvg is executed faithfully by Aero.

Integrity. We will start with the integrity claim from Honeycrisp (roth2019honeycrisp, ).

Claim A.3 ().

At the end of add phase, if no device has found malicious activity by the aggregator 𝒜𝒜\mathcal{A}caligraphic_A, the sum of the ciphertexts published by 𝒜𝒜\mathcal{A}caligraphic_A is correct (with high probability) and no inputs of malicious nodes are dependent on inputs of honest nodes (in the same round).

Proof.

We refer readers to Honeycrisp (roth2019honeycrisp, ) for more details. Here we will just give a short version for demonstration.

We will first prove no inputs of malicious devices are dependent on inputs of honest devices in the same round. Next we will prove if a malicious aggregator introduces an error into a summation tree, it will be caught with high probability.

First, assume for the sake of contradiction that it is possible for a malicious device to set its ciphertext to be c𝑐citalic_c that is from an honest device in the current round. The adversary needs to produce a t=Hash(rcπ)𝑡𝐻𝑎𝑠𝑟norm𝑐𝜋t=Hash(r||c||\pi)italic_t = italic_H italic_a italic_s italic_h ( italic_r | | italic_c | | italic_π ) and include t𝑡titalic_t in the Merkle tree MC𝑀𝐶MCitalic_M italic_C, before the honest device reveals c𝑐citalic_c. Under the Random Oracle assumption in cryptography, this is not possible. So no inputs of malicious devices are dependent on inputs of honest devices in the same round.

Next we need to show if 𝒜𝒜\mathcal{A}caligraphic_A introduces an error into a summation tree ST𝑆𝑇STitalic_S italic_T, it will be caught with high probability. In particular, here we will just show the case where 𝒜𝒜\mathcal{A}caligraphic_A introduces an error into one of the leaf nodes. Similar analysis can be done for non-leaf nodes, which is presented in Honeycrisp (roth2019honeycrisp, ).

Suppose the total number of verifiers is W𝑊Witalic_W and the total number of leaf nodes in one summation tree is Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (Figure 4). Here we make the assumption that MqWsuperscript𝑀𝑞𝑊M^{\prime}\approx qWitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≈ italic_q italic_W, where q𝑞qitalic_q is the sampling probability. We will prove this assumption later in this section.

Suppose 𝒜𝒜\mathcal{A}caligraphic_A introduces an error into a particular leaf node j[Γ,M1]𝑗Γsuperscript𝑀1j\in[0,M^{\prime}-1]italic_j ∈ [ roman_Γ , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ]. The probability of an honest device picking any vinitsubscript𝑣𝑖𝑛𝑖𝑡v_{init}italic_v start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT to have j[vinit,vinit+s]𝑗subscript𝑣𝑖𝑛𝑖𝑡subscript𝑣𝑖𝑛𝑖𝑡𝑠j\in[v_{init},v_{init}+s]italic_j ∈ [ italic_v start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT + italic_s ] is

qsM.𝑞𝑠superscript𝑀.\frac{qs}{M^{\prime}}.divide start_ARG italic_q italic_s end_ARG start_ARG italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG .

Since there are at least (1f)W1𝑓𝑊(1-f)W( 1 - italic_f ) italic_W honest devices, the probability of no honest device checking j𝑗jitalic_j is

(1qsM)(1f)W=(1sW)(1f)We(1f)s.superscript1𝑞𝑠superscript𝑀1𝑓𝑊superscript1𝑠𝑊1𝑓𝑊superscript𝑒1𝑓𝑠.(1-\frac{qs}{M^{\prime}})^{(1-f)W}=(1-\frac{s}{W})^{(1-f)W}\leq e^{-(1-f)s}.( 1 - divide start_ARG italic_q italic_s end_ARG start_ARG italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT ( 1 - italic_f ) italic_W end_POSTSUPERSCRIPT = ( 1 - divide start_ARG italic_s end_ARG start_ARG italic_W end_ARG ) start_POSTSUPERSCRIPT ( 1 - italic_f ) italic_W end_POSTSUPERSCRIPT ≤ italic_e start_POSTSUPERSCRIPT - ( 1 - italic_f ) italic_s end_POSTSUPERSCRIPT .

Similarly we can prove if 𝒜𝒜\mathcal{A}caligraphic_A introduces error into non-leaf nodes, it will be caught with a high probability. For instance, if f=3%,s=5formulae-sequence𝑓percent3𝑠5f=3\%,s=5italic_f = 3 % , italic_s = 5, the probability is about 0.007; if f=3%,s=2Γformulae-sequence𝑓percent3𝑠2Γf=3\%,s=20italic_f = 3 % , italic_s = 2 roman_Γ, the probability is about 1Γ111superscriptΓ1110^{-11}1 roman_Γ start_POSTSUPERSCRIPT - 1 1 end_POSTSUPERSCRIPT.

As noted above, proof relies on an assumption about the threshold of Sybils (pseudonym leaf nodes) the aggregator can introduce into the summation tree. Intuitively, if the aggregator can introduce as many Sybils as it wants to make MqWmuch-greater-thansuperscript𝑀𝑞𝑊M^{\prime}\gg qWitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≫ italic_q italic_W, then the probability of each node being covered will decrease by a large factor, thus impacting privacy. For example, if W=1ΓΓ,q=Γ.Γ1formulae-sequence𝑊1ΓΓ𝑞Γ.Γ1W=100,q=0.01italic_W = 1 roman_Γ roman_Γ , italic_q = roman_Γ . roman_Γ 1, Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is expected to be close to qW=1𝑞𝑊1qW=1italic_q italic_W = 1 and the 100 verifiers expect to verify only 1 leaf node. However, if the aggregator introduces 99 additional leaf nodes into the summation tree while the 100 verifiers still expect to verify only 1 leaf node, obviously many malicious nodes are very likely to be missed by the verifiers. Before claiming in Aero there is a threshold of Sybils, we need to introduce an assumption from Honeycrisp (roth2019honeycrisp, ), which we will also use in our claim.

Assumption A.1 ().

(Honeycrisp) All devices know an upper bound Wmaxsubscript𝑊𝑚𝑎𝑥W_{max}italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and a lower bound Wminsubscript𝑊𝑚𝑖𝑛W_{min}italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT of the number of potential participating devices in the system. If the true number of devices is Wtotsubscript𝑊𝑡𝑜𝑡W_{tot}italic_W start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT, then by definition: WminWtotWmaxsubscript𝑊𝑚𝑖𝑛subscript𝑊𝑡𝑜𝑡subscript𝑊𝑚𝑎𝑥W_{min}\leq W_{tot}\leq W_{max}italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_W start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT ≤ italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT. We assume WmaxWtotWminsubscript𝑊𝑚𝑎𝑥subscript𝑊𝑡𝑜𝑡subscript𝑊𝑚𝑖𝑛\frac{W_{max}-W_{tot}}{W_{min}}divide start_ARG italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_W start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG is always below some constant (low) threshold (this determines the portion of Sybils the aggregator 𝒜𝒜\mathcal{A}caligraphic_A could make without getting caught). WmaxWminsubscript𝑊𝑚𝑎𝑥subscript𝑊𝑚𝑖𝑛\frac{W_{max}}{W_{min}}divide start_ARG italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG should also be below some (more generous) constant threshold.

With the above assumption, similarly, we claim in Aero:

Claim A.4 ().

(Aero) There is a upper bound on the portion of Sybils a malicious aggregator can introduce without being caught. Precisely, all devices know an upper bound Mmaxsubscript𝑀𝑚𝑎𝑥M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and a lower bound Mminsubscript𝑀𝑚𝑖𝑛M_{min}italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT of the number of potential leaf nodes in the summation tree. If the true number of leaf nodes is Mtotsubscript𝑀𝑡𝑜𝑡M_{tot}italic_M start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT, then by definition: MminMtotMmaxsubscript𝑀𝑚𝑖𝑛subscript𝑀𝑡𝑜𝑡subscript𝑀𝑚𝑎𝑥M_{min}\leq M_{tot}\leq M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT. In Aero, MmaxMtotMminsubscript𝑀𝑚𝑎𝑥subscript𝑀𝑡𝑜𝑡subscript𝑀𝑚𝑖𝑛\frac{M_{max}-M_{tot}}{M_{min}}divide start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG is always below some constant (low) threshold (this determines the portion of Sybils the aggregator 𝒜𝒜\mathcal{A}caligraphic_A could make without getting caught). MmaxMminsubscript𝑀𝑚𝑎𝑥subscript𝑀𝑚𝑖𝑛\frac{M_{max}}{M_{min}}divide start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG will also be below some (more generous) constant threshold.

Proof.

Note that in this claim, by ”leaf nodes”, we mean leaf nodes corresponding to a honest generator device. Once we know a upper bound Mmaxsubscript𝑀𝑚𝑎𝑥M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and a lower bound Mminsubscript𝑀𝑚𝑖𝑛M_{min}italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT of the number of leaf nodes, in the worst case, a malicious aggregator could introduce at most MmaxMminsubscript𝑀𝑚𝑎𝑥subscript𝑀𝑚𝑖𝑛M_{max}-M_{min}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT Sybils. If the true number of leaf nodes is Mtotsubscript𝑀𝑡𝑜𝑡M_{tot}italic_M start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT, then the aggregator could introduce at most MmaxMtotsubscript𝑀𝑚𝑎𝑥subscript𝑀𝑡𝑜𝑡M_{max}-M_{tot}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT Sybils.

Let’s first consider Mminsubscript𝑀𝑚𝑖𝑛M_{min}italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and Mmaxsubscript𝑀𝑚𝑎𝑥M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT.

Informally, if the number of devices W𝑊Witalic_W is large enough, the number of leaf nodes (generators) is likely to be close to the expectation qW𝑞𝑊qWitalic_q italic_W. Suppose for now there are two constants capturing this ”closeness”: kminsubscript𝑘𝑚𝑖𝑛k_{min}italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT, kmaxsubscript𝑘𝑚𝑎𝑥k_{max}italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, where kmin1,kmax1formulae-sequencesubscript𝑘𝑚𝑖𝑛1subscript𝑘𝑚𝑎𝑥1k_{min}\approx 1,k_{max}\approx 1italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≈ 1 , italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ≈ 1. Then, we have

kminqWminMminkmaxqWmin,subscript𝑘𝑚𝑖𝑛𝑞subscript𝑊𝑚𝑖𝑛subscript𝑀𝑚𝑖𝑛subscript𝑘𝑚𝑎𝑥𝑞subscript𝑊𝑚𝑖𝑛k_{min}\cdot qW_{min}\leq M_{min}\leq k_{max}\cdot qW_{min},italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ⋅ italic_q italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ⋅ italic_q italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ,
kminqWmaxMmaxkmaxqWmax.subscript𝑘𝑚𝑖𝑛𝑞subscript𝑊𝑚𝑎𝑥subscript𝑀𝑚𝑎𝑥subscript𝑘𝑚𝑎𝑥𝑞subscript𝑊𝑚𝑎𝑥.k_{min}\cdot qW_{max}\leq M_{max}\leq k_{max}\cdot qW_{max}.italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ⋅ italic_q italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ≤ italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ≤ italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ⋅ italic_q italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT .

Consider the fraction MmaxMmin.subscript𝑀𝑚𝑎𝑥subscript𝑀𝑚𝑖𝑛.\frac{M_{max}}{M_{min}}.divide start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG .

MmaxMminkmaxqWmaxkminqWmin=kmaxkminWmaxWminsubscript𝑀𝑚𝑎𝑥subscript𝑀𝑚𝑖𝑛subscript𝑘𝑚𝑎𝑥𝑞subscript𝑊𝑚𝑎𝑥subscript𝑘𝑚𝑖𝑛𝑞subscript𝑊𝑚𝑖𝑛subscript𝑘𝑚𝑎𝑥subscript𝑘𝑚𝑖𝑛subscript𝑊𝑚𝑎𝑥subscript𝑊𝑚𝑖𝑛\frac{M_{max}}{M_{min}}\leq\frac{k_{max}\cdot qW_{max}}{k_{min}\cdot qW_{min}}% =\frac{k_{max}}{k_{min}}\cdot\frac{W_{max}}{W_{min}}divide start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG ≤ divide start_ARG italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ⋅ italic_q italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ⋅ italic_q italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG

WmaxWminsubscript𝑊𝑚𝑎𝑥subscript𝑊𝑚𝑖𝑛\frac{W_{max}}{W_{min}}divide start_ARG italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG is below some constant threshold as in Assumption A.1. If kmaxkminsubscript𝑘𝑚𝑎𝑥subscript𝑘𝑚𝑖𝑛\frac{k_{max}}{k_{min}}divide start_ARG italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG is smaller than some constant threshold, then it’s reasonable to assume like in Honeycrisp that MmaxMminsubscript𝑀𝑚𝑎𝑥subscript𝑀𝑚𝑖𝑛\frac{M_{max}}{M_{min}}divide start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG is below some constant threshold. We will discuss the values of kmaxsubscript𝑘𝑚𝑎𝑥k_{max}italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and kminsubscript𝑘𝑚𝑖𝑛k_{min}italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT at the end of the proof.

As for the fraction MmaxMtotMminsubscript𝑀𝑚𝑎𝑥subscript𝑀𝑡𝑜𝑡subscript𝑀𝑚𝑖𝑛\frac{M_{max}-M_{tot}}{M_{min}}divide start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG, it’s not difficult to see that,

MmaxMtotMminkmaxWmaxkminWtotkminWminsubscript𝑀𝑚𝑎𝑥subscript𝑀𝑡𝑜𝑡subscript𝑀𝑚𝑖𝑛subscript𝑘𝑚𝑎𝑥subscript𝑊𝑚𝑎𝑥subscript𝑘𝑚𝑖𝑛subscript𝑊𝑡𝑜𝑡subscript𝑘𝑚𝑖𝑛subscript𝑊𝑚𝑖𝑛\frac{M_{max}-M_{tot}}{M_{min}}\leq\frac{k_{max}W_{max}-k_{min}W_{tot}}{k_{min% }W_{min}}divide start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG ≤ divide start_ARG italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG

Similarly if both kmaxsubscript𝑘𝑚𝑎𝑥k_{max}italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and kminsubscript𝑘𝑚𝑖𝑛k_{min}italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT are close to 1, this fraction will be close to WmaxWtotWminsubscript𝑊𝑚𝑎𝑥subscript𝑊𝑡𝑜𝑡subscript𝑊𝑚𝑖𝑛\frac{W_{max}-W_{tot}}{W_{min}}divide start_ARG italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_W start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_W start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG, which is below some small constant (low) threshold as specified in Assumption A.1. So in Aero, it is reasonable to assume this fraction is also below some constant (low) threshold.

Lastly, let’s discuss kmaxsubscript𝑘𝑚𝑎𝑥k_{max}italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and kminsubscript𝑘𝑚𝑖𝑛k_{min}italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT. Consider a general case where the total number of devices is W𝑊Witalic_W and the sampling probability is q𝑞qitalic_q. The number of sampled devices (leaf nodes) is X𝑋Xitalic_X. Chernoff bound states that,

Pr(X<kminqW)<(ekmin1kminkmin)qW.𝑃𝑟𝑋subscript𝑘𝑚𝑖𝑛𝑞𝑊superscriptsuperscript𝑒subscript𝑘𝑚𝑖𝑛1superscriptsubscript𝑘𝑚𝑖𝑛subscript𝑘𝑚𝑖𝑛𝑞𝑊.Pr(X<k_{min}qW)<(\frac{e^{k_{min}-1}}{{k_{min}}^{k_{min}}})^{qW}.italic_P italic_r ( italic_X < italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT italic_q italic_W ) < ( divide start_ARG italic_e start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_q italic_W end_POSTSUPERSCRIPT .

With qW=5ΓΓΓ𝑞𝑊5ΓΓΓqW=5000italic_q italic_W = 5 roman_Γ roman_Γ roman_Γ, kmin=Γ.9subscript𝑘𝑚𝑖𝑛Γ.9k_{min}=0.9italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = roman_Γ . 9, the above probability will be smaller than 5.771Γ125.771superscriptΓ125.77\cdot 10^{-12}5 . 7 7 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT - 1 2 end_POSTSUPERSCRIPT. Similarly, with qW=1ΓΓΓΓ,kmin=Γ.93formulae-sequence𝑞𝑊1ΓΓΓΓsubscript𝑘𝑚𝑖𝑛Γ.93qW=10000,k_{min}=0.93italic_q italic_W = 1 roman_Γ roman_Γ roman_Γ roman_Γ , italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = roman_Γ . 9 3, the probability will be smaller than 1.271Γ111.271superscriptΓ111.27\cdot 10^{-11}1 . 2 7 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT - 1 1 end_POSTSUPERSCRIPT. Since Aero is designed for large-scale training, it’s reasonable to assume kminsubscript𝑘𝑚𝑖𝑛k_{min}italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT is close to 1. The argument is similar for kmaxsubscript𝑘𝑚𝑎𝑥k_{max}italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (e.g. kmax=1.Γ1subscript𝑘𝑚𝑎𝑥1.Γ1k_{max}=1.01italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 1 . roman_Γ 1).

However, if qW𝑞𝑊qWitalic_q italic_W is, for example, 500, to achieve a similar probability of 1.171Γ111.171superscriptΓ111.17\cdot 10^{-11}1 . 1 7 ⋅ 1 roman_Γ start_POSTSUPERSCRIPT - 1 1 end_POSTSUPERSCRIPT, kminsubscript𝑘𝑚𝑖𝑛k_{min}italic_k start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT will be about 0.7. In this setting, it’s not reasonable to assume M=qWsuperscript𝑀𝑞𝑊M^{\prime}=qWitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_q italic_W as in Claim A.3 any more. Instead one will need to assume a different bound, e.g. M<2qWsuperscript𝑀2𝑞𝑊M^{\prime}<2qWitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < 2 italic_q italic_W, and re-calculate the probability. For example, if M=2qWsuperscript𝑀2𝑞𝑊M^{\prime}=2qWitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 2 italic_q italic_W, in Claim A.3, the probability of an honest device checking a particular leaf node will still be the same. However, the probability of no honest device checking a particular node will be instead

(1qsM)(1f)W<(1s2W)(1f)We(1f)s/2superscript1𝑞𝑠superscript𝑀1𝑓𝑊superscript1𝑠2𝑊1𝑓𝑊superscript𝑒1𝑓𝑠2(1-\frac{qs}{M^{\prime}})^{(1-f)W}<(1-\frac{s}{2W})^{(1-f)W}\leq e^{-(1-f)s/2}( 1 - divide start_ARG italic_q italic_s end_ARG start_ARG italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT ( 1 - italic_f ) italic_W end_POSTSUPERSCRIPT < ( 1 - divide start_ARG italic_s end_ARG start_ARG 2 italic_W end_ARG ) start_POSTSUPERSCRIPT ( 1 - italic_f ) italic_W end_POSTSUPERSCRIPT ≤ italic_e start_POSTSUPERSCRIPT - ( 1 - italic_f ) italic_s / 2 end_POSTSUPERSCRIPT

Verifiers are expected to verify more nodes in the summation tree to make Claim A.3 true and thus ensure privacy. ∎

Freshness. Reusing the encryption key will not affect the security proof in each round, but will lead to attacks across rounds, where information from previous rounds can be leaked. The adversary may obtain the victim device’s ciphertext Enc(m)𝐸𝑛𝑐𝑚Enc(m)italic_E italic_n italic_c ( italic_m ) from a previous round and use kEnc(m)𝑘𝐸𝑛𝑐𝑚k\cdot Enc(m)italic_k ⋅ italic_E italic_n italic_c ( italic_m ), where k𝑘kitalic_k is a large constant, to participate in a later round. Since k𝑘kitalic_k is large, kEnc(m)𝑘𝐸𝑛𝑐𝑚k\cdot Enc(m)italic_k ⋅ italic_E italic_n italic_c ( italic_m ) will dominate the aggregated result. After decryption, the adversary will be able to learn approximately m𝑚mitalic_m. We define the freshness as the guarantee that only fresh generated ciphertexts are included in the aggregation.

We need to prove that by asking each device to put the round number in the first slot of ciphertext, generating corresponding ZK-proof, and in the add phase verifying the ZK proof, the adversary will not be able to use ciphertexts from previous rounds in the current round.

First, according to the knowledge soundness property of zkSNARK, which states that it is not possible for a prover to construct a proof without knowing the witness (e.g. secret inputs), an adversary can’t construct a proof for a new round (nitulescuzk, ). This means the adversary can only insert a non-valid proof into the summation tree.

Next, if the adversary inserts one ciphertext from a previous round with a non-valid proof into the summation tree, with high probability, it will be caught, since as claimed before, with high probability each leaf node will be checked by some honest devices. The honest devices will be able to detect this error.

How Aero supports modified DP-FedAvg. We’ve covered the integrity and freshness of Aero’s add phase/aggregation. Next we’ll show the modified DP-FedAvg will be executed faithfully. In order to show this, we claim the following:

Claim A.5 ().

Data from honest generator devices will be included at most once in the aggregation; data from honest DP-noise committee members will be included exactly once.

Proof.

It’s not difficult to see that both honest generators’ and honest DP-noise committee members’ data will be included at most once, as defined in Claim A.3. So for honest generator devices, the proof is done.

As for honest DP-noise committee member, we just need to prove data from an honest member will be included (at least once). Recall that for a honest committee member, it is always online during a round, otherwise will be considered as malicious. Also recall that in the add phase, after the aggregator constructs the summation tree and the Merkle tree, the aggregator needs to send a Merkle proof to the device that its data is included in the summation tree (Step 7 in Figure 4). An honest committee member can thus make sure its data is included in the aggregation by verifying this Merkle proof. ∎

This claim captures the requirement for faithfully executing the modified DP-FedAvg.

Firstly, data from honest generators (including online and offline) will be included at most once, which is exactly what our new sampling method does (Appendix §A.1.1): some honest generators might be filtered while others not; those included in the aggregation will be added exactly once.

Secondly, data from honest DP-noise committee members (only online) will be included exactly once, which ensures that enough Gaussian noise will be added.

Combining with the integrity and freshness of the underlying aggregation, the modified DP-FedAvg protocol will be executed as it is in Aero.

A.1.4. Decryption

The last step in Aero’s protocol is decryption. In this section, we’ll prove Aero’s decryption protocol will not leak any information, except the decrypted result.

Let’s first review the BFV scheme (brakerski2012fully, ; fan2012somewhat, ). Let the field for the coefficients of the ciphertext polynomials be Q𝑄Qitalic_Q, the polynomials themselves be from a polynomial ring RQsubscript𝑅𝑄R_{Q}italic_R start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, the distribution ϕitalic-ϕ\phiitalic_ϕ for the coefficients of error polynomials be the required Gaussian distribution ϕitalic-ϕ\phiitalic_ϕ (standard deviation=3.2), and the secret key sk𝑠𝑘skitalic_s italic_k be a polynomial of same degree N𝑁Nitalic_N as the ciphertext polynomials but with coefficients from the ternary distribution ({-1,0,1}), which we denote as ψ𝜓\psiitalic_ψ. Then, given a small constant γ1much-less-than𝛾1\gamma\ll 1italic_γ ≪ 1, the BFV scheme has the following procedures

  • Keygen𝐾𝑒𝑦𝑔𝑒𝑛Keygenitalic_K italic_e italic_y italic_g italic_e italic_n: sψ,aRQ,eϕformulae-sequence𝑠𝜓formulae-sequence𝑎subscript𝑅𝑄𝑒italic-ϕs\leftarrow\psi,a\leftarrow R_{Q},e\leftarrow\phiitalic_s ← italic_ψ , italic_a ← italic_R start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_e ← italic_ϕ. Compute b=as+e𝑏𝑎𝑠𝑒b=as+eitalic_b = italic_a italic_s + italic_e and output pk=(a,b)𝑝𝑘𝑎𝑏pk=(a,b)italic_p italic_k = ( italic_a , italic_b ) and sk=s𝑠𝑘𝑠sk=sitalic_s italic_k = italic_s

  • Enc(pk,m)𝐸𝑛𝑐𝑝𝑘𝑚Enc(pk,m)italic_E italic_n italic_c ( italic_p italic_k , italic_m ): e1ϕ,e2ϕ,rψformulae-sequencesubscript𝑒1italic-ϕformulae-sequencesubscript𝑒2italic-ϕ𝑟𝜓e_{1}\leftarrow\phi,e_{2}\leftarrow\phi,r\leftarrow\psiitalic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← italic_ϕ , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_ϕ , italic_r ← italic_ψ, output (c1=ar+e1,c2=br+e2+m/γ)formulae-sequencesubscript𝑐1𝑎𝑟subscript𝑒1subscript𝑐2𝑏𝑟subscript𝑒2𝑚𝛾(c_{1}=ar+e_{1},c_{2}=br+e_{2}+m/\gamma)( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a italic_r + italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_b italic_r + italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_m / italic_γ )

  • Dec(c1,c2)𝐷𝑒𝑐subscript𝑐1subscript𝑐2Dec(c_{1},c_{2})italic_D italic_e italic_c ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) : output m=(c2c1s)γm=\lfloor(c_{2}-c_{1}s)\cdot\gamma\rceilitalic_m = ⌊ ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s ) ⋅ italic_γ ⌉

Since m𝑚mitalic_m will finally be known to the aggregator and revealing m𝑚mitalic_m is safe, without losing generality, assume (c1,c2)subscript𝑐1subscript𝑐2(c_{1},c_{2})( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is the encryption of 0. Further, as defined in release phase (§4.4),

esmall=c2c1s=re+e2se1.subscript𝑒𝑠𝑚𝑎𝑙𝑙subscript𝑐2subscript𝑐1𝑠𝑟𝑒subscript𝑒2𝑠subscript𝑒1.e_{small}=c_{2}-c_{1}\cdot s=re+e_{2}-se_{1}.italic_e start_POSTSUBSCRIPT italic_s italic_m italic_a italic_l italic_l end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_s = italic_r italic_e + italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_s italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

This small error must remain hidden during the decryption process; otherwise, it may reveal information about the secret key s𝑠sitalic_s or the polynomial r𝑟ritalic_r. To hide esmallsubscript𝑒𝑠𝑚𝑎𝑙𝑙e_{small}italic_e start_POSTSUBSCRIPT italic_s italic_m italic_a italic_l italic_l end_POSTSUBSCRIPT, Aero’s scheme applies the smudging lemma (asharov2012multiparty, ). This lemma states that to achieve 2λsuperscript2𝜆2^{-\lambda}2 start_POSTSUPERSCRIPT - italic_λ end_POSTSUPERSCRIPT statistical distance between esmallsubscript𝑒𝑠𝑚𝑎𝑙𝑙e_{small}italic_e start_POSTSUBSCRIPT italic_s italic_m italic_a italic_l italic_l end_POSTSUBSCRIPT and esmall+esmudgingsubscript𝑒𝑠𝑚𝑎𝑙𝑙subscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔e_{small}+e_{smudging}italic_e start_POSTSUBSCRIPT italic_s italic_m italic_a italic_l italic_l end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT, esmudgingsubscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔e_{smudging}italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT just needs to be sampled from a uniform distribution whose bound is λ𝜆\lambdaitalic_λ bits more than the upper bound of esmallsubscript𝑒𝑠𝑚𝑎𝑙𝑙e_{small}italic_e start_POSTSUBSCRIPT italic_s italic_m italic_a italic_l italic_l end_POSTSUBSCRIPT. Suppose the smudging distribution is ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which is some uniform distribution whose bound is λ𝜆\lambdaitalic_λ bits more than the upper bound of esmallsubscript𝑒𝑠𝑚𝑎𝑙𝑙e_{small}italic_e start_POSTSUBSCRIPT italic_s italic_m italic_a italic_l italic_l end_POSTSUBSCRIPT.

Recall that in the release phase, the decryption committee reveals c1s+esmudgingsubscript𝑐1𝑠subscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔c_{1}\cdot s+e_{smudging}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_s + italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT to the aggregator, where esmudgingϕsubscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔superscriptitalic-ϕe_{smudging}\leftarrow\phi^{\prime}italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT ← italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. So the adversary’s view is

(c2,c1,c1s+esmudging),subscript𝑐2subscript𝑐1subscript𝑐1𝑠subscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔(c_{2},c_{1},c_{1}s+e_{smudging}),( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s + italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT ) ,

which is equivalent to

(c2,c1,c2+c1s+esmudging).subscript𝑐2subscript𝑐1subscript𝑐2subscript𝑐1𝑠subscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔.(c_{2},c_{1},-c_{2}+c_{1}s+e_{smudging}).( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s + italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT ) .

If we can prove this view is indistinguishable from

(c2,c1,esmudging)subscript𝑐2subscript𝑐1superscriptsubscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔(c_{2},c_{1},e_{smudging}^{\prime})( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

where esmudgingsuperscriptsubscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔e_{smudging}^{\prime}italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is some freshly sampled error from the smudging distribution ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, then revealing c1s+esmudgingsubscript𝑐1𝑠subscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔c_{1}\cdot s+e_{smudging}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_s + italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT to the aggregator will not leak more information than telling the aggregator a uniformly random number, since (c1,c2)subscript𝑐1subscript𝑐2(c_{1},c_{2})( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are already known to the aggregator.

To prove the above claim, let’s start with

(c2,c1,c2+c1s+esmudging).subscript𝑐2subscript𝑐1subscript𝑐2subscript𝑐1𝑠subscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔.(c_{2},c_{1},-c_{2}+c_{1}s+e_{smudging}).( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s + italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT ) .

Expanding c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we get that the above is same as

(asr+re+e2,ar+e1,ree2+se1+esmudging).𝑎𝑠𝑟𝑟𝑒subscript𝑒2𝑎𝑟subscript𝑒1𝑟𝑒subscript𝑒2𝑠subscript𝑒1subscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔.(asr+re+e_{2},ar+e_{1},-re-e_{2}+se_{1}+e_{smudging}).( italic_a italic_s italic_r + italic_r italic_e + italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a italic_r + italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r italic_e - italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_s italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT ) .

With the smudging lemma, the above is indistinguishable from

((as+e)r+e2,ar+e1,esmudging).𝑎𝑠𝑒𝑟subscript𝑒2𝑎𝑟subscript𝑒1superscriptsubscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔.((as+e)r+e_{2},ar+e_{1},e_{smudging}^{\prime}).( ( italic_a italic_s + italic_e ) italic_r + italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a italic_r + italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Notice that esmudgingsuperscriptsubscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔e_{smudging}^{\prime}italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has nothing to do with the secret key s𝑠sitalic_s, and we can apply Ring-LWE assumption to convert (as+e,a)𝑎𝑠𝑒𝑎(as+e,a)( italic_a italic_s + italic_e , italic_a ) back to (b,a)𝑏𝑎(b,a)( italic_b , italic_a ) as otherwise (as+e,a)𝑎𝑠𝑒𝑎(as+e,a)( italic_a italic_s + italic_e , italic_a ) is not indistinguishable from (b,a)𝑏𝑎(b,a)( italic_b , italic_a ). The above is indistinguishable from

(br+e2,ar+e1,esmudging)=(c2,c1,esmudging).𝑏𝑟subscript𝑒2𝑎𝑟subscript𝑒1superscriptsubscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔subscript𝑐2subscript𝑐1subscriptsuperscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔.(br+e_{2},ar+e_{1},e_{smudging}^{\prime})=(c_{2},c_{1},e^{\prime}_{smudging}).( italic_b italic_r + italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a italic_r + italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT ) .

Since esmudgingsuperscriptsubscript𝑒𝑠𝑚𝑢𝑑𝑔𝑖𝑛𝑔e_{smudging}^{\prime}italic_e start_POSTSUBSCRIPT italic_s italic_m italic_u italic_d italic_g italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT doesn’t depend on either the secret key or honest devices’ data, revealing partial decryption result will not leak information about either the secret key or honest devices’ data.

A.2. Details of the setup phase

During the setup phase, Aero (i) forms the master committee, which then (ii) receives and validates inputs for the round, and (iii) generates keys for cryptographic primitives. We present the details for only the second piece here, as the first and the third are discussed in detail earlier (§4).

For the second piece, the master committee needs to check whether there is enough privacy budget to run a training task before launching it. To do this, the committee members need to calculate the new DP parameters before they launch the training task and check whether the new parameters are below some threshold. The details are as follows.

Recall that once the master committee is formed, each committee member receives the model parameters θtsuperscript𝜃𝑡\theta^{t}italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for the current round t𝑡titalic_t, the user selection probability q𝑞qitalic_q, noise scale z𝑧zitalic_z, and clip** bound S𝑆Sitalic_S from the aggregator for this round of training (required by DP-FedAvg; Figure 2).

Each committee member locally computes new values of the DP parameters ϵ,δitalic-ϵ𝛿\epsilon,\deltaitalic_ϵ , italic_δ using the moment accounting algorithm \mathcal{M}caligraphic_M (line 14 in DP-FedAvg). This computation requires the DP parameters ϵ,δsuperscriptitalic-ϵsuperscript𝛿\epsilon^{\prime},\delta^{\prime}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from round t1𝑡1t-1italic_t - 1 in addition to the inputs z,q𝑧𝑞z,qitalic_z , italic_q. The committee member downloads the former from a public bulletin board, where it is signed by more than a threshold of honest members of the previous round’s master committee. After getting the former DP parameters, the committee member calculates the new ϵ,δitalic-ϵ𝛿\epsilon,\deltaitalic_ϵ , italic_δ. If the new values are below their recommended value, the committee member signs a certificate containing the parameters (θtsuperscript𝜃𝑡\theta^{t}italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, q𝑞qitalic_q, z𝑧zitalic_z, S𝑆Sitalic_S), new values of ϵ,δitalic-ϵ𝛿\epsilon,\deltaitalic_ϵ , italic_δ, and keys for cryptographic primitives and publish it to the bulletin board to start this training task.