Federated learning with differential privacy and an untrusted aggregator (technical report)

Kunlong Liu University of California, Santa Barbara and Trinabh Gupta University of California, Santa Barbara

Abstract.

Federated learning for training models over mobile devices is gaining popularity. Current systems for this task exhibit significant trade-offs between model accuracy, privacy guarantee, and device efficiency. For instance, Oort (OSDI 2021) provides excellent accuracy and efficiency but requires a trusted central server. On the other hand, Orchard (OSDI 2020) provides good accuracy and the rigorous guarantee of differential privacy over an untrusted server, but creates huge overhead for the devices. This paper describes Aero, a new federated learning system that significantly improves this trade-off. Aero guarantees good accuracy, differential privacy over an untrusted server, and keeps the device overhead low. The key idea of Aero is to tune system architecture and design to a specific set of popular, federated learning algorithms. This tuning requires novel optimizations and techniques, e.g., a new protocol to securely aggregate updates from devices. An evaluation of Aero demonstrates that it provides comparable accuracy to plain federated learning (without differential privacy), and it improves efficiency (cpu and network) over Orchard by up to $10^{5}\times$ .

^†^†copyright: none

1. Introduction

Federated learning (FL) is a recent paradigm in machine learning that embraces a decentralized training architecture (mcmahan2017communication, ). In contrast to the traditional, central model of learning where users ship their training data to a central server, users in FL download the latest model parameters from the server, perform local training to generate updates to the parameters, and send only the updates to the server. Federated learning has gained popularity in training models for mobiles (hard2018federated, ; hartmann2019federated, ; kairouz2021practical, ; DPFTRL, ) as it can save network bandwidth and it is privacy-friendly—raw data stays at the devices.

Current systems for federated learning exhibit significant trade-offs between model accuracy, privacy, and device efficiency. For instance, one class of systems that includes Oort (lai2021oort, ), FedScale (fedscale-icml22, ), and FedML (chaoyanghe2020fedml, ) provides excellent accuracy (comparable to centralized learning) and device efficiency. But these systems provide only a weak notion of privacy. This point is subtle. At first glimpse, it appears that in any federated learning system, since users ship updates to model parameters rather than the raw training data, this data (user images, text messages, search queries, etc.) remains confidential. However, research shows that updates can be reverse-engineered to reveal the raw data (zhu2019deep, ; melis2019exploiting, ; shokri2017membership, ). Thus, if the server is compromised, so is the users’ data. In other words, the server must be trusted.

On the other hand, systems such as HybridAlpha (xu2019hybridalpha, ) and Orchard (roth2020orchard, ) offer good accuracy and a differential privacy guarantee for users’ data. Informally, differential privacy says that an adversary cannot deduce a user’s training data by inspecting the updates or the learned model parameters (dwork2011firm, ; dwork2006calibrating, ; dwork2014algorithmic, ; abadi2016deep, ). In fact, Orchard guarantees differential privacy while assuming a byzantine server. But the downside is the high overhead for the devices. For example, to train a CNN model with 1.2 million parameters (reddi2020adaptive, ), Orchard requires from each device $\approx$ 14 minutes of training time on a six-core processor and $\approx$ 840 MiB in network transfers per round of training (§6.2). The full training requires at least a few hundred rounds. Further, for a few randomly chosen devices, this per-round cost spikes to $\approx$ 214 hours of cpu time and $\approx$ 11 TiB of network transfers. Clearly, this is quite high.¹¹1Another class of systems provides a particular type of differential privacy called local differential privacy (LDP) (duchi2013local, ; ijcai2021-217, ; truex2020ldp, ). These systems are efficient but LDP creates a high accuracy loss (truex2020ldp, ; grafberger2021fedless, ; ijcai2021-217, ) (§2.3, §7).

This paper describes a new federated learning system, Aero, that significantly improves the tradeoff between accuracy, privacy, and device overhead. Aero provides good accuracy, the differential privacy guarantee in the same threat model as Orchard, and low device overhead. For instance, most of the time Aero’s devices incur overhead in milliseconds of cpu time and KiBs of network transfers.

The key idea in Aero is that it does not aim to be a general-purpose federated learning system, rather focuses on a particular class of algorithms (§3.1). These algorithms sample devices that contribute updates in a round using a simple probability parameter (e.g., a device is selected with a probability of $10^{-5}$ ), then aggregate updates across devices by averaging them, and generate noise needed for differential privacy from a Gaussian distribution. Admittedly, this is only one class of algorithms, but this class comprises popular algorithms such as DP-FedAvg (mcmahan2017learning, ) and DP-FedSGD (mcmahan2017learning, ) that are the ones commonly used and deployed (hard2018federated, ; hartmann2019federated, ; fedscale-icml22, ; chaoyanghe2020fedml, ). With this restriction, Aero tunes system architecture and design to these algorithms, thereby gaining on performance by orders of magnitude.

This tuning is non-trivial and requires novel techniques and optimizations (§3.2, §4). As one example, the devices must verify that the byzantine server did not behave maliciously. One prior technique is to use a summation tree (roth2019honeycrisp, ), where the server explicitly shows its work aggregating updates across devices in a tree form; the devices then collectively check nodes of this tree. This checking, in turn, adds overhead to the devices. Aero addresses this tension between privacy and overhead by leveraging the sampling characteristic of DP-FedAvg and similar algorithms: the total number of devices that participate in the system (e.g., one billion) is much larger than the ones that are sampled to contribute updates. Leveraging this characteristic, Aero employs multiple, finer-grained summation trees (rather than a monolithic tree) to massively divide the checking work across the large device pool (§4.3). Aero further optimizes how each device verifies the nodes of the tree using a technique called polynomial identity testing (§4.3). The aforementioned is just one example of optimization; Aero uses multiple throughout its architecture (§3.2) and design (§4).

We implemented Aero by extending the FedScale FL system (fedscale-icml22, ) (§5). FedScale supports plain federated learning without differential privacy or protection against a byzantine server. However, it is flexible, allows a programmer to specify models in the PyTorch framework (pytorch, ), and includes a host of models and datasets, with model sizes ranging from 49K to 3.9M, for easy evaluation. Our evaluation of Aero’s prototype (§6) shows that Aero trains models with comparable accuracy to FedScale, in particular, the plain FedAvg algorithm in FedScale (§6.1). Aero also improves overhead relative to Orchard by up to five orders of magnitude, to a point where the overhead is low to moderate. For instance, for a 1.2M parameter CNN on the FEMNIST dataset (reddi2020adaptive, ; cohen2017emnist, ), and for a total population of $10^{9}$ devices where $10^{4}$ contribute updates per round, a Aero device requires 15 ms of cpu time and 3.12 KiB of network transfers per round. Occasionally (with a probability of $10^{-5}$ in a round) this overhead increases when a device contributes updates, to a moderate 13.4 minutes of latency on a six-core processor and 234 MiB in network transfers.

Prior to Aero, one must choose two of the three properties of high accuracy, rigorous privacy guarantee, and low device overhead. With Aero, one can train models in a federated manner with a balance across these properties, at least for a particular class of federated learning algorithms. Thus, Aero’s main contribution is that it finally shows a way for a data analyst to train models while asking the data providers to place no trust in the analyst or their company.

2. Problem and background

This section outlines the problem and gives a short background on Orchard (roth2020orchard, ) that forms both a baseline and an inspiration for Aero.

2.1. Scenario and threat model

We consider a scenario consisting of a data analyst and a large number of mobile devices, e.g., hundreds of million. The analyst, perhaps at a large company such as Google, is interested in learning a machine learning model over the data on the devices. For instance, the analyst may want to train a recurrent neural network (RNN) to provide auto-completion suggestions for the android keyboard (hard2018federated, ).

One restriction we place on this scenario is that the training must be done in a federated manner. (We refer the reader to prior work (tan2021cryptgpu, ; knott2021crypTen, ) for a discussion on training in the centralized, non-federated model.) As noted earlier (§1), federated learning proceeds in rounds, where in each round devices download the latest model parameters from a server, generate updates to these parameters locally, and send the updates to the server. The server aggregates the model updates. This repeats until the model achieves a target accuracy.

In this scenario, a malicious server, or even a malicious device, can execute many attacks. For instance, a malicious server can infer the training data of a device from the updates contributed by the device (zhu2019deep, ; melis2019exploiting, ; shokri2017membership, ). Similarly, a malicious device that receives model parameters from the server can execute an inference attack to learn another device’s input.

We assume the strong OB+MC threat model from Orchard. The server is honest-but-curious most of the time but occasionally byzantine (OB), while the devices are mostly correct (MC), but a small fraction can be malicious. The rationale behind the occasionality of the server’s maliciousness is that the server’s operator, e.g., Google, is reputed and subject to significant scrutiny from the press and the users, and thus unlikely to be byzantine for long. However, it may occasionally come under attack, e.g., from a rogue employee. The rationale behind the smallness of the fraction of malicious devices is that with billions of devices, it is unlikely that an adversary will control more than a small percentage. For instance, for a billion devices, only controlling 3% is already significantly larger than a large botnet. We further assume that a configurable percentage of honest devices may be offline during any given round of training.

2.2. Goals

Under the OB+MC threat model, we want our system to meet the following goals.

Privacy. It must guarantee the gold standard definition of privacy, i.e., differential privacy (DP) (dwork2011firm, ; dwork2006calibrating, ; dwork2014algorithmic, ; abadi2016deep, ). Informally, a system offers DP for model training if the probability of learning a particular set of model parameters is (approximately) independent of whether a device’s input is included in the training. This means that DP prevents inference attacks (Naseri2022LocalAC, ) where a particular device’s input is revealed, as models are (approximately) independent of its input.

Accuracy. During periods when the server or the devices that contribute in a round are not byzantine, our system must produce models with accuracy comparable to models trained via plain federated learning. That is, we want the impact of differential privacy to be low. Further, we want the system to mitigate a malicious device’s impact on accuracy and prevent it from supplying arbitrary updates.

Efficiency and scalability. We want the system to support models with a large number of parameters while imposing a low to moderate device-side overhead. For the former, a reference point is the android keyboard auto-completion model (an RNN) with 1.4M parameters (hard2018federated, ). For the device overhead, if a device participates regularly in training, e.g., in every round, then it should incur no more than a few seconds of cpu and a few MiBs in network transfers per round. However, we assume that devices can tolerate occasional amounts of additional work, contributing tens of minutes of cpu and a few hundred MiBs in network transfers.

2.3. Possible solution approaches

Meeting all of the goals described above is quite challenging. For illustration, consider the following solution approaches.

Local differential privacy. One option to guarantee differential privacy is to pick a federated learning algorithm that incorporates local differential privacy (LDP) (duchi2013local, ). In LDP-based federated learning, devices add statistical noise to their updates before uploading them to the server. The added noise protects updates against a malicious server which now cannot execute an inference attack, but LDP significantly degrades model accuracy relative to plain federated learning as every device must add noise. For instance, the LDP-FL (ijcai2021-217, ) system trains a VGG model over CIFAR10 with 10% accuracy compared to 62% with plain federated learning.

Trusted server. One alternative is to use central differential privacy (abadi2016deep, ), where a central entity adds a smaller amount of noise to device updates to ensure differential privacy. This approach mitigates the accuracy issue, but provides weak to no privacy as the central entity sees devices’ updates.

Server-side secure multiparty computation (MPC). One way to reduce trust in the central server is to break it down into multiple non-colluding pieces, e.g., three servers, that run in separate administrative domains. Then, one would run a secure multi-party computation protocol (yao1982protocols, ; goldreich2019play, ) among these servers such that they holistically perform the required computation (noise generation, addition, etc.) while no individual server sees the input or intermediate state of the computation. The problem is that we must still put significant trust in the server—that an adversary cannot compromise, say, two administrative domains.

Large-scale MPC. One can remove trust in the server by instead running MPC among the devices (essentially the devices perform the server’s work). The problem now shifts to efficiency and scalability: general-purpose MPC protocols are expensive and do not scale well with the number of participants (scaleMamba, ; damgaard2012multiparty, ). Indeed, scaling MPC to a few hundred or thousand participants is an active research area (gordon2021more, ; ben2021large, ), let alone hundreds of millions of participants.

State-of-the-art: Orchard.

Refer to caption — Figure 1. An overview of Orchard (roth2020orchard, ). $\Delta_{k}$ denotes $k$ -th device’s update. The superscript $t$ denotes the round number. Orchard runs the four phases of setup, generate, add, and release for every round.

Orchard (roth2020orchard, ) takes a middle ground. It runs small(er)-scale MPC among devices while assuming an (occasionally) byzantine server. In particular, it forms a committee of a few tens of devices picked randomly from the entire population of devices; this committee then runs MPC among its members. Figure 1 shows an overview of Orchard. Orchard supports the full-batch gradient descent algorithm where all devices contribute updates in a round. For every round, Orchard runs the following four phases.

In the setup phase, Orchard samples the committee. Its members use MPC to generate keys for two cryptographic primitives: additive homomorphic encryption (AHE) (rivest1978data, ; fan2012somewhat, ) and zero-knowledge proof (ZK-proof) (groth16on, ; goldwasser2019knowledge, ).

In the generate phase, devices download the latest model parameters from the server. After local training, they encrypt their model updates (denoted by $\Delta$ in Figure 1) and generate a ZK-proof to prove to the server that the ciphertexts are well-formed and the model updates are bounded. The encryption hides the updates from the byzantine server and the proof limits the impact of malicious devices (they cannot supply arbitrary updates and thus ruin the model accuracy).

In the add phase, the server homomorphically adds the encrypted updates. The server also generates proof that it performed the addition correctly so that all devices can collaboratively verify the addition. This verification is necessary to prevent a byzantine server from launching subtle attacks to break DP. (We will discuss these attacks further in §4.3.)

In the release phase, the committee from the setup phase uses MPC to decrypt the output ciphertexts from the add phase. The committee also generates and adds the DP noise to the output, before releasing it, to guarantee (central) DP.

The challenge with Orchard is that even though it uses MPC at a smaller scale, the MPC overhead is still high. First, in the setup phase, committee devices generate fresh keys for each round, and generating one AHE key pair inside general-purpose MPC requires $\approx$ 180 seconds of cpu and 1 GiB of network transfers. Second, the overhead of the add phase to verify the server’s work grows linearly with the model size and becomes unpragmatic as soon as the model has a few hundred thousand parameters. Third, in the release phase, committee devices decrypt ciphertexts and generate DP noise inside general-purpose MPC, costing, for example, $\approx$ 2600 seconds of cpu and $\approx$ 38 GiB of network transfers per device for a model with just 4K parameters.

In general, providing high accuracy, differential privacy, and device efficiency simultaneously in a threat model where there is no trusted party proves incredibly challenging.

3. Overview of Aero

The high-level idea in Aero is to focus on a specific type of federated learning algorithms comprising DP-FedAvg (mcmahan2017learning, ), DP-FedSGD (mcmahan2017learning, ), and DP-FTRL (kairouz2021practical, ). These algorithms have similar characteristics; for instance, they all sample noise for differential privacy from a Gaussian distribution. To keep Aero easier to explain and understand, we take the most popular DP-FedAvg as the canonical algorithm and describe Aero in its context.

3.1. DP-FedAvg without amplification

Main:

1: parameters

2: device selection probability

q\in(0,1]

3: DP noise scale

z

4: total # of devices

W

5: clip** bound on device updates

S

6: Initialize model

\theta^{0}

, DP privacy budget accountant

\mathcal{M}

7: for each round

t=0,1,2,\ldots

C^{t}\leftarrow

(sample users with probability

q

)

9: for each user

k\in C^{t}

10:

\Delta_{k}^{t}\leftarrow{{\small\textsc{UserUpdate}}}(k,\theta^{t},S)

11: // Add updates and Gaussian DP noise with

\sigma=zS

12:

\Delta^{t}\leftarrow\sum_{k}\Delta_{k}^{t}+\mathcal{N}(0,I\sigma^{2})

13:

\theta^{t+1}\leftarrow\theta^{t}+(\Delta^{t}/(qW))

// Update model

14: Update

\mathcal{M}

based on noise scale

z

and parameter

q

UserUpdate(

k

\theta^{0}

S

)

15: parameters

B,E,\eta

\eta

is learning rate

16:

\theta\leftarrow\theta^{0}

17: for each local epoch

i

1

E

18:

\mathcal{B}\leftarrow

(

k

’s data split into size

B

batches)

19: for batch

b\in\mathcal{B}

20:

\theta\leftarrow\theta-\eta\nabla\ell(\theta;b)

\ell

is loss fn. (model err.)

21:

\theta\leftarrow\theta^{0}+Clip(\theta-\theta^{0},S)

22: return

\Delta_{k}=\theta-\theta^{0}

// Already clipped

Figure 2. Pseudocode for the DP-FedAvg algorithm.

Clip(\cdot,S)

scales its input vector such that its norm (Euclidean distance from the origin) is less than

S

\mathcal{M}

is the privacy budget accountant of Abadi et al. (abadi2016deep, ) that tracks the values of the DP parameters

\epsilon

and

\delta

DP-FedAvg proceeds in discrete rounds (Figure 2). In each round $t$ , it samples a small subset of user devices using a probability parameter $q$ (line 8), and asks the sampled devices to provide updates to the global model parameters (line 10). The devices locally generate the updates before clip** them by a value $S$ and uploading them (line 21); this clip** is necessary for differential privacy and it bounds the norm (sensitivity) of a device’s update. DP-FedAvg then aggregates these updates (line 12) and (separately) adds noise sampled from a Gaussian distribution. The standard deviation of the Gaussian distribution depends on a noise scale parameter $z$ and the clip** bound $S$ ; both are input parameters for DP-FedAvg. Finally, DP-FedAvg updates a privacy accountant $\mathcal{M}$ that computes, based on the noise scale $z$ and sampling probability $q$ , two parameters $\epsilon$ and $\delta$ associated with differential privacy (line 14). These parameters capture the strength of the guarantee: how much the model parameters learned after a round vary depending on a device’s input. A lower value of $\delta$ and $\epsilon$ is desirable, and the literature recommends ensuring that $\epsilon$ stays close to or below $1$ , and $\delta$ is less than $1/W$ , where $W$ is the total number of devices (mcmahan2017learning, ).

DP-FedAvg has three characteristics that are crucial for Aero. The first is the sampling of devices (lines 8 to 10 in Figure 2). For instance, the sampling parameter $q$ could be $10^{-5}$ such that 10,000 out of, say, $10^{9}$ total devices contribute updates in a round. The second characteristic is that the noise is sampled from a Gaussian distribution whose standard deviation $\sigma$ is predetermined (set before the algorithm is run). This is in contrast to other DP algorithms that utilize techniques such as the Sparse Vector Technique (SVT) that generate noise depending on the value of the updates (roth2010interactive, ; dwork2009complexity, ). The third characteristic is averaging of updates: DP-FedAvg simply adds updates and noise (line 12 in Figure 2) rather than combining them using a more complex function. Aero heavily leverages these characteristics.

Finally, we remark that Aero can support DP-FedAvg only without the amplification assumption for DP. This is because the adversary (the byzantine server) can observe all traffic and knows which devices contribute updates for training. In contrast, the amplification assumption requires the server to be oblivious to the contributors, which in turn improves the privacy budget. We leave the addition of expensive oblivious approaches (which hides who is contributing updates besides hiding the updates themselves) to future work.

3.2. Architecture of Aero

Aero borrows two system components from Orchard: an aggregator and a public bulletin board (Figure 3). The aggregator runs server-side inside a data center and therefore consists of one or more powerful machines. Its main role is to combine updates from user devices without learning their content. The bulletin board is an immutable append-only log. The aggregator (which is potentially malicious) and the devices use the bulletin board to reliably broadcast messages and store states across rounds, e.g., the latest values of DP parameters $\epsilon$ and $\delta$ . Like Orchard (roth2020orchard, ), Aero assumes that free web services such as Wikipedia, or a public block-chain could serve as the bulletin board.

Like Orchard, Aero also consists of committees of devices, but instead of a single committee as in Orchard, Aero has three types of committees tailored to the needs of DP-FedAvg (and similar algorithms). A master committee handles system setup, including key generation for cryptographic primitives. A DP-noise committee handles Gaussian noise generation. And multiple decryption committees perform decryption operations to release updates to the global model parameters at the end of a training round. Aero samples each committee afresh each round, dividing the committee workload across the large population of devices.

An architecture with separate committee types is deliberate and crucial. It helps tailor a committee’s protocol to its tasks to significantly improve efficiency. Besides, the use of multiple committees of the same type, i.e., multiple decryption committees, helps Aero scale with model size as each committee works on a subset of model parameters.

Notably, the ability to use separate committee types is possible only because of the specifics of DP-FedAvg. For instance, the fact that Gaussian noise generation does not depend on the value of the updates allows Aero to separate the DP-noise committee from the decryption committees.

3.3. Protocol overview of Aero

To begin training a model, a data analyst supplies input parameters (the model architecture, the initial model parameters, and the other input parameters for DP-FedAvg) to the aggregator. The aggregator then initiates a round-based protocol consisting of discrete rounds. In each round, it executes one iteration of the for loop in the Main procedure of DP-FedAvg (line 7 in Figure 2). Each round further consists of the four phases of setup, generate, add, and release (Figure 3).

In the setup phase, the aggregator samples the various committees for the round. The master committee then receives and validates the input parameters, and generates keys for an AHE and a ZK-proof scheme. Aero’s setup phase is similar to Orchard (§2.3) with the key difference that Aero’s master committee uses techniques to reuse keys across rounds rather than generating them fresh for each round using MPC.

Next, in the generate phase, (i) devices select themselves to generate updates for the round, and (ii) the DP-noise committee generates the Gaussian noise for DP. Both types of devices use techniques to perform their work efficiently. For instance, the DP-noise committee generates noise in a distributed manner while avoiding MPC.

Next, in the add phase, the aggregator adds the model updates to the Gaussian noise without learning the plaintext content of either of them. This is done through the use of the AHE scheme. The entire population of devices collectively verifies the aggregator’s work. Again, the key is efficiency for the devices, for which the aggregator and the devices use a new verifiable aggregation protocol.

Finally, in the release phase, each decryption committee receives the secret key for the AHE scheme from the master committee and decrypts a few ciphertexts from the add phase. The key point is that a decryption committee avoids general-purpose MPC by using a specialized decryption protocol.

4. Design of Aero

We now go over the design details of Aero phase-by-phase. The main challenge in each phase is kee** the device overhead low while protecting against the malicious aggregator and the malicious subset of devices. We highlight these challenges, and Aero’s key design choices and techniques.

But before proceeding, we briefly discuss committee formation, which is common to multiple phases. To form committees, Aero uses the sortition protocol from Orchard (which in turn used Algorand’s protocol (gilad2017algorand, )). This protocol relies on a publicly verifiable source of randomness so that the results of the election are verifiable by all devices. At the end of the protocol, the aggregator publishes the list (public keys) of the committee members by putting it on the bulletin board. An important aspect of committee formation is committee size and the number of malicious devices in a committee: provision of a larger number of malicious devices $A$ relative to the committee size $C$ increases costs but ensures higher resiliency. Like Orchard, Aero makes a probabilistic argument (roth2019honeycrisp, ) to select $C$ and $A$ such that the probability of the number of malicious devices exceeding $A$ is small. For example, if the overall population contains up to $f=3\%$ malicious devices (§2.1), then the probability that a randomly sampled subset of $C=45$ devices contain more than $A=2C/5=18$ malicious devices is less than $9.6\cdot 10^{-14}$ .

4.1. Setup phase

Much of Aero’s setup phase is similar to Orchard. During this phase, (i) the aggregator samples the master committee, which then (ii) receives and validates inputs for the round (i.e., receives model parameters $\theta^{t}$ for the current round $t$ , the device selection probability $q$ , noise scale $z$ , and clip** bound $S$ , and generates new values of the DP parameters $\epsilon,\delta$ ), and (iii) generates keys for cryptographic primitives (§3.3). We do not focus on the first two pieces to avoid repetition with Orchard, but include them in the supplementary material for completeness (Appendix §A.2). Instead, the key challenge in Aero is the overhead of key generation.

Recall (§2.3) that Orchard uses MPC among the master committee members to correctly run the key generation function and ensure that even if the malicious members of the committee collude, they cannot recover the AHE secret key. The overhead of this MPC is high: $\approx$ 1 GiB of network transfers and 180 seconds of cpu time per committee device. How can this overhead be reduced?

One idea (roth2021mycelium, ) is to reuse keys across rounds rather than generate them afresh for each round. Indeed, this is what Aero does: the master committee in round 1 generates the keys and shares them with the committee for the next round, and this committee then shares the keys with the committee for the third round, and so on. But one has to be careful.

Consider the following attack. Say that the malicious aggregator receives a victim device $k^{\prime}s$ update $Enc(pk,\Delta_{k}^{t})$ in round $t$ . Then, in the next round $t+1$ , the aggregator colludes with a malicious device in the overall population to use $Enc(pk,\Delta_{k}^{t})$ as the device’s update. This attack enables the aggregator to violate differential privacy as the victim device’s input does not satisfy the required clip** bound $S$ in round $t+1$ due to its multiple copies (§3.1). Orchard does not suffer from this attack as it generates fresh keys in each round: the ciphertext for round $t$ decrypts to a random message with round $t+1$ ’s key. However, prior work that reuses keys in this manner (in particular, Mycelium (roth2021mycelium, )) does suffer from this attack.

Thus, Aero must apply the reuse-of-keys idea with care. Aero adjusts the generate and add phases of its protocol (§3.3) to prevent the aforementioned attack. We are not in a position yet to describe these changes, but we will detail them shortly when we describe these other phases (§4.2, §4.3). Meanwhile, the changes in the setup phase relative to Orchard are the following: for the AHE secret key $sk$ , Aero implements an efficient verifiable secret redistribution scheme (gupta2006extended, ; roth2021mycelium, ) such that committee members at round $t+1$ securely obtain the relevant shares of the key from the committee at round $t$ . For the public keys (AHE public key $pk$ , and both the ZK-proof public proving and verification keys), the committee for round $t$ signs a certificate containing these keys and uploads it to the bulletin board, and the committee for round $t+1$ downloads it from the board.

The savings by switching from key generation to key resharing are substantial for the network, with a slight increase in cpu. While the MPC solution incurs $\approx$ 1 GiB of network transfers and 180 seconds of cpu time per committee device, key resharing requires 125 MiB and 187 seconds, respectively (§6.2). The cpu is higher because key resharing requires certain expensive field exponentiation operations (gupta2006extended, ).

4.2. Generate phase

Recall from §3.3 that during this phase (i) Aero must pick a subset of devices to generate updates to the model parameters, (ii) the DP-noise committee must generate Gaussian noise for differential privacy, and (iii) both types of devices must encrypt their generated data (updates and noise).

Device sampling for updates. One design choice is to ask the aggregator to sample devices that will contribute updates. The problem with this option is that the (malicious) aggregator may choose the devices non-uniformly; for instance, it may pick an honest device more often than the device should be picked, violating differential privacy. An alternative is to ask the devices to sample themselves with probability $q$ (as required by DP-FedAvg; line 8 in Figure 2). But then a malicious device may pick itself in every round, which would allow it to significantly affect model accuracy.

Aero adopts a hybrid and efficient design in which devices sample themselves but the aggregator verifies the sampling. Let $B^{t}$ be a publicly verifiable source of randomness for round $t$ ; this is the same randomness that is used in the sortition protocol to sample committees for the round. Then, each device $k$ with public key $\pi_{k}$ computes $PRG(\pi_{k}||B^{t})$ , where $PRG$ is a pseudorandom generator. Next, the device scales the PRG output to a value between 0 and 1, and checks if the result is less than $q$ . For instance, if the PRG output is 8 bytes, then the device divides this number by $2^{64}-1$ . If selected, the device runs the UserUpdate procedure (line 10 in Figure 2) to generate updates for the round. This approach of sampling is efficient as devices only perform local computations.

Gaussian noise generation. The default option is to make the DP-noise committee generate the noise using MPC, but as noted several times in this paper, this option is expensive. Instead, Aero adapts prior work (truex2019hybrid, ) on distributed Gaussian noise generation. The Gaussian distribution has the property that if an element sampled from $\mathcal{N}(0,a)$ is added to another element sampled from $\mathcal{N}(0,b)$ , then the sum is a sample of $\mathcal{N}(a+b)$ (truex2019hybrid, ; xu2019hybridalpha, ; dwork2006our, ). This works well for the simple case when all $C$ committee members of the DP-noise committee are honest. Given the standard deviation of the Gaussian distribution, $\sigma=z\cdot S$ , the devices can independently compute their additive share. That is, to generate samples from $\mathcal{N}(0,I\sigma^{2})$ (line 12 in Figure 2), each committee member can sample its share of the noise from the distribution $\mathcal{N}(0,I\frac{\sigma^{2}}{C})$ .

The challenge in Aero is therefore: how do we account for the $A$ malicious devices in the DP-committee? These devices may behave arbitrarily and may thus generate either no noise or large amounts of it. Adding unnecessary noise hurts accuracy, not privacy. In contrast, failing to add noise may violate privacy. We thus consider the worst case in which malicious users fail to add any noise and ask honest devices to compensate. Each honest client thus samples its noise share from the distribution $\mathcal{N}(0,I\frac{\sigma^{2}}{C-A})$ .²²2Aero can further compensate for honest-but-offline devices. Say, for e.g., that $B$ of $C-A$ honest devices must be provisioned to be offline. Then, Aero subtracts $B$ from $C-A$ to get the number of honest-but-online devices.

This algorithm generates noise cheaply without expensive MPC. The downside is that it may generate more noise than necessary, hurting accuracy. To mitigate this risk, we carefully choose the committee size to minimize the ratio of additional noise. Specifically, we choose $C$ to keep the ratio $(C-A)/C$ close to 1. For instance, instead of picking a committee containing a few tens of devices similar to the master committee, we pick a somewhat larger DP-noise committee: $(A,C)=(40,280)$ .³³3Using a probabilistic argument for committee size selection as before (§4), if $f=3\%$ devices in the overall population are malicious, then the chances of sampling 280 devices with more than 1/7th malicious is $4.1\cdot 10^{-14}$ .

Encryption and ZK-proofs. Once the devices generate their updates or shares of the Gaussian noise, they encrypt the content using the public key of the AHE scheme to prevent the aggregator from learning the content. Further, they certify using a ZK-proof scheme that the encryption is done correctly and the data being encrypted is bounded by the clip** value $S$ (so that malicious devices may not supply arbitrary updates). This encryption and ZK-proof generation is same as in Orchard, but Aero requires additional changes. Recall from the setup phase that Aero must ensure a ciphertext generated in a round is used only in that round, to prevent complications due to reuse of keys (§4.1). To do this, each device concatenates the round number $t$ (as a timestamp) to the plaintext message before encrypting it. Further, the ZK-proof includes additional constraints that prove that a prefix of the plaintext message equals the current round number.

Add phase protocol of Aero

Commit step

Add step

Verify step

Every device in the system does the following:

Figure 4. Aero’s verifiable aggregation. This description does not include the PIT optimization (described in text) that applies to line 11.

4.3. Add phase

Recall that during the add phase (i) the aggregator adds ciphertexts containing device updates to those containing shares of Gaussian noise, (ii) the devices collectively verify the aggregator’s addition (§3.3).

This work during the add phase has subtle requirements. So first, we expand on these requirements while considering a toy example with two honest and a malicious device. The first honest device’s input is ${\small\textsc{Enc}}(pk,\Delta)$ , where $\Delta$ is its update, while the second honest device’s input is ${\small\textsc{Enc}}(pk,n)$ , where $n$ is the Gaussian noise. For this toy example, first (R1), the aggregator must not omit ${\small\textsc{Enc}}(pk,n)$ from the aggregate as the added noise would then be insufficient to protect $\Delta$ and guarantee DP. Second (R2), the aggregator must not let the malicious device use ${\small\textsc{Enc}}(pk,\Delta)$ as its input. Relatedly, the aggregator itself must not modify ${\small\textsc{Enc}}(pk,\Delta)$ to ${\small\textsc{Enc}}(pk,k\cdot\Delta)$ , where $k$ is a scalar, using the additively homomorphic properties of the encryption scheme. The reason is that these changes can violate the clip** requirement that a device’s input is bounded by $S$ (e.g., $2\cdot\Delta$ may be larger than $S$ ). And, third (R3), the aggregator must ensure that the above (the malicious device or the aggregator copying a device’s input) does not happen across rounds, as recall that Aero uses the same encryption key in multiple rounds (§4.1).

One option to satisfy these requirements is to use the verifiable aggregation protocol of Orchard (roth2019honeycrisp, ) that is based on summation trees. The main challenge is resource costs. Briefly, in this protocol, the aggregator arranges the ciphertexts to be aggregated as leaf nodes of a tree, and publishes the nodes of the tree leading to the root node. For example, the leaf nodes will be ${\small\textsc{Enc}}(pk,\Delta)$ and ${\small\textsc{Enc}}(pk,n)$ , and the root node will be ${\small\textsc{Enc}}(pk,\Delta)+{\small\textsc{Enc}}(pk,n)$ , for the toy example above. Then, devices in the entire population inspect parts of this tree: download a few children and their parents and check that the addition is done correctly, that the leaf nodes haven’t been modified by the aggregator, and the leaf nodes that should be included are indeed included. The problem is that Orchard requires a device to download and check about $3\cdot s$ nodes of the tree (roth2019honeycrisp, ; roth2020orchard, ), where $s$ is a configurable parameter whose default value is six. But for realistic models, each node is made of many ciphertexts (e.g., the 1.2M parameter CNN model requires $\ell=293$ ciphertexts), and 18 such nodes add to 738 MiB.

Aero improves this protocol using two ideas. First, Aero observes that the entire population of devices that must collectively check the tree is massive (e.g., $10^{9}$ ). Besides, although the tree has bulky nodes with many ciphertexts, the total number of nodes is not high due to sampling (e.g., only 10,000 devices contribute updates in a round). Thus, Aero moves away from one summation tree with “bulky” nodes, to $\ell$ summation trees with “small” nodes, where $\ell$ is the number of ciphertexts comprising a device’s update (e.g., $\ell=293$ for the 1.2M parameter model). Then, each device probabilistically selects a handful of trees, and a checks few nodes within each selected tree.

Second, Aero optimizes how a device tests whether the sum of two ciphertexts equals a third ciphertext. Aero recognizes that ciphertexts can be expressed as polynomials and the validity of their addition can be checked efficiently using a technique called polynomial identity testing (PIT) (schwartz1980fast, ; zippel1979probabilistic, ). Roughly, PIT says that the sum of polynomials can be checked by evaluating them at a random point and checking the sum of these evaluations. Using PIT, Aero replaces the ciphertexts at the non-leaf nodes of the summation trees with their much smaller evaluations at a random point.

We now describe Aero’s protocol in detail, first without the PIT optimization, and then with it.

Incorporating finer-grained summation trees. Aero’s protocol has three steps: commit, add, and verify (Figure 4). In the commit step, all devices commit to their ciphertexts before submitting them to the aggregator (line 1–3 in Figure 4). The aggregator publishes a Merkle tree of these commitments to the bulletin board. Committing before submitting ensures that a malicious device cannot copy and submit an honest device’s input (requirement R2 above). Similarly, this design ensures that the aggregator cannot change a device’s input (again requirement R2).

In the add step, the aggregator adds the ciphertexts via summation trees. Specifically, if device updates have $\ell$ ciphertexts, the aggregator creates $\ell$ summation trees, one per ciphertext (line 6 in Figure 4). The leaf vertices of the $j$ -th tree are the $j$ -th ciphertexts in the devices’ inputs, while each parent is the sum of its children ciphertexts, and the root is the $j$ -th ciphertext in the aggregation result. The aggregator publishes the vertices of the summation trees on the bulletin board (line 7 in Figure 4), allowing an honest device to check that its input is not omitted (requirement R1 above).

In the verify step, each device in the system selects $q\cdot\ell$ summation trees, where $q$ is the device sampling probability (line 9 in Figure 4), and checks $s$ leaf nodes and $2s$ non-leaf nodes in each tree. ( $s=6$ in our implementation.) Specifically, the device checks that the leaf node ciphertexts are committed to in the commit step (requirement R2), and the ZK-proofs of the ciphertexts are valid, e.g., the first part of the plaintext message in the ciphertexts equals the current round number (requirement R3). For the non-leafs, the device checks that they sum to their children.

Incorporating PIT. Checking the non-leaf vertices is a main source of overhead for the protocol above. The reason is that even though each non-leaf is a single ciphertext, this ciphertext is large: for the quantum-secure AHE scheme Aero uses (§5), a ciphertext is 131 KiB, made of two polynomials of $2^{12}$ coefficients each, where each coefficient is 16 bytes.

As mentioned earlier, Aero reduces this overhead by using polynomial identity testing (PIT) (schwartz1980fast, ; zippel1979probabilistic, ). This test says that given a $d$ -degree polynomial $g(x)$ whose coefficients are in a field $\mathbb{F}$ , one can test whether $g(x)$ is a zero polynomial by picking a number $r\in\mathbb{F}$ uniformly and testing whether $g(r)==0$ . This works because a $d$ -degree polynomial has at most $d$ solutions to $g(x)==0$ and $d$ is much less than $|\mathbb{F}|$ .

Using PIT, Aero replaces the ciphertexts at the non-leafs with their evaluations at a random point $r$ . Then, during the “Verify” step, a device checks (line 11 in Fig. 4) whether these evaluations (rather than ciphertexts) add up. Thus, instead of downloading three ciphertexts with $2\cdot 2^{12}$ field elements each, a device downloads 2 elements of $\mathbb{F}$ per ciphertext.

A requirement for PIT is generation of $r$ , which must be sampled uniformly from the coefficient field. For this task, Aero extends the master committee to publish an $r$ to the bulletin board in the add step, using a known protocol to securely and efficiently generate a random number (damgard2006unconditionally, ; damgaard2012multiparty, ).

4.4. Release phase

During the release phase, Aero must decrypt the $\ell$ ciphertexts from the add phase, i.e., the $\ell$ root nodes of the $\ell$ summation trees. The default, but expensive, option is to use MPC among the members of the decryption committees.

Aero addresses this efficiency challenge using known ideas and applying them; i.e., Aero’s contribution in this phase is not new techniques, but the observation that existing ideas can be applied. Nevertheless, applying these ideas requires some care and work.

First, recall that Aero has multiple decryption committees (§3.2). Naturally, to reduce per-device work, each committee decrypts a few of the $\ell$ ciphertexts. A design question for Aero is how many committees should it use. On the one hand, more committees are desirable (best case is $\ell$ ). However, more committees also mean that each has to be larger to ensure that none of them samples more than $A$ out of $C$ malicious devices, breaking the threshold assumptions of a committee. Meanwhile, a larger committee means more overhead. In practice (§6), we take a middle ground and configure Aero to use ten decryption committees.

Second, Aero reduces each committee’s work relative to the MPC baseline, using a fast distributed decryption protocol to decrypt the ciphertexts (chen2019efficient, ). The use of this protocol is possible as a decryption committee’s task is only of decryption given how we formed and assigned work to different types of committees (§3.2). This fast protocol requires the committee devices to mainly perform local computations with little interaction with each other. The caveat is that for this protocol to be applicable, the committee members must know an upper bound on the number of additive homomorphic operations on the ciphertexts they are decrypting.⁴⁴4This bound is needed to add a “smudging noise” to the committee’s decryption output to ensure that the output does not leak information on the inputs to the aggregation (asharov2012multiparty, ). Fortunately, in Aero’s setting this bound is known: it is the maximum number of devices whose data the aggregator adds in the add phase ( $M_{max}$ in Figure 4). The benefit of distributed decryption (and moving work outside MPC) meanwhile is substantial.

4.5. Privacy proof

Aero’s protocol provides the required differential privacy guarantee (§2.2). The supplementary material (Appendix A.1) contains a proof. But, briefly, the key reasons are that (i) in the generate phase, honest devices sample themselves to make sure that they are not sampled more than expected, (ii) the verifiable aggregation protects these devices’ input, and (iii) key resharing and fast decryption protocols keep secret keys hidden.

5. Prototype implementation

We implemented a prototype of Aero atop FedScale (fedscale-icml22, ), which is a scalable system for federated learning capable of handling a large number of devices. By default, FedScale supports algorithms such as FedAvg and FedSGD (without differential privacy). Further, it allows a data analyst to specify the model using the popular PyTorch framework.

Dataset	Model	Size	FedScale	Aero
FEMNIST (cohen2017emnist, )	LeNet (lecun1995learning, )	49K	75%	74%
	CNND (reddi2020adaptive, )	1.2M	78%	68%
	CNNF (mcmahan2017communication, )	1.7M	79%	68%
	AlexNet (krizhevsky2012imagenet, )	3.9M	78%	40%
CIFAR10 (Krizhevsky09learningmultiple, )	LeNet (lecun1995learning, )	62K	48%	48%
	ResNet20 (he2016deep, )	272K	59%	48%
	ResNet56 (he2016deep, )	855K	54%	35%
Speech (warden2018speech, )	MobileNetV2 (howard2017mobilenets, )	2M	57%	4%

Figure 6. Test accuracy for different models after 480 rounds of training and differential privacy parameters

(\epsilon,\delta)

set to (5.03,

W^{-1.1}

). As shown later, increasing

\epsilon

can recover the accuracy loss.

Our Aero prototype extends FedScale in the following way (Figure 5). First, it extends the programming layer of FedScale with Opacus (opacus, ), which is a library that adjusts a PyTorch model to make it suitable for differentially private federated learning; for instance, Opacus replaces the batch normalization layer of a neural network with group normalization. Second, our prototype extends the device-side code of FedScale with additional components needed for the various committees and phases in Aero (key resharing, Gaussian noise generation, verifiable aggregation, and distributed decryption; §4). FedScale is written in Python while the code we added is in Rust; thus, we use PyO3 to wrap the Rust code with Python interfaces. Third, our prototype extends the FedScale server-side code with Aero’s aggregator code and the code to coordinate the various phases of Aero. In total, we added $\approx$ 4,300 lines of Rust to FedScale.

Our prototype configures the cryptographic primitives for 128-bit security. For additively homomorphic encryption, we use the BFV encryption scheme. We set the polynomial degree in BFV to $2^{12}$ and use the default parameters from Microsoft SEAL (seal, ). For ZK-proofs, we use ark_groth16 (arkgroth16, ), which implements the zkSNARK of Jens Groth (groth16on, ).

6. Evaluation

We evaluate Aero in two parts. First, we compare it with plain federated learning, specifically, the FedScale system. This comparison sheds light on the cost of privacy both in terms of model accuracy and resource consumption on the devices. Second, we compare Aero to Orchard, which is the state-of-the-art system for training models in a federated manner in the same threat model as Aero. This comparison helps understand the effectiveness of Aero’s techniques in reducing overhead. Our main results are the following:

•

Aero can train models with comparable accuracy to FedScale (plain federated learning). For instance, for a CNN model over the FEMNIST dataset, Aero produces a model with 79.2% accuracy with DP parameter $\epsilon=5.53$ , relative to 79.3% in FedScale, after 480 rounds of training.
•

Aero’s cpu and network overhead is low to moderate: for a 1.2M parameter model, devices spend 15 ms of cpu and 3.12 KiBs of network transfers most of the time, and occasionally (with a probability of $10^{-5}$ in a round) 13.4 min. of processor time and 234 MiBs of network transfers.
•

Aero’s techniques improve over Orchard by up to $2.3\cdot 10^{5}\times$ .

Testbed. Our testbed has machines of type c5.24xlarge on Amazon EC2. Each machine has 96vcpus, 192 GiB RAM, and 25 Gbps network bandwidth. We use a single machine for running Aero’s server. Meanwhile, we co-locate multiple devices on a machine: each device is assigned six cpus given that modern mobiles have processors with four to eight cpus.

Default system configuration. Unless specified otherwise, we configure the systems to assume $W=10^{9}$ total devices. For Aero, we set the default device sampling probability $q$ in DP-FedAvg to $10^{-5}$ ; i.e., the expected number of devices that contribute updates in a round is $10^{4}$ . We also configure Aero to use ten decryption committees, where each committee has a total of $C=45$ devices of which $A=18$ may be malicious. The first decryption committee also serves as the master committee. We configure the DP-noise committee with $(A,C)=(40,280)$ . For Orchard, we configure its committee to have 40 devices of which 16 may be malicious.⁵⁵5Aero’s committees are larger because it must ensure, using a union bound, that the chance of sampling more than $A$ malicious devices across any of its committees is the same as in Orchard.

6.1. Comparison with FedScale

Accuracy. We evaluate several datasets and models to compare Aero with FedScale, specifically, the FedAvg algorithm in FedScale. Figure 6 shows these datasets and models. We use CNND and CNNF for two different CNN models: one dropout model (reddi2020adaptive, ) and the other from the FedAvg paper (mcmahan2017communication, ).

Aero’s accuracy depends on the DP parameters $\epsilon$ and $\delta$ . For Figure 6 experiments, we set $\epsilon=5.04$ and $\delta=1/W^{1.1}$ . Further, for both systems, we set all other training parameters (batch size, the number of device-side training epochs, etc.) per the examples provided by FedScale for each dataset.

Figure 6 compares the accuracies after 480 rounds of training (these models converge in roughly 400-500 rounds). Generally, Aero’s accuracy loss grows with the number of model parameters. The reason is that DP-FedAvg adds noise for every parameter and thus the norm of the noise increases with the number of parameters.

Although Aero’s accuracy loss is (very) high for a larger number of parameters, this loss is recoverable by increasing $\epsilon$ (but still kee** it at a recommended value). Figure 7 shows accuracy for two values of $\epsilon$ for two example models. Increasing $\epsilon$ from 5.04 to 5.53 recovers the accuracy loss. For instance, for the CNNF model, FedScale’s accuracy is 79.3% after 480 rounds, while Aero’s is 79.2%. The reason is that as $\epsilon$ increases, more devices can contribute updates ( $q$ increases), which increases the signal relative to the differential privacy noise. Overall, Aero can give competitive accuracy as plain federated learning for models with parameters ranging from tens of thousand to a few million.

Model	Size	FedScale	Aero	FedScale	Aero
		cpu (ms)		network (KiB)
LeNet	49K	3.36E-4	2	3.93E-5	0.96
CNND	1.2M	9.50E-4	55	9.66E-4	3.87
CNNF	1.7M	9.49E-4	77	1.35E-3	5.15
AlexNet	3.9M	1.75E-3	170	3.12E-3	11.0

Figure 8. Per device per round average cost for different models.

Device overhead. Another cost of privacy relative to plain federated learning is increased device overhead. Figure 8 summarizes the average cpu and network cost per round per device for the four models on the FEMNIST dataset. (We picked the FEMNIST dataset just as an example, but the results for the other datasets are qualitatively the same.)

Overall, an Aero device on average (considering the different types of Aero devices) spends $5.9\cdot 10^{3}-9.7\cdot 10^{4}\times$ higher cpu and $3.5\cdot 10^{3}-2.4\cdot 10^{4}\times$ higher network relative to FedScale. This overhead is due to the fact that FedScale does not use any cryptographic operations, while Aero devices use many, for example, encryption and ZK-proofs during the generate phase, and verifiable aggregation during the add phase. However, Aero’s overhead, at least, on average, is low (Figure 8). Further, as we will show next, Aero’s worst-case overhead is also moderate.

6.2. Comparison to Orchard

Both Aero and Orchard have multiple types of devices. Aero has devices that participate in the master committee, generate updates (or Gaussian noise), verify the aggregator’s work, and participate in the decryption committee. Similarly, Orchard has generator, verifier, and committee devices. We compare overhead for these devices separately.

Generator device overhead. The overhead for the generators changes only with the model size (after excluding the training time to generate the plain updates). Thus, we vary the number of model parameters and report overhead.

Figure 8(a) shows the cpu time and Figure 9(a) shows the network transfers with a varying number of model parameters. These overheads grow linearly with the number of model parameters (the network overhead is not a straight line as it includes a fixed cost of 60 MiB to download ZK-proof proving keys). The reason is that the dominant operations for a generator device are generating ZK-proofs and ship** ciphertexts to the aggregator. The number of both operations is proportional to the number of parameters (§4.2).

In terms of absolute overhead, a specific data point of interest is a million-parameter model, e.g., the CNND model with 1.2M parameters. For this size, a generator device spends 1.01 hours in cpu time, or equivalently 13.4 minutes of latency (wall-clock time) over six cores. The generator also sends 101 MiB of data over the network. These overheads are moderate, considering the fact that the probability that a device will be a generator in a round is small: $10^{-5}$ .

Finally, the cpu and network overhead for Aero and Orchard is roughly the same. The reason is that the dominant operations for the two systems are common: ZK-proofs and upload of ciphertexts.

Verifier device overhead. Figure 8(b) shows cpu and Figure 9(b) shows network overhead for the verifier devices that participate in the verifiable aggregation protocol (§4.3). These experiments fix the number of model parameters to 1.2M and vary the probability $q$ with which a verifier device samples summation trees to inspect (recall that a verifier device in Aero checks $q\cdot\ell$ summation trees). For Orchard, overhead does not change with $q$ ( $q$ is effectively 1).

Overall, Aero’s verifier devices, which are the bulk of the devices in the system, are efficient consuming a few milliseconds of cpu and a few KiBs of network transfers. For instance, for $q=10^{-5}$ , Aero incurs 3.12 KiB in network and 15 ms of cpu time, while Orchard incurs 1.96 seconds (130 $\times$ ) and 738 MiB ( $2.36\cdot 10^{5}\times$ ).

Comparing Aero with Orchard, a verifier in Aero consumes lower cpu than Orchard for smaller values of $q$ but a higher cpu for larger $q$ . This trend is due to constants: even though an Aero device checks $q\cdot\ell$ summation trees and $3s$ ciphertexts in each tree versus $\ell\cdot 3s$ ciphertexts in Orchard, Aero devices verify the ZK-proofs to address the reuse-of-keys issue, while Orchard does not have such a requirement (§4.3). Each proof check takes $\approx$ 700 ms on a single cpu of c5.24xlarge. Indeed, Aero w/o ZK-proof check (another line in the plot) is strictly better than Orchard.

Aero’s network overhead increases linearly with $q$ , while Orchard’s stays constant as it does not do sampling (Figure 9(b)). Notably, when $q=1$ , i.e., when Aero and Orchard check the same number of ciphertexts, a Aero verifier consumes 251 MB, which is $\approx 1/3$ rd of Orchard. This is because polynomial identity testing allows a Aero verifier to download evaluations of ciphertext polynomials rather than the full polynomials from non-leaf vertices (§4.3).

Committee device overhead. Figure 8(c) and Figure 9(c) show the cpu and network overhead of decryption committee devices as a function of the model size. (In Aero, the first decryption committee also serves as the master committee.)

Aero’s overheads are much lower than Orchard’s—for 1.2M parameters, cpu time is 206 s in Aero versus 214 hours in Orchard (i.e., $3751\times$ lower), and network is 234 MiB in Aero versus 11 TiB in Orchard (i.e., $4.8\cdot 10^{4}\times$ lower). This improvement is for two reasons. First, Aero divides the decryption of multiple ciphertexts across committees, and thus each performs less work. Second, Aero uses the distributed decryption protocol (§4.4), while Orchard uses the general-purpose SCALE-MAMBA MPC (scaleMamba, ).

7. Related work

Aero’s goal is to add the rigorous guarantee of differential privacy to federated learning—at low device overhead. This section compares Aero to prior work with similar goals.

Local differential privacy (LDP). In LDP, devices locally add noise to their updates before submitting them for aggregation (duchi2013local, ; erlingsson2014rappor, ; ijcai2021-217, ; he2020secure, ; pathak2010multiparty, ; truex2020ldp, ; seif2020wireless, ; bhowmick2018protection, ; hao2019towards, ; sun2020federated, ; grafberger2021fedless, ; nguyen2016collecting, ; wang2019collecting, ; niu2019secure, ; lu2019blockchain, ; chen2018machine, ; mugunthan2020blockflow, ; chen2020practical, ; ding2021differentially, ). On the plus side, the privacy guarantee in LDP does not depend on the behavior of the aggregator, as devices add noises locally. Further, LDP is scalable as it adds small device-side overhead relative to plain federated learning. However, on the negative side, since each device perturbs its update, the trained model can have a large error.

Central differential privacy (CDP). Given the accuracy loss in LDP, many systems target CDP (froelicher2017unlynx, ; zeng2022aggregating, ; sebert2022protecting, ; truex2019hybrid, ; xu2019hybridalpha, ; chase2017private, ; rastogi2010differentially, ; stevens2021efficient, ; hynes2018efficient, ; roth2019honeycrisp, ; roth2020orchard, ; xu2022detrust, ). The core challenge is of hiding sensitive device updates from the aggregator.

Several systems in this category target a setting of a few tens of devices to a few thousand devices (xu2022detrust, ; xu2019hybridalpha, ; truex2019hybrid, ; stevens2021efficient, ; sebert2022protecting, ). These systems require all devices to participate in one or more cryptographic primitives, and thus their overhead grows with the number of devices. For example, in secure aggregation based FLDP (stevens2021efficient, ), each device generates a secret key, then masks its update using the key, before sending the masked update to the aggregator. Then, the devices securely sum their masks to subtract them from the aggregator’s result. This latter protocol requires each device to secretly share its mask with all others.

Chase et al. (chase2017private, ) do not require their protocol to scale with the number of devices: two devices aggregate updates from all others before generating and adding DP noise via Yao’s MPC protocol (yao1982protocols, ). The issue is that if the adversary compromises the two devices, it learns the updates.

Honeycrisp (roth2019honeycrisp, ), Orchard (roth2020orchard, ), and Mycelium (roth2021mycelium, ) target a setting of a billion devices. One of their key insights is to run expensive cryptographic protocols among a small, randomly-sampled committee, while leveraging an untrusted resourceful aggregator to help with the aggregation. Among the three systems, Orchard supports learning tasks, while Honeycrisp supports aggregate statistics and Mycelium supports graph analytics. The limitation of Orchard is that it imposes a large overhead on the devices (§2.3, §6). Aero improves over Orchard by several orders of magnitude (§6).

An alternative to cryptography is to use trusted hardware, e.g., Intel SGX (hynes2018efficient, ). These systems add negligible overhead over plain federated learning, but trusting the hardware design and manufacturer is a strong assumption (fei2021security, ; TrustZoneAttacks, ; ArmSEVAttacks, ).

No differential privacy. Many systems provide a weaker notion of privacy than differential privacy, for functionality such as federated machine learning (aono2017privacy, ; rathee2022elsa, ; dong2020eastfly, ; fu2020vfl, ; jiang2020federated, ; jiang2021flashe, ; liu2019secure, ; ma2021privacy, ; mandal2019privfl, ; 254465, ; sav2020poseidon, ; xu2022hercules, ; beguier2020efficient, ; chen2021ppt, ; chowdhury2021eiffel, ; ergun2021sparsified, ; fereidooni2021safelearn, ; guo2020secure, ; hao2021efficient, ; kadhe2020fastsecagg, ; li2021secure, ; liu2020boosting, ; so2021turbo, ; xu2019verifynet, ; zhang2021dubhe, ; mo2021ppfl, ; hashemi2021byzantine, ; quoc2021secfl, ; sav2022privacy, ), statistics (corrigan2017prio, ), and aggregation (bell2020secure, ; bonawitz2017practical, ; liu2022dhsa, ; wan2022information, ; liu2022efficient, ). For instance, BatchCrypt (254465, ) uses Paillier AHE (damgaard2001generalisation, ) to hide updates from the aggregator. The promise is that the adversary learns only the aggregate of the data of many devices. The fundamental issue is that aggregation does not provide a rigorous guarantee: one can learn individual training data from the trained model parameters (zhu2019deep, ; melis2019exploiting, ; briland2017deep, ; shokri2017membership, ).

8. Summary

Federated learning over a large number of mobile devices is getting significant attention both in industry and academia. One big challenge of current practical systems, those that provide good accuracy and efficiency, is the trust they require: the data analyst must say “let’s trust that the server will not be compromised”. Aero adds an alternative. It shows that one can perform FL with good accuracy, moderate overhead, and the rigorous guarantee of differential privacy without trusting a central server or the data analyst. Aero improves the trade-off by focusing on a specific type of learning algorithms and tuning system architecture and design to these algorithms (§4). The main evaluation highlight is that Aero has comparable accuracy to plain federated learning, and improves over prior work Orchard that has strong guarantees by five orders of magnitude (§6).

References

[1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In ACM Conference on Computer and Communications Security (CCS), pages 308–318, 2016.
[2] Y. Aono, T. Hayashi, L. Wang, S. Moriai, et al. Privacy-preserving deep learning via additively homomorphic encryption. IEEE Transactions on Information Forensics and Security, pages 1333–1345, 2017.
[3] arkworks. ark-groth16. https://github.com/arkworks-rs/groth16.
[4] G. Asharov, A. Jain, A. López-Alt, E. Tromer, V. Vaikuntanathan, and D. Wichs. Multiparty computation with low communication, computation and interaction via threshold FHE. In Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), pages 483–501, 2012.
[5] C. Beguier, M. Andreux, and E. W. Tramel. Efficient sparse secure aggregation for federated learning. arXiv preprint arXiv:2007.14861, 2020.
[6] J. H. Bell, K. A. Bonawitz, A. Gascón, T. Lepoint, and M. Raykova. Secure single-server aggregation with (poly) logarithmic overhead. In ACM Conference on Computer and Communications Security (CCS), pages 1253–1269, 2020.
[7] A. Ben-Efraim, K. Cong, E. Omri, E. Orsini, N. P. Smart, and E. Soria-Vazquez. Large scale, actively secure computation from lpn and free-xor garbled circuits. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, 2021.
[8] G. Beniamini. Trust issues: Exploiting trustzone TEEs. https://googleprojectzero.blogspot.com/2017/07/trust-issues-exploiting-trustzone-tees.html. Accessed: 2022-01-30.
[9] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, and R. Rogers. Protection against reconstruction and its applications in private federated learning. arXiv preprint arXiv:1812.00984, 2018.
[10] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth. Practical secure aggregation for privacy-preserving machine learning. In ACM Conference on Computer and Communications Security (CCS), pages 1175–1191, 2017.
[11] Z. Brakerski. Fully homomorphic encryption without modulus switching from classical GapSVP. In Advances in Cryptology—CRYPTO, pages 868–886, 2012.
[12] M. Chase, R. Gilad-Bachrach, K. Laine, K. Lauter, and P. Rindal. Private collaborative neural network learning. IACR Cryptol. ePrint Arch., 2017.
[13] C. Chen, J. Zhou, B. Wu, W. Fang, L. Wang, Y. Qi, and X. Zheng. Practical privacy preserving POI recommendation. ACM Transactions on Intelligent Systems and Technology (TIST), pages 1–20, 2020.
[14] H. Chen, W. Dai, M. Kim, and Y. Song. Efficient multi-key homomorphic encryption with packed ciphertexts with application to oblivious neural network inference. In ACM Conference on Computer and Communications Security (CCS), pages 395–412, 2019.
[15] Q. Chen, Z. Wang, W. Zhang, and X. Lin. PPT: A privacy-preserving global model training protocol for federated learning in P2P networks. arXiv preprint arXiv:2105.14408, 2021.
[16] X. Chen, J. Ji, C. Luo, W. Liao, and P. Li. When machine learning meets blockchain: A decentralized, privacy-preserving and secure design. In IEEE International Conference on Big Data (Big Data), pages 1178–1187, 2018.
[17] A. R. Chowdhury, C. Guo, S. Jha, and L. van der Maaten. EIFFeL: Ensuring integrity for federated learning. arXiv preprint arXiv:2112.12727, 2021.
[18] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik. EMNIST: Extending MNIST to handwritten letters. In International Joint Conference on Neural Networks (IJCNN), pages 2921–2926, 2017.
[19] H. Corrigan-Gibbs and D. Boneh. Prio: Private, robust, and scalable computation of aggregate statistics. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 259–282, 2017.
[20] I. Damgård and M. Jurik. A generalisation, a simplification and some applications of Paillier’s probabilistic public-key system. In Proceedings of International Workshop on Practice and Theory in Public Key Cryptography: Public Key Cryptography, pages 119–136, 2001.
[21] I. Damgård, V. Pastro, N. Smart, and S. Zakarias. Multiparty computation from somewhat homomorphic encryption. In Advances in Cryptology—CRYPTO, pages 643–662, 2012.
[22] I. Damgård, M. Fitzi, E. Kiltz, J. B. Nielsen, and T. Toft. Unconditionally secure constant-rounds multi-party computation for equality, comparison, bits and exponentiation. In Theory of Cryptography Conference (TCC), 2006.
[23] J. Ding, G. Liang, J. Bi, and M. Pan. Differentially private and communication efficient collaborative learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021.
[24] Y. Dong, X. Chen, L. Shen, and D. Wang. EaSTFLy: Efficient and secure ternary federated learning. Computers & Security, page 101824, 2020.
[25] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In Symposium on Foundations of Computer Science (FOCS), 2013.
[26] C. Dwork. A firm foundation for private data analysis. Communications of the ACM, pages 86–95, 2011.
[27] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), page 486, 2006.
[28] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (TCC), pages 265–284, 2006.
[29] C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In ACM Symposium on Theory of Computing (STOC), 2009.
[30] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, page 9(3–4):211–407, 2014.
[31] I. Ergun, H. U. Sami, and B. Guler. Sparsified secure aggregation for privacy-preserving federated learning. arXiv preprint arXiv:2112.12872, 2021.
[32] Ú. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In ACM Conference on Computer and Communications Security (CCS), pages 1054–1067, 2014.
[33] J. Fan and F. Vercauteren. Somewhat practical fully homomorphic encryption. IACR Cryptol. ePrint Arch., page 144, 2012.
[34] S. Fei, Z. Yan, W. Ding, and H. Xie. Security vulnerabilities of SGX and countermeasures: A survey. ACM Computing Surveys, 2021.
[35] H. Fereidooni, S. Marchal, M. Miettinen, A. Mirhoseini, H. Möllering, T. D. Nguyen, P. Rieger, A.-R. Sadeghi, T. Schneider, H. Yalame, et al. SAFELearn: secure aggregation for private federated learning. In IEEE Security and Privacy Workshops (SPW), pages 56–62, 2021.
[36] D. Froelicher, P. Egger, J. S. Sousa, J. L. Raisaro, Z. Huang, C. Mouchet, B. Ford, and J.-P. Hubaux. UnLynx: A decentralized system for privacy-conscious data sharing. Proceedings on Privacy Enhancing Technologies, pages 232–250, 2017.
[37] A. Fu, X. Zhang, N. Xiong, Y. Gao, H. Wang, and J. Zhang. VFL: a verifiable federated learning with privacy-preserving for big data in industrial IoT. IEEE Transactions on Industrial Informatics, 2020.
[38] Y. Gilad, R. Hemo, S. Micali, G. Vlachos, and N. Zeldovich. Algorand: Scaling byzantine agreements for cryptocurrencies. In ACM Symposium on Operating Systems Principles (SOSP), page 51–68, 2017.
[39] O. Goldreich, S. Micali, and A. Wigderson. How to play any mental game. In ACM Symposium on Theory of Computing (STOC), page 218–229, 1987.
[40] S. Goldwasser, S. Micali, and C. Rackoff. The Knowledge Complexity of Interactive Proof-Systems, page 203–225. Association for Computing Machinery, 2019.
[41] S. D. Gordon, D. Starin, and A. Yerukhimovich. The more the merrier: reducing the cost of large scale mpc. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, 2021.
[42] A. Grafberger, M. Chadha, A. **dal, J. Gu, and M. Gerndt. FedLess: Secure and scalable federated learning using serverless computing. In IEEE International Conference on Big Data (Big Data), pages 164–173, 2021.
[43] J. Groth. On the size of pairing-based non-interactive arguments. In Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), pages 305–326, 2016.
[44] J. Guo, Z. Liu, K.-Y. Lam, J. Zhao, Y. Chen, and C. Xing. Secure weighted aggregation for federated learning. arXiv preprint arXiv:2010.08730, 2020.
[45] V. Gupta and K. Gopinath. An extended verifiable secret redistribution protocol for archival systems. In International Conference on Availability, Reliability and Security (ARES), pages 8–pp, 2006.
[46] M. Hao, H. Li, G. Xu, H. Chen, and T. Zhang. Efficient, private and robust federated learning. In Annual Computer Security Applications Conference, pages 45–60, 2021.
[47] M. Hao, H. Li, G. Xu, S. Liu, and H. Yang. Towards efficient and privacy-preserving federated deep learning. In IEEE International Conference on Communications (ICC), pages 1–6, 2019.
[48] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.
[49] F. Hartmann, S. Suh, A. Komarzewski, T. D. Smith, and I. Segall. Federated learning for ranking browser history suggestions. arXiv preprint arXiv:1911.11807, 2019.
[50] H. Hashemi, Y. Wang, C. Guo, and M. Annavaram. Byzantine-robust and privacy-preserving framework for FedML. arXiv preprint arXiv:2105.02295, 2021.
[51] C. He, S. Li, J. So, M. Zhang, H. Wang, X. Wang, P. Vepakomma, A. Singh, H. Qiu, L. Shen, P. Zhao, Y. Kang, Y. Liu, R. Raskar, Q. Yang, M. Annavaram, and S. Avestimehr. Fedml: A research library and benchmark for federated machine learning. Advances in Neural Information Processing Systems, Best Paper Award at Federate Learning Workshop, 2020.
[52] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[53] L. He, S. P. Karimireddy, and M. Jaggi. Secure byzantine-robust machine learning. arXiv preprint arXiv:2006.04747, 2020.
[54] B. Hitaj, G. Ateniese, and F. Perez-Cruz. Deep models under the GAN: Information leakage from collaborative deep learning. In ACM Conference on Computer and Communications Security (CCS), page 603–618, 2017.
[55] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[56] N. Hynes, R. Cheng, and D. Song. Efficient deep learning on multi-source private data. arXiv preprint arXiv:1807.06689, 2018.
[57] M. Jiang, T. Jung, R. Karl, and T. Zhao. Federated dynamic GNN with secure aggregation. arXiv preprint arXiv:2009.07351, 2020.
[58] Z. Jiang, W. Wang, and Y. Liu. Flashe: Additively symmetric homomorphic encryption for cross-silo federated learning. arXiv preprint arXiv:2109.00675, 2021.
[59] S. Kadhe, N. Rajaraman, O. O. Koyluoglu, and K. Ramchandran. Fastsecagg: Scalable secure aggregation for privacy-preserving federated learning. In ICML Workshop on Federated Learning for User Privacy and Data Confidentiality, 2020.
[60] P. Kairouz, B. McMahan, S. Song, O. Thakkar, A. Thakurta, and Z. Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, pages 5213–5225, 2021.
[61] B. Knott, S. Venkataraman, A. Hannun, S. Sengupta, M. Ibrahim, and L. van der Maaten. Crypten: Secure multi-party computation meets machine learning. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
[62] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[63] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
[64] KU Leuven COSIC. SCALE-MAMBA. https://github.com/KULeuven-COSIC/SCALE-MAMBA.
[65] F. Lai, Y. Dai, S. S. Singapuram, J. Liu, X. Zhu, H. V. Madhyastha, and M. Chowdhury. FedScale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learning (ICML), 2022.
[66] F. Lai, X. Zhu, H. V. Madhyastha, and M. Chowdhury. Oort: Efficient federated learning via guided participant selection. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 19–35, 2021.
[67] Y. LeCun, L. D. Jackel, L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, et al. Learning algorithms for classification: A comparison on handwritten digit recognition. Neural Networks: the statistical mechanics perspective, 1995.
[68] K. H. Li, P. P. B. de Gusmão, D. J. Beutel, and N. D. Lane. Secure aggregation for federated learning in flower. In Proceedings of ACM International Workshop on Distributed Machine Learning, pages 8–14, 2021.
[69] M. Li, Y. Zhang, Z. Lin, and Y. Solihin. Exploiting unprotected I/O operations in AMD’s secure encrypted virtualization. In USENIX Security Symposium, 2019.
[70] C. Liu, S. Chakraborty, and D. Verma. Secure model fusion for distributed learning using partial homomorphic encryption. In Policy-Based Autonomic Data Governance, pages 154–179. Springer, 2019.
[71] Y. Liu, Z. Ma, X. Liu, S. Ma, S. Nepal, R. H. Deng, and K. Ren. Boosting privately: Federated extreme gradient boosting for mobile crowdsensing. In International Conference on Distributed Computing Systems (ICDCS), pages 1–11, 2020.
[72] Z. Liu, S. Chen, J. Ye, J. Fan, H. Li, and X. Li. DHSA: efficient doubly homomorphic secure aggregation for cross-silo federated learning. The Journal of Supercomputing, 2022.
[73] Z. Liu, J. Guo, K.-Y. Lam, and J. Zhao. Efficient dropout-resilient aggregation for privacy-preserving machine learning. IEEE Transactions on Information Forensics and Security, 2022.
[74] Y. Lu, X. Huang, Y. Dai, S. Maharjan, and Y. Zhang. Blockchain and federated learning for privacy-preserved data sharing in industrial IoT. IEEE Transactions on Industrial Informatics, pages 4177–4186, 2019.
[75] J. Ma, S.-A. Naas, S. Sigg, and X. Lyu. Privacy-preserving federated learning based on multi-key homomorphic encryption. arXiv preprint arXiv:2104.06824, 2021.
[76] K. Mandal and G. Gong. PrivFL: Practical privacy-preserving federated regressions on high-dimensional data over mobile networks. In Proceedings of ACM SIGSAC Conference on Cloud Computing Security Workshop, pages 57–68, 2019.
[77] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282, 2017.
[78] B. McMahan and A. Thakurta. Federated learning with formal differential privacy guarantees. https://ai.googleblog.com/2022/02/federated-learning-with-formal.html. Accessed: 2022-12-12.
[79] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018.
[80] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018.
[81] L. Melis, C. Song, E. D. Cristofaro, and V. Shmatikov. Exploiting unintended feature leakage in collaborative learning. In IEEE Symposium on Security and Privacy (S&P), pages 691–706, 2019.
[82] Microsoft. Microsoft SEAL (release 3.7). https://github.com/Microsoft/SEAL.
[83] F. Mo, H. Haddadi, K. Katevas, E. Marin, D. Perino, and N. Kourtellis. PPFL: privacy-preserving federated learning with trusted execution environments. arXiv preprint arXiv:2104.14380, 2021.
[84] V. Mugunthan, R. Rahman, and L. Kagal. BlockFLow: An accountable and privacy-preserving solution for federated learning. arXiv preprint arXiv:2007.03856, 2020.
[85] M. Naseri, J. Hayes, and E. D. Cristofaro. Local and central differential privacy for robustness and privacy in federated learning. Proceedings of the Network and Distributed System Security Symposium, 2022.
[86] T. T. Nguyên, X. Xiao, Y. Yang, S. C. Hui, H. Shin, and J. Shin. Collecting and analyzing data from smart device users with local differential privacy. arXiv preprint arXiv:1606.05053, 2016.
[87] A. Nitulescu. zk-snarks: A gentle introduction. https://www.di.ens.fr/~nitulesc/files/Survey-SNARKs.pdf, 2020.
[88] C. Niu, F. Wu, S. Tang, L. Hua, R. Jia, C. Lv, Z. Wu, and G. Chen. Secure federated submodel learning. arXiv preprint arXiv:1911.02254, 2019.
[89] M. A. Pathak, S. Rane, and B. Raj. Multiparty differential privacy via aggregation of locally trained classifiers. In Advances in Neural Information Processing Systems (NeurIPS), pages 1876–1884, 2010.
[90] pytorch. Pytorch. https://github.com/pytorch/pytorch.
[91] D. L. Quoc and C. Fetzer. SecFL: Confidential federated learning using TEEs. arXiv preprint arXiv:2110.00981, 2021.
[92] V. Rastogi and S. Nath. Differentially private aggregation of distributed time-series with transformation and encryption. In Proceedings of ACM SIGMOD International Conference on Management of data, pages 735–746, 2010.
[93] M. Rathee, C. Shen, S. Wagh, and R. A. Popa. ELSA: Secure aggregation for federated learning with malicious actors. Cryptology ePrint Archive, 2022.
[94] S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar, and H. B. McMahan. Adaptive federated optimization. In International Conference on Learning Representations, 2020.
[95] R. L. Rivest, L. Adleman, M. L. Dertouzos, et al. On data banks and privacy homomorphisms. Foundations of secure computation, 1978.
[96] A. Roth and T. Roughgarden. Interactive privacy via the median mechanism. In ACM Symposium on Theory of Computing (STOC), 2010.
[97] E. Roth, K. Newatia, Y. Ma, K. Zhong, S. Angel, and A. Haeberlen. Mycelium: Large-scale distributed graph queries with differential privacy. In ACM Symposium on Operating Systems Principles (SOSP), page 327–343, 2021.
[98] E. Roth, D. Noble, B. H. Falk, and A. Haeberlen. Honeycrisp: Large-scale differentially private aggregation without a trusted core. In ACM Symposium on Operating Systems Principles (SOSP), pages 196–210, 2019.
[99] E. Roth, H. Zhang, A. Haeberlen, and B. C. Pierce. Orchard: Differentially private analytics at scale. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 1065–1081, 2020.
[100] S. Sav, A. Diaa, A. Pyrgelis, J.-P. Bossuat, and J.-P. Hubaux. Privacy-preserving federated recurrent neural networks. arXiv preprint arXiv:2207.13947, 2022.
[101] S. Sav, A. Pyrgelis, J. R. Troncoso-Pastoriza, D. Froelicher, J.-P. Bossuat, J. S. Sousa, and J.-P. Hubaux. Poseidon: Privacy-preserving federated neural network learning. In Proceedings of the Network and Distributed System Security Symposium (NDSS), 2021.
[102] J. T. Schwartz. Fast probabilistic algorithms for verification of polynomial identities. J. ACM, 27(4):701–717, 1980.
[103] A. G. Sébert, R. Sirdey, O. Stan, and C. Gouy-Pailler. Protecting data from all parties: Combining FHE and DP in federated learning. arXiv preprint arXiv:2205.04330, 2022.
[104] M. Seif, R. Tandon, and M. Li. Wireless federated learning with local differential privacy. In IEEE International Symposium on Information Theory (ISIT), pages 2604–2609, 2020.
[105] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy (S&P), pages 3–18, 2017.
[106] J. So, B. Güler, and A. S. Avestimehr. Turbo-aggregate: Breaking the quadratic aggregation barrier in secure federated learning. IEEE Journal on Selected Areas in Information Theory, pages 479–489, 2021.
[107] T. Stevens, C. Skalka, C. Vincent, J. Ring, S. Clark, and J. Near. Efficient differentially private secure aggregation for federated learning via hardness of learning with errors. arXiv preprint arXiv:2112.06872, 2021.
[108] L. Sun and L. Lyu. Federated model distillation with noise-free differential privacy. arXiv preprint arXiv:2009.05537, 2020.
[109] L. Sun, J. Qian, and X. Chen. LDP-FL: Practical private aggregation in federated learning with local differential privacy. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 1571–1578, 2021.
[110] S. Tan, B. Knott, Y. Tian, and D. J. Wu. Cryptgpu: Fast privacy-preserving machine learning on the gpu. In 2021 IEEE Symposium on Security and Privacy (SP), pages 1021–1038. IEEE, 2021.
[111] S. Truex, N. Baracaldo, A. Anwar, T. Steinke, H. Ludwig, R. Zhang, and Y. Zhou. A hybrid approach to privacy-preserving federated learning. In ACM Workshop on Artificial Intelligence and Security, pages 1–11, 2019.
[112] S. Truex, L. Liu, K.-H. Chow, M. E. Gursoy, and W. Wei. LDP-Fed: Federated learning with local differential privacy. In Proceedings of the ACM International Workshop on Edge Systems, Analytics and Networking, pages 61–66, 2020.
[113] K. Wan, H. Sun, M. Ji, and G. Caire. Information theoretic secure aggregation with uncoded groupwise keys. arXiv preprint arXiv:2204.11364, 2022.
[114] N. Wang, X. Xiao, Y. Yang, J. Zhao, S. C. Hui, H. Shin, J. Shin, and G. Yu. Collecting and analyzing multidimensional data with local differential privacy. In IEEE International Conference on Data Engineering (ICDE), pages 638–649, 2019.
[115] P. Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
[116] G. Xu, X. Han, S. Xu, T. Zhang, H. Li, X. Huang, and R. H. Deng. Hercules: Boosting the performance of privacy-preserving federated learning. arXiv preprint arXiv:2207.04620, 2022.
[117] G. Xu, H. Li, S. Liu, K. Yang, and X. Lin. VerifyNet: Secure and verifiable federated learning. IEEE Transactions on Information Forensics and Security, pages 911–926, 2019.
[118] R. Xu, N. Baracaldo, Y. Zhou, A. Anwar, S. Kadhe, and H. Ludwig. DeTrust-FL: Privacy-preserving federated learning in decentralized trust setting. In IEEE International Conference on Cloud Computing (CLOUD), 2022.
[119] R. Xu, N. Baracaldo, Y. Zhou, A. Anwar, and H. Ludwig. HybridAlpha: An efficient approach for privacy-preserving federated learning. In ACM Workshop on Artificial Intelligence and Security, pages 13–23, 2019.
[120] A. C. Yao. Protocols for secure computations. In Annual Symposium on Foundations of Computer Science (SFCS), pages 160–164, 1982.
[121] A. Yousefpour, I. Shilov, A. Sablayrolles, D. Testuggine, K. Prasad, M. Malek, J. Nguyen, S. Ghosh, A. Bharadwaj, J. Zhao, G. Cormode, and I. Mironov. Opacus: User-friendly differential privacy library in PyTorch. arXiv preprint arXiv:2109.12298, 2021.
[122] D. Zeng, S. Liu, and Z. Xu. Aggregating gradients in encoded domain for federated learning. arXiv preprint arXiv:2205.13216, 2022.
[123] C. Zhang, S. Li, J. Xia, W. Wang, F. Yan, and Y. Liu. BatchCrypt: Efficient homomorphic encryption for cross-silo federated learning. In USENIX Annual Technical Conference (USENIX ATC), pages 493–506, 2020.
[124] S. Zhang, Z. Li, Q. Chen, W. Zheng, J. Leng, and M. Guo. Dubhe: Towards data unbiasedness with homomorphic encryption in federated learning client selection. In International Conference on Parallel Processing, pages 1–10, 2021.
[125] K. Zhu, P. Van Hentenryck, and F. Fioretto. Bias and variance of post-processing in differential privacy. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11177–11184, 2021.
[126] L. Zhu, Z. Liu, and S. Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
[127] R. Zippel. Probabilistic algorithms for sparse polynomials. In International Symposium on Symbolic and Algebraic Manipulation, pages 216–226, 1979.

\raggedend

Appendix A Supplementary material

A.1. Privacy proof

The goal of Aero is to provide differential privacy for a class of federated learning algorithms. We will take DP-FedAvg as an example and prove that Aero indeed meets its goal in multiple steps. The proof for other algorithms such as DP-FedSGD is similar. The outline of the proof is as follows.

First, we will introduce a slightly modified version of DP-FedAvg that makes explicit the behavior of the malicious aggregator and devices. This modified version changes line 8 and line 12 in Figure 2. For instance, we will modify line 8 in Figure 2 to show that a byzantine aggregator may allow malicious devices to be sampled in a round. Appendix A.1.1 introduces the modified version and shows that the changes do not impact DP-FedAvg’s differential privacy guarantee.

Next, we will prove that Aero executes the modified DP-FedAvg algorithm faithfully, by showing that enough Gaussian noise will be added (Appendix A.1.2) and if an adversary introduces an error into the aggregation, it will be caught with high probability (Appendix A.1.3).

Finally, we will prove that after aggregation, Aero’s decryption protocol does not leak any information beyond the allowed output of DP-FedAvg (Appendix A.1.4).

We will not cover security of the protocols used in the setup phase, e.g., the sortition protocol to form committees, because Aero does not innovate on these protocols. With the above proof structure, we will show Aero provides the differential privacy guarantee to honest devices’ data.

Before proceeding to the proof, we introduce a few definitions. In the main body of the paper, for simplicity, we did not distinguish between honest-but-offline and malicious devices for the DP-noise committee (§4.2). Instead, we considered all offline devices as malicious for this committee, since it’s not possible to tell whether an offline device is malicious or not.

However, for devices that generate updates, we do protect honest-but-offline devices’ data. So when talking about generator devices, we use the following terms:

•

honest-and-online devices,
•

honest-but-offline devices,
•

honest devices: including both honest-and-online and honest-but-offline devices
•

malicious devices

As for the DP-noise committee members, we still follow the previous definition, namely

•

honest members: honest and online members
•

malicious members: malicious or honest-but-offline members

A.1.1. DP-FedAvg

As mentioned above, Aero must take into account the behavior of malicious entities for the computation in line 8 and line 12 of Figure 2.

There are two reasons, Aero must modify line 8: first, a malicious aggregator may filter out honest-but-offline devices’ data in the aggregation; second, a malicious aggregator may add malicious devices’ data in the aggregation.

To capture the power of a malicious aggregator, we change line 8 to be

$C^{t}\leftarrow$ subset of (sampled honest users with probability q) + (some malicious devices)

We must modify line 12 because the DP-noise committee is likely to add more noise as noted in the design of the generate phase (§4.2). To capture this additional noise, we change line 12 to be

$\Delta^{t}\leftarrow\sum_{k}\Delta_{k}^{t}+\mathcal{N}(0,I\sigma^{2})$ + some additional bounded noise

In the remainder of this section, we will prove that the modifications preserve the privacy guarantee of the original DP-FedAvg algorithm (brendan2018learning, ).

Device sampling (line 8). For device sampling, there are two possible attacks: either some malicious devices’ data will be included or some honest devices’ data will be filtered out.

The first case is not a problem, since it’s equivalent to post-processing: for example, after aggregation, the malicious aggregator can add malicious updates. Post-processing of a deferentially private result does not affect the differential privacy guarantee (this follows from the post-processing lemma in the differential privacy literature (zhu2021bias, ; dwork2006calibrating, )).

To prove that filtering out honest devices’ data will not impact the privacy guarantee, we’ll first give intuition and then a rigorous proof. Intuitively, a larger sampling probability (larger $q$ ) leads to more privacy loss, because a device’s data is more likely to be used in training. So informally, if each device is expected to contribute an update fewer times, the privacy loss is expected to be less. Now coming back to the case where the aggregator filters out honest devices’ data, it is obvious that each device is expected to contribute updates no more frequently than without filtering, which means the privacy loss is expected to be no more than without filtering.

In the original paper of DP-FedAvg (brendan2018learning, ), the DP guarantee relies on the moments accountant introduced by Abadi et al. (abadi2016deep, ), whose tail bound can be converted to $(\epsilon,\delta)$ -differential privacy. To be precise, the proof uses a lemma (which we will introduce shortly) that gives the moments bound on Gaussian noise with random sampling, which is equivalent to $(\epsilon,\delta)$ -differential privacy. We cite this lemma as Lemma A.1 in the appendix.

So next we will show by replacing the original random sampling with random sampling plus filtering, the moment bound still holds. In the discussion, without losing generality, we will focus on functions whose sensitivity is 1, for example, the UserUpdate function that does local training and gradient clip** with $S=1$ in DP-FedAvg (Figure 2). Notice that for any function $f^{\prime}$ whose sensitivity is $S^{\prime}\neq 1$ , we can always construct a function $f=f^{\prime}/S^{\prime}$ whose sensitivity is 1.

Lemma A.1 ().

Given any function $f$ , whose norm $\|f(\cdot)\|_{2}\leq 1$ , let $z\geq 1$ be some noise scale and $\sigma=z\cdot\|f(\cdot)\|_{2}$ , let $d=\{d_{1},...,d_{n}\}$ be a database, let $\mathcal{J}$ be a sample from $[n]$ where each $i\in[n]$ is chosen independently with probability $q\leq\frac{1}{16\sigma}$ , then for any positive integer $\lambda\leq\sigma^{2}\ln\frac{1}{q\sigma}$ , the function $\mathcal{G}(d)=\sum_{i\in\mathcal{J}}f(d_{i})+\mathcal{N}(0,\sigma^{2}% \boldsymbol{I})$ satisfies

\alpha_{\mathcal{G}}(\lambda)\leq\frac{q^{2}\lambda(\lambda+1)}{(1-q)\sigma^{2% }}+O(q^{3}\lambda^{2}/\sigma^{3}).

Notice that if the bound on $\alpha_{\mathcal{G}}$ does not change when some data points are filtered out after sampling, the differential privacy guarantee will not change either. We refer readers to (abadi2016deep, ) for more details.

Claim A.1 ().

Let $\mathcal{J}$ be a sample from $[n]$ where each $i\in[n]$ is chosen independently with probability $q\leq\frac{1}{16\sigma}$ . Let $\mathcal{J}^{\prime}\subseteq\mathcal{J}$ be some arbitrary subset of $\mathcal{J}$ . Then for any positive integer $\lambda\leq\sigma^{2}\ln\frac{1}{q\sigma}$ , the function $\mathcal{G^{\prime}}(d)=\sum_{i\in\mathcal{J}^{\prime}}f(d_{i})+\mathcal{N}(0,% \sigma^{2}\boldsymbol{I})$ also satisfies

\alpha_{\mathcal{G^{\prime}}}(\lambda)\leq\frac{q^{2}\lambda(\lambda+1)}{(1-q)% \sigma^{2}}+O(q^{3}\lambda^{2}/\sigma^{3}).

Proof.

Let’s abuse the notation a little bit and define $\mathcal{G}_{f,\mathcal{J}}$ to be

\mathcal{G}_{f,\mathcal{J}}(d)=\sum_{i\in\mathcal{J}}f(d_{i})+\mathcal{N}(0,% \sigma^{2}\boldsymbol{I}).

To prove this claim, since by definition $\mathcal{G^{\prime}}(d)=\mathcal{G}_{f,\mathcal{J^{\prime}}}(d)$ , we just need to show

\alpha_{\mathcal{G}_{f,\mathcal{J^{\prime}}}}(\lambda)\leq\frac{q^{2}\lambda(% \lambda+1)}{(1-q)\sigma^{2}}+O(q^{3}\lambda^{2}/\sigma^{3}).

To do this, suppose for now we have a $f^{\prime}$ on $\mathcal{J}$ whose sensitivity is no greater than 1 and gives the same output as $f$ on $\mathcal{J}^{\prime}$ , namely

\|f^{\prime}(\cdot)\|_{2}\leq 1,\mathcal{G}_{f^{\prime},\mathcal{J}}(d)=% \mathcal{G}_{f,\mathcal{J^{\prime}}}(d).

By applying Lemma A.1 on $\mathcal{G}_{f^{\prime},\mathcal{J}}$ , we get

\alpha_{\mathcal{G}_{f^{\prime},\mathcal{J}}}(\lambda)\leq\frac{q^{2}\lambda(% \lambda+1)}{(1-q)\sigma^{2}}+O(q^{3}\lambda^{2}/\sigma^{3}).

Since $\mathcal{G}_{f^{\prime},\mathcal{J}}(d)=\mathcal{G}_{f,\mathcal{J^{\prime}}}(d)$ ,

\alpha_{\mathcal{G}_{f^{\prime},\mathcal{J}}}(\lambda)=\alpha_{\mathcal{G}_{f,% \mathcal{J^{\prime}}}}(\lambda).

Combining the preceding two equations, we get

\alpha_{\mathcal{G}_{f,\mathcal{J^{\prime}}}}(\lambda)=\alpha_{\mathcal{G}_{f^% {\prime},\mathcal{J}}}(\lambda)\leq\frac{q^{2}\lambda(\lambda+1)}{(1-q)\sigma^% {2}}+O(q^{3}\lambda^{2}/\sigma^{3}).

The remaining task is to construct such a $f^{\prime}$ on $\mathcal{J}$ , where $\|f^{\prime}(\cdot)\|_{2}\leq 1$ , to give the same output as $f$ on $\mathcal{J}^{\prime}$ . One possible $f^{\prime}$ is as follows:

f^{\prime}(d_{i})=\begin{cases}f(d_{i})&\quad\text{if }i\in\mathcal{J}^{\prime% },\\ 0&\quad\text{if }i\in\mathcal{J}-\mathcal{J}^{\prime}.\end{cases}

It’s easy to prove $\sum_{i\in\mathcal{J}^{\prime}}f(d_{i})=\sum_{i\in\mathcal{J}}f^{\prime}(d_{i})$ .

Next, for the sensitivity of $f^{\prime}$ , it is not difficult to see $\|f^{\prime}(\cdot)\|_{2}\leq\|f(\cdot)\|_{2}$ , since by removing or adding one entry to the database, $f^{\prime}$ will incur either the same change as $f$ or no change. ∎

So far we have proved that if only a subset of selected devices are included in aggregation or more malicious devices’ data is included, the differential privacy guarantee will not be impacted.

Gaussian Noise (line 12). Recall that we also add some additional noise in line 12 in Figure 2. For the Gaussian noise, we need to prove additional noise will not impact privacy, which is not difficult to show, since this change is also equivalent to post-processing.

With the above proof, we’ve showed that our modified DP-FedAvg provides the same privacy guarantee as the original DP-FedAvg.

A.1.2. DP-noise committee

This section shows that the $\mathcal{N}(0,I\sigma^{2})$ part of line 12 of our modified DP-FedAvg is executed faithfully.

We have already covered in the generate phase (§4.2) that $\mathcal{N}(0,I\sigma^{2})$ amount of noise will be generated by the DP-noise committee as long as the DP-noise committee does not violate its threshold: less than $A$ out of $C$ devices of the DP-noise committee are indeed malicious. Thus, in this section we will derive the probability of a committee having fewer than some threshold of honest members. We will follow the same way to compute the probability as in Honeycrisp (roth2019honeycrisp, ), which is a building block for Orchard (roth2020orchard, ).

Claim A.2 ().

(Aero) If a randomly sampled DP-noise committee size is $C$ , the probability of a committee member being malicious is $f$ , the probability that committee has fewer than $(1-t)\cdot C$ honest members is upper-bounded by $\mathit{p}=e^{-fC}(\frac{ef}{t})^{tC}$ , when $1>t\geq f.$

Proof.

We treat each member being malicious as independent events. Let $X_{i}$ be a random variable

X_{i}=\begin{cases}1,&\quad\text{if member i is malicious,}\\ 0,&\quad\text{if member i is honest.}\end{cases}

Let $X=\sum_{i}X_{i}$ be the random variable representing the number of malicious members. If $t\geq f$ , the Chernoff bound shows that

Pr(X\geq tC)\leq e^{-fC}(\frac{ef}{t})^{tC}.

So the probability of fewer than $(1-t)\cdot C$ members being honest will be upper-bounded by $\mathit{p}.$ ∎

With (high) probability, the lower bound of honest committee members will be $(1-t)C$ . As long as each honest member contributes $\frac{1}{(1-t)C}\cdot\mathcal{N}(0,\sigma^{2}\boldsymbol{I})$ noise, the total amount of noise will be no less than $\mathcal{N}(0,\sigma^{2}\boldsymbol{I})$ .

To give some examples of what committee sizes could be, when $f=0.03$ , we may set $t=1/7,C=280$ , to achieve $\mathit{p}=4.1\cdot 10^{-14}$ ; when $f=0.05$ , we may set $t=1/8,C=350$ to achieve $\mathit{p}=9.78\cdot 10^{-7}$ or $C=450$ to achieve $\mathit{p}=1.87\cdot 10^{-8}$ ; and, when $f=0.10$ , we may set $t=1/5,C=350$ to achieve $\mathit{p}=1.34\cdot 10^{-6}$ or $C=450$ to achieve $\mathit{p}=2.82\cdot 10^{-8}$ .

A.1.3. Aggregation

This section will prove that the additions in line 12 in our modified DP-FedAvg are executed faithfully in Aero. Otherwise, the aggregator will be caught with a high probability.

We will first prove the integrity of additions as in Aero’s add phase protocol and in Honeycrisp. Next, we will prove the freshness guarantee that no ciphertexts from previous rounds can be included in the current aggregation. Finally, we will prove how these two proofs together show that line 12 of modified DP-FedAvg is executed faithfully by Aero.

Integrity. We will start with the integrity claim from Honeycrisp (roth2019honeycrisp, ).

Claim A.3 ().

At the end of add phase, if no device has found malicious activity by the aggregator $\mathcal{A}$ , the sum of the ciphertexts published by $\mathcal{A}$ is correct (with high probability) and no inputs of malicious nodes are dependent on inputs of honest nodes (in the same round).

Proof.

We refer readers to Honeycrisp (roth2019honeycrisp, ) for more details. Here we will just give a short version for demonstration.

We will first prove no inputs of malicious devices are dependent on inputs of honest devices in the same round. Next we will prove if a malicious aggregator introduces an error into a summation tree, it will be caught with high probability.

First, assume for the sake of contradiction that it is possible for a malicious device to set its ciphertext to be $c$ that is from an honest device in the current round. The adversary needs to produce a $t=Hash(r||c||\pi)$ and include $t$ in the Merkle tree $MC$ , before the honest device reveals $c$ . Under the Random Oracle assumption in cryptography, this is not possible. So no inputs of malicious devices are dependent on inputs of honest devices in the same round.

Next we need to show if $\mathcal{A}$ introduces an error into a summation tree $ST$ , it will be caught with high probability. In particular, here we will just show the case where $\mathcal{A}$ introduces an error into one of the leaf nodes. Similar analysis can be done for non-leaf nodes, which is presented in Honeycrisp (roth2019honeycrisp, ).

Suppose the total number of verifiers is $W$ and the total number of leaf nodes in one summation tree is $M^{\prime}$ (Figure 4). Here we make the assumption that $M^{\prime}\approx qW$ , where $q$ is the sampling probability. We will prove this assumption later in this section.

Suppose $\mathcal{A}$ introduces an error into a particular leaf node $j\in[0,M^{\prime}-1]$ . The probability of an honest device picking any $v_{init}$ to have $j\in[v_{init},v_{init}+s]$ is

\frac{qs}{M^{\prime}}.

Since there are at least $(1-f)W$ honest devices, the probability of no honest device checking $j$ is

(1-\frac{qs}{M^{\prime}})^{(1-f)W}=(1-\frac{s}{W})^{(1-f)W}\leq e^{-(1-f)s}.

Similarly we can prove if $\mathcal{A}$ introduces error into non-leaf nodes, it will be caught with a high probability. For instance, if $f=3\%,s=5$ , the probability is about 0.007; if $f=3\%,s=20$ , the probability is about $10^{-11}$ .

∎

As noted above, proof relies on an assumption about the threshold of Sybils (pseudonym leaf nodes) the aggregator can introduce into the summation tree. Intuitively, if the aggregator can introduce as many Sybils as it wants to make $M^{\prime}\gg qW$ , then the probability of each node being covered will decrease by a large factor, thus impacting privacy. For example, if $W=100,q=0.01$ , $M^{\prime}$ is expected to be close to $qW=1$ and the 100 verifiers expect to verify only 1 leaf node. However, if the aggregator introduces 99 additional leaf nodes into the summation tree while the 100 verifiers still expect to verify only 1 leaf node, obviously many malicious nodes are very likely to be missed by the verifiers. Before claiming in Aero there is a threshold of Sybils, we need to introduce an assumption from Honeycrisp (roth2019honeycrisp, ), which we will also use in our claim.

Assumption A.1 ().

(Honeycrisp) All devices know an upper bound $W_{max}$ and a lower bound $W_{min}$ of the number of potential participating devices in the system. If the true number of devices is $W_{tot}$ , then by definition: $W_{min}\leq W_{tot}\leq W_{max}$ . We assume $\frac{W_{max}-W_{tot}}{W_{min}}$ is always below some constant (low) threshold (this determines the portion of Sybils the aggregator $\mathcal{A}$ could make without getting caught). $\frac{W_{max}}{W_{min}}$ should also be below some (more generous) constant threshold.

With the above assumption, similarly, we claim in Aero:

Claim A.4 ().

(Aero) There is a upper bound on the portion of Sybils a malicious aggregator can introduce without being caught. Precisely, all devices know an upper bound $M_{max}$ and a lower bound $M_{min}$ of the number of potential leaf nodes in the summation tree. If the true number of leaf nodes is $M_{tot}$ , then by definition: $M_{min}\leq M_{tot}\leq M_{max}$ . In Aero, $\frac{M_{max}-M_{tot}}{M_{min}}$ is always below some constant (low) threshold (this determines the portion of Sybils the aggregator $\mathcal{A}$ could make without getting caught). $\frac{M_{max}}{M_{min}}$ will also be below some (more generous) constant threshold.

Proof.

Note that in this claim, by ”leaf nodes”, we mean leaf nodes corresponding to a honest generator device. Once we know a upper bound $M_{max}$ and a lower bound $M_{min}$ of the number of leaf nodes, in the worst case, a malicious aggregator could introduce at most $M_{max}-M_{min}$ Sybils. If the true number of leaf nodes is $M_{tot}$ , then the aggregator could introduce at most $M_{max}-M_{tot}$ Sybils.

Let’s first consider $M_{min}$ and $M_{max}$ .

Informally, if the number of devices $W$ is large enough, the number of leaf nodes (generators) is likely to be close to the expectation $qW$ . Suppose for now there are two constants capturing this ”closeness”: $k_{min}$ , $k_{max}$ , where $k_{min}\approx 1,k_{max}\approx 1$ . Then, we have

k_{min}\cdot qW_{min}\leq M_{min}\leq k_{max}\cdot qW_{min},

k_{min}\cdot qW_{max}\leq M_{max}\leq k_{max}\cdot qW_{max}.

Consider the fraction $\frac{M_{max}}{M_{min}}.$

\frac{M_{max}}{M_{min}}\leq\frac{k_{max}\cdot qW_{max}}{k_{min}\cdot qW_{min}}% =\frac{k_{max}}{k_{min}}\cdot\frac{W_{max}}{W_{min}}

$\frac{W_{max}}{W_{min}}$ is below some constant threshold as in Assumption A.1. If $\frac{k_{max}}{k_{min}}$ is smaller than some constant threshold, then it’s reasonable to assume like in Honeycrisp that $\frac{M_{max}}{M_{min}}$ is below some constant threshold. We will discuss the values of $k_{max}$ and $k_{min}$ at the end of the proof.

As for the fraction $\frac{M_{max}-M_{tot}}{M_{min}}$ , it’s not difficult to see that,

\frac{M_{max}-M_{tot}}{M_{min}}\leq\frac{k_{max}W_{max}-k_{min}W_{tot}}{k_{min% }W_{min}}

Similarly if both $k_{max}$ and $k_{min}$ are close to 1, this fraction will be close to $\frac{W_{max}-W_{tot}}{W_{min}}$ , which is below some small constant (low) threshold as specified in Assumption A.1. So in Aero, it is reasonable to assume this fraction is also below some constant (low) threshold.

Lastly, let’s discuss $k_{max}$ and $k_{min}$ . Consider a general case where the total number of devices is $W$ and the sampling probability is $q$ . The number of sampled devices (leaf nodes) is $X$ . Chernoff bound states that,

Pr(X<k_{min}qW)<(\frac{e^{k_{min}-1}}{{k_{min}}^{k_{min}}})^{qW}.

With $qW=5000$ , $k_{min}=0.9$ , the above probability will be smaller than $5.77\cdot 10^{-12}$ . Similarly, with $qW=10000,k_{min}=0.93$ , the probability will be smaller than $1.27\cdot 10^{-11}$ . Since Aero is designed for large-scale training, it’s reasonable to assume $k_{min}$ is close to 1. The argument is similar for $k_{max}$ (e.g. $k_{max}=1.01$ ).

However, if $qW$ is, for example, 500, to achieve a similar probability of $1.17\cdot 10^{-11}$ , $k_{min}$ will be about 0.7. In this setting, it’s not reasonable to assume $M^{\prime}=qW$ as in Claim A.3 any more. Instead one will need to assume a different bound, e.g. $M^{\prime}<2qW$ , and re-calculate the probability. For example, if $M^{\prime}=2qW$ , in Claim A.3, the probability of an honest device checking a particular leaf node will still be the same. However, the probability of no honest device checking a particular node will be instead

(1-\frac{qs}{M^{\prime}})^{(1-f)W}<(1-\frac{s}{2W})^{(1-f)W}\leq e^{-(1-f)s/2}

Verifiers are expected to verify more nodes in the summation tree to make Claim A.3 true and thus ensure privacy. ∎

Freshness. Reusing the encryption key will not affect the security proof in each round, but will lead to attacks across rounds, where information from previous rounds can be leaked. The adversary may obtain the victim device’s ciphertext $Enc(m)$ from a previous round and use $k\cdot Enc(m)$ , where $k$ is a large constant, to participate in a later round. Since $k$ is large, $k\cdot Enc(m)$ will dominate the aggregated result. After decryption, the adversary will be able to learn approximately $m$ . We define the freshness as the guarantee that only fresh generated ciphertexts are included in the aggregation.

We need to prove that by asking each device to put the round number in the first slot of ciphertext, generating corresponding ZK-proof, and in the add phase verifying the ZK proof, the adversary will not be able to use ciphertexts from previous rounds in the current round.

First, according to the knowledge soundness property of zkSNARK, which states that it is not possible for a prover to construct a proof without knowing the witness (e.g. secret inputs), an adversary can’t construct a proof for a new round (nitulescuzk, ). This means the adversary can only insert a non-valid proof into the summation tree.

Next, if the adversary inserts one ciphertext from a previous round with a non-valid proof into the summation tree, with high probability, it will be caught, since as claimed before, with high probability each leaf node will be checked by some honest devices. The honest devices will be able to detect this error.

How Aero supports modified DP-FedAvg. We’ve covered the integrity and freshness of Aero’s add phase/aggregation. Next we’ll show the modified DP-FedAvg will be executed faithfully. In order to show this, we claim the following:

Claim A.5 ().

Data from honest generator devices will be included at most once in the aggregation; data from honest DP-noise committee members will be included exactly once.

Proof.

It’s not difficult to see that both honest generators’ and honest DP-noise committee members’ data will be included at most once, as defined in Claim A.3. So for honest generator devices, the proof is done.

As for honest DP-noise committee member, we just need to prove data from an honest member will be included (at least once). Recall that for a honest committee member, it is always online during a round, otherwise will be considered as malicious. Also recall that in the add phase, after the aggregator constructs the summation tree and the Merkle tree, the aggregator needs to send a Merkle proof to the device that its data is included in the summation tree (Step 7 in Figure 4). An honest committee member can thus make sure its data is included in the aggregation by verifying this Merkle proof. ∎

This claim captures the requirement for faithfully executing the modified DP-FedAvg.

Firstly, data from honest generators (including online and offline) will be included at most once, which is exactly what our new sampling method does (Appendix §A.1.1): some honest generators might be filtered while others not; those included in the aggregation will be added exactly once.

Secondly, data from honest DP-noise committee members (only online) will be included exactly once, which ensures that enough Gaussian noise will be added.

Combining with the integrity and freshness of the underlying aggregation, the modified DP-FedAvg protocol will be executed as it is in Aero.

A.1.4. Decryption

The last step in Aero’s protocol is decryption. In this section, we’ll prove Aero’s decryption protocol will not leak any information, except the decrypted result.

Let’s first review the BFV scheme (brakerski2012fully, ; fan2012somewhat, ). Let the field for the coefficients of the ciphertext polynomials be $Q$ , the polynomials themselves be from a polynomial ring $R_{Q}$ , the distribution $\phi$ for the coefficients of error polynomials be the required Gaussian distribution $\phi$ (standard deviation=3.2), and the secret key $sk$ be a polynomial of same degree $N$ as the ciphertext polynomials but with coefficients from the ternary distribution ({-1,0,1}), which we denote as $\psi$ . Then, given a small constant $\gamma\ll 1$ , the BFV scheme has the following procedures

•

$Keygen$ : $s\leftarrow\psi,a\leftarrow R_{Q},e\leftarrow\phi$ . Compute $b=as+e$ and output $pk=(a,b)$ and $sk=s$
•

$Enc(pk,m)$ : $e_{1}\leftarrow\phi,e_{2}\leftarrow\phi,r\leftarrow\psi$ , output $(c_{1}=ar+e_{1},c_{2}=br+e_{2}+m/\gamma)$
•

$Dec(c_{1},c_{2})$ : output $m=\lfloor(c_{2}-c_{1}s)\cdot\gamma\rceil$

Since $m$ will finally be known to the aggregator and revealing $m$ is safe, without losing generality, assume $(c_{1},c_{2})$ is the encryption of 0. Further, as defined in release phase (§4.4),

e_{small}=c_{2}-c_{1}\cdot s=re+e_{2}-se_{1}.

This small error must remain hidden during the decryption process; otherwise, it may reveal information about the secret key $s$ or the polynomial $r$ . To hide $e_{small}$ , Aero’s scheme applies the smudging lemma (asharov2012multiparty, ). This lemma states that to achieve $2^{-\lambda}$ statistical distance between $e_{small}$ and $e_{small}+e_{smudging}$ , $e_{smudging}$ just needs to be sampled from a uniform distribution whose bound is $\lambda$ bits more than the upper bound of $e_{small}$ . Suppose the smudging distribution is $\phi^{\prime}$ , which is some uniform distribution whose bound is $\lambda$ bits more than the upper bound of $e_{small}$ .

Recall that in the release phase, the decryption committee reveals $c_{1}\cdot s+e_{smudging}$ to the aggregator, where $e_{smudging}\leftarrow\phi^{\prime}$ . So the adversary’s view is

(c_{2},c_{1},c_{1}s+e_{smudging}),

which is equivalent to

(c_{2},c_{1},-c_{2}+c_{1}s+e_{smudging}).

If we can prove this view is indistinguishable from

(c_{2},c_{1},e_{smudging}^{\prime})

where $e_{smudging}^{\prime}$ is some freshly sampled error from the smudging distribution $\phi^{\prime}$ , then revealing $c_{1}\cdot s+e_{smudging}$ to the aggregator will not leak more information than telling the aggregator a uniformly random number, since $(c_{1},c_{2})$ are already known to the aggregator.

To prove the above claim, let’s start with

(c_{2},c_{1},-c_{2}+c_{1}s+e_{smudging}).

Expanding $c_{2}$ and $c_{1}$ , we get that the above is same as

(asr+re+e_{2},ar+e_{1},-re-e_{2}+se_{1}+e_{smudging}).

With the smudging lemma, the above is indistinguishable from

((as+e)r+e_{2},ar+e_{1},e_{smudging}^{\prime}).

Notice that $e_{smudging}^{\prime}$ has nothing to do with the secret key $s$ , and we can apply Ring-LWE assumption to convert $(as+e,a)$ back to $(b,a)$ as otherwise $(as+e,a)$ is not indistinguishable from $(b,a)$ . The above is indistinguishable from

(br+e_{2},ar+e_{1},e_{smudging}^{\prime})=(c_{2},c_{1},e^{\prime}_{smudging}).

Since $e_{smudging}^{\prime}$ doesn’t depend on either the secret key or honest devices’ data, revealing partial decryption result will not leak information about either the secret key or honest devices’ data.

A.2. Details of the setup phase

During the setup phase, Aero (i) forms the master committee, which then (ii) receives and validates inputs for the round, and (iii) generates keys for cryptographic primitives. We present the details for only the second piece here, as the first and the third are discussed in detail earlier (§4).

For the second piece, the master committee needs to check whether there is enough privacy budget to run a training task before launching it. To do this, the committee members need to calculate the new DP parameters before they launch the training task and check whether the new parameters are below some threshold. The details are as follows.

Recall that once the master committee is formed, each committee member receives the model parameters $\theta^{t}$ for the current round $t$ , the user selection probability $q$ , noise scale $z$ , and clip** bound $S$ from the aggregator for this round of training (required by DP-FedAvg; Figure 2).

Each committee member locally computes new values of the DP parameters $\epsilon,\delta$ using the moment accounting algorithm $\mathcal{M}$ (line 14 in DP-FedAvg). This computation requires the DP parameters $\epsilon^{\prime},\delta^{\prime}$ from round $t-1$ in addition to the inputs $z,q$ . The committee member downloads the former from a public bulletin board, where it is signed by more than a threshold of honest members of the previous round’s master committee. After getting the former DP parameters, the committee member calculates the new $\epsilon,\delta$ . If the new values are below their recommended value, the committee member signs a certificate containing the parameters ( $\theta^{t}$ , $q$ , $z$ , $S$ ), new values of $\epsilon,\delta$ , and keys for cryptographic primitives and publish it to the bulletin board to start this training task.