Search | arXiv e-print repository

Releasing Large-Scale Human Mobility Histograms with Differential Privacy

Authors: Christopher Bian, Albert Cheu, Yannis Guzman, Marco Gruteser, Peter Kairouz, Ryan McKenna, Edo Roth

Abstract: Environmental Insights Explorer (EIE) is a Google product that reports aggregate statistics about human mobility, including various methods of transit used by people across roughly 50,000 regions globally. These statistics are used to estimate carbon emissions and provided to policymakers to inform their decisions on transportation policy and infrastructure. Due to the inherent sensitivity of this… ▽ More Environmental Insights Explorer (EIE) is a Google product that reports aggregate statistics about human mobility, including various methods of transit used by people across roughly 50,000 regions globally. These statistics are used to estimate carbon emissions and provided to policymakers to inform their decisions on transportation policy and infrastructure. Due to the inherent sensitivity of this type of user data, it is crucial that the statistics derived and released from it are computed with appropriate privacy protections. In this work, we use a combination of federated analytics and differential privacy to release these required statistics, while operating under strict error constraints to ensure utility for downstream stakeholders. In this work, we propose a new mechanism that achieves $ ε\approx 2 $-DP while satisfying these strict utility constraints, greatly improving over natural baselines. We believe this mechanism may be of more general interest for the broad class of group-by-sum workloads. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2404.10764 [pdf, other]

Confidential Federated Computations

Authors: Hubert Eichner, Daniel Ramage, Kallista Bonawitz, Dzmitry Huba, Tiziano Santoro, Brett McLarnon, Timon Van Overveldt, Nova Fallen, Peter Kairouz, Albert Cheu, Katharine Daly, Adria Gascon, Marco Gruteser, Brendan McMahan

Abstract: Federated Learning and Analytics (FLA) have seen widespread adoption by technology platforms for processing sensitive on-device data. However, basic FLA systems have privacy limitations: they do not necessarily require anonymization mechanisms like differential privacy (DP), and provide limited protections against a potentially malicious service provider. Adding DP to a basic FLA system currently… ▽ More Federated Learning and Analytics (FLA) have seen widespread adoption by technology platforms for processing sensitive on-device data. However, basic FLA systems have privacy limitations: they do not necessarily require anonymization mechanisms like differential privacy (DP), and provide limited protections against a potentially malicious service provider. Adding DP to a basic FLA system currently requires either adding excessive noise to each device's updates, or assuming an honest service provider that correctly implements the mechanism and only uses the privatized outputs. Secure multiparty computation (SMPC) -based oblivious aggregations can limit the service provider's access to individual user updates and improve DP tradeoffs, but the tradeoffs are still suboptimal, and they suffer from scalability challenges and susceptibility to Sybil attacks. This paper introduces a novel system architecture that leverages trusted execution environments (TEEs) and open-sourcing to both ensure confidentiality of server-side computations and provide externally verifiable privacy properties, bolstering the robustness and trustworthiness of private federated computations. △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2208.08540 [pdf, other]

Necessary Conditions in Multi-Server Differential Privacy

Authors: Albert Cheu, Chao Yan

Abstract: We consider protocols where users communicate with multiple servers to perform a computation on the users' data. An adversary exerts semi-honest control over many of the parties but its view is differentially private with respect to honest users. Prior work described protocols that required multiple rounds of interaction or offered privacy against a computationally bounded adversary. Our work pres… ▽ More We consider protocols where users communicate with multiple servers to perform a computation on the users' data. An adversary exerts semi-honest control over many of the parties but its view is differentially private with respect to honest users. Prior work described protocols that required multiple rounds of interaction or offered privacy against a computationally bounded adversary. Our work presents limitations of non-interactive protocols that offer privacy against unbounded adversaries. We show these protocols demand exponentially more samples for some learning and estimation tasks than centrally private counterparts. This means performing as well as the central model requires interactivity or computational differential privacy, or both. △ Less

Submitted 17 August, 2022; originally announced August 2022.

Comments: 22 pages

arXiv:2112.10032 [pdf, other]

Pure Differential Privacy from Secure Intermediaries

Authors: Albert Cheu, Chao Yan

Abstract: Recent work in differential privacy has explored the prospect of combining local randomization with a secure intermediary. Specifically, there are a variety of protocols in the secure shuffle model (where an intermediary randomly permutes messages) as well as the secure aggregation model (where an intermediary adds messages). Most of these protocols are limited to approximate differential privacy.… ▽ More Recent work in differential privacy has explored the prospect of combining local randomization with a secure intermediary. Specifically, there are a variety of protocols in the secure shuffle model (where an intermediary randomly permutes messages) as well as the secure aggregation model (where an intermediary adds messages). Most of these protocols are limited to approximate differential privacy. An exception is the shuffle protocol by Ghazi, Golowich, Kumar, Manurangsi, Pagh, and Velingker (arXiv:2002.01919): it computes bounded sums under pure differential privacy. Its additive error is $\tilde{O}(1/\varepsilon^{3/2})$, where $\varepsilon$ is the privacy parameter. In this work, we give a new protocol that ensures $O(1/\varepsilon)$ error under pure differential privacy. We also show how to use it to test uniformity of distributions over $[d]$. The tester's sample complexity has an optimal dependence on $d$. Our work relies on a novel class of secure intermediaries which are of independent interest. △ Less

Submitted 27 December, 2021; v1 submitted 18 December, 2021; originally announced December 2021.

Comments: Updated abstract, prior work, and acknowledgements

arXiv:2107.11839 [pdf, ps, other]

Differential Privacy in the Shuffle Model: A Survey of Separations

Authors: Albert Cheu

Abstract: Differential privacy is often studied in one of two models. In the central model, a single analyzer has the responsibility of performing a privacy-preserving computation on data. But in the local model, each data owner ensures their own privacy. Although it removes the need to trust the analyzer, local privacy comes at a price: a locally private protocol is less accurate than a centrally private c… ▽ More Differential privacy is often studied in one of two models. In the central model, a single analyzer has the responsibility of performing a privacy-preserving computation on data. But in the local model, each data owner ensures their own privacy. Although it removes the need to trust the analyzer, local privacy comes at a price: a locally private protocol is less accurate than a centrally private counterpart when solving many learning and estimation problems. Protocols in the shuffle model are designed to attain the best of both worlds: recent work has shown high accuracy is possible with only a mild trust assumption. This survey paper gives an overview of novel shuffle protocols, along with lower bounds that establish the limits of the new model. We also summarize work that show the promise of interactivity in the shuffle model. △ Less

Submitted 24 May, 2022; v1 submitted 25 July, 2021; originally announced July 2021.

Comments: 24 May '22 version includes more recent work on sums, histograms, and an appendix comparing the shuffle model with other distributed models

arXiv:2106.09805 [pdf, ps, other]

Shuffle Private Stochastic Convex Optimization

Authors: Albert Cheu, Matthew Joseph, Jieming Mao, Binghui Peng

Abstract: In shuffle privacy, each user sends a collection of randomized messages to a trusted shuffler, the shuffler randomly permutes these messages, and the resulting shuffled collection of messages must satisfy differential privacy. Prior work in this model has largely focused on protocols that use a single round of communication to compute algorithmic primitives like means, histograms, and counts. We p… ▽ More In shuffle privacy, each user sends a collection of randomized messages to a trusted shuffler, the shuffler randomly permutes these messages, and the resulting shuffled collection of messages must satisfy differential privacy. Prior work in this model has largely focused on protocols that use a single round of communication to compute algorithmic primitives like means, histograms, and counts. We present interactive shuffle protocols for stochastic convex optimization. Our protocols rely on a new noninteractive protocol for summing vectors of bounded $\ell_2$ norm. By combining this sum subroutine with mini-batch stochastic gradient descent, accelerated gradient descent, and Nesterov's smoothing method, we obtain loss guarantees for a variety of convex loss functions that significantly improve on those of the local model and sometimes match those of the central model. △ Less

Submitted 14 March, 2022; v1 submitted 17 June, 2021; originally announced June 2021.

Comments: Accepted to ICLR 2022

arXiv:2104.02739 [pdf, other]

Differentially Private Histograms in the Shuffle Model from Fake Users

Authors: Albert Cheu, Maxim Zhilyaev

Abstract: There has been much recent work in the shuffle model of differential privacy, particularly for approximate $d$-bin histograms. While these protocols achieve low error, the number of messages sent by each user -- the message complexity -- has so far scaled with $d$ or the privacy parameters. The message complexity is an informative predictor of a shuffle protocol's resource consumption. We present… ▽ More There has been much recent work in the shuffle model of differential privacy, particularly for approximate $d$-bin histograms. While these protocols achieve low error, the number of messages sent by each user -- the message complexity -- has so far scaled with $d$ or the privacy parameters. The message complexity is an informative predictor of a shuffle protocol's resource consumption. We present a protocol whose message complexity is two when there are sufficiently many users. The protocol essentially pairs each row in the dataset with a fake row and performs a simple randomization on all rows. We show that the error introduced by the protocol is small, using rigorous analysis as well as experiments on real-world data. We also prove that corrupt users have a relatively low impact on our protocol's estimates. △ Less

Submitted 5 August, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

Comments: 29 pages. // May 3 updates: (1) experiments that compared results with prior work. (2) moved count-min section to the main body (3) expanded introduction // May 29 updates: (1) further improved introduction (2) enhanced reduction in communication complexity (3) included reference to GKMP20 protocol // Aug 5 update: updated the affiliation footnotes

arXiv:2009.08000 [pdf, other]

The Limits of Pan Privacy and Shuffle Privacy for Learning and Estimation

Authors: Albert Cheu, Jonathan Ullman

Abstract: There has been a recent wave of interest in intermediate trust models for differential privacy that eliminate the need for a fully trusted central data collector, but overcome the limitations of local differential privacy. This interest has led to the introduction of the shuffle model (Cheu et al., EUROCRYPT 2019; Erlingsson et al., SODA 2019) and revisiting the pan-private model (Dwork et al., IT… ▽ More There has been a recent wave of interest in intermediate trust models for differential privacy that eliminate the need for a fully trusted central data collector, but overcome the limitations of local differential privacy. This interest has led to the introduction of the shuffle model (Cheu et al., EUROCRYPT 2019; Erlingsson et al., SODA 2019) and revisiting the pan-private model (Dwork et al., ITCS 2010). The message of this line of work is that, for a variety of low-dimensional problems -- such as counts, means, and histograms -- these intermediate models offer nearly as much power as central differential privacy. However, there has been considerably less success using these models for high-dimensional learning and estimation problems. In this work, we show that, for a variety of high-dimensional learning and estimation problems, both the shuffle model and the pan-private model inherently incur an exponential price in sample complexity relative to the central model. For example, we show that, private agnostic learning of parity functions over $d$ bits requires $Ω(2^{d/2})$ samples in these models, and privately selecting the most common attribute from a set of $d$ choices requires $Ω(d^{1/2})$ samples, both of which are exponential separations from the central model. Our work gives the first non-trivial lower bounds for these problems for both the pan-private model and the general multi-message shuffle model. △ Less

Submitted 3 December, 2020; v1 submitted 16 September, 2020; originally announced September 2020.

Comments: Corrected a proof in the Appendix. Added a reference to parallel and concurrent work

arXiv:2004.10941 [pdf, other]

Private Query Release Assisted by Public Data

Authors: Raef Bassily, Albert Cheu, Shay Moran, Aleksandar Nikolov, Jonathan Ullman, Zhiwei Steven Wu

Abstract: We study the problem of differentially private query release assisted by access to public data. In this problem, the goal is to answer a large class $\mathcal{H}$ of statistical queries with error no more than $α$ using a combination of public and private samples. The algorithm is required to satisfy differential privacy only with respect to the private samples. We study the limits of this task in… ▽ More We study the problem of differentially private query release assisted by access to public data. In this problem, the goal is to answer a large class $\mathcal{H}$ of statistical queries with error no more than $α$ using a combination of public and private samples. The algorithm is required to satisfy differential privacy only with respect to the private samples. We study the limits of this task in terms of the private and public sample complexities. First, we show that we can solve the problem for any query class $\mathcal{H}$ of finite VC-dimension using only $d/α$ public samples and $\sqrt{p}d^{3/2}/α^2$ private samples, where $d$ and $p$ are the VC-dimension and dual VC-dimension of $\mathcal{H}$, respectively. In comparison, with only private samples, this problem cannot be solved even for simple query classes with VC-dimension one, and without any private samples, a larger public sample of size $d/α^2$ is needed. Next, we give sample complexity lower bounds that exhibit tight dependence on $p$ and $α$. For the class of decision stumps, we give a lower bound of $\sqrt{p}/α$ on the private sample complexity whenever the public sample size is less than $1/α^2$. Given our upper bounds, this shows that the dependence on $\sqrt{p}$ is necessary in the private sample complexity. We also give a lower bound of $1/α$ on the public sample complexity for a broad family of query classes, which by our upper bound, is tight in $α$. △ Less

Submitted 22 April, 2020; originally announced April 2020.

arXiv:2004.09481 [pdf, ps, other]

Connecting Robust Shuffle Privacy and Pan-Privacy

Authors: Victor Balcer, Albert Cheu, Matthew Joseph, Jieming Mao

Abstract: In the \emph{shuffle model} of differential privacy, data-holding users send randomized messages to a secure shuffler, the shuffler permutes the messages, and the resulting collection of messages must be differentially private with regard to user data. In the \emph{pan-private} model, an algorithm processes a stream of data while maintaining an internal state that is differentially private with re… ▽ More In the \emph{shuffle model} of differential privacy, data-holding users send randomized messages to a secure shuffler, the shuffler permutes the messages, and the resulting collection of messages must be differentially private with regard to user data. In the \emph{pan-private} model, an algorithm processes a stream of data while maintaining an internal state that is differentially private with regard to the stream data. We give evidence connecting these two apparently different models. Our results focus on \emph{robustly} shuffle private protocols, whose privacy guarantees are not greatly affected by malicious users. First, we give robustly shuffle private protocols and upper bounds for counting distinct elements and uniformity testing. Second, we use pan-private lower bounds to prove robustly shuffle private lower bounds for both problems. Focusing on the dependence on the domain size $k$, we find that robust approximate shuffle privacy and approximate pan-privacy have additive error $Θ(\sqrt{k})$ for counting distinct elements. For uniformity testing, we give a robust approximate shuffle private protocol with sample complexity $\tilde O(k^{2/3})$ and show that an $Ω(k^{2/3})$ dependence is necessary for any robust pure shuffle private tester. Finally, we show that this connection is useful in both directions: we give a pan-private adaptation of recent work on shuffle private histograms and use it to recover further separations between pan-privacy and interactive local privacy. △ Less

Submitted 11 August, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

Comments: This version removes a uniformity tester which we found does not satisfy pure d.p

arXiv:1911.06879 [pdf, ps, other]

Separating Local & Shuffled Differential Privacy via Histograms

Authors: Victor Balcer, Albert Cheu

Abstract: Recent work in differential privacy has highlighted the shuffled model as a promising avenue to compute accurate statistics while kee** raw data in users' hands. We present a protocol in this model that estimates histograms with error independent of the domain size. This implies an arbitrarily large gap in sample complexity between the shuffled and local models. On the other hand, the models are… ▽ More Recent work in differential privacy has highlighted the shuffled model as a promising avenue to compute accurate statistics while kee** raw data in users' hands. We present a protocol in this model that estimates histograms with error independent of the domain size. This implies an arbitrarily large gap in sample complexity between the shuffled and local models. On the other hand, the models are equivalent when we impose the constraints of pure differential privacy and single-message randomizers. △ Less

Submitted 13 April, 2020; v1 submitted 15 November, 2019; originally announced November 2019.

Comments: 14 pages, two tables. Accepted to ITC 2020

arXiv:1909.09630 [pdf, other]

Manipulation Attacks in Local Differential Privacy

Authors: Albert Cheu, Adam Smith, Jonathan Ullman

Abstract: Local differential privacy is a widely studied restriction on distributed algorithms that collect aggregates about sensitive user data, and is now deployed in several large systems. We initiate a systematic study of a fundamental limitation of locally differentially private protocols: they are highly vulnerable to adversarial manipulation. While any algorithm can be manipulated by adversaries who… ▽ More Local differential privacy is a widely studied restriction on distributed algorithms that collect aggregates about sensitive user data, and is now deployed in several large systems. We initiate a systematic study of a fundamental limitation of locally differentially private protocols: they are highly vulnerable to adversarial manipulation. While any algorithm can be manipulated by adversaries who lie about their inputs, we show that any non-interactive locally differentially private protocol can be manipulated to a much greater extent. Namely, when the privacy level is high or the input domain is large, an attacker who controls a small fraction of the users in the protocol can completely obscure the distribution of the users' inputs. We also show that existing protocols differ greatly in their resistance to manipulation, even when they offer the same accuracy guarantee with honest execution. Our results suggest caution when deploying local differential privacy and reinforce the importance of efficient cryptographic techniques for emulating mechanisms from central differential privacy in distributed settings. △ Less

Submitted 20 September, 2019; originally announced September 2019.

arXiv:1808.01394 [pdf, other]

doi 10.1007/978-3-030-17653-2_13

Distributed Differential Privacy via Shuffling

Authors: Albert Cheu, Adam Smith, Jonathan Ullman, David Zeber, Maxim Zhilyaev

Abstract: We consider the problem of designing scalable, robust protocols for computing statistics about sensitive data. Specifically, we look at how best to design differentially private protocols in a distributed setting, where each user holds a private datum. The literature has mostly considered two models: the "central" model, in which a trusted server collects users' data in the clear, which allows gre… ▽ More We consider the problem of designing scalable, robust protocols for computing statistics about sensitive data. Specifically, we look at how best to design differentially private protocols in a distributed setting, where each user holds a private datum. The literature has mostly considered two models: the "central" model, in which a trusted server collects users' data in the clear, which allows greater accuracy; and the "local" model, in which users individually randomize their data, and need not trust the server, but accuracy is limited. Attempts to achieve the accuracy of the central model without a trusted server have so far focused on variants of cryptographic MPC, which limits scalability. In this paper, we initiate the analytic study of a shuffled model for distributed differentially private algorithms, which lies between the local and central models. This simple-to-implement model, a special case of the ESA framework of [Bittau et al., '17], augments the local model with an anonymous channel that randomly permutes a set of user-supplied messages. For sum queries, we show that this model provides the power of the central model while avoiding the need to trust a central server and the complexity of cryptographic secure function evaluation. More generally, we give evidence that the power of the shuffled model lies strictly between those of the central and local models: for a natural restriction of the model, we show that shuffled protocols for a widely studied selection problem require exponentially higher sample complexity than do central-model protocols. △ Less

Submitted 16 May, 2019; v1 submitted 3 August, 2018; originally announced August 2018.

Comments: Updated to correct a typo

arXiv:1711.04213 [pdf, other]

Skyline Identification in Multi-Armed Bandits

Authors: Albert Cheu, Ravi Sundaram, Jonathan Ullman

Abstract: We introduce a variant of the classical PAC multi-armed bandit problem. There is an ordered set of $n$ arms $A[1],\dots,A[n]$, each with some stochastic reward drawn from some unknown bounded distribution. The goal is to identify the $skyline$ of the set $A$, consisting of all arms $A[i]$ such that $A[i]$ has larger expected reward than all lower-numbered arms $A[1],\dots,A[i-1]$. We define a natu… ▽ More We introduce a variant of the classical PAC multi-armed bandit problem. There is an ordered set of $n$ arms $A[1],\dots,A[n]$, each with some stochastic reward drawn from some unknown bounded distribution. The goal is to identify the $skyline$ of the set $A$, consisting of all arms $A[i]$ such that $A[i]$ has larger expected reward than all lower-numbered arms $A[1],\dots,A[i-1]$. We define a natural notion of an $\varepsilon$-approximate skyline and prove matching upper and lower bounds for identifying an $\varepsilon$-skyline. Specifically, we show that in order to identify an $\varepsilon$-skyline from among $n$ arms with probability $1-δ$, $$ Θ\bigg(\frac{n}{\varepsilon^2} \cdot \min\bigg\{ \log\bigg(\frac{1}{\varepsilon δ}\bigg), \log\bigg(\frac{n}δ\bigg) \bigg\} \bigg) $$ samples are necessary and sufficient. When $\varepsilon \gg 1/n$, our results improve over the naive algorithm, which draws enough samples to approximate the expected reward of every arm; the algorithm of (Auer et al., AISTATS'16) for Pareto-optimal arm identification is likewise superseded. Our results show that the sample complexity of the skyline problem lies strictly in between that of best arm identification (Even-Dar et al., COLT'02) and that of approximating the expected reward of every arm. △ Less

Submitted 9 January, 2018; v1 submitted 11 November, 2017; originally announced November 2017.

Comments: 18 pages, 2 Figures; an ALT'18/ISIT'18 submission

Showing 1–14 of 14 results for author: Cheu, A