-
Alistair: Efficient On-device Budgeting for Differentially-Private Ad-Measurement Systems
Authors:
Pierre Tholoniat,
Kelly Kostopoulou,
Peter McNeely,
Prabhpreet Singh Sodhi,
Anirudh Varanasi,
Benjamin Case,
Asaf Cidon,
Roxana Geambasu,
Mathias Lécuyer
Abstract:
With the impending removal of third-party cookies from major browsers and the introduction of new privacy-preserving advertising APIs, the research community has a timely opportunity to assist industry in qualitatively improving the Web's privacy. This paper discusses our efforts, within a W3C community group, to enhance existing privacy-preserving advertising measurement APIs. We analyze designs…
▽ More
With the impending removal of third-party cookies from major browsers and the introduction of new privacy-preserving advertising APIs, the research community has a timely opportunity to assist industry in qualitatively improving the Web's privacy. This paper discusses our efforts, within a W3C community group, to enhance existing privacy-preserving advertising measurement APIs. We analyze designs from Google, Apple, Meta and Mozilla, and augment them with a more rigorous and efficient differential privacy (DP) budgeting component. Our approach, called Alistair, enforces well-defined DP guarantees and enables advertisers to conduct more private measurement queries accurately. By framing the privacy guarantee in terms of an individual form of DP, we can make DP budgeting more efficient than in current systems that use a traditional DP definition. We incorporate Alistair into Chrome and evaluate it on microbenchmarks and advertising datasets. Across all workloads, Alistair significantly outperforms baselines in enabling more advertising measurements under comparable DP protection.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Turbo: Effective Caching in Differentially-Private Databases
Authors:
Kelly Kostopoulou,
Pierre Tholoniat,
Asaf Cidon,
Roxana Geambasu,
Mathias Lécuyer
Abstract:
Differentially-private (DP) databases allow for privacy-preserving analytics over sensitive datasets or data streams. In these systems, user privacy is a limited resource that must be conserved with each query. We propose Turbo, a novel, state-of-the-art caching layer for linear query workloads over DP databases. Turbo builds upon private multiplicative weights (PMW), a DP mechanism that is powerf…
▽ More
Differentially-private (DP) databases allow for privacy-preserving analytics over sensitive datasets or data streams. In these systems, user privacy is a limited resource that must be conserved with each query. We propose Turbo, a novel, state-of-the-art caching layer for linear query workloads over DP databases. Turbo builds upon private multiplicative weights (PMW), a DP mechanism that is powerful in theory but ineffective in practice, and transforms it into a highly-effective caching mechanism, PMW-Bypass, that uses prior query results obtained through an external DP mechanism to train a PMW to answer arbitrary future linear queries accurately and "for free" from a privacy perspective. Our experiments on public Covid19 and CitiBike datasets show that Turbo with PMW-Bypass conserves 1.7-15.9x more budget compared to vanilla PMW and simpler cache designs, a significant improvement. Moreover, Turbo provides support for range query workloads, such as timeseries or streams, where opportunities exist to further conserve privacy budget through DP parallel composition and warm-starting of PMW state. Our work provides a theoretical foundation and general system design for effective caching in DP databases.
△ Less
Submitted 23 October, 2023; v1 submitted 28 June, 2023;
originally announced June 2023.
-
Advancing Differential Privacy: Where We Are Now and Future Directions for Real-World Deployment
Authors:
Rachel Cummings,
Damien Desfontaines,
David Evans,
Roxana Geambasu,
Yangsibo Huang,
Matthew Jagielski,
Peter Kairouz,
Gautam Kamath,
Sewoong Oh,
Olga Ohrimenko,
Nicolas Papernot,
Ryan Rogers,
Milan Shen,
Shuang Song,
Weijie Su,
Andreas Terzis,
Abhradeep Thakurta,
Sergei Vassilvitskii,
Yu-Xiang Wang,
Li Xiong,
Sergey Yekhanin,
Da Yu,
Huanyu Zhang,
Wanrong Zhang
Abstract:
In this article, we present a detailed review of current practices and state-of-the-art methodologies in the field of differential privacy (DP), with a focus of advancing DP's deployment in real-world applications. Key points and high-level contents of the article were originated from the discussions from "Differential Privacy (DP): Challenges Towards the Next Frontier," a workshop held in July 20…
▽ More
In this article, we present a detailed review of current practices and state-of-the-art methodologies in the field of differential privacy (DP), with a focus of advancing DP's deployment in real-world applications. Key points and high-level contents of the article were originated from the discussions from "Differential Privacy (DP): Challenges Towards the Next Frontier," a workshop held in July 2022 with experts from industry, academia, and the public sector seeking answers to broad questions pertaining to privacy and its implications in the design of industry-grade systems.
This article aims to provide a reference point for the algorithmic and design decisions within the realm of privacy, highlighting important challenges and potential research directions. Covering a wide spectrum of topics, this article delves into the infrastructure needs for designing private systems, methods for achieving better privacy/utility trade-offs, performing privacy attacks and auditing, as well as communicating privacy with broader audiences and stakeholders.
△ Less
Submitted 12 March, 2024; v1 submitted 14 April, 2023;
originally announced April 2023.
-
Packing Privacy Budget Efficiently
Authors:
Pierre Tholoniat,
Kelly Kostopoulou,
Mosharaf Chowdhury,
Asaf Cidon,
Roxana Geambasu,
Mathias Lécuyer,
Junfeng Yang
Abstract:
Machine learning (ML) models can leak information about users, and differential privacy (DP) provides a rigorous way to bound that leakage under a given budget. This DP budget can be regarded as a new type of compute resource in workloads of multiple ML models training on user data. Once it is used, the DP budget is forever consumed. Therefore, it is crucial to allocate it most efficiently to trai…
▽ More
Machine learning (ML) models can leak information about users, and differential privacy (DP) provides a rigorous way to bound that leakage under a given budget. This DP budget can be regarded as a new type of compute resource in workloads of multiple ML models training on user data. Once it is used, the DP budget is forever consumed. Therefore, it is crucial to allocate it most efficiently to train as many models as possible. This paper presents the scheduler for privacy that optimizes for efficiency. We formulate privacy scheduling as a new type of multidimensional knapsack problem, called privacy knapsack, which maximizes DP budget efficiency. We show that privacy knapsack is NP-hard, hence practical algorithms are necessarily approximate. We develop an approximation algorithm for privacy knapsack, DPK, and evaluate it on microbenchmarks and on a new, synthetic private-ML workload we developed from the Alibaba ML cluster trace. We show that DPK: (1) often approaches the efficiency-optimal schedule, (2) consistently schedules more tasks compared to a state-of-the-art privacy scheduling algorithm that focused on fairness (1.3-1.7x in Alibaba, 1.0-2.6x in microbenchmarks), but (3) sacrifices some level of fairness for efficiency. Therefore, using DPK, DP ML operators should be able to train more models on the same amount of user data while offering the same privacy guarantee to their users.
△ Less
Submitted 26 December, 2022;
originally announced December 2022.
-
How to Combine Membership-Inference Attacks on Multiple Updated Models
Authors:
Matthew Jagielski,
Stanley Wu,
Alina Oprea,
Jonathan Ullman,
Roxana Geambasu
Abstract:
A large body of research has shown that machine learning models are vulnerable to membership inference (MI) attacks that violate the privacy of the participants in the training data. Most MI research focuses on the case of a single standalone model, while production machine-learning platforms often update models over time, on data that often shifts in distribution, giving the attacker more informa…
▽ More
A large body of research has shown that machine learning models are vulnerable to membership inference (MI) attacks that violate the privacy of the participants in the training data. Most MI research focuses on the case of a single standalone model, while production machine-learning platforms often update models over time, on data that often shifts in distribution, giving the attacker more information. This paper proposes new attacks that take advantage of one or more model updates to improve MI. A key part of our approach is to leverage rich information from standalone MI attacks mounted separately against the original and updated models, and to combine this information in specific ways to improve attack effectiveness. We propose a set of combination functions and tuning methods for each, and present both analytical and quantitative justification for various options. Our results on four public datasets show that our attacks are effective at using update information to give the adversary a significant advantage over attacks on standalone models, but also compared to a prior MI attack that takes advantage of model updates in a related machine-unlearning setting. We perform the first measurements of the impact of distribution shift on MI attacks with model updates, and show that a more drastic distribution shift results in significantly higher MI risk than a gradual shift. Our code is available at https://www.github.com/stanleykywu/model-updates.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
Toward Training at ImageNet Scale with Differential Privacy
Authors:
Alexey Kurakin,
Shuang Song,
Steve Chien,
Roxana Geambasu,
Andreas Terzis,
Abhradeep Thakurta
Abstract:
Differential privacy (DP) is the de facto standard for training machine learning (ML) models, including neural networks, while ensuring the privacy of individual examples in the training set. Despite a rich literature on how to train ML models with differential privacy, it remains extremely challenging to train real-life, large neural networks with both reasonable accuracy and privacy.
We set ou…
▽ More
Differential privacy (DP) is the de facto standard for training machine learning (ML) models, including neural networks, while ensuring the privacy of individual examples in the training set. Despite a rich literature on how to train ML models with differential privacy, it remains extremely challenging to train real-life, large neural networks with both reasonable accuracy and privacy.
We set out to investigate how to do this, using ImageNet image classification as a poster example of an ML task that is very challenging to resolve accurately with DP right now. This paper shares initial lessons from our effort, in the hope that it will inspire and inform other researchers to explore DP training at scale. We show approaches that help make DP training faster, as well as model types and settings of the training process that tend to work better in the DP setting. Combined, the methods we discuss let us train a Resnet-18 with DP to $47.9\%$ accuracy and privacy parameters $ε= 10, δ= 10^{-6}$. This is a significant improvement over "naive" DP training of ImageNet models, but a far cry from the $75\%$ accuracy that can be obtained by the same network without privacy. The model we use was pretrained on the Places365 data set as a starting point. We share our code at https://github.com/google-research/dp-imagenet, calling for others to build upon this new baseline to further improve DP at scale.
△ Less
Submitted 8 February, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Privacy Budget Scheduling
Authors:
Tao Luo,
Mingen Pan,
Pierre Tholoniat,
Asaf Cidon,
Roxana Geambasu,
Mathias Lécuyer
Abstract:
Machine learning (ML) models trained on personal data have been shown to leak information about users. Differential privacy (DP) enables model training with a guaranteed bound on this leakage. Each new model trained with DP increases the bound on data leakage and can be seen as consuming part of a global privacy budget that should not be exceeded. This budget is a scarce resource that must be care…
▽ More
Machine learning (ML) models trained on personal data have been shown to leak information about users. Differential privacy (DP) enables model training with a guaranteed bound on this leakage. Each new model trained with DP increases the bound on data leakage and can be seen as consuming part of a global privacy budget that should not be exceeded. This budget is a scarce resource that must be carefully managed to maximize the number of successfully trained models.
We describe PrivateKube, an extension to the popular Kubernetes datacenter orchestrator that adds privacy as a new type of resource to be managed alongside other traditional compute resources, such as CPU, GPU, and memory. The abstractions we design for the privacy resource mirror those defined by Kubernetes for traditional resources, but there are also major differences. For example, traditional compute resources are replenishable while privacy is not: a CPU can be regained after a model finishes execution while privacy budget cannot. This distinction forces a re-design of the scheduler. We present DPF (Dominant Private Block Fairness) -- a variant of the popular Dominant Resource Fairness (DRF) algorithm -- that is geared toward the non-replenishable privacy resource but enjoys similar theoretical properties as DRF.
We evaluate PrivateKube and DPF on microbenchmarks and an ML workload on Amazon Reviews data. Compared to existing baselines, DPF allows training more models under the same global privacy guarantee. This is especially true for DPF over Rényi DP, a highly composable form of DP.
△ Less
Submitted 29 June, 2021;
originally announced June 2021.
-
Pythia: Grammar-Based Fuzzing of REST APIs with Coverage-guided Feedback and Learning-based Mutations
Authors:
Vaggelis Atlidakis,
Roxana Geambasu,
Patrice Godefroid,
Marina Polishchuk,
Baishakhi Ray
Abstract:
This paper introduces Pythia, the first fuzzer that augments grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy for stateful REST API fuzzing. Pythia uses a statistical model to learn common usage patterns of a target REST API from structurally valid seed inputs. It then generates learning-based mutations by injecting a small amount of noise deviating from c…
▽ More
This paper introduces Pythia, the first fuzzer that augments grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy for stateful REST API fuzzing. Pythia uses a statistical model to learn common usage patterns of a target REST API from structurally valid seed inputs. It then generates learning-based mutations by injecting a small amount of noise deviating from common usage patterns while still maintaining syntactic validity. Pythia's mutation strategy helps generate grammatically valid test cases and coverage-guided feedback helps prioritize the test cases that are more likely to find bugs. We present experimental evaluation on three production-scale, open-source cloud services showing that Pythia outperforms prior approaches both in code coverage and new bugs found. Using Pythia, we found 29 new bugs which we are in the process of reporting to the respective service owners.
△ Less
Submitted 23 May, 2020;
originally announced May 2020.
-
Privacy Accounting and Quality Control in the Sage Differentially Private ML Platform
Authors:
Mathias Lecuyer,
Riley Spahn,
Kiran Vodrahalli,
Roxana Geambasu,
Daniel Hsu
Abstract:
Companies increasingly expose machine learning (ML) models trained over sensitive user data to untrusted domains, such as end-user devices and wide-access model stores. We present Sage, a differentially private (DP) ML platform that bounds the cumulative leakage of training data through models. Sage builds upon the rich literature on DP ML algorithms and contributes pragmatic solutions to two of t…
▽ More
Companies increasingly expose machine learning (ML) models trained over sensitive user data to untrusted domains, such as end-user devices and wide-access model stores. We present Sage, a differentially private (DP) ML platform that bounds the cumulative leakage of training data through models. Sage builds upon the rich literature on DP ML algorithms and contributes pragmatic solutions to two of the most pressing systems challenges of global DP: running out of privacy budget and the privacy-utility tradeoff. To address the former, we develop block composition, a new privacy loss accounting method that leverages the growing database regime of ML workloads to keep training models endlessly on a sensitive data stream while enforcing a global DP guarantee for the stream. To address the latter, we develop privacy-adaptive training, a process that trains a model on growing amounts of data and/or with increasing privacy parameters until, with high probability, the model meets developer-configured quality criteria. They illustrate how a systems focus on characteristics of ML workloads enables pragmatic solutions that are not apparent when one focuses on individual algorithms, as most DP ML literature does.
△ Less
Submitted 6 September, 2019; v1 submitted 3 September, 2019;
originally announced September 2019.
-
Certified Robustness to Adversarial Examples with Differential Privacy
Authors:
Mathias Lecuyer,
Vaggelis Atlidakis,
Roxana Geambasu,
Daniel Hsu,
Suman Jana
Abstract:
Adversarial examples that fool machine learning models, particularly deep neural networks, have been a topic of intense research interest, with attacks and defenses being developed in a tight back-and-forth. Most past defenses are best effort and have been shown to be vulnerable to sophisticated attacks. Recently a set of certified defenses have been introduced, which provide guarantees of robustn…
▽ More
Adversarial examples that fool machine learning models, particularly deep neural networks, have been a topic of intense research interest, with attacks and defenses being developed in a tight back-and-forth. Most past defenses are best effort and have been shown to be vulnerable to sophisticated attacks. Recently a set of certified defenses have been introduced, which provide guarantees of robustness to norm-bounded attacks, but they either do not scale to large datasets or are limited in the types of models they can support. This paper presents the first certified defense that both scales to large networks and datasets (such as Google's Inception network for ImageNet) and applies broadly to arbitrary model types. Our defense, called PixelDP, is based on a novel connection between robustness against adversarial examples and differential privacy, a cryptographically-inspired formalism, that provides a rigorous, generic, and flexible foundation for defense.
△ Less
Submitted 29 May, 2019; v1 submitted 9 February, 2018;
originally announced February 2018.
-
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Authors:
Mathias Lecuyer,
Riley Spahn,
Roxana Geambasu,
Tzu-Kuo Huang,
Siddhartha Sen
Abstract:
Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected "just in case" would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use da…
▽ More
Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected "just in case" would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage; unused data can be evicted to a highly protected store. However, many of today's big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability. We present Pyramid, a limited-exposure data management system that builds upon count featurization to enhance data protection. As such, Pyramid uniquely introduces both the idea and proof-of-concept for leveraging training set minimization methods to instill rigor and selectivity into big data management. We integrated Pyramid into Spark Velox, a framework for ML-based targeting and personalization. We evaluate it on three applications and show that Pyramid approaches state-of-the-art models while training on less than 1% of the raw data.
△ Less
Submitted 21 May, 2017;
originally announced May 2017.
-
FairTest: Discovering Unwarranted Associations in Data-Driven Applications
Authors:
Florian Tramèr,
Vaggelis Atlidakis,
Roxana Geambasu,
Daniel Hsu,
Jean-Pierre Hubaux,
Mathias Humbert,
Ari Juels,
Huang Lin
Abstract:
In a world where traditional notions of privacy are increasingly challenged by the myriad companies that collect and analyze our data, it is important that decision-making entities are held accountable for unfair treatments arising from irresponsible data usage. Unfortunately, a lack of appropriate methodologies and tools means that even identifying unfair or discriminatory effects can be a challe…
▽ More
In a world where traditional notions of privacy are increasingly challenged by the myriad companies that collect and analyze our data, it is important that decision-making entities are held accountable for unfair treatments arising from irresponsible data usage. Unfortunately, a lack of appropriate methodologies and tools means that even identifying unfair or discriminatory effects can be a challenge in practice. We introduce the unwarranted associations (UA) framework, a principled methodology for the discovery of unfair, discriminatory, or offensive user treatment in data-driven applications. The UA framework unifies and rationalizes a number of prior attempts at formalizing algorithmic fairness. It uniquely combines multiple investigative primitives and fairness metrics with broad applicability, granular exploration of unfair treatment in user subgroups, and incorporation of natural notions of utility that may account for observed disparities. We instantiate the UA framework in FairTest, the first comprehensive tool that helps developers check data-driven applications for unfair user treatment. It enables scalable and statistically rigorous investigation of associations between application outcomes (such as prices or premiums) and sensitive user attributes (such as race or gender). Furthermore, FairTest provides debugging capabilities that let programmers rule out potential confounders for observed unfair effects. We report on use of FairTest to investigate and in some cases address disparate impact, offensive labeling, and uneven rates of algorithmic error in four data-driven applications. As examples, our results reveal subtle biases against older populations in the distribution of error in a predictive health application and offensive racial labeling in an image tagger.
△ Less
Submitted 16 August, 2016; v1 submitted 8 October, 2015;
originally announced October 2015.
-
XRay: Enhancing the Web's Transparency with Differential Correlation
Authors:
Mathias Lecuyer,
Guillaume Ducoffe,
Francis Lan,
Andrei Papancea,
Theofilos Petsios,
Riley Spahn,
Augustin Chaintreau,
Roxana Geambasu
Abstract:
Today's Web services - such as Google, Amazon, and Facebook - leverage user data for varied purposes, including personalizing recommendations, targeting advertisements, and adjusting prices. At present, users have little insight into how their data is being used. Hence, they cannot make informed choices about the services they choose. To increase transparency, we developed XRay, the first fine-gra…
▽ More
Today's Web services - such as Google, Amazon, and Facebook - leverage user data for varied purposes, including personalizing recommendations, targeting advertisements, and adjusting prices. At present, users have little insight into how their data is being used. Hence, they cannot make informed choices about the services they choose. To increase transparency, we developed XRay, the first fine-grained, robust, and scalable personal data tracking system for the Web. XRay predicts which data in an arbitrary Web account (such as emails, searches, or viewed products) is being used to target which outputs (such as ads, recommended products, or prices). XRay's core functions are service agnostic and easy to instantiate for new services, and they can track data within and across services. To make predictions independent of the audited service, XRay relies on the following insight: by comparing outputs from different accounts with similar, but not identical, subsets of data, one can pinpoint targeting through correlation. We show both theoretically, and through experiments on Gmail, Amazon, and YouTube, that XRay achieves high precision and recall by correlating data from a surprisingly small number of extra accounts.
△ Less
Submitted 7 October, 2014; v1 submitted 8 July, 2014;
originally announced July 2014.