Search | arXiv e-print repository

Privately Answering Queries on Skewed Data via Per Record Differential Privacy

Authors: Jeremy Seeman, William Sexton, David Pujol, Ashwin Machanavajjhala

Abstract: We consider the problem of the private release of statistics (like aggregate payrolls) where it is critical to preserve the contribution made by a small number of outlying large entities. We propose a privacy formalism, per-record zero concentrated differential privacy (PzCDP), where the privacy loss associated with each record is a public function of that record's value. Unlike other formalisms w… ▽ More We consider the problem of the private release of statistics (like aggregate payrolls) where it is critical to preserve the contribution made by a small number of outlying large entities. We propose a privacy formalism, per-record zero concentrated differential privacy (PzCDP), where the privacy loss associated with each record is a public function of that record's value. Unlike other formalisms which provide different privacy losses to different records, PzCDP's privacy loss depends explicitly on the confidential data. We define our formalism, derive its properties, and propose mechanisms which satisfy PzCDP that are uniquely suited to publishing skewed or heavy-tailed statistics, where a small number of records contribute substantially to query answers. This targeted relaxation helps overcome the difficulties of applying standard DP to these data products. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 14 pages, 5 figures

arXiv:2309.08574 [pdf, other]

DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms

Authors: Shweta Patwa, Danyu Sun, Amir Gilad, Ashwin Machanavajjhala, Sudeepa Roy

Abstract: Synthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental operations in data analysis include analyzing aggregated statistics, e.g., count, sum, or median, on a subset of data satisfying some conditions. Whe… ▽ More Synthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental operations in data analysis include analyzing aggregated statistics, e.g., count, sum, or median, on a subset of data satisfying some conditions. When synthetic data is generated, users may be interested in knowing if their aggregated queries generating such statistics can be reliably answered on the synthetic data, for instance, to decide if the synthetic data is suitable for specific tasks. However, the standard data generation systems do not provide "per-query" quality guarantees on the synthetic data, and the users have no way of knowing how much the aggregated statistics on the synthetic data can be trusted. To address this problem, we present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other while guaranteeing differential privacy. We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally. △ Less

Submitted 15 September, 2023; originally announced September 2023.

arXiv:2308.16298 [pdf, other]

Publishing Wikipedia usage data with strong privacy guarantees

Authors: Temilola Adeleye, Skye Berghel, Damien Desfontaines, Michael Hay, Isaac Johnson, Cléo Lemoisson, Ashwin Machanavajjhala, Tom Magerlein, Gabriele Modena, David Pujol, Daniel Simmons-Marengo, Hal Triedman

Abstract: For almost 20 years, the Wikimedia Foundation has been publishing statistics about how many people visited each Wikipedia page on each day. This data helps Wikipedia editors determine where to focus their efforts to improve the online encyclopedia, and enables academic research. In June 2023, the Wikimedia Foundation, helped by Tumult Labs, addressed a long-standing request from Wikipedia editors… ▽ More For almost 20 years, the Wikimedia Foundation has been publishing statistics about how many people visited each Wikipedia page on each day. This data helps Wikipedia editors determine where to focus their efforts to improve the online encyclopedia, and enables academic research. In June 2023, the Wikimedia Foundation, helped by Tumult Labs, addressed a long-standing request from Wikipedia editors and academic researchers: it started publishing these statistics with finer granularity, including the country of origin in the daily counts of page views. This new data publication uses differential privacy to provide robust guarantees to people browsing or editing Wikipedia. This paper describes this data publication: its goals, the process followed from its inception to its deployment, the algorithms used to produce the data, and the outcomes of the data release. △ Less

Submitted 1 September, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

Comments: 11 pages, 10 figures, Theory and Practice of Differential Privacy (TPDP) 2023

arXiv:2212.10310 [pdf, other]

PreFair: Privately Generating Justifiably Fair Synthetic Data

Authors: David Pujol, Amir Gilad, Ashwin Machanavajjhala

Abstract: When a database is protected by Differential Privacy (DP), its usability is limited in scope. In this scenario, generating a synthetic version of the data that mimics the properties of the private data allows users to perform any operation on the synthetic data, while maintaining the privacy of the original data. Therefore, multiple works have been devoted to devising systems for DP synthetic data… ▽ More When a database is protected by Differential Privacy (DP), its usability is limited in scope. In this scenario, generating a synthetic version of the data that mimics the properties of the private data allows users to perform any operation on the synthetic data, while maintaining the privacy of the original data. Therefore, multiple works have been devoted to devising systems for DP synthetic data generation. However, such systems may preserve or even magnify properties of the data that make it unfair, endering the synthetic data unfit for use. In this work, we present PreFair, a system that allows for DP fair synthetic data generation. PreFair extends the state-of-the-art DP data generation mechanisms by incorporating a causal fairness criterion that ensures fair synthetic data. We adapt the notion of justifiable fairness to fit the synthetic data generation scenario. We further study the problem of generating DP fair synthetic data, showing its intractability and designing algorithms that are optimal under certain assumptions. We also provide an extensive experimental evaluation, showing that PreFair generates synthetic data that is significantly fairer than the data generated by leading DP data generation mechanisms, while remaining faithful to the private data. △ Less

Submitted 27 March, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: 15 pages, 11 figures

arXiv:2212.09884 [pdf, other]

Multi-Analyst Differential Privacy for Online Query Answering

Authors: David Pujol, Albert Sun, Brandon Fain, Ashwin Machanavajjhala

Abstract: Most differentially private mechanisms are designed for the use of a single analyst. In reality, however, there are often multiple stakeholders with different and possibly conflicting priorities that must share the same privacy loss budget. This motivates the problem of equitable budget-sharing for multi-analyst differential privacy. Our previous work defined desiderata that any mechanism in this… ▽ More Most differentially private mechanisms are designed for the use of a single analyst. In reality, however, there are often multiple stakeholders with different and possibly conflicting priorities that must share the same privacy loss budget. This motivates the problem of equitable budget-sharing for multi-analyst differential privacy. Our previous work defined desiderata that any mechanism in this space should satisfy and introduced methods for budget-sharing in the offline case where queries are known in advance. We extend our previous work on multi-analyst differentially private query answering to the case of online query answering, where queries come in one at a time and must be answered without knowledge of the following queries. We demonstrate that the unknown ordering of queries in the online case results in a fundamental limit in the number of queries that can be answered while satisfying the desiderata. In response, we develop two mechanisms, one which satisfies the desiderata in all cases but is subject to the fundamental limitations, and another that randomizes the input order ensuring that existing online query answering mechanisms can satisfy the desiderata. △ Less

Submitted 19 December, 2022; originally announced December 2022.

Comments: 11 pages 3 figures

arXiv:2212.04133 [pdf, other]

Tumult Analytics: a robust, easy-to-use, scalable, and expressive framework for differential privacy

Authors: Skye Berghel, Philip Bohannon, Damien Desfontaines, Charles Estes, Sam Haney, Luke Hartman, Michael Hay, Ashwin Machanavajjhala, Tom Magerlein, Gerome Miklau, Amritha Pai, William Sexton, Ruchit Shrestha

Abstract: In this short paper, we outline the design of Tumult Analytics, a Python framework for differential privacy used at institutions such as the U.S. Census Bureau, the Wikimedia Foundation, or the Internal Revenue Service. In this short paper, we outline the design of Tumult Analytics, a Python framework for differential privacy used at institutions such as the U.S. Census Bureau, the Wikimedia Foundation, or the Internal Revenue Service. △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2209.03310 [pdf, other]

Bayesian and Frequentist Semantics for Common Variations of Differential Privacy: Applications to the 2020 Census

Authors: Daniel Kifer, John M. Abowd, Robert Ashmead, Ryan Cumings-Menon, Philip Leclerc, Ashwin Machanavajjhala, William Sexton, Pavel Zhuravlev

Abstract: The purpose of this paper is to guide interpretation of the semantic privacy guarantees for some of the major variations of differential privacy, which include pure, approximate, Rényi, zero-concentrated, and $f$ differential privacy. We interpret privacy-loss accounting parameters, frequentist semantics, and Bayesian semantics (including new results). The driving application is the interpretation… ▽ More The purpose of this paper is to guide interpretation of the semantic privacy guarantees for some of the major variations of differential privacy, which include pure, approximate, Rényi, zero-concentrated, and $f$ differential privacy. We interpret privacy-loss accounting parameters, frequentist semantics, and Bayesian semantics (including new results). The driving application is the interpretation of the confidentiality protections for the 2020 Census Public Law 94-171 Redistricting Data Summary File released August 12, 2021, which, for the first time, were produced with formal privacy guarantees. △ Less

Submitted 7 September, 2022; originally announced September 2022.

arXiv:2209.01286 [pdf, other]

DPXPlain: Privately Explaining Aggregate Query Answers

Authors: Yuchao Tao, Amir Gilad, Ashwin Machanavajjhala, Sudeepa Roy

Abstract: Differential privacy (DP) is the state-of-the-art and rigorous notion of privacy for answering aggregate database queries while preserving the privacy of sensitive information in the data. In today's era of data analysis, however, it poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data itself, or is it due to… ▽ More Differential privacy (DP) is the state-of-the-art and rigorous notion of privacy for answering aggregate database queries while preserving the privacy of sensitive information in the data. In today's era of data analysis, however, it poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data itself, or is it due to the extra noise that must be added to preserve DP? In the second case, even the observation made by the users on query results may be wrong. In the first case, can we still mine interesting explanations from the sensitive data while protecting its privacy? To address these challenges, we present a three-phase framework DPXPlain, which is the first system to the best of our knowledge for explaining group-by aggregate query answers with DP. In its three phases, DPXPlain (a) answers a group-by aggregate query with DP, (b) allows users to compare aggregate values of two groups and with high probability assesses whether this comparison holds or is flipped by the DP noise, and (c) eventually provides an explanation table containing the approximately `top-k' explanation predicates along with their relative influences and ranks in the form of confidence intervals, while guaranteeing DP in all steps. We perform an extensive experimental analysis of DPXPlain with multiple use-cases on real and synthetic data showing that DPXPlain efficiently provides insightful explanations with good accuracy and utility. △ Less

Submitted 2 September, 2022; originally announced September 2022.

arXiv:2204.08986 [pdf, other]

The 2020 Census Disclosure Avoidance System TopDown Algorithm

Authors: John M. Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Micah Heineck, Christine Heiss, Robert Johns, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, Brett Moran, William Sexton, Matthew Spence, Pavel Zhuravlev

Abstract: The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another ke… ▽ More The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another key aspect of the TDA are invariants, statistics that the Census Bureau has determined, as matter of policy, to exclude from the privacy-loss accounting. The TDA post-processes the measurements together with the invariants to produce a Microdata Detail File (MDF) that contains one record for each person and one record for each housing unit enumerated in the 2020 Census. The MDF is passed to the 2020 Census tabulation system to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File. This paper describes the mathematics and testing of the TDA for this purpose. △ Less

Submitted 19 April, 2022; originally announced April 2022.

arXiv:2203.05084 [pdf, other]

IncShrink: Architecting Efficient Outsourced Databases using Incremental MPC and Differential Privacy

Authors: Chenghong Wang, Johes Bater, Kartik Nayak, Ashwin Machanavajjhala

Abstract: In this paper, we consider secure outsourced growing databases that support view-based query answering. These databases allow untrusted servers to privately maintain a materialized view, such that they can use only the materialized view to process query requests instead of accessing the original data from which the view was derived. To tackle this, we devise a novel view-based secure outsourced gr… ▽ More In this paper, we consider secure outsourced growing databases that support view-based query answering. These databases allow untrusted servers to privately maintain a materialized view, such that they can use only the materialized view to process query requests instead of accessing the original data from which the view was derived. To tackle this, we devise a novel view-based secure outsourced growing database framework, Incshrink. The key features of this solution are: (i) Incshrink maintains the view using incremental MPC operators which eliminates the need for a trusted third party upfront, and (ii) to ensure high performance, Incshrink guarantees that the leakage satisfies DP in the presence of updates. To the best of our knowledge, there are no existing systems that have these properties. We demonstrate Incshrink's practical feasibility in terms of efficiency and accuracy with extensive empirical evaluations on real-world datasets and the TPC-ds benchmark. The evaluation results show that Incshrink provides a 3-way trade-off in terms of privacy, accuracy, and efficiency guarantees, and offers at least a 7,800 times performance advantage over standard secure outsourced databases that do not support the view-based query paradigm. △ Less

Submitted 9 March, 2022; originally announced March 2022.

arXiv:2112.09238 [pdf, other]

Benchmarking Differentially Private Synthetic Data Generation Algorithms

Authors: Yuchao Tao, Ryan McKenna, Michael Hay, Ashwin Machanavajjhala, Gerome Miklau

Abstract: This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the distribution of individual and pairs of attributes, pairwise correlation as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluat… ▽ More This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the distribution of individual and pairs of attributes, pairwise correlation as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluation we identify the top performing algorithms and those that consistently fail to beat baseline approaches. △ Less

Submitted 15 February, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

arXiv:2111.04671 [pdf, other]

doi 10.1109/MSEC.2021.3105773

Equity and Privacy: More Than Just a Tradeoff

Authors: David Pujol, Ashwin Machanavajjhala

Abstract: While the entire field of privacy preserving data analytics is focused on the privacy-utility tradeoff, recent work has shown that privacy preserving data publishing can introduce different levels of utility across different population groups. It is important to understand this new tradeoff between privacy and equity as privacy technology is being deployed in situations where the data products wil… ▽ More While the entire field of privacy preserving data analytics is focused on the privacy-utility tradeoff, recent work has shown that privacy preserving data publishing can introduce different levels of utility across different population groups. It is important to understand this new tradeoff between privacy and equity as privacy technology is being deployed in situations where the data products will be used for research and policy making. Will marginal populations see disproportionately less utility from privacy technology? If there is an inequity how can we address it? △ Less

Submitted 8 November, 2021; originally announced November 2021.

Comments: 3 pages, 1 figure. Published in IEEE Security & Privacy ( Volume: 19, Issue: 6, Nov.-Dec. 2021)

arXiv:2107.10659 [pdf, ps, other]

Differentially Private Algorithms for 2020 Census Detailed DHC Race \& Ethnicity

Authors: Sam Haney, William Sexton, Ashwin Machanavajjhala, Michael Hay, Gerome Miklau

Abstract: This article describes a proposed differentially private (DP) algorithms that the US Census Bureau is considering to release the Detailed Demographic and Housing Characteristics (DHC) Race & Ethnicity tabulations as part of the 2020 Census. The tabulations contain statistics (counts) of demographic and housing characteristics of the entire population of the US crossed with detailed races and tribe… ▽ More This article describes a proposed differentially private (DP) algorithms that the US Census Bureau is considering to release the Detailed Demographic and Housing Characteristics (DHC) Race & Ethnicity tabulations as part of the 2020 Census. The tabulations contain statistics (counts) of demographic and housing characteristics of the entire population of the US crossed with detailed races and tribes at varying levels of geography. We describe two differentially private algorithmic strategies, one based on adding noise drawn from a two-sided Geometric distribution that satisfies "pure"-DP, and another based on adding noise from a Discrete Gaussian distribution that satisfied a well studied variant of differential privacy, called Zero Concentrated Differential Privacy (zCDP). We analytically estimate the privacy loss parameters ensured by the two algorithms for comparable levels of error introduced in the statistics. △ Less

Submitted 22 July, 2021; originally announced July 2021.

Comments: Presented at Theory and Practice of Differential Privacy Workshop (TPDP) 2021

arXiv:2106.12118 [pdf, other]

HDMM: Optimizing error of high-dimensional statistical queries under differential privacy

Authors: Ryan McKenna, Gerome Miklau, Michael Hay, Ashwin Machanavajjhala

Abstract: In this work we describe the High-Dimensional Matrix Mechanism (HDMM), a differentially private algorithm for answering a workload of predicate counting queries. HDMM represents query workloads using a compact implicit matrix representation and exploits this representation to efficiently optimize over (a subset of) the space of differentially private algorithms for one that is unbiased and answers… ▽ More In this work we describe the High-Dimensional Matrix Mechanism (HDMM), a differentially private algorithm for answering a workload of predicate counting queries. HDMM represents query workloads using a compact implicit matrix representation and exploits this representation to efficiently optimize over (a subset of) the space of differentially private algorithms for one that is unbiased and answers the input query workload with low expected error. HDMM can be deployed for both $ε$-differential privacy (with Laplace noise) and $(ε, δ)$-differential privacy (with Gaussian noise), although the core techniques are slightly different for each. We demonstrate empirically that HDMM can efficiently answer queries with lower expected error than state-of-the-art techniques, and in some cases, it nearly matches existing lower bounds for the particular class of mechanisms we consider. △ Less

Submitted 22 June, 2021; originally announced June 2021.

Comments: arXiv admin note: text overlap with arXiv:1808.03537

arXiv:2106.05131 [pdf, other]

Prior-Aware Distribution Estimation for Differential Privacy

Authors: Yuchao Tao, Johes Bater, Ashwin Machanavajjhala

Abstract: Joint distribution estimation of a dataset under differential privacy is a fundamental problem for many privacy-focused applications, such as query answering, machine learning tasks and synthetic data generation. In this work, we examine the joint distribution estimation problem given two data points: 1) differentially private answers of a workload computed over private data and 2) a prior empiric… ▽ More Joint distribution estimation of a dataset under differential privacy is a fundamental problem for many privacy-focused applications, such as query answering, machine learning tasks and synthetic data generation. In this work, we examine the joint distribution estimation problem given two data points: 1) differentially private answers of a workload computed over private data and 2) a prior empirical distribution from a public dataset. Our goal is to find a new distribution such that estimating the workload using this distribution is as accurate as the differentially private answer, and the relative entropy, or KL divergence, of this distribution is minimized with respect to the prior distribution. We propose an approach based on iterative optimization for solving this problem. An application of our solution won second place in the NIST 2020 Differential Privacy Temporal Map Challenge, Sprint 2. △ Less

Submitted 9 June, 2021; originally announced June 2021.

arXiv:2103.15942 [pdf, other]

doi 10.1145/3448016.3457306

DP-Sync: Hiding Update Patterns in Secure Outsourced Databases with Differential Privacy

Authors: Chenghong Wang, Johes Bater, Kartik Nayak, Ashwin Machanavajjhala

Abstract: In this paper, we have introduced a new type of leakage associated with modern encrypted databases called update pattern leakage. We formalize the definition and security model of DP-Sync with DP update patterns. We also proposed the framework DP-Sync, which extends existing encrypted database schemes to DP-Sync with DP update patterns. DP-Sync guarantees that the entire data update history over t… ▽ More In this paper, we have introduced a new type of leakage associated with modern encrypted databases called update pattern leakage. We formalize the definition and security model of DP-Sync with DP update patterns. We also proposed the framework DP-Sync, which extends existing encrypted database schemes to DP-Sync with DP update patterns. DP-Sync guarantees that the entire data update history over the outsourced data structure is protected by differential privacy. This is achieved by imposing differentially-private strategies that dictate the data owner's synchronization of local~data. △ Less

Submitted 6 April, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

arXiv:2103.14435 [pdf, other]

Synthesizing Linked Data Under Cardinality and Integrity Constraints

Authors: Amir Gilad, Shweta Patwa, Ashwin Machanavajjhala

Abstract: The generation of synthetic data is useful in multiple aspects, from testing applications to benchmarking to privacy preservation. Generating the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs) is an important aspect of this problem. Given instances of two relations, where one has a foreign key dependence on the other and is missing its foreign key… ▽ More The generation of synthetic data is useful in multiple aspects, from testing applications to benchmarking to privacy preservation. Generating the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs) is an important aspect of this problem. Given instances of two relations, where one has a foreign key dependence on the other and is missing its foreign key ($FK$) values, and two types of constraints: (1) CCs that apply to the join view and (2) ICs that apply to the table with missing $FK$ values, our goal is to impute the missing $FK$ values such that the constraints are satisfied. We provide a novel framework for the problem based on declarative CCs and ICs. We further show that the problem is NP-hard and propose a novel two-phase solution that guarantees the satisfaction of the ICs. Phase I yields an intermediate solution accounting for the CCs alone, and relies on a hybrid approach based on CC types. For one type, the problem is modeled as an Integer Linear Program. For the others, we describe an efficient and accurate solution. We then combine the two solutions. Phase II augments this solution by incorporating the ICs and uses a coloring of the conflict hypergraph to infer the values of the $FK$ column. Our extensive experimental study shows that our solution scales well when the data and number of constraints increases. We further show that our solution maintains low error rates for the CCs. △ Less

Submitted 26 March, 2021; originally announced March 2021.

arXiv:2011.01192 [pdf, other]

doi 10.14778/3467861.3467870

Budget Sharing for Multi-Analyst Differential Privacy

Authors: David Pujol, Yikai Wu, Brandon Fain, Ashwin Machanavajjhala

Abstract: Large organizations that collect data about populations (like the US Census Bureau) release summary statistics that are used by multiple stakeholders for resource allocation and policy making problems. These organizations are also legally required to protect the privacy of individuals from whom they collect data. Differential Privacy (DP) provides a solution to release useful summary data while pr… ▽ More Large organizations that collect data about populations (like the US Census Bureau) release summary statistics that are used by multiple stakeholders for resource allocation and policy making problems. These organizations are also legally required to protect the privacy of individuals from whom they collect data. Differential Privacy (DP) provides a solution to release useful summary data while preserving privacy. Most DP mechanisms are designed to answer a single set of queries. In reality, there are often multiple stakeholders that use a given data release and have overlap** but not-identical queries. This introduces a novel joint optimization problem in DP where the privacy budget must be shared among different analysts. We initiate study into the problem of DP query answering across multiple analysts. To capture the competing goals and priorities of multiple analysts, we formulate three desiderata that any mechanism should satisfy in this setting -- The Sharing Incentive, Non-Interference, and Adaptivity -- while still optimizing for overall error. We demonstrate how existing DP query answering mechanisms in the multi-analyst settings fail to satisfy at least one of the desiderata. We present novel DP algorithms that provably satisfy all our desiderata and empirically show that they incur low error on realistic tasks. △ Less

Submitted 4 November, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: 13 pages, 5 figures. Proceedings of the VLDB Endowment (PVLDB) Vol. 14 No. 10. Presented at the International Conference on Very Large Data Bases (VLDB) 2021

ACM Class: H.2.7

Journal ref: PVLDB, 14(10): 1805-1817, 2021

arXiv:2004.08887 [pdf, other]

DP-Cryptography: Marrying Differential Privacy and Cryptography in Emerging Applications

Authors: Sameer Wagh, Xi He, Ashwin Machanavajjhala, Prateek Mittal

Abstract: Differential privacy (DP) has arisen as the state-of-the-art metric for quantifying individual privacy when sensitive data are analyzed, and it is starting to see practical deployment in organizations such as the US Census Bureau, Apple, Google, etc. There are two popular models for deploying differential privacy - standard differential privacy (SDP), where a trusted server aggregates all the data… ▽ More Differential privacy (DP) has arisen as the state-of-the-art metric for quantifying individual privacy when sensitive data are analyzed, and it is starting to see practical deployment in organizations such as the US Census Bureau, Apple, Google, etc. There are two popular models for deploying differential privacy - standard differential privacy (SDP), where a trusted server aggregates all the data and runs the DP mechanisms, and local differential privacy (LDP), where each user perturbs their own data and perturbed data is analyzed. Due to security concerns arising from aggregating raw data at a single server, several real world deployments in industry have embraced the LDP model. However, systems based on the LDP model tend to have poor utility - "a gap" in the utility achieved as compared to systems based on the SDP model. In this work, we survey and synthesize emerging directions of research at the intersection of differential privacy and cryptography. First, we survey solutions that combine cryptographic primitives like secure computation and anonymous communication with differential privacy to give alternatives to the LDP model that avoid a trusted server as in SDP but close the gap in accuracy. These primitives introduce performance bottlenecks and necessitate efficient alternatives. Second, we synthesize work in an area we call "DP-Cryptography" - cryptographic primitives that are allowed to leak differentially private outputs. These primitives have orders of magnitude better performance than standard cryptographic primitives. DP-cryptographic primitives are perfectly suited for implementing alternatives to LDP, but are also applicable to scenarios where standard cryptographic primitives do not have practical implementations. Through this unique lens of research taxonomy, we survey ongoing research in these directions while also providing novel directions for future research. △ Less

Submitted 19 April, 2020; originally announced April 2020.

arXiv:2004.04656 [pdf, other]

doi 10.1145/3318464.3389762

Computing Local Sensitivities of Counting Queries with Joins

Authors: Yuchao Tao, Xi He, Ashwin Machanavajjhala, Sudeepa Roy

Abstract: Local sensitivity of a query Q given a database instance D, i.e. how much the output Q(D) changes when a tuple is added to D or deleted from D, has many applications including query analysis, outlier detection, and in differential privacy. However, it is NP-hard to find local sensitivity of a conjunctive query in terms of the size of the query, even for the class of acyclic queries. Although the c… ▽ More Local sensitivity of a query Q given a database instance D, i.e. how much the output Q(D) changes when a tuple is added to D or deleted from D, has many applications including query analysis, outlier detection, and in differential privacy. However, it is NP-hard to find local sensitivity of a conjunctive query in terms of the size of the query, even for the class of acyclic queries. Although the complexity is polynomial when the query size is fixed, the naive algorithms are not efficient for large databases and queries involving multiple joins. In this paper, we present a novel approach to compute local sensitivity of counting queries involving join operations by tracking and summarizing tuple sensitivities -- the maximum change a tuple can cause in the query result when it is added or removed. We give algorithms for the sensitivity problem for full acyclic join queries using join trees, that run in polynomial time in both the size of the database and query for an interesting sub-class of queries, which we call 'doubly acyclic queries' that include path queries, and in polynomial time in combined complexity when the maximum degree in the join tree is bounded. Our algorithms can be extended to certain non-acyclic queries using generalized hypertree decompositions. We evaluate our approach experimentally, and show applications of our algorithms to obtain better results for differential privacy by orders of magnitude. △ Less

Submitted 9 April, 2020; originally announced April 2020.

Comments: To be published in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

arXiv:1908.10268 [pdf, other]

Answering Summation Queries for Numerical Attributes under Differential Privacy

Authors: Yikai Wu, David Pujol, Ios Kotsogiannis, Ashwin Machanavajjhala

Abstract: In this work we explore the problem of answering a set of sum queries under Differential Privacy. This is a little understood, non-trivial problem especially in the case of numerical domains. We show that traditional techniques from the literature are not always the best choice and a more rigorous approach is necessary to develop low error algorithms. In this work we explore the problem of answering a set of sum queries under Differential Privacy. This is a little understood, non-trivial problem especially in the case of numerical domains. We show that traditional techniques from the literature are not always the best choice and a more rigorous approach is necessary to develop low error algorithms. △ Less

Submitted 27 August, 2019; originally announced August 2019.

Comments: TPDP 2019, 7 pages

arXiv:1907.02159 [pdf, other]

Capacity Bounded Differential Privacy

Authors: Kamalika Chaudhuri, Jacob Imola, Ashwin Machanavajjhala

Abstract: Differential privacy, a notion of algorithmic stability, is a gold standard for measuring the additional risk an algorithm's output poses to the privacy of a single record in the dataset. Differential privacy is defined as the distance between the output distribution of an algorithm on neighboring datasets that differ in one entry. In this work, we present a novel relaxation of differential privac… ▽ More Differential privacy, a notion of algorithmic stability, is a gold standard for measuring the additional risk an algorithm's output poses to the privacy of a single record in the dataset. Differential privacy is defined as the distance between the output distribution of an algorithm on neighboring datasets that differ in one entry. In this work, we present a novel relaxation of differential privacy, capacity bounded differential privacy, where the adversary that distinguishes output distributions is assumed to be capacity-bounded -- i.e. bounded not in computational power, but in terms of the function class from which their attack algorithm is drawn. We model adversaries in terms of restricted f-divergences between probability distributions, and study properties of the definition and algorithms that satisfy them. △ Less

Submitted 3 July, 2019; originally announced July 2019.

Comments: 10 pages, 2 figures, Neurips 2019

arXiv:1905.12744 [pdf, other]

Fair Decision Making using Privacy-Protected Data

Authors: Satya Kuppam, Ryan Mckenna, David Pujol, Michael Hay, Ashwin Machanavajjhala, Gerome Miklau

Abstract: Data collected about individuals is regularly used to make decisions that impact those same individuals. We consider settings where sensitive personal data is used to decide who will receive resources or benefits. While it is well known that there is a tradeoff between protecting privacy and the accuracy of decisions, we initiate a first-of-its-kind study into the impact of formally private mechan… ▽ More Data collected about individuals is regularly used to make decisions that impact those same individuals. We consider settings where sensitive personal data is used to decide who will receive resources or benefits. While it is well known that there is a tradeoff between protecting privacy and the accuracy of decisions, we initiate a first-of-its-kind study into the impact of formally private mechanisms (based on differential privacy) on fair and equitable decision-making. We empirically investigate novel tradeoffs on two real-world decisions made using U.S. Census data (allocation of federal funds and assignment of voting rights benefits) as well as a classic apportionment problem. Our results show that if decisions are made using an $ε$-differentially private version of the data, under strict privacy constraints (smaller $ε$), the noise added to achieve privacy may disproportionately impact some groups over others. We propose novel measures of fairness in the context of randomized differentially private algorithms and identify a range of causes of outcome disparities. △ Less

Submitted 24 January, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

Comments: 12 pages, 4 figures

arXiv:1902.07756 [pdf, other]

Crypt$ε$: Crypto-Assisted Differential Privacy on Untrusted Servers

Authors: Amrita Roy Chowdhury, Chenghong Wang, Xi He, Ashwin Machanavajjhala, Somesh Jha

Abstract: Differential privacy (DP) has steadily become the de-facto standard for achieving privacy in data analysis, which is typically implemented either in the "central" or "local" model. The local model has been more popular for commercial deployments as it does not require a trusted data collector. This increased privacy, however, comes at a cost of utility and algorithmic expressibility as compared to… ▽ More Differential privacy (DP) has steadily become the de-facto standard for achieving privacy in data analysis, which is typically implemented either in the "central" or "local" model. The local model has been more popular for commercial deployments as it does not require a trusted data collector. This increased privacy, however, comes at a cost of utility and algorithmic expressibility as compared to the central model. In this work, we propose, Crypt$ε$, a system and programming framework that (1) achieves the accuracy guarantees and algorithmic expressibility of the central model (2) without any trusted data collector like in the local model. Crypt$ε$ achieves the "best of both worlds" by employing two non-colluding untrusted servers that run DP programs on encrypted data from the data owners. Although straightforward implementations of DP programs using secure computation tools can achieve the above goal theoretically, in practice they are beset with many challenges such as poor performance and tricky security proofs. To this end, Crypt$ε$ allows data analysts to author logical DP programs that are automatically translated to secure protocols that work on encrypted data. These protocols ensure that the untrusted servers learn nothing more than the noisy outputs, thereby guaranteeing DP (for computationally bounded adversaries) for all Crypt$ε$ programs. Crypt$ε$ supports a rich class of DP programs that can be expressed via a small set of transformation and measurement operators followed by arbitrary post-processing. Further, we propose performance optimizations leveraging the fact that the output is noisy. We demonstrate Crypt$ε$'s feasibility for practical DP analysis with extensive empirical evaluations on real datasets. △ Less

Submitted 10 March, 2020; v1 submitted 20 February, 2019; originally announced February 2019.

arXiv:1810.01816 [pdf, other]

Shrinkwrap: Differentially-Private Query Processing in Private Data Federations

Authors: Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, Jennie Rogers

Abstract: A private data federation is a set of autonomous databases that share a unified query interface offering in-situ evaluation of SQL queries over the union of the sensitive data of its members. Owing to privacy concerns, these systems do not have a trusted data collector that can see all their data and their member databases cannot learn about individual records of other engines. Federations current… ▽ More A private data federation is a set of autonomous databases that share a unified query interface offering in-situ evaluation of SQL queries over the union of the sensitive data of its members. Owing to privacy concerns, these systems do not have a trusted data collector that can see all their data and their member databases cannot learn about individual records of other engines. Federations currently achieve this goal by evaluating queries obliviously using secure multiparty computation. This hides the intermediate result cardinality of each query operator by exhaustively padding it. With cascades of such operators, this padding accumulates to a blow-up in the output size of each operator and a proportional loss in query performance. Hence, existing private data federations do not scale well to complex SQL queries over large datasets. We introduce Shrinkwrap, a private data federation that offers data owners a differentially private view of the data held by others to improve their performance over oblivious query processing. Shrinkwrap uses computational differential privacy to minimize the padding of intermediate query results, achieving up to 35X performance improvement over oblivious query processing. When the query needs differentially private output, Shrinkwrap provides a trade-off between result accuracy and query evaluation performance. △ Less

Submitted 3 October, 2018; originally announced October 2018.

arXiv:1808.03555 [pdf, other]

doi 10.1145/3183713.3196921

Ektelo: A Framework for Defining Differentially-Private Computations

Authors: Dan Zhang, Ryan McKenna, Ios Kotsogiannis, George Bissias, Michael Hay, Ashwin Machanavajjhala, Gerome Miklau

Abstract: The adoption of differential privacy is growing but the complexity of designing private, efficient and accurate algorithms is still high. We propose a novel programming framework and system, Ektelo, for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming… ▽ More The adoption of differential privacy is growing but the complexity of designing private, efficient and accurate algorithms is still high. We propose a novel programming framework and system, Ektelo, for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming to one of a small number of operator classes. While past programming frameworks have helped to ensure the privacy of programs, the novelty of our framework is its significant support for authoring accurate and efficient (as well as private) programs. After describing the design and architecture of the Ektelo system, we show that Ektelo is expressive, allows for safer implementations through code reuse, and that it allows both privacy novices and experts to easily design algorithms. We demonstrate the use of Ektelo by designing several new state-of-the-art algorithms. △ Less

Submitted 24 May, 2019; v1 submitted 10 August, 2018; originally announced August 2018.

Comments: Journal version under submission

arXiv:1808.03537 [pdf, other]

doi 10.14778/3231751.3231769

Optimizing error of high-dimensional statistical queries under differential privacy

Authors: Ryan McKenna, Gerome Miklau, Michael Hay, Ashwin Machanavajjhala

Abstract: Differentially private algorithms for answering sets of predicate counting queries on a sensitive database have many applications. Organizations that collect individual-level data, such as statistical agencies and medical institutions, use them to safely release summary tabulations. However, existing techniques are accurate only on a narrow class of query workloads, or are extremely slow, especial… ▽ More Differentially private algorithms for answering sets of predicate counting queries on a sensitive database have many applications. Organizations that collect individual-level data, such as statistical agencies and medical institutions, use them to safely release summary tabulations. However, existing techniques are accurate only on a narrow class of query workloads, or are extremely slow, especially when analyzing more than one or two dimensions of the data. In this work we propose HDMM, a new differentially private algorithm for answering a workload of predicate counting queries, that is especially effective for higher-dimensional datasets. HDMM represents query workloads using an implicit matrix representation and exploits this compact representation to efficiently search (a subset of) the space of differentially private algorithms for one that answers the input query workload with high accuracy. We empirically show that HDMM can efficiently answer queries with lower error than state-of-the-art techniques on a variety of low and high dimensional datasets. △ Less

Submitted 10 August, 2018; originally announced August 2018.

Journal ref: PVLDB, 11 (10): 1206-1219, 2018

arXiv:1804.00370 [pdf, other]

Differentially Private Hierarchical Count-of-Counts Histograms

Authors: Yu-Hsuan Kuo, Cho-Chun Chiu, Daniel Kifer, Michael Hay, Ashwin Machanavajjhala

Abstract: We consider the problem of privately releasing a class of queries that we call hierarchical count-of-counts histograms. Count-of-counts histograms partition the rows of an input table into groups (e.g., group of people in the same household), and for every integer j report the number of groups of size j. Hierarchical count-of-counts queries report count-of-counts histograms at different granularit… ▽ More We consider the problem of privately releasing a class of queries that we call hierarchical count-of-counts histograms. Count-of-counts histograms partition the rows of an input table into groups (e.g., group of people in the same household), and for every integer j report the number of groups of size j. Hierarchical count-of-counts queries report count-of-counts histograms at different granularities as per hierarchy defined on an attribute in the input data (e.g., geographical location of a household at the national, state and county levels). In this paper, we introduce this problem, along with appropriate error metrics and propose a differentially private solution that generates count-of-counts histograms that are consistent across all levels of the hierarchy. △ Less

Submitted 13 September, 2018; v1 submitted 1 April, 2018; originally announced April 2018.

Comments: 13 pages

arXiv:1712.10266 [pdf, other]

APEx: Accuracy-Aware Differentially Private Data Exploration

Authors: Chang Ge, Xi He, Ihab F. Ilyas, Ashwin Machanavajjhala

Abstract: Organizations are increasingly interested in allowing external data scientists to explore their sensitive datasets. Due to the popularity of differential privacy, data owners want the data exploration to ensure provable privacy guarantees. However, current systems for answering queries with differential privacy place an inordinate burden on the data analysts to understand differential privacy, man… ▽ More Organizations are increasingly interested in allowing external data scientists to explore their sensitive datasets. Due to the popularity of differential privacy, data owners want the data exploration to ensure provable privacy guarantees. However, current systems for answering queries with differential privacy place an inordinate burden on the data analysts to understand differential privacy, manage their privacy budget, and even implement new algorithms for noisy query answering. Moreover, current systems do not provide any guarantees to the data analyst on the quality they care about, namely accuracy of query answers. We present APEx, a novel system that allows data analysts to pose adaptively chosen sequences of queries along with required accuracy bounds. By translating queries and accuracy bounds into differentially private algorithms with the least privacy loss, APEx returns query answers to the data analyst that meet the accuracy bounds, and proves to the data owner that the entire data exploration process is differentially private. Our comprehensive experimental study on real datasets demonstrates that APEx can answer a variety of queries accurately with moderate to small privacy loss, and can support data exploration for entity resolution with high accuracy under reasonable privacy settings. △ Less

Submitted 10 May, 2019; v1 submitted 29 December, 2017; originally announced December 2017.

Comments: Full version of the ACM SIGMOD 2019 paper

arXiv:1712.05888 [pdf, other]

One-sided Differential Privacy

Authors: Stelios Doudalis, Ios Kotsogiannis, Samuel Haney, Ashwin Machanavajjhala, Sharad Mehrotra

Abstract: In this paper, we study the problem of privacy-preserving data sharing, wherein only a subset of the records in a database are sensitive, possibly based on predefined privacy policies. Existing solutions, viz, differential privacy (DP), are over-pessimistic and treat all information as sensitive. Alternatively, techniques, like access control and personalized differential privacy, reveal all non-s… ▽ More In this paper, we study the problem of privacy-preserving data sharing, wherein only a subset of the records in a database are sensitive, possibly based on predefined privacy policies. Existing solutions, viz, differential privacy (DP), are over-pessimistic and treat all information as sensitive. Alternatively, techniques, like access control and personalized differential privacy, reveal all non-sensitive records truthfully, and they indirectly leak information about sensitive records through exclusion attacks. Motivated by the limitations of prior work, we introduce the notion of one-sided differential privacy (OSDP). We formalize the exclusion attack and we show how OSDP protects against it. OSDP offers differential privacy like guarantees, but only to the sensitive records. OSDP allows the truthful release of a subset of the non-sensitive records. The sample can be used to support applications that must output true data, and is well suited for publishing complex types of data, e.g. trajectories. Though some non-sensitive records are suppressed to avoid exclusion attacks, our experiments show that the suppression results in a small loss in utility in most cases. Additionally, we present a recipe for turning DP mechanisms for answering counting queries into OSDP techniques for the same task. Our OSDP algorithms leverage the presence of non-sensitive records and are able to offer up to a 25x improvement in accuracy over state-of-the-art DP-solutions. △ Less

Submitted 15 December, 2017; originally announced December 2017.

arXiv:1705.09561 [pdf, other]

Differentially private significance tests for regression coefficients

Authors: Andrés F. Barrientos, Jerome P. Reiter, Ashwin Machanavajjhala, Yan Chen

Abstract: Many data producers seek to provide users access to confidential data without unduly compromising data subjects' privacy and confidentiality. One general strategy is to require users to do analyses without seeing the confidential data; for example, analysts only get access to synthetic data or query systems that provide disclosure-protected outputs of statistical models. With synthetic data or red… ▽ More Many data producers seek to provide users access to confidential data without unduly compromising data subjects' privacy and confidentiality. One general strategy is to require users to do analyses without seeing the confidential data; for example, analysts only get access to synthetic data or query systems that provide disclosure-protected outputs of statistical models. With synthetic data or redacted outputs, the analyst never really knows how much to trust the resulting findings. In particular, if the user did the same analysis on the confidential data, would regression coefficients of interest be statistically significant or not? We present algorithms for assessing this question that satisfy differential privacy. We describe conditions under which the algorithms should give accurate answers about statistical significance. We illustrate the properties of the proposed methods using artificial and genuine data. △ Less

Submitted 11 June, 2018; v1 submitted 26 May, 2017; originally announced May 2017.

arXiv:1705.07872 [pdf, other]

Providing Access to Confidential Research Data Through Synthesis and Verification: An Application to Data on Employees of the U.S. Federal Government

Authors: Andrés F. Barrientos, Alexander Bolton, Tom Balmat, Jerome P. Reiter, John M. de Figueiredo, Ashwin Machanavajjhala, Yan Chen, Charley Kneifel, Mark DeLong

Abstract: Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. One approach suggested in the literature is that data stewards generate and release synthetic data, i.e., data simu… ▽ More Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. One approach suggested in the literature is that data stewards generate and release synthetic data, i.e., data simulated from statistical models, while also providing users access to a verification server that allows them to assess the quality of inferences from the synthetic data. We present an application of the synthetic data plus verification server approach to longitudinal data on employees of the U.S. federal government. As part of the application, we present a novel model for generating synthetic career trajectories, as well as strategies for generating high dimensional, longitudinal synthetic datasets. We also present novel verification algorithms for regression coefficients that satisfy differential privacy. We illustrate the integrated use of synthetic data plus verification via analysis of differentials in pay by race. The integrated system performs as intended, allowing users to explore the synthetic data for potential pay differentials and learn through verifications which findings in the synthetic data hold up and which do not. The analysis on the confidential data reveals pay differentials across races not documented in published studies. △ Less

Submitted 16 June, 2018; v1 submitted 22 May, 2017; originally announced May 2017.

arXiv:1702.00535 [pdf, other]

Composing Differential Privacy and Secure Computation: A case study on scaling private record linkage

Authors: Xi He, Ashwin Machanavajjhala, Cheryl Flynn, Divesh Srivastava

Abstract: Private record linkage (PRL) is the problem of identifying pairs of records that are similar as per an input matching rule from databases held by two parties that do not trust one another. We identify three key desiderata that a PRL solution must ensure: 1) perfect precision and high recall of matching pairs, 2) a proof of end-to-end privacy, and 3) communication and computational costs that scale… ▽ More Private record linkage (PRL) is the problem of identifying pairs of records that are similar as per an input matching rule from databases held by two parties that do not trust one another. We identify three key desiderata that a PRL solution must ensure: 1) perfect precision and high recall of matching pairs, 2) a proof of end-to-end privacy, and 3) communication and computational costs that scale subquadratically in the number of input records. We show that all of the existing solutions for PRL - including secure 2-party computation (S2PC), and their variants that use non-private or differentially private (DP) blocking to ensure subquadratic cost - violate at least one of the three desiderata. In particular, S2PC techniques guarantee end-to-end privacy but have either low recall or quadratic cost. In contrast, no end-to-end privacy guarantee has been formalized for solutions that achieve subquadratic cost. This is true even for solutions that compose DP and S2PC: DP does not permit the release of any exact information about the databases, while S2PC algorithms for PRL allow the release of matching records. In light of this deficiency, we propose a novel privacy model, called output constrained differential privacy, that shares the strong privacy protection of DP, but allows for the truthful release of the output of a certain function applied to the data. We apply this to PRL, and show that protocols satisfying this privacy model permit the disclosure of the true matching records, but their execution is insensitive to the presence or absence of a single non-matching record. We find that prior work that combine DP and S2PC techniques even fail to satisfy this end-to-end privacy model. Hence, we develop novel protocols that provably achieve this end-to-end privacy guarantee, together with the other two desiderata of PRL. △ Less

Submitted 1 September, 2017; v1 submitted 1 February, 2017; originally announced February 2017.

arXiv:1701.00752 [pdf]

Privacy-Preserving Data Analysis for the Federal Statistical Agencies

Authors: John Abowd, Lorenzo Alvisi, Cynthia Dwork, Sampath Kannan, Ashwin Machanavajjhala, Jerome Reiter

Abstract: Government statistical agencies collect enormously valuable data on the nation's population and business activities. Wide access to these data enables evidence-based policy making, supports new research that improves society, facilitates training for students in data science, and provides resources for the public to better understand and participate in their society. These data also affect the pri… ▽ More Government statistical agencies collect enormously valuable data on the nation's population and business activities. Wide access to these data enables evidence-based policy making, supports new research that improves society, facilitates training for students in data science, and provides resources for the public to better understand and participate in their society. These data also affect the private sector. For example, the Employment Situation in the United States, published by the Bureau of Labor Statistics, moves markets. Nonetheless, government agencies are under increasing pressure to limit access to data because of a growing understanding of the threats to data privacy and confidentiality. "De-identification" - strip** obvious identifiers like names, addresses, and identification numbers - has been found inadequate in the face of modern computational and informational resources. Unfortunately, the problem extends even to the release of aggregate data statistics. This counter-intuitive phenomenon has come to be known as the Fundamental Law of Information Recovery. It says that overly accurate estimates of too many statistics can completely destroy privacy. One may think of this as death by a thousand cuts. Every statistic computed from a data set leaks a small amount of information about each member of the data set - a tiny cut. This is true even if the exact value of the statistic is distorted a bit in order to preserve privacy. But while each statistical release is an almost harmless little cut in terms of privacy risk for any individual, the cumulative effect can be to completely compromise the privacy of some individuals. △ Less

Submitted 3 January, 2017; originally announced January 2017.

Comments: A Computing Community Consortium (CCC) white paper, 7 pages

arXiv:1512.04817 [pdf, other]

Principled Evaluation of Differentially Private Algorithms using DPBench

Authors: Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, Dan Zhang

Abstract: Differential privacy has become the dominant standard in the research community for strong privacy protection. There has been a flood of research into query answering algorithms that meet this standard. Algorithms are becoming increasingly complex, and in particular, the performance of many emerging algorithms is {\em data dependent}, meaning the distribution of the noise added to query answers ma… ▽ More Differential privacy has become the dominant standard in the research community for strong privacy protection. There has been a flood of research into query answering algorithms that meet this standard. Algorithms are becoming increasingly complex, and in particular, the performance of many emerging algorithms is {\em data dependent}, meaning the distribution of the noise added to query answers may change depending on the input data. Theoretical analysis typically only considers the worst case, making empirical study of average case performance increasingly important. In this paper we propose a set of evaluation principles which we argue are essential for sound evaluation. Based on these principles we propose DPBench, a novel evaluation framework for standardized evaluation of privacy algorithms. We then apply our benchmark to evaluate algorithms for answering 1- and 2-dimensional range queries. The result is a thorough empirical study of 15 published algorithms on a total of 27 datasets that offers new insights into algorithm behavior---in particular the influence of dataset scale and shape---and a more complete characterization of the state of the art. Our methodology is able to resolve inconsistencies in prior empirical studies and place algorithm performance in context through comparison to simple baselines. Finally, we pose open research questions which we hope will guide future algorithm design. △ Less

Submitted 15 December, 2015; originally announced December 2015.

arXiv:1508.07306 [pdf, ps, other]

On the Privacy Properties of Variants on the Sparse Vector Technique

Authors: Yan Chen, Ashwin Machanavajjhala

Abstract: The sparse vector technique is a powerful differentially private primitive that allows an analyst to check whether queries in a stream are greater or lesser than a threshold. This technique has a unique property -- the algorithm works by adding noise with a finite variance to the queries and the threshold, and guarantees privacy that only degrades with (a) the maximum sensitivity of any one query… ▽ More The sparse vector technique is a powerful differentially private primitive that allows an analyst to check whether queries in a stream are greater or lesser than a threshold. This technique has a unique property -- the algorithm works by adding noise with a finite variance to the queries and the threshold, and guarantees privacy that only degrades with (a) the maximum sensitivity of any one query in stream, and (b) the number of positive answers output by the algorithm. Recent work has developed variants of this algorithm, which we call {\em generalized private threshold testing}, and are claimed to have privacy guarantees that do not depend on the number of positive or negative answers output by the algorithm. These algorithms result in a significant improvement in utility over the sparse vector technique for a given privacy budget, and have found applications in frequent itemset mining, feature selection in machine learning and generating synthetic data. In this paper we critically analyze the privacy properties of generalized private threshold testing. We show that generalized private threshold testing does not satisfy ε-differential privacy for any finite ε. We identify a subtle error in the privacy analysis of this technique in prior work. Moreover, we show an adversary can use generalized private threshold testing to recover counts from the datasets (especially small counts) exactly with high accuracy, and thus can result in individuals being reidentified. We demonstrate our attacks empirically on real datasets. △ Less

Submitted 28 August, 2015; originally announced August 2015.

Comments: 8 pages

arXiv:1411.5428 [pdf, ps, other]

Differentially Private Algorithms for Empirical Machine Learning

Authors: Ben Stoddard, Yan Chen, Ashwin Machanavajjhala

Abstract: An important use of private data is to build machine learning classifiers. While there is a burgeoning literature on differentially private classification algorithms, we find that they are not practical in real applications due to two reasons. First, existing differentially private classifiers provide poor accuracy on real world datasets. Second, there is no known differentially private algorithm… ▽ More An important use of private data is to build machine learning classifiers. While there is a burgeoning literature on differentially private classification algorithms, we find that they are not practical in real applications due to two reasons. First, existing differentially private classifiers provide poor accuracy on real world datasets. Second, there is no known differentially private algorithm for empirically evaluating the private classifier on a private test dataset. In this paper, we develop differentially private algorithms that mirror real world empirical machine learning workflows. We consider the private classifier training algorithm as a blackbox. We present private algorithms for selecting features that are input to the classifier. Though adding a preprocessing step takes away some of the privacy budget from the actual classification process (thus potentially making it noisier and less accurate), we show that our novel preprocessing techniques significantly increase classifier accuracy on three real-world datasets. We also present the first private algorithms for empirically constructing receiver operating characteristic (ROC) curves on a private test set. △ Less

Submitted 21 November, 2014; v1 submitted 19 November, 2014; originally announced November 2014.

arXiv:1404.3722 [pdf, other]

Design of Policy-Aware Differentially Private Algorithms

Authors: Samuel Haney, Ashwin Machanavajjhala, Bolin Ding

Abstract: The problem of designing error optimal differentially private algorithms is well studied. Recent work applying differential privacy to real world settings have used variants of differential privacy that appropriately modify the notion of neighboring databases. The problem of designing error optimal algorithms for such variants of differential privacy is open. In this paper, we show a novel transfo… ▽ More The problem of designing error optimal differentially private algorithms is well studied. Recent work applying differential privacy to real world settings have used variants of differential privacy that appropriately modify the notion of neighboring databases. The problem of designing error optimal algorithms for such variants of differential privacy is open. In this paper, we show a novel transformational equivalence result that can turn the problem of query answering under differential privacy with a modified notion of neighbors to one of query answering under standard differential privacy, for a large class of neighbor definitions. We utilize the Blowfish privacy framework that generalizes differential privacy. Blowfish uses a {\em policy graph} to instantiate different notions of neighboring databases. We show that the error incurred when answering a workload $\mathbf{W}$ on a database $\mathbf{x}$ under a Blowfish policy graph $G$ is identical to the error required to answer a transformed workload $f_G(\mathbf{W})$ on database $g_G(\mathbf{x})$ under standard differential privacy, where $f_G$ and $g_G$ are linear transformations based on $G$. Using this result, we develop error efficient algorithms for releasing histograms and multidimensional range queries under different Blowfish policies. We believe the tools we develop will be useful for finding mechanisms to answer many other classes of queries with low error under other policy graphs. △ Less

Submitted 20 November, 2015; v1 submitted 14 April, 2014; originally announced April 2014.

arXiv:1312.3913 [pdf, other]

doi 10.1145/2588555.2588581

Blowfish Privacy: Tuning Privacy-Utility Trade-offs using Policies

Authors: Xi He, Ashwin Machanavajjhala, Bolin Ding

Abstract: Privacy definitions provide ways for trading-off the privacy of individuals in a statistical database for the utility of downstream analysis of the data. In this paper, we present Blowfish, a class of privacy definitions inspired by the Pufferfish framework, that provides a rich interface for this trade-off. In particular, we allow data publishers to extend differential privacy using a policy, whi… ▽ More Privacy definitions provide ways for trading-off the privacy of individuals in a statistical database for the utility of downstream analysis of the data. In this paper, we present Blowfish, a class of privacy definitions inspired by the Pufferfish framework, that provides a rich interface for this trade-off. In particular, we allow data publishers to extend differential privacy using a policy, which specifies (a) secrets, or information that must be kept secret, and (b) constraints that may be known about the data. While the secret specification allows increased utility by lessening protection for certain individual properties, the constraint specification provides added protection against an adversary who knows correlations in the data (arising from constraints). We formalize policies and present novel algorithms that can handle general specifications of sensitive information and certain count constraints. We show that there are reasonable policies under which our privacy mechanisms for k-means clustering, histograms and range queries introduce significantly lesser noise than their differentially private counterparts. We quantify the privacy-utility trade-offs for various policies analytically and empirically on real datasets. △ Less

Submitted 23 June, 2014; v1 submitted 13 December, 2013; originally announced December 2013.

Comments: Full version of the paper at SIGMOD'14 Snowbird, Utah USA

arXiv:1302.6556 [pdf, other]

On Sharing Private Data with Multiple Non-Colluding Adversaries

Authors: Theodoros Rekatsinas, Amol Deshpande, Ashwin Machanavajjhala

Abstract: We present SPARSI, a theoretical framework for partitioning sensitive data across multiple non-colluding adversaries. Most work in privacy-aware data sharing has considered disclosing summaries where the aggregate information about the data is preserved, but sensitive user information is protected. Nonetheless, there are applications, including online advertising, cloud computing and crowdsourcing… ▽ More We present SPARSI, a theoretical framework for partitioning sensitive data across multiple non-colluding adversaries. Most work in privacy-aware data sharing has considered disclosing summaries where the aggregate information about the data is preserved, but sensitive user information is protected. Nonetheless, there are applications, including online advertising, cloud computing and crowdsourcing markets, where detailed and fine-grained user-data must be disclosed. We consider a new data sharing paradigm and introduce the problem of privacy-aware data partitioning, where a sensitive dataset must be partitioned among k untrusted parties (adversaries). The goal is to maximize the utility derived by partitioning and distributing the dataset, while minimizing the amount of sensitive information disclosed. The data should be distributed so that an adversary, without colluding with other adversaries, cannot draw additional inferences about the private information, by linking together multiple pieces of information released to her. The assumption of no collusion is both reasonable and necessary in the above application domains that require release of private user information. SPARSI enables us to formally define privacy-aware data partitioning using the notion of sensitive properties for modeling private information and a hypergraph representation for describing the interdependencies between data entries and private information. We show that solving privacy-aware partitioning is, in general, NP-hard, but for specific information disclosure functions, good approximate solutions can be found using relaxation techniques. Finally, we present a local search algorithm applicable to generic information disclosure functions. We apply SPARSI together with the proposed algorithms on data from a real advertising scenario and show that we can partition data with no disclosure to any single advertiser. △ Less

Submitted 11 March, 2013; v1 submitted 26 February, 2013; originally announced February 2013.

Comments: 14 pages, 6 figures, 2 tables

arXiv:1205.0435 [pdf, other]

Scalable Social Coordination using Enmeshed Queries

Authors: Jianjun Chen, Ashwin Machanavajjhala, George Varghese

Abstract: Social coordination allows users to move beyond awareness of their friends to efficiently coordinating physical activities with others. While specific forms of social coordination can be seen in tools such as Evite, Meetup and Groupon, we introduce a more general model using what we call enmeshed queries. An enmeshed query allows users to declaratively specify an intent to coordinate by specifying… ▽ More Social coordination allows users to move beyond awareness of their friends to efficiently coordinating physical activities with others. While specific forms of social coordination can be seen in tools such as Evite, Meetup and Groupon, we introduce a more general model using what we call enmeshed queries. An enmeshed query allows users to declaratively specify an intent to coordinate by specifying social attributes such as the desired group size and who/what/when, and the database returns matching queries. Enmeshed queries are continuous, but new queries (and not data) answer older queries; the variable group size also makes enmeshed queries different from entangled queries, publish-subscribe systems, and dating services. We show that even offline group coordination using enmeshed queries is NP-hard. We then introduce efficient heuristics that use selective indices such as location and time to reduce the space of possible matches; we also add refinements such as delayed evaluation and using the relative matchability of users to determine search order. We describe a centralized implementation and evaluate its performance against an optimal algorithm. We show that the combination of not stop** prematurely (after finding a match) and delayed evaluation results in an algorithm that finds 86% of the matches found by an optimal algorithm, and takes an average of 40 usec per query using 1 core of a 2.5 Ghz server machine. Further, the algorithm has good latency, is reasonably fair to large group size requests, and can be scaled to global workloads using multiple cores and multiple servers. We conclude by describing potential generalizations that add prices, recommendations, and data mining to basic enmeshed queries. △ Less

Submitted 17 August, 2012; v1 submitted 2 May, 2012; originally announced May 2012.

Comments: 11 pages, 9 figures

arXiv:1203.6406 [pdf, other]

An Analysis of Structured Data on the Web

Authors: Nilesh Dalvi, Ashwin Machanavajjhala, Bo Pang

Abstract: In this paper, we analyze the nature and distribution of structured data on the Web. Web-scale information extraction, or the problem of creating structured tables using extraction from the entire web, is gathering lots of research interest. We perform a study to understand and quantify the value of Web-scale extraction, and how structured information is distributed amongst top aggregator websites… ▽ More In this paper, we analyze the nature and distribution of structured data on the Web. Web-scale information extraction, or the problem of creating structured tables using extraction from the entire web, is gathering lots of research interest. We perform a study to understand and quantify the value of Web-scale extraction, and how structured information is distributed amongst top aggregator websites and tail sites for various interesting domains. We believe this is the first study of its kind, and gives us new insights for information extraction over the Web. △ Less

Submitted 28 March, 2012; originally announced March 2012.

Comments: VLDB2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 7, pp. 680-691 (2012)

arXiv:1203.5387 [pdf, ps, other]

Finding Connected Components on Map-reduce in Logarithmic Rounds

Authors: Vibhor Rastogi, Ashwin Machanavajjhala, Laukik Chitnis, Anish Das Sarma

Abstract: Given a large graph G = (V,E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of map-reduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior map-reduce techn… ▽ More Given a large graph G = (V,E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of map-reduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior map-reduce techniques either require d rounds, or require about n|V| + |E| communication per round. We propose two randomized map-reduce algorithms -- (i) Hash-Greater-To-Min, which provably requires at most 3log(n) rounds with high probability, and at most 2(|V| + |E|) communication per round, and (ii) Hash-to-Min, which has a worse theoretical complexity, but in practice completes in at most 2log(d) rounds and 3(|V| + |E|) communication per rounds. Our techniques for connected components can be applied to clustering as well. We propose a novel algorithm for agglomerative single linkage clustering in map-reduce. This is the first algorithm that can provably compute a clustering in at most O(log(n)) rounds, where n is the size of the largest cluster. We show the effectiveness of all our algorithms through detailed experiments on large synthetic as well as real-world datasets. △ Less

Submitted 12 November, 2012; v1 submitted 24 March, 2012; originally announced March 2012.

arXiv:1111.3689 [pdf, other]

CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks

Authors: Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, Philip Bohannon

Abstract: De-duplication---identification of distinct records referring to the same real-world entity---is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, {\em blocking} has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off {\em recall} of identified duplicates for {\em efficiency}. Tra… ▽ More De-duplication---identification of distinct records referring to the same real-world entity---is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, {\em blocking} has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off {\em recall} of identified duplicates for {\em efficiency}. Traditional de-duplication tasks, while challenging, typically involved a fixed schema such as Census data or medical records. However, with the presence of large, diverse sets of structured data on the web and the need to organize it effectively on content portals, de-duplication systems need to scale in a new dimension to handle a large number of schemas, tasks and data sets, while handling ever larger problem sizes. In addition, when working in a map-reduce framework it is important that canopy formation be implemented as a {\em hash function}, making the canopy design problem more challenging. We present CBLOCK, a system that addresses these challenges. CBLOCK learns hash functions automatically from attribute domains and a labeled dataset consisting of duplicates. Subsequently, CBLOCK expresses blocking functions using a hierarchical tree structure composed of atomic hash functions. The application may guide the automated blocking process based on architectural constraints, such as by specifying a maximum size of each block (based on memory requirements), impose disjointness of blocks (in a grid environment), or specify a particular objective function trading off recall for efficiency. As a post-processing step to automatically generated blocks, CBLOCK {\em rolls-up} smaller blocks to increase recall. We present experimental results on two large-scale de-duplication datasets at Yahoo!---consisting of over 140K movies and 40K restaurants respectively---and demonstrate the utility of CBLOCK. △ Less

Submitted 15 November, 2011; originally announced November 2011.

arXiv:1105.4254 [pdf]

Personalized Social Recommendations - Accurate or Private?

Authors: Ashwin Machanavajjhala, Aleksandra Korolova, Atish Das Sarma

Abstract: With the recent surge of social networks like Facebook, new forms of recommendations have become possible - personalized recommendations of ads, content, and even new friend and product connections based on one's social interactions. Since recommendations may use sensitive social information, it is speculated that these recommendations are associated with privacy risks. The main contribution of th… ▽ More With the recent surge of social networks like Facebook, new forms of recommendations have become possible - personalized recommendations of ads, content, and even new friend and product connections based on one's social interactions. Since recommendations may use sensitive social information, it is speculated that these recommendations are associated with privacy risks. The main contribution of this work is in formalizing these expected trade-offs between the accuracy and privacy of personalized social recommendations. In this paper, we study whether "social recommendations", or recommendations that are solely based on a user's social network, can be made without disclosing sensitive links in the social graph. More precisely, we quantify the loss in utility when existing recommendation algorithms are modified to satisfy a strong notion of privacy, called differential privacy. We prove lower bounds on the minimum loss in utility for any recommendation algorithm that is differentially private. We adapt two privacy preserving algorithms from the differential privacy literature to the problem of social recommendations, and analyze their performance in comparison to the lower bounds, both analytically and experimentally. We show that good private social recommendations are feasible only for a small subset of the users in the social network or for a lenient setting of privacy parameters. △ Less

Submitted 21 May, 2011; originally announced May 2011.

Comments: VLDB2011

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 4, No. 7, pp. 440-450 (2011)

arXiv:1004.5600 [pdf, other]

On the (Im)possibility of Preserving Utility and Privacy in Personalized Social Recommendations

Authors: Ashwin Machanavajjhala, Aleksandra Korolova, Atish Das Sarma

Abstract: With the recent surge of social networks like Facebook, new forms of recommendations have become possible -- personalized recommendations of ads, content, and even new social and product connections based on one's social interactions. In this paper, we study whether "social recommendations", or recommendations that utilize a user's social network, can be made without disclosing sensitive links b… ▽ More With the recent surge of social networks like Facebook, new forms of recommendations have become possible -- personalized recommendations of ads, content, and even new social and product connections based on one's social interactions. In this paper, we study whether "social recommendations", or recommendations that utilize a user's social network, can be made without disclosing sensitive links between users. More precisely, we quantify the loss in utility when existing recommendation algorithms are modified to satisfy a strong notion of privacy called differential privacy. We propose lower bounds on the minimum loss in utility for any recommendation algorithm that is differentially private. We also propose two recommendation algorithms that satisfy differential privacy, analyze their performance in comparison to the lower bound, both analytically and experimentally, and show that good private social recommendations are feasible only for a few users in the social network or for a lenient setting of privacy parameters. △ Less

Submitted 30 April, 2010; originally announced April 2010.

arXiv:0904.0682 [pdf, ps, other]

Privacy in Search Logs

Authors: Michaela Goetz, Ashwin Machanavajjhala, Guozhang Wang, Xiaokui Xiao, Johannes Gehrke

Abstract: Search engine companies collect the "database of intentions", the histories of their users' search queries. These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose sensitive information. In this paper we analyze algorithms for publishing frequent keywords, queries and clicks of a search log. We first show how… ▽ More Search engine companies collect the "database of intentions", the histories of their users' search queries. These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose sensitive information. In this paper we analyze algorithms for publishing frequent keywords, queries and clicks of a search log. We first show how methods that achieve variants of $k$-anonymity are vulnerable to active attacks. We then demonstrate that the stronger guarantee ensured by $ε$-differential privacy unfortunately does not provide any utility for this problem. We then propose an algorithm ZEALOUS and show how to set its parameters to achieve $(ε,δ)$-probabilistic privacy. We also contrast our analysis of ZEALOUS with an analysis by Korolova et al. [17] that achieves $(ε',δ')$-indistinguishability. Our paper concludes with a large experimental study using real applications where we compare ZEALOUS and previous work that achieves $k$-anonymity in search log publishing. Our results show that ZEALOUS yields comparable utility to $k-$anonymity while at the same time achieving much stronger privacy guarantees. △ Less

Submitted 11 May, 2011; v1 submitted 4 April, 2009; originally announced April 2009.

arXiv:0705.2787 [pdf, ps, other]

Worst-Case Background Knowledge for Privacy-Preserving Data Publishing

Authors: David J. Martin, Daniel Kifer, Ashwin Machanavajjhala, Johannes Gehrke, Joseph Y. Halpern

Abstract: Recent work has shown the necessity of considering an attacker's background knowledge when reasoning about privacy in data publishing. However, in practice, the data publisher does not know what background knowledge the attacker possesses. Thus, it is important to consider the worst-case. In this paper, we initiate a formal study of worst-case background knowledge. We propose a language that can… ▽ More Recent work has shown the necessity of considering an attacker's background knowledge when reasoning about privacy in data publishing. However, in practice, the data publisher does not know what background knowledge the attacker possesses. Thus, it is important to consider the worst-case. In this paper, we initiate a formal study of worst-case background knowledge. We propose a language that can express any background knowledge about the data. We provide a polynomial time algorithm to measure the amount of disclosure of sensitive information in the worst case, given that the attacker has at most a specified number of pieces of information in this language. We also provide a method to efficiently sanitize the data so that the amount of disclosure in the worst case is less than a specified threshold. △ Less

Submitted 18 May, 2007; originally announced May 2007.

Comments: 10 pages

Showing 1–48 of 48 results for author: Machanavajjhala, A