Search | arXiv e-print repository

Conciliating Privacy and Utility in Data Releases via Individual Differential Privacy and Microaggregation

Authors: Jordi Soria-Comas, David Sánchez, Josep Domingo-Ferrer, Sergio Martínez, Luis Del Vasto-Terrientes

Abstract: $ε$-Differential privacy (DP) is a well-known privacy model that offers strong privacy guarantees. However, when applied to data releases, DP significantly deteriorates the analytical utility of the protected outcomes. To keep data utility at reasonable levels, practical applications of DP to data releases have used weak privacy parameters (large $ε… ▽ More $ε$-Differential privacy (DP) is a well-known privacy model that offers strong privacy guarantees. However, when applied to data releases, DP significantly deteriorates the analytical utility of the protected outcomes. To keep data utility at reasonable levels, practical applications of DP to data releases have used weak privacy parameters (large $ε$), which dilute the privacy guarantees of DP. In this work, we tackle this issue by using an alternative formulation of the DP privacy guarantees, named $ε$-individual differential privacy (iDP), which causes less data distortion while providing the same protection as DP to subjects. We enforce iDP in data releases by relying on attribute masking plus a pre-processing step based on data microaggregation. The goal of this step is to reduce the sensitivity to record changes, which determines the amount of noise required to enforce iDP (and DP). Specifically, we propose data microaggregation strategies designed for iDP whose sensitivities are significantly lower than those used in DP. As a result, we obtain iDP-protected data with significantly better utility than with DP. We report on experiments that show how our approach can provide strong privacy (small $ε$) while yielding protected data that do not significantly degrade the accuracy of secondary data analysis. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: 17 pages, 6 figures

arXiv:2010.10881 [pdf, other]

Multi-Dimensional Randomized Response

Authors: Josep Domingo-Ferrer, Jordi Soria-Comas

Abstract: In our data world, a host of not necessarily trusted controllers gather data on individual subjects. To preserve her privacy and, more generally, her informational self-determination, the individual has to be empowered by giving her agency on her own data. Maximum agency is afforded by local anonymization, that allows each individual to anonymize her own data before handing them to the data contro… ▽ More In our data world, a host of not necessarily trusted controllers gather data on individual subjects. To preserve her privacy and, more generally, her informational self-determination, the individual has to be empowered by giving her agency on her own data. Maximum agency is afforded by local anonymization, that allows each individual to anonymize her own data before handing them to the data controller. Randomized response (RR) is a local anonymization approach able to yield multi-dimensional full sets of anonymized microdata that are valid for exploratory analysis and machine learning. This is so because an unbiased estimate of the distribution of the true data of individuals can be obtained from their pooled randomized data. Furthermore, RR offers rigorous privacy guarantees. The main weakness of RR is the curse of dimensionality when applied to several attributes: as the number of attributes grows, the accuracy of the estimated true data distribution quickly degrades. We propose several complementary approaches to mitigate the dimensionality problem. First, we present two basic protocols, separate RR on each attribute and joint RR for all attributes, and discuss their limitations. Then we introduce an algorithm to form clusters of attributes so that attributes in different clusters can be viewed as independent and joint RR can be performed within each cluster. After that, we introduce an adjustment algorithm for the randomized data set that repairs some of the accuracy loss due to assuming independence between attributes when using RR separately on each attribute or due to assuming independence between clusters in cluster-wise RR. We also present empirical work to illustrate the proposed methods. △ Less

Submitted 19 December, 2020; v1 submitted 21 October, 2020; originally announced October 2020.

Comments: IEEE Transactions on Knowledge and Data Engineering, to appear. (First version submitted on May 8, 2019 as TKDE-2019-05-0430; first revision submitted on July 13, 2020 as TKDE-2019-05-0430.R1; second revision submitted on Nov. 5, 2020 as TKDE-2019-05-0430.R2 and accepted without changes on Dec. 16, 2020)

arXiv:1803.02139 [pdf, ps, other]

Connecting Randomized Response, Post-Randomization, Differential Privacy and t-Closeness via Deniability and Permutation

Authors: Josep Domingo-Ferrer, Jordi Soria-Comas

Abstract: We explore some novel connections between the main privacy models in use and we recall a few known ones. We show these models to be more related than commonly understood, around two main principles: deniability and permutation. In particular, randomized response turns out to be very modern in spite of it having been introduced over 50 years ago: it is a local anonymization method and it allows und… ▽ More We explore some novel connections between the main privacy models in use and we recall a few known ones. We show these models to be more related than commonly understood, around two main principles: deniability and permutation. In particular, randomized response turns out to be very modern in spite of it having been introduced over 50 years ago: it is a local anonymization method and it allows understanding the protection offered by $ε$-differential privacy when $ε$ is increased to improve utility. A similar understanding on the effect of large $ε$ in terms of deniability is obtained from the connection between $ε$-differential privacy and t-closeness. Finally, the post-randomization method (PRAM) is shown to be viewable as permutation and to be connected with randomized response and differential privacy. Since the latter is also connected with t-closeness, it follows that the permutation principle can explain the guarantees offered by all those models. Thus, calibrating permutation is very relevant in anonymization, and we conclude by sketching two ways of doing it. △ Less

Submitted 6 March, 2018; originally announced March 2018.

Comments: Submitted manuscript

MSC Class: 68P99 ACM Class: H.2.7; K.4.1

arXiv:1612.02298 [pdf, other]

doi 10.1109/TIFS.2017.2663337

Individual Differential Privacy: A Utility-Preserving Formulation of Differential Privacy Guarantees

Authors: Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez, David Megías

Abstract: Differential privacy is a popular privacy model within the research community because of the strong privacy guarantee it offers, namely that the presence or absence of any individual in a data set does not significantly influence the results of analyses on the data set. However, enforcing this strict guarantee in practice significantly distorts data and/or limits data uses, thus diminishing the an… ▽ More Differential privacy is a popular privacy model within the research community because of the strong privacy guarantee it offers, namely that the presence or absence of any individual in a data set does not significantly influence the results of analyses on the data set. However, enforcing this strict guarantee in practice significantly distorts data and/or limits data uses, thus diminishing the analytical utility of the differentially private results. In an attempt to address this shortcoming, several relaxations of differential privacy have been proposed that trade off privacy guarantees for improved data utility. In this work, we argue that the standard formalization of differential privacy is stricter than required by the intuitive privacy guarantee it seeks. In particular, the standard formalization requires indistinguishability of results between any pair of neighbor data sets, while indistinguishability between the actual data set and its neighbor data sets should be enough. This limits the data controller's ability to adjust the level of protection to the actual data, hence resulting in significant accuracy loss. In this respect, we propose individual differential privacy, an alternative differential privacy notion that offers em the same privacy guarantees as standard differential privacy to individuals (even though not to groups of individuals). This new notion allows the data controller to adjust the distortion to the actual data set, which results in less distortion and more analytical accuracy. We propose several mechanisms to attain individual differential privacy and we compare the new notion against standard differential privacy in terms of the accuracy of the analytical results. △ Less

Submitted 8 February, 2017; v1 submitted 7 December, 2016; originally announced December 2016.

arXiv:1512.05110 [pdf, other]

doi 10.1016/j.knosys.2014.11.011

From t-closeness to differential privacy and vice versa in data anonymization

Authors: J. Domingo-Ferrer, J. Soria-Comas

Abstract: k-Anonymity and ε-differential privacy are two mainstream privacy models, the former introduced to anonymize data sets and the latter to limit the knowledge gain that results from including one individual in the data set. Whereas basic k-anonymity only protects against identity disclosure, t-closeness was presented as an extension of k-anonymity that also protects against attribute disclosure. We… ▽ More k-Anonymity and ε-differential privacy are two mainstream privacy models, the former introduced to anonymize data sets and the latter to limit the knowledge gain that results from including one individual in the data set. Whereas basic k-anonymity only protects against identity disclosure, t-closeness was presented as an extension of k-anonymity that also protects against attribute disclosure. We show here that, if not quite equivalent, t-closeness and ε-differential privacy are strongly related to one another when it comes to anonymizing data sets. Specifically, k-anonymity for the quasi-identifiers combined with ε-differential privacy for the confidential attributes yields stochastic t-closeness (an extension of t-closeness), with t a function of k and ε. Conversely, t-closeness can yield ε- differential privacy when t = exp(ε/2) and the assumptions made by t-closeness about the prior and posterior views of the data hold △ Less

Submitted 21 December, 2015; v1 submitted 16 December, 2015; originally announced December 2015.

Journal ref: Knowledge-Based Systems, Vol. 74, pp. 151-158, 2015

arXiv:1512.02909 [pdf, other]

doi 10.1109/TKDE.2015.2435777

t-Closeness through Microaggregation: Strict Privacy with Enhanced Utility Preservation

Authors: Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez, Sergio Martínez

Abstract: Microaggregation is a technique for disclosure limitation aimed at protecting the privacy of data subjects in microdata releases. It has been used as an alternative to generalization and suppression to generate $k$-anonymous data sets, where the identity of each subject is hidden within a group of $k$ subjects. Unlike generalization, microaggregation perturbs the data and this additional masking f… ▽ More Microaggregation is a technique for disclosure limitation aimed at protecting the privacy of data subjects in microdata releases. It has been used as an alternative to generalization and suppression to generate $k$-anonymous data sets, where the identity of each subject is hidden within a group of $k$ subjects. Unlike generalization, microaggregation perturbs the data and this additional masking freedom allows improving data utility in several ways, such as increasing data granularity, reducing the impact of outliers and avoiding discretization of numerical data. $k$-Anonymity, on the other side, does not protect against attribute disclosure, which occurs if the variability of the confidential values in a group of $k$ subjects is too small. To address this issue, several refinements of $k$-anonymity have been proposed, among which $t$-closeness stands out as providing one of the strictest privacy guarantees. Existing algorithms to generate $t$-close data sets are based on generalization and suppression (they are extensions of $k$-anonymization algorithms based on the same principles). This paper proposes and shows how to use microaggregation to generate $k$-anonymous $t$-close data sets. The advantages of microaggregation are analyzed, and then several microaggregation algorithms for $k$-anonymous $t$-closeness are presented and empirically evaluated. △ Less

Submitted 9 December, 2015; originally announced December 2015.

Journal ref: IEEE Transactions on Knowledge & Data Engineering 27(11): 3098-3110 (2015)

arXiv:1512.02897 [pdf, other]

doi 10.1016/j.inffus.2015.11.002

Utility-Preserving Differentially Private Data Releases Via Individual Ranking Microaggregation

Authors: David Sánchez, Josep Domingo-Ferrer, Sergio Martínez, Jordi Soria-Comas

Abstract: Being able to release and exploit open data gathered in information systems is crucial for researchers, enterprises and the overall society. Yet, these data must be anonymized before release to protect the privacy of the subjects to whom the records relate. Differential privacy is a privacy model for anonymization that offers more robust privacy guarantees than previous models, such as $k$-anonymi… ▽ More Being able to release and exploit open data gathered in information systems is crucial for researchers, enterprises and the overall society. Yet, these data must be anonymized before release to protect the privacy of the subjects to whom the records relate. Differential privacy is a privacy model for anonymization that offers more robust privacy guarantees than previous models, such as $k$-anonymity and its extensions. However, it is often disregarded that the utility of differentially private outputs is quite limited, either because of the amount of noise that needs to be added to obtain them or because utility is only preserved for a restricted type and/or a limited number of queries. On the contrary, $k$-anonymity-like data releases make no assumptions on the uses of the protected data and, thus, do not restrict the number and type of doable analyses. Recently, some authors have proposed mechanisms to offer general-purpose differentially private data releases. This paper extends such works with a specific focus on the preservation of the utility of the protected data. Our proposal builds on microaggregation-based anonymization, which is more flexible and utility-preserving than alternative anonymization methods used in the literature, in order to reduce the amount of noise needed to satisfy differential privacy. In this way, we improve the utility of differentially private data releases. Moreover, the noise reduction we achieve does not depend on the size of the data set, but just on the number of attributes to be protected, which is a more desirable behavior for large data sets. The utility benefits brought by our proposal are empirically evaluated and compared with related works for several data sets and metrics. △ Less

Submitted 16 December, 2015; v1 submitted 9 December, 2015; originally announced December 2015.

Journal ref: Information Fusion 30:1-14 (2016)

arXiv:1503.02563 [pdf]

doi 10.1109/IEOM.2015.7093833

Co-Utility: Self-Enforcing Protocols without Coordination Mechanisms

Authors: Josep Domingo-Ferrer, Jordi Soria-Comas, Oana Ciobotaru

Abstract: Performing some task among a set of agents requires the use of some protocol that regulates the interactions between them. If those agents are rational, they may try to subvert the protocol for their own benefit, in an attempt to reach an outcome that provides greater utility. We revisit the traditional notion of self-enforcing protocols implemented using existing game-theoretic solution concepts,… ▽ More Performing some task among a set of agents requires the use of some protocol that regulates the interactions between them. If those agents are rational, they may try to subvert the protocol for their own benefit, in an attempt to reach an outcome that provides greater utility. We revisit the traditional notion of self-enforcing protocols implemented using existing game-theoretic solution concepts, we describe its shortcomings in real-world applications, and we propose a new notion of self-enforcing protocols, namely co-utile protocols. The latter represent a solution concept that can be implemented without a coordination mechanism in situations when traditional self-enforcing protocols need a coordination mechanism. Co-utile protocols are preferable in decentralized systems of rational agents because of their efficiency and fairness. We illustrate the application of co-utile protocols to information technology, specifically to preserving the privacy of query profiles of database/search engine users. △ Less

Submitted 9 March, 2015; originally announced March 2015.

Comments: Proceedings of the 2015 International Conference on Industrial Engineering and Operations Management-IEOM 2015, Dubai, United Arab Emirates, March 3-5, 2015. To appear in IEEE Explore

MSC Class: 91Axx ACM Class: K.4.1

arXiv:1307.0966 [pdf, other]

Improving data utility in differential privacy and k-anonymity

Authors: Jordi Soria-Comas

Abstract: We focus on two mainstream privacy models: k-anonymity and differential privacy. Once a privacy model has been selected, the goal is to enforce it while preserving as much data utility as possible. The main objective of this thesis is to improve the data utility in k-anonymous and differentially private data releases. k-Anonymity has several drawbacks. On the disclosure limitation side, there is a… ▽ More We focus on two mainstream privacy models: k-anonymity and differential privacy. Once a privacy model has been selected, the goal is to enforce it while preserving as much data utility as possible. The main objective of this thesis is to improve the data utility in k-anonymous and differentially private data releases. k-Anonymity has several drawbacks. On the disclosure limitation side, there is a lack of protection against attribute disclosure and against informed intruders. On the data utility side, dealing with a large number of quasi-identifier attributes is problematic. We propose a relaxation of k-anonymity that deals with these issues. Differential privacy limits disclosure risk through noise addition. The Laplace distribution is commonly used for the random noise. We show that the Laplace distribution is not optimal: the same disclosure limitation guarantee can be attained by adding less noise. Optimal univariate and multivariate noises are characterized and constructed. Common mechanisms to attain differential privacy do not take into account the users prior knowledge; they implicitly assume zero initial knowledge about the query response. We propose a mechanism that focuses on limiting the knowledge gain over the prior knowledge. Microaggregation-based k-anonymity and differential privacy can be combined to produce microdata releases with the strong privacy guarantees of differential privacy and improved data accuracy. The last contribution delves into the relation between t-closeness and differential privacy. We see that for a specific distance and under some reasonable assumptions on the intruders knowledge, t-closeness leads to differential privacy. △ Less

Submitted 3 July, 2013; originally announced July 2013.

Comments: Ph.D. Thesis defended on June 14, 2013, at the Department of Computer Engineering and Mathematics of Universitat Rovira i Virgili. Advisor: Josep Domingo-Ferrer

ACM Class: K.4.1

Showing 1–9 of 9 results for author: Soria-Comas, J