Obtaining -differential privacy guarantees when using a Poisson mechanism to synthesize contingency tables
James Jackson
Lancaster University, Lancaster, UK
Robin Mitra
University College London, London, UK
Brian Francis
Lancaster University, Lancaster, UK
Iain Dove
Abstract
We show that differential privacy type guarantees can be obtained when using a Poisson synthesis mechanism to protect counts in contingency tables. Specifically, we show how to obtain -probabilistic differential privacy guarantees via the Poisson distribution’s cumulative distribution function. We demonstrate this empirically with the synthesis of an administrative-type confidential database.
1 Introduction
Differential privacy (DP) (Dwork et al., 2006) is a property of a perturbation mechanism that formally quantifies how accurately any individual’s true values can be established, given all other individuals’ true values are known. Originally developed as a way to protect the privacy of summary statistics (queries), it soon expanded as a way to protect entire data sets. Differentially private data synthesis (DIPS) has since become a popular area of research; see, for example, Abowd and Vilhuber (2008); Machanavajjhala et al. (2008); Charest (2011); McClure and Reiter (2012); Bowen and Liu (2020); Quick (2021); Drechsler (2023).
In Jackson et al. (2022b, a), we proposed a synthesis approach for \replacedcontingency tablescategorical data sets, which takes place at the tabular level, and that uses saturated count models. This approach effectively uses a count distribution to apply noise to the counts in the original data’s contingency table, and therefore shares traits with DP mechanisms which apply noise in a similar way. \addedNote that as microdata composed entirely of categorical variables can be expressed in contingency table format, this approach is suitable in the case of categorical data more generally.
In this paper, we consider the ability to obtain DP-guarantees when using the Poisson distribution to synthesize counts in \replacedcontingency tablestabular data (contingency tables). We show that although -DP cannot be satisfied, -DP guarantees can be obtained through the use of the Poisson’s cumulative distribution function (CDF).
\added
The motivation behind this work is that, with the exception of Quick (2021), the use of count distributions has largely been overlooked as a way to satisfy DP. An obvious benefit of using count distributions is that negative counts cannot be obtained. As the Poisson has only one parameter and hence is likely to be sub-optimal, the intention is that in the future the Poisson could be replaced with more complex count distributions, such as the (discretised) gamma family distribution, where additional parameters provide scope for fine-tuning.
The paper is structured as follows. Section 2 introduces some terminology and definitions. Section 3 looks at existing DP mechanisms for contingency tables, such as the (discretised) Laplace and Gaussian mechanisms. Section 4 gives our novel contribution, the ability to obtain -DP guarantees when using a Poisson synthesis mechanism. Section 5 gives an empirical example using an administrative database. Section 6 gives some concluding remarks.
2 Terminology and definitions
Rinott et al. (2018) set out how DP extends into a contingency table setting. Following their notation, let and denote vectors of \deletedcell counts in the original and synthetic data’s contingency tables, respectively, where denotes the number of cells and and denote the range of obtainable original and synthetic counts (respectively). For contingency tables, we suppose that , where is the set of non-negative integers.
Moreover, we describe and as neighbours, denoted by , whenever all but one of the counts in and are identical and the differing count differs by exactly one. Henceforth, without loss of generality, we suppose and differ in their th element only, i.e. and for , . Thus represents the data held by the intruder (who knows all but one of the individuals’ true values) and represents the completed data where the “unknown individual” has been added to the cell in which they truly belong.
The -DP definition revolves around the likelihood ratio, or, more accurately, around a series of likelihood ratios.
Definition 1 (-DP)
A perturbation mechanism satisfies -DP () if:
(1)
Definition 1 is the special case of the standard DP definition, given in Dwork et al. (2006), for when the range of and are discrete. \addedAlthough we appreciate that in some instances the denominator in (1) could be equal to zero, for the mechanisms we consider here this probability is always non-zero.
For any , and , whenever the ratio is either small or large, relatively too much is gleaned about the unknown individual’s true values. It is worth noting, too, that the above definition considers all possible synthetic data sets in , illustrating that DP is not a risk metric for a particular synthetic data set but rather a property of a synthesis mechanism.
Somewhat confusingly, there are two similar but different relaxations of -DP. The first is -differential privacy (Dwork and Roth, 2014). The second is known as -probabilistic differential privacy (Machanavajjhala et al., 2008). These are given below in Definitions 2 and 3. In the remainder of this paper, we focus on -probabilistic DP. Yet whenever -probabilistic DP is satisfied, -DP is also satisfied (Goetz et al., 2012).
Definition 2 (-DP)
A perturbation mechanism satisfies -DP (; ) if:
(2)
Definition 3 (-probabilistic DP)
A perturbation mechanism satisfies -probabilistic DP (; ) if:
(3)
Theorem 1 (-probabilistic DP implies -DP)
If a perturbation mechanism satisfies -probabilistic DP, then it also satisfies -DP. (Proof: see Goetz et al. (2012))
3 Examples of existing DP mechanisms
We now give examples of existing DP mechanisms suitable for synthesizing counts in contingency tables. \addedNote that for the Laplace and Gaussian mechanisms, discretised noise needs to be added (unless one is willing to accept non-integer “counts”). This can simply involve adding continuous noise before rounding the adjusted values to the nearest integer. Similarly, negative values can be rounded to zero.
Example 1 (The Laplace mechanism)
A random variable Laplace has probability density function :
The Laplace mechanism satisfies -DP by using the Laplace distribution to add random noise to the original counts . \addedSpecifically, for every original count , the Laplace mechanism generates a Laplace random variate. To show that this mechanism does indeed satisfy DP, we suppose that for and that (\addedi.e. the assumptions made in Section 2). Firstly, when :
(4)
Similarly, when , (4) is equal to exp(-), and when it is equal to exp(0). Hence the DP definition in (1) holds.
Example 2 (The Gaussian mechanism)
A random variable Normal has probability density function :
In a similar way to the Laplace mechanism, the Gaussian mechanism, say , applies Normal(0, ) random noise to the original counts, resulting in a mechanism that satisfies -differential privacy. Using the same assumptions and notation as previous, it follows that:
Recall that -probabilistic DP is satisfied whenever
which, in this instance, occurs whenever
The probability can be obtained from , the normal distribution’s CDF (Balle and Wang, 2018), as .
Example 3 (Multinomial-Dirichlet synthesizer)
A multinomial-Dirichlet synthesis mechanism (Abowd and Vilhuber, 2008), say , can also yield DP guarantees. The original counts can be converted to cell probabilities simply by dividing by (the number of individuals in the data). A Dirichlet prior with concentration parameters is placed on (see Abowd and Vilhuber (2008) for more on this approach). Using the same “without loss of generality” assumptions as previous, it follows that
(5)
Recall again that DP is satisfied whenever
As the expression in (5) is always greater than or equal to one, and hence always greater than 1/exp(), DP is satisfied whenever
As and , this simplifies to
Considering all counts gives that DP is satisfied whenever
4 Satisfying -probabilistic DP with a Poisson synthesis mechanism
When using saturated count models to synthesize contingency tables, as set out in Jackson et al. (2022b), a count distribution, e.g. the Poisson, applies noise to original counts. We assume that a constant pseudocount is added to every element of (i.e. to all original counts, not just to zero counts as in Jackson et al. (2022b)), which opens up the possibility that original counts of zero can be synthesized to non-zeros. When using the Poisson we apply the following mechanism, which we denote by , to obtain a set of synthetic counts:
Supposing once again that and differ in their th element only, we have:
(6)
This quantity is bounded below by exp(-1), with this minimum occurring when . It is unbounded above, however, as can take any integer up to infinity; i.e. the expression in (6) tends to infinity as tends to infinity. Thus -DP cannot be satisfied.
Instead, we now consider the -probabilistic DP relaxation, first considering the left-hand inequality of the DP definition (Def. 1):
When , this inequality holds with probability 1. When , the probability that this inequality holds can be determined through the Poisson’s CDF, since is a realization from a Poisson random variable. This probability is given as:
(7)
where is the CDF of the Poisson distribution with mean .
We next consider the right-hand inequality of Def. 1:
For all , this inequality holds with probability
(8)
Recall that in -probabilistic DP, is the probability that DP is satisfied, i.e. the probability that both inequalities hold. A non-trivial question when is how to combine the probabilities given in (7) and (8) and hence compute ? This is an area of future research.
When , however, the left-hand inequality of Def. 1 always holds, thus we need only focus on (8). Although non-trivial for any and , (8) is minimised when (when ). Note, a formal proof has been omitted here but extensive empirical simulation results have been undertaken. Thus,
(9)
This also demonstrates the role of as a tuning parameter for risk. In general, a larger value corresponds to a lower value. Yet is not a decreasing function of . For a very brief explanation, this is because increasing increases the value of the expression inside the squared bracket in (9), but it also increases the mean of the Poisson random variable from which a synthetic count is drawn. Figure 1 illustrates the nature of the relationship between and for different values of . For example, setting satisfies approximately (3,0.3)-probabilistic DP and (1.5,0.6)-probabilistic DP.
Figure 1: The relationship between and in the Poisson synthesis mechanism for and .
In contingency tables where there are no zero counts, a -DP guarantee can be obtained when . In this instance, is determined by the smallest original count, i.e.:
(10)
In a sense, in this example we have violated the traditional -probabilistic DP definition given in (3) because is dependent on a particular set of original counts – not all original counts.
We can easily replace the Poisson with any other count distribution (e.g. the negative binomial, Poisson inverse-Gaussian, Delaporte, Sichel, etc.), which of course would lead to a different expression for the ratio in (6).
5 An empirical example
5.1 The English School Census administrative database
The English School Census (ESC) is a large administrative database belonging to the UK’s Department for Education (DfE), which holds information about pupils attending state-funded schools in the UK. Owing to the presence of sensitive data, strict privacy guarantees would be required for data from the ESC to be made available to researchers. There is therefore great appeal to DP-type approaches, where more formal guarantees of privacy can be obtained.
Access to the real ESC data is currently restricted, even for the sake of demonstrating the effectiveness of privacy methods. For this reason, staff at the Office for National Statistics (ONS) created a substitute data set using publicly-available data sources, such as published ESC data and 2011 UK census data. A key feature of this data set, ESC, is that it replicates some of the statistical properties present in the actual ESC. We take a subset of this data which has approximately individuals (rows) and 5 categorical variables (columns). As all variables are categorical, the data set can be expressed as a contingency table with around cells. More information about the data set – as well as the data set itself – is available at Blanchard et al. (2022).
5.2 Applying the Poisson synthesis mechanism
We now apply the Poisson synthesis mechanism to the ESC data, considering different values of , and considering values.
Figure 2 gives combinations of values that can be achieved for the ESC data when using values of 0.1, 0.2, 0.5 and 1. For example, when , an value of 1 is required to obtain a value of 0.05; when , a value of 0.05, is obtained only for values greater than 6.
DP methods,in general are known to have a detrimental effect on utility. To gain a simple insight into general utility (Snoke et al., 2018), the boxplots in Figure 3 compare the percentage differences between original and synthetic counts for various values of , and for original counts between 1 and 10. Unsurprisingly, increasing increasing the percentage differences, i.e. has an adverse effect on utility. This loss of utility is more magnified in specific analyses, especially when the analyst wishes to quantify uncertainty.
Figure 2: Combinations of such that -probabilistic DP is achieved when the Poisson is used, for various max and equal to 1.5, 2, 2.5 and 3.Figure 3: For different values of , boxplots showing percentage differences between original and synthetic counts (utility) for original counts in the range 1–10.
6 Discussion
To summarise, in this paper we have shown how to obtain -DP guarantees when using a Poisson synthesis mechanism to protect the privacy of counts in contingency tables. \addedFor a given , the corresponding value of that is achievable with the Poisson is relatively high; much higher than that which is achievable with other DP mechanisms. Going forward, we believe other count distributions, such as the negative binomial, are likely to be more favourable (i.e. will give better utility results), while also providing the same DP-type risk guarantees, because such distributions would introduce further tuning parameters in addition to . Previous work suggests that such tuning parameter apply noise in a more efficient fashion (Jackson et al., 2022a). These tuning parameters could be set to obtain certain or values.
We end with an interesting note in relation to DP. Somewhat counterintuitively, the reason why multinomial-based synthesis mechanisms (e.g. the multinomial Dirichlet synthesizer) can satisfy -DP – but the Poisson cannot – is because with multinomial mechanisms have a maximum synthetic count that any original count can take, namely . With count distributions, any original count can be synthesized to any non-negative integer. To help explain why this causes the DP definition to fail, recall that, with contingency tables, DP definitions effectively assume that the intruder is trying to locate the cell to which
just one individual belongs; i.e. in the intruder’s data set one, and only one, cell count is one less than it actually is. Suppose that a particular count in the intruder’s data set is equal to 1, but that the corresponding synthetic count – generated by simulating from the Poisson with – has a count of 5. It is 11.7 times more likely that this synthetic count originated from a cell with a count of 2 than from a count of 1, therefore the intruder can infer that that particular cell is a likely origin of the target. It is interesting therefore that, with DP, disclosure risk is deemed to be at its greatest when the scope for potential movement between original and synthetic counts is at its greatest. This largely goes against the objectives of traditional SDC methods, which typically reduce risk by increasing the divergence from the original counts.
References
Abowd and Vilhuber (2008)
Abowd, J. M. and Vilhuber, L. (2008) How Protective Are Synthetic Data?
In Privacy in Statistical Databases 2008 (eds. J. Domingo-Ferrer and Y. Saygın), 239–246. Berlin, Heidelberg: Springer.
Balle and Wang (2018)
Balle, B. and Wang, Y.-X. (2018) Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising.
In Proceedings of the 35th International Conference on Machine Learning (eds. J. Dy and A. Krause), vol. 80 of Proceedings of Machine Learning Research, 394–403. PMLR.
URL: https://proceedings.mlr.press/v80/balle18a.html.
Blanchard et al. (2022)
Blanchard, S., Jackson, J. E., Mitra, R., Francis, B. J. and Dove, I. (2022) A constructed English School Census substitute.
URL: 10.17635/lancaster/researchdata/533.
Bowen and Liu (2020)
Bowen, C. M. and Liu, F. (2020) Comparative Study of Differentially Private Data Synthesis Methods.
Statistical Science, 35, 280 – 307.
URL: https://doi.org/10.1214/19-STS742.
Charest (2011)
Charest, A.-S. (2011) How can we analyze differentially-private synthetic datasets?
Journal of Privacy and Conf., 2.
URL: https://journalprivacyconfidentiality.org/index.php/jpc/article/view/589.
Drechsler (2023)
Drechsler, J. (2023) Differential privacy for government agencies—are we there yet?
Journal of the American Statistical Association, 118, 761–773.
URL: https://doi.org/10.1080/01621459.2022.2161385.
Dwork et al. (2006)
Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006) Calibrating noise to sensitivity in private data analysis.
In Theory of Cryptography (eds. S. Halevi and T. Rabin), 265–284. Berlin, Heidelberg: Springer.
Dwork and Roth (2014)
Dwork, C. and Roth, A. (2014) The Algorithmic Foundations of Differential Privacy.
Foundations and Trends® in Theoretical Computer Science, 9, 211–407.
URL: http://dx.doi.org/10.1561/0400000042.
Goetz et al. (2012)
Goetz, M., Machanavajjhala, A., Wang, G., Xiao, X. and Gehrke, J. (2012) Publishing Search Logs - A Comparative Study of Privacy Guarantees.
IEEE Trans. Knowl. Data Eng., 24, 520–532.
Jackson et al. (2022a)
Jackson, J., Mitra, R., Francis, B. and Dove, I. (2022a) On integrating the number of synthetic data sets into the a priori synthesis approach.
In Privacy in Statistical Databases 2022 (eds. J. Domingo-Ferrer and M. Laurent), 205–219. Cham: Springer International Publishing.
Jackson et al. (2022b)
— (2022b) Using Saturated Count Models for User-Friendly Synthesis of Large Confidential Administrative Databases.
Journal of the Royal Statistical Society Series A: Statistics in Society, 185, 1613–1643.
URL: https://doi.org/10.1111/rssa.12876.
Machanavajjhala et al. (2008)
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. and Vilhuber, L. (2008) Privacy: Theory meets practice on the map.
In 2008 IEEE 24th international conference on data engineering, 277–286. IEEE.
McClure and Reiter (2012)
McClure, D. and Reiter, J. P. (2012) Differential Privacy and Statistical Disclosure Risk Measures: An Investigation with Binary Synthetic Data.
Transactions on Data Privacy, 5, 535––552.
Quick (2021)
Quick, H. (2021) Generating Poisson-distributed differentially private synthetic data.
Journal of the Royal Statistical Society: Series A (Statistics in Society), 184, 1093–1108.
URL: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssa.12711.
Rinott et al. (2018)
Rinott, Y., O’Keefe, C. M., Shlomo, N., Skinner, C. et al. (2018) Confidentiality and Differential Privacy in the Dissemination of Frequency Tables.
Statistical Science, 33, 358–385.
Snoke et al. (2018)
Snoke, J., Raab, G. M., Nowok, B., Dibben, C. and Slavkovic, A. (2018) General and specific utility measures for synthetic data.
Journal of the Royal Statistical Society: Series A (Statistics in Society), 181, 663–688.