Metric Differential Privacy at the User-Level
Abstract.
Metric differential privacy (DP) provides heterogeneous privacy guarantees based on a distance between the pair of inputs. It is a widely popular notion of privacy since it captures the natural privacy semantics for many applications (such as, for location data) and results in better utility than standard DP. However, prior work in metric DP has primarily focused on the item-level setting where every user only reports a single data item. A more realistic setting is that of user-level DP where each user contributes multiple items and privacy is then desired at the granularity of the user’s entire contribution. In this paper, we initiate the study of metric DP at the user-level. Specifically, we use the earth-mover’s distance () as our metric to obtain a notion of privacy as it captures both the magnitude and spatial aspects of changes in a user’s data.
We make three main technical contributions. First, we design two novel mechanisms under -DP to answer linear queries and item-wise queries. Specifically, our analysis for the latter involves a generalization of the privacy amplification by shuffling result which may be of independent interest. Second, we provide a black-box reduction from the general unbounded to bounded -DP (size of the dataset is fixed and public) with a novel sampling based mechanism. Third, we show that our proposed mechanisms can provably provide improved utility over user-level DP, for certain types of linear queries and frequency estimation.
1. Introduction
Differential privacy (DP) is the state-of-the art technique that enables useful data analysis while still providing a strong privacy guarantee at the granularity of individuals (Dwork, 2006). Over nearly two decades, DP has enjoyed significant academic attention and has proven its efficacy in practical applications as well. It has been successfully deployed in diverse settings, including the US census (Abowd, 2018), Apple’s iOS platform (Cormode et al., 2018), and Google Chrome (Erlingsson et al., 2014).
Intuitively, DP guarantee makes a pair of input data to be indistinguishable from each other. The standard DP guarantee requires all pairs of inputs to be indistinguishable thereby providing a uniform privacy guarantee to all pairs. This implies that every pair of input is considered equally sensitive. However, many practical applications call for a more tailored privacy semantics based on the heterogeneity of the data. In particular, input pairs that are closer or more similar to each other are considered to be more sensitive. For instance, for location data, revealing the exact city of residence is far more sensitive than revealing just the country. Metric DP (-DP; (Chatzikokolakis et al., 2013)) is a notion of DP that formally captures this heterogeneity in privacy semantics. Specifically, similarity is measured via a distance metric and the privacy guarantee degrades linearly with the distance between the pair of inputs. In addition to offering a more nuanced privacy definition, metric DP also improves utility compared to standard DP. This improvement stems from metric DP requiring only similar pairs of input to be indistinguishable, which results in a significantly lower noise than standard DP.
Prior work in metric DP has primarily focused on the item-level setting where every user only reports a single data item (for e.g., a single record in a dataset). However, in many practical applications, a user contributes multiple items to a dataset. Privacy is then desired at the granularity of the user’s entire contribution. This has spurred a large body of work known as user-level DP (Amin et al., 2019; Bassily and Sun, 2023; Cummings et al., 2022; Acharya et al., 2023). However, all of this work considers only standard DP and is thus susceptible to the same limitations in utility as noted earlier. To this end, we initiate the study of metric DP at the user-level. While there have been some prior attempts at this, these work is limited to specific settings such as text data (Fernandes et al., 2019). To the best of our knowledge, this is the first work to give a general definition of metric DP at the user-level.
The immediate task is to define a metric on the entire collection of a user’s data. Recall that metric DP caters to the privacy semantics that similar data is more sensitive. But the challenge here is that the similarity between two collections (sets) of data points has to be measured along two dimensions – the distance between the individual data items, and the fraction of the data items in the set that are different. In particular, note that in addition to small changes in the item-wise distances, changes in a smaller amount of the data also indicate more similarity and hence, correspond to more sensitive information (see below for concrete examples). This necessitates a measure that can express both of these quantities as a single metric. We tackle this challenge by using the earth-mover’s distance (; (Givens and Shortt, 1984)) on the normalized representation of the user’s data. Informally, the between two distributions is the minimum cost of transporting one distribution to another, where the cost is determined by the quantity of data items moved multiplied by the distance (measured via ) over which they are moved. Our resulting privacy definition, denoted as -DP, yields the following privacy semantics. Under -DP, the strength of the privacy guarantee (indistinguishability) between two pairs of inputs (sets of data items) grows inversely with if can be obtained by changing fraction of by an average distance of (Def. 3.1). therefore takes into account both the structure of the distributions as well as the raw difference in their values. Consequently, the parameters and provide flexibility in interpretation and offer a nuanced privacy definition suitable for many practical applications. We illustrate this with the following examples:
Location Data. We will use our location dataset as a canonical example throughout the paper. Suppose that the location dataset consists of daily locations of users collected over a period of time. Here, the parameter can be interpreted in terms of the length of the time window the change in pertains to, and corresponds to the extent of change in the location. Then, -DP makes it harder to distinguish between locations that are close to each other, and collected over a smaller time window. This is natural, since locations gathered over an extended period, such as a month, may reveal routine patterns that are less sensitive than locations recorded on a single day (for instance, a single-day location might reveal a non-routine visit to a friend or hospital).
Textual Data. Consider a natural language dataset of user conversations where each user’s data is represented as a set of words. Typically, word embeddings map each word into a high-dimensional space, and word similarity is measured using a distance, such as the Euclidean distance, between and . Now, the parameter corresponds to what fraction of the user’s conversation has changed in from , while corresponds to the extent of the changes in the textual content. Thus, two conversations are harder to distinguish if there is only a fine-grained difference in their textual semantics 111Such as transitioning from text about algebra to trigonometry versus changing it from ”math” to ”classical music”., and if it pertains to just a small fraction of the conversation (indicating a user rarely discussed the topic, which typically implies more sensitive information).
Graph Data. Consider a graph in which connections in are private. Suppose there is additional public information in the form of a covariate , which captures some auxiliary information about a user—for instance, the interests of a user. Here similarity between users is measured via covariate distance. The parameter corresponds to the fraction of a user’s connections which has changed in from , and the parameter corresponds to the extent of the change in their interests. Thus, two graphs are harder to distinguish between if it is a fine-grained change to the interest222for instance, shifting from movies featuring Dwayne Johnson to Vin Diesel instead of from ”action” to “rom-com”, and if it pertains to only a few of the user’s connections. (say a small, private group of friends). This again captures natural privacy semantics as users are more likely to share common interests with their close friends than with a larger group, such as all workplace colleagues.
1.1. Details of Our Contributions
We consider users who hold datasets , each containing elements from a data domain of size . Let denote the distance metric defined over . WLOG, we consider to be a normalized distance metric, i.e., all measures of distance are normalized to be at most . Let denote the normalized version of the dataset . between any pair of datasets can be defined by first normalizing them to , and then using to measure the minimum cost of transporting to . The global dataset is given by , and there is an aggregator who wants to privately compute a query . In the central model, the aggregator already holds from each user, and applies a private mechanism to obtain a private estimate for . In the local model, the users do not trust the aggregator, and communicate private messages to the aggregator. The aggregator then post-processes these messages to output a private estimate of . For simplicity, in this work we assume the mechanisms to be non-interactive.
We also make a distinction between bounded and unbounded data. Note that boundedness here refers to the size of each user’s dataset and not the number of the users – throughout the paper, we assume that the number of users, , is fixed and publicly known. In our specific context, bounded data corresponds to the case where the size of each user’s dataset is publicly known, and the mechanism only needs to preserve privacy between datasets of the same size. Furthermore, in the central model, each user’s dataset has the same public size. The benefit of this simplification is that algorithm analysis is easier. Such a bounded data setting has been considered in many previous works (Li et al., 2016). We also consider the general unbounded data setting where each user can have datasets of varying sizes, with the size being private as well.
For each model and type of boundedness, we summarize how one would apply -DP, along with the resulting semantics, in Table 1. We also include a corresponding notion of the standard user-level DP (Liu et al., 2023) (provides a uniform privacy guarantee to all pairs of datasets) and serves as our baseline. In what follows, we elaborate on our main contributions.
Model | Granularity | Data Boundedness | Privacy Guarantee | Semantics | Notes |
Local (applies to each ) | User | Unbounded | -user-level DP (Def. 2.1) | Two input datasets are indistinguishable with parameters | Recently proposed in (Acharya et al., 2023). Acts our baseline for the local model. |
User | Bounded | -bounded -DP (Def. 3.1) | Two input datasets are indistinguishable with parameters . | The size of each dataset, , is public. Proofs of privacy easier due to Lemma 2.1. | |
User | Unbounded | -unbounded -DP (Def. 3.1) | Two input datasets is indistinguishable with parameters . | Implies user-level DP when since . | |
Item | N/A | --DP (Def. 2.3) | Two input items is protected with parameters | Proposed in (Chatzikokolakis et al., 2015) | |
Central (applies to ) | User | Unbounded | -user-level DP (Def. 2.2) | Let where . Two input global datasets s.t. they differ only on the dataset of a single user are indistinguishable with parameters | Studied widely (Bassily and Sun, 2023; Liu et al., 2020, 2023). Acts our baseline for the central model. |
User | Bounded | -bounded -DP (Def. 3.1) | Two input global datasets s.t. they differ only on are indistinguishable with parameters | Each has size which is public. | |
User | Unbounded | -discrete -DP (Def. 3.2) | Two input global datasets s.t. they differ only on and are indistinguishable with parameters . | Using group privacy, we can show the following parameters where . Implies user-level DP when since . |
1.1.1. Mechanism Design
We provide novel mechanisms for answering two types of queries for -DP.
Linear Query
First, we study how to release linear queries , where is the normalized representation of the global dataset and is a real-valued matrix with bounded entries. While computing the sensitivity of a linear query is easy under user-level DP, proving a sensitivity under -DP is quite challenging. Specifically, it requires analysis of a coupling between two possible datasets, along with a stronger assumption that is “Lipschitz” in a sense, rather than just being bounded. To this end, we first prove the following bound:
Theorem 1.1.
(Informal version of Thm. 4.1): The sensitivity of is upper bounded by
where the notation indicates the column of indexed by .
Using the above result, we show that the sensitivity of , which is a maximum over the space of all datasets, can be reduced to a Lipschitz property of that is much easier to compute. In Sec. 6.1, we show that a special class of linear queries, which we call linear embedding queries, satisfies the above mentioned Lipschitzness and can provide provably better utility than user-level DP.
Unordered Release of Item-wise Queries
We design a mechanism for performing itemwise queries on the entire dataset . Our approach is to simply apply a private mechanism to each item and release the set of noisy outputs after shuffling them. Here can be an arbitrary mechanism satisfying - DP which makes our mechanism completely general-purpose (see Sec. 4.2 for some concrete examples of ). Here we consider the bounded data setting since the size of is revealed. The main technical novelty lies in providing a tight privacy analysis of the above mechanism. Specifically, prior work shows that the above mechanism satisfies bounded --DP (Fernandes et al., 2019) by using the interplay between couplings and privacy via composition. However, we show that composition is not the right tool for tight privacy analysis since it does not take into that the output of our mechanism is an unordered list, i.e., the s are released in a random arbitrary order. Instead, we generalize a tight result from privacy amplification by shuffling (Feldman et al., 2022) to metric DP.
Theorem 1.2.
(Informal version of Thm. 4.3) Suppose that is an - DP algorithm with respect to . Let be a dataset. Then, releasing satisfies - DP.
This analysis reduces the cost of releasing points in the multiset from to , allowing for better utility. We keep the analysis general – we consider releasing the shuffled multiset of any black-box mechanism , that satisfies metric DP in the data domain , applied to each data point. Consequently, this result has broader applications to the shuffle model of privacy, and may be of independent interest.
1.1.2. Extending -DP to the Unbounded Setting
We start our mechanism designs by considering the bounded data settings in both the local and central models of privacy (See Table 1) as this enables easier privacy analysis (Sec. 4). However, the bounded setting might be restrictive in practice as it cannot support usecases where users have different amounts of data, or the data sizes are also private. To this end, we extend -DP to the more general unbounded setting. We show that when user data is relatively homogeneous (such as, when it is i.i.d.), the privacy analysis of the unbounded setting may be reduced to the bounded setting.
Specifically, in Sec. 5 we create a black-box projection mechanism which projects any unbounded dataset onto a dataset where each user contributes a fixed, predefined amount of data. This enables running any bounded -DP mechanism on the projected data. Our projection mechanism samples a fixed number of dataset items with replacement from each user. The privacy analysis follows by showing that the between any two datasets remains relatively unchanged by sampling, up to a small additive factor as determined by the Chernoff’s bound.
One caveat is that the introduced additive factor necessitates a slight adjustment to the privacy semantics of -DP. Instead of protecting any change of distance with a privacy parameter , we consider a small threshold such that all changes less than are protected with a uniform parameter . In essence, this privacy guarantee provides -DP at the granularity of units of distance . We refer to this notion as discrete user-level -DP (Def. 5.1) and have the following result:
Theorem 1.3.
(Informal version of Thm. 5.3) Suppose that for users, is a mechanism which satisfies -bounded -DP. The algorithm which, given arbitrary user datasets , takes i.i.d. samples from each and then applies on each of the sampled data items, satisfies -discrete -DP for all .
The two notions of privacy are nearly equivalent for small , showing that unbounded -DP can be reduced to bounded -DP with an almost exact translation of the privacy guarantee.
1.1.3. Demonstrating Improvements Over User-level DP
We compare the privacy and utility of our proposed -DP mechanisms with baseline mechanisms satisfying user-level DP. Specifically, in Sec. 6.1, we study a special type of linear query called linear embedding queries and in Sec. 6.2, we study problem of private frequency estimation. For simplicity, we consider the bounded data setting.
Let’s start by understanding the relationship between --DP and -user-level DP. The following observations hold in both the central and local models:
-
•
: Since333If , then user-level DP is strictly weaker than -DP; the more appropriate baseline is to use . we assume is normalized, we always have . Thus, in this case --DP implies -user-level DP. However, any pair of input such that the privacy protection of -DP is actually stronger.
-
•
: In this case, some pairs of inputs (with a large distance between them) are protected less strongly than they are under user-level DP. However, as indicated in our aforementioned real-life examples, input pairs with high (i.e., dissimilar input pairs) are typically less sensitive.
Now, we interpret the theoretical error bounds for linear embedding queries. From Table 2(a), the error for releasing a -dimensional linear embedding query under user-level DP is , while it is for -DP. When , these utilities are identical, but DP offers stronger privacy. When , then the utility of -DP is higher than that of user-level DP, with the the two guarantees offering differing privacy semantics. Thus, in both cases, there is a clear benefit of using -DP. These observations are the same in the local model.
Finally, for frequency estimation in the local model, Table 2(b) shows that the error of user-level DP is , while it is for DP. For constant and , the utility is improved. In the central model, the error of the user-level DP algorithm is while it is for -DP. The algorithm under -DP has the added benefit that it can be implemented in the shuffle model of privacy, which requires less trust and parallels prior work in the shuffle model (Feldman et al., 2022). There is a utility improvement for . When , we leave it as an interesting open problem whether -DP can offer utility improvements over user-level DP.
Linear Embedding Queries | |||||
Algorithm | Privacy Guarantee | Privacy Model | Error | Notes | |
Laplace Mechanism | -user level DP | Central, Bounded | (Lemma 6.3) | gives same utility but stronger privacy ; gives better utility but different privacy for . | |
PrivEMDLinear | --DP | Central, Bounded | (Lemma 6.2) |
Frequency Estimation | |||||
Algorithm | Privacy Guarantee | Privacy Model | Error | Notes | |
Hadamard Response | -user-level DP | Local, Bounded | (Lemma 6.4) | Assuming ; -DP gives better utility for . | |
PrivEMDItemWise | --DP | Local, Bounded | (Thm. 6.6) | ||
Laplace Mechanism | -user-level DP | Central, Bounded | (Lemma 6.7) | Assuming ; -DP gives better utility when . | |
PrivEMDItemWise | --DP | Central, Bounded 444This algorithm works in the shuffle model, which requires less trust than the central model. | (Corollary 6.8) |
2. Background
2.1. Differential Privacy
Intuitively, DP is a property of a mechanism which ensures that its output distribution remains insensitive to changes in the data of a single individual. The standard DP guarantee, which is also know as item-level DP, considers each user to contribute only a single item to a global dataset, i.e., . In this paper, we consider differential privacy at the user level. We start by considering the local model:
Definition 2.1 (Unbounded User-level Local DP (Acharya et al., 2023)).
We say a mechanism acting on a dataset satisfies -unbounded user-level local DP if, for all and all outputs
(1) |
Note that here we consider the more general unbounded data setting where the two datasets can have arbitrary sizes.
Next, we present the definition for the central model.
Definition 2.2 (Unbounded User-level Central DP (Liu et al., 2023)).
Let denote a global dataset from users where . We say , if can be obtained from by changing the dataset of a single user from to . We say a mechanism acting on a dataset satisfies -unbounded user-level central DP if, for all such that , and all outputs
(2) |
Note that there is no restriction in the sizes of the datasets in the above definition.
Next, we define metric DP (-DP) that enables the privacy guarantee to depend on the distance between the pair of inputs. We start by introducing it at the item-level (so we consider changing an item to another item ). For simplicity, we consider the local model, so the mechanism acts on just a single item:
Definition 2.3 (Local -DP (Alvim et al., 2018)).
We say satisfies -local -DP if for all data elements , and all outputs
2.2. Earth-Mover’s Distance
Notations. We denote the set of all possible datasets as . We will also view a dataset as a probability distribution defined by its normalized histogram . To do so, let denote the probability simplex indexed by —i.e. the set of all vectors such that and . For a dataset , then denotes the probability distribution defined by , meaning . A natural way to extend the notion of distance from items in to distributions in is to use the Earth-Mover’s (or -Wasserstein) distance (Givens and Shortt, 1984), which we now define. For a joint distribution , let denote the distribution conditioned on observing , and let denote the marginal distribution of . We define and similarly.
Definition 2.4.
For distributions , a joint distribution on is a coupling between and if and . We let denote the set of couplings between and .
A coupling can be viewed as a “transportation plan” between and , in the sense that if places probability mass at a point , then probability mass from at is transported to at (or vice-versa). We define the cost of a coupling as the expected transportation distance given by . The earth-mover’s distance () between is equal to the minimum possible cost of a coupling between and :
Since we assume that is bounded by , we have .
Next we present the Birkhoff-Von Neumann Theorem which is useful in our privacy analysis in Sec. 4.2. The theorem states that if both and are empirical distributions with the same number of points, then the between them is the cost of the coupling that moves the entire mass in each point to the same destination:
3. Definition of -DP
In this section, we introduce our generalization of metric DP to the user-level. We start with the local model. We use the metric to measure the distance between two datasets since it captures the intuition that the changes which move smaller amounts of data by smaller distances are more sensitive (as discussed in Sec. 1).
Definition 3.1 ((Un)Bounded Local -DP).
Let be a mechanism which acts on a dataset . We say satisfies -bounded local -DP if for any two datasets such that , and for any output , we have
(4) |
If the above equation holds for all datasets , regardless of whether , we say that satisfies -unbounded local -DP.
For bounded -DP, the size of the dataset is not protected, which is acceptable for applications where the amount of data is not sensitive. We explicitly differentiate between bounded and unbounded data since privacy analysis is easier under bounded -DP by leveraging Lemma 2.1 (see Section 4).
In the central model, our goal is to protect changes in a single user’s dataset, transitioning from to , with a privacy guarantee that depends on . We consider the bounded data setting where each dataset has a publicly known fixed size .
Definition 3.2 (Bounded Central -DP).
Let denote a global dataset from users where . We say if can be obtained from by changing the dataset of a single user from to . We say a mechanism satisfies -bounded -DP if, for all such that , and all outputs , we have
In the above definition, the two global datasets are indistinguishable with a privacy parameter . Since we consider the bounded data setting, neither the number of total users, , nor the size of the individual datasets, , are protected.
It is important to note that the above definition cannot be directly translated to the unbounded data setting. This limitation arises from the fact that if each is allowed to have an arbitrary size, then changing a single could potentially change the entirety of in the worst-case (where user contributes the entire global dataset). This essentially reduces the central model (Def. 3.2) to the local model (Def. 3.1).
We circumvent this challenge and provide a privacy definition for the undounded data setting in Sec. 5, by controlling the amount of data from each user.
Setting the Privacy Parameters. There are some semantic differences between the parameter in Definitions 3.1 and 3.2, and in Definitions 2.1 and 2.2. The privacy parameter is unitless. On the other hand, is not unitless – it has a unit inversely proportional to . While is usually not considered acceptable for standard DP, it is not unreasonable to set in our case. This is acceptable if a strong privacy guarantee is needed only for input pairs that are close to each other since . For all , let refer to the minimum privacy parameter that is acceptable over all data changes of the form
A -fraction of is changed by average distance .
Then, may be set as
,
and we can verify that Definition 3.1 will protect an input pair with the corresponding budget . The parameter has the same interpretation as in standard DP, and should be set .
Concrete Example. Throughout this paper, we consider a dataset of users, each of whom contributes location data points over the period of a month. We use the length of the shortest path on Earth’s surface as our metric . Suppose we want to protect a user’s location over any particular day within a radius of miles, and the user’s location over the entire time period within a distance of miles. In the normalized metric space, these distances are and , respectively555The maximum surface distance between two points on Earth is miles.. They correspond to a fraction and of the metric space changing, respectively. Suppose we want to protect both of these inputs with privacy parameter . Hence, we set
This value is much higher than typical privacy parameters used in DP, and yet it is able to adequately protect the desired inputs. Finally, we will set .
4. Mechanisms for -DP
Now, we describe our mechanisms for releasing queries under -DP. Throughout this section, we focus on the bounded data setting, and consider both the local and central models. In Sec. 4.1, we show how to bound the sensitivity of linear queries, which can then be released with the addition of calibrated noise. Then, in Sec. 4.2, we show that we can release a noisy representation of under -DP by applying any -DP mechanism to each item in , and aggregating the outputs.
4.1. Linear Queries
A non-adaptive linear query on a dataset computes the value of , where is a matrix with rows. The linearity comes from the linear transformation ; our linear queries are normalized since they operate on rather than . Such normalized queries can be used for answering the fraction of users satisfying a predicate (Blum et al., 2013). Nevertheless, one can estimate the non-normalized query by multiplying by an estimate of .
Let us represent by a function where , the th column of . The linear query can then be re-written as
(5) |
Thus, we may interpret a linear query on as expected value of over a random item from . Linear queries are simple but capable of expressing many indispensible tools in data analysis, and they are well-studied in differential privacy (Blum et al., 2013; Hardt and Talwar, 2010; Dwork et al., 2014). We will design a simple mechanism satisfying -DP for releasing a linear query, based on bounding the sensitivity of under the . The sensitivity measures the maximum change output , measured according to some norm on , relative to a change in the inputs by a certain . This is given by:
Naively, it is intractible to compute this sensitivity since there are exponentially many datasets of a given size. Additionally, this sensitivity might not always be bounded. For instance, consider two points that are close in , but is very far from . In this case, we cannot bound , since the and which put all their mass on and , respectively, will have . However, if is -Lipschitz, meaning
then it is possible to bound using . We do this by observing that, for any coupling between , each mass that moves a distance may change by up to (based on Eq. (5)). This allows us to compute the following bound.
Theorem 4.1.
Let be a linear query of the form in (5), where is -Lipschitz. Then, we have .
Remarks.
The above theorem implies that a reasonable upper bound for can be made when the query function is smooth in terms of . In Sec. 6.1 we outline a special type of linear query for which this is the case. Additionally, the aforementioned example illustrates that this sensitivity analysis is tight. This means that , in addition to defining the privacy semantics, also influences the types of queries that can be answered with good utility.
Proof Sketch. Let be the minimum-cost coupling between . can then be bounded by transporting onto times the amount that can change when each mass is transported, which is atmost . ∎
Using the upper bound on , we follow a well-known approach for privately releasing a point with known sensitivity under a norm: Sample a point uniformly from the ball , and release , where , the Gamma distribution with shape and scale (Hardt and Talwar, 2010). Here, is a scale parameter that may be different in the central or local model, since the sensitivity of is less in the bounded central model. This mechanism, PrivEMDLinear, is outlined in Alg. 1. Combining Thm. 4.1 with a standard privacy analysis, we can show that PrivEMDLinear satisfies - DP.
Lemma 4.2.
PrivEMDLinear (Alg. 1) with scale satisfies -unbounded local -DP and with scale satisfies -bounded central -DP.
Remarks. When using the -norm, PrivEMDLinear becomes the multidimensional Laplace mechanism. We may instantiate PrivEMDLinear with any noise mechanism that preserves -metric DP in the space . In particular, under the -norm, we can add Gaussian noise of width (Dwork et al., 2014), and this will give better utility than adding noise based on the Gamma distribution at the cost of -bounded local DP.
Concrete Example. In our location example, consider releasing the average distance of each point in from a particular city in the local model. This can be expressed with , where is the city; by the triangle inequality this is -Lipschitz. PrivEMDLinear could then be applied to release plus noise of expected magnitude per user; the total noise will be , corresponding to an error of just miles.
4.2. Unordered Release of Item-wise Queries
We now consider the problem of directly releasing a private query applied to each item in . This can provide a more fine-grained result than the aforementioned linear queries, which outputs the average over all the items. We release the query results as an unordered list to take advantage of the fact that subsequent computation (such as, aggregation) often does not depend on the ordering of the data (Feldman et al., 2022). Specifically, our second mechanism PrivEMDItemWise applies a mechanism , which satisfies - DP, to each item individually. We use as a black-box making PrivEMDItemWise completely general-purpose. For example, one could let be a private item-release mechanism666A number of metric DP mechanisms for releasing items in specific applications are mentioned in Sec. 7. and use PrivEMDItemWise to form a histogram of the dataset. could also be a classifer, and PrivEMDItemWise can then release a simplified representation of the dataset.
Once PrivEMDItemWise applies to each element in the dataset, it shuffles the results (to remove any ordering of the data) and outputs the shuffled list. This appears in Alg. 2, and a precursor to this algorithm appeared in (Fernandes et al., 2019).
As PrivEMDItemWise does not hide the size of , we show it satisfies bounded DP. We use the following argument: for a neighboring dataset , by Lemma 2.1 there exists a permutation satisfying (3). Observe that we release the query responses in an unordered fashion by explicitly shuffling them. This allows us to pair up the element with and analyze the privacy guarantee of releasing versus . Prior work does this with composition (Fernandes et al., 2019).
However, composition is not the right tool for obtaining a tight privacy analysis. The reason is that composition assumes that each is output sequentially, and in particular it is possible to identify which point came from and which came from . In our case, we output an unordered list, and it is not possible to link which point came from an index . Based on this observation, our key idea is to leverage privacy amplification by shuffling (Feldman et al., 2022) instead, which can yield a much smaller privacy parameter when the output is order invariant.
In particular, our core technical contribution is to generalize the privacy amplification by shuffling to -DP. Specifically, we analyze the privacy guarantee between two multisets and when satisfies -DP, in terms of the vector of distances . The parameters we will be interested in are , or the number of nonzero elements in , and since in our different privacy models we will be able to bound both. Formally:
Theorem 4.3.
Suppose that is a metric space such that , and that is an ( -DP algorithm. Let and be two vectors, and we define . Let be a constant, and suppose it holds that . Then, for all outputs , we have that
where
Remarks. In particular, if , the above bound is roughly , which grows with just (as ). The standard result is only applicable when satisfies -local DP, and just is changed to (since each user owns just one item). We generalize the state-of-the-art privacy amplification by shuffling analysis of Feldman et al. (Feldman et al., 2022) to -DP, and our result may be of independent interest. We state our result generally in terms of the vector since we will be applying it with different known bounds on these quantities.
Proof Sketch. We first generalize the analysis of amplification by shuffling to the datasets and , where . We show the resulting privacy parameter is given by where
Then, we apply group privacy times to show the general result holds with parameter . The function is concave so the worst-case amplification is simply . ∎
Comparison with Composition. Analyzing Thm. 4.3 using the state-of-the-art composition results (Kairouz et al., 2015) and gives us
However, we cannot form a satisfying bound on the -norm of —it is only possible to say which is tight when e.g. = 1. The bound is thus missing the factor of —composition here does not leverage the fact that all items are released in a random order.
Combining (3) and Thm. 4.3, we obtain an improved privacy guarantee for PrivEMDItemWise. We may state this guarantee in both the bounded local and central models. In the local model, recall that each user is applying PrivEMDItemWise to their data. In the central model, the central server applies PrivEMDItemWise to the entire dataset, and thus releases the frequencies of itemwise queries.
Theorem 4.4.
For any , PrivEMDItemWise shown in Alg. 2 satisfies bounded local - DP, where
and
Similarly, PrivEMDItemWise satisfies bounded central - DP, where
Remarks. Thm. 4.4 gives the tightest possible privacy parameters, but we may also give an asymptotic formula as follows. For desired privacy parameters , one should set
(6) |
and respectively
(7) |
in order to achieve DP in the bounded local (respectively central) model.
Assuming , this means that the privacy parameter will be roughly (resp. ) for releasing the samples; this is asymptotically better than the analysis with composition which would require setting (resp. ). Even with higher for , the budget is still (resp. ), which are both significant asymptotic improvements.
Concrete Example. Our improved analysis makes the most significant improvements in the central model. Here, we would have to apply PrivEMDItemWise with for each of the location data points per user. Using the guarantee of Thm. 4.4, it is possible to set – a several orders of magnitude improvement.
5. Generalization to Unbounded DP
The mechanisms presented so far face two challenges when applied to the unbounded data setting. First, a direct privacy analysis of the unbounded data setting is difficult since we cannot leverage Lemma 2.1, which significantly simplifies the analysis (for the bounded data setting). Second, and more importantly, the unbounded central model offers no utility improvement over the local model. In the worst-case scenario, a single user may contribute nearly all the data in the dataset, effectively reducing any algorithm to satisfying only local -DP. This issue has been noted in previous work in user-level DP (Liu et al., 2023).
In this section, we tackle these challenges by showing a blackbox reduction from unbounded -DP to bounded -DP. Our reduction works in both the local and central models. The key idea of the reduction is to smoothly project a dataset of any size to a dataset of a given fixed size, such that the distance between any two input datasets and the distance between their projections are roughly the same. Then, it is easy to show that applying any bounded -DP algorithm to the smooth projections is sufficient to guarantee unbounded -DP for the entire scheme.
Our proposed projection mechanism is smooth in a near-multiplicative sense, albeit with a small additive penalty when the between the two datasets is small. We account for this subtlety, by slightly modify the privacy semantics of -DP in the unbounded setting to not grow arbitrarily strong as . Instead, we introduce a distance threshold such that for all enjoys a uniform privacy guarantee of . This refined privacy definition, termed discrete -DP, is formalized (in the local model) as:
Definition 5.1.
[Discrete Local -DP] Let be a mechanism which acts on a dataset . We say satisfies -discrete local -DP if, for any two datasets such that ,
Like in standard DP, the above definition uses the parameter because it is a unitless privacy parameter—the unit of the metric is expressed in the parameter .
Fact 5.1.
For any such that , satisfies
Fact 5.1 is implied from Definition 5.1 by a direct application of group privacy (Dwork, 2006) (proven in Lemma A.2).This guarantee can be interpreted as providing -DP at the granularity of units of distance . Note that for all , we have . Thus, -discrete local DP is roughly equivalent to -unbounded local -DP, except if . In this case, the privacy parameter will not go below . This adjustment does not significantly alter the overall privacy semantics of -DP; one may simply set as described in Sec. 4.
In the central model, we make a similar definition:
Definition 5.2.
[Discrete Central -DP] Let denote a global dataset from users (of any size). We say if can be obtained from by changing to for just one user , such that . We say a mechanism satisfies -discrete central -DP if, for all such that , we have
As before, -discrete central -DP is roughly equivalent to -bounded central -DP when all user datasets have size . However, we will see that Definition 5.2 is the appropriate generalization to unbounded user datasets under our projection mechanism which is described below.
Our projection mechanism first generates a fixed number of samples with replacement from each user’s dataset . Next, it applies a blackbox bounded -DP mechanism, , to the projected dataset . By blackbox application we mean that can be any arbitrary mechanism as long as it satisfies bounded -DP. The projection mechanism, in the central model, is outlined in Alg. 3 (in the local model, each user samples from their own , so we would simply have ). The smoothness of the projection from to follows from the following claim:
Lemma 5.2.
Let be probability distributions, and let be the minimum cost coupling between . Let denote i.i.d. samples from , and let and . Then,
Proof Sketch:
Intuitively, for any coupling between and , we can simulate sampling times from by first sampling from , and then sampling . This view shows there is a transportation plan from and of expected cost . Using Bernstein’s inequality, we can show with probability at least , is upper bounded by . ∎
Thus, sampling is a smooth projection from to , with the caveat that there is an additive term that comes into play if is too small to guarantee convergence. This is the reason for our relaxation to discrete -DP.
In summary, we have the following privacy guarantee:
Theorem 5.3.
Remarks.
If the number of samples is at least , then Thm. 5.3 shows there is only a small multiplicative cost to considering just bounded -DP (in the respective local or central model). In this case, the bounded algorithm will need to roughly satisfy -bounded -DP, and this is roughly the same as the resulting -discrete DP algorithm. There is no privacy disadvantage to taking a large number of samples, and the utility may also increase due to more information about the dataset being captured (recall that the projection does not providing privacy; it is being provided by ). Thus, the number of samples may be set to be large with computational costs being the only constraint.
Proof Sketch.
Let and denote the sampled data for and , respectively. By convexity of DP, it suffices to analyze the privacy guarantee for any coupling between the random variables . In particular, we use the optimal coupling between to define this coupling. The resulting privacy parameter is bounded in terms of the expected cost by Lemma 5.2.
∎
BoundedEMDReduction can be used to bound the contribution of each user in the central setting, allowing us to apply the simpler Definition 3.2. In addition, it can be used to adapt PrivEMDItemWise to the unbounded data setting. One caveat is that utility may not be preserved if the number of user samples is too small or, in the central setting, if the users data distributions are heterogeneous. In particular, if users have varying numbers of samples, each from different distributions, applying BoundedEMDReduction equalizes the frequency of all user data. Nonetheless, it is often reasonable to assume the users have homogeneous data distributions (Liu et al., 2020; Acharya et al., 2023).
6. Applications of Proposed Mechanisms
In this section, we compare the utilities of PrivEMDLinear and PrivEMDItemWise to existing mechanisms satisfying user-level DP. For simplicity, we assume the bounded data setting.
Notations.
We define the following quantities of a real matrix . First, the operator norm of , denoted by , is given by . We can show that is equal to the maximum norm of a column of . Furthermore, , more commonly written as , is the spectral norm and is equal to the maximum singular value of . Matrix norms satisfy the important submultiplicative property, which states that for any matrices and .
Next, let denote the identity matrix, and again suppose that with . If has full row rank, then there exists a matrix such that . We call such a matrix a right inverse of . Finally, for and , let denote the Kronecker product of two real matrices, whose entry in is .
6.1. Linear Embedding Queries
Many applications of metric DP assume there is an embedding function , which maps an item to its semantic representation in (each of the examples in Sec. 1 have an embedding representation). The metric is then the distance between and ; in this section, we consider the distance.
Since also communicates information about the item , we define linear embedding queries as linear queries applied to an item’s embedding . Formally,
where for a matrix (meaning that is a linear function). Assume each row of is normalized so that . Each coordinate of is equal to . This can be interpreted as the average similarity of each item in with the vector . Our analysis will assume that and , which is usually the case in practice. Note that we may write as , where is the collection of embedding vectors in .
6.1.1. Local Model
Existing user-level DP solutions ask each user to privately release the query . The aggregator computes the average . The current best solutions have the following error guarantee (Duchi et al., 2013; Bassily, 2019):
Lemma 6.1.
(From Proposition 3 in (Duchi et al., 2013)) There exists an -bounded user-level DP in the local model algorithm which produces an estimate such that, for all ,
To interpret the term , we can use the inequality , which is tight for certain choices of and . By assumption, we know and , both of which can also be tight. The bound is thus .
On the other hand, for -DP, by Thm. 4.1, we know that is at most the Lipschitz constant of given by:
Hence, each user can apply PrivEMDLinear with the Gaussian mechanism with , which gives the following utility guarantee:
Lemma 6.2.
There exists an -bounded -DP algorithm in the local model which produces an estimate such that, for all ,
Remarks. We use the Gaussian mechanism because it performs better under the error than the pure -bounded local DP illustrated in Alg. 1. However, this forces us to use . We leave it as an interesting open question whether similar error can be achieved with . Compared to Lemma 6.1, the above bound differs by a factor of (and small terms)—when , we know that -DP provides better privacy. When , that PrivEMDItemWise offers lower error than Lemma 6.1.
6.1.2. Central Model
In the central model, linear query release has been extensively studied, and optimal algorithms under item-level DP are known (Hardt and Talwar, 2010; Bhaskara et al., 2012; Nikolov et al., 2013). These algorithms can be easily adapted to user-level DP, which will provide the following guarantee:
Lemma 6.3.
(From Thm. 1.3 in (Hardt and Talwar, 2010)) There exists an -bounded user-level DP algorithm in the central model which produces an estimate such that, for all ,
6.2. Frequency Estimation
Here, we evaluate the error of PrivEMDItemWise for private frequency estimation, where the goal is to obtain a private estimate of the (normalized) histogram . This problem has been extensively studied in privacy (Hay et al., 2009; Xu et al., 2013; Suresh, 2019; Kairouz et al., 2016; Acharya et al., 2018; Chen et al., 2020; Acharya et al., 2023); the high-level goal is to minimize the distance between and . However when the data domain is a general metric space , not all perturbations to are the same. Thus, we will measure similarity between with , as we do in our privacy definition.
To ease the analysis while still demonstrating the effectiveness of our mechanisms, we fix to be the following “clustered” metric space. Let , where , and . For some , the distance is given by the following:
We can think of this metric space as a collection of clusters consisting of the items for each . Points in a cluster are more related, being at distance apart, than items in two different clusters, which are distance apart. We will assume that privacy is only needed between two items in the same cluster, so we will set .
6.2.1. Algorithms in the Local Model
At a high level, in the local model each user applies a private mechanism (with discrete and ) to each sample and releases it. The central server forms an aggregate vector . Let denote the transition probability matrix of ; we have by linearity of expectation that . Assuming that has a right inverse , the central server returns the estimate , which is unbiased. All previous work in distribution estimation under local DP can be expressed in this way (Kairouz et al., 2016; Acharya et al., 2018; Chen et al., 2020; Acharya et al., 2023). We summarize this in Alg. 4.
The state-of-the-art approach for frequency estimation is the Hadamard response (Acharya et al., 2018; Chen et al., 2020), which is based off of the Hadamard matrices (which form a robust encoding of ). Specifically, the matrix is given by , where is a Hadamard matrix and are constants chosen so that is normalized and that each element is proportional to either or . This mechanism has the following utility:
Lemma 6.4.
(From Thm. 3.1 in (Chen et al., 2020)) There exists a mechanism such that FreqEstLocal satisfies -bounded user-level DP and returns an estimator such that
Remarks. In order to adapt the Hadamard response to the user-level setting, we suppose each user applies to each sample with privacy budget , and -user level DP follows from composition (Kairouz et al., 2015). The term is a sampling error which does not depend on , and the second term is the cost of privacy. The cost of privacy usually dominates, and furthermore its dependence on is not significant. This is because reduces both the effect of each sample on the final estimate, and the privacy budget per sample, countervailing itself.
Under -DP, we can use a transition probability matrix that is less noisy. Specifically, each user may apply to their dataset using PrivEMDItemWise, and by Thm. 4.3, needs to satisfy -DP. Note that for our choice of and , this is often a less restrictive requirement than (-DP) since . We first derive an error bound on FreqEstLocal in terms of (specifically its right inverse), which we will then optimize later.
Theorem 6.5.
For the metric space and any mechanism satisfying ( -DP where ( is specifically defined in Thm. 4.4), FreqEstLocal satisfies -bounded -DP in the local model and returns an estimator such that
(8) |
where is a right inverse of , , and is a column vector of s indexed by .
Remarks.
The first term in the RHS of (8) is the cost of equalizing mass between clusters, and the second term is the cost of equalizing the mass across clusters (since the matrix essentially projects to act between clusters). For small , the first term approaches , and the latter term may also approach because will not often map a point outside its cluster under -DP (and thus, ).
Proof Sketch. Our bound forms a transportation plan between and by first map** the mass within each cluster arbitrarily, which incurs at most cost, and then equalizing the mass between clusters, which incurs at most cost. Both of the error terms can then be bounded by viewing as the sum of independent variables drawn from a Dirichlet distribution with mean , and applying a standard variance analysis.∎
We apply Thm. 6.5 with being a generalization of -randomized response (Kairouz et al., 2016) which is adapted to -DP. Specifically, has probabilities given by, for each ,
Using this mechanism, the higher-order terms of Eq. (8) will approach with , as follows:
Theorem 6.6.
For the metric space , FreqEstLocal with the mechanism satisfies - DP in the local model and returns an estimator such that
where is defined in Eq. (6).
Remarks: Specifically, for our choice of , we have
Similar to Lemma 6.4, the term is the cost of sampling. The term is the cost of privacy, and it dominates when . We will compare Thm. 6.6 with Lemma 6.4 when —then the cost of privacy dominates. Specifically, the cost of Lemma 6.4 is , and the cost of Thm. 6.6 is . Given , the error will be smaller if
i.e. if there is a gap between of size at least . This is possible if , and for these instances DP offers better utility than user-level DP. In Thm. 6.6, the super-linear factor of comes from the fact that the -RR is suboptimal in terms of (Acharya et al., 2018).
6.2.2. Algorithms in the Central Model
The Laplace mechanism has been shown to be optimal for many instances of frequency estimation (Dwork et al., 2014). To attain user-level privacy, the baseline Laplace mechanism releases, for each , the values , where . The distribution function is then the normalization of . This gives us the following guarantees.
Lemma 6.7.
For the metric space , the Laplace mechanism described above satisfies -user level DP, and produces an estimate such that
Again, this utility does not depend on , since each user contributes fraction of the whole dataset which is independent of . Consistent with central DP, the error decreases with , which is much faster than the in the local model.
It is possible to adapt FreqEstLocal to bounded central -DP by simply pretending to be one user who holds the global dataset . The privacy analysis of Thm. 4.4, and the utility analysis of Thm. 6.6 may be combined for the following corollary:
Corollary 6.8.
Remarks. In particular
The same sampling error is present, but the cost of privacy is reduced from a dependence in Thm. 6.6 to just . To compare just the cost of privacy in Corollary 6.6 to Lemma 6.7, we will assume we are in the regime . Then, the cost in Corollary 6.8 is . The error of Corollary 6.8 will be less when
Thus, the utility is improved when is bigger than by a factor of at least , which is achieved when . One final advantage of Corollary 6.8 is that it may be implemented in the shuffle model of DP, which requires less trust than the central model. This parallels prior results of the shuffle model of DP (Feldman et al., 2022).
7. Related Work
Item-level DP. DP was originally considered at the item-level (Dwork, 2006), where a privacy guarantee is made when one item in the sensitive dataset is changed. Of the most relevance to our setting are results in distribution estimation (Hay et al., 2009; Xu et al., 2013; Suresh, 2019); these results study more complex estimation problems than frequency. Also, we consider linear query release, for which there is a long line of work (Hardt and Talwar, 2010; Bhaskara et al., 2012; Nikolov et al., 2013; Blum et al., 2013; Li et al., 2015). The mechanism in (Hardt and Talwar, 2010) is often optimal and easy to adapt to our setting; we compare our algorithms with it.
User-level DP. With a vast increase data collected about users, user-level privacy is gaining more interest (Amin et al., 2019; Narayanan et al., 2022; Bassily and Sun, 2023; Levy et al., 2021; Liu et al., 2020; Cummings et al., 2022). The most relevant work to ours on user-level private mean estimation (Cummings et al., 2022) and histogram estimation (Liu et al., 2023; Acharya et al., 2023), though these problems are again more complex than the ones we study. Another related area is the problem of deciding the amount of data to pick from each user in cases where the users have different amounts of data (Amin et al., 2019; Liu et al., 2023; Cummings et al., 2022), which is related to our unbounded DP setup. These techniques apply to more specialized settings than our general blackbox reduction, and they are not immediately comparable.
Local DP. Local DP has also received lots of attention recently. The most results to our work is locally-private linear query release (Duchi et al., 2013; Bassily, 2019) and distribution estimation (Duchi et al., 2013; Kairouz et al., 2016; Acharya et al., 2018; Chen et al., 2020; Acharya et al., 2023). We directly compare our work to the optimal algorithms in (Bassily, 2019) and (Chen et al., 2020) for our problems, which can be adapted to user-level DP easily. The other related line of work is privacy amplification from the local model to the central model, given access to a trusted shuffler (Erlingsson et al., 2019; Girgis et al., 2021; Feldman et al., 2022). We extend the state-of-the-art analysis in (Feldman et al., 2022) to general metric DP.
Metric DP.
Metric DP was first proposed in (Chatzikokolakis et al., 2013) in the central model. In the local model, this has led to work on releasing numeric data (Roy Chowdhury et al., 2022), location data (Andrés et al., 2013; Bordenabe et al., 2014; Chatzikokolakis et al., 2015; Weggenmann and Kerschbaum, 2021) and text (Feyisetan et al., 2019, 2020; Feyisetan and Kasiviswanathan, 2021; Imola et al., 2022). Unlike these works, we consider privacy in a general metric space. The most related work is that of (Fernandes et al., 2019), which proposes metric DP based on the for releasing text embeddings. As explained in the introduction, we consider a much more general setting than (Fernandes et al., 2019).
8. Conclusion
We have proposed metric DP at the user level using the earth-mover’s distance . This captures both the magnitude and structural aspects of changes in the data, resulting in a tailored privacy semantic. We have designed two novel privacy mechanisms under -DP which improves the utility over standard DP. Additionally, we have shown that general (unbounded) -DP can be reduced to the simpler case (bounded) where all users have the same amount of data. Finally, we have demonstrated that -DP .
References
- (1)
- Abowd (2018) John M Abowd. 2018. The US Census Bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2867–2867.
- Acharya et al. (2023) Jayadev Acharya, Yuhan Liu, and Ziteng Sun. 2023. Discrete distribution estimation under user-level local differential privacy. In International Conference on Artificial Intelligence and Statistics. PMLR, 8561–8585.
- Acharya et al. (2018) Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. 2018. Communication Efficient, Sample Optimal, Linear Time Locally Private Discrete Distribution Estimation. CoRR abs/1802.04705 (2018). arXiv:1802.04705 http://arxiv.longhoe.net/abs/1802.04705
- Alvim et al. (2018) Mário Alvim, Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Anna Pazii. 2018. Invited Paper: Local Differential Privacy on Metric Spaces: Optimizing the Trade-Off with Utility. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF). 262–267. https://doi.org/10.1109/CSF.2018.00026
- Amin et al. (2019) Kareem Amin, Alex Kulesza, Andres Munoz, and Sergei Vassilvtiskii. 2019. Bounding user contributions: A bias-variance trade-off in differential privacy. In International Conference on Machine Learning. PMLR, 263–271.
- Andrés et al. (2013) Miguel E Andrés, Nicolás E Bordenabe, Konstantinos Chatzikokolakis, and Catuscia Palamidessi. 2013. Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. 901–914.
- Balle et al. (2018) Borja Balle, Gilles Barthe, and Marco Gaboardi. 2018. Privacy amplification by subsampling: Tight analyses via couplings and divergences. Advances in neural information processing systems 31 (2018).
- Barthe and Olmedo (2013) Gilles Barthe and Federico Olmedo. 2013. Beyond differential privacy: Composition theorems and relational logic for f-divergences between probabilistic programs. In International Colloquium on Automata, Languages, and Programming. Springer, 49–60.
- Bassily (2019) Raef Bassily. 2019. Linear queries estimation with local differential privacy. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 721–729.
- Bassily and Sun (2023) Raef Bassily and Ziteng Sun. 2023. User-level private stochastic convex optimization with optimal rates. In International Conference on Machine Learning. PMLR, 1838–1851.
- Bhaskara et al. (2012) Aditya Bhaskara, Daniel Dadush, Ravishankar Krishnaswamy, and Kunal Talwar. 2012. Unconditional differentially private mechanisms for linear queries. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing. 1269–1284.
- Blum et al. (2013) Avrim Blum, Katrina Ligett, and Aaron Roth. 2013. A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM) 60, 2 (2013), 1–25.
- Bordenabe et al. (2014) Nicolás E. Bordenabe, Konstantinos Chatzikokolakis, and Catuscia Palamidessi. 2014. Optimal Geo-Indistinguishable Mechanisms for Location Privacy. Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (Nov. 2014), 251–262. https://doi.org/10.1145/2660267.2660345 arXiv: 1402.5029.
- Chatzikokolakis et al. (2013) Konstantinos Chatzikokolakis, Miguel E Andrés, Nicolás Emilio Bordenabe, and Catuscia Palamidessi. 2013. Broadening the scope of differential privacy using metrics. In PETS.
- Chatzikokolakis et al. (2015) Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Marco Stronati. 2015. Constructing elastic distinguishability metrics for location privacy. arXiv preprint arXiv:1503.00756 (2015).
- Chen et al. (2020) Wei-Ning Chen, Peter Kairouz, and Ayfer Ozgur. 2020. Breaking the communication-privacy-accuracy trilemma. Advances in Neural Information Processing Systems 33 (2020), 3312–3324.
- Cormode et al. (2018) Graham Cormode, Somesh Jha, Tejas Kulkarni, Ninghui Li, Divesh Srivastava, and Tianhao Wang. 2018. Privacy at scale: Local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data. 1655–1658.
- Csiszár (1975) Imre Csiszár. 1975. I-divergence geometry of probability distributions and minimization problems. The annals of probability (1975), 146–158.
- Cummings et al. (2022) Rachel Cummings, Vitaly Feldman, Audra McMillan, and Kunal Talwar. 2022. Mean estimation with user-level privacy under data heterogeneity. Advances in Neural Information Processing Systems 35 (2022), 29139–29151.
- Duchi et al. (2013) John C Duchi, Michael I Jordan, and Martin J Wainwright. 2013. Local privacy, data processing inequalities, and minimax rates. arXiv preprint arXiv:1302.3203 (2013).
- Dwork (2006) Cynthia Dwork. 2006. Differential privacy. In International colloquium on automata, languages, and programming. Springer, 1–12.
- Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211–407.
- Erlingsson et al. (2019) Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. 2019. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2468–2479.
- Erlingsson et al. (2014) Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. 1054–1067.
- Feldman et al. (2022) Vitaly Feldman, Audra McMillan, and Kunal Talwar. 2022. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 954–964.
- Fernandes et al. (2019) Natasha Fernandes, Mark Dras, and Annabelle McIver. 2019. Generalised differential privacy for text document processing. In Principles of Security and Trust: 8th International Conference, POST 2019, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2019, Prague, Czech Republic, April 6–11, 2019, Proceedings 8. Springer International Publishing, 123–148.
- Feyisetan et al. (2020) Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020. Privacy-and utility-preserving textual analysis via calibrated multivariate perturbations. In Proceedings of the 13th international conference on web search and data mining. 178–186.
- Feyisetan et al. (2019) Oluwaseyi Feyisetan, Tom Diethe, and Thomas Drake. 2019. Leveraging hierarchical representations for preserving privacy and utility in text. In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 210–219.
- Feyisetan and Kasiviswanathan (2021) Oluwaseyi Feyisetan and Shiva Kasiviswanathan. 2021. Private release of text embedding vectors. In Proceedings of the First Workshop on Trustworthy Natural Language Processing. 15–27.
- Girgis et al. (2021) Antonious M Girgis, Deepesh Data, Suhas Diggavi, Ananda Theertha Suresh, and Peter Kairouz. 2021. On the renyi differential privacy of the shuffle model. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2321–2341.
- Givens and Shortt (1984) Clark R Givens and Rae Michael Shortt. 1984. A class of Wasserstein metrics for probability distributions. Michigan Mathematical Journal 31, 2 (1984), 231–240.
- Hardt and Talwar (2010) Moritz Hardt and Kunal Talwar. 2010. On the geometry of differential privacy. In Proceedings of the forty-second ACM symposium on Theory of computing. 705–714.
- Hay et al. (2009) Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. 2009. Boosting the accuracy of differentially-private histograms through consistency. arXiv preprint arXiv:0904.0942 (2009).
- Imola et al. (2022) Jacob Imola, Shiva Kasiviswanathan, Stephen White, Abhinav Aggarwal, and Nathanael Teissier. 2022. Balancing utility and scalability in metric differential privacy. In Uncertainty in Artificial Intelligence. PMLR, 885–894.
- Kairouz et al. (2016) Peter Kairouz, Keith Bonawitz, and Daniel Ramage. 2016. Discrete distribution estimation under local privacy. In International Conference on Machine Learning. PMLR, 2436–2444.
- Kairouz et al. (2015) Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2015. The composition theorem for differential privacy. In International conference on machine learning. PMLR, 1376–1385.
- Konig (2001) Dénes Konig. 2001. Theorie der endlichen und unendlichen Graphen. Vol. 72. American Mathematical Soc.
- Levy et al. (2021) Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, and Ananda Theertha Suresh. 2021. Learning with User-Level Privacy. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 12466–12479. https://proceedings.neurips.cc/paper_files/paper/2021/file/67e235e7f2fa8800d8375409b566e6b6-Paper.pdf
- Li et al. (2015) Chao Li, Gerome Miklau, Michael Hay, Andrew McGregor, and Vibhor Rastogi. 2015. The matrix mechanism: optimizing linear counting queries under differential privacy. The VLDB journal 24 (2015), 757–781.
- Li et al. (2016) N. Li, M. Lyu, D. Su, and W. Yang. 2016. Differential Privacy: From Theory to Practice. Morgan and Claypool. https://ieeexplore.ieee.org/document/7731575
- Liu et al. (2020) Yuhan Liu, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and Michael Riley. 2020. Learning discrete distributions: user vs item-level privacy. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 20965–20976. https://proceedings.neurips.cc/paper_files/paper/2020/file/f06edc8ab534b2c7ecbd4c2051d9cb1e-Paper.pdf
- Liu et al. (2023) Yuhan Liu, Ananda Theertha Suresh, Wennan Zhu, Peter Kairouz, and Marco Gruteser. 2023. Algorithms for bounding contribution for histogram estimation under user-level privacy. In International Conference on Machine Learning. PMLR, 21969–21996.
- Narayanan et al. (2022) Shyam Narayanan, Vahab Mirrokni, and Hossein Esfandiari. 2022. Tight and robust private mean estimation with few users. In International Conference on Machine Learning. PMLR, 16383–16412.
- Nikolov et al. (2013) Aleksandar Nikolov, Kunal Talwar, and Li Zhang. 2013. The geometry of differential privacy: the sparse and approximate cases. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing. 351–360.
- Roy Chowdhury et al. (2022) Amrita Roy Chowdhury, Bolin Ding, Somesh Jha, Weiran Liu, and **gren Zhou. 2022. Strengthening Order Preserving Encryption with Differential Privacy. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (Los Angeles, CA, USA) (CCS ’22). Association for Computing Machinery, New York, NY, USA, 2519–2533. https://doi.org/10.1145/3548606.3560610
- Suresh (2019) Ananda Theertha Suresh. 2019. Differentially private anonymized histograms. Advances in Neural Information Processing Systems 32 (2019).
- Weggenmann and Kerschbaum (2021) Benjamin Weggenmann and Florian Kerschbaum. 2021. Differential privacy for directional data. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 1205–1222.
- Xu et al. (2013) Jia Xu, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, Ge Yu, and Marianne Winslett. 2013. Differentially private histogram publication. The VLDB journal 22 (2013), 797–822.
Appendix A Omitted Technical Details
An alternative characterization of differential privacy is through the hockey-stick divergence (Barthe and Olmedo, 2013). For probability distributions defined on a space , this is given by the following:
Definition A.1.
Let , and let be distributions on a space . The Hockey Stick Divergence is given by
It is easy to show that implies (2), so Definition A.1 provides an alternative way to prove privacy.
Definition A.1 satisfies a number of useful properties. First, because it is an -divergence (Csiszár, 1975), it satisfies the data-processing inequality: for any function , we have
This property is used to show that DP is invariant to post-processing by any function . The second property, again holding for all -divergences, is convexity. This states that for two pairs of distributions and a real number we have
Stated in terms of couplings, we may generalize convexity as follows:
Lemma A.1.
Suppose are random variables with probability distributions . Suppose is a randomized function. Then, for any coupling , we have
Proof.
We may write
Applying convexity, we have
and the claim follows. ∎
Third, satisfies a “weak” triangle inequality (also known as group privacy):
Lemma A.2.
For distributions on , we have .
Proof.
For any , we may view through its dual form as
Thus, let denote the maximal set such that
We may rewrite this as
showing the claim. ∎
Appendix B Omitted Proofs from Section 4
B.1. Proof of Theorem 4.1
B.2. Proof of Lemma 4.2
B.3. Proof of Theorem 4.3
See 4.3 We will first assume the following lemma:
Lemma B.1.
Suppose that is an -metric DP algorithm, where . Let be a set of inputs such that , and let be a constant such that . Then, we have that
where
To prove Theorem 4.3, let
Let , and WLOG suppose that for . Our goal is to show that
By Lemma B.1, we have for each that
where
Applying Lemma A.2 times, we see
We now show that is a concave function of ; to do this we write , where and is a suitable constant. We will show that . Taking derivatives, it is easy to show that has the same sign as . Thus, we will show that . We may write
Now, we have
We are done by observing that , and . Having shown convexity, we establish the maximum occurs when each is equal to . This gives us a bound of
B.4. Proof of Lemma B.1
This lemma can be viewed as a generalization of amplification by shuffling, which has the same setup but sets and merely requires that satisfy -local DP. We generalize the approach of Feldman et al. (2022), starting with the the following preliminary claims.
B.4.1. Preliminary Lemmas
Lemma B.2.
(Generalization of Lemma 3.3 in Feldman et al. (2022)). Let be a set of indices, and for , let be two families of distributions and be coefficients such that
Then, there exists a post-processing mechanism such that
where , , and , and is a uniformly random shuffle.
Proof.
Let be distributions where is defined over and satisfies and (with reversed probabilities if ), and for is defined over and satisfies and . Let be a function returning a distribution satisfying
Observe that by definition, the following probability distributions are equal for :
Let denote the number of indices such that , and define similarly. We will show that there exists a post-processing function such that, for both , we have
(9) |
We will do this by conditioning on the event that
where satisfy . Now, define the vector . Conditioned on , is distributed according to the following process: First, select a random partition such that and , corresponding to the indices (after shuffling) where are equal to , or . Next, let be a random injection from to . Then, is distributed according to:
(10) | |||
(11) | |||
(12) |
The above process is independent of given . In particular, it does not care whether we replace with , and thus it serves as our process satisfying (9) for both values of . Having established this, it is easy to show that , for , and , for . ∎
Having reduced the shuffling problem to a divergence between two fixed probability distributions, we follow the method of (Feldman et al., 2022) to compute this divergence. We use the following two results:
Lemma B.3.
(Restatement of Lemma A.1 from (Feldman et al., 2022)): Suppose , and . Define and . Then, , where
The next result, advanced joint convexity, originally appeared in the privacy amplification by sampling literature and can be used to improve the parameter when computing between two distributions which are nearly the same.
Lemma B.4.
(Restatement of Theorem 2 from (Balle et al., 2018)) Let be probability distributions satisfying and for distributions and . Given , define and . Then,
Finally, we require a result from local DP:
Lemma B.5.
(Restatement of Theorem 2.5 from (Kairouz et al., 2015)) Let be two distributions and be a parameter such that . Then, there exist distributions such that
With these results in order, we are ready to complete the proof.
B.4.2. Completing the proof of Lemma B.1
Using the definition of -DP and the fact that , we have
Applying Lemma B.5 to the first equation, we obtain
(13) | |||
(14) |
where . Applying the lemma to the second and third sets of equations, we obtain
(15) | |||
(16) | |||
(17) | |||
(18) |
where . Subtracting 15 and 16, we obtain that
(19) |
(20) |
Taking the average of 19 and 20, we obtain
(21) |
where . Now, equations 13 and 14 imply that
This implies
(22) |
Applying Lemma B.2, there exists a function such that
where , , and . By the post-processing inequality, we have for any that
Observe we can write
Define and . We can rewrite the above as
Applying Lemma B.4, we have
where and . By convexity, the RHS above is at most
Now, we finally set . Lemma B.3 (using the assumption that ) implies . From this, we obtain our desired result that
where
B.5. Proof of Theorem 4.4
See 4.4
First, consider the local model. Fix any two itemsets and such that . By Lemma 2.1, there exists a permutation such that
Let
(23) | |||
(24) |
By Theorem 4.3, we know that , where . The final privacy parameters for a fixed will be and ; the worst-case privacy parameters are thus and . Since is an increasing function, the latter term reduces to .
In the bounded central model, the same logic applies, except that have size , differ in only coordinates, and
We apply Theorem 4.3 to obtain , where , and we complete the proof similarly.
Appendix C Omitted Proofs from Section 5
C.1. Proof of Lemma 5.2
See 5.2 For , define , and observe that . Now, let denote . Observe each is i.i.d. and satisfies and . Due to the last two facts, we have . By Bernstein’s inequality, we have, for all ,
where and . By setting
we ensure that the probability is at most . We have
Finally,
C.2. Proof of Theorem 5.3
See 5.3 First, we will consider the local model. Let denote two datasets such that . Let , denote the set of samples when (resp. ) is used. Our goal is to show that . Observe we may define the objects to be the probability distributions of (which lie in ). By Lemma A.1, for any coupling , we have
Let denote the event that we have . When holds, then by assumption. When this does not hold, then trivially . Conditioning on the above expectation, we have
Now, let denote the optimal coupling between . We will take , the -fold Kronecker product of . Observe this is indeed a coupling between , and each coordinate of is simply a sample from . Thus, the event above is equivalent to
where the notation indicates that and , and each . By Lemma 5.2, we know that , and thus the above expectation is at most . This proof may be generalized easily to the central model.
Appendix D Omitted Proofs from Section 6
D.1. Proof of Lemma 6.2
D.2. Proof of Theorem 6.5
See 6.5 First, we will introduce notation. For a cluster label , let denote the elements of in cluster . Define to be the indices of in (so that indices outside are zeroed out). Define similarly, and observe that are not normalized.
For any estimate , consider the following transportation plan from to : For each , transfer to arbitrarily, and put any excess weight in the bin for an arbitrary . The cost incurred by this is at most , where denotes total mass of its argument. Finally, equalize the weights in the coordinates . The cost incurred for this step is at most . Thus, the total cost is
Observe that the term is simply the distance between and , where is the matrix that maps a vector to its sum along each coordinate in . Thus, we may form the the upper bound
(25) |
Now, we will bound (25) given this estimator. In the following, let denote the th row of the matrix . Observe that
Define , and notice that . Thus,
where the last step holds because the are independent. Now, we have
Putting it all together, we have
To control the term in (25), using similar steps, we may write
Similarly, for any we have
and this implies
Substituting into (25), we obtain the desired bound.
D.3. Proof of Theorem 6.6
See 6.6 For positive constants , the matrix is given by
where
The matrix is actually invertible, and
where
It is easy to show the identity that . Each row of looks like one copy of , copies of , and copies of . Thus,
Substituting, we obtain
Next, it’s easy to see that
Each row of the latter consists of one copy of and copies of . This gives us
Substituting, we obtain
Applying Theorem 6.5, we obtain
finishing the claim. To obtain an asymptotic bound (with budget ), we plug in (6), which says that we may set
In the first case, we have
and this implies
In the second, we have
This implies
In both cases, the desired bound has been shown.
D.4. Proof of Lemma 6.7
We use the bound that . In each coordinate, the expected error introduced by the Laplace noise is at most , and thus . Normalizing will only reduce this error.