Instance-Optimal Private Density Estimation in the Wasserstein Distance
Abstract
Estimating the density of a distribution from samples is a fundamental problem in statistics. In many practical settings, the Wasserstein distance is an appropriate error metric for density estimation. For example, when estimating population densities in a geographic region, a small Wasserstein distance means that the estimate is able to capture roughly where the population mass is. In this work we study differentially private density estimation in the Wasserstein distance. We design and analyze instance-optimal algorithms for this problem that can adapt to easy instances.
For distributions over , we consider a strong notion of instance-optimality: an algorithm that uniformly achieves the instance-optimal estimation rate is competitive with an algorithm that is told that the distribution is either or for some distribution whose probability density function (pdf) is within a factor of 2 of the pdf of . For distributions over , we use a different notion of instance optimality. We say that an algorithm is instance-optimal if it is competitive with an algorithm that is given a constant-factor multiplicative approximation of the density of the distribution. We characterize the instance-optimal estimation rates in both these settings and show that they are uniformly achievable (up to polylogarithmic factors). Our approach for extends to arbitrary metric spaces as it goes via hierarchically separated trees. As a special case our results lead to instance-optimal private learning in TV distance for discrete distributions.
Contents
- 1 Introduction
- 2 Preliminaries
- 3 On Instance Optimality
- 4 Additional Related Work
- 5 Distribution Estimation on Hierarchically Separated Trees
- 6 Instance Optimal Density Estimation on in Wasserstein distance
- A Preliminaries
- B Experiment Details
- C Appendix for Section 5
- D Local Minimality in the High Dimensional Setting
- E Differentially Private Quantiles
- F Proofs in Section 6
1 Introduction
Distribution estimation is a fundamental problem in statistics. In this work, we focus on the problem of learning the density of a distribution over a low-dimensional real space. Our motivation for studying this problem comes from practical problems such as estimating the population density in a geographical area (defined by bounded two dimensional space, for e.g. ), learning the distribution of accuracy of a machine learning model (i.e. a distribution over ), estimating the average temperature across latitude, longitude, and altitude (i.e. a distribution over ) etc.
In this work, we are interested in the non-parametric version of this question, where we make no assumptions on the form of the distribution we are learning. This is frequently of interest in practice, where population densities for example may change over time (become more or less concentrated), and it is difficult to specify a meaningful parametric class that will simultaneously capture all densities of interest. Given estimation is often done using sensitive data (for e.g. health data), our interest in this question is in, and consequently all our results are for, the differentially private version of this question. While we believe our results in the non-private setting are also novel and interesting, we view the private results as our main contribution.
Any statistical algorithm learning from samples is inexact. The appropriate gauge to measure the (in)accuracy of a density estimation algorithm depends on how this density estimate is used. In this work, we focus on the Wasserstein distance between the original distribution and the learnt distribution as our measure of accuracy. Known by many names (Earthmover distance, Kantorovich distance, Optimal Transport distance), this distance is defined over any distance metric as the minimum over all couplings from to of the quantity . It is arguably one of the most natural ways to define distances between distributions over a metric space and has been extensively studied (see Section 4) . We note that Wasserstein distance is particularly salient in many practical applications of density estimation where the geometry of the space is significant. As a simple example, when creating population density estimates, if the population is concentrated in a few cities, then outputting a distribution concentrated close to these cities (even if not exactly at the cities) is intuitively better than outputting a distribution that is more spread out. Metrics such as TV distance that do not incorporate the geometry of the space do not capture this nuance. Additionally, Wasserstein distance is versatile and can be adapted to the setting of interest by varying the metric. In the case of the metric being a discrete metric with , it reduces to the commonly used total variation distance. Our focus in this work is on the case of Euclidean distance metric on or , though our results apply to both to higher-dimensional Euclidean space as well as to any finite metric. In the case (with the standard Euclidean metric), the Wasserstein distance is equivalent to the total area between the cumulative distribution functions.
The problem of learning a distribution under Wasserstein distance has a long history, starting with [Dud69] proving worst-case bounds on the rate of convergence of the Wasserstein distance between the empirical distribution and the target distribution over . Similarly, this question for the case of the discrete metric () has been very well studied. However, most known results for this problem look at it from the point of view of worst-case analysis. This can paint a rather pessimistic picture. For example, the minimax rate of -privately learning a discrete distribution over in TV distance (i.e. Wasserstein with the discrete metric described above) scales linearly with , which can be prohibitive for large support size . For Wasserstein distance with norm, the rate of convergence of the empirical distribution suffers a curse of dimensionality, with the worst-case error between the distribution and the empirical distribution being for distributions over . For the differentially private version of this question, recent works [BSV22, HVZ23] have shown that the optimal Wasserstein minimax error between the sample and the private estimate is . This worst-case analysis viewpoint fails to distinguish between algorithms that perform very differently on the types of instances one may see in practice. In particular, many practical distributions may be more feasible to estimate than suggested by the minimax rate. As an example, Figure 1 shows the cumulative distribution function of a bimodal distribution on with very sparse support, and the cdf learnt by a minimax optimal algorithm, as well as an algorithm we present in this work . As is clear from the figure, the minimax optimal algorithm is easily outperformed. This phenomenon only gets worse in higher dimensions. Similarly, if the distribution in lies on a -dimensional subspace, the worst-case error scaling with is significantly larger than our algorithm’s scaling of .
![Refer to caption](x1.png)
![Refer to caption](x2.png)
This motivates the problem of viewing this question through the lens of instance optimality. 222c.f. related work section for discussion of other beyond worst case analysis approaches for this question and Section 3 for a more in-depth discussion of our approach. Briefly, instance optimal algorithms are those that on any given instance of the problem, are able to perform competitively with what any algorithm can do on this instance. Let be a class of algorithms of interest (e.g. all -differentially private algorithms) and be a cost measure for an instance . In our setting, we have a distribution over a metric space, and given a set of samples from , we want to learn an estimate for the distribution. Our measure of performance is the Wasserstein distance , so . We would ideally like to say that an algorithm is -instance optimal in a class if for all instances , and all ,
(InstanceOptimality-Ideal) |
The reader would have noticed that this definition is however impossible to achieve except for trivial classes . The algorithm that ignores its input and always outputs makes the right hand side 0. However, this algorithm performs poorly on any distributions far from and so is not a reasonable benchmark. A common approach in many works is to measure the performance of the competing algorithm not just on the given instance, but on a small neighborhood around it. Thus we say that that an algorithm is -instance optimal amongst a class with respect to a neighborhood function if for all instances , and all
In other words, the benchmark we evaluate against is the cost of the best algorithm for a neighborhood that knows this neighborhood. We would like our algorithm , that is not tailor-made for , to nevertheless be competitive against this benchmark.
This definition is general, and captures most notions of instance optimality that have been studied in the literature. The set must be carefully defined for this notion to be meaningful; we can always define to be the set of all instances whence this notion reduces to worst-case analysis. In many previous works, this neighborhood map has been defined to capture the belief that any natural algorithm must not have significantly different performance on different members of . For example, [FLN01, ABC17, VV16, OS15, GKN20] include in appropriate renamings of to capture some kind of permutation invariance of natural algorithms. In statistics, one often enforces that the cardinality of is 2, often called the hardest one-dimensional subproblem [CL15, AD20, DLSV23]. Some recent works in privacy [HLY21, DKSS23] have defined instance optimality w.r.t. neighboring datasets obtained by deleting a small number of data points. Any reasonable definition of instance optimality for a problem must justify its choice of the neighborhood map; similar choices must be justifiable in every other notion of beyond worst case analysis [Rou21]. In instance-optimality definitions, this choice of neighborhood is what encapsulates what class of domain-specific algorithms our algorithm competes against. A good definition thus depends on the context and on the kind of domain knowledge we imagine an expert designing a custom algorithm for an application may have. Ideally, the definition is broad (i.e. the neighborhoods are sufficiently contained) so that in a large class of applications, we expect the domain knowledge to not be enough to rule out any member of . We discuss this general definition of instance optimality further in Section 3. We remark that for reasonable neighborhood maps, this is an extremely strong requirement: an instance-optimal algorithm must simultaneously do well on every single input, in fact as well as any other algorithm that is given this neighborhood in advance!
Instance optimality guarantees are most useful when there is a big difference between achievable utility guarantees for typical cases and the worst-case utility guarantees. Wasserstein estimation is an example of such a problem. We will see that achievable utility bounds for, for example, concentrated distributions are a lot better than worse case distributions. Our definition of instance optimality is particularly suitable for metric spaces, and our notion of neighborhood allows the target utility bound to adapt to the distribution. We note that for estimation in Wasserstein distance with practically important metrics such as and norms, it is unclear if existing instance optimality definitions (using notions of neighborhood discussed above) capture this. For example, for discrete distributions, setting the neighborhood to be all permutations of the distribution destroys all structure of the distribution (for e.g. concentration), and hence performance on this neighborhood may not capture the relative ease of estimation of a concentrated distribution. Similar problems apply to other previously studied definitions of instance optimality, which are not well-suited to density estimation with error metrics that incorporate the geometry of the metric space. See Section 3 and Section 4 for further discussion on the inadequacy of existing instance optimality definitions for our setting of interest.
Our notion of neighborhood will correspond to small balls in one of the strictest notions of distance between distributions. Recall that for distributions on , is defined as . Our neighborhood map will have the property that for all , and for all , . This corresponds to the benchmark algorithm being given as auxiliary input a multiplicative constant factor approximation to the probability density function (and we can replace the constant by any constant). In particular, an algorithm that knows the support of the distribution will not be able to do much better than our algorithm that gets no such information. Notice that this implicitly implies that our algorithm is able to exploit sparsity in the data distribution since it is competitive with an algorithm that is told the support. In the one-dimensional real case we can achieve an even stronger notion of instance-optimality. In this case is defined to be where is a distribution with . This is a strengthening of the rate defined by the hardest one-dimensional subproblem.
We also give a definition that captures another aspect of instance optimality, related to the notion of super efficiency, that we term local minimality in Section 3. Informally, local minimality says that if any comparator algorithm does better than on , then there is a distribution in the neighborhood of where does better than the comparator. Approximate local minimality relaxes the latter condition to being better than some constant times the comparator. The two definitions of approximate local minimality and instance optimality are in general incomparable (see Section 3) but for suitable smooth algorithms, we show that these definitions are equivalent. Our algorithms, both for the 1-dimensional and the case of general metric spaces approximately satisfy both these definitions.
In order to show that the instance optimality definition is achievable, we give both algorithmic upper bounds and matching, up to logarithmic factors, theoretical lower bounds. The algorithms we use in our upper bounds are built largely from ingredients previously used for similar problems. We see this as an asset since these algorithms are implementable in practice. A key ingredient that we do introduce is the use of randomised HST approximation of finite metric spaces. This replaces deterministic hierarchical decompositions that were used in prior work, allowing us to gain tighter utility guarantees. Our main conceptual contribution is to introduce what we believe to the right notion of instance optimality for this problem, including the definition of a meaningful neighbourhood function. The main technical challenge is in the lower bounds, which require carefully building nets of distributions within each neighborhood that allow us to use a slight generalisation of DP Assoud’s Lemma to give a lower bound on the target estimation rate for each distribution .
1.1 Our Results
Given an estimation algorithm , the estimation rate of for distribution is:
(1) |
We start by stating an informal version of our result in the one-dimensional real case.
Theorem 1.1 (Informal 1-dimensional result).
Let . There is an -differentially private algorithm such that, for all distributions supported in , for all natural numbers , there exists a distribution (with ) such that the following is satisfied.
For any -DP algorithm , with probability at least over the randomness of and additional randomness of the algorithm,
where
In this one-dimensional case, our algorithm is based on DP quantile estimation. The additive term can be made polynomially small. The lower bound is based on (differentially private) simple hypothesis testing where for each distribution , we find a distribution in that is indistinguishable from given samples but also sufficiently far from in Wasserstein distance.
Extending the quantiles based approach from the one dimensional setting to even the two dimensional setting is challenging, as there is no “right” way to generalize quantiles to dimensions 2 or beyond. Several previous works on Wasserstein density estimation (e.g. [BNNR09]) have used a hierarchical decomposition approach to address this question. A hierarchical approach has also been used in various more practical works on private density estimation (e.g. [CB22, QYL12, BKM+21, MJT+22, ZXX16]). These works focus on practical performance and do not offer tight theoretical bounds. A hierarchical approach was also used by [GHK+23], who proved theoretical bounds for a related problem, but not through the lens of instance optimality. We compare our results to theirs in more detail later in this section.
The use of deterministic hierarchical decompositions in all these papers means that some points that are very close (but on opposite sides of the boundaries of the hierarchical decomposition) get mapped to relatively far points, resulting in high distortion factors that are not appropriate for instance optimality.
Inspired by the above approaches but noting their constraints, we use a randomized embedding into hierarchically separated trees instead of a deterministic one. We define our algorithm on any hierarchically separated tree metric and use the fact that there is a randomized embedding of on a hierarchically separated tree metric space with low distortion. This, along with some other important technical modifications (such as truncating low values to ), allows us to analyze a variant of the above practical algorithms theoretically and show that it satisfies our strong notion of instance optimality, up to polylogarithmic factors in the number of samples.
Theorem 1.2 (Informal two-dimensional result).
There is a polynomial time -differentially private algorithm that for any distribution on , any integer , and any -DP algorithm with probability at least 0.75, satisfies
where . Here, the expectation is taken over the internal coin tosses of as well as over the choice of the i.i.d. samples .
In fact, since our algorithm is defined on any hierarchically separated tree metric space, it has the added bonus of giving instance optimality results for any finite metric space (since powerful results [Bar96, FRT03] show that any finite metric space can be embedded in a hierarchically separated tree metric space with a distortion factor at most logarithmic in the size of the metric space).
Theorem 1.3 (Informal finite metric result).
Let be an arbitrary metric space with diameter . There is a polynomial time -differentially private algorithm such that for any distribution on any integer and any -DP algorithm with probability at least 0.75, satisfies
where . Here, the expectation is taken over the internal coin tosses of as well as over the choice of the i.i.d. samples .
Our lower bound result is actually slightly stronger than stated in Theorem 1.3 since it holds not only for -DP, but also for -DP. At this point, we also compare specifically to the paper of [GHK+23] who give an algorithm for obtaining two-dimensional heatmaps and analyze it theoretically. They focus on the empirical version of a variant of this problem as opposed to the population version, and aim to compete with the best -sparse distribution. Their algorithm takes the sparsity parameter as input in order to set parameters and achieves additive error (and a constant multiplicative factor). On the other hand, our algorithm also performs better for sparse distributions but is automatically adaptive to the sparsity (and hence doesn’t need to take it as an input). Additionally the additive term in our work can be made polynomially small (for any polynomial) in at a logarithmic cost to the multiplicative error (regardless of the sparsity of the distribution). On the other hand, for large their results have additive error that scales with . Their use of a deterministic hierarchical decomposition makes their algorithm unsuitable for our notion of instance optimality (as discussed earlier), and it is unclear if their algorithm can be directly extended to all finite metric spaces.
Note that instance optimality for all finite metric spaces implies instance optimality results for a wide variety of applications not addressed in prior work. For example, our results immediately extend to other low-dimensional real spaces with arbitrary metrics (for e.g. norms). They also give non-trivial improvements on worst-case analysis for higher-dimensional spaces that are not the main focus of our work (for , we can use a fine grid of size at an additive cost of in the Wasserstein distance in order to create a finite metric space to apply our result on. Since the dependence on in the result above is logarithmic, this translates to a multiplicative overhead term replacing the factor above. While this is still a significant overhead, all previous results on density estimation in the Wasserstein distance (in both the private and non-private literature) are worst case, where the sample complexity is exponential in . Since our results only have a polynomial dependence in over the optimal error, this is a non-trivial improvement over worst-case error, even when is large.
Another immediate application of our results is to give (to the best of our knowledge) new bounds for private estimation of discrete distributions in TV distance. Generally, for learning a discrete distribution defined by probabilities , our results lead to a rate (up to polylogarithmic factors) of
This can give significant improvements over the worst case bounds for practically important distributions. The minimax rate is linear in the support size , namely (for sufficiently small ). Now, consider the following power-law distribution over support size : . (Power law distributions arise frequently in practice for e.g. frequencies of family names, sizes of power outages etc. all follow power law distributions.) Applying our result above gives a bound that is , which is much better than the worst case bound for large support distributions.
Our result also applies to other practically important settings such as building lists of popular sequences such as n-grams over words. We leave open the questions of designing instance-optimal algorithms for other practically important questions in private learning and statistics, and of designing better instance optimal algorithms for higher dimensional spaces. We also leave open the question of removing the polylogarithmic factors in our instance optimality bounds.
1.2 Techniques
1.2.1 Distributions over :
We start by describing the rate we obtain for distributions over .In order to state the rate, we will use to represent the -quantile of the distribution and use to define a certain restricted distribution described below. The rate consists of three terms and roughly looks as follows— we suppress logarithmic factors in .
The first term is , the expected Wasserstein distance between the true distribution and the empirical distribution over samples, and is the non-private term. The remaining two terms represent the cost of privacy- the first is a specific interquantile distance, roughly , and the second can be thought of as capturing the weight of the tails- represented by the Wasserstein distance between and a ‘restricted’ version of with its tails chopped off (i.e. the cumulative distribution function is below and above and identical to otherwise). Observe that all of the terms above are smaller for distributions with small support or greater concentration, and hence the rate adapts to the hardness of the distributions.
Upper Bounds: The upper bound involves estimating roughly equally spaced quantiles of the empirical distribution differentially privately (using a known private CDF estimation algorithm), and placing roughly mass at each of the estimated quantile points. For the analysis, the intuition for each of the terms is as follows: since we only have access to the empirical distribution, the non-private term comes from that. Next, if the quantile estimates are good, then the pointwise CDF differences between the empirical distribution and the estimated distribution are at most (due to the discretization), and so we will pay multiplied by the interquantile distance of the empirical distribution. This aligns with the accuracy of state-of-the-art DP quantile estimation algorithms. Finally, since the distribution is restricted to the estimated quantiles, the distribution is before the first estimated quantile and above the last estimated quantile and so we pay the Wasserstein distance between the empirical distribution and a restricted version of the empirical distribution. Some care needs to be taken while reasoning about expectation versus high probability (for various terms), and in relating population quantities to empirical quantities (which we do using various concentration inequalities). Details can be found in Section 6.2.
Lower Bounds: We prove that the private and non-private terms are lower bounds separately. Both proofs follow the same framework. The idea is that given knowledge of two distributions and , we can use a (private) Wasserstein estimation algorithm to construct a hypothesis test distinguishing from . If the (private) estimate for and with samples gives error smaller than , we can use this to distinguish from . This would give a contradiction if and are (privately) indistinguishable with samples. Hence, this would give a lower bound of on the error of the Wasserstein estimation algorithm on or .
Thus the task reduces to constructing a distribution that satisfies three properties: 1) it is (privately) indistinguishable from given samples, 2) the Wasserstein distance between and is sufficiently large, 3) . The main technical work is in identifying a distribution that satisfies these properties.
For the privacy term, we construct the distribution by taking half the mass from the first -quantile of (scaling the density function by half) and moving it to the last -quantile of (scaling the density function by ). The third property is satisfied by definition, so we reason about the other two. Intuitively, since the Wasserstein distance captures how hard it is to ‘move’ to , this mass needs to move at least the interquantile distance to change to . This implies that the Wasserstein distance is at least the interquantile distance scaled by , as described in the rate. Additionally, mass that is further out in the tail needs to move more, which is captured by the Wasserstein distance between the distribution and its ‘restriction’. Hence, the Wasserstein distance between and is lower bounded by these two terms of interest. The intuition behind Property 2 is that it is hard for any -DP algorithm to pinpoint the location of an -fraction of the points in the dataset. Overall, this shows the privacy lower bound.
The non-private lower bound requires a more careful construction of . We divide into various scales and carefully adjust them differently to obtain the desired properties. Formally, to construct from , we consider and all quantiles of the form and for . For , we add mass to , by setting the density to be and balance out the extra mass by setting to be between . For (i.e. the tail), we add mass to , by setting to be and balance out the extra mass by setting to be between .
The third property is again trivially satisfied. For the first property, observe that to ‘move’ to the extra mass between has to ‘travel’ between and , and so the Wasserstein distance between and can be lower bounded by a sum of various scaled interquantile distances. We attempt to upper bound the expected Wasserstein distance between and by a similar term. It is more intuitive to reason about this using an alternative (equivalent) formulation of Wasserstein distance as the area between the CDF curves of and . The intuition is that the expected pointwise CDF difference between and in the interval , would be roughly (by properties of a Binomial) and hence the contribution of this interval to the area would be roughly and similarly for the corresponding interval , . Hence, the expected Wasserstein distance would be a sum of these scaled quantile interval distances. We formalize this intuition using a result of Bobkov and Ledoux [BL19] that characterizes the expected Wasserstein distance between and as an integral of a function of the CDF of . We now have a bound in terms of the sum of scaled quantile interval distances, but we want to bound it by a sum of scaled interquantile distances. We can telescope the sum to indeed bound it by a sum of scaled interquantile distances. This establishes that . Next, we show that is indistinguishable from by analyzing the KL divergence between and . The main idea is that high density intervals are modified by a small multiplicative factor of roughly , but low density intervals (with mass less than ) are modified by a constant multiplicative factor, so overall the contribution of each interval to the KL divergence is sufficiently small. This establishes indistinguishability with samples. For formal details we refer the reader to Section 6.1.
1.2.2 Distributions on HSTs
Since the main technical challenge of proving Theorem 1.3 is proving the equivalent result for distributions on HST metric spaces, we focus on that problem in this section. Standard results on low distortion embeddings of metric spaces into HST metric spaces can be used to translate the HST result to and to general metric spaces with overhead.
Definition 1.4 (Hierarchically Separated Tree).
A hierarchically separated tree (HST) is a rooted weighted tree such that the edges between level and all have the same weight (denoted ) and the weights are geometrically decreasing so . Let be the depth of the tree.
HSTs can be defined with any geometric scaling but we will only need a factor of 2 in this work. HSTs may also have arbitrary degree. A HST defines a metric on its leaf nodes by defining the distance between any two leaf nodes to be the weight of the minimum weight path between them.
HST metric spaces are particularly well-behaved when working with the Wasserstein distance since the Wasserstein distance on a HST has a simple closed form. A distribution on the the underlying metric space in a HST induces a function on the nodes of the tree where the value of a node is given by the weight in of the leaf nodes in the subtree rooted at . For every level of the tree, let be the distribution induced on the nodes at level where the probability of node is . Thus is a discrete distribution on a domain of size , where is the number of nodes in level of the tree.
Lemma 1.5 (Closed form Wasserstein distance formula).
Given two distributions and defined in a HST metric space, the Wasserstein distance between and has the closed formula:
where is the weight of the edge connecting to its parent, and the sum is over all nodes in the tree.
We will call a node -active under the distribution if . Let be the set of -active nodes under and be the set of -active nodes at level . Then there exists an algorithm such that given a distribution , , and ,
where the max is over all the levels of the tree and . Further, this bound matches (up to logarithmic factors) the lower bound where . The error rate does indeed adapt to easy instances as we expected. The error decomposes into three components. The first component is the non-private sampling error; the error that would occur even if privacy was not required. The second component indicates that we can not privately estimate the value of nodes that have probability less than . The third component is the error due to privacy on the active nodes. If is highly concentrated then we expect most nodes to either be -active or have weight 0, so the first two terms in are small. There should also be few active nodes, making the last term also small. Conversely, if has a large region of low density then we expect a large number of inactive nodes, as well as non-zero inactive nodes that are at higher levels of the tree and hence contribute more to the final term. Thus, in distributions with high dispersion we expect the right hand side to be large.
Upper Bounds: As in the one-dimensional setting, we want to restrict to only privately estimating the density at a small number () of points. While we could try to mimic the one-dimensional solution by privately estimating a solution to the -median problem, it’s not clear how to prove such an approach is instance-optimal. It turns out that a simpler solution more amenable to analysis will suffice. Our algorithm has two stages; first we attempt to find the set of -active nodes, then we estimate the weight of these active nodes. Since these nodes have weight greater than , we can privately estimate them to within constant multiplicative error. Any nodes that are not detected as active, are initially ascribed a weight of 0. The error due to not estimating the non-active nodes is absorbed into the third error term. The final step is to project the noisy density function into the space of distributions on the underlying metric space. The error of the upper bound algorithm is summed over all levels of the tree, although since the depth of the tree is logarithmic in the size of the metric space, this is within a logarithmic factor of the maximum over the levels.
Lower Bound: We first observe that in order to estimate the distribution well in Wasserstein distance, an algorithm must estimate each level of the tree well in TV distance. This is derived from Lemma 1.5. This allows us to reduce to the problem of lower bounding the error of density estimation of discrete distributions in TV distance. The main tool we use is a differentially private version of Assouad’s method. Similar to how the technique in the previous section allowed us to relate lower bounding estimation rates to simple hypothesis testing, Assouad’s lemma allows us to relate lower bounding estimation rates to multiple hypothesis testing. Note that unlike the technique in the previous section, Assouad’s lemma allows us to prove lower bounds on the expected error, rather than lower bounds on high probability error bounds. It involves constructing nets of distributions in that are pairwise far in the relevant metric of interest (which for us in the TV distance) but the multiple hypothesis testing problem between the distributions is sufficiently hard. For proving the third term belongs in the lower bound, the standard statement of DP Assouad’s lemma [ASZ21] suffices, where one builds a set of distributions indexed by a hypercube. For the first and second terms, we need to slightly generalise the statement to allow for sets of distributions indexed by a product of hypercubes. We use the approximate DP version of DP Assouad’s so while our upper bounds are for pure differential privacy, our lower bounds hold for both pure and approximate differential privacy.
Let us start with the third term. Suppose the number of active nodes is even (a small tweak is made if there is an odd number of active nodes). We pair up the active nodes and index each pair by a coordinate of the hypercube. For each corner of the hypercube, , for each coordinate , if , we move mass from one node in the th pair to the other node. If then we leave the th pair of nodes alone. Since each active node has mass , it’s clear that each resulting distribution belongs in . We can also show that these distributions form a sufficiently hard multiple hypothesis testing problem. By DP Assouad’s (Lemma 5.8), this allows us to lower bound the estimation error by , which is within of the third term when the number of active nodes is . We treat the case where there is a single active node separately.
For the second term, we want to pair up the inactive nodes in a similar manner and move half their mass from one node to the other. However, since we want to remain within , we can’t pair any two inactive nodes together. Thus, we divide the inactive nodes into scales, where nodes within a certain scale all have weight within a multiplicative factor of two. We then pair up nodes within each scale and have a different hypercube for each scale. Again, it’s clear that these distributions are all in and we can show that these distributions form a sufficiently hard multiple hypothesis testing problem. The proof for the first term follows similarly.
2 Preliminaries
For all distributions , we will use to denote the density of (when it exists) and to denote the cumulative distribution function of . Given a space , let be the set of distributions on the space . Given a logical statement , let if is false and 1 if is true. For example, and .
A number of distances between distributions are important in this work. We start by defining the infinity divergence, which is important in the notion of instance optimality we use.
Definition 2.1 (-divergence).
Given two distributions and with the same support, the -Rényi divergence , if and are discrete, and , if and are continuous distributions on , and have density functions. If and don’t have the same support, then .
We will use to denote the KL-divergence, to denote the squared Hellinger divergence and to denote the total variation distance, defined later. These metrics are defined in Appendix A
Wasserstein Distance:
The error metric that we use to judge our performance on the density estimation task is -Wasserstein distance (that we will call just Wasserstein distance where it is clear from context). In this subsection, we define Wasserstein distance.
Definition 2.2.
For any separable metric space , let represent Borel measures on . Then, the -Wasserstein distance between is defined as
where the infimum is over all measures on the product space with marginals and respectively.
Finally, for one dimensional real spaces where the metric of interest is norm, we will use the following equivalent formulation of Wasserstein distance extensively.
Lemma 2.3 (Wasserstein formula over ).
Let represent probability distributions on with finite expectation. Then, the -Wasserstein distance between is equal to
where the represents the cumulative distribution function.
Given an metric space , the Wasserstein metric is a well-defined metric on the set of the probability distributions over .
2.1 Differential Privacy
We start by defining the Hamming distance between datasets.
Definition 2.4 (Hamming Distance).
For any and two datasets , the Hamming distance between the two datasets is defined as the number of entries that they disagree on, i.e. .
Next, we define differential privacy.
Definition 2.5 (Differential Privacy [DMNS17, DKM+06]).
A randomized algorithm is said to be -differentially private if for every pair of datasets such that and for all subsets ,
Differential privacy satisfies two important properties that we will utilise: it is closed under post-processing and composes naturally. More information about these properties and an example of a basic differentially private algorithm are given in Appendix A.
3 On Instance Optimality
In this section, we discuss the notion of instance optimality, and argue that it provides a useful benchmark that captures the idea of going beyond the worst case. The notion of instance optimality we propose can be see as a generalisation of the hardest one-dimensional subproblem, or hardest local alternative introduced by [CL15]. Suppose we have a family of distributions on a space and our goal is to learn the parameter where is a metric space with metric . Given an estimation algorithm , we can define the estimation rate333We note that while the estimation rate here is defined in expectation, we will sometimes show results (for e.g. in the one-dimensional case) where estimation rate is defined with probability at least over the randomness of the algorithm and the data; see Equation 1. of to be the function where
Since the estimation rate is a function of the distribution , the estimation rate of an algorithm may be lower at “easy” distributions and larger at “harder” distributions. As a classic example, consider the estimation rate of Bernoulli parameter estimation where simply outputs the empirical mean. Then , so this algorithm performs better when the Bernoulli parameter is close to 0 or 1, and has it’s worst case error when .
Cai and Low [CL15] proposed three desiderata that a target estimation rate should satisfy in order to be a meaningful benchmark;
-
1.
varies significantly across
-
2.
is an achievable estimation rate; there exists an algorithm and constant such that for all
-
3.
Outperforming the benchmark at one distribution leads to worse performance at another distribution.
In this section we will discuss the definition of instance optimality we will use in this work by defining the target estimation rate that will serve as our benchmark estimation rate. The main theorems of this paper establish that our chosen benchmark achieves desiderata 1 and 2 above. It is not immediately obvious that desiderata 3 holds. We will show in Section 3.2, through the introduction of a related notion of instance optimality which we call local minimality, that desiderata 3 holds in many important settings, including the problem studied in this paper.
3.1 Local Estimation Rates
We will start by defining a target estimation rate. We’ll say an algorithm is -instance optimal if it uniformly achieves this target estimation rate up to a multiplicative factor. For each distribution , we define a neighbourhood .
Definition 3.1.
Given a function , where is the power set of , we define the optimal estimation rate with respect to to be:
(2) |
An algorithm is -instance optimal with respect to if for all ,
If an algorithm uniformly achieves the optimal estimation rate wrt a function , then this implies that for all distributions , the error of the algorithm on is competitive with an algorithm that is told the additional information that the distribution is in . Given a function , it is possible that there does not exist an algorithm that uniformly achieves . For example, as discussed in the introduction, if , then is not uniformly achievable. Conversely, if is not chosen carefully, then the estimation rate may not define a meaningful benchmark; e.g. an estimation rate that adapts to easy instances.
A different formalization may be more probabilistic: the algorithm designer may have in mind a distribution over distributions that they care about, and their objective may be to minimize . Suppose that for the chosen by the algorithm designer, and for our neighborhood map , the function does not vary too much over on average. Formally, let
and let . Then for any algorithm that is -instance optimal with respect to , we can write
In other words, as long as the algorithm ’s performance is relatively constant over on average over the distribution of interest, the instance optimal algorithm (that is not tailored to ) is competitive with . A similar result holds for a multiplicative definition of .
This discussion can help guide the choice of the neighborhood function that is appropriate for a particular application. In the case of density estimation in the Wasserstein distance, we will define be a small ball around . We believe this captures the kind of domain information an algorithm designer may have. E.g. one may have a small amount of public data samples, in which case the posterior over distributions in a ball will be relatively constant. If the algorithm designer’s custom algorithm needs to do well for all distributions in this set, an instance-optimal algorithm will be competitive with this custom algorithm.
Previous work in instance optimality has largely focused on two notions of neighborhood. In [FLN01, ABC17, VV16, OS15], where the objects of interest are discrete subsets with no a priori structure, it is natural to ask that the algorithm work well for any permutation of the inputs. For example, if the goal is to compute the set of maximal points from a 2-d point set, the algorithm designer would typically want an algorithm that works well for any permutation of the set of input points. In our setting where the points of interest have a metric structure, this is not an appropriate notion. In fact, even for the discrete case studied in Section 5.2.2, permutation invariance cannot capture natural prior beliefs that may arise in practice. For example, for power-law distributions that one often sees in private learning applications [ZKM+20, CB22, CCD+23], a small number of samples are sufficient to get a good estimate of the heavy bins, and rule out a large fraction of permutations of the input space.
A second line of work arising from the statistics literature [CL15] has looked at defining instance-optimality with respect to neighborhoods of size . While this approach has been very successful for many problems, we find it inappropriate for density estimation (outside of density estimation on as neighborhoods of size two are too weak to capture the difficulty of problems of interest. Even in the simple case of discrete distributions, this neighborhood is provably insufficient to get instance-optimality results with any competitive ratio. Indeed, for any two given distributions on with TV distance , samples suffice to distinguish them, whereas learning a near uniform distribution on atoms requires samples. In the private setting, the need to use multiple distributions to prove lower bounds is well-studied. Our approach shares this similarity of using a multi-instance lower bounding argument with packing lower bounds in privacy, and local Fano’s and Le Cam’s methods in statistics. Our work shows that some of the same lower bounding techniques can be used to prove instance-optimality results with respect to natural neighborhood maps, going well beyond the the worst-case results those works prove.
In the special case of density estimation in the Wasserstein distance on , instance optimality with respect to neighborhoods of size 2 is achievable. In the standard version of this benchmark metric, where can be any distribution and is chosen to maximise . However, this notion may not be an appropriate notion of instance optimality by itself. To see this, consider a distribution supported on an interval . Moving a small amount of mass from one end of the interval to the other would create an indistinguishable distribution that is far from in Wasserstein distance, and a hypothesis testing argument can be used to show that the target estimation rate defined above (for the hardest one-d sub problem) depends on the interval size . This implies that the adaptivity of algorithms to support size of the distribution (crucial in Wasserstein estimation) is not captured by this notion of instance optimality. Instead, we add a further restriction to the definition to make it more appropriate for our setting; we only consider distributions that are in a small ball around (, and ask that an algorithm is competitive with an algorithm that is told the additional information that (in the worst case over distributions that are in this ball). That is, we define the benchmark estimation rate to be
(3) |
Note that all such distributions have the same support as , which allows us to capture the adaptivity of algorithms to the support size of the distribution. Specifically, we define the following target estimation rate in the one-dimensional setting. In the case of estimating distributions on a bounded subset of , we will show that this error rate is achievable, up to logarithmic factors.
We also note that our notion of instance optimality more naturally captures the accuracy of algorithms even for basic tasks. Note that for the Bernoulli case, our technique achieves a bound of which also appear to be better than the instance-optimal lower bounds in [MSU22], which take the form . This apparent contradiction can be explained by the the use in [MSU22] of the hardest-one dimensional sub-problem to define the instance-optimal rate, i.e., is for a worst-case Bernoulli . On the other hand, the notion of instance-optimality we use would only consider Bernoullis such that . When is close to , the lower bound in [MSU22] would on this instance consider to be , which can have a large -distance from , and so isn’t in the neighborhood used in our notion of instance-optimality. Hence, the target rate one would obtain from our definition is smaller when is close to . Our algorithm can achieve this improved rate, as it is likely to output as an estimate of in this case, pushing small counts down to zero.
Recent differentially private algorithms such as those in [HLY21, DKSS23] have shown instance-optimality for problems such as mean estimation. Relatedly, other works have designed algorithms that adapt to the local/smooth/deletion sensitivity of the underlying function. An instance in these works in a dataset rather than a distribution, and it is not clear how to extend the corresponding notion of neighborhood to our setting. Our neighborhood notion perhaps comes closest to the deletion neighborhoods considered in some of these works.
Finally, we remark that while we have stated our results as being competitive with the worst-case instance in , they apply for the average case over a specific distribution over . Since that specific distribution is adversarial, we don’t view this version as more natural than the worst case.
3.2 Locally Minimal Algorithms
In this section we address the third desiderata of [CL15]. An important concept in statistics is that of efficiency of an estimator, which informally compares the rate of convergence of the estimator with a benchmark that in general is not beatable. This idea has been used to argue that for some fundamental estimation problems, the Maximum Likelihood Estimator (MLE) is the best possible. Hodge showed an example of a superefficient estimator that is asymptotically as good as the MLE everywhere, but beats the MLE on a certain set of inputs. The statistics community has argued in multiple ways that these superefficient estimators do not limit our ability to argue that MLE is “optimal”. We refer the reader to [vdV97, Wol65, Vov09] for a discussion of superefficiency. One of the more compelling arguments here is a result saying that the set of points where superefficiency is achieved has Lebesgue measure zero. This in particular implies that in a small neighborhood around any point, there is a point (in fact many points) where the superefficient estimator does no better than the MLE. In the partial order on estimators, the MLE is thus minimal and this is true even when looking at the performance of the estimator only on a small neighborhood around a given point.
This motivates a slightly different notion capturing the goodness of the algorithm locally.
Definition 3.2.
Let be a class of algorithms. We say that an algorithm is -locally minimal with respect to a neighborhood map , if for all instance , and all , there is a such that .
In words, local minimality says that for any other , the algorithm is competitive with for some instance in the neighborhood of . Put differently, no can be uniformly much better than on the neighborhood, even one that knows .
We show that in general, this notion is incomparable to our notion of instance optimality. Nevertheless, under reasonable assumptions, the two notions are closely related.
Example 3.3 (Local Minimality Instance Optimality).
Consider a pair of instances with . Let contain two algorithms , and with
Then one can verify that is (-)locally minimal in . However, it is not -instance optimal for any finite as it fails to satisfy the definition at .
Example 3.4 (Instance Optimality Local Minimality).
Consider a set of instances with . Let contain algorithms with
Then one can verify that is (-)instance optimal in . However, it is not -locally minimal at .
Under smoothness assumptions on with respect to , one can argue that the two notions are essentially equivalent.
Proposition 3.5.
Let be such that for all instances and for all , . Further, suppose that is compact for any . If is -instance optimal in with respect to , then it is -locally minimal.
Proof.
Let be an instance and let be a competing algorithm. By definition of -instance optimality,
By compactness, this implies that there is a achieving the supremum. In other words, there exists such that
Since , our smoothness assumption implies that
Combining the last two inequalities, this satisfies
Since and were arbitrary, this implies that is -locally minimal. ∎
Proposition 3.6.
Let be such that for all instances and for all , . If is -locally minimal in with respect to , then it is -instance optimal.
Proof.
Let be an instance and let be a competing algorithm. By definition of -local minimality, there is a such that
Since , our smoothness assumption implies that
Combining the last two inequalities, this satisfies
Since and were arbitrary, this implies that is -instance optimal. ∎
A similar pair of results hold when the comparator algorithm is smooth with respect to the neighborhood map.
3.3 Relaxed Definitions
We finish by noting relaxations of the above definitions that share the same semantic meaning (our algorithms will achieve these relaxed notions).
Definition 3.7.
Given a function , where is the power set of , we define the optimal estimation rate with respect to to be:
(4) |
An algorithm is -instance optimal with respect to if for all ,
Definition 3.8.
Let be a class of algorithms. We say that an algorithm is -locally minimal with respect to a neighborhood map , if for all instance , and all , there is a such that .
Note that we think of and as non-negative. The reason these are relaxed definitions is because we allow for an additive approximation factor in addition to a multiplicative factor, and also compare to a benchmark rate that depends on a potentially smaller number of samples (and is hence easier to achieve). The original definition of instance optimality (Definition 3.1) can be obtained by setting and .
In our work, for most settings of interest, we roughly achieve and to be an arbitrarily small polynomial in the inverse of the number of samples at a cost to the multiplicative factor. We don’t view this as a significant issue since we expect the benchmark rate with samples to behave asymptotically similarly to that with samples in most cases. We leave it as an open question as to whether the original definition of instance optimality can be achieved.
4 Additional Related Work
Instance Optimality for Differentially Private Statistics:
Several recent works have focused on formulating and giving ‘instance optimal’ differentially private algorithms for various statistical tasks. The work of McMillan, Smith and Ullman [MSU22] is most directly related to our work; they gave locally minimax optimal algorithms for parameter estimation for one-dimensional exponential families in the central model of differential privacy. The work of Duchi and Ruan [DR18] also gives locally-minimax optimal algorithms for various one-dimensional parameter estimation problems under the stronger constraint of local differential privacy. The notion of local minimax optimality both these papers use is based on the hardest one-dimensional sub-problem described in Section 3.1. While our results for density estimation in satisfy this notion, they also satisfy a stronger notion described in Section 3.1. Additionally, as discussed in [MSU22], this definition is provably unsuitable for higher dimensions; we instead suggest a looser definition of instance optimality that is more promising in higher dimensions. More importantly, our paper is primarily focused on the non-parametric setting, and hence our techniques are different than the ones used in those papers, which focused primarily on parameter estimation.
Other Beyond Worse-Case Results in Central Differential Privacy:
Several additional works in the differential privacy literature study algorithms with accuracy that varies with the input dataset. Nearly all of them look at the empirical setting where we are concerned with the specific input dataset, rather than a distribution it may be drawn from. While initial algorithms in differential privacy added noise based on a worse case notion of global sensitivity, these works give various algorithmic frameworks that help develop algorithms with guarantees that adapt to the hardness of the input dataset. These include algorithms based on smooth sensitivity [NRS07, BS19], the propose-test-release framework [DL09, BA20], Lipschitz extensions [BBDS13, KNRS13, CZ13, RS16], and sensitivity pre-processing [CD20]. However, none of these works study a formal notion of instance optimality.
In contrast, some more recent work do study definitions of instance optimality in the empirical setting. A work of Asi and Duchi [AD20] studies two notions of instance optimality: one by comparing the performance of an algorithm on a dataset against the performance of the best unbiased algorithm on that dataset, and another based on an analogue of the ‘hardest one-dimensional sub-problem’ for the empirical setting (they compare the performance of an algorithm on a dataset with all benchmark algorithms that know that the input dataset is either of two possible datasets but whose performance is evaluated as the worse over the two datasets). They give a general mechanism known as the inverse sensitivity mechanism that they show is nearly instance optimal under these definitions for various problems such as median and mean estimation. Our work is focused on population quantities as opposed to empirical quantities—while these are related, they can be very different. For example, as pointed out in McMillan, Smith and Ullman [MSU22], using the inverse sensitivity mechanism in [AD20] to estimate the mean of a Gaussian (by using a locally minimax optimal algorithm for empirical mean) will result in infinite mean squared error, whereas other approaches that reason directly about the population quantities can get much better error.
In [DKSS23] and [HLY21], different notions of instance optimality are defined. Roughly, they compare the performance of an algorithm on a dataset with a benchmark algorithm that knows the input dataset but whose performance is evaluated as the worst-case performance over large subsets of the input dataset. While the details of the definitions in these papers vary slightly, both papers give instance-optimal algorithms for mean estimation under their respective definitions. For one-dimensional distributions, our algorithmic technique at a high level shares ideas with these algorithms—the algorithms in their papers try to adapt to the range of values in the dataset, whereas we try to adapt to the level of concentration of the distribution. However, the details of how this is done and the associated analyses vary. Our algorithm for general metric spaces uses different techniques. Our work differs from these works in a few other prominent ways: firstly, they are primarily concerned with estimating functionals of the underlying dataset, whereas we are concerned with density estimation in Wasserstein distance—these are problems with different output types and different error metrics. Finally, it is not clear if notions such as subset-based instance optimality that are well defined in the empirical setting transfer meaningfully to the distributional setting.
Instance-Optimal Statistical Estimation without Privacy Constraints:
Donoho and Liu [DL91] formulated the notion of the ‘hardest one-dimensional sub-problem’ as a way of capturing instance optimality for statistical estimation and gave non-private instance optimal algorithms for some one-dimensional parameter estimation problems. Cai and Low [CL15] formulated an instance-optimality type definition for non-parameteric estimation problems. Our results for Wasserstein density estimation over use a stronger version of this notion of instance optimality. In higher dimensions, this notion is provably unachievable, and so we define a different notion.
The other line of work most related to ours is on instance-optimal learning of discrete distributions [OS15, VV16, HO19]. In their setting, instance optimality is defined by comparing the performance of an algorithm on a discrete distribution to the minimax error of any algorithm on the class of discrete distributions with probability vectors that are permutations of the probability vector of . We note that this notion is not well suited to many metric spaces, because permutations may not preserve properties such as concentration of the distribution, and hence this notion of instance optimality may provide an overly pessimistic view of the performance of an algorithm. Our notion of instance optimality (in terms of neighborhood) compares against algorithms with a different type of prior knowledge- i.e., the location of where the distribution concentrates, and approximate values of the probabilities at each point. We note that these are technically incomparable, and may be useful in different settings. For estimation in Wasserstein distance, knowledge of where the distribution is concentrated could be very useful in algorithm design, and so comparing to algorithms with this type of knowledge is more appropriate. See Section 3 for more discussion.
Finally, there is another line of work on getting similar instance optimal guarantees for other statistical problems [ADJ+11, ADJ+12, AJOS13b, AJOS13a]. For the closeness testing problem (given two sequences, determine if they are produced by the same distribution, or different distributions), Acharya, Das, Jafarpour, Orlitsky, Pan and Suresh [ADJ+11, ADJ+12] developed a test (without any knowledge about the generating distributions) that achieves the same error with samples that an optimal label-invariant test that knows the distributions and would achieve with samples.
Other work on Differentially Private Statistics:
There is a lot of other work on private statistical estimation, and we survey the most relevant parts of the literature here. There is a long line of work on minimax parameter/distribution estimation on various parametric distribution families: product distributions [BUV18, KLSU19, ASZ20, CWZ19, Sin23], Gaussian, sub-Gaussian distributions (and more generally exponential families) [KV18, KLSU19, AAK21, BGS+21, KMS+22b, KMS22a, HKM22, KMV22, AL22, LKO22, TCK+22, HKMN23, AKT+23, BHS23, KDH23], mixtures of Gaussian distributions [KSSU19, AAL23b, AAL23a], heavy-tailed distributions [KSU20, Nar23], discrete distributions with finite support [DHS15, ASZ20], distributions with finite covers [BKSW21] and more. This line of work focuses on minimax guarantees in the parametric setting, i.e. optimizing the worst-case error of an algorithm over the entire class of distributions. Our work, on the other hand works in the non-parametric setting where we do not make assumptions about the distribution the dataset is drawn from, but instead give ‘instance-optimal’ algorithms that adapt to the hardness of the distribution the input dataset is drawn from.
There is also a line of work on differentially private CDF estimation [DNPR10, CSS11, BNS16, BNSV15, ALMM19, KLM+20, CLN+23], and quantile estimation [KSS22, GJK21, ASSU24]. Our algorithm for density estimation over uses a quantile estimation algorithm (based on a CDF estimator) as a subroutine. Finally, there is a line of work on differentially private testing [ASZ17, CDK17, CKM+19], and the work characterizing the sample complexity of simple hypothesis tests forms an important part of our analysis of the instance-optimal rate for distributions over .
Work on Estimation in Wasserstein Distance:
In addition to the recent works [BSV22, HVZ23] on private Wasserstein learning on , there is a plethora of works studying it in the non-private setting.
One line of work studies the convergence in Wasserstein distance of the empirical measure (on samples) to the true measure, as a function of the measure and the number of samples [Dud69, DY95, CR12, DSS11, BG14, FG15, BL19, WB19, Lei20, Fou23]. Some of the later works above can be viewed as studying this problem from a beyond worst-case analysis viewpoint. They give upper and lower bounds for the expected value of this quantity, in terms of various notions of ‘dimension’ of the underlying measure, such as the covering number of the support of the distribution, the upper and lower ‘Wasserstein dimensions’ of the measure, and others. Our work shows that the empirical measure, appropriately massaged, is approximately instance-optimal for density estimation without privacy constraints (for the notions of instance optimality we consider), and hence these works give us a handle on the instance-optimal rate as a function of the distribution and sample size . Some more recent work studies minimax estimation in Wasserstein distance [SP19, NWB19], and show that without additional assumptions on the distribution, the empirical measure is minimax optimal. Our work extends this result to show that in the general non-parametric setting, the empirical measure is also approximately instance-optimal; to the best of our knowledge, instance optimal estimation in Wasserstein distance (even without privacy constraints) has not been previously studied.
5 Distribution Estimation on Hierarchically Separated Trees
Let us now turn to distribution estimation on arbitrary finite metric spaces. We will use the fact that any metric on a finite space can be embedding in a hierarchically separated tree (HST) metric to reduce the problem of density estimation in Wasserstein distance on an arbitrary metric space to density estimation in Wasserstein distance on an HST. In Section 5.2 we’ll characterise the target estimation rate . In Section 5.3, we’ll then provide an -DP algorithm and prove that it achieves this target estimation rate up to logarithmic factors.
5.1 Preliminaries on Hierarchically Separated Trees
A key component of our proof strategy is the reduction to Hierarchically Separated Trees (HSTs). HSTs are special class of tree metrics that are able to embed arbitrary metric spaces with low distortion. They are particularly well-behaved when working with the Wasserstein distance since the Wasserstein distance on an HST has a simple closed form.
Definition 5.1 (Hierarchically Separated Tree).
A hierarchically separated tree (HST) is a rooted weighted tree such that the edges between level and all have the same weight (denoted ) and the weights are geometrically decreasing so . Let be the depth of the tree.
An HST defines a metric on its leaf nodes by defining the distance between any two leaf nodes to be the weight of the minimum weight path between the two nodes. We will rely on two main facts about HSTs in this work.
Lemma 5.2 (Low distortion metric embeddings [FRT03]).
Let be a metric space with points. There exists a randomized, polynomial time algorithm that produces an HST where the leaf nodes of the tree correspond to the elements of the metric space and the induced tree metric is such that for all
-
•
-
•
The depth of the HST is logarithmic in the size of the metric space, .
An immediate consequence of the metric distortion in Lemma 5.2 is that the Wasserstein distance in the original metric space is also preserved up to a factor in expectation. Thus, Lemma 5.2 allows us to translate the problem of learning densities on an arbitrary metric space in Wasserstein distance to learning densities in Wasserstein distance on an HST. This is a useful tool since HST metrics are generally easier to work with and, as we’ll see below, the Wasserstein distance is particularly well-behaved on an HST. In order to use Lemma 5.2 to translate the problem of density estimation on a bounded ball in into density estimation on an HST, one discretizes the metric, paying a small additive term.
Corollary 5.3.
Given , there is a probabilistic embedding of into an HST such that for all :
-
•
-
•
The distortion is logarithmic in , so taking to be polynomially small, one gets the distortion to be . It is easy to see that this implies that the Wasserstein distance is preserved in both directions up to , up to an additive error.
A distribution on the the underlying metric space in an HST induces a function on the nodes of the tree where the value of a node is given by the weight in of the leaf nodes in the subtree rooted at . For every level of the tree, let be the distribution induced on the nodes at level where the probability of node is . Thus is a discrete distribution on a domain of size , where is the number of nodes in level of the tree.
Lemma 5.4 (Closed form Wasserstein distance formula).
Given two distributions and defined on an HST metric space, the Wasserstein distance between and has the closed formula:
where is the length of the edge connecting to its parent, and the sum is over all nodes in the tree.
5.2 The Target Estimation Rate
Recall the definition of our neighbourhood.
We will call a node , -active node under the distribution if the weight in of the sub-tree rooted at is greater than . Let be the set of -active nodes under and be the -active nodes at level .
Theorem 5.5.
Given a distribution on , , , and , let where is the Lambert W function so , then
where the max is over all the levels of the tree.
Note that so the dependence on and in Theorem 5.5 matches the upper bound in Theorem 5.13. The error rate does indeed adapt to easy instances as we expected. The error decomposes into three components. The first component is the non-private sampling error; the error that would occur even if privacy was not required. The second component indicates that we can not estimate the value of nodes that have probability less than . The third component is the error due to privacy on the active nodes. If is highly concentrated then we expect most nodes to either be -active or have weight 0, so the first two terms in are small. There should also be few active nodes, making the last term smaller as well. Conversely, if has a large region of low density then we expect a large number of inactive nodes, as well as non-zero inactive nodes that are at higher levels of the tree and hence contribute more to the final term. Thus, in distributions with high dispersion we expect the right hand side to be large.
The proof of Theorem 4.1 will involve two main steps. First, we will reduce the lower bound on the HST to a lower bound on a star metric, or equivalently estimation of a discrete distribution in TV distance. We’ll then use a variant of Assouad’s inequality to prove the lower bounds on estimating discrete distributions in TV distance.
5.2.1 Reduction to Estimation in TV distance of Discrete Distributions
The key observation is that in order to estimate the distribution well in Wasserstein distance, an algorithm must estimate each level of the tree well in TV distance. Any estimate of also induces an estimate of ; let be an estimate of the distribution and be the induced estimate of the distribution at level . Then for any distribution
The following observation ensures that our notions of instance optimality in both the Wasserstein metric and the per-level TV distance are compatible at every level .
Theorem 5.6.
For every level , define the neighborhood of as by . Then,
where the error of is measured in the Wasserstein distance and is measured in the TV distance.
5.2.2 Characterizing Target Estimation Rate for Discrete Distributions
In light of Theorem 5.6, we will focus on characterizing the difficulty of estimating the distribution at a single level of the tree for the remainder of this section. Since this is fundamentally a statement about estimating discrete distributions in TV distance, we will state everything in this section in terms of general discrete distributions. Let , and let be a distribution on . Define . Our goal is to give a lower bound for , where the metric is the TV distance.
Theorem 5.7.
Given and , let where is the Lambert W function so . Given a distribution ,
Theorem 5.5 follows immediately from Theorem 5.6 and Theorem 5.7. The main tool we will use is a differentially private version of Assouad’s method. This gives us a method for lower bounding the error by constructing nets of distributions that are pairwise far in the relevant metric of interest, which for us in the TV distance. The following is a slight variant on the differentially private variant of Assouad’s lemma given in [ASZ21]. Rather than building a set of distributions indexed by a hypercube, we will build a set of distributions over a product of hypercubes. Since this is an extension of the version that appears in [ASZ21], we include a proof in Appendix C for completeness.
Lemma 5.8.
[A extension of -DP Assouad’s method [ASZ21]] Let be a sequence of natural numbers such that , and . Given a family of distributions on a space , a parameter where is a metric space with metric , suppose that there exists a set of distributions indexed by the product of hypercubes where such that for a sequence ,
(5) |
For each coordinate , , consider the mixture distributions obtained by averaging over all distributions with a fixed value at the th coordinate:
and let be a binary classifier. Then
where the min on the LHS is over all -DP mechanisms, and on the right hand side is over all -DP binary classifiers. Moreover, if for all , , there exists a coupling between and with , then
Note that an upper bound on implies there exists a coupling between and such that .
We will separately prove that each of the three terms in Theorem 5.7 belong in the lower bound. Each proof will follow the same underlying structure. Given a distribution , the main technical step is carefully designing a family of distributions in that satisfy the conditions of Lemma 5.8. Lemma 5.9 and Lemma 5.10 give lower bounds on the noise due to privacy. Lemma 5.11 gives lower bounds based on the error due to sampling.
Let
where is the Lambert W function satisfying . In both lemma proofs we will use the inequality that if , then
(6) |
Lemma 5.9.
Given a distribution , , and ,
Proof.
Let be the number of active nodes. If then the RHS is 0 and so we are done. Otherwise, assume and let . Using the notation from Lemma 5.8, let and for all . We will drop the reference to in the notation since only is significant.
Pair up the active nodes to form pairs of active nodes denoted by . Given , define the distribution as follows: for all , and if and and if . For all other , . It is immediate that for all , . For any pair , , so that Equation 5 is satisfied with . Further, given , and only differ on the probability of and , so and by Equation 6, Noting that completes the proof. ∎
Lemma 5.10.
For all , , and distributions on , if , then
Since , the condition that is a mild condition. For example, it is satisfied whenever .
Similar to the proof of Lemma 5.9, we are going to pair up the coordinates and move mass between the coordinates to create the distributions indexed by the product of hypercubes. Since we want all the distributions we create to be in , we will divide the space into scales such that all elements in the same scale have approximately the same probability of occurring. We’ll then move mass within these scales. For , let .
Proof.
Given , let and .
Let us first consider the case that there exists a scale with and where is the element in . Define by and for all , . Since , . In this case we will use Lemma 5.8 with , otherwise, and corresponds to the set of distributions . Then noting that and using eqn (6) we have that and so that and we are done.
Next suppose that for all scales such that we have . Let be the smallest such that . Since the scales are geometrically decreasing,
It follows that . Further,
Thus we can (up to constants) ignore scales such that and assume that is even for all scales.
Let . Now, within each scale , pair the elements to form distinct pairs . Given , define by and if and and if . For all other elements, . Then, it is easy to see that for all , . Further, using notation from Lemma 5.8, Equation 5 is satisfied with since and , which is less than whenever . By eqn (6), whenever so by Lemma 5.8 we have
which completes the proof. ∎
Next we lower bound the statistical term.
Lemma 5.11.
For all , , and distributions , if and , then
To streamline the notation, we will use to denote . In order to prove Lemma 5.11, we will need the following standard result from the statistics literature which allows us to lower bound the performance of any simple classifier distinguishing two distributions and by the KL divergence between and . We give a specific result for distinguishing Bernoulli random variables since we’ll use this in the proof of Lemma 5.11.
Lemma 5.12.
Given any pair of distributions and on the same domain,
where the minimum is over all binary classifiers. In particular, if and where then
where again the minimum is over all binary classifiers.
Proof of Lemma 5.11.
As in the proof of Lemma 5.10, first suppose there exists a scale with and there exists such that
Then define a distribution by and for all , . Then since . Then we will use Lemma 5.8 with and for , and corresponds to . Now,
(for more detail on the proof of this inequality see the proof of Lemma 5.12) so
and . Thus by Lemma 5.8,
and we are done.
On the other hand, suppose that for all scales such that we have
where . As in the proof of Lemma 5.10, we will argue that we can ignore any singleton scales, and assume that is even for all scales. Let so
Therefore, and so
(7) |
where the first inequality follows from whenever , and the second follows because for all such that .
Assume that is even for all . Within each scale , pair the elements to form distinct pairs per scale. For all , let , and note that for all and , . Given , define by and if and and if . For all other elements, . Then, for all , . Further, using notation from Lemma 5.8, we have . Also, for any , and only differ on and where and . Therefore, by Lemma 5.12, and the post-processing inequality,
Lemma 5.8 then implies the result. ∎
5.3 An -DP Distribution Estimation Algorithm
Now, let us return to HSTs and designing an estimation algorithm that achieves the target estimation rate, up to logarithmic factors. As in the one-dimensional setting, we want to restrict to only privately estimating the density at a small number () of points. While we could try to mimic the one-dimensional solution by privately estimating a solution to the -median problem, it’s not clear how to prove that such an approach is instance-optimal. It turns out that a simpler solution more amenable to analysis will suffice. Our algorithm has two stages; first we attempt to find the set of -active nodes, then we estimate the weight of these active nodes. Since these nodes have weight greater than , we can privately estimate them to within constant multiplicative error.
Let be the underlying metric space so . For any set of nodes and a function defined on the nodes, define the function as if and otherwise. Given two functions and defined on the nodes, we define
where is the length of the edge connecting to its parent, and the sum is over all nodes in the tree. So by Lemma 5.4, . Note that satisfies the triangle inequality.
A high-level outline of the proposed algorithm is given in Algorithm 1. Now, we state the main theorem of this section.
Theorem 5.13.
Given any , PrivDensityEstTree is
-DP.
Given a distribution , with probability ,
This bound has the same three terms as our lower bound on in Theorem 5.5 corresponding again to the empirical error (the error inherent even in the absence of a privacy requirement), the error from the private algorithm not being able to estimate the probability of events that occur with probability less than , and the error due to the noise added to the active nodes. The maximum over the levels that appeared in the lower bound is replaced with a sum over the levels in the upper bound, so, up to logarithmic factors, the upper bound is within a factor of of the lower bound. Since we can not hope to locate the set of -active nodes exactly with a private algorithm, we find a set that is guaranteed to satisfy
Note that so the error introduced here by not estimating perfectly is at most a logarithmic multiplicative factor.
The first step of our algorithm is to estimate the empirical distribution. We use a truncated version of the standard empirical distribution. This allows us to achieve an error rate of even when is small.
The proof of the following lemma is contained in Appendix C.
Lemma 5.14.
For any distribution , if then with probability ,
The goal of Algorithm 3 is to estimate the set of -active nodes.
The next lemma allows us to bound how close to the goal we get. The proof is contained in Appendix C.
Lemma 5.15.
Let be the set of active nodes found in Algorithm 1. Then with probability ,
We also prove the following lemma relating the error due to estimating the active nodes to a quantity depending on the true active nodes.
Lemma 5.16.
If then
The key component of this proof is that any discrepancy between the weight of the nodes on and that assigned by was already paid for in . The final step in Algorithm 1 is to project the noisy function into the space of distributions on the underlying metric space. We’d like to do this in a way that preserves, up to a constant, the distance between and . We will do this iteratively starting from the root node, by ensuring that the sum of each node’s children add up to it’s assigned value. Since we know the root node has value 1, this results in a valid distribution. We start from the top of the tree since errors in higher nodes of the contribute more to the Wasserstein distance. While errors in higher nodes of the tree propagate can propagate to lower levels, the predominant influence on the overall error is retained at the top level due to the geometric nature of the edge weights.
Lemma 5.17.
For any real-valued function on the nodes of the HST such that where is the root node and given any distribution ,
Proof of Theorem 5.13.
The privacy follows from the fact that each user contributes to at most queries in LocateActiveNodes and at most one coordinate in the computation of in line 4 in PrivDensityEstTree.
For the utility, we will consider each level individually. First suppose that .
(8) | ||||
where the first inequality follow from Lemma 5.17, the second inequality follows from the triangle inequality and Lemma 5.4, and the third follows from Lemma 5.16 and Lemma 5.15. Finally,
The final statement then follows from Lemma 5.14 and basic concentration bounds on the Laplacian distribution.
If , then the proof goes through for all except the final term related to the noise due to privacy. We consider two cases. Let . First suppose that then no node that is in a level above , but is not a direct ancestor of is in . Therefore, since the projection algorithm is top-down, will be concentrated on . Therefore, the error of level is simply , which can be charged to the first term plus the sum of the weight of the inactive nodes, which is in the second term. Next, suppose that then sum of the inactive nodes (in term two) dominates the error due to adding noise to ∎
6 Instance Optimal Density Estimation on in Wasserstein distance
Let us now consider the setting of estimating distributions on . In this setting, the target estimation rate is that of an algorithm that knows that the distribution is either or for a distribution such that . This definition of instance-optimality strengthens that corresponding to the so-called hardest-one dimensional subproblem [DL91], since this is a harder estimation rate to achieve. A formal description of the target estimation rate is given in Section 3.1 and Section 3.3. In Section 6.1, we lower bound this estimation rate using hypothesis testing techniques. Then, in Section 6.2, we give an algorithm that up to polylogarithmic factors, uniformly achieves the lower bound, and hence approximately achieves the instance-optimal estimation rate. Our instance optimality results apply to all continuous distributions in a bounded interval with density functions (though it is likely that they apply more generally). All omitted proofs can be found in Appendix F.
6.1 General Lower Bound
To state the main theorem in this section, we will introduce some notation. We start by defining the restriction of a distribution.
Definition 6.1.
For any distribution over with a density function, the restriction of with respect to is defined as the distribution with the following CDF function F’:
If , then is a step function that goes to at that point and is prior to that point.
Also recall the following definition of quantiles.
Definition 6.2.
For , the -quantile of a distribution over is defined as follows:
When the distribution is clear from context, we will sometimes abuse notation and use when we mean . The main theorem we will prove in this section is the following:
Theorem 6.3.
There exists a constant such that given a continuous distribution on with bounded expectation and ,
where is the empirical distribution on samples drawn independently from .
The same result can be extended to -DP algorithms as well for
We discuss each of the terms in turn. Note that the final term is related to the expected Wasserstein distance between the empirical distribution and the true distribution. There is now a long line of work characterizing this quantity in terms of the distribution (See Section 4), but essentially, if the distribution is more concentrated, this term is smaller. The first term is a very particular inter-quantile distance that is also much smaller for concentrated distributions, and can be large for relatively dispersed distributions. The second term characterizes the length of the tails of the distribution—longer tails make this Wasserstein distance larger. Overall, this rate is significantly lower for more concentrated distributions with small support, and relatively large for more dispersed distributions. We prove this theorem over the following couple of sections; in Section 6.1.1 we characterize the cost of private instance optimality, and in Section 6.1.2 we characterize the cost of achieving instance optimality without privacy (this non-private characterization is also new to our work, to the best of our knowledge). Combining the theorems in those sections gives the above result.
6.1.1 The Privacy Term
The main theorem we will prove in this section is the following.
Theorem 6.4.
Fix , . For all distributions over that have a density function and finite expectation, there exists another distribution such that , that is indistinguishable from given samples such that for all -DP algorithms , with probability at least over the draws , , the following holds for some constant .
We start with some notation. For any distribution with a density, let denote its density function. Throughout this section, we will use to represent the -quantile of distribution . Let be the ‘starting point’ of distribution (defined as if the infimum exists, and otherwise.
Next, we describe some results on differentially private testing that we will use. We say that a testing algorithm distinguishes two distributions and with samples, if given the promise that a dataset of size is drawn from either or , with probability at least , it outputs if the dataset was drawn from and if it was drawn from . We now state a theorem lower bounding the sample complexity of differentially private hypothesis testing.
Theorem 6.5 ([CKM+19, Theorem 1.2]).
Fix . For every pair of distributions over , if there exists an -DP testing algorithm444The same bounds (and hence all our results in this subsection) can be extended to -DP (with ) by using an equivalence of pure and approximate DP for identity and closeness testing [ASZ17, Lemma 5]. that distinguishes and with samples, then
where
and is the squared Hellinger distance between , and , where is such that if , then is the maximum value such that
else is the maximum value such that
We now are ready to start proving our main theorem.
Proof.
(of Theorem 6.4) The idea is to construct from by moving mass from the leftmost quantiles to the rightmost quantile. We do this such that is statistically close enough to such that the two distributions can not be distinguished with samples, but is also far from in Wasserstein distance. This produces a lower bound of on how well an algorithm can simultaneously estimate and since if there was an algorithm that produced good estimates of and in Wasserstein distance with samples, then we could tell them apart, and this would give a contradiction.
Let be a quantity to be set later. Formally, we define as the distribution with the following density function.
Note that by the definition of , we have that .
We will prove that the sample complexity of telling apart and under -DP is , using known results on hypothesis testing. Then, we will argue that the Wasserstein distance between and is sufficiently large. Setting appropriately will complete the proof.
Define to be the smallest such that there exists an -DP testing algorithm that distinguishes and ; called the sample complexity of privately distinguishing and .
Lemma 6.6.
The proof of this lemma is in Appendix F. We next argue that and are sufficiently far away in Wasserstein distance.
Lemma 6.7.
.
The proof of this lemma is also in Appendix F.
Finally, we are ready to prove the theorem. Assume that with probability larger than over the draw of two datasets , , and the randomness used by invocations of algorithm we have that . Then, given a dataset of size , we can perform the following test: run the differentially private algorithm on the dataset and compute and and output the distribution with lower distance. Then, note that which implies that with probability at least , if the dataset was sampled from (by the accuracy guarantee). A similar argument shows that with probability at least , if the dataset was sampled from . Hence, with samples we have defined a test that distinguishes and . However, for for some constant , by Lemma 6.6 we get that any differentially private test distinguishing and requires more than samples, which is a contradiction. Hence, with probability at least over the draw of two datasets , , and the randomness used by invocations of algorithm we have that ,where the last inequality is by invoking Lemma 6.7 with .
∎
6.1.2 Empirical Term
In this section, we prove the following result.
Theorem 6.8.
Fix sufficiently large natural numbers and let be sufficiently small constants. For all algorithms , the following holds. For all continuous distributions over with a density and with bounded expectation, there exists another distribution (with ), that is indistinguishable from given samples, such that with probability at least over the draws , , the following holds.
where is the -quantile of .
Before going into the proof, we state the following result on the sample complexity of testing. This is a folklore result but for a proof of the lower bound see [BY02] and the upper bound see [Can17].
Theorem 6.9.
Fix . For every pair of distributions over , if there exists a testing algorithm that distinguishes and with samples, then
wherer represents the squared Hellinger distance between and .
Throughout the proof, we will use to represent the -quantile of distribution .
Proof of Theorem 6.8.
is constructed by adding progressively more mass to up until and subtracting proportionate amounts of mass from afterwards. Intuitively, this is done in such a way that to ‘change’ to , for all one has to move roughly mass from to . This ensures that the Wasserstein distance between and is larger than the expected Wasserstein distance between and its empirical distribution on samples . This is carefully done to ensure that is indistinguishable from .
Formally, consider in the range . For all , we set . For all , we set . Next, consider in the range . For all , we set . For all , we set . Note that has bounded expectation by assumption, and hence, so does . Additionally, note that .
There are two key considerations balanced in the design of . On one hand, we need to be indistinguishable from given samples. On the other hand, we need to be sufficiently far away from in Wasserstein distance. This ensures that given an accurate algorithm for estimating the density of the distribution (in Wasserstein distance) given access to samples from it, we can design a test distinguishing and with that many samples, thereby contradicting their indistinguishability.
Detailed proofs of claims below can be found in Appendix F. First, we show that is indistinguishable from .
Lemma 6.10.
Next, we establish a lower bound on the Wasserstein distance between and .
Lemma 6.11.
Next, we upper bound the expected Wasserstein distance between the distribution and its empirical distribution on samples.
Lemma 6.12.
We now prove a simple claim regarding restrictions.
Claim 6.13 (Restrictions preserve Wasserstein distance).
For all datasets , and any natural number we have that
Finally, we are ready to put the above lemmas together to prove Theorem 6.8. Fix . Assume, for sake of contradiction, that with probability larger than over the draw of two datasets , , and the randomness used by invocations of algorithm we have that . Then, given a dataset of size , we perform the following test: run the differentially private algorithm on the dataset and compute and and output the distribution with lower distance. Then, note that which implies that with probability at least , if (by the accuracy guarantee). A similar argument shows that with probability at least , if . Hence, with samples we have defined a test that distinguishes and . However, by Lemma 6.10 bounding the divergence between and , Theorem 6.9 on sample complexity lower bounds for testing, and Lemma A.4 on the relationship between KL and Hellinger distance, we get that any statistical test distinguishing and requires more than samples, which is a contradiction. Hence, with probability at least over the draw of two datasets , , and the randomness used by invocations of algorithm we must have that
(9) |
Next, note that by Lemma 6.12 (with value ), we have that
Analyzing the middle term in the above sum, we have that
Substituting this back in the previous sum, we have that
where in the last inequality we use the fact that . Hence, by Lemma 6.11 (which gives a lower bound on ) in conjunction with the above equation, we have that for some sufficiently small constant . Substituting back in Equation 9, we have that with probability at least over the draw of two datasets , , and the randomness used by invocations of algorithm we have that
as required.
∎
6.2 Upper Bound
In this section, we describe an algorithm that achieves the instance optimal rate described in the previous section (up to polylogarithmic factors in some of the terms).
We will be looking at distributions supported on a discrete, ordered interval . Note that by a simple coupling argument, any continuous distribution on is at most away in Wasserstein distance from a distribution on this grid. The dependence on in our bounds for discrete distributions will be inverse polylogarithmic (or better), and so our algorithms for estimating distributions in the interval also work to give similar bounds for continuous distributions on , up to a small additive factor of , which can be set to any inverse polynomial in the dataset size without significantly affecting our bounds.
Formally, we will prove the following theorem (See Theorem 6.15 for a more detailed statement).
Theorem 6.14.
Fix , , and such that is an integer. Let for some sufficiently large constant . There exists an algorithm that for any distribution on satisfies the following. When run with input a random sample , outputs a distribution such that with probability at least over the randomness of and the algorithm,
where is the empirical distribution on samples drawn independently from , represents the -quantile of distribution , and for a sufficiently large constant .
Since , this upper bound matches the lower bound in Theorem 6.3 in its dependence on and its dependence on (up to logarithmic factors in ). The algorithm that we will analyze proceeds by estimating sufficiently many quantiles from the empirical distribution and distributing mass evenly between the chosen quantiles. The number of quantiles is chosen carefully to ensure that the estimated -quantiles are also approximately -quantiles for the empirical distribution (and hence also approximately for the true distribution), and to ensure that the CDF of the output distribution closely tracks the CDF of the empirical distribution. Through a careful analysis, we are able to leverage these properties to give instance optimality guarantees for the accuracy of the algorithm.
6.2.1 Algorithm for density estimation
Algorithm 5 is our algorithm for density estimation, and proceeds by differentially privately estimating sufficiently many quantiles of the distribution and placing equal mass on each of them. We argue that a simple CDF based differentially private quantiles estimator satisfies a specific guarantee that will be key to our analysis. See Appendix E for more details about the quantiles algorithm and formal statements and proofs therein.
Observe that Algorithm 5 inherits the privacy of , since it simply postprocesses the quantiles it receives from that subroutine, and hence is also -DP.
Now, we are in a position to state our main theorem, which bounds the Wasserstein distance between the distribution output by our algorithm, and the underlying probability distribution .
Theorem 6.15.
Fix , , and such that is an integer. Let for some sufficiently large constant . Let be any distribution supported on , and .
Then, Algorithm 5, when given inputs , privacy parameter , interval end points , and granularity , outputs a distribution such that with probability at least over the randomness of and the algorithm,
where is the uniform distribution on , represents the -quantile of distribution , are sufficiently large constants, and , where is a sufficiently large constant.
We note that using more sophisticated differentially private CDF estimators to estimate quantiles (such as ones in [BNSV15, CLN+23]), we can also obtain a version of the same theorem for approximate differential privacy, with a better dependence on the size of the domain (only as opposed to , where is the number of times has to be applied to to get it to be ). 555The theorem would be of the same form as Theorem 6.15, except that Algorithm 5 would be -DP, with the lower bound on instead being , and being set instead to .
To prove Theorem 6.15, we first relate the Wasserstein distance of interest (between the true distribution and the algorithm’s output distribution to a quantity related to an appropriately chosen restriction. Let represent the -quantile of and represent the -quantile of and represent the -quantiles of . We also note that all these distributions (and others that will come up in the proof) are bounded distributions over the real line and so we can freely apply the triangle inequality for Wasserstein distance, and the cumulative distribution formula for Wasserstein distance (Lemma 2.3). The proof of the main theorem will follow from the following lemmas (all proved in Appendix F).
Lemma 6.16.
Let be a sufficiently large constant, and let be sufficiently large. With probability at least over the randomness in data samples and Algorithm 5,
Lemma 6.17 (Wasserstein in terms of quantiles).
For all datasets (with data entries in ), with probability at least over the randomness of Algorithm 5, we have that
where is the uniform distribution over .
Now, we argue about the concentration of the Wasserstein distance between restrictions of the empirical distribution and restrictions of the true distribution.
Claim 6.18.
Fix and sufficiently large constants . Let be sufficiently large such that (as in Theorem 6.15). For all such that , with probability at least over the randomness in the data,
Now, we give the proof of our main theorem.
Theorem 6.15.
Using Lemma 6.16, Claim 6.18 and the triangle inequality, we have that with probability at least over the randomness of the data and the algorithm,
Finally, applying Lemma 6.17 and taking a union bound over failure probabilities, we get that with probability at least over the randomness of the data and the algorithm,
as required. ∎
References
- [AAK21] Ishaq Aden-Ali, Hassan Ashtiani, and Gautam Kamath. On the sample complexity of privately learning unbounded high-dimensional gaussians. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Algorithmic Learning Theory, 16-19 March 2021, Virtual Conference, Worldwide, volume 132 of Proceedings of Machine Learning Research, pages 185–216. PMLR, 2021.
- [AAL23a] Mohammad Afzali, Hassan Ashtiani, and Christopher Liaw. Mixtures of gaussians are privately learnable with a polynomial number of samples. CoRR, abs/2309.03847, 2023.
- [AAL23b] Jamil Arbas, Hassan Ashtiani, and Christopher Liaw. Polynomial time and private learning of unbounded gaussian mixture models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 1018–1040. PMLR, 2023.
- [ABC17] Peyman Afshani, Jérémy Barbay, and Timothy M. Chan. Instance-optimal geometric algorithms. J. ACM, 64(1), mar 2017.
- [AD20] Hilal Asi and John C. Duchi. Instance-optimality in differential privacy via approximate inverse sensitivity mechanisms. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- [ADJ+11] Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, and Shengjun Pan. Competitive closeness testing. In Sham M. Kakade and Ulrike von Luxburg, editors, COLT 2011 - The 24th Annual Conference on Learning Theory, June 9-11, 2011, Budapest, Hungary, volume 19 of JMLR Proceedings, pages 47–68. JMLR.org, 2011.
- [ADJ+12] Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, Shengjun Pan, and Ananda Theertha Suresh. Competitive classification and closeness testing. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, volume 23 of JMLR Proceedings, pages 22.1–22.18. JMLR.org, 2012.
- [AJOS13a] Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. A competitive test for uniformity of monotone distributions. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, Scottsdale, AZ, USA, April 29 - May 1, 2013, volume 31 of JMLR Workshop and Conference Proceedings, pages 57–65. JMLR.org, 2013.
- [AJOS13b] Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Optimal probability estimation with applications to prediction and classification. In Shai Shalev-Shwartz and Ingo Steinwart, editors, COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA, volume 30 of JMLR Workshop and Conference Proceedings, pages 764–796. JMLR.org, 2013.
- [AKT+23] Daniel Alabi, Pravesh K. Kothari, Pranay Tankala, Prayaag Venkat, and Fred Zhang. Privately estimating a gaussian: Efficient, robust, and optimal. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 483–496. ACM, 2023.
- [AL22] Hassan Ashtiani and Christopher Liaw. Private and polynomial time algorithms for learning gaussians and beyond. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pages 1075–1076. PMLR, 2022.
- [ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning implies finite littlestone dimension. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 852–860. ACM, 2019.
- [ASSU24] Maryam Aliakbarpour, Rose Silver, Thomas Steinke, and Jonathan R. Ullman. Differentially private medians and interior points for non-pathological data. In Venkatesan Guruswami, editor, 15th Innovations in Theoretical Computer Science Conference, ITCS 2024, January 30 to February 2, 2024, Berkeley, CA, USA, volume 287 of LIPIcs, pages 3:1–3:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2024.
- [ASZ17] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private testing of identity and closeness of discrete distributions. CoRR, abs/1707.05128, 2017.
- [ASZ20] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private assouad, fano, and le cam. CoRR, abs/2004.06830, 2020.
- [ASZ21] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private Assouad, Fano, and Le Cam. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 48–78. PMLR, 16–19 Mar 2021.
- [BA20] Victor-Emmanuel Brunel and Marco Avella-Medina. Propose, test, release: Differentially private estimation with high probability. CoRR, abs/2002.08774, 2020.
- [Bar96] Yair Bartal. Probabilistic approximations of metric spaces and its algorithmic applications. In 37th Annual Symposium on Foundations of Computer Science, FOCS ’96, Burlington, Vermont, USA, 14-16 October, 1996, pages 184–193. IEEE Computer Society, 1996.
- [BBDS13] Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. Differentially private data analysis of social networks via restricted sensitivity. In Robert D. Kleinberg, editor, Innovations in Theoretical Computer Science, ITCS ’13, Berkeley, CA, USA, January 9-12, 2013, pages 87–96. ACM, 2013.
- [BG14] Emmanuel Boissard and Thibaut Le Gouic. On the mean speed of convergence of empirical and occupation measures in Wasserstein distance. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 50(2):539 – 563, 2014.
- [BGS+21] Gavin Brown, Marco Gaboardi, Adam D. Smith, Jonathan R. Ullman, and Lydia Zakynthinou. Covariance-aware private mean estimation without private covariance estimation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 7950–7964, 2021.
- [BHS23] Gavin Brown, Samuel B. Hopkins, and Adam Smith. Fast, sample-efficient, affine-invariant private mean and covariance estimation for subgaussian distributions, 2023.
- [BKM+21] Eugene Bagdasaryan, Peter Kairouz, Stefan Mellem, Adrià Gascón, Kallista Bonawitz, Deborah Estrin, and Marco Gruteser. Towards sparse federated analytics: Location heatmaps under distributed differential privacy with secure aggregation. arXiv preprint arXiv:2111.02356, 2021.
- [BKSW21] Mark Bun, Gautam Kamath, Thomas Steinke, and Zhiwei Steven Wu. Private hypothesis selection. IEEE Trans. Inf. Theory, 67(3):1981–2000, 2021.
- [BL19] Sergey G. Bobkov and Michel Ledoux. One-dimensional empirical measures, order statistics, and kantorovich transport distances. Memoirs of the American Mathematical Society, 2019.
- [BM23] Daniel Bartl and Shahar Mendelson. On a variance dependent Dvoretzky-Kiefer-Wolfowitz inequality. arXiv e-prints, page arXiv:2308.04757, August 2023.
- [BNNR09] Khanh Do Ba, Huy L. Nguyen, Huy Ngoc Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428–442, 2009.
- [BNS16] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. Theory Comput., 12(1):1–61, 2016.
- [BNSV15] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Differentially private release and learning of threshold functions. CoRR, abs/1504.07553, 2015.
- [BS19] Mark Bun and Thomas Steinke. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 181–191, 2019.
- [BSV22] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Private measures, random walks, and synthetic data, 2022.
- [BUV18] Mark Bun, Jonathan R. Ullman, and Salil P. Vadhan. Fingerprinting codes and the price of approximate differential privacy. SIAM J. Comput., 47(5):1888–1938, 2018.
- [BY02] Z. Bar-Yossef. The Complexity of Massive Data Set Computations. University of California, Berkeley, 2002.
- [Can17] Clément L. Canonne. A short note on distinguishing discrete distributions., 2017.
- [CB22] Graham Cormode and Akash Bharadwaj. Sample-and-threshold differential privacy: Histograms and applications. In International Conference on Artificial Intelligence and Statistics, pages 1420–1431. PMLR, 2022.
- [CCD+23] Karan Chadha, Junye Chen, John Duchi, Vitaly Feldman, Hanieh Hashemi, Omid Javidbakht, Audra McMillan, and Kunal Talwar. Differentially private heavy hitter detection using federated analytics, 2023.
- [CD20] Rachel Cummings and David Durfee. Individual sensitivity preprocessing for data privacy. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 528–547. SIAM, 2020.
- [CDK17] Bryan Cai, Constantinos Daskalakis, and Gautam Kamath. Priv’it: Private and sample efficient identity testing. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 635–644. PMLR, 2017.
- [CKM+19] Clément L. Canonne, Gautam Kamath, Audra McMillan, Adam D. Smith, and Jonathan R. Ullman. The structure of optimal private tests for simple hypotheses. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 310–321. ACM, 2019.
- [CL15] T. Tony Cai and Mark G. Low. A framework for estimation of convex functions. Statistica Sinica, 25(2):423–456, 2015.
- [CLN+23] Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, and Uri Stemmer. Optimal differentially private learning of thresholds and quasi-concave optimization. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 472–482. ACM, 2023.
- [CR12] Guillermo D. Cañas and Lorenzo Rosasco. Learning probability measures with respect to optimal transport metrics. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pages 2501–2509, 2012.
- [CSS11] T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Trans. Inf. Syst. Secur., 14(3):26:1–26:24, 2011.
- [CWZ19] T. Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. CoRR, abs/1902.04495, 2019.
- [CZ13] Shixi Chen and Shuigeng Zhou. Recursive mechanism: towards node differential privacy and unrestricted joins. In Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, pages 653–664. ACM, 2013.
- [DHS15] Ilias Diakonikolas, Moritz Hardt, and Ludwig Schmidt. Differentially private learning of structured discrete distributions. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2566–2574, 2015.
- [DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In International Conference on the Theory and Applications of Cryptographic Techniques, EUROCRYPT ’06, pages 486–503, St. Petersburg, Russia, 2006.
- [DKSS23] Travis Dick, Alex Kulesza, Ziteng Sun, and Ananda Theertha Suresh. Subset-based instance optimality in private estimation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 7992–8014. PMLR, 2023.
- [DL91] David L. Donoho and Richard C. Liu. Geometrizing Rates of Convergence, II. The Annals of Statistics, 19(2):633 – 667, 1991.
- [DL09] Cynthia Dwork and **g Lei. Differential privacy and robust statistics. In Michael Mitzenmacher, editor, Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009, pages 371–380. ACM, 2009.
- [DLSV23] Trung Dang, Jasper C.H. Lee, Maoyuan Song, and Paul Valiant. Optimality in mean estimation: Beyond worst-case, beyond sub-gaussian, and beyond $1+\alpha$ moments. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- [DMNS17] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. Journal of Privacy and Confidentiality, 7(3):17–51, 2017.
- [DNPR10] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Leonard J. Schulman, editor, Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010, pages 715–724. ACM, 2010.
- [DR18] John C. Duchi and Feng Ruan. The right complexity measure in locally private estimation: It is not the fisher information. CoRR, abs/1806.05756, 2018.
- [DSS11] Steffen Dereich, Michael Scheutzow, and Reik Schottstedt. Constructive quantization: Approximation by empirical measures. Annales De L Institut Henri Poincare-probabilites Et Statistiques, 49:1183–1203, 2011.
- [Dud69] R. M. Dudley. The speed of mean glivenko-cantelli convergence. The Annals of Mathematical Statistics, 40(1):40–50, 1969.
- [DY95] Vladimir Dobric and Joseph E. Yukich. Asymptotics for transportation cost in high dimensions. Journal of Theoretical Probability, 8:97–118, 1995.
- [FG15] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707, August 2015.
- [FLN01] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01, page 102–113, New York, NY, USA, 2001. Association for Computing Machinery.
- [Fou23] Nicolas Fournier. Convergence of the empirical measure in expected wasserstein distance: non asymptotic explicit bounds in , 2023.
- [FRT03] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics. In Proceedings of the Thirty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’03, page 448–455, New York, NY, USA, 2003. Association for Computing Machinery.
- [GHK+23] Badih Ghazi, Junfeng He, Kai Kohlhoff, Ravi Kumar, Pasin Manurangsi, Vidhya Navalpakkam, and Nachiappan Valliappan. Differentially private heatmaps. In Brian Williams, Yiling Chen, and Jennifer Neville, editors, Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 7696–7704. AAAI Press, 2023.
- [GJK21] Jennifer Gillenwater, Matthew Joseph, and Alex Kulesza. Differentially private quantiles. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 3713–3722. PMLR, 2021.
- [GKN20] Tomer Grossman, Ilan Komargodski, and Moni Naor. Instance Complexity and Unlabeled Certificates in the Decision Tree Model. In Thomas Vidick, editor, 11th Innovations in Theoretical Computer Science Conference (ITCS 2020), volume 151 of Leibniz International Proceedings in Informatics (LIPIcs), pages 56:1–56:38, Dagstuhl, Germany, 2020. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
- [HKM22] Samuel B. Hopkins, Gautam Kamath, and Mahbod Majid. Efficient mean estimation with pure differential privacy via a sum-of-squares exponential mechanism. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1406–1417. ACM, 2022.
- [HKMN23] Samuel B. Hopkins, Gautam Kamath, Mahbod Majid, and Shyam Narayanan. Robustness implies privacy in statistical estimation. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 497–506. ACM, 2023.
- [HLY21] Ziyue Huang, Yuting Liang, and Ke Yi. Instance-optimal mean estimation under differential privacy. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 25993–26004, 2021.
- [HO19] Yi Hao and Alon Orlitsky. Doubly-competitive distribution estimation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2614–2623. PMLR, 2019.
- [HVZ23] Yiyun He, Roman Vershynin, and Yizhe Zhu. Algorithmically effective differentially private synthetic data, 2023.
- [KDH23] Rohith Kuditipudi, John C. Duchi, and Saminul Haque. A pretty fast algorithm for adaptive private mean estimation. In Gergely Neu and Lorenzo Rosasco, editors, The Thirty Sixth Annual Conference on Learning Theory, COLT 2023, 12-15 July 2023, Bangalore, India, volume 195 of Proceedings of Machine Learning Research, pages 2511–2551. PMLR, 2023.
- [KLM+20] Haim Kaplan, Katrina Ligett, Yishay Mansour, Moni Naor, and Uri Stemmer. Privately learning thresholds: Closing the exponential gap. In Jacob D. Abernethy and Shivani Agarwal, editors, Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 2263–2285. PMLR, 2020.
- [KLSU19] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan R. Ullman. Privately learning high-dimensional distributions. In Alina Beygelzimer and Daniel Hsu, editors, Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pages 1853–1902. PMLR, 2019.
- [KMS22a] Gautam Kamath, Argyris Mouzakis, and Vikrant Singhal. New lower bounds for private estimation and a generalized fingerprinting lemma. In NeurIPS, 2022.
- [KMS+22b] Gautam Kamath, Argyris Mouzakis, Vikrant Singhal, Thomas Steinke, and Jonathan R. Ullman. A private and computationally-efficient estimator for unbounded gaussians. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pages 544–572. PMLR, 2022.
- [KMV22] Pravesh Kothari, Pasin Manurangsi, and Ameya Velingker. Private robust estimation by stabilizing convex relaxations. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pages 723–777. PMLR, 2022.
- [KNRS13] Shiva Prasad Kasiviswanathan, Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. Analyzing graphs with node differential privacy. In Amit Sahai, editor, Theory of Cryptography - 10th Theory of Cryptography Conference, TCC 2013, Tokyo, Japan, March 3-6, 2013. Proceedings, volume 7785 of Lecture Notes in Computer Science, pages 457–476. Springer, 2013.
- [KSS22] Haim Kaplan, Shachar Schnapp, and Uri Stemmer. Differentially private approximate quantiles. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 10751–10761. PMLR, 2022.
- [KSSU19] Gautam Kamath, Or Sheffet, Vikrant Singhal, and Jonathan R. Ullman. Differentially private algorithms for learning mixtures of separated gaussians. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 168–180, 2019.
- [KSU20] Gautam Kamath, Vikrant Singhal, and Jonathan R. Ullman. Private mean estimation of heavy-tailed distributions. In Jacob D. Abernethy and Shivani Agarwal, editors, Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 2204–2235. PMLR, 2020.
- [KU20] Gautam Kamath and Jonathan R. Ullman. A primer on private statistics. CoRR, abs/2005.00010, 2020.
- [KV18] Vishesh Karwa and Salil P. Vadhan. Finite sample differentially private confidence intervals. In Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA, volume 94 of LIPIcs, pages 44:1–44:9. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
- [Lei20] **g Lei. Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces. Bernoulli, 26(1):767 – 798, 2020.
- [LKO22] Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pages 1167–1246. PMLR, 2022.
- [MJT+22] Audra McMillan, Omid Javidbakht, Kunal Talwar, Elliot Briggs, Mike Chatzidakis, Junye Chen, John Duchi, Vitaly Feldman, Yusuf Goren, Michael Hesse, Vojta **a, Anil Katti, Albert Liu, Cheney Lyford, Joey Meyer, Alex Palmer, David Park, Wonhee Park, Gianni Parsa, Paul Pelzl, Rehan Rishi, Congzheng Song, Shan Wang, and Shundong Zhou. Private federated statistics in an interactive setting. arXiv preprint arXiv:2211.10082, 2022.
- [MSU22] Audra McMillan, Adam D. Smith, and Jonathan R. Ullman. Instance-optimal differentially private estimation. CoRR, abs/2210.15819, 2022.
- [Nar23] Shyam Narayanan. Better and simpler lower bounds for differentially private statistical estimation. CoRR, abs/2310.06289, 2023.
- [NRS07] Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. Smooth sensitivity and sampling in private data analysis. In David S. Johnson and Uriel Feige, editors, Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11-13, 2007, pages 75–84. ACM, 2007.
- [NWB19] Jonathan Niles-Weed and Quentin Berthet. Minimax estimation of smooth densities in wasserstein distance. The Annals of Statistics, 2019.
- [OS15] Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is good-turing good. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2143–2151, 2015.
- [QYL12] Wahbeh Qardaji, Weining Yang, and Ninghui Li. Differentially private grids for geospatial data. Proceedings - International Conference on Data Engineering, 09 2012.
- [Rou21] Tim Roughgarden. Beyond the Worst-Case Analysis of Algorithms. Cambridge University Press, 2021.
- [RS16] Sofya Raskhodnikova and Adam D. Smith. Lipschitz extensions for node-private graph statistics and the generalized exponential mechanism. In Irit Dinur, editor, IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 495–504. IEEE Computer Society, 2016.
- [Sin23] Vikrant Singhal. A polynomial time, pure differentially private estimator for binary product distributions. CoRR, abs/2304.06787, 2023.
- [SP19] Shashank Singh and Barnabás Póczos. Minimax distribution estimation in wasserstein distance, 2019.
- [TCK+22] Eliad Tsfadia, Edith Cohen, Haim Kaplan, Yishay Mansour, and Uri Stemmer. Friendlycore: Practical differentially private aggregation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 21828–21863. PMLR, 2022.
- [vdV97] A. W. van der Vaart. Superefficiency, pages 397–410. Springer New York, New York, NY, 1997.
- [Vov09] Vladimir Vovk. Superefficiency from the Vantage Point of Computability. Statistical Science, 24(1):73 – 86, 2009.
- [VV16] Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In Daniel Wichs and Yishay Mansour, editors, Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 142–155. ACM, 2016.
- [WB19] Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4 A):2620–2648, 2019.
- [Wol65] J. Wolfowitz. Asymptotic efficiency of the maximum likelihood estimator. Theory of Probability & Its Applications, 10(2):247–260, 1965.
- [ZKM+20] Wennan Zhu, Peter Kairouz, Brendan McMahan, Haicheng Sun, and Wei Li. Federated heavy hitters discovery with differential privacy. In International Conference on Artificial Intelligence and Statistics, pages 3837–3847. PMLR, 2020.
- [ZXX16] Jun Zhang, Xiaokui Xiao, and Xing Xie. Privtree: A differentially private algorithm for hierarchical decompositions. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, page 155–170, New York, NY, USA, 2016. Association for Computing Machinery.
Appendix A Preliminaries
A.1 Distribution Distances
A number of other distances between distributions are used in this work.
Definition A.1 (-divergence).
Given two distributions and with , the KL divergence , if and are discrete, and if and are distributions on , and have density functions. If , then .
Definition A.2 (Hellinger distance).
Given two distributions and , the Hellinger distance (where we think of and as vectors representing the probability masses, and the square root being component-wise.), if and are discrete. If and are distributions on , and have density functions, then .
Note that we use to represent the squared Hellinger distance. Next, we define total variation distance, which will come up in our high-dimensional results.
Definition A.3 (Total Variation distance).
Given two discrete distributions and , the Total Variation distance (where we think of and as vectors representing the probability masses). More generally, for any two probability measures and defined on , the total variation distance is defined as where represents the probability of under measure and likewise for .
We use the following relationship between Hellinger distance and KL divergence.
Lemma A.4.
For all distributions such that KL-divergence of is well defined, we have that
(10) |
A.2 Differential Privacy
Lemma A.5 (Post-Processing [DMNS17]).
If Algorithm is -differentially private, and is any randomized function, then the algorithm is -differentially private.
Secondly, differential privacy is robust to adaptive composition.
Lemma A.6 (Composition of -differential privacy [DMNS17]).
If is an adaptive composition of differentially private algorithms , where is differentially private, then is -differentially private.
Finally, we discuss the Laplace mechanism, which we will use in one of our algorithms.
Definition A.7 (-Sensitivity).
The -sensitivity of a function is
Lemma A.8 (Laplace Mechanism).
Let be a function with -sensitivity . Then the Laplacian mechanism is algorithm
where (and are mutually independent). Algorithm is -DP.
Appendix B Experiment Details
Below we describe the experiment referenced in the introduction.
The distribution: We have taken a distribution on , which is concentrated on two points and , with and . These algorithms have been run with samples from this distribution.
Minimax Optimal Algorithm: The minimax-optimal algorithm here is the algorithm PSMM from [HVZ23] that considers a fixed partitioning of the interval into equal intervals and places the empirical mass in each interval on an arbitrary point in each interval. Here we consider this algorithm with , so that no noise is added. We have run it here with buckets.
Instance-optimal Algorithm: The instance-optimal algorithm finds quantiles as in Algorithm 5. In this particular implementation, we used the recursive exponential mechanism of [KSS22], but we expect other quantile algorithms would work similarly. In this particular case, we use quantiles with .
Appendix C Appendix for Section 5
See 5.6
Proof of Theorem 5.6.
Given a distribution , let
so Let . We want to define an algorithm on the distributions in that achieves maximum error rate . Define a randomised function which given a node at level , is sampled from the distribution restricted to the leaf nodes that are children of . Given a set of nodes at level , define to be the set where is applied to each set element individually. Then define . Since is applied individually to each element in , is -DP.
Given a distribution , define a distribution on the leaves of the tree as follows:
where is the parent node of at level . Note , and . Now,
where the first inequality follows by definition of and the fact . Since , this implies that for all distributions in ,
which implies for all levels , and so we are done. ∎
See 5.8
Proof of Lemma 5.8.
We will follow the proof of Theorem 3 in [ASZ21]. Given an estimator , define a classifier by projecting on the product of hypercubes so
By the triangle inequality and the definition of , for any ,
Therefore, we can restrict to a lower bound on the performance of DP classifiers:
(11) |
Also, for any -DP classifier ,
where the first inequality follows from the fact that the max is greater than the average, and the second follows from assumption (5). For each pair, we divide into two groups;
Combining with eqn 11 we have the first statement. Next, since for each pair , there exists a coupling between and such that , we can use the DP version of Le Cam’s method from [ASZ21] to give for any classifier ,
which implies the final result. ∎
See 5.12
Proof of Lemma 5.12.
A standard result in the statistics literature states that for any pair of distributions and ,
where the minimum is over all binary classifiers. If and where then
where the first inequality holds since for and by assumption and the second follows again because of the constraint on . ∎
See 5.14
Lemma 5.14 is an immediate corollary of the following lemma.
Lemma C.1.
For any distribution , if then with probability ,
Proof of Lemma 5.14.
We’ll consider each level of the tree individually then use a union bound over all the levels to obtain our final bound. Let be the empirical distribution without truncation. The following conditions are sufficient to ensure that the bounds hold for a single level :
We will begin by showing these conditions are sufficient. If then these conditions imply that the empirical density for node is truncated, and hence the error that that node is either or (when and , respectively), as required. If then either the estimate is not truncated and the error is less than , as required. Or the estimate is truncated and the error is . Under the above conditions, if then truncation will only occur if
in which case , as required. Similarly, if then truncation will only occur if , as required.
We will now show that these conditions hold simultaneously with probability at least for all the nodes at level . If then using the multiplicative form of Chernoff bound,
Firstly, since , . Further, and so . Therefore,
(12) |
Let then using a union bound and Eqn (12) we have
There exist at most elements in that do not belong in . We will prove that, independently, each of these elements satisfy the required condition with probability then a union bound proves the final result. If then using the multiplicative form of Chernoff bound (If are all i.i.d. and , then ),
Next, if then using the additive form of Chernoff bound (If are all i.i.d. and , then )
By symmetry, if then
∎
See 5.15
Proof of Lemma 5.15.
First notice that if a node is an -active node, then all of it’s ancestor nodes are also -active. So, it suffices to show that (with high probability) if at any stage a node makes to it Line 7 of Algorithm 3, then if then and if then .
By Lemma 5.14, with probability , all nodes satisfy
Further, if one samples independent samples from then with probability ,
So conditioning on both these events if ,
so they will not survive Line 7 of Algorithm 3. If then
Each level has at most in so we query at most nodes in the tree when running LocateActiveNodes since each node has at most 2 children. Therefore, we can set . ∎
See 5.16
Proof of Lemma 5.16.
The key component of this proof is that any discrepancy between the weight of the nodes on and that assigned by was already paid for in .
as required. ∎
See 5.17
Proof of Lemma 5.17.
We first note that for any pair of sequences of real values and , and constant such that ,
Also if then
Let be the function after only levels have been updated. So matches on all levels except . Let be a node in the th level of the HST. If we suppose the sum is over the normalised children of a node , , and for all the children of , and , we can see that the contribution to the Wasserstein distance by the children increases by an additive factor of . Iterating, we can see that
which is 4 times the wasserstein distance.
∎
Appendix D Local Minimality in the High Dimensional Setting
Theorem D.1.
Given any , and a distribution , and let , then for all -DP algorithms , there exists a distribution such that with probability ,
where is the output of with samples.
Proof.
First, let us obtain a slightly simpler upper bound on . From eqn (8) in the proof of Theorem 5.13 we have that for each level ,
from Lemma 5.15 we have that with probability ,
and if one samples independent samples from then we have that with probability ,
Therefore, for all we have for some constant therefore,
For the same reason as in the proof of Theorem 5.13, we can upper bound by by dealing with the case separately. Therefore,
Further, by Theorem 5.7 and Theorem 5.6, given and , let where is the Lambert W function so . Given a distribution , there exists a constant such that
Let , then so
Now, let so for all ,
Appendix E Differentially Private Quantiles
Estimating appropriately chosen quantiles is the main part of our algorithm for approximating the distribution over in Wasserstein distance, and so in this section, we describe some known differentially private algorithms for this task and derive some corollaries that we use extensively in our application. We will use to represent CDF functions, with representing the CDF of distribution . We start by stating an important theorem on private CDF estimation. This follows from a use of the binary tree mechanism [CSS11, DNPR10]. A version of this theorem for approximate differential privacy is described in a survey by Kamath and Ullman [KU20, Theorem 4.1]. The version presented here for pure differential privacy follows from a very similar argument, except using Laplace Noise instead of Gaussian noise (and basic composition instead of advanced composition to analyze privacy). Their accuracy was also in expectation, but a similar analysis yields a high probability bound, as in the theorem below.
Theorem E.1.
[KU20, Theorem 4.1]
Let , let be an ordered, finite domain, and let be a dataset. Let be the uniform distribution on . Then, there exists an -DP algorithm that on input and the domain outputs a vector over such that with probability at least over the randomness of :
CDF estimation is intimately related to quantile estimation, and we use the following quantitative statement that will follow from a simple application of Theorem E.1.
Theorem E.2.
Fix any , , , and such that is an integer. Let be a sufficiently large constant. Then, there exists an -DP algorithm , that on input interval end points , granularity , , and desired quantile values , outputs quantiles such that with probability at least over the randomness of , for all ,
and
where is the uniform distribution on the entries of .
Proof.
Algorithm operates by running the algorithm referenced in Theorem E.1 on and domain , and postprocessing its outputs to get quantile estimates as follows. For every quantile that we are asked to estimate, simply scans the vector output by algorithm in order, and outputs the first domain element whose CDF estimate in crosses . Conditioned on the accuracy of the CDF estimation algorithm , we have that this output satisfies
Additionally, since is the first domain element whose estimate in crosses , we also have that
Hence, with probability at least , we have this property for all . ∎
We now state a corollary of this theorem that we will use extensively in our presentation.
Corollary E.3.
Fix any , , and such that is an integer. Let , such that set to is an integer greater than or equal to , where and are sufficiently large constants. 666 is set to be sufficiently small in order to relate the accuracy of the quantiles algorithm to a parameter depending on , and is set sufficiently large that is not less than . The dependence on comes up in the proof of 6.18.
Then, there exists an -DP algorithm (the same one referenced in Theorem E.2), that on input interval end points , granularity , , and desired quantile values , outputs quantiles such that with probability at least , for all ,
where is the uniform distribution on the entries of and for all , is the -quantile of .
Proof.
First, note that is set such that .
Hence, by Theorem E.2, we have that with probability at least ,
for all ,
and
Condition the event above for the rest of the proof. Note that the first equation implies that for all ,
which implies that .
Next, note that we also have that for all ,
This implies that for all , . ∎
Appendix F Proofs in Section 6
F.1 Omitted Proofs in Section 6.1.1
Proof of Lemma 6.6.
We evaluate the various terms in Theorem 6.5.
We start by evaluating }. Consider the first term in the outer maximum. For all , we have that . For all other , one can see that the value of the integrand is . Hence, the value of the first term is . Now, consider the second term in the outer maximum. For all , the value of the integrand is . For all , the value of the integrand is . Hence, the second term is . Put together, we get that .
When , we have that , and so we have that the largest value of that makes , is . When , we have that the value of that makes , is .
Finally, we describe the distributions and and compute the squared Hellinger distance between them. There are two cases, based on the range of . First, consider . First, we calculate . This value is equal to for , and is also equal to for . Similarly, consider ; it is equal to for , and is equal to for . It is also equal to for . Since , and by the above calculations, we have that , and . Upper bounding the squared Hellinger distance between and by the TV distance (See Lemma A.4), we get that (where we have used that ).
Next, consider . First, consider . This value is equal to for , and is also equal to for . Similarly, consider ; it is equal to for , and is equal to at . It is also equal to at . Note that . and are the distributions created by normalizing and by dividing by a factor of . Now, we upper bound the squared Hellinger distance between and by the TV distance (See Lemma A.4), to get that .
Substituting into the lower bound for sample complexity of distinguishing and , this tells us that for all , .
∎
Proof of Lemma 6.7.
Note that has bounded expectation (and hence, so does ). Hence, we can use the following form of the Wasserstein distance:
Now, given the settings of and , we can precisely write the forms of their cumulative distribution function as follows. Note that for , we have that . For , we have . Finally, for , we have that and , which gives us that .
Hence, we have that
∎
F.2 Omitted proofs in Section 6.1.2
Proof of Lemma 6.10.
The KL divergence is defined as . This can be broken up into a sum over the dyadic quantiles as:
where the third inequality from last is by the fact that the geometric series converges to , the second inequality from last is from the fact that , and for . ∎
Proof of Lemma 6.11.
First, we recall the definition of the 1-Wasserstein distance in terms of the cumulative distribution function.
Fix any . Observe that by construction, for all and for all , . Similarly, fix any . Observe that for all , and for all , we have that . Substituting the above bounds in the formula for the Wasserstein distance, we get that
Pulling the summation over outside the integral and grou** terms,
Switching the order of summation (summing over first), and grou** terms, we get
Telesco** the inner sums over we get that
A change of variables (where we now set to ) then gives
where the last inequality is by pulling the first two terms from the summation in second term to the summation in the first term, and using the fact that for , we have that
∎
Proof of Lemma 6.12.
We first state a theorem of Bobkov and Ledoux [BL19].
Theorem F.1 (Theorem 3.5, [BL19]).
There is an absolute constant , such that for all distributions over , for every ,
where
and
Now, we are ready to prove the main theorem. Fix natural number . Restricted to , is an increasing function, and hence for , we have that .
Similarly, restricted to , is a decreasing function, and hence for , we have that .
Using this, we can now upper bound the expected Wasserstein distance between and its empirical distribution using Theorem F.1. Hence, we upper bound the terms and . We start by upper bounding . Note that for all , we have that . Hence,
where the last inequality is because for , we have that .
Next, we bound . Note that for all and for all , we have that . Hence,
Then, using the upper bound in Theorem F.1, substituting in the bounds for and , and simplifying, we get the claim. ∎
Proof of Claim 6.13.
By the definition of Wasserstein distance and restrictions of distributions, we have that
∎
F.3 Omitted Proofs in Section 6.2
Before going into the proofs, we state the standard Chernoff concentration bound that we will use multiple times.
Theorem F.2 (Binomial Concentration).
Let with expectation , and . Then,
Proof of Lemma 6.16.
(13) | ||||
(14) |
Note that for all , we have that the cumulative distribution functions of and its restricted version are identical and likewise for . Additionally, the cumulative density functions for the restricted versions of the two distributions are identical to each other outside of this interval. Hence, we can simplify the middle term in the RHS of the inequality above as follows:
Next, we reason about the remaining terms.
Consider the term . First, condition on the event in Theorem E.3 (on the accuracy of the private quantiles for the empirical distribution), which tells us that with probability at least , we have for all , that
(15) |
which implies in particular that .
Next, we argue that with high probability. By the definition of quantiles, we have that . The number of entries in the dataset less than is hence a Binomial with mean less than , and hence, we have by Theorem F.2 (with set to that with probability at least , the number of entries in the dataset less than is at most , which means the total mass less than in the empirical distribution is less than . This implies that by the definition of quantiles.
Additionally, note that for all , . The number of entries in the dataset that are less than is hence a Binomial with success probability less than . By Theorem F.2, we can again argue that with probability at least , there is a constant such that the total mass of the empirical distribution on values less than is less than . Hence, . This implies by Equation 15, that for some constant . Hence, for all , we have that .
Hence, taking a union bound, with probability at least ,
By a symmetric argument, we also have that with probability at least ,
Taking a union bound to ensure that all terms in Equation 14 are bounded as required, the proof is complete. ∎
Proof of Lemma 6.17.
First, we condition on the event in Corollary E.3 (on the accuracy of differentially private quantile estimates) that for all ,
note that this event happens with probability at least over the randomness of the algorithm.
Observe that this implies that increases by somewhere in the range (for all ) and remains constant outside these intervals.
Now, we show that for all , we have that .
If there exists , we have that , and , which implies that . If there exists no such , then we have that , and the corresponding interval collapses to a single point (which will fall in another interval considered below).
Next, fix any . Note that if there exists , we have for all such that , and . This implies that for all such , . If there exists no such , then we have that , and this is not relevant since the corresponding interval collapses to a single point (that is considered in another interval).
Finally, for , we have that , and , so we have that .
Note that every is considered in some interval above and hence we have shown that for all , we have that .
Finally, using the formula for Wasserstein distance (and the definition of a restriction), we have that
(16) | ||||
(17) | ||||
(18) | ||||
(19) |
∎
Before the proof of Claim 6.18, we state the following variance-dependent version of the DKW inequality that uniformly bounds the absolute difference in CDFs between the true and empirical distribution.
Theorem F.3 (See for example Theorem 1.2 in [BM23]).
Fix . There are absolute constants such that for all ,
We also state the following lemma on Binomial random variables, which is a simple consequence of a Lemma by Bobkov and Ledoux [BL19].
Lemma F.4 (Lemma 3.8 in [BL19]).
Let be the sum of independent Bernoulli random variables with and (for all ). Also assume . Then, for some sufficiently small constant ,
Proof of Claim 6.18.
Now, by the formula for Wasserstein distance, the definition of restriction, and Fubini’s theorem, we have that
By Lemma F.4, using the fact that , where each term in the sum is an independent Bernoulli random variable with expectation , with (ensuring that the conditions of the lemma are met), we get that , which gives
Now, consider the random variable . Note that (for an appropriately chosen ), and so we are in the regime where we can apply Theorem F.3 for an appropriately chosen .
In particular, we have that for , .
Setting , we have that , and (the second inequality for sufficiently large ). In particular, this implies for , , which implies that , as long as for some sufficiently large constant .
Now, using Theorem F.3, we have that with probability at least ,
Condition on this for the rest of the proof. Then, we can write the following set of equations.
as required.
∎
F.4 Local Minimality in the One-Dimensional Setting
In this subsection, we argue that the instance-optimal algorithm discussed in Section 6.2 is also locally-minimal (See Section 3.2 for a discussion of local minimality).
First, we state a corollary of our upper bound for continuous distributions, Theorem 6.14. This corollary follows by discretizing the distribution and applying the previous upper bound to the discretized distribution. The parameters of the discretized distribution are related to that of the original distribution via simple coupling arguments.
Corollary F.5.
Fix , , . Let be any continuous distribution supported on . Consider any (such that divides ), and let for some sufficiently large constant . Then, there exists an algorithm, that when given inputs , privacy parameter , interval end points , granularity , and access to algorithm , outputs a distribution such that with probability at least over the randomness of and the algorithm,
where is the uniform distribution on , represents the -quantile of distribution , and , where is a sufficiently large constant.
We state a lemma of Ledoux and Bobkov that we will use in the main proof of this section.
Lemma F.6 (Lemma 3.8 in [BL19]).
Let be the sum of independent Bernoulli random variables with and (for all ). Then, for some sufficiently small constant ,
Now, we are ready to state and prove the local minimality result. Note that the statement will reference the rates defined by Equation 1 in the introduction.
Theorem F.7.
Let , . For any continuous distribution over with a density, let . Fix , and let , with for some constant . There exists an algorithm such that for all continuous distributions , for all algorithms , there exists a distribution such that
Proof.
Let , and set for a sufficiently large constant . Then, by Corollary F.5 with appropriately chosen we have that with probability at least 0.95, for any distribution (and hence particularly any distribution ,
where is the constant referenced in Theorem 6.3. We will show that for distribution , each of the corresponding distribution-dependent terms is closely related to the terms for .
First, consider . Firstly, note that for all , , and , since , which implies that for all . Similarly, note that for all , , and . Hence, we have that
Next, consider . Recall that , and . Then, (noting that and ), we have that
Finally, consider . By Fubini’s theorem and applying both inequalities in Lemma F.6, we have that
where is a sufficiently large constant and the fourth inequality holds since .
By the above observations connecting the distribution-dependent terms with the corresponding terms for , we have that for all , with probability at least ,
(20) | ||||
Now, we proceed with the analysis in two cases. Firstly, consider the case when the first and third terms inside the bracket on the RHS of equation 20 are larger than the second term inside the bracket. Then, we have that for all , with probability at least ,
By Theorem 6.3 and the fact that , for all algorithms , there exists a distribution such that ,
Hence, for all algorithms and the corresponding distribution , with probability at least ,
Next, consider the case where the first and third terms inside the bracket on the RHS of equation 20 are smaller than the second term inside the bracket. Then, we have that for all , with probability at least ,
By Theorem 6.3, for all algorithms , there exists a distribution such that
Hence, we have that for all algorithms and for the corresponding distribution , with probability at least ,
as required. This completes the proof. ∎