Suppose that is a probability measure on with , where is the Euclidean norm on . Let be i.i.d. samples of . How many samples are needed so that the empirical distribution is “close” to ? Obviously the answer depends on the notion of “close” we use. If we want the covariance matrix of to be close, in the operator norm, to the covariance matrix of , it is already a very deep question of how many samples are needed, though by now, in some aspects, this question has been settled after a series of work [32, 2, 3, 39, 33, 21, 15, 35, 44, 1]. In general, after certain rescaling, samples suffice to accurately approximate the covariance matrix of . On the other hand, if we want and to be close in the Wasserstein distance, we need to be exponentially large in (see, e.g., [12]).
To circumvent this curse of dimensionality issue, in recent years, the notions of sliced, max sliced and projection robust Wasserstein distances have been introduced and used in applications [31, 7, 9, 10, 13, 11, 14, 22, 28, 43, 18, 23, 24]. They were further studied in [26, 42, 19, 4, 25, 27]. The max sliced -Wasserstein distance between two probability measures and on is
(1.1) |
|
|
|
where is the pushforward probability measure of by the map , i.e., if is the distribution of a random vector in , then is the distribution of the random variable . The quantity denotes the -Wasserstein distance between the measures and on . The sliced Wasserstein distance (which we do not study in this paper) is the notion where in (1.1), we replace the supremum over by the integral of over on the unit sphere and then take the th root. The projection robust Wasserstein distance (which we also study in this paper) is the notion where in (1.1), we take the -Wasserstein distance between the pushforward measures of and by a projection onto a subspace of a fixed dimension and then take supremum over all such subspaces. When , this is the max-sliced Wasserstein distance .
1.1. Max-sliced 1-Wasserstein distance
When , by the Kantorovich-Rubinstein theorem, the max-sliced -Wasserstein distance between two probability measures and on coincides with the following quantity:
(1.2) |
|
|
|
where the supremum is over all the on the unit sphere and over all the 1-Lipschitz functions (i.e., for all ). Consider the following problem:
Problem 1.
Suppose that is a probability measure on . Let be i.i.d. samples of . Estimate .
There are known estimates (some of which are sharp) of under certain regularity assumptions on the measure , e.g., log-concavity of [25, Theorem 1] and [4, Theorem 1.6], or satisfying the spiked transport model and the transport inequality [26, Theorem 1], or satisfying the projection Bernstein tail condition or the projection Poincaré inequality [19, Theorem 3.5 and Theorem 3.6], or being isotropic with its marginal distributions having uniformly bounded 4th moments [4, Proposition 4.1] (see also [4, Remark 4.2]).
As for the most general setting, under the only assumption of being supported on , it was shown in [25, Proposition 1] that , where is a universal constant. In [27, Theorem 2], this was improved to
. In these two bounds, the rate of convergence is optimal in , but both bounds involve the dimension .
There is a dimension-free bound for that holds with the same generality. More precisely, if is supported on , then , where is a universal constant. This follows by taking and optimizing the in the term in [42, Theorem 1]. This estimate is dimension-free but comes at the cost of slower convergence rate in .
In short, the literature concerning Problem 1 can be summarized as follows.
-
(1)
If is supported on , then , where is a constant that depends only on .
-
(2)
If is supported on , then , where is a universal constant.
-
(3)
If in addition, satisfies certain regularity assumptions, then , where is a universal constant.
These results together suggest the following question. Does the dimension-free bound , where is a universal constant, actually hold for every supported on even without any regularity assumptions?
In the first main result of this paper, we answer this question affirmatively. We obtain essentially matching dimension-free upper and lower bounds for in the most general setting. This essentially settles Problem 1.
Theorem 1.1.
Suppose that is a probability measure on with and . Let be i.i.d. random vectors in sampled according to . Then
|
|
|
where is a universal constant.
We also obtain a version of Theorem 1.1 for probability measures on Banach spaces. Beside being a result of intrinsic interest in the study of probability in Banach spaces (see [17]), this result is essential for proving the second main result Theorem 1.4 of this paper on the max-sliced 2-Wasserstein distance for probability measures on Euclidean spaces. Indeed, in proving the latter result, we will take the Banach space to be the space of all matrices equipped with the operator norm. In the Banach space setting, to define the metric , in (1.2), instead of taking supremum over on the unit sphere, we take supremum over all linear functionals , where is the unit ball of the dual space centered at the origin. See Section 1.3 for the precise definition.
Theorem 1.2.
Suppose that is a probability measure on a Banach space with separable dual and that and . Let be i.i.d. random elements of sampled according to . Then
|
|
|
|
|
|
|
|
|
|
where are i.i.d. uniform random variables and are i.i.d. standard Gaussian random variables that are independent from , and is a universal constant.
1.2. Max-sliced 2-Wasserstein distance
We now turn to the problem of estimating the expected max-sliced 2-Wasserstein distance .
Unlike in Theorem 1.1, for the max-sliced 2-Wasserstein distance, the convergence rate is not always the same. Even in dimension one, for certain log-concave measures on , for , the quantity is of order [6]. However, if is uniformly distributed on two points , one can easily see that is of order , which is much slower than when .
Similarly, for the max-sliced 2-Wasserstein distance, if we assume certain regularity assumptions on (e.g., is log-concave [4, 25]), then or . (Let’s ignore the dimension factors for a short moment.) On the other hand, even if is isotropic and its marginal distributions have uniformly bounded 4th moments, the quantity could already be as large as for some universal constant [4, Example 3.3].
Thus, in the most general setting (i.e., no regularity assumptions on ), the best convergence rate in for the max-sliced 2-Wasserstein distance we can hope for is .
Corollary 1.3.
Let . Suppose that is a probability measure on . Let be i.i.d. random vectors in sampled according to . Then for all ,
|
|
|
where is a universal constant.
Proof.
For two probability measures on , it is easy to see that . Thus by Theorem 1.1, the result follows.
∎
Corollary 1.3 removes the dimension factor in the estimate of in [27, Theorem 2].
The upper bound in Corollary 1.3 is attained, up to the constant , when is uniformly distributed on two points with being any vector with .
While the bound in Corollary 1.3 is sharp in , if one also has information on the covariance matrix of , then perhaps, one can obtain a better bound that can depend on the covariance matrix of . Before we go into further discussions on this, we mention some simple connections between the max-sliced 2-Wasserstein distance and sample covariance matrices. The literature on sample covariance matrices gives us important intuition regarding the convergence in the max-sliced 2-Wasserstein distance.
If is a probability measure on with , then the max-sliced 2-Wasserstein distance between and (the probability measure with an atom of mass at the origin) is equal to
|
|
|
where is a matrix and denotes the operator norm. Thus, for in , we have
|
|
|
|
|
|
|
|
|
|
So in order for to be small, it is necessary that cannot be too much larger than .
Given that , the quantity should be assessed relative to .
Problem 2.
Suppose that is a probability measure on . Let . How many i.i.d. samples of are needed to make small?
In [4, Theorem 1.3], it was shown that if is centered and isotropic (i.e., ) with
where , then with high probability,
(1.4) |
|
|
|
where is a constant that depends only on and . By [35], the sample covariance error term is of order with high probability. Thus, under the assumptions mentioned above, suffices in Problem 2.
The literature on sample covariance matrices (see e.g., [32, 38, 36]) suggests that for a general isotropic probability measure supported on but without the assumption , the number of samples should suffice in Problem 2. More generally, if is supported on but not necessarily isotropic, should suffice in Problem 2.
In this paper, we show that these are indeed true for symmetric and its symmetrized empirical distribution. A probability measure on is symmetric if for all measurable .
Theorem 1.4.
Let . Suppose that is a symmetric probability measure on supported on . Let be i.i.d. random vectors sampled according to . Then
|
|
|
where and is a universal constant.
The factors in Theorem 1.4 cannot always be removed. Indeed, consider the probability measure uniformly distributed on the points , where is the unit vector basis for . Then by (1.2), we have
|
|
|
where . If we view as bins and each as a ball into a bin, then is the maximum number of balls in a bin after balls are thrown into bins. So by [30, Theorem 1], when ,
|
|
|
where is a universal constant. Thus, in this example, the factors in Theorem 1.4 cannot be removed.
The following lower bound result shows that the upper bound in Theorem 1.4 is sharp for every covariance matrix up to the factor.
Proposition 1.5.
Let be a positive semidefinite matrix such that . Then there exists a symmetric probability measure on supported on such that and for every ,
|
|
|
where are i.i.d. random vectors sampled according to .
1.3. Some definitions
Throughout this paper, unless specified otherwise, we always use the Euclidean metric on . If is a bounded function, then . A function is -Lipschitz function if for all . The operator norm (or equivalently the largest singular value) of a matrix is denoted by
If is a metric space and , then the covering number is the smallest size of for which every element of has distance at most from an element of . The packing number is the largest size of for which all elements of have distance more than away from each other. We always have .
If is a Banach space, then the unit ball of is denoted by . The dual space of all bounded linear functionals is denoted by .
Pushforward measure: If is a probability measure on a separable Banach space and is a map, then is the pushforward measure of by , i.e., if is a random element of with distribution , then has distribution . In particular, is a probability measure on .
Classical Wasserstein distance: If and are probability measures on and , then the -Wasserstein distance between and is
|
|
|
where the infimum is over all distributions on with and being its marginal distributions for its first and second components.
Max-sliced and projection robust Wasserstein distances: If and are probability measures on and , , then
|
|
|
where the supremum is over all of the form , for , with in the unit ball of . Here we use the Euclidean distance on to define the Wasserstein distance on the right hand side.
When , we have
|
|
|
|
|
|
|
|
where the supremum is over all and all the -Lipschitz functions .