Contraction rates and projection subspace estimation with Gaussian process priors in high dimension
Abstract
This work explores the dimension reduction problem for Bayesian nonparametric regression and density estimation. More precisely, we are interested in estimating a functional parameter over the unit ball in , which depends only on a -dimensional subspace of , with . It is well-known that rescaled Gaussian process priors over the function space achieve smoothness adaptation and posterior contraction with near minimax-optimal rates. Moreover, hierarchical extensions of this approach, equipped with subspace projection, can also adapt to the intrinsic dimension ([Tok11]). When the ambient dimension does not vary with , the minimax rate remains of the order . However, this is up to multiplicative constants that can become prohibitively large when grows. The dependences between the contraction rate and the ambient dimension have not been fully explored yet and this work provides a first insight: we let the dimension grow with and, by combining the arguments of [Tok11] and [JT21], we derive a growth rate for that still leads to posterior consistency with minimax rate. The optimality of this growth rate is then discussed. Additionally, we provide a set of assumptions under which consistent estimation of leads to a correct estimation of the subspace projection, assuming that is known.
1 Introduction
With the ever-increasing availability of high-dimensional data in various fields of science and technology, dimension reduction methods have become more and more important, especially in non-parametric estimation, to counteract the curse of dimensionality. Suppose we want to estimate an unknown function that depends only on a -dimensional linear subspace , with . For regression and density estimation problems, minimax rates without sparsity assumptions are both of the order where is the smoothness of and is the sample size ([Bir86], [Sto82]). The aim of dimension reduction is to convert this -dimensional problem into a -dimensional one in order to obtain the way more attractive rate .
As the above rates are given up to a multiplicative constant, which may itself depend on the ambient dimension , another problem arises: determining if the number of available data is sufficient in regard to the problem’s dimension. This is generally done by allowing the ambient dimension to grow with , letting , and then observing which growth rate still permits minimax estimation at rate . Note that the subspace also depends on , thus we write .
For fixed intrinsic dimension , we distinguish two cases, whether the subspace is parallel to the axes or not. In the first case (when is parallel to the axes), the dimension-reduction problem is referred to as variable selection. In this context, it is known that for non-parametric regression, the sparsity pattern can be consistently recovered when grows exponentially with the sample size ([CD12], [YT15]). More precisely, [CD12] show that there exist two constants such that
-
•
if , there exists a consistent estimator of the sparsity pattern,
-
•
if , no such estimator exists.
This phase transition phenomenon seems to be similar in the linear regression framework (see [Ver12] and [Wai09]).
In the second case (when nothing is assumed on ), the estimation of a minimal subspace which contains all the information on is sometimes referred to as sufficient dimension reduction ([Coo98]). Among the various methods proposed for estimating , sliced inverse regression (SIR) ([Li91]) is one of the most studied. The first article including the framework of growing ambient dimension shows the consistency of SIR only under ([ZMP06]). Later, [LZL18] show that the phase transition phenomenon occurs at a growth rate in . In other words, SIR-based estimators are consistent only if and this growth rate appears to be optimal ([Lin+21]).
The difference between growth rates encountered in variable selection and in sufficient dimension reduction has led recently to the emergence of methods combining both approaches. If depends on a -dimensional subspace which can be described by linear combination of only a small number of variables, then we can perform both variable selection and sufficient dimension reduction over the selected variables. This method is studied for example in [Lin+21], [LZL19], [TSY20], and [ZMZ22] and allows a return to the exponential growth of the dimension .
The aim of this article is to perform both function and subspace estimation in the case where no hypotheses are made on and to derive the maximum dimension growth rate. Our analysis is done in the nonparametric Bayesian framework introduced by [GGV00]. Among the advantages of this approach, the use of very versatile priors, such as Gaussian processes [VV08], allows to perform smoothness and dimension adaptability at near minimax rates ([VV09], [TZG10], [JT21]) with a single Bayesian procedure, and avoids the complications associated with kernel methods (see for example the introduction of [STG13]).
The work of Tokdar, Zhu, and Ghosh [TZG10] is one of the first to include a hierarchical prior with a parameter on the subspace.They use a uniform prior on the Grassmannian of dimension and a logistic Gaussian process prior for the conditional density function. The authors are able to derive posterior consistency for both the conditional density function and the subspace but they do not provide contraction rates. Near minimax contraction rates are then derived in [Tok11] by extending the framework introduced by [VV09]. Finally, [JT21] show that for variable selection, the estimation of the regression function and that of the sparsity pattern can be realized simultaneously at near minimax rates even with dimension growing exponentially with the sample size. The growth rate is linked to the smoothness of via .
The paper is organized as follows. In Section 2, we introduce a hierarchical Gaussian process-based prior for both regression and density estimation models. This prior consists of a dimension parameter for , an invariant prior over linear -dimensional subspaces of , a -dimensional Gaussian process, and a rescaling parameter to ensure smoothness adaptability. Our first result (Theorem 3.1 in Section 3) shows that, for the estimation problem of , near minimax contraction rates can be achieved for dimensions growing not faster than which is interestingly the already mentioned growth rate where we drop out the exponential. We are not able to prove the optimality of this result but some clues are given below (see Remark 5.2); notably, this growth rate is equivalent to when , which is known to be the breakpoint of the consistency of the SIR estimator. In Section 4, we show that for fixed ambient dimension , the hierarchical Bayes procedure contracts to a subspace that contains and we conjecture that this subspace is exactly . Our estimation result of combines the standard arguments used in [Tok11] and [JT21], which are based on [VV09]. To prove the contraction around the central subspace , we show that an error on the estimation of leads to an error on the estimation of from which we obtain a contradiction on the previously established minimax estimation of . The proofs of the main results (Theorems 3.1 and 4.1) are postponed to Appendices 5.1 and 5.2 while Appendix 5.3 is dedicated to useful lemmas.
2 Problem formulation
2.1 Notation and definitions
The abundant technical notation used throughout this article make this section very useful. We begin with the definition of standard functional spaces. Let be a bounded convex subset of , with . For , write with a nonnegative integer and . The Hölder space is the space of all functions that are -times differentiable and whose partial derivatives of order , with nonnegative integers such that , are Lipshitz functions of order , that is, there exists a constant such that
for all pairs and where is the Euclidean norm. We use the following asymptotic notation: if and are two real functions over an arbitrary set , then we write if there exists a constant such that for all . The notation is defined in the same way and we write when both and hold. To model the central subspace , we will use isometries instead of the Grassmannian. For , we denote by the space of linear isometries over . In addition, the introduction of canonical subspaces and of “component filters” notation will be very convenient when dealing with the sparsity. For and , we denote by the number of ones in , by the sub-vector with components selected according to , and for , by the vector in with if and if is the -th one in . Moreover, for any integer , we denote by the vector , where is the canonical basis on . The dimension of the ambient space is implicit in this notation. Finally, for , we denote by the linear span of and by the linear span of . Clearly, is the orthogonal complement of .
The proof of Theorem 3.1 involves measuring the complexity of the space where the prior puts its mass. This measure is carried out via metric entropy. Given a subset of a metric space and a radius , we can define the following numbers:
-
•
the -packing number is the maximum number of points in such that the distance between every pair is at least ,
-
•
the -covering number is the minimum number of balls of radius needed to cover .
The logarithms of the packing and the covering number are called the entropy and the metric entropy respectively.
2.2 Bayesian framework for density estimation and regression
Our main result will be stated for two statistical settings: density estimation and fixed or random design regression with Gaussian error. As we will work with subspaces that are not orthogonal with the axes, the usual support for the density or the regression function will be replaced by the unit ball . For a given number of observations , the density or the regression function will be characterized by a functional parameter . The ambient dimension is allowed to grow with but is supposed to depend only on a subspace with fixed dimension . A prior on and on the subspace itself will be later introduced to ensure the dimension adaptability. The prior on the true parameter will consist of a projected Gaussian random variable with values in the Banach space . Now let us describe the two previously introduced statistical settings.
Density estimation
Suppose we observe an i.i.d. sample from a law over , which admits a continuous density relative to the Lebesgue measure on . The prior puts its mass on a space that is far too large compared to the space of continuous densities. So to correctly retrieve , we will work with the parametrized density where, for ,
(2.1) |
Here the exponential forces the prior to charge only nonnegative functions while the renormalization ensures that integrates to one. The true density will then be encoded by the parameter such that . In this way, all the assumptions on the true parameter can be transferred to the density . That is, is supposed to depend only on the -dimensional subspace of .
The natural metric between two densities and is the Hellinger distance defined by , where is the -norm with respect to the Lebesgue measure. Consequently, if the parameter space is embedded with a prior , we will say that the posterior contracts to at rate if, for any sufficiently large constant ,
(2.2) |
where is the joint law of .
Regression with Gaussian error
In a regression problem, the covariates can be either predetermined for each observation, this is the fixed design case, or can be part of the observation themselves. In the later case, the covariates can be considered as random; this corresponds to the random design case. The notion of posterior contraction differs slightly between these two situations and some clarifications are in order.
Fixed design
In this setting, we consider a sample of real observations satisfying the model , with where the for are fixed covariates and where the are i.i.d. univariate Gaussian random variables with zero mean and standard deviation . As previously, the regression function is supposed to depend only on a -dimensional subspace of .
We will use directly as a prior for the regression function because can be viewed by restriction as a Gaussian process over the space of design points. To quantify the posterior contraction, we introduce the design dependent semi-metric defined as the -norm for the empirical measure of the design points. If the space of regression functions over is embedded with a prior , we will say that the posterior contracts to at rate if, for any sufficiently large constant ,
(2.3) |
where is the joint law of .
Random design
Here, we observe i.i.d. pairs such that , with i.i.d. , and where the ’s are random covariates over independent of the ’s and admitting a common density that is bounded away from zero. For the sake of simplicity, the standard deviation is restricted to the interval but these bounds can be relaxed, see Remark 5.1.1 for details. Again, the regression function is supposed to depend only on a -dimensional subspace of . Moreover, we use directly as a prior for the regression function. The natural metric for this problem is the -norm denoted by where is identified with the law of one covariate. This metric is not equivalent to the Hellinger metric, which is used in the proof of Theorem 3.1, unless all regression functions are uniformly bounded by a constant . This condition can be fulfilled by projecting the prior on the space of all functions uniformly bounded by , as proposed in [GN11], but this would force us to rewrite the proof of Theorem 3.1 only for this setting. Instead, we directly post-process the posterior to integrate this constraint as in [YD16]. Then, the formulation of posterior consistency becomes as follows. Considering a prior over the regression functions, we will say that the posterior contracts to at rate if, for and any sufficiently large constants ,
(2.4) |
where is the joint law of and where is the truncated version of .
3 Main result for the functional parameter
In order for the true parameter to be recovered, we suppose that its restriction to the -dimensional subspace does not depend on the ambient dimension .
Assumption 3.1 (Sparsity of the true parameter).
There exist , , and a sequence of linear isometries such that for all , we have , and , for all .
In this way, each can be viewed as a sparse continuation in dimension of an underlying fixed function called the core function. The use of isometries instead of vector subspaces permits us to avoid the manipulation of the Grassmannian. We will use instead the more convenient orthogonal group . The next property is straightforward.
Property 3.1.
For , is constant on the intersection between and the affine subspaces , for .
In parallel to the dimension adaptability, the present setting allows the core function to be arbitrarily smooth (in a Hölder sense) while maintaining near-minimax contraction rates.
Assumption 3.2 (Smoothness of ).
There exists such that .
3.1 Prior specification
Here we specify the hierarchical prior on the parameter space. The true parameter is characterized by a sparsity pattern , where the intrinsic dimension is the one of the relevant subspace and is an isometry for the orientation; its smoothness is modeled by a rescaling parameter, and the core function is modeled by a standard squared exponential Gaussian process which has infinitely smooth sample paths. Indeed, this process has proven to be fruitful in combination with a scale parameter and allows smoothness adaptation (see [VV09]).
For , let be a standard squared exponential Gaussian process on ; that is, a centered Gaussian process with covariance kernel
where is the Euclidean norm.
Let , , and . We define and a rescaled Gaussian process with sparsity pattern , where is the diagonal matrix with diagonal vector . Then, the process is constant on affine subspaces , for (as in Property 3.1) and if is the orthogonal projection onto , then , for all .
To work properly with , we have to verify that its law identifies with the law of a -dimensional standard squared exponential process. To do so, define
a bijection with inverse for . Then, for all .
Let us introduce . Then, for all , we have
So is a standard squared exponential Gaussian process in dimension that does not depend on nor . Moreover, we have .
From now on, will refer to the restriction on of this process. Then, the hierarchical prior on the parameter with stochastic subspace selection is defined as the law of , where is the scaling parameter, is the prior on the subspace dimension, and is the prior on the orientation.
Assumption 3.3.
The intrinsic dimension of the subspace is assumed to be bounded by a known deterministic number .
Consequently, is defined by a probability vector with for all . Moreover, we define the scaling parameter such that there exists a collection of probability measures on , , with . We require the law of the stochastic isometry to be translation invariant. That is, for all subset of and for all , we need . Therefore, the law of is taken as the unit Haar measure on , the only probability measure that is translation invariant on . In addition, all , and are supposed to be independent of .
For convenience, the notation will refer to a probability measure as well as its density.
Assumption 3.4 (Rescaling measures).
There exist constants , , and such that for all and , the density satisfies
-
1.
for all sufficiently large , ;
-
2.
for all , ;
-
3.
.
Assumptions similar to Assumption 3.4 are standard, see for instance Equation (3.4) in [VV09] or Assumption 5 [JT21]. For example, this assumption is satisfied if, for all and , is the restriction to of an exponential law with parameter independent of and (indeed, if has density function , with differentiable and strictly increasing, then has density function ).
The next section gives some precision about the reproducing kernel Hilbert space (RKHS) of . The content is a bit technical and can be skipped at first reading.
3.2 Reproducing kernel Hilbert space of
One of the advantages of choosing a Gaussian process prior is that the contraction rate depends explicitly on the small ball probability and on the relative position of the parameter with respect to the RKHS associated with the process. This section is dedicated to the basic properties of this space. For elementary definitions and for some precision about the link between the contraction rate and the RKHS, we refer the reader to [VV08] and [VV08a].
Notation.
We denote by the space of continuous functions on which are constant on affine subspaces , for .
We introduce the operator
so that , where is the process introduced above rescaled by and restricted to . It is a bijective linear map and also an isometry if the domain and the codomain are endowed with the uniform norm. In particular, the map is continuous. According to Lemma 7.1 in [VV08a], if is the RKHS of , then the RKHS of is equal to . Let us detail its elements. The stochastic process RKHS of (as defined in [VV08a]) is composed of functions for which there exists such that
(3.1) |
where is the spectral measure of the -rescaled squared exponential process in dimension with spectral density (see Lemma 4.1 in [VV09], and the following discussion). We can view as a random Gaussian element with values in the Banach space . Thus, according to Theorem 2.1 in [VV08a], the stochastic process RKHS and the Banach space RKHS coincide and we can apply Lemma 7.1 from the same reference. The space is then the set of functions
(3.2) |
where runs through and the RKHS norm is .
We remark that functions of the RKHS of have the same sparsity-pattern as the trajectories of .
Remark 3.1.
Functions are constant on affine subspaces for .
As mentioned at the beginning of this section, contraction rates under Gaussian process prior depend on two quantities: the small ball probability and the relative position of the parameter with respect to the RKHS. For a parameter and , these two quantities define the concentration function , with
(3.3) |
3.3 Posterior consistency
Before we state the theorem, we need a last assumption, which determines how the ambient dimension is allowed to grow with the sample size .
Assumption 3.5 (Growth of ).
The ambient dimension satisfies
for some small constant and where .
An examination of shows that if and that otherwise. Thereby, a standard rate of order for is achieved with parameter . The fastest rate tends to the order when tends to zero. Although it is always possible to set extremely close to zero in order to obtain the best rate for , one should keep in mind that the contraction rate may then be suboptimal, as discussed at the end of this section.
Theorem 3.1.
Let with , a large constant that depends on , and as in Assumption 3.5. Then, if the parameter space is embedded with the prior and under Assumptions 3.1-3.5, the posterior contracts at rate for density estimation (as defined in (2.2)) as well as for regression with fixed or random design (as defined in (2.3) and (2.4)).
An examination of shows that the contraction rate is improved as the smoothness of grows, unlike . This highlights a trade-off between the contraction rate and the growth of the design dimension: fast contraction rates imply slowly increasing dimension and conversely.
4 Subspace recovery for the density estimation problem
In this section, we propose to recover the central subspace for the density estimation problem. To avoid identifiability issues caused by the spherical support, we suppose that the ambient dimension does not depend on . Hence, we denote the ambient dimension by with and the central subspace by where corresponds to in Assumption 3.1. This assumption is justified by the following considerations. If the ambient dimension grows with , the Hellinger metric relative to the Lebesgue measure on tends to give more importance to the center of the support, as tends to infinity. For example, consider a parameter in dimension two that is everywhere constant except in a small region on the border of , and such that the central subspace is of dimension two. The importance of this small region in the support , in the Hellinger sense, decreases exponentially with , way faster than the estimation of the true parameter in Theorem 3.1. Consequently, for sufficiently large , a constant function together with some one-dimensional subspace characterize a density that is in the Hellinger ball of radius centered on ; so we have no hope of recovering the true subspace by simply using the posterior consistency.
As a consequence, the true density , the parameter , and the central subspace do not depend on anymore. The true density is characterized by via the transformation (2.1). Moreover, is supposed to depend only on a -dimensional subspace of and can be viewed as the sparse continuation of an underlying function . In the same way, can be viewed as the sparse continuation of a function over , except that the renormalisation of depends on . Note that is not necessarily a density on so the notation will designate from now on the -distance between the square roots of and even if and are not densities.
Let us introduce a few more notation. Let be the set of all optimal isometries:
and, for , let be the set of isometries that send the subspace to a subspace containing :
Recovering means the following: for some rate ,
where is the operator norm with respect to the Euclidean distance in . However, under the assumptions of Theorem 3.1, the only information we have on the true subspace is posterior consistency to the density with rate . This will only allow us to recover a subspace of containing . A crucial assumption to eliminate the subspaces of dimension smaller than and the subspaces that do not contain is to suppose that is non-constant in all directions. More precisely, the default of constancy for each direction has to be detectable in Hellinger distance, as formalized in the following assumption.
Assumption 4.1.
There exist a constant and a window size such that for all vector line in (directed by a unit vector ), there exists such that for all , for all , and for all constant ,
where .
Assumption 4.1 seems a bit technical at first glance but it can be shown that it is satisfied as soon as is differentiable over with points such that the gradients at these points are linearly independent.
Theorem 4.1.
Theorem 4.1 ensures that the central subspace can be recovered as soon as the intrinsic dimension is known. Subspaces of dimension smaller than are also eliminated but the theorem does not reject those of dimension greater than . We conjecture that the prior mass on those spaces tends to vanish, for reasons similar to those exposed in [JT21]. Indeed, introducing a penalization on larger dimensions if necessary, it should be possible to show that the posterior cannot contract as fast as the minimax rate for if a subspace of greater dimension is chosen. As discussed in the introduction of this section, the estimation of the central subspace is made under the assumption that is fixed with mainly because of the identifiability issue caused by the ellipsoid support. We believe that this restriction can be relaxed by extending the support to the full ambient space , as in [JT21]. In this case, the square over which we integrate the Hellinger distance in the proof of Theorem 4.1 can be taken as the product space of a square of side in directions and times . Then, the integrated error should no longer depend on and consistency to the true subspace should follow. Further investigations in this direction might be worthwhile.
Acknowledgments
We acknowledge the support of the French Agence Nationale de la Recherche (ANR) under reference ANR-21-CE40-0007 (GAP Project).
5 Appendix
5.1 Proof of Theorem 3.1
As a reminder, we first exhibit some facts about the convergence rate:
(5.1) |
So is a large multiple of the minimax rate times a logarithm factor. The constant is chosen to be arbitrarily large in order to absorb undesired terms in the proof.
The proof of Theorem 3.1 is based on Theorem 2.1 in [GGV00]. The general outline is a combination of the arguments of [Tok11] (itself derived from [VV09]) and [JT21]. Concretely, it suffices to show that there exists a sequence of sets (referred to as a sieve), such that the following three conditions hold for all sufficiently large :
(5.2) | |||
(5.3) | |||
(5.4) |
This is the purpose of the next sections. The first condition (5.2), referred to as prior mass condition, ensures that the prior puts a sufficient amount of mass around the true parameter. Condition (5.3), called sieve condition, forces the sieve to capture most of the mass of the prior, while the entropy condition (5.4) constrains its size. These three conditions map one to one with the conditions of Theorem 2.1 in [GGV00], as showed in [VV08] for density estimation and regression with fixed design. For regression with random design, we recall in the next section some arguments spread out in Bayesian literature.
5.1.1 Regression with random design
Here, we show that Theorem 2.1 in [GGV00] can be applied in the regression with random design setting, as soon as Conditions (5.2), (5.3), and (5.4) are satisfied. The procedure consists in showing that the posterior contracts to the density of a pair and then to retrieve from this density. For a function , we define , where is the density of a univariate Gaussian variable with mean and standard deviation and is the density of one covariate. Then, the density of one observation under regression with random design is . We first prove that Condition (5.2) implies Condition (2.4) in [GGV00] with . Detailed calculations can be found in [FS23], Section A.2. We have to compare the uniform neighborhood of with the Kullback-Leibler neighborhood
where is the Kullback-Leibler divergence between and and is the Kullback-Leibler variation. Using the following identities from [FS23]
we deduce that, if with , then
where . Consequently, according to (5.2), and multiplying by if necessary, we have
One can remark that for Condition (2.4) in [GGV00] to be satisfied, we must have which is the case as soon as .
Condition (2.3) in [GGV00] is immediately deduced from (5.3). For Condition (2.4), we use the inequality
(5.5) |
see again [FS23] for details. Then, assuming that , we have, according to (5.4) and multiplying by 3 if necessary,
where the first inequality comes from the definition of the packing number and the covering number and where the third inequality follows from (5.5). Theorem 2.1 in [GGV00] then ensures posterior consistency to at rate in Hellinger distance. Now, because we also have the converse inequality
and that when nothing is assumed on and with , we obtain posterior contraction to at rate in the -distance:
where .
Remark 5.1.
The restriction to for the standard deviation can be relaxed. In fact, if , then it suffices to consider Theorem 2.1 in [GGV00] with . Condition (2.4) in [GGV00] is then immediately satisfied and, for Condition (2.3), the proof of (5.3) can be adapted to replace 5 by . On the contrary, if , Condition (2.2) in [GGV00] can be satisfied by multiplying by .
5.1.2 Prior mass condition (5.2)
We verify here that . Let us introduce the following notation.
Notation.
For , we denote by the function such that , for all . Hence, .
We first reduce the problem to deterministic dimension and direction by conditioning with and integrating over :
Now, we want to bound from below the integrand on a significant subset of . We remark that if is such that , then
We show that the right-hand side is bounded from below by and then, we bound from below the measure of the set of satisfying .
From now on, we use without specification the constants of Lemmas 5.5, 5.6, and 5.7 and we fix . Let where . We suppose large enough so that . Then,
(5.6) |
and, because , we have
(5.7) |
According to Lemma 5.3 in [VV08a], for , we can write
where is the concentration function in (3.3). Now we want to control the concentration function using Lemmas 5.5 and 5.7. The inequality (5.6) and the previous restriction on ensure that the conditions of Lemma 5.7 are satisfied with , while (5.7) and Lemma 5.5 give
Using the expression (3.3) of the concentration function, a combination of the two lemmas gives
where the last inequality holds because and for large enough. Let us define the constant and note that there exists a constant such that for sufficiently large , . Then, there exists a constant such that
With the help of the reminder (5.1), we see that
hence . Then, by choosing such that , we can achieve
(5.8) |
At this point, the problem amount to bound from below the measure of the set of satisfying . We denote by this set. The core function is continuous on the compact subset , so there exists a constant such that is -Hölder with Hölder constant . Then, for all ,
From now on, it is apparently sufficient to compute the measure of a ball in with radius . In fact, . However, this leads to a design dimension not larger than . To obtain of order , we have to consider a larger subset.
Notation.
Let be a linear subspace of . We denote by the set of isometries that fix :
Then, for all , we have
For , we define
Then, . Since the Haar measure is translation invariant, it is sufficient to cover with translations of to obtain a lower bound on the measure of , that is, to cover with sets where belongs to some net and then remark that .
Lemma 5.1.
We have,
Proof of Lemma 5.1.
Let . The first step consists in constructing a net such that there exist and with . Let be an orthonormal basis adapted to the direct sum . For all -tuple of orthonormal vectors , we fix an isometry such that for all . Moreover, we denote by a set of -tuples of orthonormal vectors in such that, for all -tuples of orthonormal vectors, there exists satisfying
We claim that we can take . Indeed, there exists such that
By Lemma 5.8, we can extend in an orthonormal basis of such that
(5.9) |
Then, writing and taking such that for all , we have . Moreover, because and for , we can define such that
Then, we have and according to (5.9),
and,
So and the net is appropriate. Finally, by taking as in Lemma 5.10, we obtain
hence the result. ∎
Consequently, we have established that
Recall that we have the following lower bound:
In order to establish the prior mass condition, it suffices to derive the greatest design dimension for which we can reach
For large enough, a design dimension as specified in Assumption 3.5 is appropriate for sufficiently small constant .
Remark 5.2.
The exponent in Lemma 5.1 is probably not far to be optimal. In fact, ignoring the constants, changing this exponent to with would lead to a growth rate of which, when is close to zero, gives a growth rate with an order superior to . The breakpoint of some popular subspace estimators, such as SIR, being the order , it would be surprising to estimate a function faster than its central subspace.
5.1.3 Sieve condition (5.3)
The second condition can be verified similarly as in [JT21]. As in the previous section, we will first treat the case with deterministic rescaling parameter, dimension, and direction and then integrate according to , , and .
We suppose that is large enough so that . We introduce the quantities for some large constant and, for , the quantity such that , for a large constant . The sieve is defined as follows:
with
where is the unit ball in the Banach space .
The nesting property of Lemma 4.7 in [VV09] remains true in the present setting, that is, for ,
Consequently, if , then
By Borell’s inequality (see [VV08a], Theorem 5.1, or [Bor75]), for every ,
where is the cumulative distribution function of the standard normal distribution. Now, because
we have
For large enough, we have and , so according to Lemma 5.7 and because , we have
for sufficiently large . So by taking a very large multiple of , we can reach . The second assertion of Lemma 4.10 in [VV09] gives which leads to the upper bound
Taking into account the random rescaling parameter , we have, for sufficiently large ,
where the last inequality holds because and are supposed to be large enough.
Now considering the prior on the sparsity pattern, we obtain
5.1.4 Entropy condition (5.4)
We use again the notation and quantities of the previous section. According to Lemma 5.6, for all and , the metric entropy of is bounded as:
The simple estimation gives then
(5.10) |
The metric entropy of is derived as follows:
To extend these inequalities to the full sieve, we need the following lemma from [Tok11].
Lemma 5.2 (Tokdar 2011, Lemma 1).
Let , and . Then
where is the unit ball in .
By examining the representation result in (3.2) for , we see that, for all , we have . Hence, Lemma 5.2 gives
If is a net over such that for all , there exist and with , where is the minimum of when runs through , then
This clearly implies
and hence
Consequently, the -entropy of can be bounded by the cardinal of the net times the maximal -entropy of sets :
It only remains to bound the cardinal of .
Lemma 5.3.
For , there exists a net over such that
where
and such that
Proof.
Firstly, we remark that
Thus, for , we search to construct such that there exists satisfying and . Let be an orthonormal basis adapted to the direct sum . We introduce a set of orthonormal basis of such that, for all orthonormal basis of , there exists such that
and we reuse the set of Lemma 5.1, replacing by . For all and , we fix an isometry such that , for all . By construction, there exist and such that
Then we choose . Using Lemma 5.8, we extend to an orthonormal basis over such that and we define , for . Now we choose such that
This leads to , for all , hence . We can thus define the net as the set of all isometries for and . According to Lemma 5.10, this yields the upper bound
Observing that the upper bound in (5.10) does not hide a constant depending on , we can write
Then, the lemma yields the following inequality:
where , which, with the logarithm and for sufficiently large , gives the desired result.
5.2 Proof of Theorem 4.1
5.2.1 Case
The idea of the proof is to show that the non-constancy of in all directions results in a significant difference (in the Hellinger sense) between the true density and any density that is more parcimonious than . If this difference can be bounded from below, then the set of over-parcimonious densities is expected to have an almost-null posterior mass as soon as the contraction rate falls below the lower bound.
Let and let be a density that satisfies the model with parameters and . Then, is constant on , for any . Moreover, the intersection between and is non-null so is constant in at least one direction, say . We will use Assumption 4.1 and integrate the Hellinger distance over a small square inside the region where is non-constant in . As usual, we denote .
Let us introduce the operator
In particular, we have . We use the notation of Assumption 4.1 with instead of .
Let be an orthonormal basis adapted to the direct sum and let be a solid square with edges parallel to this basis, of size and centered on . Then, and the inequality of Assumption 4.1 is valid when . Considering the basis previously introduced, integrating over amounts to integrate with respect to each variables. To simplify, we bundle these variables in three groups: a variable parallel to , a variable parallel to and a variable parallel to . In this coordinate system, we can write and we have .Then
where is the inverse image via of the range of the integral in . Hence
Then because , there exists such that
(5.11) |
Now we can use Assumption 4.1 and bound from below the Hellinger distance in the last integral, which gives
Finally, as soon as the contraction rate achieves .
5.2.2 Case
Case , with and .
To simplify the presentation, we first restrict ourselves to the case and . Assumption 4.1 specializes as follows: for all , there exists such that, for all and all constant ,
We use the fact that the non-constancy of over induces a non-constancy over any one-dimensional space not parallel to . It is then possible to set a lower bound on the Hellinger distance between and any density that is constant on a space not parallel to . For , we denote and . If is not in , then there exists such that for all , we have . Then, the intersections of and with the unit circle are separated by at least .
With this setting, any square of size centered in is included in . Let be a solid square of size , parallel to the line and centered on . The line intersects the border of at two points (see Figure 1), and using arguments from geometry on the two-dimensional Euclidean space, we can show that the orthogonal projections of these points over are at a distance from . Similarly, the line intersects the border of at two points whose orthogonal projections on are at a distance from .
Let be an orthogonal basis of adapted to the decomposition and such that and . In this system of coordinates, can be written and for all , we have
We will also use the fact that . Then, for all density constant in the direction , we have
Finally, as soon as .
Case , with arbitrary .
Given a non-optimal isometry , we need to quantify how far from the inverse image of the subspace via is. This result, elementary when , is stated for arbitrary in the following lemma. A proof is given in Appendix 5.3.
Lemma 5.4.
Let . If for all , we have , , then there exists , , such that the distance between and is at least , where .
Now we work under the assumptions of Lemma 5.4. Let be the linear span of and its orthogonal projection on (or any vector of if the orthogonal projection is zero). Then has a non-zero intersection with . Let be this one-dimensional intersection.
Let be a solid hypercube centered on , with size , and aligned with an orthogonal basis adapted to the direct sum . With the restrictions on , is included in .
We will bound from below the quantity by using the preceding two-dimensional case on slices of . For , the plane contains one element parallel to and one element parallel to , so the situation is analogue to the previous case, replacing by (Figure 2). With all this in mind, for all density constant in the direction , one has
which is sufficient to conclude.
The case can be proven in a similar way.
5.3 Lemmas
The next three lemmas are related to Lemmas 4.3, 4.5, and 4.6 in [VV09], hence their proofs can be omitted.
Lemma 5.5.
Let and . If , then, for all and , there exist constants and that depend only on such that
Lemma 5.6.
Let , , and . Then, there exists a constant that depends only on such that, for ,
Lemma 5.7.
Let , and . Then, for , there exist constants and that depends only on and such that, for all and ,
Lemma 5.8.
Let and let be an orthonormal basis of . For , let be a collection of orthonormal vectors in such that
Then we can complete this collection to obtain an orthonormal basis of satisfying
Proof of Lemma 5.8.
We denote by the subspace . Let us determine the distance between a vector and its orthogonal projection on , for . By Cauchy-Schwartz inequality, we have
for all . Then
(5.12) |
Thus the problem reduces to find a family of orthonormal vectors in with elements as close as possible to the vectors , for . This is related to what is known as procruste problem. We denote by the matrix and we use Theorem 4.1 stated in [Hig89]:
Theorem 5.9 ([Hig89]).
If admits a polar decomposition , and if has orthonormal columns, then
Let us show that the columns of can be chosen in . A singular value decomposition of can be written, , where has orthonormal columns, , and is diagonal. Therefore, . Taking and , we have the polar decomposition where has orthonormal columns. Because , it is possible to choose with columns in , whence the desired result. Now, taking , we have, for all unit vector ,
Moreover, using that for all , we finally have
thus . According to Theorem 5.9, the last inequality is also true if we replace by . Because the columns of are in , the family is orthonormal and moreover satisfies (5.12) by the triangle inequality. ∎
Notation.
Let with and let be the set of all -tuples of orthonormal vectors in .
Lemma 5.10.
Let with and . Then there exists a set such that for all , there exists such that
Proof of Lemma 5.10.
Let us construct . Let be a set of balls in with radius which cover and such that . We denote by the set of -tuples of balls such that contains at least one element of . Then, for each , there exists such that . For each , choose one particular -tuple such that and let be the set of these -tuples when runs through . It is clear that satisfy the first condition of the lemma. Moreover,
Let us estimate the last quantity. We use the inequality
where is the maximum number of disjoint balls with radius and with center in . Recall that
Consider the measure of the hyperspherical cap defined by the intersection of and a ball with center in and with radius . The colatitude angle of the cap is and, according to [Li11],
Since ,
and, using the facts that , , and , we have
The ratio of two Gamma functions can be bounded as follows
for (see [Wat59] and [LQ12], Section 2.3). Choosing , we obtain
hence the result. ∎
Proof of Lemma 5.4.
Suppose that, for all , we have and, for all , . Let us show that for all vectors of the canonical basis, . We begin with the first vectors . Define an operator which maps to . Then, for , we have . Now, we reuse the arguments of the proof of Lemma 5.8, with . We can write where is a rectangular matrix with orthonormal columns in and where is symmetric. Moreover, taking , and , , we have
So, by Theorem 4.1 in [Hig89] (Theorem 5.9 in the present document), . Then, is an orthonormal basis of such that , for . Let be an isometry such that , . Then
The same reasoning occurs with the remaining vectors, , by replacing by , and taking , with . The isometry is now the one that maps to for . As a result, for all , , we have
which contradicts the fact that . Finally, . ∎
Acknowledgement
We acknowledge the support of the French Agence Nationale de la Recherche (ANR) under reference ANR-21-CE40-0007 (GAP Project).
References
- [Bir86] Lucien Birgé “On estimating a density using Hellinger distance and some other strange facts” In Probability Theory and Related Fields 71, 1986, pp. 271–291
- [Bor75] Christer Borell “The Brunn–Minkowski inequality in Gauss space” In Inventiones mathematicae 30, 1975, pp. 207–216
- [CD12] Laëtitia Comminges and Arnak S. Dalalyan “Tight conditions for consistency of variable selection in the context of high dimensionality” In The Annals of Statistics 40.5, 2012, pp. 2667–2696
- [Coo98] R.Dennis Cook “Regression graphics: Ideas for studying regressions through graphics” John Wiley & Sons, 1998
- [FS23] Gianluca Finocchio and Johannes Schmidt-Hieber “Posterior contraction for deep Gaussian process priors” In Journal of Machine Learning Research 24.66, 2023, pp. 1–49 URL: http://jmlr.org/papers/v24/21-0556.html
- [GGV00] Subhashis Ghosal, Jayanta K. Ghosh and Aad W. Van Der Vaart “Convergence rates of posterior distributions” In The Annals of Statistics 28.2, 2000, pp. 500–531
- [GN11] Evarist Giné and Richard Nickl “Rates of contraction for posterior distributions in -metrics, ” In The Annals of Statistics 39.6, 2011, pp. 2883–2911
- [Hig89] Nicholas J. Higham “Matrix nearness problems and applications” In Applications of Matrix Theory Oxford University Press, 1989, pp. 1–27
- [JT21] Sheng Jiang and Surya T. Tokdar “Variable selection consistency of Gaussian process regression” In The Annals of Statistics 49.5, 2021, pp. 2491–2505
- [Li11] Shengqiao Li “Concise formulas for the area and volume of a hyperspherical cap” In Asian Journal of Mathematics and Statistics 4.1 ANSInet, 2011, pp. 66–70
- [Li91] Ker-Chau Li “Sliced inverse regression for dimension reduction” In Journal of the American Statistical Association 86.414 Taylor & Francis, 1991, pp. 316–327
- [Lin+21] Qian Lin, Xinran Li, Dongming Huang and Jun S. Liu “On the optimality of sliced inverse regression in high dimensions” In The Annals of Statistics 49.1 Institute of Mathematical Statistics, 2021, pp. 1–20
- [LQ12] Qiu-Ming Luo and Feng Qi “Bounds for the ratio of two gamma functions—From Wendel’s and related inequalities to logarithmically completely monotonic functions” In Banach Journal of Mathematical Analysis 6.2 Tusi Mathematical Research Group, 2012, pp. 132–158
- [LZL18] Qian Lin, Zhigen Zhao and Jun S. Liu “On consistency and sparsity for sliced inverse regression in high dimensions” In The Annals of Statistics 46.2, 2018, pp. 580–610
- [LZL19] Qian Lin, Zhigen Zhao and Jun S. Liu “Sparse sliced inverse regression via lasso” In Journal of the American Statistical Association 114.528 Taylor & Francis, 2019, pp. 1726–1739
- [STG13] Weining Shen, Surya T. Tokdar and Subhashis Ghosal “Adaptive Bayesian multivariate density estimation with Dirichlet mixtures” In Biometrika 100.3 Oxford University Press, 2013, pp. 623–640
- [Sto82] Charles J. Stone “Optimal global rates of convergence for nonparametric regression” In The Annals of Statistics 10.4 Institute of Mathematical Statistics, 1982, pp. 1040–1053
- [Tok11] Surya T. Tokdar “Dimension adaptability of Gaussian process models with variable selection and projection” Preprint . Available at arXiv:1112.0716, 2011
- [TSY20] Kai Tan, Lei Shi and Zhou Yu “Sparse SIR: Optimal rates and adaptive estimation” In The Annals of Statistics 48.1 Institute of Mathematical Statistics, 2020, pp. 64–85
- [TZG10] Surya T. Tokdar, Yu M. Zhu and Jayanta K. Ghosh “Bayesian density regression with logistic Gaussian process and subspace projection” In Bayesian Analysis 5.2 Institute of Mathematical Statistics, 2010, pp. 319
- [Ver12] Nicolas Verzelen “Minimax risks for sparse regressions: Ultra-high dimensional phenomenons” In Electronic Journal of Statistics 6 Institute of Mathematical StatisticsBernoulli Society, 2012, pp. 38–90
- [VV08] Aad W. Van Der Vaart and J.Harry Van Zanten “Rates of contraction of posterior distributions based on Gaussian process priors” In The Annals of Statistics 36.3, 2008, pp. 1435–1463
- [VV08a] Aad W. Van Der Vaart and J.Harry Van Zanten “Reproducing kernel Hilbert spaces of Gaussian priors” In Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh. Inst. Math. Stat. (IMS) Collect. 3, 2008, pp. 200–222
- [VV09] Aad W. Van Der Vaart and J.Harry Van Zanten “Adaptative Bayesian estimation using a Gaussian random field with inverse gamma bandwidth” In The Annals of Statistics 37.5B, 2009, pp. 2655–2675
- [Wai09] Martin J. Wainwright “Sharp thresholds for high-Dimensional and noisy sparsity recovery using -constrained quadratic programming (Lasso)” In IEEE transactions on information theory 55.5 IEEE, 2009, pp. 2183–2202
- [Wat59] G.N. Watson “A note on gamma functions” In Edinburgh Mathematical Notes 42 Cambridge University Press, 1959, pp. 7–9
- [YD16] Yun Yang and David B. Dunson “Bayesian manifold regression” In The Annals of Statistics 44.2, 2016, pp. 876–905
- [YT15] Yun Yang and Surya T. Tokdar “Minimax-optimal nonparametric regression in high dimensions” In The Annals of Statistics 43.2, 2015, pp. 652–674
- [ZMP06] Lixing Zhu, Baiqi Miao and Heng Peng “On sliced inverse regression with high-dimensional covariates” In Journal of the American Statistical Association 101.474 Taylor & Francis, 2006, pp. 630–643
- [ZMZ22] **g Zeng, Qing Mai and Xin Zhang “Subspace estimation with automatic dimension and variable selection in sufficient dimension reduction” In Journal of the American Statistical Association Taylor & Francis, 2022, pp. 1–13