Contraction rates and projection subspace estimation with Gaussian process priors in high dimension

Elie Odin Institut de Mathématiques de Toulouse; UMR5219. Université de Toulouse; CNRS. UT3, F-31062 Toulouse, France.
[email protected] François Bachoc Institut de Mathématiques de Toulouse; UMR5219. Université de Toulouse; CNRS. UT3, F-31062 Toulouse, France.
[email protected] Agnès Lagnoux Institut de Mathématiques de Toulouse; UMR5219. Université de Toulouse; CNRS. UT2J, F-31058 Toulouse, France.
[email protected]

(02 2023)

Abstract

This work explores the dimension reduction problem for Bayesian nonparametric regression and density estimation. More precisely, we are interested in estimating a functional parameter $f$ over the unit ball in $\mathbb{R}^{d}$ , which depends only on a $d_{0}$ -dimensional subspace of $\mathbb{R}^{d}$ , with $d_{0}<d$ . It is well-known that rescaled Gaussian process priors over the function space achieve smoothness adaptation and posterior contraction with near minimax-optimal rates. Moreover, hierarchical extensions of this approach, equipped with subspace projection, can also adapt to the intrinsic dimension $d_{0}$ ([Tok11]). When the ambient dimension $d$ does not vary with $n$ , the minimax rate remains of the order $n^{-\beta/(2\beta+d_{0})}$ . However, this is up to multiplicative constants that can become prohibitively large when $d$ grows. The dependences between the contraction rate and the ambient dimension have not been fully explored yet and this work provides a first insight: we let the dimension $d$ grow with $n$ and, by combining the arguments of [Tok11] and [JT21], we derive a growth rate for $d$ that still leads to posterior consistency with minimax rate. The optimality of this growth rate is then discussed. Additionally, we provide a set of assumptions under which consistent estimation of $f$ leads to a correct estimation of the subspace projection, assuming that $d_{0}$ is known.

1 Introduction

With the ever-increasing availability of high-dimensional data in various fields of science and technology, dimension reduction methods have become more and more important, especially in non-parametric estimation, to counteract the curse of dimensionality. Suppose we want to estimate an unknown function $f:\mathbb{R}^{d}\to\mathbb{R}$ that depends only on a $d_{0}$ -dimensional linear subspace $\mathcal{S}\subset\mathbb{R}^{d}$ , with $d_{0}\ll d$ . For regression and density estimation problems, minimax rates without sparsity assumptions are both of the order $n^{-\beta/(2\beta+d)}$ where $\beta$ is the smoothness of $f$ and $n$ is the sample size ([Bir86], [Sto82]). The aim of dimension reduction is to convert this $d$ -dimensional problem into a $d_{0}$ -dimensional one in order to obtain the way more attractive rate $n^{-\beta/(2\beta+d_{0})}$ .

As the above rates are given up to a multiplicative constant, which may itself depend on the ambient dimension $d$ , another problem arises: determining if the number of available data is sufficient in regard to the problem’s dimension. This is generally done by allowing the ambient dimension $d$ to grow with $n$ , letting $d=d_{n}$ , and then observing which growth rate still permits minimax estimation at rate $n^{-\beta/(2\beta+d_{0})}$ . Note that the subspace $\mathcal{S}$ also depends on $n$ , thus we write $\mathcal{S}=\mathcal{S}_{n}$ .

For fixed intrinsic dimension $d_{0}$ , we distinguish two cases, whether the subspace $\mathcal{S}_{n}$ is parallel to the axes or not. In the first case (when $\mathcal{S}_{n}$ is parallel to the axes), the dimension-reduction problem is referred to as variable selection. In this context, it is known that for non-parametric regression, the sparsity pattern can be consistently recovered when $d_{n}$ grows exponentially with the sample size ([CD12], [YT15]). More precisely, [CD12] show that there exist two constants $c_{*}<c^{*}$ such that

•

if $\frac{\log d_{n}}{n}<c_{*}$ , there exists a consistent estimator of the sparsity pattern,
•

if $\frac{\log d_{n}}{n}>c^{*}$ , no such estimator exists.

This phase transition phenomenon seems to be similar in the linear regression framework (see [Ver12] and [Wai09]).

In the second case (when nothing is assumed on $\mathcal{S}_{n}$ ), the estimation of a minimal subspace which contains all the information on $f$ is sometimes referred to as sufficient dimension reduction ([Coo98]). Among the various methods proposed for estimating $\mathcal{S}_{n}$ , sliced inverse regression (SIR) ([Li91]) is one of the most studied. The first article including the framework of growing ambient dimension $d_{n}$ shows the consistency of SIR only under $d_{n}=O(n^{1/2})$ ([ZMP06]). Later, [LZL18] show that the phase transition phenomenon occurs at a growth rate $d_{n}$ in $o(n)$ . In other words, SIR-based estimators are consistent only if $d_{n}/n\mathop{\longrightarrow}\limits_{n\to+\infty}0$ and this growth rate appears to be optimal ([Lin+21]).

The difference between growth rates encountered in variable selection and in sufficient dimension reduction has led recently to the emergence of methods combining both approaches. If $f$ depends on a $d_{0}$ -dimensional subspace $\mathcal{S}_{n}$ which can be described by linear combination of only a small number of variables, then we can perform both variable selection and sufficient dimension reduction over the selected variables. This method is studied for example in [Lin+21], [LZL19], [TSY20], and [ZMZ22] and allows a return to the exponential growth of the dimension $d_{n}$ .

The aim of this article is to perform both function and subspace estimation in the case where no hypotheses are made on $\mathcal{S}_{n}$ and to derive the maximum dimension growth rate. Our analysis is done in the nonparametric Bayesian framework introduced by [GGV00]. Among the advantages of this approach, the use of very versatile priors, such as Gaussian processes [VV08], allows to perform smoothness and dimension adaptability at near minimax rates ([VV09], [TZG10], [JT21]) with a single Bayesian procedure, and avoids the complications associated with kernel methods (see for example the introduction of [STG13]).

The work of Tokdar, Zhu, and Ghosh [TZG10] is one of the first to include a hierarchical prior with a parameter on the subspace.They use a uniform prior on the Grassmannian of dimension $d_{0}$ and a logistic Gaussian process prior for the conditional density function. The authors are able to derive posterior consistency for both the conditional density function and the subspace but they do not provide contraction rates. Near minimax contraction rates are then derived in [Tok11] by extending the framework introduced by [VV09]. Finally, [JT21] show that for variable selection, the estimation of the regression function and that of the sparsity pattern can be realized simultaneously at near minimax rates even with dimension $d_{n}$ growing exponentially with the sample size. The growth rate is linked to the smoothness $\beta$ of $f$ via $\log(d_{n})=O(n^{d_{0}/(2\beta+d_{0})})$ .

The paper is organized as follows. In Section 2, we introduce a hierarchical Gaussian process-based prior for both regression and density estimation models. This prior consists of a dimension parameter for $d_{0}$ , an invariant prior over linear $d_{0}$ -dimensional subspaces of $\mathbb{R}^{d_{n}}$ , a $d_{0}$ -dimensional Gaussian process, and a rescaling parameter to ensure smoothness adaptability. Our first result (Theorem 3.1 in Section 3) shows that, for the estimation problem of $f$ , near minimax contraction rates can be achieved for dimensions $d_{n}$ growing not faster than $n^{d_{0}/(2\beta+d_{0})}$ which is interestingly the already mentioned growth rate where we drop out the exponential. We are not able to prove the optimality of this result but some clues are given below (see Remark 5.2); notably, this growth rate is equivalent to $n$ when $\beta\to 0$ , which is known to be the breakpoint of the consistency of the SIR estimator. In Section 4, we show that for fixed ambient dimension $d$ , the hierarchical Bayes procedure contracts to a subspace that contains $\mathcal{S}$ and we conjecture that this subspace is exactly $\mathcal{S}$ . Our estimation result of $f$ combines the standard arguments used in [Tok11] and [JT21], which are based on [VV09]. To prove the contraction around the central subspace $\mathcal{S}$ , we show that an error on the estimation of $\mathcal{S}$ leads to an error on the estimation of $f$ from which we obtain a contradiction on the previously established minimax estimation of $f$ . The proofs of the main results (Theorems 3.1 and 4.1) are postponed to Appendices 5.1 and 5.2 while Appendix 5.3 is dedicated to useful lemmas.

2 Problem formulation

2.1 Notation and definitions

The abundant technical notation used throughout this article make this section very useful. We begin with the definition of standard functional spaces. Let $K$ be a bounded convex subset of $\mathbb{R}^{d}$ , with $d\in\mathds{N}^{*}$ . For $\alpha>0$ , write $\alpha=k+r$ with $k$ a nonnegative integer and $r\in(0,1]$ . The Hölder space $\mathfrak{C}^{\alpha}(K)$ is the space of all functions $f:K\to\mathbb{R}$ that are $k$ -times differentiable and whose partial derivatives of order $(k_{1},\ldots,k_{d})$ , with $k_{1},\ldots,k_{d}$ nonnegative integers such that $k_{1}+\cdots+k_{d}=k$ , are Lipshitz functions of order $r$ , that is, there exists a constant $D$ such that

\left|\frac{\partial^{k}}{\partial_{1}^{k_{1}}\cdots\partial_{d}^{k_{d}}}(f(x)% -f(y))\right|\ \leq\ D\left\|{x-y}\right\|^{r},

for all pairs $x,y\in K^{2}$ and where $\left\|{\cdot}\right\|$ is the Euclidean norm. We use the following asymptotic notation: if $f$ and $g$ are two real functions over an arbitrary set $S$ , then we write $f\lesssim g$ if there exists a constant $c$ such that $|f(s)|\leq c\cdot|g(s)|$ for all $s\in S$ . The notation $\gtrsim$ is defined in the same way and we write $f\asymp g$ when both $f\lesssim g$ and $f\gtrsim g$ hold. To model the central subspace $\mathcal{S}$ , we will use isometries instead of the Grassmannian. For $d\in\mathds{N}^{*}$ , we denote by $\mathcal{O}_{d}$ the space of linear isometries over $\mathbb{R}^{d}$ . In addition, the introduction of canonical subspaces and of “component filters” notation will be very convenient when dealing with the sparsity. For $x\in\mathbb{R}^{d}$ and $\mathbb{v}\in\{0,1\}^{d}$ , we denote by $|\mathbb{v}|$ the number of ones in $\mathbb{v}$ , by $x_{\mathbb{v}}:=(x_{j}:v_{j}=1,\ 1\leq j\leq d)\in\mathbb{R}^{|\mathbb{v}|}$ the sub-vector with components selected according to $\mathbb{v}$ , and for $y\in\mathbb{R}^{|\mathbb{v}|}$ , by $y^{\mathbb{v}}:=(\tilde{y}_{j})_{1\leq j\leq d}$ the vector in $\mathbb{R}^{d}$ with $\tilde{y}_{j}=0$ if $v_{j}=0$ and $\tilde{y}_{j}=y_{i}$ if $v_{j}$ is the $i$ -th one in $\mathbb{v}$ . Moreover, for any integer $b\in\llbracket 1,d\rrbracket$ , we denote by $\mathbb{b}$ the vector $\sum_{i=1}^{b}e_{i}$ , where $\{e_{i}:1\leq i\leq d\}$ is the canonical basis on $\mathbb{R}^{d}$ . The dimension $d$ of the ambient space is implicit in this notation. Finally, for $\mathbb{v}\in\{0,1\}^{d}$ , we denote by $E_{\mathbb{v}}$ the linear span of $\{e_{i}:v_{i}=1\}$ and by $E_{1-\mathbb{v}}$ the linear span of $\{e_{i}:v_{i}=0\}$ . Clearly, $E_{1-\mathbb{v}}$ is the orthogonal complement of $E_{\mathbb{v}}$ .

The proof of Theorem 3.1 involves measuring the complexity of the space where the prior puts its mass. This measure is carried out via metric entropy. Given a subset $B$ of a metric space $(E,d)$ and a radius $\varepsilon>0$ , we can define the following numbers:

•

the $\varepsilon$ -packing number $D(\varepsilon,B,d)$ is the maximum number of points in $B$ such that the distance between every pair is at least $\varepsilon$ ,
•

the $\varepsilon$ -covering number $N(\varepsilon,B,d)$ is the minimum number of balls of radius $\varepsilon$ needed to cover $B$ .

The logarithms of the packing and the covering number are called the entropy and the metric entropy respectively.

2.2 Bayesian framework for density estimation and regression

Our main result will be stated for two statistical settings: density estimation and fixed or random design regression with Gaussian error. As we will work with subspaces that are not orthogonal with the axes, the usual support $[0,1]^{d}$ for the density or the regression function will be replaced by the unit ball $\mathbb{U}_{d}:=\{x\in\mathbb{R}^{d},\left\|{x}\right\|\leq 1\}$ . For a given number of observations $n$ , the density or the regression function will be characterized by a functional parameter $f_{n}^{*}:\mathbb{U}_{d_{n}}\to\mathbb{R}$ . The ambient dimension $d_{n}$ is allowed to grow with $n$ but $f_{n}^{*}$ is supposed to depend only on a subspace $\mathcal{S}_{n}$ with fixed dimension $d_{0}$ . A prior on $d_{0}$ and on the subspace itself will be later introduced to ensure the dimension adaptability. The prior on the true parameter $f_{n}^{*}$ will consist of a projected Gaussian random variable $W_{n}$ with values in the Banach space $(\mathcal{C}(\mathbb{U}_{d_{n}}),\left\|{\cdot}\right\|_{\infty})$ . Now let us describe the two previously introduced statistical settings.

Density estimation

Suppose we observe an i.i.d. sample $X_{1},\ldots,X_{n}$ from a law $P_{n}^{*}$ over $\mathbb{U}_{d_{n}}$ , which admits a continuous density $p_{n}^{*}$ relative to the Lebesgue measure on $\mathbb{R}^{d_{n}}$ . The prior $W_{n}$ puts its mass on a space that is far too large compared to the space of continuous densities. So to correctly retrieve $p_{n}^{*}$ , we will work with the parametrized density $p_{n,W_{n}}$ where, for $w\in\mathcal{C}(\mathbb{U}_{d_{n}})$ ,

(2.1)

p_{n,w}(x)\ :=\ \frac{e^{w(x)}}{\int_{\mathbb{U}_{d_{n}}}e^{w(x)}dx}.

Here the exponential forces the prior to charge only nonnegative functions while the renormalization ensures that $p_{n,w}$ integrates to one. The true density $p_{n}^{*}$ will then be encoded by the parameter $f_{n}^{*}\in\mathcal{C}(\mathbb{U}_{d_{n}})$ such that $p_{n}^{*}=p_{n,f_{n}^{*}}$ . In this way, all the assumptions on the true parameter $f_{n}^{*}$ can be transferred to the density $p_{n}^{*}$ . That is, $p_{n}^{*}$ is supposed to depend only on the $d_{0}$ -dimensional subspace $\mathcal{S}_{n}$ of $\mathbb{R}^{d_{n}}$ .

The natural metric between two densities $p$ and $p^{\prime}$ is the Hellinger distance defined by $h(p,p^{\prime})=\left\|{\sqrt{p}-\sqrt{p^{\prime}}}\right\|_{2}$ , where $\left\|{\cdot}\right\|_{2}$ is the $L^{2}$ -norm with respect to the Lebesgue measure. Consequently, if the parameter space is embedded with a prior $\Pi_{n}$ , we will say that the posterior contracts to $p_{n}^{*}$ at rate $(\varepsilon_{n})_{n\in\mathds{N}}$ if, for any sufficiently large constant $M$ ,

(2.2)

\operatorname{\mathds{P}}_{n}^{*}\left[\Pi_{n}\left(f\in\mathcal{C}(\mathbb{U}% _{d_{n}}):h(p_{n,f},p^{*}_{n})>M\varepsilon_{n}\ |\ X_{1},\ldots,X_{n}\right)% \right]\mathop{\longrightarrow}\limits_{n\to+\infty}0,

where $\operatorname{\mathds{P}}_{n}^{*}$ is the joint law of $(X_{1},\ldots,X_{n})$ .

Regression with Gaussian error

In a regression problem, the covariates can be either predetermined for each observation, this is the fixed design case, or can be part of the observation themselves. In the later case, the covariates can be considered as random; this corresponds to the random design case. The notion of posterior contraction differs slightly between these two situations and some clarifications are in order.

Fixed design

In this setting, we consider a sample of $n$ real observations $Y_{1},\ldots,Y_{n}$ satisfying the model $Y_{i}=f_{n}^{*}(x_{i})+\epsilon_{i}$ , with $\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2})$ where the $x_{i}\in\mathbb{U}_{d_{n}}$ for $i\in\llbracket 1,n\rrbracket$ are $n$ fixed covariates and where the $\epsilon_{i}$ are $n$ i.i.d. univariate Gaussian random variables with zero mean and standard deviation $\sigma$ . As previously, the regression function $f_{n}^{*}:\{x_{i}:i\in\llbracket 1,n\rrbracket\}\to\mathbb{R}$ is supposed to depend only on a $d_{0}$ -dimensional subspace of $\mathbb{R}^{d_{n}}$ .

We will use $W_{n}$ directly as a prior for the regression function because $W_{n}$ can be viewed by restriction as a Gaussian process over the space $\mathcal{X}_{n}:=\{x_{i}:i\in\llbracket 1,n\rrbracket\}$ of design points. To quantify the posterior contraction, we introduce the design dependent semi-metric $\left\|{\cdot}\right\|_{n}$ defined as the $L^{2}(\operatorname{\mathds{P}}_{n}^{x})$ -norm for the empirical measure $\operatorname{\mathds{P}}_{n}^{x}=n^{-1}\sum_{i=1}^{n}\delta_{x_{i}}$ of the design points. If the space of regression functions over $\mathcal{X}_{n}$ is embedded with a prior $\Pi_{n}$ , we will say that the posterior contracts to $f_{n}^{*}$ at rate $(\varepsilon_{n})_{n\in\mathds{N}}$ if, for any sufficiently large constant $M$ ,

(2.3)

\operatorname{\mathds{P}}_{n}^{*}\left[\Pi_{n}\left(f\in\mathcal{C}(\mathcal{X% }_{n}):\left\|{f-f_{n}^{*}}\right\|_{n}>M\varepsilon_{n}\ |\ Y_{1},\ldots,Y_{n% }\right)\right]\mathop{\longrightarrow}\limits_{n\to+\infty}0,

where $\operatorname{\mathds{P}}_{n}^{*}$ is the joint law of $(Y_{1},\ldots,Y_{n})$ .

Random design

Here, we observe $n$ i.i.d. pairs $(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})$ such that $Y_{i}=f_{n}^{*}(X_{i})+\epsilon_{i}$ , with i.i.d. $\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2}),\ \sigma\in[1,2]$ , and where the $X_{i}$ ’s are random covariates over $\mathbb{U}_{d_{n}}$ independent of the $\epsilon_{i}$ ’s and admitting a common density $G_{n}$ that is bounded away from zero. For the sake of simplicity, the standard deviation $\sigma$ is restricted to the interval $[1,2]$ but these bounds can be relaxed, see Remark 5.1.1 for details. Again, the regression function $f_{n}^{*}:\mathbb{U}_{d_{n}}\to\mathbb{R}$ is supposed to depend only on a $d_{0}$ -dimensional subspace of $\mathbb{R}^{d_{n}}$ . Moreover, we use $W_{n}$ directly as a prior for the regression function. The natural metric for this problem is the $L^{2}(G_{n})$ -norm denoted by $\left\|{\cdot}\right\|_{2,G_{n}}$ where $G_{n}$ is identified with the law of one covariate. This metric is not equivalent to the Hellinger metric, which is used in the proof of Theorem 3.1, unless all regression functions are uniformly bounded by a constant $Q>0$ . This condition can be fulfilled by projecting the prior on the space of all functions uniformly bounded by $Q$ , as proposed in [GN11], but this would force us to rewrite the proof of Theorem 3.1 only for this setting. Instead, we directly post-process the posterior to integrate this constraint as in [YD16]. Then, the formulation of posterior consistency becomes as follows. Considering a prior $\Pi_{n}$ over the regression functions, we will say that the posterior contracts to $f_{n}^{*}$ at rate $(\varepsilon_{n})_{n\in\mathds{N}}$ if, for $Q>0$ and any sufficiently large constants $M$ ,

(2.4)

\operatorname{\mathds{P}}_{n}^{*}\left[\Pi_{n}\left(f\in\mathcal{C}(\mathbb{U}% _{d_{n}}):\bigl{\|}{f^{Q}-{f_{n}^{*}}^{Q}}\bigr{\|}_{2,G_{n}}>M\varepsilon_{n}% \ |\ (X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\right)\right]\mathop{\longrightarrow}% \limits_{n\to+\infty}0,

where $\operatorname{\mathds{P}}_{n}^{*}$ is the joint law of $(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})$ and where $f^{Q}:=(f\vee-Q)\wedge Q$ is the truncated version of $f$ .

3 Main result for the functional parameter

In order for the true parameter $f_{n}^{*}$ to be recovered, we suppose that its restriction to the $d_{0}$ -dimensional subspace $\mathcal{S}_{n}$ does not depend on the ambient dimension $d_{n}$ .

Assumption 3.1 (Sparsity of the true parameter).

There exist $n_{0},d_{0}\in\mathds{N}$ , $f_{0}\in\mathcal{C}(\mathbb{U}_{d_{0}})$ , and a sequence of linear isometries $q_{n}^{*}\in\mathcal{O}_{d_{n}}$ such that for all $n\geq n_{0}$ , we have $d_{n}\geq d_{0}$ , and $f_{n}^{*}(x)=f_{0}\left((q_{n}^{*}(x))_{\mathbf{d_{0}}}\right)$ , for all $x\in\mathbb{U}_{d_{n}}$ .

In this way, each $f_{n}^{*}$ can be viewed as a sparse continuation in dimension $d_{n}$ of an underlying fixed function $f_{0}$ called the core function. The use of isometries instead of vector subspaces permits us to avoid the manipulation of the Grassmannian. We will use instead the more convenient orthogonal group $\mathcal{O}_{d_{n}}$ . The next property is straightforward.

Property 3.1.

For $n\geq n_{0}$ , $f_{n}^{*}$ is constant on the intersection between $\mathbb{U}_{d_{n}}$ and the affine subspaces $(q_{n}^{*})^{-1}(E_{1-\mathbf{d_{0}}})+x$ , for $x\in\mathbb{R}^{d_{n}}$ .

In parallel to the dimension adaptability, the present setting allows the core function $f_{0}$ to be arbitrarily smooth (in a Hölder sense) while maintaining near-minimax contraction rates.

Assumption 3.2 (Smoothness of $f_{0}$ ).

There exists $\beta>0$ such that $f_{0}\in\mathfrak{C}^{\beta}(\mathbb{U}_{d_{0}})$ .

3.1 Prior specification

Here we specify the hierarchical prior on the parameter space. The true parameter $f_{n}^{*}$ is characterized by a sparsity pattern $(d_{0},q_{n}^{*})$ , where the intrinsic dimension $d_{0}$ is the one of the relevant subspace and $q_{n}^{*}\in\mathcal{O}_{d_{n}}$ is an isometry for the orientation; its smoothness is modeled by a rescaling parameter, and the core function $f_{0}$ is modeled by a standard squared exponential Gaussian process which has infinitely smooth sample paths. Indeed, this process has proven to be fruitful in combination with a scale parameter and allows smoothness adaptation (see [VV09]).

For $n>0$ , let $W=(W(x):x\in\mathbb{R}^{d_{n}})$ be a standard squared exponential Gaussian process on $\mathbb{R}^{d_{n}}$ ; that is, a centered Gaussian process with covariance kernel

\operatorname{\mathbb{E}}[W(s)W(t)]\ =\ \exp\bigl{(}-\left\|{s-t}\right\|^{2}% \bigr{)},\qquad\text{for all }s,t\in\mathbb{R}^{d_{n}},

where $\left\|{\cdot}\right\|$ is the Euclidean norm.

Let $a>0$ , $b\in\llbracket 1,d_{n}\rrbracket$ , and $q\in\mathcal{O}_{d_{n}}$ . We define $W_{x}^{a,b,q}:=W(a\operatorname{Diag}(\mathbb{b})\cdot q(x))$ and $W^{a,b,q}:=(W_{x}^{a,b,q}:x\in\mathbb{R}^{d_{n}})$ a rescaled Gaussian process with sparsity pattern $(b,q)$ , where $\operatorname{Diag}(\mathbb{b})$ is the diagonal matrix with diagonal vector $\mathbb{b}$ . Then, the process $W^{a,b,q}$ is constant on affine subspaces $q^{-1}(E_{1-\mathbb{b}})+x$ , for $x\in\mathbb{R}^{d_{n}}$ (as in Property 3.1) and if $R:=q^{-1}\operatorname{Diag}(\mathbb{b})q$ is the orthogonal projection onto $q^{-1}(E_{\mathbb{b}})$ , then $W^{a,b,q}_{x}=W^{a,b,q}_{Rx}$ , for all $x\in\mathbb{R}^{d_{n}}$ .

To work properly with $W^{a,b,q}$ , we have to verify that its law identifies with the law of a $b$ -dimensional standard squared exponential process. To do so, define

	$\displaystyle\phi:\ \mathbb{R}^{b}$	$\displaystyle\ \to\ q^{-1}(E_{\mathbb{b}})$
	$\displaystyle x$	$\displaystyle\ \mapsto\ \frac{1}{a}q^{-1}(x^{\mathbb{b}}),$

a bijection with inverse $\phi^{-1}(t)=a(qt)_{\mathbb{b}}$ for $t\in q^{-1}(E_{\mathbb{b}})$ . Then, $W^{a,b,q}_{\phi(x)}=W(x^{\mathbb{b}})$ for all $x\in\mathbb{R}^{b}$ .

Let us introduce $\mathrlap{\,\widetilde{\phantom{A}}}W:=(W^{a,b,q}_{\phi(x)},x\in\mathbb{R}^{b})$ . Then, for all $x,y\in\mathbb{R}^{b}\times\mathbb{R}^{b}$ , we have

\operatorname{\mathbb{E}}[\mathrlap{\,\widetilde{\phantom{A}}}W(x)\mathrlap{\,% \widetilde{\phantom{A}}}W(y)]\ =\ \operatorname{\mathbb{E}}[W(x^{\mathbb{b}})W% (y^{\mathbb{b}})]\ =\ e^{-\left\|{x^{\mathbb{b}}-y^{\mathbb{b}}}\right\|^{2}}.

So $\mathrlap{\,\widetilde{\phantom{A}}}W$ is a standard squared exponential Gaussian process in dimension $b$ that does not depend on $a$ nor $q$ . Moreover, we have $W^{a,b,q}_{t}=\mathrlap{\,\widetilde{\phantom{A}}}W(\phi^{-1}(Rt))$ .

From now on, $W^{a,b,q}$ will refer to the restriction on $\mathbb{U}_{d_{n}}$ of this process. Then, the hierarchical prior on the parameter $f\in\mathcal{C}(\mathbb{U}_{d_{n}})$ with stochastic subspace selection is defined as the law $\Pi_{n}$ of $W^{A,\Gamma,\Theta}$ , where $A$ is the scaling parameter, $\Gamma\in\llbracket 1,d_{n}\rrbracket$ is the prior on the subspace dimension, and $\Theta$ is the prior on the orientation.

Assumption 3.3.

The intrinsic dimension $d_{0}$ of the subspace is assumed to be bounded by a known deterministic number $d_{\mathrm{max}}$ .

Consequently, $\Gamma$ is defined by a probability vector $(\pi_{\Gamma}(d):1\leq d\leq d_{\mathrm{max}})$ with $\pi_{\Gamma}(d)>0$ for all $d$ . Moreover, we define the scaling parameter $A$ such that there exists a collection of probability measures $\pi_{n,d}$ on $(0,\infty)$ , $0\leq d\leq d_{\mathrm{max}}\wedge d_{n}$ , with $A\ |\ (\Gamma=d)\sim\pi_{n,d}$ . We require the law of the stochastic isometry $\Theta$ to be translation invariant. That is, for all subset $\mathcal{Q}$ of $\mathcal{O}_{d_{n}}$ and for all $q\in\mathcal{O}_{d_{n}}$ , we need $\operatorname{\mathds{P}}(\Theta\in q\cdot\mathcal{Q})=\operatorname{\mathds{P% }}(\Theta\in\mathcal{Q})$ . Therefore, the law of $\Theta$ is taken as the unit Haar measure on $\mathcal{O}_{d_{n}}$ , the only probability measure that is translation invariant on $\mathcal{O}_{d_{n}}$ . In addition, all $A,\Gamma$ , and $\Theta$ are supposed to be independent of $W$ .

For convenience, the notation $\pi_{n,d}$ will refer to a probability measure as well as its density.

Assumption 3.4 (Rescaling measures).

There exist constants $D_{1},D_{2},C_{1}$ , $C_{2}$ , and $c>1$ such that for all $n\in\mathds{N}^{*}$ and $d<d_{\mathrm{max}}\wedge d_{n}$ , the density $\pi_{n,d}$ satisfies

1.

for all sufficiently large $a$ , $\pi_{n,d}(a)\geq D_{1}e^{-C_{1}a^{d}(\log a)^{d+1}}$ ;
2.

for all $a>c$ , $\pi_{n,d}(a)\leq D_{2}e^{-C_{2}a^{d}(\log a)^{d+1}}$ ;
3.

$\pi_{n,d}([0,c])=0$ .

Assumptions similar to Assumption 3.4 are standard, see for instance Equation (3.4) in [VV09] or Assumption 5 [JT21]. For example, this assumption is satisfied if, for all $n\in\mathds{N}^{*}$ and $d<d_{\mathrm{max}}\wedge d_{n}$ , $A^{d}(\log A)^{d+1}\ |\ (\Gamma=d)$ is the restriction to $(c,+\infty)$ of an exponential law with parameter independent of $d$ and $n$ (indeed, if $g(A)$ has density function $f$ , with $g$ differentiable and strictly increasing, then $A$ has density function $(f\circ g)\cdot g^{\prime}$ ).

The next section gives some precision about the reproducing kernel Hilbert space (RKHS) of $W^{a,b,q}$ . The content is a bit technical and can be skipped at first reading.

3.2 Reproducing kernel Hilbert space of $W^{a,b,q}$

One of the advantages of choosing a Gaussian process prior is that the contraction rate depends explicitly on the small ball probability and on the relative position of the parameter with respect to the RKHS associated with the process. This section is dedicated to the basic properties of this space. For elementary definitions and for some precision about the link between the contraction rate and the RKHS, we refer the reader to [VV08] and [VV08a].

Notation.

We denote by $\mathcal{C}(\mathbb{U}_{d}\ |\ q^{-1}(E_{\mathbb{b}}))$ the space of continuous functions on $\mathbb{U}_{d}$ which are constant on affine subspaces $q^{-1}(E_{1-\mathbb{b}})+x$ , for $x\in\mathbb{U}_{d}$ .

We introduce the operator

\Lambda:\begin{cases}\mathcal{C}(\mathbb{U}_{b})&\kern-6.0pt\to\ \mathcal{C}(% \mathbb{U}_{d}\ |\ q^{-1}(E_{\mathbb{b}}))\\[8.5359pt] \phantom{\mathcal{L}^{2}}f&\kern-6.0pt\mapsto\ \Lambda f:\begin{cases}\mathbb{% U}_{d}&\kern-6.0pt\to\ \mathbb{R}\\ \ x&\kern-6.0pt\mapsto\ f((qx)_{\mathbb{b}}),\end{cases}\end{cases}

so that $W^{a,b,q}=\Lambda(\mathrlap{\,\widetilde{\phantom{A}}}W^{a})$ , where $\mathrlap{\,\widetilde{\phantom{A}}}W^{a}=(\mathrlap{\,\widetilde{\phantom{A}}% }W_{at},t\in\mathbb{U}_{b})$ is the process $\mathrlap{\,\widetilde{\phantom{A}}}W$ introduced above rescaled by $a$ and restricted to $\mathbb{U}_{b}$ . It is a bijective linear map and also an isometry if the domain and the codomain are endowed with the uniform norm. In particular, the map $\Lambda$ is continuous. According to Lemma 7.1 in [VV08a], if $\widetilde{\mathbb{H}}_{a}$ is the RKHS of $\mathrlap{\,\widetilde{\phantom{A}}}W^{a}$ , then the RKHS $\mathbb{H}_{a,b,q}$ of $W^{a,b,q}$ is equal to $\Lambda(\widetilde{\mathbb{H}}_{a})$ . Let us detail its elements. The stochastic process RKHS of $\mathrlap{\,\widetilde{\phantom{A}}}W^{a}$ (as defined in [VV08a]) is composed of functions $h:\mathbb{U}_{b}\to\mathbb{R}$ for which there exists $\psi\in L^{2}_{\mathbb{C}}(\mu^{se}_{a,b})$ such that

(3.1)

h(t)\ =\ \operatorname{\mathfrak{Re}}\int_{\mathbb{R}^{b}}e^{-i\lambda\cdot t}% \psi(\lambda)d\mu^{se}_{a,b}(\lambda),\quad t\in\mathbb{U}_{b},

where $\mu^{se}_{a,b}$ is the spectral measure of the $a$ -rescaled squared exponential process in dimension $b$ with spectral density $f^{se}_{a,b}:t\mapsto(2a\sqrt{\pi})^{-b}\exp(-\frac{1}{4}\|t/a\|^{2})$ (see Lemma 4.1 in [VV09], and the following discussion). We can view $\mathrlap{\,\widetilde{\phantom{A}}}W^{a}$ as a random Gaussian element with values in the Banach space $(\mathcal{C}(\mathbb{U}_{b}),\left\|{\cdot}\right\|_{\infty})$ . Thus, according to Theorem 2.1 in [VV08a], the stochastic process RKHS and the Banach space RKHS coincide and we can apply Lemma 7.1 from the same reference. The space $\mathbb{H}_{a,b,q}=\Lambda(\widetilde{\mathbb{H}}_{a})$ is then the set of functions

(3.2)

\overline{h}:x\in\mathbb{U}_{d}\ \mapsto\ \operatorname{\mathfrak{Re}}\int_{% \mathbb{R}^{b}}e^{-i\langle\lambda,(qx)_{\mathbb{b}}\rangle}\psi(\lambda)d\mu^% {se}_{a,b}(\lambda),

where $\psi$ runs through $L^{2}_{\mathbb{C}}(\mu^{se}_{a,b})$ and the RKHS norm is $\left\|{\overline{h}}\right\|_{\mathbb{H}_{a,b,q}}=\left\|{\psi}\right\|_{L^{2% }(\mu^{se}_{a,b})}$ .

We remark that functions of the RKHS of $W^{a,b,q}$ have the same sparsity-pattern as the trajectories of $W^{a,b,q}$ .

Remark 3.1.

Functions $\overline{h}\in\mathbb{H}_{a,b,q}$ are constant on affine subspaces $q^{-1}(E_{1-\mathbb{b}})+x$ for $x\in\mathbb{U}_{d}$ .

As mentioned at the beginning of this section, contraction rates under Gaussian process prior depend on two quantities: the small ball probability and the relative position of the parameter with respect to the RKHS. For a parameter $f\in\mathcal{C}(\mathbb{U}_{d}\ |\ q^{-1}(E_{\mathbb{b}}))$ and $\varepsilon>0$ , these two quantities define the concentration function $\phi_{f}^{a,b,q}$ , with

(3.3)

\phi_{f}^{a,b,q}(\varepsilon)\ :=\ \inf_{h\in\mathbb{H}_{a,b,q}:\left\|{h-f}% \right\|_{\infty}<\varepsilon}\left\|{h}\right\|_{\mathbb{H}_{a,b,q}}^{2}-\log% \operatorname{\mathds{P}}\left(\left\|{W^{a,b,q}}\right\|_{\infty}<\varepsilon% \right).

3.3 Posterior consistency

Before we state the theorem, we need a last assumption, which determines how the ambient dimension $d_{n}$ is allowed to grow with the sample size $n$ .

Assumption 3.5 (Growth of $d_{n}$ ).

The ambient dimension $d_{n}$ satisfies

d_{n}\ \leq\ C_{D}\cdot n^{\tfrac{d_{0}}{2\beta+d_{0}}}\cdot(\log n)^{2\kappa-% 1},

for some small constant $C_{D}>0$ and where $\kappa=(d_{0}+1)\beta/(2\beta+d_{0})$ .

An examination of $\kappa$ shows that $\kappa\geq 1/2$ if $\beta\geq 1/2$ and that $\kappa>\beta$ otherwise. Thereby, a standard rate of order $n^{1/2}$ for $d_{n}$ is achieved with parameter $\beta=d_{0}/2$ . The fastest rate tends to the order $n\cdot(\log n)^{-1}$ when $\beta$ tends to zero. Although it is always possible to set $\beta$ extremely close to zero in order to obtain the best rate for $d_{n}$ , one should keep in mind that the contraction rate may then be suboptimal, as discussed at the end of this section.

Theorem 3.1.

Let $\varepsilon_{n}=C_{\varepsilon}\cdot\underline{\varepsilon}_{n}(\log n)^{\kappa}$ with $\underline{\varepsilon}_{n}=n^{-\beta/(2\beta+d_{0})}$ , $C_{\varepsilon}$ a large constant that depends on $f_{0}$ , and $\kappa$ as in Assumption 3.5. Then, if the parameter space is embedded with the prior $\Pi_{n}$ and under Assumptions 3.1-3.5, the posterior contracts at rate $(\varepsilon_{n})_{n\in\mathds{N}}$ for density estimation (as defined in (2.2)) as well as for regression with fixed or random design (as defined in (2.3) and (2.4)).

An examination of $\varepsilon_{n}$ shows that the contraction rate is improved as the smoothness $\beta$ of $f_{0}$ grows, unlike $d_{n}$ . This highlights a trade-off between the contraction rate and the growth of the design dimension: fast contraction rates imply slowly increasing dimension and conversely.

The proof of Theorem 3.1, postponed in the Appendix, in Section 5.1, combines the arguments of [Tok11] and [JT21].

4 Subspace recovery for the density estimation problem

In this section, we propose to recover the central subspace for the density estimation problem. To avoid identifiability issues caused by the spherical support, we suppose that the ambient dimension $d_{n}$ does not depend on $n$ . Hence, we denote the ambient dimension by $d$ with $d\geq d_{0}$ and the central subspace by $\mathcal{S}:=(q^{*})^{-1}(E_{\mathbf{d_{0}}})$ where $q^{*}$ corresponds to $q^{*}_{n}$ in Assumption 3.1. This assumption is justified by the following considerations. If the ambient dimension grows with $n$ , the Hellinger metric relative to the Lebesgue measure on $\mathbb{U}_{d_{n}}$ tends to give more importance to the center of the support, as $n$ tends to infinity. For example, consider a parameter $f_{0}:\mathbb{U}_{2}\to\mathbb{R}$ in dimension two that is everywhere constant except in a small region on the border of $\mathbb{U}_{2}$ , and such that the central subspace $\mathcal{S}_{n}$ is of dimension two. The importance of this small region in the support $\mathbb{U}_{d_{n}}$ , in the Hellinger sense, decreases exponentially with $n$ , way faster than the estimation of the true parameter $f_{n}^{*}$ in Theorem 3.1. Consequently, for sufficiently large $n$ , a constant function $f_{0}:[0,1]\to\mathbb{R}$ together with some one-dimensional subspace $\mathcal{S}^{\prime}$ characterize a density that is in the Hellinger ball of radius $\varepsilon_{n}$ centered on $f_{n}^{*}$ ; so we have no hope of recovering the true subspace by simply using the posterior consistency.

As a consequence, the true density $p^{*}$ , the parameter $f^{*}$ , and the central subspace do not depend on $n$ anymore. The true density $p^{*}=p_{f^{*}}$ is characterized by $f^{*}$ via the transformation (2.1). Moreover, $f^{*}$ is supposed to depend only on a $d_{0}$ -dimensional subspace of $\mathbb{R}^{d}$ and can be viewed as the sparse continuation of an underlying function $f_{0}\in\mathcal{C}(\mathbb{U}_{d_{0}})$ . In the same way, $p^{*}$ can be viewed as the sparse continuation of a function $p_{0}$ over $\mathbb{U}_{d_{0}}$ , except that the renormalisation of $p^{*}$ depends on $d$ . Note that $p_{0}$ is not necessarily a density on $\mathbb{U}_{d_{0}}$ so the notation $h(f,g)$ will designate from now on the $L^{2}$ -distance between the square roots of $f$ and $g$ even if $f$ and $g$ are not densities.

Let us introduce a few more notation. Let $\mathcal{Q}^{*}$ be the set of all optimal isometries:

\mathcal{Q}^{*}\ :=\ \{q\in\mathcal{O}_{d}:q^{-1}(E_{\mathbf{d_{0}}})=\mathcal% {S}\},

and, for $d^{\prime}>d_{0}$ , let $\mathcal{Q}^{*}_{d^{\prime}}$ be the set of isometries that send the subspace $E_{\mathbf{d^{\prime}}}$ to a subspace containing $\mathcal{S}$ :

\mathcal{Q}^{*}_{d^{\prime}}\ :=\ \{q\in\mathcal{O}_{d}:q^{-1}(E_{\mathbf{d^{% \prime}}})\supset\mathcal{S}\}.

Recovering $\mathcal{S}$ means the following: for some rate $\delta_{n}\to 0$ ,

\operatorname{\mathds{P}}_{n}^{*}\left[\Pi_{n}\left(\Gamma\neq d_{0}\text{ or % }\min_{q\in\mathcal{Q}^{*}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|% \Theta-q\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\geq\delta_{n}\ |% \ X_{1},\ldots,X_{n}\right)\right]\mathop{\longrightarrow}\limits_{n\to+\infty% }0,

where ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt% \right|\kern-1.07639pt\right|}$ is the operator norm with respect to the Euclidean distance in $\mathbb{R}^{d}$ . However, under the assumptions of Theorem 3.1, the only information we have on the true subspace is posterior consistency to the density $p^{*}$ with rate $\varepsilon_{n}$ . This will only allow us to recover a subspace of $\mathbb{R}^{d}$ containing $\mathcal{S}$ . A crucial assumption to eliminate the subspaces of dimension smaller than $d_{0}$ and the subspaces that do not contain $\mathcal{S}$ is to suppose that $p_{0}$ is non-constant in all directions. More precisely, the default of constancy for each direction has to be detectable in Hellinger distance, as formalized in the following assumption.

Assumption 4.1.

There exist a constant $D$ and a window size $L<1$ such that for all vector line $\Delta$ in $\mathbb{R}^{d_{0}}$ (directed by a unit vector $\boldsymbol{\Delta}$ ), there exists $o\in\mathcal{B}_{d_{0}}(1-L)$ such that for all $0<l\leq L$ , for all $t\in\mathcal{B}_{d_{0}}(L/2)+o$ , and for all constant $c>0$ ,

h^{2}({p_{0}}_{|I};c)\ \geq\ D\cdot l^{2},

where $I:=\ ]o+t-\frac{l}{2}\boldsymbol{\Delta};o+t+\frac{l}{2}\boldsymbol{\Delta}[$ .

Assumption 4.1 seems a bit technical at first glance but it can be shown that it is satisfied as soon as $p_{0}$ is differentiable over $\mathbb{U}_{d_{0}}$ with $d_{0}$ points such that the gradients at these points are linearly independent.

Theorem 4.1.

Under Assumption 4.1 and the assumptions of Theorem 3.1, we have, for some rate $(\delta_{n})_{n}$ tending to zero,

(4.1)		$\displaystyle\Pi_{n}\left(\Gamma<d_{0}\ \|\ X_{1},\ldots,X_{n}\right)\mathop{% \longrightarrow}\limits_{n\to+\infty}0,\qquad\text{in }\operatorname{\mathds{P% }}_{n}^{*}\text{-probability},$
(4.2)		$\displaystyle\Pi_{n}\left(\Gamma=d_{0}\text{ and }\min_{q\in\mathcal{Q}^{}}{% \left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\Theta-q\right\|\kern-1.07639pt% \right\|\kern-1.07639pt\right\|}\geq\delta_{n}\ \|\ X_{1},\ldots,X_{n}\right)% \mathop{\longrightarrow}\limits_{n\to+\infty}0,\qquad\text{in }\operatorname{% \mathds{P}}_{n}^{}\text{-probability},$
(4.3)		$\displaystyle\Pi_{n}\left(\Gamma>d_{0}\text{ and }\min_{q\in\mathcal{Q}^{}_{% \Gamma}}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\Theta-q\right\|\kern-% 1.07639pt\right\|\kern-1.07639pt\right\|}\geq\delta_{n}\ \|\ X_{1},\ldots,X_{n}% \right)\mathop{\longrightarrow}\limits_{n\to+\infty}0,\qquad\text{in }% \operatorname{\mathds{P}}_{n}^{}\text{-probability}.$

Theorem 4.1 ensures that the central subspace $\mathcal{S}$ can be recovered as soon as the intrinsic dimension $d_{0}$ is known. Subspaces of dimension smaller than $d_{0}$ are also eliminated but the theorem does not reject those of dimension greater than $d_{0}$ . We conjecture that the prior mass on those spaces tends to vanish, for reasons similar to those exposed in [JT21]. Indeed, introducing a penalization on larger dimensions if necessary, it should be possible to show that the posterior cannot contract as fast as the minimax rate for $d_{0}$ if a subspace of greater dimension is chosen. As discussed in the introduction of this section, the estimation of the central subspace is made under the assumption that $d$ is fixed with $n$ mainly because of the identifiability issue caused by the ellipsoid support. We believe that this restriction can be relaxed by extending the support $\mathbb{U}_{d}$ to the full ambient space $\mathbb{R}^{d}$ , as in [JT21]. In this case, the square over which we integrate the Hellinger distance in the proof of Theorem 4.1 can be taken as the product space of a square of side $L$ in directions $\boldsymbol{\Delta}$ and $\boldsymbol{\Lambda}$ times $\mathbb{R}^{d-2}$ . Then, the integrated error should no longer depend on $d$ and consistency to the true subspace should follow. Further investigations in this direction might be worthwhile.

The proof of Theorem 4.1 is postponed in Appendix 5.2.

Acknowledgments

We acknowledge the support of the French Agence Nationale de la Recherche (ANR) under reference ANR-21-CE40-0007 (GAP Project).

5 Appendix

5.1 Proof of Theorem 3.1

As a reminder, we first exhibit some facts about the convergence rate:

(5.1)

\varepsilon_{n}\ =\ C_{\varepsilon}\cdot n^{-\tfrac{\beta}{2\beta+d_{0}}}\cdot% (\log n)^{\kappa},\qquad n\varepsilon_{n}^{2}\ =\ C_{\varepsilon}^{2}\cdot n^{% \tfrac{d_{0}}{2\beta+d_{0}}}\cdot(\log n)^{2\kappa}.

So $\varepsilon_{n}$ is a large multiple of the minimax rate times a logarithm factor. The constant $C_{\varepsilon}$ is chosen to be arbitrarily large in order to absorb undesired terms in the proof.

The proof of Theorem 3.1 is based on Theorem 2.1 in [GGV00]. The general outline is a combination of the arguments of [Tok11] (itself derived from [VV09]) and [JT21]. Concretely, it suffices to show that there exists a sequence of sets $\mathbb{B}_{n}\subset\mathcal{C}(\mathbb{U}_{d_{n}})$ (referred to as a sieve), such that the following three conditions hold for all sufficiently large $n$ :

(5.2)		$\displaystyle\Pi_{n}\left(\left\\|{W^{A,\Gamma,\Theta}-f_{n}^{*}}\right\\|_{% \infty}\leq 2\varepsilon_{n}\right)\ \geq\ \exp(-n\varepsilon_{n}^{2}),$
(5.3)		$\displaystyle\Pi_{n}\left(W^{A,\Gamma,\Theta}\notin\mathbb{B}_{n}\right)\ \leq% \ \exp(-5n\varepsilon_{n}^{2}),$
(5.4)		$\displaystyle\log N\left(3\varepsilon_{n},\mathbb{B}_{n},\left\\|{\cdot}\right% \\|_{\infty}\right)\ \leq\ n\varepsilon_{n}^{2}.$

This is the purpose of the next sections. The first condition (5.2), referred to as prior mass condition, ensures that the prior puts a sufficient amount of mass around the true parameter. Condition (5.3), called sieve condition, forces the sieve $\mathbb{B}_{n}$ to capture most of the mass of the prior, while the entropy condition (5.4) constrains its size. These three conditions map one to one with the conditions of Theorem 2.1 in [GGV00], as showed in [VV08] for density estimation and regression with fixed design. For regression with random design, we recall in the next section some arguments spread out in Bayesian literature.

5.1.1 Regression with random design

Here, we show that Theorem 2.1 in [GGV00] can be applied in the regression with random design setting, as soon as Conditions (5.2), (5.3), and (5.4) are satisfied. The procedure consists in showing that the posterior contracts to the density of a pair $(X_{i},Y_{i})$ and then to retrieve $f^{*}_{n}$ from this density. For a function $f:\mathbb{U}_{d_{n}}\to\mathbb{R}$ , we define $P_{f}:\mathbb{U}_{d_{n}}\times\mathbb{R}\to\mathbb{R}_{+},\ (x,y)\mapsto G_{n}% (x)\cdot\Phi_{f(x),\sigma}(y)$ , where $\Phi_{\mu,\sigma}$ is the density of a univariate Gaussian variable with mean $\mu$ and standard deviation $\sigma$ and $G_{n}$ is the density of one covariate. Then, the density of one observation $(X,Y)$ under regression with random design is $P_{f_{n}^{*}}$ . We first prove that Condition (5.2) implies Condition (2.4) in [GGV00] with $C=1$ . Detailed calculations can be found in [FS23], Section A.2. We have to compare the uniform neighborhood of $f_{n}^{*}$ with the Kullback-Leibler neighborhood

B_{2}(P_{f_{n}^{*}};\varepsilon):=\{g:\mathrm{KL}(P_{f_{n}^{*}},P_{g})\leq% \varepsilon^{2},\ V_{2,0}(P_{f_{n}^{*}},P_{g})\leq\varepsilon^{2}\},

where $\mathrm{KL}(P_{f},P_{g}):=P_{f}\left[\log(dP_{f}/dP_{g})\right]$ is the Kullback-Leibler divergence between $P_{f}$ and $P_{g}$ and $V_{2,0}(P_{f},P_{g}):=P_{f}\left[\log(dP_{f}/dP_{g})-\mathrm{KL}(P_{f},P_{g})% \right]^{2}$ is the Kullback-Leibler variation. Using the following identities from [FS23]

	$\displaystyle\mathrm{KL}(P_{f},P_{g})\$	$\displaystyle=\ \frac{1}{2\sigma^{2}}\left\\|{f-g}\right\\|_{2,G_{n}}^{2},$
	$\displaystyle V_{2}(P_{f},P_{g})\$	$\displaystyle:=\ P_{f}\left[\log\left(\frac{dP_{f}}{dP_{g}}\right)^{2}\right]% \ =\ \frac{1}{\sigma^{2}}\left\\|{f-g}\right\\|_{2,G_{n}}^{2}\ +\ \left(\frac{1}% {2\sigma^{2}}\left\\|{(f-g)^{2}}\right\\|_{2,G_{n}}\right)^{2},$
	$\displaystyle V_{2,0}(P_{f},P_{g})\$	$\displaystyle=\ V_{2}(P_{f},P_{g})\ -\ \mathrm{KL}(P_{f},P_{g})^{2},$

we deduce that, if $\left\|{f-g}\right\|_{\infty}\leq 2\varepsilon$ with $2\varepsilon<1$ , then

	$\displaystyle\mathrm{KL}(P_{f},P_{g})\$	$\displaystyle\leq\ \frac{1}{2\sigma^{2}}\left\\|{f-g}\right\\|_{\infty}^{2}\ % \leq\ \frac{2\varepsilon^{2}}{\sigma^{2}},$
	$\displaystyle V_{2,0}(P_{f},P_{g})\$	$\displaystyle\leq\ 4C_{\sigma}^{2}\cdot\varepsilon^{2},$

where $C_{\sigma}:=\sqrt{1/\sigma^{2}+1/(4\sigma^{4})}$ . Consequently, according to (5.2), and multiplying $\varepsilon_{n}$ by $4C_{\sigma}^{2}$ if necessary, we have

\Pi_{n}\left(B_{2}(P_{f_{n}^{*}};\varepsilon_{n})\right)\ \geq\ \exp\left(-% \frac{1}{4C_{\sigma}^{2}}n\varepsilon_{n}^{2}\right).

One can remark that for Condition (2.4) in [GGV00] to be satisfied, we must have $(4C_{\sigma}^{2})^{-1}\leq 1$ which is the case as soon as $\sigma\leq 2$ .

Condition (2.3) in [GGV00] is immediately deduced from (5.3). For Condition (2.4), we use the inequality

(5.5)

h(P_{f},P_{g})\ \leq\ \frac{1}{2\sigma}\left\|{f-g}\right\|_{\infty},

see again [FS23] for details. Then, assuming that $\sigma\geq 1$ , we have, according to (5.4) and multiplying $\varepsilon_{n}$ by 3 if necessary,

D\left(\varepsilon_{n},\mathbb{B}_{n},h\right)\ \leq\ N\left(\frac{\varepsilon% _{n}}{2},\mathbb{B}_{n},h\right)\ \leq\ N\left(\frac{\varepsilon_{n}}{2\sigma}% ,\mathbb{B}_{n},h\right)\ \leq\ N\left(\varepsilon_{n},\mathbb{B}_{n},\left\|{% \cdot}\right\|_{\infty}\right)\ \leq\ \exp(n\varepsilon_{n}^{2}),

where the first inequality comes from the definition of the packing number $D$ and the covering number $N$ and where the third inequality follows from (5.5). Theorem 2.1 in [GGV00] then ensures posterior consistency to $P_{f_{n}^{*}}$ at rate $\varepsilon_{n}$ in Hellinger distance. Now, because we also have the converse inequality

h^{2}(P_{f},P_{g})\ \geq\ \frac{1}{4\sigma^{2}}\exp\left(-\frac{Q^{2}}{2\sigma% ^{2}}\right)\cdot\left\|{f-g}\right\|_{2,G_{n}}^{2}\qquad\text{if }\left\|{f}% \right\|_{\infty}\leq Q\text{ and }\left\|{g}\right\|_{\infty}\leq Q

and that $h(P_{f^{Q}},P_{g^{Q}})\leq h(P_{f},P_{g})$ when nothing is assumed on $f$ and $g$ with $f^{Q}=(f\vee-Q)\wedge Q$ , we obtain posterior contraction to $f_{n}^{*}$ at rate $\varepsilon_{n}$ in the $L^{2}(G_{n})$ -distance:

\operatorname{\mathds{P}}_{n}^{*}\left[\Pi_{n}\left(g\in\mathcal{C}(\mathbb{U}% _{d_{n}}):\bigl{\|}{{f_{n}^{*}}^{Q}-g^{Q}}\bigr{\|}_{2,G_{n}}>D_{\sigma}^{Q}% \cdot\varepsilon_{n}\ |\ (X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\right)\right]% \mathop{\longrightarrow}\limits_{n\to+\infty}0,

where $D_{\sigma}^{Q}:=M\cdot 2\sigma\cdot\exp\left(Q^{2}/(4\sigma^{2})\right)$ .

Remark 5.1.

The restriction to $[1,2]$ for the standard deviation $\sigma$ can be relaxed. In fact, if $\sigma>2$ , then it suffices to consider Theorem 2.1 in [GGV00] with $C=(4C^{2}_{\sigma})^{-1}$ . Condition (2.4) in [GGV00] is then immediately satisfied and, for Condition (2.3), the proof of (5.3) can be adapted to replace 5 by $4+C$ . On the contrary, if $0<\sigma<1$ , Condition (2.2) in [GGV00] can be satisfied by multiplying $\varepsilon_{n}$ by $\sigma^{-1}$ .

5.1.2 Prior mass condition (5.2)

We verify here that $\Pi_{n}\left(\left\|{W^{A,\Gamma,\Theta}-f_{n}^{*}}\right\|_{\infty}\leq 2% \varepsilon_{n}\right)\ \geq\ \exp(-n\varepsilon_{n}^{2})$ . Let us introduce the following notation.

Notation.

For $q\in\mathcal{O}_{d_{n}}$ , we denote by $f_{n,q}:\mathbb{U}_{d_{n}}\to\mathbb{R}$ the function such that $f_{n,q}(x)=f_{0}\left((qx)_{\mathbf{d_{0}}}\right)$ , for all $x\in\mathbb{U}_{d_{n}}$ . Hence, $f_{n}^{*}=f_{n,q_{n}^{*}}$ .

We first reduce the problem to deterministic dimension and direction by conditioning with $\Gamma=d_{0}$ and integrating over $\mathcal{O}_{d_{n}}$ :

\Pi_{n}\left(\bigl{\|}{W^{A,\Gamma,\Theta}-f_{n}^{*}}\bigr{\|}_{\infty}\leq 2% \varepsilon_{n}\right)\ \geq\ \pi_{\Gamma}(d_{0})\int_{\mathcal{O}_{d_{n}}}\Pi% _{n}\left(\bigl{\|}{W^{A,d_{0},q}-f_{n}^{*}}\bigr{\|}_{\infty}\leq 2% \varepsilon_{n}\right)dq.

Now, we want to bound from below the integrand on a significant subset of $\mathcal{O}_{d_{n}}$ . We remark that if $q\in\mathcal{O}_{d_{n}}$ is such that $\left\|{f_{n}^{*}-f_{n,q}}\right\|_{\infty}\leq\varepsilon_{n}$ , then

\Pi_{n}\left(\bigl{\|}{W^{A,d_{0},q}-f_{n}^{*}}\bigr{\|}_{\infty}\leq 2% \varepsilon_{n}\right)\ \geq\ \Pi_{n}\left(\bigl{\|}{W^{A,d_{0},q}-f_{n,q}}% \bigr{\|}_{\infty}\leq\varepsilon_{n}\right).

We show that the right-hand side is bounded from below by $\exp(-\frac{1}{2}n\varepsilon_{n}^{2})$ and then, we bound from below the measure of the set of $q\in\mathcal{O}_{d_{n}}$ satisfying $\left\|{f_{n}^{*}-f_{n,q}}\right\|_{\infty}\leq\varepsilon_{n}$ .

From now on, we use without specification the constants of Lemmas 5.5, 5.6, and 5.7 and we fix $a_{0}>1$ . Let $a\in[K_{n},2K_{n}]$ where $K_{n}=\left(\frac{2C_{f_{0}}}{\varepsilon_{n}}\right)^{1/\beta}$ . We suppose $n$ large enough so that $\varepsilon_{n}/2\ <\ \min(\varepsilon_{0}^{a_{0},d_{0}};C_{f_{0}}\cdot a_{0}^% {-\beta};1/2)$ . Then,

(5.6)

K_{n}\ >\ \left(\frac{C_{f_{0}}}{C_{f_{0}}\cdot a_{0}^{-\beta}}\right)^{1/% \beta}\ =\ a_{0},

and, because $a\geq\left(\frac{2C_{f_{0}}}{\varepsilon_{n}}\right)^{1/\beta}$ , we have

(5.7)

\frac{\varepsilon_{n}}{2}\ \geq\ C_{f_{0}}\cdot a^{-\beta}.

According to Lemma 5.3 in [VV08a], for $q\in\mathcal{O}_{d_{n}}$ , we can write

	$\displaystyle\Pi_{n}\left(\bigl{\\|}{W^{A,d_{0},q}-f_{n,q}}\bigr{\\|}_{\infty}% \leq\varepsilon_{n}\right)\$	$\displaystyle\geq\ \int_{K_{n}}^{2K_{n}}\Pi_{n}\left(\bigl{\\|}{W^{a,d_{0},q}-f% _{n,q}}\bigr{\\|}_{\infty}\leq\varepsilon_{n}\right)\pi_{n,d_{0}}(a)da$
		$\displaystyle\geq\ \int_{K_{n}}^{2K_{n}}\exp\left(-\phi_{f_{n,q}}^{a,d_{0},q}(% \varepsilon_{n}/2)\right)\pi_{n,d_{0}}(a)da,$

where $\phi_{f_{n,q}}^{a,d_{0},q}$ is the concentration function in (3.3). Now we want to control the concentration function using Lemmas 5.5 and 5.7. The inequality (5.6) and the previous restriction on $n$ ensure that the conditions of Lemma 5.7 are satisfied with $\varepsilon=\varepsilon_{n}/2$ , while (5.7) and Lemma 5.5 give

\inf\left\{\left\|{\overline{h}}\right\|_{\mathbb{H}_{a,d_{0},q}}^{2}:% \overline{h}\in\mathbb{H}_{a,d_{0},q},\ \left\|{\overline{h}-f_{n,q}}\right\|_% {\infty}\leq\varepsilon_{n}/2\right\}\ \leq\ D_{f_{0}}\cdot a^{d_{0}}.

Using the expression (3.3) of the concentration function, a combination of the two lemmas gives

	$\displaystyle\phi_{f_{n,q}}^{a,d_{0},q}(\varepsilon_{n}/2)\$	$\displaystyle\leq\ D_{f_{0}}\cdot a^{d_{0}}+C_{a_{0},d_{0}}\cdot a^{d_{0}}\log% (2a/\varepsilon_{n})^{d_{0}+1}$
		$\displaystyle=\ \left(D_{f_{0}}\log(2a/\varepsilon_{n})^{-d_{0}-1}+C_{a_{0},d_% {0}}\right)a^{d_{0}}\log(2a/\varepsilon_{n})^{d_{0}+1}$
		$\displaystyle\leq\ \left(D_{f_{0}}\log(a_{0})^{-d_{0}-1}+C_{a_{0},d_{0}}\right% )a^{d_{0}}\log(2a/\varepsilon_{n})^{d_{0}+1},$

where the last inequality holds because $a\geq a_{0}$ and $\varepsilon_{n}\leq 1$ for $n$ large enough. Let us define the constant $C_{f_{0},a_{0},d_{0}}:=D_{f_{0}}\log(a_{0})^{-d_{0}-1}+C_{a_{0},d_{0}}$ and note that there exists a constant $C^{\prime}_{f_{0}}$ such that for sufficiently large $n$ , $\log(4K_{n}/\varepsilon_{n})=\log\left(4(2C_{f_{0}})^{1/\beta}\varepsilon_{n}^% {-(1+1/\beta)}\right)\leq C^{\prime}_{f_{0}}\log(1/\varepsilon_{n})$ . Then, there exists a constant $C^{\prime}_{f_{0},a_{0},d_{0}}$ such that

	$\displaystyle\int_{K_{n}}^{2K_{n}}\exp\left(-\phi_{f_{n,q}}^{a,d_{0},q}(% \varepsilon_{n}/2)\right)\pi_{n,d_{0}}(a)da\$	$\displaystyle\geq\ \int_{K_{n}}^{2K_{n}}\exp\left(-C_{f_{0},a_{0},d_{0}}\cdot a% ^{d_{0}}\log(2a/\varepsilon_{n})^{d_{0}+1}\right)\pi_{n,d_{0}}(a)da$
		$\displaystyle\geq\ \exp\left(-C_{f_{0},a_{0},d_{0}}(2K_{n})^{d_{0}}\log(4K_{n}% /\varepsilon_{n})^{d_{0}+1}\right)\pi_{n,d_{0}}(2K_{n})$
	$\displaystyle(\text{Assumption }\ref{ass:resc})\qquad$	$\displaystyle\geq\ \exp\left(-C^{\prime}_{f_{0},a_{0},d_{0}}\cdot\varepsilon_{% n}^{-d_{0}/\beta}\log(1/\varepsilon_{n})^{d_{0}+1}\right).$

With the help of the reminder (5.1), we see that

\varepsilon_{n}^{-\tfrac{d_{0}}{\beta}}\ =\ C_{\varepsilon}^{-\tfrac{d_{0}}{% \beta}}\cdot n^{\tfrac{d_{0}}{2\beta+d_{0}}}\cdot(\log n)^{-\tfrac{(d_{0}+1)d_% {0}}{2\beta+d_{0}}}\quad\text{and}\quad\left(\log\frac{1}{\varepsilon_{n}}% \right)^{d_{0}+1}\ <\ (\log n)^{d_{0}+1},

hence $\varepsilon_{n}^{-d_{0}/\beta}\log(1/\varepsilon_{n})^{d_{0}+1}<C_{\varepsilon% }^{-d_{0}/\beta}n\varepsilon_{n}^{2}$ . Then, by choosing $C_{\varepsilon}$ such that $C_{\varepsilon}^{d_{0}/\beta}\geq 2C^{\prime}_{f_{0},a_{0},d_{0}}$ , we can achieve

(5.8)

\Pi_{n}\left(\bigl{\|}{W^{A,d_{0},q}-f_{n,q}}\bigr{\|}_{\infty}\leq\varepsilon% _{n}\right)\ \geq\ \exp\left(-\frac{C^{\prime}_{f_{0},a_{0},d_{0}}\cdot n% \varepsilon_{n}^{2}}{C_{\varepsilon}^{d_{0}/\beta}}\right)\ \geq\ \exp\left(-% \frac{1}{2}n\varepsilon_{n}^{2}\right).

At this point, the problem amount to bound from below the measure of the set of $q\in\mathcal{O}_{d_{n}}$ satisfying $\left\|{f_{n}^{*}-f_{n,q}}\right\|_{\infty}\leq\varepsilon_{n}$ . We denote by $\mathcal{A}_{\varepsilon_{n}}$ this set. The core function $f_{0}$ is continuous on the compact subset $\mathbb{U}_{d_{0}}$ , so there exists a constant $D_{1}>0$ such that $f_{0}$ is $\beta$ -Hölder with Hölder constant $D_{1}$ . Then, for all $q,q^{\prime}\in\mathcal{O}_{d_{n}}$ ,

\displaystyle\left\|{f_{n,q^{\prime}}-f_{n,q}}\right\|_{\infty}\

\displaystyle=\ \sup_{x\in\mathbb{U}_{d_{n}}}\left|f_{0}\left((q^{\prime}x)_{% \mathbf{d_{0}}}\right)-f_{0}\left((qx)_{\mathbf{d_{0}}}\right)\right|\ \leq\ D% _{1}\cdot{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|q^{\prime}-q\right|% \kern-1.07639pt\right|\kern-1.07639pt\right|}^{\beta}.

From now on, it is apparently sufficient to compute the measure of a ball in $\mathcal{O}_{d_{n}}$ with radius $(\varepsilon_{n}/D_{1})^{1/\beta}$ . In fact, $B_{\mathcal{O}_{d_{n}}}\left(q_{n}^{*},\ (\varepsilon_{n}/D_{1})^{1/\beta}% \right)\subset\mathcal{A}_{\varepsilon_{n}}$ . However, this leads to a design dimension $d_{n}$ not larger than $n^{d_{0}/(4\beta+2d_{0})}$ . To obtain $d_{n}$ of order $n^{d_{0}/(2\beta+d_{0})}$ , we have to consider a larger subset.

Notation.

Let $F\subset\mathbb{R}^{d_{n}}$ be a linear subspace of $\mathbb{R}^{d_{n}}$ . We denote by $\mathcal{O}_{d_{n}}(F)$ the set of isometries that fix $F$ :

\mathcal{O}_{d_{n}}(F)\ :=\ \{q^{\prime}\in\mathcal{O}_{d_{n}}:q^{\prime}_{|F}% =\operatorname{Id}\}.

Then, for all $q^{\prime}\in\mathcal{O}_{d_{n}}((q_{n}^{*})^{-1}(E_{\mathbf{d_{0}}}))$ , we have

f_{n,q_{n}^{*}q^{\prime}}\ =\ f_{n,q_{n}^{*}}\circ q^{\prime}\ =\ f_{n,q_{n}^{% *}}\qquad\text{and}\qquad\left\|{f_{n}^{*}-f_{n,q}}\right\|_{\infty}\ =\ \left% \|{f_{n,q_{n}^{*}q^{\prime}}-f_{n,q}}\right\|_{\infty}\ \leq\ D_{1}\cdot{\left% |\kern-1.07639pt\left|\kern-1.07639pt\left|q^{*}_{n}q^{\prime}-q\right|\kern-1% .07639pt\right|\kern-1.07639pt\right|}^{\beta}.

For $\varepsilon>0$ , we define

\mathcal{Q}_{q_{n}^{*},\varepsilon}\ :=\ \{q\in\mathcal{O}_{d_{n}}:\exists q^{% \prime}\in\mathcal{O}_{d_{n}}((q_{n}^{*})^{-1}(E_{\mathbf{d_{0}}})),\ {\left|% \kern-1.07639pt\left|\kern-1.07639pt\left|q_{n}^{*}q^{\prime}-q\right|\kern-1.% 07639pt\right|\kern-1.07639pt\right|}\leq\varepsilon\}.

Then, $\mathcal{Q}_{q^{*}_{n},(\varepsilon_{n}/D_{1})^{1/\beta}}\subset\mathcal{A}_{% \varepsilon_{n}}$ . Since the Haar measure is translation invariant, it is sufficient to cover $\mathcal{O}_{d_{n}}$ with translations of $\mathcal{Q}_{q_{n}^{*},\varepsilon}$ to obtain a lower bound on the measure of $\mathcal{Q}_{q_{n}^{*},\varepsilon}$ , that is, to cover $\mathcal{O}_{d_{n}}$ with sets $\overline{q}\mathcal{Q}_{q_{n}^{*},\varepsilon}$ where $\overline{q}$ belongs to some net $\mathcal{R}\subset\mathcal{O}_{d_{n}}$ and then remark that $\operatorname{\mathds{P}}(\Theta\in\mathcal{Q}_{q_{n}^{*},\varepsilon})\geq 1/% \left|\mathcal{R}\right|$ .

Lemma 5.1.

We have,

\operatorname{\mathds{P}}(\Theta\in\mathcal{Q}_{q_{n}^{*},\varepsilon})\ \geq% \ \left(\frac{2}{\pi d_{n}}\right)^{\tfrac{d_{0}}{2}}\cdot\left(\frac{% \varepsilon}{16\sqrt{d_{0}d_{n}}}\right)^{d_{0}(d_{n}-1)}.

Proof of Lemma 5.1.

Let $q^{\prime\prime}\in\mathcal{O}_{d_{n}}$ . The first step consists in constructing a net $\mathcal{R}\subset\mathcal{O}_{d_{n}}$ such that there exist $\overline{q}\in\mathcal{R}$ and $q\in\mathcal{Q}_{q_{n}^{*},\varepsilon}$ with $q^{\prime\prime}=\overline{q}q$ . Let $(u_{1},\ldots,u_{d_{0}},u_{d_{0}+1},\ldots,u_{d_{n}})$ be an orthonormal basis adapted to the direct sum $\mathbb{R}^{d_{n}}=(q_{n}^{*})^{-1}(E_{\mathbf{d_{0}}})\ \overset{\perp}{% \bigoplus}\ (q_{n}^{*})^{-1}(E_{1-\mathbf{d_{0}}})$ . For all $d_{0}$ -tuple of orthonormal vectors $g=(g_{1},\ldots,g_{d_{0}})$ , we fix $r_{g}\in\mathcal{O}_{d_{n}}$ an isometry such that $r_{g}(q_{n}^{*}u_{i})=g_{i}$ for all $i\in\llbracket 1,d_{0}\rrbracket$ . Moreover, we denote by $\mathcal{G}$ a set of $d_{0}$ -tuples of orthonormal vectors in $\mathbb{R}^{d_{n}}$ such that, for all $d_{0}$ -tuples $f=(f_{1},\ldots,f_{d_{0}})$ of orthonormal vectors, there exists $g\in\mathcal{G}$ satisfying

\sup_{i\in\llbracket 1,d_{0}\rrbracket}\left\|{g_{i}-f_{i}}\right\|\ \leq\ % \frac{\varepsilon}{2\sqrt{d_{0}d_{n}}}.

We claim that we can take $\mathcal{R}:=\{r_{g}:g\in\mathcal{G}\}$ . Indeed, there exists $g\in\mathcal{G}$ such that

\sup_{i\in\llbracket 1,d_{0}\rrbracket}\left\|{g_{i}-q^{\prime\prime}(u_{i})}% \right\|\ \leq\ \frac{\varepsilon}{2\sqrt{d_{0}d_{n}}}.

By Lemma 5.8, we can extend $g$ in an orthonormal basis of $\mathbb{R}^{d_{n}}$ such that

(5.9)

\sup_{j\in\llbracket 1,d_{n}\rrbracket}\left\|{g_{j}-q^{\prime\prime}(u_{j})}% \right\|\ \leq\ \frac{\varepsilon}{\sqrt{d_{n}}}.

Then, writing $\overline{q}=r_{g}$ and taking $q$ such that $q(u_{j})=r_{g}^{-1}(q^{\prime\prime}u_{j})$ for all $j\in\llbracket 1,d_{n}\rrbracket$ , we have $q^{\prime\prime}=\overline{q}q$ . Moreover, because $r_{g}^{-1}(g_{j})\in E_{1-\mathbf{d_{0}}}$ and $(q_{n}^{*})^{-1}r_{g}^{-1}(g_{j})\in(q_{n}^{*})^{-1}(E_{1-\mathbf{d_{0}}})$ for $j\in\llbracket d_{0}+1,d_{n}\rrbracket$ , we can define $q^{\prime}$ such that

\begin{cases}q^{\prime}(u_{i})=u_{i},&\text{if }i\in\llbracket 1,d_{0}% \rrbracket,\\ q^{\prime}(u_{j})=(q_{n}^{*})^{-1}r_{g}^{-1}(g_{j}),&\text{if }j\in\llbracket d% _{0}+1,d_{n}\rrbracket.\end{cases}

Then, we have $q^{\prime}\in\mathcal{O}_{d_{n}}((q_{n}^{*})^{-1}(E_{\mathbf{d_{0}}}))$ and according to (5.9),

\left\|{q_{n}^{*}q^{\prime}(u_{i})-q(u_{i})}\right\|\ =\ \left\|{q_{n}^{*}(u_{% i})-r_{g}^{-1}(q^{\prime\prime}u_{i})}\right\|\ =\ \left\|{r_{g}q_{n}^{*}(u_{i% })-q^{\prime\prime}(u_{i})}\right\|\ \leq\ \frac{\varepsilon}{\sqrt{d_{n}}},% \qquad\text{for }i\in\llbracket 1,d_{0}\rrbracket,

and,

\left\|{q_{n}^{*}q^{\prime}(u_{j})-q(u_{j})}\right\|\ =\ \left\|{r_{g}^{-1}(g_% {j})-q(u_{j})}\right\|\ =\ \left\|{g_{j}-r_{g}(qu_{j})}\right\|\ \leq\ \frac{% \varepsilon}{\sqrt{d_{n}}},\qquad\text{for }j\in\llbracket d_{0}+1,d_{n}\rrbracket.

So ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|q_{n}^{*}q^{\prime}-q\right|% \kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\varepsilon$ and the net $\mathcal{R}:=\{r_{g}:g\in\mathcal{G}\}$ is appropriate. Finally, by taking $\mathcal{G}$ as in Lemma 5.10, we obtain

\left|\mathcal{R}\right|\ \leq\ \left(\frac{\pi d_{n}}{2}\right)^{\tfrac{d_{0}% }{2}}\cdot\left(\frac{16\sqrt{d_{0}d_{n}}}{\varepsilon}\right)^{d_{0}(d_{n}-1)},

hence the result. ∎

Consequently, we have established that

\operatorname{\mathds{P}}(\Theta\in\mathcal{A}_{\varepsilon_{n}})\ \geq\ \left% (\frac{2}{\pi d_{n}}\right)^{\tfrac{d_{0}}{2}}\cdot\left(\left(\frac{% \varepsilon_{n}}{D_{1}}\right)^{\frac{1}{\beta}}\frac{1}{16\sqrt{d_{0}d_{n}}}% \right)^{d_{0}(d_{n}-1)}.

Recall that we have the following lower bound:

\Pi_{n}\left(\bigl{\|}{W^{A,\Gamma,\Theta}-f_{n}^{*}}\bigr{\|}_{\infty}\leq 2% \varepsilon_{n}\right)\ \geq\ \pi_{\Gamma}(d_{0})\cdot\operatorname{\mathds{P}% }(\Theta\in\mathcal{A}_{\varepsilon_{n}})\cdot\exp\left(-\frac{1}{2}n% \varepsilon_{n}^{2}\right).

In order to establish the prior mass condition, it suffices to derive the greatest design dimension $d_{n}$ for which we can reach

\operatorname{\mathds{P}}(\Theta\in\mathcal{A}_{\varepsilon_{n}})\ \geq\ \pi_{% \Gamma}(d_{0})^{-1}\exp\left(-\frac{1}{2}n\varepsilon_{n}^{2}\right).

For $n$ large enough, a design dimension $d_{n}$ as specified in Assumption 3.5 is appropriate for sufficiently small constant $C_{D}$ .

Remark 5.2.

The exponent $d_{0}(d_{n}-1)$ in Lemma 5.1 is probably not far to be optimal. In fact, ignoring the constants, changing this exponent to $d_{n}^{\alpha}$ with $\alpha<1$ would lead to a growth rate of $n^{d_{0}/(\alpha(2\beta+d_{0}))}$ which, when $\beta$ is close to zero, gives a growth rate with an order superior to $n$ . The breakpoint of some popular subspace estimators, such as SIR, being the order $n$ , it would be surprising to estimate a function faster than its central subspace.

5.1.3 Sieve condition (5.3)

The second condition can be verified similarly as in [JT21]. As in the previous section, we will first treat the case with deterministic rescaling parameter, dimension, and direction and then integrate according to $A$ , $\Gamma$ , and $\Theta$ .

We suppose that $n$ is large enough so that $d_{n}>d_{\mathrm{max}}$ . We introduce the quantities $M_{n}:=C_{M}\sqrt{n\varepsilon_{n}^{2}}$ for some large constant $C_{M}$ and, for $1\leq b\leq d_{\mathrm{max}}$ , the quantity $r_{n,b}$ such that $r_{n,b}^{b}(\log n)^{b+1}=C_{r}n\varepsilon_{n}^{2}$ , for a large constant $C_{r}$ . The sieve $\mathbb{B}_{n}$ is defined as follows:

\mathbb{B}_{n}\ :=\ \bigcup_{q\in\mathcal{O}_{d_{n}}}\mathcal{B}_{n,q},

with

\mathcal{B}_{n,q}\ :=\ \bigcup_{b=1}^{d_{\mathrm{max}}}\mathcal{B}_{n,b,q}% \qquad\text{and}\quad\mathcal{B}_{n,b,q}\ :=\ M_{n}\sqrt{r_{n,b}}\cdot\mathbb{% H}_{1}^{r_{n,b},b,q}+\varepsilon_{n}B_{1},

where $B_{1}$ is the unit ball in the Banach space $(\mathcal{C}^{0}(\mathbb{U}_{d_{n}}),\left\|{\cdot}\right\|_{\infty})$ .

The nesting property of Lemma 4.7 in [VV09] remains true in the present setting, that is, for $a\leq a^{\prime}$ ,

\sqrt{a}\cdot\mathbb{H}_{1}^{a,b,q}\ \subseteq\ \sqrt{a^{\prime}}\cdot\mathbb{% H}_{1}^{a^{\prime},b,q}.

Consequently, if $1\leq a\leq r_{n,b}$ , then

M_{n}\mathbb{H}_{1}^{a,b,q}+\varepsilon_{n}B_{1}\ \subseteq\ M_{n}\sqrt{\frac{% r_{n,b}}{a}}\cdot\mathbb{H}_{1}^{r_{n,b},b,q}+\varepsilon_{n}B_{1}\ \subseteq% \ \mathcal{B}_{n,b,q}.

By Borell’s inequality (see [VV08a], Theorem 5.1, or [Bor75]), for every $a\in[1,r_{n,b}]$ ,

	$\displaystyle\Pi_{n}(W^{a,b,q}\notin\mathbb{B}_{n})\$	$\displaystyle\leq\ \Pi_{n}(W^{a,b,q}\notin\mathcal{B}_{n,b,q})$
		$\displaystyle\leq\ \Pi_{n}(W^{a,b,q}\notin M_{n}\mathbb{H}_{1}^{a,b,q}+% \varepsilon_{n}B_{1})$
		$\displaystyle\leq\ 1-\Phi\left(\Phi^{-1}\left(\Pi_{n}\left(\left\\|{W^{a,b,q}}% \right\\|_{\infty}\leq\varepsilon_{n}\right)\right)+M_{n}\right),$

where $\Phi$ is the cumulative distribution function of the standard normal distribution. Now, because

\Pi_{n}\left(\left\|{W^{a,b,q}}\right\|_{\infty}\leq\varepsilon_{n}\right)\ % \geq\ \Pi_{n}\left(\left\|{W^{r_{n,b},b,q}}\right\|_{\infty}\leq\varepsilon_{n% }\right)\ =\ \exp\left(-\phi_{0}^{r_{n,b},b,q}(\varepsilon_{n})\right),

we have

\Pi_{n}(W^{a,b,q}\notin\mathbb{B}_{n})\ \leq\ 1-\Phi\left(\Phi^{-1}\left(e^{-% \phi_{0}^{r_{n,b},b,q}(\varepsilon_{n})}\right)+M_{n}\right).

For $n$ large enough, we have $\varepsilon_{n}\leq\min\{\varepsilon_{0}^{a_{0},b}:b\in\llbracket 1,d_{\mathrm% {max}}\rrbracket\}$ and $r_{n,b}\geq a_{0}$ , so according to Lemma 5.7 and because $b\leq d_{\mathrm{max}}$ , we have

\phi_{0}^{r_{n,b},b,q}(\varepsilon_{n})\ \lesssim\ r_{n,b}^{b}\log\left(\frac{% r_{n,b}}{\varepsilon_{n}}\right)^{b+1}\ \lesssim\ r_{n,b}^{b}(\log n)^{b+1}\ % \lesssim\ n\varepsilon_{n}^{2},

for sufficiently large $n$ . So by taking $M_{n}^{2}$ a very large multiple of $n\varepsilon_{n}^{2}$ , we can reach $M_{n}\geq 4\sqrt{\phi_{0}^{r_{n,b},b,q}(\varepsilon_{n})}$ . The second assertion of Lemma 4.10 in [VV09] gives $M_{n}\geq-2\Phi^{-1}\left(\exp\bigl{(}-\phi_{0}^{r_{n,b},b,q}(\varepsilon_{n})% \bigr{)}\right)$ which leads to the upper bound

\Pi_{n}(W^{a,b,q}\notin\mathbb{B}_{n})\ \leq\ 1-\Phi(M_{n}/2)\ \leq\ \exp(-M_{% n}^{2}/8).

Taking into account the random rescaling parameter $A$ , we have, for sufficiently large $n$ ,

	$\displaystyle\Pi_{n}(W^{A,b,q}\notin\mathbb{B}_{n})\$	$\displaystyle\leq\ \int_{c}^{r_{n,b}}\Pi_{n}(W^{a,b,q}\notin\mathbb{B}_{n})\pi% _{n,b}(a)da\ +\ \pi_{n,b}(A\geq r_{n,b})$
	$\displaystyle(\text{Assumption }\ref{ass:resc})\quad$	$\displaystyle\leq\ \exp(-M_{n}^{2}/8)\ +\ D_{2}\int_{r_{n,b}}^{\infty}\exp% \left(-C_{2}a^{b}(\log a)^{b+1}\right)da$
		$\displaystyle\leq\ \exp(-M_{n}^{2}/8)\ +\ D_{2}\int_{r_{n,b}}^{\infty}C_{2}a^{% b-1}((b+1)\log^{b}a+b\log^{b+1}a)\exp\left(-C_{2}a^{b}(\log a)^{b+1}\right)da$
		$\displaystyle\leq\ \exp(-M_{n}^{2}/8)\ +\ D_{2}\exp\left(-C_{2}r_{n,b}^{b}(% \log r_{n,b})^{b+1}\right)$
		$\displaystyle\leq\ \frac{1}{2}\exp(-5n\varepsilon_{n}^{2})\ +\ \frac{1}{2}\exp% (-5n\varepsilon_{n}^{2})$
		$\displaystyle=\ \exp(-5n\varepsilon_{n}^{2}),$

where the last inequality holds because $C_{r}$ and $C_{M}$ are supposed to be large enough.

Now considering the prior on the sparsity pattern, we obtain

\displaystyle\Pi_{n}(W^{A,\Gamma,\Theta}\notin\mathbb{B}_{n})\

\displaystyle\leq\ \sum_{b=1}^{d_{\mathrm{max}}}\Pi_{n}(\Gamma=b)\int_{% \mathcal{O}_{d_{n}}}\Pi_{n}(W^{A,b,q}\notin\mathbb{B}_{n})dq\ \leq\ \exp(-5n% \varepsilon_{n}^{2}).

5.1.4 Entropy condition (5.4)

We use again the notation and quantities of the previous section. According to Lemma 5.6, for all $q\in\mathcal{O}_{d_{n}}$ and $b\in\llbracket 1,d_{\mathrm{max}}\rrbracket$ , the metric entropy of $\mathcal{B}_{n,b,q}$ is bounded as:

	$\displaystyle\log N\left(2\varepsilon_{n},M_{n}\sqrt{r_{n,b}}\mathbb{H}_{1}^{r% _{n,b},b,q}+\varepsilon_{n}B_{1},\left\\|{\cdot}\right\\|_{\infty}\right)\$	$\displaystyle\leq\ \log N\left(\varepsilon_{n},M_{n}\sqrt{r_{n,b}}\mathbb{H}_{% 1}^{r_{n,b},b,q},\left\\|{\cdot}\right\\|_{\infty}\right),$
		$\displaystyle\lesssim\ r_{n,b}^{b}\log\left(M_{n}\sqrt{r_{n,b}}\varepsilon_{n}% ^{-1}\right)^{b+1}.$

The simple estimation $\log\left(M_{n}\sqrt{r_{n,b}}\varepsilon_{n}^{-1}\right)\asymp\log n$ gives then

(5.10)

\log N\left(2\varepsilon_{n},\mathcal{B}_{n,b,q},\left\|{\cdot}\right\|_{% \infty}\right)\ \lesssim\ n\varepsilon_{n}^{2}.

The metric entropy of $\mathcal{B}_{n,q}$ is derived as follows:

N\left(2\varepsilon_{n},\mathcal{B}_{n,q},\left\|{\cdot}\right\|_{\infty}% \right)\ \leq\ \sum_{b=1}^{d_{\mathrm{max}}}N\left(2\varepsilon_{n},\mathcal{B% }_{n,b,q},\left\|{\cdot}\right\|_{\infty}\right)\ \leq\ d_{\mathrm{max}}\max_{% 1\leq b\leq d_{\mathrm{max}}}N\left(2\varepsilon_{n},\mathcal{B}_{n,b,q},\left% \|{\cdot}\right\|_{\infty}\right).

To extend these inequalities to the full sieve, we need the following lemma from [Tok11].

Lemma 5.2 (Tokdar 2011, Lemma 1).

Let $a>0$ , $b<d_{n}$ and $q,\tilde{q}\in\mathcal{O}_{d_{n}}$ . Then

\mathbb{H}_{1}^{a,b,q}\ \subseteq\ \mathbb{H}_{1}^{a,b,\tilde{q}}\ +\ a\sqrt{2% b}\cdot{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|q-\tilde{q}\right|% \kern-1.07639pt\right|\kern-1.07639pt\right|}B_{1},

where $B_{1}$ is the unit ball in $(\mathcal{C}^{0}(\mathbb{U}_{d_{n}}),\left\|{\cdot}\right\|_{\infty})$ .

By examining the representation result in (3.2) for $\mathbb{H}_{a,b,q}$ , we see that, for all $q^{\prime}\in\mathcal{O}_{d_{n}}(q^{-1}(E_{\mathbb{b}}))$ , we have $\mathbb{H}_{a,b,q}=\mathbb{H}_{a,b,qq^{\prime}}$ . Hence, Lemma 5.2 gives

\mathbb{H}_{1}^{a,b,q}\ \subseteq\ \mathbb{H}_{1}^{a,b,\tilde{q}q^{\prime}}\ +% \ a\sqrt{2b}\cdot{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|q-\tilde{q}% \right|\kern-1.07639pt\right|\kern-1.07639pt\right|}B_{1}.

If $\mathcal{R}_{n}$ is a net over $\mathcal{O}_{d_{n}}$ such that for all $q\in\mathcal{O}_{d_{n}}$ , there exist $q^{\prime}\in\mathcal{O}_{d_{n}}(q^{-1}(E_{\mathbf{d_{0}}}))$ and $\overline{q}\in\mathcal{R}_{n}$ with ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|qq^{\prime}-\overline{q}% \right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\zeta_{n}$ , where $\zeta_{n}$ is the minimum of $\varepsilon_{n}/(M_{n}r_{n,b}^{3/2}\sqrt{2d_{n}})$ when $b$ runs through $\llbracket 1,d_{\mathrm{max}}\rrbracket$ , then

	$\displaystyle M_{n}\sqrt{r_{n,b}}\cdot\mathbb{H}_{1}^{r_{n,b},b,q}\$	$\displaystyle\subseteq\ M_{n}\sqrt{r_{n,b}}\cdot\mathbb{H}_{1}^{r_{n,b},b,% \overline{q}}\ +\ M_{n}r_{n,b}^{3/2}\sqrt{2b}\cdot{\left\|\kern-1.07639pt\left\|% \kern-1.07639pt\left\|qq^{\prime}-\overline{q}\right\|\kern-1.07639pt\right\|% \kern-1.07639pt\right\|}B_{1}$
		$\displaystyle\subseteq\ M_{n}\sqrt{r_{n,b}}\cdot\mathbb{H}_{1}^{r_{n,b},b,% \overline{q}}\ +\ \varepsilon_{n}B_{1}$
		$\displaystyle=\ \mathcal{B}_{n,b,\overline{q}}.$

This clearly implies

\mathcal{B}_{n,q}\ \subseteq\ \mathcal{B}_{n,\overline{q}}+\varepsilon_{n}B_{1},

and hence

\mathbb{B}_{n}\ =\ \bigcup_{q\in\mathcal{O}_{d_{n}}}\mathcal{B}_{n,q}\ % \subseteq\ \bigcup_{\overline{q}\in\mathcal{R}_{n}}\left(\mathcal{B}_{n,% \overline{q}}+\varepsilon_{n}B_{1}\right).

Consequently, the $3\varepsilon_{n}$ -entropy of $\mathbb{B}_{n}$ can be bounded by the cardinal of the net $\mathcal{R}_{n}$ times the maximal $2\varepsilon_{n}$ -entropy of sets $\mathcal{B}_{n,b,q}$ :

	$\displaystyle N\left(3\varepsilon_{n},\mathbb{B}_{n},\left\\|{\cdot}\right\\|_{% \infty}\right)\$	$\displaystyle\leq\ \sum_{\overline{q}\in\mathcal{R}_{n}}N\left(3\varepsilon_{n% },\mathcal{B}_{n,\overline{q}}+\varepsilon_{n}B_{1},\left\\|{\cdot}\right\\|_{% \infty}\right)$
		$\displaystyle\leq\ \sum_{\overline{q}\in\mathcal{R}_{n}}N\left(2\varepsilon_{n% },\mathcal{B}_{n,\overline{q}},\left\\|{\cdot}\right\\|_{\infty}\right)$
		$\displaystyle\leq\ \left\|\mathcal{R}_{n}\right\|\cdot d_{\mathrm{max}}\max_{% \begin{subarray}{c}1\leq b\leq d_{\mathrm{max}}\\[1.9919pt] \overline{q}\in\mathcal{R}_{n}\end{subarray}}N\left(2\varepsilon_{n},\mathcal{% B}_{n,b,q},\left\\|{\cdot}\right\\|_{\infty}\right).$

It only remains to bound the cardinal of $\mathcal{R}_{n}$ .

Lemma 5.3.

For $\zeta>0$ , there exists a net $\mathcal{R}$ over $\mathcal{O}_{d_{n}}$ such that

\bigcup_{\overline{q}\in\mathcal{R}}\mathcal{A}_{\overline{q}}\ =\ \mathcal{O}% _{d_{n}},

where

\mathcal{A}_{\overline{q}}\ :=\ \{q\in\mathcal{O}_{d_{n}}\ |\ \exists q^{% \prime}\in\mathcal{O}_{d_{n}}(q^{-1}(E_{\mathbf{d_{0}}})),\ {\left|\kern-1.076% 39pt\left|\kern-1.07639pt\left|qq^{\prime}-\overline{q}\right|\kern-1.07639pt% \right|\kern-1.07639pt\right|}\leq\zeta\},

and such that

\left|\mathcal{R}\right|\ \leq\ \left(\frac{\pi\sqrt{d_{0}d_{n}}}{2}\right)^{d% _{0}}\left(\frac{16\sqrt{d_{0}d_{n}}}{\zeta}\right)^{d_{0}(d_{n}+d_{0}-2)}.

Proof.

Firstly, we remark that

\mathcal{A}_{\overline{q}}\ =\ \bigl{\{}q\in\mathcal{O}_{d_{n}}\ |\ \exists q^% {\prime\prime}\in\mathcal{O}_{d_{n}},\ q^{\prime\prime}_{|q^{-1}(E_{\mathbf{d_% {0}}})}=q_{|q^{-1}(E_{\mathbf{d_{0}}})}\ \text{and}\ {\left|\kern-1.07639pt% \left|\kern-1.07639pt\left|q^{\prime\prime}-\overline{q}\right|\kern-1.07639pt% \right|\kern-1.07639pt\right|}\leq\zeta\bigr{\}}.

Thus, for $q\in\mathcal{O}_{d_{n}}$ , we search to construct $\overline{q}$ such that there exists $q^{\prime\prime}\in\mathcal{O}_{d_{n}}$ satisfying $q^{\prime\prime}_{|q^{-1}(E_{\mathbf{d_{0}}})}=q_{|q^{-1}(E_{\mathbf{d_{0}}})}$ and ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|q^{\prime\prime}-\overline{q}% \right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\zeta$ . Let $(u_{1},\ldots,u_{d_{0}},u_{d_{0}+1},\ldots,u_{d_{n}})$ be an orthonormal basis adapted to the direct sum $\mathbb{R}^{d_{n}}=$ $(q)^{-1}(E_{\mathbf{d_{0}}})\$ $\overset{\perp}{\bigoplus}\ (q)^{-1}(E_{1-\mathbf{d_{0}}})$ . We introduce $\mathcal{F}$ a set of orthonormal basis of $E_{\mathbf{d_{0}}}$ such that, for all orthonormal basis $f^{\prime}$ of $E_{\mathbf{d_{0}}}$ , there exists $f\in\mathcal{F}$ such that

\sup_{i\in\llbracket 1,d_{0}\rrbracket}\left\|{f_{i}-f^{\prime}_{i}}\right\|\ % \leq\ \frac{\zeta}{2\sqrt{d_{0}d_{n}}},

and we reuse the set $\mathcal{G}$ of Lemma 5.1, replacing $\varepsilon$ by $\zeta$ . For all $g\in\mathcal{G}$ and $f\in\mathcal{F}$ , we fix an isometry $r_{g,f}\in\mathcal{O}_{d_{n}}$ such that $r_{g,f}(g_{i})=f_{i}$ , for all $i\in\llbracket 1,d_{0}\rrbracket$ . By construction, there exist $f\in\mathcal{F}$ and $g\in\mathcal{G}$ such that

\sup_{i\in\llbracket 1,d_{0}\rrbracket}\left\|{f_{i}-q(u_{i})}\right\|\ \leq\ % \frac{\zeta}{2\sqrt{d_{0}d_{n}}}\quad\text{and}\quad\sup_{i\in\llbracket 1,d_{% 0}\rrbracket}\left\|{g_{i}-u_{i}}\right\|\ \leq\ \frac{\zeta}{2\sqrt{d_{0}d_{n% }}}.

Then we choose $\overline{q}=r_{g,f}$ . Using Lemma 5.8, we extend $g$ to an orthonormal basis over $\mathbb{R}^{d_{n}}$ such that $\sup_{j\in\llbracket 1,d_{n}\rrbracket}\left\|{g_{j}-u_{j}}\right\|\ \leq\ % \zeta/\sqrt{d_{n}}$ and we define $f_{j}:=r_{g,f}(g_{j})\in E_{\mathbf{d_{0}}}^{\perp}$ , for $j\in\llbracket d_{0}+1,d_{n}\rrbracket$ . Now we choose $q^{\prime\prime}\in\mathcal{O}_{d_{n}}$ such that

\begin{cases}q^{\prime\prime}(u_{i})=q(u_{i})&\text{if }i\in\llbracket 1,d_{0}% \rrbracket,\\ q^{\prime\prime}(u_{j})=f_{j}&\text{if }j\in\llbracket d_{0}+1,d_{n}\rrbracket% .\end{cases}

This leads to $\left\|{q^{\prime\prime}(u_{j})-\overline{q}(u_{j})}\right\|\leq\zeta/\sqrt{d_% {n}}$ , for all $j\in\llbracket 1,d_{n}\rrbracket$ , hence ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|q^{\prime\prime}-\overline{q}% \right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\zeta$ . We can thus define the net $\mathcal{R}$ as the set of all isometries $r_{g,f}$ for $g\in\mathcal{G}$ and $f\in\mathcal{F}$ . According to Lemma 5.10, this yields the upper bound

	$\displaystyle\left\|\mathcal{R}\right\|\$	$\displaystyle=\ \left\|\mathcal{G}\right\|\cdot\left\|\mathcal{F}\right\|$
		$\displaystyle\leq\ \left(\frac{\pi d_{n}}{2}\right)^{\tfrac{d_{0}}{2}}\left(% \frac{16\sqrt{d_{0}d_{n}}}{\zeta}\right)^{d_{0}(d_{n}-1)}\left(\frac{\pi d_{0}% }{2}\right)^{\tfrac{d_{0}}{2}}\left(\frac{16\sqrt{d_{0}d_{n}}}{\zeta}\right)^{% d_{0}(d_{0}-1)}.\qed$

Observing that the upper bound in (5.10) does not hide a constant depending on $q$ , we can write

\max_{\begin{subarray}{c}1\leq b\leq d_{\mathrm{max}}\\[1.9919pt] \overline{q}\in\mathcal{R}_{n}\end{subarray}}N\left(2\varepsilon_{n},\mathcal{% B}_{n,b,q},\left\|{\cdot}\right\|_{\infty}\right)\ \lesssim\ n\varepsilon_{n}^% {2}.

Then, the lemma yields the following inequality:

N\left(3\varepsilon_{n},\mathbb{B}_{n},\left\|{\cdot}\right\|_{\infty}\right)% \ \lesssim\ \left(\frac{\pi\sqrt{d_{0}d_{n}}}{2}\right)^{d_{0}}\left(\frac{16M% _{n}r_{n}^{3/2}d_{n}\sqrt{2d_{0}}}{\varepsilon_{n}}\right)^{d_{0}(d_{n}+d_{0}-% 2)}d_{\mathrm{max}}\cdot n\varepsilon_{n}^{2},

where $r_{n}:=\max\{r_{n,b}:b\in\llbracket 1,d_{\mathrm{max}}\rrbracket\}$ , which, with the logarithm and for sufficiently large $n$ , gives the desired result.

5.2 Proof of Theorem 4.1

5.2.1 Case $\Gamma<d_{0}$

The idea of the proof is to show that the non-constancy of $p_{0}$ in all directions results in a significant difference (in the Hellinger sense) between the true density $p^{*}$ and any density that is more parcimonious than $p^{*}$ . If this difference can be bounded from below, then the set of over-parcimonious densities is expected to have an almost-null posterior mass as soon as the contraction rate falls below the lower bound.

Let $q\in\mathcal{O}_{d}$ and let $\tilde{p}$ be a density that satisfies the model with parameters $\Gamma$ and $q$ . Then, $\tilde{p}$ is constant on $q^{-1}(E_{1-\boldsymbol{\Gamma}})+x$ , for any $x\in\mathbb{U}_{d}$ . Moreover, the intersection between $\mathcal{S}$ and $q^{-1}(E_{1-\boldsymbol{\Gamma}})$ is non-null so $\tilde{p}_{|\mathcal{S}}$ is constant in at least one direction, say $\boldsymbol{\Delta}\in\mathcal{S}$ . We will use Assumption 4.1 and integrate the Hellinger distance over a small square inside the region where $p^{*}$ is non-constant in $\boldsymbol{\Delta}$ . As usual, we denote $\Delta:=\operatorname{Span}(\boldsymbol{\Delta})$ .

Let us introduce the operator

	$\displaystyle\Psi:\ \mathbb{R}^{d_{0}}$	$\displaystyle\ \to\ \mathcal{S}$
	$\displaystyle x$	$\displaystyle\ \mapsto\ (q^{*})^{-1}(x^{\mathbf{d_{0}}}).$

In particular, we have $p^{*}\circ\Psi=p_{0}$ . We use the notation of Assumption 4.1 with $\Psi^{-1}(\boldsymbol{\Delta})$ instead of $\boldsymbol{\Delta}$ .

Let $(\boldsymbol{\Delta},u_{1},\ldots,u_{d_{0}-1};v_{1},\ldots,v_{d-d_{0}})$ be an orthonormal basis adapted to the direct sum $\mathbb{R}^{d}=\Delta\oplus(\Delta^{\perp}\cap\mathcal{S})\oplus\mathcal{S}^{\perp}$ and let $R$ be a solid square with edges parallel to this basis, of size $L/\sqrt{d}$ and centered on $\Psi(o)$ . Then, $R\subset\mathcal{B}_{d}(L/2)+\Psi(o)$ and the inequality of Assumption 4.1 is valid when $t\in R$ . Considering the basis previously introduced, integrating over $R$ amounts to integrate with respect to each variables. To simplify, we bundle these variables in three groups: a variable $\delta$ parallel to $\Delta$ , a variable $u$ parallel to $\Delta^{\perp}\cap\mathcal{S}$ and a variable $v$ parallel to $\mathcal{S}^{\perp}$ . In this coordinate system, we can write $\Psi(o)=(\Psi(o)_{1},\Psi(o)_{2},0)$ and we have $p^{*}(\delta,u,v)=p_{0}(\Psi^{-1}(\delta,u,0))$ .Then

	$\displaystyle h^{2}(p^{*}_{\|R}\,;\,\tilde{p}_{\|R})\ =\$	$\displaystyle\iiint_{R}\left\|\sqrt{p^{*}(\delta,u,0)}-\sqrt{\tilde{p}(0,u,v)}% \right\|^{2}d\delta\,du\,dv$
	$\displaystyle\ =\$	$\displaystyle\iint\left(\int\left\|\sqrt{p_{0}(\Psi^{-1}(\delta,u,0))}-\sqrt{% \tilde{p}(0,u,v)}\right\|^{2}d\delta\right)du\,dv$
	$\displaystyle\ =\$	$\displaystyle\iint h^{2}\left({p_{0}}_{\|I_{u}}\,;\,\tilde{p}(0,u,v)\right)du\,dv,$

where $I_{u}$ is the inverse image via $\Psi$ of the range of the integral in $\delta$ . Hence

\displaystyle\Psi(I_{u})\ =\ (\Psi(o)_{1},u,0)\ +\ \Big{]}-\frac{L}{2\sqrt{d}}% \boldsymbol{\Delta}\,;\,\frac{L}{2\sqrt{d}}\boldsymbol{\Delta}\Big{[}\qquad% \text{with }u\in\Psi(o)_{2}\ +\ \Big{]}-\frac{L}{2\sqrt{d}}\,;\,\frac{L}{2% \sqrt{d}}\Big{[}^{\,d_{0}-1}.

Then because $\Psi^{-1}(\Psi(o)_{1},u,0)\in o+\mathcal{B}_{d_{0}}(L/2)$ , there exists $t\in\mathcal{B}_{d_{0}}(L/2)$ such that

(5.11)

I_{u}\ =\ o\ +\ t\ +\ \Big{]}-\frac{L}{2\sqrt{d}}\Psi^{-1}(\boldsymbol{\Delta}% )\,;\,\frac{L}{2\sqrt{d}}\Psi^{-1}(\boldsymbol{\Delta})\Big{[}.

Now we can use Assumption 4.1 and bound from below the Hellinger distance in the last integral, which gives

\displaystyle h^{2}(p^{*}_{|R}\,;\,\tilde{p}_{|R})\ \geq\ \iint D\cdot\frac{L^% {2}}{d}\,du\,dv\ =\ D\cdot\left(\frac{L}{\sqrt{d}}\right)^{d+1}.

Finally, $\Pi_{n}(\Gamma<d_{0}\ |\ X_{1},\ldots,X_{n})=0$ as soon as the contraction rate achieves $\varepsilon_{n}\leq\sqrt{D}\left(\frac{L}{\sqrt{d}}\right)^{\frac{d+1}{2}}$ .

5.2.2 Case $\Gamma=d_{0}$

Case $\boldsymbol{\Gamma=d_{0}}$ , with $\mathbf{d=2}$ and $\mathbf{d_{0}=1}$ .

To simplify the presentation, we first restrict ourselves to the case $d=2$ and $d_{0}=1$ . Assumption 4.1 specializes as follows: for all $0<l\leq L$ , there exists $o\in[-1+L,1-L]$ such that, for all $t\in[-l/2,l/2]$ and all constant $c>0$ ,

h^{2}\left({p_{0}}_{|]o+t-\frac{l}{2};o+t+\frac{l}{2}[}\,;\,c\right)\ =\ \int_% {o+t-\frac{l}{2}}^{o+t+\frac{l}{2}}\left|\sqrt{p_{0}(\lambda)}-\sqrt{c}\right|% ^{2}d\lambda\ \geq\ D\cdot l^{2}.

We use the fact that the non-constancy of $p^{*}$ over $\mathcal{S}$ induces a non-constancy over any one-dimensional space not parallel to $\mathcal{S}^{\perp}$ . It is then possible to set a lower bound on the Hellinger distance between $p^{*}$ and any density that is constant on a space not parallel to $\mathcal{S}^{\perp}$ . For $q\in\mathcal{O}_{2}$ , we denote $E:=q^{-1}(E_{\mathbf{d_{0}}})$ and $F:=E^{\perp}$ . If $q$ is not in $\mathcal{Q}^{*}$ , then there exists $0<\vartheta\leq\pi/2$ such that for all $\overline{q}\in\mathcal{Q}^{*}$ , we have ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\overline{q}-q\right|\kern-1.% 07639pt\right|\kern-1.07639pt\right|}>\vartheta$ . Then, the intersections of $F$ and $\mathcal{S}^{\perp}$ with the unit circle are separated by at least $\vartheta$ .

With this setting, any square of size $L/\sqrt{2}$ centered in $\Psi(o)$ is included in $\mathbb{U}_{2}$ . Let $R$ be a solid square of size $L/\sqrt{2}$ , parallel to the line $F$ and centered on $\Psi(o)$ . The line $F+\Psi(o)$ intersects the border of $R$ at two points (see Figure 1), and using arguments from geometry on the two-dimensional Euclidean space, we can show that the orthogonal projections of these points over $\mathcal{S}$ are at a distance $\zeta\geq\frac{L\vartheta}{4\sqrt{2}}\sqrt{4-\vartheta^{2}}$ from $\Psi(o)$ . Similarly, the line $E+\Psi(o)$ intersects the border of $R$ at two points whose orthogonal projections on $\mathcal{S}$ are at a distance $\chi\leq\frac{L}{2\sqrt{2}}\sqrt{1-\vartheta^{2}+\vartheta^{4}/4}$ from $\Psi(o)$ .

Figure 1: Illustration of the proof of Theorem 4.1 in the case

\Gamma=d_{0}

with

d=2

and

d_{0}=1

Let $(\mathbf{u},\mathbf{v})$ be an orthogonal basis of $\mathbb{R}^{2}$ adapted to the decomposition $E\oplus F$ and such that $\mathrm{pr}_{\mathcal{S}}(\mathbf{u})=\frac{2\sqrt{2}}{L}\chi\cdot\Psi(1)$ and $\mathrm{pr}_{\mathcal{S}}(\mathbf{v})=\frac{2\sqrt{2}}{L}\zeta\cdot\Psi(1)$ . In this system of coordinates, $\Psi(o)$ can be written $(o_{1},o_{2})$ and for all $u,v\in\mathbb{R}^{2}$ , we have

\Psi^{-1}\left(\mathrm{pr}_{\mathcal{S}}(u,v)\right)\ =\ \chi\cdot\frac{2\sqrt% {2}}{L}u\ +\ \zeta\cdot\frac{2\sqrt{2}}{L}v.

We will also use the fact that $p^{*}(u,v)=p_{0}\left(\Psi^{-1}\left(\mathrm{pr}_{\mathcal{S}}(u,v)\right)\right)$ . Then, for all density $\tilde{p}$ constant in the direction $F$ , we have

	$\displaystyle h^{2}(p^{*}_{\|R}\,;\,\tilde{p}_{\|R})\ =\$	$\displaystyle\iint_{R}\|\sqrt{p^{*}(u,v)}-\sqrt{\tilde{p}(u,0)}\|^{2}du\,dv$
	$\displaystyle\ =\$	$\displaystyle\iint_{R}\|\sqrt{p_{0}\left(\Psi^{-1}\left(\mathrm{pr}_{\mathcal{S% }}(u,v)\right)\right)}-\sqrt{\tilde{p}(u,0)}\|^{2}du\,dv$
	$\displaystyle\ =\$	$\displaystyle\int_{o_{1}-L/(2\sqrt{2})}^{o_{1}+L/(2\sqrt{2})}\int_{o_{2}-L/(2% \sqrt{2})}^{o_{2}+L/(2\sqrt{2})}\left\|p_{0}\Bigl{(}\chi\cdot\frac{2\sqrt{2}}{L% }u\ +\ \zeta\cdot\frac{2\sqrt{2}}{L}v\Bigr{)}^{1/2}-\sqrt{\tilde{p}(u,0)}% \right\|^{2}dv\,du$
	$\displaystyle\ =\$	$\displaystyle\int_{-L/(2\sqrt{2})}^{L/(2\sqrt{2})}\int_{-L/(2\sqrt{2})}^{L/(2% \sqrt{2})}\left\|p_{0}\Bigl{(}o+\chi\cdot\frac{2\sqrt{2}}{L}u\ +\ \zeta\cdot% \frac{2\sqrt{2}}{L}v\Bigr{)}^{1/2}-\sqrt{\tilde{p}(u,0)}\right\|^{2}dv\,du$
	$\displaystyle\ =\$	$\displaystyle\int_{-L/(2\sqrt{2})}^{L/(2\sqrt{2})}\frac{L}{2\zeta\cdot\sqrt{2}% }\left(\int_{-\zeta}^{\zeta}\left\|p_{0}\Bigl{(}o+\chi\cdot\frac{2\sqrt{2}}{L}u% \ +\ w\Bigr{)}^{1/2}-\sqrt{\tilde{p}(u,0)}\right\|^{2}dw\right)\,du$
	$\displaystyle(\text{Assumption }\ref{ass:directdetect})\quad\geq\$	$\displaystyle\int_{-L/(2\sqrt{2})}^{L/(2\sqrt{2})}\frac{L}{2\zeta\cdot\sqrt{2}% }\cdot D\cdot 4\zeta^{2}\,du\ =\ DL^{2}\cdot\zeta\ \geq\ D\cdot\frac{L^{3}}{4% \sqrt{2}}\vartheta\sqrt{4-\vartheta^{2}}.$

Finally, $\Pi_{n}\left(\Gamma=d_{0}\text{ and }\min_{q\in\mathcal{Q}^{*}}{\left|\kern-1.% 07639pt\left|\kern-1.07639pt\left|\Theta-q\right|\kern-1.07639pt\right|\kern-1% .07639pt\right|}\geq\vartheta\ |\ X_{1},\ldots,X_{n}\right)=0$ as soon as $\varepsilon_{n}<\sqrt{DL^{2}\cdot\zeta}$ .

Case $\boldsymbol{\Gamma=d_{0}}$ , with arbitrary $\mathbf{d>d_{0}}$ .

Given a non-optimal isometry $q$ , we need to quantify how far from $\mathcal{S}^{\perp}$ the inverse image of the subspace $E_{1-\mathbf{d_{0}}}$ via $q$ is. This result, elementary when $d=2$ , is stated for arbitrary $d>d_{0}$ in the following lemma. A proof is given in Appendix 5.3.

Lemma 5.4.

Let $q\in\mathcal{O}_{d}$ . If for all $\overline{q}\in\mathcal{Q}^{*}$ , we have ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\overline{q}-q\right|\kern-1.% 07639pt\right|\kern-1.07639pt\right|}>\vartheta$ , $0<\vartheta\leq\pi/2$ , then there exists $r\in E_{1-\mathbf{d_{0}}}$ , $\left\|{r}\right\|=1$ , such that the distance between $q^{-1}(r)$ and $\mathcal{S}^{\perp}\cap\mathbb{S}_{d}$ is at least $\vartheta/2d=:\overline{\vartheta}$ , where $\mathbb{S}_{d}:=\{x\in\mathbb{R}^{d}:\left\|{x}\right\|=1\}$ .

Now we work under the assumptions of Lemma 5.4. Let $G$ be the linear span of $q^{-1}(r)$ and its orthogonal projection $\boldsymbol{\Lambda}$ on $\mathcal{S}^{\perp}$ (or any vector of $\mathcal{S}^{\perp}$ if the orthogonal projection is zero). Then $G$ has a non-zero intersection with $\mathcal{S}$ . Let $\Delta$ be this one-dimensional intersection.

Let $R$ be a solid hypercube centered on $\Psi(o)$ , with size $\overline{L}:=L/\sqrt{d}$ , and aligned with an orthogonal basis $(\boldsymbol{\Delta},u_{1},\ldots,u_{d_{0}-1},\boldsymbol{\Lambda},v_{1},% \ldots,v_{d-d_{0}-1})$ adapted to the direct sum $\mathbb{R}^{d}=\mathcal{S}\oplus\mathcal{S}^{\perp}$ . With the restrictions on $o$ , $R$ is included in $\mathbb{U}_{d}$ .

We will bound from below the quantity $h^{2}(p^{*}_{|R};\tilde{p}_{|R})$ by using the preceding two-dimensional case on slices of $R$ . For $t\in\{0\}\times\prod_{i=1}^{d_{0}-1}[o-\overline{L}/2\cdot u_{i};o+\overline{L% }/2\cdot u_{i}]\times\{0\}\times\prod_{j=1}^{d-d_{0}-1}[o-\overline{L}/2\cdot v% _{j};o+\overline{L}/2\cdot v_{j}]$ , the plane $G+t$ contains one element parallel to $\mathcal{S}$ and one element parallel to $\mathcal{S}^{\perp}$ , so the situation is analogue to the previous case, replacing $\zeta$ by $\overline{\zeta}:=\frac{\overline{L}}{2}\overline{\vartheta}\sqrt{4-\overline{% \vartheta}^{2}}$ (Figure 2). With all this in mind, for all density $\tilde{p}$ constant in the direction $q^{-1}(r)$ , one has

h^{2}(p^{*}_{|R}\,;\,\tilde{p}_{|R})\ =\ \int_{t}h^{2}(p^{*}_{|R\cap(G+t)};% \tilde{p}_{|R\cap(G+t)})dt\ \geq\ \int_{t}2D\overline{L}^{2}\overline{\zeta}dt% \ =\ 2D\overline{L}^{d}\overline{\zeta},

which is sufficient to conclude.

The case $\Gamma>d_{0}$ can be proven in a similar way.

Figure 2: Illustration of the proof of Theorem 4.1 in the case

\Gamma=d_{0}

for arbitrary

d>d_{0}

5.3 Lemmas

The next three lemmas are related to Lemmas 4.3, 4.5, and 4.6 in [VV09], hence their proofs can be omitted.

Lemma 5.5.

Let $n\in\mathds{N}^{*}$ and $\beta>0$ . If $f_{0}\in\mathfrak{C}^{\beta}(\mathbb{U}_{d_{0}})$ , then, for all $a>0$ and $q_{n}\in\mathcal{O}_{d_{n}}$ , there exist constants $C_{f_{0}}$ and $D_{f_{0}}$ that depend only on $f_{0}$ such that

\inf\left\{\left\|{\overline{h}}\right\|_{\mathbb{H}_{a,d_{0},q_{n}}}^{2}\ :\ % \overline{h}\in\mathbb{H}_{a,d_{0},q_{n}},\ \left\|{\overline{h}-f_{n,q_{n}}}% \right\|_{\infty}\leq C_{f_{0}}\cdot a^{-\beta}\right\}\ \leq\ D_{f_{0}}\cdot a% ^{d_{0}}.

Lemma 5.6.

Let $n\in\mathds{N}^{*}$ , $a>0$ , $b\leq d_{\mathrm{max}}$ and $q_{n}\in\mathcal{O}_{d_{n}}$ . Then, there exists a constant $L_{b}$ that depends only on $b$ such that, for $\varepsilon<1/2$ ,

\log N(\varepsilon,\mathbb{H}_{1}^{a,b,q_{n}},\left\|{\cdot}\right\|_{\infty})% \ \leq\ L_{b}\cdot a^{b}\left(\log\frac{1}{\varepsilon}\right)^{b+1}.

Lemma 5.7.

Let $n\in\mathds{N}^{*}$ , $b\leq d_{\mathrm{max}}$ and $q_{n}\in\mathcal{O}_{d_{n}}$ . Then, for $a_{0}>0$ , there exist constants $C_{a_{0},b}$ and $\varepsilon_{0}^{a_{0},b}$ that depends only on $a_{0}$ and $b$ such that, for all $a\geq a_{0}$ and $\varepsilon<\varepsilon_{0}^{a_{0},b}$ ,

-\log\operatorname{\mathds{P}}\left(\left\|{W^{a,b,q_{n}}}\right\|_{\infty}% \leq\varepsilon\right)\ \leq\ C_{a_{0},b}\cdot a^{b}\left(\log\frac{a}{% \varepsilon}\right)^{b+1}.

Lemma 5.8.

Let $n\in\mathds{N}^{*}$ and let $(e_{1},\ldots,e_{n})$ be an orthonormal basis of $\mathbb{R}^{n}$ . For $d\leq n$ , let $(g_{1},\ldots,g_{d})\in\mathbb{R}^{n\times d}$ be a collection of orthonormal vectors in $\mathbb{R}^{n}$ such that

\left\|{e_{i}-g_{i}}\right\|\leq\varepsilon,\quad\text{for all }i\in\llbracket 1% ,d\rrbracket.

Then we can complete this collection to obtain an orthonormal basis $(g_{1},\ldots,g_{n})$ of $\mathbb{R}^{n}$ satisfying

\left\|{e_{j}-g_{j}}\right\|\leq 2\sqrt{d}\cdot\varepsilon,\quad\text{for all % }j\in\llbracket 1,n\rrbracket.

Proof of Lemma 5.8.

We denote by $F$ the subspace $\operatorname{Span}(g_{1},\ldots,g_{d})$ . Let us determine the distance between a vector $e_{j}$ and its orthogonal projection on $F^{\perp}$ , for $j\in\llbracket d+1,n\rrbracket$ . By Cauchy-Schwartz inequality, we have

\left|\langle e_{j},g_{i}\rangle\right|\ \leq\ \left\|{e_{j}}\right\|\left\|{g% _{i}-e_{i}}\right\|\leq\ \varepsilon,

for all $i\in\llbracket 1,d\rrbracket$ . Then

(5.12)

\left\|{e_{j}-P_{F^{\perp}}(e_{j})}\right\|\ =\ \left\|{P_{F}(e_{j})}\right\|=% \left(\sum_{i=1}^{d}\langle e_{j},g_{i}\rangle^{2}\left\|{g_{i}}\right\|^{2}% \right)^{1/2}\ \leq\ \sqrt{d}\cdot\varepsilon.

Thus the problem reduces to find a family of $n-d$ orthonormal vectors in $F^{\perp}$ with elements as close as possible to the vectors $P_{F^{\perp}}(e_{j})$ , for $j\in\llbracket d+1,n\rrbracket$ . This is related to what is known as procruste problem. We denote by $A$ the matrix $A:=\left(P_{F^{\perp}}(e_{d+1})|\cdots|P_{F^{\perp}}(e_{n})\right)\in\mathbb{R% }^{n\times n-d}$ and we use Theorem 4.1 stated in [Hig89]:

Theorem 5.9 ([Hig89]).

If $A$ admits a polar decomposition $A=UH$ , and if $Q\in\mathbb{R}^{n\times n-d}$ has orthonormal columns, then

{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A-U\right|\kern-1.07639pt% \right|\kern-1.07639pt\right|}_{2}\ \leq\ {\left|\kern-1.07639pt\left|\kern-1.% 07639pt\left|A-Q\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}.

Let us show that the columns of $U$ can be chosen in $F^{\perp}$ . A singular value decomposition of $A$ can be written, $A=WD\prescript{\mathrm{t}}{}{V}$ , where $W$ has orthonormal columns, $V\in\mathcal{O}_{n-d}$ , and $D\in\mathbb{R}^{n-d\times n-d}$ is diagonal. Therefore, $A=(W\prescript{\mathrm{t}}{}{V})VD\prescript{\mathrm{t}}{}{V}$ . Taking $U:=W\prescript{\mathrm{t}}{}{V}$ and $H:=VD\prescript{\mathrm{t}}{}{V}$ , we have the polar decomposition $A=UH$ where $U$ has orthonormal columns. Because $\operatorname{Im}(A)=\operatorname{Span}\left(P_{F^{\perp}}(e_{j}),j\in% \llbracket d+1,n\rrbracket\right)\subset F^{\perp}$ , it is possible to choose $W$ with columns in $F^{\perp}$ , whence the desired result. Now, taking $Q=\left(e_{d+1}|\cdots|e_{n}\right)$ , we have, for all unit vector $x\in\mathbb{R}^{n-d}$ ,

P_{F^{\perp}}(Qx)\ =\ Ax.

Moreover, using that $\left|\langle Qx,g_{i}\rangle\right|\leq\left\|{Qx}\right\|\left\|{g_{i}-e_{i}% }\right\|\leq\varepsilon$ for all $i\in\llbracket 1,d\rrbracket$ , we finally have

\left\|{Qx-Ax}\right\|^{2}\ =\ \left\|{Qx-P_{F^{\perp}}(Qx)}\right\|^{2}\ \leq% \ d\varepsilon^{2},

thus ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A-Q\right|\kern-1.07639pt% \right|\kern-1.07639pt\right|}\leq\sqrt{d}\cdot\varepsilon$ . According to Theorem 5.9, the last inequality is also true if we replace $Q$ by $U$ . Because the columns $u_{d+1},\ldots,u_{n}$ of $U$ are in $F^{\perp}$ , the family $(g_{1},\ldots,g_{d},u_{d+1},\ldots,u_{n})$ is orthonormal and moreover satisfies (5.12) by the triangle inequality. ∎

Notation.

Let $d,n\in\mathds{N}^{*}$ with $d<n$ and let $\mathcal{B}_{\mathrm{on}}^{n}(d)$ be the set of all $d$ -tuples of orthonormal vectors in $\mathbb{R}^{n}$ .

Lemma 5.10.

Let $d,n\in\mathds{N}^{*}$ with $d\leq n$ and $0<\varepsilon\leq 1$ . Then there exists a set $\mathcal{G}\subset\mathcal{B}_{\mathrm{on}}^{n}(d)$ such that for all $e\in\mathcal{B}_{\mathrm{on}}^{n}(d)$ , there exists $g\in\mathcal{G}$ such that

\max_{i\in\llbracket 1,d\rrbracket}\left\|{e_{i}-g_{i}}\right\|_{2}\ \leq\ % \varepsilon\qquad\text{and}\qquad\left|\mathcal{G}\right|\ \leq\ \left(\frac{% \pi n}{2}\right)^{d/2}\left(\frac{8}{\varepsilon}\right)^{d(n-1)}.

Proof of Lemma 5.10.

Let us construct $\mathcal{G}$ . Let $\mathcal{T}$ be a set of balls in $\mathbb{R}^{n}$ with radius $\varepsilon/2$ which cover $\mathbb{S}^{n-1}$ and such that $\left|\mathcal{T}\right|=N(\mathbb{S}^{n-1},\varepsilon/2,\left\|{\cdot}\right% \|_{2})$ . We denote by $\overline{\mathcal{T}}^{d}$ the set of $d$ -tuples of balls $(B_{1},\ldots,B_{d})\in\mathcal{T}^{d}$ such that $B_{1}\times\cdots\times B_{d}$ contains at least one element of $\mathcal{B}_{\mathrm{on}}^{n}(d)$ . Then, for each $e\in\mathcal{B}_{\mathrm{on}}^{n}(d)$ , there exists $(B_{1},\ldots,B_{d})\in\overline{\mathcal{T}}^{d}$ such that $e\in B_{1}\times\cdots\times B_{d}$ . For each $B\in\overline{\mathcal{T}}^{d}$ , choose one particular $d$ -tuple $g\in\mathcal{B}_{\mathrm{on}}^{n}(d)$ such that $g\in B$ and let $\mathcal{G}$ be the set of these $d$ -tuples when $B$ runs through $\overline{\mathcal{T}}^{d}$ . It is clear that $\mathcal{G}$ satisfy the first condition of the lemma. Moreover,

\left|\mathcal{G}\right|\ =\ \bigl{|}\overline{\mathcal{T}}^{d}\bigr{|}\ \leq% \ \left|\mathcal{T}^{d}\right|\ =\ N\left(\varepsilon/2,\mathbb{S}^{n-1},\left% \|{\cdot}\right\|_{2}\right)^{d}.

Let us estimate the last quantity. We use the inequality

N\left(\varepsilon,\mathbb{S}^{n-1},\left\|{\cdot}\right\|_{2}\right)\ \leq\ D% \left(\varepsilon,\mathbb{S}^{n-1},\left\|{\cdot}\right\|_{2}\right),

where $D\left(\varepsilon,\mathbb{S}^{n-1},\left\|{\cdot}\right\|_{2}\right)$ is the maximum number of disjoint balls with radius $\varepsilon/2$ and with center in $\mathbb{S}^{n-1}$ . Recall that

\mathcal{A}\left(\mathbb{S}^{n-1}\right)\ =\ \frac{2\pi^{n/2}}{\Gamma(n/2)}% \qquad\text{and}\qquad\mathcal{V}\left(B_{n-1}(\varepsilon)\right)\ =\ \frac{% \pi^{\frac{n-1}{2}}\varepsilon^{n-1}}{\Gamma\left(\frac{n+1}{2}\right)}.

Consider the measure $\nu(\varepsilon/2)$ of the hyperspherical cap defined by the intersection of $\mathbb{S}^{n-1}$ and a ball with center in $\mathbb{S}^{n-1}$ and with radius $\varepsilon/2$ . The colatitude angle of the cap is $\phi=2\arcsin(\varepsilon/4)$ and, according to [Li11],

\nu(\varepsilon/2)\ =\ \frac{(n-1)\pi^{\frac{n-1}{2}}}{\Gamma\left(\frac{n+1}{% 2}\right)}\int_{0}^{\phi}\sin^{n-2}(\theta)d\theta.

Since $\phi\geq\varepsilon/2$ ,

\int_{0}^{\phi}\sin^{n-2}(\theta)d\theta\ \geq\ \int_{0}^{\phi}\left(\frac{% \sin\phi}{\phi}\cdot\theta\right)^{n-2}d\theta\ =\ \left(\frac{\sin\phi}{\phi}% \right)^{n-2}\frac{\phi^{n-1}}{n-1}\ \geq\ \left(\frac{\sin\phi}{\phi}\right)^% {n-2}\frac{1}{n-1}\left(\frac{\varepsilon}{2}\right)^{n-1}

and, using the facts that $\varepsilon\leq 1$ , $\phi\leq\varepsilon$ , and $(\sin\phi)/\phi\geq 1/2$ , we have

D\left(\varepsilon,\mathbb{S}^{n-1},\left\|{\cdot}\right\|_{2}\right)\ \leq\ % \frac{\mathcal{A}(\mathbb{S}^{n-1})}{\nu(\varepsilon/2)}\ <\ \frac{\mathcal{A}% (\mathbb{S}^{n-1})}{\mathcal{V}\left(B_{n-1}(\varepsilon/2)\right)\cdot\left(% \frac{1}{2}\right)^{n-2}}\ =\ \sqrt{\pi}\cdot\left(\frac{4}{\varepsilon}\right% )^{n-1}\cdot\frac{\Gamma\left(\frac{n+1}{2}\right)}{\Gamma(n/2)}.

The ratio of two Gamma functions can be bounded as follows

\sqrt{x+1/4}\ <\ \frac{\Gamma(x+1)}{\Gamma(x+1/2)}\ <\ \sqrt{x+1/2},

for $x>-1/2$ (see [Wat59] and [LQ12], Section 2.3). Choosing $x=(n-1)/2$ , we obtain

N\left(\varepsilon,\mathbb{S}^{n-1},\left\|{\cdot}\right\|_{2}\right)\ <\ % \sqrt{\frac{\pi n}{2}}\cdot\left(\frac{4}{\varepsilon}\right)^{n-1},

hence the result. ∎

Proof of Lemma 5.4.

Suppose that, for all $r\in E_{\mathbf{d_{0}}}$ , we have $d(q^{-1}(r),\mathcal{S}\cap\mathbb{S}_{d})<\overline{\vartheta}$ and, for all $r^{\prime}\in E_{1-\mathbf{d_{0}}}$ , $d(q^{-1}(r^{\prime}),\mathcal{S}^{\perp}\cap\mathbb{S}_{d})<\overline{\vartheta}$ . Let us show that for all vectors $e_{i}$ of the canonical basis, $\left\|{q^{-1}(e_{i})-\overline{q}^{-1}(e_{i})}\right\|<2\sqrt{d}\cdot% \overline{\vartheta}$ . We begin with the first $d_{0}$ vectors $(e_{1},\ldots,e_{d_{0}})$ . Define $\operatorname{p_{\mathcal{S}}}$ an operator which maps $r\in E_{\mathbf{d_{0}}}$ to $\operatorname*{arg\,min}_{u\in\mathcal{S}\cap\mathbb{S}_{d}}\left\|{q^{-1}(r)-% u}\right\|$ . Then, for $i=1,\ldots,d_{0}$ , we have $\left\|{q^{-1}(e_{i})-\operatorname{p_{\mathcal{S}}}(q^{-1}(e_{i}))}\right\|<% \overline{\vartheta}$ . Now, we reuse the arguments of the proof of Lemma 5.8, with $A:=\left(\operatorname{p_{\mathcal{S}}}(q^{-1}(e_{1}))|\cdots|\operatorname{p_% {\mathcal{S}}}(q^{-1}(e_{d_{0}}))\right)$ . We can write $A=UH$ where $U=(u_{1}|\cdots|u_{d_{0}})$ is a rectangular matrix with orthonormal columns in $\mathcal{S}$ and where $H$ is symmetric. Moreover, taking $Q:=(q^{-1}(e_{1})|\cdots|q^{-1}(e_{d_{0}}))$ , and $x\in E_{\mathbf{d_{0}}}\cap\mathbb{S}_{d}$ , $x=\sum_{i=1}^{d_{0}}a_{i}e_{i}$ , we have

\left\|{Qx-Ax}\right\|\ =\ \left\|{\sum_{i=1}^{d_{0}}a_{i}\cdot\left(q^{-1}(e_% {i})-\operatorname{p_{\mathcal{S}}}(q^{-1}(e_{i}))\right)}\right\|\ <\ \sqrt{d% }\cdot\overline{\vartheta}.

So, by Theorem 4.1 in [Hig89] (Theorem 5.9 in the present document), ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A-U\right|\kern-1.07639pt% \right|\kern-1.07639pt\right|}<\sqrt{d}\cdot\overline{\vartheta}$ . Then, $(u_{1},\ldots,u_{d_{0}})$ is an orthonormal basis of $\mathcal{S}$ such that $\left\|{\operatorname{p_{\mathcal{S}}}(q^{-1}(e_{i}))-u_{i}}\right\|<\sqrt{d}% \cdot\overline{\vartheta}$ , for $i=1,\ldots,d_{0}$ . Let $\overline{q}\in Q^{*}$ be an isometry such that $\overline{q}(e_{i})=u_{i}$ , $i=1,\ldots,d_{0}$ . Then

\left\|{q^{-1}(e_{i})-\overline{q}^{-1}(e_{i})}\right\|\ \leq\ \left\|{q^{-1}(% e_{i})-\operatorname{p_{\mathcal{S}}}(q^{-1}(e_{i}))}\right\|+\left\|{% \operatorname{p_{\mathcal{S}}}(q^{-1}(e_{i}))-\overline{q}(e_{i})}\right\|\ <% \ 2\sqrt{d}\cdot\overline{\vartheta},\qquad i=1,\ldots,d_{0}.

The same reasoning occurs with the remaining vectors, $(e_{d_{0}+1},\ldots,e_{d})$ , by replacing $\mathcal{S}$ by $\mathcal{S}^{\perp}$ , and taking $A^{\prime}=U^{\prime}H^{\prime}$ , with $U^{\prime}=(u_{d_{0}+1}|\cdots|u_{d})$ . The isometry $\overline{q}\in Q^{*}$ is now the one that maps $e_{i}$ to $u_{i}$ for $i=1,\ldots,d$ . As a result, for all $x\in\mathbb{S}_{d}$ , $x=\sum_{i=1}^{d}a_{i}e_{i}$ , we have

\left\|{q^{-1}(x)-\overline{q}^{-1}(x)}\right\|\ =\ \left\|{\sum_{i=1}^{d}a_{i% }\cdot\left(q^{-1}(e_{i})-\overline{q}^{-1}(e_{i})\right)}\right\|\ <\ 2d\cdot% \overline{\vartheta}\ =\ \vartheta,

which contradicts the fact that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\overline{q}-q\right|\kern-1.% 07639pt\right|\kern-1.07639pt\right|}>\vartheta$ . Finally, $d(q,Q^{*})<\vartheta$ . ∎

Acknowledgement

We acknowledge the support of the French Agence Nationale de la Recherche (ANR) under reference ANR-21-CE40-0007 (GAP Project).

References

[Bir86] Lucien Birgé “On estimating a density using Hellinger distance and some other strange facts” In Probability Theory and Related Fields 71, 1986, pp. 271–291
[Bor75] Christer Borell “The Brunn–Minkowski inequality in Gauss space” In Inventiones mathematicae 30, 1975, pp. 207–216
[CD12] Laëtitia Comminges and Arnak S. Dalalyan “Tight conditions for consistency of variable selection in the context of high dimensionality” In The Annals of Statistics 40.5, 2012, pp. 2667–2696
[Coo98] R.Dennis Cook “Regression graphics: Ideas for studying regressions through graphics” John Wiley & Sons, 1998
[FS23] Gianluca Finocchio and Johannes Schmidt-Hieber “Posterior contraction for deep Gaussian process priors” In Journal of Machine Learning Research 24.66, 2023, pp. 1–49 URL: http://jmlr.org/papers/v24/21-0556.html
[GGV00] Subhashis Ghosal, Jayanta K. Ghosh and Aad W. Van Der Vaart “Convergence rates of posterior distributions” In The Annals of Statistics 28.2, 2000, pp. 500–531
[GN11] Evarist Giné and Richard Nickl “Rates of contraction for posterior distributions in $L^{r}$ -metrics, $1\leq r\leq\infty$ ” In The Annals of Statistics 39.6, 2011, pp. 2883–2911
[Hig89] Nicholas J. Higham “Matrix nearness problems and applications” In Applications of Matrix Theory Oxford University Press, 1989, pp. 1–27
[JT21] Sheng Jiang and Surya T. Tokdar “Variable selection consistency of Gaussian process regression” In The Annals of Statistics 49.5, 2021, pp. 2491–2505
[Li11] Shengqiao Li “Concise formulas for the area and volume of a hyperspherical cap” In Asian Journal of Mathematics and Statistics 4.1 ANSInet, 2011, pp. 66–70
[Li91] Ker-Chau Li “Sliced inverse regression for dimension reduction” In Journal of the American Statistical Association 86.414 Taylor & Francis, 1991, pp. 316–327
[Lin+21] Qian Lin, Xinran Li, Dongming Huang and Jun S. Liu “On the optimality of sliced inverse regression in high dimensions” In The Annals of Statistics 49.1 Institute of Mathematical Statistics, 2021, pp. 1–20
[LQ12] Qiu-Ming Luo and Feng Qi “Bounds for the ratio of two gamma functions—From Wendel’s and related inequalities to logarithmically completely monotonic functions” In Banach Journal of Mathematical Analysis 6.2 Tusi Mathematical Research Group, 2012, pp. 132–158
[LZL18] Qian Lin, Zhigen Zhao and Jun S. Liu “On consistency and sparsity for sliced inverse regression in high dimensions” In The Annals of Statistics 46.2, 2018, pp. 580–610
[LZL19] Qian Lin, Zhigen Zhao and Jun S. Liu “Sparse sliced inverse regression via lasso” In Journal of the American Statistical Association 114.528 Taylor & Francis, 2019, pp. 1726–1739
[STG13] Weining Shen, Surya T. Tokdar and Subhashis Ghosal “Adaptive Bayesian multivariate density estimation with Dirichlet mixtures” In Biometrika 100.3 Oxford University Press, 2013, pp. 623–640
[Sto82] Charles J. Stone “Optimal global rates of convergence for nonparametric regression” In The Annals of Statistics 10.4 Institute of Mathematical Statistics, 1982, pp. 1040–1053
[Tok11] Surya T. Tokdar “Dimension adaptability of Gaussian process models with variable selection and projection” Preprint . Available at arXiv:1112.0716, 2011
[TSY20] Kai Tan, Lei Shi and Zhou Yu “Sparse SIR: Optimal rates and adaptive estimation” In The Annals of Statistics 48.1 Institute of Mathematical Statistics, 2020, pp. 64–85
[TZG10] Surya T. Tokdar, Yu M. Zhu and Jayanta K. Ghosh “Bayesian density regression with logistic Gaussian process and subspace projection” In Bayesian Analysis 5.2 Institute of Mathematical Statistics, 2010, pp. 319
[Ver12] Nicolas Verzelen “Minimax risks for sparse regressions: Ultra-high dimensional phenomenons” In Electronic Journal of Statistics 6 Institute of Mathematical StatisticsBernoulli Society, 2012, pp. 38–90
[VV08] Aad W. Van Der Vaart and J.Harry Van Zanten “Rates of contraction of posterior distributions based on Gaussian process priors” In The Annals of Statistics 36.3, 2008, pp. 1435–1463
[VV08a] Aad W. Van Der Vaart and J.Harry Van Zanten “Reproducing kernel Hilbert spaces of Gaussian priors” In Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh. Inst. Math. Stat. (IMS) Collect. 3, 2008, pp. 200–222
[VV09] Aad W. Van Der Vaart and J.Harry Van Zanten “Adaptative Bayesian estimation using a Gaussian random field with inverse gamma bandwidth” In The Annals of Statistics 37.5B, 2009, pp. 2655–2675
[Wai09] Martin J. Wainwright “Sharp thresholds for high-Dimensional and noisy sparsity recovery using $\ell_{1}$ -constrained quadratic programming (Lasso)” In IEEE transactions on information theory 55.5 IEEE, 2009, pp. 2183–2202
[Wat59] G.N. Watson “A note on gamma functions” In Edinburgh Mathematical Notes 42 Cambridge University Press, 1959, pp. 7–9
[YD16] Yun Yang and David B. Dunson “Bayesian manifold regression” In The Annals of Statistics 44.2, 2016, pp. 876–905
[YT15] Yun Yang and Surya T. Tokdar “Minimax-optimal nonparametric regression in high dimensions” In The Annals of Statistics 43.2, 2015, pp. 652–674
[ZMP06] Lixing Zhu, Baiqi Miao and Heng Peng “On sliced inverse regression with high-dimensional covariates” In Journal of the American Statistical Association 101.474 Taylor & Francis, 2006, pp. 630–643
[ZMZ22] **g Zeng, Qing Mai and Xin Zhang “Subspace estimation with automatic dimension and variable selection in sufficient dimension reduction” In Journal of the American Statistical Association Taylor & Francis, 2022, pp. 1–13

	$\displaystyle N\left(3\varepsilon_{n},\mathbb{B}_{n},\left\\|{\cdot}\right\\|_{% \infty}\right)\$	$\displaystyle\leq\ \sum_{\overline{q}\in\mathcal{R}_{n}}N\left(3\varepsilon_{n% },\mathcal{B}_{n,\overline{q}}+\varepsilon_{n}B_{1},\left\\|{\cdot}\right\\|_{% \infty}\right)$
		$\displaystyle\leq\ \sum_{\overline{q}\in\mathcal{R}_{n}}N\left(2\varepsilon_{n% },\mathcal{B}_{n,\overline{q}},\left\\|{\cdot}\right\\|_{\infty}\right)$
		$\displaystyle\leq\ \left\|\mathcal{R}_{n}\right\|\cdot d_{\mathrm{max}}\max_{% \begin{subarray}{c}1\leq b\leq d_{\mathrm{max}}\\[1.9919pt] \overline{q}\in\mathcal{R}_{n}\end{subarray}}N\left(2\varepsilon_{n},\mathcal{% B}_{n,b,q},\left\\|{\cdot}\right\\|_{\infty}\right).$

	$\displaystyle h^{2}(p^{*}_{\|R}\,;\,\tilde{p}_{\|R})\ =\$	$\displaystyle\iiint_{R}\left\|\sqrt{p^{*}(\delta,u,0)}-\sqrt{\tilde{p}(0,u,v)}% \right\|^{2}d\delta\,du\,dv$
	$\displaystyle\ =\$	$\displaystyle\iint\left(\int\left\|\sqrt{p_{0}(\Psi^{-1}(\delta,u,0))}-\sqrt{% \tilde{p}(0,u,v)}\right\|^{2}d\delta\right)du\,dv$
	$\displaystyle\ =\$	$\displaystyle\iint h^{2}\left({p_{0}}_{\|I_{u}}\,;\,\tilde{p}(0,u,v)\right)du\,dv,$

Contraction rates and projection subspace estimation with Gaussian process priors in high dimension

Abstract

1 Introduction

2 Problem formulation

2.1 Notation and definitions

2.2 Bayesian framework for density estimation and regression

Density estimation

Regression with Gaussian error

Fixed design

Random design

3 Main result for the functional parameter

Assumption 3.1 (Sparsity of the true parameter).

Property 3.1.

Assumption 3.2 (Smoothness of f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT).

3.1 Prior specification

Assumption 3.3.

Assumption 3.4 (Rescaling measures).

3.2 Reproducing kernel Hilbert space of Wa,b,qsuperscript𝑊𝑎𝑏𝑞W^{a,b,q}italic_W start_POSTSUPERSCRIPT italic_a , italic_b , italic_q end_POSTSUPERSCRIPT

Notation.

Remark 3.1.

3.3 Posterior consistency

Assumption 3.5 (Growth of dnsubscript𝑑𝑛d_{n}italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT).

Theorem 3.1.

4 Subspace recovery for the density estimation problem

Assumption 4.1.

Theorem 4.1.

Acknowledgments

5 Appendix

5.1 Proof of Theorem 3.1

5.1.1 Regression with random design

Remark 5.1.

5.1.2 Prior mass condition (5.2)

Notation.

Notation.

Lemma 5.1.

Proof of Lemma 5.1.

Remark 5.2.

5.1.3 Sieve condition (5.3)

5.1.4 Entropy condition (5.4)

Lemma 5.2 (Tokdar 2011, Lemma 1).

Lemma 5.3.

Proof.

5.2 Proof of Theorem 4.1

5.2.1 Case Γ<d0Γsubscript𝑑0\Gamma<d_{0}roman_Γ < italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

5.2.2 Case Γ=d0Γsubscript𝑑0\Gamma=d_{0}roman_Γ = italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

Case Γ=𝒅0normal-Γsubscript𝒅0\boldsymbol{\Gamma=d_{0}}bold_Γ bold_= bold_italic_d start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT, with d=2normal-d2\mathbf{d=2}bold_d = bold_2 and d0=1subscriptnormal-d01\mathbf{d_{0}=1}bold_d start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT = bold_1.

Case Γ=𝒅0normal-Γsubscript𝒅0\boldsymbol{\Gamma=d_{0}}bold_Γ bold_= bold_italic_d start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT, with arbitrary d>d0normal-dsubscriptnormal-d0\mathbf{d>d_{0}}bold_d > bold_d start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT.

Lemma 5.4.

5.3 Lemmas

Lemma 5.5.

Lemma 5.6.

Lemma 5.7.

Lemma 5.8.

Proof of Lemma 5.8.

Theorem 5.9 ([Hig89]).

Notation.

Lemma 5.10.

Proof of Lemma 5.10.

Proof of Lemma 5.4.

Acknowledgement

References

Assumption 3.2 (Smoothness of $f_{0}$ ).

3.2 Reproducing kernel Hilbert space of $W^{a,b,q}$

Assumption 3.5 (Growth of $d_{n}$ ).

5.2.1 Case $\Gamma<d_{0}$

5.2.2 Case $\Gamma=d_{0}$

Case $\boldsymbol{\Gamma=d_{0}}$ , with $\mathbf{d=2}$ and $\mathbf{d_{0}=1}$ .

Case $\boldsymbol{\Gamma=d_{0}}$ , with arbitrary $\mathbf{d>d_{0}}$ .