Nearest Neighbor Sampling for Covariate Shift Adaptation

François Portier
[email protected]
Department of Statistics,
Univ Rennes, Ensai, CNRS, CREST—UMR 9194, F-35000 Rennes, France \ANDLionel Truquet
[email protected]
Department of Statistics,
Univ Rennes, Ensai, CNRS, CREST—UMR 9194, F-35000 Rennes, France \ANDIkko Yamane
[email protected]
Department of Computer Science,
Univ Rennes, Ensai, CNRS, CREST—UMR 9194, F-35000 Rennes, France
Abstract

Many existing covariate shift adaptation methods estimate sample weights given to loss values to mitigate the gap between the source and the target distribution. However, estimating the optimal weights typically involves computationally expensive matrix inversion and hyper-parameter tuning. In this paper, we propose a new covariate shift adaptation method which avoids estimating the weights. The basic idea is to directly work on unlabeled target data, labeled according to the k𝑘kitalic_k-nearest neighbors in the source dataset. Our analysis reveals that setting k=1𝑘1k=1italic_k = 1 is an optimal choice. This property removes the necessity of tuning the only hyper-parameter k𝑘kitalic_k and leads to a running time quasi-linear in the sample size. Our results include sharp rates of convergence for our estimator, with a tight control of the mean square error and explicit constants. In particular, the variance of our estimators has the same rate of convergence as for standard parametric estimation despite their non-parametric nature. The proposed estimator shares similarities with some matching-based treatment effect estimators used, e.g., in biostatistics, econometrics, and epidemiology. Our experiments show that it achieves drastic reduction in the running time with remarkable accuracy.

1 Introduction

Traditional machine learning methods assume that the source data distribution P𝑃Pitalic_P and the target data distribution Q𝑄Qitalic_Q are identical. However, this assumption can be violated in practice when there is a distribution shift (Chen et al., 2022) between them. Various types of shift have been studied in the literature, and one of the most common scenarios is covariate shift (Shimodaira, 2000) in which there is a shift in the input distribution: PXQXsubscript𝑃𝑋subscript𝑄𝑋P_{X}\neq Q_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ≠ italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT while the conditional distribution of the output variable given the input variable is the same: PY|X=QY|Xsubscript𝑃conditional𝑌𝑋subscript𝑄conditional𝑌𝑋P_{Y\>|\>X}=Q_{Y\>|\>X}italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT, where X𝑋Xitalic_X is the input and Y𝑌Yitalic_Y is the output variable. The goal of covariate shift adaptation is to adapt a supervised learning algorithm to the target distribution using labeled source data and unlabeled target data.

A standard approach to covariate shift is weighting source examples (Shimodaira, 2000), and many studies focused on improving the weights (Huang et al., 2006; Gretton et al., 2008; Yamada et al., 2013; Kanamori et al., 2009; Sugiyama et al., 2007, 2008; Aminian et al., 2022) in the same line of research. We refer the reader to Section 6 for more details of related work. Since we rarely know the model for how the input distributions can be shifted a priori, non-parametric methods are particularly useful for covariate shift adaptation. Some of the existing methods allow one to use non-parametric models through kernels. However, such kernel-based methods take at least quadratic times in computing kernel matrices. Some methods further need to solve linear systems and take cubic times in the sample size unless one resorts to approximations (Williams and Seeger, 2000; Le et al., 2013). Moreover, their performance is often sensitive to the choice of hyper-parameters of the kernel. Typically, one performs a grid search K𝐾Kitalic_K-fold cross-validation for selecting the hyper-parameters, which amplifies the running time by about K|Γ|𝐾ΓK\lvert\Gamma\rvertitalic_K | roman_Γ |, where ΓΓ\Gammaroman_Γ is the set of candidates for the hyper-parameters. Moreover, the criterion for the hyper-parameter selection is not obvious either because we do not have access to the labels for the target data. One can use weighted validation scores using the labeled source data with importance sampling, but it is not straightforward to choose what weights to be used for the cross-validation when we are choosing weights.

In this paper, we propose a non-parametric covariate shift adaptation method that is scalable and has no hyper-parameter. Our idea is to generate synthetic labels for unlabeled target data using a non-parametric conditional sampler constructed from source data. Under the assumption of covariate shift, the target data attached with the generated labels behave like labeled target data. This sampling technique allows any supervised learning method to be simply applied to the generated data to produce a model already adapted to the target distribution.

While the proposed approach is quite general and can be employed with various sampling methods for the synthetic labeling part, our main result is that a k𝑘kitalic_k-nearest neighbor (k𝑘kitalic_k-NN) based sampling method achieves an error of order (k/n)1/d+1/n+1/msuperscript𝑘𝑛1𝑑1𝑛1𝑚{(k/n)^{1/d}+1/\sqrt{n}+1/\sqrt{m}}( italic_k / italic_n ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT + 1 / square-root start_ARG italic_n end_ARG + 1 / square-root start_ARG italic_m end_ARG for estimating an expectation on the target domain, where d𝑑ditalic_d is the data dimensionality, and n𝑛nitalic_n and m𝑚mitalic_m are the source and the target sample size, respectively. Importantly, our error bounds suggest that k=1𝑘1k=1italic_k = 1 is the most favorable. This property, which is revealed by a precise scaling of the variance term in 1/n1𝑛1/\sqrt{n}1 / square-root start_ARG italic_n end_ARG, is a non-trivial and remarkable result, given the 1/k1𝑘1/\sqrt{k}1 / square-root start_ARG italic_k end_ARG-rate of the variance associated to the k𝑘kitalic_k-NN estimator of the conditional distribution (Portier, 2021, Corollary 1), and it contrasts with the well-known application of k𝑘kitalic_k-NN to standard density estimation (Dasgupta and Kpotufe, 2014), classification (Gadat et al., 2016; Cannings et al., 2020), or regression problems (Devroye et al., 1994; Jiang, 2019), in which we typically need to let k𝑘kitalic_k grow in a polynomial rate in the sample size in order to achieve a good balance in the bias-variance trade-off. This important difference in the rate of convergence, leading to a k=1𝑘1k=1italic_k = 1 number of neighbor, has also been noticed in other estimation problems such as the k𝑘kitalic_k-NN entropy estimator (Berrett et al., 2019) or the integral approximation problem (Leluc et al., 2023; Blanchet et al., 2024). Textbooks dealing with the k𝑘kitalic_k-NN algorithm include (Györfi et al., 2006; Devroye et al., 2013; Biau and Devroye, 2015).

In addition of being optimal with respect to the estimation error, setting k=1𝑘1k=1italic_k = 1 circumvent the cumbersome hyper-parameter tuning while providing computational efficiency at the same time. Our 1111-NN-based algorithm takes only a quasi-linear time 𝒪((n+m)logn)𝒪𝑛𝑚𝑛\mathcal{O}((n+m)\log n)caligraphic_O ( ( italic_n + italic_m ) roman_log italic_n ) on average using the optimized k𝑘kitalic_k-d tree (Bentley, 1975; Friedman et al., 1977). Indeed, our experiments show that the proposed method terminates faster than previous methods, by large margins. Note that the problem of getting a computationally efficient method for covariate shift adaptation, in particular for scalability to large data sets, is a recurrent problem in the existing literature. In fact, many existing methods resorted to implementation heuristics such as using a fixed number of kernel centers for reducing the computational burden at the cost of statistical guarantee (Kanamori et al., 2009; Sugiyama et al., 2007, 2008; Yamada et al., 2013).

Our method simulates the missing labels of the target sample, which in turn can be used for a variety of downstream supervised learning tasks. Even though the main focus of this paper is the estimation of expectations in the target domain, for illustrating the usefulness of our method in a typical machine learning downstream task, we also demonstrate consistency properties of parametric M-estimators in the target domain. This is particularly useful when the parametric model is mispecified, since in this case, even the population minimizer changes when the covariate distribution is shifted.

The problem of interest here is closely related to a well-known matching problem studied in the context of treatment effect estimation. In particular, k𝑘kitalic_k-NN estimators have been used to estimate the so-called average treatment effect to tackle missingness of potential outcomes. See, e.g., Rosenbaum (1995); Abadie and Imbens (2006), for an error bound obtained in this specific instance of the problem. In Section 6, we discuss the main differences between the two problems and why their result is not generally applicable to ours.

In summary, the key contributions of this paper are the following. (i) Our method is non-parametric. It does not introduce a model in covariate shift adaptation so that it will have a minimum impact on the model trained for the downstream task. (ii) Our method is fast. Adaptation only takes a quasi-linear time. (iii) There is no hyper-parameter to be tuned. (iv) The proposed method only incurs an error of order (k/n)1/d+1/n+1/msuperscript𝑘𝑛1𝑑1𝑛1𝑚{(k/n)^{1/d}+1/\sqrt{n}+1/\sqrt{m}}( italic_k / italic_n ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT + 1 / square-root start_ARG italic_n end_ARG + 1 / square-root start_ARG italic_m end_ARG for estimating an expectation on the target domain.

The outline is as follows. In Section 2, the problem of covariate shift adaptation is formally introduced along with the mathematical notation. Section 3 contains the description of the method. Section 4 is dedicated to the main theoretical results while Section 5 investigates the empirical risk minimization problem in presence of covariate shift adaptation. Section 6 provides a description of several alternative approaches to a similar type of problem as well as some points of comparison with our proposal. In Section 7, several avenues for further research are discussed and finally, the numerical experiments are provided in Section 8.

2 Problem setup

Let 𝒳𝒳\mathcal{X}caligraphic_X and 𝒴𝒴\mathcal{Y}caligraphic_Y be measurable spaces. Let PPX,Y𝑃subscript𝑃𝑋𝑌P\equiv P_{X,Y}italic_P ≡ italic_P start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT and QQX,Y𝑄subscript𝑄𝑋𝑌Q\equiv Q_{X,Y}italic_Q ≡ italic_Q start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT be probability distributions defined on 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y. Throughout the paper, we assume that P𝑃Pitalic_P and Q𝑄Qitalic_Q admit the decomposition

P=PY|XPXandQ=QY|XQX,formulae-sequence𝑃subscript𝑃conditional𝑌𝑋subscript𝑃𝑋and𝑄subscript𝑄conditional𝑌𝑋subscript𝑄𝑋P=P_{Y\>|\>X}P_{X}\quad\text{and}\quad Q=Q_{Y\>|\>X}Q_{X},italic_P = italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and italic_Q = italic_Q start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ,

where PY|X=xsubscript𝑃conditional𝑌𝑋𝑥P_{Y\>|\>X=x}italic_P start_POSTSUBSCRIPT italic_Y | italic_X = italic_x end_POSTSUBSCRIPT and QY|X=xsubscript𝑄conditional𝑌𝑋𝑥Q_{Y\>|\>X=x}italic_Q start_POSTSUBSCRIPT italic_Y | italic_X = italic_x end_POSTSUBSCRIPT are probability distributions defined on 𝒴𝒴\mathcal{Y}caligraphic_Y for each x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X.111 More formally, we denote by PY|X=()(dy)subscript𝑃conditional𝑌𝑋𝑑𝑦P_{Y\>|\>X=(\cdot)}(dy)italic_P start_POSTSUBSCRIPT italic_Y | italic_X = ( ⋅ ) end_POSTSUBSCRIPT ( italic_d italic_y ) a regular conditional measure (Bogachev and Ruas, 2007, Definition 10.4.1) such that the marginal distribution of Y𝑌Yitalic_Y can be expressed as PY(dy)=PY|X=x(dy)PX(dx)subscript𝑃𝑌𝑑𝑦subscript𝑃conditional𝑌𝑋𝑥𝑑𝑦subscript𝑃𝑋𝑑𝑥P_{Y}(dy)=\int P_{Y\>|\>X=x}(dy)P_{X}(dx)italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_d italic_y ) = ∫ italic_P start_POSTSUBSCRIPT italic_Y | italic_X = italic_x end_POSTSUBSCRIPT ( italic_d italic_y ) italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ). We also use PY|X(dy|)subscript𝑃conditional𝑌𝑋conditional𝑑𝑦P_{Y\>|\>X}(dy\>|\>\cdot)italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | ⋅ ) for PY|X=()(dy)subscript𝑃conditional𝑌𝑋𝑑𝑦P_{Y\>|\>X=(\cdot)}(dy)italic_P start_POSTSUBSCRIPT italic_Y | italic_X = ( ⋅ ) end_POSTSUBSCRIPT ( italic_d italic_y ). The same goes for Q𝑄Qitalic_Q. Here, PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT are the marginal distributions of X𝑋Xitalic_X when (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) is distributed with P𝑃Pitalic_P and Q𝑄Qitalic_Q, respectively. We shall simply call PY|Xsubscript𝑃conditional𝑌𝑋P_{Y\>|\>X}italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT (or QY|Xsubscript𝑄conditional𝑌𝑋Q_{Y\>|\>X}italic_Q start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT) the conditional distribution of Y𝑌Yitalic_Y given X𝑋Xitalic_X in the source domain (or the target domain).

Definition 1 (Source sample, source distribution).

For each integer n1𝑛1n\geq 1italic_n ≥ 1, let (Xi,Yi)i=1nsuperscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑛(X_{i},Y_{i})_{i=1}^{n}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be a collection of independent and identically distributed random variables with P𝑃Pitalic_P. We refer to (Xi,Yi)i=1nsuperscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑛(X_{i},Y_{i})_{i=1}^{n}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as the (labeled) source sample and P𝑃Pitalic_P as the source distribution.

Definition 2 (Target sample, target distribution).

For each integer m1𝑚1m\geq 1italic_m ≥ 1, let (Xi)i=1msuperscriptsubscriptsubscriptsuperscript𝑋𝑖𝑖1𝑚(X^{*}_{i})_{i=1}^{m}( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT be a collection of independent and identically distributed random variables with Q𝑄Qitalic_Q. We refer to (Xi)i=1msuperscriptsubscriptsubscriptsuperscript𝑋𝑖𝑖1𝑚(X^{*}_{i})_{i=1}^{m}( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT as the (unlabeled) target sample and Q𝑄Qitalic_Q as the target distribution.

Definition 3 (Covariate shift).

Covariate shift is a situation in which the source and the target distribution have different marginal distributions for X𝑋Xitalic_X while sharing a common conditional distribution:

  1. (C1)

    PY|X=QY|Xsubscript𝑃conditional𝑌𝑋subscript𝑄conditional𝑌𝑋P_{Y\>|\>X}=Q_{Y\>|\>X}italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT, PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT- and QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT-a.s., but PXQXsubscript𝑃𝑋subscript𝑄𝑋P_{X}\neq Q_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ≠ italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

This paper focuses on the following simple but versatile estimation problem under covariate shift.

Definition 4 (Mean estimation under covariate shift).

For each pair of integers n1𝑛1n\geq 1italic_n ≥ 1, m1𝑚1m\geq 1italic_m ≥ 1, and a known integrable function h:𝒳×𝒴:𝒳𝒴h\colon\mathcal{X}\times\mathcal{Y}\to\mathbb{R}italic_h : caligraphic_X × caligraphic_Y → blackboard_R, the goal of mean estimation under covariate shift is to estimate the mean of hhitalic_h under the target distribution,

Q(h)h(x,y)Q(dx,dy),𝑄𝑥𝑦𝑄𝑑𝑥𝑑𝑦Q(h)\equiv\int h(x,y)Q(dx,dy),italic_Q ( italic_h ) ≡ ∫ italic_h ( italic_x , italic_y ) italic_Q ( italic_d italic_x , italic_d italic_y ) ,

given access to the source sample (Xi,Yi)i=1nPsimilar-tosuperscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑛𝑃(X_{i},Y_{i})_{i=1}^{n}\sim P( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∼ italic_P and the target sample (Xi)i=1mQXsimilar-tosuperscriptsubscriptsubscriptsuperscript𝑋𝑖𝑖1𝑚subscript𝑄𝑋(X^{*}_{i})_{i=1}^{m}\sim Q_{X}( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∼ italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT under Assumption (C1).

For instance, when h(x,y)=(f(x),y)𝑥𝑦𝑓𝑥𝑦h(x,y)=\ell(f(x),y)italic_h ( italic_x , italic_y ) = roman_ℓ ( italic_f ( italic_x ) , italic_y ) for a loss function :𝒴2:superscript𝒴2\ell\colon\mathcal{Y}^{2}\to\mathbb{R}roman_ℓ : caligraphic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R and a hypothesis function f:𝒳𝒴:𝑓𝒳𝒴f\colon\mathcal{X}\to\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y, estimation of Q(h)𝑄Q(h)italic_Q ( italic_h ) becomes risk estimation, which is the central subtask in empirical risk minimization.

3 Proposed method

The basic idea of our proposed method is to use the source sample for learning to label the target data. Specifically, using the source sample (Xi,Yi)i=1nsuperscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑛(X_{i},Y_{i})_{i=1}^{n}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we will construct a stochastic labeling function S^^S\hat{\texttt{S}}over^ start_ARG S end_ARG that inputs any target data point Xisubscriptsuperscript𝑋𝑖X^{*}_{i}italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and outputs a random label Yn,i𝒴subscriptsuperscript𝑌𝑛𝑖𝒴Y^{*}_{n,i}\in\mathcal{Y}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y. (The subscript n𝑛nitalic_n of Yn,isubscriptsuperscript𝑌𝑛𝑖Y^{*}_{n,i}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT is for explicitly denoting the dependence on the source sample.) Once we succeed in generating labels for target data that behave like true target labels, we will be able to perform any supervised learning method directly on the target sample for the downstream task. For our mean estimation problem, we can simply average the output hhitalic_h evaluated at the target data with the generated labels.

When do the generated labels behave like the true target labels? Let P^Y|Xisubscript^𝑃conditional𝑌subscriptsuperscript𝑋𝑖\hat{P}_{Y\>|\>X^{*}_{i}}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the probability distribution of an output Yn,isubscriptsuperscript𝑌𝑛𝑖Y^{*}_{n,i}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT of S^^S\hat{\texttt{S}}over^ start_ARG S end_ARG for input Xisubscriptsuperscript𝑋𝑖X^{*}_{i}italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We wish to obtain S^^S\hat{\texttt{S}}over^ start_ARG S end_ARG such that the probability distribution Q^P^Y|XQX^𝑄subscript^𝑃conditional𝑌𝑋subscript𝑄𝑋\hat{Q}\equiv\hat{P}_{Y\>|\>X}Q_{X}over^ start_ARG italic_Q end_ARG ≡ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT of (Xi,Yn,i)subscriptsuperscript𝑋𝑖subscriptsuperscript𝑌𝑛𝑖(X^{*}_{i},Y^{*}_{n,i})( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ) will be a good estimate of Q=QY|XQX𝑄subscript𝑄conditional𝑌𝑋subscript𝑄𝑋Q=Q_{Y\>|\>X}Q_{X}italic_Q = italic_Q start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. For this, we want P^Y|Xisubscript^𝑃conditional𝑌subscriptsuperscript𝑋𝑖\hat{P}_{Y\>|\>X^{*}_{i}}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT to be a good estimate of PY|Xisubscript𝑃conditional𝑌subscriptsuperscript𝑋𝑖P_{Y\>|\>X^{*}_{i}}italic_P start_POSTSUBSCRIPT italic_Y | italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. In fact, if P^Y|Xi=PY|Xisubscript^𝑃conditional𝑌subscriptsuperscript𝑋𝑖subscript𝑃conditional𝑌subscriptsuperscript𝑋𝑖\hat{P}_{Y\>|\>X^{*}_{i}}=P_{Y\>|\>X^{*}_{i}}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_Y | italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the generated sample (Xi,Yn,i)subscriptsuperscript𝑋𝑖subscriptsuperscript𝑌𝑛𝑖(X^{*}_{i},Y^{*}_{n,i})( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ) will follow the target distribution QQY|XQX=PY|XQX𝑄subscript𝑄conditional𝑌𝑋subscript𝑄𝑋subscript𝑃conditional𝑌𝑋subscript𝑄𝑋Q\equiv Q_{Y\>|\>X}Q_{X}=P_{Y\>|\>X}Q_{X}italic_Q ≡ italic_Q start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT under Assumption (C1). In this sense, our task boils down to designing a good conditional sampler S^^S\hat{\texttt{S}}over^ start_ARG S end_ARG mimicking sampling from PY|Xsubscript𝑃conditional𝑌𝑋P_{Y|X}italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT. Algorithm 1 describes an outline of this general framework.

Algorithm 1 Conditional Sampling Adaptation
Input: Conditional sampler S^^S\hat{\texttt{S}}over^ start_ARG S end_ARG and target sample (Xj)j=1msuperscriptsubscriptsubscriptsuperscript𝑋𝑗𝑗1𝑚(X^{*}_{j})_{j=1}^{m}( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT.
Yn,jS^(Xj)subscriptsuperscript𝑌𝑛𝑗^Ssuperscriptsubscript𝑋𝑗Y^{*}_{n,j}\leftarrow\hat{\texttt{S}}(X_{j}^{*})italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT ← over^ start_ARG S end_ARG ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for each j{1,,m}𝑗1𝑚j\in\{1,\dots,m\}italic_j ∈ { 1 , … , italic_m }. // Generate a label conditioned on Xjsuperscriptsubscript𝑋𝑗X_{j}^{*}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.
return m1j=1mh(Xj,Yn,j)superscript𝑚1superscriptsubscript𝑗1𝑚subscriptsuperscript𝑋𝑗subscriptsuperscript𝑌𝑛𝑗m^{-1}\sum_{j=1}^{m}h(X^{*}_{j},Y^{*}_{n,j})italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_h ( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT ).

In this paper, we propose a method using a non-parametric conditional sampler S^^S\hat{\texttt{S}}over^ start_ARG S end_ARG based on the k𝑘kitalic_k-Nearest Neighbor (k𝑘kitalic_k-NN) method, which randomly picks one of the k𝑘kitalic_k-nearest neighbors of the input Xjsubscriptsuperscript𝑋𝑗X^{*}_{j}italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT among the source instances (Xi)i=1nsuperscriptsubscriptsubscript𝑋𝑖𝑖1𝑛(X_{i})_{i=1}^{n}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and output the corresponding label (Algorithm 2). We refer to this method as k𝑘kitalic_k-NN-based Conditional Sampling Adaptation (k𝑘kitalic_k-NN-CSA).

Algorithm 2 k𝑘kitalic_k-Nearest Neighbor Conditional Sampler
Input: Source sample (Xi,Yi)i=1nsuperscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑛(X_{i},Y_{i})_{i=1}^{n}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and target input Xjsubscriptsuperscript𝑋𝑗X^{*}_{j}italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
(i1,,ik)the indices of the k-nearest neighbors of Xj among the source instances (Xi)i=1nsubscript𝑖1subscript𝑖𝑘the indices of the k-nearest neighbors of Xj among the source instances (Xi)i=1n(i_{1},\dots,i_{k})\leftarrow\text{the indices of the $k$-nearest neighbors of% $X^{*}_{j}$ among the source instances $(X_{i})_{i=1}^{n}$}( italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ← the indices of the italic_k -nearest neighbors of italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT among the source instances ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.
Pick i{i1,,ik}superscript𝑖subscript𝑖1subscript𝑖𝑘i^{*}\in\{i_{1},\dots,i_{k}\}italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } uniformly at random.
return Yn,j:=Yiassignsubscriptsuperscript𝑌𝑛𝑗subscriptsuperscript𝑌superscript𝑖Y^{*}_{n,j}:=Y^{*}_{i^{*}}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT := italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

Computing time

Recent advances for nearest neighbor search rely on tree-search to reduce the computing time. The seminal paper by Bentley (1975) introduced the k𝑘kitalic_k-d tree method. Building such a tree requires 𝒪(nlogn)𝒪𝑛𝑛\mathcal{O}(n\log n)caligraphic_O ( italic_n roman_log italic_n ) and once the tree is available, search for the nearest neighbor of a given point can be done in 𝒪(logn)𝒪𝑛\mathcal{O}(\log n)caligraphic_O ( roman_log italic_n ) time (Friedman et al., 1977). As a consequence, the time complexity of k𝑘kitalic_k-NN-CSA is 𝒪(nlogn+kmlogn)𝒪𝑛𝑛𝑘𝑚𝑛\mathcal{O}(n\log n+km\log n)caligraphic_O ( italic_n roman_log italic_n + italic_k italic_m roman_log italic_n ).

4 Theoretical analysis

We now present the theory behind our approach in a didactic way by introducing a key decomposition first and then studying separately each of the terms involved: the sampling error and the k𝑘kitalic_k-NN conditional sampling error. We will see that the k𝑘kitalic_k-NN-CSA with k=1𝑘1k=1italic_k = 1 (1111-NN-CSA for short) achieves the best theoretical performance among those with other k𝑘kitalic_k’s.

4.1 The key decomposition

For the analysis of k𝑘kitalic_k-NN-CSA, recall that Q^=P^Y|XQX^𝑄subscript^𝑃conditional𝑌𝑋subscript𝑄𝑋\hat{Q}=\hat{P}_{Y|X}Q_{X}over^ start_ARG italic_Q end_ARG = over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is an estimate of the target distribution Q=PY|XQX𝑄subscript𝑃conditional𝑌𝑋subscript𝑄𝑋Q=P_{Y|X}Q_{X}italic_Q = italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT that depends on the source sample (Xi,Yi)i=1nsuperscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑛(X_{i},Y_{i})_{i=1}^{n}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, whose probability distribution is P𝑃Pitalic_P. We introduce the bootstrap sample as a collection of random variable generated according to Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG.

Definition 5 (Bootstrap sample).

For each m1𝑚1m\geq 1italic_m ≥ 1 and n1𝑛1n\geq 1italic_n ≥ 1, let (Xi,Yn,i)1imsubscriptsuperscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖1𝑖𝑚(X_{i}^{*},Y_{n,i}^{*})_{1\leq i\leq m}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT be a collection of random variables identically distributed with Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG and conditionally independent given (Xi,Yi)i=1nsuperscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑛(X_{i},Y_{i})_{i=1}^{n}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

Let h:𝒳×𝒴:𝒳𝒴h:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}italic_h : caligraphic_X × caligraphic_Y → blackboard_R be a measurable function. The quantity of interest is

Q^(h)=m1i=1mh(Xi,Yn,i),superscript^𝑄superscript𝑚1superscriptsubscript𝑖1𝑚superscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖\hat{Q}^{*}(h)=m^{-1}\sum_{i=1}^{m}h(X_{i}^{*},Y_{n,i}^{*}),over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h ) = italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ,

which is the CSA estimate of Q(h)=h(x,y)Q(dx,dy)𝑄𝑥𝑦𝑄𝑑𝑥𝑑𝑦Q(h)=\int h(x,y)Q(dx,dy)italic_Q ( italic_h ) = ∫ italic_h ( italic_x , italic_y ) italic_Q ( italic_d italic_x , italic_d italic_y ) as introduced in Algorithm 1. The following decomposition is crucial in our analysis:

(Q^Q)(h)superscript^𝑄𝑄\displaystyle(\hat{Q}^{*}-Q)(h)( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_Q ) ( italic_h ) =(Q^Q^)(h)Marginal sampling error+(Q^Q)(h)Conditional sampling errorabsentsubscriptsuperscript^𝑄^𝑄Marginal sampling errorsubscript^𝑄𝑄Conditional sampling error\displaystyle=\underbrace{(\hat{Q}^{*}-\hat{Q})(h)}_{\text{\emph{Marginal % sampling error}}}+\underbrace{(\hat{Q}-Q)(h)}_{\text{\emph{Conditional % sampling error}}}= under⏟ start_ARG ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_Q end_ARG ) ( italic_h ) end_ARG start_POSTSUBSCRIPT Marginal sampling error end_POSTSUBSCRIPT + under⏟ start_ARG ( over^ start_ARG italic_Q end_ARG - italic_Q ) ( italic_h ) end_ARG start_POSTSUBSCRIPT Conditional sampling error end_POSTSUBSCRIPT (1)
(=(Q^XQX)P^Y|X(h)+QX(P^Y|XPY|X)(h)),absentsubscriptsuperscript^𝑄𝑋subscript𝑄𝑋subscript^𝑃conditional𝑌𝑋subscript𝑄𝑋subscript^𝑃conditional𝑌𝑋subscript𝑃conditional𝑌𝑋\displaystyle\bigg{(}=(\hat{Q}^{*}_{X}-Q_{X})\hat{P}_{Y\>|\>X}(h)+Q_{X}(\hat{P% }_{Y\>|\>X}-P_{Y\>|\>X})(h)\bigg{)},( = ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_h ) + italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ) ( italic_h ) ) ,

where Q^X()1ni=1m𝟙Xi=()subscriptsuperscript^𝑄𝑋1𝑛superscriptsubscript𝑖1𝑚subscript1subscriptsuperscript𝑋𝑖\hat{Q}^{*}_{X}(\cdot)\equiv\frac{1}{n}\sum_{i=1}^{m}\mathds{1}_{X^{*}_{i}=(% \cdot)}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ⋅ ) ≡ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( ⋅ ) end_POSTSUBSCRIPT is the empirical measure defined with (Xi)i=1msuperscriptsubscriptsubscriptsuperscript𝑋𝑖𝑖1𝑚(X^{*}_{i})_{i=1}^{m}( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. The first term is the error due to the use of Q^Xsubscriptsuperscript^𝑄𝑋\hat{Q}^{*}_{X}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT in place of QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, which tends to zero as m𝑚mitalic_m grows. The second term represents the error due to the use of P^Y|Xsubscript^𝑃conditional𝑌𝑋\hat{P}_{Y\>|\>X}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT in place of PY|Xsubscript𝑃conditional𝑌𝑋P_{Y\>|\>X}italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT. When using the k𝑘kitalic_k-nearest neighbor algorithm to obtain P^Y|Xsubscript^𝑃conditional𝑌𝑋\hat{P}_{Y\>|\>X}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT, we show that this term is of order (k/n)1/d+1/nsuperscript𝑘𝑛1𝑑1𝑛(k/n)^{1/d}+1/\sqrt{n}( italic_k / italic_n ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT + 1 / square-root start_ARG italic_n end_ARG, which differs from the standard non-parametric convergence rate in (k/n)1/d+1/ksuperscript𝑘𝑛1𝑑1𝑘(k/n)^{1/d}+1/\sqrt{k}( italic_k / italic_n ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT + 1 / square-root start_ARG italic_k end_ARG found in regression problems.

4.2 Marginal sampling error

First, we will show that the marginal sampling error, (Q^Q^)(h)superscript^𝑄^𝑄(\hat{Q}^{*}-\hat{Q})(h)( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_Q end_ARG ) ( italic_h ) (the first term in our decomposition (1)), is of order 1/m1𝑚1/\sqrt{m}1 / square-root start_ARG italic_m end_ARG. The analysis relies on martingale tools. Define n=σ((X1,Y1),,(Xn,Yn))subscript𝑛𝜎subscript𝑋1subscript𝑌1subscript𝑋𝑛subscript𝑌𝑛\mathcal{F}_{n}=\sigma((X_{1},Y_{1}),\ldots,(X_{n},Y_{n}))caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_σ ( ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ). For each 1im1𝑖𝑚1\leq i\leq m1 ≤ italic_i ≤ italic_m, we have

𝔼[h(Xi,Yn,i)|n]=Q^(h).𝔼delimited-[]conditionalsuperscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖subscript𝑛^𝑄\mathbb{E}[h(X_{i}^{*},Y_{n,i}^{*})\>|\>\mathcal{F}_{n}]=\hat{Q}(h).blackboard_E [ italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = over^ start_ARG italic_Q end_ARG ( italic_h ) .

This property implies that i=1m{h(Xi,Yn,i)h(x,y)Q^(dx,dy)}superscriptsubscript𝑖1𝑚superscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖𝑥𝑦^𝑄𝑑𝑥𝑑𝑦\sum_{i=1}^{m}\{h(X_{i}^{*},Y_{n,i}^{*})-\int h(x,y)\,\hat{Q}(dx,dy)\}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT { italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - ∫ italic_h ( italic_x , italic_y ) over^ start_ARG italic_Q end_ARG ( italic_d italic_x , italic_d italic_y ) } is a martingale and therefore can be analyzed using the Lindeberg-CLT conditionally on the initial sample hence fixing the distribution Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG. The next property is reminiscent of certain results about the bootstrap method where sampling is done with the basic empirical measure, see e.g., Van der Vaart (2000). We need this type of results without specifying the measure Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG so that we can incorporate a variety of sampling schemes such as Q^=P^Y|XQX^𝑄subscript^𝑃conditional𝑌𝑋subscript𝑄𝑋\hat{Q}=\hat{P}_{Y\>|\>X}Q_{X}over^ start_ARG italic_Q end_ARG = over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. The proof is given in Appendix B.1.

Proposition 1.

Suppose that Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG satisfies the following strong law of large number: for each hhitalic_h such that Q(h)<𝑄Q(h)<\inftyitalic_Q ( italic_h ) < ∞, we have limnQ^(h)=Q(h)subscript𝑛^𝑄𝑄\lim_{n\to\infty}\hat{Q}(h)=Q(h)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( italic_h ) = italic_Q ( italic_h ) almost surely. Then, if m:=mnassign𝑚subscript𝑚𝑛m:=m_{n}\to\inftyitalic_m := italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ as n𝑛n\to\inftyitalic_n → ∞, we have the following central limit theorem: for each function such that Q(h2)<𝑄superscript2Q(h^{2})<\inftyitalic_Q ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) < ∞, we have, conditionally to nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, almost surely,

m{Q^(h)Q^(h)}𝒩(0,V)as n,formulae-sequenceleads-to𝑚superscript^𝑄^𝑄𝒩0𝑉as 𝑛\sqrt{m}\{\hat{Q}^{*}(h)-\hat{Q}(h)\}\leadsto\mathcal{N}(0,V)\qquad\text{as }n% \to\infty,square-root start_ARG italic_m end_ARG { over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h ) - over^ start_ARG italic_Q end_ARG ( italic_h ) } ↝ caligraphic_N ( 0 , italic_V ) as italic_n → ∞ ,

where V=limn{Q^(h2)Q^(h)2}𝑉subscript𝑛^𝑄superscript2^𝑄superscript2V=\lim_{n\to\infty}\{\hat{Q}(h^{2})-\hat{Q}(h)^{2}\}italic_V = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT { over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }.

As a corollary of the previous results, we can already deduce that if m𝑚mitalic_m goes to \infty and Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG satisfies a strong law of large numbers, then Q^(h)superscript^𝑄\hat{Q}^{*}(h)over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h ) converges to Q(h)𝑄Q(h)italic_Q ( italic_h ) provided that Q(h2)𝑄superscript2Q(h^{2})italic_Q ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) exists. This is a general consistency result that justifies the use of any resampling distribution Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG that converges to Q𝑄Qitalic_Q. In practical situations, it is useful to know a finite-sample bound on the error. This is the purpose of the next proposition, in which we give a non-asymptotic control of the sampling error. A proof is given in Appendix B.2.

Proposition 2.

Suppose that hhitalic_h is bounded by a constant Uh>0subscript𝑈0U_{h}>0italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT > 0. Let δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). Then with probability greater than 1δ1𝛿1-\delta1 - italic_δ,

|Q^(h)Q^(h)|Uhmlog(2/δ)+2v^nmlog(2/δ),superscript^𝑄^𝑄subscript𝑈𝑚2𝛿2subscript^𝑣𝑛𝑚2𝛿\left|\hat{Q}^{*}(h)-\hat{Q}(h)\right|\leq\frac{U_{h}}{m}\log(2/\delta)+\sqrt{% 2\frac{\hat{v}_{n}}{m}\log(2/\delta)},| over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h ) - over^ start_ARG italic_Q end_ARG ( italic_h ) | ≤ divide start_ARG italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG roman_log ( 2 / italic_δ ) + square-root start_ARG 2 divide start_ARG over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG roman_log ( 2 / italic_δ ) end_ARG ,

where v^n=Q^(h2)(Q^h)2subscript^𝑣𝑛^𝑄superscript2superscript^𝑄2\hat{v}_{n}=\hat{Q}(h^{2})-(\hat{Q}h)^{2}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - ( over^ start_ARG italic_Q end_ARG italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Notes.

A natural “averaging” alternative to the above “sampling” estimator Q^superscript^𝑄\hat{Q}^{*}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can also be investigated using the same tools. Instead of sampling Yn,isuperscriptsubscript𝑌𝑛𝑖Y_{n,i}^{*}italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT according to P^n(dy|Xi)subscript^𝑃𝑛conditional𝑑𝑦superscriptsubscript𝑋𝑖\hat{P}_{n}(dy|X_{i}^{*})over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_d italic_y | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), one might consider taking the expectation, leading to

Q¯(h)=1mi=1mh(Xi,y)P^n(dy|Xi).¯𝑄1𝑚superscriptsubscript𝑖1𝑚superscriptsubscript𝑋𝑖𝑦subscript^𝑃𝑛conditional𝑑𝑦superscriptsubscript𝑋𝑖\overline{Q}(h)=\frac{1}{m}\sum_{i=1}^{m}\int h(X_{i}^{*},y)\hat{P}_{n}(dy|X_{% i}^{*}).over¯ start_ARG italic_Q end_ARG ( italic_h ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∫ italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_d italic_y | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

This estimate can be studied in a similar way as before and the two above results are still valid with small changes. In particular Proposition 2 holds true with smaller variance term as, by Jensen’s inequality, QX(h(Xi,y)P^n(dy|Xi)2)Q^(h2)subscript𝑄𝑋superscriptsubscript𝑋𝑖𝑦subscript^𝑃𝑛superscriptconditional𝑑𝑦superscriptsubscript𝑋𝑖2^𝑄superscript2Q_{X}(\int h(X_{i}^{*},y)\hat{P}_{n}(dy|X_{i}^{*})^{2})\leq\hat{Q}(h^{2})italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ∫ italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_d italic_y | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). This alternative Q¯(h)¯𝑄\overline{Q}(h)over¯ start_ARG italic_Q end_ARG ( italic_h ) requires more computing time (when measured in terms of evaluation of hhitalic_h) and is less appealing for stochastic gradient descent algorithm or in semiparametric estimation problems, as discussed in Section 7. Estimators similar to Q¯(h)¯𝑄\overline{Q}(h)over¯ start_ARG italic_Q end_ARG ( italic_h ) have been studied in average treatment effects literature (Rosenbaum, 1995; Abadie and Imbens, 2006); see Section 6 for precise discussion.

4.3 Conditional sampling error of the nearest neighbor estimate

Our aim in this section is to obtain a bound on Q^(h)Q(h)^𝑄𝑄\hat{Q}(h)-Q(h)over^ start_ARG italic_Q end_ARG ( italic_h ) - italic_Q ( italic_h ) (the second term in our decomposition (1)) when P^Y|Xsubscript^𝑃conditional𝑌𝑋\hat{P}_{Y|X}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT is the k𝑘kitalic_k-nearest neighbor measure.

Let xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and delimited-∥∥\lVert\cdot\rVert∥ ⋅ ∥ be the Euclidean norm on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Denote the closed ball of radius τ0𝜏0\tau\geq 0italic_τ ≥ 0 around x𝑥xitalic_x by B(x,τ):={zdxzτ}assign𝐵𝑥𝜏conditional-set𝑧superscript𝑑delimited-∥∥𝑥𝑧𝜏B(x,\tau):=\{z\in\mathbb{R}^{d}\mid\lVert x-z\rVert\leq\tau\}italic_B ( italic_x , italic_τ ) := { italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∣ ∥ italic_x - italic_z ∥ ≤ italic_τ }. For n1𝑛1n\geq 1italic_n ≥ 1 and k{1,,n}𝑘1𝑛k\in\{1,\;\ldots,\;n\}italic_k ∈ { 1 , … , italic_n }, the k𝑘kitalic_k-nearest neighbor (k𝑘kitalic_k-NN for short) radius at x𝑥xitalic_x is denoted by τ^n,k,xsubscript^𝜏𝑛𝑘𝑥\hat{\tau}_{n,k,x}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT and defined as the smallest radius τ0𝜏0\tau\geq 0italic_τ ≥ 0 such that the ball B(x,τ)𝐵𝑥𝜏B(x,\tau)italic_B ( italic_x , italic_τ ) contains at least k𝑘kitalic_k points from the collection {X1,,Xn}subscript𝑋1subscript𝑋𝑛\{X_{1},\ldots,X_{n}\}{ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. That is,

τ^n,k,x:=inf{τ0:i=1n1B(x,τ)(Xi)k},assignsubscript^𝜏𝑛𝑘𝑥infimumconditional-set𝜏0superscriptsubscript𝑖1𝑛subscript1𝐵𝑥𝜏subscript𝑋𝑖𝑘\displaystyle\hat{\tau}_{n,k,x}:=\inf\left\{\tau\geq 0\,:\,\sum_{i=1}^{n}1_{B(% x,\tau)}(X_{i})\geq k\right\},over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT := roman_inf { italic_τ ≥ 0 : ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT italic_B ( italic_x , italic_τ ) end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_k } ,

where 1A(x)subscript1𝐴𝑥1_{A}(x)1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_x ) is 1111 if xA𝑥𝐴x\in Aitalic_x ∈ italic_A and 00 elsewhere. The k𝑘kitalic_k-NN estimate of PY|X(dy|x)subscript𝑃conditional𝑌𝑋conditional𝑑𝑦𝑥P_{Y\>|\>X}(dy\>|\>x)italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) is given by

P^Y|X(dy|x)=k1i=1n1Xixτ^n,k,xδYi(dy),subscript^𝑃conditional𝑌𝑋conditional𝑑𝑦𝑥superscript𝑘1superscriptsubscript𝑖1𝑛subscript1delimited-∥∥subscript𝑋𝑖𝑥subscript^𝜏𝑛𝑘𝑥subscript𝛿subscript𝑌𝑖𝑑𝑦\hat{P}_{Y\>|\>X}(dy\>|\>x)=k^{-1}\sum_{i=1}^{n}1_{\lVert X_{i}-x\rVert\leq% \hat{\tau}_{n,k,x}}\delta_{Y_{i}}(dy),over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) = italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ∥ ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_d italic_y ) ,

where δy()subscript𝛿𝑦\delta_{y}(\cdot)italic_δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( ⋅ ) is the Dirac measure at y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y defined by δy(A)=1A(y)subscript𝛿𝑦𝐴subscript1𝐴𝑦\delta_{y}(A)=1_{A}(y)italic_δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_A ) = 1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_y ) for any measurable set A𝒴𝐴𝒴A\subseteq\mathcal{Y}italic_A ⊆ caligraphic_Y. Consequently, the k𝑘kitalic_k-NN estimate of the integral h(y,x)PY|X(dy|x)𝑦𝑥subscript𝑃conditional𝑌𝑋conditional𝑑𝑦𝑥\int h(y,x)P_{Y\>|\>X}(dy\>|\>x)∫ italic_h ( italic_y , italic_x ) italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) is then defined as

h(y,x)P^Y|X(dy|x)=k1i=1n1Xixτ^n,k,xh(Yi,x).𝑦𝑥subscript^𝑃conditional𝑌𝑋conditional𝑑𝑦𝑥superscript𝑘1superscriptsubscript𝑖1𝑛subscript1normsubscript𝑋𝑖𝑥subscript^𝜏𝑛𝑘𝑥subscript𝑌𝑖𝑥\int h(y,x)\hat{P}_{Y\>|\>X}(dy\>|\>x)=k^{-1}\sum_{i=1}^{n}1_{\|X_{i}-x\|\leq% \hat{\tau}_{n,k,x}}h(Y_{i},x).∫ italic_h ( italic_y , italic_x ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) = italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ∥ ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) .

To obtain some guarantee on the behavior of the nearest neighbors estimate, we consider the case in which covariates X𝑋Xitalic_X admit a density with respect to the Lebesgue measure. We will need in addition that the support SXsubscript𝑆𝑋S_{X}italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is well shaped and that the density is lower bounded. These are standard regularity conditions to obtain some upper bound on the k𝑘kitalic_k-NN radius.

  1. (X1)

    The random variable X𝑋Xitalic_X admits a density pXsubscript𝑝𝑋p_{X}italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT with compact support SXdsubscript𝑆𝑋superscript𝑑S_{X}\subset\mathbb{R}^{d}italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

  2. (X2)

    There is c>0𝑐0c>0italic_c > 0 and T>0𝑇0T>0italic_T > 0 such that

    λ(SXB(x,τ))cλ(B(x,τ)),τ(0,T],xSX,formulae-sequence𝜆subscript𝑆𝑋𝐵𝑥𝜏𝑐𝜆𝐵𝑥𝜏formulae-sequencefor-all𝜏0𝑇for-all𝑥subscript𝑆𝑋\displaystyle\lambda(S_{X}\cap B(x,\tau))\geq c\lambda(B(x,\tau)),\qquad% \forall\tau\in(0,T],\,\forall x\in S_{X},italic_λ ( italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∩ italic_B ( italic_x , italic_τ ) ) ≥ italic_c italic_λ ( italic_B ( italic_x , italic_τ ) ) , ∀ italic_τ ∈ ( 0 , italic_T ] , ∀ italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ,

    where λ𝜆\lambdaitalic_λ is the Lebesgue measure.

  3. (X3)

    There is 0<bXUX<+0subscript𝑏𝑋subscript𝑈𝑋0<b_{X}\leq U_{X}<+\infty0 < italic_b start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ≤ italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT < + ∞ such that bXpX(x)UXsubscript𝑏𝑋subscript𝑝𝑋𝑥subscript𝑈𝑋b_{X}\leq p_{X}(x)\leq U_{X}italic_b start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ≤ italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x ) ≤ italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, for all xSX𝑥subscript𝑆𝑋x\in S_{X}italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

To obtain our main result, on the estimation property of the k𝑘kitalic_k-NN measure, we need some assumptions on the target measure QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

  1. (X4)

    The probability measure QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT admits a bounded density qXsubscript𝑞𝑋q_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT with support SXsubscript𝑆𝑋S_{X}italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. We will take UXsubscript𝑈𝑋U_{X}italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT large enough such that it will also be an upper bound of qXsubscript𝑞𝑋q_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

Two additional assumptions, different from the one before about X𝑋Xitalic_X, will be needed to deal with the function hhitalic_h and the probability distribution of (Y,X)𝑌𝑋(Y,X)( italic_Y , italic_X ).

  1. (H1)

    For any x𝑥xitalic_x in SXsubscript𝑆𝑋S_{X}italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT,

    |𝔼[h(Y,x)|X=x]𝔼[h(Y,x)|X=x+u]|gh(x)u𝔼delimited-[]conditional𝑌𝑥𝑋𝑥𝔼delimited-[]conditional𝑌𝑥𝑋𝑥𝑢subscript𝑔𝑥delimited-∥∥𝑢\lvert\mathbb{E}[h(Y,x)\>|\>X=x]-\mathbb{E}[h(Y,x)\>|\>X=x+u]\rvert\leq g_{h}(% x)\lVert u\rVert| blackboard_E [ italic_h ( italic_Y , italic_x ) | italic_X = italic_x ] - blackboard_E [ italic_h ( italic_Y , italic_x ) | italic_X = italic_x + italic_u ] | ≤ italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x ) ∥ italic_u ∥

    with gh2(x)QX(dx)<superscriptsubscript𝑔2𝑥subscript𝑄𝑋𝑑𝑥\int g_{h}^{2}(x)Q_{X}(dx)<\infty∫ italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) < ∞.

  2. (H2)

    There exists σ+2>0superscriptsubscript𝜎20\sigma_{+}^{2}>0italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0 such that supxSXVar(h(Y,x)|X)σ+2 a.s.subscriptsupremum𝑥subscript𝑆𝑋Varconditional𝑌𝑥𝑋superscriptsubscript𝜎2 a.s.\sup_{x\in S_{X}}\operatorname{Var}(h(Y,x)\>|\>X)\leq\sigma_{+}^{2}\text{ a.s.}roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Var ( italic_h ( italic_Y , italic_x ) | italic_X ) ≤ italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT a.s., where Var(h(Y,x)|X)Varconditional𝑌𝑥𝑋\operatorname{Var}(h(Y,x)\>|\>X)roman_Var ( italic_h ( italic_Y , italic_x ) | italic_X ) is the conditional variance of h(Y,x)𝑌𝑥h(Y,x)italic_h ( italic_Y , italic_x ) given X𝑋Xitalic_X.

In what follows, we give a control of the RMSE of Q^h^𝑄\hat{Q}hover^ start_ARG italic_Q end_ARG italic_h. Let X2=𝔼(X2)subscriptnorm𝑋2𝔼superscript𝑋2\|X\|_{2}=\sqrt{\mathbb{E}\left(X^{2}\right)}∥ italic_X ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = square-root start_ARG blackboard_E ( italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG, [x]delimited-[]𝑥[x][ italic_x ] the integer part of a real number x𝑥xitalic_x and let Γ(x):=0ux1exp(u)𝑑uassignΓ𝑥superscriptsubscript0superscript𝑢𝑥1𝑢differential-d𝑢\Gamma(x):=\int_{0}^{\infty}u^{x-1}\exp(-u)duroman_Γ ( italic_x ) := ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_x - 1 end_POSTSUPERSCRIPT roman_exp ( - italic_u ) italic_d italic_u for x>0𝑥0x>0italic_x > 0. Finally, we denote by Vd:=λ(B(0,1))=πd/2Γ(d/21)assignsubscript𝑉𝑑𝜆𝐵01superscript𝜋𝑑2Γ𝑑21V_{d}:=\lambda(B(0,1))=\frac{\pi^{d/2}}{\Gamma(d/2-1)}italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT := italic_λ ( italic_B ( 0 , 1 ) ) = divide start_ARG italic_π start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_Γ ( italic_d / 2 - 1 ) end_ARG the volume of the unit Euclidean ball in dimension d𝑑ditalic_d for the Lebesque measure.

We give an upper-bound for the RMSE with explicit constants with respect to the dimension d𝑑ditalic_d. Additionally, we give a lower bound for the variance which has a standard parametric rate of convergence. The proof is given in Appendix C.1.

Proposition 3.

Suppose that Assumptions (X1), (X2), (X3), (X4), (H1), and (H2) are fulfilled. We have

Q^hQh=Sh+Bh,^𝑄𝑄subscript𝑆subscript𝐵\hat{Q}h-Qh=S_{h}+B_{h},over^ start_ARG italic_Q end_ARG italic_h - italic_Q italic_h = italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ,

where Bhsubscript𝐵B_{h}italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is a bias term (defined in the proof) that satisfies, for any n1𝑛1n\geq 1italic_n ≥ 1,

𝔼|Bh|22Γ(1+[2/d])M1,d2/dgh2(x)QX(dx)k2/dn2/d,𝔼superscriptsubscript𝐵22Γ1delimited-[]2𝑑superscriptsubscript𝑀1𝑑2𝑑superscriptsubscript𝑔2𝑥subscript𝑄𝑋𝑑𝑥superscript𝑘2𝑑superscript𝑛2𝑑\mathbb{E}|B_{h}|^{2}\leq\frac{2\Gamma\left(1+[2/d]\right)}{M_{1,d}^{2/d}}\int g% _{h}^{2}(x)Q_{X}(dx)\cdot\frac{k^{2/d}}{n^{2/d}},blackboard_E | italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 2 roman_Γ ( 1 + [ 2 / italic_d ] ) end_ARG start_ARG italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / italic_d end_POSTSUPERSCRIPT end_ARG ∫ italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) ⋅ divide start_ARG italic_k start_POSTSUPERSCRIPT 2 / italic_d end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 / italic_d end_POSTSUPERSCRIPT end_ARG ,

and Shsubscript𝑆S_{h}italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is a variance term (defined in the proof) that satisfies, for any n2𝑛2n\geq 2italic_n ≥ 2,

σ2M1,d24M2,d2n1min1kn𝔼[Sh2]max1kn𝔼[Sh2]2d+3σ+2M2,d2M1,d2n1.superscriptsubscript𝜎2superscriptsubscript𝑀1𝑑24superscriptsubscript𝑀2𝑑2superscript𝑛1subscript1𝑘𝑛𝔼delimited-[]superscriptsubscript𝑆2subscript1𝑘𝑛𝔼delimited-[]superscriptsubscript𝑆2superscript2𝑑3superscriptsubscript𝜎2superscriptsubscript𝑀2𝑑2superscriptsubscript𝑀1𝑑2superscript𝑛1\frac{\sigma_{-}^{2}M_{1,d}^{2}}{4M_{2,d}^{2}}n^{-1}\leq\min_{1\leq k\leq n}% \mathbb{E}\left[S_{h}^{2}\right]\leq\max_{1\leq k\leq n}\mathbb{E}\left[S_{h}^% {2}\right]\leq\frac{2^{d+3}\sigma_{+}^{2}M_{2,d}^{2}}{M_{1,d}^{2}}n^{-1}.divide start_ARG italic_σ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≤ roman_min start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_n end_POSTSUBSCRIPT blackboard_E [ italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ roman_max start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_n end_POSTSUBSCRIPT blackboard_E [ italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG 2 start_POSTSUPERSCRIPT italic_d + 3 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT .

For the lower bound to be true, it is assumed that the map** hhitalic_h does not depend on x𝑥xitalic_x, i.e. h(y,x)=h(y)𝑦𝑥𝑦h(y,x)=h(y)italic_h ( italic_y , italic_x ) = italic_h ( italic_y ) and σ2=infxSXVar(h(Y)|X=x)superscriptsubscript𝜎2subscriptinfimum𝑥subscript𝑆𝑋Varconditional𝑌𝑋𝑥\sigma_{-}^{2}=\inf_{x\in S_{X}}\operatorname{Var}\left(h(Y)|X=x\right)italic_σ start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_inf start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Var ( italic_h ( italic_Y ) | italic_X = italic_x ).

Notes.

(i) The two terms Bhsubscript𝐵B_{h}italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Shsubscript𝑆S_{h}italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT correspond respectively to the bias term and the variance term. The upper bound obtained for the bias term is usual in k𝑘kitalic_k-NN regression analysis. However, the upper and lower bound on the variance are particular to our framework as they show that the variance behaves as in usual parametric estimation. Consequently, our rates of convergence are sharper than the optimal rate of convergence n12+dsuperscript𝑛12𝑑n^{-\frac{1}{2+d}}italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 + italic_d end_ARG end_POSTSUPERSCRIPT for nonparametric estimation of Lipschitz functions. This can be explained by the fact that several k𝑘kitalic_k-NN estimators are averaged to estimate Qh𝑄Qhitalic_Q italic_h, which is a standard expectation and not a conditional expectation.

(ii) Since the rate of convergence of the variance term Shsubscript𝑆S_{h}italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT does not depend on k𝑘kitalic_k, k𝑘kitalic_k might be chosen according to the upper bound on the bias term, which gives k=1𝑘1k=1italic_k = 1. One can deduce the following convergence rates, depending on the dimension. For d=1𝑑1d=1italic_d = 1, we get the rate n1/2superscript𝑛12n^{-1/2}italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT. For d=2𝑑2d=2italic_d = 2, the contributions of both terms, Bhsubscript𝐵B_{h}italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Shsubscript𝑆S_{h}italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, coincide and we get the rate n1/2superscript𝑛12n^{-1/2}italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT. For d3𝑑3d\geq 3italic_d ≥ 3, the rate is n1/dsuperscript𝑛1𝑑n^{-1/d}italic_n start_POSTSUPERSCRIPT - 1 / italic_d end_POSTSUPERSCRIPT.

For the global mean square error which incorporate the marginal sampling error as well as the k𝑘kitalic_k-NN conditional sampling error, we give the following result in the optimal case k=1𝑘1k=1italic_k = 1. The proof can be found in Appendix D.

Theorem 1.

Suppose that Assumptions (X1), (X2), (X3), (X4) and (H1) hold true with supxSX𝔼[h2(Y,x)|X]subscriptsupremum𝑥subscript𝑆𝑋𝔼delimited-[]conditionalsuperscript2𝑌𝑥𝑋\sup_{x\in S_{X}}\mathbb{E}\left[h^{2}(Y,x)|X\right]roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_Y , italic_x ) | italic_X ] bounded. If k=1𝑘1k=1italic_k = 1, there then exists C>0𝐶0C>0italic_C > 0 only depending on the distribution of (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), Xsuperscript𝑋X^{*}italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and on hhitalic_h such that

𝔼|Q^(h)Q(h)|2C{1m+1n2/d+1n}.𝔼superscriptsuperscript^𝑄𝑄2𝐶1𝑚1superscript𝑛2𝑑1𝑛\mathbb{E}\left|\hat{Q}^{*}(h)-Q(h)\right|^{2}\leq C\left\{\frac{1}{m}+\frac{1% }{n^{2/d}}+\frac{1}{n}\right\}.blackboard_E | over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h ) - italic_Q ( italic_h ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG + divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 / italic_d end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG } .

We next give a non-asymptotic control of Q^hQh^𝑄𝑄\hat{Q}h-Qhover^ start_ARG italic_Q end_ARG italic_h - italic_Q italic_h when hhitalic_h is a bounded function using Bernstein’s concentration inequality. This bound affords a complement with respect to the bound for the MSE. However, for technical reasons, this high-probability bound requires that k𝑘kitalic_k grows at least logarithmically with respect to n𝑛nitalic_n, in contrast to Proposition 3. In our numerical experiments, we will also include the case k=logn𝑘𝑛k=\log nitalic_k = roman_log italic_n for comparison. The proof of the next result is given in Appendix C.2.

Proposition 4.

Suppose that Assumptions (X1), (X2), (X3), (X4), (H1), and (H2) are fulfilled. Suppose that there exists a constant C>0𝐶0C>0italic_C > 0 such that Clognkn/2𝐶𝑛𝑘𝑛2C\log n\leq k\leq n/2italic_C roman_log italic_n ≤ italic_k ≤ italic_n / 2 and that hhitalic_h is bounded by Uhsubscript𝑈U_{h}italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Let δ(0,1/3)𝛿013\delta\in(0,1/3)italic_δ ∈ ( 0 , 1 / 3 ). With probability greater than 13δ13𝛿1-3\delta1 - 3 italic_δ, we have

|Q^hQh|L0(kn)1/d+2L2nlog(2/δ)+2L1σ2nlog(2/δ),^𝑄𝑄subscript𝐿0superscript𝑘𝑛1𝑑2subscript𝐿2𝑛2𝛿2subscript𝐿1superscript𝜎2𝑛2𝛿\left|\hat{Q}h-Qh\right|\leq L_{0}\left(\frac{k}{n}\right)^{1/d}+\frac{2L_{2}}% {n}\log(2/\delta)+\sqrt{\frac{2L_{1}\sigma^{2}}{n}\log(2/\delta)},| over^ start_ARG italic_Q end_ARG italic_h - italic_Q italic_h | ≤ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( divide start_ARG italic_k end_ARG start_ARG italic_n end_ARG ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT + divide start_ARG 2 italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG roman_log ( 2 / italic_δ ) + square-root start_ARG divide start_ARG 2 italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG roman_log ( 2 / italic_δ ) end_ARG ,

where

L0=(2cbxVd)1/dgh(x)QX(dx),L1=4σ+2UX2bX2c2,L2=4UhUX3bXc.formulae-sequencesubscript𝐿0superscript2𝑐subscript𝑏𝑥subscript𝑉𝑑1𝑑subscript𝑔𝑥subscript𝑄𝑋𝑑𝑥formulae-sequencesubscript𝐿14superscriptsubscript𝜎2superscriptsubscript𝑈𝑋2superscriptsubscript𝑏𝑋2superscript𝑐2subscript𝐿24subscript𝑈subscript𝑈𝑋3subscript𝑏𝑋𝑐L_{0}=\left(\frac{2}{cb_{x}V_{d}}\right)^{1/d}\int g_{h}(x)Q_{X}(dx),\quad L_{% 1}=\frac{4\sigma_{+}^{2}U_{X}^{2}}{b_{X}^{2}c^{2}},\quad L_{2}=\frac{4U_{h}U_{% X}}{3b_{X}c}.italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( divide start_ARG 2 end_ARG start_ARG italic_c italic_b start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT ∫ italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 4 italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_b start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG 4 italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_ARG start_ARG 3 italic_b start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_c end_ARG .

Notes.

(i) The proof needs a bound on supxSXτ^n,k,xsubscriptsupremum𝑥subscript𝑆𝑋subscript^𝜏𝑛𝑘𝑥\sup_{x\in S_{X}}\hat{\tau}_{n,k,x}roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT which is given in (Portier, 2021, Lemma 4). For this we need that k𝑘kitalic_k grows logarithmically w.r.t. n𝑛nitalic_n as stated in the assumptions.

(ii) The sum of the two last terms in the upper-bound, which corresponds to the variance of our estimator, is of order 1/n1𝑛1/\sqrt{n}1 / square-root start_ARG italic_n end_ARG and the conditional variance of hhitalic_h appears as a multiplicative factor. Combining Proposition 2 and Proposition 4, we finally obtain that Q^(h)Q(h)superscript^𝑄𝑄\hat{Q}^{*}(h)-Q(h)over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h ) - italic_Q ( italic_h ) is of order 1/m+1/n+(k/n)1/d1𝑚1𝑛superscript𝑘𝑛1𝑑1/\sqrt{m}+1/\sqrt{n}+(k/n)^{1/d}1 / square-root start_ARG italic_m end_ARG + 1 / square-root start_ARG italic_n end_ARG + ( italic_k / italic_n ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT (up to log factors).

(iii) In Corollary 1 in Appendix E, the upper bound given in Proposition 2 is refined using Proposition 4 that allows to (roughly speaking) replace the empirical variance by the true variance.

5 Applications to empirical risk minimization

In this section, we illustrate our results with some applications to empirical risk minimization. This is of particular interest in our context as the optimal linear model for the source distribution might be different from the ideal linear model for the target. In such a case, using covariate adaptation is always better as the source minimizer will be away from the target minimizer.

5.1 Mathematical background

Suppose that m,n(θ)=1mi=1mmθ(Yn,i,Xi)subscriptsuperscript𝑚𝑛𝜃1𝑚superscriptsubscript𝑖1𝑚subscript𝑚𝜃subscriptsuperscript𝑌𝑛𝑖superscriptsubscript𝑋𝑖\mathcal{R}^{*}_{m,n}(\theta)=\frac{1}{m}\sum_{i=1}^{m}m_{\theta}\left({Y}^{*}% _{n,i},X_{i}^{*}\right)caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), where for each θΘd𝜃Θsuperscript𝑑\theta\in\Theta\subset\mathbb{R}^{d}italic_θ ∈ roman_Θ ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, mθsubscript𝑚𝜃m_{\theta}italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a measurable function from ×psuperscript𝑝\mathbb{R}\times\mathbb{R}^{p}blackboard_R × blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT to \mathbb{R}blackboard_R. Set

θ^argminθΘm,n(θ).superscript^𝜃subscript𝜃Θsubscriptsuperscript𝑚𝑛𝜃\hat{\theta}^{*}\in\arg\min_{\theta\in\Theta}\mathcal{R}^{*}_{m,n}(\theta).over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_arg roman_min start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( italic_θ ) .

Similarly, we define

θ:=argminθΘ(θ)assignsuperscript𝜃subscript𝜃Θsuperscript𝜃\theta^{*}:=\arg\min_{\theta\in\Theta}\mathcal{R}^{*}\left(\theta\right)italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := roman_arg roman_min start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ )

with (θ)=𝔼mθ(Y,X)superscript𝜃𝔼subscript𝑚𝜃superscript𝑌superscript𝑋\mathcal{R}^{*}(\theta)=\mathbb{E}m_{\theta}\left(Y^{*},X^{*}\right)caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and (Y,X)superscript𝑌superscript𝑋(Y^{*},X^{*})( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is a copy of (Yi,Xi)superscriptsubscript𝑌𝑖superscriptsubscript𝑋𝑖(Y_{i}^{*},X_{i}^{*})( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Note that the expected value is taken for the unobserved label Yisuperscriptsubscript𝑌𝑖Y_{i}^{*}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and not the generated label Yn,isuperscriptsubscript𝑌𝑛𝑖{Y}_{n,i}^{*}italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We assume here that for a reference measure μ𝜇\muitalic_μ on 𝒴𝒴\mathcal{Y}caligraphic_Y, there exists for each x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X a conditional density p(|x)p(\cdot|x)italic_p ( ⋅ | italic_x ) such that (x,y)p(y|x)maps-to𝑥𝑦𝑝conditional𝑦𝑥(x,y)\mapsto p(y|x)( italic_x , italic_y ) ↦ italic_p ( italic_y | italic_x ) is jointly measurable and for any Borel set B𝐵Bitalic_B,

(YB|X=x)=Bp(y|x)μ(dy).𝑌conditional𝐵𝑋𝑥subscript𝐵𝑝conditional𝑦𝑥𝜇𝑑𝑦\mathbb{P}\left(Y\in B|X=x\right)=\int_{B}p(y|x)\mu(dy).blackboard_P ( italic_Y ∈ italic_B | italic_X = italic_x ) = ∫ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_p ( italic_y | italic_x ) italic_μ ( italic_d italic_y ) .

One can then include the case of classification (μ=δ0+δ1𝜇subscript𝛿0subscript𝛿1\mu=\delta_{0}+\delta_{1}italic_μ = italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), counts (μ𝜇\muitalic_μ is the counting measure on the set of nonnegative integers) or regression (μ𝜇\muitalic_μ is the Lebesgue measure on \mathbb{R}blackboard_R).

5.2 Consistency of general empirical risk minimizers

We will use the following assumptions.

  1. (A1)

    There exist a measurable function h:+:subscripth:\mathbb{R}\rightarrow\mathbb{R}_{+}italic_h : blackboard_R → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and η:++:𝜂subscriptsubscript\eta:\mathbb{R}_{+}\rightarrow\mathbb{R}_{+}italic_η : blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT satisfying

    supx𝒳supθΘ|mθ(y,x)|h(y),subscriptsupremum𝑥𝒳subscriptsupremum𝜃Θsubscript𝑚𝜃𝑦𝑥𝑦\displaystyle\sup_{x\in\mathcal{X}}\sup_{\theta\in\Theta}\left|m_{\theta}(y,x)% \right|\leq h(y),roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y , italic_x ) | ≤ italic_h ( italic_y ) ,
    supx𝒳sup|θθ|δ|mθ(y,x)mθ(y,x)|h(y)η(δ),subscriptsupremum𝑥𝒳subscriptsupremum𝜃superscript𝜃𝛿subscript𝑚𝜃𝑦𝑥subscript𝑚superscript𝜃𝑦𝑥𝑦𝜂𝛿\displaystyle\sup_{x\in\mathcal{X}}\sup_{|\theta-\theta^{\prime}|\leq\delta}% \left|m_{\theta}(y,x)-m_{\theta^{\prime}}(y,x)\right|\leq h(y)\eta(\delta),roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT | italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ≤ italic_δ end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y , italic_x ) - italic_m start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y , italic_x ) | ≤ italic_h ( italic_y ) italic_η ( italic_δ ) ,

    and such that 𝔼[h(Y)2|X]𝔼delimited-[]conditionalsuperscript𝑌2𝑋\mathbb{E}\left[h(Y)^{2}|X\right]blackboard_E [ italic_h ( italic_Y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_X ] is a bounded random variable and limδ0η(δ)=0subscript𝛿0𝜂𝛿0\lim_{\delta\rightarrow 0}\eta(\delta)=0roman_lim start_POSTSUBSCRIPT italic_δ → 0 end_POSTSUBSCRIPT italic_η ( italic_δ ) = 0.

  2. (A2)

    There exists a measurable function gh:𝒳+:subscript𝑔𝒳subscriptg_{h}:\mathcal{X}\rightarrow\mathbb{R}_{+}italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_X → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT such that gh(x)QX(dx)<subscript𝑔𝑥subscript𝑄𝑋𝑑𝑥\int g_{h}(x)Q_{X}(dx)<\infty∫ italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) < ∞ and

    h(y)|p(y|x+u)p(y|x)|μ(dy)gh(x)|u|,(x,x+u)SX2.\int h(y)\left|p(y|x+u)-p(y|x)\right|\mu(dy)\leq g_{h}(x)|u|,\quad(x,x+u)\in S% _{X}^{2}.∫ italic_h ( italic_y ) | italic_p ( italic_y | italic_x + italic_u ) - italic_p ( italic_y | italic_x ) | italic_μ ( italic_d italic_y ) ≤ italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x ) | italic_u | , ( italic_x , italic_x + italic_u ) ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The above assumptions are satisfied, for instance, in the logistic regression framework with compact covariates. In this case, hhitalic_h is a constant function and η(δ)=δ𝜂𝛿𝛿\eta(\delta)=\deltaitalic_η ( italic_δ ) = italic_δ. Note also that p(1|x)𝑝conditional1𝑥p(1|x)italic_p ( 1 | italic_x ) could be different form (1+exp(xTθ))1superscript1superscript𝑥𝑇𝜃1\left(1+\exp\left(-x^{T}\theta\right)\right)^{-1}( 1 + roman_exp ( - italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_θ ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT as soon as xp(1|x)(0,1)maps-to𝑥𝑝conditional1𝑥01x\mapsto p(1|x)\in(0,1)italic_x ↦ italic_p ( 1 | italic_x ) ∈ ( 0 , 1 ) is Lipschitz on SXsubscript𝑆𝑋S_{X}italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

In what follows, an assertion of the form Xm,n=o(1)subscript𝑋𝑚𝑛subscript𝑜1X_{m,n}=o_{\mathbb{P}}(1)italic_X start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) as m,n𝑚𝑛m,n\rightarrow\inftyitalic_m , italic_n → ∞ means that for any ϵ,ζ>0italic-ϵ𝜁0\epsilon,\zeta>0italic_ϵ , italic_ζ > 0, there exists A>0𝐴0A>0italic_A > 0 such that

min(m,n)A(|Xm,n|>ϵ)ζ.𝑚𝑛𝐴subscript𝑋𝑚𝑛italic-ϵ𝜁\min(m,n)\geq A\Rightarrow\mathbb{P}\left(|X_{m,n}|>\epsilon\right)\leq\zeta.roman_min ( italic_m , italic_n ) ≥ italic_A ⇒ blackboard_P ( | italic_X start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT | > italic_ϵ ) ≤ italic_ζ .

Additionally, the assertion Xm,n=O(1)subscript𝑋𝑚𝑛subscript𝑂1X_{m,n}=O_{\mathbb{P}}(1)italic_X start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) means that for any ε>0𝜀0\varepsilon>0italic_ε > 0, there exist A,M>0𝐴𝑀0A,M>0italic_A , italic_M > 0 such that

supm,nA(|Xm,n|>M)ε.subscriptsupremum𝑚𝑛𝐴subscript𝑋𝑚𝑛𝑀𝜀\sup_{m,n\geq A}\mathbb{P}\left(|X_{m,n}|>M\right)\leq\varepsilon.roman_sup start_POSTSUBSCRIPT italic_m , italic_n ≥ italic_A end_POSTSUBSCRIPT blackboard_P ( | italic_X start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT | > italic_M ) ≤ italic_ε .

The proof of the following result is in Appendix F.1.

Theorem 2.

Suppose that Assumptions (X1), (X2), (X3), (X4), and (A1), (A2) hold true with a compact subset ΘΘ\Thetaroman_Θ of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and the unique minimizer θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of M𝑀Mitalic_M. Then θ^θ=o(1)superscript^𝜃superscript𝜃subscript𝑜1\hat{\theta}^{*}-\theta^{*}=o_{\mathbb{P}}(1)over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) as m,n𝑚𝑛m,n\rightarrow\inftyitalic_m , italic_n → ∞. Moreover, the excess risk satisfies (θ^)(θ)=o(1)superscriptsuperscript^𝜃superscriptsuperscript𝜃subscript𝑜1\mathcal{R}^{*}(\hat{\theta}^{*})-\mathcal{R}^{*}(\theta^{*})=o_{\mathbb{P}}(1)caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ).

5.3 Convergence rate for linear least-squares estimators

We now illustrate our results with an upper-bound on the excess risk for linear least-squares estimators in the misspecified case. Here, the targeted risk is given by

(θ)=𝔼[(YXTθ)2]superscript𝜃𝔼delimited-[]superscriptsuperscript𝑌superscriptsuperscript𝑋𝑇𝜃2\mathcal{R}^{*}(\theta)=\mathbb{E}\left[\left(Y^{*}-{X^{*}}^{T}\theta\right)^{% 2}\right]caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E [ ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

and any optimal linear rule should simply be satisfied:

θargminθd(θ).superscript𝜃subscript𝜃superscript𝑑superscript𝜃\theta^{*}\in\arg\min_{\theta\in\mathbb{R}^{d}}\mathcal{R}^{*}(\theta).italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_arg roman_min start_POSTSUBSCRIPT italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) .

Note that θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is unique the matrix 𝔼[XXT]𝔼delimited-[]superscript𝑋superscriptsuperscript𝑋𝑇\mathbb{E}\left[X^{*}{X^{*}}^{T}\right]blackboard_E [ italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] is of full rank. The empirical risk is defined by

m(θ)=1mi=1m(Yn,iXiTθ)2subscriptsuperscript𝑚𝜃1𝑚superscriptsubscript𝑖1𝑚superscriptsubscriptsuperscript𝑌𝑛𝑖superscriptsuperscriptsubscript𝑋𝑖𝑇𝜃2\mathcal{R}^{*}_{m}(\theta)=\frac{1}{m}\sum_{i=1}^{m}\left({Y}^{*}_{n,i}-{X_{i% }^{*}}^{T}\theta\right)^{2}caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

and θ^superscript^𝜃\hat{\theta}^{*}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the empirical risk minimizer, is given by

θ^argminθdm(θ).superscript^𝜃subscript𝜃superscript𝑑subscriptsuperscript𝑚𝜃\hat{\theta}^{*}\in\arg\min_{\theta\in\mathbb{R}^{d}}\mathcal{R}^{*}_{m}(% \theta).over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_arg roman_min start_POSTSUBSCRIPT italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ ) .

The excess risk satisfies the following upper bound whose proof is given in Appendix F.1.

Theorem 3.

Suppose that Assumptions (X1), (X2), (X3), (X4) hold true. Suppose that the map** x𝔼[Y|X=x]maps-to𝑥𝔼delimited-[]conditional𝑌𝑋𝑥x\mapsto\mathbb{E}\left[Y|X=x\right]italic_x ↦ blackboard_E [ italic_Y | italic_X = italic_x ] is Lipschitz and that the conditional expectation 𝔼[Y2|X]𝔼delimited-[]conditionalsuperscript𝑌2𝑋\mathbb{E}[Y^{2}|X]blackboard_E [ italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_X ] is bounded. Suppose also that Γ=𝔼[XXT]Γ𝔼delimited-[]superscript𝑋superscriptsuperscript𝑋𝑇\Gamma=\mathbb{E}[X^{*}{X^{*}}^{T}]roman_Γ = blackboard_E [ italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] is positive definite. Then, we have

(θ^)(θ)=O(m1+n1+n2/d).superscriptsuperscript^𝜃superscriptsuperscript𝜃subscript𝑂superscript𝑚1superscript𝑛1superscript𝑛2𝑑\mathcal{R}^{*}(\hat{\theta}^{*})-\mathcal{R}^{*}(\theta^{*})=O_{\mathbb{P}}% \left(m^{-1}+n^{-1}+n^{-2/d}\right).caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 2 / italic_d end_POSTSUPERSCRIPT ) .

Notes.

The assumptions do not require the linear model for the (Xi,Yi)subscript𝑋𝑖subscript𝑌𝑖(X_{i},Y_{i})( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )’s to be valid, i.e., one can consider cases where 𝔼[Y|X]𝔼delimited-[]conditional𝑌𝑋\mathbb{E}[Y|X]blackboard_E [ italic_Y | italic_X ] is not linear. Also, when the source data follows a non-linear model of the form Y=r(X)+ε𝑌𝑟𝑋𝜀Y=r\left(X\right)+\varepsilonitalic_Y = italic_r ( italic_X ) + italic_ε where ε𝜀\varepsilonitalic_ε and X𝑋Xitalic_X are independent, our regularity assumptions means that r𝑟ritalic_r is Lipschitz on the compact set SXsubscript𝑆𝑋S_{X}italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

6 Related work

A standard approach to covariate shift problems is to use some re-weighting in order to “transfer” the source distribution with density pXsubscript𝑝𝑋p_{X}italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT to the target distribution with density qXsubscript𝑞𝑋q_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. This approach relies on the following type of estimates:

Q^w(h)=n1i=1nw(Xi)h(Xi,Yi),subscript^𝑄𝑤superscript𝑛1superscriptsubscript𝑖1𝑛𝑤subscript𝑋𝑖subscript𝑋𝑖subscript𝑌𝑖\hat{Q}_{w}(h)=n^{-1}\sum_{i=1}^{n}w(X_{i})h(X_{i},Y_{i}),over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_h ) = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where ideally the function w𝑤witalic_w would take the form qX/pXsubscript𝑞𝑋subscript𝑝𝑋q_{X}/p_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. Such a choice has the nice property that the expected value 𝔼[w(Xi)h(Xi,Yi)]𝔼delimited-[]𝑤subscript𝑋𝑖subscript𝑋𝑖subscript𝑌𝑖\mathbb{E}[w(X_{i})h(X_{i},Y_{i})]blackboard_E [ italic_w ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] is equal to the targeted quantity Q(h)𝑄Q(h)italic_Q ( italic_h ). This however cannot be directly computed as pXsubscript𝑝𝑋p_{X}italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and qXsubscript𝑞𝑋q_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT are unknown in practice. There are actually different ways to estimate w𝑤witalic_w, and our goal here is to distinguish between two leading approaches.

Plug-in approach

The plug-in approach is when the weights are computed using two estimates p^Xsubscript^𝑝𝑋\hat{p}_{X}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and q^Xsubscript^𝑞𝑋\hat{q}_{X}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT in place of pXsubscript𝑝𝑋p_{X}italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and qXsubscript𝑞𝑋q_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, respectively; i.e., simply use w^=q^X/p^X^𝑤subscript^𝑞𝑋subscript^𝑝𝑋\hat{w}=\hat{q}_{X}/\hat{p}_{X}over^ start_ARG italic_w end_ARG = over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT / over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT instead w𝑤witalic_w in the above formula, see for instance (Shimodaira, 2000; Sugiyama et al., 2007, 2008). Note that the selection of hyper-parameters for q^Xsubscript^𝑞𝑋\hat{q}_{X}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and p^Xsubscript^𝑝𝑋\hat{p}_{X}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is needed and the n𝑛nitalic_n evaluation (q^X(Xi),p^X(Xi))subscript^𝑞𝑋subscript𝑋𝑖subscript^𝑝𝑋subscript𝑋𝑖(\hat{q}_{X}(X_{i}),\hat{p}_{X}(X_{i}))( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) might be heavy in terms of computing time.

For the sake of clarity, we focus on a specific instance of covariate shift problem in which the target probability density qXsubscript𝑞𝑋q_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is known and p^^𝑝{\hat{p}}over^ start_ARG italic_p end_ARG is the kernel density estimate (KDE), i.e., p^XKDE(x)=(1/n)i=1nKb(xXi)superscriptsubscript^𝑝𝑋𝐾𝐷𝐸𝑥1𝑛superscriptsubscript𝑖1𝑛subscript𝐾𝑏𝑥subscript𝑋𝑖\hat{p}_{X}^{KDE}(x)=(1/n)\sum_{i=1}^{n}K_{b}(x-X_{i})over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K italic_D italic_E end_POSTSUPERSCRIPT ( italic_x ) = ( 1 / italic_n ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_x - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where Kbsubscript𝐾𝑏K_{b}italic_K start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT typically is a Gaussian density with mean 00 and variance b2superscript𝑏2b^{2}italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (a hyper-parameter to be tuned). Note that such a situation does not involve any changes for our sampling procedure whereas it is clearly advantageous for the weighted approach for which one unknown, qXsubscript𝑞𝑋q_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, is now given. In this case, the analysis of Q^w(h)subscript^𝑄𝑤\hat{Q}_{w}(h)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_h ) can be carried out using the decomposition Q^w^(h)Q(h)=Q^w^(hTh)+Q^w^(Th)Q(h)subscript^𝑄^𝑤𝑄subscript^𝑄^𝑤𝑇subscript^𝑄^𝑤𝑇𝑄\hat{Q}_{\hat{w}}(h)-Q(h)=\hat{Q}_{\hat{w}}(h-Th)+\hat{Q}_{\hat{w}}(Th)-Q(h)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG end_POSTSUBSCRIPT ( italic_h ) - italic_Q ( italic_h ) = over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG end_POSTSUBSCRIPT ( italic_h - italic_T italic_h ) + over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG end_POSTSUBSCRIPT ( italic_T italic_h ) - italic_Q ( italic_h ), with Th(x)=𝔼[h(X,Y)|X=x]𝑇𝑥𝔼delimited-[]conditional𝑋𝑌𝑋𝑥Th(x)=\mathbb{E}[h(X,Y)|X=x]italic_T italic_h ( italic_x ) = blackboard_E [ italic_h ( italic_X , italic_Y ) | italic_X = italic_x ]. The first term above is a sum of centered random variable which (provided some conditions) satisfies the so-called Lindeberg condition so that the central limit theorem implies that nQ^w^(hTh)𝑛subscript^𝑄^𝑤𝑇\sqrt{n}\hat{Q}_{\hat{w}}(h-Th)square-root start_ARG italic_n end_ARG over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG end_POSTSUBSCRIPT ( italic_h - italic_T italic_h ) is asymptotically Gaussian. The second term above is more complicated and the analysis can be derived using results in Delyon and Portier (2016); Clémençon and Portier (2018). Those results assert (under some conditions) that Q^w^(Th)Q(h)=Op(nbd+b)subscript^𝑄^𝑤𝑇𝑄subscript𝑂𝑝𝑛superscript𝑏𝑑𝑏\hat{Q}_{\hat{w}}(Th)-Q(h)=O_{p}(nb^{d}+b)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG end_POSTSUBSCRIPT ( italic_T italic_h ) - italic_Q ( italic_h ) = italic_O start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n italic_b start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT + italic_b ) (in case pXsubscript𝑝𝑋p_{X}italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is Lipschitz ). As a consequence, we obtain, optimizing over b𝑏bitalic_b, that Q^w^(h)Q(h)=Op(n1/2+n1/(1+d))subscript^𝑄^𝑤𝑄subscript𝑂𝑝superscript𝑛12superscript𝑛11𝑑\hat{Q}_{\hat{w}}(h)-Q(h)=O_{p}(n^{-1/2}+n^{-1/(1+d)})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG end_POSTSUBSCRIPT ( italic_h ) - italic_Q ( italic_h ) = italic_O start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 / ( 1 + italic_d ) end_POSTSUPERSCRIPT ). This is easily compared to our bound, when qXsubscript𝑞𝑋q_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is known, n1/2+n1/dsuperscript𝑛12superscript𝑛1𝑑n^{-1/2}+n^{-1/d}italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 / italic_d end_POSTSUPERSCRIPT, which is smaller than the one given before.

Direct weight estimation

Huang et al. (2006) proposed Kernel Mean Matching (KMM) for estimating the ratios of the probability density functions of the source and the target distribution. They used the estimated ratios for weighting the source sample. Gretton et al. (2008) further studied this method theoretically and empirically. Sugiyama et al. (2007, 2008) proposed a method that estimates the ratios as a function by minimizing the Kullback-Leibler divergence between the source density function multiplied by the ratio function and the target density function. The estimated function can predict ratios even outside of the source sample, which enables cross-validation for hyper-parameter tuning. Kanamori et al. (2009) proposed constrained and unconstrained least squares methods for estimating the ratio function called Least-Squares Importance Fitting (LSIF) and unconstrained LSIF (uLSIF). Yamada et al. (2013) developed its variant called Relative uLSIF (RuLISF), which replaces the denominator of the ratio with a convex mixture of the source and the target density functions to circumvent issues caused by near-zero denominators. Zhang et al. (2021) proposed a covariate shift adaptation method that directly minimizes an upper bound of the target risk in order to avoid estimation of weights. The method shows great empirical performance while it does not exactly minimize the target risk and hence the minimizer converges to a biased solution.

Connection to treatment effect estimation

One of the quantities of great interest in treatment effect estimation is the average treatment effect on the treated (ATT), 𝔼[Y(1)Y(0)|W=1]𝔼delimited-[]superscript𝑌1conditionalsuperscript𝑌0𝑊1\mathbb{E}[Y^{(1)}-Y^{(0)}\>|\>W=1]blackboard_E [ italic_Y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - italic_Y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT | italic_W = 1 ], where W{0,1}𝑊01W\in\{0,1\}italic_W ∈ { 0 , 1 } is a treatment assignment variable, Y(1)superscript𝑌1Y^{(1)}italic_Y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and Y(0)superscript𝑌0Y^{(0)}italic_Y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT are potential outcomes corresponding to the treatment 1111 and 00.222A common scenario is that we have a treated group (represented by treatment 1) and a non-treated, or controlled group (represented by treatment 0). Suppose that we wish to estimate the ATT using i.i.d. observations of W𝑊Witalic_W and its outcome Y:=Y(W)assign𝑌superscript𝑌𝑊Y:=Y^{(W)}italic_Y := italic_Y start_POSTSUPERSCRIPT ( italic_W ) end_POSTSUPERSCRIPT together with covariates X𝑋Xitalic_X, {(Yi,Wi,Xi)}i=1Nsuperscriptsubscriptsubscript𝑌𝑖subscript𝑊𝑖subscript𝑋𝑖𝑖1𝑁\{(Y_{i},W_{i},X_{i})\}_{i=1}^{N}{ ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Under the standard assumptions (see e.g., Hernan and Robins (2023)) including the conditional exchangeability Y(w)W|Xperpendicular-toabsentperpendicular-tosuperscript𝑌𝑤conditional𝑊𝑋Y^{(w)}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0% mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss% }\mkern 2.0mu{\scriptscriptstyle\perp}}}W\>|\>Xitalic_Y start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_W | italic_X, the positivity P({W=w|X})>0𝑃conditional-set𝑊𝑤𝑋0P(\{W=w\>|\>X\})>0italic_P ( { italic_W = italic_w | italic_X } ) > 0, and the consistency W=wY(w)=Y(W)=Y𝑊𝑤superscript𝑌𝑤superscript𝑌𝑊𝑌W=w\implies Y^{(w)}=Y^{(W)}=Yitalic_W = italic_w ⟹ italic_Y start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT = italic_Y start_POSTSUPERSCRIPT ( italic_W ) end_POSTSUPERSCRIPT = italic_Y, for each w{0,1}𝑤01w\in\{0,1\}italic_w ∈ { 0 , 1 }, the ATT equals the difference between

𝔼[Y(1)|W=1]𝔼delimited-[]conditionalsuperscript𝑌1𝑊1\displaystyle\mathbb{E}[Y^{(1)}\>|\>W=1]blackboard_E [ italic_Y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT | italic_W = 1 ] =yPY(1)|X,W=1(dy)PX|W=1(dx)absent𝑦subscript𝑃conditionalsuperscript𝑌1𝑋𝑊1𝑑𝑦subscript𝑃conditional𝑋𝑊1𝑑𝑥\displaystyle=\int yP_{Y^{(1)}\>|\>X,W=1}(dy)P_{X\>|\>W=1}(dx)= ∫ italic_y italic_P start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT | italic_X , italic_W = 1 end_POSTSUBSCRIPT ( italic_d italic_y ) italic_P start_POSTSUBSCRIPT italic_X | italic_W = 1 end_POSTSUBSCRIPT ( italic_d italic_x )
=yPY|X,W=1(dy)PX|W=1(dx)absent𝑦subscript𝑃conditional𝑌𝑋𝑊1𝑑𝑦subscript𝑃conditional𝑋𝑊1𝑑𝑥\displaystyle=\int yP_{Y\>|\>X,W=1}(dy)P_{X\>|\>W=1}(dx)= ∫ italic_y italic_P start_POSTSUBSCRIPT italic_Y | italic_X , italic_W = 1 end_POSTSUBSCRIPT ( italic_d italic_y ) italic_P start_POSTSUBSCRIPT italic_X | italic_W = 1 end_POSTSUBSCRIPT ( italic_d italic_x )
=𝔼[Y|W=1]absent𝔼delimited-[]conditional𝑌𝑊1\displaystyle=\mathbb{E}[Y\>|\>W=1]= blackboard_E [ italic_Y | italic_W = 1 ] (2)

and

𝔼[Y(0)|W=1]𝔼delimited-[]conditionalsuperscript𝑌0𝑊1\displaystyle\mathbb{E}[Y^{(0)}\>|\>W=1]blackboard_E [ italic_Y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT | italic_W = 1 ] =yPY(0)|X,W=0(dy)PX|W=1(dx)absent𝑦subscript𝑃conditionalsuperscript𝑌0𝑋𝑊0𝑑𝑦subscript𝑃conditional𝑋𝑊1𝑑𝑥\displaystyle=\int yP_{Y^{(0)}\>|\>X,W=0}(dy)P_{X\>|\>W=1}(dx)= ∫ italic_y italic_P start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT | italic_X , italic_W = 0 end_POSTSUBSCRIPT ( italic_d italic_y ) italic_P start_POSTSUBSCRIPT italic_X | italic_W = 1 end_POSTSUBSCRIPT ( italic_d italic_x )
=yPY|X,W=0(dy)r(x)PX|W=0(dx)absent𝑦subscript𝑃conditional𝑌𝑋𝑊0𝑑𝑦𝑟𝑥subscript𝑃conditional𝑋𝑊0𝑑𝑥\displaystyle=\int yP_{Y\>|\>X,W=0}(dy)\,r(x)\,P_{X\>|\>W=0}(dx)= ∫ italic_y italic_P start_POSTSUBSCRIPT italic_Y | italic_X , italic_W = 0 end_POSTSUBSCRIPT ( italic_d italic_y ) italic_r ( italic_x ) italic_P start_POSTSUBSCRIPT italic_X | italic_W = 0 end_POSTSUBSCRIPT ( italic_d italic_x )
=𝔼[r(X)Y|W=0],absent𝔼delimited-[]conditional𝑟𝑋𝑌𝑊0\displaystyle=\mathbb{E}[r(X)Y\>|\>W=0],= blackboard_E [ italic_r ( italic_X ) italic_Y | italic_W = 0 ] , (3)

where r(x)𝑟𝑥r(x)italic_r ( italic_x ) is the density ratio defined such that r(x)dPX|W=0(x)=dPX|W=1(x)𝑟𝑥𝑑subscript𝑃conditional𝑋𝑊0𝑥𝑑subscript𝑃conditional𝑋𝑊1𝑥r(x){dP_{X\>|\>W=0}}(x)={dP_{X\>|\>W=1}}(x)italic_r ( italic_x ) italic_d italic_P start_POSTSUBSCRIPT italic_X | italic_W = 0 end_POSTSUBSCRIPT ( italic_x ) = italic_d italic_P start_POSTSUBSCRIPT italic_X | italic_W = 1 end_POSTSUBSCRIPT ( italic_x ). We can easily estimate the first term 𝔼[Y(1)|W=1]𝔼delimited-[]conditionalsuperscript𝑌1𝑊1\mathbb{E}[Y^{(1)}\>|\>W=1]blackboard_E [ italic_Y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT | italic_W = 1 ] (Eq. (2)) by the conditional sample average 1N1i=1NYi𝟙Wi=11subscript𝑁1superscriptsubscript𝑖1𝑁subscript𝑌𝑖subscript1subscript𝑊𝑖1\frac{1}{N_{1}}\sum_{i=1}^{N}Y_{i}\cdot\mathds{1}_{W_{i}=1}divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ blackboard_1 start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT, where N1:=i=1N𝟙Wi=1assignsubscript𝑁1superscriptsubscript𝑖1𝑁subscript1subscript𝑊𝑖1N_{1}:=\sum_{i=1}^{N}\mathds{1}_{W_{i}=1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT. Estimating the second term 𝔼[Y(0)|W=1]𝔼delimited-[]conditionalsuperscript𝑌0𝑊1\mathbb{E}[Y^{(0)}\>|\>W=1]blackboard_E [ italic_Y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT | italic_W = 1 ] (Eq. (3)) is more involved. The sample average with the condition W=0𝑊0W=0italic_W = 0, 1N0i=1MYi𝟙Wi=01subscript𝑁0superscriptsubscript𝑖1𝑀subscript𝑌𝑖subscript1subscript𝑊𝑖0\frac{1}{N_{0}}\sum_{i=1}^{M}Y_{i}\cdot\mathds{1}_{W_{i}=0}divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ blackboard_1 start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT, where N0:=i=1N𝟙Wi=0assignsubscript𝑁0superscriptsubscript𝑖1𝑁subscript1subscript𝑊𝑖0N_{0}:=\sum_{i=1}^{N}\mathds{1}_{W_{i}=0}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT, would be biased to 𝔼[Y|W=0]𝔼[Y(0)|W=1]𝔼delimited-[]conditional𝑌𝑊0𝔼delimited-[]conditionalsuperscript𝑌0𝑊1\mathbb{E}[Y\>|\>W=0]\neq\mathbb{E}[Y^{(0)}\>|\>W=1]blackboard_E [ italic_Y | italic_W = 0 ] ≠ blackboard_E [ italic_Y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT | italic_W = 1 ], but the bias is only due to the change in the conditional distributions of X𝑋Xitalic_X given W=1𝑊1W=1italic_W = 1 and X𝑋Xitalic_X given W=0𝑊0W=0italic_W = 0 quantified by r(X)𝑟𝑋r(X)italic_r ( italic_X ), similarly to the covariate shift (see Eq. (3)). One way to correct the bias is to use an estimate r^^𝑟\widehat{r}over^ start_ARG italic_r end_ARG of the ratio r𝑟ritalic_r for the weighted average 1N0i=1Nr^(Xi)Yi𝟙Wi=01subscript𝑁0superscriptsubscript𝑖1𝑁^𝑟subscript𝑋𝑖subscript𝑌𝑖subscript1subscript𝑊𝑖0\frac{1}{N_{0}}\sum_{i=1}^{N}\widehat{r}(X_{i})\cdot Y_{i}\cdot\mathds{1}_{W_{% i}=0}divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG italic_r end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ blackboard_1 start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT, similarly to the reweighting approach to covariate shift adaptation, leading to the following estimate:

ATT^^ATT\displaystyle\widehat{\operatorname{ATT}}over^ start_ARG roman_ATT end_ARG =1N1i:Wi=1Yi1N0i:Wi=0r^(Xi)Yi.absent1subscript𝑁1subscript:𝑖subscript𝑊𝑖1subscript𝑌𝑖1subscript𝑁0subscript:𝑖subscript𝑊𝑖0^𝑟subscript𝑋𝑖subscript𝑌𝑖\displaystyle=\frac{1}{N_{1}}\sum_{i\colon W_{i}=1}Y_{i}-\frac{1}{N_{0}}\sum_{% i\colon W_{i}=0}\widehat{r}(X_{i})\cdot Y_{i}.= divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i : italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i : italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT over^ start_ARG italic_r end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Another popular approach is the nearest neighbor matching Abadie and Imbens (2006). See also Rosenbaum (1995) for a broad introduction to matching problems for evaluating treatment effects. In Abadie and Imbens (2006), the ATT is estimated by

ATT¯=1N1i:Wi=1[YiY^i(0)],¯ATT1subscript𝑁1subscript:𝑖subscript𝑊𝑖1delimited-[]subscript𝑌𝑖subscript^𝑌𝑖0\overline{\mbox{ATT}}=\frac{1}{N_{1}}\sum_{i\colon W_{i}=1}\left[Y_{i}-\hat{Y}% _{i}(0)\right],over¯ start_ARG ATT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i : italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT [ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) ] ,

where Y^i(0)=1kj=1NYj(1Wj)1{XiXjτn,k,Xi}subscript^𝑌𝑖01𝑘superscriptsubscript𝑗1𝑁subscript𝑌𝑗1subscript𝑊𝑗subscript1normsubscript𝑋𝑖subscript𝑋𝑗subscript𝜏𝑛𝑘subscript𝑋𝑖\hat{Y}_{i}(0)=\frac{1}{k}\sum_{j=1}^{N}Y_{j}(1-W_{j})1_{\{\|X_{i}-X_{j}\|\leq% \tau_{n,k,X_{i}}\}}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 0 ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) 1 start_POSTSUBSCRIPT { ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ ≤ italic_τ start_POSTSUBSCRIPT italic_n , italic_k , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } end_POSTSUBSCRIPT is the average of Yjsubscript𝑌𝑗Y_{j}italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s over the k𝑘kitalic_k first NNs of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the untreated group. The estimator takes the form

ATT¯=1N1i:Wi=1Yi1N1i:Wi=0Kk(i)kYi,¯ATT1subscript𝑁1subscript:𝑖subscript𝑊𝑖1subscript𝑌𝑖1subscript𝑁1subscript:𝑖subscript𝑊𝑖0subscript𝐾𝑘𝑖𝑘subscript𝑌𝑖\overline{\mbox{ATT}}=\frac{1}{N_{1}}\sum_{i\colon W_{i}=1}Y_{i}-\frac{1}{N_{1% }}\sum_{i\colon W_{i}=0}\frac{K_{k}(i)}{k}Y_{i},over¯ start_ARG ATT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i : italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i : italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT divide start_ARG italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i ) end_ARG start_ARG italic_k end_ARG italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

where Kk(i)=j=1NWj1XiXjτn,k,Xjsubscript𝐾𝑘𝑖superscriptsubscript𝑗1𝑁subscript𝑊𝑗subscript1normsubscript𝑋𝑖subscript𝑋𝑗subscript𝜏𝑛𝑘subscript𝑋𝑗K_{k}(i)=\sum_{j=1}^{N}W_{j}1_{\|X_{i}-X_{j}\|\leq\tau_{n,k,X_{j}}}italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ ≤ italic_τ start_POSTSUBSCRIPT italic_n , italic_k , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the number of times observation i𝑖iitalic_i is used as a match, i.e., the number of times observation i𝑖iitalic_i is among the k𝑘kitalic_k NNs of variables in the treated group. Note that ATT¯¯ATT\overline{\mbox{ATT}}over¯ start_ARG ATT end_ARG coincides with ATT^^ATT\widehat{\operatorname{ATT}}over^ start_ARG roman_ATT end_ARG if r^(Xi)^𝑟subscript𝑋𝑖\hat{r}(X_{i})over^ start_ARG italic_r end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is defined as Kk(i)N0kN1subscript𝐾𝑘𝑖subscript𝑁0𝑘subscript𝑁1\frac{K_{k}(i)N_{0}}{kN_{1}}divide start_ARG italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i ) italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_k italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG. Recently, Lin et al. (2023) showed that the latter quantity can be indeed interpreted as an estimate of the density ratio but its consistency requires k𝑘k\rightarrow\inftyitalic_k → ∞ while Abadie and Imbens (2006) considered a fixed value of k𝑘kitalic_k, as in our problem. To see an analogy with our method, one can consider the case in which hhitalic_h does not depend on x𝑥xitalic_x, i.e. h(y,x)=g(y)𝑦𝑥𝑔𝑦h(y,x)=g(y)italic_h ( italic_y , italic_x ) = italic_g ( italic_y ) for some function g𝑔gitalic_g. Using the notation from the present paper (W=0𝑊0W=0italic_W = 0 and W=1𝑊1W=1italic_W = 1 indicate the target and the source domain, respectively, with N0=nsubscript𝑁0𝑛N_{0}=nitalic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_n and N1=msubscript𝑁1𝑚N_{1}=mitalic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_m) the second term of ATT¯¯ATT\overline{\textrm{ATT}}over¯ start_ARG ATT end_ARG above generalizes to the form

1mi=1nKk(i)kg(Yi)=1mi=1mg(y)P^n(dy|Xi)1𝑚superscriptsubscript𝑖1𝑛subscript𝐾𝑘𝑖𝑘𝑔subscript𝑌𝑖1𝑚superscriptsubscript𝑖1𝑚𝑔𝑦subscript^𝑃𝑛conditional𝑑𝑦superscriptsubscript𝑋𝑖\frac{1}{m}\sum_{i=1}^{n}\frac{K_{k}(i)}{k}g(Y_{i})=\frac{1}{m}\sum_{i=1}^{m}% \int g(y)\hat{P}_{n}(dy|X_{i}^{*})divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i ) end_ARG start_ARG italic_k end_ARG italic_g ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∫ italic_g ( italic_y ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_d italic_y | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (4)

with Kk(i)=j=1m1{XiXjτn,k,Xj}subscript𝐾𝑘𝑖superscriptsubscript𝑗1𝑚subscript1normsubscript𝑋𝑖superscriptsubscript𝑋𝑗subscript𝜏𝑛𝑘superscriptsubscript𝑋𝑗K_{k}(i)=\sum_{j=1}^{m}1_{\{\|X_{i}-X_{j}^{*}\|\leq\tau_{n,k,X_{j}^{*}}\}}italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT { ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ≤ italic_τ start_POSTSUBSCRIPT italic_n , italic_k , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } end_POSTSUBSCRIPT. The previous estimate corresponds to the one introduced in the notes following Proposition 2. On the other hand, our estimator applied to this case is given by

1mi=1mg(Yn,i)1𝑚superscriptsubscript𝑖1𝑚𝑔superscriptsubscript𝑌𝑛𝑖\frac{1}{m}\sum_{i=1}^{m}g(Y_{n,i}^{*})divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_g ( italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (5)

Both estimators are different when k>1𝑘1k>1italic_k > 1 but they coincide as soon as k=1𝑘1k=1italic_k = 1. In fact, P^n(dy|Xi)subscript^𝑃𝑛conditional𝑑𝑦superscriptsubscript𝑋𝑖\hat{P}_{n}(dy|X_{i}^{*})over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_d italic_y | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) has one single atom when k=1𝑘1k=1italic_k = 1, so that sampling from it and evaluating the average are the same. Here are a few remarks.

  • When k>1𝑘1k>1italic_k > 1, Eq. (5) requires fewer evaluations of g𝑔gitalic_g than Eq. (4). This is relevant when evaluation of the function is time-consuming or costly such as observation from physical experiments.

  • Our theoretical analysis is rather different from that of Abadie and Imbens (2006). Since they rely on the expression in the left side of Eq. (4), it is unclear whether they can or not handle the case when hhitalic_h depends on x𝑥xitalic_x (required for prediction purpose). In contrast, our approach is based on the decomposition given in Section 4.2, with sampling error and estimation error, leveraging g(y)P^n(dy|x)Q(dx)𝑔𝑦subscript^𝑃𝑛conditional𝑑𝑦𝑥𝑄𝑑𝑥\int\int g(y)\hat{P}_{n}(dy|x)Q(dx)∫ ∫ italic_g ( italic_y ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) italic_Q ( italic_d italic_x ) as a centering term. Our results are more general because they include the case when hhitalic_h depends on x𝑥xitalic_x and also we can deal with both estimates (4) and (5) in the meantime, as mentioned in the notes following Proposition 2). Moreover, Proposition 3 implies a lower bound for Eq. (4) and we believe this result to be new in treatment effect literature.

Other references

The idea of nonparametric sampling is a standard one in the field of texture synthesis. In particular, the choice of 1-NN resampling was often used as a fast method to generate new textures from a small sample. See Truquet (2011) for a literature review in this context. Our conditional sampling framework bears resemblance with traditional bootstrap sampling as there is random generation according to some estimated distribution. In contrast, the original bootstrap method is usually made up using draws from the standard empirical measure (1/n)i=1nδXi,Yi1𝑛superscriptsubscript𝑖1𝑛subscript𝛿subscript𝑋𝑖subscript𝑌𝑖(1/n)\sum_{i=1}^{n}\delta_{X_{i},Y_{i}}( 1 / italic_n ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Here another distribution, P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG, has been used to generate new samples. Moreover, our goal is totally different here. While the bootstrap technique was initially introduced for making inference, here the goal is to estimate an unknown quantity Q(h)𝑄Q(h)italic_Q ( italic_h ) which appears in many machine learning tasks. Kpotufe and Martinet (2021) theoretically study covariate shift adaptation under the assumption that we have access to a labeled sample both from the source and the target distribution. Although they consider a k𝑘kitalic_k-nearest-neighbor-based method, it is essentially different from ours since they perform the k𝑘kitalic_k-NN method on the union of the source and the target sample. Lee (2013) proposed pseudo-labeling unlabeled data in the context of semi-supervised learning. Wang (2023) proposed a hyper-parameter selection method for kernel ridge regression under covariate shift using pseudo-labeling. The author focuses on model selection in regression problems while we study the mean estimation that can be applied to a wider range of supervised learning problems.

7 Extensions

Several ways to extend our method beyond the mean estimation problem are considered in this section.

Heterogeneity in target distributions

The case where the target covariates distribution QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT changes across the data might be of interest if one wishes to aggregate several pieces of target data whose covariates distributions are not necessarily the same. This might occur when the target data is obtained by gathering individuals from different countries, and consequently, the distributions are not the same anymore or when the time between the measurements has caused some changes in the distribution.

While such an heterogeneity in target data might be seen as more complicated at first glance, it actually can be examined using a similar decomposition and the same tools as the one used to obtain the non-asymptotic bound in Theorem 1. More formally, the target distribution is here Q=(1/m)i=1mQi𝑄1𝑚superscriptsubscript𝑖1𝑚subscript𝑄𝑖Q=(1/m)\sum_{i=1}^{m}Q_{i}italic_Q = ( 1 / italic_m ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with Qi=PY|XQX,isubscript𝑄𝑖subscript𝑃conditional𝑌𝑋subscript𝑄𝑋𝑖Q_{i}=P_{Y|X}Q_{X,i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X , italic_i end_POSTSUBSCRIPT. For each m1𝑚1m\geq 1italic_m ≥ 1 and n1𝑛1n\geq 1italic_n ≥ 1, let (Xi,Yn,i)1imsubscriptsuperscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖1𝑖𝑚(X_{i}^{*},Y_{n,i}^{*})_{1\leq i\leq m}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT be a collection of random variables conditionally independent given (Xi,Yi)i=1nsuperscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑛(X_{i},Y_{i})_{i=1}^{n}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and such that for each i=1,,m𝑖1𝑚i=1,\ldots,mitalic_i = 1 , … , italic_m, (Xi,Yn,i)Q^i,nsimilar-tosuperscriptsubscript𝑋𝑖subscript𝑌𝑛𝑖subscript^𝑄𝑖𝑛(X_{i}^{*},Y_{n,i})\sim\hat{Q}_{i,n}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ) ∼ over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT with Q^i,n=P^Y|XQX,isubscript^𝑄𝑖𝑛subscript^𝑃conditional𝑌𝑋subscript𝑄𝑋𝑖\hat{Q}_{i,n}=\hat{P}_{Y|X}Q_{X,i}over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X , italic_i end_POSTSUBSCRIPT. The quantity of interest and the proposed estimator are therefore slightly different from before, given by, respectively,

Q(h)=m1i=1mQi(h),Q^(h)=m1i=1mh(Xi,Yn,i).formulae-sequence𝑄superscript𝑚1superscriptsubscript𝑖1𝑚subscript𝑄𝑖superscript^𝑄superscript𝑚1superscriptsubscript𝑖1𝑚superscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖Q(h)=m^{-1}\sum_{i=1}^{m}Q_{i}(h),\qquad\hat{Q}^{*}(h)=m^{-1}\sum_{i=1}^{m}h(X% _{i}^{*},Y_{n,i}^{*}).italic_Q ( italic_h ) = italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h ) , over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h ) = italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

The decomposition is

Q^(h)Q(h)=m1i=1m{h(Xi,Yn,i)Q^i,n(h)}+m1i=1m{Q^i,n(h)Qi(h)}.superscript^𝑄𝑄superscript𝑚1superscriptsubscript𝑖1𝑚superscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖subscript^𝑄𝑖𝑛superscript𝑚1superscriptsubscript𝑖1𝑚subscript^𝑄𝑖𝑛subscript𝑄𝑖\hat{Q}^{*}(h)-Q(h)=m^{-1}\sum_{i=1}^{m}\left\{h(X_{i}^{*},Y_{n,i}^{*})-\hat{Q% }_{i,n}(h)\right\}+m^{-1}\sum_{i=1}^{m}\left\{\hat{Q}_{i,n}(h)-Q_{i}(h)\right\}.over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h ) - italic_Q ( italic_h ) = italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT { italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT ( italic_h ) } + italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT { over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT ( italic_h ) - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h ) } .

The non-asymptotic analysis of the sampling error is similar to before as the Bernstein inequality is tailored to non-identically distributed variables. We obtain that the rate O(m1/2)𝑂superscript𝑚12O(m^{-1/2})italic_O ( italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) as before by simply requiring a bound on the variance of each random variables. The other term concerning the conditional distribution can be analyzed by writing

m1i=1m{Q^i,n(h)Qi(h)}L2m1i=1mQ^i,n(h)Qi(h)L2subscriptdelimited-∥∥superscript𝑚1superscriptsubscript𝑖1𝑚subscript^𝑄𝑖𝑛subscript𝑄𝑖subscript𝐿2superscript𝑚1superscriptsubscript𝑖1𝑚subscriptdelimited-∥∥subscript^𝑄𝑖𝑛subscript𝑄𝑖subscript𝐿2\left\lVert m^{-1}\sum_{i=1}^{m}\left\{\hat{Q}_{i,n}(h)-Q_{i}(h)\right\}\right% \rVert_{L_{2}}\leq m^{-1}\sum_{i=1}^{m}\lVert\hat{Q}_{i,n}(h)-Q_{i}(h)\rVert_{% L_{2}}∥ italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT { over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT ( italic_h ) - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h ) } ∥ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT ( italic_h ) - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h ) ∥ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

and therefore we can directly apply Proposition 3 (given the assumptions are satisfied for each i𝑖iitalic_i uniformly). We finally obtain the rate m1/2+n1/d+n1/2superscript𝑚12superscript𝑛1𝑑superscript𝑛12m^{-1/2}+n^{-1/d}+{n^{-1/2}}italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 / italic_d end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT, similar to the one obtained before.

Stochastic gradient descent

Our sampling approach can be easily combined with the well-known stochastic gradient descent algorithm (and more generally with stochastic approximation) where only a small part of the data is used at each step to update the estimator. This particular property allows to require a small number of operations at each iteration (in contrast with gradient based optimization).

To illustrate this idea, consider the empirical risk minimization problem described in Section 5 where one is interested in solving minθ{R(θ):=𝔼[mθ(Y,X)]}subscript𝜃assignsuperscript𝑅𝜃𝔼delimited-[]subscript𝑚𝜃superscript𝑌superscript𝑋\min_{\theta}\{R^{*}(\theta):=\mathbb{E}[m_{\theta}(Y^{*},X^{*})]\}roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT { italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) := blackboard_E [ italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] } where θmθ(y,x)maps-to𝜃subscript𝑚𝜃𝑦𝑥\theta\mapsto m_{\theta}(y,x)italic_θ ↦ italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y , italic_x ) is differentiable. Suppose that n𝑛nitalic_n source samples have been obtained making the conditional distribution P^Y|Xsubscript^𝑃conditional𝑌𝑋\hat{P}_{Y|X}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT available for sampling new points. Then the algorithm at step i1𝑖1i\geq 1italic_i ≥ 1, might proceed by first generating Xisubscriptsuperscript𝑋𝑖X^{*}_{i}italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and then Yn,iP^Y|Xisimilar-tosuperscriptsubscript𝑌𝑛𝑖subscript^𝑃conditional𝑌superscriptsubscript𝑋𝑖Y_{n,i}^{*}\sim\hat{P}_{Y|X_{i}^{*}}italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∼ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. This means finding the nearest neighbor to Xisuperscriptsubscript𝑋𝑖X_{i}^{*}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT among the source data and represents only log(n)𝑛\log(n)roman_log ( italic_n ) operations using the kd𝑘𝑑kditalic_k italic_d-tree. Having this been done, the update is simply

θi=θi1γiθmθ(Yn,i,Xi).subscript𝜃𝑖subscript𝜃𝑖1subscript𝛾𝑖subscript𝜃subscript𝑚𝜃subscriptsuperscript𝑌𝑛𝑖subscriptsuperscript𝑋𝑖\theta_{i}=\theta_{i-1}-\gamma_{i}\nabla_{\theta}m_{\theta}(Y^{*}_{n,i},X^{*}_% {i}).italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

It results that each iteration in the above is similar to standard stochastic gradient descent, the only difference being the additional 1111-nearest neighbor search. We stress that this is contrasting with the re-weighting approach for which a new sample, say Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, would require evaluating p^X(Xi)subscript^𝑝𝑋subscript𝑋𝑖\hat{p}_{X}(X_{i})over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and therefore would need to compute all n𝑛nitalic_n distances between Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, j=1,,n𝑗1𝑛j=1,\ldots,nitalic_j = 1 , … , italic_n and the new Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Semiparametric estimation

Simulating the labels to obtain a new sample is also convenient in semiparametric problems where quantities of interest often involve additional estimated parameters. Typical semiparametric problems involve expectations of functions that are indexed by an unknown parameter, 𝔼[hθ(X,Y)]𝔼delimited-[]subscript𝜃𝑋𝑌\mathbb{E}[h_{\theta}(X,Y)]blackboard_E [ italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X , italic_Y ) ], and θ𝜃\thetaitalic_θ is estimated from the data using some transformation θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG of the sample. In such a situation, while estimating θ𝜃\thetaitalic_θ using reweighting is unclear without more information on θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG, one can directly use our sampling approach by introducing m1j=1mhθ^(Yj,Xj)superscript𝑚1superscriptsubscript𝑗1𝑚subscriptsuperscript^𝜃superscriptsubscript𝑌𝑗superscriptsubscript𝑋𝑗m^{-1}\sum_{j=1}^{m}h_{\hat{\theta}^{*}}\left(Y_{j}^{*},X_{j}^{*}\right)italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) where θ^=θ^(X1,Y1,,Xm,Ym)superscript^𝜃^𝜃superscriptsubscript𝑋1superscriptsubscript𝑌1superscriptsubscript𝑋𝑚superscriptsubscript𝑌𝑚\hat{\theta}^{*}=\hat{\theta}\left(X_{1}^{*},Y_{1}^{*},\ldots,X_{m}^{*},Y_{m}^% {*}\right)over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = over^ start_ARG italic_θ end_ARG ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). This allows to obtain a semiparametric estimate with covariate shift adaptation. See Van der Vaart (2000), Chapters 19.419.419.419.4 and 25.825.825.825.8 for more details and examples in parametric or semiparametric estimation.

8 Experiments

The main purpose of the experiments is to compare our k𝑘kitalic_k-NN-CSA approach with several state-of-the-art competitors when facing multiple situations from mean estimation to empirical risk minimization with synthetic and real-world data.

We consider the following instances of our proposed method.

1111-NN-CSA:

the Conditional Sampling Covariate-shift Adaptation (CSA) (Algorithm 1) with k𝑘kitalic_k-Nearest Neighbor (k𝑘kitalic_k-NN) conditional sampler (Algorithm 2) with k=1𝑘1k=1italic_k = 1.

logn𝑛\log nroman_log italic_n-NN-CSA:

the same as above but with k=logn𝑘𝑛k=\log nitalic_k = roman_log italic_n.

We use the Python module cKDTree (Archibald, 2008) from SciPy (Virtanen et al., 2020) for nearest neighbor search in our methods. We compare them with the following existing covariate-shift adaptation methods.

KDE-R-W (KDE-Ratio-Weighting):

the weighting method using the ratio of the Kernel Density Estimates (KDEs) of pXsubscript𝑝𝑋p_{X}italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and qXsubscript𝑞𝑋q_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT (see Section 6).

KMM-W (KMM-Weighting):

the weighting method estimating qX/pXsubscript𝑞𝑋subscript𝑝𝑋q_{X}/p_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT using the Kernel Mean Matching (KMM) (Huang et al., 2006; Gretton et al., 2008). We use the Gaussian kernel.

KLIEP-W (KLIEP-Weighting):

the weighting method estimating qX/pXsubscript𝑞𝑋subscript𝑝𝑋q_{X}/p_{X}italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT by the Kullback-Leibler Importance Estimation Procedure (KLIEP) (Sugiyama et al., 2007, 2008). The linear combination of the Gaussian basis functions centered at the sample points are used for modeling the weight function.

KLIEP100-W

: the same as KLIEP-W but with 100 randomly subsampled basis functions (Sugiyama et al., 2007, 2008) for reducing the time- and space-complexities.

RuLSIF-W (RuLSIF-Weighting):

the weighting method using qX/(αpX+(1α)qX)subscript𝑞𝑋𝛼subscript𝑝𝑋1𝛼subscript𝑞𝑋q_{X}/(\alpha p_{X}+(1-\alpha)q_{X})italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT / ( italic_α italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) estimated by Relative unconstrained Least-Squares Importance Fitting (RuLSIF) (Yamada et al., 2013), where α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] is a hyper-parameter. We use the default value α=0.1𝛼0.1\alpha=0.1italic_α = 0.1. As a model of the weight function, the Gaussian basis functions centered at the sample points are used.

RuLSIF100-W:

the same as RuLSIF-W but with 100 randomly subsampled basis functions (Yamada et al., 2013) for reducing the time- and space-complexities.

See Section 6 for more explanations of those methods. For KMM-W, KLIEP-W, and RuLSIF-W, we used the implementations from Awesome Domain Adaptation Python Toolbox (ADAPT) (de Mathelin et al., 2021). All the computations were performed on the cluster, Grid5000 (Balouek et al., 2013). For the methods using Gaussian basis functions (KLIEP-W, KLIEP100-W, RuLSIF-W, RuLSIF100-W), we use 5-fold cross-validation for choosing the Gaussian bandwidth from {0.001,0.01,0.1,1,10}0.0010.010.1110\{0.001,0.01,0.1,1,10\}{ 0.001 , 0.01 , 0.1 , 1 , 10 }. KMM-W does not offer a way to do cross-validation, and we fixed to 1111. More details are in the supplementary material.

Furthermore, we also report the results for the following baseline method and ideal method.

NoCorrection:

the method that takes the average 1ni=1nh(Xi,Yi)1𝑛superscriptsubscript𝑖1𝑛subscript𝑋𝑖subscript𝑌𝑖\frac{1}{n}\sum_{i=1}^{n}h(X_{i},Y_{i})divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) only using the source sample (Xi,Yi)i=1nPmsimilar-tosuperscriptsubscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖1𝑛superscript𝑃𝑚(X_{i},Y_{i})_{i=1}^{n}\sim P^{m}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, ignoring the target sample.

OracleY:

the result for taking the average 1mi=1mh(Xi,Yi)1𝑚superscriptsubscript𝑖1𝑚subscriptsuperscript𝑋𝑖subscriptsuperscript𝑌𝑖\frac{1}{m}\sum_{i=1}^{m}h(X^{*}_{i},Y^{\circ}_{i})divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_h ( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) using a sample (Xi,Yi)i=1mQmsimilar-tosuperscriptsubscriptsubscriptsuperscript𝑋𝑖subscriptsuperscript𝑌𝑖𝑖1𝑚superscript𝑄𝑚(X^{*}_{i},Y^{\circ}_{i})_{i=1}^{m}\sim Q^{m}( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∼ italic_Q start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. Note that Yisubscriptsuperscript𝑌𝑖Y^{\circ}_{i}italic_Y start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are not available in practical scenarios of our interest and made invisible to other methods.

We conduct experiments in three setups, detailed below, with different sample sizes n𝑛nitalic_n (=mabsent𝑚=m= italic_m) and data dimensionalities d𝑑ditalic_d: (n,d){50,100,500,1000,5000,10000}×{5,10}𝑛𝑑501005001000500010000510(n,d)\in\{50,100,500,1000,5000,10000\}\times\{5,10\}( italic_n , italic_d ) ∈ { 50 , 100 , 500 , 1000 , 5000 , 10000 } × { 5 , 10 }. Each experiment is repeated 50 times with different random seeds.

Setup of Experiment E1 (mean estimation with synthetic data):

The task here is to estimate Q(h)=yPY|X(dy|x)QX(dx)𝑄double-integral𝑦subscript𝑃conditional𝑌𝑋conditional𝑑𝑦𝑥subscript𝑄𝑋𝑑𝑥Q(h)=\iint y\,P_{Y|X}(dy\>|\>x)Q_{X}(dx)italic_Q ( italic_h ) = ∬ italic_y italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) under the following setup. We define hhitalic_h by h(x,y)=y𝑥𝑦𝑦h(x,y)=yitalic_h ( italic_x , italic_y ) = italic_y, PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT as the uniform distribution over [1,1]dsuperscript11𝑑[-1,1]^{d}[ - 1 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT as that over [0,1]×[1,1]d101superscript11𝑑1[0,1]\times[-1,1]^{d-1}[ 0 , 1 ] × [ - 1 , 1 ] start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, and PY|X=xsubscript𝑃conditional𝑌𝑋𝑥P_{Y|X=x}italic_P start_POSTSUBSCRIPT italic_Y | italic_X = italic_x end_POSTSUBSCRIPT as the normal distribution with mean x𝑥xitalic_x and variance 0.10.10.10.1. Figure 7(a) in Appendix G shows an illustration of the setup. In this setup, we have P(h)=yPY|X(dy|x)PX(dx)=0𝑃double-integral𝑦subscript𝑃conditional𝑌𝑋conditional𝑑𝑦𝑥subscript𝑃𝑋𝑑𝑥0P(h)=\iint y\,P_{Y|X}(dy\>|\>x)P_{X}(dx)=0italic_P ( italic_h ) = ∬ italic_y italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) = 0 while Q(h)=yPY|X(dy|x)QX(dx)=0.5𝑄double-integral𝑦subscript𝑃conditional𝑌𝑋conditional𝑑𝑦𝑥subscript𝑄𝑋𝑑𝑥0.5Q(h)=\iint y\,P_{Y|X}(dy\>|\>x)Q_{X}(dx)=0.5italic_Q ( italic_h ) = ∬ italic_y italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) = 0.5. Because of this difference, covariate shift adaptation is essential for correctly estimating Q(h)𝑄Q(h)italic_Q ( italic_h ).

Comparison of estimation errors for Experiment E1:

The results are presented in Figure 1. First, the errors for NoCorrection are not decreasing as the sample sizes increase, ending up with large errors in all cases, because of the bias due to the covariate shift. Other methods with covariate-shift adaptation had always smaller errors than that of this baseline. Excluding OracleY, an ideal method unavailable in practice, KLIEP100-W, KMM-W, 1-NN-CSA, and logn𝑛\log nroman_log italic_n-NN-CSA were among the best for smaller dimensionalities d{1,2}𝑑12d\in\{1,2\}italic_d ∈ { 1 , 2 } (Figures 1(a) and 1(b)). For the larger dimensionalites d{5,10}𝑑510d\in\{5,10\}italic_d ∈ { 5 , 10 }, KMM-W and 1-NN-CSA outperformed other methods. In particular, 1-NN-CSA gave outstanding performances in many cases except d=5𝑑5d=5italic_d = 5 and n{100,500,1000}𝑛1005001000n\in\{100,500,1000\}italic_n ∈ { 100 , 500 , 1000 }, for which KMM-W was even better. The errors of most methods roughly follow power laws, where the slope of a line corresponds to the power of the convergence rate (steeper is better). 1-NN-CSA and logn𝑛\log nroman_log italic_n-NN-CSA seem to have the steepest slopes for d=10𝑑10d=10italic_d = 10, although comparison is difficult for the lower dimensionalities.

Comparison of running times in Experiment E1:

Figure 2 shows the comparison in running times. 1- and logn𝑛\log nroman_log italic_n-NN-CSA were much faster than other methods in all cases except for (d,n)=(10,50)𝑑𝑛1050(d,n)=(10,50)( italic_d , italic_n ) = ( 10 , 50 ). Their advantage is most pronounced for larger sample sizes. For instance, 1- and logn𝑛\log nroman_log italic_n-NN-CSA were at least 100100100100 times faster than other methods for (n,d)=(10000,1)𝑛𝑑100001(n,d)=(10000,1)( italic_n , italic_d ) = ( 10000 , 1 ) (Figure 2(a)).

Refer to caption
(a) d=1𝑑1d=1italic_d = 1
Refer to caption
(b) d=2𝑑2d=2italic_d = 2
Refer to caption
(c) d=5𝑑5d=5italic_d = 5
Refer to caption
(d) d=10𝑑10d=10italic_d = 10
Figure 1: Mean Squared Errors (MSE) for Experiment E1 (estimation of yQ(dy)𝑦𝑄𝑑𝑦\int y\,Q(dy)∫ italic_y italic_Q ( italic_d italic_y )). The horizontal axis is for the sample sizes n𝑛nitalic_n (=mabsent𝑚=m= italic_m), and the vertical axis is for the mean absolute error of each estimate. The four figures are for different data dimensionalities.
Refer to caption
(a) d=1𝑑1d=1italic_d = 1
Refer to caption
(b) d=2𝑑2d=2italic_d = 2
Refer to caption
(c) d=5𝑑5d=5italic_d = 5
Refer to caption
(d) d=10𝑑10d=10italic_d = 10
Figure 2: Running times for Experiment E1. The horizontal axis is for the sample sizes n𝑛nitalic_n (=mabsent𝑚=m= italic_m), and the vertical axis is for the mean running time of each method. The four figures are for different data dimensionalities.

Setup of Experiment E2 (risk estimation with synthetic data):

In this experiment, we compare the methods in the context of risk estimation of a fixed function f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Setting f0:xx1:subscript𝑓0maps-to𝑥subscript𝑥1f_{0}\colon x\mapsto-x_{1}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_x ↦ - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the first coordinate of x𝑥xitalic_x, we estimate the expected loss (i.e., risk) of f0(X)subscript𝑓0𝑋f_{0}(X)italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X ) with the square loss in predicting the response Y𝑌Yitalic_Y when (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) follows Q𝑄Qitalic_Q. In other words, we set hhitalic_h as h(x,y):=(yf0(x))2assign𝑥𝑦superscript𝑦subscript𝑓0𝑥2h(x,y):=(y-f_{0}(x))^{2}italic_h ( italic_x , italic_y ) := ( italic_y - italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and the goal is to estimate the risk Q(h)=(yf0(x))2Q(dx,dy)𝑄superscript𝑦subscript𝑓0𝑥2𝑄𝑑𝑥𝑑𝑦Q(h)=\int(y-f_{0}(x))^{2}Q(dx,dy)italic_Q ( italic_h ) = ∫ ( italic_y - italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q ( italic_d italic_x , italic_d italic_y ). We use the uniform distribution over [1,1]dsuperscript11𝑑[-1,1]^{d}[ - 1 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and that over [0,1]×[0,1]d101superscript01𝑑1[0,1]\times[0,1]^{d-1}[ 0 , 1 ] × [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT for QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. The conditional distribution PY|X=xsubscript𝑃conditional𝑌𝑋𝑥P_{Y|X=x}italic_P start_POSTSUBSCRIPT italic_Y | italic_X = italic_x end_POSTSUBSCRIPT is the normal distribution with mean |x1|subscript𝑥1\lvert x_{1}\rvert| italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | and variance 0.10.10.10.1 for any x𝒳:=[1,1]d𝑥𝒳assignsuperscript11𝑑x\in\mathcal{X}:=[-1,1]^{d}italic_x ∈ caligraphic_X := [ - 1 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Under this setup, the function f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT performs poorly on the support of QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and should incur a large risk. See Figure 7(b) for an illustration of the setup. In this setup, the risks under PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT largely differ because f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT fits (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) well in a half of the support of PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT but not in that of QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

Comparison of estimation errors for Experiment E2:

We present the estimation errors for Experiment E2 in Figure 3. KMM-W, 1-NN-CSA, logn𝑛\log nroman_log italic_n-NN-CSA gave similar results, almost matching those of OracleY, while KMM-W and 1-NN-CSA were advantageous for d=5𝑑5d=5italic_d = 5, and 1-NN-CSA outperformed other methods for d=10𝑑10d=10italic_d = 10. We can notice that KDE-W, KMM-W, RuLSIF-W, and RuLSIF100-W did not always improve errors over NoCorrection (Figures 1(c) and 1(d)). Some methods such as KMM-W and KLIEP-W showed great performance in some cases while giving poor results in other cases. In contrast, 1-NN-CSA showed stable and often best performances in these experiments.

Refer to caption
(a) d=1𝑑1d=1italic_d = 1
Refer to caption
(b) d=2𝑑2d=2italic_d = 2
Refer to caption
(c) d=5𝑑5d=5italic_d = 5
Refer to caption
(d) d=10𝑑10d=10italic_d = 10
Figure 3: Estimation errors for Experiment E2 (estimation of (yf0(x))2Q(dx,dy)superscript𝑦subscript𝑓0𝑥2𝑄𝑑𝑥𝑑𝑦\int(y-f_{0}(x))^{2}\,Q(dx,dy)∫ ( italic_y - italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q ( italic_d italic_x , italic_d italic_y ))
Refer to caption
(a) d=1𝑑1d=1italic_d = 1
Refer to caption
(b) d=2𝑑2d=2italic_d = 2
Refer to caption
(c) d=5𝑑5d=5italic_d = 5
Refer to caption
(d) d=10𝑑10d=10italic_d = 10
Figure 4: Running times in Experiment E2

Comparison of running times in Experiment E2:

The running times in Experiments E2 (Figure 4) were very similar to those in Experiment E1 (Figure 4), but we can see more clearly that 1- and logn𝑛\log nroman_log italic_n-NN-CSA outperform other methods even for the smallest sample size.

Setup of Experiment E3 (linear regression with synthetic data):

Next, we present experiments of linear regression. Using samples from the same source and test distributions as in Experiment E2, we perform the ordinary least squares after covariate adaptation. More precisely, we aim to optimize the parameters θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT of the model fθ:d,xθx:subscript𝑓𝜃formulae-sequencesuperscript𝑑maps-to𝑥superscript𝜃top𝑥f_{\theta}\colon\mathbb{R}^{d}\to\mathbb{R},x\mapsto\theta^{\top}xitalic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R , italic_x ↦ italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x so that the Mean Squared Error (MSE) 𝔼[(Yfθ(X))2]𝔼delimited-[]superscriptsuperscript𝑌subscript𝑓𝜃superscript𝑋2\mathbb{E}[(Y^{*}-f_{\theta}(X^{*}))^{2}]blackboard_E [ ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] in the target domain will be minimized. To do so, we minimize the MSE estimated by each covariate shift adaptation method.

Comparison of estimation errors for Experiment E3:

The results are summarized in Figures 5.333We plot the MSEs subtracted by 0.00950.00950.00950.0095 to better present the curves in the region close to the minimum population MSE 0.010.010.010.01 while kee** values positive. KMM-W performed better than any other methods for the higher dimensions d{5,10}𝑑510d\in\{5,10\}italic_d ∈ { 5 , 10 } and the small-to-moderate sample sizes 50n50050𝑛50050\leq n\leq 50050 ≤ italic_n ≤ 500, 1-NN-CSA being the second best. For n=10000𝑛10000n=10000italic_n = 10000, 1-NN-CSA showed performance better than or comparable to KMM-W.

Refer to caption
(a) d=1𝑑1d=1italic_d = 1
Refer to caption
(b) d=2𝑑2d=2italic_d = 2
Refer to caption
(c) d=5𝑑5d=5italic_d = 5
Refer to caption
(d) d=10𝑑10d=10italic_d = 10
Figure 5: Mean Squared Errors (MSE) (subtracted by 0.00950.00950.00950.0095) for Experiment E3 (linear regression)

Comparison of running times in Experiment E3:

As in Experiments E1–E2, 1-NN-CSA and logn𝑛\log nroman_log italic_n-NN-CSA finished their computations faster than the other adaptation methods by large margins (Figure 6).

Refer to caption
(a) d=1𝑑1d=1italic_d = 1
Refer to caption
(b) d=2𝑑2d=2italic_d = 2
Refer to caption
(c) d=5𝑑5d=5italic_d = 5
Refer to caption
(d) d=10𝑑10d=10italic_d = 10
Figure 6: Computation times (seconds) in Experiment E3

In Experiments E1–E3, the proposed methods, 1- and logn𝑛\log nroman_log italic_n-NN-CSA were able to finish computation much faster than other adaptation methods without compromising on the statistical performance. logn𝑛\log nroman_log italic_n-NN-CSA did not show advantages in accuracy, with increased computation costs. We can conclude that 1-NN-CSA is preferred over logn𝑛\log nroman_log italic_n-NN-CSA. A reason that we were not able to conduct experiments with larger sample sizes than 10000100001000010000 is that the existing adaptation methods have too demanding computational requirements. For instance, the running times of RuLSIF-W in Figure 6(c) grows about 100 times as the sample size increases by 10 times, taking more than 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT seconds for n=10000𝑛10000n=10000italic_n = 10000. For n=105𝑛superscript105n=10^{5}italic_n = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, we would need at least 103×100superscript10310010^{3}\times 10010 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT × 100 seconds, that is 27272727 hours of compute for a single run. In contrast, the time complexity of 1-NN-CSA being 𝒪(nlogn)𝒪𝑛𝑛\mathcal{O}(n\log n)caligraphic_O ( italic_n roman_log italic_n ) and its running time less than one second for n=104𝑛superscript104n=10^{4}italic_n = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, we can estimate its running time for n=105𝑛superscript105n=10^{5}italic_n = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT as 1×(105log105)/(104log104)=12.51superscript105superscript105superscript104superscript10412.51\times(10^{5}\log 10^{5})/(10^{4}\log 10^{4})=12.51 × ( 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT roman_log 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ) / ( 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_log 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) = 12.5 seconds. 1-NN-CSA would stay feasible in applications of even larger scales.

The previous methods construct the distance matrix between pairs of data points, which takes running time and memory space quadratic in the sample size. Additionally, RuLSIF-W computes the inverse of the distance matrix, taking cubic running time. KMM-W and KLIEP-W solve convex optimization problems with iterative procedures, for which the implementations from de Mathelin et al. (2021) use stop** criteria based on objective function values. This resulted in good accuracy and milder growth in running time in our experiments. However, tuning the solvers can be involved in practice. In contrast, k𝑘kitalic_k-NN-CSA does not have such subtle issues around optimization solvers: we only have to perform nearest neighbor search.

In all cases, we can observe that 1-NN-CSA showed clear power-law, with nearly straight lines in the logarithmic scales. This is a significant advantage in predicting returns when one invests on increasing the sample size.

Experiment E4 (linear regression and logistic regression with benchmark datasets):

We use regression benchmark datasets, diabetes444Available at https://archive.ics.uci.edu/ml/index.php., california (Pace and Barry, 1997)555Available at https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html. and classification datasets, twonorm (Breiman, 1996)666Available at https://www.cs.utoronto.ca/~delve/data/datasets.html. and breast_cancer44{}^{\ref{footnote:uci}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPT. We apply the ridge regression and the logistic regression, respectively. The evaluation metric is the mean squared error for the regression tasks and the classification accuracy for classification tasks. We synthetically introduce covariate shift by subsampling test data. See Appendix H for more details.

Remark:

For fair comparison, the benchmark experiments presented in this paper follow the standard protocol used in the literature as similarly done in previous research (Gretton et al., 2008; Kanamori et al., 2009; Yamada et al., 2013; Sugiyama et al., 2007, 2008): we apply biased resampling to synthetically simulate a target dataset under covariate shift. It is thus important to note that they are not completely real-world data. Nevertheless, this ensures that the methods are tested in isolation from other types of distribution shifts while using real data for the source covariate distribution as well as the conditional distributions.

Table 1: MSE/accuracy for regression/classification benchmark datasets. We repeat the experiment using 50 different random subsamples and calculate the average scores (and standard errors). The results comparable to the best in terms of Wilcoxon’s signed rank test with significance level 1%percent11\%1 % are shown in bold fonts.
Regression (MSE) Classification (accuracy)
diabetes california breast_cancer twonorm
1NN-CSA 3470 (35) 0.146 (0.001) 0.9633 (0.002) 0.9327 (0.002)
lognNN-CSA 3605 (40) 0.150 (0.001) 0.9595 (0.002) 0.9293 (0.002)
KDE-R-W 3673 (52) 3.864 (1.067) 0.9596 (0.002) 0.5260 (0.009)
KMM-W 3831 (60) 3.702 (1.160) 0.9594 (0.002) 0.9583 (0.001)
KLIEP-W 3221 (31) 2.896 (0.798) 0.9648 (0.002) 0.9482 (0.001)
KLIEP100-W 3223 (31) 3.034 (0.843) 0.9648 (0.002) 0.9480 (0.001)
RuLSIF-W 3235 (31) 3.039 (0.843) 0.7794 (0.015) 0.9512 (0.001)
RuLSIF100-W 3238 (31) 3.045 (0.844) 0.7794 (0.015) 0.9539 (0.001)
Table 2: Total running times in seconds spent for the training including the hyper-parameter tuning (if any) for benchmark datasets. We repeat the experiment using 50 different random subsamples dataset and calculate the average running times (and standard errors). The results comparable to the best in terms of Wilcoxon’s signed rank test with significance level 1%percent11\%1 % are shown in bold.
diabetes california breast_cancer twonorm
1NN-CSA 0.0015 (0.0000) 0.0084 (0.0001) 0.0036 (0.0000) 0.0051 (0.0000)
lognNN-CSA 0.0016 (0.0000) 0.0128 (0.0001) 0.0037 (0.0000) 0.0052 (0.0000)
KDE-R-W 0.0078 (0.0000) 0.2121 (0.0008) 0.0117 (0.0000) 0.0124 (0.0000)
KMM-W 0.0373 (0.0015) 0.4067 (0.0038) 0.0542 (0.0014) 0.0220 (0.0006)
KLIEP-W 7.602 (0.051) 29.98 (0.34) 8.67 (0.07) 8.86 (0.16)
KLIEP100-W 7.501 (0.045) 16.91 (0.07) 8.68 (0.07) 8.26 (0.10)
RuLSIF-W 0.0575 (0.0014) 1.686 (0.011) 0.0529 (0.0020) 0.2014 (0.0016)
RuLSIF100-W 0.0401 (0.0007) 0.1237 (0.0004) 0.0454 (0.0014) 0.0391 (0.0002)

Results for Experiment E4:

Table 1 shows the obtained MSEs and classification accuracies. 1111-NN-CSA and logn𝑛\log nroman_log italic_n-NN-CSA gave the best performance for california and performances comparable to the best for breast_cancer. For the other datasets, different methods performed the best depending on the dataset. On the other hand, in terms of running time, 1NN-CSA was consistently faster than the previous methods (Table 2).

Our experiments show that the proposed method is almost always faster than the previous methods and gives great accuracy in many cases, even though it is not always the best. 1-NN-CSA is highly recommended as an off-the-shelf method applicable even in larger scales, although the previous methods such as KMM-W, KLIEP-W, and RuLSIF-W should not be neglected, as far as the computational budget allows. The times spent for adaptation are summarized in Table 2, showing that the proposed methods 1111-NN-CSA and logn𝑛\log nroman_log italic_n-NN-CSA are much faster than other methods.

9 Conclusion

We proposed a k𝑘kitalic_k-NN-based covariate shift adaptation method. We provided error bounds, which suggest setting k=1𝑘1k=1italic_k = 1 is among the best choices. This resulted in a scalable non-parametric method with no hyper-parameter. For future research directions, one could complete our results for the parametric inference on the target domain, in particular for finding the asymptotic distribution of M𝑀Mitalic_M-estimators. For the average treatment effect, Abadie and Imbens (2006) derived asymptotic normality of their estimator and it could be interesting to get a similar result in our context. Investigating non-parametric estimation on the target domain could be also an interesting direction. However, non-parametric estimators computed with the source sample can be already optimal when the ratio of densities is bounded. See for instance Ma et al. (2023) in the reproducing kernel Hilbert space framework. It could be then interesting to extend our result to cases with unbounded density ratios. Finally, it may be interesting to extend our approach with approximate nearest neighbor methods for further scalability.

Acknowledgement

Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr). IY was supported by the Allocation d’Installation Scientifique (AIS) 2023 from Rennes Métropole.

References

  • Abadie and Imbens (2006) Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average treatment effects. econometrica 74(1), 235–267.
  • Aminian et al. (2022) Aminian, G., M. Abroshan, M. Mahdi Khalili, L. Toni, and M. Rodrigues (2022, 28–30 Mar). An information-theoretical approach to semi-supervised learning under covariate-shift. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, Volume 151 of Proceedings of Machine Learning Research, pp.  7433–7449. PMLR.
  • Archibald (2008) Archibald, A. M. (2008). cKDTree.
  • Balouek et al. (2013) Balouek, D., A. Carpen Amarie, G. Charrier, F. Desprez, E. Jeannot, E. Jeanvoine, A. Lèbre, D. Margery, N. Niclausse, L. Nussbaum, O. Richard, C. Pérez, F. Quesnel, C. Rohr, and L. Sarzyniec (2013). Adding virtualization capabilities to the Grid’5000 testbed. In I. I. Ivanov, M. van Sinderen, F. Leymann, and T. Shan (Eds.), Cloud Computing and Services Science, Volume 367 of Communications in Computer and Information Science, pp.  3–20. Springer International Publishing.
  • Bentley (1975) Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 509–517.
  • Berrett et al. (2019) Berrett, T. B., R. J. Samworth, and M. Yuan (2019). Efficient multivariate entropy estimation via k-nearest neighbour distances. The Annals of Statistics 47(1), 288–318.
  • Biau and Devroye (2015) Biau, G. and L. Devroye (2015). Lectures on the nearest neighbor method, Volume 246. Springer.
  • Blanchet et al. (2024) Blanchet, J., H. Chen, Y. Lu, and L. Ying (2024). When can regression-adjusted control variate help? rare events, sobolev embedding and minimax optimality. Advances in Neural Information Processing Systems 36.
  • Bogachev and Ruas (2007) Bogachev, V. I. and M. A. S. Ruas (2007). Measure theory, Volume 2. Springer Science & Business Media.
  • Breiman (1996) Breiman, L. (1996). Bias, variance, and arcing classifiers.
  • Cannings et al. (2020) Cannings, T. I., T. B. Berrett, and R. J. Samworth (2020). Local nearest neighbour classification with applications to semi-supervised learning. The Annals of Statistics 48(3), 1789–1814.
  • Chen et al. (2022) Chen, L., M. Zaharia, and J. Y. Zou (2022). Estimating and explaining model performance when both covariates and labels shift. In Advances in Neural Information Processing Systems, Volume 35, pp.  11467–11479. Curran Associates, Inc.
  • Clémençon and Portier (2018) Clémençon, S. and F. Portier (2018). Beating monte carlo integration: A nonasymptotic study of kernel smoothing methods. In International Conference on Artificial Intelligence and Statistics, pp.  548–556. PMLR.
  • Dasgupta and Kpotufe (2014) Dasgupta, S. and S. Kpotufe (2014). Optimal rates for k-nn density and mode estimation. In Advances in Neural Information Processing Systems, Volume 27.
  • de Mathelin et al. (2021) de Mathelin, A., M. Atiq, G. Richard, A. de la Concha, M. Yachouti, F. Deheeger, M. Mougeot, and N. Vayatis (2021). ADAPT : Awesome Domain Adaptation Python Toolbox. arXiv:2107.03049 [cs.LG].
  • Delyon and Portier (2016) Delyon, B. and F. Portier (2016). Integral approximation by kernel smoothing. Bernoulli 22(4), 2177–2208.
  • Devroye et al. (1994) Devroye, L., L. Györfi, A. Krzyżak, and G. Lugosi (1994). On the strong universal consistency of nearest neighbor regression function estimates. Ann. Statist. 22(3), 1371–1385.
  • Devroye et al. (2013) Devroye, L., L. Györfi, and G. Lugosi (2013). A probabilistic theory of pattern recognition, Volume 31. Springer Science & Business Media.
  • Dua and Graff (2017) Dua, D. and C. Graff (2017). UCI machine learning repository.
  • Friedman et al. (1977) Friedman, J. H., J. L. Bentley, and R. A. Finkel (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS) 3(3), 209–226.
  • Gadat et al. (2016) Gadat, S., T. Klein, and C. Marteau (2016). Classification in general finite dimensional spaces with the k𝑘kitalic_k-nearest neighbor rule. Ann. Statist. 44(3), 982–1009.
  • Gretton et al. (2008) Gretton, A., A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf (2008, December). Covariate Shift by Kernel Mean Matching. In Dataset Shift in Machine Learning, pp.  131–160. The MIT Press.
  • Györfi et al. (2006) Györfi, L., M. Kohler, A. Krzyzak, and H. Walk (2006). A Distribution-Free Theory of Nonparametric Regression. Springer Science & Business Media.
  • Hernan and Robins (2023) Hernan, M. A. and J. M. Robins (2023). Causal Inference: What If. Chapman & Hall/CRC Monographs on Statistics & Applied Probab. CRC Press.
  • Huang et al. (2006) Huang, J., A. Gretton, K. Borgwardt, B. Schölkopf, and A. Smola (2006). Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems, Volume 19. MIT Press.
  • Jiang (2019) Jiang, H. (2019). Non-asymptotic uniform rates of consistency for k𝑘kitalic_k-NN regression. In AAAI Proceedings, Volume 33, pp.  3999–4006.
  • Kanamori et al. (2009) Kanamori, T., S. Hido, and M. Sugiyama (2009). A least-squares approach to direct importance estimation. Journal of Machine Learning Research 10(48), 1391–1445.
  • Kpotufe and Martinet (2021) Kpotufe, S. and G. Martinet (2021). Marginal singularity and the benefits of labels in covariate-shift. The Annals of Statistics 49(6), 3299–3323.
  • Le et al. (2013) Le, Q., T. Sarlós, A. Smola, et al. (2013). Fastfood—approximating kernel expansions in loglinear time. In Proceedings of the 30th International Conference on Machine Learning, Volume 28.
  • Lee (2013) Lee, D.-H. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML.
  • Leluc et al. (2023) Leluc, R., F. Portier, J. Segers, and A. Zhuman (2023). Speeding up monte carlo integration: Control neighbors for optimal convergence. To appear in Bernoulli. ArXiv:2305.06151.
  • Lin et al. (2023) Lin, Z., P. Ding, and F. Han (2023). Estimation based on nearest neighbor matching: from density ratio to average treatment effect. Econometrica 91(6), 2187–2217.
  • Ma et al. (2023) Ma, C., R. Pathak, and M. J. Wainwright (2023). Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics 51(2), 738–761.
  • Pace and Barry (1997) Pace, R. K. and R. Barry (1997). Sparse spatial autoregressions. Statistics & Probability Letters 33(3), 291–297.
  • Portier (2021) Portier, F. (2021). Nearest neighbor process: weak convergence and non-asymptotic bound. To appear in Bernoulli. ArXiv:2110.15083.
  • Rosenbaum (1995) Rosenbaum, P. R. (1995). Observational Studies. Springer.
  • Shimodaira (2000) Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90(2), 227–244.
  • Sugiyama et al. (2007) Sugiyama, M., S. Nakajima, H. Kashima, P. v. Bünau, and M. Kawanabe (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 20, NIPS 2007, pp.  1433–1440.
  • Sugiyama et al. (2008) Sugiyama, M., T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe (2008, December). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics 60(4), 699–746.
  • Tropp (2012) Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12, 389–434.
  • Truquet (2011) Truquet, L. (2011). On a nonparametric resampling scheme for markov random fields. Electronic Journal of Statistics 5, 1503–1536.
  • Van der Vaart (2000) Van der Vaart, A. W. (2000). Asymptotic Statistics, Volume 3. Cambridge University Press.
  • Virtanen et al. (2020) Virtanen, P., R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, 261–272.
  • Wang (2023) Wang, K. (2023, March). Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift. arXiv:2302.10160 [cs, math, stat].
  • Wendel (1948) Wendel, J. G. (1948). Note on the gamma function. The American Mathematical Monthly 55(9), 563.
  • Williams and Seeger (2000) Williams, C. K. I. and M. Seeger (2000). Using the Nyström Method to Speed up Kernel Machines. In Advances in Neural Information Processing Systems, Volume 13 of NIPS 2000, pp.  661–667. MIT Press.
  • Yamada et al. (2013) Yamada, M., T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama (2013, May). Relative Density-Ratio Estimation for Robust Distribution Comparison. Neural Computation 25(5), 1324–1370.
  • Zhang et al. (2021) Zhang, T., I. Yamane, N. Lu, and M. Sugiyama (2021, June). A One-Step Approach to Covariate Shift Adaptation. SN Computer Science 2(4), 319.

Appendix A Preliminary results

The first preliminary result is concerned about the order of magnitude of PX(B(x,τ))subscript𝑃𝑋𝐵𝑥𝜏P_{X}(B(x,\tau))italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_B ( italic_x , italic_τ ) ) for which we obtain a lower bound and an upper bound.

Lemma 1.

Under (X1), (X2), and (X3), it holds, for every xSX𝑥subscript𝑆𝑋x\in S_{X}italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and τ[0,T]𝜏0𝑇\tau\in[0,T]italic_τ ∈ [ 0 , italic_T ],

M1,dτdPX(B(x,τ))M2,dτd,subscript𝑀1𝑑superscript𝜏𝑑subscript𝑃𝑋𝐵𝑥𝜏subscript𝑀2𝑑superscript𝜏𝑑M_{1,d}\tau^{d}\leq P_{X}(B(x,\tau))\leq M_{2,d}\tau^{d},italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_B ( italic_x , italic_τ ) ) ≤ italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ,

with M1,d=cbXVdsubscript𝑀1𝑑𝑐subscript𝑏𝑋subscript𝑉𝑑M_{1,d}=cb_{X}V_{d}italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT = italic_c italic_b start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and M2,d=UXVdsubscript𝑀2𝑑subscript𝑈𝑋subscript𝑉𝑑M_{2,d}=U_{X}V_{d}italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT = italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT.

Proof.

The proof of the lower bound follows from

PX(B(x,τ))=B(x,τ)SXpX(y)𝑑ybXB(x,τ)SX𝑑ybXcB(x,τ)𝑑y,subscript𝑃𝑋𝐵𝑥𝜏subscript𝐵𝑥𝜏subscript𝑆𝑋subscript𝑝𝑋𝑦differential-d𝑦subscript𝑏𝑋subscript𝐵𝑥𝜏subscript𝑆𝑋differential-d𝑦subscript𝑏𝑋𝑐subscript𝐵𝑥𝜏differential-d𝑦P_{X}(B(x,\tau))=\int_{B(x,\tau)\cap S_{X}}p_{X}(y)dy\geq b_{X}\int_{B(x,\tau)% \cap S_{X}}dy\geq b_{X}c\int_{B(x,\tau)}dy,italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_B ( italic_x , italic_τ ) ) = ∫ start_POSTSUBSCRIPT italic_B ( italic_x , italic_τ ) ∩ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_y ) italic_d italic_y ≥ italic_b start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_B ( italic_x , italic_τ ) ∩ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d italic_y ≥ italic_b start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_c ∫ start_POSTSUBSCRIPT italic_B ( italic_x , italic_τ ) end_POSTSUBSCRIPT italic_d italic_y ,

where we have used (X3) to get the first inequality and then (X2) to obtain the second one. We conclude by change of variable y=x+τu𝑦𝑥𝜏𝑢y=x+\tau uitalic_y = italic_x + italic_τ italic_u. The proof of the upper bound is similar:

PX(B(x,τ))=B(x,τ)SXpX(y)𝑑yUXB(x,τ)SX𝑑yUXB(x,τ)𝑑y.subscript𝑃𝑋𝐵𝑥𝜏subscript𝐵𝑥𝜏subscript𝑆𝑋subscript𝑝𝑋𝑦differential-d𝑦subscript𝑈𝑋subscript𝐵𝑥𝜏subscript𝑆𝑋differential-d𝑦subscript𝑈𝑋subscript𝐵𝑥𝜏differential-d𝑦P_{X}(B(x,\tau))=\int_{B(x,\tau)\cap S_{X}}p_{X}(y)dy\leq U_{X}\int_{B(x,\tau)% \cap S_{X}}dy\leq U_{X}\int_{B(x,\tau)}dy.italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_B ( italic_x , italic_τ ) ) = ∫ start_POSTSUBSCRIPT italic_B ( italic_x , italic_τ ) ∩ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_y ) italic_d italic_y ≤ italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_B ( italic_x , italic_τ ) ∩ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d italic_y ≤ italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_B ( italic_x , italic_τ ) end_POSTSUBSCRIPT italic_d italic_y .

The same type of result can be obtained for PX(B(x1,τ1)B(x2,τ2))subscript𝑃𝑋𝐵subscript𝑥1subscript𝜏1𝐵subscript𝑥2subscript𝜏2P_{X}(B(x_{1},\tau_{1})\cup B(x_{2},\tau_{2}))italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_B ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∪ italic_B ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) as follows.

Lemma 2.

Under (X1), (X2), and (X3), it holds, for every (x1,x2)SX×SXsubscript𝑥1subscript𝑥2subscript𝑆𝑋subscript𝑆𝑋(x_{1},x_{2})\in S_{X}\times S_{X}( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and (τ1,τ2)[0,T]2subscript𝜏1subscript𝜏2superscript0𝑇2(\tau_{1},\tau_{2})\in[0,T]^{2}( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ [ 0 , italic_T ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

12M1,d(τ1d+τ2d)PX(B(x1,τ1)B(x2,τ2))M2,d(τ1d+τ2d),12subscript𝑀1𝑑superscriptsubscript𝜏1𝑑superscriptsubscript𝜏2𝑑subscript𝑃𝑋𝐵subscript𝑥1subscript𝜏1𝐵subscript𝑥2subscript𝜏2subscript𝑀2𝑑superscriptsubscript𝜏1𝑑superscriptsubscript𝜏2𝑑\frac{1}{2}M_{1,d}(\tau_{1}^{d}+\tau_{2}^{d})\leq P_{X}(B(x_{1},\tau_{1})\cup B% (x_{2},\tau_{2}))\leq M_{2,d}(\tau_{1}^{d}+\tau_{2}^{d}),divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT + italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ≤ italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_B ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∪ italic_B ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ≤ italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT + italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ,

with M1,d=cbXVdsubscript𝑀1𝑑𝑐subscript𝑏𝑋subscript𝑉𝑑M_{1,d}=cb_{X}V_{d}italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT = italic_c italic_b start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and M2,d=UXVdsubscript𝑀2𝑑subscript𝑈𝑋subscript𝑉𝑑M_{2,d}=U_{X}V_{d}italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT = italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT.

Proof.

The proof of the upper bound follows from the union bound and Lemma 1. For the lower bound, start noting that for any events A𝐴Aitalic_A and B𝐵Bitalic_B, 1AB(1A+1B)/2subscript1𝐴𝐵subscript1𝐴subscript1𝐵21_{A\cup B}\geq(1_{A}+1_{B})/21 start_POSTSUBSCRIPT italic_A ∪ italic_B end_POSTSUBSCRIPT ≥ ( 1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + 1 start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) / 2. Then the conclusion follows from Lemma 1.

Based on the previous results, an upper and a lower bound are obtained on the moments of the nearest neighbor radius τ^n,k,xsubscript^𝜏𝑛𝑘𝑥\hat{\tau}_{n,k,x}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT. A similar upper bound is stated as Lemma 3 in Leluc et al. (2023)).

Lemma 3.

Let q𝑞qitalic_q be a positive real number. Under (X1), (X2), and (X3), there exist two positive real numbers cq,dsubscript𝑐𝑞𝑑c_{q,d}italic_c start_POSTSUBSCRIPT italic_q , italic_d end_POSTSUBSCRIPT and Cq,dsubscript𝐶𝑞𝑑C_{q,d}italic_C start_POSTSUBSCRIPT italic_q , italic_d end_POSTSUBSCRIPT, depending on q𝑞qitalic_q, d𝑑ditalic_d and on the distribution of X𝑋Xitalic_X such that

cq,dkq/d(n+1)q/d𝔼τ^n,k,xqCq,dkq/d(n+1)q/d.subscript𝑐𝑞𝑑superscript𝑘𝑞𝑑superscript𝑛1𝑞𝑑𝔼superscriptsubscript^𝜏𝑛𝑘𝑥𝑞subscript𝐶𝑞𝑑superscript𝑘𝑞𝑑superscript𝑛1𝑞𝑑c_{q,d}\frac{k^{q/d}}{(n+1)^{q/d}}\leq\mathbb{E}\hat{\tau}_{n,k,x}^{q}\leq C_{% q,d}\frac{k^{q/d}}{(n+1)^{q/d}}.italic_c start_POSTSUBSCRIPT italic_q , italic_d end_POSTSUBSCRIPT divide start_ARG italic_k start_POSTSUPERSCRIPT italic_q / italic_d end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_n + 1 ) start_POSTSUPERSCRIPT italic_q / italic_d end_POSTSUPERSCRIPT end_ARG ≤ blackboard_E over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT italic_q , italic_d end_POSTSUBSCRIPT divide start_ARG italic_k start_POSTSUPERSCRIPT italic_q / italic_d end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_n + 1 ) start_POSTSUPERSCRIPT italic_q / italic_d end_POSTSUPERSCRIPT end_ARG . (6)

A more precise expression of the two constants are

cq,d=M2,dq/d2Γ(1+[q/d]),Cq,d=2Γ(1+[q/d])M1,dq/d,formulae-sequencesubscript𝑐𝑞𝑑superscriptsubscript𝑀2𝑑𝑞𝑑2Γ1delimited-[]𝑞𝑑subscript𝐶𝑞𝑑2Γ1delimited-[]𝑞𝑑superscriptsubscript𝑀1𝑑𝑞𝑑c_{q,d}=\frac{M_{2,d}^{-q/d}}{2\Gamma\left(1+[q/d]\right)},\quad C_{q,d}=2% \Gamma\left(1+[q/d]\right)M_{1,d}^{-q/d},italic_c start_POSTSUBSCRIPT italic_q , italic_d end_POSTSUBSCRIPT = divide start_ARG italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_q / italic_d end_POSTSUPERSCRIPT end_ARG start_ARG 2 roman_Γ ( 1 + [ italic_q / italic_d ] ) end_ARG , italic_C start_POSTSUBSCRIPT italic_q , italic_d end_POSTSUBSCRIPT = 2 roman_Γ ( 1 + [ italic_q / italic_d ] ) italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_q / italic_d end_POSTSUPERSCRIPT ,

where [x]delimited-[]𝑥[x][ italic_x ] denotes the integer part of the real number x𝑥xitalic_x.

Proof.

We have τ^n,k,x=Z(k)(x)subscript^𝜏𝑛𝑘𝑥subscript𝑍𝑘𝑥\hat{\tau}_{n,k,x}=Z_{(k)}(x)over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ( italic_x ) the kth-order statistics of Zi(x)=|xXi|subscript𝑍𝑖𝑥𝑥subscript𝑋𝑖Z_{i}(x)=|x-X_{i}|italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = | italic_x - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |. Moreover, for any measurable and non-negative function f𝑓fitalic_f,

𝔼f(Z(k)(x))=𝔼fFx1(U(k)),𝔼𝑓subscript𝑍𝑘𝑥𝔼𝑓superscriptsubscript𝐹𝑥1subscript𝑈𝑘\mathbb{E}f\left(Z_{(k)}(x)\right)=\mathbb{E}f\circ F_{x}^{-1}\left(U_{(k)}% \right),blackboard_E italic_f ( italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ( italic_x ) ) = blackboard_E italic_f ∘ italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ) ,

where U(k)subscript𝑈𝑘U_{(k)}italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT is the kth-order statistics of a n𝑛nitalic_n sample of uniform random variables and since Fx(z)=(X1B(x,z))[M1,dzd,M2,dzd]subscript𝐹𝑥𝑧subscript𝑋1𝐵𝑥𝑧subscript𝑀1𝑑superscript𝑧𝑑subscript𝑀2𝑑superscript𝑧𝑑F_{x}(z)=\mathbb{P}\left(X_{1}\in B(x,z)\right)\in\left[M_{1,d}z^{d},M_{2,d}z^% {d}\right]italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_z ) = blackboard_P ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_B ( italic_x , italic_z ) ) ∈ [ italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ],

Fx1(u)=inf{z:Fx(z)u}[u1/dM2,d1/d,u1/dM1,d1/d].superscriptsubscript𝐹𝑥1𝑢infimumconditional-set𝑧subscript𝐹𝑥𝑧𝑢superscript𝑢1𝑑superscriptsubscript𝑀2𝑑1𝑑superscript𝑢1𝑑superscriptsubscript𝑀1𝑑1𝑑F_{x}^{-1}(u)=\inf\left\{z\in\mathbb{R}:F_{x}(z)\geq u\right\}\in\left[\frac{u% ^{1/d}}{M_{2,d}^{1/d}},\frac{u^{1/d}}{M_{1,d}^{1/d}}\right].italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_u ) = roman_inf { italic_z ∈ blackboard_R : italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_z ) ≥ italic_u } ∈ [ divide start_ARG italic_u start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_u start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT end_ARG ] .

Note that the range of Zi(x)subscript𝑍𝑖𝑥Z_{i}(x)italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) is [0,diam(SX)]0𝑑𝑖𝑎𝑚subscript𝑆𝑋[0,diam(S_{X})][ 0 , italic_d italic_i italic_a italic_m ( italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) ] and we use a constant c𝑐citalic_c in the definition of M1,d=cbXVdsubscript𝑀1𝑑𝑐subscript𝑏𝑋subscript𝑉𝑑M_{1,d}=cb_{X}V_{d}italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT = italic_c italic_b start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT such that

infxSXλd(B(x,z))cVdzd for 0zdiam(SX).subscriptinfimum𝑥subscript𝑆𝑋subscript𝜆𝑑𝐵𝑥𝑧𝑐subscript𝑉𝑑superscript𝑧𝑑 for 0𝑧𝑑𝑖𝑎𝑚subscript𝑆𝑋\inf_{x\in S_{X}}\lambda_{d}\left(B(x,z)\right)\geq cV_{d}z^{d}\mbox{ for }0% \leq z\leq diam(S_{X}).roman_inf start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_B ( italic_x , italic_z ) ) ≥ italic_c italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for 0 ≤ italic_z ≤ italic_d italic_i italic_a italic_m ( italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) .

If z=u1/d/M1,d1/ddiam(SX)𝑧superscript𝑢1𝑑superscriptsubscript𝑀1𝑑1𝑑𝑑𝑖𝑎𝑚subscript𝑆𝑋z=u^{1/d}/M_{1,d}^{1/d}\geq diam(S_{X})italic_z = italic_u start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT / italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT ≥ italic_d italic_i italic_a italic_m ( italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ), we have Fx(z)=1usubscript𝐹𝑥𝑧1𝑢F_{x}(z)=1\geq uitalic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_z ) = 1 ≥ italic_u and we still have Fx1(u)zsuperscriptsubscript𝐹𝑥1𝑢𝑧F_{x}^{-1}(u)\leq zitalic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_u ) ≤ italic_z.

Moreover, if g𝑔gitalic_g is measurable and nonnegative,

𝔼g(U(k))=n!g(uk)𝟙0u1un1𝑑u1𝑑un=Γ(n+1)Γ(k)Γ(nk+1)01g(u)uk1(1u)nk𝑑u.𝔼𝑔subscript𝑈𝑘𝑛𝑔subscript𝑢𝑘subscript10subscript𝑢1subscript𝑢𝑛1differential-dsubscript𝑢1differential-dsubscript𝑢𝑛Γ𝑛1Γ𝑘Γ𝑛𝑘1superscriptsubscript01𝑔𝑢superscript𝑢𝑘1superscript1𝑢𝑛𝑘differential-d𝑢\mathbb{E}g\left(U_{(k)}\right)=n!\int g(u_{k})\mathds{1}_{0\leq u_{1}\leq% \cdots\leq u_{n}\leq 1}du_{1}\cdots du_{n}=\frac{\Gamma(n+1)}{\Gamma(k)\Gamma(% n-k+1)}\int_{0}^{1}g(u)u^{k-1}(1-u)^{n-k}du.blackboard_E italic_g ( italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ) = italic_n ! ∫ italic_g ( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) blackboard_1 start_POSTSUBSCRIPT 0 ≤ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ ⋯ ≤ italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT italic_d italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_d italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG roman_Γ ( italic_n + 1 ) end_ARG start_ARG roman_Γ ( italic_k ) roman_Γ ( italic_n - italic_k + 1 ) end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_g ( italic_u ) italic_u start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - italic_u ) start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT italic_d italic_u .

When f(z)=zq𝑓𝑧superscript𝑧𝑞f(z)=z^{q}italic_f ( italic_z ) = italic_z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT for some q>0𝑞0q>0italic_q > 0, we get

𝔼[Z(k)q]M1,dq/d𝔼[U(k)q/d]=M1,dq/dΓ(n+1)Γ(k+q/d)Γ(k)Γ(n+q/d+1),𝔼delimited-[]superscriptsubscript𝑍𝑘𝑞superscriptsubscript𝑀1𝑑𝑞𝑑𝔼delimited-[]superscriptsubscript𝑈𝑘𝑞𝑑superscriptsubscript𝑀1𝑑𝑞𝑑Γ𝑛1Γ𝑘𝑞𝑑Γ𝑘Γ𝑛𝑞𝑑1\mathbb{E}\left[Z_{(k)}^{q}\right]\leq M_{1,d}^{-q/d}\mathbb{E}\left[U_{(k)}^{% q/d}\right]=M_{1,d}^{-q/d}\frac{\Gamma(n+1)\Gamma(k+q/d)}{\Gamma(k)\Gamma(n+q/% d+1)},blackboard_E [ italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] ≤ italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_q / italic_d end_POSTSUPERSCRIPT blackboard_E [ italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q / italic_d end_POSTSUPERSCRIPT ] = italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_q / italic_d end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( italic_n + 1 ) roman_Γ ( italic_k + italic_q / italic_d ) end_ARG start_ARG roman_Γ ( italic_k ) roman_Γ ( italic_n + italic_q / italic_d + 1 ) end_ARG ,
𝔼[Z(k)q]M2,dq/d𝔼[U(k)q/d]=M2,dq/dΓ(n+1)Γ(k+q/d)Γ(k)Γ(n+q/d+1).𝔼delimited-[]superscriptsubscript𝑍𝑘𝑞superscriptsubscript𝑀2𝑑𝑞𝑑𝔼delimited-[]superscriptsubscript𝑈𝑘𝑞𝑑superscriptsubscript𝑀2𝑑𝑞𝑑Γ𝑛1Γ𝑘𝑞𝑑Γ𝑘Γ𝑛𝑞𝑑1\mathbb{E}\left[Z_{(k)}^{q}\right]\geq M_{2,d}^{-q/d}\mathbb{E}\left[U_{(k)}^{% q/d}\right]=M_{2,d}^{-q/d}\frac{\Gamma(n+1)\Gamma(k+q/d)}{\Gamma(k)\Gamma(n+q/% d+1)}.blackboard_E [ italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] ≥ italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_q / italic_d end_POSTSUPERSCRIPT blackboard_E [ italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q / italic_d end_POSTSUPERSCRIPT ] = italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_q / italic_d end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( italic_n + 1 ) roman_Γ ( italic_k + italic_q / italic_d ) end_ARG start_ARG roman_Γ ( italic_k ) roman_Γ ( italic_n + italic_q / italic_d + 1 ) end_ARG .

For x1𝑥1x\geq 1italic_x ≥ 1 and s>0𝑠0s>0italic_s > 0, let N1,s=infx1Γ(x+s)xsΓ(x)subscript𝑁1𝑠subscriptinfimum𝑥1Γ𝑥𝑠superscript𝑥𝑠Γ𝑥N_{1,s}=\inf_{x\geq 1}\frac{\Gamma(x+s)}{x^{s}\Gamma(x)}italic_N start_POSTSUBSCRIPT 1 , italic_s end_POSTSUBSCRIPT = roman_inf start_POSTSUBSCRIPT italic_x ≥ 1 end_POSTSUBSCRIPT divide start_ARG roman_Γ ( italic_x + italic_s ) end_ARG start_ARG italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT roman_Γ ( italic_x ) end_ARG and N2,s=supx1Γ(x+s)xsΓ(x)subscript𝑁2𝑠subscriptsupremum𝑥1Γ𝑥𝑠superscript𝑥𝑠Γ𝑥N_{2,s}=\sup_{x\geq 1}\frac{\Gamma(x+s)}{x^{s}\Gamma(x)}italic_N start_POSTSUBSCRIPT 2 , italic_s end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_x ≥ 1 end_POSTSUBSCRIPT divide start_ARG roman_Γ ( italic_x + italic_s ) end_ARG start_ARG italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT roman_Γ ( italic_x ) end_ARG. We then get

M2,dq/dN1,q/dN2,q/dkq/d(n+1)q/d𝔼[Z(k)(x)q]M1,dq/dN2,q/dN1,q/dkq/d(n+1)q/d.superscriptsubscript𝑀2𝑑𝑞𝑑subscript𝑁1𝑞𝑑subscript𝑁2𝑞𝑑superscript𝑘𝑞𝑑superscript𝑛1𝑞𝑑𝔼delimited-[]subscript𝑍𝑘superscript𝑥𝑞superscriptsubscript𝑀1𝑑𝑞𝑑subscript𝑁2𝑞𝑑subscript𝑁1𝑞𝑑superscript𝑘𝑞𝑑superscript𝑛1𝑞𝑑M_{2,d}^{-q/d}\frac{N_{1,q/d}}{N_{2,q/d}}\frac{k^{q/d}}{(n+1)^{q/d}}\leq% \mathbb{E}\left[Z_{(k)}(x)^{q}\right]\leq M_{1,d}^{-q/d}\frac{N_{2,q/d}}{N_{1,% q/d}}\frac{k^{q/d}}{(n+1)^{q/d}}.italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_q / italic_d end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT 1 , italic_q / italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT 2 , italic_q / italic_d end_POSTSUBSCRIPT end_ARG divide start_ARG italic_k start_POSTSUPERSCRIPT italic_q / italic_d end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_n + 1 ) start_POSTSUPERSCRIPT italic_q / italic_d end_POSTSUPERSCRIPT end_ARG ≤ blackboard_E [ italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] ≤ italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_q / italic_d end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT 2 , italic_q / italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT 1 , italic_q / italic_d end_POSTSUBSCRIPT end_ARG divide start_ARG italic_k start_POSTSUPERSCRIPT italic_q / italic_d end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_n + 1 ) start_POSTSUPERSCRIPT italic_q / italic_d end_POSTSUPERSCRIPT end_ARG .

By Wendel’s inequality (Wendel, 1948), for s(0,1)𝑠01s\in(0,1)italic_s ∈ ( 0 , 1 ), we have N1,sinfx1(xx+s)1s1/2subscript𝑁1𝑠subscriptinfimum𝑥1superscript𝑥𝑥𝑠1𝑠12N_{1,s}\geq\inf_{x\geq 1}\left(\frac{x}{x+s}\right)^{1-s}\geq 1/2italic_N start_POSTSUBSCRIPT 1 , italic_s end_POSTSUBSCRIPT ≥ roman_inf start_POSTSUBSCRIPT italic_x ≥ 1 end_POSTSUBSCRIPT ( divide start_ARG italic_x end_ARG start_ARG italic_x + italic_s end_ARG ) start_POSTSUPERSCRIPT 1 - italic_s end_POSTSUPERSCRIPT ≥ 1 / 2 and N2,s1subscript𝑁2𝑠1N_{2,s}\leq 1italic_N start_POSTSUBSCRIPT 2 , italic_s end_POSTSUBSCRIPT ≤ 1. For s1𝑠1s\geq 1italic_s ≥ 1, using the equality Γ(z+1)=zΓ(z)Γ𝑧1𝑧Γ𝑧\Gamma(z+1)=z\Gamma(z)roman_Γ ( italic_z + 1 ) = italic_z roman_Γ ( italic_z ), one can deduce that

N1,s1/2,N2,sΓ(2+[s]).formulae-sequencesubscript𝑁1𝑠12subscript𝑁2𝑠Γ2delimited-[]𝑠N_{1,s}\geq 1/2,\quad N_{2,s}\leq\Gamma(2+[s]).italic_N start_POSTSUBSCRIPT 1 , italic_s end_POSTSUBSCRIPT ≥ 1 / 2 , italic_N start_POSTSUBSCRIPT 2 , italic_s end_POSTSUBSCRIPT ≤ roman_Γ ( 2 + [ italic_s ] ) .

Indeed if s=s+𝑠superscript𝑠s=s^{\prime}+\ellitalic_s = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + roman_ℓ with \ell\in\mathbb{N}roman_ℓ ∈ blackboard_N and 0s<10superscript𝑠10\leq s^{\prime}<10 ≤ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < 1,

Γ(x+s)xsΓ(x)=j=1(1+j+s1x)Γ(x+s)xsΓ(x)Γ𝑥𝑠superscript𝑥𝑠Γ𝑥superscriptsubscriptproduct𝑗11𝑗superscript𝑠1𝑥Γ𝑥superscript𝑠superscript𝑥superscript𝑠Γ𝑥\frac{\Gamma(x+s)}{x^{s}\Gamma(x)}=\prod_{j=1}^{\ell}\left(1+\frac{j+s^{\prime% }-1}{x}\right)\frac{\Gamma(x+s^{\prime})}{x^{s^{\prime}}\Gamma(x)}divide start_ARG roman_Γ ( italic_x + italic_s ) end_ARG start_ARG italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT roman_Γ ( italic_x ) end_ARG = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_j + italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_x end_ARG ) divide start_ARG roman_Γ ( italic_x + italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_x start_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_Γ ( italic_x ) end_ARG

and

1j=1(1+j+s1x)j=1(1+j)=Γ(+2).1superscriptsubscriptproduct𝑗11𝑗superscript𝑠1𝑥superscriptsubscriptproduct𝑗11𝑗Γ21\leq\prod_{j=1}^{\ell}\left(1+\frac{j+s^{\prime}-1}{x}\right)\leq\prod_{j=1}^% {\ell}\left(1+j\right)=\Gamma(\ell+2).1 ≤ ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_j + italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_x end_ARG ) ≤ ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ( 1 + italic_j ) = roman_Γ ( roman_ℓ + 2 ) .

This completes the proof of Lemma 3. ∎

Appendix B Proofs of the results on the marginal sampling error (Section 4.2)

B.1 Proof of Proposition 1

The proof relies on the Lindeberg central limit theorem as given in Proposition 2.27 in Van der Vaart (2000) conditionally to nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We need to show the two properties:

m1i=1m𝔼[(h(Xi,Yn,i)𝔼[h(Xi,Yn,i)|n])2|n]V,superscript𝑚1superscriptsubscript𝑖1𝑚𝔼delimited-[]conditionalsuperscriptsuperscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖𝔼delimited-[]conditionalsuperscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖subscript𝑛2subscript𝑛𝑉\displaystyle m^{-1}\sum_{i=1}^{m}\mathbb{E}[(h(X_{i}^{*},Y_{n,i}^{*})-\mathbb% {E}[h(X_{i}^{*},Y_{n,i}^{*})\>|\>\mathcal{F}_{n}])^{2}\>|\>\mathcal{F}_{n}]\to V,italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E [ ( italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - blackboard_E [ italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] → italic_V ,
m1i=1m𝔼[h(Xi,Yn,i)21{|h(Xi,Yn,i)|>ϵn}|n]0,superscript𝑚1superscriptsubscript𝑖1𝑚𝔼delimited-[]conditionalsuperscriptsuperscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖2subscript1superscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖italic-ϵ𝑛subscript𝑛0\displaystyle m^{-1}\sum_{i=1}^{m}\mathbb{E}[h(X_{i}^{*},Y_{n,i}^{*})^{2}1_{\{% \lvert h(X_{i}^{*},Y_{n,i}^{*})\rvert>\epsilon\sqrt{n}\}}\>|\>\mathcal{F}_{n}]% \to 0,italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E [ italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT { | italic_h ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | > italic_ϵ square-root start_ARG italic_n end_ARG } end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] → 0 ,

where each convergence needs to happen with probability 1111. Equivalently, using that (Xi,Yn,i)i=1,,msubscriptsuperscriptsubscript𝑋𝑖superscriptsubscript𝑌𝑛𝑖𝑖1𝑚(X_{i}^{*},Y_{n,i}^{*})_{i=1,\ldots,m}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_m end_POSTSUBSCRIPT is identically distributed according to Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG, we need to show that

Q^(h2)Q^(h)2V,^𝑄superscript2^𝑄superscript2𝑉\displaystyle\hat{Q}(h^{2})-\hat{Q}(h)^{2}\to V,over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → italic_V ,
Q^(h21{|h|>ϵn})0 for each ϵ>0.^𝑄superscript2subscript1italic-ϵ𝑛0 for each ϵ>0\displaystyle\hat{Q}(h^{2}1_{\{\lvert h\rvert>\epsilon\sqrt{n}\}})\to 0\text{ % for each $\epsilon>0$}.over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT { | italic_h | > italic_ϵ square-root start_ARG italic_n end_ARG } end_POSTSUBSCRIPT ) → 0 for each italic_ϵ > 0 .

The first result is a direct consequence of the assumption. Fix M>0𝑀0M>0italic_M > 0. For all n𝑛nitalic_n sufficiently large, we have Mϵn𝑀italic-ϵ𝑛M\leq\epsilon\sqrt{n}italic_M ≤ italic_ϵ square-root start_ARG italic_n end_ARG, implying that

Q^(h21{|h|>ϵn})Q^(h21{|h|>M}),^𝑄superscript2subscript1italic-ϵ𝑛^𝑄superscript2subscript1𝑀\hat{Q}(h^{2}1_{\{\lvert h\rvert>\epsilon\sqrt{n}\}})\leq\hat{Q}(h^{2}1_{\{% \lvert h\rvert>M\}}),over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT { | italic_h | > italic_ϵ square-root start_ARG italic_n end_ARG } end_POSTSUBSCRIPT ) ≤ over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT { | italic_h | > italic_M } end_POSTSUBSCRIPT ) ,

which converges to Q(h21{|h|>M})𝑄superscript2subscript1𝑀Q(h^{2}1_{\{\lvert h\rvert>M\}})italic_Q ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT { | italic_h | > italic_M } end_POSTSUBSCRIPT ) by assumption. Since Q(h2)𝑄superscript2Q(h^{2})italic_Q ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is finite, one can choose M𝑀Mitalic_M large enough to make Q(h21{|h|>M})𝑄superscript2subscript1𝑀Q(h^{2}1_{\{\lvert h\rvert>M\}})italic_Q ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT { | italic_h | > italic_M } end_POSTSUBSCRIPT ) arbitrarily small.

B.2 Proof of Proposition 2

Set Zn,i=h(Yn,i,Xi)Q^(h)superscriptsubscript𝑍𝑛𝑖superscriptsubscript𝑌𝑛𝑖superscriptsubscript𝑋𝑖^𝑄Z_{n,i}^{*}=h\left(Y_{n,i}^{*},X_{i}^{*}\right)-\hat{Q}(h)italic_Z start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_h ( italic_Y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( italic_h ). We have |Zn,i|2Uhsuperscriptsubscript𝑍𝑛𝑖2subscript𝑈\left|Z_{n,i}^{*}\right|\leq 2U_{h}| italic_Z start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ 2 italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Var (Zn,i|n)=v^n\mbox{Var }\left(Z_{n,i}^{*}\,\middle|\,\mathcal{F}_{n}\right)=\hat{v}_{n}Var ( italic_Z start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Note that Q^(h)Q^(h)=1mi=1mZn,isuperscript^𝑄^𝑄1𝑚superscriptsubscript𝑖1𝑚superscriptsubscript𝑍𝑛𝑖\hat{Q}^{*}(h)-\hat{Q}(h)=\frac{1}{m}\sum_{i=1}^{m}Z_{n,i}^{*}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h ) - over^ start_ARG italic_Q end_ARG ( italic_h ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Bernstein’s concentration inequality leads to

(|Q^hQ^h|>u|n)exp(1/2u2mQ^(h2)(Q^(h))2+2/3Uhu).\mathbb{P}\left(\left|\hat{Q}^{*}h-\hat{Q}h\right|>u\,\middle|\,\mathcal{F}_{n% }\right)\leq\exp\left(-\frac{1/2u^{2}m}{\hat{Q}(h^{2})-(\hat{Q}(h))^{2}+2/3U_{% h}u}\right).blackboard_P ( | over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_h - over^ start_ARG italic_Q end_ARG italic_h | > italic_u | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ roman_exp ( - divide start_ARG 1 / 2 italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m end_ARG start_ARG over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - ( over^ start_ARG italic_Q end_ARG ( italic_h ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 / 3 italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_u end_ARG ) .

Then setting

u^h(δ)=4/3Uhmlog(2/δ)+2Q^h2(Q^h)2mlog(2/δ),subscript^𝑢𝛿43subscript𝑈𝑚2𝛿2^𝑄superscript2superscript^𝑄2𝑚2𝛿\hat{u}_{h}(\delta)=\frac{4/3U_{h}}{m}\log(2/\delta)+\sqrt{2\frac{\hat{Q}h^{2}% -(\hat{Q}h)^{2}}{m}\log(2/\delta)},over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_δ ) = divide start_ARG 4 / 3 italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG roman_log ( 2 / italic_δ ) + square-root start_ARG 2 divide start_ARG over^ start_ARG italic_Q end_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( over^ start_ARG italic_Q end_ARG italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG roman_log ( 2 / italic_δ ) end_ARG ,

we get

(|Q^hQ^h|>u^h(δ)|n)δ\mathbb{P}\left(\left|\hat{Q}^{*}h-\hat{Q}h\right|>\hat{u}_{h}(\delta)\,% \middle|\,\mathcal{F}_{n}\right)\leq\deltablackboard_P ( | over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_h - over^ start_ARG italic_Q end_ARG italic_h | > over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_δ ) | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_δ

and then integrate both sides to obtain

(|Q^hQ^h|>u^h(δ))δ,superscript^𝑄^𝑄subscript^𝑢𝛿𝛿\mathbb{P}\left(\left|\hat{Q}^{*}h-\hat{Q}h\right|>\hat{u}_{h}(\delta)\right)% \leq\delta,blackboard_P ( | over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_h - over^ start_ARG italic_Q end_ARG italic_h | > over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_δ ) ) ≤ italic_δ ,

which leads to the stated bound.

Appendix C Proofs of the results on the k𝑘kitalic_k-NN conditional sampling error (Section 4.3)

Here, we give proofs of the results on the k𝑘kitalic_k-NN conditional sampling error appearing in Section 4.3.

C.1 Proof of Proposition 3

We start with a useful bias-variance decomposition. Introduce

ϵi(x)=h(Yi,x)h(y,x)PY|X(dy|Xi),subscriptitalic-ϵ𝑖𝑥subscript𝑌𝑖𝑥𝑦𝑥subscript𝑃conditional𝑌𝑋conditional𝑑𝑦subscript𝑋𝑖\displaystyle\epsilon_{i}(x)=h(Y_{i},x)-\int h(y,x)P_{Y\>|\>X}(dy\>|\>X_{i}),italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_h ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) - ∫ italic_h ( italic_y , italic_x ) italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,
Δ(x,Xi)=h(y,x)(PY|X(dy|Xi)PY|X(dy|x)).Δ𝑥subscript𝑋𝑖𝑦𝑥subscript𝑃conditional𝑌𝑋conditional𝑑𝑦subscript𝑋𝑖subscript𝑃conditional𝑌𝑋conditional𝑑𝑦𝑥\displaystyle\Delta(x,X_{i})=\int h(y,x)(P_{Y\>|\>X}(dy\>|\>X_{i})-P_{Y\>|\>X}% (dy\>|\>x)).roman_Δ ( italic_x , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∫ italic_h ( italic_y , italic_x ) ( italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) ) .

We have

h(y,x)(P^Y|X(dy|x)PY|X(dy|x))𝑦𝑥subscript^𝑃conditional𝑌𝑋conditional𝑑𝑦𝑥subscript𝑃conditional𝑌𝑋conditional𝑑𝑦𝑥\displaystyle\int h(y,x)(\hat{P}_{Y\>|\>X}(dy\>|\>x)-P_{Y\>|\>X}(dy\>|\>x))∫ italic_h ( italic_y , italic_x ) ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) - italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_d italic_y | italic_x ) )
=k1i=1nϵi(x)1{B(x,τ^n,k,x)}(Xi)+k1i=1nΔ(x,Xi)1{B(x,τ^n,k,x)}(Xi).absentsuperscript𝑘1superscriptsubscript𝑖1𝑛subscriptitalic-ϵ𝑖𝑥subscript1𝐵𝑥subscript^𝜏𝑛𝑘𝑥subscript𝑋𝑖superscript𝑘1superscriptsubscript𝑖1𝑛Δ𝑥subscript𝑋𝑖subscript1𝐵𝑥subscript^𝜏𝑛𝑘𝑥subscript𝑋𝑖\displaystyle=k^{-1}\sum_{i=1}^{n}\epsilon_{i}(x)1_{\{B(x,\hat{\tau}_{n,k,x})% \}}(X_{i})+k^{-1}\sum_{i=1}^{n}\Delta(x,X_{i})1_{\{B(x,\hat{\tau}_{n,k,x})\}}(% X_{i}).= italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) 1 start_POSTSUBSCRIPT { italic_B ( italic_x , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Δ ( italic_x , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) 1 start_POSTSUBSCRIPT { italic_B ( italic_x , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Integrating with respect to QX(dx)subscript𝑄𝑋𝑑𝑥Q_{X}(dx)italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ), we obtain

(Q^Q)(h)^𝑄𝑄\displaystyle(\hat{Q}-Q)(h)( over^ start_ARG italic_Q end_ARG - italic_Q ) ( italic_h ) =Bh+Shabsentsubscript𝐵subscript𝑆\displaystyle=B_{h}+S_{h}= italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (7)

with

Bh=k1i=1nΔ(x,Xi)1{B(x,τ^n,k,x)}(Xi)QX(dx),subscript𝐵superscript𝑘1superscriptsubscript𝑖1𝑛Δ𝑥subscript𝑋𝑖subscript1𝐵𝑥subscript^𝜏𝑛𝑘𝑥subscript𝑋𝑖subscript𝑄𝑋𝑑𝑥\displaystyle B_{h}=k^{-1}\sum_{i=1}^{n}\int\Delta(x,X_{i})1_{\{B(x,\hat{\tau}% _{n,k,x})\}}(X_{i})Q_{X}(dx),italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∫ roman_Δ ( italic_x , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) 1 start_POSTSUBSCRIPT { italic_B ( italic_x , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) ,
Sh=k1i=1nϵi(x)1{B(x,τ^n,k,x)}(Xi)QX(dx).subscript𝑆superscript𝑘1superscriptsubscript𝑖1𝑛subscriptitalic-ϵ𝑖𝑥subscript1𝐵𝑥subscript^𝜏𝑛𝑘𝑥subscript𝑋𝑖subscript𝑄𝑋𝑑𝑥\displaystyle S_{h}=k^{-1}\sum_{i=1}^{n}\int\epsilon_{i}(x)1_{\{B(x,\hat{\tau}% _{n,k,x})\}}(X_{i})Q_{X}(dx).italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∫ italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) 1 start_POSTSUBSCRIPT { italic_B ( italic_x , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) .

The term Bhsubscript𝐵B_{h}italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is a bias term and the term Shsubscript𝑆S_{h}italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (which has mean 00) is a variance term.

The proof is divided into 3333 steps. The first step takes care of bounding the bias term. The second step deals with the variance upper-bound. The third step is concerned with the variance lower bound.

The bias.

First (H1) gives that for any XiSXB(x,τ^n,k,x)subscript𝑋𝑖subscript𝑆𝑋𝐵𝑥subscript^𝜏𝑛𝑘𝑥X_{i}\in S_{X}\cap B(x,\hat{\tau}_{n,k,x})italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∩ italic_B ( italic_x , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT ) and xSX𝑥subscript𝑆𝑋x\in S_{X}italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT,

|Δ(x,Xi)|gh(x)τ^n,k,x.Δ𝑥subscript𝑋𝑖subscript𝑔𝑥subscript^𝜏𝑛𝑘𝑥\left|\Delta(x,X_{i})\right|\leq g_{h}(x)\hat{\tau}_{n,k,x}.| roman_Δ ( italic_x , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ≤ italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x ) over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT .

Consequently, using (X4) and the fact that i=1n1Xixτ^n,k,x=ksuperscriptsubscript𝑖1𝑛subscript1normsubscript𝑋𝑖𝑥subscript^𝜏𝑛𝑘𝑥𝑘\sum_{i=1}^{n}1_{\|X_{i}-x\|\leq\hat{\tau}_{n,k,x}}=k∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ∥ ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_k, we have

|k1i=1nΔ(x,Xi)1{B(x,τ^n,k,x)}(Xi)QX(dx)|τ^n,k,xgh(x)QX(dx)superscript𝑘1superscriptsubscript𝑖1𝑛Δ𝑥subscript𝑋𝑖subscript1𝐵𝑥subscript^𝜏𝑛𝑘𝑥subscript𝑋𝑖subscript𝑄𝑋𝑑𝑥subscript^𝜏𝑛𝑘𝑥subscript𝑔𝑥subscript𝑄𝑋𝑑𝑥\displaystyle\left|k^{-1}\sum_{i=1}^{n}\int\Delta(x,X_{i})1_{\{B(x,\hat{\tau}_% {n,k,x})\}}(X_{i})Q_{X}(dx)\right|\leq\int\hat{\tau}_{n,k,x}g_{h}(x)Q_{X}(dx)| italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∫ roman_Δ ( italic_x , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) 1 start_POSTSUBSCRIPT { italic_B ( italic_x , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) | ≤ ∫ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x )

and from Jensen inequality

|k1i=1nΔ(x,Xi)1{B(x,τ^n,k,x)}(Xi)QX(dx)|2τ^n,k,x2gh2(x)QX(dx)superscriptsuperscript𝑘1superscriptsubscript𝑖1𝑛Δ𝑥subscript𝑋𝑖subscript1𝐵𝑥subscript^𝜏𝑛𝑘𝑥subscript𝑋𝑖subscript𝑄𝑋𝑑𝑥2superscriptsubscript^𝜏𝑛𝑘𝑥2superscriptsubscript𝑔2𝑥subscript𝑄𝑋𝑑𝑥\displaystyle\left|k^{-1}\sum_{i=1}^{n}\int\Delta(x,X_{i})1_{\{B(x,\hat{\tau}_% {n,k,x})\}}(X_{i})Q_{X}(dx)\right|^{2}\leq\int\hat{\tau}_{n,k,x}^{2}g_{h}^{2}(% x)Q_{X}(dx)| italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∫ roman_Δ ( italic_x , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) 1 start_POSTSUBSCRIPT { italic_B ( italic_x , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∫ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x )

From Lemma 3, we have supxSX𝔼[τ^n,k,x2]C2,dk2/d(n+1)2/dsubscriptsupremum𝑥subscript𝑆𝑋𝔼delimited-[]superscriptsubscript^𝜏𝑛𝑘𝑥2subscript𝐶2𝑑superscript𝑘2𝑑superscript𝑛12𝑑\sup_{x\in S_{X}}\mathbb{E}[\hat{\tau}_{n,k,x}^{2}]\leq C_{2,d}k^{2/d}(n+1)^{-% 2/d}roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_C start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT 2 / italic_d end_POSTSUPERSCRIPT ( italic_n + 1 ) start_POSTSUPERSCRIPT - 2 / italic_d end_POSTSUPERSCRIPT and the control of the bias is given by

𝔼[|Bh2|]C2,dgh2(x)QX(dx)k2/d(n+1)2/d.𝔼delimited-[]superscriptsubscript𝐵2subscript𝐶2𝑑superscriptsubscript𝑔2𝑥subscript𝑄𝑋𝑑𝑥superscript𝑘2𝑑superscript𝑛12𝑑\mathbb{E}\left[\left|B_{h}^{2}\right|\right]\leq C_{2,d}\int g_{h}^{2}(x)Q_{X% }(dx)\cdot\frac{k^{2/d}}{(n+1)^{2/d}}.blackboard_E [ | italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ] ≤ italic_C start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT ∫ italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) ⋅ divide start_ARG italic_k start_POSTSUPERSCRIPT 2 / italic_d end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_n + 1 ) start_POSTSUPERSCRIPT 2 / italic_d end_POSTSUPERSCRIPT end_ARG .

The variance upper-bound.

For the proof, we assume that 1k<n/21𝑘𝑛21\leq k<n/21 ≤ italic_k < italic_n / 2. We have for each (x,x)SX2𝑥superscript𝑥superscriptsubscript𝑆𝑋2(x,x^{\prime})\in S_{X}^{2}( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and (i,j){1,,n}2𝑖𝑗superscript1𝑛2(i,j)\in\{1,\dots,n\}^{2}( italic_i , italic_j ) ∈ { 1 , … , italic_n } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

𝔼[ϵi(x)ϵj(x)|X1,,Xn]𝔼delimited-[]conditionalsubscriptitalic-ϵ𝑖𝑥subscriptitalic-ϵ𝑗superscript𝑥subscript𝑋1subscript𝑋𝑛\displaystyle\mathbb{E}[\epsilon_{i}(x)\epsilon_{j}(x^{\prime})\>|\>X_{1},% \ldots,X_{n}]blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ={0 if ij,𝔼[ϵi(x)ϵi(x)|Xi]𝔼[ϵi(x)2|Xi]𝔼[ϵi(x)2|Xi]σ+2 if ij.absentcases0 if ij𝔼delimited-[]conditionalsubscriptitalic-ϵ𝑖𝑥subscriptitalic-ϵ𝑖superscript𝑥subscript𝑋𝑖𝔼delimited-[]conditionalsubscriptitalic-ϵ𝑖superscript𝑥2subscript𝑋𝑖𝔼delimited-[]conditionalsubscriptitalic-ϵ𝑖superscriptsuperscript𝑥2subscript𝑋𝑖superscriptsubscript𝜎2 if ij\displaystyle=\begin{cases}0&\text{ if $i\neq j$},\\ \mathbb{E}\left[\epsilon_{i}(x)\epsilon_{i}(x^{\prime})\>|\>X_{i}\right]\leq% \sqrt{\mathbb{E}\left[\epsilon_{i}(x)^{2}\>|\>X_{i}\right]\mathbb{E}\left[% \epsilon_{i}(x^{\prime})^{2}\>|\>X_{i}\right]}\leq\sigma_{+}^{2}&\text{ if $i% \neq j$}.\end{cases}= { start_ROW start_CELL 0 end_CELL start_CELL if italic_i ≠ italic_j , end_CELL end_ROW start_ROW start_CELL blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ≤ square-root start_ARG blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_ARG ≤ italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_i ≠ italic_j . end_CELL end_ROW

For the second case, we used (H2) and the Cauchy-Schwarz inequality. As a consequence, the variance is given by

𝔼[Sh2]𝔼delimited-[]superscriptsubscript𝑆2\displaystyle\mathbb{E}\left[S_{h}^{2}\right]blackboard_E [ italic_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼[(k1i=1n1Xixτ^n,k,xϵi(x)QX(dx))2]absent𝔼delimited-[]superscriptsuperscript𝑘1superscriptsubscript𝑖1𝑛subscript1normsubscript𝑋𝑖𝑥subscript^𝜏𝑛𝑘𝑥subscriptitalic-ϵ𝑖𝑥subscript𝑄𝑋𝑑𝑥2\displaystyle=\mathbb{E}\left[\left(k^{-1}\sum_{i=1}^{n}\int 1_{\|X_{i}-x\|% \leq\hat{\tau}_{n,k,x}}\epsilon_{i}(x)Q_{X}(dx)\right)^{2}\right]= blackboard_E [ ( italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∫ 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ∥ ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=k2i,jn𝔼[(1Xixτ^n,k,xϵi(x)QX(dx))(1Xjxτ^n,k,xϵj(x)QX(dx))]absentsuperscript𝑘2superscriptsubscript𝑖𝑗𝑛𝔼delimited-[]subscript1normsubscript𝑋𝑖𝑥subscript^𝜏𝑛𝑘𝑥subscriptitalic-ϵ𝑖𝑥subscript𝑄𝑋𝑑𝑥subscript1normsubscript𝑋𝑗superscript𝑥subscript^𝜏𝑛𝑘superscript𝑥subscriptitalic-ϵ𝑗superscript𝑥subscript𝑄𝑋𝑑superscript𝑥\displaystyle=k^{-2}\sum_{i,j}^{n}\mathbb{E}\left[\left(\int 1_{\|X_{i}-x\|% \leq\hat{\tau}_{n,k,x}}\epsilon_{i}(x)Q_{X}(dx)\right)\left(\int 1_{\|X_{j}-x^% {\prime}\|\leq\hat{\tau}_{n,k,x^{\prime}}}\epsilon_{j}(x^{\prime})Q_{X}(dx^{% \prime})\right)\right]= italic_k start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E [ ( ∫ 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ∥ ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) ) ( ∫ 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ]
=k2i,jn𝔼[(1Xixτ^n,k,x1Xjxτ^n,k,x𝔼[ϵi(x)ϵj(x)|X1,,Xn]QX(dx)QX(dx))]absentsuperscript𝑘2superscriptsubscript𝑖𝑗𝑛𝔼delimited-[]subscript1normsubscript𝑋𝑖𝑥subscript^𝜏𝑛𝑘𝑥subscript1normsubscript𝑋𝑗superscript𝑥subscript^𝜏𝑛𝑘superscript𝑥𝔼delimited-[]conditionalsubscriptitalic-ϵ𝑖𝑥subscriptitalic-ϵ𝑗superscript𝑥subscript𝑋1subscript𝑋𝑛subscript𝑄𝑋𝑑𝑥subscript𝑄𝑋𝑑superscript𝑥\displaystyle=k^{-2}\sum_{i,j}^{n}\mathbb{E}\left[\left(\int\int 1_{\|X_{i}-x% \|\leq\hat{\tau}_{n,k,x}}1_{\|X_{j}-x^{\prime}\|\leq\hat{\tau}_{n,k,x^{\prime}% }}\mathbb{E}\left[\epsilon_{i}(x)\epsilon_{j}(x^{\prime})\>|\>X_{1},\dots,X_{n% }\right]Q_{X}(dx)Q_{X}(dx^{\prime})\right)\right]= italic_k start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E [ ( ∫ ∫ 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ∥ ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ]
k2σ+2i=1n𝔼[1Xixτ^n,k,x1Xixτ^n,k,xQX(dx)QX(dx)]absentsuperscript𝑘2superscriptsubscript𝜎2superscriptsubscript𝑖1𝑛𝔼delimited-[]subscript1normsubscript𝑋𝑖𝑥subscript^𝜏𝑛𝑘𝑥subscript1normsubscript𝑋𝑖superscript𝑥subscript^𝜏𝑛𝑘superscript𝑥subscript𝑄𝑋𝑑𝑥subscript𝑄𝑋𝑑superscript𝑥\displaystyle\leq k^{-2}\sigma_{+}^{2}\sum_{i=1}^{n}\mathbb{E}\left[\int\int 1% _{\|X_{i}-x\|\leq\hat{\tau}_{n,k,x}}1_{\|X_{i}-x^{\prime}\|\leq\hat{\tau}_{n,k% ,x^{\prime}}}Q_{X}(dx)Q_{X}(dx^{\prime})\right]≤ italic_k start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E [ ∫ ∫ 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x ∥ ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]
=k2σ+2i=1n𝔼[Y^i2]absentsuperscript𝑘2superscriptsubscript𝜎2superscriptsubscript𝑖1𝑛𝔼delimited-[]superscriptsubscript^𝑌𝑖2\displaystyle=k^{-2}\sigma_{+}^{2}\sum_{i=1}^{n}\mathbb{E}\left[\hat{Y}_{i}^{2% }\right]= italic_k start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E [ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

with Y^i=𝟙|Xix|τ^n,k,xQX(dx)subscript^𝑌𝑖subscript1subscript𝑋𝑖𝑥subscript^𝜏𝑛𝑘𝑥subscript𝑄𝑋𝑑𝑥\hat{Y}_{i}=\int\mathds{1}_{|X_{i}-x|\leq\hat{\tau}_{n,k,x}}Q_{X}(dx)over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∫ blackboard_1 start_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x | ≤ over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ). Let Zi(x)=xXisubscript𝑍𝑖𝑥norm𝑥subscript𝑋𝑖Z_{i}(x)=\|x-X_{i}\|italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = ∥ italic_x - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ and Z(k)(x)subscript𝑍𝑘𝑥Z_{(k)}(x)italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ( italic_x ), the klimit-from𝑘k-italic_k -th order statistics of the sample (Zi(x))1insubscriptsubscript𝑍𝑖𝑥1𝑖𝑛\left(Z_{i}(x)\right)_{1\leq i\leq n}( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_n end_POSTSUBSCRIPT. One can observe that

Zi(x)Z(k)(x)Zi(x)<Z(k)i(x),subscript𝑍𝑖𝑥subscript𝑍𝑘𝑥subscript𝑍𝑖𝑥superscriptsubscript𝑍𝑘𝑖𝑥Z_{i}(x)\leq Z_{(k)}(x)\Longleftrightarrow Z_{i}(x)<Z_{(k)}^{-i}(x),italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ≤ italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ( italic_x ) ⟺ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) < italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_x ) ,

where Z(k)i(x)superscriptsubscript𝑍𝑘𝑖𝑥Z_{(k)}^{-i}(x)italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_x ) is the klimit-from𝑘k-italic_k -th order statistics of the sample (Zj(x))1jinsubscriptsubscript𝑍𝑗𝑥1𝑗𝑖𝑛\left(Z_{j}(x)\right)_{1\leq j\neq i\leq n}( italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUBSCRIPT 1 ≤ italic_j ≠ italic_i ≤ italic_n end_POSTSUBSCRIPT. Note that the two sigma fields generated respectively by {Zi(x):xSX}conditional-setsubscript𝑍𝑖𝑥𝑥subscript𝑆𝑋\left\{Z_{i}(x):x\in S_{X}\right\}{ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) : italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT } and {Z(k)i(x):xSX}conditional-setsuperscriptsubscript𝑍𝑘𝑖𝑥𝑥subscript𝑆𝑋\left\{Z_{(k)}^{-i}(x):x\in S_{X}\right\}{ italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_x ) : italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT } are independent. For one map** ρ:SXdiam(SX):𝜌subscript𝑆𝑋𝑑𝑖𝑎𝑚subscript𝑆𝑋\rho:S_{X}\rightarrow diam(S_{X})italic_ρ : italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT → italic_d italic_i italic_a italic_m ( italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ), we first first bound

I𝐼\displaystyle Iitalic_I =\displaystyle== 𝔼|𝟙Z1(x)ρ(x)Q(dx)|2𝔼superscriptsubscript1subscript𝑍1𝑥𝜌𝑥𝑄𝑑𝑥2\displaystyle\mathbb{E}\left|\int\mathds{1}_{Z_{1}(x)\leq\rho(x)}Q(dx)\right|^% {2}blackboard_E | ∫ blackboard_1 start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) ≤ italic_ρ ( italic_x ) end_POSTSUBSCRIPT italic_Q ( italic_d italic_x ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== (Z1(x)ρ(x),Z1(y)ρ(y))Q(dx)Q(dy)formulae-sequencesubscript𝑍1𝑥𝜌𝑥subscript𝑍1𝑦𝜌𝑦𝑄𝑑𝑥𝑄𝑑𝑦\displaystyle\int\int\mathbb{P}\left(Z_{1}(x)\leq\rho(x),Z_{1}(y)\leq\rho(y)% \right)Q(dx)Q(dy)∫ ∫ blackboard_P ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) ≤ italic_ρ ( italic_x ) , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ) ≤ italic_ρ ( italic_y ) ) italic_Q ( italic_d italic_x ) italic_Q ( italic_d italic_y )
=\displaystyle== 2𝟙{ρ(x)ρ(y)}(Z1(x)ρ(x),Z1(y)ρ(y))Q(dx)Q(dy)21𝜌𝑥𝜌𝑦formulae-sequencesubscript𝑍1𝑥𝜌𝑥subscript𝑍1𝑦𝜌𝑦𝑄𝑑𝑥𝑄𝑑𝑦\displaystyle 2\int\int\mathds{1}{\{\rho(x)\leq\rho(y)\}}\mathbb{P}\left(Z_{1}% (x)\leq\rho(x),Z_{1}(y)\leq\rho(y)\right)Q(dx)Q(dy)2 ∫ ∫ blackboard_1 { italic_ρ ( italic_x ) ≤ italic_ρ ( italic_y ) } blackboard_P ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) ≤ italic_ρ ( italic_x ) , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ) ≤ italic_ρ ( italic_y ) ) italic_Q ( italic_d italic_x ) italic_Q ( italic_d italic_y )
=\displaystyle== 2𝟙{ρ(x)ρ(y),xy2ρ(y)}(Z1(x)ρ(x),Z1(y)ρ(y))Q(dx)Q(dy)21formulae-sequence𝜌𝑥𝜌𝑦norm𝑥𝑦2𝜌𝑦formulae-sequencesubscript𝑍1𝑥𝜌𝑥subscript𝑍1𝑦𝜌𝑦𝑄𝑑𝑥𝑄𝑑𝑦\displaystyle 2\int\int\mathds{1}{\{\rho(x)\leq\rho(y),\|x-y\|\leq 2\rho(y)\}}% \mathbb{P}\left(Z_{1}(x)\leq\rho(x),Z_{1}(y)\leq\rho(y)\right)Q(dx)Q(dy)2 ∫ ∫ blackboard_1 { italic_ρ ( italic_x ) ≤ italic_ρ ( italic_y ) , ∥ italic_x - italic_y ∥ ≤ 2 italic_ρ ( italic_y ) } blackboard_P ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) ≤ italic_ρ ( italic_x ) , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ) ≤ italic_ρ ( italic_y ) ) italic_Q ( italic_d italic_x ) italic_Q ( italic_d italic_y )
\displaystyle\leq 2M2,d𝟙{ρ(x)ρ(y),xy2ρ(y)}ρ(y)dQ(dx)Q(dy)2subscript𝑀2𝑑1formulae-sequence𝜌𝑥𝜌𝑦norm𝑥𝑦2𝜌𝑦𝜌superscript𝑦𝑑𝑄𝑑𝑥𝑄𝑑𝑦\displaystyle 2M_{2,d}\int\int\mathds{1}{\{\rho(x)\leq\rho(y),\|x-y\|\leq 2% \rho(y)\}}\rho(y)^{d}Q(dx)Q(dy)2 italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT ∫ ∫ blackboard_1 { italic_ρ ( italic_x ) ≤ italic_ρ ( italic_y ) , ∥ italic_x - italic_y ∥ ≤ 2 italic_ρ ( italic_y ) } italic_ρ ( italic_y ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_Q ( italic_d italic_x ) italic_Q ( italic_d italic_y )
\displaystyle\leq 2M2,dQ(B(y,2ρ(y)))ρ(y)dQ(dy)2subscript𝑀2𝑑𝑄𝐵𝑦2𝜌𝑦𝜌superscript𝑦𝑑𝑄𝑑𝑦\displaystyle 2M_{2,d}\int Q\left(B(y,2\rho(y))\right)\rho(y)^{d}Q(dy)2 italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT ∫ italic_Q ( italic_B ( italic_y , 2 italic_ρ ( italic_y ) ) ) italic_ρ ( italic_y ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_Q ( italic_d italic_y )
\displaystyle\leq 2d+1M2,d2ρ(y)2dQ(dy).superscript2𝑑1superscriptsubscript𝑀2𝑑2𝜌superscript𝑦2𝑑𝑄𝑑𝑦\displaystyle 2^{d+1}M_{2,d}^{2}\int\rho(y)^{2d}Q(dy).2 start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ italic_ρ ( italic_y ) start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT italic_Q ( italic_d italic_y ) .

The fourth inequality is due to the fact that when xy>2ρ(y)norm𝑥𝑦2𝜌𝑦\|x-y\|>2\rho(y)∥ italic_x - italic_y ∥ > 2 italic_ρ ( italic_y ), the two balls B(x,ρ(x))𝐵𝑥𝜌𝑥B(x,\rho(x))italic_B ( italic_x , italic_ρ ( italic_x ) ) and B(y,ρ(y))𝐵𝑦𝜌𝑦B(y,\rho(y))italic_B ( italic_y , italic_ρ ( italic_y ) ) do not intersect. We then get using Lemma 3333,

EY^i22d+1M2,d2𝔼{Z(k)i(y)}2dQ(dy)2d+3M2,d2M1,d2k2n2.𝐸superscriptsubscript^𝑌𝑖2superscript2𝑑1superscriptsubscript𝑀2𝑑2𝔼superscriptsuperscriptsubscript𝑍𝑘𝑖𝑦2𝑑𝑄𝑑𝑦superscript2𝑑3superscriptsubscript𝑀2𝑑2superscriptsubscript𝑀1𝑑2superscript𝑘2superscript𝑛2E\hat{Y}_{i}^{2}\leq 2^{d+1}M_{2,d}^{2}\int\mathbb{E}\left\{Z_{(k)}^{-i}(y)% \right\}^{2d}Q(dy)\leq\frac{2^{d+3}M_{2,d}^{2}}{M_{1,d}^{2}}\frac{k^{2}}{n^{2}}.italic_E over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ blackboard_E { italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_y ) } start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT italic_Q ( italic_d italic_y ) ≤ divide start_ARG 2 start_POSTSUPERSCRIPT italic_d + 3 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

This leads to the variance upper-bound.

The variance lower-bound.

Here we assume the σ2:=infxSXVar(h(Y)|X=x)>0assignsubscriptsuperscript𝜎2subscriptinfimum𝑥subscript𝑆𝑋Varconditional𝑌𝑋𝑥0\sigma^{2}_{-}:=\inf_{x\in S_{X}}\operatorname{Var}\left(h(Y)|X=x\right)>0italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - end_POSTSUBSCRIPT := roman_inf start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Var ( italic_h ( italic_Y ) | italic_X = italic_x ) > 0. From previous computations, we have

Var(Q^h)σ2k2i=1n𝔼[Y^i2].Var^𝑄subscriptsuperscript𝜎2superscript𝑘2superscriptsubscript𝑖1𝑛𝔼delimited-[]superscriptsubscript^𝑌𝑖2\operatorname{Var}\left(\hat{Q}h\right)\geq\frac{\sigma^{2}_{-}}{k^{2}}\sum_{i% =1}^{n}\mathbb{E}\left[\hat{Y}_{i}^{2}\right].roman_Var ( over^ start_ARG italic_Q end_ARG italic_h ) ≥ divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E [ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

We have 𝔼Y^i2𝔼2Y^i𝔼superscriptsubscript^𝑌𝑖2superscript𝔼2subscript^𝑌𝑖\mathbb{E}\hat{Y}_{i}^{2}\geq\mathbb{E}^{2}\hat{Y}_{i}blackboard_E over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ blackboard_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and we have to find a lower bound for 𝔼Y^i𝔼subscript^𝑌𝑖\mathbb{E}\hat{Y}_{i}blackboard_E over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. But from the arguments used in the proof for the upper-bound, we have

𝔼Y^i=SX(Zi(x)Z(k)i(x))Q(dx)M1,dSX𝔼|Z(k)i(x)|dQ(dx)k2nM2,d,𝔼subscript^𝑌𝑖subscriptsubscript𝑆𝑋subscript𝑍𝑖𝑥superscriptsubscript𝑍𝑘𝑖𝑥𝑄𝑑𝑥subscript𝑀1𝑑subscriptsubscript𝑆𝑋𝔼superscriptsuperscriptsubscript𝑍𝑘𝑖𝑥𝑑𝑄𝑑𝑥𝑘2𝑛subscript𝑀2𝑑\mathbb{E}\hat{Y}_{i}=\int_{S_{X}}\mathbb{P}\left(Z_{i}(x)\leq Z_{(k)}^{-i}(x)% \right)Q(dx)\geq M_{1,d}\int_{S_{X}}\mathbb{E}\left|Z_{(k)}^{-i}(x)\right|^{d}% Q(dx)\geq\frac{k}{2nM_{2,d}},blackboard_E over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ≤ italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_x ) ) italic_Q ( italic_d italic_x ) ≥ italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E | italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_x ) | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_Q ( italic_d italic_x ) ≥ divide start_ARG italic_k end_ARG start_ARG 2 italic_n italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT end_ARG ,

where the last inequality follows from Lemma 3. This shows the lower-bound and the proof of Proposition 3 is now complete.

C.2 Proof of Proposition 4

Define

τ¯n,k=(2knM1,d)1/d.subscript¯𝜏𝑛𝑘superscript2𝑘𝑛subscript𝑀1𝑑1𝑑\displaystyle\overline{\tau}_{n,k}=\left(\frac{2k}{nM_{1,d}}\right)^{1/d}.over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT = ( divide start_ARG 2 italic_k end_ARG start_ARG italic_n italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT .

The following Lemma (Portier, 2021, Lemma 4) controls the size of the k𝑘kitalic_k-NN balls uniformly over all xSX𝑥subscript𝑆𝑋x\in S_{X}italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

Lemma 4 (Portier (2021, Lemma 4)).

Suppose that (X1) (X2) and (X3) hold true. Then for all n1𝑛1n\geq 1italic_n ≥ 1, δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) and 1kn1𝑘𝑛1\leq k\leq n1 ≤ italic_k ≤ italic_n such that 24dlog(12n/δ)kTdnbXcVd/224𝑑12𝑛𝛿𝑘superscript𝑇𝑑𝑛subscript𝑏𝑋𝑐subscript𝑉𝑑224d\log(12n/\delta)\leq k\leq T^{d}nb_{X}cV_{d}/224 italic_d roman_log ( 12 italic_n / italic_δ ) ≤ italic_k ≤ italic_T start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_n italic_b start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_c italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT / 2 , it holds, with probability at least 1δ1𝛿1-\delta1 - italic_δ:

supxSXτ^n,k,xτ¯n,k.subscriptsupremum𝑥subscript𝑆𝑋subscript^𝜏𝑛𝑘𝑥subscript¯𝜏𝑛𝑘\displaystyle\sup_{x\in S_{X}}\hat{\tau}_{n,k,x}\leq\overline{\tau}_{n,k}.roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT ≤ over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT .

We now deal with the variance term of our estimator. The variance term of the nearest-neighbors estimator is given by Vn=k1i=1nY^isubscript𝑉𝑛superscript𝑘1superscriptsubscript𝑖1𝑛subscript^𝑌𝑖V_{n}=k^{-1}\sum_{i=1}^{n}\hat{Y}_{i}italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where

Y^i=ϵi(x)𝟙B(x,τ^n,k,x)(Xi)QX(dx).subscript^𝑌𝑖subscriptitalic-ϵ𝑖𝑥subscript1𝐵𝑥subscript^𝜏𝑛𝑘𝑥subscript𝑋𝑖subscript𝑄𝑋𝑑𝑥\hat{Y}_{i}=\int\epsilon_{i}(x)\mathds{1}_{B(x,\hat{\tau}_{n,k,x})}(X_{i})Q_{X% }(dx).over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∫ italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) blackboard_1 start_POSTSUBSCRIPT italic_B ( italic_x , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) .

and Set τ^k=supxSXτ^n,k,xsubscript^𝜏𝑘subscriptsupremum𝑥subscript𝑆𝑋subscript^𝜏𝑛𝑘𝑥\hat{\tau}_{k}=\sup_{x\in S_{X}}\hat{\tau}_{n,k,x}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_x end_POSTSUBSCRIPT. From our assumptions and Jensen’s inequality, we have

|Y^i|2UhM2,dτ^k,Var (Y^i|X1,,Xn)σ+2M2,d2τk^2d.formulae-sequencesubscript^𝑌𝑖2subscript𝑈subscript𝑀2𝑑subscript^𝜏𝑘Var conditionalsubscript^𝑌𝑖subscript𝑋1subscript𝑋𝑛superscriptsubscript𝜎2superscriptsubscript𝑀2𝑑2superscript^subscript𝜏𝑘2𝑑|\hat{Y}_{i}|\leq 2U_{h}M_{2,d}\hat{\tau}_{k},\quad\mbox{Var }\left(\hat{Y}_{i% }|X_{1},\ldots,X_{n}\right)\leq\sigma_{+}^{2}M_{2,d}^{2}\hat{\tau_{k}}^{2d}.| over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ 2 italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , Var ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT .

Applying Bernstein’s inequality for i.i.d. random variables (we recall that the Yissuperscriptsubscript𝑌𝑖𝑠Y_{i}^{\prime}sitalic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s are independent conditionally on the Xissuperscriptsubscript𝑋𝑖𝑠X_{i}^{\prime}sitalic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s), we get for t>0𝑡0t>0italic_t > 0,

(|Vn|>t|X1,,Xn)2exp(12k2t2nσ+2M2,d2τ^k2d+23UhM2,dτ^kdk).subscript𝑉𝑛conditional𝑡subscript𝑋1subscript𝑋𝑛212superscript𝑘2superscript𝑡2𝑛superscriptsubscript𝜎2superscriptsubscript𝑀2𝑑2superscriptsubscript^𝜏𝑘2𝑑23subscript𝑈subscript𝑀2𝑑superscriptsubscript^𝜏𝑘𝑑𝑘\mathbb{P}\left(|V_{n}|>t|X_{1},\ldots,X_{n}\right)\leq 2\exp\left(-\frac{% \frac{1}{2}k^{2}t^{2}}{n\sigma_{+}^{2}M_{2,d}^{2}\hat{\tau}_{k}^{2d}+\frac{2}{% 3}U_{h}M_{2,d}\hat{\tau}_{k}^{d}k}\right).blackboard_P ( | italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | > italic_t | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ 2 roman_exp ( - divide start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT + divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_k end_ARG ) .

This leads to

(|Vn|>t,τ^kτ¯n,k|X1,,Xn)2exp(12t2nL1σ+2+L2t).formulae-sequencesubscript𝑉𝑛𝑡subscript^𝜏𝑘conditionalsubscript¯𝜏𝑛𝑘subscript𝑋1subscript𝑋𝑛212superscript𝑡2𝑛subscript𝐿1superscriptsubscript𝜎2subscript𝐿2𝑡\mathbb{P}\left(|V_{n}|>t,\hat{\tau}_{k}\leq\overline{\tau}_{n,k}|X_{1},\ldots% ,X_{n}\right)\leq 2\exp\left(-\frac{\frac{1}{2}t^{2}n}{L_{1}\sigma_{+}^{2}+L_{% 2}t}\right).blackboard_P ( | italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | > italic_t , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ 2 roman_exp ( - divide start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t end_ARG ) .

Note that this upper-bound is not random and we get

(|Vn|>t,τ^kτ¯n,k)2exp(12t2nL1σ+2+L2t).formulae-sequencesubscript𝑉𝑛𝑡subscript^𝜏𝑘subscript¯𝜏𝑛𝑘212superscript𝑡2𝑛subscript𝐿1superscriptsubscript𝜎2subscript𝐿2𝑡\mathbb{P}\left(|V_{n}|>t,\hat{\tau}_{k}\leq\overline{\tau}_{n,k}\right)\leq 2% \exp\left(-\frac{\frac{1}{2}t^{2}n}{L_{1}\sigma_{+}^{2}+L_{2}t}\right).blackboard_P ( | italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | > italic_t , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ≤ 2 roman_exp ( - divide start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG start_ARG italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t end_ARG ) . (8)

Setting

t~n(δ,h)=L2nlog(2/δ)+L22n2log2(2/δ)+2L1σ+2nlog(2/δ),subscript~𝑡𝑛𝛿subscript𝐿2𝑛2𝛿superscriptsubscript𝐿22superscript𝑛2superscript22𝛿2subscript𝐿1superscriptsubscript𝜎2𝑛2𝛿\widetilde{t}_{n}(\delta,h)=\frac{L_{2}}{n}\log(2/\delta)+\sqrt{\frac{L_{2}^{2% }}{n^{2}}\log^{2}(2/\delta)+\frac{2L_{1}\sigma_{+}^{2}}{n}\log(2/\delta)},over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h ) = divide start_ARG italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG roman_log ( 2 / italic_δ ) + square-root start_ARG divide start_ARG italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 / italic_δ ) + divide start_ARG 2 italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG roman_log ( 2 / italic_δ ) end_ARG ,

which is smaller than

tn(δ,h)=L0(kn)1/d+2L2nlog(2/δ)+2L1σ+2nlog(2/δ)subscript𝑡𝑛𝛿subscript𝐿0superscript𝑘𝑛1𝑑2subscript𝐿2𝑛2𝛿2subscript𝐿1superscriptsubscript𝜎2𝑛2𝛿t_{n}(\delta,h)=L_{0}\left(\frac{k}{n}\right)^{1/d}+\frac{2L_{2}}{n}\log(2/% \delta)+\sqrt{\frac{2L_{1}\sigma_{+}^{2}}{n}\log(2/\delta)}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h ) = italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( divide start_ARG italic_k end_ARG start_ARG italic_n end_ARG ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT + divide start_ARG 2 italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG roman_log ( 2 / italic_δ ) + square-root start_ARG divide start_ARG 2 italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG roman_log ( 2 / italic_δ ) end_ARG

we then get

(|Vn|>tn(δ,h))subscript𝑉𝑛subscript𝑡𝑛𝛿\displaystyle\mathbb{P}\left(|V_{n}|>t_{n}(\delta,h)\right)blackboard_P ( | italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | > italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h ) ) \displaystyle\leq (|Vn|>t~n(δ,h),τ^kτ¯n,k)+P(τ^k>τ¯n,k)formulae-sequencesubscript𝑉𝑛subscript~𝑡𝑛𝛿subscript^𝜏𝑘subscript¯𝜏𝑛𝑘𝑃subscript^𝜏𝑘subscript¯𝜏𝑛𝑘\displaystyle\mathbb{P}\left(|V_{n}|>\widetilde{t}_{n}(\delta,h),\hat{\tau}_{k% }\leq\overline{\tau}_{n,k}\right)+P\left(\hat{\tau}_{k}>\overline{\tau}_{n,k}\right)blackboard_P ( | italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | > over~ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h ) , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) + italic_P ( over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT )
\displaystyle\leq 2δ,2𝛿\displaystyle 2\delta,2 italic_δ ,

where the last inequality is a consequence of (8) and Lemma 4. Moreover, from the proof of Theorem 3 and Lemma 4, the bias part can be dominated by L0(k/n)1/dsubscript𝐿0superscript𝑘𝑛1𝑑L_{0}(k/n)^{1/d}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_k / italic_n ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT with probability at least 1δ1𝛿1-\delta1 - italic_δ. This concludes the proof.

Appendix D Proof of Theorem 1

First, note that the boundedness of supxSX𝔼[h2(Y,x)|X]subscriptsupremum𝑥subscript𝑆𝑋𝔼delimited-[]conditionalsuperscript2𝑌𝑥𝑋\sup_{x\in S_{X}}\mathbb{E}\left[h^{2}(Y,x)|X\right]roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_Y , italic_x ) | italic_X ] entails H2. Setting Am,n=Q^(h)Q^(h)subscript𝐴𝑚𝑛superscript^𝑄^𝑄A_{m,n}=\hat{Q}^{*}(h)-\hat{Q}(h)italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h ) - over^ start_ARG italic_Q end_ARG ( italic_h ) and Bn=Q^(h)Q(h)subscript𝐵𝑛^𝑄𝑄B_{n}=\hat{Q}(h)-Q(h)italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_Q end_ARG ( italic_h ) - italic_Q ( italic_h ), Proposition 3 guarantees that

𝔼Bn2C1{1n2/d+1n},𝔼superscriptsubscript𝐵𝑛2subscript𝐶11superscript𝑛2𝑑1𝑛\mathbb{E}B_{n}^{2}\leq C_{1}\left\{\frac{1}{n^{2/d}}+\frac{1}{n}\right\},blackboard_E italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 / italic_d end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG } ,

for some C1>0subscript𝐶10C_{1}>0italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 only depending on the distribution of (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), Xsuperscript𝑋X^{*}italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and on hhitalic_h. It remains to show that the same bound can obtained for Am,nsubscript𝐴𝑚𝑛A_{m,n}italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT. Since,

𝔼[Am,n2|n]Q^(h2)Q^(h)2mQ^(h2)m.𝔼delimited-[]conditionalsuperscriptsubscript𝐴𝑚𝑛2subscript𝑛^𝑄superscript2^𝑄superscript2𝑚^𝑄superscript2𝑚\mathbb{E}\left[A_{m,n}^{2}|\mathcal{F}_{n}\right]\leq\frac{\hat{Q}(h^{2})-% \hat{Q}(h)^{2}}{m}\leq\frac{\hat{Q}(h^{2})}{m}.blackboard_E [ italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ≤ divide start_ARG over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG ≤ divide start_ARG over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_m end_ARG .

It only remains to show that 𝔼Q^(h2)𝔼^𝑄superscript2\mathbb{E}\hat{Q}(h^{2})blackboard_E over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is bounded with respect to n𝑛nitalic_n. The approach is similar to the control of the variance term for k=1𝑘1k=1italic_k = 1 studied in the proof of Proposition 3. If L𝐿Litalic_L is an upper-bound for supxSX𝔼[h2(Y,x)|X]subscriptsupremum𝑥subscript𝑆𝑋𝔼delimited-[]conditionalsuperscript2𝑌𝑥𝑋\sup_{x\in S_{X}}\mathbb{E}\left[h^{2}(Y,x)|X\right]roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_Y , italic_x ) | italic_X ], we have

𝔼Q^(h2)𝔼^𝑄superscript2\displaystyle\mathbb{E}\hat{Q}(h^{2})blackboard_E over^ start_ARG italic_Q end_ARG ( italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) =\displaystyle== i=1nSXQX(dx)𝔼[h2(Y1,x)𝟙X1xmin2jnXjx]superscriptsubscript𝑖1𝑛subscriptsubscript𝑆𝑋subscript𝑄𝑋𝑑𝑥𝔼delimited-[]superscript2subscript𝑌1𝑥subscript1normsubscript𝑋1𝑥subscript2𝑗𝑛normsubscript𝑋𝑗𝑥\displaystyle\sum_{i=1}^{n}\int_{S_{X}}Q_{X}(dx)\mathbb{E}\left[h^{2}(Y_{1},x)% \mathds{1}_{\|X_{1}-x\|\leq\min_{2\leq j\leq n}\|X_{j}-x\|}\right]∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x ) blackboard_E [ italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) blackboard_1 start_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x ∥ ≤ roman_min start_POSTSUBSCRIPT 2 ≤ italic_j ≤ italic_n end_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x ∥ end_POSTSUBSCRIPT ]
\displaystyle\leq nLSX(X1xmin2jnXjx)QX(dx)𝑛𝐿subscriptsubscript𝑆𝑋normsubscript𝑋1𝑥subscript2𝑗𝑛normsubscript𝑋𝑗𝑥subscript𝑄𝑋𝑑𝑥\displaystyle nL\int_{S_{X}}\mathbb{P}\left(\|X_{1}-x\|\leq\min_{2\leq j\leq n% }\|X_{j}-x\|\right)Q_{X}(dx)italic_n italic_L ∫ start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_P ( ∥ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x ∥ ≤ roman_min start_POSTSUBSCRIPT 2 ≤ italic_j ≤ italic_n end_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x ∥ ) italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x )
\displaystyle\leq nLM2,dSX𝔼min2jnXjxdQX(dx)𝑛𝐿subscript𝑀2𝑑subscriptsubscript𝑆𝑋𝔼subscript2𝑗𝑛superscriptnormsubscript𝑋𝑗𝑥𝑑subscript𝑄𝑋𝑑𝑥\displaystyle nLM_{2,d}\int_{S_{X}}\mathbb{E}\min_{2\leq j\leq n}\|X_{j}-x\|^{% d}Q_{X}(dx)italic_n italic_L italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E roman_min start_POSTSUBSCRIPT 2 ≤ italic_j ≤ italic_n end_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x ∥ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_d italic_x )
\displaystyle\leq 2LM2,dM1,d.2𝐿subscript𝑀2𝑑subscript𝑀1𝑑\displaystyle\frac{2LM_{2,d}}{M_{1,d}}.divide start_ARG 2 italic_L italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT end_ARG .

The last upper bound is obtained from Lemma 3 using the fact that min2jnXjxsubscript2𝑗𝑛normsubscript𝑋𝑗𝑥\min_{2\leq j\leq n}\|X_{j}-x\|roman_min start_POSTSUBSCRIPT 2 ≤ italic_j ≤ italic_n end_POSTSUBSCRIPT ∥ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x ∥ has the same probability distribution as τ^n1,1,xsubscript^𝜏𝑛11𝑥\hat{\tau}_{n-1,1,x}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n - 1 , 1 , italic_x end_POSTSUBSCRIPT. We deduce the result taking C𝐶Citalic_C as the maximum between C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2LM2,d/M1,d2𝐿subscript𝑀2𝑑subscript𝑀1𝑑2LM_{2,d}/M_{1,d}2 italic_L italic_M start_POSTSUBSCRIPT 2 , italic_d end_POSTSUBSCRIPT / italic_M start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT. \blacksquare

Appendix E A corollary bounding the sampling error for k𝑘kitalic_k-NN sampling

Corollary 1.

Suppose that (X1) (X2) (X3) (X4) are fulfilled and that (H1) and (H2) are fulfilled for both hhitalic_h and h2superscript2h^{2}italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and hhitalic_h is bounded. Let k=kn𝑘subscript𝑘𝑛k=k_{n}italic_k = italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT satisfying the condition of Lemma 4 and δ(0,1/7)𝛿017\delta\in(0,1/7)italic_δ ∈ ( 0 , 1 / 7 ) and set s2(h)=Qh2(Qh)2superscript𝑠2𝑄superscript2superscript𝑄2s^{2}(h)=Qh^{2}-(Qh)^{2}italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_h ) = italic_Q italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( italic_Q italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and s^2(h)=Q^h2(Q^h)2superscript^𝑠2^𝑄superscript2superscript^𝑄2\hat{s}^{2}(h)=\hat{Q}h^{2}-(\hat{Q}h)^{2}over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_h ) = over^ start_ARG italic_Q end_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( over^ start_ARG italic_Q end_ARG italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Let δ(0,1/7)𝛿017\delta\in(0,1/7)italic_δ ∈ ( 0 , 1 / 7 ). Then with probability greater than 17δ17𝛿1-7\delta1 - 7 italic_δ,

Q^hQ^h4/3Uhmlog(2/δ)+2vn2(δ,h)mlog(2/δ)+2s2(h)mlog(2/δ),superscript^𝑄^𝑄43subscript𝑈𝑚2𝛿2subscriptsuperscript𝑣2𝑛𝛿𝑚2𝛿2superscript𝑠2𝑚2𝛿\hat{Q}^{*}h-\hat{Q}h\leq\frac{4/3U_{h}}{m}\log(2/\delta)+\sqrt{2\frac{v^{2}_{% n}(\delta,h)}{m}\log(2/\delta)}+\sqrt{2\frac{s^{2}(h)}{m}\log(2/\delta)},over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_h - over^ start_ARG italic_Q end_ARG italic_h ≤ divide start_ARG 4 / 3 italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG roman_log ( 2 / italic_δ ) + square-root start_ARG 2 divide start_ARG italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h ) end_ARG start_ARG italic_m end_ARG roman_log ( 2 / italic_δ ) end_ARG + square-root start_ARG 2 divide start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_h ) end_ARG start_ARG italic_m end_ARG roman_log ( 2 / italic_δ ) end_ARG , (9)

where

vn2(δ,h)=tn(δ,h2)+tn(δ,h)2+2tn(δ,h)Q|h|subscriptsuperscript𝑣2𝑛𝛿subscript𝑡𝑛𝛿superscript2subscript𝑡𝑛superscript𝛿22subscript𝑡𝑛𝛿𝑄v^{2}_{n}(\delta,h)=t_{n}(\delta,h^{2})+t_{n}(\delta,h)^{2}+2t_{n}(\delta,h)Q|h|italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h ) = italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h ) italic_Q | italic_h |

and tn(δ,h)subscript𝑡𝑛𝛿t_{n}(\delta,h)italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h ) is defined in the statement of Proposition 4.

Proof.

We first use the result of Proposition 2. In particular, setting

u^h(δ)=4/3Uhmlog(2/δ)+2Q^h2(Q^h)2mlog(2/δ),subscript^𝑢𝛿43subscript𝑈𝑚2𝛿2^𝑄superscript2superscript^𝑄2𝑚2𝛿\hat{u}_{h}(\delta)=\frac{4/3U_{h}}{m}\log(2/\delta)+\sqrt{2\frac{\hat{Q}h^{2}% -(\hat{Q}h)^{2}}{m}\log(2/\delta)},over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_δ ) = divide start_ARG 4 / 3 italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG roman_log ( 2 / italic_δ ) + square-root start_ARG 2 divide start_ARG over^ start_ARG italic_Q end_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( over^ start_ARG italic_Q end_ARG italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG roman_log ( 2 / italic_δ ) end_ARG ,

we have

(|Q^hQ^h|>u^h(δ))δ,superscript^𝑄^𝑄subscript^𝑢𝛿𝛿\mathbb{P}\left(\left|\hat{Q}^{*}h-\hat{Q}h\right|>\hat{u}_{h}(\delta)\right)% \leq\delta,blackboard_P ( | over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_h - over^ start_ARG italic_Q end_ARG italic_h | > over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_δ ) ) ≤ italic_δ ,

We then use the decomposition

Q^h2(Q^h)2=s2(h)+Q^h2Qh2(Q^hQh)22Qh(Q^hQh).^𝑄superscript2superscript^𝑄2superscript𝑠2^𝑄superscript2𝑄superscript2superscript^𝑄𝑄22𝑄^𝑄𝑄\hat{Q}h^{2}-(\hat{Q}h)^{2}=s^{2}(h)+\hat{Q}h^{2}-Qh^{2}-\left(\hat{Q}h-Qh% \right)^{2}-2Qh\left(\hat{Q}h-Qh\right).over^ start_ARG italic_Q end_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( over^ start_ARG italic_Q end_ARG italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_h ) + over^ start_ARG italic_Q end_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_Q italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( over^ start_ARG italic_Q end_ARG italic_h - italic_Q italic_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_Q italic_h ( over^ start_ARG italic_Q end_ARG italic_h - italic_Q italic_h ) .

From Proposition 4, we know that

|Q^h2Qh2|tn(δ,h2)^𝑄superscript2𝑄superscript2subscript𝑡𝑛𝛿superscript2\left|\hat{Q}h^{2}-Qh^{2}\right|\leq t_{n}(\delta,h^{2})| over^ start_ARG italic_Q end_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_Q italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ≤ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

with probability greater than 13δ13𝛿1-3\delta1 - 3 italic_δ and

|Q^hQh|tn(δ,h)^𝑄𝑄subscript𝑡𝑛𝛿\left|\hat{Q}h-Qh\right|\leq t_{n}(\delta,h)| over^ start_ARG italic_Q end_ARG italic_h - italic_Q italic_h | ≤ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ , italic_h )

with probability greater than 13δ13𝛿1-3\delta1 - 3 italic_δ. Collecting these three bounds, we easily obtain the conclusion of the second point of Corollary 1. ∎

Appendix F Proofs of the results on the empirical risk minimization (Section 5)

Here, we present proofs of the result on the application to empirical risk minimization.

F.1 Proof of Theorem 2.

From (A1), supθΘ|mθ(Y1,X1)|subscriptsupremum𝜃Θsubscript𝑚𝜃superscriptsubscript𝑌1subscriptsuperscript𝑋1\sup_{\theta\in\Theta}\left|m_{\theta}\left(Y_{1}^{*},X^{*}_{1}\right)\right|roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | is integrable and θ(θ)maps-to𝜃superscript𝜃\theta\mapsto\mathcal{R}^{*}(\theta)italic_θ ↦ caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) is continuous over the compact set ΘΘ\Thetaroman_Θ. As a consequence, weak consistency will follow from Theorem 5.75.75.75.7 in Van der Vaart (2000) if we show that

supθΘ|m,n(θ)(θ)|=o(1).subscriptsupremum𝜃Θsubscriptsuperscript𝑚𝑛𝜃superscript𝜃subscript𝑜1\sup_{\theta\in\Theta}\left|\mathcal{R}^{*}_{m,n}(\theta)-\mathcal{R}^{*}(% \theta)\right|=o_{\mathbb{P}}(1).roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( italic_θ ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) | = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) . (10)

Pointwise convergence holds true from assumptions (A1), (A2) as each map** mθsubscript𝑚𝜃m_{\theta}italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT satisfies Assumptions (H1), (H2). One can then apply Theorem 1 and the Markov inequality to get for any θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ,

m,n(θ)(θ)=o(1),m,n.formulae-sequencesubscriptsuperscript𝑚𝑛𝜃superscript𝜃subscript𝑜1𝑚𝑛\mathcal{R}^{*}_{m,n}(\theta)-\mathcal{R}^{*}(\theta)=o_{\mathbb{P}}(1),\quad m% ,n\rightarrow\infty.caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( italic_θ ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) , italic_m , italic_n → ∞ .

We now prove uniform convergence. Let δ>0𝛿0\delta>0italic_δ > 0. One can cover the compact set ΘΘ\Thetaroman_Θ with finitely many balls B(θi,δ)𝐵subscript𝜃𝑖𝛿B(\theta_{i},\delta)italic_B ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ ), 1ik1𝑖𝑘1\leq i\leq k1 ≤ italic_i ≤ italic_k. For θΘB(θi,δ)𝜃Θ𝐵subscript𝜃𝑖𝛿\theta\in\Theta\cap B(\theta_{i},\delta)italic_θ ∈ roman_Θ ∩ italic_B ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ ), we have

|m,n(θ)m,n(θi)|η(δ)1mi=1mh(Yn,i).subscriptsuperscript𝑚𝑛𝜃subscriptsuperscript𝑚𝑛subscript𝜃𝑖𝜂𝛿1𝑚superscriptsubscript𝑖1𝑚subscriptsuperscript𝑌𝑛𝑖\left|\mathcal{R}^{*}_{m,n}(\theta)-\mathcal{R}^{*}_{m,n}(\theta_{i})\right|% \leq\eta(\delta)\frac{1}{m}\sum_{i=1}^{m}h\left({Y}^{*}_{n,i}\right).| caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( italic_θ ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ≤ italic_η ( italic_δ ) divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_h ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ) .

Moreover, from Assumptions (A2) and Theorem 1 with the Markov inequality, we know that

1mi=1mh(Yn,i)𝔼h(Y1),m,n,formulae-sequence1𝑚superscriptsubscript𝑖1𝑚subscriptsuperscript𝑌𝑛𝑖𝔼superscriptsubscript𝑌1𝑚𝑛\frac{1}{m}\sum_{i=1}^{m}h\left({Y}^{*}_{n,i}\right)\to\mathbb{E}h\left(Y_{1}^% {*}\right),\quad m,n\rightarrow\infty,divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_h ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ) → blackboard_E italic_h ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_m , italic_n → ∞ ,

in probability. We also have

|(θ)(θi)|η(δ)𝔼h(Y1).superscript𝜃superscriptsubscript𝜃𝑖𝜂𝛿𝔼superscriptsubscript𝑌1\left|\mathcal{R}^{*}(\theta)-\mathcal{R}^{*}(\theta_{i})\right|\leq\eta(% \delta)\mathbb{E}h\left(Y_{1}^{*}\right).| caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ≤ italic_η ( italic_δ ) blackboard_E italic_h ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

Finally, one can use the bound

supθθ|m,n(θ)(θ)|subscriptsupremum𝜃𝜃subscriptsuperscript𝑚𝑛𝜃superscript𝜃\displaystyle\sup_{\theta\in\theta}\left|\mathcal{R}^{*}_{m,n}(\theta)-% \mathcal{R}^{*}(\theta)\right|roman_sup start_POSTSUBSCRIPT italic_θ ∈ italic_θ end_POSTSUBSCRIPT | caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( italic_θ ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) | \displaystyle\leq max1ik|m,n(θi)(θi)|+η(δ){1mi=1mh(Yn,i)+𝔼h(Y1)}subscript1𝑖𝑘subscriptsuperscript𝑚𝑛subscript𝜃𝑖superscriptsubscript𝜃𝑖𝜂𝛿1𝑚superscriptsubscript𝑖1𝑚subscriptsuperscript𝑌𝑛𝑖𝔼superscriptsubscript𝑌1\displaystyle\max_{1\leq i\leq k}\left|\mathcal{R}^{*}_{m,n}(\theta_{i})-% \mathcal{R}^{*}(\theta_{i})\right|+\eta(\delta)\left\{\frac{1}{m}\sum_{i=1}^{m% }h\left({Y}^{*}_{n,i}\right)+\mathbb{E}h\left(Y_{1}^{*}\right)\right\}roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_k end_POSTSUBSCRIPT | caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | + italic_η ( italic_δ ) { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_h ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ) + blackboard_E italic_h ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) }
=\displaystyle== 2η(δ)𝔼h(Y1)+o(1).2𝜂𝛿𝔼superscriptsubscript𝑌1subscript𝑜1\displaystyle 2\eta(\delta)\mathbb{E}h\left(Y_{1}^{*}\right)+o_{\mathbb{P}}(1).2 italic_η ( italic_δ ) blackboard_E italic_h ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) .

Given that δ𝛿\deltaitalic_δ is arbitrary, the above implies (10) and the weak consistency of θ^superscript^𝜃\hat{\theta}^{*}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT follows. The second assertion about the excess risk then follows easily using that

(θ^)(θ)2supθΘ|m,n(θ)(θ)|.superscriptsuperscript^𝜃superscriptsuperscript𝜃2subscriptsupremum𝜃Θsubscriptsuperscript𝑚𝑛𝜃superscript𝜃\mathcal{R}^{*}(\hat{\theta}^{*})-\mathcal{R}^{*}({\theta}^{*})\leq 2\sup_{% \theta\in\Theta}\left|\mathcal{R}^{*}_{m,n}(\theta)-\mathcal{R}^{*}(\theta)% \right|.caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 2 roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT | caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( italic_θ ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ) | .

\blacksquare

F.2 Proof of Theorem 3.

Let Zi=Γ1/2Xisuperscriptsubscript𝑍𝑖superscriptΓ12superscriptsubscript𝑋𝑖Z_{i}^{*}=\Gamma^{-1/2}X_{i}^{*}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Γ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and define

Σm=1mi=1mZiZiT,Nm=1mi=1mZi[Yn,iXiTθ].formulae-sequencesubscriptΣ𝑚1𝑚superscriptsubscript𝑖1𝑚superscriptsubscript𝑍𝑖superscriptsuperscriptsubscript𝑍𝑖𝑇subscript𝑁𝑚1𝑚superscriptsubscript𝑖1𝑚superscriptsubscript𝑍𝑖delimited-[]subscriptsuperscript𝑌𝑛𝑖superscriptsuperscriptsubscript𝑋𝑖𝑇superscript𝜃\Sigma_{m}=\frac{1}{m}\sum_{i=1}^{m}Z_{i}^{*}{Z_{i}^{*}}^{T},\quad N_{m}=\frac% {1}{m}\sum_{i=1}^{m}Z_{i}^{*}\left[{Y}^{*}_{n,i}-{X_{i}^{*}}^{T}\theta^{*}% \right].roman_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] .

The proof first requires some analysis of the smallest eigenvalues of ΣmsubscriptΣ𝑚\Sigma_{m}roman_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. From the matrix Chernoff inequality given in Tropp (2012), see Corollary 5.25.25.25.2 and Remark 5.35.35.35.3, we have

P(λmin(Σm)1η)dexp(η2m/2B)𝑃subscript𝜆subscriptΣ𝑚1𝜂𝑑superscript𝜂2𝑚2𝐵P(\lambda_{\min}(\Sigma_{m})\leq 1-\eta)\leq d\exp(-\eta^{2}m/2B)italic_P ( italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( roman_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ≤ 1 - italic_η ) ≤ italic_d roman_exp ( - italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m / 2 italic_B )

where B=λmin(Γ)1dmaxxSX|x|2𝐵subscript𝜆superscriptΓ1𝑑subscript𝑥subscript𝑆𝑋superscriptsubscript𝑥2B=\lambda_{\min}(\Gamma)^{-1}d\max_{x\in S_{X}}|x|_{\infty}^{2}italic_B = italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d roman_max start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_x | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is defined so as to satisfy Z22λmin(Γ)1X22Bsuperscriptsubscriptnorm𝑍22subscript𝜆superscriptΓ1superscriptsubscriptnorm𝑋22𝐵\|Z\|_{2}^{2}\leq\lambda_{\min}(\Gamma)^{-1}\|X\|_{2}^{2}\leq B∥ italic_Z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ italic_X ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_B, with probability 1111. Inverting the previous we obtain that with probability at least 1δ1𝛿1-\delta1 - italic_δ,

λmin(Σm)>1(2B/mlog(dδ1))subscript𝜆subscriptΣ𝑚12𝐵𝑚𝑑superscript𝛿1\lambda_{\min}(\Sigma_{m})>1-\sqrt{(2B/m\log(d\delta^{-1}))}italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( roman_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) > 1 - square-root start_ARG ( 2 italic_B / italic_m roman_log ( italic_d italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) end_ARG

and therefore as soon as (8B/n)log(dδ1)18𝐵𝑛𝑑superscript𝛿11(8B/n)\log(d\delta^{-1})\leq 1( 8 italic_B / italic_n ) roman_log ( italic_d italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ≤ 1, we have that λmin(Σm)1/2subscript𝜆subscriptΣ𝑚12\lambda_{\min}(\Sigma_{m})\geq 1/2italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( roman_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ≥ 1 / 2. On the previous event, we have that

Γ1/2(θ^θ)=Σm1NmsuperscriptΓ12superscript^𝜃superscript𝜃superscriptsubscriptΣ𝑚1subscript𝑁𝑚\Gamma^{1/2}(\hat{\theta}^{*}-\theta^{*})=\Sigma_{m}^{-1}N_{m}roman_Γ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

It follows that

(θ^)(θ)=Γ1/2(θ^θ)22=Σm1Nm222Nm22.superscriptsuperscript^𝜃superscriptsuperscript𝜃superscriptsubscriptnormsuperscriptΓ12superscript^𝜃superscript𝜃22superscriptsubscriptnormsuperscriptsubscriptΣ𝑚1subscript𝑁𝑚222superscriptsubscriptnormsubscript𝑁𝑚22\displaystyle\mathcal{R}^{*}(\hat{\theta}^{*})-\mathcal{R}^{*}(\theta^{*})=\|% \Gamma^{1/2}(\hat{\theta}^{*}-\theta^{*})\|_{2}^{2}=\|\Sigma_{m}^{-1}N_{m}\|_{% 2}^{2}\leq 2\|N_{m}\|_{2}^{2}.caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - caligraphic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∥ roman_Γ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ roman_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ∥ italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

We conclude using Theorem 1 with h(y,x)=Γ1/2x(yxTθ)𝑦𝑥superscriptΓ12𝑥𝑦superscript𝑥𝑇superscript𝜃h(y,x)=\Gamma^{-1/2}x\left(y-x^{T}\theta^{*}\right)italic_h ( italic_y , italic_x ) = roman_Γ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_x ( italic_y - italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Note that 𝔼[h(Y,X)]=0𝔼delimited-[]superscript𝑌superscript𝑋0\mathbb{E}\left[h\left(Y^{*},X^{*}\right)\right]=0blackboard_E [ italic_h ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] = 0 by definition of θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. \blacksquare

Appendix G Illustration for Experiments E1–E3

Illustrations of data used in Experiments E1–E3 can be found in Figure 7(b).

Refer to caption
(a) Data for Experiment E1
Refer to caption
(b) Data for Experiment E2 and E3
Figure 7: Visualization of data for Experiments E1–E3

Appendix H Details of the benchmark data experiments

We use the following datasets.

Data splitting and sampling bias simulation

We split the original to the training and test set and simulate covariate shift by rejection sampling from the test set with rejection probability determined according to the value of a covariate. For california, twonorm, breast cancer, we follow the procedure of Sugiyama et al. (2007): we include each target data point Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the target set with probability min(1,4Xi,c2)14superscriptsubscript𝑋𝑖𝑐2\min(1,4X_{i,c}^{2})roman_min ( 1 , 4 italic_X start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) or reject it otherwise, where Xi,csubscript𝑋𝑖𝑐X_{i,c}italic_X start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT is the c𝑐citalic_c-th attribute of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For diabetes, we used a different biasing procedure for this data set because the technique of Sugiyama et al. (2007) rejects too many data points to perform our experiment for this dataset. We instead use the procedure of an example from the ADAPT package de Mathelin et al. (2021)777https://adapt-python.github.io/adapt/examples/Sample_bias_example.html for diabetes: for each data point Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we accept it with probability proportional to exp(20×|Xi,age+0.06|)20subscript𝑋𝑖age0.06\exp(-20\times\lvert X_{i,\text{age}}+0.06\rvert)roman_exp ( - 20 × | italic_X start_POSTSUBSCRIPT italic_i , age end_POSTSUBSCRIPT + 0.06 | ), where Xi,agesubscript𝑋𝑖ageX_{i,\texttt{age}}italic_X start_POSTSUBSCRIPT italic_i , age end_POSTSUBSCRIPT is the age attribute of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and reject (i.e., exclude) otherwise.

Pre-processing

We use the hot-encoding for all categorical features. We center and normalize all the data using the mean and the dimension-wise standard deviation of the source set. We do the same centering and normalization for the output variables for regression datasets.

After training and prediction, we post-process the output using the inverse operation. Table 3 shows basic information about the datasets after the bias-sampling and pre-processing.

Table 3: Basic information of the datasets
california twonorm diabetes breast cancer
Input dimension d𝑑ditalic_d 8 20 10 9
source sample size n𝑛nitalic_n 1000 100 150 200
Target sample size m𝑚mitalic_m 1000 500 150 100