Nearest Neighbor Sampling for Covariate Shift Adaptation
Abstract
Many existing covariate shift adaptation methods estimate sample weights given to loss values to mitigate the gap between the source and the target distribution. However, estimating the optimal weights typically involves computationally expensive matrix inversion and hyper-parameter tuning. In this paper, we propose a new covariate shift adaptation method which avoids estimating the weights. The basic idea is to directly work on unlabeled target data, labeled according to the -nearest neighbors in the source dataset. Our analysis reveals that setting is an optimal choice. This property removes the necessity of tuning the only hyper-parameter and leads to a running time quasi-linear in the sample size. Our results include sharp rates of convergence for our estimator, with a tight control of the mean square error and explicit constants. In particular, the variance of our estimators has the same rate of convergence as for standard parametric estimation despite their non-parametric nature. The proposed estimator shares similarities with some matching-based treatment effect estimators used, e.g., in biostatistics, econometrics, and epidemiology. Our experiments show that it achieves drastic reduction in the running time with remarkable accuracy.
1 Introduction
Traditional machine learning methods assume that the source data distribution and the target data distribution are identical. However, this assumption can be violated in practice when there is a distribution shift (Chen et al., 2022) between them. Various types of shift have been studied in the literature, and one of the most common scenarios is covariate shift (Shimodaira, 2000) in which there is a shift in the input distribution: while the conditional distribution of the output variable given the input variable is the same: , where is the input and is the output variable. The goal of covariate shift adaptation is to adapt a supervised learning algorithm to the target distribution using labeled source data and unlabeled target data.
A standard approach to covariate shift is weighting source examples (Shimodaira, 2000), and many studies focused on improving the weights (Huang et al., 2006; Gretton et al., 2008; Yamada et al., 2013; Kanamori et al., 2009; Sugiyama et al., 2007, 2008; Aminian et al., 2022) in the same line of research. We refer the reader to Section 6 for more details of related work. Since we rarely know the model for how the input distributions can be shifted a priori, non-parametric methods are particularly useful for covariate shift adaptation. Some of the existing methods allow one to use non-parametric models through kernels. However, such kernel-based methods take at least quadratic times in computing kernel matrices. Some methods further need to solve linear systems and take cubic times in the sample size unless one resorts to approximations (Williams and Seeger, 2000; Le et al., 2013). Moreover, their performance is often sensitive to the choice of hyper-parameters of the kernel. Typically, one performs a grid search -fold cross-validation for selecting the hyper-parameters, which amplifies the running time by about , where is the set of candidates for the hyper-parameters. Moreover, the criterion for the hyper-parameter selection is not obvious either because we do not have access to the labels for the target data. One can use weighted validation scores using the labeled source data with importance sampling, but it is not straightforward to choose what weights to be used for the cross-validation when we are choosing weights.
In this paper, we propose a non-parametric covariate shift adaptation method that is scalable and has no hyper-parameter. Our idea is to generate synthetic labels for unlabeled target data using a non-parametric conditional sampler constructed from source data. Under the assumption of covariate shift, the target data attached with the generated labels behave like labeled target data. This sampling technique allows any supervised learning method to be simply applied to the generated data to produce a model already adapted to the target distribution.
While the proposed approach is quite general and can be employed with various sampling methods for the synthetic labeling part, our main result is that a -nearest neighbor (-NN) based sampling method achieves an error of order for estimating an expectation on the target domain, where is the data dimensionality, and and are the source and the target sample size, respectively. Importantly, our error bounds suggest that is the most favorable. This property, which is revealed by a precise scaling of the variance term in , is a non-trivial and remarkable result, given the -rate of the variance associated to the -NN estimator of the conditional distribution (Portier, 2021, Corollary 1), and it contrasts with the well-known application of -NN to standard density estimation (Dasgupta and Kpotufe, 2014), classification (Gadat et al., 2016; Cannings et al., 2020), or regression problems (Devroye et al., 1994; Jiang, 2019), in which we typically need to let grow in a polynomial rate in the sample size in order to achieve a good balance in the bias-variance trade-off. This important difference in the rate of convergence, leading to a number of neighbor, has also been noticed in other estimation problems such as the -NN entropy estimator (Berrett et al., 2019) or the integral approximation problem (Leluc et al., 2023; Blanchet et al., 2024). Textbooks dealing with the -NN algorithm include (Györfi et al., 2006; Devroye et al., 2013; Biau and Devroye, 2015).
In addition of being optimal with respect to the estimation error, setting circumvent the cumbersome hyper-parameter tuning while providing computational efficiency at the same time. Our -NN-based algorithm takes only a quasi-linear time on average using the optimized -d tree (Bentley, 1975; Friedman et al., 1977). Indeed, our experiments show that the proposed method terminates faster than previous methods, by large margins. Note that the problem of getting a computationally efficient method for covariate shift adaptation, in particular for scalability to large data sets, is a recurrent problem in the existing literature. In fact, many existing methods resorted to implementation heuristics such as using a fixed number of kernel centers for reducing the computational burden at the cost of statistical guarantee (Kanamori et al., 2009; Sugiyama et al., 2007, 2008; Yamada et al., 2013).
Our method simulates the missing labels of the target sample, which in turn can be used for a variety of downstream supervised learning tasks. Even though the main focus of this paper is the estimation of expectations in the target domain, for illustrating the usefulness of our method in a typical machine learning downstream task, we also demonstrate consistency properties of parametric M-estimators in the target domain. This is particularly useful when the parametric model is mispecified, since in this case, even the population minimizer changes when the covariate distribution is shifted.
The problem of interest here is closely related to a well-known matching problem studied in the context of treatment effect estimation. In particular, -NN estimators have been used to estimate the so-called average treatment effect to tackle missingness of potential outcomes. See, e.g., Rosenbaum (1995); Abadie and Imbens (2006), for an error bound obtained in this specific instance of the problem. In Section 6, we discuss the main differences between the two problems and why their result is not generally applicable to ours.
In summary, the key contributions of this paper are the following. (i) Our method is non-parametric. It does not introduce a model in covariate shift adaptation so that it will have a minimum impact on the model trained for the downstream task. (ii) Our method is fast. Adaptation only takes a quasi-linear time. (iii) There is no hyper-parameter to be tuned. (iv) The proposed method only incurs an error of order for estimating an expectation on the target domain.
The outline is as follows. In Section 2, the problem of covariate shift adaptation is formally introduced along with the mathematical notation. Section 3 contains the description of the method. Section 4 is dedicated to the main theoretical results while Section 5 investigates the empirical risk minimization problem in presence of covariate shift adaptation. Section 6 provides a description of several alternative approaches to a similar type of problem as well as some points of comparison with our proposal. In Section 7, several avenues for further research are discussed and finally, the numerical experiments are provided in Section 8.
2 Problem setup
Let and be measurable spaces. Let and be probability distributions defined on . Throughout the paper, we assume that and admit the decomposition
where and are probability distributions defined on for each .111 More formally, we denote by a regular conditional measure (Bogachev and Ruas, 2007, Definition 10.4.1) such that the marginal distribution of can be expressed as . We also use for . The same goes for . Here, and are the marginal distributions of when is distributed with and , respectively. We shall simply call (or ) the conditional distribution of given in the source domain (or the target domain).
Definition 1 (Source sample, source distribution).
For each integer , let be a collection of independent and identically distributed random variables with . We refer to as the (labeled) source sample and as the source distribution.
Definition 2 (Target sample, target distribution).
For each integer , let be a collection of independent and identically distributed random variables with . We refer to as the (unlabeled) target sample and as the target distribution.
Definition 3 (Covariate shift).
Covariate shift is a situation in which the source and the target distribution have different marginal distributions for while sharing a common conditional distribution:
-
(C1)
, - and -a.s., but .
This paper focuses on the following simple but versatile estimation problem under covariate shift.
Definition 4 (Mean estimation under covariate shift).
For each pair of integers , , and a known integrable function , the goal of mean estimation under covariate shift is to estimate the mean of under the target distribution,
given access to the source sample and the target sample under Assumption (C1).
For instance, when for a loss function and a hypothesis function , estimation of becomes risk estimation, which is the central subtask in empirical risk minimization.
3 Proposed method
The basic idea of our proposed method is to use the source sample for learning to label the target data. Specifically, using the source sample , we will construct a stochastic labeling function that inputs any target data point and outputs a random label . (The subscript of is for explicitly denoting the dependence on the source sample.) Once we succeed in generating labels for target data that behave like true target labels, we will be able to perform any supervised learning method directly on the target sample for the downstream task. For our mean estimation problem, we can simply average the output evaluated at the target data with the generated labels.
When do the generated labels behave like the true target labels? Let denote the probability distribution of an output of for input . We wish to obtain such that the probability distribution of will be a good estimate of . For this, we want to be a good estimate of . In fact, if , the generated sample will follow the target distribution under Assumption (C1). In this sense, our task boils down to designing a good conditional sampler mimicking sampling from . Algorithm 1 describes an outline of this general framework.
In this paper, we propose a method using a non-parametric conditional sampler based on the -Nearest Neighbor (-NN) method, which randomly picks one of the -nearest neighbors of the input among the source instances and output the corresponding label (Algorithm 2). We refer to this method as -NN-based Conditional Sampling Adaptation (-NN-CSA).
Computing time
Recent advances for nearest neighbor search rely on tree-search to reduce the computing time. The seminal paper by Bentley (1975) introduced the -d tree method. Building such a tree requires and once the tree is available, search for the nearest neighbor of a given point can be done in time (Friedman et al., 1977). As a consequence, the time complexity of -NN-CSA is .
4 Theoretical analysis
We now present the theory behind our approach in a didactic way by introducing a key decomposition first and then studying separately each of the terms involved: the sampling error and the -NN conditional sampling error. We will see that the -NN-CSA with (-NN-CSA for short) achieves the best theoretical performance among those with other ’s.
4.1 The key decomposition
For the analysis of -NN-CSA, recall that is an estimate of the target distribution that depends on the source sample , whose probability distribution is . We introduce the bootstrap sample as a collection of random variable generated according to .
Definition 5 (Bootstrap sample).
For each and , let be a collection of random variables identically distributed with and conditionally independent given .
Let be a measurable function. The quantity of interest is
which is the CSA estimate of as introduced in Algorithm 1. The following decomposition is crucial in our analysis:
(1) | ||||
where is the empirical measure defined with . The first term is the error due to the use of in place of , which tends to zero as grows. The second term represents the error due to the use of in place of . When using the -nearest neighbor algorithm to obtain , we show that this term is of order , which differs from the standard non-parametric convergence rate in found in regression problems.
4.2 Marginal sampling error
First, we will show that the marginal sampling error, (the first term in our decomposition (1)), is of order . The analysis relies on martingale tools. Define . For each , we have
This property implies that is a martingale and therefore can be analyzed using the Lindeberg-CLT conditionally on the initial sample hence fixing the distribution . The next property is reminiscent of certain results about the bootstrap method where sampling is done with the basic empirical measure, see e.g., Van der Vaart (2000). We need this type of results without specifying the measure so that we can incorporate a variety of sampling schemes such as . The proof is given in Appendix B.1.
Proposition 1.
Suppose that satisfies the following strong law of large number: for each such that , we have almost surely. Then, if as , we have the following central limit theorem: for each function such that , we have, conditionally to , almost surely,
where .
As a corollary of the previous results, we can already deduce that if goes to and satisfies a strong law of large numbers, then converges to provided that exists. This is a general consistency result that justifies the use of any resampling distribution that converges to . In practical situations, it is useful to know a finite-sample bound on the error. This is the purpose of the next proposition, in which we give a non-asymptotic control of the sampling error. A proof is given in Appendix B.2.
Proposition 2.
Suppose that is bounded by a constant . Let . Then with probability greater than ,
where .
Notes.
A natural “averaging” alternative to the above “sampling” estimator can also be investigated using the same tools. Instead of sampling according to , one might consider taking the expectation, leading to
This estimate can be studied in a similar way as before and the two above results are still valid with small changes. In particular Proposition 2 holds true with smaller variance term as, by Jensen’s inequality, . This alternative requires more computing time (when measured in terms of evaluation of ) and is less appealing for stochastic gradient descent algorithm or in semiparametric estimation problems, as discussed in Section 7. Estimators similar to have been studied in average treatment effects literature (Rosenbaum, 1995; Abadie and Imbens, 2006); see Section 6 for precise discussion.
4.3 Conditional sampling error of the nearest neighbor estimate
Our aim in this section is to obtain a bound on (the second term in our decomposition (1)) when is the -nearest neighbor measure.
Let and be the Euclidean norm on . Denote the closed ball of radius around by . For and , the -nearest neighbor (-NN for short) radius at is denoted by and defined as the smallest radius such that the ball contains at least points from the collection . That is,
where is if and elsewhere. The -NN estimate of is given by
where is the Dirac measure at defined by for any measurable set . Consequently, the -NN estimate of the integral is then defined as
To obtain some guarantee on the behavior of the nearest neighbors estimate, we consider the case in which covariates admit a density with respect to the Lebesgue measure. We will need in addition that the support is well shaped and that the density is lower bounded. These are standard regularity conditions to obtain some upper bound on the -NN radius.
-
(X1)
The random variable admits a density with compact support .
-
(X2)
There is and such that
where is the Lebesgue measure.
-
(X3)
There is such that , for all .
To obtain our main result, on the estimation property of the -NN measure, we need some assumptions on the target measure .
-
(X4)
The probability measure admits a bounded density with support . We will take large enough such that it will also be an upper bound of .
Two additional assumptions, different from the one before about , will be needed to deal with the function and the probability distribution of .
-
(H1)
For any in ,
with .
-
(H2)
There exists such that , where is the conditional variance of given .
In what follows, we give a control of the RMSE of . Let , the integer part of a real number and let for . Finally, we denote by the volume of the unit Euclidean ball in dimension for the Lebesque measure.
We give an upper-bound for the RMSE with explicit constants with respect to the dimension . Additionally, we give a lower bound for the variance which has a standard parametric rate of convergence. The proof is given in Appendix C.1.
Proposition 3.
Suppose that Assumptions (X1), (X2), (X3), (X4), (H1), and (H2) are fulfilled. We have
where is a bias term (defined in the proof) that satisfies, for any ,
and is a variance term (defined in the proof) that satisfies, for any ,
For the lower bound to be true, it is assumed that the map** does not depend on , i.e. and .
Notes.
(i) The two terms and correspond respectively to the bias term and the variance term. The upper bound obtained for the bias term is usual in -NN regression analysis. However, the upper and lower bound on the variance are particular to our framework as they show that the variance behaves as in usual parametric estimation. Consequently, our rates of convergence are sharper than the optimal rate of convergence for nonparametric estimation of Lipschitz functions. This can be explained by the fact that several -NN estimators are averaged to estimate , which is a standard expectation and not a conditional expectation.
(ii) Since the rate of convergence of the variance term does not depend on , might be chosen according to the upper bound on the bias term, which gives . One can deduce the following convergence rates, depending on the dimension. For , we get the rate . For , the contributions of both terms, and , coincide and we get the rate . For , the rate is .
For the global mean square error which incorporate the marginal sampling error as well as the -NN conditional sampling error, we give the following result in the optimal case . The proof can be found in Appendix D.
Theorem 1.
We next give a non-asymptotic control of when is a bounded function using Bernstein’s concentration inequality. This bound affords a complement with respect to the bound for the MSE. However, for technical reasons, this high-probability bound requires that grows at least logarithmically with respect to , in contrast to Proposition 3. In our numerical experiments, we will also include the case for comparison. The proof of the next result is given in Appendix C.2.
Notes.
(i) The proof needs a bound on which is given in (Portier, 2021, Lemma 4). For this we need that grows logarithmically w.r.t. as stated in the assumptions.
5 Applications to empirical risk minimization
In this section, we illustrate our results with some applications to empirical risk minimization. This is of particular interest in our context as the optimal linear model for the source distribution might be different from the ideal linear model for the target. In such a case, using covariate adaptation is always better as the source minimizer will be away from the target minimizer.
5.1 Mathematical background
Suppose that , where for each , is a measurable function from to . Set
Similarly, we define
with and is a copy of . Note that the expected value is taken for the unobserved label and not the generated label . We assume here that for a reference measure on , there exists for each a conditional density such that is jointly measurable and for any Borel set ,
One can then include the case of classification (), counts ( is the counting measure on the set of nonnegative integers) or regression ( is the Lebesgue measure on ).
5.2 Consistency of general empirical risk minimizers
We will use the following assumptions.
-
(A1)
There exist a measurable function and satisfying
and such that is a bounded random variable and .
-
(A2)
There exists a measurable function such that and
The above assumptions are satisfied, for instance, in the logistic regression framework with compact covariates. In this case, is a constant function and . Note also that could be different form as soon as is Lipschitz on .
In what follows, an assertion of the form as means that for any , there exists such that
Additionally, the assertion means that for any , there exist such that
The proof of the following result is in Appendix F.1.
5.3 Convergence rate for linear least-squares estimators
We now illustrate our results with an upper-bound on the excess risk for linear least-squares estimators in the misspecified case. Here, the targeted risk is given by
and any optimal linear rule should simply be satisfied:
Note that is unique the matrix is of full rank. The empirical risk is defined by
and , the empirical risk minimizer, is given by
The excess risk satisfies the following upper bound whose proof is given in Appendix F.1.
Theorem 3.
Notes.
The assumptions do not require the linear model for the ’s to be valid, i.e., one can consider cases where is not linear. Also, when the source data follows a non-linear model of the form where and are independent, our regularity assumptions means that is Lipschitz on the compact set .
6 Related work
A standard approach to covariate shift problems is to use some re-weighting in order to “transfer” the source distribution with density to the target distribution with density . This approach relies on the following type of estimates:
where ideally the function would take the form . Such a choice has the nice property that the expected value is equal to the targeted quantity . This however cannot be directly computed as and are unknown in practice. There are actually different ways to estimate , and our goal here is to distinguish between two leading approaches.
Plug-in approach
The plug-in approach is when the weights are computed using two estimates and in place of and , respectively; i.e., simply use instead in the above formula, see for instance (Shimodaira, 2000; Sugiyama et al., 2007, 2008). Note that the selection of hyper-parameters for and is needed and the evaluation might be heavy in terms of computing time.
For the sake of clarity, we focus on a specific instance of covariate shift problem in which the target probability density is known and is the kernel density estimate (KDE), i.e., , where typically is a Gaussian density with mean and variance (a hyper-parameter to be tuned). Note that such a situation does not involve any changes for our sampling procedure whereas it is clearly advantageous for the weighted approach for which one unknown, , is now given. In this case, the analysis of can be carried out using the decomposition , with . The first term above is a sum of centered random variable which (provided some conditions) satisfies the so-called Lindeberg condition so that the central limit theorem implies that is asymptotically Gaussian. The second term above is more complicated and the analysis can be derived using results in Delyon and Portier (2016); Clémençon and Portier (2018). Those results assert (under some conditions) that (in case is Lipschitz ). As a consequence, we obtain, optimizing over , that . This is easily compared to our bound, when is known, , which is smaller than the one given before.
Direct weight estimation
Huang et al. (2006) proposed Kernel Mean Matching (KMM) for estimating the ratios of the probability density functions of the source and the target distribution. They used the estimated ratios for weighting the source sample. Gretton et al. (2008) further studied this method theoretically and empirically. Sugiyama et al. (2007, 2008) proposed a method that estimates the ratios as a function by minimizing the Kullback-Leibler divergence between the source density function multiplied by the ratio function and the target density function. The estimated function can predict ratios even outside of the source sample, which enables cross-validation for hyper-parameter tuning. Kanamori et al. (2009) proposed constrained and unconstrained least squares methods for estimating the ratio function called Least-Squares Importance Fitting (LSIF) and unconstrained LSIF (uLSIF). Yamada et al. (2013) developed its variant called Relative uLSIF (RuLISF), which replaces the denominator of the ratio with a convex mixture of the source and the target density functions to circumvent issues caused by near-zero denominators. Zhang et al. (2021) proposed a covariate shift adaptation method that directly minimizes an upper bound of the target risk in order to avoid estimation of weights. The method shows great empirical performance while it does not exactly minimize the target risk and hence the minimizer converges to a biased solution.
Connection to treatment effect estimation
One of the quantities of great interest in treatment effect estimation is the average treatment effect on the treated (ATT), , where is a treatment assignment variable, and are potential outcomes corresponding to the treatment and .222A common scenario is that we have a treated group (represented by treatment 1) and a non-treated, or controlled group (represented by treatment 0). Suppose that we wish to estimate the ATT using i.i.d. observations of and its outcome together with covariates , . Under the standard assumptions (see e.g., Hernan and Robins (2023)) including the conditional exchangeability , the positivity , and the consistency , for each , the ATT equals the difference between
(2) |
and
(3) |
where is the density ratio defined such that . We can easily estimate the first term (Eq. (2)) by the conditional sample average , where . Estimating the second term (Eq. (3)) is more involved. The sample average with the condition , , where , would be biased to , but the bias is only due to the change in the conditional distributions of given and given quantified by , similarly to the covariate shift (see Eq. (3)). One way to correct the bias is to use an estimate of the ratio for the weighted average , similarly to the reweighting approach to covariate shift adaptation, leading to the following estimate:
Another popular approach is the nearest neighbor matching Abadie and Imbens (2006). See also Rosenbaum (1995) for a broad introduction to matching problems for evaluating treatment effects. In Abadie and Imbens (2006), the ATT is estimated by
where is the average of ’s over the first NNs of in the untreated group. The estimator takes the form
where is the number of times observation is used as a match, i.e., the number of times observation is among the NNs of variables in the treated group. Note that coincides with if is defined as . Recently, Lin et al. (2023) showed that the latter quantity can be indeed interpreted as an estimate of the density ratio but its consistency requires while Abadie and Imbens (2006) considered a fixed value of , as in our problem. To see an analogy with our method, one can consider the case in which does not depend on , i.e. for some function . Using the notation from the present paper ( and indicate the target and the source domain, respectively, with and ) the second term of above generalizes to the form
(4) |
with . The previous estimate corresponds to the one introduced in the notes following Proposition 2. On the other hand, our estimator applied to this case is given by
(5) |
Both estimators are different when but they coincide as soon as . In fact, has one single atom when , so that sampling from it and evaluating the average are the same. Here are a few remarks.
- •
-
•
Our theoretical analysis is rather different from that of Abadie and Imbens (2006). Since they rely on the expression in the left side of Eq. (4), it is unclear whether they can or not handle the case when depends on (required for prediction purpose). In contrast, our approach is based on the decomposition given in Section 4.2, with sampling error and estimation error, leveraging as a centering term. Our results are more general because they include the case when depends on and also we can deal with both estimates (4) and (5) in the meantime, as mentioned in the notes following Proposition 2). Moreover, Proposition 3 implies a lower bound for Eq. (4) and we believe this result to be new in treatment effect literature.
Other references
The idea of nonparametric sampling is a standard one in the field of texture synthesis. In particular, the choice of 1-NN resampling was often used as a fast method to generate new textures from a small sample. See Truquet (2011) for a literature review in this context. Our conditional sampling framework bears resemblance with traditional bootstrap sampling as there is random generation according to some estimated distribution. In contrast, the original bootstrap method is usually made up using draws from the standard empirical measure . Here another distribution, , has been used to generate new samples. Moreover, our goal is totally different here. While the bootstrap technique was initially introduced for making inference, here the goal is to estimate an unknown quantity which appears in many machine learning tasks. Kpotufe and Martinet (2021) theoretically study covariate shift adaptation under the assumption that we have access to a labeled sample both from the source and the target distribution. Although they consider a -nearest-neighbor-based method, it is essentially different from ours since they perform the -NN method on the union of the source and the target sample. Lee (2013) proposed pseudo-labeling unlabeled data in the context of semi-supervised learning. Wang (2023) proposed a hyper-parameter selection method for kernel ridge regression under covariate shift using pseudo-labeling. The author focuses on model selection in regression problems while we study the mean estimation that can be applied to a wider range of supervised learning problems.
7 Extensions
Several ways to extend our method beyond the mean estimation problem are considered in this section.
Heterogeneity in target distributions
The case where the target covariates distribution changes across the data might be of interest if one wishes to aggregate several pieces of target data whose covariates distributions are not necessarily the same. This might occur when the target data is obtained by gathering individuals from different countries, and consequently, the distributions are not the same anymore or when the time between the measurements has caused some changes in the distribution.
While such an heterogeneity in target data might be seen as more complicated at first glance, it actually can be examined using a similar decomposition and the same tools as the one used to obtain the non-asymptotic bound in Theorem 1. More formally, the target distribution is here with . For each and , let be a collection of random variables conditionally independent given and such that for each , with . The quantity of interest and the proposed estimator are therefore slightly different from before, given by, respectively,
The decomposition is
The non-asymptotic analysis of the sampling error is similar to before as the Bernstein inequality is tailored to non-identically distributed variables. We obtain that the rate as before by simply requiring a bound on the variance of each random variables. The other term concerning the conditional distribution can be analyzed by writing
and therefore we can directly apply Proposition 3 (given the assumptions are satisfied for each uniformly). We finally obtain the rate , similar to the one obtained before.
Stochastic gradient descent
Our sampling approach can be easily combined with the well-known stochastic gradient descent algorithm (and more generally with stochastic approximation) where only a small part of the data is used at each step to update the estimator. This particular property allows to require a small number of operations at each iteration (in contrast with gradient based optimization).
To illustrate this idea, consider the empirical risk minimization problem described in Section 5 where one is interested in solving where is differentiable. Suppose that source samples have been obtained making the conditional distribution available for sampling new points. Then the algorithm at step , might proceed by first generating and then . This means finding the nearest neighbor to among the source data and represents only operations using the -tree. Having this been done, the update is simply
It results that each iteration in the above is similar to standard stochastic gradient descent, the only difference being the additional -nearest neighbor search. We stress that this is contrasting with the re-weighting approach for which a new sample, say , would require evaluating and therefore would need to compute all distances between , and the new .
Semiparametric estimation
Simulating the labels to obtain a new sample is also convenient in semiparametric problems where quantities of interest often involve additional estimated parameters. Typical semiparametric problems involve expectations of functions that are indexed by an unknown parameter, , and is estimated from the data using some transformation of the sample. In such a situation, while estimating using reweighting is unclear without more information on , one can directly use our sampling approach by introducing where . This allows to obtain a semiparametric estimate with covariate shift adaptation. See Van der Vaart (2000), Chapters and for more details and examples in parametric or semiparametric estimation.
8 Experiments
The main purpose of the experiments is to compare our -NN-CSA approach with several state-of-the-art competitors when facing multiple situations from mean estimation to empirical risk minimization with synthetic and real-world data.
We consider the following instances of our proposed method.
- -NN-CSA:
- -NN-CSA:
-
the same as above but with .
We use the Python module cKDTree (Archibald, 2008) from SciPy (Virtanen et al., 2020) for nearest neighbor search in our methods. We compare them with the following existing covariate-shift adaptation methods.
- KDE-R-W (KDE-Ratio-Weighting):
-
the weighting method using the ratio of the Kernel Density Estimates (KDEs) of and (see Section 6).
- KMM-W (KMM-Weighting):
- KLIEP-W (KLIEP-Weighting):
- KLIEP100-W
- RuLSIF-W (RuLSIF-Weighting):
-
the weighting method using estimated by Relative unconstrained Least-Squares Importance Fitting (RuLSIF) (Yamada et al., 2013), where is a hyper-parameter. We use the default value . As a model of the weight function, the Gaussian basis functions centered at the sample points are used.
- RuLSIF100-W:
-
the same as RuLSIF-W but with 100 randomly subsampled basis functions (Yamada et al., 2013) for reducing the time- and space-complexities.
See Section 6 for more explanations of those methods. For KMM-W, KLIEP-W, and RuLSIF-W, we used the implementations from Awesome Domain Adaptation Python Toolbox (ADAPT) (de Mathelin et al., 2021). All the computations were performed on the cluster, Grid5000 (Balouek et al., 2013). For the methods using Gaussian basis functions (KLIEP-W, KLIEP100-W, RuLSIF-W, RuLSIF100-W), we use 5-fold cross-validation for choosing the Gaussian bandwidth from . KMM-W does not offer a way to do cross-validation, and we fixed to . More details are in the supplementary material.
Furthermore, we also report the results for the following baseline method and ideal method.
- NoCorrection:
-
the method that takes the average only using the source sample , ignoring the target sample.
- OracleY:
-
the result for taking the average using a sample . Note that are not available in practical scenarios of our interest and made invisible to other methods.
We conduct experiments in three setups, detailed below, with different sample sizes () and data dimensionalities : . Each experiment is repeated 50 times with different random seeds.
Setup of Experiment E1 (mean estimation with synthetic data):
The task here is to estimate under the following setup. We define by , as the uniform distribution over , as that over , and as the normal distribution with mean and variance . Figure 7(a) in Appendix G shows an illustration of the setup. In this setup, we have while . Because of this difference, covariate shift adaptation is essential for correctly estimating .
Comparison of estimation errors for Experiment E1:
The results are presented in Figure 1. First, the errors for NoCorrection are not decreasing as the sample sizes increase, ending up with large errors in all cases, because of the bias due to the covariate shift. Other methods with covariate-shift adaptation had always smaller errors than that of this baseline. Excluding OracleY, an ideal method unavailable in practice, KLIEP100-W, KMM-W, 1-NN-CSA, and -NN-CSA were among the best for smaller dimensionalities (Figures 1(a) and 1(b)). For the larger dimensionalites , KMM-W and 1-NN-CSA outperformed other methods. In particular, 1-NN-CSA gave outstanding performances in many cases except and , for which KMM-W was even better. The errors of most methods roughly follow power laws, where the slope of a line corresponds to the power of the convergence rate (steeper is better). 1-NN-CSA and -NN-CSA seem to have the steepest slopes for , although comparison is difficult for the lower dimensionalities.
Comparison of running times in Experiment E1:
Figure 2 shows the comparison in running times. 1- and -NN-CSA were much faster than other methods in all cases except for . Their advantage is most pronounced for larger sample sizes. For instance, 1- and -NN-CSA were at least times faster than other methods for (Figure 2(a)).
![Refer to caption](x1.png)
![Refer to caption](x2.png)
![Refer to caption](x3.png)
![Refer to caption](x4.png)
![Refer to caption](x5.png)
![Refer to caption](x6.png)
![Refer to caption](x7.png)
![Refer to caption](x8.png)
Setup of Experiment E2 (risk estimation with synthetic data):
In this experiment, we compare the methods in the context of risk estimation of a fixed function . Setting , where is the first coordinate of , we estimate the expected loss (i.e., risk) of with the square loss in predicting the response when follows . In other words, we set as , and the goal is to estimate the risk . We use the uniform distribution over for and that over for . The conditional distribution is the normal distribution with mean and variance for any . Under this setup, the function performs poorly on the support of and should incur a large risk. See Figure 7(b) for an illustration of the setup. In this setup, the risks under and largely differ because fits well in a half of the support of but not in that of .
Comparison of estimation errors for Experiment E2:
We present the estimation errors for Experiment E2 in Figure 3. KMM-W, 1-NN-CSA, -NN-CSA gave similar results, almost matching those of OracleY, while KMM-W and 1-NN-CSA were advantageous for , and 1-NN-CSA outperformed other methods for . We can notice that KDE-W, KMM-W, RuLSIF-W, and RuLSIF100-W did not always improve errors over NoCorrection (Figures 1(c) and 1(d)). Some methods such as KMM-W and KLIEP-W showed great performance in some cases while giving poor results in other cases. In contrast, 1-NN-CSA showed stable and often best performances in these experiments.
![Refer to caption](x9.png)
![Refer to caption](x10.png)
![Refer to caption](x11.png)
![Refer to caption](x12.png)
![Refer to caption](x13.png)
![Refer to caption](x14.png)
![Refer to caption](x15.png)
![Refer to caption](x16.png)
Comparison of running times in Experiment E2:
Setup of Experiment E3 (linear regression with synthetic data):
Next, we present experiments of linear regression. Using samples from the same source and test distributions as in Experiment E2, we perform the ordinary least squares after covariate adaptation. More precisely, we aim to optimize the parameters of the model so that the Mean Squared Error (MSE) in the target domain will be minimized. To do so, we minimize the MSE estimated by each covariate shift adaptation method.
Comparison of estimation errors for Experiment E3:
The results are summarized in Figures 5.333We plot the MSEs subtracted by to better present the curves in the region close to the minimum population MSE while kee** values positive. KMM-W performed better than any other methods for the higher dimensions and the small-to-moderate sample sizes , 1-NN-CSA being the second best. For , 1-NN-CSA showed performance better than or comparable to KMM-W.
![Refer to caption](x17.png)
![Refer to caption](x18.png)
![Refer to caption](x19.png)
![Refer to caption](x20.png)
Comparison of running times in Experiment E3:
As in Experiments E1–E2, 1-NN-CSA and -NN-CSA finished their computations faster than the other adaptation methods by large margins (Figure 6).
![Refer to caption](x21.png)
![Refer to caption](x22.png)
![Refer to caption](x23.png)
![Refer to caption](x24.png)
In Experiments E1–E3, the proposed methods, 1- and -NN-CSA were able to finish computation much faster than other adaptation methods without compromising on the statistical performance. -NN-CSA did not show advantages in accuracy, with increased computation costs. We can conclude that 1-NN-CSA is preferred over -NN-CSA. A reason that we were not able to conduct experiments with larger sample sizes than is that the existing adaptation methods have too demanding computational requirements. For instance, the running times of RuLSIF-W in Figure 6(c) grows about 100 times as the sample size increases by 10 times, taking more than seconds for . For , we would need at least seconds, that is hours of compute for a single run. In contrast, the time complexity of 1-NN-CSA being and its running time less than one second for , we can estimate its running time for as seconds. 1-NN-CSA would stay feasible in applications of even larger scales.
The previous methods construct the distance matrix between pairs of data points, which takes running time and memory space quadratic in the sample size. Additionally, RuLSIF-W computes the inverse of the distance matrix, taking cubic running time. KMM-W and KLIEP-W solve convex optimization problems with iterative procedures, for which the implementations from de Mathelin et al. (2021) use stop** criteria based on objective function values. This resulted in good accuracy and milder growth in running time in our experiments. However, tuning the solvers can be involved in practice. In contrast, -NN-CSA does not have such subtle issues around optimization solvers: we only have to perform nearest neighbor search.
In all cases, we can observe that 1-NN-CSA showed clear power-law, with nearly straight lines in the logarithmic scales. This is a significant advantage in predicting returns when one invests on increasing the sample size.
Experiment E4 (linear regression and logistic regression with benchmark datasets):
We use regression benchmark datasets, diabetes444Available at https://archive.ics.uci.edu/ml/index.php., california (Pace and Barry, 1997)555Available at https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html. and classification datasets, twonorm (Breiman, 1996)666Available at https://www.cs.utoronto.ca/~delve/data/datasets.html. and breast_cancer. We apply the ridge regression and the logistic regression, respectively. The evaluation metric is the mean squared error for the regression tasks and the classification accuracy for classification tasks. We synthetically introduce covariate shift by subsampling test data. See Appendix H for more details.
Remark:
For fair comparison, the benchmark experiments presented in this paper follow the standard protocol used in the literature as similarly done in previous research (Gretton et al., 2008; Kanamori et al., 2009; Yamada et al., 2013; Sugiyama et al., 2007, 2008): we apply biased resampling to synthetically simulate a target dataset under covariate shift. It is thus important to note that they are not completely real-world data. Nevertheless, this ensures that the methods are tested in isolation from other types of distribution shifts while using real data for the source covariate distribution as well as the conditional distributions.
Regression (MSE) | Classification (accuracy) | |||
diabetes | california | breast_cancer | twonorm | |
1NN-CSA | 3470 (35) | 0.146 (0.001) | 0.9633 (0.002) | 0.9327 (0.002) |
lognNN-CSA | 3605 (40) | 0.150 (0.001) | 0.9595 (0.002) | 0.9293 (0.002) |
KDE-R-W | 3673 (52) | 3.864 (1.067) | 0.9596 (0.002) | 0.5260 (0.009) |
KMM-W | 3831 (60) | 3.702 (1.160) | 0.9594 (0.002) | 0.9583 (0.001) |
KLIEP-W | 3221 (31) | 2.896 (0.798) | 0.9648 (0.002) | 0.9482 (0.001) |
KLIEP100-W | 3223 (31) | 3.034 (0.843) | 0.9648 (0.002) | 0.9480 (0.001) |
RuLSIF-W | 3235 (31) | 3.039 (0.843) | 0.7794 (0.015) | 0.9512 (0.001) |
RuLSIF100-W | 3238 (31) | 3.045 (0.844) | 0.7794 (0.015) | 0.9539 (0.001) |
diabetes | california | breast_cancer | twonorm | |
---|---|---|---|---|
1NN-CSA | 0.0015 (0.0000) | 0.0084 (0.0001) | 0.0036 (0.0000) | 0.0051 (0.0000) |
lognNN-CSA | 0.0016 (0.0000) | 0.0128 (0.0001) | 0.0037 (0.0000) | 0.0052 (0.0000) |
KDE-R-W | 0.0078 (0.0000) | 0.2121 (0.0008) | 0.0117 (0.0000) | 0.0124 (0.0000) |
KMM-W | 0.0373 (0.0015) | 0.4067 (0.0038) | 0.0542 (0.0014) | 0.0220 (0.0006) |
KLIEP-W | 7.602 (0.051) | 29.98 (0.34) | 8.67 (0.07) | 8.86 (0.16) |
KLIEP100-W | 7.501 (0.045) | 16.91 (0.07) | 8.68 (0.07) | 8.26 (0.10) |
RuLSIF-W | 0.0575 (0.0014) | 1.686 (0.011) | 0.0529 (0.0020) | 0.2014 (0.0016) |
RuLSIF100-W | 0.0401 (0.0007) | 0.1237 (0.0004) | 0.0454 (0.0014) | 0.0391 (0.0002) |
Results for Experiment E4:
Table 1 shows the obtained MSEs and classification accuracies. -NN-CSA and -NN-CSA gave the best performance for california and performances comparable to the best for breast_cancer. For the other datasets, different methods performed the best depending on the dataset. On the other hand, in terms of running time, 1NN-CSA was consistently faster than the previous methods (Table 2).
Our experiments show that the proposed method is almost always faster than the previous methods and gives great accuracy in many cases, even though it is not always the best. 1-NN-CSA is highly recommended as an off-the-shelf method applicable even in larger scales, although the previous methods such as KMM-W, KLIEP-W, and RuLSIF-W should not be neglected, as far as the computational budget allows. The times spent for adaptation are summarized in Table 2, showing that the proposed methods -NN-CSA and -NN-CSA are much faster than other methods.
9 Conclusion
We proposed a -NN-based covariate shift adaptation method. We provided error bounds, which suggest setting is among the best choices. This resulted in a scalable non-parametric method with no hyper-parameter. For future research directions, one could complete our results for the parametric inference on the target domain, in particular for finding the asymptotic distribution of -estimators. For the average treatment effect, Abadie and Imbens (2006) derived asymptotic normality of their estimator and it could be interesting to get a similar result in our context. Investigating non-parametric estimation on the target domain could be also an interesting direction. However, non-parametric estimators computed with the source sample can be already optimal when the ratio of densities is bounded. See for instance Ma et al. (2023) in the reproducing kernel Hilbert space framework. It could be then interesting to extend our result to cases with unbounded density ratios. Finally, it may be interesting to extend our approach with approximate nearest neighbor methods for further scalability.
Acknowledgement
Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr). IY was supported by the Allocation d’Installation Scientifique (AIS) 2023 from Rennes Métropole.
References
- Abadie and Imbens (2006) Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average treatment effects. econometrica 74(1), 235–267.
- Aminian et al. (2022) Aminian, G., M. Abroshan, M. Mahdi Khalili, L. Toni, and M. Rodrigues (2022, 28–30 Mar). An information-theoretical approach to semi-supervised learning under covariate-shift. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, Volume 151 of Proceedings of Machine Learning Research, pp. 7433–7449. PMLR.
- Archibald (2008) Archibald, A. M. (2008). cKDTree.
- Balouek et al. (2013) Balouek, D., A. Carpen Amarie, G. Charrier, F. Desprez, E. Jeannot, E. Jeanvoine, A. Lèbre, D. Margery, N. Niclausse, L. Nussbaum, O. Richard, C. Pérez, F. Quesnel, C. Rohr, and L. Sarzyniec (2013). Adding virtualization capabilities to the Grid’5000 testbed. In I. I. Ivanov, M. van Sinderen, F. Leymann, and T. Shan (Eds.), Cloud Computing and Services Science, Volume 367 of Communications in Computer and Information Science, pp. 3–20. Springer International Publishing.
- Bentley (1975) Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 509–517.
- Berrett et al. (2019) Berrett, T. B., R. J. Samworth, and M. Yuan (2019). Efficient multivariate entropy estimation via k-nearest neighbour distances. The Annals of Statistics 47(1), 288–318.
- Biau and Devroye (2015) Biau, G. and L. Devroye (2015). Lectures on the nearest neighbor method, Volume 246. Springer.
- Blanchet et al. (2024) Blanchet, J., H. Chen, Y. Lu, and L. Ying (2024). When can regression-adjusted control variate help? rare events, sobolev embedding and minimax optimality. Advances in Neural Information Processing Systems 36.
- Bogachev and Ruas (2007) Bogachev, V. I. and M. A. S. Ruas (2007). Measure theory, Volume 2. Springer Science & Business Media.
- Breiman (1996) Breiman, L. (1996). Bias, variance, and arcing classifiers.
- Cannings et al. (2020) Cannings, T. I., T. B. Berrett, and R. J. Samworth (2020). Local nearest neighbour classification with applications to semi-supervised learning. The Annals of Statistics 48(3), 1789–1814.
- Chen et al. (2022) Chen, L., M. Zaharia, and J. Y. Zou (2022). Estimating and explaining model performance when both covariates and labels shift. In Advances in Neural Information Processing Systems, Volume 35, pp. 11467–11479. Curran Associates, Inc.
- Clémençon and Portier (2018) Clémençon, S. and F. Portier (2018). Beating monte carlo integration: A nonasymptotic study of kernel smoothing methods. In International Conference on Artificial Intelligence and Statistics, pp. 548–556. PMLR.
- Dasgupta and Kpotufe (2014) Dasgupta, S. and S. Kpotufe (2014). Optimal rates for k-nn density and mode estimation. In Advances in Neural Information Processing Systems, Volume 27.
- de Mathelin et al. (2021) de Mathelin, A., M. Atiq, G. Richard, A. de la Concha, M. Yachouti, F. Deheeger, M. Mougeot, and N. Vayatis (2021). ADAPT : Awesome Domain Adaptation Python Toolbox. arXiv:2107.03049 [cs.LG].
- Delyon and Portier (2016) Delyon, B. and F. Portier (2016). Integral approximation by kernel smoothing. Bernoulli 22(4), 2177–2208.
- Devroye et al. (1994) Devroye, L., L. Györfi, A. Krzyżak, and G. Lugosi (1994). On the strong universal consistency of nearest neighbor regression function estimates. Ann. Statist. 22(3), 1371–1385.
- Devroye et al. (2013) Devroye, L., L. Györfi, and G. Lugosi (2013). A probabilistic theory of pattern recognition, Volume 31. Springer Science & Business Media.
- Dua and Graff (2017) Dua, D. and C. Graff (2017). UCI machine learning repository.
- Friedman et al. (1977) Friedman, J. H., J. L. Bentley, and R. A. Finkel (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS) 3(3), 209–226.
- Gadat et al. (2016) Gadat, S., T. Klein, and C. Marteau (2016). Classification in general finite dimensional spaces with the -nearest neighbor rule. Ann. Statist. 44(3), 982–1009.
- Gretton et al. (2008) Gretton, A., A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf (2008, December). Covariate Shift by Kernel Mean Matching. In Dataset Shift in Machine Learning, pp. 131–160. The MIT Press.
- Györfi et al. (2006) Györfi, L., M. Kohler, A. Krzyzak, and H. Walk (2006). A Distribution-Free Theory of Nonparametric Regression. Springer Science & Business Media.
- Hernan and Robins (2023) Hernan, M. A. and J. M. Robins (2023). Causal Inference: What If. Chapman & Hall/CRC Monographs on Statistics & Applied Probab. CRC Press.
- Huang et al. (2006) Huang, J., A. Gretton, K. Borgwardt, B. Schölkopf, and A. Smola (2006). Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems, Volume 19. MIT Press.
- Jiang (2019) Jiang, H. (2019). Non-asymptotic uniform rates of consistency for -NN regression. In AAAI Proceedings, Volume 33, pp. 3999–4006.
- Kanamori et al. (2009) Kanamori, T., S. Hido, and M. Sugiyama (2009). A least-squares approach to direct importance estimation. Journal of Machine Learning Research 10(48), 1391–1445.
- Kpotufe and Martinet (2021) Kpotufe, S. and G. Martinet (2021). Marginal singularity and the benefits of labels in covariate-shift. The Annals of Statistics 49(6), 3299–3323.
- Le et al. (2013) Le, Q., T. Sarlós, A. Smola, et al. (2013). Fastfood—approximating kernel expansions in loglinear time. In Proceedings of the 30th International Conference on Machine Learning, Volume 28.
- Lee (2013) Lee, D.-H. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML.
- Leluc et al. (2023) Leluc, R., F. Portier, J. Segers, and A. Zhuman (2023). Speeding up monte carlo integration: Control neighbors for optimal convergence. To appear in Bernoulli. ArXiv:2305.06151.
- Lin et al. (2023) Lin, Z., P. Ding, and F. Han (2023). Estimation based on nearest neighbor matching: from density ratio to average treatment effect. Econometrica 91(6), 2187–2217.
- Ma et al. (2023) Ma, C., R. Pathak, and M. J. Wainwright (2023). Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics 51(2), 738–761.
- Pace and Barry (1997) Pace, R. K. and R. Barry (1997). Sparse spatial autoregressions. Statistics & Probability Letters 33(3), 291–297.
- Portier (2021) Portier, F. (2021). Nearest neighbor process: weak convergence and non-asymptotic bound. To appear in Bernoulli. ArXiv:2110.15083.
- Rosenbaum (1995) Rosenbaum, P. R. (1995). Observational Studies. Springer.
- Shimodaira (2000) Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90(2), 227–244.
- Sugiyama et al. (2007) Sugiyama, M., S. Nakajima, H. Kashima, P. v. Bünau, and M. Kawanabe (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 20, NIPS 2007, pp. 1433–1440.
- Sugiyama et al. (2008) Sugiyama, M., T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe (2008, December). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics 60(4), 699–746.
- Tropp (2012) Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12, 389–434.
- Truquet (2011) Truquet, L. (2011). On a nonparametric resampling scheme for markov random fields. Electronic Journal of Statistics 5, 1503–1536.
- Van der Vaart (2000) Van der Vaart, A. W. (2000). Asymptotic Statistics, Volume 3. Cambridge University Press.
- Virtanen et al. (2020) Virtanen, P., R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, 261–272.
- Wang (2023) Wang, K. (2023, March). Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift. arXiv:2302.10160 [cs, math, stat].
- Wendel (1948) Wendel, J. G. (1948). Note on the gamma function. The American Mathematical Monthly 55(9), 563.
- Williams and Seeger (2000) Williams, C. K. I. and M. Seeger (2000). Using the Nyström Method to Speed up Kernel Machines. In Advances in Neural Information Processing Systems, Volume 13 of NIPS 2000, pp. 661–667. MIT Press.
- Yamada et al. (2013) Yamada, M., T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama (2013, May). Relative Density-Ratio Estimation for Robust Distribution Comparison. Neural Computation 25(5), 1324–1370.
- Zhang et al. (2021) Zhang, T., I. Yamane, N. Lu, and M. Sugiyama (2021, June). A One-Step Approach to Covariate Shift Adaptation. SN Computer Science 2(4), 319.
Appendix A Preliminary results
The first preliminary result is concerned about the order of magnitude of for which we obtain a lower bound and an upper bound.
Proof.
The same type of result can be obtained for as follows.
Proof.
The proof of the upper bound follows from the union bound and Lemma 1. For the lower bound, start noting that for any events and , . Then the conclusion follows from Lemma 1.
∎
Based on the previous results, an upper and a lower bound are obtained on the moments of the nearest neighbor radius . A similar upper bound is stated as Lemma 3 in Leluc et al. (2023)).
Lemma 3.
Proof.
We have the kth-order statistics of . Moreover, for any measurable and non-negative function ,
where is the kth-order statistics of a sample of uniform random variables and since ,
Note that the range of is and we use a constant in the definition of such that
If , we have and we still have .
Appendix B Proofs of the results on the marginal sampling error (Section 4.2)
B.1 Proof of Proposition 1
The proof relies on the Lindeberg central limit theorem as given in Proposition 2.27 in Van der Vaart (2000) conditionally to . We need to show the two properties:
where each convergence needs to happen with probability . Equivalently, using that is identically distributed according to , we need to show that
The first result is a direct consequence of the assumption. Fix . For all sufficiently large, we have , implying that
which converges to by assumption. Since is finite, one can choose large enough to make arbitrarily small.
B.2 Proof of Proposition 2
Set . We have and . Note that . Bernstein’s concentration inequality leads to
Then setting
we get
and then integrate both sides to obtain
which leads to the stated bound.
Appendix C Proofs of the results on the -NN conditional sampling error (Section 4.3)
Here, we give proofs of the results on the -NN conditional sampling error appearing in Section 4.3.
C.1 Proof of Proposition 3
We start with a useful bias-variance decomposition. Introduce
We have
Integrating with respect to , we obtain
(7) |
with
The term is a bias term and the term (which has mean ) is a variance term.
The proof is divided into steps. The first step takes care of bounding the bias term. The second step deals with the variance upper-bound. The third step is concerned with the variance lower bound.
The bias.
The variance upper-bound.
For the proof, we assume that . We have for each and ,
For the second case, we used (H2) and the Cauchy-Schwarz inequality. As a consequence, the variance is given by
with . Let and , the th order statistics of the sample . One can observe that
where is the th order statistics of the sample . Note that the two sigma fields generated respectively by and are independent. For one map** , we first first bound
The fourth inequality is due to the fact that when , the two balls and do not intersect. We then get using Lemma ,
This leads to the variance upper-bound.
The variance lower-bound.
C.2 Proof of Proposition 4
Define
The following Lemma (Portier, 2021, Lemma 4) controls the size of the -NN balls uniformly over all .
Lemma 4 (Portier (2021, Lemma 4)).
We now deal with the variance term of our estimator. The variance term of the nearest-neighbors estimator is given by , where
and Set . From our assumptions and Jensen’s inequality, we have
Applying Bernstein’s inequality for i.i.d. random variables (we recall that the are independent conditionally on the ), we get for ,
This leads to
Note that this upper-bound is not random and we get
(8) |
Setting
which is smaller than
we then get
where the last inequality is a consequence of (8) and Lemma 4. Moreover, from the proof of Theorem 3 and Lemma 4, the bias part can be dominated by with probability at least . This concludes the proof.
Appendix D Proof of Theorem 1
First, note that the boundedness of entails H2. Setting and , Proposition 3 guarantees that
for some only depending on the distribution of , and on . It remains to show that the same bound can obtained for . Since,
It only remains to show that is bounded with respect to . The approach is similar to the control of the variance term for studied in the proof of Proposition 3. If is an upper-bound for , we have
The last upper bound is obtained from Lemma 3 using the fact that has the same probability distribution as . We deduce the result taking as the maximum between and .
Appendix E A corollary bounding the sampling error for -NN sampling
Corollary 1.
Proof.
We first use the result of Proposition 2. In particular, setting
we have
We then use the decomposition
From Proposition 4, we know that
with probability greater than and
with probability greater than . Collecting these three bounds, we easily obtain the conclusion of the second point of Corollary 1. ∎
Appendix F Proofs of the results on the empirical risk minimization (Section 5)
Here, we present proofs of the result on the application to empirical risk minimization.
F.1 Proof of Theorem 2.
From (A1), is integrable and is continuous over the compact set . As a consequence, weak consistency will follow from Theorem in Van der Vaart (2000) if we show that
(10) |
Pointwise convergence holds true from assumptions (A1), (A2) as each map** satisfies Assumptions (H1), (H2). One can then apply Theorem 1 and the Markov inequality to get for any ,
We now prove uniform convergence. Let . One can cover the compact set with finitely many balls , . For , we have
Moreover, from Assumptions (A2) and Theorem 1 with the Markov inequality, we know that
in probability. We also have
Finally, one can use the bound
Given that is arbitrary, the above implies (10) and the weak consistency of follows. The second assertion about the excess risk then follows easily using that
F.2 Proof of Theorem 3.
Let and define
The proof first requires some analysis of the smallest eigenvalues of . From the matrix Chernoff inequality given in Tropp (2012), see Corollary and Remark , we have
where is defined so as to satisfy , with probability . Inverting the previous we obtain that with probability at least ,
and therefore as soon as , we have that . On the previous event, we have that
It follows that
We conclude using Theorem 1 with . Note that by definition of .
Appendix G Illustration for Experiments E1–E3
Illustrations of data used in Experiments E1–E3 can be found in Figure 7(b).
![Refer to caption](x25.png)
![Refer to caption](x26.png)
Appendix H Details of the benchmark data experiments
We use the following datasets.
-
•
california: Regression dataset called “California Housing” available from https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html.
-
•
diabetes: Regression dataset available from https://archive.ics.uci.edu/ml/index.php (Dua and Graff, 2017).
-
•
breast cancer: Classification dataset available from https://archive.ics.uci.edu/ml/index.php (Dua and Graff, 2017).
-
•
twonorm: Classification dataset available from https://www.cs.utoronto.ca/~delve/data/datasets.html.
Data splitting and sampling bias simulation
We split the original to the training and test set and simulate covariate shift by rejection sampling from the test set with rejection probability determined according to the value of a covariate. For california, twonorm, breast cancer, we follow the procedure of Sugiyama et al. (2007): we include each target data point to the target set with probability or reject it otherwise, where is the -th attribute of . For diabetes, we used a different biasing procedure for this data set because the technique of Sugiyama et al. (2007) rejects too many data points to perform our experiment for this dataset. We instead use the procedure of an example from the ADAPT package de Mathelin et al. (2021)777https://adapt-python.github.io/adapt/examples/Sample_bias_example.html for diabetes: for each data point , we accept it with probability proportional to , where is the age attribute of and reject (i.e., exclude) otherwise.
Pre-processing
We use the hot-encoding for all categorical features. We center and normalize all the data using the mean and the dimension-wise standard deviation of the source set. We do the same centering and normalization for the output variables for regression datasets.
After training and prediction, we post-process the output using the inverse operation. Table 3 shows basic information about the datasets after the bias-sampling and pre-processing.
california | twonorm | diabetes | breast cancer | |
---|---|---|---|---|
Input dimension | 8 | 20 | 10 | 9 |
source sample size | 1000 | 100 | 150 | 200 |
Target sample size | 1000 | 500 | 150 | 100 |