Beyond Discrepancy: A Closer Look at the Theory of Distribution Shift
Abstract
Many machine learning models appear to deploy effortlessly under distribution shift, and perform well on a target distribution that is considerably different from the training distribution. Yet, learning theory of distribution shift bounds performance on the target distribution as a function of the discrepancy between the source and target, rarely guaranteeing high target accuracy. Motivated by this gap, this work takes a closer look at the theory of distribution shift for a classifier from a source to a target distribution. Instead of relying on the discrepancy, we adopt an Invariant-Risk-Minimization (IRM)-like assumption connecting the distributions, and characterize conditions under which data from a source distribution is sufficient for accurate classification of the target. When these conditions are not met, we show when only unlabeled data from the target is sufficient, and when labeled target data is needed. In all cases, we provide rigorous theoretical guarantees in the large sample regime.
1 Introduction
Classical learning theory operates within the statistical learning framework, in which the training and testing datasets are assumed to be drawn from the same distribution [1]. However, this assumption is rarely met in practice, where models often succeed in ever-changing real world environments rarely matching the precise conditions of their training data. This motivates the problem of distribution shift, in which a learner trains on a source distribution, with the goal of generalizing well over a distinct target distribution.
Thus far, the theory of distribution shift has consistently taken a worst-case approach, typically bounding generalization error in terms of some notion of discrepancy between the source and target distributions [2, 3, 4]. In cases where the source and target distributions are completely unrelated, or the source provides little information about the decision boundary of the target, discrepancy-based analyses correctly capture the difficulty of generalization. However, in practice, many large models appear to generalize effortlessly to target distributions with non-zero discrepancy.
Motivated by this gap, we take a closer look at the theory of distribution shift. In our setting, we consider a source distribution and a target distribution , with the goal of building an accurate classifier over , primarily via training samples from . To accomplish this, we first select a feature map, , under which the source and target distributions are similar. To make predictions, we then use -nearest neighbors (-NN) inside feature space over data sampled from .
Instead of a worst-case, discrepancy-based approach, we study generalization under an Invariant Risk Minimization (IRM)-like assumption which we term the “Statistical IRM Assumption”. IRM assumes on the existence of a feature map and a classifier (over feature space) so that their composition achieves optimal accuracy over both source and target distributions [5]. We adapt this assumption to the nearest neighbors setting, and replace the existence of with the assumption that the some feature map maps points from the target close to those from the source while retaining information sufficient for optimal prediction. This property allows us to leverage the fact that nearest neighbors enjoys strong generalization properties within the support of its training distribution.
One might hope that such a condition is sufficient for generalization from source data alone. Unfortunately, the existence of a suitable feature map does not imply its identifiability - there may be many poor feature maps in that appear suitable when only source data are available. We show (Theorem 4) that to guarantee generalization to the target using only source data, the source must be rich enough so that this cannot happen, i.e. that all maps that lead to optimal classification over the source distribution appropriately unify the source and target. We further exhibit a learning rule which leads to provable generalization to the target under this additional condition (Theorem 3).
We next consider the case where the learner has access to unlabeled target data in addition to labeled source data. Here, the target data provides crucial new information about which feature maps transform target data close to source data in feature space. We find that it is necessary and sufficient (Theorems 6 and 5) that all maps which both lead to optimal classification over the source, and map target data close to source data, further appropriately unify the source and target classification tasks.
When generalization is not possible with the addition of unlabeled target data, some labeled target data is needed. In this setting, the goal is to minimize the amount of labeled target data used – if large amounts of labeled target data are obtainable, we could simply use standard learning algorithms directly on the target data. We introduce a complexity measure on the embedding class, , which we term the distance dimension, and use it to provide an upper bound on the amount target data needed for generalization. In particular, we show that the natural procedure of minimizing the empirical risk (over ) on the target distribution of the source data-trained nearest neighbor classifier meets this upper bound.
1.1 An Illustrative Example
Figure 1 illustrates three learning problems in which we seek to generalize from the bold source data to the faded target data. In each case, the set of possible features maps is , the projections of onto the and -axis, respectively. Here, the Statistical IRM Assumption manifests itself in the following way: the learner knows that perfect classification can be performed on the source and the target through the intermediate projection onto either the or -axis. If the correct projection can be identified, a classifier generalizing to the target can be built by composing the correct projection with a classifier that accurately classifies source data in feature space.
The possibility of generalizing directly to the target is illustrated by Figure 1(a). In this case, using source data alone, it can be deduced that is not suitable, given that projection under significantly reduces accuracy over the source distribution. Thus, a classifier can be constructed through composition with that allows for generalization to the target.
By contrast, in panel (b), we see that both and admit good classification over the source distribution. However, note that only leads to good generalization over the target distribution, and that there is no way to pin down which embedding should be used with source data alone. That said, given access to unlabeled target data, can be eliminated from contention – it fails to uniformly map target data close to source data in feature space, another condition for correctly relating the source and target.
In panel (c), we see an instance in which no amount of source data and unlabeled target data will allow the learner to distinguish a winner between the two possible feature maps. In this case, labeled target data is needed. However, note that only a relatively small amount of labeled target data will be needed – all that is required is enough points to validate that a source-trained classifier arising from first projecting onto the -axis has inferior performance to an analogous classifier where data are projected to the -axis.
1.2 Guarantees Beyond Discrepancy
The examples of Figure 1 also serve to showcase the potential for generalization guarantees in scenarios where worst-case analyses indicate that generalization to the target should be hard.
There are a few veins of the discrepancy literature [6]. One prominent vein considers bounding generalization error in terms of divergence measures between the source and target [2, 3, 4]. Another considers density ratios between target and source [7, 8]. In each case, the idea is that the degradation of prediction quality on the target will be small when the source and target distributions are not “too far” from each other.
Consider again the examples of Figure 1. A density ratio analysis indicates that generalization to the target in (a) is impossible from pure source data, and expensive in a transfer learning setting (c), as the source has no mass in large chunks of the support of the target. Divergence measures paint a similar picture. Thus, our assumption allows us to consider the possibility of cheap generalization to targets which may have a completely different support from the source in the original data space, but are related in some deeper manner. In such scenarios, discrepancy-based analyses may often be overly pessimistic.
2 Related Work
As alluded to above, the theory literature has primarily studied distribution shift through the lens of discrepancy [2, 3, 7, 4, 8, 9, 10]. In the transfer learning literature, in which one considers the possibility of updating a model trained on the source distribution with a relatively small amount of target data, divergence-based analyses have also been prevalent [11, 12, 13, 14]. A notable line of work attains strong guarantees in certain cases where the divergence between source and target distributions is large, but the decision boundary on the source and target are similar by honing in on the information about the decision boundary contained in the source distribution [6, 15].
Much of the attention towards the selection of feature representations has been devoted to problem of “domain generalization”, wherein the learner tries to generalize to a large of set testing environments using samples from a smaller set of source environments, which provide training data [16, 17]. As mentioned above, the IRM literature hinges on the existence of a feature map, and a classifier whose composition achieves optimal accuracy over both source and target distributions [5]. Another line of work considers a different assumption, namely the existence of some suitable feature map for which the conditional distributions of transformed features given a label are shared across all environments [18, 19, 20].
An important part of this work considers the case where some relatively small amount of labeled target data is available to the learner, and can be exploited in the determination of a suitable feature map. In the theory literature, this setting is most closely explored by the work on ‘’few-shot representation learning” [21, 22], where the goal is to use data on a set of source tasks to learn a low dimensional representation that connects tasks together, allowing for generalization to a related target task without too many extra samples.
In considering the case where the learner has access to unlabeled target samples, we enter the ‘’unsupervised domain adaptation” setting. Here, one often uses unlabeled target data to find some feature space under which source and target supports align [23, 24]. The literature has shown that unlabeled data has provable utility in certain common settings, e.g. under covariate shift [25, 26]. Unsupervised domain adaptation has also been studied through the lens of discrepancy [9, 10].
3 Preliminaries
Let the instance space be a compact metric space, and be a finite label set. A data distribution over is defined by a Borel measure over , and a conditional probability function .
We assume our distributions satisfy some measure-theoretic regularity conditions. In particular, we assume our Borel measures are open measures, and that the Lebesgue Differentiation Theorem always holds. See Appendix B for details to this end.
For a classifier , we define its risk over as the probability it misclassifies, i.e. we define . The classifier with the lowest possible risk is called the Bayes optimal classifier, defined as .
3.1 Problem Statement and Goal
In this work, we are interested in the problem of distribution shift, in which the goal is to build a classifier with low risk over a target distribution , primarily using data from a source distribution, . We denote the Bayes risk on source and target via and .
The challenge in this setting is that and can put mass in drastically different regions in making direct generalization from the source distribution to the target distribution difficult or impossible in the worst case.
3.2 Feature Maps
We consider classification after first applying a transformation into a feature space , also a compact metric space.
We assume we are given , a class of feature maps . Here, each represents a potential feature map under which the source and target distributions could plausibly be connected. Let denote the distance metric induced on by , i.e. . We assume all are continuous, and is compact with respect to the supremum distance metric. We also include further technical assumptions on in Appendix B.3.
Note that the following important examples of feature map collections which meet these regularity assumptions when the domain is a compact subset of .
Example 1.
Let denote the set of all projections from onto a set of coordinates. Formally, we may write , where for each with , we let .
Example 2.
Let denote the set of all linear maps corresponding to matrices in with each entry contained in .
For any data distribution over , we denote via the distribution defined via where , often writing , where and are the induced marginal and conditional distributions of . We assume that the induced marginals are also open measures. Measure-theoretic details of induced distributions are discussed in Appendix D.
3.3 Nearest Neighbors
We let denote the -nearest neighbor classifier arising from an i.i.d. sample and a metric over the instances, where ties are broken arbitrarily. It is well known that under mild regularity conditions, and imply that -nearest neighbors will converge to the Bayes optimal classifier [27]. Motivated by technical concerns, we will make the slightly stronger assumption that and .
Because we consider classification in feature space, we will often consider the composition of -NN with maps . To this end, we let denote the map defined by
3.4 Margin Conditions
Finally, we restrict our attantions data distributions in which Bayes-optimal classification is clearly non-ambiguous, and regions in which Bayes-optimal predictions differ are separated by a margin. We formalize this as follows.
Definition 1.
A data distribution over is -separated if there exist , and disjoint sets , so that the following hold:
-
1.
The sets cover the support: .
-
2.
On the set where is the Bayes-optimal decision, no other label has similar conditional probability: If , then , .
-
3.
These sets themselves are separated by a margin: .
When is -separated, we say that has margin , and label margin . The conditions of well-separated distributions are met in most practical cases, where classification is rarely ambiguous, and arbitrarily close examples are usually classified identically.
4 The Statistical IRM Assumption
Generalizing from source data in a feature space induced by some is only possible if contains a map that appropriately unifies the classification tasks on and . We use this section to motivate and define some desirable properties of feature maps vis a vis this goal, and to introduce the Statistical IRM Assumption, formalizing our requirement for the existence of quality maps in .
4.1 Desirable Properties of Feature Maps
In Invariant Risk Minimization, the fundamental assumption is the existence of a feature map and an “invariant predictor” for which is Bayes-optimal on all training and testing environments. This allows a learner to assume that selecting a feature space through which good performance on training environments is attainable is not a completely futile approach to constructing a generalizing classifier.
In this spirit, we first interest ourselves in feature maps which preserve the possibility of optimal classification on our single source distribution. We consider a slightly stronger but natural notion that encodes the idea that no information relevant to the classification task on the source should be lost under the map**.
Definition 2.
We say a feature map source-preserves if the induced source distribution is separated, and the Bayes risk on equal to that of , i.e.
Let denote the set of all source preserving feature maps in .
Thus, source-preserving feature maps retain all information needed for optimal classification in the sense that the risk of the Bayes optimal in original space and feature space should be the same under the correct embedding. We also require that some margin is preserved in the arising feature space.
While not strictly necessary under the IRM assumption, it also desirable that an embedding maps examples that are similar with respect to the classification task to similar parts of the feature space, regardless of which distribution they come from. We formalize a condition capturing this idea via the following.
Definition 3.
We say a feature map contracts and if the induced source is separated with margin , and for each , there is some such that
where is a fixed constant. Let denote the set of all contracting feature maps in .
Ultimately, we are interested in feature spaces in which we can generalize to the target by classifying target data as we would source data. This possibility is captured by the notion of the invariant predictor in the IRM assumption. We interest ourselves in feature maps with a similar property - ones for which the optimal classification decision is locally the same across and .
Definition 4.
We say a feature map Bayes-unifies and if for all and ,
Let denote the set of all Bayes-unifying feature maps in .
Under a feature map which Bayes-unifies, any points which are mapped closer together than half the induced margin are classified the same under the source and target distributions.
4.2 Stating the Statistical IRM Assumption
It’s intuitive that if a feature map both preserves the Bayes risk on the source, and unifies the classification tasks of source and target, then converging to the Bayes risk on the target is possible when source data populate the support of the induced target.
Thus, we would like a feature map which possess all of these properties. Our fundamental assumption is that there exists at least one such feature map in – we term this the Statistical IRM Assumption.
Assumption 1 (Statistical IRM Assumption).
We assume there is some such that
-
1.
source-preserves
-
2.
contracts the source and target
-
3.
Bayes-unifies source and target
We say that with all of these properties realizes the Statistical IRM Assumption, and let denote the set of all maps in which realize the Statistical IRM Assumption.
This assumption is an analogue of the IRM assumption, adapted to our single-source, single-target setting. Like IRM, it allows for the possibility of optimal classification on both source and target via the selection of an appropriate feature space. Contraction, which is not an assumption in IRM, allows for that optimal classification to be realized via a local classification scheme such as -NN.
4.3 The Statistical IRM Theorem
One would expect that if a learner were handed , generalization to the target should be possible with source data alone – because the classification task on and is unified in the feature space arising from , and every example in the target support is mapped close to the training support, the learner able to construct a good classifier for the target by simply constructing a constructing a good classifier on the induced source.
We formalize this intuition via the following theorem, which states that given knowledge of a realizing feature map , generalization to the target can be accomplished with source data only via the construction of a -NN classifier in feature space.
Theorem 1 (Statistical IRM Theorem).
Suppose realizes the Statistical IRM assumption. Then for all , there exists such that for all , with probability over ,
Thus, from the perspective of target generalization from source data, it suffices to determine a feature map which realizes the Statistical IRM Assumption. In what follows, we characterize the statistical identifiability of (and thus the learnability of ) under this assumption in each of our data availability settings.
5 The Distance Dimension of
Our investigation of the identifiability of realizing feature maps relies on one further notion – one of embedding classes with bounded complexity. To this end, we introduce a complexity measure on which will play a key role in each of the settings we consider. We begin with an intermediate definition.
Definition 5.
For a given , we define its induced distance comparer as the map
We also define as the induced distance comparer class of .
Distance comparers are a natural tool for our analysis – all nearest-neighbor computations inside the feature space can be expressed in such terms. This observation gives rise to a natural complexity measure for the determination of a suitable feature map, which we term the distance dimension.
Definition 6.
The distance dimension of , denoted , is the VC dimension of the induced comparer class .
In providing upper bounds, it will be important that the distance dimension be finite. We note that it is easily bounded for the two important classes of feature maps mentioned in Section 3.
6 Direct Generalization from
We first study the possibility of constructing a classifier that generalizes to using only labeled samples from . In this setting, a learner takes input , and outputs a classifier , with the goal of achieving a small risk on .
While one might hope that the Statistical IRM assumption alone is sufficient for generalization to the target, this is unfortunately false. In fact, we have already seen an example of this phenomenon in Figure 1(c). Here, we essentially argued that was a realizing feature map: it preserves the source risk, unifies the classification tasks on source and target, and maps all target points close to source points. However, we argued that and were statistically indistinguishable in this setting, leaving the learner in need of more information.
To the end of a general characterization of learnability in this setting, recall our discussion of Figure 1(a). Here, the projection could be determined as realizing the Statistical IRM assumption given that it was the only map in that preserved the source distribution – it’s clear from the figure that that does not preserve the source, and so cannot possibly realize the Statistical IRM assumption. It is vital that this reasoning could be carried out with source data alone.
More generally, by the Statistical IRM Theorem, it is sufficient for generalization from source data alone that the learner be able to identify from source data alone. Such a realizing feature map must of course satisfy all three requirements of Assumption 1. However, note that only one of these requirements, namely source-preservation, depends on the source distribution alone – the others, namely contraction and Bayes-unification, are defined in terms of the target distribution. As such, only source-preservation can be tested using source data.
That said, if the learner can be assured that all source-preserving feature maps realize the Statistical IRM assumption, i.e. , it can identify realizing feature maps by identify source-preserving feature maps. We formalize this intuitive idea with the following theorem, which shows that PAC guarantees for target generalization are obtainable when the additional condition holds.
Theorem 3.
Suppose the Statistical IRM Assumption holds, the distance dimension , and that Then there is a learning rule such that for every , there exists such that if , with probability over ,
We relegate specification of the learning rule to the Appendix. It is founded on minimizing the empirical risk on the source data over feature maps in , but further leverages the knowledge that source-preserving feature maps induce separated distributions over feature space. After selecting a candidate feature map which empirically matches the requirements for source preservation, it uses -NN in the implied feature space to make predictions.
In light of the discussion above, the condition that is intuitively necessary as well – without it, there may be some feature map which is source-preserving, but which e.g. fails to Bayes-unify the source and the target. It’s simple to see that attempting to generalize to the target via such a feature space could be catastrophic, and thus that blindly choosing between source-preserving feature maps will eventually lead the learner astray. On the other hand, not classifying through a feature space subjects the learner to the standard pitfalls of out of distribution generalization. We formalize these ideas via the following hardness result.
Theorem 4.
Fix a source , a target , and some embedding class for which the Statistical IRM assumption holds. Suppose that is non-empty, and that a learner successfully generalizes to (with high probability) using only samples from . Then for all , there exists data distributions such that the following hold::
-
1.
.
-
2.
There is a realizing the Statistical IRM assumption on alternative source and alternative target .
-
3.
For all , there exists such that with probability at least over ,
Thus, in the case that some feature maps preserve the source but do not realize the Statistical IRM assumption, there is always some problem nearly identical problem instance where the Statistical IRM assumption is realized by , buy which causes a given learner to have unbounded sample complexity (for some choice of and ).
7 Combining Labeled Samples from and Unlabeled Samples from
We now consider the less restrictive unlabeled target data are also available. Here, a learner takes input and , and outputs a classifier .
The story given additional access to unlabeled data is similar to source-only setting: the Statistical IRM assumption alone is insufficient for guaranteeing successful generalization when additional unlabeled target data are available. In other words, the combination of labeled source and unlabeled target is generally insufficient for identifying a that realizes the Statistical IRM assumption.
For a simple example to this end, we return to panel (c) of Figure 1. For the source and target distributions shown, it is evident that no amount of labeled data from the source and unlabeled data from target will allow us decide whether we should project data onto the -axis or the -axis. This is because the only difference between them is the manner in which is labeled. By contrast, the example depicted by Figure 1 panel (b) illustrates a case in which the additional unlabeled data from proves sufficient: because projecting onto the -axis fails to map target points close source points, we can conclude that must be the projection onto the -axis.
As in source-only setting, identifying a feature map realizing the Statistical IRM assumption requires testing the three conditions of Assumption 1. Understanding the utility of additional unlabeled target data is to realize that it allows the learner to not only test which feature maps preserve the source, but also which maps contract and . On the other hand, it is insufficient to determine which feature maps Bayes-unify, as this notion intrinsically depends on labeling under . This motivates a similar sufficient condition for learnability as we saw in the previous section –namely, that all feature maps which both preserve the source and contract the source and target further Bayes unify. The following theorem shows that this is indeed a sufficient condition for learnability in this setting.
Theorem 5.
Suppose the Statistical IRM Assumption holds, the distance dimension , and that Then there is a learning rule such that for all , there exist and , such that if and , with probability over and ,
The learning rule is specified in detail in the Appendix. It proceeds by first selecting a feature map which both empirically preserves the source, and maps each unlabeled target in point close to some source point in . As above, it uses -NN in the selected feature space to make predictions.
In accordance with the intuition developed above, the condition that is necessary. The issue of course is that without access to labeled target data, testing whether a feature map Bayes-unifies is impossible. We formalize this via the following hardness result.
Theorem 6.
Suppose realizes the Statistical IRM Assumption for and , and that there is some for which . Then for all learners , there exists a conditional data distribution, such that the following hold:
-
1.
for all .
-
2.
realizes the Statistical IRM Assumption for and .
-
3.
There exists such that for arbitrarily large values of and , with probability at least over and ,
Theorem 6 shows that no combination of embedding class and learner can circumvent the impossibility of testing Bayes unification with unlabeled target data. For any embedding class and learning algorithm, one can always find a pair of source and target distributions on which realizes the Statistical IRM Assumption, but on which the learning algorithm will fail.
8 Efficient Use of Labeled Samples from
The discussion above implies that even under the Statistical IRM assumption, there are many situations where label target data is required for generalization. In such cases, we would hope that we can exploit the information encoded in the Statistical IRM Assumption to achieve generalization through labeled source data and a small amount of labeled target data. In this section we show that the Statistical IRM assumption allows for significant convergence rate speed-ups in many settings.
Recall that the Statistical IRM theorem states that given a realizing feature map, generalization to the target can be accomplished purely through source data – the limitation of a lack is labeled target data is the difficulty in identifying such a feature map. This inspires the strategy of allocating all of the labeled target data towards determining a realizing feature map.
In this spirit, we analyze the natural scheme of constructing a classifier by composing nearest-neighbors trained solely on source data with the map that minimizes the empirical risk over , finding that the number of target examples required for guarantees can be controlled in terms of the “distance dimension” of the class .
Theorem 7.
Suppose realizes the Statistical IRM assumption. Then for every , there exists such that if
then with probability at least over , ,
where is output of .
Thus, the amount of labeled target data required for generalization when realizes the Statistical IRM assumption can be largely controlled through our complexity measure on the class . We say “largely” given that , the amount of data required from , has a logarithmic dependence on , the amount of data drawn from . This implies a near distributional-independence between source and target in the sample complexity.
We note that the above margin assumptions are not required for the analysis leading to Theorem 7. Thus – comparing e.g. to rates of convergence under the canonical Tsybakov noise assumption in , under which nonparametric classifiers necessarily incur rates of – the guarantees of Theorem 7 represent significant convergence rate speed-ups over naively training a non-parametric classifier with target data in many cases where the distance dimension is polynomial in the dimension of the instance space [28].
9 Discussion
In this work, we study the problem of distribution shift under a variant of the IRM assumption, wherein it is known that a feature map in a class unifies classification on source and target. We investigate the identifiability of such maps, characterizing learnability in settings where worst-case approaches indicate that learning should be impossible or expensive.
Our work suggests that the study of IRM-like assumptions is a promising direction for shedding light on new situations where guaranteeing generalization under distribution shift is possible. It also highlights that a primary issue in learning under IRM-like assumptions may be the statistical identifiability of suitable feature maps.
Acknowledgements: This work was supported by the National Science Foundation under the following grants: NSF CIF-2402817, SaTC-2241100, CCF-2217058, and ARO-MURI W911NF2110317.
RB was also partially supported by the German Research Foundation through the Cluster of Excellence “Machine Learning - New Perspectives for Science" (EXC 2064/1 number 390727645)
References
- [1] L. G. Valiant. A theory of the learnable. Communications of the ACM, pages 1134–1142, 1984.
- [2] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. NIPS’06, page 137–144, Cambridge, MA, USA, 2006. MIT Press.
- [3] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Vaughan. A theory of learning from different domains. Machine Learning, 79:151–175, 05 2010.
- [4] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the renyi divergence, 2012.
- [5] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization, 2020.
- [6] Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. CoRR, abs/2002.04747, 2020.
- [7] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence. Dataset shift in machine learning. 2009.
- [8] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
- [9] A. Tuan Nguyen, Toan Tran, Yarin Gal, Philip H. S. Torr, and Atılım Güneş Baydin. Kl guided domain adaptation, 2022.
- [10] Ziqiao Wang and Yongyi Mao. Information-theoretic analysis of unsupervised domain adaptation, 2023.
- [11] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, 2007.
- [12] Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with drifting distributions, 2012.
- [13] Corinna Cortes, Mehryar Mohri, and Andrés Muñoz Medina. Adaptation based on generalized discrepancy. Journal of Machine Learning Research, 20(1):1–30, 2019.
- [14] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms, 2023.
- [15] Steve Hanneke, Samory Kpotufe, and Yasaman Mahdaviyeh. Limits of model selection under transfer learning. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 5781–5812. PMLR, 12–15 Jul 2023.
- [16] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation, 2013.
- [17] Han Zhao, Shanghang Zhang, Guanhang Wu, João P. Costeira, José M. F. Moura, and Geoffrey J. Gordon. Multiple source domain adaptation with adversarial training of neural networks, 2017.
- [18] Yining Chen, Elan Rosenfeld, Mark Sellke, Tengyu Ma, and Andrej Risteski. Iterative feature matching: Toward provable domain generalization with logarithmic environments, 2021.
- [19] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation, 2015.
- [20] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex Kot. Domain generalization with adversarial feature learning, 2018.
- [21] Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, and Qi Lei. Few-shot learning via learning the representation, provably, 2021.
- [22] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning, 2016.
- [23] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. ICML’11, 2011.
- [24] Xiaofeng Liu, Chaehwa Yoo, Fangxu Xing, Hye** Oh, Georges El Fakhri, Je-Won Kang, and Jonghye Woo. Deep unsupervised domain adaptation: A review of recent advances and perspectives, 2022.
- [25] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. ALT’12, 2012.
- [26] Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Scholkopf. Correcting sample selection bias by unlabeled data. NIPS’06, 2006.
- [27] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classification. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3437–3445, 2014.
- [28] Jean-Yves Audibert and Alexandre B. Tsybakov. Fast learning rates for plug-in classifiers under the margin condition, 2011.
- [29] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. 2014.
Appendix A Further Notation
Given , we let denote the closed ball centered at of radius . For a feature map, , we also let denote the set of all points with distance (under ) at most from .
Recall that for , we let denote the metric over induced by , i.e. . We extend this in the natural way to sets, letting
Finally, for a pair of feature maps , we let denote the supremum metric between and . That is,
Appendix B Further Technical Assumptions
B.1 Lebesgue Differentiation Theorem
We assume that for any finite Borel measure over , the Lebesgue differentiation theorem holds. That is, for all measurable functions , up to a null set under ,
B.2 Open Measures
As referenced in Section 3 above, we assume that our Borel measures satisfy a further regularity condition – namely, that they are open measures.
Definition 7.
A Borel measure, , over metric space is open, if for all measurable sets , if only if there exists and such that and .
A very typical example of such a measure is any distribution that has a finite density function. In this work, we will restrict ourselves to considering open measures with the following assumption: and are open, and for all , the induced source and target measures, and are open over the metric space . Here we are noting that and are only non-zero over subsets of the image of , , and thus we restrict our attention to when considering openness.
This technical assumption allows us to simplify our results as it prohibits cases in which can be a pathological distribution that concentrates in an area of that leads to bad generalization. We also believe that such an assumption is relatively mild – all distributions over are arbitrarily close to open Borel measures – we can simply add spherical noise to each sampled point.
B.3 Assumptions on
We include two further technical assumptions about . We begin by assuming that all feature maps send an infinite number of points to a given .
Assumption 2.
For all and all , the set of points that have the same image as in under is infinite. That is,
Observe that this assumption is clearly met by the examples given in Section 3.2. Furthermore, it is likely to be met by any reasonable family of continuous maps that perform any kind of dimension reduction.
Next, we define dominance, which will be useful for formulating our other assumption.
Definition 8.
We say that feature map dominates feature map at point if
We now define an embedding class to be indomitable when it avoids instances of one feature map dominating another.
Definition 9.
is indomitable if for all distinct and for all , the following holds. For all , there exists maps such that:
-
1.
.
-
2.
does not dominate at .
-
3.
does not dominate at .
We will now assume that is indeed indomitable.
Assumption 3.
is indomitable.
Observe that this assumption is satisfied by both examples of feature maps given in Section 3.2. More generally, the fact that our definition permits a lack of dominance to hold for some two maps that are close to and makes our definition mild enough to hold for most continuous classes of feature maps.
Appendix C -nearest neighbors
First, we fix as a sequence of integers with the following properties.
Definition 10.
Let be a sequence of integers so that , and .
Observe that would suffice as an example of such a series.
Next, our goal is to define the -nearest neighbors classifier over a labeled data set of of points, . To do so, we begin by describing a tie-breaking procedure used in cases where training points are equidistant from a given test point.
Definition 11.
An ordering , over a dataset is any ordered permutation of . We say that if occurs before in the permutation.
We now show how to use to break ties when computing nearest neighbors.
Definition 12.
Let . Let be an ordering over dataset . For , we say that if either of the two conditions hold:
-
1.
.
-
2.
and .
In essence, ties are broken by choosing the datapoint that appears earlier in the ordering. We now define a nearest neighbor as follows.
Definition 13.
Let be a dataset, and let be an ordering of . For , we say that is a -nearest neighbor of if
We also let denote the set of all -nearest neighbors of when using the ordering .
Observe that by construction, . This is because the ordering allows us to strictly order points based on their distances from with ties broken by .
We are now ready to define the nearest neighbors classifier.
Definition 14.
Let be a dataset, and an ordering over . Then for , we define
Here, we break ties in arbitrarily (which could be done with an ordering of ).
Throughout the paper, we typically ommit from our notation for . This is because in all cases, we assume that some ordering is implicitly chosen (independent of the data points) ahead of time.
C.1 Composing with feature maps.
We now define the classifier , where is a feature map. One important detail for doing so, is that we will continue to use an ordering over , rather than an ordering over . This will allow us to use a single ordering throughout all of our learning algorithms that deal with learning a feature map.
Recall that for any feature map, , denotes the distance metric
Using this, we give analogs to Definitions 15 and 16 by essentially replacing with .
Definition 15.
Let and . Let be an ordering over dataset . For , we say that if either of the two conditions hold:
-
1.
.
-
2.
and .
Definition 16.
Let . Let be a dataset, and let be an ordering of . For , we say that is a -nearest neighbor of under if
We also let denote the set of all -nearest neighbors of when using the ordering .
Finally, we define as follows.
Definition 17.
Let , let be a dataset, and an ordering over . Then for , we define
Here, we break ties in arbitrarily (which could be done with an ordering of ).
The key point of this definition is that all tie-breaking mechanisms are done independently of . In particular, we have the following.
Lemma 1.
Let be a dataset of points, and an ordering over . Let be two features maps in . Suppose for that for all , if and only if . Then .
Proof.
This is immediate from the previous definitions as all ties are broken in an identical manner for both and . ∎
As before, to avoid cumbersome notation, we will assume that an ordering, , of is fixed.
Appendix D Induced conditional distributions
In this section, we rigorously define the conditional data distribution of . Recall that if denote the random variables corresponding to , then is defined as the data distribution , where is a feature map. We write , where denotes the measure corresponding to over , and is the conditional data distribution, . Our goal in this section is to similarly write .
First, observe that for any measurable subset , . This directly follows from the definition of the random variable .
Next, to define , first recall that denotes the probability that given that . By assumption this is well defined for all and , and moreover for any the function defined by is measurable. To define , we first define for all as follows.
Definition 18.
is a measure over so that for all measurable sets ,
The fact that is a well-defined measure follows directly from the rules of integration. In essence, is the probability. of observing with and . We now show the following:
Lemma 2.
is absolutely continuous with respect to for all
Proof.
This immediately follows from the fact that for all . Thus for any measurable set ,
Thus for any , we can simply choose so that . ∎
We now use the Radon-Nikoym theorem on to define .
Lemma 3.
For all , there exists a measurable function such that
for all measurable sets .
Proof.
This directly follows from the Radon-Nikoym theorem. ∎
We then define using these functions, .
Definition 19.
For all and , we define .
Appendix E Technical Lemmas
E.1 Useful bounds related to
We now prove several results regarding the distance dimension of , . These will be useful for proving all of our subsequent results.
We begin by defining a useful hypothesis class for analyzing nearest neighbors.
Definition 20.
Let be a set of labeled points in . For , define as
Finally, we let
Observe that is a set of binary classifiers. We now show that it has bounded VC-dimension.
Lemma 4.
There exists an absolute constant, such that has VC-dimension bounded as
Proof.
Suppose, shatters a set of points in , The key observation is that for any , the way labels a given point is determined by the -nearest neighbors of in . Furthermore, by Lemma 1, these labels are full determined by the set of all comparisons,
These indicator variables precisely correspond to the definition of a distance comparer (Definition 5). It follows that the number of distinct ways that can label is at most the number of ways can label all possible comparisons, Since by definition, , By Sauer’s Lemma, the number of ways can label is at most . However, since shatters , there exist precisely such labelings. It follows that . From here, straightforward algebra implies that , as desired. ∎
Next, we define a hypothesis class that will be useful for bounding the margin of a data distribution.
Definition 21.
For and , define as the map
Let .
Roughly speaking, the class will prove useful in allowing us to uniformly bound measured distances over a data distribution. We now bound its VC-dimension as follows.
Lemma 5.
There exists an absolute constant, such that has VC-dimension bounded as
Proof.
Suppose shatters the set . We say that induces ordering, over by ranking the pairs in increasing distance. That is,
Our strategy is to double count the number of distinct orderings, , over that can be constructed using . Here, two orderings are distinct if they ever differ for some pair of entries from .
First, suppose that is even (which we can assume by deleting a pair from if needed). Since shatters , for all with , there exists such that
Observe that for , and must induce distinct orderings over as the bottom -elements of their orderings are distinct. Since there are choices for , this shows that there are at least orderings.
Second, there are possible quadruples, . Suppose that and satisfy that
for all . By the definition of a distance comparer (Definition 5), this implies that and induces the same ordering over . Thus it suffices to count the number of ways can label the set of possible quadruples. By Sauer’s Lemma, this is at most .
Combining our two observations, it follows that . Standard algebra yields that for some absolute constant . ∎
Finally, for dealing with margins and labels simultaneously, we introduce the following hypothesis class.
Definition 22.
Let be a set of labeled points in . For , define as follows:
We let
We now bound its VC-dimension.
Lemma 6.
There exists an absolute constant, such that has VC-dimension bounded as
Proof.
Suppose shatters . We will double count the number of subsets of that can be obtained as the pre-image of under some . For any , define as
Then the key observation is that
Thus, a subset of is the pre-image of under if it is precisely the intersection of the pre-images of under and .
By Sauer’s Lemma and Lemma 5, there are at most subsets that are the pre-image of under some .
We now similarly bound the pre-images under . To this end, observe that the value of over all is completely determined by the way in which classifies . This quantity, in turn, is fully determined by the way in which (Definition 20) labels the set . Thus, applying Sauer’s Lemma along with Lemma 4, we see that at most possible subsets.
However, since shatters , we know that subsets can be formed in this manner. Thus, it follows that for some constant ,
Taking logs and applying standard algebraic manipulations yields the desired result. ∎
We end with one final useful hypothesis class that is a generalization of Definition 21.
Definition 23.
Let be an integer. For and , define as the map
Let .
This class will assist us with computing the distance between the source and target distributions simultaneously over all embeddings.
Lemma 7.
There exists an absolute constant, such that has VC-dimension bounded as
Proof.
Suppose shatters where . The key observation is that the subset shattered by is precisely determined by the behavior of over the set of pairs, for and . By Sauer’s Lemma along with Lemma 5, there are at most such subsets possible.
It follows that . Applying standard manipulations again yields that for some constant . ∎
E.2 Useful properties of data distributions
Lemma 8.
Let be a well-separated data distribution with label margin, . Let be a classifier such that , where denotes the Bayes-optimal classifier. Then
Proof.
Let . Let denote the set of all points for which and disagree. Then we have,
Here, we are using the fact that for (as is well-separated). Finally, using the fact that has excess risk at most , we find that which implies that , as desired. ∎
We now use the fact that is compact to prove a useful Lemma.
Lemma 9.
For all , there exists such that for all and all ,
Proof.
Assume towards a contradiction, that for some , for all , there exists such that . Let be a sequence and let be corresponding feature maps and points for this sequence.
Since is compact, we can take an infinite subsequence of so that for some . Similarly, since is compact, we can take an infinite subsequence so that for some . Because is continuous, there exists such that
Select such that , , and . Then, applying the triangle inequality, we have
Furthermore, since ,
However, this is a contradiction to the definition of . ∎
E.3 Useful Definitions for Analyzing Margins
We begin by precisely characterizing feature maps that preserve a data distribution .
Lemma 10.
Let be a well-separted distribution, and let be the sets as defined in Definition 1. Then preserves if and only if there exists such that
Proof.
The first direction is immediate. If exists, then it is clear that is well-separated with corresponding sets .
In the second direction, assume towards a contradiction that no such exists. Because preserves , there exist corresponding sets that partition the support of . Because no such exists, we must have some such that for – otherwise we could have used the margin of as a valid choice for .
However, it then becomes clear that there exists a ball of non-zero radius centered at that is mapped into . This means it is classified as by while it is classified as by . Since is well-separated, there is a unique Bayes-optimal classifier over the support of , and this shows that does not incur Bayes-optimal risk over . Thus does not preserve , which is a contradiction.
∎
We now generalize the idea of the margin of a data distribution as follows.
Definition 24.
Let be a well-separated data distribution, and let be a feature map. Then the margin variable of , is random variable, defined as follows. Let . Then
can be thought of as a randomly observed margin. We will be particularly interested in observing small values of , as this will be reflective of hte margin of . In particular, we have the following.
Lemma 11.
Let be a well-separated data distribution and let be a feature map. If preserves , let denote the margin of . Otherwise, let . Then for every there exists such that
Proof.
Let , and let be the sets corresponding to Definition 1. Suppose preserves . Then the sets must be the corresponding sets for , and it follows that
On the other hand, if does not preserve , then we must have
as if this distance were positive, then would clearly preserve . Thus in either case, there exists so that .
It follows that there exists and such that . Let
Because is continuous, and lie within the support of . It follows that . is also a lower bound on the probability that we observe , which means it is a lower bound on the probability that This gives the desired result. ∎
We now use a similar idea to describe distances between the supports of two measures.
Definition 25.
Let be measures over . Then is defined as
where is a random variable following distribution .
can be thought of as representing the distance that a point drawn from has from when using distance metric determined by .
It will also be useful to define finite sample version of , that don’t rely on the sets, .
Definition 26.
Let be measures over . Let . Then defined as
where is a random variable following distribution , and are drawn i.i.d from .
We now show that converges to .
Lemma 12.
Let be any feature map. Then converges in distribution to .
Proof.
For any , the probability that is precisely the probability that some is chosen so that has distance less than from . For all such , and let satisfy . Furthermore pick so that . It follows that will hold if one of the points selected from will be within distance from . However, this event occurs with high probability for being sufficiently large. ∎
Appendix F Proof of Theorem 1
First, we characterize areas of that are likely to be correctly classified by composing nearest neighbors with .
Definition 27.
Let be a feature map that preserves . Let , and let be a distance. We let denote the set of all points such that there exists for which the following hold.
-
1.
.
-
2.
.
Here represents a small amount of mass that must be close to , and and determine a region in which that mass is concentrated. The idea will be that can be accurately classified using points sampled from . We now formalize this with the following lemma.
Lemma 13.
Fix . Then there exists such that for all , for all and with probability at least over ,
Proof.
Because is well-separated, let denote the regions that correspond to Definition 1. Let be the point as defined in the definition of . By applying Lemma 10, observe that there exists such that , and . This holds because if it didn’t, then the triangle inequality would show that and have distance less than .
It now suffices to show that with probability at least over , .
To do this, let , and let . We can view as being constructed by first drawing , and then drawing the labels of each of its points.
Observe that if contains at least points, then the nearest neighbors (according to ) of will all be drawn from . To this end, by Hoeffding’s inequality, we see that
Thus, for sufficiently large (depending only on ), this quantity is at least , and . This means that with probability at least , the contains at least points.
Next, suppose that this even occurs. We now select the labels for our points. Because of our method of generating , we can assume that these labels are i.i.d and drawn for points in . Let the label of the th nearest neighbor of be denoted as . For all , define as the random variable that is if , if , and otherwise. The key observation is that if and only if for all as this will imply that is the pluarlity choice.
Because is well separated, it has label margin . Therefore, is a random variable bounded in with expected value at least . It follows by Hoeffding’s inequality, that
Because , it follows that for a sufficiently large value of , this quantity is at least . Thus taking a union bound over all gives the desired result. ∎
Here observe that we are comparing the nearest neighbors classifier using to bayes-optimal over , where is the original feature map we are considering. In other words, this lemma implies that small perturbations to the feature map do not affect classification.
Next, we show that the entire support of the target distribution, , can be covered using the regions .
Lemma 14.
Let . Then there exists such that the following holds. For all that realize SIRM () and for which has margin at least ,
Proof.
Let . By the Definition of , . Now let be arbitrary.
Because contracts , there exists such that , where is the margin of . It follows that
Finally, by Lemma 9, there exists such that for all , . This implies that for all . Finally, we take
It suffices to show that . To do so, observe that is closed and therefore compact (as is compact by assumption). Take an open cover of by balls of radius . Then it has a finite sub-cover. Each of htese balls have positive mass under , and furthermore every ball where must fully contain at least one of these balls. It follows that , where is the minimum mass of one of these balls. Since , it follows that , as desired. ∎
Lemma 15.
Let . Then there exists such that for all that relate such that has margin at least , if , then with probability at least over ,
Proof.
Let relate , and suppose has margin . let be as in Lemma 14. Because relates , observe that for all ,
To see this, observe that Definition 3 implies that , and Definition 4 implies that must be labeled by the same as is by .
Next, select from Lemma 13. It follows that since all , for , we have that with probability at least over ,
A standard of markov’s inequality that converts the expected loss into a loss bound with high probability completes the proof. ∎
We are now prepared to prove Theorem 1.
Proof.
Any that relates has positive margin, and so the previous lemma applies for sufficiently large . Since , it immediately follows that converges in risk to the bayes optimal of , as desired. ∎
Appendix G Proof of Theorem 3
G.1 Description of our learning rule
We begin with our learning rule, , that achieves the bound given in Theorem 3.
G.2 Bounding the error in estimating the loss
Our method for estimating the loss over the source distribution that a nearest neighbors classifier is given in Algorithm 2. We simply evaluate the empirical risk using nearest neighbors over the designated loss set, .
We now bound the accuracy of this method using the following Lemma.
Lemma 16.
Let be an arbitrary data distribution. Let be a set of labeled points, and let be an i.i.d sample that is independent of . Then there exists such that for all , with probability at least over , for all ,
Proof.
Fix , and define as the event that the empirical risk induced by each is representative of the true risk. That is,
Our goal is to show that holds with probability at least , for sufficiently large . The key observation is that for all ,
where is as defined in Definition 20. Thus, it follows that
To analyze the latter quantity, a standard application of the fundamental theorem of statistical learning (see Shavel-Schwartz and Ben-David) implies that holds with probability provided that .
Fix . By Lemma 4, . Substituting this, along with , we see that
where is some constant that depends on . Since assymptotically dominates this quantity, it follows that for sufficiently large , we indeed have , which proves the desired result. ∎
G.3 Bounding the error in estimating the margin
Our method for estimating the margin of a distributions, , is given in Algorithm 4. The main idea is to split the set, , into two equal parts, , and . We then use a nearest neighbors classifier over to label the points in both and . Finally, we measure the distance between differently labeled points from and respectively. For technical reasons, when comparing distances between and , we only compare points that have the same index. This allows us to exploit independence between each comparison we make.
We now show that this method is likely to accurately estimate margins by showing that it gives good estimates for , which is described in Definition 24.
Lemma 17.
There exists , such that for all , if is a set of labeled points, with probability at least over and , at least one of the two conditions will hold:
-
1.
.
-
2.
.
Proof.
For and , let be as defined in Definition 22, so that
Observe that
This is because if and only if either and are given the same labels, or if they have distance (under ) of at least .
Define as the random variable where for ,
The variable is closely related to , the only difference is that we replace the bayes optimal classifier, with .
To relate to our previous quantities, observe that
Because the set of classifiers, has bounded VC-dimension, , we can apply uniform convergence to see that must be close to its expectation with high probability over all . More precisely, by applying the same argument as in the proof of Lemma 16, a we have the for sufficiently large, with probability at least over , for all ,
By substituting the definition of along with our observation about , it follows that
We now turn our attention to showing that must indeed serve as a reasonable approximation for . To do so, observe that if and are constructed from the same random variables, , then they only differ if and differ over either or .
Suppose that Then it follows by Lemma 8 that , where is the label margin of . It follows by the rules of probability that the probability that is at most summed with the probability that . That is,
(1) |
However, if is sufficiently large, then we have that with probability at least over , for all ,
This in turn implies that
(2) |
By taking a union bound, it follows with probability at least , that the Equations 2 and 1 simulatenously hold over all .
Finally, if , then for sufficiently large, condition number 2. from the statement of the Lemma must hold. Otherwise, if , condition 1. holds. Thus in either case, one of the two conditions hold which completes the proof. ∎
G.4 Proving the theorem
We first show a Lemma that implies that the feature map selected by our algorithm, , is likely to realize the SIRM assumption on .
Lemma 18.
Let be any SIRM realizing feature map, and suppose that has margin . Then for all , there exists such that if , with probability at least over , (defined Line 6 of direct_generalize_nn) is a SIRM realizing feature map for , and has margin at least .
Proof.
Assume towards a contradiction, that for , there exist arbitrarily large values of for which with probability at least , has margin less than .
For sufficiently large, with probability at least , applying Lemmas 16 and 17 we have that, for all ,
and that one of the two conditions hold as well:
-
1.
.
-
2.
.
Because these equations hold for , the smallest observed empirical loss must be at most . It follows that must incur empirical loss at most which implies that condition 2. must apply to .
Furthermore, because , we have that with high probability, will match the Bayes-optimal classifier, for all points in . It follows that the observed margin, will be at least .
Combining all of this, we see that
Now let be a sequence of integers going to infinity so that for each , with probability at least , has a margin less than . Because is fixed, it follows that for sufficiently large , there exist and such that all of the equations above hold.
Because is compact, there exists an infinite subsequence of the s for which converges (using the distance metric over ) to some . Relabel our sequence so that without loss of generality, .
The key observation is that the variable, is Lipschitz with respect to the distance metric over . In particular, if , then .
Using this, observe that for sufficiently large values of , we have that . Substituing this, it follows that for all sufficiently large ,
Since can be arbitrarily large it follows that which implies must have margin at least .
However, this in term implies that for all sufficiently large , too must have margin at least . Here we are again exploiting the fact that the margin is Lipschitz.
This finally gives us a contradiction, as we previous assumed that all had margin less than .
∎
We are now prepared to prove Theorem 3.
Proof.
Fix . The previous Lemma implies that for sufficiently large values of , with probability we will select some that has margin at least . Lemma 15 implies that for sufficiently large (in a way that only depends on ), with probability at least over ,
Crucially, is completely independent of , which is learned purely using and . Taking a union bound implies the desired result. ∎
Appendix H Proof of Theorem 4
Proof.
Fix . Let relate , and let be a feature map that source-preserves but fails to relate . We will construct and using and .
Because fails to relate , there exists such that . Note that if this doesn’t hold, then we can simply use the construction from the proof of Theorem 6. The point will be central to constructing both and . We begin by constructing .
Constructing :
Let be a small value. Then by Assumptions 2 and 3, there exists and such that the following conditions hold:
-
1.
.
-
2.
.
-
3.
.
Here, and are chosen using Assumption 3, while the existence of and is based on Assumption 2. We also let be two points such that
Next, let be a measure over obtained by the following steps.
-
1.
Begin with , the measure of over .
-
2.
Remove all points in that lie within a distance of from the set .
-
3.
Pick such that any two points in have distance larger than . Insert balls of probability mass centered at each of these points, so that for .
Observe that is constructed from by adding a region of mass (and appropriately down-sizing all other regions). Furthermore, if is appropriately chosen, then the region being removed from can also be forced to have size at most . It follows that . Next, we define the conditional distribution, with the following steps. Let be two labels in .
-
1.
For .
-
2.
For .
-
3.
For .
-
4.
For .
-
5.
For all other , .
Basically, we force the conditional distribution near and to be and respectively. For , this is reversed. This construction only modifies at points where is modified, and it follows that .
Furthermore, observe that and both source-preserve . This occurs because and are chosen to be small enough so that the 4 balls, are all mapped to disjoint areas under both and .
Constructing :
Next, we will construct by giving a choice of two possible target distribution, and . We let be a point mass that is concentrated at . We let and . This gives us and .
Observe that is SIRM related to by , and is SIRM related to by . This is because is mapped to by , and the same holds respectively for and .
Finishing the proof:
We now show that our learner will have a large error over either some choice of . To do so, suppose is randomly chosen from this set. It follows that our learning rules expected loss is:
From here, the desired result follows by a straightforward application of markov’s inequality.
∎
Appendix I Proof of Theorem 5
I.1 Description of the learning rule
We give the learning rule that achieves the bound given in Theorem 5
I.2 Analyzing the procedure,
We begin by describing the process used to estimate how far data from is from data from under a feature map, . The subroutine is given in Algorithm 6, where is a labeled set of points drawn from , and is an unlabeled set of points drawn from , the marginal -distribution of .
The idea is for each , we assign it its own set of points sampled from . Then, we take the distance from to the closest point in its assigned set. Finally, taking the max of all of these gives us an approxmation of the furthest distance any has from when using the distance metric, .
We now show that this procedure approximates the quantity, , which is defined in Definition 26.
Lemma 19.
There exists such that if , and , then with probability at least over , and , for all ,
where is the largest integer at most .
Proof.
For convenience, let denote . For and , let be as defined in Definition 23, so that
Furthermore, let us relabel (defined in line 6 of Algorithm 6) so that . It follows that
This is because if and only if some has distance less than from .
Next, we relate this quantity to as follows. Observe that
Because the set of classifiers, has bounded VC-dimension, , we can apply uniform convergence to see that must be close to its expectation with high probability over all . More precisely, by applying the same argument as in the proof of Lemma 16, a we have the for sufficiently large, with probability at least over , for all ,
By substituting the definition of along with our observation about , it follows that for all ,
By the rules of probability, and the definition of an infimum, it follows that
Substituting the value of gives the desired result. ∎
I.3 Bounding the performance a given feature map,
We now consider a fixed feature map that realizes the SIRM assumption on . The idea behind doing so is that this allows us to give a baseline over how and should be expected to behave.
Lemma 20.
Let realize the SIRM assumption on . Suppose has margin , and that
Finally, let where by the fact that contracts . Then for all , there exists such that if , , with probability at least over , , the following three things hold:
-
1.
.
-
2.
.
-
3.
.
Proof.
We bound the probability of each of these three things occuring separately, and then apply a union bound.
First of all, for sufficiently large, the first claim holds with probability at least . This follows directly from a combination of Lemma 16 along with Lemma 15 being applied to (as technically SIRM realizes on ). In particular, we have that with probability at least that
(3) |
Second, observe that Lemma 8 implies that the probability that differs from the Bayes-optimal is at most , where is the label margin of . It follows that with probability at least , correctly labels all the points in and (see Algorithm 4). Thus, it follows that , as with correct labels it is impossible to observe two differently labeled poitns that are closer than .
Finally, by Lemma 9, there exists such that
Let The argument in the proof of Lemma 9 implies that . Thus, for sufficiently large, it follows that with probability at least over that there will exist some .
Since is compact, it follows that we can take a finite covering of with balls of radius . If there are such balls, with probability at least over that there will exist some for all balls in our covering. Observe that this implies that for all , there will exist some .
We now use this to show that the third condition is likely to hold. Pick sufficiently large so that . It follows that with probability at least over and , for all , for all , there exists some .
Suppose that this holds. For each (Algorithm 6), let be the point for which is minimized. Thus by the definition of . However, by our claim above, we see that some must be in which implies that . Thus .
Since this occurs for all , it follows that the maximum distance we observe is at most , which means , as desired.
Since our three events all occur with probability , it follows that if is sufficiently large, they simultaneously occur with probability at least . This completes the proof.
∎
I.4 Proving the Theorem
We are now prepared to prove Theorem 5. We start with the following Lemma.
Lemma 21.
Let be as defined in Lemma 20. Then for all , there exists such that for all , , with probability at least over , , the outputted feature map satisfies the following: let denote the margin of and let denote
Then
Proof.
Assume towards a contradiction that this fails to occur and fix for that ails. For sufficiently large, with probability at least , the premises of Lemmas 16, 17, 19, and 20 are all simulatneously met. In particular, Lemmas 16, 17 imply that
and that one of the two conditions hold as well:
-
1.
.
-
2.
.
However, Lemma 20 implies that
Since has source loss at most more than the optimal, it follows that condition 2 must hold. Thus, by additionally adding Lemma 19, we see that has the following properties:
-
1.
-
2.
.
However, recall that must maximize the quantity . Since this quantity is at least for (by Lemma 20), it follows that
(4) |
In particular, this equation holds with probability at least . However, by our assumption for arbitrarily large values of , fails to have the desired properties with probability .
Thus with probability at least , for arbitrarily large values of , there exists such that Equation 4 holds but for which the desired property fails.
Let be any subsueqnce of integers so that the corresponding feature maps, converge to some . Note that this exists because is compact.
The key observation is that because and are both Lipschitz with respect to (clearly small changes in the feature map cannot change these variables much), it follows that for all , there exists such that for all ,
Since is arbitrarily large, observe that this implies that
Finally, since gets arbitrarily close to , it follows that
Taking , and noting the definitions of and , it follows that precisely fullfills the conditions given in the statement of the lemma, and this completes the proof.
∎
We now prove Theorem 5.
Proof.
Fix . The previous Lemma implies that for sufficiently large values of , with probability we will select some that has margin at least . We now use an argument identical to the argument given for Theorem 3 and conclude the proof. ∎
Appendix J Proof of Theorem 6
Proof.
For such that source-preserves , define as the classifier over defined by
with ties being broken arbitrarily.
Let and let . Because and both contract , it follows that and are precisely well-defined over . However, because , this implies that there exists such that
This holds from the fact that must match the Bayes-optimal, , whereas must fail to (otherwise would indeed SIRM realize on ).
Define
Essentially this is a noiseless distribution that is purely classified by . Let and . The key observation is that if we randomly select and then apply our learning algorithm to , then our learner must incur expected risk at least . This is because whatever it outputs, it has a 50-50 chance of misclassifyign instances from in which and disagree. Thus our learner has expected risk at least . Since is fixed, this implies the desired result. ∎
Appendix K Bounds on the Distance Dimension and Transfer Learning Guarantees
K.1 Proof of Theorem 2
Proof.
(Theorem 2) We will construct two maps, , and such that for any and ,
This will immediately imply the result as it is well known that linear classifiers over have vc dimension .
Letting be the matrix associated with , we have
Thus, letting (cast as a vector in ) and suffices. ∎
K.2 Proof of Theorem 7
Proof.
(Theorem 7) Suppose realizes the Statistical IRM assumption for . Define
and
is thus the event that the source sample gives rise to a -nearest neighbors classifier which has small excess risk on when composed with the realizing projection . is the event that the empirical risks of -nearest neighbor classifiers composed with feature maps are uniformly representative of their true risks on the target .
Our goal is to show that and jointly hold with probability at least , as this would imply that our learned classifier has risk at most , as desired. By Lemma 15, holds with probability at least , so it suffices to show that holds with probability at least as well.
Fix any set of points, . It suffices to show that , as integrating over all possibilities of would give the desired result. Consider the hypothesis class, where is defined as
Observe that is a binary classifier over its domain. It follows that given ,
To analyze the latter quantity, it suffices to show that , as standard application of the fundamental theorem of statistical learning [29] implies holds with probability provided that .
To this end, suppose shatters a set of points . Let , and let . The key observation is that for any , the way labels a given point is determined by the -nearest neighbors of in . Furthermore, these labels are fully determined by the set of all comparisons,
Note that is a set of induced distance comparers (Definition 5). It follows that the number of distinct ways that can label is at most the number of ways can label all possible comparisons arising from and with . Thus, by the definition of the distance dimension and Sauer’s Lemma, the number of ways can label is at most
At the same time, because shatters , there exist precisely such labelings. Thus, we have
From here, straightforward algebra implies that , as desired.
∎