Beyond Discrepancy: A Closer Look at the Theory of Distribution Shift

Robi Bhattacharjee
University of Tübingen and Tübingen AI Center
[email protected]
&Nick Rittler
University of California- San Diego
[email protected]
&Kamalika Chaudhuri
University of California - San Diego
[email protected]

Abstract

Many machine learning models appear to deploy effortlessly under distribution shift, and perform well on a target distribution that is considerably different from the training distribution. Yet, learning theory of distribution shift bounds performance on the target distribution as a function of the discrepancy between the source and target, rarely guaranteeing high target accuracy. Motivated by this gap, this work takes a closer look at the theory of distribution shift for a classifier from a source to a target distribution. Instead of relying on the discrepancy, we adopt an Invariant-Risk-Minimization (IRM)-like assumption connecting the distributions, and characterize conditions under which data from a source distribution is sufficient for accurate classification of the target. When these conditions are not met, we show when only unlabeled data from the target is sufficient, and when labeled target data is needed. In all cases, we provide rigorous theoretical guarantees in the large sample regime.

1 Introduction

Classical learning theory operates within the statistical learning framework, in which the training and testing datasets are assumed to be drawn from the same distribution [1]. However, this assumption is rarely met in practice, where models often succeed in ever-changing real world environments rarely matching the precise conditions of their training data. This motivates the problem of distribution shift, in which a learner trains on a source distribution, with the goal of generalizing well over a distinct target distribution.

Thus far, the theory of distribution shift has consistently taken a worst-case approach, typically bounding generalization error in terms of some notion of discrepancy between the source and target distributions [2, 3, 4]. In cases where the source and target distributions are completely unrelated, or the source provides little information about the decision boundary of the target, discrepancy-based analyses correctly capture the difficulty of generalization. However, in practice, many large models appear to generalize effortlessly to target distributions with non-zero discrepancy.

Motivated by this gap, we take a closer look at the theory of distribution shift. In our setting, we consider a source distribution $\mathcal{D}_{s}$ and a target distribution $\mathcal{D}_{t}$ , with the goal of building an accurate classifier over $\mathcal{D}_{t}$ , primarily via training samples from $\mathcal{D}_{s}$ . To accomplish this, we first select a feature map, $\hat{\phi}\in\Phi$ , under which the source and target distributions are similar. To make predictions, we then use $k_{n}$ -nearest neighbors ( $k_{n}$ -NN) inside feature space over data sampled from $\mathcal{D}_{s}$ .

Instead of a worst-case, discrepancy-based approach, we study generalization under an Invariant Risk Minimization (IRM)-like assumption which we term the “Statistical IRM Assumption”. IRM assumes on the existence of a feature map $\psi^{*}$ and a classifier (over feature space) $h^{*}$ so that their composition $h^{*}\circ\psi^{*}$ achieves optimal accuracy over both source and target distributions [5]. We adapt this assumption to the nearest neighbors setting, and replace the existence of $h^{*}$ with the assumption that the some feature map $\phi^{*}\in\Phi$ maps points from the target $\mathcal{D}_{t}$ close to those from the source $\mathcal{D}_{s}$ while retaining information sufficient for optimal prediction. This property allows us to leverage the fact that nearest neighbors enjoys strong generalization properties within the support of its training distribution.

One might hope that such a condition is sufficient for generalization from source data alone. Unfortunately, the existence of a suitable feature map does not imply its identifiability - there may be many poor feature maps in $\Phi$ that appear suitable when only source data are available. We show (Theorem 4) that to guarantee generalization to the target using only source data, the source must be rich enough so that this cannot happen, i.e. that all maps that lead to optimal classification over the source distribution appropriately unify the source and target. We further exhibit a learning rule which leads to provable generalization to the target under this additional condition (Theorem 3).

Refer to caption — (a) Pure source data is sufficient, as $\phi_{x}$ significantly reduces accuracy over the source distribution.

We next consider the case where the learner has access to unlabeled target data in addition to labeled source data. Here, the target data provides crucial new information about which feature maps transform target data close to source data in feature space. We find that it is necessary and sufficient (Theorems 6 and 5) that all maps which both lead to optimal classification over the source, and map target data close to source data, further appropriately unify the source and target classification tasks.

When generalization is not possible with the addition of unlabeled target data, some labeled target data is needed. In this setting, the goal is to minimize the amount of labeled target data used – if large amounts of labeled target data are obtainable, we could simply use standard learning algorithms directly on the target data. We introduce a complexity measure on the embedding class, $\Phi$ , which we term the distance dimension, and use it to provide an upper bound on the amount target data needed for generalization. In particular, we show that the natural procedure of minimizing the empirical risk (over $\phi\in\Phi$ ) on the target distribution of the source data-trained nearest neighbor classifier meets this upper bound.

1.1 An Illustrative Example

Figure 1 illustrates three learning problems in which we seek to generalize from the bold source data to the faded target data. In each case, the set of possible features maps is $\Phi=\{\phi_{x},\phi_{y}\}$ , the projections of onto the $x$ and $y$ -axis, respectively. Here, the Statistical IRM Assumption manifests itself in the following way: the learner knows that perfect classification can be performed on the source and the target through the intermediate projection onto either the $x$ or $y$ -axis. If the correct projection can be identified, a classifier generalizing to the target can be built by composing the correct projection with a classifier that accurately classifies source data in feature space.

The possibility of generalizing directly to the target is illustrated by Figure 1(a). In this case, using source data alone, it can be deduced that $\phi_{x}$ is not suitable, given that projection under $\phi_{x}$ significantly reduces accuracy over the source distribution. Thus, a classifier can be constructed through composition with $\phi_{y}$ that allows for generalization to the target.

By contrast, in panel (b), we see that both $\phi_{x}$ and $\phi_{y}$ admit good classification over the source distribution. However, note that only $\phi_{y}$ leads to good generalization over the target distribution, and that there is no way to pin down which embedding should be used with source data alone. That said, given access to unlabeled target data, $\phi_{x}$ can be eliminated from contention – it fails to uniformly map target data close to source data in feature space, another condition for correctly relating the source and target.

In panel (c), we see an instance in which no amount of source data and unlabeled target data will allow the learner to distinguish a winner between the two possible feature maps. In this case, labeled target data is needed. However, note that only a relatively small amount of labeled target data will be needed – all that is required is enough points to validate that a source-trained classifier arising from first projecting onto the $x$ -axis has inferior performance to an analogous classifier where data are projected to the $y$ -axis.

1.2 Guarantees Beyond Discrepancy

The examples of Figure 1 also serve to showcase the potential for generalization guarantees in scenarios where worst-case analyses indicate that generalization to the target should be hard.

There are a few veins of the discrepancy literature [6]. One prominent vein considers bounding generalization error in terms of divergence measures between the source and target [2, 3, 4]. Another considers density ratios between target and source [7, 8]. In each case, the idea is that the degradation of prediction quality on the target will be small when the source and target distributions are not “too far” from each other.

Consider again the examples of Figure 1. A density ratio analysis indicates that generalization to the target in (a) is impossible from pure source data, and expensive in a transfer learning setting (c), as the source has no mass in large chunks of the support of the target. Divergence measures paint a similar picture. Thus, our assumption allows us to consider the possibility of cheap generalization to targets which may have a completely different support from the source in the original data space, but are related in some deeper manner. In such scenarios, discrepancy-based analyses may often be overly pessimistic.

2 Related Work

As alluded to above, the theory literature has primarily studied distribution shift through the lens of discrepancy [2, 3, 7, 4, 8, 9, 10]. In the transfer learning literature, in which one considers the possibility of updating a model trained on the source distribution with a relatively small amount of target data, divergence-based analyses have also been prevalent [11, 12, 13, 14]. A notable line of work attains strong guarantees in certain cases where the divergence between source and target distributions is large, but the decision boundary on the source and target are similar by honing in on the information about the decision boundary contained in the source distribution [6, 15].

Much of the attention towards the selection of feature representations has been devoted to problem of “domain generalization”, wherein the learner tries to generalize to a large of set testing environments using samples from a smaller set of source environments, which provide training data [16, 17]. As mentioned above, the IRM literature hinges on the existence of a feature map, $\psi^{*}$ and a classifier $h^{*}$ whose composition achieves optimal accuracy over both source and target distributions [5]. Another line of work considers a different assumption, namely the existence of some suitable feature map $\psi^{*}$ for which the conditional distributions of transformed features $\psi^{*}(x)$ given a label are shared across all environments [18, 19, 20].

An important part of this work considers the case where some relatively small amount of labeled target data is available to the learner, and can be exploited in the determination of a suitable feature map. In the theory literature, this setting is most closely explored by the work on ‘’few-shot representation learning” [21, 22], where the goal is to use data on a set of source tasks to learn a low dimensional representation that connects tasks together, allowing for generalization to a related target task without too many extra samples.

In considering the case where the learner has access to unlabeled target samples, we enter the ‘’unsupervised domain adaptation” setting. Here, one often uses unlabeled target data to find some feature space under which source and target supports align [23, 24]. The literature has shown that unlabeled data has provable utility in certain common settings, e.g. under covariate shift [25, 26]. Unsupervised domain adaptation has also been studied through the lens of discrepancy [9, 10].

3 Preliminaries

Let the instance space $(\mathcal{X},d_{\mathcal{X}})$ be a compact metric space, and $\mathcal{Y}$ be a finite label set. A data distribution $\mathcal{D}=(\mu,\eta)$ over $\mathcal{X}\times\mathcal{Y}$ is defined by a Borel measure $\mu$ over $\mathcal{X}$ , and a conditional probability function $\eta(y|x):=\Pr_{(X,Y)\sim\mathcal{D}}[Y=y\mid X=x]$ .

We assume our distributions satisfy some measure-theoretic regularity conditions. In particular, we assume our Borel measures are open measures, and that the Lebesgue Differentiation Theorem always holds. See Appendix B for details to this end.

For a classifier $h:\mathcal{X}\to\mathcal{Y}$ , we define its risk $R(h,\mathcal{D})$ over $\mathcal{D}$ as the probability it misclassifies, i.e. we define $R(h,\mathcal{D}):=\Pr_{(X,Y)\sim\mathcal{D}}[h(X)\neq Y]$ . The classifier with the lowest possible risk is called the Bayes optimal classifier, defined as $g_{\mathcal{D}}(x)=\operatorname*{arg\,max}_{y\in\mathcal{Y}}\eta(y|x)$ .

3.1 Problem Statement and Goal

In this work, we are interested in the problem of distribution shift, in which the goal is to build a classifier with low risk over a target distribution $\mathcal{D}_{t}=(\mu_{t},\eta_{t})$ , primarily using data from a source distribution, $\mathcal{D}_{s}=(\mu_{s},\eta_{s})$ . We denote the Bayes risk on source and target via $R_{s}^{*}$ and $R_{t}^{*}$ .

The challenge in this setting is that $\mu_{s}$ and $\mu_{t}$ can put mass in drastically different regions in $\mathcal{X}$ making direct generalization from the source distribution to the target distribution difficult or impossible in the worst case.

3.2 Feature Maps

We consider classification after first applying a transformation into a feature space $(\mathcal{Z},d_{\mathcal{Z}})$ , also a compact metric space.

We assume we are given $\Phi$ , a class of feature maps $\phi:\mathcal{X}\to\mathcal{Z}$ . Here, each $\phi\in\Phi$ represents a potential feature map under which the source and target distributions could plausibly be connected. Let $d_{\phi}$ denote the distance metric induced on $\mathcal{X}$ by $\phi$ , i.e. $d_{\phi}(x,x^{\prime})=d_{\mathcal{Z}}\left(\phi(x),\phi(x^{\prime})\right)$ . We assume all $\phi\in\Phi$ are continuous, and $\Phi$ is compact with respect to the supremum distance metric. We also include further technical assumptions on $\Phi$ in Appendix B.3.

Note that the following important examples of feature map collections which meet these regularity assumptions when the domain is a compact subset of $\mathbb{R}^{D}$ .

Example 1.

Let $\textnormal{Cor}_{D,K}$ denote the set of all projections from $\mathbb{R}^{D}\to\mathbb{R}^{K}$ onto a set of $K$ coordinates. Formally, we may write $\textnormal{Cor}_{D,K}=\{\phi_{J}:J\subset[D],\ |J|=K\}$ , where for each $J\subseteq[D]$ with $J=\{j_{1},\dots,j_{k}\}$ , we let $\phi_{J}(x)=(x_{j_{1}},x_{j_{2}},\dots,x_{j_{k}})$ .

Example 2.

Let $\textnormal{Proj}_{D,K}$ denote the set of all linear maps corresponding to matrices in $\mathbb{R}^{D\times K}$ with each entry contained in $[-1,1]$ .

For any data distribution $\mathcal{D}=(\mu,\eta)$ over $\mathcal{X}\times\mathcal{Y}$ , we denote via $\mathcal{D}^{\phi}$ the distribution defined via $(\phi(X),Y)$ where $(X,Y)\sim\mathcal{D}$ , often writing $\mathcal{D}^{\phi}=(\mu^{\phi},\eta^{\phi})$ , where $\mu^{\phi}$ and $\eta^{\phi}$ are the induced marginal and conditional distributions of $\mathcal{D}^{\phi}$ . We assume that the induced marginals are also open measures. Measure-theoretic details of induced distributions are discussed in Appendix D.

3.3 Nearest Neighbors

We let $\mathcal{N}_{S}:\mathcal{X}\to\mathcal{Y}$ denote the $k_{n}$ -nearest neighbor classifier arising from an i.i.d. sample $S\sim\mathcal{D}^{n}$ and a metric over the instances, where ties are broken arbitrarily. It is well known that under mild regularity conditions, $k_{n}/n\to 0$ and $k_{n}\to\infty$ imply that $k_{n}$ -nearest neighbors will converge to the Bayes optimal classifier [27]. Motivated by technical concerns, we will make the slightly stronger assumption that $k_{n}/n\to 0$ and $k_{n}/\log(n)\to\infty$ .

Because we consider classification in feature space, we will often consider the composition of $k_{n}$ -NN with maps $\phi\in\Phi$ . To this end, we let $\mathcal{N}_{S}^{\phi}:\mathcal{X}\to\mathcal{Y}$ denote the map defined by

\mathcal{N}_{S}^{\phi}(x)=\mathcal{N}_{\{(\phi(x),y):(x,y)\in S\}}\big{(}\phi(% x)\big{)}.

3.4 Margin Conditions

Finally, we restrict our attantions data distributions in which Bayes-optimal classification is clearly non-ambiguous, and regions in which Bayes-optimal predictions differ are separated by a margin. We formalize this as follows.

Definition 1.

A data distribution $\mathcal{D}$ over $\mathcal{X}\times\mathcal{Y}$ is $(\rho,\Delta)$ -separated if there exist $\rho,\Delta>0$ , and disjoint sets $\{\mu^{y}:y\in\mathcal{Y}\}$ , so that the following hold:

1.

The sets cover the support: $\textnormal{supp}(\mu)=\cup_{y\in\mathcal{Y}}\mu^{y}$ .
2.

On the set where $y$ is the Bayes-optimal decision, no other label has similar conditional probability: If $y\neq y^{\prime}$ , then $\forall x\in\mu^{y}$ , $\eta(y|x)>\eta(y^{\prime}|x)+\Delta$ .
3.

These sets themselves are separated by a margin: $\min_{y\neq y^{\prime}}d(\mu^{y},\mu^{y^{\prime}})=\rho$ .

When $\mathcal{D}$ is $(\rho,\Delta)$ -separated, we say that $\mathcal{D}$ has margin $\rho$ , and label margin $\Delta$ . The conditions of well-separated distributions are met in most practical cases, where classification is rarely ambiguous, and arbitrarily close examples are usually classified identically.

4 The Statistical IRM Assumption

Generalizing from source data in a feature space induced by some $\phi\in\Phi$ is only possible if $\Phi$ contains a map that appropriately unifies the classification tasks on $\mathcal{D}_{s}$ and $\mathcal{D}_{t}$ . We use this section to motivate and define some desirable properties of feature maps vis a vis this goal, and to introduce the Statistical IRM Assumption, formalizing our requirement for the existence of quality maps in $\Phi$ .

4.1 Desirable Properties of Feature Maps

In Invariant Risk Minimization, the fundamental assumption is the existence of a feature map $\psi^{*}:\mathcal{X}\to\mathcal{Z}$ and an “invariant predictor” $h^{*}:\mathcal{Z}\to\mathcal{Y}$ for which $h^{*}\circ\psi^{*}$ is Bayes-optimal on all training and testing environments. This allows a learner to assume that selecting a feature space through which good performance on training environments is attainable is not a completely futile approach to constructing a generalizing classifier.

In this spirit, we first interest ourselves in feature maps which preserve the possibility of optimal classification on our single source distribution. We consider a slightly stronger but natural notion that encodes the idea that no information relevant to the classification task on the source should be lost under the map**.

Definition 2.

We say a feature map $\phi$ source-preserves if the induced source distribution $\mathcal{D}_{s}^{\phi}$ is separated, and the Bayes risk on $\mathcal{D}_{s}^{\phi}$ equal to that of $\mathcal{D}_{s}$ , i.e.

R(g_{\mathcal{D}_{s}^{\phi}},\mathcal{D}_{s}^{\phi})=R_{s}^{*}.

Let $\mathcal{S}(\Phi)$ denote the set of all source preserving feature maps in $\Phi$ .

Thus, source-preserving feature maps retain all information needed for optimal classification in the sense that the risk of the Bayes optimal in original space $\mathcal{X}$ and feature space $\mathcal{Z}$ should be the same under the correct embedding. We also require that some margin is preserved in the arising feature space.

While not strictly necessary under the IRM assumption, it also desirable that an embedding maps examples that are similar with respect to the classification task to similar parts of the feature space, regardless of which distribution they come from. We formalize a condition capturing this idea via the following.

Definition 3.

We say a feature map $\phi$ contracts $\mathcal{D}_{s}$ and $\mathcal{D}_{t}$ if the induced source $\mathcal{D}_{s}^{\phi}$ is separated with margin $\rho^{\phi}$ , and for each $z_{t}\in\textnormal{supp}(\mu_{t}^{\phi})$ , there is some $z_{s}\in\textnormal{supp}(\mu_{s}^{\phi})$ such that

d_{\mathcal{Z}}(z_{t},z_{s})<\frac{\rho^{\phi}}{\Lambda},

where $\Lambda>2$ is a fixed constant. Let $\mathcal{C}(\Phi)$ denote the set of all contracting feature maps in $\Phi$ .

Ultimately, we are interested in feature spaces in which we can generalize to the target by classifying target data as we would source data. This possibility is captured by the notion of the invariant predictor $h^{*}$ in the IRM assumption. We interest ourselves in feature maps with a similar property - ones for which the optimal classification decision is locally the same across $\mathcal{D}_{s}$ and $\mathcal{D}_{t}$ .

Definition 4.

We say a feature map $\phi$ Bayes-unifies $\mathcal{D}_{s}$ and $\mathcal{D}_{t}$ if for all $x_{s}\in\textnormal{supp}(\mu_{s})$ and $x_{t}\in\textnormal{supp}(\mu_{t})$ ,

d_{\phi}(x_{s},x_{t})<\frac{\rho^{\phi}}{2}\implies g_{\mathcal{D}_{s}}(x_{s})% =g_{\mathcal{D}_{t}}(x_{t}).

Let $\mathcal{U}(\Phi)$ denote the set of all Bayes-unifying feature maps in $\Phi$ .

Under a feature map which Bayes-unifies, any points which are mapped closer together than half the induced margin are classified the same under the source and target distributions.

4.2 Stating the Statistical IRM Assumption

It’s intuitive that if a feature map both preserves the Bayes risk on the source, and unifies the classification tasks of source and target, then converging to the Bayes risk on the target is possible when source data populate the support of the induced target.

Thus, we would like a feature map which possess all of these properties. Our fundamental assumption is that there exists at least one such feature map in $\Phi$ – we term this the Statistical IRM Assumption.

Assumption 1 (Statistical IRM Assumption).

We assume there is some $\phi^{*}\in\Phi$ such that

1.

$\phi^{*}$ source-preserves $\mathcal{D}_{s}$
2.

$\phi^{*}$ contracts the source $\mathcal{D}_{s}$ and target $\mathcal{D}_{t}$
3.

$\phi^{*}$ Bayes-unifies source $\mathcal{D}_{s}$ and target $\mathcal{D}_{t}$

We say that $\phi^{*}$ with all of these properties realizes the Statistical IRM Assumption, and let $\Phi^{*}$ denote the set of all maps in $\Phi$ which realize the Statistical IRM Assumption.

This assumption is an analogue of the IRM assumption, adapted to our single-source, single-target setting. Like IRM, it allows for the possibility of optimal classification on both source and target via the selection of an appropriate feature space. Contraction, which is not an assumption in IRM, allows for that optimal classification to be realized via a local classification scheme such as $k_{n}$ -NN.

4.3 The Statistical IRM Theorem

One would expect that if a learner were handed $\phi^{*}\in\Phi^{*}$ , generalization to the target should be possible with source data alone – because the classification task on $\mathcal{D}_{s}$ and $\mathcal{D}_{t}$ is unified in the feature space arising from $\phi^{*}$ , and every example in the target support is mapped close to the training support, the learner able to construct a good classifier for the target by simply constructing a constructing a good classifier on the induced source.

We formalize this intuition via the following theorem, which states that given knowledge of a realizing feature map $\phi^{*}$ , generalization to the target can be accomplished with source data only via the construction of a $k_{n}$ -NN classifier in feature space.

Theorem 1 (Statistical IRM Theorem).

Suppose $\phi^{*}$ realizes the Statistical IRM assumption. Then for all $\epsilon,\delta>0$ , there exists $N$ such that for all $n\geq N$ , with probability $\geq 1-\delta$ over $S\sim\mathcal{D}_{s}^{n}$ ,

R(\mathcal{N}_{S}^{\phi^{*}},\mathcal{D}_{t})\leq R_{t}^{*}+\epsilon.

Thus, from the perspective of target generalization from source data, it suffices to determine a feature map $\phi\in\Phi$ which realizes the Statistical IRM Assumption. In what follows, we characterize the statistical identifiability of $\phi^{*}$ (and thus the learnability of $\mathcal{D}_{t}$ ) under this assumption in each of our data availability settings.

5 The Distance Dimension of $\Phi$

Our investigation of the identifiability of realizing feature maps relies on one further notion – one of embedding classes with bounded complexity. To this end, we introduce a complexity measure on $\Phi$ which will play a key role in each of the settings we consider. We begin with an intermediate definition.

Definition 5.

For a given $\phi:\mathcal{X}\to\mathcal{Z}$ , we define its induced distance comparer $\Delta_{\phi}:\mathcal{X}^{4}\to\{0,1\}$ as the map

\Delta_{\phi}(x_{1},x_{2},x_{3},x_{4})=\mathbbm{1}\left(d_{\mathcal{Z}}\left(% \phi(x_{1}),\phi(x_{2})\right)\geq d_{\mathcal{Z}}\left(\phi(x_{3}),\phi(x_{4}% )\right)\right).

We also define $\Delta\Phi:=\{\Delta_{\phi}:\phi\in\Phi\}$ as the induced distance comparer class of $\Phi$ .

Distance comparers are a natural tool for our analysis – all nearest-neighbor computations inside the feature space $\mathcal{Z}$ can be expressed in such terms. This observation gives rise to a natural complexity measure for the determination of a suitable feature map, which we term the distance dimension.

Definition 6.

The distance dimension of $\Phi$ , denoted $\partial(\Phi)$ , is the VC dimension of the induced comparer class $\Delta\Phi$ .

In providing upper bounds, it will be important that the distance dimension be finite. We note that it is easily bounded for the two important classes of feature maps mentioned in Section 3.

Theorem 2.

Suppose $\textnormal{Cor}_{D,K}$ and $\textnormal{Proj}_{D,K}$ are defined as in Examples 1 and 2, respectively. Then

\partial(\textnormal{Cor}_{D,K})\leq K\log D\ \text{ and }\ \partial(% \textnormal{Proj}_{D,K})\leq D^{2}.

6 Direct Generalization from $\mathcal{D}_{s}$

We first study the possibility of constructing a classifier that generalizes to $\mathcal{D}_{t}$ using only labeled samples from $\mathcal{D}_{s}$ . In this setting, a learner $L$ takes input $S\sim D_{s}^{n}$ , and outputs a classifier $L(S):\mathcal{X}\to\mathcal{Y}$ , with the goal of achieving a small risk on $\mathcal{D}_{t}$ .

While one might hope that the Statistical IRM assumption alone is sufficient for generalization to the target, this is unfortunately false. In fact, we have already seen an example of this phenomenon in Figure 1(c). Here, we essentially argued that $\phi_{y}$ was a realizing feature map: it preserves the source risk, unifies the classification tasks on source and target, and maps all target points close to source points. However, we argued that $\phi_{x}$ and $\phi_{y}$ were statistically indistinguishable in this setting, leaving the learner in need of more information.

To the end of a general characterization of learnability in this setting, recall our discussion of Figure 1(a). Here, the projection $\phi_{y}$ could be determined as realizing the Statistical IRM assumption given that it was the only map in $\Phi$ that preserved the source distribution – it’s clear from the figure that that $\phi_{x}$ does not preserve the source, and so cannot possibly realize the Statistical IRM assumption. It is vital that this reasoning could be carried out with source data alone.

More generally, by the Statistical IRM Theorem, it is sufficient for generalization from source data alone that the learner be able to identify $\phi^{*}$ from source data alone. Such a realizing feature map must of course satisfy all three requirements of Assumption 1. However, note that only one of these requirements, namely source-preservation, depends on the source distribution alone – the others, namely contraction and Bayes-unification, are defined in terms of the target distribution. As such, only source-preservation can be tested using source data.

That said, if the learner can be assured that all source-preserving feature maps realize the Statistical IRM assumption, i.e. $\Phi^{*}=\mathcal{S}(\Phi)$ , it can identify realizing feature maps by identify source-preserving feature maps. We formalize this intuitive idea with the following theorem, which shows that PAC guarantees for target generalization are obtainable when the additional condition $\Phi^{*}=\mathcal{S}(\Phi)$ holds.

Theorem 3.

Suppose the Statistical IRM Assumption holds, the distance dimension $\partial(\Phi)<\infty$ , and that $\Phi^{*}=\mathcal{S}(\Phi).$ Then there is a learning rule $L$ such that for every $\epsilon,\delta>0$ , there exists $N$ such that if $n\geq N$ , with probability $\geq 1-\delta$ over $S\sim\mathcal{D}_{s}^{n}$ ,

R(L(S),\mathcal{D}_{t})\leq R_{t}^{*}+\epsilon.

We relegate specification of the learning rule to the Appendix. It is founded on minimizing the empirical risk on the source data over feature maps in $\Phi$ , but further leverages the knowledge that source-preserving feature maps induce separated distributions over feature space. After selecting a candidate feature map which empirically matches the requirements for source preservation, it uses $k_{n}$ -NN in the implied feature space to make predictions.

In light of the discussion above, the condition that $\Phi^{*}=\mathcal{S}(\Phi)$ is intuitively necessary as well – without it, there may be some feature map which is source-preserving, but which e.g. fails to Bayes-unify the source and the target. It’s simple to see that attempting to generalize to the target via such a feature space could be catastrophic, and thus that blindly choosing between source-preserving feature maps will eventually lead the learner astray. On the other hand, not classifying through a feature space subjects the learner to the standard pitfalls of out of distribution generalization. We formalize these ideas via the following hardness result.

Theorem 4.

Fix a source $\mathcal{D}_{s}$ , a target $\mathcal{D}_{t}$ , and some embedding class $\Phi$ for which the Statistical IRM assumption holds. Suppose that $\mathcal{S}(\Phi)\setminus\Phi^{*}$ is non-empty, and that a learner $L$ successfully generalizes to $\mathcal{D}_{t}$ (with high probability) using only samples from $\mathcal{D}_{s}$ . Then for all $\epsilon>0$ , there exists data distributions $\mathcal{D}_{s}^{\prime},\mathcal{D}_{t}^{\prime}$ such that the following hold::

1.

$W(\mathcal{D}_{s},\mathcal{D}_{s}^{\prime})<\epsilon$ .
2.

There is a $\phi\in\Phi$ realizing the Statistical IRM assumption on alternative source $\mathcal{D}_{s}^{\prime}$ and alternative target $\mathcal{D}_{t}^{\prime}$ .

For all $N$ , there exists $n>N$ such that with probability at least $\frac{1}{4}$ over $S\sim\mathcal{D}_{s}^{n}$ ,

R(L(S),\mathcal{D}_{t}^{\prime})>R(g_{\mathcal{D}_{t}^{\prime}},\mathcal{D}_{t% }^{\prime})+\frac{1}{4}.

Thus, in the case that some feature maps preserve the source but do not realize the Statistical IRM assumption, there is always some problem nearly identical problem instance where the Statistical IRM assumption is realized by $\Phi$ , buy which causes a given learner to have unbounded sample complexity (for some choice of $\epsilon$ and $\delta$ ).

7 Combining Labeled Samples from $\mathcal{D}_{s}$ and Unlabeled Samples from $\mathcal{D}_{t}$

We now consider the less restrictive unlabeled target data are also available. Here, a learner $L$ takes input $S\sim\mathcal{D}_{s}^{n}$ and $U\sim\mu_{t}^{m}$ , and outputs a classifier $L(S,U):\mathcal{X}\to\mathcal{Y}$ .

The story given additional access to unlabeled data is similar to source-only setting: the Statistical IRM assumption alone is insufficient for guaranteeing successful generalization when additional unlabeled target data are available. In other words, the combination of labeled source and unlabeled target is generally insufficient for identifying a $\phi\in\Phi$ that realizes the Statistical IRM assumption.

For a simple example to this end, we return to panel (c) of Figure 1. For the source and target distributions shown, it is evident that no amount of labeled data from the source and unlabeled data from target will allow us decide whether we should project data onto the $x$ -axis or the $y$ -axis. This is because the only difference between them is the manner in which $\mathcal{D}_{t}$ is labeled. By contrast, the example depicted by Figure 1 panel (b) illustrates a case in which the additional unlabeled data from $\mathcal{D}_{t}$ proves sufficient: because projecting onto the $x$ -axis fails to map target points close source points, we can conclude that $\phi$ must be the projection onto the $y$ -axis.

As in source-only setting, identifying a feature map realizing the Statistical IRM assumption requires testing the three conditions of Assumption 1. Understanding the utility of additional unlabeled target data is to realize that it allows the learner to not only test which feature maps preserve the source, but also which maps contract $\mathcal{D}_{s}$ and $\mathcal{D}_{t}$ . On the other hand, it is insufficient to determine which feature maps Bayes-unify, as this notion intrinsically depends on labeling under $\mathcal{D}_{t}$ . This motivates a similar sufficient condition for learnability as we saw in the previous section –namely, that all feature maps which both preserve the source and contract the source and target further Bayes unify. The following theorem shows that this is indeed a sufficient condition for learnability in this setting.

Theorem 5.

Suppose the Statistical IRM Assumption holds, the distance dimension $\partial(\Phi)<\infty$ , and that $\Phi^{*}=\mathcal{S}(\Phi)\cap\mathcal{C}(\Phi).$ Then there is a learning rule $L$ such that for all $\epsilon,\delta>0$ , there exist $N$ and $M$ , such that if $n\geq N$ and $m\geq M$ , with probability $\geq 1-\delta$ over $S\sim\mathcal{D}_{s}^{n}$ and $U\sim\mu_{t}^{m}$ ,

R(L(S,U),\mathcal{D}_{t})\leq R_{t}^{*}+\epsilon.

The learning rule is specified in detail in the Appendix. It proceeds by first selecting a feature map which both empirically preserves the source, and maps each unlabeled target in $U$ point close to some source point in $S$ . As above, it uses $k_{n}$ -NN in the selected feature space to make predictions.

In accordance with the intuition developed above, the condition that $\Phi^{*}=\mathcal{S}(\Phi)\cap\mathcal{C}(\Phi)$ is necessary. The issue of course is that without access to labeled target data, testing whether a feature map Bayes-unifies is impossible. We formalize this via the following hardness result.

Theorem 6.

Suppose $\Phi$ realizes the Statistical IRM Assumption for $\mathcal{D}_{s}$ and $\mathcal{D}_{t}$ , and that there is some $\phi\in\mathcal{S}(\Phi)\cap\mathcal{C}(\Phi)$ for which $\phi\not\in\Phi^{*}$ . Then for all learners $L$ , there exists a conditional data distribution, $\eta^{\prime}$ such that the following hold:

1.

$\eta^{\prime}(y|x)=\eta(y|x)$ for all $x\in\textnormal{supp}(\mu_{s})$ .
2.

$\Phi$ realizes the Statistical IRM Assumption for $\mathcal{D}_{s}$ and $\mathcal{D}_{t}^{\prime}=(\mu_{t},\eta^{\prime})$ .

There exists $\delta,\epsilon>0$ such that for arbitrarily large values of $n$ and $m$ , with probability at least $\delta$ over $S\sim\mathcal{D}_{s}^{n}$ and $U\sim\mu_{t}^{m}$ ,

R(L(S,U),\mathcal{D}_{t}^{\prime})>R(g_{\mathcal{D}_{t}^{\prime}},\mathcal{D}_% {t}^{\prime})+\epsilon.

Theorem 6 shows that no combination of embedding class $\Phi$ and learner $L$ can circumvent the impossibility of testing Bayes unification with unlabeled target data. For any embedding class and learning algorithm, one can always find a pair of source and target distributions on which $\Phi$ realizes the Statistical IRM Assumption, but on which the learning algorithm will fail.

8 Efficient Use of Labeled Samples from $\mathcal{D}_{t}$

Algorithm 1 Selection of an Appropriate Feature Map via Target Loss Validation

1:procedure feature_validate(

S\sim\mathcal{D}_{s}^{n}

T\sim\mathcal{D}_{t}^{m}

)

\hat{\phi}=\operatorname*{arg\,min}_{\phi\in\Phi}\frac{1}{m}\sum_{(x,y)\in T}% \mathbbm{1}\left(\mathcal{N}^{\phi}_{S}\neq y\right)

3: return

\mathcal{N}^{\hat{\phi}}_{S}

4:end procedure

The discussion above implies that even under the Statistical IRM assumption, there are many situations where label target data is required for generalization. In such cases, we would hope that we can exploit the information encoded in the Statistical IRM Assumption to achieve generalization through labeled source data and a small amount of labeled target data. In this section we show that the Statistical IRM assumption allows for significant convergence rate speed-ups in many settings.

Recall that the Statistical IRM theorem states that given a realizing feature map, generalization to the target can be accomplished purely through source data – the limitation of a lack is labeled target data is the difficulty in identifying such a feature map. This inspires the strategy of allocating all of the labeled target data towards determining a realizing feature map.

In this spirit, we analyze the natural scheme of constructing a classifier by composing nearest-neighbors trained solely on source data with the map $\phi\in\Phi$ that minimizes the empirical risk over $T\sim\mathcal{D}_{t}^{m}$ , finding that the number of target examples required for guarantees can be controlled in terms of the “distance dimension” of the class $\partial(\Phi)$ .

Theorem 7.

Suppose $\Phi$ realizes the Statistical IRM assumption. Then for every $\epsilon,\delta>0$ , there exists $N$ such that if

n\geq N,m\geq\Omega\left(\frac{\partial(\Phi)\log\left(n+\partial(\Phi)\right)% +\log\frac{1}{\delta}}{\epsilon^{2}}\right),

then with probability at least $1-\delta$ over $S\sim\mathcal{D}_{s}^{n}$ , $T\sim\mathcal{D}_{t}^{m}$ ,

R(\mathcal{N}^{\hat{\phi}}_{S},\mathcal{D}_{t})\leq R_{t}^{*}+\epsilon,

where $\mathcal{N}^{\hat{\phi}}_{S}$ is output of $\textsc{feature\_validate}(S,T)$ .

Thus, the amount of labeled target data required for generalization when $\Phi$ realizes the Statistical IRM assumption can be largely controlled through our complexity measure on the class $\Phi$ . We say “largely” given that $m$ , the amount of data required from $\mathcal{D}_{t}$ , has a logarithmic dependence on $n$ , the amount of data drawn from $\mathcal{D}_{s}$ . This implies a near distributional-independence between source and target in the sample complexity.

We note that the above margin assumptions are not required for the analysis leading to Theorem 7. Thus – comparing e.g. to rates of convergence under the canonical Tsybakov noise assumption in $\mathcal{X}=\mathbb{R}^{d}$ , under which nonparametric classifiers necessarily incur rates of $\tilde{\Omega}(m^{-1/1+d})$ – the guarantees of Theorem 7 represent significant convergence rate speed-ups over naively training a non-parametric classifier with target data in many cases where the distance dimension $\partial(\Phi)$ is polynomial in the dimension of the instance space [28].

9 Discussion

In this work, we study the problem of distribution shift under a variant of the IRM assumption, wherein it is known that a feature map in a class $\Phi$ unifies classification on source and target. We investigate the identifiability of such maps, characterizing learnability in settings where worst-case approaches indicate that learning should be impossible or expensive.

Our work suggests that the study of IRM-like assumptions is a promising direction for shedding light on new situations where guaranteeing generalization under distribution shift is possible. It also highlights that a primary issue in learning under IRM-like assumptions may be the statistical identifiability of suitable feature maps.

Acknowledgements: This work was supported by the National Science Foundation under the following grants: NSF CIF-2402817, SaTC-2241100, CCF-2217058, and ARO-MURI W911NF2110317.

RB was also partially supported by the German Research Foundation through the Cluster of Excellence “Machine Learning - New Perspectives for Science" (EXC 2064/1 number 390727645)

References

[1] L. G. Valiant. A theory of the learnable. Communications of the ACM, pages 1134–1142, 1984.
[2] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. NIPS’06, page 137–144, Cambridge, MA, USA, 2006. MIT Press.
[3] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Vaughan. A theory of learning from different domains. Machine Learning, 79:151–175, 05 2010.
[4] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the renyi divergence, 2012.
[5] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization, 2020.
[6] Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. CoRR, abs/2002.04747, 2020.
[7] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence. Dataset shift in machine learning. 2009.
[8] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
[9] A. Tuan Nguyen, Toan Tran, Yarin Gal, Philip H. S. Torr, and Atılım Güneş Baydin. Kl guided domain adaptation, 2022.
[10] Ziqiao Wang and Yongyi Mao. Information-theoretic analysis of unsupervised domain adaptation, 2023.
[11] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, 2007.
[12] Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with drifting distributions, 2012.
[13] Corinna Cortes, Mehryar Mohri, and Andrés Muñoz Medina. Adaptation based on generalized discrepancy. Journal of Machine Learning Research, 20(1):1–30, 2019.
[14] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms, 2023.
[15] Steve Hanneke, Samory Kpotufe, and Yasaman Mahdaviyeh. Limits of model selection under transfer learning. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 5781–5812. PMLR, 12–15 Jul 2023.
[16] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation, 2013.
[17] Han Zhao, Shanghang Zhang, Guanhang Wu, João P. Costeira, José M. F. Moura, and Geoffrey J. Gordon. Multiple source domain adaptation with adversarial training of neural networks, 2017.
[18] Yining Chen, Elan Rosenfeld, Mark Sellke, Tengyu Ma, and Andrej Risteski. Iterative feature matching: Toward provable domain generalization with logarithmic environments, 2021.
[19] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation, 2015.
[20] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex Kot. Domain generalization with adversarial feature learning, 2018.
[21] Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, and Qi Lei. Few-shot learning via learning the representation, provably, 2021.
[22] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning, 2016.
[23] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. ICML’11, 2011.
[24] Xiaofeng Liu, Chaehwa Yoo, Fangxu Xing, Hye** Oh, Georges El Fakhri, Je-Won Kang, and Jonghye Woo. Deep unsupervised domain adaptation: A review of recent advances and perspectives, 2022.
[25] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. ALT’12, 2012.
[26] Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Scholkopf. Correcting sample selection bias by unlabeled data. NIPS’06, 2006.
[27] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classification. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3437–3445, 2014.
[28] Jean-Yves Audibert and Alexandre B. Tsybakov. Fast learning rates for plug-in classifiers under the margin condition, 2011.
[29] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. 2014.

Appendix A Further Notation

Given $x\in\mathcal{X}$ , we let $B(x,r)=\{x^{\prime}:d(x,x^{\prime})\leq r\}$ denote the closed ball centered at $x$ of radius $r$ . For a feature map, $\phi\in\Phi$ , we also let $B_{\phi}(x,r)=\{x^{\prime}:d_{\phi}(x,x^{\prime})\leq r\}$ denote the set of all points with distance (under $\phi$ ) at most $r$ from $x$ .

Recall that for $\phi\in\Phi$ , we let $d_{\phi}$ denote the metric over $\mathcal{X}$ induced by $\phi$ , i.e. $d_{\phi}(x,x^{\prime})=d\left(\phi(x),\phi(x^{\prime})\right)$ . We extend this in the natural way to sets, letting

d_{\phi}(A,B)=\inf_{a\in A,b\in B}d_{\phi}(a,b).

Finally, for a pair of feature maps $\phi,\phi^{\prime}$ , we let $d(\phi,\phi^{\prime})$ denote the supremum metric between $\phi$ and $\phi^{\prime}$ . That is,

d(\phi,\phi^{\prime})=\sup_{x\in\mathcal{X}}d_{\mathcal{X}}\left(\phi(x),\phi^% {\prime}(x)\right).

Appendix B Further Technical Assumptions

B.1 Lebesgue Differentiation Theorem

We assume that for any finite Borel measure $\mu$ over $\mathcal{X}$ , the Lebesgue differentiation theorem holds. That is, for all measurable functions $f:\mathcal{X}\to\mathbb{R}$ , up to a null set under $\mu$ ,

\lim_{r\to 0^{+}}\frac{1}{\mu(B(x,r))}\int f(x)d\mu(x)=f(x).

B.2 Open Measures

As referenced in Section 3 above, we assume that our Borel measures satisfy a further regularity condition – namely, that they are open measures.

Definition 7.

A Borel measure, $\mu$ , over metric space $(\mathcal{M},d)$ is open, if for all measurable sets $A$ , $\mu(A)>0$ if only if there exists $x\in\mathcal{M}$ and $r>0$ such that $B(x,r)\subseteq A$ and $\mu(B(x(x,r))>0$ .

A very typical example of such a measure is any distribution that has a finite density function. In this work, we will restrict ourselves to considering open measures with the following assumption: $\mu_{s}$ and $\mu_{t}$ are open, and for all $\phi\in\Phi$ , the induced source and target measures, $\mu_{s}^{\phi}$ and $\mu_{t}^{\phi}$ are open over the metric space $(\phi(\mathcal{X}),d)\subseteq(\mathcal{Z},d)$ . Here we are noting that $\mu_{s}^{\phi}$ and $\mu_{t}^{\phi}$ are only non-zero over subsets of the image of $\phi$ , $\phi(\mathcal{X})$ , and thus we restrict our attention to $\phi(\mathcal{X})$ when considering openness.

This technical assumption allows us to simplify our results as it prohibits cases in which $\mu_{t}^{\phi}$ can be a pathological distribution that concentrates in an area of $\textnormal{supp}(\mu_{s}^{\phi})$ that leads to bad generalization. We also believe that such an assumption is relatively mild – all distributions over $\mathcal{Z}$ are arbitrarily close to open Borel measures – we can simply add spherical noise to each sampled point.

B.3 Assumptions on $\Phi$

We include two further technical assumptions about $\Phi$ . We begin by assuming that all feature maps send an infinite number of points to a given $z\in\mathcal{Z}$ .

Assumption 2.

For all $x\in\mathcal{X}$ and all $\phi\in\Phi$ , the set of points $x^{\prime}$ that have the same image as $x$ in $\mathcal{Z}$ under $\phi$ is infinite. That is,

|\{x^{\prime}:\phi(x^{\prime})=\phi(x)\}|=\infty.

Observe that this assumption is clearly met by the examples given in Section 3.2. Furthermore, it is likely to be met by any reasonable family of continuous maps that perform any kind of dimension reduction.

Next, we define dominance, which will be useful for formulating our other assumption.

Definition 8.

We say that feature map $\phi_{1}$ dominates feature map $\phi_{2}$ at point $x$ if

\{x^{\prime}:\phi_{1}(x^{\prime})=\phi_{1}(x)\}\supseteq\{x^{\prime}:\phi_{2}(% x^{\prime})=\phi_{2}(x)\}.

We now define an embedding class to be indomitable when it avoids instances of one feature map dominating another.

Definition 9.

$\Phi$ is indomitable if for all distinct $\phi_{1},\phi_{2}\in\Phi$ and for all $x\in\mathcal{X}$ , the following holds. For all $\epsilon>0$ , there exists maps $\phi_{1}^{\epsilon},\phi_{2}^{\epsilon}\in\Phi$ such that:

1.

$d(\phi_{1},\phi_{1}^{\epsilon}),d(\phi_{2},\phi_{2}^{\epsilon})<\epsilon$ .
2.

$\phi_{1}^{\epsilon}$ does not dominate $\phi_{2}^{\epsilon}$ at $x$ .
3.

$\phi_{2}^{\epsilon}$ does not dominate $\phi_{1}^{\epsilon}$ at $x$ .

We will now assume that $\Phi$ is indeed indomitable.

Assumption 3.

$\Phi$ is indomitable.

Observe that this assumption is satisfied by both examples of feature maps given in Section 3.2. More generally, the fact that our definition permits a lack of dominance to hold for some two maps that are close to $\phi_{1}$ and $\phi_{2}$ makes our definition mild enough to hold for most continuous classes of feature maps.

Appendix C $k_{n}$ -nearest neighbors

First, we fix $k_{n}$ as a sequence of integers with the following properties.

Definition 10.

Let $k_{n}$ be a sequence of integers so that $\lim_{n\to\infty}\frac{k_{n}}{\log n}=\infty$ , and $\lim_{n\to\infty}\frac{k_{n}}{n}=0$ .

Observe that $k_{n}=\log^{2}n$ would suffice as an example of such a series.

Next, our goal is to define the $k_{n}$ -nearest neighbors classifier over a labeled data set of of $n$ points, $S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}$ . To do so, we begin by describing a tie-breaking procedure used in cases where training points are equidistant from a given test point.

Definition 11.

An ordering $\pi$ , over a dataset $S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}$ is any ordered permutation of $S$ . We say that $(x_{i},y_{i})<_{\pi}(x_{j},y_{j})$ if $(x_{i},y_{i})$ occurs before $(x_{j},y_{j})$ in the permutation.

We now show how to use $\pi$ to break ties when computing nearest neighbors.

Definition 12.

Let $x\in\mathcal{X}$ . Let $\pi$ be an ordering over dataset $S$ . For $(x_{i},y_{i}),(x_{j},y_{j})\in S$ , we say that $d(x,x_{i})<_{\pi}d(x,x_{j})$ if either of the two conditions hold:

1.

$d(x,x_{i})<d(x,x_{j})$ .
2.

$d(x,x_{i})=d(x,x_{j})$ and $(x_{i},y_{i})<_{\pi}(x_{j},y_{j})$ .

In essence, ties are broken by choosing the datapoint that appears earlier in the ordering. We now define a nearest neighbor as follows.

Definition 13.

Let $S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}$ be a dataset, and let $\pi$ be an ordering of $S$ . For $x\in\mathcal{X}$ , we say that $(x_{i},y_{i})$ is a $k_{n}$ -nearest neighbor of $x$ if

|\{j:d(x,x_{j})<_{\pi}d(x,x_{i})\}|<k_{n}.

We also let $S_{k_{n}}^{\pi}(x)$ denote the set of all $k_{n}$ -nearest neighbors of $x$ when using the ordering $\pi$ .

Observe that by construction, $|S_{k_{n}}^{\pi}(x)|=k_{n}$ . This is because the ordering $<_{\pi}$ allows us to strictly order points based on their distances from $x$ with ties broken by $\pi$ .

We are now ready to define the nearest neighbors classifier.

Definition 14.

Let $S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}$ be a dataset, and $\pi$ an ordering over $S$ . Then for $x\in\mathcal{X}$ , we define

\mathcal{N}_{S,\pi}(x)=\operatorname*{arg\,max}_{y\in\mathcal{Y}}\sum_{(x_{i},% y_{i})\in S_{k_{n}}^{\pi}(x)}\mathbbm{1}(y=y_{i}).

Here, we break ties in $\mathcal{Y}$ arbitrarily (which could be done with an ordering of $\mathcal{Y}$ ).

Throughout the paper, we typically ommit $\pi$ from our notation for $\mathcal{N}_{S,\pi}(x)$ . This is because in all cases, we assume that some ordering $\pi$ is implicitly chosen (independent of the data points) ahead of time.

C.1 Composing with feature maps.

We now define the classifier $\mathcal{N}_{S}^{\phi}$ , where $\phi:\mathcal{X}\to\mathcal{Z}$ is a feature map. One important detail for doing so, is that we will continue to use an ordering over $S$ , rather than an ordering over $\phi(S)=\{(\phi(x_{1}),y_{1}),\dots(\phi(x_{n}),y_{n})\}$ . This will allow us to use a single ordering throughout all of our learning algorithms that deal with learning a feature map.

Recall that for any feature map, $\phi$ , $d_{\phi}:\mathcal{X}^{2}\to[0,\infty)$ denotes the distance metric

d_{\phi}(x,x^{\prime})=d_{\mathcal{Z}}\left(\phi(x),\phi(x^{\prime})\right).

Using this, we give analogs to Definitions 15 and 16 by essentially replacing $d$ with $d_{\phi}$ .

Definition 15.

Let $x\in\mathcal{X}$ and $\phi\in\Phi$ . Let $\pi$ be an ordering over dataset $S$ . For $(x_{i},y_{i}),(x_{j},y_{j})\in S$ , we say that $d_{\phi}(x,x_{i})<_{\pi}d_{\phi}(x,x_{j})$ if either of the two conditions hold:

1.

$d_{\phi}(x,x_{i})<d_{\phi}(x,x_{j})$ .
2.

$d_{\phi}(x,x_{i})=d_{\phi}(x,x_{j})$ and $(x_{i},y_{i})<_{\pi}(x_{j},y_{j})$ .

Definition 16.

Let $\phi\in\Phi$ . Let $S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}$ be a dataset, and let $\pi$ be an ordering of $S$ . For $x\in\mathcal{X}$ , we say that $(x_{i},y_{i})$ is a $k_{n}$ -nearest neighbor of $x$ under $\phi$ if

|\{j:d_{\phi}(x,x_{j})<_{\pi}d_{\phi}(x,x_{i})\}|<k_{n}.

We also let $S_{k_{n},\phi}^{\pi}(x)$ denote the set of all $k_{n}$ -nearest neighbors of $x$ when using the ordering $\pi$ .

Finally, we define $\mathcal{N}_{S}^{\phi}$ as follows.

Definition 17.

Let $\phi\in\Phi$ , let $S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}$ be a dataset, and $\pi$ an ordering over $S$ . Then for $x\in\mathcal{X}$ , we define

\mathcal{N}_{S,\pi}^{\phi}(x)=\operatorname*{arg\,max}_{y\in\mathcal{Y}}\sum_{% (x_{i},y_{i})\in S_{k_{n},\phi}^{\pi}(x)}\mathbbm{1}(y=y_{i}).

Here, we break ties in $\mathcal{Y}$ arbitrarily (which could be done with an ordering of $\mathcal{Y}$ ).

The key point of this definition is that all tie-breaking mechanisms are done independently of $\phi$ . In particular, we have the following.

Lemma 1.

Let $S$ be a dataset of $n$ points, and $\pi$ an ordering over $S$ . Let $\phi,\phi^{\prime}$ be two features maps in $\Phi$ . Suppose for $x\in\mathcal{X}$ that for all $i,j$ , $d_{\phi}(x,x_{i})\leq d_{\phi}(x,x_{j})$ if and only if $d_{\phi^{\prime}}(x,x_{i})\leq d_{\phi^{\prime}}(x,x_{j})$ . Then $\mathcal{N}_{S,\pi}^{\phi}(x)=\mathcal{N}_{S,\pi}^{\phi^{\prime}}(x)$ .

Proof.

This is immediate from the previous definitions as all ties are broken in an identical manner for both $\phi$ and $\phi^{\prime}$ . ∎

As before, to avoid cumbersome notation, we will assume that an ordering, $\pi$ , of $S$ is fixed.

Appendix D Induced conditional distributions

In this section, we rigorously define the conditional data distribution of $\mathcal{D}^{\phi}$ . Recall that if $(X,Y)\sim\mathcal{D}$ denote the random variables corresponding to $\mathcal{D}$ , then $\mathcal{D}^{\phi}$ is defined as the data distribution $(\phi(X),Y)$ , where $\phi:\mathcal{X}\to\mathcal{Z}$ is a feature map. We write $\mathcal{D}=(\mu,\eta)$ , where $\mu$ denotes the measure corresponding to $X$ over $\mathcal{X}$ , and $\eta$ is the conditional data distribution, $p(y|X)$ . Our goal in this section is to similarly write $\mathcal{D}^{\phi}=(\mu^{\phi},\eta^{\phi})$ .

First, observe that for any measurable subset $B\subseteq\mathcal{Z}$ , $\mu^{\phi}(B)=\mu\left(\phi^{-1}(B)\right)$ . This directly follows from the definition of the random variable $\phi(X)$ .

Next, to define $\eta^{\phi}$ , first recall that $\eta(y|x)$ denotes the probability that $Y=y$ given that $X=x$ . By assumption this is well defined for all $x\in\mathcal{X}$ and $y\in\mathcal{Y}$ , and moreover for any $y\in\mathcal{Y}$ the function $\mathcal{X}\to[0,1]$ defined by $x\mapsto\eta(y|x)$ is measurable. To define $\eta^{\phi}$ , we first define $\upsilon^{y}$ for all $y\in\mathcal{Y}$ as follows.

Definition 18.

$\upsilon^{y}$ is a measure over $\mathcal{Z}$ so that for all measurable sets $B$ ,

\upsilon^{y}(B)=\int_{\phi^{-1}(B)}\eta(y|x)d\mu(x).

The fact that $\upsilon^{y}$ is a well-defined measure follows directly from the rules of integration. In essence, $\upsilon^{y}(B)$ is the probability. of observing $(X,Y)$ with $\phi(X)\in B$ and $Y=y$ . We now show the following:

Lemma 2.

$\upsilon^{y}$ is absolutely continuous with respect to $\mu^{\phi}$ for all $y$

Proof.

This immediately follows from the fact that $\eta(y|x)\leq 1$ for all $y,x$ . Thus for any measurable set $B$ ,

\upsilon^{y}(B)=\int_{\phi^{-1}(B)}\eta(y|x)d\mu(x)\leq\int_{\phi^{-1}(B)}d\mu% (x)=\mu(\phi^{-1}(B))=\mu^{\phi}(B).

Thus for any $\epsilon>0$ , we can simply choose $\delta=\epsilon$ so that $\mu^{\phi}(B)<\delta\implies\upsilon^{y}(B)<\epsilon$ . ∎

We now use the Radon-Nikoym theorem on $\upsilon^{y}$ to define $\eta^{\phi}$ .

Lemma 3.

For all $y\in\mathcal{Y}$ , there exists a measurable function $f^{y}:\mathcal{Z}\to[0,1]$ such that

\upsilon^{y}(B)=\int_{B}f^{y}(z)d\mu^{\phi}(z),

for all measurable sets $B$ .

Proof.

This directly follows from the Radon-Nikoym theorem. ∎

We then define $\eta^{\phi}$ using these functions, $f^{y}$ .

Definition 19.

For all $z\in\mathcal{Z}$ and $y\in\mathcal{Y}$ , we define $\eta^{\phi}(y|z)=f^{y}(z)$ .

Appendix E Technical Lemmas

E.1 Useful bounds related to $\Phi$

We now prove several results regarding the distance dimension of $\Phi$ , $\partial(\Phi)$ . These will be useful for proving all of our subsequent results.

We begin by defining a useful hypothesis class for analyzing nearest neighbors.

Definition 20.

Let $S=\{(x_{1}^{*},y_{1}^{*}),\dots,(x_{n}^{*},y_{n}^{*})\}$ be a set of $n$ labeled points in $\mathcal{X}\times\mathcal{Y}$ . For $\phi\in\Phi$ , define $h_{S,\phi}:\mathcal{X}\times\mathcal{Y}\to\{0,1\}$ as

h_{S,\phi}\left((x,y)\right)=\begin{cases}1&\mathcal{N}_{S}^{\phi}(x)=y\\ 0&\text{otherwise}\end{cases}.

Finally, we let $\mathcal{H}(S,\Phi)=\{h_{S,\phi}:\phi\in\Phi\}.$

Observe that $\mathcal{H}(S,\Phi)$ is a set of binary classifiers. We now show that it has bounded VC-dimension.

Lemma 4.

There exists an absolute constant, $c_{1}>0$ such that $\mathcal{H}(S,\Phi)$ has VC-dimension bounded as

vc(\mathcal{H}(S,\Phi))\leq c_{1}\partial(\Phi)\log\left(n+\partial(\Phi)% \right).

Proof.

Suppose, $\mathcal{H}(S,\Phi)$ shatters a set $V$ of $v$ points in $\mathcal{X}\times\mathcal{Y}$ , $V=\{(x_{1},y_{1}),\dots,(x_{v},y_{v})\}.$ The key observation is that for any $h_{\phi}\in H_{\hat{S}}$ , the way $h_{\phi}$ labels a given point $(x,y)$ is determined by the $k_{n}$ -nearest neighbors of $\phi(x)$ in $\{\phi(x_{1}),\dots,\phi(x_{n})\}$ . Furthermore, by Lemma 1, these labels are full determined by the set of all $\binom{n}{2}$ comparisons,

\left\{\mathbbm{1}\left(d(\phi(x,x_{i})\geq d(\phi(x,x_{j})\right):1\leq i<j% \leq n\right\}.

These indicator variables precisely correspond to the definition of a distance comparer (Definition 5). It follows that the number of distinct ways that $\mathcal{H}(S,\Phi)$ can label $V$ is at most the number of ways $\Delta\Phi$ can label all $v\binom{n}{2}$ possible comparisons, $\{(x_{i},x_{j},x_{k}):1\leq i\leq v,1\leq j<k\leq n\}.$ Since by definition, $vc(\Delta\Phi)=\partial(\Phi)$ , By Sauer’s Lemma, the number of ways $\mathcal{H}(S,\Phi)$ can label $V$ is at most $\left(v\binom{n}{2}\right)^{\partial(\Phi)}$ . However, since $\mathcal{H}(S,\Phi)$ shatters $V$ , there exist precisely $2^{v}$ such labelings. It follows that $v\leq\log\left(\left(v\binom{n}{2}\right)^{\partial(\Phi)}\right)$ . From here, straightforward algebra implies that $v=O\left(\partial(\Phi)\log\left(n+\partial(\Phi)\right)\right)$ , as desired. ∎

Next, we define a hypothesis class that will be useful for bounding the margin of a data distribution.

Definition 21.

For $\phi\in\Phi$ and $r>0$ , define $q_{\phi,r}:\mathcal{X}^{2}\to\{0,1\}$ as the map

q_{\phi,r}(x,x^{\prime})=\begin{cases}1&d_{\phi}(x,x^{\prime})<r\\ 0&\text{otherwise.}\end{cases}.

Let $\mathcal{Q}(\Phi)=\{q_{\phi,r}:\phi\in\Phi,r\in[0,\infty)\}$ .

Roughly speaking, the class $\mathcal{Q}(\Phi)$ will prove useful in allowing us to uniformly bound measured distances over a data distribution. We now bound its VC-dimension as follows.

Lemma 5.

There exists an absolute constant, $c_{2}>0$ such that $\mathcal{Q}(\Phi)$ has VC-dimension bounded as

vc(\mathcal{Q}(\Phi))\leq c_{2}\partial(\Phi)\log\left(\partial(\Phi)\right).

Proof.

Suppose $\mathcal{Q}(\Phi)$ shatters the set $X=\{(x_{1},x_{1}^{\prime}),(x_{2},x_{2}^{\prime}),\dots,(x_{v},x_{v}^{\prime})\}$ . We say that $\phi$ induces ordering, $\geq_{\phi}$ over $X$ by ranking the pairs in increasing distance. That is,

(x_{i},x_{i}^{\prime})\geq_{\phi}(x_{j},x_{j}^{\prime})\longleftrightarrow d_{% \phi}(x_{i},x_{i}^{\prime})\geq d_{\phi}(x_{j},x_{j}^{\prime}).

Our strategy is to double count the number of distinct orderings, $\geq_{\phi}$ , over $X$ that can be constructed using $\phi$ . Here, two orderings are distinct if they ever differ for some pair of entries from $X$ .

First, suppose that $v$ is even (which we can assume by deleting a pair from $X$ if needed). Since $\mathcal{Q}(\Phi)$ shatters $X$ , for all $S\subset X$ with $|S|=\frac{v}{2}$ , there exists $\phi_{S},r$ such that

d_{\phi}(x_{i},x_{i}^{\prime})\leq r\leftrightarrow(x_{i},x_{i}^{\prime})\in S.

Observe that for $S\neq S^{\prime}$ , $\phi_{S}$ and $\phi_{S^{\prime}}$ must induce distinct orderings over $X$ as the bottom $v/2$ -elements of their orderings are distinct. Since there are $\binom{v}{v/2}\geq 2^{v/2}$ choices for $S$ , this shows that there are at least $2^{v/2}$ orderings.

Second, there are $v^{2}$ possible quadruples, $(x_{i},x_{i}^{\prime},x_{j},x_{j}^{\prime})$ . Suppose that $\phi$ and $\phi^{\prime}$ satisfy that

\Delta_{\phi}(x_{i},x_{i}^{\prime},x_{j},x_{j}^{\prime})=\Delta_{\phi^{\prime}% }(x_{i},x_{i}^{\prime},x_{j},x_{j}^{\prime}),

for all $i,j$ . By the definition of a distance comparer (Definition 5), this implies that $\phi$ and $\phi^{\prime}$ induces the same ordering over $X$ . Thus it suffices to count the number of ways $\Delta\Phi=\{\Delta_{\phi}:\phi\in\Phi\}$ can label the set of $v^{2}$ possible quadruples. By Sauer’s Lemma, this is at most $bv^{2\partial(\Phi)}$ .

Combining our two observations, it follows that $2^{v/2}\leq bv^{2\partial(\Phi)}$ . Standard algebra yields that $v\leq c^{\prime}\partial\Phi\log(\partial(\Phi))$ for some absolute constant $c^{\prime}$ . ∎

Finally, for dealing with margins and labels simultaneously, we introduce the following hypothesis class.

Definition 22.

Let $S=\{(x_{1}^{*},y_{1}^{*}),\dots,(x_{n}^{*},y_{n}^{*})\}$ be a set of $n$ labeled points in $\mathcal{X}\times\mathcal{Y}$ . For $\phi\in\Phi$ , define $q_{\phi,r,S}:\mathcal{X}^{2}\to\{0,1\}$ as follows:

q_{\phi,r,S}(x,x^{\prime})=\begin{cases}1&d_{\phi}(x,x^{\prime})<r\text{ and }% \mathcal{N}_{S}^{\phi}(x)\neq\mathcal{N}_{S}^{\phi}(x^{\prime})\\ 0&\text{otherwise}\end{cases}.

We let $\mathcal{Q}(\Phi,S)=\{q_{\phi,r,S}:\phi\in\Phi,r\in[0,\infty)\}.$

We now bound its VC-dimension.

Lemma 6.

There exists an absolute constant, $c_{3}>0$ such that $\mathcal{Q}(\Phi,S)$ has VC-dimension bounded as

vc(\mathcal{Q}(\Phi,S))\leq c_{3}\partial(\Phi)\log\left(n+\partial(\Phi)% \right).

Proof.

Suppose $\mathcal{Q}(\Phi,S)$ shatters $V=\{(x_{1},x_{1}^{\prime}),\dots(x_{v},x_{v}^{\prime})\}$ . We will double count the number of subsets of $V$ that can be obtained as the pre-image of $1$ under some $q_{\phi,r,S}\in\mathcal{Q}(\Phi,S)$ . For any $\phi\in\Phi$ , define $t_{\phi}:\mathcal{X}^{2}\to\{0,1\}$ as

t_{\phi}((x,x^{\prime}))=\begin{cases}1&\mathcal{N}_{S}^{\phi}(x)\neq\mathcal{% N}_{S}^{\phi}(x^{\prime})\\ 0&\text{otherwise}\end{cases}.

Then the key observation is that

q_{\phi,r,S}(x,x^{\prime})=t_{\phi}(x,x^{\prime})q_{\phi,r}(x,x^{\prime}).

Thus, a subset of $\{(x_{1},x_{1}^{\prime}),\dots,(x_{v},x_{v}^{\prime})\}$ is the pre-image of $1$ under $q_{\phi,r,S}$ if it is precisely the intersection of the pre-images of $1$ under $q_{\phi,r}$ and $t_{\phi}$ .

By Sauer’s Lemma and Lemma 5, there are at most $O\left(v^{c_{2}\partial(\Phi)\log(\partial(\Phi)}\right)$ subsets that are the pre-image of $1$ under some $q_{\phi,r}$ .

We now similarly bound the pre-images under $t_{\phi}$ . To this end, observe that the value of $t_{\phi}$ over all $(x_{i},x_{i}^{\prime})$ is completely determined by the way in which $\mathcal{N}_{S}^{\phi}$ classifies $\{x_{1},\dots,x_{v},x_{1}^{\prime}\dots,x_{v}^{\prime}\}$ . This quantity, in turn, is fully determined by the way in which $h_{s,\phi}$ (Definition 20) labels the set $\{x_{1},\dots,x_{v},x_{1}^{\prime},\dots,x_{v}^{\prime}\}\times\mathcal{Y}$ . Thus, applying Sauer’s Lemma along with Lemma 4, we see that at most $O\left((2v|\mathcal{Y}|)^{c_{1}\partial(\Phi)\log\left(n+\partial(\Phi)\right)% }\right)$ possible subsets.

However, since $\mathcal{Q}(\Phi,S)$ shatters $V$ , we know that $2^{v}$ subsets can be formed in this manner. Thus, it follows that for some constant $c_{3}^{\prime}$ ,

2^{v}\leq c_{3}^{\prime}\left(v^{c_{2}\partial(\Phi)\log(\partial(\Phi)}\right% )\left((2v|\mathcal{Y}|)^{c_{1}\partial(\Phi)\log\left(n+\partial(\Phi)\right)% }\right).

Taking logs and applying standard algebraic manipulations yields the desired result. ∎

We end with one final useful hypothesis class that is a generalization of Definition 21.

Definition 23.

Let $n>1$ be an integer. For $\phi\in\Phi$ and $r>0$ , define $q_{\phi,r,n}:\mathcal{X}^{n+1}\to\{0,1\}$ as the map

q_{\phi,r}(x,x_{1},\dots,x_{n})=\begin{cases}1&\exists_{1\leq i\leq n}d_{\phi}% (x,x_{i})<r\\ 0&\text{otherwise.}\end{cases}.

Let $\mathcal{Q}_{n}(\Phi)=\{q_{\phi,r,n}:\phi\in\Phi,r\in[0,\infty)\}$ .

This class will assist us with computing the distance between the source and target distributions simultaneously over all embeddings.

Lemma 7.

There exists an absolute constant, $c_{4}>0$ such that $\mathcal{Q}_{n}(\Phi)$ has VC-dimension bounded as

vc(\mathcal{Q}_{n}(\Phi))\leq c_{4}\partial(\Phi)\log\left(\partial(\Phi)+n% \right).

Proof.

Suppose $\mathcal{Q}_{n}(\Phi)$ shatters $V=((x_{1},x^{1}),(x_{2},x^{2}),\dots,(x_{v},x^{v}))$ where $x^{i}=(x_{1}^{i},x_{2}^{i},\dots,x_{n}^{i})\in\mathcal{X}^{n}$ . The key observation is that the subset shattered by $q_{\phi,r,n}$ is precisely determined by the behavior of $q_{\phi,r}$ over the set of $nv$ pairs, $(x_{i},x_{j}^{i})$ for $1\leq i\leq v$ and $1\leq j\leq n$ . By Sauer’s Lemma along with Lemma 5, there are at most $O((nv)^{c_{2}\partial(\phi)\log\partial(\Phi)})$ such subsets possible.

It follows that $v\leq C(nv)^{c_{2}\partial(\phi)\log\partial(\Phi)}$ . Applying standard manipulations again yields that $v\leq c_{4}\partial(\Phi)\log\left(\partial(\Phi)+n\right)$ for some constant $c_{4}$ . ∎

E.2 Useful properties of data distributions

Lemma 8.

Let $\mathcal{D}$ be a well-separated data distribution with label margin, $\Delta$ . Let $h:\mathcal{X}\to\mathcal{Y}$ be a classifier such that $R(h,\mathcal{D})<R(g_{\mathcal{D}},\mathcal{D})+\epsilon$ , where $\mathcal{D}$ denotes the Bayes-optimal classifier. Then

\Pr_{(x,y)\sim\mathcal{D}}[h(x)\neq g_{\mathcal{D}}(x)]<\frac{\epsilon}{\Delta}.

Proof.

Let $\mathcal{D}=(\mu,\eta)$ . Let $A\subset\mathcal{X}$ denote the set of all points for which $h$ and $g_{\mathcal{D}}$ disagree. Then we have,

\begin{split}R(h,\mathcal{D})&=1-\int_{\mathcal{X}}\eta(h(x)|x)d\mu(x)\\ &=1-\int_{A}\eta(h(x)|x)d\mu(x)-\int_{\mathcal{X}\setminus A}\eta(h(x)|x)d\mu(% x)\\ &\geq 1-\left(\int_{A}\left(\eta(g_{\mathcal{D}}(x)|x)-\Delta\right)d\mu(x)% \right)-\left(\int_{\mathcal{X}\setminus A}\eta(g_{\mathcal{D}}(x)|x)d\mu(x)% \right)\\ &=1+\Delta\mu(A)-\int_{\mathcal{X}}\eta(g_{\mathcal{D}}(x)|x)d\mu(x)\\ &=R(g_{cD},\mathcal{D})+\Delta\mu(A)\end{split}

Here, we are using the fact that $\eta(h(x)|x)\leq\eta(g_{\mathcal{D}}(x)|x)-\Delta$ for $g_{\mathcal{D}}(x)\neq h(x)$ (as $\mathcal{D}$ is well-separated). Finally, using the fact that $h$ has excess risk at most $\epsilon$ , we find that $\Delta\mu(A)<\epsilon$ which implies that $\mu(A)<\frac{\epsilon}{\Delta}$ , as desired. ∎

We now use the fact that $\Phi$ is compact to prove a useful Lemma.

Lemma 9.

For all $\epsilon>0$ , there exists $\delta>0$ such that for all $\phi\in\Phi$ and all $x\in\mathcal{X}$ ,

\phi\left(B(x,\delta)\right)\subseteq B(\phi(x),\epsilon).

Proof.

Assume towards a contradiction, that for some $\epsilon>0$ , for all $\delta>0$ , there exists $\phi,x$ such that $\phi\left(B(x,\delta)\right)\not\subseteq B(\phi(x),\epsilon)$ . Let $\delta_{i}\to 0$ be a sequence and let $\phi_{i},x_{i}$ be corresponding feature maps and points for this sequence.

Since $\Phi$ is compact, we can take an infinite subsequence of $\phi_{i}$ so that $\phi_{i}\to\phi$ for some $\phi$ . Similarly, since $\mathcal{X}$ is compact, we can take an infinite subsequence so that $x_{i}\to x$ for some $x$ . Because $\phi$ is continuous, there exists $\delta>0$ such that

\phi\left(B(x,\delta)\right)\subseteq B(\phi(x),\frac{\epsilon}{2}).

Select $i$ such that $d(x,x_{i})<\frac{\delta}{2}$ , $\delta_{i}<\frac{\delta}{2}$ , and $d(\phi,\phi_{i})<\frac{\epsilon}{2}$ . Then, applying the triangle inequality, we have

\begin{split}B(x_{i},\delta_{i})&\subseteq B(x_{i},\frac{\delta}{2})\\ &\subseteq B(x,\delta).\end{split}

Furthermore, since $d(\phi,\phi_{i})<\frac{\epsilon}{2}$ ,

\begin{split}\phi_{i}(B(x,\delta))&\subseteq\{z:d_{\mathcal{Z}}\left(z,\phi(B(% x,\delta))\right)<\frac{\epsilon}{2}\}\\ &\subseteq\{z:d_{\mathcal{Z}}\left(z,B(\phi(x),\frac{\epsilon}{2})\right)<% \frac{\epsilon}{2}\}\\ &\subseteq B(\phi(x),\epsilon)\end{split}

However, this is a contradiction to the definition of $\delta_{i}$ . ∎

E.3 Useful Definitions for Analyzing Margins

We begin by precisely characterizing feature maps that preserve a data distribution $\mathcal{D}$ .

Lemma 10.

Let $\mathcal{D}$ be a well-separted distribution, and let $\{\mu^{y}:y\in\mathcal{Y}\}$ be the sets as defined in Definition 1. Then $\phi$ preserves $\mathcal{D}$ if and only if there exists $\rho^{\phi}>0$ such that

\min_{y\neq y^{\prime}}d_{\phi}(\mu^{y},\mu^{y^{\prime}})=\rho^{\phi}.

Proof.

The first direction is immediate. If $\rho^{\phi}$ exists, then it is clear that $\mathcal{D}^{\phi}$ is well-separated with corresponding sets $\phi(\mu^{y})$ .

In the second direction, assume towards a contradiction that no such $\rho^{\phi}$ exists. Because $\phi$ preserves $\mathcal{D}$ , there exist corresponding sets $\{\mu_{\phi}^{y}:y\in\mathcal{Y}\}$ that partition the support of $\mathcal{D}^{\phi}$ . Because no such $\rho^{\phi}$ exists, we must have some $x\in\mu^{y}$ such that $\phi(x)\in\mu_{\phi}^{y^{\prime}}$ for $y\neq y^{\prime}$ – otherwise we could have used the margin of $\mathcal{D}_{s}^{\phi}$ as a valid choice for $\rho^{\phi}$ .

However, it then becomes clear that there exists a ball of non-zero radius centered at $x$ that is mapped into $\mu_{\phi}^{y^{\prime}}$ . This means it is classified as $y^{\prime}$ by $g_{\mathcal{D}^{\phi}}$ while it is classified as $y$ by $g_{\mathcal{D}}$ . Since $\mathcal{D}$ is well-separated, there is a unique Bayes-optimal classifier over the support of $\mathcal{D}$ , and this shows that $g_{\mathcal{D}^{\phi}}$ does not incur Bayes-optimal risk over $\mathcal{D}$ . Thus $\phi$ does not preserve $\mathcal{D}$ , which is a contradiction.

∎

We now generalize the idea of the margin of a data distribution as follows.

Definition 24.

Let $\mathcal{D}=(\mu,\eta)$ be a well-separated data distribution, and let $\phi\in\Phi$ be a feature map. Then the margin variable of $\mathcal{D}^{\phi}$ , is random variable, $\alpha^{\phi}$ defined as follows. Let $(x,x^{\prime})\sim\mu^{2}$ . Then

\alpha^{\phi}=\begin{cases}d(x,x^{\prime})&g_{\mathcal{D}}(x)\neq g_{\mathcal{% D}}(x^{\prime})\\ \infty&\text{otherwise}\end{cases}.

$\alpha^{\phi}$ can be thought of as a randomly observed margin. We will be particularly interested in observing small values of $\alpha^{\phi}$ , as this will be reflective of hte margin of $\mathcal{D}^{\phi}$ . In particular, we have the following.

Lemma 11.

Let $\mathcal{D}$ be a well-separated data distribution and let $\phi\in\Phi$ be a feature map. If $\phi$ preserves $\mathcal{D}$ , let $\rho^{\phi}$ denote the margin of $\mathcal{D}^{\phi}$ . Otherwise, let $\rho^{\phi}=0$ . Then for every $\gamma>0$ there exists $\delta>0$ such that

\Pr_{\alpha^{\phi}}[\alpha^{\phi}\leq\rho^{\phi}+\gamma]\geq\delta.

Proof.

Let $\mathcal{D}=(\mu,\eta)$ , and let $\{\mu^{y}:y\in\mathcal{Y}\}$ be the sets corresponding to Definition 1. Suppose $\phi$ preserves $\mathcal{D}$ . Then the sets $\{\phi(\mu^{y}):y\in\mathcal{Y}\}$ must be the corresponding sets for $\mathcal{D}^{\phi}$ , and it follows that

\min_{y\neq y^{\prime}}d_{\mathcal{Z}}\left(\phi(\mu^{y}),\phi(\mu^{y^{\prime}% })\right)=\rho^{\phi}.

On the other hand, if $\phi$ does not preserve $\mathcal{D}$ , then we must have

\min_{y\neq y^{\prime}}d_{\mathcal{Z}}\left(\phi(\mu^{y}),\phi(\mu^{y^{\prime}% })\right)=0,

as if this distance were positive, then $\phi$ would clearly preserve $\mathcal{D}_{s}$ . Thus in either case, there exists $y,y^{\prime}$ so that $d\left(\phi(\mu^{y}),\phi(\mu^{y^{\prime}})\right)=\rho^{\phi}$ .

It follows that there exists $x\in\mu^{y}$ and $x^{\prime}\in\mu^{y^{\prime}}$ such that $d_{\phi}(x,x^{\prime})\leq\rho^{\phi}+\gamma/2$ . Let

\delta=\mu(B_{\phi}(x,\gamma/4))\mu(B_{\phi}(x^{\prime},\gamma/4)).

Because $\phi$ is continuous, $\phi(x)$ and $\phi(y)$ lie within the support of $\mathcal{D}_{s}^{\phi}$ . It follows that $\delta>0$ . $\delta$ is also a lower bound on the probability that we observe $(x_{1},x_{2})\in B_{\phi}(x,\gamma/4)\times B_{\phi}(x^{\prime},\gamma/4)$ , which means it is a lower bound on the probability that $\alpha^{\phi}\leq\rho^{\phi}+\gamma.$ This gives the desired result. ∎

We now use a similar idea to describe distances between the supports of two measures.

Definition 25.

Let $\mu_{s},\mu_{t}$ be measures over $\mathcal{X}$ . Then $\beta^{\phi}$ is defined as

\beta^{\phi}=d_{\phi}(x_{t},\textnormal{supp}(\mu_{s})),

where $x_{t}$ is a random variable following distribution $\mu_{t}$ .

$\beta^{\phi}$ can be thought of as representing the distance that a point drawn from $\mathcal{D}_{t}$ has from $\mathcal{D}_{s}$ when using distance metric determined by $\phi$ .

It will also be useful to define finite sample version of $\beta_{n}$ , that don’t rely on the sets, $\textnormal{supp}(\mu_{s})$ .

Definition 26.

Let $\mu_{s},\mu_{t}$ be measures over $\mathcal{X}$ . Let $n>0$ . Then $\beta_{n}^{\phi}$ defined as

\beta_{n}^{\phi}=\min_{1\leq i\leq n}d_{\phi}(x_{t},x_{s}^{i}),

where $x_{t}$ is a random variable following distribution $\mu_{t}$ , and $x_{s}^{1},\dots,x_{s}^{n}$ are drawn i.i.d from $\mu$ .

We now show that $\beta_{n}$ converges to $\beta$ .

Lemma 12.

Let $\phi$ be any feature map. Then $\beta_{1}^{\phi},\beta_{2}^{\phi},\dots$ converges in distribution to $\beta^{\phi}$ .

Proof.

For any $r>0$ , the probability that $\beta^{\phi}<r$ is precisely the probability that some $x_{t}\in\textnormal{supp}(\mu_{t})$ is chosen so that $x_{t}$ has distance less than $r$ from $\textnormal{supp}(\mu_{s})$ . For all such $x_{t}$ , and let $x_{s}\in\textnormal{supp}(\mu_{s})$ satisfy $d(x_{t},x_{s})<r$ . Furthermore pick $\epsilon$ so that $2\epsilon<r-d(x_{t},x_{s})$ . It follows that $\beta_{n}^{\phi}<r$ will hold if one of the $n$ points selected from $\mu_{s}$ will be within distance $\epsilon$ from $x_{s}$ . However, this event occurs with high probability for $n$ being sufficiently large. ∎

Appendix F Proof of Theorem 1

First, we characterize areas of $\mathcal{X}$ that are likely to be correctly classified by composing nearest neighbors with $\phi$ .

Definition 27.

Let $\phi$ be a feature map that preserves $\mathcal{D}_{s}$ . Let $0<p<1$ , and let $r>0$ be a distance. We let $\mathcal{X}_{p,r}^{\phi}$ denote the set of all points $x$ such that there exists $x^{\prime}$ for which the following hold.

1.

$d_{\phi}(x,x^{\prime})<\frac{\rho^{\phi}}{2}-r$ .
2.

$\mu_{s}\left(B_{\phi}(x^{\prime},r)\right)\geq p$ .

Here $p$ represents a small amount of mass that must be close to $x$ , and $x^{\prime}$ and $r$ determine a region in which that mass is concentrated. The idea will be that $x$ can be accurately classified using points sampled from $B(x^{\prime},r)$ . We now formalize this with the following lemma.

Lemma 13.

Fix $p,r>0$ . Then there exists $N$ such that for all $n\geq N$ , for all $\phi\in\Phi$ and $x\in\mathcal{X}_{p,r}^{\phi}$ with probability at least $1-\frac{1}{n^{4}}$ over $S\sim\mathcal{D}_{s}^{\phi}$ ,

\mathcal{N}_{S}^{\phi}(x)=g_{\mathcal{D}_{s}^{\phi}}\left(\operatorname*{arg\,% min}_{z\in\textnormal{supp}(\mathcal{D}_{s}^{\phi})}d_{\mathcal{Z}}(z,\phi(x))% \right).

Proof.

Because $\mathcal{D}_{s}$ is well-separated, let $\mu_{s}^{y}$ denote the regions that correspond to Definition 1. Let $x^{\prime}$ be the point as defined in the definition of $\mathcal{X}_{p,r}^{\phi}$ . By applying Lemma 10, observe that there exists $y\in\mathcal{Y}$ such that $\operatorname*{arg\,min}_{z\in\textnormal{supp}(\mathcal{D}_{s}^{\phi})}\in% \phi(\mu_{s}^{y})$ , and $x^{\prime}\in\mu_{s}^{y}$ . This holds because if it didn’t, then the triangle inequality would show that $\phi(\mu_{s}^{y})$ and $\phi(\mu_{s}^{y^{\prime}})$ have distance less than $\rho^{\phi}$ .

It now suffices to show that with probability at least $1-\frac{1}{n^{2}}$ over $S\sim\mathcal{D}_{s}^{\phi}$ , $\mathcal{N}_{S}^{\phi}(x)=y$ .

To do this, let $S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}$ , and let $X=\{x_{1},\dots,x_{n}\}$ . We can view $S$ as being constructed by first drawing $X$ , and then drawing the labels of each of its points.

Observe that if $B_{\phi}(x^{\prime},r)$ contains at least $k_{n}$ points, then the $k_{n}$ nearest neighbors (according to $d_{\phi}$ ) of $x$ will all be drawn from $\mu_{s}^{y}$ . To this end, by Hoeffding’s inequality, we see that

\Pr\left[|X\cap B_{\phi}(x^{\prime},r)|>\frac{np}{2}\right]\geq 1-\exp\left(% \frac{n^{2}p^{2}}{2n}\right)=1-\exp\left(\frac{np^{2}}{2}\right).

Thus, for $n$ sufficiently large (depending only on $p$ ), this quantity is at least $1-\frac{1}{2n^{4}}$ , and $\frac{np}{2}>k_{n}$ . This means that with probability at least $1-\frac{1}{2n^{4}}$ , the $B_{\phi}(x^{\prime},r)$ contains at least $k_{n}$ points.

Next, suppose that this even occurs. We now select the labels for our points. Because of our method of generating $S$ , we can assume that these labels are i.i.d and drawn for points in $\mu^{y}$ . Let the label of the $i$ th nearest neighbor of $x$ be denoted as $y_{i}$ . For all $y^{\prime}\neq y$ , define $J_{i}^{y^{\prime}}$ as the random variable that is $1$ if $y_{i}=y$ , $-1$ if $y_{i}=y^{\prime}$ , and $-1$ otherwise. The key observation is that $\mathcal{N}_{S}^{\phi}(x)=y$ if and only if $\sum_{i=1}^{k_{n}}J_{i}^{y^{\prime}}>0$ for all $y^{\prime}\neq y$ as this will imply that $y$ is the pluarlity choice.

Because $\mathcal{D}_{s}$ is well separated, it has label margin $\Delta$ . Therefore, $J_{i}^{y^{\prime}}$ is a random variable bounded in $[-1,1]$ with expected value at least $\Delta$ . It follows by Hoeffding’s inequality, that

\Pr[\sum_{i=1}^{k_{n}}J_{i}^{y^{\prime}}>0]\geq 1-\exp\left(\frac{-2\Delta^{2}% k_{n}^{2}}{4k_{n}}\right)=1-\exp\left(\frac{-\Delta^{2}k_{n}}{2}\right).

Because $k_{n}\geq\omega(\log n)$ , it follows that for a sufficiently large value of $n$ , this quantity is at least $1-\frac{1}{2n^{4}|\mathcal{Y}|}$ . Thus taking a union bound over all $y^{\prime}\in\mathcal{Y}\setminus\{y\}$ gives the desired result. ∎

Here observe that we are comparing the nearest neighbors classifier using $\phi^{\prime}$ to bayes-optimal over $\mathcal{D}_{s}^{\phi}$ , where $\phi$ is the original feature map we are considering. In other words, this lemma implies that small perturbations to the feature map do not affect classification.

Next, we show that the entire support of the target distribution, $\mathcal{D}_{t}=(\mu_{t},\eta_{t})$ , can be covered using the regions $\mathcal{X}_{p,r}$ .

Lemma 14.

Let $\rho>0$ . Then there exists $p,r>0$ such that the following holds. For all $\phi\in\Phi$ that realize SIRM ( $\mathcal{D}_{s},\mathcal{D}_{t}$ ) and for which $\mathcal{D}_{s}^{\phi}$ has margin at least $\rho$ ,

\textnormal{supp}(\mathcal{D}_{s}),\textnormal{supp}(\mathcal{D}_{t})\subseteq% \mathcal{X}_{p,r}^{\phi}.

Proof.

Let $r=\rho\left(\frac{1}{2}-\frac{1}{\Lambda}\right)$ . By the Definition of $\Lambda$ , $r>0$ . Now let $x\in\textnormal{supp}(\mathcal{D}_{t})$ be arbitrary.

Because $\phi$ contracts $(\mathcal{D}_{s},\mathcal{D}_{t})$ , there exists $x^{\prime}\in\textnormal{supp}(\mu_{s})$ such that $d_{\phi}(x,x^{\prime})<\frac{\rho^{\phi}}{\Lambda}$ , where $\rho^{\phi}$ is the margin of $D_{s}^{\phi}$ . It follows that

\begin{split}d_{\phi}(x,x^{\prime})&<\frac{\rho^{\phi}}{\Lambda}\\ &=\frac{1}{2}\rho^{\phi}-\left(\frac{1}{2}-\frac{1}{\Lambda}\right)\rho^{\phi}% \\ &\leq\frac{1}{2}\rho^{\phi}-\left(\frac{1}{2}-\frac{1}{\Lambda}\right)\rho\\ &=\frac{\rho^{\phi}}{2}-r.\end{split}

Finally, by Lemma 9, there exists $s>0$ such that for all $\phi\in\Phi$ , $\phi(B(x,s))\subseteq B(\phi(x),r))$ . This implies that $B(x^{\prime},s)\subseteq B_{\phi}(x^{\prime},r)$ for all $x^{\prime}$ . Finally, we take

p=\inf_{x\in\textnormal{supp}(\mu_{s})}\mu_{s}(B(x,s).

It suffices to show that $p>0$ . To do so, observe that $\textnormal{supp}(\mu_{s})$ is closed and therefore compact (as $\mathcal{X}$ is compact by assumption). Take an open cover of $\mu_{s}$ by balls of radius $s/2$ . Then it has a finite sub-cover. Each of htese balls have positive mass under $\mu_{s}$ , and furthermore every ball $B(x,s)$ where $x\in\textnormal{supp}(\mu_{s})$ must fully contain at least one of these balls. It follows that $\mu_{s}(B(x,s))\geq q$ , where $q>0$ is the minimum mass of one of these balls. Since $q>0$ , it follows that $p>0$ , as desired. ∎

Lemma 15.

Let $\rho>0$ . Then there exists $N>0$ such that for all $\phi$ that relate $(\mathcal{D}_{s},\mathcal{D}_{t})$ such that $\mathcal{D}_{s}^{\phi}$ has margin at least $\rho$ , if $n\geq N$ , then with probability at least $1-\frac{1}{n^{2}}$ over $S\sim\mathcal{D}_{s}^{n}$ ,

R(\mathcal{N}_{S}^{\phi},\mathcal{D}_{t})<R_{t}^{*}+\frac{1}{n^{2}}.

Proof.

Let $\phi$ relate $\mathcal{D}_{s},\mathcal{D}_{t}$ , and suppose $\mathcal{D}_{s}^{\phi}$ has margin $\rho^{\phi}$ . let $r,p$ be as in Lemma 14. Because $\phi$ relates $\mathcal{D}_{s},\mathcal{D}_{t}$ , observe that for all $x\in\textnormal{supp}(\mu_{t})$ ,

g_{Dt}(x)=g_{\mathcal{D}_{s}^{\phi}}\left(\operatorname*{arg\,min}_{z\in% \textnormal{supp}(\mathcal{D}_{s}^{\phi})}d_{\mathcal{Z}}(z,\phi(x)\right).

To see this, observe that Definition 3 implies that $min_{z\in\textnormal{supp}(\mathcal{D}_{s}^{\phi})}d_{\mathcal{Z}}(z,\phi(x))<% \frac{\rho^{\phi}}{2}$ , and Definition 4 implies that $z$ must be labeled by $g_{\mathcal{D}_{s}^{\phi}}$ the same as $x$ is by $g_{\mathcal{D}_{t}}$ .

Next, select $N$ from Lemma 13. It follows that since all $x\in\textnormal{supp}(\mu_{t})\in\mathcal{X}_{p,r}^{\phi}$ , for $n\geq N$ , we have that with probability at least $1-\frac{1}{n^{4}}$ over $S\sim\mathcal{D}_{s}^{n}$ ,

\mathcal{N}_{S}^{\phi}(x)=g_{\mathcal{D}_{s}^{\phi}}\left(\operatorname*{arg\,% min}_{z\in\textnormal{supp}(\mathcal{D}_{s}^{\phi})}d_{\mathcal{Z}}(z,\phi(x)% \right)=g_{Dt}(x).

A standard of markov’s inequality that converts the expected loss into a loss bound with high probability completes the proof. ∎

We are now prepared to prove Theorem 1.

Proof.

Any $\phi$ that relates $(\mathcal{D}_{s},\mathcal{D}_{t})$ has positive margin, and so the previous lemma applies for sufficiently large $n$ . Since $\frac{1}{n^{2}}\to 0$ , it immediately follows that $\mathcal{N}_{S}^{\phi}$ converges in risk to the bayes optimal of $\mathcal{D}_{t}$ , as desired. ∎

Appendix G Proof of Theorem 3

G.1 Description of our learning rule

We begin with our learning rule, $L$ , that achieves the bound given in Theorem 3.

Algorithm 2

\textsc{direct\_generalize\_nn}(S\sim\mathcal{D}_{s}^{n})

S_{tr}\leftarrow\{(x_{i},y_{i}):1\leq i\leq n/4\}

S_{loss}\leftarrow\{(x_{i},y_{i}):n/4<i\leq n/2\}

S_{margin}\leftarrow\{(x_{i},y_{i}):n/2<i\leq 3n/4\}

S_{final}\leftarrow\{(x_{i},y_{i}):3n/4<i\leq n\}

\epsilon\leftarrow n^{-1/3}

\Phi_{\epsilon}=\left\{\phi:\textsc{source\_loss}(\phi,S_{loss})<\epsilon\right\}

\hat{\phi}\leftarrow\operatorname*{arg\,max}_{\phi\in\Phi_{\epsilon}}\textsc{% source\_margin}(\phi,S_{margin})

8:return

\mathcal{N}_{S_{final}}^{\hat{\phi}}

G.2 Bounding the error in estimating the loss

Our method for estimating the loss over the source distribution that a nearest neighbors classifier is given in Algorithm 2. We simply evaluate the empirical risk using nearest neighbors over the designated loss set, $S_{loss}$ .

We now bound the accuracy of this method using the following Lemma.

Lemma 16.

Let $\mathcal{D}$ be an arbitrary data distribution. Let $S_{tr}$ be a set of $n$ labeled points, and let $S_{loss}\sim\mathcal{D}^{n}$ be an i.i.d sample that is independent of $S_{tr}$ . Then there exists $N>0$ such that for all $n\geq N$ , with probability at least $1-\frac{1}{n^{2}}$ over $S_{tst}\sim\mathcal{D}_{s}^{n}$ , for all $\phi\in\Phi$ ,

|source\_loss(\phi,S_{tr},S_{tst})-R(\mathcal{N}_{S_{tr}}^{\phi},\mathcal{D})|% <n^{-1/3}.

Proof.

Fix $\epsilon=n^{-1/3}$ , and define $E$ as the event that the empirical risk induced by each $\phi\in\Phi$ is representative of the true risk. That is,

E=\mathbbm{1}\left(\sup_{\phi\in\Phi}\left|R(\mathcal{N}_{S_{tr}}^{\phi},% \mathcal{D})-\frac{1}{n}\sum_{(x,y)\in S_{loss}}\mathbbm{1}\left(\mathcal{N}_{% S_{tr}}^{\phi}(x)\neq y\right)\right|<\epsilon\right).

Our goal is to show that $E$ holds with probability at least $1-\frac{1}{n^{2}}$ , for sufficiently large $n$ . The key observation is that for all $\phi\in\Phi$ ,

\mathbbm{1}\left(\mathcal{N}_{S_{tr}}^{\phi}(x)\neq y\right)=1-h_{S,\phi}((x,y% )),

where $h_{S,\phi}(x,y)$ is as defined in Definition 20. Thus, it follows that

\begin{split}E&=\mathbbm{1}\left(\sup_{\phi\in\Phi}\left|R(\mathcal{N}_{S_{tr}% }^{\phi},\mathcal{D})-\frac{1}{m}\sum_{(x,y)\in S_{loss}}\mathbbm{1}\left(% \mathcal{N}_{S_{tr}}^{\phi}(x)\neq y\right)\right|<\frac{\epsilon}{2}\right)\\ &=\mathbbm{1}\left(\sup_{h\in\mathcal{H}(S_{tr},\Phi)}\left|\mathbb{E}_{(x,y)% \sim\mathcal{D}}[1-h_{S,\phi}(x,y)]-\frac{1}{m}\sum_{(x,y)\in S_{loss}}1-h_{S,% \phi}(x,y)\right|<\frac{\epsilon}{2}\right)\\ &=\mathbbm{1}\left(\sup_{h\in\mathcal{H}(S_{tr},\Phi)}\left|\mathbb{E}_{(x,y)% \sim\mathcal{D}}[h_{S,\phi}(x,y)]-\frac{1}{m}\sum_{(x,y)\in S_{loss}}h_{S,\phi% }(x,y)\right|<\frac{\epsilon}{2}\right).\end{split}

To analyze the latter quantity, a standard application of the fundamental theorem of statistical learning (see Shavel-Schwartz and Ben-David) implies that $E$ holds with probability $1-\delta$ provided that $|S_{loss}|=n\geq\Omega\left(\frac{vc\left(\mathcal{H}(S_{tr},\Phi)\right)+\ln% \frac{1}{\delta}}{\epsilon^{2}}\right)$ .

Fix $\delta=\frac{1}{n^{2}}$ . By Lemma 4, $vc\left(\mathcal{H}(S_{tr},\Phi)\right)\leq c_{1}\partial(\Phi)\log\left(n+% \partial(\Phi)\right)$ . Substituting this, along with $\epsilon=n^{-1/3}$ , we see that

\begin{split}\frac{vc\left(\mathcal{H}(S_{tr},\Phi)\right)+\ln\frac{1}{\delta}% }{\epsilon^{2}}&\leq\frac{c_{1}\partial(\Phi)\log\left(n+\partial(\Phi)\right)% +2\ln n}{n^{-2/3}}\\ &\leq Cn^{2/3}\log n,\end{split}

where $C$ is some constant that depends on $\partial(\Phi)$ . Since $n$ assymptotically dominates this quantity, it follows that for sufficiently large $n$ , we indeed have $n\geq\Omega\left(\frac{vc\left(\mathcal{H}(S_{tr},\Phi)\right)+\ln\frac{1}{% \delta}}{\epsilon^{2}}\right)$ , which proves the desired result. ∎

Algorithm 3

source\_loss(\phi,S_{tr},S_{loss})

1:return

\frac{1}{|S_{loss}|}\sum_{(x,y)\in S_{loss}}\mathbbm{1}\left(\mathcal{N}^{\phi% }_{S_{tr}}(x)\neq y\right)

G.3 Bounding the error in estimating the margin

Our method for estimating the margin of a distributions, $\mathcal{D}_{s}^{\phi}$ , is given in Algorithm 4. The main idea is to split the set, $S_{source}$ , into two equal parts, $S_{source}^{a}$ , and $S_{source}^{b}$ . We then use a nearest neighbors classifier over $S_{tr}$ to label the points in both $S_{source}^{a}$ and $S_{source}^{b}$ . Finally, we measure the distance between differently labeled points from $S_{source}^{a}$ and $S_{source}^{b}$ respectively. For technical reasons, when comparing distances between $S_{source}^{a}$ and $S_{source}^{b}$ , we only compare points that have the same index. This allows us to exploit independence between each comparison we make.

We now show that this method is likely to accurately estimate margins by showing that it gives good estimates for $\alpha^{\phi}$ , which is described in Definition 24.

Lemma 17.

There exists $N$ , such that for all $n\geq N$ , if $S_{tr}$ is a set of $n$ labeled points, with probability at least $1-\frac{1}{n}$ over $S_{source}\sim\mathcal{D}_{s}^{n}$ and $S_{loss}\sim\mathcal{D}_{s}^{n}$ , at least one of the two conditions will hold:

1.

$source\_loss(\phi,S_{tr},S_{loss})>R_{s}^{*}+O(n^{-1/4})$ .
2.

$\Pr[\alpha^{\phi}<source\_margin(\phi,S_{tr},S_{source})]<O(n^{-1/4})$ .

Proof.

For $\phi\in\Phi$ and $r\geq 0$ , let $q_{\phi,r,S_{tr}}$ be as defined in Definition 22, so that

q_{\phi,r,S_{tr}}(x,x^{\prime})=\begin{cases}1&d_{\phi}(x,x^{\prime})<r\text{ % and }\mathcal{N}_{S}^{\phi}(x)\neq\mathcal{N}_{S}^{\phi}(x^{\prime})\\ 0&\text{otherwise}\end{cases}.

Observe that

source\_margin(\phi,S_{tr},S_{source})=\max\left\{r:\frac{1}{n}\sum_{i=1}^{n}q% _{\phi,r,S_{tr}}(x_{i}^{a},x_{i}^{b})=0\right\}.

This is because $q_{\phi,r,S_{tr}}(x_{i}^{a},x_{i}^{b})=0$ if and only if either $x_{i}^{a}$ and $x_{i}^{b}$ are given the same labels, or if they have distance (under $\phi$ ) of at least $r$ .

Define $\alpha_{S_{tr}}^{\phi}$ as the random variable where for $x,x^{\prime}\sim\mu_{s}$ ,

\alpha_{S_{tr}}^{\phi}=\begin{cases}d_{\phi}(x,x^{\prime})&\mathcal{N}_{S_{tr}% }^{\phi}(x)\neq\mathcal{N}_{S_{tr}}^{\phi}(x^{\prime})\\ \infty&\text{otherwise}\end{cases}.

The variable $\alpha_{S_{tr}}^{\phi}$ is closely related to $\alpha^{\phi}$ , the only difference is that we replace the bayes optimal classifier, $g_{\mathcal{D}_{s}}$ with $\mathcal{N}_{S_{tr}}^{\phi}$ .

To relate $\alpha_{S_{tr}}^{\phi}$ to our previous quantities, observe that

\Pr[\alpha_{S_{tr}}^{\phi}\leq r]=\mathbb{E}_{(x,x^{\prime})\sim\mu_{s}^{2}}[q% _{\phi,r,S_{tr}}(x,x^{\prime})].

Because the set of classifiers, $\mathcal{Q}(\Phi,S_{tr})=\{q_{\phi,r,S_{tr}}:\phi\in\Phi,r\geq 0\}$ has bounded VC-dimension, $c_{3}\partial(\Phi)\log(n+\partial(\Phi))$ , we can apply uniform convergence to see that $\Pr[\alpha_{S_{tr}}^{\phi}\leq r]$ must be close to its expectation with high probability over all $\phi,r$ . More precisely, by applying the same argument as in the proof of Lemma 16, a we have the for $n$ sufficiently large, with probability at least $1-\frac{1}{n^{2}}$ over $S_{source}\sim\mathcal{D}_{s}^{n}$ , for all $q_{\phi,r,S_{tr}}\in\mathcal{Q}(\Phi,S_{tr})$ ,

\left|\mathbb{E}_{(x,x^{\prime})\sim\mu_{s}^{2}}[q_{\phi,r,S_{tr}}(x,x^{\prime% })]-\frac{1}{n}\sum_{i=1}^{n}q_{\phi,r,S_{tr}}(x_{i}^{a},x_{i}^{b})\right|<n^{% -1/3}.

By substituting the definition of $\alpha_{S_{tr}}^{\phi}$ along with our observation about $source\_margin(\phi,S_{tr},S_{source})$ , it follows that

\Pr[\alpha_{S_{tr}}^{\phi}<source\_margin(\phi,S_{tr},S_{source})]<n^{-1/3}.

We now turn our attention to showing that $\alpha_{S_{tr}}^{\phi}$ must indeed serve as a reasonable approximation for $\alpha^{\phi}$ . To do so, observe that if $\alpha_{S_{tr}}^{\phi}$ and $\alpha^{\phi}$ are constructed from the same random variables, $x,x^{\prime}\sim\mu_{s}$ , then they only differ if $\mathcal{N}_{S_{tr}}^{\phi}$ and $g_{\mathcal{D}_{s}}$ differ over either $x$ or $x^{\prime}$ .

Suppose that $R(\mathcal{N}_{S_{tr}}^{\phi},\mathcal{D}_{s})=R(g_{\mathcal{D}_{s}},\mathcal{% D}_{s})+\epsilon^{\phi}.$ Then it follows by Lemma 8 that $\Pr_{x\sim\mu_{s}}[g_{\mathcal{D}_{s}}(x)\neq\mathcal{N}_{S_{tr}}^{\phi}(x)]% \geq\frac{\epsilon^{\phi}}{\Delta}$ , where $\Delta$ is the label margin of $\mathcal{D}_{s}$ . It follows by the rules of probability that the probability that $\alpha^{\phi}<r$ is at most $\frac{\epsilon^{\phi}}{\Delta}$ summed with the probability that $\alpha_{S_{tr}}^{\phi}<r$ . That is,

\Pr[\alpha_{S_{tr}}^{\phi}<source\_margin(\phi,S_{tr},S_{source})]<n^{-1/3}+% \frac{\epsilon^{\phi}}{\Delta}

(1)

However, if $n$ is sufficiently large, then we have that with probability at least $1-\frac{1}{n^{2}}$ over $S_{loss}\sim\mathcal{D}_{s}^{n}$ , for all $\phi$ ,

\left|source\_loss(\phi,S_{tr},S_{loss})-\left(R(g_{\mathcal{D}_{s}},\mathcal{% D}_{s})+\epsilon^{\phi}\right)\right|<O(n^{-1/3}).

This in turn implies that

source\_loss(\phi,S_{tr},S_{loss})>R(g_{\mathcal{D}_{s}},\mathcal{D}_{s})+% \epsilon^{\phi}-O(n^{-1/3})

(2)

By taking a union bound, it follows with probability at least $1-\frac{2}{n^{2}}$ , that the Equations 2 and 1 simulatenously hold over all $\phi$ .

Finally, if $\epsilon^{\phi}\geq n^{-1/4}$ , then for $n$ sufficiently large, condition number 2. from the statement of the Lemma must hold. Otherwise, if $\epsilon^{\phi}<n^{-1/4}$ , condition 1. holds. Thus in either case, one of the two conditions hold which completes the proof. ∎

Algorithm 4

source\_margin(\phi,S_{tr},S_{source})

S_{source}=S_{source}^{a}\cup S_{source}^{b}

|S_{margin}^{a}|=|S_{margin}^{b}|

S_{source}^{a}=\{(x_{1}^{a},y_{1}^{a}),\dots,(x_{n}^{a},y_{n}^{a})\}

X^{a}=\{x_{1}^{a},\dots,x_{n}^{a}\}

S_{source}^{b}=\{(x_{1}^{b},y_{1}^{b}),\dots,(x_{n}^{b},y_{n}^{b})\}

X^{b}=\{x_{1}^{b},\dots,x_{n}^{b}\}

6:for

i=1\dot{n}

d_{i}=d_{\phi}(x_{i}^{a},x_{i}^{b})

8: if

\mathcal{N}_{S_{tr}}^{\phi}(x_{i}^{a})=\mathcal{N}_{S_{tr}}^{\phi}(x_{i}^{b})

then

d_{i}=\infty

10: end if

11:end for

12:return

\min_{1\leq i\leq n}d_{i}

G.4 Proving the theorem

We first show a Lemma that implies that the feature map selected by our algorithm, $\hat{\phi}$ , is likely to realize the SIRM assumption on $(\mathcal{D}_{s},\mathcal{D}_{t})$ .

Lemma 18.

Let $\phi^{*}\in\Phi$ be any SIRM realizing feature map, and suppose that $\mathcal{D}_{s}^{\phi^{*}}$ has margin $\rho^{*}$ . Then for all $\delta>0$ , there exists $N$ such that if $n\geq N$ , with probability at least $1-\delta$ over $S\sim\mathcal{D}_{s}^{n}$ , $\hat{\phi}$ (defined Line 6 of direct_generalize_nn) is a SIRM realizing feature map for $(\mathcal{D}_{s},\mathcal{D}_{t})$ , and has margin at least $\frac{\rho^{*}}{2}$ .

Proof.

Assume towards a contradiction, that for $\delta>0$ , there exist arbitrarily large values of $n$ for which with probability at least $\delta$ , $\hat{\phi}$ has margin less than $\frac{\rho^{*}}{2}$ .

For $n$ sufficiently large, with probability at least $1-O(\frac{1}{n})$ , applying Lemmas 16 and 17 we have that, for all $\phi\in\Phi$ ,

|source\_loss(\phi,S_{tr},S_{loss})-R(\mathcal{N}_{S_{tr}}^{\phi},\mathcal{D})% |<n^{-1/3},

and that one of the two conditions hold as well:

1.

$source\_loss(\phi,S_{tr},S_{loss})>R_{s}^{*}+O(n^{-1/4})$ .
2.

$\Pr[\alpha^{\phi}<source\_margin(\phi,S_{tr},S_{source})]<O(n^{-1/4})$ .

Because these equations hold for $\phi^{*}$ , the smallest observed empirical loss must be at most $R_{s}^{*}+n^{-1/3}$ . It follows that $\hat{\phi}$ must incur empirical loss at most $R_{s}^{*}+2n^{-1/3}$ which implies that condition 2. must apply to $\hat{\phi}$ .

Furthermore, because $k_{n}>\log n$ , we have that with high probability, $\mathcal{N}_{S_{tr}}^{\phi^{*}}$ will match the Bayes-optimal classifier, $g_{\mathcal{D}_{s}}$ for all points in $S_{source}$ . It follows that the observed margin, $source\_margin(\phi^{*},S_{tr},S_{source})$ will be at least $\rho^{*}$ .

Combining all of this, we see that

\Pr[\alpha^{\hat{\phi}}<\rho^{*}]<n^{-1/4}.

Now let $n_{1},n_{2},\dots$ be a sequence of integers going to infinity so that for each $n_{i}$ , with probability at least $\delta$ , $\hat{\phi}$ has a margin less than $\frac{\rho^{*}}{2}$ . Because $\delta$ is fixed, it follows that for sufficiently large $n_{i}$ , there exist $\hat{\phi}_{i}$ and $S_{i}$ such that all of the equations above hold.

Because $\Phi$ is compact, there exists an infinite subsequence of the $n_{i}$ s for which $\hat{\phi}_{i}$ converges (using the distance metric over $\Phi$ ) to some $\phi$ . Relabel our sequence so that without loss of generality, $\hat{\phi}_{i}\to\phi$ .

The key observation is that the variable, $\alpha^{\phi}$ is Lipschitz with respect to the distance metric over $\Phi$ . In particular, if $|\phi-\phi^{\prime}|<r$ , then $\alpha^{\phi}-\alpha^{\phi^{\prime}}<2r$ .

Using this, observe that for sufficiently large values of $i$ , we have that $d(\hat{\phi}_{i},\phi)<\frac{\rho^{*}}{8}$ . Substituing this, it follows that for all sufficiently large $i$ ,

\Pr[\alpha^{\phi}<\frac{3\rho^{*}}{4}]\leq\Pr[\alpha^{\hat{\phi}}<\rho^{*}]<n_% {i}^{-1/4}.

Since $n_{i}$ can be arbitrarily large it follows that $\Pr[\alpha^{\phi}<\frac{3\rho^{*}}{4}]=0$ which implies $\mathcal{D}_{s}^{\phi}$ must have margin at least $\frac{3\rho^{*}}{4}$ .

However, this in term implies that for all sufficiently large $i$ , $\hat{\phi}_{i}$ too must have margin at least $\frac{3\rho^{*}}{4}-\frac{\rho^{*}}{4}=\frac{\rho^{*}}{2}$ . Here we are again exploiting the fact that the margin is Lipschitz.

This finally gives us a contradiction, as we previous assumed that all $\hat{\phi}_{i}$ had margin less than $\frac{\rho^{*}}{2}$ .

∎

We are now prepared to prove Theorem 3.

Proof.

Fix $\epsilon,\delta>0$ . The previous Lemma implies that for sufficiently large values of $n$ , with probability $1-\frac{\delta}{2}$ we will select some $\hat{\phi}$ that has margin at least $\frac{\rho^{*}}{2}$ . Lemma 15 implies that for $n$ sufficiently large (in a way that only depends on $\rho^{*},\mathcal{D}_{s}$ ), with probability at least $1-\frac{\delta}{2}$ over $S_{final}\sim\mathcal{D}_{s}^{n/4}$ ,

R(\mathcal{N}_{S_{final}}^{\hat{\phi}},\mathcal{D}_{t})<R_{t}^{*}+\epsilon.

Crucially, $S_{final}$ is completely independent of $\hat{\phi}$ , which is learned purely using $S_{tr},S_{loss},$ and $S_{source}$ . Taking a union bound implies the desired result. ∎

Appendix H Proof of Theorem 4

Proof.

Fix $\epsilon>0$ . Let $\phi_{1}$ relate $(\mathcal{D}_{s},\mathcal{D}_{t})$ , and let $\phi_{2}\in\mathcal{S}(\Phi)\setminus\Phi^{*}$ be a feature map that source-preserves $\mathcal{D}_{s}$ but fails to relate $(\mathcal{D}_{s},\mathcal{D}_{t})$ . We will construct $\mathcal{D}_{s}^{\prime}$ and $\mathcal{D}_{t}^{\prime}$ using $\phi_{1}$ and $\phi_{2}$ .

Because $\phi_{2}$ fails to relate $(\mathcal{D}_{s},\mathcal{D}_{t})$ , there exists $x\in\textnormal{supp}(\mu_{t})$ such that $\phi_{2}(x)=z\notin\textnormal{supp}(\mu_{s}^{\phi_{2}})$ . Note that if this doesn’t hold, then we can simply use the construction from the proof of Theorem 6. The point $x$ will be central to constructing both $\mathcal{D}_{s}^{\prime}$ and $\mathcal{D}_{t}^{\prime}$ . We begin by constructing $\mathcal{D}_{s}^{\prime}$ .

Constructing $\mathcal{D}_{s}^{\prime}$ :

Let $\alpha>0$ be a small value. Then by Assumptions 2 and 3, there exists $x_{1},x_{2}\in\mathcal{X}$ and $\phi_{1}^{\alpha},\phi_{2}^{\alpha}\in\Phi$ such that the following conditions hold:

1.

$d(\phi_{1},\phi_{1}^{\alpha}),d(\phi_{2},\phi_{2}^{\alpha})<\alpha$ .
2.

$\phi_{1}^{\alpha}(x_{1})=\phi_{1}^{\alpha}(x)\neq\phi_{1}^{\alpha}(x_{2})$ .
3.

$\phi_{2}^{\alpha}(x_{1})\neq\phi_{2}^{\alpha}(x)\neq\phi_{2}^{\alpha}(x_{2})$ .

Here, $\phi_{1}^{\alpha}$ and $\phi_{2}^{\alpha}$ are chosen using Assumption 3, while the existence of $x_{1}$ and $x_{2}$ is based on Assumption 2. We also let $x_{1}^{\prime},x_{2}^{\prime}$ be two points such that

0<d_{\phi_{1}^{\alpha}}(x_{1},x_{1}^{\prime})<<d_{\phi_{1}^{\alpha}}(x_{1},x_{% 2}),\text{ and }0<d_{\phi_{2}^{\alpha}}(x_{2},x_{2}^{\prime})<<d_{\phi_{2}^{% \alpha}}(x_{1},x_{2}).

Next, let $\mu_{s}^{\prime}$ be a measure over $\mathcal{X}$ obtained by the following steps.

1.

Begin with $\mu_{s}$ , the measure of $\mathcal{D}_{s}$ over $\mathcal{X}$ .
2.

Remove all points in $\textnormal{supp}(\mu_{s})$ that lie within a distance of $r$ from the set $\{x_{1},x_{1}^{\prime},x_{2},x_{2}^{\prime}\}$ .
3.

Pick $s$ such that any two points in $\{x_{1},x_{1}^{\prime},x_{2},x_{2}^{\prime}\}$ have distance larger than $4s$ . Insert balls of probability mass $\epsilon/8$ centered at each of these points, so that $\mu_{s}^{\prime}(B(x,s))=\frac{\epsilon}{4}$ for $x\in\{x_{1},x_{1}^{\prime},x_{2},x_{2}^{\prime}\}$ .

Observe that $\mu_{s}^{\prime}$ is constructed from $\mu_{s}$ by adding a region of mass $\epsilon/2$ (and appropriately down-sizing all other regions). Furthermore, if $r$ is appropriately chosen, then the region being removed from $\mu_{s}$ can also be forced to have size at most $\epsilon/2$ . It follows that $W(\mu_{s},\mu_{s}^{\prime})<\epsilon$ . Next, we define the conditional distribution, $\eta_{s}^{\prime}$ with the following steps. Let $y_{1}\neq y_{2}$ be two labels in $\mathcal{Y}$ .

1.

For $x\in B(x_{1},s):\eta_{s}^{\prime}(y_{1}|x)=1$ .
2.

For $x\in B(x_{1}^{\prime},s):\eta_{s}^{\prime}(y_{2}|x)=1$ .
3.

For $x\in B(x_{2},s):\eta_{s}^{\prime}(y_{2}|x)=1$ .
4.

For $x\in B(x_{2}^{\prime},s):\eta_{s}^{\prime}(y_{1}|x)=1$ .
5.

For all other $x$ , $\eta_{s}^{\prime}(y|x)=\eta(y|x)$ .

Basically, we force the conditional distribution near $x_{1}$ and $x_{2}$ to be $y_{1}$ and $y_{2}$ respectively. For $x_{1}^{\prime},x_{2}^{\prime}$ , this is reversed. This construction only modifies $\eta_{s}$ at points where $\mu_{s}$ is modified, and it follows that $W(\mathcal{D}_{s},\mathcal{D}_{s}^{\prime})<\epsilon$ .

Furthermore, observe that $\phi_{1}^{\alpha}$ and $\phi_{2}^{\alpha}$ both source-preserve $\mathcal{D}_{s}^{\prime}$ . This occurs because $r$ and $s$ are chosen to be small enough so that the 4 balls, $B(x_{1},s),B(x_{2},s),B(x_{1}^{\prime},s),B(x_{2}^{\prime},s)$ are all mapped to disjoint areas under both $\phi_{1}$ and $\phi_{2}$ .

Constructing $\mathcal{D}_{t}^{\prime}$ :

Next, we will construct $\mathcal{D}_{t}^{\prime}$ by giving a choice of two possible target distribution, $\mathcal{D}_{t}^{1}$ and $\mathcal{D}_{t}^{2}$ . We let $\mu_{t}^{\prime}$ be a point mass that is concentrated at $x$ . We let $\eta_{t}^{1}(y_{1}|x)=1$ and $\eta_{t}^{2}(y_{2}|x)=1$ . This gives us $\mathcal{D}_{t}^{1}$ and $\mathcal{D}_{t}^{2}$ .

Observe that $\mathcal{D}_{t}^{1}$ is SIRM related to $\mathcal{D}_{s}^{\prime}$ by $\phi_{1}^{\alpha}$ , and $\mathcal{D}_{t}^{2}$ is SIRM related to $\mathcal{D}_{s}^{\prime}$ by $\phi_{2}^{\alpha}$ . This is because $x$ is mapped to $x_{1}$ by $\phi_{1}^{\alpha}$ , and the same holds respectively for $x_{2}$ and $\phi_{2}^{\alpha}$ .

Finishing the proof:

We now show that our learner will have a large error over either some choice of $\mathcal{D}_{t}^{\prime}\in\{\mathcal{D}_{t}^{1},\mathcal{D}_{t}^{2}\}$ . To do so, suppose $\mathcal{D}_{t}^{\prime}$ is randomly chosen from this set. It follows that our learning rules expected loss is:

\begin{split}\mathbb{E}_{\mathcal{D}_{t}^{\prime}\sim\{\mathcal{D}_{t}^{1},% \mathcal{D}_{t}^{2}\}}\mathbb{E}_{S\sim(\mathcal{D}_{s}^{\prime})^{n}}R(L(S),% \mathcal{D}_{t}^{\prime})&=\mathbb{E}_{S\sim(\mathcal{D}_{s}^{\prime})^{n}}% \mathbb{E}_{\mathcal{D}_{t}^{\prime}\sim\{\mathcal{D}_{t}^{1},\mathcal{D}_{t}^% {2}\}}R(L(S),\mathcal{D}_{t}^{\prime})\\ &=\mathbb{E}_{S\sim(\mathcal{D}_{s}^{\prime})^{n}}\mathbb{E}_{i\sim\{1,2\}}% \mathbbm{1}(L(S)(x)\neq y_{i})\\ &=\frac{1}{2}.\end{split}

From here, the desired result follows by a straightforward application of markov’s inequality.

∎

Appendix I Proof of Theorem 5

I.1 Description of the learning rule

We give the learning rule that achieves the bound given in Theorem 5

Algorithm 5

\textsc{presrv\_contract\_nn}(S\sim\mathcal{D}_{s}^{n},U\sim\mu_{t}^{m})

S_{tr}\leftarrow\{(x_{i},y_{i}):1\leq i\leq n/5\}

S_{loss}\leftarrow\{(x_{i},y_{i}):n/5<i\leq 2n/5\}

S_{margin}\leftarrow\{(x_{i},y_{i}):2n/5<i\leq 3n/5\}

S_{margin,t}\leftarrow\{(x_{i},y_{i}):3n/5<i\leq 4n/5\}

S_{final}\leftarrow\{(x_{i},y_{i}):4n/5<i\leq n\}

\epsilon\leftarrow n^{-1/3}

\Phi_{\epsilon}=\left\{\phi:\textsc{source\_loss}(\phi,S_{tr},S_{loss})<% \epsilon\right\}

\rho_{s}(\phi)=\textsc{source\_margin}(\phi,S_{tr},S_{margin})

\rho_{t}(\phi)=\textsc{target\_margin}(\phi,S_{margin,t},U)

10:

\hat{\phi}\leftarrow\operatorname*{arg\,max}_{\phi\in\Phi_{\epsilon}}\rho_{s}(% \phi)-\Lambda\rho_{t}(\phi)

11:return

\mathcal{N}_{S_{tr}}^{\hat{\phi}}

I.2 Analyzing the procedure, $target\_margin$

We begin by describing the process used to estimate how far data from $\mathcal{D}_{t}$ is from data from $\mathcal{D}_{s}$ under a feature map, $\phi$ . The subroutine is given in Algorithm 6, where $S$ is a labeled set of points drawn from $\mathcal{D}_{s}$ , and $U$ is an unlabeled set of points drawn from $\mu_{t}$ , the marginal $\mathcal{X}$ -distribution of $\mathcal{D}_{t}$ .

Algorithm 6

target\_margin(\phi,S_{margin,t},U)

U=\{u_{1},\dots,u_{m}\}

S_{margin,t}=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}

l=\min(m,[\sqrt{n}])

U^{\prime}=\{u_{1},\dots,u_{l}\}

5:for

1\leq i\leq l

X_{i}=\{(x_{il-l+1},\dots,x_{il}\}

7:end for

8:return

\max_{1\leq i\leq l}\min_{x\in X_{i}}d_{\phi}(u_{i},x)

The idea is for each $u\in U$ , we assign it its own set of $l$ points sampled from $\mu_{s}$ . Then, we take the distance from $u$ to the closest point in its assigned set. Finally, taking the max of all of these gives us an approxmation of the furthest distance any $u\in\textnormal{supp}(\mu_{t})$ has from $\textnormal{supp}(\mu_{s})$ when using the distance metric, $d_{\phi}$ .

We now show that this procedure approximates the quantity, $\beta_{l}^{\phi}$ , which is defined in Definition 26.

Lemma 19.

There exists $N>0$ such that if $n>N$ , and $m\geq\sqrt{n}$ , then with probability at least $1-\frac{1}{n}$ over $S_{margin,t}\sim\mathcal{D}_{s}^{n}$ , and $U\sim\mathcal{D}_{t}^{m}$ , for all $\phi\in\Phi$ ,

\Pr[\beta_{l_{n}}^{\phi}\geq target\_margin(\phi,S_{margin,t},U)]\leq n^{-1/6},

where $l_{n}$ is the largest integer at most $\sqrt{n}$ .

Proof.

For convenience, let $l$ denote $l_{n}$ . For $\phi\in\Phi$ and $r\geq 0$ , let $q_{\phi,r,l}$ be as defined in Definition 23, so that

q_{\phi,r,l}(u,x_{1},\dots,x_{l})=\begin{cases}1&\exists_{1\leq i\leq l}d_{% \phi}(u,x_{i})<r\\ 0&\text{otherwise}\end{cases}.

Furthermore, let us relabel $X_{i}$ (defined in line 6 of Algorithm 6) so that $X_{i}=(x_{1}^{i},\dots,x_{l}^{i})$ . It follows that

target\_margin(\phi,S_{margin,t},U)=\inf\left\{r:\frac{1}{l}\sum_{i=1}^{l}q_{% \phi,r,l}(u_{i},x_{1}^{i},\dots,x_{l}^{i})=1\right\}.

This is because $q_{\phi,r,l}(u_{i},x_{1}^{i},\dots,x_{l}^{i})=1$ if and only if some $x_{j}^{i}$ has distance less than $r$ from $u_{i}$ .

Next, we relate this quantity to $\beta_{l}^{\phi}$ as follows. Observe that

\Pr[\beta_{l_{n}}\geq r]=\mathbb{E}_{u\sim\mu_{t},x_{1},\dots,x_{l}\sim\mu_{s}% ^{l}}[1-q_{\phi,r,l}(u,x_{1},\dots,x_{l})].

Because the set of classifiers, $\mathcal{Q}_{l}(\Phi)=\{q_{\phi,r,l}:\phi\in\Phi,r\geq 0\}$ has bounded VC-dimension, $c_{4}\partial(\Phi)\log(l+\partial(\Phi))$ , we can apply uniform convergence to see that $\Pr[\beta_{l}<r]$ must be close to its expectation with high probability over all $\phi,r$ . More precisely, by applying the same argument as in the proof of Lemma 16, a we have the for $n$ sufficiently large, with probability at least $1-\frac{1}{l^{2}}$ over $S_{margin,t}\sim\mathcal{D}_{s}^{l}$ $U\sim\mathcal{D}_{t}^{l}$ , for all $q_{\phi,r,l}\in\mathcal{Q}_{l}(\Phi)$ ,

\left|\mathbb{E}_{u\sim\mu_{t},x_{1},\dots,x_{l}\sim\mu_{s}^{l}}[1-q_{\phi,r,l% }u,x_{1},\dots,x_{l})]-\frac{1}{l}\sum_{i=1}^{l}1-q_{\phi,r,S_{tr}}(u_{i},x_{1% }^{i},\dots,x_{l}^{i})\right|<l^{-1/3}.

By substituting the definition of $\beta_{n}^{\phi}$ along with our observation about $target\_margin(\phi,S_{margin,t},U)$ , it follows that for all $r>target\_margin(\phi,S_{margin,t},U)$ ,

\Pr[\alpha_{S_{tr}}^{\phi}\geq r]<l^{-1/3}.

By the rules of probability, and the definition of an infimum, it follows that

\Pr[\alpha_{S_{tr}}^{\phi}\geq target\_margin(\phi,S_{margin,t},U)]\leq l^{-1/% 3}.

Substituting the value of $\l$ gives the desired result. ∎

I.3 Bounding the performance a given feature map, $\phi^{*}\in\Phi$

We now consider a fixed feature map $\phi^{*}$ that realizes the SIRM assumption on $(\mathcal{D}_{s},\mathcal{D}_{t})$ . The idea behind doing so is that this allows us to give a baseline over how $source\_margin$ and $target\_margin$ should be expected to behave.

Lemma 20.

Let $\phi^{*}$ realize the SIRM assumption on $(\mathcal{D}_{s},\mathcal{D}_{t})$ . Suppose $\mathcal{D}_{s}^{\phi^{*}}$ has margin $\rho^{*}>0$ , and that

\max_{x_{t}\in\textnormal{supp}(\mu_{t})}\min_{x_{s}\in\textnormal{supp}(\mu_{% s})}d_{\phi^{*}}(x_{t},x_{a}s)=\beta^{*}.

Finally, let $\rho^{*}-\Lambda\beta^{*}=\gamma^{*}$ where $\gamma^{*}>0$ by the fact that $\phi^{*}$ contracts $(\mathcal{D}_{s},\mathcal{D}_{t})$ . Then for all $\delta>0$ , there exists $N$ such that if $n\geq N$ , $m\geq\sqrt{n}$ , with probability at least $1-\delta$ over $S\sim\mathcal{D}_{s}^{n}$ , $U\sim\mu_{t}^{m}$ , the following three things hold:

1.

$source\_loss(\phi^{*},S_{tr},S_{loss})<R_{s}^{*}+n^{-1/3}$ .
2.

$source\_margin(\phi^{*},S_{tr},S_{margin})\geq\rho^{*}$ .
3.

$target\_margin(\phi^{*},S_{margin,t},U)\leq\beta^{*}+\frac{\gamma^{*}}{2}$ .

Proof.

We bound the probability of each of these three things occuring separately, and then apply a union bound.

First of all, for $n$ sufficiently large, the first claim holds with probability at least $1-O(\frac{1}{n^{2}})$ . This follows directly from a combination of Lemma 16 along with Lemma 15 being applied to $\mathcal{D}_{s}$ (as $\phi$ technically SIRM realizes on $(\mathcal{D}_{s},\mathcal{D}_{s})$ ). In particular, we have that with probability at least $1-\frac{1}{n^{2}}$ that

R(\mathcal{N}_{S_{tr}}^{\phi^{*}},\mathcal{D}_{s})<R_{s}^{*}+\frac{1}{n^{2}}.

(3)

Second, observe that Lemma 8 implies that the probability that $\mathcal{N}_{S_{tr}}^{\phi^{*}}$ differs from the Bayes-optimal is at most $\frac{1}{n^{2}\Delta}$ , where $\Delta$ is the label margin of $\mathcal{D}_{s}$ . It follows that with probability at least $1-O(\frac{1}{n})$ , $S_{tr}^{\phi^{*}}$ correctly labels all the points in $S_{margin}^{a}$ and $S_{margin}^{b}$ (see Algorithm 4). Thus, it follows that $source\_margin(\phi^{*},S_{tr},S_{margin})\geq\rho^{*}$ , as with correct labels it is impossible to observe two differently labeled poitns that are closer than $\rho^{*}$ .

Finally, by Lemma 9, there exists $\tau^{*}>0$ such that

\phi^{*}(B(x,\tau^{*}))\subseteq B\left(\phi^{*}(x),\frac{\gamma^{*}}{2}\right).

Let $p=\min_{x\in\textnormal{supp}(\mu_{s})}B(x,\frac{\tau^{*}}{2}).$ The argument in the proof of Lemma 9 implies that $p>0$ . Thus, for $n$ sufficiently large, it follows that with probability at least $1-(1-p)^{l}\geq 1-e^{-pl}$ over $x_{1},\dots,x_{l}\sim\mu_{s}^{l}$ that there will exist some $x_{i}\in B(x,\frac{\tau^{*}}{2})$ .

Since $\textnormal{supp}(\mu_{s})$ is compact, it follows that we can take a finite covering of $\textnormal{supp}(\mu_{s})$ with balls of radius $\frac{\tau^{*}}{2}$ . If there are $C$ such balls, with probability at least $1-(1-p)^{l}\geq 1-Ce^{-pl}$ over $x_{1},\dots,x_{l}\sim\mu_{s}^{l}$ that there will exist some $x_{i}\in B(x,\frac{\tau^{*}}{2})$ for all balls $B(x,\frac{\tau^{*}}{2})$ in our covering. Observe that this implies that for all $x\in\textnormal{supp}(\mu_{s})$ , there will exist some $x_{i}\in B(x,\tau^{*})$ .

We now use this to show that the third condition is likely to hold. Pick $n$ sufficiently large so that $Ce^{-pl}<\frac{1}{n^{2}}$ . It follows that with probability at least $1-O(\frac{1}{n})$ over $S_{margin,t}\sim\mathcal{D}_{s}^{n/5}$ and $U\sim\mathcal{D}_{t}^{m}$ , for all $1\leq i\leq l$ , for all $x\in\textnormal{supp}(\mu_{s})$ , there exists some $x_{j}^{i}\in B(x,\tau^{*})$ .

Suppose that this holds. For each $u_{i}\in U^{\prime}$ (Algorithm 6), let $x_{i}^{*}\in\textnormal{supp}(\mu_{s})$ be the point for which $d_{\phi^{*}}(u_{i},x_{i}^{*})$ is minimized. Thus $d_{\phi^{*}}(u_{i},x_{i}^{*})\leq\beta^{*}$ by the definition of $\beta^{*}$ . However, by our claim above, we see that some $x_{j}^{i}$ must be in $B(x_{i}^{*},tau^{*})$ which implies that $\phi^{*}(x_{j}^{i})\in B(\phi^{*}(x_{i}^{*}),\frac{\gamma^{*}}{2})$ . Thus $d_{\phi^{*}}(x_{j}^{i},u_{i})\leq\beta^{*}+\frac{\gamma^{*}}{2}$ .

Since this occurs for all $1\leq i\leq l$ , it follows that the maximum distance we observe is at most $\beta^{*}+\frac{\gamma^{*}}{2}$ , which means $target\_margin(\phi,S_{margin,t},U)\leq\beta^{*}+\frac{\gamma^{*}}{2}$ , as desired.

Since our three events all occur with probability $1-O(1/n)$ , it follows that if $n$ is sufficiently large, they simultaneously occur with probability at least $1-\delta$ . This completes the proof.

∎

I.4 Proving the Theorem

We are now prepared to prove Theorem 5. We start with the following Lemma.

Lemma 21.

Let $\phi^{*}$ be as defined in Lemma 20. Then for all $\delta>0$ , there exists $N>0$ such that for all $n\geq N$ , $m\geq\sqrt{n}$ , with probability at least $1-\delta$ over $S\sim\mathcal{D}_{s}^{n}$ , $U\sim\mathcal{D}_{t}^{m}$ , the outputted feature map $\hat{\phi}$ satisfies the following: let $\hat{\rho}$ denote the margin of $\mathcal{D}_{s}^{\hat{\phi}}$ and let $\hat{\beta}$ denote

\hat{\beta}=\max_{x_{t}\in\textnormal{supp}(\mu_{t})}\min_{x_{s}\in\textnormal% {supp}(\mu_{s})}d_{\hat{\phi}}(x_{t},x_{a}s).

Then

\hat{\rho}-\Lambda\hat{\beta}\geq\frac{\gamma^{*}}{4}.

Proof.

Assume towards a contradiction that this fails to occur and fix $\delta>0$ for that ails. For $n$ sufficiently large, with probability at least $1-\frac{\delta}{2}$ , the premises of Lemmas 16, 17, 19, and 20 are all simulatneously met. In particular, Lemmas 16, 17 imply that

|source\_loss(\phi,S_{tr},S_{loss})-R(\mathcal{N}_{S_{tr}}^{\phi},\mathcal{D})% |<O(n^{-1/3}),

and that one of the two conditions hold as well:

1.

$source\_loss(\phi,S_{tr},S_{loss})>R_{s}^{*}+n^{-1/4}$ .
2.

$\Pr[\alpha^{\phi}<source\_margin(\phi,S_{tr},S_{margin})]<n^{-1/4}$ .

However, Lemma 20 implies that

source\_loss(\phi^{*},S_{tr},S_{loss})<R_{s}^{*}+n^{-1/3}.

Since $\hat{\phi}$ has source loss at most $n^{-1/3}$ more than the optimal, it follows that condition 2 must hold. Thus, by additionally adding Lemma 19, we see that $\hat{\phi}$ has the following properties:

1.

$\Pr[\alpha^{\hat{\phi}}<source\_margin(\hat{\phi},S_{tr},S_{margin})]<n^{-1/4}$
2.

$\Pr[\beta_{l_{n}}^{\hat{\phi}}\geq target\_margin(\hat{\phi},S_{margin,t},U)]% \leq n^{-1/6}$ .

However, recall that $\hat{\phi}$ must maximize the quantity $source\_margin(\phi,S_{tr},S_{margin})-\Lambda target\_margin(\phi,S_{margin,t% },U)$ . Since this quantity is at least $\frac{\gamma^{*}}{2}$ for $\phi^{*}$ (by Lemma 20), it follows that

\Pr[\alpha^{\hat{\phi}}-\Lambda\beta_{l_{n}}^{\hat{\phi}}<\frac{\gamma^{*}}{2}% ]<O(n^{-1/6}).

(4)

In particular, this equation holds with probability at least $1-\frac{\delta}{2}$ . However, by our assumption for arbitrarily large values of $n$ , $\hat{\phi}$ fails to have the desired properties with probability $\delta$ .

Thus with probability at least $\frac{\delta}{2}$ , for arbitrarily large values of $n_{i}$ , there exists $\hat{\phi}_{n_{i}}$ such that Equation 4 holds but for which the desired property fails.

Let $n_{1},n_{2},\dots$ be any subsueqnce of integers so that the corresponding feature maps, $\hat{\phi}_{n_{i}}$ converge to some $\phi\in\Phi$ . Note that this exists because $\Phi$ is compact.

The key observation is that because $\alpha^{\phi}$ and $\beta_{l_{n}}^{\phi}$ are both Lipschitz with respect to $\phi$ (clearly small changes in the feature map cannot change these variables much), it follows that for all $\gamma>0$ , there exists $j$ such that for all $i>j$ ,

\Pr[\alpha^{\phi}-\Lambda\beta_{l_{n_{i}}}^{\phi}<\frac{\gamma^{*}}{2}-\gamma]% <O(n_{i}^{-1/6}).

(5)

Howver, since $\beta_{t}^{\phi}\to\beta^{\phi}$ in distribution (Lemma 12), it follows that for $i$ sufficiently large,

\Pr[\alpha^{\phi}-\Lambda\beta^{\phi}<\frac{\gamma^{*}}{2}-2\gamma]<O(n_{i}^{-% 1/6}).

(6)

Since $n_{i}$ is arbitrarily large, observe that this implies that

(\alpha^{\phi})-\Lambda\max(\beta^{\phi})\geq\frac{\gamma^{*}}{2}-2\gamma.

Finally, since $\hat{\phi}$ gets arbitrarily close to $\phi$ , it follows that

(\alpha^{\hat{\phi}})-\Lambda\max(\beta^{\hat{\phi}})\geq\frac{\gamma^{*}}{2}-% 3\gamma.

Taking $\gamma=\frac{\gamma^{*}}{12}$ , and noting the definitions of $\alpha^{\hat{\phi}}$ and $\beta^{\hat{\phi}}$ , it follows that $\hat{\phi}$ precisely fullfills the conditions given in the statement of the lemma, and this completes the proof.

∎

We now prove Theorem 5.

Proof.

Fix $\epsilon,\delta>0$ . The previous Lemma implies that for sufficiently large values of $n$ , with probability $1-\frac{\delta}{2}$ we will select some $\hat{\phi}$ that has margin at least $\frac{\gamma^{*}}{4}$ . We now use an argument identical to the argument given for Theorem 3 and conclude the proof. ∎

Appendix J Proof of Theorem 6

Proof.

For $\phi\in\Phi$ such that $\phi$ source-preserves $\mathcal{D}_{s}$ , define $g^{\phi}$ as the classifier over $\mathcal{X}$ defined by

g^{\phi}(x)=g_{\mathcal{D}_{s}^{\phi}}\left(\operatorname*{arg\,min}_{z\in% \textnormal{supp}(\mathcal{D}_{s}^{\phi})}d_{\mathcal{Z}}(z,\phi(x))\right),

with ties being broken arbitrarily.

Let $\phi_{1}\in(\mathcal{S}(\Phi)\cap\Phi_{con})\setminus\Phi_{relates}$ and let $\phi_{2}\in\mathcal{S}(\Phi)$ . Because $\phi_{1}$ and $\phi_{2}$ both contract $(\mathcal{D}_{s},\mathcal{D}_{t})$ , it follows that $g^{\phi_{1}}$ and $g^{\phi_{2}}$ are precisely well-defined over $\textnormal{supp}(\mu_{t})$ . However, because $\phi_{1}\notin\Phi^{*}$ , this implies that there exists $\epsilon>0$ such that

\mu_{t}\{x:g^{\phi_{1}}(x)\neq g^{\phi_{2}}(x)\}=\epsilon.

This holds from the fact that $g^{\phi_{2}}$ must match the Bayes-optimal, $g_{\mathcal{D}_{t}}$ , whereas $g^{\phi_{1}}$ must fail to (otherwise $\phi_{1}$ would indeed SIRM realize on $(\mathcal{D}_{s},\mathcal{D}_{t})$ ).

Define

\eta_{t}^{i}(y|x)=\begin{cases}1&g^{\phi^{i}}(x)=y\\ 0\text{ otherwise}\end{cases}.

Essentially this is a noiseless distribution that is purely classified by $g^{\phi^{i}}$ . Let $\mathcal{D}_{t}^{1}=(\mu_{t},\eta_{t}^{1})$ and $\mathcal{D}_{t}^{2}=(\mu_{t},\eta_{t}^{2})$ . The key observation is that if we randomly select $\mathcal{D}_{t}^{\prime}\sim\{\mathcal{D}_{t}^{1},\mathcal{D}_{t}^{2}\}$ and then apply our learning algorithm to $(\mathcal{D}_{s},\mathcal{D}_{t}^{\prime})$ , then our learner must incur expected risk at least $\frac{\epsilon}{2}$ . This is because whatever it outputs, it has a 50-50 chance of misclassifyign instances from in which $g^{\phi_{1}}$ and $g^{\phi_{2}}$ disagree. Thus our learner has expected risk at least $\frac{\epsilon}{2}$ . Since $\epsilon>0$ is fixed, this implies the desired result. ∎

Appendix K Bounds on the Distance Dimension and Transfer Learning Guarantees

K.1 Proof of Theorem 2

Proof.

(Theorem 2) We will construct two maps, $\alpha:Lin_{D,D}\to\mathbb{R}^{D^{2}}$ , and $\beta:(\mathbb{R}^{D})^{4}\to\mathbb{R}^{D^{2}}$ such that for any $\phi\in Lin_{D,D}$ and $x_{1},x_{2},x_{3},x_{4}\in(\mathbb{R}^{D})$ ,

\Delta\phi(x_{1},x_{2},x_{3},x_{4})=sgn\left(\langle\alpha(\phi),\beta(x_{1},x% _{2},x_{3},x_{4}\rangle\right).

This will immediately imply the result as it is well known that linear classifiers over $\mathbb{R}^{n}$ have vc dimension $n$ .

Letting $A_{\phi}$ be the $D\times D$ matrix associated with $\phi$ , we have

\begin{split}dist\phi(x_{1},x_{2},x_{3},x_{4})&=sgn\left(d(\phi(x_{1}),\phi(x_% {2}))^{2}-d(\phi(x_{3}),\phi(x_{4}))^{2}\right)\\ &=sgn\left((x_{1}-x_{2})^{t}A_{\phi}^{t}A_{\phi}(x_{1}-x_{2})-(x_{3}-x_{4})^{t% }A_{\phi}^{t}A_{\phi}(x_{3}-x_{4})\right)\\ &=sgn\left(\langle A_{\phi}^{t}A_{\phi},(x_{1}-x_{2})(x_{1}-x_{2})^{t}\rangle-% \langle A_{\phi}^{t}A_{\phi},(x_{3}-x_{4})(x_{3}-x_{4})^{t}\rangle\right)\\ &=sgn\left(\langle A_{\phi}^{t}A_{\phi},(x_{1}-x_{2})(x_{1}-x_{2})^{t}-(x_{3}-% x_{4})(x_{3}-x_{4})^{t}\rangle\right)\end{split}

Thus, letting $\alpha(\phi)=A_{\phi}^{t}A_{\phi}$ (cast as a vector in $\mathbb{R}^{D^{2}}$ ) and $\beta(x_{1},x_{2},x_{3},x_{4})=(x_{1}-x_{2})(x_{1}-x_{2})^{t}-(x_{3}-x_{4})(x_% {3}-x_{4})^{t}$ suffices. ∎

K.2 Proof of Theorem 7

Proof.

(Theorem 7) Suppose $\phi^{*}\in\Phi$ realizes the Statistical IRM assumption for $\mathcal{D}_{s},\mathcal{D}_{t}$ . Define

E_{1}=\mathbbm{1}\left(R(\mathcal{N}_{S}^{\phi^{*}},\mathcal{D}_{t})-R^{*}_{t}% <\frac{\epsilon}{2}\right),

and

E_{2}=\mathbbm{1}\left(\sup_{\phi\in\Phi}\left|R(\mathcal{N}^{\phi}_{S},% \mathcal{D}_{t})-\frac{1}{m}\sum_{(x,y)\in T}\mathbbm{1}\left(\mathcal{N}^{% \phi}_{S}(x)\neq y\right)\right|<\frac{\epsilon}{2}\right).

$E_{1}$ is thus the event that the source sample $S\sim\mathcal{D}_{s}^{n}$ gives rise to a $k_{n}$ -nearest neighbors classifier which has small excess risk on $\mathcal{D}_{t}$ when composed with the realizing projection $\phi^{*}$ . $E_{2}$ is the event that the empirical risks of $k_{n}$ -nearest neighbor classifiers composed with feature maps $\phi\in\Phi$ are uniformly representative of their true risks on the target $\mathcal{D}_{t}$ .

Our goal is to show that $E_{1}$ and $E_{2}$ jointly hold with probability at least $1-\delta$ , as this would imply that our learned classifier has risk at most $R^{*}_{t}+\epsilon$ , as desired. By Lemma 15, $E_{1}$ holds with probability at least $1-\frac{\delta}{2}$ , so it suffices to show that $E_{2}$ holds with probability at least $1-\frac{\delta}{2}$ as well.

Fix any set of $n$ points, $\hat{S}$ . It suffices to show that $\Pr_{T\sim\mathcal{D}_{t}^{m}}[E_{2}=1|S=\hat{S}]\geq 1-\delta$ , as integrating over all possibilities of $S$ would give the desired result. Consider the hypothesis class, $\mathcal{H}_{\hat{S}}=\{h_{\phi}:\phi\in\Phi\}$ where $h_{\phi}:\mathcal{X}\times\mathcal{Y}\to\{0,1\}$ is defined as

h_{\phi}(x,y)=\mathbbm{1}\left(\mathcal{N}^{\phi}_{\hat{S}}(x)\neq y\right).

Observe that $h\in\mathcal{H}_{\hat{S}}$ is a binary classifier over its domain. It follows that given $S=\hat{S}$ ,

\begin{split}E_{2}&=\mathbbm{1}\left(\sup_{\phi\in\Phi}\left|R(\mathcal{N}^{% \phi}_{S},\mathcal{D}_{t})-\frac{1}{m}\sum_{(x,y)\in T}\mathbbm{1}\left(% \mathcal{N}^{\phi}_{S}(x)\neq y\right)\right|<\frac{\epsilon}{2}\right)\\ &=\mathbbm{1}\left(\sup_{h_{\phi}\in\mathcal{H}_{\hat{S}}}\left|\mathbb{E}_{(x% ,y)\sim\mathcal{D}_{t}}[h_{\phi}(x,y)]-\frac{1}{m}\sum_{(x,y)\in T}h_{\phi}(x,% y)\right|<\frac{\epsilon}{2}\right).\end{split}

To analyze the latter quantity, it suffices to show that $vc(\mathcal{H}_{\hat{S}})\leq O\left(\partial(\Phi)\log\left(n+\partial(\Phi)% \right)\right)$ , as standard application of the fundamental theorem of statistical learning [29] implies $E_{2}$ holds with probability $1-\frac{\delta}{2}$ provided that $m\geq\Omega\left(\frac{vc(\mathcal{H}_{\hat{S}})+\ln\frac{1}{\delta}}{\epsilon% ^{2}}\right)$ .

To this end, suppose $\mathcal{H}_{\hat{S}}$ shatters a set of $v$ points $V\subset\mathcal{X}\times\mathcal{Y}$ . Let $V=\{(x_{1},y_{1}),\dots,(x_{v},y_{v})\}$ , and let $\hat{S}=\{(x_{1}^{\prime},y_{1}^{\prime}),\dots,(x_{n}^{\prime},y_{n}^{\prime})\}$ . The key observation is that for any $h_{\phi}\in\mathcal{H}_{\hat{S}}$ , the way $h_{\phi}$ labels a given point $(x,y)\in V$ is determined by the $k_{n}$ -nearest neighbors of $\phi(x)$ in $\{\phi(x^{\prime}_{1}),\dots,\phi(x^{\prime}_{n})\}$ . Furthermore, these labels are fully determined by the set of all $\binom{n}{2}$ comparisons,

\mathcal{C}_{\phi,x}=\left\{\mathbbm{1}\left(d(\phi(x),\phi(x^{\prime}_{i}))% \geq d(\phi(x),\phi(x^{\prime}_{j})\right):1\leq i<j\leq n\right\}.

Note that $\mathcal{C}_{\phi,x}$ is a set of induced distance comparers (Definition 5). It follows that the number of distinct ways that $\mathcal{H}_{\hat{S}}$ can label $V$ is at most the number of ways $\Delta\Phi$ can label all $v\binom{n}{2}$ possible comparisons arising from $x_{i}\in V$ and $x^{\prime}_{j},x^{\prime}_{k}\in\hat{S}$ with $j<k$ . Thus, by the definition of the distance dimension $\partial(\Phi)$ and Sauer’s Lemma, the number of ways $\mathcal{H}_{\hat{S}}$ can label $V$ is at most

\left(ev\binom{n}{2}\right)^{\partial(\Phi)}

At the same time, because $\mathcal{H}_{\hat{S}}$ shatters $V$ , there exist precisely $2^{v}$ such labelings. Thus, we have

v\leq\partial(\Phi)\log\left(ev\binom{n}{2}\right).

From here, straightforward algebra implies that $v=O\left(\partial(\Phi)\log\left(n+\partial(\Phi)\right)\right)$ , as desired.

∎

Beyond Discrepancy: A Closer Look at the Theory of Distribution Shift

Abstract

1 Introduction

1.1 An Illustrative Example

1.2 Guarantees Beyond Discrepancy

2 Related Work

3 Preliminaries

3.1 Problem Statement and Goal

3.2 Feature Maps

Example 1.

Example 2.

3.3 Nearest Neighbors

3.4 Margin Conditions

Definition 1.

4 The Statistical IRM Assumption

4.1 Desirable Properties of Feature Maps

Definition 2.

Definition 3.

Definition 4.

4.2 Stating the Statistical IRM Assumption

Assumption 1 (Statistical IRM Assumption).

4.3 The Statistical IRM Theorem

Theorem 1 (Statistical IRM Theorem).

5 The Distance Dimension of ΦΦ\Phiroman_Φ

Definition 5.

Definition 6.

Theorem 2.

6 Direct Generalization from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT

Theorem 3.

Theorem 4.

7 Combining Labeled Samples from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Unlabeled Samples from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

Theorem 5.

Theorem 6.

8 Efficient Use of Labeled Samples from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

Theorem 7.

9 Discussion

References

Appendix A Further Notation

Appendix B Further Technical Assumptions

B.1 Lebesgue Differentiation Theorem

B.2 Open Measures

Definition 7.

B.3 Assumptions on ΦΦ\Phiroman_Φ

Assumption 2.

Definition 8.

Definition 9.

Assumption 3.

Appendix C knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbors

Definition 10.

Definition 11.

Definition 12.

Definition 13.

Definition 14.

C.1 Composing with feature maps.

Definition 15.

Definition 16.

Definition 17.

Lemma 1.

Proof.

Appendix D Induced conditional distributions

Definition 18.

Lemma 2.

Proof.

Lemma 3.

Proof.

Definition 19.

Appendix E Technical Lemmas

E.1 Useful bounds related to ΦΦ\Phiroman_Φ

Definition 20.

Lemma 4.

Proof.

Definition 21.

Lemma 5.

Proof.

Definition 22.

Lemma 6.

Proof.

Definition 23.

Lemma 7.

Proof.

5 The Distance Dimension of $\Phi$

6 Direct Generalization from $\mathcal{D}_{s}$

7 Combining Labeled Samples from $\mathcal{D}_{s}$ and Unlabeled Samples from $\mathcal{D}_{t}$

8 Efficient Use of Labeled Samples from $\mathcal{D}_{t}$

B.3 Assumptions on $\Phi$

Appendix C $k_{n}$ -nearest neighbors

E.1 Useful bounds related to $\Phi$

Constructing $\mathcal{D}_{s}^{\prime}$ :

Constructing $\mathcal{D}_{t}^{\prime}$ :

I.2 Analyzing the procedure, $target\_margin$

I.3 Bounding the performance a given feature map, $\phi^{*}\in\Phi$