Beyond Discrepancy: A Closer Look at the Theory of Distribution Shift

Robi Bhattacharjee
University of Tübingen and Tübingen AI Center
[email protected]
&Nick Rittler
University of California- San Diego
[email protected]
&Kamalika Chaudhuri
University of California - San Diego
[email protected]
Abstract

Many machine learning models appear to deploy effortlessly under distribution shift, and perform well on a target distribution that is considerably different from the training distribution. Yet, learning theory of distribution shift bounds performance on the target distribution as a function of the discrepancy between the source and target, rarely guaranteeing high target accuracy. Motivated by this gap, this work takes a closer look at the theory of distribution shift for a classifier from a source to a target distribution. Instead of relying on the discrepancy, we adopt an Invariant-Risk-Minimization (IRM)-like assumption connecting the distributions, and characterize conditions under which data from a source distribution is sufficient for accurate classification of the target. When these conditions are not met, we show when only unlabeled data from the target is sufficient, and when labeled target data is needed. In all cases, we provide rigorous theoretical guarantees in the large sample regime.

1 Introduction

Classical learning theory operates within the statistical learning framework, in which the training and testing datasets are assumed to be drawn from the same distribution [1]. However, this assumption is rarely met in practice, where models often succeed in ever-changing real world environments rarely matching the precise conditions of their training data. This motivates the problem of distribution shift, in which a learner trains on a source distribution, with the goal of generalizing well over a distinct target distribution.

Thus far, the theory of distribution shift has consistently taken a worst-case approach, typically bounding generalization error in terms of some notion of discrepancy between the source and target distributions [2, 3, 4]. In cases where the source and target distributions are completely unrelated, or the source provides little information about the decision boundary of the target, discrepancy-based analyses correctly capture the difficulty of generalization. However, in practice, many large models appear to generalize effortlessly to target distributions with non-zero discrepancy.

Motivated by this gap, we take a closer look at the theory of distribution shift. In our setting, we consider a source distribution 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and a target distribution 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with the goal of building an accurate classifier over 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, primarily via training samples from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. To accomplish this, we first select a feature map, ϕ^Φ^italic-ϕΦ\hat{\phi}\in\Phiover^ start_ARG italic_ϕ end_ARG ∈ roman_Φ, under which the source and target distributions are similar. To make predictions, we then use knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbors (knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-NN) inside feature space over data sampled from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

Instead of a worst-case, discrepancy-based approach, we study generalization under an Invariant Risk Minimization (IRM)-like assumption which we term the “Statistical IRM Assumption”. IRM assumes on the existence of a feature map ψsuperscript𝜓\psi^{*}italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and a classifier (over feature space) hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT so that their composition hψsuperscriptsuperscript𝜓h^{*}\circ\psi^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∘ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT achieves optimal accuracy over both source and target distributions [5]. We adapt this assumption to the nearest neighbors setting, and replace the existence of hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with the assumption that the some feature map ϕΦsuperscriptitalic-ϕΦ\phi^{*}\in\Phiitalic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Φ maps points from the target 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT close to those from the source 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT while retaining information sufficient for optimal prediction. This property allows us to leverage the fact that nearest neighbors enjoys strong generalization properties within the support of its training distribution.

One might hope that such a condition is sufficient for generalization from source data alone. Unfortunately, the existence of a suitable feature map does not imply its identifiability - there may be many poor feature maps in ΦΦ\Phiroman_Φ that appear suitable when only source data are available. We show (Theorem 4) that to guarantee generalization to the target using only source data, the source must be rich enough so that this cannot happen, i.e. that all maps that lead to optimal classification over the source distribution appropriately unify the source and target. We further exhibit a learning rule which leads to provable generalization to the target under this additional condition (Theorem 3).

Refer to caption
(a) Pure source data is sufficient, as ϕxsubscriptitalic-ϕ𝑥\phi_{x}italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT significantly reduces accuracy over the source distribution.
Refer to caption
(b) Unlabeled target data coupled with labeled source data is sufficient – ϕxsubscriptitalic-ϕ𝑥\phi_{x}italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT can be eliminated because it does not map target data close to source data in the implied feature space.
Refer to caption
(c) Labeled target data is needed to determine which projection is correct – without it both projections appear completely symmetric, but with competing notions of how to label in feature space.
Figure 1: Examples of similar distribution shift problems with different data demands. Faded data points are sampled from the target distribution, while the bold points are selected from the source. In all cases, we wish to generalize to the target via the selection a feature map from Φ={ϕx,ϕy}Φsubscriptitalic-ϕ𝑥subscriptitalic-ϕ𝑦\Phi=\{\phi_{x},\phi_{y}\}roman_Φ = { italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT }, with ϕxsubscriptitalic-ϕ𝑥\phi_{x}italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT denoting projection onto the x𝑥xitalic_x and y𝑦yitalic_y axis, respectively.

We next consider the case where the learner has access to unlabeled target data in addition to labeled source data. Here, the target data provides crucial new information about which feature maps transform target data close to source data in feature space. We find that it is necessary and sufficient (Theorems 6 and 5) that all maps which both lead to optimal classification over the source, and map target data close to source data, further appropriately unify the source and target classification tasks.

When generalization is not possible with the addition of unlabeled target data, some labeled target data is needed. In this setting, the goal is to minimize the amount of labeled target data used – if large amounts of labeled target data are obtainable, we could simply use standard learning algorithms directly on the target data. We introduce a complexity measure on the embedding class, ΦΦ\Phiroman_Φ, which we term the distance dimension, and use it to provide an upper bound on the amount target data needed for generalization. In particular, we show that the natural procedure of minimizing the empirical risk (over ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ) on the target distribution of the source data-trained nearest neighbor classifier meets this upper bound.

1.1 An Illustrative Example

Figure 1 illustrates three learning problems in which we seek to generalize from the bold source data to the faded target data. In each case, the set of possible features maps is Φ={ϕx,ϕy}Φsubscriptitalic-ϕ𝑥subscriptitalic-ϕ𝑦\Phi=\{\phi_{x},\phi_{y}\}roman_Φ = { italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT }, the projections of onto the x𝑥xitalic_x and y𝑦yitalic_y-axis, respectively. Here, the Statistical IRM Assumption manifests itself in the following way: the learner knows that perfect classification can be performed on the source and the target through the intermediate projection onto either the x𝑥xitalic_x or y𝑦yitalic_y-axis. If the correct projection can be identified, a classifier generalizing to the target can be built by composing the correct projection with a classifier that accurately classifies source data in feature space.

The possibility of generalizing directly to the target is illustrated by Figure 1(a). In this case, using source data alone, it can be deduced that ϕxsubscriptitalic-ϕ𝑥\phi_{x}italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is not suitable, given that projection under ϕxsubscriptitalic-ϕ𝑥\phi_{x}italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT significantly reduces accuracy over the source distribution. Thus, a classifier can be constructed through composition with ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT that allows for generalization to the target.

By contrast, in panel (b), we see that both ϕxsubscriptitalic-ϕ𝑥\phi_{x}italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT admit good classification over the source distribution. However, note that only ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT leads to good generalization over the target distribution, and that there is no way to pin down which embedding should be used with source data alone. That said, given access to unlabeled target data, ϕxsubscriptitalic-ϕ𝑥\phi_{x}italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT can be eliminated from contention – it fails to uniformly map target data close to source data in feature space, another condition for correctly relating the source and target.

In panel (c), we see an instance in which no amount of source data and unlabeled target data will allow the learner to distinguish a winner between the two possible feature maps. In this case, labeled target data is needed. However, note that only a relatively small amount of labeled target data will be needed – all that is required is enough points to validate that a source-trained classifier arising from first projecting onto the x𝑥xitalic_x-axis has inferior performance to an analogous classifier where data are projected to the y𝑦yitalic_y-axis.

1.2 Guarantees Beyond Discrepancy

The examples of Figure 1 also serve to showcase the potential for generalization guarantees in scenarios where worst-case analyses indicate that generalization to the target should be hard.

There are a few veins of the discrepancy literature [6]. One prominent vein considers bounding generalization error in terms of divergence measures between the source and target [2, 3, 4]. Another considers density ratios between target and source [7, 8]. In each case, the idea is that the degradation of prediction quality on the target will be small when the source and target distributions are not “too far” from each other.

Consider again the examples of Figure 1. A density ratio analysis indicates that generalization to the target in (a) is impossible from pure source data, and expensive in a transfer learning setting (c), as the source has no mass in large chunks of the support of the target. Divergence measures paint a similar picture. Thus, our assumption allows us to consider the possibility of cheap generalization to targets which may have a completely different support from the source in the original data space, but are related in some deeper manner. In such scenarios, discrepancy-based analyses may often be overly pessimistic.

2 Related Work

As alluded to above, the theory literature has primarily studied distribution shift through the lens of discrepancy [2, 3, 7, 4, 8, 9, 10]. In the transfer learning literature, in which one considers the possibility of updating a model trained on the source distribution with a relatively small amount of target data, divergence-based analyses have also been prevalent [11, 12, 13, 14]. A notable line of work attains strong guarantees in certain cases where the divergence between source and target distributions is large, but the decision boundary on the source and target are similar by honing in on the information about the decision boundary contained in the source distribution [6, 15].

Much of the attention towards the selection of feature representations has been devoted to problem of “domain generalization”, wherein the learner tries to generalize to a large of set testing environments using samples from a smaller set of source environments, which provide training data [16, 17]. As mentioned above, the IRM literature hinges on the existence of a feature map, ψsuperscript𝜓\psi^{*}italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and a classifier hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT whose composition achieves optimal accuracy over both source and target distributions [5]. Another line of work considers a different assumption, namely the existence of some suitable feature map ψsuperscript𝜓\psi^{*}italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for which the conditional distributions of transformed features ψ(x)superscript𝜓𝑥\psi^{*}(x)italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) given a label are shared across all environments [18, 19, 20].

An important part of this work considers the case where some relatively small amount of labeled target data is available to the learner, and can be exploited in the determination of a suitable feature map. In the theory literature, this setting is most closely explored by the work on ‘’few-shot representation learning” [21, 22], where the goal is to use data on a set of source tasks to learn a low dimensional representation that connects tasks together, allowing for generalization to a related target task without too many extra samples.

In considering the case where the learner has access to unlabeled target samples, we enter the ‘’unsupervised domain adaptation” setting. Here, one often uses unlabeled target data to find some feature space under which source and target supports align [23, 24]. The literature has shown that unlabeled data has provable utility in certain common settings, e.g. under covariate shift [25, 26]. Unsupervised domain adaptation has also been studied through the lens of discrepancy [9, 10].

3 Preliminaries

Let the instance space (𝒳,d𝒳)𝒳subscript𝑑𝒳(\mathcal{X},d_{\mathcal{X}})( caligraphic_X , italic_d start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ) be a compact metric space, and 𝒴𝒴\mathcal{Y}caligraphic_Y be a finite label set. A data distribution 𝒟=(μ,η)𝒟𝜇𝜂\mathcal{D}=(\mu,\eta)caligraphic_D = ( italic_μ , italic_η ) over 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y is defined by a Borel measure μ𝜇\muitalic_μ over 𝒳𝒳\mathcal{X}caligraphic_X, and a conditional probability function η(y|x):=Pr(X,Y)𝒟[Y=yX=x]assign𝜂conditional𝑦𝑥subscriptPrsimilar-to𝑋𝑌𝒟𝑌conditional𝑦𝑋𝑥\eta(y|x):=\Pr_{(X,Y)\sim\mathcal{D}}[Y=y\mid X=x]italic_η ( italic_y | italic_x ) := roman_Pr start_POSTSUBSCRIPT ( italic_X , italic_Y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_Y = italic_y ∣ italic_X = italic_x ].

We assume our distributions satisfy some measure-theoretic regularity conditions. In particular, we assume our Borel measures are open measures, and that the Lebesgue Differentiation Theorem always holds. See Appendix B for details to this end.

For a classifier h:𝒳𝒴:𝒳𝒴h:\mathcal{X}\to\mathcal{Y}italic_h : caligraphic_X → caligraphic_Y, we define its risk R(h,𝒟)𝑅𝒟R(h,\mathcal{D})italic_R ( italic_h , caligraphic_D ) over 𝒟𝒟\mathcal{D}caligraphic_D as the probability it misclassifies, i.e. we define R(h,𝒟):=Pr(X,Y)𝒟[h(X)Y]assign𝑅𝒟subscriptPrsimilar-to𝑋𝑌𝒟𝑋𝑌R(h,\mathcal{D}):=\Pr_{(X,Y)\sim\mathcal{D}}[h(X)\neq Y]italic_R ( italic_h , caligraphic_D ) := roman_Pr start_POSTSUBSCRIPT ( italic_X , italic_Y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_h ( italic_X ) ≠ italic_Y ]. The classifier with the lowest possible risk is called the Bayes optimal classifier, defined as g𝒟(x)=argmaxy𝒴η(y|x)subscript𝑔𝒟𝑥subscriptargmax𝑦𝒴𝜂conditional𝑦𝑥g_{\mathcal{D}}(x)=\operatorname*{arg\,max}_{y\in\mathcal{Y}}\eta(y|x)italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_x ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_η ( italic_y | italic_x ).

3.1 Problem Statement and Goal

In this work, we are interested in the problem of distribution shift, in which the goal is to build a classifier with low risk over a target distribution 𝒟t=(μt,ηt)subscript𝒟𝑡subscript𝜇𝑡subscript𝜂𝑡\mathcal{D}_{t}=(\mu_{t},\eta_{t})caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), primarily using data from a source distribution, 𝒟s=(μs,ηs)subscript𝒟𝑠subscript𝜇𝑠subscript𝜂𝑠\mathcal{D}_{s}=(\mu_{s},\eta_{s})caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). We denote the Bayes risk on source and target via Rssuperscriptsubscript𝑅𝑠R_{s}^{*}italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Rtsuperscriptsubscript𝑅𝑡R_{t}^{*}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

The challenge in this setting is that μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can put mass in drastically different regions in 𝒳𝒳\mathcal{X}caligraphic_X making direct generalization from the source distribution to the target distribution difficult or impossible in the worst case.

3.2 Feature Maps

We consider classification after first applying a transformation into a feature space (𝒵,d𝒵)𝒵subscript𝑑𝒵(\mathcal{Z},d_{\mathcal{Z}})( caligraphic_Z , italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ), also a compact metric space.

We assume we are given ΦΦ\Phiroman_Φ, a class of feature maps ϕ:𝒳𝒵:italic-ϕ𝒳𝒵\phi:\mathcal{X}\to\mathcal{Z}italic_ϕ : caligraphic_X → caligraphic_Z. Here, each ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ represents a potential feature map under which the source and target distributions could plausibly be connected. Let dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT denote the distance metric induced on 𝒳𝒳\mathcal{X}caligraphic_X by ϕitalic-ϕ\phiitalic_ϕ, i.e. dϕ(x,x)=d𝒵(ϕ(x),ϕ(x))subscript𝑑italic-ϕ𝑥superscript𝑥subscript𝑑𝒵italic-ϕ𝑥italic-ϕsuperscript𝑥d_{\phi}(x,x^{\prime})=d_{\mathcal{Z}}\left(\phi(x),\phi(x^{\prime})\right)italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_ϕ ( italic_x ) , italic_ϕ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ). We assume all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ are continuous, and ΦΦ\Phiroman_Φ is compact with respect to the supremum distance metric. We also include further technical assumptions on ΦΦ\Phiroman_Φ in Appendix B.3.

Note that the following important examples of feature map collections which meet these regularity assumptions when the domain is a compact subset of Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

Example 1.

Let CorD,KsubscriptCor𝐷𝐾\textnormal{Cor}_{D,K}Cor start_POSTSUBSCRIPT italic_D , italic_K end_POSTSUBSCRIPT denote the set of all projections from DKsuperscript𝐷superscript𝐾\mathbb{R}^{D}\to\mathbb{R}^{K}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT onto a set of K𝐾Kitalic_K coordinates. Formally, we may write CorD,K={ϕJ:J[D],|J|=K}subscriptCor𝐷𝐾conditional-setsubscriptitalic-ϕ𝐽formulae-sequence𝐽delimited-[]𝐷𝐽𝐾\textnormal{Cor}_{D,K}=\{\phi_{J}:J\subset[D],\ |J|=K\}Cor start_POSTSUBSCRIPT italic_D , italic_K end_POSTSUBSCRIPT = { italic_ϕ start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT : italic_J ⊂ [ italic_D ] , | italic_J | = italic_K }, where for each J[D]𝐽delimited-[]𝐷J\subseteq[D]italic_J ⊆ [ italic_D ] with J={j1,,jk}𝐽subscript𝑗1subscript𝑗𝑘J=\{j_{1},\dots,j_{k}\}italic_J = { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, we let ϕJ(x)=(xj1,xj2,,xjk)subscriptitalic-ϕ𝐽𝑥subscript𝑥subscript𝑗1subscript𝑥subscript𝑗2subscript𝑥subscript𝑗𝑘\phi_{J}(x)=(x_{j_{1}},x_{j_{2}},\dots,x_{j_{k}})italic_ϕ start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( italic_x ) = ( italic_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ).

Example 2.

Let ProjD,KsubscriptProj𝐷𝐾\textnormal{Proj}_{D,K}Proj start_POSTSUBSCRIPT italic_D , italic_K end_POSTSUBSCRIPT denote the set of all linear maps corresponding to matrices in D×Ksuperscript𝐷𝐾\mathbb{R}^{D\times K}blackboard_R start_POSTSUPERSCRIPT italic_D × italic_K end_POSTSUPERSCRIPT with each entry contained in [1,1]11[-1,1][ - 1 , 1 ].

For any data distribution 𝒟=(μ,η)𝒟𝜇𝜂\mathcal{D}=(\mu,\eta)caligraphic_D = ( italic_μ , italic_η ) over 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y, we denote via 𝒟ϕsuperscript𝒟italic-ϕ\mathcal{D}^{\phi}caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT the distribution defined via (ϕ(X),Y)italic-ϕ𝑋𝑌(\phi(X),Y)( italic_ϕ ( italic_X ) , italic_Y ) where (X,Y)𝒟similar-to𝑋𝑌𝒟(X,Y)\sim\mathcal{D}( italic_X , italic_Y ) ∼ caligraphic_D, often writing 𝒟ϕ=(μϕ,ηϕ)superscript𝒟italic-ϕsuperscript𝜇italic-ϕsuperscript𝜂italic-ϕ\mathcal{D}^{\phi}=(\mu^{\phi},\eta^{\phi})caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT = ( italic_μ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_η start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ), where μϕsuperscript𝜇italic-ϕ\mu^{\phi}italic_μ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT and ηϕsuperscript𝜂italic-ϕ\eta^{\phi}italic_η start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT are the induced marginal and conditional distributions of 𝒟ϕsuperscript𝒟italic-ϕ\mathcal{D}^{\phi}caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT. We assume that the induced marginals are also open measures. Measure-theoretic details of induced distributions are discussed in Appendix D.

3.3 Nearest Neighbors

We let 𝒩S:𝒳𝒴:subscript𝒩𝑆𝒳𝒴\mathcal{N}_{S}:\mathcal{X}\to\mathcal{Y}caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT : caligraphic_X → caligraphic_Y denote the knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbor classifier arising from an i.i.d. sample S𝒟nsimilar-to𝑆superscript𝒟𝑛S\sim\mathcal{D}^{n}italic_S ∼ caligraphic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and a metric over the instances, where ties are broken arbitrarily. It is well known that under mild regularity conditions, kn/n0subscript𝑘𝑛𝑛0k_{n}/n\to 0italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_n → 0 and knsubscript𝑘𝑛k_{n}\to\inftyitalic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ imply that knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbors will converge to the Bayes optimal classifier [27]. Motivated by technical concerns, we will make the slightly stronger assumption that kn/n0subscript𝑘𝑛𝑛0k_{n}/n\to 0italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_n → 0 and kn/log(n)subscript𝑘𝑛𝑛k_{n}/\log(n)\to\inftyitalic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / roman_log ( italic_n ) → ∞.

Because we consider classification in feature space, we will often consider the composition of knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-NN with maps ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ. To this end, we let 𝒩Sϕ:𝒳𝒴:superscriptsubscript𝒩𝑆italic-ϕ𝒳𝒴\mathcal{N}_{S}^{\phi}:\mathcal{X}\to\mathcal{Y}caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT : caligraphic_X → caligraphic_Y denote the map defined by

𝒩Sϕ(x)=𝒩{(ϕ(x),y):(x,y)S}(ϕ(x)).superscriptsubscript𝒩𝑆italic-ϕ𝑥subscript𝒩conditional-setitalic-ϕ𝑥𝑦𝑥𝑦𝑆italic-ϕ𝑥\mathcal{N}_{S}^{\phi}(x)=\mathcal{N}_{\{(\phi(x),y):(x,y)\in S\}}\big{(}\phi(% x)\big{)}.caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) = caligraphic_N start_POSTSUBSCRIPT { ( italic_ϕ ( italic_x ) , italic_y ) : ( italic_x , italic_y ) ∈ italic_S } end_POSTSUBSCRIPT ( italic_ϕ ( italic_x ) ) .

3.4 Margin Conditions

Finally, we restrict our attantions data distributions in which Bayes-optimal classification is clearly non-ambiguous, and regions in which Bayes-optimal predictions differ are separated by a margin. We formalize this as follows.

Definition 1.

A data distribution 𝒟𝒟\mathcal{D}caligraphic_D over 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y is (ρ,Δ)𝜌Δ(\rho,\Delta)( italic_ρ , roman_Δ )-separated if there exist ρ,Δ>0𝜌Δ0\rho,\Delta>0italic_ρ , roman_Δ > 0, and disjoint sets {μy:y𝒴}conditional-setsuperscript𝜇𝑦𝑦𝒴\{\mu^{y}:y\in\mathcal{Y}\}{ italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT : italic_y ∈ caligraphic_Y }, so that the following hold:

  1. 1.

    The sets cover the support: supp(μ)=y𝒴μysupp𝜇subscript𝑦𝒴superscript𝜇𝑦\textnormal{supp}(\mu)=\cup_{y\in\mathcal{Y}}\mu^{y}supp ( italic_μ ) = ∪ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT.

  2. 2.

    On the set where y𝑦yitalic_y is the Bayes-optimal decision, no other label has similar conditional probability: If yy𝑦superscript𝑦y\neq y^{\prime}italic_y ≠ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, then xμyfor-all𝑥superscript𝜇𝑦\forall x\in\mu^{y}∀ italic_x ∈ italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT, η(y|x)>η(y|x)+Δ𝜂conditional𝑦𝑥𝜂conditionalsuperscript𝑦𝑥Δ\eta(y|x)>\eta(y^{\prime}|x)+\Deltaitalic_η ( italic_y | italic_x ) > italic_η ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x ) + roman_Δ.

  3. 3.

    These sets themselves are separated by a margin: minyyd(μy,μy)=ρsubscript𝑦superscript𝑦𝑑superscript𝜇𝑦superscript𝜇superscript𝑦𝜌\min_{y\neq y^{\prime}}d(\mu^{y},\mu^{y^{\prime}})=\rhoroman_min start_POSTSUBSCRIPT italic_y ≠ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_d ( italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = italic_ρ.

When 𝒟𝒟\mathcal{D}caligraphic_D is (ρ,Δ)𝜌Δ(\rho,\Delta)( italic_ρ , roman_Δ )-separated, we say that 𝒟𝒟\mathcal{D}caligraphic_D has margin ρ𝜌\rhoitalic_ρ, and label margin ΔΔ\Deltaroman_Δ. The conditions of well-separated distributions are met in most practical cases, where classification is rarely ambiguous, and arbitrarily close examples are usually classified identically.

4 The Statistical IRM Assumption

Generalizing from source data in a feature space induced by some ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ is only possible if ΦΦ\Phiroman_Φ contains a map that appropriately unifies the classification tasks on 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We use this section to motivate and define some desirable properties of feature maps vis a vis this goal, and to introduce the Statistical IRM Assumption, formalizing our requirement for the existence of quality maps in ΦΦ\Phiroman_Φ.

4.1 Desirable Properties of Feature Maps

In Invariant Risk Minimization, the fundamental assumption is the existence of a feature map ψ:𝒳𝒵:superscript𝜓𝒳𝒵\psi^{*}:\mathcal{X}\to\mathcal{Z}italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : caligraphic_X → caligraphic_Z and an “invariant predictor” h:𝒵𝒴:superscript𝒵𝒴h^{*}:\mathcal{Z}\to\mathcal{Y}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : caligraphic_Z → caligraphic_Y for which hψsuperscriptsuperscript𝜓h^{*}\circ\psi^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∘ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is Bayes-optimal on all training and testing environments. This allows a learner to assume that selecting a feature space through which good performance on training environments is attainable is not a completely futile approach to constructing a generalizing classifier.

In this spirit, we first interest ourselves in feature maps which preserve the possibility of optimal classification on our single source distribution. We consider a slightly stronger but natural notion that encodes the idea that no information relevant to the classification task on the source should be lost under the map**.

Definition 2.

We say a feature map ϕitalic-ϕ\phiitalic_ϕ source-preserves if the induced source distribution 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT is separated, and the Bayes risk on 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT equal to that of 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, i.e.

R(g𝒟sϕ,𝒟sϕ)=Rs.𝑅subscript𝑔superscriptsubscript𝒟𝑠italic-ϕsuperscriptsubscript𝒟𝑠italic-ϕsuperscriptsubscript𝑅𝑠R(g_{\mathcal{D}_{s}^{\phi}},\mathcal{D}_{s}^{\phi})=R_{s}^{*}.italic_R ( italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .

Let 𝒮(Φ)𝒮Φ\mathcal{S}(\Phi)caligraphic_S ( roman_Φ ) denote the set of all source preserving feature maps in ΦΦ\Phiroman_Φ.

Thus, source-preserving feature maps retain all information needed for optimal classification in the sense that the risk of the Bayes optimal in original space 𝒳𝒳\mathcal{X}caligraphic_X and feature space 𝒵𝒵\mathcal{Z}caligraphic_Z should be the same under the correct embedding. We also require that some margin is preserved in the arising feature space.

While not strictly necessary under the IRM assumption, it also desirable that an embedding maps examples that are similar with respect to the classification task to similar parts of the feature space, regardless of which distribution they come from. We formalize a condition capturing this idea via the following.

Definition 3.

We say a feature map ϕitalic-ϕ\phiitalic_ϕ contracts 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT if the induced source 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT is separated with margin ρϕsuperscript𝜌italic-ϕ\rho^{\phi}italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, and for each ztsupp(μtϕ)subscript𝑧𝑡suppsuperscriptsubscript𝜇𝑡italic-ϕz_{t}\in\textnormal{supp}(\mu_{t}^{\phi})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ), there is some zssupp(μsϕ)subscript𝑧𝑠suppsuperscriptsubscript𝜇𝑠italic-ϕz_{s}\in\textnormal{supp}(\mu_{s}^{\phi})italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) such that

d𝒵(zt,zs)<ρϕΛ,subscript𝑑𝒵subscript𝑧𝑡subscript𝑧𝑠superscript𝜌italic-ϕΛd_{\mathcal{Z}}(z_{t},z_{s})<\frac{\rho^{\phi}}{\Lambda},italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) < divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_ARG start_ARG roman_Λ end_ARG ,

where Λ>2Λ2\Lambda>2roman_Λ > 2 is a fixed constant. Let 𝒞(Φ)𝒞Φ\mathcal{C}(\Phi)caligraphic_C ( roman_Φ ) denote the set of all contracting feature maps in ΦΦ\Phiroman_Φ.

Ultimately, we are interested in feature spaces in which we can generalize to the target by classifying target data as we would source data. This possibility is captured by the notion of the invariant predictor hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the IRM assumption. We interest ourselves in feature maps with a similar property - ones for which the optimal classification decision is locally the same across 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Definition 4.

We say a feature map ϕitalic-ϕ\phiitalic_ϕ Bayes-unifies 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT if for all xssupp(μs)subscript𝑥𝑠suppsubscript𝜇𝑠x_{s}\in\textnormal{supp}(\mu_{s})italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) and xtsupp(μt)subscript𝑥𝑡suppsubscript𝜇𝑡x_{t}\in\textnormal{supp}(\mu_{t})italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ),

dϕ(xs,xt)<ρϕ2g𝒟s(xs)=g𝒟t(xt).subscript𝑑italic-ϕsubscript𝑥𝑠subscript𝑥𝑡superscript𝜌italic-ϕ2subscript𝑔subscript𝒟𝑠subscript𝑥𝑠subscript𝑔subscript𝒟𝑡subscript𝑥𝑡d_{\phi}(x_{s},x_{t})<\frac{\rho^{\phi}}{2}\implies g_{\mathcal{D}_{s}}(x_{s})% =g_{\mathcal{D}_{t}}(x_{t}).italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) < divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ⟹ italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Let 𝒰(Φ)𝒰Φ\mathcal{U}(\Phi)caligraphic_U ( roman_Φ ) denote the set of all Bayes-unifying feature maps in ΦΦ\Phiroman_Φ.

Under a feature map which Bayes-unifies, any points which are mapped closer together than half the induced margin are classified the same under the source and target distributions.

4.2 Stating the Statistical IRM Assumption

It’s intuitive that if a feature map both preserves the Bayes risk on the source, and unifies the classification tasks of source and target, then converging to the Bayes risk on the target is possible when source data populate the support of the induced target.

Thus, we would like a feature map which possess all of these properties. Our fundamental assumption is that there exists at least one such feature map in ΦΦ\Phiroman_Φ – we term this the Statistical IRM Assumption.

Assumption 1 (Statistical IRM Assumption).

We assume there is some ϕΦsuperscriptitalic-ϕΦ\phi^{*}\in\Phiitalic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Φ such that

  1. 1.

    ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT source-preserves 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT

  2. 2.

    ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT contracts the source 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and target 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

  3. 3.

    ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT Bayes-unifies source 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and target 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

We say that ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with all of these properties realizes the Statistical IRM Assumption, and let ΦsuperscriptΦ\Phi^{*}roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denote the set of all maps in ΦΦ\Phiroman_Φ which realize the Statistical IRM Assumption.

This assumption is an analogue of the IRM assumption, adapted to our single-source, single-target setting. Like IRM, it allows for the possibility of optimal classification on both source and target via the selection of an appropriate feature space. Contraction, which is not an assumption in IRM, allows for that optimal classification to be realized via a local classification scheme such as knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-NN.

4.3 The Statistical IRM Theorem

One would expect that if a learner were handed ϕΦsuperscriptitalic-ϕsuperscriptΦ\phi^{*}\in\Phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, generalization to the target should be possible with source data alone – because the classification task on 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is unified in the feature space arising from ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and every example in the target support is mapped close to the training support, the learner able to construct a good classifier for the target by simply constructing a constructing a good classifier on the induced source.

We formalize this intuition via the following theorem, which states that given knowledge of a realizing feature map ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, generalization to the target can be accomplished with source data only via the construction of a knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-NN classifier in feature space.

Theorem 1 (Statistical IRM Theorem).

Suppose ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT realizes the Statistical IRM assumption. Then for all ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0, there exists N𝑁Nitalic_N such that for all nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

R(𝒩Sϕ,𝒟t)Rt+ϵ.𝑅superscriptsubscript𝒩𝑆superscriptitalic-ϕsubscript𝒟𝑡superscriptsubscript𝑅𝑡italic-ϵR(\mathcal{N}_{S}^{\phi^{*}},\mathcal{D}_{t})\leq R_{t}^{*}+\epsilon.italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ϵ .

Thus, from the perspective of target generalization from source data, it suffices to determine a feature map ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ which realizes the Statistical IRM Assumption. In what follows, we characterize the statistical identifiability of ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (and thus the learnability of 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) under this assumption in each of our data availability settings.

5 The Distance Dimension of ΦΦ\Phiroman_Φ

Our investigation of the identifiability of realizing feature maps relies on one further notion – one of embedding classes with bounded complexity. To this end, we introduce a complexity measure on ΦΦ\Phiroman_Φ which will play a key role in each of the settings we consider. We begin with an intermediate definition.

Definition 5.

For a given ϕ:𝒳𝒵:italic-ϕ𝒳𝒵\phi:\mathcal{X}\to\mathcal{Z}italic_ϕ : caligraphic_X → caligraphic_Z, we define its induced distance comparer Δϕ:𝒳4{0,1}:subscriptΔitalic-ϕsuperscript𝒳401\Delta_{\phi}:\mathcal{X}^{4}\to\{0,1\}roman_Δ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT : caligraphic_X start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT → { 0 , 1 } as the map

Δϕ(x1,x2,x3,x4)=𝟙(d𝒵(ϕ(x1),ϕ(x2))d𝒵(ϕ(x3),ϕ(x4))).subscriptΔitalic-ϕsubscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥41subscript𝑑𝒵italic-ϕsubscript𝑥1italic-ϕsubscript𝑥2subscript𝑑𝒵italic-ϕsubscript𝑥3italic-ϕsubscript𝑥4\Delta_{\phi}(x_{1},x_{2},x_{3},x_{4})=\mathbbm{1}\left(d_{\mathcal{Z}}\left(% \phi(x_{1}),\phi(x_{2})\right)\geq d_{\mathcal{Z}}\left(\phi(x_{3}),\phi(x_{4}% )\right)\right).roman_Δ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = blackboard_1 ( italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_ϕ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_ϕ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ≥ italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_ϕ ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , italic_ϕ ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ) ) .

We also define ΔΦ:={Δϕ:ϕΦ}assignΔΦconditional-setsubscriptΔitalic-ϕitalic-ϕΦ\Delta\Phi:=\{\Delta_{\phi}:\phi\in\Phi\}roman_Δ roman_Φ := { roman_Δ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT : italic_ϕ ∈ roman_Φ } as the induced distance comparer class of ΦΦ\Phiroman_Φ.

Distance comparers are a natural tool for our analysis – all nearest-neighbor computations inside the feature space 𝒵𝒵\mathcal{Z}caligraphic_Z can be expressed in such terms. This observation gives rise to a natural complexity measure for the determination of a suitable feature map, which we term the distance dimension.

Definition 6.

The distance dimension of ΦΦ\Phiroman_Φ, denoted (Φ)Φ\partial(\Phi)∂ ( roman_Φ ), is the VC dimension of the induced comparer class ΔΦΔΦ\Delta\Phiroman_Δ roman_Φ.

In providing upper bounds, it will be important that the distance dimension be finite. We note that it is easily bounded for the two important classes of feature maps mentioned in Section 3.

Theorem 2.

Suppose CorD,KsubscriptCor𝐷𝐾\textnormal{Cor}_{D,K}Cor start_POSTSUBSCRIPT italic_D , italic_K end_POSTSUBSCRIPT and ProjD,KsubscriptProj𝐷𝐾\textnormal{Proj}_{D,K}Proj start_POSTSUBSCRIPT italic_D , italic_K end_POSTSUBSCRIPT are defined as in Examples 1 and 2, respectively. Then

(CorD,K)KlogD and (ProjD,K)D2.subscriptCor𝐷𝐾𝐾𝐷 and subscriptProj𝐷𝐾superscript𝐷2\partial(\textnormal{Cor}_{D,K})\leq K\log D\ \text{ and }\ \partial(% \textnormal{Proj}_{D,K})\leq D^{2}.∂ ( Cor start_POSTSUBSCRIPT italic_D , italic_K end_POSTSUBSCRIPT ) ≤ italic_K roman_log italic_D and ∂ ( Proj start_POSTSUBSCRIPT italic_D , italic_K end_POSTSUBSCRIPT ) ≤ italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

6 Direct Generalization from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT

We first study the possibility of constructing a classifier that generalizes to 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using only labeled samples from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. In this setting, a learner L𝐿Litalic_L takes input SDsnsimilar-to𝑆superscriptsubscript𝐷𝑠𝑛S\sim D_{s}^{n}italic_S ∼ italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and outputs a classifier L(S):𝒳𝒴:𝐿𝑆𝒳𝒴L(S):\mathcal{X}\to\mathcal{Y}italic_L ( italic_S ) : caligraphic_X → caligraphic_Y, with the goal of achieving a small risk on 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

While one might hope that the Statistical IRM assumption alone is sufficient for generalization to the target, this is unfortunately false. In fact, we have already seen an example of this phenomenon in Figure 1(c). Here, we essentially argued that ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT was a realizing feature map: it preserves the source risk, unifies the classification tasks on source and target, and maps all target points close to source points. However, we argued that ϕxsubscriptitalic-ϕ𝑥\phi_{x}italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT were statistically indistinguishable in this setting, leaving the learner in need of more information.

To the end of a general characterization of learnability in this setting, recall our discussion of Figure 1(a). Here, the projection ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT could be determined as realizing the Statistical IRM assumption given that it was the only map in ΦΦ\Phiroman_Φ that preserved the source distribution – it’s clear from the figure that that ϕxsubscriptitalic-ϕ𝑥\phi_{x}italic_ϕ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT does not preserve the source, and so cannot possibly realize the Statistical IRM assumption. It is vital that this reasoning could be carried out with source data alone.

More generally, by the Statistical IRM Theorem, it is sufficient for generalization from source data alone that the learner be able to identify ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from source data alone. Such a realizing feature map must of course satisfy all three requirements of Assumption 1. However, note that only one of these requirements, namely source-preservation, depends on the source distribution alone – the others, namely contraction and Bayes-unification, are defined in terms of the target distribution. As such, only source-preservation can be tested using source data.

That said, if the learner can be assured that all source-preserving feature maps realize the Statistical IRM assumption, i.e. Φ=𝒮(Φ)superscriptΦ𝒮Φ\Phi^{*}=\mathcal{S}(\Phi)roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_S ( roman_Φ ), it can identify realizing feature maps by identify source-preserving feature maps. We formalize this intuitive idea with the following theorem, which shows that PAC guarantees for target generalization are obtainable when the additional condition Φ=𝒮(Φ)superscriptΦ𝒮Φ\Phi^{*}=\mathcal{S}(\Phi)roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_S ( roman_Φ ) holds.

Theorem 3.

Suppose the Statistical IRM Assumption holds, the distance dimension (Φ)<Φ\partial(\Phi)<\infty∂ ( roman_Φ ) < ∞, and that Φ=𝒮(Φ).superscriptΦ𝒮Φ\Phi^{*}=\mathcal{S}(\Phi).roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_S ( roman_Φ ) . Then there is a learning rule L𝐿Litalic_L such that for every ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0, there exists N𝑁Nitalic_N such that if nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

R(L(S),𝒟t)Rt+ϵ.𝑅𝐿𝑆subscript𝒟𝑡superscriptsubscript𝑅𝑡italic-ϵR(L(S),\mathcal{D}_{t})\leq R_{t}^{*}+\epsilon.italic_R ( italic_L ( italic_S ) , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ϵ .

We relegate specification of the learning rule to the Appendix. It is founded on minimizing the empirical risk on the source data over feature maps in ΦΦ\Phiroman_Φ, but further leverages the knowledge that source-preserving feature maps induce separated distributions over feature space. After selecting a candidate feature map which empirically matches the requirements for source preservation, it uses knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-NN in the implied feature space to make predictions.

In light of the discussion above, the condition that Φ=𝒮(Φ)superscriptΦ𝒮Φ\Phi^{*}=\mathcal{S}(\Phi)roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_S ( roman_Φ ) is intuitively necessary as well – without it, there may be some feature map which is source-preserving, but which e.g. fails to Bayes-unify the source and the target. It’s simple to see that attempting to generalize to the target via such a feature space could be catastrophic, and thus that blindly choosing between source-preserving feature maps will eventually lead the learner astray. On the other hand, not classifying through a feature space subjects the learner to the standard pitfalls of out of distribution generalization. We formalize these ideas via the following hardness result.

Theorem 4.

Fix a source 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, a target 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and some embedding class ΦΦ\Phiroman_Φ for which the Statistical IRM assumption holds. Suppose that 𝒮(Φ)Φ𝒮ΦsuperscriptΦ\mathcal{S}(\Phi)\setminus\Phi^{*}caligraphic_S ( roman_Φ ) ∖ roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is non-empty, and that a learner L𝐿Litalic_L successfully generalizes to 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (with high probability) using only samples from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Then for all ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, there exists data distributions 𝒟s,𝒟tsuperscriptsubscript𝒟𝑠superscriptsubscript𝒟𝑡\mathcal{D}_{s}^{\prime},\mathcal{D}_{t}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that the following hold::

  1. 1.

    W(𝒟s,𝒟s)<ϵ𝑊subscript𝒟𝑠superscriptsubscript𝒟𝑠italic-ϵW(\mathcal{D}_{s},\mathcal{D}_{s}^{\prime})<\epsilonitalic_W ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_ϵ.

  2. 2.

    There is a ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ realizing the Statistical IRM assumption on alternative source 𝒟ssuperscriptsubscript𝒟𝑠\mathcal{D}_{s}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and alternative target 𝒟tsuperscriptsubscript𝒟𝑡\mathcal{D}_{t}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

  3. 3.

    For all N𝑁Nitalic_N, there exists n>N𝑛𝑁n>Nitalic_n > italic_N such that with probability at least 1414\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

    R(L(S),𝒟t)>R(g𝒟t,𝒟t)+14.𝑅𝐿𝑆superscriptsubscript𝒟𝑡𝑅subscript𝑔superscriptsubscript𝒟𝑡superscriptsubscript𝒟𝑡14R(L(S),\mathcal{D}_{t}^{\prime})>R(g_{\mathcal{D}_{t}^{\prime}},\mathcal{D}_{t% }^{\prime})+\frac{1}{4}.italic_R ( italic_L ( italic_S ) , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > italic_R ( italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 4 end_ARG .

Thus, in the case that some feature maps preserve the source but do not realize the Statistical IRM assumption, there is always some problem nearly identical problem instance where the Statistical IRM assumption is realized by ΦΦ\Phiroman_Φ, buy which causes a given learner to have unbounded sample complexity (for some choice of ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ).

7 Combining Labeled Samples from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Unlabeled Samples from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

We now consider the less restrictive unlabeled target data are also available. Here, a learner L𝐿Litalic_L takes input S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and Uμtmsimilar-to𝑈superscriptsubscript𝜇𝑡𝑚U\sim\mu_{t}^{m}italic_U ∼ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and outputs a classifier L(S,U):𝒳𝒴:𝐿𝑆𝑈𝒳𝒴L(S,U):\mathcal{X}\to\mathcal{Y}italic_L ( italic_S , italic_U ) : caligraphic_X → caligraphic_Y.

The story given additional access to unlabeled data is similar to source-only setting: the Statistical IRM assumption alone is insufficient for guaranteeing successful generalization when additional unlabeled target data are available. In other words, the combination of labeled source and unlabeled target is generally insufficient for identifying a ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ that realizes the Statistical IRM assumption.

For a simple example to this end, we return to panel (c) of Figure 1. For the source and target distributions shown, it is evident that no amount of labeled data from the source and unlabeled data from target will allow us decide whether we should project data onto the x𝑥xitalic_x-axis or the y𝑦yitalic_y-axis. This is because the only difference between them is the manner in which 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is labeled. By contrast, the example depicted by Figure 1 panel (b) illustrates a case in which the additional unlabeled data from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT proves sufficient: because projecting onto the x𝑥xitalic_x-axis fails to map target points close source points, we can conclude that ϕitalic-ϕ\phiitalic_ϕ must be the projection onto the y𝑦yitalic_y-axis.

As in source-only setting, identifying a feature map realizing the Statistical IRM assumption requires testing the three conditions of Assumption 1. Understanding the utility of additional unlabeled target data is to realize that it allows the learner to not only test which feature maps preserve the source, but also which maps contract 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. On the other hand, it is insufficient to determine which feature maps Bayes-unify, as this notion intrinsically depends on labeling under 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This motivates a similar sufficient condition for learnability as we saw in the previous section –namely, that all feature maps which both preserve the source and contract the source and target further Bayes unify. The following theorem shows that this is indeed a sufficient condition for learnability in this setting.

Theorem 5.

Suppose the Statistical IRM Assumption holds, the distance dimension (Φ)<Φ\partial(\Phi)<\infty∂ ( roman_Φ ) < ∞, and that Φ=𝒮(Φ)𝒞(Φ).superscriptΦ𝒮Φ𝒞Φ\Phi^{*}=\mathcal{S}(\Phi)\cap\mathcal{C}(\Phi).roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_S ( roman_Φ ) ∩ caligraphic_C ( roman_Φ ) . Then there is a learning rule L𝐿Litalic_L such that for all ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0, there exist N𝑁Nitalic_N and M𝑀Mitalic_M, such that if nN𝑛𝑁n\geq Nitalic_n ≥ italic_N and mM𝑚𝑀m\geq Mitalic_m ≥ italic_M, with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and Uμtmsimilar-to𝑈superscriptsubscript𝜇𝑡𝑚U\sim\mu_{t}^{m}italic_U ∼ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT,

R(L(S,U),𝒟t)Rt+ϵ.𝑅𝐿𝑆𝑈subscript𝒟𝑡superscriptsubscript𝑅𝑡italic-ϵR(L(S,U),\mathcal{D}_{t})\leq R_{t}^{*}+\epsilon.italic_R ( italic_L ( italic_S , italic_U ) , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ϵ .

The learning rule is specified in detail in the Appendix. It proceeds by first selecting a feature map which both empirically preserves the source, and maps each unlabeled target in U𝑈Uitalic_U point close to some source point in S𝑆Sitalic_S. As above, it uses knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-NN in the selected feature space to make predictions.

In accordance with the intuition developed above, the condition that Φ=𝒮(Φ)𝒞(Φ)superscriptΦ𝒮Φ𝒞Φ\Phi^{*}=\mathcal{S}(\Phi)\cap\mathcal{C}(\Phi)roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_S ( roman_Φ ) ∩ caligraphic_C ( roman_Φ ) is necessary. The issue of course is that without access to labeled target data, testing whether a feature map Bayes-unifies is impossible. We formalize this via the following hardness result.

Theorem 6.

Suppose ΦΦ\Phiroman_Φ realizes the Statistical IRM Assumption for 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and that there is some ϕ𝒮(Φ)𝒞(Φ)italic-ϕ𝒮Φ𝒞Φ\phi\in\mathcal{S}(\Phi)\cap\mathcal{C}(\Phi)italic_ϕ ∈ caligraphic_S ( roman_Φ ) ∩ caligraphic_C ( roman_Φ ) for which ϕΦitalic-ϕsuperscriptΦ\phi\not\in\Phi^{*}italic_ϕ ∉ roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Then for all learners L𝐿Litalic_L, there exists a conditional data distribution, ηsuperscript𝜂\eta^{\prime}italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that the following hold:

  1. 1.

    η(y|x)=η(y|x)superscript𝜂conditional𝑦𝑥𝜂conditional𝑦𝑥\eta^{\prime}(y|x)=\eta(y|x)italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = italic_η ( italic_y | italic_x ) for all xsupp(μs)𝑥suppsubscript𝜇𝑠x\in\textnormal{supp}(\mu_{s})italic_x ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ).

  2. 2.

    ΦΦ\Phiroman_Φ realizes the Statistical IRM Assumption for 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒟t=(μt,η)superscriptsubscript𝒟𝑡subscript𝜇𝑡superscript𝜂\mathcal{D}_{t}^{\prime}=(\mu_{t},\eta^{\prime})caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

  3. 3.

    There exists δ,ϵ>0𝛿italic-ϵ0\delta,\epsilon>0italic_δ , italic_ϵ > 0 such that for arbitrarily large values of n𝑛nitalic_n and m𝑚mitalic_m, with probability at least δ𝛿\deltaitalic_δ over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and Uμtmsimilar-to𝑈superscriptsubscript𝜇𝑡𝑚U\sim\mu_{t}^{m}italic_U ∼ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT,

    R(L(S,U),𝒟t)>R(g𝒟t,𝒟t)+ϵ.𝑅𝐿𝑆𝑈superscriptsubscript𝒟𝑡𝑅subscript𝑔superscriptsubscript𝒟𝑡superscriptsubscript𝒟𝑡italic-ϵR(L(S,U),\mathcal{D}_{t}^{\prime})>R(g_{\mathcal{D}_{t}^{\prime}},\mathcal{D}_% {t}^{\prime})+\epsilon.italic_R ( italic_L ( italic_S , italic_U ) , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > italic_R ( italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ϵ .

Theorem 6 shows that no combination of embedding class ΦΦ\Phiroman_Φ and learner L𝐿Litalic_L can circumvent the impossibility of testing Bayes unification with unlabeled target data. For any embedding class and learning algorithm, one can always find a pair of source and target distributions on which ΦΦ\Phiroman_Φ realizes the Statistical IRM Assumption, but on which the learning algorithm will fail.

8 Efficient Use of Labeled Samples from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

Algorithm 1 Selection of an Appropriate Feature Map via Target Loss Validation
1:procedure feature_validate(S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, T𝒟tmsimilar-to𝑇superscriptsubscript𝒟𝑡𝑚T\sim\mathcal{D}_{t}^{m}italic_T ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT)
2:     ϕ^=argminϕΦ1m(x,y)T𝟙(𝒩Sϕy)^italic-ϕsubscriptargminitalic-ϕΦ1𝑚subscript𝑥𝑦𝑇1subscriptsuperscript𝒩italic-ϕ𝑆𝑦\hat{\phi}=\operatorname*{arg\,min}_{\phi\in\Phi}\frac{1}{m}\sum_{(x,y)\in T}% \mathbbm{1}\left(\mathcal{N}^{\phi}_{S}\neq y\right)over^ start_ARG italic_ϕ end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_ϕ ∈ roman_Φ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_T end_POSTSUBSCRIPT blackboard_1 ( caligraphic_N start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ≠ italic_y )
3:     return 𝒩Sϕ^subscriptsuperscript𝒩^italic-ϕ𝑆\mathcal{N}^{\hat{\phi}}_{S}caligraphic_N start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT
4:end procedure

The discussion above implies that even under the Statistical IRM assumption, there are many situations where label target data is required for generalization. In such cases, we would hope that we can exploit the information encoded in the Statistical IRM Assumption to achieve generalization through labeled source data and a small amount of labeled target data. In this section we show that the Statistical IRM assumption allows for significant convergence rate speed-ups in many settings.

Recall that the Statistical IRM theorem states that given a realizing feature map, generalization to the target can be accomplished purely through source data – the limitation of a lack is labeled target data is the difficulty in identifying such a feature map. This inspires the strategy of allocating all of the labeled target data towards determining a realizing feature map.

In this spirit, we analyze the natural scheme of constructing a classifier by composing nearest-neighbors trained solely on source data with the map ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ that minimizes the empirical risk over T𝒟tmsimilar-to𝑇superscriptsubscript𝒟𝑡𝑚T\sim\mathcal{D}_{t}^{m}italic_T ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, finding that the number of target examples required for guarantees can be controlled in terms of the “distance dimension” of the class (Φ)Φ\partial(\Phi)∂ ( roman_Φ ).

Theorem 7.

Suppose ΦΦ\Phiroman_Φ realizes the Statistical IRM assumption. Then for every ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0, there exists N𝑁Nitalic_N such that if

nN,mΩ((Φ)log(n+(Φ))+log1δϵ2),formulae-sequence𝑛𝑁𝑚ΩΦ𝑛Φ1𝛿superscriptitalic-ϵ2n\geq N,m\geq\Omega\left(\frac{\partial(\Phi)\log\left(n+\partial(\Phi)\right)% +\log\frac{1}{\delta}}{\epsilon^{2}}\right),italic_n ≥ italic_N , italic_m ≥ roman_Ω ( divide start_ARG ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ) + roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,

then with probability at least 1δ1𝛿1-\delta1 - italic_δ over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, T𝒟tmsimilar-to𝑇superscriptsubscript𝒟𝑡𝑚T\sim\mathcal{D}_{t}^{m}italic_T ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT,

R(𝒩Sϕ^,𝒟t)Rt+ϵ,𝑅subscriptsuperscript𝒩^italic-ϕ𝑆subscript𝒟𝑡superscriptsubscript𝑅𝑡italic-ϵR(\mathcal{N}^{\hat{\phi}}_{S},\mathcal{D}_{t})\leq R_{t}^{*}+\epsilon,italic_R ( caligraphic_N start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ϵ ,

where 𝒩Sϕ^subscriptsuperscript𝒩^italic-ϕ𝑆\mathcal{N}^{\hat{\phi}}_{S}caligraphic_N start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is output of feature_validate(S,T)feature_validate𝑆𝑇\textsc{feature\_validate}(S,T)feature_validate ( italic_S , italic_T ).

Thus, the amount of labeled target data required for generalization when ΦΦ\Phiroman_Φ realizes the Statistical IRM assumption can be largely controlled through our complexity measure on the class ΦΦ\Phiroman_Φ. We say “largely” given that m𝑚mitalic_m, the amount of data required from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, has a logarithmic dependence on n𝑛nitalic_n, the amount of data drawn from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. This implies a near distributional-independence between source and target in the sample complexity.

We note that the above margin assumptions are not required for the analysis leading to Theorem 7. Thus – comparing e.g. to rates of convergence under the canonical Tsybakov noise assumption in 𝒳=d𝒳superscript𝑑\mathcal{X}=\mathbb{R}^{d}caligraphic_X = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, under which nonparametric classifiers necessarily incur rates of Ω~(m1/1+d)~Ωsuperscript𝑚11𝑑\tilde{\Omega}(m^{-1/1+d})over~ start_ARG roman_Ω end_ARG ( italic_m start_POSTSUPERSCRIPT - 1 / 1 + italic_d end_POSTSUPERSCRIPT ) – the guarantees of Theorem 7 represent significant convergence rate speed-ups over naively training a non-parametric classifier with target data in many cases where the distance dimension (Φ)Φ\partial(\Phi)∂ ( roman_Φ ) is polynomial in the dimension of the instance space [28].

9 Discussion

In this work, we study the problem of distribution shift under a variant of the IRM assumption, wherein it is known that a feature map in a class ΦΦ\Phiroman_Φ unifies classification on source and target. We investigate the identifiability of such maps, characterizing learnability in settings where worst-case approaches indicate that learning should be impossible or expensive.

Our work suggests that the study of IRM-like assumptions is a promising direction for shedding light on new situations where guaranteeing generalization under distribution shift is possible. It also highlights that a primary issue in learning under IRM-like assumptions may be the statistical identifiability of suitable feature maps.

Acknowledgements: This work was supported by the National Science Foundation under the following grants: NSF CIF-2402817, SaTC-2241100, CCF-2217058, and ARO-MURI W911NF2110317.

RB was also partially supported by the German Research Foundation through the Cluster of Excellence “Machine Learning - New Perspectives for Science" (EXC 2064/1 number 390727645)

References

  • [1] L. G. Valiant. A theory of the learnable. Communications of the ACM, pages 1134–1142, 1984.
  • [2] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. NIPS’06, page 137–144, Cambridge, MA, USA, 2006. MIT Press.
  • [3] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Vaughan. A theory of learning from different domains. Machine Learning, 79:151–175, 05 2010.
  • [4] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the renyi divergence, 2012.
  • [5] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization, 2020.
  • [6] Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. CoRR, abs/2002.04747, 2020.
  • [7] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence. Dataset shift in machine learning. 2009.
  • [8] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  • [9] A. Tuan Nguyen, Toan Tran, Yarin Gal, Philip H. S. Torr, and Atılım Güneş Baydin. Kl guided domain adaptation, 2022.
  • [10] Ziqiao Wang and Yongyi Mao. Information-theoretic analysis of unsupervised domain adaptation, 2023.
  • [11] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, 2007.
  • [12] Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with drifting distributions, 2012.
  • [13] Corinna Cortes, Mehryar Mohri, and Andrés Muñoz Medina. Adaptation based on generalized discrepancy. Journal of Machine Learning Research, 20(1):1–30, 2019.
  • [14] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms, 2023.
  • [15] Steve Hanneke, Samory Kpotufe, and Yasaman Mahdaviyeh. Limits of model selection under transfer learning. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 5781–5812. PMLR, 12–15 Jul 2023.
  • [16] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation, 2013.
  • [17] Han Zhao, Shanghang Zhang, Guanhang Wu, João P. Costeira, José M. F. Moura, and Geoffrey J. Gordon. Multiple source domain adaptation with adversarial training of neural networks, 2017.
  • [18] Yining Chen, Elan Rosenfeld, Mark Sellke, Tengyu Ma, and Andrej Risteski. Iterative feature matching: Toward provable domain generalization with logarithmic environments, 2021.
  • [19] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation, 2015.
  • [20] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex Kot. Domain generalization with adversarial feature learning, 2018.
  • [21] Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, and Qi Lei. Few-shot learning via learning the representation, provably, 2021.
  • [22] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning, 2016.
  • [23] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. ICML’11, 2011.
  • [24] Xiaofeng Liu, Chaehwa Yoo, Fangxu Xing, Hye** Oh, Georges El Fakhri, Je-Won Kang, and Jonghye Woo. Deep unsupervised domain adaptation: A review of recent advances and perspectives, 2022.
  • [25] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. ALT’12, 2012.
  • [26] Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Scholkopf. Correcting sample selection bias by unlabeled data. NIPS’06, 2006.
  • [27] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classification. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3437–3445, 2014.
  • [28] Jean-Yves Audibert and Alexandre B. Tsybakov. Fast learning rates for plug-in classifiers under the margin condition, 2011.
  • [29] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. 2014.

Appendix A Further Notation

Given x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, we let B(x,r)={x:d(x,x)r}𝐵𝑥𝑟conditional-setsuperscript𝑥𝑑𝑥superscript𝑥𝑟B(x,r)=\{x^{\prime}:d(x,x^{\prime})\leq r\}italic_B ( italic_x , italic_r ) = { italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_d ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_r } denote the closed ball centered at x𝑥xitalic_x of radius r𝑟ritalic_r. For a feature map, ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ, we also let Bϕ(x,r)={x:dϕ(x,x)r}subscript𝐵italic-ϕ𝑥𝑟conditional-setsuperscript𝑥subscript𝑑italic-ϕ𝑥superscript𝑥𝑟B_{\phi}(x,r)=\{x^{\prime}:d_{\phi}(x,x^{\prime})\leq r\}italic_B start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_r ) = { italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_r } denote the set of all points with distance (under ϕitalic-ϕ\phiitalic_ϕ) at most r𝑟ritalic_r from x𝑥xitalic_x.

Recall that for ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ, we let dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT denote the metric over 𝒳𝒳\mathcal{X}caligraphic_X induced by ϕitalic-ϕ\phiitalic_ϕ, i.e. dϕ(x,x)=d(ϕ(x),ϕ(x))subscript𝑑italic-ϕ𝑥superscript𝑥𝑑italic-ϕ𝑥italic-ϕsuperscript𝑥d_{\phi}(x,x^{\prime})=d\left(\phi(x),\phi(x^{\prime})\right)italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_d ( italic_ϕ ( italic_x ) , italic_ϕ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ). We extend this in the natural way to sets, letting

dϕ(A,B)=infaA,bBdϕ(a,b).subscript𝑑italic-ϕ𝐴𝐵subscriptinfimumformulae-sequence𝑎𝐴𝑏𝐵subscript𝑑italic-ϕ𝑎𝑏d_{\phi}(A,B)=\inf_{a\in A,b\in B}d_{\phi}(a,b).italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A , italic_B ) = roman_inf start_POSTSUBSCRIPT italic_a ∈ italic_A , italic_b ∈ italic_B end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a , italic_b ) .

Finally, for a pair of feature maps ϕ,ϕitalic-ϕsuperscriptitalic-ϕ\phi,\phi^{\prime}italic_ϕ , italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we let d(ϕ,ϕ)𝑑italic-ϕsuperscriptitalic-ϕd(\phi,\phi^{\prime})italic_d ( italic_ϕ , italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denote the supremum metric between ϕitalic-ϕ\phiitalic_ϕ and ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. That is,

d(ϕ,ϕ)=supx𝒳d𝒳(ϕ(x),ϕ(x)).𝑑italic-ϕsuperscriptitalic-ϕsubscriptsupremum𝑥𝒳subscript𝑑𝒳italic-ϕ𝑥superscriptitalic-ϕ𝑥d(\phi,\phi^{\prime})=\sup_{x\in\mathcal{X}}d_{\mathcal{X}}\left(\phi(x),\phi^% {\prime}(x)\right).italic_d ( italic_ϕ , italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ( italic_ϕ ( italic_x ) , italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) .

Appendix B Further Technical Assumptions

B.1 Lebesgue Differentiation Theorem

We assume that for any finite Borel measure μ𝜇\muitalic_μ over 𝒳𝒳\mathcal{X}caligraphic_X, the Lebesgue differentiation theorem holds. That is, for all measurable functions f:𝒳:𝑓𝒳f:\mathcal{X}\to\mathbb{R}italic_f : caligraphic_X → blackboard_R, up to a null set under μ𝜇\muitalic_μ,

limr0+1μ(B(x,r))f(x)𝑑μ(x)=f(x).subscript𝑟superscript01𝜇𝐵𝑥𝑟𝑓𝑥differential-d𝜇𝑥𝑓𝑥\lim_{r\to 0^{+}}\frac{1}{\mu(B(x,r))}\int f(x)d\mu(x)=f(x).roman_lim start_POSTSUBSCRIPT italic_r → 0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_μ ( italic_B ( italic_x , italic_r ) ) end_ARG ∫ italic_f ( italic_x ) italic_d italic_μ ( italic_x ) = italic_f ( italic_x ) .

B.2 Open Measures

As referenced in Section 3 above, we assume that our Borel measures satisfy a further regularity condition – namely, that they are open measures.

Definition 7.

A Borel measure, μ𝜇\muitalic_μ, over metric space (,d)𝑑(\mathcal{M},d)( caligraphic_M , italic_d ) is open, if for all measurable sets A𝐴Aitalic_A, μ(A)>0𝜇𝐴0\mu(A)>0italic_μ ( italic_A ) > 0 if only if there exists x𝑥x\in\mathcal{M}italic_x ∈ caligraphic_M and r>0𝑟0r>0italic_r > 0 such that B(x,r)A𝐵𝑥𝑟𝐴B(x,r)\subseteq Aitalic_B ( italic_x , italic_r ) ⊆ italic_A and μ(B(x(x,r))>0\mu(B(x(x,r))>0italic_μ ( italic_B ( italic_x ( italic_x , italic_r ) ) > 0.

A very typical example of such a measure is any distribution that has a finite density function. In this work, we will restrict ourselves to considering open measures with the following assumption: μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are open, and for all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ, the induced source and target measures, μsϕsuperscriptsubscript𝜇𝑠italic-ϕ\mu_{s}^{\phi}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT and μtϕsuperscriptsubscript𝜇𝑡italic-ϕ\mu_{t}^{\phi}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT are open over the metric space (ϕ(𝒳),d)(𝒵,d)italic-ϕ𝒳𝑑𝒵𝑑(\phi(\mathcal{X}),d)\subseteq(\mathcal{Z},d)( italic_ϕ ( caligraphic_X ) , italic_d ) ⊆ ( caligraphic_Z , italic_d ). Here we are noting that μsϕsuperscriptsubscript𝜇𝑠italic-ϕ\mu_{s}^{\phi}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT and μtϕsuperscriptsubscript𝜇𝑡italic-ϕ\mu_{t}^{\phi}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT are only non-zero over subsets of the image of ϕitalic-ϕ\phiitalic_ϕ, ϕ(𝒳)italic-ϕ𝒳\phi(\mathcal{X})italic_ϕ ( caligraphic_X ), and thus we restrict our attention to ϕ(𝒳)italic-ϕ𝒳\phi(\mathcal{X})italic_ϕ ( caligraphic_X ) when considering openness.

This technical assumption allows us to simplify our results as it prohibits cases in which μtϕsuperscriptsubscript𝜇𝑡italic-ϕ\mu_{t}^{\phi}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT can be a pathological distribution that concentrates in an area of supp(μsϕ)suppsuperscriptsubscript𝜇𝑠italic-ϕ\textnormal{supp}(\mu_{s}^{\phi})supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) that leads to bad generalization. We also believe that such an assumption is relatively mild – all distributions over 𝒵𝒵\mathcal{Z}caligraphic_Z are arbitrarily close to open Borel measures – we can simply add spherical noise to each sampled point.

B.3 Assumptions on ΦΦ\Phiroman_Φ

We include two further technical assumptions about ΦΦ\Phiroman_Φ. We begin by assuming that all feature maps send an infinite number of points to a given z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z.

Assumption 2.

For all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ, the set of points xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that have the same image as x𝑥xitalic_x in 𝒵𝒵\mathcal{Z}caligraphic_Z under ϕitalic-ϕ\phiitalic_ϕ is infinite. That is,

|{x:ϕ(x)=ϕ(x)}|=.conditional-setsuperscript𝑥italic-ϕsuperscript𝑥italic-ϕ𝑥|\{x^{\prime}:\phi(x^{\prime})=\phi(x)\}|=\infty.| { italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_ϕ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_x ) } | = ∞ .

Observe that this assumption is clearly met by the examples given in Section 3.2. Furthermore, it is likely to be met by any reasonable family of continuous maps that perform any kind of dimension reduction.

Next, we define dominance, which will be useful for formulating our other assumption.

Definition 8.

We say that feature map ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT dominates feature map ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT at point x𝑥xitalic_x if

{x:ϕ1(x)=ϕ1(x)}{x:ϕ2(x)=ϕ2(x)}.conditional-setsuperscript𝑥subscriptitalic-ϕ2superscript𝑥subscriptitalic-ϕ2𝑥conditional-setsuperscript𝑥subscriptitalic-ϕ1superscript𝑥subscriptitalic-ϕ1𝑥\{x^{\prime}:\phi_{1}(x^{\prime})=\phi_{1}(x)\}\supseteq\{x^{\prime}:\phi_{2}(% x^{\prime})=\phi_{2}(x)\}.{ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) } ⊇ { italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) } .

We now define an embedding class to be indomitable when it avoids instances of one feature map dominating another.

Definition 9.

ΦΦ\Phiroman_Φ is indomitable if for all distinct ϕ1,ϕ2Φsubscriptitalic-ϕ1subscriptitalic-ϕ2Φ\phi_{1},\phi_{2}\in\Phiitalic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Φ and for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, the following holds. For all ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, there exists maps ϕ1ϵ,ϕ2ϵΦsuperscriptsubscriptitalic-ϕ1italic-ϵsuperscriptsubscriptitalic-ϕ2italic-ϵΦ\phi_{1}^{\epsilon},\phi_{2}^{\epsilon}\in\Phiitalic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ∈ roman_Φ such that:

  1. 1.

    d(ϕ1,ϕ1ϵ),d(ϕ2,ϕ2ϵ)<ϵ𝑑subscriptitalic-ϕ1superscriptsubscriptitalic-ϕ1italic-ϵ𝑑subscriptitalic-ϕ2superscriptsubscriptitalic-ϕ2italic-ϵitalic-ϵd(\phi_{1},\phi_{1}^{\epsilon}),d(\phi_{2},\phi_{2}^{\epsilon})<\epsilonitalic_d ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) , italic_d ( italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) < italic_ϵ.

  2. 2.

    ϕ1ϵsuperscriptsubscriptitalic-ϕ1italic-ϵ\phi_{1}^{\epsilon}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT does not dominate ϕ2ϵsuperscriptsubscriptitalic-ϕ2italic-ϵ\phi_{2}^{\epsilon}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT at x𝑥xitalic_x.

  3. 3.

    ϕ2ϵsuperscriptsubscriptitalic-ϕ2italic-ϵ\phi_{2}^{\epsilon}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT does not dominate ϕ1ϵsuperscriptsubscriptitalic-ϕ1italic-ϵ\phi_{1}^{\epsilon}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT at x𝑥xitalic_x.

We will now assume that ΦΦ\Phiroman_Φ is indeed indomitable.

Assumption 3.

ΦΦ\Phiroman_Φ is indomitable.

Observe that this assumption is satisfied by both examples of feature maps given in Section 3.2. More generally, the fact that our definition permits a lack of dominance to hold for some two maps that are close to ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT makes our definition mild enough to hold for most continuous classes of feature maps.

Appendix C knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbors

First, we fix knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as a sequence of integers with the following properties.

Definition 10.

Let knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be a sequence of integers so that limnknlogn=subscript𝑛subscript𝑘𝑛𝑛\lim_{n\to\infty}\frac{k_{n}}{\log n}=\inftyroman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG roman_log italic_n end_ARG = ∞, and limnknn=0subscript𝑛subscript𝑘𝑛𝑛0\lim_{n\to\infty}\frac{k_{n}}{n}=0roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG = 0.

Observe that kn=log2nsubscript𝑘𝑛superscript2𝑛k_{n}=\log^{2}nitalic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n would suffice as an example of such a series.

Next, our goal is to define the knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbors classifier over a labeled data set of of n𝑛nitalic_n points, S={(x1,y1),,(xn,yn)}𝑆subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}italic_S = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }. To do so, we begin by describing a tie-breaking procedure used in cases where training points are equidistant from a given test point.

Definition 11.

An ordering π𝜋\piitalic_π, over a dataset S={(x1,y1),,(xn,yn)}𝑆subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}italic_S = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } is any ordered permutation of S𝑆Sitalic_S. We say that (xi,yi)<π(xj,yj)subscript𝜋subscript𝑥𝑖subscript𝑦𝑖subscript𝑥𝑗subscript𝑦𝑗(x_{i},y_{i})<_{\pi}(x_{j},y_{j})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) if (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) occurs before (xj,yj)subscript𝑥𝑗subscript𝑦𝑗(x_{j},y_{j})( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) in the permutation.

We now show how to use π𝜋\piitalic_π to break ties when computing nearest neighbors.

Definition 12.

Let x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. Let π𝜋\piitalic_π be an ordering over dataset S𝑆Sitalic_S. For (xi,yi),(xj,yj)Ssubscript𝑥𝑖subscript𝑦𝑖subscript𝑥𝑗subscript𝑦𝑗𝑆(x_{i},y_{i}),(x_{j},y_{j})\in S( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_S, we say that d(x,xi)<πd(x,xj)subscript𝜋𝑑𝑥subscript𝑥𝑖𝑑𝑥subscript𝑥𝑗d(x,x_{i})<_{\pi}d(x,x_{j})italic_d ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_d ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) if either of the two conditions hold:

  1. 1.

    d(x,xi)<d(x,xj)𝑑𝑥subscript𝑥𝑖𝑑𝑥subscript𝑥𝑗d(x,x_{i})<d(x,x_{j})italic_d ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_d ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).

  2. 2.

    d(x,xi)=d(x,xj)𝑑𝑥subscript𝑥𝑖𝑑𝑥subscript𝑥𝑗d(x,x_{i})=d(x,x_{j})italic_d ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_d ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and (xi,yi)<π(xj,yj)subscript𝜋subscript𝑥𝑖subscript𝑦𝑖subscript𝑥𝑗subscript𝑦𝑗(x_{i},y_{i})<_{\pi}(x_{j},y_{j})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).

In essence, ties are broken by choosing the datapoint that appears earlier in the ordering. We now define a nearest neighbor as follows.

Definition 13.

Let S={(x1,y1),,(xn,yn)}𝑆subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}italic_S = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } be a dataset, and let π𝜋\piitalic_π be an ordering of S𝑆Sitalic_S. For x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, we say that (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is a knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbor of x𝑥xitalic_x if

|{j:d(x,xj)<πd(x,xi)}|<kn.conditional-set𝑗subscript𝜋𝑑𝑥subscript𝑥𝑗𝑑𝑥subscript𝑥𝑖subscript𝑘𝑛|\{j:d(x,x_{j})<_{\pi}d(x,x_{i})\}|<k_{n}.| { italic_j : italic_d ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_d ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } | < italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .

We also let Sknπ(x)superscriptsubscript𝑆subscript𝑘𝑛𝜋𝑥S_{k_{n}}^{\pi}(x)italic_S start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) denote the set of all knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbors of x𝑥xitalic_x when using the ordering π𝜋\piitalic_π.

Observe that by construction, |Sknπ(x)|=knsuperscriptsubscript𝑆subscript𝑘𝑛𝜋𝑥subscript𝑘𝑛|S_{k_{n}}^{\pi}(x)|=k_{n}| italic_S start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) | = italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. This is because the ordering <πsubscript𝜋<_{\pi}< start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT allows us to strictly order points based on their distances from x𝑥xitalic_x with ties broken by π𝜋\piitalic_π.

We are now ready to define the nearest neighbors classifier.

Definition 14.

Let S={(x1,y1),,(xn,yn)}𝑆subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}italic_S = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } be a dataset, and π𝜋\piitalic_π an ordering over S𝑆Sitalic_S. Then for x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, we define

𝒩S,π(x)=argmaxy𝒴(xi,yi)Sknπ(x)𝟙(y=yi).subscript𝒩𝑆𝜋𝑥subscriptargmax𝑦𝒴subscriptsubscript𝑥𝑖subscript𝑦𝑖superscriptsubscript𝑆subscript𝑘𝑛𝜋𝑥1𝑦subscript𝑦𝑖\mathcal{N}_{S,\pi}(x)=\operatorname*{arg\,max}_{y\in\mathcal{Y}}\sum_{(x_{i},% y_{i})\in S_{k_{n}}^{\pi}(x)}\mathbbm{1}(y=y_{i}).caligraphic_N start_POSTSUBSCRIPT italic_S , italic_π end_POSTSUBSCRIPT ( italic_x ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_S start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT blackboard_1 ( italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Here, we break ties in 𝒴𝒴\mathcal{Y}caligraphic_Y arbitrarily (which could be done with an ordering of 𝒴𝒴\mathcal{Y}caligraphic_Y).

Throughout the paper, we typically ommit π𝜋\piitalic_π from our notation for 𝒩S,π(x)subscript𝒩𝑆𝜋𝑥\mathcal{N}_{S,\pi}(x)caligraphic_N start_POSTSUBSCRIPT italic_S , italic_π end_POSTSUBSCRIPT ( italic_x ). This is because in all cases, we assume that some ordering π𝜋\piitalic_π is implicitly chosen (independent of the data points) ahead of time.

C.1 Composing with feature maps.

We now define the classifier 𝒩Sϕsuperscriptsubscript𝒩𝑆italic-ϕ\mathcal{N}_{S}^{\phi}caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, where ϕ:𝒳𝒵:italic-ϕ𝒳𝒵\phi:\mathcal{X}\to\mathcal{Z}italic_ϕ : caligraphic_X → caligraphic_Z is a feature map. One important detail for doing so, is that we will continue to use an ordering over S𝑆Sitalic_S, rather than an ordering over ϕ(S)={(ϕ(x1),y1),(ϕ(xn),yn)}italic-ϕ𝑆italic-ϕsubscript𝑥1subscript𝑦1italic-ϕsubscript𝑥𝑛subscript𝑦𝑛\phi(S)=\{(\phi(x_{1}),y_{1}),\dots(\phi(x_{n}),y_{n})\}italic_ϕ ( italic_S ) = { ( italic_ϕ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … ( italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }. This will allow us to use a single ordering throughout all of our learning algorithms that deal with learning a feature map.

Recall that for any feature map, ϕitalic-ϕ\phiitalic_ϕ, dϕ:𝒳2[0,):subscript𝑑italic-ϕsuperscript𝒳20d_{\phi}:\mathcal{X}^{2}\to[0,\infty)italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT : caligraphic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → [ 0 , ∞ ) denotes the distance metric

dϕ(x,x)=d𝒵(ϕ(x),ϕ(x)).subscript𝑑italic-ϕ𝑥superscript𝑥subscript𝑑𝒵italic-ϕ𝑥italic-ϕsuperscript𝑥d_{\phi}(x,x^{\prime})=d_{\mathcal{Z}}\left(\phi(x),\phi(x^{\prime})\right).italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_ϕ ( italic_x ) , italic_ϕ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) .

Using this, we give analogs to Definitions 15 and 16 by essentially replacing d𝑑ditalic_d with dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT.

Definition 15.

Let x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ. Let π𝜋\piitalic_π be an ordering over dataset S𝑆Sitalic_S. For (xi,yi),(xj,yj)Ssubscript𝑥𝑖subscript𝑦𝑖subscript𝑥𝑗subscript𝑦𝑗𝑆(x_{i},y_{i}),(x_{j},y_{j})\in S( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_S, we say that dϕ(x,xi)<πdϕ(x,xj)subscript𝜋subscript𝑑italic-ϕ𝑥subscript𝑥𝑖subscript𝑑italic-ϕ𝑥subscript𝑥𝑗d_{\phi}(x,x_{i})<_{\pi}d_{\phi}(x,x_{j})italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) if either of the two conditions hold:

  1. 1.

    dϕ(x,xi)<dϕ(x,xj)subscript𝑑italic-ϕ𝑥subscript𝑥𝑖subscript𝑑italic-ϕ𝑥subscript𝑥𝑗d_{\phi}(x,x_{i})<d_{\phi}(x,x_{j})italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).

  2. 2.

    dϕ(x,xi)=dϕ(x,xj)subscript𝑑italic-ϕ𝑥subscript𝑥𝑖subscript𝑑italic-ϕ𝑥subscript𝑥𝑗d_{\phi}(x,x_{i})=d_{\phi}(x,x_{j})italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and (xi,yi)<π(xj,yj)subscript𝜋subscript𝑥𝑖subscript𝑦𝑖subscript𝑥𝑗subscript𝑦𝑗(x_{i},y_{i})<_{\pi}(x_{j},y_{j})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).

Definition 16.

Let ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ. Let S={(x1,y1),,(xn,yn)}𝑆subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}italic_S = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } be a dataset, and let π𝜋\piitalic_π be an ordering of S𝑆Sitalic_S. For x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, we say that (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is a knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbor of x𝑥xitalic_x under ϕitalic-ϕ\phiitalic_ϕ if

|{j:dϕ(x,xj)<πdϕ(x,xi)}|<kn.conditional-set𝑗subscript𝜋subscript𝑑italic-ϕ𝑥subscript𝑥𝑗subscript𝑑italic-ϕ𝑥subscript𝑥𝑖subscript𝑘𝑛|\{j:d_{\phi}(x,x_{j})<_{\pi}d_{\phi}(x,x_{i})\}|<k_{n}.| { italic_j : italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } | < italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .

We also let Skn,ϕπ(x)superscriptsubscript𝑆subscript𝑘𝑛italic-ϕ𝜋𝑥S_{k_{n},\phi}^{\pi}(x)italic_S start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) denote the set of all knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbors of x𝑥xitalic_x when using the ordering π𝜋\piitalic_π.

Finally, we define 𝒩Sϕsuperscriptsubscript𝒩𝑆italic-ϕ\mathcal{N}_{S}^{\phi}caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT as follows.

Definition 17.

Let ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ, let S={(x1,y1),,(xn,yn)}𝑆subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}italic_S = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } be a dataset, and π𝜋\piitalic_π an ordering over S𝑆Sitalic_S. Then for x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, we define

𝒩S,πϕ(x)=argmaxy𝒴(xi,yi)Skn,ϕπ(x)𝟙(y=yi).superscriptsubscript𝒩𝑆𝜋italic-ϕ𝑥subscriptargmax𝑦𝒴subscriptsubscript𝑥𝑖subscript𝑦𝑖superscriptsubscript𝑆subscript𝑘𝑛italic-ϕ𝜋𝑥1𝑦subscript𝑦𝑖\mathcal{N}_{S,\pi}^{\phi}(x)=\operatorname*{arg\,max}_{y\in\mathcal{Y}}\sum_{% (x_{i},y_{i})\in S_{k_{n},\phi}^{\pi}(x)}\mathbbm{1}(y=y_{i}).caligraphic_N start_POSTSUBSCRIPT italic_S , italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_S start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT blackboard_1 ( italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Here, we break ties in 𝒴𝒴\mathcal{Y}caligraphic_Y arbitrarily (which could be done with an ordering of 𝒴𝒴\mathcal{Y}caligraphic_Y).

The key point of this definition is that all tie-breaking mechanisms are done independently of ϕitalic-ϕ\phiitalic_ϕ. In particular, we have the following.

Lemma 1.

Let S𝑆Sitalic_S be a dataset of n𝑛nitalic_n points, and π𝜋\piitalic_π an ordering over S𝑆Sitalic_S. Let ϕ,ϕitalic-ϕsuperscriptitalic-ϕ\phi,\phi^{\prime}italic_ϕ , italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be two features maps in ΦΦ\Phiroman_Φ. Suppose for x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X that for all i,j𝑖𝑗i,jitalic_i , italic_j, dϕ(x,xi)dϕ(x,xj)subscript𝑑italic-ϕ𝑥subscript𝑥𝑖subscript𝑑italic-ϕ𝑥subscript𝑥𝑗d_{\phi}(x,x_{i})\leq d_{\phi}(x,x_{j})italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) if and only if dϕ(x,xi)dϕ(x,xj)subscript𝑑superscriptitalic-ϕ𝑥subscript𝑥𝑖subscript𝑑superscriptitalic-ϕ𝑥subscript𝑥𝑗d_{\phi^{\prime}}(x,x_{i})\leq d_{\phi^{\prime}}(x,x_{j})italic_d start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_d start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Then 𝒩S,πϕ(x)=𝒩S,πϕ(x)superscriptsubscript𝒩𝑆𝜋italic-ϕ𝑥superscriptsubscript𝒩𝑆𝜋superscriptitalic-ϕ𝑥\mathcal{N}_{S,\pi}^{\phi}(x)=\mathcal{N}_{S,\pi}^{\phi^{\prime}}(x)caligraphic_N start_POSTSUBSCRIPT italic_S , italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) = caligraphic_N start_POSTSUBSCRIPT italic_S , italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x ).

Proof.

This is immediate from the previous definitions as all ties are broken in an identical manner for both ϕitalic-ϕ\phiitalic_ϕ and ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. ∎

As before, to avoid cumbersome notation, we will assume that an ordering, π𝜋\piitalic_π, of S𝑆Sitalic_S is fixed.

Appendix D Induced conditional distributions

In this section, we rigorously define the conditional data distribution of 𝒟ϕsuperscript𝒟italic-ϕ\mathcal{D}^{\phi}caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT. Recall that if (X,Y)𝒟similar-to𝑋𝑌𝒟(X,Y)\sim\mathcal{D}( italic_X , italic_Y ) ∼ caligraphic_D denote the random variables corresponding to 𝒟𝒟\mathcal{D}caligraphic_D, then 𝒟ϕsuperscript𝒟italic-ϕ\mathcal{D}^{\phi}caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT is defined as the data distribution (ϕ(X),Y)italic-ϕ𝑋𝑌(\phi(X),Y)( italic_ϕ ( italic_X ) , italic_Y ), where ϕ:𝒳𝒵:italic-ϕ𝒳𝒵\phi:\mathcal{X}\to\mathcal{Z}italic_ϕ : caligraphic_X → caligraphic_Z is a feature map. We write 𝒟=(μ,η)𝒟𝜇𝜂\mathcal{D}=(\mu,\eta)caligraphic_D = ( italic_μ , italic_η ), where μ𝜇\muitalic_μ denotes the measure corresponding to X𝑋Xitalic_X over 𝒳𝒳\mathcal{X}caligraphic_X, and η𝜂\etaitalic_η is the conditional data distribution, p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ). Our goal in this section is to similarly write 𝒟ϕ=(μϕ,ηϕ)superscript𝒟italic-ϕsuperscript𝜇italic-ϕsuperscript𝜂italic-ϕ\mathcal{D}^{\phi}=(\mu^{\phi},\eta^{\phi})caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT = ( italic_μ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_η start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ).

First, observe that for any measurable subset B𝒵𝐵𝒵B\subseteq\mathcal{Z}italic_B ⊆ caligraphic_Z, μϕ(B)=μ(ϕ1(B))superscript𝜇italic-ϕ𝐵𝜇superscriptitalic-ϕ1𝐵\mu^{\phi}(B)=\mu\left(\phi^{-1}(B)\right)italic_μ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_B ) = italic_μ ( italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_B ) ). This directly follows from the definition of the random variable ϕ(X)italic-ϕ𝑋\phi(X)italic_ϕ ( italic_X ).

Next, to define ηϕsuperscript𝜂italic-ϕ\eta^{\phi}italic_η start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, first recall that η(y|x)𝜂conditional𝑦𝑥\eta(y|x)italic_η ( italic_y | italic_x ) denotes the probability that Y=y𝑌𝑦Y=yitalic_Y = italic_y given that X=x𝑋𝑥X=xitalic_X = italic_x. By assumption this is well defined for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y, and moreover for any y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y the function 𝒳[0,1]𝒳01\mathcal{X}\to[0,1]caligraphic_X → [ 0 , 1 ] defined by xη(y|x)maps-to𝑥𝜂conditional𝑦𝑥x\mapsto\eta(y|x)italic_x ↦ italic_η ( italic_y | italic_x ) is measurable. To define ηϕsuperscript𝜂italic-ϕ\eta^{\phi}italic_η start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, we first define υysuperscript𝜐𝑦\upsilon^{y}italic_υ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT for all y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y as follows.

Definition 18.

υysuperscript𝜐𝑦\upsilon^{y}italic_υ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT is a measure over 𝒵𝒵\mathcal{Z}caligraphic_Z so that for all measurable sets B𝐵Bitalic_B,

υy(B)=ϕ1(B)η(y|x)𝑑μ(x).superscript𝜐𝑦𝐵subscriptsuperscriptitalic-ϕ1𝐵𝜂conditional𝑦𝑥differential-d𝜇𝑥\upsilon^{y}(B)=\int_{\phi^{-1}(B)}\eta(y|x)d\mu(x).italic_υ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_B ) = ∫ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_B ) end_POSTSUBSCRIPT italic_η ( italic_y | italic_x ) italic_d italic_μ ( italic_x ) .

The fact that υysuperscript𝜐𝑦\upsilon^{y}italic_υ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT is a well-defined measure follows directly from the rules of integration. In essence, υy(B)superscript𝜐𝑦𝐵\upsilon^{y}(B)italic_υ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_B ) is the probability. of observing (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) with ϕ(X)Bitalic-ϕ𝑋𝐵\phi(X)\in Bitalic_ϕ ( italic_X ) ∈ italic_B and Y=y𝑌𝑦Y=yitalic_Y = italic_y. We now show the following:

Lemma 2.

υysuperscript𝜐𝑦\upsilon^{y}italic_υ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT is absolutely continuous with respect to μϕsuperscript𝜇italic-ϕ\mu^{\phi}italic_μ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT for all y𝑦yitalic_y

Proof.

This immediately follows from the fact that η(y|x)1𝜂conditional𝑦𝑥1\eta(y|x)\leq 1italic_η ( italic_y | italic_x ) ≤ 1 for all y,x𝑦𝑥y,xitalic_y , italic_x. Thus for any measurable set B𝐵Bitalic_B,

υy(B)=ϕ1(B)η(y|x)𝑑μ(x)ϕ1(B)𝑑μ(x)=μ(ϕ1(B))=μϕ(B).superscript𝜐𝑦𝐵subscriptsuperscriptitalic-ϕ1𝐵𝜂conditional𝑦𝑥differential-d𝜇𝑥subscriptsuperscriptitalic-ϕ1𝐵differential-d𝜇𝑥𝜇superscriptitalic-ϕ1𝐵superscript𝜇italic-ϕ𝐵\upsilon^{y}(B)=\int_{\phi^{-1}(B)}\eta(y|x)d\mu(x)\leq\int_{\phi^{-1}(B)}d\mu% (x)=\mu(\phi^{-1}(B))=\mu^{\phi}(B).italic_υ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_B ) = ∫ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_B ) end_POSTSUBSCRIPT italic_η ( italic_y | italic_x ) italic_d italic_μ ( italic_x ) ≤ ∫ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_B ) end_POSTSUBSCRIPT italic_d italic_μ ( italic_x ) = italic_μ ( italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_B ) ) = italic_μ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_B ) .

Thus for any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, we can simply choose δ=ϵ𝛿italic-ϵ\delta=\epsilonitalic_δ = italic_ϵ so that μϕ(B)<δυy(B)<ϵsuperscript𝜇italic-ϕ𝐵𝛿superscript𝜐𝑦𝐵italic-ϵ\mu^{\phi}(B)<\delta\implies\upsilon^{y}(B)<\epsilonitalic_μ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_B ) < italic_δ ⟹ italic_υ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_B ) < italic_ϵ. ∎

We now use the Radon-Nikoym theorem on υysuperscript𝜐𝑦\upsilon^{y}italic_υ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT to define ηϕsuperscript𝜂italic-ϕ\eta^{\phi}italic_η start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT.

Lemma 3.

For all y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y, there exists a measurable function fy:𝒵[0,1]:superscript𝑓𝑦𝒵01f^{y}:\mathcal{Z}\to[0,1]italic_f start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT : caligraphic_Z → [ 0 , 1 ] such that

υy(B)=Bfy(z)𝑑μϕ(z),superscript𝜐𝑦𝐵subscript𝐵superscript𝑓𝑦𝑧differential-dsuperscript𝜇italic-ϕ𝑧\upsilon^{y}(B)=\int_{B}f^{y}(z)d\mu^{\phi}(z),italic_υ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_B ) = ∫ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_z ) italic_d italic_μ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_z ) ,

for all measurable sets B𝐵Bitalic_B.

Proof.

This directly follows from the Radon-Nikoym theorem. ∎

We then define ηϕsuperscript𝜂italic-ϕ\eta^{\phi}italic_η start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT using these functions, fysuperscript𝑓𝑦f^{y}italic_f start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT.

Definition 19.

For all z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z and y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y, we define ηϕ(y|z)=fy(z)superscript𝜂italic-ϕconditional𝑦𝑧superscript𝑓𝑦𝑧\eta^{\phi}(y|z)=f^{y}(z)italic_η start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_y | italic_z ) = italic_f start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_z ).

Appendix E Technical Lemmas

E.1 Useful bounds related to ΦΦ\Phiroman_Φ

We now prove several results regarding the distance dimension of ΦΦ\Phiroman_Φ, (Φ)Φ\partial(\Phi)∂ ( roman_Φ ). These will be useful for proving all of our subsequent results.

We begin by defining a useful hypothesis class for analyzing nearest neighbors.

Definition 20.

Let S={(x1,y1),,(xn,yn)}𝑆superscriptsubscript𝑥1superscriptsubscript𝑦1superscriptsubscript𝑥𝑛superscriptsubscript𝑦𝑛S=\{(x_{1}^{*},y_{1}^{*}),\dots,(x_{n}^{*},y_{n}^{*})\}italic_S = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } be a set of n𝑛nitalic_n labeled points in 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y. For ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ, define hS,ϕ:𝒳×𝒴{0,1}:subscript𝑆italic-ϕ𝒳𝒴01h_{S,\phi}:\mathcal{X}\times\mathcal{Y}\to\{0,1\}italic_h start_POSTSUBSCRIPT italic_S , italic_ϕ end_POSTSUBSCRIPT : caligraphic_X × caligraphic_Y → { 0 , 1 } as

hS,ϕ((x,y))={1𝒩Sϕ(x)=y0otherwise.subscript𝑆italic-ϕ𝑥𝑦cases1superscriptsubscript𝒩𝑆italic-ϕ𝑥𝑦0otherwiseh_{S,\phi}\left((x,y)\right)=\begin{cases}1&\mathcal{N}_{S}^{\phi}(x)=y\\ 0&\text{otherwise}\end{cases}.italic_h start_POSTSUBSCRIPT italic_S , italic_ϕ end_POSTSUBSCRIPT ( ( italic_x , italic_y ) ) = { start_ROW start_CELL 1 end_CELL start_CELL caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) = italic_y end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW .

Finally, we let (S,Φ)={hS,ϕ:ϕΦ}.𝑆Φconditional-setsubscript𝑆italic-ϕitalic-ϕΦ\mathcal{H}(S,\Phi)=\{h_{S,\phi}:\phi\in\Phi\}.caligraphic_H ( italic_S , roman_Φ ) = { italic_h start_POSTSUBSCRIPT italic_S , italic_ϕ end_POSTSUBSCRIPT : italic_ϕ ∈ roman_Φ } .

Observe that (S,Φ)𝑆Φ\mathcal{H}(S,\Phi)caligraphic_H ( italic_S , roman_Φ ) is a set of binary classifiers. We now show that it has bounded VC-dimension.

Lemma 4.

There exists an absolute constant, c1>0subscript𝑐10c_{1}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 such that (S,Φ)𝑆Φ\mathcal{H}(S,\Phi)caligraphic_H ( italic_S , roman_Φ ) has VC-dimension bounded as

vc((S,Φ))c1(Φ)log(n+(Φ)).𝑣𝑐𝑆Φsubscript𝑐1Φ𝑛Φvc(\mathcal{H}(S,\Phi))\leq c_{1}\partial(\Phi)\log\left(n+\partial(\Phi)% \right).italic_v italic_c ( caligraphic_H ( italic_S , roman_Φ ) ) ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ) .
Proof.

Suppose,(S,Φ)𝑆Φ\mathcal{H}(S,\Phi)caligraphic_H ( italic_S , roman_Φ ) shatters a set V𝑉Vitalic_V of v𝑣vitalic_v points in 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y, V={(x1,y1),,(xv,yv)}.𝑉subscript𝑥1subscript𝑦1subscript𝑥𝑣subscript𝑦𝑣V=\{(x_{1},y_{1}),\dots,(x_{v},y_{v})\}.italic_V = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) } . The key observation is that for any hϕHS^subscriptitalic-ϕsubscript𝐻^𝑆h_{\phi}\in H_{\hat{S}}italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∈ italic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT, the way hϕsubscriptitalic-ϕh_{\phi}italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT labels a given point (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is determined by the knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbors of ϕ(x)italic-ϕ𝑥\phi(x)italic_ϕ ( italic_x ) in {ϕ(x1),,ϕ(xn)}italic-ϕsubscript𝑥1italic-ϕsubscript𝑥𝑛\{\phi(x_{1}),\dots,\phi(x_{n})\}{ italic_ϕ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }. Furthermore, by Lemma 1, these labels are full determined by the set of all (n2)binomial𝑛2\binom{n}{2}( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG ) comparisons,

{𝟙(d(ϕ(x,xi)d(ϕ(x,xj)):1i<jn}.\left\{\mathbbm{1}\left(d(\phi(x,x_{i})\geq d(\phi(x,x_{j})\right):1\leq i<j% \leq n\right\}.{ blackboard_1 ( italic_d ( italic_ϕ ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_d ( italic_ϕ ( italic_x , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) : 1 ≤ italic_i < italic_j ≤ italic_n } .

These indicator variables precisely correspond to the definition of a distance comparer (Definition 5). It follows that the number of distinct ways that (S,Φ)𝑆Φ\mathcal{H}(S,\Phi)caligraphic_H ( italic_S , roman_Φ ) can label V𝑉Vitalic_V is at most the number of ways ΔΦΔΦ\Delta\Phiroman_Δ roman_Φ can label all v(n2)𝑣binomial𝑛2v\binom{n}{2}italic_v ( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG ) possible comparisons, {(xi,xj,xk):1iv,1j<kn}.conditional-setsubscript𝑥𝑖subscript𝑥𝑗subscript𝑥𝑘formulae-sequence1𝑖𝑣1𝑗𝑘𝑛\{(x_{i},x_{j},x_{k}):1\leq i\leq v,1\leq j<k\leq n\}.{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) : 1 ≤ italic_i ≤ italic_v , 1 ≤ italic_j < italic_k ≤ italic_n } . Since by definition, vc(ΔΦ)=(Φ)𝑣𝑐ΔΦΦvc(\Delta\Phi)=\partial(\Phi)italic_v italic_c ( roman_Δ roman_Φ ) = ∂ ( roman_Φ ), By Sauer’s Lemma, the number of ways (S,Φ)𝑆Φ\mathcal{H}(S,\Phi)caligraphic_H ( italic_S , roman_Φ ) can label V𝑉Vitalic_V is at most (v(n2))(Φ)superscript𝑣binomial𝑛2Φ\left(v\binom{n}{2}\right)^{\partial(\Phi)}( italic_v ( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG ) ) start_POSTSUPERSCRIPT ∂ ( roman_Φ ) end_POSTSUPERSCRIPT. However, since (S,Φ)𝑆Φ\mathcal{H}(S,\Phi)caligraphic_H ( italic_S , roman_Φ ) shatters V𝑉Vitalic_V, there exist precisely 2vsuperscript2𝑣2^{v}2 start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT such labelings. It follows that vlog((v(n2))(Φ))𝑣superscript𝑣binomial𝑛2Φv\leq\log\left(\left(v\binom{n}{2}\right)^{\partial(\Phi)}\right)italic_v ≤ roman_log ( ( italic_v ( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG ) ) start_POSTSUPERSCRIPT ∂ ( roman_Φ ) end_POSTSUPERSCRIPT ). From here, straightforward algebra implies that v=O((Φ)log(n+(Φ)))𝑣𝑂Φ𝑛Φv=O\left(\partial(\Phi)\log\left(n+\partial(\Phi)\right)\right)italic_v = italic_O ( ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ) ), as desired. ∎

Next, we define a hypothesis class that will be useful for bounding the margin of a data distribution.

Definition 21.

For ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ and r>0𝑟0r>0italic_r > 0, define qϕ,r:𝒳2{0,1}:subscript𝑞italic-ϕ𝑟superscript𝒳201q_{\phi,r}:\mathcal{X}^{2}\to\{0,1\}italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r end_POSTSUBSCRIPT : caligraphic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → { 0 , 1 } as the map

qϕ,r(x,x)={1dϕ(x,x)<r0otherwise..subscript𝑞italic-ϕ𝑟𝑥superscript𝑥cases1subscript𝑑italic-ϕ𝑥superscript𝑥𝑟0otherwise.q_{\phi,r}(x,x^{\prime})=\begin{cases}1&d_{\phi}(x,x^{\prime})<r\\ 0&\text{otherwise.}\end{cases}.italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL 1 end_CELL start_CELL italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_r end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW .

Let 𝒬(Φ)={qϕ,r:ϕΦ,r[0,)}𝒬Φconditional-setsubscript𝑞italic-ϕ𝑟formulae-sequenceitalic-ϕΦ𝑟0\mathcal{Q}(\Phi)=\{q_{\phi,r}:\phi\in\Phi,r\in[0,\infty)\}caligraphic_Q ( roman_Φ ) = { italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r end_POSTSUBSCRIPT : italic_ϕ ∈ roman_Φ , italic_r ∈ [ 0 , ∞ ) }.

Roughly speaking, the class 𝒬(Φ)𝒬Φ\mathcal{Q}(\Phi)caligraphic_Q ( roman_Φ ) will prove useful in allowing us to uniformly bound measured distances over a data distribution. We now bound its VC-dimension as follows.

Lemma 5.

There exists an absolute constant, c2>0subscript𝑐20c_{2}>0italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 such that 𝒬(Φ)𝒬Φ\mathcal{Q}(\Phi)caligraphic_Q ( roman_Φ ) has VC-dimension bounded as

vc(𝒬(Φ))c2(Φ)log((Φ)).𝑣𝑐𝒬Φsubscript𝑐2ΦΦvc(\mathcal{Q}(\Phi))\leq c_{2}\partial(\Phi)\log\left(\partial(\Phi)\right).italic_v italic_c ( caligraphic_Q ( roman_Φ ) ) ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( ∂ ( roman_Φ ) ) .
Proof.

Suppose 𝒬(Φ)𝒬Φ\mathcal{Q}(\Phi)caligraphic_Q ( roman_Φ ) shatters the set X={(x1,x1),(x2,x2),,(xv,xv)}𝑋subscript𝑥1superscriptsubscript𝑥1subscript𝑥2superscriptsubscript𝑥2subscript𝑥𝑣superscriptsubscript𝑥𝑣X=\{(x_{1},x_{1}^{\prime}),(x_{2},x_{2}^{\prime}),\dots,(x_{v},x_{v}^{\prime})\}italic_X = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) }. We say that ϕitalic-ϕ\phiitalic_ϕ induces ordering, ϕsubscriptitalic-ϕ\geq_{\phi}≥ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT over X𝑋Xitalic_X by ranking the pairs in increasing distance. That is,

(xi,xi)ϕ(xj,xj)dϕ(xi,xi)dϕ(xj,xj).subscriptitalic-ϕsubscript𝑥𝑖superscriptsubscript𝑥𝑖subscript𝑥𝑗superscriptsubscript𝑥𝑗subscript𝑑italic-ϕsubscript𝑥𝑖superscriptsubscript𝑥𝑖subscript𝑑italic-ϕsubscript𝑥𝑗superscriptsubscript𝑥𝑗(x_{i},x_{i}^{\prime})\geq_{\phi}(x_{j},x_{j}^{\prime})\longleftrightarrow d_{% \phi}(x_{i},x_{i}^{\prime})\geq d_{\phi}(x_{j},x_{j}^{\prime}).( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟷ italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Our strategy is to double count the number of distinct orderings, ϕsubscriptitalic-ϕ\geq_{\phi}≥ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, over X𝑋Xitalic_X that can be constructed using ϕitalic-ϕ\phiitalic_ϕ. Here, two orderings are distinct if they ever differ for some pair of entries from X𝑋Xitalic_X.

First, suppose that v𝑣vitalic_v is even (which we can assume by deleting a pair from X𝑋Xitalic_X if needed). Since 𝒬(Φ)𝒬Φ\mathcal{Q}(\Phi)caligraphic_Q ( roman_Φ ) shatters X𝑋Xitalic_X, for all SX𝑆𝑋S\subset Xitalic_S ⊂ italic_X with |S|=v2𝑆𝑣2|S|=\frac{v}{2}| italic_S | = divide start_ARG italic_v end_ARG start_ARG 2 end_ARG, there exists ϕS,rsubscriptitalic-ϕ𝑆𝑟\phi_{S},ritalic_ϕ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_r such that

dϕ(xi,xi)r(xi,xi)S.subscript𝑑italic-ϕsubscript𝑥𝑖superscriptsubscript𝑥𝑖𝑟subscript𝑥𝑖superscriptsubscript𝑥𝑖𝑆d_{\phi}(x_{i},x_{i}^{\prime})\leq r\leftrightarrow(x_{i},x_{i}^{\prime})\in S.italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_r ↔ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_S .

Observe that for SS𝑆superscript𝑆S\neq S^{\prime}italic_S ≠ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, ϕSsubscriptitalic-ϕ𝑆\phi_{S}italic_ϕ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and ϕSsubscriptitalic-ϕsuperscript𝑆\phi_{S^{\prime}}italic_ϕ start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT must induce distinct orderings over X𝑋Xitalic_X as the bottom v/2𝑣2v/2italic_v / 2-elements of their orderings are distinct. Since there are (vv/2)2v/2binomial𝑣𝑣2superscript2𝑣2\binom{v}{v/2}\geq 2^{v/2}( FRACOP start_ARG italic_v end_ARG start_ARG italic_v / 2 end_ARG ) ≥ 2 start_POSTSUPERSCRIPT italic_v / 2 end_POSTSUPERSCRIPT choices for S𝑆Sitalic_S, this shows that there are at least 2v/2superscript2𝑣22^{v/2}2 start_POSTSUPERSCRIPT italic_v / 2 end_POSTSUPERSCRIPT orderings.

Second, there are v2superscript𝑣2v^{2}italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT possible quadruples, (xi,xi,xj,xj)subscript𝑥𝑖superscriptsubscript𝑥𝑖subscript𝑥𝑗superscriptsubscript𝑥𝑗(x_{i},x_{i}^{\prime},x_{j},x_{j}^{\prime})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Suppose that ϕitalic-ϕ\phiitalic_ϕ and ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT satisfy that

Δϕ(xi,xi,xj,xj)=Δϕ(xi,xi,xj,xj),subscriptΔitalic-ϕsubscript𝑥𝑖superscriptsubscript𝑥𝑖subscript𝑥𝑗superscriptsubscript𝑥𝑗subscriptΔsuperscriptitalic-ϕsubscript𝑥𝑖superscriptsubscript𝑥𝑖subscript𝑥𝑗superscriptsubscript𝑥𝑗\Delta_{\phi}(x_{i},x_{i}^{\prime},x_{j},x_{j}^{\prime})=\Delta_{\phi^{\prime}% }(x_{i},x_{i}^{\prime},x_{j},x_{j}^{\prime}),roman_Δ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_Δ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,

for all i,j𝑖𝑗i,jitalic_i , italic_j. By the definition of a distance comparer (Definition 5), this implies that ϕitalic-ϕ\phiitalic_ϕ and ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT induces the same ordering over X𝑋Xitalic_X. Thus it suffices to count the number of ways ΔΦ={Δϕ:ϕΦ}ΔΦconditional-setsubscriptΔitalic-ϕitalic-ϕΦ\Delta\Phi=\{\Delta_{\phi}:\phi\in\Phi\}roman_Δ roman_Φ = { roman_Δ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT : italic_ϕ ∈ roman_Φ } can label the set of v2superscript𝑣2v^{2}italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT possible quadruples. By Sauer’s Lemma, this is at most bv2(Φ)𝑏superscript𝑣2Φbv^{2\partial(\Phi)}italic_b italic_v start_POSTSUPERSCRIPT 2 ∂ ( roman_Φ ) end_POSTSUPERSCRIPT.

Combining our two observations, it follows that 2v/2bv2(Φ)superscript2𝑣2𝑏superscript𝑣2Φ2^{v/2}\leq bv^{2\partial(\Phi)}2 start_POSTSUPERSCRIPT italic_v / 2 end_POSTSUPERSCRIPT ≤ italic_b italic_v start_POSTSUPERSCRIPT 2 ∂ ( roman_Φ ) end_POSTSUPERSCRIPT. Standard algebra yields that vcΦlog((Φ))𝑣superscript𝑐ΦΦv\leq c^{\prime}\partial\Phi\log(\partial(\Phi))italic_v ≤ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∂ roman_Φ roman_log ( ∂ ( roman_Φ ) ) for some absolute constant csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. ∎

Finally, for dealing with margins and labels simultaneously, we introduce the following hypothesis class.

Definition 22.

Let S={(x1,y1),,(xn,yn)}𝑆superscriptsubscript𝑥1superscriptsubscript𝑦1superscriptsubscript𝑥𝑛superscriptsubscript𝑦𝑛S=\{(x_{1}^{*},y_{1}^{*}),\dots,(x_{n}^{*},y_{n}^{*})\}italic_S = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } be a set of n𝑛nitalic_n labeled points in 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y. For ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ, define qϕ,r,S:𝒳2{0,1}:subscript𝑞italic-ϕ𝑟𝑆superscript𝒳201q_{\phi,r,S}:\mathcal{X}^{2}\to\{0,1\}italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S end_POSTSUBSCRIPT : caligraphic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → { 0 , 1 } as follows:

qϕ,r,S(x,x)={1dϕ(x,x)<r and 𝒩Sϕ(x)𝒩Sϕ(x)0otherwise.subscript𝑞italic-ϕ𝑟𝑆𝑥superscript𝑥cases1subscript𝑑italic-ϕ𝑥superscript𝑥𝑟 and superscriptsubscript𝒩𝑆italic-ϕ𝑥superscriptsubscript𝒩𝑆italic-ϕsuperscript𝑥0otherwiseq_{\phi,r,S}(x,x^{\prime})=\begin{cases}1&d_{\phi}(x,x^{\prime})<r\text{ and }% \mathcal{N}_{S}^{\phi}(x)\neq\mathcal{N}_{S}^{\phi}(x^{\prime})\\ 0&\text{otherwise}\end{cases}.italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL 1 end_CELL start_CELL italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_r and caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) ≠ caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW .

We let 𝒬(Φ,S)={qϕ,r,S:ϕΦ,r[0,)}.𝒬Φ𝑆conditional-setsubscript𝑞italic-ϕ𝑟𝑆formulae-sequenceitalic-ϕΦ𝑟0\mathcal{Q}(\Phi,S)=\{q_{\phi,r,S}:\phi\in\Phi,r\in[0,\infty)\}.caligraphic_Q ( roman_Φ , italic_S ) = { italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S end_POSTSUBSCRIPT : italic_ϕ ∈ roman_Φ , italic_r ∈ [ 0 , ∞ ) } .

We now bound its VC-dimension.

Lemma 6.

There exists an absolute constant, c3>0subscript𝑐30c_{3}>0italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0 such that 𝒬(Φ,S)𝒬Φ𝑆\mathcal{Q}(\Phi,S)caligraphic_Q ( roman_Φ , italic_S ) has VC-dimension bounded as

vc(𝒬(Φ,S))c3(Φ)log(n+(Φ)).𝑣𝑐𝒬Φ𝑆subscript𝑐3Φ𝑛Φvc(\mathcal{Q}(\Phi,S))\leq c_{3}\partial(\Phi)\log\left(n+\partial(\Phi)% \right).italic_v italic_c ( caligraphic_Q ( roman_Φ , italic_S ) ) ≤ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ) .
Proof.

Suppose 𝒬(Φ,S)𝒬Φ𝑆\mathcal{Q}(\Phi,S)caligraphic_Q ( roman_Φ , italic_S ) shatters V={(x1,x1),(xv,xv)}𝑉subscript𝑥1superscriptsubscript𝑥1subscript𝑥𝑣superscriptsubscript𝑥𝑣V=\{(x_{1},x_{1}^{\prime}),\dots(x_{v},x_{v}^{\prime})\}italic_V = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , … ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) }. We will double count the number of subsets of V𝑉Vitalic_V that can be obtained as the pre-image of 1111 under some qϕ,r,S𝒬(Φ,S)subscript𝑞italic-ϕ𝑟𝑆𝒬Φ𝑆q_{\phi,r,S}\in\mathcal{Q}(\Phi,S)italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S end_POSTSUBSCRIPT ∈ caligraphic_Q ( roman_Φ , italic_S ). For any ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ, define tϕ:𝒳2{0,1}:subscript𝑡italic-ϕsuperscript𝒳201t_{\phi}:\mathcal{X}^{2}\to\{0,1\}italic_t start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT : caligraphic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → { 0 , 1 } as

tϕ((x,x))={1𝒩Sϕ(x)𝒩Sϕ(x)0otherwise.subscript𝑡italic-ϕ𝑥superscript𝑥cases1superscriptsubscript𝒩𝑆italic-ϕ𝑥superscriptsubscript𝒩𝑆italic-ϕsuperscript𝑥0otherwiset_{\phi}((x,x^{\prime}))=\begin{cases}1&\mathcal{N}_{S}^{\phi}(x)\neq\mathcal{% N}_{S}^{\phi}(x^{\prime})\\ 0&\text{otherwise}\end{cases}.italic_t start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = { start_ROW start_CELL 1 end_CELL start_CELL caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) ≠ caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW .

Then the key observation is that

qϕ,r,S(x,x)=tϕ(x,x)qϕ,r(x,x).subscript𝑞italic-ϕ𝑟𝑆𝑥superscript𝑥subscript𝑡italic-ϕ𝑥superscript𝑥subscript𝑞italic-ϕ𝑟𝑥superscript𝑥q_{\phi,r,S}(x,x^{\prime})=t_{\phi}(x,x^{\prime})q_{\phi,r}(x,x^{\prime}).italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_t start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Thus, a subset of {(x1,x1),,(xv,xv)}subscript𝑥1superscriptsubscript𝑥1subscript𝑥𝑣superscriptsubscript𝑥𝑣\{(x_{1},x_{1}^{\prime}),\dots,(x_{v},x_{v}^{\prime})\}{ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } is the pre-image of 1111 under qϕ,r,Ssubscript𝑞italic-ϕ𝑟𝑆q_{\phi,r,S}italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S end_POSTSUBSCRIPT if it is precisely the intersection of the pre-images of 1111 under qϕ,rsubscript𝑞italic-ϕ𝑟q_{\phi,r}italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r end_POSTSUBSCRIPT and tϕsubscript𝑡italic-ϕt_{\phi}italic_t start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT.

By Sauer’s Lemma and Lemma 5, there are at most O(vc2(Φ)log((Φ))O\left(v^{c_{2}\partial(\Phi)\log(\partial(\Phi)}\right)italic_O ( italic_v start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( ∂ ( roman_Φ ) end_POSTSUPERSCRIPT ) subsets that are the pre-image of 1111 under some qϕ,rsubscript𝑞italic-ϕ𝑟q_{\phi,r}italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r end_POSTSUBSCRIPT.

We now similarly bound the pre-images under tϕsubscript𝑡italic-ϕt_{\phi}italic_t start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. To this end, observe that the value of tϕsubscript𝑡italic-ϕt_{\phi}italic_t start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT over all (xi,xi)subscript𝑥𝑖superscriptsubscript𝑥𝑖(x_{i},x_{i}^{\prime})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is completely determined by the way in which 𝒩Sϕsuperscriptsubscript𝒩𝑆italic-ϕ\mathcal{N}_{S}^{\phi}caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT classifies {x1,,xv,x1,xv}subscript𝑥1subscript𝑥𝑣superscriptsubscript𝑥1superscriptsubscript𝑥𝑣\{x_{1},\dots,x_{v},x_{1}^{\prime}\dots,x_{v}^{\prime}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT … , italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }. This quantity, in turn, is fully determined by the way in which hs,ϕsubscript𝑠italic-ϕh_{s,\phi}italic_h start_POSTSUBSCRIPT italic_s , italic_ϕ end_POSTSUBSCRIPT (Definition 20) labels the set {x1,,xv,x1,,xv}×𝒴subscript𝑥1subscript𝑥𝑣superscriptsubscript𝑥1superscriptsubscript𝑥𝑣𝒴\{x_{1},\dots,x_{v},x_{1}^{\prime},\dots,x_{v}^{\prime}\}\times\mathcal{Y}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } × caligraphic_Y. Thus, applying Sauer’s Lemma along with Lemma 4, we see that at most O((2v|𝒴|)c1(Φ)log(n+(Φ)))𝑂superscript2𝑣𝒴subscript𝑐1Φ𝑛ΦO\left((2v|\mathcal{Y}|)^{c_{1}\partial(\Phi)\log\left(n+\partial(\Phi)\right)% }\right)italic_O ( ( 2 italic_v | caligraphic_Y | ) start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ) end_POSTSUPERSCRIPT ) possible subsets.

However, since 𝒬(Φ,S)𝒬Φ𝑆\mathcal{Q}(\Phi,S)caligraphic_Q ( roman_Φ , italic_S ) shatters V𝑉Vitalic_V, we know that 2vsuperscript2𝑣2^{v}2 start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT subsets can be formed in this manner. Thus, it follows that for some constant c3superscriptsubscript𝑐3c_{3}^{\prime}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT,

2vc3(vc2(Φ)log((Φ))((2v|𝒴|)c1(Φ)log(n+(Φ))).2^{v}\leq c_{3}^{\prime}\left(v^{c_{2}\partial(\Phi)\log(\partial(\Phi)}\right% )\left((2v|\mathcal{Y}|)^{c_{1}\partial(\Phi)\log\left(n+\partial(\Phi)\right)% }\right).2 start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ≤ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( ∂ ( roman_Φ ) end_POSTSUPERSCRIPT ) ( ( 2 italic_v | caligraphic_Y | ) start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ) end_POSTSUPERSCRIPT ) .

Taking logs and applying standard algebraic manipulations yields the desired result. ∎

We end with one final useful hypothesis class that is a generalization of Definition 21.

Definition 23.

Let n>1𝑛1n>1italic_n > 1 be an integer. For ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ and r>0𝑟0r>0italic_r > 0, define qϕ,r,n:𝒳n+1{0,1}:subscript𝑞italic-ϕ𝑟𝑛superscript𝒳𝑛101q_{\phi,r,n}:\mathcal{X}^{n+1}\to\{0,1\}italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_n end_POSTSUBSCRIPT : caligraphic_X start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT → { 0 , 1 } as the map

qϕ,r(x,x1,,xn)={11indϕ(x,xi)<r0otherwise..subscript𝑞italic-ϕ𝑟𝑥subscript𝑥1subscript𝑥𝑛cases1subscript1𝑖𝑛subscript𝑑italic-ϕ𝑥subscript𝑥𝑖𝑟0otherwise.q_{\phi,r}(x,x_{1},\dots,x_{n})=\begin{cases}1&\exists_{1\leq i\leq n}d_{\phi}% (x,x_{i})<r\\ 0&\text{otherwise.}\end{cases}.italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 end_CELL start_CELL ∃ start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_n end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_r end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW .

Let 𝒬n(Φ)={qϕ,r,n:ϕΦ,r[0,)}subscript𝒬𝑛Φconditional-setsubscript𝑞italic-ϕ𝑟𝑛formulae-sequenceitalic-ϕΦ𝑟0\mathcal{Q}_{n}(\Phi)=\{q_{\phi,r,n}:\phi\in\Phi,r\in[0,\infty)\}caligraphic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_Φ ) = { italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_n end_POSTSUBSCRIPT : italic_ϕ ∈ roman_Φ , italic_r ∈ [ 0 , ∞ ) }.

This class will assist us with computing the distance between the source and target distributions simultaneously over all embeddings.

Lemma 7.

There exists an absolute constant, c4>0subscript𝑐40c_{4}>0italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT > 0 such that 𝒬n(Φ)subscript𝒬𝑛Φ\mathcal{Q}_{n}(\Phi)caligraphic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_Φ ) has VC-dimension bounded as

vc(𝒬n(Φ))c4(Φ)log((Φ)+n).𝑣𝑐subscript𝒬𝑛Φsubscript𝑐4ΦΦ𝑛vc(\mathcal{Q}_{n}(\Phi))\leq c_{4}\partial(\Phi)\log\left(\partial(\Phi)+n% \right).italic_v italic_c ( caligraphic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_Φ ) ) ≤ italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( ∂ ( roman_Φ ) + italic_n ) .
Proof.

Suppose 𝒬n(Φ)subscript𝒬𝑛Φ\mathcal{Q}_{n}(\Phi)caligraphic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_Φ ) shatters V=((x1,x1),(x2,x2),,(xv,xv))𝑉subscript𝑥1superscript𝑥1subscript𝑥2superscript𝑥2subscript𝑥𝑣superscript𝑥𝑣V=((x_{1},x^{1}),(x_{2},x^{2}),\dots,(x_{v},x^{v}))italic_V = ( ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ) ) where xi=(x1i,x2i,,xni)𝒳nsuperscript𝑥𝑖superscriptsubscript𝑥1𝑖superscriptsubscript𝑥2𝑖superscriptsubscript𝑥𝑛𝑖superscript𝒳𝑛x^{i}=(x_{1}^{i},x_{2}^{i},\dots,x_{n}^{i})\in\mathcal{X}^{n}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∈ caligraphic_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The key observation is that the subset shattered by qϕ,r,nsubscript𝑞italic-ϕ𝑟𝑛q_{\phi,r,n}italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_n end_POSTSUBSCRIPT is precisely determined by the behavior of qϕ,rsubscript𝑞italic-ϕ𝑟q_{\phi,r}italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r end_POSTSUBSCRIPT over the set of nv𝑛𝑣nvitalic_n italic_v pairs, (xi,xji)subscript𝑥𝑖superscriptsubscript𝑥𝑗𝑖(x_{i},x_{j}^{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) for 1iv1𝑖𝑣1\leq i\leq v1 ≤ italic_i ≤ italic_v and 1jn1𝑗𝑛1\leq j\leq n1 ≤ italic_j ≤ italic_n. By Sauer’s Lemma along with Lemma 5, there are at most O((nv)c2(ϕ)log(Φ))𝑂superscript𝑛𝑣subscript𝑐2italic-ϕΦO((nv)^{c_{2}\partial(\phi)\log\partial(\Phi)})italic_O ( ( italic_n italic_v ) start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∂ ( italic_ϕ ) roman_log ∂ ( roman_Φ ) end_POSTSUPERSCRIPT ) such subsets possible.

It follows that vC(nv)c2(ϕ)log(Φ)𝑣𝐶superscript𝑛𝑣subscript𝑐2italic-ϕΦv\leq C(nv)^{c_{2}\partial(\phi)\log\partial(\Phi)}italic_v ≤ italic_C ( italic_n italic_v ) start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∂ ( italic_ϕ ) roman_log ∂ ( roman_Φ ) end_POSTSUPERSCRIPT. Applying standard manipulations again yields that vc4(Φ)log((Φ)+n)𝑣subscript𝑐4ΦΦ𝑛v\leq c_{4}\partial(\Phi)\log\left(\partial(\Phi)+n\right)italic_v ≤ italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( ∂ ( roman_Φ ) + italic_n ) for some constant c4subscript𝑐4c_{4}italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. ∎

E.2 Useful properties of data distributions

Lemma 8.

Let 𝒟𝒟\mathcal{D}caligraphic_D be a well-separated data distribution with label margin, ΔΔ\Deltaroman_Δ. Let h:𝒳𝒴:𝒳𝒴h:\mathcal{X}\to\mathcal{Y}italic_h : caligraphic_X → caligraphic_Y be a classifier such that R(h,𝒟)<R(g𝒟,𝒟)+ϵ𝑅𝒟𝑅subscript𝑔𝒟𝒟italic-ϵR(h,\mathcal{D})<R(g_{\mathcal{D}},\mathcal{D})+\epsilonitalic_R ( italic_h , caligraphic_D ) < italic_R ( italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT , caligraphic_D ) + italic_ϵ, where 𝒟𝒟\mathcal{D}caligraphic_D denotes the Bayes-optimal classifier. Then

Pr(x,y)𝒟[h(x)g𝒟(x)]<ϵΔ.subscriptPrsimilar-to𝑥𝑦𝒟𝑥subscript𝑔𝒟𝑥italic-ϵΔ\Pr_{(x,y)\sim\mathcal{D}}[h(x)\neq g_{\mathcal{D}}(x)]<\frac{\epsilon}{\Delta}.roman_Pr start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_h ( italic_x ) ≠ italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_x ) ] < divide start_ARG italic_ϵ end_ARG start_ARG roman_Δ end_ARG .
Proof.

Let 𝒟=(μ,η)𝒟𝜇𝜂\mathcal{D}=(\mu,\eta)caligraphic_D = ( italic_μ , italic_η ). Let A𝒳𝐴𝒳A\subset\mathcal{X}italic_A ⊂ caligraphic_X denote the set of all points for which hhitalic_h and g𝒟subscript𝑔𝒟g_{\mathcal{D}}italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT disagree. Then we have,

R(h,𝒟)=1𝒳η(h(x)|x)𝑑μ(x)=1Aη(h(x)|x)𝑑μ(x)𝒳Aη(h(x)|x)𝑑μ(x)1(A(η(g𝒟(x)|x)Δ)𝑑μ(x))(𝒳Aη(g𝒟(x)|x)𝑑μ(x))=1+Δμ(A)𝒳η(g𝒟(x)|x)𝑑μ(x)=R(gcD,𝒟)+Δμ(A)𝑅𝒟1subscript𝒳𝜂conditional𝑥𝑥differential-d𝜇𝑥1subscript𝐴𝜂conditional𝑥𝑥differential-d𝜇𝑥subscript𝒳𝐴𝜂conditional𝑥𝑥differential-d𝜇𝑥1subscript𝐴𝜂conditionalsubscript𝑔𝒟𝑥𝑥Δdifferential-d𝜇𝑥subscript𝒳𝐴𝜂conditionalsubscript𝑔𝒟𝑥𝑥differential-d𝜇𝑥1Δ𝜇𝐴subscript𝒳𝜂conditionalsubscript𝑔𝒟𝑥𝑥differential-d𝜇𝑥𝑅subscript𝑔𝑐𝐷𝒟Δ𝜇𝐴\begin{split}R(h,\mathcal{D})&=1-\int_{\mathcal{X}}\eta(h(x)|x)d\mu(x)\\ &=1-\int_{A}\eta(h(x)|x)d\mu(x)-\int_{\mathcal{X}\setminus A}\eta(h(x)|x)d\mu(% x)\\ &\geq 1-\left(\int_{A}\left(\eta(g_{\mathcal{D}}(x)|x)-\Delta\right)d\mu(x)% \right)-\left(\int_{\mathcal{X}\setminus A}\eta(g_{\mathcal{D}}(x)|x)d\mu(x)% \right)\\ &=1+\Delta\mu(A)-\int_{\mathcal{X}}\eta(g_{\mathcal{D}}(x)|x)d\mu(x)\\ &=R(g_{cD},\mathcal{D})+\Delta\mu(A)\end{split}start_ROW start_CELL italic_R ( italic_h , caligraphic_D ) end_CELL start_CELL = 1 - ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_η ( italic_h ( italic_x ) | italic_x ) italic_d italic_μ ( italic_x ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = 1 - ∫ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_η ( italic_h ( italic_x ) | italic_x ) italic_d italic_μ ( italic_x ) - ∫ start_POSTSUBSCRIPT caligraphic_X ∖ italic_A end_POSTSUBSCRIPT italic_η ( italic_h ( italic_x ) | italic_x ) italic_d italic_μ ( italic_x ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ 1 - ( ∫ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_η ( italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_x ) | italic_x ) - roman_Δ ) italic_d italic_μ ( italic_x ) ) - ( ∫ start_POSTSUBSCRIPT caligraphic_X ∖ italic_A end_POSTSUBSCRIPT italic_η ( italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_x ) | italic_x ) italic_d italic_μ ( italic_x ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = 1 + roman_Δ italic_μ ( italic_A ) - ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_η ( italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_x ) | italic_x ) italic_d italic_μ ( italic_x ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_R ( italic_g start_POSTSUBSCRIPT italic_c italic_D end_POSTSUBSCRIPT , caligraphic_D ) + roman_Δ italic_μ ( italic_A ) end_CELL end_ROW

Here, we are using the fact that η(h(x)|x)η(g𝒟(x)|x)Δ𝜂conditional𝑥𝑥𝜂conditionalsubscript𝑔𝒟𝑥𝑥Δ\eta(h(x)|x)\leq\eta(g_{\mathcal{D}}(x)|x)-\Deltaitalic_η ( italic_h ( italic_x ) | italic_x ) ≤ italic_η ( italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_x ) | italic_x ) - roman_Δ for g𝒟(x)h(x)subscript𝑔𝒟𝑥𝑥g_{\mathcal{D}}(x)\neq h(x)italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_x ) ≠ italic_h ( italic_x ) (as 𝒟𝒟\mathcal{D}caligraphic_D is well-separated). Finally, using the fact that hhitalic_h has excess risk at most ϵitalic-ϵ\epsilonitalic_ϵ, we find that Δμ(A)<ϵΔ𝜇𝐴italic-ϵ\Delta\mu(A)<\epsilonroman_Δ italic_μ ( italic_A ) < italic_ϵ which implies that μ(A)<ϵΔ𝜇𝐴italic-ϵΔ\mu(A)<\frac{\epsilon}{\Delta}italic_μ ( italic_A ) < divide start_ARG italic_ϵ end_ARG start_ARG roman_Δ end_ARG, as desired. ∎

We now use the fact that ΦΦ\Phiroman_Φ is compact to prove a useful Lemma.

Lemma 9.

For all ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, there exists δ>0𝛿0\delta>0italic_δ > 0 such that for all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ and all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X,

ϕ(B(x,δ))B(ϕ(x),ϵ).italic-ϕ𝐵𝑥𝛿𝐵italic-ϕ𝑥italic-ϵ\phi\left(B(x,\delta)\right)\subseteq B(\phi(x),\epsilon).italic_ϕ ( italic_B ( italic_x , italic_δ ) ) ⊆ italic_B ( italic_ϕ ( italic_x ) , italic_ϵ ) .
Proof.

Assume towards a contradiction, that for some ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, for all δ>0𝛿0\delta>0italic_δ > 0, there exists ϕ,xitalic-ϕ𝑥\phi,xitalic_ϕ , italic_x such that ϕ(B(x,δ))B(ϕ(x),ϵ)not-subset-of-or-equalsitalic-ϕ𝐵𝑥𝛿𝐵italic-ϕ𝑥italic-ϵ\phi\left(B(x,\delta)\right)\not\subseteq B(\phi(x),\epsilon)italic_ϕ ( italic_B ( italic_x , italic_δ ) ) ⊈ italic_B ( italic_ϕ ( italic_x ) , italic_ϵ ). Let δi0subscript𝛿𝑖0\delta_{i}\to 0italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → 0 be a sequence and let ϕi,xisubscriptitalic-ϕ𝑖subscript𝑥𝑖\phi_{i},x_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be corresponding feature maps and points for this sequence.

Since ΦΦ\Phiroman_Φ is compact, we can take an infinite subsequence of ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT so that ϕiϕsubscriptitalic-ϕ𝑖italic-ϕ\phi_{i}\to\phiitalic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_ϕ for some ϕitalic-ϕ\phiitalic_ϕ. Similarly, since 𝒳𝒳\mathcal{X}caligraphic_X is compact, we can take an infinite subsequence so that xixsubscript𝑥𝑖𝑥x_{i}\to xitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_x for some x𝑥xitalic_x. Because ϕitalic-ϕ\phiitalic_ϕ is continuous, there exists δ>0𝛿0\delta>0italic_δ > 0 such that

ϕ(B(x,δ))B(ϕ(x),ϵ2).italic-ϕ𝐵𝑥𝛿𝐵italic-ϕ𝑥italic-ϵ2\phi\left(B(x,\delta)\right)\subseteq B(\phi(x),\frac{\epsilon}{2}).italic_ϕ ( italic_B ( italic_x , italic_δ ) ) ⊆ italic_B ( italic_ϕ ( italic_x ) , divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) .

Select i𝑖iitalic_i such that d(x,xi)<δ2𝑑𝑥subscript𝑥𝑖𝛿2d(x,x_{i})<\frac{\delta}{2}italic_d ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG, δi<δ2subscript𝛿𝑖𝛿2\delta_{i}<\frac{\delta}{2}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG, and d(ϕ,ϕi)<ϵ2𝑑italic-ϕsubscriptitalic-ϕ𝑖italic-ϵ2d(\phi,\phi_{i})<\frac{\epsilon}{2}italic_d ( italic_ϕ , italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG. Then, applying the triangle inequality, we have

B(xi,δi)B(xi,δ2)B(x,δ).𝐵subscript𝑥𝑖subscript𝛿𝑖𝐵subscript𝑥𝑖𝛿2𝐵𝑥𝛿\begin{split}B(x_{i},\delta_{i})&\subseteq B(x_{i},\frac{\delta}{2})\\ &\subseteq B(x,\delta).\end{split}start_ROW start_CELL italic_B ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL ⊆ italic_B ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⊆ italic_B ( italic_x , italic_δ ) . end_CELL end_ROW

Furthermore, since d(ϕ,ϕi)<ϵ2𝑑italic-ϕsubscriptitalic-ϕ𝑖italic-ϵ2d(\phi,\phi_{i})<\frac{\epsilon}{2}italic_d ( italic_ϕ , italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG,

ϕi(B(x,δ)){z:d𝒵(z,ϕ(B(x,δ)))<ϵ2}{z:d𝒵(z,B(ϕ(x),ϵ2))<ϵ2}B(ϕ(x),ϵ)subscriptitalic-ϕ𝑖𝐵𝑥𝛿conditional-set𝑧subscript𝑑𝒵𝑧italic-ϕ𝐵𝑥𝛿italic-ϵ2conditional-set𝑧subscript𝑑𝒵𝑧𝐵italic-ϕ𝑥italic-ϵ2italic-ϵ2𝐵italic-ϕ𝑥italic-ϵ\begin{split}\phi_{i}(B(x,\delta))&\subseteq\{z:d_{\mathcal{Z}}\left(z,\phi(B(% x,\delta))\right)<\frac{\epsilon}{2}\}\\ &\subseteq\{z:d_{\mathcal{Z}}\left(z,B(\phi(x),\frac{\epsilon}{2})\right)<% \frac{\epsilon}{2}\}\\ &\subseteq B(\phi(x),\epsilon)\end{split}start_ROW start_CELL italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_B ( italic_x , italic_δ ) ) end_CELL start_CELL ⊆ { italic_z : italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_z , italic_ϕ ( italic_B ( italic_x , italic_δ ) ) ) < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⊆ { italic_z : italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_z , italic_B ( italic_ϕ ( italic_x ) , divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) ) < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⊆ italic_B ( italic_ϕ ( italic_x ) , italic_ϵ ) end_CELL end_ROW

However, this is a contradiction to the definition of δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. ∎

E.3 Useful Definitions for Analyzing Margins

We begin by precisely characterizing feature maps that preserve a data distribution 𝒟𝒟\mathcal{D}caligraphic_D.

Lemma 10.

Let 𝒟𝒟\mathcal{D}caligraphic_D be a well-separted distribution, and let {μy:y𝒴}conditional-setsuperscript𝜇𝑦𝑦𝒴\{\mu^{y}:y\in\mathcal{Y}\}{ italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT : italic_y ∈ caligraphic_Y } be the sets as defined in Definition 1. Then ϕitalic-ϕ\phiitalic_ϕ preserves 𝒟𝒟\mathcal{D}caligraphic_D if and only if there exists ρϕ>0superscript𝜌italic-ϕ0\rho^{\phi}>0italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT > 0 such that

minyydϕ(μy,μy)=ρϕ.subscript𝑦superscript𝑦subscript𝑑italic-ϕsuperscript𝜇𝑦superscript𝜇superscript𝑦superscript𝜌italic-ϕ\min_{y\neq y^{\prime}}d_{\phi}(\mu^{y},\mu^{y^{\prime}})=\rho^{\phi}.roman_min start_POSTSUBSCRIPT italic_y ≠ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT .
Proof.

The first direction is immediate. If ρϕsuperscript𝜌italic-ϕ\rho^{\phi}italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT exists, then it is clear that 𝒟ϕsuperscript𝒟italic-ϕ\mathcal{D}^{\phi}caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT is well-separated with corresponding sets ϕ(μy)italic-ϕsuperscript𝜇𝑦\phi(\mu^{y})italic_ϕ ( italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ).

In the second direction, assume towards a contradiction that no such ρϕsuperscript𝜌italic-ϕ\rho^{\phi}italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT exists. Because ϕitalic-ϕ\phiitalic_ϕ preserves 𝒟𝒟\mathcal{D}caligraphic_D, there exist corresponding sets {μϕy:y𝒴}conditional-setsuperscriptsubscript𝜇italic-ϕ𝑦𝑦𝒴\{\mu_{\phi}^{y}:y\in\mathcal{Y}\}{ italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT : italic_y ∈ caligraphic_Y } that partition the support of 𝒟ϕsuperscript𝒟italic-ϕ\mathcal{D}^{\phi}caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT. Because no such ρϕsuperscript𝜌italic-ϕ\rho^{\phi}italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT exists, we must have some xμy𝑥superscript𝜇𝑦x\in\mu^{y}italic_x ∈ italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT such that ϕ(x)μϕyitalic-ϕ𝑥superscriptsubscript𝜇italic-ϕsuperscript𝑦\phi(x)\in\mu_{\phi}^{y^{\prime}}italic_ϕ ( italic_x ) ∈ italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for yy𝑦superscript𝑦y\neq y^{\prime}italic_y ≠ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT – otherwise we could have used the margin of 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT as a valid choice for ρϕsuperscript𝜌italic-ϕ\rho^{\phi}italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT.

However, it then becomes clear that there exists a ball of non-zero radius centered at x𝑥xitalic_x that is mapped into μϕysuperscriptsubscript𝜇italic-ϕsuperscript𝑦\mu_{\phi}^{y^{\prime}}italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. This means it is classified as ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by g𝒟ϕsubscript𝑔superscript𝒟italic-ϕg_{\mathcal{D}^{\phi}}italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT while it is classified as y𝑦yitalic_y by g𝒟subscript𝑔𝒟g_{\mathcal{D}}italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT. Since 𝒟𝒟\mathcal{D}caligraphic_D is well-separated, there is a unique Bayes-optimal classifier over the support of 𝒟𝒟\mathcal{D}caligraphic_D, and this shows that g𝒟ϕsubscript𝑔superscript𝒟italic-ϕg_{\mathcal{D}^{\phi}}italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT does not incur Bayes-optimal risk over 𝒟𝒟\mathcal{D}caligraphic_D. Thus ϕitalic-ϕ\phiitalic_ϕ does not preserve 𝒟𝒟\mathcal{D}caligraphic_D, which is a contradiction.

We now generalize the idea of the margin of a data distribution as follows.

Definition 24.

Let 𝒟=(μ,η)𝒟𝜇𝜂\mathcal{D}=(\mu,\eta)caligraphic_D = ( italic_μ , italic_η ) be a well-separated data distribution, and let ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ be a feature map. Then the margin variable of 𝒟ϕsuperscript𝒟italic-ϕ\mathcal{D}^{\phi}caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, is random variable, αϕsuperscript𝛼italic-ϕ\alpha^{\phi}italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT defined as follows. Let (x,x)μ2similar-to𝑥superscript𝑥superscript𝜇2(x,x^{\prime})\sim\mu^{2}( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Then

αϕ={d(x,x)g𝒟(x)g𝒟(x)otherwise.superscript𝛼italic-ϕcases𝑑𝑥superscript𝑥subscript𝑔𝒟𝑥subscript𝑔𝒟superscript𝑥otherwise\alpha^{\phi}=\begin{cases}d(x,x^{\prime})&g_{\mathcal{D}}(x)\neq g_{\mathcal{% D}}(x^{\prime})\\ \infty&\text{otherwise}\end{cases}.italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT = { start_ROW start_CELL italic_d ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_x ) ≠ italic_g start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ∞ end_CELL start_CELL otherwise end_CELL end_ROW .

αϕsuperscript𝛼italic-ϕ\alpha^{\phi}italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT can be thought of as a randomly observed margin. We will be particularly interested in observing small values of αϕsuperscript𝛼italic-ϕ\alpha^{\phi}italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, as this will be reflective of hte margin of 𝒟ϕsuperscript𝒟italic-ϕ\mathcal{D}^{\phi}caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT. In particular, we have the following.

Lemma 11.

Let 𝒟𝒟\mathcal{D}caligraphic_D be a well-separated data distribution and let ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ be a feature map. If ϕitalic-ϕ\phiitalic_ϕ preserves 𝒟𝒟\mathcal{D}caligraphic_D, let ρϕsuperscript𝜌italic-ϕ\rho^{\phi}italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT denote the margin of 𝒟ϕsuperscript𝒟italic-ϕ\mathcal{D}^{\phi}caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT. Otherwise, let ρϕ=0superscript𝜌italic-ϕ0\rho^{\phi}=0italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT = 0. Then for every γ>0𝛾0\gamma>0italic_γ > 0 there exists δ>0𝛿0\delta>0italic_δ > 0 such that

Prαϕ[αϕρϕ+γ]δ.subscriptPrsuperscript𝛼italic-ϕsuperscript𝛼italic-ϕsuperscript𝜌italic-ϕ𝛾𝛿\Pr_{\alpha^{\phi}}[\alpha^{\phi}\leq\rho^{\phi}+\gamma]\geq\delta.roman_Pr start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ≤ italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT + italic_γ ] ≥ italic_δ .
Proof.

Let 𝒟=(μ,η)𝒟𝜇𝜂\mathcal{D}=(\mu,\eta)caligraphic_D = ( italic_μ , italic_η ), and let {μy:y𝒴}conditional-setsuperscript𝜇𝑦𝑦𝒴\{\mu^{y}:y\in\mathcal{Y}\}{ italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT : italic_y ∈ caligraphic_Y } be the sets corresponding to Definition 1. Suppose ϕitalic-ϕ\phiitalic_ϕ preserves 𝒟𝒟\mathcal{D}caligraphic_D. Then the sets {ϕ(μy):y𝒴}conditional-setitalic-ϕsuperscript𝜇𝑦𝑦𝒴\{\phi(\mu^{y}):y\in\mathcal{Y}\}{ italic_ϕ ( italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) : italic_y ∈ caligraphic_Y } must be the corresponding sets for 𝒟ϕsuperscript𝒟italic-ϕ\mathcal{D}^{\phi}caligraphic_D start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, and it follows that

minyyd𝒵(ϕ(μy),ϕ(μy))=ρϕ.subscript𝑦superscript𝑦subscript𝑑𝒵italic-ϕsuperscript𝜇𝑦italic-ϕsuperscript𝜇superscript𝑦superscript𝜌italic-ϕ\min_{y\neq y^{\prime}}d_{\mathcal{Z}}\left(\phi(\mu^{y}),\phi(\mu^{y^{\prime}% })\right)=\rho^{\phi}.roman_min start_POSTSUBSCRIPT italic_y ≠ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_ϕ ( italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) , italic_ϕ ( italic_μ start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) = italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT .

On the other hand, if ϕitalic-ϕ\phiitalic_ϕ does not preserve 𝒟𝒟\mathcal{D}caligraphic_D, then we must have

minyyd𝒵(ϕ(μy),ϕ(μy))=0,subscript𝑦superscript𝑦subscript𝑑𝒵italic-ϕsuperscript𝜇𝑦italic-ϕsuperscript𝜇superscript𝑦0\min_{y\neq y^{\prime}}d_{\mathcal{Z}}\left(\phi(\mu^{y}),\phi(\mu^{y^{\prime}% })\right)=0,roman_min start_POSTSUBSCRIPT italic_y ≠ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_ϕ ( italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) , italic_ϕ ( italic_μ start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) = 0 ,

as if this distance were positive, then ϕitalic-ϕ\phiitalic_ϕ would clearly preserve 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Thus in either case, there exists y,y𝑦superscript𝑦y,y^{\prime}italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT so that d(ϕ(μy),ϕ(μy))=ρϕ𝑑italic-ϕsuperscript𝜇𝑦italic-ϕsuperscript𝜇superscript𝑦superscript𝜌italic-ϕd\left(\phi(\mu^{y}),\phi(\mu^{y^{\prime}})\right)=\rho^{\phi}italic_d ( italic_ϕ ( italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) , italic_ϕ ( italic_μ start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) = italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT.

It follows that there exists xμy𝑥superscript𝜇𝑦x\in\mu^{y}italic_x ∈ italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT and xμysuperscript𝑥superscript𝜇superscript𝑦x^{\prime}\in\mu^{y^{\prime}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_μ start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT such that dϕ(x,x)ρϕ+γ/2subscript𝑑italic-ϕ𝑥superscript𝑥superscript𝜌italic-ϕ𝛾2d_{\phi}(x,x^{\prime})\leq\rho^{\phi}+\gamma/2italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT + italic_γ / 2. Let

δ=μ(Bϕ(x,γ/4))μ(Bϕ(x,γ/4)).𝛿𝜇subscript𝐵italic-ϕ𝑥𝛾4𝜇subscript𝐵italic-ϕsuperscript𝑥𝛾4\delta=\mu(B_{\phi}(x,\gamma/4))\mu(B_{\phi}(x^{\prime},\gamma/4)).italic_δ = italic_μ ( italic_B start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_γ / 4 ) ) italic_μ ( italic_B start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_γ / 4 ) ) .

Because ϕitalic-ϕ\phiitalic_ϕ is continuous, ϕ(x)italic-ϕ𝑥\phi(x)italic_ϕ ( italic_x ) and ϕ(y)italic-ϕ𝑦\phi(y)italic_ϕ ( italic_y ) lie within the support of 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT. It follows that δ>0𝛿0\delta>0italic_δ > 0. δ𝛿\deltaitalic_δ is also a lower bound on the probability that we observe (x1,x2)Bϕ(x,γ/4)×Bϕ(x,γ/4)subscript𝑥1subscript𝑥2subscript𝐵italic-ϕ𝑥𝛾4subscript𝐵italic-ϕsuperscript𝑥𝛾4(x_{1},x_{2})\in B_{\phi}(x,\gamma/4)\times B_{\phi}(x^{\prime},\gamma/4)( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ italic_B start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_γ / 4 ) × italic_B start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_γ / 4 ), which means it is a lower bound on the probability that αϕρϕ+γ.superscript𝛼italic-ϕsuperscript𝜌italic-ϕ𝛾\alpha^{\phi}\leq\rho^{\phi}+\gamma.italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ≤ italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT + italic_γ . This gives the desired result. ∎

We now use a similar idea to describe distances between the supports of two measures.

Definition 25.

Let μs,μtsubscript𝜇𝑠subscript𝜇𝑡\mu_{s},\mu_{t}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be measures over 𝒳𝒳\mathcal{X}caligraphic_X. Then βϕsuperscript𝛽italic-ϕ\beta^{\phi}italic_β start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT is defined as

βϕ=dϕ(xt,supp(μs)),superscript𝛽italic-ϕsubscript𝑑italic-ϕsubscript𝑥𝑡suppsubscript𝜇𝑠\beta^{\phi}=d_{\phi}(x_{t},\textnormal{supp}(\mu_{s})),italic_β start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT = italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ,

where xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a random variable following distribution μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

βϕsuperscript𝛽italic-ϕ\beta^{\phi}italic_β start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT can be thought of as representing the distance that a point drawn from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT when using distance metric determined by ϕitalic-ϕ\phiitalic_ϕ.

It will also be useful to define finite sample version of βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, that don’t rely on the sets, supp(μs)suppsubscript𝜇𝑠\textnormal{supp}(\mu_{s})supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ).

Definition 26.

Let μs,μtsubscript𝜇𝑠subscript𝜇𝑡\mu_{s},\mu_{t}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be measures over 𝒳𝒳\mathcal{X}caligraphic_X. Let n>0𝑛0n>0italic_n > 0. Then βnϕsuperscriptsubscript𝛽𝑛italic-ϕ\beta_{n}^{\phi}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT defined as

βnϕ=min1indϕ(xt,xsi),superscriptsubscript𝛽𝑛italic-ϕsubscript1𝑖𝑛subscript𝑑italic-ϕsubscript𝑥𝑡superscriptsubscript𝑥𝑠𝑖\beta_{n}^{\phi}=\min_{1\leq i\leq n}d_{\phi}(x_{t},x_{s}^{i}),italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_n end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ,

where xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a random variable following distribution μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and xs1,,xsnsuperscriptsubscript𝑥𝑠1superscriptsubscript𝑥𝑠𝑛x_{s}^{1},\dots,x_{s}^{n}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are drawn i.i.d from μ𝜇\muitalic_μ.

We now show that βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT converges to β𝛽\betaitalic_β.

Lemma 12.

Let ϕitalic-ϕ\phiitalic_ϕ be any feature map. Then β1ϕ,β2ϕ,superscriptsubscript𝛽1italic-ϕsuperscriptsubscript𝛽2italic-ϕitalic-…\beta_{1}^{\phi},\beta_{2}^{\phi},\dotsitalic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_… converges in distribution to βϕsuperscript𝛽italic-ϕ\beta^{\phi}italic_β start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT.

Proof.

For any r>0𝑟0r>0italic_r > 0, the probability that βϕ<rsuperscript𝛽italic-ϕ𝑟\beta^{\phi}<ritalic_β start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < italic_r is precisely the probability that some xtsupp(μt)subscript𝑥𝑡suppsubscript𝜇𝑡x_{t}\in\textnormal{supp}(\mu_{t})italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is chosen so that xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has distance less than r𝑟ritalic_r from supp(μs)suppsubscript𝜇𝑠\textnormal{supp}(\mu_{s})supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). For all such xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and let xssupp(μs)subscript𝑥𝑠suppsubscript𝜇𝑠x_{s}\in\textnormal{supp}(\mu_{s})italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) satisfy d(xt,xs)<r𝑑subscript𝑥𝑡subscript𝑥𝑠𝑟d(x_{t},x_{s})<ritalic_d ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) < italic_r. Furthermore pick ϵitalic-ϵ\epsilonitalic_ϵ so that 2ϵ<rd(xt,xs)2italic-ϵ𝑟𝑑subscript𝑥𝑡subscript𝑥𝑠2\epsilon<r-d(x_{t},x_{s})2 italic_ϵ < italic_r - italic_d ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). It follows that βnϕ<rsuperscriptsubscript𝛽𝑛italic-ϕ𝑟\beta_{n}^{\phi}<ritalic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < italic_r will hold if one of the n𝑛nitalic_n points selected from μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT will be within distance ϵitalic-ϵ\epsilonitalic_ϵ from xssubscript𝑥𝑠x_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. However, this event occurs with high probability for n𝑛nitalic_n being sufficiently large. ∎

Appendix F Proof of Theorem 1

First, we characterize areas of 𝒳𝒳\mathcal{X}caligraphic_X that are likely to be correctly classified by composing nearest neighbors with ϕitalic-ϕ\phiitalic_ϕ.

Definition 27.

Let ϕitalic-ϕ\phiitalic_ϕ be a feature map that preserves 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Let 0<p<10𝑝10<p<10 < italic_p < 1, and let r>0𝑟0r>0italic_r > 0 be a distance. We let 𝒳p,rϕsuperscriptsubscript𝒳𝑝𝑟italic-ϕ\mathcal{X}_{p,r}^{\phi}caligraphic_X start_POSTSUBSCRIPT italic_p , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT denote the set of all points x𝑥xitalic_x such that there exists xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for which the following hold.

  1. 1.

    dϕ(x,x)<ρϕ2rsubscript𝑑italic-ϕ𝑥superscript𝑥superscript𝜌italic-ϕ2𝑟d_{\phi}(x,x^{\prime})<\frac{\rho^{\phi}}{2}-ritalic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - italic_r.

  2. 2.

    μs(Bϕ(x,r))psubscript𝜇𝑠subscript𝐵italic-ϕsuperscript𝑥𝑟𝑝\mu_{s}\left(B_{\phi}(x^{\prime},r)\right)\geq pitalic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r ) ) ≥ italic_p.

Here p𝑝pitalic_p represents a small amount of mass that must be close to x𝑥xitalic_x, and xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and r𝑟ritalic_r determine a region in which that mass is concentrated. The idea will be that x𝑥xitalic_x can be accurately classified using points sampled from B(x,r)𝐵superscript𝑥𝑟B(x^{\prime},r)italic_B ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r ). We now formalize this with the following lemma.

Lemma 13.

Fix p,r>0𝑝𝑟0p,r>0italic_p , italic_r > 0. Then there exists N𝑁Nitalic_N such that for all nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, for all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ and x𝒳p,rϕ𝑥superscriptsubscript𝒳𝑝𝑟italic-ϕx\in\mathcal{X}_{p,r}^{\phi}italic_x ∈ caligraphic_X start_POSTSUBSCRIPT italic_p , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT with probability at least 11n411superscript𝑛41-\frac{1}{n^{4}}1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG over S𝒟sϕsimilar-to𝑆superscriptsubscript𝒟𝑠italic-ϕS\sim\mathcal{D}_{s}^{\phi}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT,

𝒩Sϕ(x)=g𝒟sϕ(argminzsupp(𝒟sϕ)d𝒵(z,ϕ(x))).superscriptsubscript𝒩𝑆italic-ϕ𝑥subscript𝑔superscriptsubscript𝒟𝑠italic-ϕsubscriptargmin𝑧suppsuperscriptsubscript𝒟𝑠italic-ϕsubscript𝑑𝒵𝑧italic-ϕ𝑥\mathcal{N}_{S}^{\phi}(x)=g_{\mathcal{D}_{s}^{\phi}}\left(\operatorname*{arg\,% min}_{z\in\textnormal{supp}(\mathcal{D}_{s}^{\phi})}d_{\mathcal{Z}}(z,\phi(x))% \right).caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) = italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_z ∈ supp ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_z , italic_ϕ ( italic_x ) ) ) .
Proof.

Because 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is well-separated, let μsysuperscriptsubscript𝜇𝑠𝑦\mu_{s}^{y}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT denote the regions that correspond to Definition 1. Let xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the point as defined in the definition of 𝒳p,rϕsuperscriptsubscript𝒳𝑝𝑟italic-ϕ\mathcal{X}_{p,r}^{\phi}caligraphic_X start_POSTSUBSCRIPT italic_p , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT. By applying Lemma 10, observe that there exists y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y such that argminzsupp(𝒟sϕ)ϕ(μsy)subscriptargmin𝑧suppsuperscriptsubscript𝒟𝑠italic-ϕitalic-ϕsuperscriptsubscript𝜇𝑠𝑦\operatorname*{arg\,min}_{z\in\textnormal{supp}(\mathcal{D}_{s}^{\phi})}\in% \phi(\mu_{s}^{y})start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_z ∈ supp ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ∈ italic_ϕ ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ), and xμsysuperscript𝑥superscriptsubscript𝜇𝑠𝑦x^{\prime}\in\mu_{s}^{y}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT. This holds because if it didn’t, then the triangle inequality would show that ϕ(μsy)italic-ϕsuperscriptsubscript𝜇𝑠𝑦\phi(\mu_{s}^{y})italic_ϕ ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) and ϕ(μsy)italic-ϕsuperscriptsubscript𝜇𝑠superscript𝑦\phi(\mu_{s}^{y^{\prime}})italic_ϕ ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) have distance less than ρϕsuperscript𝜌italic-ϕ\rho^{\phi}italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT.

It now suffices to show that with probability at least 11n211superscript𝑛21-\frac{1}{n^{2}}1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG over S𝒟sϕsimilar-to𝑆superscriptsubscript𝒟𝑠italic-ϕS\sim\mathcal{D}_{s}^{\phi}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, 𝒩Sϕ(x)=ysuperscriptsubscript𝒩𝑆italic-ϕ𝑥𝑦\mathcal{N}_{S}^{\phi}(x)=ycaligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) = italic_y.

To do this, let S={(x1,y1),,(xn,yn)}𝑆subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛S=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}italic_S = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }, and let X={x1,,xn}𝑋subscript𝑥1subscript𝑥𝑛X=\{x_{1},\dots,x_{n}\}italic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. We can view S𝑆Sitalic_S as being constructed by first drawing X𝑋Xitalic_X, and then drawing the labels of each of its points.

Observe that if Bϕ(x,r)subscript𝐵italic-ϕsuperscript𝑥𝑟B_{\phi}(x^{\prime},r)italic_B start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r ) contains at least knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT points, then the knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT nearest neighbors (according to dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT) of x𝑥xitalic_x will all be drawn from μsysuperscriptsubscript𝜇𝑠𝑦\mu_{s}^{y}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT. To this end, by Hoeffding’s inequality, we see that

Pr[|XBϕ(x,r)|>np2]1exp(n2p22n)=1exp(np22).Pr𝑋subscript𝐵italic-ϕsuperscript𝑥𝑟𝑛𝑝21superscript𝑛2superscript𝑝22𝑛1𝑛superscript𝑝22\Pr\left[|X\cap B_{\phi}(x^{\prime},r)|>\frac{np}{2}\right]\geq 1-\exp\left(% \frac{n^{2}p^{2}}{2n}\right)=1-\exp\left(\frac{np^{2}}{2}\right).roman_Pr [ | italic_X ∩ italic_B start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r ) | > divide start_ARG italic_n italic_p end_ARG start_ARG 2 end_ARG ] ≥ 1 - roman_exp ( divide start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_n end_ARG ) = 1 - roman_exp ( divide start_ARG italic_n italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) .

Thus, for n𝑛nitalic_n sufficiently large (depending only on p𝑝pitalic_p), this quantity is at least 112n4112superscript𝑛41-\frac{1}{2n^{4}}1 - divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG, and np2>kn𝑛𝑝2subscript𝑘𝑛\frac{np}{2}>k_{n}divide start_ARG italic_n italic_p end_ARG start_ARG 2 end_ARG > italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. This means that with probability at least 112n4112superscript𝑛41-\frac{1}{2n^{4}}1 - divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG, the Bϕ(x,r)subscript𝐵italic-ϕsuperscript𝑥𝑟B_{\phi}(x^{\prime},r)italic_B start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r ) contains at least knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT points.

Next, suppose that this even occurs. We now select the labels for our points. Because of our method of generating S𝑆Sitalic_S, we can assume that these labels are i.i.d and drawn for points in μysuperscript𝜇𝑦\mu^{y}italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT. Let the label of the i𝑖iitalic_ith nearest neighbor of x𝑥xitalic_x be denoted as yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For all yysuperscript𝑦𝑦y^{\prime}\neq yitalic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_y, define Jiysuperscriptsubscript𝐽𝑖superscript𝑦J_{i}^{y^{\prime}}italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as the random variable that is 1111 if yi=ysubscript𝑦𝑖𝑦y_{i}=yitalic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y, 11-1- 1 if yi=ysubscript𝑦𝑖superscript𝑦y_{i}=y^{\prime}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and 11-1- 1 otherwise. The key observation is that 𝒩Sϕ(x)=ysuperscriptsubscript𝒩𝑆italic-ϕ𝑥𝑦\mathcal{N}_{S}^{\phi}(x)=ycaligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) = italic_y if and only if i=1knJiy>0superscriptsubscript𝑖1subscript𝑘𝑛superscriptsubscript𝐽𝑖superscript𝑦0\sum_{i=1}^{k_{n}}J_{i}^{y^{\prime}}>0∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT > 0 for all yysuperscript𝑦𝑦y^{\prime}\neq yitalic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_y as this will imply that y𝑦yitalic_y is the pluarlity choice.

Because 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is well separated, it has label margin ΔΔ\Deltaroman_Δ. Therefore, Jiysuperscriptsubscript𝐽𝑖superscript𝑦J_{i}^{y^{\prime}}italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is a random variable bounded in [1,1]11[-1,1][ - 1 , 1 ] with expected value at least ΔΔ\Deltaroman_Δ. It follows by Hoeffding’s inequality, that

Pr[i=1knJiy>0]1exp(2Δ2kn24kn)=1exp(Δ2kn2).Prsuperscriptsubscript𝑖1subscript𝑘𝑛superscriptsubscript𝐽𝑖superscript𝑦012superscriptΔ2superscriptsubscript𝑘𝑛24subscript𝑘𝑛1superscriptΔ2subscript𝑘𝑛2\Pr[\sum_{i=1}^{k_{n}}J_{i}^{y^{\prime}}>0]\geq 1-\exp\left(\frac{-2\Delta^{2}% k_{n}^{2}}{4k_{n}}\right)=1-\exp\left(\frac{-\Delta^{2}k_{n}}{2}\right).roman_Pr [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT > 0 ] ≥ 1 - roman_exp ( divide start_ARG - 2 roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) = 1 - roman_exp ( divide start_ARG - roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) .

Because knω(logn)subscript𝑘𝑛𝜔𝑛k_{n}\geq\omega(\log n)italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ italic_ω ( roman_log italic_n ), it follows that for a sufficiently large value of n𝑛nitalic_n, this quantity is at least 112n4|𝒴|112superscript𝑛4𝒴1-\frac{1}{2n^{4}|\mathcal{Y}|}1 - divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT | caligraphic_Y | end_ARG. Thus taking a union bound over all y𝒴{y}superscript𝑦𝒴𝑦y^{\prime}\in\mathcal{Y}\setminus\{y\}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y ∖ { italic_y } gives the desired result. ∎

Here observe that we are comparing the nearest neighbors classifier using ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to bayes-optimal over 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, where ϕitalic-ϕ\phiitalic_ϕ is the original feature map we are considering. In other words, this lemma implies that small perturbations to the feature map do not affect classification.

Next, we show that the entire support of the target distribution, 𝒟t=(μt,ηt)subscript𝒟𝑡subscript𝜇𝑡subscript𝜂𝑡\mathcal{D}_{t}=(\mu_{t},\eta_{t})caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), can be covered using the regions 𝒳p,rsubscript𝒳𝑝𝑟\mathcal{X}_{p,r}caligraphic_X start_POSTSUBSCRIPT italic_p , italic_r end_POSTSUBSCRIPT.

Lemma 14.

Let ρ>0𝜌0\rho>0italic_ρ > 0. Then there exists p,r>0𝑝𝑟0p,r>0italic_p , italic_r > 0 such that the following holds. For all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ that realize SIRM (𝒟s,𝒟tsubscript𝒟𝑠subscript𝒟𝑡\mathcal{D}_{s},\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) and for which 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT has margin at least ρ𝜌\rhoitalic_ρ,

supp(𝒟s),supp(𝒟t)𝒳p,rϕ.suppsubscript𝒟𝑠suppsubscript𝒟𝑡superscriptsubscript𝒳𝑝𝑟italic-ϕ\textnormal{supp}(\mathcal{D}_{s}),\textnormal{supp}(\mathcal{D}_{t})\subseteq% \mathcal{X}_{p,r}^{\phi}.supp ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , supp ( caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⊆ caligraphic_X start_POSTSUBSCRIPT italic_p , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT .
Proof.

Let r=ρ(121Λ)𝑟𝜌121Λr=\rho\left(\frac{1}{2}-\frac{1}{\Lambda}\right)italic_r = italic_ρ ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG roman_Λ end_ARG ). By the Definition of ΛΛ\Lambdaroman_Λ, r>0𝑟0r>0italic_r > 0. Now let xsupp(𝒟t)𝑥suppsubscript𝒟𝑡x\in\textnormal{supp}(\mathcal{D}_{t})italic_x ∈ supp ( caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be arbitrary.

Because ϕitalic-ϕ\phiitalic_ϕ contracts (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), there exists xsupp(μs)superscript𝑥suppsubscript𝜇𝑠x^{\prime}\in\textnormal{supp}(\mu_{s})italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) such that dϕ(x,x)<ρϕΛsubscript𝑑italic-ϕ𝑥superscript𝑥superscript𝜌italic-ϕΛd_{\phi}(x,x^{\prime})<\frac{\rho^{\phi}}{\Lambda}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_ARG start_ARG roman_Λ end_ARG, where ρϕsuperscript𝜌italic-ϕ\rho^{\phi}italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT is the margin of Dsϕsuperscriptsubscript𝐷𝑠italic-ϕD_{s}^{\phi}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT. It follows that

dϕ(x,x)<ρϕΛ=12ρϕ(121Λ)ρϕ12ρϕ(121Λ)ρ=ρϕ2r.subscript𝑑italic-ϕ𝑥superscript𝑥superscript𝜌italic-ϕΛ12superscript𝜌italic-ϕ121Λsuperscript𝜌italic-ϕ12superscript𝜌italic-ϕ121Λ𝜌superscript𝜌italic-ϕ2𝑟\begin{split}d_{\phi}(x,x^{\prime})&<\frac{\rho^{\phi}}{\Lambda}\\ &=\frac{1}{2}\rho^{\phi}-\left(\frac{1}{2}-\frac{1}{\Lambda}\right)\rho^{\phi}% \\ &\leq\frac{1}{2}\rho^{\phi}-\left(\frac{1}{2}-\frac{1}{\Lambda}\right)\rho\\ &=\frac{\rho^{\phi}}{2}-r.\end{split}start_ROW start_CELL italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL < divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_ARG start_ARG roman_Λ end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT - ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG roman_Λ end_ARG ) italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT - ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG roman_Λ end_ARG ) italic_ρ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - italic_r . end_CELL end_ROW

Finally, by Lemma 9, there exists s>0𝑠0s>0italic_s > 0 such that for all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ, ϕ(B(x,s))B(ϕ(x),r))\phi(B(x,s))\subseteq B(\phi(x),r))italic_ϕ ( italic_B ( italic_x , italic_s ) ) ⊆ italic_B ( italic_ϕ ( italic_x ) , italic_r ) ). This implies that B(x,s)Bϕ(x,r)𝐵superscript𝑥𝑠subscript𝐵italic-ϕsuperscript𝑥𝑟B(x^{\prime},s)\subseteq B_{\phi}(x^{\prime},r)italic_B ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) ⊆ italic_B start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r ) for all xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Finally, we take

p=infxsupp(μs)μs(B(x,s).p=\inf_{x\in\textnormal{supp}(\mu_{s})}\mu_{s}(B(x,s).italic_p = roman_inf start_POSTSUBSCRIPT italic_x ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_B ( italic_x , italic_s ) .

It suffices to show that p>0𝑝0p>0italic_p > 0. To do so, observe that supp(μs)suppsubscript𝜇𝑠\textnormal{supp}(\mu_{s})supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is closed and therefore compact (as 𝒳𝒳\mathcal{X}caligraphic_X is compact by assumption). Take an open cover of μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT by balls of radius s/2𝑠2s/2italic_s / 2. Then it has a finite sub-cover. Each of htese balls have positive mass under μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and furthermore every ball B(x,s)𝐵𝑥𝑠B(x,s)italic_B ( italic_x , italic_s ) where xsupp(μs)𝑥suppsubscript𝜇𝑠x\in\textnormal{supp}(\mu_{s})italic_x ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) must fully contain at least one of these balls. It follows that μs(B(x,s))qsubscript𝜇𝑠𝐵𝑥𝑠𝑞\mu_{s}(B(x,s))\geq qitalic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_B ( italic_x , italic_s ) ) ≥ italic_q, where q>0𝑞0q>0italic_q > 0 is the minimum mass of one of these balls. Since q>0𝑞0q>0italic_q > 0, it follows that p>0𝑝0p>0italic_p > 0, as desired. ∎

Lemma 15.

Let ρ>0𝜌0\rho>0italic_ρ > 0. Then there exists N>0𝑁0N>0italic_N > 0 such that for all ϕitalic-ϕ\phiitalic_ϕ that relate (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) such that 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT has margin at least ρ𝜌\rhoitalic_ρ, if nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, then with probability at least 11n211superscript𝑛21-\frac{1}{n^{2}}1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

R(𝒩Sϕ,𝒟t)<Rt+1n2.𝑅superscriptsubscript𝒩𝑆italic-ϕsubscript𝒟𝑡superscriptsubscript𝑅𝑡1superscript𝑛2R(\mathcal{N}_{S}^{\phi},\mathcal{D}_{t})<R_{t}^{*}+\frac{1}{n^{2}}.italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) < italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .
Proof.

Let ϕitalic-ϕ\phiitalic_ϕ relate 𝒟s,𝒟tsubscript𝒟𝑠subscript𝒟𝑡\mathcal{D}_{s},\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and suppose 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT has margin ρϕsuperscript𝜌italic-ϕ\rho^{\phi}italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT. let r,p𝑟𝑝r,pitalic_r , italic_p be as in Lemma 14. Because ϕitalic-ϕ\phiitalic_ϕ relates 𝒟s,𝒟tsubscript𝒟𝑠subscript𝒟𝑡\mathcal{D}_{s},\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, observe that for all xsupp(μt)𝑥suppsubscript𝜇𝑡x\in\textnormal{supp}(\mu_{t})italic_x ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ),

gDt(x)=g𝒟sϕ(argminzsupp(𝒟sϕ)d𝒵(z,ϕ(x)).g_{Dt}(x)=g_{\mathcal{D}_{s}^{\phi}}\left(\operatorname*{arg\,min}_{z\in% \textnormal{supp}(\mathcal{D}_{s}^{\phi})}d_{\mathcal{Z}}(z,\phi(x)\right).italic_g start_POSTSUBSCRIPT italic_D italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_z ∈ supp ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_z , italic_ϕ ( italic_x ) ) .

To see this, observe that Definition 3 implies that minzsupp(𝒟sϕ)d𝒵(z,ϕ(x))<ρϕ2𝑚𝑖subscript𝑛𝑧suppsuperscriptsubscript𝒟𝑠italic-ϕsubscript𝑑𝒵𝑧italic-ϕ𝑥superscript𝜌italic-ϕ2min_{z\in\textnormal{supp}(\mathcal{D}_{s}^{\phi})}d_{\mathcal{Z}}(z,\phi(x))<% \frac{\rho^{\phi}}{2}italic_m italic_i italic_n start_POSTSUBSCRIPT italic_z ∈ supp ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_z , italic_ϕ ( italic_x ) ) < divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG, and Definition 4 implies that z𝑧zitalic_z must be labeled by g𝒟sϕsubscript𝑔superscriptsubscript𝒟𝑠italic-ϕg_{\mathcal{D}_{s}^{\phi}}italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT the same as x𝑥xitalic_x is by g𝒟tsubscript𝑔subscript𝒟𝑡g_{\mathcal{D}_{t}}italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Next, select N𝑁Nitalic_N from Lemma 13. It follows that since all xsupp(μt)𝒳p,rϕ𝑥suppsubscript𝜇𝑡superscriptsubscript𝒳𝑝𝑟italic-ϕx\in\textnormal{supp}(\mu_{t})\in\mathcal{X}_{p,r}^{\phi}italic_x ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_X start_POSTSUBSCRIPT italic_p , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, for nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, we have that with probability at least 11n411superscript𝑛41-\frac{1}{n^{4}}1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

𝒩Sϕ(x)=g𝒟sϕ(argminzsupp(𝒟sϕ)d𝒵(z,ϕ(x))=gDt(x).\mathcal{N}_{S}^{\phi}(x)=g_{\mathcal{D}_{s}^{\phi}}\left(\operatorname*{arg\,% min}_{z\in\textnormal{supp}(\mathcal{D}_{s}^{\phi})}d_{\mathcal{Z}}(z,\phi(x)% \right)=g_{Dt}(x).caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) = italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_z ∈ supp ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_z , italic_ϕ ( italic_x ) ) = italic_g start_POSTSUBSCRIPT italic_D italic_t end_POSTSUBSCRIPT ( italic_x ) .

A standard of markov’s inequality that converts the expected loss into a loss bound with high probability completes the proof. ∎

We are now prepared to prove Theorem 1.

Proof.

Any ϕitalic-ϕ\phiitalic_ϕ that relates (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) has positive margin, and so the previous lemma applies for sufficiently large n𝑛nitalic_n. Since 1n201superscript𝑛20\frac{1}{n^{2}}\to 0divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG → 0, it immediately follows that 𝒩Sϕsuperscriptsubscript𝒩𝑆italic-ϕ\mathcal{N}_{S}^{\phi}caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT converges in risk to the bayes optimal of 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, as desired. ∎

Appendix G Proof of Theorem 3

G.1 Description of our learning rule

We begin with our learning rule, L𝐿Litalic_L, that achieves the bound given in Theorem 3.

Algorithm 2 direct_generalize_nn(S𝒟sn)direct_generalize_nnsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛\textsc{direct\_generalize\_nn}(S\sim\mathcal{D}_{s}^{n})direct_generalize_nn ( italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )
1:Str{(xi,yi):1in/4}subscript𝑆𝑡𝑟conditional-setsubscript𝑥𝑖subscript𝑦𝑖1𝑖𝑛4S_{tr}\leftarrow\{(x_{i},y_{i}):1\leq i\leq n/4\}italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ← { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : 1 ≤ italic_i ≤ italic_n / 4 }
2:Sloss{(xi,yi):n/4<in/2}subscript𝑆𝑙𝑜𝑠𝑠conditional-setsubscript𝑥𝑖subscript𝑦𝑖𝑛4𝑖𝑛2S_{loss}\leftarrow\{(x_{i},y_{i}):n/4<i\leq n/2\}italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ← { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_n / 4 < italic_i ≤ italic_n / 2 }
3:Smargin{(xi,yi):n/2<i3n/4}subscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛conditional-setsubscript𝑥𝑖subscript𝑦𝑖𝑛2𝑖3𝑛4S_{margin}\leftarrow\{(x_{i},y_{i}):n/2<i\leq 3n/4\}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT ← { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_n / 2 < italic_i ≤ 3 italic_n / 4 }
4:Sfinal{(xi,yi):3n/4<in}subscript𝑆𝑓𝑖𝑛𝑎𝑙conditional-setsubscript𝑥𝑖subscript𝑦𝑖3𝑛4𝑖𝑛S_{final}\leftarrow\{(x_{i},y_{i}):3n/4<i\leq n\}italic_S start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT ← { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : 3 italic_n / 4 < italic_i ≤ italic_n }
5:ϵn1/3italic-ϵsuperscript𝑛13\epsilon\leftarrow n^{-1/3}italic_ϵ ← italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT
6:Φϵ={ϕ:source_loss(ϕ,Sloss)<ϵ}subscriptΦitalic-ϵconditional-setitalic-ϕsource_lossitalic-ϕsubscript𝑆𝑙𝑜𝑠𝑠italic-ϵ\Phi_{\epsilon}=\left\{\phi:\textsc{source\_loss}(\phi,S_{loss})<\epsilon\right\}roman_Φ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT = { italic_ϕ : source_loss ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) < italic_ϵ }
7:ϕ^argmaxϕΦϵsource_margin(ϕ,Smargin)^italic-ϕsubscriptargmaxitalic-ϕsubscriptΦitalic-ϵsource_marginitalic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛\hat{\phi}\leftarrow\operatorname*{arg\,max}_{\phi\in\Phi_{\epsilon}}\textsc{% source\_margin}(\phi,S_{margin})over^ start_ARG italic_ϕ end_ARG ← start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_ϕ ∈ roman_Φ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT end_POSTSUBSCRIPT source_margin ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT )
8:return 𝒩Sfinalϕ^superscriptsubscript𝒩subscript𝑆𝑓𝑖𝑛𝑎𝑙^italic-ϕ\mathcal{N}_{S_{final}}^{\hat{\phi}}caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT

G.2 Bounding the error in estimating the loss

Our method for estimating the loss over the source distribution that a nearest neighbors classifier is given in Algorithm 2. We simply evaluate the empirical risk using nearest neighbors over the designated loss set, Slosssubscript𝑆𝑙𝑜𝑠𝑠S_{loss}italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT.

We now bound the accuracy of this method using the following Lemma.

Lemma 16.

Let 𝒟𝒟\mathcal{D}caligraphic_D be an arbitrary data distribution. Let Strsubscript𝑆𝑡𝑟S_{tr}italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT be a set of n𝑛nitalic_n labeled points, and let Sloss𝒟nsimilar-tosubscript𝑆𝑙𝑜𝑠𝑠superscript𝒟𝑛S_{loss}\sim\mathcal{D}^{n}italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be an i.i.d sample that is independent of Strsubscript𝑆𝑡𝑟S_{tr}italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT. Then there exists N>0𝑁0N>0italic_N > 0 such that for all nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, with probability at least 11n211superscript𝑛21-\frac{1}{n^{2}}1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG over Stst𝒟snsimilar-tosubscript𝑆𝑡𝑠𝑡superscriptsubscript𝒟𝑠𝑛S_{tst}\sim\mathcal{D}_{s}^{n}italic_S start_POSTSUBSCRIPT italic_t italic_s italic_t end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, for all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ,

|source_loss(ϕ,Str,Stst)R(𝒩Strϕ,𝒟)|<n1/3.𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑡𝑠𝑡𝑅superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ𝒟superscript𝑛13|source\_loss(\phi,S_{tr},S_{tst})-R(\mathcal{N}_{S_{tr}}^{\phi},\mathcal{D})|% <n^{-1/3}.| italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t italic_s italic_t end_POSTSUBSCRIPT ) - italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , caligraphic_D ) | < italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT .
Proof.

Fix ϵ=n1/3italic-ϵsuperscript𝑛13\epsilon=n^{-1/3}italic_ϵ = italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT, and define E𝐸Eitalic_E as the event that the empirical risk induced by each ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ is representative of the true risk. That is,

E=𝟙(supϕΦ|R(𝒩Strϕ,𝒟)1n(x,y)Sloss𝟙(𝒩Strϕ(x)y)|<ϵ).𝐸1subscriptsupremumitalic-ϕΦ𝑅superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ𝒟1𝑛subscript𝑥𝑦subscript𝑆𝑙𝑜𝑠𝑠1superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ𝑥𝑦italic-ϵE=\mathbbm{1}\left(\sup_{\phi\in\Phi}\left|R(\mathcal{N}_{S_{tr}}^{\phi},% \mathcal{D})-\frac{1}{n}\sum_{(x,y)\in S_{loss}}\mathbbm{1}\left(\mathcal{N}_{% S_{tr}}^{\phi}(x)\neq y\right)\right|<\epsilon\right).italic_E = blackboard_1 ( roman_sup start_POSTSUBSCRIPT italic_ϕ ∈ roman_Φ end_POSTSUBSCRIPT | italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , caligraphic_D ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) ≠ italic_y ) | < italic_ϵ ) .

Our goal is to show that E𝐸Eitalic_E holds with probability at least 11n211superscript𝑛21-\frac{1}{n^{2}}1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, for sufficiently large n𝑛nitalic_n. The key observation is that for all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ,

𝟙(𝒩Strϕ(x)y)=1hS,ϕ((x,y)),1superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ𝑥𝑦1subscript𝑆italic-ϕ𝑥𝑦\mathbbm{1}\left(\mathcal{N}_{S_{tr}}^{\phi}(x)\neq y\right)=1-h_{S,\phi}((x,y% )),blackboard_1 ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) ≠ italic_y ) = 1 - italic_h start_POSTSUBSCRIPT italic_S , italic_ϕ end_POSTSUBSCRIPT ( ( italic_x , italic_y ) ) ,

where hS,ϕ(x,y)subscript𝑆italic-ϕ𝑥𝑦h_{S,\phi}(x,y)italic_h start_POSTSUBSCRIPT italic_S , italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) is as defined in Definition 20. Thus, it follows that

E=𝟙(supϕΦ|R(𝒩Strϕ,𝒟)1m(x,y)Sloss𝟙(𝒩Strϕ(x)y)|<ϵ2)=𝟙(suph(Str,Φ)|𝔼(x,y)𝒟[1hS,ϕ(x,y)]1m(x,y)Sloss1hS,ϕ(x,y)|<ϵ2)=𝟙(suph(Str,Φ)|𝔼(x,y)𝒟[hS,ϕ(x,y)]1m(x,y)SlosshS,ϕ(x,y)|<ϵ2).𝐸1subscriptsupremumitalic-ϕΦ𝑅superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ𝒟1𝑚subscript𝑥𝑦subscript𝑆𝑙𝑜𝑠𝑠1superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ𝑥𝑦italic-ϵ21subscriptsupremumsubscript𝑆𝑡𝑟Φsubscript𝔼similar-to𝑥𝑦𝒟delimited-[]1subscript𝑆italic-ϕ𝑥𝑦1𝑚subscript𝑥𝑦subscript𝑆𝑙𝑜𝑠𝑠1subscript𝑆italic-ϕ𝑥𝑦italic-ϵ21subscriptsupremumsubscript𝑆𝑡𝑟Φsubscript𝔼similar-to𝑥𝑦𝒟delimited-[]subscript𝑆italic-ϕ𝑥𝑦1𝑚subscript𝑥𝑦subscript𝑆𝑙𝑜𝑠𝑠subscript𝑆italic-ϕ𝑥𝑦italic-ϵ2\begin{split}E&=\mathbbm{1}\left(\sup_{\phi\in\Phi}\left|R(\mathcal{N}_{S_{tr}% }^{\phi},\mathcal{D})-\frac{1}{m}\sum_{(x,y)\in S_{loss}}\mathbbm{1}\left(% \mathcal{N}_{S_{tr}}^{\phi}(x)\neq y\right)\right|<\frac{\epsilon}{2}\right)\\ &=\mathbbm{1}\left(\sup_{h\in\mathcal{H}(S_{tr},\Phi)}\left|\mathbb{E}_{(x,y)% \sim\mathcal{D}}[1-h_{S,\phi}(x,y)]-\frac{1}{m}\sum_{(x,y)\in S_{loss}}1-h_{S,% \phi}(x,y)\right|<\frac{\epsilon}{2}\right)\\ &=\mathbbm{1}\left(\sup_{h\in\mathcal{H}(S_{tr},\Phi)}\left|\mathbb{E}_{(x,y)% \sim\mathcal{D}}[h_{S,\phi}(x,y)]-\frac{1}{m}\sum_{(x,y)\in S_{loss}}h_{S,\phi% }(x,y)\right|<\frac{\epsilon}{2}\right).\end{split}start_ROW start_CELL italic_E end_CELL start_CELL = blackboard_1 ( roman_sup start_POSTSUBSCRIPT italic_ϕ ∈ roman_Φ end_POSTSUBSCRIPT | italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , caligraphic_D ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) ≠ italic_y ) | < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_1 ( roman_sup start_POSTSUBSCRIPT italic_h ∈ caligraphic_H ( italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , roman_Φ ) end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ 1 - italic_h start_POSTSUBSCRIPT italic_S , italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) ] - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT 1 - italic_h start_POSTSUBSCRIPT italic_S , italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) | < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_1 ( roman_sup start_POSTSUBSCRIPT italic_h ∈ caligraphic_H ( italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , roman_Φ ) end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_h start_POSTSUBSCRIPT italic_S , italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) ] - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_S , italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) | < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) . end_CELL end_ROW

To analyze the latter quantity, a standard application of the fundamental theorem of statistical learning (see Shavel-Schwartz and Ben-David) implies that E𝐸Eitalic_E holds with probability 1δ1𝛿1-\delta1 - italic_δ provided that |Sloss|=nΩ(vc((Str,Φ))+ln1δϵ2)subscript𝑆𝑙𝑜𝑠𝑠𝑛Ω𝑣𝑐subscript𝑆𝑡𝑟Φ1𝛿superscriptitalic-ϵ2|S_{loss}|=n\geq\Omega\left(\frac{vc\left(\mathcal{H}(S_{tr},\Phi)\right)+\ln% \frac{1}{\delta}}{\epsilon^{2}}\right)| italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT | = italic_n ≥ roman_Ω ( divide start_ARG italic_v italic_c ( caligraphic_H ( italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , roman_Φ ) ) + roman_ln divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ).

Fix δ=1n2𝛿1superscript𝑛2\delta=\frac{1}{n^{2}}italic_δ = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. By Lemma 4, vc((Str,Φ))c1(Φ)log(n+(Φ))𝑣𝑐subscript𝑆𝑡𝑟Φsubscript𝑐1Φ𝑛Φvc\left(\mathcal{H}(S_{tr},\Phi)\right)\leq c_{1}\partial(\Phi)\log\left(n+% \partial(\Phi)\right)italic_v italic_c ( caligraphic_H ( italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , roman_Φ ) ) ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ). Substituting this, along with ϵ=n1/3italic-ϵsuperscript𝑛13\epsilon=n^{-1/3}italic_ϵ = italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT, we see that

vc((Str,Φ))+ln1δϵ2c1(Φ)log(n+(Φ))+2lnnn2/3Cn2/3logn,𝑣𝑐subscript𝑆𝑡𝑟Φ1𝛿superscriptitalic-ϵ2subscript𝑐1Φ𝑛Φ2𝑛superscript𝑛23𝐶superscript𝑛23𝑛\begin{split}\frac{vc\left(\mathcal{H}(S_{tr},\Phi)\right)+\ln\frac{1}{\delta}% }{\epsilon^{2}}&\leq\frac{c_{1}\partial(\Phi)\log\left(n+\partial(\Phi)\right)% +2\ln n}{n^{-2/3}}\\ &\leq Cn^{2/3}\log n,\end{split}start_ROW start_CELL divide start_ARG italic_v italic_c ( caligraphic_H ( italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , roman_Φ ) ) + roman_ln divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL ≤ divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ) + 2 roman_ln italic_n end_ARG start_ARG italic_n start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_C italic_n start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT roman_log italic_n , end_CELL end_ROW

where C𝐶Citalic_C is some constant that depends on (Φ)Φ\partial(\Phi)∂ ( roman_Φ ). Since n𝑛nitalic_n assymptotically dominates this quantity, it follows that for sufficiently large n𝑛nitalic_n, we indeed have nΩ(vc((Str,Φ))+ln1δϵ2)𝑛Ω𝑣𝑐subscript𝑆𝑡𝑟Φ1𝛿superscriptitalic-ϵ2n\geq\Omega\left(\frac{vc\left(\mathcal{H}(S_{tr},\Phi)\right)+\ln\frac{1}{% \delta}}{\epsilon^{2}}\right)italic_n ≥ roman_Ω ( divide start_ARG italic_v italic_c ( caligraphic_H ( italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , roman_Φ ) ) + roman_ln divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ), which proves the desired result. ∎

Algorithm 3 source_loss(ϕ,Str,Sloss)𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠source\_loss(\phi,S_{tr},S_{loss})italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT )
1:return 1|Sloss|(x,y)Sloss𝟙(𝒩Strϕ(x)y)1subscript𝑆𝑙𝑜𝑠𝑠subscript𝑥𝑦subscript𝑆𝑙𝑜𝑠𝑠1subscriptsuperscript𝒩italic-ϕsubscript𝑆𝑡𝑟𝑥𝑦\frac{1}{|S_{loss}|}\sum_{(x,y)\in S_{loss}}\mathbbm{1}\left(\mathcal{N}^{\phi% }_{S_{tr}}(x)\neq y\right)divide start_ARG 1 end_ARG start_ARG | italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( caligraphic_N start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ≠ italic_y )

G.3 Bounding the error in estimating the margin

Our method for estimating the margin of a distributions, 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, is given in Algorithm 4. The main idea is to split the set, Ssourcesubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒S_{source}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT, into two equal parts, Ssourceasuperscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑎S_{source}^{a}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, and Ssourcebsuperscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑏S_{source}^{b}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT. We then use a nearest neighbors classifier over Strsubscript𝑆𝑡𝑟S_{tr}italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT to label the points in both Ssourceasuperscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑎S_{source}^{a}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and Ssourcebsuperscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑏S_{source}^{b}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT. Finally, we measure the distance between differently labeled points from Ssourceasuperscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑎S_{source}^{a}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and Ssourcebsuperscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑏S_{source}^{b}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT respectively. For technical reasons, when comparing distances between Ssourceasuperscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑎S_{source}^{a}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and Ssourcebsuperscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑏S_{source}^{b}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, we only compare points that have the same index. This allows us to exploit independence between each comparison we make.

We now show that this method is likely to accurately estimate margins by showing that it gives good estimates for αϕsuperscript𝛼italic-ϕ\alpha^{\phi}italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, which is described in Definition 24.

Lemma 17.

There exists N𝑁Nitalic_N, such that for all nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, if Strsubscript𝑆𝑡𝑟S_{tr}italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT is a set of n𝑛nitalic_n labeled points, with probability at least 11n11𝑛1-\frac{1}{n}1 - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over Ssource𝒟snsimilar-tosubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒superscriptsubscript𝒟𝑠𝑛S_{source}\sim\mathcal{D}_{s}^{n}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and Sloss𝒟snsimilar-tosubscript𝑆𝑙𝑜𝑠𝑠superscriptsubscript𝒟𝑠𝑛S_{loss}\sim\mathcal{D}_{s}^{n}italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, at least one of the two conditions will hold:

  1. 1.

    source_loss(ϕ,Str,Sloss)>Rs+O(n1/4)𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠superscriptsubscript𝑅𝑠𝑂superscript𝑛14source\_loss(\phi,S_{tr},S_{loss})>R_{s}^{*}+O(n^{-1/4})italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) > italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_O ( italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ).

  2. 2.

    Pr[αϕ<source_margin(ϕ,Str,Ssource)]<O(n1/4)Prsuperscript𝛼italic-ϕ𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑂superscript𝑛14\Pr[\alpha^{\phi}<source\_margin(\phi,S_{tr},S_{source})]<O(n^{-1/4})roman_Pr [ italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT ) ] < italic_O ( italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ).

Proof.

For ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ and r0𝑟0r\geq 0italic_r ≥ 0, let qϕ,r,Strsubscript𝑞italic-ϕ𝑟subscript𝑆𝑡𝑟q_{\phi,r,S_{tr}}italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT be as defined in Definition 22, so that

qϕ,r,Str(x,x)={1dϕ(x,x)<r and 𝒩Sϕ(x)𝒩Sϕ(x)0otherwise.subscript𝑞italic-ϕ𝑟subscript𝑆𝑡𝑟𝑥superscript𝑥cases1subscript𝑑italic-ϕ𝑥superscript𝑥𝑟 and superscriptsubscript𝒩𝑆italic-ϕ𝑥superscriptsubscript𝒩𝑆italic-ϕsuperscript𝑥0otherwiseq_{\phi,r,S_{tr}}(x,x^{\prime})=\begin{cases}1&d_{\phi}(x,x^{\prime})<r\text{ % and }\mathcal{N}_{S}^{\phi}(x)\neq\mathcal{N}_{S}^{\phi}(x^{\prime})\\ 0&\text{otherwise}\end{cases}.italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL 1 end_CELL start_CELL italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_r and caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) ≠ caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW .

Observe that

source_margin(ϕ,Str,Ssource)=max{r:1ni=1nqϕ,r,Str(xia,xib)=0}.𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒:𝑟1𝑛superscriptsubscript𝑖1𝑛subscript𝑞italic-ϕ𝑟subscript𝑆𝑡𝑟superscriptsubscript𝑥𝑖𝑎superscriptsubscript𝑥𝑖𝑏0source\_margin(\phi,S_{tr},S_{source})=\max\left\{r:\frac{1}{n}\sum_{i=1}^{n}q% _{\phi,r,S_{tr}}(x_{i}^{a},x_{i}^{b})=0\right\}.italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT ) = roman_max { italic_r : divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) = 0 } .

This is because qϕ,r,Str(xia,xib)=0subscript𝑞italic-ϕ𝑟subscript𝑆𝑡𝑟superscriptsubscript𝑥𝑖𝑎superscriptsubscript𝑥𝑖𝑏0q_{\phi,r,S_{tr}}(x_{i}^{a},x_{i}^{b})=0italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) = 0 if and only if either xiasuperscriptsubscript𝑥𝑖𝑎x_{i}^{a}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and xibsuperscriptsubscript𝑥𝑖𝑏x_{i}^{b}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT are given the same labels, or if they have distance (under ϕitalic-ϕ\phiitalic_ϕ) of at least r𝑟ritalic_r.

Define αStrϕsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ\alpha_{S_{tr}}^{\phi}italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT as the random variable where for x,xμssimilar-to𝑥superscript𝑥subscript𝜇𝑠x,x^{\prime}\sim\mu_{s}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT,

αStrϕ={dϕ(x,x)𝒩Strϕ(x)𝒩Strϕ(x)otherwise.superscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕcasessubscript𝑑italic-ϕ𝑥superscript𝑥superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ𝑥superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕsuperscript𝑥otherwise\alpha_{S_{tr}}^{\phi}=\begin{cases}d_{\phi}(x,x^{\prime})&\mathcal{N}_{S_{tr}% }^{\phi}(x)\neq\mathcal{N}_{S_{tr}}^{\phi}(x^{\prime})\\ \infty&\text{otherwise}\end{cases}.italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT = { start_ROW start_CELL italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) ≠ caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ∞ end_CELL start_CELL otherwise end_CELL end_ROW .

The variable αStrϕsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ\alpha_{S_{tr}}^{\phi}italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT is closely related to αϕsuperscript𝛼italic-ϕ\alpha^{\phi}italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, the only difference is that we replace the bayes optimal classifier, g𝒟ssubscript𝑔subscript𝒟𝑠g_{\mathcal{D}_{s}}italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT with 𝒩Strϕsuperscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ\mathcal{N}_{S_{tr}}^{\phi}caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT.

To relate αStrϕsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ\alpha_{S_{tr}}^{\phi}italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT to our previous quantities, observe that

Pr[αStrϕr]=𝔼(x,x)μs2[qϕ,r,Str(x,x)].Prsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ𝑟subscript𝔼similar-to𝑥superscript𝑥superscriptsubscript𝜇𝑠2delimited-[]subscript𝑞italic-ϕ𝑟subscript𝑆𝑡𝑟𝑥superscript𝑥\Pr[\alpha_{S_{tr}}^{\phi}\leq r]=\mathbb{E}_{(x,x^{\prime})\sim\mu_{s}^{2}}[q% _{\phi,r,S_{tr}}(x,x^{\prime})].roman_Pr [ italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ≤ italic_r ] = blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] .

Because the set of classifiers, 𝒬(Φ,Str)={qϕ,r,Str:ϕΦ,r0}𝒬Φsubscript𝑆𝑡𝑟conditional-setsubscript𝑞italic-ϕ𝑟subscript𝑆𝑡𝑟formulae-sequenceitalic-ϕΦ𝑟0\mathcal{Q}(\Phi,S_{tr})=\{q_{\phi,r,S_{tr}}:\phi\in\Phi,r\geq 0\}caligraphic_Q ( roman_Φ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ) = { italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_ϕ ∈ roman_Φ , italic_r ≥ 0 } has bounded VC-dimension, c3(Φ)log(n+(Φ))subscript𝑐3Φ𝑛Φc_{3}\partial(\Phi)\log(n+\partial(\Phi))italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ), we can apply uniform convergence to see that Pr[αStrϕr]Prsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ𝑟\Pr[\alpha_{S_{tr}}^{\phi}\leq r]roman_Pr [ italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ≤ italic_r ] must be close to its expectation with high probability over all ϕ,ritalic-ϕ𝑟\phi,ritalic_ϕ , italic_r. More precisely, by applying the same argument as in the proof of Lemma 16, a we have the for n𝑛nitalic_n sufficiently large, with probability at least 11n211superscript𝑛21-\frac{1}{n^{2}}1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG over Ssource𝒟snsimilar-tosubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒superscriptsubscript𝒟𝑠𝑛S_{source}\sim\mathcal{D}_{s}^{n}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, for all qϕ,r,Str𝒬(Φ,Str)subscript𝑞italic-ϕ𝑟subscript𝑆𝑡𝑟𝒬Φsubscript𝑆𝑡𝑟q_{\phi,r,S_{tr}}\in\mathcal{Q}(\Phi,S_{tr})italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_Q ( roman_Φ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ),

|𝔼(x,x)μs2[qϕ,r,Str(x,x)]1ni=1nqϕ,r,Str(xia,xib)|<n1/3.subscript𝔼similar-to𝑥superscript𝑥superscriptsubscript𝜇𝑠2delimited-[]subscript𝑞italic-ϕ𝑟subscript𝑆𝑡𝑟𝑥superscript𝑥1𝑛superscriptsubscript𝑖1𝑛subscript𝑞italic-ϕ𝑟subscript𝑆𝑡𝑟superscriptsubscript𝑥𝑖𝑎superscriptsubscript𝑥𝑖𝑏superscript𝑛13\left|\mathbb{E}_{(x,x^{\prime})\sim\mu_{s}^{2}}[q_{\phi,r,S_{tr}}(x,x^{\prime% })]-\frac{1}{n}\sum_{i=1}^{n}q_{\phi,r,S_{tr}}(x_{i}^{a},x_{i}^{b})\right|<n^{% -1/3}.| blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) | < italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT .

By substituting the definition of αStrϕsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ\alpha_{S_{tr}}^{\phi}italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT along with our observation about source_margin(ϕ,Str,Ssource)𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒source\_margin(\phi,S_{tr},S_{source})italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT ), it follows that

Pr[αStrϕ<source_margin(ϕ,Str,Ssource)]<n1/3.Prsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒superscript𝑛13\Pr[\alpha_{S_{tr}}^{\phi}<source\_margin(\phi,S_{tr},S_{source})]<n^{-1/3}.roman_Pr [ italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT ) ] < italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT .

We now turn our attention to showing that αStrϕsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ\alpha_{S_{tr}}^{\phi}italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT must indeed serve as a reasonable approximation for αϕsuperscript𝛼italic-ϕ\alpha^{\phi}italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT. To do so, observe that if αStrϕsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ\alpha_{S_{tr}}^{\phi}italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT and αϕsuperscript𝛼italic-ϕ\alpha^{\phi}italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT are constructed from the same random variables, x,xμssimilar-to𝑥superscript𝑥subscript𝜇𝑠x,x^{\prime}\sim\mu_{s}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, then they only differ if 𝒩Strϕsuperscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ\mathcal{N}_{S_{tr}}^{\phi}caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT and g𝒟ssubscript𝑔subscript𝒟𝑠g_{\mathcal{D}_{s}}italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT differ over either x𝑥xitalic_x or xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Suppose that R(𝒩Strϕ,𝒟s)=R(g𝒟s,𝒟s)+ϵϕ.𝑅superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕsubscript𝒟𝑠𝑅subscript𝑔subscript𝒟𝑠subscript𝒟𝑠superscriptitalic-ϵitalic-ϕR(\mathcal{N}_{S_{tr}}^{\phi},\mathcal{D}_{s})=R(g_{\mathcal{D}_{s}},\mathcal{% D}_{s})+\epsilon^{\phi}.italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_R ( italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT . Then it follows by Lemma 8 that Prxμs[g𝒟s(x)𝒩Strϕ(x)]ϵϕΔsubscriptPrsimilar-to𝑥subscript𝜇𝑠subscript𝑔subscript𝒟𝑠𝑥superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ𝑥superscriptitalic-ϵitalic-ϕΔ\Pr_{x\sim\mu_{s}}[g_{\mathcal{D}_{s}}(x)\neq\mathcal{N}_{S_{tr}}^{\phi}(x)]% \geq\frac{\epsilon^{\phi}}{\Delta}roman_Pr start_POSTSUBSCRIPT italic_x ∼ italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ≠ caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) ] ≥ divide start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_ARG start_ARG roman_Δ end_ARG, where ΔΔ\Deltaroman_Δ is the label margin of 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. It follows by the rules of probability that the probability that αϕ<rsuperscript𝛼italic-ϕ𝑟\alpha^{\phi}<ritalic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < italic_r is at most ϵϕΔsuperscriptitalic-ϵitalic-ϕΔ\frac{\epsilon^{\phi}}{\Delta}divide start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_ARG start_ARG roman_Δ end_ARG summed with the probability that αStrϕ<rsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ𝑟\alpha_{S_{tr}}^{\phi}<ritalic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < italic_r. That is,

Pr[αStrϕ<source_margin(ϕ,Str,Ssource)]<n1/3+ϵϕΔPrsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒superscript𝑛13superscriptitalic-ϵitalic-ϕΔ\Pr[\alpha_{S_{tr}}^{\phi}<source\_margin(\phi,S_{tr},S_{source})]<n^{-1/3}+% \frac{\epsilon^{\phi}}{\Delta}roman_Pr [ italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT ) ] < italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT + divide start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_ARG start_ARG roman_Δ end_ARG (1)

However, if n𝑛nitalic_n is sufficiently large, then we have that with probability at least 11n211superscript𝑛21-\frac{1}{n^{2}}1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG over Sloss𝒟snsimilar-tosubscript𝑆𝑙𝑜𝑠𝑠superscriptsubscript𝒟𝑠𝑛S_{loss}\sim\mathcal{D}_{s}^{n}italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, for all ϕitalic-ϕ\phiitalic_ϕ,

|source_loss(ϕ,Str,Sloss)(R(g𝒟s,𝒟s)+ϵϕ)|<O(n1/3).𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠𝑅subscript𝑔subscript𝒟𝑠subscript𝒟𝑠superscriptitalic-ϵitalic-ϕ𝑂superscript𝑛13\left|source\_loss(\phi,S_{tr},S_{loss})-\left(R(g_{\mathcal{D}_{s}},\mathcal{% D}_{s})+\epsilon^{\phi}\right)\right|<O(n^{-1/3}).| italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) - ( italic_R ( italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) | < italic_O ( italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT ) .

This in turn implies that

source_loss(ϕ,Str,Sloss)>R(g𝒟s,𝒟s)+ϵϕO(n1/3)𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠𝑅subscript𝑔subscript𝒟𝑠subscript𝒟𝑠superscriptitalic-ϵitalic-ϕ𝑂superscript𝑛13source\_loss(\phi,S_{tr},S_{loss})>R(g_{\mathcal{D}_{s}},\mathcal{D}_{s})+% \epsilon^{\phi}-O(n^{-1/3})italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) > italic_R ( italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT - italic_O ( italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT ) (2)

By taking a union bound, it follows with probability at least 12n212superscript𝑛21-\frac{2}{n^{2}}1 - divide start_ARG 2 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, that the Equations 2 and 1 simulatenously hold over all ϕitalic-ϕ\phiitalic_ϕ.

Finally, if ϵϕn1/4superscriptitalic-ϵitalic-ϕsuperscript𝑛14\epsilon^{\phi}\geq n^{-1/4}italic_ϵ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ≥ italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT, then for n𝑛nitalic_n sufficiently large, condition number 2. from the statement of the Lemma must hold. Otherwise, if ϵϕ<n1/4superscriptitalic-ϵitalic-ϕsuperscript𝑛14\epsilon^{\phi}<n^{-1/4}italic_ϵ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT, condition 1. holds. Thus in either case, one of the two conditions hold which completes the proof. ∎

Algorithm 4 source_margin(ϕ,Str,Ssource)𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒source\_margin(\phi,S_{tr},S_{source})italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT )
1:Ssource=SsourceaSsourcebsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒superscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑎superscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑏S_{source}=S_{source}^{a}\cup S_{source}^{b}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, |Smargina|=|Smarginb|superscriptsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑎superscriptsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑏|S_{margin}^{a}|=|S_{margin}^{b}|| italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT | = | italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT |.
2:Ssourcea={(x1a,y1a),,(xna,yna)}superscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑎superscriptsubscript𝑥1𝑎superscriptsubscript𝑦1𝑎superscriptsubscript𝑥𝑛𝑎superscriptsubscript𝑦𝑛𝑎S_{source}^{a}=\{(x_{1}^{a},y_{1}^{a}),\dots,(x_{n}^{a},y_{n}^{a})\}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) }.
3:Xa={x1a,,xna}superscript𝑋𝑎superscriptsubscript𝑥1𝑎superscriptsubscript𝑥𝑛𝑎X^{a}=\{x_{1}^{a},\dots,x_{n}^{a}\}italic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT }.
4:Ssourceb={(x1b,y1b),,(xnb,ynb)}superscriptsubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑏superscriptsubscript𝑥1𝑏superscriptsubscript𝑦1𝑏superscriptsubscript𝑥𝑛𝑏superscriptsubscript𝑦𝑛𝑏S_{source}^{b}=\{(x_{1}^{b},y_{1}^{b}),\dots,(x_{n}^{b},y_{n}^{b})\}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) }.
5:Xb={x1b,,xnb}superscript𝑋𝑏superscriptsubscript𝑥1𝑏superscriptsubscript𝑥𝑛𝑏X^{b}=\{x_{1}^{b},\dots,x_{n}^{b}\}italic_X start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT }.
6:for i=1n˙𝑖1˙𝑛i=1\dot{n}italic_i = 1 over˙ start_ARG italic_n end_ARG do
7:     di=dϕ(xia,xib)subscript𝑑𝑖subscript𝑑italic-ϕsuperscriptsubscript𝑥𝑖𝑎superscriptsubscript𝑥𝑖𝑏d_{i}=d_{\phi}(x_{i}^{a},x_{i}^{b})italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ).
8:     if 𝒩Strϕ(xia)=𝒩Strϕ(xib)superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕsuperscriptsubscript𝑥𝑖𝑎superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕsuperscriptsubscript𝑥𝑖𝑏\mathcal{N}_{S_{tr}}^{\phi}(x_{i}^{a})=\mathcal{N}_{S_{tr}}^{\phi}(x_{i}^{b})caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) = caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) then
9:         di=subscript𝑑𝑖d_{i}=\inftyitalic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∞
10:     end if
11:end for
12:return min1indisubscript1𝑖𝑛subscript𝑑𝑖\min_{1\leq i\leq n}d_{i}roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_n end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

G.4 Proving the theorem

We first show a Lemma that implies that the feature map selected by our algorithm, ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG, is likely to realize the SIRM assumption on (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Lemma 18.

Let ϕΦsuperscriptitalic-ϕΦ\phi^{*}\in\Phiitalic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Φ be any SIRM realizing feature map, and suppose that 𝒟sϕsuperscriptsubscript𝒟𝑠superscriptitalic-ϕ\mathcal{D}_{s}^{\phi^{*}}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT has margin ρsuperscript𝜌\rho^{*}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Then for all δ>0𝛿0\delta>0italic_δ > 0, there exists N𝑁Nitalic_N such that if nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, with probability at least 1δ1𝛿1-\delta1 - italic_δ over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG (defined Line 6 of direct_generalize_nn) is a SIRM realizing feature map for (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and has margin at least ρ2superscript𝜌2\frac{\rho^{*}}{2}divide start_ARG italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG.

Proof.

Assume towards a contradiction, that for δ>0𝛿0\delta>0italic_δ > 0, there exist arbitrarily large values of n𝑛nitalic_n for which with probability at least δ𝛿\deltaitalic_δ, ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG has margin less than ρ2superscript𝜌2\frac{\rho^{*}}{2}divide start_ARG italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG.

For n𝑛nitalic_n sufficiently large, with probability at least 1O(1n)1𝑂1𝑛1-O(\frac{1}{n})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ), applying Lemmas 16 and 17 we have that, for all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ,

|source_loss(ϕ,Str,Sloss)R(𝒩Strϕ,𝒟)|<n1/3,𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠𝑅superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ𝒟superscript𝑛13|source\_loss(\phi,S_{tr},S_{loss})-R(\mathcal{N}_{S_{tr}}^{\phi},\mathcal{D})% |<n^{-1/3},| italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) - italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , caligraphic_D ) | < italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT ,

and that one of the two conditions hold as well:

  1. 1.

    source_loss(ϕ,Str,Sloss)>Rs+O(n1/4)𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠superscriptsubscript𝑅𝑠𝑂superscript𝑛14source\_loss(\phi,S_{tr},S_{loss})>R_{s}^{*}+O(n^{-1/4})italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) > italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_O ( italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ).

  2. 2.

    Pr[αϕ<source_margin(ϕ,Str,Ssource)]<O(n1/4)Prsuperscript𝛼italic-ϕ𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒𝑂superscript𝑛14\Pr[\alpha^{\phi}<source\_margin(\phi,S_{tr},S_{source})]<O(n^{-1/4})roman_Pr [ italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT ) ] < italic_O ( italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ).

Because these equations hold for ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the smallest observed empirical loss must be at most Rs+n1/3superscriptsubscript𝑅𝑠superscript𝑛13R_{s}^{*}+n^{-1/3}italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT. It follows that ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG must incur empirical loss at most Rs+2n1/3superscriptsubscript𝑅𝑠2superscript𝑛13R_{s}^{*}+2n^{-1/3}italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 2 italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT which implies that condition 2. must apply to ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG.

Furthermore, because kn>lognsubscript𝑘𝑛𝑛k_{n}>\log nitalic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > roman_log italic_n, we have that with high probability, 𝒩Strϕsuperscriptsubscript𝒩subscript𝑆𝑡𝑟superscriptitalic-ϕ\mathcal{N}_{S_{tr}}^{\phi^{*}}caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT will match the Bayes-optimal classifier, g𝒟ssubscript𝑔subscript𝒟𝑠g_{\mathcal{D}_{s}}italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all points in Ssourcesubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒S_{source}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT. It follows that the observed margin, source_margin(ϕ,Str,Ssource)𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛superscriptitalic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒source\_margin(\phi^{*},S_{tr},S_{source})italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT ) will be at least ρsuperscript𝜌\rho^{*}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Combining all of this, we see that

Pr[αϕ^<ρ]<n1/4.Prsuperscript𝛼^italic-ϕsuperscript𝜌superscript𝑛14\Pr[\alpha^{\hat{\phi}}<\rho^{*}]<n^{-1/4}.roman_Pr [ italic_α start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT < italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] < italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT .

Now let n1,n2,subscript𝑛1subscript𝑛2n_{1},n_{2},\dotsitalic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … be a sequence of integers going to infinity so that for each nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with probability at least δ𝛿\deltaitalic_δ, ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG has a margin less than ρ2superscript𝜌2\frac{\rho^{*}}{2}divide start_ARG italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG. Because δ𝛿\deltaitalic_δ is fixed, it follows that for sufficiently large nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, there exist ϕ^isubscript^italic-ϕ𝑖\hat{\phi}_{i}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that all of the equations above hold.

Because ΦΦ\Phiroman_Φ is compact, there exists an infinite subsequence of the nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs for which ϕ^isubscript^italic-ϕ𝑖\hat{\phi}_{i}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT converges (using the distance metric over ΦΦ\Phiroman_Φ) to some ϕitalic-ϕ\phiitalic_ϕ. Relabel our sequence so that without loss of generality, ϕ^iϕsubscript^italic-ϕ𝑖italic-ϕ\hat{\phi}_{i}\to\phiover^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_ϕ.

The key observation is that the variable, αϕsuperscript𝛼italic-ϕ\alpha^{\phi}italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT is Lipschitz with respect to the distance metric over ΦΦ\Phiroman_Φ. In particular, if |ϕϕ|<ritalic-ϕsuperscriptitalic-ϕ𝑟|\phi-\phi^{\prime}|<r| italic_ϕ - italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | < italic_r, then αϕαϕ<2rsuperscript𝛼italic-ϕsuperscript𝛼superscriptitalic-ϕ2𝑟\alpha^{\phi}-\alpha^{\phi^{\prime}}<2ritalic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT - italic_α start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT < 2 italic_r.

Using this, observe that for sufficiently large values of i𝑖iitalic_i, we have that d(ϕ^i,ϕ)<ρ8𝑑subscript^italic-ϕ𝑖italic-ϕsuperscript𝜌8d(\hat{\phi}_{i},\phi)<\frac{\rho^{*}}{8}italic_d ( over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ ) < divide start_ARG italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG. Substituing this, it follows that for all sufficiently large i𝑖iitalic_i,

Pr[αϕ<3ρ4]Pr[αϕ^<ρ]<ni1/4.Prsuperscript𝛼italic-ϕ3superscript𝜌4Prsuperscript𝛼^italic-ϕsuperscript𝜌superscriptsubscript𝑛𝑖14\Pr[\alpha^{\phi}<\frac{3\rho^{*}}{4}]\leq\Pr[\alpha^{\hat{\phi}}<\rho^{*}]<n_% {i}^{-1/4}.roman_Pr [ italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < divide start_ARG 3 italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG ] ≤ roman_Pr [ italic_α start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT < italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] < italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT .

Since nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be arbitrarily large it follows that Pr[αϕ<3ρ4]=0Prsuperscript𝛼italic-ϕ3superscript𝜌40\Pr[\alpha^{\phi}<\frac{3\rho^{*}}{4}]=0roman_Pr [ italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < divide start_ARG 3 italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG ] = 0 which implies 𝒟sϕsuperscriptsubscript𝒟𝑠italic-ϕ\mathcal{D}_{s}^{\phi}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT must have margin at least 3ρ43superscript𝜌4\frac{3\rho^{*}}{4}divide start_ARG 3 italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG.

However, this in term implies that for all sufficiently large i𝑖iitalic_i, ϕ^isubscript^italic-ϕ𝑖\hat{\phi}_{i}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT too must have margin at least 3ρ4ρ4=ρ23superscript𝜌4superscript𝜌4superscript𝜌2\frac{3\rho^{*}}{4}-\frac{\rho^{*}}{4}=\frac{\rho^{*}}{2}divide start_ARG 3 italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG - divide start_ARG italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG = divide start_ARG italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG. Here we are again exploiting the fact that the margin is Lipschitz.

This finally gives us a contradiction, as we previous assumed that all ϕ^isubscript^italic-ϕ𝑖\hat{\phi}_{i}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT had margin less than ρ2superscript𝜌2\frac{\rho^{*}}{2}divide start_ARG italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG.

We are now prepared to prove Theorem 3.

Proof.

Fix ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0. The previous Lemma implies that for sufficiently large values of n𝑛nitalic_n, with probability 1δ21𝛿21-\frac{\delta}{2}1 - divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG we will select some ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG that has margin at least ρ2superscript𝜌2\frac{\rho^{*}}{2}divide start_ARG italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG. Lemma 15 implies that for n𝑛nitalic_n sufficiently large (in a way that only depends on ρ,𝒟ssuperscript𝜌subscript𝒟𝑠\rho^{*},\mathcal{D}_{s}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT), with probability at least 1δ21𝛿21-\frac{\delta}{2}1 - divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG over Sfinal𝒟sn/4similar-tosubscript𝑆𝑓𝑖𝑛𝑎𝑙superscriptsubscript𝒟𝑠𝑛4S_{final}\sim\mathcal{D}_{s}^{n/4}italic_S start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n / 4 end_POSTSUPERSCRIPT,

R(𝒩Sfinalϕ^,𝒟t)<Rt+ϵ.𝑅superscriptsubscript𝒩subscript𝑆𝑓𝑖𝑛𝑎𝑙^italic-ϕsubscript𝒟𝑡superscriptsubscript𝑅𝑡italic-ϵR(\mathcal{N}_{S_{final}}^{\hat{\phi}},\mathcal{D}_{t})<R_{t}^{*}+\epsilon.italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) < italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ϵ .

Crucially, Sfinalsubscript𝑆𝑓𝑖𝑛𝑎𝑙S_{final}italic_S start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT is completely independent of ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG, which is learned purely using Str,Sloss,subscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠S_{tr},S_{loss},italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT , and Ssourcesubscript𝑆𝑠𝑜𝑢𝑟𝑐𝑒S_{source}italic_S start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT. Taking a union bound implies the desired result. ∎

Appendix H Proof of Theorem 4

Proof.

Fix ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. Let ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT relate (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and let ϕ2𝒮(Φ)Φsubscriptitalic-ϕ2𝒮ΦsuperscriptΦ\phi_{2}\in\mathcal{S}(\Phi)\setminus\Phi^{*}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_S ( roman_Φ ) ∖ roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be a feature map that source-preserves 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT but fails to relate (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We will construct 𝒟ssuperscriptsubscript𝒟𝑠\mathcal{D}_{s}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒟tsuperscriptsubscript𝒟𝑡\mathcal{D}_{t}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Because ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT fails to relate (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), there exists xsupp(μt)𝑥suppsubscript𝜇𝑡x\in\textnormal{supp}(\mu_{t})italic_x ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) such that ϕ2(x)=zsupp(μsϕ2)subscriptitalic-ϕ2𝑥𝑧suppsuperscriptsubscript𝜇𝑠subscriptitalic-ϕ2\phi_{2}(x)=z\notin\textnormal{supp}(\mu_{s}^{\phi_{2}})italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) = italic_z ∉ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). Note that if this doesn’t hold, then we can simply use the construction from the proof of Theorem 6. The point x𝑥xitalic_x will be central to constructing both 𝒟ssuperscriptsubscript𝒟𝑠\mathcal{D}_{s}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒟tsuperscriptsubscript𝒟𝑡\mathcal{D}_{t}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We begin by constructing 𝒟ssuperscriptsubscript𝒟𝑠\mathcal{D}_{s}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Constructing 𝒟ssuperscriptsubscript𝒟𝑠\mathcal{D}_{s}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

Let α>0𝛼0\alpha>0italic_α > 0 be a small value. Then by Assumptions 2 and 3, there exists x1,x2𝒳subscript𝑥1subscript𝑥2𝒳x_{1},x_{2}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_X and ϕ1α,ϕ2αΦsuperscriptsubscriptitalic-ϕ1𝛼superscriptsubscriptitalic-ϕ2𝛼Φ\phi_{1}^{\alpha},\phi_{2}^{\alpha}\in\Phiitalic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ∈ roman_Φ such that the following conditions hold:

  1. 1.

    d(ϕ1,ϕ1α),d(ϕ2,ϕ2α)<α𝑑subscriptitalic-ϕ1superscriptsubscriptitalic-ϕ1𝛼𝑑subscriptitalic-ϕ2superscriptsubscriptitalic-ϕ2𝛼𝛼d(\phi_{1},\phi_{1}^{\alpha}),d(\phi_{2},\phi_{2}^{\alpha})<\alphaitalic_d ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) , italic_d ( italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) < italic_α.

  2. 2.

    ϕ1α(x1)=ϕ1α(x)ϕ1α(x2)superscriptsubscriptitalic-ϕ1𝛼subscript𝑥1superscriptsubscriptitalic-ϕ1𝛼𝑥superscriptsubscriptitalic-ϕ1𝛼subscript𝑥2\phi_{1}^{\alpha}(x_{1})=\phi_{1}^{\alpha}(x)\neq\phi_{1}^{\alpha}(x_{2})italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x ) ≠ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

  3. 3.

    ϕ2α(x1)ϕ2α(x)ϕ2α(x2)superscriptsubscriptitalic-ϕ2𝛼subscript𝑥1superscriptsubscriptitalic-ϕ2𝛼𝑥superscriptsubscriptitalic-ϕ2𝛼subscript𝑥2\phi_{2}^{\alpha}(x_{1})\neq\phi_{2}^{\alpha}(x)\neq\phi_{2}^{\alpha}(x_{2})italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≠ italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x ) ≠ italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Here, ϕ1αsuperscriptsubscriptitalic-ϕ1𝛼\phi_{1}^{\alpha}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT and ϕ2αsuperscriptsubscriptitalic-ϕ2𝛼\phi_{2}^{\alpha}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT are chosen using Assumption 3, while the existence of x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is based on Assumption 2. We also let x1,x2superscriptsubscript𝑥1superscriptsubscript𝑥2x_{1}^{\prime},x_{2}^{\prime}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be two points such that

0<dϕ1α(x1,x1)<<dϕ1α(x1,x2), and 0<dϕ2α(x2,x2)<<dϕ2α(x1,x2).formulae-sequence0subscript𝑑superscriptsubscriptitalic-ϕ1𝛼subscript𝑥1superscriptsubscript𝑥1much-less-thansubscript𝑑superscriptsubscriptitalic-ϕ1𝛼subscript𝑥1subscript𝑥2 and 0subscript𝑑superscriptsubscriptitalic-ϕ2𝛼subscript𝑥2superscriptsubscript𝑥2much-less-thansubscript𝑑superscriptsubscriptitalic-ϕ2𝛼subscript𝑥1subscript𝑥20<d_{\phi_{1}^{\alpha}}(x_{1},x_{1}^{\prime})<<d_{\phi_{1}^{\alpha}}(x_{1},x_{% 2}),\text{ and }0<d_{\phi_{2}^{\alpha}}(x_{2},x_{2}^{\prime})<<d_{\phi_{2}^{% \alpha}}(x_{1},x_{2}).0 < italic_d start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < < italic_d start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , and 0 < italic_d start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < < italic_d start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Next, let μssuperscriptsubscript𝜇𝑠\mu_{s}^{\prime}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be a measure over 𝒳𝒳\mathcal{X}caligraphic_X obtained by the following steps.

  1. 1.

    Begin with μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, the measure of 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over 𝒳𝒳\mathcal{X}caligraphic_X.

  2. 2.

    Remove all points in supp(μs)suppsubscript𝜇𝑠\textnormal{supp}(\mu_{s})supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) that lie within a distance of r𝑟ritalic_r from the set {x1,x1,x2,x2}subscript𝑥1superscriptsubscript𝑥1subscript𝑥2superscriptsubscript𝑥2\{x_{1},x_{1}^{\prime},x_{2},x_{2}^{\prime}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }.

  3. 3.

    Pick s𝑠sitalic_s such that any two points in {x1,x1,x2,x2}subscript𝑥1superscriptsubscript𝑥1subscript𝑥2superscriptsubscript𝑥2\{x_{1},x_{1}^{\prime},x_{2},x_{2}^{\prime}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } have distance larger than 4s4𝑠4s4 italic_s. Insert balls of probability mass ϵ/8italic-ϵ8\epsilon/8italic_ϵ / 8 centered at each of these points, so that μs(B(x,s))=ϵ4superscriptsubscript𝜇𝑠𝐵𝑥𝑠italic-ϵ4\mu_{s}^{\prime}(B(x,s))=\frac{\epsilon}{4}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_B ( italic_x , italic_s ) ) = divide start_ARG italic_ϵ end_ARG start_ARG 4 end_ARG for x{x1,x1,x2,x2}𝑥subscript𝑥1superscriptsubscript𝑥1subscript𝑥2superscriptsubscript𝑥2x\in\{x_{1},x_{1}^{\prime},x_{2},x_{2}^{\prime}\}italic_x ∈ { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }.

Observe that μssuperscriptsubscript𝜇𝑠\mu_{s}^{\prime}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is constructed from μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT by adding a region of mass ϵ/2italic-ϵ2\epsilon/2italic_ϵ / 2 (and appropriately down-sizing all other regions). Furthermore, if r𝑟ritalic_r is appropriately chosen, then the region being removed from μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT can also be forced to have size at most ϵ/2italic-ϵ2\epsilon/2italic_ϵ / 2. It follows that W(μs,μs)<ϵ𝑊subscript𝜇𝑠superscriptsubscript𝜇𝑠italic-ϵW(\mu_{s},\mu_{s}^{\prime})<\epsilonitalic_W ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_ϵ. Next, we define the conditional distribution, ηssuperscriptsubscript𝜂𝑠\eta_{s}^{\prime}italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the following steps. Let y1y2subscript𝑦1subscript𝑦2y_{1}\neq y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be two labels in 𝒴𝒴\mathcal{Y}caligraphic_Y.

  1. 1.

    For xB(x1,s):ηs(y1|x)=1:𝑥𝐵subscript𝑥1𝑠superscriptsubscript𝜂𝑠conditionalsubscript𝑦1𝑥1x\in B(x_{1},s):\eta_{s}^{\prime}(y_{1}|x)=1italic_x ∈ italic_B ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s ) : italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) = 1.

  2. 2.

    For xB(x1,s):ηs(y2|x)=1:𝑥𝐵superscriptsubscript𝑥1𝑠superscriptsubscript𝜂𝑠conditionalsubscript𝑦2𝑥1x\in B(x_{1}^{\prime},s):\eta_{s}^{\prime}(y_{2}|x)=1italic_x ∈ italic_B ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) : italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) = 1.

  3. 3.

    For xB(x2,s):ηs(y2|x)=1:𝑥𝐵subscript𝑥2𝑠superscriptsubscript𝜂𝑠conditionalsubscript𝑦2𝑥1x\in B(x_{2},s):\eta_{s}^{\prime}(y_{2}|x)=1italic_x ∈ italic_B ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s ) : italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) = 1.

  4. 4.

    For xB(x2,s):ηs(y1|x)=1:𝑥𝐵superscriptsubscript𝑥2𝑠superscriptsubscript𝜂𝑠conditionalsubscript𝑦1𝑥1x\in B(x_{2}^{\prime},s):\eta_{s}^{\prime}(y_{1}|x)=1italic_x ∈ italic_B ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) : italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) = 1.

  5. 5.

    For all other x𝑥xitalic_x, ηs(y|x)=η(y|x)superscriptsubscript𝜂𝑠conditional𝑦𝑥𝜂conditional𝑦𝑥\eta_{s}^{\prime}(y|x)=\eta(y|x)italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = italic_η ( italic_y | italic_x ).

Basically, we force the conditional distribution near x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to be y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively. For x1,x2superscriptsubscript𝑥1superscriptsubscript𝑥2x_{1}^{\prime},x_{2}^{\prime}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, this is reversed. This construction only modifies ηssubscript𝜂𝑠\eta_{s}italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT at points where μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is modified, and it follows that W(𝒟s,𝒟s)<ϵ𝑊subscript𝒟𝑠superscriptsubscript𝒟𝑠italic-ϵW(\mathcal{D}_{s},\mathcal{D}_{s}^{\prime})<\epsilonitalic_W ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_ϵ.

Furthermore, observe that ϕ1αsuperscriptsubscriptitalic-ϕ1𝛼\phi_{1}^{\alpha}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT and ϕ2αsuperscriptsubscriptitalic-ϕ2𝛼\phi_{2}^{\alpha}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT both source-preserve 𝒟ssuperscriptsubscript𝒟𝑠\mathcal{D}_{s}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This occurs because r𝑟ritalic_r and s𝑠sitalic_s are chosen to be small enough so that the 4 balls, B(x1,s),B(x2,s),B(x1,s),B(x2,s)𝐵subscript𝑥1𝑠𝐵subscript𝑥2𝑠𝐵superscriptsubscript𝑥1𝑠𝐵superscriptsubscript𝑥2𝑠B(x_{1},s),B(x_{2},s),B(x_{1}^{\prime},s),B(x_{2}^{\prime},s)italic_B ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s ) , italic_B ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s ) , italic_B ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) , italic_B ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) are all mapped to disjoint areas under both ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Constructing 𝒟tsuperscriptsubscript𝒟𝑡\mathcal{D}_{t}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

Next, we will construct 𝒟tsuperscriptsubscript𝒟𝑡\mathcal{D}_{t}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by giving a choice of two possible target distribution, 𝒟t1superscriptsubscript𝒟𝑡1\mathcal{D}_{t}^{1}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and 𝒟t2superscriptsubscript𝒟𝑡2\mathcal{D}_{t}^{2}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We let μtsuperscriptsubscript𝜇𝑡\mu_{t}^{\prime}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be a point mass that is concentrated at x𝑥xitalic_x. We let ηt1(y1|x)=1superscriptsubscript𝜂𝑡1conditionalsubscript𝑦1𝑥1\eta_{t}^{1}(y_{1}|x)=1italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) = 1 and ηt2(y2|x)=1superscriptsubscript𝜂𝑡2conditionalsubscript𝑦2𝑥1\eta_{t}^{2}(y_{2}|x)=1italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) = 1. This gives us 𝒟t1superscriptsubscript𝒟𝑡1\mathcal{D}_{t}^{1}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and 𝒟t2superscriptsubscript𝒟𝑡2\mathcal{D}_{t}^{2}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Observe that 𝒟t1superscriptsubscript𝒟𝑡1\mathcal{D}_{t}^{1}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is SIRM related to 𝒟ssuperscriptsubscript𝒟𝑠\mathcal{D}_{s}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by ϕ1αsuperscriptsubscriptitalic-ϕ1𝛼\phi_{1}^{\alpha}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT, and 𝒟t2superscriptsubscript𝒟𝑡2\mathcal{D}_{t}^{2}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is SIRM related to 𝒟ssuperscriptsubscript𝒟𝑠\mathcal{D}_{s}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by ϕ2αsuperscriptsubscriptitalic-ϕ2𝛼\phi_{2}^{\alpha}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT. This is because x𝑥xitalic_x is mapped to x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by ϕ1αsuperscriptsubscriptitalic-ϕ1𝛼\phi_{1}^{\alpha}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT, and the same holds respectively for x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and ϕ2αsuperscriptsubscriptitalic-ϕ2𝛼\phi_{2}^{\alpha}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT.

Finishing the proof:

We now show that our learner will have a large error over either some choice of 𝒟t{𝒟t1,𝒟t2}superscriptsubscript𝒟𝑡superscriptsubscript𝒟𝑡1superscriptsubscript𝒟𝑡2\mathcal{D}_{t}^{\prime}\in\{\mathcal{D}_{t}^{1},\mathcal{D}_{t}^{2}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }. To do so, suppose 𝒟tsuperscriptsubscript𝒟𝑡\mathcal{D}_{t}^{\prime}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is randomly chosen from this set. It follows that our learning rules expected loss is:

𝔼𝒟t{𝒟t1,𝒟t2}𝔼S(𝒟s)nR(L(S),𝒟t)=𝔼S(𝒟s)n𝔼𝒟t{𝒟t1,𝒟t2}R(L(S),𝒟t)=𝔼S(𝒟s)n𝔼i{1,2}𝟙(L(S)(x)yi)=12.subscript𝔼similar-tosuperscriptsubscript𝒟𝑡superscriptsubscript𝒟𝑡1superscriptsubscript𝒟𝑡2subscript𝔼similar-to𝑆superscriptsuperscriptsubscript𝒟𝑠𝑛𝑅𝐿𝑆superscriptsubscript𝒟𝑡subscript𝔼similar-to𝑆superscriptsuperscriptsubscript𝒟𝑠𝑛subscript𝔼similar-tosuperscriptsubscript𝒟𝑡superscriptsubscript𝒟𝑡1superscriptsubscript𝒟𝑡2𝑅𝐿𝑆superscriptsubscript𝒟𝑡subscript𝔼similar-to𝑆superscriptsuperscriptsubscript𝒟𝑠𝑛subscript𝔼similar-to𝑖121𝐿𝑆𝑥subscript𝑦𝑖12\begin{split}\mathbb{E}_{\mathcal{D}_{t}^{\prime}\sim\{\mathcal{D}_{t}^{1},% \mathcal{D}_{t}^{2}\}}\mathbb{E}_{S\sim(\mathcal{D}_{s}^{\prime})^{n}}R(L(S),% \mathcal{D}_{t}^{\prime})&=\mathbb{E}_{S\sim(\mathcal{D}_{s}^{\prime})^{n}}% \mathbb{E}_{\mathcal{D}_{t}^{\prime}\sim\{\mathcal{D}_{t}^{1},\mathcal{D}_{t}^% {2}\}}R(L(S),\mathcal{D}_{t}^{\prime})\\ &=\mathbb{E}_{S\sim(\mathcal{D}_{s}^{\prime})^{n}}\mathbb{E}_{i\sim\{1,2\}}% \mathbbm{1}(L(S)(x)\neq y_{i})\\ &=\frac{1}{2}.\end{split}start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ { caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S ∼ ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_R ( italic_L ( italic_S ) , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_S ∼ ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ { caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT italic_R ( italic_L ( italic_S ) , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_S ∼ ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_i ∼ { 1 , 2 } end_POSTSUBSCRIPT blackboard_1 ( italic_L ( italic_S ) ( italic_x ) ≠ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG . end_CELL end_ROW

From here, the desired result follows by a straightforward application of markov’s inequality.

Appendix I Proof of Theorem 5

I.1 Description of the learning rule

We give the learning rule that achieves the bound given in Theorem 5

Algorithm 5 presrv_contract_nn(S𝒟sn,Uμtm)presrv_contract_nnformulae-sequencesimilar-to𝑆superscriptsubscript𝒟𝑠𝑛similar-to𝑈superscriptsubscript𝜇𝑡𝑚\textsc{presrv\_contract\_nn}(S\sim\mathcal{D}_{s}^{n},U\sim\mu_{t}^{m})presrv_contract_nn ( italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_U ∼ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT )
1:Str{(xi,yi):1in/5}subscript𝑆𝑡𝑟conditional-setsubscript𝑥𝑖subscript𝑦𝑖1𝑖𝑛5S_{tr}\leftarrow\{(x_{i},y_{i}):1\leq i\leq n/5\}italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ← { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : 1 ≤ italic_i ≤ italic_n / 5 }
2:Sloss{(xi,yi):n/5<i2n/5}subscript𝑆𝑙𝑜𝑠𝑠conditional-setsubscript𝑥𝑖subscript𝑦𝑖𝑛5𝑖2𝑛5S_{loss}\leftarrow\{(x_{i},y_{i}):n/5<i\leq 2n/5\}italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ← { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_n / 5 < italic_i ≤ 2 italic_n / 5 }
3:Smargin{(xi,yi):2n/5<i3n/5}subscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛conditional-setsubscript𝑥𝑖subscript𝑦𝑖2𝑛5𝑖3𝑛5S_{margin}\leftarrow\{(x_{i},y_{i}):2n/5<i\leq 3n/5\}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT ← { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : 2 italic_n / 5 < italic_i ≤ 3 italic_n / 5 }
4:Smargin,t{(xi,yi):3n/5<i4n/5}subscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡conditional-setsubscript𝑥𝑖subscript𝑦𝑖3𝑛5𝑖4𝑛5S_{margin,t}\leftarrow\{(x_{i},y_{i}):3n/5<i\leq 4n/5\}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT ← { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : 3 italic_n / 5 < italic_i ≤ 4 italic_n / 5 }
5:Sfinal{(xi,yi):4n/5<in}subscript𝑆𝑓𝑖𝑛𝑎𝑙conditional-setsubscript𝑥𝑖subscript𝑦𝑖4𝑛5𝑖𝑛S_{final}\leftarrow\{(x_{i},y_{i}):4n/5<i\leq n\}italic_S start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT ← { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : 4 italic_n / 5 < italic_i ≤ italic_n }
6:ϵn1/3italic-ϵsuperscript𝑛13\epsilon\leftarrow n^{-1/3}italic_ϵ ← italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT
7:Φϵ={ϕ:source_loss(ϕ,Str,Sloss)<ϵ}subscriptΦitalic-ϵconditional-setitalic-ϕsource_lossitalic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠italic-ϵ\Phi_{\epsilon}=\left\{\phi:\textsc{source\_loss}(\phi,S_{tr},S_{loss})<% \epsilon\right\}roman_Φ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT = { italic_ϕ : source_loss ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) < italic_ϵ }
8:ρs(ϕ)=source_margin(ϕ,Str,Smargin)subscript𝜌𝑠italic-ϕsource_marginitalic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛\rho_{s}(\phi)=\textsc{source\_margin}(\phi,S_{tr},S_{margin})italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ϕ ) = source_margin ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT )
9:ρt(ϕ)=target_margin(ϕ,Smargin,t,U)subscript𝜌𝑡italic-ϕtarget_marginitalic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈\rho_{t}(\phi)=\textsc{target\_margin}(\phi,S_{margin,t},U)italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ϕ ) = target_margin ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U )
10:ϕ^argmaxϕΦϵρs(ϕ)Λρt(ϕ)^italic-ϕsubscriptargmaxitalic-ϕsubscriptΦitalic-ϵsubscript𝜌𝑠italic-ϕΛsubscript𝜌𝑡italic-ϕ\hat{\phi}\leftarrow\operatorname*{arg\,max}_{\phi\in\Phi_{\epsilon}}\rho_{s}(% \phi)-\Lambda\rho_{t}(\phi)over^ start_ARG italic_ϕ end_ARG ← start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_ϕ ∈ roman_Φ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_ϕ ) - roman_Λ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ϕ )
11:return 𝒩Strϕ^superscriptsubscript𝒩subscript𝑆𝑡𝑟^italic-ϕ\mathcal{N}_{S_{tr}}^{\hat{\phi}}caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT

I.2 Analyzing the procedure, target_margin𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛target\_marginitalic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n

We begin by describing the process used to estimate how far data from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is from data from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT under a feature map, ϕitalic-ϕ\phiitalic_ϕ. The subroutine is given in Algorithm 6, where S𝑆Sitalic_S is a labeled set of points drawn from 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and U𝑈Uitalic_U is an unlabeled set of points drawn from μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the marginal 𝒳𝒳\mathcal{X}caligraphic_X-distribution of 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Algorithm 6 target_margin(ϕ,Smargin,t,U)𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈target\_margin(\phi,S_{margin,t},U)italic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U )
1:U={u1,,um}𝑈subscript𝑢1subscript𝑢𝑚U=\{u_{1},\dots,u_{m}\}italic_U = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }.
2:Smargin,t={(x1,y1),,(xn,yn)}subscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛S_{margin,t}=\{(x_{1},y_{1}),\dots,(x_{n},y_{n})\}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }.
3:l=min(m,[n])𝑙𝑚delimited-[]𝑛l=\min(m,[\sqrt{n}])italic_l = roman_min ( italic_m , [ square-root start_ARG italic_n end_ARG ] ).
4:U={u1,,ul}superscript𝑈subscript𝑢1subscript𝑢𝑙U^{\prime}=\{u_{1},\dots,u_{l}\}italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }.
5:for 1il1𝑖𝑙1\leq i\leq l1 ≤ italic_i ≤ italic_l do
6:     Xi={(xill+1,,xil}X_{i}=\{(x_{il-l+1},\dots,x_{il}\}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i italic_l - italic_l + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT }
7:end for
8:return max1ilminxXidϕ(ui,x)subscript1𝑖𝑙subscript𝑥subscript𝑋𝑖subscript𝑑italic-ϕsubscript𝑢𝑖𝑥\max_{1\leq i\leq l}\min_{x\in X_{i}}d_{\phi}(u_{i},x)roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_l end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_x ∈ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ).

The idea is for each uU𝑢𝑈u\in Uitalic_u ∈ italic_U, we assign it its own set of l𝑙litalic_l points sampled from μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Then, we take the distance from u𝑢uitalic_u to the closest point in its assigned set. Finally, taking the max of all of these gives us an approxmation of the furthest distance any usupp(μt)𝑢suppsubscript𝜇𝑡u\in\textnormal{supp}(\mu_{t})italic_u ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) has from supp(μs)suppsubscript𝜇𝑠\textnormal{supp}(\mu_{s})supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) when using the distance metric, dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT.

We now show that this procedure approximates the quantity, βlϕsuperscriptsubscript𝛽𝑙italic-ϕ\beta_{l}^{\phi}italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT, which is defined in Definition 26.

Lemma 19.

There exists N>0𝑁0N>0italic_N > 0 such that if n>N𝑛𝑁n>Nitalic_n > italic_N, and mn𝑚𝑛m\geq\sqrt{n}italic_m ≥ square-root start_ARG italic_n end_ARG, then with probability at least 11n11𝑛1-\frac{1}{n}1 - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG over Smargin,t𝒟snsimilar-tosubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡superscriptsubscript𝒟𝑠𝑛S_{margin,t}\sim\mathcal{D}_{s}^{n}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and U𝒟tmsimilar-to𝑈superscriptsubscript𝒟𝑡𝑚U\sim\mathcal{D}_{t}^{m}italic_U ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, for all ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ,

Pr[βlnϕtarget_margin(ϕ,Smargin,t,U)]n1/6,Prsuperscriptsubscript𝛽subscript𝑙𝑛italic-ϕ𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈superscript𝑛16\Pr[\beta_{l_{n}}^{\phi}\geq target\_margin(\phi,S_{margin,t},U)]\leq n^{-1/6},roman_Pr [ italic_β start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ≥ italic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U ) ] ≤ italic_n start_POSTSUPERSCRIPT - 1 / 6 end_POSTSUPERSCRIPT ,

where lnsubscript𝑙𝑛l_{n}italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the largest integer at most n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG.

Proof.

For convenience, let l𝑙litalic_l denote lnsubscript𝑙𝑛l_{n}italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. For ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ and r0𝑟0r\geq 0italic_r ≥ 0, let qϕ,r,lsubscript𝑞italic-ϕ𝑟𝑙q_{\phi,r,l}italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_l end_POSTSUBSCRIPT be as defined in Definition 23, so that

qϕ,r,l(u,x1,,xl)={11ildϕ(u,xi)<r0otherwise.subscript𝑞italic-ϕ𝑟𝑙𝑢subscript𝑥1subscript𝑥𝑙cases1subscript1𝑖𝑙subscript𝑑italic-ϕ𝑢subscript𝑥𝑖𝑟0otherwiseq_{\phi,r,l}(u,x_{1},\dots,x_{l})=\begin{cases}1&\exists_{1\leq i\leq l}d_{% \phi}(u,x_{i})<r\\ 0&\text{otherwise}\end{cases}.italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_l end_POSTSUBSCRIPT ( italic_u , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 end_CELL start_CELL ∃ start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_l end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_u , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_r end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW .

Furthermore, let us relabel Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (defined in line 6 of Algorithm 6) so that Xi=(x1i,,xli)subscript𝑋𝑖superscriptsubscript𝑥1𝑖superscriptsubscript𝑥𝑙𝑖X_{i}=(x_{1}^{i},\dots,x_{l}^{i})italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ). It follows that

target_margin(ϕ,Smargin,t,U)=inf{r:1li=1lqϕ,r,l(ui,x1i,,xli)=1}.𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈infimumconditional-set𝑟1𝑙superscriptsubscript𝑖1𝑙subscript𝑞italic-ϕ𝑟𝑙subscript𝑢𝑖superscriptsubscript𝑥1𝑖superscriptsubscript𝑥𝑙𝑖1target\_margin(\phi,S_{margin,t},U)=\inf\left\{r:\frac{1}{l}\sum_{i=1}^{l}q_{% \phi,r,l}(u_{i},x_{1}^{i},\dots,x_{l}^{i})=1\right\}.italic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U ) = roman_inf { italic_r : divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_l end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 1 } .

This is because qϕ,r,l(ui,x1i,,xli)=1subscript𝑞italic-ϕ𝑟𝑙subscript𝑢𝑖superscriptsubscript𝑥1𝑖superscriptsubscript𝑥𝑙𝑖1q_{\phi,r,l}(u_{i},x_{1}^{i},\dots,x_{l}^{i})=1italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_l end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 1 if and only if some xjisuperscriptsubscript𝑥𝑗𝑖x_{j}^{i}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT has distance less than r𝑟ritalic_r from uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Next, we relate this quantity to βlϕsuperscriptsubscript𝛽𝑙italic-ϕ\beta_{l}^{\phi}italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT as follows. Observe that

Pr[βlnr]=𝔼uμt,x1,,xlμsl[1qϕ,r,l(u,x1,,xl)].Prsubscript𝛽subscript𝑙𝑛𝑟subscript𝔼formulae-sequencesimilar-to𝑢subscript𝜇𝑡subscript𝑥1similar-tosubscript𝑥𝑙superscriptsubscript𝜇𝑠𝑙delimited-[]1subscript𝑞italic-ϕ𝑟𝑙𝑢subscript𝑥1subscript𝑥𝑙\Pr[\beta_{l_{n}}\geq r]=\mathbb{E}_{u\sim\mu_{t},x_{1},\dots,x_{l}\sim\mu_{s}% ^{l}}[1-q_{\phi,r,l}(u,x_{1},\dots,x_{l})].roman_Pr [ italic_β start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≥ italic_r ] = blackboard_E start_POSTSUBSCRIPT italic_u ∼ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ 1 - italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_l end_POSTSUBSCRIPT ( italic_u , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ] .

Because the set of classifiers, 𝒬l(Φ)={qϕ,r,l:ϕΦ,r0}subscript𝒬𝑙Φconditional-setsubscript𝑞italic-ϕ𝑟𝑙formulae-sequenceitalic-ϕΦ𝑟0\mathcal{Q}_{l}(\Phi)=\{q_{\phi,r,l}:\phi\in\Phi,r\geq 0\}caligraphic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( roman_Φ ) = { italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_l end_POSTSUBSCRIPT : italic_ϕ ∈ roman_Φ , italic_r ≥ 0 } has bounded VC-dimension, c4(Φ)log(l+(Φ))subscript𝑐4Φ𝑙Φc_{4}\partial(\Phi)\log(l+\partial(\Phi))italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∂ ( roman_Φ ) roman_log ( italic_l + ∂ ( roman_Φ ) ), we can apply uniform convergence to see that Pr[βl<r]Prsubscript𝛽𝑙𝑟\Pr[\beta_{l}<r]roman_Pr [ italic_β start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT < italic_r ] must be close to its expectation with high probability over all ϕ,ritalic-ϕ𝑟\phi,ritalic_ϕ , italic_r. More precisely, by applying the same argument as in the proof of Lemma 16, a we have the for n𝑛nitalic_n sufficiently large, with probability at least 11l211superscript𝑙21-\frac{1}{l^{2}}1 - divide start_ARG 1 end_ARG start_ARG italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG over Smargin,t𝒟slsimilar-tosubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡superscriptsubscript𝒟𝑠𝑙S_{margin,t}\sim\mathcal{D}_{s}^{l}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT U𝒟tlsimilar-to𝑈superscriptsubscript𝒟𝑡𝑙U\sim\mathcal{D}_{t}^{l}italic_U ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, for all qϕ,r,l𝒬l(Φ)subscript𝑞italic-ϕ𝑟𝑙subscript𝒬𝑙Φq_{\phi,r,l}\in\mathcal{Q}_{l}(\Phi)italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_l end_POSTSUBSCRIPT ∈ caligraphic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( roman_Φ ),

|𝔼uμt,x1,,xlμsl[1qϕ,r,lu,x1,,xl)]1li=1l1qϕ,r,Str(ui,x1i,,xli)|<l1/3.\left|\mathbb{E}_{u\sim\mu_{t},x_{1},\dots,x_{l}\sim\mu_{s}^{l}}[1-q_{\phi,r,l% }u,x_{1},\dots,x_{l})]-\frac{1}{l}\sum_{i=1}^{l}1-q_{\phi,r,S_{tr}}(u_{i},x_{1% }^{i},\dots,x_{l}^{i})\right|<l^{-1/3}.| blackboard_E start_POSTSUBSCRIPT italic_u ∼ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ 1 - italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_l end_POSTSUBSCRIPT italic_u , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ] - divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT 1 - italic_q start_POSTSUBSCRIPT italic_ϕ , italic_r , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) | < italic_l start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT .

By substituting the definition of βnϕsuperscriptsubscript𝛽𝑛italic-ϕ\beta_{n}^{\phi}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT along with our observation about target_margin(ϕ,Smargin,t,U)𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈target\_margin(\phi,S_{margin,t},U)italic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U ), it follows that for all r>target_margin(ϕ,Smargin,t,U)𝑟𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈r>target\_margin(\phi,S_{margin,t},U)italic_r > italic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U ),

Pr[αStrϕr]<l1/3.Prsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ𝑟superscript𝑙13\Pr[\alpha_{S_{tr}}^{\phi}\geq r]<l^{-1/3}.roman_Pr [ italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ≥ italic_r ] < italic_l start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT .

By the rules of probability, and the definition of an infimum, it follows that

Pr[αStrϕtarget_margin(ϕ,Smargin,t,U)]l1/3.Prsuperscriptsubscript𝛼subscript𝑆𝑡𝑟italic-ϕ𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈superscript𝑙13\Pr[\alpha_{S_{tr}}^{\phi}\geq target\_margin(\phi,S_{margin,t},U)]\leq l^{-1/% 3}.roman_Pr [ italic_α start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ≥ italic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U ) ] ≤ italic_l start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT .

Substituting the value of łitalic-ł\litalic_ł gives the desired result. ∎

I.3 Bounding the performance a given feature map, ϕΦsuperscriptitalic-ϕΦ\phi^{*}\in\Phiitalic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Φ

We now consider a fixed feature map ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that realizes the SIRM assumption on (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The idea behind doing so is that this allows us to give a baseline over how source_margin𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛source\_marginitalic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n and target_margin𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛target\_marginitalic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n should be expected to behave.

Lemma 20.

Let ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT realize the SIRM assumption on (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Suppose 𝒟sϕsuperscriptsubscript𝒟𝑠superscriptitalic-ϕ\mathcal{D}_{s}^{\phi^{*}}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT has margin ρ>0superscript𝜌0\rho^{*}>0italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0, and that

maxxtsupp(μt)minxssupp(μs)dϕ(xt,xas)=β.subscriptsubscript𝑥𝑡suppsubscript𝜇𝑡subscriptsubscript𝑥𝑠suppsubscript𝜇𝑠subscript𝑑superscriptitalic-ϕsubscript𝑥𝑡subscript𝑥𝑎𝑠superscript𝛽\max_{x_{t}\in\textnormal{supp}(\mu_{t})}\min_{x_{s}\in\textnormal{supp}(\mu_{% s})}d_{\phi^{*}}(x_{t},x_{a}s)=\beta^{*}.roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_s ) = italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .

Finally, let ρΛβ=γsuperscript𝜌Λsuperscript𝛽superscript𝛾\rho^{*}-\Lambda\beta^{*}=\gamma^{*}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - roman_Λ italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT where γ>0superscript𝛾0\gamma^{*}>0italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 by the fact that ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT contracts (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Then for all δ>0𝛿0\delta>0italic_δ > 0, there exists N𝑁Nitalic_N such that if nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, mn𝑚𝑛m\geq\sqrt{n}italic_m ≥ square-root start_ARG italic_n end_ARG, with probability at least 1δ1𝛿1-\delta1 - italic_δ over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, Uμtmsimilar-to𝑈superscriptsubscript𝜇𝑡𝑚U\sim\mu_{t}^{m}italic_U ∼ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the following three things hold:

  1. 1.

    source_loss(ϕ,Str,Sloss)<Rs+n1/3𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠superscriptitalic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠superscriptsubscript𝑅𝑠superscript𝑛13source\_loss(\phi^{*},S_{tr},S_{loss})<R_{s}^{*}+n^{-1/3}italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) < italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT.

  2. 2.

    source_margin(ϕ,Str,Smargin)ρ𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛superscriptitalic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛superscript𝜌source\_margin(\phi^{*},S_{tr},S_{margin})\geq\rho^{*}italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT ) ≥ italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

  3. 3.

    target_margin(ϕ,Smargin,t,U)β+γ2𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛superscriptitalic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈superscript𝛽superscript𝛾2target\_margin(\phi^{*},S_{margin,t},U)\leq\beta^{*}+\frac{\gamma^{*}}{2}italic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U ) ≤ italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG.

Proof.

We bound the probability of each of these three things occuring separately, and then apply a union bound.

First of all, for n𝑛nitalic_n sufficiently large, the first claim holds with probability at least 1O(1n2)1𝑂1superscript𝑛21-O(\frac{1}{n^{2}})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ). This follows directly from a combination of Lemma 16 along with Lemma 15 being applied to 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (as ϕitalic-ϕ\phiitalic_ϕ technically SIRM realizes on (𝒟s,𝒟s)subscript𝒟𝑠subscript𝒟𝑠(\mathcal{D}_{s},\mathcal{D}_{s})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )). In particular, we have that with probability at least 11n211superscript𝑛21-\frac{1}{n^{2}}1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG that

R(𝒩Strϕ,𝒟s)<Rs+1n2.𝑅superscriptsubscript𝒩subscript𝑆𝑡𝑟superscriptitalic-ϕsubscript𝒟𝑠superscriptsubscript𝑅𝑠1superscript𝑛2R(\mathcal{N}_{S_{tr}}^{\phi^{*}},\mathcal{D}_{s})<R_{s}^{*}+\frac{1}{n^{2}}.italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) < italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (3)

Second, observe that Lemma 8 implies that the probability that 𝒩Strϕsuperscriptsubscript𝒩subscript𝑆𝑡𝑟superscriptitalic-ϕ\mathcal{N}_{S_{tr}}^{\phi^{*}}caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT differs from the Bayes-optimal is at most 1n2Δ1superscript𝑛2Δ\frac{1}{n^{2}\Delta}divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Δ end_ARG, where ΔΔ\Deltaroman_Δ is the label margin of 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. It follows that with probability at least 1O(1n)1𝑂1𝑛1-O(\frac{1}{n})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ), Strϕsuperscriptsubscript𝑆𝑡𝑟superscriptitalic-ϕS_{tr}^{\phi^{*}}italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT correctly labels all the points in Smarginasuperscriptsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑎S_{margin}^{a}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and Smarginbsuperscriptsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑏S_{margin}^{b}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT (see Algorithm 4). Thus, it follows that source_margin(ϕ,Str,Smargin)ρ𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛superscriptitalic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛superscript𝜌source\_margin(\phi^{*},S_{tr},S_{margin})\geq\rho^{*}italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT ) ≥ italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, as with correct labels it is impossible to observe two differently labeled poitns that are closer than ρsuperscript𝜌\rho^{*}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Finally, by Lemma 9, there exists τ>0superscript𝜏0\tau^{*}>0italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 such that

ϕ(B(x,τ))B(ϕ(x),γ2).superscriptitalic-ϕ𝐵𝑥superscript𝜏𝐵superscriptitalic-ϕ𝑥superscript𝛾2\phi^{*}(B(x,\tau^{*}))\subseteq B\left(\phi^{*}(x),\frac{\gamma^{*}}{2}\right).italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_B ( italic_x , italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ⊆ italic_B ( italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) , divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) .

Let p=minxsupp(μs)B(x,τ2).𝑝subscript𝑥suppsubscript𝜇𝑠𝐵𝑥superscript𝜏2p=\min_{x\in\textnormal{supp}(\mu_{s})}B(x,\frac{\tau^{*}}{2}).italic_p = roman_min start_POSTSUBSCRIPT italic_x ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_B ( italic_x , divide start_ARG italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) . The argument in the proof of Lemma 9 implies that p>0𝑝0p>0italic_p > 0. Thus, for n𝑛nitalic_n sufficiently large, it follows that with probability at least 1(1p)l1epl1superscript1𝑝𝑙1superscript𝑒𝑝𝑙1-(1-p)^{l}\geq 1-e^{-pl}1 - ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ≥ 1 - italic_e start_POSTSUPERSCRIPT - italic_p italic_l end_POSTSUPERSCRIPT over x1,,xlμslsimilar-tosubscript𝑥1subscript𝑥𝑙superscriptsubscript𝜇𝑠𝑙x_{1},\dots,x_{l}\sim\mu_{s}^{l}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT that there will exist some xiB(x,τ2)subscript𝑥𝑖𝐵𝑥superscript𝜏2x_{i}\in B(x,\frac{\tau^{*}}{2})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_B ( italic_x , divide start_ARG italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ).

Since supp(μs)suppsubscript𝜇𝑠\textnormal{supp}(\mu_{s})supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is compact, it follows that we can take a finite covering of supp(μs)suppsubscript𝜇𝑠\textnormal{supp}(\mu_{s})supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) with balls of radius τ2superscript𝜏2\frac{\tau^{*}}{2}divide start_ARG italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG. If there are C𝐶Citalic_C such balls, with probability at least 1(1p)l1Cepl1superscript1𝑝𝑙1𝐶superscript𝑒𝑝𝑙1-(1-p)^{l}\geq 1-Ce^{-pl}1 - ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ≥ 1 - italic_C italic_e start_POSTSUPERSCRIPT - italic_p italic_l end_POSTSUPERSCRIPT over x1,,xlμslsimilar-tosubscript𝑥1subscript𝑥𝑙superscriptsubscript𝜇𝑠𝑙x_{1},\dots,x_{l}\sim\mu_{s}^{l}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT that there will exist some xiB(x,τ2)subscript𝑥𝑖𝐵𝑥superscript𝜏2x_{i}\in B(x,\frac{\tau^{*}}{2})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_B ( italic_x , divide start_ARG italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) for all balls B(x,τ2)𝐵𝑥superscript𝜏2B(x,\frac{\tau^{*}}{2})italic_B ( italic_x , divide start_ARG italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) in our covering. Observe that this implies that for all xsupp(μs)𝑥suppsubscript𝜇𝑠x\in\textnormal{supp}(\mu_{s})italic_x ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), there will exist some xiB(x,τ)subscript𝑥𝑖𝐵𝑥superscript𝜏x_{i}\in B(x,\tau^{*})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_B ( italic_x , italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

We now use this to show that the third condition is likely to hold. Pick n𝑛nitalic_n sufficiently large so that Cepl<1n2𝐶superscript𝑒𝑝𝑙1superscript𝑛2Ce^{-pl}<\frac{1}{n^{2}}italic_C italic_e start_POSTSUPERSCRIPT - italic_p italic_l end_POSTSUPERSCRIPT < divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. It follows that with probability at least 1O(1n)1𝑂1𝑛1-O(\frac{1}{n})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) over Smargin,t𝒟sn/5similar-tosubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡superscriptsubscript𝒟𝑠𝑛5S_{margin,t}\sim\mathcal{D}_{s}^{n/5}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n / 5 end_POSTSUPERSCRIPT and U𝒟tmsimilar-to𝑈superscriptsubscript𝒟𝑡𝑚U\sim\mathcal{D}_{t}^{m}italic_U ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, for all 1il1𝑖𝑙1\leq i\leq l1 ≤ italic_i ≤ italic_l, for all xsupp(μs)𝑥suppsubscript𝜇𝑠x\in\textnormal{supp}(\mu_{s})italic_x ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), there exists some xjiB(x,τ)superscriptsubscript𝑥𝑗𝑖𝐵𝑥superscript𝜏x_{j}^{i}\in B(x,\tau^{*})italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_B ( italic_x , italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

Suppose that this holds. For each uiUsubscript𝑢𝑖superscript𝑈u_{i}\in U^{\prime}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (Algorithm 6), let xisupp(μs)superscriptsubscript𝑥𝑖suppsubscript𝜇𝑠x_{i}^{*}\in\textnormal{supp}(\mu_{s})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) be the point for which dϕ(ui,xi)subscript𝑑superscriptitalic-ϕsubscript𝑢𝑖superscriptsubscript𝑥𝑖d_{\phi^{*}}(u_{i},x_{i}^{*})italic_d start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is minimized. Thus dϕ(ui,xi)βsubscript𝑑superscriptitalic-ϕsubscript𝑢𝑖superscriptsubscript𝑥𝑖superscript𝛽d_{\phi^{*}}(u_{i},x_{i}^{*})\leq\beta^{*}italic_d start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by the definition of βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. However, by our claim above, we see that some xjisuperscriptsubscript𝑥𝑗𝑖x_{j}^{i}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT must be in B(xi,tau)𝐵superscriptsubscript𝑥𝑖𝑡𝑎superscript𝑢B(x_{i}^{*},tau^{*})italic_B ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_t italic_a italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) which implies that ϕ(xji)B(ϕ(xi),γ2)superscriptitalic-ϕsuperscriptsubscript𝑥𝑗𝑖𝐵superscriptitalic-ϕsuperscriptsubscript𝑥𝑖superscript𝛾2\phi^{*}(x_{j}^{i})\in B(\phi^{*}(x_{i}^{*}),\frac{\gamma^{*}}{2})italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∈ italic_B ( italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ). Thus dϕ(xji,ui)β+γ2subscript𝑑superscriptitalic-ϕsuperscriptsubscript𝑥𝑗𝑖subscript𝑢𝑖superscript𝛽superscript𝛾2d_{\phi^{*}}(x_{j}^{i},u_{i})\leq\beta^{*}+\frac{\gamma^{*}}{2}italic_d start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG.

Since this occurs for all 1il1𝑖𝑙1\leq i\leq l1 ≤ italic_i ≤ italic_l, it follows that the maximum distance we observe is at most β+γ2superscript𝛽superscript𝛾2\beta^{*}+\frac{\gamma^{*}}{2}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG, which means target_margin(ϕ,Smargin,t,U)β+γ2𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈superscript𝛽superscript𝛾2target\_margin(\phi,S_{margin,t},U)\leq\beta^{*}+\frac{\gamma^{*}}{2}italic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U ) ≤ italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG, as desired.

Since our three events all occur with probability 1O(1/n)1𝑂1𝑛1-O(1/n)1 - italic_O ( 1 / italic_n ), it follows that if n𝑛nitalic_n is sufficiently large, they simultaneously occur with probability at least 1δ1𝛿1-\delta1 - italic_δ. This completes the proof.

I.4 Proving the Theorem

We are now prepared to prove Theorem 5. We start with the following Lemma.

Lemma 21.

Let ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be as defined in Lemma 20. Then for all δ>0𝛿0\delta>0italic_δ > 0, there exists N>0𝑁0N>0italic_N > 0 such that for all nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, mn𝑚𝑛m\geq\sqrt{n}italic_m ≥ square-root start_ARG italic_n end_ARG, with probability at least 1δ1𝛿1-\delta1 - italic_δ over S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, U𝒟tmsimilar-to𝑈superscriptsubscript𝒟𝑡𝑚U\sim\mathcal{D}_{t}^{m}italic_U ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the outputted feature map ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG satisfies the following: let ρ^^𝜌\hat{\rho}over^ start_ARG italic_ρ end_ARG denote the margin of 𝒟sϕ^superscriptsubscript𝒟𝑠^italic-ϕ\mathcal{D}_{s}^{\hat{\phi}}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT and let β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG denote

β^=maxxtsupp(μt)minxssupp(μs)dϕ^(xt,xas).^𝛽subscriptsubscript𝑥𝑡suppsubscript𝜇𝑡subscriptsubscript𝑥𝑠suppsubscript𝜇𝑠subscript𝑑^italic-ϕsubscript𝑥𝑡subscript𝑥𝑎𝑠\hat{\beta}=\max_{x_{t}\in\textnormal{supp}(\mu_{t})}\min_{x_{s}\in\textnormal% {supp}(\mu_{s})}d_{\hat{\phi}}(x_{t},x_{a}s).over^ start_ARG italic_β end_ARG = roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ supp ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_s ) .

Then

ρ^Λβ^γ4.^𝜌Λ^𝛽superscript𝛾4\hat{\rho}-\Lambda\hat{\beta}\geq\frac{\gamma^{*}}{4}.over^ start_ARG italic_ρ end_ARG - roman_Λ over^ start_ARG italic_β end_ARG ≥ divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG .
Proof.

Assume towards a contradiction that this fails to occur and fix δ>0𝛿0\delta>0italic_δ > 0 for that ails. For n𝑛nitalic_n sufficiently large, with probability at least 1δ21𝛿21-\frac{\delta}{2}1 - divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG, the premises of Lemmas 16, 17, 19, and 20 are all simulatneously met. In particular, Lemmas 16, 17 imply that

|source_loss(ϕ,Str,Sloss)R(𝒩Strϕ,𝒟)|<O(n1/3),𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠𝑅superscriptsubscript𝒩subscript𝑆𝑡𝑟italic-ϕ𝒟𝑂superscript𝑛13|source\_loss(\phi,S_{tr},S_{loss})-R(\mathcal{N}_{S_{tr}}^{\phi},\mathcal{D})% |<O(n^{-1/3}),| italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) - italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , caligraphic_D ) | < italic_O ( italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT ) ,

and that one of the two conditions hold as well:

  1. 1.

    source_loss(ϕ,Str,Sloss)>Rs+n1/4𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠superscriptsubscript𝑅𝑠superscript𝑛14source\_loss(\phi,S_{tr},S_{loss})>R_{s}^{*}+n^{-1/4}italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) > italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT.

  2. 2.

    Pr[αϕ<source_margin(ϕ,Str,Smargin)]<n1/4Prsuperscript𝛼italic-ϕ𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛superscript𝑛14\Pr[\alpha^{\phi}<source\_margin(\phi,S_{tr},S_{margin})]<n^{-1/4}roman_Pr [ italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT ) ] < italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT.

However, Lemma 20 implies that

source_loss(ϕ,Str,Sloss)<Rs+n1/3.𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑜𝑠𝑠superscriptitalic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑙𝑜𝑠𝑠superscriptsubscript𝑅𝑠superscript𝑛13source\_loss(\phi^{*},S_{tr},S_{loss})<R_{s}^{*}+n^{-1/3}.italic_s italic_o italic_u italic_r italic_c italic_e _ italic_l italic_o italic_s italic_s ( italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ) < italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT .

Since ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG has source loss at most n1/3superscript𝑛13n^{-1/3}italic_n start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT more than the optimal, it follows that condition 2 must hold. Thus, by additionally adding Lemma 19, we see that ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG has the following properties:

  1. 1.

    Pr[αϕ^<source_margin(ϕ^,Str,Smargin)]<n1/4Prsuperscript𝛼^italic-ϕ𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛^italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛superscript𝑛14\Pr[\alpha^{\hat{\phi}}<source\_margin(\hat{\phi},S_{tr},S_{margin})]<n^{-1/4}roman_Pr [ italic_α start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT < italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( over^ start_ARG italic_ϕ end_ARG , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT ) ] < italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT

  2. 2.

    Pr[βlnϕ^target_margin(ϕ^,Smargin,t,U)]n1/6Prsuperscriptsubscript𝛽subscript𝑙𝑛^italic-ϕ𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛^italic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈superscript𝑛16\Pr[\beta_{l_{n}}^{\hat{\phi}}\geq target\_margin(\hat{\phi},S_{margin,t},U)]% \leq n^{-1/6}roman_Pr [ italic_β start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT ≥ italic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n ( over^ start_ARG italic_ϕ end_ARG , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U ) ] ≤ italic_n start_POSTSUPERSCRIPT - 1 / 6 end_POSTSUPERSCRIPT.

However, recall that ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG must maximize the quantity source_margin(ϕ,Str,Smargin)Λtarget_margin(ϕ,Smargin,t,U)𝑠𝑜𝑢𝑟𝑐𝑒_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑡𝑟subscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛Λ𝑡𝑎𝑟𝑔𝑒𝑡_𝑚𝑎𝑟𝑔𝑖𝑛italic-ϕsubscript𝑆𝑚𝑎𝑟𝑔𝑖𝑛𝑡𝑈source\_margin(\phi,S_{tr},S_{margin})-\Lambda target\_margin(\phi,S_{margin,t% },U)italic_s italic_o italic_u italic_r italic_c italic_e _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n end_POSTSUBSCRIPT ) - roman_Λ italic_t italic_a italic_r italic_g italic_e italic_t _ italic_m italic_a italic_r italic_g italic_i italic_n ( italic_ϕ , italic_S start_POSTSUBSCRIPT italic_m italic_a italic_r italic_g italic_i italic_n , italic_t end_POSTSUBSCRIPT , italic_U ). Since this quantity is at least γ2superscript𝛾2\frac{\gamma^{*}}{2}divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG for ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (by Lemma 20), it follows that

Pr[αϕ^Λβlnϕ^<γ2]<O(n1/6).Prsuperscript𝛼^italic-ϕΛsuperscriptsubscript𝛽subscript𝑙𝑛^italic-ϕsuperscript𝛾2𝑂superscript𝑛16\Pr[\alpha^{\hat{\phi}}-\Lambda\beta_{l_{n}}^{\hat{\phi}}<\frac{\gamma^{*}}{2}% ]<O(n^{-1/6}).roman_Pr [ italic_α start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT - roman_Λ italic_β start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT < divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ] < italic_O ( italic_n start_POSTSUPERSCRIPT - 1 / 6 end_POSTSUPERSCRIPT ) . (4)

In particular, this equation holds with probability at least 1δ21𝛿21-\frac{\delta}{2}1 - divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG. However, by our assumption for arbitrarily large values of n𝑛nitalic_n, ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG fails to have the desired properties with probability δ𝛿\deltaitalic_δ.

Thus with probability at least δ2𝛿2\frac{\delta}{2}divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG, for arbitrarily large values of nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, there exists ϕ^nisubscript^italic-ϕsubscript𝑛𝑖\hat{\phi}_{n_{i}}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT such that Equation 4 holds but for which the desired property fails.

Let n1,n2,subscript𝑛1subscript𝑛2n_{1},n_{2},\dotsitalic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … be any subsueqnce of integers so that the corresponding feature maps, ϕ^nisubscript^italic-ϕsubscript𝑛𝑖\hat{\phi}_{n_{i}}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT converge to some ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ. Note that this exists because ΦΦ\Phiroman_Φ is compact.

The key observation is that because αϕsuperscript𝛼italic-ϕ\alpha^{\phi}italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT and βlnϕsuperscriptsubscript𝛽subscript𝑙𝑛italic-ϕ\beta_{l_{n}}^{\phi}italic_β start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT are both Lipschitz with respect to ϕitalic-ϕ\phiitalic_ϕ (clearly small changes in the feature map cannot change these variables much), it follows that for all γ>0𝛾0\gamma>0italic_γ > 0, there exists j𝑗jitalic_j such that for all i>j𝑖𝑗i>jitalic_i > italic_j,

Pr[αϕΛβlniϕ<γ2γ]<O(ni1/6).Prsuperscript𝛼italic-ϕΛsuperscriptsubscript𝛽subscript𝑙subscript𝑛𝑖italic-ϕsuperscript𝛾2𝛾𝑂superscriptsubscript𝑛𝑖16\Pr[\alpha^{\phi}-\Lambda\beta_{l_{n_{i}}}^{\phi}<\frac{\gamma^{*}}{2}-\gamma]% <O(n_{i}^{-1/6}).roman_Pr [ italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT - roman_Λ italic_β start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - italic_γ ] < italic_O ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 6 end_POSTSUPERSCRIPT ) . (5)

Howver, since βtϕβϕsuperscriptsubscript𝛽𝑡italic-ϕsuperscript𝛽italic-ϕ\beta_{t}^{\phi}\to\beta^{\phi}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT → italic_β start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT in distribution (Lemma 12), it follows that for i𝑖iitalic_i sufficiently large,

Pr[αϕΛβϕ<γ22γ]<O(ni1/6).Prsuperscript𝛼italic-ϕΛsuperscript𝛽italic-ϕsuperscript𝛾22𝛾𝑂superscriptsubscript𝑛𝑖16\Pr[\alpha^{\phi}-\Lambda\beta^{\phi}<\frac{\gamma^{*}}{2}-2\gamma]<O(n_{i}^{-% 1/6}).roman_Pr [ italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT - roman_Λ italic_β start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT < divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - 2 italic_γ ] < italic_O ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 6 end_POSTSUPERSCRIPT ) . (6)

Since nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is arbitrarily large, observe that this implies that

(αϕ)Λmax(βϕ)γ22γ.superscript𝛼italic-ϕΛsuperscript𝛽italic-ϕsuperscript𝛾22𝛾(\alpha^{\phi})-\Lambda\max(\beta^{\phi})\geq\frac{\gamma^{*}}{2}-2\gamma.( italic_α start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) - roman_Λ roman_max ( italic_β start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) ≥ divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - 2 italic_γ .

Finally, since ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG gets arbitrarily close to ϕitalic-ϕ\phiitalic_ϕ, it follows that

(αϕ^)Λmax(βϕ^)γ23γ.superscript𝛼^italic-ϕΛsuperscript𝛽^italic-ϕsuperscript𝛾23𝛾(\alpha^{\hat{\phi}})-\Lambda\max(\beta^{\hat{\phi}})\geq\frac{\gamma^{*}}{2}-% 3\gamma.( italic_α start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT ) - roman_Λ roman_max ( italic_β start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT ) ≥ divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - 3 italic_γ .

Taking γ=γ12𝛾superscript𝛾12\gamma=\frac{\gamma^{*}}{12}italic_γ = divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 12 end_ARG, and noting the definitions of αϕ^superscript𝛼^italic-ϕ\alpha^{\hat{\phi}}italic_α start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT and βϕ^superscript𝛽^italic-ϕ\beta^{\hat{\phi}}italic_β start_POSTSUPERSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUPERSCRIPT, it follows that ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG precisely fullfills the conditions given in the statement of the lemma, and this completes the proof.

We now prove Theorem 5.

Proof.

Fix ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0. The previous Lemma implies that for sufficiently large values of n𝑛nitalic_n, with probability 1δ21𝛿21-\frac{\delta}{2}1 - divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG we will select some ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG that has margin at least γ4superscript𝛾4\frac{\gamma^{*}}{4}divide start_ARG italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG. We now use an argument identical to the argument given for Theorem 3 and conclude the proof. ∎

Appendix J Proof of Theorem 6

Proof.

For ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ such that ϕitalic-ϕ\phiitalic_ϕ source-preserves 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, define gϕsuperscript𝑔italic-ϕg^{\phi}italic_g start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT as the classifier over 𝒳𝒳\mathcal{X}caligraphic_X defined by

gϕ(x)=g𝒟sϕ(argminzsupp(𝒟sϕ)d𝒵(z,ϕ(x))),superscript𝑔italic-ϕ𝑥subscript𝑔superscriptsubscript𝒟𝑠italic-ϕsubscriptargmin𝑧suppsuperscriptsubscript𝒟𝑠italic-ϕsubscript𝑑𝒵𝑧italic-ϕ𝑥g^{\phi}(x)=g_{\mathcal{D}_{s}^{\phi}}\left(\operatorname*{arg\,min}_{z\in% \textnormal{supp}(\mathcal{D}_{s}^{\phi})}d_{\mathcal{Z}}(z,\phi(x))\right),italic_g start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_x ) = italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_z ∈ supp ( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT ( italic_z , italic_ϕ ( italic_x ) ) ) ,

with ties being broken arbitrarily.

Let ϕ1(𝒮(Φ)Φcon)Φrelatessubscriptitalic-ϕ1𝒮ΦsubscriptΦ𝑐𝑜𝑛subscriptΦ𝑟𝑒𝑙𝑎𝑡𝑒𝑠\phi_{1}\in(\mathcal{S}(\Phi)\cap\Phi_{con})\setminus\Phi_{relates}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ ( caligraphic_S ( roman_Φ ) ∩ roman_Φ start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ) ∖ roman_Φ start_POSTSUBSCRIPT italic_r italic_e italic_l italic_a italic_t italic_e italic_s end_POSTSUBSCRIPT and let ϕ2𝒮(Φ)subscriptitalic-ϕ2𝒮Φ\phi_{2}\in\mathcal{S}(\Phi)italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_S ( roman_Φ ). Because ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT both contract (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), it follows that gϕ1superscript𝑔subscriptitalic-ϕ1g^{\phi_{1}}italic_g start_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and gϕ2superscript𝑔subscriptitalic-ϕ2g^{\phi_{2}}italic_g start_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are precisely well-defined over supp(μt)suppsubscript𝜇𝑡\textnormal{supp}(\mu_{t})supp ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). However, because ϕ1Φsubscriptitalic-ϕ1superscriptΦ\phi_{1}\notin\Phi^{*}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∉ roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, this implies that there exists ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 such that

μt{x:gϕ1(x)gϕ2(x)}=ϵ.subscript𝜇𝑡conditional-set𝑥superscript𝑔subscriptitalic-ϕ1𝑥superscript𝑔subscriptitalic-ϕ2𝑥italic-ϵ\mu_{t}\{x:g^{\phi_{1}}(x)\neq g^{\phi_{2}}(x)\}=\epsilon.italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT { italic_x : italic_g start_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ≠ italic_g start_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) } = italic_ϵ .

This holds from the fact that gϕ2superscript𝑔subscriptitalic-ϕ2g^{\phi_{2}}italic_g start_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT must match the Bayes-optimal, g𝒟tsubscript𝑔subscript𝒟𝑡g_{\mathcal{D}_{t}}italic_g start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, whereas gϕ1superscript𝑔subscriptitalic-ϕ1g^{\phi_{1}}italic_g start_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT must fail to (otherwise ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT would indeed SIRM realize on (𝒟s,𝒟t)subscript𝒟𝑠subscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )).

Define

ηti(y|x)={1gϕi(x)=y0 otherwise.superscriptsubscript𝜂𝑡𝑖conditional𝑦𝑥cases1superscript𝑔superscriptitalic-ϕ𝑖𝑥𝑦0 otherwiseotherwise\eta_{t}^{i}(y|x)=\begin{cases}1&g^{\phi^{i}}(x)=y\\ 0\text{ otherwise}\end{cases}.italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_y | italic_x ) = { start_ROW start_CELL 1 end_CELL start_CELL italic_g start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x ) = italic_y end_CELL end_ROW start_ROW start_CELL 0 otherwise end_CELL start_CELL end_CELL end_ROW .

Essentially this is a noiseless distribution that is purely classified by gϕisuperscript𝑔superscriptitalic-ϕ𝑖g^{\phi^{i}}italic_g start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Let 𝒟t1=(μt,ηt1)superscriptsubscript𝒟𝑡1subscript𝜇𝑡superscriptsubscript𝜂𝑡1\mathcal{D}_{t}^{1}=(\mu_{t},\eta_{t}^{1})caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) and 𝒟t2=(μt,ηt2)superscriptsubscript𝒟𝑡2subscript𝜇𝑡superscriptsubscript𝜂𝑡2\mathcal{D}_{t}^{2}=(\mu_{t},\eta_{t}^{2})caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The key observation is that if we randomly select 𝒟t{𝒟t1,𝒟t2}similar-tosuperscriptsubscript𝒟𝑡superscriptsubscript𝒟𝑡1superscriptsubscript𝒟𝑡2\mathcal{D}_{t}^{\prime}\sim\{\mathcal{D}_{t}^{1},\mathcal{D}_{t}^{2}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ { caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } and then apply our learning algorithm to (𝒟s,𝒟t)subscript𝒟𝑠superscriptsubscript𝒟𝑡(\mathcal{D}_{s},\mathcal{D}_{t}^{\prime})( caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), then our learner must incur expected risk at least ϵ2italic-ϵ2\frac{\epsilon}{2}divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG. This is because whatever it outputs, it has a 50-50 chance of misclassifyign instances from in which gϕ1superscript𝑔subscriptitalic-ϕ1g^{\phi_{1}}italic_g start_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and gϕ2superscript𝑔subscriptitalic-ϕ2g^{\phi_{2}}italic_g start_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT disagree. Thus our learner has expected risk at least ϵ2italic-ϵ2\frac{\epsilon}{2}divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG. Since ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 is fixed, this implies the desired result. ∎

Appendix K Bounds on the Distance Dimension and Transfer Learning Guarantees

K.1 Proof of Theorem 2

Proof.

(Theorem 2) We will construct two maps, α:LinD,DD2:𝛼𝐿𝑖subscript𝑛𝐷𝐷superscriptsuperscript𝐷2\alpha:Lin_{D,D}\to\mathbb{R}^{D^{2}}italic_α : italic_L italic_i italic_n start_POSTSUBSCRIPT italic_D , italic_D end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, and β:(D)4D2:𝛽superscriptsuperscript𝐷4superscriptsuperscript𝐷2\beta:(\mathbb{R}^{D})^{4}\to\mathbb{R}^{D^{2}}italic_β : ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT such that for any ϕLinD,Ditalic-ϕ𝐿𝑖subscript𝑛𝐷𝐷\phi\in Lin_{D,D}italic_ϕ ∈ italic_L italic_i italic_n start_POSTSUBSCRIPT italic_D , italic_D end_POSTSUBSCRIPT and x1,x2,x3,x4(D)subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4superscript𝐷x_{1},x_{2},x_{3},x_{4}\in(\mathbb{R}^{D})italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∈ ( blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ),

Δϕ(x1,x2,x3,x4)=sgn(α(ϕ),β(x1,x2,x3,x4).\Delta\phi(x_{1},x_{2},x_{3},x_{4})=sgn\left(\langle\alpha(\phi),\beta(x_{1},x% _{2},x_{3},x_{4}\rangle\right).roman_Δ italic_ϕ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = italic_s italic_g italic_n ( ⟨ italic_α ( italic_ϕ ) , italic_β ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ⟩ ) .

This will immediately imply the result as it is well known that linear classifiers over nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT have vc dimension n𝑛nitalic_n.

Letting Aϕsubscript𝐴italic-ϕA_{\phi}italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT be the D×D𝐷𝐷D\times Ditalic_D × italic_D matrix associated with ϕitalic-ϕ\phiitalic_ϕ, we have

distϕ(x1,x2,x3,x4)=sgn(d(ϕ(x1),ϕ(x2))2d(ϕ(x3),ϕ(x4))2)=sgn((x1x2)tAϕtAϕ(x1x2)(x3x4)tAϕtAϕ(x3x4))=sgn(AϕtAϕ,(x1x2)(x1x2)tAϕtAϕ,(x3x4)(x3x4)t)=sgn(AϕtAϕ,(x1x2)(x1x2)t(x3x4)(x3x4)t)𝑑𝑖𝑠𝑡italic-ϕsubscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4𝑠𝑔𝑛𝑑superscriptitalic-ϕsubscript𝑥1italic-ϕsubscript𝑥22𝑑superscriptitalic-ϕsubscript𝑥3italic-ϕsubscript𝑥42𝑠𝑔𝑛superscriptsubscript𝑥1subscript𝑥2𝑡superscriptsubscript𝐴italic-ϕ𝑡subscript𝐴italic-ϕsubscript𝑥1subscript𝑥2superscriptsubscript𝑥3subscript𝑥4𝑡superscriptsubscript𝐴italic-ϕ𝑡subscript𝐴italic-ϕsubscript𝑥3subscript𝑥4𝑠𝑔𝑛superscriptsubscript𝐴italic-ϕ𝑡subscript𝐴italic-ϕsubscript𝑥1subscript𝑥2superscriptsubscript𝑥1subscript𝑥2𝑡superscriptsubscript𝐴italic-ϕ𝑡subscript𝐴italic-ϕsubscript𝑥3subscript𝑥4superscriptsubscript𝑥3subscript𝑥4𝑡𝑠𝑔𝑛superscriptsubscript𝐴italic-ϕ𝑡subscript𝐴italic-ϕsubscript𝑥1subscript𝑥2superscriptsubscript𝑥1subscript𝑥2𝑡subscript𝑥3subscript𝑥4superscriptsubscript𝑥3subscript𝑥4𝑡\begin{split}dist\phi(x_{1},x_{2},x_{3},x_{4})&=sgn\left(d(\phi(x_{1}),\phi(x_% {2}))^{2}-d(\phi(x_{3}),\phi(x_{4}))^{2}\right)\\ &=sgn\left((x_{1}-x_{2})^{t}A_{\phi}^{t}A_{\phi}(x_{1}-x_{2})-(x_{3}-x_{4})^{t% }A_{\phi}^{t}A_{\phi}(x_{3}-x_{4})\right)\\ &=sgn\left(\langle A_{\phi}^{t}A_{\phi},(x_{1}-x_{2})(x_{1}-x_{2})^{t}\rangle-% \langle A_{\phi}^{t}A_{\phi},(x_{3}-x_{4})(x_{3}-x_{4})^{t}\rangle\right)\\ &=sgn\left(\langle A_{\phi}^{t}A_{\phi},(x_{1}-x_{2})(x_{1}-x_{2})^{t}-(x_{3}-% x_{4})(x_{3}-x_{4})^{t}\rangle\right)\end{split}start_ROW start_CELL italic_d italic_i italic_s italic_t italic_ϕ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) end_CELL start_CELL = italic_s italic_g italic_n ( italic_d ( italic_ϕ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_ϕ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d ( italic_ϕ ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , italic_ϕ ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_s italic_g italic_n ( ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_s italic_g italic_n ( ⟨ italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ - ⟨ italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_s italic_g italic_n ( ⟨ italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ ) end_CELL end_ROW

Thus, letting α(ϕ)=AϕtAϕ𝛼italic-ϕsuperscriptsubscript𝐴italic-ϕ𝑡subscript𝐴italic-ϕ\alpha(\phi)=A_{\phi}^{t}A_{\phi}italic_α ( italic_ϕ ) = italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT (cast as a vector in D2superscriptsuperscript𝐷2\mathbb{R}^{D^{2}}blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT) and β(x1,x2,x3,x4)=(x1x2)(x1x2)t(x3x4)(x3x4)t𝛽subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4subscript𝑥1subscript𝑥2superscriptsubscript𝑥1subscript𝑥2𝑡subscript𝑥3subscript𝑥4superscriptsubscript𝑥3subscript𝑥4𝑡\beta(x_{1},x_{2},x_{3},x_{4})=(x_{1}-x_{2})(x_{1}-x_{2})^{t}-(x_{3}-x_{4})(x_% {3}-x_{4})^{t}italic_β ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT suffices. ∎

K.2 Proof of Theorem 7

Proof.

(Theorem 7) Suppose ϕΦsuperscriptitalic-ϕΦ\phi^{*}\in\Phiitalic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Φ realizes the Statistical IRM assumption for 𝒟s,𝒟tsubscript𝒟𝑠subscript𝒟𝑡\mathcal{D}_{s},\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Define

E1=𝟙(R(𝒩Sϕ,𝒟t)Rt<ϵ2),subscript𝐸11𝑅superscriptsubscript𝒩𝑆superscriptitalic-ϕsubscript𝒟𝑡subscriptsuperscript𝑅𝑡italic-ϵ2E_{1}=\mathbbm{1}\left(R(\mathcal{N}_{S}^{\phi^{*}},\mathcal{D}_{t})-R^{*}_{t}% <\frac{\epsilon}{2}\right),italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = blackboard_1 ( italic_R ( caligraphic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) ,

and

E2=𝟙(supϕΦ|R(𝒩Sϕ,𝒟t)1m(x,y)T𝟙(𝒩Sϕ(x)y)|<ϵ2).subscript𝐸21subscriptsupremumitalic-ϕΦ𝑅subscriptsuperscript𝒩italic-ϕ𝑆subscript𝒟𝑡1𝑚subscript𝑥𝑦𝑇1subscriptsuperscript𝒩italic-ϕ𝑆𝑥𝑦italic-ϵ2E_{2}=\mathbbm{1}\left(\sup_{\phi\in\Phi}\left|R(\mathcal{N}^{\phi}_{S},% \mathcal{D}_{t})-\frac{1}{m}\sum_{(x,y)\in T}\mathbbm{1}\left(\mathcal{N}^{% \phi}_{S}(x)\neq y\right)\right|<\frac{\epsilon}{2}\right).italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = blackboard_1 ( roman_sup start_POSTSUBSCRIPT italic_ϕ ∈ roman_Φ end_POSTSUBSCRIPT | italic_R ( caligraphic_N start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_T end_POSTSUBSCRIPT blackboard_1 ( caligraphic_N start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ≠ italic_y ) | < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) .

E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is thus the event that the source sample S𝒟snsimilar-to𝑆superscriptsubscript𝒟𝑠𝑛S\sim\mathcal{D}_{s}^{n}italic_S ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT gives rise to a knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbors classifier which has small excess risk on 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT when composed with the realizing projection ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the event that the empirical risks of knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbor classifiers composed with feature maps ϕΦitalic-ϕΦ\phi\in\Phiitalic_ϕ ∈ roman_Φ are uniformly representative of their true risks on the target 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Our goal is to show that E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT jointly hold with probability at least 1δ1𝛿1-\delta1 - italic_δ, as this would imply that our learned classifier has risk at most Rt+ϵsubscriptsuperscript𝑅𝑡italic-ϵR^{*}_{t}+\epsilonitalic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ϵ, as desired. By Lemma 15, E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT holds with probability at least 1δ21𝛿21-\frac{\delta}{2}1 - divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG, so it suffices to show that E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT holds with probability at least 1δ21𝛿21-\frac{\delta}{2}1 - divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG as well.

Fix any set of n𝑛nitalic_n points, S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG. It suffices to show that PrT𝒟tm[E2=1|S=S^]1δsubscriptPrsimilar-to𝑇superscriptsubscript𝒟𝑡𝑚subscript𝐸2conditional1𝑆^𝑆1𝛿\Pr_{T\sim\mathcal{D}_{t}^{m}}[E_{2}=1|S=\hat{S}]\geq 1-\deltaroman_Pr start_POSTSUBSCRIPT italic_T ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 | italic_S = over^ start_ARG italic_S end_ARG ] ≥ 1 - italic_δ, as integrating over all possibilities of S𝑆Sitalic_S would give the desired result. Consider the hypothesis class, S^={hϕ:ϕΦ}subscript^𝑆conditional-setsubscriptitalic-ϕitalic-ϕΦ\mathcal{H}_{\hat{S}}=\{h_{\phi}:\phi\in\Phi\}caligraphic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT = { italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT : italic_ϕ ∈ roman_Φ } where hϕ:𝒳×𝒴{0,1}:subscriptitalic-ϕ𝒳𝒴01h_{\phi}:\mathcal{X}\times\mathcal{Y}\to\{0,1\}italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT : caligraphic_X × caligraphic_Y → { 0 , 1 } is defined as

hϕ(x,y)=𝟙(𝒩S^ϕ(x)y).subscriptitalic-ϕ𝑥𝑦1subscriptsuperscript𝒩italic-ϕ^𝑆𝑥𝑦h_{\phi}(x,y)=\mathbbm{1}\left(\mathcal{N}^{\phi}_{\hat{S}}(x)\neq y\right).italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) = blackboard_1 ( caligraphic_N start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ( italic_x ) ≠ italic_y ) .

Observe that hS^subscript^𝑆h\in\mathcal{H}_{\hat{S}}italic_h ∈ caligraphic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT is a binary classifier over its domain. It follows that given S=S^𝑆^𝑆S=\hat{S}italic_S = over^ start_ARG italic_S end_ARG,

E2=𝟙(supϕΦ|R(𝒩Sϕ,𝒟t)1m(x,y)T𝟙(𝒩Sϕ(x)y)|<ϵ2)=𝟙(suphϕS^|𝔼(x,y)𝒟t[hϕ(x,y)]1m(x,y)Thϕ(x,y)|<ϵ2).subscript𝐸21subscriptsupremumitalic-ϕΦ𝑅subscriptsuperscript𝒩italic-ϕ𝑆subscript𝒟𝑡1𝑚subscript𝑥𝑦𝑇1subscriptsuperscript𝒩italic-ϕ𝑆𝑥𝑦italic-ϵ21subscriptsupremumsubscriptitalic-ϕsubscript^𝑆subscript𝔼similar-to𝑥𝑦subscript𝒟𝑡delimited-[]subscriptitalic-ϕ𝑥𝑦1𝑚subscript𝑥𝑦𝑇subscriptitalic-ϕ𝑥𝑦italic-ϵ2\begin{split}E_{2}&=\mathbbm{1}\left(\sup_{\phi\in\Phi}\left|R(\mathcal{N}^{% \phi}_{S},\mathcal{D}_{t})-\frac{1}{m}\sum_{(x,y)\in T}\mathbbm{1}\left(% \mathcal{N}^{\phi}_{S}(x)\neq y\right)\right|<\frac{\epsilon}{2}\right)\\ &=\mathbbm{1}\left(\sup_{h_{\phi}\in\mathcal{H}_{\hat{S}}}\left|\mathbb{E}_{(x% ,y)\sim\mathcal{D}_{t}}[h_{\phi}(x,y)]-\frac{1}{m}\sum_{(x,y)\in T}h_{\phi}(x,% y)\right|<\frac{\epsilon}{2}\right).\end{split}start_ROW start_CELL italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL = blackboard_1 ( roman_sup start_POSTSUBSCRIPT italic_ϕ ∈ roman_Φ end_POSTSUBSCRIPT | italic_R ( caligraphic_N start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_T end_POSTSUBSCRIPT blackboard_1 ( caligraphic_N start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ≠ italic_y ) | < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_1 ( roman_sup start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) ] - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_T end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) | < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) . end_CELL end_ROW

To analyze the latter quantity, it suffices to show that vc(S^)O((Φ)log(n+(Φ)))𝑣𝑐subscript^𝑆𝑂Φ𝑛Φvc(\mathcal{H}_{\hat{S}})\leq O\left(\partial(\Phi)\log\left(n+\partial(\Phi)% \right)\right)italic_v italic_c ( caligraphic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ) ≤ italic_O ( ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ) ), as standard application of the fundamental theorem of statistical learning [29] implies E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT holds with probability 1δ21𝛿21-\frac{\delta}{2}1 - divide start_ARG italic_δ end_ARG start_ARG 2 end_ARG provided that mΩ(vc(S^)+ln1δϵ2)𝑚Ω𝑣𝑐subscript^𝑆1𝛿superscriptitalic-ϵ2m\geq\Omega\left(\frac{vc(\mathcal{H}_{\hat{S}})+\ln\frac{1}{\delta}}{\epsilon% ^{2}}\right)italic_m ≥ roman_Ω ( divide start_ARG italic_v italic_c ( caligraphic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ) + roman_ln divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ).

To this end, suppose S^subscript^𝑆\mathcal{H}_{\hat{S}}caligraphic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT shatters a set of v𝑣vitalic_v points V𝒳×𝒴𝑉𝒳𝒴V\subset\mathcal{X}\times\mathcal{Y}italic_V ⊂ caligraphic_X × caligraphic_Y. Let V={(x1,y1),,(xv,yv)}𝑉subscript𝑥1subscript𝑦1subscript𝑥𝑣subscript𝑦𝑣V=\{(x_{1},y_{1}),\dots,(x_{v},y_{v})\}italic_V = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) }, and let S^={(x1,y1),,(xn,yn)}^𝑆superscriptsubscript𝑥1superscriptsubscript𝑦1superscriptsubscript𝑥𝑛superscriptsubscript𝑦𝑛\hat{S}=\{(x_{1}^{\prime},y_{1}^{\prime}),\dots,(x_{n}^{\prime},y_{n}^{\prime})\}over^ start_ARG italic_S end_ARG = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) }. The key observation is that for any hϕS^subscriptitalic-ϕsubscript^𝑆h_{\phi}\in\mathcal{H}_{\hat{S}}italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT, the way hϕsubscriptitalic-ϕh_{\phi}italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT labels a given point (x,y)V𝑥𝑦𝑉(x,y)\in V( italic_x , italic_y ) ∈ italic_V is determined by the knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-nearest neighbors of ϕ(x)italic-ϕ𝑥\phi(x)italic_ϕ ( italic_x ) in {ϕ(x1),,ϕ(xn)}italic-ϕsubscriptsuperscript𝑥1italic-ϕsubscriptsuperscript𝑥𝑛\{\phi(x^{\prime}_{1}),\dots,\phi(x^{\prime}_{n})\}{ italic_ϕ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_ϕ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }. Furthermore, these labels are fully determined by the set of all (n2)binomial𝑛2\binom{n}{2}( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG ) comparisons,

𝒞ϕ,x={𝟙(d(ϕ(x),ϕ(xi))d(ϕ(x),ϕ(xj)):1i<jn}.\mathcal{C}_{\phi,x}=\left\{\mathbbm{1}\left(d(\phi(x),\phi(x^{\prime}_{i}))% \geq d(\phi(x),\phi(x^{\prime}_{j})\right):1\leq i<j\leq n\right\}.caligraphic_C start_POSTSUBSCRIPT italic_ϕ , italic_x end_POSTSUBSCRIPT = { blackboard_1 ( italic_d ( italic_ϕ ( italic_x ) , italic_ϕ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ≥ italic_d ( italic_ϕ ( italic_x ) , italic_ϕ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) : 1 ≤ italic_i < italic_j ≤ italic_n } .

Note that 𝒞ϕ,xsubscript𝒞italic-ϕ𝑥\mathcal{C}_{\phi,x}caligraphic_C start_POSTSUBSCRIPT italic_ϕ , italic_x end_POSTSUBSCRIPT is a set of induced distance comparers (Definition 5). It follows that the number of distinct ways that S^subscript^𝑆\mathcal{H}_{\hat{S}}caligraphic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT can label V𝑉Vitalic_V is at most the number of ways ΔΦΔΦ\Delta\Phiroman_Δ roman_Φ can label all v(n2)𝑣binomial𝑛2v\binom{n}{2}italic_v ( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG ) possible comparisons arising from xiVsubscript𝑥𝑖𝑉x_{i}\in Vitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V and xj,xkS^subscriptsuperscript𝑥𝑗subscriptsuperscript𝑥𝑘^𝑆x^{\prime}_{j},x^{\prime}_{k}\in\hat{S}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ over^ start_ARG italic_S end_ARG with j<k𝑗𝑘j<kitalic_j < italic_k. Thus, by the definition of the distance dimension (Φ)Φ\partial(\Phi)∂ ( roman_Φ ) and Sauer’s Lemma, the number of ways S^subscript^𝑆\mathcal{H}_{\hat{S}}caligraphic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT can label V𝑉Vitalic_V is at most

(ev(n2))(Φ)superscript𝑒𝑣binomial𝑛2Φ\left(ev\binom{n}{2}\right)^{\partial(\Phi)}( italic_e italic_v ( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG ) ) start_POSTSUPERSCRIPT ∂ ( roman_Φ ) end_POSTSUPERSCRIPT

At the same time, because S^subscript^𝑆\mathcal{H}_{\hat{S}}caligraphic_H start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT shatters V𝑉Vitalic_V, there exist precisely 2vsuperscript2𝑣2^{v}2 start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT such labelings. Thus, we have

v(Φ)log(ev(n2)).𝑣Φ𝑒𝑣binomial𝑛2v\leq\partial(\Phi)\log\left(ev\binom{n}{2}\right).italic_v ≤ ∂ ( roman_Φ ) roman_log ( italic_e italic_v ( FRACOP start_ARG italic_n end_ARG start_ARG 2 end_ARG ) ) .

From here, straightforward algebra implies that v=O((Φ)log(n+(Φ)))𝑣𝑂Φ𝑛Φv=O\left(\partial(\Phi)\log\left(n+\partial(\Phi)\right)\right)italic_v = italic_O ( ∂ ( roman_Φ ) roman_log ( italic_n + ∂ ( roman_Φ ) ) ), as desired.