Proxy Methods for Domain Adaptation

Katherine Tsai

{}^{1}

, Stephen R. Pfohl

{}^{2}

, Olawale Salaudeen

{}^{1}

,
Nicole Chiou

{}^{3}

, Matt J. Kusner

{}^{4}

,
Alexander D’Amour

{}^{5}

, Sanmi Koyejo

{}^{3,5}

, Arthur Gretton

{}^{5,6}

(

{}^{1}

University of Illinois Urbana-Champaign

{}^{2}

Google Research

{}^{3}

Stanford University

{}^{4}

University College London

{}^{5}

Google DeepMind

{}^{6}

Gatsby Computational Neuroscience Unit )

Abstract

We study the problem of domain adaptation under distribution shift, where the shift is due to a change in the distribution of an unobserved, latent variable that confounds both the covariates and the labels. In this setting, neither the covariate shift nor the label shift assumptions apply. Our approach to adaptation employs proximal causal learning, a technique for estimating causal effects in settings where proxies of unobserved confounders are available. We demonstrate that proxy variables allow for adaptation to distribution shift without explicitly recovering or modeling latent variables. We consider two settings, (i) Concept Bottleneck: an additional “concept” variable is observed that mediates the relationship between the covariates and labels; (ii) Multi-domain: training data from multiple source domains is available, where each source domain exhibits a different distribution over the latent confounder. We develop a two-stage kernel estimation approach to adapt to complex distribution shifts in both settings. In our experiments, we show that our approach outperforms other methods, notably those which explicitly recover the latent confounder.

1 Introduction

The goal of domain adaptation is to transfer an accurate model from a labeled source domain to an unlabeled target domain, which has a different but related distribution (pan2010domain; koh2021wilds; malinin2021shifts). It is motivated by the fact that labeling data is often labor intensive, and sometimes requires domain expertise. For example, the distribution of patients diagnosed with a condition from hospital $A$ and hospital $B$ may differ due to patients’ socioeconomic status, demographics, and other factors. However, labeled data might be only be available at hospital $A$ and not at hospital $B$ (e.g., due to less funding). As a result, an accurate model for patients from hospital $A$ may perform poorly for patients from hospital $B$ .

In order to provide guarantees on the accuracy of a transferred model, one of two classical assumptions have been made: label shift or covariate shift. Label shift (buck1966comparison; lipton2018detecting) assumes that the distribution of a label $P(Y)$ shifts between source and target domains, but the conditional distribution $P(X\mid Y)$ does not. Conversely, covariate shift (shimodaira2000improving) assumes that the covariate distribution $P(X)$ shifts between domains, but the distribution $P(Y\mid X)$ stays the same. Each assumption provides theoretical guarantees on the generalization of a transferred classifier. In fact, without any assumptions, the source and target domains could differ arbitrarily, making guarantees impossible. However, these assumptions are often too restrictive to apply in real-world settings (zhang2015multi; schrouff2022diagnosing). For instance, if covariates $X$ and labels $Y$ are confounded by a third variable $U$ , it is possible for neither $P(X\mid Y)$ or $P(Y\mid X)$ to be equal across domains. For example, demographic information $U$ could confound the relationship between a diagnosis $Y$ and a radiological image $X$ . In this example, if two hospitals have different distributions over demographics, both label shift and covariate shift adaptation methods will fail to transfer a classifier across hospitals.

To address this, recent work has introduced a latent shift assumption: the distribution of $U$ , an unobserved latent confounder of $X$ and $Y$ , shifts between the source and target domain (alabdulmohsin2023adapting). In this setting, all distributions of $X$ and $Y$ (without conditioning on $U$ ) may differ across the domains, violating label and covariate shift assumptions.

Contributions. We propose techniques for domain adaptation under the latent shift assumption that are guaranteed to identify the optimal predictor $\mathbb{E}[Y\mid x]$ in the target domain. We make use of proxy methods (miao2018identifying), which are a recently developed framework for causal effect estimation in the presence of a hidden confounder $U$ , given indirect proxy information on $U$ . Compared to prior work (alabdulmohsin2023adapting), our techniques do not require: identifying the distribution of the latent variable $U$ , that $U$ be discrete, or further linear independence assumptions. We consider two settings: (1) Concept Bottleneck: we observe in both domains a proxy $W$ of the unobserved confounder $U$ and a concept $C$ that mediates the direct relationship between $X$ and $Y$ (alabdulmohsin2023adapting), or (2) Multi-Domain: we do not observe $C$ in either domain, but have access to observations from multiple source domains. For both settings, we provide guarantees for identifying $\mathbb{E}[Y\mid x]$ without observing $Y$ in the target domain. When $\mathbb{E}[Y\mid x]$ is identifiable, we develop practical two-stage kernel estimators to perform adaptation.

2 Related Work

The development of techniques for learning robust models and adapting to distribution shift has a long history in machine learning, but recently has received increased attention (shen2021towards; zhou2022domain; wang2022generalizing).

Causality for domain adaptation. Our work is inspired by techniques that formulate the covariate/label shift settings as assumptions on the causal structure for domain adaptation and distributional robustness (e.g, scholkopf2012causal; peters2015causal; zhang2015multi; subbaswamy2019preventing; rothenhausler2021anchor; veitch2021counterfactual; magliacane2018domain; arjovsky2019invariant; ganin2016domain; ben2010theory; oberst2021regularizing).

Proximal causal inference. Our identification technique is inspired by approaches used to identify causal effects with unobserved confounding with observed proxies (kuroki2014measurement; miao2018identifying; deaner2018proxy; tchetgen2020introduction; mastouri2021proximal; cui2023semiparametric; xu2023kernel). These approaches design ‘bridge functions’ to connect quantities involving a proxy $W$ with those of the label $Y$ . The beauty of this approach is that these bridge functions are implicitly a marginalization over $U$ . This allows these approaches to identify causal quantities without identifying distributions involving $U$ .

Latent shift. Our work is most closely related to alabdulmohsin2023adapting, who introduced the setting of latent shift with proxies $W$ and concepts $C$ . They showed that the optimal predictor $\mathbb{E}[Y\mid x]$ is identifiable in the target domain if $W$ and $C$ are observed in the source domain and $X$ is observed in the target domain. To do so, they required (a) identification of distributions involving $U$ , (b) that $U$ is a discrete variable, (c) knowledge of the dimensionality of $U$ , and (d) additional linear independence assumptions. In contrast, our work derives identification results for arbitrary $U$ , and does not require any of (a)-(d). However, there is no free lunch: to achieve this, we require that proxies $W$ are observed in the target, and either that: (i) concepts $C$ are also observed in the target, or (ii) we observe multiple source domains. For (ii) we do not require $C$ in either the source or the target, but for full identification we require that $U$ is discrete.

3 Problem Framework

(a) Covariate shift

(b) Label shift

(d) Multi-Domain shift

Figure 1: Causal diagrams. The shaded circle denotes unobserved variable and the solid circle denotes observed variable.

X

is the covariate,

Y

is the response,

C

is the concept,

W

is the proxy,

Z

is the domain-related variable, and

U

is the latent variable.

Let $P(\cdot)$ and $Q(\cdot)$ denote the probability distribution functions of the source domain and target domain, respectively. Let $p$ and $q$ indicate source and target quantities. Our goal is to study identification and estimation of the optimal target predictor $\mathbb{E}_{q}[Y\mid x]$ when $Y$ is not observed in the target domain.

Concept Bottleneck. The first setting we study is described by the graph in Figure 0(c). We have two additional variables: (i) proxies $W$ , which provide auxiliary information about $U$ , or can be seen as a noisy version of it (kuroki2014measurement), and (ii) concepts $C$ , which mediate or ‘bottleneck’ the relationship between the covariates $X$ and labels $Y$ (goyal2019explaining; koh2020concept). For example, koh2020concept describe a setting where the concepts $C$ are high-level clinical and morphological features of a knee X-ray $X$ , which mediate the relationship with osteoporosis severity $Y$ . In this example, $U$ could describe demographic variations that alter symptoms $X,C$ and outcome $Y$ , and the proxies $W$ could include patient background and clinical history (e.g., prior diagnoses, medications, procedures, etc). For the source domain we assume we observe $(X,C,W,Y)\!\sim\!P$ and for the target domain we observe $(X,C,W)\!\sim\!Q$ .

We formalize the notion of latent shift, as introduced in alabdulmohsin2023adapting.

Assumption 1 (Concept Bottleneck, alabdulmohsin2023adapting).

The shift between $P$ and $Q$ is located in unobserved $U$ , i.e., there is a latent shift $P(U)\not\eq Q(U)$ , but $P(V\mid U)=Q(V\mid U)$ , where $V\subseteq\{W,X,C,Y\}$ .

This assumption states that every variable conditioned on $U$ is invariant across domains. However, as $P(U)\!\not\eq\!Q(U)$ , none of the marginal distributions are: $P(V)\!\not\eq\!Q(V)$ for $V\!\subseteq\!\{W,X,C,Y\}$ . This assumption is a generalization of covariate shift $P(Y\mid X,U)\!=\!Q(Y\mid X,U)$ (shimodaira2000improving) and label shift $P(X\mid Y,U)\!=\!Q(X\mid Y,U)$ (buck1966comparison), with associated graphs in Figure 0(a)–0(b).

Assumption 2 (Structural assumption).

Graphs in Figure 1 are faithful and Markov (spirtes2000causation).

Under Assumption 2, we have the following conditional independence properties for the graph in Figure 0(c):

Y\perp\!\!\!\perp X\mid\{U,C\},\quad W\perp\!\!\!\perp\{X,C\}\mid U.

With this conditional independence structure, $\{U,C\}$ blocks the information from $X$ to $Y$ and $U$ blocks the information flow from $W$ to $\{X,C\}$ . We will see in Section 4 that these assumptions allow us to obtain $Q(Y\mid x)$ from $Q(W,C\mid x)$ in the target domain, where the latter is a function of observed quantities.

Multi-domain. In the second setting, suppose we do not observe the concepts $C$ in any domain, but instead observe data from multiple source domains, according to the graph in Figure 0(d). For instance, we may want to learn a classifier for a target hospital that has only unlabelled data, using data from several source hospitals with labelled data. Here, let $Z$ be a random variable in $\mathcal{Z}$ denoting a prior over the source domains, and let $P(U|Z)$ be the distribution of $U$ given $Z$ . We make $k_{Z}$ draws from $Z$ , indexed by $r\in\{1,\ldots,k_{Z}\}$ , and write $\{z_{1},\ldots,z_{k_{Z}}\}=:\mathcal{Z}_{p}\subseteq\mathcal{Z}$ . For each source domain $z_{r}$ , we observe $(X,W,Y)\!\sim\!P(X,W,Y|z_{r})\!:=\!P_{r}(X,W,Y)$ . For the target, we denote it with index $k_{Z}+1$ and only observe $(X,W)\!\sim\!P(X,W|z_{k_{Z}+1})\!:=\!Q(X,W)$ . In general let $P_{r}(V)\!:=\!P(V|z_{r})$ and $Q(V)\!:=\!P(V|z_{k_{Z}+1})$ for any $V\subseteq\{W,X,Y,U\}$ . For this setting we replace Assumption 1 with the following shift assumption.

Assumption 3 (Multi-Domain).

For each $z,z^{\prime}\in\mathcal{Z}_{p}$ such that $z\not\eq z^{\prime}$ , we have $P(U|z)\not\eq P(U|z^{\prime})\neq Q(U)$ .

Note that Assumption 2 implies the following the conditional independence property in Figure 0(d):

\{Y,X,W\}\perp\!\!\!\perp Z\mid U.

Under Assumption 3, we allow all joint distributions to be different

P(W,X,U,Y|z)\not\eq P(W,X,U,Y|z^{\prime})\neq Q(W,X,U,Y)

for $z\neq z^{\prime}\in\mathcal{Z}_{p}$ .

4 Identification under Latent Shifts

Our identification techniques are inspired by proximal causal inference (tchetgen2020introduction). The key idea is to design so-called “bridge” functions to identify distributions confounded by unobserved variables. We first show that with additional proxies and concepts, $\mathbb{E}_{q}[Y\mid x]$ is identifiable under any latent shift.

4.1 Identification with Concepts

To prove identifiability, we need certain assumptions to hold for the shift. The first is a regularity assumption, also known as a completeness condition, and is commonly used to identify causal estimands (d2011completeness; miao2018identifying).

Assumption 4 (Informative variables).

Let $g$ be any mean squared integrable function. Both the source domain and the target domain, $(f,F)\in\{(p,P),(q,Q)\}$ , satisfy $\mathbb{E}_{f}[g(U)\mid x,c]=0$ for all $x\in\mathcal{X},c\in\mathcal{C}$ if and only if $g(U)=0$ almost surely with respect to $F(U)$ .

At a high level, completeness states that the $X$ must have sufficient variability related to the change of $U$ . This is a common assumption made in proximal causal inference (cf. Condition (ii) in miao2018identifying and Assumption 3 in mastouri2021proximal). For more details on the justification of completeness assumption, see the supplementary material of miao2022identifying.

Second, we need a guarantee on the support of $u\in\mathcal{U}$ . Intuitively, if a $u\in\mathcal{U}$ has non-zero probability in the target domain, it should have non-zero probability in the source domain as well. Otherwise, it is impossible to adjust to certain shifts (as we never see these regimes in the source domain). This is similar to the positivity assumption commonly made in causality literature (hernan2006estimating).

Assumption 5 (Positivity).

For any $u\in\mathcal{U}$ , if $Q(u)>0$ then $P(u)>0$ .

If data are generated according to Figure 0(c), and the regularity conditions 8–10 hold (see Appendix A.2), miao2018identifying first showed the existence of the solutions $h_{0}^{p}(w,c),h_{0}^{q}(w,c)$ of the following equations:

	$\displaystyle\mathbb{E}_{p}[Y\mid c,x]=$	$\displaystyle\;\int_{\mathcal{W}}h_{0}^{p}(w,c)\mathrm{d}P(w\mid c,x)$		(4.1)
	$\displaystyle\mathbb{E}_{q}[Y\mid c,x]=$	$\displaystyle\;\int_{\mathcal{W}}h_{0}^{q}(w,c)\mathrm{d}Q(w\mid c,x).$

The terms $h_{0}^{p}(w,c),h_{0}^{q}(w,c)$ are called ‘bridge’ functions as they connect the proxy $W$ to the label $Y$ . If we are able to identify $h_{0}^{q}(w,c)$ then we can identify $\mathbb{E}_{q}[Y\mid x]$ , by using eq. (4.1) to obtain $\mathbb{E}_{q}[Y\mid C,x]$ and marginalizing over $Q(C\mid x)$ .

We show that it is possible to connect identification of $h_{0}^{q}(w,c)$ with that of $h_{0}^{p}(w,c)$ , leading directly to identification of $\mathbb{E}_{q}[Y\mid x]$ .

Theorem 4.1.

Assume that $h_{0}^{p}$ and $h_{0}^{q}$ exist (i.e., regularity Assumptions 8–10 hold). Then given Assumptions 1, 2, 4, 5 we have that, for any $c\in\mathcal{C}$ ,

\int_{\mathcal{W}}h_{0}^{p}(w,c)\mathrm{d}P(w\mid u)=\int_{\mathcal{W}}h_{0}^{% q}(w,c)\mathrm{d}Q(w\mid u),

almost surely with respect to $Q(U)$ . This implies that

\mathbb{E}_{q}[Y\mid x]=\int_{\mathcal{W}\times\mathcal{C}}h_{0}^{p}(w,c)% \mathrm{d}Q(w,c\mid x).

The proof is given in Appendix B.1. Hence, given $h_{0}^{p}$ and $(W,X,C)$ from the target $Q$ , we are able to adapt to arbitrary distribution shifts in unobserved $U$ . The advantage of this approach is that it will not require estimating any distributions involving $U$ . We demonstrate this in Section 5.

While concepts can ensure identifiability, they may not be available in practice. In this case, a natural question is whether the optimal target predictor $\mathbb{E}_{q}[Y\mid x]$ is still identifiable. In the next section we show that if we instead have access to data from multiple source domains, $\mathbb{E}_{q}[Y\mid x]$ may again be identifiable.

4.2 The Blessings of Multiple Domains

We now turn to the multi-domain setting. The graphical structure in Figure 0(d) is similar to the structure in Figure 0(c) with $C$ replaced by $X$ , $X$ replaced by $Z$ , and the arrow between $U$ and $Z$ flipped. Although the bridge function proposed by miao2018identifying assumes an edge from $U$ to $Z$ , changing the direction from $Z$ to $U$ does not change the conditional independence structure (pearl2009causality). The main difference is we will only be able to guarantee full identification when $U$ is discrete. We start by demonstrating this, and then give an example of the inherent difficulty of identification when $U$ is continuous.

To begin, for simplicity, assume $U$ and $W$ are discrete (with dimensionalities $k_{U}$ and $k_{W}$ ). We have finitely many samples from $Z$ , denoted as $z_{1},\ldots,z_{k_{Z}}$ , corresponding to our training domains. We seek a bridge function (in this case, a matrix $M_{0}(w_{i},x)$ ) satisfying

\displaystyle\mathbb{E}_{r}[Y\mid x]=\sum_{i=1}^{k_{w}}M_{0}(w_{i},x)P_{r}(w_{% i}\mid x),

(4.2)

for all $r=1,\ldots,k_{Z}$ , where $\mathbb{E}_{r}[Y\mid x]$ is the conditional expectation obtained in domain $r$ , and $P_{r}(W\mid x)=P(W\mid x,z_{r})$ .

In order to identify $M_{0}(w_{i},x)$ , and then $\mathbb{E}_{q}[Y\mid x]$ , we need enough source domains to capture the variability of $U$ . The following result describes how many we need.

Proposition 4.2.

Suppose that we have $k_{Z}$ source domains and $W$ , $U$ have $k_{W}$ and $k_{U}$ categories respectively. Then, if $k_{W},k_{Z}\geq k_{U}$ and subject to appropriate rank conditions (see proof in Appendix B.2), the bridge function is identifiable and does not depend on the specific $z$ .

This result generalizes the identification analysis developed in miao2018identifying. If the number of observed source domains $k_{Z}$ is greater than the dimension of the latent $U$ , then subject to appropriate identifiability requirements (detailed in Appendix B.2), we can recover the bridge $M_{0}(w_{i},x)$ .

Now, consider the case where $U$ is discrete but all observed variables $W,X,Y$ are continuous. In this case we have the following system

\mathbb{E}_{r}[Y\mid x]=\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P_{r}(w\mid x),

(4.3)

for $r=1,\ldots,k_{Z}$ . The proof of existence of $m_{0}$ is a modification of Proposition A.2, as shown in Proposition A.3. In order to identify target $\mathbb{E}_{q}[Y\mid x]$ , we need the following assumption.

Assumption 6.

Let $g$ be a square integrable function on $U$ . For each $x\in\mathcal{X}$ and for all $z\in\mathcal{Z}_{p}$ , $\mathbb{E}[g(U)\mid x,z]=0$ if and only if $g(U)=0$ , $P(U)$ almost surely.

Given this assumption we can prove identifiability.

Proposition 4.3.

Given that Assumptions 1–3, 6 hold; that $m_{0}$ exists; that $(W,X,Y)$ are observed for the sources $z\in\mathcal{Z}_{p}$ , and $(W,X)$ is observed from the target domain. Then $\mathbb{E}_{q}[Y\mid x]$ is identifiable, and for any $x\in\mathcal{X}$ , we can write

\mathbb{E}_{q}[Y\mid x]=\int_{\mathcal{W}}{m}_{0}(w,x)dQ(w\mid x).

(4.4)

The proof is given in Appendix B.3. Crucially, this result is valid only when Assumptions 6 holds, and it remains unclear when it is expected to hold. Proposition 4.2 suggests that Assumptions 6 is not vacuous when $U$ is finite dimensional. We plan to investigate further this in future work.

Now let us consider the case where $U$ is continuous. In this case, unfortunately, Assumption 6 may not hold, preventing identification of $\mathbb{E}_{q}[Y\mid x]$ . This is illustrated in the following example.

Example 4.4.

Recall the decomposition of both sides of (4.3). Under Assumption 2 and given the existence of $m_{0}$ (Proposition A.2),

$\displaystyle\mathbb{E}_{p}[{Y\mid x,z}]$	$\displaystyle=\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P(w\mid x,z)$
	$\displaystyle=\int_{\mathcal{U}}\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P(w\mid u% )\mathrm{d}P(u\mid x,z);$	(4.5)
$\displaystyle\mathbb{E}_{p}[{Y\mid x,z}]$	$\displaystyle=\int_{\mathcal{U}}\mathbb{E}_{p}[Y\mid x,u]\mathrm{d}P(u\mid x,z).$	(4.6)

For every $x$ , Eqs. (4.5) and (4.6) represent projections onto $P(u\mid x,z_{r}),$ $r\in{1,\ldots,k_{z}}.$ Consider $\mathcal{U}:=[-\pi,\pi]$ with periodic boundary conditions, and for a given $x$ define $P(u\mid x,z_{r})=(2\pi)^{-1}(1+\cos(ru)),\forall r\in\mathbb{N}^{+}$ (note that cosines form an orthonormal basis). We now construct an example where (4.5) holds for some $z$ but not for others. Define the difference

	$\displaystyle\mathbb{E}_{p}[Y\mid x,u]-\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P% (w\mid u)$		(4.7)
	$\displaystyle=\cos((k_{z}+1)u)=:g(u).$

In this case, $g(u)\neq 0,$ and in particular, (4.5) holds for all $r\leq k_{z},$ but not for $P(u\mid x,z_{k_{z}+1}).$

This example illustrates a larger point: that for continuous $U$ , no finite set of projections will suffice to completely characterize the square integrable functions on $\mathcal{U}$ . That said, as more projections are employed, and subject to appropriate assumptions on the smoothness of (4.7), the error will reduce as more domains are observed. The characterization of this convergence will be the topic of future work. In experiments, we show that the adaptation can still be effective even when the latent variable $U|z_{r}$ is continuous valued and follows different Beta distributions for each distinct $r$ , given just two training source domains.

5 Kernel Bridge Function Estimation

We introduce kernel methods to estimate the bridge functions and subsequently leverage the estimates to adapt to distribution shifts. Section 4 shows that bridge functions for both settings can be adapted to the target domain, so we drop the domain specific indices and use $h_{0}$ and $m_{0}$ to denote the bridge functions. We begin by introducing the notation.

Notation. Let $\otimes$ be the tensor product, $\overline{\otimes}$ be the columnwise Khatri-Rao product and $\odot$ be the Hadamard product. For any space $\mathcal{V}\in\{\mathcal{X},\mathcal{C},\mathcal{W},\mathcal{Y}\}$ , let $k:\mathcal{V}\times\mathcal{V}\rightarrow\mathbb{R}$ be a positive semidefinite kernel function and $\phi(v)=k(v,\cdot)$ for any $v\in\mathcal{V}$ be the feature map. We denote $\mathcal{H}_{\mathcal{V}}$ to be the RKHS on $\mathcal{V}$ associated with kernel function $k$ . The RKHS has two properties: (i) $f\in\mathcal{H}_{\mathcal{V}}$ , $f(v)=\langle{f},{k(v,\cdot)}\rangle$ for all $v\in\mathcal{V}$ and (ii) $k(v,\cdot)\in\mathcal{H}_{\mathcal{V}}$ . We denote $\langle{\cdot},{\cdot}\rangle$ as the inner product and $|\!|\!|\cdot|\!|\!|_{{\mathcal{H}_{\mathcal{V}}}}$ as the induced norm. For notation simplicity, we denote the product space $\mathcal{H}_{\mathcal{V}}\times\mathcal{H}_{\mathcal{V}^{\prime}}$ associated with operation $\mathcal{H}_{\mathcal{V}}\otimes\mathcal{H}_{\mathcal{V}^{\prime}}$ as $\mathcal{H}_{\mathcal{V}\mathcal{V}^{\prime}}$ . We define the kernel mean embedding as $\mu_{V}=\mathbb{E}[\phi(V)]=\int k(v,\cdot)p(v)dv$ (smola2007hilbert) and the conditional mean embedding as $\mu_{V\mid y}=\int k(v,\cdot)p(v\mid y)dv$ (song2009hilbert; singh2019kernel). For $V\in\{W,X,C\}$ , we denote the $a$ -th batch of i.i.d. samples as $V_{a}=\{v_{a,i}\}_{i=1}^{n_{a}}$ . Define the Gram matrices as $\mathcal{K}_{V_{a}}=\begin{bmatrix}k(v_{a,i},v_{a,j})\end{bmatrix}_{i,j}\in% \mathbb{R}^{n_{a}\times n_{a}}$ , $\mathcal{K}_{V_{ab}}=\begin{bmatrix}k(v_{a,i},v_{b,j})\end{bmatrix}_{i,j}\in% \mathbb{R}^{n_{a}\times n_{b}}$ . Let $\Phi_{V_{a}}=\begin{bmatrix}\phi(v_{a,1}),\ldots,\phi(v_{a,n_{a}})\end{bmatrix% }^{\top}\in\mathcal{H}_{\mathcal{V}}^{n_{a}}$ be the vectorized feature map such that $\Phi_{V_{a}}(v^{\prime})=\begin{bmatrix}k(v_{a,1},v^{\prime}),\ldots,k(v_{a,n_% {a}},v^{\prime})\end{bmatrix}^{\top}\in\mathbb{R}^{n_{a}}$ .

5.1 Adaptation with Concepts

Suppose that for the bridge function $h_{0}\in\mathcal{H}_{\mathcal{W}\mathcal{C}}$ , where $\mathcal{H}_{\mathcal{W}\mathcal{C}}$ is a RKHS. It follows from Theorem 4.1 that

$\displaystyle\mathbb{E}_{q}[Y\mid X=x]$	$\displaystyle=\mathbb{E}_{q}[h_{0}(W,C)\mid x]$
	$\displaystyle=\mathbb{E}_{q}[\langle{h_{0}},{\phi(W)\otimes\phi(C)}\rangle\mid x]$
	$\displaystyle=\langle{h_{0}},{\mu_{WC\mid x}^{q}}\rangle.$	(5.1)

To adapt to the distribution shifts, we estimate the bridge function $h_{0}$ in the source domain and the conditional mean embedding $\mu_{WC\mid x}^{q}=\mathbb{E}_{q}[\phi(W)\otimes\phi(C)\mid x]$ in the target domain. The empirical estimate of the conditional mean embedding along with the consistency proof have been provided in (song2009hilbert; grunewalder2012conditional) thus we focus on the estimation procedure of the bridge function $h_{0}$ .

To estimate the bridge function $h_{0}$ , we employ the regression method developed in mastouri2021proximal. Recall $\mathbb{E}[Y\mid c,x]=\mathbb{E}[h_{0}(W,c)\mid c,x]$ . We define the population risk function in the source domain as:

	$\displaystyle R(h_{0})$	$\displaystyle=\mathbb{E}_{p}[(Y-G_{h_{0}}(C,X))^{2}];$		(5.2)
	$\displaystyle G_{h_{0}}(x,c)$	$\displaystyle=\langle{h_{0}},{\mu_{W\mid c,x}^{p}\otimes\phi(c)}\rangle.$

The procedure to optimize (5.2) involves two stages. In the first stage, we estimate the conditional mean embedding $\mu_{W\mid c,x}^{p}=\mathbb{E}_{p}[\phi(W)\mid c,x]$ , which we will use as a plug-in estimator to estimate $h_{0}$ in the second step. Given $n_{1}$ i.i.d. samples $(X_{1},W_{1},C_{1})=\{(x_{1,i},w_{1,i},c_{1,i})\}_{i=1}^{n_{1}}$ from the source distribution $p$ and a regularizing parameter $\lambda_{1}>0$ , we denote $\mathcal{K}_{X_{1}}\in\mathbb{R}^{n_{1}\times n_{1}}$ , $\mathcal{K}_{C_{1}}\in\mathbb{R}^{n_{1}\times n_{1}}$ as the Gram matrices and $\Phi_{X_{1}}\in\mathcal{H}_{\mathcal{X}}^{n_{1}}$ , $\Phi_{C_{1}}\in\mathcal{H}_{\mathcal{C}}^{n_{1}}$ as $n_{1}$ -dimensional vectorized feature maps of $X_{1}$ , $C_{1}$ respectively. Following the procedure developed in song2009hilbert, the estimate of $\mu_{W\mid x,c}^{p}$ is

	$\displaystyle\widehat{\mu}_{W\mid c,x}^{p}$	$\displaystyle=\sum_{i=1}^{n_{1}}b_{i}(x,c)\phi(w_{1,i});$		(5.3)
	$\displaystyle b(x,c)$	$\displaystyle=(\mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^% {-1}\left(\Phi_{X_{1}}(x)\odot\Phi_{C_{1}}(c)\right).$

In the second stage, we replace $\mu_{W\mid x,c}^{p}$ with $\widehat{\mu}_{W\mid x,c}^{p}$ in (5.2) and define the empirical risk. Consider $n_{2}$ i.i.d. samples $(X_{2},Y_{2},C_{2})=\{({x}_{2,i},{y}_{2,i},{c}_{2,i})\}_{i=1}^{n_{2}}$ from the source distribution and a regularization parameter $\lambda_{2}>0$ , we want to minimize

\displaystyle\mathop{\mathrm{argmin}}_{h_{0}\in\mathcal{H}_{\mathcal{W}% \mathcal{C}}}\frac{1}{2n_{2}}\sum_{i=1}^{n_{2}}\left(y_{2,i}-\langle{h_{0}},{% \phi(c_{2,i})\otimes\widehat{\mu}^{p}_{W\mid c_{2,i},x_{2,i}}}\rangle\right)^{% 2}+\lambda_{2}|\!|\!|h_{0}|\!|\!|_{{\mathcal{H}_{\mathcal{W}\mathcal{C}}}}^{2}.

(5.4)

We follow the same analysis procedure derived in mastouri2021proximal. The solution to (5.4) is shown in the following.

Proposition 5.1.

Let $\mathcal{K}_{W_{1}}\in\mathbb{R}^{n_{1}\times n_{1}}$ , $\mathcal{K}_{C_{2}}\in\mathbb{R}^{n_{2}\times n_{2}}$ be the Gram matrices of $W_{1}$ and $C_{2}$ , respectively. Let $\mathcal{K}_{X_{12}}\in\mathbb{R}^{n_{1}\times n_{2}}$ , $\mathcal{K}_{C_{12}}\in\mathbb{R}^{n_{1}\times n_{2}}$ be the cross Gram matrices of $(X_{1},X_{2})$ and $(C_{1},C_{2})$ , respectively. For any $\lambda_{2}>0$ , there exists a unique optimal solution to (5.4) of the form

	$\displaystyle\widehat{h}_{0}$	$\displaystyle=\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\alpha_{ij}\phi(w_{1,i})% \otimes\phi(c_{2,j});$
	$\displaystyle\text{vec}(\alpha)$	$\displaystyle=(I\overline{\otimes}\Gamma)(\lambda_{2}n_{2}I+\Sigma)^{-1}y_{2},$

where $\Sigma=(\Gamma^{\top}\mathcal{K}_{W_{1}}\Gamma)\odot\mathcal{K}_{C_{2}}$ , $\Gamma=(\mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}(% \mathcal{K}_{X_{12}}\odot\mathcal{K}_{C_{12}})$ , and $y_{2}=\begin{bmatrix}y_{2,1},\ldots,y_{2,n_{2}}\end{bmatrix}^{\top}$ .

Proposition 5.1 is an application of the Representer theorem (scholkopf2001generalized) – the optimal estimate of the infinite dimensional operator is a finite rank operator spanned by the feature space of $W_{1}$ and $C_{2}$ .

Finally, given estimate $\widehat{\mu}_{WC\mid x}^{q}$ and a new sample $x_{\text{new}}$ , we can construct the empirical predictor of (5.1) as

\widehat{y}_{\text{pred}}=\langle{\widehat{h}_{0}},{\widehat{\mu}_{WC\mid x_{% \text{new}}}^{q}}\rangle.

This completes the full adaptation procedure.

On classification tasks. For classification tasks, where the label is $Y\in\{1,\ldots,k_{Y}\}$ , we treat the multi-task regressor as a classifier. We encode $Y$ by a one-hot encoder and then regress on the encoded $\widetilde{Y}\in\{0,1\}^{k_{Y}}$ . Each label $\ell$ has a corresponding bridge function $h_{0,\ell}$ for $\ell\in\{1,\ldots,k_{Y}\}$ . For $i=1,\ldots,n_{2}$ , let the encoded $y_{2,i}$ be $\widetilde{y}_{2,i}=\begin{bmatrix}\widetilde{y}_{2,i,1},\ldots,\widetilde{y}_% {2,i,k_{Y}}\end{bmatrix}^{\top}\in\{0,1\}^{k_{Y}}$ . Then for each $\ell$ , we can estimate $h_{0,\ell}$ by replacing $y_{2,i}$ in (5.4) with $\widetilde{y}_{2,i,\ell}\in\{0,1\}$ . For each new sample $x_{\text{new}}$ , the predicted score of label $\ell$ is $\widehat{y}_{\text{pred},\ell}=\langle{\widehat{h}_{0,\ell}},{\widehat{\mu}^{q% }_{WC\mid x_{\text{new}}}}\rangle$ , and we select the label that has the highest prediction score: $\mathop{\mathrm{argmax}}_{\ell}\widehat{y}_{\text{pred},\ell}$ .

5.2 Adaptation with Multiple Domains

In the multiple source domain setting, the estimation of $m_{0}$ follows similarly to that of $h_{0}$ . Assuming that $m_{0}\in\mathcal{H}_{\mathcal{W}\mathcal{X}}$ , then (4.3) can be written as

\mathbb{E}_{r}[Y\mid x]=\mathbb{E}_{p}[\langle{m_{0}},{\mu_{W\mid x,r}\otimes% \phi(x)}\rangle\mid x],

for $r=1,\ldots,k_{Z}$ . The task is to estimate $m_{0}$ from the source domain and then apply it to the target domain. We can define the population risk function as

	$\displaystyle R(m_{0})$	$\displaystyle=\sum_{r=1}^{k_{Z}}\mathbb{E}_{r}[(Y-G_{m_{0}}(r,X))^{2}];$		(5.5)
	$\displaystyle G_{m_{0}}(r,x)$	$\displaystyle=\langle{m_{0}},{\mu_{W\mid r,x}\otimes\phi(x)}\rangle.$

We employ the two-stage estimation procedure as we did for estimating $h_{0}$ : (i) we first estimate $\mu_{W\mid r,x}$ and then (ii) plug the estimate $\widehat{\mu}_{W\mid r,x}$ to estimate $m_{0}$ .

At the $r$ -th domain, we observe the samples: $\{(w_{r,i},x_{r,i},r)\}_{i=1}^{n_{r}}$ . As with (5.3), we learn a conditional mean embedding $\widehat{\mu}_{W\mid r,x}=\sum_{i=1}^{n_{r}}d_{r,i}(x)\phi(w_{r,i})$ , where $d_{r}(x)=(\mathcal{K}_{X_{r}}+\lambda_{3}I)^{-1}\left(\Phi_{X_{r}}(x)\right)% \in\mathbb{R}^{n_{r}}$ and $\lambda_{3}>0$ for $r=1,\ldots,k_{Z}$ . In the second stage, given another batch of independent samples: $\{(y_{r,i},x_{r,i},r)\}_{i=1}^{n_{r}}$ for $r=1,\ldots,k_{Z}$ , we minimize:

\displaystyle\frac{1}{2\sum_{r=1}n_{r}}\sum_{r=1}^{k_{Z}}\sum_{i=1}^{n_{r}}% \left(y_{r,i}-\langle{m_{0}},{\phi(x_{r,i})\otimes\widehat{\mu}_{W\mid r,x_{r,% i}}}\rangle\right)^{2}+\lambda_{4}|\!|\!|m_{0}|\!|\!|_{{\mathcal{H}_{\mathcal{% W}\mathcal{X}}}}^{2}.

(5.6)

Then, $\widehat{m}_{0}$ yields an analytical solution in similar form to $\widehat{h}_{0}$ shown in Proposition 5.1 (see Appendix C.2 for details). Finally, with the estimated conditional mean embedding $\widehat{\mu}_{W\mid x}^{q}$ and a new sample $x_{\text{new}}$ from the target test set, we have

\widehat{y}_{\text{pred}}=\langle{\widehat{m}_{0}},{\widehat{\mu}_{W\mid x_{% \text{new}}}^{q}\otimes\phi(x_{\text{new}})}\rangle.

We convert the regression task with $m_{0}$ to the classification task by learning $k_{Y}$ bridge functions, where each bridge function $m_{0,\ell}$ corresponds to label $\ell$ .

6 Experiments

Refer to caption — \thesubsubfigure Classification task on simulated data.

Table 1: Multi-domain adaptation result. The values are the average AUROC of

10

independent replicates of the data. Each task has three source domains with different

P_{r}(U)

and one target domain. The proposed method has outperformed other baselines and is close to the Oracle in task 2.

Task	ORACLE	Cat-ERM	Avg-ERM	SA	MK	WCSC	DANN	MMD	Proposed
Task 1	$0.9425$	$0.8030$	$0.7916$	$0.7918$	$0.5848$	$0.5221$	$0.8039$	$0.8055$	$\mathbf{0.8848}$
Task 1	$\pm 0.0039$	$\pm 0.0155$	$\pm 0.0148$	$\pm 0.0148$	$\pm 0.0593$	$\pm 0.0299$	$\pm 0.0229$	$\pm 0.0248$	$\pm 0.0120$
Task 2	$0.9431$	$0.8942$	$0.8953$	$0.8953$	$0.8054$	$0.8144$	$0.9158$	$0.9149$	$\mathbf{0.9318}$
Task 2	$\pm 0.0061$	$\pm 0.0084$	$\pm 0.0079$	$\pm 0.0079$	$\pm 0.0204$	$\pm 0.0474$	$\pm 0.0125$	$\pm 0.0135$	$\pm 0.0063$
Task 3	$0.8876$	$0.8483$	$0.8427$	$0.8408$	$0.8002$	$0.7428$	$0.8480$	$0.8470$	$\mathbf{0.8569}$
Task 3	$\pm 0.0085$	$\pm 0.0134$	$\pm 0.0130$	$\pm 0.0132$	$\pm 0.0311$	$\pm 0.0311$	$\pm 0.0166$	$\pm 0.0181$	$\pm 0.0095$

We verify our theory with both simulated and real data, demonstrating robustness to latent shifts and transferablility of the bridge functions.

For the setting with concept variables present, we compare our method with baselines: Empricial Risk Minimization (ERM), Covariate shift weighting (COVAR) (shimodaira2000improving), Label shift weighting (LABEL) (buck1966comparison), and the spectral (LSA-S) and Wasserstein Autoencoder (LSA-WAE) latent shift adaptation approaches (alabdulmohsin2023adapting). For the multi-domain setting, we compare our method with baselines: Simple Adaptation (SA) (mansour2008domain), Weighted Combination of Source Classifiers (WCSC) (zhang2015multi), and Marginal Kernel (MK) (blanchard2011generalizing). We also compare with multi-domain generalization baselines (muandet2013domain): Domain Adversarial Neural Networks (DANN) (ganin2016domain), Maximum Mean Discrepancy (MMD) (GreBorRasSchetal12). Additionally, we modify the ERM method to the multi-domain setting by concatenating the source samples to learn one ERM model (Cat-ERM) or taking the average result of each source domain ERM model (Avg-ERM). The ORACLE model is a model that is trained on target distribution samples. and evaluated on held-out target distribution samples. The tuning parameters for all models including the proposed model are selected using five-fold cross-validation. Details regarding the setups are in Appendix D.

Classification task. The task designed in alabdulmohsin2023adapting is a binary classification problem with $Y\in\{0,1\}$ and the latent variable $U\in\{0,1\}$ is a Bernoulli random variable. Additionally, $X\in\mathbb{R}^{2},W\in\mathbb{R}$ are continuous random variables and $C\in\mathbb{R}^{3}$ is a discrete variable. We have one source domain with $P(U=1)=0.1$ . We evaluate the models on the target distribution with $Q(U)$ shifting from $Q(U=1)\in\{0.1,\ldots,0.9\}$ . The goal of this task is to investigate whether the adaptation method is robust to any arbitrary shift of $U$ .

The ORACLE and ERM model are implemented as MultiLayer Perceptrons (MLP). The kernel function used in the proposed method is the Gaussian kernel.

We compare the proposed method with the LSA-S and Wasserstein Autoencoder adaptation LSA-WAE approaches developed in alabdulmohsin2023adapting. While all three methods are designed to adjust shift for the same graph in Figure 0(c), our method takes additional $W,C,X$ as training samples in the target domain while LSA-S and LSA-WAE only take $X$ . For all three methods, only $X$ is observed in the test data.

While the identification theory developed in (alabdulmohsin2023adapting) does not require $W,C$ in the target domain, we are aware that in practice, having more information in the target domain may improve estimation. To make the methods more directly comparable, we design an additional step to incorporate $W$ from the target in the LSA-S algorithm. We describe this procedure in more detail in Appendix D.1.

Results are shown in Figure 2. The proposed method is more robust to the shift compared to baselines and is close to the ORACLE model. It is shown that with observed $W$ in the target domain, LSA-S does not improve the performance compared to LSA-S without $W$ . We also compare results under different noise levels and observe similar trends as discussed in Appendix D.

dSprites dataset regression task. We test the proposed procedure on the dSprites (dsprites17) dataset, an image dataset described by five latent parameters (shape, scale, rotation, posX, and posY). Motivated by dsprites17’s experiments, we design a regression task where the dSprites images (64 $\times$ 64 = 4096-dimensional) are $X\in\mathbb{R}^{64\times 64}$ and subject to a nonlinear confounder $U\in[0,2\pi]$ which is a rotation of the image. $W\in\mathbb{R}$ and $C\in\mathbb{R}$ are continuous random variables. For this experiment, we have $7000$ training samples and $3000$ test samples. Further details about the procedure are in Appendix D.

In the results in Figure 2, we vary $a$ , which controls which region of the source distribution that the target distribution concentrates. We design the experiment such that increasing $a$ shifts the target distribution to increasingly low mass regions of the source distribution. We compute the mean squared error of each method on test examples from the target distribution.

We find that, while the baseline methods degrade as the target distributions shift increases, the proposed method adapts and maintains low error, nearly matching the error achieved by the oracle, which is trained on target distribution samples.

6.1 Multi-Domain Adaptation

In the multi-domain setting, we use the same classification dataset provided in alabdulmohsin2023adapting as Section D.6. We assume that $C$ is not observed in any domain and generate multiple datasets drawn with different distributions on $U$ .

Classification task. We construct three different tasks with different settings of $P(U)$ over the source and target domains. For each task, we construct three source domains and one target domain, drawing $3200$ random training samples for the each source domain and $9600$ random training samples for the target domain. The set of source domains of of Task 1–3 have different combinations of distribution on $U$ documented in Appendix D.3.

The backbone models for ORACLE, Cat-ERM, Avg-ERM, and SA (mansour2008domain) are simple MLPs; MK (blanchard2011generalizing) is a weighted kernel support vector machine; WCSC (zhang2015multi) is a re-weighted kernel density estimator. SA (mansour2008domain) assumes that $Q(X)$ is the convex combinations of $P_{r}(X)$ for $r=1,\ldots,k_{Z}$ ; WCSC (zhang2015multi) assumes that $Q(X\mid Y)$ is a linear mixture of $P_{r}(X|Y)$ for $r=1,\ldots,k_{Z}$ domain is an i.i.d. realization from the general distribution.

The results are shown in Table 1. Overall, we find our approach performs better than ERM and baseline multi-domain adaptation methods. All methods perform better in the setting of Task 2 than for Task 1, informally demonstrating the effect of the closeness of the source domains to the target domain. For Task 3, while our proposed approach performs best, ERM also performs well, and substantially better than the domain adaptation baselines.

Regression task. We consider two regression tasks, where $U$ is either a Bernoulli or a Beta random variable. We present the results in Appendix D.

6.2 Concept and multi-domain adaptation with MIMIC-CXR

We conduct a small-scale experiment using a sample of chest X-ray data extracted from the MIMIC-CXR dataset (johnson2019mimic). We briefly describe the experimental design and results here, and include a complete description in Appendix D.7. We consider classification of the absence of a radiological finding from low-dimensional embeddings of the X-rays (sellergren2022simplified), using the absence of a radiological finding in the radiology report as the target of prediction. This corresponds to the “No Finding” label defined by irvin2019chexpert.

We consider distribution shifts similar to settings in makar2022causally, where patient sex is considered as a possible “shortcut" in the classification of the absence of a radiological finding. We impose distribution shift through structured resampling of the data where $P(U=1)=P(Y=1\mid\textrm{Sex}=\textrm{Female})=P(Y=0\mid\textrm{Sex}=\textrm{% Male})$ and $P(\textrm{Sex}=\textrm{Female})=P(\textrm{Sex}=\textrm{Male})=0.5$ is held constant. We perform both concept adaptation and multi-domain adaptation experiments with the MIMIC-CXR data. For the concept adaptation experiment, we consider the concept variable $C$ to be the embedding of a radiology report associated with the chest X-ray. We experiment with the use of patient age as a potential proxy $W$ for $U$ due to a hypothesized correlation between the presence of radiological findings and patient age.

The results are summarized in Figure 3. For both experiments, we find that the performance of baseline models fit using only information from the source domain(s) degrades under distribution shift. In the concept adaptation experiment, adaptation is relatively successful, as much of the performance of comparator models fit using target domain data is recovered by the adaptation procedure.

However, we find that the multi-domain adaptation procedure is not successful. In this case, we find that while the multi-domain adaptation procedure marginally outperforms a model fit using the concatenated source domain data under distribution shift, it recovers substantially less of the performance of the target domain model than the concept adaptation procedure does. Furthermore, the adapted model does not outperform the kernel estimators that only leverage information from the source domains. The lack of success in this setting could potentially be explained by insufficient number or diversity of domains relative to the level of noise induced by sampling variability and limited sample size.

7 Discussion

We propose a strategy for adaptation under distribution shift in a latent variable using a bridge function approach (miao2018identifying; tchetgen2020introduction). This approach allows for identification of the optimal predictor in the target domain without identifying the distribution of the latent variable and without distributional assumptions on the form of the latent. We require that proxies of the latent variable are present and that (i) mediating concepts are available or (ii) data from multiple source domains are present.

We argue our approach is useful for two reasons. First, the latent distribution in general is only identifiable under strict distributional assumptions (locatello2019challenging). Second, recovery of the latent variable may be challenging in practice even if it is identifiable (rissanen2021critical). For example, because most latent variable estimation methods are designed to model the data generating process (kingma2013auto), one might allocate substantial modeling capacity to variability in the data and the latent variable that are irrelevant to modeling the shift in the conditional distribution of $Y\mid X$ . By contrast, we model only the components of the observable variables relevant to the adaptation.

Acknowledgments: We thank Zhu Li and Dimitri Meunier for helpful discussions. AG was partly supported by the Gatsby Charitable Foundation. OS was partly supported by the UIUC Beckman Institute Graduate Research Fellowship, NSF-NRT 1735252. KT was partly supported by NSF Graduate Research Fellowship Program. SK was partly supported by the NSF III 2046795, IIS 1909577, CCF 1934986, NIH 1R01MH116226-01A, NIFA award 2020-67021-32799, the Alfred P. Sloan Foundation, and Google Inc. This study was funded by Google LLC and/or a subsidiary thereof (‘Google’).

References

Appendix A Identification of the Distribution

In this section, we demonstrate the existence of the bridge functions $h_{0}$ and $m_{0}$ under certain regularity conditions. We first discuss the discrete case and then generalize to the continuous case.

A.1 The Discrete Case of the Bridge Function $h_{0}$

The idea of bridge function $h_{0}$ may seem abstract in the continuous setting. When every variable is discrete, however, the construction of the bridge function is demonstrated by solving series of matrix problems. This idea originates from miao2018identifying and we apply the technique to show the construction of bridge function when every variable $(W,U,C,X,Y)$ is discrete.

Let

	$\displaystyle\mathbf{P}(W\mid u)$	$\displaystyle=\begin{bmatrix}P(w_{1}\mid u)&\ldots&P(w_{k_{W}}\mid u)\end{% bmatrix}^{\top}\in\mathbb{R}^{k_{W}};$
	$\displaystyle\mathbf{P}(W\mid U)$	$\displaystyle=\begin{bmatrix}\mathbf{P}(W\mid u_{1})&\ldots&\mathbf{P}(W\mid u% _{k_{U}})\end{bmatrix}\in\mathbb{R}^{k_{W}\times k_{U}},$

be a column vector, and a matrix, respectively. We define similarly

	$\displaystyle\mathbf{P}(U\mid x,c)$	$\displaystyle=\begin{bmatrix}P(u_{1}\mid c,x)&\ldots&P(u_{k_{U}}\mid c,x)\end{% bmatrix}^{\top}\in\mathbb{R}^{k_{U}};$
	$\displaystyle\mathbf{P}(U\mid X,c)$	$\displaystyle=\begin{bmatrix}\mathbf{P}(U\mid x_{1},c)&\ldots&\mathbf{P}(U\mid x% _{k_{X}},c)\end{bmatrix}\in\mathbb{R}^{k_{U}\times k_{X}},$

for $c\in\mathcal{C}$ . We define

	$\displaystyle\mathbf{P}(Y\mid X,c)$	$\displaystyle=\begin{bmatrix}\mathbf{P}(Y\mid x_{1},c)&\ldots&\mathbf{P}(Y\mid x% _{k_{X}},c)\end{bmatrix}\in\mathbb{R}^{k_{Y}\times k_{X}};$
	$\displaystyle\mathbf{P}(Y\mid U,c)$	$\displaystyle=\begin{bmatrix}\mathbf{P}(Y\mid u_{1},c)&\ldots&\mathbf{P}(Y\mid u% _{k_{X}},c)\end{bmatrix}\in\mathbb{R}^{k_{Y}\times k_{X}};$
	$\displaystyle\mathbf{P}(W\mid X,c)$	$\displaystyle=\begin{bmatrix}\mathbf{P}(W\mid x_{1},c)&\ldots&\mathbf{P}(W\mid x% _{k_{X}},c)\end{bmatrix}\in\mathbb{R}^{k_{W}\times k_{X}},$

analogously. As an alternative to finding a $h_{0}(w,c)$ such that

\mathbb{E}[Y\mid c,x]=\sum_{i=1}^{k_{W}}h_{0}(w_{i},c)p(w_{i}\mid c,x),

the proxy problem is converted to finding a $\widetilde{H}_{0}(Y,W,c)$ such that

\mathbf{P}(Y\mid X,c)=\widetilde{H}_{0}(Y,W,c)\mathbf{P}(W\mid X,c),\quad c\in% \mathcal{C}.

First, under the condition that $W\perp\!\!\!\perp\{X,C\}\mid U$ , we can write

\mathbf{P}(W\mid X,c)=\mathbf{P}(W\mid U)\mathbf{P}(U\mid X,c).

(A.1)

Similarly, under the condition that $Y\perp\!\!\!\perp X\mid\{U,C\}$ , we have

\mathbf{P}(Y\mid X,c)=\mathbf{P}(Y\mid U,c)\mathbf{P}(U\mid X,c)

(A.2)

We introduce the following assumption:

Assumption 7.

Columns of $\mathbf{P}(W\mid U)$ are linearly independent. For every $c\in\mathcal{C}$ , the columns of $\mathbf{P}(W\mid X,c)$ satisfy $\mathbf{P}(W\mid x,c)\in\mathcal{N}(\mathbf{P}(W\mid U)^{*})^{\perp}$ for all $x\in\mathcal{X}$ .

Assumption 7 is the requirement for the least-squares problem to have an unique solution. Hence, by Assumption 7, we have

\mathbf{P}(U\mid X,c)=\mathbf{P}(W\mid U)^{\dagger}\mathbf{P}(W\mid X,c),

where $\mathbf{P}(W\mid U)^{\dagger}$ is the generalized inverse of $\mathbf{P}(W\mid U)$ . Plug the above equation into (A.2), we see that

\mathbf{P}(Y\mid X,c)=\underbrace{\mathbf{P}(Y\mid U,c)\mathbf{P}(W\mid U)^{% \dagger}}_{\widetilde{H}(Y,W,c)}\mathbf{P}(W\mid X,c).

A.2 Existence of the Bridge Function $h_{0}$

The sufficient conditions of existence of $h_{0}$ are originally discussed in miao2018identifying, we adapt them to our setting and provide a brief review in this section. We assume the following completeness assumption and regularity conditions. This assumption is equivalent to Condition (iii) in miao2018identifying.

Assumption 8.

For any mean squared integrable function $g$ and for $c\in\mathcal{C}$ , $\mathbb{E}[g(X)\mid W,c]=0$ almost surely if and only if $g(X)=0$ almost surely.

Let $f$ be either the distribution from $p$ or $q$ , we consider $K_{c}:L_{2}(W\mid c)\rightarrow L_{2}(X\mid c)$ as the conditional expectation operator associated with the kernel function

k(w,x,c)=\frac{f(w,x\mid c)}{f(w\mid c)f(x\mid c)}.

Then it follows that $\mathbb{E}[Y\mid c,x]=K_{c}h_{0}$ :

	$\displaystyle\mathbb{E}[Y\mid c,x]$	$\displaystyle=\int_{\mathcal{W}}h_{0}(w,c)f(w\mid x,c)\mathrm{d}w$
		$\displaystyle=\int k(w,x,c)h_{0}(w,c)f(w\mid c)\mathrm{d}w=K_{c}h_{0}.$

To find the solution $h_{0}$ , we assume the followings.

Assumption 9.

For any $c\in\mathcal{C}$ , $\int_{\mathcal{W}}\int_{\mathcal{X}}f(w\mid c,x)f(x\mid c,w)\mathrm{d}w\mathrm% {d}x<\infty$ .

This is a sufficient condition to ensure that $K_{c}$ is a compact operator (carrasco2007linear, Example 2.3). Hence, by the definition of a compact operator, there exists a singular system $\{\lambda_{c,i},\phi_{c,i},\psi_{c,i}\}_{i\in\mathbb{N}}$ of $K_{c}$ for every $c\in\mathcal{C}$ .

Assumption 10.

For fixed $c\in\mathcal{C}$ :

1.

$\mathbb{E}[Y\mid X,c]\in L_{2}(X\mid c);$
2.

$\sum_{i\in\mathbb{N}}\lambda_{c,i}^{-2}\left|\langle{\mathbb{E}[Y\mid X,c]},{% \psi_{c,i}}\rangle\right|^{2}<\infty$ .

The above two assumptions are restatements of Conditions (v)–(vii) in miao2018identifying. We adapt the results from Proposition 1 in miao2018identifying to the graph in Figure 0(c) which replaces the node $X$ by $C$ and node $Z$ by $X$ .

Proposition A.1 (Existence of $h_{0}$ , adapted from Proposition 1 in miao2018identifying).

Under Assumption 2, 8–10, the solution to (4.1) exists.

Proof.

The proof follows directly from the result of Picard’s theorem. Assumption 9 implies that $K_{c}$ is a compact operator. Assumption 8 implies that $\mathcal{N}(K_{c}^{*})^{\perp}=L_{2}(X\mid c)$ . Therefore, under the first statement in Assumption 10, we have $\mathbb{E}[Y\mid X,c]\in\mathcal{N}(K_{c}^{*})^{\perp}$ . Along with the second statement in Assumption 10, we can apply Lemma A.3. ∎

A.3 Existence of Bridge Function $m_{0}$

The proof of the existence of $m_{0}^{p}$ is similar to the analysis of $h_{0}$ . Let $K_{x}:L_{2}(W\mid x)\rightarrow L_{2}(Z\mid x)$ be the integral operator associated with the kernel function $k(w,x,z)=p(w,z\mid x)/(p(w\mid x)p(z\mid x))$ . Then, we can write

\displaystyle\mathbb{E}_{p}[Y\mid x,z]

\displaystyle=\int k(w,x,z)p(w\mid x)m_{0}(w,x)\mathrm{d}w=K_{x}m_{0}.

Proposition A.2 (Existence of $m_{0}$ , Proposition 1 in miao2018identifying).

Assume that

1.

for any mean squared integrable function $g$ and for $x\in\mathcal{X}$ , $\mathbb{E}[g(Z)\mid W,x]=0$ almost surely if and only if $g(Z)=0$ almost surely;
2.

For any $x\in\mathcal{X}$ , $\int_{\mathcal{W}}\int_{\mathcal{Z}}f(w\mid x,z)f(z\mid x,w)\mathrm{d}w\mathrm% {d}z<\infty$ ;
3.

For any $x\in\mathcal{X}$ , $\mathbb{E}[Y\mid Z,x]\in L_{2}(Z\mid x)$ ;
4.

For any $x\in\mathcal{X}$ , $\sum_{i\in\mathbb{N}}\lambda_{x,i}^{-2}\left|\langle{\mathbb{E}[Y\mid Z,x]},{% \psi_{x,i}}\rangle\right|^{2}<\infty$ , where $(\lambda_{x,i},\phi_{x,i},\psi_{x,i})$ is the singular system of $K_{x}$ .

Then the solution of $m_{0}^{p}$ exists.

The proof of Proposition A.2 is similar to the proof of Proposition 1 in (miao2018identifying), where we replace $P(y|z,x)$ in Proposition 1 of miao2018identifying with $\mathbb{E}[Y\mid Z,x]$ . The proof for existence of $m_{0}^{q}$ also follows similarly as Proposition A.2.

A.4 Auxiliary Lemma

We introduce the Picard’s theorem as follows.

Lemma A.3 (Picard’s Theorem).

Let $K:\mathcal{H}_{1}\rightarrow\mathcal{H}_{2}$ be a compact operator with singular system $\{\lambda_{j},\varphi_{j},\psi_{j}\}_{j=1}^{\infty}$ and $\phi$ be a given function in $\mathcal{H}_{2}$ . Then the equation of first kind $Kh=\phi$ have solutions if and only if

1.

$\phi\in\mathcal{N}(K^{*})^{\perp}$ , where $\mathcal{N}(K^{*})=\{h:K^{*}h=0\}$ is the null space of the adjoint operator $K^{*}$ .
2.

$\sum_{j=1}^{+\infty}\lambda_{j}^{-2}\left|\langle{\phi},{\psi_{j}}\rangle% \right|^{2}<\infty$ .

Appendix B Transferring Bridge Functions

In this section, we discuss the identifiability results.

B.1 Proof of Theorem 4.1

For $f\in\{p,q\}$ , recall that

$\displaystyle\mathbb{E}_{f}[Y\mid c,x]$	$\displaystyle=\int_{\mathcal{W}}h_{0}^{f}(w,c)f(w\mid c,x)\mathrm{d}w$
	$\displaystyle=\int_{\mathcal{W}}\int_{\mathcal{U}}h_{0}^{f}(w,c)f(w\mid c,u)f(% u\mid c,x)\mathrm{d}u\mathrm{d}w$
	$\displaystyle=\int_{\mathcal{W}}\int_{\mathcal{U}}h_{0}^{f}(w,c)f(w\mid u)f(u% \mid c,x)\mathrm{d}u\mathrm{d}w$	$\displaystyle(W\perp\!\!\!\perp C\mid U).$

Similarly, we can write

\displaystyle\mathbb{E}_{f}[Y\mid c,x]

\displaystyle=\int_{\mathcal{U}}\mathbb{E}_{f}[Y\mid c,u]f(u\mid c,x)\mathrm{d}u

\displaystyle(Y\perp\!\!\!\perp X\mid\{U,C\}).

Under Assumption 4, we have

\mathbb{E}_{f}[Y\mid c,U]=\int_{\mathcal{W}}h_{0}^{f}(w,c)f(w\mid U)\mathrm{d}w\quad

(B.1)

almost surely with respect to $F(U)$ , $F\in\{P,Q\}$ .

Suppose that $u\in\mathcal{U}$ such that $Q(u)>0$ . Then, by Assumption 5 , we must have $P(u)>0$ . Hence, conditioned on the selected $u$ and $c$ and under Assumption 1, we have

	$\displaystyle\mathbb{E}_{p}[Y\mid c,u]$	$\displaystyle=\int_{\mathcal{W}}h_{0}^{p}(w,c)p(w\mid u)\mathrm{d}w;$
	$\displaystyle\mathbb{E}_{q}[Y\mid c,u]$	$\displaystyle=\int_{\mathcal{W}}h_{0}^{q}(w,c)p(w\mid u)\mathrm{d}w$	$\displaystyle(p(w\mid u)=q(w\mid u),\;\forall c\in\mathcal{C},w\in\mathcal{W},% u\in\mathcal{U}).$

We then can write

\mathbb{E}_{p}[Y\mid c,u]-\mathbb{E}_{q}[Y\mid c,u]=\int_{\mathcal{W}}h_{0}^{p% }(w,c)p(w\mid u)\mathrm{d}w-\int_{\mathcal{W}}h_{0}^{q}(w,c)q(w\mid u)\mathrm{% d}w.

Note that, by Assumption 1, we have $\mathbb{E}_{p}[Y\mid c,u]=\mathbb{E}_{q}[Y\mid c,u]$ and hence the left hand side of the above equation is $0$ and we can conclude that:

\int_{\mathcal{W}}h_{0}^{p}(w,c)p(w\mid U)\mathrm{d}w=\int_{\mathcal{W}}h_{0}^% {q}(w,c)q(w\mid U)\mathrm{d}w

$Q(U)$ almost surely. We complete the first part of proof.

To show the second part of the theorem, note that we can write

	$\displaystyle\mathbb{E}_{q}[Y\mid x]$	$\displaystyle=\mathbb{E}_{q}[\mathbb{E}_{q}[Y\mid C,x]\mid x]$
		$\displaystyle=\mathbb{E}_{q}[\mathbb{E}_{q}[h_{0}^{q}(W,c)\mid C,x]\mid x].$
Since $p(w\mid u)=q(w\mid u)$ by Assumption 1, we can factorize the above equation as
	$\displaystyle\mathbb{E}_{q}[Y\mid x]$	$\displaystyle=\int_{\mathcal{C}}\left[\int_{\mathcal{U}}\left\{\int_{\mathcal{% W}}h_{0}^{q}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}u\right]q(% c\mid x)\mathrm{d}c.$
Let the support of $U$ conditioned on $c,x$ be $\mathcal{U}_{c,x}^{1}=\{u:Q(u\mid c,x)>0\}$ and $\mathcal{U}_{c,x}^{0}=\{u:Q(u\mid c,x)=0\}$ . Hence, we have $\mathcal{U}=\mathcal{U}_{c,x}^{0}\cup\mathcal{U}_{c,x}^{1}$ , and $\mathcal{U}_{c,x}^{0}\cap\mathcal{U}_{c,x}^{1}=\emptyset$ such that $\int_{\mathcal{U}_{c,x}^{0}}q(u\mid c,x)\mathrm{d}u=0$ and $\int_{\mathcal{U}_{c,x}^{1}}q(u\mid c,x)\mathrm{d}u=1$ . Then, we can further decompose the above as
	$\displaystyle\mathbb{E}_{q}[Y\mid x]$	$\displaystyle=\int_{\mathcal{C}}\left[\int_{\mathcal{U}_{c,x}^{0}}\left\{\int_% {\mathcal{W}}h_{0}^{q}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}% u\right]q(c\mid x)\mathrm{d}c$
		$\displaystyle\quad+\int_{\mathcal{C}}\left[\int_{\mathcal{U}_{c,x}^{1}}\left\{% \int_{\mathcal{W}}h_{0}^{q}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)% \mathrm{d}u\right]q(c\mid x)\mathrm{d}c$
		$\displaystyle=\int_{\mathcal{C}}\left[\int_{\mathcal{U}_{c,x}^{1}}\left\{\int_% {\mathcal{W}}h_{0}^{q}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}% u\right]q(c\mid x)\mathrm{d}c.$
Given $c,x$ , since the support of $Q(U\mid c,x)$ is included in the support of $Q(U)$ , so if $u\in\mathcal{U}_{c,x}^{1}$ , we must have $Q(u)>0$ and hence $P(u)>0$ by Assumption 5, and we can swap $h_{0}^{q}$ with $h_{0}^{p}$ .
		$\displaystyle=\int_{\mathcal{C}}\left[\int_{\mathcal{U}_{c,x}^{1}}\left\{\int_% {\mathcal{W}}h_{0}^{p}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}% u\right]q(c\mid x)\mathrm{d}c.$
Since $\int_{\mathcal{U}_{c,x}^{0}}\left\{\int_{\mathcal{W}}h_{0}^{p}(w,c)p(w\mid u)% \mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}u=0$ , we can add it to the above term and arrive at
		$\displaystyle=\int_{\mathcal{C}}\left[\int_{\mathcal{U}}\left\{\int_{\mathcal{% W}}h_{0}^{p}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}u\right]q(% c\mid x)\mathrm{d}c$
		$\displaystyle=\int_{\mathcal{C}}\int_{\mathcal{W}}h_{0}^{p}(w,c)q(w,c\mid x)% \mathrm{d}w\mathrm{d}c.$	(B.2)

Since we can identify $h_{0}^{p}$ from the observable $(W,X,Y,C)$ of the source domain by solving the linear system (4.1), given observable $(W,C,X)$ from the target domain, we can identify $\mathbb{E}_{q}[Y\mid x]$ .

B.2 Proof of Proposition 4.2

The following proof is a generalization of the proof of miao2018identifying, suited to the multidomain case. All variables besides $Z$ are assumed to be discrete-valued and multivariate: $V$ can take $k_{v}$ values for $V\in\{U,X,Y,W\}$ .

Let $\mathbf{P}(W\mid U)=\begin{bmatrix}\mathbf{P}(W\mid u_{1})&\ldots&\mathbf{P}(W% \mid u_{k_{U}})\end{bmatrix}\in\mathbb{R}^{k_{W}\times k_{U}}$ .

Similarly, define

\mathbf{P}(Y\mid U,x)=\begin{bmatrix}\mathbf{P}(Y\mid u_{1},x)&\ldots&\mathbf{% P}(Y\mid u_{k_{U}},x)\end{bmatrix}\in\mathbb{R}^{k_{Y}\times k_{U}}.

This notation carries through to the remaining variables.

The approach we will take differs from the concept case (and standard proxy case) in the following way: we do not observe $Z$ in the training or test domains, nor do we know its true dimension (indeed $Z$ may be continuous valued). Rather, we assume that we have at least $k_{Z}$ distinct draws $z_{r}$ from $Z$ in training, where $r\in\{1,\ldots,k_{Z}\}$ is the domain index, and that $k_{Z}\geq k_{U}.$ We also suppose that in test, we observe a distinct draw $z_{k_{Z}+1}$ which was not seen in training.

Our goal is to obtain a bridge function, which in the categorical case will be a bridge matrix of dimension $M_{w,x}\in\mathbb{R}^{k_{Y}\times k_{W}}$ . Define $P_{r}(V\mid x):=P(V\mid x,z_{r})$ for $V\in\{U,Y,W\}$ . We assume that for each $x$ ,

\mathrm{rank}\left(P_{1:k_{Z}}(U\mid x)\right)=k_{U},\qquad P_{1:k_{Z}}(U\mid x% ):=\left[\begin{array}[]{ccc}P_{1}(U\mid x)&\ldots&P_{k_{Z}}(U\mid x)\end{% array}\right]

which implies that $P(U\mid x,z_{r})$ varies with $z_{r},$ and that we see a sufficient diversity of domains to span the space of vectors on $U$ .

The graphical model supports the conditional independence relation

\{Y,X,W\}\perp\!\!\!\perp Z\mid U,

however we will only require the standard proxy assumptions

	$\displaystyle W\perp\!\!\!\perp X,Z\mid U,$
	$\displaystyle Y\perp\!\!\!\perp Z\mid X,U.$

Next, as in the concept case, we require

P(Y|U,x)=M_{w,x}P(W|U),

where we assume $\mathrm{rank}(P(W|U))=k_{u}$ (as in the first condition of Assumption 7). The matrix $M_{w,x}$ is invariant to the distribution $P(U)$ by construction. If we can solve for $M_{w,x}$ , then given a novel domain corresponding to the draw $z_{k_{z}+1}$ , we have

	$\displaystyle P(Y\|U,x)P_{k_{z}+1}(U\|x)$	$\displaystyle=M_{w,x}P(W\|U)P_{k_{z}+1}(U\|x)$
	$\displaystyle P_{k_{z}+1}(Y\|x)$	$\displaystyle=M_{w,x}P_{k_{z}+1}(W\|x).$

This allows us to compute conditional expectations under $P(Y\mid x)$ in the novel domain, based on observations of $(W,X)$ in this domain.

To solve for $M_{w,x}$ , we project both sides on a basis over $U$ arising from the training domains,

\displaystyle P(Y|U,x)P_{1:k_{Z}}(U\mid x)

\displaystyle=M_{w,x}P(W|U)P_{1:k_{Z}}(U\mid x),

where we define $P_{1:k_{Z}}(Y|x)=\left[\begin{array}[]{ccc}P_{1}(Y\mid x)&\ldots&P_{k_{Z}}(Y% \mid x)\end{array}\right]$ , and likewise $P_{1:k_{Z}}(W\mid x).$ Then the above becomes

	$\displaystyle P_{1:k_{Z}}(Y\|x)$	$\displaystyle=M_{w,x}P_{1:k_{Z}}(W\mid x)$
	$\displaystyle M_{w,x}$	$\displaystyle=P_{1:k_{Z}}(Y\|x)P_{1:k_{Z}}^{\dagger}(W\mid x).$		(B.3)

This demonstrates that we can recover the domain-invariant $M_{w,x}$ purely from observed data.

One domain is not enough: We illustrate with an example, where we again consider the case where all variables are categorical:

P(Y|x)=M_{w,x}P(W|x),

(B.4)

where $P(Y\mid x)$ is a $k_{Y}\times 1$ vector of probabilities, $P(W\mid x)$ is a $k_{W}\times 1$ vector of probabilities, and $M$ is a $k_{Y}\times k_{W}$ matrix for which we wish to solve. We have too few equations for the number of unknowns.

One solution to (B.4) is the matrix of conditional probabilities $M_{w,x}=P(Y|W,x)$ . This matrix is not invariant to changes to $P(U)$ , however:

p(Y|W,x)=p(Y|U,x)P(U|W,x).

The posterior $P(U|W,x)$ changes when the prior $P(U)$ changes. In contrast, the solution in (B.3) is guaranteed to be domain invariant.

B.3 Proof of Proposition 4.3

For all $r=1,\ldots,k_{Z}$ , we can write

$\displaystyle\mathbb{E}_{r}[Y\mid x]=\mathbb{E}[{Y\mid x,z_{r}}]$	$\displaystyle=\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P(w\mid x,z_{r})$
	$\displaystyle=\int_{\mathcal{U}}\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P(w\mid u% )\mathrm{d}P(u\mid x,z_{r});$	(B.5)
$\displaystyle\mathbb{E}[{Y\mid x,z_{r}}]$	$\displaystyle=\int_{\mathcal{U}}\mathbb{E}[Y\mid x,u]\mathrm{d}P(u\mid x,z_{r}).$	(B.6)

By Assumption 6, the integrands of (B.5)–(B.6) have the following property

\displaystyle\mathbb{E}[Y\mid x,u]=\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P(w% \mid u),

(B.7)

almost surely with respect to $P(U)$ . We will show that $m_{0}$ can be transferred to identify the distribution in the target domain.

We define the support set ${\mathcal{S}}_{q}(x)=\{u:Q(u\mid x)>0\}$ . Therefore, we can write

	$\displaystyle\mathbb{E}_{q}[Y\mid x]$	$\displaystyle=\int_{\mathcal{U}}\mathbb{E}[Y\mid u,x]\mathrm{d}Q(u\mid x)$
		$\displaystyle=\int_{{\mathcal{S}}_{q}(x)}\mathbb{E}[Y\mid u,x]\mathrm{d}Q(u% \mid x).$

Furthermore, since we have ${\mathcal{S}}_{q}(x)\subseteq\{u:P(u)>0\}$ , we can apply (B.7) to obtain

	$\displaystyle\mathbb{E}_{q}[Y\mid x]$	$\displaystyle={\int_{\mathcal{W}}\int_{\mathcal{U}}m_{0}(w,x)\mathrm{d}P(w\mid u% )}\mathrm{d}Q(u\mid x)$
		$\displaystyle=\mathbb{E}_{q}[{m}_{0}(W,x)\mid x].$

We complete the proof.

Appendix C Estimation Procedure

The estimation procedure of $\widehat{h}_{0}$ is discussed in Section C.1 and the estimation procedure of $\widehat{m}_{0}$ is discussed in Section C.2. In Section C.3, we discuss the case when either $Z$ or $C$ is a discrete variable.

C.1 Proof of Proposition 5.1

The proof of Proposition 5.1 simply follows the result in (mastouri2021proximal) which extends from the representer theorem (scholkopf2001generalized). There exists a $\gamma\in\mathbb{R}^{n_{2}}$ such that

\displaystyle\widehat{h}_{0}=\sum_{j=1}^{n_{2}}\gamma_{j}\widehat{\mu}_{W\mid c% _{2,j},x_{2,j}}\otimes\phi(c_{2,j}).

(C.1)

From song2009hilbert, we have $\widehat{\mu}_{W\mid c_{2,j},x_{2,j}}=\sum_{i=1}^{n_{1}}b_{i}(c_{2,j},x_{2,j})% \phi(w_{1,i})$ and $b_{i}$ is the $i$ -th element of $b$ , a function on $\mathcal{C}\times\mathcal{X}$ : $b(c,x)=(\mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}% \left(\Phi_{X_{1}}(x)\odot\Phi_{C_{1}}(c)\right)$ . If we expand (C.1) with the previous expression, we have

\widehat{h}_{0}=\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\alpha_{ij}\phi(w_{1,i})% \otimes\phi(c_{2,j}),

where $\alpha_{ij}=b_{i}(c_{2,j},x_{2,j})\gamma_{j}$ . Hence, the rest of the proof will focus on finding the expression of $\alpha_{ij}$ . Following the proof technique developed in (mastouri2021proximal), we introduce two following lemmas that assist the analysis.

Lemma C.1.

The square of the operator norm of $\widehat{h}_{0}$ , denoted as $|\!|\!|\widehat{h}_{0}|\!|\!|_{{\mathcal{H}_{\mathcal{W}\mathcal{C}}}}^{2}$ , can be represented as

|\!|\!|\widehat{h}_{0}|\!|\!|_{{\mathcal{H}_{\mathcal{W}\mathcal{C}}}}^{2}=% \operatorname{vec}(\alpha)^{\top}(\mathcal{K}_{C_{2}}\otimes\mathcal{K}_{W_{1}% })\operatorname{vec}(\alpha).

Proof of Lemma C.1.

Write

	$\displaystyle\langle{\widehat{h}_{0}},{\widehat{h}_{0}}\rangle$	$\displaystyle=\left\langle{\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\alpha_{ij}\phi% (w_{1,i})\otimes\phi(c_{2,j})},{\sum_{m=1}^{n_{1}}\sum_{r=1}^{n_{2}}\alpha_{mr% }\phi(w_{1,m})\otimes\phi(c_{2,r})}\right\rangle$
		$\displaystyle=\sum_{i,m=1}^{n_{1}}\sum_{j,r=1}^{n_{2}}\alpha_{ij}\alpha_{mr}k(% w_{1,i},w_{1,m})k(c_{2,j},c_{2,r})$
		$\displaystyle=\mathop{\mathrm{tr}}\left(\alpha^{\top}\mathcal{K}_{W_{1}}\alpha% \mathcal{K}_{C_{2}}\right)$
		$\displaystyle=\operatorname{vec}(\alpha)^{\top}\operatorname{vec}(\mathcal{K}_% {W_{1}}\alpha\mathcal{K}_{C_{2}}).$
Using the fact that $\operatorname{vec}(ABC)=(C^{\top}\otimes A)\operatorname{vec}(B)$ , the above display can be written as
		$\displaystyle=\operatorname{vec}(\alpha)^{\top}(\mathcal{K}_{C_{2}}\otimes% \mathcal{K}_{W_{1}})\operatorname{vec}(\alpha).$

∎

Lemma C.2.

For any $c\in\mathcal{C}$ , $x\in\mathcal{X}$ ,

\langle{\widehat{h}_{0}},{\phi(c)\otimes\widehat{\mu}_{W\mid c,x}}\rangle=\Phi% _{C_{2}}(c)^{\top}\alpha^{\top}\mathcal{K}_{W_{1}}(\mathcal{K}_{X_{1}}\odot% \mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}(\Phi_{C_{1}}(c)\odot\Phi_{X_{1}}(x% )).

Proof of Lemma C.2.

Write

	$\displaystyle\langle{\widehat{h}_{0}},{\phi(c)\otimes\widehat{\mu}_{W\mid c,x}}\rangle$	$\displaystyle=\left\langle\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\alpha_{ij}\phi(% w_{1,i})\otimes\phi(c_{2,j}),\phi(c)\otimes\sum_{r=1}^{n_{1}}b_{r}(c,x)\phi(w_% {1,r})\right\rangle$
		$\displaystyle=\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\sum_{r=1}^{n_{1}}\alpha_{ij% }k(w_{1,i},w_{1,r})k(c_{2,j},c)b_{r}(c,x).$
Summing over $i,j$ , the above equation is equivalent as
		$\displaystyle=\sum_{r=1}^{n_{1}}\Phi_{C_{2}}(c)^{\top}\alpha^{\top}\Phi_{W_{1}% }(w_{1,r})b_{r}(c,x)$
		$\displaystyle=\Phi_{C_{2}}(c)^{\top}\alpha^{\top}\mathcal{K}_{W_{1}}b(c,x)$
		$\displaystyle=\Phi_{C_{2}}(c)^{\top}\alpha^{\top}\mathcal{K}_{W_{1}}(\mathcal{% K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}\left(\Phi_{X_{1}}(x% )\odot\Phi_{C_{1}}(c)\right)$
		$\displaystyle={\left(\Phi_{X_{1}}(x)\odot\Phi_{C_{1}}(c)\right)^{\top}(% \mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}\mathcal{K}% _{W_{1}}}\alpha\Phi_{C_{2}}(c).$

∎

With Lemma C.1–C.2, we can write (5.4) as

\displaystyle\frac{1}{2n_{2}}\|y_{2}-D^{\top}\operatorname{vec}(\alpha)\|_{2}^% {2}+\lambda_{2}\operatorname{vec}(\alpha)^{\top}E\operatorname{vec}(\alpha),

(C.2)

where

\displaystyle D=\mathcal{K}_{C_{2}}\overline{\otimes}\left\{\mathcal{K}_{W_{1}% }(\mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}(\mathcal% {K}_{X_{12}}\odot\mathcal{K}_{C_{12}})\right\},\quad E=\mathcal{K}_{C_{2}}% \otimes\mathcal{K}_{W_{1}}.

Then by setting the gradient of (C.2) with respect to $\operatorname{vec}(\alpha)$ to zero, we will obtain

	$\displaystyle\operatorname{vec}(\alpha)$	$\displaystyle=\left(DD^{\top}+\lambda_{2}n_{2}E\right)^{-1}Dy_{2}.$
Apply Woodbury matrix identity, the above display is equivalent as
		$\displaystyle=E^{-1}D(\lambda_{2}n_{2}I+D^{\top}E^{-1}D)^{-1}y_{2}.$	(C.3)

Using the fact that for matrices $A,B,C,F$ , $(A\otimes B)(C\overline{\otimes}F)=AC\overline{\otimes}BF$ , we can simplify $E^{-1}D$ as

	$\displaystyle E^{-1}D$	$\displaystyle=\left(\mathcal{K}_{C_{2}}^{-1}\otimes{\mathcal{K}_{W_{1}}^{-1}}% \right)\left[\mathcal{K}_{C_{2}}\overline{\otimes}\left\{\mathcal{K}_{W_{1}}(% \mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}(\mathcal{K% }_{X_{12}}\odot\mathcal{K}_{C_{12}})\right\}\right]$
		$\displaystyle=I\overline{\otimes}(\mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+% \lambda_{1}n_{1}I)^{-1}(\mathcal{K}_{X_{12}}\odot\mathcal{K}_{C_{12}})$
		$\displaystyle=I\overline{\otimes}\Gamma.$

Hence, using the fact that $(A\overline{\otimes}B)^{\top}(C\overline{\otimes}F)=(A^{\top}C)\odot B^{\top}F$ , we have

\displaystyle D^{\top}E^{-1}D=(\mathcal{K}_{C_{2}}\overline{\otimes}\mathcal{K% }_{W_{1}}\Gamma)^{\top}(I\overline{\otimes}\Gamma)=\mathcal{K}_{C_{2}}\odot(% \Gamma^{\top}\mathcal{K}_{W_{1}}\Gamma)

Hence, we can write (C.3) as

\operatorname{vec}(\alpha)=(I\overline{\otimes}\Gamma)\left\{\lambda_{2}n_{2}I% +\mathcal{K}_{C_{2}}\odot(\Gamma^{\top}\mathcal{K}_{W_{1}}\Gamma)\right\}^{-1}% y_{2}.

C.2 Proof of Kernel Bridge Function $m_{0}$

We begin with the results.

Proposition C.3.

Let $\mathcal{K}_{W_{3}}\in\mathbb{R}^{n_{3}\times n_{3}}$ , $\mathcal{K}_{X_{4}}\in\mathbb{R}^{n_{4}\times n_{4}}$ be the Gram matrices of $W_{3}$ and $X_{4}$ , respectively. Let $\mathcal{K}_{X_{34}}\in\mathbb{R}^{n_{3}\times n_{4}}$ , $\mathcal{K}_{Z_{34}}\in\mathbb{R}^{n_{3}\times n_{4}}$ be the cross Gram matrices of $(X_{3},X_{4})$ and $(Z_{3},Z_{4})$ , respectively. For any $\lambda_{4}>0$ , there exists a unique optimal solution to (5.6) of the form

	$\displaystyle\widehat{m}_{0}$	$\displaystyle=\sum_{i=1}^{n_{3}}\sum_{j=1}^{n_{4}}\alpha_{ij}\phi(w_{3,i})% \otimes\phi(x_{4,j});$
	$\displaystyle\text{vec}(\alpha)$	$\displaystyle=(I\overline{\otimes}\Gamma)(\lambda_{4}n_{4}I+\Sigma)^{-1}y_{4},$

where $\Sigma=(\Gamma^{\top}\mathcal{K}_{W_{3}}\Gamma)\odot\mathcal{K}_{X_{4}}$ , $\Gamma=(\mathcal{K}_{X_{3}}\odot\mathcal{K}_{Z_{3}}+\lambda_{3}n_{3}I)^{-1}(% \mathcal{K}_{X_{34}}\odot\mathcal{K}_{Z_{34}})$ , and $y_{4}=\begin{bmatrix}y_{4,1},\ldots,y_{4,n_{4}}\end{bmatrix}^{\top}$ .

The proof of Proposition C.3 follows exactly as the proof of Proposition 5.1, with $X$ replaced by $Z$ and $C$ replaced by $X$ .

C.3 Estimation with discrete $Z$ or $C$

In the case when $C$ or $Z$ happen to be discrete variables, a more efficient alternative to the estimator introduced in Section 5.1 which requires kernelized features of $C$ (or $Z$ ), is to solve a separate regression of $W$ on $X$ for each $c\in\mathcal{C}$ (or $z\in\mathcal{Z}$ ). Define the index set $\Xi_{1}(c)=\{i:c_{1,i}=c,i=1,\ldots,n_{1}\}$ , we modify (5.3) as

	$\displaystyle\widehat{\mu}_{W\mid c,x}^{p}$	$\displaystyle=\sum_{i=1}^{n_{1}}b_{i}(x)\phi(w_{1,i})\mathds{1}(c_{1,i}=c);$
	$\displaystyle b(x)$	$\displaystyle=(\mathcal{K}_{X_{1,c}}+\lambda_{1}I)^{-1}{\Phi_{X_{1,c}}(x)},$

where $\mathcal{K}_{X_{1,c}}=[k(x_{1,i},x_{1,j})]_{i,j}$ and $\Phi_{X_{1,c}}=\begin{bmatrix}\phi(x_{1,i})\end{bmatrix}_{i}^{\top}$ with $i,j\in\Xi_{1}(c)$ . Alternatively, one can apply the form in (5.3) but use binary kernel on $C$ (or $Z$ ).

Appendix D Experiments

In this section we discuss the experimental settings and implementation details. We start with introducing the implementation details of all the baselines and proposed method. Then, we discuss the experimental settings.

D.1 Baselines of Adaptation with Concepts and Proxies

We introduce the baseline methods for the adaptation task with $C$ and $W$ . This includes the baselines methods COVARS, LABELS, ORACLE, LSA-W, LSA-S, LSA-S w/ target $W$ and the proposed method. To select the parameters for the regression task on dSprite, we apply five-fold cross-validation with mean squared error as the metric to select the kernel length scale and the ridge regularization penalty.

COVARS. We fit a domain classifier using logistic regression, compute instance weights following shimodaira2000improving, and learn a weighted kernel ridge regressor with a Gaussian kernel function on the source training samples.

LABELS. The label shift baseline assumes oracle access to labels in the target domain. For the classification task, we compute instance weights $q(Y)/p(Y)$ using the observed frequencies in the validation set for the source domain and the training set for the target domain. For the regression task, we compute the weights by fitting a Gaussian kernel density estimator using the source validation set and the target training set separately. We then use the fitted densities to estimate $q(Y)/p(Y)$ for each sample in the source training set. Finally, we learn a sample-weighted kernel ridge regressor with a Gaussian kernel on the source training samples.

ORACLE. For regression tasks, we learn a kernel ridge regressor with a Gaussian kernel on target training samples. For the classification task, we use a standard MLP trained with sample in the target domain. Details of the model structure are documented in Section D.2.

LSA-W. The estimation procedure follows Section 6 in alabdulmohsin2023adapting. In this case, we discretize the values of $W$ by applying additional transform $\mathrm{sign}(w)$ for each sample $w$ .

LSA-S. The estimation procedure follows Algorithm 2–5 in alabdulmohsin2023adapting.

LSA-S w/ target $W$ . We briefly describe the procedure to incorporate target $W$ to LSA-S. alabdulmohsin2023adapting showed that $Q(Y|x)$ can be decomposed as

	$\displaystyle Q(Y\mid x)$	$\displaystyle=\sum_{\widetilde{u}}\underbrace{P(Y\mid\widetilde{u},x)}_{(a)}% \underbrace{Q(\widetilde{u}\mid x)}_{(b)}$		(D.1)
		$\displaystyle\propto\sum_{\widetilde{u}}\underbrace{P(Y\mid\widetilde{u},x)}_{% (a)}\underbrace{P(\widetilde{u}\mid x)}_{(c)}\underbrace{\frac{Q(\widetilde{u}% )}{P(\widetilde{u})}}_{(d)}\frac{P(x)}{Q(x)},$		(D.2)

where $\widetilde{u}$ is a permutation of original $u$ . Both LSA-WAE and LSA-S are multi-stage procedures to compute (a), (c), (d) individually and combine the results using formula (D.2) to obtain the predicted target distribution. Step (a) corresponds to Algorithm 5, (c) corresponds to Equation (17), and (d) corresponds to Algorithm 4 in (alabdulmohsin2023adapting).

With the additional $W$ from target, we can obtain (b) by slightly modifying the one estimation step in LSA-S. We test on this procedure, namely LSA-S w/ target W, with (c), (d) replaced by (b). Suppose that $U$ takes values in $1,\ldots,k_{U}$ and $\widetilde{U}$ be a permutation of $U$ . Define the matrix ${\bf G}$ as:

{\bf G}=\begin{bmatrix}\langle{\widehat{P}(W\mid\widetilde{U}=1)},{\widehat{P}% (W\mid\widetilde{U}=1)}\rangle&\cdots&\langle{\widehat{P}(W\mid\widetilde{U}=1% )},{\widehat{P}(W\mid\widetilde{U}=k_{U})}\rangle\\ \vdots&\ddots&\vdots\\ \langle{\widehat{P}(W\mid\widetilde{U}=k_{U})},{\widehat{P}(W\mid\widetilde{U}% =1)}\rangle&\cdots&\langle{\widehat{P}(W\mid\widetilde{U}=k_{U})},{\widehat{P}% (W\mid\widetilde{U}=k_{U})}\rangle\end{bmatrix},

where $\widehat{P}(W\mid\widetilde{U}=i)$ is the estimated conditional kernel density function obtained by Algorithm 3 in alabdulmohsin2023adapting. The step (b) is computed by solving the following least-squares:

	$\displaystyle\widehat{Q}(\widetilde{\mathbf{U}}\mid x)=$	$\displaystyle\arg\min\left\\|\begin{bmatrix}\langle{\widehat{Q}(W\mid x)},{% \widehat{P}(W\mid\widetilde{U}=1)}\rangle\\ \vdots\\ \langle{\widehat{Q}(W\mid x)},{\widehat{P}(W\mid\widetilde{U}=k_{U})}\rangle% \end{bmatrix}-{\bf G}\begin{bmatrix}Q(\widetilde{U}=1\mid x)\\ \vdots\\ Q(\widetilde{U}=k_{U}\mid x)\end{bmatrix}\right\\|_{F}^{2},$
		$\displaystyle\text{subject to }\quad 0\leq Q(\widetilde{U}=i\mid x)\leq 1,% \quad i=1,\ldots,k_{U};$
		$\displaystyle\quad\quad\quad\quad\quad\sum_{i=1}^{k_{U}}Q(\widetilde{U}=i\mid x% )=1.$

Then, we compute the predicted conditional probability based on (D.1).

Proposed Method. For the regression task using the dSprite dataset, we employ the Gaussian kernel function as the feature map for both $X$ and $W$ . In the classification task, we also utilize the Gaussian kernel function for $X$ and $W$ . Additionally, we make use of a columnwise binary kernel for $C$ , which performs a binary kernel operation on each entry and computes the product of all function outputs. To compute $\widehat{h}_{0}$ , we apply one-hot encoder on $Y$ and apply the results in Proposition 5.1 For choosing the kernel length scale for the classification task, we use the validation set with AUROC metric.

D.2 Baselines of Multi-Source Adaptation

For the first three baselines: Cat-ERM, Avg-ERM, and SA, we use a standard MLP model as the backbone structure. It is a single hidden layer MLP with size $100$ and ReLU activation functions. The network is trained using Adam optimizer (kingma2014adam) with learning rate $10^{-3}$ . The batch size is set to be $200$ and the maximum number of iteration is set to be $300$ .

Cat-ERM. We concatenate all the samples across environments into one dataset. Then, we train the model with a standard MLP model as specified above.

Avg-ERM. For each environment, we train a standard MLP model. During testing, we take the average of predictions from all models.

Simple Adaptation (SA) (mansour2008domain). To implement the method, we build kernel density estimators with Gaussian kernel function to estimate the density $p_{r}(x)$ for $r=1,\ldots,k_{Z}$ . We then reweigh the output of the classifier, a standard MLP, of each domain with the normalized weight $P_{r}(x_{\text{new}})/\left\{\sum_{r}P_{r^{\prime}}(x_{\text{new}})\right\}$ . The kernel length scale is chosen using five-fold cross-validation with AUROC metric.

Marginal Kernel (MK) (blanchard2011generalizing). This method involves a kernel SVM with a product kernel on $(\mathcal{X},P(X))$ . For any $x,x^{\prime}\in\mathcal{X}$ and a distribution on $X$ , $P,P^{\prime}$ , the kernel function is defined as $k((x,P),(x^{\prime},P^{\prime}))=k_{1}(x,x^{\prime})k_{2}(P,P^{\prime})$ . Let $n$ be the number of samples. Here $k_{1}$ is a Gaussian kernel function, and $k_{2}$ is the mean of the Gram matrix $[k(x_{i},x_{j}^{\prime})]_{ij}\in\mathbb{R}^{n\times n}$ , where $x_{i}$ for $i=1,\ldots,n$ is a i.i.d. sample from $P$ and $x_{j}^{\prime}$ for $j=1,\ldots,n$ is a i.i.d. sample from $P^{\prime}$ . To accommodate the large dataset, we precompute the Gram matrix and apply it to a linear classifier trained using Stochastic Gradient Descent (SGD) implemented in the package scikit-learn (scikit-learn). The kernel length scale is chosen using five-fold cross-validation with AUROC metric.

Weighted Combination of Source Classifiers (WCSC) (zhang2015multi). For each source environment, we estimate the conditional probability $X\mid y$ using kernel density estimator with the Gaussian kernel function. The rest of the estimation procedure follows Section 2 in zhang2015multi. The kernel length scale is chosen using five-fold cross-validation with AUROC metric.

Proposed Method. We use columnwise Gaussian kernel function as the feature map of $X$ , a Gaussian kernel function as the feature map of $W$ . The conditional mean embedding $\widehat{\mu}_{W\mid x,z}^{p}$ is estimated using the approach introduced in Section C.3. The analytical solution of $\widehat{m}_{0}$ is discussed in Proposition C.3. All the kernel length scale and the regularization parameters $\lambda_{3}$ , $\lambda_{4}$ are selected using five-fold cross-validation with AUROC metric.

ORACLE. The model is $\langle{\widehat{m}_{0}},{\widehat{\mu}_{W\mid x}^{q}}\rangle$ , where both the bridge function $\widehat{m}_{0}$ and $\widehat{\mu}_{W\mid x}^{q}$ are estimated using the target dataset, with the number of training samples equal to the training samples of the source domain. All the kernel length scale and the regularization parameters $\lambda_{3}$ , $\lambda_{4}$ are selected using five-fold cross-validation with AUROC metric.

D.3 Classification Task

The classification task discussed in Section D.6 is first introduced alabdulmohsin2023adapting. Let $\mathbf{o}(\cdot)$ be the one-hot encoder, we follow their data generation procedure and generate samples using the following data generation process:

	$\displaystyle U\sim$	$\displaystyle\;\textrm{Categorical}(\bm{\pi});$
	$\displaystyle W\mid U=u\sim$	$\displaystyle\;\mathcal{N}(\mathbf{o}(u)\mathbf{M}_{W\|U},1\big{)};$
	$\displaystyle X\mid U=u\sim$	$\displaystyle\;\mathcal{N}(\mathbf{o}(u)\mathbf{M}_{X\|U},\mathbf{I}_{k_{X}});$
	$\displaystyle C_{i}\mid X=x,U=u\sim$	$\displaystyle\;\textrm{Bernoulli}\Big{(}\mathrm{logit}^{-1}\big{(}[x\mathbf{M}% _{C\|X,U=u}+\mathbf{o}(u)\mathbf{M}_{C\|U}]_{i}\big{)}\Big{)};$
	$\displaystyle Y\mid C=c,U=u\sim$	$\displaystyle\;\textrm{Bernoulli}\Big{(}\mathrm{logit}^{-1}\big{(}c\mathbf{M}_% {Y\|C,U=u}+\mathbf{o}(u)\mathbf{M}_{Y\|U}\big{)}\Big{)},$

where the matrices are defined as

	$\displaystyle\mathbf{M}_{W\|U}:=\begin{bmatrix}-1&1\end{bmatrix}^{\top},\quad% \mathbf{M}_{X\|U}:=a_{w}\begin{bmatrix}-1&1\\ 1&-1\end{bmatrix},\quad\mathbf{M}_{C\|U}:=\begin{bmatrix}-2&2&2\\ -1&1&2\end{bmatrix};$
	$\displaystyle\mathbf{M}_{C\|X,U=u_{0}}:=3\begin{bmatrix}-2&2&-1\\ 1&-2&-3\end{bmatrix},\quad\mathbf{M}_{C\|X,U=u_{1}}:=3\begin{bmatrix}2&-2&1\\ -1&2&3\end{bmatrix};$
	$\displaystyle\mathbf{M}_{Y\|U}:=\begin{bmatrix}2&2\end{bmatrix}^{\top},\quad% \mathbf{M}_{Y\|C,U=u_{0}}:=\begin{bmatrix}3&-2&-1\end{bmatrix}^{\top},\quad% \mathbf{M}_{Y\|C,U=u_{1}}:=\begin{bmatrix}3&-1&-2\end{bmatrix}^{\top}.$

The coefficient $a_{w}=1$ in Figure 2. Figure 4 displays additional results where $a_{w}=2,3$ . We generate $7000$ training samples, $1000$ validation samples, and $2000$ testing samples for the classification task with concepts and proxies.

In the multi-domain case, we construct $3$ different tasks: Task $1$ is composed of $z_{1},z_{2},z_{3}$ such that $P(U=0\mid{z_{1}})=0.1$ , $P(U=0\mid{z_{2}})=0.2$ , $P(U=0\mid{z_{3}})=0.3$ and a target domain with $Q(U=0)=0.9$ . For task $2$ , we select $z_{4},z_{5},z_{6}$ such that $P(U=0\mid z_{4})=0.4$ , $P(U=0\mid{z_{5}})=0.5$ , $P(U=0\mid{z_{6}})=0.6$ and $Q(U=0)=0.9$ . For task $3$ , we select $z_{7},z_{8},z_{9}$ such that $P(U=0\mid z_{7})=0.7$ , $P(U=0\mid{z_{8}})=0.8$ , $P(U=0\mid{z_{9}})=0.9$ and $Q(U=0)=0.4$ . The results are shown in Table 1– 2.

D.4 Comparison to Domain Generalization Baselines

Table 2: Multi-domain generalization vs. (proposed) adaptation result. The values are the average AUROC of

10

independent runs drawn from the data generating process. Each task has three source domains with different

P_{r}(U)

and one target domain. The proposed method has outperformed all domain generalization benchmarks across all tasks.

	ORACLE	ARM	CDANN	CORAL	DANN	GroupDRO	IRM	MMD	VREx	Proposed
Task 1	$0.9425$	$0.8065$	$0.8061$	$0.8030$	$0.8039$	$0.7954$	$0.7989$	$0.8055$	$0.8010$	$\mathbf{0.8848}$
	$\pm 0.0039$	$\pm 0.0247$	$\pm 0.0252$	$\pm 0.0236$	$\pm 0.0229$	$\pm 0.0323$	$\pm 0.0283$	$\pm 0.0248$	$\pm 0.0279$	$\pm 0.0120$
Task 2	$0.9431$	$0.9143$	$0.9159$	$0.9158$	$0.9158$	$0.9160$	$0.9131$	$0.9149$	$0.9136$	$\mathbf{0.9318}$
	$\pm 0.0061$	$\pm 0.0150$	$\pm 0.0125$	$\pm 0.0132$	$\pm 0.0125$	$\pm 0.0125$	$\pm 0.0135$	$\pm 0.0135$	$\pm 0.0124$	$\pm 0.0063$
Task 3	$0.8876$	$0.8470$	$0.8456$	$0.8473$	$0.8480$	$0.8487$	$0.8469$	$0.8470$	$0.8470$	$\mathbf{0.8569}$
	$\pm 0.0085$	$\pm 0.0171$	$\pm 0.0164$	$\pm 0.0163$	$\pm 0.0166$	$\pm 0.0185$	$\pm 0.0186$	$\pm 0.0181$	$\pm 0.0132$	$\pm 0.0095$

Given that we observe multiple domains at test time, a natural question is: Does adaptation give us an advantage over generalization? In generalization, we cannot assume to have any observations in the target domain. We compare our adaptation method with multi-domain generalization baselines (muandet2013domain): Adaptive Risk Minimization (ARM) (zhang2021adaptive), Conditional Domain Adversarial Neural Networks (CDANN) (long2018conditional), Correlation Alignment (CORAL) (sun2016deep), Domain Adversarial Neural Networks (DANN) (ganin2016domain), Distributionally Robust Optimization for Group Shifts (GroupDRO) (sagawa2019distributionally), Invariant Risk Minimization (IRM) (arjovsky2019invariant), Maximum Mean Discrepancy (MMD) (Borgwardt2006IntegratingSB), and Risk Extrapolation (REx) (krueger2021out).

In Table 2, we show that our proposed method for domain adaptation in the multi-domain setting outperforms the state-of-the-art multi-domain generalization methods.

D.5 Regression Tasks

We consider three tasks. We will first introduce the simulated task and then discuss about the task on dSprite data (dsprites17).

D.5.1 Simulated Dataset

We consider the following data generation process.

Simulated regression task 1.

$\displaystyle U$	$\displaystyle=Ber(a);$
$\displaystyle X$	$\displaystyle=\mathcal{N}(0,1);$
$\displaystyle Y$	$\displaystyle=-X{\bf 1}_{\left(U=0\right)}+X{\bf 1}_{\left(U=1\right)};$	(D.3)
$\displaystyle W$	$\displaystyle=\mathcal{N}(-1,0.01){\bf 1}_{\left(U=0\right)}+\mathcal{N}(1,0.0% 1){\bf 1}_{\left(U=1\right)}.$

There are two source domains. We set $a=0.1$ for source domain $z_{1}$ and $a=0.9$ for source domain $z_{2}$ . According to the data generation process (D.3), $Y$ is mostly positively correlated with $X$ in domain $z_{1}$ and negatively correlated with $X$ in domain $z_{2}$ . For each domain, we synthesized $2000$ training samples and $1000$ testing samples. We sweep across $a=\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\}$ in the target domain. We run $10$ replications and the results shown in Figure 5. In the next task, we set $U$ to be a continuous random variable following a Beta distribution.

In this task, we expect the Cat-ERM method to fail drastically as we anticipate that the predicted $Y$ versus $X$ is a flat line – the predicted result would be an average of the downward slo** line $(U=0)$ and upward slo** line $(U=1)$ . The result in Figure 5 supports our hypothesis, as the mean squared error remains nearly flat as we vary the target distribution $Q(U)$ .

Simulated regression task 2.

	$\displaystyle U$	$\displaystyle=Beta(a,b)$
	$\displaystyle X$	$\displaystyle=\mathcal{N}(0,1)$
	$\displaystyle Y$	$\displaystyle=(2U-1)X$
	$\displaystyle W$	$\displaystyle=\mathcal{N}(-1,0.01)(1-U)+\mathcal{N}(1,0.01)U.$

There are two source domains, corresponding to two draws from $P(Z)$ which we write $z_{r}=(a,b)$ . We set $a=2,b=4$ for the first source domain $r=1$ , and $a=4,b=2$ for the second source domain $r=2$ . The corresponding distributions over $U$ are shown in Figure 6. Under this setting, we test the target domain with $a,b=1,\ldots,5$ , with distributions shown in Figure 6. For each domain, we synthesized $2000$ training samples and $1000$ testing samples. We run $10$ replications and the results shown in Figure 5.

D.6 Adaptation with Concepts and Proxies

D.6.1 dSprites Dataset

We test the proposed procedure on the dSprites dataset (dsprites17), an image dataset described by five latent parameters (shape, scale, rotation, posX, and posY). Motivated by dsprites17’s experiments, we design a regression task where the dSprites images (64 $\times$ 64 = 4096-dimensional) are $X\in\mathbb{R}^{64\times 64}$ and subject to a nonlinear confounder $U\in[0,2\pi]$ which is a rotation of the image (Figure 7). We fix all other latent parameters – shape is heart, scale is maximized, and all others are set to their 0’th position. $W\in\mathbb{R}$ and $C\in\mathbb{R}$ are continuous random variables. The data generation process is defined as follows

	$\displaystyle U^{p}$	$\displaystyle\sim 2\pi\text{Beta}(2,4),\quad U^{q}\sim\text{Uniform}(a,2\pi);$
	$\displaystyle X$	$\displaystyle=\text{Rotate}(\text{image},U\text{ rads})+\eta,\quad\eta\sim% \mathcal{N}(0,0.01I_{64});$
	$\displaystyle C$	$\displaystyle=\Bigg{(}\frac{0.1\\|X^{T}A\\|_{2}^{2}-5000}{2000}\Bigg{)}^{2}+U+\gamma;$
	$\displaystyle A$	$\displaystyle\sim\text{Uniform}(0,1),\,A\in\mathbb{R}^{4096\times 10},\quad% \gamma\sim\mathcal{N}(0,0.5);$
	$\displaystyle Y$	$\displaystyle=\frac{1}{4}C+\frac{1}{20}\sin(U)+\varepsilon,\quad\varepsilon% \sim\mathcal{N}(0,0.1);$
	$\displaystyle W$	$\displaystyle=\cos(U)+\nu,\quad\nu\sim\mathcal{N}(0,0.25).$

When fitting all model, both baselines and the proposed method, we project the images $\mathbb{R}^{4096}$ to $\mathbb{R}^{16}$ via Gaussian Random Projection using the scikit-learn implementation (Bingham2001RandomPI; scikit-learn). Additionally, for the proposed method, we use a Gaussian kernel as the feature map for $X,\,C$ .

We generate $7000$ training samples and $3000$ test samples in our experiments. Then, we use five-fold cross-validation to select hyperparameters for baselines and proposed method for each $a$ ( $U^{p}\sim\text{Uniform}(a,2\pi)$ ) – hyperparameters are (i) ridge regression penalty and (ii) Gaussian kernel scaling factor. Once we select a set of hyperparameters for a value of $a$ , we perform 10 new random data regenerations to get transfer errors with 95% confidence intervals (Figure 2).

D.7 Classification of radiological findings with MIMIC-CXR

We conduct a small-scale experiment with chest X-ray data extracted from the MIMIC-CXR dataset (johnson2019mimic). We consider classification of the absence of a radiological finding in a chest X-ray. For this, we use the set of labels extracted by irvin2019chexpert. These labels correspond to 14 categories of radiological findings extracted based on mentions in the associated radiology reports. We specifically consider classification of the “No Finding” ( $Y=1$ ) label, corresponding to cases where no pathology was identified as positive or uncertain in the radiology report.

To define the dataset, we consider the set of 217,536 chest X-rays with defined Chexpert labels (irvin2019chexpert), MIMIC-IV entries, and pretrained embeddings (sellergren2022simplified). We then filter this dataset to the 212,567 examples considered as a part of the “train” partition provided by the MIMIC-CXR database (johnson2019mimic). We then partition the data into training, validation, and testing splits such that 80%, 10%, and 10% of the examples belong to each partition, respectively. For adaptation, we consider BioBERT (lee2020biobert) 768-dimensional embeddings of the radiology reports as concepts $C$ and the patient’s age as a proxy variable $W$ . For simplicity, we use the patient anchor_age defined through linkage to the MIMIC-IV database, regardless of the patient’s age at the time of the chest X-ray. Similar to the dSprites experiment, we further reduce the dimensionality of $X$ and $C$ to $\mathbb{R}^{64}$ using Gaussian Random Projection fit on the full training partition (170,053 examples).

To define distribution shifts, we adopt a problem formulation similar to that of makar2022causally, where patient sex is considered as a possible “shortcut" in the classification of the absence of a radiological finding. As in makar2022causally, we impose distribution shift through structured resampling of the data where $P(U=1)=P(Y=1\mid\textrm{Sex}=\textrm{Female})=P(Y=0\mid\textrm{Sex}=\textrm{% Male})$ . For example, when $P(U=1)=0.1$ , the prevalence of $P(Y=1\mid\textrm{Sex}=\textrm{Female})=0.1$ and $P(Y=1\mid\textrm{Sex}=\textrm{Male})=0.9$ . We implement the shift through a weighted sampling procedure that maintains the label shift invariance within patient sex subgroups, i.e., preserves $X\mid Y,A$ under the distribution shift, where $A$ corresponds to patient sex. This procedure further fixes the total proportion of male and female patients in the population at 50%. For our experiments, we consider nine domains corresponding to cases where $P(U=1)\in\{0.1,0.2,\ldots,0.9\}$ .

We perform both concept adaptation and multi-domain adaptation experiments with the MIMIC-CXR data. For the concept adaptation experiment, we perform weighted sampling with replacement of 1,000 examples from each of the training, validation, and testing partitions defined previously, separately for each domain. We fix the source domain to the case where $P(U=1)=0.1$ and then adapt to each of the nine target domains. For the multi-domain adaptation experiment, we randomly sample 500 examples per domain and partition from the sets of 1,000 examples defined for the concept experiment. For this experiment, we consider a case where two source domains corresponding to $P(U=1)=0.1$ and $P(U=1)=0.2$ are available. To match the size of the aggregate source domain data with the size of the target domain, we sample 250 examples per partition for each source domain. We repeat the sampling procedure five times and report the mean $\pm$ standard deviation of performance metrics over the five replicates.

For both experiments, we perform two-fold cross-validation for the kernel length-scale parameters using data from the source domain(s). Here, we compare to ridge logistic regression models fit in the source and target domains, with the ridge penalty fit with five-fold cross validation. We use LR-Target to refer to logistic regression models fit in a target domain, LR-SOURCE to refer to models fit in a source domain, and Cat-LR to refer to logistic regression models fit with concatenated data from the multiple source domains. We use Bridge-SOURCE to refer to the kernel estimator that leverages the bridge function ( $h_{0}$ or $m_{0}$ for the concept and multi-domain adaptation settings, respectively) and conditional mean embedding ( $\mu_{WC\mid x}$ or $\mu_{W\mid z,x}$ ) fit on the source domain data. Bridge-TARGET refers to the kernel estimator where both the bridge function and conditional mean embedding are fit on the target domain data.

	$\displaystyle P(Y\|U,x)P_{k_{z}+1}(U\|x)$	$\displaystyle=M_{w,x}P(W\|U)P_{k_{z}+1}(U\|x)$
	$\displaystyle P_{k_{z}+1}(Y\|x)$	$\displaystyle=M_{w,x}P_{k_{z}+1}(W\|x).$

Proxy Methods for Domain Adaptation

Abstract

1 Introduction

2 Related Work

3 Problem Framework

Assumption 1 (Concept Bottleneck, alabdulmohsin2023adapting).

Assumption 2 (Structural assumption).

Assumption 3 (Multi-Domain).

4 Identification under Latent Shifts

4.1 Identification with Concepts

Assumption 4 (Informative variables).

Assumption 5 (Positivity).

Theorem 4.1.

4.2 The Blessings of Multiple Domains

Proposition 4.2.

Assumption 6.

Proposition 4.3.

Example 4.4.

5 Kernel Bridge Function Estimation

5.1 Adaptation with Concepts

Proposition 5.1.

5.2 Adaptation with Multiple Domains

6 Experiments

6.1 Multi-Domain Adaptation

6.2 Concept and multi-domain adaptation with MIMIC-CXR

7 Discussion

References

Appendix A Identification of the Distribution

A.1 The Discrete Case of the Bridge Function h0subscriptℎ0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

Assumption 7.

A.2 Existence of the Bridge Function h0subscriptℎ0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

Assumption 8.

Assumption 9.

Assumption 10.

Proposition A.1 (Existence of h0subscriptℎ0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, adapted from Proposition 1 in miao2018identifying).

Proof.

A.3 Existence of Bridge Function m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

Proposition A.2 (Existence of m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Proposition 1 in miao2018identifying).

A.4 Auxiliary Lemma

Lemma A.3 (Picard’s Theorem).

Appendix B Transferring Bridge Functions

B.1 Proof of Theorem 4.1

B.2 Proof of Proposition 4.2

B.3 Proof of Proposition 4.3

Appendix C Estimation Procedure

C.1 Proof of Proposition 5.1

Lemma C.1.

Proof of Lemma C.1.

Lemma C.2.

Proof of Lemma C.2.

C.2 Proof of Kernel Bridge Function m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

Proposition C.3.

C.3 Estimation with discrete Z𝑍Zitalic_Z or C𝐶Citalic_C

Appendix D Experiments

D.1 Baselines of Adaptation with Concepts and Proxies

D.2 Baselines of Multi-Source Adaptation

D.3 Classification Task

D.4 Comparison to Domain Generalization Baselines

D.5 Regression Tasks

D.5.1 Simulated Dataset

D.6 Adaptation with Concepts and Proxies

D.6.1 dSprites Dataset

D.7 Classification of radiological findings with MIMIC-CXR

References

A.1 The Discrete Case of the Bridge Function $h_{0}$

A.2 Existence of the Bridge Function $h_{0}$

Proposition A.1 (Existence of $h_{0}$ , adapted from Proposition 1 in miao2018identifying).

A.3 Existence of Bridge Function $m_{0}$

Proposition A.2 (Existence of $m_{0}$ , Proposition 1 in miao2018identifying).

C.2 Proof of Kernel Bridge Function $m_{0}$

C.3 Estimation with discrete $Z$ or $C$