License: CC BY 4.0
arXiv:2403.07442v1 [cs.LG] 12 Mar 2024

Proxy Methods for Domain Adaptation

Katherine Tsai11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Stephen R. Pfohl22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Olawale Salaudeen11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,
Nicole Chiou33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Matt J. Kusner44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT,
Alexander D’Amour55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT, Sanmi Koyejo3,535{}^{3,5}start_FLOATSUPERSCRIPT 3 , 5 end_FLOATSUPERSCRIPT, Arthur Gretton5,656{}^{5,6}start_FLOATSUPERSCRIPT 5 , 6 end_FLOATSUPERSCRIPT
( 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTUniversity of Illinois Urbana-Champaign
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Google Research
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTStanford University
44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTUniversity College London
55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTGoogle DeepMind
66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPTGatsby Computational Neuroscience Unit )
Abstract

We study the problem of domain adaptation under distribution shift, where the shift is due to a change in the distribution of an unobserved, latent variable that confounds both the covariates and the labels. In this setting, neither the covariate shift nor the label shift assumptions apply. Our approach to adaptation employs proximal causal learning, a technique for estimating causal effects in settings where proxies of unobserved confounders are available. We demonstrate that proxy variables allow for adaptation to distribution shift without explicitly recovering or modeling latent variables. We consider two settings, (i) Concept Bottleneck: an additional “concept” variable is observed that mediates the relationship between the covariates and labels; (ii) Multi-domain: training data from multiple source domains is available, where each source domain exhibits a different distribution over the latent confounder. We develop a two-stage kernel estimation approach to adapt to complex distribution shifts in both settings. In our experiments, we show that our approach outperforms other methods, notably those which explicitly recover the latent confounder.

1 Introduction

The goal of domain adaptation is to transfer an accurate model from a labeled source domain to an unlabeled target domain, which has a different but related distribution (pan2010domain; koh2021wilds; malinin2021shifts). It is motivated by the fact that labeling data is often labor intensive, and sometimes requires domain expertise. For example, the distribution of patients diagnosed with a condition from hospital A𝐴Aitalic_A and hospital B𝐵Bitalic_B may differ due to patients’ socioeconomic status, demographics, and other factors. However, labeled data might be only be available at hospital A𝐴Aitalic_A and not at hospital B𝐵Bitalic_B (e.g., due to less funding). As a result, an accurate model for patients from hospital A𝐴Aitalic_A may perform poorly for patients from hospital B𝐵Bitalic_B.

In order to provide guarantees on the accuracy of a transferred model, one of two classical assumptions have been made: label shift or covariate shift. Label shift (buck1966comparison; lipton2018detecting) assumes that the distribution of a label P(Y)𝑃𝑌P(Y)italic_P ( italic_Y ) shifts between source and target domains, but the conditional distribution P(XY)𝑃conditional𝑋𝑌P(X\mid Y)italic_P ( italic_X ∣ italic_Y ) does not. Conversely, covariate shift (shimodaira2000improving) assumes that the covariate distribution P(X)𝑃𝑋P(X)italic_P ( italic_X ) shifts between domains, but the distribution P(YX)𝑃conditional𝑌𝑋P(Y\mid X)italic_P ( italic_Y ∣ italic_X ) stays the same. Each assumption provides theoretical guarantees on the generalization of a transferred classifier. In fact, without any assumptions, the source and target domains could differ arbitrarily, making guarantees impossible. However, these assumptions are often too restrictive to apply in real-world settings (zhang2015multi; schrouff2022diagnosing). For instance, if covariates X𝑋Xitalic_X and labels Y𝑌Yitalic_Y are confounded by a third variable U𝑈Uitalic_U, it is possible for neither P(XY)𝑃conditional𝑋𝑌P(X\mid Y)italic_P ( italic_X ∣ italic_Y ) or P(YX)𝑃conditional𝑌𝑋P(Y\mid X)italic_P ( italic_Y ∣ italic_X ) to be equal across domains. For example, demographic information U𝑈Uitalic_U could confound the relationship between a diagnosis Y𝑌Yitalic_Y and a radiological image X𝑋Xitalic_X. In this example, if two hospitals have different distributions over demographics, both label shift and covariate shift adaptation methods will fail to transfer a classifier across hospitals.

To address this, recent work has introduced a latent shift assumption: the distribution of U𝑈Uitalic_U, an unobserved latent confounder of X𝑋Xitalic_X and Y𝑌Yitalic_Y, shifts between the source and target domain (alabdulmohsin2023adapting). In this setting, all distributions of X𝑋Xitalic_X and Y𝑌Yitalic_Y (without conditioning on U𝑈Uitalic_U) may differ across the domains, violating label and covariate shift assumptions.

Contributions. We propose techniques for domain adaptation under the latent shift assumption that are guaranteed to identify the optimal predictor 𝔼[Yx]𝔼delimited-[]conditional𝑌𝑥\mathbb{E}[Y\mid x]blackboard_E [ italic_Y ∣ italic_x ] in the target domain. We make use of proxy methods (miao2018identifying), which are a recently developed framework for causal effect estimation in the presence of a hidden confounder U𝑈Uitalic_U, given indirect proxy information on U𝑈Uitalic_U. Compared to prior work (alabdulmohsin2023adapting), our techniques do not require: identifying the distribution of the latent variable U𝑈Uitalic_U, that U𝑈Uitalic_U be discrete, or further linear independence assumptions. We consider two settings: (1) Concept Bottleneck: we observe in both domains a proxy W𝑊Witalic_W of the unobserved confounder U𝑈Uitalic_U and a concept C𝐶Citalic_C that mediates the direct relationship between X𝑋Xitalic_X and Y𝑌Yitalic_Y (alabdulmohsin2023adapting), or (2) Multi-Domain: we do not observe C𝐶Citalic_C in either domain, but have access to observations from multiple source domains. For both settings, we provide guarantees for identifying 𝔼[Yx]𝔼delimited-[]conditional𝑌𝑥\mathbb{E}[Y\mid x]blackboard_E [ italic_Y ∣ italic_x ] without observing Y𝑌Yitalic_Y in the target domain. When 𝔼[Yx]𝔼delimited-[]conditional𝑌𝑥\mathbb{E}[Y\mid x]blackboard_E [ italic_Y ∣ italic_x ] is identifiable, we develop practical two-stage kernel estimators to perform adaptation.

2 Related Work

The development of techniques for learning robust models and adapting to distribution shift has a long history in machine learning, but recently has received increased attention (shen2021towards; zhou2022domain; wang2022generalizing).

Causality for domain adaptation. Our work is inspired by techniques that formulate the covariate/label shift settings as assumptions on the causal structure for domain adaptation and distributional robustness (e.g, scholkopf2012causal; peters2015causal; zhang2015multi; subbaswamy2019preventing; rothenhausler2021anchor; veitch2021counterfactual; magliacane2018domain; arjovsky2019invariant; ganin2016domain; ben2010theory; oberst2021regularizing).

Proximal causal inference. Our identification technique is inspired by approaches used to identify causal effects with unobserved confounding with observed proxies (kuroki2014measurement; miao2018identifying; deaner2018proxy; tchetgen2020introduction; mastouri2021proximal; cui2023semiparametric; xu2023kernel). These approaches design ‘bridge functions’ to connect quantities involving a proxy W𝑊Witalic_W with those of the label Y𝑌Yitalic_Y. The beauty of this approach is that these bridge functions are implicitly a marginalization over U𝑈Uitalic_U. This allows these approaches to identify causal quantities without identifying distributions involving U𝑈Uitalic_U.

Latent shift. Our work is most closely related to alabdulmohsin2023adapting, who introduced the setting of latent shift with proxies W𝑊Witalic_W and concepts C𝐶Citalic_C. They showed that the optimal predictor 𝔼[Yx]𝔼delimited-[]conditional𝑌𝑥\mathbb{E}[Y\mid x]blackboard_E [ italic_Y ∣ italic_x ] is identifiable in the target domain if W𝑊Witalic_W and C𝐶Citalic_C are observed in the source domain and X𝑋Xitalic_X is observed in the target domain. To do so, they required (a) identification of distributions involving U𝑈Uitalic_U, (b) that U𝑈Uitalic_U is a discrete variable, (c) knowledge of the dimensionality of U𝑈Uitalic_U, and (d) additional linear independence assumptions. In contrast, our work derives identification results for arbitrary U𝑈Uitalic_U, and does not require any of (a)-(d). However, there is no free lunch: to achieve this, we require that proxies W𝑊Witalic_W are observed in the target, and either that: (i) concepts C𝐶Citalic_C are also observed in the target, or (ii) we observe multiple source domains. For (ii) we do not require C𝐶Citalic_C in either the source or the target, but for full identification we require that U𝑈Uitalic_U is discrete.

3 Problem Framework

Y𝑌Yitalic_YX𝑋Xitalic_XU𝑈Uitalic_U
(a) Covariate shift
Y𝑌Yitalic_YX𝑋Xitalic_XU𝑈Uitalic_U
(b) Label shift
C𝐶Citalic_CY𝑌Yitalic_YX𝑋Xitalic_XU𝑈Uitalic_UW𝑊Witalic_W




(c) Concept Bottleneck shift
X𝑋Xitalic_XY𝑌Yitalic_YZ𝑍Zitalic_ZU𝑈Uitalic_UW𝑊Witalic_Wn𝑛nitalic_n
(d) Multi-Domain shift
Figure 1: Causal diagrams. The shaded circle denotes unobserved variable and the solid circle denotes observed variable. X𝑋Xitalic_X is the covariate, Y𝑌Yitalic_Y is the response, C𝐶Citalic_C is the concept, W𝑊Witalic_W is the proxy, Z𝑍Zitalic_Z is the domain-related variable, and U𝑈Uitalic_U is the latent variable.

Let P()𝑃P(\cdot)italic_P ( ⋅ ) and Q()𝑄Q(\cdot)italic_Q ( ⋅ ) denote the probability distribution functions of the source domain and target domain, respectively. Let p𝑝pitalic_p and q𝑞qitalic_q indicate source and target quantities. Our goal is to study identification and estimation of the optimal target predictor 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] when Y𝑌Yitalic_Y is not observed in the target domain.

Concept Bottleneck. The first setting we study is described by the graph in Figure 0(c). We have two additional variables: (i) proxies W𝑊Witalic_W, which provide auxiliary information about U𝑈Uitalic_U, or can be seen as a noisy version of it (kuroki2014measurement), and (ii) concepts C𝐶Citalic_C, which mediate or ‘bottleneck’ the relationship between the covariates X𝑋Xitalic_X and labels Y𝑌Yitalic_Y (goyal2019explaining; koh2020concept). For example, koh2020concept describe a setting where the concepts C𝐶Citalic_C are high-level clinical and morphological features of a knee X-ray X𝑋Xitalic_X, which mediate the relationship with osteoporosis severity Y𝑌Yitalic_Y. In this example, U𝑈Uitalic_U could describe demographic variations that alter symptoms X,C𝑋𝐶X,Citalic_X , italic_C and outcome Y𝑌Yitalic_Y, and the proxies W𝑊Witalic_W could include patient background and clinical history (e.g., prior diagnoses, medications, procedures, etc). For the source domain we assume we observe (X,C,W,Y)Psimilar-to𝑋𝐶𝑊𝑌𝑃(X,C,W,Y)\!\sim\!P( italic_X , italic_C , italic_W , italic_Y ) ∼ italic_P and for the target domain we observe (X,C,W)Qsimilar-to𝑋𝐶𝑊𝑄(X,C,W)\!\sim\!Q( italic_X , italic_C , italic_W ) ∼ italic_Q.

We formalize the notion of latent shift, as introduced in alabdulmohsin2023adapting.

Assumption 1 (Concept Bottleneck, alabdulmohsin2023adapting).

The shift between P𝑃Pitalic_P and Q𝑄Qitalic_Q is located in unobserved U𝑈Uitalic_U, i.e., there is a latent shift P(U)\eqQ(U)𝑃𝑈\eq𝑄𝑈P(U)\not\eq Q(U)italic_P ( italic_U ) not italic_Q ( italic_U ), but P(VU)=Q(VU)𝑃conditional𝑉𝑈𝑄conditional𝑉𝑈P(V\mid U)=Q(V\mid U)italic_P ( italic_V ∣ italic_U ) = italic_Q ( italic_V ∣ italic_U ), where V{W,X,C,Y}𝑉𝑊𝑋𝐶𝑌V\subseteq\{W,X,C,Y\}italic_V ⊆ { italic_W , italic_X , italic_C , italic_Y }.

This assumption states that every variable conditioned on U𝑈Uitalic_U is invariant across domains. However, as P(U)\eqQ(U)𝑃𝑈\eq𝑄𝑈P(U)\!\not\eq\!Q(U)italic_P ( italic_U ) not italic_Q ( italic_U ), none of the marginal distributions are: P(V)\eqQ(V)𝑃𝑉\eq𝑄𝑉P(V)\!\not\eq\!Q(V)italic_P ( italic_V ) not italic_Q ( italic_V ) for V{W,X,C,Y}𝑉𝑊𝑋𝐶𝑌V\!\subseteq\!\{W,X,C,Y\}italic_V ⊆ { italic_W , italic_X , italic_C , italic_Y }. This assumption is a generalization of covariate shift P(YX,U)=Q(YX,U)𝑃conditional𝑌𝑋𝑈𝑄conditional𝑌𝑋𝑈P(Y\mid X,U)\!=\!Q(Y\mid X,U)italic_P ( italic_Y ∣ italic_X , italic_U ) = italic_Q ( italic_Y ∣ italic_X , italic_U ) (shimodaira2000improving) and label shift P(XY,U)=Q(XY,U)𝑃conditional𝑋𝑌𝑈𝑄conditional𝑋𝑌𝑈P(X\mid Y,U)\!=\!Q(X\mid Y,U)italic_P ( italic_X ∣ italic_Y , italic_U ) = italic_Q ( italic_X ∣ italic_Y , italic_U ) (buck1966comparison), with associated graphs in Figure 0(a)0(b).

Assumption 2 (Structural assumption).

Graphs in Figure 1 are faithful and Markov (spirtes2000causation).

Under Assumption 2, we have the following conditional independence properties for the graph in Figure 0(c):

YX{U,C},W{X,C}U.Y\perp\!\!\!\perp X\mid\{U,C\},\quad W\perp\!\!\!\perp\{X,C\}\mid U.italic_Y ⟂ ⟂ italic_X ∣ { italic_U , italic_C } , italic_W ⟂ ⟂ { italic_X , italic_C } ∣ italic_U .

With this conditional independence structure, {U,C}𝑈𝐶\{U,C\}{ italic_U , italic_C } blocks the information from X𝑋Xitalic_X to Y𝑌Yitalic_Y and U𝑈Uitalic_U blocks the information flow from W𝑊Witalic_W to {X,C}𝑋𝐶\{X,C\}{ italic_X , italic_C }. We will see in Section 4 that these assumptions allow us to obtain Q(Yx)𝑄conditional𝑌𝑥Q(Y\mid x)italic_Q ( italic_Y ∣ italic_x ) from Q(W,Cx)𝑄𝑊conditional𝐶𝑥Q(W,C\mid x)italic_Q ( italic_W , italic_C ∣ italic_x ) in the target domain, where the latter is a function of observed quantities.

Multi-domain. In the second setting, suppose we do not observe the concepts C𝐶Citalic_C in any domain, but instead observe data from multiple source domains, according to the graph in Figure 0(d). For instance, we may want to learn a classifier for a target hospital that has only unlabelled data, using data from several source hospitals with labelled data. Here, let Z𝑍Zitalic_Z be a random variable in 𝒵𝒵\mathcal{Z}caligraphic_Z denoting a prior over the source domains, and let P(U|Z)𝑃conditional𝑈𝑍P(U|Z)italic_P ( italic_U | italic_Z ) be the distribution of U𝑈Uitalic_U given Z𝑍Zitalic_Z. We make kZsubscript𝑘𝑍k_{Z}italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT draws from Z𝑍Zitalic_Z, indexed by r{1,,kZ}𝑟1subscript𝑘𝑍r\in\{1,\ldots,k_{Z}\}italic_r ∈ { 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT }, and write {z1,,zkZ}=:𝒵p𝒵\{z_{1},\ldots,z_{k_{Z}}\}=:\mathcal{Z}_{p}\subseteq\mathcal{Z}{ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT } = : caligraphic_Z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⊆ caligraphic_Z. For each source domain zrsubscript𝑧𝑟z_{r}italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, we observe (X,W,Y)P(X,W,Y|zr):=Pr(X,W,Y)similar-to𝑋𝑊𝑌𝑃𝑋𝑊conditional𝑌subscript𝑧𝑟assignsubscript𝑃𝑟𝑋𝑊𝑌(X,W,Y)\!\sim\!P(X,W,Y|z_{r})\!:=\!P_{r}(X,W,Y)( italic_X , italic_W , italic_Y ) ∼ italic_P ( italic_X , italic_W , italic_Y | italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) := italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_X , italic_W , italic_Y ). For the target, we denote it with index kZ+1subscript𝑘𝑍1k_{Z}+1italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT + 1 and only observe (X,W)P(X,W|zkZ+1):=Q(X,W)similar-to𝑋𝑊𝑃𝑋conditional𝑊subscript𝑧subscript𝑘𝑍1assign𝑄𝑋𝑊(X,W)\!\sim\!P(X,W|z_{k_{Z}+1})\!:=\!Q(X,W)( italic_X , italic_W ) ∼ italic_P ( italic_X , italic_W | italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) := italic_Q ( italic_X , italic_W ). In general let Pr(V):=P(V|zr)assignsubscript𝑃𝑟𝑉𝑃conditional𝑉subscript𝑧𝑟P_{r}(V)\!:=\!P(V|z_{r})italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_V ) := italic_P ( italic_V | italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) and Q(V):=P(V|zkZ+1)assign𝑄𝑉𝑃conditional𝑉subscript𝑧subscript𝑘𝑍1Q(V)\!:=\!P(V|z_{k_{Z}+1})italic_Q ( italic_V ) := italic_P ( italic_V | italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) for any V{W,X,Y,U}𝑉𝑊𝑋𝑌𝑈V\subseteq\{W,X,Y,U\}italic_V ⊆ { italic_W , italic_X , italic_Y , italic_U }. For this setting we replace Assumption 1 with the following shift assumption.

Assumption 3 (Multi-Domain).

For each z,z𝒵p𝑧superscript𝑧normal-′subscript𝒵𝑝z,z^{\prime}\in\mathcal{Z}_{p}italic_z , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT such that z\eqz𝑧\eqsuperscript𝑧normal-′z\not\eq z^{\prime}italic_z not italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have P(U|z)\eqP(U|z)Q(U)𝑃conditional𝑈𝑧\eq𝑃conditional𝑈superscript𝑧normal-′𝑄𝑈P(U|z)\not\eq P(U|z^{\prime})\neq Q(U)italic_P ( italic_U | italic_z ) not italic_P ( italic_U | italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≠ italic_Q ( italic_U ).

Note that Assumption 2 implies the following the conditional independence property in Figure 0(d):

{Y,X,W}ZU.\{Y,X,W\}\perp\!\!\!\perp Z\mid U.{ italic_Y , italic_X , italic_W } ⟂ ⟂ italic_Z ∣ italic_U .

Under Assumption 3, we allow all joint distributions to be different

P(W,X,U,Y|z)\eqP(W,X,U,Y|z)Q(W,X,U,Y)𝑃𝑊𝑋𝑈conditional𝑌𝑧\eq𝑃𝑊𝑋𝑈conditional𝑌superscript𝑧𝑄𝑊𝑋𝑈𝑌P(W,X,U,Y|z)\not\eq P(W,X,U,Y|z^{\prime})\neq Q(W,X,U,Y)italic_P ( italic_W , italic_X , italic_U , italic_Y | italic_z ) not italic_P ( italic_W , italic_X , italic_U , italic_Y | italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≠ italic_Q ( italic_W , italic_X , italic_U , italic_Y )

for zz𝒵p𝑧superscript𝑧subscript𝒵𝑝z\neq z^{\prime}\in\mathcal{Z}_{p}italic_z ≠ italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

4 Identification under Latent Shifts

Our identification techniques are inspired by proximal causal inference (tchetgen2020introduction). The key idea is to design so-called “bridge” functions to identify distributions confounded by unobserved variables. We first show that with additional proxies and concepts, 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] is identifiable under any latent shift.

4.1 Identification with Concepts

To prove identifiability, we need certain assumptions to hold for the shift. The first is a regularity assumption, also known as a completeness condition, and is commonly used to identify causal estimands (d2011completeness; miao2018identifying).

Assumption 4 (Informative variables).

Let g𝑔gitalic_g be any mean squared integrable function. Both the source domain and the target domain, (f,F){(p,P),(q,Q)}𝑓𝐹𝑝𝑃𝑞𝑄(f,F)\in\{(p,P),(q,Q)\}( italic_f , italic_F ) ∈ { ( italic_p , italic_P ) , ( italic_q , italic_Q ) }, satisfy 𝔼f[g(U)x,c]=0subscript𝔼𝑓delimited-[]conditional𝑔𝑈𝑥𝑐0\mathbb{E}_{f}[g(U)\mid x,c]=0blackboard_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT [ italic_g ( italic_U ) ∣ italic_x , italic_c ] = 0 for all x𝒳,c𝒞formulae-sequence𝑥𝒳𝑐𝒞x\in\mathcal{X},c\in\mathcal{C}italic_x ∈ caligraphic_X , italic_c ∈ caligraphic_C if and only if g(U)=0𝑔𝑈0g(U)=0italic_g ( italic_U ) = 0 almost surely with respect to F(U)𝐹𝑈F(U)italic_F ( italic_U ).

At a high level, completeness states that the X𝑋Xitalic_X must have sufficient variability related to the change of U𝑈Uitalic_U. This is a common assumption made in proximal causal inference (cf. Condition (ii) in  miao2018identifying and Assumption 3 in mastouri2021proximal). For more details on the justification of completeness assumption, see the supplementary material of miao2022identifying.

Second, we need a guarantee on the support of u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U. Intuitively, if a u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U has non-zero probability in the target domain, it should have non-zero probability in the source domain as well. Otherwise, it is impossible to adjust to certain shifts (as we never see these regimes in the source domain). This is similar to the positivity assumption commonly made in causality literature (hernan2006estimating).

Assumption 5 (Positivity).

For any u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U, if Q(u)>0𝑄𝑢0Q(u)>0italic_Q ( italic_u ) > 0 then P(u)>0𝑃𝑢0P(u)>0italic_P ( italic_u ) > 0.

If data are generated according to Figure 0(c), and the regularity conditions 810 hold (see Appendix A.2), miao2018identifying first showed the existence of the solutions h0p(w,c),h0q(w,c)superscriptsubscript0𝑝𝑤𝑐superscriptsubscript0𝑞𝑤𝑐h_{0}^{p}(w,c),h_{0}^{q}(w,c)italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) of the following equations:

𝔼p[Yc,x]=subscript𝔼𝑝delimited-[]conditional𝑌𝑐𝑥absent\displaystyle\mathbb{E}_{p}[Y\mid c,x]=blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_x ] = 𝒲h0p(w,c)dP(wc,x)subscript𝒲superscriptsubscript0𝑝𝑤𝑐differential-d𝑃conditional𝑤𝑐𝑥\displaystyle\;\int_{\mathcal{W}}h_{0}^{p}(w,c)\mathrm{d}P(w\mid c,x)∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) roman_d italic_P ( italic_w ∣ italic_c , italic_x ) (4.1)
𝔼q[Yc,x]=subscript𝔼𝑞delimited-[]conditional𝑌𝑐𝑥absent\displaystyle\mathbb{E}_{q}[Y\mid c,x]=blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_x ] = 𝒲h0q(w,c)dQ(wc,x).subscript𝒲superscriptsubscript0𝑞𝑤𝑐differential-d𝑄conditional𝑤𝑐𝑥\displaystyle\;\int_{\mathcal{W}}h_{0}^{q}(w,c)\mathrm{d}Q(w\mid c,x).∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) roman_d italic_Q ( italic_w ∣ italic_c , italic_x ) .

The terms h0p(w,c),h0q(w,c)superscriptsubscript0𝑝𝑤𝑐superscriptsubscript0𝑞𝑤𝑐h_{0}^{p}(w,c),h_{0}^{q}(w,c)italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) are called ‘bridge’ functions as they connect the proxy W𝑊Witalic_W to the label Y𝑌Yitalic_Y. If we are able to identify h0q(w,c)superscriptsubscript0𝑞𝑤𝑐h_{0}^{q}(w,c)italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) then we can identify 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ], by using eq. (4.1) to obtain 𝔼q[YC,x]subscript𝔼𝑞delimited-[]conditional𝑌𝐶𝑥\mathbb{E}_{q}[Y\mid C,x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_C , italic_x ] and marginalizing over Q(Cx)𝑄conditional𝐶𝑥Q(C\mid x)italic_Q ( italic_C ∣ italic_x ).

We show that it is possible to connect identification of h0q(w,c)superscriptsubscript0𝑞𝑤𝑐h_{0}^{q}(w,c)italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) with that of h0p(w,c)superscriptsubscript0𝑝𝑤𝑐h_{0}^{p}(w,c)italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ), leading directly to identification of 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ].

Theorem 4.1.

Assume that h0psuperscriptsubscript0𝑝h_{0}^{p}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and h0qsuperscriptsubscript0𝑞h_{0}^{q}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT exist (i.e., regularity Assumptions 810 hold). Then given Assumptions 1, 2, 4, 5 we have that, for any c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C,

𝒲h0p(w,c)dP(wu)=𝒲h0q(w,c)dQ(wu),subscript𝒲superscriptsubscript0𝑝𝑤𝑐differential-d𝑃conditional𝑤𝑢subscript𝒲superscriptsubscript0𝑞𝑤𝑐differential-d𝑄conditional𝑤𝑢\int_{\mathcal{W}}h_{0}^{p}(w,c)\mathrm{d}P(w\mid u)=\int_{\mathcal{W}}h_{0}^{% q}(w,c)\mathrm{d}Q(w\mid u),∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) roman_d italic_P ( italic_w ∣ italic_u ) = ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) roman_d italic_Q ( italic_w ∣ italic_u ) ,

almost surely with respect to Q(U)𝑄𝑈Q(U)italic_Q ( italic_U ). This implies that

𝔼q[Yx]=𝒲×𝒞h0p(w,c)dQ(w,cx).subscript𝔼𝑞delimited-[]conditional𝑌𝑥subscript𝒲𝒞superscriptsubscript0𝑝𝑤𝑐differential-d𝑄𝑤conditional𝑐𝑥\mathbb{E}_{q}[Y\mid x]=\int_{\mathcal{W}\times\mathcal{C}}h_{0}^{p}(w,c)% \mathrm{d}Q(w,c\mid x).blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] = ∫ start_POSTSUBSCRIPT caligraphic_W × caligraphic_C end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) roman_d italic_Q ( italic_w , italic_c ∣ italic_x ) .

The proof is given in Appendix B.1. Hence, given h0psuperscriptsubscript0𝑝h_{0}^{p}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and (W,X,C)𝑊𝑋𝐶(W,X,C)( italic_W , italic_X , italic_C ) from the target Q𝑄Qitalic_Q, we are able to adapt to arbitrary distribution shifts in unobserved U𝑈Uitalic_U. The advantage of this approach is that it will not require estimating any distributions involving U𝑈Uitalic_U. We demonstrate this in Section 5.

While concepts can ensure identifiability, they may not be available in practice. In this case, a natural question is whether the optimal target predictor 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] is still identifiable. In the next section we show that if we instead have access to data from multiple source domains, 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] may again be identifiable.

4.2 The Blessings of Multiple Domains

We now turn to the multi-domain setting. The graphical structure in Figure 0(d) is similar to the structure in Figure 0(c) with C𝐶Citalic_C replaced by X𝑋Xitalic_X, X𝑋Xitalic_X replaced by Z𝑍Zitalic_Z, and the arrow between U𝑈Uitalic_U and Z𝑍Zitalic_Z flipped. Although the bridge function proposed by miao2018identifying assumes an edge from U𝑈Uitalic_U to Z𝑍Zitalic_Z, changing the direction from Z𝑍Zitalic_Z to U𝑈Uitalic_U does not change the conditional independence structure (pearl2009causality). The main difference is we will only be able to guarantee full identification when U𝑈Uitalic_U is discrete. We start by demonstrating this, and then give an example of the inherent difficulty of identification when U𝑈Uitalic_U is continuous.

To begin, for simplicity, assume U𝑈Uitalic_U and W𝑊Witalic_W are discrete (with dimensionalities kUsubscript𝑘𝑈k_{U}italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT and kWsubscript𝑘𝑊k_{W}italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT). We have finitely many samples from Z𝑍Zitalic_Z, denoted as z1,,zkZsubscript𝑧1subscript𝑧subscript𝑘𝑍z_{1},\ldots,z_{k_{Z}}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT, corresponding to our training domains. We seek a bridge function (in this case, a matrix M0(wi,x)subscript𝑀0subscript𝑤𝑖𝑥M_{0}(w_{i},x)italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x )) satisfying

𝔼r[Yx]=i=1kwM0(wi,x)Pr(wix),subscript𝔼𝑟delimited-[]conditional𝑌𝑥superscriptsubscript𝑖1subscript𝑘𝑤subscript𝑀0subscript𝑤𝑖𝑥subscript𝑃𝑟conditionalsubscript𝑤𝑖𝑥\displaystyle\mathbb{E}_{r}[Y\mid x]=\sum_{i=1}^{k_{w}}M_{0}(w_{i},x)P_{r}(w_{% i}\mid x),blackboard_E start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_x ) , (4.2)

for all r=1,,kZ𝑟1subscript𝑘𝑍r=1,\ldots,k_{Z}italic_r = 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT, where 𝔼r[Yx]subscript𝔼𝑟delimited-[]conditional𝑌𝑥\mathbb{E}_{r}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] is the conditional expectation obtained in domain r𝑟ritalic_r, and Pr(Wx)=P(Wx,zr)subscript𝑃𝑟conditional𝑊𝑥𝑃conditional𝑊𝑥subscript𝑧𝑟P_{r}(W\mid x)=P(W\mid x,z_{r})italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_W ∣ italic_x ) = italic_P ( italic_W ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ).

In order to identify M0(wi,x)subscript𝑀0subscript𝑤𝑖𝑥M_{0}(w_{i},x)italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ), and then 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ], we need enough source domains to capture the variability of U𝑈Uitalic_U. The following result describes how many we need.

Proposition 4.2.

Suppose that we have kZsubscript𝑘𝑍k_{Z}italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT source domains and W𝑊Witalic_W, U𝑈Uitalic_U have kWsubscript𝑘𝑊k_{W}italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and kUsubscript𝑘𝑈k_{U}italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT categories respectively. Then, if kW,kZkUsubscript𝑘𝑊subscript𝑘𝑍subscript𝑘𝑈k_{W},k_{Z}\geq k_{U}italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ≥ italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT and subject to appropriate rank conditions (see proof in Appendix B.2), the bridge function is identifiable and does not depend on the specific z𝑧zitalic_z.

This result generalizes the identification analysis developed in miao2018identifying. If the number of observed source domains kZsubscript𝑘𝑍k_{Z}italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT is greater than the dimension of the latent U𝑈Uitalic_U, then subject to appropriate identifiability requirements (detailed in Appendix B.2), we can recover the bridge M0(wi,x)subscript𝑀0subscript𝑤𝑖𝑥M_{0}(w_{i},x)italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ).

Now, consider the case where U𝑈Uitalic_U is discrete but all observed variables W,X,Y𝑊𝑋𝑌W,X,Yitalic_W , italic_X , italic_Y are continuous. In this case we have the following system

𝔼r[Yx]=𝒲m0(w,x)dPr(wx),subscript𝔼𝑟delimited-[]conditional𝑌𝑥subscript𝒲subscript𝑚0𝑤𝑥differential-dsubscript𝑃𝑟conditional𝑤𝑥\mathbb{E}_{r}[Y\mid x]=\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P_{r}(w\mid x),blackboard_E start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] = ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_x ) roman_d italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_w ∣ italic_x ) , (4.3)

for r=1,,kZ𝑟1subscript𝑘𝑍r=1,\ldots,k_{Z}italic_r = 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT. The proof of existence of m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a modification of Proposition A.2, as shown in Proposition A.3. In order to identify target 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ], we need the following assumption.

Assumption 6.

Let g𝑔gitalic_g be a square integrable function on U𝑈Uitalic_U. For each x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and for all z𝒵p𝑧subscript𝒵𝑝z\in\mathcal{Z}_{p}italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, 𝔼[g(U)x,z]=0𝔼delimited-[]conditional𝑔𝑈𝑥𝑧0\mathbb{E}[g(U)\mid x,z]=0blackboard_E [ italic_g ( italic_U ) ∣ italic_x , italic_z ] = 0 if and only if g(U)=0𝑔𝑈0g(U)=0italic_g ( italic_U ) = 0, P(U)𝑃𝑈P(U)italic_P ( italic_U ) almost surely.

Given this assumption we can prove identifiability.

Proposition 4.3.

Given that Assumptions 13, 6 hold; that m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT exists; that (W,X,Y)𝑊𝑋𝑌(W,X,Y)( italic_W , italic_X , italic_Y ) are observed for the sources z𝒵p𝑧subscript𝒵𝑝z\in\mathcal{Z}_{p}italic_z ∈ caligraphic_Z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and (W,X)𝑊𝑋(W,X)( italic_W , italic_X ) is observed from the target domain. Then 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] is identifiable, and for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, we can write

𝔼q[Yx]=𝒲m0(w,x)𝑑Q(wx).subscript𝔼𝑞delimited-[]conditional𝑌𝑥subscript𝒲subscript𝑚0𝑤𝑥differential-d𝑄conditional𝑤𝑥\mathbb{E}_{q}[Y\mid x]=\int_{\mathcal{W}}{m}_{0}(w,x)dQ(w\mid x).blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] = ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_x ) italic_d italic_Q ( italic_w ∣ italic_x ) . (4.4)

The proof is given in Appendix B.3. Crucially, this result is valid only when Assumptions 6 holds, and it remains unclear when it is expected to hold. Proposition 4.2 suggests that Assumptions 6 is not vacuous when U𝑈Uitalic_U is finite dimensional. We plan to investigate further this in future work.

Now let us consider the case where U𝑈Uitalic_U is continuous. In this case, unfortunately, Assumption 6 may not hold, preventing identification of 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ]. This is illustrated in the following example.

Example 4.4.

Recall the decomposition of both sides of (4.3). Under Assumption 2 and given the existence of m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (Proposition A.2),

𝔼p[Yx,z]subscript𝔼𝑝delimited-[]conditional𝑌𝑥𝑧\displaystyle\mathbb{E}_{p}[{Y\mid x,z}]blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_Y ∣ italic_x , italic_z ] =𝒲m0(w,x)dP(wx,z)absentsubscript𝒲subscript𝑚0𝑤𝑥differential-d𝑃conditional𝑤𝑥𝑧\displaystyle=\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P(w\mid x,z)= ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_x ) roman_d italic_P ( italic_w ∣ italic_x , italic_z )
=𝒰𝒲m0(w,x)dP(wu)dP(ux,z);absentsubscript𝒰subscript𝒲subscript𝑚0𝑤𝑥differential-d𝑃conditional𝑤𝑢differential-d𝑃conditional𝑢𝑥𝑧\displaystyle=\int_{\mathcal{U}}\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P(w\mid u% )\mathrm{d}P(u\mid x,z);= ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_x ) roman_d italic_P ( italic_w ∣ italic_u ) roman_d italic_P ( italic_u ∣ italic_x , italic_z ) ; (4.5)
𝔼p[Yx,z]subscript𝔼𝑝delimited-[]conditional𝑌𝑥𝑧\displaystyle\mathbb{E}_{p}[{Y\mid x,z}]blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_Y ∣ italic_x , italic_z ] =𝒰𝔼p[Yx,u]dP(ux,z).absentsubscript𝒰subscript𝔼𝑝delimited-[]conditional𝑌𝑥𝑢differential-d𝑃conditional𝑢𝑥𝑧\displaystyle=\int_{\mathcal{U}}\mathbb{E}_{p}[Y\mid x,u]\mathrm{d}P(u\mid x,z).= ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_Y ∣ italic_x , italic_u ] roman_d italic_P ( italic_u ∣ italic_x , italic_z ) . (4.6)

For every x𝑥xitalic_x, Eqs. (4.5) and (4.6) represent projections onto P(ux,zr),𝑃conditional𝑢𝑥subscript𝑧𝑟P(u\mid x,z_{r}),italic_P ( italic_u ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , r1,,kz.𝑟1normal-…subscript𝑘𝑧r\in{1,\ldots,k_{z}}.italic_r ∈ 1 , … , italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT . Consider 𝒰:=[π,π]assign𝒰𝜋𝜋\mathcal{U}:=[-\pi,\pi]caligraphic_U := [ - italic_π , italic_π ] with periodic boundary conditions, and for a given x𝑥xitalic_x define P(ux,zr)=(2π)1(1+cos(ru)),r+formulae-sequence𝑃conditional𝑢𝑥subscript𝑧𝑟superscript2𝜋11𝑟𝑢for-all𝑟superscriptP(u\mid x,z_{r})=(2\pi)^{-1}(1+\cos(ru)),\forall r\in\mathbb{N}^{+}italic_P ( italic_u ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = ( 2 italic_π ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 + roman_cos ( italic_r italic_u ) ) , ∀ italic_r ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT (note that cosines form an orthonormal basis). We now construct an example where (4.5) holds for some z𝑧zitalic_z but not for others. Define the difference

𝔼p[Yx,u]𝒲m0(w,x)dP(wu)subscript𝔼𝑝delimited-[]conditional𝑌𝑥𝑢subscript𝒲subscript𝑚0𝑤𝑥differential-d𝑃conditional𝑤𝑢\displaystyle\mathbb{E}_{p}[Y\mid x,u]-\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P% (w\mid u)blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_Y ∣ italic_x , italic_u ] - ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_x ) roman_d italic_P ( italic_w ∣ italic_u ) (4.7)
=cos((kz+1)u)=:g(u).\displaystyle=\cos((k_{z}+1)u)=:g(u).= roman_cos ( ( italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + 1 ) italic_u ) = : italic_g ( italic_u ) .

In this case, g(u)0,𝑔𝑢0g(u)\neq 0,italic_g ( italic_u ) ≠ 0 , and in particular, (4.5) holds for all rkz,𝑟subscript𝑘𝑧r\leq k_{z},italic_r ≤ italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , but not for P(ux,zkz+1).𝑃conditional𝑢𝑥subscript𝑧subscript𝑘𝑧1P(u\mid x,z_{k_{z}+1}).italic_P ( italic_u ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) .

This example illustrates a larger point: that for continuous U𝑈Uitalic_U, no finite set of projections will suffice to completely characterize the square integrable functions on 𝒰𝒰\mathcal{U}caligraphic_U. That said, as more projections are employed, and subject to appropriate assumptions on the smoothness of (4.7), the error will reduce as more domains are observed. The characterization of this convergence will be the topic of future work. In experiments, we show that the adaptation can still be effective even when the latent variable U|zrconditional𝑈subscript𝑧𝑟U|z_{r}italic_U | italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is continuous valued and follows different Beta distributions for each distinct r𝑟ritalic_r, given just two training source domains.

5 Kernel Bridge Function Estimation

We introduce kernel methods to estimate the bridge functions and subsequently leverage the estimates to adapt to distribution shifts. Section 4 shows that bridge functions for both settings can be adapted to the target domain, so we drop the domain specific indices and use h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to denote the bridge functions. We begin by introducing the notation.

Notation. Let tensor-product\otimes be the tensor product, ¯¯tensor-product\overline{\otimes}over¯ start_ARG ⊗ end_ARG be the columnwise Khatri-Rao product and direct-product\odot be the Hadamard product. For any space 𝒱{𝒳,𝒞,𝒲,𝒴}𝒱𝒳𝒞𝒲𝒴\mathcal{V}\in\{\mathcal{X},\mathcal{C},\mathcal{W},\mathcal{Y}\}caligraphic_V ∈ { caligraphic_X , caligraphic_C , caligraphic_W , caligraphic_Y }, let k:𝒱×𝒱:𝑘𝒱𝒱k:\mathcal{V}\times\mathcal{V}\rightarrow\mathbb{R}italic_k : caligraphic_V × caligraphic_V → blackboard_R be a positive semidefinite kernel function and ϕ(v)=k(v,)italic-ϕ𝑣𝑘𝑣\phi(v)=k(v,\cdot)italic_ϕ ( italic_v ) = italic_k ( italic_v , ⋅ ) for any v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V be the feature map. We denote 𝒱subscript𝒱\mathcal{H}_{\mathcal{V}}caligraphic_H start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT to be the RKHS on 𝒱𝒱\mathcal{V}caligraphic_V associated with kernel function k𝑘kitalic_k. The RKHS has two properties: (i) f𝒱𝑓subscript𝒱f\in\mathcal{H}_{\mathcal{V}}italic_f ∈ caligraphic_H start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT, f(v)=f,k(v,)𝑓𝑣𝑓𝑘𝑣f(v)=\langle{f},{k(v,\cdot)}\rangleitalic_f ( italic_v ) = ⟨ italic_f , italic_k ( italic_v , ⋅ ) ⟩ for all v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V and (ii) k(v,)𝒱𝑘𝑣subscript𝒱k(v,\cdot)\in\mathcal{H}_{\mathcal{V}}italic_k ( italic_v , ⋅ ) ∈ caligraphic_H start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT. We denote ,\langle{\cdot},{\cdot}\rangle⟨ ⋅ , ⋅ ⟩ as the inner product and ||||||𝒱|\!|\!|\cdot|\!|\!|_{{\mathcal{H}_{\mathcal{V}}}}| | | ⋅ | | | start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT end_POSTSUBSCRIPT as the induced norm. For notation simplicity, we denote the product space 𝒱×𝒱subscript𝒱subscriptsuperscript𝒱\mathcal{H}_{\mathcal{V}}\times\mathcal{H}_{\mathcal{V}^{\prime}}caligraphic_H start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT × caligraphic_H start_POSTSUBSCRIPT caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT associated with operation 𝒱𝒱tensor-productsubscript𝒱subscriptsuperscript𝒱\mathcal{H}_{\mathcal{V}}\otimes\mathcal{H}_{\mathcal{V}^{\prime}}caligraphic_H start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ⊗ caligraphic_H start_POSTSUBSCRIPT caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as 𝒱𝒱subscript𝒱superscript𝒱\mathcal{H}_{\mathcal{V}\mathcal{V}^{\prime}}caligraphic_H start_POSTSUBSCRIPT caligraphic_V caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. We define the kernel mean embedding as μV=𝔼[ϕ(V)]=k(v,)p(v)𝑑vsubscript𝜇𝑉𝔼delimited-[]italic-ϕ𝑉𝑘𝑣𝑝𝑣differential-d𝑣\mu_{V}=\mathbb{E}[\phi(V)]=\int k(v,\cdot)p(v)dvitalic_μ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = blackboard_E [ italic_ϕ ( italic_V ) ] = ∫ italic_k ( italic_v , ⋅ ) italic_p ( italic_v ) italic_d italic_v (smola2007hilbert) and the conditional mean embedding as μVy=k(v,)p(vy)𝑑vsubscript𝜇conditional𝑉𝑦𝑘𝑣𝑝conditional𝑣𝑦differential-d𝑣\mu_{V\mid y}=\int k(v,\cdot)p(v\mid y)dvitalic_μ start_POSTSUBSCRIPT italic_V ∣ italic_y end_POSTSUBSCRIPT = ∫ italic_k ( italic_v , ⋅ ) italic_p ( italic_v ∣ italic_y ) italic_d italic_v (song2009hilbert; singh2019kernel). For V{W,X,C}𝑉𝑊𝑋𝐶V\in\{W,X,C\}italic_V ∈ { italic_W , italic_X , italic_C }, we denote the a𝑎aitalic_a-th batch of i.i.d. samples as Va={va,i}i=1nasubscript𝑉𝑎superscriptsubscriptsubscript𝑣𝑎𝑖𝑖1subscript𝑛𝑎V_{a}=\{v_{a,i}\}_{i=1}^{n_{a}}italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT italic_a , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Define the Gram matrices as 𝒦Va=[k(va,i,va,j)]i,jna×nasubscript𝒦subscript𝑉𝑎subscriptmatrix𝑘subscript𝑣𝑎𝑖subscript𝑣𝑎𝑗𝑖𝑗superscriptsubscript𝑛𝑎subscript𝑛𝑎\mathcal{K}_{V_{a}}=\begin{bmatrix}k(v_{a,i},v_{a,j})\end{bmatrix}_{i,j}\in% \mathbb{R}^{n_{a}\times n_{a}}caligraphic_K start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_k ( italic_v start_POSTSUBSCRIPT italic_a , italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_a , italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒦Vab=[k(va,i,vb,j)]i,jna×nbsubscript𝒦subscript𝑉𝑎𝑏subscriptmatrix𝑘subscript𝑣𝑎𝑖subscript𝑣𝑏𝑗𝑖𝑗superscriptsubscript𝑛𝑎subscript𝑛𝑏\mathcal{K}_{V_{ab}}=\begin{bmatrix}k(v_{a,i},v_{b,j})\end{bmatrix}_{i,j}\in% \mathbb{R}^{n_{a}\times n_{b}}caligraphic_K start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_k ( italic_v start_POSTSUBSCRIPT italic_a , italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b , italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Let ΦVa=[ϕ(va,1),,ϕ(va,na)]𝒱nasubscriptΦsubscript𝑉𝑎superscriptmatrixitalic-ϕsubscript𝑣𝑎1italic-ϕsubscript𝑣𝑎subscript𝑛𝑎topsuperscriptsubscript𝒱subscript𝑛𝑎\Phi_{V_{a}}=\begin{bmatrix}\phi(v_{a,1}),\ldots,\phi(v_{a,n_{a}})\end{bmatrix% }^{\top}\in\mathcal{H}_{\mathcal{V}}^{n_{a}}roman_Φ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_ϕ ( italic_v start_POSTSUBSCRIPT italic_a , 1 end_POSTSUBSCRIPT ) , … , italic_ϕ ( italic_v start_POSTSUBSCRIPT italic_a , italic_n start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the vectorized feature map such that ΦVa(v)=[k(va,1,v),,k(va,na,v)]nasubscriptΦsubscript𝑉𝑎superscript𝑣superscriptmatrix𝑘subscript𝑣𝑎1superscript𝑣𝑘subscript𝑣𝑎subscript𝑛𝑎superscript𝑣topsuperscriptsubscript𝑛𝑎\Phi_{V_{a}}(v^{\prime})=\begin{bmatrix}k(v_{a,1},v^{\prime}),\ldots,k(v_{a,n_% {a}},v^{\prime})\end{bmatrix}^{\top}\in\mathbb{R}^{n_{a}}roman_Φ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = [ start_ARG start_ROW start_CELL italic_k ( italic_v start_POSTSUBSCRIPT italic_a , 1 end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , … , italic_k ( italic_v start_POSTSUBSCRIPT italic_a , italic_n start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

5.1 Adaptation with Concepts

Suppose that for the bridge function h0𝒲𝒞subscript0subscript𝒲𝒞h_{0}\in\mathcal{H}_{\mathcal{W}\mathcal{C}}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT caligraphic_W caligraphic_C end_POSTSUBSCRIPT, where 𝒲𝒞subscript𝒲𝒞\mathcal{H}_{\mathcal{W}\mathcal{C}}caligraphic_H start_POSTSUBSCRIPT caligraphic_W caligraphic_C end_POSTSUBSCRIPT is a RKHS. It follows from Theorem 4.1 that

𝔼q[YX=x]subscript𝔼𝑞delimited-[]conditional𝑌𝑋𝑥\displaystyle\mathbb{E}_{q}[Y\mid X=x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_X = italic_x ] =𝔼q[h0(W,C)x]absentsubscript𝔼𝑞delimited-[]conditionalsubscript0𝑊𝐶𝑥\displaystyle=\mathbb{E}_{q}[h_{0}(W,C)\mid x]= blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_W , italic_C ) ∣ italic_x ]
=𝔼q[h0,ϕ(W)ϕ(C)x]absentsubscript𝔼𝑞delimited-[]conditionalsubscript0tensor-productitalic-ϕ𝑊italic-ϕ𝐶𝑥\displaystyle=\mathbb{E}_{q}[\langle{h_{0}},{\phi(W)\otimes\phi(C)}\rangle\mid x]= blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ ⟨ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϕ ( italic_W ) ⊗ italic_ϕ ( italic_C ) ⟩ ∣ italic_x ]
=h0,μWCxq.absentsubscript0superscriptsubscript𝜇conditional𝑊𝐶𝑥𝑞\displaystyle=\langle{h_{0}},{\mu_{WC\mid x}^{q}}\rangle.= ⟨ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_W italic_C ∣ italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ⟩ . (5.1)

To adapt to the distribution shifts, we estimate the bridge function h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the source domain and the conditional mean embedding μWCxq=𝔼q[ϕ(W)ϕ(C)x]superscriptsubscript𝜇conditional𝑊𝐶𝑥𝑞subscript𝔼𝑞delimited-[]conditionaltensor-productitalic-ϕ𝑊italic-ϕ𝐶𝑥\mu_{WC\mid x}^{q}=\mathbb{E}_{q}[\phi(W)\otimes\phi(C)\mid x]italic_μ start_POSTSUBSCRIPT italic_W italic_C ∣ italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_ϕ ( italic_W ) ⊗ italic_ϕ ( italic_C ) ∣ italic_x ] in the target domain. The empirical estimate of the conditional mean embedding along with the consistency proof have been provided in (song2009hilbert; grunewalder2012conditional) thus we focus on the estimation procedure of the bridge function h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

To estimate the bridge function h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we employ the regression method developed in mastouri2021proximal. Recall 𝔼[Yc,x]=𝔼[h0(W,c)c,x]𝔼delimited-[]conditional𝑌𝑐𝑥𝔼delimited-[]conditionalsubscript0𝑊𝑐𝑐𝑥\mathbb{E}[Y\mid c,x]=\mathbb{E}[h_{0}(W,c)\mid c,x]blackboard_E [ italic_Y ∣ italic_c , italic_x ] = blackboard_E [ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_W , italic_c ) ∣ italic_c , italic_x ]. We define the population risk function in the source domain as:

R(h0)𝑅subscript0\displaystyle R(h_{0})italic_R ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =𝔼p[(YGh0(C,X))2];absentsubscript𝔼𝑝delimited-[]superscript𝑌subscript𝐺subscript0𝐶𝑋2\displaystyle=\mathbb{E}_{p}[(Y-G_{h_{0}}(C,X))^{2}];= blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ ( italic_Y - italic_G start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_C , italic_X ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ; (5.2)
Gh0(x,c)subscript𝐺subscript0𝑥𝑐\displaystyle G_{h_{0}}(x,c)italic_G start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_c ) =h0,μWc,xpϕ(c).absentsubscript0tensor-productsuperscriptsubscript𝜇conditional𝑊𝑐𝑥𝑝italic-ϕ𝑐\displaystyle=\langle{h_{0}},{\mu_{W\mid c,x}^{p}\otimes\phi(c)}\rangle.= ⟨ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_W ∣ italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ⊗ italic_ϕ ( italic_c ) ⟩ .

The procedure to optimize (5.2) involves two stages. In the first stage, we estimate the conditional mean embedding μWc,xp=𝔼p[ϕ(W)c,x]superscriptsubscript𝜇conditional𝑊𝑐𝑥𝑝subscript𝔼𝑝delimited-[]conditionalitalic-ϕ𝑊𝑐𝑥\mu_{W\mid c,x}^{p}=\mathbb{E}_{p}[\phi(W)\mid c,x]italic_μ start_POSTSUBSCRIPT italic_W ∣ italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_ϕ ( italic_W ) ∣ italic_c , italic_x ], which we will use as a plug-in estimator to estimate h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the second step. Given n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT i.i.d. samples (X1,W1,C1)={(x1,i,w1,i,c1,i)}i=1n1subscript𝑋1subscript𝑊1subscript𝐶1superscriptsubscriptsubscript𝑥1𝑖subscript𝑤1𝑖subscript𝑐1𝑖𝑖1subscript𝑛1(X_{1},W_{1},C_{1})=\{(x_{1,i},w_{1,i},c_{1,i})\}_{i=1}^{n_{1}}( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = { ( italic_x start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from the source distribution p𝑝pitalic_p and a regularizing parameter λ1>0subscript𝜆10\lambda_{1}>0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0, we denote 𝒦X1n1×n1subscript𝒦subscript𝑋1superscriptsubscript𝑛1subscript𝑛1\mathcal{K}_{X_{1}}\in\mathbb{R}^{n_{1}\times n_{1}}caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒦C1n1×n1subscript𝒦subscript𝐶1superscriptsubscript𝑛1subscript𝑛1\mathcal{K}_{C_{1}}\in\mathbb{R}^{n_{1}\times n_{1}}caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as the Gram matrices and ΦX1𝒳n1subscriptΦsubscript𝑋1superscriptsubscript𝒳subscript𝑛1\Phi_{X_{1}}\in\mathcal{H}_{\mathcal{X}}^{n_{1}}roman_Φ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, ΦC1𝒞n1subscriptΦsubscript𝐶1superscriptsubscript𝒞subscript𝑛1\Phi_{C_{1}}\in\mathcal{H}_{\mathcal{C}}^{n_{1}}roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-dimensional vectorized feature maps of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT respectively. Following the procedure developed in song2009hilbert, the estimate of μWx,cpsuperscriptsubscript𝜇conditional𝑊𝑥𝑐𝑝\mu_{W\mid x,c}^{p}italic_μ start_POSTSUBSCRIPT italic_W ∣ italic_x , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is

μ^Wc,xpsuperscriptsubscript^𝜇conditional𝑊𝑐𝑥𝑝\displaystyle\widehat{\mu}_{W\mid c,x}^{p}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT =i=1n1bi(x,c)ϕ(w1,i);absentsuperscriptsubscript𝑖1subscript𝑛1subscript𝑏𝑖𝑥𝑐italic-ϕsubscript𝑤1𝑖\displaystyle=\sum_{i=1}^{n_{1}}b_{i}(x,c)\phi(w_{1,i});= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_c ) italic_ϕ ( italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) ; (5.3)
b(x,c)𝑏𝑥𝑐\displaystyle b(x,c)italic_b ( italic_x , italic_c ) =(𝒦X1𝒦C1+λ1n1I)1(ΦX1(x)ΦC1(c)).absentsuperscriptdirect-productsubscript𝒦subscript𝑋1subscript𝒦subscript𝐶1subscript𝜆1subscript𝑛1𝐼1direct-productsubscriptΦsubscript𝑋1𝑥subscriptΦsubscript𝐶1𝑐\displaystyle=(\mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^% {-1}\left(\Phi_{X_{1}}(x)\odot\Phi_{C_{1}}(c)\right).= ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ⊙ roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c ) ) .

In the second stage, we replace μWx,cpsuperscriptsubscript𝜇conditional𝑊𝑥𝑐𝑝\mu_{W\mid x,c}^{p}italic_μ start_POSTSUBSCRIPT italic_W ∣ italic_x , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT with μ^Wx,cpsuperscriptsubscript^𝜇conditional𝑊𝑥𝑐𝑝\widehat{\mu}_{W\mid x,c}^{p}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_x , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT in (5.2) and define the empirical risk. Consider n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT i.i.d. samples (X2,Y2,C2)={(x2,i,y2,i,c2,i)}i=1n2subscript𝑋2subscript𝑌2subscript𝐶2superscriptsubscriptsubscript𝑥2𝑖subscript𝑦2𝑖subscript𝑐2𝑖𝑖1subscript𝑛2(X_{2},Y_{2},C_{2})=\{({x}_{2,i},{y}_{2,i},{c}_{2,i})\}_{i=1}^{n_{2}}( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = { ( italic_x start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from the source distribution and a regularization parameter λ2>0subscript𝜆20\lambda_{2}>0italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0, we want to minimize

argminh0𝒲𝒞12n2i=1n2(y2,ih0,ϕ(c2,i)μ^Wc2,i,x2,ip)2+λ2|h0|𝒲𝒞2.subscriptargminsubscript0subscript𝒲𝒞12subscript𝑛2superscriptsubscript𝑖1subscript𝑛2superscriptsubscript𝑦2𝑖subscript0tensor-productitalic-ϕsubscript𝑐2𝑖subscriptsuperscript^𝜇𝑝conditional𝑊subscript𝑐2𝑖subscript𝑥2𝑖2subscript𝜆2superscriptsubscriptnormsubscript0subscript𝒲𝒞2\displaystyle\mathop{\mathrm{argmin}}_{h_{0}\in\mathcal{H}_{\mathcal{W}% \mathcal{C}}}\frac{1}{2n_{2}}\sum_{i=1}^{n_{2}}\left(y_{2,i}-\langle{h_{0}},{% \phi(c_{2,i})\otimes\widehat{\mu}^{p}_{W\mid c_{2,i},x_{2,i}}}\rangle\right)^{% 2}+\lambda_{2}|\!|\!|h_{0}|\!|\!|_{{\mathcal{H}_{\mathcal{W}\mathcal{C}}}}^{2}.roman_argmin start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT caligraphic_W caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT - ⟨ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϕ ( italic_c start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT ) ⊗ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W ∣ italic_c start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | | italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | | start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT caligraphic_W caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5.4)

We follow the same analysis procedure derived in mastouri2021proximal. The solution to (5.4) is shown in the following.

Proposition 5.1.

Let 𝒦W1n1×n1subscript𝒦subscript𝑊1superscriptsubscript𝑛1subscript𝑛1\mathcal{K}_{W_{1}}\in\mathbb{R}^{n_{1}\times n_{1}}caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒦C2n2×n2subscript𝒦subscript𝐶2superscriptsubscript𝑛2subscript𝑛2\mathcal{K}_{C_{2}}\in\mathbb{R}^{n_{2}\times n_{2}}caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the Gram matrices of W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively. Let 𝒦X12n1×n2subscript𝒦subscript𝑋12superscriptsubscript𝑛1subscript𝑛2\mathcal{K}_{X_{12}}\in\mathbb{R}^{n_{1}\times n_{2}}caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒦C12n1×n2subscript𝒦subscript𝐶12superscriptsubscript𝑛1subscript𝑛2\mathcal{K}_{C_{12}}\in\mathbb{R}^{n_{1}\times n_{2}}caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the cross Gram matrices of (X1,X2)subscript𝑋1subscript𝑋2(X_{1},X_{2})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and (C1,C2)subscript𝐶1subscript𝐶2(C_{1},C_{2})( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), respectively. For any λ2>0subscript𝜆20\lambda_{2}>0italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0, there exists a unique optimal solution to (5.4) of the form

h^0subscript^0\displaystyle\widehat{h}_{0}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =i=1n1j=1n2αijϕ(w1,i)ϕ(c2,j);absentsuperscriptsubscript𝑖1subscript𝑛1superscriptsubscript𝑗1subscript𝑛2tensor-productsubscript𝛼𝑖𝑗italic-ϕsubscript𝑤1𝑖italic-ϕsubscript𝑐2𝑗\displaystyle=\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\alpha_{ij}\phi(w_{1,i})% \otimes\phi(c_{2,j});= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_ϕ ( italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) ⊗ italic_ϕ ( italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) ;
𝑣𝑒𝑐(α)𝑣𝑒𝑐𝛼\displaystyle\text{vec}(\alpha)vec ( italic_α ) =(I¯Γ)(λ2n2I+Σ)1y2,absent𝐼¯tensor-productΓsuperscriptsubscript𝜆2subscript𝑛2𝐼Σ1subscript𝑦2\displaystyle=(I\overline{\otimes}\Gamma)(\lambda_{2}n_{2}I+\Sigma)^{-1}y_{2},= ( italic_I over¯ start_ARG ⊗ end_ARG roman_Γ ) ( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I + roman_Σ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where Σ=(Γ𝒦W1Γ)𝒦C2normal-Σdirect-productsuperscriptnormal-Γtopsubscript𝒦subscript𝑊1normal-Γsubscript𝒦subscript𝐶2\Sigma=(\Gamma^{\top}\mathcal{K}_{W_{1}}\Gamma)\odot\mathcal{K}_{C_{2}}roman_Σ = ( roman_Γ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Γ ) ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Γ=(𝒦X1𝒦C1+λ1n1I)1(𝒦X12𝒦C12)normal-Γsuperscriptdirect-productsubscript𝒦subscript𝑋1subscript𝒦subscript𝐶1subscript𝜆1subscript𝑛1𝐼1direct-productsubscript𝒦subscript𝑋12subscript𝒦subscript𝐶12\Gamma=(\mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}(% \mathcal{K}_{X_{12}}\odot\mathcal{K}_{C_{12}})roman_Γ = ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), and y2=[y2,1,,y2,n2]subscript𝑦2superscriptmatrixsubscript𝑦21normal-…subscript𝑦2subscript𝑛2topy_{2}=\begin{bmatrix}y_{2,1},\ldots,y_{2,n_{2}}\end{bmatrix}^{\top}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_y start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT 2 , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

Proposition 5.1 is an application of the Representer theorem (scholkopf2001generalized) – the optimal estimate of the infinite dimensional operator is a finite rank operator spanned by the feature space of W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Finally, given estimate μ^WCxqsuperscriptsubscript^𝜇conditional𝑊𝐶𝑥𝑞\widehat{\mu}_{WC\mid x}^{q}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W italic_C ∣ italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT and a new sample xnewsubscript𝑥newx_{\text{new}}italic_x start_POSTSUBSCRIPT new end_POSTSUBSCRIPT, we can construct the empirical predictor of (5.1) as

y^pred=h^0,μ^WCxnewq.subscript^𝑦predsubscript^0superscriptsubscript^𝜇conditional𝑊𝐶subscript𝑥new𝑞\widehat{y}_{\text{pred}}=\langle{\widehat{h}_{0}},{\widehat{\mu}_{WC\mid x_{% \text{new}}}^{q}}\rangle.over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT = ⟨ over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W italic_C ∣ italic_x start_POSTSUBSCRIPT new end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ⟩ .

This completes the full adaptation procedure.

On classification tasks. For classification tasks, where the label is Y{1,,kY}𝑌1subscript𝑘𝑌Y\in\{1,\ldots,k_{Y}\}italic_Y ∈ { 1 , … , italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT }, we treat the multi-task regressor as a classifier. We encode Y𝑌Yitalic_Y by a one-hot encoder and then regress on the encoded Y~{0,1}kY~𝑌superscript01subscript𝑘𝑌\widetilde{Y}\in\{0,1\}^{k_{Y}}over~ start_ARG italic_Y end_ARG ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Each label \ellroman_ℓ has a corresponding bridge function h0,subscript0h_{0,\ell}italic_h start_POSTSUBSCRIPT 0 , roman_ℓ end_POSTSUBSCRIPT for {1,,kY}1subscript𝑘𝑌\ell\in\{1,\ldots,k_{Y}\}roman_ℓ ∈ { 1 , … , italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT }. For i=1,,n2𝑖1subscript𝑛2i=1,\ldots,n_{2}italic_i = 1 , … , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, let the encoded y2,isubscript𝑦2𝑖y_{2,i}italic_y start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT be y~2,i=[y~2,i,1,,y~2,i,kY]{0,1}kYsubscript~𝑦2𝑖superscriptmatrixsubscript~𝑦2𝑖1subscript~𝑦2𝑖subscript𝑘𝑌topsuperscript01subscript𝑘𝑌\widetilde{y}_{2,i}=\begin{bmatrix}\widetilde{y}_{2,i,1},\ldots,\widetilde{y}_% {2,i,k_{Y}}\end{bmatrix}^{\top}\in\{0,1\}^{k_{Y}}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 2 , italic_i , 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 2 , italic_i , italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Then for each \ellroman_ℓ, we can estimate h0,subscript0h_{0,\ell}italic_h start_POSTSUBSCRIPT 0 , roman_ℓ end_POSTSUBSCRIPT by replacing y2,isubscript𝑦2𝑖y_{2,i}italic_y start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT in (5.4) with y~2,i,{0,1}subscript~𝑦2𝑖01\widetilde{y}_{2,i,\ell}\in\{0,1\}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 2 , italic_i , roman_ℓ end_POSTSUBSCRIPT ∈ { 0 , 1 }. For each new sample xnewsubscript𝑥newx_{\text{new}}italic_x start_POSTSUBSCRIPT new end_POSTSUBSCRIPT, the predicted score of label \ellroman_ℓ is y^pred,=h^0,,μ^WCxnewqsubscript^𝑦predsubscript^0subscriptsuperscript^𝜇𝑞conditional𝑊𝐶subscript𝑥new\widehat{y}_{\text{pred},\ell}=\langle{\widehat{h}_{0,\ell}},{\widehat{\mu}^{q% }_{WC\mid x_{\text{new}}}}\rangleover^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT pred , roman_ℓ end_POSTSUBSCRIPT = ⟨ over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 , roman_ℓ end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W italic_C ∣ italic_x start_POSTSUBSCRIPT new end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩, and we select the label that has the highest prediction score: argmaxy^pred,subscriptargmaxsubscript^𝑦pred\mathop{\mathrm{argmax}}_{\ell}\widehat{y}_{\text{pred},\ell}roman_argmax start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT pred , roman_ℓ end_POSTSUBSCRIPT.

5.2 Adaptation with Multiple Domains

In the multiple source domain setting, the estimation of m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT follows similarly to that of h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Assuming that m0𝒲𝒳subscript𝑚0subscript𝒲𝒳m_{0}\in\mathcal{H}_{\mathcal{W}\mathcal{X}}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT caligraphic_W caligraphic_X end_POSTSUBSCRIPT, then (4.3) can be written as

𝔼r[Yx]=𝔼p[m0,μWx,rϕ(x)x],subscript𝔼𝑟delimited-[]conditional𝑌𝑥subscript𝔼𝑝delimited-[]conditionalsubscript𝑚0tensor-productsubscript𝜇conditional𝑊𝑥𝑟italic-ϕ𝑥𝑥\mathbb{E}_{r}[Y\mid x]=\mathbb{E}_{p}[\langle{m_{0}},{\mu_{W\mid x,r}\otimes% \phi(x)}\rangle\mid x],blackboard_E start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] = blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ ⟨ italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_W ∣ italic_x , italic_r end_POSTSUBSCRIPT ⊗ italic_ϕ ( italic_x ) ⟩ ∣ italic_x ] ,

for r=1,,kZ𝑟1subscript𝑘𝑍r=1,\ldots,k_{Z}italic_r = 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT. The task is to estimate m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the source domain and then apply it to the target domain. We can define the population risk function as

R(m0)𝑅subscript𝑚0\displaystyle R(m_{0})italic_R ( italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =r=1kZ𝔼r[(YGm0(r,X))2];absentsuperscriptsubscript𝑟1subscript𝑘𝑍subscript𝔼𝑟delimited-[]superscript𝑌subscript𝐺subscript𝑚0𝑟𝑋2\displaystyle=\sum_{r=1}^{k_{Z}}\mathbb{E}_{r}[(Y-G_{m_{0}}(r,X))^{2}];= ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ ( italic_Y - italic_G start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r , italic_X ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ; (5.5)
Gm0(r,x)subscript𝐺subscript𝑚0𝑟𝑥\displaystyle G_{m_{0}}(r,x)italic_G start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r , italic_x ) =m0,μWr,xϕ(x).absentsubscript𝑚0tensor-productsubscript𝜇conditional𝑊𝑟𝑥italic-ϕ𝑥\displaystyle=\langle{m_{0}},{\mu_{W\mid r,x}\otimes\phi(x)}\rangle.= ⟨ italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_W ∣ italic_r , italic_x end_POSTSUBSCRIPT ⊗ italic_ϕ ( italic_x ) ⟩ .

We employ the two-stage estimation procedure as we did for estimating h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: (i) we first estimate μWr,xsubscript𝜇conditional𝑊𝑟𝑥\mu_{W\mid r,x}italic_μ start_POSTSUBSCRIPT italic_W ∣ italic_r , italic_x end_POSTSUBSCRIPT and then (ii) plug the estimate μ^Wr,xsubscript^𝜇conditional𝑊𝑟𝑥\widehat{\mu}_{W\mid r,x}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_r , italic_x end_POSTSUBSCRIPT to estimate m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

At the r𝑟ritalic_r-th domain, we observe the samples: {(wr,i,xr,i,r)}i=1nrsuperscriptsubscriptsubscript𝑤𝑟𝑖subscript𝑥𝑟𝑖𝑟𝑖1subscript𝑛𝑟\{(w_{r,i},x_{r,i},r)\}_{i=1}^{n_{r}}{ ( italic_w start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT , italic_r ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . As with (5.3), we learn a conditional mean embedding μ^Wr,x=i=1nrdr,i(x)ϕ(wr,i)subscript^𝜇conditional𝑊𝑟𝑥superscriptsubscript𝑖1subscript𝑛𝑟subscript𝑑𝑟𝑖𝑥italic-ϕsubscript𝑤𝑟𝑖\widehat{\mu}_{W\mid r,x}=\sum_{i=1}^{n_{r}}d_{r,i}(x)\phi(w_{r,i})over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_r , italic_x end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ϕ ( italic_w start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT ), where dr(x)=(𝒦Xr+λ3I)1(ΦXr(x))nrsubscript𝑑𝑟𝑥superscriptsubscript𝒦subscript𝑋𝑟subscript𝜆3𝐼1subscriptΦsubscript𝑋𝑟𝑥superscriptsubscript𝑛𝑟d_{r}(x)=(\mathcal{K}_{X_{r}}+\lambda_{3}I)^{-1}\left(\Phi_{X_{r}}(x)\right)% \in\mathbb{R}^{n_{r}}italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x ) = ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and λ3>0subscript𝜆30\lambda_{3}>0italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0 for r=1,,kZ𝑟1subscript𝑘𝑍r=1,\ldots,k_{Z}italic_r = 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT. In the second stage, given another batch of independent samples: {(yr,i,xr,i,r)}i=1nrsuperscriptsubscriptsubscript𝑦𝑟𝑖subscript𝑥𝑟𝑖𝑟𝑖1subscript𝑛𝑟\{(y_{r,i},x_{r,i},r)\}_{i=1}^{n_{r}}{ ( italic_y start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT , italic_r ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for r=1,,kZ𝑟1subscript𝑘𝑍r=1,\ldots,k_{Z}italic_r = 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT, we minimize:

12r=1nrr=1kZi=1nr(yr,im0,ϕ(xr,i)μ^Wr,xr,i)2+λ4|m0|𝒲𝒳2.12subscript𝑟1subscript𝑛𝑟superscriptsubscript𝑟1subscript𝑘𝑍superscriptsubscript𝑖1subscript𝑛𝑟superscriptsubscript𝑦𝑟𝑖subscript𝑚0tensor-productitalic-ϕsubscript𝑥𝑟𝑖subscript^𝜇conditional𝑊𝑟subscript𝑥𝑟𝑖2subscript𝜆4superscriptsubscriptnormsubscript𝑚0subscript𝒲𝒳2\displaystyle\frac{1}{2\sum_{r=1}n_{r}}\sum_{r=1}^{k_{Z}}\sum_{i=1}^{n_{r}}% \left(y_{r,i}-\langle{m_{0}},{\phi(x_{r,i})\otimes\widehat{\mu}_{W\mid r,x_{r,% i}}}\rangle\right)^{2}+\lambda_{4}|\!|\!|m_{0}|\!|\!|_{{\mathcal{H}_{\mathcal{% W}\mathcal{X}}}}^{2}.divide start_ARG 1 end_ARG start_ARG 2 ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT - ⟨ italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT ) ⊗ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_r , italic_x start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT | | | italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | | start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT caligraphic_W caligraphic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5.6)

Then, m^0subscript^𝑚0\widehat{m}_{0}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT yields an analytical solution in similar form to h^0subscript^0\widehat{h}_{0}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT shown in Proposition 5.1 (see Appendix C.2 for details). Finally, with the estimated conditional mean embedding μ^Wxqsuperscriptsubscript^𝜇conditional𝑊𝑥𝑞\widehat{\mu}_{W\mid x}^{q}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT and a new sample xnewsubscript𝑥newx_{\text{new}}italic_x start_POSTSUBSCRIPT new end_POSTSUBSCRIPT from the target test set, we have

y^pred=m^0,μ^Wxnewqϕ(xnew).subscript^𝑦predsubscript^𝑚0tensor-productsuperscriptsubscript^𝜇conditional𝑊subscript𝑥new𝑞italic-ϕsubscript𝑥new\widehat{y}_{\text{pred}}=\langle{\widehat{m}_{0}},{\widehat{\mu}_{W\mid x_{% \text{new}}}^{q}\otimes\phi(x_{\text{new}})}\rangle.over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT = ⟨ over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_x start_POSTSUBSCRIPT new end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ⊗ italic_ϕ ( italic_x start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ) ⟩ .

We convert the regression task with m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to the classification task by learning kYsubscript𝑘𝑌k_{Y}italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT bridge functions, where each bridge function m0,subscript𝑚0m_{0,\ell}italic_m start_POSTSUBSCRIPT 0 , roman_ℓ end_POSTSUBSCRIPT corresponds to label \ellroman_ℓ.

6 Experiments

Refer to caption
\thesubsubfigure Classification task on simulated data.
Refer to caption
\thesubsubfigure Regression on the dSprites dataset.
Figure 2: Adaptation results with concept and proxy. Shown is the average evaluation metric on held-out target distribution samples across 10 independent replicates of the data. The proposed method is robust to the latent shift compared to the baselines in both cases. (a) We set P(U=1)=0.1𝑃𝑈10.1P(U=1)=0.1italic_P ( italic_U = 1 ) = 0.1. Both the AUROC and accuracy remains nearly constant in various degree of shifts, while the performance of other baselines drops as Q(U=1)𝑄𝑈1Q(U=1)italic_Q ( italic_U = 1 ) moves to 0.90.90.90.9. (b) The left figure denotes the density function of U𝑈Uitalic_U, the overlap** area of two distribution shrinks as a𝑎aitalic_a moves rightward. The result on the right shows that our method is robust even when the overlap** area between two distributions is small.
Table 1: Multi-domain adaptation result. The values are the average AUROC of 10101010 independent replicates of the data. Each task has three source domains with different Pr(U)subscript𝑃𝑟𝑈P_{r}(U)italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_U ) and one target domain. The proposed method has outperformed other baselines and is close to the Oracle in task 2.
Task ORACLE Cat-ERM Avg-ERM SA MK WCSC DANN MMD Proposed
Task 1 0.94250.94250.94250.9425 0.80300.80300.80300.8030 0.79160.79160.79160.7916 0.79180.79180.79180.7918 0.58480.58480.58480.5848 0.52210.52210.52210.5221 0.80390.80390.80390.8039 0.80550.80550.80550.8055 0.88480.8848\mathbf{0.8848}bold_0.8848
±0.0039plus-or-minus0.0039\pm 0.0039± 0.0039 ±0.0155plus-or-minus0.0155\pm 0.0155± 0.0155 ±0.0148plus-or-minus0.0148\pm 0.0148± 0.0148 ±0.0148plus-or-minus0.0148\pm 0.0148± 0.0148 ±0.0593plus-or-minus0.0593\pm 0.0593± 0.0593 ±0.0299plus-or-minus0.0299\pm 0.0299± 0.0299 ±0.0229plus-or-minus0.0229\pm 0.0229± 0.0229 ±0.0248plus-or-minus0.0248\pm 0.0248± 0.0248 ±0.0120plus-or-minus0.0120\pm 0.0120± 0.0120
Task 2 0.94310.94310.94310.9431 0.89420.89420.89420.8942 0.89530.89530.89530.8953 0.89530.89530.89530.8953 0.80540.80540.80540.8054 0.81440.81440.81440.8144 0.91580.91580.91580.9158 0.91490.91490.91490.9149 0.93180.9318\mathbf{0.9318}bold_0.9318
±0.0061plus-or-minus0.0061\pm 0.0061± 0.0061 ±0.0084plus-or-minus0.0084\pm 0.0084± 0.0084 ±0.0079plus-or-minus0.0079\pm 0.0079± 0.0079 ±0.0079plus-or-minus0.0079\pm 0.0079± 0.0079 ±0.0204plus-or-minus0.0204\pm 0.0204± 0.0204 ±0.0474plus-or-minus0.0474\pm 0.0474± 0.0474 ±0.0125plus-or-minus0.0125\pm 0.0125± 0.0125 ±0.0135plus-or-minus0.0135\pm 0.0135± 0.0135 ±0.0063plus-or-minus0.0063\pm 0.0063± 0.0063
Task 3 0.88760.88760.88760.8876 0.84830.84830.84830.8483 0.84270.84270.84270.8427 0.84080.84080.84080.8408 0.80020.80020.80020.8002 0.74280.74280.74280.7428 0.84800.84800.84800.8480 0.84700.84700.84700.8470 0.85690.8569\mathbf{0.8569}bold_0.8569
±0.0085plus-or-minus0.0085\pm 0.0085± 0.0085 ±0.0134plus-or-minus0.0134\pm 0.0134± 0.0134 ±0.0130plus-or-minus0.0130\pm 0.0130± 0.0130 ±0.0132plus-or-minus0.0132\pm 0.0132± 0.0132 ±0.0311plus-or-minus0.0311\pm 0.0311± 0.0311 ±0.0311plus-or-minus0.0311\pm 0.0311± 0.0311 ±0.0166plus-or-minus0.0166\pm 0.0166± 0.0166 ±0.0181plus-or-minus0.0181\pm 0.0181± 0.0181 ±0.0095plus-or-minus0.0095\pm 0.0095± 0.0095

We verify our theory with both simulated and real data, demonstrating robustness to latent shifts and transferablility of the bridge functions.

For the setting with concept variables present, we compare our method with baselines: Empricial Risk Minimization (ERM), Covariate shift weighting (COVAR) (shimodaira2000improving), Label shift weighting (LABEL) (buck1966comparison), and the spectral (LSA-S) and Wasserstein Autoencoder (LSA-WAE) latent shift adaptation approaches  (alabdulmohsin2023adapting). For the multi-domain setting, we compare our method with baselines: Simple Adaptation (SA) (mansour2008domain), Weighted Combination of Source Classifiers (WCSC) (zhang2015multi), and Marginal Kernel (MK) (blanchard2011generalizing). We also compare with multi-domain generalization baselines (muandet2013domain): Domain Adversarial Neural Networks (DANN) (ganin2016domain), Maximum Mean Discrepancy (MMD) (GreBorRasSchetal12). Additionally, we modify the ERM method to the multi-domain setting by concatenating the source samples to learn one ERM model (Cat-ERM) or taking the average result of each source domain ERM model (Avg-ERM). The ORACLE model is a model that is trained on target distribution samples. and evaluated on held-out target distribution samples. The tuning parameters for all models including the proposed model are selected using five-fold cross-validation. Details regarding the setups are in Appendix D.

Classification task. The task designed in alabdulmohsin2023adapting is a binary classification problem with Y{0,1}𝑌01Y\in\{0,1\}italic_Y ∈ { 0 , 1 } and the latent variable U{0,1}𝑈01U\in\{0,1\}italic_U ∈ { 0 , 1 } is a Bernoulli random variable. Additionally, X2,Wformulae-sequence𝑋superscript2𝑊X\in\mathbb{R}^{2},W\in\mathbb{R}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_W ∈ blackboard_R are continuous random variables and C3𝐶superscript3C\in\mathbb{R}^{3}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is a discrete variable. We have one source domain with P(U=1)=0.1𝑃𝑈10.1P(U=1)=0.1italic_P ( italic_U = 1 ) = 0.1. We evaluate the models on the target distribution with Q(U)𝑄𝑈Q(U)italic_Q ( italic_U ) shifting from Q(U=1){0.1,,0.9}𝑄𝑈10.10.9Q(U=1)\in\{0.1,\ldots,0.9\}italic_Q ( italic_U = 1 ) ∈ { 0.1 , … , 0.9 }. The goal of this task is to investigate whether the adaptation method is robust to any arbitrary shift of U𝑈Uitalic_U.

The ORACLE and ERM model are implemented as MultiLayer Perceptrons (MLP). The kernel function used in the proposed method is the Gaussian kernel.

We compare the proposed method with the LSA-S and Wasserstein Autoencoder adaptation LSA-WAE approaches developed in alabdulmohsin2023adapting. While all three methods are designed to adjust shift for the same graph in Figure 0(c), our method takes additional W,C,X𝑊𝐶𝑋W,C,Xitalic_W , italic_C , italic_X as training samples in the target domain while LSA-S and LSA-WAE only take X𝑋Xitalic_X. For all three methods, only X𝑋Xitalic_X is observed in the test data.

While the identification theory developed in (alabdulmohsin2023adapting) does not require W,C𝑊𝐶W,Citalic_W , italic_C in the target domain, we are aware that in practice, having more information in the target domain may improve estimation. To make the methods more directly comparable, we design an additional step to incorporate W𝑊Witalic_W from the target in the LSA-S algorithm. We describe this procedure in more detail in Appendix D.1.

Results are shown in Figure 2. The proposed method is more robust to the shift compared to baselines and is close to the ORACLE model. It is shown that with observed W𝑊Witalic_W in the target domain, LSA-S does not improve the performance compared to LSA-S without W𝑊Witalic_W. We also compare results under different noise levels and observe similar trends as discussed in Appendix D.

dSprites dataset regression task. We test the proposed procedure on the dSprites (dsprites17) dataset, an image dataset described by five latent parameters (shape, scale, rotation, posX, and posY). Motivated by  dsprites17’s experiments, we design a regression task where the dSprites images (64 ×\times× 64 = 4096-dimensional) are X64×64𝑋superscript6464X\in\mathbb{R}^{64\times 64}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT 64 × 64 end_POSTSUPERSCRIPT and subject to a nonlinear confounder U[0,2π]𝑈02𝜋U\in[0,2\pi]italic_U ∈ [ 0 , 2 italic_π ] which is a rotation of the image. W𝑊W\in\mathbb{R}italic_W ∈ blackboard_R and C𝐶C\in\mathbb{R}italic_C ∈ blackboard_R are continuous random variables. For this experiment, we have 7000700070007000 training samples and 3000300030003000 test samples. Further details about the procedure are in Appendix D.

In the results in Figure 2, we vary a𝑎aitalic_a, which controls which region of the source distribution that the target distribution concentrates. We design the experiment such that increasing a𝑎aitalic_a shifts the target distribution to increasingly low mass regions of the source distribution. We compute the mean squared error of each method on test examples from the target distribution.

We find that, while the baseline methods degrade as the target distributions shift increases, the proposed method adapts and maintains low error, nearly matching the error achieved by the oracle, which is trained on target distribution samples.

6.1 Multi-Domain Adaptation

Refer to caption
Figure 3: Concept and multi-domain adaptation with MIMIC-CXR. Shown are the mean ±plus-or-minus\pm± SD AUROC of concept (left) and multi-domain adaptation (right) for classification of “No finding” from embeddings of chest X-rays over five replicates of a sampling procedure that introduces a shift in the prevalence of “No finding” with patient sex subgroups, where radiology report embeddings serve as concept variables C𝐶Citalic_C and patient age serves as the proxy W𝑊Witalic_W. In the concept adaptation experiment, the source domain corresponds to P(U=1)=P(Y=1Sex=Female)=P(Y=0Sex=Male)=0.1𝑃𝑈1𝑃𝑌conditional1SexFemale𝑃𝑌conditional0SexMale0.1P(U=1)=P(Y=1\mid\textrm{Sex}=\textrm{Female})=P(Y=0\mid\textrm{Sex}=\textrm{% Male})=0.1italic_P ( italic_U = 1 ) = italic_P ( italic_Y = 1 ∣ Sex = Female ) = italic_P ( italic_Y = 0 ∣ Sex = Male ) = 0.1. In the multi-domain adaptation experiment, we consider two source domains P(U=1)={0.1,0.2}.𝑃𝑈10.10.2P(U=1)=\{0.1,0.2\}.italic_P ( italic_U = 1 ) = { 0.1 , 0.2 } .

In the multi-domain setting, we use the same classification dataset provided in alabdulmohsin2023adapting as Section D.6. We assume that C𝐶Citalic_C is not observed in any domain and generate multiple datasets drawn with different distributions on U𝑈Uitalic_U.

Classification task. We construct three different tasks with different settings of P(U)𝑃𝑈P(U)italic_P ( italic_U ) over the source and target domains. For each task, we construct three source domains and one target domain, drawing 3200320032003200 random training samples for the each source domain and 9600960096009600 random training samples for the target domain. The set of source domains of of Task 1–3 have different combinations of distribution on U𝑈Uitalic_U documented in Appendix D.3.

The backbone models for ORACLE, Cat-ERM, Avg-ERM, and SA (mansour2008domain) are simple MLPs; MK (blanchard2011generalizing) is a weighted kernel support vector machine; WCSC (zhang2015multi) is a re-weighted kernel density estimator. SA (mansour2008domain) assumes that Q(X)𝑄𝑋Q(X)italic_Q ( italic_X ) is the convex combinations of Pr(X)subscript𝑃𝑟𝑋P_{r}(X)italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_X ) for r=1,,kZ𝑟1subscript𝑘𝑍r=1,\ldots,k_{Z}italic_r = 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT; WCSC (zhang2015multi) assumes that Q(XY)𝑄conditional𝑋𝑌Q(X\mid Y)italic_Q ( italic_X ∣ italic_Y ) is a linear mixture of Pr(X|Y)subscript𝑃𝑟conditional𝑋𝑌P_{r}(X|Y)italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_X | italic_Y ) for r=1,,kZ𝑟1subscript𝑘𝑍r=1,\ldots,k_{Z}italic_r = 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT domain is an i.i.d. realization from the general distribution.

The results are shown in Table 1. Overall, we find our approach performs better than ERM and baseline multi-domain adaptation methods. All methods perform better in the setting of Task 2 than for Task 1, informally demonstrating the effect of the closeness of the source domains to the target domain. For Task 3, while our proposed approach performs best, ERM also performs well, and substantially better than the domain adaptation baselines.

Regression task. We consider two regression tasks, where U𝑈Uitalic_U is either a Bernoulli or a Beta random variable. We present the results in Appendix D.

6.2 Concept and multi-domain adaptation with MIMIC-CXR

We conduct a small-scale experiment using a sample of chest X-ray data extracted from the MIMIC-CXR dataset (johnson2019mimic). We briefly describe the experimental design and results here, and include a complete description in Appendix D.7. We consider classification of the absence of a radiological finding from low-dimensional embeddings of the X-rays (sellergren2022simplified), using the absence of a radiological finding in the radiology report as the target of prediction. This corresponds to the “No Finding” label defined by irvin2019chexpert.

We consider distribution shifts similar to settings in makar2022causally, where patient sex is considered as a possible “shortcut" in the classification of the absence of a radiological finding. We impose distribution shift through structured resampling of the data where P(U=1)=P(Y=1Sex=Female)=P(Y=0Sex=Male)𝑃𝑈1𝑃𝑌conditional1SexFemale𝑃𝑌conditional0SexMaleP(U=1)=P(Y=1\mid\textrm{Sex}=\textrm{Female})=P(Y=0\mid\textrm{Sex}=\textrm{% Male})italic_P ( italic_U = 1 ) = italic_P ( italic_Y = 1 ∣ Sex = Female ) = italic_P ( italic_Y = 0 ∣ Sex = Male ) and P(Sex=Female)=P(Sex=Male)=0.5𝑃SexFemale𝑃SexMale0.5P(\textrm{Sex}=\textrm{Female})=P(\textrm{Sex}=\textrm{Male})=0.5italic_P ( Sex = Female ) = italic_P ( Sex = Male ) = 0.5 is held constant. We perform both concept adaptation and multi-domain adaptation experiments with the MIMIC-CXR data. For the concept adaptation experiment, we consider the concept variable C𝐶Citalic_C to be the embedding of a radiology report associated with the chest X-ray. We experiment with the use of patient age as a potential proxy W𝑊Witalic_W for U𝑈Uitalic_U due to a hypothesized correlation between the presence of radiological findings and patient age.

The results are summarized in Figure 3. For both experiments, we find that the performance of baseline models fit using only information from the source domain(s) degrades under distribution shift. In the concept adaptation experiment, adaptation is relatively successful, as much of the performance of comparator models fit using target domain data is recovered by the adaptation procedure.

However, we find that the multi-domain adaptation procedure is not successful. In this case, we find that while the multi-domain adaptation procedure marginally outperforms a model fit using the concatenated source domain data under distribution shift, it recovers substantially less of the performance of the target domain model than the concept adaptation procedure does. Furthermore, the adapted model does not outperform the kernel estimators that only leverage information from the source domains. The lack of success in this setting could potentially be explained by insufficient number or diversity of domains relative to the level of noise induced by sampling variability and limited sample size.

7 Discussion

We propose a strategy for adaptation under distribution shift in a latent variable using a bridge function approach (miao2018identifying; tchetgen2020introduction). This approach allows for identification of the optimal predictor in the target domain without identifying the distribution of the latent variable and without distributional assumptions on the form of the latent. We require that proxies of the latent variable are present and that (i) mediating concepts are available or (ii) data from multiple source domains are present.

We argue our approach is useful for two reasons. First, the latent distribution in general is only identifiable under strict distributional assumptions (locatello2019challenging). Second, recovery of the latent variable may be challenging in practice even if it is identifiable  (rissanen2021critical). For example, because most latent variable estimation methods are designed to model the data generating process (kingma2013auto), one might allocate substantial modeling capacity to variability in the data and the latent variable that are irrelevant to modeling the shift in the conditional distribution of YXconditional𝑌𝑋Y\mid Xitalic_Y ∣ italic_X. By contrast, we model only the components of the observable variables relevant to the adaptation.

Acknowledgments: We thank Zhu Li and Dimitri Meunier for helpful discussions. AG was partly supported by the Gatsby Charitable Foundation. OS was partly supported by the UIUC Beckman Institute Graduate Research Fellowship, NSF-NRT 1735252. KT was partly supported by NSF Graduate Research Fellowship Program. SK was partly supported by the NSF III 2046795, IIS 1909577, CCF 1934986, NIH 1R01MH116226-01A, NIFA award 2020-67021-32799, the Alfred P. Sloan Foundation, and Google Inc. This study was funded by Google LLC and/or a subsidiary thereof (‘Google’).

References

Appendix A Identification of the Distribution

In this section, we demonstrate the existence of the bridge functions h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT under certain regularity conditions. We first discuss the discrete case and then generalize to the continuous case.

A.1 The Discrete Case of the Bridge Function h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

The idea of bridge function h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT may seem abstract in the continuous setting. When every variable is discrete, however, the construction of the bridge function is demonstrated by solving series of matrix problems. This idea originates from miao2018identifying and we apply the technique to show the construction of bridge function when every variable (W,U,C,X,Y)𝑊𝑈𝐶𝑋𝑌(W,U,C,X,Y)( italic_W , italic_U , italic_C , italic_X , italic_Y ) is discrete.

Let

𝐏(Wu)𝐏conditional𝑊𝑢\displaystyle\mathbf{P}(W\mid u)bold_P ( italic_W ∣ italic_u ) =[P(w1u)P(wkWu)]kW;absentsuperscriptmatrix𝑃conditionalsubscript𝑤1𝑢𝑃conditionalsubscript𝑤subscript𝑘𝑊𝑢topsuperscriptsubscript𝑘𝑊\displaystyle=\begin{bmatrix}P(w_{1}\mid u)&\ldots&P(w_{k_{W}}\mid u)\end{% bmatrix}^{\top}\in\mathbb{R}^{k_{W}};= [ start_ARG start_ROW start_CELL italic_P ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_u ) end_CELL start_CELL … end_CELL start_CELL italic_P ( italic_w start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_u ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ;
𝐏(WU)𝐏conditional𝑊𝑈\displaystyle\mathbf{P}(W\mid U)bold_P ( italic_W ∣ italic_U ) =[𝐏(Wu1)𝐏(WukU)]kW×kU,absentmatrix𝐏conditional𝑊subscript𝑢1𝐏conditional𝑊subscript𝑢subscript𝑘𝑈superscriptsubscript𝑘𝑊subscript𝑘𝑈\displaystyle=\begin{bmatrix}\mathbf{P}(W\mid u_{1})&\ldots&\mathbf{P}(W\mid u% _{k_{U}})\end{bmatrix}\in\mathbb{R}^{k_{W}\times k_{U}},= [ start_ARG start_ROW start_CELL bold_P ( italic_W ∣ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL … end_CELL start_CELL bold_P ( italic_W ∣ italic_u start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

be a column vector, and a matrix, respectively. We define similarly

𝐏(Ux,c)𝐏conditional𝑈𝑥𝑐\displaystyle\mathbf{P}(U\mid x,c)bold_P ( italic_U ∣ italic_x , italic_c ) =[P(u1c,x)P(ukUc,x)]kU;absentsuperscriptmatrix𝑃conditionalsubscript𝑢1𝑐𝑥𝑃conditionalsubscript𝑢subscript𝑘𝑈𝑐𝑥topsuperscriptsubscript𝑘𝑈\displaystyle=\begin{bmatrix}P(u_{1}\mid c,x)&\ldots&P(u_{k_{U}}\mid c,x)\end{% bmatrix}^{\top}\in\mathbb{R}^{k_{U}};= [ start_ARG start_ROW start_CELL italic_P ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ italic_c , italic_x ) end_CELL start_CELL … end_CELL start_CELL italic_P ( italic_u start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_c , italic_x ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ;
𝐏(UX,c)𝐏conditional𝑈𝑋𝑐\displaystyle\mathbf{P}(U\mid X,c)bold_P ( italic_U ∣ italic_X , italic_c ) =[𝐏(Ux1,c)𝐏(UxkX,c)]kU×kX,absentmatrix𝐏conditional𝑈subscript𝑥1𝑐𝐏conditional𝑈subscript𝑥subscript𝑘𝑋𝑐superscriptsubscript𝑘𝑈subscript𝑘𝑋\displaystyle=\begin{bmatrix}\mathbf{P}(U\mid x_{1},c)&\ldots&\mathbf{P}(U\mid x% _{k_{X}},c)\end{bmatrix}\in\mathbb{R}^{k_{U}\times k_{X}},= [ start_ARG start_ROW start_CELL bold_P ( italic_U ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c ) end_CELL start_CELL … end_CELL start_CELL bold_P ( italic_U ∣ italic_x start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

for c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C. We define

𝐏(YX,c)𝐏conditional𝑌𝑋𝑐\displaystyle\mathbf{P}(Y\mid X,c)bold_P ( italic_Y ∣ italic_X , italic_c ) =[𝐏(Yx1,c)𝐏(YxkX,c)]kY×kX;absentmatrix𝐏conditional𝑌subscript𝑥1𝑐𝐏conditional𝑌subscript𝑥subscript𝑘𝑋𝑐superscriptsubscript𝑘𝑌subscript𝑘𝑋\displaystyle=\begin{bmatrix}\mathbf{P}(Y\mid x_{1},c)&\ldots&\mathbf{P}(Y\mid x% _{k_{X}},c)\end{bmatrix}\in\mathbb{R}^{k_{Y}\times k_{X}};= [ start_ARG start_ROW start_CELL bold_P ( italic_Y ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c ) end_CELL start_CELL … end_CELL start_CELL bold_P ( italic_Y ∣ italic_x start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ;
𝐏(YU,c)𝐏conditional𝑌𝑈𝑐\displaystyle\mathbf{P}(Y\mid U,c)bold_P ( italic_Y ∣ italic_U , italic_c ) =[𝐏(Yu1,c)𝐏(YukX,c)]kY×kX;absentmatrix𝐏conditional𝑌subscript𝑢1𝑐𝐏conditional𝑌subscript𝑢subscript𝑘𝑋𝑐superscriptsubscript𝑘𝑌subscript𝑘𝑋\displaystyle=\begin{bmatrix}\mathbf{P}(Y\mid u_{1},c)&\ldots&\mathbf{P}(Y\mid u% _{k_{X}},c)\end{bmatrix}\in\mathbb{R}^{k_{Y}\times k_{X}};= [ start_ARG start_ROW start_CELL bold_P ( italic_Y ∣ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c ) end_CELL start_CELL … end_CELL start_CELL bold_P ( italic_Y ∣ italic_u start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ;
𝐏(WX,c)𝐏conditional𝑊𝑋𝑐\displaystyle\mathbf{P}(W\mid X,c)bold_P ( italic_W ∣ italic_X , italic_c ) =[𝐏(Wx1,c)𝐏(WxkX,c)]kW×kX,absentmatrix𝐏conditional𝑊subscript𝑥1𝑐𝐏conditional𝑊subscript𝑥subscript𝑘𝑋𝑐superscriptsubscript𝑘𝑊subscript𝑘𝑋\displaystyle=\begin{bmatrix}\mathbf{P}(W\mid x_{1},c)&\ldots&\mathbf{P}(W\mid x% _{k_{X}},c)\end{bmatrix}\in\mathbb{R}^{k_{W}\times k_{X}},= [ start_ARG start_ROW start_CELL bold_P ( italic_W ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c ) end_CELL start_CELL … end_CELL start_CELL bold_P ( italic_W ∣ italic_x start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

analogously. As an alternative to finding a h0(w,c)subscript0𝑤𝑐h_{0}(w,c)italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_c ) such that

𝔼[Yc,x]=i=1kWh0(wi,c)p(wic,x),𝔼delimited-[]conditional𝑌𝑐𝑥superscriptsubscript𝑖1subscript𝑘𝑊subscript0subscript𝑤𝑖𝑐𝑝conditionalsubscript𝑤𝑖𝑐𝑥\mathbb{E}[Y\mid c,x]=\sum_{i=1}^{k_{W}}h_{0}(w_{i},c)p(w_{i}\mid c,x),blackboard_E [ italic_Y ∣ italic_c , italic_x ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c ) italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_c , italic_x ) ,

the proxy problem is converted to finding a H~0(Y,W,c)subscript~𝐻0𝑌𝑊𝑐\widetilde{H}_{0}(Y,W,c)over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_Y , italic_W , italic_c ) such that

𝐏(YX,c)=H~0(Y,W,c)𝐏(WX,c),c𝒞.formulae-sequence𝐏conditional𝑌𝑋𝑐subscript~𝐻0𝑌𝑊𝑐𝐏conditional𝑊𝑋𝑐𝑐𝒞\mathbf{P}(Y\mid X,c)=\widetilde{H}_{0}(Y,W,c)\mathbf{P}(W\mid X,c),\quad c\in% \mathcal{C}.bold_P ( italic_Y ∣ italic_X , italic_c ) = over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_Y , italic_W , italic_c ) bold_P ( italic_W ∣ italic_X , italic_c ) , italic_c ∈ caligraphic_C .

First, under the condition that W{X,C}UW\perp\!\!\!\perp\{X,C\}\mid Uitalic_W ⟂ ⟂ { italic_X , italic_C } ∣ italic_U, we can write

𝐏(WX,c)=𝐏(WU)𝐏(UX,c).𝐏conditional𝑊𝑋𝑐𝐏conditional𝑊𝑈𝐏conditional𝑈𝑋𝑐\mathbf{P}(W\mid X,c)=\mathbf{P}(W\mid U)\mathbf{P}(U\mid X,c).bold_P ( italic_W ∣ italic_X , italic_c ) = bold_P ( italic_W ∣ italic_U ) bold_P ( italic_U ∣ italic_X , italic_c ) . (A.1)

Similarly, under the condition that YX{U,C}Y\perp\!\!\!\perp X\mid\{U,C\}italic_Y ⟂ ⟂ italic_X ∣ { italic_U , italic_C }, we have

𝐏(YX,c)=𝐏(YU,c)𝐏(UX,c)𝐏conditional𝑌𝑋𝑐𝐏conditional𝑌𝑈𝑐𝐏conditional𝑈𝑋𝑐\mathbf{P}(Y\mid X,c)=\mathbf{P}(Y\mid U,c)\mathbf{P}(U\mid X,c)bold_P ( italic_Y ∣ italic_X , italic_c ) = bold_P ( italic_Y ∣ italic_U , italic_c ) bold_P ( italic_U ∣ italic_X , italic_c ) (A.2)

We introduce the following assumption:

Assumption 7.

Columns of 𝐏(WU)𝐏conditional𝑊𝑈\mathbf{P}(W\mid U)bold_P ( italic_W ∣ italic_U ) are linearly independent. For every c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C, the columns of 𝐏(WX,c)𝐏conditional𝑊𝑋𝑐\mathbf{P}(W\mid X,c)bold_P ( italic_W ∣ italic_X , italic_c ) satisfy 𝐏(Wx,c)𝒩(𝐏(WU)*)𝐏conditional𝑊𝑥𝑐𝒩superscript𝐏superscriptconditional𝑊𝑈perpendicular-to\mathbf{P}(W\mid x,c)\in\mathcal{N}(\mathbf{P}(W\mid U)^{*})^{\perp}bold_P ( italic_W ∣ italic_x , italic_c ) ∈ caligraphic_N ( bold_P ( italic_W ∣ italic_U ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X.

Assumption 7 is the requirement for the least-squares problem to have an unique solution. Hence, by Assumption 7, we have

𝐏(UX,c)=𝐏(WU)𝐏(WX,c),𝐏conditional𝑈𝑋𝑐𝐏superscriptconditional𝑊𝑈𝐏conditional𝑊𝑋𝑐\mathbf{P}(U\mid X,c)=\mathbf{P}(W\mid U)^{\dagger}\mathbf{P}(W\mid X,c),bold_P ( italic_U ∣ italic_X , italic_c ) = bold_P ( italic_W ∣ italic_U ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_P ( italic_W ∣ italic_X , italic_c ) ,

where 𝐏(WU)𝐏superscriptconditional𝑊𝑈\mathbf{P}(W\mid U)^{\dagger}bold_P ( italic_W ∣ italic_U ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT is the generalized inverse of 𝐏(WU)𝐏conditional𝑊𝑈\mathbf{P}(W\mid U)bold_P ( italic_W ∣ italic_U ). Plug the above equation into (A.2), we see that

𝐏(YX,c)=𝐏(YU,c)𝐏(WU)H~(Y,W,c)𝐏(WX,c).𝐏conditional𝑌𝑋𝑐subscript𝐏conditional𝑌𝑈𝑐𝐏superscriptconditional𝑊𝑈~𝐻𝑌𝑊𝑐𝐏conditional𝑊𝑋𝑐\mathbf{P}(Y\mid X,c)=\underbrace{\mathbf{P}(Y\mid U,c)\mathbf{P}(W\mid U)^{% \dagger}}_{\widetilde{H}(Y,W,c)}\mathbf{P}(W\mid X,c).bold_P ( italic_Y ∣ italic_X , italic_c ) = under⏟ start_ARG bold_P ( italic_Y ∣ italic_U , italic_c ) bold_P ( italic_W ∣ italic_U ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG ( italic_Y , italic_W , italic_c ) end_POSTSUBSCRIPT bold_P ( italic_W ∣ italic_X , italic_c ) .

A.2 Existence of the Bridge Function h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

The sufficient conditions of existence of h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are originally discussed in miao2018identifying, we adapt them to our setting and provide a brief review in this section. We assume the following completeness assumption and regularity conditions. This assumption is equivalent to Condition (iii) in miao2018identifying.

Assumption 8.

For any mean squared integrable function g𝑔gitalic_g and for c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C, 𝔼[g(X)W,c]=0𝔼delimited-[]conditional𝑔𝑋𝑊𝑐0\mathbb{E}[g(X)\mid W,c]=0blackboard_E [ italic_g ( italic_X ) ∣ italic_W , italic_c ] = 0 almost surely if and only if g(X)=0𝑔𝑋0g(X)=0italic_g ( italic_X ) = 0 almost surely.

Let f𝑓fitalic_f be either the distribution from p𝑝pitalic_p or q𝑞qitalic_q, we consider Kc:L2(Wc)L2(Xc):subscript𝐾𝑐subscript𝐿2conditional𝑊𝑐subscript𝐿2conditional𝑋𝑐K_{c}:L_{2}(W\mid c)\rightarrow L_{2}(X\mid c)italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_W ∣ italic_c ) → italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ∣ italic_c ) as the conditional expectation operator associated with the kernel function

k(w,x,c)=f(w,xc)f(wc)f(xc).𝑘𝑤𝑥𝑐𝑓𝑤conditional𝑥𝑐𝑓conditional𝑤𝑐𝑓conditional𝑥𝑐k(w,x,c)=\frac{f(w,x\mid c)}{f(w\mid c)f(x\mid c)}.italic_k ( italic_w , italic_x , italic_c ) = divide start_ARG italic_f ( italic_w , italic_x ∣ italic_c ) end_ARG start_ARG italic_f ( italic_w ∣ italic_c ) italic_f ( italic_x ∣ italic_c ) end_ARG .

Then it follows that 𝔼[Yc,x]=Kch0𝔼delimited-[]conditional𝑌𝑐𝑥subscript𝐾𝑐subscript0\mathbb{E}[Y\mid c,x]=K_{c}h_{0}blackboard_E [ italic_Y ∣ italic_c , italic_x ] = italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

𝔼[Yc,x]𝔼delimited-[]conditional𝑌𝑐𝑥\displaystyle\mathbb{E}[Y\mid c,x]blackboard_E [ italic_Y ∣ italic_c , italic_x ] =𝒲h0(w,c)f(wx,c)dwabsentsubscript𝒲subscript0𝑤𝑐𝑓conditional𝑤𝑥𝑐differential-d𝑤\displaystyle=\int_{\mathcal{W}}h_{0}(w,c)f(w\mid x,c)\mathrm{d}w= ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_c ) italic_f ( italic_w ∣ italic_x , italic_c ) roman_d italic_w
=k(w,x,c)h0(w,c)f(wc)dw=Kch0.absent𝑘𝑤𝑥𝑐subscript0𝑤𝑐𝑓conditional𝑤𝑐differential-d𝑤subscript𝐾𝑐subscript0\displaystyle=\int k(w,x,c)h_{0}(w,c)f(w\mid c)\mathrm{d}w=K_{c}h_{0}.= ∫ italic_k ( italic_w , italic_x , italic_c ) italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_c ) italic_f ( italic_w ∣ italic_c ) roman_d italic_w = italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

To find the solution h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we assume the followings.

Assumption 9.

For any c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C, 𝒲𝒳f(wc,x)f(xc,w)dwdx<subscript𝒲subscript𝒳𝑓conditional𝑤𝑐𝑥𝑓conditional𝑥𝑐𝑤differential-d𝑤differential-d𝑥\int_{\mathcal{W}}\int_{\mathcal{X}}f(w\mid c,x)f(x\mid c,w)\mathrm{d}w\mathrm% {d}x<\infty∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_w ∣ italic_c , italic_x ) italic_f ( italic_x ∣ italic_c , italic_w ) roman_d italic_w roman_d italic_x < ∞.

This is a sufficient condition to ensure that Kcsubscript𝐾𝑐K_{c}italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is a compact operator (carrasco2007linear, Example 2.3). Hence, by the definition of a compact operator, there exists a singular system {λc,i,ϕc,i,ψc,i}isubscriptsubscript𝜆𝑐𝑖subscriptitalic-ϕ𝑐𝑖subscript𝜓𝑐𝑖𝑖\{\lambda_{c,i},\phi_{c,i},\psi_{c,i}\}_{i\in\mathbb{N}}{ italic_λ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT of Kcsubscript𝐾𝑐K_{c}italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT for every c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C.

Assumption 10.

For fixed c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C:

  1. 1.

    𝔼[YX,c]L2(Xc);𝔼delimited-[]conditional𝑌𝑋𝑐subscript𝐿2conditional𝑋𝑐\mathbb{E}[Y\mid X,c]\in L_{2}(X\mid c);blackboard_E [ italic_Y ∣ italic_X , italic_c ] ∈ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ∣ italic_c ) ;

  2. 2.

    iλc,i2|𝔼[YX,c],ψc,i|2<\sum_{i\in\mathbb{N}}\lambda_{c,i}^{-2}\left|\langle{\mathbb{E}[Y\mid X,c]},{% \psi_{c,i}}\rangle\right|^{2}<\infty∑ start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT | ⟨ blackboard_E [ italic_Y ∣ italic_X , italic_c ] , italic_ψ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∞.

The above two assumptions are restatements of Conditions (v)–(vii) in miao2018identifying. We adapt the results from Proposition 1 in miao2018identifying to the graph in Figure 0(c) which replaces the node X𝑋Xitalic_X by C𝐶Citalic_C and node Z𝑍Zitalic_Z by X𝑋Xitalic_X.

Proposition A.1 (Existence of h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, adapted from Proposition 1 in miao2018identifying).

Under Assumption 2, 810, the solution to (4.1) exists.

Proof.

The proof follows directly from the result of Picard’s theorem. Assumption 9 implies that Kcsubscript𝐾𝑐K_{c}italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is a compact operator. Assumption 8 implies that 𝒩(Kc*)=L2(Xc)𝒩superscriptsuperscriptsubscript𝐾𝑐perpendicular-tosubscript𝐿2conditional𝑋𝑐\mathcal{N}(K_{c}^{*})^{\perp}=L_{2}(X\mid c)caligraphic_N ( italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ∣ italic_c ). Therefore, under the first statement in Assumption 10, we have 𝔼[YX,c]𝒩(Kc*)𝔼delimited-[]conditional𝑌𝑋𝑐𝒩superscriptsuperscriptsubscript𝐾𝑐perpendicular-to\mathbb{E}[Y\mid X,c]\in\mathcal{N}(K_{c}^{*})^{\perp}blackboard_E [ italic_Y ∣ italic_X , italic_c ] ∈ caligraphic_N ( italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. Along with the second statement in Assumption 10, we can apply Lemma A.3. ∎

A.3 Existence of Bridge Function m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

The proof of the existence of m0psuperscriptsubscript𝑚0𝑝m_{0}^{p}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is similar to the analysis of h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Let Kx:L2(Wx)L2(Zx):subscript𝐾𝑥subscript𝐿2conditional𝑊𝑥subscript𝐿2conditional𝑍𝑥K_{x}:L_{2}(W\mid x)\rightarrow L_{2}(Z\mid x)italic_K start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT : italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_W ∣ italic_x ) → italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_Z ∣ italic_x ) be the integral operator associated with the kernel function k(w,x,z)=p(w,zx)/(p(wx)p(zx))𝑘𝑤𝑥𝑧𝑝𝑤conditional𝑧𝑥𝑝conditional𝑤𝑥𝑝conditional𝑧𝑥k(w,x,z)=p(w,z\mid x)/(p(w\mid x)p(z\mid x))italic_k ( italic_w , italic_x , italic_z ) = italic_p ( italic_w , italic_z ∣ italic_x ) / ( italic_p ( italic_w ∣ italic_x ) italic_p ( italic_z ∣ italic_x ) ). Then, we can write

𝔼p[Yx,z]subscript𝔼𝑝delimited-[]conditional𝑌𝑥𝑧\displaystyle\mathbb{E}_{p}[Y\mid x,z]blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_Y ∣ italic_x , italic_z ] =k(w,x,z)p(wx)m0(w,x)dw=Kxm0.absent𝑘𝑤𝑥𝑧𝑝conditional𝑤𝑥subscript𝑚0𝑤𝑥differential-d𝑤subscript𝐾𝑥subscript𝑚0\displaystyle=\int k(w,x,z)p(w\mid x)m_{0}(w,x)\mathrm{d}w=K_{x}m_{0}.= ∫ italic_k ( italic_w , italic_x , italic_z ) italic_p ( italic_w ∣ italic_x ) italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_x ) roman_d italic_w = italic_K start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .
Proposition A.2 (Existence of m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Proposition 1 in miao2018identifying).

Assume that

  1. 1.

    for any mean squared integrable function g𝑔gitalic_g and for x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, 𝔼[g(Z)W,x]=0𝔼delimited-[]conditional𝑔𝑍𝑊𝑥0\mathbb{E}[g(Z)\mid W,x]=0blackboard_E [ italic_g ( italic_Z ) ∣ italic_W , italic_x ] = 0 almost surely if and only if g(Z)=0𝑔𝑍0g(Z)=0italic_g ( italic_Z ) = 0 almost surely;

  2. 2.

    For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, 𝒲𝒵f(wx,z)f(zx,w)dwdz<subscript𝒲subscript𝒵𝑓conditional𝑤𝑥𝑧𝑓conditional𝑧𝑥𝑤differential-d𝑤differential-d𝑧\int_{\mathcal{W}}\int_{\mathcal{Z}}f(w\mid x,z)f(z\mid x,w)\mathrm{d}w\mathrm% {d}z<\infty∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_Z end_POSTSUBSCRIPT italic_f ( italic_w ∣ italic_x , italic_z ) italic_f ( italic_z ∣ italic_x , italic_w ) roman_d italic_w roman_d italic_z < ∞;

  3. 3.

    For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, 𝔼[YZ,x]L2(Zx)𝔼delimited-[]conditional𝑌𝑍𝑥subscript𝐿2conditional𝑍𝑥\mathbb{E}[Y\mid Z,x]\in L_{2}(Z\mid x)blackboard_E [ italic_Y ∣ italic_Z , italic_x ] ∈ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_Z ∣ italic_x );

  4. 4.

    For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, iλx,i2|𝔼[YZ,x],ψx,i|2<\sum_{i\in\mathbb{N}}\lambda_{x,i}^{-2}\left|\langle{\mathbb{E}[Y\mid Z,x]},{% \psi_{x,i}}\rangle\right|^{2}<\infty∑ start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_x , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT | ⟨ blackboard_E [ italic_Y ∣ italic_Z , italic_x ] , italic_ψ start_POSTSUBSCRIPT italic_x , italic_i end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∞, where (λx,i,ϕx,i,ψx,i)subscript𝜆𝑥𝑖subscriptitalic-ϕ𝑥𝑖subscript𝜓𝑥𝑖(\lambda_{x,i},\phi_{x,i},\psi_{x,i})( italic_λ start_POSTSUBSCRIPT italic_x , italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_x , italic_i end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT italic_x , italic_i end_POSTSUBSCRIPT ) is the singular system of Kxsubscript𝐾𝑥K_{x}italic_K start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

Then the solution of m0psuperscriptsubscript𝑚0𝑝m_{0}^{p}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT exists.

The proof of Proposition A.2 is similar to the proof of Proposition 1 in (miao2018identifying), where we replace P(y|z,x)𝑃conditional𝑦𝑧𝑥P(y|z,x)italic_P ( italic_y | italic_z , italic_x ) in Proposition 1 of miao2018identifying with 𝔼[YZ,x]𝔼delimited-[]conditional𝑌𝑍𝑥\mathbb{E}[Y\mid Z,x]blackboard_E [ italic_Y ∣ italic_Z , italic_x ]. The proof for existence of m0qsuperscriptsubscript𝑚0𝑞m_{0}^{q}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT also follows similarly as Proposition A.2.

A.4 Auxiliary Lemma

We introduce the Picard’s theorem as follows.

Lemma A.3 (Picard’s Theorem).

Let K:12normal-:𝐾normal-→subscript1subscript2K:\mathcal{H}_{1}\rightarrow\mathcal{H}_{2}italic_K : caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be a compact operator with singular system {λj,φj,ψj}j=1superscriptsubscriptsubscript𝜆𝑗subscript𝜑𝑗subscript𝜓𝑗𝑗1\{\lambda_{j},\varphi_{j},\psi_{j}\}_{j=1}^{\infty}{ italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT and ϕitalic-ϕ\phiitalic_ϕ be a given function in 2subscript2\mathcal{H}_{2}caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then the equation of first kind Kh=ϕ𝐾italic-ϕKh=\phiitalic_K italic_h = italic_ϕ have solutions if and only if

  1. 1.

    ϕ𝒩(K*)italic-ϕ𝒩superscriptsuperscript𝐾perpendicular-to\phi\in\mathcal{N}(K^{*})^{\perp}italic_ϕ ∈ caligraphic_N ( italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, where 𝒩(K*)={h:K*h=0}𝒩superscript𝐾conditional-setsuperscript𝐾0\mathcal{N}(K^{*})=\{h:K^{*}h=0\}caligraphic_N ( italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = { italic_h : italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_h = 0 } is the null space of the adjoint operator K*superscript𝐾K^{*}italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.

  2. 2.

    j=1+λj2|ϕ,ψj|2<superscriptsubscript𝑗1superscriptsubscript𝜆𝑗2superscriptitalic-ϕsubscript𝜓𝑗2\sum_{j=1}^{+\infty}\lambda_{j}^{-2}\left|\langle{\phi},{\psi_{j}}\rangle% \right|^{2}<\infty∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT | ⟨ italic_ϕ , italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∞.

Appendix B Transferring Bridge Functions

In this section, we discuss the identifiability results.

B.1 Proof of Theorem 4.1

For f{p,q}𝑓𝑝𝑞f\in\{p,q\}italic_f ∈ { italic_p , italic_q }, recall that

𝔼f[Yc,x]subscript𝔼𝑓delimited-[]conditional𝑌𝑐𝑥\displaystyle\mathbb{E}_{f}[Y\mid c,x]blackboard_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_x ] =𝒲h0f(w,c)f(wc,x)dwabsentsubscript𝒲superscriptsubscript0𝑓𝑤𝑐𝑓conditional𝑤𝑐𝑥differential-d𝑤\displaystyle=\int_{\mathcal{W}}h_{0}^{f}(w,c)f(w\mid c,x)\mathrm{d}w= ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_f ( italic_w ∣ italic_c , italic_x ) roman_d italic_w
=𝒲𝒰h0f(w,c)f(wc,u)f(uc,x)dudwabsentsubscript𝒲subscript𝒰superscriptsubscript0𝑓𝑤𝑐𝑓conditional𝑤𝑐𝑢𝑓conditional𝑢𝑐𝑥differential-d𝑢differential-d𝑤\displaystyle=\int_{\mathcal{W}}\int_{\mathcal{U}}h_{0}^{f}(w,c)f(w\mid c,u)f(% u\mid c,x)\mathrm{d}u\mathrm{d}w= ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_f ( italic_w ∣ italic_c , italic_u ) italic_f ( italic_u ∣ italic_c , italic_x ) roman_d italic_u roman_d italic_w
=𝒲𝒰h0f(w,c)f(wu)f(uc,x)dudwabsentsubscript𝒲subscript𝒰superscriptsubscript0𝑓𝑤𝑐𝑓conditional𝑤𝑢𝑓conditional𝑢𝑐𝑥differential-d𝑢differential-d𝑤\displaystyle=\int_{\mathcal{W}}\int_{\mathcal{U}}h_{0}^{f}(w,c)f(w\mid u)f(u% \mid c,x)\mathrm{d}u\mathrm{d}w= ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_f ( italic_w ∣ italic_u ) italic_f ( italic_u ∣ italic_c , italic_x ) roman_d italic_u roman_d italic_w (WCU).\displaystyle(W\perp\!\!\!\perp C\mid U).( italic_W ⟂ ⟂ italic_C ∣ italic_U ) .

Similarly, we can write

𝔼f[Yc,x]subscript𝔼𝑓delimited-[]conditional𝑌𝑐𝑥\displaystyle\mathbb{E}_{f}[Y\mid c,x]blackboard_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_x ] =𝒰𝔼f[Yc,u]f(uc,x)duabsentsubscript𝒰subscript𝔼𝑓delimited-[]conditional𝑌𝑐𝑢𝑓conditional𝑢𝑐𝑥differential-d𝑢\displaystyle=\int_{\mathcal{U}}\mathbb{E}_{f}[Y\mid c,u]f(u\mid c,x)\mathrm{d}u= ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_u ] italic_f ( italic_u ∣ italic_c , italic_x ) roman_d italic_u (YX{U,C}).\displaystyle(Y\perp\!\!\!\perp X\mid\{U,C\}).( italic_Y ⟂ ⟂ italic_X ∣ { italic_U , italic_C } ) .

Under Assumption 4, we have

𝔼f[Yc,U]=𝒲h0f(w,c)f(wU)dwsubscript𝔼𝑓delimited-[]conditional𝑌𝑐𝑈subscript𝒲superscriptsubscript0𝑓𝑤𝑐𝑓conditional𝑤𝑈differential-d𝑤\mathbb{E}_{f}[Y\mid c,U]=\int_{\mathcal{W}}h_{0}^{f}(w,c)f(w\mid U)\mathrm{d}w\quadblackboard_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_U ] = ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_f ( italic_w ∣ italic_U ) roman_d italic_w (B.1)

almost surely with respect to F(U)𝐹𝑈F(U)italic_F ( italic_U ), F{P,Q}𝐹𝑃𝑄F\in\{P,Q\}italic_F ∈ { italic_P , italic_Q }.

Suppose that u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U such that Q(u)>0𝑄𝑢0Q(u)>0italic_Q ( italic_u ) > 0. Then, by Assumption 5 , we must have P(u)>0𝑃𝑢0P(u)>0italic_P ( italic_u ) > 0. Hence, conditioned on the selected u𝑢uitalic_u and c𝑐citalic_c and under Assumption 1, we have

𝔼p[Yc,u]subscript𝔼𝑝delimited-[]conditional𝑌𝑐𝑢\displaystyle\mathbb{E}_{p}[Y\mid c,u]blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_u ] =𝒲h0p(w,c)p(wu)dw;absentsubscript𝒲superscriptsubscript0𝑝𝑤𝑐𝑝conditional𝑤𝑢differential-d𝑤\displaystyle=\int_{\mathcal{W}}h_{0}^{p}(w,c)p(w\mid u)\mathrm{d}w;= ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_u ) roman_d italic_w ;
𝔼q[Yc,u]subscript𝔼𝑞delimited-[]conditional𝑌𝑐𝑢\displaystyle\mathbb{E}_{q}[Y\mid c,u]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_u ] =𝒲h0q(w,c)p(wu)dwabsentsubscript𝒲superscriptsubscript0𝑞𝑤𝑐𝑝conditional𝑤𝑢differential-d𝑤\displaystyle=\int_{\mathcal{W}}h_{0}^{q}(w,c)p(w\mid u)\mathrm{d}w= ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_u ) roman_d italic_w (p(wu)=q(wu),c𝒞,w𝒲,u𝒰).formulae-sequence𝑝conditional𝑤𝑢𝑞conditional𝑤𝑢formulae-sequencefor-all𝑐𝒞formulae-sequence𝑤𝒲𝑢𝒰\displaystyle(p(w\mid u)=q(w\mid u),\;\forall c\in\mathcal{C},w\in\mathcal{W},% u\in\mathcal{U}).( italic_p ( italic_w ∣ italic_u ) = italic_q ( italic_w ∣ italic_u ) , ∀ italic_c ∈ caligraphic_C , italic_w ∈ caligraphic_W , italic_u ∈ caligraphic_U ) .

We then can write

𝔼p[Yc,u]𝔼q[Yc,u]=𝒲h0p(w,c)p(wu)dw𝒲h0q(w,c)q(wu)dw.subscript𝔼𝑝delimited-[]conditional𝑌𝑐𝑢subscript𝔼𝑞delimited-[]conditional𝑌𝑐𝑢subscript𝒲superscriptsubscript0𝑝𝑤𝑐𝑝conditional𝑤𝑢differential-d𝑤subscript𝒲superscriptsubscript0𝑞𝑤𝑐𝑞conditional𝑤𝑢differential-d𝑤\mathbb{E}_{p}[Y\mid c,u]-\mathbb{E}_{q}[Y\mid c,u]=\int_{\mathcal{W}}h_{0}^{p% }(w,c)p(w\mid u)\mathrm{d}w-\int_{\mathcal{W}}h_{0}^{q}(w,c)q(w\mid u)\mathrm{% d}w.blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_u ] - blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_u ] = ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_u ) roman_d italic_w - ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_q ( italic_w ∣ italic_u ) roman_d italic_w .

Note that, by Assumption 1, we have 𝔼p[Yc,u]=𝔼q[Yc,u]subscript𝔼𝑝delimited-[]conditional𝑌𝑐𝑢subscript𝔼𝑞delimited-[]conditional𝑌𝑐𝑢\mathbb{E}_{p}[Y\mid c,u]=\mathbb{E}_{q}[Y\mid c,u]blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_u ] = blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_c , italic_u ] and hence the left hand side of the above equation is 00 and we can conclude that:

𝒲h0p(w,c)p(wU)dw=𝒲h0q(w,c)q(wU)dwsubscript𝒲superscriptsubscript0𝑝𝑤𝑐𝑝conditional𝑤𝑈differential-d𝑤subscript𝒲superscriptsubscript0𝑞𝑤𝑐𝑞conditional𝑤𝑈differential-d𝑤\int_{\mathcal{W}}h_{0}^{p}(w,c)p(w\mid U)\mathrm{d}w=\int_{\mathcal{W}}h_{0}^% {q}(w,c)q(w\mid U)\mathrm{d}w∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_U ) roman_d italic_w = ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_q ( italic_w ∣ italic_U ) roman_d italic_w

Q(U)𝑄𝑈Q(U)italic_Q ( italic_U ) almost surely. We complete the first part of proof.

To show the second part of the theorem, note that we can write

𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\displaystyle\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] =𝔼q[𝔼q[YC,x]x]absentsubscript𝔼𝑞delimited-[]conditionalsubscript𝔼𝑞delimited-[]conditional𝑌𝐶𝑥𝑥\displaystyle=\mathbb{E}_{q}[\mathbb{E}_{q}[Y\mid C,x]\mid x]= blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_C , italic_x ] ∣ italic_x ]
=𝔼q[𝔼q[h0q(W,c)C,x]x].absentsubscript𝔼𝑞delimited-[]conditionalsubscript𝔼𝑞delimited-[]conditionalsuperscriptsubscript0𝑞𝑊𝑐𝐶𝑥𝑥\displaystyle=\mathbb{E}_{q}[\mathbb{E}_{q}[h_{0}^{q}(W,c)\mid C,x]\mid x].= blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_W , italic_c ) ∣ italic_C , italic_x ] ∣ italic_x ] .
Since p(wu)=q(wu)𝑝conditional𝑤𝑢𝑞conditional𝑤𝑢p(w\mid u)=q(w\mid u)italic_p ( italic_w ∣ italic_u ) = italic_q ( italic_w ∣ italic_u ) by Assumption 1, we can factorize the above equation as
𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\displaystyle\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] =𝒞[𝒰{𝒲h0q(w,c)p(wu)dw}q(uc,x)du]q(cx)dc.absentsubscript𝒞delimited-[]subscript𝒰subscript𝒲superscriptsubscript0𝑞𝑤𝑐𝑝conditional𝑤𝑢differential-d𝑤𝑞conditional𝑢𝑐𝑥differential-d𝑢𝑞conditional𝑐𝑥differential-d𝑐\displaystyle=\int_{\mathcal{C}}\left[\int_{\mathcal{U}}\left\{\int_{\mathcal{% W}}h_{0}^{q}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}u\right]q(% c\mid x)\mathrm{d}c.= ∫ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT { ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_u ) roman_d italic_w } italic_q ( italic_u ∣ italic_c , italic_x ) roman_d italic_u ] italic_q ( italic_c ∣ italic_x ) roman_d italic_c .
Let the support of U𝑈Uitalic_U conditioned on c,x𝑐𝑥c,xitalic_c , italic_x be 𝒰c,x1={u:Q(uc,x)>0}superscriptsubscript𝒰𝑐𝑥1conditional-set𝑢𝑄conditional𝑢𝑐𝑥0\mathcal{U}_{c,x}^{1}=\{u:Q(u\mid c,x)>0\}caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = { italic_u : italic_Q ( italic_u ∣ italic_c , italic_x ) > 0 } and 𝒰c,x0={u:Q(uc,x)=0}superscriptsubscript𝒰𝑐𝑥0conditional-set𝑢𝑄conditional𝑢𝑐𝑥0\mathcal{U}_{c,x}^{0}=\{u:Q(u\mid c,x)=0\}caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { italic_u : italic_Q ( italic_u ∣ italic_c , italic_x ) = 0 }. Hence, we have 𝒰=𝒰c,x0𝒰c,x1𝒰superscriptsubscript𝒰𝑐𝑥0superscriptsubscript𝒰𝑐𝑥1\mathcal{U}=\mathcal{U}_{c,x}^{0}\cup\mathcal{U}_{c,x}^{1}caligraphic_U = caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∪ caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, and 𝒰c,x0𝒰c,x1=superscriptsubscript𝒰𝑐𝑥0superscriptsubscript𝒰𝑐𝑥1\mathcal{U}_{c,x}^{0}\cap\mathcal{U}_{c,x}^{1}=\emptysetcaligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∩ caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = ∅ such that 𝒰c,x0q(uc,x)du=0subscriptsuperscriptsubscript𝒰𝑐𝑥0𝑞conditional𝑢𝑐𝑥differential-d𝑢0\int_{\mathcal{U}_{c,x}^{0}}q(u\mid c,x)\mathrm{d}u=0∫ start_POSTSUBSCRIPT caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_q ( italic_u ∣ italic_c , italic_x ) roman_d italic_u = 0 and 𝒰c,x1q(uc,x)du=1subscriptsuperscriptsubscript𝒰𝑐𝑥1𝑞conditional𝑢𝑐𝑥differential-d𝑢1\int_{\mathcal{U}_{c,x}^{1}}q(u\mid c,x)\mathrm{d}u=1∫ start_POSTSUBSCRIPT caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_q ( italic_u ∣ italic_c , italic_x ) roman_d italic_u = 1. Then, we can further decompose the above as
𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\displaystyle\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] =𝒞[𝒰c,x0{𝒲h0q(w,c)p(wu)dw}q(uc,x)du]q(cx)dcabsentsubscript𝒞delimited-[]subscriptsuperscriptsubscript𝒰𝑐𝑥0subscript𝒲superscriptsubscript0𝑞𝑤𝑐𝑝conditional𝑤𝑢differential-d𝑤𝑞conditional𝑢𝑐𝑥differential-d𝑢𝑞conditional𝑐𝑥differential-d𝑐\displaystyle=\int_{\mathcal{C}}\left[\int_{\mathcal{U}_{c,x}^{0}}\left\{\int_% {\mathcal{W}}h_{0}^{q}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}% u\right]q(c\mid x)\mathrm{d}c= ∫ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_u ) roman_d italic_w } italic_q ( italic_u ∣ italic_c , italic_x ) roman_d italic_u ] italic_q ( italic_c ∣ italic_x ) roman_d italic_c
+𝒞[𝒰c,x1{𝒲h0q(w,c)p(wu)dw}q(uc,x)du]q(cx)dcsubscript𝒞delimited-[]subscriptsuperscriptsubscript𝒰𝑐𝑥1subscript𝒲superscriptsubscript0𝑞𝑤𝑐𝑝conditional𝑤𝑢differential-d𝑤𝑞conditional𝑢𝑐𝑥differential-d𝑢𝑞conditional𝑐𝑥differential-d𝑐\displaystyle\quad+\int_{\mathcal{C}}\left[\int_{\mathcal{U}_{c,x}^{1}}\left\{% \int_{\mathcal{W}}h_{0}^{q}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)% \mathrm{d}u\right]q(c\mid x)\mathrm{d}c+ ∫ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_u ) roman_d italic_w } italic_q ( italic_u ∣ italic_c , italic_x ) roman_d italic_u ] italic_q ( italic_c ∣ italic_x ) roman_d italic_c
=𝒞[𝒰c,x1{𝒲h0q(w,c)p(wu)dw}q(uc,x)du]q(cx)dc.absentsubscript𝒞delimited-[]subscriptsuperscriptsubscript𝒰𝑐𝑥1subscript𝒲superscriptsubscript0𝑞𝑤𝑐𝑝conditional𝑤𝑢differential-d𝑤𝑞conditional𝑢𝑐𝑥differential-d𝑢𝑞conditional𝑐𝑥differential-d𝑐\displaystyle=\int_{\mathcal{C}}\left[\int_{\mathcal{U}_{c,x}^{1}}\left\{\int_% {\mathcal{W}}h_{0}^{q}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}% u\right]q(c\mid x)\mathrm{d}c.= ∫ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_u ) roman_d italic_w } italic_q ( italic_u ∣ italic_c , italic_x ) roman_d italic_u ] italic_q ( italic_c ∣ italic_x ) roman_d italic_c .
Given c,x𝑐𝑥c,xitalic_c , italic_x, since the support of Q(Uc,x)𝑄conditional𝑈𝑐𝑥Q(U\mid c,x)italic_Q ( italic_U ∣ italic_c , italic_x ) is included in the support of Q(U)𝑄𝑈Q(U)italic_Q ( italic_U ), so if u𝒰c,x1𝑢superscriptsubscript𝒰𝑐𝑥1u\in\mathcal{U}_{c,x}^{1}italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, we must have Q(u)>0𝑄𝑢0Q(u)>0italic_Q ( italic_u ) > 0 and hence P(u)>0𝑃𝑢0P(u)>0italic_P ( italic_u ) > 0 by Assumption 5, and we can swap h0qsuperscriptsubscript0𝑞h_{0}^{q}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT with h0psuperscriptsubscript0𝑝h_{0}^{p}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT.
=𝒞[𝒰c,x1{𝒲h0p(w,c)p(wu)dw}q(uc,x)du]q(cx)dc.absentsubscript𝒞delimited-[]subscriptsuperscriptsubscript𝒰𝑐𝑥1subscript𝒲superscriptsubscript0𝑝𝑤𝑐𝑝conditional𝑤𝑢differential-d𝑤𝑞conditional𝑢𝑐𝑥differential-d𝑢𝑞conditional𝑐𝑥differential-d𝑐\displaystyle=\int_{\mathcal{C}}\left[\int_{\mathcal{U}_{c,x}^{1}}\left\{\int_% {\mathcal{W}}h_{0}^{p}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}% u\right]q(c\mid x)\mathrm{d}c.= ∫ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_u ) roman_d italic_w } italic_q ( italic_u ∣ italic_c , italic_x ) roman_d italic_u ] italic_q ( italic_c ∣ italic_x ) roman_d italic_c .
Since 𝒰c,x0{𝒲h0p(w,c)p(wu)dw}q(uc,x)du=0subscriptsuperscriptsubscript𝒰𝑐𝑥0subscript𝒲superscriptsubscript0𝑝𝑤𝑐𝑝conditional𝑤𝑢differential-d𝑤𝑞conditional𝑢𝑐𝑥differential-d𝑢0\int_{\mathcal{U}_{c,x}^{0}}\left\{\int_{\mathcal{W}}h_{0}^{p}(w,c)p(w\mid u)% \mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}u=0∫ start_POSTSUBSCRIPT caligraphic_U start_POSTSUBSCRIPT italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_u ) roman_d italic_w } italic_q ( italic_u ∣ italic_c , italic_x ) roman_d italic_u = 0, we can add it to the above term and arrive at
=𝒞[𝒰{𝒲h0p(w,c)p(wu)dw}q(uc,x)du]q(cx)dcabsentsubscript𝒞delimited-[]subscript𝒰subscript𝒲superscriptsubscript0𝑝𝑤𝑐𝑝conditional𝑤𝑢differential-d𝑤𝑞conditional𝑢𝑐𝑥differential-d𝑢𝑞conditional𝑐𝑥differential-d𝑐\displaystyle=\int_{\mathcal{C}}\left[\int_{\mathcal{U}}\left\{\int_{\mathcal{% W}}h_{0}^{p}(w,c)p(w\mid u)\mathrm{d}w\right\}q(u\mid c,x)\mathrm{d}u\right]q(% c\mid x)\mathrm{d}c= ∫ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT { ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_p ( italic_w ∣ italic_u ) roman_d italic_w } italic_q ( italic_u ∣ italic_c , italic_x ) roman_d italic_u ] italic_q ( italic_c ∣ italic_x ) roman_d italic_c
=𝒞𝒲h0p(w,c)q(w,cx)dwdc.absentsubscript𝒞subscript𝒲superscriptsubscript0𝑝𝑤𝑐𝑞𝑤conditional𝑐𝑥differential-d𝑤differential-d𝑐\displaystyle=\int_{\mathcal{C}}\int_{\mathcal{W}}h_{0}^{p}(w,c)q(w,c\mid x)% \mathrm{d}w\mathrm{d}c.= ∫ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_w , italic_c ) italic_q ( italic_w , italic_c ∣ italic_x ) roman_d italic_w roman_d italic_c . (B.2)

Since we can identify h0psuperscriptsubscript0𝑝h_{0}^{p}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT from the observable (W,X,Y,C)𝑊𝑋𝑌𝐶(W,X,Y,C)( italic_W , italic_X , italic_Y , italic_C ) of the source domain by solving the linear system (4.1), given observable (W,C,X)𝑊𝐶𝑋(W,C,X)( italic_W , italic_C , italic_X ) from the target domain, we can identify 𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ].

B.2 Proof of Proposition 4.2

The following proof is a generalization of the proof of miao2018identifying, suited to the multidomain case. All variables besides Z𝑍Zitalic_Z are assumed to be discrete-valued and multivariate: V𝑉Vitalic_V can take kvsubscript𝑘𝑣k_{v}italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT values for V{U,X,Y,W}𝑉𝑈𝑋𝑌𝑊V\in\{U,X,Y,W\}italic_V ∈ { italic_U , italic_X , italic_Y , italic_W }.

Let 𝐏(WU)=[𝐏(Wu1)𝐏(WukU)]kW×kU𝐏conditional𝑊𝑈matrix𝐏conditional𝑊subscript𝑢1𝐏conditional𝑊subscript𝑢subscript𝑘𝑈superscriptsubscript𝑘𝑊subscript𝑘𝑈\mathbf{P}(W\mid U)=\begin{bmatrix}\mathbf{P}(W\mid u_{1})&\ldots&\mathbf{P}(W% \mid u_{k_{U}})\end{bmatrix}\in\mathbb{R}^{k_{W}\times k_{U}}bold_P ( italic_W ∣ italic_U ) = [ start_ARG start_ROW start_CELL bold_P ( italic_W ∣ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL … end_CELL start_CELL bold_P ( italic_W ∣ italic_u start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Similarly, define

𝐏(YU,x)=[𝐏(Yu1,x)𝐏(YukU,x)]kY×kU.𝐏conditional𝑌𝑈𝑥matrix𝐏conditional𝑌subscript𝑢1𝑥𝐏conditional𝑌subscript𝑢subscript𝑘𝑈𝑥superscriptsubscript𝑘𝑌subscript𝑘𝑈\mathbf{P}(Y\mid U,x)=\begin{bmatrix}\mathbf{P}(Y\mid u_{1},x)&\ldots&\mathbf{% P}(Y\mid u_{k_{U}},x)\end{bmatrix}\in\mathbb{R}^{k_{Y}\times k_{U}}.bold_P ( italic_Y ∣ italic_U , italic_x ) = [ start_ARG start_ROW start_CELL bold_P ( italic_Y ∣ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) end_CELL start_CELL … end_CELL start_CELL bold_P ( italic_Y ∣ italic_u start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x ) end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

This notation carries through to the remaining variables.

The approach we will take differs from the concept case (and standard proxy case) in the following way: we do not observe Z𝑍Zitalic_Z in the training or test domains, nor do we know its true dimension (indeed Z𝑍Zitalic_Z may be continuous valued). Rather, we assume that we have at least kZsubscript𝑘𝑍k_{Z}italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT distinct draws zrsubscript𝑧𝑟z_{r}italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT from Z𝑍Zitalic_Z in training, where r{1,,kZ}𝑟1subscript𝑘𝑍r\in\{1,\ldots,k_{Z}\}italic_r ∈ { 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT } is the domain index, and that kZkU.subscript𝑘𝑍subscript𝑘𝑈k_{Z}\geq k_{U}.italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ≥ italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT . We also suppose that in test, we observe a distinct draw zkZ+1subscript𝑧subscript𝑘𝑍1z_{k_{Z}+1}italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT which was not seen in training.

Our goal is to obtain a bridge function, which in the categorical case will be a bridge matrix of dimension Mw,xkY×kWsubscript𝑀𝑤𝑥superscriptsubscript𝑘𝑌subscript𝑘𝑊M_{w,x}\in\mathbb{R}^{k_{Y}\times k_{W}}italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Define Pr(Vx):=P(Vx,zr)assignsubscript𝑃𝑟conditional𝑉𝑥𝑃conditional𝑉𝑥subscript𝑧𝑟P_{r}(V\mid x):=P(V\mid x,z_{r})italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_V ∣ italic_x ) := italic_P ( italic_V ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) for V{U,Y,W}𝑉𝑈𝑌𝑊V\in\{U,Y,W\}italic_V ∈ { italic_U , italic_Y , italic_W }. We assume that for each x𝑥xitalic_x,

rank(P1:kZ(Ux))=kU,P1:kZ(Ux):=[P1(Ux)PkZ(Ux)]formulae-sequenceranksubscript𝑃:1subscript𝑘𝑍conditional𝑈𝑥subscript𝑘𝑈assignsubscript𝑃:1subscript𝑘𝑍conditional𝑈𝑥delimited-[]subscript𝑃1conditional𝑈𝑥subscript𝑃subscript𝑘𝑍conditional𝑈𝑥\mathrm{rank}\left(P_{1:k_{Z}}(U\mid x)\right)=k_{U},\qquad P_{1:k_{Z}}(U\mid x% ):=\left[\begin{array}[]{ccc}P_{1}(U\mid x)&\ldots&P_{k_{Z}}(U\mid x)\end{% array}\right]roman_rank ( italic_P start_POSTSUBSCRIPT 1 : italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_U ∣ italic_x ) ) = italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 1 : italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_U ∣ italic_x ) := [ start_ARRAY start_ROW start_CELL italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_U ∣ italic_x ) end_CELL start_CELL … end_CELL start_CELL italic_P start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_U ∣ italic_x ) end_CELL end_ROW end_ARRAY ]

which implies that P(Ux,zr)𝑃conditional𝑈𝑥subscript𝑧𝑟P(U\mid x,z_{r})italic_P ( italic_U ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) varies with zr,subscript𝑧𝑟z_{r},italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , and that we see a sufficient diversity of domains to span the space of vectors on U𝑈Uitalic_U.

The graphical model supports the conditional independence relation

{Y,X,W}ZU,\{Y,X,W\}\perp\!\!\!\perp Z\mid U,{ italic_Y , italic_X , italic_W } ⟂ ⟂ italic_Z ∣ italic_U ,

however we will only require the standard proxy assumptions

WX,ZU,\displaystyle W\perp\!\!\!\perp X,Z\mid U,italic_W ⟂ ⟂ italic_X , italic_Z ∣ italic_U ,
YZX,U.\displaystyle Y\perp\!\!\!\perp Z\mid X,U.italic_Y ⟂ ⟂ italic_Z ∣ italic_X , italic_U .

Next, as in the concept case, we require

P(Y|U,x)=Mw,xP(W|U),𝑃conditional𝑌𝑈𝑥subscript𝑀𝑤𝑥𝑃conditional𝑊𝑈P(Y|U,x)=M_{w,x}P(W|U),italic_P ( italic_Y | italic_U , italic_x ) = italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT italic_P ( italic_W | italic_U ) ,

where we assume rank(P(W|U))=kurank𝑃conditional𝑊𝑈subscript𝑘𝑢\mathrm{rank}(P(W|U))=k_{u}roman_rank ( italic_P ( italic_W | italic_U ) ) = italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT (as in the first condition of Assumption 7). The matrix Mw,xsubscript𝑀𝑤𝑥M_{w,x}italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT is invariant to the distribution P(U)𝑃𝑈P(U)italic_P ( italic_U ) by construction. If we can solve for Mw,xsubscript𝑀𝑤𝑥M_{w,x}italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT, then given a novel domain corresponding to the draw zkz+1subscript𝑧subscript𝑘𝑧1z_{k_{z}+1}italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT, we have

P(Y|U,x)Pkz+1(U|x)𝑃conditional𝑌𝑈𝑥subscript𝑃subscript𝑘𝑧1conditional𝑈𝑥\displaystyle P(Y|U,x)P_{k_{z}+1}(U|x)italic_P ( italic_Y | italic_U , italic_x ) italic_P start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_U | italic_x ) =Mw,xP(W|U)Pkz+1(U|x)absentsubscript𝑀𝑤𝑥𝑃conditional𝑊𝑈subscript𝑃subscript𝑘𝑧1conditional𝑈𝑥\displaystyle=M_{w,x}P(W|U)P_{k_{z}+1}(U|x)= italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT italic_P ( italic_W | italic_U ) italic_P start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_U | italic_x )
Pkz+1(Y|x)subscript𝑃subscript𝑘𝑧1conditional𝑌𝑥\displaystyle P_{k_{z}+1}(Y|x)italic_P start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_Y | italic_x ) =Mw,xPkz+1(W|x).absentsubscript𝑀𝑤𝑥subscript𝑃subscript𝑘𝑧1conditional𝑊𝑥\displaystyle=M_{w,x}P_{k_{z}+1}(W|x).= italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_W | italic_x ) .

This allows us to compute conditional expectations under P(Yx)𝑃conditional𝑌𝑥P(Y\mid x)italic_P ( italic_Y ∣ italic_x ) in the novel domain, based on observations of (W,X)𝑊𝑋(W,X)( italic_W , italic_X ) in this domain.

To solve for Mw,xsubscript𝑀𝑤𝑥M_{w,x}italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT, we project both sides on a basis over U𝑈Uitalic_U arising from the training domains,

P(Y|U,x)P1:kZ(Ux)𝑃conditional𝑌𝑈𝑥subscript𝑃:1subscript𝑘𝑍conditional𝑈𝑥\displaystyle P(Y|U,x)P_{1:k_{Z}}(U\mid x)italic_P ( italic_Y | italic_U , italic_x ) italic_P start_POSTSUBSCRIPT 1 : italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_U ∣ italic_x ) =Mw,xP(W|U)P1:kZ(Ux),absentsubscript𝑀𝑤𝑥𝑃conditional𝑊𝑈subscript𝑃:1subscript𝑘𝑍conditional𝑈𝑥\displaystyle=M_{w,x}P(W|U)P_{1:k_{Z}}(U\mid x),= italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT italic_P ( italic_W | italic_U ) italic_P start_POSTSUBSCRIPT 1 : italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_U ∣ italic_x ) ,

where we define P1:kZ(Y|x)=[P1(Yx)PkZ(Yx)]subscript𝑃:1subscript𝑘𝑍conditional𝑌𝑥delimited-[]subscript𝑃1conditional𝑌𝑥subscript𝑃subscript𝑘𝑍conditional𝑌𝑥P_{1:k_{Z}}(Y|x)=\left[\begin{array}[]{ccc}P_{1}(Y\mid x)&\ldots&P_{k_{Z}}(Y% \mid x)\end{array}\right]italic_P start_POSTSUBSCRIPT 1 : italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Y | italic_x ) = [ start_ARRAY start_ROW start_CELL italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Y ∣ italic_x ) end_CELL start_CELL … end_CELL start_CELL italic_P start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Y ∣ italic_x ) end_CELL end_ROW end_ARRAY ], and likewise P1:kZ(Wx).subscript𝑃:1subscript𝑘𝑍conditional𝑊𝑥P_{1:k_{Z}}(W\mid x).italic_P start_POSTSUBSCRIPT 1 : italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W ∣ italic_x ) . Then the above becomes

P1:kZ(Y|x)subscript𝑃:1subscript𝑘𝑍conditional𝑌𝑥\displaystyle P_{1:k_{Z}}(Y|x)italic_P start_POSTSUBSCRIPT 1 : italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Y | italic_x ) =Mw,xP1:kZ(Wx)absentsubscript𝑀𝑤𝑥subscript𝑃:1subscript𝑘𝑍conditional𝑊𝑥\displaystyle=M_{w,x}P_{1:k_{Z}}(W\mid x)= italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 1 : italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W ∣ italic_x )
Mw,xsubscript𝑀𝑤𝑥\displaystyle M_{w,x}italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT =P1:kZ(Y|x)P1:kZ(Wx).absentsubscript𝑃:1subscript𝑘𝑍conditional𝑌𝑥superscriptsubscript𝑃:1subscript𝑘𝑍conditional𝑊𝑥\displaystyle=P_{1:k_{Z}}(Y|x)P_{1:k_{Z}}^{\dagger}(W\mid x).= italic_P start_POSTSUBSCRIPT 1 : italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Y | italic_x ) italic_P start_POSTSUBSCRIPT 1 : italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_W ∣ italic_x ) . (B.3)

This demonstrates that we can recover the domain-invariant Mw,xsubscript𝑀𝑤𝑥M_{w,x}italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT purely from observed data.

One domain is not enough: We illustrate with an example, where we again consider the case where all variables are categorical:

P(Y|x)=Mw,xP(W|x),𝑃conditional𝑌𝑥subscript𝑀𝑤𝑥𝑃conditional𝑊𝑥P(Y|x)=M_{w,x}P(W|x),italic_P ( italic_Y | italic_x ) = italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT italic_P ( italic_W | italic_x ) , (B.4)

where P(Yx)𝑃conditional𝑌𝑥P(Y\mid x)italic_P ( italic_Y ∣ italic_x ) is a kY×1subscript𝑘𝑌1k_{Y}\times 1italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT × 1 vector of probabilities, P(Wx)𝑃conditional𝑊𝑥P(W\mid x)italic_P ( italic_W ∣ italic_x ) is a kW×1subscript𝑘𝑊1k_{W}\times 1italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT × 1 vector of probabilities, and M𝑀Mitalic_M is a kY×kWsubscript𝑘𝑌subscript𝑘𝑊k_{Y}\times k_{W}italic_k start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT matrix for which we wish to solve. We have too few equations for the number of unknowns.

One solution to (B.4) is the matrix of conditional probabilities Mw,x=P(Y|W,x)subscript𝑀𝑤𝑥𝑃conditional𝑌𝑊𝑥M_{w,x}=P(Y|W,x)italic_M start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT = italic_P ( italic_Y | italic_W , italic_x ). This matrix is not invariant to changes to P(U)𝑃𝑈P(U)italic_P ( italic_U ), however:

p(Y|W,x)=p(Y|U,x)P(U|W,x).𝑝conditional𝑌𝑊𝑥𝑝conditional𝑌𝑈𝑥𝑃conditional𝑈𝑊𝑥p(Y|W,x)=p(Y|U,x)P(U|W,x).italic_p ( italic_Y | italic_W , italic_x ) = italic_p ( italic_Y | italic_U , italic_x ) italic_P ( italic_U | italic_W , italic_x ) .

The posterior P(U|W,x)𝑃conditional𝑈𝑊𝑥P(U|W,x)italic_P ( italic_U | italic_W , italic_x ) changes when the prior P(U)𝑃𝑈P(U)italic_P ( italic_U ) changes. In contrast, the solution in (B.3) is guaranteed to be domain invariant.

B.3 Proof of Proposition 4.3

For all r=1,,kZ𝑟1subscript𝑘𝑍r=1,\ldots,k_{Z}italic_r = 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT, we can write

𝔼r[Yx]=𝔼[Yx,zr]subscript𝔼𝑟delimited-[]conditional𝑌𝑥𝔼delimited-[]conditional𝑌𝑥subscript𝑧𝑟\displaystyle\mathbb{E}_{r}[Y\mid x]=\mathbb{E}[{Y\mid x,z_{r}}]blackboard_E start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] = blackboard_E [ italic_Y ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] =𝒲m0(w,x)dP(wx,zr)absentsubscript𝒲subscript𝑚0𝑤𝑥differential-d𝑃conditional𝑤𝑥subscript𝑧𝑟\displaystyle=\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P(w\mid x,z_{r})= ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_x ) roman_d italic_P ( italic_w ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT )
=𝒰𝒲m0(w,x)dP(wu)dP(ux,zr);absentsubscript𝒰subscript𝒲subscript𝑚0𝑤𝑥differential-d𝑃conditional𝑤𝑢differential-d𝑃conditional𝑢𝑥subscript𝑧𝑟\displaystyle=\int_{\mathcal{U}}\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P(w\mid u% )\mathrm{d}P(u\mid x,z_{r});= ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_x ) roman_d italic_P ( italic_w ∣ italic_u ) roman_d italic_P ( italic_u ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ; (B.5)
𝔼[Yx,zr]𝔼delimited-[]conditional𝑌𝑥subscript𝑧𝑟\displaystyle\mathbb{E}[{Y\mid x,z_{r}}]blackboard_E [ italic_Y ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] =𝒰𝔼[Yx,u]dP(ux,zr).absentsubscript𝒰𝔼delimited-[]conditional𝑌𝑥𝑢differential-d𝑃conditional𝑢𝑥subscript𝑧𝑟\displaystyle=\int_{\mathcal{U}}\mathbb{E}[Y\mid x,u]\mathrm{d}P(u\mid x,z_{r}).= ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT blackboard_E [ italic_Y ∣ italic_x , italic_u ] roman_d italic_P ( italic_u ∣ italic_x , italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) . (B.6)

By Assumption 6, the integrands of (B.5)–(B.6) have the following property

𝔼[Yx,u]=𝒲m0(w,x)dP(wu),𝔼delimited-[]conditional𝑌𝑥𝑢subscript𝒲subscript𝑚0𝑤𝑥differential-d𝑃conditional𝑤𝑢\displaystyle\mathbb{E}[Y\mid x,u]=\int_{\mathcal{W}}m_{0}(w,x)\mathrm{d}P(w% \mid u),blackboard_E [ italic_Y ∣ italic_x , italic_u ] = ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_x ) roman_d italic_P ( italic_w ∣ italic_u ) , (B.7)

almost surely with respect to P(U)𝑃𝑈P(U)italic_P ( italic_U ). We will show that m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be transferred to identify the distribution in the target domain.

We define the support set 𝒮q(x)={u:Q(ux)>0}subscript𝒮𝑞𝑥conditional-set𝑢𝑄conditional𝑢𝑥0{\mathcal{S}}_{q}(x)=\{u:Q(u\mid x)>0\}caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) = { italic_u : italic_Q ( italic_u ∣ italic_x ) > 0 }. Therefore, we can write

𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\displaystyle\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] =𝒰𝔼[Yu,x]dQ(ux)absentsubscript𝒰𝔼delimited-[]conditional𝑌𝑢𝑥differential-d𝑄conditional𝑢𝑥\displaystyle=\int_{\mathcal{U}}\mathbb{E}[Y\mid u,x]\mathrm{d}Q(u\mid x)= ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT blackboard_E [ italic_Y ∣ italic_u , italic_x ] roman_d italic_Q ( italic_u ∣ italic_x )
=𝒮q(x)𝔼[Yu,x]dQ(ux).absentsubscriptsubscript𝒮𝑞𝑥𝔼delimited-[]conditional𝑌𝑢𝑥differential-d𝑄conditional𝑢𝑥\displaystyle=\int_{{\mathcal{S}}_{q}(x)}\mathbb{E}[Y\mid u,x]\mathrm{d}Q(u% \mid x).= ∫ start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT blackboard_E [ italic_Y ∣ italic_u , italic_x ] roman_d italic_Q ( italic_u ∣ italic_x ) .

Furthermore, since we have 𝒮q(x){u:P(u)>0}subscript𝒮𝑞𝑥conditional-set𝑢𝑃𝑢0{\mathcal{S}}_{q}(x)\subseteq\{u:P(u)>0\}caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ⊆ { italic_u : italic_P ( italic_u ) > 0 }, we can apply (B.7) to obtain

𝔼q[Yx]subscript𝔼𝑞delimited-[]conditional𝑌𝑥\displaystyle\mathbb{E}_{q}[Y\mid x]blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_Y ∣ italic_x ] =𝒲𝒰m0(w,x)dP(wu)dQ(ux)absentsubscript𝒲subscript𝒰subscript𝑚0𝑤𝑥differential-d𝑃conditional𝑤𝑢differential-d𝑄conditional𝑢𝑥\displaystyle={\int_{\mathcal{W}}\int_{\mathcal{U}}m_{0}(w,x)\mathrm{d}P(w\mid u% )}\mathrm{d}Q(u\mid x)= ∫ start_POSTSUBSCRIPT caligraphic_W end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w , italic_x ) roman_d italic_P ( italic_w ∣ italic_u ) roman_d italic_Q ( italic_u ∣ italic_x )
=𝔼q[m0(W,x)x].absentsubscript𝔼𝑞delimited-[]conditionalsubscript𝑚0𝑊𝑥𝑥\displaystyle=\mathbb{E}_{q}[{m}_{0}(W,x)\mid x].= blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_W , italic_x ) ∣ italic_x ] .

We complete the proof.

Appendix C Estimation Procedure

The estimation procedure of h^0subscript^0\widehat{h}_{0}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is discussed in Section C.1 and the estimation procedure of m^0subscript^𝑚0\widehat{m}_{0}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is discussed in Section C.2. In Section C.3, we discuss the case when either Z𝑍Zitalic_Z or C𝐶Citalic_C is a discrete variable.

C.1 Proof of Proposition 5.1

The proof of Proposition 5.1 simply follows the result in (mastouri2021proximal) which extends from the representer theorem (scholkopf2001generalized). There exists a γn2𝛾superscriptsubscript𝑛2\gamma\in\mathbb{R}^{n_{2}}italic_γ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT such that

h^0=j=1n2γjμ^Wc2,j,x2,jϕ(c2,j).subscript^0superscriptsubscript𝑗1subscript𝑛2tensor-productsubscript𝛾𝑗subscript^𝜇conditional𝑊subscript𝑐2𝑗subscript𝑥2𝑗italic-ϕsubscript𝑐2𝑗\displaystyle\widehat{h}_{0}=\sum_{j=1}^{n_{2}}\gamma_{j}\widehat{\mu}_{W\mid c% _{2,j},x_{2,j}}\otimes\phi(c_{2,j}).over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊗ italic_ϕ ( italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) . (C.1)

From song2009hilbert, we have μ^Wc2,j,x2,j=i=1n1bi(c2,j,x2,j)ϕ(w1,i)subscript^𝜇conditional𝑊subscript𝑐2𝑗subscript𝑥2𝑗superscriptsubscript𝑖1subscript𝑛1subscript𝑏𝑖subscript𝑐2𝑗subscript𝑥2𝑗italic-ϕsubscript𝑤1𝑖\widehat{\mu}_{W\mid c_{2,j},x_{2,j}}=\sum_{i=1}^{n_{1}}b_{i}(c_{2,j},x_{2,j})% \phi(w_{1,i})over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) italic_ϕ ( italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) and bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th element of b𝑏bitalic_b, a function on 𝒞×𝒳𝒞𝒳\mathcal{C}\times\mathcal{X}caligraphic_C × caligraphic_X: b(c,x)=(𝒦X1𝒦C1+λ1n1I)1(ΦX1(x)ΦC1(c))𝑏𝑐𝑥superscriptdirect-productsubscript𝒦subscript𝑋1subscript𝒦subscript𝐶1subscript𝜆1subscript𝑛1𝐼1direct-productsubscriptΦsubscript𝑋1𝑥subscriptΦsubscript𝐶1𝑐b(c,x)=(\mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}% \left(\Phi_{X_{1}}(x)\odot\Phi_{C_{1}}(c)\right)italic_b ( italic_c , italic_x ) = ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ⊙ roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c ) ). If we expand (C.1) with the previous expression, we have

h^0=i=1n1j=1n2αijϕ(w1,i)ϕ(c2,j),subscript^0superscriptsubscript𝑖1subscript𝑛1superscriptsubscript𝑗1subscript𝑛2tensor-productsubscript𝛼𝑖𝑗italic-ϕsubscript𝑤1𝑖italic-ϕsubscript𝑐2𝑗\widehat{h}_{0}=\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\alpha_{ij}\phi(w_{1,i})% \otimes\phi(c_{2,j}),over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_ϕ ( italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) ⊗ italic_ϕ ( italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) ,

where αij=bi(c2,j,x2,j)γjsubscript𝛼𝑖𝑗subscript𝑏𝑖subscript𝑐2𝑗subscript𝑥2𝑗subscript𝛾𝑗\alpha_{ij}=b_{i}(c_{2,j},x_{2,j})\gamma_{j}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Hence, the rest of the proof will focus on finding the expression of αijsubscript𝛼𝑖𝑗\alpha_{ij}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Following the proof technique developed in (mastouri2021proximal), we introduce two following lemmas that assist the analysis.

Lemma C.1.

The square of the operator norm of h^0subscriptnormal-^0\widehat{h}_{0}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, denoted as |h^0|𝒲𝒞2superscriptsubscriptnormsubscriptnormal-^0subscript𝒲𝒞2|\!|\!|\widehat{h}_{0}|\!|\!|_{{\mathcal{H}_{\mathcal{W}\mathcal{C}}}}^{2}| | | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | | start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT caligraphic_W caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, can be represented as

|||h^0|||𝒲𝒞2=vec(α)(𝒦C2𝒦W1)vec(α).|\!|\!|\widehat{h}_{0}|\!|\!|_{{\mathcal{H}_{\mathcal{W}\mathcal{C}}}}^{2}=% \operatorname{vec}(\alpha)^{\top}(\mathcal{K}_{C_{2}}\otimes\mathcal{K}_{W_{1}% })\operatorname{vec}(\alpha).| | | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | | start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT caligraphic_W caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_vec ( italic_α ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊗ caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_vec ( italic_α ) .
Proof of Lemma C.1.

Write

h^0,h^0subscript^0subscript^0\displaystyle\langle{\widehat{h}_{0}},{\widehat{h}_{0}}\rangle⟨ over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ =i=1n1j=1n2αijϕ(w1,i)ϕ(c2,j),m=1n1r=1n2αmrϕ(w1,m)ϕ(c2,r)absentsuperscriptsubscript𝑖1subscript𝑛1superscriptsubscript𝑗1subscript𝑛2tensor-productsubscript𝛼𝑖𝑗italic-ϕsubscript𝑤1𝑖italic-ϕsubscript𝑐2𝑗superscriptsubscript𝑚1subscript𝑛1superscriptsubscript𝑟1subscript𝑛2tensor-productsubscript𝛼𝑚𝑟italic-ϕsubscript𝑤1𝑚italic-ϕsubscript𝑐2𝑟\displaystyle=\left\langle{\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\alpha_{ij}\phi% (w_{1,i})\otimes\phi(c_{2,j})},{\sum_{m=1}^{n_{1}}\sum_{r=1}^{n_{2}}\alpha_{mr% }\phi(w_{1,m})\otimes\phi(c_{2,r})}\right\rangle= ⟨ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_ϕ ( italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) ⊗ italic_ϕ ( italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m italic_r end_POSTSUBSCRIPT italic_ϕ ( italic_w start_POSTSUBSCRIPT 1 , italic_m end_POSTSUBSCRIPT ) ⊗ italic_ϕ ( italic_c start_POSTSUBSCRIPT 2 , italic_r end_POSTSUBSCRIPT ) ⟩
=i,m=1n1j,r=1n2αijαmrk(w1,i,w1,m)k(c2,j,c2,r)absentsuperscriptsubscript𝑖𝑚1subscript𝑛1superscriptsubscript𝑗𝑟1subscript𝑛2subscript𝛼𝑖𝑗subscript𝛼𝑚𝑟𝑘subscript𝑤1𝑖subscript𝑤1𝑚𝑘subscript𝑐2𝑗subscript𝑐2𝑟\displaystyle=\sum_{i,m=1}^{n_{1}}\sum_{j,r=1}^{n_{2}}\alpha_{ij}\alpha_{mr}k(% w_{1,i},w_{1,m})k(c_{2,j},c_{2,r})= ∑ start_POSTSUBSCRIPT italic_i , italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j , italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_m italic_r end_POSTSUBSCRIPT italic_k ( italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 , italic_m end_POSTSUBSCRIPT ) italic_k ( italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 , italic_r end_POSTSUBSCRIPT )
=tr(α𝒦W1α𝒦C2)absenttrsuperscript𝛼topsubscript𝒦subscript𝑊1𝛼subscript𝒦subscript𝐶2\displaystyle=\mathop{\mathrm{tr}}\left(\alpha^{\top}\mathcal{K}_{W_{1}}\alpha% \mathcal{K}_{C_{2}}\right)= roman_tr ( italic_α start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
=vec(α)vec(𝒦W1α𝒦C2).\displaystyle=\operatorname{vec}(\alpha)^{\top}\operatorname{vec}(\mathcal{K}_% {W_{1}}\alpha\mathcal{K}_{C_{2}}).= roman_vec ( italic_α ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_vec ( caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) .
Using the fact that vec(ABC)=(CA)vec(B)vec𝐴𝐵𝐶tensor-productsuperscript𝐶top𝐴vec𝐵\operatorname{vec}(ABC)=(C^{\top}\otimes A)\operatorname{vec}(B)roman_vec ( italic_A italic_B italic_C ) = ( italic_C start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⊗ italic_A ) roman_vec ( italic_B ), the above display can be written as
=vec(α)(𝒦C2𝒦W1)vec(α).\displaystyle=\operatorname{vec}(\alpha)^{\top}(\mathcal{K}_{C_{2}}\otimes% \mathcal{K}_{W_{1}})\operatorname{vec}(\alpha).= roman_vec ( italic_α ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊗ caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_vec ( italic_α ) .

Lemma C.2.

For any c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C, x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X,

h^0,ϕ(c)μ^Wc,x=ΦC2(c)α𝒦W1(𝒦X1𝒦C1+λ1n1I)1(ΦC1(c)ΦX1(x)).subscript^0tensor-productitalic-ϕ𝑐subscript^𝜇conditional𝑊𝑐𝑥subscriptΦsubscript𝐶2superscript𝑐topsuperscript𝛼topsubscript𝒦subscript𝑊1superscriptdirect-productsubscript𝒦subscript𝑋1subscript𝒦subscript𝐶1subscript𝜆1subscript𝑛1𝐼1direct-productsubscriptΦsubscript𝐶1𝑐subscriptΦsubscript𝑋1𝑥\langle{\widehat{h}_{0}},{\phi(c)\otimes\widehat{\mu}_{W\mid c,x}}\rangle=\Phi% _{C_{2}}(c)^{\top}\alpha^{\top}\mathcal{K}_{W_{1}}(\mathcal{K}_{X_{1}}\odot% \mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}(\Phi_{C_{1}}(c)\odot\Phi_{X_{1}}(x% )).⟨ over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϕ ( italic_c ) ⊗ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_c , italic_x end_POSTSUBSCRIPT ⟩ = roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c ) ⊙ roman_Φ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) .
Proof of Lemma C.2.

Write

h^0,ϕ(c)μ^Wc,xsubscript^0tensor-productitalic-ϕ𝑐subscript^𝜇conditional𝑊𝑐𝑥\displaystyle\langle{\widehat{h}_{0}},{\phi(c)\otimes\widehat{\mu}_{W\mid c,x}}\rangle⟨ over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϕ ( italic_c ) ⊗ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_c , italic_x end_POSTSUBSCRIPT ⟩ =i=1n1j=1n2αijϕ(w1,i)ϕ(c2,j),ϕ(c)r=1n1br(c,x)ϕ(w1,r)absentsuperscriptsubscript𝑖1subscript𝑛1superscriptsubscript𝑗1subscript𝑛2tensor-productsubscript𝛼𝑖𝑗italic-ϕsubscript𝑤1𝑖italic-ϕsubscript𝑐2𝑗tensor-productitalic-ϕ𝑐superscriptsubscript𝑟1subscript𝑛1subscript𝑏𝑟𝑐𝑥italic-ϕsubscript𝑤1𝑟\displaystyle=\left\langle\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\alpha_{ij}\phi(% w_{1,i})\otimes\phi(c_{2,j}),\phi(c)\otimes\sum_{r=1}^{n_{1}}b_{r}(c,x)\phi(w_% {1,r})\right\rangle= ⟨ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_ϕ ( italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) ⊗ italic_ϕ ( italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) , italic_ϕ ( italic_c ) ⊗ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_c , italic_x ) italic_ϕ ( italic_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT ) ⟩
=i=1n1j=1n2r=1n1αijk(w1,i,w1,r)k(c2,j,c)br(c,x).absentsuperscriptsubscript𝑖1subscript𝑛1superscriptsubscript𝑗1subscript𝑛2superscriptsubscript𝑟1subscript𝑛1subscript𝛼𝑖𝑗𝑘subscript𝑤1𝑖subscript𝑤1𝑟𝑘subscript𝑐2𝑗𝑐subscript𝑏𝑟𝑐𝑥\displaystyle=\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\sum_{r=1}^{n_{1}}\alpha_{ij% }k(w_{1,i},w_{1,r})k(c_{2,j},c)b_{r}(c,x).= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_k ( italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT ) italic_k ( italic_c start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_c ) italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_c , italic_x ) .
Summing over i,j𝑖𝑗i,jitalic_i , italic_j, the above equation is equivalent as
=r=1n1ΦC2(c)αΦW1(w1,r)br(c,x)absentsuperscriptsubscript𝑟1subscript𝑛1subscriptΦsubscript𝐶2superscript𝑐topsuperscript𝛼topsubscriptΦsubscript𝑊1subscript𝑤1𝑟subscript𝑏𝑟𝑐𝑥\displaystyle=\sum_{r=1}^{n_{1}}\Phi_{C_{2}}(c)^{\top}\alpha^{\top}\Phi_{W_{1}% }(w_{1,r})b_{r}(c,x)= ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT ) italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_c , italic_x )
=ΦC2(c)α𝒦W1b(c,x)absentsubscriptΦsubscript𝐶2superscript𝑐topsuperscript𝛼topsubscript𝒦subscript𝑊1𝑏𝑐𝑥\displaystyle=\Phi_{C_{2}}(c)^{\top}\alpha^{\top}\mathcal{K}_{W_{1}}b(c,x)= roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_b ( italic_c , italic_x )
=ΦC2(c)α𝒦W1(𝒦X1𝒦C1+λ1n1I)1(ΦX1(x)ΦC1(c))absentsubscriptΦsubscript𝐶2superscript𝑐topsuperscript𝛼topsubscript𝒦subscript𝑊1superscriptdirect-productsubscript𝒦subscript𝑋1subscript𝒦subscript𝐶1subscript𝜆1subscript𝑛1𝐼1direct-productsubscriptΦsubscript𝑋1𝑥subscriptΦsubscript𝐶1𝑐\displaystyle=\Phi_{C_{2}}(c)^{\top}\alpha^{\top}\mathcal{K}_{W_{1}}(\mathcal{% K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}\left(\Phi_{X_{1}}(x% )\odot\Phi_{C_{1}}(c)\right)= roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ⊙ roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c ) )
=(ΦX1(x)ΦC1(c))(𝒦X1𝒦C1+λ1n1I)1𝒦W1αΦC2(c).absentsuperscriptdirect-productsubscriptΦsubscript𝑋1𝑥subscriptΦsubscript𝐶1𝑐topsuperscriptdirect-productsubscript𝒦subscript𝑋1subscript𝒦subscript𝐶1subscript𝜆1subscript𝑛1𝐼1subscript𝒦subscript𝑊1𝛼subscriptΦsubscript𝐶2𝑐\displaystyle={\left(\Phi_{X_{1}}(x)\odot\Phi_{C_{1}}(c)\right)^{\top}(% \mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}\mathcal{K}% _{W_{1}}}\alpha\Phi_{C_{2}}(c).= ( roman_Φ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ⊙ roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α roman_Φ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c ) .

With Lemma C.1C.2, we can write (5.4) as

12n2y2Dvec(α)22+λ2vec(α)Evec(α),\displaystyle\frac{1}{2n_{2}}\|y_{2}-D^{\top}\operatorname{vec}(\alpha)\|_{2}^% {2}+\lambda_{2}\operatorname{vec}(\alpha)^{\top}E\operatorname{vec}(\alpha),divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_vec ( italic_α ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_vec ( italic_α ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E roman_vec ( italic_α ) , (C.2)

where

D=𝒦C2¯{𝒦W1(𝒦X1𝒦C1+λ1n1I)1(𝒦X12𝒦C12)},E=𝒦C2𝒦W1.formulae-sequence𝐷subscript𝒦subscript𝐶2¯tensor-productsubscript𝒦subscript𝑊1superscriptdirect-productsubscript𝒦subscript𝑋1subscript𝒦subscript𝐶1subscript𝜆1subscript𝑛1𝐼1direct-productsubscript𝒦subscript𝑋12subscript𝒦subscript𝐶12𝐸tensor-productsubscript𝒦subscript𝐶2subscript𝒦subscript𝑊1\displaystyle D=\mathcal{K}_{C_{2}}\overline{\otimes}\left\{\mathcal{K}_{W_{1}% }(\mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}(\mathcal% {K}_{X_{12}}\odot\mathcal{K}_{C_{12}})\right\},\quad E=\mathcal{K}_{C_{2}}% \otimes\mathcal{K}_{W_{1}}.italic_D = caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG ⊗ end_ARG { caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } , italic_E = caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊗ caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Then by setting the gradient of (C.2) with respect to vec(α)vec𝛼\operatorname{vec}(\alpha)roman_vec ( italic_α ) to zero, we will obtain

vec(α)vec𝛼\displaystyle\operatorname{vec}(\alpha)roman_vec ( italic_α ) =(DD+λ2n2E)1Dy2.absentsuperscript𝐷superscript𝐷topsubscript𝜆2subscript𝑛2𝐸1𝐷subscript𝑦2\displaystyle=\left(DD^{\top}+\lambda_{2}n_{2}E\right)^{-1}Dy_{2}.= ( italic_D italic_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_E ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_D italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
Apply Woodbury matrix identity, the above display is equivalent as
=E1D(λ2n2I+DE1D)1y2.absentsuperscript𝐸1𝐷superscriptsubscript𝜆2subscript𝑛2𝐼superscript𝐷topsuperscript𝐸1𝐷1subscript𝑦2\displaystyle=E^{-1}D(\lambda_{2}n_{2}I+D^{\top}E^{-1}D)^{-1}y_{2}.= italic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_D ( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I + italic_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_D ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (C.3)

Using the fact that for matrices A,B,C,F𝐴𝐵𝐶𝐹A,B,C,Fitalic_A , italic_B , italic_C , italic_F, (AB)(C¯F)=AC¯BFtensor-product𝐴𝐵𝐶¯tensor-product𝐹𝐴𝐶¯tensor-product𝐵𝐹(A\otimes B)(C\overline{\otimes}F)=AC\overline{\otimes}BF( italic_A ⊗ italic_B ) ( italic_C over¯ start_ARG ⊗ end_ARG italic_F ) = italic_A italic_C over¯ start_ARG ⊗ end_ARG italic_B italic_F, we can simplify E1Dsuperscript𝐸1𝐷E^{-1}Ditalic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_D as

E1Dsuperscript𝐸1𝐷\displaystyle E^{-1}Ditalic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_D =(𝒦C21𝒦W11)[𝒦C2¯{𝒦W1(𝒦X1𝒦C1+λ1n1I)1(𝒦X12𝒦C12)}]absenttensor-productsuperscriptsubscript𝒦subscript𝐶21superscriptsubscript𝒦subscript𝑊11delimited-[]subscript𝒦subscript𝐶2¯tensor-productsubscript𝒦subscript𝑊1superscriptdirect-productsubscript𝒦subscript𝑋1subscript𝒦subscript𝐶1subscript𝜆1subscript𝑛1𝐼1direct-productsubscript𝒦subscript𝑋12subscript𝒦subscript𝐶12\displaystyle=\left(\mathcal{K}_{C_{2}}^{-1}\otimes{\mathcal{K}_{W_{1}}^{-1}}% \right)\left[\mathcal{K}_{C_{2}}\overline{\otimes}\left\{\mathcal{K}_{W_{1}}(% \mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+\lambda_{1}n_{1}I)^{-1}(\mathcal{K% }_{X_{12}}\odot\mathcal{K}_{C_{12}})\right\}\right]= ( caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⊗ caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) [ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG ⊗ end_ARG { caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } ]
=I¯(𝒦X1𝒦C1+λ1n1I)1(𝒦X12𝒦C12)absent𝐼¯tensor-productsuperscriptdirect-productsubscript𝒦subscript𝑋1subscript𝒦subscript𝐶1subscript𝜆1subscript𝑛1𝐼1direct-productsubscript𝒦subscript𝑋12subscript𝒦subscript𝐶12\displaystyle=I\overline{\otimes}(\mathcal{K}_{X_{1}}\odot\mathcal{K}_{C_{1}}+% \lambda_{1}n_{1}I)^{-1}(\mathcal{K}_{X_{12}}\odot\mathcal{K}_{C_{12}})= italic_I over¯ start_ARG ⊗ end_ARG ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
=I¯Γ.absent𝐼¯tensor-productΓ\displaystyle=I\overline{\otimes}\Gamma.= italic_I over¯ start_ARG ⊗ end_ARG roman_Γ .

Hence, using the fact that (A¯B)(C¯F)=(AC)BFsuperscript𝐴¯tensor-product𝐵top𝐶¯tensor-product𝐹direct-productsuperscript𝐴top𝐶superscript𝐵top𝐹(A\overline{\otimes}B)^{\top}(C\overline{\otimes}F)=(A^{\top}C)\odot B^{\top}F( italic_A over¯ start_ARG ⊗ end_ARG italic_B ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_C over¯ start_ARG ⊗ end_ARG italic_F ) = ( italic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_C ) ⊙ italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_F, we have

DE1D=(𝒦C2¯𝒦W1Γ)(I¯Γ)=𝒦C2(Γ𝒦W1Γ)superscript𝐷topsuperscript𝐸1𝐷superscriptsubscript𝒦subscript𝐶2¯tensor-productsubscript𝒦subscript𝑊1Γtop𝐼¯tensor-productΓdirect-productsubscript𝒦subscript𝐶2superscriptΓtopsubscript𝒦subscript𝑊1Γ\displaystyle D^{\top}E^{-1}D=(\mathcal{K}_{C_{2}}\overline{\otimes}\mathcal{K% }_{W_{1}}\Gamma)^{\top}(I\overline{\otimes}\Gamma)=\mathcal{K}_{C_{2}}\odot(% \Gamma^{\top}\mathcal{K}_{W_{1}}\Gamma)italic_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_D = ( caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG ⊗ end_ARG caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Γ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_I over¯ start_ARG ⊗ end_ARG roman_Γ ) = caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ ( roman_Γ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Γ )

Hence, we can write (C.3) as

vec(α)=(I¯Γ){λ2n2I+𝒦C2(Γ𝒦W1Γ)}1y2.vec𝛼𝐼¯tensor-productΓsuperscriptsubscript𝜆2subscript𝑛2𝐼direct-productsubscript𝒦subscript𝐶2superscriptΓtopsubscript𝒦subscript𝑊1Γ1subscript𝑦2\operatorname{vec}(\alpha)=(I\overline{\otimes}\Gamma)\left\{\lambda_{2}n_{2}I% +\mathcal{K}_{C_{2}}\odot(\Gamma^{\top}\mathcal{K}_{W_{1}}\Gamma)\right\}^{-1}% y_{2}.roman_vec ( italic_α ) = ( italic_I over¯ start_ARG ⊗ end_ARG roman_Γ ) { italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I + caligraphic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ ( roman_Γ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Γ ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

C.2 Proof of Kernel Bridge Function m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

We begin with the results.

Proposition C.3.

Let 𝒦W3n3×n3subscript𝒦subscript𝑊3superscriptsubscript𝑛3subscript𝑛3\mathcal{K}_{W_{3}}\in\mathbb{R}^{n_{3}\times n_{3}}caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒦X4n4×n4subscript𝒦subscript𝑋4superscriptsubscript𝑛4subscript𝑛4\mathcal{K}_{X_{4}}\in\mathbb{R}^{n_{4}\times n_{4}}caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the Gram matrices of W3subscript𝑊3W_{3}italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and X4subscript𝑋4X_{4}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, respectively. Let 𝒦X34n3×n4subscript𝒦subscript𝑋34superscriptsubscript𝑛3subscript𝑛4\mathcal{K}_{X_{34}}\in\mathbb{R}^{n_{3}\times n_{4}}caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒦Z34n3×n4subscript𝒦subscript𝑍34superscriptsubscript𝑛3subscript𝑛4\mathcal{K}_{Z_{34}}\in\mathbb{R}^{n_{3}\times n_{4}}caligraphic_K start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the cross Gram matrices of (X3,X4)subscript𝑋3subscript𝑋4(X_{3},X_{4})( italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) and (Z3,Z4)subscript𝑍3subscript𝑍4(Z_{3},Z_{4})( italic_Z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ), respectively. For any λ4>0subscript𝜆40\lambda_{4}>0italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT > 0, there exists a unique optimal solution to (5.6) of the form

m^0subscript^𝑚0\displaystyle\widehat{m}_{0}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =i=1n3j=1n4αijϕ(w3,i)ϕ(x4,j);absentsuperscriptsubscript𝑖1subscript𝑛3superscriptsubscript𝑗1subscript𝑛4tensor-productsubscript𝛼𝑖𝑗italic-ϕsubscript𝑤3𝑖italic-ϕsubscript𝑥4𝑗\displaystyle=\sum_{i=1}^{n_{3}}\sum_{j=1}^{n_{4}}\alpha_{ij}\phi(w_{3,i})% \otimes\phi(x_{4,j});= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_ϕ ( italic_w start_POSTSUBSCRIPT 3 , italic_i end_POSTSUBSCRIPT ) ⊗ italic_ϕ ( italic_x start_POSTSUBSCRIPT 4 , italic_j end_POSTSUBSCRIPT ) ;
𝑣𝑒𝑐(α)𝑣𝑒𝑐𝛼\displaystyle\text{vec}(\alpha)vec ( italic_α ) =(I¯Γ)(λ4n4I+Σ)1y4,absent𝐼¯tensor-productΓsuperscriptsubscript𝜆4subscript𝑛4𝐼Σ1subscript𝑦4\displaystyle=(I\overline{\otimes}\Gamma)(\lambda_{4}n_{4}I+\Sigma)^{-1}y_{4},= ( italic_I over¯ start_ARG ⊗ end_ARG roman_Γ ) ( italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_I + roman_Σ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ,

where Σ=(Γ𝒦W3Γ)𝒦X4normal-Σdirect-productsuperscriptnormal-Γtopsubscript𝒦subscript𝑊3normal-Γsubscript𝒦subscript𝑋4\Sigma=(\Gamma^{\top}\mathcal{K}_{W_{3}}\Gamma)\odot\mathcal{K}_{X_{4}}roman_Σ = ( roman_Γ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Γ ) ⊙ caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Γ=(𝒦X3𝒦Z3+λ3n3I)1(𝒦X34𝒦Z34)normal-Γsuperscriptdirect-productsubscript𝒦subscript𝑋3subscript𝒦subscript𝑍3subscript𝜆3subscript𝑛3𝐼1direct-productsubscript𝒦subscript𝑋34subscript𝒦subscript𝑍34\Gamma=(\mathcal{K}_{X_{3}}\odot\mathcal{K}_{Z_{3}}+\lambda_{3}n_{3}I)^{-1}(% \mathcal{K}_{X_{34}}\odot\mathcal{K}_{Z_{34}})roman_Γ = ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ caligraphic_K start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), and y4=[y4,1,,y4,n4]subscript𝑦4superscriptmatrixsubscript𝑦41normal-…subscript𝑦4subscript𝑛4topy_{4}=\begin{bmatrix}y_{4,1},\ldots,y_{4,n_{4}}\end{bmatrix}^{\top}italic_y start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_y start_POSTSUBSCRIPT 4 , 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT 4 , italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

The proof of Proposition C.3 follows exactly as the proof of Proposition 5.1, with X𝑋Xitalic_X replaced by Z𝑍Zitalic_Z and C𝐶Citalic_C replaced by X𝑋Xitalic_X.

C.3 Estimation with discrete Z𝑍Zitalic_Z or C𝐶Citalic_C

In the case when C𝐶Citalic_C or Z𝑍Zitalic_Z happen to be discrete variables, a more efficient alternative to the estimator introduced in Section 5.1 which requires kernelized features of C𝐶Citalic_C (or Z𝑍Zitalic_Z), is to solve a separate regression of W𝑊Witalic_W on X𝑋Xitalic_X for each c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C (or z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z). Define the index set Ξ1(c)={i:c1,i=c,i=1,,n1}subscriptΞ1𝑐conditional-set𝑖formulae-sequencesubscript𝑐1𝑖𝑐𝑖1subscript𝑛1\Xi_{1}(c)=\{i:c_{1,i}=c,i=1,\ldots,n_{1}\}roman_Ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_c ) = { italic_i : italic_c start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = italic_c , italic_i = 1 , … , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }, we modify (5.3) as

μ^Wc,xpsuperscriptsubscript^𝜇conditional𝑊𝑐𝑥𝑝\displaystyle\widehat{\mu}_{W\mid c,x}^{p}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_c , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT =i=1n1bi(x)ϕ(w1,i)𝟙(c1,i=c);absentsuperscriptsubscript𝑖1subscript𝑛1subscript𝑏𝑖𝑥italic-ϕsubscript𝑤1𝑖1subscript𝑐1𝑖𝑐\displaystyle=\sum_{i=1}^{n_{1}}b_{i}(x)\phi(w_{1,i})\mathds{1}(c_{1,i}=c);= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_ϕ ( italic_w start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) blackboard_1 ( italic_c start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = italic_c ) ;
b(x)𝑏𝑥\displaystyle b(x)italic_b ( italic_x ) =(𝒦X1,c+λ1I)1ΦX1,c(x),absentsuperscriptsubscript𝒦subscript𝑋1𝑐subscript𝜆1𝐼1subscriptΦsubscript𝑋1𝑐𝑥\displaystyle=(\mathcal{K}_{X_{1,c}}+\lambda_{1}I)^{-1}{\Phi_{X_{1,c}}(x)},= ( caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 , italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 , italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ,

where 𝒦X1,c=[k(x1,i,x1,j)]i,jsubscript𝒦subscript𝑋1𝑐subscriptdelimited-[]𝑘subscript𝑥1𝑖subscript𝑥1𝑗𝑖𝑗\mathcal{K}_{X_{1,c}}=[k(x_{1,i},x_{1,j})]_{i,j}caligraphic_K start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 , italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ italic_k ( italic_x start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and ΦX1,c=[ϕ(x1,i)]isubscriptΦsubscript𝑋1𝑐superscriptsubscriptmatrixitalic-ϕsubscript𝑥1𝑖𝑖top\Phi_{X_{1,c}}=\begin{bmatrix}\phi(x_{1,i})\end{bmatrix}_{i}^{\top}roman_Φ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 , italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_ϕ ( italic_x start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT with i,jΞ1(c)𝑖𝑗subscriptΞ1𝑐i,j\in\Xi_{1}(c)italic_i , italic_j ∈ roman_Ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_c ). Alternatively, one can apply the form in (5.3) but use binary kernel on C𝐶Citalic_C (or Z𝑍Zitalic_Z).

Appendix D Experiments

In this section we discuss the experimental settings and implementation details. We start with introducing the implementation details of all the baselines and proposed method. Then, we discuss the experimental settings.

D.1 Baselines of Adaptation with Concepts and Proxies

We introduce the baseline methods for the adaptation task with C𝐶Citalic_C and W𝑊Witalic_W. This includes the baselines methods COVARS, LABELS, ORACLE, LSA-W, LSA-S, LSA-S w/ target W𝑊Witalic_W and the proposed method. To select the parameters for the regression task on dSprite, we apply five-fold cross-validation with mean squared error as the metric to select the kernel length scale and the ridge regularization penalty.

COVARS. We fit a domain classifier using logistic regression, compute instance weights following shimodaira2000improving, and learn a weighted kernel ridge regressor with a Gaussian kernel function on the source training samples.

LABELS. The label shift baseline assumes oracle access to labels in the target domain. For the classification task, we compute instance weights q(Y)/p(Y)𝑞𝑌𝑝𝑌q(Y)/p(Y)italic_q ( italic_Y ) / italic_p ( italic_Y ) using the observed frequencies in the validation set for the source domain and the training set for the target domain. For the regression task, we compute the weights by fitting a Gaussian kernel density estimator using the source validation set and the target training set separately. We then use the fitted densities to estimate q(Y)/p(Y)𝑞𝑌𝑝𝑌q(Y)/p(Y)italic_q ( italic_Y ) / italic_p ( italic_Y ) for each sample in the source training set. Finally, we learn a sample-weighted kernel ridge regressor with a Gaussian kernel on the source training samples.

ORACLE. For regression tasks, we learn a kernel ridge regressor with a Gaussian kernel on target training samples. For the classification task, we use a standard MLP trained with sample in the target domain. Details of the model structure are documented in Section D.2.

LSA-W. The estimation procedure follows Section 6 in alabdulmohsin2023adapting. In this case, we discretize the values of W𝑊Witalic_W by applying additional transform sign(w)sign𝑤\mathrm{sign}(w)roman_sign ( italic_w ) for each sample w𝑤witalic_w.

LSA-S. The estimation procedure follows Algorithm 2–5 in alabdulmohsin2023adapting.

LSA-S w/ target W𝑊Witalic_W. We briefly describe the procedure to incorporate target W𝑊Witalic_W to LSA-S. alabdulmohsin2023adapting showed that Q(Y|x)𝑄conditional𝑌𝑥Q(Y|x)italic_Q ( italic_Y | italic_x ) can be decomposed as

Q(Yx)𝑄conditional𝑌𝑥\displaystyle Q(Y\mid x)italic_Q ( italic_Y ∣ italic_x ) =u~P(Yu~,x)(a)Q(u~x)(b)absentsubscript~𝑢subscript𝑃conditional𝑌~𝑢𝑥𝑎subscript𝑄conditional~𝑢𝑥𝑏\displaystyle=\sum_{\widetilde{u}}\underbrace{P(Y\mid\widetilde{u},x)}_{(a)}% \underbrace{Q(\widetilde{u}\mid x)}_{(b)}= ∑ start_POSTSUBSCRIPT over~ start_ARG italic_u end_ARG end_POSTSUBSCRIPT under⏟ start_ARG italic_P ( italic_Y ∣ over~ start_ARG italic_u end_ARG , italic_x ) end_ARG start_POSTSUBSCRIPT ( italic_a ) end_POSTSUBSCRIPT under⏟ start_ARG italic_Q ( over~ start_ARG italic_u end_ARG ∣ italic_x ) end_ARG start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT (D.1)
u~P(Yu~,x)(a)P(u~x)(c)Q(u~)P(u~)(d)P(x)Q(x),proportional-toabsentsubscript~𝑢subscript𝑃conditional𝑌~𝑢𝑥𝑎subscript𝑃conditional~𝑢𝑥𝑐subscript𝑄~𝑢𝑃~𝑢𝑑𝑃𝑥𝑄𝑥\displaystyle\propto\sum_{\widetilde{u}}\underbrace{P(Y\mid\widetilde{u},x)}_{% (a)}\underbrace{P(\widetilde{u}\mid x)}_{(c)}\underbrace{\frac{Q(\widetilde{u}% )}{P(\widetilde{u})}}_{(d)}\frac{P(x)}{Q(x)},∝ ∑ start_POSTSUBSCRIPT over~ start_ARG italic_u end_ARG end_POSTSUBSCRIPT under⏟ start_ARG italic_P ( italic_Y ∣ over~ start_ARG italic_u end_ARG , italic_x ) end_ARG start_POSTSUBSCRIPT ( italic_a ) end_POSTSUBSCRIPT under⏟ start_ARG italic_P ( over~ start_ARG italic_u end_ARG ∣ italic_x ) end_ARG start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT under⏟ start_ARG divide start_ARG italic_Q ( over~ start_ARG italic_u end_ARG ) end_ARG start_ARG italic_P ( over~ start_ARG italic_u end_ARG ) end_ARG end_ARG start_POSTSUBSCRIPT ( italic_d ) end_POSTSUBSCRIPT divide start_ARG italic_P ( italic_x ) end_ARG start_ARG italic_Q ( italic_x ) end_ARG , (D.2)

where u~~𝑢\widetilde{u}over~ start_ARG italic_u end_ARG is a permutation of original u𝑢uitalic_u. Both LSA-WAE and LSA-S are multi-stage procedures to compute (a), (c), (d) individually and combine the results using formula (D.2) to obtain the predicted target distribution. Step (a) corresponds to Algorithm 5, (c) corresponds to Equation (17), and (d) corresponds to Algorithm 4 in (alabdulmohsin2023adapting).

With the additional W𝑊Witalic_W from target, we can obtain (b) by slightly modifying the one estimation step in LSA-S. We test on this procedure, namely LSA-S w/ target W, with (c), (d) replaced by (b). Suppose that U𝑈Uitalic_U takes values in 1,,kU1subscript𝑘𝑈1,\ldots,k_{U}1 , … , italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT and U~~𝑈\widetilde{U}over~ start_ARG italic_U end_ARG be a permutation of U𝑈Uitalic_U. Define the matrix 𝐆𝐆{\bf G}bold_G as:

𝐆=[P^(WU~=1),P^(WU~=1)P^(WU~=1),P^(WU~=kU)P^(WU~=kU),P^(WU~=1)P^(WU~=kU),P^(WU~=kU)],𝐆matrix^𝑃conditional𝑊~𝑈1^𝑃conditional𝑊~𝑈1^𝑃conditional𝑊~𝑈1^𝑃conditional𝑊~𝑈subscript𝑘𝑈^𝑃conditional𝑊~𝑈subscript𝑘𝑈^𝑃conditional𝑊~𝑈1^𝑃conditional𝑊~𝑈subscript𝑘𝑈^𝑃conditional𝑊~𝑈subscript𝑘𝑈{\bf G}=\begin{bmatrix}\langle{\widehat{P}(W\mid\widetilde{U}=1)},{\widehat{P}% (W\mid\widetilde{U}=1)}\rangle&\cdots&\langle{\widehat{P}(W\mid\widetilde{U}=1% )},{\widehat{P}(W\mid\widetilde{U}=k_{U})}\rangle\\ \vdots&\ddots&\vdots\\ \langle{\widehat{P}(W\mid\widetilde{U}=k_{U})},{\widehat{P}(W\mid\widetilde{U}% =1)}\rangle&\cdots&\langle{\widehat{P}(W\mid\widetilde{U}=k_{U})},{\widehat{P}% (W\mid\widetilde{U}=k_{U})}\rangle\end{bmatrix},bold_G = [ start_ARG start_ROW start_CELL ⟨ over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = 1 ) , over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = 1 ) ⟩ end_CELL start_CELL ⋯ end_CELL start_CELL ⟨ over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = 1 ) , over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ⟩ end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ⟨ over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) , over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = 1 ) ⟩ end_CELL start_CELL ⋯ end_CELL start_CELL ⟨ over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) , over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ⟩ end_CELL end_ROW end_ARG ] ,

where P^(WU~=i)^𝑃conditional𝑊~𝑈𝑖\widehat{P}(W\mid\widetilde{U}=i)over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = italic_i ) is the estimated conditional kernel density function obtained by Algorithm 3 in alabdulmohsin2023adapting. The step (b) is computed by solving the following least-squares:

Q^(𝐔~x)=^𝑄conditional~𝐔𝑥absent\displaystyle\widehat{Q}(\widetilde{\mathbf{U}}\mid x)=over^ start_ARG italic_Q end_ARG ( over~ start_ARG bold_U end_ARG ∣ italic_x ) = argmin[Q^(Wx),P^(WU~=1)Q^(Wx),P^(WU~=kU)]𝐆[Q(U~=1x)Q(U~=kUx)]F2,superscriptsubscriptnormmatrix^𝑄conditional𝑊𝑥^𝑃conditional𝑊~𝑈1^𝑄conditional𝑊𝑥^𝑃conditional𝑊~𝑈subscript𝑘𝑈𝐆matrix𝑄~𝑈conditional1𝑥𝑄~𝑈conditionalsubscript𝑘𝑈𝑥𝐹2\displaystyle\arg\min\left\|\begin{bmatrix}\langle{\widehat{Q}(W\mid x)},{% \widehat{P}(W\mid\widetilde{U}=1)}\rangle\\ \vdots\\ \langle{\widehat{Q}(W\mid x)},{\widehat{P}(W\mid\widetilde{U}=k_{U})}\rangle% \end{bmatrix}-{\bf G}\begin{bmatrix}Q(\widetilde{U}=1\mid x)\\ \vdots\\ Q(\widetilde{U}=k_{U}\mid x)\end{bmatrix}\right\|_{F}^{2},roman_arg roman_min ∥ [ start_ARG start_ROW start_CELL ⟨ over^ start_ARG italic_Q end_ARG ( italic_W ∣ italic_x ) , over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = 1 ) ⟩ end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ⟨ over^ start_ARG italic_Q end_ARG ( italic_W ∣ italic_x ) , over^ start_ARG italic_P end_ARG ( italic_W ∣ over~ start_ARG italic_U end_ARG = italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) ⟩ end_CELL end_ROW end_ARG ] - bold_G [ start_ARG start_ROW start_CELL italic_Q ( over~ start_ARG italic_U end_ARG = 1 ∣ italic_x ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_Q ( over~ start_ARG italic_U end_ARG = italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ∣ italic_x ) end_CELL end_ROW end_ARG ] ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
subject to 0Q(U~=ix)1,i=1,,kU;formulae-sequencesubject to 0𝑄~𝑈conditional𝑖𝑥1𝑖1subscript𝑘𝑈\displaystyle\text{subject to }\quad 0\leq Q(\widetilde{U}=i\mid x)\leq 1,% \quad i=1,\ldots,k_{U};subject to 0 ≤ italic_Q ( over~ start_ARG italic_U end_ARG = italic_i ∣ italic_x ) ≤ 1 , italic_i = 1 , … , italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ;
i=1kUQ(U~=ix)=1.superscriptsubscript𝑖1subscript𝑘𝑈𝑄~𝑈conditional𝑖𝑥1\displaystyle\quad\quad\quad\quad\quad\sum_{i=1}^{k_{U}}Q(\widetilde{U}=i\mid x% )=1.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( over~ start_ARG italic_U end_ARG = italic_i ∣ italic_x ) = 1 .

Then, we compute the predicted conditional probability based on (D.1).

Proposed Method. For the regression task using the dSprite dataset, we employ the Gaussian kernel function as the feature map for both X𝑋Xitalic_X and W𝑊Witalic_W. In the classification task, we also utilize the Gaussian kernel function for X𝑋Xitalic_X and W𝑊Witalic_W. Additionally, we make use of a columnwise binary kernel for C𝐶Citalic_C, which performs a binary kernel operation on each entry and computes the product of all function outputs. To compute h^0subscript^0\widehat{h}_{0}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we apply one-hot encoder on Y𝑌Yitalic_Y and apply the results in Proposition 5.1 For choosing the kernel length scale for the classification task, we use the validation set with AUROC metric.

D.2 Baselines of Multi-Source Adaptation

For the first three baselines: Cat-ERM, Avg-ERM, and SA, we use a standard MLP model as the backbone structure. It is a single hidden layer MLP with size 100100100100 and ReLU activation functions. The network is trained using Adam optimizer (kingma2014adam) with learning rate 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. The batch size is set to be 200200200200 and the maximum number of iteration is set to be 300300300300.

Cat-ERM. We concatenate all the samples across environments into one dataset. Then, we train the model with a standard MLP model as specified above.

Avg-ERM. For each environment, we train a standard MLP model. During testing, we take the average of predictions from all models.

Simple Adaptation (SA) (mansour2008domain). To implement the method, we build kernel density estimators with Gaussian kernel function to estimate the density pr(x)subscript𝑝𝑟𝑥p_{r}(x)italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x ) for r=1,,kZ𝑟1subscript𝑘𝑍r=1,\ldots,k_{Z}italic_r = 1 , … , italic_k start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT. We then reweigh the output of the classifier, a standard MLP, of each domain with the normalized weight Pr(xnew)/{rPr(xnew)}subscript𝑃𝑟subscript𝑥newsubscript𝑟subscript𝑃superscript𝑟subscript𝑥newP_{r}(x_{\text{new}})/\left\{\sum_{r}P_{r^{\prime}}(x_{\text{new}})\right\}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ) / { ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ) }. The kernel length scale is chosen using five-fold cross-validation with AUROC metric.

Marginal Kernel (MK) (blanchard2011generalizing). This method involves a kernel SVM with a product kernel on (𝒳,P(X))𝒳𝑃𝑋(\mathcal{X},P(X))( caligraphic_X , italic_P ( italic_X ) ). For any x,x𝒳𝑥superscript𝑥𝒳x,x^{\prime}\in\mathcal{X}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X and a distribution on X𝑋Xitalic_X, P,P𝑃superscript𝑃P,P^{\prime}italic_P , italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the kernel function is defined as k((x,P),(x,P))=k1(x,x)k2(P,P)𝑘𝑥𝑃superscript𝑥superscript𝑃subscript𝑘1𝑥superscript𝑥subscript𝑘2𝑃superscript𝑃k((x,P),(x^{\prime},P^{\prime}))=k_{1}(x,x^{\prime})k_{2}(P,P^{\prime})italic_k ( ( italic_x , italic_P ) , ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_P , italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Let n𝑛nitalic_n be the number of samples. Here k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a Gaussian kernel function, and k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the mean of the Gram matrix [k(xi,xj)]ijn×nsubscriptdelimited-[]𝑘subscript𝑥𝑖superscriptsubscript𝑥𝑗𝑖𝑗superscript𝑛𝑛[k(x_{i},x_{j}^{\prime})]_{ij}\in\mathbb{R}^{n\times n}[ italic_k ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n is a i.i.d. sample from P𝑃Pitalic_P and xjsuperscriptsubscript𝑥𝑗x_{j}^{\prime}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for j=1,,n𝑗1𝑛j=1,\ldots,nitalic_j = 1 , … , italic_n is a i.i.d. sample from Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. To accommodate the large dataset, we precompute the Gram matrix and apply it to a linear classifier trained using Stochastic Gradient Descent (SGD) implemented in the package scikit-learn (scikit-learn). The kernel length scale is chosen using five-fold cross-validation with AUROC metric.

Weighted Combination of Source Classifiers (WCSC) (zhang2015multi). For each source environment, we estimate the conditional probability Xyconditional𝑋𝑦X\mid yitalic_X ∣ italic_y using kernel density estimator with the Gaussian kernel function. The rest of the estimation procedure follows Section 2 in zhang2015multi. The kernel length scale is chosen using five-fold cross-validation with AUROC metric.

Proposed Method. We use columnwise Gaussian kernel function as the feature map of X𝑋Xitalic_X, a Gaussian kernel function as the feature map of W𝑊Witalic_W. The conditional mean embedding μ^Wx,zpsuperscriptsubscript^𝜇conditional𝑊𝑥𝑧𝑝\widehat{\mu}_{W\mid x,z}^{p}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_x , italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is estimated using the approach introduced in Section C.3. The analytical solution of m^0subscript^𝑚0\widehat{m}_{0}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is discussed in Proposition C.3. All the kernel length scale and the regularization parameters λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, λ4subscript𝜆4\lambda_{4}italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are selected using five-fold cross-validation with AUROC metric.

ORACLE. The model is m^0,μ^Wxqsubscript^𝑚0superscriptsubscript^𝜇conditional𝑊𝑥𝑞\langle{\widehat{m}_{0}},{\widehat{\mu}_{W\mid x}^{q}}\rangle⟨ over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ⟩, where both the bridge function m^0subscript^𝑚0\widehat{m}_{0}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and μ^Wxqsuperscriptsubscript^𝜇conditional𝑊𝑥𝑞\widehat{\mu}_{W\mid x}^{q}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_W ∣ italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT are estimated using the target dataset, with the number of training samples equal to the training samples of the source domain. All the kernel length scale and the regularization parameters λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, λ4subscript𝜆4\lambda_{4}italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are selected using five-fold cross-validation with AUROC metric.

D.3 Classification Task

The classification task discussed in Section D.6 is first introduced alabdulmohsin2023adapting. Let 𝐨()𝐨\mathbf{o}(\cdot)bold_o ( ⋅ ) be the one-hot encoder, we follow their data generation procedure and generate samples using the following data generation process:

Usimilar-to𝑈absent\displaystyle U\simitalic_U ∼ Categorical(𝝅);Categorical𝝅\displaystyle\;\textrm{Categorical}(\bm{\pi});Categorical ( bold_italic_π ) ;
WU=uconditional𝑊𝑈𝑢similar-toabsent\displaystyle W\mid U=u\simitalic_W ∣ italic_U = italic_u ∼ 𝒩(𝐨(u)𝐌W|U,1);𝒩𝐨𝑢subscript𝐌conditional𝑊𝑈1\displaystyle\;\mathcal{N}(\mathbf{o}(u)\mathbf{M}_{W|U},1\big{)};caligraphic_N ( bold_o ( italic_u ) bold_M start_POSTSUBSCRIPT italic_W | italic_U end_POSTSUBSCRIPT , 1 ) ;
XU=uconditional𝑋𝑈𝑢similar-toabsent\displaystyle X\mid U=u\simitalic_X ∣ italic_U = italic_u ∼ 𝒩(𝐨(u)𝐌X|U,𝐈kX);𝒩𝐨𝑢subscript𝐌conditional𝑋𝑈subscript𝐈subscript𝑘𝑋\displaystyle\;\mathcal{N}(\mathbf{o}(u)\mathbf{M}_{X|U},\mathbf{I}_{k_{X}});caligraphic_N ( bold_o ( italic_u ) bold_M start_POSTSUBSCRIPT italic_X | italic_U end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ;
CiX=x,U=uformulae-sequenceconditionalsubscript𝐶𝑖𝑋𝑥𝑈𝑢similar-toabsent\displaystyle C_{i}\mid X=x,U=u\simitalic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_X = italic_x , italic_U = italic_u ∼ Bernoulli(logit1([x𝐌C|X,U=u+𝐨(u)𝐌C|U]i));Bernoullisuperscriptlogit1subscriptdelimited-[]𝑥subscript𝐌conditional𝐶𝑋𝑈𝑢𝐨𝑢subscript𝐌conditional𝐶𝑈𝑖\displaystyle\;\textrm{Bernoulli}\Big{(}\mathrm{logit}^{-1}\big{(}[x\mathbf{M}% _{C|X,U=u}+\mathbf{o}(u)\mathbf{M}_{C|U}]_{i}\big{)}\Big{)};Bernoulli ( roman_logit start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( [ italic_x bold_M start_POSTSUBSCRIPT italic_C | italic_X , italic_U = italic_u end_POSTSUBSCRIPT + bold_o ( italic_u ) bold_M start_POSTSUBSCRIPT italic_C | italic_U end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ;
YC=c,U=uformulae-sequenceconditional𝑌𝐶𝑐𝑈𝑢similar-toabsent\displaystyle Y\mid C=c,U=u\simitalic_Y ∣ italic_C = italic_c , italic_U = italic_u ∼ Bernoulli(logit1(c𝐌Y|C,U=u+𝐨(u)𝐌Y|U)),Bernoullisuperscriptlogit1𝑐subscript𝐌conditional𝑌𝐶𝑈𝑢𝐨𝑢subscript𝐌conditional𝑌𝑈\displaystyle\;\textrm{Bernoulli}\Big{(}\mathrm{logit}^{-1}\big{(}c\mathbf{M}_% {Y|C,U=u}+\mathbf{o}(u)\mathbf{M}_{Y|U}\big{)}\Big{)},Bernoulli ( roman_logit start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_c bold_M start_POSTSUBSCRIPT italic_Y | italic_C , italic_U = italic_u end_POSTSUBSCRIPT + bold_o ( italic_u ) bold_M start_POSTSUBSCRIPT italic_Y | italic_U end_POSTSUBSCRIPT ) ) ,

where the matrices are defined as

𝐌W|U:=[11],𝐌X|U:=aw[1111],𝐌C|U:=[222112];formulae-sequenceassignsubscript𝐌conditional𝑊𝑈superscriptmatrix11topformulae-sequenceassignsubscript𝐌conditional𝑋𝑈subscript𝑎𝑤matrix1111assignsubscript𝐌conditional𝐶𝑈matrix222112\displaystyle\mathbf{M}_{W|U}:=\begin{bmatrix}-1&1\end{bmatrix}^{\top},\quad% \mathbf{M}_{X|U}:=a_{w}\begin{bmatrix}-1&1\\ 1&-1\end{bmatrix},\quad\mathbf{M}_{C|U}:=\begin{bmatrix}-2&2&2\\ -1&1&2\end{bmatrix};bold_M start_POSTSUBSCRIPT italic_W | italic_U end_POSTSUBSCRIPT := [ start_ARG start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_M start_POSTSUBSCRIPT italic_X | italic_U end_POSTSUBSCRIPT := italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] , bold_M start_POSTSUBSCRIPT italic_C | italic_U end_POSTSUBSCRIPT := [ start_ARG start_ROW start_CELL - 2 end_CELL start_CELL 2 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL end_ROW end_ARG ] ;
𝐌C|X,U=u0:=3[221123],𝐌C|X,U=u1:=3[221123];formulae-sequenceassignsubscript𝐌conditional𝐶𝑋𝑈subscript𝑢03matrix221123assignsubscript𝐌conditional𝐶𝑋𝑈subscript𝑢13matrix221123\displaystyle\mathbf{M}_{C|X,U=u_{0}}:=3\begin{bmatrix}-2&2&-1\\ 1&-2&-3\end{bmatrix},\quad\mathbf{M}_{C|X,U=u_{1}}:=3\begin{bmatrix}2&-2&1\\ -1&2&3\end{bmatrix};bold_M start_POSTSUBSCRIPT italic_C | italic_X , italic_U = italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT := 3 [ start_ARG start_ROW start_CELL - 2 end_CELL start_CELL 2 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL - 3 end_CELL end_ROW end_ARG ] , bold_M start_POSTSUBSCRIPT italic_C | italic_X , italic_U = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT := 3 [ start_ARG start_ROW start_CELL 2 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 2 end_CELL start_CELL 3 end_CELL end_ROW end_ARG ] ;
𝐌Y|U:=[22],𝐌Y|C,U=u0:=[321],𝐌Y|C,U=u1:=[312].formulae-sequenceassignsubscript𝐌conditional𝑌𝑈superscriptmatrix22topformulae-sequenceassignsubscript𝐌conditional𝑌𝐶𝑈subscript𝑢0superscriptmatrix321topassignsubscript𝐌conditional𝑌𝐶𝑈subscript𝑢1superscriptmatrix312top\displaystyle\mathbf{M}_{Y|U}:=\begin{bmatrix}2&2\end{bmatrix}^{\top},\quad% \mathbf{M}_{Y|C,U=u_{0}}:=\begin{bmatrix}3&-2&-1\end{bmatrix}^{\top},\quad% \mathbf{M}_{Y|C,U=u_{1}}:=\begin{bmatrix}3&-1&-2\end{bmatrix}^{\top}.bold_M start_POSTSUBSCRIPT italic_Y | italic_U end_POSTSUBSCRIPT := [ start_ARG start_ROW start_CELL 2 end_CELL start_CELL 2 end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_M start_POSTSUBSCRIPT italic_Y | italic_C , italic_U = italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT := [ start_ARG start_ROW start_CELL 3 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_M start_POSTSUBSCRIPT italic_Y | italic_C , italic_U = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT := [ start_ARG start_ROW start_CELL 3 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

The coefficient aw=1subscript𝑎𝑤1a_{w}=1italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 1 in Figure 2. Figure 4 displays additional results where aw=2,3subscript𝑎𝑤23a_{w}=2,3italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 2 , 3. We generate 7000700070007000 training samples, 1000100010001000 validation samples, and 2000200020002000 testing samples for the classification task with concepts and proxies.

In the multi-domain case, we construct 3333 different tasks: Task 1111 is composed of z1,z2,z3subscript𝑧1subscript𝑧2subscript𝑧3z_{1},z_{2},z_{3}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT such that P(U=0z1)=0.1𝑃𝑈conditional0subscript𝑧10.1P(U=0\mid{z_{1}})=0.1italic_P ( italic_U = 0 ∣ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 0.1, P(U=0z2)=0.2𝑃𝑈conditional0subscript𝑧20.2P(U=0\mid{z_{2}})=0.2italic_P ( italic_U = 0 ∣ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0.2, P(U=0z3)=0.3𝑃𝑈conditional0subscript𝑧30.3P(U=0\mid{z_{3}})=0.3italic_P ( italic_U = 0 ∣ italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = 0.3 and a target domain with Q(U=0)=0.9𝑄𝑈00.9Q(U=0)=0.9italic_Q ( italic_U = 0 ) = 0.9. For task 2222, we select z4,z5,z6subscript𝑧4subscript𝑧5subscript𝑧6z_{4},z_{5},z_{6}italic_z start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT such that P(U=0z4)=0.4𝑃𝑈conditional0subscript𝑧40.4P(U=0\mid z_{4})=0.4italic_P ( italic_U = 0 ∣ italic_z start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = 0.4, P(U=0z5)=0.5𝑃𝑈conditional0subscript𝑧50.5P(U=0\mid{z_{5}})=0.5italic_P ( italic_U = 0 ∣ italic_z start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) = 0.5, P(U=0z6)=0.6𝑃𝑈conditional0subscript𝑧60.6P(U=0\mid{z_{6}})=0.6italic_P ( italic_U = 0 ∣ italic_z start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) = 0.6 and Q(U=0)=0.9𝑄𝑈00.9Q(U=0)=0.9italic_Q ( italic_U = 0 ) = 0.9. For task 3333, we select z7,z8,z9subscript𝑧7subscript𝑧8subscript𝑧9z_{7},z_{8},z_{9}italic_z start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT such that P(U=0z7)=0.7𝑃𝑈conditional0subscript𝑧70.7P(U=0\mid z_{7})=0.7italic_P ( italic_U = 0 ∣ italic_z start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ) = 0.7, P(U=0z8)=0.8𝑃𝑈conditional0subscript𝑧80.8P(U=0\mid{z_{8}})=0.8italic_P ( italic_U = 0 ∣ italic_z start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT ) = 0.8, P(U=0z9)=0.9𝑃𝑈conditional0subscript𝑧90.9P(U=0\mid{z_{9}})=0.9italic_P ( italic_U = 0 ∣ italic_z start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT ) = 0.9 and Q(U=0)=0.4𝑄𝑈00.4Q(U=0)=0.4italic_Q ( italic_U = 0 ) = 0.4. The results are shown in Table 1– 2.

Refer to caption
Figure 4: Classification results with aw=2,3subscript𝑎𝑤23a_{w}=2,3italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 2 , 3. The figures indicate that LSA-S and LSA-S w/ target W𝑊Witalic_W have comparable performance, aggregating the target W𝑊Witalic_W does not seem to improve the performance.

D.4 Comparison to Domain Generalization Baselines

Table 2: Multi-domain generalization vs. (proposed) adaptation result. The values are the average AUROC of 10101010 independent runs drawn from the data generating process. Each task has three source domains with different Pr(U)subscript𝑃𝑟𝑈P_{r}(U)italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_U ) and one target domain. The proposed method has outperformed all domain generalization benchmarks across all tasks.
ORACLE ARM CDANN CORAL DANN GroupDRO IRM MMD VREx Proposed
Task 1 0.94250.94250.94250.9425 0.80650.80650.80650.8065 0.80610.80610.80610.8061 0.80300.80300.80300.8030 0.80390.80390.80390.8039 0.79540.79540.79540.7954 0.79890.79890.79890.7989 0.80550.80550.80550.8055 0.80100.80100.80100.8010 0.88480.8848\mathbf{0.8848}bold_0.8848
±0.0039plus-or-minus0.0039\pm 0.0039± 0.0039 ±0.0247plus-or-minus0.0247\pm 0.0247± 0.0247 ±0.0252plus-or-minus0.0252\pm 0.0252± 0.0252 ±0.0236plus-or-minus0.0236\pm 0.0236± 0.0236 ±0.0229plus-or-minus0.0229\pm 0.0229± 0.0229 ±0.0323plus-or-minus0.0323\pm 0.0323± 0.0323 ±0.0283plus-or-minus0.0283\pm 0.0283± 0.0283 ±0.0248plus-or-minus0.0248\pm 0.0248± 0.0248 ±0.0279plus-or-minus0.0279\pm 0.0279± 0.0279 ±0.0120plus-or-minus0.0120\pm 0.0120± 0.0120
Task 2 0.94310.94310.94310.9431 0.91430.91430.91430.9143 0.91590.91590.91590.9159 0.91580.91580.91580.9158 0.91580.91580.91580.9158 0.91600.91600.91600.9160 0.91310.91310.91310.9131 0.91490.91490.91490.9149 0.91360.91360.91360.9136 0.93180.9318\mathbf{0.9318}bold_0.9318
±0.0061plus-or-minus0.0061\pm 0.0061± 0.0061 ±0.0150plus-or-minus0.0150\pm 0.0150± 0.0150 ±0.0125plus-or-minus0.0125\pm 0.0125± 0.0125 ±0.0132plus-or-minus0.0132\pm 0.0132± 0.0132 ±0.0125plus-or-minus0.0125\pm 0.0125± 0.0125 ±0.0125plus-or-minus0.0125\pm 0.0125± 0.0125 ±0.0135plus-or-minus0.0135\pm 0.0135± 0.0135 ±0.0135plus-or-minus0.0135\pm 0.0135± 0.0135 ±0.0124plus-or-minus0.0124\pm 0.0124± 0.0124 ±0.0063plus-or-minus0.0063\pm 0.0063± 0.0063
Task 3 0.88760.88760.88760.8876 0.84700.84700.84700.8470 0.84560.84560.84560.8456 0.84730.84730.84730.8473 0.84800.84800.84800.8480 0.84870.84870.84870.8487 0.84690.84690.84690.8469 0.84700.84700.84700.8470 0.84700.84700.84700.8470 0.85690.8569\mathbf{0.8569}bold_0.8569
±0.0085plus-or-minus0.0085\pm 0.0085± 0.0085 ±0.0171plus-or-minus0.0171\pm 0.0171± 0.0171 ±0.0164plus-or-minus0.0164\pm 0.0164± 0.0164 ±0.0163plus-or-minus0.0163\pm 0.0163± 0.0163 ±0.0166plus-or-minus0.0166\pm 0.0166± 0.0166 ±0.0185plus-or-minus0.0185\pm 0.0185± 0.0185 ±0.0186plus-or-minus0.0186\pm 0.0186± 0.0186 ±0.0181plus-or-minus0.0181\pm 0.0181± 0.0181 ±0.0132plus-or-minus0.0132\pm 0.0132± 0.0132 ±0.0095plus-or-minus0.0095\pm 0.0095± 0.0095

Given that we observe multiple domains at test time, a natural question is: Does adaptation give us an advantage over generalization? In generalization, we cannot assume to have any observations in the target domain. We compare our adaptation method with multi-domain generalization baselines (muandet2013domain): Adaptive Risk Minimization (ARM)  (zhang2021adaptive), Conditional Domain Adversarial Neural Networks (CDANN) (long2018conditional), Correlation Alignment (CORAL) (sun2016deep), Domain Adversarial Neural Networks (DANN) (ganin2016domain), Distributionally Robust Optimization for Group Shifts (GroupDRO) (sagawa2019distributionally), Invariant Risk Minimization (IRM) (arjovsky2019invariant), Maximum Mean Discrepancy (MMD) (Borgwardt2006IntegratingSB), and Risk Extrapolation (REx) (krueger2021out).

In Table 2, we show that our proposed method for domain adaptation in the multi-domain setting outperforms the state-of-the-art multi-domain generalization methods.

D.5 Regression Tasks

We consider three tasks. We will first introduce the simulated task and then discuss about the task on dSprite data (dsprites17).

D.5.1 Simulated Dataset

We consider the following data generation process.

Simulated regression task 1.

U𝑈\displaystyle Uitalic_U =Ber(a);absent𝐵𝑒𝑟𝑎\displaystyle=Ber(a);= italic_B italic_e italic_r ( italic_a ) ;
X𝑋\displaystyle Xitalic_X =𝒩(0,1);absent𝒩01\displaystyle=\mathcal{N}(0,1);= caligraphic_N ( 0 , 1 ) ;
Y𝑌\displaystyle Yitalic_Y =X𝟏(U=0)+X𝟏(U=1);absent𝑋subscript1𝑈0𝑋subscript1𝑈1\displaystyle=-X{\bf 1}_{\left(U=0\right)}+X{\bf 1}_{\left(U=1\right)};= - italic_X bold_1 start_POSTSUBSCRIPT ( italic_U = 0 ) end_POSTSUBSCRIPT + italic_X bold_1 start_POSTSUBSCRIPT ( italic_U = 1 ) end_POSTSUBSCRIPT ; (D.3)
W𝑊\displaystyle Witalic_W =𝒩(1,0.01)𝟏(U=0)+𝒩(1,0.01)𝟏(U=1).absent𝒩10.01subscript1𝑈0𝒩10.01subscript1𝑈1\displaystyle=\mathcal{N}(-1,0.01){\bf 1}_{\left(U=0\right)}+\mathcal{N}(1,0.0% 1){\bf 1}_{\left(U=1\right)}.= caligraphic_N ( - 1 , 0.01 ) bold_1 start_POSTSUBSCRIPT ( italic_U = 0 ) end_POSTSUBSCRIPT + caligraphic_N ( 1 , 0.01 ) bold_1 start_POSTSUBSCRIPT ( italic_U = 1 ) end_POSTSUBSCRIPT .

There are two source domains. We set a=0.1𝑎0.1a=0.1italic_a = 0.1 for source domain z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a=0.9𝑎0.9a=0.9italic_a = 0.9 for source domain z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. According to the data generation process (D.3), Y𝑌Yitalic_Y is mostly positively correlated with X𝑋Xitalic_X in domain z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and negatively correlated with X𝑋Xitalic_X in domain z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For each domain, we synthesized 2000200020002000 training samples and 1000100010001000 testing samples. We sweep across a={0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9}𝑎0.10.20.30.40.50.60.70.80.9a=\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\}italic_a = { 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 } in the target domain. We run 10101010 replications and the results shown in Figure 5. In the next task, we set U𝑈Uitalic_U to be a continuous random variable following a Beta distribution.

In this task, we expect the Cat-ERM method to fail drastically as we anticipate that the predicted Y𝑌Yitalic_Y versus X𝑋Xitalic_X is a flat line – the predicted result would be an average of the downward slo** line (U=0)𝑈0(U=0)( italic_U = 0 ) and upward slo** line (U=1)𝑈1(U=1)( italic_U = 1 ). The result in Figure 5 supports our hypothesis, as the mean squared error remains nearly flat as we vary the target distribution Q(U)𝑄𝑈Q(U)italic_Q ( italic_U ).

Simulated regression task 2.

U𝑈\displaystyle Uitalic_U =Beta(a,b)absent𝐵𝑒𝑡𝑎𝑎𝑏\displaystyle=Beta(a,b)= italic_B italic_e italic_t italic_a ( italic_a , italic_b )
X𝑋\displaystyle Xitalic_X =𝒩(0,1)absent𝒩01\displaystyle=\mathcal{N}(0,1)= caligraphic_N ( 0 , 1 )
Y𝑌\displaystyle Yitalic_Y =(2U1)Xabsent2𝑈1𝑋\displaystyle=(2U-1)X= ( 2 italic_U - 1 ) italic_X
W𝑊\displaystyle Witalic_W =𝒩(1,0.01)(1U)+𝒩(1,0.01)U.absent𝒩10.011𝑈𝒩10.01𝑈\displaystyle=\mathcal{N}(-1,0.01)(1-U)+\mathcal{N}(1,0.01)U.= caligraphic_N ( - 1 , 0.01 ) ( 1 - italic_U ) + caligraphic_N ( 1 , 0.01 ) italic_U .

There are two source domains, corresponding to two draws from P(Z)𝑃𝑍P(Z)italic_P ( italic_Z ) which we write zr=(a,b)subscript𝑧𝑟𝑎𝑏z_{r}=(a,b)italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ( italic_a , italic_b ). We set a=2,b=4formulae-sequence𝑎2𝑏4a=2,b=4italic_a = 2 , italic_b = 4 for the first source domain r=1𝑟1r=1italic_r = 1, and a=4,b=2formulae-sequence𝑎4𝑏2a=4,b=2italic_a = 4 , italic_b = 2 for the second source domain r=2𝑟2r=2italic_r = 2. The corresponding distributions over U𝑈Uitalic_U are shown in Figure 6. Under this setting, we test the target domain with a,b=1,,5formulae-sequence𝑎𝑏15a,b=1,\ldots,5italic_a , italic_b = 1 , … , 5, with distributions shown in Figure 6. For each domain, we synthesized 2000200020002000 training samples and 1000100010001000 testing samples. We run 10101010 replications and the results shown in Figure 5.

D.6 Adaptation with Concepts and Proxies

Refer to caption
Figure 5: Top left: results of regression task 1. The proposed method is close to the ORACLE method as compared all other competing methods that is vulnerable to the distribution shifts. Other figures: results of regression task 2. In each plot, we fix b𝑏bitalic_b and vary a𝑎aitalic_a. For all plots, it appears that when a=b𝑎𝑏a=bitalic_a = italic_b, the mean squared error of all methods converge to a point. This is the case when the target density function of U𝑈Uitalic_U has a peak centered around 0.50.50.50.5, as shown in Figure 6, and hence Y=(2U1)X𝑌2𝑈1𝑋Y=(2U-1)Xitalic_Y = ( 2 italic_U - 1 ) italic_X is close to zero for most samples.
Refer to caption
Figure 6: The probablity density function of Beta distributions with different a,b=1,,5formulae-sequence𝑎𝑏15a,b=1,\ldots,5italic_a , italic_b = 1 , … , 5.

D.6.1 dSprites Dataset

Refer to caption
Figure 7: dSprites image with confound (rotation) applied.

We test the proposed procedure on the dSprites dataset (dsprites17), an image dataset described by five latent parameters (shape, scale, rotation, posX, and posY). Motivated by  dsprites17’s experiments, we design a regression task where the dSprites images (64 ×\times× 64 = 4096-dimensional) are X64×64𝑋superscript6464X\in\mathbb{R}^{64\times 64}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT 64 × 64 end_POSTSUPERSCRIPT and subject to a nonlinear confounder U[0,2π]𝑈02𝜋U\in[0,2\pi]italic_U ∈ [ 0 , 2 italic_π ] which is a rotation of the image (Figure 7). We fix all other latent parameters – shape is heart, scale is maximized, and all others are set to their 0’th position. W𝑊W\in\mathbb{R}italic_W ∈ blackboard_R and C𝐶C\in\mathbb{R}italic_C ∈ blackboard_R are continuous random variables. The data generation process is defined as follows

Upsuperscript𝑈𝑝\displaystyle U^{p}italic_U start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT 2πBeta(2,4),UqUniform(a,2π);formulae-sequencesimilar-toabsent2𝜋Beta24similar-tosuperscript𝑈𝑞Uniform𝑎2𝜋\displaystyle\sim 2\pi\text{Beta}(2,4),\quad U^{q}\sim\text{Uniform}(a,2\pi);∼ 2 italic_π Beta ( 2 , 4 ) , italic_U start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ∼ Uniform ( italic_a , 2 italic_π ) ;
X𝑋\displaystyle Xitalic_X =Rotate(image,U rads)+η,η𝒩(0,0.01I64);formulae-sequenceabsentRotateimage𝑈 rads𝜂similar-to𝜂𝒩00.01subscript𝐼64\displaystyle=\text{Rotate}(\text{image},U\text{ rads})+\eta,\quad\eta\sim% \mathcal{N}(0,0.01I_{64});= Rotate ( image , italic_U rads ) + italic_η , italic_η ∼ caligraphic_N ( 0 , 0.01 italic_I start_POSTSUBSCRIPT 64 end_POSTSUBSCRIPT ) ;
C𝐶\displaystyle Citalic_C =(0.1XTA2250002000)2+U+γ;absentsuperscript0.1superscriptsubscriptnormsuperscript𝑋𝑇𝐴22500020002𝑈𝛾\displaystyle=\Bigg{(}\frac{0.1\|X^{T}A\|_{2}^{2}-5000}{2000}\Bigg{)}^{2}+U+\gamma;= ( divide start_ARG 0.1 ∥ italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 5000 end_ARG start_ARG 2000 end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_U + italic_γ ;
A𝐴\displaystyle Aitalic_A Uniform(0,1),A4096×10,γ𝒩(0,0.5);formulae-sequencesimilar-toabsentUniform01formulae-sequence𝐴superscript409610similar-to𝛾𝒩00.5\displaystyle\sim\text{Uniform}(0,1),\,A\in\mathbb{R}^{4096\times 10},\quad% \gamma\sim\mathcal{N}(0,0.5);∼ Uniform ( 0 , 1 ) , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 4096 × 10 end_POSTSUPERSCRIPT , italic_γ ∼ caligraphic_N ( 0 , 0.5 ) ;
Y𝑌\displaystyle Yitalic_Y =14C+120sin(U)+ε,ε𝒩(0,0.1);formulae-sequenceabsent14𝐶120𝑈𝜀similar-to𝜀𝒩00.1\displaystyle=\frac{1}{4}C+\frac{1}{20}\sin(U)+\varepsilon,\quad\varepsilon% \sim\mathcal{N}(0,0.1);= divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_C + divide start_ARG 1 end_ARG start_ARG 20 end_ARG roman_sin ( italic_U ) + italic_ε , italic_ε ∼ caligraphic_N ( 0 , 0.1 ) ;
W𝑊\displaystyle Witalic_W =cos(U)+ν,ν𝒩(0,0.25).formulae-sequenceabsent𝑈𝜈similar-to𝜈𝒩00.25\displaystyle=\cos(U)+\nu,\quad\nu\sim\mathcal{N}(0,0.25).= roman_cos ( italic_U ) + italic_ν , italic_ν ∼ caligraphic_N ( 0 , 0.25 ) .

When fitting all model, both baselines and the proposed method, we project the images 4096superscript4096\mathbb{R}^{4096}blackboard_R start_POSTSUPERSCRIPT 4096 end_POSTSUPERSCRIPT to 16superscript16\mathbb{R}^{16}blackboard_R start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT via Gaussian Random Projection using the scikit-learn implementation  (Bingham2001RandomPI; scikit-learn). Additionally, for the proposed method, we use a Gaussian kernel as the feature map for X,C𝑋𝐶X,\,Citalic_X , italic_C.

We generate 7000700070007000 training samples and 3000300030003000 test samples in our experiments. Then, we use five-fold cross-validation to select hyperparameters for baselines and proposed method for each a𝑎aitalic_a (UpUniform(a,2π)similar-tosuperscript𝑈𝑝Uniform𝑎2𝜋U^{p}\sim\text{Uniform}(a,2\pi)italic_U start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∼ Uniform ( italic_a , 2 italic_π )) – hyperparameters are (i) ridge regression penalty and (ii) Gaussian kernel scaling factor. Once we select a set of hyperparameters for a value of a𝑎aitalic_a, we perform 10 new random data regenerations to get transfer errors with 95% confidence intervals (Figure 2).

D.7 Classification of radiological findings with MIMIC-CXR

We conduct a small-scale experiment with chest X-ray data extracted from the MIMIC-CXR dataset (johnson2019mimic). We consider classification of the absence of a radiological finding in a chest X-ray. For this, we use the set of labels extracted by irvin2019chexpert. These labels correspond to 14 categories of radiological findings extracted based on mentions in the associated radiology reports. We specifically consider classification of the “No Finding” (Y=1𝑌1Y=1italic_Y = 1) label, corresponding to cases where no pathology was identified as positive or uncertain in the radiology report.

To define the dataset, we consider the set of 217,536 chest X-rays with defined Chexpert labels (irvin2019chexpert), MIMIC-IV entries, and pretrained embeddings (sellergren2022simplified). We then filter this dataset to the 212,567 examples considered as a part of the “train” partition provided by the MIMIC-CXR database (johnson2019mimic). We then partition the data into training, validation, and testing splits such that 80%, 10%, and 10% of the examples belong to each partition, respectively. For adaptation, we consider BioBERT (lee2020biobert) 768-dimensional embeddings of the radiology reports as concepts C𝐶Citalic_C and the patient’s age as a proxy variable W𝑊Witalic_W. For simplicity, we use the patient anchor_age defined through linkage to the MIMIC-IV database, regardless of the patient’s age at the time of the chest X-ray. Similar to the dSprites experiment, we further reduce the dimensionality of X𝑋Xitalic_X and C𝐶Citalic_C to 64superscript64\mathbb{R}^{64}blackboard_R start_POSTSUPERSCRIPT 64 end_POSTSUPERSCRIPT using Gaussian Random Projection fit on the full training partition (170,053 examples).

To define distribution shifts, we adopt a problem formulation similar to that of makar2022causally, where patient sex is considered as a possible “shortcut" in the classification of the absence of a radiological finding. As in makar2022causally, we impose distribution shift through structured resampling of the data where P(U=1)=P(Y=1Sex=Female)=P(Y=0Sex=Male)𝑃𝑈1𝑃𝑌conditional1SexFemale𝑃𝑌conditional0SexMaleP(U=1)=P(Y=1\mid\textrm{Sex}=\textrm{Female})=P(Y=0\mid\textrm{Sex}=\textrm{% Male})italic_P ( italic_U = 1 ) = italic_P ( italic_Y = 1 ∣ Sex = Female ) = italic_P ( italic_Y = 0 ∣ Sex = Male ). For example, when P(U=1)=0.1𝑃𝑈10.1P(U=1)=0.1italic_P ( italic_U = 1 ) = 0.1, the prevalence of P(Y=1Sex=Female)=0.1𝑃𝑌conditional1SexFemale0.1P(Y=1\mid\textrm{Sex}=\textrm{Female})=0.1italic_P ( italic_Y = 1 ∣ Sex = Female ) = 0.1 and P(Y=1Sex=Male)=0.9𝑃𝑌conditional1SexMale0.9P(Y=1\mid\textrm{Sex}=\textrm{Male})=0.9italic_P ( italic_Y = 1 ∣ Sex = Male ) = 0.9. We implement the shift through a weighted sampling procedure that maintains the label shift invariance within patient sex subgroups, i.e., preserves XY,Aconditional𝑋𝑌𝐴X\mid Y,Aitalic_X ∣ italic_Y , italic_A under the distribution shift, where A𝐴Aitalic_A corresponds to patient sex. This procedure further fixes the total proportion of male and female patients in the population at 50%. For our experiments, we consider nine domains corresponding to cases where P(U=1){0.1,0.2,,0.9}𝑃𝑈10.10.20.9P(U=1)\in\{0.1,0.2,\ldots,0.9\}italic_P ( italic_U = 1 ) ∈ { 0.1 , 0.2 , … , 0.9 }.

We perform both concept adaptation and multi-domain adaptation experiments with the MIMIC-CXR data. For the concept adaptation experiment, we perform weighted sampling with replacement of 1,000 examples from each of the training, validation, and testing partitions defined previously, separately for each domain. We fix the source domain to the case where P(U=1)=0.1𝑃𝑈10.1P(U=1)=0.1italic_P ( italic_U = 1 ) = 0.1 and then adapt to each of the nine target domains. For the multi-domain adaptation experiment, we randomly sample 500 examples per domain and partition from the sets of 1,000 examples defined for the concept experiment. For this experiment, we consider a case where two source domains corresponding to P(U=1)=0.1𝑃𝑈10.1P(U=1)=0.1italic_P ( italic_U = 1 ) = 0.1 and P(U=1)=0.2𝑃𝑈10.2P(U=1)=0.2italic_P ( italic_U = 1 ) = 0.2 are available. To match the size of the aggregate source domain data with the size of the target domain, we sample 250 examples per partition for each source domain. We repeat the sampling procedure five times and report the mean ±plus-or-minus\pm± standard deviation of performance metrics over the five replicates.

For both experiments, we perform two-fold cross-validation for the kernel length-scale parameters using data from the source domain(s). Here, we compare to ridge logistic regression models fit in the source and target domains, with the ridge penalty fit with five-fold cross validation. We use LR-Target to refer to logistic regression models fit in a target domain, LR-SOURCE to refer to models fit in a source domain, and Cat-LR to refer to logistic regression models fit with concatenated data from the multiple source domains. We use Bridge-SOURCE to refer to the kernel estimator that leverages the bridge function (h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the concept and multi-domain adaptation settings, respectively) and conditional mean embedding (μWCxsubscript𝜇conditional𝑊𝐶𝑥\mu_{WC\mid x}italic_μ start_POSTSUBSCRIPT italic_W italic_C ∣ italic_x end_POSTSUBSCRIPT or μWz,xsubscript𝜇conditional𝑊𝑧𝑥\mu_{W\mid z,x}italic_μ start_POSTSUBSCRIPT italic_W ∣ italic_z , italic_x end_POSTSUBSCRIPT) fit on the source domain data. Bridge-TARGET refers to the kernel estimator where both the bridge function and conditional mean embedding are fit on the target domain data.

References