Mining Invariance from Nonlinear Multi-Environment Data: Binary Classification

Austin Goddard, Kang Du, Yu Xiang Department of Electrical and Computer Engineering
University of Utah
{austin.goddard, kang.du, yu.xiang}@utah.edu
Abstract

Making predictions in an unseen environment given data from multiple training environments is a challenging task. We approach this problem from an invariance perspective, focusing on binary classification to shed light on general nonlinear data generation mechanisms. We identify a unique form of invariance that exists solely in a binary setting that allows us to train models invariant over environments. We provide sufficient conditions for such invariance and show it is robust even when environmental conditions vary greatly. Our formulation admits a causal interpretation, allowing us to compare it with various frameworks. Finally, we propose a heuristic prediction method and conduct experiments using real and synthetic datasets.

I Introduction

It is common practice to collect observations of a set of features X=(X1,,Xm)𝑋subscript𝑋1subscript𝑋𝑚X=(X_{1},\dots,X_{m})italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) and response Y𝑌Yitalic_Y from different environments to train a model. The prediction of the response in an unseen environment is often referred to as multi-environment domain adaptation, with practical applications in various fields (e.g., genetics [1] and healthcare [2]). A common assumption in such problems is the principle of invariance, modularity, or autonomy [3, 4, 5, 6, 7, 8]. This invariance assumption states that the conditional distribution of Y𝑌Yitalic_Y given X𝑋Xitalic_X is invariant with respect to different environment.

The invariant causal prediction (ICP) framework [9], along with its various extensions [10, 11], employ the invariance principle to identify invariant predictors across environments. Following this framework, various domain adaptation approaches have been developed [12, 13, 14]. Specifically, the stabilized regression (SR[14] approach relies on a weaker form of invariance dependent on expectation as opposed to probability. The common assumption for the approaches mentioned is that the assignment of Y𝑌Yitalic_Y does not change over environments. In a causal sense, from which much of the literature in this area stems, this is referred to as an intervention on Y𝑌Yitalic_Y [8]. When Y𝑌Yitalic_Y is intervened, the invariance principle, as well as the frameworks mentioned above, fail. In a series of recent works[15, 16], an alternative approach called the invariant matching property (IMP) has been developed to detect linear invariant models in a regression setting even when the assignment of Y𝑌Yitalic_Y is altered over environment.

In this work, we extend general principles developed in [15, 16] to the binary classification setting as an attempt to generalize to nonlinear settings. The proposed approach works even when data-generating models change over environments (e.g., Y𝑌Yitalic_Y can be generated using a probit model for one environment and a logistic model in another). Additionally, the approach is not constrained by the data type, meaning it can be useful on continuous, discrete, or categorical variables.

II Problem Formulation

Consider the following setting. For different environmental conditions indexed by the set \mathcal{E}caligraphic_E, we have a random vector X=(X1,,Xm)𝑋subscript𝑋1subscript𝑋𝑚X=(X_{1},\dots,X_{m})italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) and a binary random variable Y𝑌Yitalic_Y whose elements form a joint distribution 𝒫e:=𝒫eX,Yassignsubscript𝒫𝑒superscriptsubscript𝒫𝑒𝑋𝑌\mathcal{P}_{e}:=\mathcal{P}_{e}^{X,Y}caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT := caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X , italic_Y end_POSTSUPERSCRIPT dependent on e𝑒e\in\mathcal{E}italic_e ∈ caligraphic_E. Denote X𝑋Xitalic_X and Y𝑌Yitalic_Y as Xesuperscript𝑋𝑒X^{e}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT for a specific e𝑒e\in\mathcal{E}italic_e ∈ caligraphic_E, respectively. The supports of X𝑋Xitalic_X and Y𝑌Yitalic_Y are 𝒳=m𝒳superscript𝑚\mathcal{X}=\mathbb{R}^{m}caligraphic_X = blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and 𝒴={0,1}𝒴01\mathcal{Y}=\{0,1\}caligraphic_Y = { 0 , 1 }, respectively. Let XSsubscript𝑋𝑆X_{S}italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT be a random vector containing the elements in X𝑋Xitalic_X indexed by the set S{1,,m}𝑆1𝑚S\subseteq\{1,\dots,m\}italic_S ⊆ { 1 , … , italic_m }, and let 𝒳Ssubscript𝒳𝑆\mathcal{X}_{S}caligraphic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT be its support. To simplify notation, let X0e:=Yeassignsuperscriptsubscript𝑋0𝑒superscript𝑌𝑒X_{0}^{e}:=Y^{e}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT := italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT. For each e𝑒e\in\mathcal{E}italic_e ∈ caligraphic_E, we keep the distribution 𝒫esubscript𝒫𝑒\mathcal{P}_{e}caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT general, with the exception that there exists an Xiesubscriptsuperscript𝑋𝑒𝑖X^{e}_{i}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT generated according to the form

Xie=g(XSie)+ϵe, for some i{1,,m},formulae-sequencesubscriptsuperscript𝑋𝑒𝑖𝑔subscriptsuperscript𝑋𝑒subscript𝑆𝑖superscriptitalic-ϵ𝑒 for some 𝑖1𝑚X^{e}_{i}=g(X^{e}_{S_{i}})+\epsilon^{e},\text{ for some }i\in\{1,\dots,m\},italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g ( italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , for some italic_i ∈ { 1 , … , italic_m } , (1)

where XSiesubscriptsuperscript𝑋𝑒subscript𝑆𝑖X^{e}_{S_{i}}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, for Si{0,,m}\isubscript𝑆𝑖\0𝑚𝑖S_{i}\subseteq\{0,\dots,m\}\backslash iitalic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ { 0 , … , italic_m } \ italic_i, represents the variables that directly effect Xiesubscriptsuperscript𝑋𝑒𝑖X^{e}_{i}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and ϵesuperscriptitalic-ϵ𝑒\epsilon^{e}italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT is an independent, zero mean, noise variable. We assume the output of the function g𝑔gitalic_g is not constant with regards to any of its inputs; g𝑔gitalic_g is a constant function when Si=subscript𝑆𝑖S_{i}=\varnothingitalic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∅.

Additionally, while the function g𝑔gitalic_g does not change over environment (i.e., does not depend on e𝑒eitalic_e), the distribution of ϵesuperscriptitalic-ϵ𝑒\epsilon^{e}italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT can change arbitrarily as long as the mean of the distribution remains zero. Aside from a binary Y𝑌Yitalic_Y and the form of Xiesuperscriptsubscript𝑋𝑖𝑒X_{i}^{e}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT in (1), we make no assumptions on the distribution or functional form of any variable. As such, this formulation applies to any set of features, be it continuous, discrete, or a mixture of the two.

We assume only a subset of all environments are observed and denote this set by obssubscriptobs\mathcal{E}_{\text{obs}}\subseteq\mathcal{E}caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT ⊆ caligraphic_E. Where obs=train{etest}subscriptobssubscripttrainsuperscript𝑒test\mathcal{E}_{\text{obs}}=\mathcal{E}_{\text{train}}\cup\{e^{\text{test}}\}caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ∪ { italic_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT }, and Ytest:=Yetestassignsuperscript𝑌testsuperscript𝑌superscript𝑒testY^{\text{test}}:=Y^{e^{\text{test}}}italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT := italic_Y start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, our goal is to make predictions on Ytestsuperscript𝑌testY^{\text{test}}italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, given a set of training environments trainsubscripttrain\mathcal{E}_{\text{train}}caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT. As such, we aim to find a function ϕe:𝒳𝒲:subscriptitalic-ϕ𝑒absent𝒳𝒲\phi_{e}:\mathcal{X}\xrightarrow{}\mathcal{W}italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT : caligraphic_X start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW caligraphic_W such that, the probability of Y𝑌Yitalic_Y given ϕe(X)subscriptitalic-ϕ𝑒𝑋\phi_{e}(X)italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_X ) does not vary over any environment. Specifically, for all w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W and e,hobs𝑒subscriptobse,h\in\mathcal{E}_{\text{obs}}italic_e , italic_h ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT,

𝒫e(Y|ϕe(X)=w)=𝒫h(Y|ϕh(X)=w).subscript𝒫𝑒conditional𝑌subscriptitalic-ϕ𝑒𝑋𝑤subscript𝒫conditional𝑌subscriptitalic-ϕ𝑋𝑤\mathcal{P}_{e}(Y|\phi_{e}(X)=w)=\mathcal{P}_{h}(Y|\phi_{h}(X)=w).caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_Y | italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_X ) = italic_w ) = caligraphic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_Y | italic_ϕ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_X ) = italic_w ) . (2)

As Y𝑌Yitalic_Y is binary, it is equivalent to write (2) in the form: 𝖤𝒫e[Y|ϕe(X)=w]=𝖤𝒫h[Y|ϕh(X)=w]subscript𝖤subscript𝒫𝑒conditional𝑌subscriptitalic-ϕ𝑒𝑋𝑤subscript𝖤subscript𝒫conditional𝑌subscriptitalic-ϕ𝑋𝑤\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|\phi_{e}(X)=w]=\operatorname{\sf E}_{% \mathcal{P}_{h}}[Y|\phi_{h}(X)=w]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_X ) = italic_w ] = sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_ϕ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_X ) = italic_w ], for all w𝒲𝑤𝒲w\in\mathcal{W}italic_w ∈ caligraphic_W and e,hobs𝑒subscriptobse,h\in\mathcal{E}_{\text{obs}}italic_e , italic_h ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT. It is well-known that (2) is satisfied if ϕe(X)=XSYsubscriptitalic-ϕ𝑒𝑋subscript𝑋subscript𝑆𝑌\phi_{e}(X)=X_{S_{Y}}italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_X ) = italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT and for SY{1,,m}subscript𝑆𝑌1𝑚S_{Y}\subseteq\{1,\dots,m\}italic_S start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ⊆ { 1 , … , italic_m },

Ye=f(XSYe)+ϵY,superscript𝑌𝑒𝑓superscriptsubscript𝑋subscript𝑆𝑌𝑒subscriptitalic-ϵ𝑌Y^{e}=f(X_{S_{Y}}^{e})+\epsilon_{Y},italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = italic_f ( italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT , (3)

where ϵYsubscriptitalic-ϵ𝑌\epsilon_{Y}italic_ϵ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT is an independent noise that does not vary over environment [9]. However, we are interested in a more general setting where the function f𝑓fitalic_f and distribution of the noise can vary over environment. From a causal perspective, this would indicate that Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT had been intervened (see Section IV-A). In such a setting, ϕe(X)=XSYsubscriptitalic-ϕ𝑒𝑋subscript𝑋subscript𝑆𝑌\phi_{e}(X)=X_{S_{Y}}italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_X ) = italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT is no longer useful and other approaches must be considered. We now consider one such alternative, starting with a motivating example.

III Motivating Example

Consider the following setting with Xe=(X1e,X2e,X3e)superscript𝑋𝑒subscriptsuperscript𝑋𝑒1subscriptsuperscript𝑋𝑒2subscriptsuperscript𝑋𝑒3X^{e}=(X^{e}_{1},X^{e}_{2},X^{e}_{3})italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = ( italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). Let X1esubscriptsuperscript𝑋𝑒1X^{e}_{1}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2esubscriptsuperscript𝑋𝑒2X^{e}_{2}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be independent and follow X1e𝒩(μ1e,σ12)similar-tosubscriptsuperscript𝑋𝑒1𝒩superscriptsubscript𝜇1𝑒subscriptsuperscript𝜎21X^{e}_{1}\sim\mathcal{N}(\mu_{1}^{e},\sigma^{2}_{1})italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and X2e𝒩(μ2e,σ22)similar-tosubscriptsuperscript𝑋𝑒2𝒩superscriptsubscript𝜇2𝑒subscriptsuperscript𝜎22X^{e}_{2}\sim\mathcal{N}(\mu_{2}^{e},\sigma^{2}_{2})italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). The variable Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT is generated such that Ye|X1e,X2econditionalsuperscript𝑌𝑒subscriptsuperscript𝑋𝑒1subscriptsuperscript𝑋𝑒2Y^{e}|X^{e}_{1},X^{e}_{2}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT forms a probit model. Specifically,

Ye={1,if β1eX1e+β2X2e+ϵY>0,0,otherwise.superscript𝑌𝑒cases1if β1eX1e+β2X2e+ϵY>00otherwiseY^{e}=\begin{cases}1,&\text{if $\beta_{1}^{e}X^{e}_{1}+\beta_{2}X^{e}_{2}+% \epsilon_{Y}>0$},\\ 0,&\text{otherwise}.\end{cases}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = { start_ROW start_CELL 1 , end_CELL start_CELL if italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT > 0 , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise . end_CELL end_ROW

Following a similar form as (1), X3esubscriptsuperscript𝑋𝑒3X^{e}_{3}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is linear given Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT so that

X3e={γ1X1e+ϵ3,if Ye=1,γ0X1e+ϵ3,if Ye=0.subscriptsuperscript𝑋𝑒3casessubscript𝛾1subscriptsuperscript𝑋𝑒1subscriptitalic-ϵ3if Ye=1subscript𝛾0subscriptsuperscript𝑋𝑒1subscriptitalic-ϵ3if Ye=0X^{e}_{3}=\begin{cases}\gamma_{1}X^{e}_{1}+\epsilon_{3},&\text{if $Y^{e}=1$},% \\ \gamma_{0}X^{e}_{1}+\epsilon_{3},&\text{if $Y^{e}=0$}.\end{cases}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , end_CELL start_CELL if italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = 1 , end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , end_CELL start_CELL if italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = 0 . end_CELL end_ROW

The noise variables ϵYsubscriptitalic-ϵ𝑌\epsilon_{Y}italic_ϵ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT and ϵ3subscriptitalic-ϵ3\epsilon_{3}italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are i.i.d. 𝒩(0,σ2)𝒩0superscript𝜎2\mathcal{N}(0,\sigma^{2})caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Suppose we wish to predict Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT given only X1esubscriptsuperscript𝑋𝑒1X^{e}_{1}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Predicting Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT for a particular e𝑒e\in\mathcal{E}italic_e ∈ caligraphic_E becomes difficult as β1esuperscriptsubscript𝛽1𝑒\beta_{1}^{e}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and μ2esubscriptsuperscript𝜇𝑒2\mu^{e}_{2}italic_μ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT vary with environment. Specifically,

𝖤𝒫e[Y|X1=x1]=Φ(β1ex1+β2μ2e(β2σ2)2+σ2),subscript𝖤subscript𝒫𝑒conditional𝑌subscript𝑋1subscript𝑥1Φsuperscriptsubscript𝛽1𝑒subscript𝑥1subscript𝛽2subscriptsuperscript𝜇𝑒2superscriptsubscript𝛽2subscript𝜎22superscript𝜎2\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{1}=x_{1}]=\Phi\left(\frac{\beta_{1% }^{e}x_{1}+\beta_{2}\mu^{e}_{2}}{\sqrt{(\beta_{2}\sigma_{2})^{2}+\sigma^{2}}}% \right),sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = roman_Φ ( divide start_ARG italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ( italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) , (4)

where ΦΦ\Phiroman_Φ is the cumulative distribution function of a standard normal random variable. As (4) varies over environment, it is not practical to use 𝖤𝒫e[Y|X1]subscript𝖤subscript𝒫𝑒conditional𝑌subscript𝑋1\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{1}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] to estimate Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT on different environments. Even while conditioning on both X1esubscriptsuperscript𝑋𝑒1X^{e}_{1}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2esubscriptsuperscript𝑋𝑒2X^{e}_{2}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (the variables that directly affect Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT), the variance (w.r.t. environment) still remains through β1esuperscriptsubscript𝛽1𝑒\beta_{1}^{e}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT.

We can, however, decompose (4) into various variant and invariant components such that 𝖤𝒫e[Y|X1=x1]subscript𝖤subscript𝒫𝑒conditional𝑌subscript𝑋1subscript𝑥1\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{1}=x_{1}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] becomes the following (see the proof of Proposition 1 for a general case),

𝖤𝒫e[X3|X1=x1]𝖤𝒫e[X3|X1=x1,Y=0]𝖤𝒫e[X3|X1=x1,Y=1]𝖤𝒫e[X3|X1=x1,Y=0],subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1subscript𝑥1subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1subscript𝑥1𝑌0subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1subscript𝑥1𝑌1subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1subscript𝑥1𝑌0\displaystyle\frac{\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}=x_{1}]-% \operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}=x_{1},Y=0]}{\operatorname{% \sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}=x_{1},Y=1]-\operatorname{\sf E}_{\mathcal% {P}_{e}}[X_{3}|X_{1}=x_{1},Y=0]},divide start_ARG sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y = 0 ] end_ARG start_ARG sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y = 1 ] - sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y = 0 ] end_ARG , (5)

where 𝖤𝒫e[X3|X1=x1]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1subscript𝑥1\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}=x_{1}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] is

Φ(β1ex1+β2μ2e(β2σ2)2+σ2)(γ1γ0)x1+γ0x1,Φsuperscriptsubscript𝛽1𝑒subscript𝑥1subscript𝛽2subscriptsuperscript𝜇𝑒2superscriptsubscript𝛽2subscript𝜎22superscript𝜎2subscript𝛾1subscript𝛾0subscript𝑥1subscript𝛾0subscript𝑥1\displaystyle\Phi\left(\frac{\beta_{1}^{e}x_{1}+\beta_{2}\mu^{e}_{2}}{\sqrt{(% \beta_{2}\sigma_{2})^{2}+\sigma^{2}}}\right)(\gamma_{1}-\gamma_{0})x_{1}+% \gamma_{0}x_{1},roman_Φ ( divide start_ARG italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ( italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (6)

and 𝖤𝒫e[X3|X1=x1,Y=y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1subscript𝑥1𝑌𝑦\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}=x_{1},Y=y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y = italic_y ] is γ1x1subscript𝛾1subscript𝑥1\gamma_{1}x_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT if y=1𝑦1y=1italic_y = 1 and γ0x1subscript𝛾0subscript𝑥1\gamma_{0}x_{1}italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT if y=0𝑦0y=0italic_y = 0. We note that the variance (w.r.t environment) contributed by β1esuperscriptsubscript𝛽1𝑒\beta_{1}^{e}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and μ2esuperscriptsubscript𝜇2𝑒\mu_{2}^{e}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT is completely accounted for in the term 𝖤𝒫e[X3|X1]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] and that 𝖤𝒫e[X3|X1,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] is invariant over environment. Thus, (2) holds for the function ϕe(X)=(X1,𝖤𝒫e[X3|X1])subscriptitalic-ϕ𝑒𝑋subscript𝑋1subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1\phi_{e}(X)=(X_{1},\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}])italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_X ) = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ). In addition to this, we also note that conditioning on both X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT leads to a similar invariance; we only condition on X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in this example for simplicity.

This invariance does not hold if we replace X3esubscriptsuperscript𝑋𝑒3X^{e}_{3}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with any other variable. For example, suppose we were to estimate Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, replacing X3esubscriptsuperscript𝑋𝑒3X^{e}_{3}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with X2esubscriptsuperscript𝑋𝑒2X^{e}_{2}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We can still decompose (4) similarly to (5) by replacing X3esubscriptsuperscript𝑋𝑒3X^{e}_{3}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with X2esubscriptsuperscript𝑋𝑒2X^{e}_{2}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. As 𝖤𝒫e[X2|X1]=μ2esubscript𝖤subscript𝒫𝑒conditionalsubscript𝑋2subscript𝑋1subscriptsuperscript𝜇𝑒2\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1}]=\mu^{e}_{2}sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_μ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT does not contain β1esuperscriptsubscript𝛽1𝑒\beta_{1}^{e}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, the portion of 𝖤𝒫e[Y|X1]subscript𝖤subscript𝒫𝑒conditional𝑌subscript𝑋1\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{1}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] that contains β1esuperscriptsubscript𝛽1𝑒\beta_{1}^{e}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT must reside in 𝖤𝒫e[X2|X1,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋2subscript𝑋1𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ]. i.e., 𝖤𝒫e[X2|X1,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋2subscript𝑋1𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] is not invariant over environments as is 𝖤𝒫e[X3|X1,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ]. Thus, the function ϕe(X)=(X1,𝖤𝒫e[X2|X1])subscriptitalic-ϕ𝑒𝑋subscript𝑋1subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋2subscript𝑋1\phi_{e}(X)=(X_{1},\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1}])italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_X ) = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ) will no longer satisfy (2).

To further illustrate the difference in selecting X3esuperscriptsubscript𝑋3𝑒X_{3}^{e}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT over X2esuperscriptsubscript𝑋2𝑒X_{2}^{e}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, suppose we wish to estimate on a new environment etestsuperscript𝑒teste^{\text{test}}italic_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT. While we have access to Xtestsuperscript𝑋testX^{\text{test}}italic_X start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, we can easily construct 𝖤𝒫test[Xi|X1]subscript𝖤subscript𝒫testconditionalsubscript𝑋𝑖subscript𝑋1\operatorname{\sf E}_{\mathcal{P}_{\text{test}}}[X_{i}|X_{1}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] for either i{2,3}𝑖23i\in\{2,3\}italic_i ∈ { 2 , 3 }. We cannot, however, use Ytestsuperscript𝑌testY^{\text{test}}italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT to construct our estimate, and 𝖤𝒫test[Xi|X1,Y]subscript𝖤subscript𝒫testconditionalsubscript𝑋𝑖subscript𝑋1𝑌\operatorname{\sf E}_{\mathcal{P}_{\text{test}}}[X_{i}|X_{1},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] must be obtained by leveraging invariances over environment. Thus, for either i{2,3}𝑖23i\in\{2,3\}italic_i ∈ { 2 , 3 }, we construct the estimate

Y^itest=:𝖤𝒫test[Xi|X1]𝖤𝒫e[Xi|X1,Y=0]𝖤𝒫e[Xi|X1,Y=1]𝖤𝒫e[Xi|X1,Y=0],\hat{Y}^{\text{test}}_{i}=:\frac{\operatorname{\sf E}_{\mathcal{P}_{\text{test% }}}[X_{i}|X_{1}]-\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{i}|X_{1},Y=0]}{% \operatorname{\sf E}_{\mathcal{P}_{e}}[X_{i}|X_{1},Y=1]-\operatorname{\sf E}_{% \mathcal{P}_{e}}[X_{i}|X_{1},Y=0]},over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = : divide start_ARG sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y = 0 ] end_ARG start_ARG sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y = 1 ] - sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y = 0 ] end_ARG , (7)

where etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT. As 𝖤𝒫e[X3|X1,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋3subscript𝑋1𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] is invariant and 𝖤𝒫e[X2|X1,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋2subscript𝑋1𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] is not invariant as discussed above, Y^3testsubscriptsuperscript^𝑌test3\hat{Y}^{\text{test}}_{3}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT will provide a good estimate of Ytestsuperscript𝑌testY^{\text{test}}italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, while Y^2testsubscriptsuperscript^𝑌test2\hat{Y}^{\text{test}}_{2}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT will not.

In Fig. 1 we compare Y^3testsubscriptsuperscript^𝑌test3\hat{Y}^{\text{test}}_{3}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and Y^2testsubscriptsuperscript^𝑌test2\hat{Y}^{\text{test}}_{2}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by simulating (xtest,ytest)superscript𝑥testsuperscript𝑦test(x^{\text{test}},y^{\text{test}})( italic_x start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ) pairs for a set of specific parameters. The estimate Y^2testsubscriptsuperscript^𝑌test2\hat{Y}^{\text{test}}_{2}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT does not fit the data as many x1testsuperscriptsubscript𝑥1testx_{1}^{\text{test}}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT corresponding to ytest=0superscript𝑦test0y^{\text{test}}=0italic_y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT = 0 will be incorrectly classified to one. However, this is not the case when Y^3testsubscriptsuperscript^𝑌test3\hat{Y}^{\text{test}}_{3}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is used, and the fit is greatly improved (Fig. 1). The poor fit on Y^2testsubscriptsuperscript^𝑌test2\hat{Y}^{\text{test}}_{2}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a result of 𝖤𝒫e[X2|X1,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋2subscript𝑋1𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] varying across environments.

Refer to caption
Refer to caption
Figure 1: Comparisons of Y^3testsubscriptsuperscript^𝑌test3\hat{Y}^{\text{test}}_{3}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (left) and Y^2testsubscriptsuperscript^𝑌test2\hat{Y}^{\text{test}}_{2}over^ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (right), where β1e=2superscriptsubscript𝛽1𝑒2\beta_{1}^{e}=2italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = 2, μ2e=1subscriptsuperscript𝜇𝑒21\mu^{e}_{2}=1italic_μ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, β2test=0superscriptsubscript𝛽2test0\beta_{2}^{\text{test}}=0italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT = 0, and μ2test=1subscriptsuperscript𝜇test21\mu^{\text{test}}_{2}=-1italic_μ start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 1.

IV The Binary Invariant Matching Property

A deterministic relationship such as the one in (5) has been previously referred to as matching [15], and can be generalized to the formulation outlined in Section II.

Definition 1.

For k{1,,m}𝑘1𝑚k\in\{1,\dots,m\}italic_k ∈ { 1 , … , italic_m }, S{1,,m}\k𝑆\1𝑚𝑘S\subseteq\{1,\dots,m\}\backslash kitalic_S ⊆ { 1 , … , italic_m } \ italic_k, and h(XS,Y):=𝖤𝒫e[Xk|XS,Y]assignsubscript𝑋𝑆𝑌subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌h(X_{S},Y):=\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]italic_h ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ) := sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ], the pair (k,S)𝑘𝑆(k,S)( italic_k , italic_S ) satisfies the binary invariant matching property (bIMP)111There are degenerate cases when h(XS,0)=h(XS,1)subscript𝑋𝑆0subscript𝑋𝑆1h(X_{S},0)=h(X_{S},1)italic_h ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , 0 ) = italic_h ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , 1 ), for which the tower property implies 𝖤𝒫e[Xk|XS]=𝖤𝒫e[h(XS,Y)|XS]=h(XS,0)subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑆𝑌subscript𝑋𝑆subscript𝑋𝑆0\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]=\operatorname{\sf E}_{% \mathcal{P}_{e}}[h(X_{S},Y)|X_{S}]=h(X_{S},0)sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] = sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_h ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ) | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] = italic_h ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , 0 ), and the ratio in (8) reduces to 00 divided by 00. if,

𝖤𝒫e[Y|XS]=𝖤𝒫e[Xk|XS]h(XS,0)h(XS,1)h(XS,0),subscript𝖤subscript𝒫𝑒conditional𝑌subscript𝑋𝑆subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆subscript𝑋𝑆0subscript𝑋𝑆1subscript𝑋𝑆0\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{S}]=\frac{\operatorname{\sf E}_{% \mathcal{P}_{e}}[X_{k}|X_{S}]-h(X_{S},0)}{h(X_{S},1)-h(X_{S},0)},sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] = divide start_ARG sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] - italic_h ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , 0 ) end_ARG start_ARG italic_h ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , 1 ) - italic_h ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , 0 ) end_ARG , (8)

holds for all eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT, where h(XS,Y)subscript𝑋𝑆𝑌h(X_{S},Y)italic_h ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ) does not depend on e𝑒eitalic_e.

As seen in the example, there are a variety of choices for k𝑘kitalic_k and S𝑆Sitalic_S, not all of which lead to invariant representations. We now detail the sufficient conditions for which a pair (k,S)𝑘𝑆(k,S)( italic_k , italic_S ) satisfies the bIMP (see Appendix for the proof).

Proposition 1.

Let k{1,,m}𝑘1𝑚k\in\{1,\ldots,m\}italic_k ∈ { 1 , … , italic_m } and S=RQ𝑆𝑅𝑄S=R\cup Qitalic_S = italic_R ∪ italic_Q where R,Q{1,,m}k𝑅𝑄1𝑚𝑘R,Q\subseteq\{1,\ldots,m\}\setminus kitalic_R , italic_Q ⊆ { 1 , … , italic_m } ∖ italic_k and RQ=𝑅𝑄R\cap Q=\varnothingitalic_R ∩ italic_Q = ∅. The pair (k,S)𝑘𝑆(k,S)( italic_k , italic_S ) satisfies the bIMP if, for every eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT,

  1. 1.

    Xke=g(XRe,Ye)+ϵesubscriptsuperscript𝑋𝑒𝑘𝑔subscriptsuperscript𝑋𝑒𝑅superscript𝑌𝑒superscriptitalic-ϵ𝑒X^{e}_{k}=g(X^{e}_{R},Y^{e})+\epsilon^{e}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_g ( italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT as in (1) ,

  2. 2.

    XQeXke|(XRe,Ye)perpendicular-toabsentperpendicular-tosubscriptsuperscript𝑋𝑒𝑄conditionalsubscriptsuperscript𝑋𝑒𝑘subscriptsuperscript𝑋𝑒𝑅superscript𝑌𝑒X^{e}_{Q}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.% 0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss% }\mkern 2.0mu{\scriptscriptstyle\perp}}}X^{e}_{k}\ |\ (X^{e}_{R},\ Y^{e})italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ( italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) .

What remains is to show that the bIMP can be used to satisfy the invariance principle in (2), and thus, can be beneficial in predicting on unknown environments, as shown below.

Theorem 1.

Let k{1,,m}𝑘1𝑚k\in\{1,\ldots,m\}italic_k ∈ { 1 , … , italic_m } and S=RQ𝑆𝑅𝑄S=R\cup Qitalic_S = italic_R ∪ italic_Q where R,Q{1,,m}k𝑅𝑄1𝑚𝑘R,Q\subseteq\{1,\ldots,m\}\setminus kitalic_R , italic_Q ⊆ { 1 , … , italic_m } ∖ italic_k and RQ=𝑅𝑄R\cap Q=\varnothingitalic_R ∩ italic_Q = ∅. When ϕe(X)=(XR,XQ,𝖤𝒫e[Xk|XR,XQ])subscriptitalic-ϕ𝑒𝑋subscript𝑋𝑅subscript𝑋𝑄subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑅subscript𝑋𝑄\phi_{e}(X)=(X_{R},X_{Q},\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{R},X_% {Q}])italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_X ) = ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] ), (2) holds if the pair (k,S)𝑘𝑆(k,S)( italic_k , italic_S ) satisfies the bIMP.

Proof.

Let e(XR,XQ):=𝖤𝒫e[Xk|XR,XQ]assignsuperscript𝑒subscript𝑋𝑅subscript𝑋𝑄subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑅subscript𝑋𝑄\ell^{e}(X_{R},X_{Q}):=\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{R},X_{Q}]roman_ℓ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) := sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ] and ϕe(X)=(XR,XQ,e(XR,XQ))subscriptitalic-ϕ𝑒𝑋subscript𝑋𝑅subscript𝑋𝑄superscript𝑒subscript𝑋𝑅subscript𝑋𝑄\phi_{e}(X)=(X_{R},X_{Q},\ell^{e}(X_{R},X_{Q}))italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_X ) = ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , roman_ℓ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) ). Since (k,S)𝑘𝑆(k,S)( italic_k , italic_S ) satisfies the bIMP and e(XR,XQ)superscript𝑒subscript𝑋𝑅subscript𝑋𝑄\ell^{e}(X_{R},X_{Q})roman_ℓ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) is a function of XRsubscript𝑋𝑅X_{R}italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and XQsubscript𝑋𝑄X_{Q}italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,

𝖤𝒫esubscript𝖤subscript𝒫𝑒\displaystyle\operatorname{\sf E}_{\mathcal{P}_{e}}sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [Y|ϕe(X)=(xQ,xR,z)]delimited-[]conditional𝑌subscriptitalic-ϕ𝑒𝑋subscript𝑥𝑄subscript𝑥𝑅𝑧\displaystyle[Y|\phi_{e}(X)=(x_{Q},x_{R},z)][ italic_Y | italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_X ) = ( italic_x start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_z ) ]
=𝖤𝒫e[Y|XR=xR,XQ=xq,e(XR,XQ)=z]absentsubscript𝖤subscript𝒫𝑒conditional𝑌subscript𝑋𝑅subscript𝑥𝑅subscript𝑋𝑄subscript𝑥𝑞superscript𝑒subscript𝑋𝑅subscript𝑋𝑄𝑧\displaystyle=\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{R}=x_{R},X_{Q}=x_{q}% ,\ell^{e}(X_{R},X_{Q})=z]= sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , roman_ℓ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) = italic_z ]
=zg(xR,0)g(xR,1)g(xR,0).absent𝑧𝑔subscript𝑥𝑅0𝑔subscript𝑥𝑅1𝑔subscript𝑥𝑅0\displaystyle=\frac{z-g(x_{R},0)}{g(x_{R},1)-g(x_{R},0)}.= divide start_ARG italic_z - italic_g ( italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , 0 ) end_ARG start_ARG italic_g ( italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , 1 ) - italic_g ( italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , 0 ) end_ARG . (9)

Thus, (2) holds as (9) does not vary over eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT. ∎

Remark 1.

In this work, we focus specifically on settings where Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT is binary. However, there does exist a corresponding matching property with sufficient conditions similar to those in Proposition 1 for cases when Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT is multi-class or continuous. We leave the analysis for the long version of this work.

IV-A A Causal Perspective

While the sufficient conditions in Theorem 1 may seem abstract, we now show that, in fact, they have a specific meaning in a causal sense. To do so, we introduce the structural causal model (SCM) [8]. Here, Xesuperscript𝑋𝑒X^{e}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT are part of an SCM 𝒮esuperscript𝒮𝑒\mathcal{S}^{e}caligraphic_S start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT that varies over environment such that

𝒮e:{Ye:=fYe(XPA(Ye)e,ϵYe),X1e:=f1e(XPA(X1e)e,ϵ1e),Xme:=fme(XPA(Xme)e,ϵme).:superscript𝒮𝑒casesassignsuperscript𝑌𝑒superscriptsubscript𝑓𝑌𝑒subscriptsuperscript𝑋𝑒𝑃𝐴superscript𝑌𝑒subscriptsuperscriptitalic-ϵ𝑒𝑌otherwiseassignsuperscriptsubscript𝑋1𝑒superscriptsubscript𝑓1𝑒subscriptsuperscript𝑋𝑒𝑃𝐴superscriptsubscript𝑋1𝑒subscriptsuperscriptitalic-ϵ𝑒1otherwiseotherwiseassignsuperscriptsubscript𝑋𝑚𝑒superscriptsubscript𝑓𝑚𝑒subscriptsuperscript𝑋𝑒𝑃𝐴superscriptsubscript𝑋𝑚𝑒subscriptsuperscriptitalic-ϵ𝑒𝑚otherwise\mathcal{S}^{e}:\begin{cases}Y^{e}:=f_{Y}^{e}(X^{e}_{PA(Y^{e})}\,,\ \epsilon^{% e}_{Y}),\\ X_{1}^{e}:=f_{1}^{e}(X^{e}_{PA(X_{1}^{e})}\,,\ \epsilon^{e}_{1}),\\ \qquad\vdots\\ X_{m}^{e}:=f_{m}^{e}(X^{e}_{PA(X_{m}^{e})}\,,\ \epsilon^{e}_{m}).\end{cases}caligraphic_S start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT : { start_ROW start_CELL italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT := italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_A ( italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT , italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT := italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_A ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT , italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT := italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_A ( italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT , italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) . end_CELL start_CELL end_CELL end_ROW (10)

where ϵ1eϵme,ϵYesuperscriptsubscriptitalic-ϵ1𝑒superscriptsubscriptitalic-ϵ𝑚𝑒superscriptsubscriptitalic-ϵ𝑌𝑒\epsilon_{1}^{e}\dots\epsilon_{m}^{e},\epsilon_{Y}^{e}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT … italic_ϵ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT are independent noise variables. To simplify notation, let X0e:=Yeassignsuperscriptsubscript𝑋0𝑒superscript𝑌𝑒X_{0}^{e}:=Y^{e}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT := italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT. Thus, PA(Xie){0,,1}𝑃𝐴superscriptsubscript𝑋𝑖𝑒01PA(X_{i}^{e})\subseteq\{0,\dots,1\}italic_P italic_A ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) ⊆ { 0 , … , 1 } denotes the set indexed by the direct causal parents of Xiesuperscriptsubscript𝑋𝑖𝑒X_{i}^{e}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT for all i{0,,m}𝑖0𝑚i\in\{0,\dots,m\}italic_i ∈ { 0 , … , italic_m }.

As in Section II, Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT is binary. Additionally, at least one structural assignment (i.e., fie()superscriptsubscript𝑓𝑖𝑒f_{i}^{e}(\cdot)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( ⋅ )) in 𝒮esuperscript𝒮𝑒\mathcal{S}^{e}caligraphic_S start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT is an additive noise function that does not vary over environment. Specifically, for some i{0,,m}𝑖0𝑚i\in\{0,\dots,m\}italic_i ∈ { 0 , … , italic_m }, let fie(XPA(Xie)e,ϵie)=g(XPA(Xie)e)+ϵiesuperscriptsubscript𝑓𝑖𝑒superscriptsubscript𝑋𝑃𝐴superscriptsubscript𝑋𝑖𝑒𝑒subscriptsuperscriptitalic-ϵ𝑒𝑖𝑔superscriptsubscript𝑋𝑃𝐴superscriptsubscript𝑋𝑖𝑒𝑒subscriptsuperscriptitalic-ϵ𝑒𝑖f_{i}^{e}(X_{PA(X_{i}^{e})}^{e},\epsilon^{e}_{i})=g(X_{PA(X_{i}^{e})}^{e})+% \epsilon^{e}_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_P italic_A ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_g ( italic_X start_POSTSUBSCRIPT italic_P italic_A ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where ϵiesubscriptsuperscriptitalic-ϵ𝑒𝑖\epsilon^{e}_{i}italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has zero mean. An intervention on a variable from {X1e,,Xme,Ye}superscriptsubscript𝑋1𝑒superscriptsubscript𝑋𝑚𝑒superscript𝑌𝑒\{X_{1}^{e},\dots,X_{m}^{e},Y^{e}\}{ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT } occurs if the structural assignment changes for some e𝑒e\in\mathcal{E}italic_e ∈ caligraphic_E. Relating the SCM to the formation in Section II gives insight into the types of interventions that may occur. While many methods [9, 14, 15] make various assumptions on the types of interventions (e.g., shifts in the mean or variance), the setting in (10) allows for very general interventions, including general interventions on Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, which many other approaches do not allow.

Given 𝒮esuperscript𝒮𝑒\mathcal{S}^{e}caligraphic_S start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT for all eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT, we can express the conditions of Proposition 1 in the language of SCMs, detailed below.

Corollary 1.

Let k{1,,m}𝑘1𝑚k\in\{1,\ldots,m\}italic_k ∈ { 1 , … , italic_m } and S=RQ𝑆𝑅𝑄S=R\cup Qitalic_S = italic_R ∪ italic_Q where R,Q{1,,m}k𝑅𝑄1𝑚𝑘R,Q\subseteq\{1,\ldots,m\}\setminus kitalic_R , italic_Q ⊆ { 1 , … , italic_m } ∖ italic_k and RQ=𝑅𝑄R\cap Q=\varnothingitalic_R ∩ italic_Q = ∅. For the SCM 𝒮esuperscript𝒮𝑒\mathcal{S}^{e}caligraphic_S start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, the pair (k,S)𝑘𝑆(k,S)( italic_k , italic_S ) satisfies the bIMP for all eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT if the following cases hold.

  1. 1.

    Xke=g(XPA(Xke)e)+ϵkesuperscriptsubscript𝑋𝑘𝑒𝑔subscriptsuperscript𝑋𝑒𝑃𝐴superscriptsubscript𝑋𝑘𝑒subscriptsuperscriptitalic-ϵ𝑒𝑘X_{k}^{e}=g(X^{e}_{PA(X_{k}^{e})})+\epsilon^{e}_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = italic_g ( italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_A ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

  2. 2.

    XResuperscriptsubscript𝑋𝑅𝑒X_{R}^{e}italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT constitute the parents of Xkesuperscriptsubscript𝑋𝑘𝑒X_{k}^{e}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ,

  3. 3.

    The variables in XQesuperscriptsubscript𝑋𝑄𝑒X_{Q}^{e}italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT can be any non-descendants of Xkesuperscriptsubscript𝑋𝑘𝑒X_{k}^{e}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT.

The first condition in Proposition 1 is analogous to the first and second condition above as PA(Xke)=(XS,Y)𝑃𝐴superscriptsubscript𝑋𝑘𝑒subscript𝑋𝑆𝑌PA(X_{k}^{e})=(X_{S},Y)italic_P italic_A ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) = ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ). Additionally, in an SCM, any variable conditioned on its parents is independent of any non-descendant. As such, the set XQesuperscriptsubscript𝑋𝑄𝑒X_{Q}^{e}italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT can be any non-descendant of Xkesuperscriptsubscript𝑋𝑘𝑒X_{k}^{e}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, bridging the final conditions in Proposition 1 and Corollary 1.

In many cases, the set Q𝑄Qitalic_Q can be quite inclusive despite what may seem like a strong independence condition in Proposition 1. In Corollary 1, we learn that, in a causal sense, XQesuperscriptsubscript𝑋𝑄𝑒X_{Q}^{e}italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT can be any non-descendant of Xkesuperscriptsubscript𝑋𝑘𝑒X_{k}^{e}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT. For example, if half of the predictors in an SCM are ancestors of Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, while the other half are descendants, then the set Q𝑄Qitalic_Q indexes at least half of all predictors (and potentially many more).

V Proposed Method

For each etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, we have nesuperscript𝑛𝑒n^{e}italic_n start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT samples, represented as a matrix 𝑿ene×msuperscript𝑿𝑒superscriptsubscript𝑛𝑒𝑚\bm{X}^{e}\in\mathbb{R}^{n_{e}\times m}bold_italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × italic_m end_POSTSUPERSCRIPT, and a vector 𝒀e{0,1}nesuperscript𝒀𝑒superscript01subscript𝑛𝑒\bm{Y}^{e}\in\{0,1\}^{n_{e}}bold_italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (see [17] for a discussion on the impact of different environments). Additionally, we have ntestsubscript𝑛testn_{\text{test}}italic_n start_POSTSUBSCRIPT test end_POSTSUBSCRIPT samples in the test environment, and we denote 𝑿testntest×msuperscript𝑿testsuperscriptsubscript𝑛test𝑚\bm{X}^{\text{test}}\in\mathbb{R}^{n_{\text{test}}\times m}bold_italic_X start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT test end_POSTSUBSCRIPT × italic_m end_POSTSUPERSCRIPT and 𝒀testsuperscript𝒀test\bm{Y}^{\text{test}}bold_italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT as the predictor matrix and target vector for the environment etestsuperscript𝑒teste^{\text{test}}italic_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, respectively. We denote 𝑿𝑿\bm{X}bold_italic_X as the pooled predictor matrix over all etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, and 𝑿Y=ysubscript𝑿𝑌𝑦\bm{X}_{Y=y}bold_italic_X start_POSTSUBSCRIPT italic_Y = italic_y end_POSTSUBSCRIPT as the matrix comprising the rows of 𝑿𝑿\bm{X}bold_italic_X in which Y=y𝑌𝑦Y=yitalic_Y = italic_y, for y{0,1}𝑦01y\in\{0,1\}italic_y ∈ { 0 , 1 }. Let 𝑿esuperscript𝑿𝑒\bm{X}^{-e}bold_italic_X start_POSTSUPERSCRIPT - italic_e end_POSTSUPERSCRIPT be the matrix of samples indexed only by those samples not in etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT.

We now leverage insights gained from Theorem 1 and the bIMP to develop a practical method for estimation in unknown environments. At test time, we do not have access to Ytestsuperscript𝑌testY^{\text{test}}italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT. As such, one cannot say with definitive assurance that (2) holds for all eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT. Thus, the best that can be done in such settings is to identify a ϕesubscriptitalic-ϕ𝑒\phi_{e}italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT such that (2) holds for all etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, implying that trainsubscripttrain\mathcal{E}_{\text{train}}caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT must have at least two environments.

Thus, our goal in a practical setting is to identify (k,S)𝑘𝑆(k,S)( italic_k , italic_S ) pairs that may satisfy the bIMP overall etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT. Simply put, we test whether 𝖤𝒫e[Xk|XS,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ] is invariant. To do so, we consider a special form of the model in (1) where Xke=g(XSe,Ye)+ϵesuperscriptsubscript𝑋𝑘𝑒𝑔subscriptsuperscript𝑋𝑒𝑆superscript𝑌𝑒superscriptitalic-ϵ𝑒X_{k}^{e}=g(X^{e}_{S},Y^{e})+\epsilon^{e}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = italic_g ( italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT with ϵe𝒩(0,(σe)2)similar-tosuperscriptitalic-ϵ𝑒𝒩0superscriptsuperscript𝜎𝑒2\epsilon^{e}\sim\mathcal{N}(0,(\sigma^{e})^{2})italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , ( italic_σ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is assigned a different nonlinear additive noise function for each value of Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT. Specifically,

g(XSe,Ye)={g1(XSe),if Ye=1,g0(XSe),if Ye=0.𝑔superscriptsubscript𝑋𝑆𝑒superscript𝑌𝑒casessubscript𝑔1superscriptsubscript𝑋𝑆𝑒if Ye=1subscript𝑔0superscriptsubscript𝑋𝑆𝑒if Ye=0g(X_{S}^{e},Y^{e})=\begin{cases}g_{1}(X_{S}^{e}),&\text{if $Y^{e}=1$},\\ g_{0}(X_{S}^{e}),&\text{if $Y^{e}=0$}.\end{cases}italic_g ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) = { start_ROW start_CELL italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) , end_CELL start_CELL if italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = 1 , end_CELL end_ROW start_ROW start_CELL italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) , end_CELL start_CELL if italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = 0 . end_CELL end_ROW (11)

As Xkesuperscriptsubscript𝑋𝑘𝑒X_{k}^{e}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT can be split into two models, one for each value of Yesuperscript𝑌𝑒Y^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, we can perform an invariance test on each model. If both are found to be invariant, we can consider 𝖤𝒫e[Xk|XS,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ] as a whole to be invariant. Invariance tests on additive noise models have been widely studied: Various tests have been proposed for linear [9] and nonlinear [10] models. We adopt one such approximate test from [10] known as the residual distribution test for our setting, as further detailed in Algorithm 1.

Algorithm 1 Binary Invariant Residual Distribution Test

Input: 𝒀esuperscript𝒀𝑒\bm{Y}^{e}bold_italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and 𝑿esuperscript𝑿𝑒\bm{X}^{e}bold_italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, for each etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, significance level α𝛼\alphaitalic_α, and the pair (k,S)𝑘𝑆(k,S)( italic_k , italic_S )

Output: accepted or rejected

Regress 𝑿k,Y=isubscript𝑿𝑘𝑌𝑖\bm{X}_{k,Y=i}bold_italic_X start_POSTSUBSCRIPT italic_k , italic_Y = italic_i end_POSTSUBSCRIPT on 𝑿S,Y=isubscript𝑿𝑆𝑌𝑖\bm{X}_{S,Y=i}bold_italic_X start_POSTSUBSCRIPT italic_S , italic_Y = italic_i end_POSTSUBSCRIPT to get g^isubscript^𝑔𝑖\hat{g}_{i}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for i{0,1}𝑖01i\in\{0,1\}italic_i ∈ { 0 , 1 }
for each etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT and i{0,1}𝑖01i\in\{0,1\}italic_i ∈ { 0 , 1 } do
     𝑹ie=𝑿k,Y=ieg^i(𝑿S,Y=ie)subscriptsuperscript𝑹𝑒𝑖subscriptsuperscript𝑿𝑒𝑘𝑌𝑖subscript^𝑔𝑖subscriptsuperscript𝑿𝑒𝑆𝑌𝑖\bm{R}^{e}_{i}=\bm{X}^{e}_{k,Y=i}-\hat{g}_{i}(\bm{X}^{e}_{S,Y=i})bold_italic_R start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_Y = italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S , italic_Y = italic_i end_POSTSUBSCRIPT )
     𝑹ie=𝑿k,Y=ieg^i(𝑿S,Y=ie)subscriptsuperscript𝑹𝑒𝑖subscriptsuperscript𝑿𝑒𝑘𝑌𝑖subscript^𝑔𝑖subscriptsuperscript𝑿𝑒𝑆𝑌𝑖\bm{R}^{-e}_{i}=\bm{X}^{-e}_{k,Y=i}-\hat{g}_{i}(\bm{X}^{-e}_{S,Y=i})bold_italic_R start_POSTSUPERSCRIPT - italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_X start_POSTSUPERSCRIPT - italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_Y = italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT - italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S , italic_Y = italic_i end_POSTSUBSCRIPT )
     pvalie=t-test(𝑹ie,𝑹ie)superscriptsubscriptpval𝑖𝑒𝑡-testsubscriptsuperscript𝑹𝑒𝑖subscriptsuperscript𝑹𝑒𝑖\text{pval}_{i}^{e}=t\text{-test}(\bm{R}^{e}_{i},\bm{R}^{-e}_{i})pval start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = italic_t -test ( bold_italic_R start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_R start_POSTSUPERSCRIPT - italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
Combine p-values in pval1esuperscriptsubscriptpval1𝑒\text{pval}_{1}^{e}pval start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and pval0esuperscriptsubscriptpval0𝑒\text{pval}_{0}^{e}pval start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT via Bonferroni correction
if minetrainpval1e>αsubscript𝑒subscripttrainsuperscriptsubscriptpval1𝑒𝛼\min_{e\in\mathcal{E}_{\text{train}}}\text{pval}_{1}^{e}>\alpharoman_min start_POSTSUBSCRIPT italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT end_POSTSUBSCRIPT pval start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT > italic_α and minetrainpval2e>αsubscript𝑒subscripttrainsuperscriptsubscriptpval2𝑒𝛼\min_{e\in\mathcal{E}_{\text{train}}}\text{pval}_{2}^{e}>\alpharoman_min start_POSTSUBSCRIPT italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT end_POSTSUBSCRIPT pval start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT > italic_α then
     return accepted
else return rejected

We use Algorithm 1 as an approximate test for whether 𝖤𝒫e[Xk|XS,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ] is invariant over environments. We now employ this test to develop a practical method for estimating Ytestsuperscript𝑌testY^{\text{test}}italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT which we refer to as bIMP. We adopt a similar approach to that of [14] and [15] in which we test the invariance of 𝖤𝒫e[Xk|XS,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ] for all possible pairs (k,S)𝑘𝑆(k,S)( italic_k , italic_S ). We then train models using the Xkesuperscriptsubscript𝑋𝑘𝑒X_{k}^{e}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and XSesubscriptsuperscript𝑋𝑒𝑆X^{e}_{S}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT which are accepted according to Algorithm 1. Our bIMP models are a combination of two separate models trained to estimate both 𝖤𝒫e[Xk|XS,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ] and 𝖤𝒫e[Xk|XS]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ]. Given both of these estimates, we compute an estimate of Ytestsuperscript𝑌testY^{\text{test}}italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT using (8). As it is likely that more than one pair is accepted, the final estimate of Ytestsuperscript𝑌testY^{\text{test}}italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT is the average estimate over all accepted pairs.

While we can guarantee invariance via the bIMP, there is no guarantee that the estimation will predict well on etestsuperscript𝑒teste^{\text{test}}italic_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT. As such, in addition to filtering pairs based on invariance, bIMP also filters based on a prediction score. Invariant pairs 𝒯invsubscript𝒯inv\mathcal{T}_{\text{inv}}caligraphic_T start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT computed using (8) are filtered using the mean squared prediction error. The threshold by which the pairs are filtered is identical to the procedure proposed in [14].

Algorithm 2 bIMP

Input: 𝒀esuperscript𝒀𝑒\bm{Y}^{e}bold_italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, for each etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, and 𝑿esuperscript𝑿𝑒\bm{X}^{e}bold_italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, for each eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT

Output: Estimate 𝒀^testsuperscriptbold-^𝒀test\bm{\hat{Y}}^{\text{test}}overbold_^ start_ARG bold_italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT

Identify the set of all invariant pairs 𝒯invsubscript𝒯inv\mathcal{T}_{\text{inv}}caligraphic_T start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT using Algorithm 1
Filter pairs from 𝒯invsubscript𝒯inv\mathcal{T}_{\text{inv}}caligraphic_T start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT based on prediction score
for each (k,S)𝑘𝑆(k,S)( italic_k , italic_S ) in 𝒯invsubscript𝒯inv\mathcal{T}_{\text{inv}}caligraphic_T start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT do
     Estimate 𝖤𝒫e[Xk|XS,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ] by regressing 𝑿ksubscript𝑿𝑘\bm{X}_{k}bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on (𝑿S,𝒀)subscript𝑿𝑆𝒀(\bm{X}_{S},\bm{Y})( bold_italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_italic_Y )
     Estimate 𝖤𝒫test[Xk|XS]subscript𝖤subscript𝒫testconditionalsubscript𝑋𝑘subscript𝑋𝑆\operatorname{\sf E}_{\mathcal{P}_{\text{test}}}[X_{k}|X_{S}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] by regressing 𝑿ktestsuperscriptsubscript𝑿𝑘test\bm{X}_{k}^{\text{test}}bold_italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT on 𝑿Stestsubscriptsuperscript𝑿test𝑆\bm{X}^{\text{test}}_{S}bold_italic_X start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT
     Using (8), compute 𝒀^k,Stestsuperscriptsubscriptbold-^𝒀𝑘𝑆test\bm{\hat{Y}}_{k,S}^{\text{test}}overbold_^ start_ARG bold_italic_Y end_ARG start_POSTSUBSCRIPT italic_k , italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT for the pair (k,S)𝑘𝑆(k,S)( italic_k , italic_S )
𝒀^test=1|𝒯inv|(k,S)𝒯inv𝒀^k,Stestsuperscriptbold-^𝒀test1subscript𝒯invsubscript𝑘𝑆subscript𝒯invsuperscriptsubscriptbold-^𝒀𝑘𝑆test\bm{\hat{Y}}^{\text{test}}=\frac{1}{|\mathcal{T}_{\text{inv}}|}\sum_{(k,S)\in% \mathcal{T}_{\text{inv}}}\bm{\hat{Y}}_{k,S}^{\text{test}}overbold_^ start_ARG bold_italic_Y end_ARG start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_T start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_k , italic_S ) ∈ caligraphic_T start_POSTSUBSCRIPT inv end_POSTSUBSCRIPT end_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_Y end_ARG start_POSTSUBSCRIPT italic_k , italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT

The bIMP method proposed gives freedom to the user to select the underlying models with which to estimate 𝖤𝒫e[Xk|XS]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] and 𝖤𝒫e[Xk|XS,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ]. In the case of 𝖤𝒫e[Xk|XS]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ], we have complete freedom to select whichever model suits the data, be it linear or nonlinear. For 𝖤𝒫e[Xk|XS,Y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ], we are restricted by the additive noise of (1). In addition, we have chosen to model Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using two sub-models, one for each value of Y𝑌Yitalic_Y as in (11). This, however, is not the only option and depends on the invariance test used. When estimating each model, ordinary least squares (OLS) could be used for linear models, and a generalized additive model (GAM) or Gaussian process regression could be used for nonlinear models. In practice, we found estimating each model using OLS to be the most efficient, as fitting two nonlinear models for all possible (k,S)𝑘𝑆(k,S)( italic_k , italic_S ) pairs can be computationally expensive.

Remark 2.

There are several challenges with this approach that we leave for future work. We observe that nonlinear implementations of the invariance test (Algorithm 1) may lead to erroneously accepted invariant pairs. In addition to this, the complexity of training a nonlinear model for all possible (k,S)𝑘𝑆(k,S)( italic_k , italic_S ) pairs can be high. Finally, the effects of model misspecification can be challenging to analyze.

VI Experiments

We provide one synthetic and two real datasets to test the effectiveness of bIMP and compare with the following two baselines: (1) a binary adaptation of Method II from [9] (ICP), and (2) logistic regression (LR). While we do not expect LR to perform well on unknown environments, it serves as a natural baseline. While ICP can handle the binary response setting via logistic regression, SR is specific to regression settings and thus not reported. In all experiments, we set α=0.1𝛼0.1\alpha=0.1italic_α = 0.1.

Refer to caption
Figure 2: Simulation accuracy over 1000100010001000 simulated datsets.

As there is some degree of freedom in selecting how the sub-models in bIMP are trained, we explore two variants of bIMP: bIMP (linear) and bIMP (GAM). For both variants, we follow the invariance test in Algorithm 1 and estimate g1subscript𝑔1g_{1}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using OLS. We estimate 𝖤𝒫e[Xk|XS]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] using OLS for bIMP (linear), and a GAM for bIMP (GAM).

Synthetic data. The simulated dataset is generated as follows. We generate data from three environments, e1,e2trainsuperscript𝑒1superscript𝑒2subscripttraine^{1},e^{2}\in\mathcal{E}_{\text{train}}italic_e start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, and etestsuperscript𝑒teste^{\text{test}}italic_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT. The number of predictors m𝑚mitalic_m is randomly selected from {3,,7}37\{3,\dots,7\}{ 3 , … , 7 }. For each i{2,,m}𝑖2𝑚i\in\{2,\dots,m\}italic_i ∈ { 2 , … , italic_m } and etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, Xie𝒩(μie,1)similar-tosuperscriptsubscript𝑋𝑖𝑒𝒩subscriptsuperscript𝜇𝑒𝑖1X_{i}^{e}\sim\mathcal{N}(\mu^{e}_{i},1)italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ), and μiesubscriptsuperscript𝜇𝑒𝑖\mu^{e}_{i}italic_μ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is randomly selected on the interval [2,0]20[-2,0][ - 2 , 0 ] for e=e1𝑒superscript𝑒1e=e^{1}italic_e = italic_e start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, [0,2]02[0,2][ 0 , 2 ] for e=e2𝑒superscript𝑒2e=e^{2}italic_e = italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and [0,3]03[0,3][ 0 , 3 ] for e=etest𝑒superscript𝑒teste=e^{\text{test}}italic_e = italic_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT. Then, where S1={2,,m}subscript𝑆12𝑚S_{1}=\{2,\dots,m\}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { 2 , … , italic_m }, Ye|XS1econditionalsuperscript𝑌𝑒superscriptsubscript𝑋subscript𝑆1𝑒Y^{e}|X_{S_{1}}^{e}italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT follows a logistic model such that 𝒫e(Y=1|XS1)=1/(1+eXS1βe)subscript𝒫𝑒𝑌conditional1subscript𝑋subscript𝑆111superscript𝑒subscript𝑋subscript𝑆1superscript𝛽𝑒\mathcal{P}_{e}(Y=1|X_{S_{1}})=1/(1+e^{-X_{S_{1}}\beta^{e}})caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_Y = 1 | italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = 1 / ( 1 + italic_e start_POSTSUPERSCRIPT - italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) for etrain𝑒subscripttraine\in\mathcal{E}_{\text{train}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT. For etestsuperscript𝑒teste^{\text{test}}italic_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, Ytest|XS1testconditionalsuperscript𝑌testsuperscriptsubscript𝑋subscript𝑆1testY^{\text{test}}|X_{S_{1}}^{\text{test}}italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT follows a probit model such that Ytest=1superscript𝑌test1Y^{\text{test}}=1italic_Y start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT = 1, if XS1testβtest+ϵ<0superscriptsubscript𝑋subscript𝑆1testsuperscript𝛽testitalic-ϵ0X_{S_{1}}^{\text{test}}\beta^{\text{test}}+\epsilon<0italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT + italic_ϵ < 0, where ϵ𝒩(0,1)similar-toitalic-ϵ𝒩01\epsilon\sim\mathcal{N}(0,1)italic_ϵ ∼ caligraphic_N ( 0 , 1 ). For all eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT, randomly select βesuperscript𝛽𝑒\beta^{e}italic_β start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT as βeUnif[0,1]similar-tosuperscript𝛽𝑒Unif01\beta^{e}\sim\mathrm{Unif}[0,1]italic_β start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∼ roman_Unif [ 0 , 1 ]. The coefficients are then scaled such that they sum to one. For all eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT, the variable X1esuperscriptsubscript𝑋1𝑒X_{1}^{e}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT is then simulated similarly to Xkesuperscriptsubscript𝑋𝑘𝑒X_{k}^{e}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT in (11). Specifically, g1(XS1e)=XS1eη1subscript𝑔1superscriptsubscript𝑋subscript𝑆1𝑒superscriptsubscript𝑋subscript𝑆1𝑒subscript𝜂1g_{1}(X_{S_{1}}^{e})=X_{S_{1}}^{e}\eta_{1}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) = italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and g0(XS1e)=XS1eη0subscript𝑔0superscriptsubscript𝑋subscript𝑆1𝑒superscriptsubscript𝑋subscript𝑆1𝑒subscript𝜂0g_{0}(X_{S_{1}}^{e})=X_{S_{1}}^{e}\eta_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) = italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The noise term associated with X1esuperscriptsubscript𝑋1𝑒X_{1}^{e}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT is a standard normal. The coefficients η1,iUnif[0,1]similar-tosubscript𝜂1𝑖Unif01\eta_{1,i}\sim\mathrm{Unif}[0,1]italic_η start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ∼ roman_Unif [ 0 , 1 ] and η0,iUnif[0,1]similar-tosubscript𝜂0𝑖Unif01\eta_{0,i}\sim\mathrm{Unif}[0,1]italic_η start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT ∼ roman_Unif [ 0 , 1 ] do not vary over the environment. The number of samples per environment is fixed to 1000100010001000.

Simulation results on both accuracy and mean squared error (MSE) indicate that bIMP can generalize to the test environment while LR and ICP are not (Fig 2). In addition, bIMP (linear) slightly outperforms bIMP (GAM). While we expect LR to behave poorly, ICP also performs poorly as all parents of Y𝑌Yitalic_Y are intervened in every simulation.

bIMP (linear) bIMP (GAM) LR
Environment Accuracy
born in US 85.085.085.085.0 84.984.984.984.9 78.278.278.278.2
overtime 68.468.468.468.4 59.159.159.159.1 77.077.077.077.0
caucasian 85.085.085.085.0 85.285.285.285.2 78.178.178.178.1
TABLE I: census: performance and training environments.

Two real-world data. We also include experiments on two real datasets: census [18] and mushroom [19]. The census dataset is data gathered from the 1994199419941994 US census and contains 14141414 societal and demographic variables such as age, education, marital status, and working class. The target variable used is whether or not an individual’s income exceeded 50505050k/yr. The data is first split into test and training data by whether or not a person graduated from a college. Thus, we train only on those who did not graduate college with the aim of extending our trained model to those who did. We further split the training data and run the methods on each set of training environments. The variables used to split the training data into environments are “was the person born in the US", “do they regularly work more than 40404040hr/week", and “does the person identify as Caucasian". The experiment shows that bIMP outperforms LR and ICP in all environments aside from the overtime environment (Table I). The ICP method returns no invariant predictors for any environment, thus no predictions can be made and no accuracy is reported; this is also the case for the mushroom data below.

bIMP (linear) bIMP (GAM) LR
Environment Accuracy
meadows 76.076.076.076.0 87.587.587.587.5 46.246.246.246.2
paths 88.188.188.188.1 90.990.990.990.9 11.811.811.811.8
TABLE II: mushroom: performance and training environments.

The mushroom dataset contains 16161616 features related to naturally growing mushrooms’ size, shape, and color and showcases how the proposed approach can handle discrete and categorical data. We aim to predict whether or not a mushroom is edible based on these factors. The environments on which we predict are the habitats in which the mushrooms grow. Specifically, we train on mushrooms that grow in grass or urban habitats and test on mushrooms that grow in meadows or paths. Results in Table II indicate that bIMP outperforms ICP and LR for both the linear and GAM variants, while the GAM variant performed the best.

VII Acknowledgements

We thank the anonymous reviewers for their helpful comments that improved the quality of this work.

Proof of Proposition 1.

First, we show that (8) holds for any eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT. Without loss of generality, let Xiesubscriptsuperscript𝑋𝑒𝑖X^{e}_{i}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be continuous for all i{1,,m}𝑖1𝑚i\in\{1,\dots,m\}italic_i ∈ { 1 , … , italic_m }. The pdf of Xke|XSeconditionalsuperscriptsubscript𝑋𝑘𝑒superscriptsubscript𝑋𝑆𝑒X_{k}^{e}|X_{S}^{e}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT for any eobs𝑒subscriptobse\in\mathcal{E}_{\text{obs}}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT is

f𝑓\displaystyle fitalic_f (xk|x)Xke|XSe{}_{X_{k}^{e}|X_{S}^{e}}(x_{k}|x)start_FLOATSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_FLOATSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x )
=fXke|XSe,Ye(xk|x,1)pYe|XSe(1|x)absentsubscript𝑓conditionalsuperscriptsubscript𝑋𝑘𝑒superscriptsubscript𝑋𝑆𝑒superscript𝑌𝑒conditionalsubscript𝑥𝑘𝑥1subscript𝑝conditionalsuperscript𝑌𝑒superscriptsubscript𝑋𝑆𝑒conditional1𝑥\displaystyle=f_{X_{k}^{e}|X_{S}^{e},Y^{e}}(x_{k}|x,1)\cdot p_{Y^{e}|X_{S}^{e}% }(1|x)= italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x , 1 ) ⋅ italic_p start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( 1 | italic_x )
+fXke|XSe,Ye(xk|x,0)pYe|XSe(0|x)subscript𝑓conditionalsuperscriptsubscript𝑋𝑘𝑒superscriptsubscript𝑋𝑆𝑒superscript𝑌𝑒conditionalsubscript𝑥𝑘𝑥0subscript𝑝conditionalsuperscript𝑌𝑒superscriptsubscript𝑋𝑆𝑒conditional0𝑥\displaystyle\qquad+f_{X_{k}^{e}|X_{S}^{e},Y^{e}}(x_{k}|x,0)\cdot p_{Y^{e}|X_{% S}^{e}}(0|x)+ italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x , 0 ) ⋅ italic_p start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( 0 | italic_x )
=fXke|XSe,Ye(xk|x,1)pYe|XSe(1|x)absentsubscript𝑓conditionalsuperscriptsubscript𝑋𝑘𝑒superscriptsubscript𝑋𝑆𝑒superscript𝑌𝑒conditionalsubscript𝑥𝑘𝑥1subscript𝑝conditionalsuperscript𝑌𝑒superscriptsubscript𝑋𝑆𝑒conditional1𝑥\displaystyle=f_{X_{k}^{e}|X_{S}^{e},Y^{e}}(x_{k}|x,1)\cdot p_{Y^{e}|X_{S}^{e}% }(1|x)= italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x , 1 ) ⋅ italic_p start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( 1 | italic_x )
+fXke|XSe,Ye(xk|x,0)[1pYe|XSe(1|x)]subscript𝑓conditionalsuperscriptsubscript𝑋𝑘𝑒superscriptsubscript𝑋𝑆𝑒superscript𝑌𝑒conditionalsubscript𝑥𝑘𝑥0delimited-[]1subscript𝑝conditionalsuperscript𝑌𝑒superscriptsubscript𝑋𝑆𝑒conditional1𝑥\displaystyle\qquad+f_{X_{k}^{e}|X_{S}^{e},Y^{e}}(x_{k}|x,0)\cdot\left[1-p_{Y^% {e}|X_{S}^{e}}(1|x)\right]+ italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x , 0 ) ⋅ [ 1 - italic_p start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( 1 | italic_x ) ]
=pYe|XSe(1|x)[fXke|XSe,Ye(xk|x,1)fXke|XSe,Ye(xk|x,0)]absentsubscript𝑝conditionalsuperscript𝑌𝑒superscriptsubscript𝑋𝑆𝑒conditional1𝑥delimited-[]subscript𝑓conditionalsuperscriptsubscript𝑋𝑘𝑒superscriptsubscript𝑋𝑆𝑒superscript𝑌𝑒conditionalsubscript𝑥𝑘𝑥1subscript𝑓conditionalsuperscriptsubscript𝑋𝑘𝑒superscriptsubscript𝑋𝑆𝑒superscript𝑌𝑒conditionalsubscript𝑥𝑘𝑥0\displaystyle=p_{Y^{e}|X_{S}^{e}}(1|x)\left[f_{X_{k}^{e}|X_{S}^{e},Y^{e}}(x_{k% }|x,1)-f_{X_{k}^{e}|X_{S}^{e},Y^{e}}(x_{k}|x,0)\right]= italic_p start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( 1 | italic_x ) [ italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x , 1 ) - italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x , 0 ) ]
+fXke|XSe,Ye(xk|x,0).subscript𝑓conditionalsuperscriptsubscript𝑋𝑘𝑒superscriptsubscript𝑋𝑆𝑒superscript𝑌𝑒conditionalsubscript𝑥𝑘𝑥0\displaystyle\qquad+f_{X_{k}^{e}|X_{S}^{e},Y^{e}}(x_{k}|x,0).+ italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x , 0 ) . (12)

Then using (12), we can write 𝖤𝒫e[Xk|XS=x]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑥\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=x]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_x ] as

xkfXke|XSe(xk|x)𝑑xksuperscriptsubscriptsubscript𝑥𝑘subscript𝑓conditionalsuperscriptsubscript𝑋𝑘𝑒superscriptsubscript𝑋𝑆𝑒conditionalsubscript𝑥𝑘𝑥differential-dsubscript𝑥𝑘\displaystyle\int_{-\infty}^{\infty}x_{k}\cdot f_{X_{k}^{e}|X_{S}^{e}}(x_{k}|x% )\,dx_{k}∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x ) italic_d italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
=𝖤𝒫e[Y|XS=x]𝖤𝒫e[Xk|XS=x,Y=1]absentsubscript𝖤subscript𝒫𝑒conditional𝑌subscript𝑋𝑆𝑥subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑥𝑌1\displaystyle=\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{S}=x]\cdot% \operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=x,Y=1]= sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_x ] ⋅ sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_x , italic_Y = 1 ]
𝖤𝒫e[Y|XS=x]𝖤𝒫e[Xk|XS=x,Y=0]subscript𝖤subscript𝒫𝑒conditional𝑌subscript𝑋𝑆𝑥subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑥𝑌0\displaystyle\qquad-\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{S}=x]\cdot% \operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=x,Y=0]- sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_x ] ⋅ sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_x , italic_Y = 0 ]
+𝖤𝒫e[Xk|XS=x,Y=0].subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑥𝑌0\displaystyle\qquad+\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=x,Y=0].+ sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_x , italic_Y = 0 ] . (13)

Thus, 𝖤𝒫e[Y|XS]subscript𝖤subscript𝒫𝑒conditional𝑌subscript𝑋𝑆\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{S}]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Y | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] can be written as

𝖤𝒫e[Xk|XS]𝖤𝒫e[Xk|XS,Y=0]𝖤𝒫e[Xk|XS,Y=1]𝖤𝒫e[Xk|XS,Y=0].subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌0subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌1subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑌0\frac{\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]-\operatorname{\sf E}% _{\mathcal{P}_{e}}[X_{k}|X_{S},Y=0]}{\operatorname{\sf E}_{\mathcal{P}_{e}}[X_% {k}|X_{S},Y=1]-\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y=0]}.divide start_ARG sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] - sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y = 0 ] end_ARG start_ARG sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y = 1 ] - sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y = 0 ] end_ARG . (14)

We now show (I) 𝖤𝒫e[Xk|XS=x,Y=y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆𝑥𝑌𝑦\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=x,Y=y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_x , italic_Y = italic_y ] does not depend on e𝑒eitalic_e and (II) the denominator of (14) is non-zero. Since XSe=(XRe,XQe)superscriptsubscript𝑋𝑆𝑒superscriptsubscript𝑋𝑅𝑒superscriptsubscript𝑋𝑄𝑒X_{S}^{e}=(X_{R}^{e},X_{Q}^{e})italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ),

𝖤𝒫e[Xk\displaystyle\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT |XS,Y]=𝖤𝒫e[Xk|XR,XQ,Y]=(a)𝖤𝒫e[Xk|XR,Y]\displaystyle|X_{S},Y]=\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{R},X_{Q% },Y]\overset{(a)}{=}\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{R},Y]| italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_Y ] = sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_Y ] start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_Y ]
=(b)𝖤𝒫e[g(XR,Y)+ϵ|XR,Y]=g(XRe,Ye),𝑏subscript𝖤subscript𝒫𝑒𝑔subscript𝑋𝑅𝑌conditionalitalic-ϵsubscript𝑋𝑅𝑌𝑔superscriptsubscript𝑋𝑅𝑒superscript𝑌𝑒\displaystyle\overset{(b)}{=}\operatorname{\sf E}_{\mathcal{P}_{e}}[g(X_{R},Y)% +\epsilon|X_{R},Y]=g(X_{R}^{e},Y^{e}),start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG = end_ARG sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_g ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_Y ) + italic_ϵ | italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_Y ] = italic_g ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) , (15)

where (a)𝑎(a)( italic_a ) follows since XQeXke|XRe,Yeperpendicular-toabsentperpendicular-tosuperscriptsubscript𝑋𝑄𝑒conditionalsuperscriptsubscript𝑋𝑘𝑒superscriptsubscript𝑋𝑅𝑒superscript𝑌𝑒X_{Q}^{e}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.% 0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss% }\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{k}^{e}|X_{R}^{e},Y^{e}italic_X start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, (b)𝑏(b)( italic_b ) follows from the assumption Xke=g(XRe,Ye)+ϵesubscriptsuperscript𝑋𝑒𝑘𝑔superscriptsubscript𝑋𝑅𝑒superscript𝑌𝑒superscriptitalic-ϵ𝑒X^{e}_{k}=g(X_{R}^{e},Y^{e})+\epsilon^{e}italic_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_g ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, and (c)𝑐(c)( italic_c ) follows since ϵitalic-ϵ\epsilonitalic_ϵ has zero mean. Thus, the 𝖤𝒫e[Xk|XS=(xQ,xR),Y=y]subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆subscript𝑥𝑄subscript𝑥𝑅𝑌𝑦\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=(x_{Q},x_{R}),Y=y]sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) , italic_Y = italic_y ] does not depend on e𝑒eitalic_e as 𝖤𝒫e[Xk|XS=(xQ,xR),Y=y]=g(xR,y)subscript𝖤subscript𝒫𝑒conditionalsubscript𝑋𝑘subscript𝑋𝑆subscript𝑥𝑄subscript𝑥𝑅𝑌𝑦𝑔subscript𝑥𝑅𝑦\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=(x_{Q},x_{R}),Y=y]=g(x_{R},y)sansserif_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) , italic_Y = italic_y ] = italic_g ( italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_y ). As the output of the function g𝑔gitalic_g is not constant with regards to any of its inputs as in (1), the denominator of (14) is non-zero. ∎

References

  • [1] N. Meinshausen, A. Hauser, J. M. Mooij, J. Peters, P. Versteeg, and P. Bühlmann, “Methods for causal inference from gene perturbation experiments and validation,” Proceedings of the National Academy of Sciences, vol. 113, no. 27, pp. 7361–7368, 2016.
  • [2] A. V. Goddard, Y. Xiang, and C. J. Bryan, “Invariance-based causal prediction to identify the direct causes of suicidal behavior,” Frontiers in psychiatry, p. 2598, 2022.
  • [3] T. Haavelmo, “The probability approach in econometrics,” Econometrica: Journal of the Econometric Society, vol. 12, pp. 1–115, 1944.
  • [4] J. Aldrich, “Autonomy,” Oxford Economic Papers, vol. 41, no. 1, pp. 15–34, 1989.
  • [5] K. D. Hoover, “The logic of causal inference: Econometrics and the conditional analysis of causation,” Economics & Philosophy, vol. 6, no. 2, pp. 207–234, 1990.
  • [6] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij, “On causal and anticausal learning,” arXiv preprint arXiv:1206.6471, 2012.
  • [7] A. P. Dawid and V. Didelez, “Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview,” Statistics Surveys, vol. 4, pp. 184–231, 2010.
  • [8] J. Pearl, Causality.   Cambridge university press, 2009.
  • [9] J. Peters, P. Bühlmann, and N. Meinshausen, “Causal inference by using invariant prediction: identification and confidence intervals,” Journal of the Royal Statistical Society. Series B (Statistical Methodology), pp. 947–1012, 2016.
  • [10] C. Heinze-Deml, J. Peters, and N. Meinshausen, “Invariant causal prediction for nonlinear models,” Journal of Causal Inference, vol. 6, no. 2, 2018.
  • [11] N. Pfister, P. Bühlmann, and J. Peters, “Invariant causal prediction for sequential data,” Journal of the American Statistical Association, vol. 114, no. 527, pp. 1264–1276, 2019.
  • [12] M. Rojas-Carulla, B. Schölkopf, R. Turner, and J. Peters, “Invariant models for causal transfer learning,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 1309–1342, 2018.
  • [13] D. Rothenhäusler, N. Meinshausen, P. Bühlmann, and J. Peters, “Anchor regression: Heterogeneous data meet causality,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 83, no. 2, pp. 215–246, 2021.
  • [14] N. Pfister, E. G. Williams, J. Peters, R. Aebersold, and P. Bühlmann, “Stabilizing variable selection and regression,” The Annals of Applied Statistics, vol. 15, no. 3, pp. 1220–1246, 2021.
  • [15] K. Du and Y. Xiang, “Learning invariant representations under general interventions on the response,” IEEE Journal on Selected Areas in Information Theory, 2023.
  • [16] ——, “Generalized invariant matching property via lasso,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023.
  • [17] A. Goddard, Y. Xiang, and I. Soloveychik, “Error probability bounds for invariant causal prediction via multiple access channels,” Asilomar Conference on Signals, Systems, and Computers, 2023.
  • [18] B. Becker and R. Kohavi, “Adult,” UCI Machine Learning Repository, 1996, DOI: https://doi.org/10.24432/C5XW20.
  • [19] “Mushroom,” UCI Machine Learning Repository, 1987, DOI: https://doi.org/10.24432/C5959T.